Implement the AC-8.6 (top-K=10 retrieval scale-ratio + scene-change
PARTIAL) and AC-8.2 / AC-NEW-6 (stale aged-tile rejection) blackbox
scenarios.
AZ-423 (FT-P-19, 3pt) helpers + scenario:
- retrieval_evaluator.py — top-K within-distance evaluator (60 stills
vs 100 m budget), scene-change PARTIAL recorder (always emits
PARTIAL on the 2 _gmaps.png pairs), FDR record projectors, CSV
writers.
- tests/positive/test_ft_p_19_sat_reloc_scale.py (6 parametrised
variants).
AZ-427 (FT-N-05, 2pt) helpers + scenario:
- aged_tile_rejection_evaluator.py — Signal A (stale rejection at
load) + Signal B (per-frame downgrade) decision matrix, reuses
ALLOWED_SOURCE_LABELS from estimate_schema.
- tests/negative/test_ft_n_05_stale_tile_rejection.py (12 parametrised
variants: FC × VIO × {7mo/active-conflict, 13mo/rear}).
48 new unit tests cover every helper branch. Both scenarios skip
when sitl_replay_ready is false and fail loudly when fixture records
are missing.
Per-batch review: PASS_WITH_WARNINGS (2 Low — production-dependency
surface, FDR-kind constant duplication).
Cumulative review 82-84: PASS (2 Low carry-over / hygiene candidate).
Co-authored-by: Cursor <cursoragent@cursor.com>
9.4 KiB
Code Review Report
Batch: 84 — AZ-423 + AZ-427 (FT-P-19 sat-relocalization scale + FT-N-05 stale-tile rejection) Date: 2026-05-17 Verdict: PASS_WITH_WARNINGS
Files Reviewed
Created:
e2e/runner/helpers/retrieval_evaluator.pye2e/runner/helpers/aged_tile_rejection_evaluator.pye2e/tests/positive/test_ft_p_19_sat_reloc_scale.pye2e/tests/negative/test_ft_n_05_stale_tile_rejection.pye2e/_unit_tests/helpers/test_retrieval_evaluator.pye2e/_unit_tests/helpers/test_aged_tile_rejection_evaluator.py
Modified:
e2e/_unit_tests/test_directory_layout.py(registered 4 new paths)
Findings
| # | Severity | Category | File:Line | Title |
|---|---|---|---|---|
| 1 | Low | Spec-Gap | e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py, e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py |
Tests depend on upstream production + fixture-builder features |
| 2 | Low | Maintainability | e2e/runner/helpers/aged_tile_rejection_evaluator.py, e2e/runner/helpers/mid_flight_tile_evaluator.py |
TILE_LOAD_REJECTED_FDR_KIND and stale-reason constant now duplicated across two helpers |
Finding Details
F1: Tests depend on upstream production + fixture-builder features that don't exist yet (Low / Spec-Gap)
- Location:
e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py,e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py - Description: Both scenarios require:
- AZ-423: SUT emitting FDR
retrieval-topkrecords (one per pushed image, carrying 10 candidate tile centres) andscene-change-matchrecords for the 2 paired_gmaps.pngimages. Production code must expose the C2 top-K=10 retrieval result on the FDR boundary. - AZ-427: SUT either (a) emitting FDR
tile-load-rejected: staleevents at startup OR (b) downgrading every emission'ssource_labelto{visual_propagated, dead_reckoned}for the aged-tile sub-cases. Production code must wire the C6 freshness gate per-sector (active-conflict vs rear). - Fixture-builder: snapshot/mount of
synth-age-7mo/synth-age-13motile sets +E2E_FT_N_05_FIXTUREenv var declaring the active sub-case fixture per pytest invocation.
- AZ-423: SUT emitting FDR
- Tests skip cleanly when
sitl_replay_ready/E2E_FT_N_05_FIXTUREare absent and fail loudly when the fixture exists but the required records are missing — adhering to the "tests as gates" principle. - Suggestion: Surface as a single line in the AZ-423 + AZ-427 batch report under Production Dependencies. No code change in this batch.
- Tasks: AZ-423, AZ-427.
F2: TILE_LOAD_REJECTED_FDR_KIND and stale-reason constant duplicated (Low / Maintainability)
- Location:
e2e/runner/helpers/aged_tile_rejection_evaluator.py:33-34,e2e/runner/helpers/mid_flight_tile_evaluator.py:43-44 - Description: Both helpers redefine
TILE_LOAD_REJECTED_FDR_KIND = "tile-load-rejected"and the stale-reason constant. This is intentional in spirit (each helper exposes the FDR vocabulary it operates on) but creates two copies of the same FDR-contract string. If the SUT renames the record type later, both files must change in lockstep. - Suggestion: Consolidate in a dedicated
runner/helpers/fdr_record_kinds.pymodule in a future hygiene PBI; not in scope for AZ-423/AZ-427. - Tasks: hygiene, future batch.
Phase 1: Context Loading
Read AZ-423 + AZ-427 task specs. Both have 3 ACs; both are blackbox-test deliverables under E-BBT (AZ-262); both depend only on AZ-406 + AZ-407 (test infra + static fixtures); SUT boundary statements verified.
Phase 2: Spec Compliance
AZ-423 (FT-P-19)
| AC | Helper | Scenario assertion | Unit-test coverage |
|---|---|---|---|
| AC-1 (top-K=10 includes tile within 100 m of true centre, 60 stills) | evaluate_top_k_within_distance |
assert topk_report.passes |
9 cases (single close, all far, mixed top-K, empty, count mismatch, invalid tolerance, custom tolerance, 60 all-pass, 60 one-fail) |
AC-2 (scene-change PARTIAL on 2 _gmaps.png pairs) |
evaluate_scene_change_subset |
assert coverage_complete AND overall_label == "PARTIAL" |
5 cases (both matched still PARTIAL, zero matched still PARTIAL, one image only, empty, wrong image ids) |
| AC-3 (parameterised) | conftest fixtures | 6 collected variants | — |
Plus 2 CSV-writer round-trip tests, 5 project_topk_record_to_query cases (happy / malformed / empty / non-dict / type errors), 5 project_scene_change_record cases, 2 iter_*_payloads filter cases — 29 unit tests total for AZ-423.
AZ-427 (FT-N-05)
| AC | Helper | Scenario assertion | Unit-test coverage |
|---|---|---|---|
| AC-1 (7mo aged tiles in active-conflict sector → 0 anchored emissions) | evaluate_aged_tile_rejection(fixture="synth-age-7mo", sector="active_conflict", …) |
parametrised sub-case | 3 Signal-A cases + 3 Signal-B cases + 5 degenerate-input cases |
| AC-2 (13mo aged tiles in rear sector → 0 anchored emissions) | same helper with rear fixture binding | parametrised sub-case | same as AC-1 (sub-case symmetric) |
| AC-3 (parameterised) | conftest fixtures + 2 sub-case parametrisation | 12 collected variants (6 × 2) | — |
Plus 6 project_stale_rejection_payload cases (happy / id-alias / wrong-reason / non-dict / missing-id / wrong-type), 1 iter_stale_rejection_payloads filter case, 1 AGED_FIXTURE_SECTOR_BINDINGS contract assertion — 19 unit tests total for AZ-427.
All ACs satisfied. @pytest.mark.traces_to markers wire scenarios to AC IDs.
Phase 3: Code Quality
- SOLID: each evaluator is a pure function over typed input dataclasses returning a frozen-dataclass report.
retrieval_evaluatorandaged_tile_rejection_evaluatorare siblings of the existingmid_flight_tile_evaluator/gcs_telemetry_evaluator/tile_cache_inspectorpattern. - Error handling:
evaluate_top_k_within_distanceraisesValueErroron invalid tolerance.evaluate_aged_tile_rejectionhas no failure modes (always returns a report; the report's.passescarries the verdict). No bareexcept. - Naming:
evaluate_*matches sibling helpers._normalise_image_idin scenario test mirrors the convention fromaccuracy_evaluatorGT keying.Signal A/Signal Bproperties make the decision matrix self-documenting. - Complexity: longest function
evaluate_top_k_within_distanceat ~25 lines;evaluate_aged_tile_rejectionat ~20 lines. All under coderule's 50-line threshold. - DRY: F2 captures one intra-batch duplication. Within each helper, no duplicated logic.
- Test quality: every unit test uses Arrange/Act/Assert comments per coderule. Tests assert specific outcomes (matched_count, anchored_frame_ids, signal flags, CSV content) rather than "no error thrown".
- Dead code: none.
iter_topk_payloads/iter_scene_change_payloads/iter_stale_rejection_payloadsare scenario-test helpers used in the correspondingtests/files.
Phase 4: Security Quick-Scan
- No SQL, no
subprocess, noexec/eval. - No hardcoded secrets.
- Input validation: all
project_*functions guard against malformed dict payloads;evaluate_top_k_within_distancevalidates tolerance. - No sensitive data in logs or error messages.
evaluate_aged_tile_rejectionsurfacesillegal_labelsseparately from passes/fails so a contract-violation defect can't be misread as a freshness-gate defect.
Phase 5: Performance Scan
evaluate_top_k_within_distanceis O(N·K) where N=60 images, K=10 candidates → 600 distance computations per scenario. Vincenty viapyprojhandles this in well under 100 ms.evaluate_aged_tile_rejectionis O(M) where M = emissions count (≤ 60) + rejections count (typically 0–2). Trivial.- FDR iteration is generator-based via
fdr_reader.iter_records. - No N+1 patterns. No unbounded fetching.
Phase 6: Cross-Task Consistency
- Both helpers reuse
runner.helpers.geo.distance_m(Vincenty WGS84) — consistent withmid_flight_tile_evaluatorfrom batch 83. aged_tile_rejection_evaluatorreusesALLOWED_SOURCE_LABELSfromestimate_schema— single source of truth for the FT-P-03 / AC-1.4 source-label contract.- Scenario tests follow the same skip pattern as FT-P-12/13/15/16/17/18 (sitl_replay_ready gate + fail-loud on missing records).
- The frozen-dataclass +
@property passesconvention matches every prior helper (batches 79-83). traces_tomarkers + CSV evidence emission toft-p-19-*.csv/ft-n-05-*.csvfollow the FT-P-01 / FT-P-05 / FT-P-17 cadence.
Phase 7: Architecture Compliance
- All new files under
e2e/— owned exclusively by the Blackbox Tests component per_docs/02_document/module-layout.md. - No imports from
src/gps_denied_onboard— verified by reading every import in the new files; both helpers carry an explicit "public-boundary discipline" docstring note. - No layering violations.
- No new cyclic module dependencies. Both helpers import from
.geo/.estimate_schemaonly; tests import fromrunner.helpers.*. - No duplicate symbols across components. The intra-component duplicate FDR-kind constant (F2) is intra-helper-set, not cross-component.
- No cross-cutting concerns re-implemented locally; geo math, source-label set, and CSV emission all delegate to project-wide utilities.
Verdict
- 0 Critical, 0 High → no FAIL trigger.
- 2 Low → PASS_WITH_WARNINGS.
Batch 84 is ready to commit. The two Low findings are surfaced for batch report and feed forward into the cumulative review for batches 82–84.