# Code Review Report **Batch**: 84 — AZ-423 + AZ-427 (FT-P-19 sat-relocalization scale + FT-N-05 stale-tile rejection) **Date**: 2026-05-17 **Verdict**: PASS_WITH_WARNINGS ## Files Reviewed **Created**: - `e2e/runner/helpers/retrieval_evaluator.py` - `e2e/runner/helpers/aged_tile_rejection_evaluator.py` - `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py` - `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py` - `e2e/_unit_tests/helpers/test_retrieval_evaluator.py` - `e2e/_unit_tests/helpers/test_aged_tile_rejection_evaluator.py` **Modified**: - `e2e/_unit_tests/test_directory_layout.py` (registered 4 new paths) ## Findings | # | Severity | Category | File:Line | Title | |---|----------|----------|-----------|-------| | 1 | Low | Spec-Gap | `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py`, `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py` | Tests depend on upstream production + fixture-builder features | | 2 | Low | Maintainability | `e2e/runner/helpers/aged_tile_rejection_evaluator.py`, `e2e/runner/helpers/mid_flight_tile_evaluator.py` | `TILE_LOAD_REJECTED_FDR_KIND` and stale-reason constant now duplicated across two helpers | ### Finding Details **F1: Tests depend on upstream production + fixture-builder features that don't exist yet** (Low / Spec-Gap) - Location: `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py`, `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py` - Description: Both scenarios require: * **AZ-423**: SUT emitting FDR `retrieval-topk` records (one per pushed image, carrying 10 candidate tile centres) and `scene-change-match` records for the 2 paired `_gmaps.png` images. Production code must expose the C2 top-K=10 retrieval result on the FDR boundary. * **AZ-427**: SUT either (a) emitting FDR `tile-load-rejected: stale` events at startup OR (b) downgrading every emission's `source_label` to `{visual_propagated, dead_reckoned}` for the aged-tile sub-cases. Production code must wire the C6 freshness gate per-sector (active-conflict vs rear). * **Fixture-builder**: snapshot/mount of `synth-age-7mo` / `synth-age-13mo` tile sets + `E2E_FT_N_05_FIXTURE` env var declaring the active sub-case fixture per pytest invocation. - Tests **skip cleanly** when `sitl_replay_ready` / `E2E_FT_N_05_FIXTURE` are absent and **fail loudly** when the fixture exists but the required records are missing — adhering to the "tests as gates" principle. - Suggestion: Surface as a single line in the AZ-423 + AZ-427 batch report under Production Dependencies. No code change in this batch. - Tasks: AZ-423, AZ-427. **F2: `TILE_LOAD_REJECTED_FDR_KIND` and stale-reason constant duplicated** (Low / Maintainability) - Location: `e2e/runner/helpers/aged_tile_rejection_evaluator.py:33-34`, `e2e/runner/helpers/mid_flight_tile_evaluator.py:43-44` - Description: Both helpers redefine `TILE_LOAD_REJECTED_FDR_KIND = "tile-load-rejected"` and the stale-reason constant. This is intentional in spirit (each helper exposes the FDR vocabulary it operates on) but creates two copies of the same FDR-contract string. If the SUT renames the record type later, both files must change in lockstep. - Suggestion: Consolidate in a dedicated `runner/helpers/fdr_record_kinds.py` module in a future hygiene PBI; not in scope for AZ-423/AZ-427. - Tasks: hygiene, future batch. ## Phase 1: Context Loading Read AZ-423 + AZ-427 task specs. Both have 3 ACs; both are blackbox-test deliverables under E-BBT (AZ-262); both depend only on AZ-406 + AZ-407 (test infra + static fixtures); SUT boundary statements verified. ## Phase 2: Spec Compliance ### AZ-423 (FT-P-19) | AC | Helper | Scenario assertion | Unit-test coverage | |----|--------|--------------------|--------------------| | AC-1 (top-K=10 includes tile within 100 m of true centre, 60 stills) | `evaluate_top_k_within_distance` | `assert topk_report.passes` | 9 cases (single close, all far, mixed top-K, empty, count mismatch, invalid tolerance, custom tolerance, 60 all-pass, 60 one-fail) | | AC-2 (scene-change PARTIAL on 2 `_gmaps.png` pairs) | `evaluate_scene_change_subset` | `assert coverage_complete AND overall_label == "PARTIAL"` | 5 cases (both matched still PARTIAL, zero matched still PARTIAL, one image only, empty, wrong image ids) | | AC-3 (parameterised) | conftest fixtures | 6 collected variants | — | Plus 2 CSV-writer round-trip tests, 5 `project_topk_record_to_query` cases (happy / malformed / empty / non-dict / type errors), 5 `project_scene_change_record` cases, 2 `iter_*_payloads` filter cases — 29 unit tests total for AZ-423. ### AZ-427 (FT-N-05) | AC | Helper | Scenario assertion | Unit-test coverage | |----|--------|--------------------|--------------------| | AC-1 (7mo aged tiles in active-conflict sector → 0 anchored emissions) | `evaluate_aged_tile_rejection(fixture="synth-age-7mo", sector="active_conflict", …)` | parametrised sub-case | 3 Signal-A cases + 3 Signal-B cases + 5 degenerate-input cases | | AC-2 (13mo aged tiles in rear sector → 0 anchored emissions) | same helper with rear fixture binding | parametrised sub-case | same as AC-1 (sub-case symmetric) | | AC-3 (parameterised) | conftest fixtures + 2 sub-case parametrisation | 12 collected variants (6 × 2) | — | Plus 6 `project_stale_rejection_payload` cases (happy / id-alias / wrong-reason / non-dict / missing-id / wrong-type), 1 `iter_stale_rejection_payloads` filter case, 1 `AGED_FIXTURE_SECTOR_BINDINGS` contract assertion — 19 unit tests total for AZ-427. All ACs satisfied. `@pytest.mark.traces_to` markers wire scenarios to AC IDs. ## Phase 3: Code Quality - **SOLID**: each evaluator is a pure function over typed input dataclasses returning a frozen-dataclass report. `retrieval_evaluator` and `aged_tile_rejection_evaluator` are siblings of the existing `mid_flight_tile_evaluator` / `gcs_telemetry_evaluator` / `tile_cache_inspector` pattern. - **Error handling**: `evaluate_top_k_within_distance` raises `ValueError` on invalid tolerance. `evaluate_aged_tile_rejection` has no failure modes (always returns a report; the report's `.passes` carries the verdict). No bare `except`. - **Naming**: `evaluate_*` matches sibling helpers. `_normalise_image_id` in scenario test mirrors the convention from `accuracy_evaluator` GT keying. `Signal A` / `Signal B` properties make the decision matrix self-documenting. - **Complexity**: longest function `evaluate_top_k_within_distance` at ~25 lines; `evaluate_aged_tile_rejection` at ~20 lines. All under coderule's 50-line threshold. - **DRY**: F2 captures one intra-batch duplication. Within each helper, no duplicated logic. - **Test quality**: every unit test uses Arrange/Act/Assert comments per coderule. Tests assert specific outcomes (matched_count, anchored_frame_ids, signal flags, CSV content) rather than "no error thrown". - **Dead code**: none. `iter_topk_payloads` / `iter_scene_change_payloads` / `iter_stale_rejection_payloads` are scenario-test helpers used in the corresponding `tests/` files. ## Phase 4: Security Quick-Scan - No SQL, no `subprocess`, no `exec`/`eval`. - No hardcoded secrets. - Input validation: all `project_*` functions guard against malformed dict payloads; `evaluate_top_k_within_distance` validates tolerance. - No sensitive data in logs or error messages. - `evaluate_aged_tile_rejection` surfaces `illegal_labels` separately from passes/fails so a contract-violation defect can't be misread as a freshness-gate defect. ## Phase 5: Performance Scan - `evaluate_top_k_within_distance` is O(N·K) where N=60 images, K=10 candidates → 600 distance computations per scenario. Vincenty via `pyproj` handles this in well under 100 ms. - `evaluate_aged_tile_rejection` is O(M) where M = emissions count (≤ 60) + rejections count (typically 0–2). Trivial. - FDR iteration is generator-based via `fdr_reader.iter_records`. - No N+1 patterns. No unbounded fetching. ## Phase 6: Cross-Task Consistency - Both helpers reuse `runner.helpers.geo.distance_m` (Vincenty WGS84) — consistent with `mid_flight_tile_evaluator` from batch 83. - `aged_tile_rejection_evaluator` reuses `ALLOWED_SOURCE_LABELS` from `estimate_schema` — single source of truth for the FT-P-03 / AC-1.4 source-label contract. - Scenario tests follow the same skip pattern as FT-P-12/13/15/16/17/18 (sitl_replay_ready gate + fail-loud on missing records). - The frozen-dataclass + `@property passes` convention matches every prior helper (batches 79-83). - `traces_to` markers + CSV evidence emission to `ft-p-19-*.csv` / `ft-n-05-*.csv` follow the FT-P-01 / FT-P-05 / FT-P-17 cadence. ## Phase 7: Architecture Compliance - All new files under `e2e/` — owned exclusively by the Blackbox Tests component per `_docs/02_document/module-layout.md`. - **No imports from `src/gps_denied_onboard`** — verified by reading every import in the new files; both helpers carry an explicit "public-boundary discipline" docstring note. - No layering violations. - No new cyclic module dependencies. Both helpers import from `.geo` / `.estimate_schema` only; tests import from `runner.helpers.*`. - No duplicate symbols across components. The intra-component duplicate FDR-kind constant (F2) is intra-helper-set, not cross-component. - No cross-cutting concerns re-implemented locally; geo math, source-label set, and CSV emission all delegate to project-wide utilities. ## Verdict - 0 Critical, 0 High → no FAIL trigger. - 2 Low → **PASS_WITH_WARNINGS**. Batch 84 is ready to commit. The two Low findings are surfaced for batch report and feed forward into the cumulative review for batches 82–84.