[AZ-423] [AZ-427] Add FT-P-19 + FT-N-05 blackbox tests

Implement the AC-8.6 (top-K=10 retrieval scale-ratio + scene-change PARTIAL) and AC-8.2 / AC-NEW-6 (stale aged-tile rejection) blackbox scenarios. AZ-423 (FT-P-19, 3pt) helpers + scenario: - retrieval_evaluator.py — top-K within-distance evaluator (60 stills vs 100 m budget), scene-change PARTIAL recorder (always emits PARTIAL on the 2 _gmaps.png pairs), FDR record projectors, CSV writers. - tests/positive/test_ft_p_19_sat_reloc_scale.py (6 parametrised variants). AZ-427 (FT-N-05, 2pt) helpers + scenario: - aged_tile_rejection_evaluator.py — Signal A (stale rejection at load) + Signal B (per-frame downgrade) decision matrix, reuses ALLOWED_SOURCE_LABELS from estimate_schema. - tests/negative/test_ft_n_05_stale_tile_rejection.py (12 parametrised variants: FC × VIO × {7mo/active-conflict, 13mo/rear}). 48 new unit tests cover every helper branch. Both scenarios skip when sitl_replay_ready is false and fail loudly when fixture records are missing. Per-batch review: PASS_WITH_WARNINGS (2 Low — production-dependency surface, FDR-kind constant duplication). Cumulative review 82-84: PASS (2 Low carry-over / hygiene candidate). Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 13:01:14 +00:00 · 2026-05-17 15:43:06 +03:00
parent a22028087f
commit f25cae4a82
13 changed files with 2005 additions and 3 deletions
@@ -0,0 +1,120 @@
+# Code Review Report
+
+**Batch**: 84 — AZ-423 + AZ-427 (FT-P-19 sat-relocalization scale + FT-N-05 stale-tile rejection)
+**Date**: 2026-05-17
+**Verdict**: PASS_WITH_WARNINGS
+
+## Files Reviewed
+
+**Created**:
+- `e2e/runner/helpers/retrieval_evaluator.py`
+- `e2e/runner/helpers/aged_tile_rejection_evaluator.py`
+- `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py`
+- `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py`
+- `e2e/_unit_tests/helpers/test_retrieval_evaluator.py`
+- `e2e/_unit_tests/helpers/test_aged_tile_rejection_evaluator.py`
+
+**Modified**:
+- `e2e/_unit_tests/test_directory_layout.py` (registered 4 new paths)
+
+## Findings
+
+| # | Severity | Category | File:Line | Title |
+|---|----------|----------|-----------|-------|
+| 1 | Low | Spec-Gap | `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py`, `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py` | Tests depend on upstream production + fixture-builder features |
+| 2 | Low | Maintainability | `e2e/runner/helpers/aged_tile_rejection_evaluator.py`, `e2e/runner/helpers/mid_flight_tile_evaluator.py` | `TILE_LOAD_REJECTED_FDR_KIND` and stale-reason constant now duplicated across two helpers |
+
+### Finding Details
+
+**F1: Tests depend on upstream production + fixture-builder features that don't exist yet** (Low / Spec-Gap)
+- Location: `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py`, `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py`
+- Description: Both scenarios require:
+  * **AZ-423**: SUT emitting FDR `retrieval-topk` records (one per pushed image, carrying 10 candidate tile centres) and `scene-change-match` records for the 2 paired `_gmaps.png` images. Production code must expose the C2 top-K=10 retrieval result on the FDR boundary.
+  * **AZ-427**: SUT either (a) emitting FDR `tile-load-rejected: stale` events at startup OR (b) downgrading every emission's `source_label` to `{visual_propagated, dead_reckoned}` for the aged-tile sub-cases. Production code must wire the C6 freshness gate per-sector (active-conflict vs rear).
+  * **Fixture-builder**: snapshot/mount of `synth-age-7mo` / `synth-age-13mo` tile sets + `E2E_FT_N_05_FIXTURE` env var declaring the active sub-case fixture per pytest invocation.
+- Tests **skip cleanly** when `sitl_replay_ready` / `E2E_FT_N_05_FIXTURE` are absent and **fail loudly** when the fixture exists but the required records are missing — adhering to the "tests as gates" principle.
+- Suggestion: Surface as a single line in the AZ-423 + AZ-427 batch report under Production Dependencies. No code change in this batch.
+- Tasks: AZ-423, AZ-427.
+
+**F2: `TILE_LOAD_REJECTED_FDR_KIND` and stale-reason constant duplicated** (Low / Maintainability)
+- Location: `e2e/runner/helpers/aged_tile_rejection_evaluator.py:33-34`, `e2e/runner/helpers/mid_flight_tile_evaluator.py:43-44`
+- Description: Both helpers redefine `TILE_LOAD_REJECTED_FDR_KIND = "tile-load-rejected"` and the stale-reason constant. This is intentional in spirit (each helper exposes the FDR vocabulary it operates on) but creates two copies of the same FDR-contract string. If the SUT renames the record type later, both files must change in lockstep.
+- Suggestion: Consolidate in a dedicated `runner/helpers/fdr_record_kinds.py` module in a future hygiene PBI; not in scope for AZ-423/AZ-427.
+- Tasks: hygiene, future batch.
+
+## Phase 1: Context Loading
+
+Read AZ-423 + AZ-427 task specs. Both have 3 ACs; both are blackbox-test deliverables under E-BBT (AZ-262); both depend only on AZ-406 + AZ-407 (test infra + static fixtures); SUT boundary statements verified.
+
+## Phase 2: Spec Compliance
+
+### AZ-423 (FT-P-19)
+
+| AC | Helper | Scenario assertion | Unit-test coverage |
+|----|--------|--------------------|--------------------|
+| AC-1 (top-K=10 includes tile within 100 m of true centre, 60 stills) | `evaluate_top_k_within_distance` | `assert topk_report.passes` | 9 cases (single close, all far, mixed top-K, empty, count mismatch, invalid tolerance, custom tolerance, 60 all-pass, 60 one-fail) |
+| AC-2 (scene-change PARTIAL on 2 `_gmaps.png` pairs) | `evaluate_scene_change_subset` | `assert coverage_complete AND overall_label == "PARTIAL"` | 5 cases (both matched still PARTIAL, zero matched still PARTIAL, one image only, empty, wrong image ids) |
+| AC-3 (parameterised) | conftest fixtures | 6 collected variants | — |
+
+Plus 2 CSV-writer round-trip tests, 5 `project_topk_record_to_query` cases (happy / malformed / empty / non-dict / type errors), 5 `project_scene_change_record` cases, 2 `iter_*_payloads` filter cases — 29 unit tests total for AZ-423.
+
+### AZ-427 (FT-N-05)
+
+| AC | Helper | Scenario assertion | Unit-test coverage |
+|----|--------|--------------------|--------------------|
+| AC-1 (7mo aged tiles in active-conflict sector → 0 anchored emissions) | `evaluate_aged_tile_rejection(fixture="synth-age-7mo", sector="active_conflict", …)` | parametrised sub-case | 3 Signal-A cases + 3 Signal-B cases + 5 degenerate-input cases |
+| AC-2 (13mo aged tiles in rear sector → 0 anchored emissions) | same helper with rear fixture binding | parametrised sub-case | same as AC-1 (sub-case symmetric) |
+| AC-3 (parameterised) | conftest fixtures + 2 sub-case parametrisation | 12 collected variants (6 × 2) | — |
+
+Plus 6 `project_stale_rejection_payload` cases (happy / id-alias / wrong-reason / non-dict / missing-id / wrong-type), 1 `iter_stale_rejection_payloads` filter case, 1 `AGED_FIXTURE_SECTOR_BINDINGS` contract assertion — 19 unit tests total for AZ-427.
+
+All ACs satisfied. `@pytest.mark.traces_to` markers wire scenarios to AC IDs.
+
+## Phase 3: Code Quality
+
+- **SOLID**: each evaluator is a pure function over typed input dataclasses returning a frozen-dataclass report. `retrieval_evaluator` and `aged_tile_rejection_evaluator` are siblings of the existing `mid_flight_tile_evaluator` / `gcs_telemetry_evaluator` / `tile_cache_inspector` pattern.
+- **Error handling**: `evaluate_top_k_within_distance` raises `ValueError` on invalid tolerance. `evaluate_aged_tile_rejection` has no failure modes (always returns a report; the report's `.passes` carries the verdict). No bare `except`.
+- **Naming**: `evaluate_*` matches sibling helpers. `_normalise_image_id` in scenario test mirrors the convention from `accuracy_evaluator` GT keying. `Signal A` / `Signal B` properties make the decision matrix self-documenting.
+- **Complexity**: longest function `evaluate_top_k_within_distance` at ~25 lines; `evaluate_aged_tile_rejection` at ~20 lines. All under coderule's 50-line threshold.
+- **DRY**: F2 captures one intra-batch duplication. Within each helper, no duplicated logic.
+- **Test quality**: every unit test uses Arrange/Act/Assert comments per coderule. Tests assert specific outcomes (matched_count, anchored_frame_ids, signal flags, CSV content) rather than "no error thrown".
+- **Dead code**: none. `iter_topk_payloads` / `iter_scene_change_payloads` / `iter_stale_rejection_payloads` are scenario-test helpers used in the corresponding `tests/` files.
+
+## Phase 4: Security Quick-Scan
+
+- No SQL, no `subprocess`, no `exec`/`eval`.
+- No hardcoded secrets.
+- Input validation: all `project_*` functions guard against malformed dict payloads; `evaluate_top_k_within_distance` validates tolerance.
+- No sensitive data in logs or error messages.
+- `evaluate_aged_tile_rejection` surfaces `illegal_labels` separately from passes/fails so a contract-violation defect can't be misread as a freshness-gate defect.
+
+## Phase 5: Performance Scan
+
+- `evaluate_top_k_within_distance` is O(N·K) where N=60 images, K=10 candidates → 600 distance computations per scenario. Vincenty via `pyproj` handles this in well under 100 ms.
+- `evaluate_aged_tile_rejection` is O(M) where M = emissions count (≤ 60) + rejections count (typically 0–2). Trivial.
+- FDR iteration is generator-based via `fdr_reader.iter_records`.
+- No N+1 patterns. No unbounded fetching.
+
+## Phase 6: Cross-Task Consistency
+
+- Both helpers reuse `runner.helpers.geo.distance_m` (Vincenty WGS84) — consistent with `mid_flight_tile_evaluator` from batch 83.
+- `aged_tile_rejection_evaluator` reuses `ALLOWED_SOURCE_LABELS` from `estimate_schema` — single source of truth for the FT-P-03 / AC-1.4 source-label contract.
+- Scenario tests follow the same skip pattern as FT-P-12/13/15/16/17/18 (sitl_replay_ready gate + fail-loud on missing records).
+- The frozen-dataclass + `@property passes` convention matches every prior helper (batches 79-83).
+- `traces_to` markers + CSV evidence emission to `ft-p-19-*.csv` / `ft-n-05-*.csv` follow the FT-P-01 / FT-P-05 / FT-P-17 cadence.
+
+## Phase 7: Architecture Compliance
+
+- All new files under `e2e/` — owned exclusively by the Blackbox Tests component per `_docs/02_document/module-layout.md`.
+- **No imports from `src/gps_denied_onboard`** — verified by reading every import in the new files; both helpers carry an explicit "public-boundary discipline" docstring note.
+- No layering violations.
+- No new cyclic module dependencies. Both helpers import from `.geo` / `.estimate_schema` only; tests import from `runner.helpers.*`.
+- No duplicate symbols across components. The intra-component duplicate FDR-kind constant (F2) is intra-helper-set, not cross-component.
+- No cross-cutting concerns re-implemented locally; geo math, source-label set, and CSV emission all delegate to project-wide utilities.
+
+## Verdict
+
+- 0 Critical, 0 High → no FAIL trigger.
+- 2 Low → **PASS_WITH_WARNINGS**.
+
+Batch 84 is ready to commit. The two Low findings are surfaced for batch report and feed forward into the cumulative review for batches 82–84.