[AZ-423] [AZ-427] Add FT-P-19 + FT-N-05 blackbox tests

Implement the AC-8.6 (top-K=10 retrieval scale-ratio + scene-change PARTIAL) and AC-8.2 / AC-NEW-6 (stale aged-tile rejection) blackbox scenarios. AZ-423 (FT-P-19, 3pt) helpers + scenario: - retrieval_evaluator.py — top-K within-distance evaluator (60 stills vs 100 m budget), scene-change PARTIAL recorder (always emits PARTIAL on the 2 _gmaps.png pairs), FDR record projectors, CSV writers. - tests/positive/test_ft_p_19_sat_reloc_scale.py (6 parametrised variants). AZ-427 (FT-N-05, 2pt) helpers + scenario: - aged_tile_rejection_evaluator.py — Signal A (stale rejection at load) + Signal B (per-frame downgrade) decision matrix, reuses ALLOWED_SOURCE_LABELS from estimate_schema. - tests/negative/test_ft_n_05_stale_tile_rejection.py (12 parametrised variants: FC × VIO × {7mo/active-conflict, 13mo/rear}). 48 new unit tests cover every helper branch. Both scenarios skip when sitl_replay_ready is false and fail loudly when fixture records are missing. Per-batch review: PASS_WITH_WARNINGS (2 Low — production-dependency surface, FDR-kind constant duplication). Cumulative review 82-84: PASS (2 Low carry-over / hygiene candidate). Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 03:31:13 +00:00 · 2026-05-17 15:43:06 +03:00
parent a22028087f
commit f25cae4a82
13 changed files with 2005 additions and 3 deletions
@@ -0,0 +1,105 @@
+# Batch 84 — AZ-423 + AZ-427 (FT-P-19 sat-relocalization + FT-N-05 stale-tile rejection)
+
+**Tracker**: AZ-423, AZ-427
+**Tasks**: 2 tasks / 5 complexity points (3 + 2)
+**Date**: 2026-05-17
+**Verdict**: PASS_WITH_WARNINGS
+**Review**: `_docs/03_implementation/reviews/batch_84_review.md`
+
+## Scope
+
+- **AZ-423 / FT-P-19 (AC-8.6)**: per-image top-K=10 retrieval includes a tile centre within 100 m of the image's true centre across all 60 stills; scene-change subset (2 paired `_gmaps.png` images) carries PARTIAL annotation.
+- **AZ-427 / FT-N-05 (AC-8.2, AC-NEW-6)**: aged-tile fixtures (`synth-age-7mo` in active-conflict sector, `synth-age-13mo` in rear sector) produce zero `satellite_anchored` emissions — either by load-time stale rejection (Signal A) or per-frame downgrade (Signal B).
+
+## Files
+
+### Created
+
+- `e2e/runner/helpers/retrieval_evaluator.py` — pure-logic evaluators for AC-1 (top-K within distance) + AC-2 (scene-change PARTIAL); FDR payload projectors for `retrieval-topk` + `scene-change-match` records; CSV emitters for evidence.
+- `e2e/runner/helpers/aged_tile_rejection_evaluator.py` — pure-logic evaluator with explicit Signal-A / Signal-B decision matrix; FDR projector for `tile-load-rejected: stale` records; sector-binding constants (`synth-age-7mo → active_conflict`, `synth-age-13mo → rear`).
+- `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py` — FT-P-19 scenario.
+- `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py` — FT-N-05 scenario (parametrised across 2 sub-cases via `AGED_FIXTURE_SECTOR_BINDINGS`).
+- `e2e/_unit_tests/helpers/test_retrieval_evaluator.py` — 29 unit tests.
+- `e2e/_unit_tests/helpers/test_aged_tile_rejection_evaluator.py` — 19 unit tests.
+
+### Modified
+
+- `e2e/_unit_tests/test_directory_layout.py` — registered 4 new paths.
+
+## Test Results
+
+```
+$ pytest _unit_tests/helpers/test_retrieval_evaluator.py \
+         _unit_tests/helpers/test_aged_tile_rejection_evaluator.py \
+         _unit_tests/test_directory_layout.py
+============================= 157 passed in 0.28s ==============================
+```
+
+Scenario collection:
+
+```
+$ pytest tests/positive/test_ft_p_19_sat_reloc_scale.py \
+         tests/negative/test_ft_n_05_stale_tile_rejection.py --collect-only
+collected 18 items
+  test_ft_p_19_sat_reloc_scale: 6 cases (ardupilot/inav × {okvis2, klt_ransac, vins_mono})
+  test_ft_n_05_stale_tile_rejection: 12 cases (FC × VIO × 2 aged-tile sub-cases)
+```
+
+## AC Verification
+
+### AZ-423 / FT-P-19
+
+| AC | Coverage |
+|----|----------|
+| AC-1 top-K=10 within 100 m for all 60 images | `evaluate_top_k_within_distance` + scenario assertion + 9 unit tests |
+| AC-2 scene-change subset PARTIAL | `evaluate_scene_change_subset` + scenario assertion + 5 unit tests |
+| AC-3 parameterisation | 6 collected variants via conftest `fc_adapter` / `vio_strategy` fixtures |
+
+### AZ-427 / FT-N-05
+
+| AC | Coverage |
+|----|----------|
+| AC-1 7mo aged tiles → 0 `satellite_anchored` emissions | parametrised sub-case + `evaluate_aged_tile_rejection` + 11 unit tests |
+| AC-2 13mo aged tiles → 0 `satellite_anchored` emissions | same helper + sub-case binding `synth-age-13mo → rear` |
+| AC-3 parameterisation | 12 collected variants (FC × VIO × 2 sub-cases) |
+
+`traces_to` markers:
+- FT-P-19: `AC-8.6,AC-1,AC-2,AC-3`
+- FT-N-05: `AC-8.2,AC-NEW-6,AC-1,AC-2,AC-3`
+
+## Code Review
+
+**Verdict**: PASS_WITH_WARNINGS — 0 Critical, 0 High, 2 Low.
+
+- **F1 (production-dependency surface)**: both scenarios depend on upstream SUT features (`retrieval-topk` / `scene-change-match` FDR record kinds for AZ-423; `tile-load-rejected: stale` events OR per-frame downgrades for AZ-427) and fixture-builder support (`synth-age-*` mounts + `E2E_FT_N_05_FIXTURE` env var). Tests skip cleanly when fixtures missing and fail loudly when fixtures exist but records are missing.
+- **F2 (maintainability — intra-batch duplication)**: `TILE_LOAD_REJECTED_FDR_KIND` and its stale-reason constant now exist in both `aged_tile_rejection_evaluator.py` and `mid_flight_tile_evaluator.py`. Future hygiene PBI candidate.
+
+Full review: `_docs/03_implementation/reviews/batch_84_review.md`.
+
+## Production Dependencies
+
+Surfaced for the cumulative review window (82-84) + traceability matrix:
+
+1. **AZ-423 SUT-side**: emit FDR `retrieval-topk` record kind per pushed image, carrying `image_id` + `candidates` list of `{tile_id, centre_lat_deg, centre_lon_deg}`.
+2. **AZ-423 SUT-side**: emit FDR `scene-change-match` record kind for each paired `_gmaps.png` image, carrying `image_id` + `matched` (bool) + optional `inlier_count` (int).
+3. **AZ-427 SUT-side**: wire the C6 freshness gate's sector-aware date comparison (active-conflict vs rear thresholds per AC-8.2) and emit FDR `tile-load-rejected: stale` events at startup OR ensure every emission downgrades to `{visual_propagated, dead_reckoned}`.
+4. **Fixture-builder-side**: snapshot/mount of `synth-age-7mo` + `synth-age-13mo` tile sets, run-per-sub-case orchestration that publishes `E2E_FT_N_05_FIXTURE` env var declaring which fixture is currently active.
+5. **Already exists**: `still-image-set-60` + `coordinates.csv` (AZ-407 deliverable) + `still-image-sat-refs-2` (AZ-407).
+6. **Already exists**: `sitl_replay_ready` fixture, `fc_adapter` / `vio_strategy` parameterisation, `evidence_dir` + `nfr_recorder` + `run_id` (AZ-406).
+
+## Architecture Compliance
+
+- All new files under `e2e/`, owned by the Blackbox Tests component per `_docs/02_document/module-layout.md`.
+- No imports from `src/gps_denied_onboard` (explicit public-boundary discipline notes in both helpers).
+- No new cyclic dependencies. Both helpers depend on `.geo` and (for `aged_tile_rejection_evaluator`) `.estimate_schema` only.
+- No new infrastructure libraries.
+
+## Sub-step Trace
+
+Phases executed per `implement/SKILL.md`:
+- phase 5 (load-spec) → AZ-423 + AZ-427 specs read
+- phase 6 (implement-tasks-sequentially) → helpers + scenarios + unit tests
+- phase 7 (verify-ac-coverage) → ACs traced above
+- phase 8 (code-review) → batch_84_review.md (PASS_WITH_WARNINGS)
+- phase 8.5 (cumulative-review) → batches_82-84 (next: K=3 cumulative trigger)
+- phase 11 (commit-batch) → after cumulative review.
@@ -0,0 +1,182 @@
+# Cumulative Code Review — Batches 82-84 (Cycle 1)
+
+**Window**: batches 82, 83, 84
+**Tasks covered**: AZ-421, AZ-422, AZ-423, AZ-427
+**Date**: 2026-05-17
+**Verdict**: PASS
+**Last cumulative review**: batches 79-81 (`_docs/03_implementation/reviews/cumulative_79_81_review.md`)
+
+## Scope
+
+Union of files changed since the last cumulative review. Counts include the modifications to `e2e/_unit_tests/test_directory_layout.py` that each batch made.
+
+### Files in window
+
+| Batch | Tasks | New helpers | New scenarios | New unit tests |
+|-------|-------|-------------|---------------|----------------|
+| 82 | AZ-421 (FT-P-15 + FT-P-16 + FT-P-18) | `tile_cache_inspector` | `test_ft_p_15_cache_schema`, `test_ft_p_16_offline_only`, `test_ft_p_18_no_raw_retention` | `test_tile_cache_inspector` |
+| 83 | AZ-422 (FT-P-17 + FT-N-06) | `mid_flight_tile_evaluator`, `mock_suite_sat_audit` | `test_ft_p_17_mid_flight_tiles`, `test_ft_n_06_mid_flight_freshness` | `test_mid_flight_tile_evaluator`, `test_mock_suite_sat_audit` |
+| 84 | AZ-423 + AZ-427 (FT-P-19 + FT-N-05) | `retrieval_evaluator`, `aged_tile_rejection_evaluator` | `test_ft_p_19_sat_reloc_scale`, `test_ft_n_05_stale_tile_rejection` | `test_retrieval_evaluator`, `test_aged_tile_rejection_evaluator` |
+
+Total: 5 new helper modules + 7 new scenario files + 5 new unit-test files across 4 tasks / 11 complexity points (3 + 3 + 3 + 2).
+
+## Cross-Cutting Themes
+
+### Theme 1 — FDR vocabulary fragmentation
+
+Each new helper defines its own FDR record-kind constants. As of batch 84:
+
+| Constant | Defined in |
+|----------|------------|
+| `MID_FLIGHT_TILE_FDR_KIND = "mid-flight-tile-output"` | `mid_flight_tile_evaluator.py` |
+| `TILE_LOAD_REJECTED_FDR_KIND = "tile-load-rejected"` | `mid_flight_tile_evaluator.py` AND `aged_tile_rejection_evaluator.py` |
+| `TILE_LOAD_REJECTED_STALE_REASON = "stale"` | same — duplicated |
+| `RETRIEVAL_TOPK_FDR_KIND = "retrieval-topk"` | `retrieval_evaluator.py` |
+| `SCENE_CHANGE_MATCH_FDR_KIND = "scene-change-match"` | `retrieval_evaluator.py` |
+
+The `tile-load-rejected` constant is the only true duplication; the rest live in one place. The duplication is intentional in spirit (each helper exposes the vocabulary it consumes) but creates a small coordination cost — surfaced as a Low finding in batch 84 review (F2).
+
+Recommendation (NOT blocking this window): consolidate into a single `runner/helpers/fdr_record_kinds.py` constants module. Hygiene PBI candidate (~1 point).
+
+### Theme 2 — Geo math consolidation progress
+
+`runner.helpers.geo.distance_m` (Vincenty WGS84) is now consumed by:
+
+- `mid_flight_tile_evaluator.evaluate_dedup` (Batch 83, AC-3)
+- `retrieval_evaluator.evaluate_top_k_within_distance` (Batch 84, AC-1)
+
+Both via `from .geo import distance_m`. This is the correct pattern.
+
+`runner.helpers.gcs_telemetry_evaluator.haversine_distance_m` (Batch 81) still exists as the lone outlier — surfaced in the prior cumulative review (79-81) as F1 / Low / Maintainability. Recommendation unchanged: refactor in a dedicated hygiene batch (≤1 point) before any further geo-math additions land.
+
+### Theme 3 — Tests-as-gates discipline
+
+Every scenario in this window follows the same skip/fail-loud pattern:
+
+* If `sitl_replay_ready` is false → `pytest.skip(<rich diagnostic message>)`.
+* If `sitl_replay_ready` is true BUT the required FDR records / env vars are missing → `pytest.fail(<rich diagnostic message>)`.
+
+This ensures production-code or fixture gaps surface immediately when the SITL fixture is wired up, while CI runs without the fixture remain green.
+
+Aggregate production-dependency list (collected from batches 82-84 batch reports):
+
+* SUT-side FDR record kinds: `mid-flight-tile-output`, `tile-load-rejected` (with `reason="stale"`), `retrieval-topk`, `scene-change-match`, `cache-self-check`.
+* SUT-side public-input mechanisms: `simulate_landing()` MAVLink command.
+* SUT-side production logic: C6 freshness gate sector-aware date comparison, C2 top-K=10 retrieval result emission, cross-domain matcher per-pair output emission, mid-flight orthorectification + tile generation + Mode B Fact #105 schema population.
+* Fixture-builder-side: `FT_P_17_HIGH_QUALITY_WINDOW_S` env var, `E2E_FT_N_05_FIXTURE` env var, `synth-age-7mo` / `synth-age-13mo` snapshots, Derkachi 5-min replay scenario, Docker network inspect snapshots (FT-P-16).
+
+All gated, all surfaced via skip messages with explicit fixture-owner references. This list feeds the AZ-595 (fixture-builder) backlog and the SUT-side production work.
+
+### Theme 4 — Source-label contract centralisation
+
+`aged_tile_rejection_evaluator` (Batch 84) reuses `ALLOWED_SOURCE_LABELS` from `estimate_schema` (AZ-411 deliverable). This is the correct pattern — the FT-P-03 / AC-1.4 source-label set is a one-source contract. No drift; new helpers should follow this lead when they need the same set.
+
+### Theme 5 — Helper-module size + complexity
+
+| Helper | Lines | Top-level public symbols |
+|--------|-------|--------------------------|
+| `tile_cache_inspector.py` | ~450 | 4 dataclasses + 4 evaluators + walk + JPEG probe |
+| `mid_flight_tile_evaluator.py` | ~500 | 6 dataclasses + 6 evaluators + 2 projection helpers |
+| `mock_suite_sat_audit.py` | ~77 | 1 function (`fetch_audit`) + constants |
+| `retrieval_evaluator.py` | ~260 | 6 dataclasses + 2 evaluators + 2 projectors + 2 CSV writers + 2 iter filters |
+| `aged_tile_rejection_evaluator.py` | ~190 | 3 dataclasses + 1 evaluator + 1 projector + 1 iter filter + constants |
+
+All evaluators stay under the 50-line function complexity threshold. Helper surfaces are coherent — no single helper does too many things.
+
+## Spec Compliance
+
+Spot-checked each batch's per-batch review verdict; no contradictions surfaced under cumulative inspection.
+
+Cumulative AC coverage (E-BBT blackbox tests, this window):
+
+| AC ID | Task | Status |
+|-------|------|--------|
+| AC-8.4 | AZ-422 | Covered (FT-P-17 5 sub-evaluators) |
+| AC-NEW-6 (mid-flight) | AZ-422 | Covered (FT-N-06 2 sub-evaluators) |
+| AC-8.2 (cache schema + offline) | AZ-421 | Covered (FT-P-15, FT-P-16, FT-P-18) |
+| AC-NEW-1 (no raw retention) | AZ-421 | Covered (FT-P-18) |
+| AC-8.6 | AZ-423 | Covered (FT-P-19 scale + scene-change PARTIAL) |
+| AC-NEW-6 (stale aged) | AZ-427 | Covered (FT-N-05 2 sub-cases) |
+| AC-8.2 (sector freshness) | AZ-427 | Covered |
+
+No carried-over AC misses; no AC drift between batches.
+
+## Code Quality
+
+- **SOLID / SRP**: each helper has one reason to change. No God modules. The `mid_flight_tile_evaluator` has 6 evaluators — at the upper end but still cohesive (all consume the same `TileSpec` input from the same FDR record kind).
+- **Error handling**: every input validator raises `ValueError`; every HTTP wrapper raises `RuntimeError`. No bare `except`. `_parse_iso8601_utc_seconds` catches only `(TypeError, ValueError)`.
+- **Test pyramid**: unit tests dominate (52 in batch 83, 48 in batch 84, ~30 in batch 82). Each scenario has a corresponding pure-logic unit-test file. Scenario tests delegate to evaluators for all AC math, asserting only `report.passes` + diagnostic context.
+- **Naming**: `evaluate_*` for evaluators, `project_*` for FDR-payload projectors, `iter_*` for record-kind filters, `<Concern>Report` for report dataclasses. Consistent across all 5 new helpers.
+- **Comments**: docstrings cite AC IDs and Mode B Fact / restriction references. Inline comments are minimal and only where non-obvious. Arrange/Act/Assert pattern present in every unit test.
+- **Dead code**: none. Re-verified `_top_level_field_to_attr` (Batch 83) is a documented forward-compatibility seam.
+
+## Security Quick-Scan
+
+- No SQL, no `subprocess` invocations, no `exec`/`eval`.
+- No hardcoded secrets, no logging of sensitive data.
+- HTTP client (`mock_suite_sat_audit`) validates URL non-empty, run_id non-empty, status 2xx, JSON body is dict, `entries` is list. Error messages truncate body to 200 chars.
+- `aged_tile_rejection_evaluator` separates `illegal_labels` (contract violation) from `anchored_count` (freshness-gate violation) so a defect can't mask the other.
+
+## Performance Scan
+
+- `evaluate_dedup` is O(N²) — documented + bounded by < 100 tiles per 5-min replay.
+- `evaluate_top_k_within_distance` is O(N·K) = 60 × 10 = 600 Vincenty calls per scenario.
+- `evaluate_aged_tile_rejection` is O(M) where M ≤ 60.
+- FDR iteration is generator-based via `fdr_reader.iter_records` (unchanged from prior windows).
+- No N+1, no unbounded fetches, no async-in-sync, no hot-path allocations.
+
+## Cross-Task Consistency
+
+- All four tasks consume the same conftest fixtures (`sitl_replay_ready`, `fc_adapter`, `vio_strategy`, `evidence_dir`, `nfr_recorder`, `run_id`, plus task-specific helpers).
+- `traces_to` marker convention consistent.
+- Skip messages have a uniform shape: "<scenario> requires <fixture>; <pure-logic alternative path>".
+- CSV evidence file naming: `ft-<id>-<aspect>-<fc>-<vio>.csv`.
+- TileSpec / TilePublishRequest mirror is single-source in `mid_flight_tile_evaluator`.
+
+## Architecture Compliance (Phase 7)
+
+Verified against `_docs/02_document/module-layout.md`:
+
+- **Layer direction**: all new files under `e2e/` (Blackbox Tests). No imports from `src/gps_denied_onboard`. Public-boundary discipline notes present in every new helper.
+- **Public API respect**: cross-component imports limited to `runner.helpers.{geo, estimate_schema, fdr_reader, accuracy_evaluator}` — all helpers that are intra-component to the Blackbox Tests package.
+- **No new cyclic dependencies**: import graph for new files is a tree (helpers → geo / estimate_schema; tests → helpers).
+- **Duplicate symbols**: `TILE_LOAD_REJECTED_FDR_KIND` appears twice (intra-component duplicate, surfaced in F2 below). No other duplicates.
+- **Cross-cutting concerns**: no concern that should live in `shared/*` has been re-implemented locally. Geo, schema, CSV emission, HTTP all delegate.
+
+## Baseline Delta
+
+`_docs/02_document/architecture_compliance_baseline.md` does not exist for the Blackbox Tests workspace (this workspace was greenfield-bootstrapped); no Baseline Delta section to emit.
+
+## Findings
+
+| # | Severity | Category | Status | Title |
+|---|----------|----------|--------|-------|
+| F1 | Low | Maintainability | Carried over from 79-81 | `gcs_telemetry_evaluator.haversine_distance_m` duplicates `geo.distance_m` |
+| F2 | Low | Maintainability | New in 82-84 (B84 review F2) | `TILE_LOAD_REJECTED_FDR_KIND` + stale-reason constant duplicated across two helpers |
+
+### F1 (carry-over)
+
+- **Location**: `e2e/runner/helpers/gcs_telemetry_evaluator.py`
+- **State**: still present. No batch in this window touched it.
+- **Recommendation unchanged**: consolidate to `geo.distance_m` in a dedicated ≤1-pt hygiene batch.
+
+### F2 (newly introduced)
+
+- **Location**: `e2e/runner/helpers/mid_flight_tile_evaluator.py:43-44`, `e2e/runner/helpers/aged_tile_rejection_evaluator.py:33-34`
+- **Description**: Two helpers define the same FDR record-kind constant + stale-reason string. The duplication exists because each helper aims to be self-contained, but the underlying contract is single-source (FDR record-type vocabulary).
+- **Recommendation**: address in the same hygiene batch as F1 by creating `runner/helpers/fdr_record_kinds.py` and importing the constants from there. Combine with F1 → estimated 2-pt hygiene batch.
+
+## Resolved Findings
+
+None. F1 from the prior window still stands; no fix landed in this window.
+
+## Carried Forward to Next Window (85-87)
+
+- F1, F2 unresolved.
+- Production-dependency list above is the AZ-595 fixture-builder + epic-side SUT work pending. Tracker visibility: the items are surfaced in each batch report; an aggregator PBI for the fixture-builder side is recommended but not required this cycle.
+
+## Verdict
+
+0 Critical, 0 High, 2 Low (1 carry-over + 1 new). No FAIL trigger; cumulative posture remains **PASS** with two known maintainability items eligible for a dedicated hygiene batch.
+
+Batch 84 proceeds to commit + push. Next cumulative review covers batches 85-87.
@@ -0,0 +1,120 @@
+# Code Review Report
+
+**Batch**: 84 — AZ-423 + AZ-427 (FT-P-19 sat-relocalization scale + FT-N-05 stale-tile rejection)
+**Date**: 2026-05-17
+**Verdict**: PASS_WITH_WARNINGS
+
+## Files Reviewed
+
+**Created**:
+- `e2e/runner/helpers/retrieval_evaluator.py`
+- `e2e/runner/helpers/aged_tile_rejection_evaluator.py`
+- `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py`
+- `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py`
+- `e2e/_unit_tests/helpers/test_retrieval_evaluator.py`
+- `e2e/_unit_tests/helpers/test_aged_tile_rejection_evaluator.py`
+
+**Modified**:
+- `e2e/_unit_tests/test_directory_layout.py` (registered 4 new paths)
+
+## Findings
+
+| # | Severity | Category | File:Line | Title |
+|---|----------|----------|-----------|-------|
+| 1 | Low | Spec-Gap | `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py`, `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py` | Tests depend on upstream production + fixture-builder features |
+| 2 | Low | Maintainability | `e2e/runner/helpers/aged_tile_rejection_evaluator.py`, `e2e/runner/helpers/mid_flight_tile_evaluator.py` | `TILE_LOAD_REJECTED_FDR_KIND` and stale-reason constant now duplicated across two helpers |
+
+### Finding Details
+
+**F1: Tests depend on upstream production + fixture-builder features that don't exist yet** (Low / Spec-Gap)
+- Location: `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py`, `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py`
+- Description: Both scenarios require:
+  * **AZ-423**: SUT emitting FDR `retrieval-topk` records (one per pushed image, carrying 10 candidate tile centres) and `scene-change-match` records for the 2 paired `_gmaps.png` images. Production code must expose the C2 top-K=10 retrieval result on the FDR boundary.
+  * **AZ-427**: SUT either (a) emitting FDR `tile-load-rejected: stale` events at startup OR (b) downgrading every emission's `source_label` to `{visual_propagated, dead_reckoned}` for the aged-tile sub-cases. Production code must wire the C6 freshness gate per-sector (active-conflict vs rear).
+  * **Fixture-builder**: snapshot/mount of `synth-age-7mo` / `synth-age-13mo` tile sets + `E2E_FT_N_05_FIXTURE` env var declaring the active sub-case fixture per pytest invocation.
+- Tests **skip cleanly** when `sitl_replay_ready` / `E2E_FT_N_05_FIXTURE` are absent and **fail loudly** when the fixture exists but the required records are missing — adhering to the "tests as gates" principle.
+- Suggestion: Surface as a single line in the AZ-423 + AZ-427 batch report under Production Dependencies. No code change in this batch.
+- Tasks: AZ-423, AZ-427.
+
+**F2: `TILE_LOAD_REJECTED_FDR_KIND` and stale-reason constant duplicated** (Low / Maintainability)
+- Location: `e2e/runner/helpers/aged_tile_rejection_evaluator.py:33-34`, `e2e/runner/helpers/mid_flight_tile_evaluator.py:43-44`
+- Description: Both helpers redefine `TILE_LOAD_REJECTED_FDR_KIND = "tile-load-rejected"` and the stale-reason constant. This is intentional in spirit (each helper exposes the FDR vocabulary it operates on) but creates two copies of the same FDR-contract string. If the SUT renames the record type later, both files must change in lockstep.
+- Suggestion: Consolidate in a dedicated `runner/helpers/fdr_record_kinds.py` module in a future hygiene PBI; not in scope for AZ-423/AZ-427.
+- Tasks: hygiene, future batch.
+
+## Phase 1: Context Loading
+
+Read AZ-423 + AZ-427 task specs. Both have 3 ACs; both are blackbox-test deliverables under E-BBT (AZ-262); both depend only on AZ-406 + AZ-407 (test infra + static fixtures); SUT boundary statements verified.
+
+## Phase 2: Spec Compliance
+
+### AZ-423 (FT-P-19)
+
+| AC | Helper | Scenario assertion | Unit-test coverage |
+|----|--------|--------------------|--------------------|
+| AC-1 (top-K=10 includes tile within 100 m of true centre, 60 stills) | `evaluate_top_k_within_distance` | `assert topk_report.passes` | 9 cases (single close, all far, mixed top-K, empty, count mismatch, invalid tolerance, custom tolerance, 60 all-pass, 60 one-fail) |
+| AC-2 (scene-change PARTIAL on 2 `_gmaps.png` pairs) | `evaluate_scene_change_subset` | `assert coverage_complete AND overall_label == "PARTIAL"` | 5 cases (both matched still PARTIAL, zero matched still PARTIAL, one image only, empty, wrong image ids) |
+| AC-3 (parameterised) | conftest fixtures | 6 collected variants | — |
+
+Plus 2 CSV-writer round-trip tests, 5 `project_topk_record_to_query` cases (happy / malformed / empty / non-dict / type errors), 5 `project_scene_change_record` cases, 2 `iter_*_payloads` filter cases — 29 unit tests total for AZ-423.
+
+### AZ-427 (FT-N-05)
+
+| AC | Helper | Scenario assertion | Unit-test coverage |
+|----|--------|--------------------|--------------------|
+| AC-1 (7mo aged tiles in active-conflict sector → 0 anchored emissions) | `evaluate_aged_tile_rejection(fixture="synth-age-7mo", sector="active_conflict", …)` | parametrised sub-case | 3 Signal-A cases + 3 Signal-B cases + 5 degenerate-input cases |
+| AC-2 (13mo aged tiles in rear sector → 0 anchored emissions) | same helper with rear fixture binding | parametrised sub-case | same as AC-1 (sub-case symmetric) |
+| AC-3 (parameterised) | conftest fixtures + 2 sub-case parametrisation | 12 collected variants (6 × 2) | — |
+
+Plus 6 `project_stale_rejection_payload` cases (happy / id-alias / wrong-reason / non-dict / missing-id / wrong-type), 1 `iter_stale_rejection_payloads` filter case, 1 `AGED_FIXTURE_SECTOR_BINDINGS` contract assertion — 19 unit tests total for AZ-427.
+
+All ACs satisfied. `@pytest.mark.traces_to` markers wire scenarios to AC IDs.
+
+## Phase 3: Code Quality
+
+- **SOLID**: each evaluator is a pure function over typed input dataclasses returning a frozen-dataclass report. `retrieval_evaluator` and `aged_tile_rejection_evaluator` are siblings of the existing `mid_flight_tile_evaluator` / `gcs_telemetry_evaluator` / `tile_cache_inspector` pattern.
+- **Error handling**: `evaluate_top_k_within_distance` raises `ValueError` on invalid tolerance. `evaluate_aged_tile_rejection` has no failure modes (always returns a report; the report's `.passes` carries the verdict). No bare `except`.
+- **Naming**: `evaluate_*` matches sibling helpers. `_normalise_image_id` in scenario test mirrors the convention from `accuracy_evaluator` GT keying. `Signal A` / `Signal B` properties make the decision matrix self-documenting.
+- **Complexity**: longest function `evaluate_top_k_within_distance` at ~25 lines; `evaluate_aged_tile_rejection` at ~20 lines. All under coderule's 50-line threshold.
+- **DRY**: F2 captures one intra-batch duplication. Within each helper, no duplicated logic.
+- **Test quality**: every unit test uses Arrange/Act/Assert comments per coderule. Tests assert specific outcomes (matched_count, anchored_frame_ids, signal flags, CSV content) rather than "no error thrown".
+- **Dead code**: none. `iter_topk_payloads` / `iter_scene_change_payloads` / `iter_stale_rejection_payloads` are scenario-test helpers used in the corresponding `tests/` files.
+
+## Phase 4: Security Quick-Scan
+
+- No SQL, no `subprocess`, no `exec`/`eval`.
+- No hardcoded secrets.
+- Input validation: all `project_*` functions guard against malformed dict payloads; `evaluate_top_k_within_distance` validates tolerance.
+- No sensitive data in logs or error messages.
+- `evaluate_aged_tile_rejection` surfaces `illegal_labels` separately from passes/fails so a contract-violation defect can't be misread as a freshness-gate defect.
+
+## Phase 5: Performance Scan
+
+- `evaluate_top_k_within_distance` is O(N·K) where N=60 images, K=10 candidates → 600 distance computations per scenario. Vincenty via `pyproj` handles this in well under 100 ms.
+- `evaluate_aged_tile_rejection` is O(M) where M = emissions count (≤ 60) + rejections count (typically 0–2). Trivial.
+- FDR iteration is generator-based via `fdr_reader.iter_records`.
+- No N+1 patterns. No unbounded fetching.
+
+## Phase 6: Cross-Task Consistency
+
+- Both helpers reuse `runner.helpers.geo.distance_m` (Vincenty WGS84) — consistent with `mid_flight_tile_evaluator` from batch 83.
+- `aged_tile_rejection_evaluator` reuses `ALLOWED_SOURCE_LABELS` from `estimate_schema` — single source of truth for the FT-P-03 / AC-1.4 source-label contract.
+- Scenario tests follow the same skip pattern as FT-P-12/13/15/16/17/18 (sitl_replay_ready gate + fail-loud on missing records).
+- The frozen-dataclass + `@property passes` convention matches every prior helper (batches 79-83).
+- `traces_to` markers + CSV evidence emission to `ft-p-19-*.csv` / `ft-n-05-*.csv` follow the FT-P-01 / FT-P-05 / FT-P-17 cadence.
+
+## Phase 7: Architecture Compliance
+
+- All new files under `e2e/` — owned exclusively by the Blackbox Tests component per `_docs/02_document/module-layout.md`.
+- **No imports from `src/gps_denied_onboard`** — verified by reading every import in the new files; both helpers carry an explicit "public-boundary discipline" docstring note.
+- No layering violations.
+- No new cyclic module dependencies. Both helpers import from `.geo` / `.estimate_schema` only; tests import from `runner.helpers.*`.
+- No duplicate symbols across components. The intra-component duplicate FDR-kind constant (F2) is intra-helper-set, not cross-component.
+- No cross-cutting concerns re-implemented locally; geo math, source-label set, and CSV emission all delegate to project-wide utilities.
+
+## Verdict
+
+- 0 Critical, 0 High → no FAIL trigger.
+- 2 Low → **PASS_WITH_WARNINGS**.
+
+Batch 84 is ready to commit. The two Low findings are surfaced for batch report and feed forward into the cumulative review for batches 82–84.