mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 13:01:14 +00:00
[AZ-423] [AZ-427] Add FT-P-19 + FT-N-05 blackbox tests
Implement the AC-8.6 (top-K=10 retrieval scale-ratio + scene-change
PARTIAL) and AC-8.2 / AC-NEW-6 (stale aged-tile rejection) blackbox
scenarios.
AZ-423 (FT-P-19, 3pt) helpers + scenario:
- retrieval_evaluator.py — top-K within-distance evaluator (60 stills
vs 100 m budget), scene-change PARTIAL recorder (always emits
PARTIAL on the 2 _gmaps.png pairs), FDR record projectors, CSV
writers.
- tests/positive/test_ft_p_19_sat_reloc_scale.py (6 parametrised
variants).
AZ-427 (FT-N-05, 2pt) helpers + scenario:
- aged_tile_rejection_evaluator.py — Signal A (stale rejection at
load) + Signal B (per-frame downgrade) decision matrix, reuses
ALLOWED_SOURCE_LABELS from estimate_schema.
- tests/negative/test_ft_n_05_stale_tile_rejection.py (12 parametrised
variants: FC × VIO × {7mo/active-conflict, 13mo/rear}).
48 new unit tests cover every helper branch. Both scenarios skip
when sitl_replay_ready is false and fail loudly when fixture records
are missing.
Per-batch review: PASS_WITH_WARNINGS (2 Low — production-dependency
surface, FDR-kind constant duplication).
Cumulative review 82-84: PASS (2 Low carry-over / hygiene candidate).
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,120 @@
|
||||
# Code Review Report
|
||||
|
||||
**Batch**: 84 — AZ-423 + AZ-427 (FT-P-19 sat-relocalization scale + FT-N-05 stale-tile rejection)
|
||||
**Date**: 2026-05-17
|
||||
**Verdict**: PASS_WITH_WARNINGS
|
||||
|
||||
## Files Reviewed
|
||||
|
||||
**Created**:
|
||||
- `e2e/runner/helpers/retrieval_evaluator.py`
|
||||
- `e2e/runner/helpers/aged_tile_rejection_evaluator.py`
|
||||
- `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py`
|
||||
- `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py`
|
||||
- `e2e/_unit_tests/helpers/test_retrieval_evaluator.py`
|
||||
- `e2e/_unit_tests/helpers/test_aged_tile_rejection_evaluator.py`
|
||||
|
||||
**Modified**:
|
||||
- `e2e/_unit_tests/test_directory_layout.py` (registered 4 new paths)
|
||||
|
||||
## Findings
|
||||
|
||||
| # | Severity | Category | File:Line | Title |
|
||||
|---|----------|----------|-----------|-------|
|
||||
| 1 | Low | Spec-Gap | `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py`, `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py` | Tests depend on upstream production + fixture-builder features |
|
||||
| 2 | Low | Maintainability | `e2e/runner/helpers/aged_tile_rejection_evaluator.py`, `e2e/runner/helpers/mid_flight_tile_evaluator.py` | `TILE_LOAD_REJECTED_FDR_KIND` and stale-reason constant now duplicated across two helpers |
|
||||
|
||||
### Finding Details
|
||||
|
||||
**F1: Tests depend on upstream production + fixture-builder features that don't exist yet** (Low / Spec-Gap)
|
||||
- Location: `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py`, `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py`
|
||||
- Description: Both scenarios require:
|
||||
* **AZ-423**: SUT emitting FDR `retrieval-topk` records (one per pushed image, carrying 10 candidate tile centres) and `scene-change-match` records for the 2 paired `_gmaps.png` images. Production code must expose the C2 top-K=10 retrieval result on the FDR boundary.
|
||||
* **AZ-427**: SUT either (a) emitting FDR `tile-load-rejected: stale` events at startup OR (b) downgrading every emission's `source_label` to `{visual_propagated, dead_reckoned}` for the aged-tile sub-cases. Production code must wire the C6 freshness gate per-sector (active-conflict vs rear).
|
||||
* **Fixture-builder**: snapshot/mount of `synth-age-7mo` / `synth-age-13mo` tile sets + `E2E_FT_N_05_FIXTURE` env var declaring the active sub-case fixture per pytest invocation.
|
||||
- Tests **skip cleanly** when `sitl_replay_ready` / `E2E_FT_N_05_FIXTURE` are absent and **fail loudly** when the fixture exists but the required records are missing — adhering to the "tests as gates" principle.
|
||||
- Suggestion: Surface as a single line in the AZ-423 + AZ-427 batch report under Production Dependencies. No code change in this batch.
|
||||
- Tasks: AZ-423, AZ-427.
|
||||
|
||||
**F2: `TILE_LOAD_REJECTED_FDR_KIND` and stale-reason constant duplicated** (Low / Maintainability)
|
||||
- Location: `e2e/runner/helpers/aged_tile_rejection_evaluator.py:33-34`, `e2e/runner/helpers/mid_flight_tile_evaluator.py:43-44`
|
||||
- Description: Both helpers redefine `TILE_LOAD_REJECTED_FDR_KIND = "tile-load-rejected"` and the stale-reason constant. This is intentional in spirit (each helper exposes the FDR vocabulary it operates on) but creates two copies of the same FDR-contract string. If the SUT renames the record type later, both files must change in lockstep.
|
||||
- Suggestion: Consolidate in a dedicated `runner/helpers/fdr_record_kinds.py` module in a future hygiene PBI; not in scope for AZ-423/AZ-427.
|
||||
- Tasks: hygiene, future batch.
|
||||
|
||||
## Phase 1: Context Loading
|
||||
|
||||
Read AZ-423 + AZ-427 task specs. Both have 3 ACs; both are blackbox-test deliverables under E-BBT (AZ-262); both depend only on AZ-406 + AZ-407 (test infra + static fixtures); SUT boundary statements verified.
|
||||
|
||||
## Phase 2: Spec Compliance
|
||||
|
||||
### AZ-423 (FT-P-19)
|
||||
|
||||
| AC | Helper | Scenario assertion | Unit-test coverage |
|
||||
|----|--------|--------------------|--------------------|
|
||||
| AC-1 (top-K=10 includes tile within 100 m of true centre, 60 stills) | `evaluate_top_k_within_distance` | `assert topk_report.passes` | 9 cases (single close, all far, mixed top-K, empty, count mismatch, invalid tolerance, custom tolerance, 60 all-pass, 60 one-fail) |
|
||||
| AC-2 (scene-change PARTIAL on 2 `_gmaps.png` pairs) | `evaluate_scene_change_subset` | `assert coverage_complete AND overall_label == "PARTIAL"` | 5 cases (both matched still PARTIAL, zero matched still PARTIAL, one image only, empty, wrong image ids) |
|
||||
| AC-3 (parameterised) | conftest fixtures | 6 collected variants | — |
|
||||
|
||||
Plus 2 CSV-writer round-trip tests, 5 `project_topk_record_to_query` cases (happy / malformed / empty / non-dict / type errors), 5 `project_scene_change_record` cases, 2 `iter_*_payloads` filter cases — 29 unit tests total for AZ-423.
|
||||
|
||||
### AZ-427 (FT-N-05)
|
||||
|
||||
| AC | Helper | Scenario assertion | Unit-test coverage |
|
||||
|----|--------|--------------------|--------------------|
|
||||
| AC-1 (7mo aged tiles in active-conflict sector → 0 anchored emissions) | `evaluate_aged_tile_rejection(fixture="synth-age-7mo", sector="active_conflict", …)` | parametrised sub-case | 3 Signal-A cases + 3 Signal-B cases + 5 degenerate-input cases |
|
||||
| AC-2 (13mo aged tiles in rear sector → 0 anchored emissions) | same helper with rear fixture binding | parametrised sub-case | same as AC-1 (sub-case symmetric) |
|
||||
| AC-3 (parameterised) | conftest fixtures + 2 sub-case parametrisation | 12 collected variants (6 × 2) | — |
|
||||
|
||||
Plus 6 `project_stale_rejection_payload` cases (happy / id-alias / wrong-reason / non-dict / missing-id / wrong-type), 1 `iter_stale_rejection_payloads` filter case, 1 `AGED_FIXTURE_SECTOR_BINDINGS` contract assertion — 19 unit tests total for AZ-427.
|
||||
|
||||
All ACs satisfied. `@pytest.mark.traces_to` markers wire scenarios to AC IDs.
|
||||
|
||||
## Phase 3: Code Quality
|
||||
|
||||
- **SOLID**: each evaluator is a pure function over typed input dataclasses returning a frozen-dataclass report. `retrieval_evaluator` and `aged_tile_rejection_evaluator` are siblings of the existing `mid_flight_tile_evaluator` / `gcs_telemetry_evaluator` / `tile_cache_inspector` pattern.
|
||||
- **Error handling**: `evaluate_top_k_within_distance` raises `ValueError` on invalid tolerance. `evaluate_aged_tile_rejection` has no failure modes (always returns a report; the report's `.passes` carries the verdict). No bare `except`.
|
||||
- **Naming**: `evaluate_*` matches sibling helpers. `_normalise_image_id` in scenario test mirrors the convention from `accuracy_evaluator` GT keying. `Signal A` / `Signal B` properties make the decision matrix self-documenting.
|
||||
- **Complexity**: longest function `evaluate_top_k_within_distance` at ~25 lines; `evaluate_aged_tile_rejection` at ~20 lines. All under coderule's 50-line threshold.
|
||||
- **DRY**: F2 captures one intra-batch duplication. Within each helper, no duplicated logic.
|
||||
- **Test quality**: every unit test uses Arrange/Act/Assert comments per coderule. Tests assert specific outcomes (matched_count, anchored_frame_ids, signal flags, CSV content) rather than "no error thrown".
|
||||
- **Dead code**: none. `iter_topk_payloads` / `iter_scene_change_payloads` / `iter_stale_rejection_payloads` are scenario-test helpers used in the corresponding `tests/` files.
|
||||
|
||||
## Phase 4: Security Quick-Scan
|
||||
|
||||
- No SQL, no `subprocess`, no `exec`/`eval`.
|
||||
- No hardcoded secrets.
|
||||
- Input validation: all `project_*` functions guard against malformed dict payloads; `evaluate_top_k_within_distance` validates tolerance.
|
||||
- No sensitive data in logs or error messages.
|
||||
- `evaluate_aged_tile_rejection` surfaces `illegal_labels` separately from passes/fails so a contract-violation defect can't be misread as a freshness-gate defect.
|
||||
|
||||
## Phase 5: Performance Scan
|
||||
|
||||
- `evaluate_top_k_within_distance` is O(N·K) where N=60 images, K=10 candidates → 600 distance computations per scenario. Vincenty via `pyproj` handles this in well under 100 ms.
|
||||
- `evaluate_aged_tile_rejection` is O(M) where M = emissions count (≤ 60) + rejections count (typically 0–2). Trivial.
|
||||
- FDR iteration is generator-based via `fdr_reader.iter_records`.
|
||||
- No N+1 patterns. No unbounded fetching.
|
||||
|
||||
## Phase 6: Cross-Task Consistency
|
||||
|
||||
- Both helpers reuse `runner.helpers.geo.distance_m` (Vincenty WGS84) — consistent with `mid_flight_tile_evaluator` from batch 83.
|
||||
- `aged_tile_rejection_evaluator` reuses `ALLOWED_SOURCE_LABELS` from `estimate_schema` — single source of truth for the FT-P-03 / AC-1.4 source-label contract.
|
||||
- Scenario tests follow the same skip pattern as FT-P-12/13/15/16/17/18 (sitl_replay_ready gate + fail-loud on missing records).
|
||||
- The frozen-dataclass + `@property passes` convention matches every prior helper (batches 79-83).
|
||||
- `traces_to` markers + CSV evidence emission to `ft-p-19-*.csv` / `ft-n-05-*.csv` follow the FT-P-01 / FT-P-05 / FT-P-17 cadence.
|
||||
|
||||
## Phase 7: Architecture Compliance
|
||||
|
||||
- All new files under `e2e/` — owned exclusively by the Blackbox Tests component per `_docs/02_document/module-layout.md`.
|
||||
- **No imports from `src/gps_denied_onboard`** — verified by reading every import in the new files; both helpers carry an explicit "public-boundary discipline" docstring note.
|
||||
- No layering violations.
|
||||
- No new cyclic module dependencies. Both helpers import from `.geo` / `.estimate_schema` only; tests import from `runner.helpers.*`.
|
||||
- No duplicate symbols across components. The intra-component duplicate FDR-kind constant (F2) is intra-helper-set, not cross-component.
|
||||
- No cross-cutting concerns re-implemented locally; geo math, source-label set, and CSV emission all delegate to project-wide utilities.
|
||||
|
||||
## Verdict
|
||||
|
||||
- 0 Critical, 0 High → no FAIL trigger.
|
||||
- 2 Low → **PASS_WITH_WARNINGS**.
|
||||
|
||||
Batch 84 is ready to commit. The two Low findings are surfaced for batch report and feed forward into the cumulative review for batches 82–84.
|
||||
Reference in New Issue
Block a user