Files
gps-denied-onboard/_docs/03_implementation/reviews/batch_84_review.md
T
Oleksandr Bezdieniezhnykh f25cae4a82 [AZ-423] [AZ-427] Add FT-P-19 + FT-N-05 blackbox tests
Implement the AC-8.6 (top-K=10 retrieval scale-ratio + scene-change
PARTIAL) and AC-8.2 / AC-NEW-6 (stale aged-tile rejection) blackbox
scenarios.

AZ-423 (FT-P-19, 3pt) helpers + scenario:
- retrieval_evaluator.py — top-K within-distance evaluator (60 stills
  vs 100 m budget), scene-change PARTIAL recorder (always emits
  PARTIAL on the 2 _gmaps.png pairs), FDR record projectors, CSV
  writers.
- tests/positive/test_ft_p_19_sat_reloc_scale.py (6 parametrised
  variants).

AZ-427 (FT-N-05, 2pt) helpers + scenario:
- aged_tile_rejection_evaluator.py — Signal A (stale rejection at
  load) + Signal B (per-frame downgrade) decision matrix, reuses
  ALLOWED_SOURCE_LABELS from estimate_schema.
- tests/negative/test_ft_n_05_stale_tile_rejection.py (12 parametrised
  variants: FC × VIO × {7mo/active-conflict, 13mo/rear}).

48 new unit tests cover every helper branch. Both scenarios skip
when sitl_replay_ready is false and fail loudly when fixture records
are missing.

Per-batch review: PASS_WITH_WARNINGS (2 Low — production-dependency
surface, FDR-kind constant duplication).
Cumulative review 82-84: PASS (2 Low carry-over / hygiene candidate).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-17 15:43:06 +03:00

9.4 KiB
Raw Blame History

Code Review Report

Batch: 84 — AZ-423 + AZ-427 (FT-P-19 sat-relocalization scale + FT-N-05 stale-tile rejection) Date: 2026-05-17 Verdict: PASS_WITH_WARNINGS

Files Reviewed

Created:

  • e2e/runner/helpers/retrieval_evaluator.py
  • e2e/runner/helpers/aged_tile_rejection_evaluator.py
  • e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py
  • e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py
  • e2e/_unit_tests/helpers/test_retrieval_evaluator.py
  • e2e/_unit_tests/helpers/test_aged_tile_rejection_evaluator.py

Modified:

  • e2e/_unit_tests/test_directory_layout.py (registered 4 new paths)

Findings

# Severity Category File:Line Title
1 Low Spec-Gap e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py, e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py Tests depend on upstream production + fixture-builder features
2 Low Maintainability e2e/runner/helpers/aged_tile_rejection_evaluator.py, e2e/runner/helpers/mid_flight_tile_evaluator.py TILE_LOAD_REJECTED_FDR_KIND and stale-reason constant now duplicated across two helpers

Finding Details

F1: Tests depend on upstream production + fixture-builder features that don't exist yet (Low / Spec-Gap)

  • Location: e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py, e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py
  • Description: Both scenarios require:
    • AZ-423: SUT emitting FDR retrieval-topk records (one per pushed image, carrying 10 candidate tile centres) and scene-change-match records for the 2 paired _gmaps.png images. Production code must expose the C2 top-K=10 retrieval result on the FDR boundary.
    • AZ-427: SUT either (a) emitting FDR tile-load-rejected: stale events at startup OR (b) downgrading every emission's source_label to {visual_propagated, dead_reckoned} for the aged-tile sub-cases. Production code must wire the C6 freshness gate per-sector (active-conflict vs rear).
    • Fixture-builder: snapshot/mount of synth-age-7mo / synth-age-13mo tile sets + E2E_FT_N_05_FIXTURE env var declaring the active sub-case fixture per pytest invocation.
  • Tests skip cleanly when sitl_replay_ready / E2E_FT_N_05_FIXTURE are absent and fail loudly when the fixture exists but the required records are missing — adhering to the "tests as gates" principle.
  • Suggestion: Surface as a single line in the AZ-423 + AZ-427 batch report under Production Dependencies. No code change in this batch.
  • Tasks: AZ-423, AZ-427.

F2: TILE_LOAD_REJECTED_FDR_KIND and stale-reason constant duplicated (Low / Maintainability)

  • Location: e2e/runner/helpers/aged_tile_rejection_evaluator.py:33-34, e2e/runner/helpers/mid_flight_tile_evaluator.py:43-44
  • Description: Both helpers redefine TILE_LOAD_REJECTED_FDR_KIND = "tile-load-rejected" and the stale-reason constant. This is intentional in spirit (each helper exposes the FDR vocabulary it operates on) but creates two copies of the same FDR-contract string. If the SUT renames the record type later, both files must change in lockstep.
  • Suggestion: Consolidate in a dedicated runner/helpers/fdr_record_kinds.py module in a future hygiene PBI; not in scope for AZ-423/AZ-427.
  • Tasks: hygiene, future batch.

Phase 1: Context Loading

Read AZ-423 + AZ-427 task specs. Both have 3 ACs; both are blackbox-test deliverables under E-BBT (AZ-262); both depend only on AZ-406 + AZ-407 (test infra + static fixtures); SUT boundary statements verified.

Phase 2: Spec Compliance

AZ-423 (FT-P-19)

AC Helper Scenario assertion Unit-test coverage
AC-1 (top-K=10 includes tile within 100 m of true centre, 60 stills) evaluate_top_k_within_distance assert topk_report.passes 9 cases (single close, all far, mixed top-K, empty, count mismatch, invalid tolerance, custom tolerance, 60 all-pass, 60 one-fail)
AC-2 (scene-change PARTIAL on 2 _gmaps.png pairs) evaluate_scene_change_subset assert coverage_complete AND overall_label == "PARTIAL" 5 cases (both matched still PARTIAL, zero matched still PARTIAL, one image only, empty, wrong image ids)
AC-3 (parameterised) conftest fixtures 6 collected variants

Plus 2 CSV-writer round-trip tests, 5 project_topk_record_to_query cases (happy / malformed / empty / non-dict / type errors), 5 project_scene_change_record cases, 2 iter_*_payloads filter cases — 29 unit tests total for AZ-423.

AZ-427 (FT-N-05)

AC Helper Scenario assertion Unit-test coverage
AC-1 (7mo aged tiles in active-conflict sector → 0 anchored emissions) evaluate_aged_tile_rejection(fixture="synth-age-7mo", sector="active_conflict", …) parametrised sub-case 3 Signal-A cases + 3 Signal-B cases + 5 degenerate-input cases
AC-2 (13mo aged tiles in rear sector → 0 anchored emissions) same helper with rear fixture binding parametrised sub-case same as AC-1 (sub-case symmetric)
AC-3 (parameterised) conftest fixtures + 2 sub-case parametrisation 12 collected variants (6 × 2)

Plus 6 project_stale_rejection_payload cases (happy / id-alias / wrong-reason / non-dict / missing-id / wrong-type), 1 iter_stale_rejection_payloads filter case, 1 AGED_FIXTURE_SECTOR_BINDINGS contract assertion — 19 unit tests total for AZ-427.

All ACs satisfied. @pytest.mark.traces_to markers wire scenarios to AC IDs.

Phase 3: Code Quality

  • SOLID: each evaluator is a pure function over typed input dataclasses returning a frozen-dataclass report. retrieval_evaluator and aged_tile_rejection_evaluator are siblings of the existing mid_flight_tile_evaluator / gcs_telemetry_evaluator / tile_cache_inspector pattern.
  • Error handling: evaluate_top_k_within_distance raises ValueError on invalid tolerance. evaluate_aged_tile_rejection has no failure modes (always returns a report; the report's .passes carries the verdict). No bare except.
  • Naming: evaluate_* matches sibling helpers. _normalise_image_id in scenario test mirrors the convention from accuracy_evaluator GT keying. Signal A / Signal B properties make the decision matrix self-documenting.
  • Complexity: longest function evaluate_top_k_within_distance at ~25 lines; evaluate_aged_tile_rejection at ~20 lines. All under coderule's 50-line threshold.
  • DRY: F2 captures one intra-batch duplication. Within each helper, no duplicated logic.
  • Test quality: every unit test uses Arrange/Act/Assert comments per coderule. Tests assert specific outcomes (matched_count, anchored_frame_ids, signal flags, CSV content) rather than "no error thrown".
  • Dead code: none. iter_topk_payloads / iter_scene_change_payloads / iter_stale_rejection_payloads are scenario-test helpers used in the corresponding tests/ files.

Phase 4: Security Quick-Scan

  • No SQL, no subprocess, no exec/eval.
  • No hardcoded secrets.
  • Input validation: all project_* functions guard against malformed dict payloads; evaluate_top_k_within_distance validates tolerance.
  • No sensitive data in logs or error messages.
  • evaluate_aged_tile_rejection surfaces illegal_labels separately from passes/fails so a contract-violation defect can't be misread as a freshness-gate defect.

Phase 5: Performance Scan

  • evaluate_top_k_within_distance is O(N·K) where N=60 images, K=10 candidates → 600 distance computations per scenario. Vincenty via pyproj handles this in well under 100 ms.
  • evaluate_aged_tile_rejection is O(M) where M = emissions count (≤ 60) + rejections count (typically 02). Trivial.
  • FDR iteration is generator-based via fdr_reader.iter_records.
  • No N+1 patterns. No unbounded fetching.

Phase 6: Cross-Task Consistency

  • Both helpers reuse runner.helpers.geo.distance_m (Vincenty WGS84) — consistent with mid_flight_tile_evaluator from batch 83.
  • aged_tile_rejection_evaluator reuses ALLOWED_SOURCE_LABELS from estimate_schema — single source of truth for the FT-P-03 / AC-1.4 source-label contract.
  • Scenario tests follow the same skip pattern as FT-P-12/13/15/16/17/18 (sitl_replay_ready gate + fail-loud on missing records).
  • The frozen-dataclass + @property passes convention matches every prior helper (batches 79-83).
  • traces_to markers + CSV evidence emission to ft-p-19-*.csv / ft-n-05-*.csv follow the FT-P-01 / FT-P-05 / FT-P-17 cadence.

Phase 7: Architecture Compliance

  • All new files under e2e/ — owned exclusively by the Blackbox Tests component per _docs/02_document/module-layout.md.
  • No imports from src/gps_denied_onboard — verified by reading every import in the new files; both helpers carry an explicit "public-boundary discipline" docstring note.
  • No layering violations.
  • No new cyclic module dependencies. Both helpers import from .geo / .estimate_schema only; tests import from runner.helpers.*.
  • No duplicate symbols across components. The intra-component duplicate FDR-kind constant (F2) is intra-helper-set, not cross-component.
  • No cross-cutting concerns re-implemented locally; geo math, source-label set, and CSV emission all delegate to project-wide utilities.

Verdict

  • 0 Critical, 0 High → no FAIL trigger.
  • 2 Low → PASS_WITH_WARNINGS.

Batch 84 is ready to commit. The two Low findings are surfaced for batch report and feed forward into the cumulative review for batches 8284.