mirror of https://github.com/azaion/gps-denied-onboard.git synced 2026-06-21 15:11:12 +00:00

Files

T

Oleksandr Bezdieniezhnykh f25cae4a82 [AZ-423] [AZ-427] Add FT-P-19 + FT-N-05 blackbox tests

Implement the AC-8.6 (top-K=10 retrieval scale-ratio + scene-change
PARTIAL) and AC-8.2 / AC-NEW-6 (stale aged-tile rejection) blackbox
scenarios.

AZ-423 (FT-P-19, 3pt) helpers + scenario:
- retrieval_evaluator.py — top-K within-distance evaluator (60 stills
  vs 100 m budget), scene-change PARTIAL recorder (always emits
  PARTIAL on the 2 _gmaps.png pairs), FDR record projectors, CSV
  writers.
- tests/positive/test_ft_p_19_sat_reloc_scale.py (6 parametrised
  variants).

AZ-427 (FT-N-05, 2pt) helpers + scenario:
- aged_tile_rejection_evaluator.py — Signal A (stale rejection at
  load) + Signal B (per-frame downgrade) decision matrix, reuses
  ALLOWED_SOURCE_LABELS from estimate_schema.
- tests/negative/test_ft_n_05_stale_tile_rejection.py (12 parametrised
  variants: FC × VIO × {7mo/active-conflict, 13mo/rear}).

48 new unit tests cover every helper branch. Both scenarios skip
when sitl_replay_ready is false and fail loudly when fixture records
are missing.

Per-batch review: PASS_WITH_WARNINGS (2 Low — production-dependency
surface, FDR-kind constant duplication).
Cumulative review 82-84: PASS (2 Low carry-over / hygiene candidate).

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-17 15:43:06 +03:00

9.4 KiB

Raw Blame History

Code Review Report

Batch: 84 — AZ-423 + AZ-427 (FT-P-19 sat-relocalization scale + FT-N-05 stale-tile rejection) Date: 2026-05-17 Verdict: PASS_WITH_WARNINGS

Files Reviewed

Created:

e2e/runner/helpers/retrieval_evaluator.py
e2e/runner/helpers/aged_tile_rejection_evaluator.py
e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py
e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py
e2e/_unit_tests/helpers/test_retrieval_evaluator.py
e2e/_unit_tests/helpers/test_aged_tile_rejection_evaluator.py

Modified:

e2e/_unit_tests/test_directory_layout.py (registered 4 new paths)

Findings

#	Severity	Category	File:Line	Title
1	Low	Spec-Gap	`e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py`, `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py`	Tests depend on upstream production + fixture-builder features
2	Low	Maintainability	`e2e/runner/helpers/aged_tile_rejection_evaluator.py`, `e2e/runner/helpers/mid_flight_tile_evaluator.py`	`TILE_LOAD_REJECTED_FDR_KIND` and stale-reason constant now duplicated across two helpers

Finding Details

F1: Tests depend on upstream production + fixture-builder features that don't exist yet (Low / Spec-Gap)

Location: e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py, e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py
Description: Both scenarios require:
- AZ-423: SUT emitting FDR retrieval-topk records (one per pushed image, carrying 10 candidate tile centres) and scene-change-match records for the 2 paired _gmaps.png images. Production code must expose the C2 top-K=10 retrieval result on the FDR boundary.
- AZ-427: SUT either (a) emitting FDR tile-load-rejected: stale events at startup OR (b) downgrading every emission's source_label to {visual_propagated, dead_reckoned} for the aged-tile sub-cases. Production code must wire the C6 freshness gate per-sector (active-conflict vs rear).
- Fixture-builder: snapshot/mount of synth-age-7mo / synth-age-13mo tile sets + E2E_FT_N_05_FIXTURE env var declaring the active sub-case fixture per pytest invocation.
Tests skip cleanly when sitl_replay_ready / E2E_FT_N_05_FIXTURE are absent and fail loudly when the fixture exists but the required records are missing — adhering to the "tests as gates" principle.
Suggestion: Surface as a single line in the AZ-423 + AZ-427 batch report under Production Dependencies. No code change in this batch.
Tasks: AZ-423, AZ-427.

F2: TILE_LOAD_REJECTED_FDR_KIND and stale-reason constant duplicated (Low / Maintainability)

Location: e2e/runner/helpers/aged_tile_rejection_evaluator.py:33-34, e2e/runner/helpers/mid_flight_tile_evaluator.py:43-44
Description: Both helpers redefine TILE_LOAD_REJECTED_FDR_KIND = "tile-load-rejected" and the stale-reason constant. This is intentional in spirit (each helper exposes the FDR vocabulary it operates on) but creates two copies of the same FDR-contract string. If the SUT renames the record type later, both files must change in lockstep.
Suggestion: Consolidate in a dedicated runner/helpers/fdr_record_kinds.py module in a future hygiene PBI; not in scope for AZ-423/AZ-427.
Tasks: hygiene, future batch.

Phase 1: Context Loading

Read AZ-423 + AZ-427 task specs. Both have 3 ACs; both are blackbox-test deliverables under E-BBT (AZ-262); both depend only on AZ-406 + AZ-407 (test infra + static fixtures); SUT boundary statements verified.

Phase 2: Spec Compliance

AZ-423 (FT-P-19)

AC	Helper	Scenario assertion	Unit-test coverage
AC-1 (top-K=10 includes tile within 100 m of true centre, 60 stills)	`evaluate_top_k_within_distance`	`assert topk_report.passes`	9 cases (single close, all far, mixed top-K, empty, count mismatch, invalid tolerance, custom tolerance, 60 all-pass, 60 one-fail)
AC-2 (scene-change PARTIAL on 2 `_gmaps.png` pairs)	`evaluate_scene_change_subset`	`assert coverage_complete AND overall_label == "PARTIAL"`	5 cases (both matched still PARTIAL, zero matched still PARTIAL, one image only, empty, wrong image ids)
AC-3 (parameterised)	conftest fixtures	6 collected variants	—

Plus 2 CSV-writer round-trip tests, 5 project_topk_record_to_query cases (happy / malformed / empty / non-dict / type errors), 5 project_scene_change_record cases, 2 iter_*_payloads filter cases — 29 unit tests total for AZ-423.

AZ-427 (FT-N-05)

AC	Helper	Scenario assertion	Unit-test coverage
AC-1 (7mo aged tiles in active-conflict sector → 0 anchored emissions)	`evaluate_aged_tile_rejection(fixture="synth-age-7mo", sector="active_conflict", …)`	parametrised sub-case	3 Signal-A cases + 3 Signal-B cases + 5 degenerate-input cases
AC-2 (13mo aged tiles in rear sector → 0 anchored emissions)	same helper with rear fixture binding	parametrised sub-case	same as AC-1 (sub-case symmetric)
AC-3 (parameterised)	conftest fixtures + 2 sub-case parametrisation	12 collected variants (6 × 2)	—

Plus 6 project_stale_rejection_payload cases (happy / id-alias / wrong-reason / non-dict / missing-id / wrong-type), 1 iter_stale_rejection_payloads filter case, 1 AGED_FIXTURE_SECTOR_BINDINGS contract assertion — 19 unit tests total for AZ-427.

All ACs satisfied. @pytest.mark.traces_to markers wire scenarios to AC IDs.

Phase 3: Code Quality

SOLID: each evaluator is a pure function over typed input dataclasses returning a frozen-dataclass report. retrieval_evaluator and aged_tile_rejection_evaluator are siblings of the existing mid_flight_tile_evaluator / gcs_telemetry_evaluator / tile_cache_inspector pattern.
Error handling: evaluate_top_k_within_distance raises ValueError on invalid tolerance. evaluate_aged_tile_rejection has no failure modes (always returns a report; the report's .passes carries the verdict). No bare except.
Naming: evaluate_* matches sibling helpers. _normalise_image_id in scenario test mirrors the convention from accuracy_evaluator GT keying. Signal A / Signal B properties make the decision matrix self-documenting.
Complexity: longest function evaluate_top_k_within_distance at ~25 lines; evaluate_aged_tile_rejection at ~20 lines. All under coderule's 50-line threshold.
DRY: F2 captures one intra-batch duplication. Within each helper, no duplicated logic.
Test quality: every unit test uses Arrange/Act/Assert comments per coderule. Tests assert specific outcomes (matched_count, anchored_frame_ids, signal flags, CSV content) rather than "no error thrown".
Dead code: none. iter_topk_payloads / iter_scene_change_payloads / iter_stale_rejection_payloads are scenario-test helpers used in the corresponding tests/ files.

Phase 4: Security Quick-Scan

No SQL, no subprocess, no exec/eval.
No hardcoded secrets.
Input validation: all project_* functions guard against malformed dict payloads; evaluate_top_k_within_distance validates tolerance.
No sensitive data in logs or error messages.
evaluate_aged_tile_rejection surfaces illegal_labels separately from passes/fails so a contract-violation defect can't be misread as a freshness-gate defect.

Phase 5: Performance Scan

evaluate_top_k_within_distance is O(N·K) where N=60 images, K=10 candidates → 600 distance computations per scenario. Vincenty via pyproj handles this in well under 100 ms.
evaluate_aged_tile_rejection is O(M) where M = emissions count (≤ 60) + rejections count (typically 0–2). Trivial.
FDR iteration is generator-based via fdr_reader.iter_records.
No N+1 patterns. No unbounded fetching.

Phase 6: Cross-Task Consistency

Both helpers reuse runner.helpers.geo.distance_m (Vincenty WGS84) — consistent with mid_flight_tile_evaluator from batch 83.
aged_tile_rejection_evaluator reuses ALLOWED_SOURCE_LABELS from estimate_schema — single source of truth for the FT-P-03 / AC-1.4 source-label contract.
Scenario tests follow the same skip pattern as FT-P-12/13/15/16/17/18 (sitl_replay_ready gate + fail-loud on missing records).
The frozen-dataclass + @property passes convention matches every prior helper (batches 79-83).
traces_to markers + CSV evidence emission to ft-p-19-*.csv / ft-n-05-*.csv follow the FT-P-01 / FT-P-05 / FT-P-17 cadence.

Phase 7: Architecture Compliance

All new files under e2e/ — owned exclusively by the Blackbox Tests component per _docs/02_document/module-layout.md.
No imports from src/gps_denied_onboard — verified by reading every import in the new files; both helpers carry an explicit "public-boundary discipline" docstring note.
No layering violations.
No new cyclic module dependencies. Both helpers import from .geo / .estimate_schema only; tests import from runner.helpers.*.
No duplicate symbols across components. The intra-component duplicate FDR-kind constant (F2) is intra-helper-set, not cross-component.
No cross-cutting concerns re-implemented locally; geo math, source-label set, and CSV emission all delegate to project-wide utilities.

Verdict

0 Critical, 0 High → no FAIL trigger.
2 Low → PASS_WITH_WARNINGS.

Batch 84 is ready to commit. The two Low findings are surfaced for batch report and feed forward into the cumulative review for batches 82–84.

9.4 KiB Raw Blame History Unescape Escape