Files
Oleksandr Bezdieniezhnykh 29ac16cfcb [AZ-409] [AZ-412] [AZ-413] Batch 70: FT-P-01/04/05/06 scenarios
AZ-409 (3pt) — FT-P-01 still-image frame-center accuracy:
- accuracy_evaluator.py: GT loader + Vincenty error + AC-2/AC-3 pass-counts
- test_ft_p_01_still_image_accuracy.py: scenario gated on frame_source_replay
  + sitl_observer NotImplementedError; AC-4 timeout discipline

AZ-412 (3pt) — FT-P-04 Derkachi f2f registration >=95% on normal segments:
- registration_classifier.py: accel-derived attitude + overlap heuristic
  + success ratio with AC-3 sharp-turn exclusion
- test_ft_p_04_derkachi_f2f_registration.py: scenario gated on
  frame_source_replay + imu_replay + fdr_reader

AZ-413 (3pt) — FT-P-05 + FT-P-06 cross-domain MRE budgets:
- mre_evaluator.py: per-image budget (strict <2.5px) + 95th-percentile
  via numpy linear interp + combined report
- test_ft_p_05_sat_anchor.py: cross-domain scenario, reuses
  accuracy_evaluator for geodesic join
- test_ft_p_06_mre_budgets.py: pure piggyback on FT-P-04 + FT-P-05 CSV
  evidence; skips when either upstream CSV missing

Tests: 325 unit tests pass (+77 vs batch 69).
Reports: batch_70_report.md, batch_70_review.md (PASS).
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-16 18:10:46 +03:00

8.5 KiB

Code Review Report

Batch: 70 — AZ-409, AZ-412, AZ-413 Date: 2026-05-16 Verdict: PASS

Findings

(none)

Findings Sweep

Phase 1 — Context Loading

Loaded specs AZ-409_ft_p_01_still_image_accuracy.md, AZ-412_ft_p_04_derkachi_f2f_registration.md, AZ-413_ft_p_05_06_sat_anchor_mre.md, _docs/00_problem/input_data/expected_results/results_report.md (authoritative Pass/Fail Rules), plus the existing geo.py, anchor_pair_detector.py, estimate_schema.py helpers for pattern re-use.

Phase 2 — Spec Compliance

AZ-409 (FT-P-01)

AC Test Status
AC-1 (per-image distance computed) test_evaluate_all_pass_yields_overall_pass, test_evaluate_full_timeout_run_produces_zero_pass_counts Covered
AC-2 (≥48/60 within 50 m) test_evaluate_boundary_threshold_holds, test_evaluate_below_50m_threshold_fails_overall Covered
AC-3 (≥30/60 within 20 m) test_evaluate_boundary_threshold_holds, test_evaluate_below_20m_threshold_fails_overall Covered
AC-4 (timeout discipline) test_compute_per_image_timeout_sets_inf_and_false_flags, test_evaluate_missing_estimate_recorded_as_timeout Covered
AC-5 (parametrization 6 variants) Verified via pytest --collect-only — 6 variants collected Covered
Runtime push-to-SITL end-to-end gated by _harness_helpers_implemented on frame_source_replay + sitl_observer NOT COVERED (harness-loop, same pattern as batch 69 AZ-410/AZ-411)

AZ-412 (FT-P-04)

AC Test Status
AC-1 (classification reproducibility) test_classify_frames_is_reproducible_ac1 (uses real Derkachi data_imu.csv first 100 rows) Covered
AC-2 (success ratio ≥ 0.95) test_compute_success_ratio_perfect_run_passes, test_compute_success_ratio_at_95_pct_passes, test_compute_success_ratio_below_95_pct_fails Covered
AC-3 (sharp-turn frames excluded from denominator) test_classify_frames_excludes_sharp_roll, test_compute_success_ratio_excludes_sharp_turn_from_denominator_ac3, test_compute_success_ratio_handles_missing_metric_separately Covered
AC-4 (parametrization 6 variants) Verified via pytest --collect-only Covered
Runtime full Derkachi replay gated by _harness_helpers_implemented on frame_source_replay, imu_replay, fdr_reader NOT COVERED (harness-loop)

AZ-413 (FT-P-05 + FT-P-06)

AC Test Status
AC-1 (per-image MRE captured) test_evaluate_per_image_budget_all_pass (covers the captured-list path); test_write_cross_domain_csv_round_trip (CSV column shape) Covered
AC-2 (cross-domain MRE < 2.5 px, all 60) test_evaluate_per_image_budget_single_fail_fails_overall, test_evaluate_per_image_budget_above_boundary_fails (strict < 2.5 boundary explicitly tested) Covered
AC-3 (accuracy alongside MRE) Delegated to accuracy_evaluator (already covered by AZ-409 tests); FT-P-05 scenario wires both via evaluate() Covered by reuse
AC-4 (95th-percentile budgets) test_evaluate_p95_uses_numpy_linear_interpolation, test_evaluate_combined_p95_both_pass, test_evaluate_combined_p95_fails_when_frame_to_frame_fails, test_evaluate_combined_p95_fails_when_cross_domain_fails Covered
AC-5 (parametrization 6 variants per scenario file) Verified via pytest --collect-only — 12 items between FT-P-05 (6) + FT-P-06 (6) Covered
Runtime push-to-SITL end-to-end gated by _harness_helpers_implemented on frame_source_replay, sitl_observer, fdr_reader NOT COVERED (harness-loop)

No Spec-Gap findings.

Phase 3 — Code Quality

  • SRP respected per task:
    • accuracy_evaluator owns geodesic distance + pass-count rules only.
    • registration_classifier owns attitude derivation + overlap heuristic + success ratio only.
    • mre_evaluator owns per-image budget + p95 budget only.
  • Error handling consistent: every loader raises FileNotFoundError on missing input and ValueError on header/column drift (matches the AZ-410 / AZ-411 helper pattern).
  • Naming: dataclass + function names follow the project's snake_case / CamelCase convention.
  • Complexity: longest function is classify_frames at ~50 lines (linear pipeline). All others under 30.
  • Tests assert behaviour, not just "no exception": geodesic round-trips against real distances, boundary conditions (exactly 48/60, exactly 0.95 ratio, exactly 2.5 px) are explicitly tested.
  • Spec drift guard: each helper has a test_constants_match_spec test that fails if the public constants drift from the AC text (catches a renamer that touches code but forgets the spec).
  • Boundary strictness: AC-2 of FT-P-05 says "MRE < 2.5 px"; the helper uses strict < and the test test_evaluate_per_image_budget_single_fail_fails_overall proves a 2.5 px reading FAILS. This is the kind of boundary the spec would otherwise be ambiguous on.

Phase 4 — Security

No SQL, no subprocess, no credentials. CSV loaders validate header columns explicitly; numeric coercion via float() / int() raises on garbage input.

Phase 5 — Performance

  • All three helpers operate on per-flight-sized data (60 images, ≤14700 frames, ≤4900 IMU rows). Pure-Python loops are fine.
  • mre_evaluator.evaluate_p95 uses numpy.percentile (vectorised).
  • No new I/O patterns beyond CSV read/write.

Phase 6 — Cross-Task Consistency

  • API stability: the three new helpers share the same shape pattern as AZ-410's anchor_pair_detector and AZ-411's estimate_schema — typed @dataclass(frozen=True) records, a load_… reader, an evaluate(…) / compute_… core, a write_csv_evidence emitter. The FT-P-05 scenario reuses accuracy_evaluator.evaluate() (AZ-409) to compute per-image error_m → demonstrates the cross-task consistency in action.
  • No duplicate symbols across batches: each helper module owns disjoint public names; the only shared dependency is runner.helpers.geo.distance_m.
  • Scenario-file skip pattern: all 4 new scenario files (test_ft_p_01_*, test_ft_p_04_*, test_ft_p_05_*, test_ft_p_06_*) reuse the _harness_helpers_implemented gate pattern from batch 69. Consistent.
  • Within-batch dep (AZ-413 → AZ-412): FT-P-06 reads FT-P-04's CSV (the f2f MRE column). The mre_evaluator's load_frame_to_frame_csv explicitly validates that the mre_px column is present; if absent (FT-P-04 evidence not yet carrying MRE), FT-P-06 fails with a clear message pointing at the SUT contract (AC-NEW-3 FDR schema). This is the safest failure mode for an inter-task dep.

Phase 7 — Architecture Compliance

  1. Layer direction: every new file under e2e/**. The test_no_sut_imports.py invariant (passes after the run) confirms zero gps_denied_onboard imports across all 14 new files.
  2. Public API respect: only public names imported across modules (runner.helpers.{geo,accuracy_evaluator,mre_evaluator} etc.). No leading-underscore cross-module imports.
  3. No new cyclic dependencies: import graph:
    • accuracy_evaluatorgeo
    • registration_classifier → (none)
    • mre_evaluator → (numpy + stdlib)
    • tests.positive.test_ft_p_01_*accuracy_evaluator
    • tests.positive.test_ft_p_04_*registration_classifier
    • tests.positive.test_ft_p_05_*accuracy_evaluator + mre_evaluator
    • tests.positive.test_ft_p_06_*mre_evaluator Linear DAG.
  4. Duplicate symbols across components: none.
  5. Cross-cutting concerns: pytest plugin registration unchanged from batch 69 (the new helpers don't need a plugin — they're called from scenario test bodies).

No Architecture findings.

Baseline delta section omitted (no architecture_compliance_baseline.md for this project).

AC Test Coverage Summary

Task ACs Covered (unit) NOT COVERED (harness-loop) Test File
AZ-409 1, 2, 3, 4, 5 Runtime push-to-SITL end-to-end test_accuracy_evaluator.py (20 tests)
AZ-412 1, 2, 3, 4 Runtime full Derkachi replay test_registration_classifier.py (26 tests)
AZ-413 1, 2, 3, 4, 5 Runtime push-to-SITL end-to-end test_mre_evaluator.py (22 tests)

Verdict: PASS

No Critical, High, Medium, or Low findings. Unit-test layer is complete and consistent across the three tasks; runtime end-to-end paths are correctly gated and documented as hardware-loop ACs pending the upstream frame_source_replay / sitl_observer / fdr_reader / imu_replay helpers landing.

Auto-Fix Attempts: 0

No failures — auto-fix gate not entered.

Stuck Agents: 0

None.