AZ-409 (3pt) — FT-P-01 still-image frame-center accuracy: - accuracy_evaluator.py: GT loader + Vincenty error + AC-2/AC-3 pass-counts - test_ft_p_01_still_image_accuracy.py: scenario gated on frame_source_replay + sitl_observer NotImplementedError; AC-4 timeout discipline AZ-412 (3pt) — FT-P-04 Derkachi f2f registration >=95% on normal segments: - registration_classifier.py: accel-derived attitude + overlap heuristic + success ratio with AC-3 sharp-turn exclusion - test_ft_p_04_derkachi_f2f_registration.py: scenario gated on frame_source_replay + imu_replay + fdr_reader AZ-413 (3pt) — FT-P-05 + FT-P-06 cross-domain MRE budgets: - mre_evaluator.py: per-image budget (strict <2.5px) + 95th-percentile via numpy linear interp + combined report - test_ft_p_05_sat_anchor.py: cross-domain scenario, reuses accuracy_evaluator for geodesic join - test_ft_p_06_mre_budgets.py: pure piggyback on FT-P-04 + FT-P-05 CSV evidence; skips when either upstream CSV missing Tests: 325 unit tests pass (+77 vs batch 69). Reports: batch_70_report.md, batch_70_review.md (PASS). Co-authored-by: Cursor <cursoragent@cursor.com>
8.5 KiB
Code Review Report
Batch: 70 — AZ-409, AZ-412, AZ-413 Date: 2026-05-16 Verdict: PASS
Findings
(none)
Findings Sweep
Phase 1 — Context Loading
Loaded specs AZ-409_ft_p_01_still_image_accuracy.md,
AZ-412_ft_p_04_derkachi_f2f_registration.md,
AZ-413_ft_p_05_06_sat_anchor_mre.md,
_docs/00_problem/input_data/expected_results/results_report.md
(authoritative Pass/Fail Rules), plus the existing geo.py,
anchor_pair_detector.py, estimate_schema.py helpers for pattern
re-use.
Phase 2 — Spec Compliance
AZ-409 (FT-P-01)
| AC | Test | Status |
|---|---|---|
| AC-1 (per-image distance computed) | test_evaluate_all_pass_yields_overall_pass, test_evaluate_full_timeout_run_produces_zero_pass_counts |
Covered |
| AC-2 (≥48/60 within 50 m) | test_evaluate_boundary_threshold_holds, test_evaluate_below_50m_threshold_fails_overall |
Covered |
| AC-3 (≥30/60 within 20 m) | test_evaluate_boundary_threshold_holds, test_evaluate_below_20m_threshold_fails_overall |
Covered |
| AC-4 (timeout discipline) | test_compute_per_image_timeout_sets_inf_and_false_flags, test_evaluate_missing_estimate_recorded_as_timeout |
Covered |
| AC-5 (parametrization 6 variants) | Verified via pytest --collect-only — 6 variants collected |
Covered |
| Runtime push-to-SITL end-to-end | gated by _harness_helpers_implemented on frame_source_replay + sitl_observer |
NOT COVERED (harness-loop, same pattern as batch 69 AZ-410/AZ-411) |
AZ-412 (FT-P-04)
| AC | Test | Status |
|---|---|---|
| AC-1 (classification reproducibility) | test_classify_frames_is_reproducible_ac1 (uses real Derkachi data_imu.csv first 100 rows) |
Covered |
| AC-2 (success ratio ≥ 0.95) | test_compute_success_ratio_perfect_run_passes, test_compute_success_ratio_at_95_pct_passes, test_compute_success_ratio_below_95_pct_fails |
Covered |
| AC-3 (sharp-turn frames excluded from denominator) | test_classify_frames_excludes_sharp_roll, test_compute_success_ratio_excludes_sharp_turn_from_denominator_ac3, test_compute_success_ratio_handles_missing_metric_separately |
Covered |
| AC-4 (parametrization 6 variants) | Verified via pytest --collect-only |
Covered |
| Runtime full Derkachi replay | gated by _harness_helpers_implemented on frame_source_replay, imu_replay, fdr_reader |
NOT COVERED (harness-loop) |
AZ-413 (FT-P-05 + FT-P-06)
| AC | Test | Status |
|---|---|---|
| AC-1 (per-image MRE captured) | test_evaluate_per_image_budget_all_pass (covers the captured-list path); test_write_cross_domain_csv_round_trip (CSV column shape) |
Covered |
| AC-2 (cross-domain MRE < 2.5 px, all 60) | test_evaluate_per_image_budget_single_fail_fails_overall, test_evaluate_per_image_budget_above_boundary_fails (strict < 2.5 boundary explicitly tested) |
Covered |
| AC-3 (accuracy alongside MRE) | Delegated to accuracy_evaluator (already covered by AZ-409 tests); FT-P-05 scenario wires both via evaluate() |
Covered by reuse |
| AC-4 (95th-percentile budgets) | test_evaluate_p95_uses_numpy_linear_interpolation, test_evaluate_combined_p95_both_pass, test_evaluate_combined_p95_fails_when_frame_to_frame_fails, test_evaluate_combined_p95_fails_when_cross_domain_fails |
Covered |
| AC-5 (parametrization 6 variants per scenario file) | Verified via pytest --collect-only — 12 items between FT-P-05 (6) + FT-P-06 (6) |
Covered |
| Runtime push-to-SITL end-to-end | gated by _harness_helpers_implemented on frame_source_replay, sitl_observer, fdr_reader |
NOT COVERED (harness-loop) |
No Spec-Gap findings.
Phase 3 — Code Quality
- SRP respected per task:
accuracy_evaluatorowns geodesic distance + pass-count rules only.registration_classifierowns attitude derivation + overlap heuristic + success ratio only.mre_evaluatorowns per-image budget + p95 budget only.
- Error handling consistent: every loader raises
FileNotFoundErroron missing input andValueErroron header/column drift (matches the AZ-410 / AZ-411 helper pattern). - Naming: dataclass + function names follow the project's snake_case / CamelCase convention.
- Complexity: longest function is
classify_framesat ~50 lines (linear pipeline). All others under 30. - Tests assert behaviour, not just "no exception": geodesic round-trips against real distances, boundary conditions (exactly 48/60, exactly 0.95 ratio, exactly 2.5 px) are explicitly tested.
- Spec drift guard: each helper has a
test_constants_match_spectest that fails if the public constants drift from the AC text (catches a renamer that touches code but forgets the spec). - Boundary strictness: AC-2 of FT-P-05 says "MRE < 2.5 px"; the helper uses strict
<and the testtest_evaluate_per_image_budget_single_fail_fails_overallproves a 2.5 px reading FAILS. This is the kind of boundary the spec would otherwise be ambiguous on.
Phase 4 — Security
No SQL, no subprocess, no credentials. CSV loaders validate header columns explicitly; numeric coercion via float() / int() raises on garbage input.
Phase 5 — Performance
- All three helpers operate on per-flight-sized data (60 images, ≤14700 frames, ≤4900 IMU rows). Pure-Python loops are fine.
mre_evaluator.evaluate_p95usesnumpy.percentile(vectorised).- No new I/O patterns beyond CSV read/write.
Phase 6 — Cross-Task Consistency
- API stability: the three new helpers share the same shape pattern as AZ-410's
anchor_pair_detectorand AZ-411'sestimate_schema— typed@dataclass(frozen=True)records, aload_…reader, anevaluate(…)/compute_…core, awrite_csv_evidenceemitter. The FT-P-05 scenario reusesaccuracy_evaluator.evaluate()(AZ-409) to compute per-image error_m → demonstrates the cross-task consistency in action. - No duplicate symbols across batches: each helper module owns disjoint public names; the only shared dependency is
runner.helpers.geo.distance_m. - Scenario-file skip pattern: all 4 new scenario files (
test_ft_p_01_*,test_ft_p_04_*,test_ft_p_05_*,test_ft_p_06_*) reuse the_harness_helpers_implementedgate pattern from batch 69. Consistent. - Within-batch dep (AZ-413 → AZ-412): FT-P-06 reads FT-P-04's CSV (the f2f MRE column). The mre_evaluator's
load_frame_to_frame_csvexplicitly validates that themre_pxcolumn is present; if absent (FT-P-04 evidence not yet carrying MRE), FT-P-06 fails with a clear message pointing at the SUT contract (AC-NEW-3 FDR schema). This is the safest failure mode for an inter-task dep.
Phase 7 — Architecture Compliance
- Layer direction: every new file under
e2e/**. Thetest_no_sut_imports.pyinvariant (passes after the run) confirms zerogps_denied_onboardimports across all 14 new files. - Public API respect: only public names imported across modules (
runner.helpers.{geo,accuracy_evaluator,mre_evaluator}etc.). No leading-underscore cross-module imports. - No new cyclic dependencies: import graph:
accuracy_evaluator→georegistration_classifier→ (none)mre_evaluator→ (numpy + stdlib)tests.positive.test_ft_p_01_*→accuracy_evaluatortests.positive.test_ft_p_04_*→registration_classifiertests.positive.test_ft_p_05_*→accuracy_evaluator+mre_evaluatortests.positive.test_ft_p_06_*→mre_evaluatorLinear DAG.
- Duplicate symbols across components: none.
- Cross-cutting concerns: pytest plugin registration unchanged from batch 69 (the new helpers don't need a plugin — they're called from scenario test bodies).
No Architecture findings.
Baseline delta section omitted (no architecture_compliance_baseline.md for this project).
AC Test Coverage Summary
| Task | ACs Covered (unit) | NOT COVERED (harness-loop) | Test File |
|---|---|---|---|
| AZ-409 | 1, 2, 3, 4, 5 | Runtime push-to-SITL end-to-end | test_accuracy_evaluator.py (20 tests) |
| AZ-412 | 1, 2, 3, 4 | Runtime full Derkachi replay | test_registration_classifier.py (26 tests) |
| AZ-413 | 1, 2, 3, 4, 5 | Runtime push-to-SITL end-to-end | test_mre_evaluator.py (22 tests) |
Verdict: PASS
No Critical, High, Medium, or Low findings. Unit-test layer is complete
and consistent across the three tasks; runtime end-to-end paths are
correctly gated and documented as hardware-loop ACs pending the upstream
frame_source_replay / sitl_observer / fdr_reader / imu_replay
helpers landing.
Auto-Fix Attempts: 0
No failures — auto-fix gate not entered.
Stuck Agents: 0
None.