# Code Review Report

**Batch**: 70 — AZ-409, AZ-412, AZ-413
**Date**: 2026-05-16
**Verdict**: PASS

## Findings

(none)

## Findings Sweep

### Phase 1 — Context Loading

Loaded specs `AZ-409_ft_p_01_still_image_accuracy.md`,
`AZ-412_ft_p_04_derkachi_f2f_registration.md`,
`AZ-413_ft_p_05_06_sat_anchor_mre.md`,
`_docs/00_problem/input_data/expected_results/results_report.md`
(authoritative Pass/Fail Rules), plus the existing `geo.py`,
`anchor_pair_detector.py`, `estimate_schema.py` helpers for pattern
re-use.

### Phase 2 — Spec Compliance

**AZ-409 (FT-P-01)**

| AC | Test | Status |
|----|------|--------|
| AC-1 (per-image distance computed) | `test_evaluate_all_pass_yields_overall_pass`, `test_evaluate_full_timeout_run_produces_zero_pass_counts` | Covered |
| AC-2 (≥48/60 within 50 m) | `test_evaluate_boundary_threshold_holds`, `test_evaluate_below_50m_threshold_fails_overall` | Covered |
| AC-3 (≥30/60 within 20 m) | `test_evaluate_boundary_threshold_holds`, `test_evaluate_below_20m_threshold_fails_overall` | Covered |
| AC-4 (timeout discipline) | `test_compute_per_image_timeout_sets_inf_and_false_flags`, `test_evaluate_missing_estimate_recorded_as_timeout` | Covered |
| AC-5 (parametrization 6 variants) | Verified via `pytest --collect-only` — 6 variants collected | Covered |
| Runtime push-to-SITL end-to-end | gated by `_harness_helpers_implemented` on `frame_source_replay` + `sitl_observer` | NOT COVERED (harness-loop, same pattern as batch 69 AZ-410/AZ-411) |

**AZ-412 (FT-P-04)**

| AC | Test | Status |
|----|------|--------|
| AC-1 (classification reproducibility) | `test_classify_frames_is_reproducible_ac1` (uses real Derkachi data_imu.csv first 100 rows) | Covered |
| AC-2 (success ratio ≥ 0.95) | `test_compute_success_ratio_perfect_run_passes`, `test_compute_success_ratio_at_95_pct_passes`, `test_compute_success_ratio_below_95_pct_fails` | Covered |
| AC-3 (sharp-turn frames excluded from denominator) | `test_classify_frames_excludes_sharp_roll`, `test_compute_success_ratio_excludes_sharp_turn_from_denominator_ac3`, `test_compute_success_ratio_handles_missing_metric_separately` | Covered |
| AC-4 (parametrization 6 variants) | Verified via `pytest --collect-only` | Covered |
| Runtime full Derkachi replay | gated by `_harness_helpers_implemented` on `frame_source_replay`, `imu_replay`, `fdr_reader` | NOT COVERED (harness-loop) |

**AZ-413 (FT-P-05 + FT-P-06)**

| AC | Test | Status |
|----|------|--------|
| AC-1 (per-image MRE captured) | `test_evaluate_per_image_budget_all_pass` (covers the captured-list path); `test_write_cross_domain_csv_round_trip` (CSV column shape) | Covered |
| AC-2 (cross-domain MRE < 2.5 px, all 60) | `test_evaluate_per_image_budget_single_fail_fails_overall`, `test_evaluate_per_image_budget_above_boundary_fails` (strict < 2.5 boundary explicitly tested) | Covered |
| AC-3 (accuracy alongside MRE) | Delegated to `accuracy_evaluator` (already covered by AZ-409 tests); FT-P-05 scenario wires both via `evaluate()` | Covered by reuse |
| AC-4 (95th-percentile budgets) | `test_evaluate_p95_uses_numpy_linear_interpolation`, `test_evaluate_combined_p95_both_pass`, `test_evaluate_combined_p95_fails_when_frame_to_frame_fails`, `test_evaluate_combined_p95_fails_when_cross_domain_fails` | Covered |
| AC-5 (parametrization 6 variants per scenario file) | Verified via `pytest --collect-only` — 12 items between FT-P-05 (6) + FT-P-06 (6) | Covered |
| Runtime push-to-SITL end-to-end | gated by `_harness_helpers_implemented` on `frame_source_replay`, `sitl_observer`, `fdr_reader` | NOT COVERED (harness-loop) |

No Spec-Gap findings.

### Phase 3 — Code Quality

- **SRP** respected per task:
  - `accuracy_evaluator` owns geodesic distance + pass-count rules only.
  - `registration_classifier` owns attitude derivation + overlap heuristic + success ratio only.
  - `mre_evaluator` owns per-image budget + p95 budget only.
- **Error handling** consistent: every loader raises `FileNotFoundError` on missing input and `ValueError` on header/column drift (matches the AZ-410 / AZ-411 helper pattern).
- **Naming**: dataclass + function names follow the project's snake_case / CamelCase convention.
- **Complexity**: longest function is `classify_frames` at ~50 lines (linear pipeline). All others under 30.
- **Tests assert behaviour**, not just "no exception": geodesic round-trips against real distances, boundary conditions (exactly 48/60, exactly 0.95 ratio, exactly 2.5 px) are explicitly tested.
- **Spec drift guard**: each helper has a `test_constants_match_spec` test that fails if the public constants drift from the AC text (catches a renamer that touches code but forgets the spec).
- **Boundary strictness**: AC-2 of FT-P-05 says "MRE < 2.5 px"; the helper uses strict `<` and the test `test_evaluate_per_image_budget_single_fail_fails_overall` proves a 2.5 px reading FAILS. This is the kind of boundary the spec would otherwise be ambiguous on.

### Phase 4 — Security

No SQL, no subprocess, no credentials. CSV loaders validate header columns explicitly; numeric coercion via `float()` / `int()` raises on garbage input.

### Phase 5 — Performance

- All three helpers operate on per-flight-sized data (60 images, ≤14700 frames, ≤4900 IMU rows). Pure-Python loops are fine.
- `mre_evaluator.evaluate_p95` uses `numpy.percentile` (vectorised).
- No new I/O patterns beyond CSV read/write.

### Phase 6 — Cross-Task Consistency

- **API stability**: the three new helpers share the same shape pattern as AZ-410's `anchor_pair_detector` and AZ-411's `estimate_schema` — typed `@dataclass(frozen=True)` records, a `load_…` reader, an `evaluate(…)` / `compute_…` core, a `write_csv_evidence` emitter. The FT-P-05 scenario reuses `accuracy_evaluator.evaluate()` (AZ-409) to compute per-image error_m → demonstrates the cross-task consistency in action.
- **No duplicate symbols across batches**: each helper module owns disjoint public names; the only shared dependency is `runner.helpers.geo.distance_m`.
- **Scenario-file skip pattern**: all 4 new scenario files (`test_ft_p_01_*`, `test_ft_p_04_*`, `test_ft_p_05_*`, `test_ft_p_06_*`) reuse the `_harness_helpers_implemented` gate pattern from batch 69. Consistent.
- **Within-batch dep (AZ-413 → AZ-412)**: FT-P-06 reads FT-P-04's CSV (the f2f MRE column). The mre_evaluator's `load_frame_to_frame_csv` explicitly validates that the `mre_px` column is present; if absent (FT-P-04 evidence not yet carrying MRE), FT-P-06 fails with a clear message pointing at the SUT contract (AC-NEW-3 FDR schema). This is the safest failure mode for an inter-task dep.

### Phase 7 — Architecture Compliance

1. **Layer direction**: every new file under `e2e/**`. The `test_no_sut_imports.py` invariant (passes after the run) confirms zero `gps_denied_onboard` imports across all 14 new files.
2. **Public API respect**: only public names imported across modules (`runner.helpers.{geo,accuracy_evaluator,mre_evaluator}` etc.). No leading-underscore cross-module imports.
3. **No new cyclic dependencies**: import graph:
   - `accuracy_evaluator` → `geo`
   - `registration_classifier` → (none)
   - `mre_evaluator` → (numpy + stdlib)
   - `tests.positive.test_ft_p_01_*` → `accuracy_evaluator`
   - `tests.positive.test_ft_p_04_*` → `registration_classifier`
   - `tests.positive.test_ft_p_05_*` → `accuracy_evaluator` + `mre_evaluator`
   - `tests.positive.test_ft_p_06_*` → `mre_evaluator`
   Linear DAG.
4. **Duplicate symbols across components**: none.
5. **Cross-cutting concerns**: pytest plugin registration unchanged from batch 69 (the new helpers don't need a plugin — they're called from scenario test bodies).

No Architecture findings.

Baseline delta section omitted (no `architecture_compliance_baseline.md` for this project).

## AC Test Coverage Summary

| Task | ACs Covered (unit) | NOT COVERED (harness-loop) | Test File |
|------|---------------------|----------------------------|-----------|
| AZ-409 | 1, 2, 3, 4, 5 | Runtime push-to-SITL end-to-end | `test_accuracy_evaluator.py` (20 tests) |
| AZ-412 | 1, 2, 3, 4 | Runtime full Derkachi replay | `test_registration_classifier.py` (26 tests) |
| AZ-413 | 1, 2, 3, 4, 5 | Runtime push-to-SITL end-to-end | `test_mre_evaluator.py` (22 tests) |

## Verdict: PASS

No Critical, High, Medium, or Low findings. Unit-test layer is complete
and consistent across the three tasks; runtime end-to-end paths are
correctly gated and documented as hardware-loop ACs pending the upstream
`frame_source_replay` / `sitl_observer` / `fdr_reader` / `imu_replay`
helpers landing.

## Auto-Fix Attempts: 0

No failures — auto-fix gate not entered.

## Stuck Agents: 0

None.