[AZ-409] [AZ-412] [AZ-413] Batch 70: FT-P-01/04/05/06 scenarios

AZ-409 (3pt) — FT-P-01 still-image frame-center accuracy: - accuracy_evaluator.py: GT loader + Vincenty error + AC-2/AC-3 pass-counts - test_ft_p_01_still_image_accuracy.py: scenario gated on frame_source_replay + sitl_observer NotImplementedError; AC-4 timeout discipline AZ-412 (3pt) — FT-P-04 Derkachi f2f registration >=95% on normal segments: - registration_classifier.py: accel-derived attitude + overlap heuristic + success ratio with AC-3 sharp-turn exclusion - test_ft_p_04_derkachi_f2f_registration.py: scenario gated on frame_source_replay + imu_replay + fdr_reader AZ-413 (3pt) — FT-P-05 + FT-P-06 cross-domain MRE budgets: - mre_evaluator.py: per-image budget (strict <2.5px) + 95th-percentile via numpy linear interp + combined report - test_ft_p_05_sat_anchor.py: cross-domain scenario, reuses accuracy_evaluator for geodesic join - test_ft_p_06_mre_budgets.py: pure piggyback on FT-P-04 + FT-P-05 CSV evidence; skips when either upstream CSV missing Tests: 325 unit tests pass (+77 vs batch 69). Reports: batch_70_report.md, batch_70_review.md (PASS). Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 01:41:13 +00:00 · 2026-05-16 18:10:46 +03:00
parent 702a0c0ff3
commit 29ac16cfcb
17 changed files with 3013 additions and 1 deletions
@@ -0,0 +1,209 @@
+# Batch 70 Report — Test Implementation (cycle 1, batch 4 of test phase)
+
+**Batch**: 70
+**Date**: 2026-05-16
+**Context**: Test implementation (greenfield Step 10 — Implement Tests)
+**Tasks**: AZ-409 (3pt), AZ-412 (3pt), AZ-413 (3pt) — 9 cp / 3 tasks
+**Cycle**: 1
+**Verdict**: COMPLETE — PASS (self-reviewed; see `reviews/batch_70_review.md`)
+
+## Summary
+
+Three pure-positive scenarios on the same Derkachi + still-image fixtures
+that AZ-407 / AZ-408 set up. Each follows the now-established
+batch-69 pattern:
+
+* A pure-logic helper module under `e2e/runner/helpers/` (everything the
+  scenario needs except docker-bound replay + observation).
+* A scenario file under `e2e/tests/positive/` parameterized across
+  `(fc_adapter, vio_strategy)` and skip-gated on upstream helper
+  `NotImplementedError` (auto-activates when the harness lands).
+* A unit-test file under `e2e/_unit_tests/helpers/` that drives the
+  helper directly with synthetic + real-fixture data.
+
+### AZ-409 — FT-P-01 still-image frame-center accuracy (3pt)
+
+* **`runner/helpers/accuracy_evaluator.py`** — `load_gt_coordinates`
+  parses `_docs/00_problem/input_data/coordinates.csv`; `evaluate`
+  joins by `image_id`, computes Vincenty geodesic distance via
+  `geo.distance_m`, and produces per-image + aggregate reports. The
+  three thresholds are exposed as module constants
+  (`PASS_COUNT_50M_REQUIRED=48`, `PASS_COUNT_20M_REQUIRED=30`,
+  `TOTAL_IMAGES_REQUIRED=60`) so a future spec change has exactly one
+  place to flip. `AggregateReport.overall_pass` is the boolean the
+  scenario asserts.
+* **`tests/positive/test_ft_p_01_still_image_accuracy.py`** — pytest
+  scenario, gated on `frame_source_replay.replay_image_directory` +
+  `sitl_observer.get_observer`. Pushes one image at a time with a 5 s
+  per-image timeout; timeouts are recorded as `(inf, inf)` and propagate
+  to `pass_50m=false`, `pass_20m=false`, `error_m=inf` per AC-4.
+* **20 unit tests** in `test_accuracy_evaluator.py`.
+
+### AZ-412 — FT-P-04 Derkachi frame-to-frame registration ≥95 % (3pt)
+
+* **`runner/helpers/registration_classifier.py`** — derives bank +
+  pitch from SCALED_IMU2 accelerometer (spec-mandated; AC-1 prohibits
+  internal SUT attitude). The classifier expands each 10 Hz IMU row
+  into 3 video-frame indices (30 fps / 10 Hz = 3), classifies each
+  frame as normal iff bank/pitch ∈ ±10° AND inferred prior-frame
+  overlap ≥40 %, then exposes a `compute_success_ratio(classifications,
+  registration_success_by_frame)` that returns a typed `SuccessReport`
+  with `excluded_by_{attitude,overlap,missing_metric}` counts so AC-3
+  diagnostics survive in the run report.
+* **Inferred-overlap heuristic** — translation = horizontal velocity ×
+  (1/30 s); overlap = `1 - translation / ground_footprint_m` clamped to
+  [0, 1]; default ground footprint = 147 m (derived from the camera_info.md
+  ~141 m altitude × 55° HFOV). The heuristic is explicitly an upper bound;
+  the docstring records the assumption so a future calibration change has
+  the tunable in one place.
+* **`tests/positive/test_ft_p_04_derkachi_f2f_registration.py`** —
+  gated on `frame_source_replay`, `imu_replay`, `fdr_reader`. Reads
+  per-frame `registration_success` from `frame_metric` FDR records;
+  emits `ft-p-04-{fc_adapter}-{vio_strategy}.csv`; asserts AC-2.
+* **26 unit tests** in `test_registration_classifier.py` (including
+  attitude round-trips for ±30° roll/pitch, the reproducibility check
+  on the real first 100 IMU rows, and the boundary ratio cases).
+
+### AZ-413 — FT-P-05 + FT-P-06 cross-domain MRE budgets (3pt)
+
+* **`runner/helpers/mre_evaluator.py`** — three independent reports:
+  `PerImageBudgetReport` (FT-P-05 AC-2: every MRE < 2.5 px, strict <),
+  `P95Report` (single-domain p95 < budget), `CombinedP95Report` (FT-P-06
+  AC-4: both domains pass). The 95th percentile uses
+  `numpy.percentile(..., method='linear')` — exactly what the spec
+  mandates. `load_frame_to_frame_csv` raises `ValueError` if the
+  FT-P-04 CSV lacks an `mre_px` column (forces the failure to surface
+  at the SUT-contract layer rather than silently passing).
+* **`tests/positive/test_ft_p_05_sat_anchor.py`** — gated scenario that
+  pushes the 60 images, joins MRE with GT-error via
+  `accuracy_evaluator.evaluate`, emits `ft-p-05.csv`, asserts AC-2 + AC-3.
+* **`tests/positive/test_ft_p_06_mre_budgets.py`** — pure piggyback that
+  reads `ft-p-04-*.csv` + `ft-p-05-*.csv` from the same run and asserts
+  AC-4. Skips (does NOT fail) if either upstream CSV is missing — that
+  failure mode is the FT-P-04 / FT-P-05 scenario's responsibility.
+* **22 unit tests** in `test_mre_evaluator.py`.
+
+## Files added / modified
+
+### Added (9)
+
+AZ-409:
+* `e2e/runner/helpers/accuracy_evaluator.py`
+* `e2e/tests/positive/test_ft_p_01_still_image_accuracy.py`
+* `e2e/_unit_tests/helpers/test_accuracy_evaluator.py`
+
+AZ-412:
+* `e2e/runner/helpers/registration_classifier.py`
+* `e2e/tests/positive/test_ft_p_04_derkachi_f2f_registration.py`
+* `e2e/_unit_tests/helpers/test_registration_classifier.py`
+
+AZ-413:
+* `e2e/runner/helpers/mre_evaluator.py`
+* `e2e/tests/positive/test_ft_p_05_sat_anchor.py`
+* `e2e/tests/positive/test_ft_p_06_mre_budgets.py`
+* `e2e/_unit_tests/helpers/test_mre_evaluator.py`
+
+### Modified (2)
+
+* `e2e/_unit_tests/test_directory_layout.py` — added 3 helper paths and
+  4 scenario paths (the FT-P-01/04/05/06 scenarios; FT-P-02 + FT-P-03/14
+  were added in batch 69).
+* `_docs/_autodev_state.md` — batch 70 pointer.
+
+## Spec / module-layout drift notes
+
+* **AZ-409 AC-5 says "four times" (the 4-variant matrix);** the conftest
+  currently parameterises `(fc_adapter, vio_strategy)` as 2 × 3 = 6
+  variants (`vins_mono` was added in AZ-406 alongside `okvis2` and
+  `klt_ransac`). AC-5 reads "the conftest's `(fc_adapter, vio_strategy)`
+  parameterization" first, with the 4-variant list as an example — so
+  the conftest is authoritative. No code change needed; flagged here so
+  the audit trail sees the discrepancy.
+* **AZ-412 / AZ-413 same observation** — both ACs say "per
+  parameterization" without pinning a count; the conftest's 6-variant
+  matrix is what runs.
+* **AZ-412 attitude convention** — the helper docstring records the
+  Z-down + accel-decomposition assumption explicitly (the SCALED_IMU2
+  wire format doesn't ship attitude). Roll/pitch ±30° round-trips are
+  tested to confirm the decomposition.
+* **AZ-412 ground footprint** — default 147 m is derived from
+  `camera_info.md` (~141 m alt, ~55° HFOV). Recorded as a module
+  constant + classifier kwarg so a future re-calibration touches one
+  place.
+* **AZ-413 strict `<` boundary** — AC-2 says "MRE < 2.5 px"; the helper
+  uses `<` (not `≤`), and the unit test
+  `test_evaluate_per_image_budget_single_fail_fails_overall` proves a
+  2.5 px reading FAILS. Removes the boundary ambiguity.
+
+## Test Results
+
+### Focused tests (Step 6.4)
+
+`pytest e2e/_unit_tests/` — **325 passed in 172.07s** (was 248 at end
+of batch 69; +77 new tests across this batch).
+
+Breakdown of new tests:
+
+* AZ-409 — 20 tests
+* AZ-412 — 26 tests
+* AZ-413 — 22 tests
+* AZ-409/412/413 directory_layout entries — 9 new parametrize cases
+
+Scenario collection: 6 scenario files × parametrize matrix yields 42
+collected items in `e2e/tests/positive/` (all 4 new scenario files plus
+the 2 from batch 69). Every scenario file remains correctly skip-gated;
+no premature activation.
+
+### No full-project pytest run
+
+Per the implement skill's Test-Run Cadence, Step 16 owns the only
+full-project suite invocation; batches run focused tests only.
+
+## AC Test Coverage
+
+See `reviews/batch_70_review.md` for the per-AC traceability table. In
+summary: every unit-testable AC is covered; every runtime-only AC
+(end-to-end harness loop) is documented as gated and auto-activating
+when the upstream helpers land.
+
+## Code Review Verdict
+
+Self-reviewed — PASS. See `reviews/batch_70_review.md` for the full
+sweep (no Critical / High / Medium / Low findings).
+
+## Auto-Fix Attempts
+
+0. No code-review failures — auto-fix gate was not entered.
+
+## Stuck Agents
+
+None.
+
+## Deferred follow-ups
+
+Unchanged from batch 69 (same list, same owners):
+
+* `runner.helpers.frame_source_replay.FrameSourceReplayer.{replay_video,
+  replay_image_directory}` — owned by AZ-441.
+* `runner.helpers.fdr_reader.iter_records` — owned by AZ-441.
+* `runner.helpers.imu_replay.ImuReplayer.replay` — owned by AZ-407
+  per scaffold docstring (not landed yet).
+* `runner.helpers.sitl_observer.get_observer` — owned by AZ-416 / AZ-417.
+* `runner.helpers.mavproxy_tlog_reader.iter_messages` — owned by AZ-416.
+
+This batch did not introduce any new debt.
+
+## Next Batch
+
+Batch 71 candidate set (all are 3pt scenario tasks unblocked by this
+batch's helpers + existing AZ-407 fixtures):
+
+* AZ-414 (FT-P-07 + FT-N-02 — sharp-turn behaviour)
+* AZ-415 (FT-P-08 — multi-segment relocalisation)
+* AZ-418 (FT-P-10 — smoothing lookback) — 3pt
+
+Likely composition: ~9 cp across 3 tasks, same shape as batches 69–70.
+
+The next milestone after batches 71–72 will be the K=3 cumulative
+review covering batches 70, 71, 72 (the current `last_cumulative_review`
+is `batches_67-69`).