Files
Oleksandr Bezdieniezhnykh 29ac16cfcb [AZ-409] [AZ-412] [AZ-413] Batch 70: FT-P-01/04/05/06 scenarios
AZ-409 (3pt) — FT-P-01 still-image frame-center accuracy:
- accuracy_evaluator.py: GT loader + Vincenty error + AC-2/AC-3 pass-counts
- test_ft_p_01_still_image_accuracy.py: scenario gated on frame_source_replay
  + sitl_observer NotImplementedError; AC-4 timeout discipline

AZ-412 (3pt) — FT-P-04 Derkachi f2f registration >=95% on normal segments:
- registration_classifier.py: accel-derived attitude + overlap heuristic
  + success ratio with AC-3 sharp-turn exclusion
- test_ft_p_04_derkachi_f2f_registration.py: scenario gated on
  frame_source_replay + imu_replay + fdr_reader

AZ-413 (3pt) — FT-P-05 + FT-P-06 cross-domain MRE budgets:
- mre_evaluator.py: per-image budget (strict <2.5px) + 95th-percentile
  via numpy linear interp + combined report
- test_ft_p_05_sat_anchor.py: cross-domain scenario, reuses
  accuracy_evaluator for geodesic join
- test_ft_p_06_mre_budgets.py: pure piggyback on FT-P-04 + FT-P-05 CSV
  evidence; skips when either upstream CSV missing

Tests: 325 unit tests pass (+77 vs batch 69).
Reports: batch_70_report.md, batch_70_review.md (PASS).
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-16 18:10:46 +03:00

9.0 KiB
Raw Permalink Blame History

Batch 70 Report — Test Implementation (cycle 1, batch 4 of test phase)

Batch: 70 Date: 2026-05-16 Context: Test implementation (greenfield Step 10 — Implement Tests) Tasks: AZ-409 (3pt), AZ-412 (3pt), AZ-413 (3pt) — 9 cp / 3 tasks Cycle: 1 Verdict: COMPLETE — PASS (self-reviewed; see reviews/batch_70_review.md)

Summary

Three pure-positive scenarios on the same Derkachi + still-image fixtures that AZ-407 / AZ-408 set up. Each follows the now-established batch-69 pattern:

  • A pure-logic helper module under e2e/runner/helpers/ (everything the scenario needs except docker-bound replay + observation).
  • A scenario file under e2e/tests/positive/ parameterized across (fc_adapter, vio_strategy) and skip-gated on upstream helper NotImplementedError (auto-activates when the harness lands).
  • A unit-test file under e2e/_unit_tests/helpers/ that drives the helper directly with synthetic + real-fixture data.

AZ-409 — FT-P-01 still-image frame-center accuracy (3pt)

  • runner/helpers/accuracy_evaluator.pyload_gt_coordinates parses _docs/00_problem/input_data/coordinates.csv; evaluate joins by image_id, computes Vincenty geodesic distance via geo.distance_m, and produces per-image + aggregate reports. The three thresholds are exposed as module constants (PASS_COUNT_50M_REQUIRED=48, PASS_COUNT_20M_REQUIRED=30, TOTAL_IMAGES_REQUIRED=60) so a future spec change has exactly one place to flip. AggregateReport.overall_pass is the boolean the scenario asserts.
  • tests/positive/test_ft_p_01_still_image_accuracy.py — pytest scenario, gated on frame_source_replay.replay_image_directory + sitl_observer.get_observer. Pushes one image at a time with a 5 s per-image timeout; timeouts are recorded as (inf, inf) and propagate to pass_50m=false, pass_20m=false, error_m=inf per AC-4.
  • 20 unit tests in test_accuracy_evaluator.py.

AZ-412 — FT-P-04 Derkachi frame-to-frame registration ≥95 % (3pt)

  • runner/helpers/registration_classifier.py — derives bank + pitch from SCALED_IMU2 accelerometer (spec-mandated; AC-1 prohibits internal SUT attitude). The classifier expands each 10 Hz IMU row into 3 video-frame indices (30 fps / 10 Hz = 3), classifies each frame as normal iff bank/pitch ∈ ±10° AND inferred prior-frame overlap ≥40 %, then exposes a compute_success_ratio(classifications, registration_success_by_frame) that returns a typed SuccessReport with excluded_by_{attitude,overlap,missing_metric} counts so AC-3 diagnostics survive in the run report.
  • Inferred-overlap heuristic — translation = horizontal velocity × (1/30 s); overlap = 1 - translation / ground_footprint_m clamped to [0, 1]; default ground footprint = 147 m (derived from the camera_info.md ~141 m altitude × 55° HFOV). The heuristic is explicitly an upper bound; the docstring records the assumption so a future calibration change has the tunable in one place.
  • tests/positive/test_ft_p_04_derkachi_f2f_registration.py — gated on frame_source_replay, imu_replay, fdr_reader. Reads per-frame registration_success from frame_metric FDR records; emits ft-p-04-{fc_adapter}-{vio_strategy}.csv; asserts AC-2.
  • 26 unit tests in test_registration_classifier.py (including attitude round-trips for ±30° roll/pitch, the reproducibility check on the real first 100 IMU rows, and the boundary ratio cases).

AZ-413 — FT-P-05 + FT-P-06 cross-domain MRE budgets (3pt)

  • runner/helpers/mre_evaluator.py — three independent reports: PerImageBudgetReport (FT-P-05 AC-2: every MRE < 2.5 px, strict <), P95Report (single-domain p95 < budget), CombinedP95Report (FT-P-06 AC-4: both domains pass). The 95th percentile uses numpy.percentile(..., method='linear') — exactly what the spec mandates. load_frame_to_frame_csv raises ValueError if the FT-P-04 CSV lacks an mre_px column (forces the failure to surface at the SUT-contract layer rather than silently passing).
  • tests/positive/test_ft_p_05_sat_anchor.py — gated scenario that pushes the 60 images, joins MRE with GT-error via accuracy_evaluator.evaluate, emits ft-p-05.csv, asserts AC-2 + AC-3.
  • tests/positive/test_ft_p_06_mre_budgets.py — pure piggyback that reads ft-p-04-*.csv + ft-p-05-*.csv from the same run and asserts AC-4. Skips (does NOT fail) if either upstream CSV is missing — that failure mode is the FT-P-04 / FT-P-05 scenario's responsibility.
  • 22 unit tests in test_mre_evaluator.py.

Files added / modified

Added (9)

AZ-409:

  • e2e/runner/helpers/accuracy_evaluator.py
  • e2e/tests/positive/test_ft_p_01_still_image_accuracy.py
  • e2e/_unit_tests/helpers/test_accuracy_evaluator.py

AZ-412:

  • e2e/runner/helpers/registration_classifier.py
  • e2e/tests/positive/test_ft_p_04_derkachi_f2f_registration.py
  • e2e/_unit_tests/helpers/test_registration_classifier.py

AZ-413:

  • e2e/runner/helpers/mre_evaluator.py
  • e2e/tests/positive/test_ft_p_05_sat_anchor.py
  • e2e/tests/positive/test_ft_p_06_mre_budgets.py
  • e2e/_unit_tests/helpers/test_mre_evaluator.py

Modified (2)

  • e2e/_unit_tests/test_directory_layout.py — added 3 helper paths and 4 scenario paths (the FT-P-01/04/05/06 scenarios; FT-P-02 + FT-P-03/14 were added in batch 69).
  • _docs/_autodev_state.md — batch 70 pointer.

Spec / module-layout drift notes

  • AZ-409 AC-5 says "four times" (the 4-variant matrix); the conftest currently parameterises (fc_adapter, vio_strategy) as 2 × 3 = 6 variants (vins_mono was added in AZ-406 alongside okvis2 and klt_ransac). AC-5 reads "the conftest's (fc_adapter, vio_strategy) parameterization" first, with the 4-variant list as an example — so the conftest is authoritative. No code change needed; flagged here so the audit trail sees the discrepancy.
  • AZ-412 / AZ-413 same observation — both ACs say "per parameterization" without pinning a count; the conftest's 6-variant matrix is what runs.
  • AZ-412 attitude convention — the helper docstring records the Z-down + accel-decomposition assumption explicitly (the SCALED_IMU2 wire format doesn't ship attitude). Roll/pitch ±30° round-trips are tested to confirm the decomposition.
  • AZ-412 ground footprint — default 147 m is derived from camera_info.md (~141 m alt, ~55° HFOV). Recorded as a module constant + classifier kwarg so a future re-calibration touches one place.
  • AZ-413 strict < boundary — AC-2 says "MRE < 2.5 px"; the helper uses < (not ), and the unit test test_evaluate_per_image_budget_single_fail_fails_overall proves a 2.5 px reading FAILS. Removes the boundary ambiguity.

Test Results

Focused tests (Step 6.4)

pytest e2e/_unit_tests/325 passed in 172.07s (was 248 at end of batch 69; +77 new tests across this batch).

Breakdown of new tests:

  • AZ-409 — 20 tests
  • AZ-412 — 26 tests
  • AZ-413 — 22 tests
  • AZ-409/412/413 directory_layout entries — 9 new parametrize cases

Scenario collection: 6 scenario files × parametrize matrix yields 42 collected items in e2e/tests/positive/ (all 4 new scenario files plus the 2 from batch 69). Every scenario file remains correctly skip-gated; no premature activation.

No full-project pytest run

Per the implement skill's Test-Run Cadence, Step 16 owns the only full-project suite invocation; batches run focused tests only.

AC Test Coverage

See reviews/batch_70_review.md for the per-AC traceability table. In summary: every unit-testable AC is covered; every runtime-only AC (end-to-end harness loop) is documented as gated and auto-activating when the upstream helpers land.

Code Review Verdict

Self-reviewed — PASS. See reviews/batch_70_review.md for the full sweep (no Critical / High / Medium / Low findings).

Auto-Fix Attempts

  1. No code-review failures — auto-fix gate was not entered.

Stuck Agents

None.

Deferred follow-ups

Unchanged from batch 69 (same list, same owners):

  • runner.helpers.frame_source_replay.FrameSourceReplayer.{replay_video, replay_image_directory} — owned by AZ-441.
  • runner.helpers.fdr_reader.iter_records — owned by AZ-441.
  • runner.helpers.imu_replay.ImuReplayer.replay — owned by AZ-407 per scaffold docstring (not landed yet).
  • runner.helpers.sitl_observer.get_observer — owned by AZ-416 / AZ-417.
  • runner.helpers.mavproxy_tlog_reader.iter_messages — owned by AZ-416.

This batch did not introduce any new debt.

Next Batch

Batch 71 candidate set (all are 3pt scenario tasks unblocked by this batch's helpers + existing AZ-407 fixtures):

  • AZ-414 (FT-P-07 + FT-N-02 — sharp-turn behaviour)
  • AZ-415 (FT-P-08 — multi-segment relocalisation)
  • AZ-418 (FT-P-10 — smoothing lookback) — 3pt

Likely composition: ~9 cp across 3 tasks, same shape as batches 6970.

The next milestone after batches 7172 will be the K=3 cumulative review covering batches 70, 71, 72 (the current last_cumulative_review is batches_67-69).