[AZ-409] [AZ-412] [AZ-413] Batch 70: FT-P-01/04/05/06 scenarios

AZ-409 (3pt) — FT-P-01 still-image frame-center accuracy: - accuracy_evaluator.py: GT loader + Vincenty error + AC-2/AC-3 pass-counts - test_ft_p_01_still_image_accuracy.py: scenario gated on frame_source_replay + sitl_observer NotImplementedError; AC-4 timeout discipline AZ-412 (3pt) — FT-P-04 Derkachi f2f registration >=95% on normal segments: - registration_classifier.py: accel-derived attitude + overlap heuristic + success ratio with AC-3 sharp-turn exclusion - test_ft_p_04_derkachi_f2f_registration.py: scenario gated on frame_source_replay + imu_replay + fdr_reader AZ-413 (3pt) — FT-P-05 + FT-P-06 cross-domain MRE budgets: - mre_evaluator.py: per-image budget (strict <2.5px) + 95th-percentile via numpy linear interp + combined report - test_ft_p_05_sat_anchor.py: cross-domain scenario, reuses accuracy_evaluator for geodesic join - test_ft_p_06_mre_budgets.py: pure piggyback on FT-P-04 + FT-P-05 CSV evidence; skips when either upstream CSV missing Tests: 325 unit tests pass (+77 vs batch 69). Reports: batch_70_report.md, batch_70_review.md (PASS). Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 22:31:12 +00:00 · 2026-05-16 18:10:46 +03:00
parent 702a0c0ff3
commit 29ac16cfcb
17 changed files with 3013 additions and 1 deletions
@@ -0,0 +1,88 @@
+# FT-P-01 — Still-image satellite-anchor frame-center accuracy
+
+**Task**: AZ-409_ft_p_01_still_image_accuracy
+**Name**: FT-P-01 still-image set-60 frame-center accuracy (AC-1.1, AC-1.2)
+**Description**: Implement the FT-P-01 blackbox scenario — push 60 still images one at a time, observe outbound `GPS_INPUT` (AP) / `MSP2_SENSOR_GPS` (iNav) at SITL, compute Vincenty geodesic distance to GT, assert pass-count rules from `expected_results/results_report.md`.
+**Complexity**: 3 points
+**Dependencies**: AZ-406, AZ-407
+**Component**: Blackbox Tests / Positive (epic AZ-262)
+**Tracker**: AZ-409
+**Epic**: AZ-262 (E-BBT)
+
+## Problem
+
+The canonical "frame-center geolocation" pipeline accuracy must be validated against the 60-image GT set under the AC-1.1 / AC-1.2 budgets — without this scenario the project has no honest measurement of its satellite-anchor accuracy on still images.
+
+## Outcome
+
+- A pytest scenario at `e2e/tests/positive/test_ft_p_01_still_image_accuracy.py` parameterized by `(fc_adapter, vio_strategy)`.
+- Test pushes each `AD0000NN.jpg` to the SUT's frame source, waits up to 5 s for the corresponding outbound message at the SITL listener, computes Vincenty distance to the `coordinates.csv` GT row.
+- Test emits `e2e-results/run-${RUN_ID}/ft-p-01.csv` (one row per image: `image_id, gt_lat, gt_lon, est_lat, est_lon, error_m, pass_50m, pass_20m`).
+- Aggregate pass criteria: `pass_count(error≤50m) ≥ 48` AND `pass_count(error≤20m) ≥ 30`.
+
+## Scope
+
+### Included
+- pytest test method per `(fc_adapter, vio_strategy)` parameterization.
+- Frame-source-replay helper invocation (one image at a time).
+- SITL observer for AP `GPS_INPUT` reception AND iNav `MSP2_SENSOR_GPS` reception.
+- Vincenty geodesic computation (use `geopy` or `scipy` per the runner's pinned deps).
+- Per-image CSV emission for evidence; aggregate assertion.
+
+### Excluded
+- The 60 images themselves — bind-mounted from `_docs/00_problem/input_data/`.
+- The tile-cache-fixture build — owned by AZ-407.
+- Per-image MRE (cross-domain matcher quality) — owned by FT-P-05 (AZ-413).
+- Schema validation of the outbound message — owned by FT-P-03 (AZ-411).
+
+## Acceptance Criteria
+
+**AC-1: per-image distance computed**
+Given the SUT cold-started with `tile-cache-fixture` mounted
+When the test pushes each `AD0000NN.jpg` and reads the next outbound message at SITL within 5 s
+Then a per-image error row is recorded in `ft-p-01.csv` for all 60 images.
+
+**AC-2: pass-count rule (50 m budget)**
+Given all 60 per-image distances
+Then `pass_count(error_m ≤ 50) ≥ 48` (i.e. ≥80 % of 60 images per AC-1.1).
+
+**AC-3: pass-count rule (20 m budget)**
+Given all 60 per-image distances
+Then `pass_count(error_m ≤ 20) ≥ 30` (i.e. ≥50 % of 60 images per AC-1.2).
+
+**AC-4: timeout discipline**
+Given an image is pushed
+When no outbound message arrives at SITL within 5 s
+Then the test records `error_m=∞`, `pass_50m=False`, `pass_20m=False` for that image; the run continues. Final aggregate may still PASS if the total pass counts meet the thresholds.
+
+**AC-5: parameterization**
+Given the conftest's `(fc_adapter, vio_strategy)` parameterization
+Then the scenario runs four times by default — `(ardupilot, okvis2)`, `(ardupilot, klt_ransac)`, `(inav, okvis2)`, `(inav, klt_ransac)` — and emits one CSV row per parameterization in `report.csv`.
+
+## System Under Test Boundary
+
+This is an end-to-end scenario. The test ONLY interacts with the SUT through public boundaries.
+
+- **Allowed inputs**: frame-source push, FC inbound IMU stream replay (passive — no modification by the test).
+- **Allowed observation**: SITL-side `GPS_INPUT` / `MSP2_SENSOR_GPS` receipt; FDR archive post-run if needed for diagnostic evidence.
+- **Forbidden**: importing any SUT module, monkeypatching internal SUT state, stubbing C1-C5 components or C6 tile cache or C7 inference runtime, reading SUT internal Python/C++ buffers.
+- **External-system stubs allowed**: `ardupilot-plane-sitl`, `inav-sitl` (real SITLs — observe only), the `tile-cache-fixture` mount (a public on-disk schema), the FDR filesystem.
+- If a SUT module isn't yet implemented, the scenario MUST fail/block as missing product implementation; it must NOT pass by replacing that module with a test stub.
+
+## Constraints
+
+- Reference data: `_docs/00_problem/input_data/expected_results/results_report.md` (the rule) + `expected_results/position_accuracy.csv` (per-image budget flags).
+- Geodesic comparison: Vincenty (NOT haversine) — required by `expected-results.md` `numeric_tolerance` semantics.
+- Per-image CSV must include the image_id used in `coordinates.csv` (1-based `AD000001` style).
+
+## Risks & Mitigation
+
+**Risk: SITL round-trip latency variance can push a passing image past the 5 s timeout**
+- *Mitigation*: timeout is 5 s per image (matches scenario's `Max execution time` budget of 5 min for 60 images). If timeout-driven failures dominate, NFT-PERF-01 is the place to investigate latency; this scenario records timeouts as failures and continues.
+
+## Document Dependencies
+
+- `_docs/02_document/tests/blackbox-tests.md` § FT-P-01
+- `_docs/00_problem/input_data/expected_results/results_report.md` (Pass/Fail Rules)
+- `_docs/00_problem/input_data/expected_results/position_accuracy.csv` (per-image flags)
+- `_docs/02_document/tests/test-data.md` § Expected Results Mapping § Position accuracy (FT-P-01 row)
@@ -0,0 +1,69 @@
+# FT-P-04 — Derkachi frame-to-frame registration success rate
+
+**Task**: AZ-412_ft_p_04_derkachi_f2f_registration
+**Name**: FT-P-04 frame-to-frame registration ≥95 % on normal Derkachi segments (AC-2.1a)
+**Description**: Implement FT-P-04 — full Derkachi replay; SUT exposes per-frame registration-success metric (via `NAMED_VALUE_FLOAT` or FDR per AC-NEW-3); compute success ratio over normal segments only (nadir ±10° bank/pitch, ≥40 % inferred prior-frame overlap).
+**Complexity**: 3 points
+**Dependencies**: AZ-406, AZ-407
+**Component**: Blackbox Tests / Positive (epic AZ-262)
+**Tracker**: AZ-412
+**Epic**: AZ-262 (E-BBT)
+
+## Problem
+
+The frame-to-frame registration success rate on "normal" segments is a direct measurement of the C1 VIO + C3 matcher quality in nominal conditions. AC-2.1a requires ≥95 % — without this scenario the project has no honest measurement.
+
+## Outcome
+
+- pytest scenario at `e2e/tests/positive/test_ft_p_04_derkachi_f2f_registration.py`.
+- Replays the full Derkachi fixture; reads per-frame registration-success metric (boolean per frame).
+- "Normal" segment classification: nadir bank/pitch within ±10° (estimated from `SCALED_IMU2`-derived attitude in `data_imu.csv`); ≥40 % inferred prior-frame overlap (heuristic from frame-to-frame translation magnitude).
+- Computes success ratio over normal segments only.
+
+## Scope
+
+### Included
+- Full-replay test method.
+- Normal-segment derivation from `data_imu.csv` (the test computes attitude from `SCALED_IMU2` per AC-2.1a).
+- Per-frame registration-success metric ingestion (via `NAMED_VALUE_FLOAT` listener OR post-run FDR read).
+- Success-ratio computation + assertion.
+
+### Excluded
+- Sharp-turn segments — exercised separately by FT-N-02 (AZ-414) and explicitly excluded from this denominator.
+- MRE budgets — owned by FT-P-05 / FT-P-06 (AZ-413).
+- Cross-domain (UAV → satellite) success — owned by FT-P-05.
+
+## Acceptance Criteria
+
+**AC-1: normal-segment classification reproducibility**
+Given the same Derkachi `data_imu.csv`
+Then the same set of frames are classified "normal" across two runs of the test.
+
+**AC-2: success ratio meets AC-2.1a budget**
+Given the SUT exposes per-frame registration-success
+Then `success_ratio_over_normal_segments ≥ 0.95`.
+
+**AC-3: success-ratio computation excludes sharp-turn frames**
+Given the per-frame attitude exceeds ±10° bank or pitch (sharp-turn region)
+Then those frames are excluded from the denominator; the test reports `excluded_frame_count` separately for diagnostic clarity.
+
+**AC-4: parameterization**
+Given the conftest's `(fc_adapter, vio_strategy)` parameterization
+Then the scenario runs once per parameterization.
+
+## System Under Test Boundary
+
+End-to-end through public boundaries.
+
+- **Allowed**: per-frame metric exposure via `NAMED_VALUE_FLOAT` (a public MAVLink message) OR via post-run FDR archive read.
+- **Forbidden**: importing C1 VIO internal state, monkeypatching C3 matcher pass/fail return, stubbing C2 retrieval to force successes.
+
+## Constraints
+
+- The per-frame registration-success metric is part of the SUT's documented public output (per AC-NEW-3 FDR schema). If it isn't there yet, the scenario fails — it does NOT compensate by inferring from another signal.
+- Normal-segment derivation uses ONLY `data_imu.csv`, not internal SUT state.
+
+## Document Dependencies
+
+- `_docs/02_document/tests/blackbox-tests.md` § FT-P-04
+- `_docs/02_document/tests/test-data.md` § Image processing quality (FT-P-04 row)
@@ -0,0 +1,72 @@
+# FT-P-05 + FT-P-06 — Satellite-anchor cross-domain registration + MRE budgets
+
+**Task**: AZ-413_ft_p_05_06_sat_anchor_mre
+**Name**: Cross-domain matcher registration + Mean Reprojection Error budgets (AC-2.1b, AC-2.2)
+**Description**: Implement FT-P-05 (satellite-anchor cross-domain registration with MRE < 2.5 px and accuracy budget) and FT-P-06 (frame-to-frame MRE < 1.0 px and cross-domain MRE < 2.5 px at 95th percentile). FT-P-06 piggybacks on FT-P-04 + FT-P-05 runs and only adds the MRE 95th-percentile assertion.
+**Complexity**: 3 points
+**Dependencies**: AZ-406, AZ-407, AZ-412 (FT-P-06 reuses FT-P-04 MRE collection)
+**Component**: Blackbox Tests / Positive (epic AZ-262)
+**Tracker**: AZ-413
+**Epic**: AZ-262 (E-BBT)
+
+## Problem
+
+Two AC bind the cross-domain matcher (UAV → satellite tile) quality: AC-2.1b (registration succeeds with MRE in budget) and AC-2.2 (95th-percentile MRE < 1.0 px frame-to-frame, < 2.5 px cross-domain). Both are direct measurements of C3 / C3.5 quality and are required to validate the matcher choice.
+
+## Outcome
+
+- pytest scenarios at `e2e/tests/positive/test_ft_p_05_sat_anchor.py` (FT-P-05) and a small piggyback assertion in `e2e/tests/positive/test_ft_p_06_mre_budgets.py` (FT-P-06).
+- FT-P-05: pushes each `still-image-set-60` image; reads per-frame MRE (via `NAMED_VALUE_FLOAT` or FDR); aggregates per-image accuracy AND MRE distribution; asserts MRE < 2.5 px for all images, ≥80 % within 50 m of GT, ≥50 % within 20 m of GT.
+- FT-P-06: depends on the runs of FT-P-04 (frame-to-frame MRE) and FT-P-05 (cross-domain MRE); aggregates by domain; asserts 95th-percentile MRE < 1.0 px frame-to-frame, < 2.5 px cross-domain.
+- CSV evidence: `e2e-results/run-${RUN_ID}/ft-p-05.csv` (one row per image: `image_id, est_lat, est_lon, error_m, mre_px, pass_50m, pass_20m, pass_mre`).
+
+## Scope
+
+### Included
+- FT-P-05 test method.
+- FT-P-06 piggyback method that reads FT-P-04 + FT-P-05 evidence and adds the 95th-percentile assertion.
+- Per-image MRE retrieval via `NAMED_VALUE_FLOAT` or post-run FDR archive read.
+
+### Excluded
+- The 60-image accuracy-only assertion (AC-1.1, AC-1.2) — owned by FT-P-01 (AZ-409).
+- The Derkachi frame-to-frame success ratio (AC-2.1a) — owned by FT-P-04.
+- C3.5 conditional-refiner-specific assertions — out of scope; AC-2.1b is on the matcher pipeline as a whole.
+
+## Acceptance Criteria
+
+**AC-1: per-image MRE captured**
+Given each `still-image-set-60` image
+Then the test records the per-frame MRE in `ft-p-05.csv`.
+
+**AC-2: cross-domain MRE budget (per image)**
+Given each per-image MRE
+Then `mre_px < 2.5` for all 60 images (per AC-2.1b "all images").
+
+**AC-3: accuracy budget alongside MRE**
+Given the same 60 images
+Then ≥80 % satisfy `error_m ≤ 50` AND ≥50 % satisfy `error_m ≤ 20` (matches FT-P-01 thresholds; this assertion may be loosened to "implied by FT-P-01" if FT-P-01 already passes in the same run).
+
+**AC-4: 95th-percentile MRE budget (FT-P-06)**
+Given FT-P-04 + FT-P-05 evidence
+Then `MRE_95th_percentile_frame_to_frame < 1.0 px` AND `MRE_95th_percentile_cross_domain < 2.5 px`.
+
+**AC-5: parameterization**
+Given the conftest's `(fc_adapter, vio_strategy)` parameterization
+Then both test files run per parameterization.
+
+## System Under Test Boundary
+
+End-to-end through public boundaries.
+
+- **Allowed**: `NAMED_VALUE_FLOAT` MRE exposure OR post-run FDR archive read.
+- **Forbidden**: importing C3 / C3.5 matcher state, stubbing the matcher to force a specific MRE.
+
+## Constraints
+
+- The per-frame MRE must be exposed as a documented public artifact (NAMED_VALUE_FLOAT key or FDR field) — if absent, the test fails.
+- The 95th-percentile is computed strictly (linear interpolation between the two adjacent ranks per numpy's default percentile algorithm).
+
+## Document Dependencies
+
+- `_docs/02_document/tests/blackbox-tests.md` § FT-P-05, § FT-P-06
+- `_docs/02_document/tests/test-data.md` § Image processing quality