[AZ-409] [AZ-412] [AZ-413] Batch 70: FT-P-01/04/05/06 scenarios

AZ-409 (3pt) — FT-P-01 still-image frame-center accuracy:
- accuracy_evaluator.py: GT loader + Vincenty error + AC-2/AC-3 pass-counts
- test_ft_p_01_still_image_accuracy.py: scenario gated on frame_source_replay
  + sitl_observer NotImplementedError; AC-4 timeout discipline

AZ-412 (3pt) — FT-P-04 Derkachi f2f registration >=95% on normal segments:
- registration_classifier.py: accel-derived attitude + overlap heuristic
  + success ratio with AC-3 sharp-turn exclusion
- test_ft_p_04_derkachi_f2f_registration.py: scenario gated on
  frame_source_replay + imu_replay + fdr_reader

AZ-413 (3pt) — FT-P-05 + FT-P-06 cross-domain MRE budgets:
- mre_evaluator.py: per-image budget (strict <2.5px) + 95th-percentile
  via numpy linear interp + combined report
- test_ft_p_05_sat_anchor.py: cross-domain scenario, reuses
  accuracy_evaluator for geodesic join
- test_ft_p_06_mre_budgets.py: pure piggyback on FT-P-04 + FT-P-05 CSV
  evidence; skips when either upstream CSV missing

Tests: 325 unit tests pass (+77 vs batch 69).
Reports: batch_70_report.md, batch_70_review.md (PASS).
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-16 18:10:46 +03:00
parent 702a0c0ff3
commit 29ac16cfcb
17 changed files with 3013 additions and 1 deletions
@@ -0,0 +1,88 @@
# FT-P-01 — Still-image satellite-anchor frame-center accuracy
**Task**: AZ-409_ft_p_01_still_image_accuracy
**Name**: FT-P-01 still-image set-60 frame-center accuracy (AC-1.1, AC-1.2)
**Description**: Implement the FT-P-01 blackbox scenario — push 60 still images one at a time, observe outbound `GPS_INPUT` (AP) / `MSP2_SENSOR_GPS` (iNav) at SITL, compute Vincenty geodesic distance to GT, assert pass-count rules from `expected_results/results_report.md`.
**Complexity**: 3 points
**Dependencies**: AZ-406, AZ-407
**Component**: Blackbox Tests / Positive (epic AZ-262)
**Tracker**: AZ-409
**Epic**: AZ-262 (E-BBT)
## Problem
The canonical "frame-center geolocation" pipeline accuracy must be validated against the 60-image GT set under the AC-1.1 / AC-1.2 budgets — without this scenario the project has no honest measurement of its satellite-anchor accuracy on still images.
## Outcome
- A pytest scenario at `e2e/tests/positive/test_ft_p_01_still_image_accuracy.py` parameterized by `(fc_adapter, vio_strategy)`.
- Test pushes each `AD0000NN.jpg` to the SUT's frame source, waits up to 5 s for the corresponding outbound message at the SITL listener, computes Vincenty distance to the `coordinates.csv` GT row.
- Test emits `e2e-results/run-${RUN_ID}/ft-p-01.csv` (one row per image: `image_id, gt_lat, gt_lon, est_lat, est_lon, error_m, pass_50m, pass_20m`).
- Aggregate pass criteria: `pass_count(error≤50m) ≥ 48` AND `pass_count(error≤20m) ≥ 30`.
## Scope
### Included
- pytest test method per `(fc_adapter, vio_strategy)` parameterization.
- Frame-source-replay helper invocation (one image at a time).
- SITL observer for AP `GPS_INPUT` reception AND iNav `MSP2_SENSOR_GPS` reception.
- Vincenty geodesic computation (use `geopy` or `scipy` per the runner's pinned deps).
- Per-image CSV emission for evidence; aggregate assertion.
### Excluded
- The 60 images themselves — bind-mounted from `_docs/00_problem/input_data/`.
- The tile-cache-fixture build — owned by AZ-407.
- Per-image MRE (cross-domain matcher quality) — owned by FT-P-05 (AZ-413).
- Schema validation of the outbound message — owned by FT-P-03 (AZ-411).
## Acceptance Criteria
**AC-1: per-image distance computed**
Given the SUT cold-started with `tile-cache-fixture` mounted
When the test pushes each `AD0000NN.jpg` and reads the next outbound message at SITL within 5 s
Then a per-image error row is recorded in `ft-p-01.csv` for all 60 images.
**AC-2: pass-count rule (50 m budget)**
Given all 60 per-image distances
Then `pass_count(error_m ≤ 50) ≥ 48` (i.e. ≥80 % of 60 images per AC-1.1).
**AC-3: pass-count rule (20 m budget)**
Given all 60 per-image distances
Then `pass_count(error_m ≤ 20) ≥ 30` (i.e. ≥50 % of 60 images per AC-1.2).
**AC-4: timeout discipline**
Given an image is pushed
When no outbound message arrives at SITL within 5 s
Then the test records `error_m=∞`, `pass_50m=False`, `pass_20m=False` for that image; the run continues. Final aggregate may still PASS if the total pass counts meet the thresholds.
**AC-5: parameterization**
Given the conftest's `(fc_adapter, vio_strategy)` parameterization
Then the scenario runs four times by default — `(ardupilot, okvis2)`, `(ardupilot, klt_ransac)`, `(inav, okvis2)`, `(inav, klt_ransac)` — and emits one CSV row per parameterization in `report.csv`.
## System Under Test Boundary
This is an end-to-end scenario. The test ONLY interacts with the SUT through public boundaries.
- **Allowed inputs**: frame-source push, FC inbound IMU stream replay (passive — no modification by the test).
- **Allowed observation**: SITL-side `GPS_INPUT` / `MSP2_SENSOR_GPS` receipt; FDR archive post-run if needed for diagnostic evidence.
- **Forbidden**: importing any SUT module, monkeypatching internal SUT state, stubbing C1-C5 components or C6 tile cache or C7 inference runtime, reading SUT internal Python/C++ buffers.
- **External-system stubs allowed**: `ardupilot-plane-sitl`, `inav-sitl` (real SITLs — observe only), the `tile-cache-fixture` mount (a public on-disk schema), the FDR filesystem.
- If a SUT module isn't yet implemented, the scenario MUST fail/block as missing product implementation; it must NOT pass by replacing that module with a test stub.
## Constraints
- Reference data: `_docs/00_problem/input_data/expected_results/results_report.md` (the rule) + `expected_results/position_accuracy.csv` (per-image budget flags).
- Geodesic comparison: Vincenty (NOT haversine) — required by `expected-results.md` `numeric_tolerance` semantics.
- Per-image CSV must include the image_id used in `coordinates.csv` (1-based `AD000001` style).
## Risks & Mitigation
**Risk: SITL round-trip latency variance can push a passing image past the 5 s timeout**
- *Mitigation*: timeout is 5 s per image (matches scenario's `Max execution time` budget of 5 min for 60 images). If timeout-driven failures dominate, NFT-PERF-01 is the place to investigate latency; this scenario records timeouts as failures and continues.
## Document Dependencies
- `_docs/02_document/tests/blackbox-tests.md` § FT-P-01
- `_docs/00_problem/input_data/expected_results/results_report.md` (Pass/Fail Rules)
- `_docs/00_problem/input_data/expected_results/position_accuracy.csv` (per-image flags)
- `_docs/02_document/tests/test-data.md` § Expected Results Mapping § Position accuracy (FT-P-01 row)
@@ -0,0 +1,69 @@
# FT-P-04 — Derkachi frame-to-frame registration success rate
**Task**: AZ-412_ft_p_04_derkachi_f2f_registration
**Name**: FT-P-04 frame-to-frame registration ≥95 % on normal Derkachi segments (AC-2.1a)
**Description**: Implement FT-P-04 — full Derkachi replay; SUT exposes per-frame registration-success metric (via `NAMED_VALUE_FLOAT` or FDR per AC-NEW-3); compute success ratio over normal segments only (nadir ±10° bank/pitch, ≥40 % inferred prior-frame overlap).
**Complexity**: 3 points
**Dependencies**: AZ-406, AZ-407
**Component**: Blackbox Tests / Positive (epic AZ-262)
**Tracker**: AZ-412
**Epic**: AZ-262 (E-BBT)
## Problem
The frame-to-frame registration success rate on "normal" segments is a direct measurement of the C1 VIO + C3 matcher quality in nominal conditions. AC-2.1a requires ≥95 % — without this scenario the project has no honest measurement.
## Outcome
- pytest scenario at `e2e/tests/positive/test_ft_p_04_derkachi_f2f_registration.py`.
- Replays the full Derkachi fixture; reads per-frame registration-success metric (boolean per frame).
- "Normal" segment classification: nadir bank/pitch within ±10° (estimated from `SCALED_IMU2`-derived attitude in `data_imu.csv`); ≥40 % inferred prior-frame overlap (heuristic from frame-to-frame translation magnitude).
- Computes success ratio over normal segments only.
## Scope
### Included
- Full-replay test method.
- Normal-segment derivation from `data_imu.csv` (the test computes attitude from `SCALED_IMU2` per AC-2.1a).
- Per-frame registration-success metric ingestion (via `NAMED_VALUE_FLOAT` listener OR post-run FDR read).
- Success-ratio computation + assertion.
### Excluded
- Sharp-turn segments — exercised separately by FT-N-02 (AZ-414) and explicitly excluded from this denominator.
- MRE budgets — owned by FT-P-05 / FT-P-06 (AZ-413).
- Cross-domain (UAV → satellite) success — owned by FT-P-05.
## Acceptance Criteria
**AC-1: normal-segment classification reproducibility**
Given the same Derkachi `data_imu.csv`
Then the same set of frames are classified "normal" across two runs of the test.
**AC-2: success ratio meets AC-2.1a budget**
Given the SUT exposes per-frame registration-success
Then `success_ratio_over_normal_segments ≥ 0.95`.
**AC-3: success-ratio computation excludes sharp-turn frames**
Given the per-frame attitude exceeds ±10° bank or pitch (sharp-turn region)
Then those frames are excluded from the denominator; the test reports `excluded_frame_count` separately for diagnostic clarity.
**AC-4: parameterization**
Given the conftest's `(fc_adapter, vio_strategy)` parameterization
Then the scenario runs once per parameterization.
## System Under Test Boundary
End-to-end through public boundaries.
- **Allowed**: per-frame metric exposure via `NAMED_VALUE_FLOAT` (a public MAVLink message) OR via post-run FDR archive read.
- **Forbidden**: importing C1 VIO internal state, monkeypatching C3 matcher pass/fail return, stubbing C2 retrieval to force successes.
## Constraints
- The per-frame registration-success metric is part of the SUT's documented public output (per AC-NEW-3 FDR schema). If it isn't there yet, the scenario fails — it does NOT compensate by inferring from another signal.
- Normal-segment derivation uses ONLY `data_imu.csv`, not internal SUT state.
## Document Dependencies
- `_docs/02_document/tests/blackbox-tests.md` § FT-P-04
- `_docs/02_document/tests/test-data.md` § Image processing quality (FT-P-04 row)
@@ -0,0 +1,72 @@
# FT-P-05 + FT-P-06 — Satellite-anchor cross-domain registration + MRE budgets
**Task**: AZ-413_ft_p_05_06_sat_anchor_mre
**Name**: Cross-domain matcher registration + Mean Reprojection Error budgets (AC-2.1b, AC-2.2)
**Description**: Implement FT-P-05 (satellite-anchor cross-domain registration with MRE < 2.5 px and accuracy budget) and FT-P-06 (frame-to-frame MRE < 1.0 px and cross-domain MRE < 2.5 px at 95th percentile). FT-P-06 piggybacks on FT-P-04 + FT-P-05 runs and only adds the MRE 95th-percentile assertion.
**Complexity**: 3 points
**Dependencies**: AZ-406, AZ-407, AZ-412 (FT-P-06 reuses FT-P-04 MRE collection)
**Component**: Blackbox Tests / Positive (epic AZ-262)
**Tracker**: AZ-413
**Epic**: AZ-262 (E-BBT)
## Problem
Two AC bind the cross-domain matcher (UAV → satellite tile) quality: AC-2.1b (registration succeeds with MRE in budget) and AC-2.2 (95th-percentile MRE < 1.0 px frame-to-frame, < 2.5 px cross-domain). Both are direct measurements of C3 / C3.5 quality and are required to validate the matcher choice.
## Outcome
- pytest scenarios at `e2e/tests/positive/test_ft_p_05_sat_anchor.py` (FT-P-05) and a small piggyback assertion in `e2e/tests/positive/test_ft_p_06_mre_budgets.py` (FT-P-06).
- FT-P-05: pushes each `still-image-set-60` image; reads per-frame MRE (via `NAMED_VALUE_FLOAT` or FDR); aggregates per-image accuracy AND MRE distribution; asserts MRE < 2.5 px for all images, ≥80 % within 50 m of GT, ≥50 % within 20 m of GT.
- FT-P-06: depends on the runs of FT-P-04 (frame-to-frame MRE) and FT-P-05 (cross-domain MRE); aggregates by domain; asserts 95th-percentile MRE < 1.0 px frame-to-frame, < 2.5 px cross-domain.
- CSV evidence: `e2e-results/run-${RUN_ID}/ft-p-05.csv` (one row per image: `image_id, est_lat, est_lon, error_m, mre_px, pass_50m, pass_20m, pass_mre`).
## Scope
### Included
- FT-P-05 test method.
- FT-P-06 piggyback method that reads FT-P-04 + FT-P-05 evidence and adds the 95th-percentile assertion.
- Per-image MRE retrieval via `NAMED_VALUE_FLOAT` or post-run FDR archive read.
### Excluded
- The 60-image accuracy-only assertion (AC-1.1, AC-1.2) — owned by FT-P-01 (AZ-409).
- The Derkachi frame-to-frame success ratio (AC-2.1a) — owned by FT-P-04.
- C3.5 conditional-refiner-specific assertions — out of scope; AC-2.1b is on the matcher pipeline as a whole.
## Acceptance Criteria
**AC-1: per-image MRE captured**
Given each `still-image-set-60` image
Then the test records the per-frame MRE in `ft-p-05.csv`.
**AC-2: cross-domain MRE budget (per image)**
Given each per-image MRE
Then `mre_px < 2.5` for all 60 images (per AC-2.1b "all images").
**AC-3: accuracy budget alongside MRE**
Given the same 60 images
Then ≥80 % satisfy `error_m ≤ 50` AND ≥50 % satisfy `error_m ≤ 20` (matches FT-P-01 thresholds; this assertion may be loosened to "implied by FT-P-01" if FT-P-01 already passes in the same run).
**AC-4: 95th-percentile MRE budget (FT-P-06)**
Given FT-P-04 + FT-P-05 evidence
Then `MRE_95th_percentile_frame_to_frame < 1.0 px` AND `MRE_95th_percentile_cross_domain < 2.5 px`.
**AC-5: parameterization**
Given the conftest's `(fc_adapter, vio_strategy)` parameterization
Then both test files run per parameterization.
## System Under Test Boundary
End-to-end through public boundaries.
- **Allowed**: `NAMED_VALUE_FLOAT` MRE exposure OR post-run FDR archive read.
- **Forbidden**: importing C3 / C3.5 matcher state, stubbing the matcher to force a specific MRE.
## Constraints
- The per-frame MRE must be exposed as a documented public artifact (NAMED_VALUE_FLOAT key or FDR field) — if absent, the test fails.
- The 95th-percentile is computed strictly (linear interpolation between the two adjacent ranks per numpy's default percentile algorithm).
## Document Dependencies
- `_docs/02_document/tests/blackbox-tests.md` § FT-P-05, § FT-P-06
- `_docs/02_document/tests/test-data.md` § Image processing quality