[AZ-699] Real-flight validation runner + Markdown accuracy report

New e2e test runs gps-denied-replay --auto-trim against the real derkachi.tlog + flight video + AZ-702 calibration, computes the horizontal-error distribution (mean/p50/p95/p99 + 10/25/50/100 m threshold-hit share), writes _docs/06_metrics/real_flight_ validation_{date}.md, and asserts honest PASS/FAIL with no @xfail mask. AZ-404's 1-min test is untouched (sibling, not replacement). Extends gps_compare.py with HorizontalErrorDistribution + percentile_sorted (numpy-equivalent linear interpolation). New test helper _report_writer.py renders the canonical Markdown schema documented as FT-P-20 in blackbox-tests.md. 16 new unit tests pin distribution arithmetic, verdict gate, failure-message templating (references calibration acquisition method per AC-3), and report layout. 129 passed in focused regression, 3 skipped (real video / Tier-2 prerequisites). Zero new mypy --strict errors. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 13:21:13 +00:00 · 2026-05-20 16:53:48 +03:00
parent f5366bbca1
commit dcde602f61
9 changed files with 1261 additions and 2 deletions
@@ -0,0 +1,153 @@
+# Real-flight validation runner + accuracy report
+
+**Task**: AZ-699_real_flight_validation_runner
+**Name**: Run estimator against real Derkachi tlog + video; compute honest accuracy metrics; write report
+**Description**: New e2e test `test_derkachi_real_tlog.py` that feeds the real `derkachi.tlog` (not the synth) into the replay pipeline, compares the JSONL output against the binary-tlog GPS truth (from AZ-697), and writes a structured Markdown accuracy report. Flips AC-3 from `@xfail` to a real PASS/FAIL verdict.
+**Complexity**: 3 points
+**Dependencies**: AZ-697
+**Component**: Blackbox Tests (epic AZ-696)
+**Tracker**: AZ-699
+**Epic**: AZ-696
+
+## Problem
+
+`tests/e2e/replay/test_derkachi_1min.py::test_ac3_within_100m_80pct_of_ticks`
+is permanently `@xfail`. Even when the test runs (Jetson Tier-2), the
+result is hidden — we have no honest measurement of estimator accuracy
+against a real flight. The cycle-1 retrospective (`_docs/06_metrics/retro_2026-05-20.md`)
+flagged this as the highest-impact open verification.
+
+The two contributors:
+1. Synth tlog (compares estimator to itself) — fixed by AZ-697.
+2. Unknown camera intrinsics — addressed by AZ-702 (T6, factory sheet).
+
+This task wires the real tlog + the calibration into a new test and
+produces the honest verdict + a structured report.
+
+## Outcome
+
+- A new test runs the full `gps-denied-replay` against `derkachi.tlog` +
+  `flight_derkachi.mp4` + `khp20s30_factory.json` (or the current
+  fallback) and reports honest accuracy metrics.
+- A structured report at `_docs/06_metrics/real_flight_validation_{YYYY-MM-DD}.md`
+  contains mean / p50 / p95 / p99 horizontal error, % within {10, 25, 50, 100} m,
+  vertical error stats, and notes the calibration assumption.
+- AC-3 emits a real PASS or honest FAIL verdict (no `@xfail` mask).
+
+## Scope
+
+### Included
+- New test `tests/e2e/replay/test_derkachi_real_tlog.py` parallel to the existing 1-min test but using the binary tlog.
+- Metric helpers (mean/p50/p95/p99 percentile + threshold-hit counters) live in `src/gps_denied_onboard/helpers/gps_compare.py` (extends AZ-697).
+- Report writer `tests/e2e/replay/_report_writer.py` (test helper, not production code).
+- Updated `_docs/06_metrics/real_flight_validation_{date}.md` artifact format documented in `_docs/02_document/tests/blackbox-tests.md`.
+
+### Excluded
+- Map visualization — AZ-700.
+- HTTP API — AZ-701.
+- Camera calibration acquisition — AZ-702 (this task ships with whatever calibration is current).
+- Editing the existing `test_derkachi_1min.py` (new test runs alongside).
+
+## Acceptance Criteria
+
+**AC-1: Real PASS/FAIL verdict (no mask)**
+Given the new test on Tier-2 Jetson
+When `pytest tests/e2e/replay/test_derkachi_real_tlog.py -m tier2` runs
+Then the result is PASS or FAIL — no `@xfail`, no `@skip`
+
+**AC-2: Structured report written**
+Given a successful invocation
+When the test finishes
+Then `_docs/06_metrics/real_flight_validation_{YYYY-MM-DD}.md` exists with all required metrics in a Markdown table
+
+**AC-3: FAIL message attributes calibration uncertainty**
+Given the test fails the 80 %/100 m gate
+When the failure message renders
+Then it references the calibration acquisition method (factory-sheet per AZ-702) and the residual budget
+
+**AC-4: Existing 1-min test untouched**
+Given the cycle-1 test `test_ac3_within_100m_80pct_of_ticks`
+When all changes land
+Then the existing `@xfail` test still exists and runs (we add, don't replace)
+
+## Non-Functional Requirements
+
+**Performance**
+- The new test must complete within the existing Jetson Tier-2 wall budget (≤ 15 min for a 60 s clip; report longer for longer clips).
+
+## Unit Tests
+
+| AC Ref | What to Test | Required Outcome |
+|--------|-------------|-----------------|
+| AC-2 | Report writer with mock metrics | Markdown contains every required row |
+| AC-3 | Failure message templating | Contains "calibration: factory-sheet" + budget |
+
+## Blackbox Tests
+
+| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
+|--------|------------------------|-------------|-------------------|----------------|
+| AC-1 | Real derkachi.tlog + video + KHP20S30 calibration | Full replay + accuracy gate | PASS or FAIL (honest) | Perf ≤ 15 min |
+| AC-2 | After AC-1 run | Report file existence + contents | Structured report on disk | — |
+
+## Constraints
+
+- The new test MUST use the existing `gps-denied-replay` console-script — no inlined estimator invocation.
+- The report MUST be Markdown (not HTML/JSON) so it lives alongside other `_docs/06_metrics/` artifacts.
+- Skipping in CI when `RUN_REPLAY_E2E=0` is allowed (matches existing pattern); the test MUST run when the env var is set.
+
+## Risks & Mitigation
+
+**Risk 1: Honest FAIL exposes a true product gap**
+- *Risk*: The estimator may legitimately fail the 100 m/80 % gate even with correct calibration. Derkachi is cruise altitude with limited VPR anchor diversity.
+- *Mitigation*: That's the goal — honest measurement. Surface the gap; downstream cycles can tighten.
+
+**Risk 2: tlog format edge cases**
+- *Risk*: Real tlogs may carry non-standard system IDs, dialect mismatches, or corrupt segments.
+- *Mitigation*: AZ-697's AC-3 / AC-4 cover this at the truth-extractor level; this task only consumes the result.
+
+---
+
+## Implementation Notes (Batch 100 — Cycle 2)
+
+**Status**: In Testing (Jira AZ-699).
+
+### Files changed
+
+Production:
+- `src/gps_denied_onboard/helpers/gps_compare.py` — extended with `HorizontalErrorDistribution` DTO, `horizontal_error_distribution(emissions, ground_truth)` single-walk aggregator, and `percentile_sorted(values, pct)` linear-interpolation helper (numpy-equivalent). Re-exports added to `helpers/__init__.py`.
+
+Test helpers (under `tests/`, not production):
+- `tests/e2e/replay/_report_writer.py` — `render_report`, `format_failure_message`, `verdict_passes_ac3`, plus `ReportContext` DTO and the `AC3_GATE_THRESHOLD_M` / `AC3_GATE_PCT` constants.
+
+Tests:
+- `tests/e2e/replay/test_derkachi_real_tlog.py` — the AZ-699 e2e runner (skipped without real video + `RUN_REPLAY_E2E=1`; honest PASS/FAIL when prerequisites met).
+- `tests/unit/test_az699_report_writer.py` — 16 unit tests covering percentile arithmetic, distribution aggregator, verdict gate, failure-message templating, and report layout.
+
+Documentation:
+- `_docs/02_document/tests/blackbox-tests.md` — new entry **FT-P-20** documenting the artefact schema for `_docs/06_metrics/real_flight_validation_{YYYY-MM-DD}.md`.
+
+### AC coverage
+
+| AC | Test / Artefact | Result |
+|----|-----------------|--------|
+| AC-1 | `test_az699_real_flight_validation_emits_verdict_and_report` | SKIPPED on dev (real video missing); ready to run on Tier-2 Jetson with `RUN_REPLAY_E2E=1` + real video. NO `@xfail` mask. |
+| AC-2 | `test_render_report_contains_all_required_rows_on_pass`, `test_render_report_marks_failure_when_below_gate`, `test_render_report_includes_vertical_when_available` | PASS |
+| AC-3 | `test_failure_message_references_calibration_method_factory_sheet`, `test_failure_message_references_calibration_method_placeholder` | PASS |
+| AC-4 | `tests/e2e/replay/test_derkachi_1min.py` untouched (verified by diff scope: this batch added a sibling file, did not modify the existing one) | PASS |
+
+### Test results
+
+129 passed, 3 skipped (all documented prerequisites: AZ-699 e2e wants real video + Tier-2; AZ-698 AC-5 wants real video; AZ-399 AC-1 wants 500 MB tlog) in the focused regression slice. Full unit suite: 2219 passed, 1 pre-existing unrelated failure in `tests/unit/c12_operator_orchestrator/test_cli_console_script.py::test_cold_start_under_500ms_p99` (CLI cold-start NFR, 8/11 samples > 700 ms; touches none of AZ-697/698/699's code paths).
+
+### Strict typing
+
+`mypy --strict` on the three new modules (`helpers/gps_compare.py`, `helpers/__init__.py`, `tests/e2e/replay/_report_writer.py`): **Success: no issues found in 3 source files.**
+
+### Known limitations
+
+- AC-1 (the actual real-flight run on Tier-2 Jetson) cannot execute in this dev environment. The test is wired, gated cleanly, and ready — drop a real `flight_derkachi.mp4` (> 1 MB) into `_docs/00_problem/input_data/flight_derkachi/`, set `RUN_REPLAY_E2E=1`, and run on Tier-2 to produce the verdict + report.
+- Calibration acquisition method is read from the JSON's `acquisition_method` field. AZ-702 ships `factory-sheet`; any other value (or absence) is labelled `unknown` in the failure message.
+
+### Design note: helper module location
+
+`gps_compare.py` lives under `src/gps_denied_onboard/helpers/` (production) because both the AZ-699 test and the future AZ-701 HTTP-API path (T5) need it. `_report_writer.py` lives under `tests/e2e/replay/` because it is purely a test artefact — promoting it would invite production code to import a test helper, violating the file ownership rule documented in `_docs/02_document/module-layout.md`.