[AZ-699] Real-flight validation runner + Markdown accuracy report

New e2e test runs gps-denied-replay --auto-trim against the real derkachi.tlog + flight video + AZ-702 calibration, computes the horizontal-error distribution (mean/p50/p95/p99 + 10/25/50/100 m threshold-hit share), writes _docs/06_metrics/real_flight_ validation_{date}.md, and asserts honest PASS/FAIL with no @xfail mask. AZ-404's 1-min test is untouched (sibling, not replacement). Extends gps_compare.py with HorizontalErrorDistribution + percentile_sorted (numpy-equivalent linear interpolation). New test helper _report_writer.py renders the canonical Markdown schema documented as FT-P-20 in blackbox-tests.md. 16 new unit tests pin distribution arithmetic, verdict gate, failure-message templating (references calibration acquisition method per AC-3), and report layout. 129 passed in focused regression, 3 skipped (real video / Tier-2 prerequisites). Zero new mypy --strict errors. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-21 10:21:13 +00:00 · 2026-05-20 16:53:48 +03:00
parent f5366bbca1
commit dcde602f61
9 changed files with 1261 additions and 2 deletions
@@ -593,3 +593,82 @@ All tests run from the `e2e-runner` container against the SUT through public bou

 **Expected outcome**: All mid-flight tiles current-timestamped and fresh.
 **Max execution time**: 6 min.
+
+---
+
+### FT-P-20: Real-flight validation runner — honest verdict + Markdown accuracy report
+
+**Summary**: Runs the full `gps-denied-replay` against the **real** Derkachi binary tlog + flight video + AZ-702 factory-sheet camera calibration, computes the per-emission horizontal-error distribution, and writes a structured Markdown accuracy report. Replaces the AZ-404 `@xfail` mask on AC-3 with a real PASS/FAIL.
+**Traces to**: AZ-699 AC-1..AC-3 (epic AZ-696 AC-3 — the 100 m / 80 % gate).
+**Category**: Position Accuracy
+
+**Preconditions**:
+- `_docs/00_problem/input_data/flight_derkachi/derkachi.tlog` (real binary, multi-flight).
+- `_docs/00_problem/input_data/flight_derkachi/flight_derkachi.mp4` (real recording, > 1 MB; the placeholder used by AZ-404 does not satisfy this gate).
+- `_docs/00_problem/input_data/flight_derkachi/khp20s30_factory.json` (AZ-702 calibration).
+- `gps-denied-replay` console-script installed.
+- `RUN_REPLAY_E2E=1` (matches the existing AZ-404 gate).
+
+**Input data**: real `derkachi.tlog` covers up to three sorties; the AZ-698 segmenter + `--auto-trim` locates the matching flight automatically.
+
+**Steps**:
+
+| Step | Consumer Action | Expected System Response |
+|------|----------------|------------------------|
+| 1 | Invoke `gps-denied-replay --auto-trim ...` with real fixtures | Subprocess exits 0 within the 15-min NFR budget |
+| 2 | Parse JSONL emissions; pair each with the nearest-in-time ground-truth row (binary-tlog GPS via AZ-697) | Distribution computed: count, mean, p50, p95, p99, threshold-hit share at 10/25/50/100 m |
+| 3 | Render the Markdown accuracy report and write `_docs/06_metrics/real_flight_validation_{YYYY-MM-DD}.md` | Report exists with header, run context, horizontal-error stats, threshold-hit table, and (when available) vertical-error stats |
+| 4 | Evaluate the AC-3 gate: ≥ 80 % within 100 m | Verdict is PASS or honest FAIL — no `@xfail` mask |
+| 5 | On FAIL, surface a failure message referencing the calibration acquisition method (factory-sheet / placeholder / unknown) and the residual budget | Operator can attribute the failure without re-reading the source |
+
+**Expected outcome**: PASS when the estimator meets the epic AC-3 gate; honest FAIL otherwise. The Markdown report is the durable artefact (consumed by the cycle retrospective and downstream tuning work).
+
+**Max execution time**: 15 min (matches AZ-699 NFR for a single Tier-2 Jetson run).
+
+**Report artefact schema** (canonical, produced by `tests/e2e/replay/_report_writer.py`):
+
+```markdown
+# Real-flight validation — YYYY-MM-DD
+
+**Verdict**: PASS | FAIL (AC-3 gate: ≥ 80 % within 100 m)
+
+## Run context
+
+- Tlog: `<path>`
+- Video: `<path>`
+- Calibration acquisition method: factory-sheet | placeholder | unknown
+- Clip duration: <float> s
+- Emissions consumed: <int>
+- Ground-truth pairings: <int>
+
+## Horizontal error (metres)
+
+| Statistic | Value |
+| --------- | ----- |
+| Mean | <float> |
+| p50  | <float> |
+| p95  | <float> |
+| p99  | <float> |
+
+## Threshold-hit share
+
+| Threshold (m) | Hit share (%) |
+| ------------- | ------------- |
+| 10  | <float> |
+| 25  | <float> |
+| 50  | <float> |
+| 100 | <float> |
+
+## Vertical error (metres)
+
+| Statistic | Value |
+| --------- | ----- |
+| Mean    | <float> |
+| p50     | <float> |
+| p95     | <float> |
+| Samples | <int>   |
+```
+
+The Vertical-error section is replaced by `_No emissions carried a comparable altitude — vertical stats skipped._` when none of the JSONL rows carry an `alt_m` field comparable to the ground-truth altitude.
+
+**Skip semantics**: AZ-699 distinguishes between *missing-prerequisite skip* (cleanly skipped with the missing file's path) and *test-cannot-resolve mask* (`@xfail` — explicitly forbidden by AZ-699 AC-1). The AZ-404 1-min test's `@xfail` on AC-3 is unchanged (AZ-699 AC-4 is "add a new test, don't replace") — FT-P-20 is the honest replacement that runs alongside it.