mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 10:21:13 +00:00
[AZ-699] Real-flight validation runner + Markdown accuracy report
New e2e test runs gps-denied-replay --auto-trim against the real
derkachi.tlog + flight video + AZ-702 calibration, computes the
horizontal-error distribution (mean/p50/p95/p99 + 10/25/50/100 m
threshold-hit share), writes _docs/06_metrics/real_flight_
validation_{date}.md, and asserts honest PASS/FAIL with no @xfail
mask. AZ-404's 1-min test is untouched (sibling, not replacement).
Extends gps_compare.py with HorizontalErrorDistribution +
percentile_sorted (numpy-equivalent linear interpolation). New
test helper _report_writer.py renders the canonical Markdown
schema documented as FT-P-20 in blackbox-tests.md.
16 new unit tests pin distribution arithmetic, verdict gate,
failure-message templating (references calibration acquisition
method per AC-3), and report layout. 129 passed in focused
regression, 3 skipped (real video / Tier-2 prerequisites).
Zero new mypy --strict errors.
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -593,3 +593,82 @@ All tests run from the `e2e-runner` container against the SUT through public bou
|
||||
|
||||
**Expected outcome**: All mid-flight tiles current-timestamped and fresh.
|
||||
**Max execution time**: 6 min.
|
||||
|
||||
---
|
||||
|
||||
### FT-P-20: Real-flight validation runner — honest verdict + Markdown accuracy report
|
||||
|
||||
**Summary**: Runs the full `gps-denied-replay` against the **real** Derkachi binary tlog + flight video + AZ-702 factory-sheet camera calibration, computes the per-emission horizontal-error distribution, and writes a structured Markdown accuracy report. Replaces the AZ-404 `@xfail` mask on AC-3 with a real PASS/FAIL.
|
||||
**Traces to**: AZ-699 AC-1..AC-3 (epic AZ-696 AC-3 — the 100 m / 80 % gate).
|
||||
**Category**: Position Accuracy
|
||||
|
||||
**Preconditions**:
|
||||
- `_docs/00_problem/input_data/flight_derkachi/derkachi.tlog` (real binary, multi-flight).
|
||||
- `_docs/00_problem/input_data/flight_derkachi/flight_derkachi.mp4` (real recording, > 1 MB; the placeholder used by AZ-404 does not satisfy this gate).
|
||||
- `_docs/00_problem/input_data/flight_derkachi/khp20s30_factory.json` (AZ-702 calibration).
|
||||
- `gps-denied-replay` console-script installed.
|
||||
- `RUN_REPLAY_E2E=1` (matches the existing AZ-404 gate).
|
||||
|
||||
**Input data**: real `derkachi.tlog` covers up to three sorties; the AZ-698 segmenter + `--auto-trim` locates the matching flight automatically.
|
||||
|
||||
**Steps**:
|
||||
|
||||
| Step | Consumer Action | Expected System Response |
|
||||
|------|----------------|------------------------|
|
||||
| 1 | Invoke `gps-denied-replay --auto-trim ...` with real fixtures | Subprocess exits 0 within the 15-min NFR budget |
|
||||
| 2 | Parse JSONL emissions; pair each with the nearest-in-time ground-truth row (binary-tlog GPS via AZ-697) | Distribution computed: count, mean, p50, p95, p99, threshold-hit share at 10/25/50/100 m |
|
||||
| 3 | Render the Markdown accuracy report and write `_docs/06_metrics/real_flight_validation_{YYYY-MM-DD}.md` | Report exists with header, run context, horizontal-error stats, threshold-hit table, and (when available) vertical-error stats |
|
||||
| 4 | Evaluate the AC-3 gate: ≥ 80 % within 100 m | Verdict is PASS or honest FAIL — no `@xfail` mask |
|
||||
| 5 | On FAIL, surface a failure message referencing the calibration acquisition method (factory-sheet / placeholder / unknown) and the residual budget | Operator can attribute the failure without re-reading the source |
|
||||
|
||||
**Expected outcome**: PASS when the estimator meets the epic AC-3 gate; honest FAIL otherwise. The Markdown report is the durable artefact (consumed by the cycle retrospective and downstream tuning work).
|
||||
|
||||
**Max execution time**: 15 min (matches AZ-699 NFR for a single Tier-2 Jetson run).
|
||||
|
||||
**Report artefact schema** (canonical, produced by `tests/e2e/replay/_report_writer.py`):
|
||||
|
||||
```markdown
|
||||
# Real-flight validation — YYYY-MM-DD
|
||||
|
||||
**Verdict**: PASS | FAIL (AC-3 gate: ≥ 80 % within 100 m)
|
||||
|
||||
## Run context
|
||||
|
||||
- Tlog: `<path>`
|
||||
- Video: `<path>`
|
||||
- Calibration acquisition method: factory-sheet | placeholder | unknown
|
||||
- Clip duration: <float> s
|
||||
- Emissions consumed: <int>
|
||||
- Ground-truth pairings: <int>
|
||||
|
||||
## Horizontal error (metres)
|
||||
|
||||
| Statistic | Value |
|
||||
| --------- | ----- |
|
||||
| Mean | <float> |
|
||||
| p50 | <float> |
|
||||
| p95 | <float> |
|
||||
| p99 | <float> |
|
||||
|
||||
## Threshold-hit share
|
||||
|
||||
| Threshold (m) | Hit share (%) |
|
||||
| ------------- | ------------- |
|
||||
| 10 | <float> |
|
||||
| 25 | <float> |
|
||||
| 50 | <float> |
|
||||
| 100 | <float> |
|
||||
|
||||
## Vertical error (metres)
|
||||
|
||||
| Statistic | Value |
|
||||
| --------- | ----- |
|
||||
| Mean | <float> |
|
||||
| p50 | <float> |
|
||||
| p95 | <float> |
|
||||
| Samples | <int> |
|
||||
```
|
||||
|
||||
The Vertical-error section is replaced by `_No emissions carried a comparable altitude — vertical stats skipped._` when none of the JSONL rows carry an `alt_m` field comparable to the ground-truth altitude.
|
||||
|
||||
**Skip semantics**: AZ-699 distinguishes between *missing-prerequisite skip* (cleanly skipped with the missing file's path) and *test-cannot-resolve mask* (`@xfail` — explicitly forbidden by AZ-699 AC-1). The AZ-404 1-min test's `@xfail` on AC-3 is unchanged (AZ-699 AC-4 is "add a new test, don't replace") — FT-P-20 is the honest replacement that runs alongside it.
|
||||
|
||||
Reference in New Issue
Block a user