mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 19:21:13 +00:00
[AZ-699] Real-flight validation runner + Markdown accuracy report
New e2e test runs gps-denied-replay --auto-trim against the real
derkachi.tlog + flight video + AZ-702 calibration, computes the
horizontal-error distribution (mean/p50/p95/p99 + 10/25/50/100 m
threshold-hit share), writes _docs/06_metrics/real_flight_
validation_{date}.md, and asserts honest PASS/FAIL with no @xfail
mask. AZ-404's 1-min test is untouched (sibling, not replacement).
Extends gps_compare.py with HorizontalErrorDistribution +
percentile_sorted (numpy-equivalent linear interpolation). New
test helper _report_writer.py renders the canonical Markdown
schema documented as FT-P-20 in blackbox-tests.md.
16 new unit tests pin distribution arithmetic, verdict gate,
failure-message templating (references calibration acquisition
method per AC-3), and report layout. 129 passed in focused
regression, 3 skipped (real video / Tier-2 prerequisites).
Zero new mypy --strict errors.
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -1,106 +0,0 @@
|
||||
# Real-flight validation runner + accuracy report
|
||||
|
||||
**Task**: AZ-699_real_flight_validation_runner
|
||||
**Name**: Run estimator against real Derkachi tlog + video; compute honest accuracy metrics; write report
|
||||
**Description**: New e2e test `test_derkachi_real_tlog.py` that feeds the real `derkachi.tlog` (not the synth) into the replay pipeline, compares the JSONL output against the binary-tlog GPS truth (from AZ-697), and writes a structured Markdown accuracy report. Flips AC-3 from `@xfail` to a real PASS/FAIL verdict.
|
||||
**Complexity**: 3 points
|
||||
**Dependencies**: AZ-697
|
||||
**Component**: Blackbox Tests (epic AZ-696)
|
||||
**Tracker**: AZ-699
|
||||
**Epic**: AZ-696
|
||||
|
||||
## Problem
|
||||
|
||||
`tests/e2e/replay/test_derkachi_1min.py::test_ac3_within_100m_80pct_of_ticks`
|
||||
is permanently `@xfail`. Even when the test runs (Jetson Tier-2), the
|
||||
result is hidden — we have no honest measurement of estimator accuracy
|
||||
against a real flight. The cycle-1 retrospective (`_docs/06_metrics/retro_2026-05-20.md`)
|
||||
flagged this as the highest-impact open verification.
|
||||
|
||||
The two contributors:
|
||||
1. Synth tlog (compares estimator to itself) — fixed by AZ-697.
|
||||
2. Unknown camera intrinsics — addressed by AZ-702 (T6, factory sheet).
|
||||
|
||||
This task wires the real tlog + the calibration into a new test and
|
||||
produces the honest verdict + a structured report.
|
||||
|
||||
## Outcome
|
||||
|
||||
- A new test runs the full `gps-denied-replay` against `derkachi.tlog` +
|
||||
`flight_derkachi.mp4` + `khp20s30_factory.json` (or the current
|
||||
fallback) and reports honest accuracy metrics.
|
||||
- A structured report at `_docs/06_metrics/real_flight_validation_{YYYY-MM-DD}.md`
|
||||
contains mean / p50 / p95 / p99 horizontal error, % within {10, 25, 50, 100} m,
|
||||
vertical error stats, and notes the calibration assumption.
|
||||
- AC-3 emits a real PASS or honest FAIL verdict (no `@xfail` mask).
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
- New test `tests/e2e/replay/test_derkachi_real_tlog.py` parallel to the existing 1-min test but using the binary tlog.
|
||||
- Metric helpers (mean/p50/p95/p99 percentile + threshold-hit counters) live in `src/gps_denied_onboard/helpers/gps_compare.py` (extends AZ-697).
|
||||
- Report writer `tests/e2e/replay/_report_writer.py` (test helper, not production code).
|
||||
- Updated `_docs/06_metrics/real_flight_validation_{date}.md` artifact format documented in `_docs/02_document/tests/blackbox-tests.md`.
|
||||
|
||||
### Excluded
|
||||
- Map visualization — AZ-700.
|
||||
- HTTP API — AZ-701.
|
||||
- Camera calibration acquisition — AZ-702 (this task ships with whatever calibration is current).
|
||||
- Editing the existing `test_derkachi_1min.py` (new test runs alongside).
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Real PASS/FAIL verdict (no mask)**
|
||||
Given the new test on Tier-2 Jetson
|
||||
When `pytest tests/e2e/replay/test_derkachi_real_tlog.py -m tier2` runs
|
||||
Then the result is PASS or FAIL — no `@xfail`, no `@skip`
|
||||
|
||||
**AC-2: Structured report written**
|
||||
Given a successful invocation
|
||||
When the test finishes
|
||||
Then `_docs/06_metrics/real_flight_validation_{YYYY-MM-DD}.md` exists with all required metrics in a Markdown table
|
||||
|
||||
**AC-3: FAIL message attributes calibration uncertainty**
|
||||
Given the test fails the 80 %/100 m gate
|
||||
When the failure message renders
|
||||
Then it references the calibration acquisition method (factory-sheet per AZ-702) and the residual budget
|
||||
|
||||
**AC-4: Existing 1-min test untouched**
|
||||
Given the cycle-1 test `test_ac3_within_100m_80pct_of_ticks`
|
||||
When all changes land
|
||||
Then the existing `@xfail` test still exists and runs (we add, don't replace)
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- The new test must complete within the existing Jetson Tier-2 wall budget (≤ 15 min for a 60 s clip; report longer for longer clips).
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-2 | Report writer with mock metrics | Markdown contains every required row |
|
||||
| AC-3 | Failure message templating | Contains "calibration: factory-sheet" + budget |
|
||||
|
||||
## Blackbox Tests
|
||||
|
||||
| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
|
||||
|--------|------------------------|-------------|-------------------|----------------|
|
||||
| AC-1 | Real derkachi.tlog + video + KHP20S30 calibration | Full replay + accuracy gate | PASS or FAIL (honest) | Perf ≤ 15 min |
|
||||
| AC-2 | After AC-1 run | Report file existence + contents | Structured report on disk | — |
|
||||
|
||||
## Constraints
|
||||
|
||||
- The new test MUST use the existing `gps-denied-replay` console-script — no inlined estimator invocation.
|
||||
- The report MUST be Markdown (not HTML/JSON) so it lives alongside other `_docs/06_metrics/` artifacts.
|
||||
- Skipping in CI when `RUN_REPLAY_E2E=0` is allowed (matches existing pattern); the test MUST run when the env var is set.
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: Honest FAIL exposes a true product gap**
|
||||
- *Risk*: The estimator may legitimately fail the 100 m/80 % gate even with correct calibration. Derkachi is cruise altitude with limited VPR anchor diversity.
|
||||
- *Mitigation*: That's the goal — honest measurement. Surface the gap; downstream cycles can tighten.
|
||||
|
||||
**Risk 2: tlog format edge cases**
|
||||
- *Risk*: Real tlogs may carry non-standard system IDs, dialect mismatches, or corrupt segments.
|
||||
- *Mitigation*: AZ-697's AC-3 / AC-4 cover this at the truth-extractor level; this task only consumes the result.
|
||||
Reference in New Issue
Block a user