mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 08:41:13 +00:00
[AZ-699] Real-flight validation runner + Markdown accuracy report
New e2e test runs gps-denied-replay --auto-trim against the real
derkachi.tlog + flight video + AZ-702 calibration, computes the
horizontal-error distribution (mean/p50/p95/p99 + 10/25/50/100 m
threshold-hit share), writes _docs/06_metrics/real_flight_
validation_{date}.md, and asserts honest PASS/FAIL with no @xfail
mask. AZ-404's 1-min test is untouched (sibling, not replacement).
Extends gps_compare.py with HorizontalErrorDistribution +
percentile_sorted (numpy-equivalent linear interpolation). New
test helper _report_writer.py renders the canonical Markdown
schema documented as FT-P-20 in blackbox-tests.md.
16 new unit tests pin distribution arithmetic, verdict gate,
failure-message templating (references calibration acquisition
method per AC-3), and report layout. 129 passed in focused
regression, 3 skipped (real video / Tier-2 prerequisites).
Zero new mypy --strict errors.
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -593,3 +593,82 @@ All tests run from the `e2e-runner` container against the SUT through public bou
|
||||
|
||||
**Expected outcome**: All mid-flight tiles current-timestamped and fresh.
|
||||
**Max execution time**: 6 min.
|
||||
|
||||
---
|
||||
|
||||
### FT-P-20: Real-flight validation runner — honest verdict + Markdown accuracy report
|
||||
|
||||
**Summary**: Runs the full `gps-denied-replay` against the **real** Derkachi binary tlog + flight video + AZ-702 factory-sheet camera calibration, computes the per-emission horizontal-error distribution, and writes a structured Markdown accuracy report. Replaces the AZ-404 `@xfail` mask on AC-3 with a real PASS/FAIL.
|
||||
**Traces to**: AZ-699 AC-1..AC-3 (epic AZ-696 AC-3 — the 100 m / 80 % gate).
|
||||
**Category**: Position Accuracy
|
||||
|
||||
**Preconditions**:
|
||||
- `_docs/00_problem/input_data/flight_derkachi/derkachi.tlog` (real binary, multi-flight).
|
||||
- `_docs/00_problem/input_data/flight_derkachi/flight_derkachi.mp4` (real recording, > 1 MB; the placeholder used by AZ-404 does not satisfy this gate).
|
||||
- `_docs/00_problem/input_data/flight_derkachi/khp20s30_factory.json` (AZ-702 calibration).
|
||||
- `gps-denied-replay` console-script installed.
|
||||
- `RUN_REPLAY_E2E=1` (matches the existing AZ-404 gate).
|
||||
|
||||
**Input data**: real `derkachi.tlog` covers up to three sorties; the AZ-698 segmenter + `--auto-trim` locates the matching flight automatically.
|
||||
|
||||
**Steps**:
|
||||
|
||||
| Step | Consumer Action | Expected System Response |
|
||||
|------|----------------|------------------------|
|
||||
| 1 | Invoke `gps-denied-replay --auto-trim ...` with real fixtures | Subprocess exits 0 within the 15-min NFR budget |
|
||||
| 2 | Parse JSONL emissions; pair each with the nearest-in-time ground-truth row (binary-tlog GPS via AZ-697) | Distribution computed: count, mean, p50, p95, p99, threshold-hit share at 10/25/50/100 m |
|
||||
| 3 | Render the Markdown accuracy report and write `_docs/06_metrics/real_flight_validation_{YYYY-MM-DD}.md` | Report exists with header, run context, horizontal-error stats, threshold-hit table, and (when available) vertical-error stats |
|
||||
| 4 | Evaluate the AC-3 gate: ≥ 80 % within 100 m | Verdict is PASS or honest FAIL — no `@xfail` mask |
|
||||
| 5 | On FAIL, surface a failure message referencing the calibration acquisition method (factory-sheet / placeholder / unknown) and the residual budget | Operator can attribute the failure without re-reading the source |
|
||||
|
||||
**Expected outcome**: PASS when the estimator meets the epic AC-3 gate; honest FAIL otherwise. The Markdown report is the durable artefact (consumed by the cycle retrospective and downstream tuning work).
|
||||
|
||||
**Max execution time**: 15 min (matches AZ-699 NFR for a single Tier-2 Jetson run).
|
||||
|
||||
**Report artefact schema** (canonical, produced by `tests/e2e/replay/_report_writer.py`):
|
||||
|
||||
```markdown
|
||||
# Real-flight validation — YYYY-MM-DD
|
||||
|
||||
**Verdict**: PASS | FAIL (AC-3 gate: ≥ 80 % within 100 m)
|
||||
|
||||
## Run context
|
||||
|
||||
- Tlog: `<path>`
|
||||
- Video: `<path>`
|
||||
- Calibration acquisition method: factory-sheet | placeholder | unknown
|
||||
- Clip duration: <float> s
|
||||
- Emissions consumed: <int>
|
||||
- Ground-truth pairings: <int>
|
||||
|
||||
## Horizontal error (metres)
|
||||
|
||||
| Statistic | Value |
|
||||
| --------- | ----- |
|
||||
| Mean | <float> |
|
||||
| p50 | <float> |
|
||||
| p95 | <float> |
|
||||
| p99 | <float> |
|
||||
|
||||
## Threshold-hit share
|
||||
|
||||
| Threshold (m) | Hit share (%) |
|
||||
| ------------- | ------------- |
|
||||
| 10 | <float> |
|
||||
| 25 | <float> |
|
||||
| 50 | <float> |
|
||||
| 100 | <float> |
|
||||
|
||||
## Vertical error (metres)
|
||||
|
||||
| Statistic | Value |
|
||||
| --------- | ----- |
|
||||
| Mean | <float> |
|
||||
| p50 | <float> |
|
||||
| p95 | <float> |
|
||||
| Samples | <int> |
|
||||
```
|
||||
|
||||
The Vertical-error section is replaced by `_No emissions carried a comparable altitude — vertical stats skipped._` when none of the JSONL rows carry an `alt_m` field comparable to the ground-truth altitude.
|
||||
|
||||
**Skip semantics**: AZ-699 distinguishes between *missing-prerequisite skip* (cleanly skipped with the missing file's path) and *test-cannot-resolve mask* (`@xfail` — explicitly forbidden by AZ-699 AC-1). The AZ-404 1-min test's `@xfail` on AC-3 is unchanged (AZ-699 AC-4 is "add a new test, don't replace") — FT-P-20 is the honest replacement that runs alongside it.
|
||||
|
||||
+47
@@ -104,3 +104,50 @@ Then the existing `@xfail` test still exists and runs (we add, don't replace)
|
||||
**Risk 2: tlog format edge cases**
|
||||
- *Risk*: Real tlogs may carry non-standard system IDs, dialect mismatches, or corrupt segments.
|
||||
- *Mitigation*: AZ-697's AC-3 / AC-4 cover this at the truth-extractor level; this task only consumes the result.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Notes (Batch 100 — Cycle 2)
|
||||
|
||||
**Status**: In Testing (Jira AZ-699).
|
||||
|
||||
### Files changed
|
||||
|
||||
Production:
|
||||
- `src/gps_denied_onboard/helpers/gps_compare.py` — extended with `HorizontalErrorDistribution` DTO, `horizontal_error_distribution(emissions, ground_truth)` single-walk aggregator, and `percentile_sorted(values, pct)` linear-interpolation helper (numpy-equivalent). Re-exports added to `helpers/__init__.py`.
|
||||
|
||||
Test helpers (under `tests/`, not production):
|
||||
- `tests/e2e/replay/_report_writer.py` — `render_report`, `format_failure_message`, `verdict_passes_ac3`, plus `ReportContext` DTO and the `AC3_GATE_THRESHOLD_M` / `AC3_GATE_PCT` constants.
|
||||
|
||||
Tests:
|
||||
- `tests/e2e/replay/test_derkachi_real_tlog.py` — the AZ-699 e2e runner (skipped without real video + `RUN_REPLAY_E2E=1`; honest PASS/FAIL when prerequisites met).
|
||||
- `tests/unit/test_az699_report_writer.py` — 16 unit tests covering percentile arithmetic, distribution aggregator, verdict gate, failure-message templating, and report layout.
|
||||
|
||||
Documentation:
|
||||
- `_docs/02_document/tests/blackbox-tests.md` — new entry **FT-P-20** documenting the artefact schema for `_docs/06_metrics/real_flight_validation_{YYYY-MM-DD}.md`.
|
||||
|
||||
### AC coverage
|
||||
|
||||
| AC | Test / Artefact | Result |
|
||||
|----|-----------------|--------|
|
||||
| AC-1 | `test_az699_real_flight_validation_emits_verdict_and_report` | SKIPPED on dev (real video missing); ready to run on Tier-2 Jetson with `RUN_REPLAY_E2E=1` + real video. NO `@xfail` mask. |
|
||||
| AC-2 | `test_render_report_contains_all_required_rows_on_pass`, `test_render_report_marks_failure_when_below_gate`, `test_render_report_includes_vertical_when_available` | PASS |
|
||||
| AC-3 | `test_failure_message_references_calibration_method_factory_sheet`, `test_failure_message_references_calibration_method_placeholder` | PASS |
|
||||
| AC-4 | `tests/e2e/replay/test_derkachi_1min.py` untouched (verified by diff scope: this batch added a sibling file, did not modify the existing one) | PASS |
|
||||
|
||||
### Test results
|
||||
|
||||
129 passed, 3 skipped (all documented prerequisites: AZ-699 e2e wants real video + Tier-2; AZ-698 AC-5 wants real video; AZ-399 AC-1 wants 500 MB tlog) in the focused regression slice. Full unit suite: 2219 passed, 1 pre-existing unrelated failure in `tests/unit/c12_operator_orchestrator/test_cli_console_script.py::test_cold_start_under_500ms_p99` (CLI cold-start NFR, 8/11 samples > 700 ms; touches none of AZ-697/698/699's code paths).
|
||||
|
||||
### Strict typing
|
||||
|
||||
`mypy --strict` on the three new modules (`helpers/gps_compare.py`, `helpers/__init__.py`, `tests/e2e/replay/_report_writer.py`): **Success: no issues found in 3 source files.**
|
||||
|
||||
### Known limitations
|
||||
|
||||
- AC-1 (the actual real-flight run on Tier-2 Jetson) cannot execute in this dev environment. The test is wired, gated cleanly, and ready — drop a real `flight_derkachi.mp4` (> 1 MB) into `_docs/00_problem/input_data/flight_derkachi/`, set `RUN_REPLAY_E2E=1`, and run on Tier-2 to produce the verdict + report.
|
||||
- Calibration acquisition method is read from the JSON's `acquisition_method` field. AZ-702 ships `factory-sheet`; any other value (or absence) is labelled `unknown` in the failure message.
|
||||
|
||||
### Design note: helper module location
|
||||
|
||||
`gps_compare.py` lives under `src/gps_denied_onboard/helpers/` (production) because both the AZ-699 test and the future AZ-701 HTTP-API path (T5) need it. `_report_writer.py` lives under `tests/e2e/replay/` because it is purely a test artefact — promoting it would invite production code to import a test helper, violating the file ownership rule documented in `_docs/02_document/module-layout.md`.
|
||||
@@ -0,0 +1,114 @@
|
||||
# Batch 100 — Cycle 2 — AZ-699
|
||||
|
||||
**Date**: 2026-05-20
|
||||
**Tasks**: AZ-699 (Real-flight validation runner + accuracy report).
|
||||
**Story points**: 3.
|
||||
**Jira status**: AZ-699 → `In Testing`.
|
||||
|
||||
## What shipped
|
||||
|
||||
An honest PASS/FAIL e2e runner for the real Derkachi flight,
|
||||
together with the metric helpers, report writer, and unit tests
|
||||
that make its output reproducible and reviewable.
|
||||
|
||||
- `HorizontalErrorDistribution` aggregate (mean / p50 / p95 / p99
|
||||
horizontal, threshold-hit share at 10/25/50/100 m, vertical
|
||||
stats when emissions carry altitude) in
|
||||
`src/gps_denied_onboard/helpers/gps_compare.py`.
|
||||
- `tests/e2e/replay/_report_writer.py` — Markdown report
|
||||
renderer + AC-3 failure-message template + verdict helper.
|
||||
- `tests/e2e/replay/test_derkachi_real_tlog.py` — runs `gps-denied-replay
|
||||
--auto-trim` against real `derkachi.tlog` + real video + AZ-702
|
||||
calibration, computes the distribution, writes the report,
|
||||
and asserts PASS/FAIL with no `@xfail` mask.
|
||||
- New FT-P-20 entry in `_docs/02_document/tests/blackbox-tests.md`
|
||||
documenting the report artefact schema.
|
||||
|
||||
## Files changed
|
||||
|
||||
Production (2):
|
||||
|
||||
- `src/gps_denied_onboard/helpers/gps_compare.py`
|
||||
- `src/gps_denied_onboard/helpers/__init__.py`
|
||||
|
||||
Tests (3 new):
|
||||
|
||||
- `tests/e2e/replay/_report_writer.py`
|
||||
- `tests/e2e/replay/test_derkachi_real_tlog.py`
|
||||
- `tests/unit/test_az699_report_writer.py`
|
||||
|
||||
Docs:
|
||||
|
||||
- `_docs/02_document/tests/blackbox-tests.md` (new FT-P-20)
|
||||
- `_docs/02_tasks/done/AZ-699_real_flight_validation_runner.md`
|
||||
(moved from `todo/`, Implementation Notes appended)
|
||||
|
||||
## AC coverage
|
||||
|
||||
| AC | Test / Artefact | Result |
|
||||
| ---- | ------------------------------------------------------------------------------------------------------------------------ | ------ |
|
||||
| AC-1 | `test_az699_real_flight_validation_emits_verdict_and_report` | SKIPPED on dev (real video missing); wired + ready for Tier-2 Jetson; NO @xfail mask. |
|
||||
| AC-2 | `test_render_report_contains_all_required_rows_on_pass`, `test_render_report_marks_failure_when_below_gate` | PASS |
|
||||
| AC-3 | `test_failure_message_references_calibration_method_factory_sheet`, `…placeholder` | PASS |
|
||||
| AC-4 | `tests/e2e/replay/test_derkachi_1min.py` untouched | PASS |
|
||||
|
||||
## Test run
|
||||
|
||||
```
|
||||
tests/unit/test_az699_report_writer.py 16 PASS
|
||||
tests/unit/test_az697_gps_compare.py 10 PASS
|
||||
tests/unit/replay_input/test_az405_auto_sync.py 14 PASS
|
||||
tests/unit/replay_input/test_az405_replay_input_adapter 13 PASS
|
||||
tests/unit/replay_input/test_az698_window_alignment.py 19 PASS 1 SKIP
|
||||
tests/unit/replay_input/test_tlog_ground_truth.py 12 PASS
|
||||
tests/unit/c8_fc_adapter/test_az399_tlog_replay_adapter 24 PASS 1 SKIP
|
||||
tests/unit/calibration/test_khp20s30_factory.py 9 PASS
|
||||
tests/unit/runtime_root/test_az687_pre_constructed_replay_mode.py 3 PASS
|
||||
tests/unit/test_az269_config_loader.py 9 PASS
|
||||
tests/e2e/replay/test_derkachi_real_tlog.py - 1 SKIP
|
||||
```
|
||||
|
||||
Focused slice: **129 passed, 3 skipped, 0 failed.**
|
||||
|
||||
Full unit suite (2 220 tests): **2 219 passed, 1 failed.** The
|
||||
single failure is in
|
||||
`tests/unit/c12_operator_orchestrator/test_cli_console_script.py::test_cold_start_under_500ms_p99`
|
||||
— a CLI cold-start NFR test (8/11 samples > 700 ms; budget is 500 ms).
|
||||
The C12 binary does NOT import any AZ-697/698/699 module
|
||||
(`gps_denied_onboard.components.c12_operator_orchestrator.{operator_reloc_service,flights_api.bbox}`
|
||||
import specific helper submodules, not the package's `__init__`).
|
||||
Pre-existing, unrelated, reported but not blocking per coderule.
|
||||
|
||||
## Strict typing
|
||||
|
||||
`mypy --strict` on the three new code units:
|
||||
|
||||
```
|
||||
gps_denied_onboard/helpers/gps_compare.py
|
||||
gps_denied_onboard/helpers/__init__.py
|
||||
tests/e2e/replay/_report_writer.py
|
||||
→ Success: no issues found in 3 source files.
|
||||
```
|
||||
|
||||
Zero new strict errors in the broader replay/auto-sync surface
|
||||
(carried over from batch 99's baseline of 12 pre-existing errors;
|
||||
no new errors introduced).
|
||||
|
||||
## Skip semantics — AZ-699 AC-1 spec wording
|
||||
|
||||
The AZ-699 spec line 56 reads: "the result is PASS or FAIL — no
|
||||
`@xfail`, no `@skip`". The spec's Constraints section line 96 reads:
|
||||
"Skipping in CI when `RUN_REPLAY_E2E=0` is allowed (matches existing
|
||||
pattern); the test MUST run when the env var is set." We resolved
|
||||
this internal contradiction in favour of the Constraints: the test
|
||||
SKIPS cleanly when a prerequisite is missing (env var unset, real
|
||||
video missing or placeholder-sized, console-script not installed),
|
||||
and produces an honest PASS/FAIL verdict when all prerequisites
|
||||
hold. The forbidden pattern is the `@xfail` mask that AZ-404 used
|
||||
to hide AC-3 — that is NOT present anywhere in AZ-699.
|
||||
|
||||
## Next batch
|
||||
|
||||
Batch 101 — **AZ-700** (replay map visualization). Depends on
|
||||
AZ-697 (ground truth) and AZ-698 (alignment) — both now in
|
||||
testing.
|
||||
@@ -8,8 +8,8 @@ status: in_progress
|
||||
sub_step:
|
||||
phase: 6
|
||||
name: implement-tasks-sequentially
|
||||
detail: "batch 100 of ~102: AZ-699"
|
||||
detail: "batch 101 of ~102: AZ-700"
|
||||
retry_count: 0
|
||||
cycle: 2
|
||||
tracker: jira
|
||||
last_completed_batch: 99
|
||||
last_completed_batch: 100
|
||||
|
||||
Reference in New Issue
Block a user