Files
gps-denied-onboard/_docs/02_document/tests/blackbox-tests.md
T
Oleksandr Bezdieniezhnykh 940066bee2 chore: WIP pre-implement
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-26 17:09:13 +03:00

716 lines
36 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Blackbox Tests
All tests run from the `e2e-runner` container against the SUT through public boundaries only (frame source, FC inbound stream, tile cache mount, FC outbound observed via SITL, GCS observed via mavproxy-listener, FDR via post-run filesystem read). Two FC adapters parameterize every test that touches the FC contract: `ardupilot` and `inav`. Two `VioStrategy` modes parameterize Tier-1 product correctness tests: `okvis2` (production-default) and `klt_ransac` (mandatory simple-baseline). `vins_mono` is parameterized only when the research build is under test.
## Positive Scenarios
### FT-P-01: Still-image satellite-anchor frame-center accuracy
**Summary**: Validates the canonical satellite-anchor frame-center geolocation pipeline against the 60-image GT set.
**Traces to**: AC-1.1, AC-1.2
**Category**: Position Accuracy
**Preconditions**:
- `tile-cache-fixture` mounted at `/var/azaion/tile-cache`.
- SUT cold-started with no prior state; configured for the FC adapter under test.
**Input data**: `still-image-set-60` (per `test-data.md`).
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | For each image `AD0000NN.jpg` in order, write the frame to the SUT's frame-source path and wait up to 5 s for the corresponding outbound `GPS_INPUT` (AP) / `MSP2_SENSOR_GPS` (iNav) message at the SITL listener | One outbound message per input image; payload includes WGS84 lat/lon |
| 2 | Compute Vincenty geodesic distance between estimated lat/lon and `coordinates.csv` GT row for that image | Per-image error ≤ 50 m for ≥80% of images, ≤ 20 m for ≥50% |
| 3 | Capture per-image error to `e2e-results/run-${RUN_ID}/ft-p-01.csv` | CSV produced with one row per image |
**Expected outcome**: aggregate `pass_count(error≤50m) ≥ 48` AND `pass_count(error≤20m) ≥ 30` (matching the rule in `expected_results/results_report.md`).
**Max execution time**: 5 min (60 images × ~5 s including SITL round trip).
---
### FT-P-02: Derkachi VIO drift between satellite anchors
**Summary**: Validates cumulative drift between consecutive satellite-anchored fixes during the Derkachi flight replay.
**Traces to**: AC-1.3
**Category**: Position Accuracy
**Preconditions**:
- `tile-cache-fixture` mounted (covers Derkachi route).
- SUT cold-started; FC adapter under test connected via SITL; `data_imu.csv` replayed at 10 Hz into FC IMU stream.
**Input data**: `derkachi-fixture` video at 30 fps + IMU CSV at 10 Hz.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Start synchronized video + IMU replay (3 video frames per IMU row) | SUT begins emitting estimates at the SUT's runtime cadence |
| 2 | At each frame whose outbound estimate carries `source_label = satellite_anchored`, record the propagated centre estimate of the prior visual-only segment AND the new anchor centre | Two values per anchor pair captured |
| 3 | Compute per-anchor-pair drift = ‖propagated_centre next_anchor_centre‖. Bin by `last_satellite_anchor_age_ms`. | Bins populated; CSV emitted |
**Expected outcome**: Across all anchor pairs, at least 95% satisfy `drift < 100 m` (visual-only) AND `drift < 50 m` (when CombinedImuFactor IMU fusion is active in C5). Drift distribution monotonically grows with anchor age, with no anomalous spike.
**Max execution time**: 10 min (8 min replay + parsing).
---
### FT-P-03: Estimate output schema and source-label semantics
**Summary**: Validates the SUT's outbound estimate carries every required field with correct types and the source label is one of the three allowed values.
**Traces to**: AC-1.4, AC-4.3
**Category**: Position Accuracy / FC Contract
**Preconditions**:
- One image from `still-image-set-60` already loaded into the cache fixture.
- SUT cold-started.
**Input data**: any single image (default `AD000001.jpg`).
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Push the image to the frame source | SUT emits one outbound `GPS_INPUT` (AP) / `MSP2_SENSOR_GPS` (iNav) AND one out-of-band channel message (MAVLink `STATUSTEXT` or `NAMED_VALUE_FLOAT` per AC-4.3) carrying the source label |
| 2 | Read the SITL-side fields | Schema match: `lat`, `lon`, `cov_semi_major_m`, `last_satellite_anchor_age_ms` present and well-typed |
| 3 | Read the out-of-band label channel | Label ∈ `{satellite_anchored, visual_propagated, dead_reckoned}` |
**Expected outcome**: Schema check passes AND label is in the allowed set.
**Max execution time**: 30 s.
---
### FT-P-04: Derkachi frame-to-frame registration success rate
**Summary**: Validates frame-to-frame registration succeeds for ≥95% of "normal" segments of the Derkachi flight.
**Traces to**: AC-2.1a
**Category**: Image Processing
**Preconditions**:
- SUT cold-started; FC adapter and VioStrategy both parameterized.
**Input data**: `derkachi-fixture` (full duration). "Normal" segments derived per AC-2.1a: nadir ±10° bank/pitch (estimated from `SCALED_IMU2`-derived attitude), ≥40% inferred prior-frame overlap (heuristic from frame-to-frame translation magnitude).
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Replay the Derkachi fixture | SUT emits per-frame registration-success metric (exposed via `NAMED_VALUE_FLOAT` or in FDR per AC-NEW-3) |
| 2 | After replay, compute success-ratio over normal segments only | Success ratio ≥ 0.95 |
**Expected outcome**: ≥95% on normal segments. Sharp-turn segments (excluded from this denominator) are exercised separately by FT-N-02.
**Max execution time**: 12 min.
---
### FT-P-05: Satellite-anchor cross-domain registration
**Summary**: Validates the satellite-anchor (UAV→satellite cross-domain) matcher succeeds with the cross-domain MRE budget.
**Traces to**: AC-2.1b, AC-2.2
**Category**: Image Processing
**Preconditions**:
- `tile-cache-fixture` includes the still-image footprints.
- SUT cold-started.
**Input data**: `still-image-set-60` plus `still-image-sat-refs-2` (for the 2 images with paired `_gmaps.png`).
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | For each still image, push to frame source | One satellite-anchor result per image |
| 2 | Read per-frame MRE (via FDR or `NAMED_VALUE_FLOAT`) | MRE recorded |
| 3 | Aggregate per-image accuracy AND MRE distribution | All images: MRE < 2.5 px; ≥80% within 50 m of GT; ≥50% within 20 m of GT |
**Expected outcome**: AC-1.1, AC-1.2, AC-2.1b, AC-2.2 all satisfied.
**Max execution time**: 5 min.
---
### FT-P-06: Mean Reprojection Error budgets (frame-to-frame + cross-domain)
**Summary**: Validates the two MRE budgets are honored.
**Traces to**: AC-2.2
**Preconditions**: Same as FT-P-04 + FT-P-05.
**Input data**: `derkachi-fixture` (frame-to-frame MRE) + `still-image-set-60` (cross-domain MRE).
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Run FT-P-04 and FT-P-05 in sequence; collect per-frame MRE from both runs | MRE values captured |
| 2 | Aggregate by domain (frame-to-frame vs satellite-anchored) | Distribution per domain |
**Expected outcome**: Frame-to-frame MRE < 1.0 px (95th percentile); cross-domain MRE < 2.5 px (95th percentile).
**Max execution time**: piggybacks on FT-P-04 / FT-P-05.
---
### FT-P-07: Sharp-turn recovery via satellite reference
**Summary**: Validates that frames during sharp turns may fail frame-to-frame but recover via satellite-reference re-localization.
**Traces to**: AC-3.2
**Preconditions**:
- Sharp-turn segment of the Derkachi flight identified by gyro_z spikes in `SCALED_IMU2`. (If Derkachi has no sharp turn meeting AC-3.2 thresholds, fall back to a synthetic gyro overlay; flag in FDR.)
**Input data**: `derkachi-fixture` filtered to sharp-turn segment(s).
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Replay sharp-turn segment | SUT emits `source_label = visual_propagated` or `dead_reckoned` during turn |
| 2 | After turn, observe next satellite-anchor attempt | Recovery: `source_label = satellite_anchored` returns within 3 frames of turn end; drift ≤ 200 m, heading change handled |
**Expected outcome**: Recovery within 3 frames; <200 m drift; <70° heading change handled.
**Max execution time**: 5 min (per turn segment, multiple per replay).
---
### FT-P-08: Multi-segment satellite-reference re-localization
**Summary**: Validates ≥3 disconnected segments per flight handled via satellite-reference re-localization.
**Traces to**: AC-3.3
**Preconditions**:
- `multi-segment-derkachi` synthetic fixture generated with 3+ blackout windows.
**Input data**: `multi-segment-derkachi`.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Replay with injected blackout windows | SUT emits `dead_reckoned` during each blackout |
| 2 | At end of each blackout, observe re-localization | `source_label` returns to `satellite_anchored` within 3 frames; trajectory continuity preserved (no >100 m jump) |
**Expected outcome**: All 3+ segments re-localized successfully; no trajectory jump exceeds 100 m.
**Max execution time**: 12 min.
---
### FT-P-09-AP: ArduPilot Plane GPS_INPUT contract conformance + signing
**Summary**: Validates `GPS_INPUT` reaches AP SITL, AP EKF accepts it as primary GPS, and MAVLink 2.0 message signing handshake completes per D-C8-9.
**Traces to**: AC-4.3 (AP), D-C8-9, AC-NEW-2 (precondition)
**Category**: FC Contract / Security
**Preconditions**:
- `ardupilot-plane-sitl` running with `GPS_TYPE=14`.
- `mavlink-passkey` loaded as Docker secret into SUT.
**Input data**: `derkachi-fixture` (any 60 s segment).
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Start SUT with FC adapter `ardupilot` | Signing handshake completes within 5 s; signed channel established |
| 2 | Replay 60 s of Derkachi | SUT emits signed `GPS_INPUT` at the configured rate |
| 3 | Read AP `EK3_SRC1_POSXY` parameter via MAVPROXY | Value reads `3` (GPS source) |
| 4 | Read AP-side GPS health via `GPS_RAW_INT` | Fix type ≥ 3 (3D fix), HDOP within nominal |
**Expected outcome**: Signing handshake succeeds; AP EKF on GPS source-set; GPS_RAW_INT shows healthy fix.
**Max execution time**: 90 s.
---
### FT-P-09-iNav: iNav MSP2_SENSOR_GPS contract conformance
**Summary**: Validates `MSP2_SENSOR_GPS` reaches iNav SITL and iNav GPS provider accepts it as the sole source.
**Traces to**: AC-4.3 (iNav)
**Category**: FC Contract
**Preconditions**:
- `inav-sitl` running with GPS provider configured to MSP per `docs/SITL/SITL.md`.
**Input data**: `derkachi-fixture` (any 60 s segment).
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Start SUT with FC adapter `inav` | TCP connection to `inav-sitl:5760` established |
| 2 | Replay 60 s of Derkachi | SUT emits `MSP2_SENSOR_GPS` (ID 0x1F03) frames at 5 Hz |
| 3 | Read iNav GPS state via MSP query | `gpsSol.fixType` ≥ 3, `gpsSol.numSat` reflects emitted value, provider=MSP |
**Expected outcome**: iNav GPS state reflects emitted frames; no fallback to internal GPS.
**Max execution time**: 90 s.
---
### FT-P-10: GTSAM smoothing-loop look-back accuracy (IT-11)
**Summary**: Validates the smoothing-loop's past-keyframe pose estimates improve over raw single-shot estimates (Mode B Fact #107). NOT validated as FC-side retroactive correction.
**Traces to**: AC-4.5 (revised scope per Mode B), Mode B Fact #107
**Category**: Position Accuracy / Internal smoothing
**Preconditions**:
- SUT cold-started; FDR enabled.
**Input data**: `derkachi-fixture` full replay.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Replay Derkachi end-to-end | FDR contains per-keyframe (a) raw single-shot pose at first emission, (b) smoothed pose at iSAM2 convergence |
| 2 | After replay, parse FDR; for each past keyframe compute distance(raw, GT) and distance(smoothed, GT) | Per-keyframe pair extracted |
| 3 | Aggregate across keyframes | smoothed_error < raw_error for ≥80% of keyframes; mean improvement ≥ 5 m |
**Expected outcome**: Internal smoothing improves past-keyframe accuracy; FC-side retroactive correction NOT exercised (out of scope per Mode B revision A6).
**Max execution time**: 12 min.
---
### FT-P-11: Cold-start initialization from FC EKF
**Summary**: Validates SUT initialization from FC EKF's last valid GPS + IMU-extrapolated position at GPS denial.
**Traces to**: AC-5.1
**Category**: Startup
**Preconditions**:
- `cold-boot-fixture` provides a frozen FC pose snapshot.
- SUT not running.
**Input data**: `cold-boot-fixture`.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Start `ardupilot-plane-sitl` (or `inav-sitl`) with the frozen-pose snapshot loaded | SITL EKF reflects the snapshot pose |
| 2 | Start SUT | SUT queries FC EKF; reads pose; initializes |
| 3 | Push first nav-camera frame | First outbound estimate's lat/lon within ±50 m of the FC EKF snapshot pose |
**Expected outcome**: First emitted estimate uses FC EKF's pose as prior, within ±50 m tolerance.
**Max execution time**: 60 s.
---
### FT-P-12: GCS downsample at 1-2 Hz
**Summary**: Validates position estimates + confidence stream to the GCS (via `mavproxy-listener`) at 1-2 Hz.
**Traces to**: AC-6.1
**Category**: GCS / Telemetry
**Preconditions**:
- `mavproxy-listener` running and capturing to `.tlog`.
**Input data**: `derkachi-fixture` 60 s segment.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Start replay | SUT emits to FC at runtime cadence (~3 Hz) AND to GCS at 1-2 Hz |
| 2 | After replay, parse `.tlog` for SUT-emitted GCS messages over the 60 s window | GCS rate within [1, 2] Hz inclusive |
**Expected outcome**: GCS-side rate observed in [1, 2] Hz over the window.
**Max execution time**: 90 s.
---
### FT-P-13: GCS command path (operator re-loc hint)
**Summary**: Validates that GCS-originated commands (via standard MAVLink) can carry operator re-loc hints to the SUT.
**Traces to**: AC-6.2
**Category**: GCS / Telemetry
**Preconditions**:
- `mavproxy-listener` configured to send commands.
- SUT in `dead_reckoned` state (e.g. mid-blackout from FT-N-03 setup).
**Input data**: synthesized `STATUSTEXT` carrying re-loc hint from MAVPROXY.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | While SUT is in `dead_reckoned`, send re-loc-hint STATUSTEXT from MAVPROXY | SUT acknowledges the hint via FDR log entry |
| 2 | Push next nav-camera frame after hint | Next satellite-anchor attempt uses hint as a search prior |
**Expected outcome**: Hint received; next anchor attempt biases search; no rejection.
**Max execution time**: 60 s.
---
### FT-P-14: WGS84 output coordinate system
**Summary**: Validates output coordinates are in WGS84 (latitude/longitude in degrees as per ArduPilot/iNav GPS convention scaled to 1e-7).
**Traces to**: AC-6.3
**Preconditions**: any FT-P-01 / FT-P-02 run.
**Input data**: any.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Capture one outbound `GPS_INPUT` / `MSP2_SENSOR_GPS` from SITL | Lat/lon present; values in valid WGS84 range; scaled per protocol convention |
**Expected outcome**: Coordinates parse as WGS84 within Earth bounds.
**Max execution time**: 30 s.
---
### FT-P-15: Tile cache schema and resolution floor
**Summary**: Validates the tile cache manifest carries every required field and tiles meet the ≥0.5 m/px floor.
**Traces to**: AC-8.1, RESTRICT-SAT-2 (manifest schema)
**Preconditions**: `tile-cache-fixture` mounted.
**Input data**: tile cache.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | The SUT exposes a one-time cache-load self-check at startup; observe via FDR | Each tile manifest entry has CRS, tile matrix, dimension, lat-adjusted m/px, capture date, source, compression |
| 2 | Inspect m/px values | All ≥ 0.5 m/px; reject below floor |
**Expected outcome**: All loaded tiles pass schema check and resolution floor.
**Max execution time**: 30 s.
---
### FT-P-16: Pre-loaded cache (offline-only interface)
**Summary**: Validates the SUT loads tiles from the local cache only, with no in-flight Service calls.
**Traces to**: AC-8.3, RESTRICT-SAT-1
**Preconditions**: `tile-cache-fixture` mounted; `e2e-net` `internal: true` enforced (no internet egress).
**Input data**: `derkachi-fixture` 60 s segment.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Start replay | SUT serves tiles from `/var/azaion/tile-cache` only |
| 2 | Observe network egress counter on `gps-denied-onboard` container | All egress to non-`e2e-net` destinations is 0 (paired with NFT-SEC-02) |
**Expected outcome**: 0 external egress; replay completes against local cache.
**Max execution time**: 90 s.
---
### FT-P-17: Mid-flight tile generation
**Summary**: Validates the SUT continuously orthorectifies nav-camera frames into basemap-projected tiles, deduplicates them, and stores them locally for landing-time upload.
**Traces to**: AC-8.4
**Preconditions**: empty `mid-flight-tile-output` directory in the FDR volume; mock-suite-sat-service running.
**Input data**: `derkachi-fixture` 5 min segment.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Start replay | SUT generates and writes tiles to FDR's `mid-flight-tile-output/` |
| 2 | After replay, read tiles | ≥1 tile per ~3 s of high-quality nav frames; each tile carries quality metadata sufficient for the Service voting layer (per Mode B Fact #105) |
| 3 | Simulate landing event; SUT uploads to `mock-suite-sat-service` | Mock service receives all tiles with HTTP 202 |
**Expected outcome**: Tiles produced + deduplicated + uploaded with quality metadata.
**Max execution time**: 8 min.
---
### FT-P-18: No raw nav/AI-cam frame retention (storage policy)
**Summary**: Validates that no raw nav-camera or AI-camera frames are retained except the ≤0.1 Hz failed-tile-gen thumbnail log.
**Traces to**: AC-8.5
**Preconditions**: `derkachi-fixture` replay just completed.
**Input data**: post-replay state of FDR + tile cache.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Walk the FDR + tile cache for any file matching nav-camera raw-frame pattern (JPEG/RAW with original dimensions) | Only the failed-tile-gen thumbnail log files present (≤0.1 Hz cadence) |
| 2 | Verify thumbnail log is bounded by AC-NEW-3 FDR budget | Total thumbnail log < 1 GB over 8 h (NFT-LIM-02 cross-check) |
**Expected outcome**: 0 unauthorized raw frames retained.
**Max execution time**: 30 s (filesystem walk).
---
### FT-P-19: Satellite relocalization scale-ratio + scene-change
**Summary**: Validates UAV-frame ground footprint at deployment altitude is retrievable from cache regardless of internal tiling. Scene-change subset is reduced-confidence (PARTIAL — see traceability matrix).
**Traces to**: AC-8.6 (scale-ratio FULL; scene-change PARTIAL)
**Preconditions**: `tile-cache-fixture` mounted with multi-zoom-level coverage.
**Input data**: `still-image-set-60` (scale-ratio); the 2 paired `_gmaps.png` images (scene-change subset).
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | For each still image, query cache top-K=10 retrieval | Top-K result includes a tile whose centre is within 100 m of the image's true centre (scale-ratio satisfied) |
| 2 | For the 2 paired images, run cross-domain matcher against the `_gmaps.png` reference | Scale-ratio match succeeds; scene-change behavior recorded (PARTIAL — full coverage requires a labeled change-pair dataset, deferred under D-PROJ-3) |
**Expected outcome**: Scale-ratio passes for 60/60; scene-change recorded as PARTIAL.
**Max execution time**: 5 min.
---
## Negative Scenarios
### FT-N-01: 350 m outlier injection tolerance
**Summary**: Validates the system tolerates up to 350 m outliers between two consecutive frames with airframe tilt up to ±20°.
**Traces to**: AC-3.1, RESTRICT-CAM-1 (nadir camera, tilt limits)
**Preconditions**: SUT running on `derkachi-fixture`; `outlier-injection-derkachi` injector primed in `medium` density.
**Input data**: `outlier-injection-derkachi` (medium).
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Start replay with injector active (every 10th frame replaced by far-away tile crop) | SUT detects outlier; rejects from anchor; estimate continues from prior valid state |
| 2 | Compare per-frame outbound estimate vs GT for non-outlier frames | Error_after_outlier ≤ error_before_outlier + 50 m; covariance grows monotonically across the outlier event |
**Expected outcome**: Outliers rejected; estimate degrades at most by 50 m drift; covariance monotonic.
**Max execution time**: 12 min.
---
### FT-N-02: Sharp-turn frame-to-frame failure expected
**Summary**: Negative twin of FT-P-07 — validates that during a sharp turn, frame-to-frame may LEGITIMATELY fail, and the system labels accordingly.
**Traces to**: AC-3.2 (negative path)
**Preconditions**: Same as FT-P-07.
**Input data**: sharp-turn segment of Derkachi (or synthetic gyro overlay).
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Replay sharp-turn segment | During turn frames: `source_label``{visual_propagated, dead_reckoned}`; covariance grows |
| 2 | After turn, observe label transition | Label returns to `satellite_anchored` once next anchor succeeds |
**Expected outcome**: Sharp-turn frames correctly mark themselves as not-satellite-anchored; recovery exercised in FT-P-07.
**Max execution time**: 5 min.
---
### FT-N-03: Extended outage triggers operator re-loc request
**Summary**: Validates that on ≥3 consecutive frames AND ≥2 s without estimate, the SUT requests operator re-loc via telemetry and continues dead-reckoned propagation.
**Traces to**: AC-3.4
**Preconditions**: `derkachi-fixture` + 3-frame outage injector primed.
**Input data**: synthetic outage on Derkachi.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Trigger 3-consecutive-frame failure (corrupt frames) | SUT fails to produce estimates for 3+ frames |
| 2 | Wait ≥2 s | STATUSTEXT containing `OPERATOR_RELOC_REQUEST` emitted to `mavproxy-listener` |
| 3 | During outage, observe FC outbound | Estimates labeled `dead_reckoned` continue; FC uses last-known + IMU extrapolation |
**Expected outcome**: Re-loc request emitted; dead-reckoned estimates continue.
**Max execution time**: 60 s.
---
### FT-N-04: Visual blackout + spoofed GPS combined failsafe
**Summary**: Validates the AC-3.5 + AC-NEW-8 combined failsafe: switch label, reject spoof, propagate from last trusted state, monotonic covariance, STATUSTEXT.
**Traces to**: AC-3.5, AC-NEW-8
**Preconditions**: `blackout-spoof-derkachi` injector primed for 5 s, 15 s, 35 s windows; FC inbound stream patched to inject spoofed GPS.
**Input data**: `blackout-spoof-derkachi` (each window run as a sub-case).
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Begin blackout window AND inject spoofed GPS in same temporal window | Within ≤1 frame OR ≤400 ms: `source_label = dead_reckoned`; spoofed GPS rejected from estimator input; covariance grows monotonically |
| 2 | Observe `horiz_accuracy` field in outbound `GPS_INPUT` (AP) | `horiz_accuracy` ≥ 95% covariance semi-major axis (no under-reporting) |
| 3 | Observe GCS stream | `VISUAL_BLACKOUT_IMU_ONLY` STATUSTEXT at 1-2 Hz throughout blackout |
| 4 | For 35 s window only | Per AC-NEW-8: when 95% covariance crosses 100 m → fix-quality degraded; when crosses 500 m OR blackout exceeds 30 s → `horiz_accuracy=999.0` AND `VISUAL_BLACKOUT_FAILSAFE` STATUSTEXT |
| 5 | End blackout; restore FC GPS-health | Recovery only after FC GPS-health stable + non-spoofed for ≥10 s AND a visual/satellite consistency check succeeds |
**Expected outcome**: All four steps' assertions pass for each window.
**Max execution time**: 5 min (3 windows × ~90 s each).
---
### FT-N-05: Stale-tile rejection on freshness violation
**Summary**: Validates that tiles violating AC-8.2 freshness window are rejected (or downgraded so they cannot produce a `satellite_anchored` label).
**Traces to**: AC-8.2, AC-NEW-6
**Preconditions**: `synth-age-tile-set` (`synth-age-7mo` for active-conflict, `synth-age-13mo` for rear) mounted instead of fresh fixture.
**Input data**: `still-image-set-60` against the aged cache.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Replay against `synth-age-7mo` (configure SUT for active-conflict sector) | SUT either rejects load OR loads but never emits `satellite_anchored` from these tiles |
| 2 | Replay against `synth-age-13mo` (configure SUT for rear sector) | Same: reject or non-`satellite_anchored` only |
**Expected outcome**: 0 frames emit `satellite_anchored` from aged tiles.
**Max execution time**: 5 min.
---
### FT-N-06: Mid-flight tile freshness (current-timestamped)
**Summary**: Validates that mid-flight-generated tiles are timestamped as current and treated as fresh per AC-NEW-6.
**Traces to**: AC-NEW-6 (positive sub-case)
**Preconditions**: empty `mid-flight-tile-output`.
**Input data**: `derkachi-fixture` 5 min segment.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Start replay | SUT generates mid-flight tiles |
| 2 | Inspect each generated tile's manifest entry | `capture_date` is within ±60 s of generation wall-clock; treated as fresh by the freshness gate |
**Expected outcome**: All mid-flight tiles current-timestamped and fresh.
**Max execution time**: 6 min.
---
### FT-P-20: Real-flight validation runner — honest verdict + Markdown accuracy report
**Summary**: Runs the full `gps-denied-replay` against the **real** Derkachi binary tlog + flight video + AZ-702 factory-sheet camera calibration, computes the per-emission horizontal-error distribution, and writes a structured Markdown accuracy report. Replaces the AZ-404 `@xfail` mask on AC-3 with a real PASS/FAIL.
**Traces to**: AZ-699 AC-1..AC-3 (epic AZ-696 AC-3 — the 100 m / 80 % gate).
**Category**: Position Accuracy
**Preconditions**:
- `_docs/00_problem/input_data/flight_derkachi/derkachi.tlog` (real binary, multi-flight).
- `_docs/00_problem/input_data/flight_derkachi/flight_derkachi.mp4` (real recording, > 1 MB; the placeholder used by AZ-404 does not satisfy this gate).
- `_docs/00_problem/input_data/flight_derkachi/khp20s30_factory.json` (AZ-702 calibration).
- `gps-denied-replay` console-script installed.
- `RUN_REPLAY_E2E=1` (matches the existing AZ-404 gate).
**Input data**: real `derkachi.tlog` covers up to three sorties; the AZ-698 segmenter + `--auto-trim` locates the matching flight automatically.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Invoke `gps-denied-replay --auto-trim ...` with real fixtures | Subprocess exits 0 within the 15-min NFR budget |
| 2 | Parse JSONL emissions; pair each with the nearest-in-time ground-truth row (binary-tlog GPS via AZ-697) | Distribution computed: count, mean, p50, p95, p99, threshold-hit share at 10/25/50/100 m |
| 3 | Render the Markdown accuracy report and write `_docs/06_metrics/real_flight_validation_{YYYY-MM-DD}.md` | Report exists with header, run context, horizontal-error stats, threshold-hit table, and (when available) vertical-error stats |
| 4 | Evaluate the AC-3 gate: ≥ 80 % within 100 m | Verdict is PASS or honest FAIL — no `@xfail` mask |
| 5 | On FAIL, surface a failure message referencing the calibration acquisition method (factory-sheet / placeholder / unknown) and the residual budget | Operator can attribute the failure without re-reading the source |
**Expected outcome**: PASS when the estimator meets the epic AC-3 gate; honest FAIL otherwise. The Markdown report is the durable artefact (consumed by the cycle retrospective and downstream tuning work).
**Max execution time**: 15 min (matches AZ-699 NFR for a single Tier-2 Jetson run).
**Report artefact schema** (canonical, produced by `tests/e2e/replay/_report_writer.py`):
```markdown
# Real-flight validation — YYYY-MM-DD
**Verdict**: PASS | FAIL (AC-3 gate: ≥ 80 % within 100 m)
## Run context
- Tlog: `<path>`
- Video: `<path>`
- Calibration acquisition method: factory-sheet | placeholder | unknown
- Clip duration: <float> s
- Emissions consumed: <int>
- Ground-truth pairings: <int>
## Horizontal error (metres)
| Statistic | Value |
| --------- | ----- |
| Mean | <float> |
| p50 | <float> |
| p95 | <float> |
| p99 | <float> |
## Threshold-hit share
| Threshold (m) | Hit share (%) |
| ------------- | ------------- |
| 10 | <float> |
| 25 | <float> |
| 50 | <float> |
| 100 | <float> |
## Vertical error (metres)
| Statistic | Value |
| --------- | ----- |
| Mean | <float> |
| p50 | <float> |
| p95 | <float> |
| Samples | <int> |
```
The Vertical-error section is replaced by `_No emissions carried a comparable altitude — vertical stats skipped._` when none of the JSONL rows carry an `alt_m` field comparable to the ground-truth altitude.
**Skip semantics**: AZ-699 distinguishes between *missing-prerequisite skip* (cleanly skipped with the missing file's path) and *test-cannot-resolve mask* (`@xfail` — explicitly forbidden by AZ-699 AC-1). The AZ-404 1-min test's `@xfail` on AC-3 is unchanged (AZ-699 AC-4 is "add a new test, don't replace") — FT-P-20 is the honest replacement that runs alongside it.
---
### FT-P-21: End-to-end orchestrator pipeline from `(tlog, video, calibration)` only
**Summary**: Validates the full 7-step Epic AZ-835 pipeline — given only `(tlog, video, calibration)`, the system auto-extracts a `RouteSpec` (AZ-836), posts it to the real satellite-provider (AZ-838), builds the C6 FAISS index via the route-driven `operator_pre_flight_setup` fixture (AZ-839, supersedes the AZ-777 Phase 3 bbox-seeded placeholder), runs the airborne replay pipeline, and emits a horizontal-error verdict report. No operator hand-curation between steps. Closes the Epic AZ-835 narrative: "give it a tlog + video + calibration, and the system does everything else."
**Traces to**: AZ-840 AC-1..AC-8 (epic AZ-835 narrative); supplementary product-AC coverage on AC-1.1, AC-1.2, AC-8.3 (pre-loaded cache realised from route, not bbox).
**Category**: End-to-end Integration + Position Accuracy
**Preconditions**:
- Tier-2 Jetson with `@pytest.mark.tier2` + `RUN_REPLAY_E2E=1` env (explicit skip-reason naming the missing env var — no silent skip per AZ-840 AC-6).
- Real `satellite-provider` + `satellite-provider-postgres` services running in `docker-compose.test.jetson.yml` (lineage AZ-688 / AZ-691 / AZ-692; cycle-3 AZ-777 Phase 1 adapted C11 to the real `POST /api/satellite/tiles/inventory` + `GET /tiles/{z}/{x}/{y}` endpoints).
- `tests/e2e/replay/conftest.py::operator_pre_flight_setup` from AZ-839 (route-driven C6 population, supersedes the AZ-777 Phase 3 placeholder).
- `_docs/00_problem/input_data/flight_derkachi/derkachi.tlog` + `flight_derkachi.mp4` (real binary + real video >1 MB).
- `_docs/00_problem/input_data/flight_derkachi/khp20s30_factory.json` (AZ-702 factory-sheet camera calibration).
- `gps-denied-replay` console-script installed in the e2e-runner image (AZ-604).
- AZ-776 (eskf open-loop composition profile) landed; AZ-848 — Jetson `eskf_out_of_order` regression — currently blocks the heavy-AC path on Jetson, so FT-P-21 produces its first honest verdict once AZ-848 lands.
**Input data**: real `derkachi.tlog`, real `flight_derkachi.mp4`, factory-sheet camera calibration. AZ-836's `extract_route_from_tlog(tlog, max_waypoints=10)` derives the `RouteSpec` from the tlog itself; no operator authoring required.
**Steps**:
| Step | Consumer Action | Expected System Response |
|------|----------------|------------------------|
| 1 | Active-flight cut + tlog/video sync via AZ-405's `tlog_video_adapter` | Active segment located; tlog↔video offset resolved (`replay.compose_root.ready` logs `auto_sync_used=true|false`, AC-8 escape hatch honored). |
| 2 | On-fly frame + IMU extraction via `VideoFileFrameSource` + `TlogReplayFcAdapter` | Frame and IMU streams co-aligned per AZ-697 ground-truth invariants. |
| 3 | `extract_route_from_tlog(tlog, max_waypoints=10)``RouteSpec` | Route materially follows tlog trajectory; waypoints inside the Derkachi bbox (lat 50.0808..50.0832, lon 36.1070..36.1134) per AZ-836 AC-1. |
| 4 | `operator_pre_flight_setup` posts route via `SatelliteProviderRouteClient.seed_route`; satellite-provider downloads Google Maps tiles into C6 | Route registered; `mapsReady=true` within poll budget; `tile_count > 0`; warm fixture re-invocation within the same compose session ≤ 30 s (AZ-839 AC-2). |
| 5 | C10 `DescriptorBatcher` builds the FAISS HNSW NetVLAD index from the populated C6 | Three sidecar files (`.index` + `.sha256` + `.meta.json`) pass the AZ-306 triple-consistency check; tamper test raises `IndexUnavailableError` (AZ-839 AC-6). |
| 6 | Invoke airborne `gps-denied-replay` against the populated cache + tlog/video/calibration | Subprocess runs the per-frame loop end-to-end; emits JSONL outputs (currently blocked by AZ-848 — `eskf_out_of_order` at frame 3 fails the binary with exit 1 deterministically on the Derkachi 1-min clip). |
| 7 | Compute horizontal-error distribution via `helpers/accuracy_report.py` + `helpers/gps_compare.py`; write verdict report | `_docs/06_metrics/real_flight_validation_<YYYY-MM-DD>.md` exists with the honest distribution (PASS or FAIL on the AZ-696 100 m / 80 % gate — verdict emitted **regardless** of PASS/FAIL per AZ-840 AC-2). |
**Expected outcome**: Verdict report exists with the honest horizontal-error distribution. Test PASSes iff the run meets the AZ-696 100 m / 80 % gate (≥ 80 % of ticks within 100 m of ground truth). Mid-pipeline failures (e.g., satellite-provider rejection at step 4, sidecar mismatch at step 5, ESKF divergence at step 6) fail LOUD with a clear error pointing at the failing step — no silent skip past a failure (AZ-840 AC-5).
**Max execution time**: 15 min wall-clock on the Derkachi clip (AZ-840 AC-4 soft target for first delivery; hard NFR set after first honest measurement is recorded in the verdict report).
**Relationship to existing tests**:
- FT-P-20 (AZ-699 real-flight runner) is preserved (AZ-840 AC-7) — FT-P-21 reuses its verdict-report-writing path through `_report_writer.py` rather than superseding it. Either the two live alongside, or AZ-699's runner is wrapped by AZ-840's orchestrator with the verdict-writing path preserved.
- FT-P-15 + FT-P-16 (pre-loaded cache, AC-8.3) remain the canonical bbox-fixture tests; FT-P-21 is the route-driven supplementary test that exercises the same end-state (populated C6) via the production C11→satellite-provider path.
**Implemented as**: `tests/e2e/replay/test_az835_e2e_real_flight.py` (per AZ-840). Unit-tested orchestration helper: `tests/e2e/replay/test_e2e_orchestrator_unit.py` (17 tests covering parameter validation + error propagation between the 7 orchestration steps).