Files
gps-denied-onboard/_docs/02_document/tests/blackbox-tests.md
T
Oleksandr Bezdieniezhnykh dcde602f61 [AZ-699] Real-flight validation runner + Markdown accuracy report
New e2e test runs gps-denied-replay --auto-trim against the real
derkachi.tlog + flight video + AZ-702 calibration, computes the
horizontal-error distribution (mean/p50/p95/p99 + 10/25/50/100 m
threshold-hit share), writes _docs/06_metrics/real_flight_
validation_{date}.md, and asserts honest PASS/FAIL with no @xfail
mask. AZ-404's 1-min test is untouched (sibling, not replacement).

Extends gps_compare.py with HorizontalErrorDistribution +
percentile_sorted (numpy-equivalent linear interpolation). New
test helper _report_writer.py renders the canonical Markdown
schema documented as FT-P-20 in blackbox-tests.md.

16 new unit tests pin distribution arithmetic, verdict gate,
failure-message templating (references calibration acquisition
method per AC-3), and report layout. 129 passed in focused
regression, 3 skipped (real video / Tier-2 prerequisites).
Zero new mypy --strict errors.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-20 16:53:48 +03:00

30 KiB
Raw Blame History

Blackbox Tests

All tests run from the e2e-runner container against the SUT through public boundaries only (frame source, FC inbound stream, tile cache mount, FC outbound observed via SITL, GCS observed via mavproxy-listener, FDR via post-run filesystem read). Two FC adapters parameterize every test that touches the FC contract: ardupilot and inav. Two VioStrategy modes parameterize Tier-1 product correctness tests: okvis2 (production-default) and klt_ransac (mandatory simple-baseline). vins_mono is parameterized only when the research build is under test.

Positive Scenarios

FT-P-01: Still-image satellite-anchor frame-center accuracy

Summary: Validates the canonical satellite-anchor frame-center geolocation pipeline against the 60-image GT set. Traces to: AC-1.1, AC-1.2 Category: Position Accuracy

Preconditions:

  • tile-cache-fixture mounted at /var/azaion/tile-cache.
  • SUT cold-started with no prior state; configured for the FC adapter under test.

Input data: still-image-set-60 (per test-data.md).

Steps:

Step Consumer Action Expected System Response
1 For each image AD0000NN.jpg in order, write the frame to the SUT's frame-source path and wait up to 5 s for the corresponding outbound GPS_INPUT (AP) / MSP2_SENSOR_GPS (iNav) message at the SITL listener One outbound message per input image; payload includes WGS84 lat/lon
2 Compute Vincenty geodesic distance between estimated lat/lon and coordinates.csv GT row for that image Per-image error ≤ 50 m for ≥80% of images, ≤ 20 m for ≥50%
3 Capture per-image error to e2e-results/run-${RUN_ID}/ft-p-01.csv CSV produced with one row per image

Expected outcome: aggregate pass_count(error≤50m) ≥ 48 AND pass_count(error≤20m) ≥ 30 (matching the rule in expected_results/results_report.md). Max execution time: 5 min (60 images × ~5 s including SITL round trip).


FT-P-02: Derkachi VIO drift between satellite anchors

Summary: Validates cumulative drift between consecutive satellite-anchored fixes during the Derkachi flight replay. Traces to: AC-1.3 Category: Position Accuracy

Preconditions:

  • tile-cache-fixture mounted (covers Derkachi route).
  • SUT cold-started; FC adapter under test connected via SITL; data_imu.csv replayed at 10 Hz into FC IMU stream.

Input data: derkachi-fixture video at 30 fps + IMU CSV at 10 Hz.

Steps:

Step Consumer Action Expected System Response
1 Start synchronized video + IMU replay (3 video frames per IMU row) SUT begins emitting estimates at the SUT's runtime cadence
2 At each frame whose outbound estimate carries source_label = satellite_anchored, record the propagated centre estimate of the prior visual-only segment AND the new anchor centre Two values per anchor pair captured
3 Compute per-anchor-pair drift = ‖propagated_centre next_anchor_centre‖. Bin by last_satellite_anchor_age_ms. Bins populated; CSV emitted

Expected outcome: Across all anchor pairs, at least 95% satisfy drift < 100 m (visual-only) AND drift < 50 m (when CombinedImuFactor IMU fusion is active in C5). Drift distribution monotonically grows with anchor age, with no anomalous spike. Max execution time: 10 min (8 min replay + parsing).


FT-P-03: Estimate output schema and source-label semantics

Summary: Validates the SUT's outbound estimate carries every required field with correct types and the source label is one of the three allowed values. Traces to: AC-1.4, AC-4.3 Category: Position Accuracy / FC Contract

Preconditions:

  • One image from still-image-set-60 already loaded into the cache fixture.
  • SUT cold-started.

Input data: any single image (default AD000001.jpg).

Steps:

Step Consumer Action Expected System Response
1 Push the image to the frame source SUT emits one outbound GPS_INPUT (AP) / MSP2_SENSOR_GPS (iNav) AND one out-of-band channel message (MAVLink STATUSTEXT or NAMED_VALUE_FLOAT per AC-4.3) carrying the source label
2 Read the SITL-side fields Schema match: lat, lon, cov_semi_major_m, last_satellite_anchor_age_ms present and well-typed
3 Read the out-of-band label channel Label ∈ {satellite_anchored, visual_propagated, dead_reckoned}

Expected outcome: Schema check passes AND label is in the allowed set. Max execution time: 30 s.


FT-P-04: Derkachi frame-to-frame registration success rate

Summary: Validates frame-to-frame registration succeeds for ≥95% of "normal" segments of the Derkachi flight. Traces to: AC-2.1a Category: Image Processing

Preconditions:

  • SUT cold-started; FC adapter and VioStrategy both parameterized.

Input data: derkachi-fixture (full duration). "Normal" segments derived per AC-2.1a: nadir ±10° bank/pitch (estimated from SCALED_IMU2-derived attitude), ≥40% inferred prior-frame overlap (heuristic from frame-to-frame translation magnitude).

Steps:

Step Consumer Action Expected System Response
1 Replay the Derkachi fixture SUT emits per-frame registration-success metric (exposed via NAMED_VALUE_FLOAT or in FDR per AC-NEW-3)
2 After replay, compute success-ratio over normal segments only Success ratio ≥ 0.95

Expected outcome: ≥95% on normal segments. Sharp-turn segments (excluded from this denominator) are exercised separately by FT-N-02. Max execution time: 12 min.


FT-P-05: Satellite-anchor cross-domain registration

Summary: Validates the satellite-anchor (UAV→satellite cross-domain) matcher succeeds with the cross-domain MRE budget. Traces to: AC-2.1b, AC-2.2 Category: Image Processing

Preconditions:

  • tile-cache-fixture includes the still-image footprints.
  • SUT cold-started.

Input data: still-image-set-60 plus still-image-sat-refs-2 (for the 2 images with paired _gmaps.png).

Steps:

Step Consumer Action Expected System Response
1 For each still image, push to frame source One satellite-anchor result per image
2 Read per-frame MRE (via FDR or NAMED_VALUE_FLOAT) MRE recorded
3 Aggregate per-image accuracy AND MRE distribution All images: MRE < 2.5 px; ≥80% within 50 m of GT; ≥50% within 20 m of GT

Expected outcome: AC-1.1, AC-1.2, AC-2.1b, AC-2.2 all satisfied. Max execution time: 5 min.


FT-P-06: Mean Reprojection Error budgets (frame-to-frame + cross-domain)

Summary: Validates the two MRE budgets are honored. Traces to: AC-2.2

Preconditions: Same as FT-P-04 + FT-P-05.

Input data: derkachi-fixture (frame-to-frame MRE) + still-image-set-60 (cross-domain MRE).

Steps:

Step Consumer Action Expected System Response
1 Run FT-P-04 and FT-P-05 in sequence; collect per-frame MRE from both runs MRE values captured
2 Aggregate by domain (frame-to-frame vs satellite-anchored) Distribution per domain

Expected outcome: Frame-to-frame MRE < 1.0 px (95th percentile); cross-domain MRE < 2.5 px (95th percentile). Max execution time: piggybacks on FT-P-04 / FT-P-05.


FT-P-07: Sharp-turn recovery via satellite reference

Summary: Validates that frames during sharp turns may fail frame-to-frame but recover via satellite-reference re-localization. Traces to: AC-3.2

Preconditions:

  • Sharp-turn segment of the Derkachi flight identified by gyro_z spikes in SCALED_IMU2. (If Derkachi has no sharp turn meeting AC-3.2 thresholds, fall back to a synthetic gyro overlay; flag in FDR.)

Input data: derkachi-fixture filtered to sharp-turn segment(s).

Steps:

Step Consumer Action Expected System Response
1 Replay sharp-turn segment SUT emits source_label = visual_propagated or dead_reckoned during turn
2 After turn, observe next satellite-anchor attempt Recovery: source_label = satellite_anchored returns within 3 frames of turn end; drift ≤ 200 m, heading change handled

Expected outcome: Recovery within 3 frames; <200 m drift; <70° heading change handled. Max execution time: 5 min (per turn segment, multiple per replay).


FT-P-08: Multi-segment satellite-reference re-localization

Summary: Validates ≥3 disconnected segments per flight handled via satellite-reference re-localization. Traces to: AC-3.3

Preconditions:

  • multi-segment-derkachi synthetic fixture generated with 3+ blackout windows.

Input data: multi-segment-derkachi.

Steps:

Step Consumer Action Expected System Response
1 Replay with injected blackout windows SUT emits dead_reckoned during each blackout
2 At end of each blackout, observe re-localization source_label returns to satellite_anchored within 3 frames; trajectory continuity preserved (no >100 m jump)

Expected outcome: All 3+ segments re-localized successfully; no trajectory jump exceeds 100 m. Max execution time: 12 min.


FT-P-09-AP: ArduPilot Plane GPS_INPUT contract conformance + signing

Summary: Validates GPS_INPUT reaches AP SITL, AP EKF accepts it as primary GPS, and MAVLink 2.0 message signing handshake completes per D-C8-9. Traces to: AC-4.3 (AP), D-C8-9, AC-NEW-2 (precondition) Category: FC Contract / Security

Preconditions:

  • ardupilot-plane-sitl running with GPS_TYPE=14.
  • mavlink-passkey loaded as Docker secret into SUT.

Input data: derkachi-fixture (any 60 s segment).

Steps:

Step Consumer Action Expected System Response
1 Start SUT with FC adapter ardupilot Signing handshake completes within 5 s; signed channel established
2 Replay 60 s of Derkachi SUT emits signed GPS_INPUT at the configured rate
3 Read AP EK3_SRC1_POSXY parameter via MAVPROXY Value reads 3 (GPS source)
4 Read AP-side GPS health via GPS_RAW_INT Fix type ≥ 3 (3D fix), HDOP within nominal

Expected outcome: Signing handshake succeeds; AP EKF on GPS source-set; GPS_RAW_INT shows healthy fix. Max execution time: 90 s.


FT-P-09-iNav: iNav MSP2_SENSOR_GPS contract conformance

Summary: Validates MSP2_SENSOR_GPS reaches iNav SITL and iNav GPS provider accepts it as the sole source. Traces to: AC-4.3 (iNav) Category: FC Contract

Preconditions:

  • inav-sitl running with GPS provider configured to MSP per docs/SITL/SITL.md.

Input data: derkachi-fixture (any 60 s segment).

Steps:

Step Consumer Action Expected System Response
1 Start SUT with FC adapter inav TCP connection to inav-sitl:5760 established
2 Replay 60 s of Derkachi SUT emits MSP2_SENSOR_GPS (ID 0x1F03) frames at 5 Hz
3 Read iNav GPS state via MSP query gpsSol.fixType ≥ 3, gpsSol.numSat reflects emitted value, provider=MSP

Expected outcome: iNav GPS state reflects emitted frames; no fallback to internal GPS. Max execution time: 90 s.


FT-P-10: GTSAM smoothing-loop look-back accuracy (IT-11)

Summary: Validates the smoothing-loop's past-keyframe pose estimates improve over raw single-shot estimates (Mode B Fact #107). NOT validated as FC-side retroactive correction. Traces to: AC-4.5 (revised scope per Mode B), Mode B Fact #107 Category: Position Accuracy / Internal smoothing

Preconditions:

  • SUT cold-started; FDR enabled.

Input data: derkachi-fixture full replay.

Steps:

Step Consumer Action Expected System Response
1 Replay Derkachi end-to-end FDR contains per-keyframe (a) raw single-shot pose at first emission, (b) smoothed pose at iSAM2 convergence
2 After replay, parse FDR; for each past keyframe compute distance(raw, GT) and distance(smoothed, GT) Per-keyframe pair extracted
3 Aggregate across keyframes smoothed_error < raw_error for ≥80% of keyframes; mean improvement ≥ 5 m

Expected outcome: Internal smoothing improves past-keyframe accuracy; FC-side retroactive correction NOT exercised (out of scope per Mode B revision A6). Max execution time: 12 min.


FT-P-11: Cold-start initialization from FC EKF

Summary: Validates SUT initialization from FC EKF's last valid GPS + IMU-extrapolated position at GPS denial. Traces to: AC-5.1 Category: Startup

Preconditions:

  • cold-boot-fixture provides a frozen FC pose snapshot.
  • SUT not running.

Input data: cold-boot-fixture.

Steps:

Step Consumer Action Expected System Response
1 Start ardupilot-plane-sitl (or inav-sitl) with the frozen-pose snapshot loaded SITL EKF reflects the snapshot pose
2 Start SUT SUT queries FC EKF; reads pose; initializes
3 Push first nav-camera frame First outbound estimate's lat/lon within ±50 m of the FC EKF snapshot pose

Expected outcome: First emitted estimate uses FC EKF's pose as prior, within ±50 m tolerance. Max execution time: 60 s.


FT-P-12: GCS downsample at 1-2 Hz

Summary: Validates position estimates + confidence stream to the GCS (via mavproxy-listener) at 1-2 Hz. Traces to: AC-6.1 Category: GCS / Telemetry

Preconditions:

  • mavproxy-listener running and capturing to .tlog.

Input data: derkachi-fixture 60 s segment.

Steps:

Step Consumer Action Expected System Response
1 Start replay SUT emits to FC at runtime cadence (~3 Hz) AND to GCS at 1-2 Hz
2 After replay, parse .tlog for SUT-emitted GCS messages over the 60 s window GCS rate within [1, 2] Hz inclusive

Expected outcome: GCS-side rate observed in [1, 2] Hz over the window. Max execution time: 90 s.


FT-P-13: GCS command path (operator re-loc hint)

Summary: Validates that GCS-originated commands (via standard MAVLink) can carry operator re-loc hints to the SUT. Traces to: AC-6.2 Category: GCS / Telemetry

Preconditions:

  • mavproxy-listener configured to send commands.
  • SUT in dead_reckoned state (e.g. mid-blackout from FT-N-03 setup).

Input data: synthesized STATUSTEXT carrying re-loc hint from MAVPROXY.

Steps:

Step Consumer Action Expected System Response
1 While SUT is in dead_reckoned, send re-loc-hint STATUSTEXT from MAVPROXY SUT acknowledges the hint via FDR log entry
2 Push next nav-camera frame after hint Next satellite-anchor attempt uses hint as a search prior

Expected outcome: Hint received; next anchor attempt biases search; no rejection. Max execution time: 60 s.


FT-P-14: WGS84 output coordinate system

Summary: Validates output coordinates are in WGS84 (latitude/longitude in degrees as per ArduPilot/iNav GPS convention scaled to 1e-7). Traces to: AC-6.3

Preconditions: any FT-P-01 / FT-P-02 run.

Input data: any.

Steps:

Step Consumer Action Expected System Response
1 Capture one outbound GPS_INPUT / MSP2_SENSOR_GPS from SITL Lat/lon present; values in valid WGS84 range; scaled per protocol convention

Expected outcome: Coordinates parse as WGS84 within Earth bounds. Max execution time: 30 s.


FT-P-15: Tile cache schema and resolution floor

Summary: Validates the tile cache manifest carries every required field and tiles meet the ≥0.5 m/px floor. Traces to: AC-8.1, RESTRICT-SAT-2 (manifest schema)

Preconditions: tile-cache-fixture mounted.

Input data: tile cache.

Steps:

Step Consumer Action Expected System Response
1 The SUT exposes a one-time cache-load self-check at startup; observe via FDR Each tile manifest entry has CRS, tile matrix, dimension, lat-adjusted m/px, capture date, source, compression
2 Inspect m/px values All ≥ 0.5 m/px; reject below floor

Expected outcome: All loaded tiles pass schema check and resolution floor. Max execution time: 30 s.


FT-P-16: Pre-loaded cache (offline-only interface)

Summary: Validates the SUT loads tiles from the local cache only, with no in-flight Service calls. Traces to: AC-8.3, RESTRICT-SAT-1

Preconditions: tile-cache-fixture mounted; e2e-net internal: true enforced (no internet egress).

Input data: derkachi-fixture 60 s segment.

Steps:

Step Consumer Action Expected System Response
1 Start replay SUT serves tiles from /var/azaion/tile-cache only
2 Observe network egress counter on gps-denied-onboard container All egress to non-e2e-net destinations is 0 (paired with NFT-SEC-02)

Expected outcome: 0 external egress; replay completes against local cache. Max execution time: 90 s.


FT-P-17: Mid-flight tile generation

Summary: Validates the SUT continuously orthorectifies nav-camera frames into basemap-projected tiles, deduplicates them, and stores them locally for landing-time upload. Traces to: AC-8.4

Preconditions: empty mid-flight-tile-output directory in the FDR volume; mock-suite-sat-service running.

Input data: derkachi-fixture 5 min segment.

Steps:

Step Consumer Action Expected System Response
1 Start replay SUT generates and writes tiles to FDR's mid-flight-tile-output/
2 After replay, read tiles ≥1 tile per ~3 s of high-quality nav frames; each tile carries quality metadata sufficient for the Service voting layer (per Mode B Fact #105)
3 Simulate landing event; SUT uploads to mock-suite-sat-service Mock service receives all tiles with HTTP 202

Expected outcome: Tiles produced + deduplicated + uploaded with quality metadata. Max execution time: 8 min.


FT-P-18: No raw nav/AI-cam frame retention (storage policy)

Summary: Validates that no raw nav-camera or AI-camera frames are retained except the ≤0.1 Hz failed-tile-gen thumbnail log. Traces to: AC-8.5

Preconditions: derkachi-fixture replay just completed.

Input data: post-replay state of FDR + tile cache.

Steps:

Step Consumer Action Expected System Response
1 Walk the FDR + tile cache for any file matching nav-camera raw-frame pattern (JPEG/RAW with original dimensions) Only the failed-tile-gen thumbnail log files present (≤0.1 Hz cadence)
2 Verify thumbnail log is bounded by AC-NEW-3 FDR budget Total thumbnail log < 1 GB over 8 h (NFT-LIM-02 cross-check)

Expected outcome: 0 unauthorized raw frames retained. Max execution time: 30 s (filesystem walk).


FT-P-19: Satellite relocalization scale-ratio + scene-change

Summary: Validates UAV-frame ground footprint at deployment altitude is retrievable from cache regardless of internal tiling. Scene-change subset is reduced-confidence (PARTIAL — see traceability matrix). Traces to: AC-8.6 (scale-ratio FULL; scene-change PARTIAL)

Preconditions: tile-cache-fixture mounted with multi-zoom-level coverage.

Input data: still-image-set-60 (scale-ratio); the 2 paired _gmaps.png images (scene-change subset).

Steps:

Step Consumer Action Expected System Response
1 For each still image, query cache top-K=10 retrieval Top-K result includes a tile whose centre is within 100 m of the image's true centre (scale-ratio satisfied)
2 For the 2 paired images, run cross-domain matcher against the _gmaps.png reference Scale-ratio match succeeds; scene-change behavior recorded (PARTIAL — full coverage requires a labeled change-pair dataset, deferred under D-PROJ-3)

Expected outcome: Scale-ratio passes for 60/60; scene-change recorded as PARTIAL. Max execution time: 5 min.


Negative Scenarios

FT-N-01: 350 m outlier injection tolerance

Summary: Validates the system tolerates up to 350 m outliers between two consecutive frames with airframe tilt up to ±20°. Traces to: AC-3.1, RESTRICT-CAM-1 (nadir camera, tilt limits)

Preconditions: SUT running on derkachi-fixture; outlier-injection-derkachi injector primed in medium density.

Input data: outlier-injection-derkachi (medium).

Steps:

Step Consumer Action Expected System Response
1 Start replay with injector active (every 10th frame replaced by far-away tile crop) SUT detects outlier; rejects from anchor; estimate continues from prior valid state
2 Compare per-frame outbound estimate vs GT for non-outlier frames Error_after_outlier ≤ error_before_outlier + 50 m; covariance grows monotonically across the outlier event

Expected outcome: Outliers rejected; estimate degrades at most by 50 m drift; covariance monotonic. Max execution time: 12 min.


FT-N-02: Sharp-turn frame-to-frame failure expected

Summary: Negative twin of FT-P-07 — validates that during a sharp turn, frame-to-frame may LEGITIMATELY fail, and the system labels accordingly. Traces to: AC-3.2 (negative path)

Preconditions: Same as FT-P-07.

Input data: sharp-turn segment of Derkachi (or synthetic gyro overlay).

Steps:

Step Consumer Action Expected System Response
1 Replay sharp-turn segment During turn frames: source_label{visual_propagated, dead_reckoned}; covariance grows
2 After turn, observe label transition Label returns to satellite_anchored once next anchor succeeds

Expected outcome: Sharp-turn frames correctly mark themselves as not-satellite-anchored; recovery exercised in FT-P-07. Max execution time: 5 min.


FT-N-03: Extended outage triggers operator re-loc request

Summary: Validates that on ≥3 consecutive frames AND ≥2 s without estimate, the SUT requests operator re-loc via telemetry and continues dead-reckoned propagation. Traces to: AC-3.4

Preconditions: derkachi-fixture + 3-frame outage injector primed.

Input data: synthetic outage on Derkachi.

Steps:

Step Consumer Action Expected System Response
1 Trigger 3-consecutive-frame failure (corrupt frames) SUT fails to produce estimates for 3+ frames
2 Wait ≥2 s STATUSTEXT containing OPERATOR_RELOC_REQUEST emitted to mavproxy-listener
3 During outage, observe FC outbound Estimates labeled dead_reckoned continue; FC uses last-known + IMU extrapolation

Expected outcome: Re-loc request emitted; dead-reckoned estimates continue. Max execution time: 60 s.


FT-N-04: Visual blackout + spoofed GPS combined failsafe

Summary: Validates the AC-3.5 + AC-NEW-8 combined failsafe: switch label, reject spoof, propagate from last trusted state, monotonic covariance, STATUSTEXT. Traces to: AC-3.5, AC-NEW-8

Preconditions: blackout-spoof-derkachi injector primed for 5 s, 15 s, 35 s windows; FC inbound stream patched to inject spoofed GPS.

Input data: blackout-spoof-derkachi (each window run as a sub-case).

Steps:

Step Consumer Action Expected System Response
1 Begin blackout window AND inject spoofed GPS in same temporal window Within ≤1 frame OR ≤400 ms: source_label = dead_reckoned; spoofed GPS rejected from estimator input; covariance grows monotonically
2 Observe horiz_accuracy field in outbound GPS_INPUT (AP) horiz_accuracy ≥ 95% covariance semi-major axis (no under-reporting)
3 Observe GCS stream VISUAL_BLACKOUT_IMU_ONLY STATUSTEXT at 1-2 Hz throughout blackout
4 For 35 s window only Per AC-NEW-8: when 95% covariance crosses 100 m → fix-quality degraded; when crosses 500 m OR blackout exceeds 30 s → horiz_accuracy=999.0 AND VISUAL_BLACKOUT_FAILSAFE STATUSTEXT
5 End blackout; restore FC GPS-health Recovery only after FC GPS-health stable + non-spoofed for ≥10 s AND a visual/satellite consistency check succeeds

Expected outcome: All four steps' assertions pass for each window. Max execution time: 5 min (3 windows × ~90 s each).


FT-N-05: Stale-tile rejection on freshness violation

Summary: Validates that tiles violating AC-8.2 freshness window are rejected (or downgraded so they cannot produce a satellite_anchored label). Traces to: AC-8.2, AC-NEW-6

Preconditions: synth-age-tile-set (synth-age-7mo for active-conflict, synth-age-13mo for rear) mounted instead of fresh fixture.

Input data: still-image-set-60 against the aged cache.

Steps:

Step Consumer Action Expected System Response
1 Replay against synth-age-7mo (configure SUT for active-conflict sector) SUT either rejects load OR loads but never emits satellite_anchored from these tiles
2 Replay against synth-age-13mo (configure SUT for rear sector) Same: reject or non-satellite_anchored only

Expected outcome: 0 frames emit satellite_anchored from aged tiles. Max execution time: 5 min.


FT-N-06: Mid-flight tile freshness (current-timestamped)

Summary: Validates that mid-flight-generated tiles are timestamped as current and treated as fresh per AC-NEW-6. Traces to: AC-NEW-6 (positive sub-case)

Preconditions: empty mid-flight-tile-output.

Input data: derkachi-fixture 5 min segment.

Steps:

Step Consumer Action Expected System Response
1 Start replay SUT generates mid-flight tiles
2 Inspect each generated tile's manifest entry capture_date is within ±60 s of generation wall-clock; treated as fresh by the freshness gate

Expected outcome: All mid-flight tiles current-timestamped and fresh. Max execution time: 6 min.


FT-P-20: Real-flight validation runner — honest verdict + Markdown accuracy report

Summary: Runs the full gps-denied-replay against the real Derkachi binary tlog + flight video + AZ-702 factory-sheet camera calibration, computes the per-emission horizontal-error distribution, and writes a structured Markdown accuracy report. Replaces the AZ-404 @xfail mask on AC-3 with a real PASS/FAIL. Traces to: AZ-699 AC-1..AC-3 (epic AZ-696 AC-3 — the 100 m / 80 % gate). Category: Position Accuracy

Preconditions:

  • _docs/00_problem/input_data/flight_derkachi/derkachi.tlog (real binary, multi-flight).
  • _docs/00_problem/input_data/flight_derkachi/flight_derkachi.mp4 (real recording, > 1 MB; the placeholder used by AZ-404 does not satisfy this gate).
  • _docs/00_problem/input_data/flight_derkachi/khp20s30_factory.json (AZ-702 calibration).
  • gps-denied-replay console-script installed.
  • RUN_REPLAY_E2E=1 (matches the existing AZ-404 gate).

Input data: real derkachi.tlog covers up to three sorties; the AZ-698 segmenter + --auto-trim locates the matching flight automatically.

Steps:

Step Consumer Action Expected System Response
1 Invoke gps-denied-replay --auto-trim ... with real fixtures Subprocess exits 0 within the 15-min NFR budget
2 Parse JSONL emissions; pair each with the nearest-in-time ground-truth row (binary-tlog GPS via AZ-697) Distribution computed: count, mean, p50, p95, p99, threshold-hit share at 10/25/50/100 m
3 Render the Markdown accuracy report and write _docs/06_metrics/real_flight_validation_{YYYY-MM-DD}.md Report exists with header, run context, horizontal-error stats, threshold-hit table, and (when available) vertical-error stats
4 Evaluate the AC-3 gate: ≥ 80 % within 100 m Verdict is PASS or honest FAIL — no @xfail mask
5 On FAIL, surface a failure message referencing the calibration acquisition method (factory-sheet / placeholder / unknown) and the residual budget Operator can attribute the failure without re-reading the source

Expected outcome: PASS when the estimator meets the epic AC-3 gate; honest FAIL otherwise. The Markdown report is the durable artefact (consumed by the cycle retrospective and downstream tuning work).

Max execution time: 15 min (matches AZ-699 NFR for a single Tier-2 Jetson run).

Report artefact schema (canonical, produced by tests/e2e/replay/_report_writer.py):

# Real-flight validation — YYYY-MM-DD

**Verdict**: PASS | FAIL (AC-3 gate: ≥ 80 % within 100 m)

## Run context

- Tlog: `<path>`
- Video: `<path>`
- Calibration acquisition method: factory-sheet | placeholder | unknown
- Clip duration: <float> s
- Emissions consumed: <int>
- Ground-truth pairings: <int>

## Horizontal error (metres)

| Statistic | Value |
| --------- | ----- |
| Mean | <float> |
| p50  | <float> |
| p95  | <float> |
| p99  | <float> |

## Threshold-hit share

| Threshold (m) | Hit share (%) |
| ------------- | ------------- |
| 10  | <float> |
| 25  | <float> |
| 50  | <float> |
| 100 | <float> |

## Vertical error (metres)

| Statistic | Value |
| --------- | ----- |
| Mean    | <float> |
| p50     | <float> |
| p95     | <float> |
| Samples | <int>   |

The Vertical-error section is replaced by _No emissions carried a comparable altitude — vertical stats skipped._ when none of the JSONL rows carry an alt_m field comparable to the ground-truth altitude.

Skip semantics: AZ-699 distinguishes between missing-prerequisite skip (cleanly skipped with the missing file's path) and test-cannot-resolve mask (@xfail — explicitly forbidden by AZ-699 AC-1). The AZ-404 1-min test's @xfail on AC-3 is unchanged (AZ-699 AC-4 is "add a new test, don't replace") — FT-P-20 is the honest replacement that runs alongside it.