Files
Oleksandr Bezdieniezhnykh 1f634c2604
ci/woodpecker/push/02-build-push Pipeline failed
Update demo replay validation and testing documentation
- Modified the autodev state to reflect the current testing phase and details of the new `jetson-e2e` tests.
- Enhanced the "How to Test" documentation to provide clearer instructions on the demo replay validation process, including video and tlog alignment steps.
- Updated architectural documentation to include the new demo replay operator flow and its dependencies.
- Documented the removal of deprecated auto-sync features and clarified the operator-facing UI for replay validation.
- Added new entries in the dependencies table for upcoming tasks related to the demo replay flow.

These changes improve clarity and usability for operators and developers working with the demo replay system.
2026-06-20 11:24:43 +03:00

12 KiB
Raw Permalink Blame History

E2E replay tests (AZ-404 + AZ-835 + cycle-4)

End-to-end regression suite for the gps-denied-replay console-script (AZ-402). Two distinct entry points live here:

Entry point Source Coverage
AZ-265 / AZ-404 — 60 s Derkachi clip with synthetic tlog test_derkachi_1min.py Original AC-1..AC-10 of the replay epic; runs on Tier-1 + Tier-2
AZ-835 / AZ-840 — full (tlog, video, calibration) orchestrator test_az835_e2e_real_flight.py Tier-2 only; closes the real-flight validation loop end-to-end (extract → seed → FAISS → run → verdict)

The cycle-4 replay-input redesign (AZ-894 / AZ-895 / AZ-896) replaces the tlog auto-sync surface with a CSV-driven path; the AZ-265 suite is the regression net that catches drift in the legacy path during the deprecation window. See replay_protocol.md Invariants 12-14 for the authoritative contract.

How to run

AZ-404 Derkachi 60 s suite (Tier-1 + Tier-2)

# In a fresh venv with the package installed:
RUN_REPLAY_E2E=1 pytest tests/e2e/replay/test_derkachi_1min.py -v

Without RUN_REPLAY_E2E=1 the heavy tests skip cleanly. The two unconditional tests (AC-4a mode-agnosticism scan + AC-7 skip-gate self-check + the helpers in test_helpers.py) still run.

AZ-835 orchestrator test — full (tlog, video, calibration) loop (Tier-2 only)

Closes Epic AZ-835's narrative: given a real-flight .tlog + the matching nadir video + camera calibration, the orchestrator runs the 7-step pipeline end-to-end and writes a verdict report.

Required inputs (already in-repo for the Derkachi reference fixture):

  • .tlog — pymavlink binary log from a real flight. Reference fixture: _docs/00_problem/input_data/flight_derkachi/data_imu.csv (the canonical CSV that _tlog_synth.py reconstructs the tlog from) plus the synthesised tlog the conftest emits at session start.
  • Nadir video — _docs/00_problem/input_data/flight_derkachi/*.mp4 (large asset; not always checked in to the workstation clone — pull from the Jetson e2e harness or git LFS if absent).
  • Calibration — tests/fixtures/calibration/adti26.json (factory-sheet approximation for the Topotek KHP20S30; real intrinsics still TBD).

Tier-2 invocation (Jetson):

ssh jetson-e2e
cd /workspace/gps-denied-onboard
export RUN_REPLAY_E2E=1
pytest tests/e2e/replay/test_az835_e2e_real_flight.py -v --tb=short -m tier2

AZ-962: docker-compose.test.jetson.yml exports GPS_DENIED_OPERATOR_CONFIG_PATH=/opt/configs/operator_replay.yaml automatically and bind-mounts ./configs:/opt/configs:ro, so no manual env-var export is required when running through scripts/run-tests-jetson.sh. The YAML at configs/operator_replay.yaml declares the four blocks the fixture requires (c6 / c7 / c10 / c11); secrets (SATELLITE_PROVIDER_API_KEY) flow in from .env.test via the loader's ENV_KEY_MAP. c10_provisioning.backbones is intentionally empty pending AZ-964 (the orchestrator test will SKIP at the "no backbones" gate until AZ-964 lands).

The bundled local-development entry point is scripts/run-tests-jetson.sh, which handles the SSH alias + rsync + remote pytest invocation. See _docs/02_document/tests/tier2-jetson-testing.md for the harness contract.

Skip gates (in evaluation order):

  1. @pytest.mark.tier2 — the per-suite Tier-2 plugin gates this off on dev macOS / Tier-1 Docker (matches the AZ-839 / AZ-699 contract).
  2. RUN_REPLAY_E2E not in {1, true, yes, on}.
  3. gps-denied-replay console-script not on PATH.
  4. Real Derkachi video missing or placeholder-sized.
  5. operator_pre_flight_setup fixture itself skipped — the downstream consumer inherits the SKIP automatically (pytest's fixture-skip propagation).

Expected runtime on Tier-2 Jetson AGX Orin (cold cache): ≤ 8 min end-to-end (≤ 5 min C3 fixture cold-start budget + ≤ 3 min for the replay + verdict compute). Warm-cache reinvocations within the same compose session: ≤ 60 s.

Verdict report location: _docs/06_metrics/real_flight_validation_<YYYY-MM-DD>.md. The report is emitted ALWAYS, regardless of PASS / FAIL on the AZ-696 AC-3 threshold (≥ 80 % of emissions within 100 m of tlog ground truth). The success criterion at the fixture level is "honest report exists with distribution data", not "PASS". The PASS / FAIL line of the report itself is the operator-facing answer to "did this flight clip localise within the threshold".

Imagery source license attribution (dev/research use only)

The Jetson e2e harness's satellite-provider instance downloads tiles from the Google Maps satellite layer (mt0..mt3.google.com/vt/lyrs=s), governed by Google Maps Platform Terms of Service. Every tile served by the harness carries the "Imagery © Google" attribution string.

This is dev/research use only. Production deployment of the gps-denied-onboard companion against a Google-Maps-sourced satellite-provider requires either a Google Maps Platform licensing review or migration to a true CC-BY satellite source on the parent-suite side (parent-suite ticket TBD; see _docs/02_document/architecture.md § satellite-provider integration). The onboard-side seed scripts (tests/fixtures/derkachi_c6/seed_region.py, seed_route.py) propagate the attribution into the test fixture's metadata; do not remove it.

Fixture state

Artifact Status Source
flight_derkachi.mp4 available _docs/00_problem/input_data/flight_derkachi/
data_imu.csv available same dir; 4900 rows at 10 Hz over 489.9 s
Synthetic tlog generated at fixture time _tlog_synth.py reproduces a pymavlink .tlog from the CSV (the original tlog is not in-repo; the CSV was its export)
Camera calibration placeholder (tests/fixtures/calibration/adti26.json) The real Topotek KHP20S30 intrinsics are unknown per camera_info.md. AC-3 accuracy depends on this.
Operator pre-flight rehearsal blocked tests/fixtures/mock-suite-sat-service/ is a bootstrap stub (only GET /healthz); AC-8 skips until the full D-PROJ-2 contract lands.

Clip range

The first 60 s of the Derkachi flight (Time=0.0 → Time=60.0). The take-off region exercises the AZ-405 IMU-take-off auto-sync detector; the cruise region that follows stresses the satellite-anchor + VIO drift-correction path. To change the trim, edit _CLIP_START_S and _CLIP_END_S in conftest.py.

Expected runtime (Tier-1)

Test Expected wall clock
AC-1 (--pace asap) ≤ 30 s (Tier-2 only)
AC-2 schema match piggybacks on AC-1 (Tier-2 only)
AC-5 determinism 2 × asap runs (≤ 60 s total; Tier-2 only)
AC-6 realtime 60 s ± 3 s (Tier-2 only)
AC-6 asap ≤ 30 s (Tier-2 only)
Total suite ~6 min wall clock on Tier-2; skips on Mac

The AC-1 / AC-2 / AC-5 tests share --pace asap runs but each fixture invocation produces a fresh output file, so they do not short-circuit each other (preserves AC-5's two-runs-diff guarantee).

AC matrix

AC Test State
AC-1: exit 0 + JSONL count match test_ac1_exits_0_jsonl_count_match xfail (AZ-963 — open-loop ESKF)
AC-2: JSONL schema match test_ac2_jsonl_schema_match Tier-2 (Jetson only)
AC-3: ≤ 100 m for 80 % of ticks test_ac3_within_100m_80pct_of_ticks xfail (AZ-963 — open-loop ESKF)
AC-4a: mode-agnosticism AST scan test_ac4_mode_agnosticism_ast_scan unconditional
AC-4b: encoder byte-equality test_ac4_encoder_byte_equality skip (waiting on AZ-558)
AC-5: determinism test_ac5_determinism_two_runs_diff xfail (AZ-963 — open-loop ESKF)
AC-6a: realtime 60 s ± 5 % test_ac6_pace_realtime_60s_within_5pct xfail (AZ-963 — open-loop ESKF)
AC-6b: asap ≤ 30 s test_ac6_pace_asap_under_30s xfail (AZ-963 — open-loop ESKF)
AC-7: skip-gate self-check test_ac7_skip_gate_consistent_with_env_var unconditional
AC-8: operator workflow rehearsal test_ac8_operator_workflow skip (waiting on D-PROJ-2 mock)
AC-9: helper L2 correctness test_helpers.py::test_ac9_l2_* unconditional
AC-10: README accuracy this file live

Failure-mode cookbook

Symptom Likely cause Fix
gps-denied-replay console-script not on PATH package not installed in the test venv pip install -e .
AC-1 line count off by > 5 % tlog synthesizer drifted from the CSV regenerate by re-running the test (synthesizer is deterministic; non-determinism would be a real bug)
AC-3 fails at ~ 0 % even with calibration wrong intrinsics OR wrong WGS84 ground truth source — verify the GLOBAL_POSITION_INT columns are still the AC-3 reference (per flight_derkachi/README.md) re-derive ground truth
AC-5 determinism violated non-deterministic float ordering in C5 estimator OR a clock leaked into the runtime bisect via git log against the C5 / clock modules
AC-6 realtime drifts on shared CI shared-runner contention; the spec allows widening to ± 5 s adjust _HEAVY_SKIP boundary if it persists
tlog missing required messages _tlog_synth.py lost a message group check _REQUIRED_MESSAGE_GROUPS in tlog_replay_adapter.py against the synth output

Files

tests/e2e/replay/
├── README.md                   ← this file
├── __init__.py                 ← package marker + module-level docstring
├── _helpers.py                 ← parse_jsonl, l2_horizontal_m, match_percentage,
│                                  CapturingMavlinkTransport, GroundTruthRow
├── _tlog_synth.py              ← CSV → tlog generator
├── conftest.py                 ← derkachi_replay_inputs, replay_runner,
│                                  operator_pre_flight_setup fixtures
├── test_helpers.py             ← unit tests for _helpers (unconditional)
└── test_derkachi_1min.py       ← AC-1..AC-8 + AC-7 skip gate + AC-4a AST scan

Follow-up work

  • AZ-963 — five Derkachi ACs (AC-1, AC-3, AC-5, AC-6a, AC-6b) are xfail until a reference C6 tile cache exists (resolution path: AZ-777 / AZ-974).
  • Real Topotek KHP20S30 calibration — needed for AC-3 accuracy even after AZ-777 lands (the threshold is ≤100 m for 80 % of ticks).
  • AZ-558 — closes AC-4b (route C8 encoders through MavlinkTransport).
  • D-PROJ-2 mock-suite-sat-service — unblocks AC-8 (operator workflow rehearsal).

Epic AZ-835 ticket map

The Tier-2 orchestrator path shipped under Epic AZ-835. Sub-tickets:

Ticket Role
AZ-836 TlogRouteExtractor — active-segment trim + Douglas-Peucker coarsen tlog GPS to ≤ N waypoints
AZ-838 SatelliteProviderRouteClient + seed_route.py CLI — POST RouteSpec to satellite-provider, poll mapsReady
AZ-839 C3 operator_pre_flight_setup real fixture — wires C1+C2+C11+C10 against the seeded catalog
AZ-840 C4 E2E orchestrator test — drives the full 7-step pipeline from (tlog, video, calibration)
AZ-842 C6 Docs — replay_protocol.md Invariants 12-14 + architecture.md + this README (cycle-4 rescope)

The cycle-4 replay-input redesign tickets ride alongside the Epic:

Ticket Role
AZ-894 CsvReplayInputAdapter — new CSV-driven primary path on the single canonical clock
AZ-895 Auto-sync surface deprecation — tlog adapter reduced to audit-only role
AZ-896 CSV format spec (csv_replay_format.md) + example data_imu.csv
AZ-897 Operator-facing UI (React + Tailwind paired-upload form) — cycle 5+