Files
gps-denied-onboard/_docs/02_tasks/todo/AZ-404_replay_e2e_fixture.md
T
Oleksandr Bezdieniezhnykh 5adf3dd04f [AZ-265] Replay as configuration of airborne binary (ADR-011)
Re-design replay mode per user direction: replay is no longer a fourth
Docker image with a reduced component set, but a `config.mode = "replay"`
branch of the single airborne binary. The pre-flight workflow (route in
suite UI -> C12 tile download via real satellite-provider -> C10
manifest+engines build) is identical between live and replay; only three
strategies swap at compose time:

  FrameSource:      Live <-> Video
  FcAdapter:        Pymavlink/MSP2 <-> TlogReplay
  MavlinkTransport: Serial <-> Noop

The C8 outbound MAVLink encoders run unchanged in both modes; their
bytes hit `NoopMavlinkTransport` in replay and disappear. A new
`JsonlReplaySink` taps C5's `EstimatorOutput` stream so the parent-suite
UI sees per-tick coordinates by tailing `results.jsonl`. MAVLink 2.0
signing key remains mandatory (operator supplies a dummy file).

A new `replay_input/` Layer-4 cross-cutting coordinator owns
`(video, tlog) -> (FrameSource, FcAdapter, Clock)` convergence; the
composition root sees only standard interfaces past `.open()`.

Docs:
- architecture.md: new ADR-011 with full rationale; ADR-002 binary
  narrative updated.
- contracts/replay/replay_protocol.md: bumped to v2.0.0; 12 invariants
  (notably mode-agnosticism + encoder byte-equality + signing key
  mandatory + real C6 cache in replay).
- module-layout.md: Build-Time Exclusion Map dropped from 4 to 3 binary
  columns; replay-mode `BUILD_*` flags default ON in airborne;
  `shared/replay_input` cross-cutting entry added.
- epics.md: E-DEMO-REPLAY scope reframed; story points 27-32 -> 19-24.

Task respecs:
- AZ-401: shrunk 3 -> 2 pts; `compose_root` mode branch + JSONL sink +
  NoopMavlinkTransport wiring; legacy `compose_replay` export deleted.
- AZ-402: console-script wrapper that mutates `config.mode = "replay"`
  and dispatches into the shared airborne main; `--mavlink-signing-key`
  mandatory.
- AZ-403: CANCELLED. Moved to done/ with banner; Jira transition deferred
  via `_docs/_process_leftovers/2026-05-14_az_403_cancellation_pending_tracker.md`.
- AZ-404: AC-4 reworded as mode-agnosticism AST scan + encoder
  byte-equality test; new AC-8 operator-workflow rehearsal.
- AZ-405: also owns the `replay_input/` module + `ReplayInputAdapter`.

_dependencies_table.md updated: AZ-401 gains AZ-405 dep; AZ-404 drops
AZ-403 dep; AZ-403 row marked CANCELLED.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-14 09:01:04 +03:00

13 KiB
Raw Blame History

Replay — E2E replay fixture test (Derkachi 12 min clip + tlog) + mode-agnosticism + operator workflow

Task: AZ-404_replay_e2e_fixture Name: E2E replay fixture test — Derkachi 12 min clip + tlog; AC-3 ≤ 100 m for ≥ 80 % of ticks + mode-agnosticism enforcement + operator-workflow rehearsal Description: Implement tests/e2e/replay/test_derkachi_1min.py running the gps-denied-replay console-script against a 12 min Derkachi clip + matching pymavlink .tlog and asserting AC-3 of the epic: L2 horizontal distance ≤ 100 m for ≥ 80 % of ticks (matches AC-1.3 cumulative-drift bound). Per ADR-011 the test runs against the single airborne image in replay mode — there is no separate replay-cli image to verify. Also asserts:

  • AC-1 (CLI exits 0; JSONL line count within ±5 % of GLOBAL_POSITION_INT tlog count);
  • AC-2 (each line is valid JSON matching EstimatorOutput schema);
  • AC-4 — revised per ADR-011 — mode-agnosticism of the C1C7 + C13 components + byte-equality of C8 outbound encoders between live and replay (the v1.0.0 SBOM-diff check is replaced by these two AST/byte assertions);
  • AC-5 (determinism: same input → same output within ≤ 1e-6 float drift in position fields, run twice and diff);
  • AC-6 (--pace realtime runs in 60 ± 5 s; --pace asap in ≤ 30 s on Tier-1 hardware);
  • AC-9 (operator pre-flight workflow rehearsal: the test setup runs the operator's C10/C11/C12 pre-flight flow against a mock satellite-provider before invoking the replay CLI, demonstrating that the operator workflow is identical between live and replay modes).

Test fixture: re-uses the existing Derkachi corpus (_docs/00_problem/input_data/flight_derkachi/) — clip a 60120 s segment + matching tlog window. Test gated by RUN_REPLAY_E2E=1 env var in CI (Tier-1 capable; not run on every PR by default per the project's existing E2E gating pattern).

Complexity: 5 points (unchanged from v1.0.0 — the test surface is the same; AC-4 is reworded but no smaller; AC-9 is added, AC-8 removed). Dependencies: AZ-402 (CLI entrypoint); AZ-401 (compose_root replay branch); AZ-405 (ReplayInputAdapter + auto-sync inside replay_input/); the Derkachi fixture (_docs/00_problem/input_data/flight_derkachi/); the airborne Docker image (the same image the live binary ships in — no replay-specific image; ADR-011); AZ-263, AZ-269, AZ-266, AZ-272, AZ-273. Component: replay-tests (epic AZ-265 / E-DEMO-REPLAY) — test at tests/e2e/replay/. Tracker: AZ-404 Epic: AZ-265 (E-DEMO-REPLAY)

Document Dependencies

  • _docs/02_document/contracts/replay/replay_protocol.md (v2.0.0) — Invariants 1, 5, 7, 10, 12 (mode-agnosticism + encoder byte-equality + JSONL one-line-per-emit + determinism + real C6 cache in replay).
  • _docs/02_document/architecture.mdADR-011 (replay-as-configuration; the design-defining decision that AC-4 enforces).
  • _docs/02_document/components/07_c5_state/description.mdEstimatorOutput schema.
  • _docs/00_problem/input_data/flight_derkachi/README.md — fixture documentation.
  • _docs/00_problem/input_data/expected_results/position_accuracy.csv — ground-truth GPS for the AC-3 assertion.

Problem

Without this task, AC-3 (the epic's primary acceptance gate — demo confidence equals field test confidence on the same footage) is unverified. AC-5 (determinism) and AC-6 (pace timing) are similarly unverified at the system level. Under ADR-011, AC-4 (mode-agnosticism + byte-equality of C8 encoders) and AC-9 (operator workflow rehearsal) are now the structural guarantees that replace the v1.0.0 SBOM diff — without this task, the airborne and replay code paths can drift silently and nothing in CI catches it.

Outcome

  • tests/e2e/replay/conftest.py:
    • Fixture derkachi_replay_inputs returning (video_path, tlog_path, calib_path, ground_truth_csv).
    • Fixture operator_pre_flight_setup (NEW per AC-9): runs the operator C12 pre-flight flow against a mock-suite-sat-service fixture (per ADR-007) — plan route → download tiles → build C10 manifest+engines+descriptor index → assert the cache content hash matches the expected fixture. The fixture yields the populated cache directory + the manifest path.
    • Fixture replay_runner invoking the CLI via subprocess.run(["gps-denied-replay", ...]) (or equivalent) against the populated cache and returning the captured stdout/stderr + exit code + parsed JSONL output.
  • tests/e2e/replay/test_derkachi_1min.py:
    • test_ac1_exits_0_jsonl_count_match.
    • test_ac2_jsonl_schema_match.
    • test_ac3_within_100m_80pct_of_ticks.
    • test_ac4_mode_agnosticism_ast_scan (NEW per ADR-011): AST scan asserts no components/**/*.py file contains if config.mode / if mode == "replay" / is_replay style branches. The scan is part of this E2E test for centralized ownership of the invariant; can be hoisted to a standalone lint later if useful.
    • test_ac4_encoder_byte_equality (NEW per ADR-011): construct two identical EstimatorOutput instances; pass one through compose_root(config_live).fc_adapter.emit_external_position(out) (with SerialMavlinkTransport replaced by a CapturingMavlinkTransport test fixture); pass the other through compose_root(config_replay).fc_adapter.emit_external_position(out) (with NoopMavlinkTransport replaced by the same CapturingMavlinkTransport); assert the captured byte streams are byte-identical (replay protocol Invariant 5).
    • test_ac5_determinism_two_runs_diff.
    • test_ac6_pace_realtime_60s_within_5pct.
    • test_ac6_pace_asap_under_30s.
    • test_ac9_operator_workflow (NEW per ADR-011): use the operator_pre_flight_setup fixture; assert the cache directory's content hash matches the expected fixture hash; then invoke replay_runner against the populated cache; assert AC-3 passes. This is the integration proof that the operator workflow is identical between live and replay.
  • Helper tests/e2e/replay/_helpers.py:
    • JSONL parser → list of EstimatorOutput.
    • L2 horizontal-distance computation (WGS84-aware; uses WgsConverter AZ-279 inside the test for ground-truth comparison).
    • Match-percentage computation against ground-truth GPS.
    • CapturingMavlinkTransport test fixture (used by test_ac4_encoder_byte_equality).
  • CI gating: tests marked @pytest.mark.skipif(not os.getenv("RUN_REPLAY_E2E"), reason="...") per the project's E2E pattern.
  • Documentation: tests/e2e/replay/README.md describes how to run locally + which env var enables in CI + the operator-workflow rehearsal fixture.

Scope

Included

  • All 8 test methods (AC-1, AC-2, AC-3, AC-4 mode-agnosticism, AC-4 byte-equality, AC-5, AC-6 realtime, AC-6 asap, AC-9 operator workflow).
  • Helper functions for JSONL parsing + ground-truth comparison + CapturingMavlinkTransport.
  • Conftest fixtures incl. operator_pre_flight_setup.
  • README.

Excluded

  • AC-7 / AC-8 auto-sync detection unit tests — owned by AZ-405 (the E2E test uses the auto-sync via the CLI, but unit-level positive/ambiguous/hand-launch cases live with AZ-405).
  • Test against a separate replay-cli Docker image — dropped per ADR-011; the test runs against the airborne image only.

Acceptance Criteria

AC-1: test_ac1_exits_0_jsonl_count_match passes — runs the CLI; exit code is 0; JSONL line count is within ±5 % of the tlog's GLOBAL_POSITION_INT count.

AC-2: test_ac2_jsonl_schema_match passes — every JSONL line is a valid JSON object with all EstimatorOutput schema fields present + correct types.

AC-3: test_ac3_within_100m_80pct_of_ticks passes — for the Derkachi fixture with known ground-truth GPS, ≥ 80 % of emitted EstimatorOutput records have L2 horizontal distance ≤ 100 m from ground truth.

AC-4a: test_ac4_mode_agnosticism_ast_scan passes — AST scan over src/gps_denied_onboard/components/**/*.py asserts no file contains an if config.mode / if mode == "replay" / if self._replay_mode / is_replay style branch. Replay-mode logic is structurally confined to the composition root + the replay strategies + the replay_input/ coordinator.

AC-4b: test_ac4_encoder_byte_equality passes — for a known EstimatorOutput, the C8 outbound encoder byte stream is byte-identical between compose_root(config_live) and compose_root(config_replay) (verified via CapturingMavlinkTransport). The MAVLink 2.0 signing handshake runs in both modes; the dummy signing key in replay produces a byte-equivalent encoded output.

AC-5: test_ac5_determinism_two_runs_diff passes — run the CLI twice with identical args; load both JSONL outputs; assert position fields differ by ≤ 1e-6 float (replay protocol Invariant 10).

AC-6a: test_ac6_pace_realtime_60s_within_5pct passes — run with --pace realtime on a 60 s clip; assert wall-clock duration is 60 s ± 3 s.

AC-6b: test_ac6_pace_asap_under_30s passes — run with --pace asap on the same 60 s clip; assert wall-clock duration ≤ 30 s on Tier-1 hardware.

AC-7: All tests skip cleanly without RUN_REPLAY_E2E — when the env var is unset, pytest tests/e2e/replay/ reports all 8 tests as SKIPPED, not FAILED.

AC-8: test_ac9_operator_workflow passes — the operator_pre_flight_setup fixture runs the operator C12 pre-flight flow against a mock satellite-provider; the resulting cache directory's content hash matches the expected fixture; the replay CLI then runs against the populated cache and AC-3 passes. Demonstrates replay protocol Invariant 12 (real C6 cache in replay) + epic AC-9 (operator workflow identity).

AC-9: Helper L2 computation correct — unit-level test of the WGS84 L2 helper against hand-computed expected distance for a known coord pair.

AC-10: README accuracytests/e2e/replay/README.md documents the env var, the fixture location, the expected runtime per pace, the operator-workflow rehearsal fixture, and the failure-mode cookbook (e.g., "if AC-3 fails, regenerate ground-truth via X").

Non-Functional Requirements

  • E2E suite runtime ≤ 6 min on Tier-1 hardware (one operator pre-flight setup + one realtime run + one asap run + two determinism asap runs + AC-4 byte-equality + AST scan; the operator-workflow setup adds ~30 s vs. v1.0.0).
  • E2E memory ≤ 4 GB resident (epic NFT).

Constraints

  • Re-use the Derkachi fixture (_docs/00_problem/input_data/flight_derkachi/); do NOT introduce new fixture data unless explicitly missing.
  • Re-use the mock-suite-sat-service test fixture (per ADR-007) for the operator pre-flight rehearsal.
  • pytest is the test runner.
  • Tier-1 hardware assumed (Jetson AGX Orin or equivalent x86 with CUDA per the project's CI matrix).
  • The 12 min clip is a sub-segment of the existing Derkachi flight; the segment range is documented in tests/e2e/replay/README.md.

Risks & Mitigation

  • Risk: AC-3 flake under non-deterministic ML inferenceMitigation: AC-5 (determinism) covers the two-runs-equal case; AC-3 is the offline-replay-quality check; if the system is non-deterministic enough to flake AC-3, that's a deeper bug worth surfacing.
  • Risk: Derkachi fixture clip not yet trimmedMitigation: this task includes producing the trimmed clip + tlog window as part of the fixture; the conftest fixture file holds the trim definition (start/end timestamps).
  • Risk: AC-6 realtime timing flakes on shared CI runnersMitigation: ± 3 s tolerance is generous; if flakes persist, the tolerance widens to ± 5 s in a follow-up.
  • Risk (new per ADR-011): mode-agnosticism AST scan false-positivesMitigation: the scan whitelist is owned by this test; legitimate uses of config.mode inside runtime_root/* are NOT scanned (only components/**/*.py); the test fails with the offending file path + line so the author can move the branch into runtime_root or into a replay strategy.
  • Risk (new per ADR-011): encoder byte-equality fails because the MAVLink signing nonce / counter differs between live and replayMitigation: the test uses a DeterministicSigningKey fixture that seeds the per-flight nonce / counter to a known value; both compose_root(config_live) and compose_root(config_replay) use this seeded key. If the byte streams still differ after the deterministic-seeding fix, that is a genuine drift between live and replay encoders and is a P0 bug.

Runtime Completeness

  • Named capability: end-to-end replay regression test against the Derkachi fixture + mode-agnosticism enforcement + operator-workflow rehearsal.
  • Production code: real CLI invocation, real ground-truth comparison, real determinism diff, real AST scan, real encoder byte-stream capture, real operator C12 pre-flight run.
  • Allowed external stubs: mock-suite-sat-service (per ADR-007) for the operator pre-flight rehearsal only; no other stubs — this is the integration-fidelity test.
  • Unacceptable substitutes: an in-process pytest harness that bypasses the CLI subprocess (defeats AC-1 — the deliverable is the console-script entrypoint); a separate replay-cli Docker image test (defeats ADR-011 — there is only one image).

Contract

Verifies _docs/02_document/contracts/replay/replay_protocol.md (v2.0.0) — Invariants 1, 5, 7, 10, 12; epic ACs 1, 2, 3, 4 (mode-agnosticism + byte-equality), 5, 6, 9.