# Replay — E2E replay fixture test (Derkachi 1–2 min clip + tlog) + mode-agnosticism + operator workflow **Task**: AZ-404_replay_e2e_fixture **Name**: E2E replay fixture test — Derkachi 1–2 min clip + tlog; AC-3 ≤ 100 m for ≥ 80 % of ticks + mode-agnosticism enforcement + operator-workflow rehearsal **Description**: Implement `tests/e2e/replay/test_derkachi_1min.py` running the `gps-denied-replay` console-script against a 1–2 min Derkachi clip + matching pymavlink `.tlog` and asserting AC-3 of the epic: L2 horizontal distance ≤ 100 m for ≥ 80 % of ticks (matches AC-1.3 cumulative-drift bound). Per ADR-011 the test runs against the **single airborne image** in replay mode — there is no separate replay-cli image to verify. Also asserts: - AC-1 (CLI exits 0; JSONL line count within ±5 % of `GLOBAL_POSITION_INT` tlog count); - AC-2 (each line is valid JSON matching `EstimatorOutput` schema); - AC-4 — **revised per ADR-011** — mode-agnosticism of the C1–C7 + C13 components + byte-equality of C8 outbound encoders between live and replay (the v1.0.0 SBOM-diff check is replaced by these two AST/byte assertions); - AC-5 (determinism: same input → same output within ≤ 1e-6 float drift in position fields, run twice and diff); - AC-6 (`--pace realtime` runs in 60 ± 5 s; `--pace asap` in ≤ 30 s on Tier-1 hardware); - AC-9 (operator pre-flight workflow rehearsal: the test setup runs the operator's C10/C11/C12 pre-flight flow against a mock satellite-provider before invoking the replay CLI, demonstrating that the operator workflow is identical between live and replay modes). Test fixture: re-uses the existing Derkachi corpus (`_docs/00_problem/input_data/flight_derkachi/`) — clip a 60–120 s segment + matching tlog window. Test gated by `RUN_REPLAY_E2E=1` env var in CI (Tier-1 capable; not run on every PR by default per the project's existing E2E gating pattern). **Complexity**: 5 points (unchanged from v1.0.0 — the test surface is the same; AC-4 is reworded but no smaller; AC-9 is added, AC-8 removed). **Dependencies**: AZ-402 (CLI entrypoint); AZ-401 (compose_root replay branch); AZ-405 (`ReplayInputAdapter` + auto-sync inside replay_input/); the Derkachi fixture (`_docs/00_problem/input_data/flight_derkachi/`); the airborne Docker image (the same image the live binary ships in — no replay-specific image; ADR-011); AZ-263, AZ-269, AZ-266, AZ-272, AZ-273. **Component**: replay-tests (epic AZ-265 / E-DEMO-REPLAY) — test at `tests/e2e/replay/`. **Tracker**: AZ-404 **Epic**: AZ-265 (E-DEMO-REPLAY) ### Document Dependencies - `_docs/02_document/contracts/replay/replay_protocol.md` (v2.0.0) — Invariants 1, 5, 7, 10, 12 (mode-agnosticism + encoder byte-equality + JSONL one-line-per-emit + determinism + real C6 cache in replay). - `_docs/02_document/architecture.md` — **ADR-011** (replay-as-configuration; the design-defining decision that AC-4 enforces). - `_docs/02_document/components/07_c5_state/description.md` — `EstimatorOutput` schema. - `_docs/00_problem/input_data/flight_derkachi/README.md` — fixture documentation. - `_docs/00_problem/input_data/expected_results/position_accuracy.csv` — ground-truth GPS for the AC-3 assertion. ## Problem Without this task, AC-3 (the epic's primary acceptance gate — demo confidence equals field test confidence on the same footage) is unverified. AC-5 (determinism) and AC-6 (pace timing) are similarly unverified at the system level. Under ADR-011, AC-4 (mode-agnosticism + byte-equality of C8 encoders) and AC-9 (operator workflow rehearsal) are now the structural guarantees that replace the v1.0.0 SBOM diff — without this task, the airborne and replay code paths can drift silently and nothing in CI catches it. ## Outcome - `tests/e2e/replay/conftest.py`: - Fixture `derkachi_replay_inputs` returning `(video_path, tlog_path, calib_path, ground_truth_csv)`. - Fixture `operator_pre_flight_setup` (NEW per AC-9): runs the operator C12 pre-flight flow against a `mock-suite-sat-service` fixture (per ADR-007) — plan route → download tiles → build C10 manifest+engines+descriptor index → assert the cache content hash matches the expected fixture. The fixture yields the populated cache directory + the manifest path. - Fixture `replay_runner` invoking the CLI via `subprocess.run(["gps-denied-replay", ...])` (or equivalent) against the populated cache and returning the captured stdout/stderr + exit code + parsed JSONL output. - `tests/e2e/replay/test_derkachi_1min.py`: - `test_ac1_exits_0_jsonl_count_match`. - `test_ac2_jsonl_schema_match`. - `test_ac3_within_100m_80pct_of_ticks`. - `test_ac4_mode_agnosticism_ast_scan` (NEW per ADR-011): AST scan asserts no `components/**/*.py` file contains `if config.mode` / `if mode == "replay"` / `is_replay` style branches. The scan is part of this E2E test for centralized ownership of the invariant; can be hoisted to a standalone lint later if useful. - `test_ac4_encoder_byte_equality` (NEW per ADR-011): construct two identical `EstimatorOutput` instances; pass one through `compose_root(config_live).fc_adapter.emit_external_position(out)` (with `SerialMavlinkTransport` replaced by a `CapturingMavlinkTransport` test fixture); pass the other through `compose_root(config_replay).fc_adapter.emit_external_position(out)` (with `NoopMavlinkTransport` replaced by the same `CapturingMavlinkTransport`); assert the captured byte streams are byte-identical (replay protocol Invariant 5). - `test_ac5_determinism_two_runs_diff`. - `test_ac6_pace_realtime_60s_within_5pct`. - `test_ac6_pace_asap_under_30s`. - `test_ac9_operator_workflow` (NEW per ADR-011): use the `operator_pre_flight_setup` fixture; assert the cache directory's content hash matches the expected fixture hash; then invoke `replay_runner` against the populated cache; assert AC-3 passes. This is the integration proof that the operator workflow is identical between live and replay. - Helper `tests/e2e/replay/_helpers.py`: - JSONL parser → list of `EstimatorOutput`. - L2 horizontal-distance computation (WGS84-aware; uses `WgsConverter` AZ-279 inside the test for ground-truth comparison). - Match-percentage computation against ground-truth GPS. - `CapturingMavlinkTransport` test fixture (used by `test_ac4_encoder_byte_equality`). - CI gating: tests marked `@pytest.mark.skipif(not os.getenv("RUN_REPLAY_E2E"), reason="...")` per the project's E2E pattern. - Documentation: `tests/e2e/replay/README.md` describes how to run locally + which env var enables in CI + the operator-workflow rehearsal fixture. ## Scope ### Included - All 8 test methods (AC-1, AC-2, AC-3, AC-4 mode-agnosticism, AC-4 byte-equality, AC-5, AC-6 realtime, AC-6 asap, AC-9 operator workflow). - Helper functions for JSONL parsing + ground-truth comparison + `CapturingMavlinkTransport`. - Conftest fixtures incl. `operator_pre_flight_setup`. - README. ### Excluded - AC-7 / AC-8 auto-sync detection unit tests — owned by AZ-405 (the E2E test uses the auto-sync via the CLI, but unit-level positive/ambiguous/hand-launch cases live with AZ-405). - Test against a separate replay-cli Docker image — **dropped per ADR-011**; the test runs against the airborne image only. ## Acceptance Criteria **AC-1: test_ac1_exits_0_jsonl_count_match passes** — runs the CLI; exit code is 0; JSONL line count is within ±5 % of the tlog's `GLOBAL_POSITION_INT` count. **AC-2: test_ac2_jsonl_schema_match passes** — every JSONL line is a valid JSON object with all `EstimatorOutput` schema fields present + correct types. **AC-3: test_ac3_within_100m_80pct_of_ticks passes** — for the Derkachi fixture with known ground-truth GPS, ≥ 80 % of emitted `EstimatorOutput` records have L2 horizontal distance ≤ 100 m from ground truth. **AC-4a: test_ac4_mode_agnosticism_ast_scan passes** — AST scan over `src/gps_denied_onboard/components/**/*.py` asserts no file contains an `if config.mode` / `if mode == "replay"` / `if self._replay_mode` / `is_replay` style branch. Replay-mode logic is structurally confined to the composition root + the replay strategies + the `replay_input/` coordinator. **AC-4b: test_ac4_encoder_byte_equality passes** — for a known `EstimatorOutput`, the C8 outbound encoder byte stream is byte-identical between `compose_root(config_live)` and `compose_root(config_replay)` (verified via `CapturingMavlinkTransport`). The MAVLink 2.0 signing handshake runs in both modes; the dummy signing key in replay produces a byte-equivalent encoded output. **AC-5: test_ac5_determinism_two_runs_diff passes** — run the CLI twice with identical args; load both JSONL outputs; assert position fields differ by ≤ 1e-6 float (replay protocol Invariant 10). **AC-6a: test_ac6_pace_realtime_60s_within_5pct passes** — run with `--pace realtime` on a 60 s clip; assert wall-clock duration is 60 s ± 3 s. **AC-6b: test_ac6_pace_asap_under_30s passes** — run with `--pace asap` on the same 60 s clip; assert wall-clock duration ≤ 30 s on Tier-1 hardware. **AC-7: All tests skip cleanly without RUN_REPLAY_E2E** — when the env var is unset, `pytest tests/e2e/replay/` reports all 8 tests as SKIPPED, not FAILED. **AC-8: test_ac9_operator_workflow passes** — the `operator_pre_flight_setup` fixture runs the operator C12 pre-flight flow against a mock satellite-provider; the resulting cache directory's content hash matches the expected fixture; the replay CLI then runs against the populated cache and AC-3 passes. Demonstrates replay protocol Invariant 12 (real C6 cache in replay) + epic AC-9 (operator workflow identity). **AC-9: Helper L2 computation correct** — unit-level test of the WGS84 L2 helper against hand-computed expected distance for a known coord pair. **AC-10: README accuracy** — `tests/e2e/replay/README.md` documents the env var, the fixture location, the expected runtime per pace, the operator-workflow rehearsal fixture, and the failure-mode cookbook (e.g., "if AC-3 fails, regenerate ground-truth via X"). ## Non-Functional Requirements - E2E suite runtime ≤ 6 min on Tier-1 hardware (one operator pre-flight setup + one realtime run + one asap run + two determinism asap runs + AC-4 byte-equality + AST scan; the operator-workflow setup adds ~30 s vs. v1.0.0). - E2E memory ≤ 4 GB resident (epic NFT). ## Constraints - Re-use the Derkachi fixture (`_docs/00_problem/input_data/flight_derkachi/`); do NOT introduce new fixture data unless explicitly missing. - Re-use the `mock-suite-sat-service` test fixture (per ADR-007) for the operator pre-flight rehearsal. - pytest is the test runner. - Tier-1 hardware assumed (Jetson AGX Orin or equivalent x86 with CUDA per the project's CI matrix). - The 1–2 min clip is a sub-segment of the existing Derkachi flight; the segment range is documented in `tests/e2e/replay/README.md`. ## Risks & Mitigation - **Risk: AC-3 flake under non-deterministic ML inference** — *Mitigation*: AC-5 (determinism) covers the two-runs-equal case; AC-3 is the offline-replay-quality check; if the system is non-deterministic enough to flake AC-3, that's a deeper bug worth surfacing. - **Risk: Derkachi fixture clip not yet trimmed** — *Mitigation*: this task includes producing the trimmed clip + tlog window as part of the fixture; the conftest fixture file holds the trim definition (start/end timestamps). - **Risk: AC-6 realtime timing flakes on shared CI runners** — *Mitigation*: ± 3 s tolerance is generous; if flakes persist, the tolerance widens to ± 5 s in a follow-up. - **Risk (new per ADR-011): mode-agnosticism AST scan false-positives** — *Mitigation*: the scan whitelist is owned by this test; legitimate uses of `config.mode` inside `runtime_root/*` are NOT scanned (only `components/**/*.py`); the test fails with the offending file path + line so the author can move the branch into `runtime_root` or into a replay strategy. - **Risk (new per ADR-011): encoder byte-equality fails because the MAVLink signing nonce / counter differs between live and replay** — *Mitigation*: the test uses a `DeterministicSigningKey` fixture that seeds the per-flight nonce / counter to a known value; both `compose_root(config_live)` and `compose_root(config_replay)` use this seeded key. If the byte streams still differ after the deterministic-seeding fix, that is a genuine drift between live and replay encoders and is a P0 bug. ## Runtime Completeness - **Named capability**: end-to-end replay regression test against the Derkachi fixture + mode-agnosticism enforcement + operator-workflow rehearsal. - **Production code**: real CLI invocation, real ground-truth comparison, real determinism diff, real AST scan, real encoder byte-stream capture, real operator C12 pre-flight run. - **Allowed external stubs**: `mock-suite-sat-service` (per ADR-007) for the operator pre-flight rehearsal only; no other stubs — this is the integration-fidelity test. - **Unacceptable substitutes**: an in-process pytest harness that bypasses the CLI subprocess (defeats AC-1 — the deliverable is the console-script entrypoint); a separate replay-cli Docker image test (defeats ADR-011 — there is only one image). ## Contract Verifies `_docs/02_document/contracts/replay/replay_protocol.md` (v2.0.0) — Invariants 1, 5, 7, 10, 12; epic ACs 1, 2, 3, 4 (mode-agnosticism + byte-equality), 5, 6, 9.