# Batch 109 — Cycle 3 — AZ-840 e2e orchestrator test **Date**: 2026-05-23 **Tasks**: AZ-840 (C4 — Epic AZ-835). **Story points**: 3 (per the task spec). **Jira status**: AZ-840 In Progress → In Testing at commit step. ## Why this batch exists Epic AZ-835 (real-flight e2e validation) needs a single Tier-2 test that proves the 7-step pipeline runs from `(tlog, video, calibration)` to a horizontal-error verdict without operator hand-curation between steps. Steps 3-5 were delivered by AZ-839 (C3 — `operator_pre_flight_setup`); steps 1-2-6-7 are this batch. The AZ-839 batch 108b follow-up note explicitly anticipated this batch: "AZ-840 will additionally need to feed the airborne replay binary a config that points at the same `cache_root` ... the cleanest path is for AZ-840 to write an effective YAML at runtime from the same override recipe used here." ## What this batch ships A driver module + unit test suite + Tier-2 integration test: * `tests/e2e/replay/_e2e_orchestrator.py` — wraps the AZ-699 verdict-report path with the AZ-839 C3 fixture's `PopulatedC6Cache`. Public surface: * `OrchestratorStep` enum — failure-step labels per AC-5. * `OrchestrationFailure(step, message)` exception — wraps every step failure with the step name in the message prefix. * `OrchestrationReport` dataclass — verdict, distribution, paths, wall-clock measurements per AC-4. * `write_effective_replay_config` — small helper that overlays `c6_tile_cache.root_dir` onto the static operator YAML. * `read_calibration_acquisition_method` — mirror of AZ-699's helper so the report writer keeps the same shape. * `run_e2e_orchestration` — the AC-1 entry point wiring validate → write_config → airborne subprocess → parse JSONL → load tlog GT → compute distribution → render report. * `tests/e2e/replay/test_e2e_orchestrator_unit.py` — 17 unit tests covering each of the 7 steps' failure modes plus the happy path. The runner is injected (`subprocess.run` default) so unit tests stage synthetic JSONL output without touching the airborne binary. `load_tlog_ground_truth` is monkeypatched to return a synthetic 3-row series. * `tests/e2e/replay/test_az835_e2e_real_flight.py:: test_az840_e2e_real_flight_orchestration` — Tier-2 + RUN_REPLAY_E2E gated test that consumes the C3 fixture + Derkachi inputs and asserts the verdict markdown is written, the threshold-hit share table is present, and the 15-min budget held. ## AC coverage | AC | Description | Coverage | |-----|----------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------| | AC-1| Steps 1-7 end-to-end on Tier-2 from a fresh tlog/video | `test_az840_e2e_real_flight_orchestration` (Tier-2-gated); 17 unit tests prove the orchestrator structure | | AC-2| Verdict report exists either PASS or FAIL | `test_run_e2e_orchestration_writes_report_even_on_fail_verdict` + integration assertion `report_path.is_file()` | | AC-3| Reuses C3 fixture (`operator_pre_flight_setup`) | Integration test consumes the fixture; effective config overlay points at `populated_cache.cache_root` | | AC-4| 15-min wall-time soft target on the Derkachi clip | `_DEFAULT_MAX_SECONDS = 900.0` passed as `subprocess.run` `timeout`; integration asserts `replay_subprocess_seconds <= 900`| | AC-5| Mid-pipeline failure fails LOUD with a clear step prefix | `OrchestratorStep` enum + 8 step-specific failure unit tests (`validate`/`write_config`/`airborne` × 3/`parse` × 2/`gt`) | | AC-6| Gated by `RUN_REPLAY_E2E=1` + Tier-2 marker | `_orchestrator_skip_reason()` checks env vars + binary + video size; `@pytest.mark.tier2` decorator | | AC-7| AZ-699 verdict test continues to pass | No changes to `test_derkachi_real_tlog.py`; same `real_flight_validation_.md` report path convention | | AC-8| Unit-tested orchestration helper without Tier-2 inputs | 17 unit tests covering config write (4) + calibration parse (3) + run helper (10) — all use mocked subprocess + GT loader | ## Test run results ``` $ .venv/bin/pytest tests/e2e/replay/ -v --tb=short --timeout=60 ============================ 45 passed, 10 skipped, 3 warnings in 0.78s ============ ``` Breakdown: * 17 new orchestrator unit tests pass. * 11 AZ-839 driver unit tests still pass (no driver changes). * 14 helper unit tests (`test_helpers.py`) still pass. * 3 derkachi-1min mode-agnostic AST tests still pass. * 10 skips: 1 new Tier-2 (this AZ-840 integration), 6 RUN_REPLAY_E2E gated AZ-404 cases, 1 AC-8 D-PROJ-2 placeholder, 1 Tier-2 AZ-699, 1 Tier-2 AZ-839 integration. None are regressions; the tier2 gate trips off-Jetson. ## Design notes ### `--auto-trim` ownership The orchestrator passes `--auto-trim` unconditionally so AZ-405 / AZ-698 active-flight-cut + tlog/video sync (Epic step 1) runs inside the airborne binary every time. The Epic narrative does not separate trim from the airborne pipeline; collapsing them into a single subprocess invocation matches AZ-699 and avoids duplicating the trim path. ### `clip_duration_s` parity with AZ-699 `run_e2e_orchestration` computes `clip_duration_s = ground_truth[-1].t_s - ground_truth[0].t_s` exactly as `test_derkachi_real_tlog.py` does. This means both verdict reports name the same clip duration even when the trimmed video is shorter than the ground-truth window — a deliberate choice: the report header documents what the verdict covers, not what the binary processed. ### Effective config write — single source of truth `write_effective_replay_config` materialises the same override recipe AZ-839 uses in-memory, but on disk so the airborne subprocess sees the cache_root the fixture chose. Field-level merge: every other block in the operator YAML is preserved verbatim; only `c6_tile_cache.root_dir` and `c6_tile_cache.faiss_index_path` are overwritten. The static operator YAML on disk is never touched. ### Failure surface = step prefix `OrchestrationFailure` always prefixes its message with `[]`. CI log scrapers and pytest's traceback printer both surface the prefix on the first line; AC-5 ("clear error pointing at the failing step") holds without requiring the test to inspect the exception object. The step is also exposed as `exc.step` for programmatic assertions. ## Files changed * `tests/e2e/replay/_e2e_orchestrator.py` (new, 656 LOC). * `tests/e2e/replay/test_e2e_orchestrator_unit.py` (new, 660+ LOC). * `tests/e2e/replay/test_az835_e2e_real_flight.py` (new, 156 LOC). No `src/` changes, no operator-config YAML changes, no AZ-839 driver changes. AZ-840 is purely additive at the test layer. ## Code review (self-review) Verdict: **PASS_WITH_WARNINGS**. | Phase | Result | |-------|--------| | 1. Context loading | Re-read `gps_compare.py`, `accuracy_report.py`, `replay_input.py`, `cli/replay.py`, `test_derkachi_real_tlog.py`. Emission schema (`emitted_at`, `position_wgs84`) is the same shape `gps-denied-replay` writes. | | 2. Spec compliance | All 8 AZ-840 ACs covered; AC-7 holds by inspection (no AZ-699 changes). | | 3. Code quality | All public types have docstrings; failure messages name the upstream exception via `repr` so `OSError` / `subprocess.TimeoutExpired` carry through. Runner kw-args mirror `subprocess.run` signature 1:1. | | 4. Security quick-scan | Effective config write goes to a tmp file the test owns; no secrets in the YAML overlay (override is two string fields). Subprocess `env` is opt-in (`None` defaults to `os.environ`). | | 5. Performance scan | Unit tests run in 0.51 s. Tier-2 wall-clock cap is 900 s, enforced by the subprocess timeout. | | 6. Cross-task consistency | `clip_duration_s` and `report_path` match AZ-699 exactly so a single Jetson run produces the same markdown shape. | | 7. Architecture compliance | Orchestrator lives entirely under `tests/e2e/replay/`; no `src/` writes. C3 fixture's invariants (`PopulatedC6Cache.cache_root` is the single source of truth) propagate via `write_effective_replay_config`. | ## Findings | ID | Severity | Description | Disposition | |----|----------|-------------|-------------| | F1 | Low | `_default_tile_decoder` in `conftest.py` (carried from batch 108) — still raw TIFF. Not in the AZ-840 path; AZ-840 doesn't change tile decoding. | Defer; no AZ-840 ticket. | | F2 | Low | `_resolve_replay_descriptor_dim` is NetVLAD-only (carried from batch 108). AZ-840 doesn't change descriptors. | Defer; no AZ-840 ticket. | | F3 | Low | `--pace asap` is hardcoded in `_run_replay_subprocess` argv; the AZ-699 test passes `--pace asap` too, so behaviour is identical. If a future test wants a real-time pace, the runner kwarg is the seam. | Document; no ticket. | | F4 | Low | `_run_replay_subprocess` does not stream stdout/stderr; failures surface only after the subprocess exits. For 15-min runs this means the operator sees no progress until the budget expires. AZ-699 has the same shape. | Document; consider an AZ-* if the budget grows. | ## Notes for follow-up * AZ-840 lands the orchestrator test as Tier-2-gated. Verifying the Tier-2 path actually runs on the Jetson harness is the next gating step before Epic AZ-835 can flip from "covered by unit tests" to "covered by Tier-2 integration". * `_e2e_orchestrator.py` is intentionally kept under `tests/` rather than promoted to `src/`. If a second consumer of the same orchestration shape appears (e.g. AZ-833 mock-suite-sat parity test), the move to a shared helper module under `src/gps_denied_onboard/replay/` is the right next step; for now the test-only location matches the helper's only consumer. * AZ-841 (Tier-2 unxfail follow-up) and AZ-842 (replay protocol + orchestrator docs) sit downstream — both should reference this batch report in their planning sections.