Wraps the AZ-699 verdict-report path with the AZ-839 operator_pre_flight_setup C3 fixture so a single Tier-2 test takes only (tlog, video, calibration) and runs the full 7-step pipeline on the Jetson harness without operator hand-curation. New surface (tests-only, no src/ changes): - tests/e2e/replay/_e2e_orchestrator.py — orchestrator with OrchestratorStep enum, OrchestrationFailure exception (step prefix per AC-5), OrchestrationReport dataclass, write_effective_replay_config helper, and run_e2e_orchestration entry point covering steps 1-2-6-7. - tests/e2e/replay/test_e2e_orchestrator_unit.py — 17 unit tests covering each failure mode + happy path with mocked subprocess + ground-truth loader (AC-8). - tests/e2e/replay/test_az835_e2e_real_flight.py — Tier-2 + RUN_REPLAY_E2E gated integration test asserting verdict report exists, 15-min budget held (AC-1, AC-2, AC-3, AC-4, AC-6). The effective config write overlays c6_tile_cache.root_dir onto the static operator YAML at runtime so the airborne subprocess shares the cache_root the C3 fixture chose. Field- level merge — every other operator-config block stays verbatim. The static YAML on disk is never touched. Test run: tests/e2e/replay 45 passed, 10 skipped (10 skips were 9 pre-existing + 1 new tier2). No src/ touched, no AZ-839 driver changes; AC-7 (AZ-699 still passes) holds by inspection. Co-authored-by: Cursor <cursoragent@cursor.com>
10 KiB
Batch 109 — Cycle 3 — AZ-840 e2e orchestrator test
Date: 2026-05-23 Tasks: AZ-840 (C4 — Epic AZ-835). Story points: 3 (per the task spec). Jira status: AZ-840 In Progress → In Testing at commit step.
Why this batch exists
Epic AZ-835 (real-flight e2e validation) needs a single Tier-2
test that proves the 7-step pipeline runs from
(tlog, video, calibration) to a horizontal-error verdict
without operator hand-curation between steps. Steps 3-5 were
delivered by AZ-839 (C3 — operator_pre_flight_setup); steps
1-2-6-7 are this batch.
The AZ-839 batch 108b follow-up note explicitly anticipated this
batch: "AZ-840 will additionally need to feed the airborne
replay binary a config that points at the same cache_root
... the cleanest path is for AZ-840 to write an effective YAML
at runtime from the same override recipe used here."
What this batch ships
A driver module + unit test suite + Tier-2 integration test:
tests/e2e/replay/_e2e_orchestrator.py— wraps the AZ-699 verdict-report path with the AZ-839 C3 fixture'sPopulatedC6Cache. Public surface:OrchestratorStepenum — failure-step labels per AC-5.OrchestrationFailure(step, message)exception — wraps every step failure with the step name in the message prefix.OrchestrationReportdataclass — verdict, distribution, paths, wall-clock measurements per AC-4.write_effective_replay_config— small helper that overlaysc6_tile_cache.root_dironto the static operator YAML.read_calibration_acquisition_method— mirror of AZ-699's helper so the report writer keeps the same shape.run_e2e_orchestration— the AC-1 entry point wiring validate → write_config → airborne subprocess → parse JSONL → load tlog GT → compute distribution → render report.
tests/e2e/replay/test_e2e_orchestrator_unit.py— 17 unit tests covering each of the 7 steps' failure modes plus the happy path. The runner is injected (subprocess.rundefault) so unit tests stage synthetic JSONL output without touching the airborne binary.load_tlog_ground_truthis monkeypatched to return a synthetic 3-row series.tests/e2e/replay/test_az835_e2e_real_flight.py:: test_az840_e2e_real_flight_orchestration— Tier-2 + RUN_REPLAY_E2E gated test that consumes the C3 fixture + Derkachi inputs and asserts the verdict markdown is written, the threshold-hit share table is present, and the 15-min budget held.
AC coverage
| AC | Description | Coverage |
|---|---|---|
| AC-1 | Steps 1-7 end-to-end on Tier-2 from a fresh tlog/video | test_az840_e2e_real_flight_orchestration (Tier-2-gated); 17 unit tests prove the orchestrator structure |
| AC-2 | Verdict report exists either PASS or FAIL | test_run_e2e_orchestration_writes_report_even_on_fail_verdict + integration assertion report_path.is_file() |
| AC-3 | Reuses C3 fixture (operator_pre_flight_setup) |
Integration test consumes the fixture; effective config overlay points at populated_cache.cache_root |
| AC-4 | 15-min wall-time soft target on the Derkachi clip | _DEFAULT_MAX_SECONDS = 900.0 passed as subprocess.run timeout; integration asserts replay_subprocess_seconds <= 900 |
| AC-5 | Mid-pipeline failure fails LOUD with a clear step prefix | OrchestratorStep enum + 8 step-specific failure unit tests (validate/write_config/airborne × 3/parse × 2/gt) |
| AC-6 | Gated by RUN_REPLAY_E2E=1 + Tier-2 marker |
_orchestrator_skip_reason() checks env vars + binary + video size; @pytest.mark.tier2 decorator |
| AC-7 | AZ-699 verdict test continues to pass | No changes to test_derkachi_real_tlog.py; same real_flight_validation_<date>.md report path convention |
| AC-8 | Unit-tested orchestration helper without Tier-2 inputs | 17 unit tests covering config write (4) + calibration parse (3) + run helper (10) — all use mocked subprocess + GT loader |
Test run results
$ .venv/bin/pytest tests/e2e/replay/ -v --tb=short --timeout=60
============================ 45 passed, 10 skipped, 3 warnings in 0.78s ============
Breakdown:
- 17 new orchestrator unit tests pass.
- 11 AZ-839 driver unit tests still pass (no driver changes).
- 14 helper unit tests (
test_helpers.py) still pass. - 3 derkachi-1min mode-agnostic AST tests still pass.
- 10 skips: 1 new Tier-2 (this AZ-840 integration), 6 RUN_REPLAY_E2E gated AZ-404 cases, 1 AC-8 D-PROJ-2 placeholder, 1 Tier-2 AZ-699, 1 Tier-2 AZ-839 integration. None are regressions; the tier2 gate trips off-Jetson.
Design notes
--auto-trim ownership
The orchestrator passes --auto-trim unconditionally so AZ-405 /
AZ-698 active-flight-cut + tlog/video sync (Epic step 1) runs
inside the airborne binary every time. The Epic narrative does
not separate trim from the airborne pipeline; collapsing them
into a single subprocess invocation matches AZ-699 and avoids
duplicating the trim path.
clip_duration_s parity with AZ-699
run_e2e_orchestration computes
clip_duration_s = ground_truth[-1].t_s - ground_truth[0].t_s
exactly as test_derkachi_real_tlog.py does. This means both
verdict reports name the same clip duration even when the
trimmed video is shorter than the ground-truth window — a
deliberate choice: the report header documents what the verdict
covers, not what the binary processed.
Effective config write — single source of truth
write_effective_replay_config materialises the same override
recipe AZ-839 uses in-memory, but on disk so the airborne
subprocess sees the cache_root the fixture chose. Field-level
merge: every other block in the operator YAML is preserved
verbatim; only c6_tile_cache.root_dir and
c6_tile_cache.faiss_index_path are overwritten. The static
operator YAML on disk is never touched.
Failure surface = step prefix
OrchestrationFailure always prefixes its message with
[<step>]. CI log scrapers and pytest's traceback printer both
surface the prefix on the first line; AC-5 ("clear error
pointing at the failing step") holds without requiring the test
to inspect the exception object. The step is also exposed as
exc.step for programmatic assertions.
Files changed
tests/e2e/replay/_e2e_orchestrator.py(new, 656 LOC).tests/e2e/replay/test_e2e_orchestrator_unit.py(new, 660+ LOC).tests/e2e/replay/test_az835_e2e_real_flight.py(new, 156 LOC).
No src/ changes, no operator-config YAML changes, no AZ-839
driver changes. AZ-840 is purely additive at the test layer.
Code review (self-review)
Verdict: PASS_WITH_WARNINGS.
| Phase | Result |
|---|---|
| 1. Context loading | Re-read gps_compare.py, accuracy_report.py, replay_input.py, cli/replay.py, test_derkachi_real_tlog.py. Emission schema (emitted_at, position_wgs84) is the same shape gps-denied-replay writes. |
| 2. Spec compliance | All 8 AZ-840 ACs covered; AC-7 holds by inspection (no AZ-699 changes). |
| 3. Code quality | All public types have docstrings; failure messages name the upstream exception via repr so OSError / subprocess.TimeoutExpired carry through. Runner kw-args mirror subprocess.run signature 1:1. |
| 4. Security quick-scan | Effective config write goes to a tmp file the test owns; no secrets in the YAML overlay (override is two string fields). Subprocess env is opt-in (None defaults to os.environ). |
| 5. Performance scan | Unit tests run in 0.51 s. Tier-2 wall-clock cap is 900 s, enforced by the subprocess timeout. |
| 6. Cross-task consistency | clip_duration_s and report_path match AZ-699 exactly so a single Jetson run produces the same markdown shape. |
| 7. Architecture compliance | Orchestrator lives entirely under tests/e2e/replay/; no src/ writes. C3 fixture's invariants (PopulatedC6Cache.cache_root is the single source of truth) propagate via write_effective_replay_config. |
Findings
| ID | Severity | Description | Disposition |
|---|---|---|---|
| F1 | Low | _default_tile_decoder in conftest.py (carried from batch 108) — still raw TIFF. Not in the AZ-840 path; AZ-840 doesn't change tile decoding. |
Defer; no AZ-840 ticket. |
| F2 | Low | _resolve_replay_descriptor_dim is NetVLAD-only (carried from batch 108). AZ-840 doesn't change descriptors. |
Defer; no AZ-840 ticket. |
| F3 | Low | --pace asap is hardcoded in _run_replay_subprocess argv; the AZ-699 test passes --pace asap too, so behaviour is identical. If a future test wants a real-time pace, the runner kwarg is the seam. |
Document; no ticket. |
| F4 | Low | _run_replay_subprocess does not stream stdout/stderr; failures surface only after the subprocess exits. For 15-min runs this means the operator sees no progress until the budget expires. AZ-699 has the same shape. |
Document; consider an AZ-* if the budget grows. |
Notes for follow-up
- AZ-840 lands the orchestrator test as Tier-2-gated. Verifying the Tier-2 path actually runs on the Jetson harness is the next gating step before Epic AZ-835 can flip from "covered by unit tests" to "covered by Tier-2 integration".
_e2e_orchestrator.pyis intentionally kept undertests/rather than promoted tosrc/. If a second consumer of the same orchestration shape appears (e.g. AZ-833 mock-suite-sat parity test), the move to a shared helper module undersrc/gps_denied_onboard/replay/is the right next step; for now the test-only location matches the helper's only consumer.- AZ-841 (Tier-2 unxfail follow-up) and AZ-842 (replay protocol
- orchestrator docs) sit downstream — both should reference this batch report in their planning sections.