Files
Oleksandr Bezdieniezhnykh ade0c86f2b [AZ-840] [AZ-835] e2e orchestrator test (E-AZ-835 C4)
Wraps the AZ-699 verdict-report path with the AZ-839
operator_pre_flight_setup C3 fixture so a single Tier-2 test
takes only (tlog, video, calibration) and runs the full 7-step
pipeline on the Jetson harness without operator hand-curation.

New surface (tests-only, no src/ changes):
- tests/e2e/replay/_e2e_orchestrator.py — orchestrator with
  OrchestratorStep enum, OrchestrationFailure exception (step
  prefix per AC-5), OrchestrationReport dataclass,
  write_effective_replay_config helper, and
  run_e2e_orchestration entry point covering steps 1-2-6-7.
- tests/e2e/replay/test_e2e_orchestrator_unit.py — 17 unit
  tests covering each failure mode + happy path with mocked
  subprocess + ground-truth loader (AC-8).
- tests/e2e/replay/test_az835_e2e_real_flight.py — Tier-2 +
  RUN_REPLAY_E2E gated integration test asserting verdict
  report exists, 15-min budget held (AC-1, AC-2, AC-3, AC-4,
  AC-6).

The effective config write overlays c6_tile_cache.root_dir
onto the static operator YAML at runtime so the airborne
subprocess shares the cache_root the C3 fixture chose. Field-
level merge — every other operator-config block stays
verbatim. The static YAML on disk is never touched.

Test run: tests/e2e/replay 45 passed, 10 skipped (10 skips
were 9 pre-existing + 1 new tier2). No src/ touched, no
AZ-839 driver changes; AC-7 (AZ-699 still passes) holds by
inspection.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-23 15:27:41 +03:00

10 KiB
Raw Permalink Blame History

Batch 109 — Cycle 3 — AZ-840 e2e orchestrator test

Date: 2026-05-23 Tasks: AZ-840 (C4 — Epic AZ-835). Story points: 3 (per the task spec). Jira status: AZ-840 In Progress → In Testing at commit step.

Why this batch exists

Epic AZ-835 (real-flight e2e validation) needs a single Tier-2 test that proves the 7-step pipeline runs from (tlog, video, calibration) to a horizontal-error verdict without operator hand-curation between steps. Steps 3-5 were delivered by AZ-839 (C3 — operator_pre_flight_setup); steps 1-2-6-7 are this batch.

The AZ-839 batch 108b follow-up note explicitly anticipated this batch: "AZ-840 will additionally need to feed the airborne replay binary a config that points at the same cache_root ... the cleanest path is for AZ-840 to write an effective YAML at runtime from the same override recipe used here."

What this batch ships

A driver module + unit test suite + Tier-2 integration test:

  • tests/e2e/replay/_e2e_orchestrator.py — wraps the AZ-699 verdict-report path with the AZ-839 C3 fixture's PopulatedC6Cache. Public surface:
    • OrchestratorStep enum — failure-step labels per AC-5.
    • OrchestrationFailure(step, message) exception — wraps every step failure with the step name in the message prefix.
    • OrchestrationReport dataclass — verdict, distribution, paths, wall-clock measurements per AC-4.
    • write_effective_replay_config — small helper that overlays c6_tile_cache.root_dir onto the static operator YAML.
    • read_calibration_acquisition_method — mirror of AZ-699's helper so the report writer keeps the same shape.
    • run_e2e_orchestration — the AC-1 entry point wiring validate → write_config → airborne subprocess → parse JSONL → load tlog GT → compute distribution → render report.
  • tests/e2e/replay/test_e2e_orchestrator_unit.py — 17 unit tests covering each of the 7 steps' failure modes plus the happy path. The runner is injected (subprocess.run default) so unit tests stage synthetic JSONL output without touching the airborne binary. load_tlog_ground_truth is monkeypatched to return a synthetic 3-row series.
  • tests/e2e/replay/test_az835_e2e_real_flight.py:: test_az840_e2e_real_flight_orchestration — Tier-2 + RUN_REPLAY_E2E gated test that consumes the C3 fixture + Derkachi inputs and asserts the verdict markdown is written, the threshold-hit share table is present, and the 15-min budget held.

AC coverage

AC Description Coverage
AC-1 Steps 1-7 end-to-end on Tier-2 from a fresh tlog/video test_az840_e2e_real_flight_orchestration (Tier-2-gated); 17 unit tests prove the orchestrator structure
AC-2 Verdict report exists either PASS or FAIL test_run_e2e_orchestration_writes_report_even_on_fail_verdict + integration assertion report_path.is_file()
AC-3 Reuses C3 fixture (operator_pre_flight_setup) Integration test consumes the fixture; effective config overlay points at populated_cache.cache_root
AC-4 15-min wall-time soft target on the Derkachi clip _DEFAULT_MAX_SECONDS = 900.0 passed as subprocess.run timeout; integration asserts replay_subprocess_seconds <= 900
AC-5 Mid-pipeline failure fails LOUD with a clear step prefix OrchestratorStep enum + 8 step-specific failure unit tests (validate/write_config/airborne × 3/parse × 2/gt)
AC-6 Gated by RUN_REPLAY_E2E=1 + Tier-2 marker _orchestrator_skip_reason() checks env vars + binary + video size; @pytest.mark.tier2 decorator
AC-7 AZ-699 verdict test continues to pass No changes to test_derkachi_real_tlog.py; same real_flight_validation_<date>.md report path convention
AC-8 Unit-tested orchestration helper without Tier-2 inputs 17 unit tests covering config write (4) + calibration parse (3) + run helper (10) — all use mocked subprocess + GT loader

Test run results

$ .venv/bin/pytest tests/e2e/replay/ -v --tb=short --timeout=60
============================ 45 passed, 10 skipped, 3 warnings in 0.78s ============

Breakdown:

  • 17 new orchestrator unit tests pass.
  • 11 AZ-839 driver unit tests still pass (no driver changes).
  • 14 helper unit tests (test_helpers.py) still pass.
  • 3 derkachi-1min mode-agnostic AST tests still pass.
  • 10 skips: 1 new Tier-2 (this AZ-840 integration), 6 RUN_REPLAY_E2E gated AZ-404 cases, 1 AC-8 D-PROJ-2 placeholder, 1 Tier-2 AZ-699, 1 Tier-2 AZ-839 integration. None are regressions; the tier2 gate trips off-Jetson.

Design notes

--auto-trim ownership

The orchestrator passes --auto-trim unconditionally so AZ-405 / AZ-698 active-flight-cut + tlog/video sync (Epic step 1) runs inside the airborne binary every time. The Epic narrative does not separate trim from the airborne pipeline; collapsing them into a single subprocess invocation matches AZ-699 and avoids duplicating the trim path.

clip_duration_s parity with AZ-699

run_e2e_orchestration computes clip_duration_s = ground_truth[-1].t_s - ground_truth[0].t_s exactly as test_derkachi_real_tlog.py does. This means both verdict reports name the same clip duration even when the trimmed video is shorter than the ground-truth window — a deliberate choice: the report header documents what the verdict covers, not what the binary processed.

Effective config write — single source of truth

write_effective_replay_config materialises the same override recipe AZ-839 uses in-memory, but on disk so the airborne subprocess sees the cache_root the fixture chose. Field-level merge: every other block in the operator YAML is preserved verbatim; only c6_tile_cache.root_dir and c6_tile_cache.faiss_index_path are overwritten. The static operator YAML on disk is never touched.

Failure surface = step prefix

OrchestrationFailure always prefixes its message with [<step>]. CI log scrapers and pytest's traceback printer both surface the prefix on the first line; AC-5 ("clear error pointing at the failing step") holds without requiring the test to inspect the exception object. The step is also exposed as exc.step for programmatic assertions.

Files changed

  • tests/e2e/replay/_e2e_orchestrator.py (new, 656 LOC).
  • tests/e2e/replay/test_e2e_orchestrator_unit.py (new, 660+ LOC).
  • tests/e2e/replay/test_az835_e2e_real_flight.py (new, 156 LOC).

No src/ changes, no operator-config YAML changes, no AZ-839 driver changes. AZ-840 is purely additive at the test layer.

Code review (self-review)

Verdict: PASS_WITH_WARNINGS.

Phase Result
1. Context loading Re-read gps_compare.py, accuracy_report.py, replay_input.py, cli/replay.py, test_derkachi_real_tlog.py. Emission schema (emitted_at, position_wgs84) is the same shape gps-denied-replay writes.
2. Spec compliance All 8 AZ-840 ACs covered; AC-7 holds by inspection (no AZ-699 changes).
3. Code quality All public types have docstrings; failure messages name the upstream exception via repr so OSError / subprocess.TimeoutExpired carry through. Runner kw-args mirror subprocess.run signature 1:1.
4. Security quick-scan Effective config write goes to a tmp file the test owns; no secrets in the YAML overlay (override is two string fields). Subprocess env is opt-in (None defaults to os.environ).
5. Performance scan Unit tests run in 0.51 s. Tier-2 wall-clock cap is 900 s, enforced by the subprocess timeout.
6. Cross-task consistency clip_duration_s and report_path match AZ-699 exactly so a single Jetson run produces the same markdown shape.
7. Architecture compliance Orchestrator lives entirely under tests/e2e/replay/; no src/ writes. C3 fixture's invariants (PopulatedC6Cache.cache_root is the single source of truth) propagate via write_effective_replay_config.

Findings

ID Severity Description Disposition
F1 Low _default_tile_decoder in conftest.py (carried from batch 108) — still raw TIFF. Not in the AZ-840 path; AZ-840 doesn't change tile decoding. Defer; no AZ-840 ticket.
F2 Low _resolve_replay_descriptor_dim is NetVLAD-only (carried from batch 108). AZ-840 doesn't change descriptors. Defer; no AZ-840 ticket.
F3 Low --pace asap is hardcoded in _run_replay_subprocess argv; the AZ-699 test passes --pace asap too, so behaviour is identical. If a future test wants a real-time pace, the runner kwarg is the seam. Document; no ticket.
F4 Low _run_replay_subprocess does not stream stdout/stderr; failures surface only after the subprocess exits. For 15-min runs this means the operator sees no progress until the budget expires. AZ-699 has the same shape. Document; consider an AZ-* if the budget grows.

Notes for follow-up

  • AZ-840 lands the orchestrator test as Tier-2-gated. Verifying the Tier-2 path actually runs on the Jetson harness is the next gating step before Epic AZ-835 can flip from "covered by unit tests" to "covered by Tier-2 integration".
  • _e2e_orchestrator.py is intentionally kept under tests/ rather than promoted to src/. If a second consumer of the same orchestration shape appears (e.g. AZ-833 mock-suite-sat parity test), the move to a shared helper module under src/gps_denied_onboard/replay/ is the right next step; for now the test-only location matches the helper's only consumer.
  • AZ-841 (Tier-2 unxfail follow-up) and AZ-842 (replay protocol
    • orchestrator docs) sit downstream — both should reference this batch report in their planning sections.