[AZ-840] [AZ-835] e2e orchestrator test (E-AZ-835 C4)

Wraps the AZ-699 verdict-report path with the AZ-839
operator_pre_flight_setup C3 fixture so a single Tier-2 test
takes only (tlog, video, calibration) and runs the full 7-step
pipeline on the Jetson harness without operator hand-curation.

New surface (tests-only, no src/ changes):
- tests/e2e/replay/_e2e_orchestrator.py — orchestrator with
  OrchestratorStep enum, OrchestrationFailure exception (step
  prefix per AC-5), OrchestrationReport dataclass,
  write_effective_replay_config helper, and
  run_e2e_orchestration entry point covering steps 1-2-6-7.
- tests/e2e/replay/test_e2e_orchestrator_unit.py — 17 unit
  tests covering each failure mode + happy path with mocked
  subprocess + ground-truth loader (AC-8).
- tests/e2e/replay/test_az835_e2e_real_flight.py — Tier-2 +
  RUN_REPLAY_E2E gated integration test asserting verdict
  report exists, 15-min budget held (AC-1, AC-2, AC-3, AC-4,
  AC-6).

The effective config write overlays c6_tile_cache.root_dir
onto the static operator YAML at runtime so the airborne
subprocess shares the cache_root the C3 fixture chose. Field-
level merge — every other operator-config block stays
verbatim. The static YAML on disk is never touched.

Test run: tests/e2e/replay 45 passed, 10 skipped (10 skips
were 9 pre-existing + 1 new tier2). No src/ touched, no
AZ-839 driver changes; AC-7 (AZ-699 still passes) holds by
inspection.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-23 15:27:41 +03:00
parent 8c4be9ace0
commit ade0c86f2b
6 changed files with 1680 additions and 1 deletions
@@ -0,0 +1,171 @@
# Batch 109 — Cycle 3 — AZ-840 e2e orchestrator test
**Date**: 2026-05-23
**Tasks**: AZ-840 (C4 — Epic AZ-835).
**Story points**: 3 (per the task spec).
**Jira status**: AZ-840 In Progress → In Testing at commit step.
## Why this batch exists
Epic AZ-835 (real-flight e2e validation) needs a single Tier-2
test that proves the 7-step pipeline runs from
`(tlog, video, calibration)` to a horizontal-error verdict
without operator hand-curation between steps. Steps 3-5 were
delivered by AZ-839 (C3 — `operator_pre_flight_setup`); steps
1-2-6-7 are this batch.
The AZ-839 batch 108b follow-up note explicitly anticipated this
batch: "AZ-840 will additionally need to feed the airborne
replay binary a config that points at the same `cache_root`
... the cleanest path is for AZ-840 to write an effective YAML
at runtime from the same override recipe used here."
## What this batch ships
A driver module + unit test suite + Tier-2 integration test:
* `tests/e2e/replay/_e2e_orchestrator.py` — wraps the AZ-699
verdict-report path with the AZ-839 C3 fixture's
`PopulatedC6Cache`. Public surface:
* `OrchestratorStep` enum — failure-step labels per AC-5.
* `OrchestrationFailure(step, message)` exception — wraps
every step failure with the step name in the message prefix.
* `OrchestrationReport` dataclass — verdict, distribution,
paths, wall-clock measurements per AC-4.
* `write_effective_replay_config` — small helper that overlays
`c6_tile_cache.root_dir` onto the static operator YAML.
* `read_calibration_acquisition_method` — mirror of AZ-699's
helper so the report writer keeps the same shape.
* `run_e2e_orchestration` — the AC-1 entry point wiring
validate → write_config → airborne subprocess → parse JSONL
→ load tlog GT → compute distribution → render report.
* `tests/e2e/replay/test_e2e_orchestrator_unit.py` — 17 unit
tests covering each of the 7 steps' failure modes plus the
happy path. The runner is injected (`subprocess.run` default)
so unit tests stage synthetic JSONL output without touching
the airborne binary. `load_tlog_ground_truth` is monkeypatched
to return a synthetic 3-row series.
* `tests/e2e/replay/test_az835_e2e_real_flight.py::
test_az840_e2e_real_flight_orchestration` — Tier-2 + RUN_REPLAY_E2E
gated test that consumes the C3 fixture + Derkachi inputs and
asserts the verdict markdown is written, the threshold-hit
share table is present, and the 15-min budget held.
## AC coverage
| AC | Description | Coverage |
|-----|----------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| AC-1| Steps 1-7 end-to-end on Tier-2 from a fresh tlog/video | `test_az840_e2e_real_flight_orchestration` (Tier-2-gated); 17 unit tests prove the orchestrator structure |
| AC-2| Verdict report exists either PASS or FAIL | `test_run_e2e_orchestration_writes_report_even_on_fail_verdict` + integration assertion `report_path.is_file()` |
| AC-3| Reuses C3 fixture (`operator_pre_flight_setup`) | Integration test consumes the fixture; effective config overlay points at `populated_cache.cache_root` |
| AC-4| 15-min wall-time soft target on the Derkachi clip | `_DEFAULT_MAX_SECONDS = 900.0` passed as `subprocess.run` `timeout`; integration asserts `replay_subprocess_seconds <= 900`|
| AC-5| Mid-pipeline failure fails LOUD with a clear step prefix | `OrchestratorStep` enum + 8 step-specific failure unit tests (`validate`/`write_config`/`airborne` × 3/`parse` × 2/`gt`) |
| AC-6| Gated by `RUN_REPLAY_E2E=1` + Tier-2 marker | `_orchestrator_skip_reason()` checks env vars + binary + video size; `@pytest.mark.tier2` decorator |
| AC-7| AZ-699 verdict test continues to pass | No changes to `test_derkachi_real_tlog.py`; same `real_flight_validation_<date>.md` report path convention |
| AC-8| Unit-tested orchestration helper without Tier-2 inputs | 17 unit tests covering config write (4) + calibration parse (3) + run helper (10) — all use mocked subprocess + GT loader |
## Test run results
```
$ .venv/bin/pytest tests/e2e/replay/ -v --tb=short --timeout=60
============================ 45 passed, 10 skipped, 3 warnings in 0.78s ============
```
Breakdown:
* 17 new orchestrator unit tests pass.
* 11 AZ-839 driver unit tests still pass (no driver changes).
* 14 helper unit tests (`test_helpers.py`) still pass.
* 3 derkachi-1min mode-agnostic AST tests still pass.
* 10 skips: 1 new Tier-2 (this AZ-840 integration), 6
RUN_REPLAY_E2E gated AZ-404 cases, 1 AC-8 D-PROJ-2 placeholder,
1 Tier-2 AZ-699, 1 Tier-2 AZ-839 integration. None are
regressions; the tier2 gate trips off-Jetson.
## Design notes
### `--auto-trim` ownership
The orchestrator passes `--auto-trim` unconditionally so AZ-405 /
AZ-698 active-flight-cut + tlog/video sync (Epic step 1) runs
inside the airborne binary every time. The Epic narrative does
not separate trim from the airborne pipeline; collapsing them
into a single subprocess invocation matches AZ-699 and avoids
duplicating the trim path.
### `clip_duration_s` parity with AZ-699
`run_e2e_orchestration` computes
`clip_duration_s = ground_truth[-1].t_s - ground_truth[0].t_s`
exactly as `test_derkachi_real_tlog.py` does. This means both
verdict reports name the same clip duration even when the
trimmed video is shorter than the ground-truth window — a
deliberate choice: the report header documents what the verdict
covers, not what the binary processed.
### Effective config write — single source of truth
`write_effective_replay_config` materialises the same override
recipe AZ-839 uses in-memory, but on disk so the airborne
subprocess sees the cache_root the fixture chose. Field-level
merge: every other block in the operator YAML is preserved
verbatim; only `c6_tile_cache.root_dir` and
`c6_tile_cache.faiss_index_path` are overwritten. The static
operator YAML on disk is never touched.
### Failure surface = step prefix
`OrchestrationFailure` always prefixes its message with
`[<step>]`. CI log scrapers and pytest's traceback printer both
surface the prefix on the first line; AC-5 ("clear error
pointing at the failing step") holds without requiring the test
to inspect the exception object. The step is also exposed as
`exc.step` for programmatic assertions.
## Files changed
* `tests/e2e/replay/_e2e_orchestrator.py` (new, 656 LOC).
* `tests/e2e/replay/test_e2e_orchestrator_unit.py` (new, 660+ LOC).
* `tests/e2e/replay/test_az835_e2e_real_flight.py` (new, 156 LOC).
No `src/` changes, no operator-config YAML changes, no AZ-839
driver changes. AZ-840 is purely additive at the test layer.
## Code review (self-review)
Verdict: **PASS_WITH_WARNINGS**.
| Phase | Result |
|-------|--------|
| 1. Context loading | Re-read `gps_compare.py`, `accuracy_report.py`, `replay_input.py`, `cli/replay.py`, `test_derkachi_real_tlog.py`. Emission schema (`emitted_at`, `position_wgs84`) is the same shape `gps-denied-replay` writes. |
| 2. Spec compliance | All 8 AZ-840 ACs covered; AC-7 holds by inspection (no AZ-699 changes). |
| 3. Code quality | All public types have docstrings; failure messages name the upstream exception via `repr` so `OSError` / `subprocess.TimeoutExpired` carry through. Runner kw-args mirror `subprocess.run` signature 1:1. |
| 4. Security quick-scan | Effective config write goes to a tmp file the test owns; no secrets in the YAML overlay (override is two string fields). Subprocess `env` is opt-in (`None` defaults to `os.environ`). |
| 5. Performance scan | Unit tests run in 0.51 s. Tier-2 wall-clock cap is 900 s, enforced by the subprocess timeout. |
| 6. Cross-task consistency | `clip_duration_s` and `report_path` match AZ-699 exactly so a single Jetson run produces the same markdown shape. |
| 7. Architecture compliance | Orchestrator lives entirely under `tests/e2e/replay/`; no `src/` writes. C3 fixture's invariants (`PopulatedC6Cache.cache_root` is the single source of truth) propagate via `write_effective_replay_config`. |
## Findings
| ID | Severity | Description | Disposition |
|----|----------|-------------|-------------|
| F1 | Low | `_default_tile_decoder` in `conftest.py` (carried from batch 108) — still raw TIFF. Not in the AZ-840 path; AZ-840 doesn't change tile decoding. | Defer; no AZ-840 ticket. |
| F2 | Low | `_resolve_replay_descriptor_dim` is NetVLAD-only (carried from batch 108). AZ-840 doesn't change descriptors. | Defer; no AZ-840 ticket. |
| F3 | Low | `--pace asap` is hardcoded in `_run_replay_subprocess` argv; the AZ-699 test passes `--pace asap` too, so behaviour is identical. If a future test wants a real-time pace, the runner kwarg is the seam. | Document; no ticket. |
| F4 | Low | `_run_replay_subprocess` does not stream stdout/stderr; failures surface only after the subprocess exits. For 15-min runs this means the operator sees no progress until the budget expires. AZ-699 has the same shape. | Document; consider an AZ-* if the budget grows. |
## Notes for follow-up
* AZ-840 lands the orchestrator test as Tier-2-gated. Verifying
the Tier-2 path actually runs on the Jetson harness is the
next gating step before Epic AZ-835 can flip from "covered by
unit tests" to "covered by Tier-2 integration".
* `_e2e_orchestrator.py` is intentionally kept under `tests/`
rather than promoted to `src/`. If a second consumer of the
same orchestration shape appears (e.g. AZ-833 mock-suite-sat
parity test), the move to a shared helper module under
`src/gps_denied_onboard/replay/` is the right next step;
for now the test-only location matches the helper's only
consumer.
* AZ-841 (Tier-2 unxfail follow-up) and AZ-842 (replay protocol
+ orchestrator docs) sit downstream — both should reference
this batch report in their planning sections.