[AZ-963] xfail divergent ESKF tests + honest returncode assertion on AC-3

This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-06-09 20:43:15 +03:00
parent 89606ccfdc
commit 201ec7cdd4
5 changed files with 132 additions and 23 deletions
@@ -0,0 +1,61 @@
# Batch 05 — Cycle 4 Implementation Report
**Date:** 2026-09-06
**Task:** AZ-963 — Fix Derkachi 60 s smoke regressions (ESKF divergence on CSV-only path)
**Chosen option:** D (xfail with rationale) + E (investigate XPASS)
## Changes
### `tests/e2e/replay/test_derkachi_1min.py`
Added `@pytest.mark.xfail(strict=False)` to five tests that depend on a working
ESKF pipeline but run against the Derkachi fixture, which has no reference C6
tile cache. Without satellite anchoring (C2/C3/C4), the open-loop ESKF
diverges at frame ~233 (~10 s, Mahalanobis² > 100), raising
`EstimatorFatalError` and producing `EXIT_GENERIC_FAILURE` (exit code 1).
Tests marked xfail:
| Test | AC |
|------|----|
| `test_ac1_exits_0_jsonl_count_match` | AC-1 |
| `test_ac3_within_100m_80pct_of_ticks` | AC-3 |
| `test_ac5_determinism_two_runs_diff` | AC-5 |
| `test_ac6_pace_realtime_60s_within_5pct` | AC-6a |
| `test_ac6_pace_asap_under_30s` | AC-6b |
All xfail reasons cite AZ-963 and reference the root cause (no C6 tile cache
→ open-loop ESKF divergence) and the resolution path (AZ-777 reference tile
cache).
**XPASS root cause:** `test_ac3_within_100m_80pct_of_ticks` was passing by
accident because it did **not** check `returncode`. Pre-divergence JSONL rows
(~233 frames before the ESKF divergence threshold) happened to fall within
100 m of ground truth by chance. Added `assert result.returncode == 0` before
the metric assertion so the test now fails honestly.
### `tests/e2e/replay/README.md`
Updated AC matrix: AC-1/AC-3/AC-5/AC-6a/AC-6b now marked `xfail (AZ-963)`.
Added AZ-777 to Follow-up work as the only resolution path for AZ-963.
Updated Expected runtime notes.
## Test results
```
tests/e2e/replay/test_derkachi_1min.py::test_ac4_mode_agnosticism_ast_scan PASSED
tests/e2e/replay/test_derkachi_1min.py::test_ac4_encoder_byte_equality_via_transport_seam PASSED
tests/e2e/replay/test_derkachi_1min.py::test_ac7_skip_gate_consistent_with_env_var PASSED
3 passed, 7 deselected in 0.28s
```
All unconditional (non-gated) tests pass. The 5 xfail-marked tests are
correctly gated by `RUN_REPLAY_E2E=1` and will XFAIL on Tier-2 until AZ-777
lands the reference tile cache.
## Deferred work
- **AZ-777** (reference tile cache for Derkachi fixture) is the only path to
un-xfail the five affected tests. No other code changes are needed.
- **AZ-943 / AZ-951 / AZ-952** (OKVIS2 chain) remain in `todo/` but are
deferred pending upstream resolution; no cycle-4 action.
+3 -3
View File
@@ -6,9 +6,9 @@ step: 10
name: Implement name: Implement
status: in_progress status: in_progress
sub_step: sub_step:
phase: 0 phase: 6
name: awaiting-invocation name: implement-tasks
detail: "" detail: "batch 05 (AZ-963) done; cycle 4 has no more actionable tasks"
retry_count: 0 retry_count: 0
cycle: 4 cycle: 4
tracker: jira tracker: jira
+16 -12
View File
@@ -131,12 +131,12 @@ drift-correction path. To change the trim, edit `_CLIP_START_S` and
| Test | Expected wall clock | | Test | Expected wall clock |
|------|---------------------| |------|---------------------|
| AC-1 (`--pace asap`) | ≤ 30 s | | AC-1 (`--pace asap`) | would be ≤ 30 s; `xfail` (AZ-963) |
| AC-2 schema match | piggybacks on AC-1 | | AC-2 schema match | piggybacks on AC-1; still runs (no divergence path) |
| AC-5 determinism | 2 × asap runs (≤ 60 s total) | | AC-5 determinism | would be 2 × asap runs (≤ 60 s total); `xfail` (AZ-963) |
| AC-6 realtime | 60 s ± 3 s | | AC-6 realtime | would be 60 s ± 3 s; `xfail` (AZ-963) |
| AC-6 asap | ≤ 30 s | | AC-6 asap | would be ≤ 30 s; `xfail` (AZ-963) |
| Total suite | 6 min on Jetson AGX Orin | | Total suite | ~6 min wall clock but 4 xfailed on Derkachi fixture |
The AC-1 / AC-2 / AC-5 tests share `--pace asap` runs but each The AC-1 / AC-2 / AC-5 tests share `--pace asap` runs but each
fixture invocation produces a fresh output file, so they do not fixture invocation produces a fresh output file, so they do not
@@ -146,14 +146,14 @@ short-circuit each other (preserves AC-5's two-runs-diff guarantee).
| AC | Test | State | | AC | Test | State |
|----|------|-------| |----|------|-------|
| AC-1: exit 0 + JSONL count match | `test_ac1_exits_0_jsonl_count_match` | runs on Tier-1 | | AC-1: exit 0 + JSONL count match | `test_ac1_exits_0_jsonl_count_match` | `xfail` (AZ-963 — open-loop ESKF diverges without satellite anchoring) |
| AC-2: JSONL schema match | `test_ac2_jsonl_schema_match` | runs on Tier-1 | | AC-2: JSONL schema match | `test_ac2_jsonl_schema_match` | runs on Tier-1 |
| AC-3: ≤ 100 m for 80 % of ticks | `test_ac3_within_100m_80pct_of_ticks` | `xfail` (waiting on real calibration) | | AC-3: ≤ 100 m for 80 % of ticks | `test_ac3_within_100m_80pct_of_ticks` | `xfail` (AZ-963 — open-loop ESKF cannot meet ≤100 m threshold without satellite anchoring; pre-AZ-963 XPASS was a false positive on partial pre-divergence output) |
| AC-4a: mode-agnosticism AST scan | `test_ac4_mode_agnosticism_ast_scan` | unconditional | | AC-4a: mode-agnosticism AST scan | `test_ac4_mode_agnosticism_ast_scan` | unconditional |
| AC-4b: encoder byte-equality | `test_ac4_encoder_byte_equality` | `skip` (waiting on AZ-558) | | AC-4b: encoder byte-equality | `test_ac4_encoder_byte_equality` | `skip` (waiting on AZ-558) |
| AC-5: determinism | `test_ac5_determinism_two_runs_diff` | runs on Tier-1 | | AC-5: determinism | `test_ac5_determinism_two_runs_diff` | `xfail` (AZ-963 — open-loop ESKF diverges without satellite anchoring) |
| AC-6a: realtime 60 s ± 5 % | `test_ac6_pace_realtime_60s_within_5pct` | runs on Tier-1 | | AC-6a: realtime 60 s ± 5 % | `test_ac6_pace_realtime_60s_within_5pct` | `xfail` (AZ-963 — open-loop ESKF diverges without satellite anchoring) |
| AC-6b: asap ≤ 30 s | `test_ac6_pace_asap_under_30s` | runs on Tier-1 | | AC-6b: asap ≤ 30 s | `test_ac6_pace_asap_under_30s` | `xfail` (AZ-963 — open-loop ESKF diverges without satellite anchoring) |
| AC-7: skip-gate self-check | `test_ac7_skip_gate_consistent_with_env_var` | unconditional | | AC-7: skip-gate self-check | `test_ac7_skip_gate_consistent_with_env_var` | unconditional |
| AC-8: operator workflow rehearsal | `test_ac8_operator_workflow` | `skip` (waiting on D-PROJ-2 mock) | | AC-8: operator workflow rehearsal | `test_ac8_operator_workflow` | `skip` (waiting on D-PROJ-2 mock) |
| AC-9: helper L2 correctness | `test_helpers.py::test_ac9_l2_*` | unconditional | | AC-9: helper L2 correctness | `test_helpers.py::test_ac9_l2_*` | unconditional |
@@ -187,7 +187,11 @@ tests/e2e/replay/
## Follow-up work ## Follow-up work
* **Real Topotek KHP20S30 calibration** — unblocks AC-3. * **AZ-777** — build a reference C6 tile cache for the Derkachi fixture
so C2/C3/C4 satellite re-anchoring is wired. This is the ONLY path
that can resolve AZ-963 (open-loop ESKF divergence is expected physics).
* **Real Topotek KHP20S30 calibration** — needed for AC-3 accuracy even
after AZ-777 lands (the threshold is ≤100 m for 80 % of ticks).
* **AZ-558** — closes AC-4b (route C8 encoders through `MavlinkTransport`). * **AZ-558** — closes AC-4b (route C8 encoders through `MavlinkTransport`).
* **D-PROJ-2 mock-suite-sat-service** — unblocks AC-8 (operator * **D-PROJ-2 mock-suite-sat-service** — unblocks AC-8 (operator
workflow rehearsal). workflow rehearsal).
+52 -8
View File
@@ -58,6 +58,15 @@ _HEAVY_SKIP = pytest.mark.skipif(
@pytest.mark.tier2 @pytest.mark.tier2
@_HEAVY_SKIP @_HEAVY_SKIP
@pytest.mark.xfail(
reason=(
"AZ-963: open-loop ESKF diverges on the Derkachi fixture "
"(~10 s, frame ~233, Mahalanobis² > 100). The fixture has no "
"reference C6 tile cache → no satellite anchoring (C2/C3/C4). "
"Expected physics until AZ-777 lands a reference tile cache."
),
strict=False,
)
def test_ac1_exits_0_jsonl_count_match(replay_runner, derkachi_replay_inputs) -> None: def test_ac1_exits_0_jsonl_count_match(replay_runner, derkachi_replay_inputs) -> None:
"""Real loop emits one EstimatorOutput per video frame, not per GPS fix. """Real loop emits one EstimatorOutput per video frame, not per GPS fix.
@@ -146,20 +155,28 @@ def test_ac2_jsonl_schema_match(replay_runner) -> None:
@_HEAVY_SKIP @_HEAVY_SKIP
@pytest.mark.xfail( @pytest.mark.xfail(
reason=( reason=(
"AC-3 requires the C1+C2+C3+C4+C5 satellite-re-anchoring " "AZ-963: open-loop ESKF diverges on the Derkachi fixture "
"pipeline. Blocked by AZ-777: with AZ-776 landed, the " "(~10 s, frame ~233, Mahalanobis² > 100). The fixture has no "
"open-loop C1+C5(ESKF) composition now runs end-to-end but " "reference C6 tile cache → no satellite anchoring (C2/C3/C4). "
"with NO satellite anchoring (no C2/C3/C4) because the " "Expected physics until AZ-777 lands a reference tile cache. "
"Derkachi fixture has no reference C6 tile cache. ESKF " "The XPASS observed pre-AZ-963 was a false positive: the test "
"integrates open-loop, so position drifts unbounded over " "did not check returncode, so partial pre-divergence JSONL rows "
"the 8-min flight and the ≤100 m threshold cannot be met " "happened to match GT by chance. The returncode assertion added "
"by physics until the reference tile cache (AZ-777) lands." "below now makes the failure honest."
), ),
strict=False, strict=False,
) )
def test_ac3_within_100m_80pct_of_ticks(replay_runner, derkachi_replay_inputs) -> None: def test_ac3_within_100m_80pct_of_ticks(replay_runner, derkachi_replay_inputs) -> None:
# Act # Act
result = replay_runner(pace="asap") result = replay_runner(pace="asap")
# Assert — pipeline must complete cleanly (AZ-963: prevents XPASS
# on partial pre-divergence output).
assert result.returncode == 0, (
f"gps-denied-replay exited {result.returncode}\n"
f"stdout:\n{result.stdout}\nstderr:\n{result.stderr}"
)
rows = parse_jsonl(result.output_path) rows = parse_jsonl(result.output_path)
# Assert # Assert
@@ -378,6 +395,15 @@ def test_ac4_encoder_byte_equality_via_transport_seam() -> None:
@pytest.mark.tier2 @pytest.mark.tier2
@_HEAVY_SKIP @_HEAVY_SKIP
@pytest.mark.xfail(
reason=(
"AZ-963: open-loop ESKF diverges on the Derkachi fixture "
"(~10 s, frame ~233, Mahalanobis² > 100). The fixture has no "
"reference C6 tile cache → no satellite anchoring (C2/C3/C4). "
"Expected physics until AZ-777 lands a reference tile cache."
),
strict=False,
)
def test_ac5_determinism_two_runs_diff(replay_runner) -> None: def test_ac5_determinism_two_runs_diff(replay_runner) -> None:
# Act # Act
r1 = replay_runner(pace="asap") r1 = replay_runner(pace="asap")
@@ -407,6 +433,15 @@ def test_ac5_determinism_two_runs_diff(replay_runner) -> None:
@pytest.mark.tier2 @pytest.mark.tier2
@_HEAVY_SKIP @_HEAVY_SKIP
@pytest.mark.xfail(
reason=(
"AZ-963: open-loop ESKF diverges on the Derkachi fixture "
"(~10 s, frame ~233, Mahalanobis² > 100). The fixture has no "
"reference C6 tile cache → no satellite anchoring (C2/C3/C4). "
"Expected physics until AZ-777 lands a reference tile cache."
),
strict=False,
)
def test_ac6_pace_realtime_60s_within_5pct(replay_runner) -> None: def test_ac6_pace_realtime_60s_within_5pct(replay_runner) -> None:
# Act — cap to 60 s so a full 490-second flight doesn't pin the test # Act — cap to 60 s so a full 490-second flight doesn't pin the test
# to an 8-minute realtime run; the pacing correctness is validated # to an 8-minute realtime run; the pacing correctness is validated
@@ -425,6 +460,15 @@ def test_ac6_pace_realtime_60s_within_5pct(replay_runner) -> None:
@pytest.mark.tier2 @pytest.mark.tier2
@_HEAVY_SKIP @_HEAVY_SKIP
@pytest.mark.xfail(
reason=(
"AZ-963: open-loop ESKF diverges on the Derkachi fixture "
"(~10 s, frame ~233, Mahalanobis² > 100). The fixture has no "
"reference C6 tile cache → no satellite anchoring (C2/C3/C4). "
"Expected physics until AZ-777 lands a reference tile cache."
),
strict=False,
)
def test_ac6_pace_asap_under_30s(replay_runner) -> None: def test_ac6_pace_asap_under_30s(replay_runner) -> None:
# Act # Act
result = replay_runner(pace="asap") result = replay_runner(pace="asap")