[AZ-963] xfail divergent ESKF tests + honest returncode assertion on AC-3

This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-06-09 20:43:15 +03:00
parent 89606ccfdc
commit 201ec7cdd4
5 changed files with 132 additions and 23 deletions
@@ -0,0 +1,111 @@
# AZ-963 — Fix Derkachi 60s smoke regressions: ESKF divergence on CSV-only path with no satellite anchoring (AZ-895 fallout)
**Status**: To Do (Jira) / `todo/` (local)
**Issue type**: Task
**Complexity**: 3 SP (may bump to 5 SP after triage if option B is chosen)
**Cycle**: cycle-4 e2e closure follow-up
**Jira**: https://denyspopov.atlassian.net/browse/AZ-963
**Filed**: 2026-05-29 during cycle-4 Tier-2 validation run
## Why
Discovered 2026-05-29 during cycle-4 e2e validation run on Tier-2 Jetson AGX Orin. Four tests in `tests/e2e/replay/test_derkachi_1min.py` regressed to FAIL after the AZ-895 deprecation made the CSV-driven replay path primary:
* `test_ac1_exits_0_jsonl_count_match` — expects exit 0, got exit 1
* `test_ac5_determinism_two_runs_diff` — expects two PASSing runs to diff cleanly, both exit 1
* `test_ac6_pace_realtime_60s_within_5pct` — expects realtime pace within 5%, exits 1 before timing measurement is meaningful
* `test_ac6_pace_asap_under_30s` — expects asap under 30s, exits 1 in ~13s with fatal error
All four fail with the same root cause:
```
ERROR c5.state.eskf_filter_divergence kv={"source":"vio","mahalanobis_sq":212.31,"threshold_sq":100.0}
ERROR replay_loop.state_add_vio_fatal frame=233
EstimatorFatalError('eskf filter divergence on vio: mahalanobis²=212.311 > 100.0')
```
The CSV-driven path (now primary since AZ-895 deprecation) runs **open-loop** — the Derkachi fixture has no reference C6 tile cache so C2 VPR / C3 matcher / C4 pose-anchor stages are not wired:
```
WARN replay_loop.satellite_anchoring_not_wired: frame=0 — C2 VPR / C4 pose-anchor stages are not wired
in this run (Derkachi has no reference tile cache); estimator runs open-loop on VIO + IMU. Expect
monotonically growing position error.
```
After ~10s of open-loop integration, ESKF Mahalanobis distance exceeds the 100.0 threshold at frame 233 and the runner crashes with a non-zero exit code. The 4 tests don't care about accuracy but they require a clean exit — which they can't get on the CSV-only path.
**Why this matters now**: before AZ-895, the tlog path was the primary replay surface and presumably exited cleanly (with some warning about divergence) without raising `EstimatorFatalError`. The AZ-895 deprecation didn't account for the runtime-semantic difference between the two paths in test fixtures that depended on "runner exits 0 even without satellite anchoring".
## Related XPASS finding (in scope to investigate, may split into sub-ticket)
`test_ac3_within_100m_80pct_of_ticks` showed up as XPASS in the same run. It was marked xfail because "AC-3 requires the C1+C2+C3+C4+C5 satellite-re-anchoring pipeline. Blocked by AZ-777...". XPASS means "marked xfail but unexpectedly passed" — which is impossible per the documented physics (open-loop ESKF can't meet ≤80% within 100m). Either the test is silently no-oping into a pass, or the xfail mark is stale, or the new semantics changed something that fixed it. Worth investigating because it could be a third silent-failure surface.
## Goal
The 4 currently-failing tests must either PASS, or have an explicit gating decision (xfail with a tracked reason, or skip with the right mark) that doesn't silently hide AC coverage. The AC matrix in the README must accurately reflect what's measured vs what's deferred.
This ticket does NOT mandate a specific fix — the right answer requires triage. Options on the table:
* **A**: Loosen the ESKF divergence threshold in the test harness path (changes production code; risky — the threshold exists for a real safety reason)
* **B**: Add a reference C6 tile cache for Derkachi so satellite anchoring works (AZ-777 follow-up scope; large; the fixture has no anchorable imagery yet)
* **C**: Gate the 4 tests behind a "satellite anchoring required" mark and skip them on the open-loop path (preserves the tests as documentation; doesn't restore AC coverage)
* **D**: Mark the divergence-driven failures as expected (xfail with rationale: "open-loop ESKF diverges on this fixture")
* **E**: Investigate why AC-3 XPASSes and whether that finding changes AD
* **F**: Some combination after triage
## Acceptance Criteria
* **AC-1**: All 4 currently-failing tests (`test_ac1_exits_0_jsonl_count_match`, `test_ac5_determinism_two_runs_diff`, `test_ac6_pace_realtime_60s_within_5pct`, `test_ac6_pace_asap_under_30s`) are either PASSing or have an explicit gating decision with a tracked Jira reference — NOT silently disabled.
* **AC-2**: The `test_ac3_within_100m_80pct_of_ticks` XPASS is investigated and either becomes a real PASS (xfail mark removed with rationale) or stays xfail with an updated rationale (one of the two; not both, not silent).
* **AC-3**: No regression to the documented AC matrix in `tests/e2e/replay/README.md` § `AC matrix` — every AC row is still being measured in some form (PASS / honest xfail / honest skip with reason), and the README accurately reflects the current state.
* **AC-4**: The fix does not bring back the AZ-895-deprecated auto-sync surface (`--time-offset-ms`, `--skip-auto-sync-validation` CLI flags must remain deprecated).
* **AC-5**: A short triage memo lives at `_docs/03_implementation/batch_*_az963_triage.md` (or equivalent batch report) explaining which of options AF was chosen and why, with the run-log evidence cited.
## Out of scope
* AZ-840 orchestrator test (separate AZ-962 ticket).
* Reverting AZ-895 to restore the tlog path as primary.
* Building a reference C6 tile cache for Derkachi (separate large work).
* Tracker-state cleanup for AZ-840 / AZ-842 (separate user decision).
## Dependencies
* **AZ-895** (Done locally / In Testing in Jira) — this ticket addresses fallout from that deprecation.
* **AZ-265 / AZ-404** (60s suite epic) — the regressed tests are deliverables of that epic.
* **AZ-777** (Phase 3 superseded) — referenced in the existing xfail rationale; understanding why it's superseded informs the triage.
* **AZ-962** (sibling) — the AZ-840 orchestrator test is blocked by a different gap; both are cycle-4 e2e closure work but they're independent and can be worked in parallel.
## Estimate
3 SP. Investigation + triage + implementation. May bump to 5 SP if option B (build reference tile cache) is chosen — in that case split into sub-tickets per the user's complexity-budget rule (≤5 SP per ticket).
## Run-log evidence (2026-05-29 Tier-2)
```
e2e-runner-1 | = 4 failed, 48 passed, 3 skipped, 1 xfailed, 1 xpassed, 1 warning in 90.59s (0:01:30) =
e2e-runner-1 | FAILED tests/e2e/replay/test_derkachi_1min.py::test_ac1_exits_0_jsonl_count_match
e2e-runner-1 | FAILED tests/e2e/replay/test_derkachi_1min.py::test_ac5_determinism_two_runs_diff
e2e-runner-1 | FAILED tests/e2e/replay/test_derkachi_1min.py::test_ac6_pace_realtime_60s_within_5pct
e2e-runner-1 | FAILED tests/e2e/replay/test_derkachi_1min.py::test_ac6_pace_asap_under_30s
e2e-runner-1 | XPASS tests/e2e/replay/test_derkachi_1min.py::test_ac3_within_100m_80pct_of_ticks
```
Excerpt from the stdout of the first failure (representative of all 4):
```
{"ts":"2026-05-29T10:34:50.397901Z","level":"ERROR","component":"c5_state.eskf_baseline",
"kind":"c5.state.eskf_filter_divergence",
"kv":{"source":"vio","mahalanobis_sq":212.31115250586484,"threshold_sq":100.0}}
{"ts":"2026-05-29T10:34:50.398356Z","level":"ERROR","component":"runtime_root.replay_loop",
"kind":"replay_loop.state_add_vio_fatal",
"msg":"replay_loop.state_add_vio_fatal: frame=233 EstimatorFatalError('eskf filter divergence on vio: mahalanobis²=212.311 > 100.0')"}
```
## References
* Failing tests: `tests/e2e/replay/test_derkachi_1min.py:82, 387, 417, 433`
* XPASS: `tests/e2e/replay/test_derkachi_1min.py::test_ac3_within_100m_80pct_of_ticks`
* ESKF threshold: `c5_state.eskf_baseline` (Mahalanobis² 100.0 threshold)
* Satellite-anchoring-not-wired warning: `runtime_root.replay_loop:replay_loop.satellite_anchoring_not_wired`
* README AC matrix: `tests/e2e/replay/README.md` § `AC matrix`
* Sibling ticket (parallel work): AZ-962 — orchestrator config wiring