mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 12:11:13 +00:00
[AZ-963] xfail divergent ESKF tests + honest returncode assertion on AC-3
This commit is contained in:
@@ -1,111 +0,0 @@
|
||||
# AZ-963 — Fix Derkachi 60s smoke regressions: ESKF divergence on CSV-only path with no satellite anchoring (AZ-895 fallout)
|
||||
|
||||
**Status**: To Do (Jira) / `todo/` (local)
|
||||
**Issue type**: Task
|
||||
**Complexity**: 3 SP (may bump to 5 SP after triage if option B is chosen)
|
||||
**Cycle**: cycle-4 e2e closure follow-up
|
||||
**Jira**: https://denyspopov.atlassian.net/browse/AZ-963
|
||||
**Filed**: 2026-05-29 during cycle-4 Tier-2 validation run
|
||||
|
||||
## Why
|
||||
|
||||
Discovered 2026-05-29 during cycle-4 e2e validation run on Tier-2 Jetson AGX Orin. Four tests in `tests/e2e/replay/test_derkachi_1min.py` regressed to FAIL after the AZ-895 deprecation made the CSV-driven replay path primary:
|
||||
|
||||
* `test_ac1_exits_0_jsonl_count_match` — expects exit 0, got exit 1
|
||||
* `test_ac5_determinism_two_runs_diff` — expects two PASSing runs to diff cleanly, both exit 1
|
||||
* `test_ac6_pace_realtime_60s_within_5pct` — expects realtime pace within 5%, exits 1 before timing measurement is meaningful
|
||||
* `test_ac6_pace_asap_under_30s` — expects asap under 30s, exits 1 in ~13s with fatal error
|
||||
|
||||
All four fail with the same root cause:
|
||||
|
||||
```
|
||||
ERROR c5.state.eskf_filter_divergence kv={"source":"vio","mahalanobis_sq":212.31,"threshold_sq":100.0}
|
||||
ERROR replay_loop.state_add_vio_fatal frame=233
|
||||
EstimatorFatalError('eskf filter divergence on vio: mahalanobis²=212.311 > 100.0')
|
||||
```
|
||||
|
||||
The CSV-driven path (now primary since AZ-895 deprecation) runs **open-loop** — the Derkachi fixture has no reference C6 tile cache so C2 VPR / C3 matcher / C4 pose-anchor stages are not wired:
|
||||
|
||||
```
|
||||
WARN replay_loop.satellite_anchoring_not_wired: frame=0 — C2 VPR / C4 pose-anchor stages are not wired
|
||||
in this run (Derkachi has no reference tile cache); estimator runs open-loop on VIO + IMU. Expect
|
||||
monotonically growing position error.
|
||||
```
|
||||
|
||||
After ~10s of open-loop integration, ESKF Mahalanobis distance exceeds the 100.0 threshold at frame 233 and the runner crashes with a non-zero exit code. The 4 tests don't care about accuracy but they require a clean exit — which they can't get on the CSV-only path.
|
||||
|
||||
**Why this matters now**: before AZ-895, the tlog path was the primary replay surface and presumably exited cleanly (with some warning about divergence) without raising `EstimatorFatalError`. The AZ-895 deprecation didn't account for the runtime-semantic difference between the two paths in test fixtures that depended on "runner exits 0 even without satellite anchoring".
|
||||
|
||||
## Related XPASS finding (in scope to investigate, may split into sub-ticket)
|
||||
|
||||
`test_ac3_within_100m_80pct_of_ticks` showed up as XPASS in the same run. It was marked xfail because "AC-3 requires the C1+C2+C3+C4+C5 satellite-re-anchoring pipeline. Blocked by AZ-777...". XPASS means "marked xfail but unexpectedly passed" — which is impossible per the documented physics (open-loop ESKF can't meet ≤80% within 100m). Either the test is silently no-oping into a pass, or the xfail mark is stale, or the new semantics changed something that fixed it. Worth investigating because it could be a third silent-failure surface.
|
||||
|
||||
## Goal
|
||||
|
||||
The 4 currently-failing tests must either PASS, or have an explicit gating decision (xfail with a tracked reason, or skip with the right mark) that doesn't silently hide AC coverage. The AC matrix in the README must accurately reflect what's measured vs what's deferred.
|
||||
|
||||
This ticket does NOT mandate a specific fix — the right answer requires triage. Options on the table:
|
||||
|
||||
* **A**: Loosen the ESKF divergence threshold in the test harness path (changes production code; risky — the threshold exists for a real safety reason)
|
||||
* **B**: Add a reference C6 tile cache for Derkachi so satellite anchoring works (AZ-777 follow-up scope; large; the fixture has no anchorable imagery yet)
|
||||
* **C**: Gate the 4 tests behind a "satellite anchoring required" mark and skip them on the open-loop path (preserves the tests as documentation; doesn't restore AC coverage)
|
||||
* **D**: Mark the divergence-driven failures as expected (xfail with rationale: "open-loop ESKF diverges on this fixture")
|
||||
* **E**: Investigate why AC-3 XPASSes and whether that finding changes A–D
|
||||
* **F**: Some combination after triage
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
* **AC-1**: All 4 currently-failing tests (`test_ac1_exits_0_jsonl_count_match`, `test_ac5_determinism_two_runs_diff`, `test_ac6_pace_realtime_60s_within_5pct`, `test_ac6_pace_asap_under_30s`) are either PASSing or have an explicit gating decision with a tracked Jira reference — NOT silently disabled.
|
||||
* **AC-2**: The `test_ac3_within_100m_80pct_of_ticks` XPASS is investigated and either becomes a real PASS (xfail mark removed with rationale) or stays xfail with an updated rationale (one of the two; not both, not silent).
|
||||
* **AC-3**: No regression to the documented AC matrix in `tests/e2e/replay/README.md` § `AC matrix` — every AC row is still being measured in some form (PASS / honest xfail / honest skip with reason), and the README accurately reflects the current state.
|
||||
* **AC-4**: The fix does not bring back the AZ-895-deprecated auto-sync surface (`--time-offset-ms`, `--skip-auto-sync-validation` CLI flags must remain deprecated).
|
||||
* **AC-5**: A short triage memo lives at `_docs/03_implementation/batch_*_az963_triage.md` (or equivalent batch report) explaining which of options A–F was chosen and why, with the run-log evidence cited.
|
||||
|
||||
## Out of scope
|
||||
|
||||
* AZ-840 orchestrator test (separate AZ-962 ticket).
|
||||
* Reverting AZ-895 to restore the tlog path as primary.
|
||||
* Building a reference C6 tile cache for Derkachi (separate large work).
|
||||
* Tracker-state cleanup for AZ-840 / AZ-842 (separate user decision).
|
||||
|
||||
## Dependencies
|
||||
|
||||
* **AZ-895** (Done locally / In Testing in Jira) — this ticket addresses fallout from that deprecation.
|
||||
* **AZ-265 / AZ-404** (60s suite epic) — the regressed tests are deliverables of that epic.
|
||||
* **AZ-777** (Phase 3 superseded) — referenced in the existing xfail rationale; understanding why it's superseded informs the triage.
|
||||
* **AZ-962** (sibling) — the AZ-840 orchestrator test is blocked by a different gap; both are cycle-4 e2e closure work but they're independent and can be worked in parallel.
|
||||
|
||||
## Estimate
|
||||
|
||||
3 SP. Investigation + triage + implementation. May bump to 5 SP if option B (build reference tile cache) is chosen — in that case split into sub-tickets per the user's complexity-budget rule (≤5 SP per ticket).
|
||||
|
||||
## Run-log evidence (2026-05-29 Tier-2)
|
||||
|
||||
```
|
||||
e2e-runner-1 | = 4 failed, 48 passed, 3 skipped, 1 xfailed, 1 xpassed, 1 warning in 90.59s (0:01:30) =
|
||||
e2e-runner-1 | FAILED tests/e2e/replay/test_derkachi_1min.py::test_ac1_exits_0_jsonl_count_match
|
||||
e2e-runner-1 | FAILED tests/e2e/replay/test_derkachi_1min.py::test_ac5_determinism_two_runs_diff
|
||||
e2e-runner-1 | FAILED tests/e2e/replay/test_derkachi_1min.py::test_ac6_pace_realtime_60s_within_5pct
|
||||
e2e-runner-1 | FAILED tests/e2e/replay/test_derkachi_1min.py::test_ac6_pace_asap_under_30s
|
||||
e2e-runner-1 | XPASS tests/e2e/replay/test_derkachi_1min.py::test_ac3_within_100m_80pct_of_ticks
|
||||
```
|
||||
|
||||
Excerpt from the stdout of the first failure (representative of all 4):
|
||||
|
||||
```
|
||||
{"ts":"2026-05-29T10:34:50.397901Z","level":"ERROR","component":"c5_state.eskf_baseline",
|
||||
"kind":"c5.state.eskf_filter_divergence",
|
||||
"kv":{"source":"vio","mahalanobis_sq":212.31115250586484,"threshold_sq":100.0}}
|
||||
{"ts":"2026-05-29T10:34:50.398356Z","level":"ERROR","component":"runtime_root.replay_loop",
|
||||
"kind":"replay_loop.state_add_vio_fatal",
|
||||
"msg":"replay_loop.state_add_vio_fatal: frame=233 EstimatorFatalError('eskf filter divergence on vio: mahalanobis²=212.311 > 100.0')"}
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
* Failing tests: `tests/e2e/replay/test_derkachi_1min.py:82, 387, 417, 433`
|
||||
* XPASS: `tests/e2e/replay/test_derkachi_1min.py::test_ac3_within_100m_80pct_of_ticks`
|
||||
* ESKF threshold: `c5_state.eskf_baseline` (Mahalanobis² 100.0 threshold)
|
||||
* Satellite-anchoring-not-wired warning: `runtime_root.replay_loop:replay_loop.satellite_anchoring_not_wired`
|
||||
* README AC matrix: `tests/e2e/replay/README.md` § `AC matrix`
|
||||
* Sibling ticket (parallel work): AZ-962 — orchestrator config wiring
|
||||
Reference in New Issue
Block a user