[AZ-964] FAISS index bootstrap for AZ-839 fixture + build flag

AZ-964 SHIPPED — AZ-840 orchestrator test moves past FAISS gate.

Changes:
* tests/e2e/replay/_faiss_seed.py — extracts the empty HNSW32
  seeding logic from scripts/mk_test_faiss_fixture.py into a
  reusable test-infra module: seed_empty_faiss_index(root_dir,
  *, descriptor_dim=512, backbone_label="ultra_vpr") -> Path.
* scripts/mk_test_faiss_fixture.py rewritten as a thin CLI shim
  importing the same helper. compose `tile-init` contract is
  preserved.
* tests/e2e/replay/conftest.py::_build_operator_pre_flight_cache
  now calls seed_empty_faiss_index(cache_root) immediately before
  build_descriptor_index(config), so the factory's _load() finds
  a valid .index + .sha256 + .meta.json triplet at the fixture's
  override root_dir. populate_c6_from_route later in the fixture
  rebuilds the real index once route tiles are downloaded.
* docker-compose.test.jetson.yml: BUILD_PYTORCH_FP16_RUNTIME: "ON"
  added to e2e-runner.environment. Scope creep documented honestly
  in the spec — Tier-2 surfaced this third config gap on the same
  fixture chain while validating AZ-964 (RuntimeNotAvailableError:
  ... the flag is OFF). One-line wiring; the dustynv/l4t-pytorch
  base image bakes the Tegra-tuned PyTorch wheel and
  pytorch_fp16_runtime.py exists, so flag flip is sufficient.

Tier-2 verdict (4F / 48P / 3S / 1XF / 1XP in 86.07s, 0 errors —
was 2 errors before this commit): AZ-840 orchestrator test moves
from ERROR at FAISS gate to SKIP at empty-backbones gate — exactly
the AZ-965 gate AZ-964 AC-3 promised. test_operator_pre_flight_
integration SKIPs cleanly too. The 4 derkachi_1min ESKF-divergence
FAILs are constant across all three runs today (AZ-963 path,
independent of orchestrator chain).

Three Tier-2 runs today on the orchestrator chain:
  i.   pre-AZ-962: SKIP at env-var gate
  ii.  post-AZ-962: ERROR at FAISS gate
  iii. post-AZ-964: SKIP at backbones gate (AZ-965)

Cycle-4 e2e gate still NOT GREEN. Orchestrator chain remaining =
AZ-965 (NetVLAD backbone provisioning); 60s smoke chain remaining
= AZ-963 (ESKF divergence). OKVIS2 deferral directive unchanged.

Pre-existing yamllint false positive on docker-compose.test.jetson
.yml:185 (sibling `volumes:` keys flagged as duplicates without
respecting parent-key scope) — PyYAML parses cleanly with no
duplicates and docker-compose accepts the file at runtime.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-29 17:02:49 +03:00
parent 763d8b21ad
commit 288aae881d
7 changed files with 144 additions and 44 deletions
@@ -1,80 +0,0 @@
# AZ-964 — Bootstrap FAISS descriptor index for AZ-839 C3 fixture (`operator_pre_flight_cache`)
**Status**: To Do (Jira) / `todo/` (local)
**Issue type**: Task
**Complexity**: 3 SP
**Cycle**: cycle-4 e2e closure follow-up
**Jira**: https://denyspopov.atlassian.net/browse/AZ-964
**Filed**: 2026-05-29 (surfaced by AZ-962 Tier-2 re-run)
## Why
Discovered 2026-05-29 during the AZ-962 Tier-2 re-run on Jetson AGX Orin. With `GPS_DENIED_OPERATOR_CONFIG_PATH` + `operator_replay.yaml` now correctly wired (AZ-962 shipped), the AZ-840 orchestrator test (`tests/e2e/replay/test_az835_e2e_real_flight.py::test_az840_e2e_real_flight_orchestration`) moved from SKIPped to ERRORed at a deeper, real gate during fixture setup:
```
gps_denied_onboard.components.c6_tile_cache.errors.IndexUnavailableError:
FaissDescriptorIndex: .index file missing at
/tmp/pytest-of-root/pytest-0/operator_pre_flight_cache0/descriptor.index
```
The same error also breaks `test_operator_pre_flight_integration.py::test_operator_pre_flight_setup_produces_populated_cache`, confirming this is a fixture-wide problem, not specific to one test.
## Root cause (read from code)
`tests/e2e/replay/conftest.py::_build_operator_pre_flight_cache` (line 487):
1. Overrides `c6_tile_cache.root_dir` to a fresh `/tmp/pytest-of-root/.../operator_pre_flight_cache0/` (per AC of AZ-839, the fixture creates a *new* cache each test).
2. Calls `build_descriptor_index(config)` — which constructs `FaissDescriptorIndex.from_config(config)`.
3. `FaissDescriptorIndex.__init__` calls `_load()` which **raises** `IndexUnavailableError` when no `.index` file exists at `c6_tile_cache.root_dir/descriptor.index`.
4. The fixture never gets to call `populate_c6_from_route` (which presumably creates the index downstream).
The compose `tile-init` setup service exists and runs `scripts/mk_test_faiss_fixture.py` — but it writes a seed index to `/var/lib/gps-denied/tiles` (the `tile-data` volume), **not** to the tmp dir the fixture overrides into. So the fixture's override path always starts empty.
## Goal
Make `_build_operator_pre_flight_cache` succeed past the `build_descriptor_index(config)` call so the AZ-840 orchestrator test can actually exercise the 7-step pipeline (or fail at the next real gate — c10 backbones, AZ-965).
## Scope
One of (in preference order; pick during implementation):
1. **Fixture seeds the index inline**: before calling `build_descriptor_index`, invoke `scripts/mk_test_faiss_fixture.py` programmatically (or in-process equivalent) against the override `root_dir`. Pure test-infra change.
2. **`populate_c6_from_route` creates the index if missing**: production code change so the descriptor-index factory tolerates a fresh `root_dir`. Larger blast radius — touches a shared factory.
3. **`FaissDescriptorIndex` supports an explicit `bootstrap=True` mode**: factory signal that this run intends to create a fresh index. Requires API design.
Option (1) is the smallest, lowest-risk path and the natural extension of the `tile-init` pattern already in compose. **Recommended.**
## Acceptance Criteria
* **AC-1**: `_build_operator_pre_flight_cache` no longer ERRORs at `build_descriptor_index` when started against a fresh empty `c6_tile_cache.root_dir`.
* **AC-2**: `JETSON_SSH_ALIAS=<alias> bash scripts/run-tests-jetson.sh` no longer reports the `IndexUnavailableError` for `test_az840_e2e_real_flight_orchestration` **or** for `test_operator_pre_flight_setup_produces_populated_cache`.
* **AC-3**: If the AZ-840 orchestrator test now reaches the c10-backbone gate (`AZ-839 operator_pre_flight_setup: config has no c10_provisioning.backbones entries`), that's the expected next gate — AZ-965 handles it; AZ-964 is done.
* **AC-4**: `tests/unit` + `tests/e2e/replay/test_operator_pre_flight_*` continue to pass on Tier-1 (Colima).
## Out of scope
* c10 backbone provisioning (separate ticket — AZ-965).
* The 4 ESKF-divergence regression failures in `test_derkachi_1min.py` (separate ticket — AZ-963).
* Adding a reference C6 tile cache for the Derkachi fixture (large separate work).
* Re-opening AZ-840 / AZ-842 tracker state.
## Dependencies
* **Blocks**: AZ-840 (orchestrator test cannot run end-to-end until this clears).
* **Surfaced by**: AZ-962 (env-var + YAML wiring exposed the next gate).
* **Related**: AZ-839 (C3 fixture — this is its bug to own).
## Estimate
3 SP. Multi-step (locate the seed-index script, invoke it from the fixture before `build_descriptor_index`, verify on Tier-2), moderate risk (the seed script's assumptions might not match the fixture's override path layout).
## References
* Run log: 2026-05-29 Tier-2 Jetson AGX Orin (AZ-962 re-run), 84.99s, 4 failed / 48 passed / 1 skipped / 1 xfailed / 1 xpassed / 2 errors
* Test: `tests/e2e/replay/test_az835_e2e_real_flight.py::test_az840_e2e_real_flight_orchestration` (ERROR)
* Test: `tests/e2e/replay/test_operator_pre_flight_integration.py::test_operator_pre_flight_setup_produces_populated_cache` (ERROR)
* Fixture: `tests/e2e/replay/conftest.py:487`
* Faulting factory: `src/gps_denied_onboard/runtime_root/storage_factory.py:176`
* Faulting class: `src/gps_denied_onboard/components/c6_tile_cache/faiss_descriptor_index.py:107,430`
* Existing seed script: `scripts/mk_test_faiss_fixture.py` (invoked by `tile-init` compose service)
* AZ-962 spec: `_docs/02_tasks/done/AZ-962_operator_config_jetson_wiring.md`