[AZ-962] [AZ-964] [AZ-965] operator_replay.yaml + Tier-2 wiring

AZ-962 SHIPPED — Tier-2 Jetson AZ-840 orchestrator test no longer
SKIPs at the env-var gate. configs/operator_replay.yaml registers
c6/c7/c10/c11 with sane defaults (backbones intentionally empty,
see AZ-965); docker-compose.test.jetson.yml exports
GPS_DENIED_OPERATOR_CONFIG_PATH=/opt/configs/operator_replay.yaml
and bind-mounts ./configs:/opt/configs:ro. ENV_KEY_MAP gains
SATELLITE_PROVIDER_URL → c11_tile_manager.satellite_provider_url
and SATELLITE_PROVIDER_API_KEY → c11_tile_manager.service_api_key
so secrets flow from .env.test and never sit in YAML. README drops
the manual export step. 97/97 c11 + config unit tests stay green.

Tier-2 re-run (4 failed / 48 passed / 1 skipped / 1 xfailed /
1 xpassed / 2 errors in 84.99s vs baseline 3 skipped — i.e. -2
skipped, +2 errors): AZ-840 orchestrator test moves from SKIP to
ERROR with a deeper, real gate — IndexUnavailableError on
FaissDescriptorIndex against a fresh c6_tile_cache.root_dir.

AZ-964 (3 SP, todo/) filed for FAISS index bootstrap in the AZ-839
C3 fixture. AZ-965 (3 SP, todo/, blocked by AZ-964) filed for
NetVLAD ONNX backbone provisioning — the next gate the orchestrator
test will hit once FAISS clears.

Cycle-4 e2e gate remains NOT GREEN: AZ-840 chain is now AZ-964 →
AZ-965 → PASS; 60s smoke chain is AZ-963 → PASS. OKVIS2 deferral
directive (2026-05-29) unchanged — still gated behind Derkachi
e2e green, still NOT MET.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-29 16:42:55 +03:00
parent 92ba7997a9
commit 763d8b21ad
9 changed files with 272 additions and 6 deletions
File diff suppressed because one or more lines are too long
@@ -1,11 +1,20 @@
# AZ-962 — Wire `GPS_DENIED_OPERATOR_CONFIG_PATH` + `operator_replay.yaml` into Tier-2 Jetson harness
**Status**: To Do (Jira) / `todo/` (local)
**Status**: Done (Jira) / `done/` (local)
**Issue type**: Task
**Complexity**: 3 SP
**Cycle**: cycle-4 e2e closure follow-up
**Jira**: https://denyspopov.atlassian.net/browse/AZ-962
**Filed**: 2026-05-29 during cycle-4 Tier-2 validation run
**Shipped**: 2026-05-29 (same day)
## Closure note (2026-05-29)
Shipped: `configs/operator_replay.yaml` authored (registers all 4 blocks c6/c7/c10/c11), `docker-compose.test.jetson.yml` exports `GPS_DENIED_OPERATOR_CONFIG_PATH=/opt/configs/operator_replay.yaml` and bind-mounts `./configs:/opt/configs:ro`, and `ENV_KEY_MAP` (`src/gps_denied_onboard/config/loader.py`) gained two entries for `SATELLITE_PROVIDER_URL` / `SATELLITE_PROVIDER_API_KEY``c11_tile_manager` so secrets stay out of the YAML and flow in from `.env.test`. README `tests/e2e/replay/README.md` updated to drop the manual `export GPS_DENIED_OPERATOR_CONFIG_PATH=...` step.
Tier-2 re-run on Jetson AGX Orin (`JETSON_SSH_ALIAS=jetson bash scripts/run-tests-jetson.sh`): 4 failed / 48 passed / 1 skipped / 1 xfailed / 1 xpassed / 2 errors in 84.99s. AC-3 satisfied — `test_az840_e2e_real_flight_orchestration` no longer SKIPs at the env-var gate. AC-4 satisfied — it now ERRORs at a deeper, real gate (`IndexUnavailableError: FaissDescriptorIndex: .index file missing at /tmp/pytest-of-root/pytest-0/operator_pre_flight_cache0/descriptor.index`) which is captured in a NEW follow-up ticket **AZ-964**. The empty-backbones gate that this spec originally flagged (c10 backbones) becomes the gate AFTER AZ-964 clears — filed as **AZ-965**.
Net cycle-4 status remains NOT GREEN (orchestrator test still doesn't PASS, blocked by AZ-964 + AZ-965; ESKF divergence regression still blocked by AZ-963). AZ-962 itself is complete.
## Why
@@ -0,0 +1,80 @@
# AZ-964 — Bootstrap FAISS descriptor index for AZ-839 C3 fixture (`operator_pre_flight_cache`)
**Status**: To Do (Jira) / `todo/` (local)
**Issue type**: Task
**Complexity**: 3 SP
**Cycle**: cycle-4 e2e closure follow-up
**Jira**: https://denyspopov.atlassian.net/browse/AZ-964
**Filed**: 2026-05-29 (surfaced by AZ-962 Tier-2 re-run)
## Why
Discovered 2026-05-29 during the AZ-962 Tier-2 re-run on Jetson AGX Orin. With `GPS_DENIED_OPERATOR_CONFIG_PATH` + `operator_replay.yaml` now correctly wired (AZ-962 shipped), the AZ-840 orchestrator test (`tests/e2e/replay/test_az835_e2e_real_flight.py::test_az840_e2e_real_flight_orchestration`) moved from SKIPped to ERRORed at a deeper, real gate during fixture setup:
```
gps_denied_onboard.components.c6_tile_cache.errors.IndexUnavailableError:
FaissDescriptorIndex: .index file missing at
/tmp/pytest-of-root/pytest-0/operator_pre_flight_cache0/descriptor.index
```
The same error also breaks `test_operator_pre_flight_integration.py::test_operator_pre_flight_setup_produces_populated_cache`, confirming this is a fixture-wide problem, not specific to one test.
## Root cause (read from code)
`tests/e2e/replay/conftest.py::_build_operator_pre_flight_cache` (line 487):
1. Overrides `c6_tile_cache.root_dir` to a fresh `/tmp/pytest-of-root/.../operator_pre_flight_cache0/` (per AC of AZ-839, the fixture creates a *new* cache each test).
2. Calls `build_descriptor_index(config)` — which constructs `FaissDescriptorIndex.from_config(config)`.
3. `FaissDescriptorIndex.__init__` calls `_load()` which **raises** `IndexUnavailableError` when no `.index` file exists at `c6_tile_cache.root_dir/descriptor.index`.
4. The fixture never gets to call `populate_c6_from_route` (which presumably creates the index downstream).
The compose `tile-init` setup service exists and runs `scripts/mk_test_faiss_fixture.py` — but it writes a seed index to `/var/lib/gps-denied/tiles` (the `tile-data` volume), **not** to the tmp dir the fixture overrides into. So the fixture's override path always starts empty.
## Goal
Make `_build_operator_pre_flight_cache` succeed past the `build_descriptor_index(config)` call so the AZ-840 orchestrator test can actually exercise the 7-step pipeline (or fail at the next real gate — c10 backbones, AZ-965).
## Scope
One of (in preference order; pick during implementation):
1. **Fixture seeds the index inline**: before calling `build_descriptor_index`, invoke `scripts/mk_test_faiss_fixture.py` programmatically (or in-process equivalent) against the override `root_dir`. Pure test-infra change.
2. **`populate_c6_from_route` creates the index if missing**: production code change so the descriptor-index factory tolerates a fresh `root_dir`. Larger blast radius — touches a shared factory.
3. **`FaissDescriptorIndex` supports an explicit `bootstrap=True` mode**: factory signal that this run intends to create a fresh index. Requires API design.
Option (1) is the smallest, lowest-risk path and the natural extension of the `tile-init` pattern already in compose. **Recommended.**
## Acceptance Criteria
* **AC-1**: `_build_operator_pre_flight_cache` no longer ERRORs at `build_descriptor_index` when started against a fresh empty `c6_tile_cache.root_dir`.
* **AC-2**: `JETSON_SSH_ALIAS=<alias> bash scripts/run-tests-jetson.sh` no longer reports the `IndexUnavailableError` for `test_az840_e2e_real_flight_orchestration` **or** for `test_operator_pre_flight_setup_produces_populated_cache`.
* **AC-3**: If the AZ-840 orchestrator test now reaches the c10-backbone gate (`AZ-839 operator_pre_flight_setup: config has no c10_provisioning.backbones entries`), that's the expected next gate — AZ-965 handles it; AZ-964 is done.
* **AC-4**: `tests/unit` + `tests/e2e/replay/test_operator_pre_flight_*` continue to pass on Tier-1 (Colima).
## Out of scope
* c10 backbone provisioning (separate ticket — AZ-965).
* The 4 ESKF-divergence regression failures in `test_derkachi_1min.py` (separate ticket — AZ-963).
* Adding a reference C6 tile cache for the Derkachi fixture (large separate work).
* Re-opening AZ-840 / AZ-842 tracker state.
## Dependencies
* **Blocks**: AZ-840 (orchestrator test cannot run end-to-end until this clears).
* **Surfaced by**: AZ-962 (env-var + YAML wiring exposed the next gate).
* **Related**: AZ-839 (C3 fixture — this is its bug to own).
## Estimate
3 SP. Multi-step (locate the seed-index script, invoke it from the fixture before `build_descriptor_index`, verify on Tier-2), moderate risk (the seed script's assumptions might not match the fixture's override path layout).
## References
* Run log: 2026-05-29 Tier-2 Jetson AGX Orin (AZ-962 re-run), 84.99s, 4 failed / 48 passed / 1 skipped / 1 xfailed / 1 xpassed / 2 errors
* Test: `tests/e2e/replay/test_az835_e2e_real_flight.py::test_az840_e2e_real_flight_orchestration` (ERROR)
* Test: `tests/e2e/replay/test_operator_pre_flight_integration.py::test_operator_pre_flight_setup_produces_populated_cache` (ERROR)
* Fixture: `tests/e2e/replay/conftest.py:487`
* Faulting factory: `src/gps_denied_onboard/runtime_root/storage_factory.py:176`
* Faulting class: `src/gps_denied_onboard/components/c6_tile_cache/faiss_descriptor_index.py:107,430`
* Existing seed script: `scripts/mk_test_faiss_fixture.py` (invoked by `tile-init` compose service)
* AZ-962 spec: `_docs/02_tasks/done/AZ-962_operator_config_jetson_wiring.md`
@@ -0,0 +1,83 @@
# AZ-965 — Provision NetVLAD ONNX backbone for AZ-839 `c10_provisioning` corpus
**Status**: To Do (Jira) / `todo/` (local)
**Issue type**: Task
**Complexity**: 3 SP (5 SP if export/training required)
**Cycle**: cycle-4 e2e closure follow-up
**Jira**: https://denyspopov.atlassian.net/browse/AZ-965
**Filed**: 2026-05-29 (forward-looked during AZ-962)
## Why
Forward-looked during AZ-962. The AZ-839 C3 fixture's `_build_replay_backbone_embedder` (`conftest.py:594-601`) calls `build_backbone_specs(config)` which reads `config.components['c10_provisioning'].backbones` (a tuple of `BackboneSpec`). When empty (the current state — no `.onnx` files ship in the repo), the fixture `pytest.skip`s with:
```
AZ-839 operator_pre_flight_setup: config has no c10_provisioning.backbones
entries — the e2e harness config must declare at least one backbone
(typically DINOv2-VPR or NetVLAD per AZ-321).
```
The AZ-962 YAML (`configs/operator_replay.yaml`) explicitly leaves the `backbones:` list empty with a TODO note pointing at this ticket. Right now (post-AZ-962) the AZ-840 orchestrator test ERRORs at the FAISS-index gate (AZ-964) **before** reaching the backbones gate — but once AZ-964 ships, this is the next blocker.
## Goal
Provision a NetVLAD `.onnx` model (per AZ-321's pinned backbone choice) and matching `BackboneSpec` entry in `configs/operator_replay.yaml` so `c10_provisioning.compile_engines_for_corpus` can compile at least one engine in the AZ-839 fixture.
## Scope
1. **Source a NetVLAD `.onnx`**: AZ-321 specifies NetVLAD as the C2 baseline. Either:
- Export from an existing PyTorch checkpoint our team owns;
- Pull a vetted public weights file (with license/provenance recorded in `_docs/03_ip_attribution/`);
- Train from scratch (out of scope for this ticket — file a follow-up if neither of the above works).
2. **Place the `.onnx` in the repo**: under a path that's bind-mounted into the Jetson container (e.g. `models/netvlad/netvlad.onnx`). Add to `.gitattributes` for git-lfs if >50 MiB. Verify size against existing checked-in models.
3. **Verify TensorRT compile**: run `c7_inference.PyTorchFp16Runtime.compile_engine` (or the relevant production code path) against the new `.onnx` on Jetson AGX Orin to confirm a `.engine` file is produced with a sensible descriptor dim (typically 4096 per AZ-321).
4. **Populate `configs/operator_replay.yaml`**:
```yaml
c10_provisioning:
workspace_mb: 4096
backbones:
- model_name: netvlad
onnx_path: /opt/models/netvlad/netvlad.onnx
input_name: image
input_shape_chw: [3, 224, 224]
descriptor_dim: 4096
```
(Exact field names per `BackboneSpec` dataclass — verify in `src/gps_denied_onboard/components/c10_provisioning/`.)
5. **Wire `./models` bind-mount** into `docker-compose.test.jetson.yml`.
6. **Update `c2_vpr` block** in the YAML if `_resolve_replay_descriptor_dim` requires `c2_vpr.strategy='net_vlad'` (it does — see `conftest.py:658-666`).
## Acceptance Criteria
* **AC-1**: `models/netvlad/netvlad.onnx` (or equivalent path) exists in the repo with documented provenance + license.
* **AC-2**: `c7_inference` can compile this `.onnx` to a TensorRT `.engine` on Jetson AGX Orin (Tier-2) without errors.
* **AC-3**: `configs/operator_replay.yaml` declares the `netvlad` backbone in `c10_provisioning.backbones`.
* **AC-4**: `JETSON_SSH_ALIAS=<alias> bash scripts/run-tests-jetson.sh` no longer SKIPs `test_az840_e2e_real_flight_orchestration` with the empty-backbones message.
* **AC-5**: The AZ-840 orchestrator test either PASSes (and the AZ-699 verdict report lands at `_docs/06_metrics/real_flight_validation_<YYYY-MM-DD>.md`) or fails with a NEW error filed as a separate follow-up ticket.
* **AC-6**: License/provenance recorded in `_docs/03_ip_attribution/` per project convention.
## Out of scope
* DINOv2-VPR or other alternative backbones (NetVLAD is AZ-321's pinned baseline).
* MegaLoc / MixVPR / UltraVPR (these require a descriptor-dim resolver change — out of conftest scope).
* The 4 ESKF-divergence regression failures (AZ-963).
* Reference C6 tile cache for the Derkachi fixture (large separate work).
## Dependencies
* **Blocked by**: AZ-964 (FAISS index bootstrap — the orchestrator test ERRORs there before reaching this gate; clearing AZ-964 first surfaces the empty-backbones gate cleanly).
* **Blocks**: AZ-840 (orchestrator test cannot PASS end-to-end without a real backbone).
* **Related**: AZ-321 (defines NetVLAD as the C2 baseline), AZ-839 (C3 fixture).
## Estimate
3 SP if a usable `.onnx` already exists in the team's drive; 5 SP if export/training is needed. If 5+ SP, consider splitting model-acquisition from yaml-wiring into two sub-tickets.
## References
* Fixture skip-gate: `tests/e2e/replay/conftest.py:594-601`
* Backbone factory: `src/gps_denied_onboard/runtime_root/c10_factory.py::build_backbone_specs`
* Backbone spec dataclass: `src/gps_denied_onboard/components/c10_provisioning/config.py`
* AZ-321 (NetVLAD baseline choice)
* AZ-962 spec: `_docs/02_tasks/done/AZ-962_operator_config_jetson_wiring.md`