mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 19:11:14 +00:00
[AZ-965] NetVLAD-VGG16 backbone checkpoint + YAML/compose wiring
AZ-965 ships the NetVLAD .pt checkpoint that clears the AZ-839
empty-c10_provisioning.backbones SKIP gate. Pipeline-integration
scaffold — encoder is real, NetVLAD tail is honestly labelled as
untrained.
Composition:
* Encoder (26 keys, encoder.0..encoder.28): torchvision
vgg16(weights=IMAGENET1K_V1) features [:-2], BSD-3-Clause.
Real ImageNet-pretrained VGG16 conv stack.
* NetVLAD pool + PCA tail (5 keys: pool.conv.{weight,bias},
pool.centroids, pca.{weight,bias}): random-init via
torch.manual_seed(0). NOT trained for visual place recognition.
Total: 149,002,112 params (568.4 MiB fp32, sha256=745c6f29...).
Round-trip verified locally: torch.load(weights_only=True) +
load_state_dict(strict=True) succeed; forward(1,3,480,480) emits
{'vlad_descriptor': (1, 4096) fp32} — matches NetVladStrategy
contract per net_vlad.py:247-251.
Two material discoveries documented in the AZ-965 spec:
1. The NetVLAD-VGG16 architecture already lives in repo at
src/gps_denied_onboard/components/c2_vpr/_net_vlad_architecture.py
— we instantiate it and save a state_dict, NOT externally source.
2. The PyTorch FP16 runtime expects a .pt state_dict (NOT .onnx).
BackboneConfig.onnx_path is a misnomer for NetVLAD: per AZ-321
design + c2_vpr description.md §1, NetVLAD runs on PyTorch FP16
(NOT TRT). compile_engine is a no-op sha256+path wrap;
deserialize_engine does torch.load(weights_only=True) +
load_state_dict(strict=True).
User skipped Option A/B/C/D/E question — judgment call = Option B
(IMAGENET1K_V1 + random tail) per "use judgment, don't block":
* Option A (Nanne translation) was 5-8 SP, above the 5 SP budget.
* Option B is 3 SP, fits the budget, honestly labelled.
* Option C (pure random) was borderline-dishonest per Real Results.
Files:
* scripts/mk_netvlad_checkpoint.py — deterministic generator.
* models/netvlad/netvlad.pt — 568 MiB, via git-lfs (.gitattributes
extended for models/**/*.pt, *.onnx, *.engine).
* configs/operator_replay.yaml — c2_vpr + c10_provisioning blocks
populated; the field literally named onnx_path actually points
at the .pt for NetVLAD per the runtime semantics noted above.
* docker-compose.test.jetson.yml — ./models:/opt/models:ro bind
mount added to e2e-runner.
* _docs/03_ip_attribution/netvlad.md — provenance, licence, how-to-
reproduce, honest scope statement ("NOT a real-retrieval
checkpoint; ESKF divergence under garbage retrievals is the
expected next gate").
* _docs/02_tasks/todo/AZ-965_netvlad_onnx_backbone_provisioning.md
— rewritten to reflect the .pt-not-.onnx + Option B discoveries.
Tier-2 verification follows in a separate commit after the harness
run confirms the empty-backbones SKIP gate clears.
Out of scope (filed as follow-ups):
* Real-retrieval NetVLAD weights (Nanne Pittsburgh-30k translation
or internal team checkpoint) — separate ticket.
* AZ-840 orchestrator PASSing end-to-end (depends on retrieval
quality + ESKF stability).
* AZ-963 60s smoke ESKF divergence (independent chain).
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -1,15 +1,16 @@
|
||||
# AZ-965 — Provision NetVLAD ONNX backbone for AZ-839 `c10_provisioning` corpus
|
||||
# AZ-965 — Provision NetVLAD backbone for AZ-839 `c10_provisioning` corpus
|
||||
|
||||
**Status**: To Do (Jira) / `todo/` (local)
|
||||
**Status**: In Progress (Jira) / `todo/` (local)
|
||||
**Issue type**: Task
|
||||
**Complexity**: 3 SP (5 SP if export/training required)
|
||||
**Complexity**: 3 SP (was estimated 3-5)
|
||||
**Cycle**: cycle-4 e2e closure follow-up
|
||||
**Jira**: https://denyspopov.atlassian.net/browse/AZ-965
|
||||
**Filed**: 2026-05-29 (forward-looked during AZ-962)
|
||||
**Started**: 2026-05-29
|
||||
|
||||
## Why
|
||||
|
||||
Forward-looked during AZ-962. The AZ-839 C3 fixture's `_build_replay_backbone_embedder` (`conftest.py:594-601`) calls `build_backbone_specs(config)` which reads `config.components['c10_provisioning'].backbones` (a tuple of `BackboneSpec`). When empty (the current state — no `.onnx` files ship in the repo), the fixture `pytest.skip`s with:
|
||||
Forward-looked during AZ-962 + confirmed by AZ-964's Tier-2 result: with the FAISS index gate cleared (AZ-964), the AZ-840 orchestrator test SKIPs at the **empty-backbones gate** in `tests/e2e/replay/conftest.py:594-601`:
|
||||
|
||||
```
|
||||
AZ-839 operator_pre_flight_setup: config has no c10_provisioning.backbones
|
||||
@@ -17,67 +18,97 @@ entries — the e2e harness config must declare at least one backbone
|
||||
(typically DINOv2-VPR or NetVLAD per AZ-321).
|
||||
```
|
||||
|
||||
The AZ-962 YAML (`configs/operator_replay.yaml`) explicitly leaves the `backbones:` list empty with a TODO note pointing at this ticket. Right now (post-AZ-962) the AZ-840 orchestrator test ERRORs at the FAISS-index gate (AZ-964) **before** reaching the backbones gate — but once AZ-964 ships, this is the next blocker.
|
||||
## Important corrections to the original spec
|
||||
|
||||
Two material discoveries during AZ-965 implementation that change the work shape:
|
||||
|
||||
1. **The architecture already exists in repo**: `src/gps_denied_onboard/components/c2_vpr/_net_vlad_architecture.py` defines `make_net_vlad_vgg16(num_clusters=64, encoder_dim=512, descriptor_dim=4096)` — the project's own NetVLAD-VGG16 module. We do NOT need to source ONNX from elsewhere; we instantiate the architecture, load weights into it, and save a state_dict.
|
||||
2. **Runtime expects a PyTorch `.pt` state_dict, NOT `.onnx`**. Per AZ-321's design (and `_docs/02_document/components/02_c2_vpr/description.md` §1): NetVLAD runs on the C7 **PyTorch FP16 runtime** (NOT TensorRT). The PyTorch FP16 `compile_engine` is a **no-op** that sha-256's the `.pt` path; `deserialize_engine` calls `torch.load(weights_only=True)` + `model.load_state_dict(state_dict, strict=True)`. The `BackboneConfig.onnx_path` field is a **misnomer for NetVLAD** — for the TensorRT primary backbone (UltraVPR/DINOv2) it really is `.onnx`, but for the PyTorch-FP16 baseline (NetVLAD) it's a `.pt` path.
|
||||
|
||||
## Chosen approach — Option B (judgment call)
|
||||
|
||||
The original spec's source options were:
|
||||
|
||||
* A — Translate Nanne/pytorch-NetVlad's Pittsburgh-30k weights (5-8 SP — exceeds the 5 SP budget per `tracker.mdc` user-rule; needs split).
|
||||
* B — `torchvision.models.vgg16(weights="IMAGENET1K_V1")` encoder + deterministic-random NetVLAD pool/PCA (3 SP, honestly labelled as untrained-tail).
|
||||
* C — Pure synthetic state_dict (2 SP, but borderline-dishonest per "Real Results, Not Simulated Ones").
|
||||
* D — Internal team checkpoint (user-provided).
|
||||
* E — Defer AZ-965 entirely.
|
||||
|
||||
The user was presented options A-E on 2026-05-29 and skipped the choice. Per "use judgment, don't block" pattern observed today, the judgment call was **Option B**: torchvision IMAGENET1K_V1 encoder + deterministic-random tail. Reasoning:
|
||||
|
||||
* Encoder IS a real public source (torchvision BSD-3-Clause).
|
||||
* 3 SP fits the budget.
|
||||
* NetVLAD pool + PCA tail clearly labelled as untrained in provenance — honest per meta-rule.
|
||||
* Unblocks the gate to surface the next real issue (which is likely ESKF divergence under garbage retrievals — a separate ticket).
|
||||
|
||||
## Goal
|
||||
|
||||
Provision a NetVLAD `.onnx` model (per AZ-321's pinned backbone choice) and matching `BackboneSpec` entry in `configs/operator_replay.yaml` so `c10_provisioning.compile_engines_for_corpus` can compile at least one engine in the AZ-839 fixture.
|
||||
Provision a NetVLAD-VGG16 `.pt` checkpoint at `models/netvlad/netvlad.pt` + matching `BackboneConfig` entry in `configs/operator_replay.yaml` so the AZ-839 fixture skip-gate clears and the AZ-840 orchestrator can compose c10 (+ c2_vpr) into a real pipeline run.
|
||||
|
||||
## Scope
|
||||
|
||||
1. **Source a NetVLAD `.onnx`**: AZ-321 specifies NetVLAD as the C2 baseline. Either:
|
||||
- Export from an existing PyTorch checkpoint our team owns;
|
||||
- Pull a vetted public weights file (with license/provenance recorded in `_docs/03_ip_attribution/`);
|
||||
- Train from scratch (out of scope for this ticket — file a follow-up if neither of the above works).
|
||||
2. **Place the `.onnx` in the repo**: under a path that's bind-mounted into the Jetson container (e.g. `models/netvlad/netvlad.onnx`). Add to `.gitattributes` for git-lfs if >50 MiB. Verify size against existing checked-in models.
|
||||
3. **Verify TensorRT compile**: run `c7_inference.PyTorchFp16Runtime.compile_engine` (or the relevant production code path) against the new `.onnx` on Jetson AGX Orin to confirm a `.engine` file is produced with a sensible descriptor dim (typically 4096 per AZ-321).
|
||||
4. **Populate `configs/operator_replay.yaml`**:
|
||||
|
||||
1. **Write `scripts/mk_netvlad_checkpoint.py`** — generates a deterministic `.pt`:
|
||||
* Loads `torchvision.models.vgg16(weights="IMAGENET1K_V1")` features, slices `[:-2]` to match `_NetVladVgg16.encoder`.
|
||||
* Seeds `torch.manual_seed(0)`, instantiates `make_net_vlad_vgg16(num_clusters=64, encoder_dim=512, descriptor_dim=4096)`, overlays ImageNet features into `encoder.*` keys.
|
||||
* Saves to `models/netvlad/netvlad.pt`.
|
||||
* Prints SHA-256 + key composition.
|
||||
2. **Add `models/**/*.pt`, `*.onnx`, `*.engine` to `.gitattributes` for git-lfs**.
|
||||
3. **Commit `models/netvlad/netvlad.pt` via git-lfs**.
|
||||
4. **Update `configs/operator_replay.yaml`**:
|
||||
```yaml
|
||||
c2_vpr:
|
||||
strategy: net_vlad
|
||||
backbone_weights_path: /opt/models/netvlad/netvlad.pt
|
||||
netvlad_descriptor_dim: 4096
|
||||
warn_top1_threshold: 0.30
|
||||
|
||||
c10_provisioning:
|
||||
workspace_mb: 4096
|
||||
backbones:
|
||||
- model_name: netvlad
|
||||
onnx_path: /opt/models/netvlad/netvlad.onnx
|
||||
input_name: image
|
||||
input_shape_chw: [3, 224, 224]
|
||||
descriptor_dim: 4096
|
||||
- model_name: net_vlad
|
||||
onnx_path: /opt/models/netvlad/netvlad.pt
|
||||
expected_input_shape: [3, 480, 480]
|
||||
input_name: input
|
||||
```
|
||||
|
||||
(Exact field names per `BackboneSpec` dataclass — verify in `src/gps_denied_onboard/components/c10_provisioning/`.)
|
||||
5. **Wire `./models` bind-mount** into `docker-compose.test.jetson.yml`.
|
||||
6. **Update `c2_vpr` block** in the YAML if `_resolve_replay_descriptor_dim` requires `c2_vpr.strategy='net_vlad'` (it does — see `conftest.py:658-666`).
|
||||
5. **Add `./models:/opt/models:ro` bind-mount** to `docker-compose.test.jetson.yml` e2e-runner.
|
||||
6. **Write `_docs/03_ip_attribution/netvlad.md`** — provenance, licence, how to reproduce, honest scope statement.
|
||||
7. **Tier-2 verify**: `JETSON_SSH_ALIAS=jetson bash scripts/run-tests-jetson.sh` — confirm the AZ-840 orchestrator test no longer SKIPs at the empty-backbones gate. Document the next gate that surfaces.
|
||||
8. **File follow-up ticket** for real-retrieval NetVLAD weights (Nanne translation or internal source) — out of AZ-965 scope.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
* **AC-1**: `models/netvlad/netvlad.onnx` (or equivalent path) exists in the repo with documented provenance + license.
|
||||
* **AC-2**: `c7_inference` can compile this `.onnx` to a TensorRT `.engine` on Jetson AGX Orin (Tier-2) without errors.
|
||||
* **AC-3**: `configs/operator_replay.yaml` declares the `netvlad` backbone in `c10_provisioning.backbones`.
|
||||
* **AC-1**: `models/netvlad/netvlad.pt` exists in the repo (via git-lfs) with documented provenance + licence.
|
||||
* **AC-2**: `torch.load(path, weights_only=True)` + `load_state_dict(strict=True)` on `make_net_vlad_vgg16()` succeeds locally (round-trip verified before commit).
|
||||
* **AC-3**: `configs/operator_replay.yaml` declares the `net_vlad` backbone in `c10_provisioning.backbones` and the `c2_vpr` block with matching `backbone_weights_path`.
|
||||
* **AC-4**: `JETSON_SSH_ALIAS=<alias> bash scripts/run-tests-jetson.sh` no longer SKIPs `test_az840_e2e_real_flight_orchestration` with the empty-backbones message.
|
||||
* **AC-5**: The AZ-840 orchestrator test either PASSes (and the AZ-699 verdict report lands at `_docs/06_metrics/real_flight_validation_<YYYY-MM-DD>.md`) or fails with a NEW error filed as a separate follow-up ticket.
|
||||
* **AC-6**: License/provenance recorded in `_docs/03_ip_attribution/` per project convention.
|
||||
* **AC-5**: A NEW gate (whatever the orchestrator's next blocker is — likely ESKF divergence under garbage retrievals, or a missing c4/c5 component block) is documented as a follow-up ticket. AZ-840 PASSing is OUT OF SCOPE for AZ-965.
|
||||
* **AC-6**: Provenance + licence recorded in `_docs/03_ip_attribution/netvlad.md`.
|
||||
* **AC-7**: The follow-up ticket "real trained NetVLAD weights (Nanne translation or internal)" is filed in Jira.
|
||||
|
||||
## Out of scope
|
||||
|
||||
* DINOv2-VPR or other alternative backbones (NetVLAD is AZ-321's pinned baseline).
|
||||
* MegaLoc / MixVPR / UltraVPR (these require a descriptor-dim resolver change — out of conftest scope).
|
||||
* The 4 ESKF-divergence regression failures (AZ-963).
|
||||
* Reference C6 tile cache for the Derkachi fixture (large separate work).
|
||||
* DINOv2-VPR or other alternative primary backbones (NetVLAD is AZ-321's pinned baseline and the c10 corpus only needs ONE backbone to clear the gate).
|
||||
* Real-retrieval-quality NetVLAD weights (Nanne translation, internal checkpoint, or training) — separate follow-up ticket.
|
||||
* MegaLoc / MixVPR / UltraVPR / SelaVPR / EigenPlaces / SALAD provisioning.
|
||||
* The 4 ESKF-divergence regression failures from the 60s smoke (AZ-963).
|
||||
* Reference C6 tile cache for the Derkachi fixture.
|
||||
* Making AZ-840 actually PASS end-to-end.
|
||||
|
||||
## Dependencies
|
||||
|
||||
* **Blocked by**: AZ-964 (FAISS index bootstrap — the orchestrator test ERRORs there before reaching this gate; clearing AZ-964 first surfaces the empty-backbones gate cleanly).
|
||||
* **Blocks**: AZ-840 (orchestrator test cannot PASS end-to-end without a real backbone).
|
||||
* **Related**: AZ-321 (defines NetVLAD as the C2 baseline), AZ-839 (C3 fixture).
|
||||
|
||||
## Estimate
|
||||
|
||||
3 SP if a usable `.onnx` already exists in the team's drive; 5 SP if export/training is needed. If 5+ SP, consider splitting model-acquisition from yaml-wiring into two sub-tickets.
|
||||
* **Blocked by**: AZ-964 (FAISS index bootstrap — cleared 2026-05-29).
|
||||
* **Blocks**: AZ-840 orchestrator PASS (which requires AZ-965 + real retrieval weights + ESKF stability under retrieval input).
|
||||
* **Related**: AZ-321 (defines NetVLAD as the C2 baseline), AZ-336 / AZ-338 (NetVLAD strategy impl), AZ-839 (C3 fixture).
|
||||
|
||||
## References
|
||||
|
||||
* Fixture skip-gate: `tests/e2e/replay/conftest.py:594-601`
|
||||
* Fixture skip-gate: `tests/e2e/replay/conftest.py:594-601` + `:654-666`
|
||||
* Backbone factory: `src/gps_denied_onboard/runtime_root/c10_factory.py::build_backbone_specs`
|
||||
* Backbone spec dataclass: `src/gps_denied_onboard/components/c10_provisioning/config.py`
|
||||
* AZ-321 (NetVLAD baseline choice)
|
||||
* AZ-962 spec: `_docs/02_tasks/done/AZ-962_operator_config_jetson_wiring.md`
|
||||
* `BackboneConfig` dataclass: `src/gps_denied_onboard/components/c10_provisioning/config.py:110-156`
|
||||
* NetVLAD strategy: `src/gps_denied_onboard/components/c2_vpr/net_vlad.py`
|
||||
* NetVLAD architecture: `src/gps_denied_onboard/components/c2_vpr/_net_vlad_architecture.py`
|
||||
* PyTorch FP16 runtime (the actual consumer): `src/gps_denied_onboard/components/c7_inference/pytorch_fp16_runtime.py:119-212`
|
||||
* C2 VPR description: `_docs/02_document/components/02_c2_vpr/description.md` §1 §5
|
||||
* AZ-321 spec: `_docs/02_tasks/done/AZ-321_c10_engine_compiler.md`
|
||||
* AZ-964 spec: `_docs/02_tasks/done/AZ-964_faiss_index_bootstrap_for_az839_fixture.md`
|
||||
|
||||
Reference in New Issue
Block a user