[AZ-339] C2 MegaLoc + MixVPR secondary VPR backbones

Adds two research-only VprStrategy implementations for the IT-12
comparative-study matrix. MegaLocStrategy (D=2048, 322x322) and
MixVprStrategy (D=4096, 320x320), both via C7 TensorRT FP16 with
their own concrete BackbonePreprocessor. Single-stage global L2
normalisation; retrieval delegated to FaissBridge; FDR records +
structured logs identical to UltraVPR. BUILD_VPR_MEGALOC and
BUILD_VPR_MIXVPR ON for research/replay-cli only, OFF for airborne
and operator-tooling (fail-fast at composition root via existing
AZ-336 factory). Uses helpers.iso_ts_from_clock from day 1 — no
new timestamp helper duplicates introduced.

36 parametrised AC tests + 25 protocol-conformance + 18 helper
regression tests pass; 1690 / 1690 unit tests pass (excluding 1
pre-existing flaky cold-start subprocess test in c12). Verdict:
PASS_WITH_WARNINGS — one Medium follow-on (AZ-527 to consolidate
4-way _assert_engine_output_dim) + one Low AC wording drift.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-13 23:52:54 +03:00
parent 5dfd9a577e
commit 0d65ff4705
9 changed files with 2283 additions and 1 deletions
@@ -1,207 +0,0 @@
# C2 MegaLoc + MixVPR Secondary Backbones
**Task**: AZ-339_c2_megaloc_mixvpr
**Name**: C2 MegaLoc + MixVPR Secondary Backbones (Research-only)
**Description**: Implement `MegaLocStrategy` and `MixVprStrategy`, two secondary `VprStrategy` backbones used for IT-12 comparative-study purposes (research binary only). Both run on the C7 TensorRT runtime (same path as UltraVPR; FP16 engines compiled by C10) but are gated OFF for airborne and operator-tooling per ADR-002 — they're available only in the research binary and (selectable) replay-cli. Each strategy ships its own concrete `BackbonePreprocessor` (different resize target and normalisation per upstream code drop). Embeddings: MegaLoc D=2048, MixVPR D=4096. Both produce L2-normalised embeddings; both delegate `retrieve_topk` to the C6 TileStore Public API. Neither is on the production critical path; performance NFRs are looser than UltraVPR.
**Complexity**: 5 points
**Dependencies**: AZ-336_c2_vpr_strategy_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-298_c7_tensorrt_runtime, AZ-303_c6_storage_interfaces, AZ-283_descriptor_normaliser, AZ-281_engine_filename_schema, AZ-321_c10_engine_compiler, AZ-266_log_module, AZ-272_fdr_record_schema
**Component**: c2_vpr (epic AZ-255 / E-C2)
**Tracker**: AZ-339
**Epic**: AZ-255 (E-C2)
### Document Dependencies
- `_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md` — Protocol contract; both strategies satisfy every invariant.
- `_docs/02_document/components/02_c2_vpr/description.md` — § 1 secondary backbones for IT-12 comparative study; § 5 backbone library list.
- `_docs/02_document/module-layout.md``c2_vpr.mega_loc` and `c2_vpr.mix_vpr` Internal entries; `BUILD_VPR_MEGALOC` and `BUILD_VPR_MIXVPR` rows (both OFF for airborne/operator-tooling, ON for research; replay-cli inherits research selection at config time).
- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md``InferenceRuntime` interface (TRT runtime).
- `_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md` — L2 normalisation.
## Problem
Without this task:
- The IT-12 comparative-study cannot enumerate MegaLoc and MixVPR alongside UltraVPR / NetVLAD; researchers cannot quantify whether UltraVPR's PRIMARY designation is justified against the broader VPR-backbone landscape.
- The research binary's link surface is incomplete; the comparative-study CI matrix entry that asserts the research binary contains every secondary backbone fails.
- A future cycle that wants to swap MegaLoc to PRIMARY (e.g., if UltraVPR's upstream code drop becomes unmaintained) would have no migration path — the strategy class would not yet exist.
## Outcome
- `src/gps_denied_onboard/components/c2_vpr/mega_loc.py` defining `MegaLocStrategy` (Protocol-conforming) + `create(config, tile_store, inference_runtime)` factory entry-point.
- Constructor signature: `__init__(self, runtime, tile_store, weights_path, preprocessor, normaliser, fdr_client)`.
- `embed_query`: preprocess → TRT forward → L2 normalise → return `VprQuery`.
- `retrieve_topk`: delegate to `tile_store.faiss_topk`; return `VprResult` with `backbone_label="mega_loc"`, `descriptor_dim=2048`.
- `descriptor_dim() -> int`: returns 2048; engine output shape asserted at load.
- `src/gps_denied_onboard/components/c2_vpr/_preprocessor_mega_loc.py` defining `MegaLocBackbonePreprocessor`:
- `input_shape() -> (322, 322)` per upstream MegaLoc default.
- Normalisation: ImageNet mean/std (same as UltraVPR — common upstream convention; not a coupling, both happen to use ImageNet).
- Centre-crop with calibration-aware logic (same pattern as UltraVPR / NetVLAD; copied not shared per description.md § 6).
- Output dtype FP16, NCHW.
- `src/gps_denied_onboard/components/c2_vpr/mix_vpr.py` defining `MixVprStrategy` (mirrors `MegaLocStrategy` structure):
- `backbone_label="mix_vpr"`, `descriptor_dim=4096`.
- `src/gps_denied_onboard/components/c2_vpr/_preprocessor_mix_vpr.py` defining `MixVprBackbonePreprocessor`:
- `input_shape() -> (320, 320)` per upstream MixVPR default.
- Normalisation: ImageNet mean/std.
- Output dtype FP16, NCHW.
- Composition-root wiring paths for `config.vpr.strategy in {"mega_loc", "mix_vpr"}`.
- `BUILD_VPR_MEGALOC` and `BUILD_VPR_MIXVPR` CMake flags wired per ADR-002.
- Logging per description.md § 9 (INFO ready, WARN top-1-above-threshold, ERROR / FDR per error path).
- Engine output shape assertion at load for both strategies.
- Unit tests covering Protocol conformance, L2-normalisation, deterministic embeddings, top-K invariants, error paths — for BOTH strategies.
## Scope
### Included
- Both `MegaLocStrategy` and `MixVprStrategy` classes implementing the Protocol.
- Both concrete `BackbonePreprocessor` implementations (one per strategy; preprocessing parameters per upstream code drop).
- Module-level `create` factory functions for both.
- Composition-root wiring for both strategy choices.
- Engine output shape assertion at load for both.
- Logging + FDR records identical pattern to UltraVPR (per-backbone `backbone_label`).
- Unit tests for both strategies covering invariants + error paths.
- `BUILD_VPR_MEGALOC` and `BUILD_VPR_MIXVPR` CMake flag wiring.
### Excluded
- The `VprStrategy` Protocol — owned by AZ-336.
- Shared `DescriptorNormaliser` — already AZ-283.
- C7 TensorRT runtime — owned by AZ-298.
- Engine compilation — owned by AZ-321.
- Other backbones — AZ-337 (UltraVPR), AZ-338 (NetVLAD), AZ-340 (SelaVPR + EigenPlaces + SALAD).
- FAISS retrieve wiring — owned by AZ-341.
- Recall@10 acceptance tests for these secondary backbones — deferred to Step 9 / E-BBT (and the floors are looser per the engine rule — these are research-only, not engine-rule-binding).
## Acceptance Criteria
**AC-1 (per strategy): Protocol conformance**
Given a constructed `MegaLocStrategy` AND a constructed `MixVprStrategy`
When `isinstance(strategy, VprStrategy)` is evaluated
Then both return `True`
**AC-2 (per strategy): `embed_query` produces L2-normalised FP16 embedding of correct dim**
Given a valid `NavCameraFrame` and `CameraCalibration`
When `embed_query` is called on each strategy
Then MegaLoc returns `embedding.shape == (2048,)`, MixVPR returns `embedding.shape == (4096,)`; both are `dtype == np.float16`; both have `||embedding||_2 == 1.0 ± 1e-3`
**AC-3 (per strategy): Deterministic embeddings**
Given the same frame
When `embed_query` is called 3 times
Then bit-exact embeddings (ULP-tolerant FP16) for each strategy
**AC-4 (per strategy): `retrieve_topk` returns exactly k candidates with correct backbone_label**
Given a corpus of 100 tiles per strategy's descriptor_dim + a constructed `VprQuery`
When `retrieve_topk(query, k=10)` is called on each strategy
Then `len(candidates) == 10`, sorted ascending; `backbone_label == "mega_loc"` for MegaLoc; `backbone_label == "mix_vpr"` for MixVPR; `descriptor_dim` matches
**AC-5 (per strategy): `descriptor_dim()` is stable**
Given a constructed strategy
When `descriptor_dim()` is called 100 times
Then MegaLoc returns 2048 every call; MixVPR returns 4096 every call
**AC-6 (per strategy): Engine output shape mismatch → `ConfigurationError`**
Given a TRT engine whose output tensor shape does not match the strategy's expected `descriptor_dim`
When `create(...)` is called
Then `ConfigurationError` is raised; the strategy is NOT instantiated
**AC-7 (per strategy): `VprBackboneError` on forward-pass failure**
Given an `InferenceRuntime` test double that raises
When `embed_query` is called
Then `VprBackboneError` is raised; ERROR log + FDR record emitted
**AC-8 (per strategy): `VprPreprocessError` on corrupt image bytes**
Given a frame with malformed `image_bytes`
When `embed_query` is called
Then `VprPreprocessError` is raised; ERROR log + FDR record emitted
**AC-9 (per strategy): Composition-root wiring**
Given `config.vpr.strategy = "mega_loc"` (resp. `"mix_vpr"`) AND valid weights AND matching `descriptor_dim`
When `compose_root(config)` runs
Then the corresponding strategy is wired; AZ-336 factory's pre-flight `descriptor_dim` validation passes; INFO log `kind="c2.vpr.ready"` with `{strategy: "mega_loc", descriptor_dim: 2048}` (resp. `mix_vpr` / 4096) is emitted
**AC-10 (per strategy): Build-flag exclusion in airborne binary**
Given `config.vpr.strategy = "mega_loc"` (resp. `"mix_vpr"`) AND `BUILD_VPR_MEGALOC=OFF` (resp. `BUILD_VPR_MIXVPR=OFF`) — the airborne case
When the binary tries to load
Then `ConfigurationError` is raised at composition-root time with message containing the missing flag; the binary refuses to start (fail-fast per AZ-336 factory's lazy-import → ImportError → `ConfigurationError` mapping)
**AC-11 (per strategy): Preprocessing input shape**
Given the strategy's preprocessor instance
When `input_shape()` is called
Then MegaLoc returns `(322, 322)`; MixVPR returns `(320, 320)`
## Non-Functional Requirements
**Performance** (looser than UltraVPR — research-only, not on production critical path):
- MegaLoc `embed_query` p95 ≤ 80 ms on Tier-1 Jetson Orin (FP16 TRT).
- MixVPR `embed_query` p95 ≤ 100 ms on Tier-1 Jetson Orin (FP16 TRT) — slightly higher because MixVPR's mix-net is ~30% larger than UltraVPR's backbone.
- `retrieve_topk` p95: MegaLoc ≤ 3 ms, MixVPR ≤ 4 ms (4096-d FAISS HNSW slower than 512-d).
- GPU memory per strategy: MegaLoc ≤ 700 MB; MixVPR ≤ 800 MB resident.
- These NFRs are research-side guidance; not engine-rule blockers.
**Compatibility**
- Both consume TRT engines produced by AZ-321 with the AZ-281 self-describing filename schema.
- Upstream code drops pinned per Plan-phase; weight-format changes between drops require engine rebuild.
**Reliability**
- Both strategies single-threaded by contract.
- Both use unconditional L2-normalisation (INV-3).
- Errors do not crash the process; downstream falls back to VIO-only.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 (MegaLoc) | `isinstance(MegaLocStrategy(...), VprStrategy)` | `True` |
| AC-1 (MixVPR) | `isinstance(MixVprStrategy(...), VprStrategy)` | `True` |
| AC-2 (MegaLoc) | `embed_query` output | shape (2048,), dtype float16, L2-norm ≈ 1.0 |
| AC-2 (MixVPR) | `embed_query` output | shape (4096,), dtype float16, L2-norm ≈ 1.0 |
| AC-3 (each) | `embed_query` × 3 same frame | bit-exact embeddings (ULP-tolerant) |
| AC-4 (each) | `retrieve_topk` against fixture corpus | `len == 10`, sorted, correct `backbone_label`, correct `descriptor_dim` |
| AC-5 (each) | `descriptor_dim()` × 100 | always returns the correct dim |
| AC-6 (each) | TRT engine with wrong output shape | `ConfigurationError` at create time |
| AC-7 (each) | `forward` raises | `VprBackboneError`; ERROR log + FDR |
| AC-8 (each) | malformed `image_bytes` | `VprPreprocessError`; ERROR log + FDR |
| AC-9 (each) | `compose_root(config=<strategy>)` | wired; INFO log with correct backbone label and dim |
| AC-10 (each) | airborne binary + strategy chosen | `ConfigurationError` with missing-flag message; fail-fast |
| AC-11 (MegaLoc) | `MegaLocBackbonePreprocessor.input_shape()` | returns `(322, 322)` |
| AC-11 (MixVPR) | `MixVprBackbonePreprocessor.input_shape()` | returns `(320, 320)` |
| Preprocess-shape (each) | `preprocess(frame)` output | NCHW shape `(1, 3, H, W)`, dtype float16 |
## Constraints
- **Each strategy ships its own concrete preprocessor** — preprocessing parameters per upstream code drop (description.md § 6 "C2-internal helper, NOT a shared helper").
- **Preprocessing parameters are weights-coupled** — `(322, 322)` for MegaLoc, `(320, 320)` for MixVPR; ImageNet mean/std for both. Hard-coded; not config-knobs.
- **Centre-crop logic is duplicated, NOT shared** — copying preprocessing between strategies is intentional per the contract; sharing would couple weights-versions across strategies and let one strategy's upgrade silently break another's preprocessing.
- **Both use TensorRT runtime** (consistent with UltraVPR's path); the difference between secondary and primary is not the runtime but the build-flag ON/OFF in airborne.
- **No engine compilation in this task** — the `.trt` engine files come from AZ-321; this task consumes them via `config.vpr.backbone_weights_path`.
- **Both strategies hold engine IDs returned by `inference_runtime.load_engine`, NOT engines themselves**.
- **No GPU operations in `__init__` beyond engine load** — same constraint as UltraVPR.
## Risks & Mitigation
**Risk 1: MegaLoc and MixVPR upstream code drops use different ONNX op sets that TRT 10.3 partially supports**
- *Risk*: Engine compilation succeeds but with fallback layers that don't run on GPU; `embed_query` p95 inflates.
- *Mitigation*: AZ-321 (engine compile) is responsible for detecting fallback layers and reporting them. This task consumes the produced engine; if NFR-perf budgets are violated, AZ-321 escalates the upstream support gap.
**Risk 2: Higher embedding dim (4096 for MixVPR) inflates corpus storage requirements**
- *Risk*: A research binary that switches between UltraVPR (D=512) and MixVPR (D=4096) needs to rebuild the FAISS corpus every swap; researchers may forget.
- *Mitigation*: AZ-336 factory's pre-flight `descriptor_dim` validation catches the mismatch at startup with a clear error message. Researchers must rebuild the corpus (C10) before swapping; the helpful error tells them so.
**Risk 3: MegaLoc / MixVPR are research-only — operators may select them by mistake**
- *Risk*: A typo or copy-pasted research config selects MegaLoc / MixVPR on an airborne binary; cold start fails.
- *Mitigation*: AC-10 ensures fail-fast at composition-root with a clear message. Operators learn at startup, not after takeoff.
**Risk 4: Test fixtures for MegaLoc / MixVPR engines don't exist in CI**
- *Risk*: Without TRT engines for these strategies, the unit tests cannot exercise the full `embed_query` path; they're stubbed via `FakeInferenceRuntime`.
- *Mitigation*: This is fine — Step 9 / E-BBT validates the real engine path against C2-IT-01 and the C2-PT-01 NFR. The unit tests validate Protocol conformance + invariants; they don't need real engines.
**Risk 5: Preprocessing duplication across strategies invites subtle bugs**
- *Risk*: A bug fix to UltraVPR's centre-crop logic doesn't propagate to MegaLoc / MixVPR.
- *Mitigation*: This is the documented trade-off (description.md § 6). The duplication is intentional. If a bug fix is needed across strategies, each strategy's preprocessor is updated explicitly with a coordinated commit; cross-checking is part of code review.
## Runtime Completeness
- **Named capability**: secondary `VprStrategy` implementations for IT-12 comparative-study (architecture / E-C2 / `solution.md` "MegaLoc, MixVPR secondary backbones").
- **Production code that must exist**: real `MegaLocStrategy` and `MixVprStrategy` classes calling real C7 TRT `InferenceRuntime.forward` with real loaded `.trt` engines; real concrete preprocessors with real OpenCV resize + ImageNet normalisation + FP16 cast; real L2-normalisation; real composition-root wiring paths.
- **Allowed external stubs**: tests MAY use `FakeInferenceRuntime` returning pre-computed embeddings; `FakeTileStore`; `FakeFdrClient`; production wiring uses real C7 + real engines + real C6.
- **Unacceptable substitutes**: NumPy-only forward passes (would not satisfy NFR budgets); skipping L2-normalisation (would break INV-3); shared preprocessors across strategies (would defeat description.md § 6 isolation); selecting these strategies in airborne binaries (must fail-fast per AC-10); engine load at first frame (would defer the engine-output-shape assertion past startup); per-strategy thread safety (the contract is single-thread).