# C2 NetVLAD Mandatory Simple-Baseline

**Task**: AZ-338_c2_net_vlad
**Name**: C2 NetVLAD Mandatory Simple-Baseline
**Description**: Implement `NetVladStrategy`, the C2 mandatory simple-baseline `VprStrategy` (engine rule: every component MUST ship a comparative baseline alongside its production-default; description.md § 1 designates NetVLAD as the C2 baseline). NetVLAD has a much higher embedding dim than UltraVPR (D=4096 with NetVLAD-VGG16 default; can be reduced to D=512 via PCA-whitening per the upstream NetVLAD code drop) and uses PyTorch FP16 (NOT TensorRT) per the simple-baseline policy: "the baseline runs on the simplest available runtime" so a TRT engine compile bug doesn't simultaneously break baseline AND primary. Includes the concrete `NetVladBackbonePreprocessor` (different resize target + normalisation than UltraVPR). MUST satisfy AC-2.1b's relaxed engine-rule floor `recall@10 ≥ 0.85` on Derkachi normal segment.
**Complexity**: 3 points
**Dependencies**: AZ-336_c2_vpr_strategy_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-300_c7_pytorch_baseline, AZ-303_c6_storage_interfaces, AZ-283_descriptor_normaliser, AZ-266_log_module, AZ-272_fdr_record_schema
**Component**: c2_vpr (epic AZ-255 / E-C2)
**Tracker**: AZ-338
**Epic**: AZ-255 (E-C2)

### Document Dependencies

- `_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md` — Protocol contract; every invariant MUST be satisfied; INV-3 (L2-normalised) is critical because NetVLAD raw embeddings include intra-cluster residuals that must be globally L2-normalised after the VLAD aggregation.
- `_docs/02_document/components/02_c2_vpr/description.md` — § 1 NetVLAD designated as mandatory simple-baseline; § 5 PyTorch matches simple-baseline track; § 9 logging.
- `_docs/02_document/module-layout.md` — `c2_vpr.net_vlad` Internal entry; `BUILD_VPR_NETVLAD` row; `BUILD_PYTORCH_RUNTIME` row (NetVLAD requires PyTorch runtime ON which is OFF for airborne — NetVLAD is research/replay-only by build-flag combination).
- `_docs/02_document/components/02_c2_vpr/tests.md` — C2-IT-01 engine rule check `recall@10 ≥ 0.85` for NetVLAD on Derkachi normal segment.
- `_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md` — `InferenceRuntime` interface; AZ-300 `pytorch_fp16_runtime` is the consumed concrete runtime.
- `_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md` — L2 + intra-normalisation (NetVLAD's published preprocessing chain includes intra-cluster normalisation BEFORE the global L2 normalisation; the `DescriptorNormaliser` helper must support both).

## Problem

Without this task:

- The C2 component has no comparative baseline; the engine rule (every primary backbone has a baseline alongside it for FT-12 comparative-study and for risk reduction if the primary fails) is violated for C2 specifically — the project-wide policy goes unsatisfied for one of its largest backbone surfaces.
- AC-2.1b's relaxed-floor check (`recall@10 ≥ 0.85` for NetVLAD) has no producer; suite-level FT-P-19 cannot validate the engine rule.
- The research binary (which links every backbone for IT-12 comparative studies) cannot ship without a NetVLAD strategy; researchers cannot run the comparative study that informs whether the primary's engine choice is justified.
- A code drop / weights / engine compile bug in UltraVPR has no fallback at the strategy layer; the operator who notices a sudden drop in suite-level satellite re-loc accuracy would have no mechanism to A/B against the baseline.

## Outcome

- `src/gps_denied_onboard/components/c2_vpr/net_vlad.py` defining:
  - `NetVladStrategy` class implementing the `VprStrategy` Protocol.
  - Constructor signature: `__init__(self, runtime: InferenceRuntime, tile_store: TileStore, weights_path: Path, preprocessor: NetVladBackbonePreprocessor, normaliser: DescriptorNormaliser, fdr_client: FdrClient, descriptor_dim: int = 4096)`.
  - `embed_query(frame, calibration)`:
    1. `tensor = self._preprocessor.preprocess(frame, calibration)` (returns FP16 NCHW (1, 3, H, W); H=W=480 per the upstream NetVLAD-VGG16 default).
    2. `intermediate = self._runtime.forward(self._engine_id, {"input": tensor})["vlad_descriptor"]` (returns FP16 (1, descriptor_dim) post-VLAD aggregation).
    3. `intra_normalised = self._normaliser.intra_cluster_normalise(intermediate[0], num_clusters=64)` (per NetVLAD's published preprocessing: intra-cluster L2 first).
    4. `embedding = self._normaliser.l2_normalise(intra_normalised)` (then global L2).
    5. Return `VprQuery(frame_id, embedding, produced_at=monotonic_ns())`.
    6. Catch RuntimeError → wrap in `VprBackboneError`; emit ERROR log + FDR record.
  - `retrieve_topk(query, k)`: identical to UltraVPR — delegates to `tile_store.faiss_topk`; returns `VprResult` with `backbone_label="net_vlad"`.
  - `descriptor_dim() -> int`: returns the constructor-passed value (default 4096); asserted at engine-load time against the engine's output tensor shape; mismatch → `RuntimeError`.
  - Module-level `create(config, tile_store, inference_runtime) -> VprStrategy`:
    1. Resolve `weights_path = config.vpr.backbone_weights_path` (a PyTorch state_dict file with the `.pth` extension; NetVLAD does NOT use the AZ-281 self-describing TRT filename schema — its own AZ-280 sidecar carries the PCA matrix + cluster centres).
    2. Resolve `descriptor_dim = config.vpr.netvlad_descriptor_dim` (default 4096; can be 512 if PCA-whitened weights are loaded).
    3. Construct `NetVladBackbonePreprocessor(input_shape=(480, 480), mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))`.
    4. Construct `DescriptorNormaliser` with `intra_cluster_normalise` capability.
    5. Load model via `inference_runtime.load_engine(weights_path)` (the PyTorch runtime accepts `.pth` files; AZ-300).
    6. Assert engine output shape == `(1, descriptor_dim)`; mismatch → `ConfigurationError`.
    7. Construct and return `NetVladStrategy(...)`.
- `src/gps_denied_onboard/components/c2_vpr/_preprocessor_net_vlad.py`:
  - Implements `BackbonePreprocessor` Protocol.
  - `preprocess(frame, calibration)`:
    1. Decode `frame.image_bytes` to RGB uint8 (H_in, W_in, 3).
    2. Centre-crop to a square region (same calibration-aware logic as UltraVPR — copied here, NOT shared, because the calibration handling is part of the preprocessor's contract).
    3. Resize to `(480, 480)` via OpenCV.
    4. Normalise: `(pixel/255.0 - mean) / std`; cast to FP16.
    5. Transpose HWC → CHW; add batch dim.
    6. Return ndarray of shape `(1, 3, 480, 480)` dtype float16.
  - `input_shape() -> tuple[int, ...]`: returns `(480, 480)`.
  - On failure: raise `VprPreprocessError`.
- Composition-root wiring path for `config.vpr.strategy == "net_vlad"`.
- Logging per description.md § 9: INFO `kind="c2.vpr.ready"` with `{strategy: "net_vlad", descriptor_dim: 4096}`; ERROR / WARN identical to UltraVPR.
- FDR records emitted: `kind="vpr.embed_query"`, `kind="vpr.backbone_error"`, `kind="vpr.preprocess_error"`.

## Scope

### Included

- `NetVladStrategy` implementing the Protocol; `NetVladBackbonePreprocessor` implementing `BackbonePreprocessor`.
- Module-level `create(config, tile_store, inference_runtime)` factory entry-point.
- Intra-cluster L2 normalisation BEFORE global L2 normalisation (NetVLAD's published preprocessing chain).
- Composition-root wiring for `config.vpr.strategy == "net_vlad"`.
- Engine output shape assertion at load time.
- Logging + FDR records identical to UltraVPR (the per-backbone label distinguishes the records).
- Unit tests covering all 7 invariants, the dual-stage normalisation, the preprocessing contract, the load-time shape assertion.
- `BUILD_VPR_NETVLAD` CMake flag wiring per ADR-002 (ON for research; OFF for airborne / operator-tooling because PyTorch runtime is excluded; ON-but-effectively-unused for replay-cli unless explicitly selected).

### Excluded

- The `VprStrategy` Protocol — owned by AZ-336.
- The `DescriptorNormaliser.l2_normalise` — already AZ-283. **Note**: AZ-283 ships `l2_normalise`; this task may need to extend AZ-283 to add `intra_cluster_normalise(vec, num_clusters)`. **Decision**: extending AZ-283 is in scope here as a small contract addition (the helper ships with `l2_normalise`; adding `intra_cluster_normalise` is a single function). If AZ-283 is already merged when this task starts, the addition is a backward-compatible function add; no breaking change.
- The C7 PyTorch runtime — owned by AZ-300; this task consumes the interface.
- Other backbones — owned by AZ-337 (UltraVPR), AZ-339 (MegaLoc + MixVPR), AZ-340 (SelaVPR + EigenPlaces + SALAD).
- FAISS retrieve wiring — owned by AZ-341.
- C2-IT-01's NetVLAD recall@10 ≥ 0.85 acceptance test — deferred to Step 9 / E-BBT.

## Acceptance Criteria

**AC-1: Protocol conformance**
Given a constructed `NetVladStrategy` instance
When `isinstance(strategy, VprStrategy)` is evaluated
Then the result is `True`

**AC-2: `embed_query` produces L2-normalised FP16 (descriptor_dim,) embedding**
Given a valid `NavCameraFrame` and `CameraCalibration`
When `strategy.embed_query(frame, calibration)` is called
Then `embedding.shape == (4096,)` (or the configured `descriptor_dim`), `embedding.dtype == np.float16`, `||embedding||_2 == 1.0 ± 1e-3`

**AC-3: Dual-stage normalisation — intra-cluster THEN global L2**
Given a fake intermediate VLAD descriptor with non-zero per-cluster sub-vectors
When the embedding pipeline runs
Then `intra_cluster_normalise` is called BEFORE `l2_normalise` (verifiable via spy on the normaliser); the order is NEVER reversed; the output's per-cluster sub-vectors are unit-norm in the intra-cluster sense AND the full vector is unit-norm globally

**AC-4: `embed_query` is deterministic**
Given the same frame + calibration
When `embed_query` is called 3 times
Then all three returns have bit-exact `embedding` arrays (ULP-tolerant FP16)

**AC-5: `retrieve_topk` returns exactly k candidates with `backbone_label = "net_vlad"`**
Given a corpus of 100 tiles + a constructed `VprQuery` with D=4096
When `strategy.retrieve_topk(query, k=10)` is called
Then `len(candidates) == 10`; sorted ascending; `backbone_label == "net_vlad"`; `candidates[0].descriptor_dim == 4096`

**AC-6: `descriptor_dim()` is config-driven and stable**
Given construction with `descriptor_dim=4096`
When `descriptor_dim()` is called 100 times
Then every call returns 4096; constructing a second instance with `descriptor_dim=512` (PCA-whitened weights case) returns 512 from that instance's `descriptor_dim()`

**AC-7: Engine output shape mismatch at load → `ConfigurationError`**
Given a model whose output tensor shape is `(1, 2048)` while `config.vpr.netvlad_descriptor_dim = 4096`
When `NetVladStrategy.create(...)` is called
Then `ConfigurationError` is raised with message containing `"engine output shape mismatch: expected (1, 4096), got (1, 2048)"`; the strategy is NOT instantiated

**AC-8: `VprBackboneError` on forward-pass failure**
Given a `InferenceRuntime` test double that raises `RuntimeError` from `forward`
When `embed_query` is called
Then `VprBackboneError` is raised; ERROR log + FDR record emitted

**AC-9: `VprPreprocessError` on corrupt image bytes**
Given a frame with malformed `image_bytes`
When `embed_query` is called
Then `VprPreprocessError` is raised; ERROR log + FDR record emitted

**AC-10: Composition-root wiring**
Given `config.vpr.strategy = "net_vlad"` AND valid weights AND matching `descriptor_dim`
When `compose_root(config)` runs
Then a `NetVladStrategy` is wired; AZ-336 factory's pre-flight `descriptor_dim` validation passes; INFO log `kind="c2.vpr.ready"` with `{strategy: "net_vlad", descriptor_dim: 4096}` emitted

**AC-11: Build-flag combination — NetVLAD requires PyTorch runtime**
Given `config.vpr.strategy = "net_vlad"` AND `BUILD_PYTORCH_RUNTIME=OFF` (airborne binary)
When the binary tries to load
Then `ConfigurationError` is raised at composition-root time with message containing `"NetVLAD requires BUILD_PYTORCH_RUNTIME=ON; this binary has BUILD_PYTORCH_RUNTIME=OFF"`; the binary refuses to start (fail-fast)

## Non-Functional Requirements

**Performance**
- `embed_query` p95 ≤ 80 ms on Tier-1 Jetson Orin with PyTorch FP16 — looser than UltraVPR's 60 ms because the simple-baseline runs on the simpler runtime; not on the production critical path.
- `retrieve_topk` p95 ≤ 4 ms — slightly looser than UltraVPR because the higher embedding dim (4096 vs 512) makes FAISS lookup ~ 8× more compute; still sub-frame at 3 Hz.
- GPU memory: ≤ 800 MB resident for backbone weights — looser than UltraVPR's 600 MB because NetVLAD's VGG16 backbone is larger.
- These NFRs are not enforced as engine-rule blockers; they're operator guidance for the research binary's resource budget.

**Compatibility**
- The PyTorch state_dict format is owned by C7's PyTorch runtime (AZ-300); this task consumes the produced model via `config.vpr.backbone_weights_path`.
- The upstream NetVLAD code drop is pinned per Plan-phase; PCA-whitening parameters change with weights → AZ-280 sidecar carries them.

**Reliability**
- Strategy is single-threaded by contract (INV-1).
- Dual-stage normalisation order (intra-cluster THEN global L2) is mandatory; reversing the order produces a different embedding subspace and silently breaks AC-2.1b (recall regression).
- `VprBackboneError` does not crash the process; downstream falls back to VIO-only.

## Unit Tests

| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | `isinstance(NetVladStrategy(...), VprStrategy)` | `True` |
| AC-2 | `embed_query` output | shape (4096,), dtype float16, L2-norm == 1.0 ± 1e-3 |
| AC-3 | Spy on normaliser methods | `intra_cluster_normalise` called BEFORE `l2_normalise` exactly once each per `embed_query` |
| AC-4 | `embed_query` × 3 same frame | bit-exact embeddings |
| AC-5 | `retrieve_topk` against fixture corpus | `len == 10`, sorted, `backbone_label == "net_vlad"`, `descriptor_dim == 4096` |
| AC-6 | `descriptor_dim()` × 100 (D=4096 instance) + a second D=512 instance | first instance always 4096; second always 512 |
| AC-7 | Model with wrong output shape | `ConfigurationError` at create time |
| AC-8 | `forward` raises | `VprBackboneError`; ERROR log + FDR |
| AC-9 | malformed `image_bytes` | `VprPreprocessError`; ERROR log + FDR |
| AC-10 | `compose_root(config="net_vlad")` | wired; INFO log with `{strategy: "net_vlad", descriptor_dim: 4096}` |
| AC-11 | airborne binary + `config.vpr.strategy = "net_vlad"` | `ConfigurationError` with PyTorch-OFF message; fail-fast |
| Preprocess-shape | `preprocessor.preprocess(frame)` output | shape `(1, 3, 480, 480)`, dtype float16 |
| Preprocess-input-shape | `preprocessor.input_shape()` | returns `(480, 480)` |

## Constraints

- **Dual-stage normalisation order is non-negotiable** — intra-cluster THEN global L2. Reversing is forbidden.
- **NetVLAD uses the PyTorch runtime, NOT TensorRT** — the simple-baseline policy isolates it from TRT engine compile risk. The research binary links both runtimes; airborne binary excludes the PyTorch runtime via `BUILD_PYTORCH_RUNTIME=OFF`, which makes NetVLAD effectively unselectable for airborne (AC-11).
- **Preprocessing parameters are weights-coupled** — `(480, 480)` resize, ImageNet mean/std. Hard-coded; not config-knobs.
- **`descriptor_dim` IS config-driven** (unlike UltraVPR which hard-codes 512) because NetVLAD ships in two flavours: full 4096-d and PCA-whitened 512-d. The choice is part of the operator's deployment, not a runtime decision.
- **Constructor injection only**; no `import gps_denied_onboard.config` inside the strategy module.
- **The strategy holds the engine ID, NOT the engine itself** — engine lifecycle is owned by C7.

## Risks & Mitigation

**Risk 1: NetVLAD embedding dim of 4096 is 8× larger than UltraVPR's 512; FAISS HNSW lookup is slower**
- *Risk*: `retrieve_topk` may exceed C2-PT-01's 2 ms budget for the lookup stage; the budget was set against UltraVPR's D=512.
- *Mitigation*: `retrieve_topk` p95 ≤ 4 ms is the looser baseline budget (acknowledged in NFRs); for the research binary this is acceptable since NetVLAD is comparison-only. If an operator wants the production-fast path with NetVLAD, they configure PCA-whitening (D=512) at corpus build time (C10).

**Risk 2: NetVLAD recall@10 ≥ 0.85 floor not achievable with FP16**
- *Risk*: FP16 quantisation degrades the VLAD aggregation precision below the relaxed engine-rule floor.
- *Mitigation*: C2-IT-01's NetVLAD assertion is the validation gate (deferred to Step 9). If FP16 fails, the operator can configure FP32 weights — the strategy does not hard-code dtype; it follows the runtime's loaded model.

**Risk 3: PyTorch FP16 runtime on Tier-1 Jetson is slower than expected**
- *Risk*: PyTorch FP16 inference on Jetson has known pipeline-stall issues compared to TRT.
- *Mitigation*: NetVLAD is research-only by build-flag combination (AC-11 enforces); the production critical path is UltraVPR. If a future cycle wants NetVLAD on the airborne binary, that's a separate task: convert NetVLAD to ONNX → TRT engine, then update this strategy to use the TRT runtime.

**Risk 4: Operator picks NetVLAD on airborne binary by mistake**
- *Risk*: A typo in the airborne config that selects `net_vlad` would silently fall back to VIO-only every flight if the runtime were missing.
- *Mitigation*: AC-11 makes this fail-fast at composition-root time with a clear error message. Operators learn at startup, not after takeoff.

**Risk 5: AZ-283 `DescriptorNormaliser` may not yet ship `intra_cluster_normalise`**
- *Risk*: The helper as defined in AZ-283 ships only `l2_normalise`; this task needs `intra_cluster_normalise` too.
- *Mitigation*: As noted in Scope/Excluded, extending AZ-283 to add `intra_cluster_normalise` is a backward-compatible function addition. If AZ-283 already merged before this task starts, the addition is committed alongside this task with a one-line note in `_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md`. If AZ-283 not yet merged, coordinate the addition during AZ-283's implementation. Either way, no breaking change to existing consumers.

## Runtime Completeness

- **Named capability**: mandatory simple-baseline `VprStrategy` for engine-rule comparative validation against the production-default UltraVPR (architecture / E-C2 / `solution.md` "NetVLAD mandatory simple-baseline" / engine rule + AC-2.1b relaxed floor).
- **Production code that must exist**: real `NetVladStrategy` calling real C7 PyTorch `InferenceRuntime.forward` with a real loaded NetVLAD `.pth` model; real `NetVladBackbonePreprocessor` performing real OpenCV resize + ImageNet normalisation + FP16 cast; real dual-stage normalisation (intra-cluster THEN global L2); real composition-root wiring path.
- **Allowed external stubs**: tests MAY use `FakeInferenceRuntime` returning pre-computed VLAD descriptors; `FakeTileStore`; `FakeFdrClient`; `FakeDescriptorNormaliser` instrumented to verify call order (AC-3); production wiring uses the real C7 PyTorch runtime + real NetVLAD weights + real C6.
- **Unacceptable substitutes**: a NumPy-only NetVLAD forward pass (would not satisfy NFR-perf budget; would defeat the runtime-isolation strategy of using a different runtime than UltraVPR); skipping intra-cluster normalisation (would silently break AC-2.1b's recall floor); using TensorRT for NetVLAD (would defeat the simple-baseline policy of isolating runtime risk); making preprocessing parameters config-knobs (would let operators silently break the recall floor); selecting NetVLAD in an airborne binary (must fail-fast per AC-11); a single-stage L2-only normalisation (would deviate from NetVLAD's published preprocessing chain; recall regression risk).