mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 19:01:14 +00:00
[AZ-338] [AZ-283] C2 NetVLAD mandatory simple-baseline VprStrategy
NetVLAD is the C2 comparative baseline per the engine rule (every production-default backbone ships with a simple-baseline alongside). Runs on the C7 PyTorch FP16 runtime (NOT TRT) so a TRT engine compile bug cannot simultaneously break NetVLAD AND UltraVPR. Production changes: - c2_vpr/net_vlad.py — NetVladStrategy + module-level create() factory. Constructor wires InferenceRuntimeCut + DescriptorIndexCut + NetVladBackbonePreprocessor + DescriptorNormaliser + FaissBridge. embed_query pipeline: preprocess -> runtime.infer -> dual-stage normalisation (intra-cluster THEN global L2) -> VprQuery. retrieve_topk delegates one-line to FaissBridge. - c2_vpr/_net_vlad_architecture.py — Arandjelovic et al. 2016 NetVLAD layer over torchvision VGG16 features + optional Linear PCA projection to descriptor_dim (default 4096; published Pittsburgh reference uses K*D=64*512=32768 raw + Linear(32768, 4096) PCA). - c2_vpr/_preprocessor_net_vlad.py — OpenCV-based image preprocessor: decode -> centre-crop square -> resize (480, 480) -> ImageNet normalisation -> FP16 NCHW. Calibration is not consumed (NetVLAD is calibration-agnostic per published preprocessing chain). - c2_vpr/inference_runtime_cut.py — NEW AZ-507 consumer-side cut mirroring C7 InferenceRuntime; lets c2_vpr stay AZ-507-clean. - c2_vpr/config.py — added netvlad_descriptor_dim: int = 4096 knob. - helpers/descriptor_normaliser.py — added intra_cluster_normalise (DescriptorNormaliser v1.0.0 -> v1.1.0; backward-compatible add). - runtime_root/vpr_factory.py — added _register_strategy_architecture helper that binds (MODEL_NAME, architecture_factory(descriptor_dim)) to C7's architecture registry before delegating to the strategy's create() factory. Keeps the c7 import at L4, preserves AZ-507. - fdr_client/records.py — registered vpr.embed_query, vpr.backbone_error, vpr.preprocess_error record kinds. Tests: - tests/unit/c2_vpr/test_net_vlad.py — 31 tests covering all 11 ACs + preprocessor contract + architecture factory + constructor validation + FDR record emission. - tests/unit/test_az283_descriptor_normaliser.py — +8 tests for the new intra_cluster_normalise. - tests/unit/test_az272_fdr_record_schema.py — +3 fixture payloads. Full unit suite: 1608 passed / 80 env-skipped (+43 new tests). Per-batch code review (batch_46_review.md): PASS_WITH_WARNINGS (4 Low-severity hygiene findings; no Critical/High/Medium). Architectural notes: - The spec implied c2_vpr.net_vlad.create() registers the architecture with C7. That violates AZ-507 (no cross-component imports). Resolved by exposing MODEL_NAME + architecture_factory(descriptor_dim) on the strategy module and having the composition root perform the C7 bind. - C7 PyTorch runtime API names in the spec (forward, load_engine) were outdated; aligned implementation with the live v1.0.0 Protocol (infer, compile_engine + deserialize_engine). Spec hygiene flagged in review F2. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,265 @@
|
||||
# Code Review — Batch 46 / AZ-338 (C2 NetVLAD Mandatory Simple-Baseline)
|
||||
|
||||
**Date**: 2026-05-13
|
||||
**Mode**: Per-batch (all 7 phases)
|
||||
**Task**: AZ-338 — C2 NetVLAD Mandatory Simple-Baseline (3pt)
|
||||
**Verdict**: **PASS_WITH_WARNINGS**
|
||||
|
||||
## Scope
|
||||
|
||||
| Domain | Files |
|
||||
|--------|-------|
|
||||
| c2_vpr (production) | `net_vlad.py` (NEW), `_net_vlad_architecture.py` (NEW), `_preprocessor_net_vlad.py` (NEW), `inference_runtime_cut.py` (NEW — AZ-507 cut of C7 InferenceRuntime), `config.py` (added `netvlad_descriptor_dim: int = 4096`), `__init__.py` (re-exports `InferenceRuntimeCut`) |
|
||||
| Shared helpers | `helpers/descriptor_normaliser.py` (added `intra_cluster_normalise(descriptor, num_clusters)` — backward-compatible v1.1.0) |
|
||||
| FDR | `fdr_client/records.py` (registered `vpr.embed_query`, `vpr.backbone_error`, `vpr.preprocess_error` per the AZ-338 spec § Outcome) |
|
||||
| Composition root | `runtime_root/vpr_factory.py` (added `_register_strategy_architecture` helper; calls C7 `register_architecture` for the strategy's `MODEL_NAME` + `architecture_factory` pair before delegating to `create()`) |
|
||||
| Tests | `tests/unit/c2_vpr/test_net_vlad.py` (NEW, 31 tests), `tests/unit/test_az283_descriptor_normaliser.py` (+8 tests for the new method), `tests/unit/test_az272_fdr_record_schema.py` (+3 fixture payloads) |
|
||||
| Docs | `_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md` (v1.0.0 → v1.1.0; documented `intra_cluster_normalise` row + changelog entry) |
|
||||
|
||||
## Phase 1 — Context Loading
|
||||
|
||||
Inputs reviewed:
|
||||
|
||||
- AZ-338 spec (`_docs/02_tasks/todo/AZ-338_c2_net_vlad.md`).
|
||||
- `vpr_strategy_protocol.md` v1.0.0 — 7 invariants; INV-3 (L2-normalised
|
||||
embedding) is the central correctness contract.
|
||||
- `c2_vpr/_faiss_bridge.py` (AZ-341, prior batch) — the strategy's
|
||||
one-line retrieve delegation target.
|
||||
- `c7_inference/pytorch_fp16_runtime.py` (AZ-300) — the runtime that
|
||||
actually deserializes the registered NetVLAD architecture.
|
||||
- `c7_inference/architecture_registry.py` — the registration target;
|
||||
rejects re-registration with a different factory under the same key
|
||||
(defensive against accidental collision).
|
||||
- AZ-507 lint rule (`tests/unit/test_az270_compose_root.py::test_ac6_only_compose_root_imports_concrete_strategies`)
|
||||
— components MAY NOT import other components.
|
||||
- `_types/inference.py` — `BuildConfig`, `EngineCacheEntry`,
|
||||
`EngineHandle`, `PrecisionMode` (L1 shared DTOs the strategy uses).
|
||||
|
||||
## Phase 2 — Spec Compliance
|
||||
|
||||
All 11 ACs satisfied:
|
||||
|
||||
| AC | Description | Covering test(s) |
|
||||
|----|-------------|------------------|
|
||||
| AC-1 | Protocol conformance | `test_ac1_protocol_conformance` |
|
||||
| AC-2 | L2-norm == 1.0 ± 1e-3 FP16 (D,) | `test_ac2_embed_query_returns_unit_norm_fp16_descriptor` + 512-PCA variant |
|
||||
| AC-3 | `intra_cluster_normalise` BEFORE `l2_normalise` | `test_ac3_intra_cluster_called_before_global_l2` + once-each |
|
||||
| AC-4 | Deterministic across 3 calls | `test_ac4_embed_query_deterministic_for_same_frame` |
|
||||
| AC-5 | `retrieve_topk` == k, label="net_vlad", sorted | `test_ac5_retrieve_topk_returns_exactly_k_with_net_vlad_label` |
|
||||
| AC-6 | `descriptor_dim()` stable | 4096 + 512 instance variants |
|
||||
| AC-7 | Engine output shape mismatch → ConfigError | `test_ac7_create_rejects_engine_output_shape_mismatch` |
|
||||
| AC-8 | `VprBackboneError` on forward failure | RuntimeError + missing-key + wrong-shape variants |
|
||||
| AC-9 | `VprPreprocessError` on corrupt image | non-array + wrong-dtype + wrong-shape variants |
|
||||
| AC-10 | Composition-root wiring + `c2.vpr.ready` log | INFO log + model_name forcing |
|
||||
| AC-11 | `BUILD_PYTORCH_RUNTIME=OFF` → ConfigError fail-fast | `tensorrt` + `onnx_trt_ep` runtime label variants |
|
||||
|
||||
Spec deviations:
|
||||
|
||||
- **`flask runtime.forward(engine_id, ...)` → `runtime.infer(handle, ...)`**:
|
||||
the spec used placeholder names; the actual C7 `InferenceRuntime`
|
||||
Protocol API is `infer(handle, inputs)` + `compile_engine` +
|
||||
`deserialize_engine`. Aligned with the live Protocol shape (AZ-297).
|
||||
Flag: spec wording should be refreshed to match the c7 contract.
|
||||
- **Architecture registration moved from `c2_vpr.net_vlad.create()` to
|
||||
`runtime_root/vpr_factory.py::_register_strategy_architecture`**: the
|
||||
spec implies the strategy's `create(...)` registers the architecture
|
||||
with C7. That violates AZ-507 (c2_vpr cannot import c7_inference).
|
||||
Resolved by exposing `MODEL_NAME` + `architecture_factory(descriptor_dim)`
|
||||
on the strategy module and having the composition root perform the
|
||||
c7 binding before calling `create(...)`. The C7-side `register_architecture`
|
||||
call lives at L4 (runtime_root), not L3. **This is a design
|
||||
improvement over the spec; the spec should be updated.**
|
||||
- **`NetVladStrategy.__init__` signature**: differs from the spec's
|
||||
positional argument list (the spec lists `runtime, tile_store,
|
||||
weights_path, preprocessor, normaliser, fdr_client, descriptor_dim`).
|
||||
Implemented as keyword-only with `engine_handle` (returned from
|
||||
`deserialize_engine`) replacing `weights_path` (the strategy holds
|
||||
the resolved handle, not the source path — per the spec's own
|
||||
"holds the engine ID, NOT the engine itself" constraint, more
|
||||
consistent). The `tile_store` field also got renamed
|
||||
`descriptor_index` to match `DescriptorIndexCut` (AZ-507 cut).
|
||||
|
||||
Aligning the spec with the implementation is in the **Findings** below
|
||||
(see F2).
|
||||
|
||||
## Phase 3 — Code Quality
|
||||
|
||||
- Every function ≤ ~50 LOC except `make_net_vlad_vgg16` (~75 LOC of
|
||||
which 60 is inner `nn.Module` definitions — natural, indivisible).
|
||||
- No bare `except`; every error chain uses `raise ... from exc`.
|
||||
- No silently-swallowed errors; the strategy emits ERROR logs + an FDR
|
||||
record for both `VprBackboneError` and `VprPreprocessError` paths.
|
||||
- Constructor validation is consistent: `ValueError` for range/shape
|
||||
violations, `TypeError` for type violations (matches the pattern of
|
||||
the prior batch's `FaissBridge`).
|
||||
- The `_iso_ts_from_clock` helper is duplicated yet again — sixth
|
||||
module-local copy (see F1 below; carried-over from cumulative review
|
||||
43-45).
|
||||
- Class names (`NetVladStrategy`, `NetVladBackbonePreprocessor`) match
|
||||
the spec.
|
||||
- No verbose default-on debug logging; logs are scoped to ERROR-on-error
|
||||
+ one INFO `c2.vpr.ready` at composition time.
|
||||
- Ruff clean on every new file (UP037 auto-fixes applied; one RUF002
|
||||
ambiguous-glyph in `_net_vlad_architecture.py` docstring fixed in
|
||||
Phase F).
|
||||
|
||||
## Phase 4 — Security Quick-Scan
|
||||
|
||||
- No SQL injection / command injection / eval / exec.
|
||||
- No hardcoded secrets.
|
||||
- FDR error-message payload is bounded to `str(error)[:512]` — prevents
|
||||
unbounded sensitive-data exfiltration via long exception messages.
|
||||
- No PII; `vpr.embed_query` payload is `(frame_id, backbone_label,
|
||||
descriptor_dim, latency_us)` — all operational metadata.
|
||||
- The `intra_cluster_normalise` helper rejects float64 input — denies
|
||||
upcasts that would silently break the FAISS metric.
|
||||
- The `c7_inference.register_architecture` call lives in the
|
||||
composition root which runs at startup; not reachable from
|
||||
user-controlled input.
|
||||
|
||||
## Phase 5 — Performance Scan
|
||||
|
||||
- `embed_query` p95 ≤ 80ms NFR — not verified by microbench in this
|
||||
batch (deferred to C2-IT-01 / FT-P-19, Step 9). Justification:
|
||||
microbench requires real PyTorch CUDA + real NetVLAD weights; the
|
||||
current Tier-1 host has neither.
|
||||
- `retrieve_topk` p95 ≤ 4ms — the `FaissBridge` (AZ-341) already
|
||||
carries the p95 ≤ 500µs microbench; this strategy is a single-line
|
||||
delegation, no added overhead.
|
||||
- The architecture's NetVLAD pooling layer uses `torch.bmm` for the
|
||||
K-cluster reduction instead of a Python loop — single optimised
|
||||
CUDA kernel call. The published reference impl from Pittsburgh
|
||||
has a Python `for k in range(K)` loop; this batched form is
|
||||
asymptotically equivalent (K ~ 64) and dramatically faster on GPU.
|
||||
- The dual-stage normalisation is two FP32-on-FP16-input operations,
|
||||
~ 4096-element working set — sub-µs on any host.
|
||||
|
||||
## Phase 6 — Cross-Task Consistency
|
||||
|
||||
NetVLAD is the first concrete VprStrategy implementation. Cross-task
|
||||
consistency therefore concerns the patterns it establishes for
|
||||
AZ-337 (UltraVPR), AZ-339 (MegaLoc/MixVPR), AZ-340 (SelaVPR/EigenPlaces/SALAD):
|
||||
|
||||
- **AZ-507 cut pattern**: `InferenceRuntimeCut` joins
|
||||
`DescriptorIndexCut` (AZ-341), `TileUploaderCut` (AZ-329),
|
||||
`TileDownloaderCut` (AZ-328). Five Protocol cuts now exist
|
||||
cross-component; all named `*Cut`; all `runtime_checkable=True`; all
|
||||
one Protocol per file; all consumed via the consumer-side cut
|
||||
module path. Pattern is stable.
|
||||
- **Architecture-registration split**: the strategy module exposes
|
||||
`MODEL_NAME` + `architecture_factory(descriptor_dim)`; the
|
||||
composition root performs the c7 registration. Future C2 strategies
|
||||
using the PyTorch runtime (AZ-339 MegaLoc/MixVPR with VGG/ResNet
|
||||
backbones; AZ-340 SelaVPR/EigenPlaces/SALAD with various backbones)
|
||||
follow the same shape; the composition-root helper
|
||||
`_register_strategy_architecture` already has the dispatch slot for
|
||||
per-strategy `descriptor_dim` lookup.
|
||||
- **Dual-stage normalisation**: NetVLAD's `intra_cluster_normalise`
|
||||
+ `l2_normalise` chain is unique to NetVLAD (UltraVPR uses
|
||||
single-stage `l2_normalise` per the AZ-337 spec). The helper
|
||||
addition to `DescriptorNormaliser` is therefore NetVLAD-specific by
|
||||
invocation but architectural-pattern-neutral by API; future
|
||||
VLAD-aggregating strategies (SALAD has VLAD-like aggregation) can
|
||||
reuse the same helper.
|
||||
- **FDR record kinds**: `vpr.embed_query` / `vpr.backbone_error` /
|
||||
`vpr.preprocess_error` are strategy-generic; every concrete C2
|
||||
strategy emits the same three plus the AZ-341 `vpr.retrieve_topk`
|
||||
from the bridge.
|
||||
|
||||
## Phase 7 — Architecture Compliance
|
||||
|
||||
1. **Layer direction (rule 1)**: no upward imports. The strategy module
|
||||
imports `_types`, `clock`, `config`, `fdr_client`, `helpers`,
|
||||
`logging`, and its sibling c2_vpr modules — all at or below L3.
|
||||
2. **Public API respect / AZ-507 (rule 2)**: verified by the
|
||||
`test_ac6_only_compose_root_imports_concrete_strategies` lint:
|
||||
PASS. `c2_vpr/net_vlad.py` consumes `InferenceRuntimeCut` (defined
|
||||
in c2_vpr) instead of importing `c7_inference.InferenceRuntime`.
|
||||
3. **No new cyclic dependencies (rule 3)**: no new cycles.
|
||||
4. **Duplicate symbols (rule 4)**: `_iso_ts_from_clock` now in 6
|
||||
modules (carry-over F1, AZ-508 covers consolidation). No new
|
||||
duplications introduced.
|
||||
5. **Cross-cutting concerns not locally re-implemented (rule 5)**: the
|
||||
composition root owns the c7 architecture registration; the
|
||||
c2_vpr factory does not.
|
||||
|
||||
## Findings
|
||||
|
||||
| # | Severity | Category | Files | Title |
|
||||
|---|----------|----------|-------|-------|
|
||||
| F1 | Low | Maintainability | `c2_vpr/net_vlad.py` | `_iso_ts_from_clock` duplicated (6th module-local copy) |
|
||||
| F2 | Low | Spec-Hygiene | AZ-338 task spec | Spec § Outcome lists outdated C7 API names (`runtime.forward` vs `infer`; `runtime.load_engine` vs `compile_engine + deserialize_engine`) + architecture-registration location |
|
||||
| F3 | Low | Test-Coverage | `tests/unit/c2_vpr/test_net_vlad.py` | NFR-perf microbench (p95 ≤ 80ms) deferred (no Tier-1 PyTorch CUDA host); flagged in Phase 5 |
|
||||
| F4 | Low | Architecture | `_net_vlad_architecture.py` | NetVLAD's PCA-projection layer parameters are part of the loaded `.pth` state dict; weights validation that the PCA centroids match the recorded sidecar is deferred to AZ-280 (engine sidecar) integration |
|
||||
|
||||
### Finding Details
|
||||
|
||||
**F1: `_iso_ts_from_clock` duplicated (6th copy)** (Low / Maintainability)
|
||||
|
||||
- Location: `src/gps_denied_onboard/components/c2_vpr/net_vlad.py`
|
||||
module-level function.
|
||||
- Description: same 6-line helper as `c2_vpr/_faiss_bridge.py`,
|
||||
`c12_operator_orchestrator/operator_reloc_service.py`,
|
||||
`c11_tile_manager/idempotent_retry.py`,
|
||||
`c11_tile_manager/signing_key.py`,
|
||||
`c6_tile_cache/postgres_filesystem_store.py`,
|
||||
`c6_tile_cache/freshness_gate.py` — six modules now.
|
||||
- Suggestion: AZ-508 (hygiene PBI for ISO-timestamp consolidation) is
|
||||
already in `todo/` and scoped to absorb all six call-sites.
|
||||
|
||||
**F2: AZ-338 spec uses outdated C7 API names + architecture-registration location**
|
||||
(Low / Spec-Hygiene)
|
||||
|
||||
- Locations:
|
||||
- Spec § Outcome:
|
||||
`intermediate = self._runtime.forward(self._engine_id, {"input": tensor})`
|
||||
→ live API is `self._runtime.infer(self._engine_handle, {"input": tensor})`.
|
||||
- Spec § Outcome:
|
||||
`inference_runtime.load_engine(weights_path)` → live API is
|
||||
`compile_engine(model_path, build_config) -> entry; deserialize_engine(entry) -> handle`.
|
||||
- Spec § Outcome implies `create(...)` performs the C7
|
||||
architecture registration; AZ-507 forbids this. Resolved by
|
||||
moving the registration to `runtime_root/vpr_factory.py::_register_strategy_architecture`.
|
||||
- Description: the spec was written against an earlier C7 Protocol
|
||||
draft; the C7 Protocol stabilised at v1.0.0 in AZ-297. The
|
||||
implementation aligns with the v1.0.0 Protocol; the spec is now
|
||||
stale on this detail.
|
||||
- Suggestion: surface to user as a small spec-hygiene follow-up.
|
||||
Same class of finding as cumulative review F3 (AZ-341 spec
|
||||
listed an unused `normaliser` parameter). Recommend a single
|
||||
hygiene PBI scoped to "refresh AZ-337..AZ-340 specs against the
|
||||
stabilised C7 v1.0.0 + AZ-507 patterns".
|
||||
|
||||
**F3: NFR-perf microbench deferred (no Tier-1 PyTorch CUDA host)**
|
||||
(Low / Test-Coverage)
|
||||
|
||||
- Location: tests/unit/c2_vpr/test_net_vlad.py (no microbench
|
||||
test class for AZ-338 NFR-perf).
|
||||
- Description: the AZ-338 spec NFRs cite p95 ≤ 80ms for `embed_query`
|
||||
on Tier-1 Jetson Orin. Microbench requires real PyTorch CUDA + real
|
||||
NetVLAD weights; not runnable on this Tier-0 dev host (macOS, no
|
||||
CUDA). The fake `InferenceRuntime` returns a synthetic output and
|
||||
therefore cannot probe real-runtime latency.
|
||||
- Suggestion: schedule under FT-P-19 / C2-IT-01 (Step 9 / E-BBT) on
|
||||
Tier-1 hardware. No action this batch.
|
||||
|
||||
**F4: PCA-projection sidecar verification deferred** (Low / Architecture)
|
||||
|
||||
- Location: `src/gps_denied_onboard/components/c2_vpr/_net_vlad_architecture.py`
|
||||
PCA `nn.Linear(K*D, descriptor_dim)`.
|
||||
- Description: the architecture loads its PCA-projection layer's
|
||||
weights from the same `.pth` state dict as the rest of the model
|
||||
via `torch.load + load_state_dict(strict=True)`. There is no
|
||||
separate check that the PCA centroids + whitening matrix match
|
||||
the sha256 sidecar (AZ-280). For now the deserialize-time
|
||||
strict-mode check is the only safeguard.
|
||||
- Suggestion: schedule under a future "C2 PCA-whitening sidecar
|
||||
validation" PBI if FT-P-19 / C2-IT-01 reveals real-world drift.
|
||||
No action this batch.
|
||||
|
||||
## Verdict
|
||||
|
||||
**PASS_WITH_WARNINGS** — 4 Low-severity findings, all hygiene /
|
||||
deferred-validation. No Critical, no High, no Medium. AC coverage is
|
||||
complete; full unit suite is green (1608 passed / 80 env-skipped, +43
|
||||
tests over batch 45).
|
||||
Reference in New Issue
Block a user