mirror of https://github.com/azaion/gps-denied-onboard.git synced 2026-06-21 15:11:12 +00:00

Files

T

Oleksandr Bezdieniezhnykh af0dbe863a [AZ-338] [AZ-283] C2 NetVLAD mandatory simple-baseline VprStrategy

NetVLAD is the C2 comparative baseline per the engine rule (every
production-default backbone ships with a simple-baseline alongside).
Runs on the C7 PyTorch FP16 runtime (NOT TRT) so a TRT engine compile
bug cannot simultaneously break NetVLAD AND UltraVPR.

Production changes:
- c2_vpr/net_vlad.py — NetVladStrategy + module-level create() factory.
  Constructor wires InferenceRuntimeCut + DescriptorIndexCut +
  NetVladBackbonePreprocessor + DescriptorNormaliser + FaissBridge.
  embed_query pipeline: preprocess -> runtime.infer -> dual-stage
  normalisation (intra-cluster THEN global L2) -> VprQuery.
  retrieve_topk delegates one-line to FaissBridge.
- c2_vpr/_net_vlad_architecture.py — Arandjelovic et al. 2016 NetVLAD
  layer over torchvision VGG16 features + optional Linear PCA
  projection to descriptor_dim (default 4096; published Pittsburgh
  reference uses K*D=64*512=32768 raw + Linear(32768, 4096) PCA).
- c2_vpr/_preprocessor_net_vlad.py — OpenCV-based image preprocessor:
  decode -> centre-crop square -> resize (480, 480) -> ImageNet
  normalisation -> FP16 NCHW. Calibration is not consumed (NetVLAD
  is calibration-agnostic per published preprocessing chain).
- c2_vpr/inference_runtime_cut.py — NEW AZ-507 consumer-side cut
  mirroring C7 InferenceRuntime; lets c2_vpr stay AZ-507-clean.
- c2_vpr/config.py — added netvlad_descriptor_dim: int = 4096 knob.
- helpers/descriptor_normaliser.py — added intra_cluster_normalise
  (DescriptorNormaliser v1.0.0 -> v1.1.0; backward-compatible add).
- runtime_root/vpr_factory.py — added _register_strategy_architecture
  helper that binds (MODEL_NAME, architecture_factory(descriptor_dim))
  to C7's architecture registry before delegating to the strategy's
  create() factory. Keeps the c7 import at L4, preserves AZ-507.
- fdr_client/records.py — registered vpr.embed_query,
  vpr.backbone_error, vpr.preprocess_error record kinds.

Tests:
- tests/unit/c2_vpr/test_net_vlad.py — 31 tests covering all 11 ACs +
  preprocessor contract + architecture factory + constructor
  validation + FDR record emission.
- tests/unit/test_az283_descriptor_normaliser.py — +8 tests for the
  new intra_cluster_normalise.
- tests/unit/test_az272_fdr_record_schema.py — +3 fixture payloads.

Full unit suite: 1608 passed / 80 env-skipped (+43 new tests).
Per-batch code review (batch_46_review.md): PASS_WITH_WARNINGS
(4 Low-severity hygiene findings; no Critical/High/Medium).

Architectural notes:
- The spec implied c2_vpr.net_vlad.create() registers the architecture
  with C7. That violates AZ-507 (no cross-component imports). Resolved
  by exposing MODEL_NAME + architecture_factory(descriptor_dim) on the
  strategy module and having the composition root perform the C7 bind.
- C7 PyTorch runtime API names in the spec (forward, load_engine)
  were outdated; aligned implementation with the live v1.0.0 Protocol
  (infer, compile_engine + deserialize_engine). Spec hygiene flagged
  in review F2.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-13 22:30:29 +03:00

14 KiB

Raw Blame History

Code Review — Batch 46 / AZ-338 (C2 NetVLAD Mandatory Simple-Baseline)

Date: 2026-05-13 Mode: Per-batch (all 7 phases) Task: AZ-338 — C2 NetVLAD Mandatory Simple-Baseline (3pt) Verdict: PASS_WITH_WARNINGS

Scope

Domain	Files
c2_vpr (production)	`net_vlad.py` (NEW), `_net_vlad_architecture.py` (NEW), `_preprocessor_net_vlad.py` (NEW), `inference_runtime_cut.py` (NEW — AZ-507 cut of C7 InferenceRuntime), `config.py` (added `netvlad_descriptor_dim: int = 4096`), `__init__.py` (re-exports `InferenceRuntimeCut`)
Shared helpers	`helpers/descriptor_normaliser.py` (added `intra_cluster_normalise(descriptor, num_clusters)` — backward-compatible v1.1.0)
FDR	`fdr_client/records.py` (registered `vpr.embed_query`, `vpr.backbone_error`, `vpr.preprocess_error` per the AZ-338 spec § Outcome)
Composition root	`runtime_root/vpr_factory.py` (added `_register_strategy_architecture` helper; calls C7 `register_architecture` for the strategy's `MODEL_NAME` + `architecture_factory` pair before delegating to `create()`)
Tests	`tests/unit/c2_vpr/test_net_vlad.py` (NEW, 31 tests), `tests/unit/test_az283_descriptor_normaliser.py` (+8 tests for the new method), `tests/unit/test_az272_fdr_record_schema.py` (+3 fixture payloads)
Docs	`_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md` (v1.0.0 → v1.1.0; documented `intra_cluster_normalise` row + changelog entry)

Phase 1 — Context Loading

Inputs reviewed:

AZ-338 spec (_docs/02_tasks/todo/AZ-338_c2_net_vlad.md).
vpr_strategy_protocol.md v1.0.0 — 7 invariants; INV-3 (L2-normalised embedding) is the central correctness contract.
c2_vpr/_faiss_bridge.py (AZ-341, prior batch) — the strategy's one-line retrieve delegation target.
c7_inference/pytorch_fp16_runtime.py (AZ-300) — the runtime that actually deserializes the registered NetVLAD architecture.
c7_inference/architecture_registry.py — the registration target; rejects re-registration with a different factory under the same key (defensive against accidental collision).
AZ-507 lint rule (tests/unit/test_az270_compose_root.py::test_ac6_only_compose_root_imports_concrete_strategies) — components MAY NOT import other components.
_types/inference.py — BuildConfig, EngineCacheEntry, EngineHandle, PrecisionMode (L1 shared DTOs the strategy uses).

Phase 2 — Spec Compliance

All 11 ACs satisfied:

AC	Description	Covering test(s)
AC-1	Protocol conformance	`test_ac1_protocol_conformance`
AC-2	L2-norm == 1.0 ± 1e-3 FP16 (D,)	`test_ac2_embed_query_returns_unit_norm_fp16_descriptor` + 512-PCA variant
AC-3	`intra_cluster_normalise` BEFORE `l2_normalise`	`test_ac3_intra_cluster_called_before_global_l2` + once-each
AC-4	Deterministic across 3 calls	`test_ac4_embed_query_deterministic_for_same_frame`
AC-5	`retrieve_topk` == k, label="net_vlad", sorted	`test_ac5_retrieve_topk_returns_exactly_k_with_net_vlad_label`
AC-6	`descriptor_dim()` stable	4096 + 512 instance variants
AC-7	Engine output shape mismatch → ConfigError	`test_ac7_create_rejects_engine_output_shape_mismatch`
AC-8	`VprBackboneError` on forward failure	RuntimeError + missing-key + wrong-shape variants
AC-9	`VprPreprocessError` on corrupt image	non-array + wrong-dtype + wrong-shape variants
AC-10	Composition-root wiring + `c2.vpr.ready` log	INFO log + model_name forcing
AC-11	`BUILD_PYTORCH_RUNTIME=OFF` → ConfigError fail-fast	`tensorrt` + `onnx_trt_ep` runtime label variants

Spec deviations:

flask runtime.forward(engine_id, ...) → runtime.infer(handle, ...): the spec used placeholder names; the actual C7 InferenceRuntime Protocol API is infer(handle, inputs) + compile_engine + deserialize_engine. Aligned with the live Protocol shape (AZ-297). Flag: spec wording should be refreshed to match the c7 contract.
Architecture registration moved from c2_vpr.net_vlad.create() to runtime_root/vpr_factory.py::_register_strategy_architecture: the spec implies the strategy's create(...) registers the architecture with C7. That violates AZ-507 (c2_vpr cannot import c7_inference). Resolved by exposing MODEL_NAME + architecture_factory(descriptor_dim) on the strategy module and having the composition root perform the c7 binding before calling create(...). The C7-side register_architecture call lives at L4 (runtime_root), not L3. This is a design improvement over the spec; the spec should be updated.
NetVladStrategy.__init__ signature: differs from the spec's positional argument list (the spec lists runtime, tile_store, weights_path, preprocessor, normaliser, fdr_client, descriptor_dim). Implemented as keyword-only with engine_handle (returned from deserialize_engine) replacing weights_path (the strategy holds the resolved handle, not the source path — per the spec's own "holds the engine ID, NOT the engine itself" constraint, more consistent). The tile_store field also got renamed descriptor_index to match DescriptorIndexCut (AZ-507 cut).

Aligning the spec with the implementation is in the Findings below (see F2).

Phase 3 — Code Quality

Every function ≤ ~50 LOC except make_net_vlad_vgg16 (~75 LOC of which 60 is inner nn.Module definitions — natural, indivisible).
No bare except; every error chain uses raise ... from exc.
No silently-swallowed errors; the strategy emits ERROR logs + an FDR record for both VprBackboneError and VprPreprocessError paths.
Constructor validation is consistent: ValueError for range/shape violations, TypeError for type violations (matches the pattern of the prior batch's FaissBridge).
The _iso_ts_from_clock helper is duplicated yet again — sixth module-local copy (see F1 below; carried-over from cumulative review 43-45).
Class names (NetVladStrategy, NetVladBackbonePreprocessor) match the spec.
No verbose default-on debug logging; logs are scoped to ERROR-on-error
- one INFO c2.vpr.ready at composition time.
Ruff clean on every new file (UP037 auto-fixes applied; one RUF002 ambiguous-glyph in _net_vlad_architecture.py docstring fixed in Phase F).

Phase 4 — Security Quick-Scan

No SQL injection / command injection / eval / exec.
No hardcoded secrets.
FDR error-message payload is bounded to str(error)[:512] — prevents unbounded sensitive-data exfiltration via long exception messages.
No PII; vpr.embed_query payload is (frame_id, backbone_label, descriptor_dim, latency_us) — all operational metadata.
The intra_cluster_normalise helper rejects float64 input — denies upcasts that would silently break the FAISS metric.
The c7_inference.register_architecture call lives in the composition root which runs at startup; not reachable from user-controlled input.

Phase 5 — Performance Scan

embed_query p95 ≤ 80ms NFR — not verified by microbench in this batch (deferred to C2-IT-01 / FT-P-19, Step 9). Justification: microbench requires real PyTorch CUDA + real NetVLAD weights; the current Tier-1 host has neither.
retrieve_topk p95 ≤ 4ms — the FaissBridge (AZ-341) already carries the p95 ≤ 500µs microbench; this strategy is a single-line delegation, no added overhead.
The architecture's NetVLAD pooling layer uses torch.bmm for the K-cluster reduction instead of a Python loop — single optimised CUDA kernel call. The published reference impl from Pittsburgh has a Python for k in range(K) loop; this batched form is asymptotically equivalent (K ~ 64) and dramatically faster on GPU.
The dual-stage normalisation is two FP32-on-FP16-input operations, ~ 4096-element working set — sub-µs on any host.

Phase 6 — Cross-Task Consistency

NetVLAD is the first concrete VprStrategy implementation. Cross-task consistency therefore concerns the patterns it establishes for AZ-337 (UltraVPR), AZ-339 (MegaLoc/MixVPR), AZ-340 (SelaVPR/EigenPlaces/SALAD):

AZ-507 cut pattern: InferenceRuntimeCut joins DescriptorIndexCut (AZ-341), TileUploaderCut (AZ-329), TileDownloaderCut (AZ-328). Five Protocol cuts now exist cross-component; all named *Cut; all runtime_checkable=True; all one Protocol per file; all consumed via the consumer-side cut module path. Pattern is stable.
Architecture-registration split: the strategy module exposes MODEL_NAME + architecture_factory(descriptor_dim); the composition root performs the c7 registration. Future C2 strategies using the PyTorch runtime (AZ-339 MegaLoc/MixVPR with VGG/ResNet backbones; AZ-340 SelaVPR/EigenPlaces/SALAD with various backbones) follow the same shape; the composition-root helper _register_strategy_architecture already has the dispatch slot for per-strategy descriptor_dim lookup.
Dual-stage normalisation: NetVLAD's intra_cluster_normalise
- l2_normalise chain is unique to NetVLAD (UltraVPR uses single-stage l2_normalise per the AZ-337 spec). The helper addition to DescriptorNormaliser is therefore NetVLAD-specific by invocation but architectural-pattern-neutral by API; future VLAD-aggregating strategies (SALAD has VLAD-like aggregation) can reuse the same helper.
FDR record kinds: vpr.embed_query / vpr.backbone_error / vpr.preprocess_error are strategy-generic; every concrete C2 strategy emits the same three plus the AZ-341 vpr.retrieve_topk from the bridge.

Phase 7 — Architecture Compliance

Layer direction (rule 1): no upward imports. The strategy module imports _types, clock, config, fdr_client, helpers, logging, and its sibling c2_vpr modules — all at or below L3.
Public API respect / AZ-507 (rule 2): verified by the test_ac6_only_compose_root_imports_concrete_strategies lint: PASS. c2_vpr/net_vlad.py consumes InferenceRuntimeCut (defined in c2_vpr) instead of importing c7_inference.InferenceRuntime.
No new cyclic dependencies (rule 3): no new cycles.
Duplicate symbols (rule 4): _iso_ts_from_clock now in 6 modules (carry-over F1, AZ-508 covers consolidation). No new duplications introduced.
Cross-cutting concerns not locally re-implemented (rule 5): the composition root owns the c7 architecture registration; the c2_vpr factory does not.

Findings

#	Severity	Category	Files	Title
F1	Low	Maintainability	`c2_vpr/net_vlad.py`	`_iso_ts_from_clock` duplicated (6th module-local copy)
F2	Low	Spec-Hygiene	AZ-338 task spec	Spec § Outcome lists outdated C7 API names (`runtime.forward` vs `infer`; `runtime.load_engine` vs `compile_engine + deserialize_engine`) + architecture-registration location
F3	Low	Test-Coverage	`tests/unit/c2_vpr/test_net_vlad.py`	NFR-perf microbench (p95 ≤ 80ms) deferred (no Tier-1 PyTorch CUDA host); flagged in Phase 5
F4	Low	Architecture	`_net_vlad_architecture.py`	NetVLAD's PCA-projection layer parameters are part of the loaded `.pth` state dict; weights validation that the PCA centroids match the recorded sidecar is deferred to AZ-280 (engine sidecar) integration

Finding Details

F1: _iso_ts_from_clock duplicated (6th copy) (Low / Maintainability)

Location: src/gps_denied_onboard/components/c2_vpr/net_vlad.py module-level function.
Description: same 6-line helper as c2_vpr/_faiss_bridge.py, c12_operator_orchestrator/operator_reloc_service.py, c11_tile_manager/idempotent_retry.py, c11_tile_manager/signing_key.py, c6_tile_cache/postgres_filesystem_store.py, c6_tile_cache/freshness_gate.py — six modules now.
Suggestion: AZ-508 (hygiene PBI for ISO-timestamp consolidation) is already in todo/ and scoped to absorb all six call-sites.

F2: AZ-338 spec uses outdated C7 API names + architecture-registration location (Low / Spec-Hygiene)

Locations:
- Spec § Outcome: intermediate = self._runtime.forward(self._engine_id, {"input": tensor}) → live API is self._runtime.infer(self._engine_handle, {"input": tensor}).
- Spec § Outcome: inference_runtime.load_engine(weights_path) → live API is compile_engine(model_path, build_config) -> entry; deserialize_engine(entry) -> handle.
- Spec § Outcome implies create(...) performs the C7 architecture registration; AZ-507 forbids this. Resolved by moving the registration to runtime_root/vpr_factory.py::_register_strategy_architecture.
Description: the spec was written against an earlier C7 Protocol draft; the C7 Protocol stabilised at v1.0.0 in AZ-297. The implementation aligns with the v1.0.0 Protocol; the spec is now stale on this detail.
Suggestion: surface to user as a small spec-hygiene follow-up. Same class of finding as cumulative review F3 (AZ-341 spec listed an unused normaliser parameter). Recommend a single hygiene PBI scoped to "refresh AZ-337..AZ-340 specs against the stabilised C7 v1.0.0 + AZ-507 patterns".

F3: NFR-perf microbench deferred (no Tier-1 PyTorch CUDA host) (Low / Test-Coverage)

Location: tests/unit/c2_vpr/test_net_vlad.py (no microbench test class for AZ-338 NFR-perf).
Description: the AZ-338 spec NFRs cite p95 ≤ 80ms for embed_query on Tier-1 Jetson Orin. Microbench requires real PyTorch CUDA + real NetVLAD weights; not runnable on this Tier-0 dev host (macOS, no CUDA). The fake InferenceRuntime returns a synthetic output and therefore cannot probe real-runtime latency.
Suggestion: schedule under FT-P-19 / C2-IT-01 (Step 9 / E-BBT) on Tier-1 hardware. No action this batch.

F4: PCA-projection sidecar verification deferred (Low / Architecture)

Location: src/gps_denied_onboard/components/c2_vpr/_net_vlad_architecture.py PCA nn.Linear(K*D, descriptor_dim).
Description: the architecture loads its PCA-projection layer's weights from the same .pth state dict as the rest of the model via torch.load + load_state_dict(strict=True). There is no separate check that the PCA centroids + whitening matrix match the sha256 sidecar (AZ-280). For now the deserialize-time strict-mode check is the only safeguard.
Suggestion: schedule under a future "C2 PCA-whitening sidecar validation" PBI if FT-P-19 / C2-IT-01 reveals real-world drift. No action this batch.

Verdict

PASS_WITH_WARNINGS — 4 Low-severity findings, all hygiene / deferred-validation. No Critical, no High, no Medium. AC coverage is complete; full unit suite is green (1608 passed / 80 env-skipped, +43 tests over batch 45).

14 KiB Raw Blame History