mirror of https://github.com/azaion/gps-denied-onboard.git synced 2026-06-21 08:51:12 +00:00

Files

T

Oleksandr Bezdieniezhnykh 0d65ff4705 [AZ-339] C2 MegaLoc + MixVPR secondary VPR backbones

Adds two research-only VprStrategy implementations for the IT-12
comparative-study matrix. MegaLocStrategy (D=2048, 322x322) and
MixVprStrategy (D=4096, 320x320), both via C7 TensorRT FP16 with
their own concrete BackbonePreprocessor. Single-stage global L2
normalisation; retrieval delegated to FaissBridge; FDR records +
structured logs identical to UltraVPR. BUILD_VPR_MEGALOC and
BUILD_VPR_MIXVPR ON for research/replay-cli only, OFF for airborne
and operator-tooling (fail-fast at composition root via existing
AZ-336 factory). Uses helpers.iso_ts_from_clock from day 1 — no
new timestamp helper duplicates introduced.

36 parametrised AC tests + 25 protocol-conformance + 18 helper
regression tests pass; 1690 / 1690 unit tests pass (excluding 1
pre-existing flaky cold-start subprocess test in c12). Verdict:
PASS_WITH_WARNINGS — one Medium follow-on (AZ-527 to consolidate
4-way _assert_engine_output_dim) + one Low AC wording drift.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-13 23:52:54 +03:00

16 KiB

Raw Blame History

C2 MegaLoc + MixVPR Secondary Backbones

Task: AZ-339_c2_megaloc_mixvpr Name: C2 MegaLoc + MixVPR Secondary Backbones (Research-only) Description: Implement MegaLocStrategy and MixVprStrategy, two secondary VprStrategy backbones used for IT-12 comparative-study purposes (research binary only). Both run on the C7 TensorRT runtime (same path as UltraVPR; FP16 engines compiled by C10) but are gated OFF for airborne and operator-tooling per ADR-002 — they're available only in the research binary and (selectable) replay-cli. Each strategy ships its own concrete BackbonePreprocessor (different resize target and normalisation per upstream code drop). Embeddings: MegaLoc D=2048, MixVPR D=4096. Both produce L2-normalised embeddings; both delegate retrieve_topk to the C6 TileStore Public API. Neither is on the production critical path; performance NFRs are looser than UltraVPR. Complexity: 5 points Dependencies: AZ-336_c2_vpr_strategy_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-298_c7_tensorrt_runtime, AZ-303_c6_storage_interfaces, AZ-283_descriptor_normaliser, AZ-281_engine_filename_schema, AZ-321_c10_engine_compiler, AZ-266_log_module, AZ-272_fdr_record_schema Component: c2_vpr (epic AZ-255 / E-C2) Tracker: AZ-339 Epic: AZ-255 (E-C2)

Document Dependencies

_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md — Protocol contract; both strategies satisfy every invariant.
_docs/02_document/components/02_c2_vpr/description.md — § 1 secondary backbones for IT-12 comparative study; § 5 backbone library list.
_docs/02_document/module-layout.md — c2_vpr.mega_loc and c2_vpr.mix_vpr Internal entries; BUILD_VPR_MEGALOC and BUILD_VPR_MIXVPR rows (both OFF for airborne/operator-tooling, ON for research; replay-cli inherits research selection at config time).
_docs/02_document/contracts/c7_inference/inference_runtime_protocol.md — InferenceRuntime interface (TRT runtime).
_docs/02_document/contracts/shared_helpers/descriptor_normaliser.md — L2 normalisation.

Problem

Without this task:

The IT-12 comparative-study cannot enumerate MegaLoc and MixVPR alongside UltraVPR / NetVLAD; researchers cannot quantify whether UltraVPR's PRIMARY designation is justified against the broader VPR-backbone landscape.
The research binary's link surface is incomplete; the comparative-study CI matrix entry that asserts the research binary contains every secondary backbone fails.
A future cycle that wants to swap MegaLoc to PRIMARY (e.g., if UltraVPR's upstream code drop becomes unmaintained) would have no migration path — the strategy class would not yet exist.

Outcome

src/gps_denied_onboard/components/c2_vpr/mega_loc.py defining MegaLocStrategy (Protocol-conforming) + create(config, tile_store, inference_runtime) factory entry-point.
- Constructor signature: __init__(self, runtime, tile_store, weights_path, preprocessor, normaliser, fdr_client).
- embed_query: preprocess → TRT forward → L2 normalise → return VprQuery.
- retrieve_topk: delegate to tile_store.faiss_topk; return VprResult with backbone_label="mega_loc", descriptor_dim=2048.
- descriptor_dim() -> int: returns 2048; engine output shape asserted at load.
src/gps_denied_onboard/components/c2_vpr/_preprocessor_mega_loc.py defining MegaLocBackbonePreprocessor:
- input_shape() -> (322, 322) per upstream MegaLoc default.
- Normalisation: ImageNet mean/std (same as UltraVPR — common upstream convention; not a coupling, both happen to use ImageNet).
- Centre-crop with calibration-aware logic (same pattern as UltraVPR / NetVLAD; copied not shared per description.md § 6).
- Output dtype FP16, NCHW.
src/gps_denied_onboard/components/c2_vpr/mix_vpr.py defining MixVprStrategy (mirrors MegaLocStrategy structure):
- backbone_label="mix_vpr", descriptor_dim=4096.
src/gps_denied_onboard/components/c2_vpr/_preprocessor_mix_vpr.py defining MixVprBackbonePreprocessor:
- input_shape() -> (320, 320) per upstream MixVPR default.
- Normalisation: ImageNet mean/std.
- Output dtype FP16, NCHW.
Composition-root wiring paths for config.vpr.strategy in {"mega_loc", "mix_vpr"}.
BUILD_VPR_MEGALOC and BUILD_VPR_MIXVPR CMake flags wired per ADR-002.
Logging per description.md § 9 (INFO ready, WARN top-1-above-threshold, ERROR / FDR per error path).
Engine output shape assertion at load for both strategies.
Unit tests covering Protocol conformance, L2-normalisation, deterministic embeddings, top-K invariants, error paths — for BOTH strategies.

Scope

Included

Both MegaLocStrategy and MixVprStrategy classes implementing the Protocol.
Both concrete BackbonePreprocessor implementations (one per strategy; preprocessing parameters per upstream code drop).
Module-level create factory functions for both.
Composition-root wiring for both strategy choices.
Engine output shape assertion at load for both.
Logging + FDR records identical pattern to UltraVPR (per-backbone backbone_label).
Unit tests for both strategies covering invariants + error paths.
BUILD_VPR_MEGALOC and BUILD_VPR_MIXVPR CMake flag wiring.

Excluded

The VprStrategy Protocol — owned by AZ-336.
Shared DescriptorNormaliser — already AZ-283.
C7 TensorRT runtime — owned by AZ-298.
Engine compilation — owned by AZ-321.
Other backbones — AZ-337 (UltraVPR), AZ-338 (NetVLAD), AZ-340 (SelaVPR + EigenPlaces + SALAD).
FAISS retrieve wiring — owned by AZ-341.
Recall@10 acceptance tests for these secondary backbones — deferred to Step 9 / E-BBT (and the floors are looser per the engine rule — these are research-only, not engine-rule-binding).

Acceptance Criteria

AC-1 (per strategy): Protocol conformance Given a constructed MegaLocStrategy AND a constructed MixVprStrategy When isinstance(strategy, VprStrategy) is evaluated Then both return True

AC-2 (per strategy): embed_query produces L2-normalised FP16 embedding of correct dim Given a valid NavCameraFrame and CameraCalibration When embed_query is called on each strategy Then MegaLoc returns embedding.shape == (2048,), MixVPR returns embedding.shape == (4096,); both are dtype == np.float16; both have ||embedding||_2 == 1.0 ± 1e-3

AC-3 (per strategy): Deterministic embeddings Given the same frame When embed_query is called 3 times Then bit-exact embeddings (ULP-tolerant FP16) for each strategy

AC-4 (per strategy): retrieve_topk returns exactly k candidates with correct backbone_label Given a corpus of 100 tiles per strategy's descriptor_dim + a constructed VprQuery When retrieve_topk(query, k=10) is called on each strategy Then len(candidates) == 10, sorted ascending; backbone_label == "mega_loc" for MegaLoc; backbone_label == "mix_vpr" for MixVPR; descriptor_dim matches

AC-5 (per strategy): descriptor_dim() is stable Given a constructed strategy When descriptor_dim() is called 100 times Then MegaLoc returns 2048 every call; MixVPR returns 4096 every call

AC-6 (per strategy): Engine output shape mismatch → ConfigurationError Given a TRT engine whose output tensor shape does not match the strategy's expected descriptor_dim When create(...) is called Then ConfigurationError is raised; the strategy is NOT instantiated

AC-7 (per strategy): VprBackboneError on forward-pass failure Given an InferenceRuntime test double that raises When embed_query is called Then VprBackboneError is raised; ERROR log + FDR record emitted

AC-8 (per strategy): VprPreprocessError on corrupt image bytes Given a frame with malformed image_bytes When embed_query is called Then VprPreprocessError is raised; ERROR log + FDR record emitted

AC-9 (per strategy): Composition-root wiring Given config.vpr.strategy = "mega_loc" (resp. "mix_vpr") AND valid weights AND matching descriptor_dim When compose_root(config) runs Then the corresponding strategy is wired; AZ-336 factory's pre-flight descriptor_dim validation passes; INFO log kind="c2.vpr.ready" with {strategy: "mega_loc", descriptor_dim: 2048} (resp. mix_vpr / 4096) is emitted

AC-10 (per strategy): Build-flag exclusion in airborne binary Given config.vpr.strategy = "mega_loc" (resp. "mix_vpr") AND BUILD_VPR_MEGALOC=OFF (resp. BUILD_VPR_MIXVPR=OFF) — the airborne case When the binary tries to load Then ConfigurationError is raised at composition-root time with message containing the missing flag; the binary refuses to start (fail-fast per AZ-336 factory's lazy-import → ImportError → ConfigurationError mapping)

AC-11 (per strategy): Preprocessing input shape Given the strategy's preprocessor instance When input_shape() is called Then MegaLoc returns (322, 322); MixVPR returns (320, 320)

Non-Functional Requirements

Performance (looser than UltraVPR — research-only, not on production critical path):

MegaLoc embed_query p95 ≤ 80 ms on Tier-1 Jetson Orin (FP16 TRT).
MixVPR embed_query p95 ≤ 100 ms on Tier-1 Jetson Orin (FP16 TRT) — slightly higher because MixVPR's mix-net is ~30% larger than UltraVPR's backbone.
retrieve_topk p95: MegaLoc ≤ 3 ms, MixVPR ≤ 4 ms (4096-d FAISS HNSW slower than 512-d).
GPU memory per strategy: MegaLoc ≤ 700 MB; MixVPR ≤ 800 MB resident.
These NFRs are research-side guidance; not engine-rule blockers.

Compatibility

Both consume TRT engines produced by AZ-321 with the AZ-281 self-describing filename schema.
Upstream code drops pinned per Plan-phase; weight-format changes between drops require engine rebuild.

Reliability

Both strategies single-threaded by contract.
Both use unconditional L2-normalisation (INV-3).
Errors do not crash the process; downstream falls back to VIO-only.

Unit Tests

AC Ref	What to Test	Required Outcome
AC-1 (MegaLoc)	`isinstance(MegaLocStrategy(...), VprStrategy)`	`True`
AC-1 (MixVPR)	`isinstance(MixVprStrategy(...), VprStrategy)`	`True`
AC-2 (MegaLoc)	`embed_query` output	shape (2048,), dtype float16, L2-norm ≈ 1.0
AC-2 (MixVPR)	`embed_query` output	shape (4096,), dtype float16, L2-norm ≈ 1.0
AC-3 (each)	`embed_query` × 3 same frame	bit-exact embeddings (ULP-tolerant)
AC-4 (each)	`retrieve_topk` against fixture corpus	`len == 10`, sorted, correct `backbone_label`, correct `descriptor_dim`
AC-5 (each)	`descriptor_dim()` × 100	always returns the correct dim
AC-6 (each)	TRT engine with wrong output shape	`ConfigurationError` at create time
AC-7 (each)	`forward` raises	`VprBackboneError`; ERROR log + FDR
AC-8 (each)	malformed `image_bytes`	`VprPreprocessError`; ERROR log + FDR
AC-9 (each)	`compose_root(config=<strategy>)`	wired; INFO log with correct backbone label and dim
AC-10 (each)	airborne binary + strategy chosen	`ConfigurationError` with missing-flag message; fail-fast
AC-11 (MegaLoc)	`MegaLocBackbonePreprocessor.input_shape()`	returns `(322, 322)`
AC-11 (MixVPR)	`MixVprBackbonePreprocessor.input_shape()`	returns `(320, 320)`
Preprocess-shape (each)	`preprocess(frame)` output	NCHW shape `(1, 3, H, W)`, dtype float16

Constraints

Each strategy ships its own concrete preprocessor — preprocessing parameters per upstream code drop (description.md § 6 "C2-internal helper, NOT a shared helper").
Preprocessing parameters are weights-coupled — (322, 322) for MegaLoc, (320, 320) for MixVPR; ImageNet mean/std for both. Hard-coded; not config-knobs.
Centre-crop logic is duplicated, NOT shared — copying preprocessing between strategies is intentional per the contract; sharing would couple weights-versions across strategies and let one strategy's upgrade silently break another's preprocessing.
Both use TensorRT runtime (consistent with UltraVPR's path); the difference between secondary and primary is not the runtime but the build-flag ON/OFF in airborne.
No engine compilation in this task — the .trt engine files come from AZ-321; this task consumes them via config.vpr.backbone_weights_path.
Both strategies hold engine IDs returned by inference_runtime.load_engine, NOT engines themselves.
No GPU operations in __init__ beyond engine load — same constraint as UltraVPR.

Risks & Mitigation

Risk 1: MegaLoc and MixVPR upstream code drops use different ONNX op sets that TRT 10.3 partially supports

Risk: Engine compilation succeeds but with fallback layers that don't run on GPU; embed_query p95 inflates.
Mitigation: AZ-321 (engine compile) is responsible for detecting fallback layers and reporting them. This task consumes the produced engine; if NFR-perf budgets are violated, AZ-321 escalates the upstream support gap.

Risk 2: Higher embedding dim (4096 for MixVPR) inflates corpus storage requirements

Risk: A research binary that switches between UltraVPR (D=512) and MixVPR (D=4096) needs to rebuild the FAISS corpus every swap; researchers may forget.
Mitigation: AZ-336 factory's pre-flight descriptor_dim validation catches the mismatch at startup with a clear error message. Researchers must rebuild the corpus (C10) before swapping; the helpful error tells them so.

Risk 3: MegaLoc / MixVPR are research-only — operators may select them by mistake

Risk: A typo or copy-pasted research config selects MegaLoc / MixVPR on an airborne binary; cold start fails.
Mitigation: AC-10 ensures fail-fast at composition-root with a clear message. Operators learn at startup, not after takeoff.

Risk 4: Test fixtures for MegaLoc / MixVPR engines don't exist in CI

Risk: Without TRT engines for these strategies, the unit tests cannot exercise the full embed_query path; they're stubbed via FakeInferenceRuntime.
Mitigation: This is fine — Step 9 / E-BBT validates the real engine path against C2-IT-01 and the C2-PT-01 NFR. The unit tests validate Protocol conformance + invariants; they don't need real engines.

Risk 5: Preprocessing duplication across strategies invites subtle bugs

Risk: A bug fix to UltraVPR's centre-crop logic doesn't propagate to MegaLoc / MixVPR.
Mitigation: This is the documented trade-off (description.md § 6). The duplication is intentional. If a bug fix is needed across strategies, each strategy's preprocessor is updated explicitly with a coordinated commit; cross-checking is part of code review.

Runtime Completeness

Named capability: secondary VprStrategy implementations for IT-12 comparative-study (architecture / E-C2 / solution.md "MegaLoc, MixVPR secondary backbones").
Production code that must exist: real MegaLocStrategy and MixVprStrategy classes calling real C7 TRT InferenceRuntime.forward with real loaded .trt engines; real concrete preprocessors with real OpenCV resize + ImageNet normalisation + FP16 cast; real L2-normalisation; real composition-root wiring paths.
Allowed external stubs: tests MAY use FakeInferenceRuntime returning pre-computed embeddings; FakeTileStore; FakeFdrClient; production wiring uses real C7 + real engines + real C6.
Unacceptable substitutes: NumPy-only forward passes (would not satisfy NFR budgets); skipping L2-normalisation (would break INV-3); shared preprocessors across strategies (would defeat description.md § 6 isolation); selecting these strategies in airborne binaries (must fail-fast per AC-10); engine load at first frame (would defer the engine-output-shape assertion past startup); per-strategy thread safety (the contract is single-thread).

16 KiB Raw Blame History Unescape Escape