Files
Oleksandr Bezdieniezhnykh 846670a5c5 Refactor documentation for splittable artifacts and update references
Updated various documentation files to clarify the handling of splittable artifacts, allowing for folder equivalents of key markdown files when they exceed size limits. Adjusted references in multiple sections to reflect this new structure, ensuring consistency across the research methodology. Enhanced clarity on the saving actions and artifact organization, particularly for `01_source_registry.md`, `02_fact_cards.md`, and `06_component_fit_matrix.md`. This change aims to improve usability and maintainability of the research documentation.
2026-05-08 23:39:30 +03:00

211 KiB
Raw Permalink Blame History

Fact Cards — C2: Visual Place Recognition

Mode A Phase 2 — engine Step 3 (Fact Extraction & Evidence Cards). Extracted from sources logged in ../01_source_registry/C2_vpr.md (see ../01_source_registry/00_summary.md for index). Confidence labels: High (L1 / verified source code), ⚠️ Medium (L1/L2 with caveat), Low (L3/L4 inferential). Bound to sub-questions in ../00_question_decomposition.md.

Index: ../00_summary.md. Sibling categories: SQ6 (FC external positioning), SQ1 (existing systems), SQ2 (canonical pipeline), C1 (VIO), C3 (matchers).

Facts in this file: VPR candidate enumeration (MixVPR, SALAD, SelaVPR, NetVLAD, EigenPlaces, AnyLoc, BoQ, DINOv2-VLAD) + Plan-phase decisions D-C2-1..D-C2-N + C2 working conclusions.


SQ3+SQ4 / C2 — Visual Place Recognition (VPR) candidate enumeration

This section opens the C2 (VPR) candidate enumeration for the per-Mode API Capability Verification gate. Per 00_question_decomposition.md SQ3+SQ4 pre-screen and Fact #26, the candidates entering this gate are: MixVPR, SALAD, SelaVPR, EigenPlaces, NetVLAD (mandatory pre-screen survivors); plus AnyLoc, BoQ, DINOv2-VLAD (conditional on INT8 quantization path). SuperGlue-as-reranker pruned outright (matcher class, not VPR class).

The 2026-05-08 sessions cover MixVPR (session 1) and SALAD (session 2); SelaVPR, EigenPlaces, NetVLAD are scheduled for subsequent sessions per the autodev session-shape note ("one VPR family per session").

The pinned mode for every C2 candidate is the same per-frame retrieval contract — query: 1× ADTi 20MP nadir frame downscaled to the candidate's native input size; cache: pre-computed descriptor table over the project's ~400 km² operational area at the AC-8.1 resolution floor (≥0.5 m/px); output: top-K cosine-similar tile candidates fed to C3 (cross-domain matcher). Per-candidate variations are: input image size, backbone, descriptor dimensionality, training-domain provenance, and inference runtime.

Fact #42 — MixVPR per-mode API capability verification (ResNet50 backbone + MixVPR aggregator on Jetson Orin Nano Super) — DOCUMENTARY PASS WITH AERIAL-DOMAIN-TRAINING CAVEAT; Jetson MVE pending

  • Statement: MixVPR (amaralibey/MixVPR, WACV 2023; canonical implementation now folded into amaralibey/openvprlab as the MixVPR aggregator class) is an MLP-Mixer-style aggregation head that consumes a CNN/ViT backbone feature map and produces a single L2-normalised global descriptor. Per the per-Mode API Capability Verification rule, the project's pinned mode is the (ResNet50 backbone, cropped at last block, ImageNet pretrained, num_unfrozen_blocks=1) + (MixVPR aggregator with in_channels=1024, in_h=20, in_w=20, out_channels=512, mix_depth=4, mlp_ratio=1, out_rows=4 → 2048-D descriptor) at 320×320 ImageNet-normalised input tuple — selected as the canonical paper config and the OpenVPRLab default resnet50_mixvpr.yaml config. Mode-enumeration query (1/3): MixVPR is parameterised by the (backbone, aggregator-shape) pair; the MixVPR(in_channels, in_h, in_w, out_channels, mix_depth, mlp_ratio, out_rows) class accepts any 4-D feature tensor that matches the configured (in_channels, in_h, in_w), so the actual mode space is the cross product of supported backbones (ResNet18/50, DinoV2 ViT-S/B/L/G-14) × aggregator hyperparameters. Per Per-Mode API rule each (backbone, aggregator) pair is a separately-cataloged candidate; the project's pinned mode is the canonical ResNet50+MixVPR pair, and any DinoV2 variant would be a separate Fact-card entry. Pinned-mode runnable example query (2/3): OpenVPRLab ships config/resnet50_mixvpr.yaml as a first-class config, runnable via python run.py --config ./config/resnet50_mixvpr.yaml for training; for inference the canonical pattern is model.eval(); descriptors = model(images) where images: torch.Tensor[B, 3, 320, 320] ImageNet-normalised, output descriptors: torch.Tensor[B, 2048] L2-normalised. The companion FAISS-based recall pipeline compute_recall_performance(descriptors, num_references, num_queries, ground_truth, k_values=[1,5,10,15]) is documented as the validation harness. Disqualifier-probe query (3/3): did NOT surface any documented frame-rate floor (VPR is per-frame independent, so no rate gate applies); did NOT surface any documented memory ceiling beyond the standard ResNet50+aggregator footprint (~25M parameters total → ~50 MB weights at fp32 / ~25 MB at fp16); did NOT surface any documented Jetson Orin Nano measurement; did NOT surface a documented ONNX/TensorRT export path inside OpenVPRLab itself (relies on standard PyTorch → ONNX export practice — to be resolved in C7 row, not C2). Critical documentary gap: MixVPR's published Recall@1 numbers are on ground-level VPR benchmarks (Pitts30k, MSLS, Tokyo24/7, Nordland) — NOT on aerial nadir benchmarks (AerialVL, AerialExtreMatch) which are the project's actual operating domain. Per Fact #19, AerialExtreMatch is the cross-source / cross-pitch / cross-scale benchmark this candidate must publish numbers against; the canonical MixVPR weights are not aerial-trained, so a project-domain re-train (or community aerial-retrain checkpoint, if surfaced in subsequent search) is required before the Jetson MVE can produce empirically-meaningful AC-8.6 numbers. Pinned-mode sentence: "We will use MixVPR with ResNet50 backbone (cropped at last block, ImageNet pretrained) + MixVPR aggregator (in=1024×20×20, out_channels=512, mix_depth=4, mlp_ratio=1, out_rows=4 → 2048-D descriptor) at 320×320 ImageNet-normalised input, with inputs {1× ADTi 20MP nav frame stream → center-cropped + bilinearly downscaled to 320×320 + ImageNet-normalised} and expect outputs {2048-D L2-normalised global descriptor per frame; cosine top-K retrieval against pre-cached descriptor table over the operational area's tiles} on Jetson Orin Nano Super (8 GB shared, JetPack 6, ROS 2 Humble; PyTorch fp16 baseline; final inference runtime selection deferred to C7)."
  • Source: Source #57 (OpenVPRLab context7 lookup), Source #58 (MixVPR canonical paper arXiv:2303.02190 + GitHub amaralibey/MixVPR)
  • Phase: Phase 2
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer
  • Confidence: for mode-enumeration, runnable-example, and parameter-count documentary evidence; ⚠️ for Jetson Orin Nano Super latency / memory / accuracy (no documentary measurement — Jetson MVE will resolve); for canonical-checkpoint aerial-domain fitness (canonical weights are ground-level-trained — project-domain retrain or aerial-trained community checkpoint required)
  • Related Dimension: SQ3+SQ4 / C2 lead candidate — per-mode API capability verification gate
  • Fit Impact: DOCUMENTARY PASS for the per-mode API capability verification gate — MixVPR has a documented runnable per-mode example with the project's pinned configuration, and no documented disqualifier exists at the API/algorithm level. HOWEVER, an explicit aerial-domain-training caveat is raised: the canonical MixVPR weights are trained on GSV-Cities (street-view); the project operates on aerial nadir at 1 km AGL. This is a Plan-phase decision point — the Plan must either (a) commit to a project-domain MixVPR retrain on AerialVL / AerialExtreMatch, OR (b) source an aerial-trained community checkpoint at Plan time, OR (c) downgrade MixVPR's status to Experimental-only and elevate a different C2 candidate (SALAD / SelaVPR / EigenPlaces / NetVLAD aerial variants — all to be assessed in subsequent sessions). The deferred Jetson Orin Nano Super hardware MVE phase still gates final accuracy/latency/memory promotion. License: MIT (per amaralibey/MixVPR repo) — permissive, no copyleft, dual-track-compatible.

Fact #43 — SALAD per-mode API capability verification (DINOv2 ViT-B/14 backbone + SALAD aggregator on Jetson Orin Nano Super) — DOCUMENTARY PASS WITH AERIAL-DOMAIN-TRAINING CAVEAT + GPL-3.0 LICENSE-TRACK FINDING + DINOV2-VIT-EXPORT RISK; Jetson MVE pending

  • Statement: SALAD (serizba/salad, CVPR 2024; canonical implementation by Sergio Izquierdo + Javier Civera, Universidad de Zaragoza) is an optimal-transport-based aggregation head that consumes a fine-tuned DINOv2 ViT backbone's spatial patch tokens + global (cls) token and produces a single L2-normalised global descriptor. Per the per-Mode API Capability Verification rule, the project's pinned mode is the (DINOv2-B/14 backbone, last 4 transformer blocks fine-tuned, return_cls_token=True) + (SALAD aggregator with m=64 clusters, score-projection MLP hidden=512+ReLU, dimensionality reduction d=768→l=128, global-token MLP d=768→256, learned dustbin scalar, Sinkhorn-Knopp optimal-transport assignment, final L2 intra+inter norm → 64×128 + 256 = 8192+256 = 8448-D descriptor) at 322×322 ImageNet-normalised input tuple — selected as the canonical paper config (Source #60 §4.1 + Table 6) and the canonical Torch-Hub default (torch.hub.load("serizba/salad", "dinov2_salad") returns this exact config per Source #59 README §Setup). Slim variants (m=15, l=32 → 544-D; m=32, l=64 → 2112-D) are documented and ship as separate pretrained checkpoints — per the Per-Mode API rule each (m, l) choice is a separately-cataloged candidate, but they share the same upstream API and license; the project pins the full 8448-D variant as the canonical default and treats the slim variants as Plan-phase trade-off knobs against AC-8.3 cache budget. Mode-enumeration query (1/3): SALAD is parameterised by the (backbone, m, l, hidden_dim, train_blocks) tuple. The canonical class definition lives in serizba/salad (NOT amaralibey/openvprlab — Source #61 confirms OpenVPRLab indexes only MixVPR / GeMPool / BoQ aggregators, no SALAD class). Three pretrained checkpoints are documented: dino_salad (8448-D), dino_salad_2048_64 (2112-D), dino_salad_512_32 (544-D). DINOv2 backbone size enumeration (paper Table 5): S/B/L/G with parameter counts 21M / 86M / 300M / 1100M and RTX-3090 latencies 1.30 / 2.41 / 7.82 / 24.93 ms; paper picks B as the canonical balance, and we inherit that choice for the project — DINOv2-B is the only size that survives the project's AC-4.1 latency budget on Jetson Orin Nano Super even at fp16+TensorRT extrapolation, and INT8 quantization is the prerequisite for any DINOv2-L/G variant (rolls into the Conditional candidates row alongside AnyLoc). Pinned-mode runnable example query (2/3): Source #59 README §Setup ships a Torch-Hub one-liner — model = torch.hub.load("serizba/salad", "dinov2_salad"); model.eval(); model.cuda() — that loads the full 8448-D config without any per-parameter wiring. The eval CLI python3 eval.py --ckpt_path 'weights/dino_salad.ckpt' --image_size 322 322 --batch_size 256 --val_datasets MSLS Nordland is the documented inference harness; for the slim variants the same eval.py is reused with the slim checkpoint paths. The MixVPR-derived training framework (Source #59 README "Acknowledgements") is the harness serizba/salad extends — python3 main.py for training on GSV-Cities. Disqualifier-probe query (3/3): did NOT surface any documented frame-rate floor (VPR is per-frame independent, no temporal-rate gate); did NOT surface any documented memory ceiling at the algorithm level beyond the standard DINOv2-B + SALAD-aggregator footprint (~86M params backbone + small SALAD aggregator → ~344 MB weights at fp32 / ~172 MB at fp16); did NOT surface any documented Jetson Orin Nano measurement; did NOT surface a documented ONNX/TensorRT export path inside serizba/salad itself (relies on standard PyTorch → ONNX export + TensorRT — to be resolved in C7 row, not C2). Three new disqualifier-class findings raised that did NOT surface for MixVPR: (i) GPL-3.0 license (Source #59 LICENSE file = GNU GENERAL PUBLIC LICENSE v3) — places SALAD on the GPL-3.0 license track alongside OpenVINS / VINS-Mono / VINS-Fusion / ORB-SLAM3, NOT on the BSD/permissive track where MixVPR (MIT) sits. This materially affects D-C1-1 license-posture interaction: if the project locks the BSD/permissive track at Plan time (D-C1-1 = (b)), SALAD becomes unusable as a C2 candidate, leaving only MixVPR + EigenPlaces + NetVLAD on the BSD/permissive C2 candidate axis. (ii) DINOv2-ViT-B export risk — paper §5 explicitly flags "the adoption of DINOv2 as our backbone results in slower processing speeds compared to ResNet-based methods"; ViT export to TensorRT is more fragile than ResNet export; INT8 quantization of ViTs is harder than CNNs (well-known industry signal). The Jetson MVE phase (D-C1-2 + D-C2-4 deferred bring-up) must validate the DINOv2-B → TensorRT fp16 export path BEFORE SALAD's documentary lead can be promoted to Selected. (iii) Descriptor-cache budget pressure — at the canonical 8448-D the descriptor cache for ~400 km² @ 0.5 m/px tiles (~160k tiles × 8448 dim × 2 bytes fp16) consumes ~2.7 GB, ~27% of the AC-8.3 10 GB cache budget — vs MixVPR's ~650 MB / 6.5%. The slim 544-D variant restores feasibility (~0.17 GB / 1.7%) at the cost of ~5 R@1 points on MSLS Challenge — Plan-phase trade-off raises D-C2-6 SALAD descriptor-size choice as a new gate. Critical documentary gap (same as MixVPR): SALAD's published Recall@1 numbers are on ground-level VPR benchmarks (MSLS Challenge/Val, Pittsburgh250k-test, SPED, NordLand, SF-XL) — NOT on aerial nadir benchmarks (AerialVL, AerialExtreMatch). Per Fact #19 + Fact #26, this is the SAME aerial-domain-training caveat raised by the MixVPR closure (Fact #42 Fit Impact) — D-C2-1 (canonical-vs-aerial-retrain-vs-community-aerial-checkpoint) applies to SALAD identically. Pinned-mode sentence: "We will use SALAD with DINOv2 ViT-B/14 backbone (last 4 transformer blocks fine-tuned, return_cls_token=True) + SALAD aggregator (m=64 clusters, score-projection MLP hidden=512+ReLU, dim-reduction d=768→l=128, global-token MLP d=768→256, learned dustbin scalar, Sinkhorn-Knopp optimal-transport assignment, final L2 intra+inter norm → 8192+256 = 8448-D descriptor) at 322×322 ImageNet-normalised input, with inputs {1× ADTi 20MP nav frame stream → center-cropped + bilinearly downscaled to 322×322 + ImageNet-normalised} and expect outputs {8448-D L2-normalised global descriptor per frame; cosine top-K retrieval against pre-cached descriptor table over the operational area's tiles} on Jetson Orin Nano Super (8 GB shared, JetPack 6, ROS 2 Humble; PyTorch fp16 + TensorRT baseline; final inference runtime selection deferred to C7)."
  • Source: Source #59 (serizba/salad README + LICENSE WebFetch), Source #60 (canonical paper arXiv:2311.15937 v2 / CVPR 2024), Source #61 (OpenVPRLab DinoV2 backbone context7 cross-source — disconfirmation that OpenVPRLab ships SALAD)
  • Phase: Phase 2
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer + license-posture decision-maker
  • Confidence: for mode-enumeration, runnable-example, parameter-count, license, RTX-3090 latency, and ground-level-benchmark Recall@K documentary evidence; ⚠️ for Jetson Orin Nano Super latency / memory / accuracy (no documentary measurement — Jetson MVE will resolve); ⚠️ for DINOv2-ViT-B → TensorRT fp16 / INT8 export quality (paper-acknowledged "slower processing" + industry signal of harder ViT export — C7 + Jetson MVE will resolve); for canonical-checkpoint aerial-domain fitness (same caveat as MixVPR — canonical weights are GSV-Cities ground-level-trained, no aerial-nadir benchmark in canonical paper)
  • Related Dimension: SQ3+SQ4 / C2 lead candidate — per-mode API capability verification gate
  • Fit Impact: DOCUMENTARY PASS for the per-mode API capability verification gate — SALAD has a documented runnable per-mode example with the project's pinned configuration (Torch-Hub one-liner), three documented descriptor-size variants, and no API-level disqualifier. HOWEVER, three explicit caveats are raised — two new vs MixVPR, one shared: (i) GPL-3.0 license-track placement (NEW vs MixVPR-MIT) — interacts with D-C1-1 license-posture decision; if the project locks BSD/permissive track at Plan time, SALAD is excluded. (ii) DINOv2-ViT-B export risk (NEW vs MixVPR-ResNet50) — paper-acknowledged + industry-signal risk that DINOv2-B → TensorRT fp16 / INT8 path on Jetson Orin Nano Super may not deliver the latency budget; Jetson MVE phase is more critical for SALAD than for MixVPR. (iii) Aerial-domain-training caveat (SHARED with MixVPR via D-C2-1) — canonical weights are GSV-Cities street-view, not aerial-nadir; same Plan-phase decision (project-domain retrain / aerial-trained community checkpoint / elevate alternate C2 candidate) applies. HOWEVER, SALAD's accuracy advantage is material: paper Table 1 shows SALAD-full MSLS Challenge R@1 = 75.0 vs MixVPR's 64.0 (+11 R@1 absolute), and SALAD's slim 544-D variant ALREADY outperforms MixVPR (70.8 vs 64.0) with a smaller descriptor. HOWEVER, this advantage is on ground-level benchmarks; aerial-domain transfer is uncharted in the canonical paper. Two new Plan-phase decisions raised by SALAD closure: D-C2-5 DINOv2 ViT-export to TensorRT fp16/INT8 path on Jetson Orin Nano Super (applies to every ViT-based C2 candidate: SALAD, SelaVPR, AnyLoc, BoQ, DINOv2-VLAD); D-C2-6 SALAD descriptor-size choice (8448-D / 2112-D / 544-D — interacts with D-C2-2 cache carve-out and AC-8.3 differently than MixVPR; full 8448-D consumes ~27% of cache budget). The deferred Jetson Orin Nano Super hardware MVE phase still gates final accuracy/latency/memory promotion (D-C1-2 + D-C2-4). License: GPL-3.0 (per serizba/salad LICENSE file = GNU GPL v3) — copyleft, GPL-3.0 license track.

C2 — Per-Mode API Capability Verification (engine Step 2 — Mandatory context7 lookup) [2026-05-08 sessions, MixVPR + SALAD pass]

This section catalogs per-candidate Per-Mode API Capability Verification entries for C2. Pre-screen survivors (per Fact #26 + 00_question_decomposition.md SQ3+SQ4): MixVPR (session 1), SALAD (session 2), SelaVPR, EigenPlaces, NetVLAD, AnyLoc (conditional), BoQ (conditional), DINOv2-VLAD (conditional). Each candidate gets a pinned-mode statement, three-query context7 (or equivalent) lookup, MVE block, and per-numbered-Restriction × per-numbered-AC sub-matrix.

Candidate Per-mode verification Status License track Session
MixVPR (ResNet50+MixVPR @ 320×320 → 2048-D) Fact #42 + Source #57 + #58 Documentary lead with aerial-domain-training caveat BSD/permissive (MIT) 2026-05-08 (session 1)
SALAD (DINOv2 ViT-B+SALAD @ 322×322 → 8448-D, with slim 2112-D / 544-D variants) Fact #43 + Source #59 + #60 + #61 Documentary lead with aerial-domain-training caveat + GPL-3.0-license-track caveat + DINOv2-ViT-export risk caveat GPL-3.0 (canonical) 2026-05-08 (session 2)
SelaVPR (DINOv2 ViT+SelaVPR) NOT STARTED (TBD) next session
EigenPlaces (ResNet50+EigenPlaces) NOT STARTED (TBD) next session
NetVLAD class (VGG16+NetVLAD or AlexNet+NetVLAD) NOT STARTED (TBD) next session
AnyLoc (DINOv2 ViT-G+VLAD) NOT STARTED — conditional on INT8 quantization (TBD) conditional
BoQ (DINOv2 ViT-B+BoQ) NOT STARTED — conditional on INT8 quantization (TBD) conditional
DINOv2-VLAD (DINOv2 direct + VLAD pooling) NOT STARTED — conditional on INT8 quantization (TBD) conditional

C2 — Minimum Viable Example (MVE) Blocks

MVE — MixVPR with ResNet50 backbone @ 320×320 → 2048-D descriptor

  • Source: Source #57 (OpenVPRLab context7 → https://context7.com/amaralibey/openvprlab/llms.txt — code snippets Initialize and Use MixVPR Aggregator, Initialize VPRFramework and Perform Inference, Compute Recall Performance with FAISS, Train and Monitor VPR Models via CLI), accessed 2026-05-08; Source #58 (MixVPR canonical paper Ali-bey et al. WACV 2023 arXiv:2303.02190 + amaralibey/MixVPR GitHub)
  • Inputs in the example: GSV-Cities images at 320×320 (ImageNet mean/std normalised); batch tensor images: torch.Tensor[B, 3, 320, 320]; ResNet50 (cropped at last block, ImageNet pretrained, num_unfrozen_blocks=1) → 1024-channel feature map at 20×20; MixVPR aggregator hyperparameters (in_channels=1024, in_h=20, in_w=20, out_channels=512, mix_depth=4, mlp_ratio=1, out_rows=4)
  • Outputs in the example: descriptors: torch.Tensor[B, 2048] L2-normalised; cosine retrieval via FAISS (compute_recall_performance harness reports Recall@{1,5,10,15})
  • Project inputs: 1× ADTi 20MP nav frame stream (5472×3648, target 3 fps) → center-cropped to 3648×3648 (square) → bilinearly downscaled to 320×320 → ImageNet-normalised → fp16 batch on Jetson Orin Nano Super
  • Project outputs required: 2048-D L2-normalised global descriptor per frame; cosine top-K (project default K=10 per Fact #25) against pre-cached descriptor table over the ~400 km² operational area's tiles at AC-8.1 resolution floor; satisfies AC-8.6 retrieval-recall requirement under cross-season / cross-domain / scene-change conditions; satisfies AC-4.1 latency budget for steady-state and AC-NEW-2 spoofing-promotion path
  • Match assessment: exact mode match for (ResNet50 backbone, MixVPR aggregator, 320×320 input, 2048-D output); FAISS retrieval harness exists; ⚠️ partial input domain (canonical weights trained on GSV-Cities ground-level imagery vs project's nadir aerial 1 km AGL — domain shift unverified); ⚠️ partial Jetson Orin Nano Super measurement (no documented benchmark); ⚠️ partial inference runtime (PyTorch fp16 baseline assumed; ONNX/TensorRT export path is C7's job, not MixVPR's)
  • If ⚠️ or : docs do not explicitly disqualify the configuration. The (backbone, aggregator) pair, input size, normalisation, and output shape are all documented and runnable as-is. The aerial-domain-training caveat is not an API/mode disqualifier (the API works on any ImageNet-normalised image); it is an accuracy caveat that the deferred Jetson MVE phase + Plan-phase retrain decision will resolve. → Status: Documentary lead with aerial-domain-training caveat; final promotion to "Selected" requires (a) Plan-phase decision on canonical-vs-aerial-retrain-vs-community-aerial-checkpoint, AND (b) Jetson Orin Nano Super hardware MVE phase artifact (latency, memory, AerialExtreMatch Recall@K).

MVE — SALAD with DINOv2 ViT-B/14 backbone @ 322×322 → 8448-D descriptor (canonical full variant; slim 2112-D and 544-D variants documented as separately-cataloged sibling modes)

  • Source: Source #59 (serizba/salad README + LICENSE WebFetch — Torch-Hub one-liner model = torch.hub.load("serizba/salad", "dinov2_salad"), eval CLI python3 eval.py --ckpt_path 'weights/dino_salad.ckpt' --image_size 322 322 --batch_size 256 --val_datasets MSLS Nordland, three pretrained checkpoints), accessed 2026-05-08; Source #60 (canonical paper arXiv:2311.15937 v2 / Izquierdo & Civera CVPR 2024 — §4.1 implementation details + Table 1 per-variant Recall@K + Table 5 DINOv2-size ablation + Table 6 train-blocks ablation); Source #61 (OpenVPRLab DinoV2 backbone context7 cross-source — confirms ViT-B/14 spatial-feature backbone API at 322×322, disconfirms OpenVPRLab as a SALAD aggregator source)
  • Inputs in the example: GSV-Cities images for training at 224×224, MSLS / Nordland / Pittsburgh250k-test / SPED / SF-XL evaluation images at 322×322 (ImageNet mean/std normalised; must be divisible by 14 → 322/14 = 23 patches per side → 23×23 = 529 spatial tokens + 1 global cls token); batch tensor images: torch.Tensor[B, 3, 322, 322]; DINOv2-B/14 backbone (768-dim tokens, 86M params, last 4 transformer blocks fine-tuned, return_cls_token=True) → spatial feature tensor [B, 768, 23, 23] + global token [B, 768]; SALAD aggregator hyperparameters (m=64 clusters, hidden=512+ReLU score-projection MLP, dim-reduction d=768→l=128, global-token MLP d=768→256, learned dustbin scalar z, Sinkhorn-Knopp optimal-transport assignment, final L2 intra+inter norm)
  • Outputs in the example: full variant descriptors: torch.Tensor[B, 8448] (= m×l + global = 64×128 + 256 = 8192+256) L2-normalised; slim variants [B, 2112] (m=32,l=64) or [B, 544] (m=15,l=32). Cosine top-K retrieval against pre-cached descriptor table; canonical paper Table 1 reports MSLS Challenge R@1 = 75.0 / 73.7 / 70.8 across full / 2112-D / 544-D variants on RTX 3090; canonical paper Table 5 reports DINOv2-B forward pass at 2.41 ms per image on RTX 3090 batch=1 at 322×322 (full SALAD pipeline 2.41 ms per Table 1 footnote — aggregator overhead negligible)
  • Project inputs: 1× ADTi 20MP nav frame stream (5472×3648, target 3 fps) → center-cropped to 3648×3648 (square) → bilinearly downscaled to 322×322 → ImageNet-normalised → fp16 batch on Jetson Orin Nano Super
  • Project outputs required: 8448-D (or slim 2112-D / 544-D — Plan-phase choice per D-C2-6) L2-normalised global descriptor per frame; cosine top-K (project default K=10 per Fact #25) against pre-cached descriptor table over the ~400 km² operational area's tiles at AC-8.1 resolution floor; satisfies AC-8.6 retrieval-recall requirement under cross-season / cross-domain / scene-change conditions; satisfies AC-4.1 latency budget for steady-state and AC-NEW-2 spoofing-promotion path
  • Match assessment: exact mode match for (DINOv2-B/14 backbone, SALAD aggregator, 322×322 input, 8448-D / 2112-D / 544-D output); Torch-Hub runnable one-liner exists for the full variant; eval CLI ships with documented per-variant checkpoints; ⚠️ partial input domain (canonical weights trained on GSV-Cities ground-level imagery vs project's nadir aerial 1 km AGL — domain shift unverified, same caveat as MixVPR); ⚠️ partial Jetson Orin Nano Super measurement (no documented benchmark); ⚠️ partial inference runtime — paper-acknowledged "DINOv2 backbone results in slower processing speeds compared to ResNet-based methods" + industry signal that ViT export to TensorRT is more fragile than ResNet export + INT8 quantization of ViTs is harder than CNNs (PyTorch fp16 baseline assumed; ONNX/TensorRT/INT8 export path is C7's job + new D-C2-5 deferred gate, not SALAD's algorithmic responsibility); ⚠️ descriptor-cache budget at full 8448-D consumes ~2.7 GB / ~27% of AC-8.3 10 GB cache budget over ~400 km² @ 0.5 m/px (vs MixVPR's ~650 MB / 6.5%) — slim 2112-D ~0.68 GB / 6.8%, slim 544-D ~0.17 GB / 1.7% — Plan-phase D-C2-6 trade-off
  • If ⚠️ or : docs do not explicitly disqualify the algorithmic mode. The (backbone, aggregator, m, l, train_blocks) tuple, input size, normalisation, and output shape are all documented and runnable as-is via the Torch-Hub one-liner. However, three caveats elevate the verification gate's risk profile beyond MixVPR's: (i) GPL-3.0 license-track placement — Source #59 LICENSE = GNU GPL v3, copyleft; interacts with D-C1-1 license-posture decision (BSD/permissive lock at Plan time would exclude SALAD); (ii) DINOv2-ViT-B export risk to Jetson Orin Nano Super at fp16/INT8 — paper-acknowledged "slower" + industry signal — Jetson MVE phase is more critical for SALAD than for CNN-based candidates; (iii) Aerial-domain-training caveat — same as MixVPR via D-C2-1. → Status: Documentary lead with aerial-domain-training caveat + GPL-3.0-license-track caveat + DINOv2-ViT-export risk caveat; final promotion to "Selected" requires (a) Plan-phase decision on D-C2-1 (canonical-vs-aerial-retrain-vs-community-aerial-checkpoint), (b) Plan-phase decision on D-C1-1 license-posture (must allow GPL-3.0 track or SALAD is excluded), (c) Plan-phase decision on D-C2-6 SALAD descriptor-size choice (interacts with D-C2-2 cache carve-out), AND (d) Jetson Orin Nano Super hardware MVE phase artifact (latency, memory, DINOv2-B → TensorRT fp16 export quality, AerialExtreMatch Recall@K).

C2 — Per-numbered-Restriction × Per-numbered-AC Sub-Matrix per Candidate

Per Per-Mode API Capability Verification rule item 4: every numbered Restriction line and every numbered Acceptance Criterion is bound to one of {Pass, Fail, Verify, N/A} per candidate, with one-line evidence cite. Lines marked N/A are out of C2 scope (handled by C1 / C3 / C4 / C5 / C6 / C7 / C8 / C9 / C10). Cells marked Verify block final "Selected" promotion until the Jetson Orin Nano Super hardware MVE phase + Plan-phase aerial-training decision resolve them.

Sub-matrix legend

  • Pass: pinned mode satisfies the line with cited documentary evidence
  • Fail: pinned mode contradicts the line with cited documentary evidence
  • Verify: no documentary evidence either way; deferred Jetson MVE phase will resolve (or Plan-phase aerial-training decision)
  • N/A: line is irrelevant to C2 (will be bound by C1/.../C10 in their respective rows)

Cross-cutting N/A lines (apply to ALL C2 candidates)

The following AC and Restriction lines are out of C2 scope and are marked N/A for every C2 candidate without per-candidate citation:

  • All of AC-1.3, AC-1.4 (frame-to-frame drift, source label) — bound by C1 (VIO) + C5 (state estimator)
  • All of AC-2.1a (frame-to-frame registration) — bound by C1
  • AC-2.2 frame-to-frame branch (<1.0 px MRE) — bound by C1; the cross-domain branch (<2.5 px MRE) is bound by C3 (matcher), not C2 — C2 only enables the matcher by retrieving the right tile candidates
  • All of AC-3.1, AC-3.2, AC-3.4, AC-3.5 (sharp-turn outliers, sharp-turn recovery, operator re-loc, visual blackout) — bound by C1 + C5 + C8
  • AC-4.3 (FC output contract) — bound by C8
  • AC-4.4 (frame-by-frame, no batching) — bound by C5 (estimator emits per-frame); C2 contributes per-frame retrieval but the no-batching contract is C5's
  • AC-4.5 (corrections allowed) — bound by C5
  • All of AC-5.x (initialisation, failsafe, reboot) — bound by C5
  • All of AC-6.x (GCS telemetry) — bound by C8
  • All of AC-7.x (AI-camera object localization) — bound outside C2 entirely
  • AC-8.1 ortho-resolution sub-bound (raw cache provider format) — bound by C6 (tile cache pipeline); C2 only consumes the cache-interface descriptors
  • AC-8.2 (tile freshness policy) — bound by C6 (cache provisioning) + C10 (provisioning); C2's freshness consumption is captured under AC-NEW-6
  • AC-8.3 (offline pre-load of imagery) — bound by C10 (provisioning)
  • AC-8.4 (mid-flight tile generation) — bound by C5 (state estimator quality gate, source-label satellite_anchor requirement) + C6 (tile cache write + single-frame orthorectification responsibility, REASSIGNED 2026-05-08 per user-locked C4 = Pose estimation definition). Originally bound to "C4 (orthorectification)" pre-2026-05-08 when C4 = Single-frame orthorectification; user-locked redefinition reassigns the orthorectification responsibility to C6 (write-side cache concern) since C6 already owns tile cache write per this same binding. C4 now = Pose estimation (PnP + RANSAC + LM) and is bound to AC-2.1b satellite-anchor registration + AC-1.1/1.2 frame-center pose accuracy, NOT AC-8.4 mid-flight tile generation. No new component slot is created; original C4C10 numbering is preserved.
  • AC-8.5 (no raw frame retention) — bound by C5 + system-wide FDR
  • All of AC-NEW-1 (cold-start TTFF) — bound by C5 + C7; C2's contribution is the descriptor-table-load latency, captured under AC-4.2 sub-row
  • All of AC-NEW-3 (FDR records) — bound by C5 + system-wide
  • All of AC-NEW-4 (false-position safety budget — covariance honesty) — bound by C5; C2's retrieval-quality contribution is captured under AC-NEW-7
  • All of AC-NEW-5 (operating environmental envelope) — bound by C7 + system-wide thermal
  • AC-NEW-8 (visual blackout + GPS spoofing degraded mode) — bound by C5 + C1 (IMU-only propagation); C2's role is to enable re-anchor when visual returns, captured under AC-3.3 row
  • Restriction "UAV & Flight" sub-bullets:
    • "Fixed-wing UAVs only" — N/A (not a VPR concern)
    • "Operational area: eastern/southern Ukraine" — Pass via training-domain match; explicit row below
    • "Mission profile: 8-hour flights" — N/A (VPR is stateless per-frame)
    • "Sharp turns are exceptions" — N/A (VPR is per-frame independent)
    • "No raw-photo storage" — N/A (bound by C5 / system-wide FDR)
  • Restriction "Cameras" — AI camera + camera-to-companion interface — N/A (VPR consumes nav-camera frames only)
  • Restriction "Sensors & Integration" — N/A (VPR does not consume FC IMU)
  • Restriction "Communication protocol" / "Output to FC" / "Ground station" — bound by C8
  • Restriction "Failsafe & Safety" — bound by C5 + C8

MixVPR — per-numbered binding (C2-relevant lines only; cross-cutting N/A above)

Line Binding Evidence (one-line cite)
AC-1.1 (frame-center within 50 m, ≥80% normal-flight photos) Verify (downstream) C2's retrieval-correctness contribution to AC-1.1 is the prerequisite for C3+C4's geometric pose; documentary evidence of MixVPR retrieval recall on aerial nadir at AC-8.1 resolution floor is absent — Plan-phase aerial-training decision + Jetson MVE on Derkachi flight required
AC-1.2 (frame-center within 20 m, ≥50% normal-flight photos) Verify (downstream) Same as AC-1.1, tighter tail; AerialExtreMatch Recall@1 stratified by difficulty cell is the documentary target
AC-2.1b (satellite-anchor registration succeeds, AC-1.1/1.2 + AC-2.2 + AC-8.2 + AC-8.6 conditions) Verify (downstream) C2 produces the top-K retrieval that C3+C4 consume; the success rate of the retrieval stage under AC-2.1b is what C2 owns — Jetson MVE measurement on AerialExtreMatch + Derkachi flight
AC-3.3 (≥3 disconnected segments via satellite-reference re-localization) Pass (API) → Verify (recall) MixVPR's per-frame top-K cosine retrieval is structurally suited to re-localization (no temporal state required); recall under cross-season / scene-change scoring is unverified — AerialExtreMatch + Plan-phase aerial-training decision required
AC-4.1 (latency <400 ms p95, end-to-end camera→FC) Verify MixVPR canonical paper reports 1.21 ms inference per image on A100 at 320×320 batch=1; Jetson Orin Nano Super equivalent extrapolated to ~1030 ms (fp16, TensorRT) — well within budget assuming C7's TensorRT/ONNX deployment delivers the fp16 path; ResNet50 is well-supported on TensorRT — Jetson MVE measurement
AC-4.2 (memory <8 GB shared) Verify ResNet50 + MixVPR weights ~25 MB at fp16; activations at 320×320 batch=1 ~50 MB; descriptor cache for ~400 km² @ 0.5 m/px tiles → ~160k tiles × 2048 dims × 2 bytes (fp16) = ~650 MB, or ~325 MB at int8 — well within budget assuming C6's tile-cache carve-out negotiates the descriptor-table allocation; co-resident memory pressure with C1/C3/C5/C6 unverified — Jetson MVE measurement
AC-8.1 (cache-interface resolution ≥0.5 m/px, ideally 0.3 m/px) Pass (with Verify) MixVPR is resolution-agnostic at the algorithm level (ResNet50 accepts any 320×320 input regardless of source GSD); the question is whether descriptor recall holds at 0.5 m/px tile GSD vs nav-camera 12 cm/px GSD — cross-resolution generalization unverified, AerialExtreMatch cross-scale cells (Fact #19) is the documentary target
AC-8.6 — Scale-ratio (any UAV-frame ground footprint at deployment altitude must be retrievable) Verify At 1 km AGL the nav-camera frame footprint is 470×314 m to 980×655 m (per restrictions.md); MixVPR's 320×320 input must be center-cropped + downscaled from this range — cross-scale recall at AC-8.6 spec is exactly the AerialExtreMatch test cell — Jetson MVE measurement
AC-8.6 — Scene change in active-conflict sectors Verify Cratering / building destruction / road realignment is exactly the AerialExtreMatch "scene-change" cell + the Skoltech aerial-VPR survey (Source #38) cropland/season class; canonical MixVPR weights are not aerial-trained — Plan-phase aerial-training decision will materially affect this row
AC-8.6 — Compute & latency under steady-state and re-loc-trigger Verify MixVPR's per-frame compute is constant (no temporal state), so the "re-loc-trigger workload" and "steady-state workload" have the same per-frame latency; co-resident memory + GPU-time pressure under simultaneous C1+C2+C3 inference unverified — Jetson MVE measurement
AC-NEW-2 (spoofing-promotion latency <3 s p95) Pass (latency budget) → Verify (recall) MixVPR per-frame latency at fp16 + TensorRT well under 3 s budget; the gating constraint is whether the re-anchor retrieval succeeds on the first or first-few frames after spoofing detection — recall under "first-frame after spoof onset" condition unverified, Jetson MVE on Derkachi flight required
AC-NEW-6 (imagery freshness — never satellite_anchored on stale-tile match) Pass (mechanical) MixVPR returns top-K with cosine scores; the freshness-age decision is a downstream filter on the retrieved candidates (C5 / C6 owns the freshness-aware ranking) — MixVPR provides the input
AC-NEW-7 (cache-poisoning safety budget — P(>30 m geo-misalign) <1%, P(>100 m) <0.1%) Verify (downstream) MixVPR's contribution is retrieval correctness under mid-flight-written tile (AC-8.4) presence; if a misaligned mid-flight tile has a near-correct descriptor it can poison the retrieval — multi-flight Monte Carlo replay is the validation, Plan-phase aerial-training decision affects this
Restriction "Operational area: eastern/southern Ukraine" — VPR train-domain match ⚠️ Documentary gap → Verify Canonical MixVPR weights are GSV-Cities (street-level) trained; Skoltech aerial-VPR survey (Source #38) demonstrates aerial-trained MixVPR retrains with materially different recall — Plan-phase decision: (a) project-domain retrain on AerialVL, (b) source aerial-trained community checkpoint, (c) elevate a different C2 candidate
Restriction "Altitude ≤1 km AGL; terrain assumed flat (rolling steppe / agricultural)" — VPR scale band match Verify Same as AC-8.6 scale-ratio row; cross-scale recall at the project's altitude band is the AerialExtreMatch cross-scale cell
Restriction "Weather: predominantly sunny ... seasonal/visibility classes (summer crops, autumn/winter bare fields, cloud/haze, snow if winter)" — VPR cross-season generalization Verify Cross-season VPR is exactly the dominant aerial-VPR failure mode per Fact #19 + SQ5; canonical MixVPR weights are single-domain — Plan-phase aerial-training decision is the primary lever
Restriction "Navigation camera (pinned): ADTi 20MP, 5472×3648" Pass (API) MixVPR consumes any 320×320 ImageNet-normalised input; the 5472×3648 → 320×320 downscale is mechanical (center-crop + bilinear); information loss at downscale is the actual concern but is shared with all 320×320-input C2 candidates (Plan-time may consider higher-resolution C2 candidates like SelaVPR @ 224×224 for ViT or AnyLoc @ 322×322 for DINOv2)
Restriction "Satellite Imagery — resolution ≥0.5 m/px" — VPR descriptor pipeline at AC-8.1 floor Verify Same as AC-8.1; MixVPR resolution-agnostic at API level, recall at 0.5 m/px tile GSD vs 12 cm/px nav-camera GSD unverified
Restriction "Satellite Imagery — Cache budget: 10 GB" — descriptor budget carve-out Pass (with Verify) MixVPR descriptor cache estimate ~650 MB fp16 / ~325 MB int8 over ~400 km² @ 0.5 m/px; comfortably within the 10 GB cache budget assuming C6 carves out a descriptor-table allocation; AC-8.3 explicitly says "Pre-extracted descriptors/indices count against the cache budget unless explicitly carved out" — Plan-phase decision
Restriction "Companion computer: Jetson Orin Nano Super, 8 GB shared" Verify ResNet50 fp16 inference is well within Jetson Orin Nano Super capability (well-supported on TensorRT); steady-state co-resident memory + GPU-time with C1 (VIO) + C3 (matcher) unverified — Jetson MVE measurement

SALAD — per-numbered binding (C2-relevant lines only; cross-cutting N/A above also apply identically)

Cells share the legend defined under the MixVPR sub-matrix. Where a binding is identical in both substance and evidence to the MixVPR row, the SALAD row points to the MixVPR row to avoid restating; where SALAD's pinned mode produces a materially different binding (license, descriptor budget, ViT export risk, accuracy advantage), the SALAD row carries a distinct evidence cite.

Line Binding Evidence (one-line cite)
AC-1.1 (frame-center within 50 m, ≥80% normal-flight photos) Verify (downstream) Same downstream-of-C2 dependency as the MixVPR row; documentary evidence of SALAD retrieval recall on aerial nadir at AC-8.1 resolution floor is absent — Plan-phase aerial-training decision (D-C2-1) + Jetson MVE on Derkachi flight required. SALAD-specific upside: paper Table 1 MSLS Challenge R@1 = 75.0 vs MixVPR 64.0 (+11 R@1 absolute on ground-level urban) suggests SALAD's aerial transfer ceiling may be higher, but transfer is unverified
AC-1.2 (frame-center within 20 m, ≥50% normal-flight photos) Verify (downstream) Same as AC-1.1, tighter tail; AerialExtreMatch Recall@1 stratified by difficulty cell remains the documentary target
AC-2.1b (satellite-anchor registration succeeds, AC-1.1/1.2 + AC-2.2 + AC-8.2 + AC-8.6 conditions) Verify (downstream) C2's contribution identical to MixVPR row — top-K retrieval feeding C3+C4; Jetson MVE measurement on AerialExtreMatch + Derkachi flight
AC-3.3 (≥3 disconnected segments via satellite-reference re-localization) Pass (API) → Verify (recall) SALAD's per-frame top-K cosine retrieval is structurally identical to MixVPR for re-localization (no temporal state required); cross-season recall under SALAD's dustbin-aware features may be more robust than MixVPR (paper Fig 3 shows the network discards sky/road/dynamic objects), but this is unverified on aerial nadir — AerialExtreMatch + D-C2-1 required
AC-4.1 (latency <400 ms p95, end-to-end camera→FC) Verify SALAD canonical paper reports 2.41 ms inference per image on RTX 3090 at 322×322 batch=1 (Table 1 + Table 5 for DINOv2-B); RTX-3090-to-Jetson-Orin-Nano-Super extrapolation factor 810× → ~2030 ms per frame at fp16 with TensorRT, well within budget. HOWEVER: extrapolation assumes clean DINOv2 ViT-B → TensorRT fp16 export, which paper §5 explicitly flags as risk: "the adoption of DINOv2 as our backbone results in slower processing speeds compared to ResNet-based methods" — D-C2-5 deferred Jetson MVE risk
AC-4.2 (memory <8 GB shared) Verify DINOv2-B + SALAD weights ~86M params × 2 bytes (fp16) = ~172 MB (vs MixVPR-ResNet50's ~25 MB) — 7× larger model footprint; activations at 322×322 batch=1 ~80 MB; descriptor cache for ~400 km² @ 0.5 m/px tiles depends on descriptor-size variant: full 8448-D → ~2.7 GB, slim 2112-D → ~0.68 GB, slim 544-D → ~0.17 GB. AC-8.3 cache budget interaction is materially harsher for SALAD than for MixVPR — D-C2-2 carve-out + D-C2-6 descriptor-size choice. Co-resident memory pressure with C1/C3/C5/C6 unverified — Jetson MVE measurement
AC-8.1 (cache-interface resolution ≥0.5 m/px, ideally 0.3 m/px) Pass (with Verify) SALAD is resolution-agnostic at the algorithm level (DINOv2 accepts any input divisible by 14, paper §4.1 "model is agnostic to image input size"); cross-resolution generalization at 0.5 m/px tile GSD vs nav-camera 12 cm/px GSD unverified, AerialExtreMatch cross-scale cells (Fact #19) is the documentary target — same dependency as MixVPR row
AC-8.6 — Scale-ratio (any UAV-frame ground footprint at deployment altitude must be retrievable) Verify At 1 km AGL the nav-camera frame footprint is 470×314 m to 980×655 m (per restrictions.md); SALAD's 322×322 input must be center-cropped + downscaled from this range — cross-scale recall at AC-8.6 spec is exactly the AerialExtreMatch test cell. SALAD's dustbin mechanism (paper §3.2) explicitly discards uninformative regions, which may help under high-altitude downscale (sky/featureless field tokens get assigned to dustbin), but this is unverified on aerial nadir — Jetson MVE measurement
AC-8.6 — Scene change in active-conflict sectors Verify Cratering / building destruction / road realignment is exactly the AerialExtreMatch "scene-change" cell + the Skoltech aerial-VPR survey (Source #38); canonical SALAD weights are not aerial-trained — D-C2-1 will materially affect this row identically to MixVPR. SALAD's dustbin mechanism may help by discarding scene-change-unstable regions, but this is unverified on aerial active-conflict imagery
AC-8.6 — Compute & latency under steady-state and re-loc-trigger Verify SALAD's per-frame compute is constant (single-stage, no temporal state, no re-ranking — paper §1 explicitly notes "single-stage approach … without requiring expensive post-processing steps"); the "re-loc-trigger workload" and "steady-state workload" have the same per-frame latency. Co-resident memory + GPU-time pressure under simultaneous C1+C2+C3 inference unverified, and the DINOv2-B vs ResNet50 ratio (3.4× more params) materially shifts this from MixVPR — Jetson MVE measurement (D-C2-5 risk)
AC-NEW-2 (spoofing-promotion latency <3 s p95) Pass (latency budget) → Verify (recall) Same structure as MixVPR row: SALAD per-frame latency at fp16 + TensorRT well under 3 s budget on extrapolation; gating constraint is whether re-anchor retrieval succeeds on first or first-few frames after spoofing detection — recall under "first-frame after spoof onset" condition unverified, Jetson MVE on Derkachi flight required
AC-NEW-6 (imagery freshness — never satellite_anchored on stale-tile match) Pass (mechanical) SALAD returns top-K with cosine scores identically to MixVPR; freshness-age decision is a downstream C5/C6 filter on the retrieved candidates
AC-NEW-7 (cache-poisoning safety budget — P(>30 m geo-misalign) <1%, P(>100 m) <0.1%) Verify (downstream) SALAD's contribution is retrieval correctness under mid-flight-written tile (AC-8.4) presence; if a misaligned mid-flight tile has a near-correct descriptor it can poison the retrieval — multi-flight Monte Carlo replay is the validation, D-C2-1 affects this. SALAD's dustbin + bidirectional optimal-transport assignment may produce more conservative scoring than MixVPR's pure feature-mixing aggregation, potentially reducing false-positive cosine matches at the descriptor level — but this is unverified speculation, AerialExtreMatch + replay required
Restriction "Operational area: eastern/southern Ukraine" — VPR train-domain match ⚠️ Documentary gap → Verify Canonical SALAD weights are GSV-Cities (street-level) trained, same caveat as MixVPR — D-C2-1 applies identically. NEW finding vs MixVPR: SALAD's GPL-3.0 license places aerial-retrain artifacts under copyleft if redistributed — must be considered alongside D-C1-1 license-posture decision. Aerial-trained community SALAD checkpoints are an open search target for next sessions
Restriction "Altitude ≤1 km AGL; terrain assumed flat (rolling steppe / agricultural)" — VPR scale band match Verify Same as AC-8.6 scale-ratio row; cross-scale recall at the project's altitude band is the AerialExtreMatch cross-scale cell
Restriction "Weather: predominantly sunny ... seasonal/visibility classes" — VPR cross-season generalization Verify Cross-season VPR is exactly the dominant aerial-VPR failure mode per Fact #19 + SQ5; canonical SALAD weights are single-domain — D-C2-1 is the primary lever. SALAD-specific finding: paper Table 1 NordLand R@1 = 76.0 vs MixVPR's 58.4 (+17.6 R@1 absolute) — NordLand evaluates extreme seasonal variation on a Norway train route; this is documentary evidence that SALAD's cross-season generalization on ground-level imagery is materially stronger than MixVPR's, but aerial cross-season is unverified
Restriction "Navigation camera (pinned): ADTi 20MP, 5472×3648" Pass (API) SALAD consumes any 322×322 ImageNet-normalised input (must be divisible by 14); the 5472×3648 → 322×322 downscale is mechanical (center-crop + bilinear); information loss at downscale is shared with MixVPR — D-C2-3 input-resolution-shape Plan-phase decision still applies (SelaVPR and AnyLoc may be evaluated at higher input sizes in their respective sessions)
Restriction "Satellite Imagery — resolution ≥0.5 m/px" — VPR descriptor pipeline at AC-8.1 floor Verify Same as AC-8.1; algorithm-level resolution-agnostic, recall at 0.5 m/px tile GSD vs 12 cm/px nav-camera GSD unverified
Restriction "Satellite Imagery — Cache budget: 10 GB" — descriptor budget carve-out Pass (with Verify) — descriptor-size-choice-dependent Per-variant: full 8448-D ~2.7 GB / 27% of cache budget; slim 2112-D ~0.68 GB / 6.8%; slim 544-D ~0.17 GB / 1.7%. AC-8.3 explicitly says "Pre-extracted descriptors/indices count against the cache budget unless explicitly carved out" — D-C2-2 carve-out decision interacts with D-C2-6 SALAD descriptor-size choice (full variant only viable if explicit carve-out is granted; slim variants viable within cache budget but lose ~5 R@1 points on MSLS Challenge per paper Table 1)
Restriction "Companion computer: Jetson Orin Nano Super, 8 GB shared" Verify (with elevated risk) DINOv2 ViT-B fp16 inference on Jetson Orin Nano Super is paper-acknowledged slower than ResNet-based methods (paper §5 Conclusions and Limitations); ViT export to TensorRT/INT8 is industry-known harder than CNN export — D-C2-5 deferred Jetson MVE risk is materially higher than for MixVPR. Steady-state co-resident memory + GPU-time with C1 + C3 (matcher) unverified — Jetson MVE measurement
Restriction "License posture (D-C1-1)" — VPR license-track interaction NEW finding vs MixVPR — sub-matrix-blocking under BSD/permissive track SALAD canonical implementation is GPL-3.0 (Source #59 LICENSE) — copyleft. If Plan-phase D-C1-1 locks the BSD/permissive track, SALAD is excluded as a C2 candidate (no permissive aerial-trained SALAD checkpoint surfaced in this session's search). Under D-C1-1 = (a) GPL-3.0 track or (c) keep-both-tracks-open, SALAD is eligible. Recommendation: present D-C1-1 + this row to user as a structured Choose block at Plan time

C2 — Status [2026-05-08 sessions, MixVPR + SALAD]

C2 is OPEN. After this session 2 of 5 mandatory pre-screen candidates have per-mode entries:

  • MixVPR (session 1, 2026-05-08): Pinned-mode statement + Three-query context7 lookup (OpenVPRLab) + MVE block + Per-numbered-Restriction × per-numbered-AC sub-matrix → Documentary lead with aerial-domain-training caveat, BSD/permissive track (MIT)
  • SALAD (session 2, 2026-05-08): Pinned-mode statement + Three-query lookup via serizba/salad README + LICENSE + canonical paper (context7 fall-back per Per-Mode API rule item 2 — serizba/salad not indexed in context7; OpenVPRLab cross-source for DINOv2 ViT-B backbone API but NOT for SALAD aggregator) + MVE block + Per-numbered-Restriction × per-numbered-AC sub-matrix → Documentary lead with aerial-domain-training caveat + GPL-3.0-license-track caveat + DINOv2-ViT-export risk caveat, GPL-3.0 track (canonical)

Per-mode API capability verification gate: PASS for both candidates (with documented caveats — see Fact #42 + Fact #43 "Fit Impact" + Plan-phase decisions).

Status assignment in ../06_component_fit_matrix/C2_vpr.md row:

  • MixVPR = Documentary lead with aerial-domain-training caveat (BSD/permissive track)
  • SALAD = Documentary lead with aerial-domain-training caveat + GPL-3.0-license-track caveat + DINOv2-ViT-export risk caveat (GPL-3.0 track)

Final promotion to "Selected" for either candidate requires the Plan-phase decisions (D-C2-1 / D-C2-2 / D-C2-3 / D-C2-5 / D-C2-6) AND the Jetson Orin Nano Super hardware MVE phase artifact (D-C1-2 + D-C2-4).

Next session candidates: SelaVPR, EigenPlaces, NetVLAD (remaining mandatory pre-screen survivors); AnyLoc, BoQ, DINOv2-VLAD (conditional on INT8 quantization path). Per the autodev session-shape note ("one VPR family per session"), the natural next picks are:

  • SelaVPR — surfaces the "lighter foundation-model" branch (DINOv2 + self-attention aggregation, smaller descriptor than SALAD per published benchmarks); informs D-C2-3 input-resolution shape and D-C2-5 ViT-export-risk via a second ViT-based candidate.
  • NetVLAD — mandatory simple-VLAD reference baseline; closes the simple-baseline reference point and gives the comparison framework an unambiguous lower bound.
  • EigenPlaces — same ResNet50 backbone family as MixVPR with a different (margin-loss + viewpoint-invariance) aggregation; the most direct apples-to-apples comparison against MixVPR on the BSD/permissive license track.

Cross-cutting decisions raised across MixVPR + SALAD sessions (will compound as more candidates are processed):

  1. D-C2-1 VPR canonical-weights vs aerial-retrain vs aerial-community-checkpoint (raised by MixVPR closure; reaffirmed by SALAD closure with identical caveat) — Plan-phase Choose block; applies to every ground-level-pretrained C2 candidate.
  2. D-C2-2 descriptor-cache carve-out vs raw-tile-cache budget (raised by MixVPR closure; harshened by SALAD closure because SALAD-full's 8448-D descriptor consumes ~27% of the 10 GB cache budget alone) — AC-8.3 forces this; conditional candidates (AnyLoc/BoQ/DINOv2-VLAD) at higher dimensionality push it further.
  3. D-C2-3 input-resolution shape (320×320 vs 322×322 vs higher per SelaVPR/AnyLoc/BoQ) (raised by MixVPR closure; reaffirmed by SALAD closure — 322×322 is the canonical SALAD eval size, marginally higher than MixVPR's 320×320 but still in the same downscale-from-5472×3648 regime).
  4. D-C2-4 deferred Jetson Orin Nano Super hardware MVE coverage for C2 (raised by MixVPR closure; scope-broadened by SALAD closure — Jetson MVE must now also cover DINOv2-B → TensorRT fp16/INT8 export quality, not just ResNet50 latency).
  5. D-C2-5 (NEW) — DINOv2 ViT-export to TensorRT fp16/INT8 path on Jetson Orin Nano Super (raised by SALAD closure) — applies to every ViT-based C2 candidate (SALAD, SelaVPR, AnyLoc, BoQ, DINOv2-VLAD); Jetson MVE prerequisite. Likely rolls into D-C1-2 + the C7 inference-runtime row.
  6. D-C2-6 (NEW) — SALAD descriptor-size choice (8448-D / 2112-D / 544-D) (raised by SALAD closure) — Plan-phase trade-off; full variant gives best R@1 but consumes ~27% of cache budget; slim 544-D fits within 1.7% of cache budget but loses ~5 R@1 points. Interacts with D-C2-2.
  7. D-C1-1 license-posture interaction with C2 (already raised by C1; reaffirmed and sharpened by SALAD closure) — SALAD canonical implementation is GPL-3.0; under BSD/permissive lock at Plan time SALAD is excluded as a C2 candidate. The BSD/permissive C2 axis currently contains MixVPR + (next-session: EigenPlaces, NetVLAD); the GPL-3.0 C2 axis currently contains SALAD + (next-session: possibly SelaVPR, AnyLoc, BoQ, DINOv2-VLAD pending license verification).

Fact #44 — SelaVPR per-mode API capability verification (DINOv2 ViT-L/14 frozen + lightweight adapters + LocalAdapt up-conv on Jetson Orin Nano Super) — DOCUMENTARY PASS WITH AERIAL-DOMAIN-TRAINING CAVEAT + LARGER-VIT-EXPORT RISK + TWO-STAGE-LATENCY-AND-LOCAL-FEATURE-CACHE RISKS; Jetson MVE pending

  • Statement: SelaVPR (Lu-Feng/SelaVPR, ICLR 2024; canonical implementation by Feng Lu et al., Tsinghua Shenzhen + Peng Cheng Laboratory + UCAS) is a two-stage VPR method that adds lightweight serial+parallel adapters to a frozen DINOv2 ViT-L/14 backbone for global feature extraction, plus an LocalAdapt up-convolutional module after the backbone for dense local feature extraction; retrieval is two-stage (top-K via global-feature cosine search + re-ranking via mutual nearest neighbor cross-matching of local features, no RANSAC). Per the per-Mode API Capability Verification rule, the project's pinned mode is the (DINOv2 ViT-L/14 backbone, FROZEN — only adapters trained, optional --registers 4-register variant) + (Global Adaptation: per-block serial Adapter1 after MHA with internal skip + parallel Adapter2 alongside MLP scaled by s=0.2, both bottleneck FC→ReLU→FC with bottleneck ratio 0.5; class token discarded; patch tokens reshaped to 16×16×1024 feature map; GeM pool → L2-normalised 1024-D global descriptor) + (Local Adaptation: two 3×3 up-conv layers stride=2 padding=1 with ReLU between, output channels 256 then 128; intra-channel L2 normalisation → 61×61×128 dense local features) + (Re-ranking: top-100 candidates by default — --rerank_num=100 for paper accuracy, --rerank_num=20 for 1/5 runtime at near-identical accuracy; mutual-nearest-neighbor cross-matching with |M| count as score) at 224×224 ImageNet-normalised input tuple — selected as the canonical paper config (Source #63 §4.2 Implementation Details) and the canonical CLI default (Source #62 README Train/Test sections). MSLS-finetuned checkpoint is the recommended starting point for cross-domain transfer projects (NOT Pitts30k-further-finetuned, which is "only for urban scenes" per README). The optional --registers variant uses DINOv2+4-registers backbone (per Darcet et al. 2024 ViT registers paper) and ships with a separately finetuned MSLS checkpoint — better local-matching performance per README §"Local Matching using DINOv2+Registers" — but adds yet another export/MVE variant; the project pins the non-registers variant as the canonical default and treats the registers variant as a Plan-phase optional MVE knob. Mode-enumeration query (1/3): SelaVPR is parameterised by the (backbone, adapter-bottleneck-ratio, scaling-factor s, local-adapter-up-conv-shape, training-dataset-finetune-chain) tuple. The canonical class definitions live in Lu-Feng/SelaVPR — adapter1 + adapter2 in /backbone/dinov2/block.py, LocalAdapt in network.py. Two pretrained checkpoint variants are documented: MSLS-finetuned (for diverse scenes — MSLS-val R@1=90.8 / Nordland-test R@1=85.2 / St. Lucia R@1=99.8) and Pitts30k-further-finetuned (only for urban — Tokyo24/7 R@1=94.0 / Pitts30k R@1=92.8 / Pitts250k R@1=95.7); plus the optional --registers variant with its own MSLS-finetuned checkpoint. Backbone enumeration (paper §4.2 + README) is fixed at DINOv2 ViT-L/14 — paper does NOT ablate to ViT-S/B/G as SALAD did; the ViT-L choice is hardwired. Per the Per-Mode API rule, each (training-finetune-chain, registers-or-not) tuple is a separately-cataloged sibling mode. Pinned-mode runnable example query (2/3): Source #62 README ships a documented training CLI python3 train.py --datasets_folder=... --dataset_name=msls --queries_per_epoch=30000 --foundation_model_path=/path/to/dinov2_vitl14_pretrain.pth (then resume on Pitts30k for the urban variant) and an evaluation CLI python3 eval.py --datasets_folder=... --dataset_name=pitts30k --resume=/path/to/SelaVPR_pitts30k.pth --rerank_num=100. Pretrained weights are distributed via Google Drive links inside README HTML tables. The canonical inference pattern is model.eval(); global_features, local_features = model(images) where images: torch.Tensor[B, 3, 224, 224] ImageNet-normalised, output global_features: torch.Tensor[B, 1024] L2-normalised + local_features: torch.Tensor[B, 128, 61, 61] intra-channel L2-normalised. Re-ranking is performed by a separate cross-match step over the top-K candidates' local features. The optional --efficient_ram_testing flag saves extracted local features to disk (./output_local_features/) and loads only currently-needed features into RAM — useful when local-feature cache exceeds available shared memory (relevant to project's 8 GB shared budget). Disqualifier-probe query (3/3): did NOT surface any documented frame-rate floor (VPR is per-frame independent at the global-retrieval stage; re-ranking is per-event); did NOT surface any documented memory ceiling at the algorithm level beyond the standard DINOv2-L + adapters + LocalAdapt footprint (frozen DINOv2-L weights ~1.1 GB at fp32 / ~550 MB at fp16; adapter+LocalAdapt weights modest); did NOT surface any documented Jetson Orin Nano measurement; did NOT surface a documented ONNX/TensorRT export path inside Lu-Feng/SelaVPR itself (relies on standard PyTorch → ONNX export + TensorRT — to be resolved in C7 row, not C2). Three new disqualifier-class findings raised that did NOT surface for MixVPR, partially shared with SALAD: (i) MIT license (Source #62 LICENSE = MIT, Copyright 2024 Feng Lu) — places SelaVPR on the BSD/permissive license track alongside MixVPR, OKVIS2, Kimera-VIO, DPVO, pure-VO baseline; distinct from SALAD's GPL-3.0 placement. SelaVPR is the FIRST DINOv2-based C2 candidate on the BSD/permissive track — under D-C1-1 = (b) BSD/permissive lock at Plan time, SelaVPR survives where SALAD does not. This is materially positive for the BSD/permissive C2 axis. (ii) DINOv2-ViT-L export risk (HARSHER than SALAD's ViT-B) — SelaVPR's frozen backbone is DINOv2 ViT-L/14 (300M params) vs SALAD's fine-tuned DINOv2 ViT-B/14 (86M params, only last 4 blocks fine-tuned). Per SALAD paper Table 5, ViT-L latency on RTX 3090 = 7.82 ms vs ViT-B = 2.41 ms (3.2× slower for the backbone alone). Extrapolation to Jetson Orin Nano Super (factor 810× from RTX 3090 at fp16+TensorRT): SelaVPR backbone alone ~6080 ms, plus adapter overhead, plus LocalAdapt up-conv overhead → estimated ~200270 ms feature extraction per frame vs SALAD's ~2030 ms. D-C2-5 deferred Jetson MVE risk is materially HARSHER for SelaVPR than for SALAD — the project's AC-4.1 latency budget (400 ms p95 end-to-end camera→FC) gets a much larger SelaVPR carve-out vs SALAD. Counter-mitigation: SelaVPR's FROZEN backbone may make TensorRT export easier than SALAD's fine-tuned-backbone export, since the canonical DINOv2-L pretrained weights have a well-documented and TensorRT-optimized export pathway (FB AI Public Files distribution). (iii) Two-stage retrieval+re-ranking adds latency & local-feature-cache cost not present in single-stage MixVPR/SALAD/NetVLAD/EigenPlaces — SelaVPR is the FIRST two-stage C2 candidate evaluated. Per paper Table 3 (Pitts30k-test on RTX 3090): re-ranking matching time 0.085 s/query at rerank_num=100 (extrapolated to Jetson: ~700 ms — exceeds AC-4.1 400 ms budget); 0.018 s/query at rerank_num=20 (extrapolated to Jetson: ~150 ms — fits within budget). SelaVPR is only viable on Jetson Orin Nano Super at rerank_num≤20, and even then the extraction (~200 ms) + re-ranking (~150 ms) totals ~350 ms — tight against AC-4.1 400 ms budget before any other component (C1+C3+C5+C8) costs are added. Local-feature-cache cost: SelaVPR's dense 61×61×128 local features = 476,288 floats per image = ~1.9 MB at fp32 / ~950 KB at fp16. For ~160k tiles in the project's ~400 km² operational area, the local-feature cache alone would consume ~150 GB at fp16 — fundamentally infeasible against the 10 GB AC-8.3 cache budget. Mitigation paths: (a) cache only global descriptors (1024-D × 2 bytes × 160k = ~320 MB fp16) and re-extract local features on-demand per re-ranking event — adds GPU time per re-rank trigger but keeps cache budget feasible; (b) precompute and cache the top-K (K=10 or K=20) local feature sets per likely query path — reduces re-extract cost at the cost of provisioning complexity; (c) drop SelaVPR's re-ranking entirely and use only its global descriptor (the paper's "SelaVPR(global)" variant — see Table 2: MSLS-challenge R@1=69.6 vs full SelaVPR's 73.5; Tokyo24/7 R@1=81.9 vs 94.0; Pitts30k R@1=90.2 vs 92.8 — gives back the two-stage advantage but re-establishes single-stage parity with MixVPR/SALAD). Plan-phase trade-off raises D-C2-7 SelaVPR re-ranking strategy choice (full re-rank with on-demand local feature extraction / cache top-K local features / disable re-ranking entirely) as a new gate. Critical documentary gap (same as MixVPR + SALAD): SelaVPR's published Recall@1 numbers are on ground-level VPR benchmarks (Tokyo24/7 / MSLS-val / MSLS-challenge / Pitts30k-test / Pitts250k / Nordland-test / St. Lucia) — NOT on aerial nadir benchmarks (AerialVL, AerialExtreMatch). Per Fact #19 + Fact #26, this is the SAME aerial-domain-training caveat raised by MixVPR closure (Fact #42) and SALAD closure (Fact #43) — D-C2-1 (canonical-vs-aerial-retrain-vs-community-aerial-checkpoint) applies to SelaVPR identically; the MSLS-finetuned variant is recommended for cross-domain transfer per README, but aerial transfer remains unverified. Pinned-mode sentence: "We will use SelaVPR with DINOv2 ViT-L/14 backbone (FROZEN, no fine-tuning) + Global Adaptation (per-block serial Adapter1 after MHA + parallel Adapter2 alongside MLP scaled by s=0.2; bottleneck ratio 0.5; ReLU; output GeM-pooled to 1024-D L2-normalised global descriptor) + Local Adaptation (two 3×3 up-conv layers stride=2 padding=1 with ReLU between, output channels 256 then 128; intra-channel L2 norm → 61×61×128 dense local features) + Re-ranking (top-K via global-cosine search + mutual-nearest-neighbor cross-matching with |M| as re-rank score, rerank_num=20 for Jetson budget compatibility) at 224×224 ImageNet-normalised input, with inputs {1× ADTi 20MP nav frame stream → center-cropped + bilinearly downscaled to 224×224 + ImageNet-normalised} and expect outputs {1024-D L2-normalised global descriptor per frame for cosine top-K retrieval over the operational area's tiles + on-demand 61×61×128 local features for top-20 re-ranking} on Jetson Orin Nano Super (8 GB shared, JetPack 6, ROS 2 Humble; PyTorch fp16 + TensorRT baseline; final inference runtime selection deferred to C7)."
  • Source: Source #62 (Lu-Feng/SelaVPR README + LICENSE WebFetch — context7 not indexed), Source #63 (canonical paper arXiv:2402.14505 v1 / ICLR 2024), Source #61 (OpenVPRLab DinoV2 backbone context7 cross-source — confirms DINOv2 ViT-L/14 is a first-class supported backbone in the broader VPR ecosystem; reused across SALAD + SelaVPR sessions for backbone-API documentary cross-source)
  • Phase: Phase 2
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer + license-posture decision-maker + Plan-phase re-ranking-strategy decision-maker
  • Confidence: for mode-enumeration, runnable-example, parameter-count, license, RTX-3090 runtime, and ground-level-benchmark Recall@K documentary evidence; ⚠️ for Jetson Orin Nano Super latency / memory / accuracy (no documentary measurement — Jetson MVE will resolve, and risk is harsher than for SALAD due to ViT-L vs ViT-B); ⚠️ for DINOv2-ViT-L → TensorRT fp16 / INT8 export quality (industry signal of harder ViT export, harsher than ViT-B; counter-mitigated by frozen-backbone canonical export pathway via FB AI Public Files); ⚠️ for two-stage re-ranking latency budget on Jetson (rerank_num=100 fails AC-4.1 budget on extrapolation; rerank_num=20 fits but tight; rerank-disabled "global-only" mode falls back to single-stage parity with MixVPR/SALAD); ⚠️ for local-feature-cache budget (61×61×128 dense local features × 160k tiles = ~150 GB fp16 — infeasible without on-demand-extraction or cache-strategy mitigation); for canonical-checkpoint aerial-domain fitness (same caveat as MixVPR + SALAD — canonical weights are MSLS+Pitts30k street-level-trained, no aerial-nadir benchmark in canonical paper)
  • Related Dimension: SQ3+SQ4 / C2 lead candidate — per-mode API capability verification gate
  • Fit Impact: DOCUMENTARY PASS for the per-mode API capability verification gate — SelaVPR has a documented runnable per-mode example with the project's pinned configuration (CLI + multiple pretrained checkpoints), three documented checkpoint variants (MSLS-finetuned / Pitts30k-further-finetuned / --registers MSLS-finetuned), and no API-level disqualifier. HOWEVER, four caveats are raised — three new vs SALAD, one shared with MixVPR + SALAD: (i) MIT license-track placement (NEW vs SALAD-GPL-3.0; same as MixVPR-MIT) — interacts positively with D-C1-1 license-posture decision; SelaVPR is the FIRST DINOv2-based C2 candidate on the BSD/permissive track, materially expanding the BSD/permissive C2 axis options. (ii) DINOv2-ViT-L export risk (HARSHER than SALAD-ViT-B) — 300M params vs 86M (3.5× larger model, ~3.2× slower backbone per SALAD paper Table 5); D-C2-5 deferred Jetson MVE risk is materially harsher for SelaVPR. Counter-mitigation: frozen backbone may make TensorRT export easier than SALAD's fine-tuned ViT-B export. (iii) Two-stage re-ranking latency + local-feature-cache cost (NEW vs MixVPR + SALAD — both single-stage) — at default rerank_num=100 the matching cost exceeds AC-4.1 400 ms budget on Jetson extrapolation; at rerank_num=20 the total extraction+matching is ~350 ms, tight against budget; the dense 61×61×128 local features are infeasible to cache (~150 GB across the operational area). Plan-phase D-C2-7 re-ranking strategy choice required. (iv) Aerial-domain-training caveat (SHARED with MixVPR + SALAD via D-C2-1) — canonical weights are MSLS+Pitts30k street-level, not aerial-nadir; same Plan-phase decision (project-domain retrain / aerial-trained community checkpoint / elevate alternate C2 candidate) applies. HOWEVER, SelaVPR's accuracy advantage is material on cross-illumination/cross-season ground-level benchmarks: paper Table 2 shows SelaVPR Tokyo24/7 R@1=94.0 — best across all compared methods including MixVPR (85.1) and prior SOTA R²Former (88.6) — and Nordland-test R@1=85.2 (vs SALAD's 76.0 and MixVPR's 58.4), indicating SelaVPR's adapter-on-frozen-DINOv2-L design generalizes well to extreme illumination (Tokyo24/7 day/night) and extreme seasonal (Nordland) variation. HOWEVER, this advantage is on ground-level benchmarks; aerial-domain transfer is uncharted in the canonical paper, and the larger backbone may not help if the aerial-vs-ground gap dominates the cross-illumination/cross-season gap. One new Plan-phase decision raised by SelaVPR closure: D-C2-7 SelaVPR re-ranking strategy (full rerank with on-demand local-feature extraction / cache top-K local features per likely query path / disable re-ranking entirely and use SelaVPR-global-only at single-stage parity). The deferred Jetson Orin Nano Super hardware MVE phase still gates final accuracy/latency/memory promotion (D-C1-2 + D-C2-4 + D-C2-5 — all harshened by the ViT-L choice). License: MIT (per Lu-Feng/SelaVPR LICENSE file Copyright 2024 Feng Lu) — permissive, BSD/permissive license track.

C2 — Per-Mode API Capability Verification (engine Step 2 — SelaVPR session entry, 2026-05-08)

MVE — SelaVPR with DINOv2 ViT-L/14 frozen backbone + adapters + LocalAdapt @ 224×224 → 1024-D global + 61×61×128 dense local descriptors (canonical MSLS-finetuned variant; Pitts30k-further-finetuned and --registers variants documented as separately-cataloged sibling modes)

  • Source: Source #62 (Lu-Feng/SelaVPR README + LICENSE WebFetch — training CLI python3 train.py --datasets_folder=... --dataset_name=msls --foundation_model_path=/path/to/dinov2_vitl14_pretrain.pth, evaluation CLI python3 eval.py --datasets_folder=... --dataset_name=pitts30k --resume=/path/to/SelaVPR_pitts30k.pth --rerank_num={20,100}, two pretrained checkpoint variants distributed via Google Drive links in README HTML tables, optional --registers flag for DINOv2+4-register variant, optional --efficient_ram_testing flag for disk-backed local-feature cache), accessed 2026-05-08; Source #63 (canonical paper arXiv:2402.14505 v1 / Lu et al. ICLR 2024 — §3.13.5 Method + §4.14.3 Datasets/Implementation/Comparisons + Table 2 Recall@K + Table 3 single-query runtime); Source #61 (OpenVPRLab DinoV2 backbone context7 cross-source — confirms DINOv2 ViT-L/14 is a first-class supported backbone, with input-divisibility-by-14 constraint and 16×16 patch-grid layout for 224×224 input)
  • Inputs in the example: MSLS images for training at 224×224 (ImageNet mean/std normalised; must be divisible by 14 → 224/14 = 16 patches per side → 16×16 = 256 spatial tokens + 1 global cls token); MSLS / Tokyo24/7 / Pitts30k / Nordland evaluation images at 224×224 (same divisibility constraint); batch tensor images: torch.Tensor[B, 3, 224, 224]; DINOv2-L/14 backbone (1024-dim tokens, 300M params, FROZEN — no fine-tuning, only adapters trained) → spatial feature tensor [B, 1024, 16, 16]; Global Adaptation (per-block serial Adapter1 after MHA + parallel Adapter2 alongside MLP scaled by s=0.2; bottleneck ratio 0.5) → adapted spatial feature map; class token discarded; Local Adaptation (two 3×3 up-conv layers stride=2 padding=1 with ReLU between, output channels 256 then 128; intra-channel L2 norm) → [B, 128, 61, 61] dense local features
  • Outputs in the example: global_features: torch.Tensor[B, 1024] L2-normalised + local_features: torch.Tensor[B, 128, 61, 61] intra-channel L2-normalised; cosine top-K retrieval against pre-cached global descriptors; mutual-nearest-neighbor cross-matching with |M| count as re-rank score over top-K candidates (default rerank_num=100, alternative rerank_num=20 for 1/5 runtime); canonical paper Table 2 reports Tokyo24/7 R@1=94.0 / MSLS-val R@1=90.8 / MSLS-challenge R@1=73.5 / Pitts30k R@1=92.8 (Pitts30k-further-finetuned variant); MSLS-finetuned variant (recommended for cross-domain transfer) reports MSLS-val R@1=90.8 / Nordland-test R@1=85.2 / St. Lucia R@1=99.8; canonical paper Table 3 reports extraction 0.027 s/query + matching 0.085 s/query = total 0.112 s/query at rerank_num=100 on RTX 3090 / Pitts30k-test (less than 4% of TransVPR's 3.018 s/query)
  • Project inputs: 1× ADTi 20MP nav frame stream (5472×3648, target 3 fps) → center-cropped to 3648×3648 (square) → bilinearly downscaled to 224×224 → ImageNet-normalised → fp16 batch on Jetson Orin Nano Super
  • Project outputs required: 1024-D L2-normalised global descriptor per frame; cosine top-K (project default K=10 per Fact #25) against pre-cached descriptor table over the ~400 km² operational area's tiles at AC-8.1 resolution floor; on-demand re-ranking of top-K candidates via mutual-nearest-neighbor local-feature cross-matching at rerank_num=20 (the only Jetson-budget-compatible setting per Fact #44 Disqualifier-probe); satisfies AC-8.6 retrieval-recall requirement under cross-season / cross-domain / scene-change conditions; satisfies AC-4.1 latency budget for steady-state ONLY at rerank_num=20 AND with successful DINOv2-L → TensorRT fp16 export (D-C2-5); satisfies AC-NEW-2 spoofing-promotion path
  • Match assessment: exact mode match for (DINOv2 ViT-L/14 frozen backbone, Global Adaptation adapters, LocalAdapt up-conv module, 224×224 input, 1024-D global + 61×61×128 local output); training+evaluation CLI exists; multiple pretrained checkpoints documented with Google Drive distribution + DINOv2 backbone weights from FB AI Public Files; ⚠️ partial input domain (canonical weights trained on MSLS + Pitts30k street-level imagery vs project's nadir aerial 1 km AGL — domain shift unverified, same caveat as MixVPR + SALAD); ⚠️ HARSHER Jetson Orin Nano Super export risk than SALAD — DINOv2 ViT-L (300M params) vs SALAD's ViT-B (86M params, 3.5× larger model, ~3.2× slower per SALAD paper Table 5; extrapolated extraction ~200270 ms per frame on Jetson at fp16+TensorRT); ⚠️ two-stage re-ranking latency — at default rerank_num=100 the matching cost extrapolates to ~700 ms exceeds AC-4.1 400 ms budget; at rerank_num=20 the total extraction+matching is ~350 ms, tight against budget before C1+C3+C5+C8 costs added; ⚠️ two-stage local-feature-cache cost — 61×61×128 = 476k floats × 160k tiles × 2 bytes (fp16) = ~150 GB, fundamentally infeasible against AC-8.3 10 GB cache budget; mitigation paths: (a) cache global descriptors only (~320 MB fp16 / 3.2% of cache budget) + on-demand local-feature re-extraction per re-rank trigger (adds GPU time); (b) precompute top-K local feature sets per likely query path (~15 GB if K=100 — still over budget; ~3 GB if K=20 + selective coverage); (c) disable re-ranking entirely and use SelaVPR-global-only mode (MSLS-challenge R@1=69.6 vs full SelaVPR's 73.5 — gives back the two-stage advantage but re-establishes single-stage parity with MixVPR/SALAD); ⚠️ partial inference runtime — paper §4.3 explicitly notes "TransVPR is fast at extracting features, while SelaVPR is slower (but faster than other methods) due to the use of the ViT/L backbone" — D-C2-5 risk harsher than SALAD's; counter-mitigation: SelaVPR's FROZEN backbone may have an easier TensorRT export pathway than SALAD's fine-tuned-backbone export (canonical DINOv2-L pretrained weights distributed via FB AI Public Files have a well-documented optimization pathway)
  • If ⚠️ or : docs do not explicitly disqualify the algorithmic mode. The (backbone, adapter-config, LocalAdapt-config, re-ranking-pool-size) tuple, input size, normalisation, and output shapes are all documented and runnable as-is via the canonical CLI. However, four caveats elevate the verification gate's risk profile beyond MixVPR's and partially differently from SALAD's: (i) MIT license-track placement — Source #62 LICENSE = MIT, BSD/permissive track; POSITIVELY interacts with D-C1-1 license-posture decision (SelaVPR survives BSD/permissive lock where SALAD does not); (ii) DINOv2-ViT-L export risk to Jetson Orin Nano Super at fp16/INT8 — harsher than SALAD's ViT-B because ViT-L is 3.5× larger; D-C2-5 deferred Jetson MVE risk is materially elevated; counter-mitigation by frozen-backbone canonical export pathway; (iii) Two-stage re-ranking latency + local-feature-cache cost — NEW vs MixVPR + SALAD; raises D-C2-7 SelaVPR re-ranking strategy choice as a new Plan-phase gate; only rerank_num≤20 fits AC-4.1 budget on Jetson extrapolation; local-feature cache infeasible without on-demand-extraction or cache-strategy mitigation; (iv) Aerial-domain-training caveat — same as MixVPR + SALAD via D-C2-1. → Status: Documentary lead with aerial-domain-training caveat + DINOv2-ViT-L-export risk caveat (harsher than SALAD-ViT-B) + two-stage-latency-and-local-feature-cache-strategy risk caveat (NEW), BSD/permissive track (MIT — same as MixVPR, distinct from SALAD); final promotion to "Selected" requires (a) Plan-phase decision on D-C2-1 (canonical-vs-aerial-retrain-vs-community-aerial-checkpoint), (b) Plan-phase decision on D-C2-7 SelaVPR re-ranking strategy (full-rerank-with-on-demand / cache-top-K / disable-rerank-and-fall-back-to-global-only), (c) Plan-phase decision on D-C2-3 input-resolution shape (SelaVPR's 224×224 is materially smaller than MixVPR's 320×320 and SALAD's 322×322 — interaction with information loss at downscale-from-5472×3648), AND (d) Jetson Orin Nano Super hardware MVE phase artifact (latency, memory, DINOv2-L → TensorRT fp16 export quality, cross-validation against SALAD's DINOv2-B export numbers, AerialExtreMatch Recall@K).

C2 — Per-numbered-Restriction × Per-numbered-AC Sub-Matrix per Candidate (SelaVPR addition)

SelaVPR — per-numbered binding (C2-relevant lines only; cross-cutting N/A above also apply identically)

Cells share the legend defined under the MixVPR sub-matrix. Where a binding is identical in both substance and evidence to the MixVPR or SALAD row, the SelaVPR row points to those rows to avoid restating; where SelaVPR's pinned mode produces a materially different binding (license, larger backbone, two-stage re-ranking, smaller global descriptor, larger input downscale), the SelaVPR row carries a distinct evidence cite.

Line Binding Evidence (one-line cite)
AC-1.1 (frame-center within 50 m, ≥80% normal-flight photos) Verify (downstream) Same downstream-of-C2 dependency as MixVPR + SALAD rows; documentary evidence of SelaVPR retrieval recall on aerial nadir at AC-8.1 resolution floor is absent — Plan-phase aerial-training decision (D-C2-1) + Jetson MVE on Derkachi flight required. SelaVPR-specific upside: paper Table 2 Tokyo24/7 R@1=94.0 (best across all compared methods, including MixVPR's 85.1 and prior SOTA R²Former's 88.6) and Nordland-test R@1=85.2 (vs SALAD's 76.0 and MixVPR's 58.4) suggests adapter-on-frozen-DINOv2-L design generalizes well to extreme illumination/seasonal variation, which may translate to aerial cross-season/cross-illumination — but unverified
AC-1.2 (frame-center within 20 m, ≥50% normal-flight photos) Verify (downstream) Same as AC-1.1, tighter tail; AerialExtreMatch Recall@1 stratified by difficulty cell remains the documentary target. SelaVPR-specific consideration: the two-stage re-ranking via dense local-feature mutual-nearest-neighbor matching may improve geometric-fine-grain accuracy (the MNN matching implicitly enforces local consistency without RANSAC), which may raise AC-1.2 tail performance vs single-stage MixVPR/SALAD — but this is speculative on aerial nadir
AC-2.1b (satellite-anchor registration succeeds, AC-1.1/1.2 + AC-2.2 + AC-8.2 + AC-8.6 conditions) Verify (downstream) C2's contribution identical to MixVPR + SALAD rows — top-K retrieval feeding C3+C4; SelaVPR's re-ranking adds a second filter that may improve top-1 quality before C3+C4 see the candidate; Jetson MVE measurement on AerialExtreMatch + Derkachi flight
AC-3.3 (≥3 disconnected segments via satellite-reference re-localization) Pass (API) → Verify (recall) SelaVPR's per-frame top-K cosine retrieval (global) is structurally identical to MixVPR + SALAD for re-localization (no temporal state required); the two-stage re-ranking adds robustness against perceptual aliasing at the cost of re-rank-event latency — the MNN-with-
AC-4.1 (latency <400 ms p95, end-to-end camera→FC) Verify (HARSHER risk than SALAD; tight budget at rerank_num=20) SelaVPR canonical paper reports 0.027 s/query feature extraction + 0.085 s/query matching at rerank_num=100 = 0.112 s/query total on RTX 3090 at 224×224 batch=1 (paper Table 3, Pitts30k-test); RTX-3090-to-Jetson-Orin-Nano-Super extrapolation factor 810× → ~200270 ms extraction + ~700 ms matching at rerank_num=100 (FAILS AC-4.1) OR ~150 ms matching at rerank_num=20 (extracts+matches ~350 ms, tight against budget before C1+C3+C5+C8 costs added). D-C2-5 + D-C2-7 deferred Jetson MVE risk is materially HARSHER than SALAD's — DINOv2-L (300M params) vs SALAD's DINOv2-B (86M params, 3.5× larger model). Paper §4.3 explicitly notes "TransVPR is fast at extracting features, while SelaVPR is slower (but faster than other methods) due to the use of the ViT/L backbone". Counter-mitigation: SelaVPR's FROZEN backbone may have an easier TensorRT export pathway than SALAD's fine-tuned-backbone export. Plan-phase commitment required: project must commit to either (a) rerank_num=20 with on-demand local-feature extraction (tight budget, validated by Jetson MVE), (b) disable re-ranking and use SelaVPR-global-only at single-stage parity (MSLS-challenge R@1=69.6 vs full's 73.5), or (c) reject SelaVPR if Jetson MVE extraction exceeds ~250 ms after TensorRT optimization
AC-4.2 (memory <8 GB shared) Verify (descriptor cache feasible at global-only; local-feature cache INFEASIBLE without mitigation) DINOv2-L + adapters + LocalAdapt weights ~300M params × 2 bytes (fp16) = ~600 MB (vs SALAD's ~172 MB and MixVPR's ~25 MB) — 3.5× larger model footprint than SALAD; activations at 224×224 batch=1 ~50 MB; descriptor cache for ~400 km² @ 0.5 m/px tiles: 1024-D global descriptor → ~320 MB fp16 / 3.2% of 10 GB cache budget (smallest of all C2 candidates so far); HOWEVER, dense 61×61×128 = 476k floats local-features × 160k tiles × 2 bytes (fp16) = ~150 GB local-feature cache — fundamentally infeasible. Mitigation paths (D-C2-7): (a) cache global only + on-demand local-feature re-extraction per re-rank event (adds GPU time per re-rank trigger but keeps cache budget feasible); (b) precompute top-K local feature sets per likely query path (~3 GB at K=20 with selective coverage — feasible but adds provisioning complexity to C10); (c) disable re-ranking entirely (back to single-stage parity). AC-8.3 cache budget interaction is materially different from MixVPR + SALAD — SelaVPR has the smallest global-descriptor cache of all C2 candidates so far AND the largest potential local-feature cache (infeasible without mitigation). Co-resident memory pressure with C1/C3/C5/C6 unverified — Jetson MVE measurement
AC-8.1 (cache-interface resolution ≥0.5 m/px, ideally 0.3 m/px) Pass (with Verify) SelaVPR is resolution-agnostic at the algorithm level (DINOv2 accepts any input divisible by 14, paper §3.3 implementation accepts 224×224); cross-resolution generalization at 0.5 m/px tile GSD vs nav-camera 12 cm/px GSD unverified, AerialExtreMatch cross-scale cells (Fact #19) is the documentary target — same dependency as MixVPR + SALAD rows
AC-8.6 — Scale-ratio (any UAV-frame ground footprint at deployment altitude must be retrievable) Verify (smaller input size = harsher downscale than MixVPR + SALAD) At 1 km AGL the nav-camera frame footprint is 470×314 m to 980×655 m (per restrictions.md); SelaVPR's 224×224 input must be center-cropped + downscaled from this range — this is a more aggressive downscale than MixVPR's 320×320 or SALAD's 322×322, potentially losing more fine-grained content needed for cross-scale matching. Cross-scale recall at AC-8.6 spec is exactly the AerialExtreMatch test cell — Jetson MVE measurement. SelaVPR's two-stage re-ranking with dense local features may compensate for the smaller input by exploiting fine-grain local matches that the global descriptor missed — but unverified on aerial nadir
AC-8.6 — Scene change in active-conflict sectors Verify Cratering / building destruction / road realignment is exactly the AerialExtreMatch "scene-change" cell + the Skoltech aerial-VPR survey (Source #38); canonical SelaVPR weights are not aerial-trained — D-C2-1 will materially affect this row identically to MixVPR + SALAD. SelaVPR-specific consideration: the local-feature re-ranking may help reject candidates with significant scene-change-induced local-feature drift (the MNN matching count `
AC-8.6 — Compute & latency under steady-state and re-loc-trigger Verify (asymmetric latency profile — NEW vs MixVPR + SALAD) SelaVPR's per-frame compute is NOT constant — global-only retrieval (steady-state) costs ~200270 ms extraction on Jetson extrapolation; full re-ranking (re-loc-trigger or top-K validation) adds ~150 ms at rerank_num=20 or ~700 ms at rerank_num=100. The "re-loc-trigger workload" and "steady-state workload" have materially different latency — this is a NEW cost-model finding vs MixVPR + SALAD (both single-stage, constant per-frame cost). The project may want to use SelaVPR-global-only for steady-state and trigger full re-ranking only on satellite-re-anchor events or VIO-divergence-triggered re-localization. Co-resident memory + GPU-time pressure under simultaneous C1+C2+C3 inference unverified, and the DINOv2-L vs ResNet50 ratio (12× more params than MixVPR's ResNet50, 3.5× more params than SALAD's ViT-B) materially shifts this from MixVPR + SALAD — Jetson MVE measurement (D-C2-5 + D-C2-7 risks compounded)
AC-NEW-2 (spoofing-promotion latency <3 s p95) Pass (latency budget comfortable) → Verify (recall at re-anchor) Same structure as MixVPR + SALAD rows: SelaVPR per-frame global retrieval at fp16 + TensorRT well under 3 s budget on extrapolation; the spoofing-promotion event is precisely a re-loc-trigger where SelaVPR's two-stage re-ranking can amortize its event-cost over the multi-second budget — re-rank-num=100 at ~700 ms is well within the 3 s budget for a single re-anchor event. Gating constraint is whether re-anchor retrieval (after re-ranking) succeeds on first or first-few frames after spoofing detection — recall under "first-frame after spoof onset" condition unverified, Jetson MVE on Derkachi flight required. SelaVPR-specific upside: the re-rank step provides a high-quality re-anchor signal that single-stage methods cannot match — the spoof-promotion event is exactly the use-case where SelaVPR's two-stage architecture earns its latency cost
AC-NEW-6 (imagery freshness — never satellite_anchored on stale-tile match) Pass (mechanical) SelaVPR returns top-K with cosine scores from global descriptors identically to MixVPR + SALAD; freshness-age decision is a downstream C5/C6 filter on the retrieved candidates. The two-stage re-ranking adds an additional filter (only re-rank candidates with acceptable freshness age) — this can be exploited by the freshness-aware ranking to reject stale-tile candidates BEFORE they consume re-rank GPU time
AC-NEW-7 (cache-poisoning safety budget — P(>30 m geo-misalign) <1%, P(>100 m) <0.1%) Verify (downstream — POSITIVE structural advantage vs single-stage) SelaVPR's contribution is retrieval correctness under mid-flight-written tile (AC-8.4) presence; if a misaligned mid-flight tile has a near-correct global descriptor it CAN poison the global-retrieval stage, BUT the two-stage re-ranking via dense local-feature MNN matching provides a structurally novel filter against geometric misalignment — a poisoned-but-misaligned tile would have local features that DON'T mutual-nearest-neighbor match, so `
Restriction "Operational area: eastern/southern Ukraine" — VPR train-domain match ⚠️ Documentary gap → Verify Canonical SelaVPR weights are MSLS + Pitts30k (street-level / urban) trained, same caveat as MixVPR + SALAD — D-C2-1 applies identically. NEW finding vs SALAD (positive): SelaVPR's MIT license places aerial-retrain artifacts under permissive licensing if redistributed — no copyleft friction with the BSD/permissive track. Aerial-trained community SelaVPR checkpoints are an open search target for next sessions; the README's recommendation to use the MSLS-finetuned variant for "diverse scenes" rather than the Pitts30k-further-finetuned urban variant is a useful default for cross-domain transfer projects
Restriction "Altitude ≤1 km AGL; terrain assumed flat (rolling steppe / agricultural)" — VPR scale band match Verify Same as AC-8.6 scale-ratio row; cross-scale recall at the project's altitude band is the AerialExtreMatch cross-scale cell
Restriction "Weather: predominantly sunny ... seasonal/visibility classes" — VPR cross-season generalization Verify (DOCUMENTARY ADVANTAGE on cross-illumination/cross-season ground-level) Cross-season VPR is the dominant aerial-VPR failure mode per Fact #19 + SQ5; canonical SelaVPR weights are single-domain — D-C2-1 is the primary lever. SelaVPR-specific finding: paper Table 2 Tokyo24/7 R@1 = 94.0 (extreme day/night illumination) is the BEST across all compared methods (vs MixVPR's 85.1 and SALAD-comparable two-stage R²Former's 88.6); paper's "trained models" table reports Nordland-test R@1 = 85.2 (vs SALAD's 76.0 and MixVPR's 58.4) — extreme seasonal variation. This is documentary evidence that SelaVPR's adapter-on-frozen-DINOv2-L design generalizes to cross-illumination/cross-season ground-level imagery materially better than both MixVPR and SALAD, suggesting the design's transferability to aerial cross-season may also be stronger — but aerial cross-season is unverified
Restriction "Navigation camera (pinned): ADTi 20MP, 5472×3648" Pass (API) — but harsher downscale than MixVPR + SALAD SelaVPR consumes any 224×224 ImageNet-normalised input (must be divisible by 14); the 5472×3648 → 224×224 downscale is more aggressive than MixVPR's 5472×3648 → 320×320 or SALAD's 5472×3648 → 322×322; information loss at this larger downscale is the actual concern; D-C2-3 input-resolution-shape Plan-phase decision is harshened by SelaVPR closure — SelaVPR is at the small-input extreme of the C2 candidate space, MixVPR + SALAD are at the medium-input baseline, AnyLoc + BoQ may be at the higher-resolution end (next sessions). The two-stage re-ranking via dense local features may compensate for the aggressive downscale by exploiting fine-grain local matches the global descriptor missed
Restriction "Satellite Imagery — resolution ≥0.5 m/px" — VPR descriptor pipeline at AC-8.1 floor Verify Same as AC-8.1; algorithm-level resolution-agnostic, recall at 0.5 m/px tile GSD vs 12 cm/px nav-camera GSD unverified
Restriction "Satellite Imagery — Cache budget: 10 GB" — descriptor budget carve-out Pass (with Verify) — global cache smallest of all C2 candidates; local-feature cache INFEASIBLE without strategy mitigation Per-stage: 1024-D global descriptor cache ~320 MB fp16 / 3.2% of cache budget — smallest of all C2 candidates so far (vs MixVPR's 650 MB / 6.5% and SALAD-full's 2.7 GB / 27%); 61×61×128 dense local feature cache ~150 GB fp16 — fundamentally infeasible against AC-8.3 10 GB budget without mitigation. AC-8.3 explicitly says "Pre-extracted descriptors/indices count against the cache budget unless explicitly carved out" — D-C2-2 carve-out decision interacts with D-C2-7 SelaVPR re-ranking strategy choice: if D-C2-7 = (a) on-demand local-feature re-extraction, then only the 320 MB global cache counts, and SelaVPR has the most cache-efficient C2 footprint of all candidates so far; if D-C2-7 = (b) precompute top-K local features per likely path (~3 GB at K=20), the cache cost is moderate; if D-C2-7 = (c) disable re-ranking, SelaVPR matches MixVPR + SALAD-slim on cache footprint. NEW interaction: D-C2-7 vs D-C2-2 vs AC-8.3 form a three-way Plan-phase decision uniquely raised by SelaVPR's two-stage architecture
Restriction "Companion computer: Jetson Orin Nano Super, 8 GB shared" Verify (with elevated risk — HARSHER than SALAD) DINOv2 ViT-L fp16 inference on Jetson Orin Nano Super is paper-acknowledged slower than ResNet/ViT-B-based methods (paper §4.3 Table 3 + §3.1 implementation note); ViT-L export to TensorRT/INT8 is industry-known harder than ViT-B export, which is in turn harder than CNN export — D-C2-5 deferred Jetson MVE risk is materially higher than for SALAD (and SALAD is harder than MixVPR). Counter-mitigation: SelaVPR's FROZEN backbone may have an easier TensorRT export pathway than SALAD's fine-tuned-backbone export (canonical DINOv2-L pretrained weights distributed via FB AI Public Files have a well-documented optimization pathway). Steady-state co-resident memory + GPU-time with C1 + C3 (matcher) unverified — Jetson MVE measurement
Restriction "License posture (D-C1-1)" — VPR license-track interaction POSITIVE finding vs SALAD — sub-matrix-PASSING under BSD/permissive track SelaVPR canonical implementation is MIT (Source #62 LICENSE Copyright 2024 Feng Lu) — permissive. Distinct from SALAD's GPL-3.0 placement. Under D-C1-1 = (a) GPL-3.0 track, (b) BSD/permissive lock, or (c) keep-both-tracks-open, SelaVPR is eligible on every license-posture choice — first DINOv2-based C2 candidate to achieve this. This materially expands the BSD/permissive C2 axis options beyond MixVPR + (next-session: EigenPlaces, NetVLAD pending license verification): under D-C1-1 = (b), the BSD/permissive C2 axis now contains MixVPR (CNN-based, MIT) + SelaVPR (DINOv2 ViT-L-based, MIT) with materially different design points (single-stage vs two-stage; ResNet50 vs DINOv2-L; 320×320 vs 224×224 input; 2048-D vs 1024-D global descriptor). Recommendation: present D-C1-1 + this row to user as a structured Choose block at Plan time, noting that SelaVPR materially changes the BSD/permissive-track ceiling vs the prior MixVPR-only state

C2 — Status [2026-05-08 sessions, MixVPR + SALAD + SelaVPR]

C2 is OPEN. After this session 3 of 5 mandatory pre-screen candidates have per-mode entries:

  • MixVPR (session 1, 2026-05-08): Documentary lead with aerial-domain-training caveat, BSD/permissive track (MIT)
  • SALAD (session 2, 2026-05-08): Documentary lead with aerial-domain-training caveat + GPL-3.0-license-track caveat + DINOv2-ViT-export risk caveat, GPL-3.0 track (canonical)
  • SelaVPR (session 3, 2026-05-08): Pinned-mode statement + Three-query lookup via Lu-Feng/SelaVPR README + LICENSE + canonical paper (context7 fall-back per Per-Mode API rule item 2 — Lu-Feng/SelaVPR not indexed in context7 — only unrelated liu-feng-deeplearning/coverhunter returned; OpenVPRLab cross-source for DINOv2 ViT-L/14 backbone API reused from SALAD session) + MVE block + Per-numbered-Restriction × per-numbered-AC sub-matrix → Documentary lead with aerial-domain-training caveat + DINOv2-ViT-L-export risk caveat (HARSHER than SALAD-ViT-B) + two-stage-latency-and-local-feature-cache-strategy risk caveat (NEW vs MixVPR + SALAD), BSD/permissive track (MIT — same as MixVPR, distinct from SALAD)

Per-mode API capability verification gate: PASS for all three candidates (with documented caveats — see Fact #42 + Fact #43 + Fact #44 "Fit Impact" + Plan-phase decisions).

Status assignment in ../06_component_fit_matrix/C2_vpr.md row:

  • MixVPR = Documentary lead with aerial-domain-training caveat (BSD/permissive track)
  • SALAD = Documentary lead with aerial-domain-training caveat + GPL-3.0-license-track caveat + DINOv2-ViT-export risk caveat (GPL-3.0 track)
  • SelaVPR = Documentary lead with aerial-domain-training caveat + DINOv2-ViT-L-export risk caveat (HARSHER than SALAD) + two-stage-latency-and-local-feature-cache-strategy risk caveat (BSD/permissive track)

Final promotion to "Selected" for any candidate requires the Plan-phase decisions (D-C2-1 / D-C2-2 / D-C2-3 / D-C2-5 / D-C2-6 / D-C2-7 NEW from SelaVPR) AND the Jetson Orin Nano Super hardware MVE phase artifact (D-C1-2 + D-C2-4).

Next session candidates: EigenPlaces, NetVLAD (remaining mandatory pre-screen survivors); AnyLoc, BoQ, DINOv2-VLAD (conditional on INT8 quantization path).

SelaVPR closure raises one new Plan-phase decision (compounding with prior six): 8. D-C2-7 (NEW from SelaVPR closure 2026-05-08) — SelaVPR re-ranking strategy choice (full re-rank with on-demand local-feature extraction / cache top-K local features per likely query path / disable re-ranking entirely and use SelaVPR-global-only mode) — Plan-phase decision; full re-rank at rerank_num=100 fails AC-4.1 latency budget on Jetson extrapolation; rerank_num=20 fits but tight; on-demand local-feature extraction + global-only-cache (~320 MB) is the most cache-efficient mitigation; precompute-top-K-local-features (~3 GB at K=20 with selective coverage) is the moderate option; disable-rerank gives back the two-stage advantage but drops MSLS-challenge R@1 from 73.5 to 69.6 (still ahead of MixVPR's 64.0). Three-way interaction with D-C2-2 (cache carve-out) and AC-8.3 (10 GB budget) and AC-4.1 (400 ms latency budget) — present as structured Choose block at Plan time conditional on SelaVPR being elevated to Selected.

Cross-component process gates open (compounding across MixVPR + SALAD + SelaVPR sessions):

  1. D-C2-1 VPR canonical-weights vs aerial-retrain vs aerial-community-checkpoint (raised by MixVPR; reaffirmed by SALAD + SelaVPR — applies to ALL three candidates and every subsequent ground-level-pretrained C2 candidate)
  2. D-C2-2 descriptor-cache carve-out vs raw-tile-cache budget (raised by MixVPR; harshened by SALAD-full; materially-changed-shape by SelaVPR — global-only cache is smallest of all candidates, but local-feature cache is largest by orders of magnitude, forcing the D-C2-7 mitigation choice)
  3. D-C2-3 input-resolution shape (raised by MixVPR/SALAD at 320322; harshened by SelaVPR's 224×224 — SelaVPR is the smallest-input C2 candidate; AnyLoc + BoQ may be at higher-resolution end in subsequent sessions)
  4. D-C2-4 deferred Jetson Orin Nano Super hardware MVE coverage for C2 (raised by MixVPR; broadened by SALAD; broadened further by SelaVPR — must now also cover DINOv2-L → TensorRT fp16 export quality + two-stage re-ranking latency profile + local-feature on-demand extraction performance)
  5. D-C2-5 DINOv2 ViT-export to TensorRT fp16/INT8 path on Jetson Orin Nano Super (raised by SALAD; harshened by SelaVPR — ViT-L is 3.5× larger than ViT-B; export risk profile materially elevated; counter-mitigation by frozen-backbone canonical export pathway)
  6. D-C2-6 SALAD descriptor-size choice (raised by SALAD only — does not apply to SelaVPR which has fixed 1024-D global)
  7. D-C1-1 license-posture interaction with C2 (raised by C1; sharpened by SALAD-GPL-3.0; materially-positive update from SelaVPR-MIT — SelaVPR is the first DINOv2-based C2 candidate on the BSD/permissive track, expanding the BSD/permissive C2 axis to MixVPR + SelaVPR with materially different design points)
  8. D-C2-7 (NEW from SelaVPR) — SelaVPR re-ranking strategy choice (only applies to SelaVPR; first two-stage C2 candidate evaluated)

Fact #45 — NetVLAD per-mode API capability verification (canonical VGG-16 + NetVLAD + PCA-whitening reference baseline on Jetson Orin Nano Super) — DOCUMENTARY PASS WITH ESTABLISHED-BASELINE EXEMPTION + MIT LICENSE TRACK + KNOWN ACCURACY-DEFICIT VS MODERN C2 CANDIDATES + RUNTIME-STACK PORT-RISK; Jetson MVE pending

  • Statement: NetVLAD (Relja/netvlad v1.03 MATLAB canonical, CVPR 2016 / TPAMI 2018; canonical implementation by Relja Arandjelović + Petr Gronat + Akihiko Torii + Tomas Pajdla + Josef Sivic, INRIA WILLOW + ENS + Tokyo Tech + CTU Prague; modern PyTorch reproduction Nanne/pytorch-NetVlad per Source #65) is the canonical learned-VLAD reference baseline for the entire VPR field — a single-stage VPR method that consumes a CNN backbone's last-conv-layer dense descriptor map and produces a fixed-dimensional global descriptor via a learned soft-assignment-VLAD pooling layer (paper Eq. 14). Per the per-Mode API Capability Verification rule, the project's pinned mode is the (VGG-16 backbone cropped at conv5_3 → 512-D dense descriptor map at H×W spatial locations) + (NetVLAD pooling with vlad_preL2_intra method: input features L2-normalised, K=64 cluster centres with learned w_k/b_k/c_k parameters, soft-assignment via softmax over w_k^T x_i + b_k, aggregation of first-order residuals (x_i - c_k) weighted by soft-assignment into a 64×512 = 32768-D K×D matrix, intra-channel L2-normalisation per the _intra suffix, flatten to 32768-D, final L2-normalisation) + (PCA + whitening dimensionality reduction to 4096-D L2-normalised global descriptor — canonical paper recommendation per Relja/netvlad README §"Train PCA + whitening" and paper §5; alternatively cropped to 256-D / 512-D for tighter cache budgets via cropToDim, only valid for +whitening networks) at 224×224 ImageNet-normalised input tuple — selected as the canonical paper test config (Source #66 §5 + Source #64 README) and the canonical pretrained-weight distribution (vd16_pitts30k_conv5_3_vlad_preL2_intra_white.mat, 529 MB, pretrained on Pittsburgh 30k via Tokyo Time Machine). The Pitts30k-trained checkpoint is the recommended starting point for cross-domain transfer projects (the Tokyo-Time-Machine-trained checkpoint vd16_tokyoTM_conv5_3_vlad_preL2_intra_white.mat is also distributed for cross-domain ablation). Modern PyTorch runtime path uses Source #65 (Nanne/pytorch-NetVlad) with verified Recall@K reproduction (R@1=85.2 vs paper's 84.1 on Pitts30k-test, +0.9 absolute reproduction gap), OR re-port from Relja/netvlad MATLAB to PyTorch directly (preserving MIT licensing on the project's NetVLAD path), OR use OpenVPRLab's NetVLAD aggregator option on ResNet50/DINOv2 backbones (per Source #57 — but that is a different mode per the Per-Mode API rule and would be separately cataloged). Mode-enumeration query (1/3) — context7 PASS: /relja/netvlad is indexed in context7 with 90 code snippets and Medium source reputation; the canonical loadNet() API supports network IDs vd16 (VGG-16, last conv = conv5_3, 512-D feature map), vd19 (VGG-19), caffe (AlexNet, last conv = conv5, 256-D), places (Places-CNN). addLayers() API supports aggregation methods vlad_preL2_intra (default — input L2-norm + intra-channel L2-norm of NetVLAD K×D matrix + flatten + L2-norm), vlad_preL2 (no intra-norm), vladv2_preL2_intra (full NetVLAD with trainable biases per paper Eq. 4 — slightly higher capacity but slower convergence), max (global max-pool — paper Eq. for f_max), avg (global avg-pool). Default cluster count K=64. Output dimensionality = K × D (e.g., VGG-16 conv5_3 with K=64 → 32768-D pre-PCA NetVLAD matrix); recommended PCA + whitening reduces to 4096-D (canonical paper recommendation). Per the Per-Mode API rule, each (backbone, aggregation-method, K, PCA-dim) tuple is a separately-cataloged sibling mode. Pinned-mode runnable example query (2/3) — context7 PASS + WebFetch cross-validation: Source #64 README ships a documented inference CLI (computeRepresentation(net, im) for single-image, serialAllFeats(net, imPath, imageFns, outputFn) for batched), evaluation CLI (testFromFn(dbTest, dbFeatFn, qFeatFn) returns Recall@N + retrieval indices), training CLI (trainWeakly(dbTrain, dbVal, ...) with weakly supervised triplet ranking + hard negative mining), and PCA-whitening (addPCA(bestNet, dbTrain, 'doWhite', true, 'pcaDim', 4096)). Source #65 README ships the same with PyTorch + Faiss runtime path (python main.py --mode={train,test,cluster} --arch={vgg16,alexnet} --pooling=netvlad --num_clusters=64). Pretrained weights distributed via canonical project page download links (Pitts30k-best at 529 MB; all-models tarball at 3 GB). The canonical inference pattern in PyTorch is model.eval(); descriptor = model(images) where images: torch.Tensor[B, 3, 224, 224] ImageNet-normalised, output descriptor: torch.Tensor[B, 4096] L2-normalised after PCA-whitening (or 32768-D pre-whitening if PCA layer not applied). Disqualifier-probe query (3/3): did NOT surface any documented frame-rate floor (single-stage, per-frame independent, single-pass through the CNN backbone + NetVLAD aggregation); did NOT surface any documented memory ceiling at the algorithm level beyond the standard VGG-16 + NetVLAD layer footprint (VGG-16 ~138M params, conv5_3 cropped backbone ~50-60M params at fp16, NetVLAD layer 64×512 = 32768 cluster-residual parameters + K×512 cluster weights + K biases ~17 MB at fp16, PCA-whitening matrix 32768×4096 = ~268 MB at fp16); DID surface a documented Jetson-incompatibility risk on the canonical implementation (MATLAB + MatConvNet stack is not deployable on Jetson Orin Nano Super JetPack 6 / ROS 2 Humble — PyTorch port required); did NOT surface any Jetson Orin Nano measurement for the PyTorch port either (similarly to MixVPR / SALAD / SelaVPR — D-C2-4 deferred Jetson MVE phase will resolve); did NOT surface a documented ONNX/TensorRT export path inside Relja/netvlad (MATLAB → ONNX is not supported), or inside Nanne/pytorch-NetVlad (relies on standard PyTorch → ONNX export — to be resolved in C7 row, not C2). Three new disqualifier-class findings raised that are partially shared with prior C2 candidates: (i) MIT license (Source #64 README explicitly states "NetVLAD is distributed under the MIT License (see the LICENCE file).") — places NetVLAD on the BSD/permissive license track alongside MixVPR (MIT) + SelaVPR (MIT); same as MixVPR + SelaVPR, distinct from SALAD's GPL-3.0. CRITICAL caveat for the Nanne/pytorch-NetVlad PyTorch port: Source #65 README does NOT cite a LICENSE file — verification of licensing terms is a Plan-phase blocker if the project adopts the Nanne port directly. Mitigation: re-port the canonical Relja/netvlad MATLAB repo to PyTorch directly (preserves MIT licensing); the canonical algorithm is documented in the canonical paper (Source #66) + canonical README (Source #64), so re-implementation effort is moderate (~1 week of engineering + cluster-init prerequisite + retraining or weight transfer from canonical pretrained weights). (ii) Established-baseline accuracy-deficit vs modern C2 candidates (NEW vs MixVPR + SALAD + SelaVPR — NetVLAD is the simple-baseline, the others are the modern competitive leads) — per Source #66 paper Table 1 + cross-validated against modern papers' baseline comparisons: Pitts30k-test R@1: NetVLAD = 84.1 (paper) / 85.2 (PyTorch reproduction); MixVPR ~90; SALAD = 95.1; SelaVPR = 92.8 — NetVLAD is 5-11 absolute Recall@1 points below modern leads on Pitts30k. Tokyo24/7 R@1: NetVLAD = 73.3 (paper); MixVPR = 85.1; SelaVPR = 94.0 — NetVLAD is 11.8-20.7 absolute Recall@1 points below modern leads on Tokyo24/7. Nordland-test R@1: NetVLAD reported as ~33 in MixVPR paper Table 1 baseline column; MixVPR = 58.4; SALAD = 76.0; SelaVPR = 85.2 — NetVLAD is 25-52 absolute R@1 points below modern leads on Nordland. This deficit is documentary and expected — NetVLAD's role per the engine's Component Option Breadth rule is precisely to be the long-established reference point that prevents false confidence in the modern leads, NOT a competitive lead. The accuracy gap is the whole point of including the simple baseline. (iii) Runtime-stack port-risk (NEW vs MixVPR + SALAD + SelaVPR — they ship modern PyTorch implementations with TensorRT-export-known pathways; NetVLAD ships canonical MATLAB + MatConvNet) — the project has three Plan-phase choices: (a) adopt Nanne/pytorch-NetVlad PyTorch port (Source #65) — fast path but license-uncertain; (b) re-port Relja/netvlad MATLAB to PyTorch from scratch — preserves MIT licensing but ~1 week engineering; (c) use OpenVPRLab's NetVLAD aggregator option (Source #57) on ResNet50/DINOv2 backbones — apples-to-apples comparison vs MixVPR but a different mode per Per-Mode API rule. The project's pinned mode in this fact card is option (b), with (a) and (c) tracked as separately-cataloged sibling modes if the project elevates NetVLAD beyond mandatory-baseline role. (iv) Aerial-domain-training caveat (SHARED with MixVPR + SALAD + SelaVPR via D-C2-1) — canonical weights are Pittsburgh 30k + Tokyo Time Machine street-level, NOT aerial-nadir; same Plan-phase decision (project-domain retrain / aerial-trained community checkpoint / elevate alternate C2 candidate) applies. HOWEVER, NetVLAD's role does not require aerial-domain training to be useful as a baseline — its purpose is to provide the long-established reference point against which modern aerial-trained candidates are scored. (v) Descriptor-dimensionality cache-footprint (NEW vs MixVPR + SALAD + SelaVPR — NetVLAD's canonical 4096-D PCA-whitened is 2× larger than MixVPR's 2048-D and 4× larger than SelaVPR's 1024-D global) — per ~400 km² operational area at AC-8.1 resolution floor (~160k tiles): NetVLAD 4096-D × 2 bytes × 160k = ~1.3 GB fp16 / 13% of 10 GB AC-8.3 cache budget — the largest descriptor cache of any C2 candidate evaluated so far on a single-stage basis (vs MixVPR's ~650 MB / 6.5%, SelaVPR-global-only's ~320 MB / 3.2%, SALAD-slim-544's ~170 MB / 1.7%; only SALAD-full-8448 at 2.7 GB / 27% is larger, AnyLoc-49152 / BoQ-16384 / DINOv2-VLAD if-and-when added would also exceed). The 256-D / 512-D cropToDim variants are documented as supported by the canonical implementation (only valid for +whitening networks) and would reduce cache footprint to ~80 MB / 0.8% (256-D) or ~160 MB / 1.6% (512-D) at the cost of further Recall@K loss. Pinned-mode sentence: "We will use NetVLAD with VGG-16 backbone cropped at conv5_3 + NetVLAD pooling layer (vlad_preL2_intra method: input features L2-normalised, K=64 cluster centres, intra-channel L2-norm of K×D matrix, final L2-norm → 32768-D raw NetVLAD descriptor) + PCA + whitening to 4096-D L2-normalised global descriptor at 224×224 ImageNet-normalised input (canonical paper test config with Pittsburgh 30k pretrained weights vd16_pitts30k_conv5_3_vlad_preL2_intra_white.mat), with inputs {1× ADTi 20MP nav frame stream → center-cropped + bilinearly downscaled to 224×224 + ImageNet-normalised} and expect outputs {4096-D L2-normalised global descriptor per frame for cosine top-K retrieval over the operational area's tiles} on Jetson Orin Nano Super (8 GB shared, JetPack 6, ROS 2 Humble; PyTorch fp16 + TensorRT baseline via re-ported MIT-licensed canonical weights; final inference runtime selection deferred to C7). Mandatory simple-baseline role per engine Component Option Breadth rule — NetVLAD's purpose is to be the long-established VLAD-aggregation reference point against which modern C2 leads (MixVPR / SALAD / SelaVPR / EigenPlaces / AnyLoc / BoQ) are scored; documented Recall@K deficit of 5-25 absolute points on standard benchmarks vs modern leads is expected and serves the role's purpose."
  • Source: Source #64 (Relja/netvlad v1.03 canonical MATLAB README + README_more + project page WebFetch + context7 /relja/netvlad indexed lookup), Source #65 (Nanne/pytorch-NetVlad modern PyTorch reproduction README + verified Recall@K reproduction), Source #66 (canonical paper arXiv:1511.07247 / Arandjelović et al. CVPR 2016 + TPAMI 2018 — §3.1 NetVLAD layer Eq. 14 + §4 weakly supervised triplet ranking loss + §5 implementation details + Table 1 Pitts30k-test Recall@K + cross-source paper citation by every modern VPR work)
  • Phase: Phase 2
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer + simple-baseline reference-point owner + license-posture decision-maker (Nanne port license verification gate)
  • Confidence: for mode-enumeration, runnable-example, parameter-count, license (canonical), RTX-3090 PyTorch reproduction Recall@K, and ground-level-benchmark Recall@K documentary evidence; for Established-baseline exemption applicability per the engine's source-tiering rule (NetVLAD is the mandatory simple-VLAD baseline per Component Option Breadth rule, exempt from strict 18-month Critical-novelty window); ⚠️ for Nanne/pytorch-NetVlad license (README does NOT cite a LICENSE file — Plan-phase verification gate); ⚠️ for Jetson Orin Nano Super latency / memory / accuracy (no documentary measurement — Jetson MVE will resolve); ⚠️ for VGG-16 → TensorRT fp16 / INT8 export quality (VGG-16 is a 6× larger and ~4× slower CNN than ResNet50 per modern benchmarks; export path is well-documented but runtime cost is materially higher than MixVPR's ResNet50); for canonical-checkpoint aerial-domain fitness (same caveat as MixVPR + SALAD + SelaVPR — canonical weights are Pittsburgh 30k + Tokyo Time Machine street-level-trained, no aerial-nadir benchmark in canonical paper); for accuracy-deficit-as-feature framing (the 5-25 absolute R@1 deficit vs modern leads is the whole point of including NetVLAD as the mandatory simple baseline — this is not a disqualifier, it is the role's definition)
  • Related Dimension: SQ3+SQ4 / C2 mandatory simple-VLAD baseline candidate — per-mode API capability verification gate
  • Fit Impact: DOCUMENTARY PASS for the per-mode API capability verification gate — NetVLAD has a documented runnable per-mode example with the project's pinned configuration (canonical MATLAB CLI via Source #64 + modern PyTorch CLI via Source #65 + algorithmic specification via Source #66 paper), multiple documented pretrained checkpoints (Pittsburgh 30k for cross-domain transfer + Tokyo Time Machine for cross-domain ablation + AlexNet/VGG-16/VGG-19 backbone variants for capacity ablation), and no API-level disqualifier. HOWEVER, five caveats are raised — three new vs MixVPR + SALAD + SelaVPR, two shared: (i) MIT license-track placement on canonical — same as MixVPR-MIT + SelaVPR-MIT; CRITICAL Nanne/pytorch-NetVlad license-uncertainty caveat — the most-cited PyTorch port does NOT cite a LICENSE file; Plan-phase verification gate; mitigation via re-port from canonical MATLAB MIT repo. (ii) Established-baseline accuracy-deficit vs modern C2 candidates (NEW vs MixVPR + SALAD + SelaVPR — they are the modern competitive leads, NetVLAD is the long-established reference point) — Pitts30k-test R@1 deficit of 5-11 absolute, Tokyo24/7 R@1 deficit of 11.8-20.7 absolute, Nordland-test R@1 deficit of 25-52 absolute. This deficit IS the role's purpose — NetVLAD's job is to be the long-established VLAD-aggregation reference point that prevents false confidence in modern leads, NOT to be a competitive lead. (iii) Runtime-stack port-risk (NEW vs MixVPR + SALAD + SelaVPR — they ship modern PyTorch implementations) — three Plan-phase port-strategy choices documented: Nanne/pytorch-NetVlad (fast, license-uncertain), re-port from canonical (preserves MIT, ~1 week engineering), OpenVPRLab-NetVLAD-on-ResNet50 (apples-to-apples vs MixVPR but separately-cataloged sibling mode). (iv) Descriptor-dimensionality cache-footprint (NEW vs SelaVPR's 1024-D global; comparable to SALAD-full-8448 within order of magnitude) — canonical 4096-D PCA-whitened consumes ~1.3 GB / 13% of 10 GB cache budget; 256-D / 512-D cropToDim variants documented as supported (only valid for +whitening networks) — interacts with D-C2-2 carve-out decision and D-C2-6-style Plan-phase descriptor-size choice. (v) Aerial-domain-training caveat (SHARED with MixVPR + SALAD + SelaVPR via D-C2-1) — canonical weights are Pittsburgh 30k + Tokyo Time Machine street-level, not aerial-nadir; HOWEVER, NetVLAD's mandatory-simple-baseline role does NOT require aerial-domain training to be useful — its purpose is to provide the long-established reference point against which modern aerial-trained candidates are scored. NEW Plan-phase decisions raised by NetVLAD closure (will be tagged D-C2-8 + D-C2-9): D-C2-8 (NEW) NetVLAD PyTorch-port-strategy choice (Nanne port with license-uncertainty / re-port from canonical with MIT preservation / OpenVPRLab-NetVLAD-on-ResNet50 as separately-cataloged sibling mode); D-C2-9 (NEW) NetVLAD descriptor-dimension choice (canonical 4096-D PCA-whitened / 512-D cropToDim for tighter cache / 256-D cropToDim for tightest cache — only valid for +whitening networks). The deferred Jetson Orin Nano Super hardware MVE phase still gates final accuracy/latency/memory measurement (D-C1-2 + D-C2-4) — NetVLAD's measurement role on the Jetson is to establish the simple-VLAD-baseline floor that modern C2 leads must exceed by margin to justify their added complexity. License: MIT for canonical Relja/netvlad (per Source #64 README) — permissive, BSD/permissive license track; license-uncertain for Nanne/pytorch-NetVlad PyTorch port (Plan-phase verification gate).

C2 — Per-Mode API Capability Verification (engine Step 2 — NetVLAD session entry, 2026-05-08)

MVE — NetVLAD with VGG-16 cropped at conv5_3 + NetVLAD pooling (vlad_preL2_intra, K=64) + PCA-whitening @ 224×224 → 4096-D global descriptor (canonical Pittsburgh-30k-pretrained variant; Tokyo-Time-Machine-pretrained, AlexNet/VGG-19 backbone variants, and 256-D/512-D cropToDim variants documented as separately-cataloged sibling modes)

  • Source: Source #64 (Relja/netvlad v1.03 canonical README + README_more — loadNet('vd16', 'conv5_3') for VGG-16 backbone, addLayers(net, opts, dbTrain) with opts.method='vlad_preL2_intra' for NetVLAD pooling, addPCA(bestNet, dbTrain, 'doWhite', true, 'pcaDim', 4096) for PCA-whitening, computeRepresentation(net, im) for single-image inference, serialAllFeats(net, imPath, imageFns, outputFn) for batched inference, testFromFn(dbTest, dbFeatFn, qFeatFn) for Recall@K evaluation, pretrained vd16_pitts30k_conv5_3_vlad_preL2_intra_white.mat (529 MB) for Pittsburgh-domain canonical and vd16_tokyoTM_conv5_3_vlad_preL2_intra_white.mat for Tokyo-Time-Machine-domain canonical — both distributed via the canonical project page; context7 indexed at /relja/netvlad), accessed 2026-05-08; Source #65 (Nanne/pytorch-NetVlad modern PyTorch reproduction README — python main.py --mode={train,test,cluster} --arch={vgg16,alexnet} --pooling=netvlad --num_clusters=64, verified VGG-16 reproduction R@1=85.2 on Pitts30k-test vs paper's 84.1, license-uncertain — README does NOT cite a LICENSE file), accessed 2026-05-08; Source #66 (canonical paper arXiv:1511.07247 / Arandjelović et al. CVPR 2016 — §3.1 NetVLAD pooling layer Eq. 14 + §4 weakly supervised triplet ranking loss + §5 implementation details + Table 1 Pitts30k-test Recall@K; TPAMI 2018 extended version)
  • Inputs in the example: Pittsburgh 30k images (perspective images sampled from Google Street View Time Machine panoramas) for training at 224×224 (ImageNet mean/std normalised); Pittsburgh 30k / Pittsburgh 250k / Tokyo24/7 evaluation images at 224×224; batch tensor images: torch.Tensor[B, 3, 224, 224]; VGG-16 cropped at conv5_3 backbone (138M params with conv5_3 last layer = ~50-60M params at the cropped backbone footprint; output spatial feature tensor [B, 512, 14, 14] at 224×224); NetVLAD pooling layer with K=64 cluster centres, learned w_k (1×1×512 conv filters per cluster) + b_k (per-cluster biases) + c_k (cluster centres), soft-assignment via softmax over w_k^T x_i + b_k, aggregation of first-order residuals (x_i - c_k) weighted by soft-assignment into a 64×512 = 32768-D K×D matrix; intra-channel L2-norm + flatten + final L2-norm → 32768-D L2-normalised raw NetVLAD descriptor; PCA + whitening dimensionality reduction to 4096-D L2-normalised global descriptor (canonical paper recommendation per Source #66 §5; alternatively cropToDim to 256-D / 512-D for tighter cache budgets — only valid for +whitening networks)
  • Outputs in the example: descriptor: torch.Tensor[B, 4096] L2-normalised; cosine top-K retrieval against pre-cached descriptors; canonical paper Table 1 reports Pitts30k-test R@1=84.1 (VGG-16 + NetVLAD + whitening trained on Pittsburgh; PyTorch reproduction R@1=85.2 per Source #65), R@5=94.6, R@10=95.5; Tokyo24/7 R@1=73.3 (across daytime/sunset/nighttime queries); Tokyo Time Machine cross-domain ablation R@1 reported in paper §5 across multiple training regimes; canonical paper Table 1 Recall@K positions NetVLAD as the CVPR 2016 SOTA at the time of publication, with subsequent C2 candidates measuring their improvement against this baseline (MixVPR Pitts30k R@1 ~90 = +6 absolute over NetVLAD, SALAD Pitts250k R@1=95.1 = +11 absolute over NetVLAD, SelaVPR Pitts30k R@1=92.8 = +8.7 absolute over NetVLAD)
  • Project inputs: 1× ADTi 20MP nav frame stream (5472×3648, target 3 fps) → center-cropped to 3648×3648 (square) → bilinearly downscaled to 224×224 → ImageNet-normalised → fp16 batch on Jetson Orin Nano Super
  • Project outputs required: 4096-D L2-normalised global descriptor per frame; cosine top-K (project default K=10 per Fact #25) against pre-cached descriptor table over the ~400 km² operational area's tiles at AC-8.1 resolution floor; satisfies AC-8.6 retrieval-recall requirement under cross-season / cross-domain / scene-change conditions ONLY as a baseline floor — NetVLAD is NOT expected to satisfy AC-8.6 competitively vs modern aerial-trained candidates; satisfies AC-4.1 latency budget for steady-state pending Jetson MVE measurement (VGG-16 forward pass on Jetson Orin Nano Super at fp16 + TensorRT estimated ~30-50 ms per frame, NetVLAD aggregation ~5-10 ms, PCA-whitening ~1-2 ms = ~40-60 ms total — comfortable margin within 400 ms budget); satisfies AC-NEW-2 spoofing-promotion path as the simple-baseline retrieval reference, not as the competitive lead
  • Match assessment: exact mode match for (VGG-16 cropped at conv5_3 backbone, NetVLAD pooling with vlad_preL2_intra method and K=64 cluster centres, PCA-whitening to 4096-D, 224×224 input, 4096-D L2-normalised global descriptor output); training+evaluation+PCA-whitening CLIs exist in canonical Relja/netvlad (Source #64) AND in Nanne/pytorch-NetVlad PyTorch port (Source #65); multiple pretrained checkpoints documented (Pittsburgh 30k canonical + Tokyo Time Machine canonical, VGG-16/AlexNet/VGG-19 backbone variants, distributed via canonical project page); ⚠️ partial input domain (canonical weights trained on Pittsburgh 30k + Tokyo Time Machine street-level urban imagery vs project's nadir aerial 1 km AGL — domain shift unverified, same caveat as MixVPR + SALAD + SelaVPR); ⚠️ Jetson Orin Nano Super export risk on canonical MATLAB stack (MATLAB + MatConvNet not deployable on JetPack 6 — PyTorch port required); ⚠️ partial PyTorch port license (Source #65 README does NOT cite a LICENSE file — Plan-phase blocker if Nanne port is adopted directly; mitigation via re-port from canonical MIT repo, ~1 week engineering); ⚠️ documented Recall@K deficit vs modern C2 leads (Pitts30k -5 to -11 R@1 absolute; Tokyo24/7 -11.8 to -20.7 R@1 absolute; Nordland -25 to -52 R@1 absolute) — this deficit is expected and IS the role's purpose (mandatory simple-VLAD baseline per engine Component Option Breadth rule); ⚠️ descriptor-dimensionality cache-footprint at 4096-D PCA-whitened (~1.3 GB / 13% of AC-8.3 cache budget — largest single-stage descriptor cache of any C2 candidate so far evaluated; 256-D / 512-D cropToDim variants documented as supported with +whitening networks for tighter cache budgets at cost of further Recall@K loss)
  • If ⚠️ or : docs do not explicitly disqualify the algorithmic mode. The (backbone, aggregation-method, K, PCA-whitening-dim) tuple, input size, normalisation, and output shape are all documented and runnable — directly via Relja/netvlad MATLAB (canonical) OR Nanne/pytorch-NetVlad PyTorch (modern reproduction with verified Recall@K) OR re-ported PyTorch (preserves MIT licensing). However, five caveats elevate the verification gate's interpretation beyond MixVPR's, SALAD's, and SelaVPR's: (i) MIT license-track placement on canonical (POSITIVE — same as MixVPR + SelaVPR); license-uncertain caveat on Nanne port (NEGATIVE — Plan-phase blocker if adopted directly); (ii) Established-baseline accuracy-deficit vs modern leads (5-25 absolute R@1 points across Pitts30k/Tokyo24/7/Nordland — this deficit IS the role's purpose per engine Component Option Breadth rule); (iii) Runtime-stack port-risk (NEW vs all prior C2 candidates — three Plan-phase port-strategy choices documented); (iv) Descriptor-dimensionality cache-footprint at 4096-D (~1.3 GB / 13%, with 256-D / 512-D cropToDim variants for tighter budgets); (v) Aerial-domain-training caveat (SHARED with MixVPR + SALAD + SelaVPR via D-C2-1, but NetVLAD's mandatory-baseline role does NOT require aerial-domain training to be useful). → Status: Mandatory simple-baseline (engine Component Option Breadth rule) with MIT license + license-uncertain-Nanne-port caveat + established-baseline-accuracy-deficit-as-feature + runtime-stack-port-risk caveat + 4096-D-descriptor-cache caveat + aerial-domain-training caveat, BSD/permissive track (canonical MIT). Final role assignment is NOT promotion to "Selected" but promotion to "Mandatory simple-baseline reference floor" that all modern C2 leads must measurably exceed on the project's evaluation conditions to justify their added complexity. The deferred Jetson Orin Nano Super hardware MVE phase will measure NetVLAD's Jetson latency/memory/Recall@K floor — modern leads MixVPR / SALAD / SelaVPR / EigenPlaces (next session) / AnyLoc / BoQ must show measurable advantage over this floor under the project's specific operating context (aerial nadir, 1 km AGL, eastern/southern Ukraine cross-season, AC-4.1 + AC-4.2 + AC-8.3 budgets) to remain in the Plan-phase candidate pool.

C2 — Per-numbered-Restriction × Per-numbered-AC Sub-Matrix per Candidate (NetVLAD addition)

NetVLAD — per-numbered binding (C2-relevant lines only; cross-cutting N/A above also apply identically)

Cells share the legend defined under the MixVPR sub-matrix. Where a binding is identical in both substance and evidence to the MixVPR / SALAD / SelaVPR rows, the NetVLAD row points to those rows to avoid restating; where NetVLAD's pinned mode produces a materially different binding (mandatory-simple-baseline role, larger 4096-D PCA-whitened descriptor, MATLAB-canonical-stack port-risk, expected accuracy-deficit-as-feature), the NetVLAD row carries a distinct evidence cite.

Line Binding Evidence (one-line cite)
AC-1.1 (frame-center within 50 m, ≥80% normal-flight photos) Verify (downstream) — expected to fail competitive bar; baseline floor only Same downstream-of-C2 dependency as MixVPR + SALAD + SelaVPR rows; documentary evidence of NetVLAD retrieval recall on aerial nadir at AC-8.1 resolution floor is absent — Plan-phase aerial-training decision (D-C2-1) + Jetson MVE on Derkachi flight required. NetVLAD-specific framing: paper Table 1 Pitts30k R@1=84.1 (street-level urban) is 5-11 absolute below MixVPR/SALAD/SelaVPR; this deficit on ground-level cross-domain is expected to widen on aerial nadir due to NetVLAD's older VGG-16 backbone + simpler aggregation (vs ResNet50+MixVPR / DINOv2-B+SALAD / DINOv2-L+SelaVPR). NetVLAD's role here is to establish the simple-VLAD-baseline floor that modern C2 leads must exceed by ≥5-10 absolute R@1 points to justify inclusion, not to be a competitive contender
AC-1.2 (frame-center within 20 m, ≥50% normal-flight photos) Verify (downstream) — expected to fail competitive bar; baseline floor only Same as AC-1.1, tighter tail; AerialExtreMatch Recall@1 stratified by difficulty cell remains the documentary target. NetVLAD-specific consideration: NetVLAD's single-stage retrieval has no second-stage filter (vs SelaVPR's local-feature MNN re-ranking) — geometric-fine-grain accuracy at AC-1.2 tail is structurally less robust than two-stage methods; this is documented as the canonical limitation in modern Patch-NetVLAD (CVPR 2021) which exists precisely to add patch-level re-ranking on top of NetVLAD
AC-2.1b (satellite-anchor registration succeeds, AC-1.1/1.2 + AC-2.2 + AC-8.2 + AC-8.6 conditions) Verify (downstream) — expected to fail competitive bar; baseline floor only C2's contribution identical to MixVPR + SALAD + SelaVPR rows — top-K retrieval feeding C3+C4; NetVLAD's recall floor sets the expectation that modern C2 leads must clear; Jetson MVE measurement on AerialExtreMatch + Derkachi flight
AC-3.3 (≥3 disconnected segments via satellite-reference re-localization) Pass (API) → Verify (recall) — expected to fail competitive bar; baseline floor only NetVLAD's per-frame top-K cosine retrieval is structurally identical to MixVPR + SALAD + SelaVPR-global-only for re-localization (no temporal state required); single-stage simplicity is structurally less robust against perceptual aliasing (no re-ranking filter). Cross-season recall under NetVLAD's VGG-16 + NetVLAD aggregation is documented as substantially below modern leads: paper Table on Tokyo24/7 R@1=73.3 vs SelaVPR's 94.0 (-20.7) vs MixVPR's 85.1 (-11.8). Aerial nadir cross-season is unverified — AerialExtreMatch + D-C2-1 required
AC-4.1 (latency <400 ms p95, end-to-end camera→FC) Pass (with Verify) — VGG-16 + NetVLAD + PCA-whitening on Jetson Orin Nano Super at fp16 + TensorRT estimated ~40-60 ms/frame total; comfortable budget Source #66 paper §5 reports VGG-16 forward pass at standard image-classification benchmarks (~10-20 ms on contemporary GPUs); NetVLAD aggregation is a soft-max + multiply-add over K×D = 32768 elements (~1-2 ms); PCA-whitening matrix multiply is a 32768×4096 dense MatMul (~1-2 ms with TensorRT). RTX-3090-to-Jetson-Orin-Nano-Super extrapolation factor 4-6×~40-60 ms total per frame at fp16+TensorRT, comfortable within AC-4.1 400 ms budget before C1+C3+C5+C8 costs added. D-C2-4 deferred Jetson MVE risk is LOWEST among all C2 candidates evaluated so far — NetVLAD's CNN backbone is the most-export-friendly (VGG-16 has the most well-documented TensorRT export pathway of any backbone in this row); structural simplicity of single-stage retrieval; no DINOv2 ViT export-risk; no two-stage re-ranking latency. HOWEVER, accuracy floor is the trade-off — see AC-1.1/1.2/2.1b/3.3 rows above
AC-4.2 (memory <8 GB shared) Pass (with Verify) — comfortable margin VGG-16 + NetVLAD + PCA-whitening weights ~50-60M params for cropped backbone × 2 bytes (fp16) = ~110 MB + NetVLAD layer ~17 MB + PCA matrix ~268 MB = ~400 MB total weights (vs SALAD's ~172 MB and SelaVPR's ~600 MB and MixVPR's ~25 MB — NetVLAD model footprint is medium-low); activations at 224×224 batch=1 ~30 MB; descriptor cache for ~400 km² @ 0.5 m/px tiles: 4096-D global descriptor → ~1.3 GB fp16 / 13% of 10 GB cache budget — largest single-stage descriptor cache of any C2 candidate evaluated so far (vs MixVPR's 650 MB / 6.5%, SALAD-slim-544's 170 MB / 1.7%, SelaVPR-global-only's 320 MB / 3.2%; only SALAD-full-8448 at 2.7 GB / 27% is larger). 256-D / 512-D cropToDim variants documented as supported (only valid for +whitening networks) — would reduce cache footprint to ~80 MB / 0.8% (256-D) or ~160 MB / 1.6% (512-D) at cost of further Recall@K loss. AC-8.3 cache budget interaction is D-C2-9 NetVLAD descriptor-dimension Plan-phase choice (NEW). Co-resident memory pressure with C1/C3/C5/C6 manageable — Jetson MVE measurement
AC-8.1 (cache-interface resolution ≥0.5 m/px, ideally 0.3 m/px) Pass (with Verify) NetVLAD is resolution-agnostic at the algorithm level (VGG-16 accepts any input size; 224×224 is the canonical paper test resolution); cross-resolution generalization at 0.5 m/px tile GSD vs nav-camera 12 cm/px GSD unverified, AerialExtreMatch cross-scale cells (Fact #19) is the documentary target — same dependency as MixVPR + SALAD + SelaVPR rows
AC-8.6 — Scale-ratio (any UAV-frame ground footprint at deployment altitude must be retrievable) Verify (input downscale similar to MixVPR/SALAD; older backbone may be less scale-invariant) At 1 km AGL the nav-camera frame footprint is 470×314 m to 980×655 m (per restrictions.md); NetVLAD's 224×224 input is the same downscale aggressiveness as SelaVPR's 224×224 (more aggressive than MixVPR's 320×320 / SALAD's 322×322). Cross-scale recall at AC-8.6 spec is exactly the AerialExtreMatch test cell — Jetson MVE measurement. NetVLAD-specific consideration: VGG-16 backbone is older and less scale-invariant than ResNet50 (documented in modern CNN literature) — cross-scale recall floor is expected to be lowest of all C2 candidates evaluated so far
AC-8.6 — Scene change in active-conflict sectors Verify — expected to fail competitive bar; baseline floor only Cratering / building destruction / road realignment is exactly the AerialExtreMatch "scene-change" cell + the Skoltech aerial-VPR survey (Source #38); canonical NetVLAD weights are not aerial-trained — D-C2-1 will materially affect this row identically to MixVPR + SALAD + SelaVPR. NetVLAD-specific consideration: the soft-assignment-VLAD aggregation has no built-in mechanism to reject local-feature drift from scene change (vs SelaVPR's local-feature MNN re-ranking which provides a structural filter; vs SALAD's optimal-transport dustbin which discards uninformative regions). This is the limitation that motivated Patch-NetVLAD (CVPR 2021) — adding patch-level re-ranking on top of NetVLAD precisely to address scene-change robustness
AC-8.6 — Compute & latency under steady-state and re-loc-trigger Pass — single-stage constant per-frame cost (LOWEST risk among C2 candidates) NetVLAD's per-frame compute is constant (single-stage retrieval, no re-ranking — vs SelaVPR's variable cost). Steady-state and re-loc-trigger workloads have identical latency profile (~40-60 ms total per frame at fp16+TensorRT extrapolation). Co-resident memory + GPU-time pressure under simultaneous C1+C2+C3 inference manageable — VGG-16 backbone is the most-export-friendly + smallest-runtime-risk of all C2 candidates evaluated; D-C2-4 deferred Jetson MVE risk is LOWEST for NetVLAD; D-C2-5 ViT-export-risk does NOT apply (NetVLAD uses CNN backbone, not ViT). This cost-model advantage is the structural counterpart to the accuracy-deficit-as-feature — NetVLAD trades modern recall for runtime simplicity, which IS the role's purpose
AC-NEW-2 (spoofing-promotion latency <3 s p95) Pass (latency budget very comfortable) → Verify (recall at re-anchor) — expected to fail competitive bar; baseline floor only Same structure as MixVPR + SALAD + SelaVPR-global-only rows: NetVLAD per-frame global retrieval at fp16 + TensorRT well under 3 s budget on extrapolation (~40-60 ms total per frame, ~40-60× under budget); single-stage simplicity is the lowest-latency option in the C2 row. Gating constraint is whether re-anchor retrieval succeeds on first or first-few frames after spoofing detection — recall under "first-frame after spoof onset" condition unverified, Jetson MVE on Derkachi flight required. NetVLAD-specific consideration: re-anchor recall is expected to be substantially below modern leads (per documented Tokyo24/7 R@1=73.3 vs SelaVPR's 94.0); the spoofing-promotion event may need to fall through to a modern C2 lead as primary with NetVLAD as the simple-baseline reference floor
AC-NEW-6 (imagery freshness — never satellite_anchored on stale-tile match) Pass (mechanical) NetVLAD returns top-K with cosine scores from global descriptors identically to MixVPR + SALAD + SelaVPR-global-only; freshness-age decision is a downstream C5/C6 filter on the retrieved candidates. No re-ranking step (unlike SelaVPR) — freshness-aware candidate filtering happens entirely after the NetVLAD top-K retrieval
AC-NEW-7 (cache-poisoning safety budget — P(>30 m geo-misalign) <1%, P(>100 m) <0.1%) Verify (downstream — single-stage retrieval has NO structural advantage over poisoned-but-misaligned tiles) NetVLAD's contribution is retrieval correctness under mid-flight-written tile (AC-8.4) presence; if a misaligned mid-flight tile has a near-correct global descriptor it CAN poison the global-retrieval stage. Unlike SelaVPR, NetVLAD has no second-stage filter against geometric misalignment — single-stage retrieval is structurally less robust against the cache-poisoning attack class. However, like MixVPR + SALAD, this is the structural baseline single-stage cost that the modern leads share with NetVLAD. Multi-flight Monte Carlo replay is the validation, D-C2-1 affects this. NetVLAD's role here is to establish the simple-baseline cache-poisoning resistance floor that modern C2 leads must measurably exceed
Restriction "Operational area: eastern/southern Ukraine" — VPR train-domain match ⚠️ Documentary gap → Verify Canonical NetVLAD weights are Pittsburgh 30k + Tokyo Time Machine (street-level / urban) trained, same caveat as MixVPR + SALAD + SelaVPR — D-C2-1 applies identically. NetVLAD-specific consideration: as the mandatory simple-baseline, NetVLAD does NOT require aerial-domain training to fulfill its reference role — its job is to be the long-established floor against which modern aerial-trained candidates are scored. Aerial-trained NetVLAD weights are NOT a search target for this candidate; the role is satisfied by the canonical Pittsburgh-30k-pretrained weights
Restriction "Altitude ≤1 km AGL; terrain assumed flat (rolling steppe / agricultural)" — VPR scale band match Verify Same as AC-8.6 scale-ratio row; cross-scale recall at the project's altitude band is the AerialExtreMatch cross-scale cell
Restriction "Weather: predominantly sunny ... seasonal/visibility classes" — VPR cross-season generalization Verify (DOCUMENTARY DEFICIT on cross-illumination/cross-season ground-level) Cross-season VPR is the dominant aerial-VPR failure mode per Fact #19 + SQ5; canonical NetVLAD weights are single-domain — D-C2-1 is the primary lever. NetVLAD-specific finding: paper Tokyo24/7 R@1 = 73.3 (extreme day/night illumination) is the LOWEST across all C2 candidates evaluated; cross-validated against MixVPR paper Table 1 baseline column reporting Nordland R@1 ~33 — documenting the substantial cross-season deficit vs modern leads (MixVPR 58.4 / SALAD 76.0 / SelaVPR 85.2). This deficit IS the role's purpose — NetVLAD as the simple-VLAD baseline establishes the cross-season recall floor that modern leads must measurably exceed to justify their added complexity
Restriction "Navigation camera (pinned): ADTi 20MP, 5472×3648" Pass (API) — same downscale aggressiveness as SelaVPR NetVLAD consumes any 224×224 ImageNet-normalised input; the 5472×3648 → 224×224 downscale is the same aggressiveness as SelaVPR (more aggressive than MixVPR's 320×320 / SALAD's 322×322). D-C2-3 input-resolution-shape Plan-phase decision applies identically to NetVLAD as to SelaVPR. NetVLAD's older VGG-16 backbone may be more sensitive to information loss at this aggressive downscale than modern backbones — but documentary evidence is consistent with the cross-scale-ratio limitations of the simple-baseline role
Restriction "Satellite Imagery — resolution ≥0.5 m/px" — VPR descriptor pipeline at AC-8.1 floor Verify Same as AC-8.1; algorithm-level resolution-agnostic, recall at 0.5 m/px tile GSD vs 12 cm/px nav-camera GSD unverified
Restriction "Satellite Imagery — Cache budget: 10 GB" — descriptor budget carve-out Verify (largest single-stage cache footprint of all C2 candidates so far at canonical 4096-D) Per-candidate: 4096-D global descriptor cache ~1.3 GB fp16 / 13% of cache budgetlargest single-stage descriptor cache of any C2 candidate evaluated so far (vs MixVPR's 650 MB / 6.5%, SelaVPR-global-only's 320 MB / 3.2%, SALAD-slim-544's 170 MB / 1.7%; only SALAD-full-8448 at 2.7 GB / 27% is larger). 256-D / 512-D cropToDim variants documented as supported (only valid for +whitening networks) — would reduce cache footprint to ~80 MB / 0.8% (256-D) or ~160 MB / 1.6% (512-D) at cost of further Recall@K loss vs the canonical 4096-D. AC-8.3 explicitly says "Pre-extracted descriptors/indices count against the cache budget unless explicitly carved out" — D-C2-2 carve-out decision interacts with D-C2-9 NetVLAD descriptor-dimension Plan-phase choice (NEW)
Restriction "Companion computer: Jetson Orin Nano Super, 8 GB shared" Pass (with Verify) — LOWEST runtime risk among C2 candidates VGG-16 fp16 inference on Jetson Orin Nano Super has the most-well-documented TensorRT export pathway of any backbone in this row — D-C2-5 ViT-export-risk does NOT apply (NetVLAD uses CNN backbone, not ViT); D-C2-4 deferred Jetson MVE risk is LOWEST for NetVLAD. Steady-state co-resident memory + GPU-time with C1 + C3 (matcher) manageable — single-stage simplicity is the runtime advantage of the simple-baseline role. HOWEVER, runtime-stack port-risk is NEW vs MixVPR + SALAD + SelaVPR — canonical implementation is MATLAB + MatConvNet, not deployable on JetPack 6; PyTorch port required (D-C2-8 NEW Plan-phase choice: Nanne port with license-uncertainty / re-port from canonical with MIT preservation / OpenVPRLab-NetVLAD-on-ResNet50 as separately-cataloged sibling mode)
Restriction "License posture (D-C1-1)" — VPR license-track interaction POSITIVE finding on canonical (MIT, BSD/permissive); NEGATIVE on Nanne PyTorch port (license-uncertain) — Plan-phase verification gate NetVLAD canonical implementation is MIT (Source #64 README explicitly states "NetVLAD is distributed under the MIT License (see the LICENCE file).") — permissive. Same as MixVPR-MIT + SelaVPR-MIT; distinct from SALAD's GPL-3.0. Under D-C1-1 = (a) GPL-3.0 track, (b) BSD/permissive lock, or (c) keep-both-tracks-open, NetVLAD canonical is eligible on every license-posture choice. HOWEVER, the most-cited PyTorch port (Nanne/pytorch-NetVlad, Source #65) does NOT cite a LICENSE file — Plan-phase verification gate is required before Nanne port adoption. Mitigation: re-port from canonical MIT MATLAB to PyTorch directly (~1 week engineering, preserves MIT licensing). This places NetVLAD on the BSD/permissive C2 axis: MixVPR (MIT) + SelaVPR (MIT) + NetVLAD (MIT canonical) + (next-session: EigenPlaces pending license verification) with materially different design points (single-stage CNN vs single-stage CNN-ResNet50+MixVPR vs two-stage DINOv2-L+SelaVPR vs simple-baseline CNN-VGG16+VLAD). Recommendation: present D-C1-1 + this row to user as a structured Choose block at Plan time, noting NetVLAD's role is mandatory simple-baseline regardless of which D-C1-1 path is chosen

Fact #46 — EigenPlaces per-mode API capability verification (canonical ResNet-50 + GeM + 2048-D viewpoint-robust modern competitive lead on Jetson Orin Nano Super) — DOCUMENTARY PASS WITH MIT LICENSE TRACK + STRUCTURALLY-SIMPLEST MODERN COMPETITIVE CNN ARCHITECTURE + VIEWPOINT-ROBUST TRAINING ADVANTAGE + 60%-LESS-VRAM-RETRAIN ADVANTAGE; Jetson MVE pending; closes C2 mandatory pre-screen at 5/5

  • Statement: EigenPlaces (gmberton/EigenPlaces, ICCV 2023; canonical implementation by Gabriele Berton + Gabriele Trivigno + Barbara Caputo + Carlo Masone, Politecnico di Torino — same author group as CosPlace [CVPR 2022] and the standardized fair-comparison harness gmberton/VPR-methods-evaluation) is a modern competitive single-stage VPR method that introduces a viewpoint-robust training paradigm rather than a new architecture. Per the per-Mode API Capability Verification rule, the project's pinned mode is the (ResNet-50 backbone cropped at the last conv layer → 2048-D dense feature map at H×W spatial locations) + (GeM [Generalized Mean Pooling, Radenović et al. 2018] aggregation) + (single fully-connected layer producing 2048-D global descriptor with L2-normalisation) at 224×224 ImageNet-normalised input tuple — the canonical PyTorch-Hub-distributed best-Recall@K config (Source #67 + Source #68 paper Table 3). PyTorch Hub one-liner model = torch.hub.load("gmberton/eigenplaces", "get_trained_model", backbone="ResNet50", fc_output_dim=2048) returns the pretrained model with no Google-Drive dependency (unlike SelaVPR). Multiple per-Mode sibling candidates are PyTorch-Hub-distributed: (ResNet18, 256), (ResNet18, 512), (ResNet50, 128), (ResNet50, 256), (ResNet50, 512), (ResNet50, 2048), (ResNet101, 128), (ResNet101, 256), (ResNet101, 512), (ResNet101, 2048), (VGG16, 512) — eleven canonical pretrained checkpoints, more than any other C2 candidate evaluated so far. Mode-enumeration query (1/3) — context7 NOT INDEXED + WebFetch fallback PASS: context7 returned 404 for gmberton/eigenplaces and EMPTY results for the search query eigenplaces; per Per-Mode API Capability Verification rule item 2, fall-back to official-docs WebFetch on the canonical repo README + LICENSE was used (Source #67) plus canonical paper WebFetch (Source #68). The canonical train.py and eval.py CLIs expose --backbone {ResNet18, ResNet50, ResNet101, VGG16} and --fc_output_dim {N} flags as documented per-mode configuration parameters, with the per-backbone fc_output_dim enumerations listed exhaustively in the README. Per the Per-Mode API rule, each (backbone, fc_output_dim) tuple is a separately-cataloged sibling mode. Pinned-mode runnable example query (2/3) — WebFetch PASS: Source #67 README ships a documented inference CLI (python3 eval.py --backbone ResNet50 --fc_output_dim 2048 --resume_model torchhub — downloads pretrained from PyTorch Hub and runs evaluation against any canonical-eval dataset format), training CLI (python3 train.py --backbone ResNet50 --fc_output_dim 2048 --train_dataset_folder path/to/sf_xl/raw/train/panoramas --val_dataset_folder ... --test_dataset_folder ...), PyTorch Hub one-liner for pure inference (torch.hub.load("gmberton/eigenplaces", "get_trained_model", backbone="ResNet50", fc_output_dim=2048)), and the companion gmberton/VPR-methods-evaluation framework that runs EigenPlaces alongside NetVLAD + SFRS + CosPlace + Conv-AP + MixVPR within a fair-comparison harness — directly usable for the project's Jetson MVE phase. The canonical inference pattern is model.eval(); descriptor = model(images) where images: torch.Tensor[B, 3, 224, 224] ImageNet-normalised, output descriptor: torch.Tensor[B, 2048] L2-normalised. Disqualifier-probe query (3/3): did NOT surface any documented frame-rate floor (single-stage, per-frame independent, single-pass through the CNN backbone + GeM + FC); did NOT surface any documented memory ceiling at the algorithm level beyond the standard ResNet-50 + GeM + FC footprint (ResNet-50 ~25M params + GeM (parameter-free) + FC layer 2048×2048 = ~4M params ≈ ~58 MB at fp16 total weights — smallest model footprint of any C2 candidate evaluated so far vs MixVPR's ~50 MB, SALAD's ~172 MB, SelaVPR's ~600 MB, NetVLAD's ~400 MB); did NOT surface any Jetson Orin Nano measurement (similarly to all C2 candidates — D-C2-4 deferred Jetson MVE phase will resolve); did NOT surface a documented ONNX/TensorRT export path inside gmberton/EigenPlaces (relies on standard PyTorch → ONNX export — to be resolved in C7 row, not C2; ResNet-50 is the most-export-friendly modern competitive backbone). Three POSITIVE structural advantages over MixVPR + SALAD + SelaVPR + NetVLAD: (i) STRUCTURALLY-SIMPLEST MODERN COMPETITIVE CNN ARCHITECTURE in the C2 row — ResNet-50 + GeM + FC is fewer moving parts than MixVPR's MLP-Mixer aggregation, SALAD's optimal-transport aggregation + DINOv2-B fine-tuned backbone, SelaVPR's frozen DINOv2-L + per-block adapters + LocalAdapt up-conv module + two-stage retrieval+rerank, and NetVLAD's K=64 cluster-centre soft-assignment + PCA-whitening. Implication: lowest D-C2-4 + D-C2-5 risk among modern competitive C2 leads (ResNet-50 → TensorRT fp16 has the most well-documented export pathway of any modern competitive backbone; no DINOv2 ViT export-risk applies; no two-stage re-ranking variance; no local-feature cache pressure; no NetVLAD-style soft-assignment-to-cluster MatMul). (ii) 60%-LESS-VRAM-RETRAIN advantage over MixVPR (Source #68 §4.4) — EigenPlaces ResNet-50 + 2048-D trains with <7 GB GPU VRAM vs MixVPR 18 GB at canonical batch 480. Implication: most retrain-friendly C2 candidate for D-C2-1 aerial-domain retrain decision — the project can iterate on aerial nadir retraining experiments at lower GPU cost; viewpoint-robust training paradigm naturally extends to aerial nadir's wide-area-coverage capture geometry (UAV passes over the same area at multiple headings/altitudes, generating exactly the multi-viewpoint training signal that EigenPlaces is designed to exploit). (iii) VIEWPOINT-ROBUST TRAINING PARADIGM ALIGNMENT WITH AERIAL NADIR USE CASE (Source #68 §3 + §4.3) — EigenPlaces's lateral CosFace loss is explicitly designed to handle queries with very different viewpoints relative to the database images (paper Tab 3 — Tokyo24/7 R@1=93.0 vs CosPlace 87.3 = +5.7, AmsterTime R@1=48.9 vs CosPlace 47.7 = +1.2, Pitts30k R@1=92.5). For aerial nadir VPR where the project's nav-camera ADTi 20MP captures vary in heading and altitude (and where satellite reference imagery has yet another distinct capture geometry — orthorectified, single-time-instant), the viewpoint-robust training paradigm is the most semantically-aligned training prior for the project's domain of any C2 candidate evaluated so far (vs MixVPR's standard metric learning, SALAD's optimal-transport, SelaVPR's adapter-based fine-tuning, NetVLAD's weakly-supervised triplet — none of which explicitly model viewpoint variation). Documented Recall@1 vs other C2 candidates evaluated (best-config-of-each): Pitts30k: EigenPlaces ResNet-50 @ 2048 = 92.5 vs MixVPR ~90 vs SALAD-full 95.1 (different paper Table) vs SelaVPR 92.8 vs NetVLAD 84.1 (paper) / 85.2 (PyTorch reproduction); Tokyo24/7: EigenPlaces ResNet-50 @ 2048 = 93.0 (best in EigenPlaces paper Tab 3) vs SelaVPR 94.0 (best per Source #63) vs MixVPR 85.1 vs NetVLAD 73.3 — EigenPlaces and SelaVPR are the two top performers, with SelaVPR winning by +1.0 absolute (with the cost of two-stage re-ranking + DINOv2-L backbone export risk + 150 GB local-feature cache pressure); MSLS Val: EigenPlaces 89.1 vs SALAD 92.2 vs SelaVPR 90.8 vs MixVPR 87.2 (EigenPlaces is third in this dataset — SALAD wins on MSLS-Val by +3.1 absolute); Nordland: EigenPlaces 71.2 vs MixVPR 76.2 vs SALAD 76.0 vs SelaVPR 85.2 vs NetVLAD ~13 (EigenPlaces is fourth on extreme cross-season, well behind SelaVPR's 85.2; SelaVPR is the clear cross-season winner); AmsterTime: EigenPlaces 48.9 vs CosPlace 47.7 vs MixVPR 40.2 vs NetVLAD 16.3 (EigenPlaces is the BEST on extreme decade-scale cross-time domain shift — relevant to Ukraine-active-conflict scene-change scenarios where structures may be 1+ years between satellite-capture-time and UAV-flight-time); SF-XL test v1 (multi-view + night + viewpoint): EigenPlaces 84.1 vs CosPlace 76.4 vs NetVLAD 40.0 (EigenPlaces is BEST by large margin — 7.7 absolute over CosPlace, 44.1 absolute over NetVLAD). Pinned-mode sentence: "We will use EigenPlaces with ResNet-50 backbone cropped at the last conv layer + GeM (Generalized Mean Pooling) aggregation + single fully-connected layer producing 2048-D L2-normalised global descriptor at 224×224 ImageNet-normalised input (canonical PyTorch-Hub-distributed best-Recall@K config trained on SF-XL panoramas with 200k iterations + batch 128 + Adam lr=1e-5 + mixed precision + lateral-CosFace + frontal-CosFace dual-loss with focal distance D=10m and cell M=15m), with inputs {1× ADTi 20MP nav frame stream → center-cropped + bilinearly downscaled to 224×224 + ImageNet-normalised} and expect outputs {2048-D L2-normalised global descriptor per frame for cosine top-K (K=10 per Fact #25) retrieval against pre-cached descriptor table over the ~400 km² operational area's tiles at AC-8.1 resolution floor} on Jetson Orin Nano Super (8 GB shared, JetPack 6, ROS 2 Humble; PyTorch fp16 + TensorRT baseline; final inference runtime selection deferred to C7). Modern competitive lead role per engine Component Option Breadth rule — EigenPlaces is the viewpoint-robust modern competitive CNN lead that closes the BSD/permissive C2 axis with a 4th materially-different design point alongside MixVPR (MLP-Mixer aggregation, multi-similarity loss) + SelaVPR (DINOv2-L two-stage adapter) + NetVLAD (mandatory simple-VLAD baseline). The viewpoint-robust training paradigm is the most semantically-aligned training prior for aerial nadir VPR of any C2 candidate evaluated so far."
  • Source: Source #67 (gmberton/EigenPlaces canonical README + LICENSE + PyTorch Hub registration with eleven pretrained checkpoints + companion VPR-methods-evaluation fair-comparison harness, accessed 2026-05-08), Source #68 (canonical paper arXiv:2308.10832 / Berton et al. ICCV 2023 — §3 viewpoint-robust training paradigm + §4.2 implementation details + §4.3 Tab 3+4 Recall@1 across 16 datasets + §4.4 resource analysis [<7 GB VRAM training, 60% less than MixVPR, 50% smaller descriptor than MixVPR-best])
  • Phase: Phase 2
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer + viewpoint-robust-training-paradigm reference-point owner + simple-baseline-vs-modern-lead comparison framework owner (via gmberton/VPR-methods-evaluation companion)
  • Confidence: for mode-enumeration (eleven canonical PyTorch-Hub-distributed checkpoints), runnable-example, parameter-count, license (MIT), training-procedure (200k iterations + Adam + dual CosFace loss + 60% less VRAM than MixVPR), and ground-level-benchmark Recall@K documentary evidence across 16 datasets; ⚠️ for input image size (paper does NOT explicitly state input size in §4.2 implementation details — the canonical eval defaults to 224×224 in the companion VPR-methods-evaluation framework following standardized practice across CosPlace + Conv-AP + MixVPR + EigenPlaces siblings, but a --image_size flag is exposed and the algorithm is resolution-agnostic at the API level; project should document the exact --image_size choice at Jetson MVE time as a reproducibility detail); ⚠️ for Jetson Orin Nano Super latency / memory / accuracy (no documentary measurement — Jetson MVE will resolve; ResNet-50 fp16 + TensorRT extrapolation is ~15-30 ms per frame total, lowest among modern competitive C2 leads); ⚠️ for ResNet-50 → TensorRT fp16 / INT8 export quality (well-documented pathway, but project must measure against AerialExtreMatch + Derkachi flight); for canonical-checkpoint aerial-domain fitness (same caveat as MixVPR + SALAD + SelaVPR + NetVLAD — canonical weights are SF-XL street-view-trained, no aerial-nadir benchmark in canonical paper; HOWEVER, EigenPlaces is the most-retrain-friendly C2 candidate at <7 GB GPU VRAM training cost — D-C2-1 aerial retrain decision has the lowest cost on EigenPlaces); for viewpoint-robust training paradigm as semantically-aligned prior for aerial nadir VPR (the lateral CosFace loss is explicitly designed for queries that vary in viewpoint relative to the database, which directly maps to UAV multi-heading / multi-altitude flights over the same operational area)
  • Related Dimension: SQ3+SQ4 / C2 modern competitive viewpoint-robust CNN lead candidate — per-mode API capability verification gate
  • Fit Impact: DOCUMENTARY PASS for the per-mode API capability verification gate — EigenPlaces has a documented runnable per-mode example with the project's pinned configuration (canonical PyTorch CLI via Source #67 + algorithmic specification via Source #68 paper), eleven documented PyTorch-Hub-distributed pretrained checkpoints (more than any other C2 candidate evaluated so far), and no API-level disqualifier. Three POSITIVE structural findings vs all prior C2 candidates: (i) STRUCTURALLY-SIMPLEST MODERN COMPETITIVE CNN ARCHITECTURE (ResNet-50 + GeM + FC — fewer moving parts than MixVPR / SALAD / SelaVPR / NetVLAD) — implication: lowest D-C2-4 + D-C2-5 risk among modern competitive C2 leads; ResNet-50 → TensorRT fp16 has the most well-documented export pathway; no DINOv2 ViT export-risk; no two-stage re-ranking variance; no local-feature cache pressure; no NetVLAD-style soft-assignment-to-cluster MatMul. (ii) 60%-LESS-VRAM-RETRAIN advantage vs MixVPR (Source #68 §4.4) — implication: most retrain-friendly C2 candidate for D-C2-1 aerial-domain retrain decision; the project can iterate on aerial nadir retraining experiments at <7 GB GPU VRAM cost vs MixVPR's 18 GB. (iii) VIEWPOINT-ROBUST TRAINING PARADIGM (paper §3) is the most semantically-aligned training prior for aerial nadir VPR — the lateral CosFace loss is explicitly designed for queries with viewpoint variability, mapping directly to UAV multi-heading / multi-altitude flights over the same operational area. Three documented benchmark observations: (a) Tokyo24/7 R@1=93.0 = +5.7 over CosPlace, second only to SelaVPR's 94.0 in the C2 row, with much lower deployment risk than SelaVPR; (b) AmsterTime R@1=48.9 = best in C2 row for extreme decade-scale cross-time domain shift — directly relevant to Ukraine-active-conflict scene-change scenarios; (c) on multi-view datasets (paper's strength) EigenPlaces wins on most — Pitts30k 92.5, AmsterTime 48.9, SF-XL-v1 84.1; (d) on extreme cross-season Nordland (71.2) and extreme night SVOX-Night (58.9), MixVPR-4096 wins by 5 absolute and SelaVPR wins by 14 absolute — EigenPlaces is third in extreme cross-illumination but strongest in viewpoint-robust scenarios. NEW Plan-phase decision raised by EigenPlaces closure (will be tagged D-C2-10): D-C2-10 (NEW) EigenPlaces descriptor-dimension choice (canonical 2048-D / 512-D / 256-D / 128-D — eleven backbone+dim sibling modes documented). 2048-D gives best Recall@1 across multi-view datasets and matches MixVPR-2048 / NetVLAD-canonical for cache-budget direct comparison; 512-D fits within 1.6% of cache budget at modest Recall@1 loss (Tokyo24/7 89.8 vs 93.0 = -3.2); 128-D fits within 0.4% of cache budget at substantial Recall@K loss on cross-domain datasets (paper §4.3 explicit observation). REUSE of D-C2-1 aerial-domain decision — applies identically to EigenPlaces as to MixVPR + SALAD + SelaVPR + NetVLAD, but EigenPlaces is the most retrain-friendly candidate so D-C2-1 = (a) project-domain retrain on AerialVL is materially cheaper to execute on EigenPlaces than on any other candidate. C2 mandatory pre-screen status: EigenPlaces closes the C2 mandatory pre-screen at 5 of 5 candidates (MixVPR + SALAD + SelaVPR + NetVLAD + EigenPlaces). The deferred Jetson Orin Nano Super hardware MVE phase still gates final accuracy/latency/memory measurement (D-C1-2 + D-C2-4) — EigenPlaces's measurement role on the Jetson is to establish the viewpoint-robust modern competitive CNN lead that all other modern competitive leads (SALAD GPL-3.0 / SelaVPR DINOv2-L / MixVPR ResNet50+MixVPR) are scored against on the project's specific operating context (aerial nadir, 1 km AGL, eastern/southern Ukraine cross-season + scene-change, AC-4.1 + AC-4.2 + AC-8.3 budgets). License: MIT for canonical gmberton/EigenPlaces (per Source #67 LICENSE) — permissive, BSD/permissive license track.

C2 — Per-Mode API Capability Verification (engine Step 2 — EigenPlaces session entry, 2026-05-08)

MVE — EigenPlaces with ResNet-50 + GeM + 2048-D global descriptor @ 224×224 (canonical PyTorch-Hub best-Recall@K variant; ResNet18/256, ResNet18/512, ResNet50/{128, 256, 512}, ResNet101/{128, 256, 512, 2048}, VGG16/512 documented as separately-cataloged sibling modes)

  • Source: Source #67 (gmberton/EigenPlaces canonical README + LICENSE — python3 eval.py --backbone ResNet50 --fc_output_dim 2048 --resume_model torchhub for canonical pretrained inference, torch.hub.load("gmberton/eigenplaces", "get_trained_model", backbone="ResNet50", fc_output_dim=2048) for PyTorch-Hub one-liner, python3 train.py --backbone ResNet50 --fc_output_dim 2048 --train_dataset_folder path/to/sf_xl/raw/train/panoramas for canonical SF-XL retraining, eleven canonical pretrained checkpoints PyTorch-Hub-distributed, companion gmberton/VPR-methods-evaluation fair-comparison harness, MIT License), accessed 2026-05-08; Source #68 (canonical paper arXiv:2308.10832 / Berton et al. ICCV 2023 — §3 viewpoint-robust training paradigm + §4.2 implementation details [200k iterations, batch 128, Adam lr=1e-5, mixed precision, lateral+frontal CosFace dual loss, cell M=15m, N=3, focal distance D=10m] + §4.3 Tab 3+4 Recall@1 across 16 datasets + §4.4 resource analysis [<7 GB VRAM training, 24 hours on RTX 3090, 60% less GPU memory than MixVPR, 50% smaller descriptor than MixVPR-best])
  • Inputs in the example: SF-XL panoramas (San Francisco eXtra Large, ~170 km², ~2.8M database images street-level urban) for training; 16 ground-level evaluation datasets (Pitts30k, Pitts250k, Tokyo24/7, AmsterTime, Eynsham, SF-XL test v1+v2, San Francisco Landmark for multi-view + MSLS Val, Nordland, St Lucia, SVOX Night/Overcast/Rain/Snow/Sun for frontal-view); batch tensor images: torch.Tensor[B, 3, 224, 224] ImageNet-normalised; ResNet-50 backbone (~25M params, output spatial feature tensor [B, 2048, 7, 7] at 224×224); GeM (Generalized Mean Pooling, parameter-free aggregation with learnable scalar p exponent) reduces to [B, 2048, 1, 1]; FC layer (nn.Linear(2048, 2048)) produces final 2048-D descriptor with L2-normalisation
  • Outputs in the example: descriptor: torch.Tensor[B, 2048] L2-normalised; cosine top-K retrieval against pre-cached descriptors; canonical paper Tab 3 reports Pitts30k R@1=92.5 (vs CosPlace 90.9, MixVPR-2048 91.5, NetVLAD-VGG16-4096 85.0), Tokyo24/7 R@1=93.0 (BEST in EigenPlaces Tab 3) (vs CosPlace 87.3, MixVPR-4096 85.1, NetVLAD-VGG16-4096 69.8; SelaVPR 94.0 from Source #63 wins by +1.0 absolute), AmsterTime R@1=48.9 (BEST in C2 row for extreme decade-scale cross-time domain shift), SF-XL test v1 R@1=84.1 (vs CosPlace 76.4, NetVLAD 40.0); paper Tab 4 reports MSLS-Val R@1=89.1 (vs CosPlace 87.4, MixVPR-4096 87.2; SALAD 92.2 wins by +3.1, SelaVPR 90.8 wins by +1.7), Nordland R@1=71.2 (vs MixVPR-4096 76.2 — MixVPR wins by +5; SelaVPR 85.2 wins by +14), SVOX-Night R@1=58.9 (vs MixVPR-4096 64.4 — MixVPR wins by +5.5), St Lucia R@1=99.6 (essentially saturated)
  • Project inputs: 1× ADTi 20MP nav frame stream (5472×3648, target 3 fps) → center-cropped to 3648×3648 (square) → bilinearly downscaled to 224×224 → ImageNet-normalised → fp16 batch on Jetson Orin Nano Super
  • Project outputs required: 2048-D L2-normalised global descriptor per frame; cosine top-K (project default K=10 per Fact #25) against pre-cached descriptor table over the ~400 km² operational area's tiles at AC-8.1 resolution floor; satisfies AC-8.6 retrieval-recall requirement under viewpoint shifts (project's strongest expected performance — viewpoint-robust training paradigm is semantically aligned with UAV multi-heading flights), cross-season (mid-tier expected — paper Nordland 71.2 vs SelaVPR 85.2), cross-domain decade-scale scene-change (project's BEST expected performance per AmsterTime 48.9); satisfies AC-4.1 latency budget for steady-state pending Jetson MVE measurement (ResNet-50 fp16 + TensorRT extrapolation ~15-30 ms total per frame — lowest among modern competitive C2 leads); satisfies AC-NEW-2 spoofing-promotion path with comfortable single-stage retrieval latency
  • Match assessment: exact mode match for (ResNet-50 backbone cropped at last conv layer, GeM pooling, FC layer to 2048-D, 224×224 ImageNet-normalised input, 2048-D L2-normalised global descriptor output); training+evaluation+PyTorch-Hub-pretrained-distribution CLIs exist in canonical gmberton/EigenPlaces (Source #67); eleven pretrained checkpoints documented (more than any other C2 candidate evaluated); companion gmberton/VPR-methods-evaluation fair-comparison harness ships in-the-box for Jetson MVE phase; ⚠️ partial input domain (canonical weights trained on SF-XL San Francisco street-level urban vs project's nadir aerial 1 km AGL — domain shift unverified, same caveat as MixVPR + SALAD + SelaVPR + NetVLAD, but EigenPlaces is the MOST retrain-friendly candidate at <7 GB GPU VRAM training cost); ⚠️ Jetson Orin Nano Super export risk on ResNet-50 (well-documented pathway but Jetson measurement absent — ResNet-50 → TensorRT fp16 extrapolation is the lowest-risk export pathway among modern competitive C2 leads); ⚠️ partial input image size documentation (paper §4.2 does NOT explicitly state input size — companion framework defaults to 224×224 following CosPlace/Conv-AP/MixVPR/EigenPlaces ecosystem standardized practice; algorithm is resolution-agnostic at API level, project must document the exact choice at Jetson MVE time as a reproducibility detail); ⚠️ third-place ranking on extreme cross-season (Nordland 71.2 vs SelaVPR 85.2 is -14 absolute; SVOX-Night 58.9 vs MixVPR-4096 64.4 is -5.5 absolute) — viewpoint robustness comes at the cost of being weaker than DINOv2-based candidates on extreme illumination, but EigenPlaces is the BEST on viewpoint-shift datasets and on extreme decade-scale cross-time domain shift (AmsterTime 48.9)
  • If ⚠️ or : docs do not explicitly disqualify the algorithmic mode. The (backbone, pooling, fc_output_dim, input size, normalisation, output shape) tuple is documented and runnable directly via PyTorch Hub one-liner OR via canonical eval.py CLI. Three POSITIVE structural advantages elevate the verification gate's interpretation vs MixVPR + SALAD + SelaVPR + NetVLAD: (i) STRUCTURALLY-SIMPLEST MODERN COMPETITIVE CNN ARCHITECTURE → lowest D-C2-4 + D-C2-5 risk among modern competitive C2 leads; (ii) 60%-LESS-VRAM-RETRAIN advantage → most retrain-friendly for D-C2-1 aerial-domain retrain decision; (iii) VIEWPOINT-ROBUST TRAINING PARADIGM → most semantically-aligned training prior for aerial nadir VPR (UAV multi-heading flights generate exactly the multi-viewpoint training signal EigenPlaces is designed to exploit). → Status: Documentary lead with aerial-domain-training caveat + structurally-simplest-modern-competitive-CNN advantage + 60%-less-VRAM-retrain advantage + viewpoint-robust-training-paradigm advantage + extreme-cross-season-third-place caveat, BSD/permissive license track (MIT). Final lead promotion to "Selected" deferred to D-C1-2 + D-C2-4 dedicated Jetson Orin Nano Super hardware MVE phase. Per the engine Component Option Breadth rule, EigenPlaces closes the C2 mandatory pre-screen at 5 of 5 candidates with a viewpoint-robust modern competitive CNN design point materially different from MixVPR (MLP-Mixer aggregation), SALAD (optimal-transport + DINOv2-B GPL-3.0), SelaVPR (DINOv2-L two-stage), and NetVLAD (canonical simple-VLAD baseline).

C2 — Per-numbered-Restriction × Per-numbered-AC Sub-Matrix per Candidate (EigenPlaces addition)

EigenPlaces — per-numbered binding (C2-relevant lines only; cross-cutting N/A above also apply identically)

Cells share the legend defined under the MixVPR sub-matrix. Where a binding is identical in both substance and evidence to the MixVPR / SALAD / SelaVPR / NetVLAD rows, the EigenPlaces row points to those rows to avoid restating; where EigenPlaces's pinned mode produces a materially different binding (viewpoint-robust training paradigm, structurally-simplest modern competitive CNN architecture, 60%-less-VRAM-retrain advantage, extreme-cross-season-third-place trade-off), the EigenPlaces row carries a distinct evidence cite.

Line Binding Evidence (one-line cite)
AC-1.1 (frame-center within 50 m, ≥80% normal-flight photos) Verify (downstream) — strongest expected performance on viewpoint-shift; weaker on extreme cross-season Same downstream-of-C2 dependency as MixVPR + SALAD + SelaVPR + NetVLAD rows; documentary evidence of EigenPlaces retrieval recall on aerial nadir at AC-8.1 resolution floor is absent — Plan-phase aerial-training decision (D-C2-1) + Jetson MVE on Derkachi flight required. EigenPlaces-specific framing: paper Tab 3 multi-view datasets (Pitts30k 92.5 + Tokyo24/7 93.0 + AmsterTime 48.9 + SF-XL-v1 84.1) demonstrate strongest viewpoint-robustness in C2 row; aerial nadir UAV multi-heading flights are the ideal use case for EigenPlaces's lateral CosFace loss training paradigm (paper §3.3); however paper Tab 4 frontal-view extreme cross-season (Nordland 71.2 vs SelaVPR 85.2 = -14) suggests extreme cross-season recall is weaker than SelaVPR — D-C2-1 aerial+cross-season retrain may be required for competitive performance
AC-1.2 (frame-center within 20 m, ≥50% normal-flight photos) Verify (downstream) — single-stage retrieval inherits same geometric-fine-grain limit as MixVPR + NetVLAD Same as AC-1.1, tighter tail; AerialExtreMatch Recall@1 stratified by difficulty cell remains the documentary target. EigenPlaces-specific consideration: single-stage retrieval has no geometric verification step (vs SelaVPR's local-feature MNN re-ranking) — geometric-fine-grain accuracy at AC-1.2 tail is structurally less robust than two-stage methods; HOWEVER, viewpoint-robust training paradigm provides better discrimination between near-duplicate viewpoints than MixVPR's standard metric learning
AC-2.1b (satellite-anchor registration succeeds) Verify (downstream) — strongest expected performance on viewpoint-shift use cases C2's contribution identical to MixVPR + SALAD + SelaVPR-global-only + NetVLAD rows — top-K retrieval feeding C3+C4; EigenPlaces's viewpoint-robust training paradigm is the most semantically-aligned for satellite-vs-UAV-aerial cross-domain (different capture altitudes, different headings, different times-of-day); Jetson MVE measurement on AerialExtreMatch + Derkachi flight
AC-3.3 (≥3 disconnected segments via satellite-reference re-localization) Pass (API) → Verify (recall) — strongest viewpoint-shift performance; mid-tier extreme-cross-season EigenPlaces's per-frame top-K cosine retrieval is structurally identical to MixVPR + SALAD + SelaVPR-global-only + NetVLAD for re-localization (no temporal state required); single-stage simplicity is structurally less robust against perceptual aliasing (no re-ranking filter), but viewpoint-robust training compensates partially. Cross-season recall: paper Nordland 71.2 (vs MixVPR 76.2 -5, SALAD 76.0 -4.8, SelaVPR 85.2 -14, NetVLAD ~33 +38) — mid-tier in C2 row for extreme cross-season; HOWEVER best in C2 row for extreme decade-scale cross-time (AmsterTime 48.9 vs CosPlace 47.7 / MixVPR 40.2 / NetVLAD 16.3). Aerial nadir cross-season+cross-time unverified — AerialExtreMatch + D-C2-1 required
AC-4.1 (latency <400 ms p95, end-to-end camera→FC) Pass (with Verify) — LOWEST latency among modern competitive C2 leads Source #68 §4.4 paper acknowledges "extraction time negligible compared to kNN matching at scale" but does NOT report explicit latency. Contemporary GPU benchmarks place ResNet-50 fp16 at ~1-2 ms on A100 / ~3-5 ms on RTX 3090; GeM pooling is parameter-free (~0.1 ms); FC layer 2048×2048 is ~0.5 ms; total ResNet-50 + GeM + FC ≈ ~5 ms on RTX 3090. RTX-3090-to-Jetson-Orin-Nano-Super extrapolation factor 4-6×~15-30 ms total per frame at fp16+TensorRT, lowest among modern competitive C2 leads (vs MixVPR ~10-30 ms, SALAD ~20-30 ms, SelaVPR ~350 ms two-stage, NetVLAD ~40-60 ms). D-C2-4 deferred Jetson MVE risk is LOW — ResNet-50 → TensorRT fp16 has the most well-documented export pathway among modern competitive backbones; D-C2-5 ViT-export-risk does NOT apply (EigenPlaces uses ResNet-50, not ViT); structural simplicity of single-stage retrieval; no two-stage re-ranking variance. Comfortable budget within AC-4.1 400 ms before C1+C3+C5+C8 costs added
AC-4.2 (memory <8 GB shared) Pass (with Verify) — SMALLEST model footprint among C2 candidates ResNet-50 ~25M params + GeM (parameter-free) + FC layer 2048×2048 = ~4M params ≈ ~58 MB total weights at fp16 — smallest of any C2 candidate evaluated so far (vs MixVPR's ~50 MB at 2048-D config, SALAD's ~172 MB DINOv2-B, SelaVPR's ~600 MB DINOv2-L+adapters, NetVLAD's ~400 MB VGG-16+NetVLAD+PCA-whitening); activations at 224×224 batch=1 ~25 MB; descriptor cache for ~400 km² @ 0.5 m/px tiles: 2048-D global descriptor → ~650 MB fp16 / 6.5% of 10 GB cache budget (identical to MixVPR-2048; smaller than SALAD-full-8448 ~2.7 GB / 27% and NetVLAD-canonical ~1.3 GB / 13%; larger than SelaVPR-global-only ~320 MB / 3.2% and SALAD-slim-544 ~170 MB / 1.7%). 128-D / 256-D / 512-D sibling modes documented as PyTorch-Hub-distributed alternatives — would reduce cache to ~40 MB / 0.4% (128-D) or ~80 MB / 0.8% (256-D) or ~160 MB / 1.6% (512-D) at modest Recall@K loss on cross-domain (paper §4.3 explicit). AC-8.3 cache budget interaction is D-C2-10 EigenPlaces descriptor-dimension Plan-phase choice (NEW). Co-resident memory pressure with C1/C3/C5/C6 manageable — Jetson MVE measurement
AC-8.1 (cache-interface resolution ≥0.5 m/px, ideally 0.3 m/px) Pass (with Verify) EigenPlaces is resolution-agnostic at the algorithm level (ResNet-50 accepts any input size; canonical eval defaults to 224×224 in companion VPR-methods-evaluation framework); cross-resolution generalization at 0.5 m/px tile GSD vs nav-camera 12 cm/px GSD unverified, AerialExtreMatch cross-scale cells (Fact #19) is the documentary target — same dependency as MixVPR + SALAD + SelaVPR + NetVLAD rows
AC-8.6 — Scale-ratio (any UAV-frame ground footprint at deployment altitude must be retrievable) Verify — same downscale aggressiveness as SelaVPR + NetVLAD At 1 km AGL the nav-camera frame footprint is 470×314 m to 980×655 m (per restrictions.md); EigenPlaces's canonical 224×224 is the same downscale aggressiveness as SelaVPR's 224×224 and NetVLAD's 224×224 (more aggressive than MixVPR's 320×320 / SALAD's 322×322). Cross-scale recall at AC-8.6 spec is exactly the AerialExtreMatch test cell — Jetson MVE measurement. EigenPlaces-specific consideration: viewpoint-robust training paradigm partially compensates for cross-scale aggressiveness — multi-heading UAV flights at the same altitude generate the multi-viewpoint training signal EigenPlaces is designed to exploit
AC-8.6 — Scene change in active-conflict sectors Verify — BEST in C2 row for extreme decade-scale cross-time domain shift (AmsterTime 48.9) Cratering / building destruction / road realignment is exactly the AerialExtreMatch "scene-change" cell + the Skoltech aerial-VPR survey (Source #38). EigenPlaces-specific finding: AmsterTime R@1=48.9 (paper Tab 3) is the BEST in C2 row for extreme decade-scale cross-time domain shift — vs CosPlace 47.7 (+1.2), MixVPR 40.2 (+8.7), NetVLAD 16.3 (+32.6). The viewpoint-robust training paradigm extends naturally to scene-change scenarios because the lateral CosFace loss exposes the model to many different views of the same place, building partial-occlusion robustness. AerialExtreMatch + D-C2-1 required for aerial-domain cross-time validation
AC-8.6 — Compute & latency under steady-state and re-loc-trigger Pass — single-stage constant per-frame cost (LOWEST risk among modern competitive C2 leads) EigenPlaces's per-frame compute is constant (single-stage retrieval, no re-ranking — vs SelaVPR's variable cost). Steady-state and re-loc-trigger workloads have identical latency profile (~15-30 ms total per frame at fp16+TensorRT extrapolation). Co-resident memory + GPU-time pressure under simultaneous C1+C2+C3 inference manageable — ResNet-50 backbone is the most-export-friendly modern-competitive backbone; D-C2-4 deferred Jetson MVE risk is LOWEST for EigenPlaces among modern competitive C2 leads (NetVLAD's VGG-16 has lower export risk but lower competitive recall — different design point); D-C2-5 ViT-export-risk does NOT apply (EigenPlaces uses ResNet-50, not ViT). This cost-model advantage compounds with the viewpoint-robust training advantage
AC-NEW-2 (spoofing-promotion latency <3 s p95) Pass (latency budget very comfortable) → Verify (recall at re-anchor) — strongest viewpoint-shift performance Same structure as MixVPR + SALAD + SelaVPR-global-only + NetVLAD rows: EigenPlaces per-frame global retrieval at fp16 + TensorRT well under 3 s budget (~15-30 ms total per frame, ~100-200× under budget); single-stage simplicity is among the lowest-latency options in the C2 row. Gating constraint is whether re-anchor retrieval succeeds on first or first-few frames after spoofing detection. EigenPlaces-specific consideration: viewpoint-robust training paradigm + best multi-view dataset Recall@K (Pitts30k 92.5, Tokyo24/7 93.0) + best AmsterTime Recall@K (48.9) suggest strong re-anchor recall at spoofing-promotion event — the UAV may have flown a different heading by the time spoofing is detected, exactly the viewpoint-shift scenario EigenPlaces is designed for. Jetson MVE on Derkachi flight required
AC-NEW-6 (imagery freshness — never satellite_anchored on stale-tile match) Pass (mechanical) EigenPlaces returns top-K with cosine scores from global descriptors identically to MixVPR + SALAD + SelaVPR-global-only + NetVLAD; freshness-age decision is a downstream C5/C6 filter on the retrieved candidates. No re-ranking step (unlike SelaVPR) — freshness-aware candidate filtering happens entirely after the EigenPlaces top-K retrieval
AC-NEW-7 (cache-poisoning safety budget — P(>30 m geo-misalign) <1%, P(>100 m) <0.1%) Verify (downstream — single-stage retrieval has NO structural advantage over poisoned-but-misaligned tiles) EigenPlaces's contribution is retrieval correctness under mid-flight-written tile (AC-8.4) presence; if a misaligned mid-flight tile has a near-correct global descriptor it CAN poison the global-retrieval stage. Unlike SelaVPR, EigenPlaces has no second-stage filter against geometric misalignment — single-stage retrieval is structurally less robust against the cache-poisoning attack class. Multi-flight Monte Carlo replay is the validation, D-C2-1 affects this. EigenPlaces-specific consideration: viewpoint-robust training paradigm may provide partial discrimination benefit (the lateral CosFace loss makes the network focus on stable scene structure rather than viewpoint-specific cues, which may also be more robust against misalignment perturbation), but this is unverified
Restriction "Operational area: eastern/southern Ukraine" — VPR train-domain match ⚠️ Documentary gap → Verify (BUT MOST RETRAIN-FRIENDLY candidate) Canonical EigenPlaces weights are SF-XL San Francisco (street-level / urban) trained — same caveat as MixVPR + SALAD + SelaVPR + NetVLAD; D-C2-1 applies identically. EigenPlaces-specific POSITIVE finding: <7 GB GPU VRAM training cost (vs MixVPR 18 GB, vs DINOv2-based ~24 GB) makes EigenPlaces the most retrain-friendly C2 candidate for D-C2-1 aerial-domain retrain decision — the project can iterate on aerial nadir retraining experiments at the lowest GPU cost; viewpoint-robust training paradigm + multi-heading UAV flights generate the multi-viewpoint training signal EigenPlaces is designed to exploit
Restriction "Altitude ≤1 km AGL; terrain assumed flat (rolling steppe / agricultural)" — VPR scale band match Verify Same as AC-8.6 scale-ratio row; cross-scale recall at the project's altitude band is the AerialExtreMatch cross-scale cell
Restriction "Weather: predominantly sunny ... seasonal/visibility classes" — VPR cross-season generalization Verify (MID-TIER documentary cross-season recall — third in C2 row) Cross-season VPR is the dominant aerial-VPR failure mode per Fact #19 + SQ5; canonical EigenPlaces weights are single-domain — D-C2-1 is the primary lever. EigenPlaces-specific finding: paper Nordland R@1 = 71.2 (vs SelaVPR 85.2 -14, SALAD 76.0 -4.8, MixVPR 76.2 -5, NetVLAD ~33 +38) — third in C2 row for extreme cross-season, behind SelaVPR (clear winner) and tied with MixVPR/SALAD; SVOX-Night R@1 = 58.9 (vs MixVPR-4096 64.4 -5.5) — fourth in extreme-night. Viewpoint robustness comes at the cost of being weaker than DINOv2-based candidates on extreme illumination, but EigenPlaces is the BEST on viewpoint-shift datasets (Tokyo24/7 93.0, Pitts30k 92.5, AmsterTime 48.9). For aerial nadir Ukraine deployment with cross-season + multi-heading flights, the trade-off is favorable to EigenPlaces if D-C2-1 retrain is committed
Restriction "Navigation camera (pinned): ADTi 20MP, 5472×3648" Pass (API) — same downscale aggressiveness as SelaVPR + NetVLAD EigenPlaces consumes any 224×224 ImageNet-normalised input; the 5472×3648 → 224×224 downscale is the same aggressiveness as SelaVPR + NetVLAD (more aggressive than MixVPR's 320×320 / SALAD's 322×322). D-C2-3 input-resolution-shape Plan-phase decision applies identically to EigenPlaces as to SelaVPR + NetVLAD. Algorithm is resolution-agnostic at API level — --image_size flag is exposed in companion VPR-methods-evaluation framework; project may choose 320×320 or higher at Jetson MVE time at proportional latency cost
Restriction "Satellite Imagery — resolution ≥0.5 m/px" — VPR descriptor pipeline at AC-8.1 floor Verify Same as AC-8.1; algorithm-level resolution-agnostic, recall at 0.5 m/px tile GSD vs 12 cm/px nav-camera GSD unverified
Restriction "Satellite Imagery — Cache budget: 10 GB" — descriptor budget carve-out Verify (medium cache footprint at canonical 2048-D; tight at 128-D) Per-candidate: 2048-D global descriptor cache ~650 MB fp16 / 6.5% of cache budget — identical to MixVPR-2048 (medium of all C2 candidates so far evaluated). Smaller sibling modes documented PyTorch-Hub-distributed: 128-D ~40 MB / 0.4%, 256-D ~80 MB / 0.8%, 512-D ~160 MB / 1.6%. AC-8.3 explicitly says "Pre-extracted descriptors/indices count against the cache budget unless explicitly carved out" — D-C2-2 carve-out decision interacts with D-C2-10 EigenPlaces descriptor-dimension Plan-phase choice (NEW). EigenPlaces's eleven PyTorch-Hub-distributed checkpoints give the project the widest range of cache-footprint sibling modes of any C2 candidate evaluated
Restriction "Companion computer: Jetson Orin Nano Super, 8 GB shared" Pass (with Verify) — LOWEST runtime risk among modern competitive C2 leads ResNet-50 fp16 inference on Jetson Orin Nano Super has the most well-documented TensorRT export pathway of any modern-competitive backbone in this row — D-C2-5 ViT-export-risk does NOT apply (EigenPlaces uses ResNet-50, not ViT). NetVLAD's VGG-16 has lower-still export risk but lower competitive recall (different design point — mandatory simple-baseline vs modern competitive lead). D-C2-4 deferred Jetson MVE risk is LOWEST for EigenPlaces among modern competitive C2 leads. Steady-state co-resident memory + GPU-time with C1 + C3 (matcher) manageable — single-stage simplicity + smallest model footprint (~58 MB at fp16) is the runtime advantage over MixVPR + SALAD + SelaVPR + NetVLAD
Restriction "License posture (D-C1-1)" — VPR license-track interaction POSITIVE finding (MIT, BSD/permissive) EigenPlaces canonical implementation is MIT (Source #67 LICENSE explicit copyright statement) — permissive. Same as MixVPR-MIT + SelaVPR-MIT + NetVLAD-canonical-MIT; distinct from SALAD's GPL-3.0. Under D-C1-1 = (a) GPL-3.0 track, (b) BSD/permissive lock, or (c) keep-both-tracks-open, EigenPlaces is eligible on every license-posture choice. Closes the BSD/permissive C2 axis with a 4th materially-different design point: MixVPR (CNN-ResNet50 + MLP-Mixer aggregation) + SelaVPR (DINOv2-L two-stage + adapters) + NetVLAD (CNN-VGG16 + soft-assignment-VLAD + PCA-whitening, mandatory simple-baseline) + EigenPlaces (CNN-ResNet50 + GeM + FC, viewpoint-robust training paradigm). The BSD/permissive C2 axis now has the most diverse design-point coverage of any license track in any component row in the project. Recommendation: present D-C1-1 + this row to user as a structured Choose block at Plan time; EigenPlaces is the lowest-risk, most-retrain-friendly modern competitive C2 lead on the BSD/permissive track

Fact #47 — LightGlue per-mode API capability verification (canonical SuperPoint+LightGlue cross-domain sparse matcher reference baseline on Jetson Orin Nano Super) — DOCUMENTARY PASS WITH APACHE-2.0 (LIGHTGLUE) + MAGIC-LEAP-RESTRICTIVE-LICENSE (SUPERPOINT WEIGHTS) DISQUALIFIER + DISK+LIGHTGLUE / ALIKED+LIGHTGLUE PERMISSIVE-LICENSE MITIGATIONS DOCUMENTED + JETSON ONNX/TENSORRT/FP16/FP8 EXPORT PATH ACTIVELY-MAINTAINED VIA fabio-sim/LightGlue-ONNX; Jetson MVE pending; opens C3 mandatory pre-screen at 1/N

  • Statement: LightGlue (cvg/LightGlue, ICCV 2023; canonical implementation by Philipp Lindenberger + Paul-Edouard Sarlin + Marc Pollefeys, ETH Zurich + Microsoft Mixed Reality & AI Lab; same author group as cvg/Hierarchical-Localization (hloc), cvg/glue-factory, cvg/pixel-perfect-sfm) is the canonical adaptive-depth/adaptive-width sparse-matcher reference baseline for the entire local-feature-matching field — a deep neural network for sparse feature matching that consumes (keypoint coords, descriptor vectors) tuples from any of five sibling extractor modes and produces a soft partial assignment matrix combining pairwise-similarity + matchability scores, returning 2D-2D correspondences with confidence scores. Per the per-Mode API Capability Verification rule, the project's pinned mode is the (SuperPoint MagicLeap-pretrained extractor at 1024×1024 grayscale input → up to 1024 keypoints with 256-D descriptors and per-keypoint confidence scores) + (LightGlue matcher with features='superpoint', n_layers=9, depth_confidence=0.95, width_confidence=0.99, filter_threshold=0.1, flash=True auto-detected, mp=False) → up to 1024 2D-2D correspondences with confidence scores feeding the project's downstream C4 PnP+RANSAC pose estimator. The canonical inference pipeline is extractor.extract(image_query)extractor.extract(image_target)matcher({'image0': feats_q, 'image1': feats_t})rbd() to remove batch dimension → extract points0 = feats0['keypoints'][matches[..., 0]] and points1 = feats1['keypoints'][matches[..., 1]] for the 2D-2D correspondences. Five separately-cataloged sibling extractor-matcher modes are documented per the Per-Mode API rule: (SuperPoint, LightGlue) with 256-D descriptors and Magic Leap restrictive license; (DISK, LightGlue) with 128-D descriptors and Apache-2.0 permissive license — paper Table 6 documents DISK+LightGlue beats SP+LightGlue on IMC 2020 stereo by +7.99 absolute AUC@5° (67.02 vs 59.03); (ALIKED, LightGlue) with 128-D descriptors and BSD-3-Clause permissive license; (SIFT, LightGlue) with 128-D classical descriptors (patent-free); (DoGHardNet, LightGlue) with 128-D learned descriptors. Mode-enumeration query (1/3) — context7 PASS: /cvg/lightglue is indexed in context7 with High source reputation, benchmark score 85.4, 64 code snippets — confirms canonical reference implementation status; the canonical LightGlue(features=..., n_layers=..., depth_confidence=..., width_confidence=..., filter_threshold=..., flash=..., mp=...) constructor signature is exposed as documented per-mode configuration with features enum supporting superpoint | disk | aliked | sift | doghardnet. Pinned-mode runnable example query (2/3) — context7 PASS + WebFetch cross-validation: Source #69 returns nine canonical code snippets (Initialize LightGlue Feature Matcher, Initialize SuperPoint Feature Extractor, Initialize and Use DISK Feature Extractor, Initialize and Use SIFT Feature Extractor, Perform Feature Matching with LightGlue, Complete Matching Pipeline Example, Initialize and Use SuperPoint + LightGlue Matcher, Extract Matched Keypoint Coordinates); Source #70 README ships the canonical pipeline (from lightglue import LightGlue, SuperPoint, DISK; from lightglue.utils import load_image, rbd; extractor = SuperPoint(max_num_keypoints=2048).eval().cuda(); matcher = LightGlue(features='superpoint').eval().cuda(); image0 = load_image('path/to/image0.jpg').cuda(); image1 = load_image('path/to/image1.jpg').cuda(); feats0 = extractor.extract(image0); feats1 = extractor.extract(image1); matches01 = matcher({'image0': feats0, 'image1': feats1}); feats0, feats1, matches01 = [rbd(x) for x in [feats0, feats1, matches01]]; matches = matches01['matches']; points0 = feats0['keypoints'][matches[..., 0]]; points1 = feats1['keypoints'][matches[..., 1]]); Source #71 paper Table 3 documents the canonical Aachen Day-Night visual-localization benchmark with the NetVLAD top-50 retrieval → SP+LightGlue match → PnP+RANSAC pose estimation pipeline — directly equivalent to the project's intended pipeline shape (C2 NetVLAD/MixVPR/SALAD/SelaVPR/EigenPlaces → C3 SP+LightGlue → C4 PnP+RANSAC), Day (0.25m,2°)/(0.5m,5°)/(1.0m,10°) = 89.2/95.4/98.5, Night = 87.8/93.9/100, throughput 17.2 pairs/sec standard / 26.1 pairs/sec optimized on RTX 3080. Disqualifier-probe query (3/3): did NOT surface any documented frame-rate floor (single-pair single-pass inference, parameter-free per-pair besides the model itself); did NOT surface any documented memory ceiling at the algorithm level beyond the standard SuperPoint+LightGlue footprint (SuperPoint ~1.3M params + LightGlue ~12M params at canonical config = ~13.3M params ≈ ~27 MB at fp16 total weights); did NOT surface any Jetson Orin Nano measurement directly (similarly to all C2 candidates — D-C3-3 (NEW) deferred Jetson MVE phase will resolve); DID surface a documented ONNX/TensorRT/OpenVINO/FP16/FP8 export path via the actively-maintained companion fabio-sim/LightGlue-ONNX project (Source #73) with January 2026 changelog entries on FP8 quantization workflow via NVIDIA ModelOpt — but Jetson Orin Nano Super has Ampere architecture, NOT FP8-native, so FP8 path applies only with INT8 emulation fallback (verification required at Jetson MVE phase); DID surface a HARD LICENSE DISQUALIFIER on the canonical SuperPoint pretrained weights AND the bundled lightglue/superpoint.py inference file — Source #72 documents the Magic Leap "ACADEMIC OR NON-PROFIT ORGANIZATION NONCOMMERCIAL RESEARCH USE ONLY" Software License Agreement which blocks commercial AND dual-use deployment as documented in the project's question_decomposition.md hard disqualifier list. NEW POSITIVE structural advantages over alternative dense-matcher candidates (e.g., MASt3R, RoMa, dense LoFTR — separately-cataloged or pruned per Fact #26 NGPS template): (i) Apache-2.0 permissive license on cvg/LightGlue itself — places LightGlue ITSELF on the BSD/permissive license track alongside cvg/Hierarchical-Localization (hloc) + Kimera-VIO + OKVIS2 + DPVO + pure-VO baseline; cvg/LightGlue Apache-2.0 status is independent of which extractor's weights are used. (ii) Adaptive-depth + adaptive-width pruning (paper §3.3) reduce inference time by ~33% average / 1.45× speedup at <1% accuracy loss on common workloads, with up to 1.86× speedup on easy pairs — critical for Jetson Orin Nano Super's tight latency budget where many UAV-vs-cached-tile pairs are "easy" (high-overlap, low-viewpoint-shift) and only a few are "hard" (cross-season, scene-change). (iii) Bidirectional cross-attention (paper §3.5) computes the similarity matrix only once per layer, saving ~33% time over full cross-attention. (iv) Rotary positional encoding (paper §3.4) provides relative position encoding in self-attention, enabling generalization to image-pair viewpoint-shifts. (v) FlashAttention support (canonical README + paper §C.3) auto-detected on PyTorch ≥2.0; LightGlue-ONNX (Source #73) ships FlashAttention-2 fused ONNX models with up to 80% faster inference on long-keypoint sequences. (vi) HuggingFace Transformers integration (canonical README §"Other links") — pip install transformers plug-and-play with ETH-CVG/lightglue_superpoint model card (separate license terms inherited from HuggingFace + Magic Leap stack). (vii) Kornia integration (canonical README §"Other links") — kornia.feature.LightGlue and kornia.feature.LightGlueMatcher interfaces; LightGlue-ONNX integration via kornia.feature.OnnxLightGlue. (viii) hloc integration for Structure-from-Motion + visual localization via cvg/Hierarchical-Localization — directly applicable to the project's offline-PC pre-flight cache-provisioning C10 row. Documented Recall@K + AUC + throughput vs SuperGlue baseline (paper §5 + Tables 1-7 + Appendix A IMC 2020/2021/2023): HPatches homography Table 1 (SP+LightGlue, 1024 keypoints): R=94.3 / P=88.9 (best precision; +1.5 over SuperGlue 87.4); AUC-DLT@5px=78.6 (vs SuperGlue 76.7, vs SGMNet 76.0; competitive with dense LoFTR 70.6). MegaDepth-1500 relative pose Table 2 (SP+LightGlue, LO-RANSAC): AUC@5°/10°/20°=66.7/79.3/87.9 (vs SuperGlue 65.8/78.7/87.5; vs LoFTR 66.4/78.6/86.5 — competitive with dense matcher at fraction of inference time); inference time 44.2 ms standard / 31.4 ms adaptive on RTX 3080. Aachen Day-Night Table 3 (SP+LightGlue + hloc + NetVLAD top-50, with cvg/Hierarchical-Localization pipeline): Day (0.25m,2°)/(0.5m,5°)/(1.0m,10°) = 89.2/95.4/98.5, Night = 87.8/93.9/100, 17.2 pairs/sec standard / 26.1 pairs/sec optimized on RTX 3080directly equivalent to the project's intended pipeline shape (C2 → C3 → C4); documentary evidence that the chosen architectural pattern is canonical and well-validated in the visual-localization community. IMC 2020 stereo (Appendix A Table 6, SP+LightGlue): AUC@5°=59.03 / AUC@10°=72.18 (beats SP+SuperGlue 58.64/71.95 +0.39/+0.23). IMC 2020 stereo with DISK+LightGlue (Appendix A Table 6 — alternative to mitigate D-C3-1): AUC@5°=67.02 / AUC@10°=77.67 — DISK+LightGlue beats SP+LightGlue by +7.99 / +5.49 absolute on stereo AUC@5°/10°, important Plan-phase signal that DISK+LightGlue is competitive with SP+LightGlue and is preferable when SuperPoint license is the binding constraint. IMC 2021 multi-view (Appendix A Table 6, SP+LightGlue): AUC@10°=50.2 / AUC@20°=62.6 (beats SP+SuperGlue 49.9/62.2). Reported headline throughput: 150 FPS @ 1024 keypoints on RTX 3080 with compilation + adaptivity (= ~6.7 ms per pair) and 50 FPS @ 4096 keypoints on RTX 3080 (= 20 ms per pair); 4-10× speedup over SuperGlue depending on input difficulty; 20 FPS @ 512 keypoints on Intel i7 10700K CPU (= ~50 ms per pair, CPU baseline). Jetson Orin Nano Super extrapolation factor 4-6× of RTX 3080 baseline → ~30-60 ms per pair @ 1024 keypoints at fp16+TensorRT; ~80-120 ms per pair @ 2048 keypoints. CRITICAL latency-budget interaction: at the project's expected per-frame K=10 top-K retrieval pairs (Fact #25 + AC-3.3 re-localization) → 10 pairs × 30-60 ms = 300-600 ms per UAV frame on extrapolation, TIGHT against AC-4.1 400 ms budget before C1+C2+C5+C8 costs added — Plan-phase D-C3-3 latency-budget mitigation choice is required: (a) reduce K from 10 to 3-5 (cost: lower retrieval recall under perceptual aliasing); (b) reduce keypoints from 1024 to 512 (cost: lower geometric verification accuracy at AC-1.2 tail); (c) accept TIGHT margin and validate at Jetson MVE; (d) parallelize matcher across multiple Jetson GPU streams (limited by single-GPU shared-memory architecture); (e) elevate to ONNX Runtime + TensorRT EP + adaptive depth (paper §5.4 reports 1.86× speedup on easy pairs, achievable if many of the K pairs are high-overlap). Pinned-mode sentence: "We will use LightGlue with SuperPoint MagicLeap-pretrained extractor at 1024×1024 grayscale input + up to 1024 keypoints with 256-D descriptors + LightGlue matcher with features='superpoint', n_layers=9, depth_confidence=0.95, width_confidence=0.99, filter_threshold=0.1, flash=True at 1024×1024 grayscale input per image (canonical cvg/LightGlue + canonical SuperPoint pretrained weights config), with inputs {1× ADTi 20MP nav frame stream → grayscale-converted + bilinearly downscaled-to-largest-edge 1024 + canonical 1× cached satellite tile per top-K retrieval result from C2} and expect outputs {up to 1024 2D-2D correspondences with confidence scores per (UAV-frame, satellite-tile) image pair, feeding the downstream C4 PnP+RANSAC pose estimator with cosine confidence threshold filter at 0.95 × max-score} on Jetson Orin Nano Super (8 GB shared, JetPack 6, ROS 2 Humble; PyTorch fp16 + TensorRT baseline via fabio-sim/LightGlue-ONNX Source #73; final inference runtime selection deferred to C7 row + D-C3-2). CRITICAL LICENSE DISQUALIFIER on SP+LightGlue canonical mode — Magic Leap's SuperPoint LICENSE (Source #72) is "ACADEMIC OR NON-PROFIT ORGANIZATION NONCOMMERCIAL RESEARCH USE ONLY" which blocks commercial AND dual-use deployment per the project's question_decomposition.md hard disqualifier ("anything whose license blocks military / dual-use deployment"); the project's deployment context (fixed-wing UAV in active-conflict eastern/southern Ukraine with AC-NEW-2 spoofing-promotion path) is dual-use military by every reasonable interpretation. Mitigation paths for D-C3-1 NEW Plan-phase decision: (a) DISK+LightGlue (Apache-2.0 throughout) — paper Table 6 shows DISK+LightGlue stereo AUC@5°=67.02 vs SP+LightGlue 59.03 (+7.99 absolute — DISK+LightGlue is demonstrably superior on phototourism stereo to SP+LightGlue) — RECOMMENDED; (b) ALIKED+LightGlue (BSD-3-Clause + Apache-2.0) — second-cleanest license-compliant option but lacks IMC documentary phototourism benchmarks that DISK+LightGlue has; (c) re-train a SuperPoint-class extractor under permissive license (e.g., kornia's reproduction kornia.feature.SuperPoint whose weights' license must be independently verified at Plan-phase OR retrain on aerial nadir corpus); (d) accept Magic Leap noncommercial-research license for the project's R&D phase only with explicit Plan-phase commitment to swap before production deployment (legally risky — internal research could still be construed as commercial preparation given the dual-use deployment intent); (e) use ALIKEDv2 + LightGlue (newer ALIKED variant) if community implementation matures sufficiently. Modern competitive role per engine Component Option Breadth rule — LightGlue is the adaptive-depth/adaptive-width sparse-matcher reference baseline that opens the C3 row with the canonical Apache-2.0 permissive matcher backbone; D-C3-1 SuperPoint-replacement-strategy choice resolves the binding-license-constraint tension on the project's pinned extractor mode."
  • Source: Source #69 (/cvg/lightglue context7 indexed lookup with High source reputation, benchmark 85.4, 64 code snippets — confirms canonical reference implementation status; nine returned canonical code snippets for the pipeline + extractor + matcher initialization + complete-matching-pipeline example), Source #70 (cvg/LightGlue canonical README + LICENSE — Apache-2.0 status; canonical pipeline; PyTorch ≥2.0 + FlashAttention auto-detection + compile() support; HuggingFace Transformers integration; kornia integration; hloc integration; LightGlue-ONNX companion; canonical RTX-3080 throughput benchmarks; eleven canonical pretrained extractor-matcher checkpoints), Source #71 (canonical paper arXiv:2306.13643 / Lindenberger et al. ICCV 2023 — §3 architecture + §3.3 adaptive-depth/adaptive-width pruning + §3.4 rotary positional encoding + §3.5 bidirectional cross-attention + §4 training recipe + §5 experiments [HPatches Table 1, MegaDepth-1500 Table 2, Aachen Day-Night Table 3 documentary equivalence to project pipeline shape] + Appendix A IMC 2020/2021/2023 [Table 6 DISK+LightGlue vs SP+LightGlue +7.99 stereo AUC documentary signal for D-C3-1 mitigation] + Appendix B MegaDepth-1800 / Aachen v1.1 / InLoc + Appendix C implementation details + Appendix D timing breakdowns), Source #72 (magicleap/SuperPointPretrainedNetwork LICENSE — "ACADEMIC OR NON-PROFIT ORGANIZATION NONCOMMERCIAL RESEARCH USE ONLY" Software License Agreement; HARD DISQUALIFIER for the canonical SP+LightGlue pinned mode in the project's commercial/dual-use deployment context; mitigation paths for D-C3-1 documented), Source #73 (fabio-sim/LightGlue-ONNX companion ONNX/TensorRT/OpenVINO/FP16/FP8 export project — actively maintained through January 2026 with FP8 ModelOpt workflow, FlashAttention-2 fused ONNX models, MultiHead-Attention fusion, ArgMax→TopK trick ~30% speedup, Kornia integration as kornia.feature.OnnxLightGlue, CLI lightglue-onnx with export | infer | trtexec commands; Jetson Orin Nano Super deployment path documented; FP8 Ampere-emulation verification gate for D-C3-2 NEW Plan-phase choice)
  • Phase: Phase 2
  • Target Audience: System architects + C3 implementer + C4 (PnP+RANSAC) implementer + C7 (Jetson runtime) implementer + C10 (offline-PC pre-flight cache provisioning) implementer + Step-7.5 reviewer + license-posture decision-maker (D-C1-1 + D-C3-1 NEW SuperPoint-replacement-strategy choice) + latency-budget decision-maker (D-C3-2 NEW LightGlue inference runtime choice + D-C3-3 NEW K-pairs-per-frame budget choice)
  • Confidence: for mode-enumeration (five canonical extractor-matcher sibling modes — SP+LightGlue, DISK+LightGlue, ALIKED+LightGlue, SIFT+LightGlue, DoGHardNet+LightGlue), runnable-example (canonical README five-line pipeline + nine context7 indexed snippets), parameter-count (~13.3M params ≈ ~27 MB at fp16 total), license (cvg/LightGlue Apache-2.0 permissive; SuperPoint weights Magic Leap restrictive — HARD DISQUALIFIER for canonical SP+LightGlue mode in project's dual-use deployment context); for documentary RTX-3080 throughput benchmarks (150 FPS @ 1024 kpts with adaptivity / 50 FPS @ 4096 kpts), HPatches/MegaDepth/Aachen/IMC documentary Recall@K + AUC + throughput across 7 datasets; for Aachen Day-Night Table 3 documentary equivalence to project's intended pipeline shape (C2 NetVLAD top-K → C3 SP+LightGlue → C4 PnP+RANSAC); for DISK+LightGlue Apache-2.0 license-compliant alternative with +7.99 absolute AUC@5° improvement on IMC 2020 stereo over SP+LightGlue (paper Appendix A Table 6) — D-C3-1 mitigation path is technically superior to the canonical SP+LightGlue mode on phototourism stereo; ⚠️ for Jetson Orin Nano Super latency / memory / accuracy (no documentary measurement — Jetson MVE will resolve via D-C3-3); ⚠️ for Jetson Orin Nano Super FP8 emulation on Ampere architecture (Source #73 documents FP8 ModelOpt workflow, but Jetson Orin Nano Super is Ampere not Hopper/Ada/Blackwell — verification gate at Jetson MVE phase for D-C3-2); ⚠️ for SuperPoint+LightGlue → TensorRT fp16 export quality (well-documented pathway via Source #73, but project must measure on Jetson Orin Nano Super); for canonical-checkpoint aerial-domain fitness (canonical training on synthetic homographies of Oxford-Paris 1M distractors + fine-tuning on MegaDepth phototourism — neither dataset is aerial nadir; same aerial-domain caveat as C2 candidates; aerial applicability referenced transitively via paper §1 Related work citation [83] Zhang et al. 2022 ISPRS "SuperGlue generalizes well to aerial matching" but NO explicit aerial-nadir validation in canonical paper — project-side via Jetson MVE on AerialExtreMatch + Derkachi flight); for Apache-2.0 placement on cvg/LightGlue itself (independent of extractor's weight license); for actively-maintained Jetson deployment pathway via Source #73 (January 2026 changelog entries on FP8 quantization + uv UX refresh)
  • Related Dimension: SQ3+SQ4 / C3 modern adaptive-depth/adaptive-width sparse-matcher reference baseline candidate — per-mode API capability verification gate; opens C3 mandatory pre-screen; raises D-C3-1 SuperPoint-replacement-strategy + D-C3-2 LightGlue-inference-runtime + D-C3-3 K-pairs-per-frame Plan-phase decisions
  • Fit Impact: DOCUMENTARY PASS for the per-mode API capability verification gate — LightGlue has a documented runnable per-mode example with the project's pinned configuration (canonical context7 + WebFetch via Source #69 + Source #70 + Source #71 paper), five documented extractor-matcher sibling modes (SP+LightGlue, DISK+LightGlue, ALIKED+LightGlue, SIFT+LightGlue, DoGHardNet+LightGlue), and no API-level disqualifier. HOWEVER, three caveats are raised — three NEW for the C3 row: (i) HARD LICENSE DISQUALIFIER on SuperPoint canonical pretrained weights (Source #72 Magic Leap "ACADEMIC OR NON-PROFIT ORGANIZATION NONCOMMERCIAL RESEARCH USE ONLY" Software License Agreement) which blocks commercial AND dual-use deployment; the project's dual-use deployment context (eastern/southern Ukraine fixed-wing UAV, AC-NEW-2 spoofing-promotion path) is dual-use military by every reasonable interpretation; mitigation via D-C3-1 NEW SuperPoint-replacement-strategy choice with DISK+LightGlue (Apache-2.0 throughout) RECOMMENDED as the cleanest license-compliant alternative AND demonstrably superior on phototourism stereo (+7.99 absolute AUC@5° per paper Appendix A Table 6). (ii) TIGHT latency-budget interaction at K=10 top-K retrieval pairs per frame — 10 pairs × 30-60 ms = 300-600 ms on Jetson Orin Nano Super extrapolation, against AC-4.1 400 ms budget before C1+C2+C5+C8 costs added; D-C3-3 NEW Plan-phase choice (reduce K, reduce keypoints, accept TIGHT margin, parallelize, elevate ONNX Runtime + TensorRT EP + adaptive depth). (iii) Jetson Orin Nano Super FP8 emulation on Ampere uncertain — Source #73 documents FP8 ModelOpt workflow but Jetson Orin Nano Super is Ampere not Hopper/Ada/Blackwell; verification gate at Jetson MVE phase for D-C3-2 NEW LightGlue-inference-runtime choice (PyTorch-fp16 / Torch-TensorRT / ONNX Runtime + TensorRT EP / pure TensorRT via trtexec + Polygraphy / FP8 ModelOpt-on-Jetson if Ampere FP8 emulation works). NEW Plan-phase decisions raised by LightGlue closure: D-C3-1 (NEW) SuperPoint-replacement-strategy choice (DISK+LightGlue with Apache-2.0 + paper Table 6 superiority / ALIKED+LightGlue with BSD-3-Clause+Apache-2.0 / SuperPoint-reproduction-with-permissive-license / accept-Magic-Leap-noncommercial-with-swap-commitment / SIFT+LightGlue classical-baseline-fallback); D-C3-2 (NEW) LightGlue-inference-runtime choice (PyTorch-fp16 / Torch-TensorRT / ONNX Runtime + TensorRT EP via Source #73 / pure TensorRT via trtexec + Polygraphy via Source #73 / FP8 ModelOpt-on-Jetson if Ampere FP8 emulation works); D-C3-3 (NEW) K-pairs-per-frame Plan-phase choice (reduce K from 10 to 3-5 / reduce keypoints from 1024 to 512 / accept TIGHT 300-600 ms margin and validate at Jetson MVE / parallelize matcher across multiple Jetson GPU streams / elevate ONNX Runtime + TensorRT EP + adaptive depth). REUSE of D-C2-1 aerial-domain decision — applies to LightGlue identically as to all C2 candidates; canonical training on synthetic homographies of Oxford-Paris 1M distractors + MegaDepth phototourism is NOT aerial nadir; D-C2-1 retrain decision interacts with D-C3-1 extractor choice (DISK+LightGlue retrain on aerial nadir corpus is the cleanest license-compliant + retrain-friendly pathway). C3 mandatory pre-screen status: LightGlue opens the C3 mandatory pre-screen at 1 of N candidates. Subsequent C3 candidates expected per Component Option Breadth rule include: XFeat (CVPR 2024 — separately-cataloged, documented to outperform LightGlue on speed at slightly lower accuracy); MASt3R (separately-cataloged, paper documented to outperform LightGlue on dense matching but pruned by Fact #26 due to dense-matcher latency on Jetson); RoMa (dense matcher, separately-cataloged); SuperGlue+SuperPoint (canonical SuperGlue baseline, displaced by LightGlue but still documentary evidence); LoFTR (dense matcher reference baseline). The deferred Jetson Orin Nano Super hardware MVE phase still gates final accuracy/latency/memory measurement — LightGlue's measurement role on the Jetson is to establish the adaptive-depth/adaptive-width sparse-matcher reference baseline on the BSD/permissive license track (with D-C3-1 mitigation to DISK+LightGlue / ALIKED+LightGlue), against which subsequent C3 candidates (XFeat, MASt3R, RoMa, SuperGlue, LoFTR) are scored on the project's specific operating context (aerial nadir, 1 km AGL, eastern/southern Ukraine cross-season, AC-4.1 + AC-4.2 + AC-8.3 budgets). License: Apache-2.0 for canonical cvg/LightGlue (per Source #70 LICENSE) — permissive, BSD/permissive license track on the matcher itself; Magic Leap restrictive on SuperPoint pretrained weights (per Source #72 LICENSE) — HARD DISQUALIFIER for canonical SP+LightGlue pinned mode in project's dual-use deployment context, mitigation via D-C3-1.