Files
gps-denied-onboard/_docs/00_research/01_source_registry/C2_vpr.md
T
Oleksandr Bezdieniezhnykh 846670a5c5 Refactor documentation for splittable artifacts and update references
Updated various documentation files to clarify the handling of splittable artifacts, allowing for folder equivalents of key markdown files when they exceed size limits. Adjusted references in multiple sections to reflect this new structure, ensuring consistency across the research methodology. Enhanced clarity on the saving actions and artifact organization, particularly for `01_source_registry.md`, `02_fact_cards.md`, and `06_component_fit_matrix.md`. This change aims to improve usability and maintainability of the research documentation.
2026-05-08 23:39:30 +03:00

63 KiB
Raw Blame History

Source Registry — C2 — Visual Place Recognition candidates

Mode A Phase 2 — engine Step 2 (Source Tiering & Exhaustive Web Investigation). Critical-novelty sensitivity per Step 0.5 in ../00_question_decomposition.md. Time windows applied:

  • Lead-candidate / SOTA claims: prefer sources within last 6 months; up to 18 months if older is the official authority.
  • Library/SDK API behaviour: must reflect the currently shipped version at search time (context7 mandatory per lead candidate).
  • Established baselines (KLT, RANSAC, EKF, ORB, SIFT, GTSAM): no time window.

This file replaces a section of the previous monolithic 01_source_registry.md. See 00_summary.md for the full category index. Investigation order is tracked in ../00_question_decomposition.md and the cross-category Investigation Status table in 00_summary.md.


Source #57

  • Title: OpenVPRLab — comprehensive open-source framework for Visual Place Recognition (amaralibey/openvprlab, main); bundles MixVPR aggregator + ResNet50 backbone + GSV-Cities datamodule + FAISS recall harness as the canonical reference implementation
  • Link: https://context7.com/amaralibey/openvprlab/llms.txt (accessed 2026-05-08); canonical README https://github.com/amaralibey/openvprlab
  • Tier: L1 (project-official codebase by the canonical MixVPR author Amar Ali-bey; same repo also packages BoQ aggregator)
  • Publication Date: README live; OpenVPRLab repo created 2024 as the modular successor to amaralibey/MixVPR and amaralibey/gsv-cities repos; main HEAD within Critical-novelty window (per Fact #36 timeliness audit)
  • Timeliness Status: Fully within Critical-novelty window
  • Version Info: OpenVPRLab main HEAD; PyTorch Lightning module; GSV-Cities-light + GSV-Cities datamodules; supported aggregators: MixVPR, BoQ, NetVLAD, GeM, ConvAP, Cosine; supported backbones: ResNet18/50, DinoV2 ViT-S/B/L/G-14
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer
  • Research Boundary Match: Full match for the project's pinned C2 mode (ResNet50 backbone + MixVPR aggregator at 320×320 input → 2048-D L2-normalised descriptor). Code snippets confirmed: Initialize and Use MixVPR Aggregator (parameter-by-parameter API), Initialize VPRFramework and Perform Inference (full backbone+aggregator+loss assembly with images: torch.Tensor[B, 3, 320, 320]), Compute Recall Performance with FAISS (the validation harness reporting Recall@{1,5,10,15}), Train and Monitor VPR Models via CLI (the canonical python run.py --config ./config/resnet50_mixvpr.yaml runnable). The DinoV2-MixVPR variant is also documented but is a separate per-Mode candidate per Per-Mode API rule.
  • Summary: OpenVPRLab is the canonical PyTorch Lightning reference implementation for MixVPR. Key documentary findings for the per-mode API verification gate: (a) MixVPR is parameterized as MixVPR(in_channels, in_h, in_w, out_channels, mix_depth, mlp_ratio, out_rows) — the (backbone, aggregator-shape) pair is the per-Mode unit; (b) canonical paper config (ResNet50, in=1024×20×20, out_channels=512, mix_depth=4, mlp_ratio=1, out_rows=4) produces a 2048-D L2-normalised descriptor at 320×320 input; (c) preprocessing is ImageNet mean/std normalisation; (d) FAISS-based cosine retrieval is the documented validation pipeline; (e) training is via python run.py --config ./config/resnet50_mixvpr.yaml on GSV-Cities (street-view) — NOT aerial nadir; (f) no documented Jetson measurement; (g) no documented ONNX/TensorRT export path inside the framework (relies on standard PyTorch → ONNX export — to be resolved in C7, not C2). License: MIT.
  • Related Sub-question: SQ3+SQ4 / C2 — MixVPR per-mode API capability verification (Mandatory context7 lookup per Per-Mode API Capability Verification rule)

Source #58

  • Title: MixVPR canonical paper — "MixVPR: Feature Mixing for Visual Place Recognition" (Ali-bey, Chaib-draa, Giguère, WACV 2023, arXiv:2303.02190) + canonical implementation amaralibey/MixVPR
  • Link: arXiv canonical paper https://arxiv.org/abs/2303.02190 (Mar 2023); canonical implementation https://github.com/amaralibey/MixVPR
  • Tier: L1 (peer-reviewed WACV 2023 + author's canonical implementation)
  • Publication Date: arXiv preprint 2023-03-03; WACV 2023 acceptance
  • Timeliness Status: ⚠️ Borderline — the canonical paper itself (Mar 2023) is older than the Critical-novelty 18-month window threshold for SQ3+SQ4 component selection, but the algorithm is mature and OpenVPRLab (Source #57, in window) maintains it actively; the algorithmic content is stable, the freshness concern is on weights and aerial-domain retrains
  • Version Info: WACV 2023 published version; canonical implementation tag-less but main HEAD aligned with paper config
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer
  • Research Boundary Match: Full match for the algorithm/API (single-image global-descriptor VPR with MLP-Mixer aggregation on a CNN backbone); partial match for the project's domain (paper benchmarks Pitts30k, MSLS-val, Tokyo24/7, Nordland — all ground-level / street-level / urban; no aerial nadir benchmark in the canonical paper)
  • Summary: The canonical paper introduces the MixVPR aggregation method: a stack of mix_depth FeatureMixer (MLP-Mixer-style) blocks operating over CNN feature-map rows + columns, followed by depth-wise + row-wise projections to produce a compact L2-normalised descriptor. Default config (and the project's pinned mode) is ResNet50-cropped backbone → 1024×20×20 feature map → MixVPR(1024, 20, 20, 512, 4, 1, 4) → 2048-D descriptor. Reported inference latency: 1.21 ms per image on A100 at 320×320 batch=1 (paper Table 4) — useful as the documentary upper-bound from which Jetson Orin Nano Super extrapolation must be measured. Reported Recall@1: ≥90% on Pitts30k-test, ≥85% on MSLS-val (state-of-the-art at publication time among 8K-and-below descriptor methods). Critical observation: the paper does NOT report aerial nadir benchmarks; aerial-domain transfer of MixVPR is documented in subsequent third-party work (Skoltech aerial-VPR survey + AerialExtreMatch — Sources #38, #19) but with materially different recall numbers, requiring project-domain retrain or aerial-trained community checkpoint. License: MIT (per canonical implementation repo).
  • Related Sub-question: SQ3+SQ4 / C2 — MixVPR per-mode API capability verification (cross-source verification of the OpenVPRLab descriptor's algorithmic claims; aerial-domain caveat sourced)

Source #59

  • Title: SALAD canonical implementation — serizba/salad (Izquierdo & Civera, CVPR 2024) — official PyTorch reference implementation, eval CLI, three pretrained checkpoints (8192+256, 2048+64, 512+32 descriptor sizes), Torch Hub registration, GSV-Cities training pipeline
  • Link: README https://github.com/serizba/salad (raw https://raw.githubusercontent.com/serizba/salad/main/README.md, accessed 2026-05-08); LICENSE https://github.com/serizba/salad/blob/main/LICENSE (raw https://raw.githubusercontent.com/serizba/salad/main/LICENSE, accessed 2026-05-08)
  • Tier: L1 (project-official codebase by the canonical SALAD authors Sergio Izquierdo + Javier Civera, Universidad de Zaragoza)
  • Publication Date: README live; main HEAD within Critical-novelty window per Fact #36 timeliness audit; canonical paper CVPR 2024
  • Timeliness Status: Fully within Critical-novelty window
  • Version Info: main HEAD; PyTorch 2.1.0 + CUDA 12.1 + Xformers (per README §Setup); Torch Hub one-liner model = torch.hub.load("serizba/salad", "dinov2_salad") returns the full 8448-D config; eval CLI python3 eval.py --ckpt_path 'weights/dino_salad.ckpt' --image_size 322 322 --batch_size 256 --val_datasets MSLS Nordland; three pretrained checkpoints documented (dino_salad 8192+256, dino_salad_2048_64 2048+64, dino_salad_512_32 512+32)
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer
  • Research Boundary Match: Full match for the project's pinned C2 mode (DINOv2 ViT-B/14 backbone + SALAD aggregator at 322×322 input). The repo ships everything needed to instantiate the model, run inference, and reproduce the canonical numbers. Partial match for the project's domain (canonical training set is GSV-Cities street-view; canonical evaluation sets are MSLS Challenge/Val, Pittsburgh250k-test, SPED, NordLand, SF-XL — all ground-level / street-level / urban / cross-season; no aerial nadir benchmark in the repo's reported tables, same caveat as MixVPR).
  • Summary: SALAD is the canonical reference implementation of the CVPR 2024 paper "Optimal Transport Aggregation for Visual Place Recognition" by Sergio Izquierdo and Javier Civera. Critical license finding: LICENSE file is GNU GENERAL PUBLIC LICENSE v3 (GPL-3.0) — copyleft. This places SALAD on the GPL-3.0 license track alongside OpenVINS / VINS-Mono / VINS-Fusion / ORB-SLAM3, NOT on the BSD/permissive track where MixVPR (MIT) sits. Three pretrained checkpoints documented: full (dino_salad, 8192+256 = 8448-D descriptor; m=64 clusters × l=128 dim per cluster + 256-D global token), medium (dino_salad_2048_64, 2048+64 = 2112-D; m=32, l=64), slim (dino_salad_512_32, 512+32 = 544-D; m=15, l=32). Canonical evaluation input: --image_size 322 322 (must be divisible by 14 for ViT/14 patch grid → 322/14 = 23 patches per side → 23×23 = 529 spatial tokens + 1 global token). Training: GSV-Cities, 4 epochs, ~30 min on RTX 3090. Acknowledged base: MixVPR's training framework is the harness on top of which SALAD is built (NOT OpenVPRLab — they share a lineage but SALAD's repo extends amaralibey/MixVPR directly per README "Acknowledgements"). No documented aerial-nadir benchmark in the repo's reported tables.
  • Related Sub-question: SQ3+SQ4 / C2 — SALAD per-mode API capability verification (context7 did not index serizba/salad directly; per Per-Mode API Capability Verification rule item 2, fall-back to official-docs WebFetch on the canonical repo README + LICENSE was used)

Source #60

  • Title: SALAD canonical paper — "Optimal Transport Aggregation for Visual Place Recognition" (Izquierdo & Civera, CVPR 2024, arXiv:2311.15937 v2)
  • Link: arXiv https://arxiv.org/abs/2311.15937 (CVPR 2024 published version), accessed 2026-05-08
  • Tier: L1 (peer-reviewed CVPR 2024 + canonical implementation cross-referenced)
  • Publication Date: arXiv preprint 2023-11-27; CVPR 2024 acceptance June 2024
  • Timeliness Status: ⚠️ Borderline — like MixVPR, the canonical paper (Nov 2023 / CVPR 2024) is at the edge of the Critical-novelty 18-month window for SQ3+SQ4 component selection, but the algorithm is mature, the canonical implementation is actively maintained, and the algorithmic content is stable; the freshness concern is on weights and aerial-domain retrains
  • Version Info: arXiv v2 (CVPR camera-ready)
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer
  • Research Boundary Match: Full match for the algorithm (single-stage VPR with optimal-transport-based aggregation on a fine-tuned DINOv2 backbone). Partial match for the project's domain (paper benchmarks MSLS Challenge / MSLS Val, Pittsburgh250k-test, SPED, NordLand, SF-XL — all ground-level urban / street-view; NO aerial nadir benchmark in the canonical paper, same caveat as MixVPR).
  • Summary: The canonical paper introduces SALAD = Sinkhorn Algorithm for Locally Aggregated Descriptors. Reformulates NetVLAD's soft-assignment as an optimal-transport problem, considering both feature-to-cluster AND cluster-to-feature relations, with a learned dustbin cluster that discards uninformative features (sky/road/dynamic objects). Combined with a fine-tuned DINOv2 backbone. Pinned canonical configuration (paper §4.1): DINOv2-B (768-dim tokens, 86M params), fine-tune the last 4 transformer blocks (Table 6: train 2 or 4 blocks both report best results, 4 marginally better at 92.2 vs 92.0 MSLS R@1). Score-projection MLP W_s1, W_s2 with hidden dim 512 and ReLU. Dimensionality reduction W_f1, W_f2 from d=768 to l=128. Global-token MLP W_g1, W_g2 from d=768 to 256. m=64 clusters, yielding final descriptor m × l + global = 64 × 128 + 256 = 8192 + 256 = 8448-D. Slim variants: m=15, l=32 → 512+32 = 544-D; m=32, l=64 → 2048+64 = 2112-D. Sinkhorn algorithm for optimal-transport assignment. Final L2 intra-norm + global L2-norm. Training: GSV-Cities, batch size 60 places × 4 images, multi-similarity loss, AdamW, lr=6e-5, 4 epochs, 30 min on RTX 3090. Training input size: 224×224; evaluation input size: 322×322 ("model is agnostic to input size as long as divisible by 14"). Reported latency (paper Table 1, 2 footnote): 2.332.41 ms per image on RTX 3090 at 322×322 batch=1 across all three descriptor-size variants — confirms aggregator overhead over the bare DINOv2-B backbone (2.41 ms) is negligible. Reported Recall@1 (paper Table 1, full 8448-D variant): MSLS Challenge 75.0, MSLS Val 92.2, NordLand 76.0, Pitts250k-test 95.1, SPED 92.1. Reported Recall@1 (slim 544-D variant): MSLS Challenge 70.8, MSLS Val 89.3, NordLand 61.2, Pitts250k-test 93.0, SPED 88.5. Reported memory footprint (Table 2 footnote, MSLS Val ~18,000 images): 0.0 GB local features (single-stage, no per-patch features cached) + global descriptors ~70 MB at 8448-D fp32 = ~35 MB at fp16. Authors' explicit limitation (§5): "the adoption of DINOv2 as our backbone results in slower processing speeds compared to ResNet-based methods" — material to project's Jetson Orin Nano Super deployment, since DINOv2-B has 86M params vs MixVPR-ResNet50's 25M (~3.4× more params; ViT export to TensorRT/INT8 is also harder than ResNet export — C7 deferred concern). License (canonical implementation): GPL-3.0 (per Source #59).
  • Related Sub-question: SQ3+SQ4 / C2 — SALAD per-mode API capability verification (cross-source verification of the canonical implementation's mode-enumeration, parameter-count, and performance claims; aerial-domain caveat documented)

Source #61

  • Title: OpenVPRLab DinoV2 backbone — context7 cross-source for DINOv2 ViT-B/14 spatial-feature backbone API at 322×322 input (NOT a SALAD aggregator source — OpenVPRLab does not ship a SALAD aggregator class, only MixVPR / GeMPool / BoQ are documented in the snippets)
  • Link: https://context7.com/amaralibey/openvprlab/llms.txt (accessed 2026-05-08, snippet Initialize and Use DinoV2 Backbone); canonical README https://github.com/amaralibey/openvprlab
  • Tier: L1 (canonical OpenVPRLab framework by Amar Ali-bey, packaged as a multi-aggregator VPR research lab; context7 indexed)
  • Publication Date: README live; OpenVPRLab main HEAD within Critical-novelty window (per Fact #36)
  • Timeliness Status: Fully within Critical-novelty window
  • Version Info: OpenVPRLab main HEAD; supported DinoV2 backbones: dinov2_vits14, dinov2_vitb14, dinov2_vitl14, dinov2_vitg14; DinoV2(backbone_name, num_unfrozen_blocks, return_cls_token) constructor; default num_unfrozen_blocks=2 (paper canonical SALAD config uses 4 per Table 6); return_cls_token=False returns spatial features only (SALAD pipeline needs True because it uses both spatial features + cls/global token)
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer
  • Research Boundary Match: Partial match — confirms the DINOv2 backbone API (input must be divisible by 14, 322×322 → [B, 768, 23, 23] spatial features for ViT-B), but the SALAD aggregator itself is NOT in OpenVPRLab. Pipeline composition for SALAD must use serizba/salad repo's aggregator class on top of either serizba/salad's own DINOv2 wrapper or OpenVPRLab's DinoV2 class with return_cls_token=True.
  • Summary: Documentary cross-source confirmation that DINOv2 ViT-B is a first-class supported backbone in the broader VPR research-pipeline ecosystem, with the same input-divisibility-by-14 constraint and 322×322 → 23×23 spatial-token layout the canonical SALAD paper uses. Critical finding for the SALAD verification gate: OpenVPRLab's documented aggregator catalog (per context7 snippet inventory) is MixVPR, GeMPool, BoQno SALAD class is present. This means SALAD cannot be assembled from OpenVPRLab alone; the project must depend on the canonical serizba/salad repo (Source #59), which is GPL-3.0. Confirms that the per-Mode API rule's "two modes of one library are two distinct candidates" semantics applies here too — DINOv2-B + MixVPR (in OpenVPRLab) and DINOv2-B + SALAD (in serizba/salad) are different candidates, with different code-of-record and different licenses.
  • Related Sub-question: SQ3+SQ4 / C2 — SALAD per-mode API capability verification (cross-source confirmation of DINOv2 ViT-B backbone API; cross-source disconfirmation of OpenVPRLab as a SALAD source). Cross-cutting reuse: Source #61 also confirms DINOv2 ViT-L/14 is a first-class supported backbone in OpenVPRLab — relied on by Source #62 + #63 (SelaVPR) for backbone-API documentary cross-source.

Source #62

  • Title: SelaVPR canonical implementation — Lu-Feng/SelaVPR (Lu, Zhang, Lan, Dong, Wang, Yuan — ICLR 2024) — official PyTorch reference implementation, training/eval CLIs, two pretrained checkpoints (MSLS-finetuned for diverse scenes; Pitts30k-further-finetuned for urban scenes), DINOv2+registers variant, two-stage retrieval+re-ranking pipeline
  • Link: README https://github.com/Lu-Feng/SelaVPR (raw https://raw.githubusercontent.com/Lu-Feng/SelaVPR/main/README.md, accessed 2026-05-08); LICENSE https://github.com/Lu-Feng/SelaVPR/blob/main/LICENSE (raw https://raw.githubusercontent.com/Lu-Feng/SelaVPR/main/LICENSE, accessed 2026-05-08); pretrained DINOv2 backbone weights https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_pretrain.pth ; SelaVPR++ successor repo https://github.com/Lu-Feng/SelaVPRplusplus (Feb 2026, separate)
  • Tier: L1 (project-official codebase by the canonical SelaVPR authors Feng Lu et al., Tsinghua Shenzhen + Peng Cheng Laboratory + UCAS)
  • Publication Date: README live; main HEAD within Critical-novelty window (Feb 2026 announcement of SelaVPR++ successor confirms continued active maintenance of the lineage); canonical paper ICLR 2024
  • Timeliness Status: Fully within Critical-novelty window
  • Version Info: main HEAD; PyTorch (no specific version pinned in README); requires DINOv2 ViT-L/14 pretrained backbone weights download (dinov2_vitl14_pretrain.pth ~1.1 GB from FB AI Public Files); CLI surface python3 train.py --datasets_folder=... --dataset_name={msls,pitts30k} --foundation_model_path=... for training + python3 eval.py --datasets_folder=... --dataset_name={msls,pitts30k,nordland,...} --resume=... --rerank_num={20,100} for evaluation; pretrained models distributed via Google Drive links inside README HTML tables (MSLS-finetuned: MSLS-val R@1=90.8 / Nordland-test R@1=85.2 / St. Lucia R@1=99.8; Pitts30k-further-finetuned: Tokyo24/7 R@1=94.0 / Pitts30k R@1=92.8 / Pitts250k R@1=95.7); optional --registers flag adds DINOv2+4-register variant (separate finetuned checkpoint also linked); 262 stars; MIT License
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer
  • Research Boundary Match: Full match for the project's pinned C2 mode (DINOv2 ViT-L/14 backbone with frozen weights + lightweight serial+parallel adapters in each transformer block + GeM-pooled 1024-D global descriptor + LocalAdapt up-conv module producing 61×61×128-D dense local features at 224×224 input). Two-stage retrieval+re-ranking is structurally distinct from MixVPR's and SALAD's single-stage retrieval — global descriptor is used for top-K candidate retrieval, then re-ranking via mutual-nearest-neighbor cross-matching of dense local features with |M| (count of mutual matches) as the re-rank score. Re-rank pool size is a runtime parameter: rerank_num=100 reproduces paper accuracy; rerank_num=20 cuts re-ranking runtime to 1/5 (0.018 s/query on RTX 3090) at near-identical accuracy. Partial match for the project's domain (canonical training is MSLS + Pitts30k street-view; canonical evaluation is Tokyo24/7 / MSLS-val / MSLS-challenge / Pitts30k-test / Pitts250k / Nordland-test / St. Lucia — all ground-level / street-level / urban / cross-season / cross-illumination; no aerial nadir benchmark in the repo's reported tables, same caveat as MixVPR + SALAD).
  • Summary: SelaVPR is the canonical reference implementation of the ICLR 2024 paper "Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition" by Feng Lu et al. Critical license finding: LICENSE file is MIT License (Copyright (c) 2024 Feng Lu) — permissive; this places SelaVPR on the BSD/permissive license track alongside MixVPR (MIT), Kimera-VIO (BSD-2), OKVIS2 (BSD-3), DPVO (MIT), pure-VO baseline (OpenCV-Apache-2.0). SelaVPR is the first DINOv2-based C2 candidate on the BSD/permissive track (SALAD is GPL-3.0). Two pretrained checkpoints are documented inside the README: (a) MSLS-finetuned (for diverse scenes — recommended starting point for cross-domain transfer projects): MSLS-val R@1=90.8 / R@5=96.4 / R@10=97.2; Nordland-test R@1=85.2 / R@5=95.5 / R@10=98.5; St. Lucia R@1=99.8; (b) Pitts30k-further-finetuned (only for urban scenes; resumed from MSLS checkpoint): Tokyo24/7 R@1=94.0 / R@5=96.8 / R@10=97.5; Pitts30k R@1=92.8 / R@5=96.8 / R@10=97.7; Pitts250k R@1=95.7 / R@5=98.8 / R@10=99.2. Optional --registers flag enables DINOv2+4-register variant (per Darcet et al. 2024 ViT registers paper) with separately finetuned MSLS checkpoint — better local-matching performance per README §"Local Matching using DINOv2+Registers". Two-stage method has an "efficient RAM usage" mode (--efficient_ram_testing flag) that saves extracted local features to disk in ./output_local_features/ and loads only the currently-needed local features into RAM — useful when descriptor cache exceeds available RAM (relevant to the project's 8 GB shared budget). Acknowledged dependencies: gmberton/deep-visual-geo-localization-benchmark (Visual Geo-localization Benchmark — dataset preparation pipeline) and facebookresearch/dinov2 (the frozen pre-trained backbone). Adapter-and-up-conv code is in /backbone/dinov2/block.py (adapter1 + adapter2 in each transformer block) and network.py (LocalAdapt up-conv module after the entire ViT).
  • Related Sub-question: SQ3+SQ4 / C2 — SelaVPR per-mode API capability verification (context7 returned no match for Lu-Feng/SelaVPR — the only "lu-feng" hit was liu-feng-deeplearning/coverhunter which is an unrelated music-similarity project; per Per-Mode API Capability Verification rule item 2, fall-back to official-docs WebFetch on the canonical repo README + LICENSE was used)

Source #63

  • Title: SelaVPR canonical paper — "Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition" (Lu, Zhang, Lan, Dong, Wang, Yuan — ICLR 2024, arXiv:2402.14505 v1)
  • Link: arXiv https://arxiv.org/abs/2402.14505 (ICLR 2024 published version: https://openreview.net/pdf?id=TVg6hlfsKa), accessed 2026-05-08; HTML render https://arxiv.org/html/2402.14505v1
  • Tier: L1 (peer-reviewed ICLR 2024 + canonical implementation cross-referenced)
  • Publication Date: arXiv preprint 2024-02-22; ICLR 2024 acceptance
  • Timeliness Status: ⚠️ Borderline — canonical paper (Feb 2024 / ICLR 2024) is at the edge of the Critical-novelty 18-month window for SQ3+SQ4 component selection, but algorithm is mature, canonical implementation is actively maintained (SelaVPR++ released Feb 2026 confirms lineage activity), algorithmic content is stable; freshness concern is on weights and aerial-domain retrains
  • Version Info: arXiv v1 (ICLR camera-ready)
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer
  • Research Boundary Match: Full match for the algorithm (two-stage VPR with adapter-based parameter-efficient transfer learning on a frozen DINOv2 ViT-L/14 backbone + GeM-pooled global descriptor + dense local features for re-ranking via mutual nearest neighbor cross-matching, no RANSAC needed). Partial match for the project's domain (paper benchmarks Tokyo24/7 / MSLS-val / MSLS-challenge / Pitts30k-test — all ground-level urban / street-view; appendix benchmarks Nordland / St. Lucia — also non-aerial; NO aerial nadir benchmark in the canonical paper, same caveat as MixVPR + SALAD).
  • Summary: The canonical paper introduces SelaVPR = Seamless adaptation of pre-trained foundation models for VPR via hybrid global-local adaptation. Architecture (paper §3): (a) Backbone: DINOv2 ViT-L/14 (frozen, only adapters trained — preserves the pre-trained model's transferability and avoids catastrophic forgetting); (b) Global adaptation (paper §3.2 + Fig 2): two adapters per transformer block — Adapter1 is a serial adapter after the MHA layer with internal skip-connection, Adapter2 is a parallel adapter alongside the MLP layer multiplied by scaling factor s=0.2; each adapter is a bottleneck (down-project FC → ReLU → up-project FC) with bottleneck ratio 0.5; the output of the last transformer block goes through an LN layer; the class token is discarded and patch tokens are reshaped as a feature map fm; the global feature is f^g = L2(GeM(fm)); (c) Local adaptation (paper §3.3 + Fig 3): two up-convolutional layers (3×3 kernel, stride=2, padding=1) with a ReLU layer between them upsample the 16×16×1024 ViT feature map to 61×61×128 dense local features; intra-channel L2 normalization; (d) Local matching for re-ranking (paper §3.4): mutual nearest neighbor cross-matching between query and each candidate; cosine similarity is used (local features are L2-normalized); the count |M| of mutual matches is the re-rank score (no RANSAC, no spatial verification); (e) Loss (paper §3.5): triplet loss L_g on global features + mutual nearest neighbor local feature loss L_l (Eq. 10) that maximizes match similarity for positive pairs and minimizes for negative pairs; combined as L = L_g + λ·L_l with λ=1, m=0.1. Implementation details (paper §4.2): input size 224×224 (NOT 322×322 like SALAD; 224/14 = 16, so 16×16=256 patch tokens + 1 cls token); global descriptor 1024-D (much smaller than MixVPR's 2048-D and SALAD's 8448-D full); local features 61×61×128 = 476,288 floats per image for re-ranking; Adam optimizer, lr=1e-5, batch=4; trained on MSLS first then further-finetuned on Pitts30k; re-rank top-100 candidates by default. Reported performance (paper Table 2): MSLS-challenge R@1=73.5 (vs SALAD's 75.0, vs MixVPR's 64.0; vs prior SOTA R²Former 73.0); Tokyo24/7 R@1=94.0 (best, +5.4 absolute over prior SOTA R²Former 88.6); MSLS-val R@1=90.8 (best); Pitts30k R@1=92.8 (best). Reported runtime (paper Table 3, RTX 3090, Pitts30k-test): feature extraction 0.027 s/query (slower than TransVPR's 0.008 because of ViT-L backbone, but still fast); matching 0.085 s/query at rerank_num=100; total 0.112 s/queryless than 4% of TransVPR's total (3.018 s/query) and ~1% of Patch-NetVLAD-p (11.144 s/query). Critical observation: SelaVPR's "fast two-stage" claim is relative to other two-stage methods; it is still slower than single-stage MixVPR (~2 ms extraction on A100) and SALAD (~2.41 ms on RTX 3090) at the global-retrieval stage, AND adds an 18 ms (rerank_num=20) to 85 ms (rerank_num=100) re-ranking cost per query that single-stage methods don't pay. Authors' explicit observation (paper §4.3): "TransVPR is fast at extracting features, while SelaVPR is slower (but faster than other methods) due to the use of the ViT/L backbone" — material to project's Jetson Orin Nano Super deployment, since DINOv2-L has 300M params vs SALAD-DINOv2-B's 86M (~3.5× more params). License (canonical implementation): MIT (per Source #62).
  • Related Sub-question: SQ3+SQ4 / C2 — SelaVPR per-mode API capability verification (cross-source verification of the canonical implementation's mode/parameter/runtime claims; aerial-domain caveat documented; ViT-L vs ViT-B backbone-size trade-off documented for Jetson MVE planning)

Source #64

  • Title: NetVLAD canonical implementation — Relja/netvlad v1.03 (Arandjelović et al., CVPR 2016 / TPAMI 2018) — official MATLAB / MatConvNet reference implementation, off-the-shelf and trained networks (VGG-16 + NetVLAD + whitening pretrained on Pittsburgh 30k and Tokyo Time Machine), training (trainWeakly) + testing (testFromFn) + PCA-whitening (addPCA) + cluster-init (addLayers) CLIs, multiple aggregation methods (vlad_preL2_intra default, vlad_preL2, vladv2_preL2_intra, max, avg)
  • Link: README https://raw.githubusercontent.com/Relja/netvlad/master/README.md (accessed 2026-05-08); README_more https://raw.githubusercontent.com/Relja/netvlad/master/README_more.md (accessed 2026-05-08); project page https://www.di.ens.fr/willow/research/netvlad/ ; trained models https://www.di.ens.fr/willow/research/netvlad/data/models/vd16_pitts30k_conv5_3_vlad_preL2_intra_white.mat (529 MB best model) + https://www.di.ens.fr/willow/research/netvlad/data/netvlad_v103_allmodels.tar.gz (3 GB all models); context7 indexed at /relja/netvlad; canonical paper arXiv:1511.07247 (CVPR 2016)
  • Tier: L1 (project-official MATLAB codebase by canonical NetVLAD authors Relja Arandjelović + Petr Gronat + Akihiko Torii + Tomas Pajdla + Josef Sivic; INRIA WILLOW + ENS + Tokyo Tech + CTU Prague)
  • Publication Date: README v1.03 dated 2016-03-04; canonical paper CVPR 2016 (arXiv 2015-11-23 v1, with 2016 updates); algorithm has been continuously cited as the canonical baseline since 2016 across MixVPR (Source #57+58), SALAD (Source #59+60), SelaVPR (Source #62+63), AnyLoc, BoQ, and every subsequent VPR paper
  • Timeliness Status: ⚠️ Borderline — the implementation is from 2016 (10 years old) and uses the deprecated MATLAB + MatConvNet stack; HOWEVER, the algorithm is canonical and stable, the canonical paper has been continuously cited as the baseline reference, and modern PyTorch ports (Source #65) reproduce its numbers with high fidelity. The freshness concern is on the runtime stack (MATLAB + MatConvNet → PyTorch port required for the project) and aerial-domain transfer (no aerial benchmark in the canonical paper). Per the engine's "Established baseline" rule (KLT, RANSAC, EKF, ORB, SIFT, GTSAM-class), NetVLAD as the mandatory simple-VLAD baseline for the C2 row is exempt from the strict 18-month Critical-novelty window — its role is exactly to be the long-established reference point against which the modern (MixVPR / SALAD / SelaVPR / AnyLoc / BoQ / DINOv2-VLAD) candidates are scored
  • Version Info: v1.03 (04 Mar 2016) — last canonical version; depends on relja_matlab v1.02+, MatConvNet v1.0-beta18+, optional Yael_matlab v438. Critical license finding: README states "NetVLAD is distributed under the MIT License (see the LICENCE file)." — MIT (BSD/permissive license track, same as MixVPR + SelaVPR; distinct from SALAD's GPL-3.0). Supported network IDs in loadNet(): vd16 (VGG-16, 138M params, conv5_3 last conv layer = 512-D feature map), vd19 (VGG-19), caffe (AlexNet, last conv = 256-D feature map), places (Places-CNN). Aggregation methods in addLayers(): vlad_preL2_intra (default — input L2-norm + intra-channel L2-norm of the K×D NetVLAD matrix + final flatten + L2-norm), vlad_preL2 (no intra-norm), vladv2_preL2_intra (full NetVLAD with trainable biases per Eq. 4 of paper), max (global max pool), avg (global avg pool). Default cluster count K=64. Output dimensionality = K × D (e.g., VGG-16 conv5_3: K=64 × D=512 = 32768-D pre-PCA NetVLAD matrix); recommended PCA + whitening reduces to 4096-D (canonical paper recommendation; 256-D / 512-D variants supported via cropToDim after L2-renormalisation, only valid for +whitening networks). 599 stars
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer + simple-baseline reference-point owner
  • Research Boundary Match: Full match for the project's pinned C2 mode (VGG-16 + NetVLAD + whitening pretrained on Pittsburgh 30k / Tokyo Time Machine, NetVLAD aggregation vlad_preL2_intra with K=64 cluster centres, output 4096-D L2-normalised global descriptor after PCA-whitening). The repo ships everything needed for inference (computeRepresentation, serialAllFeats, testFromFn) and training (trainWeakly). Partial match for the project's domain (canonical training sets are Pittsburgh 30k + Tokyo Time Machine, both street-level / urban; canonical evaluation sets are Pittsburgh 30k / Pittsburgh 250k / Tokyo 24/7 — all ground-level / street-view; NO aerial nadir benchmark in the canonical paper, same caveat as MixVPR + SALAD + SelaVPR). Critical implementation match: NetVLAD is the only canonical C2 baseline whose pretrained weights cover both Pittsburgh and Tokyo Time Machine domains as separate checkpoints (vd16_pitts30k_conv5_3_vlad_preL2_intra_white and vd16_tokyoTM_conv5_3_vlad_preL2_intra_white) — useful for cross-domain ablation if the project ever needs to compare canonical-domain transferability
  • Summary: NetVLAD is the canonical learned-VLAD reference baseline for the entire VPR field, introduced by Arandjelović et al. (CVPR 2016) and continuously cited by every subsequent VPR work (including MixVPR, SALAD, SelaVPR, AnyLoc, BoQ — all of which compare against NetVLAD as a baseline in their respective papers). The official Relja/netvlad MATLAB implementation (v1.03, 2016) defines the algorithm: (a) crop a CNN backbone (VGG-16 / VGG-19 / AlexNet / Places-CNN) at the last convolutional layer to obtain a H×W×D dense descriptor map; (b) apply NetVLAD pooling layer that learns K cluster centres c_k, K cluster-assignment weights w_k, K biases b_k, computes soft-assignment via softmax over w_k^T x_i + b_k, and aggregates first-order residuals (x_i - c_k) weighted by soft-assignment into a K×D matrix; (c) intra-channel L2-normalise the K×D matrix (per the vlad_preL2_intra method), flatten to K·D-D vector, final L2-normalise; (d) optionally apply PCA + whitening (canonical recommendation: reduce to 4096-D) for both retrieval-quality improvement AND memory-footprint reduction. Canonical training: weakly supervised triplet ranking loss with hard negative mining on Google Street View Time Machine (Pittsburgh + Tokyo) — uses only image-level GPS as supervision, no landmark annotations. License: MIT — places NetVLAD on the BSD/permissive track alongside MixVPR (MIT) + SelaVPR (MIT) + Kimera-VIO (BSD-2) + OKVIS2 (BSD-3) + DPVO (MIT) + pure-VO baseline (OpenCV-Apache-2.0). Critical limitations: (a) MATLAB + MatConvNet stack is deprecated and not deployable on Jetson Orin Nano Super (PyTorch port required — see Source #65); (b) algorithm pre-dates DINOv2 / ViT family by 6+ years; uses VGG-16 / AlexNet which are 6× larger and 4× slower than ResNet50 (per various 2020s benchmarks) at comparable accuracy; (c) canonical 4096-D descriptor (after PCA-whitening) is 2× larger than MixVPR's 2048-D and 4× larger than SelaVPR's 1024-D, increasing the descriptor-cache footprint AND retrieval-time cosine-similarity cost; (d) no documented Jetson measurement; (e) canonical R@1 numbers (Pitts30k-test): 84.1 (paper Table 1) — substantially below MixVPR's ~90 / SALAD's 95.1 / SelaVPR's 92.8 (all on Pitts30k); on Tokyo24/7: 73.3 (paper) — vs SelaVPR's 94.0 = 20.7 absolute gap; vs MixVPR's 85.1 = 11.8 absolute gap. NetVLAD is kept as the mandatory simple-VLAD baseline per the engine's Component Option Breadth rule ("at least one simple baseline...per component area when possible; these prevent false confidence in the selected option") — its role is to be the long-established reference point, NOT a competitive lead. Any selected modern C2 candidate (MixVPR / SALAD / SelaVPR / EigenPlaces / AnyLoc / BoQ) must show measurable Recall@K advantage over NetVLAD on the project's evaluation conditions to justify its added complexity. Acknowledged dependencies: MatConvNet, relja_matlab, Yael_matlab (optional, GPU acceleration). Modern lineage: Patch-NetVLAD (2021, CVPR — extends NetVLAD with patch-level features for re-ranking, GPL-3.0 license per their canonical repo); Generalized-Mean-Pool (GeM, 2018) is a simpler alternative often used together with NetVLAD-style aggregation; AnyLoc (2024) wraps NetVLAD aggregation around DINOv2 ViT-G features
  • Related Sub-question: SQ3+SQ4 / C2 — NetVLAD per-mode API capability verification (Mandatory context7 lookup PASS — /relja/netvlad indexed with 90 code snippets and Medium source reputation, benchmark score 80.5; cross-validated against canonical project page WebFetch + canonical paper WebFetch)

Source #65

  • Title: NetVLAD modern PyTorch reproduction — Nanne/pytorch-NetVlad (Pytorch implementation of NetVlad with verified Pittsburgh 30k Recall@K reproduction); modern PyTorch + Faiss runtime path (replaces canonical MATLAB + MatConvNet stack); supports VGG-16 and AlexNet backbones with K=64 NetVLAD pooling
  • Link: README https://raw.githubusercontent.com/Nanne/pytorch-NetVlad/master/README.md (accessed 2026-05-08); repo https://github.com/Nanne/pytorch-NetVlad ; pretrained checkpoint mirror https://drive.google.com/open?id=17luTjZFCX639guSVy00OUtzfTQo4AMF2 (VGG-16 + NetVLAD reproduction)
  • Tier: L2 (third-party PyTorch port; canonical algorithm is owned by Relja/netvlad per Source #64; this port is the most-cited PyTorch reproduction in the VPR research community as of 2026, with verified Recall@K against the canonical paper)
  • Publication Date: Initial commit ~2018; main HEAD active through 2024
  • Timeliness Status: ⚠️ Borderline — last meaningful update appears to be ~2022 (via repo stars/issues age signals); however the algorithm is canonical and the reproduction has been continuously verified against the canonical paper's Table 1 numbers; per Established-baseline exemption, NetVLAD's reference role does not require fresh updates
  • Version Info: main HEAD; PyTorch v0.4.0+ minimum; depends on Faiss + scipy + numpy + sklearn + h5py + tensorboardX; CLI python main.py --mode={train,test,cluster} --arch={vgg16,alexnet} --pooling=netvlad --num_clusters=64; pretrained VGG-16 checkpoint distributed via Google Drive; license not explicitly stated in README — README does NOT cite a LICENSE file; verification of licensing terms is required if the project actually adopts this PyTorch port (Plan-phase clarification gate, deferred to Plan if NetVLAD is elevated beyond mandatory-baseline role); canonical alternative is to re-port from the MIT-licensed Relja/netvlad MATLAB repo (Source #64) directly — preserving MIT licensing on the project's NetVLAD path; 490 stars
  • Target Audience: System architects + C2 implementer + simple-baseline reference-point owner
  • Research Boundary Match: Full match for the runtime stack the project will use (PyTorch + Faiss vs canonical MATLAB + MatConvNet); partial match for the canonical NetVLAD-paper reproducibility — reported VGG-16 + NetVLAD R@1 on Pitts30k-test = 85.2 (vs canonical paper's 84.1, 0.9 absolute higher — a positive reproduction signal); AlexNet R@1 = 68.6 (paper does not report AlexNet on Pitts30k as the primary number — appendix only). Partial match for the project's domain (same as Source #64 — Pittsburgh 30k street-level training + Tokyo 24/7 test, no aerial nadir)
  • Summary: Nanne/pytorch-NetVlad is the canonical PyTorch reproduction of NetVLAD that the modern VPR research community uses. README explicitly reports verified Recall@K vs the canonical paper's Table 1: VGG-16 + NetVLAD reproduction R@1=85.2 (paper: 84.1), R@5=94.8 (paper: 94.6), R@10=97.0 (paper: 95.5) — the reproduction is +0.6 to +1.5 absolute above the original numbers, demonstrating that the PyTorch port preserves the canonical algorithm's accuracy. Three CLI modes: --mode=train (full training pipeline with cluster-init prerequisite), --mode=test (inference + Recall@K evaluation), --mode=cluster (NetVLAD layer initialization via K-means clustering on training features — a prerequisite for --mode=train). Default flags: --arch=vgg16 --pooling=netvlad --num_clusters=64. Model state distributed via Google Drive. Critical license caveat: README does not cite a LICENSE file; verification of licensing terms is a Plan-phase blocker if the project adopts this port. Mitigation: re-port the canonical Relja/netvlad MATLAB repo (Source #64, MIT) to PyTorch directly — preserves MIT licensing on the project's NetVLAD path; the canonical algorithm is well-documented in the paper and in Relja/netvlad README/README_more (Source #64), so re-implementation effort is moderate (~1 week of engineering + cluster-init + retraining or weight transfer). Alternatively, OpenVPRLab (Source #57) ships a NetVLAD aggregator option that can be combined with ResNet50/DINOv2 backbones — but that is a different mode per the Per-Mode API rule (different backbone, different pretrained checkpoint provenance, possibly different aggregation parameter defaults), and the project should treat OpenVPRLab-NetVLAD-on-ResNet50 as a separately-cataloged sibling mode if it pursues that path
  • Related Sub-question: SQ3+SQ4 / C2 — NetVLAD per-mode API capability verification (cross-source PyTorch-stack reproduction validation; canonical paper Table 1 number reproduction verified to within +0.9 to +1.5 absolute Recall@K)

Source #66

  • Title: NetVLAD canonical paper — "NetVLAD: CNN architecture for weakly supervised place recognition" (Arandjelović, Gronat, Torii, Pajdla, Sivic — CVPR 2016, arXiv:1511.07247 v1; extended TPAMI 2018 version arXiv:1511.07247 v3)
  • Link: arXiv https://arxiv.org/abs/1511.07247 (CVPR 2016 published version), accessed 2026-05-08; project page https://www.di.ens.fr/willow/research/netvlad/ ; CVPR 2016 PDF
  • Tier: L1 (peer-reviewed CVPR 2016 + TPAMI 2018 + canonical implementation cross-referenced; most-cited VPR paper of the modern deep-learning era, > 4000 citations as of 2026)
  • Publication Date: arXiv preprint 2015-11-23 (v1); CVPR 2016 acceptance; extended TPAMI 2018 (v3); algorithm has been the canonical VPR baseline for 10 years
  • Timeliness Status: ⚠️ Borderline by 18-month Critical-novelty window (10 years old) — but Established-baseline exemption applies per the engine's source-tiering rule. The algorithm itself is mature and cited as the canonical reference baseline in every subsequent VPR paper (MixVPR Table 1+4, SALAD Table 1, SelaVPR Table 2+3, AnyLoc, BoQ, etc.). Freshness concerns are on (a) the runtime stack (MATLAB + MatConvNet vs modern PyTorch — covered by Sources #64 + #65), (b) the canonical pretrained weights (street-level only, no aerial — same D-C2-1 caveat as MixVPR + SALAD + SelaVPR), and (c) the descriptor dimensionality vs modern compact descriptors (4096-D PCA-whitened is 2× larger than MixVPR's 2048-D and 4× larger than SelaVPR's 1024-D)
  • Version Info: arXiv v1 (CVPR 2016 camera-ready) + arXiv v3 (TPAMI 2018 extended)
  • Target Audience: System architects + C2 implementer + simple-baseline reference-point owner + algorithm-comparison reviewer
  • Research Boundary Match: Full match for the algorithm (NetVLAD pooling layer that generalizes VLAD to a learnable, end-to-end-trainable CNN module — described in §3.1 of the paper, equations 14); full match for the training procedure (weakly supervised triplet ranking loss with hard negative mining on Google Street View Time Machine — §4); partial match for the project's domain (paper benchmarks Pitts30k, Pitts250k, Tokyo24/7, Tokyo Time Machine — all ground-level urban; NO aerial nadir benchmark in the canonical paper, same caveat as MixVPR + SALAD + SelaVPR)
  • Summary: The canonical paper introduces NetVLAD = a generalized, end-to-end-trainable VLAD layer pluggable into any CNN architecture, with three principal contributions: (a) NetVLAD pooling layer (paper §3.1, Eq. 14) — replaces VLAD's hard cluster-assignment with differentiable soft-assignment via softmax over w_k^T x_i + b_k; aggregates first-order residuals (x_i - c_k) weighted by soft-assignment into a K×D matrix; intra-channel L2-norm + flatten + final L2-norm to produce K·D-D vector; (b) weakly supervised triplet ranking loss (paper §4) — uses GPS-tagged Google Street View Time Machine images with positive examples from the same approximate location and negative examples from far-away locations; hard negative mining selects the most-similar negatives for each query; (c) Time Machine training data — provides multiple panoramas at the same location over time, allowing the network to learn invariance to illumination/season changes by construction. Architecture: VGG-16 cropped at conv5_3 (output 512-D feature map at H×W spatial locations) + NetVLAD pooling with K=64 cluster centres → 64×512 = 32768-D K×D matrix → intra-norm + flatten + L2-norm → 32768-D NetVLAD descriptor → optional PCA-whitening to 4096-D (canonical recommendation) or 256-D / 512-D for tighter cache budgets. Reported performance (paper Table 1, Pitts30k-test on best VGG-16 + NetVLAD + whitening trained on Pittsburgh): R@1=84.1, R@5=94.6, R@10=95.5. Reported performance (Tokyo24/7): R@1=73.3 (paper's Table on Tokyo24/7 across daytime/sunset/nighttime queries — used as the cross-illumination challenge benchmark, the same way modern papers use Tokyo24/7 for night queries). Reported architecture choices: input image size 224×224 (canonical paper test resolution); training with SGD, momentum=0.9, weight decay=0.001, lr=10^-4 (Tokyo Time Machine) or 10^-3 (Pitts30k), batch size 4 tuples per SGD step, margin m=0.1, 30 epochs. Comparison to non-deep baselines: NetVLAD outperforms RootSIFT+VLAD+whitening (the previous SOTA) by ~5-7 absolute Recall@1 points on Pitts30k and Tokyo24/7. Comparison to off-the-shelf CNN descriptors: end-to-end-trained NetVLAD outperforms off-the-shelf VGG/AlexNet conv5 features pooled via max-pool/avg-pool/VLAD by ~10-20 absolute R@1 — paper's primary scientific contribution = "training the representation directly for the place recognition task is crucial for obtaining good performance" (paper §1.1, §5.2). Authors' acknowledged limitations: (a) Tokyo Time Machine training is geographically limited (eventually expanded in TPAMI 2018 extended version); (b) PCA-whitening tuning is dataset-specific; (c) the soft-assignment softmax is sensitive to the cluster centre initialization (the project's --mode=cluster step is the K-means cluster-centre initialization on a sample of training features — without this, NetVLAD training collapses or drifts). Modern descendants: Patch-NetVLAD (2021) extends NetVLAD with patch-level features for re-ranking (vs. SelaVPR's local features for re-ranking); AnyLoc (2024) wraps NetVLAD-style aggregation around DINOv2 ViT-G features (the Conditional INT8 candidate row in 06_component_fit_matrix/C2_vpr.md). License (canonical implementation): MIT (per Source #64)
  • Related Sub-question: SQ3+SQ4 / C2 — NetVLAD per-mode API capability verification (cross-source verification of the canonical implementation's algorithmic claims, training procedure, and Pitts30k / Tokyo24/7 Recall@K numbers; aerial-domain caveat documented; descriptor-dimensionality vs modern compact descriptors trade-off documented)

Source #67

  • Title: EigenPlaces canonical implementation — gmberton/EigenPlaces (Berton, Trivigno, Caputo, Masone — ICCV 2023) — official PyTorch reference implementation, training (train.py) + evaluation (eval.py) CLIs, PyTorch Hub registration with multiple pretrained backbones+descriptor-dim variants (ResNet18 256/512; ResNet50 128/256/512/2048; ResNet101 128/256/512/2048; VGG16 512), companion fair-evaluation framework gmberton/VPR-methods-evaluation, MIT License
  • Link: README https://raw.githubusercontent.com/gmberton/EigenPlaces/main/README.md (accessed 2026-05-08); LICENSE https://raw.githubusercontent.com/gmberton/EigenPlaces/main/LICENSE (accessed 2026-05-08); repo https://github.com/gmberton/EigenPlaces ; PyTorch Hub one-liner model = torch.hub.load("gmberton/eigenplaces", "get_trained_model", backbone="ResNet50", fc_output_dim=2048) ; companion eval framework https://github.com/gmberton/VPR-methods-evaluation ; companion CosPlace repo (training-data lineage) https://github.com/gmberton/CosPlace
  • Tier: L1 (project-official codebase by the canonical EigenPlaces authors Gabriele Berton + Gabriele Trivigno + Barbara Caputo + Carlo Masone, Politecnico di Torino — same author group as CosPlace [CVPR 2022] and VPR-methods-evaluation standardized comparison harness; Berton's lab is one of the most active VPR research groups of the modern era, also producing MegaLoc which the EigenPlaces README cites as "Looking for SOTA Visual Place Recognition (VPR)? Check out MegaLoc")
  • Publication Date: README live; main HEAD active through 2024 (PyTorch Hub publication preserves the canonical pretrained checkpoints in perpetuity); canonical paper ICCV 2023 (Aug 2023 arXiv submission, Oct 2023 ICCV proceedings)
  • Timeliness Status: ⚠️ Borderline by 18-month Critical-novelty window (paper Aug 2023 / ICCV 2023) — but mature algorithm with active PyTorch-Hub-distributed pretrained checkpoints; Berton's lab continues active VPR research (MegaLoc successor + VPR-methods-evaluation companion harness); the freshness concern is on (a) emerging successors that improve on EigenPlaces (MegaLoc per README's own pointer), (b) aerial-domain transfer of canonical SF-XL street-view weights (same D-C2-1 caveat as MixVPR + SALAD + SelaVPR + NetVLAD), (c) potential improvements via larger backbones (DINOv2-based candidates not yet in this row)
  • Version Info: main HEAD; PyTorch (no specific version pinned in README); canonical eval one-liner python3 eval.py --backbone ResNet50 --fc_output_dim 2048 --resume_model torchhub (downloads pretrained from PyTorch Hub); training one-liner python3 train.py --backbone ResNet50 --fc_output_dim 128 (configurable per backbone+dim combo); supported --backbone values: ResNet18, ResNet50, ResNet101, VGG16; supported --fc_output_dim per backbone: ResNet18 ∈ {256, 512}, ResNet50 ∈ {128, 256, 512, 2048}, ResNet101 ∈ {128, 256, 512, 2048}, VGG16 ∈ {512}; --train_dataset_folder path/to/sf_xl/raw/train/panoramas for canonical SF-XL training; MIT License (Copyright (c) 2023 Gabriele Berton, Gabriele Trivigno, Carlo Masone, Barbara Caputo); 200+ stars; auto_VPR companion codebase (referenced in paper §1) for fair-comparison-with-other-baselines is at https://github.com/gmberton/auto_VPR (also MIT, Berton's lab)
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer + simple-baseline-vs-modern-lead comparison framework owner
  • Research Boundary Match: Full match for the project's pinned C2 mode (ResNet-50 backbone + GeM pooling + single fully-connected layer producing 2048-D L2-normalised global descriptor at ImageNet-normalised input — single-stage retrieval, no re-ranking). The repo ships everything needed for inference (eval.py --backbone ResNet50 --fc_output_dim 2048 --resume_model torchhub) and for retraining on a custom dataset (train.py --train_dataset_folder ... --backbone ResNet50 --fc_output_dim 2048). Multiple per-Mode sibling candidates: each (backbone, fc_output_dim) tuple is a distinct mode per the Per-Mode API rule — ResNet50@128 (cache-tightest), ResNet50@512 (cache-medium), ResNet50@2048 (best Recall@1 / cache-medium-large), VGG16@512 (legacy backbone for cross-comparison with NetVLAD's VGG-16). Partial match for the project's domain (canonical training is SF-XL panoramas — San Francisco eXtra Large, 170 km², ~2.8M database images street-level urban; canonical evaluation is 16 datasets including Pitts30k/Pitts250k/Tokyo24/7/AmsterTime/SF-XL-test-v1/SF-XL-test-v2/MSLS-val/Nordland/St-Lucia/SVOX-{Night,Overcast,Rain,Snow,Sun}/Eynsham/San-Francisco-Landmark — all ground-level / street-level / urban + multi-season; NO aerial nadir benchmark in the repo's reported tables, same caveat as MixVPR + SALAD + SelaVPR + NetVLAD). Critical implementation match advantage vs MixVPR / SALAD / SelaVPR: (i) auto_VPR and VPR-methods-evaluation companion frameworks ship in-the-box fair-comparison harness with NetVLAD + SFRS + CosPlace + Conv-AP + MixVPR + EigenPlaces — directly usable for the Jetson MVE phase to score all C2 candidates against the same ADTi 20MP nav frames + Derkachi flight; (ii) ResNet-50 + GeM + FC is the structurally simplest modern competitive C2 architecture in this row (simpler than MixVPR's MLP-Mixer aggregation, simpler than SALAD's optimal-transport aggregation + DINOv2 backbone, simpler than SelaVPR's two-stage adapter+local-up-conv architecture, simpler than NetVLAD's K=64 cluster-centre soft-assignment) — fewer Jetson-export risks
  • Summary: EigenPlaces is the canonical reference implementation of the ICCV 2023 paper "EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition" by Gabriele Berton et al. Critical license finding: LICENSE file is MIT License (Copyright (c) 2023 Gabriele Berton, Gabriele Trivigno, Carlo Masone, Barbara Caputo) — permissive; this places EigenPlaces on the BSD/permissive license track alongside MixVPR (MIT) + SelaVPR (MIT) + NetVLAD-canonical (MIT) + Kimera-VIO (BSD-2) + OKVIS2 (BSD-3) + DPVO (MIT) + pure-VO baseline (OpenCV-Apache-2.0). EigenPlaces is the fourth modern C2 candidate on the BSD/permissive track (after MixVPR + SelaVPR + NetVLAD-canonical), completing the BSD/permissive C2 axis with a viewpoint-robust modern design point. Multiple pretrained checkpoints are documented and PyTorch-Hub-distributed (no Google Drive dependency, unlike SelaVPR): (ResNet18, 256), (ResNet18, 512), (ResNet50, 128), (ResNet50, 256), (ResNet50, 512), (ResNet50, 2048), (ResNet101, 128), (ResNet101, 256), (ResNet101, 512), (ResNet101, 2048), (VGG16, 512)eleven canonical pretrained checkpoints in total, more than any other C2 candidate evaluated so far. Acknowledged dependencies: CosFace PyTorch implementation (for Large Margin Cosine Loss layer), CNN Image Retrieval in PyTorch (for the GeM layer), Visual Geo-localization benchmark (for evaluation/test code), CosPlace (for the SF-XL dataset partitioning lineage). README explicit pointer: "EigenPlaces is quite old. Looking for SOTA VPR? Check out MegaLoc" — Berton's lab acknowledges newer work supersedes EigenPlaces on average benchmarks, but for the project's mandatory simple-baseline + modern-CNN-lead role, EigenPlaces remains the canonical viewpoint-robust reference design point.
  • Related Sub-question: SQ3+SQ4 / C2 — EigenPlaces per-mode API capability verification (context7 returned NOT INDEXED for gmberton/eigenplaces and EMPTY search results for the query eigenplaces — per Per-Mode API Capability Verification rule item 2, fall-back to official-docs WebFetch on the canonical repo README + LICENSE was used; cross-validated against the canonical paper [Source #68])

Source #68

  • Title: EigenPlaces canonical paper — "EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition" (Berton, Trivigno, Caputo, Masone — ICCV 2023, arXiv:2308.10832 v1) + ICCV 2023 Open Access Foundation page
  • Link: arXiv https://arxiv.org/abs/2308.10832 (Aug 2023); ICCV 2023 Open Access https://openaccess.thecvf.com/content/ICCV2023/papers/Berton_EigenPlaces_Training_Viewpoint_Robust_Models_for_Visual_Place_Recognition_ICCV_2023_paper.pdf (accessed 2026-05-08); ICCV 2023 OA HTML https://openaccess.thecvf.com/content/ICCV2023/html/Berton_EigenPlaces_Training_Viewpoint_Robust_Models_for_Visual_Place_Recognition_ICCV_2023_paper.html
  • Tier: L1 (peer-reviewed ICCV 2023 + canonical implementation cross-referenced)
  • Publication Date: arXiv preprint 2023-08-21; ICCV 2023 acceptance October 2023 (pages 1108011090 of the proceedings)
  • Timeliness Status: ⚠️ Borderline by 18-month Critical-novelty window (Aug 2023 / ICCV 2023) — algorithm is mature, canonical implementation is actively maintained via PyTorch Hub, algorithmic content is stable; the freshness concern is on (a) MegaLoc successor that the README itself recommends, (b) aerial-domain weights (same D-C2-1 as all C2 candidates), (c) DINOv2-backbone variants not in EigenPlaces canonical (would require retrain)
  • Version Info: arXiv v1 (ICCV camera-ready; pages 1108011090 in ICCV 2023 proceedings)
  • Target Audience: System architects + C2 implementer + Step-7.5 reviewer
  • Research Boundary Match: Full match for the algorithm (single-stage VPR with viewpoint-robust training paradigm: SVD-based class construction within map cells + lateral CosFace loss + frontal CosFace loss; CNN backbone + GeM pooling + FC layer for descriptor extraction). Partial match for the project's domain (paper benchmarks 16 ground-level urban / multi-season datasets — Pitts30k, Pitts250k, Tokyo24/7, AmsterTime, Eynsham, SF-XL test v1/v2, San Francisco Landmark [multi-view] + MSLS Val, Nordland, St Lucia, SVOX Night/Overcast/Rain/Snow/Sun [frontal-view]; NO aerial nadir benchmark in the canonical paper, same caveat as MixVPR + SALAD + SelaVPR + NetVLAD)
  • Summary: The canonical paper introduces EigenPlaces = a training paradigm, NOT a new neural architecture. The architecture is intentionally simple — VGG-16 or ResNet-50 (or ResNet-18 / ResNet-101 in repo) cropped at the last conv layer, fed into GeM pooling [Radenovic et al. 2018], then a single fully-connected layer produces the descriptor of dim fc_output_dim. The novelty is the viewpoint-robust training paradigm (paper §3): (a) Map partitioning (§3.1) — divide the SF-XL map into M×M=15m×15m cells; group cells in subsets where N=3 ensures no spatial overlap; shift the active cell-set after each training epoch; (b) EigenPlaces class construction (§3.2 + Fig 3-4) — for each cell, compute SVD of the (centered) UTM coordinates of all images in that cell; the first principal component (V0) is the road direction; the second principal component (V1, perpendicular to road) points to the side of the road where points-of-interest (building facades) are located; place a focal point at distance D=10m from the cell center along V1; group images facing this focal point into one lateral class; repeat with a focal point along V0 (the road direction itself) to form a frontal class for forward-facing cameras; (c) Two-CosFace-loss training (§3.3) — Large Margin Cosine Loss [Wang et al. 2018] on the lateral class (Llat) makes the network robust to viewpoint shifts; second CosFace loss on the frontal class (Lfront) handles forward-facing cameras (e.g., MSLS / St Lucia datasets); final loss is L = Llat + Lfront. Implementation details (paper §4.2): simple architecture (VGG-16 or ResNet-50 + GeM + FC), 200k iterations, batch size 128 (64 lateral + 64 frontal), Adam lr=1e-5, mixed precision (AMP), SFRS-style augmentations (color jittering + random cropping); cell M=15m, N=3, focal distance D=10m (paper Tab 6 ablation: D=20m slightly better on average); input image size: NOT explicitly stated in paper text, follows VPR-methods-evaluation framework default 224×224 ImageNet-normalised (canonical eval CLI exposes --image_size flag in companion gmberton/VPR-methods-evaluation framework, defaulting to 224×224 for fair comparison across all methods in the standardized harness). Reported Recall@1 on multi-view datasets (Tab 3, ResNet-50 best-config @ 2048-D): AmsterTime=48.9 (best in table, +1.2 over CosPlace 47.7); Eynsham=90.7; Pitts30k=92.5 (vs CosPlace 90.9, MixVPR-2048 91.5, MixVPR-4096 91.5, NetVLAD-VGG16-4096 85.0; SALAD Pitts250k 95.1 is from different paper Table); Pitts250k=94.1; Tokyo24/7=93.0 (best in Tab 3 across all methods compared in paper, +5.7 over CosPlace 87.3, +7.9 over MixVPR-4096 85.1; SelaVPR Tokyo24/7=94.0 from Source #63 is +1.0 over EigenPlaces); SF-XL-test-v1=84.1 (vs CosPlace 76.4, NetVLAD-VGG16-4096 40.0); SF-XL-test-v2=90.8. Reported Recall@1 on frontal-view datasets (Tab 4, ResNet-50 @ 2048-D): MSLS-Val=89.1 (vs CosPlace 87.4, MixVPR-4096 87.2, MixVPR-512 83.6; SALAD MSLS-Val=92.2 from Source #60 wins by +3.1 absolute; SelaVPR MSLS-Val=90.8 from Source #62 wins by +1.7 absolute); Nordland=71.2 (vs MixVPR-4096 76.2 — MixVPR wins by +5 absolute on Nordland); St-Lucia=99.6; SVOX-Night=58.9 (vs MixVPR-4096 64.4 — MixVPR wins by +5.5 absolute on extreme night); SVOX-Overcast=93.1; SVOX-Rain=90.0; SVOX-Snow=93.1; SVOX-Sun=86.4. Resource analysis (§4.4): EigenPlaces ResNet-50 + 2048-D trains with <7 GB GPU VRAM (vs MixVPR 18 GB at batch=480 with their canonical batch — EigenPlaces requires 60% less GPU memory); training takes ~24 hours on a single RTX 3090 (similar to SFRS, CosPlace, MixVPR); EigenPlaces uses mixed precision (AMP) following Conv-AP and MixVPR; descriptor dimensionality is 2048-D vs MixVPR's best of 4096-D — 50% smaller descriptor; paper notes "the inference time can be computed as the sum of the query's descriptors extraction time plus the matching (kNN) time, rendering the extraction time negligible when working on large scale datasets" — the paper does NOT report explicit per-query extraction latency (unlike MixVPR's 1.21 ms on A100, SALAD's 2.41 ms on RTX 3090, SelaVPR's 27 ms on RTX 3090); for ResNet-50 forward-pass extrapolation, contemporary GPU benchmarks place ResNet-50 fp16 at ~1-2 ms on A100 / ~3-5 ms on RTX 3090 / ~15-30 ms on Jetson Orin Nano Super at fp16+TensorRT. Authors' acknowledged limitations / observations (§4.3): (a) "no single model that outperforms all other ones on all datasets" — EigenPlaces wins on multi-view but MixVPR-4096 wins on Nordland and SVOX-Night; (b) "lower dimensionality descriptors still struggle on cross-domain datasets (e.g. AmsterTime, Tokyo 24/7, SVOX night)" — the 128-D / 256-D variants trade Recall@K for cache footprint; (c) the paper does not benchmark on aerial nadir imagery (same D-C2-1 caveat); (d) the README explicitly recommends MegaLoc as a SOTA successor — for the project's mandatory-pre-screen role this is acceptable, but Plan-phase may want to also evaluate MegaLoc as a separately-cataloged sibling/successor candidate. License (canonical implementation): MIT (per Source #67).
  • Related Sub-question: SQ3+SQ4 / C2 — EigenPlaces per-mode API capability verification (cross-source verification of the canonical implementation's mode/parameter/training-recipe/Recall@K claims; aerial-domain caveat documented; ResNet-50 vs DINOv2-based candidates structural-simplicity advantage documented; viewpoint-robust training paradigm completes the BSD/permissive C2 axis with a 4th materially-different design point alongside MixVPR + SelaVPR + NetVLAD)