mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 11:31:13 +00:00
846670a5c5
Updated various documentation files to clarify the handling of splittable artifacts, allowing for folder equivalents of key markdown files when they exceed size limits. Adjusted references in multiple sections to reflect this new structure, ensuring consistency across the research methodology. Enhanced clarity on the saving actions and artifact organization, particularly for `01_source_registry.md`, `02_fact_cards.md`, and `06_component_fit_matrix.md`. This change aims to improve usability and maintainability of the research documentation.
63 KiB
63 KiB
Source Registry — C2 — Visual Place Recognition candidates
Mode A Phase 2 — engine Step 2 (Source Tiering & Exhaustive Web Investigation). Critical-novelty sensitivity per Step 0.5 in
../00_question_decomposition.md. Time windows applied:
- Lead-candidate / SOTA claims: prefer sources within last 6 months; up to 18 months if older is the official authority.
- Library/SDK API behaviour: must reflect the currently shipped version at search time (
context7mandatory per lead candidate).- Established baselines (KLT, RANSAC, EKF, ORB, SIFT, GTSAM): no time window.
This file replaces a section of the previous monolithic
01_source_registry.md. See00_summary.mdfor the full category index. Investigation order is tracked in../00_question_decomposition.mdand the cross-category Investigation Status table in00_summary.md.
Source #57
- Title: OpenVPRLab — comprehensive open-source framework for Visual Place Recognition (
amaralibey/openvprlab, main); bundles MixVPR aggregator + ResNet50 backbone + GSV-Cities datamodule + FAISS recall harness as the canonical reference implementation - Link: https://context7.com/amaralibey/openvprlab/llms.txt (accessed 2026-05-08); canonical README https://github.com/amaralibey/openvprlab
- Tier: L1 (project-official codebase by the canonical MixVPR author Amar Ali-bey; same repo also packages BoQ aggregator)
- Publication Date: README live; OpenVPRLab repo created 2024 as the modular successor to
amaralibey/MixVPRandamaralibey/gsv-citiesrepos; main HEAD within Critical-novelty window (per Fact #36 timeliness audit) - Timeliness Status: ✅ Fully within Critical-novelty window
- Version Info: OpenVPRLab main HEAD; PyTorch Lightning module; GSV-Cities-light + GSV-Cities datamodules; supported aggregators: MixVPR, BoQ, NetVLAD, GeM, ConvAP, Cosine; supported backbones: ResNet18/50, DinoV2 ViT-S/B/L/G-14
- Target Audience: System architects + C2 implementer + Step-7.5 reviewer
- Research Boundary Match: Full match for the project's pinned C2 mode (ResNet50 backbone + MixVPR aggregator at 320×320 input → 2048-D L2-normalised descriptor). Code snippets confirmed:
Initialize and Use MixVPR Aggregator(parameter-by-parameter API),Initialize VPRFramework and Perform Inference(full backbone+aggregator+loss assembly withimages: torch.Tensor[B, 3, 320, 320]),Compute Recall Performance with FAISS(the validation harness reporting Recall@{1,5,10,15}),Train and Monitor VPR Models via CLI(the canonicalpython run.py --config ./config/resnet50_mixvpr.yamlrunnable). The DinoV2-MixVPR variant is also documented but is a separate per-Mode candidate per Per-Mode API rule. - Summary: OpenVPRLab is the canonical PyTorch Lightning reference implementation for MixVPR. Key documentary findings for the per-mode API verification gate: (a) MixVPR is parameterized as
MixVPR(in_channels, in_h, in_w, out_channels, mix_depth, mlp_ratio, out_rows)— the (backbone, aggregator-shape) pair is the per-Mode unit; (b) canonical paper config(ResNet50, in=1024×20×20, out_channels=512, mix_depth=4, mlp_ratio=1, out_rows=4)produces a 2048-D L2-normalised descriptor at 320×320 input; (c) preprocessing is ImageNet mean/std normalisation; (d) FAISS-based cosine retrieval is the documented validation pipeline; (e) training is viapython run.py --config ./config/resnet50_mixvpr.yamlon GSV-Cities (street-view) — NOT aerial nadir; (f) no documented Jetson measurement; (g) no documented ONNX/TensorRT export path inside the framework (relies on standard PyTorch → ONNX export — to be resolved in C7, not C2). License: MIT. - Related Sub-question: SQ3+SQ4 / C2 — MixVPR per-mode API capability verification (Mandatory
context7lookup per Per-Mode API Capability Verification rule)
Source #58
- Title: MixVPR canonical paper — "MixVPR: Feature Mixing for Visual Place Recognition" (Ali-bey, Chaib-draa, Giguère, WACV 2023, arXiv:2303.02190) + canonical implementation
amaralibey/MixVPR - Link: arXiv canonical paper https://arxiv.org/abs/2303.02190 (Mar 2023); canonical implementation https://github.com/amaralibey/MixVPR
- Tier: L1 (peer-reviewed WACV 2023 + author's canonical implementation)
- Publication Date: arXiv preprint 2023-03-03; WACV 2023 acceptance
- Timeliness Status: ⚠️ Borderline — the canonical paper itself (Mar 2023) is older than the Critical-novelty 18-month window threshold for SQ3+SQ4 component selection, but the algorithm is mature and OpenVPRLab (Source #57, in window) maintains it actively; the algorithmic content is stable, the freshness concern is on weights and aerial-domain retrains
- Version Info: WACV 2023 published version; canonical implementation tag-less but main HEAD aligned with paper config
- Target Audience: System architects + C2 implementer + Step-7.5 reviewer
- Research Boundary Match: Full match for the algorithm/API (single-image global-descriptor VPR with MLP-Mixer aggregation on a CNN backbone); partial match for the project's domain (paper benchmarks Pitts30k, MSLS-val, Tokyo24/7, Nordland — all ground-level / street-level / urban; no aerial nadir benchmark in the canonical paper)
- Summary: The canonical paper introduces the MixVPR aggregation method: a stack of
mix_depthFeatureMixer (MLP-Mixer-style) blocks operating over CNN feature-map rows + columns, followed by depth-wise + row-wise projections to produce a compact L2-normalised descriptor. Default config (and the project's pinned mode) is ResNet50-cropped backbone → 1024×20×20 feature map → MixVPR(1024, 20, 20, 512, 4, 1, 4) → 2048-D descriptor. Reported inference latency: 1.21 ms per image on A100 at 320×320 batch=1 (paper Table 4) — useful as the documentary upper-bound from which Jetson Orin Nano Super extrapolation must be measured. Reported Recall@1: ≥90% on Pitts30k-test, ≥85% on MSLS-val (state-of-the-art at publication time among 8K-and-below descriptor methods). Critical observation: the paper does NOT report aerial nadir benchmarks; aerial-domain transfer of MixVPR is documented in subsequent third-party work (Skoltech aerial-VPR survey + AerialExtreMatch — Sources #38, #19) but with materially different recall numbers, requiring project-domain retrain or aerial-trained community checkpoint. License: MIT (per canonical implementation repo). - Related Sub-question: SQ3+SQ4 / C2 — MixVPR per-mode API capability verification (cross-source verification of the OpenVPRLab descriptor's algorithmic claims; aerial-domain caveat sourced)
Source #59
- Title: SALAD canonical implementation —
serizba/salad(Izquierdo & Civera, CVPR 2024) — official PyTorch reference implementation, eval CLI, three pretrained checkpoints (8192+256, 2048+64, 512+32 descriptor sizes), Torch Hub registration, GSV-Cities training pipeline - Link: README https://github.com/serizba/salad (raw https://raw.githubusercontent.com/serizba/salad/main/README.md, accessed 2026-05-08); LICENSE https://github.com/serizba/salad/blob/main/LICENSE (raw https://raw.githubusercontent.com/serizba/salad/main/LICENSE, accessed 2026-05-08)
- Tier: L1 (project-official codebase by the canonical SALAD authors Sergio Izquierdo + Javier Civera, Universidad de Zaragoza)
- Publication Date: README live; main HEAD within Critical-novelty window per Fact #36 timeliness audit; canonical paper CVPR 2024
- Timeliness Status: ✅ Fully within Critical-novelty window
- Version Info: main HEAD; PyTorch 2.1.0 + CUDA 12.1 + Xformers (per README §Setup); Torch Hub one-liner
model = torch.hub.load("serizba/salad", "dinov2_salad")returns the full 8448-D config; eval CLIpython3 eval.py --ckpt_path 'weights/dino_salad.ckpt' --image_size 322 322 --batch_size 256 --val_datasets MSLS Nordland; three pretrained checkpoints documented (dino_salad8192+256,dino_salad_2048_642048+64,dino_salad_512_32512+32) - Target Audience: System architects + C2 implementer + Step-7.5 reviewer
- Research Boundary Match: Full match for the project's pinned C2 mode (DINOv2 ViT-B/14 backbone + SALAD aggregator at 322×322 input). The repo ships everything needed to instantiate the model, run inference, and reproduce the canonical numbers. Partial match for the project's domain (canonical training set is GSV-Cities street-view; canonical evaluation sets are MSLS Challenge/Val, Pittsburgh250k-test, SPED, NordLand, SF-XL — all ground-level / street-level / urban / cross-season; no aerial nadir benchmark in the repo's reported tables, same caveat as MixVPR).
- Summary: SALAD is the canonical reference implementation of the CVPR 2024 paper "Optimal Transport Aggregation for Visual Place Recognition" by Sergio Izquierdo and Javier Civera. Critical license finding: LICENSE file is GNU GENERAL PUBLIC LICENSE v3 (GPL-3.0) — copyleft. This places SALAD on the GPL-3.0 license track alongside OpenVINS / VINS-Mono / VINS-Fusion / ORB-SLAM3, NOT on the BSD/permissive track where MixVPR (MIT) sits. Three pretrained checkpoints documented: full (
dino_salad, 8192+256 = 8448-D descriptor; m=64 clusters × l=128 dim per cluster + 256-D global token), medium (dino_salad_2048_64, 2048+64 = 2112-D; m=32, l=64), slim (dino_salad_512_32, 512+32 = 544-D; m=15, l=32). Canonical evaluation input:--image_size 322 322(must be divisible by 14 for ViT/14 patch grid → 322/14 = 23 patches per side → 23×23 = 529 spatial tokens + 1 global token). Training: GSV-Cities, 4 epochs, ~30 min on RTX 3090. Acknowledged base: MixVPR's training framework is the harness on top of which SALAD is built (NOT OpenVPRLab — they share a lineage but SALAD's repo extendsamaralibey/MixVPRdirectly per README "Acknowledgements"). No documented aerial-nadir benchmark in the repo's reported tables. - Related Sub-question: SQ3+SQ4 / C2 — SALAD per-mode API capability verification (
context7did not indexserizba/saladdirectly; per Per-Mode API Capability Verification rule item 2, fall-back to official-docs WebFetch on the canonical repo README + LICENSE was used)
Source #60
- Title: SALAD canonical paper — "Optimal Transport Aggregation for Visual Place Recognition" (Izquierdo & Civera, CVPR 2024, arXiv:2311.15937 v2)
- Link: arXiv https://arxiv.org/abs/2311.15937 (CVPR 2024 published version), accessed 2026-05-08
- Tier: L1 (peer-reviewed CVPR 2024 + canonical implementation cross-referenced)
- Publication Date: arXiv preprint 2023-11-27; CVPR 2024 acceptance June 2024
- Timeliness Status: ⚠️ Borderline — like MixVPR, the canonical paper (Nov 2023 / CVPR 2024) is at the edge of the Critical-novelty 18-month window for SQ3+SQ4 component selection, but the algorithm is mature, the canonical implementation is actively maintained, and the algorithmic content is stable; the freshness concern is on weights and aerial-domain retrains
- Version Info: arXiv v2 (CVPR camera-ready)
- Target Audience: System architects + C2 implementer + Step-7.5 reviewer
- Research Boundary Match: Full match for the algorithm (single-stage VPR with optimal-transport-based aggregation on a fine-tuned DINOv2 backbone). Partial match for the project's domain (paper benchmarks MSLS Challenge / MSLS Val, Pittsburgh250k-test, SPED, NordLand, SF-XL — all ground-level urban / street-view; NO aerial nadir benchmark in the canonical paper, same caveat as MixVPR).
- Summary: The canonical paper introduces SALAD = Sinkhorn Algorithm for Locally Aggregated Descriptors. Reformulates NetVLAD's soft-assignment as an optimal-transport problem, considering both feature-to-cluster AND cluster-to-feature relations, with a learned
dustbincluster that discards uninformative features (sky/road/dynamic objects). Combined with a fine-tuned DINOv2 backbone. Pinned canonical configuration (paper §4.1): DINOv2-B (768-dim tokens, 86M params), fine-tune the last 4 transformer blocks (Table 6: train 2 or 4 blocks both report best results, 4 marginally better at 92.2 vs 92.0 MSLS R@1). Score-projection MLPW_s1, W_s2with hidden dim 512 and ReLU. Dimensionality reductionW_f1, W_f2from d=768 to l=128. Global-token MLPW_g1, W_g2from d=768 to 256. m=64 clusters, yielding final descriptorm × l + global = 64 × 128 + 256 = 8192 + 256 = 8448-D. Slim variants:m=15, l=32→ 512+32 = 544-D;m=32, l=64→ 2048+64 = 2112-D. Sinkhorn algorithm for optimal-transport assignment. Final L2 intra-norm + global L2-norm. Training: GSV-Cities, batch size 60 places × 4 images, multi-similarity loss, AdamW, lr=6e-5, 4 epochs, 30 min on RTX 3090. Training input size: 224×224; evaluation input size: 322×322 ("model is agnostic to input size as long as divisible by 14"). Reported latency (paper Table 1, 2 footnote): 2.33–2.41 ms per image on RTX 3090 at 322×322 batch=1 across all three descriptor-size variants — confirms aggregator overhead over the bare DINOv2-B backbone (2.41 ms) is negligible. Reported Recall@1 (paper Table 1, full 8448-D variant): MSLS Challenge 75.0, MSLS Val 92.2, NordLand 76.0, Pitts250k-test 95.1, SPED 92.1. Reported Recall@1 (slim 544-D variant): MSLS Challenge 70.8, MSLS Val 89.3, NordLand 61.2, Pitts250k-test 93.0, SPED 88.5. Reported memory footprint (Table 2 footnote, MSLS Val ~18,000 images): 0.0 GB local features (single-stage, no per-patch features cached) + global descriptors ~70 MB at 8448-D fp32 = ~35 MB at fp16. Authors' explicit limitation (§5): "the adoption of DINOv2 as our backbone results in slower processing speeds compared to ResNet-based methods" — material to project's Jetson Orin Nano Super deployment, since DINOv2-B has 86M params vs MixVPR-ResNet50's 25M (~3.4× more params; ViT export to TensorRT/INT8 is also harder than ResNet export — C7 deferred concern). License (canonical implementation): GPL-3.0 (per Source #59). - Related Sub-question: SQ3+SQ4 / C2 — SALAD per-mode API capability verification (cross-source verification of the canonical implementation's mode-enumeration, parameter-count, and performance claims; aerial-domain caveat documented)
Source #61
- Title: OpenVPRLab DinoV2 backbone — context7 cross-source for DINOv2 ViT-B/14 spatial-feature backbone API at 322×322 input (NOT a SALAD aggregator source — OpenVPRLab does not ship a SALAD aggregator class, only MixVPR / GeMPool / BoQ are documented in the snippets)
- Link: https://context7.com/amaralibey/openvprlab/llms.txt (accessed 2026-05-08, snippet
Initialize and Use DinoV2 Backbone); canonical README https://github.com/amaralibey/openvprlab - Tier: L1 (canonical OpenVPRLab framework by Amar Ali-bey, packaged as a multi-aggregator VPR research lab;
context7indexed) - Publication Date: README live; OpenVPRLab main HEAD within Critical-novelty window (per Fact #36)
- Timeliness Status: ✅ Fully within Critical-novelty window
- Version Info: OpenVPRLab main HEAD; supported DinoV2 backbones:
dinov2_vits14, dinov2_vitb14, dinov2_vitl14, dinov2_vitg14;DinoV2(backbone_name, num_unfrozen_blocks, return_cls_token)constructor; defaultnum_unfrozen_blocks=2(paper canonical SALAD config uses4per Table 6);return_cls_token=Falsereturns spatial features only (SALAD pipeline needsTruebecause it uses both spatial features + cls/global token) - Target Audience: System architects + C2 implementer + Step-7.5 reviewer
- Research Boundary Match: Partial match — confirms the DINOv2 backbone API (input must be divisible by 14, 322×322 →
[B, 768, 23, 23]spatial features for ViT-B), but the SALAD aggregator itself is NOT in OpenVPRLab. Pipeline composition for SALAD must useserizba/saladrepo's aggregator class on top of eitherserizba/salad's own DINOv2 wrapper or OpenVPRLab'sDinoV2class withreturn_cls_token=True. - Summary: Documentary cross-source confirmation that DINOv2 ViT-B is a first-class supported backbone in the broader VPR research-pipeline ecosystem, with the same input-divisibility-by-14 constraint and 322×322 → 23×23 spatial-token layout the canonical SALAD paper uses. Critical finding for the SALAD verification gate: OpenVPRLab's documented aggregator catalog (per
context7snippet inventory) isMixVPR,GeMPool,BoQ— noSALADclass is present. This means SALAD cannot be assembled from OpenVPRLab alone; the project must depend on the canonicalserizba/saladrepo (Source #59), which is GPL-3.0. Confirms that the per-Mode API rule's "two modes of one library are two distinct candidates" semantics applies here too — DINOv2-B + MixVPR (in OpenVPRLab) and DINOv2-B + SALAD (inserizba/salad) are different candidates, with different code-of-record and different licenses. - Related Sub-question: SQ3+SQ4 / C2 — SALAD per-mode API capability verification (cross-source confirmation of DINOv2 ViT-B backbone API; cross-source disconfirmation of OpenVPRLab as a SALAD source). Cross-cutting reuse: Source #61 also confirms DINOv2 ViT-L/14 is a first-class supported backbone in OpenVPRLab — relied on by Source #62 + #63 (SelaVPR) for backbone-API documentary cross-source.
Source #62
- Title: SelaVPR canonical implementation —
Lu-Feng/SelaVPR(Lu, Zhang, Lan, Dong, Wang, Yuan — ICLR 2024) — official PyTorch reference implementation, training/eval CLIs, two pretrained checkpoints (MSLS-finetuned for diverse scenes; Pitts30k-further-finetuned for urban scenes), DINOv2+registers variant, two-stage retrieval+re-ranking pipeline - Link: README https://github.com/Lu-Feng/SelaVPR (raw https://raw.githubusercontent.com/Lu-Feng/SelaVPR/main/README.md, accessed 2026-05-08); LICENSE https://github.com/Lu-Feng/SelaVPR/blob/main/LICENSE (raw https://raw.githubusercontent.com/Lu-Feng/SelaVPR/main/LICENSE, accessed 2026-05-08); pretrained DINOv2 backbone weights https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_pretrain.pth ; SelaVPR++ successor repo https://github.com/Lu-Feng/SelaVPRplusplus (Feb 2026, separate)
- Tier: L1 (project-official codebase by the canonical SelaVPR authors Feng Lu et al., Tsinghua Shenzhen + Peng Cheng Laboratory + UCAS)
- Publication Date: README live; main HEAD within Critical-novelty window (Feb 2026 announcement of SelaVPR++ successor confirms continued active maintenance of the lineage); canonical paper ICLR 2024
- Timeliness Status: ✅ Fully within Critical-novelty window
- Version Info: main HEAD; PyTorch (no specific version pinned in README); requires DINOv2 ViT-L/14 pretrained backbone weights download (
dinov2_vitl14_pretrain.pth~1.1 GB from FB AI Public Files); CLI surfacepython3 train.py --datasets_folder=... --dataset_name={msls,pitts30k} --foundation_model_path=...for training +python3 eval.py --datasets_folder=... --dataset_name={msls,pitts30k,nordland,...} --resume=... --rerank_num={20,100}for evaluation; pretrained models distributed via Google Drive links inside README HTML tables (MSLS-finetuned: MSLS-val R@1=90.8 / Nordland-test R@1=85.2 / St. Lucia R@1=99.8; Pitts30k-further-finetuned: Tokyo24/7 R@1=94.0 / Pitts30k R@1=92.8 / Pitts250k R@1=95.7); optional--registersflag adds DINOv2+4-register variant (separate finetuned checkpoint also linked); 262 stars; MIT License - Target Audience: System architects + C2 implementer + Step-7.5 reviewer
- Research Boundary Match: Full match for the project's pinned C2 mode (DINOv2 ViT-L/14 backbone with frozen weights + lightweight serial+parallel adapters in each transformer block + GeM-pooled 1024-D global descriptor + LocalAdapt up-conv module producing 61×61×128-D dense local features at 224×224 input). Two-stage retrieval+re-ranking is structurally distinct from MixVPR's and SALAD's single-stage retrieval — global descriptor is used for top-K candidate retrieval, then re-ranking via mutual-nearest-neighbor cross-matching of dense local features with
|M|(count of mutual matches) as the re-rank score. Re-rank pool size is a runtime parameter:rerank_num=100reproduces paper accuracy;rerank_num=20cuts re-ranking runtime to 1/5 (0.018 s/query on RTX 3090) at near-identical accuracy. Partial match for the project's domain (canonical training is MSLS + Pitts30k street-view; canonical evaluation is Tokyo24/7 / MSLS-val / MSLS-challenge / Pitts30k-test / Pitts250k / Nordland-test / St. Lucia — all ground-level / street-level / urban / cross-season / cross-illumination; no aerial nadir benchmark in the repo's reported tables, same caveat as MixVPR + SALAD). - Summary: SelaVPR is the canonical reference implementation of the ICLR 2024 paper "Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition" by Feng Lu et al. Critical license finding: LICENSE file is MIT License (Copyright (c) 2024 Feng Lu) — permissive; this places SelaVPR on the BSD/permissive license track alongside MixVPR (MIT), Kimera-VIO (BSD-2), OKVIS2 (BSD-3), DPVO (MIT), pure-VO baseline (OpenCV-Apache-2.0). SelaVPR is the first DINOv2-based C2 candidate on the BSD/permissive track (SALAD is GPL-3.0). Two pretrained checkpoints are documented inside the README: (a) MSLS-finetuned (for diverse scenes — recommended starting point for cross-domain transfer projects): MSLS-val R@1=90.8 / R@5=96.4 / R@10=97.2; Nordland-test R@1=85.2 / R@5=95.5 / R@10=98.5; St. Lucia R@1=99.8; (b) Pitts30k-further-finetuned (only for urban scenes; resumed from MSLS checkpoint): Tokyo24/7 R@1=94.0 / R@5=96.8 / R@10=97.5; Pitts30k R@1=92.8 / R@5=96.8 / R@10=97.7; Pitts250k R@1=95.7 / R@5=98.8 / R@10=99.2. Optional
--registersflag enables DINOv2+4-register variant (per Darcet et al. 2024 ViT registers paper) with separately finetuned MSLS checkpoint — better local-matching performance per README §"Local Matching using DINOv2+Registers". Two-stage method has an "efficient RAM usage" mode (--efficient_ram_testingflag) that saves extracted local features to disk in./output_local_features/and loads only the currently-needed local features into RAM — useful when descriptor cache exceeds available RAM (relevant to the project's 8 GB shared budget). Acknowledged dependencies: gmberton/deep-visual-geo-localization-benchmark (Visual Geo-localization Benchmark — dataset preparation pipeline) and facebookresearch/dinov2 (the frozen pre-trained backbone). Adapter-and-up-conv code is in/backbone/dinov2/block.py(adapter1 + adapter2 in each transformer block) andnetwork.py(LocalAdapt up-conv module after the entire ViT). - Related Sub-question: SQ3+SQ4 / C2 — SelaVPR per-mode API capability verification (
context7returned no match forLu-Feng/SelaVPR— the only "lu-feng" hit wasliu-feng-deeplearning/coverhunterwhich is an unrelated music-similarity project; per Per-Mode API Capability Verification rule item 2, fall-back to official-docs WebFetch on the canonical repo README + LICENSE was used)
Source #63
- Title: SelaVPR canonical paper — "Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition" (Lu, Zhang, Lan, Dong, Wang, Yuan — ICLR 2024, arXiv:2402.14505 v1)
- Link: arXiv https://arxiv.org/abs/2402.14505 (ICLR 2024 published version: https://openreview.net/pdf?id=TVg6hlfsKa), accessed 2026-05-08; HTML render https://arxiv.org/html/2402.14505v1
- Tier: L1 (peer-reviewed ICLR 2024 + canonical implementation cross-referenced)
- Publication Date: arXiv preprint 2024-02-22; ICLR 2024 acceptance
- Timeliness Status: ⚠️ Borderline — canonical paper (Feb 2024 / ICLR 2024) is at the edge of the Critical-novelty 18-month window for SQ3+SQ4 component selection, but algorithm is mature, canonical implementation is actively maintained (SelaVPR++ released Feb 2026 confirms lineage activity), algorithmic content is stable; freshness concern is on weights and aerial-domain retrains
- Version Info: arXiv v1 (ICLR camera-ready)
- Target Audience: System architects + C2 implementer + Step-7.5 reviewer
- Research Boundary Match: Full match for the algorithm (two-stage VPR with adapter-based parameter-efficient transfer learning on a frozen DINOv2 ViT-L/14 backbone + GeM-pooled global descriptor + dense local features for re-ranking via mutual nearest neighbor cross-matching, no RANSAC needed). Partial match for the project's domain (paper benchmarks Tokyo24/7 / MSLS-val / MSLS-challenge / Pitts30k-test — all ground-level urban / street-view; appendix benchmarks Nordland / St. Lucia — also non-aerial; NO aerial nadir benchmark in the canonical paper, same caveat as MixVPR + SALAD).
- Summary: The canonical paper introduces SelaVPR = Seamless adaptation of pre-trained foundation models for VPR via hybrid global-local adaptation. Architecture (paper §3): (a) Backbone: DINOv2 ViT-L/14 (frozen, only adapters trained — preserves the pre-trained model's transferability and avoids catastrophic forgetting); (b) Global adaptation (paper §3.2 + Fig 2): two adapters per transformer block — Adapter1 is a serial adapter after the MHA layer with internal skip-connection, Adapter2 is a parallel adapter alongside the MLP layer multiplied by scaling factor
s=0.2; each adapter is a bottleneck (down-project FC → ReLU → up-project FC) with bottleneck ratio 0.5; the output of the last transformer block goes through an LN layer; the class token is discarded and patch tokens are reshaped as a feature mapfm; the global feature isf^g = L2(GeM(fm)); (c) Local adaptation (paper §3.3 + Fig 3): two up-convolutional layers (3×3 kernel, stride=2, padding=1) with a ReLU layer between them upsample the 16×16×1024 ViT feature map to 61×61×128 dense local features; intra-channel L2 normalization; (d) Local matching for re-ranking (paper §3.4): mutual nearest neighbor cross-matching between query and each candidate; cosine similarity is used (local features are L2-normalized); the count|M|of mutual matches is the re-rank score (no RANSAC, no spatial verification); (e) Loss (paper §3.5): triplet lossL_gon global features + mutual nearest neighbor local feature lossL_l(Eq. 10) that maximizes match similarity for positive pairs and minimizes for negative pairs; combined asL = L_g + λ·L_lwith λ=1, m=0.1. Implementation details (paper §4.2): input size 224×224 (NOT 322×322 like SALAD; 224/14 = 16, so 16×16=256 patch tokens + 1 cls token); global descriptor 1024-D (much smaller than MixVPR's 2048-D and SALAD's 8448-D full); local features 61×61×128 = 476,288 floats per image for re-ranking; Adam optimizer, lr=1e-5, batch=4; trained on MSLS first then further-finetuned on Pitts30k; re-rank top-100 candidates by default. Reported performance (paper Table 2): MSLS-challenge R@1=73.5 (vs SALAD's 75.0, vs MixVPR's 64.0; vs prior SOTA R²Former 73.0); Tokyo24/7 R@1=94.0 (best, +5.4 absolute over prior SOTA R²Former 88.6); MSLS-val R@1=90.8 (best); Pitts30k R@1=92.8 (best). Reported runtime (paper Table 3, RTX 3090, Pitts30k-test): feature extraction 0.027 s/query (slower than TransVPR's 0.008 because of ViT-L backbone, but still fast); matching 0.085 s/query at rerank_num=100; total 0.112 s/query — less than 4% of TransVPR's total (3.018 s/query) and ~1% of Patch-NetVLAD-p (11.144 s/query). Critical observation: SelaVPR's "fast two-stage" claim is relative to other two-stage methods; it is still slower than single-stage MixVPR (~2 ms extraction on A100) and SALAD (~2.41 ms on RTX 3090) at the global-retrieval stage, AND adds an 18 ms (rerank_num=20) to 85 ms (rerank_num=100) re-ranking cost per query that single-stage methods don't pay. Authors' explicit observation (paper §4.3): "TransVPR is fast at extracting features, while SelaVPR is slower (but faster than other methods) due to the use of the ViT/L backbone" — material to project's Jetson Orin Nano Super deployment, since DINOv2-L has 300M params vs SALAD-DINOv2-B's 86M (~3.5× more params). License (canonical implementation): MIT (per Source #62). - Related Sub-question: SQ3+SQ4 / C2 — SelaVPR per-mode API capability verification (cross-source verification of the canonical implementation's mode/parameter/runtime claims; aerial-domain caveat documented; ViT-L vs ViT-B backbone-size trade-off documented for Jetson MVE planning)
Source #64
- Title: NetVLAD canonical implementation —
Relja/netvladv1.03 (Arandjelović et al., CVPR 2016 / TPAMI 2018) — official MATLAB / MatConvNet reference implementation, off-the-shelf and trained networks (VGG-16 + NetVLAD + whitening pretrained on Pittsburgh 30k and Tokyo Time Machine), training (trainWeakly) + testing (testFromFn) + PCA-whitening (addPCA) + cluster-init (addLayers) CLIs, multiple aggregation methods (vlad_preL2_intradefault,vlad_preL2,vladv2_preL2_intra,max,avg) - Link: README https://raw.githubusercontent.com/Relja/netvlad/master/README.md (accessed 2026-05-08); README_more https://raw.githubusercontent.com/Relja/netvlad/master/README_more.md (accessed 2026-05-08); project page https://www.di.ens.fr/willow/research/netvlad/ ; trained models https://www.di.ens.fr/willow/research/netvlad/data/models/vd16_pitts30k_conv5_3_vlad_preL2_intra_white.mat (529 MB best model) + https://www.di.ens.fr/willow/research/netvlad/data/netvlad_v103_allmodels.tar.gz (3 GB all models); context7 indexed at
/relja/netvlad; canonical paper arXiv:1511.07247 (CVPR 2016) - Tier: L1 (project-official MATLAB codebase by canonical NetVLAD authors Relja Arandjelović + Petr Gronat + Akihiko Torii + Tomas Pajdla + Josef Sivic; INRIA WILLOW + ENS + Tokyo Tech + CTU Prague)
- Publication Date: README v1.03 dated 2016-03-04; canonical paper CVPR 2016 (arXiv 2015-11-23 v1, with 2016 updates); algorithm has been continuously cited as the canonical baseline since 2016 across MixVPR (Source #57+58), SALAD (Source #59+60), SelaVPR (Source #62+63), AnyLoc, BoQ, and every subsequent VPR paper
- Timeliness Status: ⚠️ Borderline — the implementation is from 2016 (10 years old) and uses the deprecated MATLAB + MatConvNet stack; HOWEVER, the algorithm is canonical and stable, the canonical paper has been continuously cited as the baseline reference, and modern PyTorch ports (Source #65) reproduce its numbers with high fidelity. The freshness concern is on the runtime stack (MATLAB + MatConvNet → PyTorch port required for the project) and aerial-domain transfer (no aerial benchmark in the canonical paper). Per the engine's "Established baseline" rule (
KLT, RANSAC, EKF, ORB, SIFT, GTSAM-class), NetVLAD as the mandatory simple-VLAD baseline for the C2 row is exempt from the strict 18-month Critical-novelty window — its role is exactly to be the long-established reference point against which the modern (MixVPR / SALAD / SelaVPR / AnyLoc / BoQ / DINOv2-VLAD) candidates are scored - Version Info: v1.03 (04 Mar 2016) — last canonical version; depends on
relja_matlabv1.02+, MatConvNet v1.0-beta18+, optional Yael_matlab v438. Critical license finding: README states "NetVLAD is distributed under the MIT License (see theLICENCEfile)." — MIT (BSD/permissive license track, same as MixVPR + SelaVPR; distinct from SALAD's GPL-3.0). Supported network IDs inloadNet():vd16(VGG-16, 138M params, conv5_3 last conv layer = 512-D feature map),vd19(VGG-19),caffe(AlexNet, last conv = 256-D feature map),places(Places-CNN). Aggregation methods inaddLayers():vlad_preL2_intra(default — input L2-norm + intra-channel L2-norm of the K×D NetVLAD matrix + final flatten + L2-norm),vlad_preL2(no intra-norm),vladv2_preL2_intra(full NetVLAD with trainable biases per Eq. 4 of paper),max(global max pool),avg(global avg pool). Default cluster count K=64. Output dimensionality = K × D (e.g., VGG-16 conv5_3: K=64 × D=512 = 32768-D pre-PCA NetVLAD matrix); recommended PCA + whitening reduces to 4096-D (canonical paper recommendation; 256-D / 512-D variants supported viacropToDimafter L2-renormalisation, only valid for+whiteningnetworks). 599 stars - Target Audience: System architects + C2 implementer + Step-7.5 reviewer + simple-baseline reference-point owner
- Research Boundary Match: Full match for the project's pinned C2 mode (VGG-16 + NetVLAD + whitening pretrained on Pittsburgh 30k / Tokyo Time Machine, NetVLAD aggregation
vlad_preL2_intrawith K=64 cluster centres, output 4096-D L2-normalised global descriptor after PCA-whitening). The repo ships everything needed for inference (computeRepresentation,serialAllFeats,testFromFn) and training (trainWeakly). Partial match for the project's domain (canonical training sets are Pittsburgh 30k + Tokyo Time Machine, both street-level / urban; canonical evaluation sets are Pittsburgh 30k / Pittsburgh 250k / Tokyo 24/7 — all ground-level / street-view; NO aerial nadir benchmark in the canonical paper, same caveat as MixVPR + SALAD + SelaVPR). Critical implementation match: NetVLAD is the only canonical C2 baseline whose pretrained weights cover both Pittsburgh and Tokyo Time Machine domains as separate checkpoints (vd16_pitts30k_conv5_3_vlad_preL2_intra_whiteandvd16_tokyoTM_conv5_3_vlad_preL2_intra_white) — useful for cross-domain ablation if the project ever needs to compare canonical-domain transferability - Summary: NetVLAD is the canonical learned-VLAD reference baseline for the entire VPR field, introduced by Arandjelović et al. (CVPR 2016) and continuously cited by every subsequent VPR work (including MixVPR, SALAD, SelaVPR, AnyLoc, BoQ — all of which compare against NetVLAD as a baseline in their respective papers). The official
Relja/netvladMATLAB implementation (v1.03, 2016) defines the algorithm: (a) crop a CNN backbone (VGG-16 / VGG-19 / AlexNet / Places-CNN) at the last convolutional layer to obtain aH×W×Ddense descriptor map; (b) apply NetVLAD pooling layer that learns K cluster centresc_k, K cluster-assignment weightsw_k, K biasesb_k, computes soft-assignment via softmax overw_k^T x_i + b_k, and aggregates first-order residuals(x_i - c_k)weighted by soft-assignment into a K×D matrix; (c) intra-channel L2-normalise the K×D matrix (per thevlad_preL2_intramethod), flatten to K·D-D vector, final L2-normalise; (d) optionally apply PCA + whitening (canonical recommendation: reduce to 4096-D) for both retrieval-quality improvement AND memory-footprint reduction. Canonical training: weakly supervised triplet ranking loss with hard negative mining on Google Street View Time Machine (Pittsburgh + Tokyo) — uses only image-level GPS as supervision, no landmark annotations. License: MIT — places NetVLAD on the BSD/permissive track alongside MixVPR (MIT) + SelaVPR (MIT) + Kimera-VIO (BSD-2) + OKVIS2 (BSD-3) + DPVO (MIT) + pure-VO baseline (OpenCV-Apache-2.0). Critical limitations: (a) MATLAB + MatConvNet stack is deprecated and not deployable on Jetson Orin Nano Super (PyTorch port required — see Source #65); (b) algorithm pre-dates DINOv2 / ViT family by 6+ years; uses VGG-16 / AlexNet which are 6× larger and 4× slower than ResNet50 (per various 2020s benchmarks) at comparable accuracy; (c) canonical 4096-D descriptor (after PCA-whitening) is 2× larger than MixVPR's 2048-D and 4× larger than SelaVPR's 1024-D, increasing the descriptor-cache footprint AND retrieval-time cosine-similarity cost; (d) no documented Jetson measurement; (e) canonical R@1 numbers (Pitts30k-test): 84.1 (paper Table 1) — substantially below MixVPR's ~90 / SALAD's 95.1 / SelaVPR's 92.8 (all on Pitts30k); on Tokyo24/7: 73.3 (paper) — vs SelaVPR's 94.0 = 20.7 absolute gap; vs MixVPR's 85.1 = 11.8 absolute gap. NetVLAD is kept as the mandatory simple-VLAD baseline per the engine's Component Option Breadth rule ("at least one simple baseline...per component area when possible; these prevent false confidence in the selected option") — its role is to be the long-established reference point, NOT a competitive lead. Any selected modern C2 candidate (MixVPR / SALAD / SelaVPR / EigenPlaces / AnyLoc / BoQ) must show measurable Recall@K advantage over NetVLAD on the project's evaluation conditions to justify its added complexity. Acknowledged dependencies: MatConvNet, relja_matlab, Yael_matlab (optional, GPU acceleration). Modern lineage: Patch-NetVLAD (2021, CVPR — extends NetVLAD with patch-level features for re-ranking, GPL-3.0 license per their canonical repo); Generalized-Mean-Pool (GeM, 2018) is a simpler alternative often used together with NetVLAD-style aggregation; AnyLoc (2024) wraps NetVLAD aggregation around DINOv2 ViT-G features - Related Sub-question: SQ3+SQ4 / C2 — NetVLAD per-mode API capability verification (Mandatory
context7lookup PASS —/relja/netvladindexed with 90 code snippets and Medium source reputation, benchmark score 80.5; cross-validated against canonical project page WebFetch + canonical paper WebFetch)
Source #65
- Title: NetVLAD modern PyTorch reproduction —
Nanne/pytorch-NetVlad(Pytorch implementation of NetVlad with verified Pittsburgh 30k Recall@K reproduction); modern PyTorch + Faiss runtime path (replaces canonical MATLAB + MatConvNet stack); supports VGG-16 and AlexNet backbones with K=64 NetVLAD pooling - Link: README https://raw.githubusercontent.com/Nanne/pytorch-NetVlad/master/README.md (accessed 2026-05-08); repo https://github.com/Nanne/pytorch-NetVlad ; pretrained checkpoint mirror https://drive.google.com/open?id=17luTjZFCX639guSVy00OUtzfTQo4AMF2 (VGG-16 + NetVLAD reproduction)
- Tier: L2 (third-party PyTorch port; canonical algorithm is owned by
Relja/netvladper Source #64; this port is the most-cited PyTorch reproduction in the VPR research community as of 2026, with verified Recall@K against the canonical paper) - Publication Date: Initial commit ~2018; main HEAD active through 2024
- Timeliness Status: ⚠️ Borderline — last meaningful update appears to be ~2022 (via repo stars/issues age signals); however the algorithm is canonical and the reproduction has been continuously verified against the canonical paper's Table 1 numbers; per Established-baseline exemption, NetVLAD's reference role does not require fresh updates
- Version Info: main HEAD; PyTorch v0.4.0+ minimum; depends on Faiss + scipy + numpy + sklearn + h5py + tensorboardX; CLI
python main.py --mode={train,test,cluster} --arch={vgg16,alexnet} --pooling=netvlad --num_clusters=64; pretrained VGG-16 checkpoint distributed via Google Drive; license not explicitly stated in README — README does NOT cite a LICENSE file; verification of licensing terms is required if the project actually adopts this PyTorch port (Plan-phase clarification gate, deferred to Plan if NetVLAD is elevated beyond mandatory-baseline role); canonical alternative is to re-port from the MIT-licensedRelja/netvladMATLAB repo (Source #64) directly — preserving MIT licensing on the project's NetVLAD path; 490 stars - Target Audience: System architects + C2 implementer + simple-baseline reference-point owner
- Research Boundary Match: Full match for the runtime stack the project will use (PyTorch + Faiss vs canonical MATLAB + MatConvNet); partial match for the canonical NetVLAD-paper reproducibility — reported VGG-16 + NetVLAD R@1 on Pitts30k-test = 85.2 (vs canonical paper's 84.1, 0.9 absolute higher — a positive reproduction signal); AlexNet R@1 = 68.6 (paper does not report AlexNet on Pitts30k as the primary number — appendix only). Partial match for the project's domain (same as Source #64 — Pittsburgh 30k street-level training + Tokyo 24/7 test, no aerial nadir)
- Summary:
Nanne/pytorch-NetVladis the canonical PyTorch reproduction of NetVLAD that the modern VPR research community uses. README explicitly reports verified Recall@K vs the canonical paper's Table 1: VGG-16 + NetVLAD reproduction R@1=85.2 (paper: 84.1), R@5=94.8 (paper: 94.6), R@10=97.0 (paper: 95.5) — the reproduction is +0.6 to +1.5 absolute above the original numbers, demonstrating that the PyTorch port preserves the canonical algorithm's accuracy. Three CLI modes:--mode=train(full training pipeline with cluster-init prerequisite),--mode=test(inference + Recall@K evaluation),--mode=cluster(NetVLAD layer initialization via K-means clustering on training features — a prerequisite for--mode=train). Default flags:--arch=vgg16 --pooling=netvlad --num_clusters=64. Model state distributed via Google Drive. Critical license caveat: README does not cite a LICENSE file; verification of licensing terms is a Plan-phase blocker if the project adopts this port. Mitigation: re-port the canonicalRelja/netvladMATLAB repo (Source #64, MIT) to PyTorch directly — preserves MIT licensing on the project's NetVLAD path; the canonical algorithm is well-documented in the paper and inRelja/netvladREADME/README_more (Source #64), so re-implementation effort is moderate (~1 week of engineering + cluster-init + retraining or weight transfer). Alternatively, OpenVPRLab (Source #57) ships a NetVLAD aggregator option that can be combined with ResNet50/DINOv2 backbones — but that is a different mode per the Per-Mode API rule (different backbone, different pretrained checkpoint provenance, possibly different aggregation parameter defaults), and the project should treat OpenVPRLab-NetVLAD-on-ResNet50 as a separately-cataloged sibling mode if it pursues that path - Related Sub-question: SQ3+SQ4 / C2 — NetVLAD per-mode API capability verification (cross-source PyTorch-stack reproduction validation; canonical paper Table 1 number reproduction verified to within +0.9 to +1.5 absolute Recall@K)
Source #66
- Title: NetVLAD canonical paper — "NetVLAD: CNN architecture for weakly supervised place recognition" (Arandjelović, Gronat, Torii, Pajdla, Sivic — CVPR 2016, arXiv:1511.07247 v1; extended TPAMI 2018 version arXiv:1511.07247 v3)
- Link: arXiv https://arxiv.org/abs/1511.07247 (CVPR 2016 published version), accessed 2026-05-08; project page https://www.di.ens.fr/willow/research/netvlad/ ; CVPR 2016 PDF
- Tier: L1 (peer-reviewed CVPR 2016 + TPAMI 2018 + canonical implementation cross-referenced; most-cited VPR paper of the modern deep-learning era, > 4000 citations as of 2026)
- Publication Date: arXiv preprint 2015-11-23 (v1); CVPR 2016 acceptance; extended TPAMI 2018 (v3); algorithm has been the canonical VPR baseline for 10 years
- Timeliness Status: ⚠️ Borderline by 18-month Critical-novelty window (10 years old) — but Established-baseline exemption applies per the engine's source-tiering rule. The algorithm itself is mature and cited as the canonical reference baseline in every subsequent VPR paper (MixVPR Table 1+4, SALAD Table 1, SelaVPR Table 2+3, AnyLoc, BoQ, etc.). Freshness concerns are on (a) the runtime stack (MATLAB + MatConvNet vs modern PyTorch — covered by Sources #64 + #65), (b) the canonical pretrained weights (street-level only, no aerial — same D-C2-1 caveat as MixVPR + SALAD + SelaVPR), and (c) the descriptor dimensionality vs modern compact descriptors (4096-D PCA-whitened is 2× larger than MixVPR's 2048-D and 4× larger than SelaVPR's 1024-D)
- Version Info: arXiv v1 (CVPR 2016 camera-ready) + arXiv v3 (TPAMI 2018 extended)
- Target Audience: System architects + C2 implementer + simple-baseline reference-point owner + algorithm-comparison reviewer
- Research Boundary Match: Full match for the algorithm (NetVLAD pooling layer that generalizes VLAD to a learnable, end-to-end-trainable CNN module — described in §3.1 of the paper, equations 1–4); full match for the training procedure (weakly supervised triplet ranking loss with hard negative mining on Google Street View Time Machine — §4); partial match for the project's domain (paper benchmarks Pitts30k, Pitts250k, Tokyo24/7, Tokyo Time Machine — all ground-level urban; NO aerial nadir benchmark in the canonical paper, same caveat as MixVPR + SALAD + SelaVPR)
- Summary: The canonical paper introduces NetVLAD = a generalized, end-to-end-trainable VLAD layer pluggable into any CNN architecture, with three principal contributions: (a) NetVLAD pooling layer (paper §3.1, Eq. 1–4) — replaces VLAD's hard cluster-assignment with differentiable soft-assignment via softmax over
w_k^T x_i + b_k; aggregates first-order residuals(x_i - c_k)weighted by soft-assignment into a K×D matrix; intra-channel L2-norm + flatten + final L2-norm to produce K·D-D vector; (b) weakly supervised triplet ranking loss (paper §4) — uses GPS-tagged Google Street View Time Machine images with positive examples from the same approximate location and negative examples from far-away locations; hard negative mining selects the most-similar negatives for each query; (c) Time Machine training data — provides multiple panoramas at the same location over time, allowing the network to learn invariance to illumination/season changes by construction. Architecture: VGG-16 cropped at conv5_3 (output 512-D feature map at H×W spatial locations) + NetVLAD pooling with K=64 cluster centres → 64×512 = 32768-D K×D matrix → intra-norm + flatten + L2-norm → 32768-D NetVLAD descriptor → optional PCA-whitening to 4096-D (canonical recommendation) or 256-D / 512-D for tighter cache budgets. Reported performance (paper Table 1, Pitts30k-test on best VGG-16 + NetVLAD + whitening trained on Pittsburgh): R@1=84.1, R@5=94.6, R@10=95.5. Reported performance (Tokyo24/7): R@1=73.3 (paper's Table on Tokyo24/7 across daytime/sunset/nighttime queries — used as the cross-illumination challenge benchmark, the same way modern papers use Tokyo24/7 for night queries). Reported architecture choices: input image size 224×224 (canonical paper test resolution); training with SGD, momentum=0.9, weight decay=0.001, lr=10^-4 (Tokyo Time Machine) or 10^-3 (Pitts30k), batch size 4 tuples per SGD step, margin m=0.1, 30 epochs. Comparison to non-deep baselines: NetVLAD outperforms RootSIFT+VLAD+whitening (the previous SOTA) by ~5-7 absolute Recall@1 points on Pitts30k and Tokyo24/7. Comparison to off-the-shelf CNN descriptors: end-to-end-trained NetVLAD outperforms off-the-shelf VGG/AlexNet conv5 features pooled via max-pool/avg-pool/VLAD by ~10-20 absolute R@1 — paper's primary scientific contribution = "training the representation directly for the place recognition task is crucial for obtaining good performance" (paper §1.1, §5.2). Authors' acknowledged limitations: (a) Tokyo Time Machine training is geographically limited (eventually expanded in TPAMI 2018 extended version); (b) PCA-whitening tuning is dataset-specific; (c) the soft-assignment softmax is sensitive to the cluster centre initialization (the project's--mode=clusterstep is the K-means cluster-centre initialization on a sample of training features — without this, NetVLAD training collapses or drifts). Modern descendants: Patch-NetVLAD (2021) extends NetVLAD with patch-level features for re-ranking (vs. SelaVPR's local features for re-ranking); AnyLoc (2024) wraps NetVLAD-style aggregation around DINOv2 ViT-G features (the Conditional INT8 candidate row in06_component_fit_matrix/C2_vpr.md). License (canonical implementation): MIT (per Source #64) - Related Sub-question: SQ3+SQ4 / C2 — NetVLAD per-mode API capability verification (cross-source verification of the canonical implementation's algorithmic claims, training procedure, and Pitts30k / Tokyo24/7 Recall@K numbers; aerial-domain caveat documented; descriptor-dimensionality vs modern compact descriptors trade-off documented)
Source #67
- Title: EigenPlaces canonical implementation —
gmberton/EigenPlaces(Berton, Trivigno, Caputo, Masone — ICCV 2023) — official PyTorch reference implementation, training (train.py) + evaluation (eval.py) CLIs, PyTorch Hub registration with multiple pretrained backbones+descriptor-dim variants (ResNet18 256/512; ResNet50 128/256/512/2048; ResNet101 128/256/512/2048; VGG16 512), companion fair-evaluation frameworkgmberton/VPR-methods-evaluation, MIT License - Link: README https://raw.githubusercontent.com/gmberton/EigenPlaces/main/README.md (accessed 2026-05-08); LICENSE https://raw.githubusercontent.com/gmberton/EigenPlaces/main/LICENSE (accessed 2026-05-08); repo https://github.com/gmberton/EigenPlaces ; PyTorch Hub one-liner
model = torch.hub.load("gmberton/eigenplaces", "get_trained_model", backbone="ResNet50", fc_output_dim=2048); companion eval framework https://github.com/gmberton/VPR-methods-evaluation ; companion CosPlace repo (training-data lineage) https://github.com/gmberton/CosPlace - Tier: L1 (project-official codebase by the canonical EigenPlaces authors Gabriele Berton + Gabriele Trivigno + Barbara Caputo + Carlo Masone, Politecnico di Torino — same author group as CosPlace [CVPR 2022] and
VPR-methods-evaluationstandardized comparison harness; Berton's lab is one of the most active VPR research groups of the modern era, also producing MegaLoc which the EigenPlaces README cites as "Looking for SOTA Visual Place Recognition (VPR)? Check out MegaLoc") - Publication Date: README live; main HEAD active through 2024 (PyTorch Hub publication preserves the canonical pretrained checkpoints in perpetuity); canonical paper ICCV 2023 (Aug 2023 arXiv submission, Oct 2023 ICCV proceedings)
- Timeliness Status: ⚠️ Borderline by 18-month Critical-novelty window (paper Aug 2023 / ICCV 2023) — but mature algorithm with active PyTorch-Hub-distributed pretrained checkpoints; Berton's lab continues active VPR research (MegaLoc successor +
VPR-methods-evaluationcompanion harness); the freshness concern is on (a) emerging successors that improve on EigenPlaces (MegaLoc per README's own pointer), (b) aerial-domain transfer of canonical SF-XL street-view weights (same D-C2-1 caveat as MixVPR + SALAD + SelaVPR + NetVLAD), (c) potential improvements via larger backbones (DINOv2-based candidates not yet in this row) - Version Info: main HEAD; PyTorch (no specific version pinned in README); canonical eval one-liner
python3 eval.py --backbone ResNet50 --fc_output_dim 2048 --resume_model torchhub(downloads pretrained from PyTorch Hub); training one-linerpython3 train.py --backbone ResNet50 --fc_output_dim 128(configurable per backbone+dim combo); supported--backbonevalues:ResNet18, ResNet50, ResNet101, VGG16; supported--fc_output_dimper backbone: ResNet18 ∈ {256, 512}, ResNet50 ∈ {128, 256, 512, 2048}, ResNet101 ∈ {128, 256, 512, 2048}, VGG16 ∈ {512};--train_dataset_folder path/to/sf_xl/raw/train/panoramasfor canonical SF-XL training; MIT License (Copyright (c) 2023 Gabriele Berton, Gabriele Trivigno, Carlo Masone, Barbara Caputo); 200+ stars;auto_VPRcompanion codebase (referenced in paper §1) for fair-comparison-with-other-baselines is at https://github.com/gmberton/auto_VPR (also MIT, Berton's lab) - Target Audience: System architects + C2 implementer + Step-7.5 reviewer + simple-baseline-vs-modern-lead comparison framework owner
- Research Boundary Match: Full match for the project's pinned C2 mode (ResNet-50 backbone + GeM pooling + single fully-connected layer producing 2048-D L2-normalised global descriptor at ImageNet-normalised input — single-stage retrieval, no re-ranking). The repo ships everything needed for inference (
eval.py --backbone ResNet50 --fc_output_dim 2048 --resume_model torchhub) and for retraining on a custom dataset (train.py --train_dataset_folder ... --backbone ResNet50 --fc_output_dim 2048). Multiple per-Mode sibling candidates: each(backbone, fc_output_dim)tuple is a distinct mode per the Per-Mode API rule — ResNet50@128 (cache-tightest), ResNet50@512 (cache-medium), ResNet50@2048 (best Recall@1 / cache-medium-large), VGG16@512 (legacy backbone for cross-comparison with NetVLAD's VGG-16). Partial match for the project's domain (canonical training is SF-XL panoramas — San Francisco eXtra Large, 170 km², ~2.8M database images street-level urban; canonical evaluation is 16 datasets including Pitts30k/Pitts250k/Tokyo24/7/AmsterTime/SF-XL-test-v1/SF-XL-test-v2/MSLS-val/Nordland/St-Lucia/SVOX-{Night,Overcast,Rain,Snow,Sun}/Eynsham/San-Francisco-Landmark — all ground-level / street-level / urban + multi-season; NO aerial nadir benchmark in the repo's reported tables, same caveat as MixVPR + SALAD + SelaVPR + NetVLAD). Critical implementation match advantage vs MixVPR / SALAD / SelaVPR: (i)auto_VPRandVPR-methods-evaluationcompanion frameworks ship in-the-box fair-comparison harness with NetVLAD + SFRS + CosPlace + Conv-AP + MixVPR + EigenPlaces — directly usable for the Jetson MVE phase to score all C2 candidates against the same ADTi 20MP nav frames + Derkachi flight; (ii) ResNet-50 + GeM + FC is the structurally simplest modern competitive C2 architecture in this row (simpler than MixVPR's MLP-Mixer aggregation, simpler than SALAD's optimal-transport aggregation + DINOv2 backbone, simpler than SelaVPR's two-stage adapter+local-up-conv architecture, simpler than NetVLAD's K=64 cluster-centre soft-assignment) — fewer Jetson-export risks - Summary: EigenPlaces is the canonical reference implementation of the ICCV 2023 paper "EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition" by Gabriele Berton et al. Critical license finding: LICENSE file is MIT License (Copyright (c) 2023 Gabriele Berton, Gabriele Trivigno, Carlo Masone, Barbara Caputo) — permissive; this places EigenPlaces on the BSD/permissive license track alongside MixVPR (MIT) + SelaVPR (MIT) + NetVLAD-canonical (MIT) + Kimera-VIO (BSD-2) + OKVIS2 (BSD-3) + DPVO (MIT) + pure-VO baseline (OpenCV-Apache-2.0). EigenPlaces is the fourth modern C2 candidate on the BSD/permissive track (after MixVPR + SelaVPR + NetVLAD-canonical), completing the BSD/permissive C2 axis with a viewpoint-robust modern design point. Multiple pretrained checkpoints are documented and PyTorch-Hub-distributed (no Google Drive dependency, unlike SelaVPR):
(ResNet18, 256),(ResNet18, 512),(ResNet50, 128),(ResNet50, 256),(ResNet50, 512),(ResNet50, 2048),(ResNet101, 128),(ResNet101, 256),(ResNet101, 512),(ResNet101, 2048),(VGG16, 512)— eleven canonical pretrained checkpoints in total, more than any other C2 candidate evaluated so far. Acknowledged dependencies: CosFace PyTorch implementation (for Large Margin Cosine Loss layer), CNN Image Retrieval in PyTorch (for the GeM layer), Visual Geo-localization benchmark (for evaluation/test code), CosPlace (for the SF-XL dataset partitioning lineage). README explicit pointer: "EigenPlaces is quite old. Looking for SOTA VPR? Check out MegaLoc" — Berton's lab acknowledges newer work supersedes EigenPlaces on average benchmarks, but for the project's mandatory simple-baseline + modern-CNN-lead role, EigenPlaces remains the canonical viewpoint-robust reference design point. - Related Sub-question: SQ3+SQ4 / C2 — EigenPlaces per-mode API capability verification (
context7returned NOT INDEXED forgmberton/eigenplacesand EMPTY search results for the queryeigenplaces— per Per-Mode API Capability Verification rule item 2, fall-back to official-docs WebFetch on the canonical repo README + LICENSE was used; cross-validated against the canonical paper [Source #68])
Source #68
- Title: EigenPlaces canonical paper — "EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition" (Berton, Trivigno, Caputo, Masone — ICCV 2023, arXiv:2308.10832 v1) + ICCV 2023 Open Access Foundation page
- Link: arXiv https://arxiv.org/abs/2308.10832 (Aug 2023); ICCV 2023 Open Access https://openaccess.thecvf.com/content/ICCV2023/papers/Berton_EigenPlaces_Training_Viewpoint_Robust_Models_for_Visual_Place_Recognition_ICCV_2023_paper.pdf (accessed 2026-05-08); ICCV 2023 OA HTML https://openaccess.thecvf.com/content/ICCV2023/html/Berton_EigenPlaces_Training_Viewpoint_Robust_Models_for_Visual_Place_Recognition_ICCV_2023_paper.html
- Tier: L1 (peer-reviewed ICCV 2023 + canonical implementation cross-referenced)
- Publication Date: arXiv preprint 2023-08-21; ICCV 2023 acceptance October 2023 (pages 11080–11090 of the proceedings)
- Timeliness Status: ⚠️ Borderline by 18-month Critical-novelty window (Aug 2023 / ICCV 2023) — algorithm is mature, canonical implementation is actively maintained via PyTorch Hub, algorithmic content is stable; the freshness concern is on (a) MegaLoc successor that the README itself recommends, (b) aerial-domain weights (same D-C2-1 as all C2 candidates), (c) DINOv2-backbone variants not in EigenPlaces canonical (would require retrain)
- Version Info: arXiv v1 (ICCV camera-ready; pages 11080–11090 in ICCV 2023 proceedings)
- Target Audience: System architects + C2 implementer + Step-7.5 reviewer
- Research Boundary Match: Full match for the algorithm (single-stage VPR with viewpoint-robust training paradigm: SVD-based class construction within map cells + lateral CosFace loss + frontal CosFace loss; CNN backbone + GeM pooling + FC layer for descriptor extraction). Partial match for the project's domain (paper benchmarks 16 ground-level urban / multi-season datasets — Pitts30k, Pitts250k, Tokyo24/7, AmsterTime, Eynsham, SF-XL test v1/v2, San Francisco Landmark [multi-view] + MSLS Val, Nordland, St Lucia, SVOX Night/Overcast/Rain/Snow/Sun [frontal-view]; NO aerial nadir benchmark in the canonical paper, same caveat as MixVPR + SALAD + SelaVPR + NetVLAD)
- Summary: The canonical paper introduces EigenPlaces = a training paradigm, NOT a new neural architecture. The architecture is intentionally simple — VGG-16 or ResNet-50 (or ResNet-18 / ResNet-101 in repo) cropped at the last conv layer, fed into GeM pooling [Radenovic et al. 2018], then a single fully-connected layer produces the descriptor of dim
fc_output_dim. The novelty is the viewpoint-robust training paradigm (paper §3): (a) Map partitioning (§3.1) — divide the SF-XL map into M×M=15m×15m cells; group cells in subsets where N=3 ensures no spatial overlap; shift the active cell-set after each training epoch; (b) EigenPlaces class construction (§3.2 + Fig 3-4) — for each cell, compute SVD of the (centered) UTM coordinates of all images in that cell; the first principal component (V0) is the road direction; the second principal component (V1, perpendicular to road) points to the side of the road where points-of-interest (building facades) are located; place a focal point at distance D=10m from the cell center along V1; group images facing this focal point into one lateral class; repeat with a focal point along V0 (the road direction itself) to form a frontal class for forward-facing cameras; (c) Two-CosFace-loss training (§3.3) — Large Margin Cosine Loss [Wang et al. 2018] on the lateral class (Llat) makes the network robust to viewpoint shifts; second CosFace loss on the frontal class (Lfront) handles forward-facing cameras (e.g., MSLS / St Lucia datasets); final loss isL = Llat + Lfront. Implementation details (paper §4.2): simple architecture (VGG-16 or ResNet-50 + GeM + FC), 200k iterations, batch size 128 (64 lateral + 64 frontal), Adam lr=1e-5, mixed precision (AMP), SFRS-style augmentations (color jittering + random cropping); cell M=15m, N=3, focal distance D=10m (paper Tab 6 ablation: D=20m slightly better on average); input image size: NOT explicitly stated in paper text, follows VPR-methods-evaluation framework default 224×224 ImageNet-normalised (canonical eval CLI exposes--image_sizeflag in companiongmberton/VPR-methods-evaluationframework, defaulting to 224×224 for fair comparison across all methods in the standardized harness). Reported Recall@1 on multi-view datasets (Tab 3, ResNet-50 best-config @ 2048-D): AmsterTime=48.9 (best in table, +1.2 over CosPlace 47.7); Eynsham=90.7; Pitts30k=92.5 (vs CosPlace 90.9, MixVPR-2048 91.5, MixVPR-4096 91.5, NetVLAD-VGG16-4096 85.0; SALAD Pitts250k 95.1 is from different paper Table); Pitts250k=94.1; Tokyo24/7=93.0 (best in Tab 3 across all methods compared in paper, +5.7 over CosPlace 87.3, +7.9 over MixVPR-4096 85.1; SelaVPR Tokyo24/7=94.0 from Source #63 is +1.0 over EigenPlaces); SF-XL-test-v1=84.1 (vs CosPlace 76.4, NetVLAD-VGG16-4096 40.0); SF-XL-test-v2=90.8. Reported Recall@1 on frontal-view datasets (Tab 4, ResNet-50 @ 2048-D): MSLS-Val=89.1 (vs CosPlace 87.4, MixVPR-4096 87.2, MixVPR-512 83.6; SALAD MSLS-Val=92.2 from Source #60 wins by +3.1 absolute; SelaVPR MSLS-Val=90.8 from Source #62 wins by +1.7 absolute); Nordland=71.2 (vs MixVPR-4096 76.2 — MixVPR wins by +5 absolute on Nordland); St-Lucia=99.6; SVOX-Night=58.9 (vs MixVPR-4096 64.4 — MixVPR wins by +5.5 absolute on extreme night); SVOX-Overcast=93.1; SVOX-Rain=90.0; SVOX-Snow=93.1; SVOX-Sun=86.4. Resource analysis (§4.4): EigenPlaces ResNet-50 + 2048-D trains with <7 GB GPU VRAM (vs MixVPR 18 GB at batch=480 with their canonical batch — EigenPlaces requires 60% less GPU memory); training takes ~24 hours on a single RTX 3090 (similar to SFRS, CosPlace, MixVPR); EigenPlaces uses mixed precision (AMP) following Conv-AP and MixVPR; descriptor dimensionality is 2048-D vs MixVPR's best of 4096-D — 50% smaller descriptor; paper notes "the inference time can be computed as the sum of the query's descriptors extraction time plus the matching (kNN) time, rendering the extraction time negligible when working on large scale datasets" — the paper does NOT report explicit per-query extraction latency (unlike MixVPR's 1.21 ms on A100, SALAD's 2.41 ms on RTX 3090, SelaVPR's 27 ms on RTX 3090); for ResNet-50 forward-pass extrapolation, contemporary GPU benchmarks place ResNet-50 fp16 at ~1-2 ms on A100 / ~3-5 ms on RTX 3090 / ~15-30 ms on Jetson Orin Nano Super at fp16+TensorRT. Authors' acknowledged limitations / observations (§4.3): (a) "no single model that outperforms all other ones on all datasets" — EigenPlaces wins on multi-view but MixVPR-4096 wins on Nordland and SVOX-Night; (b) "lower dimensionality descriptors still struggle on cross-domain datasets (e.g. AmsterTime, Tokyo 24/7, SVOX night)" — the 128-D / 256-D variants trade Recall@K for cache footprint; (c) the paper does not benchmark on aerial nadir imagery (same D-C2-1 caveat); (d) the README explicitly recommends MegaLoc as a SOTA successor — for the project's mandatory-pre-screen role this is acceptable, but Plan-phase may want to also evaluate MegaLoc as a separately-cataloged sibling/successor candidate. License (canonical implementation): MIT (per Source #67). - Related Sub-question: SQ3+SQ4 / C2 — EigenPlaces per-mode API capability verification (cross-source verification of the canonical implementation's mode/parameter/training-recipe/Recall@K claims; aerial-domain caveat documented; ResNet-50 vs DINOv2-based candidates structural-simplicity advantage documented; viewpoint-robust training paradigm completes the BSD/permissive C2 axis with a 4th materially-different design point alongside MixVPR + SelaVPR + NetVLAD)