add clarification to research methodology by including a step for solution comparison and user consultation

2026-06-22 19:01:13 +00:00 · 2026-03-17 18:43:57 +02:00
parent d764250f9a
commit b419e2c04a
35 changed files with 6030 additions and 0 deletions
@@ -0,0 +1,218 @@
+# DINOv2 Feature Aggregation for Visual Place Recognition / Image Retrieval
+
+**Research Date**: March 2025  
+**Context**: GPS-denied UAV navigation using DINOv2 ViT-S/14 for coarse satellite tile retrieval. Current approach: spatial average pooling of patch tokens.
+
+---
+
+## Executive Summary
+
+| Aggregation | Recall (MSLS Challenge R@1) | Latency | Memory | Training | Recommendation |
+|-------------|-----------------------------|---------|--------|----------|----------------|
+| **Average pooling** | ~42–50% (est. from AnyLoc/GeM) | ~50ms | Low | None | Baseline |
+| **GeM pooling** | 62.6% (DINOv2 GeM) | ~50ms | Low | Yes (80 epochs) | Simple upgrade |
+| **SALAD** | **75.0%** | **<3ms** (extract+aggregate) | 0.63 GB retrieval | 30 min | **Best** |
+
+**Recommendation**: SALAD provides the largest recall gain (+12–25 pp over GeM, +25–33 pp over average) with negligible latency overhead. GeM is a simpler middle-ground if training is constrained. SALAD was validated with ViT-B; ViT-S support is architecture-agnostic but requires config changes and may reduce recall.
+
+---
+
+## 1. Recall Improvement: SALAD vs Average Pooling
+
+### 1.1 Benchmark Numbers (from SALAD paper, arxiv 2311.15937)
+
+**Single-stage baselines (Table 1):**
+
+| Method | MSLS Challenge | MSLS Val | NordLand | Pitts250k-test | SPED |
+|--------|----------------|----------|----------|----------------|------|
+| | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
+| **GeM** (ResNet) | 49.7 | 64.2 | 67.0 | 78.2 | 86.6 | 89.6 | 21.6 | 37.3 | 44.2 | 87.0 | 94.4 | 96.3 | 66.7 | 83.4 | 88.0 |
+| **DINOv2 SALAD** | **75.0** | **88.8** | **91.3** | **92.2** | **96.4** | **97.0** | **76.0** | **89.2** | **92.0** | **95.1** | **98.5** | **99.1** | **92.1** | **96.2** | **96.5** |
+
+**Ablation: DINOv2 with different aggregations (Table 3):**
+
+| Method | Feature Dim | MSLS Challenge R@1 | MSLS Val R@1 | NordLand R@1 | Pitts250k R@1 | SPED R@1 |
+|--------|-------------|--------------------|--------------|--------------|--------------|----------|
+| DINOv2 AnyLoc (VLAD, no fine-tune) | 49152 | 42.2 | 68.7 | 16.1 | 87.2 | 85.3 |
+| **DINOv2 GeM** | 4096 | 62.6 | 85.4 | 35.4 | 89.5 | 83.0 |
+| DINOv2 MixVPR | 4096 | 72.1 | 90.0 | 63.6 | 94.6 | 89.8 |
+| DINOv2 NetVLAD | 24576 | 75.8 | 92.4 | 71.8 | 95.6 | 90.8 |
+| **DINOv2 SALAD** | 8192+256 | **75.0** | **92.2** | **76.0** | **95.1** | **92.1** |
+
+**Recall improvement (SALAD vs baselines):**
+- vs DINOv2 GeM: +12.4 pp (MSLS Challenge), +6.8 pp (MSLS Val), +40.6 pp (NordLand)
+- vs DINOv2 AnyLoc (closest to “average-like”): +32.8 pp (MSLS Challenge), +23.5 pp (MSLS Val), +59.9 pp (NordLand)
+- vs ResNet GeM: +25.3 pp (MSLS Challenge), +14.0 pp (MSLS Val)
+
+**Sources:**
+- [SALAD paper (arxiv 2311.15937)](https://arxiv.org/abs/2311.15937)
+- [CVPR 2024 paper](https://openaccess.thecvf.com/content/CVPR2024/html/Izquierdo_Optimal_Transport_Aggregation_for_Visual_Place_Recognition_CVPR_2024_paper.html)
+- [SALAD project page](https://serizba.github.io/salad.html)
+- [GitHub: serizba/salad](https://github.com/serizba/salad)
+
+---
+
+## 2. Computational Overhead: SALAD vs Average Pooling
+
+### 2.1 Latency (Table 2 from SALAD paper)
+
+| Method | Retrieval (ms) | Reranking (ms) | Total (ms) | MSLS Challenge R@1 |
+|--------|----------------|----------------|------------|--------------------|
+| Patch-NetVLAD | 908.30 | 8377.17 | ~9285 | 48.1 |
+| TransVPR | 22.72 | 1757.70 | ~1780 | 63.9 |
+| R2Former | 4.7 | 202.37 | ~207 | 73.0 |
+| **DINOv2 SALAD** | **0.63** | **0.0** | **2.41** | **75.0** |
+
+- SALAD: **<3 ms per image** (RTX 3090), single-stage, no re-ranking.
+- Sinkhorn iterations: O(n²) per iteration, but n = number of patches (~256 for 224×224), so cost is small vs backbone.
+- Backbone (DINOv2) dominates; SALAD adds only a few ms.
+
+### 2.2 Memory
+
+- DINOv2 SALAD: **0.63 GB** retrieval memory (MSLS Val, ~18k images).
+- Global descriptor only (no local feature storage for re-ranking).
+- SALAD descriptor: 8192+256 dims vs average pooling ~384 (ViT-S) or 768 (ViT-B).
+
+---
+
+## 3. GeM Pooling as Middle-Ground
+
+### 3.1 GeM vs Average Pooling
+
+- GeM (Generalized Mean): \( \text{GeM}(x) = \left( \frac{1}{n}\sum x_i^p \right)^{1/p} \); p=1 → average, p→∞ → max.
+- GeM is a learned generalization of average pooling; typically improves retrieval.
+- SALAD paper: DINOv2 GeM **62.6%** R@1 (MSLS Challenge) vs ResNet GeM **49.7%**.
+- No direct DINOv2 average-pooling numbers in the paper; AnyLoc (VLAD, no fine-tune) is ~42.2% R@1.
+
+### 3.2 GeM vs SALAD
+
+| Metric | DINOv2 GeM | DINOv2 SALAD |
+|--------|------------|--------------|
+| MSLS Challenge R@1 | 62.6 | 75.0 |
+| NordLand R@1 | 35.4 | 76.0 |
+| Descriptor size | 4096 | 8192+256 |
+| Training | 80 epochs (MixVPR pipeline) | 4 epochs, 30 min |
+| Implementation | Simple (one layer) | Sinkhorn + MLP |
+
+**Conclusion**: GeM is a simpler upgrade over average pooling; SALAD gives a larger gain, especially on hard datasets (e.g. NordLand +40.6 pp).
+
+**Sources:**
+- [GeM paper (Radenović et al.)](https://arxiv.org/abs/1811.00202)
+- [DINO-Mix (Nature 2024)](https://www.nature.com/articles/s41598-024-73853-3) — GeM + attention
+
+---
+
+## 4. DINOv2 Aggregation Methods 2025–2026
+
+| Method | Year | Aggregation | Key result | Outperforms SALAD? |
+|--------|------|-------------|------------|--------------------|
+| **SALAD** | CVPR 2024 | Optimal transport (Sinkhorn) | 75.0% MSLS Challenge | — |
+| **DINO-Mix** | 2023/2024 | MLP-Mixer | 91.75% Tokyo24/7, 80.18% NordLand | Different benchmarks |
+| **DINO-MSRA** | 2025 | Multi-scale residual attention | UAV–satellite cross-view | Cross-view only |
+| **CV-Cities** | 2024 | DINOv2 + feature mixer | Cross-view geo-localization | Cross-view only |
+| **UAV Self-Localization (GLFA+CESP)** | 2025 | Custom (GLFA, CESP) | 86.27% R@1 DenseUAV | Cross-view only |
+
+**Conclusion**: No standard VPR benchmark shows a 2025 method clearly beating SALAD. New work focuses on cross-view (UAV–satellite) with custom architectures rather than generic aggregation.
+
+**Sources:**
+- [DINO-Mix arxiv 2311.00230](https://arxiv.org/abs/2311.00230)
+- [DINO-MSRA (2025)](https://www.dqxxkx.cn/EN/10.12082/dqxxkx.2025.250051)
+- [CV-Cities arxiv 2411.12431](https://arxiv.org/html/2411.12431)
+- [DINOv2 UAV Self-Localization (2025)](https://ui.adsabs.harvard.edu/abs/2025IRAL...10.2080Y/abstract)
+
+---
+
+## 5. Cross-View (UAV-to-Satellite) Retrieval
+
+### 5.1 Methods
+
+| Method | Dataset | R@1 | Notes |
+|--------|---------|-----|------|
+| DINOv2 + GLFA + CESP | DenseUAV | 86.27% | Custom enhancement, not SALAD |
+| DINO-MSRA | UAV–satellite | — | Multi-scale residual attention |
+| CV-Cities | Ground–satellite | — | 223k pairs, 16 cities |
+| Training-free (LLM + PCA) | — | — | Zero-shot, DINOv2 features |
+
+### 5.2 SALAD on Cross-View
+
+- SALAD is evaluated on same-view VPR (street-level, dashcam), not UAV–satellite.
+- Cross-view papers use DINOv2 + custom modules (GLFA, CESP, multi-scale attention).
+- **Recommendation**: For UAV–satellite, try SALAD first; if insufficient, consider DINO-MSRA or GLFA+CESP.
+
+**Sources:**
+- [DINOv2 UAV Self-Localization](https://ui.adsabs.harvard.edu/abs/2025IRAL...10.2080Y/abstract)
+- [DINO-MSRA](https://www.dqxxkx.cn/EN/10.12082/dqxxkx.2025.250051)
+- [Street2Orbit (training-free)](https://jeonghomin.github.io/street2orbit.github.io/)
+
+---
+
+## 6. SALAD with ViT-S/14
+
+### 6.1 Paper Configuration
+
+- SALAD paper uses **DINOv2-B** (768-dim, 86M params).
+- Table 4: ViT-S (384-dim, 21M), ViT-B (768-dim, 86M), ViT-L (1024-dim, 300M), ViT-G (1536-dim, 1.1B).
+- Figure 3: different backbone sizes tested; ViT-B chosen for performance/size trade-off.
+
+### 6.2 ViT-S Compatibility
+
+- SALAD is backbone-agnostic: score MLP and aggregation take token dim `d` as input.
+- ViT-S: d=384 → adjust `W_s1`, `W_f1`, `W_g1` (512 hidden) and output dims.
+- No ViT-S results in the paper; expect lower recall than ViT-B (e.g. ~2–3 pp from Nature ViT comparison).
+
+**Conclusion**: SALAD can work with ViT-S with config changes; recall will likely be lower than ViT-B but still above average/GeM.
+
+**Sources:**
+- [SALAD paper Sec 4.1](https://arxiv.org/html/2311.15937v1)
+- [DINOv2 ViT comparison (Nature 2024)](https://www.nature.com/articles/s41598-024-83358-8)
+- [DINOv2 MODEL_CARD](https://github.com/facebookresearch/dinov2/blob/main/MODEL_CARD.md)
+
+---
+
+## 7. Structured Comparison Table
+
+| Aggregation | MSLS Ch. R@1 | NordLand R@1 | Pitts250k R@1 | Latency | Memory | Training | ViT-S | Cross-view |
+|-------------|--------------|--------------|---------------|---------|--------|----------|-------|------------|
+| **Average pooling** | ~42–50 | ~16–35 | ~87 | ~50ms | Low | None | ✓ | Unknown |
+| **GeM** | 62.6 | 35.4 | 89.5 | ~50ms | Low | 80 ep | ✓ | Unknown |
+| **SALAD** | **75.0** | **76.0** | **95.1** | **<3ms** | 0.63 GB | 30 min | Config change | Not evaluated |
+| DINO-Mix | — | 80.2 | — | — | — | Yes | — | — |
+| NetVLAD (dim red.) | 73.3 | 70.1 | 95.4 | — | — | Yes | — | — |
+
+---
+
+## 8. Recommendation for GPS-Denied UAV System
+
+| Priority | Option | Rationale |
+|----------|--------|-----------|
+| **1** | **SALAD** | Largest recall gain, <3 ms overhead, 30 min training. Adapt config for ViT-S if needed. |
+| **2** | **GeM** | Simpler than SALAD, clear gain over average pooling, minimal code change. |
+| **3** | **Average pooling** | Keep only if no training is possible and latency is critical. |
+
+**Implementation path:**
+1. Add GeM pooling as a low-effort upgrade (no Sinkhorn, small code change).
+2. Integrate SALAD (e.g. from [serizba/salad](https://github.com/serizba/salad)); adapt for ViT-S (d=384).
+3. Benchmark on UAV–satellite data; compare SALAD vs GeM vs average.
+4. If cross-view performance is weak, consider DINO-MSRA or GLFA+CESP.
+
+---
+
+## Source URLs
+
+| Source | URL |
+|--------|-----|
+| SALAD paper | https://arxiv.org/abs/2311.15937 |
+| SALAD HTML | https://arxiv.org/html/2311.15937v1 |
+| SALAD project | https://serizba.github.io/salad.html |
+| SALAD GitHub | https://github.com/serizba/salad |
+| CVPR 2024 | https://openaccess.thecvf.com/content/CVPR2024/html/Izquierdo_Optimal_Transport_Aggregation_for_Visual_Place_Recognition_CVPR_2024_paper.html |
+| DINO-Mix | https://arxiv.org/abs/2311.00230 |
+| DINO-Mix Nature | https://www.nature.com/articles/s41598-024-73853-3 |
+| GeM paper | https://arxiv.org/abs/1811.00202 |
+| DINOv2 paper | https://arxiv.org/abs/2304.07193 |
+| DINOv2 GitHub | https://github.com/facebookresearch/dinov2 |
+| DINO-MSRA 2025 | https://www.dqxxkx.cn/EN/10.12082/dqxxkx.2025.250051 |
+| CV-Cities | https://arxiv.org/html/2411.12431 |
+| UAV Self-Loc | https://ui.adsabs.harvard.edu/abs/2025IRAL...10.2080Y/abstract |
+| dinov2-retrieval | https://github.com/vra/dinov2-retrieval |
+| Emergent Mind SALAD | https://www.emergentmind.com/topics/dinov2-features-with-salad-aggregation |