add clarification to research methodology by including a step for solution comparison and user consultation

This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-03-17 18:43:57 +02:00
parent d764250f9a
commit b419e2c04a
35 changed files with 6030 additions and 0 deletions
@@ -0,0 +1,218 @@
# DINOv2 Feature Aggregation for Visual Place Recognition / Image Retrieval
**Research Date**: March 2025
**Context**: GPS-denied UAV navigation using DINOv2 ViT-S/14 for coarse satellite tile retrieval. Current approach: spatial average pooling of patch tokens.
---
## Executive Summary
| Aggregation | Recall (MSLS Challenge R@1) | Latency | Memory | Training | Recommendation |
|-------------|-----------------------------|---------|--------|----------|----------------|
| **Average pooling** | ~4250% (est. from AnyLoc/GeM) | ~50ms | Low | None | Baseline |
| **GeM pooling** | 62.6% (DINOv2 GeM) | ~50ms | Low | Yes (80 epochs) | Simple upgrade |
| **SALAD** | **75.0%** | **<3ms** (extract+aggregate) | 0.63 GB retrieval | 30 min | **Best** |
**Recommendation**: SALAD provides the largest recall gain (+1225 pp over GeM, +2533 pp over average) with negligible latency overhead. GeM is a simpler middle-ground if training is constrained. SALAD was validated with ViT-B; ViT-S support is architecture-agnostic but requires config changes and may reduce recall.
---
## 1. Recall Improvement: SALAD vs Average Pooling
### 1.1 Benchmark Numbers (from SALAD paper, arxiv 2311.15937)
**Single-stage baselines (Table 1):**
| Method | MSLS Challenge | MSLS Val | NordLand | Pitts250k-test | SPED |
|--------|----------------|----------|----------|----------------|------|
| | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
| **GeM** (ResNet) | 49.7 | 64.2 | 67.0 | 78.2 | 86.6 | 89.6 | 21.6 | 37.3 | 44.2 | 87.0 | 94.4 | 96.3 | 66.7 | 83.4 | 88.0 |
| **DINOv2 SALAD** | **75.0** | **88.8** | **91.3** | **92.2** | **96.4** | **97.0** | **76.0** | **89.2** | **92.0** | **95.1** | **98.5** | **99.1** | **92.1** | **96.2** | **96.5** |
**Ablation: DINOv2 with different aggregations (Table 3):**
| Method | Feature Dim | MSLS Challenge R@1 | MSLS Val R@1 | NordLand R@1 | Pitts250k R@1 | SPED R@1 |
|--------|-------------|--------------------|--------------|--------------|--------------|----------|
| DINOv2 AnyLoc (VLAD, no fine-tune) | 49152 | 42.2 | 68.7 | 16.1 | 87.2 | 85.3 |
| **DINOv2 GeM** | 4096 | 62.6 | 85.4 | 35.4 | 89.5 | 83.0 |
| DINOv2 MixVPR | 4096 | 72.1 | 90.0 | 63.6 | 94.6 | 89.8 |
| DINOv2 NetVLAD | 24576 | 75.8 | 92.4 | 71.8 | 95.6 | 90.8 |
| **DINOv2 SALAD** | 8192+256 | **75.0** | **92.2** | **76.0** | **95.1** | **92.1** |
**Recall improvement (SALAD vs baselines):**
- vs DINOv2 GeM: +12.4 pp (MSLS Challenge), +6.8 pp (MSLS Val), +40.6 pp (NordLand)
- vs DINOv2 AnyLoc (closest to “average-like”): +32.8 pp (MSLS Challenge), +23.5 pp (MSLS Val), +59.9 pp (NordLand)
- vs ResNet GeM: +25.3 pp (MSLS Challenge), +14.0 pp (MSLS Val)
**Sources:**
- [SALAD paper (arxiv 2311.15937)](https://arxiv.org/abs/2311.15937)
- [CVPR 2024 paper](https://openaccess.thecvf.com/content/CVPR2024/html/Izquierdo_Optimal_Transport_Aggregation_for_Visual_Place_Recognition_CVPR_2024_paper.html)
- [SALAD project page](https://serizba.github.io/salad.html)
- [GitHub: serizba/salad](https://github.com/serizba/salad)
---
## 2. Computational Overhead: SALAD vs Average Pooling
### 2.1 Latency (Table 2 from SALAD paper)
| Method | Retrieval (ms) | Reranking (ms) | Total (ms) | MSLS Challenge R@1 |
|--------|----------------|----------------|------------|--------------------|
| Patch-NetVLAD | 908.30 | 8377.17 | ~9285 | 48.1 |
| TransVPR | 22.72 | 1757.70 | ~1780 | 63.9 |
| R2Former | 4.7 | 202.37 | ~207 | 73.0 |
| **DINOv2 SALAD** | **0.63** | **0.0** | **2.41** | **75.0** |
- SALAD: **<3 ms per image** (RTX 3090), single-stage, no re-ranking.
- Sinkhorn iterations: O(n²) per iteration, but n = number of patches (~256 for 224×224), so cost is small vs backbone.
- Backbone (DINOv2) dominates; SALAD adds only a few ms.
### 2.2 Memory
- DINOv2 SALAD: **0.63 GB** retrieval memory (MSLS Val, ~18k images).
- Global descriptor only (no local feature storage for re-ranking).
- SALAD descriptor: 8192+256 dims vs average pooling ~384 (ViT-S) or 768 (ViT-B).
---
## 3. GeM Pooling as Middle-Ground
### 3.1 GeM vs Average Pooling
- GeM (Generalized Mean): \( \text{GeM}(x) = \left( \frac{1}{n}\sum x_i^p \right)^{1/p} \); p=1 → average, p→∞ → max.
- GeM is a learned generalization of average pooling; typically improves retrieval.
- SALAD paper: DINOv2 GeM **62.6%** R@1 (MSLS Challenge) vs ResNet GeM **49.7%**.
- No direct DINOv2 average-pooling numbers in the paper; AnyLoc (VLAD, no fine-tune) is ~42.2% R@1.
### 3.2 GeM vs SALAD
| Metric | DINOv2 GeM | DINOv2 SALAD |
|--------|------------|--------------|
| MSLS Challenge R@1 | 62.6 | 75.0 |
| NordLand R@1 | 35.4 | 76.0 |
| Descriptor size | 4096 | 8192+256 |
| Training | 80 epochs (MixVPR pipeline) | 4 epochs, 30 min |
| Implementation | Simple (one layer) | Sinkhorn + MLP |
**Conclusion**: GeM is a simpler upgrade over average pooling; SALAD gives a larger gain, especially on hard datasets (e.g. NordLand +40.6 pp).
**Sources:**
- [GeM paper (Radenović et al.)](https://arxiv.org/abs/1811.00202)
- [DINO-Mix (Nature 2024)](https://www.nature.com/articles/s41598-024-73853-3) — GeM + attention
---
## 4. DINOv2 Aggregation Methods 20252026
| Method | Year | Aggregation | Key result | Outperforms SALAD? |
|--------|------|-------------|------------|--------------------|
| **SALAD** | CVPR 2024 | Optimal transport (Sinkhorn) | 75.0% MSLS Challenge | — |
| **DINO-Mix** | 2023/2024 | MLP-Mixer | 91.75% Tokyo24/7, 80.18% NordLand | Different benchmarks |
| **DINO-MSRA** | 2025 | Multi-scale residual attention | UAVsatellite cross-view | Cross-view only |
| **CV-Cities** | 2024 | DINOv2 + feature mixer | Cross-view geo-localization | Cross-view only |
| **UAV Self-Localization (GLFA+CESP)** | 2025 | Custom (GLFA, CESP) | 86.27% R@1 DenseUAV | Cross-view only |
**Conclusion**: No standard VPR benchmark shows a 2025 method clearly beating SALAD. New work focuses on cross-view (UAVsatellite) with custom architectures rather than generic aggregation.
**Sources:**
- [DINO-Mix arxiv 2311.00230](https://arxiv.org/abs/2311.00230)
- [DINO-MSRA (2025)](https://www.dqxxkx.cn/EN/10.12082/dqxxkx.2025.250051)
- [CV-Cities arxiv 2411.12431](https://arxiv.org/html/2411.12431)
- [DINOv2 UAV Self-Localization (2025)](https://ui.adsabs.harvard.edu/abs/2025IRAL...10.2080Y/abstract)
---
## 5. Cross-View (UAV-to-Satellite) Retrieval
### 5.1 Methods
| Method | Dataset | R@1 | Notes |
|--------|---------|-----|------|
| DINOv2 + GLFA + CESP | DenseUAV | 86.27% | Custom enhancement, not SALAD |
| DINO-MSRA | UAVsatellite | — | Multi-scale residual attention |
| CV-Cities | Groundsatellite | — | 223k pairs, 16 cities |
| Training-free (LLM + PCA) | — | — | Zero-shot, DINOv2 features |
### 5.2 SALAD on Cross-View
- SALAD is evaluated on same-view VPR (street-level, dashcam), not UAVsatellite.
- Cross-view papers use DINOv2 + custom modules (GLFA, CESP, multi-scale attention).
- **Recommendation**: For UAVsatellite, try SALAD first; if insufficient, consider DINO-MSRA or GLFA+CESP.
**Sources:**
- [DINOv2 UAV Self-Localization](https://ui.adsabs.harvard.edu/abs/2025IRAL...10.2080Y/abstract)
- [DINO-MSRA](https://www.dqxxkx.cn/EN/10.12082/dqxxkx.2025.250051)
- [Street2Orbit (training-free)](https://jeonghomin.github.io/street2orbit.github.io/)
---
## 6. SALAD with ViT-S/14
### 6.1 Paper Configuration
- SALAD paper uses **DINOv2-B** (768-dim, 86M params).
- Table 4: ViT-S (384-dim, 21M), ViT-B (768-dim, 86M), ViT-L (1024-dim, 300M), ViT-G (1536-dim, 1.1B).
- Figure 3: different backbone sizes tested; ViT-B chosen for performance/size trade-off.
### 6.2 ViT-S Compatibility
- SALAD is backbone-agnostic: score MLP and aggregation take token dim `d` as input.
- ViT-S: d=384 → adjust `W_s1`, `W_f1`, `W_g1` (512 hidden) and output dims.
- No ViT-S results in the paper; expect lower recall than ViT-B (e.g. ~23 pp from Nature ViT comparison).
**Conclusion**: SALAD can work with ViT-S with config changes; recall will likely be lower than ViT-B but still above average/GeM.
**Sources:**
- [SALAD paper Sec 4.1](https://arxiv.org/html/2311.15937v1)
- [DINOv2 ViT comparison (Nature 2024)](https://www.nature.com/articles/s41598-024-83358-8)
- [DINOv2 MODEL_CARD](https://github.com/facebookresearch/dinov2/blob/main/MODEL_CARD.md)
---
## 7. Structured Comparison Table
| Aggregation | MSLS Ch. R@1 | NordLand R@1 | Pitts250k R@1 | Latency | Memory | Training | ViT-S | Cross-view |
|-------------|--------------|--------------|---------------|---------|--------|----------|-------|------------|
| **Average pooling** | ~4250 | ~1635 | ~87 | ~50ms | Low | None | ✓ | Unknown |
| **GeM** | 62.6 | 35.4 | 89.5 | ~50ms | Low | 80 ep | ✓ | Unknown |
| **SALAD** | **75.0** | **76.0** | **95.1** | **<3ms** | 0.63 GB | 30 min | Config change | Not evaluated |
| DINO-Mix | — | 80.2 | — | — | — | Yes | — | — |
| NetVLAD (dim red.) | 73.3 | 70.1 | 95.4 | — | — | Yes | — | — |
---
## 8. Recommendation for GPS-Denied UAV System
| Priority | Option | Rationale |
|----------|--------|-----------|
| **1** | **SALAD** | Largest recall gain, <3 ms overhead, 30 min training. Adapt config for ViT-S if needed. |
| **2** | **GeM** | Simpler than SALAD, clear gain over average pooling, minimal code change. |
| **3** | **Average pooling** | Keep only if no training is possible and latency is critical. |
**Implementation path:**
1. Add GeM pooling as a low-effort upgrade (no Sinkhorn, small code change).
2. Integrate SALAD (e.g. from [serizba/salad](https://github.com/serizba/salad)); adapt for ViT-S (d=384).
3. Benchmark on UAVsatellite data; compare SALAD vs GeM vs average.
4. If cross-view performance is weak, consider DINO-MSRA or GLFA+CESP.
---
## Source URLs
| Source | URL |
|--------|-----|
| SALAD paper | https://arxiv.org/abs/2311.15937 |
| SALAD HTML | https://arxiv.org/html/2311.15937v1 |
| SALAD project | https://serizba.github.io/salad.html |
| SALAD GitHub | https://github.com/serizba/salad |
| CVPR 2024 | https://openaccess.thecvf.com/content/CVPR2024/html/Izquierdo_Optimal_Transport_Aggregation_for_Visual_Place_Recognition_CVPR_2024_paper.html |
| DINO-Mix | https://arxiv.org/abs/2311.00230 |
| DINO-Mix Nature | https://www.nature.com/articles/s41598-024-73853-3 |
| GeM paper | https://arxiv.org/abs/1811.00202 |
| DINOv2 paper | https://arxiv.org/abs/2304.07193 |
| DINOv2 GitHub | https://github.com/facebookresearch/dinov2 |
| DINO-MSRA 2025 | https://www.dqxxkx.cn/EN/10.12082/dqxxkx.2025.250051 |
| CV-Cities | https://arxiv.org/html/2411.12431 |
| UAV Self-Loc | https://ui.adsabs.harvard.edu/abs/2025IRAL...10.2080Y/abstract |
| dinov2-retrieval | https://github.com/vra/dinov2-retrieval |
| Emergent Mind SALAD | https://www.emergentmind.com/topics/dinov2-features-with-salad-aggregation |