mirror of
https://github.com/azaion/gps-denied-desktop.git
synced 2026-04-23 00:26:36 +00:00
add clarification to research methodology by including a step for solution comparison and user consultation
This commit is contained in:
@@ -0,0 +1,218 @@
|
||||
# DINOv2 Feature Aggregation for Visual Place Recognition / Image Retrieval
|
||||
|
||||
**Research Date**: March 2025
|
||||
**Context**: GPS-denied UAV navigation using DINOv2 ViT-S/14 for coarse satellite tile retrieval. Current approach: spatial average pooling of patch tokens.
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
| Aggregation | Recall (MSLS Challenge R@1) | Latency | Memory | Training | Recommendation |
|
||||
|-------------|-----------------------------|---------|--------|----------|----------------|
|
||||
| **Average pooling** | ~42–50% (est. from AnyLoc/GeM) | ~50ms | Low | None | Baseline |
|
||||
| **GeM pooling** | 62.6% (DINOv2 GeM) | ~50ms | Low | Yes (80 epochs) | Simple upgrade |
|
||||
| **SALAD** | **75.0%** | **<3ms** (extract+aggregate) | 0.63 GB retrieval | 30 min | **Best** |
|
||||
|
||||
**Recommendation**: SALAD provides the largest recall gain (+12–25 pp over GeM, +25–33 pp over average) with negligible latency overhead. GeM is a simpler middle-ground if training is constrained. SALAD was validated with ViT-B; ViT-S support is architecture-agnostic but requires config changes and may reduce recall.
|
||||
|
||||
---
|
||||
|
||||
## 1. Recall Improvement: SALAD vs Average Pooling
|
||||
|
||||
### 1.1 Benchmark Numbers (from SALAD paper, arxiv 2311.15937)
|
||||
|
||||
**Single-stage baselines (Table 1):**
|
||||
|
||||
| Method | MSLS Challenge | MSLS Val | NordLand | Pitts250k-test | SPED |
|
||||
|--------|----------------|----------|----------|----------------|------|
|
||||
| | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
|
||||
| **GeM** (ResNet) | 49.7 | 64.2 | 67.0 | 78.2 | 86.6 | 89.6 | 21.6 | 37.3 | 44.2 | 87.0 | 94.4 | 96.3 | 66.7 | 83.4 | 88.0 |
|
||||
| **DINOv2 SALAD** | **75.0** | **88.8** | **91.3** | **92.2** | **96.4** | **97.0** | **76.0** | **89.2** | **92.0** | **95.1** | **98.5** | **99.1** | **92.1** | **96.2** | **96.5** |
|
||||
|
||||
**Ablation: DINOv2 with different aggregations (Table 3):**
|
||||
|
||||
| Method | Feature Dim | MSLS Challenge R@1 | MSLS Val R@1 | NordLand R@1 | Pitts250k R@1 | SPED R@1 |
|
||||
|--------|-------------|--------------------|--------------|--------------|--------------|----------|
|
||||
| DINOv2 AnyLoc (VLAD, no fine-tune) | 49152 | 42.2 | 68.7 | 16.1 | 87.2 | 85.3 |
|
||||
| **DINOv2 GeM** | 4096 | 62.6 | 85.4 | 35.4 | 89.5 | 83.0 |
|
||||
| DINOv2 MixVPR | 4096 | 72.1 | 90.0 | 63.6 | 94.6 | 89.8 |
|
||||
| DINOv2 NetVLAD | 24576 | 75.8 | 92.4 | 71.8 | 95.6 | 90.8 |
|
||||
| **DINOv2 SALAD** | 8192+256 | **75.0** | **92.2** | **76.0** | **95.1** | **92.1** |
|
||||
|
||||
**Recall improvement (SALAD vs baselines):**
|
||||
- vs DINOv2 GeM: +12.4 pp (MSLS Challenge), +6.8 pp (MSLS Val), +40.6 pp (NordLand)
|
||||
- vs DINOv2 AnyLoc (closest to “average-like”): +32.8 pp (MSLS Challenge), +23.5 pp (MSLS Val), +59.9 pp (NordLand)
|
||||
- vs ResNet GeM: +25.3 pp (MSLS Challenge), +14.0 pp (MSLS Val)
|
||||
|
||||
**Sources:**
|
||||
- [SALAD paper (arxiv 2311.15937)](https://arxiv.org/abs/2311.15937)
|
||||
- [CVPR 2024 paper](https://openaccess.thecvf.com/content/CVPR2024/html/Izquierdo_Optimal_Transport_Aggregation_for_Visual_Place_Recognition_CVPR_2024_paper.html)
|
||||
- [SALAD project page](https://serizba.github.io/salad.html)
|
||||
- [GitHub: serizba/salad](https://github.com/serizba/salad)
|
||||
|
||||
---
|
||||
|
||||
## 2. Computational Overhead: SALAD vs Average Pooling
|
||||
|
||||
### 2.1 Latency (Table 2 from SALAD paper)
|
||||
|
||||
| Method | Retrieval (ms) | Reranking (ms) | Total (ms) | MSLS Challenge R@1 |
|
||||
|--------|----------------|----------------|------------|--------------------|
|
||||
| Patch-NetVLAD | 908.30 | 8377.17 | ~9285 | 48.1 |
|
||||
| TransVPR | 22.72 | 1757.70 | ~1780 | 63.9 |
|
||||
| R2Former | 4.7 | 202.37 | ~207 | 73.0 |
|
||||
| **DINOv2 SALAD** | **0.63** | **0.0** | **2.41** | **75.0** |
|
||||
|
||||
- SALAD: **<3 ms per image** (RTX 3090), single-stage, no re-ranking.
|
||||
- Sinkhorn iterations: O(n²) per iteration, but n = number of patches (~256 for 224×224), so cost is small vs backbone.
|
||||
- Backbone (DINOv2) dominates; SALAD adds only a few ms.
|
||||
|
||||
### 2.2 Memory
|
||||
|
||||
- DINOv2 SALAD: **0.63 GB** retrieval memory (MSLS Val, ~18k images).
|
||||
- Global descriptor only (no local feature storage for re-ranking).
|
||||
- SALAD descriptor: 8192+256 dims vs average pooling ~384 (ViT-S) or 768 (ViT-B).
|
||||
|
||||
---
|
||||
|
||||
## 3. GeM Pooling as Middle-Ground
|
||||
|
||||
### 3.1 GeM vs Average Pooling
|
||||
|
||||
- GeM (Generalized Mean): \( \text{GeM}(x) = \left( \frac{1}{n}\sum x_i^p \right)^{1/p} \); p=1 → average, p→∞ → max.
|
||||
- GeM is a learned generalization of average pooling; typically improves retrieval.
|
||||
- SALAD paper: DINOv2 GeM **62.6%** R@1 (MSLS Challenge) vs ResNet GeM **49.7%**.
|
||||
- No direct DINOv2 average-pooling numbers in the paper; AnyLoc (VLAD, no fine-tune) is ~42.2% R@1.
|
||||
|
||||
### 3.2 GeM vs SALAD
|
||||
|
||||
| Metric | DINOv2 GeM | DINOv2 SALAD |
|
||||
|--------|------------|--------------|
|
||||
| MSLS Challenge R@1 | 62.6 | 75.0 |
|
||||
| NordLand R@1 | 35.4 | 76.0 |
|
||||
| Descriptor size | 4096 | 8192+256 |
|
||||
| Training | 80 epochs (MixVPR pipeline) | 4 epochs, 30 min |
|
||||
| Implementation | Simple (one layer) | Sinkhorn + MLP |
|
||||
|
||||
**Conclusion**: GeM is a simpler upgrade over average pooling; SALAD gives a larger gain, especially on hard datasets (e.g. NordLand +40.6 pp).
|
||||
|
||||
**Sources:**
|
||||
- [GeM paper (Radenović et al.)](https://arxiv.org/abs/1811.00202)
|
||||
- [DINO-Mix (Nature 2024)](https://www.nature.com/articles/s41598-024-73853-3) — GeM + attention
|
||||
|
||||
---
|
||||
|
||||
## 4. DINOv2 Aggregation Methods 2025–2026
|
||||
|
||||
| Method | Year | Aggregation | Key result | Outperforms SALAD? |
|
||||
|--------|------|-------------|------------|--------------------|
|
||||
| **SALAD** | CVPR 2024 | Optimal transport (Sinkhorn) | 75.0% MSLS Challenge | — |
|
||||
| **DINO-Mix** | 2023/2024 | MLP-Mixer | 91.75% Tokyo24/7, 80.18% NordLand | Different benchmarks |
|
||||
| **DINO-MSRA** | 2025 | Multi-scale residual attention | UAV–satellite cross-view | Cross-view only |
|
||||
| **CV-Cities** | 2024 | DINOv2 + feature mixer | Cross-view geo-localization | Cross-view only |
|
||||
| **UAV Self-Localization (GLFA+CESP)** | 2025 | Custom (GLFA, CESP) | 86.27% R@1 DenseUAV | Cross-view only |
|
||||
|
||||
**Conclusion**: No standard VPR benchmark shows a 2025 method clearly beating SALAD. New work focuses on cross-view (UAV–satellite) with custom architectures rather than generic aggregation.
|
||||
|
||||
**Sources:**
|
||||
- [DINO-Mix arxiv 2311.00230](https://arxiv.org/abs/2311.00230)
|
||||
- [DINO-MSRA (2025)](https://www.dqxxkx.cn/EN/10.12082/dqxxkx.2025.250051)
|
||||
- [CV-Cities arxiv 2411.12431](https://arxiv.org/html/2411.12431)
|
||||
- [DINOv2 UAV Self-Localization (2025)](https://ui.adsabs.harvard.edu/abs/2025IRAL...10.2080Y/abstract)
|
||||
|
||||
---
|
||||
|
||||
## 5. Cross-View (UAV-to-Satellite) Retrieval
|
||||
|
||||
### 5.1 Methods
|
||||
|
||||
| Method | Dataset | R@1 | Notes |
|
||||
|--------|---------|-----|------|
|
||||
| DINOv2 + GLFA + CESP | DenseUAV | 86.27% | Custom enhancement, not SALAD |
|
||||
| DINO-MSRA | UAV–satellite | — | Multi-scale residual attention |
|
||||
| CV-Cities | Ground–satellite | — | 223k pairs, 16 cities |
|
||||
| Training-free (LLM + PCA) | — | — | Zero-shot, DINOv2 features |
|
||||
|
||||
### 5.2 SALAD on Cross-View
|
||||
|
||||
- SALAD is evaluated on same-view VPR (street-level, dashcam), not UAV–satellite.
|
||||
- Cross-view papers use DINOv2 + custom modules (GLFA, CESP, multi-scale attention).
|
||||
- **Recommendation**: For UAV–satellite, try SALAD first; if insufficient, consider DINO-MSRA or GLFA+CESP.
|
||||
|
||||
**Sources:**
|
||||
- [DINOv2 UAV Self-Localization](https://ui.adsabs.harvard.edu/abs/2025IRAL...10.2080Y/abstract)
|
||||
- [DINO-MSRA](https://www.dqxxkx.cn/EN/10.12082/dqxxkx.2025.250051)
|
||||
- [Street2Orbit (training-free)](https://jeonghomin.github.io/street2orbit.github.io/)
|
||||
|
||||
---
|
||||
|
||||
## 6. SALAD with ViT-S/14
|
||||
|
||||
### 6.1 Paper Configuration
|
||||
|
||||
- SALAD paper uses **DINOv2-B** (768-dim, 86M params).
|
||||
- Table 4: ViT-S (384-dim, 21M), ViT-B (768-dim, 86M), ViT-L (1024-dim, 300M), ViT-G (1536-dim, 1.1B).
|
||||
- Figure 3: different backbone sizes tested; ViT-B chosen for performance/size trade-off.
|
||||
|
||||
### 6.2 ViT-S Compatibility
|
||||
|
||||
- SALAD is backbone-agnostic: score MLP and aggregation take token dim `d` as input.
|
||||
- ViT-S: d=384 → adjust `W_s1`, `W_f1`, `W_g1` (512 hidden) and output dims.
|
||||
- No ViT-S results in the paper; expect lower recall than ViT-B (e.g. ~2–3 pp from Nature ViT comparison).
|
||||
|
||||
**Conclusion**: SALAD can work with ViT-S with config changes; recall will likely be lower than ViT-B but still above average/GeM.
|
||||
|
||||
**Sources:**
|
||||
- [SALAD paper Sec 4.1](https://arxiv.org/html/2311.15937v1)
|
||||
- [DINOv2 ViT comparison (Nature 2024)](https://www.nature.com/articles/s41598-024-83358-8)
|
||||
- [DINOv2 MODEL_CARD](https://github.com/facebookresearch/dinov2/blob/main/MODEL_CARD.md)
|
||||
|
||||
---
|
||||
|
||||
## 7. Structured Comparison Table
|
||||
|
||||
| Aggregation | MSLS Ch. R@1 | NordLand R@1 | Pitts250k R@1 | Latency | Memory | Training | ViT-S | Cross-view |
|
||||
|-------------|--------------|--------------|---------------|---------|--------|----------|-------|------------|
|
||||
| **Average pooling** | ~42–50 | ~16–35 | ~87 | ~50ms | Low | None | ✓ | Unknown |
|
||||
| **GeM** | 62.6 | 35.4 | 89.5 | ~50ms | Low | 80 ep | ✓ | Unknown |
|
||||
| **SALAD** | **75.0** | **76.0** | **95.1** | **<3ms** | 0.63 GB | 30 min | Config change | Not evaluated |
|
||||
| DINO-Mix | — | 80.2 | — | — | — | Yes | — | — |
|
||||
| NetVLAD (dim red.) | 73.3 | 70.1 | 95.4 | — | — | Yes | — | — |
|
||||
|
||||
---
|
||||
|
||||
## 8. Recommendation for GPS-Denied UAV System
|
||||
|
||||
| Priority | Option | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| **1** | **SALAD** | Largest recall gain, <3 ms overhead, 30 min training. Adapt config for ViT-S if needed. |
|
||||
| **2** | **GeM** | Simpler than SALAD, clear gain over average pooling, minimal code change. |
|
||||
| **3** | **Average pooling** | Keep only if no training is possible and latency is critical. |
|
||||
|
||||
**Implementation path:**
|
||||
1. Add GeM pooling as a low-effort upgrade (no Sinkhorn, small code change).
|
||||
2. Integrate SALAD (e.g. from [serizba/salad](https://github.com/serizba/salad)); adapt for ViT-S (d=384).
|
||||
3. Benchmark on UAV–satellite data; compare SALAD vs GeM vs average.
|
||||
4. If cross-view performance is weak, consider DINO-MSRA or GLFA+CESP.
|
||||
|
||||
---
|
||||
|
||||
## Source URLs
|
||||
|
||||
| Source | URL |
|
||||
|--------|-----|
|
||||
| SALAD paper | https://arxiv.org/abs/2311.15937 |
|
||||
| SALAD HTML | https://arxiv.org/html/2311.15937v1 |
|
||||
| SALAD project | https://serizba.github.io/salad.html |
|
||||
| SALAD GitHub | https://github.com/serizba/salad |
|
||||
| CVPR 2024 | https://openaccess.thecvf.com/content/CVPR2024/html/Izquierdo_Optimal_Transport_Aggregation_for_Visual_Place_Recognition_CVPR_2024_paper.html |
|
||||
| DINO-Mix | https://arxiv.org/abs/2311.00230 |
|
||||
| DINO-Mix Nature | https://www.nature.com/articles/s41598-024-73853-3 |
|
||||
| GeM paper | https://arxiv.org/abs/1811.00202 |
|
||||
| DINOv2 paper | https://arxiv.org/abs/2304.07193 |
|
||||
| DINOv2 GitHub | https://github.com/facebookresearch/dinov2 |
|
||||
| DINO-MSRA 2025 | https://www.dqxxkx.cn/EN/10.12082/dqxxkx.2025.250051 |
|
||||
| CV-Cities | https://arxiv.org/html/2411.12431 |
|
||||
| UAV Self-Loc | https://ui.adsabs.harvard.edu/abs/2025IRAL...10.2080Y/abstract |
|
||||
| dinov2-retrieval | https://github.com/vra/dinov2-retrieval |
|
||||
| Emergent Mind SALAD | https://www.emergentmind.com/topics/dinov2-features-with-salad-aggregation |
|
||||
Reference in New Issue
Block a user