add solution drafts 3 times, used research skill, expand acceptance criteria

This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-03-14 20:38:00 +02:00
parent 767874cb90
commit d764250f9a
23 changed files with 3385 additions and 1 deletions
+283
View File
@@ -0,0 +1,283 @@
# Solution Draft
## Product Solution Description
A Python-based GPS-denied visual navigation service that determines GPS coordinates of consecutive UAV aerial photo centers using visual odometry, satellite image geo-referencing, and sliding window position optimization. The system operates as a background REST API service with real-time SSE streaming.
**Core approach**: Consecutive images are matched using learned features (SuperPoint + LightGlue) to estimate relative motion (visual odometry). Periodically, each image is matched against pre-cached Google Maps satellite tiles to obtain absolute position anchors. A sliding window optimizer fuses VO estimates with satellite anchors, constraining drift. The system handles route disconnections by treating each continuous VO chain as an independent segment, geo-referenced through satellite matching.
```
┌─────────────────────────────────────────────────────────────────────┐
│ Client (Desktop App) │
│ POST /jobs (start GPS, camera params, image folder) │
│ GET /jobs/{id}/stream (SSE) │
│ POST /jobs/{id}/anchor (user manual GPS input) │
│ GET /jobs/{id}/point-to-gps (image_id, pixel_x, pixel_y) │
└──────────────────────┬──────────────────────────────────────────────┘
│ HTTP/SSE
┌──────────────────────▼──────────────────────────────────────────────┐
│ FastAPI Service Layer │
│ Job Manager → Pipeline Orchestrator → SSE Event Emitter │
└──────────────────────┬──────────────────────────────────────────────┘
┌──────────────────────▼──────────────────────────────────────────────┐
│ Processing Pipeline │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────────────┐ │
│ │ Feature │ │ Visual │ │ Satellite Geo-Referencing │ │
│ │ Extractor │→│ Odometry │→│ (cross-view matching) │ │
│ │ (SuperPoint) │ │ (homography) │ │ (SuperPoint+LightGlue) │ │
│ └─────────────┘ └──────────────┘ └────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Sliding Window Position Optimizer │ │
│ │ (VO estimates + satellite anchors + drift constraints) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────▼──────────────────────────────────┐ │
│ │ Segment Manager │ │
│ │ (independent segments, satellite-anchored stitching) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Satellite Tile Cache Manager │ │
│ │ (progressive download, Google Maps Tiles API, disk cache) │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
```
## Existing/Competitor Solutions Analysis
| Solution | Approach | Accuracy | IMU Required | Open Source | Relevance |
|----------|----------|----------|-------------|-------------|-----------|
| YFS90/GNSS-Denied-UAV-Geolocalization | VO + satellite matching + terrain-weighted constraint optimization | <7m MAE | No | Yes (GitHub, 69★) | **Highest** — same constraints, best results |
| AerialPositioning (hamitbugrabayram) | Multi-provider tile engine + deep matchers + perspective warping | Not reported | Simulated INS | Yes (GitHub, 52★) | High — tile engine and perspective warping reference |
| NaviLoc | Trajectory-level VPR + VIO fusion | 19.5m MLE | Yes (VIO) | Partial | Medium — uses IMU, different altitude range |
| ITU Thesis (Öztürk 2025) | ORB-SLAM3 + SuperPoint/SuperGlue/GIM SIM | GPS-level | No | No | High — architecture reference |
| Mateos-Ramirez et al. (2024) | ORB VO + AKAZE satellite + Kalman filter | 143m mean (17km) | Yes | No | Medium — higher altitude, uses IMU |
| VisionUAV-Navigation | Multi-algorithm feature detection + satellite matching | Not reported | Not stated | Yes (GitHub) | Low — early stage |
**Key insight from competitor analysis**: YFS90 achieves <7m without IMU using terrain-weighted constraint optimization. This validates that our target accuracy (20-50m) is realistic and possibly conservative. The sliding window optimization approach is the critical differentiator from simpler VO+satellite systems.
## Architecture
### Component: Feature Extraction
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| SuperPoint | superpoint (PyTorch) | Learned features, robust to viewpoint/illumination changes. Repeatability in aerial scenes. GPU-accelerated. ~80ms per image. | Requires GPU. Fixed descriptor dimension (256). | NVIDIA GPU, PyTorch, CUDA | Model weights from official source only | Free (MIT license) | **Best** |
| SIFT | OpenCV cv2.SIFT | Classical, well-understood. Scale/rotation invariant. Good satellite matching (SIFT+LightGlue top on ISPRS 2025). | Slower than SuperPoint. Less robust to extreme viewpoint changes. | OpenCV | N/A | Free | Good fallback |
| ORB | OpenCV cv2.ORB | Very fast. Many keypoints. | Not scale-invariant. Poor for cross-view matching. | OpenCV | N/A | Free | Only for fast VO |
**Selected**: SuperPoint as primary (both VO and satellite matching — unified pipeline). SIFT as fallback for satellite matching where SuperPoint struggles.
### Component: Feature Matching
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| LightGlue | lightglue (PyTorch) | Fastest learned matcher (~20-50ms). Adaptive pruning. Best on satellite benchmarks. ONNX/TensorRT support for 2-4x speedup. | Requires GPU for best performance. | NVIDIA GPU, PyTorch | Model weights from official source only | Free (Apache 2.0) | **Best** |
| SuperGlue | superglue (PyTorch) | Graph neural network, strong spatial context. 93% match rate (ITU thesis). | Slower than LightGlue (~2x). Non-commercial license. | NVIDIA GPU, PyTorch | Model weights from official source only | Non-commercial license | Backup |
| GIM | gim (PyTorch) | Best generalization for challenging cross-domain scenes. | Additional model complexity. | NVIDIA GPU, PyTorch | Model weights from official source only | Free | Supplementary for difficult matches |
**Selected**: LightGlue as primary. GIM as supplementary for difficult satellite matches.
### Component: Visual Odometry (Consecutive Frame Matching)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Homography-based VO | OpenCV findHomography, decomposeHomographyMat | Perfect for downward camera + flat terrain. Cleanly gives rotation + translation. Known altitude resolves scale. Simple, fast. | Assumes planar ground (valid for steppe at 400m). Fails at sharp turns (by design). | OpenCV, NumPy | N/A | Free | **Best** |
| Essential matrix VO | OpenCV findEssentialMat, recoverPose | More general than homography. Works for non-planar scenes. | Scale ambiguity harder to resolve. More complex. Unnecessary for our flat terrain case. | OpenCV | N/A | Free | Overengineered |
| ORB-SLAM3 monocular | ORB-SLAM3 | Full SLAM with map management, loop closure. | Heavy dependency. Map building unnecessary. Scale ambiguity. | ROS (optional), C++ | N/A | Free (GPL) | Too complex |
**Selected**: Homography-based VO with SuperPoint+LightGlue features.
**VO Pipeline per frame**:
1. Extract SuperPoint features from current image
2. Match with previous image using LightGlue
3. Estimate homography (cv2.findHomography with RANSAC)
4. Decompose homography → rotation + translation (cv2.decomposeHomographyMat)
5. Select correct decomposition (motion must be consistent with previous direction)
6. Convert pixel displacement to meters: `displacement_m = displacement_px × GSD`
7. GSD = (altitude × sensor_width) / (focal_length × image_width)
8. Update position: new_pos = prev_pos + rotation × displacement_m
9. Report inlier ratio as match quality metric
### Component: Satellite Image Geo-Referencing
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Direct cross-view matching with perspective warping | SuperPoint+LightGlue, OpenCV warpPerspective | Pre-warp UAV image to approximate nadir view. Reduces viewpoint gap. Proven approach. | Needs rough camera pose estimate (from VO) for warping. | PyTorch, OpenCV | API key secured | Google Maps API cost | **Best** |
| Template matching (normalized cross-correlation) | OpenCV matchTemplate | Simple, no learning required. | Very sensitive to scale/rotation/illumination differences. Poor for cross-view. | OpenCV | N/A | Free | Poor for cross-view |
| VPR retrieval + refinement (NetVLAD/CosPlace) | torchvision, faiss | Handles large search areas. | Coarse localization only (tile-level). Needs fine-grained refinement step. | PyTorch, faiss | N/A | Free | Supplementary — coarse search |
**Selected**: Direct cross-view matching with perspective warping using SuperPoint + LightGlue.
**Satellite Matching Pipeline per frame**:
1. Estimate approximate position from VO
2. Fetch satellite tile(s) from cache at estimated position (zoom 18, ~0.4m/px)
3. Crop satellite region matching UAV image footprint (with margin)
4. Warp UAV image to approximate nadir view using estimated camera pose
5. Extract SuperPoint features from warped UAV image
6. Extract SuperPoint features from satellite crop (can be pre-computed and cached)
7. Match with LightGlue
8. If insufficient matches: try GIM, try wider search area, try zoom 17
9. If sufficient matches (≥15 inliers):
a. Estimate homography from matches
b. Transform image center through homography → satellite pixel coordinates
c. Convert satellite pixel coordinates to WGS84 using tile geo-referencing
d. This is the absolute position anchor
10. Report match count and inlier ratio as confidence metrics
### Component: Sliding Window Position Optimizer
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Constrained sliding window optimization | scipy.optimize, NumPy | Fuses VO + satellite anchors. Constrains maximum drift. Smooths trajectory. Inspired by YFS90 (<7m). | Window size tuning needed. | SciPy, NumPy | N/A | Free | **Best** |
| Extended Kalman Filter | filterpy | Standard, well-understood. Online fusion. | Linearization approximation. Single-pass, no backward smoothing. | filterpy | N/A | Free | Good simpler alternative |
| Pose Graph Optimization | g2o or GTSAM (Python bindings) | Globally optimal. Handles complex factor graphs. | Heavy C++ dependency. Overkill for sequential processing. | g2o/GTSAM, C++ | N/A | Free | Over-engineered |
**Selected**: Constrained sliding window optimization (primary), with EKF as simpler initial implementation.
**Optimizer behavior**:
- Maintains a sliding window of last N positions (N=20-50)
- VO estimates provide relative motion constraints between consecutive positions
- Satellite matches provide absolute position anchors (hard/soft constraints)
- Maximum drift constraint: cumulative VO displacement between anchors < 100m
- Optimization minimizes: sum of VO residuals + anchor residuals + smoothness penalty
- On each new frame: add to window, re-optimize, emit updated positions
- Enables refinement: earlier positions improve as new anchors arrive
### Component: Segment Manager
The segment manager is the core architectural pattern, not an edge case handler.
**Segment lifecycle**:
1. **Start condition**: First image of flight, or VO failure (feature match count < threshold)
2. **Active tracking**: VO provides frame-to-frame motion within segment
3. **Anchoring**: Satellite matching provides absolute position for segment's images
4. **End condition**: VO failure (sharp turn, outlier, occlusion)
5. **New segment**: Starts from satellite anchor or user-provided GPS
**Segment states**:
- `ANCHORED`: At least one satellite match provides absolute position → HIGH confidence
- `FLOATING`: No satellite match yet → positioned relative to start point only → LOW confidence
- `USER_ANCHORED`: User provided manual GPS → MEDIUM confidence (human error possible)
**Segment stitching**:
- All segments share the WGS84 coordinate frame via satellite matching
- No direct inter-segment matching needed
- A segment without any satellite anchor remains "floating" and is flagged for user input
### Component: Satellite Tile Cache Manager
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Progressive download with disk cache | aiohttp, aiofiles, sqlite3 | Async download doesn't block pipeline. Tiles cached to disk. Progressive expansion follows route. | Needs internet during processing. First few images may wait for tiles. | Google Maps Tiles API key | API key in env var, not in code | Google Maps API: $200/month free credit covers ~40K tiles | **Best** |
| Pre-download entire area | requests, sqlite3 | All tiles available at start. No download latency during processing. | Requires known bounding box. Large download for unknown routes. Wasteful. | Same | Same | Higher cost if area is large | For known routes |
**Selected**: Progressive download with disk cache.
**Strategy**:
1. On job start: download tiles in radius R=1km around starting GPS at zoom 18
2. As route extends: download tiles ahead of estimated position (radius 500m)
3. Cache tiles on disk in `{zoom}/{x}/{y}.jpg` directory structure
4. Cache is persistent across jobs — tiles are reused for overlapping areas
5. Pre-compute SuperPoint features for cached tiles (saved alongside tile images)
6. If tile download fails or is unavailable: log warning, mark position as VO-only
**Tile download budget**:
- Initial 1km radius at zoom 18: ~300 tiles (~12MB)
- Per-frame expansion: 5-20 new tiles (~0.2-0.8MB)
- Full 20km flight: ~2000 tiles (~80MB) over the course of processing
- Well within $200/month Google Maps free credit
### Component: API & Real-Time Streaming
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| FastAPI + SSE | FastAPI ≥0.135.0, EventSourceResponse, uvicorn | Native SSE support. Async pipeline. Excellent for ML workloads. OpenAPI docs auto-generated. | Python GIL (mitigated with asyncio + GPU-bound ops). | Python 3.11+, uvicorn | CORS configuration, API key auth | Free | **Best** |
**Selected**: FastAPI + SSE.
**API Endpoints**:
```
POST /jobs
Body: { start_lat, start_lon, altitude, camera_params, image_folder }
Returns: { job_id }
GET /jobs/{job_id}/stream
SSE stream of:
- { event: "position", data: { image_id, lat, lon, confidence, segment_id } }
- { event: "refined", data: { image_id, lat, lon, confidence } }
- { event: "segment_start", data: { segment_id, reason } }
- { event: "user_input_needed", data: { image_id, reason } }
- { event: "complete", data: { summary } }
POST /jobs/{job_id}/anchor
Body: { image_id, lat, lon }
Manual user GPS input for an image
GET /jobs/{job_id}/point-to-gps?image_id=X&px=100&py=200
Returns: { lat, lon, confidence }
Interactive point-to-GPS lookup
GET /jobs/{job_id}/results
Returns: full results as GeoJSON or CSV
```
### Component: Interactive Point-to-GPS Lookup
For each processed image, the system stores the estimated camera-to-ground homography (from either satellite matching or VO+estimated pose). Given a pixel coordinate (px, py) in an image:
1. If image has satellite match: use the computed homography to project (px, py) → satellite tile coordinates → WGS84. High confidence.
2. If image has only VO pose: use camera intrinsics + estimated altitude + estimated heading to ray-cast (px, py) to the ground plane → WGS84. Medium confidence.
3. Both methods return confidence score based on the underlying position estimate quality.
## Testing Strategy
### Integration / Functional Tests
- End-to-end pipeline test using provided 60-image sample dataset with ground truth GPS
- Verify 80% of positions within 50m of ground truth
- Verify 60% of positions within 20m of ground truth
- Test sharp turn handling: simulate turn by reordering/skipping images
- Test segment creation and reconnection
- Test user manual anchor injection
- Test point-to-GPS lookup accuracy against known coordinates
- Test SSE streaming delivers results within 1s of processing completion
- Test with FullHD resolution images (degraded accuracy expected, but pipeline must not fail)
### Non-Functional Tests
- Processing speed: <5s per image on RTX 2060 (target <2s)
- Memory: peak RAM <16GB, VRAM <6GB during 3000-image flight
- Memory leak test: process 3000 images, verify stable memory
- Concurrent jobs: 2 simultaneous flights, verify isolation
- Tile cache: verify tiles are cached and reused across jobs
- API: load test SSE connections (10 simultaneous clients)
- Recovery: kill and restart service mid-job, verify job can resume
### Security Tests
- API key authentication enforcement
- Google Maps API key not exposed in responses or logs
- Image folder path traversal prevention
- Input validation (GPS coordinates, camera parameters)
- Rate limiting on API endpoints
## References
- [YFS90/GNSS-Denied-UAV-Geolocalization](https://github.com/YFS90/GNSS-Denied-UAV-Geolocalization) — <7m MAE without IMU
- [AerialPositioning](https://github.com/hamitbugrabayram/AerialPositioning) — tile engine and deep matcher integration reference
- [NaviLoc (2025)](https://www.mdpi.com/2504-446X/10/2/97) — trajectory-level visual localization, 19.5m MLE
- [ITU Thesis (2025)](https://polen.itu.edu.tr/items/1fe1e872-7cea-44d8-a8de-339e4587bee6) — ORB-SLAM3 + SIM integration
- [Mateos-Ramirez et al. (2024)](https://www.mdpi.com/2076-3417/14/16/7420) — VO + satellite correction for fixed-wing UAV
- [LightGlue (ICCV 2023)](https://github.com/cvg/LightGlue) — feature matching
- [SuperPoint](https://github.com/magicleap/SuperPointPretrainedNetwork) — feature extraction
- [DALGlue (2025)](https://www.nature.com/articles/s41598-025-21602-5) — 11.8% improvement over LightGlue
- [SCAR (2026)](https://arxiv.org/html/2602.16349v1) — satellite-based aerial calibration
- [DUSt3R/MASt3R evaluation (2025)](https://arxiv.org/abs/2507.14798) — extreme low-overlap matching
- [FastAPI SSE docs](https://fastapi.tiangolo.com/tutorial/server-sent-events/)
- [Google Maps Tiles API](https://developers.google.com/maps/documentation/tile/satellite)
## Related Artifacts
- AC assessment: `_docs/00_research/gps_denied_visual_nav/00_ac_assessment.md`
- Comparison framework: `_docs/00_research/gps_denied_visual_nav/03_comparison_framework.md`
- Reasoning chain: `_docs/00_research/gps_denied_visual_nav/04_reasoning_chain.md`
+360
View File
@@ -0,0 +1,360 @@
# Solution Draft
## Assessment Findings
| Old Component Solution | Weak Point | New Solution |
|------------------------|------------|-------------|
| Direct SuperPoint+LightGlue satellite matching | **Functional**: No coarse localization stage. Fails when VO drift is large or satellite tile is wrong. Not rotation-invariant (GitHub issue #64). Returns false matches on non-overlapping pairs (issue #13). | Two-stage: DINOv2 coarse retrieval → SuperPoint+LightGlue fine alignment. Image rotation normalization. Geometric consistency check for match validation. |
| SuperPoint for all feature extraction | **Performance**: Unified pipeline is simpler but suboptimal. SuperPoint ~80ms per image is slower than needed for every-frame VO. | Dual-extractor: XFeat for VO (5x faster, ~15ms), SuperPoint for satellite matching (higher accuracy). |
| scipy.optimize sliding window | **Functional**: Generic optimizer. No proper uncertainty modeling per measurement. No terrain constraints. Reinvents what GTSAM already provides. | GTSAM iSAM2 factor graph: BetweenFactor (VO), GPSFactor (satellite anchors), terrain constraints from Copernicus DEM. |
| Google Maps as sole satellite provider | **Functional**: Eastern Ukraine imagery 3-5+ years old. $200/month free credit expired Feb 2025. 15K/day rate limit tight for large flights. | Multi-provider: Google Maps primary + Mapbox fallback + user-provided tiles. Request budgeting. |
| No image downscaling strategy | **Performance/Memory**: 6252×4168 images cannot fit in 6GB VRAM for feature extraction. No memory budget specified. | Downscale to 1600 long edge for feature extraction. Streaming one-at-a-time processing. Explicit memory budgets. |
| No camera rotation handling | **Functional**: Non-stabilized camera produces rotated images. SuperPoint/LightGlue fail at 90° rotation. | Estimate heading from VO chain. Rectify images before satellite matching. SIFT fallback for rotation-heavy cases. |
| Homography VO without terrain correction | **Functional**: GSD assumes constant altitude. No fallback for non-planar scenes. Decomposition can be unstable. | Integrate Copernicus DEM for terrain-corrected GSD. Essential matrix fallback when RANSAC inlier ratio is low. |
| No non-match detection | **Functional**: VO failure detection relies on match count only. Misses geometrically inconsistent matches. | Triple check: match count + RANSAC inlier ratio + motion consistency with previous frames. |
| API key authentication only | **Security**: API keys in URLs persist in logs and browser history. No SSE connection limits. No DoS protection. | JWT authentication. Short-lived SSE tokens. Rate limiting. Connection pool limits. Image size validation. |
| Segment reconnection via satellite only | **Functional**: Floating segments with no satellite match stay permanently unresolved. | Cross-segment matching when new anchors arrive. DEM constraints. Configurable user-input timeout with auto-continue. |
## Product Solution Description
A Python-based GPS-denied visual navigation service that determines GPS coordinates of consecutive UAV photo centers using a hierarchical localization approach: fast visual odometry for frame-to-frame motion, two-stage satellite geo-referencing (coarse retrieval + fine matching) for absolute positioning, and factor graph optimization for trajectory refinement. The system operates as a background REST API service with real-time SSE streaming.
**Core approach**: Consecutive images are matched using XFeat (fast learned features) to estimate relative motion (visual odometry). Periodically, each image is geo-referenced against satellite imagery through a two-stage process: DINOv2 global retrieval selects the best-matching satellite tile, then SuperPoint+LightGlue refines the alignment to pixel precision. A GTSAM iSAM2 factor graph fuses VO constraints, satellite anchors, and DEM terrain constraints to produce an optimized trajectory. The system handles route disconnections by treating each continuous VO chain as an independent segment, geo-referenced through satellite matching and connected via the shared WGS84 coordinate frame.
```
┌─────────────────────────────────────────────────────────────────────┐
│ Client (Desktop App) │
│ POST /jobs (start GPS, camera params, image folder) │
│ GET /jobs/{id}/stream (SSE) │
│ POST /jobs/{id}/anchor (user manual GPS input) │
│ GET /jobs/{id}/point-to-gps (image_id, pixel_x, pixel_y) │
└──────────────────────┬──────────────────────────────────────────────┘
│ HTTP/SSE (JWT auth)
┌──────────────────────▼──────────────────────────────────────────────┐
│ FastAPI Service Layer │
│ Job Manager → Pipeline Orchestrator → SSE Event Emitter │
└──────────────────────┬──────────────────────────────────────────────┘
┌──────────────────────▼──────────────────────────────────────────────┐
│ Processing Pipeline │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │
│ │ Image │ │ Visual │ │ Satellite Geo-Ref │ │
│ │ Preprocessor │→│ Odometry │→│ Stage 1: DINOv2 retrieval│ │
│ │ (downscale, │ │ (XFeat + │ │ Stage 2: SuperPoint + │ │
│ │ rectify) │ │ LightGlue) │ │ LightGlue refinement │ │
│ └──────────────┘ └──────────────┘ └─────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ GTSAM iSAM2 Factor Graph Optimizer │ │
│ │ (VO factors + satellite anchors + DEM terrain constraints) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────▼──────────────────────────────────┐ │
│ │ Segment Manager │ │
│ │ (independent segments, cross-segment reconnection) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Multi-Provider Satellite Tile Cache │ │
│ │ (Google Maps + Mapbox + user tiles, disk cache, DEM cache) │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
```
## Architecture
### Component: Image Preprocessor
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
| Downscale + rectify pipeline | OpenCV resize, NumPy | Normalizes input for all downstream components. Consistent memory usage. Preserves full-res metadata for GSD. | Loses fine detail in downscaled images. | OpenCV, NumPy | Input validation on image files | <10ms per image | **Best** |
**Selected**: Downscale + rectify pipeline.
**Preprocessing per image**:
1. Load image, validate format and dimensions
2. Downscale to max 1600 pixels on longest edge (preserving aspect ratio) for feature extraction
3. Store original resolution for GSD calculation: `GSD = (altitude × sensor_width) / (focal_length × original_width)`
4. If estimated heading is available (from previous VO): rotate image to approximate north-up orientation for satellite matching
5. Convert to grayscale for feature extraction
6. Output: downscaled grayscale image + metadata (original dims, GSD, estimated heading)
### Component: Feature Extraction
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
| XFeat (for VO) | accelerated_features (PyTorch) | 5x faster than SuperPoint. CPU-capable. Sparse + semi-dense matching. Used in SatLoc-Fusion. | Fewer keypoints than SuperPoint in some scenes. | PyTorch | Model weights from official source | ~15ms GPU, ~50ms CPU | **Best for VO** |
| SuperPoint (for satellite matching) | superpoint (PyTorch) | Learned features, robust to viewpoint/illumination. Proven for satellite matching (ISPRS 2025). 256-dim descriptors. | Slower than XFeat. Not rotation-invariant. | NVIDIA GPU, PyTorch, CUDA | Model weights from official source | ~80ms GPU | **Best for satellite** |
| SIFT (fallback) | OpenCV cv2.SIFT | Rotation-invariant. Scale-invariant. Better for high-rotation scenarios. | Slower. Less discriminative in low-texture. | OpenCV | N/A | ~200ms CPU | Rotation fallback |
**Selected**: XFeat for VO, SuperPoint for satellite matching, SIFT as rotation-heavy fallback.
### Component: Feature Matching
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
| LightGlue + ONNX/TensorRT | lightglue + LightGlue-ONNX | 2-4x faster than PyTorch via ONNX. Best balance of speed/accuracy. FlashAttention-2 + TopK-trick for additional 30%. | FP8 not available on RTX 2060 (Turing). Not rotation-invariant. | NVIDIA GPU, ONNX Runtime / TensorRT | Model weights from official source | ~50-100ms ONNX on RTX 2060 | **Best** |
| SuperGlue | superglue (PyTorch) | Strong spatial context. 93% match rate. | 2x slower than LightGlue. Non-commercial license. | NVIDIA GPU, PyTorch | Model weights from official source | ~100-200ms | Backup |
| DALGlue | dalglue (PyTorch) | 11.8% MMA improvement over LightGlue. UAV-optimized wavelet preprocessing. | Very new (2025). Limited production validation. | NVIDIA GPU, PyTorch | Model weights from official source | Comparable to LightGlue | Monitor for future |
**Selected**: LightGlue with ONNX optimization. DALGlue to evaluate when mature.
### Component: Visual Odometry (Consecutive Frame Matching)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
| Homography VO with essential matrix fallback | OpenCV findHomography, findEssentialMat, decomposeHomographyMat | Homography: optimal for flat terrain. Essential matrix: handles non-planar. Known altitude resolves scale. | Homography assumes planar. Essential matrix: more complex, scale ambiguity. | OpenCV, NumPy | N/A | ~5ms for estimation | **Best** |
| ORB-SLAM3 monocular | ORB-SLAM3 | Full SLAM with loop closure. | Heavy. Map building unnecessary. C++ dependency. | ROS (optional), C++ | N/A | — | Over-engineered |
**Selected**: Homography VO with essential matrix fallback and DEM terrain correction.
**VO Pipeline per frame**:
1. Extract XFeat features from current image (~15ms)
2. Match with previous image using LightGlue ONNX (~50ms)
3. **Triple failure check**: match count ≥ 30 AND RANSAC inlier ratio ≥ 0.4 AND motion magnitude consistent with expected inter-frame distance (100m ± 250m to handle outliers up to 350m)
4. If checks pass → estimate homography (cv2.findHomography with USAC_MAGSAC)
5. If RANSAC inlier ratio < 0.6 → additionally estimate essential matrix (cv2.findEssentialMat) as quality check
6. Decompose homography → rotation + translation
7. Select correct decomposition (motion consistent with previous direction + positive depth)
8. **Terrain-corrected GSD**: query Copernicus DEM at estimated position → `effective_altitude = flight_altitude - terrain_elevation``GSD = (effective_altitude × sensor_width) / (focal_length × original_image_width)`
9. Convert pixel displacement to meters: `displacement_m = displacement_px × GSD`
10. Update position: `new_pos = prev_pos + rotation @ displacement_m`
11. Track cumulative heading for image rectification
12. If triple failure check fails → trigger segment break
### Component: Satellite Image Geo-Referencing (Two-Stage)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
| Stage 1: DINOv2 coarse retrieval | dinov2 (PyTorch), faiss | Handles large viewpoint/domain gap. Finds correct area even with 100-200m VO drift. Semantic matching robust to seasonal change. | Coarse only (~tile-level). Needs pre-computed satellite tile embeddings. | PyTorch, faiss | Model weights from official source | ~50ms per query | **Best coarse** |
| Stage 2: SuperPoint+LightGlue fine matching with perspective warping | SuperPoint, LightGlue-ONNX, OpenCV warpPerspective | Precise alignment. Pixel-level accuracy. Proven on satellite benchmarks. | Needs rough pose for warping. Fails without Stage 1 on large drift. | PyTorch, ONNX Runtime, OpenCV | API keys secured | ~150ms total | **Best fine** |
**Selected**: Two-stage hierarchical matching.
**Satellite Matching Pipeline**:
1. Estimate approximate position from VO
2. **Stage 1 — Coarse retrieval**:
a. Define search area: 500m radius around VO estimate (expand to 1km if segment just started)
b. Pre-compute DINOv2 embeddings for all satellite tiles in search area (cached)
c. Extract DINOv2 embedding from rectified UAV image
d. Find top-5 most similar satellite tiles using faiss cosine similarity
3. **Stage 2 — Fine matching** (on top-5 tiles, stop on first good match):
a. Warp UAV image to approximate nadir view using estimated camera pose
b. Extract SuperPoint features from warped UAV image
c. Extract SuperPoint features from satellite tile (pre-computed and cached)
d. Match with LightGlue ONNX
e. **Geometric validation**: require ≥15 inliers, inlier ratio ≥ 0.3, reprojection error < 3px
f. If valid: estimate homography → transform image center → satellite pixel → WGS84
g. Report: absolute position anchor with confidence based on match quality
4. If all 5 tiles fail Stage 2: try SIFT+LightGlue (rotation-invariant), try zoom level 17 (wider view)
5. If still fails: mark frame as VO-only, reduce confidence, continue
**Satellite matching frequency**: Every frame when available, but async — don't block VO pipeline. Satellite result arrives and gets added to factor graph retroactively.
### Component: GTSAM Factor Graph Optimizer
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
| GTSAM iSAM2 factor graph | gtsam 4.2 (pip) | Incremental smoothing. Proper uncertainty propagation. Built-in GPSFactor. Backward smoothing on new evidence. Python bindings. Production-proven. | C++ backend (pip binary handles this). Learning curve for factor graph API. | gtsam==4.2, NumPy | N/A | ~5-10ms incremental update | **Best** |
| scipy.optimize sliding window | scipy, NumPy | Simple. No external dependency. | Generic optimizer. No uncertainty. No incremental update. | SciPy, NumPy | N/A | ~10-50ms per window | Baseline only |
**Selected**: GTSAM iSAM2.
**Factor graph structure**:
- **Variables**: Pose2 (x, y, heading) per image
- **VO Factor** (BetweenFactorPose2): relative motion between consecutive frames. Noise model: diagonal with sigma proportional to `1 / inlier_ratio`. Higher inlier ratio = lower uncertainty.
- **Satellite Anchor Factor** (GPSFactor or PriorFactorPoint2): absolute position from satellite matching. Noise model: sigma proportional to `reprojection_error × GSD`. Good match (~0.5px × 0.4m/px) = 0.2m sigma. Poor match = 5-10m sigma.
- **DEM Terrain Factor** (custom): constrains altitude to be consistent with Copernicus DEM at estimated position. Soft constraint, sigma = 5m.
- **Drift Limit Factor** (custom): penalizes cumulative VO displacement between satellite anchors exceeding 100m. Activated only when two anchor-to-anchor VO path exceeds threshold.
**Optimizer behavior**:
- On each new frame: add VO factor, run iSAM2.update() → ~5ms
- On satellite match arrival: add anchor factor, run iSAM2.update() → triggers backward correction of recent poses
- Emit updated positions via SSE after each update
- Refinement events: when backward correction moves positions by >1m, emit "refined" SSE event
### Component: Segment Manager
The segment manager tracks independent VO chains and manages their lifecycle and interconnection.
**Segment lifecycle**:
1. **Start condition**: First image, OR VO triple failure check fails
2. **Active tracking**: VO provides frame-to-frame motion within segment
3. **Anchoring**: Satellite two-stage matching provides absolute position
4. **End condition**: VO failure (sharp turn, outlier >350m, occlusion)
5. **New segment**: Starts from satellite anchor or user GPS
**Segment states**:
- `ANCHORED`: At least one satellite match → HIGH confidence
- `FLOATING`: No satellite match yet → positioned relative to start point → LOW confidence
- `USER_ANCHORED`: User provided manual GPS → MEDIUM confidence
**Enhanced segment reconnection**:
- When a segment becomes ANCHORED, check for nearby FLOATING segments (within 500m of any anchored position in the new segment)
- Attempt satellite-based position matching between FLOATING segment images and satellite tiles near the ANCHORED segment
- If match found: anchor the floating segment and connect to the trajectory
- DEM consistency check: ensure segment positions are consistent with terrain elevation
- If no match after all frames in floating segment are tried: request user input with configurable timeout (default: 30s), then continue with best VO estimate
### Component: Multi-Provider Satellite Tile Cache
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
| Multi-provider progressive cache with DEM | aiohttp, aiofiles, sqlite3, faiss | Multiple providers for coverage. Async download. DINOv2 embeddings pre-computed. DEM cached alongside tiles. | Needs internet. Provider API differences. | Google Maps Tiles API + Mapbox API keys | API keys in env vars only. Never logged. | Async, non-blocking | **Best** |
**Selected**: Multi-provider progressive cache.
**Provider priority**:
1. User-provided tiles (highest priority — custom/recent imagery for the area)
2. Google Maps (zoom 18, ~0.4m/px) — 100K free tiles/month
3. Mapbox Satellite (zoom 16+, up to 0.3m/px) — 200K free requests/month
**Cache strategy**:
1. On job start: download tiles in 1km radius around starting GPS from primary provider
2. Pre-compute SuperPoint features AND DINOv2 embeddings for all cached tiles
3. As route extends: download tiles 500m ahead of estimated position
4. **Request budgeting**: track daily API requests, switch to secondary provider at 80% of daily limit
5. Cache structure on disk:
```
cache/
├── tiles/{provider}/{zoom}/{x}/{y}.jpg
├── features/{provider}/{zoom}/{x}/{y}_sp.npz (SuperPoint features)
├── embeddings/{provider}/{zoom}/{x}/{y}_dino.npz (DINOv2 embedding)
└── dem/{lat}_{lon}.tif (Copernicus DEM tiles)
```
6. Cache is persistent across jobs — tiles and features reused for overlapping areas
7. **DEM cache**: download Copernicus DEM GLO-30 tiles alongside satellite tiles. 30m resolution is sufficient for terrain correction at flight altitude.
**Tile download budget (revised)**:
- Google Maps: 100,000 tiles/month free → ~50 flights at 2000 tiles each
- Mapbox: 200,000 requests/month free → additional capacity
- Per flight: ~2000 satellite tiles (~80MB) + ~500 DEM tiles (~20MB)
- Combined free tier handles operational volume
### Component: API & Real-Time Streaming
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
| FastAPI + SSE + JWT | FastAPI ≥0.135.0, EventSourceResponse, uvicorn, python-jose | Native SSE. Async pipeline. OpenAPI auto-generated. JWT for proper auth. | Python GIL (mitigated with asyncio + GPU-bound ops). | Python 3.11+, uvicorn | JWT auth, CORS, rate limiting | Async, non-blocking | **Best** |
**Selected**: FastAPI + SSE + JWT authentication.
**API Endpoints**:
```
POST /auth/token
Body: { api_key }
Returns: { access_token, token_type, expires_in }
POST /jobs
Headers: Authorization: Bearer <token>
Body: { start_lat, start_lon, altitude, camera_params, image_folder }
Returns: { job_id }
GET /jobs/{job_id}/stream
Headers: Authorization: Bearer <token>
SSE stream of:
- { event: "position", data: { image_id, lat, lon, confidence, segment_id } }
- { event: "refined", data: { image_id, lat, lon, confidence, delta_m } }
- { event: "segment_start", data: { segment_id, reason } }
- { event: "user_input_needed", data: { image_id, reason, timeout_s } }
- { event: "complete", data: { summary } }
POST /jobs/{job_id}/anchor
Headers: Authorization: Bearer <token>
Body: { image_id, lat, lon }
Manual user GPS input for an image
GET /jobs/{job_id}/point-to-gps?image_id=X&px=100&py=200
Headers: Authorization: Bearer <token>
Returns: { lat, lon, confidence }
GET /jobs/{job_id}/results?format=geojson
Headers: Authorization: Bearer <token>
Returns: full results as GeoJSON or CSV (WGS84)
```
**Security measures**:
- JWT authentication on all endpoints (short-lived tokens, 1h expiry)
- Image folder whitelist: only paths under configured base directories allowed
- Max image dimensions: 8000×8000 pixels
- Max concurrent SSE connections per client: 5
- Rate limiting: 100 requests/minute per client
- All provider API keys in environment variables, never logged or returned in responses
- CORS configured for known client origins only
### Component: Interactive Point-to-GPS Lookup
For each processed image, the system stores the estimated camera-to-ground homography. Given pixel coordinates (px, py):
1. If image has satellite match: use computed homography to project (px, py) → satellite tile coordinates → WGS84. HIGH confidence.
2. If image has only VO pose: use camera intrinsics + DEM-corrected altitude + estimated heading to ray-cast (px, py) to ground plane → WGS84. MEDIUM confidence.
3. Confidence score derived from underlying position estimate quality.
## Testing Strategy
### Integration / Functional Tests
- End-to-end pipeline test using provided 60-image sample dataset with ground truth GPS
- Verify 80% of positions within 50m of ground truth
- Verify 60% of positions within 20m of ground truth
- Test sharp turn handling: simulate 90° turn with non-overlapping images
- Test segment creation, satellite anchoring, and cross-segment reconnection
- Test user manual anchor injection via POST endpoint
- Test point-to-GPS lookup accuracy against known ground coordinates
- Test SSE streaming delivers results within 1s of processing completion
- Test with FullHD resolution images (pipeline must not fail)
- Test with 6252×4168 images (verify downscaling and memory usage)
- Test DINOv2 coarse retrieval finds correct satellite tile with 100m VO drift
- Test multi-provider fallback: block Google Maps, verify Mapbox takes over
- Test with outdated satellite imagery: verify confidence scores reflect match quality
- Test outlier handling: 350m gap between consecutive photos
- Test image rotation handling: apply 45° rotation to images, verify pipeline handles it
### Non-Functional Tests
- Processing speed: <5s per image on RTX 2060 (target <2s with ONNX optimization)
- Memory: peak RAM <16GB, VRAM <6GB during 3000-image flight at max resolution
- Memory stability: process 3000 images, verify no memory leak (stable RSS over time)
- Concurrent jobs: 2 simultaneous flights, verify isolation and resource sharing
- Tile cache: verify tiles and features are cached and reused across jobs
- API: load test SSE connections (10 simultaneous clients)
- Recovery: kill and restart service mid-job, verify job can resume from last processed image
- DEM download: verify Copernicus DEM tiles fetched and cached correctly
- GTSAM optimizer: verify backward correction produces "refined" events with improved positions
### Security Tests
- JWT authentication enforcement on all endpoints
- Expired/invalid token rejection
- Provider API keys not exposed in responses, logs, or error messages
- Image folder path traversal prevention (attempt to access /etc/passwd via image_folder)
- Image folder whitelist enforcement
- Input validation: invalid GPS coordinates, negative altitude, malformed camera params
- Rate limiting: verify 429 response after exceeding limit
- Max SSE connection enforcement
- Max image size validation (reject >8000px)
- CORS enforcement: reject requests from unknown origins
## References
- [YFS90/GNSS-Denied-UAV-Geolocalization](https://github.com/YFS90/GNSS-Denied-UAV-Geolocalization) — <7m MAE with terrain-weighted constraint optimization
- [SatLoc-Fusion (2025)](https://www.mdpi.com/2072-4292/17/17/3048) — hierarchical DINOv2+XFeat+optical flow, <15m on edge hardware
- [CEUSP (2025)](https://arxiv.org/abs/2502.11408) — DINOv2-based cross-view UAV self-positioning
- [Oblique-Robust AVL (IEEE TGRS 2024)](https://ieeexplore.ieee.org/iel7/36/10354519/10356107.pdf) — rotation-equivariant features for UAV-satellite matching
- [XFeat (CVPR 2024)](https://github.com/verlab/accelerated_features) — 5x faster than SuperPoint
- [LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX) — 2-4x speedup via ONNX/TensorRT
- [DALGlue (2025)](https://www.nature.com/articles/s41598-025-21602-5) — 11.8% MMA improvement over LightGlue for UAV
- [SIFT+LightGlue UAV Mosaicking (ISPRS 2025)](https://isprs-archives.copernicus.org/articles/XLVIII-2-W11-2025/169/2025/) — SIFT superior for high-rotation conditions
- [LightGlue rotation issue #64](https://github.com/cvg/LightGlue/issues/64) — confirmed not rotation-invariant
- [LightGlue no-match issue #13](https://github.com/cvg/LightGlue/issues/13) — false matches on non-overlapping pairs
- [GTSAM v4.2](https://github.com/borglab/gtsam) — factor graph optimization with Python bindings
- [Copernicus DEM GLO-30](https://dataspace.copernicus.eu/explore-data/data-collections/copernicus-contributing-missions/collections-description/COP-DEM) — free 30m global DEM
- [Google Maps Tiles API](https://developers.google.com/maps/documentation/tile/satellite) — satellite tiles, 100K free/month
- [Mapbox Satellite](https://docs.mapbox.com/data/tilesets/reference/mapbox-satellite/) — alternative tile provider, up to 0.3m/px
- [DINOv2 UAV Self-Localization (2025)](https://ui.adsabs.harvard.edu/abs/2025IRAL...10.2080Y/) — 86.27 R@1 on DenseUAV
- [FastAPI SSE](https://fastapi.tiangolo.com/tutorial/server-sent-events/)
- [Homography Decomposition Revisited (IJCV 2025)](https://link.springer.com/article/10.1007/s11263-025-02680-4)
- [Sliding Window Factor Graph Optimization (2020)](https://www.cambridge.org/core/services/aop-cambridge-core/content/view/523C7C41D18A8D7C159C59235DF502D0/)
## Related Artifacts
- Assessment research: `_docs/00_research/gps_denied_nav_assessment/`
- Previous AC assessment: `_docs/00_research/gps_denied_visual_nav/00_ac_assessment.md`
- Previous comparison framework: `_docs/00_research/gps_denied_visual_nav/03_comparison_framework.md`
+491
View File
@@ -0,0 +1,491 @@
# Solution Draft
## Assessment Findings
| Old Component Solution | Weak Point | New Solution |
| ---------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Pose2 + GPSFactor for satellite anchors | **Functional (Critical)**: GPSFactor works with Pose3, not Pose2. Code would fail at runtime. Custom Python DEM/drift factors are slow. | Pose2 + BetweenFactorPose2 (VO) + PriorFactorPose2 (satellite anchors). Remove DEM terrain factor and drift limit factor from graph — handle in GSD calculation and Segment Manager respectively. |
| DINOv2 model variant unspecified + faiss GPU | **Performance/Memory**: No model variant chosen. faiss GPU uses ~2GB scratch. Combined VRAM could exceed 6GB. | DINOv2 ViT-S/14 (300MB VRAM, 50ms/img). faiss on CPU only (<1ms for ~2000 vectors). Explicit VRAM budget per model. |
| Heading-based rotation rectification + SIFT fallback | **Functional**: No heading at segment start. No trigger criteria for SIFT. Multi-rotation matching not specified. | Three-tier: (1) segment start → 4-rotation retry {0°,90°,180°,270°}, (2) heading available → rectify, (3) all fail → SIFT+LightGlue. Trigger: SuperPoint inlier ratio < 0.15. |
| "Motion consistent with previous direction" homography selection | **Functional**: Underspecified for 4 decomposition solutions. Non-orthogonal R possible. No strategy for first frame pair. | Four-step disambiguation: positive depth → plane normal up → motion consistency → orthogonality check via SVD. First pair: depth + normal only. |
| Raw DINOv2 CLS token + faiss cosine | **Performance**: Raw CLS token suboptimal for retrieval. Patch-level features capture more spatial information. | DINOv2 ViT-S/14 patch tokens with spatial average pooling (not just CLS). Cosine similarity via CPU faiss. |
| "Async satellite matching — don't block VO" | **Functional**: No concrete concurrency model. Single GPU can't run two models simultaneously. | Sequential GPU pipeline: VO first (~~40ms), satellite matching overlapped with next frame's VO (~~205ms). asyncio for I/O. CPU for faiss + RANSAC. |
| JWT + rate limiting + CORS | **Security**: No image format validation. No Pillow CVE-2025-48379 mitigation. No SSE heartbeat. No memory-limited image loading. | Pin Pillow ≥11.3.0. Validate magic bytes. Reject images >10,000px. SSE heartbeat 15s. asyncio.Queue event publisher. CSP headers. |
| Custom GTSAM drift limit factor | **Functional**: Python callback per optimization step (slow). If no anchors for 50+ frames, nothing to constrain. | Replace with Segment Manager drift thresholds: 100m → warning, 200m → user input request, 500m → LOW confidence. Exponential confidence decay from last anchor. |
| Google Maps tile download (API key only) | **Functional**: Google Maps requires session tokens via createSession API, not just API key. 15K/day limit not managed. | Implement session token lifecycle: createSession → use token → handle expiry. Request budget tracking per provider per day. |
| FastAPI EventSourceResponse | **Stability**: Async generator cleanup issues on shutdown. No heartbeat. No reconnection support. | asyncio.Queue-based EventPublisher pattern. SSE heartbeat every 15s. Last-Event-ID support for reconnection. |
## Product Solution Description
A Python-based GPS-denied visual navigation service that determines GPS coordinates of consecutive UAV photo centers using a hierarchical localization approach: fast visual odometry for frame-to-frame motion, two-stage satellite geo-referencing (coarse retrieval + fine matching) for absolute positioning, and factor graph optimization for trajectory refinement. The system operates as a background REST API service with real-time SSE streaming.
**Core approach**: Consecutive images are matched using XFeat (fast learned features) to estimate relative motion (visual odometry). Each image is geo-referenced against satellite imagery through a two-stage process: DINOv2 ViT-S/14 coarse retrieval selects the best-matching satellite tile using patch-level features, then SuperPoint+LightGlue refines the alignment to pixel precision. A GTSAM iSAM2 factor graph fuses VO constraints (BetweenFactorPose2) and satellite anchors (PriorFactorPose2) in local ENU coordinates to produce an optimized trajectory. The system handles route disconnections by treating each continuous VO chain as an independent segment, geo-referenced through satellite matching and connected via the shared WGS84 coordinate frame.
```
┌─────────────────────────────────────────────────────────────────────┐
│ Client (Desktop App) │
│ POST /jobs (start GPS, camera params, image folder) │
│ GET /jobs/{id}/stream (SSE) │
│ POST /jobs/{id}/anchor (user manual GPS input) │
│ GET /jobs/{id}/point-to-gps (image_id, pixel_x, pixel_y) │
└──────────────────────┬──────────────────────────────────────────────┘
│ HTTP/SSE (JWT auth)
┌──────────────────────▼──────────────────────────────────────────────┐
│ FastAPI Service Layer │
│ Job Manager → Pipeline Orchestrator → SSE Event Publisher │
│ (asyncio.Queue-based publisher, heartbeat, Last-Event-ID) │
└──────────────────────┬──────────────────────────────────────────────┘
┌──────────────────────▼──────────────────────────────────────────────┐
│ Processing Pipeline │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │
│ │ Image │ │ Visual │ │ Satellite Geo-Ref │ │
│ │ Preprocessor │→│ Odometry │→│ Stage 1: DINOv2-S patch │ │
│ │ (downscale, │ │ (XFeat + │ │ retrieval (CPU faiss) │ │
│ │ rectify) │ │ XFeat │ │ Stage 2: SuperPoint + │ │
│ │ │ │ matcher) │ │ LightGlue-ONNX refine │ │
│ └──────────────┘ └──────────────┘ └─────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ GTSAM iSAM2 Factor Graph Optimizer │ │
│ │ Pose2 + BetweenFactorPose2 (VO) + PriorFactorPose2 (sat) │ │
│ │ Local ENU coordinates → WGS84 output │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────▼──────────────────────────────────┐ │
│ │ Segment Manager │ │
│ │ (drift thresholds, confidence decay, user input triggers) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Multi-Provider Satellite Tile Cache │ │
│ │ (Google Maps + Mapbox + user tiles, session tokens, │ │
│ │ DEM cache, request budgeting) │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
```
## Architecture
### Component: Image Preprocessor
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
| ------------------------------ | -------------------- | -------------------------------------------------------------- | --------------------------------------- | ------------- | -------------------------------------------------- | --------------- | -------- |
| Downscale + rectify + validate | OpenCV resize, NumPy | Normalizes input. Consistent memory. Validates before loading. | Loses fine detail in downscaled images. | OpenCV, NumPy | Magic byte validation, dimension check before load | <10ms per image | **Best** |
**Selected**: Downscale + rectify + validate pipeline.
**Preprocessing per image**:
1. Validate file: check magic bytes (JPEG/PNG/TIFF), reject unknown formats
2. Read image header only: check dimensions, reject if either > 10,000px
3. Load image via OpenCV (cv2.imread)
4. Downscale to max 1600 pixels on longest edge (preserving aspect ratio)
5. Store original resolution for GSD: `GSD = (effective_altitude × sensor_width) / (focal_length × original_width)` where `effective_altitude = flight_altitude - terrain_elevation` (terrain from Copernicus DEM)
6. If estimated heading is available: rotate to approximate north-up for satellite matching
7. If no heading (segment start): pass unrotated
8. Convert to grayscale for feature extraction
9. Output: downscaled grayscale image + metadata (original dims, GSD, heading if known)
### Component: Feature Extraction
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
| ----------------------------------- | ------------------------------ | -------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | ------------------------- | ---------------------------------- | -------------------- | ---------------------- |
| XFeat (for VO) | accelerated_features (PyTorch) | 5x faster than SuperPoint. CPU-capable. Has built-in matcher. Semi-dense matching. Used in SatLoc-Fusion. | Fewer keypoints than SuperPoint in some scenes. Not rotation-invariant. | PyTorch | Model weights from official source | ~15ms GPU, ~50ms CPU | **Best for VO** |
| SuperPoint (for satellite matching) | superpoint (PyTorch) | Learned features, robust to viewpoint/illumination. Proven for satellite matching (ISPRS 2025). 256-dim descriptors. | Slower than XFeat. Not rotation-invariant. | NVIDIA GPU, PyTorch, CUDA | Model weights from official source | ~80ms GPU | **Best for satellite** |
| SIFT (rotation fallback) | OpenCV cv2.SIFT | Rotation-invariant. Scale-invariant. Proven SIFT+LightGlue hybrid for UAV mosaicking (ISPRS 2025). | Slower. Less discriminative in low-texture. | OpenCV | N/A | ~200ms CPU | **Rotation fallback** |
**Selected**: XFeat with built-in matcher for VO, SuperPoint for satellite matching, SIFT+LightGlue as rotation-heavy fallback.
**VRAM budget**:
| Model | VRAM | Loaded When |
| --------------------- | ---------- | -------------------------- |
| XFeat | ~200MB | Always (VO every frame) |
| DINOv2 ViT-S/14 | ~300MB | Satellite coarse retrieval |
| SuperPoint | ~400MB | Satellite fine matching |
| LightGlue ONNX FP16 | ~500MB | Satellite fine matching |
| ONNX Runtime overhead | ~200MB | When ONNX models active |
| **Peak total** | **~1.6GB** | Satellite matching phase |
### Component: Feature Matching
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
| ---------------------------------- | ----------------------- | ------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------- | ------------------------ | ---------------------------------- | --------------------- | ---------------------- |
| XFeat built-in matcher (VO) | accelerated_features | Fastest option (~15ms extract+match). Paired with XFeat extraction. | Lower quality than LightGlue. | PyTorch | N/A | ~15ms total | **Best for VO** |
| LightGlue ONNX FP16 (satellite) | LightGlue-ONNX | 2-4x faster than PyTorch via ONNX. FP16 works on Turing (RTX 2060). | FP8 not available on Turing. Not rotation-invariant. | ONNX Runtime, NVIDIA GPU | Model weights from official source | ~50-100ms on RTX 2060 | **Best for satellite** |
| SIFT+LightGlue (rotation fallback) | OpenCV SIFT + LightGlue | SIFT rotation invariance + LightGlue contextual matching. Proven superior for high-rotation UAV (ISPRS 2025). | Slower than SuperPoint+LightGlue. | OpenCV + ONNX Runtime | N/A | ~250ms total | **Rotation fallback** |
**Selected**: XFeat matcher for VO, LightGlue ONNX FP16 for satellite matching, SIFT+LightGlue as rotation fallback.
### Component: Visual Odometry (Consecutive Frame Matching)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
| -------------------------------------------- | ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | --------------------------------------------------------- | ------------- | -------- | ------------------- | -------- |
| Homography VO with essential matrix fallback | OpenCV findHomography (USAC_MAGSAC), findEssentialMat, decomposeHomographyMat | Homography: optimal for flat terrain. Essential matrix: non-planar fallback. Known altitude resolves scale. | Homography assumes planar. 4-way decomposition ambiguity. | OpenCV, NumPy | N/A | ~5ms for estimation | **Best** |
**Selected**: Homography VO with essential matrix fallback and DEM terrain-corrected GSD.
**VO Pipeline per frame**:
1. Extract XFeat features from current image (~15ms)
2. Match with previous image using XFeat built-in matcher (included in extraction time)
3. **Triple failure check**: match count ≥ 30 AND RANSAC inlier ratio ≥ 0.4 AND motion magnitude consistent with expected inter-frame distance (100m ± 250m)
4. If checks pass → estimate homography (cv2.findHomography with USAC_MAGSAC, confidence 0.999, max iterations 2000)
5. If RANSAC inlier ratio < 0.6 → additionally estimate essential matrix as quality check
6. **Decomposition disambiguation** (4 solutions from decomposeHomographyMat):
a. Filter by positive depth: triangulate 5 matched points, reject if behind camera
b. Filter by plane normal: normal z-component > 0.5 (downward camera → ground plane normal points up)
c. If previous direction available: prefer solution consistent with expected motion
d. Orthogonality check: verify R^T R ≈ I (Frobenius norm < 0.01). If failed, re-orthogonalize via SVD: U,S,V = svd(R), R_clean = U @ V^T
e. First frame pair in segment: use filters a+b only
7. **Terrain-corrected GSD**: query Copernicus DEM at estimated position → `effective_altitude = flight_altitude - terrain_elevation``GSD = (effective_altitude × sensor_width) / (focal_length × original_image_width)`
8. Convert pixel displacement to meters: `displacement_m = displacement_px × GSD`
9. Update position: `new_pos = prev_pos + rotation @ displacement_m`
10. Track cumulative heading for image rectification
11. If triple failure check fails → trigger segment break
### Component: Satellite Image Geo-Referencing (Two-Stage)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
| ---------------------------------------- | -------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------ | --------------------- | ---------------------------------- | --------------------------- | --------------- |
| Stage 1: DINOv2 ViT-S/14 patch retrieval | dinov2 ViT-S/14 (PyTorch), faiss (CPU) | Fast (50ms). 300MB VRAM. Patch tokens capture spatial layout better than CLS alone. Semantic matching robust to seasonal change. | Coarse only (~tile-level). Lower precision than ViT-B/ViT-L. | PyTorch, faiss-cpu | Model weights from official source | ~50ms extract + <1ms search | **Best coarse** |
| Stage 2: SuperPoint+LightGlue ONNX FP16 | SuperPoint, LightGlue-ONNX, OpenCV | Precise pixel-level alignment. Proven on satellite benchmarks. FP16 on RTX 2060. | Needs rough pose for warping. Not rotation-invariant. | PyTorch, ONNX Runtime | N/A | ~150ms total | **Best fine** |
**Selected**: Two-stage hierarchical matching.
**Satellite Matching Pipeline**:
1. Estimate approximate position from VO
2. **Stage 1 — Coarse retrieval**:
a. Define search area: 500m radius around VO estimate (expand to 1km if segment just started or drift > 100m)
b. Pre-compute DINOv2 ViT-S/14 patch embeddings for all satellite tiles in search area. Method: extract patch tokens (not CLS), apply spatial average pooling to get a single descriptor per tile. Cache embeddings.
c. Extract DINOv2 ViT-S/14 patch embedding from UAV image (same pooling)
d. Find top-5 most similar satellite tiles using faiss (CPU) cosine similarity
3. **Stage 2 — Fine matching** (on top-5 tiles, stop on first good match):
a. Warp UAV image to approximate nadir view using estimated camera pose
b. **Rotation handling**:
- If heading known: single attempt with rectified image
- If no heading (segment start): try 4 rotations {0°, 90°, 180°, 270°}
c. Extract SuperPoint features from warped UAV image
d. Extract SuperPoint features from satellite tile (pre-computed and cached)
e. Match with LightGlue ONNX FP16
f. **Geometric validation**: require ≥15 inliers, inlier ratio ≥ 0.3, reprojection error < 3px
g. If valid: estimate homography → transform image center → satellite pixel → WGS84
h. Report: absolute position anchor with confidence based on match quality
4. If all 5 tiles fail Stage 2 with SuperPoint:
a. Try SIFT+LightGlue on top-3 tiles (rotation-invariant). Trigger: best SuperPoint inlier ratio was < 0.15.
b. Try zoom level 17 (wider view)
5. If still fails: mark frame as VO-only, reduce confidence, continue
**Satellite matching frequency**: Every frame when available, but async — satellite matching for frame N overlaps with VO processing for frame N+1. Satellite result arrives and gets added to factor graph retroactively via iSAM2 update.
### Component: GTSAM Factor Graph Optimizer
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
| -------------------------------- | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------- | ----------------- | -------- | -------------------------- | -------- |
| GTSAM iSAM2 factor graph (Pose2) | gtsam==4.2 (pip) | Incremental smoothing. Proper uncertainty propagation. Native BetweenFactorPose2 and PriorFactorPose2. Backward smoothing on new evidence. Python bindings. | C++ backend (pip binary). Learning curve. | gtsam==4.2, NumPy | N/A | ~5-10ms incremental update | **Best** |
**Selected**: GTSAM iSAM2 with Pose2 variables.
**Coordinate system**: Local East-North-Up (ENU) centered on starting GPS. All positions computed in ENU meters, converted to WGS84 for output. Conversion: pyproj or manual geodetic math (WGS84 ellipsoid).
**Factor graph structure**:
- **Variables**: Pose2 (x_enu, y_enu, heading) per image
- **Prior Factor** (PriorFactorPose2): first frame anchored at ENU origin (0, 0, initial_heading) with tight noise (sigma_xy = 5m if GPS accurate, sigma_theta = 0.1 rad)
- **VO Factor** (BetweenFactorPose2): relative motion between consecutive frames. Noise model: `Diagonal.Sigmas([sigma_x, sigma_y, sigma_theta])` where sigma scales inversely with RANSAC inlier ratio. High inlier ratio (0.8) → sigma 2m. Low inlier ratio (0.4) → sigma 10m. Sigma_theta proportional to displacement magnitude.
- **Satellite Anchor Factor** (PriorFactorPose2): absolute position from satellite matching. Position noise: `sigma = reprojection_error × GSD × scale_factor`. Good match (0.5px × 0.4m/px × 3) = 0.6m. Poor match = 5-10m. Heading component: loose (sigma = 1.0 rad) unless estimated from satellite alignment.
**Optimizer behavior**:
- On each new frame: add VO factor, run iSAM2.update() → ~5ms
- On satellite match arrival: add PriorFactorPose2, run iSAM2.update() → backward correction
- Emit updated positions via SSE after each update
- Refinement events: when backward correction moves positions by >1m, emit "refined" SSE event
- No custom Python factors — all factors use native GTSAM C++ implementations for speed
### Component: Segment Manager
The segment manager tracks independent VO chains, manages drift thresholds, and handles reconnection.
**Segment lifecycle**:
1. **Start condition**: First image, OR VO triple failure check fails
2. **Active tracking**: VO provides frame-to-frame motion within segment
3. **Anchoring**: Satellite two-stage matching provides absolute position
4. **End condition**: VO failure (sharp turn, outlier >350m, occlusion)
5. **New segment**: Starts, attempts satellite anchor immediately
**Segment states**:
- `ANCHORED`: At least one satellite match → HIGH confidence
- `FLOATING`: No satellite match yet → positioned relative to segment start → LOW confidence
- `USER_ANCHORED`: User provided manual GPS → MEDIUM confidence
**Drift monitoring (replaces GTSAM custom drift factor)**:
- Track cumulative VO displacement since last satellite anchor per segment
- **100m threshold**: emit warning SSE event, expand satellite search radius to 1km, increase matching attempts per frame
- **200m threshold**: emit `user_input_needed` SSE event with configurable timeout (default: 30s)
- **500m threshold**: mark all subsequent positions as VERY LOW confidence, continue processing
- **Confidence formula**: `confidence = base_confidence × exp(-drift / decay_constant)` where base_confidence is from satellite match quality, drift is distance from nearest anchor, decay_constant = 100m
**Segment reconnection**:
- When a segment becomes ANCHORED, check for nearby FLOATING segments (within 500m of any anchored position)
- Attempt satellite-based position matching between FLOATING segment images and tiles near the ANCHORED segment
- DEM consistency: verify segment elevation profile is consistent with terrain
- If no match after all frames tried: request user input, auto-continue after timeout
### Component: Multi-Provider Satellite Tile Cache
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
| ----------------------------------------- | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------ | ----------------------------------------- | --------------------------------------- | ------------------------------------------------------------- | ------------------- | -------- |
| Multi-provider progressive cache with DEM | aiohttp, aiofiles, sqlite3, faiss-cpu | Multiple providers. Async download. DINOv2/SuperPoint features pre-computed. DEM cached. Session token management. | Needs internet. Provider API differences. | Google Maps Tiles API + Mapbox API keys | API keys in env vars only. Session tokens managed internally. | Async, non-blocking | **Best** |
**Selected**: Multi-provider progressive cache.
**Provider priority**:
1. User-provided tiles (highest priority — custom/recent imagery)
2. Google Maps (zoom 18, ~0.4m/px) — 100K free requests/month, 15K/day
3. Mapbox Satellite (zoom 16-18, ~0.6-0.3m/px) — 200K free requests/month
**Google Maps session management**:
1. On job start: POST to `/v1/createSession` with API key → receive session token
2. Use session token in all subsequent tile requests for this job
3. Token has finite lifetime — handle expiry by creating new session
4. Track request count per day per provider
**Cache strategy**:
1. On job start: download tiles in 1km radius around starting GPS from primary provider
2. Pre-compute SuperPoint features AND DINOv2 ViT-S/14 patch embeddings for all cached tiles
3. As route extends: download tiles 500m ahead of estimated position
4. **Request budgeting**: track daily API requests per provider. At 80% daily limit (12,000 for Google): switch to Mapbox. Log budget status.
5. Cache structure on disk:
```
cache/
├── tiles/{provider}/{zoom}/{x}/{y}.jpg
├── features/{provider}/{zoom}/{x}/{y}_sp.npz (SuperPoint features)
├── embeddings/{provider}/{zoom}/{x}/{y}_dino.npy (DINOv2 patch embedding)
└── dem/{lat}_{lon}.tif (Copernicus DEM tiles)
```
6. Cache persistent across jobs — tiles and features reused for overlapping areas
7. **DEM cache**: Copernicus DEM GLO-30 tiles from AWS S3 (free, no auth). `s3://copernicus-dem-30m/`. Cloud Optimized GeoTIFFs, 30m resolution. Downloaded via HTTPS (no AWS SDK needed): `https://copernicus-dem-30m.s3.amazonaws.com/Copernicus_DSM_COG_10_{N|S}{lat}_00_{E|W}{lon}_DEM/...`
**Tile download budget**:
- Google Maps: 100,000/month, 15,000/day → ~7 flights/day from cache misses, ~50 flights/month
- Mapbox: 200,000/month → additional ~100 flights/month
- Per flight: ~~2000 satellite tiles (~~80MB) + ~~200 DEM tiles (~~10MB)
### Component: API & Real-Time Streaming
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
| --------------------------------- | ----------------------------------------------------- | ---------------------------------------------------------------------------------------------------- | ------------------------------------ | --------------------- | ------------------------------------- | ------------------- | -------- |
| FastAPI + SSE (Queue-based) + JWT | FastAPI ≥0.135.0, asyncio.Queue, uvicorn, python-jose | Native SSE. Queue-based publisher avoids generator cleanup issues. JWT auth. OpenAPI auto-generated. | Python GIL (mitigated with asyncio). | Python 3.11+, uvicorn | JWT, CORS, rate limiting, CSP headers | Async, non-blocking | **Best** |
**Selected**: FastAPI + Queue-based SSE + JWT authentication.
**SSE implementation**:
- Use `asyncio.Queue` per client connection (not bare async generators)
- Server pushes events to queue; client reads from queue
- On disconnect: queue is garbage collected, no lingering generators
- SSE heartbeat: send `event: heartbeat` every 15 seconds to detect stale connections
- Support `Last-Event-ID` header for reconnection: include monotonic event ID in each SSE message. On reconnect, replay missed events from in-memory ring buffer (last 1000 events per job).
**API Endpoints**:
```
POST /auth/token
Body: { api_key }
Returns: { access_token, token_type, expires_in }
POST /jobs
Headers: Authorization: Bearer <token>
Body: { start_lat, start_lon, altitude, camera_params, image_folder }
Returns: { job_id }
GET /jobs/{job_id}/stream
Headers: Authorization: Bearer <token>
SSE stream of:
- { event: "position", id: "42", data: { image_id, lat, lon, confidence, segment_id } }
- { event: "refined", id: "43", data: { image_id, lat, lon, confidence, delta_m } }
- { event: "segment_start", id: "44", data: { segment_id, reason } }
- { event: "drift_warning", id: "45", data: { segment_id, cumulative_drift_m } }
- { event: "user_input_needed", id: "46", data: { image_id, reason, timeout_s } }
- { event: "heartbeat", id: "47", data: { timestamp } }
- { event: "complete", id: "48", data: { summary } }
POST /jobs/{job_id}/anchor
Headers: Authorization: Bearer <token>
Body: { image_id, lat, lon }
GET /jobs/{job_id}/point-to-gps?image_id=X&px=100&py=200
Headers: Authorization: Bearer <token>
Returns: { lat, lon, confidence }
GET /jobs/{job_id}/results?format=geojson
Headers: Authorization: Bearer <token>
Returns: full results as GeoJSON or CSV (WGS84)
```
**Security measures**:
- JWT authentication on all endpoints (short-lived tokens, 1h expiry)
- Image folder whitelist: resolve to canonical path (os.path.realpath), verify under configured base directories
- Image validation: magic byte check (JPEG FFD8, PNG 89504E47, TIFF 4949/4D4D), dimension check (<10,000px per side), reject others
- Pin Pillow ≥11.3.0 (CVE-2025-48379 mitigation)
- Max concurrent SSE connections per client: 5
- Rate limiting: 100 requests/minute per client
- All provider API keys in environment variables, never logged or returned
- CORS configured for known client origins only
- Content-Security-Policy headers
- SSE heartbeat prevents stale connections accumulating
### Component: Interactive Point-to-GPS Lookup
For each processed image, the system stores the estimated camera-to-ground transformation. Given pixel coordinates (px, py):
1. If image has satellite match: use computed homography to project (px, py) → satellite tile coordinates → WGS84. HIGH confidence.
2. If image has only VO pose: use camera intrinsics + DEM-corrected altitude + estimated heading to ray-cast (px, py) to ground plane → WGS84. MEDIUM confidence.
3. Confidence score derived from underlying position estimate quality.
## Processing Time Budget
| Step | Component | Time | GPU/CPU | Notes |
| ---------------------- | ---------------------------------------- | -------------- | ------- | ----------------------------------- |
| 1 | Image load + validate + downscale | <10ms | CPU | OpenCV |
| 2 | XFeat extract + match (VO) | ~15ms | GPU | Built-in matcher |
| 3 | Homography estimation + decomposition | ~5ms | CPU | USAC_MAGSAC |
| 4 | GTSAM iSAM2 update (VO factor) | ~5ms | CPU | Incremental |
| 5 | SSE position emit | <1ms | CPU | Queue push |
| **VO subtotal** | | **~36ms** | | **Per-frame critical path** |
| 6 | DINOv2 ViT-S/14 extract (UAV image) | ~50ms | GPU | Patch tokens |
| 7 | faiss cosine search (top-5 tiles) | <1ms | CPU | ~2000 vectors |
| 8 | SuperPoint extract (UAV warped) | ~80ms | GPU | |
| 9 | LightGlue ONNX match (per tile, up to 5) | ~50-100ms | GPU | Stop on first good match |
| 10 | Geometric validation + homography | ~5ms | CPU | |
| 11 | GTSAM iSAM2 update (satellite factor) | ~5ms | CPU | Backward correction |
| **Satellite subtotal** | | **~191-236ms** | | **Overlapped with next frame's VO** |
| **Total per frame** | | **~230-270ms** | | **Well under 5s budget** |
## Testing Strategy
### Integration / Functional Tests
- End-to-end pipeline test using provided 60-image sample dataset with ground truth GPS
- Verify 80% of positions within 50m of ground truth
- Verify 60% of positions within 20m of ground truth
- Test sharp turn handling: simulate 90° turn with non-overlapping images
- Test segment creation, satellite anchoring, and cross-segment reconnection
- Test user manual anchor injection via POST endpoint
- Test point-to-GPS lookup accuracy against known ground coordinates
- Test SSE streaming delivers results within 1s of processing completion
- Test with FullHD resolution images (pipeline must not fail)
- Test with 6252×4168 images (verify downscaling and memory usage)
- Test DINOv2 ViT-S/14 coarse retrieval finds correct satellite tile with 100m VO drift
- Test multi-provider fallback: block Google Maps, verify Mapbox takes over
- Test with outdated satellite imagery: verify confidence scores reflect match quality
- Test outlier handling: 350m gap between consecutive photos
- Test image rotation handling: apply 45° and 90° rotation, verify 4-rotation retry works
- Test SIFT+LightGlue fallback triggers when SuperPoint inlier ratio < 0.15
- Test GTSAM PriorFactorPose2 satellite anchoring produces backward correction
- Test drift warning at 100m cumulative displacement without satellite anchor
- Test user_input_needed event at 200m cumulative displacement
- Test SSE heartbeat arrives every 15s during long processing
- Test SSE reconnection with Last-Event-ID replays missed events
- Test homography decomposition disambiguation for first frame pair (no previous direction)
### Non-Functional Tests
- Processing speed: <5s per image on RTX 2060 (target <300ms with ONNX optimization)
- Memory: peak RAM <16GB, VRAM <6GB during 3000-image flight at max resolution
- VRAM: verify peak stays under 2GB during satellite matching phase
- Memory stability: process 3000 images, verify no memory leak (stable RSS over time)
- Concurrent jobs: 2 simultaneous flights, verify isolation and resource sharing
- Tile cache: verify tiles, SuperPoint features, and DINOv2 embeddings cached and reused
- API: load test SSE connections (10 simultaneous clients)
- Recovery: kill and restart service mid-job, verify job can resume from last processed image
- DEM download: verify Copernicus DEM tiles fetched from AWS S3 and cached correctly
- GTSAM optimizer: verify backward correction produces "refined" events
- Session token lifecycle: verify Google Maps session creation, usage, and expiry handling
### Security Tests
- JWT authentication enforcement on all endpoints
- Expired/invalid token rejection
- Provider API keys not exposed in responses, logs, or error messages
- Image folder path traversal prevention (attempt to access /etc/passwd via image_folder)
- Image folder whitelist enforcement (canonical path resolution)
- Image magic byte validation: reject non-image files renamed to .jpg
- Image dimension validation: reject >10,000px images
- Input validation: invalid GPS coordinates, negative altitude, malformed camera params
- Rate limiting: verify 429 response after exceeding limit
- Max SSE connection enforcement
- CORS enforcement: reject requests from unknown origins
- Content-Security-Policy header presence
- Pillow version ≥11.3.0 verified in requirements
## References
- [YFS90/GNSS-Denied-UAV-Geolocalization](https://github.com/YFS90/GNSS-Denied-UAV-Geolocalization) — <7m MAE with terrain-weighted constraint optimization
- [SatLoc-Fusion (2025)](https://www.mdpi.com/2072-4292/17/17/3048) — hierarchical DINOv2+XFeat+optical flow, <15m on edge hardware
- [CEUSP (2025)](https://arxiv.org/abs/2502.11408) — DINOv2-based cross-view UAV self-positioning
- [DINOv2 UAV Self-Localization (2025)](https://ui.adsabs.harvard.edu/abs/2025IRAL...10.2080Y/) — 86.27 R@1 on DenseUAV
- [XFeat (CVPR 2024)](https://github.com/verlab/accelerated_features) — 5x faster than SuperPoint
- [XFeat+LightGlue (HuggingFace)](https://huggingface.co/vismatch/xfeat-lightglue) — trained xfeat-lightglue models
- [LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX) — 2-4x speedup via ONNX/TensorRT, FP16 on Turing
- [SIFT+LightGlue UAV Mosaicking (ISPRS 2025)](https://isprs-archives.copernicus.org/articles/XLVIII-2-W11-2025/169/2025/) — SIFT superior for high-rotation conditions
- [LightGlue rotation issue #64](https://github.com/cvg/LightGlue/issues/64) — confirmed not rotation-invariant
- [DALGlue (2025)](https://www.nature.com/articles/s41598-025-21602-5) — 11.8% MMA improvement over LightGlue for UAV
- [SALAD: DINOv2 Optimal Transport Aggregation (2024)](https://arxiv.org/abs/2311.15937) — improved visual place recognition
- [NaviLoc (2025)](https://www.mdpi.com/2504-446X/10/2/97) — trajectory-level optimization, 19.5m MLE, 16x improvement
- [GTSAM v4.2](https://github.com/borglab/gtsam) — factor graph optimization with Python bindings
- [GTSAM GPSFactor docs](https://gtsam.org/doxygen/a04084.html) — GPSFactor works with Pose3 only
- [GTSAM Pose2 SLAM Example](https://gtbook.github.io/gtsam-examples/Pose2SLAMExample.html) — BetweenFactorPose2 + PriorFactorPose2
- [OpenCV decomposeHomographyMat issue #23282](https://github.com/opencv/opencv/issues/23282) — non-orthogonal matrices, 4-solution ambiguity
- [Copernicus DEM GLO-30 on AWS](https://registry.opendata.aws/copernicus-dem/) — free 30m global DEM, no auth via S3
- [Google Maps Tiles API](https://developers.google.com/maps/documentation/tile/satellite) — satellite tiles, 100K free/month, session tokens required
- [Google Maps Tiles API billing](https://developers.google.com/maps/documentation/tile/usage-and-billing) — 15K/day, 6K/min rate limits
- [Mapbox Satellite](https://docs.mapbox.com/data/tilesets/reference/mapbox-satellite/) — alternative tile provider, up to 0.3m/px regional
- [FastAPI SSE](https://fastapi.tiangolo.com/tutorial/server-sent-events/) — EventSourceResponse
- [SSE-Starlette cleanup issue #99](https://github.com/sysid/sse-starlette/issues/99) — async generator cleanup, Queue pattern recommended
- [CVE-2025-48379 Pillow](https://nvd.nist.gov/vuln/detail/CVE-2025-48379) — heap buffer overflow, fixed in 11.3.0
- [FAISS GPU wiki](https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU) — ~2GB scratch space default, CPU recommended for small datasets
- [Oblique-Robust AVL (IEEE TGRS 2024)](https://ieeexplore.ieee.org/iel7/36/10354519/10356107.pdf) — rotation-equivariant features for UAV-satellite matching
## Related Artifacts
- Previous assessment research: `_docs/00_research/gps_denied_nav_assessment/`
- This assessment research: `_docs/00_research/gps_denied_draft02_assessment/`
- Previous AC assessment: `_docs/00_research/gps_denied_visual_nav/00_ac_assessment.md`