Files

284 lines
22 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Solution Draft
## Product Solution Description
A Python-based GPS-denied visual navigation service that determines GPS coordinates of consecutive UAV aerial photo centers using visual odometry, satellite image geo-referencing, and sliding window position optimization. The system operates as a background REST API service with real-time SSE streaming.
**Core approach**: Consecutive images are matched using learned features (SuperPoint + LightGlue) to estimate relative motion (visual odometry). Periodically, each image is matched against pre-cached Google Maps satellite tiles to obtain absolute position anchors. A sliding window optimizer fuses VO estimates with satellite anchors, constraining drift. The system handles route disconnections by treating each continuous VO chain as an independent segment, geo-referenced through satellite matching.
```
┌─────────────────────────────────────────────────────────────────────┐
│ Client (Desktop App) │
│ POST /jobs (start GPS, camera params, image folder) │
│ GET /jobs/{id}/stream (SSE) │
│ POST /jobs/{id}/anchor (user manual GPS input) │
│ GET /jobs/{id}/point-to-gps (image_id, pixel_x, pixel_y) │
└──────────────────────┬──────────────────────────────────────────────┘
│ HTTP/SSE
┌──────────────────────▼──────────────────────────────────────────────┐
│ FastAPI Service Layer │
│ Job Manager → Pipeline Orchestrator → SSE Event Emitter │
└──────────────────────┬──────────────────────────────────────────────┘
┌──────────────────────▼──────────────────────────────────────────────┐
│ Processing Pipeline │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────────────┐ │
│ │ Feature │ │ Visual │ │ Satellite Geo-Referencing │ │
│ │ Extractor │→│ Odometry │→│ (cross-view matching) │ │
│ │ (SuperPoint) │ │ (homography) │ │ (SuperPoint+LightGlue) │ │
│ └─────────────┘ └──────────────┘ └────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Sliding Window Position Optimizer │ │
│ │ (VO estimates + satellite anchors + drift constraints) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────▼──────────────────────────────────┐ │
│ │ Segment Manager │ │
│ │ (independent segments, satellite-anchored stitching) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Satellite Tile Cache Manager │ │
│ │ (progressive download, Google Maps Tiles API, disk cache) │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
```
## Existing/Competitor Solutions Analysis
| Solution | Approach | Accuracy | IMU Required | Open Source | Relevance |
|----------|----------|----------|-------------|-------------|-----------|
| YFS90/GNSS-Denied-UAV-Geolocalization | VO + satellite matching + terrain-weighted constraint optimization | <7m MAE | No | Yes (GitHub, 69★) | **Highest** — same constraints, best results |
| AerialPositioning (hamitbugrabayram) | Multi-provider tile engine + deep matchers + perspective warping | Not reported | Simulated INS | Yes (GitHub, 52★) | High — tile engine and perspective warping reference |
| NaviLoc | Trajectory-level VPR + VIO fusion | 19.5m MLE | Yes (VIO) | Partial | Medium — uses IMU, different altitude range |
| ITU Thesis (Öztürk 2025) | ORB-SLAM3 + SuperPoint/SuperGlue/GIM SIM | GPS-level | No | No | High — architecture reference |
| Mateos-Ramirez et al. (2024) | ORB VO + AKAZE satellite + Kalman filter | 143m mean (17km) | Yes | No | Medium — higher altitude, uses IMU |
| VisionUAV-Navigation | Multi-algorithm feature detection + satellite matching | Not reported | Not stated | Yes (GitHub) | Low — early stage |
**Key insight from competitor analysis**: YFS90 achieves <7m without IMU using terrain-weighted constraint optimization. This validates that our target accuracy (20-50m) is realistic and possibly conservative. The sliding window optimization approach is the critical differentiator from simpler VO+satellite systems.
## Architecture
### Component: Feature Extraction
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| SuperPoint | superpoint (PyTorch) | Learned features, robust to viewpoint/illumination changes. Repeatability in aerial scenes. GPU-accelerated. ~80ms per image. | Requires GPU. Fixed descriptor dimension (256). | NVIDIA GPU, PyTorch, CUDA | Model weights from official source only | Free (MIT license) | **Best** |
| SIFT | OpenCV cv2.SIFT | Classical, well-understood. Scale/rotation invariant. Good satellite matching (SIFT+LightGlue top on ISPRS 2025). | Slower than SuperPoint. Less robust to extreme viewpoint changes. | OpenCV | N/A | Free | Good fallback |
| ORB | OpenCV cv2.ORB | Very fast. Many keypoints. | Not scale-invariant. Poor for cross-view matching. | OpenCV | N/A | Free | Only for fast VO |
**Selected**: SuperPoint as primary (both VO and satellite matching — unified pipeline). SIFT as fallback for satellite matching where SuperPoint struggles.
### Component: Feature Matching
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| LightGlue | lightglue (PyTorch) | Fastest learned matcher (~20-50ms). Adaptive pruning. Best on satellite benchmarks. ONNX/TensorRT support for 2-4x speedup. | Requires GPU for best performance. | NVIDIA GPU, PyTorch | Model weights from official source only | Free (Apache 2.0) | **Best** |
| SuperGlue | superglue (PyTorch) | Graph neural network, strong spatial context. 93% match rate (ITU thesis). | Slower than LightGlue (~2x). Non-commercial license. | NVIDIA GPU, PyTorch | Model weights from official source only | Non-commercial license | Backup |
| GIM | gim (PyTorch) | Best generalization for challenging cross-domain scenes. | Additional model complexity. | NVIDIA GPU, PyTorch | Model weights from official source only | Free | Supplementary for difficult matches |
**Selected**: LightGlue as primary. GIM as supplementary for difficult satellite matches.
### Component: Visual Odometry (Consecutive Frame Matching)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Homography-based VO | OpenCV findHomography, decomposeHomographyMat | Perfect for downward camera + flat terrain. Cleanly gives rotation + translation. Known altitude resolves scale. Simple, fast. | Assumes planar ground (valid for steppe at 400m). Fails at sharp turns (by design). | OpenCV, NumPy | N/A | Free | **Best** |
| Essential matrix VO | OpenCV findEssentialMat, recoverPose | More general than homography. Works for non-planar scenes. | Scale ambiguity harder to resolve. More complex. Unnecessary for our flat terrain case. | OpenCV | N/A | Free | Overengineered |
| ORB-SLAM3 monocular | ORB-SLAM3 | Full SLAM with map management, loop closure. | Heavy dependency. Map building unnecessary. Scale ambiguity. | ROS (optional), C++ | N/A | Free (GPL) | Too complex |
**Selected**: Homography-based VO with SuperPoint+LightGlue features.
**VO Pipeline per frame**:
1. Extract SuperPoint features from current image
2. Match with previous image using LightGlue
3. Estimate homography (cv2.findHomography with RANSAC)
4. Decompose homography → rotation + translation (cv2.decomposeHomographyMat)
5. Select correct decomposition (motion must be consistent with previous direction)
6. Convert pixel displacement to meters: `displacement_m = displacement_px × GSD`
7. GSD = (altitude × sensor_width) / (focal_length × image_width)
8. Update position: new_pos = prev_pos + rotation × displacement_m
9. Report inlier ratio as match quality metric
### Component: Satellite Image Geo-Referencing
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Direct cross-view matching with perspective warping | SuperPoint+LightGlue, OpenCV warpPerspective | Pre-warp UAV image to approximate nadir view. Reduces viewpoint gap. Proven approach. | Needs rough camera pose estimate (from VO) for warping. | PyTorch, OpenCV | API key secured | Google Maps API cost | **Best** |
| Template matching (normalized cross-correlation) | OpenCV matchTemplate | Simple, no learning required. | Very sensitive to scale/rotation/illumination differences. Poor for cross-view. | OpenCV | N/A | Free | Poor for cross-view |
| VPR retrieval + refinement (NetVLAD/CosPlace) | torchvision, faiss | Handles large search areas. | Coarse localization only (tile-level). Needs fine-grained refinement step. | PyTorch, faiss | N/A | Free | Supplementary — coarse search |
**Selected**: Direct cross-view matching with perspective warping using SuperPoint + LightGlue.
**Satellite Matching Pipeline per frame**:
1. Estimate approximate position from VO
2. Fetch satellite tile(s) from cache at estimated position (zoom 18, ~0.4m/px)
3. Crop satellite region matching UAV image footprint (with margin)
4. Warp UAV image to approximate nadir view using estimated camera pose
5. Extract SuperPoint features from warped UAV image
6. Extract SuperPoint features from satellite crop (can be pre-computed and cached)
7. Match with LightGlue
8. If insufficient matches: try GIM, try wider search area, try zoom 17
9. If sufficient matches (≥15 inliers):
a. Estimate homography from matches
b. Transform image center through homography → satellite pixel coordinates
c. Convert satellite pixel coordinates to WGS84 using tile geo-referencing
d. This is the absolute position anchor
10. Report match count and inlier ratio as confidence metrics
### Component: Sliding Window Position Optimizer
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Constrained sliding window optimization | scipy.optimize, NumPy | Fuses VO + satellite anchors. Constrains maximum drift. Smooths trajectory. Inspired by YFS90 (<7m). | Window size tuning needed. | SciPy, NumPy | N/A | Free | **Best** |
| Extended Kalman Filter | filterpy | Standard, well-understood. Online fusion. | Linearization approximation. Single-pass, no backward smoothing. | filterpy | N/A | Free | Good simpler alternative |
| Pose Graph Optimization | g2o or GTSAM (Python bindings) | Globally optimal. Handles complex factor graphs. | Heavy C++ dependency. Overkill for sequential processing. | g2o/GTSAM, C++ | N/A | Free | Over-engineered |
**Selected**: Constrained sliding window optimization (primary), with EKF as simpler initial implementation.
**Optimizer behavior**:
- Maintains a sliding window of last N positions (N=20-50)
- VO estimates provide relative motion constraints between consecutive positions
- Satellite matches provide absolute position anchors (hard/soft constraints)
- Maximum drift constraint: cumulative VO displacement between anchors < 100m
- Optimization minimizes: sum of VO residuals + anchor residuals + smoothness penalty
- On each new frame: add to window, re-optimize, emit updated positions
- Enables refinement: earlier positions improve as new anchors arrive
### Component: Segment Manager
The segment manager is the core architectural pattern, not an edge case handler.
**Segment lifecycle**:
1. **Start condition**: First image of flight, or VO failure (feature match count < threshold)
2. **Active tracking**: VO provides frame-to-frame motion within segment
3. **Anchoring**: Satellite matching provides absolute position for segment's images
4. **End condition**: VO failure (sharp turn, outlier, occlusion)
5. **New segment**: Starts from satellite anchor or user-provided GPS
**Segment states**:
- `ANCHORED`: At least one satellite match provides absolute position → HIGH confidence
- `FLOATING`: No satellite match yet → positioned relative to start point only → LOW confidence
- `USER_ANCHORED`: User provided manual GPS → MEDIUM confidence (human error possible)
**Segment stitching**:
- All segments share the WGS84 coordinate frame via satellite matching
- No direct inter-segment matching needed
- A segment without any satellite anchor remains "floating" and is flagged for user input
### Component: Satellite Tile Cache Manager
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| Progressive download with disk cache | aiohttp, aiofiles, sqlite3 | Async download doesn't block pipeline. Tiles cached to disk. Progressive expansion follows route. | Needs internet during processing. First few images may wait for tiles. | Google Maps Tiles API key | API key in env var, not in code | Google Maps API: $200/month free credit covers ~40K tiles | **Best** |
| Pre-download entire area | requests, sqlite3 | All tiles available at start. No download latency during processing. | Requires known bounding box. Large download for unknown routes. Wasteful. | Same | Same | Higher cost if area is large | For known routes |
**Selected**: Progressive download with disk cache.
**Strategy**:
1. On job start: download tiles in radius R=1km around starting GPS at zoom 18
2. As route extends: download tiles ahead of estimated position (radius 500m)
3. Cache tiles on disk in `{zoom}/{x}/{y}.jpg` directory structure
4. Cache is persistent across jobs — tiles are reused for overlapping areas
5. Pre-compute SuperPoint features for cached tiles (saved alongside tile images)
6. If tile download fails or is unavailable: log warning, mark position as VO-only
**Tile download budget**:
- Initial 1km radius at zoom 18: ~300 tiles (~12MB)
- Per-frame expansion: 5-20 new tiles (~0.2-0.8MB)
- Full 20km flight: ~2000 tiles (~80MB) over the course of processing
- Well within $200/month Google Maps free credit
### Component: API & Real-Time Streaming
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| FastAPI + SSE | FastAPI ≥0.135.0, EventSourceResponse, uvicorn | Native SSE support. Async pipeline. Excellent for ML workloads. OpenAPI docs auto-generated. | Python GIL (mitigated with asyncio + GPU-bound ops). | Python 3.11+, uvicorn | CORS configuration, API key auth | Free | **Best** |
**Selected**: FastAPI + SSE.
**API Endpoints**:
```
POST /jobs
Body: { start_lat, start_lon, altitude, camera_params, image_folder }
Returns: { job_id }
GET /jobs/{job_id}/stream
SSE stream of:
- { event: "position", data: { image_id, lat, lon, confidence, segment_id } }
- { event: "refined", data: { image_id, lat, lon, confidence } }
- { event: "segment_start", data: { segment_id, reason } }
- { event: "user_input_needed", data: { image_id, reason } }
- { event: "complete", data: { summary } }
POST /jobs/{job_id}/anchor
Body: { image_id, lat, lon }
Manual user GPS input for an image
GET /jobs/{job_id}/point-to-gps?image_id=X&px=100&py=200
Returns: { lat, lon, confidence }
Interactive point-to-GPS lookup
GET /jobs/{job_id}/results
Returns: full results as GeoJSON or CSV
```
### Component: Interactive Point-to-GPS Lookup
For each processed image, the system stores the estimated camera-to-ground homography (from either satellite matching or VO+estimated pose). Given a pixel coordinate (px, py) in an image:
1. If image has satellite match: use the computed homography to project (px, py) → satellite tile coordinates → WGS84. High confidence.
2. If image has only VO pose: use camera intrinsics + estimated altitude + estimated heading to ray-cast (px, py) to the ground plane → WGS84. Medium confidence.
3. Both methods return confidence score based on the underlying position estimate quality.
## Testing Strategy
### Integration / Functional Tests
- End-to-end pipeline test using provided 60-image sample dataset with ground truth GPS
- Verify 80% of positions within 50m of ground truth
- Verify 60% of positions within 20m of ground truth
- Test sharp turn handling: simulate turn by reordering/skipping images
- Test segment creation and reconnection
- Test user manual anchor injection
- Test point-to-GPS lookup accuracy against known coordinates
- Test SSE streaming delivers results within 1s of processing completion
- Test with FullHD resolution images (degraded accuracy expected, but pipeline must not fail)
### Non-Functional Tests
- Processing speed: <5s per image on RTX 2060 (target <2s)
- Memory: peak RAM <16GB, VRAM <6GB during 3000-image flight
- Memory leak test: process 3000 images, verify stable memory
- Concurrent jobs: 2 simultaneous flights, verify isolation
- Tile cache: verify tiles are cached and reused across jobs
- API: load test SSE connections (10 simultaneous clients)
- Recovery: kill and restart service mid-job, verify job can resume
### Security Tests
- API key authentication enforcement
- Google Maps API key not exposed in responses or logs
- Image folder path traversal prevention
- Input validation (GPS coordinates, camera parameters)
- Rate limiting on API endpoints
## References
- [YFS90/GNSS-Denied-UAV-Geolocalization](https://github.com/YFS90/GNSS-Denied-UAV-Geolocalization) — <7m MAE without IMU
- [AerialPositioning](https://github.com/hamitbugrabayram/AerialPositioning) — tile engine and deep matcher integration reference
- [NaviLoc (2025)](https://www.mdpi.com/2504-446X/10/2/97) — trajectory-level visual localization, 19.5m MLE
- [ITU Thesis (2025)](https://polen.itu.edu.tr/items/1fe1e872-7cea-44d8-a8de-339e4587bee6) — ORB-SLAM3 + SIM integration
- [Mateos-Ramirez et al. (2024)](https://www.mdpi.com/2076-3417/14/16/7420) — VO + satellite correction for fixed-wing UAV
- [LightGlue (ICCV 2023)](https://github.com/cvg/LightGlue) — feature matching
- [SuperPoint](https://github.com/magicleap/SuperPointPretrainedNetwork) — feature extraction
- [DALGlue (2025)](https://www.nature.com/articles/s41598-025-21602-5) — 11.8% improvement over LightGlue
- [SCAR (2026)](https://arxiv.org/html/2602.16349v1) — satellite-based aerial calibration
- [DUSt3R/MASt3R evaluation (2025)](https://arxiv.org/abs/2507.14798) — extreme low-overlap matching
- [FastAPI SSE docs](https://fastapi.tiangolo.com/tutorial/server-sent-events/)
- [Google Maps Tiles API](https://developers.google.com/maps/documentation/tile/satellite)
## Related Artifacts
- AC assessment: `_docs/00_research/gps_denied_visual_nav/00_ac_assessment.md`
- Comparison framework: `_docs/00_research/gps_denied_visual_nav/03_comparison_framework.md`
- Reasoning chain: `_docs/00_research/gps_denied_visual_nav/04_reasoning_chain.md`