22 KiB
Solution Draft
Product Solution Description
A Python-based GPS-denied visual navigation service that determines GPS coordinates of consecutive UAV aerial photo centers using visual odometry, satellite image geo-referencing, and sliding window position optimization. The system operates as a background REST API service with real-time SSE streaming.
Core approach: Consecutive images are matched using learned features (SuperPoint + LightGlue) to estimate relative motion (visual odometry). Periodically, each image is matched against pre-cached Google Maps satellite tiles to obtain absolute position anchors. A sliding window optimizer fuses VO estimates with satellite anchors, constraining drift. The system handles route disconnections by treating each continuous VO chain as an independent segment, geo-referenced through satellite matching.
┌─────────────────────────────────────────────────────────────────────┐
│ Client (Desktop App) │
│ POST /jobs (start GPS, camera params, image folder) │
│ GET /jobs/{id}/stream (SSE) │
│ POST /jobs/{id}/anchor (user manual GPS input) │
│ GET /jobs/{id}/point-to-gps (image_id, pixel_x, pixel_y) │
└──────────────────────┬──────────────────────────────────────────────┘
│ HTTP/SSE
┌──────────────────────▼──────────────────────────────────────────────┐
│ FastAPI Service Layer │
│ Job Manager → Pipeline Orchestrator → SSE Event Emitter │
└──────────────────────┬──────────────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────────────┐
│ Processing Pipeline │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────────────┐ │
│ │ Feature │ │ Visual │ │ Satellite Geo-Referencing │ │
│ │ Extractor │→│ Odometry │→│ (cross-view matching) │ │
│ │ (SuperPoint) │ │ (homography) │ │ (SuperPoint+LightGlue) │ │
│ └─────────────┘ └──────────────┘ └────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Sliding Window Position Optimizer │ │
│ │ (VO estimates + satellite anchors + drift constraints) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────▼──────────────────────────────────┐ │
│ │ Segment Manager │ │
│ │ (independent segments, satellite-anchored stitching) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Satellite Tile Cache Manager │ │
│ │ (progressive download, Google Maps Tiles API, disk cache) │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
Existing/Competitor Solutions Analysis
| Solution | Approach | Accuracy | IMU Required | Open Source | Relevance |
|---|---|---|---|---|---|
| YFS90/GNSS-Denied-UAV-Geolocalization | VO + satellite matching + terrain-weighted constraint optimization | <7m MAE | No | Yes (GitHub, 69★) | Highest — same constraints, best results |
| AerialPositioning (hamitbugrabayram) | Multi-provider tile engine + deep matchers + perspective warping | Not reported | Simulated INS | Yes (GitHub, 52★) | High — tile engine and perspective warping reference |
| NaviLoc | Trajectory-level VPR + VIO fusion | 19.5m MLE | Yes (VIO) | Partial | Medium — uses IMU, different altitude range |
| ITU Thesis (Öztürk 2025) | ORB-SLAM3 + SuperPoint/SuperGlue/GIM SIM | GPS-level | No | No | High — architecture reference |
| Mateos-Ramirez et al. (2024) | ORB VO + AKAZE satellite + Kalman filter | 143m mean (17km) | Yes | No | Medium — higher altitude, uses IMU |
| VisionUAV-Navigation | Multi-algorithm feature detection + satellite matching | Not reported | Not stated | Yes (GitHub) | Low — early stage |
Key insight from competitor analysis: YFS90 achieves <7m without IMU using terrain-weighted constraint optimization. This validates that our target accuracy (20-50m) is realistic and possibly conservative. The sliding window optimization approach is the critical differentiator from simpler VO+satellite systems.
Architecture
Component: Feature Extraction
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|---|---|---|---|---|---|---|---|
| SuperPoint | superpoint (PyTorch) | Learned features, robust to viewpoint/illumination changes. Repeatability in aerial scenes. GPU-accelerated. ~80ms per image. | Requires GPU. Fixed descriptor dimension (256). | NVIDIA GPU, PyTorch, CUDA | Model weights from official source only | Free (MIT license) | Best |
| SIFT | OpenCV cv2.SIFT | Classical, well-understood. Scale/rotation invariant. Good satellite matching (SIFT+LightGlue top on ISPRS 2025). | Slower than SuperPoint. Less robust to extreme viewpoint changes. | OpenCV | N/A | Free | Good fallback |
| ORB | OpenCV cv2.ORB | Very fast. Many keypoints. | Not scale-invariant. Poor for cross-view matching. | OpenCV | N/A | Free | Only for fast VO |
Selected: SuperPoint as primary (both VO and satellite matching — unified pipeline). SIFT as fallback for satellite matching where SuperPoint struggles.
Component: Feature Matching
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|---|---|---|---|---|---|---|---|
| LightGlue | lightglue (PyTorch) | Fastest learned matcher (~20-50ms). Adaptive pruning. Best on satellite benchmarks. ONNX/TensorRT support for 2-4x speedup. | Requires GPU for best performance. | NVIDIA GPU, PyTorch | Model weights from official source only | Free (Apache 2.0) | Best |
| SuperGlue | superglue (PyTorch) | Graph neural network, strong spatial context. 93% match rate (ITU thesis). | Slower than LightGlue (~2x). Non-commercial license. | NVIDIA GPU, PyTorch | Model weights from official source only | Non-commercial license | Backup |
| GIM | gim (PyTorch) | Best generalization for challenging cross-domain scenes. | Additional model complexity. | NVIDIA GPU, PyTorch | Model weights from official source only | Free | Supplementary for difficult matches |
Selected: LightGlue as primary. GIM as supplementary for difficult satellite matches.
Component: Visual Odometry (Consecutive Frame Matching)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|---|---|---|---|---|---|---|---|
| Homography-based VO | OpenCV findHomography, decomposeHomographyMat | Perfect for downward camera + flat terrain. Cleanly gives rotation + translation. Known altitude resolves scale. Simple, fast. | Assumes planar ground (valid for steppe at 400m). Fails at sharp turns (by design). | OpenCV, NumPy | N/A | Free | Best |
| Essential matrix VO | OpenCV findEssentialMat, recoverPose | More general than homography. Works for non-planar scenes. | Scale ambiguity harder to resolve. More complex. Unnecessary for our flat terrain case. | OpenCV | N/A | Free | Overengineered |
| ORB-SLAM3 monocular | ORB-SLAM3 | Full SLAM with map management, loop closure. | Heavy dependency. Map building unnecessary. Scale ambiguity. | ROS (optional), C++ | N/A | Free (GPL) | Too complex |
Selected: Homography-based VO with SuperPoint+LightGlue features.
VO Pipeline per frame:
- Extract SuperPoint features from current image
- Match with previous image using LightGlue
- Estimate homography (cv2.findHomography with RANSAC)
- Decompose homography → rotation + translation (cv2.decomposeHomographyMat)
- Select correct decomposition (motion must be consistent with previous direction)
- Convert pixel displacement to meters:
displacement_m = displacement_px × GSD - GSD = (altitude × sensor_width) / (focal_length × image_width)
- Update position: new_pos = prev_pos + rotation × displacement_m
- Report inlier ratio as match quality metric
Component: Satellite Image Geo-Referencing
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|---|---|---|---|---|---|---|---|
| Direct cross-view matching with perspective warping | SuperPoint+LightGlue, OpenCV warpPerspective | Pre-warp UAV image to approximate nadir view. Reduces viewpoint gap. Proven approach. | Needs rough camera pose estimate (from VO) for warping. | PyTorch, OpenCV | API key secured | Google Maps API cost | Best |
| Template matching (normalized cross-correlation) | OpenCV matchTemplate | Simple, no learning required. | Very sensitive to scale/rotation/illumination differences. Poor for cross-view. | OpenCV | N/A | Free | Poor for cross-view |
| VPR retrieval + refinement (NetVLAD/CosPlace) | torchvision, faiss | Handles large search areas. | Coarse localization only (tile-level). Needs fine-grained refinement step. | PyTorch, faiss | N/A | Free | Supplementary — coarse search |
Selected: Direct cross-view matching with perspective warping using SuperPoint + LightGlue.
Satellite Matching Pipeline per frame:
- Estimate approximate position from VO
- Fetch satellite tile(s) from cache at estimated position (zoom 18, ~0.4m/px)
- Crop satellite region matching UAV image footprint (with margin)
- Warp UAV image to approximate nadir view using estimated camera pose
- Extract SuperPoint features from warped UAV image
- Extract SuperPoint features from satellite crop (can be pre-computed and cached)
- Match with LightGlue
- If insufficient matches: try GIM, try wider search area, try zoom 17
- If sufficient matches (≥15 inliers): a. Estimate homography from matches b. Transform image center through homography → satellite pixel coordinates c. Convert satellite pixel coordinates to WGS84 using tile geo-referencing d. This is the absolute position anchor
- Report match count and inlier ratio as confidence metrics
Component: Sliding Window Position Optimizer
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|---|---|---|---|---|---|---|---|
| Constrained sliding window optimization | scipy.optimize, NumPy | Fuses VO + satellite anchors. Constrains maximum drift. Smooths trajectory. Inspired by YFS90 (<7m). | Window size tuning needed. | SciPy, NumPy | N/A | Free | Best |
| Extended Kalman Filter | filterpy | Standard, well-understood. Online fusion. | Linearization approximation. Single-pass, no backward smoothing. | filterpy | N/A | Free | Good simpler alternative |
| Pose Graph Optimization | g2o or GTSAM (Python bindings) | Globally optimal. Handles complex factor graphs. | Heavy C++ dependency. Overkill for sequential processing. | g2o/GTSAM, C++ | N/A | Free | Over-engineered |
Selected: Constrained sliding window optimization (primary), with EKF as simpler initial implementation.
Optimizer behavior:
- Maintains a sliding window of last N positions (N=20-50)
- VO estimates provide relative motion constraints between consecutive positions
- Satellite matches provide absolute position anchors (hard/soft constraints)
- Maximum drift constraint: cumulative VO displacement between anchors < 100m
- Optimization minimizes: sum of VO residuals + anchor residuals + smoothness penalty
- On each new frame: add to window, re-optimize, emit updated positions
- Enables refinement: earlier positions improve as new anchors arrive
Component: Segment Manager
The segment manager is the core architectural pattern, not an edge case handler.
Segment lifecycle:
- Start condition: First image of flight, or VO failure (feature match count < threshold)
- Active tracking: VO provides frame-to-frame motion within segment
- Anchoring: Satellite matching provides absolute position for segment's images
- End condition: VO failure (sharp turn, outlier, occlusion)
- New segment: Starts from satellite anchor or user-provided GPS
Segment states:
ANCHORED: At least one satellite match provides absolute position → HIGH confidenceFLOATING: No satellite match yet → positioned relative to start point only → LOW confidenceUSER_ANCHORED: User provided manual GPS → MEDIUM confidence (human error possible)
Segment stitching:
- All segments share the WGS84 coordinate frame via satellite matching
- No direct inter-segment matching needed
- A segment without any satellite anchor remains "floating" and is flagged for user input
Component: Satellite Tile Cache Manager
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|---|---|---|---|---|---|---|---|
| Progressive download with disk cache | aiohttp, aiofiles, sqlite3 | Async download doesn't block pipeline. Tiles cached to disk. Progressive expansion follows route. | Needs internet during processing. First few images may wait for tiles. | Google Maps Tiles API key | API key in env var, not in code | Google Maps API: $200/month free credit covers ~40K tiles | Best |
| Pre-download entire area | requests, sqlite3 | All tiles available at start. No download latency during processing. | Requires known bounding box. Large download for unknown routes. Wasteful. | Same | Same | Higher cost if area is large | For known routes |
Selected: Progressive download with disk cache.
Strategy:
- On job start: download tiles in radius R=1km around starting GPS at zoom 18
- As route extends: download tiles ahead of estimated position (radius 500m)
- Cache tiles on disk in
{zoom}/{x}/{y}.jpgdirectory structure - Cache is persistent across jobs — tiles are reused for overlapping areas
- Pre-compute SuperPoint features for cached tiles (saved alongside tile images)
- If tile download fails or is unavailable: log warning, mark position as VO-only
Tile download budget:
- Initial 1km radius at zoom 18: ~300 tiles (~12MB)
- Per-frame expansion: 5-20 new tiles (~0.2-0.8MB)
- Full 20km flight: ~2000 tiles (~80MB) over the course of processing
- Well within $200/month Google Maps free credit
Component: API & Real-Time Streaming
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|---|---|---|---|---|---|---|---|
| FastAPI + SSE | FastAPI ≥0.135.0, EventSourceResponse, uvicorn | Native SSE support. Async pipeline. Excellent for ML workloads. OpenAPI docs auto-generated. | Python GIL (mitigated with asyncio + GPU-bound ops). | Python 3.11+, uvicorn | CORS configuration, API key auth | Free | Best |
Selected: FastAPI + SSE.
API Endpoints:
POST /jobs
Body: { start_lat, start_lon, altitude, camera_params, image_folder }
Returns: { job_id }
GET /jobs/{job_id}/stream
SSE stream of:
- { event: "position", data: { image_id, lat, lon, confidence, segment_id } }
- { event: "refined", data: { image_id, lat, lon, confidence } }
- { event: "segment_start", data: { segment_id, reason } }
- { event: "user_input_needed", data: { image_id, reason } }
- { event: "complete", data: { summary } }
POST /jobs/{job_id}/anchor
Body: { image_id, lat, lon }
Manual user GPS input for an image
GET /jobs/{job_id}/point-to-gps?image_id=X&px=100&py=200
Returns: { lat, lon, confidence }
Interactive point-to-GPS lookup
GET /jobs/{job_id}/results
Returns: full results as GeoJSON or CSV
Component: Interactive Point-to-GPS Lookup
For each processed image, the system stores the estimated camera-to-ground homography (from either satellite matching or VO+estimated pose). Given a pixel coordinate (px, py) in an image:
- If image has satellite match: use the computed homography to project (px, py) → satellite tile coordinates → WGS84. High confidence.
- If image has only VO pose: use camera intrinsics + estimated altitude + estimated heading to ray-cast (px, py) to the ground plane → WGS84. Medium confidence.
- Both methods return confidence score based on the underlying position estimate quality.
Testing Strategy
Integration / Functional Tests
- End-to-end pipeline test using provided 60-image sample dataset with ground truth GPS
- Verify 80% of positions within 50m of ground truth
- Verify 60% of positions within 20m of ground truth
- Test sharp turn handling: simulate turn by reordering/skipping images
- Test segment creation and reconnection
- Test user manual anchor injection
- Test point-to-GPS lookup accuracy against known coordinates
- Test SSE streaming delivers results within 1s of processing completion
- Test with FullHD resolution images (degraded accuracy expected, but pipeline must not fail)
Non-Functional Tests
- Processing speed: <5s per image on RTX 2060 (target <2s)
- Memory: peak RAM <16GB, VRAM <6GB during 3000-image flight
- Memory leak test: process 3000 images, verify stable memory
- Concurrent jobs: 2 simultaneous flights, verify isolation
- Tile cache: verify tiles are cached and reused across jobs
- API: load test SSE connections (10 simultaneous clients)
- Recovery: kill and restart service mid-job, verify job can resume
Security Tests
- API key authentication enforcement
- Google Maps API key not exposed in responses or logs
- Image folder path traversal prevention
- Input validation (GPS coordinates, camera parameters)
- Rate limiting on API endpoints
References
- YFS90/GNSS-Denied-UAV-Geolocalization — <7m MAE without IMU
- AerialPositioning — tile engine and deep matcher integration reference
- NaviLoc (2025) — trajectory-level visual localization, 19.5m MLE
- ITU Thesis (2025) — ORB-SLAM3 + SIM integration
- Mateos-Ramirez et al. (2024) — VO + satellite correction for fixed-wing UAV
- LightGlue (ICCV 2023) — feature matching
- SuperPoint — feature extraction
- DALGlue (2025) — 11.8% improvement over LightGlue
- SCAR (2026) — satellite-based aerial calibration
- DUSt3R/MASt3R evaluation (2025) — extreme low-overlap matching
- FastAPI SSE docs
- Google Maps Tiles API
Related Artifacts
- AC assessment:
_docs/00_research/gps_denied_visual_nav/00_ac_assessment.md - Comparison framework:
_docs/00_research/gps_denied_visual_nav/03_comparison_framework.md - Reasoning chain:
_docs/00_research/gps_denied_visual_nav/04_reasoning_chain.md