# Solution Draft ## Product Solution Description A real-time GPS-denied visual navigation system for fixed-wing UAVs, running entirely on a Jetson Orin Nano Super (8GB). The system determines frame-center GPS coordinates by fusing three information sources: (1) CUDA-accelerated visual odometry (cuVSLAM), (2) absolute position corrections from satellite image matching, and (3) IMU-based motion prediction. Results stream to clients via REST API + SSE in real time. **Hard constraint**: Camera shoots at ~3fps (333-400ms interval). The full pipeline must complete within **400ms per frame**. **Satellite matching strategy**: Benchmark LiteSAM on actual Orin Nano Super hardware as a day-one priority. If LiteSAM cannot achieve ≤400ms at 480px resolution, **abandon it entirely** and use XFeat semi-dense matching as the primary satellite matcher. Speed is non-negotiable. **Core architectural principles**: 1. **cuVSLAM handles VO** — NVIDIA's CUDA-accelerated library achieves 90fps on Jetson Orin Nano, giving VO essentially "for free" (~11ms/frame). 2. **Keyframe-based satellite matching** — satellite matcher runs on keyframes only (every 3-10 frames), amortizing its cost. Non-keyframes rely on cuVSLAM VO + IMU. 3. **Every keyframe independently attempts satellite-based geo-localization** — this handles disconnected segments natively. 4. **Pipeline parallelism** — satellite matching for frame N overlaps with VO processing of frame N+1 via CUDA streams. ``` ┌─────────────────────────────────────────────────────────────────┐ │ OFFLINE (Before Flight) │ │ Satellite Tiles → Download & Crop → Store as tile pairs │ │ (Google Maps) (per flight plan) (disk, GeoHash indexed) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ ONLINE (During Flight) │ │ │ │ EVERY FRAME (400ms budget): │ │ ┌────────────────────────────────┐ │ │ │ Camera → Downsample (CUDA 2ms)│ │ │ │ → cuVSLAM VO+IMU (~11ms)│──→ ESKF Update → SSE Emit │ │ └────────────────────────────────┘ ↑ │ │ │ │ │ KEYFRAMES ONLY (every 3-10 frames): │ │ │ ┌────────────────────────────────────┐ │ │ │ │ Satellite match (async CUDA stream)│─────┘ │ │ │ LiteSAM or XFeat (see benchmark) │ │ │ │ (does NOT block VO output) │ │ │ └────────────────────────────────────┘ │ │ │ │ IMU: 100+Hz continuous → ESKF prediction │ └─────────────────────────────────────────────────────────────────┘ ``` ## Speed Optimization Techniques ### 1. cuVSLAM for Visual Odometry (~11ms/frame) NVIDIA's CUDA-accelerated VO library (v15.0.0, March 2026) achieves 90fps on Jetson Orin Nano. Supports monocular camera + IMU natively. Features: automatic IMU fallback when visual tracking fails, loop closure, Python and C++ APIs. This eliminates custom VO entirely. ### 2. Keyframe-Based Satellite Matching Not every frame needs satellite matching. Strategy: - cuVSLAM provides VO at every frame (high-rate, low-latency) - Satellite matching triggers on **keyframes** selected by: - Fixed interval: every 3-10 frames (~1-3.3s between satellite corrections) - Confidence drop: when ESKF covariance exceeds threshold - VO failure: when cuVSLAM reports tracking loss (sharp turn) ### 3. Satellite Matcher Selection (Benchmark-Driven) **Candidate A: LiteSAM (opt)** — Best accuracy for satellite-aerial matching (RMSE@30 = 17.86m on UAV-VisLoc). 6.31M params, MobileOne + TAIFormer + MinGRU. Benchmarked at 497ms on Jetson AGX Orin at 1184px. AGX Orin is 3-4x more powerful than Orin Nano Super (275 TOPS vs 67 TOPS, $2000+ vs $249). Realistic Orin Nano Super estimates: - At 1184px: ~1.5-2.0s (unusable) - At 640px: ~500-800ms (borderline) - At 480px: ~300-500ms (best case) **Candidate B: XFeat semi-dense** — ~50-100ms on Orin Nano Super. Proven on Jetson. Not specifically designed for cross-view satellite-aerial, but fast and reliable. **Decision rule**: Benchmark LiteSAM TensorRT FP16 at 480px on Orin Nano Super. If ≤400ms → use LiteSAM. If >400ms → **abandon LiteSAM, use XFeat as primary**. No hybrid compromises — pick one and optimize it. ### 4. TensorRT FP16 Optimization LiteSAM's MobileOne backbone is reparameterizable — multi-branch training structure collapses to a single feed-forward path at inference. Combined with TensorRT FP16, this maximizes throughput. INT8 is possible for MobileOne backbone but ViT/transformer components may degrade with INT8. ### 5. CUDA Stream Pipelining Overlap operations across consecutive frames: - Stream A: cuVSLAM VO for current frame (~11ms) + ESKF fusion (~1ms) - Stream B: Satellite matching for previous keyframe (async) - CPU: SSE emission, tile management, keyframe selection logic ### 6. Pre-cropped Satellite Tiles Offline: for each satellite tile, store both the raw image and a pre-resized version matching the satellite matcher's input resolution. Runtime avoids resize cost. ## Existing/Competitor Solutions Analysis | Solution | Approach | Accuracy | Hardware | Limitations | |----------|----------|----------|----------|-------------| | Mateos-Ramirez et al. (2024) | VO (ORB) + satellite keypoint correction + Kalman | 142m mean / 17km (0.83%) | Orange Pi class | No re-localization; ORB only; 1000m+ altitude | | SatLoc (2025) | DinoV2 + XFeat + optical flow + adaptive fusion | <15m, >90% coverage | Edge (unspecified) | Paper not fully accessible | | LiteSAM (2025) | MobileOne + TAIFormer + MinGRU subpixel refinement | RMSE@30 = 17.86m on UAV-VisLoc | RTX 3090 (62ms), AGX Orin (497ms) | Not tested on Orin Nano; AGX Orin is 3-4x more powerful | | TerboucheHacene/visual_localization | SuperPoint/SuperGlue/GIM + VO + satellite | Not quantified | Desktop-class | Not edge-optimized | | cuVSLAM (NVIDIA, 2025-2026) | CUDA-accelerated VO+SLAM, mono/stereo/IMU | <1% trajectory error (KITTI), <5cm (EuRoC) | Jetson Orin Nano (90fps) | VO only, no satellite matching | **Key insight**: Combine cuVSLAM (best-in-class VO for Jetson) with the fastest viable satellite-aerial matcher via ESKF fusion. LiteSAM is the accuracy leader but unproven on Orin Nano Super — benchmark first, abandon for XFeat if too slow. ## Architecture ### Component: Visual Odometry | Solution | Tools | Advantages | Limitations | Fit | |----------|-------|-----------|-------------|-----| | cuVSLAM (mono+IMU) | PyCuVSLAM / C++ API | 90fps on Orin Nano, NVIDIA-optimized, loop closure, IMU fallback | Closed-source CUDA library | ✅ Best | | XFeat frame-to-frame | XFeatTensorRT | 5x faster than SuperPoint, open-source | ~30-50ms total, no IMU integration | ⚠️ Fallback | | ORB-SLAM3 | OpenCV + custom | Well-understood, open-source | CPU-heavy, ~30fps on Orin | ⚠️ Slower | **Selected**: **cuVSLAM (mono+IMU mode)** — purpose-built by NVIDIA for Jetson. ~11ms/frame leaves 389ms for everything else. Auto-fallback to IMU when visual tracking fails. ### Component: Satellite Image Matching | Solution | Tools | Advantages | Limitations | Fit | |----------|-------|-----------|-------------|-----| | LiteSAM (opt) | TensorRT | Best satellite-aerial accuracy (RMSE@30 17.86m), 6.31M params, subpixel refinement | 497ms on AGX Orin at 1184px; AGX Orin is 3-4x more powerful than Orin Nano Super | ✅ If benchmark passes | | XFeat semi-dense | XFeatTensorRT | ~50-100ms, lightweight, Jetson-proven | Not designed for cross-view satellite-aerial | ✅ If LiteSAM fails benchmark | | EfficientLoFTR | TensorRT | Good accuracy, semi-dense | 15.05M params (2.4x LiteSAM), slower | ⚠️ Heavier | | SuperPoint + LightGlue | TensorRT C++ | Good general matching | Sparse only, worse on satellite-aerial | ⚠️ Not specialized | **Selection**: Benchmark-driven. Day-one test on Orin Nano Super: 1. Export LiteSAM (opt) to TensorRT FP16 2. Measure at 480px, 640px, 800px 3. If ≤400ms at 480px → **LiteSAM** 4. If >400ms at any viable resolution → **XFeat semi-dense** (primary, no hybrid) ### Component: Sensor Fusion | Solution | Tools | Advantages | Limitations | Fit | |----------|-------|-----------|-------------|-----| | Error-State EKF (ESKF) | Custom Python/C++ | Lightweight, multi-rate, well-understood | Linear approximation | ✅ Best | | Hybrid ESKF/UKF | Custom | 49% better accuracy | More complex | ⚠️ Upgrade path | | Factor Graph (GTSAM) | GTSAM | Best accuracy | Heavy compute | ❌ Too heavy | **Selected**: **ESKF** with adaptive measurement noise. State vector: [position(3), velocity(3), orientation_quat(4), accel_bias(3), gyro_bias(3)] = 16 states. Measurement sources and rates: - IMU prediction: 100+Hz - cuVSLAM VO update: ~3Hz (every frame) - Satellite update: ~0.3-1Hz (keyframes only, delayed via async pipeline) ### Component: Satellite Tile Preprocessing (Offline) **Selected**: **GeoHash-indexed tile pairs on disk**. Pipeline: 1. Define operational area from flight plan 2. Download satellite tiles from Google Maps Tile API at max zoom (18-19) 3. Pre-resize each tile to matcher input resolution 4. Store: original tile + resized tile + metadata (GPS bounds, zoom, GSD) in GeoHash-indexed directory structure 5. Copy to Jetson storage before flight ### Component: Re-localization (Disconnected Segments) **Selected**: **Keyframe satellite matching is always active + expanded search on VO failure**. When cuVSLAM reports tracking loss (sharp turn, no features): 1. Immediately flag next frame as keyframe → trigger satellite matching 2. Expand tile search radius (from ±200m to ±1km based on IMU dead-reckoning uncertainty) 3. If match found: position recovered, new segment begins 4. If 3+ consecutive keyframe failures: request user input via API ### Component: Object Center Coordinates Geometric calculation once frame-center GPS is known: 1. Pixel offset from center: (dx_px, dy_px) 2. Convert to meters: dx_m = dx_px × GSD, dy_m = dy_px × GSD 3. Rotate by IMU yaw heading 4. Convert meter offset to lat/lon and add to frame-center GPS ### Component: API & Streaming **Selected**: **FastAPI + sse-starlette**. REST for session management, SSE for real-time position stream. OpenAPI auto-documentation. ## Processing Time Budget (per frame, 400ms budget) ### Normal Frame (non-keyframe, ~60-80% of frames) | Step | Time | Notes | |------|------|-------| | Image capture + transfer | ~10ms | CSI/USB3 | | Downsample (for cuVSLAM) | ~2ms | OpenCV CUDA | | cuVSLAM VO+IMU | ~11ms | NVIDIA CUDA-optimized, 90fps capable | | ESKF fusion (VO+IMU update) | ~1ms | C extension or NumPy | | SSE emit | ~1ms | Async | | **Total** | **~25ms** | Well within 400ms | ### Keyframe Satellite Matching (async, every 3-10 frames) Runs asynchronously on a separate CUDA stream — does NOT block per-frame VO output. **Path A — LiteSAM (if benchmark passes)**: | Step | Time | Notes | |------|------|-------| | Downsample to ~480px | ~1ms | OpenCV CUDA | | Load satellite tile | ~5ms | Pre-resized, from storage | | LiteSAM (opt) matching | ~300-500ms | TensorRT FP16, 480px, Orin Nano Super estimate | | Geometric pose (RANSAC) | ~5ms | Homography estimation | | ESKF satellite update | ~1ms | Delayed measurement | | **Total** | **~310-510ms** | Async, does not block VO | **Path B — XFeat (if LiteSAM abandoned)**: | Step | Time | Notes | |------|------|-------| | XFeat feature extraction (both images) | ~10-20ms | TensorRT FP16/INT8 | | XFeat semi-dense matching | ~30-50ms | KNN + refinement | | Geometric verification (RANSAC) | ~5ms | | | ESKF satellite update | ~1ms | | | **Total** | **~50-80ms** | Comfortably within budget | ### Per-Frame Wall-Clock Latency Every frame: - **VO result emitted in ~25ms** (cuVSLAM + ESKF + SSE) - Satellite correction arrives asynchronously on keyframes - Client gets immediate position, then refined position when satellite match completes ## Memory Budget (Jetson Orin Nano Super, 8GB shared) | Component | Memory | Notes | |-----------|--------|-------| | OS + runtime | ~1.5GB | JetPack 6.2 + Python | | cuVSLAM | ~200-300MB | NVIDIA CUDA library + internal state | | Satellite matcher TensorRT | ~50-100MB | LiteSAM FP16 or XFeat FP16 | | Current frame (downsampled) | ~2MB | 640×480×3 | | Satellite tile (pre-resized) | ~1MB | Single active tile | | ESKF state + buffers | ~10MB | | | FastAPI + SSE runtime | ~100MB | | | **Total** | **~1.9-2.4GB** | ~25-30% of 8GB — comfortable margin | ## Confidence Scoring | Level | Condition | Expected Accuracy | |-------|-----------|-------------------| | HIGH | Satellite match succeeded + cuVSLAM consistent | <20m | | MEDIUM | cuVSLAM VO only, recent satellite correction (<500m travel) | 20-50m | | LOW | cuVSLAM VO only, no recent satellite correction | 50-100m+ | | VERY LOW | IMU dead-reckoning only (cuVSLAM + satellite both failed) | 100m+ | | MANUAL | User-provided position | As provided | ## Key Risks and Mitigations | Risk | Likelihood | Impact | Mitigation | |------|-----------|--------|------------| | LiteSAM too slow on Orin Nano Super | HIGH | Misses 400ms deadline | **Abandon LiteSAM, use XFeat**. Day-one benchmark is the go/no-go gate | | cuVSLAM not supporting nadir-only camera well | MEDIUM | VO accuracy degrades | Fall back to XFeat frame-to-frame matching | | Google Maps satellite quality in conflict zone | HIGH | Satellite matching fails | Accept VO+IMU with higher drift; request user input sooner; alternative satellite providers | | XFeat cross-view accuracy insufficient | MEDIUM | Position corrections less accurate than LiteSAM | Increase keyframe frequency; multi-tile consensus voting; geometric verification with strict RANSAC | | cuVSLAM is closed-source | LOW | Hard to debug | Fallback to XFeat VO; cuVSLAM has Python+C++ APIs | ## Testing Strategy ### Integration / Functional Tests - End-to-end pipeline test with real flight data (60 images from input_data/) - Compare computed positions against ground truth GPS from coordinates.csv - Measure: percentage within 50m, percentage within 20m - Test sharp-turn handling: introduce 90-degree heading change in sequence - Test user-input fallback: simulate 3+ consecutive failures - Test SSE streaming: verify client receives VO result within 50ms, satellite-corrected result within 500ms - Test session management: start/stop/restart flight sessions via REST API ### Non-Functional Tests - **Day-one benchmark**: LiteSAM TensorRT FP16 at 480/640/800px on Orin Nano Super → go/no-go for LiteSAM - cuVSLAM benchmark: verify 90fps monocular+IMU on Orin Nano Super - Performance: measure per-frame processing time (must be <400ms) - Memory: monitor peak usage during 1000-frame session (must stay <8GB) - Stress: process 3000 frames without memory leak - Keyframe strategy: vary interval (2, 3, 5, 10) and measure accuracy vs latency tradeoff ## References - LiteSAM (2025): https://www.mdpi.com/2072-4292/17/19/3349 - LiteSAM code: https://github.com/boyagesmile/LiteSAM - cuVSLAM (2025-2026): https://github.com/NVlabs/PyCuVSLAM - PyCuVSLAM API: https://nvlabs.github.io/PyCuVSLAM/api.html - Mateos-Ramirez et al. (2024): https://www.mdpi.com/2076-3417/14/16/7420 - SatLoc (2025): https://www.scilit.com/publications/e5cafaf875a49297a62b298a89d5572f - XFeat (CVPR 2024): https://arxiv.org/abs/2404.19174 - XFeat TensorRT for Jetson: https://github.com/PranavNedunghat/XFeatTensorRT - EfficientLoFTR (CVPR 2024): https://github.com/zju3dv/EfficientLoFTR - JetPack 6.2: https://docs.nvidia.com/jetson/archives/jetpack-archived/jetpack-62/release-notes/ - Hybrid ESKF/UKF: https://arxiv.org/abs/2512.17505 - Google Maps Tile API: https://developers.google.com/maps/documentation/tile/satellite ## Related Artifacts - AC Assessment: `_docs/00_research/gps_denied_nav/00_ac_assessment.md` - Tech stack evaluation: `_docs/01_solution/tech_stack.md`