# Solution Draft

## Product Solution Description

A real-time GPS-denied visual navigation system for fixed-wing UAVs, running entirely on a Jetson Orin Nano Super (8GB). The system determines frame-center GPS coordinates by fusing three information sources: (1) CUDA-accelerated visual odometry (cuVSLAM), (2) absolute position corrections from satellite image matching, and (3) IMU-based motion prediction. Results stream to clients via REST API + SSE in real time.

**Hard constraint**: Camera shoots at ~3fps (333-400ms interval). The full pipeline must complete within **400ms per frame**.

**Satellite matching strategy**: Benchmark LiteSAM on actual Orin Nano Super hardware as a day-one priority. If LiteSAM cannot achieve ≤400ms at 480px resolution, **abandon it entirely** and use XFeat semi-dense matching as the primary satellite matcher. Speed is non-negotiable.

**Core architectural principles**:
1. **cuVSLAM handles VO** — NVIDIA's CUDA-accelerated library achieves 90fps on Jetson Orin Nano, giving VO essentially "for free" (~11ms/frame).
2. **Keyframe-based satellite matching** — satellite matcher runs on keyframes only (every 3-10 frames), amortizing its cost. Non-keyframes rely on cuVSLAM VO + IMU.
3. **Every keyframe independently attempts satellite-based geo-localization** — this handles disconnected segments natively.
4. **Pipeline parallelism** — satellite matching for frame N overlaps with VO processing of frame N+1 via CUDA streams.

```
┌─────────────────────────────────────────────────────────────────┐
│                    OFFLINE (Before Flight)                       │
│  Satellite Tiles → Download & Crop → Store as tile pairs        │
│  (Google Maps)     (per flight plan)   (disk, GeoHash indexed)  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    ONLINE (During Flight)                        │
│                                                                  │
│  EVERY FRAME (400ms budget):                                     │
│  ┌────────────────────────────────┐                              │
│  │ Camera → Downsample (CUDA 2ms)│                               │
│  │       → cuVSLAM VO+IMU (~11ms)│──→ ESKF Update → SSE Emit   │
│  └────────────────────────────────┘         ↑                    │
│                                             │                    │
│  KEYFRAMES ONLY (every 3-10 frames):        │                    │
│  ┌────────────────────────────────────┐     │                    │
│  │ Satellite match (async CUDA stream)│─────┘                    │
│  │ LiteSAM or XFeat (see benchmark)  │                           │
│  │ (does NOT block VO output)         │                           │
│  └────────────────────────────────────┘                          │
│                                                                  │
│  IMU: 100+Hz continuous → ESKF prediction                        │
└─────────────────────────────────────────────────────────────────┘
```

## Speed Optimization Techniques

### 1. cuVSLAM for Visual Odometry (~11ms/frame)
NVIDIA's CUDA-accelerated VO library (v15.0.0, March 2026) achieves 90fps on Jetson Orin Nano. Supports monocular camera + IMU natively. Features: automatic IMU fallback when visual tracking fails, loop closure, Python and C++ APIs. This eliminates custom VO entirely.

### 2. Keyframe-Based Satellite Matching
Not every frame needs satellite matching. Strategy:
- cuVSLAM provides VO at every frame (high-rate, low-latency)
- Satellite matching triggers on **keyframes** selected by:
  - Fixed interval: every 3-10 frames (~1-3.3s between satellite corrections)
  - Confidence drop: when ESKF covariance exceeds threshold
  - VO failure: when cuVSLAM reports tracking loss (sharp turn)

### 3. Satellite Matcher Selection (Benchmark-Driven)

**Candidate A: LiteSAM (opt)** — Best accuracy for satellite-aerial matching (RMSE@30 = 17.86m on UAV-VisLoc). 6.31M params, MobileOne + TAIFormer + MinGRU. Benchmarked at 497ms on Jetson AGX Orin at 1184px. AGX Orin is 3-4x more powerful than Orin Nano Super (275 TOPS vs 67 TOPS, $2000+ vs $249).

Realistic Orin Nano Super estimates:
- At 1184px: ~1.5-2.0s (unusable)
- At 640px: ~500-800ms (borderline)
- At 480px: ~300-500ms (best case)

**Candidate B: XFeat semi-dense** — ~50-100ms on Orin Nano Super. Proven on Jetson. Not specifically designed for cross-view satellite-aerial, but fast and reliable.

**Decision rule**: Benchmark LiteSAM TensorRT FP16 at 480px on Orin Nano Super. If ≤400ms → use LiteSAM. If >400ms → **abandon LiteSAM, use XFeat as primary**. No hybrid compromises — pick one and optimize it.

### 4. TensorRT FP16 Optimization
LiteSAM's MobileOne backbone is reparameterizable — multi-branch training structure collapses to a single feed-forward path at inference. Combined with TensorRT FP16, this maximizes throughput. INT8 is possible for MobileOne backbone but ViT/transformer components may degrade with INT8.

### 5. CUDA Stream Pipelining
Overlap operations across consecutive frames:
- Stream A: cuVSLAM VO for current frame (~11ms) + ESKF fusion (~1ms)
- Stream B: Satellite matching for previous keyframe (async)
- CPU: SSE emission, tile management, keyframe selection logic

### 6. Pre-cropped Satellite Tiles
Offline: for each satellite tile, store both the raw image and a pre-resized version matching the satellite matcher's input resolution. Runtime avoids resize cost.

## Existing/Competitor Solutions Analysis

| Solution | Approach | Accuracy | Hardware | Limitations |
|----------|----------|----------|----------|-------------|
| Mateos-Ramirez et al. (2024) | VO (ORB) + satellite keypoint correction + Kalman | 142m mean / 17km (0.83%) | Orange Pi class | No re-localization; ORB only; 1000m+ altitude |
| SatLoc (2025) | DinoV2 + XFeat + optical flow + adaptive fusion | <15m, >90% coverage | Edge (unspecified) | Paper not fully accessible |
| LiteSAM (2025) | MobileOne + TAIFormer + MinGRU subpixel refinement | RMSE@30 = 17.86m on UAV-VisLoc | RTX 3090 (62ms), AGX Orin (497ms) | Not tested on Orin Nano; AGX Orin is 3-4x more powerful |
| TerboucheHacene/visual_localization | SuperPoint/SuperGlue/GIM + VO + satellite | Not quantified | Desktop-class | Not edge-optimized |
| cuVSLAM (NVIDIA, 2025-2026) | CUDA-accelerated VO+SLAM, mono/stereo/IMU | <1% trajectory error (KITTI), <5cm (EuRoC) | Jetson Orin Nano (90fps) | VO only, no satellite matching |

**Key insight**: Combine cuVSLAM (best-in-class VO for Jetson) with the fastest viable satellite-aerial matcher via ESKF fusion. LiteSAM is the accuracy leader but unproven on Orin Nano Super — benchmark first, abandon for XFeat if too slow.

## Architecture

### Component: Visual Odometry

| Solution | Tools | Advantages | Limitations | Fit |
|----------|-------|-----------|-------------|-----|
| cuVSLAM (mono+IMU) | PyCuVSLAM / C++ API | 90fps on Orin Nano, NVIDIA-optimized, loop closure, IMU fallback | Closed-source CUDA library | ✅ Best |
| XFeat frame-to-frame | XFeatTensorRT | 5x faster than SuperPoint, open-source | ~30-50ms total, no IMU integration | ⚠️ Fallback |
| ORB-SLAM3 | OpenCV + custom | Well-understood, open-source | CPU-heavy, ~30fps on Orin | ⚠️ Slower |

**Selected**: **cuVSLAM (mono+IMU mode)** — purpose-built by NVIDIA for Jetson. ~11ms/frame leaves 389ms for everything else. Auto-fallback to IMU when visual tracking fails.

### Component: Satellite Image Matching

| Solution | Tools | Advantages | Limitations | Fit |
|----------|-------|-----------|-------------|-----|
| LiteSAM (opt) | TensorRT | Best satellite-aerial accuracy (RMSE@30 17.86m), 6.31M params, subpixel refinement | 497ms on AGX Orin at 1184px; AGX Orin is 3-4x more powerful than Orin Nano Super | ✅ If benchmark passes |
| XFeat semi-dense | XFeatTensorRT | ~50-100ms, lightweight, Jetson-proven | Not designed for cross-view satellite-aerial | ✅ If LiteSAM fails benchmark |
| EfficientLoFTR | TensorRT | Good accuracy, semi-dense | 15.05M params (2.4x LiteSAM), slower | ⚠️ Heavier |
| SuperPoint + LightGlue | TensorRT C++ | Good general matching | Sparse only, worse on satellite-aerial | ⚠️ Not specialized |

**Selection**: Benchmark-driven. Day-one test on Orin Nano Super:
1. Export LiteSAM (opt) to TensorRT FP16
2. Measure at 480px, 640px, 800px
3. If ≤400ms at 480px → **LiteSAM**
4. If >400ms at any viable resolution → **XFeat semi-dense** (primary, no hybrid)

### Component: Sensor Fusion

| Solution | Tools | Advantages | Limitations | Fit |
|----------|-------|-----------|-------------|-----|
| Error-State EKF (ESKF) | Custom Python/C++ | Lightweight, multi-rate, well-understood | Linear approximation | ✅ Best |
| Hybrid ESKF/UKF | Custom | 49% better accuracy | More complex | ⚠️ Upgrade path |
| Factor Graph (GTSAM) | GTSAM | Best accuracy | Heavy compute | ❌ Too heavy |

**Selected**: **ESKF** with adaptive measurement noise. State vector: [position(3), velocity(3), orientation_quat(4), accel_bias(3), gyro_bias(3)] = 16 states.

Measurement sources and rates:
- IMU prediction: 100+Hz
- cuVSLAM VO update: ~3Hz (every frame)
- Satellite update: ~0.3-1Hz (keyframes only, delayed via async pipeline)

### Component: Satellite Tile Preprocessing (Offline)

**Selected**: **GeoHash-indexed tile pairs on disk**.

Pipeline:
1. Define operational area from flight plan
2. Download satellite tiles from Google Maps Tile API at max zoom (18-19)
3. Pre-resize each tile to matcher input resolution
4. Store: original tile + resized tile + metadata (GPS bounds, zoom, GSD) in GeoHash-indexed directory structure
5. Copy to Jetson storage before flight

### Component: Re-localization (Disconnected Segments)

**Selected**: **Keyframe satellite matching is always active + expanded search on VO failure**.

When cuVSLAM reports tracking loss (sharp turn, no features):
1. Immediately flag next frame as keyframe → trigger satellite matching
2. Expand tile search radius (from ±200m to ±1km based on IMU dead-reckoning uncertainty)
3. If match found: position recovered, new segment begins
4. If 3+ consecutive keyframe failures: request user input via API

### Component: Object Center Coordinates

Geometric calculation once frame-center GPS is known:
1. Pixel offset from center: (dx_px, dy_px)
2. Convert to meters: dx_m = dx_px × GSD, dy_m = dy_px × GSD
3. Rotate by IMU yaw heading
4. Convert meter offset to lat/lon and add to frame-center GPS

### Component: API & Streaming

**Selected**: **FastAPI + sse-starlette**. REST for session management, SSE for real-time position stream. OpenAPI auto-documentation.

## Processing Time Budget (per frame, 400ms budget)

### Normal Frame (non-keyframe, ~60-80% of frames)

| Step | Time | Notes |
|------|------|-------|
| Image capture + transfer | ~10ms | CSI/USB3 |
| Downsample (for cuVSLAM) | ~2ms | OpenCV CUDA |
| cuVSLAM VO+IMU | ~11ms | NVIDIA CUDA-optimized, 90fps capable |
| ESKF fusion (VO+IMU update) | ~1ms | C extension or NumPy |
| SSE emit | ~1ms | Async |
| **Total** | **~25ms** | Well within 400ms |

### Keyframe Satellite Matching (async, every 3-10 frames)

Runs asynchronously on a separate CUDA stream — does NOT block per-frame VO output.

**Path A — LiteSAM (if benchmark passes)**:

| Step | Time | Notes |
|------|------|-------|
| Downsample to ~480px | ~1ms | OpenCV CUDA |
| Load satellite tile | ~5ms | Pre-resized, from storage |
| LiteSAM (opt) matching | ~300-500ms | TensorRT FP16, 480px, Orin Nano Super estimate |
| Geometric pose (RANSAC) | ~5ms | Homography estimation |
| ESKF satellite update | ~1ms | Delayed measurement |
| **Total** | **~310-510ms** | Async, does not block VO |

**Path B — XFeat (if LiteSAM abandoned)**:

| Step | Time | Notes |
|------|------|-------|
| XFeat feature extraction (both images) | ~10-20ms | TensorRT FP16/INT8 |
| XFeat semi-dense matching | ~30-50ms | KNN + refinement |
| Geometric verification (RANSAC) | ~5ms | |
| ESKF satellite update | ~1ms | |
| **Total** | **~50-80ms** | Comfortably within budget |

### Per-Frame Wall-Clock Latency

Every frame:
- **VO result emitted in ~25ms** (cuVSLAM + ESKF + SSE)
- Satellite correction arrives asynchronously on keyframes
- Client gets immediate position, then refined position when satellite match completes

## Memory Budget (Jetson Orin Nano Super, 8GB shared)

| Component | Memory | Notes |
|-----------|--------|-------|
| OS + runtime | ~1.5GB | JetPack 6.2 + Python |
| cuVSLAM | ~200-300MB | NVIDIA CUDA library + internal state |
| Satellite matcher TensorRT | ~50-100MB | LiteSAM FP16 or XFeat FP16 |
| Current frame (downsampled) | ~2MB | 640×480×3 |
| Satellite tile (pre-resized) | ~1MB | Single active tile |
| ESKF state + buffers | ~10MB | |
| FastAPI + SSE runtime | ~100MB | |
| **Total** | **~1.9-2.4GB** | ~25-30% of 8GB — comfortable margin |

## Confidence Scoring

| Level | Condition | Expected Accuracy |
|-------|-----------|-------------------|
| HIGH | Satellite match succeeded + cuVSLAM consistent | <20m |
| MEDIUM | cuVSLAM VO only, recent satellite correction (<500m travel) | 20-50m |
| LOW | cuVSLAM VO only, no recent satellite correction | 50-100m+ |
| VERY LOW | IMU dead-reckoning only (cuVSLAM + satellite both failed) | 100m+ |
| MANUAL | User-provided position | As provided |

## Key Risks and Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| LiteSAM too slow on Orin Nano Super | HIGH | Misses 400ms deadline | **Abandon LiteSAM, use XFeat**. Day-one benchmark is the go/no-go gate |
| cuVSLAM not supporting nadir-only camera well | MEDIUM | VO accuracy degrades | Fall back to XFeat frame-to-frame matching |
| Google Maps satellite quality in conflict zone | HIGH | Satellite matching fails | Accept VO+IMU with higher drift; request user input sooner; alternative satellite providers |
| XFeat cross-view accuracy insufficient | MEDIUM | Position corrections less accurate than LiteSAM | Increase keyframe frequency; multi-tile consensus voting; geometric verification with strict RANSAC |
| cuVSLAM is closed-source | LOW | Hard to debug | Fallback to XFeat VO; cuVSLAM has Python+C++ APIs |

## Testing Strategy

### Integration / Functional Tests
- End-to-end pipeline test with real flight data (60 images from input_data/)
- Compare computed positions against ground truth GPS from coordinates.csv
- Measure: percentage within 50m, percentage within 20m
- Test sharp-turn handling: introduce 90-degree heading change in sequence
- Test user-input fallback: simulate 3+ consecutive failures
- Test SSE streaming: verify client receives VO result within 50ms, satellite-corrected result within 500ms
- Test session management: start/stop/restart flight sessions via REST API

### Non-Functional Tests
- **Day-one benchmark**: LiteSAM TensorRT FP16 at 480/640/800px on Orin Nano Super → go/no-go for LiteSAM
- cuVSLAM benchmark: verify 90fps monocular+IMU on Orin Nano Super
- Performance: measure per-frame processing time (must be <400ms)
- Memory: monitor peak usage during 1000-frame session (must stay <8GB)
- Stress: process 3000 frames without memory leak
- Keyframe strategy: vary interval (2, 3, 5, 10) and measure accuracy vs latency tradeoff

## References
- LiteSAM (2025): https://www.mdpi.com/2072-4292/17/19/3349
- LiteSAM code: https://github.com/boyagesmile/LiteSAM
- cuVSLAM (2025-2026): https://github.com/NVlabs/PyCuVSLAM
- PyCuVSLAM API: https://nvlabs.github.io/PyCuVSLAM/api.html
- Mateos-Ramirez et al. (2024): https://www.mdpi.com/2076-3417/14/16/7420
- SatLoc (2025): https://www.scilit.com/publications/e5cafaf875a49297a62b298a89d5572f
- XFeat (CVPR 2024): https://arxiv.org/abs/2404.19174
- XFeat TensorRT for Jetson: https://github.com/PranavNedunghat/XFeatTensorRT
- EfficientLoFTR (CVPR 2024): https://github.com/zju3dv/EfficientLoFTR
- JetPack 6.2: https://docs.nvidia.com/jetson/archives/jetpack-archived/jetpack-62/release-notes/
- Hybrid ESKF/UKF: https://arxiv.org/abs/2512.17505
- Google Maps Tile API: https://developers.google.com/maps/documentation/tile/satellite

## Related Artifacts
- AC Assessment: `_docs/00_research/gps_denied_nav/00_ac_assessment.md`
- Tech stack evaluation: `_docs/01_solution/tech_stack.md`