Files
gps-denied-onboard/_docs/01_solution/solution_draft02.md
T

27 KiB
Raw Blame History

Solution Draft

Assessment Findings

Old Component Solution Weak Point (functional/security/performance) New Solution
LiteSAM at 480px as satellite matcher Performance: 497ms on AGX Orin at 1184px. Orin Nano Super is ~3-4x slower. At 480px estimated ~270-360ms — borderline. Paper uses PyTorch AMP, not TensorRT FP16. TensorRT could bring 2-3x improvement. Add TensorRT FP16 as mandatory optimization step. Revised estimate at 480px with TensorRT: ~90-180ms. Still benchmark-driven: abandon if >400ms.
XFeat as LiteSAM fallback for satellite matching Functional: XFeat is a general-purpose feature matcher, NOT designed for cross-view satellite-aerial gap. May fail on season/lighting differences between UAV and satellite imagery. Expand fallback options: benchmark EfficientLoFTR (designed for weak-texture aerial) alongside XFeat. Consider STHN-style deep homography as third option. See detailed satellite matcher comparison below.
SP+LG considered as "sparse only, worse on satellite-aerial" Functional: LiteSAM paper confirms "SP+LG achieves fastest inference speed but at expense of accuracy." Sparse matcher fails on texture-scarce regions. ~180-360ms on Orin Nano Super. Reject SP+LG for both VO and satellite matching. cuVSLAM is 15-33x faster for VO.
cuVSLAM on low-texture terrain Functional: cuVSLAM uses Shi-Tomasi corners + Lucas-Kanade tracking. On uniform agricultural fields/water bodies, features will be sparse → frequent tracking loss. IMU fallback lasts only ~1s. No published benchmarks for nadir agricultural terrain. Does NOT guarantee pose recovery after tracking loss. CRITICAL RISK: cuVSLAM will likely fail frequently over low-texture terrain. Mitigation: (1) increase satellite matching frequency in low-texture areas, (2) use IMU dead-reckoning bridge, (3) accept higher drift in featureless segments, (4) XFeat VO as secondary fallback may also struggle on same terrain.
cuVSLAM memory estimate ~200-300MB Performance: Map grows over time. For 3000-frame flights (~16min at 3fps), map could reach 500MB-1GB without pruning. Configure cuVSLAM map pruning. Set max keyframes. Monitor memory.
Tile search on VO failure: "expand to ±1km" Functional: Underspecified. Loading 10-20 tiles slow from disk I/O. Preload tiles within ±2km of flight plan into RAM. Ranked search by IMU dead-reckoning position.
LiteSAM resolution Performance: Paper benchmarked at 1184px on AGX Orin (497ms AMP). TensorRT FP16 with reparameterized MobileOne expected 2-3x faster. Benchmark LiteSAM TRT FP16 at 1280px on Orin Nano Super. If ≤200ms → use LiteSAM at 1280px. If >200ms → use XFeat.
SP+LG proposed for VO by user Performance: ~130-280ms/frame on Orin Nano. cuVSLAM ~8.6ms/frame. No IMU, no loop closure. Reject SP+LG for VO. cuVSLAM 15-33x faster. XFeat frame-to-frame remains fallback.

Product Solution Description

A real-time GPS-denied visual navigation system for fixed-wing UAVs, running entirely on a Jetson Orin Nano Super (8GB). The system determines frame-center GPS coordinates by fusing three information sources: (1) CUDA-accelerated visual odometry (cuVSLAM), (2) absolute position corrections from satellite image matching, and (3) IMU-based motion prediction. Results stream to clients via REST API + SSE in real time.

Hard constraint: Camera shoots at ~3fps (333-400ms interval). The full pipeline must complete within 400ms per frame.

Satellite matching strategy: Benchmark LiteSAM TensorRT FP16 at 1280px on Orin Nano Super as a day-one priority. The paper's AGX Orin benchmark used PyTorch AMP — TensorRT FP16 with reparameterized MobileOne should yield 2-3x additional speedup. Decision rule: if LiteSAM TRT FP16 at 1280px ≤200ms → use LiteSAM. If >200ms → use XFeat.

Core architectural principles:

  1. cuVSLAM handles VO — 116fps on Orin Nano 8GB, ~8.6ms/frame. SuperPoint+LightGlue was evaluated and rejected (15-33x slower, no IMU integration).
  2. Keyframe-based satellite matching — satellite matcher runs on keyframes only (every 3-10 frames), amortizing cost. Non-keyframes rely on cuVSLAM VO + IMU.
  3. Every keyframe independently attempts satellite-based geo-localization — handles disconnected segments natively.
  4. Pipeline parallelism — satellite matching for frame N overlaps with VO processing of frame N+1 via CUDA streams.
  5. Proactive tile loading — preload tiles within ±2km of flight plan into RAM for fast lookup during expanded search.
┌─────────────────────────────────────────────────────────────────┐
│                    OFFLINE (Before Flight)                       │
│  Satellite Tiles → Download & Crop → Store as tile pairs        │
│  (Google Maps)     (per flight plan)   (disk, GeoHash indexed)  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    ONLINE (During Flight)                        │
│                                                                  │
│  EVERY FRAME (400ms budget):                                     │
│  ┌────────────────────────────────┐                              │
│  │ Camera → Downsample (CUDA 2ms)│                               │
│  │       → cuVSLAM VO+IMU (~9ms) │──→ ESKF Update → SSE Emit   │
│  └────────────────────────────────┘         ↑                    │
│                                             │                    │
│  KEYFRAMES ONLY (every 3-10 frames):        │                    │
│  ┌────────────────────────────────────┐     │                    │
│  │ Satellite match (async CUDA stream)│─────┘                    │
│  │ LiteSAM TRT FP16 or XFeat         │                           │
│  │ (does NOT block VO output)         │                           │
│  └────────────────────────────────────┘                          │
│                                                                  │
│  IMU: 100+Hz continuous → ESKF prediction                        │
│  TILES: ±2km preloaded in RAM from flight plan                   │
└─────────────────────────────────────────────────────────────────┘

Speed Optimization Techniques

1. cuVSLAM for Visual Odometry (~9ms/frame)

NVIDIA's CUDA-accelerated VO library (v15.0.0, March 2026) achieves 116fps on Jetson Orin Nano 8GB at 720p. Supports monocular camera + IMU natively. Features: automatic IMU fallback when visual tracking fails, loop closure, Python and C++ APIs.

Why not SuperPoint+LightGlue for VO: SP+LG is 15-33x slower (~130-280ms vs ~9ms). Lacks IMU integration, loop closure, auto-fallback.

CRITICAL: cuVSLAM on difficult/even terrain (agricultural fields, water): cuVSLAM uses Shi-Tomasi corner detection + Lucas-Kanade optical flow tracking (classical features, not learned). On uniform agricultural terrain or water bodies:

  • Very few corners will be detected → sparse/unreliable tracking
  • Frequent keyframe creation → heavier compute
  • Tracking loss → IMU fallback (~1 second) → constant-velocity integrator (~0.5s more)
  • cuVSLAM does NOT guarantee pose recovery after tracking loss
  • All published benchmarks (KITTI: urban/suburban, EuRoC: indoor) do NOT include nadir agricultural terrain
  • Multi-stereo mode helps with featureless surfaces, but we have mono camera only

Mitigation strategy for low-texture terrain:

  1. Increase satellite matching frequency: In low-texture areas (detected by cuVSLAM's keypoint count dropping), switch from every 3-10 frames to every frame
  2. IMU dead-reckoning bridge: When cuVSLAM reports tracking loss, ESKF continues with IMU prediction. At 3fps with ~1.5s IMU bridge, that covers ~4-5 frames
  3. Accept higher drift: In featureless segments, position accuracy degrades to IMU-only level (50-100m+ over ~10s). Satellite matching must recover absolute position when texture returns
  4. Keypoint density monitoring: Track cuVSLAM's number of tracked features per frame. When below threshold (e.g., <50), proactively trigger satellite matching
  5. XFeat frame-to-frame as VO fallback: XFeat uses learned features that may detect texture invisible to Shi-Tomasi corners. But XFeat may also struggle on truly uniform terrain

2. Keyframe-Based Satellite Matching

Not every frame needs satellite matching. Strategy:

  • cuVSLAM provides VO at every frame (high-rate, low-latency)
  • Satellite matching triggers on keyframes selected by:
    • Fixed interval: every 3-10 frames (~1-3.3s between satellite corrections)
    • Confidence drop: when ESKF covariance exceeds threshold
    • VO failure: when cuVSLAM reports tracking loss (sharp turn)

3. Satellite Matcher Selection (Benchmark-Driven)

Important context: Our UAV-to-satellite matching is EASIER than typical cross-view geo-localization problems. Both the UAV camera and satellite imagery are approximately nadir (top-down). The main challenges are season/lighting differences, resolution mismatch, and temporal changes — not the extreme viewpoint gap seen in ground-to-satellite matching. This means even general-purpose matchers may perform well.

Candidate A: LiteSAM (opt) with TensorRT FP16 at 1280px — Best satellite-aerial accuracy (RMSE@30 = 17.86m on UAV-VisLoc). 6.31M params, MobileOne reparameterizable for TensorRT. Paper benchmarked at 497ms on AGX Orin using AMP at 1184px. TensorRT FP16 with reparameterized MobileOne expected 2-3x faster than AMP. At 1280px (close to paper's 1184px benchmark resolution), accuracy should match published results.

Orin Nano Super TensorRT FP16 estimate at 1280px:

  • AGX Orin AMP @ 1184px: 497ms
  • TRT FP16 speedup over AMP: ~2-3x → AGX Orin TRT estimate: ~165-250ms
  • Orin Nano Super is ~3-4x slower → estimate: ~500-1000ms without TRT
  • With TRT FP16: ~165-330ms (realistic range)
  • Go/no-go threshold: ≤200ms

Candidate B (fallback): XFeat semi-dense — ~50-100ms on Orin Nano Super. Proven on Jetson. General-purpose, not designed for cross-view gap. FASTEST option. Since our cross-view gap is small (both nadir), XFeat may work adequately for this specific use case.

Other evaluated options (not selected):

  • EfficientLoFTR: Semi-dense, 15.05M params, handles weak-texture well. ~20% slower than LiteSAM. Strong option if LiteSAM codebase proves difficult to export to TRT, but larger model footprint.
  • Deep Homography (STHN-style): End-to-end homography estimation, no feature/RANSAC pipeline. 4.24m at 50m range. Interesting future option but needs RGB retraining — higher implementation risk.
  • PFED and retrieval-based methods: Image RETRIEVAL only (identifies which tile matches), not pixel-level matching. We already know which tile to use from ESKF position.
  • SuperPoint+LightGlue: Sparse matcher. LiteSAM paper confirms worse satellite-aerial accuracy. Slower than XFeat.

Decision rule (day-one on Orin Nano Super):

  1. Export LiteSAM (opt) to TensorRT FP16
  2. Benchmark at 1280px
  3. If ≤200ms → use LiteSAM at 1280px
  4. If >200ms → use XFeat

4. TensorRT FP16 Optimization

LiteSAM's MobileOne backbone is reparameterizable — multi-branch training structure collapses to a single feed-forward path at inference. Combined with TensorRT FP16, this maximizes throughput. Do NOT use INT8 on transformer components (TAIFormer) — accuracy degrades. INT8 is safe only for the MobileOne backbone CNN layers.

5. CUDA Stream Pipelining

Overlap operations across consecutive frames:

  • Stream A: cuVSLAM VO for current frame (~9ms) + ESKF fusion (~1ms)
  • Stream B: Satellite matching for previous keyframe (async)
  • CPU: SSE emission, tile management, keyframe selection logic

6. Proactive Tile Loading

Change from draft01: Instead of loading tiles on-demand from disk, preload tiles within ±2km of the flight plan into RAM at session start. This eliminates disk I/O latency during flight. For a 50km flight path, ~2000 tiles at zoom 19 ≈ ~200MB RAM — well within budget.

On VO failure / expanded search:

  1. Compute IMU dead-reckoning position
  2. Rank preloaded tiles by distance to predicted position
  3. Try top 3 tiles (not all tiles in ±1km radius)
  4. If no match in top 3, expand to next 3

Existing/Competitor Solutions Analysis

Solution Approach Accuracy Hardware Limitations
Mateos-Ramirez et al. (2024) VO (ORB) + satellite keypoint correction + Kalman 142m mean / 17km (0.83%) Orange Pi class No re-localization; ORB only; 1000m+ altitude
SatLoc (2025) DinoV2 + XFeat + optical flow + adaptive fusion <15m, >90% coverage Edge (unspecified) Paper not fully accessible
LiteSAM (2025) MobileOne + TAIFormer + MinGRU subpixel refinement RMSE@30 = 17.86m on UAV-VisLoc RTX 3090 (62ms), AGX Orin (497ms@1184px) Not tested on Orin Nano; AGX Orin is 3-4x more powerful
TerboucheHacene/visual_localization SuperPoint/SuperGlue/GIM + VO + satellite Not quantified Desktop-class Not edge-optimized
cuVSLAM (NVIDIA, 2025-2026) CUDA-accelerated VO+SLAM, mono/stereo/IMU <1% trajectory error (KITTI), <5cm (EuRoC) Jetson Orin Nano (116fps) VO only, no satellite matching
VRLM (2024) FocalNet backbone + multi-scale feature fusion 83.35% MA@20 Desktop Not edge-optimized
Scale-Aware UAV-to-Satellite (2026) Semantic geometric + metric scale recovery N/A Desktop Addresses scale ambiguity problem
EfficientLoFTR (CVPR 2024) Aggregated attention + adaptive token selection, semi-dense Competitive with LiteSAM 2.5x faster than LoFTR, TRT available 15.05M params, heavier than LiteSAM
PFED (2025) Knowledge distillation + multi-view refinement, retrieval 97.15% Recall@1 (University-1652) AGX Orin (251.5 FPS) Retrieval only, not pixel-level matching
STHN (IEEE RA-L 2024) Deep homography estimation, coarse-to-fine 4.24m at 50m range Open-source, lightweight Trained on thermal, needs RGB retraining
Hierarchical AVL (2025) DINOv2 retrieval + SuperPoint matching 64.5-95% success rate ROS, IMU integration Two-stage complexity
JointLoc (IROS 2024) Retrieval + VO fusion, adaptive weighting 0.237m RMSE over 1km Open-source Designed for Mars/planetary, needs adaptation

Architecture

Component: Visual Odometry

Solution Tools Advantages Limitations Performance Fit
cuVSLAM (mono+IMU) PyCuVSLAM v15.0.0 116fps on Orin Nano, NVIDIA-optimized, loop closure, IMU fallback Closed-source CUDA library ~9ms/frame Best
XFeat frame-to-frame XFeatTensorRT 5x faster than SuperPoint, open-source ~30-50ms total, no IMU integration ~30-50ms/frame ⚠️ Fallback
SuperPoint+LightGlue LightGlue-ONNX TRT Good accuracy, adaptive pruning ~130-280ms, no IMU, no loop closure ~130-280ms/frame Rejected
ORB-SLAM3 OpenCV + custom Well-understood, open-source CPU-heavy, ~30fps on Orin ~33ms/frame ⚠️ Slower

Selected: cuVSLAM (mono+IMU mode) — 116fps, purpose-built by NVIDIA for Jetson. Auto-fallback to IMU when visual tracking fails.

SP+LG rejection rationale: 15-33x slower than cuVSLAM. No built-in IMU fusion, loop closure, or tracking failure detection. Building these features around SP+LG would take significant development time and still be slower. XFeat at ~30-50ms is a better fallback for VO if cuVSLAM fails on nadir camera.

Component: Satellite Image Matching

Solution Tools Advantages Limitations Performance Fit
LiteSAM (opt) TRT FP16 @ 1280px TensorRT Best satellite-aerial accuracy (RMSE@30 17.86m), 6.31M params, subpixel refinement Untested on Orin Nano Super with TensorRT Est. ~165-330ms @ 1280px TRT FP16 If ≤200ms
XFeat semi-dense XFeatTensorRT ~50-100ms, lightweight, Jetson-proven, fastest General-purpose, not designed for cross-view. Our nadir-nadir gap is small → may work. ~50-100ms Fallback if LiteSAM >200ms

Selection: Day-one benchmark on Orin Nano Super:

  1. Export LiteSAM (opt) to TensorRT FP16
  2. Benchmark at 1280px
  3. If ≤200ms → LiteSAM at 1280px
  4. If >200ms → XFeat

Component: Sensor Fusion

Solution Tools Advantages Limitations Performance Fit
Error-State EKF (ESKF) Custom Python/C++ Lightweight, multi-rate, well-understood Linear approximation <1ms/step Best
Hybrid ESKF/UKF Custom 49% better accuracy More complex ~2-3ms/step ⚠️ Upgrade path
Factor Graph (GTSAM) GTSAM Best accuracy Heavy compute ~10-50ms/step Too heavy

Selected: ESKF with adaptive measurement noise. State vector: [position(3), velocity(3), orientation_quat(4), accel_bias(3), gyro_bias(3)] = 16 states.

Component: Satellite Tile Preprocessing (Offline)

Selected: GeoHash-indexed tile pairs on disk + RAM preloading.

Pipeline:

  1. Define operational area from flight plan
  2. Download satellite tiles from Google Maps Tile API at max zoom (18-19)
  3. Pre-resize each tile to matcher input resolution
  4. Store: original tile + resized tile + metadata (GPS bounds, zoom, GSD) in GeoHash-indexed directory structure
  5. Copy to Jetson storage before flight
  6. At session start: preload tiles within ±2km of flight plan into RAM (~200MB for 50km route)

Component: Re-localization (Disconnected Segments)

Selected: Keyframe satellite matching is always active + ranked tile search on VO failure.

When cuVSLAM reports tracking loss (sharp turn, no features):

  1. Immediately flag next frame as keyframe → trigger satellite matching
  2. Compute IMU dead-reckoning position since last known position
  3. Rank preloaded tiles by distance to dead-reckoning position
  4. Try top 3 tiles sequentially (not all tiles in radius)
  5. If match found: position recovered, new segment begins
  6. If 3 consecutive keyframe failures across top tiles: expand to next 3 tiles
  7. If still no match after 3+ full attempts: request user input via API

Component: Object Center Coordinates

Geometric calculation once frame-center GPS is known:

  1. Pixel offset from center: (dx_px, dy_px)
  2. Convert to meters: dx_m = dx_px × GSD, dy_m = dy_px × GSD
  3. Rotate by IMU yaw heading
  4. Convert meter offset to lat/lon and add to frame-center GPS

Component: API & Streaming

Selected: FastAPI + sse-starlette. REST for session management, SSE for real-time position stream. OpenAPI auto-documentation.

Processing Time Budget (per frame, 400ms budget)

Normal Frame (non-keyframe, ~60-80% of frames)

Step Time Notes
Image capture + transfer ~10ms CSI/USB3
Downsample (for cuVSLAM) ~2ms OpenCV CUDA
cuVSLAM VO+IMU ~9ms NVIDIA CUDA-optimized, 116fps capable
ESKF fusion (VO+IMU update) ~1ms C extension or NumPy
SSE emit ~1ms Async
Total ~23ms Well within 400ms

Keyframe Satellite Matching (async, every 3-10 frames)

Runs asynchronously on a separate CUDA stream — does NOT block per-frame VO output.

Path A — LiteSAM TRT FP16 at 1280px (if ≤200ms benchmark):

Step Time Notes
Downsample to 1280px ~1ms OpenCV CUDA
Load satellite tile ~1ms Pre-loaded in RAM
LiteSAM (opt) TRT FP16 matching ≤200ms TensorRT FP16, 1280px, go/no-go threshold
Geometric pose (RANSAC) ~5ms Homography estimation
ESKF satellite update ~1ms Delayed measurement
Total ≤210ms Async, within budget

Path B — XFeat (if LiteSAM >200ms):

Step Time Notes
XFeat feature extraction (both images) ~10-20ms TensorRT FP16/INT8
XFeat semi-dense matching ~30-50ms KNN + refinement
Geometric verification (RANSAC) ~5ms
ESKF satellite update ~1ms
Total ~50-80ms Comfortably within budget

Memory Budget (Jetson Orin Nano Super, 8GB shared)

Component Memory Notes
OS + runtime ~1.5GB JetPack 6.2 + Python
cuVSLAM ~200-500MB CUDA library + map state. Configure map pruning for 3000-frame flights
Satellite matcher TensorRT ~50-100MB LiteSAM FP16 or XFeat FP16
Preloaded satellite tiles ~200MB ±2km of flight plan, pre-resized
Current frame (downsampled) ~2MB 640×480×3
ESKF state + buffers ~10MB
FastAPI + SSE runtime ~100MB
Total ~2.1-2.9GB ~26-36% of 8GB — comfortable margin

Confidence Scoring

Level Condition Expected Accuracy
HIGH Satellite match succeeded + cuVSLAM consistent <20m
MEDIUM cuVSLAM VO only, recent satellite correction (<500m travel) 20-50m
LOW cuVSLAM VO only, no recent satellite correction 50-100m+
VERY LOW IMU dead-reckoning only (cuVSLAM + satellite both failed) 100m+
MANUAL User-provided position As provided

Key Risks and Mitigations

Risk Likelihood Impact Mitigation
cuVSLAM fails on low-texture agricultural terrain HIGH Frequent tracking loss, degraded VO Increase satellite matching frequency when keypoint count drops. IMU dead-reckoning bridge (~1.5s). Accept higher drift in featureless segments. Satellite matching recovers position when texture returns.
LiteSAM TRT FP16 >200ms at 1280px on Orin Nano Super MEDIUM Must use XFeat instead (less accurate for cross-view) Day-one TRT FP16 benchmark. If >200ms → XFeat. Since our nadir-nadir gap is small, XFeat may still perform adequately.
XFeat cross-view accuracy insufficient MEDIUM Satellite corrections less accurate Benchmark XFeat on actual operational area satellite-aerial pairs. Increase keyframe frequency; multi-tile consensus; strict RANSAC.
cuVSLAM map memory growth on long flights MEDIUM Memory pressure Configure map pruning, set max keyframes. Monitor memory.
Google Maps satellite quality in conflict zone HIGH Satellite matching fails Accept VO+IMU with higher drift; request user input sooner; alternative satellite providers
cuVSLAM is closed-source, no nadir benchmarks MEDIUM Unknown failure modes over farmland Extensive testing with real nadir UAV imagery before deployment. XFeat VO as fallback (also uses learned features).
Tile I/O bottleneck during expanded search LOW Delayed re-localization Preload ±2km tiles in RAM; ranked search instead of exhaustive

Testing Strategy

Integration / Functional Tests

  • End-to-end pipeline test with real flight data (60 images from input_data/)
  • Compare computed positions against ground truth GPS from coordinates.csv
  • Measure: percentage within 50m, percentage within 20m
  • Test sharp-turn handling: introduce 90-degree heading change in sequence
  • Test user-input fallback: simulate 3+ consecutive failures
  • Test SSE streaming: verify client receives VO result within 50ms, satellite-corrected result within 500ms
  • Test session management: start/stop/restart flight sessions via REST API
  • Test cuVSLAM map memory: run 3000-frame session, monitor memory growth

Non-Functional Tests

  • Day-one satellite matcher benchmark: LiteSAM TRT FP16 at 1280px on Orin Nano Super. If ≤200ms → use LiteSAM. If >200ms → use XFeat. Also measure accuracy on test satellite-aerial pairs for both.
  • cuVSLAM benchmark: verify 116fps monocular+IMU on Orin Nano Super
  • cuVSLAM terrain stress test: test with nadir camera over (a) urban/structured terrain, (b) agricultural fields, (c) water/uniform terrain, (d) forest. Measure: keypoint count, tracking success rate, drift per 100 frames, IMU fallback frequency
  • cuVSLAM keypoint monitoring: verify that low-keypoint detection triggers increased satellite matching
  • Performance: measure per-frame processing time (must be <400ms)
  • Memory: monitor peak usage during 3000-frame session (must stay <8GB)
  • Stress: process 3000 frames without memory leak
  • Keyframe strategy: vary interval (2, 3, 5, 10) and measure accuracy vs latency tradeoff
  • Tile preloading: verify RAM usage of preloaded tiles for 50km flight plan

References

  • AC Assessment: _docs/00_research/gps_denied_nav/00_ac_assessment.md
  • Tech stack evaluation: _docs/01_solution/tech_stack.md
  • Security analysis: _docs/01_solution/security_analysis.md