Files
gps-denied-onboard/_docs/01_solution/solution_draft04.md
T

27 KiB
Raw Blame History

Solution Draft

Assessment Findings

Old Component Solution Weak Point (functional/security/performance) New Solution
ONNX Runtime as potential inference runtime for AI models Performance: ONNX Runtime CUDA EP on Jetson Orin Nano is 7-8x slower than TRT standalone with default settings (tensor cores not utilized). Even TRT-EP shows up to 3x overhead on some models. Use native TRT Engine for all AI models. Convert PyTorch → ONNX → trtexec → .engine. Load with tensorrt Python module. Eliminates ONNX Runtime dependency entirely.
ONNX Runtime TRT-EP memory overhead Performance: ONNX RT TRT-EP keeps serialized engine in memory (~420-440MB vs 130-140MB native TRT). Delta ~280-300MB PER MODEL. On 8GB shared memory, this wastes ~560-600MB for two models. Native TRT releases serialized blob after deserialization → saves ~280-300MB per model. Total savings ~560-600MB — 7% of total memory. Critical given cuVSLAM map growth risk.
No explicit TRT engine build step in offline pipeline Functional: Draft03 mentions TRT FP16 but doesn't define the build workflow. When/where are engines built? Add TRT engine build to offline preparation pipeline: After satellite tile download, run trtexec on Jetson to build .engine files. Store alongside tiles. One-time cost per model version.
Cross-platform portability via ONNX Runtime Functional: ONNX Runtime's primary value is cross-platform support. Our deployment is Jetson-only — this value is zero. We pay the performance/memory tax for unused portability. Drop ONNX Runtime. Jetson Orin Nano Super is fixed deployment hardware. TRT Engine is the optimal runtime for NVIDIA-only deployment.
No DLA offloading considered Performance: Draft03 doesn't mention DLA. Jetson Orin Nano has NO DLA cores — only Orin NX (1-2) and AGX Orin (2) have DLA. Confirm: DLA offloading is NOT available on Orin Nano. All inference must run on GPU (1024 CUDA cores, 16 tensor cores). This makes maximizing GPU efficiency via native TRT even more critical.
LiteSAM MinGRU TRT compatibility risk Functional: LiteSAM's subpixel refinement uses 4 stacked MinGRU layers over a 3×3 candidate window (seq_len=9). MinGRU gates depend only on input C_f (not h_{t-1}), so z_t/h̃_t are pre-computable. Ops are standard: Linear, Sigmoid, Mul, Add, ReLU, Tanh. Risk is LOW-MEDIUM — depends on whether implementation uses logcumsumexp (problematic) or simple loop (fine). Seq_len=9 makes this trivially rewritable. Day-one verification: clone LiteSAM repo → torch.onnx.export → polygraphy inspect → trtexec --fp16. If export fails on MinGRU: rewrite forward() as unrolled loop (9 steps). If LiteSAM cannot be made TRT-compatible: replace with EfficientLoFTR TRT (proven TRT path via Coarse_LoFTR_TRT, 15.05M params, semi-dense matching).

Product Solution Description

A real-time GPS-denied visual navigation system for fixed-wing UAVs, running on a Jetson Orin Nano Super (8GB). All AI model inference uses native TensorRT Engine files — no ONNX Runtime dependency. The system replaces the GPS module by sending MAVLink GPS_INPUT messages via pymavlink over UART at 5-10Hz.

Position is determined by fusing: (1) CUDA-accelerated visual odometry (cuVSLAM — native CUDA), (2) absolute position corrections from satellite image matching (LiteSAM or XFeat — TRT Engine FP16), and (3) IMU data from the flight controller via ESKF.

Inference runtime decision: Native TRT Engine over ONNX Runtime because:

  1. ONNX RT CUDA EP is 7-8x slower on Orin Nano (tensor core bug)
  2. ONNX RT TRT-EP wastes ~280-300MB per model (serialized engine retained in memory)
  3. Cross-platform portability has zero value — deployment is Jetson-only
  4. Native TRT provides direct CUDA stream control for pipelining with cuVSLAM

Hard constraint: Camera shoots at ~3fps (333ms interval). Full VO+ESKF pipeline within 400ms. GPS_INPUT at 5-10Hz.

AI Model Runtime Summary:

Model Runtime Precision Memory Integration
cuVSLAM Native CUDA (PyCuVSLAM) N/A (closed-source) ~200-500MB CUDA Stream A
LiteSAM TRT Engine FP16 ~50-80MB CUDA Stream B
XFeat TRT Engine FP16 ~30-50MB CUDA Stream B (fallback)
ESKF CPU (Python/C++) FP64 ~10MB CPU thread

Offline Preparation Pipeline (before flight):

  1. Download satellite tiles → validate → pre-resize → store (existing)
  2. NEW: Build TRT engines on Jetson (one-time per model version)
    • trtexec --onnx=litesam_fp16.onnx --saveEngine=litesam.engine --fp16
    • trtexec --onnx=xfeat.onnx --saveEngine=xfeat.engine --fp16
  3. Copy tiles + engines to Jetson storage
  4. At startup: load engines + preload tiles into RAM
┌─────────────────────────────────────────────────────────────────────┐
│                    OFFLINE (Before Flight)                           │
│  1. Satellite Tiles → Download & Validate → Pre-resize → Store      │
│     (Google Maps)     (≥0.5m/px, <2yr)     (matcher res)  (GeoHash)│
│  2. TRT Engine Build (one-time per model version):                  │
│     PyTorch model → reparameterize → ONNX export → trtexec --fp16  │
│     Output: litesam.engine, xfeat.engine                            │
│  3. Copy tiles + engines to Jetson storage                          │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    ONLINE (During Flight)                            │
│                                                                     │
│  STARTUP:                                                           │
│  1. pymavlink → read GLOBAL_POSITION_INT → init ESKF                │
│  2. Load TRT engines: litesam.engine + xfeat.engine                 │
│     (tensorrt.Runtime → deserialize_cuda_engine → create_context)   │
│  3. Allocate GPU buffers for TRT input/output (PyCUDA)              │
│  4. Start cuVSLAM with first camera frames                         │
│  5. Preload satellite tiles ±2km into RAM                           │
│  6. Begin GPS_INPUT output loop at 5-10Hz                           │
│                                                                     │
│  EVERY FRAME (3fps, 333ms interval):                                │
│  ┌──────────────────────────────────────┐                           │
│  │ Nav Camera → Downsample (CUDA ~2ms)  │                           │
│  │           → cuVSLAM VO+IMU (~9ms)    │ ← CUDA Stream A          │
│  │           → ESKF measurement update  │                           │
│  └──────────────────────────────────────┘                           │
│                                                                     │
│  5-10Hz CONTINUOUS:                                                 │
│  ┌──────────────────────────────────────┐                           │
│  │ ESKF IMU prediction → GPS_INPUT send │──→ Flight Controller      │
│  └──────────────────────────────────────┘                           │
│                                                                     │
│  KEYFRAMES (every 3-10 frames, async):                              │
│  ┌──────────────────────────────────────┐                           │
│  │ TRT Engine inference (Stream B):     │                           │
│  │   context.enqueue_v3(stream_B)       │──→ ESKF correction        │
│  │   LiteSAM FP16 or XFeat FP16        │                           │
│  └──────────────────────────────────────┘                           │
│                                                                     │
│  TELEMETRY (1Hz):                                                   │
│  ┌──────────────────────────────────────┐                           │
│  │ NAMED_VALUE_FLOAT: confidence, drift │──→ Ground Station         │
│  └──────────────────────────────────────┘                           │
└─────────────────────────────────────────────────────────────────────┘

Architecture

Component: AI Model Inference Runtime

Solution Tools Advantages Limitations Performance Memory Fit
Native TRT Engine tensorrt Python + PyCUDA + trtexec Optimal latency, minimal memory, full tensor core usage, direct CUDA stream control Hardware-specific engines, manual buffer management, rebuild per TRT version Optimal ~50-130MB total (both models) Best
ONNX Runtime TRT-EP onnxruntime + TensorRT EP Auto-fallback for unsupported ops, simpler API, auto engine caching +280-300MB per model, wrapper overhead, first-run latency spike Near-parity (claimed), up to 3x slower (observed) ~640-690MB total (both models) Memory overhead unacceptable
ONNX Runtime CUDA EP onnxruntime + CUDA EP Simplest API, broadest op support 7-8x slower on Orin Nano (tensor core bug), no TRT optimizations 7-8x slower Standard Performance unacceptable
Torch-TensorRT torch_tensorrt AOT compilation, PyTorch-native, handles mixed TRT/PyTorch Newer on Jetson, requires PyTorch runtime at inference Near native TRT PyTorch runtime ~500MB+ ⚠️ Viable alternative if TRT export fails

Selected: Native TRT Engine — optimal performance and memory on our fixed NVIDIA hardware.

Fallback: If any model has unsupported TRT ops (e.g., MinGRU in LiteSAM), use Torch-TensorRT for that specific model. Torch-TensorRT handles mixed TRT/PyTorch execution but requires PyTorch runtime in memory.

Component: TRT Engine Conversion Workflow

LiteSAM conversion:

  1. Load PyTorch model with trained weights
  2. Reparameterize MobileOne backbone (collapse multi-branch → single Conv2d+BN)
  3. Export to ONNX: torch.onnx.export(model, dummy_input, "litesam.onnx", opset_version=17)
  4. Verify with polygraphy: polygraphy inspect model litesam.onnx
  5. Build engine on Jetson: trtexec --onnx=litesam.onnx --saveEngine=litesam.engine --fp16 --memPoolSize=workspace:2048
  6. Verify engine: trtexec --loadEngine=litesam.engine --fp16

XFeat conversion:

  1. Load PyTorch model
  2. Export to ONNX: torch.onnx.export(model, dummy_input, "xfeat.onnx", opset_version=17)
  3. Build engine on Jetson: trtexec --onnx=xfeat.onnx --saveEngine=xfeat.engine --fp16
  4. Alternative: use XFeatTensorRT C++ implementation directly

INT8 quantization strategy (optional, future optimization):

  • MobileOne backbone (CNN): INT8 safe with calibration data
  • TAIFormer (transformer attention): FP16 only — INT8 degrades accuracy
  • XFeat: evaluate INT8 on actual UAV-satellite pairs before deploying
  • Use nvidia-modelopt for calibration: from modelopt.onnx.quantization import quantize

Component: TRT Python Inference Wrapper

Minimal wrapper class for TRT engine inference:

import tensorrt as trt
import pycuda.driver as cuda

class TRTInference:
    def __init__(self, engine_path, stream):
        self.logger = trt.Logger(trt.Logger.WARNING)
        self.runtime = trt.Runtime(self.logger)
        with open(engine_path, 'rb') as f:
            self.engine = self.runtime.deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()
        self.stream = stream
        self._allocate_buffers()

    def _allocate_buffers(self):
        self.inputs = {}
        self.outputs = {}
        for i in range(self.engine.num_io_tensors):
            name = self.engine.get_tensor_name(i)
            shape = self.engine.get_tensor_shape(name)
            dtype = trt.nptype(self.engine.get_tensor_dtype(name))
            size = trt.volume(shape)
            device_mem = cuda.mem_alloc(size * np.dtype(dtype).itemsize)
            self.context.set_tensor_address(name, int(device_mem))
            mode = self.engine.get_tensor_mode(name)
            if mode == trt.TensorIOMode.INPUT:
                self.inputs[name] = (device_mem, shape, dtype)
            else:
                self.outputs[name] = (device_mem, shape, dtype)

    def infer_async(self, input_data):
        for name, data in input_data.items():
            cuda.memcpy_htod_async(self.inputs[name][0], data, self.stream)
        self.context.enqueue_v3(self.stream.handle)

    def get_output(self):
        results = {}
        for name, (dev_mem, shape, dtype) in self.outputs.items():
            host_mem = np.empty(shape, dtype=dtype)
            cuda.memcpy_dtoh_async(host_mem, dev_mem, self.stream)
        self.stream.synchronize()
        return results

Key design: infer_async() + get_output() split enables pipelining with cuVSLAM on Stream A while satellite matching runs on Stream B.

Component: Visual Odometry (UNCHANGED)

cuVSLAM — native CUDA library, not affected by TRT migration. Already optimal.

Component: Satellite Image Matching (UPDATED runtime + fallback chain)

Solution Tools Advantages Limitations Performance (est. Orin Nano Super TRT FP16) Params Fit
LiteSAM (opt) TRT Engine FP16 @ 1280px trtexec + tensorrt Python Best satellite-aerial accuracy (RMSE@30=17.86m UAV-VisLoc), 6.31M params, smallest model MinGRU TRT export needs verification (LOW-MEDIUM risk) Est. ~165-330ms 6.31M Primary (if TRT export succeeds AND ≤200ms)
EfficientLoFTR TRT Engine FP16 trtexec + tensorrt Python Proven TRT path (Coarse_LoFTR_TRT repo, 138 stars). Semi-dense. CVPR 2024. High accuracy. 2.4x more params than LiteSAM. Requires einsum→elementary ops rewrite for TRT (documented in Coarse_LoFTR_TRT paper). Est. ~200-400ms 15.05M Fallback if LiteSAM TRT fails
XFeat TRT Engine FP16 trtexec + tensorrt Python (or XFeatTensorRT C++) Fastest. Proven TRT implementation. Lightweight. General-purpose, not designed for cross-view satellite-aerial gap (but nadir-nadir gap is small). Est. ~50-100ms <5M Speed fallback

Decision tree (day-one on Orin Nano Super):

  1. Clone LiteSAM repo → reparameterize MobileOne → torch.onnx.export()polygraphy inspect
  2. If ONNX export succeeds → trtexec --onnx=litesam.onnx --saveEngine=litesam.engine --fp16
  3. If MinGRU causes ONNX/TRT failure → rewrite MinGRU forward() as unrolled 9-step loop → retry
  4. If rewrite fails or accuracy degrades → switch to EfficientLoFTR TRT:
    • Apply Coarse_LoFTR_TRT TRT-adaptation techniques (einsum replacement, etc.)
    • Export to ONNX → trtexec --fp16
    • Benchmark at 640×480 and 1280px
  5. Benchmark winner: if ≤200ms → use it. If >200ms but ≤300ms → acceptable (async on Stream B). If >300ms → use XFeat TRT

EfficientLoFTR TRT adaptation (from Coarse_LoFTR_TRT paper, proven workflow):

  • Replace torch.einsum() with elementary ops (view, bmm, reshape, sum)
  • Replace any TRT-incompatible high-level PyTorch functions
  • Use ONNX export path (less memory required than Torch-TensorRT on 8GB device)
  • Knowledge distillation available for further parameter reduction if needed

Component: Sensor Fusion (UNCHANGED)

ESKF — CPU-based mathematical filter, not affected.

Component: Flight Controller Integration (UNCHANGED)

pymavlink — not affected by TRT migration.

Component: Ground Station Telemetry (UNCHANGED)

MAVLink NAMED_VALUE_FLOAT — not affected.

Component: Startup & Lifecycle (UPDATED)

Updated startup sequence:

  1. Boot Jetson → start GPS-Denied service (systemd)
  2. Connect to flight controller via pymavlink on UART
  3. Wait for heartbeat from flight controller
  4. Initialize PyCUDA context
  5. Load TRT engines: litesam.engine + xfeat.engine via tensorrt.Runtime.deserialize_cuda_engine()
  6. Allocate GPU I/O buffers for both models
  7. Create CUDA streams: Stream A (cuVSLAM), Stream B (satellite matching)
  8. Read GLOBAL_POSITION_INT → init ESKF
  9. Start cuVSLAM with first camera frames
  10. Begin GPS_INPUT output loop at 5-10Hz
  11. Preload satellite tiles within ±2km into RAM
  12. System ready

Engine load time: ~1-3 seconds per engine (deserialization from .engine file). One-time cost at startup.

Component: Thermal Management (UNCHANGED)

Same adaptive pipeline. TRT engines are slightly more power-efficient than ONNX Runtime, but the difference is within noise.

Component: Object Localization (UNCHANGED)

Not affected — trigonometric calculation, no AI inference.

Speed Optimization Techniques

1. cuVSLAM for Visual Odometry (~9ms/frame)

Unchanged from draft03. Native CUDA, not part of TRT migration.

2. Native TRT Engine Inference (NEW)

All AI models run as pre-compiled TRT FP16 engines:

  • Engine files built offline with trtexec (one-time per model version)
  • Loaded at startup (~1-3s per engine)
  • Inference via context.enqueue_v3() on dedicated CUDA Stream B
  • GPU buffers pre-allocated — zero runtime allocation during flight
  • No ONNX Runtime dependency — no framework overhead

Memory advantage over ONNX Runtime TRT-EP: ~560-600MB saved (both models combined). Latency advantage: eliminates ONNX wrapper overhead, guaranteed tensor core utilization.

3. CUDA Stream Pipelining (REFINED)

  • Stream A: cuVSLAM VO for current frame (~9ms) + ESKF fusion (~1ms)
  • Stream B: TRT engine inference for satellite matching (LiteSAM or XFeat, async)
  • CPU: GPS_INPUT output loop, NAMED_VALUE_FLOAT, command listener, tile management
  • NEW: Both cuVSLAM and TRT engines use CUDA streams natively — no framework abstraction layer. Direct GPU scheduling.

4-7. (UNCHANGED from draft03)

Keyframe-based satellite matching, TensorRT FP16 optimization, proactive tile loading, 5-10Hz GPS_INPUT output — all unchanged.

Processing Time Budget (per frame, 333ms interval)

Normal Frame (non-keyframe)

Unchanged from draft03 — cuVSLAM dominates at ~22ms total.

Keyframe Satellite Matching (async, CUDA Stream B)

Path A — LiteSAM TRT Engine FP16 at 1280px:

Step Time Notes
Downsample to 1280px ~1ms OpenCV CUDA
Load satellite tile ~1ms Pre-loaded in RAM
Copy input to GPU buffer <0.5ms PyCUDA memcpy_htod_async
LiteSAM TRT Engine FP16 ≤200ms context.enqueue_v3(stream_B)
Copy output from GPU <0.5ms PyCUDA memcpy_dtoh_async
Geometric pose (RANSAC) ~5ms Homography
ESKF satellite update ~1ms Delayed measurement
Total ≤210ms Async on Stream B

Path B — XFeat TRT Engine FP16:

Step Time Notes
XFeat TRT Engine inference ~50-80ms context.enqueue_v3(stream_B)
Geometric verification (RANSAC) ~5ms
ESKF satellite update ~1ms
Total ~60-90ms Async on Stream B

Memory Budget (Jetson Orin Nano Super, 8GB shared)

Component Memory (Native TRT) Memory (ONNX RT TRT-EP) Notes
OS + runtime ~1.5GB ~1.5GB JetPack 6.2 + Python
cuVSLAM ~200-500MB ~200-500MB CUDA library + map
LiteSAM TRT engine ~50-80MB ~330-360MB Native TRT vs TRT-EP. If LiteSAM fails: EfficientLoFTR ~100-150MB
XFeat TRT engine ~30-50MB ~310-330MB Native TRT vs TRT-EP
Preloaded satellite tiles ~200MB ~200MB ±2km of flight plan
pymavlink + MAVLink ~20MB ~20MB
FastAPI (local IPC) ~50MB ~50MB
ESKF + buffers ~10MB ~10MB
ONNX Runtime framework 0MB ~150MB Eliminated with native TRT
Total ~2.1-2.9GB ~2.8-3.6GB
% of 8GB 26-36% 35-45%
Savings ~700MB saved with native TRT

Confidence Scoring → GPS_INPUT Mapping

Unchanged from draft03.

Key Risks and Mitigations

Risk Likelihood Impact Mitigation
LiteSAM MinGRU ops unsupported in TRT 10.3 LOW-MEDIUM LiteSAM TRT export fails Day-one verification: ONNX export → polygraphy → trtexec. If MinGRU fails: (1) rewrite as unrolled 9-step loop, (2) if still fails: switch to EfficientLoFTR TRT (proven TRT path, Coarse_LoFTR_TRT, 15.05M params). XFeat TRT as speed fallback.
TRT engine build OOM on 8GB Jetson LOW Cannot build engines on target device Our models are small (6.31M LiteSAM, <5M XFeat). OOM unlikely. If occurs: reduce --memPoolSize, or build on identical Orin Nano module with more headroom
Engine incompatibility after JetPack update MEDIUM Must rebuild engines Include engine rebuild in JetPack update procedure. Takes minutes per model.
MAVSDK cannot send GPS_INPUT CONFIRMED Must use pymavlink Unchanged from draft03
cuVSLAM fails on low-texture terrain HIGH Frequent tracking loss Unchanged from draft03
Thermal throttling MEDIUM Satellite matching budget blown Unchanged from draft03
LiteSAM TRT FP16 >200ms at 1280px MEDIUM Must use fallback matcher Day-one benchmark. Fallback chain: EfficientLoFTR TRT (if ≤300ms) → XFeat TRT (if all >300ms)
Google Maps satellite quality in conflict zone HIGH Satellite matching fails Unchanged from draft03

Testing Strategy

Integration / Functional Tests

All tests from draft03 unchanged, plus:

  • TRT engine load test: Verify litesam.engine and xfeat.engine load successfully on Jetson Orin Nano Super
  • TRT inference correctness: Compare TRT engine output vs PyTorch reference output (max L1 error < 0.01)
  • CUDA Stream B pipelining: Verify satellite matching on Stream B does not block cuVSLAM on Stream A
  • Engine pre-built validation: Verify engine files from offline preparation work without rebuild at runtime

Non-Functional Tests

All tests from draft03 unchanged, plus:

  • TRT engine build time: Measure trtexec build time for LiteSAM and XFeat on Orin Nano Super (expected: 1-5 minutes each)
  • TRT engine load time: Measure deserialization time (expected: 1-3 seconds each)
  • Memory comparison: Measure actual GPU memory with native TRT vs ONNX RT TRT-EP for both models
  • MinGRU TRT compatibility (day-one blocker):
    1. Clone LiteSAM repo, load pretrained weights
    2. Reparameterize MobileOne backbone
    3. torch.onnx.export(model, dummy, "litesam.onnx", opset_version=17)
    4. polygraphy inspect model litesam.onnx — check for unsupported ops
    5. trtexec --onnx=litesam.onnx --saveEngine=litesam.engine --fp16
    6. If step 3 or 5 fails on MinGRU: rewrite MinGRU forward() as unrolled loop, retry
    7. If still fails: switch to EfficientLoFTR, apply Coarse_LoFTR_TRT adaptation
    8. Compare TRT output vs PyTorch reference (max L1 error < 0.01)
  • EfficientLoFTR TRT fallback benchmark (if LiteSAM fails): apply TRT adaptation from Coarse_LoFTR_TRT → ONNX → trtexec → measure latency at 640×480 and 1280px
  • Tensor core utilization: Verify with NSight that TRT engines use tensor cores (unlike ONNX RT CUDA EP)

References

  • AC Assessment: _docs/00_research/gps_denied_nav/00_ac_assessment.md
  • Research artifacts (this assessment): _docs/00_research/trt_engine_migration/
  • Previous research: _docs/00_research/gps_denied_nav_v3/
  • Tech stack evaluation: _docs/01_solution/tech_stack.md
  • Security analysis: _docs/01_solution/security_analysis.md