gps-denied-onboard/_docs/01_solution/solution_draft05.md

# Solution Draft

## Assessment Findings

| Old Component Solution | Weak Point (functional/security/performance) | New Solution |
|------------------------|----------------------------------------------|-------------|
| ONNX Runtime as potential inference runtime for AI models | **Performance**: ONNX Runtime CUDA EP on Jetson Orin Nano is 7-8x slower than TRT standalone with default settings (tensor cores not utilized). Even TRT-EP shows up to 3x overhead on some models. | **Use native TRT Engine for all AI models**. Convert PyTorch → ONNX → trtexec → .engine. Load with tensorrt Python module. Eliminates ONNX Runtime dependency entirely. |
| ONNX Runtime TRT-EP memory overhead | **Performance**: ONNX RT TRT-EP keeps serialized engine in memory (~420-440MB vs 130-140MB native TRT). Delta ~280-300MB PER MODEL. On 8GB shared memory, this wastes ~560-600MB for two models. | **Native TRT releases serialized blob after deserialization** → saves ~280-300MB per model. Total savings ~560-600MB — 7% of total memory. Critical given cuVSLAM map growth risk. |
| No explicit TRT engine build step in offline pipeline | **Functional**: Draft03 mentions TRT FP16 but doesn't define the build workflow. When/where are engines built? | **Add TRT engine build to offline preparation pipeline**: After satellite tile download, run trtexec on Jetson to build .engine files. Store alongside tiles. One-time cost per model version. |
| Cross-platform portability via ONNX Runtime | **Functional**: ONNX Runtime's primary value is cross-platform support. Our deployment is Jetson-only — this value is zero. We pay the performance/memory tax for unused portability. | **Drop ONNX Runtime**. Jetson Orin Nano Super is fixed deployment hardware. TRT Engine is the optimal runtime for NVIDIA-only deployment. |
| No DLA offloading considered | **Performance**: Draft03 doesn't mention DLA. Jetson Orin Nano has NO DLA cores — only Orin NX (1-2) and AGX Orin (2) have DLA. | **Confirm: DLA offloading is NOT available on Orin Nano**. All inference must run on GPU (1024 CUDA cores, 16 tensor cores). This makes maximizing GPU efficiency via native TRT even more critical. |
| LiteSAM MinGRU TRT compatibility risk | **Functional**: LiteSAM's subpixel refinement uses 4 stacked MinGRU layers over a 3×3 candidate window (seq_len=9). MinGRU gates depend only on input C_f (not h_{t-1}), so z_t/h̃_t are pre-computable. Ops are standard: Linear, Sigmoid, Mul, Add, ReLU, Tanh. Risk is LOW-MEDIUM — depends on whether implementation uses logcumsumexp (problematic) or simple loop (fine). Seq_len=9 makes this trivially rewritable. | **Day-one verification**: clone LiteSAM repo → torch.onnx.export → polygraphy inspect → trtexec --fp16. If export fails on MinGRU: rewrite forward() as unrolled loop (9 steps). **If LiteSAM cannot be made TRT-compatible: replace with EfficientLoFTR TRT** (proven TRT path via Coarse_LoFTR_TRT, 15.05M params, semi-dense matching). |
| Camera shoots at ~3fps (draft03/04 hard constraint) | **Functional**: ADTI 20L V1 max continuous rate is **2.0 fps** (burst only, buffer-limited). ADTi recommends 1.5s per capture (**0.7 fps sustained**). 3fps is physically impossible. 2fps is not sustainable for multi-hour flights (buffer saturation + mechanical shutter wear). | **Revised to 0.7 fps sustained**. ADTI 20L V1 is the sole navigation camera — used for both cuVSLAM VO and satellite matching. At 70 km/h cruise and 600m altitude: 27.8m inter-frame displacement, ~175px pixel shift, **95.2% frame overlap** — within pyramid-assisted LK optical flow range. ESKF IMU prediction at 5-10Hz bridges 1.43s gaps between frames. Satellite matching triggered on keyframes from the same stream. Viewpro A40 Pro reserved for AI object detection only. |

## UAV Platform

### Airframe Configuration

| Component | Specification | Weight |
|-----------|--------------|--------|
| Airframe | Custom 3.5m S-2 Glass Composite, Eppler 423 airfoil | ~4.5 kg |
| Battery | 2x VANT Semi-Solid State 6S 30Ah (22.2V, 666Wh each) | 5.30 kg (2.65 kg each) |
| Motor | T-Motor AT4125 KV540 (2000W peak, 5.5 kg thrust w/ APC 15x8) | 0.36 kg |
| Propulsion Acc. | ESC, 15x8 Folding Propeller, Servos, Cables | ~0.50 kg |
| Avionics | Pixhawk 6x + GPS | ~0.10 kg |
| Computing | NVIDIA Jetson Orin Nano Super Dev Kit | ~0.30 kg |
| Camera 1 | ADTI 20L V1 APS-C Camera + 16mm Lens | ~0.22 kg |
| Camera 2 | Viewpro A40 Pro (A40TPro) AI Gimbal | ~0.85 kg |
| Misc | Mounts, wiring, connectors | ~0.35 kg |
| **Total AUW** | | **~12.5 kg** |

T-Motor AT4125 KV540 is spec'd for 8-10 kg fixed-wing. At 12.5 kg AUW, static thrust-to-weight is ~0.44. Flyable but margins are tight — weight optimization should be monitored.

### Flight Performance (Max Endurance)

Assumptions: wingspan 3.5m, mean chord ~0.30m, wing area S ~1.05 m², AR ~11.7, Eppler 423 Cl_max ~1.8, cruise altitude 800-1000m (ρ ~1.10 kg/m³).

| Parameter | Value |
|-----------|-------|
| Stall speed (900m altitude) | 10.6 m/s (38 km/h) |
| Min-power speed (theoretical) | 12.0 m/s (43 km/h) — only 13% above stall, impractical |
| **Max endurance cruise (1.3× stall margin)** | **14 m/s (50 km/h)** |
| Best range speed | ~18 m/s (65 km/h) |

| Energy Budget | Value |
|---------------|-------|
| Total battery energy | 1332 Wh (2 × 666 Wh) |
| Usable (80% DoD) | 1066 Wh |
| Climb to 900m (~5 min at 3 m/s) | −57 Wh |
| 10% reserve | ×0.9 |
| **Available for cruise** | **~908 Wh** |

| Power Budget | Value |
|--------------|-------|
| Propulsion (L/D ~15, η_prop 0.65, η_motor 0.80) | ~212 W |
| Electronics (Pixhawk + Jetson + cameras + gimbal + servos) | ~55 W |
| **Total cruise power** | **~267 W** |

| Endurance | Value |
|-----------|-------|
| **Max endurance (at 50 km/h)** | **~3.4 hours** |
| Total mission (incl. climb + reserve) | ~3.5 hours |
| Max range (at 65 km/h best-range speed) | ~209 km |

### Camera 1: ADTI 20L V1 + 16mm Lens

| Spec | Value |
|------|-------|
| Sensor | Sony CMOS APS-C, 23.2 × 15.4 mm |
| Resolution | 5456 × 3632 (20 MP) |
| Focal length | 16 mm |
| Shutter | Mechanical global inter-mirror shutter (ADTI product line) |
| Max continuous fps | **2.0 fps** (spec — burst rate, buffer-limited) |
| Sustained capture rate | **~0.7 fps** (ADTi recommended 1.5s per capture) |
| File formats | JPEG, RAW, RAW+JPEG |
| HDMI video output | 1080p 24p/30p, 1440×1080 30p |
| Weight | 118g body + ~100g lens |
| ISP | Socionext Milbeaut |
| Cooling | Active fan |

**2.0 fps is a burst rate, not sustained.** The 2.0 fps spec is limited by the internal buffer (estimated 3-5 frames). Once the buffer fills, the camera throttles to the write pipeline speed of ~0.7 fps. The bottleneck chain: mechanical shutter actuation (~100-300ms) + ISP processing (demosaic, NR, JPEG compress) + storage write (~5-10 MB/frame JPEG). The 1.5s/capture recommendation guarantees the buffer never fills and accounts for thermal margin over multi-hour flights.

**Mechanical shutter wear at sustained rates:**

| Rate | Actuations per 3.5h flight | Est. flights before 150K shutter life | Est. flights before 500K shutter life |
|------|---------------------------|---------------------------------------|---------------------------------------|
| 2.0 fps (burst, unsustainable) | 25,200 | ~6 | ~20 |
| 1.0 fps | 12,600 | ~12 | ~40 |
| 0.7 fps (recommended) | 8,820 | ~17 | ~57 |

The 20L V1 shutter lifespan is not documented. The higher-end 102PRO is rated at 500K actuations. The 20L as an entry-level model is likely 100K-150K. At 0.7 fps sustained, this gives ~11-57 flights depending on shutter rating.

**Confirmed operational rate: 0.7 fps (JPEG mode) for both VO and satellite matching.**

### Camera 1: Ground Coverage at Mission Altitude

| Parameter | H = 600 m | H = 800 m | H = 1000 m |
|-----------|-----------|-----------|------------|
| Along-track footprint (15.4mm side) | 577 m | 770 m | 962 m |
| Cross-track footprint (23.2mm side) | 870 m | 1160 m | 1450 m |
| GSD | 15.9 cm/pixel | 21.3 cm/pixel | 26.6 cm/pixel |

### Camera 1: Forward Overlap at 0.7 fps

At 0.7 fps, distance between shots = V / 0.7.

**At 70 km/h (19.4 m/s) — realistic cruise speed:**

| Altitude | Along-track footprint | Shot gap (27.8m) | Forward overlap | Pixel shift |
|----------|-----------------------|------------------|-----------------|-------------|
| 600 m | 577 m | 27.8 m | **95.2%** | ~175 px |
| 800 m | 770 m | 27.8 m | **96.4%** | ~131 px |
| 1000 m | 962 m | 27.8 m | **97.1%** | ~105 px |

**Across speed range (at 600m altitude, 0.7 fps):**

| Speed | Frame gap | Pixel shift | Forward overlap |
|-------|-----------|-------------|-----------------|
| 50 km/h (14 m/s) | 20.0 m | ~126 px | 96.5% |
| 70 km/h (19.4 m/s) | 27.8 m | ~175 px | 95.2% |
| 90 km/h (25 m/s) | 35.7 m | ~224 px | 93.8% |

Even at 90 km/h and the lowest altitude (600m), overlap remains >93%. The 16mm lens on APS-C at these altitudes produces a footprint so large that 0.7 fps provides massive redundancy for both VO and satellite matching.

**For satellite matching specifically** (keyframe-based, every 5-10 camera frames):

| Target overlap | Required gap (600m) | Time between shots (70 km/h) | Capture rate |
|----------------|---------------------|------------------------------|--------------|
| 80% | 115 m | 5.9 s | 0.17 fps |
| 70% | 173 m | 8.9 s | 0.11 fps |
| 60% | 231 m | 11.9 s | 0.084 fps |

Even at the lowest altitude and highest speed, 1 satellite matching keyframe every 6-12 seconds gives 60-80% overlap.

### Camera 2: Viewpro A40 Pro (A40TPro) AI Gimbal

Dual EO/IR gimbal with AI tracking. Reserved for **AI object detection and tracking only** — not used for navigation. Operates independently from the navigation pipeline.

### Camera Role Assignment

| Role | Camera | Rate | Notes |
|------|--------|------|-------|
| Visual Odometry (cuVSLAM) | ADTI 20L V1 + 16mm | 0.7 fps (sustained) | Sole navigation camera. At 70 km/h: 27.8m/~175px displacement at 600m alt. 95%+ overlap. |
| Satellite Image Matching | ADTI 20L V1 + 16mm | Keyframes from VO stream (~every 5-10 frames) | Same image stream as VO. Subset routed to satellite matcher on Stream B. |
| AI Object Detection | Viewpro A40 Pro | Independent | Not part of navigation pipeline. |

### cuVSLAM at 0.7 fps — Feasibility

At 0.7 fps and 70 km/h cruise, inter-frame displacement is 27.8m. In pixel terms:

| Altitude | Displacement | Pixel shift | % of image height | Overlap |
|----------|-------------|-------------|-------------------|---------|
| 600 m | 27.8 m | ~175 px | 4.8% | 95.2% |
| 800 m | 27.8 m | ~131 px | 3.6% | 96.4% |
| 1000 m | 27.8 m | ~105 px | 2.9% | 97.1% |

cuVSLAM uses Lucas-Kanade optical flow with image pyramids. Standard LK handles 30-50px displacements on the base level; with 3-4 pyramid levels, effective search range extends to ~150-200px. At 600m altitude, the 175px shift is within this pyramid-assisted range. At 800-1000m, the shift drops to 105-131px — well within range.

Key factors that make 0.7 fps viable at high altitude:
- **Large footprint**: The 16mm lens on APS-C at 600-1000m produces 577-962m along-track coverage. The aircraft moves only 4-5% of the frame between shots.
- **High texture from altitude**: At 600-1000m, each frame covers a large area with diverse terrain features (roads, field boundaries, structures) even in agricultural regions.
- **IMU bridging**: cuVSLAM's built-in IMU integrator provides pose prediction during the 1.43s gap between frames. ESKF IMU prediction runs at 5-10Hz for continuous GPS_INPUT output.
- **95%+ overlap**: Consecutive frames share >95% content — abundant features for matching.

**Risk**: Over completely uniform terrain (e.g., single crop field filling entire 577m+ footprint), feature tracking may still fail. cuVSLAM falls back to IMU-only (~1s acceptable) then constant-velocity (~0.5s) before tracking loss. Satellite matching corrections every 5-10 frames bound accumulated drift.

## Product Solution Description

A real-time GPS-denied visual navigation system for fixed-wing UAVs, running on a Jetson Orin Nano Super (8GB). All AI model inference uses **native TensorRT Engine files** — no ONNX Runtime dependency. The system replaces the GPS module by sending MAVLink GPS_INPUT messages via pymavlink over UART at 5-10Hz.

Position is determined by fusing: (1) CUDA-accelerated visual odometry (cuVSLAM — native CUDA) from ADTI 20L V1 at 0.7 fps sustained, (2) absolute position corrections from satellite image matching (LiteSAM or XFeat — TRT Engine FP16) using keyframes from the same ADTI image stream, and (3) IMU data from the flight controller via ESKF. Viewpro A40 Pro is reserved for AI object detection only.

**Inference runtime decision**: Native TRT Engine over ONNX Runtime because:
1. ONNX RT CUDA EP is 7-8x slower on Orin Nano (tensor core bug)
2. ONNX RT TRT-EP wastes ~280-300MB per model (serialized engine retained in memory)
3. Cross-platform portability has zero value — deployment is Jetson-only
4. Native TRT provides direct CUDA stream control for pipelining with cuVSLAM

**Hard constraint**: ADTI 20L V1 shoots at 0.7 fps sustained (1430ms interval). Full VO+ESKF pipeline within 400ms per frame. Satellite matching async on keyframes (every 5-10 camera frames). GPS_INPUT at 5-10Hz (ESKF IMU prediction fills gaps between camera frames).

**AI Model Runtime Summary**:

| Model | Runtime | Precision | Memory | Integration |
|-------|---------|-----------|--------|-------------|
| cuVSLAM | Native CUDA (PyCuVSLAM) | N/A (closed-source) | ~200-500MB | CUDA Stream A |
| LiteSAM | TRT Engine | FP16 | ~50-80MB | CUDA Stream B |
| XFeat | TRT Engine | FP16 | ~30-50MB | CUDA Stream B (fallback) |
| ESKF | CPU (Python/C++) | FP64 | ~10MB | CPU thread |

**Offline Preparation Pipeline** (before flight):
1. Download satellite tiles → validate → pre-resize → store (existing)
2. **NEW: Build TRT engines on Jetson** (one-time per model version)
   - `trtexec --onnx=litesam_fp16.onnx --saveEngine=litesam.engine --fp16`
   - `trtexec --onnx=xfeat.onnx --saveEngine=xfeat.engine --fp16`
3. Copy tiles + engines to Jetson storage
4. At startup: load engines + preload tiles into RAM

```
┌─────────────────────────────────────────────────────────────────────┐
│                    OFFLINE (Before Flight)                           │
│  1. Satellite Tiles → Download & Validate → Pre-resize → Store      │
│     (Google Maps)     (≥0.5m/px, <2yr)     (matcher res)  (GeoHash)│
│  2. TRT Engine Build (one-time per model version):                  │
│     PyTorch model → reparameterize → ONNX export → trtexec --fp16  │
│     Output: litesam.engine, xfeat.engine                            │
│  3. Copy tiles + engines to Jetson storage                          │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    ONLINE (During Flight)                            │
│                                                                     │
│  STARTUP:                                                           │
│  1. pymavlink → read GLOBAL_POSITION_INT → init ESKF                │
│  2. Load TRT engines: litesam.engine + xfeat.engine                 │
│     (tensorrt.Runtime → deserialize_cuda_engine → create_context)   │
│  3. Allocate GPU buffers for TRT input/output (PyCUDA)              │
│  4. Start cuVSLAM with ADTI 20L V1 camera stream                   │
│  5. Preload satellite tiles ±2km into RAM                           │
│  6. Begin GPS_INPUT output loop at 5-10Hz                           │
│                                                                     │
│  EVERY CAMERA FRAME (0.7fps sustained from ADTI 20L V1):           │
│  ┌──────────────────────────────────────┐                           │
│  │ ADTI 20L V1 → Downsample (CUDA)     │                           │
│  │             → cuVSLAM VO+IMU (~9ms)  │ ← CUDA Stream A          │
│  │             → ESKF measurement       │                           │
│  └──────────────────────────────────────┘                           │
│                                                                     │
│  5-10Hz CONTINUOUS (IMU-driven between camera frames):              │
│  ┌──────────────────────────────────────┐                           │
│  │ ESKF IMU prediction → GPS_INPUT send │──→ Flight Controller      │
│  └──────────────────────────────────────┘                           │
│                                                                     │
│  KEYFRAMES (every 5-10 camera frames, async):                       │
│  ┌──────────────────────────────────────┐                           │
│  │ Same ADTI frame → TRT inference (B): │                           │
│  │   context.enqueue_v3(stream_B)       │──→ ESKF correction        │
│  │   LiteSAM FP16 or XFeat FP16        │                           │
│  └──────────────────────────────────────┘                           │
│                                                                     │
│  TELEMETRY (1Hz):                                                   │
│  ┌──────────────────────────────────────┐                           │
│  │ NAMED_VALUE_FLOAT: confidence, drift │──→ Ground Station         │
│  └──────────────────────────────────────┘                           │
└─────────────────────────────────────────────────────────────────────┘
```

## Architecture

### Component: AI Model Inference Runtime

| Solution | Tools | Advantages | Limitations | Performance | Memory | Fit |
|----------|-------|-----------|-------------|------------|--------|-----|
| Native TRT Engine | tensorrt Python + PyCUDA + trtexec | Optimal latency, minimal memory, full tensor core usage, direct CUDA stream control | Hardware-specific engines, manual buffer management, rebuild per TRT version | Optimal | ~50-130MB total (both models) | ✅ Best |
| ONNX Runtime TRT-EP | onnxruntime + TensorRT EP | Auto-fallback for unsupported ops, simpler API, auto engine caching | +280-300MB per model, wrapper overhead, first-run latency spike | Near-parity (claimed), up to 3x slower (observed) | ~640-690MB total (both models) | ❌ Memory overhead unacceptable |
| ONNX Runtime CUDA EP | onnxruntime + CUDA EP | Simplest API, broadest op support | 7-8x slower on Orin Nano (tensor core bug), no TRT optimizations | 7-8x slower | Standard | ❌ Performance unacceptable |
| Torch-TensorRT | torch_tensorrt | AOT compilation, PyTorch-native, handles mixed TRT/PyTorch | Newer on Jetson, requires PyTorch runtime at inference | Near native TRT | PyTorch runtime ~500MB+ | ⚠️ Viable alternative if TRT export fails |

**Selected**: **Native TRT Engine** — optimal performance and memory on our fixed NVIDIA hardware.

**Fallback**: If any model has unsupported TRT ops (e.g., MinGRU in LiteSAM), use **Torch-TensorRT** for that specific model. Torch-TensorRT handles mixed TRT/PyTorch execution but requires PyTorch runtime in memory.

### Component: TRT Engine Conversion Workflow

**LiteSAM conversion**:
1. Load PyTorch model with trained weights
2. Reparameterize MobileOne backbone (collapse multi-branch → single Conv2d+BN)
3. Export to ONNX: `torch.onnx.export(model, dummy_input, "litesam.onnx", opset_version=17)`
4. Verify with polygraphy: `polygraphy inspect model litesam.onnx`
5. Build engine on Jetson: `trtexec --onnx=litesam.onnx --saveEngine=litesam.engine --fp16 --memPoolSize=workspace:2048`
6. Verify engine: `trtexec --loadEngine=litesam.engine --fp16`

**XFeat conversion**:
1. Load PyTorch model
2. Export to ONNX: `torch.onnx.export(model, dummy_input, "xfeat.onnx", opset_version=17)`
3. Build engine on Jetson: `trtexec --onnx=xfeat.onnx --saveEngine=xfeat.engine --fp16`
4. Alternative: use XFeatTensorRT C++ implementation directly

**INT8 quantization strategy** (optional, future optimization):
- MobileOne backbone (CNN): INT8 safe with calibration data
- TAIFormer (transformer attention): FP16 only — INT8 degrades accuracy
- XFeat: evaluate INT8 on actual UAV-satellite pairs before deploying
- Use nvidia-modelopt for calibration: `from modelopt.onnx.quantization import quantize`

### Component: TRT Python Inference Wrapper

Minimal wrapper class for TRT engine inference:

```python
import tensorrt as trt
import pycuda.driver as cuda

class TRTInference:
    def __init__(self, engine_path, stream):
        self.logger = trt.Logger(trt.Logger.WARNING)
        self.runtime = trt.Runtime(self.logger)
        with open(engine_path, 'rb') as f:
            self.engine = self.runtime.deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()
        self.stream = stream
        self._allocate_buffers()

    def _allocate_buffers(self):
        self.inputs = {}
        self.outputs = {}
        for i in range(self.engine.num_io_tensors):
            name = self.engine.get_tensor_name(i)
            shape = self.engine.get_tensor_shape(name)
            dtype = trt.nptype(self.engine.get_tensor_dtype(name))
            size = trt.volume(shape)
            device_mem = cuda.mem_alloc(size * np.dtype(dtype).itemsize)
            self.context.set_tensor_address(name, int(device_mem))
            mode = self.engine.get_tensor_mode(name)
            if mode == trt.TensorIOMode.INPUT:
                self.inputs[name] = (device_mem, shape, dtype)
            else:
                self.outputs[name] = (device_mem, shape, dtype)

    def infer_async(self, input_data):
        for name, data in input_data.items():
            cuda.memcpy_htod_async(self.inputs[name][0], data, self.stream)
        self.context.enqueue_v3(self.stream.handle)

    def get_output(self):
        results = {}
        for name, (dev_mem, shape, dtype) in self.outputs.items():
            host_mem = np.empty(shape, dtype=dtype)
            cuda.memcpy_dtoh_async(host_mem, dev_mem, self.stream)
        self.stream.synchronize()
        return results
```

Key design: `infer_async()` + `get_output()` split enables pipelining with cuVSLAM on Stream A while satellite matching runs on Stream B.

### Component: Visual Odometry (UPDATED — camera rate corrected)

cuVSLAM — native CUDA library. Fed by **ADTI 20L V1 at 0.7 fps sustained** (previously assumed 3fps which exceeds camera hardware limit; 2.0 fps spec is burst-only, not sustainable). At 70 km/h cruise the inter-frame displacement is 27.8m — at 600m altitude this translates to ~175px (4.8% of frame), within pyramid-assisted LK optical flow range. At 800-1000m altitude the pixel shift drops to 105-131px. 95%+ frame overlap ensures abundant features for matching. ESKF IMU prediction at 5-10Hz fills the position output between sparse camera frames.

### Component: Satellite Image Matching (UPDATED runtime + fallback chain)

| Solution | Tools | Advantages | Limitations | Performance (est. Orin Nano Super TRT FP16) | Params | Fit |
|----------|-------|-----------|-------------|----------------------------------------------|--------|-----|
| LiteSAM (opt) TRT Engine FP16 @ 1280px | trtexec + tensorrt Python | Best satellite-aerial accuracy (RMSE@30=17.86m UAV-VisLoc), 6.31M params, smallest model | MinGRU TRT export needs verification (LOW-MEDIUM risk) | Est. ~165-330ms | 6.31M | ✅ Primary (if TRT export succeeds AND ≤200ms) |
| EfficientLoFTR TRT Engine FP16 | trtexec + tensorrt Python | Proven TRT path (Coarse_LoFTR_TRT repo, 138 stars). Semi-dense. CVPR 2024. High accuracy. | 2.4x more params than LiteSAM. Requires einsum→elementary ops rewrite for TRT (documented in Coarse_LoFTR_TRT paper). | Est. ~200-400ms | 15.05M | ✅ Fallback if LiteSAM TRT fails |
| XFeat TRT Engine FP16 | trtexec + tensorrt Python (or XFeatTensorRT C++) | Fastest. Proven TRT implementation. Lightweight. | General-purpose, not designed for cross-view satellite-aerial gap (but nadir-nadir gap is small). | Est. ~50-100ms | <5M | ✅ Speed fallback |

**Decision tree (day-one on Orin Nano Super)**:
1. Clone LiteSAM repo → reparameterize MobileOne → `torch.onnx.export()` → `polygraphy inspect`
2. If ONNX export succeeds → `trtexec --onnx=litesam.onnx --saveEngine=litesam.engine --fp16`
3. If MinGRU causes ONNX/TRT failure → rewrite MinGRU forward() as unrolled 9-step loop → retry
4. If rewrite fails or accuracy degrades → **switch to EfficientLoFTR TRT**:
   - Apply Coarse_LoFTR_TRT TRT-adaptation techniques (einsum replacement, etc.)
   - Export to ONNX → trtexec --fp16
   - Benchmark at 640×480 and 1280px
5. Benchmark winner: **if ≤200ms → use it. If >200ms but ≤300ms → acceptable (async on Stream B). If >300ms → use XFeat TRT**

**EfficientLoFTR TRT adaptation** (from Coarse_LoFTR_TRT paper, proven workflow):
- Replace `torch.einsum()` with elementary ops (view, bmm, reshape, sum)
- Replace any TRT-incompatible high-level PyTorch functions
- Use ONNX export path (less memory required than Torch-TensorRT on 8GB device)
- Knowledge distillation available for further parameter reduction if needed

**Satellite matching cadence**: Keyframes selected from the ADTI VO stream every 5-10 frames (~every 2.5-14s depending on camera fps setting). At 800-1000m altitude and 14 m/s cruise, this yields 60-97% forward overlap between satellite match frames. Matching runs async on Stream B — does not block VO on Stream A.

### Component: Sensor Fusion (UNCHANGED)
ESKF — CPU-based mathematical filter, not affected.

### Component: Flight Controller Integration (UNCHANGED)
pymavlink — not affected by TRT migration.

### Component: Ground Station Telemetry (UNCHANGED)
MAVLink NAMED_VALUE_FLOAT — not affected.

### Component: Startup & Lifecycle (UPDATED)

**Updated startup sequence**:
1. Boot Jetson → start GPS-Denied service (systemd)
2. Connect to flight controller via pymavlink on UART
3. Wait for heartbeat from flight controller
4. **Initialize PyCUDA context**
5. **Load TRT engines**: litesam.engine + xfeat.engine via tensorrt.Runtime.deserialize_cuda_engine()
6. **Allocate GPU I/O buffers** for both models
7. **Create CUDA streams**: Stream A (cuVSLAM), Stream B (satellite matching)
8. Read GLOBAL_POSITION_INT → init ESKF
9. Start cuVSLAM with ADTI 20L V1 camera frames
10. Begin GPS_INPUT output loop at 5-10Hz
11. Preload satellite tiles within ±2km into RAM
12. System ready

**Engine load time**: ~1-3 seconds per engine (deserialization from .engine file). One-time cost at startup.

### Component: Thermal Management (UNCHANGED)
Same adaptive pipeline. TRT engines are slightly more power-efficient than ONNX Runtime, but the difference is within noise.

### Component: Object Localization (UNCHANGED)
Not affected — trigonometric calculation, no AI inference.

## Speed Optimization Techniques

### 1. cuVSLAM for Visual Odometry (~9ms/frame)
Fed by ADTI 20L V1 at 0.7 fps sustained. At 70 km/h cruise and 600m altitude, inter-frame displacement is 27.8m (~175px, 4.8% of frame). With pyramid-based LK optical flow (3-4 levels), effective search range is ~150-200px — 175px is within range. At 800-1000m altitude, pixel shift drops to 105-131px. 95%+ overlap between consecutive frames.

### 2. Native TRT Engine Inference (NEW)
All AI models run as pre-compiled TRT FP16 engines:
- Engine files built offline with trtexec (one-time per model version)
- Loaded at startup (~1-3s per engine)
- Inference via context.enqueue_v3() on dedicated CUDA Stream B
- GPU buffers pre-allocated — zero runtime allocation during flight
- No ONNX Runtime dependency — no framework overhead

Memory advantage over ONNX Runtime TRT-EP: ~560-600MB saved (both models combined).
Latency advantage: eliminates ONNX wrapper overhead, guaranteed tensor core utilization.

### 3. CUDA Stream Pipelining (REFINED)
- Stream A: cuVSLAM VO from ADTI 20L V1 (~9ms) + ESKF fusion (~1ms)
- Stream B: TRT engine inference for satellite matching (LiteSAM or XFeat, async, triggered on keyframe from same ADTI stream)
- CPU: GPS_INPUT output loop, NAMED_VALUE_FLOAT, command listener, tile management
- **NEW**: Both cuVSLAM and TRT engines use CUDA streams natively — no framework abstraction layer. Direct GPU scheduling.

### 4-7. (UNCHANGED from draft03)
Keyframe-based satellite matching, TensorRT FP16 optimization, proactive tile loading, 5-10Hz GPS_INPUT output — all unchanged.

## Processing Time Budget

### VO Frame (every ~1430ms from ADTI 20L V1 at 0.7 fps)

| Step | Time | Notes |
|------|------|-------|
| ADTI image transfer | ~5-10ms | Trigger + readout |
| Downsample (CUDA) | ~2ms | To cuVSLAM input resolution |
| cuVSLAM VO+IMU | ~9ms | CUDA Stream A |
| ESKF measurement update | ~1ms | CPU |
| **Total** | **~17-22ms** | Well within 1430ms budget |

Between camera frames, ESKF IMU prediction runs at 5-10Hz to maintain continuous GPS_INPUT output. The ~1.4s gap between frames is bridged entirely by IMU integration.

### Keyframe Satellite Matching (every 5-10 camera frames, async CUDA Stream B)

**Path A — LiteSAM TRT Engine FP16 at 1280px**:

| Step | Time | Notes |
|------|------|-------|
| Image already in GPU (from VO) | ~0ms | Same frame used for VO and matching |
| Load satellite tile | ~1ms | Pre-loaded in RAM |
| Copy input to GPU buffer | <0.5ms | PyCUDA memcpy_htod_async |
| LiteSAM TRT Engine FP16 | ≤200ms | context.enqueue_v3(stream_B) |
| Copy output from GPU | <0.5ms | PyCUDA memcpy_dtoh_async |
| Geometric pose (RANSAC) | ~5ms | Homography |
| ESKF satellite update | ~1ms | Delayed measurement |
| **Total** | **≤210ms** | Async on Stream B, does not block VO |

**Path B — XFeat TRT Engine FP16**:

| Step | Time | Notes |
|------|------|-------|
| XFeat TRT Engine inference | ~50-80ms | context.enqueue_v3(stream_B) |
| Geometric verification (RANSAC) | ~5ms | |
| ESKF satellite update | ~1ms | |
| **Total** | **~60-90ms** | Async on Stream B |

## Memory Budget (Jetson Orin Nano Super, 8GB shared)

| Component | Memory (Native TRT) | Memory (ONNX RT TRT-EP) | Notes |
|-----------|---------------------|--------------------------|-------|
| OS + runtime | ~1.5GB | ~1.5GB | JetPack 6.2 + Python |
| cuVSLAM | ~200-500MB | ~200-500MB | CUDA library + map |
| **LiteSAM TRT engine** | **~50-80MB** | **~330-360MB** | Native TRT vs TRT-EP. If LiteSAM fails: EfficientLoFTR ~100-150MB |
| **XFeat TRT engine** | **~30-50MB** | **~310-330MB** | Native TRT vs TRT-EP |
| Preloaded satellite tiles | ~200MB | ~200MB | ±2km of flight plan |
| pymavlink + MAVLink | ~20MB | ~20MB | |
| FastAPI (local IPC) | ~50MB | ~50MB | |
| ESKF + buffers | ~10MB | ~10MB | |
| ONNX Runtime framework | **0MB** | **~150MB** | Eliminated with native TRT |
| **Total** | **~2.1-2.9GB** | **~2.8-3.6GB** | |
| **% of 8GB** | **26-36%** | **35-45%** | |
| **Savings** | — | — | **~700MB saved with native TRT** |

## Confidence Scoring → GPS_INPUT Mapping
Unchanged from draft03.

## Key Risks and Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| **LiteSAM MinGRU ops unsupported in TRT 10.3** | LOW-MEDIUM | LiteSAM TRT export fails | Day-one verification: ONNX export → polygraphy → trtexec. If MinGRU fails: (1) rewrite as unrolled 9-step loop, (2) if still fails: **switch to EfficientLoFTR TRT** (proven TRT path, Coarse_LoFTR_TRT, 15.05M params). XFeat TRT as speed fallback. |
| **TRT engine build OOM on 8GB Jetson** | LOW | Cannot build engines on target device | Our models are small (6.31M LiteSAM, <5M XFeat). OOM unlikely. If occurs: reduce --memPoolSize, or build on identical Orin Nano module with more headroom |
| **Engine incompatibility after JetPack update** | MEDIUM | Must rebuild engines | Include engine rebuild in JetPack update procedure. Takes minutes per model. |
| **MAVSDK cannot send GPS_INPUT** | CONFIRMED | Must use pymavlink | Unchanged from draft03 |
| **cuVSLAM fails on low-texture terrain** | HIGH | Frequent tracking loss | ADTI at 0.7 fps means 27.8m inter-frame displacement at 70 km/h. At 600m+ altitude, pixel shift is 105-175px with 95%+ overlap — within pyramid-assisted LK range. HIGH risk remains over completely uniform terrain (single crop covering 577m+ footprint). IMU bridging + satellite matching corrections bound drift. |
| **Thermal throttling** | MEDIUM | Satellite matching budget blown | Unchanged from draft03 |
| LiteSAM TRT FP16 >200ms at 1280px | MEDIUM | Must use fallback matcher | Day-one benchmark. Fallback chain: EfficientLoFTR TRT (if ≤300ms) → XFeat TRT (if all >300ms) |
| Google Maps satellite quality in conflict zone | HIGH | Satellite matching fails | Unchanged from draft03 |
| **AUW exceeds AT4125 recommended range** | MEDIUM | Reduced endurance, motor thermal stress | 12.5 kg AUW vs 8-10 kg recommended. Monitor motor temps. Consider weight reduction (lighter gimbal, single battery for shorter missions). |
| **cuVSLAM at 0.7 fps — inter-frame displacement** | MEDIUM | VO tracking loss on uniform terrain | At 0.7 fps and 70 km/h: ~175px displacement at 600m (4.8% of frame, 95.2% overlap). Within pyramid-assisted LK range (150-200px). At 800m+: drops to 105-131px. Mitigations: (1) cuVSLAM IMU integrator bridges 1.43s frame gaps, (2) ESKF IMU prediction at 5-10Hz fills position gaps, (3) satellite matching corrections every 5-10 frames bound drift. |
| **ADTI mechanical shutter lifespan** | MEDIUM | Shutter replacement needed periodically | At 0.7 fps sustained over 3.5h flights: ~8,800 actuations/flight. Shutter life unknown for 20L (102PRO is 500K, entry-level likely 100-150K). Estimated 11-57 flights before replacement. Budget for shutter replacement as consumable. |

## Testing Strategy

### Integration / Functional Tests
All tests from draft03 unchanged, plus:
- **TRT engine load test**: Verify litesam.engine and xfeat.engine load successfully on Jetson Orin Nano Super
- **TRT inference correctness**: Compare TRT engine output vs PyTorch reference output (max L1 error < 0.01)
- **CUDA Stream B pipelining**: Verify satellite matching on Stream B does not block cuVSLAM on Stream A
- **Engine pre-built validation**: Verify engine files from offline preparation work without rebuild at runtime
- **ADTI 20L V1 sustained capture rate**: Verify camera sustains 0.7 fps in JPEG mode over extended periods (>30 min) without buffer overflow or overheating. Also test 1.0 fps to determine if higher sustained rate is achievable.
- **ADTI trigger timing**: Verify camera trigger and image transfer pipeline delivers frames to cuVSLAM within acceptable latency (<50ms from trigger to GPU buffer)

### Non-Functional Tests
All tests from draft03 unchanged, plus:
- **TRT engine build time**: Measure trtexec build time for LiteSAM and XFeat on Orin Nano Super (expected: 1-5 minutes each)
- **TRT engine load time**: Measure deserialization time (expected: 1-3 seconds each)
- **Memory comparison**: Measure actual GPU memory with native TRT vs ONNX RT TRT-EP for both models
- **MinGRU TRT compatibility** (day-one blocker):
  1. Clone LiteSAM repo, load pretrained weights
  2. Reparameterize MobileOne backbone
  3. `torch.onnx.export(model, dummy, "litesam.onnx", opset_version=17)`
  4. `polygraphy inspect model litesam.onnx` — check for unsupported ops
  5. `trtexec --onnx=litesam.onnx --saveEngine=litesam.engine --fp16`
  6. If step 3 or 5 fails on MinGRU: rewrite MinGRU forward() as unrolled loop, retry
  7. If still fails: switch to EfficientLoFTR, apply Coarse_LoFTR_TRT adaptation
  8. Compare TRT output vs PyTorch reference (max L1 error < 0.01)
- **EfficientLoFTR TRT fallback benchmark** (if LiteSAM fails): apply TRT adaptation from Coarse_LoFTR_TRT → ONNX → trtexec → measure latency at 640×480 and 1280px
- **Tensor core utilization**: Verify with NSight that TRT engines use tensor cores (unlike ONNX RT CUDA EP)
- **Flight endurance validation**: Ground-test full system power draw (propulsion + electronics) against 267W estimate. Verify ~3.4h endurance target.
- **cuVSLAM at 0.7 fps**: Benchmark VO tracking quality, drift rate, and tracking loss frequency at 0.7 fps with ADTI 20L V1. Measure IMU integrator effectiveness for bridging 1.43s inter-frame gaps. Test at 600m and 800m altitude, over both textured and low-texture terrain.
- **ADTI shutter durability**: Track shutter actuation count across flights. Monitor for shutter failure symptoms (missed frames, inconsistent exposure).

## References
- ONNX Runtime Issue #24085 (Jetson Orin Nano tensor core bug): https://github.com/microsoft/onnxruntime/issues/24085
- ONNX Runtime Issue #20457 (TRT-EP memory overhead): https://github.com/microsoft/onnxruntime/issues/20457
- ONNX Runtime Issue #12083 (TRT-EP vs native TRT): https://github.com/microsoft/onnxruntime/issues/12083
- NVIDIA TensorRT 10 Python API: https://docs.nvidia.com/deeplearning/tensorrt/10.15.1/inference-library/python-api-docs.html
- TensorRT Best Practices: https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html
- TensorRT engine hardware specificity: https://github.com/NVIDIA/TensorRT/issues/1920
- trtexec ONNX conversion: https://nvidia-jetson.piveral.com/jetson-orin-nano/how-to-convert-onnx-to-engine-on-jetson-orin-nano-dev-board/
- Torch-TensorRT JetPack 6.2: https://docs.pytorch.org/TensorRT/v2.10.0/getting_started/jetpack.html
- XFeatTensorRT: https://github.com/PranavNedunghat/XFeatTensorRT
- JetPack 6.2 Release Notes: https://docs.nvidia.com/jetson/archives/jetpack-archived/jetpack-62/release-notes/index.html
- Jetson Orin Nano Super: https://developer.nvidia.com/blog/nvidia-jetson-orin-nano-developer-kit-gets-a-super-boost/
- DLA on Jetson Orin: https://developer.nvidia.com/blog/maximizing-deep-learning-performance-on-nvidia-jetson-orin-with-dla/
- EfficientLoFTR (CVPR 2024): https://github.com/zju3dv/EfficientLoFTR
- EfficientLoFTR HuggingFace: https://huggingface.co/docs/transformers/en/model_doc/efficientloftr
- Coarse_LoFTR_TRT (TRT for embedded): https://github.com/Kolkir/Coarse_LoFTR_TRT
- Coarse_LoFTR_TRT paper: https://ar5iv.labs.arxiv.org/html/2202.00770
- LoFTR_TRT: https://github.com/Kolkir/LoFTR_TRT
- minGRU ("Were RNNs All We Needed?"): https://huggingface.co/papers/2410.01201
- minGRU PyTorch implementation: https://github.com/lucidrains/minGRU-pytorch
- LiteSAM paper (MinGRU details, Eqs 12-16): https://www.mdpi.com/2072-4292/17/19/3349
- DALGlue (UAV feature matching, 2025): https://www.nature.com/articles/s41598-025-21602-5
- ADTI 20L V1 specs: https://unmannedrc.com/products/adti-20l-v1-mapping-camera
- ADTI 20L V1 user manual: https://docs.adti.camera/adti-20l-and-24l-v1-quick-start-guide/
- T-Motor AT4125 KV540: https://uav-en.tmotor.com/2019/Motors_0429/247.html
- VANT Semi-Solid State 6S 30Ah battery: https://www.xtbattery.com/370wh/kg-42v-high-energy-density-6s-12s-14s-18s-30ah-semi-solid-state-drone-battery/
- All references from solution_draft03.md

## Related Artifacts
- AC Assessment: `_docs/00_research/gps_denied_nav/00_ac_assessment.md`
- Research artifacts (this assessment): `_docs/00_research/trt_engine_migration/`
- Previous research: `_docs/00_research/gps_denied_nav_v3/`
- Tech stack evaluation: `_docs/01_solution/tech_stack.md`
- Security analysis: `_docs/01_solution/security_analysis.md`
- Previous draft: `_docs/01_solution/solution_draft04.md`