gps-denied-onboard/_docs/01_solution/solution_draft04.md

# Solution Draft

## Assessment Findings

| Old Component Solution | Weak Point (functional/security/performance) | New Solution |
|------------------------|----------------------------------------------|-------------|
| ONNX Runtime as potential inference runtime for AI models | **Performance**: ONNX Runtime CUDA EP on Jetson Orin Nano is 7-8x slower than TRT standalone with default settings (tensor cores not utilized). Even TRT-EP shows up to 3x overhead on some models. | **Use native TRT Engine for all AI models**. Convert PyTorch → ONNX → trtexec → .engine. Load with tensorrt Python module. Eliminates ONNX Runtime dependency entirely. |
| ONNX Runtime TRT-EP memory overhead | **Performance**: ONNX RT TRT-EP keeps serialized engine in memory (~420-440MB vs 130-140MB native TRT). Delta ~280-300MB PER MODEL. On 8GB shared memory, this wastes ~560-600MB for two models. | **Native TRT releases serialized blob after deserialization** → saves ~280-300MB per model. Total savings ~560-600MB — 7% of total memory. Critical given cuVSLAM map growth risk. |
| No explicit TRT engine build step in offline pipeline | **Functional**: Draft03 mentions TRT FP16 but doesn't define the build workflow. When/where are engines built? | **Add TRT engine build to offline preparation pipeline**: After satellite tile download, run trtexec on Jetson to build .engine files. Store alongside tiles. One-time cost per model version. |
| Cross-platform portability via ONNX Runtime | **Functional**: ONNX Runtime's primary value is cross-platform support. Our deployment is Jetson-only — this value is zero. We pay the performance/memory tax for unused portability. | **Drop ONNX Runtime**. Jetson Orin Nano Super is fixed deployment hardware. TRT Engine is the optimal runtime for NVIDIA-only deployment. |
| No DLA offloading considered | **Performance**: Draft03 doesn't mention DLA. Jetson Orin Nano has NO DLA cores — only Orin NX (1-2) and AGX Orin (2) have DLA. | **Confirm: DLA offloading is NOT available on Orin Nano**. All inference must run on GPU (1024 CUDA cores, 16 tensor cores). This makes maximizing GPU efficiency via native TRT even more critical. |
| LiteSAM MinGRU TRT compatibility risk | **Functional**: LiteSAM's subpixel refinement uses 4 stacked MinGRU layers over a 3×3 candidate window (seq_len=9). MinGRU gates depend only on input C_f (not h_{t-1}), so z_t/h̃_t are pre-computable. Ops are standard: Linear, Sigmoid, Mul, Add, ReLU, Tanh. Risk is LOW-MEDIUM — depends on whether implementation uses logcumsumexp (problematic) or simple loop (fine). Seq_len=9 makes this trivially rewritable. | **Day-one verification**: clone LiteSAM repo → torch.onnx.export → polygraphy inspect → trtexec --fp16. If export fails on MinGRU: rewrite forward() as unrolled loop (9 steps). **If LiteSAM cannot be made TRT-compatible: replace with EfficientLoFTR TRT** (proven TRT path via Coarse_LoFTR_TRT, 15.05M params, semi-dense matching). |

## Product Solution Description

A real-time GPS-denied visual navigation system for fixed-wing UAVs, running on a Jetson Orin Nano Super (8GB). All AI model inference uses **native TensorRT Engine files** — no ONNX Runtime dependency. The system replaces the GPS module by sending MAVLink GPS_INPUT messages via pymavlink over UART at 5-10Hz.

Position is determined by fusing: (1) CUDA-accelerated visual odometry (cuVSLAM — native CUDA), (2) absolute position corrections from satellite image matching (LiteSAM or XFeat — TRT Engine FP16), and (3) IMU data from the flight controller via ESKF.

**Inference runtime decision**: Native TRT Engine over ONNX Runtime because:
1. ONNX RT CUDA EP is 7-8x slower on Orin Nano (tensor core bug)
2. ONNX RT TRT-EP wastes ~280-300MB per model (serialized engine retained in memory)
3. Cross-platform portability has zero value — deployment is Jetson-only
4. Native TRT provides direct CUDA stream control for pipelining with cuVSLAM

**Hard constraint**: Camera shoots at ~3fps (333ms interval). Full VO+ESKF pipeline within 400ms. GPS_INPUT at 5-10Hz.

**AI Model Runtime Summary**:

| Model | Runtime | Precision | Memory | Integration |
|-------|---------|-----------|--------|-------------|
| cuVSLAM | Native CUDA (PyCuVSLAM) | N/A (closed-source) | ~200-500MB | CUDA Stream A |
| LiteSAM | TRT Engine | FP16 | ~50-80MB | CUDA Stream B |
| XFeat | TRT Engine | FP16 | ~30-50MB | CUDA Stream B (fallback) |
| ESKF | CPU (Python/C++) | FP64 | ~10MB | CPU thread |

**Offline Preparation Pipeline** (before flight):
1. Download satellite tiles → validate → pre-resize → store (existing)
2. **NEW: Build TRT engines on Jetson** (one-time per model version)
   - `trtexec --onnx=litesam_fp16.onnx --saveEngine=litesam.engine --fp16`
   - `trtexec --onnx=xfeat.onnx --saveEngine=xfeat.engine --fp16`
3. Copy tiles + engines to Jetson storage
4. At startup: load engines + preload tiles into RAM

```
┌─────────────────────────────────────────────────────────────────────┐
│                    OFFLINE (Before Flight)                           │
│  1. Satellite Tiles → Download & Validate → Pre-resize → Store      │
│     (Google Maps)     (≥0.5m/px, <2yr)     (matcher res)  (GeoHash)│
│  2. TRT Engine Build (one-time per model version):                  │
│     PyTorch model → reparameterize → ONNX export → trtexec --fp16  │
│     Output: litesam.engine, xfeat.engine                            │
│  3. Copy tiles + engines to Jetson storage                          │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    ONLINE (During Flight)                            │
│                                                                     │
│  STARTUP:                                                           │
│  1. pymavlink → read GLOBAL_POSITION_INT → init ESKF                │
│  2. Load TRT engines: litesam.engine + xfeat.engine                 │
│     (tensorrt.Runtime → deserialize_cuda_engine → create_context)   │
│  3. Allocate GPU buffers for TRT input/output (PyCUDA)              │
│  4. Start cuVSLAM with first camera frames                         │
│  5. Preload satellite tiles ±2km into RAM                           │
│  6. Begin GPS_INPUT output loop at 5-10Hz                           │
│                                                                     │
│  EVERY FRAME (3fps, 333ms interval):                                │
│  ┌──────────────────────────────────────┐                           │
│  │ Nav Camera → Downsample (CUDA ~2ms)  │                           │
│  │           → cuVSLAM VO+IMU (~9ms)    │ ← CUDA Stream A          │
│  │           → ESKF measurement update  │                           │
│  └──────────────────────────────────────┘                           │
│                                                                     │
│  5-10Hz CONTINUOUS:                                                 │
│  ┌──────────────────────────────────────┐                           │
│  │ ESKF IMU prediction → GPS_INPUT send │──→ Flight Controller      │
│  └──────────────────────────────────────┘                           │
│                                                                     │
│  KEYFRAMES (every 3-10 frames, async):                              │
│  ┌──────────────────────────────────────┐                           │
│  │ TRT Engine inference (Stream B):     │                           │
│  │   context.enqueue_v3(stream_B)       │──→ ESKF correction        │
│  │   LiteSAM FP16 or XFeat FP16        │                           │
│  └──────────────────────────────────────┘                           │
│                                                                     │
│  TELEMETRY (1Hz):                                                   │
│  ┌──────────────────────────────────────┐                           │
│  │ NAMED_VALUE_FLOAT: confidence, drift │──→ Ground Station         │
│  └──────────────────────────────────────┘                           │
└─────────────────────────────────────────────────────────────────────┘
```

## Architecture

### Component: AI Model Inference Runtime

| Solution | Tools | Advantages | Limitations | Performance | Memory | Fit |
|----------|-------|-----------|-------------|------------|--------|-----|
| Native TRT Engine | tensorrt Python + PyCUDA + trtexec | Optimal latency, minimal memory, full tensor core usage, direct CUDA stream control | Hardware-specific engines, manual buffer management, rebuild per TRT version | Optimal | ~50-130MB total (both models) | ✅ Best |
| ONNX Runtime TRT-EP | onnxruntime + TensorRT EP | Auto-fallback for unsupported ops, simpler API, auto engine caching | +280-300MB per model, wrapper overhead, first-run latency spike | Near-parity (claimed), up to 3x slower (observed) | ~640-690MB total (both models) | ❌ Memory overhead unacceptable |
| ONNX Runtime CUDA EP | onnxruntime + CUDA EP | Simplest API, broadest op support | 7-8x slower on Orin Nano (tensor core bug), no TRT optimizations | 7-8x slower | Standard | ❌ Performance unacceptable |
| Torch-TensorRT | torch_tensorrt | AOT compilation, PyTorch-native, handles mixed TRT/PyTorch | Newer on Jetson, requires PyTorch runtime at inference | Near native TRT | PyTorch runtime ~500MB+ | ⚠️ Viable alternative if TRT export fails |

**Selected**: **Native TRT Engine** — optimal performance and memory on our fixed NVIDIA hardware.

**Fallback**: If any model has unsupported TRT ops (e.g., MinGRU in LiteSAM), use **Torch-TensorRT** for that specific model. Torch-TensorRT handles mixed TRT/PyTorch execution but requires PyTorch runtime in memory.

### Component: TRT Engine Conversion Workflow

**LiteSAM conversion**:
1. Load PyTorch model with trained weights
2. Reparameterize MobileOne backbone (collapse multi-branch → single Conv2d+BN)
3. Export to ONNX: `torch.onnx.export(model, dummy_input, "litesam.onnx", opset_version=17)`
4. Verify with polygraphy: `polygraphy inspect model litesam.onnx`
5. Build engine on Jetson: `trtexec --onnx=litesam.onnx --saveEngine=litesam.engine --fp16 --memPoolSize=workspace:2048`
6. Verify engine: `trtexec --loadEngine=litesam.engine --fp16`

**XFeat conversion**:
1. Load PyTorch model
2. Export to ONNX: `torch.onnx.export(model, dummy_input, "xfeat.onnx", opset_version=17)`
3. Build engine on Jetson: `trtexec --onnx=xfeat.onnx --saveEngine=xfeat.engine --fp16`
4. Alternative: use XFeatTensorRT C++ implementation directly

**INT8 quantization strategy** (optional, future optimization):
- MobileOne backbone (CNN): INT8 safe with calibration data
- TAIFormer (transformer attention): FP16 only — INT8 degrades accuracy
- XFeat: evaluate INT8 on actual UAV-satellite pairs before deploying
- Use nvidia-modelopt for calibration: `from modelopt.onnx.quantization import quantize`

### Component: TRT Python Inference Wrapper

Minimal wrapper class for TRT engine inference:

```python
import tensorrt as trt
import pycuda.driver as cuda

class TRTInference:
    def __init__(self, engine_path, stream):
        self.logger = trt.Logger(trt.Logger.WARNING)
        self.runtime = trt.Runtime(self.logger)
        with open(engine_path, 'rb') as f:
            self.engine = self.runtime.deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()
        self.stream = stream
        self._allocate_buffers()

    def _allocate_buffers(self):
        self.inputs = {}
        self.outputs = {}
        for i in range(self.engine.num_io_tensors):
            name = self.engine.get_tensor_name(i)
            shape = self.engine.get_tensor_shape(name)
            dtype = trt.nptype(self.engine.get_tensor_dtype(name))
            size = trt.volume(shape)
            device_mem = cuda.mem_alloc(size * np.dtype(dtype).itemsize)
            self.context.set_tensor_address(name, int(device_mem))
            mode = self.engine.get_tensor_mode(name)
            if mode == trt.TensorIOMode.INPUT:
                self.inputs[name] = (device_mem, shape, dtype)
            else:
                self.outputs[name] = (device_mem, shape, dtype)

    def infer_async(self, input_data):
        for name, data in input_data.items():
            cuda.memcpy_htod_async(self.inputs[name][0], data, self.stream)
        self.context.enqueue_v3(self.stream.handle)

    def get_output(self):
        results = {}
        for name, (dev_mem, shape, dtype) in self.outputs.items():
            host_mem = np.empty(shape, dtype=dtype)
            cuda.memcpy_dtoh_async(host_mem, dev_mem, self.stream)
        self.stream.synchronize()
        return results
```

Key design: `infer_async()` + `get_output()` split enables pipelining with cuVSLAM on Stream A while satellite matching runs on Stream B.

### Component: Visual Odometry (UNCHANGED)

cuVSLAM — native CUDA library, not affected by TRT migration. Already optimal.

### Component: Satellite Image Matching (UPDATED runtime + fallback chain)

| Solution | Tools | Advantages | Limitations | Performance (est. Orin Nano Super TRT FP16) | Params | Fit |
|----------|-------|-----------|-------------|----------------------------------------------|--------|-----|
| LiteSAM (opt) TRT Engine FP16 @ 1280px | trtexec + tensorrt Python | Best satellite-aerial accuracy (RMSE@30=17.86m UAV-VisLoc), 6.31M params, smallest model | MinGRU TRT export needs verification (LOW-MEDIUM risk) | Est. ~165-330ms | 6.31M | ✅ Primary (if TRT export succeeds AND ≤200ms) |
| EfficientLoFTR TRT Engine FP16 | trtexec + tensorrt Python | Proven TRT path (Coarse_LoFTR_TRT repo, 138 stars). Semi-dense. CVPR 2024. High accuracy. | 2.4x more params than LiteSAM. Requires einsum→elementary ops rewrite for TRT (documented in Coarse_LoFTR_TRT paper). | Est. ~200-400ms | 15.05M | ✅ Fallback if LiteSAM TRT fails |
| XFeat TRT Engine FP16 | trtexec + tensorrt Python (or XFeatTensorRT C++) | Fastest. Proven TRT implementation. Lightweight. | General-purpose, not designed for cross-view satellite-aerial gap (but nadir-nadir gap is small). | Est. ~50-100ms | <5M | ✅ Speed fallback |

**Decision tree (day-one on Orin Nano Super)**:
1. Clone LiteSAM repo → reparameterize MobileOne → `torch.onnx.export()` → `polygraphy inspect`
2. If ONNX export succeeds → `trtexec --onnx=litesam.onnx --saveEngine=litesam.engine --fp16`
3. If MinGRU causes ONNX/TRT failure → rewrite MinGRU forward() as unrolled 9-step loop → retry
4. If rewrite fails or accuracy degrades → **switch to EfficientLoFTR TRT**:
   - Apply Coarse_LoFTR_TRT TRT-adaptation techniques (einsum replacement, etc.)
   - Export to ONNX → trtexec --fp16
   - Benchmark at 640×480 and 1280px
5. Benchmark winner: **if ≤200ms → use it. If >200ms but ≤300ms → acceptable (async on Stream B). If >300ms → use XFeat TRT**

**EfficientLoFTR TRT adaptation** (from Coarse_LoFTR_TRT paper, proven workflow):
- Replace `torch.einsum()` with elementary ops (view, bmm, reshape, sum)
- Replace any TRT-incompatible high-level PyTorch functions
- Use ONNX export path (less memory required than Torch-TensorRT on 8GB device)
- Knowledge distillation available for further parameter reduction if needed

### Component: Sensor Fusion (UNCHANGED)
ESKF — CPU-based mathematical filter, not affected.

### Component: Flight Controller Integration (UNCHANGED)
pymavlink — not affected by TRT migration.

### Component: Ground Station Telemetry (UNCHANGED)
MAVLink NAMED_VALUE_FLOAT — not affected.

### Component: Startup & Lifecycle (UPDATED)

**Updated startup sequence**:
1. Boot Jetson → start GPS-Denied service (systemd)
2. Connect to flight controller via pymavlink on UART
3. Wait for heartbeat from flight controller
4. **Initialize PyCUDA context**
5. **Load TRT engines**: litesam.engine + xfeat.engine via tensorrt.Runtime.deserialize_cuda_engine()
6. **Allocate GPU I/O buffers** for both models
7. **Create CUDA streams**: Stream A (cuVSLAM), Stream B (satellite matching)
8. Read GLOBAL_POSITION_INT → init ESKF
9. Start cuVSLAM with first camera frames
10. Begin GPS_INPUT output loop at 5-10Hz
11. Preload satellite tiles within ±2km into RAM
12. System ready

**Engine load time**: ~1-3 seconds per engine (deserialization from .engine file). One-time cost at startup.

### Component: Thermal Management (UNCHANGED)
Same adaptive pipeline. TRT engines are slightly more power-efficient than ONNX Runtime, but the difference is within noise.

### Component: Object Localization (UNCHANGED)
Not affected — trigonometric calculation, no AI inference.

## Speed Optimization Techniques

### 1. cuVSLAM for Visual Odometry (~9ms/frame)
Unchanged from draft03. Native CUDA, not part of TRT migration.

### 2. Native TRT Engine Inference (NEW)
All AI models run as pre-compiled TRT FP16 engines:
- Engine files built offline with trtexec (one-time per model version)
- Loaded at startup (~1-3s per engine)
- Inference via context.enqueue_v3() on dedicated CUDA Stream B
- GPU buffers pre-allocated — zero runtime allocation during flight
- No ONNX Runtime dependency — no framework overhead

Memory advantage over ONNX Runtime TRT-EP: ~560-600MB saved (both models combined).
Latency advantage: eliminates ONNX wrapper overhead, guaranteed tensor core utilization.

### 3. CUDA Stream Pipelining (REFINED)
- Stream A: cuVSLAM VO for current frame (~9ms) + ESKF fusion (~1ms)
- Stream B: TRT engine inference for satellite matching (LiteSAM or XFeat, async)
- CPU: GPS_INPUT output loop, NAMED_VALUE_FLOAT, command listener, tile management
- **NEW**: Both cuVSLAM and TRT engines use CUDA streams natively — no framework abstraction layer. Direct GPU scheduling.

### 4-7. (UNCHANGED from draft03)
Keyframe-based satellite matching, TensorRT FP16 optimization, proactive tile loading, 5-10Hz GPS_INPUT output — all unchanged.

## Processing Time Budget (per frame, 333ms interval)

### Normal Frame (non-keyframe)
Unchanged from draft03 — cuVSLAM dominates at ~22ms total.

### Keyframe Satellite Matching (async, CUDA Stream B)

**Path A — LiteSAM TRT Engine FP16 at 1280px**:

| Step | Time | Notes |
|------|------|-------|
| Downsample to 1280px | ~1ms | OpenCV CUDA |
| Load satellite tile | ~1ms | Pre-loaded in RAM |
| Copy input to GPU buffer | <0.5ms | PyCUDA memcpy_htod_async |
| LiteSAM TRT Engine FP16 | ≤200ms | context.enqueue_v3(stream_B) |
| Copy output from GPU | <0.5ms | PyCUDA memcpy_dtoh_async |
| Geometric pose (RANSAC) | ~5ms | Homography |
| ESKF satellite update | ~1ms | Delayed measurement |
| **Total** | **≤210ms** | Async on Stream B |

**Path B — XFeat TRT Engine FP16**:

| Step | Time | Notes |
|------|------|-------|
| XFeat TRT Engine inference | ~50-80ms | context.enqueue_v3(stream_B) |
| Geometric verification (RANSAC) | ~5ms | |
| ESKF satellite update | ~1ms | |
| **Total** | **~60-90ms** | Async on Stream B |

## Memory Budget (Jetson Orin Nano Super, 8GB shared)

| Component | Memory (Native TRT) | Memory (ONNX RT TRT-EP) | Notes |
|-----------|---------------------|--------------------------|-------|
| OS + runtime | ~1.5GB | ~1.5GB | JetPack 6.2 + Python |
| cuVSLAM | ~200-500MB | ~200-500MB | CUDA library + map |
| **LiteSAM TRT engine** | **~50-80MB** | **~330-360MB** | Native TRT vs TRT-EP. If LiteSAM fails: EfficientLoFTR ~100-150MB |
| **XFeat TRT engine** | **~30-50MB** | **~310-330MB** | Native TRT vs TRT-EP |
| Preloaded satellite tiles | ~200MB | ~200MB | ±2km of flight plan |
| pymavlink + MAVLink | ~20MB | ~20MB | |
| FastAPI (local IPC) | ~50MB | ~50MB | |
| ESKF + buffers | ~10MB | ~10MB | |
| ONNX Runtime framework | **0MB** | **~150MB** | Eliminated with native TRT |
| **Total** | **~2.1-2.9GB** | **~2.8-3.6GB** | |
| **% of 8GB** | **26-36%** | **35-45%** | |
| **Savings** | — | — | **~700MB saved with native TRT** |

## Confidence Scoring → GPS_INPUT Mapping
Unchanged from draft03.

## Key Risks and Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| **LiteSAM MinGRU ops unsupported in TRT 10.3** | LOW-MEDIUM | LiteSAM TRT export fails | Day-one verification: ONNX export → polygraphy → trtexec. If MinGRU fails: (1) rewrite as unrolled 9-step loop, (2) if still fails: **switch to EfficientLoFTR TRT** (proven TRT path, Coarse_LoFTR_TRT, 15.05M params). XFeat TRT as speed fallback. |
| **TRT engine build OOM on 8GB Jetson** | LOW | Cannot build engines on target device | Our models are small (6.31M LiteSAM, <5M XFeat). OOM unlikely. If occurs: reduce --memPoolSize, or build on identical Orin Nano module with more headroom |
| **Engine incompatibility after JetPack update** | MEDIUM | Must rebuild engines | Include engine rebuild in JetPack update procedure. Takes minutes per model. |
| **MAVSDK cannot send GPS_INPUT** | CONFIRMED | Must use pymavlink | Unchanged from draft03 |
| **cuVSLAM fails on low-texture terrain** | HIGH | Frequent tracking loss | Unchanged from draft03 |
| **Thermal throttling** | MEDIUM | Satellite matching budget blown | Unchanged from draft03 |
| LiteSAM TRT FP16 >200ms at 1280px | MEDIUM | Must use fallback matcher | Day-one benchmark. Fallback chain: EfficientLoFTR TRT (if ≤300ms) → XFeat TRT (if all >300ms) |
| Google Maps satellite quality in conflict zone | HIGH | Satellite matching fails | Unchanged from draft03 |

## Testing Strategy

### Integration / Functional Tests
All tests from draft03 unchanged, plus:
- **TRT engine load test**: Verify litesam.engine and xfeat.engine load successfully on Jetson Orin Nano Super
- **TRT inference correctness**: Compare TRT engine output vs PyTorch reference output (max L1 error < 0.01)
- **CUDA Stream B pipelining**: Verify satellite matching on Stream B does not block cuVSLAM on Stream A
- **Engine pre-built validation**: Verify engine files from offline preparation work without rebuild at runtime

### Non-Functional Tests
All tests from draft03 unchanged, plus:
- **TRT engine build time**: Measure trtexec build time for LiteSAM and XFeat on Orin Nano Super (expected: 1-5 minutes each)
- **TRT engine load time**: Measure deserialization time (expected: 1-3 seconds each)
- **Memory comparison**: Measure actual GPU memory with native TRT vs ONNX RT TRT-EP for both models
- **MinGRU TRT compatibility** (day-one blocker):
  1. Clone LiteSAM repo, load pretrained weights
  2. Reparameterize MobileOne backbone
  3. `torch.onnx.export(model, dummy, "litesam.onnx", opset_version=17)`
  4. `polygraphy inspect model litesam.onnx` — check for unsupported ops
  5. `trtexec --onnx=litesam.onnx --saveEngine=litesam.engine --fp16`
  6. If step 3 or 5 fails on MinGRU: rewrite MinGRU forward() as unrolled loop, retry
  7. If still fails: switch to EfficientLoFTR, apply Coarse_LoFTR_TRT adaptation
  8. Compare TRT output vs PyTorch reference (max L1 error < 0.01)
- **EfficientLoFTR TRT fallback benchmark** (if LiteSAM fails): apply TRT adaptation from Coarse_LoFTR_TRT → ONNX → trtexec → measure latency at 640×480 and 1280px
- **Tensor core utilization**: Verify with NSight that TRT engines use tensor cores (unlike ONNX RT CUDA EP)

## References
- ONNX Runtime Issue #24085 (Jetson Orin Nano tensor core bug): https://github.com/microsoft/onnxruntime/issues/24085
- ONNX Runtime Issue #20457 (TRT-EP memory overhead): https://github.com/microsoft/onnxruntime/issues/20457
- ONNX Runtime Issue #12083 (TRT-EP vs native TRT): https://github.com/microsoft/onnxruntime/issues/12083
- NVIDIA TensorRT 10 Python API: https://docs.nvidia.com/deeplearning/tensorrt/10.15.1/inference-library/python-api-docs.html
- TensorRT Best Practices: https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html
- TensorRT engine hardware specificity: https://github.com/NVIDIA/TensorRT/issues/1920
- trtexec ONNX conversion: https://nvidia-jetson.piveral.com/jetson-orin-nano/how-to-convert-onnx-to-engine-on-jetson-orin-nano-dev-board/
- Torch-TensorRT JetPack 6.2: https://docs.pytorch.org/TensorRT/v2.10.0/getting_started/jetpack.html
- XFeatTensorRT: https://github.com/PranavNedunghat/XFeatTensorRT
- JetPack 6.2 Release Notes: https://docs.nvidia.com/jetson/archives/jetpack-archived/jetpack-62/release-notes/index.html
- Jetson Orin Nano Super: https://developer.nvidia.com/blog/nvidia-jetson-orin-nano-developer-kit-gets-a-super-boost/
- DLA on Jetson Orin: https://developer.nvidia.com/blog/maximizing-deep-learning-performance-on-nvidia-jetson-orin-with-dla/
- EfficientLoFTR (CVPR 2024): https://github.com/zju3dv/EfficientLoFTR
- EfficientLoFTR HuggingFace: https://huggingface.co/docs/transformers/en/model_doc/efficientloftr
- Coarse_LoFTR_TRT (TRT for embedded): https://github.com/Kolkir/Coarse_LoFTR_TRT
- Coarse_LoFTR_TRT paper: https://ar5iv.labs.arxiv.org/html/2202.00770
- LoFTR_TRT: https://github.com/Kolkir/LoFTR_TRT
- minGRU ("Were RNNs All We Needed?"): https://huggingface.co/papers/2410.01201
- minGRU PyTorch implementation: https://github.com/lucidrains/minGRU-pytorch
- LiteSAM paper (MinGRU details, Eqs 12-16): https://www.mdpi.com/2072-4292/17/19/3349
- DALGlue (UAV feature matching, 2025): https://www.nature.com/articles/s41598-025-21602-5
- All references from solution_draft03.md

## Related Artifacts
- AC Assessment: `_docs/00_research/gps_denied_nav/00_ac_assessment.md`
- Research artifacts (this assessment): `_docs/00_research/trt_engine_migration/`
- Previous research: `_docs/00_research/gps_denied_nav_v3/`
- Tech stack evaluation: `_docs/01_solution/tech_stack.md`
- Security analysis: `_docs/01_solution/security_analysis.md`