mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-04-22 21:56:38 +00:00
386 lines
27 KiB
Markdown
386 lines
27 KiB
Markdown
# Solution Draft
|
||
|
||
## Assessment Findings
|
||
|
||
| Old Component Solution | Weak Point (functional/security/performance) | New Solution |
|
||
|------------------------|----------------------------------------------|-------------|
|
||
| ONNX Runtime as potential inference runtime for AI models | **Performance**: ONNX Runtime CUDA EP on Jetson Orin Nano is 7-8x slower than TRT standalone with default settings (tensor cores not utilized). Even TRT-EP shows up to 3x overhead on some models. | **Use native TRT Engine for all AI models**. Convert PyTorch → ONNX → trtexec → .engine. Load with tensorrt Python module. Eliminates ONNX Runtime dependency entirely. |
|
||
| ONNX Runtime TRT-EP memory overhead | **Performance**: ONNX RT TRT-EP keeps serialized engine in memory (~420-440MB vs 130-140MB native TRT). Delta ~280-300MB PER MODEL. On 8GB shared memory, this wastes ~560-600MB for two models. | **Native TRT releases serialized blob after deserialization** → saves ~280-300MB per model. Total savings ~560-600MB — 7% of total memory. Critical given cuVSLAM map growth risk. |
|
||
| No explicit TRT engine build step in offline pipeline | **Functional**: Draft03 mentions TRT FP16 but doesn't define the build workflow. When/where are engines built? | **Add TRT engine build to offline preparation pipeline**: After satellite tile download, run trtexec on Jetson to build .engine files. Store alongside tiles. One-time cost per model version. |
|
||
| Cross-platform portability via ONNX Runtime | **Functional**: ONNX Runtime's primary value is cross-platform support. Our deployment is Jetson-only — this value is zero. We pay the performance/memory tax for unused portability. | **Drop ONNX Runtime**. Jetson Orin Nano Super is fixed deployment hardware. TRT Engine is the optimal runtime for NVIDIA-only deployment. |
|
||
| No DLA offloading considered | **Performance**: Draft03 doesn't mention DLA. Jetson Orin Nano has NO DLA cores — only Orin NX (1-2) and AGX Orin (2) have DLA. | **Confirm: DLA offloading is NOT available on Orin Nano**. All inference must run on GPU (1024 CUDA cores, 16 tensor cores). This makes maximizing GPU efficiency via native TRT even more critical. |
|
||
| LiteSAM MinGRU TRT compatibility risk | **Functional**: LiteSAM's subpixel refinement uses 4 stacked MinGRU layers over a 3×3 candidate window (seq_len=9). MinGRU gates depend only on input C_f (not h_{t-1}), so z_t/h̃_t are pre-computable. Ops are standard: Linear, Sigmoid, Mul, Add, ReLU, Tanh. Risk is LOW-MEDIUM — depends on whether implementation uses logcumsumexp (problematic) or simple loop (fine). Seq_len=9 makes this trivially rewritable. | **Day-one verification**: clone LiteSAM repo → torch.onnx.export → polygraphy inspect → trtexec --fp16. If export fails on MinGRU: rewrite forward() as unrolled loop (9 steps). **If LiteSAM cannot be made TRT-compatible: replace with EfficientLoFTR TRT** (proven TRT path via Coarse_LoFTR_TRT, 15.05M params, semi-dense matching). |
|
||
|
||
## Product Solution Description
|
||
|
||
A real-time GPS-denied visual navigation system for fixed-wing UAVs, running on a Jetson Orin Nano Super (8GB). All AI model inference uses **native TensorRT Engine files** — no ONNX Runtime dependency. The system replaces the GPS module by sending MAVLink GPS_INPUT messages via pymavlink over UART at 5-10Hz.
|
||
|
||
Position is determined by fusing: (1) CUDA-accelerated visual odometry (cuVSLAM — native CUDA), (2) absolute position corrections from satellite image matching (LiteSAM or XFeat — TRT Engine FP16), and (3) IMU data from the flight controller via ESKF.
|
||
|
||
**Inference runtime decision**: Native TRT Engine over ONNX Runtime because:
|
||
1. ONNX RT CUDA EP is 7-8x slower on Orin Nano (tensor core bug)
|
||
2. ONNX RT TRT-EP wastes ~280-300MB per model (serialized engine retained in memory)
|
||
3. Cross-platform portability has zero value — deployment is Jetson-only
|
||
4. Native TRT provides direct CUDA stream control for pipelining with cuVSLAM
|
||
|
||
**Hard constraint**: Camera shoots at ~3fps (333ms interval). Full VO+ESKF pipeline within 400ms. GPS_INPUT at 5-10Hz.
|
||
|
||
**AI Model Runtime Summary**:
|
||
|
||
| Model | Runtime | Precision | Memory | Integration |
|
||
|-------|---------|-----------|--------|-------------|
|
||
| cuVSLAM | Native CUDA (PyCuVSLAM) | N/A (closed-source) | ~200-500MB | CUDA Stream A |
|
||
| LiteSAM | TRT Engine | FP16 | ~50-80MB | CUDA Stream B |
|
||
| XFeat | TRT Engine | FP16 | ~30-50MB | CUDA Stream B (fallback) |
|
||
| ESKF | CPU (Python/C++) | FP64 | ~10MB | CPU thread |
|
||
|
||
**Offline Preparation Pipeline** (before flight):
|
||
1. Download satellite tiles → validate → pre-resize → store (existing)
|
||
2. **NEW: Build TRT engines on Jetson** (one-time per model version)
|
||
- `trtexec --onnx=litesam_fp16.onnx --saveEngine=litesam.engine --fp16`
|
||
- `trtexec --onnx=xfeat.onnx --saveEngine=xfeat.engine --fp16`
|
||
3. Copy tiles + engines to Jetson storage
|
||
4. At startup: load engines + preload tiles into RAM
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ OFFLINE (Before Flight) │
|
||
│ 1. Satellite Tiles → Download & Validate → Pre-resize → Store │
|
||
│ (Google Maps) (≥0.5m/px, <2yr) (matcher res) (GeoHash)│
|
||
│ 2. TRT Engine Build (one-time per model version): │
|
||
│ PyTorch model → reparameterize → ONNX export → trtexec --fp16 │
|
||
│ Output: litesam.engine, xfeat.engine │
|
||
│ 3. Copy tiles + engines to Jetson storage │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ ONLINE (During Flight) │
|
||
│ │
|
||
│ STARTUP: │
|
||
│ 1. pymavlink → read GLOBAL_POSITION_INT → init ESKF │
|
||
│ 2. Load TRT engines: litesam.engine + xfeat.engine │
|
||
│ (tensorrt.Runtime → deserialize_cuda_engine → create_context) │
|
||
│ 3. Allocate GPU buffers for TRT input/output (PyCUDA) │
|
||
│ 4. Start cuVSLAM with first camera frames │
|
||
│ 5. Preload satellite tiles ±2km into RAM │
|
||
│ 6. Begin GPS_INPUT output loop at 5-10Hz │
|
||
│ │
|
||
│ EVERY FRAME (3fps, 333ms interval): │
|
||
│ ┌──────────────────────────────────────┐ │
|
||
│ │ Nav Camera → Downsample (CUDA ~2ms) │ │
|
||
│ │ → cuVSLAM VO+IMU (~9ms) │ ← CUDA Stream A │
|
||
│ │ → ESKF measurement update │ │
|
||
│ └──────────────────────────────────────┘ │
|
||
│ │
|
||
│ 5-10Hz CONTINUOUS: │
|
||
│ ┌──────────────────────────────────────┐ │
|
||
│ │ ESKF IMU prediction → GPS_INPUT send │──→ Flight Controller │
|
||
│ └──────────────────────────────────────┘ │
|
||
│ │
|
||
│ KEYFRAMES (every 3-10 frames, async): │
|
||
│ ┌──────────────────────────────────────┐ │
|
||
│ │ TRT Engine inference (Stream B): │ │
|
||
│ │ context.enqueue_v3(stream_B) │──→ ESKF correction │
|
||
│ │ LiteSAM FP16 or XFeat FP16 │ │
|
||
│ └──────────────────────────────────────┘ │
|
||
│ │
|
||
│ TELEMETRY (1Hz): │
|
||
│ ┌──────────────────────────────────────┐ │
|
||
│ │ NAMED_VALUE_FLOAT: confidence, drift │──→ Ground Station │
|
||
│ └──────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
## Architecture
|
||
|
||
### Component: AI Model Inference Runtime
|
||
|
||
| Solution | Tools | Advantages | Limitations | Performance | Memory | Fit |
|
||
|----------|-------|-----------|-------------|------------|--------|-----|
|
||
| Native TRT Engine | tensorrt Python + PyCUDA + trtexec | Optimal latency, minimal memory, full tensor core usage, direct CUDA stream control | Hardware-specific engines, manual buffer management, rebuild per TRT version | Optimal | ~50-130MB total (both models) | ✅ Best |
|
||
| ONNX Runtime TRT-EP | onnxruntime + TensorRT EP | Auto-fallback for unsupported ops, simpler API, auto engine caching | +280-300MB per model, wrapper overhead, first-run latency spike | Near-parity (claimed), up to 3x slower (observed) | ~640-690MB total (both models) | ❌ Memory overhead unacceptable |
|
||
| ONNX Runtime CUDA EP | onnxruntime + CUDA EP | Simplest API, broadest op support | 7-8x slower on Orin Nano (tensor core bug), no TRT optimizations | 7-8x slower | Standard | ❌ Performance unacceptable |
|
||
| Torch-TensorRT | torch_tensorrt | AOT compilation, PyTorch-native, handles mixed TRT/PyTorch | Newer on Jetson, requires PyTorch runtime at inference | Near native TRT | PyTorch runtime ~500MB+ | ⚠️ Viable alternative if TRT export fails |
|
||
|
||
**Selected**: **Native TRT Engine** — optimal performance and memory on our fixed NVIDIA hardware.
|
||
|
||
**Fallback**: If any model has unsupported TRT ops (e.g., MinGRU in LiteSAM), use **Torch-TensorRT** for that specific model. Torch-TensorRT handles mixed TRT/PyTorch execution but requires PyTorch runtime in memory.
|
||
|
||
### Component: TRT Engine Conversion Workflow
|
||
|
||
**LiteSAM conversion**:
|
||
1. Load PyTorch model with trained weights
|
||
2. Reparameterize MobileOne backbone (collapse multi-branch → single Conv2d+BN)
|
||
3. Export to ONNX: `torch.onnx.export(model, dummy_input, "litesam.onnx", opset_version=17)`
|
||
4. Verify with polygraphy: `polygraphy inspect model litesam.onnx`
|
||
5. Build engine on Jetson: `trtexec --onnx=litesam.onnx --saveEngine=litesam.engine --fp16 --memPoolSize=workspace:2048`
|
||
6. Verify engine: `trtexec --loadEngine=litesam.engine --fp16`
|
||
|
||
**XFeat conversion**:
|
||
1. Load PyTorch model
|
||
2. Export to ONNX: `torch.onnx.export(model, dummy_input, "xfeat.onnx", opset_version=17)`
|
||
3. Build engine on Jetson: `trtexec --onnx=xfeat.onnx --saveEngine=xfeat.engine --fp16`
|
||
4. Alternative: use XFeatTensorRT C++ implementation directly
|
||
|
||
**INT8 quantization strategy** (optional, future optimization):
|
||
- MobileOne backbone (CNN): INT8 safe with calibration data
|
||
- TAIFormer (transformer attention): FP16 only — INT8 degrades accuracy
|
||
- XFeat: evaluate INT8 on actual UAV-satellite pairs before deploying
|
||
- Use nvidia-modelopt for calibration: `from modelopt.onnx.quantization import quantize`
|
||
|
||
### Component: TRT Python Inference Wrapper
|
||
|
||
Minimal wrapper class for TRT engine inference:
|
||
|
||
```python
|
||
import tensorrt as trt
|
||
import pycuda.driver as cuda
|
||
|
||
class TRTInference:
|
||
def __init__(self, engine_path, stream):
|
||
self.logger = trt.Logger(trt.Logger.WARNING)
|
||
self.runtime = trt.Runtime(self.logger)
|
||
with open(engine_path, 'rb') as f:
|
||
self.engine = self.runtime.deserialize_cuda_engine(f.read())
|
||
self.context = self.engine.create_execution_context()
|
||
self.stream = stream
|
||
self._allocate_buffers()
|
||
|
||
def _allocate_buffers(self):
|
||
self.inputs = {}
|
||
self.outputs = {}
|
||
for i in range(self.engine.num_io_tensors):
|
||
name = self.engine.get_tensor_name(i)
|
||
shape = self.engine.get_tensor_shape(name)
|
||
dtype = trt.nptype(self.engine.get_tensor_dtype(name))
|
||
size = trt.volume(shape)
|
||
device_mem = cuda.mem_alloc(size * np.dtype(dtype).itemsize)
|
||
self.context.set_tensor_address(name, int(device_mem))
|
||
mode = self.engine.get_tensor_mode(name)
|
||
if mode == trt.TensorIOMode.INPUT:
|
||
self.inputs[name] = (device_mem, shape, dtype)
|
||
else:
|
||
self.outputs[name] = (device_mem, shape, dtype)
|
||
|
||
def infer_async(self, input_data):
|
||
for name, data in input_data.items():
|
||
cuda.memcpy_htod_async(self.inputs[name][0], data, self.stream)
|
||
self.context.enqueue_v3(self.stream.handle)
|
||
|
||
def get_output(self):
|
||
results = {}
|
||
for name, (dev_mem, shape, dtype) in self.outputs.items():
|
||
host_mem = np.empty(shape, dtype=dtype)
|
||
cuda.memcpy_dtoh_async(host_mem, dev_mem, self.stream)
|
||
self.stream.synchronize()
|
||
return results
|
||
```
|
||
|
||
Key design: `infer_async()` + `get_output()` split enables pipelining with cuVSLAM on Stream A while satellite matching runs on Stream B.
|
||
|
||
### Component: Visual Odometry (UNCHANGED)
|
||
|
||
cuVSLAM — native CUDA library, not affected by TRT migration. Already optimal.
|
||
|
||
### Component: Satellite Image Matching (UPDATED runtime + fallback chain)
|
||
|
||
| Solution | Tools | Advantages | Limitations | Performance (est. Orin Nano Super TRT FP16) | Params | Fit |
|
||
|----------|-------|-----------|-------------|----------------------------------------------|--------|-----|
|
||
| LiteSAM (opt) TRT Engine FP16 @ 1280px | trtexec + tensorrt Python | Best satellite-aerial accuracy (RMSE@30=17.86m UAV-VisLoc), 6.31M params, smallest model | MinGRU TRT export needs verification (LOW-MEDIUM risk) | Est. ~165-330ms | 6.31M | ✅ Primary (if TRT export succeeds AND ≤200ms) |
|
||
| EfficientLoFTR TRT Engine FP16 | trtexec + tensorrt Python | Proven TRT path (Coarse_LoFTR_TRT repo, 138 stars). Semi-dense. CVPR 2024. High accuracy. | 2.4x more params than LiteSAM. Requires einsum→elementary ops rewrite for TRT (documented in Coarse_LoFTR_TRT paper). | Est. ~200-400ms | 15.05M | ✅ Fallback if LiteSAM TRT fails |
|
||
| XFeat TRT Engine FP16 | trtexec + tensorrt Python (or XFeatTensorRT C++) | Fastest. Proven TRT implementation. Lightweight. | General-purpose, not designed for cross-view satellite-aerial gap (but nadir-nadir gap is small). | Est. ~50-100ms | <5M | ✅ Speed fallback |
|
||
|
||
**Decision tree (day-one on Orin Nano Super)**:
|
||
1. Clone LiteSAM repo → reparameterize MobileOne → `torch.onnx.export()` → `polygraphy inspect`
|
||
2. If ONNX export succeeds → `trtexec --onnx=litesam.onnx --saveEngine=litesam.engine --fp16`
|
||
3. If MinGRU causes ONNX/TRT failure → rewrite MinGRU forward() as unrolled 9-step loop → retry
|
||
4. If rewrite fails or accuracy degrades → **switch to EfficientLoFTR TRT**:
|
||
- Apply Coarse_LoFTR_TRT TRT-adaptation techniques (einsum replacement, etc.)
|
||
- Export to ONNX → trtexec --fp16
|
||
- Benchmark at 640×480 and 1280px
|
||
5. Benchmark winner: **if ≤200ms → use it. If >200ms but ≤300ms → acceptable (async on Stream B). If >300ms → use XFeat TRT**
|
||
|
||
**EfficientLoFTR TRT adaptation** (from Coarse_LoFTR_TRT paper, proven workflow):
|
||
- Replace `torch.einsum()` with elementary ops (view, bmm, reshape, sum)
|
||
- Replace any TRT-incompatible high-level PyTorch functions
|
||
- Use ONNX export path (less memory required than Torch-TensorRT on 8GB device)
|
||
- Knowledge distillation available for further parameter reduction if needed
|
||
|
||
### Component: Sensor Fusion (UNCHANGED)
|
||
ESKF — CPU-based mathematical filter, not affected.
|
||
|
||
### Component: Flight Controller Integration (UNCHANGED)
|
||
pymavlink — not affected by TRT migration.
|
||
|
||
### Component: Ground Station Telemetry (UNCHANGED)
|
||
MAVLink NAMED_VALUE_FLOAT — not affected.
|
||
|
||
### Component: Startup & Lifecycle (UPDATED)
|
||
|
||
**Updated startup sequence**:
|
||
1. Boot Jetson → start GPS-Denied service (systemd)
|
||
2. Connect to flight controller via pymavlink on UART
|
||
3. Wait for heartbeat from flight controller
|
||
4. **Initialize PyCUDA context**
|
||
5. **Load TRT engines**: litesam.engine + xfeat.engine via tensorrt.Runtime.deserialize_cuda_engine()
|
||
6. **Allocate GPU I/O buffers** for both models
|
||
7. **Create CUDA streams**: Stream A (cuVSLAM), Stream B (satellite matching)
|
||
8. Read GLOBAL_POSITION_INT → init ESKF
|
||
9. Start cuVSLAM with first camera frames
|
||
10. Begin GPS_INPUT output loop at 5-10Hz
|
||
11. Preload satellite tiles within ±2km into RAM
|
||
12. System ready
|
||
|
||
**Engine load time**: ~1-3 seconds per engine (deserialization from .engine file). One-time cost at startup.
|
||
|
||
### Component: Thermal Management (UNCHANGED)
|
||
Same adaptive pipeline. TRT engines are slightly more power-efficient than ONNX Runtime, but the difference is within noise.
|
||
|
||
### Component: Object Localization (UNCHANGED)
|
||
Not affected — trigonometric calculation, no AI inference.
|
||
|
||
## Speed Optimization Techniques
|
||
|
||
### 1. cuVSLAM for Visual Odometry (~9ms/frame)
|
||
Unchanged from draft03. Native CUDA, not part of TRT migration.
|
||
|
||
### 2. Native TRT Engine Inference (NEW)
|
||
All AI models run as pre-compiled TRT FP16 engines:
|
||
- Engine files built offline with trtexec (one-time per model version)
|
||
- Loaded at startup (~1-3s per engine)
|
||
- Inference via context.enqueue_v3() on dedicated CUDA Stream B
|
||
- GPU buffers pre-allocated — zero runtime allocation during flight
|
||
- No ONNX Runtime dependency — no framework overhead
|
||
|
||
Memory advantage over ONNX Runtime TRT-EP: ~560-600MB saved (both models combined).
|
||
Latency advantage: eliminates ONNX wrapper overhead, guaranteed tensor core utilization.
|
||
|
||
### 3. CUDA Stream Pipelining (REFINED)
|
||
- Stream A: cuVSLAM VO for current frame (~9ms) + ESKF fusion (~1ms)
|
||
- Stream B: TRT engine inference for satellite matching (LiteSAM or XFeat, async)
|
||
- CPU: GPS_INPUT output loop, NAMED_VALUE_FLOAT, command listener, tile management
|
||
- **NEW**: Both cuVSLAM and TRT engines use CUDA streams natively — no framework abstraction layer. Direct GPU scheduling.
|
||
|
||
### 4-7. (UNCHANGED from draft03)
|
||
Keyframe-based satellite matching, TensorRT FP16 optimization, proactive tile loading, 5-10Hz GPS_INPUT output — all unchanged.
|
||
|
||
## Processing Time Budget (per frame, 333ms interval)
|
||
|
||
### Normal Frame (non-keyframe)
|
||
Unchanged from draft03 — cuVSLAM dominates at ~22ms total.
|
||
|
||
### Keyframe Satellite Matching (async, CUDA Stream B)
|
||
|
||
**Path A — LiteSAM TRT Engine FP16 at 1280px**:
|
||
|
||
| Step | Time | Notes |
|
||
|------|------|-------|
|
||
| Downsample to 1280px | ~1ms | OpenCV CUDA |
|
||
| Load satellite tile | ~1ms | Pre-loaded in RAM |
|
||
| Copy input to GPU buffer | <0.5ms | PyCUDA memcpy_htod_async |
|
||
| LiteSAM TRT Engine FP16 | ≤200ms | context.enqueue_v3(stream_B) |
|
||
| Copy output from GPU | <0.5ms | PyCUDA memcpy_dtoh_async |
|
||
| Geometric pose (RANSAC) | ~5ms | Homography |
|
||
| ESKF satellite update | ~1ms | Delayed measurement |
|
||
| **Total** | **≤210ms** | Async on Stream B |
|
||
|
||
**Path B — XFeat TRT Engine FP16**:
|
||
|
||
| Step | Time | Notes |
|
||
|------|------|-------|
|
||
| XFeat TRT Engine inference | ~50-80ms | context.enqueue_v3(stream_B) |
|
||
| Geometric verification (RANSAC) | ~5ms | |
|
||
| ESKF satellite update | ~1ms | |
|
||
| **Total** | **~60-90ms** | Async on Stream B |
|
||
|
||
## Memory Budget (Jetson Orin Nano Super, 8GB shared)
|
||
|
||
| Component | Memory (Native TRT) | Memory (ONNX RT TRT-EP) | Notes |
|
||
|-----------|---------------------|--------------------------|-------|
|
||
| OS + runtime | ~1.5GB | ~1.5GB | JetPack 6.2 + Python |
|
||
| cuVSLAM | ~200-500MB | ~200-500MB | CUDA library + map |
|
||
| **LiteSAM TRT engine** | **~50-80MB** | **~330-360MB** | Native TRT vs TRT-EP. If LiteSAM fails: EfficientLoFTR ~100-150MB |
|
||
| **XFeat TRT engine** | **~30-50MB** | **~310-330MB** | Native TRT vs TRT-EP |
|
||
| Preloaded satellite tiles | ~200MB | ~200MB | ±2km of flight plan |
|
||
| pymavlink + MAVLink | ~20MB | ~20MB | |
|
||
| FastAPI (local IPC) | ~50MB | ~50MB | |
|
||
| ESKF + buffers | ~10MB | ~10MB | |
|
||
| ONNX Runtime framework | **0MB** | **~150MB** | Eliminated with native TRT |
|
||
| **Total** | **~2.1-2.9GB** | **~2.8-3.6GB** | |
|
||
| **% of 8GB** | **26-36%** | **35-45%** | |
|
||
| **Savings** | — | — | **~700MB saved with native TRT** |
|
||
|
||
## Confidence Scoring → GPS_INPUT Mapping
|
||
Unchanged from draft03.
|
||
|
||
## Key Risks and Mitigations
|
||
|
||
| Risk | Likelihood | Impact | Mitigation |
|
||
|------|-----------|--------|------------|
|
||
| **LiteSAM MinGRU ops unsupported in TRT 10.3** | LOW-MEDIUM | LiteSAM TRT export fails | Day-one verification: ONNX export → polygraphy → trtexec. If MinGRU fails: (1) rewrite as unrolled 9-step loop, (2) if still fails: **switch to EfficientLoFTR TRT** (proven TRT path, Coarse_LoFTR_TRT, 15.05M params). XFeat TRT as speed fallback. |
|
||
| **TRT engine build OOM on 8GB Jetson** | LOW | Cannot build engines on target device | Our models are small (6.31M LiteSAM, <5M XFeat). OOM unlikely. If occurs: reduce --memPoolSize, or build on identical Orin Nano module with more headroom |
|
||
| **Engine incompatibility after JetPack update** | MEDIUM | Must rebuild engines | Include engine rebuild in JetPack update procedure. Takes minutes per model. |
|
||
| **MAVSDK cannot send GPS_INPUT** | CONFIRMED | Must use pymavlink | Unchanged from draft03 |
|
||
| **cuVSLAM fails on low-texture terrain** | HIGH | Frequent tracking loss | Unchanged from draft03 |
|
||
| **Thermal throttling** | MEDIUM | Satellite matching budget blown | Unchanged from draft03 |
|
||
| LiteSAM TRT FP16 >200ms at 1280px | MEDIUM | Must use fallback matcher | Day-one benchmark. Fallback chain: EfficientLoFTR TRT (if ≤300ms) → XFeat TRT (if all >300ms) |
|
||
| Google Maps satellite quality in conflict zone | HIGH | Satellite matching fails | Unchanged from draft03 |
|
||
|
||
## Testing Strategy
|
||
|
||
### Integration / Functional Tests
|
||
All tests from draft03 unchanged, plus:
|
||
- **TRT engine load test**: Verify litesam.engine and xfeat.engine load successfully on Jetson Orin Nano Super
|
||
- **TRT inference correctness**: Compare TRT engine output vs PyTorch reference output (max L1 error < 0.01)
|
||
- **CUDA Stream B pipelining**: Verify satellite matching on Stream B does not block cuVSLAM on Stream A
|
||
- **Engine pre-built validation**: Verify engine files from offline preparation work without rebuild at runtime
|
||
|
||
### Non-Functional Tests
|
||
All tests from draft03 unchanged, plus:
|
||
- **TRT engine build time**: Measure trtexec build time for LiteSAM and XFeat on Orin Nano Super (expected: 1-5 minutes each)
|
||
- **TRT engine load time**: Measure deserialization time (expected: 1-3 seconds each)
|
||
- **Memory comparison**: Measure actual GPU memory with native TRT vs ONNX RT TRT-EP for both models
|
||
- **MinGRU TRT compatibility** (day-one blocker):
|
||
1. Clone LiteSAM repo, load pretrained weights
|
||
2. Reparameterize MobileOne backbone
|
||
3. `torch.onnx.export(model, dummy, "litesam.onnx", opset_version=17)`
|
||
4. `polygraphy inspect model litesam.onnx` — check for unsupported ops
|
||
5. `trtexec --onnx=litesam.onnx --saveEngine=litesam.engine --fp16`
|
||
6. If step 3 or 5 fails on MinGRU: rewrite MinGRU forward() as unrolled loop, retry
|
||
7. If still fails: switch to EfficientLoFTR, apply Coarse_LoFTR_TRT adaptation
|
||
8. Compare TRT output vs PyTorch reference (max L1 error < 0.01)
|
||
- **EfficientLoFTR TRT fallback benchmark** (if LiteSAM fails): apply TRT adaptation from Coarse_LoFTR_TRT → ONNX → trtexec → measure latency at 640×480 and 1280px
|
||
- **Tensor core utilization**: Verify with NSight that TRT engines use tensor cores (unlike ONNX RT CUDA EP)
|
||
|
||
## References
|
||
- ONNX Runtime Issue #24085 (Jetson Orin Nano tensor core bug): https://github.com/microsoft/onnxruntime/issues/24085
|
||
- ONNX Runtime Issue #20457 (TRT-EP memory overhead): https://github.com/microsoft/onnxruntime/issues/20457
|
||
- ONNX Runtime Issue #12083 (TRT-EP vs native TRT): https://github.com/microsoft/onnxruntime/issues/12083
|
||
- NVIDIA TensorRT 10 Python API: https://docs.nvidia.com/deeplearning/tensorrt/10.15.1/inference-library/python-api-docs.html
|
||
- TensorRT Best Practices: https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html
|
||
- TensorRT engine hardware specificity: https://github.com/NVIDIA/TensorRT/issues/1920
|
||
- trtexec ONNX conversion: https://nvidia-jetson.piveral.com/jetson-orin-nano/how-to-convert-onnx-to-engine-on-jetson-orin-nano-dev-board/
|
||
- Torch-TensorRT JetPack 6.2: https://docs.pytorch.org/TensorRT/v2.10.0/getting_started/jetpack.html
|
||
- XFeatTensorRT: https://github.com/PranavNedunghat/XFeatTensorRT
|
||
- JetPack 6.2 Release Notes: https://docs.nvidia.com/jetson/archives/jetpack-archived/jetpack-62/release-notes/index.html
|
||
- Jetson Orin Nano Super: https://developer.nvidia.com/blog/nvidia-jetson-orin-nano-developer-kit-gets-a-super-boost/
|
||
- DLA on Jetson Orin: https://developer.nvidia.com/blog/maximizing-deep-learning-performance-on-nvidia-jetson-orin-with-dla/
|
||
- EfficientLoFTR (CVPR 2024): https://github.com/zju3dv/EfficientLoFTR
|
||
- EfficientLoFTR HuggingFace: https://huggingface.co/docs/transformers/en/model_doc/efficientloftr
|
||
- Coarse_LoFTR_TRT (TRT for embedded): https://github.com/Kolkir/Coarse_LoFTR_TRT
|
||
- Coarse_LoFTR_TRT paper: https://ar5iv.labs.arxiv.org/html/2202.00770
|
||
- LoFTR_TRT: https://github.com/Kolkir/LoFTR_TRT
|
||
- minGRU ("Were RNNs All We Needed?"): https://huggingface.co/papers/2410.01201
|
||
- minGRU PyTorch implementation: https://github.com/lucidrains/minGRU-pytorch
|
||
- LiteSAM paper (MinGRU details, Eqs 12-16): https://www.mdpi.com/2072-4292/17/19/3349
|
||
- DALGlue (UAV feature matching, 2025): https://www.nature.com/articles/s41598-025-21602-5
|
||
- All references from solution_draft03.md
|
||
|
||
## Related Artifacts
|
||
- AC Assessment: `_docs/00_research/gps_denied_nav/00_ac_assessment.md`
|
||
- Research artifacts (this assessment): `_docs/00_research/trt_engine_migration/`
|
||
- Previous research: `_docs/00_research/gps_denied_nav_v3/`
|
||
- Tech stack evaluation: `_docs/01_solution/tech_stack.md`
|
||
- Security analysis: `_docs/01_solution/security_analysis.md`
|