add clarification to research methodology by including a step for solution comparison and user consultation

2026-06-21 09:11:12 +00:00 · 2026-03-17 18:43:57 +02:00
parent d764250f9a
commit b419e2c04a
35 changed files with 6030 additions and 0 deletions
@@ -0,0 +1,214 @@
+# TensorRT Migration Assessment — Jetson Orin Nano Super
+
+## Target Hardware: Jetson Orin Nano Super
+
+| Spec | Value |
+|------|-------|
+| GPU | Ampere, 1,024 CUDA cores, 32 Tensor Cores @ 1,020 MHz |
+| AI Performance | 67 TOPS (sparse) / 33 TOPS (dense) / 17 FP16 TFLOPs |
+| Memory | 8 GB LPDDR5 @ 102 GB/s (shared CPU/GPU) |
+| JetPack | 6.2 |
+| TensorRT | 10.3.0 |
+| CUDA | 12.6 |
+| cuDNN | 9.3 |
+| Usable VRAM | ~6-7 GB (after OS/framework overhead) |
+
+## General TRT vs ONNX Runtime on Jetson
+
+- Native TensorRT is **2-4x faster** than PyTorch on Jetson
+- ONNX Runtime with TensorRT EP is **30%-3x slower** than native TRT due to subgraph fallbacks
+- FP16 is the sweet spot for Jetson Orin Nano (Ampere Tensor Cores)
+- INT8 can **regress** performance on ViT models (up to 2.7x slowdown on Orin Nano)
+- Running multiple TRT engines concurrently causes large slowdowns (50ms → 300ms per thread) — sequential preferred
+
+## Conversion Pipeline
+
+Standard path: **PyTorch → ONNX → trtexec → TRT Engine**
+
+Alternative: **torch-tensorrt** (`torch.compile(model, backend="tensorrt")`) — less mature for complex models.
+
+## Per-Model Assessment
+
+### 1. SuperPoint (Feature Extraction)
+
+| Aspect | Assessment |
+|--------|-----------|
+| **TRT Feasibility** | ✅ **Proven** |
+| **Existing Implementations** | [yuefanhao/SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT) (367 stars), [fettahyildizz/superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt), [Op2Free/SuperPoint_TensorRT_Libtorch](https://github.com/Op2Free/SuperPoint_TensorRT_Libtorch) |
+| **Conversion Path** | PyTorch → ONNX → trtexec FP16 |
+| **Dynamic Shapes** | Fixed input resolution (e.g. 640×480 or 1600×longest) |
+| **Precision** | FP16 recommended. No accuracy loss for keypoint detection. |
+| **Estimated Jetson Perf** | ~20-40ms (FP16, estimated from desktop benchmarks scaled to Orin) |
+| **Blocking Issues** | None |
+| **Risk** | 🟢 Low |
+
+### 2. LightGlue (Feature Matching)
+
+| Aspect | Assessment |
+|--------|-----------|
+| **TRT Feasibility** | ✅ **Proven with caveats** |
+| **Existing Implementation** | [LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX) — explicitly supports TRT export. [Blog: 2-4x speedup](https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/) |
+| **Conversion Path** | LightGlue-ONNX → trtexec FP16 |
+| **Dynamic Shapes** | ⚠️ Fixed top-K keypoints (e.g. 2048). TRT TopK limit ≤ 3840. Variable keypoints replaced by padding + mask. |
+| **Adaptive Stopping** | ❌ **Not TRT-compatible**. `torch.cond()` control flow not exportable. Must use fixed-depth LightGlue. |
+| **Attention Mechanism** | ⚠️ Custom `MultiHeadAttention` export needed. Cross-attention with different Q/K lengths can be problematic. LightGlue-ONNX handles this. |
+| **Precision** | FP16 recommended. FP8 gives ~6x speedup but only on Ada/Hopper (not Ampere/Orin). |
+| **Estimated Jetson Perf** | ~30-60ms (FP16, 2048 keypoints, fixed depth) |
+| **Blocking Issues** | Must use fixed-depth mode (no adaptive stopping). TopK ≤ 3840. |
+| **Risk** | 🟡 Medium (proven path exists, but fixed-depth slightly reduces quality) |
+
+### 3. DINOv2 ViT-S/14 (Coarse Retrieval)
+
+| Aspect | Assessment |
+|--------|-----------|
+| **TRT Feasibility** | ⚠️ **Feasible but risky** |
+| **Existing Issues** | [TRT #4537](https://github.com/NVIDIA/tensorrt/issues/4537): FMHA fusion failure for DINOv2. [NVIDIA Forums #312251](https://forums.developer.nvidia.com/t/dinov2-tensorrt-model-performance-issue/312251): Embedding quality degradation. |
+| **Conversion Path** | PyTorch → ONNX (custom wrapper, exclude mask inputs) → trtexec FP16 |
+| **Dynamic Shapes** | Fixed input (224×224 for ViT-S/14) |
+| **Precision** | FP16 preferred. INT8 **not validated** for embedding quality. FP32 also shows degradation in some reports. |
+| **torch-tensorrt** | No explicit ViT support; use ONNX → trtexec path |
+| **Embedding Quality** | ⚠️ **Must validate**. Reports of degraded embeddings vs PyTorch/ONNX. Retrieval recall must be tested post-conversion. |
+| **Estimated Jetson Perf** | ~15-35ms (FP16, estimated) — 1.68x speedup reported on Orin Nano Super |
+| **Blocking Issues** | Potential embedding quality loss. FMHA fusion failure (workaround: disable FMHA plugin or use older TRT version). |
+| **Risk** | 🟠 High (quality degradation must be measured before production use) |
+
+### 4. LiteSAM (Satellite Fine Matching)
+
+| Aspect | Assessment |
+|--------|-----------|
+| **TRT Feasibility** | ❌ **No existing path** |
+| **Existing Support** | No ONNX export. No TRT conversion. 5 GitHub stars, 4 commits. |
+| **MinGRU Blocks** | Standard GRU ops → supported in ONNX and TRT. Not a blocker. |
+| **TAIFormer Blocks** | Need verification — may have custom ops that block export. |
+| **Variable Output** | Semi-dense matchers output variable matches → use fixed max + padding + mask. |
+| **Conversion Path** | Custom ONNX export (write `export_onnx.py` for LiteSAM) → trtexec FP16. Significant development effort. |
+| **Estimated Jetson Perf** | ~60-120ms (FP16, estimated, if conversion succeeds) |
+| **Blocking Issues** | No ONNX export exists. Must write custom export. Immature codebase may have non-exportable patterns. |
+| **Risk** | 🔴 Very High (no prior art, immature codebase, custom work required) |
+
+### 5. EfficientLoFTR (Fallback Matcher)
+
+| Aspect | Assessment |
+|--------|-----------|
+| **TRT Feasibility** | ⚠️ **Partial — requires adaptation** |
+| **Related Work** | [Kolkir/LoFTR_TRT](https://github.com/Kolkir/LoFTR_TRT) — TRT conversion for **original LoFTR** (not EfficientLoFTR). Has `export_onnx.py`. |
+| **Architecture Diff** | EfficientLoFTR uses efficient attention and different backbone vs original LoFTR. LoFTR_TRT scripts need adaptation. |
+| **Alternative** | Use **original LoFTR** via LoFTR_TRT on Jetson. It has a proven TRT path. Trade-off: larger model, but TRT optimized. |
+| **Conversion Path** | Option A: Adapt LoFTR_TRT for EfficientLoFTR. Option B: Use original LoFTR via LoFTR_TRT. |
+| **Estimated Jetson Perf** | ~80-150ms @ 640×480 (original LoFTR TRT); ~300-500ms @ 1184×1184 |
+| **Blocking Issues** | EfficientLoFTR-specific ONNX not documented. Adaptation of LoFTR_TRT needed. |
+| **Risk** | 🟠 High (adaptation required, but base path exists) |
+
+## Recommended Migration Strategy
+
+### Phase 1: Low-Risk, High-Impact (VO Pipeline)
+
+Convert SuperPoint + LightGlue to TRT. Proven path, biggest latency win for VO.
+
+| Model | Action | Effort | Expected Win |
+|-------|--------|--------|-------------|
+| SuperPoint | PyTorch → ONNX → TRT FP16 | Low (existing repos) | ~2-3x speedup |
+| LightGlue | LightGlue-ONNX → TRT FP16 (fixed depth, 2048 kpts) | Low-Medium | ~2-4x speedup |
+
+**VO latency on Jetson**: ~50-100ms (down from ~180ms PyTorch on desktop-class GPU). Achievable.
+
+### Phase 2: Medium-Risk (Coarse Retrieval)
+
+Convert DINOv2 ViT-S/14 to TRT with careful quality validation.
+
+| Model | Action | Effort | Expected Win |
+|-------|--------|--------|-------------|
+| DINOv2 ViT-S/14 | PyTorch → ONNX → TRT FP16. Validate retrieval recall. | Medium | ~1.5-2x speedup |
+
+**Critical**: Must measure retrieval R@1/R@5 before and after TRT conversion. If embedding quality degrades, fall back to ONNX Runtime.
+
+### Phase 3: High-Risk (Satellite Fine Matching)
+
+Two options for satellite fine matching on Jetson:
+
+**Option A: LoFTR via LoFTR_TRT (Recommended for Jetson)**
+- Use original LoFTR with existing TRT conversion scripts
+- Proven path, ~80-150ms @ 640×480 on Jetson
+- Slightly lower accuracy than LiteSAM but reliable TRT path
+- Use LiteSAM on desktop (PyTorch), LoFTR TRT on Jetson
+
+**Option B: Custom LiteSAM TRT (High effort, high risk)**
+- Write custom ONNX export for LiteSAM
+- Convert to TRT
+- Significant development effort with uncertain outcome
+- Only worthwhile if LiteSAM accuracy advantage is critical
+
+**Option C: EfficientLoFTR TRT (Medium effort)**
+- Adapt LoFTR_TRT export scripts for EfficientLoFTR
+- Closer architecture to LiteSAM than original LoFTR
+- Medium effort, uncertain outcome
+
+### Recommended: Option A for Jetson deployment
+
+## Memory Budget on Jetson Orin Nano Super (8GB shared)
+
+| Component | Memory (FP16 TRT) | Notes |
+|-----------|-------------------|-------|
+| OS + JetPack overhead | ~1.5 GB | Linux + CUDA + TRT runtime |
+| SuperPoint TRT | ~200 MB | Smaller than PyTorch |
+| LightGlue TRT | ~250 MB | Fixed-depth, 2048 kpts |
+| DINOv2 ViT-S/14 TRT | ~150 MB | ViT-S is compact |
+| LoFTR TRT (satellite) | ~300 MB | Original LoFTR |
+| Satellite tile cache | ~200 MB | Reduced for Jetson |
+| Working memory | ~500 MB | Image buffers, GTSAM, etc. |
+| **Total** | **~3.1 GB** | **Well within 8GB** |
+| **Headroom** | **~4.9 GB** | For tile cache expansion |
+
+## Architecture Implications
+
+### Desktop vs Jetson Configuration
+
+The solution should support **two runtime configurations**:
+
+**Desktop (RTX 2060+)**: Current architecture — PyTorch/ONNX models, full feature set
+- SuperPoint + LightGlue ONNX FP16
+- DINOv2 ViT-S/14 PyTorch + GeM pooling
+- LiteSAM PyTorch (EfficientLoFTR fallback)
+
+**Jetson (Orin Nano Super)**: TRT-optimized, adapted feature set
+- SuperPoint TRT FP16
+- LightGlue TRT FP16 (fixed depth, 2048 keypoints)
+- DINOv2 ViT-S/14 TRT FP16 + GeM pooling
+- LoFTR TRT FP16 (replacing LiteSAM/EfficientLoFTR)
+
+### Configuration Switch
+
+Runtime selection via environment variable or config:
+```
+INFERENCE_BACKEND=tensorrt  # or "pytorch" / "onnx"
+```
+
+Model loading layer abstracts backend:
+```
+get_feature_extractor(backend) → SuperPointTRT or SuperPointPyTorch
+get_matcher(backend) → LightGlueTRT or LightGlueONNX
+get_retrieval(backend) → DINOv2TRT or DINOv2PyTorch
+get_satellite_matcher(backend) → LoFTRTRT or LiteSAMPyTorch
+```
+
+## Key Constraints for TRT Deployment
+
+1. **LightGlue fixed depth**: No adaptive stopping on TRT. Slight quality reduction but consistent latency.
+2. **TopK ≤ 3840**: Keypoint count must be capped (recommend 2048 for Jetson memory).
+3. **DINOv2 embedding validation required**: Must verify retrieval quality post-TRT conversion.
+4. **LoFTR instead of LiteSAM on Jetson**: Different satellite fine matcher. May need separate accuracy baseline.
+5. **Sequential GPU execution**: Even more critical on Jetson (smaller GPU). No concurrent TRT engines.
+6. **TRT engine files are hardware-specific**: Must build separate engines for desktop GPU and Jetson. Not portable.
+
+## Sources
+
+- [Jetson Orin Nano Super Specs](https://developer.nvidia.com/embedded/jetson-modules)
+- [LightGlue-ONNX TRT Blog](https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/)
+- [SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT)
+- [superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt)
+- [LoFTR_TRT](https://github.com/Kolkir/LoFTR_TRT)
+- [NVIDIA TRT #4537: DINOv2 FMHA](https://github.com/NVIDIA/tensorrt/issues/4537)
+- [NVIDIA Forums: DINOv2 embedding degradation](https://forums.developer.nvidia.com/t/dinov2-tensorrt-model-performance-issue/312251)
+- [FP8 LightGlue Blog](https://fabio-sim.github.io/blog/fp8-quantized-lightglue-tensorrt-nvidia-model-optimizer/)
+- [LightGlue-with-FlashAttentionV2-TensorRT](https://github.com/) (Jetson Orin NX 8GB)
+- [XFeatTensorRT](https://github.com/) (Jetson Orin NX 16GB)