add clarification to research methodology by including a step for solution comparison and user consultation

This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-03-17 18:43:57 +02:00
parent d764250f9a
commit b419e2c04a
35 changed files with 6030 additions and 0 deletions
@@ -0,0 +1,214 @@
# TensorRT Migration Assessment — Jetson Orin Nano Super
## Target Hardware: Jetson Orin Nano Super
| Spec | Value |
|------|-------|
| GPU | Ampere, 1,024 CUDA cores, 32 Tensor Cores @ 1,020 MHz |
| AI Performance | 67 TOPS (sparse) / 33 TOPS (dense) / 17 FP16 TFLOPs |
| Memory | 8 GB LPDDR5 @ 102 GB/s (shared CPU/GPU) |
| JetPack | 6.2 |
| TensorRT | 10.3.0 |
| CUDA | 12.6 |
| cuDNN | 9.3 |
| Usable VRAM | ~6-7 GB (after OS/framework overhead) |
## General TRT vs ONNX Runtime on Jetson
- Native TensorRT is **2-4x faster** than PyTorch on Jetson
- ONNX Runtime with TensorRT EP is **30%-3x slower** than native TRT due to subgraph fallbacks
- FP16 is the sweet spot for Jetson Orin Nano (Ampere Tensor Cores)
- INT8 can **regress** performance on ViT models (up to 2.7x slowdown on Orin Nano)
- Running multiple TRT engines concurrently causes large slowdowns (50ms → 300ms per thread) — sequential preferred
## Conversion Pipeline
Standard path: **PyTorch → ONNX → trtexec → TRT Engine**
Alternative: **torch-tensorrt** (`torch.compile(model, backend="tensorrt")`) — less mature for complex models.
## Per-Model Assessment
### 1. SuperPoint (Feature Extraction)
| Aspect | Assessment |
|--------|-----------|
| **TRT Feasibility** | ✅ **Proven** |
| **Existing Implementations** | [yuefanhao/SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT) (367 stars), [fettahyildizz/superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt), [Op2Free/SuperPoint_TensorRT_Libtorch](https://github.com/Op2Free/SuperPoint_TensorRT_Libtorch) |
| **Conversion Path** | PyTorch → ONNX → trtexec FP16 |
| **Dynamic Shapes** | Fixed input resolution (e.g. 640×480 or 1600×longest) |
| **Precision** | FP16 recommended. No accuracy loss for keypoint detection. |
| **Estimated Jetson Perf** | ~20-40ms (FP16, estimated from desktop benchmarks scaled to Orin) |
| **Blocking Issues** | None |
| **Risk** | 🟢 Low |
### 2. LightGlue (Feature Matching)
| Aspect | Assessment |
|--------|-----------|
| **TRT Feasibility** | ✅ **Proven with caveats** |
| **Existing Implementation** | [LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX) — explicitly supports TRT export. [Blog: 2-4x speedup](https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/) |
| **Conversion Path** | LightGlue-ONNX → trtexec FP16 |
| **Dynamic Shapes** | ⚠️ Fixed top-K keypoints (e.g. 2048). TRT TopK limit ≤ 3840. Variable keypoints replaced by padding + mask. |
| **Adaptive Stopping** | ❌ **Not TRT-compatible**. `torch.cond()` control flow not exportable. Must use fixed-depth LightGlue. |
| **Attention Mechanism** | ⚠️ Custom `MultiHeadAttention` export needed. Cross-attention with different Q/K lengths can be problematic. LightGlue-ONNX handles this. |
| **Precision** | FP16 recommended. FP8 gives ~6x speedup but only on Ada/Hopper (not Ampere/Orin). |
| **Estimated Jetson Perf** | ~30-60ms (FP16, 2048 keypoints, fixed depth) |
| **Blocking Issues** | Must use fixed-depth mode (no adaptive stopping). TopK ≤ 3840. |
| **Risk** | 🟡 Medium (proven path exists, but fixed-depth slightly reduces quality) |
### 3. DINOv2 ViT-S/14 (Coarse Retrieval)
| Aspect | Assessment |
|--------|-----------|
| **TRT Feasibility** | ⚠️ **Feasible but risky** |
| **Existing Issues** | [TRT #4537](https://github.com/NVIDIA/tensorrt/issues/4537): FMHA fusion failure for DINOv2. [NVIDIA Forums #312251](https://forums.developer.nvidia.com/t/dinov2-tensorrt-model-performance-issue/312251): Embedding quality degradation. |
| **Conversion Path** | PyTorch → ONNX (custom wrapper, exclude mask inputs) → trtexec FP16 |
| **Dynamic Shapes** | Fixed input (224×224 for ViT-S/14) |
| **Precision** | FP16 preferred. INT8 **not validated** for embedding quality. FP32 also shows degradation in some reports. |
| **torch-tensorrt** | No explicit ViT support; use ONNX → trtexec path |
| **Embedding Quality** | ⚠️ **Must validate**. Reports of degraded embeddings vs PyTorch/ONNX. Retrieval recall must be tested post-conversion. |
| **Estimated Jetson Perf** | ~15-35ms (FP16, estimated) — 1.68x speedup reported on Orin Nano Super |
| **Blocking Issues** | Potential embedding quality loss. FMHA fusion failure (workaround: disable FMHA plugin or use older TRT version). |
| **Risk** | 🟠 High (quality degradation must be measured before production use) |
### 4. LiteSAM (Satellite Fine Matching)
| Aspect | Assessment |
|--------|-----------|
| **TRT Feasibility** | ❌ **No existing path** |
| **Existing Support** | No ONNX export. No TRT conversion. 5 GitHub stars, 4 commits. |
| **MinGRU Blocks** | Standard GRU ops → supported in ONNX and TRT. Not a blocker. |
| **TAIFormer Blocks** | Need verification — may have custom ops that block export. |
| **Variable Output** | Semi-dense matchers output variable matches → use fixed max + padding + mask. |
| **Conversion Path** | Custom ONNX export (write `export_onnx.py` for LiteSAM) → trtexec FP16. Significant development effort. |
| **Estimated Jetson Perf** | ~60-120ms (FP16, estimated, if conversion succeeds) |
| **Blocking Issues** | No ONNX export exists. Must write custom export. Immature codebase may have non-exportable patterns. |
| **Risk** | 🔴 Very High (no prior art, immature codebase, custom work required) |
### 5. EfficientLoFTR (Fallback Matcher)
| Aspect | Assessment |
|--------|-----------|
| **TRT Feasibility** | ⚠️ **Partial — requires adaptation** |
| **Related Work** | [Kolkir/LoFTR_TRT](https://github.com/Kolkir/LoFTR_TRT) — TRT conversion for **original LoFTR** (not EfficientLoFTR). Has `export_onnx.py`. |
| **Architecture Diff** | EfficientLoFTR uses efficient attention and different backbone vs original LoFTR. LoFTR_TRT scripts need adaptation. |
| **Alternative** | Use **original LoFTR** via LoFTR_TRT on Jetson. It has a proven TRT path. Trade-off: larger model, but TRT optimized. |
| **Conversion Path** | Option A: Adapt LoFTR_TRT for EfficientLoFTR. Option B: Use original LoFTR via LoFTR_TRT. |
| **Estimated Jetson Perf** | ~80-150ms @ 640×480 (original LoFTR TRT); ~300-500ms @ 1184×1184 |
| **Blocking Issues** | EfficientLoFTR-specific ONNX not documented. Adaptation of LoFTR_TRT needed. |
| **Risk** | 🟠 High (adaptation required, but base path exists) |
## Recommended Migration Strategy
### Phase 1: Low-Risk, High-Impact (VO Pipeline)
Convert SuperPoint + LightGlue to TRT. Proven path, biggest latency win for VO.
| Model | Action | Effort | Expected Win |
|-------|--------|--------|-------------|
| SuperPoint | PyTorch → ONNX → TRT FP16 | Low (existing repos) | ~2-3x speedup |
| LightGlue | LightGlue-ONNX → TRT FP16 (fixed depth, 2048 kpts) | Low-Medium | ~2-4x speedup |
**VO latency on Jetson**: ~50-100ms (down from ~180ms PyTorch on desktop-class GPU). Achievable.
### Phase 2: Medium-Risk (Coarse Retrieval)
Convert DINOv2 ViT-S/14 to TRT with careful quality validation.
| Model | Action | Effort | Expected Win |
|-------|--------|--------|-------------|
| DINOv2 ViT-S/14 | PyTorch → ONNX → TRT FP16. Validate retrieval recall. | Medium | ~1.5-2x speedup |
**Critical**: Must measure retrieval R@1/R@5 before and after TRT conversion. If embedding quality degrades, fall back to ONNX Runtime.
### Phase 3: High-Risk (Satellite Fine Matching)
Two options for satellite fine matching on Jetson:
**Option A: LoFTR via LoFTR_TRT (Recommended for Jetson)**
- Use original LoFTR with existing TRT conversion scripts
- Proven path, ~80-150ms @ 640×480 on Jetson
- Slightly lower accuracy than LiteSAM but reliable TRT path
- Use LiteSAM on desktop (PyTorch), LoFTR TRT on Jetson
**Option B: Custom LiteSAM TRT (High effort, high risk)**
- Write custom ONNX export for LiteSAM
- Convert to TRT
- Significant development effort with uncertain outcome
- Only worthwhile if LiteSAM accuracy advantage is critical
**Option C: EfficientLoFTR TRT (Medium effort)**
- Adapt LoFTR_TRT export scripts for EfficientLoFTR
- Closer architecture to LiteSAM than original LoFTR
- Medium effort, uncertain outcome
### Recommended: Option A for Jetson deployment
## Memory Budget on Jetson Orin Nano Super (8GB shared)
| Component | Memory (FP16 TRT) | Notes |
|-----------|-------------------|-------|
| OS + JetPack overhead | ~1.5 GB | Linux + CUDA + TRT runtime |
| SuperPoint TRT | ~200 MB | Smaller than PyTorch |
| LightGlue TRT | ~250 MB | Fixed-depth, 2048 kpts |
| DINOv2 ViT-S/14 TRT | ~150 MB | ViT-S is compact |
| LoFTR TRT (satellite) | ~300 MB | Original LoFTR |
| Satellite tile cache | ~200 MB | Reduced for Jetson |
| Working memory | ~500 MB | Image buffers, GTSAM, etc. |
| **Total** | **~3.1 GB** | **Well within 8GB** |
| **Headroom** | **~4.9 GB** | For tile cache expansion |
## Architecture Implications
### Desktop vs Jetson Configuration
The solution should support **two runtime configurations**:
**Desktop (RTX 2060+)**: Current architecture — PyTorch/ONNX models, full feature set
- SuperPoint + LightGlue ONNX FP16
- DINOv2 ViT-S/14 PyTorch + GeM pooling
- LiteSAM PyTorch (EfficientLoFTR fallback)
**Jetson (Orin Nano Super)**: TRT-optimized, adapted feature set
- SuperPoint TRT FP16
- LightGlue TRT FP16 (fixed depth, 2048 keypoints)
- DINOv2 ViT-S/14 TRT FP16 + GeM pooling
- LoFTR TRT FP16 (replacing LiteSAM/EfficientLoFTR)
### Configuration Switch
Runtime selection via environment variable or config:
```
INFERENCE_BACKEND=tensorrt # or "pytorch" / "onnx"
```
Model loading layer abstracts backend:
```
get_feature_extractor(backend) → SuperPointTRT or SuperPointPyTorch
get_matcher(backend) → LightGlueTRT or LightGlueONNX
get_retrieval(backend) → DINOv2TRT or DINOv2PyTorch
get_satellite_matcher(backend) → LoFTRTRT or LiteSAMPyTorch
```
## Key Constraints for TRT Deployment
1. **LightGlue fixed depth**: No adaptive stopping on TRT. Slight quality reduction but consistent latency.
2. **TopK ≤ 3840**: Keypoint count must be capped (recommend 2048 for Jetson memory).
3. **DINOv2 embedding validation required**: Must verify retrieval quality post-TRT conversion.
4. **LoFTR instead of LiteSAM on Jetson**: Different satellite fine matcher. May need separate accuracy baseline.
5. **Sequential GPU execution**: Even more critical on Jetson (smaller GPU). No concurrent TRT engines.
6. **TRT engine files are hardware-specific**: Must build separate engines for desktop GPU and Jetson. Not portable.
## Sources
- [Jetson Orin Nano Super Specs](https://developer.nvidia.com/embedded/jetson-modules)
- [LightGlue-ONNX TRT Blog](https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/)
- [SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT)
- [superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt)
- [LoFTR_TRT](https://github.com/Kolkir/LoFTR_TRT)
- [NVIDIA TRT #4537: DINOv2 FMHA](https://github.com/NVIDIA/tensorrt/issues/4537)
- [NVIDIA Forums: DINOv2 embedding degradation](https://forums.developer.nvidia.com/t/dinov2-tensorrt-model-performance-issue/312251)
- [FP8 LightGlue Blog](https://fabio-sim.github.io/blog/fp8-quantized-lightglue-tensorrt-nvidia-model-optimizer/)
- [LightGlue-with-FlashAttentionV2-TensorRT](https://github.com/) (Jetson Orin NX 8GB)
- [XFeatTensorRT](https://github.com/) (Jetson Orin NX 16GB)