mirror of
https://github.com/azaion/gps-denied-desktop.git
synced 2026-04-23 00:56:36 +00:00
add clarification to research methodology by including a step for solution comparison and user consultation
This commit is contained in:
@@ -0,0 +1,214 @@
|
||||
# TensorRT Migration Assessment — Jetson Orin Nano Super
|
||||
|
||||
## Target Hardware: Jetson Orin Nano Super
|
||||
|
||||
| Spec | Value |
|
||||
|------|-------|
|
||||
| GPU | Ampere, 1,024 CUDA cores, 32 Tensor Cores @ 1,020 MHz |
|
||||
| AI Performance | 67 TOPS (sparse) / 33 TOPS (dense) / 17 FP16 TFLOPs |
|
||||
| Memory | 8 GB LPDDR5 @ 102 GB/s (shared CPU/GPU) |
|
||||
| JetPack | 6.2 |
|
||||
| TensorRT | 10.3.0 |
|
||||
| CUDA | 12.6 |
|
||||
| cuDNN | 9.3 |
|
||||
| Usable VRAM | ~6-7 GB (after OS/framework overhead) |
|
||||
|
||||
## General TRT vs ONNX Runtime on Jetson
|
||||
|
||||
- Native TensorRT is **2-4x faster** than PyTorch on Jetson
|
||||
- ONNX Runtime with TensorRT EP is **30%-3x slower** than native TRT due to subgraph fallbacks
|
||||
- FP16 is the sweet spot for Jetson Orin Nano (Ampere Tensor Cores)
|
||||
- INT8 can **regress** performance on ViT models (up to 2.7x slowdown on Orin Nano)
|
||||
- Running multiple TRT engines concurrently causes large slowdowns (50ms → 300ms per thread) — sequential preferred
|
||||
|
||||
## Conversion Pipeline
|
||||
|
||||
Standard path: **PyTorch → ONNX → trtexec → TRT Engine**
|
||||
|
||||
Alternative: **torch-tensorrt** (`torch.compile(model, backend="tensorrt")`) — less mature for complex models.
|
||||
|
||||
## Per-Model Assessment
|
||||
|
||||
### 1. SuperPoint (Feature Extraction)
|
||||
|
||||
| Aspect | Assessment |
|
||||
|--------|-----------|
|
||||
| **TRT Feasibility** | ✅ **Proven** |
|
||||
| **Existing Implementations** | [yuefanhao/SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT) (367 stars), [fettahyildizz/superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt), [Op2Free/SuperPoint_TensorRT_Libtorch](https://github.com/Op2Free/SuperPoint_TensorRT_Libtorch) |
|
||||
| **Conversion Path** | PyTorch → ONNX → trtexec FP16 |
|
||||
| **Dynamic Shapes** | Fixed input resolution (e.g. 640×480 or 1600×longest) |
|
||||
| **Precision** | FP16 recommended. No accuracy loss for keypoint detection. |
|
||||
| **Estimated Jetson Perf** | ~20-40ms (FP16, estimated from desktop benchmarks scaled to Orin) |
|
||||
| **Blocking Issues** | None |
|
||||
| **Risk** | 🟢 Low |
|
||||
|
||||
### 2. LightGlue (Feature Matching)
|
||||
|
||||
| Aspect | Assessment |
|
||||
|--------|-----------|
|
||||
| **TRT Feasibility** | ✅ **Proven with caveats** |
|
||||
| **Existing Implementation** | [LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX) — explicitly supports TRT export. [Blog: 2-4x speedup](https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/) |
|
||||
| **Conversion Path** | LightGlue-ONNX → trtexec FP16 |
|
||||
| **Dynamic Shapes** | ⚠️ Fixed top-K keypoints (e.g. 2048). TRT TopK limit ≤ 3840. Variable keypoints replaced by padding + mask. |
|
||||
| **Adaptive Stopping** | ❌ **Not TRT-compatible**. `torch.cond()` control flow not exportable. Must use fixed-depth LightGlue. |
|
||||
| **Attention Mechanism** | ⚠️ Custom `MultiHeadAttention` export needed. Cross-attention with different Q/K lengths can be problematic. LightGlue-ONNX handles this. |
|
||||
| **Precision** | FP16 recommended. FP8 gives ~6x speedup but only on Ada/Hopper (not Ampere/Orin). |
|
||||
| **Estimated Jetson Perf** | ~30-60ms (FP16, 2048 keypoints, fixed depth) |
|
||||
| **Blocking Issues** | Must use fixed-depth mode (no adaptive stopping). TopK ≤ 3840. |
|
||||
| **Risk** | 🟡 Medium (proven path exists, but fixed-depth slightly reduces quality) |
|
||||
|
||||
### 3. DINOv2 ViT-S/14 (Coarse Retrieval)
|
||||
|
||||
| Aspect | Assessment |
|
||||
|--------|-----------|
|
||||
| **TRT Feasibility** | ⚠️ **Feasible but risky** |
|
||||
| **Existing Issues** | [TRT #4537](https://github.com/NVIDIA/tensorrt/issues/4537): FMHA fusion failure for DINOv2. [NVIDIA Forums #312251](https://forums.developer.nvidia.com/t/dinov2-tensorrt-model-performance-issue/312251): Embedding quality degradation. |
|
||||
| **Conversion Path** | PyTorch → ONNX (custom wrapper, exclude mask inputs) → trtexec FP16 |
|
||||
| **Dynamic Shapes** | Fixed input (224×224 for ViT-S/14) |
|
||||
| **Precision** | FP16 preferred. INT8 **not validated** for embedding quality. FP32 also shows degradation in some reports. |
|
||||
| **torch-tensorrt** | No explicit ViT support; use ONNX → trtexec path |
|
||||
| **Embedding Quality** | ⚠️ **Must validate**. Reports of degraded embeddings vs PyTorch/ONNX. Retrieval recall must be tested post-conversion. |
|
||||
| **Estimated Jetson Perf** | ~15-35ms (FP16, estimated) — 1.68x speedup reported on Orin Nano Super |
|
||||
| **Blocking Issues** | Potential embedding quality loss. FMHA fusion failure (workaround: disable FMHA plugin or use older TRT version). |
|
||||
| **Risk** | 🟠 High (quality degradation must be measured before production use) |
|
||||
|
||||
### 4. LiteSAM (Satellite Fine Matching)
|
||||
|
||||
| Aspect | Assessment |
|
||||
|--------|-----------|
|
||||
| **TRT Feasibility** | ❌ **No existing path** |
|
||||
| **Existing Support** | No ONNX export. No TRT conversion. 5 GitHub stars, 4 commits. |
|
||||
| **MinGRU Blocks** | Standard GRU ops → supported in ONNX and TRT. Not a blocker. |
|
||||
| **TAIFormer Blocks** | Need verification — may have custom ops that block export. |
|
||||
| **Variable Output** | Semi-dense matchers output variable matches → use fixed max + padding + mask. |
|
||||
| **Conversion Path** | Custom ONNX export (write `export_onnx.py` for LiteSAM) → trtexec FP16. Significant development effort. |
|
||||
| **Estimated Jetson Perf** | ~60-120ms (FP16, estimated, if conversion succeeds) |
|
||||
| **Blocking Issues** | No ONNX export exists. Must write custom export. Immature codebase may have non-exportable patterns. |
|
||||
| **Risk** | 🔴 Very High (no prior art, immature codebase, custom work required) |
|
||||
|
||||
### 5. EfficientLoFTR (Fallback Matcher)
|
||||
|
||||
| Aspect | Assessment |
|
||||
|--------|-----------|
|
||||
| **TRT Feasibility** | ⚠️ **Partial — requires adaptation** |
|
||||
| **Related Work** | [Kolkir/LoFTR_TRT](https://github.com/Kolkir/LoFTR_TRT) — TRT conversion for **original LoFTR** (not EfficientLoFTR). Has `export_onnx.py`. |
|
||||
| **Architecture Diff** | EfficientLoFTR uses efficient attention and different backbone vs original LoFTR. LoFTR_TRT scripts need adaptation. |
|
||||
| **Alternative** | Use **original LoFTR** via LoFTR_TRT on Jetson. It has a proven TRT path. Trade-off: larger model, but TRT optimized. |
|
||||
| **Conversion Path** | Option A: Adapt LoFTR_TRT for EfficientLoFTR. Option B: Use original LoFTR via LoFTR_TRT. |
|
||||
| **Estimated Jetson Perf** | ~80-150ms @ 640×480 (original LoFTR TRT); ~300-500ms @ 1184×1184 |
|
||||
| **Blocking Issues** | EfficientLoFTR-specific ONNX not documented. Adaptation of LoFTR_TRT needed. |
|
||||
| **Risk** | 🟠 High (adaptation required, but base path exists) |
|
||||
|
||||
## Recommended Migration Strategy
|
||||
|
||||
### Phase 1: Low-Risk, High-Impact (VO Pipeline)
|
||||
|
||||
Convert SuperPoint + LightGlue to TRT. Proven path, biggest latency win for VO.
|
||||
|
||||
| Model | Action | Effort | Expected Win |
|
||||
|-------|--------|--------|-------------|
|
||||
| SuperPoint | PyTorch → ONNX → TRT FP16 | Low (existing repos) | ~2-3x speedup |
|
||||
| LightGlue | LightGlue-ONNX → TRT FP16 (fixed depth, 2048 kpts) | Low-Medium | ~2-4x speedup |
|
||||
|
||||
**VO latency on Jetson**: ~50-100ms (down from ~180ms PyTorch on desktop-class GPU). Achievable.
|
||||
|
||||
### Phase 2: Medium-Risk (Coarse Retrieval)
|
||||
|
||||
Convert DINOv2 ViT-S/14 to TRT with careful quality validation.
|
||||
|
||||
| Model | Action | Effort | Expected Win |
|
||||
|-------|--------|--------|-------------|
|
||||
| DINOv2 ViT-S/14 | PyTorch → ONNX → TRT FP16. Validate retrieval recall. | Medium | ~1.5-2x speedup |
|
||||
|
||||
**Critical**: Must measure retrieval R@1/R@5 before and after TRT conversion. If embedding quality degrades, fall back to ONNX Runtime.
|
||||
|
||||
### Phase 3: High-Risk (Satellite Fine Matching)
|
||||
|
||||
Two options for satellite fine matching on Jetson:
|
||||
|
||||
**Option A: LoFTR via LoFTR_TRT (Recommended for Jetson)**
|
||||
- Use original LoFTR with existing TRT conversion scripts
|
||||
- Proven path, ~80-150ms @ 640×480 on Jetson
|
||||
- Slightly lower accuracy than LiteSAM but reliable TRT path
|
||||
- Use LiteSAM on desktop (PyTorch), LoFTR TRT on Jetson
|
||||
|
||||
**Option B: Custom LiteSAM TRT (High effort, high risk)**
|
||||
- Write custom ONNX export for LiteSAM
|
||||
- Convert to TRT
|
||||
- Significant development effort with uncertain outcome
|
||||
- Only worthwhile if LiteSAM accuracy advantage is critical
|
||||
|
||||
**Option C: EfficientLoFTR TRT (Medium effort)**
|
||||
- Adapt LoFTR_TRT export scripts for EfficientLoFTR
|
||||
- Closer architecture to LiteSAM than original LoFTR
|
||||
- Medium effort, uncertain outcome
|
||||
|
||||
### Recommended: Option A for Jetson deployment
|
||||
|
||||
## Memory Budget on Jetson Orin Nano Super (8GB shared)
|
||||
|
||||
| Component | Memory (FP16 TRT) | Notes |
|
||||
|-----------|-------------------|-------|
|
||||
| OS + JetPack overhead | ~1.5 GB | Linux + CUDA + TRT runtime |
|
||||
| SuperPoint TRT | ~200 MB | Smaller than PyTorch |
|
||||
| LightGlue TRT | ~250 MB | Fixed-depth, 2048 kpts |
|
||||
| DINOv2 ViT-S/14 TRT | ~150 MB | ViT-S is compact |
|
||||
| LoFTR TRT (satellite) | ~300 MB | Original LoFTR |
|
||||
| Satellite tile cache | ~200 MB | Reduced for Jetson |
|
||||
| Working memory | ~500 MB | Image buffers, GTSAM, etc. |
|
||||
| **Total** | **~3.1 GB** | **Well within 8GB** |
|
||||
| **Headroom** | **~4.9 GB** | For tile cache expansion |
|
||||
|
||||
## Architecture Implications
|
||||
|
||||
### Desktop vs Jetson Configuration
|
||||
|
||||
The solution should support **two runtime configurations**:
|
||||
|
||||
**Desktop (RTX 2060+)**: Current architecture — PyTorch/ONNX models, full feature set
|
||||
- SuperPoint + LightGlue ONNX FP16
|
||||
- DINOv2 ViT-S/14 PyTorch + GeM pooling
|
||||
- LiteSAM PyTorch (EfficientLoFTR fallback)
|
||||
|
||||
**Jetson (Orin Nano Super)**: TRT-optimized, adapted feature set
|
||||
- SuperPoint TRT FP16
|
||||
- LightGlue TRT FP16 (fixed depth, 2048 keypoints)
|
||||
- DINOv2 ViT-S/14 TRT FP16 + GeM pooling
|
||||
- LoFTR TRT FP16 (replacing LiteSAM/EfficientLoFTR)
|
||||
|
||||
### Configuration Switch
|
||||
|
||||
Runtime selection via environment variable or config:
|
||||
```
|
||||
INFERENCE_BACKEND=tensorrt # or "pytorch" / "onnx"
|
||||
```
|
||||
|
||||
Model loading layer abstracts backend:
|
||||
```
|
||||
get_feature_extractor(backend) → SuperPointTRT or SuperPointPyTorch
|
||||
get_matcher(backend) → LightGlueTRT or LightGlueONNX
|
||||
get_retrieval(backend) → DINOv2TRT or DINOv2PyTorch
|
||||
get_satellite_matcher(backend) → LoFTRTRT or LiteSAMPyTorch
|
||||
```
|
||||
|
||||
## Key Constraints for TRT Deployment
|
||||
|
||||
1. **LightGlue fixed depth**: No adaptive stopping on TRT. Slight quality reduction but consistent latency.
|
||||
2. **TopK ≤ 3840**: Keypoint count must be capped (recommend 2048 for Jetson memory).
|
||||
3. **DINOv2 embedding validation required**: Must verify retrieval quality post-TRT conversion.
|
||||
4. **LoFTR instead of LiteSAM on Jetson**: Different satellite fine matcher. May need separate accuracy baseline.
|
||||
5. **Sequential GPU execution**: Even more critical on Jetson (smaller GPU). No concurrent TRT engines.
|
||||
6. **TRT engine files are hardware-specific**: Must build separate engines for desktop GPU and Jetson. Not portable.
|
||||
|
||||
## Sources
|
||||
|
||||
- [Jetson Orin Nano Super Specs](https://developer.nvidia.com/embedded/jetson-modules)
|
||||
- [LightGlue-ONNX TRT Blog](https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/)
|
||||
- [SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT)
|
||||
- [superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt)
|
||||
- [LoFTR_TRT](https://github.com/Kolkir/LoFTR_TRT)
|
||||
- [NVIDIA TRT #4537: DINOv2 FMHA](https://github.com/NVIDIA/tensorrt/issues/4537)
|
||||
- [NVIDIA Forums: DINOv2 embedding degradation](https://forums.developer.nvidia.com/t/dinov2-tensorrt-model-performance-issue/312251)
|
||||
- [FP8 LightGlue Blog](https://fabio-sim.github.io/blog/fp8-quantized-lightglue-tensorrt-nvidia-model-optimizer/)
|
||||
- [LightGlue-with-FlashAttentionV2-TensorRT](https://github.com/) (Jetson Orin NX 8GB)
|
||||
- [XFeatTensorRT](https://github.com/) (Jetson Orin NX 16GB)
|
||||
Reference in New Issue
Block a user