add clarification to research methodology by including a step for solution comparison and user consultation

2026-06-22 21:51:11 +00:00 · 2026-03-17 18:43:57 +02:00
parent d764250f9a
commit b419e2c04a
35 changed files with 6030 additions and 0 deletions
@@ -0,0 +1,222 @@
+# TensorRT Conversion and Deployment Research: SuperPoint + LightGlue
+
+**Target:** Jetson Orin Nano Super deployment  
+**Context:** Visual navigation system with SuperPoint (~80ms) + LightGlue ONNX FP16 (~50–100ms) on RTX 2060
+
+---
+
+## Executive Summary
+
+| Model | TRT Feasibility | Known Issues | Expected Speedup | Dynamic Shapes |
+|-------|-----------------|--------------|------------------|----------------|
+| **SuperPoint** | ✅ High | TopK ≤3840 limit; TRT version compatibility | 1.5–2x over PyTorch | Fixed via top-K padding |
+| **LightGlue** | ✅ High | Adaptive stopping not supported; TopK limit | 2–4x over PyTorch | Fixed via top-K padding |
+
+---
+
+## 1. LightGlue-ONNX TensorRT Support
+
+**Source:** [fabio-sim/LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX) (580 stars)
+
+**Answer:** Yes. LightGlue-ONNX supports TensorRT via ONNX → TensorRT conversion.
+
+- **Workflow:** Export PyTorch → ONNX (via `dynamo.py` / `lightglue-onnx` CLI), then convert ONNX → TensorRT engine with `trtexec`
+- **Example:** `trtexec --onnx=weights/sift_lightglue.onnx --saveEngine=/srv/sift_lightglue.engine`
+- **Optimizations:** Attention subgraph fusion, FP16, FP8 (Jan 2026), dynamic batch support
+- **Note:** Uses ONNX Runtime with TensorRT EP or native TensorRT via `trtexec`-built engines
+
+**References:**
+- https://github.com/fabio-sim/LightGlue-ONNX
+- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
+
+---
+
+## 2. SuperPoint TensorRT Conversion
+
+**Answer:** Yes. Several projects convert SuperPoint to TensorRT for Jetson.
+
+| Project | Stars | Notes |
+|---------|-------|-------|
+| [yuefanhao/SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT) | 367 | SuperPoint + SuperGlue, C++ |
+| [Op2Free/SuperPoint_TensorRT_Libtorch](https://github.com/Op2Free/SuperPoint_TensorRT_Libtorch) | - | PyTorch/Libtorch/TensorRT; ~25.5ms TRT vs ~37.4ms PyTorch on MX450 |
+| [fettahyildizz/superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) | 3 | SuperPoint + LightGlue, TRT 8.5.2.2, dynamic I/O, Jetson-compatible |
+
+**Typical flow:** PyTorch `.pth` → ONNX → TensorRT engine (`.engine`)
+
+**Known issues:**
+- TensorRT 8.5 (e.g. Xavier NX, JetPack 5.1.4): some ops (e.g. Flatten) may fail; TRT 8.6+ preferred
+- TopK limit: K ≤ 3840 (see Section 4)
+
+**References:**
+- https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT
+- https://github.com/Op2Free/SuperPoint_TensorRT_Libtorch
+- https://github.com/fettahyildizz/superpoint_lightglue_tensorrt
+- https://github.com/NVIDIA/TensorRT/issues/3918
+- https://github.com/NVIDIA/TensorRT/issues/4255
+
+---
+
+## 3. LightGlue TensorRT Conversion
+
+**Answer:** Yes. LightGlue has been converted to TensorRT.
+
+**Projects:**
+- **LightGlue-ONNX:** ONNX export + TRT via `trtexec`; 2–4x speedup over compiled PyTorch
+- **qdLMF/LightGlue-with-FlashAttentionV2-TensorRT:** Custom FlashAttentionV2 TRT plugin; runs on Jetson Orin NX 8GB, TRT 8.5.2
+- **fettahyildizz/superpoint_lightglue_tensorrt:** End-to-end SuperPoint + LightGlue TRT pipeline
+
+**Challenges:**
+1. Variable keypoint counts → handled by fixed top-K (e.g. 2048) and padding
+2. Adaptive depth/width → not supported in ONNX/TRT export (control flow)
+3. Attention fusion → custom `MultiHeadAttention` export for ONNX Runtime
+4. TopK limit → K ≤ 3840
+
+**References:**
+- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
+- https://github.com/qdLMF/LightGlue-with-FlashAttentionV2-TensorRT
+- https://github.com/fettahyildizz/superpoint_lightglue_tensorrt
+
+---
+
+## 4. Dynamic Shapes in TensorRT
+
+**Answer:** TensorRT supports dynamic shapes, but LightGlue-ONNX uses fixed shapes for export.
+
+**TensorRT dynamic shapes:**
+- Use `-1` for variable dimensions at build time
+- Optimization profiles: `(min, opt, max)` per dynamic dimension
+- First inference after shape change can be slower (shape inference)
+- https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-dynamic-shapes.html
+
+**LightGlue-ONNX approach:**
+- Variable keypoints replaced by fixed top-K (e.g. 2048)
+- Extractor: top-K selection instead of confidence threshold
+- Matcher: fixed `(B, N, D)` inputs; outputs unified tensors
+- Enables symbolic shape inference and TRT compatibility
+
+**References:**
+- https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-dynamic-shapes.html
+- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
+
+---
+
+## 5. Expected Speedup: TRT vs ONNX Runtime
+
+| Scenario | Speedup | Notes |
+|----------|---------|-------|
+| TRT vs PyTorch (compiled) | 2–4x | Fabio’s blog, SuperPoint+LightGlue |
+| TRT FP8 vs FP32 | ~6x | NVIDIA Model Optimizer, Jan 2026 |
+| ORT TensorRT EP vs native TRT | Often slower | 30%–3x slower in reports; operator fallback, graph partitioning |
+
+**Recommendation:** Prefer native TensorRT engines (via `trtexec`) over ONNX Runtime TensorRT EP for best performance.
+
+**References:**
+- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
+- https://fabio-sim.github.io/blog/fp8-quantized-lightglue-tensorrt-nvidia-model-optimizer/
+- https://github.com/microsoft/onnxruntime/issues/11356
+- https://github.com/microsoft/onnxruntime/issues/24831
+
+---
+
+## 6. Attention Mechanisms in TensorRT
+
+**Known issues:**
+- **MultiHeadCrossAttentionPlugin:** Error grows with sequence length; acceptable up to ~128
+- **key_padding_mask / attention_mask:** Not fully supported; alignment issues with PyTorch
+- **Cross-attention (Q≠K/V length):** `bertQKVToContextPlugin` expects same Q,K,V sequence length; native TRT fusion may support via ONNX
+- **Accuracy:** Some transformer/attention models show accuracy loss after TRT conversion
+
+**LightGlue-ONNX mitigation:** Custom `MultiHeadAttention` export via `torch.library` → `com.microsoft::MultiHeadAttention` for ONNX Runtime; works with TRT EP.
+
+**References:**
+- https://github.com/NVIDIA/TensorRT/issues/2674
+- https://github.com/NVIDIA/TensorRT/issues/3619
+- https://github.com/NVIDIA/TensorRT/issues/1483
+- https://github.com/NVIDIA/TensorRT/issues/2587
+
+---
+
+## 7. LightGlue Adaptive Stopping and TRT
+
+**Answer:** Not supported in current ONNX/TRT export.
+
+- Adaptive depth/width uses control flow (`torch.cond()`)
+- TorchDynamo ONNX exporter does not yet handle this
+- Fabio’s blog: “adaptive depth & width disabled” for TRT benchmarks
+- Trade-off: fixed-depth LightGlue is faster but may use more compute on easy pairs
+
+**Reference:**
+- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
+
+---
+
+## 8. TensorRT TopK 3840 Limit
+
+**Constraint:** TensorRT TopK operator has K ≤ 3840.
+
+- SuperPoint/LightGlue use TopK for keypoint selection
+- If `num_keypoints` > 3840, TRT build fails
+- **Workaround:** Use `num_keypoints ≤ 3840` (e.g. 2048)
+- NVIDIA is working on removing this limit (medium priority)
+
+**References:**
+- https://github.com/NVIDIA/TensorRT/issues/4244
+- https://forums.developer.nvidia.com/t/topk-5-k-exceeds-the-maximum-value-allowed-3840/295903
+
+---
+
+## 9. Alternative TRT-Optimized Feature Matching Models
+
+| Model | Project | Jetson Support | Notes |
+|-------|---------|----------------|-------|
+| **XFeat** | [pranavnedunghat/xfeattensorrt](https://github.com/pranavnedunghat/xfeattensorrt) | Orin NX 16GB, JetPack 6.0, TRT 8.6.2 | Sparse + dense matching |
+| **LoFTR** | [Kolkir/LoFTR_TRT](https://github.com/Kolkir/LoFTR_TRT) | - | Detector-free, transformer-based |
+| **SuperPoint + LightGlue** | [fettahyildizz/superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) | TRT 8.5.2.2 | Full pipeline, dynamic I/O |
+
+**References:**
+- https://github.com/pranavnedunghat/xfeattensorrt
+- https://github.com/Kolkir/LoFTR_TRT
+- https://github.com/qdLMF/LightGlue-with-FlashAttentionV2-TensorRT
+
+---
+
+## 10. Jetson Orin Nano Super Considerations
+
+- **Specs:** 17 FP16 TFLOPS, 67 TOPS sparse INT8, 102 GB/s memory bandwidth
+- **Engine build:** Prefer building on Jetson for best performance; cross-compilation from x86 can be suboptimal
+- **JetPack / TRT:** Orin Nano Super typically uses JetPack 6.x with TensorRT 8.6+
+- **FP8:** Requires Hopper/Ada or newer; Orin uses Ampere, so FP8 not available; FP16 is the main option
+
+---
+
+## Summary by Model
+
+### SuperPoint
+
+| Aspect | Summary |
+|--------|---------|
+| **TRT feasibility** | ✅ High; multiple open-source implementations |
+| **Known issues** | TopK ≤3840; TRT 8.5 op compatibility on older Jetson |
+| **Expected performance** | ~1.5–2x vs PyTorch; ~25ms on MX450 (vs ~37ms PyTorch) |
+| **Dynamic shapes** | Fixed via top-K (e.g. 2048); no variable keypoint count |
+| **Sources** | [SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT), [superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) |
+
+### LightGlue
+
+| Aspect | Summary |
+|--------|---------|
+| **TRT feasibility** | ✅ High; LightGlue-ONNX + trtexec; dedicated TRT projects |
+| **Known issues** | Adaptive stopping not supported; TopK ≤3840; attention accuracy on long sequences |
+| **Expected performance** | 2–4x vs PyTorch; up to ~6x with FP8 (not on Orin) |
+| **Dynamic shapes** | Fixed via top-K padding; no variable keypoint count |
+| **Sources** | [LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX), [Fabio’s blog](https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/), [superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) |
+
+---
+
+## Recommended Deployment Path for Jetson Orin Nano Super
+
+1. Use **LightGlue-ONNX** for ONNX export of SuperPoint + LightGlue (fixed top-K ≤ 3840).
+2. Convert ONNX → TensorRT with `trtexec` on the Jetson (or matching TRT version).
+3. Use **fettahyildizz/superpoint_lightglue_tensorrt** as a reference for C++ deployment.
+4. Use FP16; avoid FP8 on Orin (Ampere).
+5. Expect 2–4x speedup vs current ONNX Runtime FP16, depending on keypoint count and engine build.