add clarification to research methodology by including a step for solution comparison and user consultation

This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-03-17 18:43:57 +02:00
parent d764250f9a
commit b419e2c04a
35 changed files with 6030 additions and 0 deletions
@@ -0,0 +1,222 @@
# TensorRT Conversion and Deployment Research: SuperPoint + LightGlue
**Target:** Jetson Orin Nano Super deployment
**Context:** Visual navigation system with SuperPoint (~80ms) + LightGlue ONNX FP16 (~50100ms) on RTX 2060
---
## Executive Summary
| Model | TRT Feasibility | Known Issues | Expected Speedup | Dynamic Shapes |
|-------|-----------------|--------------|------------------|----------------|
| **SuperPoint** | ✅ High | TopK ≤3840 limit; TRT version compatibility | 1.52x over PyTorch | Fixed via top-K padding |
| **LightGlue** | ✅ High | Adaptive stopping not supported; TopK limit | 24x over PyTorch | Fixed via top-K padding |
---
## 1. LightGlue-ONNX TensorRT Support
**Source:** [fabio-sim/LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX) (580 stars)
**Answer:** Yes. LightGlue-ONNX supports TensorRT via ONNX → TensorRT conversion.
- **Workflow:** Export PyTorch → ONNX (via `dynamo.py` / `lightglue-onnx` CLI), then convert ONNX → TensorRT engine with `trtexec`
- **Example:** `trtexec --onnx=weights/sift_lightglue.onnx --saveEngine=/srv/sift_lightglue.engine`
- **Optimizations:** Attention subgraph fusion, FP16, FP8 (Jan 2026), dynamic batch support
- **Note:** Uses ONNX Runtime with TensorRT EP or native TensorRT via `trtexec`-built engines
**References:**
- https://github.com/fabio-sim/LightGlue-ONNX
- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
---
## 2. SuperPoint TensorRT Conversion
**Answer:** Yes. Several projects convert SuperPoint to TensorRT for Jetson.
| Project | Stars | Notes |
|---------|-------|-------|
| [yuefanhao/SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT) | 367 | SuperPoint + SuperGlue, C++ |
| [Op2Free/SuperPoint_TensorRT_Libtorch](https://github.com/Op2Free/SuperPoint_TensorRT_Libtorch) | - | PyTorch/Libtorch/TensorRT; ~25.5ms TRT vs ~37.4ms PyTorch on MX450 |
| [fettahyildizz/superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) | 3 | SuperPoint + LightGlue, TRT 8.5.2.2, dynamic I/O, Jetson-compatible |
**Typical flow:** PyTorch `.pth` → ONNX → TensorRT engine (`.engine`)
**Known issues:**
- TensorRT 8.5 (e.g. Xavier NX, JetPack 5.1.4): some ops (e.g. Flatten) may fail; TRT 8.6+ preferred
- TopK limit: K ≤ 3840 (see Section 4)
**References:**
- https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT
- https://github.com/Op2Free/SuperPoint_TensorRT_Libtorch
- https://github.com/fettahyildizz/superpoint_lightglue_tensorrt
- https://github.com/NVIDIA/TensorRT/issues/3918
- https://github.com/NVIDIA/TensorRT/issues/4255
---
## 3. LightGlue TensorRT Conversion
**Answer:** Yes. LightGlue has been converted to TensorRT.
**Projects:**
- **LightGlue-ONNX:** ONNX export + TRT via `trtexec`; 24x speedup over compiled PyTorch
- **qdLMF/LightGlue-with-FlashAttentionV2-TensorRT:** Custom FlashAttentionV2 TRT plugin; runs on Jetson Orin NX 8GB, TRT 8.5.2
- **fettahyildizz/superpoint_lightglue_tensorrt:** End-to-end SuperPoint + LightGlue TRT pipeline
**Challenges:**
1. Variable keypoint counts → handled by fixed top-K (e.g. 2048) and padding
2. Adaptive depth/width → not supported in ONNX/TRT export (control flow)
3. Attention fusion → custom `MultiHeadAttention` export for ONNX Runtime
4. TopK limit → K ≤ 3840
**References:**
- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
- https://github.com/qdLMF/LightGlue-with-FlashAttentionV2-TensorRT
- https://github.com/fettahyildizz/superpoint_lightglue_tensorrt
---
## 4. Dynamic Shapes in TensorRT
**Answer:** TensorRT supports dynamic shapes, but LightGlue-ONNX uses fixed shapes for export.
**TensorRT dynamic shapes:**
- Use `-1` for variable dimensions at build time
- Optimization profiles: `(min, opt, max)` per dynamic dimension
- First inference after shape change can be slower (shape inference)
- https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-dynamic-shapes.html
**LightGlue-ONNX approach:**
- Variable keypoints replaced by fixed top-K (e.g. 2048)
- Extractor: top-K selection instead of confidence threshold
- Matcher: fixed `(B, N, D)` inputs; outputs unified tensors
- Enables symbolic shape inference and TRT compatibility
**References:**
- https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-dynamic-shapes.html
- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
---
## 5. Expected Speedup: TRT vs ONNX Runtime
| Scenario | Speedup | Notes |
|----------|---------|-------|
| TRT vs PyTorch (compiled) | 24x | Fabios blog, SuperPoint+LightGlue |
| TRT FP8 vs FP32 | ~6x | NVIDIA Model Optimizer, Jan 2026 |
| ORT TensorRT EP vs native TRT | Often slower | 30%3x slower in reports; operator fallback, graph partitioning |
**Recommendation:** Prefer native TensorRT engines (via `trtexec`) over ONNX Runtime TensorRT EP for best performance.
**References:**
- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
- https://fabio-sim.github.io/blog/fp8-quantized-lightglue-tensorrt-nvidia-model-optimizer/
- https://github.com/microsoft/onnxruntime/issues/11356
- https://github.com/microsoft/onnxruntime/issues/24831
---
## 6. Attention Mechanisms in TensorRT
**Known issues:**
- **MultiHeadCrossAttentionPlugin:** Error grows with sequence length; acceptable up to ~128
- **key_padding_mask / attention_mask:** Not fully supported; alignment issues with PyTorch
- **Cross-attention (Q≠K/V length):** `bertQKVToContextPlugin` expects same Q,K,V sequence length; native TRT fusion may support via ONNX
- **Accuracy:** Some transformer/attention models show accuracy loss after TRT conversion
**LightGlue-ONNX mitigation:** Custom `MultiHeadAttention` export via `torch.library``com.microsoft::MultiHeadAttention` for ONNX Runtime; works with TRT EP.
**References:**
- https://github.com/NVIDIA/TensorRT/issues/2674
- https://github.com/NVIDIA/TensorRT/issues/3619
- https://github.com/NVIDIA/TensorRT/issues/1483
- https://github.com/NVIDIA/TensorRT/issues/2587
---
## 7. LightGlue Adaptive Stopping and TRT
**Answer:** Not supported in current ONNX/TRT export.
- Adaptive depth/width uses control flow (`torch.cond()`)
- TorchDynamo ONNX exporter does not yet handle this
- Fabios blog: “adaptive depth & width disabled” for TRT benchmarks
- Trade-off: fixed-depth LightGlue is faster but may use more compute on easy pairs
**Reference:**
- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
---
## 8. TensorRT TopK 3840 Limit
**Constraint:** TensorRT TopK operator has K ≤ 3840.
- SuperPoint/LightGlue use TopK for keypoint selection
- If `num_keypoints` > 3840, TRT build fails
- **Workaround:** Use `num_keypoints ≤ 3840` (e.g. 2048)
- NVIDIA is working on removing this limit (medium priority)
**References:**
- https://github.com/NVIDIA/TensorRT/issues/4244
- https://forums.developer.nvidia.com/t/topk-5-k-exceeds-the-maximum-value-allowed-3840/295903
---
## 9. Alternative TRT-Optimized Feature Matching Models
| Model | Project | Jetson Support | Notes |
|-------|---------|----------------|-------|
| **XFeat** | [pranavnedunghat/xfeattensorrt](https://github.com/pranavnedunghat/xfeattensorrt) | Orin NX 16GB, JetPack 6.0, TRT 8.6.2 | Sparse + dense matching |
| **LoFTR** | [Kolkir/LoFTR_TRT](https://github.com/Kolkir/LoFTR_TRT) | - | Detector-free, transformer-based |
| **SuperPoint + LightGlue** | [fettahyildizz/superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) | TRT 8.5.2.2 | Full pipeline, dynamic I/O |
**References:**
- https://github.com/pranavnedunghat/xfeattensorrt
- https://github.com/Kolkir/LoFTR_TRT
- https://github.com/qdLMF/LightGlue-with-FlashAttentionV2-TensorRT
---
## 10. Jetson Orin Nano Super Considerations
- **Specs:** 17 FP16 TFLOPS, 67 TOPS sparse INT8, 102 GB/s memory bandwidth
- **Engine build:** Prefer building on Jetson for best performance; cross-compilation from x86 can be suboptimal
- **JetPack / TRT:** Orin Nano Super typically uses JetPack 6.x with TensorRT 8.6+
- **FP8:** Requires Hopper/Ada or newer; Orin uses Ampere, so FP8 not available; FP16 is the main option
---
## Summary by Model
### SuperPoint
| Aspect | Summary |
|--------|---------|
| **TRT feasibility** | ✅ High; multiple open-source implementations |
| **Known issues** | TopK ≤3840; TRT 8.5 op compatibility on older Jetson |
| **Expected performance** | ~1.52x vs PyTorch; ~25ms on MX450 (vs ~37ms PyTorch) |
| **Dynamic shapes** | Fixed via top-K (e.g. 2048); no variable keypoint count |
| **Sources** | [SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT), [superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) |
### LightGlue
| Aspect | Summary |
|--------|---------|
| **TRT feasibility** | ✅ High; LightGlue-ONNX + trtexec; dedicated TRT projects |
| **Known issues** | Adaptive stopping not supported; TopK ≤3840; attention accuracy on long sequences |
| **Expected performance** | 24x vs PyTorch; up to ~6x with FP8 (not on Orin) |
| **Dynamic shapes** | Fixed via top-K padding; no variable keypoint count |
| **Sources** | [LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX), [Fabios blog](https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/), [superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) |
---
## Recommended Deployment Path for Jetson Orin Nano Super
1. Use **LightGlue-ONNX** for ONNX export of SuperPoint + LightGlue (fixed top-K ≤ 3840).
2. Convert ONNX → TensorRT with `trtexec` on the Jetson (or matching TRT version).
3. Use **fettahyildizz/superpoint_lightglue_tensorrt** as a reference for C++ deployment.
4. Use FP16; avoid FP8 on Orin (Ampere).
5. Expect 24x speedup vs current ONNX Runtime FP16, depending on keypoint count and engine build.