# TensorRT Conversion and Deployment Research: SuperPoint + LightGlue **Target:** Jetson Orin Nano Super deployment **Context:** Visual navigation system with SuperPoint (~80ms) + LightGlue ONNX FP16 (~50–100ms) on RTX 2060 --- ## Executive Summary | Model | TRT Feasibility | Known Issues | Expected Speedup | Dynamic Shapes | |-------|-----------------|--------------|------------------|----------------| | **SuperPoint** | ✅ High | TopK ≤3840 limit; TRT version compatibility | 1.5–2x over PyTorch | Fixed via top-K padding | | **LightGlue** | ✅ High | Adaptive stopping not supported; TopK limit | 2–4x over PyTorch | Fixed via top-K padding | --- ## 1. LightGlue-ONNX TensorRT Support **Source:** [fabio-sim/LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX) (580 stars) **Answer:** Yes. LightGlue-ONNX supports TensorRT via ONNX → TensorRT conversion. - **Workflow:** Export PyTorch → ONNX (via `dynamo.py` / `lightglue-onnx` CLI), then convert ONNX → TensorRT engine with `trtexec` - **Example:** `trtexec --onnx=weights/sift_lightglue.onnx --saveEngine=/srv/sift_lightglue.engine` - **Optimizations:** Attention subgraph fusion, FP16, FP8 (Jan 2026), dynamic batch support - **Note:** Uses ONNX Runtime with TensorRT EP or native TensorRT via `trtexec`-built engines **References:** - https://github.com/fabio-sim/LightGlue-ONNX - https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/ --- ## 2. SuperPoint TensorRT Conversion **Answer:** Yes. Several projects convert SuperPoint to TensorRT for Jetson. | Project | Stars | Notes | |---------|-------|-------| | [yuefanhao/SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT) | 367 | SuperPoint + SuperGlue, C++ | | [Op2Free/SuperPoint_TensorRT_Libtorch](https://github.com/Op2Free/SuperPoint_TensorRT_Libtorch) | - | PyTorch/Libtorch/TensorRT; ~25.5ms TRT vs ~37.4ms PyTorch on MX450 | | [fettahyildizz/superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) | 3 | SuperPoint + LightGlue, TRT 8.5.2.2, dynamic I/O, Jetson-compatible | **Typical flow:** PyTorch `.pth` → ONNX → TensorRT engine (`.engine`) **Known issues:** - TensorRT 8.5 (e.g. Xavier NX, JetPack 5.1.4): some ops (e.g. Flatten) may fail; TRT 8.6+ preferred - TopK limit: K ≤ 3840 (see Section 4) **References:** - https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT - https://github.com/Op2Free/SuperPoint_TensorRT_Libtorch - https://github.com/fettahyildizz/superpoint_lightglue_tensorrt - https://github.com/NVIDIA/TensorRT/issues/3918 - https://github.com/NVIDIA/TensorRT/issues/4255 --- ## 3. LightGlue TensorRT Conversion **Answer:** Yes. LightGlue has been converted to TensorRT. **Projects:** - **LightGlue-ONNX:** ONNX export + TRT via `trtexec`; 2–4x speedup over compiled PyTorch - **qdLMF/LightGlue-with-FlashAttentionV2-TensorRT:** Custom FlashAttentionV2 TRT plugin; runs on Jetson Orin NX 8GB, TRT 8.5.2 - **fettahyildizz/superpoint_lightglue_tensorrt:** End-to-end SuperPoint + LightGlue TRT pipeline **Challenges:** 1. Variable keypoint counts → handled by fixed top-K (e.g. 2048) and padding 2. Adaptive depth/width → not supported in ONNX/TRT export (control flow) 3. Attention fusion → custom `MultiHeadAttention` export for ONNX Runtime 4. TopK limit → K ≤ 3840 **References:** - https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/ - https://github.com/qdLMF/LightGlue-with-FlashAttentionV2-TensorRT - https://github.com/fettahyildizz/superpoint_lightglue_tensorrt --- ## 4. Dynamic Shapes in TensorRT **Answer:** TensorRT supports dynamic shapes, but LightGlue-ONNX uses fixed shapes for export. **TensorRT dynamic shapes:** - Use `-1` for variable dimensions at build time - Optimization profiles: `(min, opt, max)` per dynamic dimension - First inference after shape change can be slower (shape inference) - https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-dynamic-shapes.html **LightGlue-ONNX approach:** - Variable keypoints replaced by fixed top-K (e.g. 2048) - Extractor: top-K selection instead of confidence threshold - Matcher: fixed `(B, N, D)` inputs; outputs unified tensors - Enables symbolic shape inference and TRT compatibility **References:** - https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-dynamic-shapes.html - https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/ --- ## 5. Expected Speedup: TRT vs ONNX Runtime | Scenario | Speedup | Notes | |----------|---------|-------| | TRT vs PyTorch (compiled) | 2–4x | Fabio’s blog, SuperPoint+LightGlue | | TRT FP8 vs FP32 | ~6x | NVIDIA Model Optimizer, Jan 2026 | | ORT TensorRT EP vs native TRT | Often slower | 30%–3x slower in reports; operator fallback, graph partitioning | **Recommendation:** Prefer native TensorRT engines (via `trtexec`) over ONNX Runtime TensorRT EP for best performance. **References:** - https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/ - https://fabio-sim.github.io/blog/fp8-quantized-lightglue-tensorrt-nvidia-model-optimizer/ - https://github.com/microsoft/onnxruntime/issues/11356 - https://github.com/microsoft/onnxruntime/issues/24831 --- ## 6. Attention Mechanisms in TensorRT **Known issues:** - **MultiHeadCrossAttentionPlugin:** Error grows with sequence length; acceptable up to ~128 - **key_padding_mask / attention_mask:** Not fully supported; alignment issues with PyTorch - **Cross-attention (Q≠K/V length):** `bertQKVToContextPlugin` expects same Q,K,V sequence length; native TRT fusion may support via ONNX - **Accuracy:** Some transformer/attention models show accuracy loss after TRT conversion **LightGlue-ONNX mitigation:** Custom `MultiHeadAttention` export via `torch.library` → `com.microsoft::MultiHeadAttention` for ONNX Runtime; works with TRT EP. **References:** - https://github.com/NVIDIA/TensorRT/issues/2674 - https://github.com/NVIDIA/TensorRT/issues/3619 - https://github.com/NVIDIA/TensorRT/issues/1483 - https://github.com/NVIDIA/TensorRT/issues/2587 --- ## 7. LightGlue Adaptive Stopping and TRT **Answer:** Not supported in current ONNX/TRT export. - Adaptive depth/width uses control flow (`torch.cond()`) - TorchDynamo ONNX exporter does not yet handle this - Fabio’s blog: “adaptive depth & width disabled” for TRT benchmarks - Trade-off: fixed-depth LightGlue is faster but may use more compute on easy pairs **Reference:** - https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/ --- ## 8. TensorRT TopK 3840 Limit **Constraint:** TensorRT TopK operator has K ≤ 3840. - SuperPoint/LightGlue use TopK for keypoint selection - If `num_keypoints` > 3840, TRT build fails - **Workaround:** Use `num_keypoints ≤ 3840` (e.g. 2048) - NVIDIA is working on removing this limit (medium priority) **References:** - https://github.com/NVIDIA/TensorRT/issues/4244 - https://forums.developer.nvidia.com/t/topk-5-k-exceeds-the-maximum-value-allowed-3840/295903 --- ## 9. Alternative TRT-Optimized Feature Matching Models | Model | Project | Jetson Support | Notes | |-------|---------|----------------|-------| | **XFeat** | [pranavnedunghat/xfeattensorrt](https://github.com/pranavnedunghat/xfeattensorrt) | Orin NX 16GB, JetPack 6.0, TRT 8.6.2 | Sparse + dense matching | | **LoFTR** | [Kolkir/LoFTR_TRT](https://github.com/Kolkir/LoFTR_TRT) | - | Detector-free, transformer-based | | **SuperPoint + LightGlue** | [fettahyildizz/superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) | TRT 8.5.2.2 | Full pipeline, dynamic I/O | **References:** - https://github.com/pranavnedunghat/xfeattensorrt - https://github.com/Kolkir/LoFTR_TRT - https://github.com/qdLMF/LightGlue-with-FlashAttentionV2-TensorRT --- ## 10. Jetson Orin Nano Super Considerations - **Specs:** 17 FP16 TFLOPS, 67 TOPS sparse INT8, 102 GB/s memory bandwidth - **Engine build:** Prefer building on Jetson for best performance; cross-compilation from x86 can be suboptimal - **JetPack / TRT:** Orin Nano Super typically uses JetPack 6.x with TensorRT 8.6+ - **FP8:** Requires Hopper/Ada or newer; Orin uses Ampere, so FP8 not available; FP16 is the main option --- ## Summary by Model ### SuperPoint | Aspect | Summary | |--------|---------| | **TRT feasibility** | ✅ High; multiple open-source implementations | | **Known issues** | TopK ≤3840; TRT 8.5 op compatibility on older Jetson | | **Expected performance** | ~1.5–2x vs PyTorch; ~25ms on MX450 (vs ~37ms PyTorch) | | **Dynamic shapes** | Fixed via top-K (e.g. 2048); no variable keypoint count | | **Sources** | [SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT), [superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) | ### LightGlue | Aspect | Summary | |--------|---------| | **TRT feasibility** | ✅ High; LightGlue-ONNX + trtexec; dedicated TRT projects | | **Known issues** | Adaptive stopping not supported; TopK ≤3840; attention accuracy on long sequences | | **Expected performance** | 2–4x vs PyTorch; up to ~6x with FP8 (not on Orin) | | **Dynamic shapes** | Fixed via top-K padding; no variable keypoint count | | **Sources** | [LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX), [Fabio’s blog](https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/), [superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) | --- ## Recommended Deployment Path for Jetson Orin Nano Super 1. Use **LightGlue-ONNX** for ONNX export of SuperPoint + LightGlue (fixed top-K ≤ 3840). 2. Convert ONNX → TensorRT with `trtexec` on the Jetson (or matching TRT version). 3. Use **fettahyildizz/superpoint_lightglue_tensorrt** as a reference for C++ deployment. 4. Use FP16; avoid FP8 on Orin (Ampere). 5. Expect 2–4x speedup vs current ONNX Runtime FP16, depending on keypoint count and engine build.