Files
gps-denied-desktop/_docs/00_research/tensorrt_superpoint_lightglue_research.md
T

9.9 KiB
Raw Blame History

TensorRT Conversion and Deployment Research: SuperPoint + LightGlue

Target: Jetson Orin Nano Super deployment
Context: Visual navigation system with SuperPoint (~80ms) + LightGlue ONNX FP16 (~50100ms) on RTX 2060


Executive Summary

Model TRT Feasibility Known Issues Expected Speedup Dynamic Shapes
SuperPoint High TopK ≤3840 limit; TRT version compatibility 1.52x over PyTorch Fixed via top-K padding
LightGlue High Adaptive stopping not supported; TopK limit 24x over PyTorch Fixed via top-K padding

1. LightGlue-ONNX TensorRT Support

Source: fabio-sim/LightGlue-ONNX (580 stars)

Answer: Yes. LightGlue-ONNX supports TensorRT via ONNX → TensorRT conversion.

  • Workflow: Export PyTorch → ONNX (via dynamo.py / lightglue-onnx CLI), then convert ONNX → TensorRT engine with trtexec
  • Example: trtexec --onnx=weights/sift_lightglue.onnx --saveEngine=/srv/sift_lightglue.engine
  • Optimizations: Attention subgraph fusion, FP16, FP8 (Jan 2026), dynamic batch support
  • Note: Uses ONNX Runtime with TensorRT EP or native TensorRT via trtexec-built engines

References:


2. SuperPoint TensorRT Conversion

Answer: Yes. Several projects convert SuperPoint to TensorRT for Jetson.

Project Stars Notes
yuefanhao/SuperPoint-SuperGlue-TensorRT 367 SuperPoint + SuperGlue, C++
Op2Free/SuperPoint_TensorRT_Libtorch - PyTorch/Libtorch/TensorRT; ~25.5ms TRT vs ~37.4ms PyTorch on MX450
fettahyildizz/superpoint_lightglue_tensorrt 3 SuperPoint + LightGlue, TRT 8.5.2.2, dynamic I/O, Jetson-compatible

Typical flow: PyTorch .pth → ONNX → TensorRT engine (.engine)

Known issues:

  • TensorRT 8.5 (e.g. Xavier NX, JetPack 5.1.4): some ops (e.g. Flatten) may fail; TRT 8.6+ preferred
  • TopK limit: K ≤ 3840 (see Section 4)

References:


3. LightGlue TensorRT Conversion

Answer: Yes. LightGlue has been converted to TensorRT.

Projects:

  • LightGlue-ONNX: ONNX export + TRT via trtexec; 24x speedup over compiled PyTorch
  • qdLMF/LightGlue-with-FlashAttentionV2-TensorRT: Custom FlashAttentionV2 TRT plugin; runs on Jetson Orin NX 8GB, TRT 8.5.2
  • fettahyildizz/superpoint_lightglue_tensorrt: End-to-end SuperPoint + LightGlue TRT pipeline

Challenges:

  1. Variable keypoint counts → handled by fixed top-K (e.g. 2048) and padding
  2. Adaptive depth/width → not supported in ONNX/TRT export (control flow)
  3. Attention fusion → custom MultiHeadAttention export for ONNX Runtime
  4. TopK limit → K ≤ 3840

References:


4. Dynamic Shapes in TensorRT

Answer: TensorRT supports dynamic shapes, but LightGlue-ONNX uses fixed shapes for export.

TensorRT dynamic shapes:

LightGlue-ONNX approach:

  • Variable keypoints replaced by fixed top-K (e.g. 2048)
  • Extractor: top-K selection instead of confidence threshold
  • Matcher: fixed (B, N, D) inputs; outputs unified tensors
  • Enables symbolic shape inference and TRT compatibility

References:


5. Expected Speedup: TRT vs ONNX Runtime

Scenario Speedup Notes
TRT vs PyTorch (compiled) 24x Fabios blog, SuperPoint+LightGlue
TRT FP8 vs FP32 ~6x NVIDIA Model Optimizer, Jan 2026
ORT TensorRT EP vs native TRT Often slower 30%3x slower in reports; operator fallback, graph partitioning

Recommendation: Prefer native TensorRT engines (via trtexec) over ONNX Runtime TensorRT EP for best performance.

References:


6. Attention Mechanisms in TensorRT

Known issues:

  • MultiHeadCrossAttentionPlugin: Error grows with sequence length; acceptable up to ~128
  • key_padding_mask / attention_mask: Not fully supported; alignment issues with PyTorch
  • Cross-attention (Q≠K/V length): bertQKVToContextPlugin expects same Q,K,V sequence length; native TRT fusion may support via ONNX
  • Accuracy: Some transformer/attention models show accuracy loss after TRT conversion

LightGlue-ONNX mitigation: Custom MultiHeadAttention export via torch.librarycom.microsoft::MultiHeadAttention for ONNX Runtime; works with TRT EP.

References:


7. LightGlue Adaptive Stopping and TRT

Answer: Not supported in current ONNX/TRT export.

  • Adaptive depth/width uses control flow (torch.cond())
  • TorchDynamo ONNX exporter does not yet handle this
  • Fabios blog: “adaptive depth & width disabled” for TRT benchmarks
  • Trade-off: fixed-depth LightGlue is faster but may use more compute on easy pairs

Reference:


8. TensorRT TopK 3840 Limit

Constraint: TensorRT TopK operator has K ≤ 3840.

  • SuperPoint/LightGlue use TopK for keypoint selection
  • If num_keypoints > 3840, TRT build fails
  • Workaround: Use num_keypoints ≤ 3840 (e.g. 2048)
  • NVIDIA is working on removing this limit (medium priority)

References:


9. Alternative TRT-Optimized Feature Matching Models

Model Project Jetson Support Notes
XFeat pranavnedunghat/xfeattensorrt Orin NX 16GB, JetPack 6.0, TRT 8.6.2 Sparse + dense matching
LoFTR Kolkir/LoFTR_TRT - Detector-free, transformer-based
SuperPoint + LightGlue fettahyildizz/superpoint_lightglue_tensorrt TRT 8.5.2.2 Full pipeline, dynamic I/O

References:


10. Jetson Orin Nano Super Considerations

  • Specs: 17 FP16 TFLOPS, 67 TOPS sparse INT8, 102 GB/s memory bandwidth
  • Engine build: Prefer building on Jetson for best performance; cross-compilation from x86 can be suboptimal
  • JetPack / TRT: Orin Nano Super typically uses JetPack 6.x with TensorRT 8.6+
  • FP8: Requires Hopper/Ada or newer; Orin uses Ampere, so FP8 not available; FP16 is the main option

Summary by Model

SuperPoint

Aspect Summary
TRT feasibility High; multiple open-source implementations
Known issues TopK ≤3840; TRT 8.5 op compatibility on older Jetson
Expected performance ~1.52x vs PyTorch; ~25ms on MX450 (vs ~37ms PyTorch)
Dynamic shapes Fixed via top-K (e.g. 2048); no variable keypoint count
Sources SuperPoint-SuperGlue-TensorRT, superpoint_lightglue_tensorrt

LightGlue

Aspect Summary
TRT feasibility High; LightGlue-ONNX + trtexec; dedicated TRT projects
Known issues Adaptive stopping not supported; TopK ≤3840; attention accuracy on long sequences
Expected performance 24x vs PyTorch; up to ~6x with FP8 (not on Orin)
Dynamic shapes Fixed via top-K padding; no variable keypoint count
Sources LightGlue-ONNX, Fabios blog, superpoint_lightglue_tensorrt

  1. Use LightGlue-ONNX for ONNX export of SuperPoint + LightGlue (fixed top-K ≤ 3840).
  2. Convert ONNX → TensorRT with trtexec on the Jetson (or matching TRT version).
  3. Use fettahyildizz/superpoint_lightglue_tensorrt as a reference for C++ deployment.
  4. Use FP16; avoid FP8 on Orin (Ampere).
  5. Expect 24x speedup vs current ONNX Runtime FP16, depending on keypoint count and engine build.