mirror of https://github.com/azaion/gps-denied-desktop.git synced 2026-04-22 22:26:37 +00:00

Files

T

Oleksandr Bezdieniezhnykh b419e2c04a add clarification to research methodology by including a step for solution comparison and user consultation

2026-03-17 18:43:57 +02:00

9.9 KiB

Raw Blame History

TensorRT Conversion and Deployment Research: SuperPoint + LightGlue

Target: Jetson Orin Nano Super deployment
Context: Visual navigation system with SuperPoint (~80ms) + LightGlue ONNX FP16 (~50–100ms) on RTX 2060

Executive Summary

Model	TRT Feasibility	Known Issues	Expected Speedup	Dynamic Shapes
SuperPoint	✅ High	TopK ≤3840 limit; TRT version compatibility	1.5–2x over PyTorch	Fixed via top-K padding
LightGlue	✅ High	Adaptive stopping not supported; TopK limit	2–4x over PyTorch	Fixed via top-K padding

1. LightGlue-ONNX TensorRT Support

Source: fabio-sim/LightGlue-ONNX (580 stars)

Answer: Yes. LightGlue-ONNX supports TensorRT via ONNX → TensorRT conversion.

Workflow: Export PyTorch → ONNX (via dynamo.py / lightglue-onnx CLI), then convert ONNX → TensorRT engine with trtexec
Example: trtexec --onnx=weights/sift_lightglue.onnx --saveEngine=/srv/sift_lightglue.engine
Optimizations: Attention subgraph fusion, FP16, FP8 (Jan 2026), dynamic batch support
Note: Uses ONNX Runtime with TensorRT EP or native TensorRT via trtexec-built engines

References:

2. SuperPoint TensorRT Conversion

Answer: Yes. Several projects convert SuperPoint to TensorRT for Jetson.

Project	Stars	Notes
yuefanhao/SuperPoint-SuperGlue-TensorRT	367	SuperPoint + SuperGlue, C++
Op2Free/SuperPoint_TensorRT_Libtorch	-	PyTorch/Libtorch/TensorRT; ~25.5ms TRT vs ~37.4ms PyTorch on MX450
fettahyildizz/superpoint_lightglue_tensorrt	3	SuperPoint + LightGlue, TRT 8.5.2.2, dynamic I/O, Jetson-compatible

Typical flow: PyTorch .pth → ONNX → TensorRT engine (.engine)

Known issues:

TensorRT 8.5 (e.g. Xavier NX, JetPack 5.1.4): some ops (e.g. Flatten) may fail; TRT 8.6+ preferred
TopK limit: K ≤ 3840 (see Section 4)

References:

3. LightGlue TensorRT Conversion

Answer: Yes. LightGlue has been converted to TensorRT.

Projects:

LightGlue-ONNX: ONNX export + TRT via trtexec; 2–4x speedup over compiled PyTorch
qdLMF/LightGlue-with-FlashAttentionV2-TensorRT: Custom FlashAttentionV2 TRT plugin; runs on Jetson Orin NX 8GB, TRT 8.5.2
fettahyildizz/superpoint_lightglue_tensorrt: End-to-end SuperPoint + LightGlue TRT pipeline

Challenges:

Variable keypoint counts → handled by fixed top-K (e.g. 2048) and padding
Adaptive depth/width → not supported in ONNX/TRT export (control flow)
Attention fusion → custom MultiHeadAttention export for ONNX Runtime
TopK limit → K ≤ 3840

References:

4. Dynamic Shapes in TensorRT

Answer: TensorRT supports dynamic shapes, but LightGlue-ONNX uses fixed shapes for export.

TensorRT dynamic shapes:

Use -1 for variable dimensions at build time
Optimization profiles: (min, opt, max) per dynamic dimension
First inference after shape change can be slower (shape inference)
https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-dynamic-shapes.html

LightGlue-ONNX approach:

Variable keypoints replaced by fixed top-K (e.g. 2048)
Extractor: top-K selection instead of confidence threshold
Matcher: fixed (B, N, D) inputs; outputs unified tensors
Enables symbolic shape inference and TRT compatibility

References:

5. Expected Speedup: TRT vs ONNX Runtime

Scenario	Speedup	Notes
TRT vs PyTorch (compiled)	2–4x	Fabio’s blog, SuperPoint+LightGlue
TRT FP8 vs FP32	~6x	NVIDIA Model Optimizer, Jan 2026
ORT TensorRT EP vs native TRT	Often slower	30%–3x slower in reports; operator fallback, graph partitioning

Recommendation: Prefer native TensorRT engines (via trtexec) over ONNX Runtime TensorRT EP for best performance.

References:

6. Attention Mechanisms in TensorRT

Known issues:

MultiHeadCrossAttentionPlugin: Error grows with sequence length; acceptable up to ~128
key_padding_mask / attention_mask: Not fully supported; alignment issues with PyTorch
Cross-attention (Q≠K/V length): bertQKVToContextPlugin expects same Q,K,V sequence length; native TRT fusion may support via ONNX
Accuracy: Some transformer/attention models show accuracy loss after TRT conversion

LightGlue-ONNX mitigation: Custom MultiHeadAttention export via torch.library → com.microsoft::MultiHeadAttention for ONNX Runtime; works with TRT EP.

References:

7. LightGlue Adaptive Stopping and TRT

Answer: Not supported in current ONNX/TRT export.

Adaptive depth/width uses control flow (torch.cond())
TorchDynamo ONNX exporter does not yet handle this
Fabio’s blog: “adaptive depth & width disabled” for TRT benchmarks
Trade-off: fixed-depth LightGlue is faster but may use more compute on easy pairs

Reference:

https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/

8. TensorRT TopK 3840 Limit

Constraint: TensorRT TopK operator has K ≤ 3840.

SuperPoint/LightGlue use TopK for keypoint selection
If num_keypoints > 3840, TRT build fails
Workaround: Use num_keypoints ≤ 3840 (e.g. 2048)
NVIDIA is working on removing this limit (medium priority)

References:

9. Alternative TRT-Optimized Feature Matching Models

Model	Project	Jetson Support	Notes
XFeat	pranavnedunghat/xfeattensorrt	Orin NX 16GB, JetPack 6.0, TRT 8.6.2	Sparse + dense matching
LoFTR	Kolkir/LoFTR_TRT	-	Detector-free, transformer-based
SuperPoint + LightGlue	fettahyildizz/superpoint_lightglue_tensorrt	TRT 8.5.2.2	Full pipeline, dynamic I/O

References:

10. Jetson Orin Nano Super Considerations

Specs: 17 FP16 TFLOPS, 67 TOPS sparse INT8, 102 GB/s memory bandwidth
Engine build: Prefer building on Jetson for best performance; cross-compilation from x86 can be suboptimal
JetPack / TRT: Orin Nano Super typically uses JetPack 6.x with TensorRT 8.6+
FP8: Requires Hopper/Ada or newer; Orin uses Ampere, so FP8 not available; FP16 is the main option

Summary by Model

SuperPoint

Aspect	Summary
TRT feasibility	✅ High; multiple open-source implementations
Known issues	TopK ≤3840; TRT 8.5 op compatibility on older Jetson
Expected performance	~1.5–2x vs PyTorch; ~25ms on MX450 (vs ~37ms PyTorch)
Dynamic shapes	Fixed via top-K (e.g. 2048); no variable keypoint count
Sources	SuperPoint-SuperGlue-TensorRT, superpoint_lightglue_tensorrt

LightGlue

Aspect	Summary
TRT feasibility	✅ High; LightGlue-ONNX + trtexec; dedicated TRT projects
Known issues	Adaptive stopping not supported; TopK ≤3840; attention accuracy on long sequences
Expected performance	2–4x vs PyTorch; up to ~6x with FP8 (not on Orin)
Dynamic shapes	Fixed via top-K padding; no variable keypoint count
Sources	LightGlue-ONNX, Fabio’s blog, superpoint_lightglue_tensorrt

Recommended Deployment Path for Jetson Orin Nano Super

Use LightGlue-ONNX for ONNX export of SuperPoint + LightGlue (fixed top-K ≤ 3840).
Convert ONNX → TensorRT with trtexec on the Jetson (or matching TRT version).
Use fettahyildizz/superpoint_lightglue_tensorrt as a reference for C++ deployment.
Use FP16; avoid FP8 on Orin (Ampere).
Expect 2–4x speedup vs current ONNX Runtime FP16, depending on keypoint count and engine build.

9.9 KiB Raw Blame History Unescape Escape

TensorRT Conversion and Deployment Research: SuperPoint + LightGlue

Executive Summary

1. LightGlue-ONNX TensorRT Support

2. SuperPoint TensorRT Conversion

3. LightGlue TensorRT Conversion

4. Dynamic Shapes in TensorRT

5. Expected Speedup: TRT vs ONNX Runtime

6. Attention Mechanisms in TensorRT

7. LightGlue Adaptive Stopping and TRT

8. TensorRT TopK 3840 Limit

9. Alternative TRT-Optimized Feature Matching Models

10. Jetson Orin Nano Super Considerations

Summary by Model

SuperPoint

LightGlue

Recommended Deployment Path for Jetson Orin Nano Super

9.9 KiB

Raw Blame History