mirror of
https://github.com/azaion/gps-denied-desktop.git
synced 2026-04-23 00:56:36 +00:00
add clarification to research methodology by including a step for solution comparison and user consultation
This commit is contained in:
@@ -0,0 +1,222 @@
|
||||
# TensorRT Conversion and Deployment Research: SuperPoint + LightGlue
|
||||
|
||||
**Target:** Jetson Orin Nano Super deployment
|
||||
**Context:** Visual navigation system with SuperPoint (~80ms) + LightGlue ONNX FP16 (~50–100ms) on RTX 2060
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
| Model | TRT Feasibility | Known Issues | Expected Speedup | Dynamic Shapes |
|
||||
|-------|-----------------|--------------|------------------|----------------|
|
||||
| **SuperPoint** | ✅ High | TopK ≤3840 limit; TRT version compatibility | 1.5–2x over PyTorch | Fixed via top-K padding |
|
||||
| **LightGlue** | ✅ High | Adaptive stopping not supported; TopK limit | 2–4x over PyTorch | Fixed via top-K padding |
|
||||
|
||||
---
|
||||
|
||||
## 1. LightGlue-ONNX TensorRT Support
|
||||
|
||||
**Source:** [fabio-sim/LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX) (580 stars)
|
||||
|
||||
**Answer:** Yes. LightGlue-ONNX supports TensorRT via ONNX → TensorRT conversion.
|
||||
|
||||
- **Workflow:** Export PyTorch → ONNX (via `dynamo.py` / `lightglue-onnx` CLI), then convert ONNX → TensorRT engine with `trtexec`
|
||||
- **Example:** `trtexec --onnx=weights/sift_lightglue.onnx --saveEngine=/srv/sift_lightglue.engine`
|
||||
- **Optimizations:** Attention subgraph fusion, FP16, FP8 (Jan 2026), dynamic batch support
|
||||
- **Note:** Uses ONNX Runtime with TensorRT EP or native TensorRT via `trtexec`-built engines
|
||||
|
||||
**References:**
|
||||
- https://github.com/fabio-sim/LightGlue-ONNX
|
||||
- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
|
||||
|
||||
---
|
||||
|
||||
## 2. SuperPoint TensorRT Conversion
|
||||
|
||||
**Answer:** Yes. Several projects convert SuperPoint to TensorRT for Jetson.
|
||||
|
||||
| Project | Stars | Notes |
|
||||
|---------|-------|-------|
|
||||
| [yuefanhao/SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT) | 367 | SuperPoint + SuperGlue, C++ |
|
||||
| [Op2Free/SuperPoint_TensorRT_Libtorch](https://github.com/Op2Free/SuperPoint_TensorRT_Libtorch) | - | PyTorch/Libtorch/TensorRT; ~25.5ms TRT vs ~37.4ms PyTorch on MX450 |
|
||||
| [fettahyildizz/superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) | 3 | SuperPoint + LightGlue, TRT 8.5.2.2, dynamic I/O, Jetson-compatible |
|
||||
|
||||
**Typical flow:** PyTorch `.pth` → ONNX → TensorRT engine (`.engine`)
|
||||
|
||||
**Known issues:**
|
||||
- TensorRT 8.5 (e.g. Xavier NX, JetPack 5.1.4): some ops (e.g. Flatten) may fail; TRT 8.6+ preferred
|
||||
- TopK limit: K ≤ 3840 (see Section 4)
|
||||
|
||||
**References:**
|
||||
- https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT
|
||||
- https://github.com/Op2Free/SuperPoint_TensorRT_Libtorch
|
||||
- https://github.com/fettahyildizz/superpoint_lightglue_tensorrt
|
||||
- https://github.com/NVIDIA/TensorRT/issues/3918
|
||||
- https://github.com/NVIDIA/TensorRT/issues/4255
|
||||
|
||||
---
|
||||
|
||||
## 3. LightGlue TensorRT Conversion
|
||||
|
||||
**Answer:** Yes. LightGlue has been converted to TensorRT.
|
||||
|
||||
**Projects:**
|
||||
- **LightGlue-ONNX:** ONNX export + TRT via `trtexec`; 2–4x speedup over compiled PyTorch
|
||||
- **qdLMF/LightGlue-with-FlashAttentionV2-TensorRT:** Custom FlashAttentionV2 TRT plugin; runs on Jetson Orin NX 8GB, TRT 8.5.2
|
||||
- **fettahyildizz/superpoint_lightglue_tensorrt:** End-to-end SuperPoint + LightGlue TRT pipeline
|
||||
|
||||
**Challenges:**
|
||||
1. Variable keypoint counts → handled by fixed top-K (e.g. 2048) and padding
|
||||
2. Adaptive depth/width → not supported in ONNX/TRT export (control flow)
|
||||
3. Attention fusion → custom `MultiHeadAttention` export for ONNX Runtime
|
||||
4. TopK limit → K ≤ 3840
|
||||
|
||||
**References:**
|
||||
- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
|
||||
- https://github.com/qdLMF/LightGlue-with-FlashAttentionV2-TensorRT
|
||||
- https://github.com/fettahyildizz/superpoint_lightglue_tensorrt
|
||||
|
||||
---
|
||||
|
||||
## 4. Dynamic Shapes in TensorRT
|
||||
|
||||
**Answer:** TensorRT supports dynamic shapes, but LightGlue-ONNX uses fixed shapes for export.
|
||||
|
||||
**TensorRT dynamic shapes:**
|
||||
- Use `-1` for variable dimensions at build time
|
||||
- Optimization profiles: `(min, opt, max)` per dynamic dimension
|
||||
- First inference after shape change can be slower (shape inference)
|
||||
- https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-dynamic-shapes.html
|
||||
|
||||
**LightGlue-ONNX approach:**
|
||||
- Variable keypoints replaced by fixed top-K (e.g. 2048)
|
||||
- Extractor: top-K selection instead of confidence threshold
|
||||
- Matcher: fixed `(B, N, D)` inputs; outputs unified tensors
|
||||
- Enables symbolic shape inference and TRT compatibility
|
||||
|
||||
**References:**
|
||||
- https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-dynamic-shapes.html
|
||||
- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
|
||||
|
||||
---
|
||||
|
||||
## 5. Expected Speedup: TRT vs ONNX Runtime
|
||||
|
||||
| Scenario | Speedup | Notes |
|
||||
|----------|---------|-------|
|
||||
| TRT vs PyTorch (compiled) | 2–4x | Fabio’s blog, SuperPoint+LightGlue |
|
||||
| TRT FP8 vs FP32 | ~6x | NVIDIA Model Optimizer, Jan 2026 |
|
||||
| ORT TensorRT EP vs native TRT | Often slower | 30%–3x slower in reports; operator fallback, graph partitioning |
|
||||
|
||||
**Recommendation:** Prefer native TensorRT engines (via `trtexec`) over ONNX Runtime TensorRT EP for best performance.
|
||||
|
||||
**References:**
|
||||
- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
|
||||
- https://fabio-sim.github.io/blog/fp8-quantized-lightglue-tensorrt-nvidia-model-optimizer/
|
||||
- https://github.com/microsoft/onnxruntime/issues/11356
|
||||
- https://github.com/microsoft/onnxruntime/issues/24831
|
||||
|
||||
---
|
||||
|
||||
## 6. Attention Mechanisms in TensorRT
|
||||
|
||||
**Known issues:**
|
||||
- **MultiHeadCrossAttentionPlugin:** Error grows with sequence length; acceptable up to ~128
|
||||
- **key_padding_mask / attention_mask:** Not fully supported; alignment issues with PyTorch
|
||||
- **Cross-attention (Q≠K/V length):** `bertQKVToContextPlugin` expects same Q,K,V sequence length; native TRT fusion may support via ONNX
|
||||
- **Accuracy:** Some transformer/attention models show accuracy loss after TRT conversion
|
||||
|
||||
**LightGlue-ONNX mitigation:** Custom `MultiHeadAttention` export via `torch.library` → `com.microsoft::MultiHeadAttention` for ONNX Runtime; works with TRT EP.
|
||||
|
||||
**References:**
|
||||
- https://github.com/NVIDIA/TensorRT/issues/2674
|
||||
- https://github.com/NVIDIA/TensorRT/issues/3619
|
||||
- https://github.com/NVIDIA/TensorRT/issues/1483
|
||||
- https://github.com/NVIDIA/TensorRT/issues/2587
|
||||
|
||||
---
|
||||
|
||||
## 7. LightGlue Adaptive Stopping and TRT
|
||||
|
||||
**Answer:** Not supported in current ONNX/TRT export.
|
||||
|
||||
- Adaptive depth/width uses control flow (`torch.cond()`)
|
||||
- TorchDynamo ONNX exporter does not yet handle this
|
||||
- Fabio’s blog: “adaptive depth & width disabled” for TRT benchmarks
|
||||
- Trade-off: fixed-depth LightGlue is faster but may use more compute on easy pairs
|
||||
|
||||
**Reference:**
|
||||
- https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/
|
||||
|
||||
---
|
||||
|
||||
## 8. TensorRT TopK 3840 Limit
|
||||
|
||||
**Constraint:** TensorRT TopK operator has K ≤ 3840.
|
||||
|
||||
- SuperPoint/LightGlue use TopK for keypoint selection
|
||||
- If `num_keypoints` > 3840, TRT build fails
|
||||
- **Workaround:** Use `num_keypoints ≤ 3840` (e.g. 2048)
|
||||
- NVIDIA is working on removing this limit (medium priority)
|
||||
|
||||
**References:**
|
||||
- https://github.com/NVIDIA/TensorRT/issues/4244
|
||||
- https://forums.developer.nvidia.com/t/topk-5-k-exceeds-the-maximum-value-allowed-3840/295903
|
||||
|
||||
---
|
||||
|
||||
## 9. Alternative TRT-Optimized Feature Matching Models
|
||||
|
||||
| Model | Project | Jetson Support | Notes |
|
||||
|-------|---------|----------------|-------|
|
||||
| **XFeat** | [pranavnedunghat/xfeattensorrt](https://github.com/pranavnedunghat/xfeattensorrt) | Orin NX 16GB, JetPack 6.0, TRT 8.6.2 | Sparse + dense matching |
|
||||
| **LoFTR** | [Kolkir/LoFTR_TRT](https://github.com/Kolkir/LoFTR_TRT) | - | Detector-free, transformer-based |
|
||||
| **SuperPoint + LightGlue** | [fettahyildizz/superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) | TRT 8.5.2.2 | Full pipeline, dynamic I/O |
|
||||
|
||||
**References:**
|
||||
- https://github.com/pranavnedunghat/xfeattensorrt
|
||||
- https://github.com/Kolkir/LoFTR_TRT
|
||||
- https://github.com/qdLMF/LightGlue-with-FlashAttentionV2-TensorRT
|
||||
|
||||
---
|
||||
|
||||
## 10. Jetson Orin Nano Super Considerations
|
||||
|
||||
- **Specs:** 17 FP16 TFLOPS, 67 TOPS sparse INT8, 102 GB/s memory bandwidth
|
||||
- **Engine build:** Prefer building on Jetson for best performance; cross-compilation from x86 can be suboptimal
|
||||
- **JetPack / TRT:** Orin Nano Super typically uses JetPack 6.x with TensorRT 8.6+
|
||||
- **FP8:** Requires Hopper/Ada or newer; Orin uses Ampere, so FP8 not available; FP16 is the main option
|
||||
|
||||
---
|
||||
|
||||
## Summary by Model
|
||||
|
||||
### SuperPoint
|
||||
|
||||
| Aspect | Summary |
|
||||
|--------|---------|
|
||||
| **TRT feasibility** | ✅ High; multiple open-source implementations |
|
||||
| **Known issues** | TopK ≤3840; TRT 8.5 op compatibility on older Jetson |
|
||||
| **Expected performance** | ~1.5–2x vs PyTorch; ~25ms on MX450 (vs ~37ms PyTorch) |
|
||||
| **Dynamic shapes** | Fixed via top-K (e.g. 2048); no variable keypoint count |
|
||||
| **Sources** | [SuperPoint-SuperGlue-TensorRT](https://github.com/yuefanhao/SuperPoint-SuperGlue-TensorRT), [superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) |
|
||||
|
||||
### LightGlue
|
||||
|
||||
| Aspect | Summary |
|
||||
|--------|---------|
|
||||
| **TRT feasibility** | ✅ High; LightGlue-ONNX + trtexec; dedicated TRT projects |
|
||||
| **Known issues** | Adaptive stopping not supported; TopK ≤3840; attention accuracy on long sequences |
|
||||
| **Expected performance** | 2–4x vs PyTorch; up to ~6x with FP8 (not on Orin) |
|
||||
| **Dynamic shapes** | Fixed via top-K padding; no variable keypoint count |
|
||||
| **Sources** | [LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX), [Fabio’s blog](https://fabio-sim.github.io/blog/accelerating-lightglue-inference-onnx-runtime-tensorrt/), [superpoint_lightglue_tensorrt](https://github.com/fettahyildizz/superpoint_lightglue_tensorrt) |
|
||||
|
||||
---
|
||||
|
||||
## Recommended Deployment Path for Jetson Orin Nano Super
|
||||
|
||||
1. Use **LightGlue-ONNX** for ONNX export of SuperPoint + LightGlue (fixed top-K ≤ 3840).
|
||||
2. Convert ONNX → TensorRT with `trtexec` on the Jetson (or matching TRT version).
|
||||
3. Use **fettahyildizz/superpoint_lightglue_tensorrt** as a reference for C++ deployment.
|
||||
4. Use FP16; avoid FP8 on Orin (Ampere).
|
||||
5. Expect 2–4x speedup vs current ONNX Runtime FP16, depending on keypoint count and engine build.
|
||||
Reference in New Issue
Block a user