mirror of
https://github.com/azaion/detections.git
synced 2026-04-22 22:56:31 +00:00
7a7f2a4cdd
Made-with: Cursor
3.2 KiB
3.2 KiB
Component: Inference Engines
Overview
Purpose: Provides pluggable inference backends (ONNX Runtime and TensorRT) behind a common abstract interface, including ONNX-to-TensorRT model conversion with FP16 and INT8 precision support.
Pattern: Strategy pattern — InferenceEngine defines the contract; OnnxEngine and TensorRTEngine are interchangeable implementations.
Upstream: Domain (constants_inf for logging). Downstream: Inference Pipeline (creates and uses engines).
Modules
| Module | Role |
|---|---|
inference_engine |
Abstract base class defining get_input_shape, get_batch_size, run |
onnx_engine |
ONNX Runtime implementation (CPU/CUDA) |
tensorrt_engine |
TensorRT implementation (GPU) + ONNX→TensorRT converter |
Internal Interfaces
InferenceEngine (abstract)
cdef class InferenceEngine:
__init__(bytes model_bytes, int batch_size=1, **kwargs)
cdef tuple get_input_shape() # -> (height, width)
cdef int get_batch_size() # -> batch_size
cdef run(input_data) # -> list of output tensors
OnnxEngine
cdef class OnnxEngine(InferenceEngine):
# Implements all base methods
# Provider priority: CUDA > CPU
TensorRTEngine
cdef class TensorRTEngine(InferenceEngine):
# Implements all base methods
@staticmethod get_gpu_memory_bytes(int device_id) -> int
@staticmethod get_engine_filename(str precision="fp16") -> str # "fp16" or "int8"
@staticmethod convert_from_source(bytes onnx_model, str calib_cache_path=None) -> bytes or None
External API
None — internal component consumed by Inference Pipeline.
Data Access Patterns
- Model bytes loaded in-memory (provided by caller)
- TensorRT: CUDA device memory allocated at init, async H2D/D2H transfers during inference
- ONNX: managed by onnxruntime internally
Implementation Details
- OnnxEngine: default batch_size=1; loads model into
onnxruntime.InferenceSession - TensorRTEngine: default batch_size=4; dynamic dimensions default to 1280×1280 input, 300 max detections
- Model conversion:
convert_from_sourceuses 90% of GPU memory as workspace; INT8 precision when calibration cache supplied, FP16 if GPU supports it, FP32 otherwise - Engine filename: GPU-specific with precision suffix (
azaion.cc_{major}.{minor}_sm_{count}.enginefor FP16,*.int8.enginefor INT8) — prevents cache confusion between precision variants - Output format:
[batch][detection_index][x1, y1, x2, y2, confidence, class_id]
Caveats
- TensorRT engine files are GPU-architecture-specific and not portable
- INT8 engine files require a pre-computed calibration cache; cache is generated offline and uploaded to Loader manually
pycuda.autoinitimport is required as side-effect (initializes CUDA context)- Dynamic shapes defaulting to 1280×1280 is hardcoded — not configurable
Dependency Graph
graph TD
onnx_engine --> inference_engine
onnx_engine --> constants_inf
tensorrt_engine --> inference_engine
tensorrt_engine --> constants_inf
Logging Strategy
Logs model metadata at init and conversion progress/errors via constants_inf.log/logerror.