# Component: Inference Engines ## Overview **Purpose**: Provides pluggable inference backends (ONNX Runtime and TensorRT) behind a common abstract interface, including ONNX-to-TensorRT model conversion. **Pattern**: Strategy pattern — `InferenceEngine` defines the contract; `OnnxEngine` and `TensorRTEngine` are interchangeable implementations. **Upstream**: Domain (constants_inf for logging). **Downstream**: Inference Pipeline (creates and uses engines). ## Modules | Module | Role | |--------|------| | `inference_engine` | Abstract base class defining `get_input_shape`, `get_batch_size`, `run` | | `onnx_engine` | ONNX Runtime implementation (CPU/CUDA) | | `tensorrt_engine` | TensorRT implementation (GPU) + ONNX→TensorRT converter | ## Internal Interfaces ### InferenceEngine (abstract) ``` cdef class InferenceEngine: __init__(bytes model_bytes, int batch_size=1, **kwargs) cdef tuple get_input_shape() # -> (height, width) cdef int get_batch_size() # -> batch_size cdef run(input_data) # -> list of output tensors ``` ### OnnxEngine ``` cdef class OnnxEngine(InferenceEngine): # Implements all base methods # Provider priority: CUDA > CPU ``` ### TensorRTEngine ``` cdef class TensorRTEngine(InferenceEngine): # Implements all base methods @staticmethod get_gpu_memory_bytes(int device_id) -> int @staticmethod get_engine_filename(int device_id) -> str @staticmethod convert_from_onnx(bytes onnx_model) -> bytes or None ``` ## External API None — internal component consumed by Inference Pipeline. ## Data Access Patterns - Model bytes loaded in-memory (provided by caller) - TensorRT: CUDA device memory allocated at init, async H2D/D2H transfers during inference - ONNX: managed by onnxruntime internally ## Implementation Details - **OnnxEngine**: default batch_size=1; loads model into `onnxruntime.InferenceSession` - **TensorRTEngine**: default batch_size=4; dynamic dimensions default to 1280×1280 input, 300 max detections - **Model conversion**: `convert_from_onnx` uses 90% of GPU memory as workspace, enables FP16 if hardware supports it - **Engine filename**: GPU-specific (`azaion.cc_{major}.{minor}_sm_{count}.engine`) — allows pre-built engine caching per GPU architecture - Output format: `[batch][detection_index][x1, y1, x2, y2, confidence, class_id]` ## Caveats - TensorRT engine files are GPU-architecture-specific and not portable - `pycuda.autoinit` import is required as side-effect (initializes CUDA context) - Dynamic shapes defaulting to 1280×1280 is hardcoded — not configurable ## Dependency Graph ```mermaid graph TD onnx_engine --> inference_engine onnx_engine --> constants_inf tensorrt_engine --> inference_engine tensorrt_engine --> constants_inf ``` ## Logging Strategy Logs model metadata at init and conversion progress/errors via `constants_inf.log`/`logerror`.