# Component: Inference Engines ## Overview **Purpose**: Provides pluggable inference backends (ONNX Runtime and TensorRT) behind a common abstract interface, including ONNX-to-TensorRT model conversion with FP16 and INT8 precision support. **Pattern**: Strategy pattern — `InferenceEngine` defines the contract; `OnnxEngine` and `TensorRTEngine` are interchangeable implementations. **Upstream**: Domain (constants_inf for logging). **Downstream**: Inference Pipeline (creates and uses engines). ## Modules | Module | Role | |--------|------| | `inference_engine` | Abstract base class defining `get_input_shape`, `get_batch_size`, `run` | | `onnx_engine` | ONNX Runtime implementation (CPU/CUDA) | | `tensorrt_engine` | TensorRT implementation (GPU) + ONNX→TensorRT converter | ## Internal Interfaces ### InferenceEngine (abstract) ``` cdef class InferenceEngine: __init__(bytes model_bytes, int batch_size=1, **kwargs) cdef tuple get_input_shape() # -> (height, width) cdef int get_batch_size() # -> batch_size cdef run(input_data) # -> list of output tensors ``` ### OnnxEngine ``` cdef class OnnxEngine(InferenceEngine): # Implements all base methods # Provider priority: CUDA > CPU ``` ### TensorRTEngine ``` cdef class TensorRTEngine(InferenceEngine): # Implements all base methods @staticmethod get_gpu_memory_bytes(int device_id) -> int @staticmethod get_engine_filename(str precision="fp16") -> str # "fp16" or "int8" @staticmethod convert_from_source(bytes onnx_model, str calib_cache_path=None) -> bytes or None ``` ## External API None — internal component consumed by Inference Pipeline. ## Data Access Patterns - Model bytes loaded in-memory (provided by caller) - TensorRT: CUDA device memory allocated at init, async H2D/D2H transfers during inference - ONNX: managed by onnxruntime internally ## Implementation Details - **OnnxEngine**: default batch_size=1; loads model into `onnxruntime.InferenceSession` - **TensorRTEngine**: default batch_size=4; dynamic dimensions default to 1280×1280 input, 300 max detections - **Model conversion**: `convert_from_source` uses 90% of GPU memory as workspace; INT8 precision when calibration cache supplied, FP16 if GPU supports it, FP32 otherwise - **Engine filename**: GPU-specific with precision suffix (`azaion.cc_{major}.{minor}_sm_{count}.engine` for FP16, `*.int8.engine` for INT8) — prevents cache confusion between precision variants - Output format: `[batch][detection_index][x1, y1, x2, y2, confidence, class_id]` ## Caveats - TensorRT engine files are GPU-architecture-specific and not portable - INT8 engine files require a pre-computed calibration cache; cache is generated offline and uploaded to Loader manually - `pycuda.autoinit` import is required as side-effect (initializes CUDA context) - Dynamic shapes defaulting to 1280×1280 is hardcoded — not configurable ## Dependency Graph ```mermaid graph TD onnx_engine --> inference_engine onnx_engine --> constants_inf tensorrt_engine --> inference_engine tensorrt_engine --> constants_inf ``` ## Logging Strategy Logs model metadata at init and conversion progress/errors via `constants_inf.log`/`logerror`.