# Component: Inference Engines

## Overview

**Purpose**: Provides pluggable inference backends (ONNX Runtime and TensorRT) behind a common abstract interface, including ONNX-to-TensorRT model conversion with FP16 and INT8 precision support.

**Pattern**: Strategy pattern — `InferenceEngine` defines the contract; `OnnxEngine` and `TensorRTEngine` are interchangeable implementations.

**Upstream**: Domain (constants_inf for logging).
**Downstream**: Inference Pipeline (creates and uses engines).

## Modules

| Module | Role |
|--------|------|
| `inference_engine` | Abstract base class defining `get_input_shape`, `get_batch_size`, `run` |
| `onnx_engine` | ONNX Runtime implementation (CPU/CUDA) |
| `tensorrt_engine` | TensorRT implementation (GPU) + ONNX→TensorRT converter |

## Internal Interfaces

### InferenceEngine (abstract)

```
cdef class InferenceEngine:
    __init__(bytes model_bytes, int batch_size=1, **kwargs)
    cdef tuple get_input_shape()       # -> (height, width)
    cdef int get_batch_size()          # -> batch_size
    cdef run(input_data)               # -> list of output tensors
```

### OnnxEngine

```
cdef class OnnxEngine(InferenceEngine):
    # Implements all base methods
    # Provider priority: CUDA > CPU
```

### TensorRTEngine

```
cdef class TensorRTEngine(InferenceEngine):
    # Implements all base methods
    @staticmethod get_gpu_memory_bytes(int device_id) -> int
    @staticmethod get_engine_filename(str precision="fp16") -> str  # "fp16" or "int8"
    @staticmethod convert_from_source(bytes onnx_model, str calib_cache_path=None) -> bytes or None
```

## External API

None — internal component consumed by Inference Pipeline.

## Data Access Patterns

- Model bytes loaded in-memory (provided by caller)
- TensorRT: CUDA device memory allocated at init, async H2D/D2H transfers during inference
- ONNX: managed by onnxruntime internally

## Implementation Details

- **OnnxEngine**: default batch_size=1; loads model into `onnxruntime.InferenceSession`
- **TensorRTEngine**: default batch_size=4; dynamic dimensions default to 1280×1280 input, 300 max detections
- **Model conversion**: `convert_from_source` uses 90% of GPU memory as workspace; INT8 precision when calibration cache supplied, FP16 if GPU supports it, FP32 otherwise
- **Engine filename**: GPU-specific with precision suffix (`azaion.cc_{major}.{minor}_sm_{count}.engine` for FP16, `*.int8.engine` for INT8) — prevents cache confusion between precision variants
- Output format: `[batch][detection_index][x1, y1, x2, y2, confidence, class_id]`

## Caveats

- TensorRT engine files are GPU-architecture-specific and not portable
- INT8 engine files require a pre-computed calibration cache; cache is generated offline and uploaded to Loader manually
- `pycuda.autoinit` import is required as side-effect (initializes CUDA context)
- Dynamic shapes defaulting to 1280×1280 is hardcoded — not configurable

## Dependency Graph

```mermaid
graph TD
    onnx_engine --> inference_engine
    onnx_engine --> constants_inf
    tensorrt_engine --> inference_engine
    tensorrt_engine --> constants_inf
```

## Logging Strategy

Logs model metadata at init and conversion progress/errors via `constants_inf.log`/`logerror`.