# Module: tensorrt_engine

## Purpose

TensorRT-based inference engine — high-performance GPU inference with CUDA memory management and ONNX-to-TensorRT model conversion. Supports FP16 and INT8 precision; INT8 is used when a pre-computed calibration cache is supplied.

## Public Interface

### Class: TensorRTEngine (extends InferenceEngine)

| Method | Signature | Description |
|--------|-----------|-------------|
| `__init__` | `(bytes model_bytes, int batch_size=4, **kwargs)` | Deserializes TensorRT engine from bytes, allocates CUDA input/output memory, creates execution context and stream |
| `get_input_shape` | `() -> tuple` | Returns `(height, width)` from input tensor shape |
| `get_batch_size` | `() -> int` | Returns batch size |
| `run` | `(input_data) -> list` | Async H2D copy → execute → D2H copy, returns output as numpy array |
| `get_gpu_memory_bytes` | `(int device_id) -> int` | Static. Returns total GPU memory in bytes (default 2GB if unavailable) |
| `get_engine_filename` | `(str precision="fp16") -> str` | Static. Returns engine filename encoding compute capability, SM count, and precision suffix: `azaion.cc_{major}.{minor}_sm_{count}.engine` (FP16) or `azaion.cc_{major}.{minor}_sm_{count}.int8.engine` (INT8) |
| `convert_from_source` | `(bytes onnx_model, str calib_cache_path=None) -> bytes or None` | Static. Converts ONNX model to TensorRT serialized engine. Uses INT8 when `calib_cache_path` is provided and the file exists; falls back to FP16 if GPU supports it, FP32 otherwise. |

### Helper Class: `_CacheCalibrator` (module-private)

Implements `trt.IInt8EntropyCalibrator2`. Loads a pre-generated INT8 calibration cache from disk and supplies it to the TensorRT builder. Used only when `convert_from_source` is called with a valid `calib_cache_path`.

## Internal Logic

- Input shape defaults to 1280×1280 for dynamic dimensions.
- Output shape defaults to 300 max detections × 6 values (x1, y1, x2, y2, conf, cls) for dynamic dimensions.
- `run` uses async CUDA memory transfers with stream synchronization.
- `convert_from_source` uses explicit batch mode:
  - If `calib_cache_path` is a valid file path → sets `BuilderFlag.INT8` and assigns `_CacheCalibrator` as the calibrator
  - Else if GPU has fast FP16 → sets `BuilderFlag.FP16`
  - Else → FP32 (no flag)
- Engine filenames encode precision suffix (`*.int8.engine` vs `*.engine`) to prevent INT8/FP16 engine cache confusion across conversions.

## Dependencies

- **External**: `tensorrt`, `pycuda.driver`, `pycuda.autoinit`, `pynvml`, `numpy`, `os`
- **Internal**: `inference_engine` (base class), `constants_inf` (logging)

## Consumers

- `inference` — instantiated when compatible NVIDIA GPU is found; also calls `convert_from_source` and `get_engine_filename`

## Data Models

None (wraps TensorRT runtime objects).

## Configuration

- Engine filename encodes GPU compute capability + SM count + precision suffix.
- Workspace memory is 90% of available GPU memory.
- INT8 calibration cache path is supplied at conversion time (downloaded by `inference.init_ai`).

## External Integrations

None directly — model bytes provided by caller.

## Security

None.

## Tests

- `tests/test_az180_jetson_int8.py` — unit tests for INT8 flag (AC-3) and FP16 fallback (AC-4); skipped when TensorRT is not available (GPU environment required).