mirror of https://github.com/azaion/detections.git synced 2026-04-22 22:06:32 +00:00

Files

T

Oleksandr Bezdieniezhnykh 7a7f2a4cdd [AZ-180] Update module and component docs for Jetson/INT8 changes

Made-with: Cursor

2026-04-02 07:25:22 +03:00

3.3 KiB

Raw Blame History

Module: tensorrt_engine

Purpose

TensorRT-based inference engine — high-performance GPU inference with CUDA memory management and ONNX-to-TensorRT model conversion. Supports FP16 and INT8 precision; INT8 is used when a pre-computed calibration cache is supplied.

Public Interface

Class: TensorRTEngine (extends InferenceEngine)

Method	Signature	Description
`__init__`	`(bytes model_bytes, int batch_size=4, **kwargs)`	Deserializes TensorRT engine from bytes, allocates CUDA input/output memory, creates execution context and stream
`get_input_shape`	`() -> tuple`	Returns `(height, width)` from input tensor shape
`get_batch_size`	`() -> int`	Returns batch size
`run`	`(input_data) -> list`	Async H2D copy → execute → D2H copy, returns output as numpy array
`get_gpu_memory_bytes`	`(int device_id) -> int`	Static. Returns total GPU memory in bytes (default 2GB if unavailable)
`get_engine_filename`	`(str precision="fp16") -> str`	Static. Returns engine filename encoding compute capability, SM count, and precision suffix: `azaion.cc_{major}.{minor}_sm_{count}.engine` (FP16) or `azaion.cc_{major}.{minor}_sm_{count}.int8.engine` (INT8)
`convert_from_source`	`(bytes onnx_model, str calib_cache_path=None) -> bytes or None`	Static. Converts ONNX model to TensorRT serialized engine. Uses INT8 when `calib_cache_path` is provided and the file exists; falls back to FP16 if GPU supports it, FP32 otherwise.

Helper Class: `_CacheCalibrator` (module-private)

Implements trt.IInt8EntropyCalibrator2. Loads a pre-generated INT8 calibration cache from disk and supplies it to the TensorRT builder. Used only when convert_from_source is called with a valid calib_cache_path.

Internal Logic

Input shape defaults to 1280×1280 for dynamic dimensions.
Output shape defaults to 300 max detections × 6 values (x1, y1, x2, y2, conf, cls) for dynamic dimensions.
run uses async CUDA memory transfers with stream synchronization.
convert_from_source uses explicit batch mode:
- If calib_cache_path is a valid file path → sets BuilderFlag.INT8 and assigns _CacheCalibrator as the calibrator
- Else if GPU has fast FP16 → sets BuilderFlag.FP16
- Else → FP32 (no flag)
Engine filenames encode precision suffix (*.int8.engine vs *.engine) to prevent INT8/FP16 engine cache confusion across conversions.

Dependencies

External: tensorrt, pycuda.driver, pycuda.autoinit, pynvml, numpy, os
Internal: inference_engine (base class), constants_inf (logging)

Consumers

inference — instantiated when compatible NVIDIA GPU is found; also calls convert_from_source and get_engine_filename

Data Models

None (wraps TensorRT runtime objects).

Configuration

Engine filename encodes GPU compute capability + SM count + precision suffix.
Workspace memory is 90% of available GPU memory.
INT8 calibration cache path is supplied at conversion time (downloaded by inference.init_ai).

External Integrations

None directly — model bytes provided by caller.

Security

None.

Tests

tests/test_az180_jetson_int8.py — unit tests for INT8 flag (AC-3) and FP16 fallback (AC-4); skipped when TensorRT is not available (GPU environment required).

3.3 KiB Raw Blame History Unescape Escape