mirror of https://github.com/azaion/detections.git synced 2026-04-22 22:56:31 +00:00

Files

T

Oleksandr Bezdieniezhnykh 7a7f2a4cdd [AZ-180] Update module and component docs for Jetson/INT8 changes

Made-with: Cursor

2026-04-02 07:25:22 +03:00

3.2 KiB

Raw Blame History

Component: Inference Engines

Overview

Purpose: Provides pluggable inference backends (ONNX Runtime and TensorRT) behind a common abstract interface, including ONNX-to-TensorRT model conversion with FP16 and INT8 precision support.

Pattern: Strategy pattern — InferenceEngine defines the contract; OnnxEngine and TensorRTEngine are interchangeable implementations.

Upstream: Domain (constants_inf for logging). Downstream: Inference Pipeline (creates and uses engines).

Modules

Module	Role
`inference_engine`	Abstract base class defining `get_input_shape`, `get_batch_size`, `run`
`onnx_engine`	ONNX Runtime implementation (CPU/CUDA)
`tensorrt_engine`	TensorRT implementation (GPU) + ONNX→TensorRT converter

Internal Interfaces

InferenceEngine (abstract)

cdef class InferenceEngine:
    __init__(bytes model_bytes, int batch_size=1, **kwargs)
    cdef tuple get_input_shape()       # -> (height, width)
    cdef int get_batch_size()          # -> batch_size
    cdef run(input_data)               # -> list of output tensors

OnnxEngine

cdef class OnnxEngine(InferenceEngine):
    # Implements all base methods
    # Provider priority: CUDA > CPU

TensorRTEngine

cdef class TensorRTEngine(InferenceEngine):
    # Implements all base methods
    @staticmethod get_gpu_memory_bytes(int device_id) -> int
    @staticmethod get_engine_filename(str precision="fp16") -> str  # "fp16" or "int8"
    @staticmethod convert_from_source(bytes onnx_model, str calib_cache_path=None) -> bytes or None

External API

None — internal component consumed by Inference Pipeline.

Data Access Patterns

Model bytes loaded in-memory (provided by caller)
TensorRT: CUDA device memory allocated at init, async H2D/D2H transfers during inference
ONNX: managed by onnxruntime internally

Implementation Details

OnnxEngine: default batch_size=1; loads model into onnxruntime.InferenceSession
TensorRTEngine: default batch_size=4; dynamic dimensions default to 1280×1280 input, 300 max detections
Model conversion: convert_from_source uses 90% of GPU memory as workspace; INT8 precision when calibration cache supplied, FP16 if GPU supports it, FP32 otherwise
Engine filename: GPU-specific with precision suffix (azaion.cc_{major}.{minor}_sm_{count}.engine for FP16, *.int8.engine for INT8) — prevents cache confusion between precision variants
Output format: [batch][detection_index][x1, y1, x2, y2, confidence, class_id]

Caveats

TensorRT engine files are GPU-architecture-specific and not portable
INT8 engine files require a pre-computed calibration cache; cache is generated offline and uploaded to Loader manually
pycuda.autoinit import is required as side-effect (initializes CUDA context)
Dynamic shapes defaulting to 1280×1280 is hardcoded — not configurable

Dependency Graph

graph TD
    onnx_engine --> inference_engine
    onnx_engine --> constants_inf
    tensorrt_engine --> inference_engine
    tensorrt_engine --> constants_inf

Logging Strategy

Logs model metadata at init and conversion progress/errors via constants_inf.log/logerror.

3.2 KiB Raw Blame History Unescape Escape