Files
detections/_docs/02_document/components/02_inference_engines/description.md
T
Oleksandr Bezdieniezhnykh 7a7f2a4cdd [AZ-180] Update module and component docs for Jetson/INT8 changes
Made-with: Cursor
2026-04-02 07:25:22 +03:00

3.2 KiB
Raw Blame History

Component: Inference Engines

Overview

Purpose: Provides pluggable inference backends (ONNX Runtime and TensorRT) behind a common abstract interface, including ONNX-to-TensorRT model conversion with FP16 and INT8 precision support.

Pattern: Strategy pattern — InferenceEngine defines the contract; OnnxEngine and TensorRTEngine are interchangeable implementations.

Upstream: Domain (constants_inf for logging). Downstream: Inference Pipeline (creates and uses engines).

Modules

Module Role
inference_engine Abstract base class defining get_input_shape, get_batch_size, run
onnx_engine ONNX Runtime implementation (CPU/CUDA)
tensorrt_engine TensorRT implementation (GPU) + ONNX→TensorRT converter

Internal Interfaces

InferenceEngine (abstract)

cdef class InferenceEngine:
    __init__(bytes model_bytes, int batch_size=1, **kwargs)
    cdef tuple get_input_shape()       # -> (height, width)
    cdef int get_batch_size()          # -> batch_size
    cdef run(input_data)               # -> list of output tensors

OnnxEngine

cdef class OnnxEngine(InferenceEngine):
    # Implements all base methods
    # Provider priority: CUDA > CPU

TensorRTEngine

cdef class TensorRTEngine(InferenceEngine):
    # Implements all base methods
    @staticmethod get_gpu_memory_bytes(int device_id) -> int
    @staticmethod get_engine_filename(str precision="fp16") -> str  # "fp16" or "int8"
    @staticmethod convert_from_source(bytes onnx_model, str calib_cache_path=None) -> bytes or None

External API

None — internal component consumed by Inference Pipeline.

Data Access Patterns

  • Model bytes loaded in-memory (provided by caller)
  • TensorRT: CUDA device memory allocated at init, async H2D/D2H transfers during inference
  • ONNX: managed by onnxruntime internally

Implementation Details

  • OnnxEngine: default batch_size=1; loads model into onnxruntime.InferenceSession
  • TensorRTEngine: default batch_size=4; dynamic dimensions default to 1280×1280 input, 300 max detections
  • Model conversion: convert_from_source uses 90% of GPU memory as workspace; INT8 precision when calibration cache supplied, FP16 if GPU supports it, FP32 otherwise
  • Engine filename: GPU-specific with precision suffix (azaion.cc_{major}.{minor}_sm_{count}.engine for FP16, *.int8.engine for INT8) — prevents cache confusion between precision variants
  • Output format: [batch][detection_index][x1, y1, x2, y2, confidence, class_id]

Caveats

  • TensorRT engine files are GPU-architecture-specific and not portable
  • INT8 engine files require a pre-computed calibration cache; cache is generated offline and uploaded to Loader manually
  • pycuda.autoinit import is required as side-effect (initializes CUDA context)
  • Dynamic shapes defaulting to 1280×1280 is hardcoded — not configurable

Dependency Graph

graph TD
    onnx_engine --> inference_engine
    onnx_engine --> constants_inf
    tensorrt_engine --> inference_engine
    tensorrt_engine --> constants_inf

Logging Strategy

Logs model metadata at init and conversion progress/errors via constants_inf.log/logerror.