Files
detections/_docs/02_document/components/02_inference_engines/description.md
T

2.9 KiB
Raw Blame History

Component: Inference Engines

Overview

Purpose: Provides pluggable inference backends (ONNX Runtime and TensorRT) behind a common abstract interface, including ONNX-to-TensorRT model conversion.

Pattern: Strategy pattern — InferenceEngine defines the contract; OnnxEngine and TensorRTEngine are interchangeable implementations.

Upstream: Domain (constants_inf for logging). Downstream: Inference Pipeline (creates and uses engines).

Modules

Module Role
inference_engine Abstract base class defining get_input_shape, get_batch_size, run
onnx_engine ONNX Runtime implementation (CPU/CUDA)
tensorrt_engine TensorRT implementation (GPU) + ONNX→TensorRT converter

Internal Interfaces

InferenceEngine (abstract)

cdef class InferenceEngine:
    __init__(bytes model_bytes, int batch_size=1, **kwargs)
    cdef tuple get_input_shape()       # -> (height, width)
    cdef int get_batch_size()          # -> batch_size
    cdef run(input_data)               # -> list of output tensors

OnnxEngine

cdef class OnnxEngine(InferenceEngine):
    # Implements all base methods
    # Provider priority: CUDA > CPU

TensorRTEngine

cdef class TensorRTEngine(InferenceEngine):
    # Implements all base methods
    @staticmethod get_gpu_memory_bytes(int device_id) -> int
    @staticmethod get_engine_filename(int device_id) -> str
    @staticmethod convert_from_onnx(bytes onnx_model) -> bytes or None

External API

None — internal component consumed by Inference Pipeline.

Data Access Patterns

  • Model bytes loaded in-memory (provided by caller)
  • TensorRT: CUDA device memory allocated at init, async H2D/D2H transfers during inference
  • ONNX: managed by onnxruntime internally

Implementation Details

  • OnnxEngine: default batch_size=1; loads model into onnxruntime.InferenceSession
  • TensorRTEngine: default batch_size=4; dynamic dimensions default to 1280×1280 input, 300 max detections
  • Model conversion: convert_from_onnx uses 90% of GPU memory as workspace, enables FP16 if hardware supports it
  • Engine filename: GPU-specific (azaion.cc_{major}.{minor}_sm_{count}.engine) — allows pre-built engine caching per GPU architecture
  • Output format: [batch][detection_index][x1, y1, x2, y2, confidence, class_id]

Caveats

  • TensorRT engine files are GPU-architecture-specific and not portable
  • pycuda.autoinit import is required as side-effect (initializes CUDA context)
  • Dynamic shapes defaulting to 1280×1280 is hardcoded — not configurable

Dependency Graph

graph TD
    onnx_engine --> inference_engine
    onnx_engine --> constants_inf
    tensorrt_engine --> inference_engine
    tensorrt_engine --> constants_inf

Logging Strategy

Logs model metadata at init and conversion progress/errors via constants_inf.log/logerror.