[AZ-180] Update module and component docs for Jetson/INT8 changes

Made-with: Cursor
2026-06-21 14:51:08 +00:00 · 2026-04-02 07:25:22 +03:00
parent 2ed9ce3336
commit 7a7f2a4cdd
5 changed files with 37 additions and 23 deletions
@@ -2,7 +2,7 @@

 ## Overview

-**Purpose**: Provides pluggable inference backends (ONNX Runtime and TensorRT) behind a common abstract interface, including ONNX-to-TensorRT model conversion.
+**Purpose**: Provides pluggable inference backends (ONNX Runtime and TensorRT) behind a common abstract interface, including ONNX-to-TensorRT model conversion with FP16 and INT8 precision support.

 **Pattern**: Strategy pattern — `InferenceEngine` defines the contract; `OnnxEngine` and `TensorRTEngine` are interchangeable implementations.

@@ -43,8 +43,8 @@ cdef class OnnxEngine(InferenceEngine):
 cdef class TensorRTEngine(InferenceEngine):
    # Implements all base methods
    @staticmethod get_gpu_memory_bytes(int device_id) -> int
-    @staticmethod get_engine_filename(int device_id) -> str
-    @staticmethod convert_from_onnx(bytes onnx_model) -> bytes or None
+    @staticmethod get_engine_filename(str precision="fp16") -> str  # "fp16" or "int8"
+    @staticmethod convert_from_source(bytes onnx_model, str calib_cache_path=None) -> bytes or None
 ```

 ## External API
@@ -61,13 +61,14 @@ None — internal component consumed by Inference Pipeline.

 - **OnnxEngine**: default batch_size=1; loads model into `onnxruntime.InferenceSession`
 - **TensorRTEngine**: default batch_size=4; dynamic dimensions default to 1280×1280 input, 300 max detections
- **Model conversion**: `convert_from_onnx` uses 90% of GPU memory as workspace, enables FP16 if hardware supports it
- **Engine filename**: GPU-specific (`azaion.cc_{major}.{minor}_sm_{count}.engine`) — allows pre-built engine caching per GPU architecture
+- **Model conversion**: `convert_from_source` uses 90% of GPU memory as workspace; INT8 precision when calibration cache supplied, FP16 if GPU supports it, FP32 otherwise
+- **Engine filename**: GPU-specific with precision suffix (`azaion.cc_{major}.{minor}_sm_{count}.engine` for FP16, `*.int8.engine` for INT8) — prevents cache confusion between precision variants
 - Output format: `[batch][detection_index][x1, y1, x2, y2, confidence, class_id]`

 ## Caveats

 - TensorRT engine files are GPU-architecture-specific and not portable
+- INT8 engine files require a pre-computed calibration cache; cache is generated offline and uploaded to Loader manually
 - `pycuda.autoinit` import is required as side-effect (initializes CUDA context)
 - Dynamic shapes defaulting to 1280×1280 is hardcoded — not configurable