[AZ-180] Update module and component docs for Jetson/INT8 changes

Made-with: Cursor
2026-06-21 13:11:09 +00:00 · 2026-04-02 07:25:22 +03:00
parent 2ed9ce3336
commit 7a7f2a4cdd
5 changed files with 37 additions and 23 deletions
@@ -2,7 +2,7 @@

 ## Purpose

-TensorRT-based inference engine — high-performance GPU inference with CUDA memory management and ONNX-to-TensorRT model conversion.
+TensorRT-based inference engine — high-performance GPU inference with CUDA memory management and ONNX-to-TensorRT model conversion. Supports FP16 and INT8 precision; INT8 is used when a pre-computed calibration cache is supplied.

 ## Public Interface

@@ -15,25 +15,32 @@ TensorRT-based inference engine — high-performance GPU inference with CUDA mem
 | `get_batch_size` | `() -> int` | Returns batch size |
 | `run` | `(input_data) -> list` | Async H2D copy → execute → D2H copy, returns output as numpy array |
 | `get_gpu_memory_bytes` | `(int device_id) -> int` | Static. Returns total GPU memory in bytes (default 2GB if unavailable) |
-| `get_engine_filename` | `(int device_id) -> str` | Static. Returns engine filename with compute capability and SM count: `azaion.cc_{major}.{minor}_sm_{count}.engine` |
-| `convert_from_onnx` | `(bytes onnx_model) -> bytes or None` | Static. Converts ONNX model to TensorRT serialized engine. Uses 90% of GPU memory as workspace. Enables FP16 if supported. |
+| `get_engine_filename` | `(str precision="fp16") -> str` | Static. Returns engine filename encoding compute capability, SM count, and precision suffix: `azaion.cc_{major}.{minor}_sm_{count}.engine` (FP16) or `azaion.cc_{major}.{minor}_sm_{count}.int8.engine` (INT8) |
+| `convert_from_source` | `(bytes onnx_model, str calib_cache_path=None) -> bytes or None` | Static. Converts ONNX model to TensorRT serialized engine. Uses INT8 when `calib_cache_path` is provided and the file exists; falls back to FP16 if GPU supports it, FP32 otherwise. |
+
+### Helper Class: `_CacheCalibrator` (module-private)
+
+Implements `trt.IInt8EntropyCalibrator2`. Loads a pre-generated INT8 calibration cache from disk and supplies it to the TensorRT builder. Used only when `convert_from_source` is called with a valid `calib_cache_path`.

 ## Internal Logic

 - Input shape defaults to 1280×1280 for dynamic dimensions.
 - Output shape defaults to 300 max detections × 6 values (x1, y1, x2, y2, conf, cls) for dynamic dimensions.
 - `run` uses async CUDA memory transfers with stream synchronization.
- `convert_from_onnx` uses explicit batch mode, configures FP16 precision when GPU supports it.
- Default batch size is 4 (vs OnnxEngine's 1).
+- `convert_from_source` uses explicit batch mode:
+  - If `calib_cache_path` is a valid file path → sets `BuilderFlag.INT8` and assigns `_CacheCalibrator` as the calibrator
+  - Else if GPU has fast FP16 → sets `BuilderFlag.FP16`
+  - Else → FP32 (no flag)
+- Engine filenames encode precision suffix (`*.int8.engine` vs `*.engine`) to prevent INT8/FP16 engine cache confusion across conversions.

 ## Dependencies

- **External**: `tensorrt`, `pycuda.driver`, `pycuda.autoinit`, `pynvml`, `numpy`
+- **External**: `tensorrt`, `pycuda.driver`, `pycuda.autoinit`, `pynvml`, `numpy`, `os`
 - **Internal**: `inference_engine` (base class), `constants_inf` (logging)

 ## Consumers

- `inference` — instantiated when compatible NVIDIA GPU is found; also calls `convert_from_onnx` and `get_engine_filename`
+- `inference` — instantiated when compatible NVIDIA GPU is found; also calls `convert_from_source` and `get_engine_filename`

 ## Data Models

@@ -41,8 +48,9 @@ None (wraps TensorRT runtime objects).

 ## Configuration

- Engine filename is GPU-specific (compute capability + SM count).
+- Engine filename encodes GPU compute capability + SM count + precision suffix.
 - Workspace memory is 90% of available GPU memory.
+- INT8 calibration cache path is supplied at conversion time (downloaded by `inference.init_ai`).

 ## External Integrations

@@ -54,4 +62,4 @@ None.

 ## Tests

-None found.
+- `tests/test_az180_jetson_int8.py` — unit tests for INT8 flag (AC-3) and FP16 fallback (AC-4); skipped when TensorRT is not available (GPU environment required).