mirror of
https://github.com/azaion/detections.git
synced 2026-04-22 23:56:31 +00:00
[AZ-180] Update module and component docs for Jetson/INT8 changes
Made-with: Cursor
This commit is contained in:
@@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
**Purpose**: Provides pluggable inference backends (ONNX Runtime and TensorRT) behind a common abstract interface, including ONNX-to-TensorRT model conversion.
|
**Purpose**: Provides pluggable inference backends (ONNX Runtime and TensorRT) behind a common abstract interface, including ONNX-to-TensorRT model conversion with FP16 and INT8 precision support.
|
||||||
|
|
||||||
**Pattern**: Strategy pattern — `InferenceEngine` defines the contract; `OnnxEngine` and `TensorRTEngine` are interchangeable implementations.
|
**Pattern**: Strategy pattern — `InferenceEngine` defines the contract; `OnnxEngine` and `TensorRTEngine` are interchangeable implementations.
|
||||||
|
|
||||||
@@ -43,8 +43,8 @@ cdef class OnnxEngine(InferenceEngine):
|
|||||||
cdef class TensorRTEngine(InferenceEngine):
|
cdef class TensorRTEngine(InferenceEngine):
|
||||||
# Implements all base methods
|
# Implements all base methods
|
||||||
@staticmethod get_gpu_memory_bytes(int device_id) -> int
|
@staticmethod get_gpu_memory_bytes(int device_id) -> int
|
||||||
@staticmethod get_engine_filename(int device_id) -> str
|
@staticmethod get_engine_filename(str precision="fp16") -> str # "fp16" or "int8"
|
||||||
@staticmethod convert_from_onnx(bytes onnx_model) -> bytes or None
|
@staticmethod convert_from_source(bytes onnx_model, str calib_cache_path=None) -> bytes or None
|
||||||
```
|
```
|
||||||
|
|
||||||
## External API
|
## External API
|
||||||
@@ -61,13 +61,14 @@ None — internal component consumed by Inference Pipeline.
|
|||||||
|
|
||||||
- **OnnxEngine**: default batch_size=1; loads model into `onnxruntime.InferenceSession`
|
- **OnnxEngine**: default batch_size=1; loads model into `onnxruntime.InferenceSession`
|
||||||
- **TensorRTEngine**: default batch_size=4; dynamic dimensions default to 1280×1280 input, 300 max detections
|
- **TensorRTEngine**: default batch_size=4; dynamic dimensions default to 1280×1280 input, 300 max detections
|
||||||
- **Model conversion**: `convert_from_onnx` uses 90% of GPU memory as workspace, enables FP16 if hardware supports it
|
- **Model conversion**: `convert_from_source` uses 90% of GPU memory as workspace; INT8 precision when calibration cache supplied, FP16 if GPU supports it, FP32 otherwise
|
||||||
- **Engine filename**: GPU-specific (`azaion.cc_{major}.{minor}_sm_{count}.engine`) — allows pre-built engine caching per GPU architecture
|
- **Engine filename**: GPU-specific with precision suffix (`azaion.cc_{major}.{minor}_sm_{count}.engine` for FP16, `*.int8.engine` for INT8) — prevents cache confusion between precision variants
|
||||||
- Output format: `[batch][detection_index][x1, y1, x2, y2, confidence, class_id]`
|
- Output format: `[batch][detection_index][x1, y1, x2, y2, confidence, class_id]`
|
||||||
|
|
||||||
## Caveats
|
## Caveats
|
||||||
|
|
||||||
- TensorRT engine files are GPU-architecture-specific and not portable
|
- TensorRT engine files are GPU-architecture-specific and not portable
|
||||||
|
- INT8 engine files require a pre-computed calibration cache; cache is generated offline and uploaded to Loader manually
|
||||||
- `pycuda.autoinit` import is required as side-effect (initializes CUDA context)
|
- `pycuda.autoinit` import is required as side-effect (initializes CUDA context)
|
||||||
- Dynamic shapes defaulting to 1280×1280 is hardcoded — not configurable
|
- Dynamic shapes defaulting to 1280×1280 is hardcoded — not configurable
|
||||||
|
|
||||||
|
|||||||
@@ -13,6 +13,7 @@ Application-wide constants, logging infrastructure, and the object detection cla
|
|||||||
| `CONFIG_FILE` | str | `"config.yaml"` | Configuration file path |
|
| `CONFIG_FILE` | str | `"config.yaml"` | Configuration file path |
|
||||||
| `QUEUE_CONFIG_FILENAME` | str | `"secured-config.json"` | Queue config filename |
|
| `QUEUE_CONFIG_FILENAME` | str | `"secured-config.json"` | Queue config filename |
|
||||||
| `AI_ONNX_MODEL_FILE` | str | `"azaion.onnx"` | ONNX model filename |
|
| `AI_ONNX_MODEL_FILE` | str | `"azaion.onnx"` | ONNX model filename |
|
||||||
|
| `INT8_CALIB_CACHE_FILE` | str | `"azaion.int8_calib.cache"` | INT8 calibration cache filename on the Loader service |
|
||||||
| `CDN_CONFIG` | str | `"cdn.yaml"` | CDN configuration file |
|
| `CDN_CONFIG` | str | `"cdn.yaml"` | CDN configuration file |
|
||||||
| `MODELS_FOLDER` | str | `"models"` | Directory for model files |
|
| `MODELS_FOLDER` | str | `"models"` | Directory for model files |
|
||||||
| `SMALL_SIZE_KB` | int | `3` | Small file size threshold (KB) |
|
| `SMALL_SIZE_KB` | int | `3` | Small file size threshold (KB) |
|
||||||
|
|||||||
@@ -41,7 +41,8 @@ Core inference orchestrator — manages the AI engine lifecycle, preprocesses me
|
|||||||
| `run_detect_video` | `(bytes video_bytes, AIRecognitionConfig ai_config, str media_name, str save_path, annotation_callback, status_callback=None)` | cpdef | Processes video from in-memory bytes via PyAV, concurrently writes to save_path |
|
| `run_detect_video` | `(bytes video_bytes, AIRecognitionConfig ai_config, str media_name, str save_path, annotation_callback, status_callback=None)` | cpdef | Processes video from in-memory bytes via PyAV, concurrently writes to save_path |
|
||||||
| `run_detect_video_stream` | `(object readable, AIRecognitionConfig ai_config, str media_name, annotation_callback, status_callback=None)` | cpdef | Processes video from a file-like readable (e.g. StreamingBuffer) via PyAV — true streaming, no bytes in RAM (AZ-178) |
|
| `run_detect_video_stream` | `(object readable, AIRecognitionConfig ai_config, str media_name, annotation_callback, status_callback=None)` | cpdef | Processes video from a file-like readable (e.g. StreamingBuffer) via PyAV — true streaming, no bytes in RAM (AZ-178) |
|
||||||
| `stop` | `()` | cpdef | Sets stop_signal to True |
|
| `stop` | `()` | cpdef | Sets stop_signal to True |
|
||||||
| `init_ai` | `()` | cdef | Engine initialization: tries TensorRT → falls back to ONNX → background TensorRT conversion |
|
| `init_ai` | `()` | cdef | Engine initialization: tries INT8 engine → FP16 engine → background TensorRT conversion (with optional INT8 calibration cache) |
|
||||||
|
| `_try_download_calib_cache` | `(str models_dir) -> str or None` | cdef | Downloads `azaion.int8_calib.cache` from Loader; writes to a temp file; returns path or None if unavailable |
|
||||||
| `preprocess` | `(frames) -> ndarray` | via engine | OpenCV blobFromImage: resize, normalize to 0..1, swap RGB, stack batch |
|
| `preprocess` | `(frames) -> ndarray` | via engine | OpenCV blobFromImage: resize, normalize to 0..1, swap RGB, stack batch |
|
||||||
| `postprocess` | `(output, ai_config) -> list[list[Detection]]` | via engine | Parses engine output to Detection objects, applies confidence threshold and overlap removal |
|
| `postprocess` | `(output, ai_config) -> list[list[Detection]]` | via engine | Parses engine output to Detection objects, applies confidence threshold and overlap removal |
|
||||||
|
|
||||||
@@ -50,9 +51,11 @@ Core inference orchestrator — manages the AI engine lifecycle, preprocesses me
|
|||||||
### Engine Initialization (`init_ai`)
|
### Engine Initialization (`init_ai`)
|
||||||
|
|
||||||
1. If `_converted_model_bytes` exists → load TensorRT from those bytes
|
1. If `_converted_model_bytes` exists → load TensorRT from those bytes
|
||||||
2. If GPU available → try downloading pre-built TensorRT engine from loader
|
2. If GPU available → try downloading pre-built INT8 engine first (`*.int8.engine`), then FP16 engine (`*.engine`) from loader
|
||||||
3. If download fails → download ONNX model, start background thread for ONNX→TensorRT conversion
|
3. If no cached engine found → download ONNX source, attempt to download INT8 calibration cache (`azaion.int8_calib.cache`) from loader, spawn background thread for ONNX→TensorRT conversion (INT8 if cache downloaded, FP16 fallback)
|
||||||
4. If no GPU → load OnnxEngine from ONNX model bytes
|
4. Calibration cache download failure is non-fatal — log warning and proceed with FP16
|
||||||
|
5. Temporary calibration cache file is deleted after conversion completes
|
||||||
|
6. If no GPU → load OnnxEngine from ONNX model bytes
|
||||||
|
|
||||||
### Stream-Based Media Processing (AZ-173)
|
### Stream-Based Media Processing (AZ-173)
|
||||||
|
|
||||||
|
|||||||
@@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
## Purpose
|
## Purpose
|
||||||
|
|
||||||
TensorRT-based inference engine — high-performance GPU inference with CUDA memory management and ONNX-to-TensorRT model conversion.
|
TensorRT-based inference engine — high-performance GPU inference with CUDA memory management and ONNX-to-TensorRT model conversion. Supports FP16 and INT8 precision; INT8 is used when a pre-computed calibration cache is supplied.
|
||||||
|
|
||||||
## Public Interface
|
## Public Interface
|
||||||
|
|
||||||
@@ -15,25 +15,32 @@ TensorRT-based inference engine — high-performance GPU inference with CUDA mem
|
|||||||
| `get_batch_size` | `() -> int` | Returns batch size |
|
| `get_batch_size` | `() -> int` | Returns batch size |
|
||||||
| `run` | `(input_data) -> list` | Async H2D copy → execute → D2H copy, returns output as numpy array |
|
| `run` | `(input_data) -> list` | Async H2D copy → execute → D2H copy, returns output as numpy array |
|
||||||
| `get_gpu_memory_bytes` | `(int device_id) -> int` | Static. Returns total GPU memory in bytes (default 2GB if unavailable) |
|
| `get_gpu_memory_bytes` | `(int device_id) -> int` | Static. Returns total GPU memory in bytes (default 2GB if unavailable) |
|
||||||
| `get_engine_filename` | `(int device_id) -> str` | Static. Returns engine filename with compute capability and SM count: `azaion.cc_{major}.{minor}_sm_{count}.engine` |
|
| `get_engine_filename` | `(str precision="fp16") -> str` | Static. Returns engine filename encoding compute capability, SM count, and precision suffix: `azaion.cc_{major}.{minor}_sm_{count}.engine` (FP16) or `azaion.cc_{major}.{minor}_sm_{count}.int8.engine` (INT8) |
|
||||||
| `convert_from_onnx` | `(bytes onnx_model) -> bytes or None` | Static. Converts ONNX model to TensorRT serialized engine. Uses 90% of GPU memory as workspace. Enables FP16 if supported. |
|
| `convert_from_source` | `(bytes onnx_model, str calib_cache_path=None) -> bytes or None` | Static. Converts ONNX model to TensorRT serialized engine. Uses INT8 when `calib_cache_path` is provided and the file exists; falls back to FP16 if GPU supports it, FP32 otherwise. |
|
||||||
|
|
||||||
|
### Helper Class: `_CacheCalibrator` (module-private)
|
||||||
|
|
||||||
|
Implements `trt.IInt8EntropyCalibrator2`. Loads a pre-generated INT8 calibration cache from disk and supplies it to the TensorRT builder. Used only when `convert_from_source` is called with a valid `calib_cache_path`.
|
||||||
|
|
||||||
## Internal Logic
|
## Internal Logic
|
||||||
|
|
||||||
- Input shape defaults to 1280×1280 for dynamic dimensions.
|
- Input shape defaults to 1280×1280 for dynamic dimensions.
|
||||||
- Output shape defaults to 300 max detections × 6 values (x1, y1, x2, y2, conf, cls) for dynamic dimensions.
|
- Output shape defaults to 300 max detections × 6 values (x1, y1, x2, y2, conf, cls) for dynamic dimensions.
|
||||||
- `run` uses async CUDA memory transfers with stream synchronization.
|
- `run` uses async CUDA memory transfers with stream synchronization.
|
||||||
- `convert_from_onnx` uses explicit batch mode, configures FP16 precision when GPU supports it.
|
- `convert_from_source` uses explicit batch mode:
|
||||||
- Default batch size is 4 (vs OnnxEngine's 1).
|
- If `calib_cache_path` is a valid file path → sets `BuilderFlag.INT8` and assigns `_CacheCalibrator` as the calibrator
|
||||||
|
- Else if GPU has fast FP16 → sets `BuilderFlag.FP16`
|
||||||
|
- Else → FP32 (no flag)
|
||||||
|
- Engine filenames encode precision suffix (`*.int8.engine` vs `*.engine`) to prevent INT8/FP16 engine cache confusion across conversions.
|
||||||
|
|
||||||
## Dependencies
|
## Dependencies
|
||||||
|
|
||||||
- **External**: `tensorrt`, `pycuda.driver`, `pycuda.autoinit`, `pynvml`, `numpy`
|
- **External**: `tensorrt`, `pycuda.driver`, `pycuda.autoinit`, `pynvml`, `numpy`, `os`
|
||||||
- **Internal**: `inference_engine` (base class), `constants_inf` (logging)
|
- **Internal**: `inference_engine` (base class), `constants_inf` (logging)
|
||||||
|
|
||||||
## Consumers
|
## Consumers
|
||||||
|
|
||||||
- `inference` — instantiated when compatible NVIDIA GPU is found; also calls `convert_from_onnx` and `get_engine_filename`
|
- `inference` — instantiated when compatible NVIDIA GPU is found; also calls `convert_from_source` and `get_engine_filename`
|
||||||
|
|
||||||
## Data Models
|
## Data Models
|
||||||
|
|
||||||
@@ -41,8 +48,9 @@ None (wraps TensorRT runtime objects).
|
|||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
- Engine filename is GPU-specific (compute capability + SM count).
|
- Engine filename encodes GPU compute capability + SM count + precision suffix.
|
||||||
- Workspace memory is 90% of available GPU memory.
|
- Workspace memory is 90% of available GPU memory.
|
||||||
|
- INT8 calibration cache path is supplied at conversion time (downloaded by `inference.init_ai`).
|
||||||
|
|
||||||
## External Integrations
|
## External Integrations
|
||||||
|
|
||||||
@@ -54,4 +62,4 @@ None.
|
|||||||
|
|
||||||
## Tests
|
## Tests
|
||||||
|
|
||||||
None found.
|
- `tests/test_az180_jetson_int8.py` — unit tests for INT8 flag (AC-3) and FP16 fallback (AC-4); skipped when TensorRT is not available (GPU environment required).
|
||||||
|
|||||||
@@ -1,10 +1,10 @@
|
|||||||
# Autopilot State
|
# Autopilot State
|
||||||
## Current Step
|
## Current Step
|
||||||
flow: existing-code
|
flow: existing-code
|
||||||
step: 9
|
step: 11
|
||||||
name: Implement
|
name: Update Docs
|
||||||
status: in_progress
|
status: not_started
|
||||||
sub_step: batch_01
|
sub_step: 0
|
||||||
retry_count: 0
|
retry_count: 0
|
||||||
|
|
||||||
## Cycle Notes
|
## Cycle Notes
|
||||||
@@ -19,4 +19,5 @@ step: 14 (Deploy) — DONE (all artifacts + 5 scripts created)
|
|||||||
|
|
||||||
AZ-180 cycle started 2026-04-02.
|
AZ-180 cycle started 2026-04-02.
|
||||||
step: 8 (New Task) — DONE (AZ-180: Jetson Orin Nano support + INT8)
|
step: 8 (New Task) — DONE (AZ-180: Jetson Orin Nano support + INT8)
|
||||||
step: 9 (Implement) — NOT STARTED
|
step: 9 (Implement) — DONE (Dockerfile.jetson, requirements-jetson.txt, docker-compose.jetson.yml, tensorrt_engine INT8, inference calib cache download)
|
||||||
|
step: 10 (Run Tests) — DONE (33 passed, 3 skipped/hardware-specific, 0 failed; also fixed 2 pre-existing test failures)
|
||||||
|
|||||||
Reference in New Issue
Block a user