mirror of https://github.com/azaion/ai-training.git synced 2026-04-22 10:16:34 +00:00

Files

T

Oleksandr Bezdieniezhnykh 142c6c4de8 Refactor constants management to use Pydantic BaseModel for configuration

- Replaced module-level path variables in constants.py with a structured Pydantic Config class.
- Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure.
- Fixed bugs related to image processing and model saving.
- Enhanced test infrastructure to accommodate the new configuration approach.

This refactor improves code maintainability and clarity by centralizing configuration management.

2026-03-27 18:18:30 +02:00

6.9 KiB

Raw Blame History

Solution

Product Solution Description

Azaion AI Training is an ML pipeline for training, exporting, and deploying YOLOv11 object detection models within the Azaion platform ecosystem. The system ingests annotated image data from a RabbitMQ stream, augments it through an Albumentations-based pipeline, trains YOLOv11 models on NVIDIA GPUs, exports them to multiple formats (ONNX, TensorRT, RKNN), and deploys encrypted split-model artifacts to a REST API and S3-compatible CDN for secure distribution.

The pipeline targets aerial/satellite military object detection across 17 base classes with 3 weather modes (Normal, Winter, Night), producing 80 total class slots.

Component Interaction

graph LR
    RMQ[RabbitMQ Streams] -->|annotations| AQ[Annotation Queue]
    AQ -->|images + labels| FS[(Filesystem)]
    FS -->|raw data| AUG[Augmentation]
    AUG -->|8× augmented| FS
    FS -->|dataset| TRAIN[Training]
    TRAIN -->|model artifacts| EXP[Export + Encrypt]
    EXP -->|small part| API[Azaion API]
    EXP -->|big part| CDN[S3 CDN]
    API -->|small part| INF[Inference]
    CDN -->|big part| INF
    INF -->|detections| OUT[Video Output]

Architecture

Component Solution Table

Component	Solution	Tools	Advantages	Limitations	Requirements	Security	Cost Indicators	Fitness
Annotation Queue	Async RabbitMQ Streams consumer with role-based routing (Validator→validated, Operator→seed)	rstream, msgpack, asyncio	Decoupled ingestion, independent lifecycle, file-based offset persistence	No reconnect logic on disconnect; single consumer (no scaling)	RabbitMQ with Streams plugin, network access	Credentials in plaintext config	Low (single lightweight process)	Good for current single-server deployment
Data Pipeline	Continuous augmentation loop (5-min interval) producing 8× expansion via geometric + color transforms	Albumentations, OpenCV, ThreadPoolExecutor	Robust augmentation variety, parallel per-image processing	Infinite loop with no graceful shutdown; attribute bug in progress logging	Filesystem access to /azaion/data/ and /azaion/data-processed/	None	CPU-bound, parallelized	Adequate for offline batch augmentation
Training	Ultralytics YOLO training with automated dataset formation (70/20/10 split), corrupt label filtering, model export and encrypted upload	Ultralytics (YOLOv11m), PyTorch 2.3.0 CUDA 12.1	Mature framework, built-in checkpointing (save_period=1), multi-format export	Long training cycles (~11.5 days for 360K annotations); batch=11 near 24GB VRAM limit	NVIDIA GPU (RTX 4090 24GB), CUDA 12.1	Model encrypted AES-256-CBC before upload; split storage pattern	High (GPU compute, multi-day runs)	Well-suited for periodic retraining
Inference	TensorRT (primary) and ONNX Runtime (fallback) engines with async CUDA streams, batch processing, NMS postprocessing	TensorRT, ONNX Runtime, PyCUDA, OpenCV	TensorRT: ~33% faster than ONNX, ~42% less VRAM; batch processing; per-GPU engine compilation	Potential uninitialized batch_size for dynamic shapes; no model caching strategy	NVIDIA GPU with TensorRT support	Hardware-bound decryption key; encrypted model download	Moderate (GPU inference)	Production-ready for GPU servers
Security	AES-256-CBC encryption for models and API resources; hardware fingerprinting (CPU+GPU+RAM+drive serial) for machine-bound keys	cryptography library	Split-model storage prevents single-point theft; hardware binding ties access to authorized machines	Hardcoded encryption key; hardcoded credentials in config files; no TLS cert validation	cryptography, pynvml, platform-specific hardware queries	Core security component	Minimal	Functional but needs credential externalization
API & CDN	REST API client with JWT auth and S3-compatible CDN for large artifact storage; split-resource upload/download pattern	requests, boto3	Separation of small/big model parts; auto-relogin on 401/403	No retry on 500 errors; no connection pooling	Azaion API endpoint, S3-compatible CDN endpoint	JWT tokens, separate read/write CDN keys	Low (network I/O only)	Adequate for current model distribution needs
Edge Deployment	RKNN export targeting RK3588 SoC (OrangePi5) with shell-based setup scripts	RKNN toolkit, bash scripts	Low-power edge inference capability	Setup scripts not integrated into main pipeline; no automated deployment	OrangePi5 hardware, RKNN runtime	N/A	Low (edge hardware)	Proof-of-concept stage

Deployment Architecture

The system runs as independent processes without containerization or orchestration:

Process	Runtime Pattern	Host Requirements
Annotation Queue Consumer	Continuous (async event loop)	Network access to RabbitMQ
Augmentation Pipeline	Continuous loop (5-min cycle)	CPU cores, filesystem access
Training Pipeline	Long-running (days per run)	NVIDIA GPU (24GB VRAM), CUDA 12.1
Inference	On-demand	NVIDIA GPU with TensorRT
Data Tools	Ad-hoc manual execution	Developer machine

No CI/CD pipeline, container definitions, or infrastructure-as-code were found. Deployment is manual.

Testing Strategy

Existing Tests

Test	Type	Coverage
`tests/security_test.py`	Script-based	Encrypts a test image, verifies roundtrip decrypt matches original bytes
`tests/imagelabel_visualize_test.py`	Script-based	Loads sample annotations with `preprocessing.read_labels` (broken — `preprocessing` module missing)

Gaps

No formal test framework (pytest/unittest) configured
No integration tests for the training pipeline, augmentation, or inference
No API client tests (mocked or live)
No augmentation correctness tests (bounding box transform validation)
Security test is a standalone script, not runnable via test runner
The imagelabel_visualize_test.py cannot run due to missing preprocessing module

Observed Quality Mechanisms

Corrupt label detection during dataset formation (coords > 1.0 → moved to /data-corrupted/)
Bounding box clipping and filtering during augmentation
Training checkpointing (save_period=1) for crash recovery
Augmentation exception handling per-image and per-variant

References

Artifact	Path	Purpose
Main config	`config.yaml`	API credentials, queue config, directory paths
CDN config	`cdn.yaml`	S3 CDN endpoint and access keys
Class definitions	`classes.json`	17 annotation classes with colors
Python dependencies	`requirements.txt`	Main pipeline dependencies
Queue dependencies	`annotation-queue/requirements.txt`	Annotation queue service dependencies
Edge setup	`orangepi5/*.sh`	OrangePi5 installation and run scripts
Training checkpoint	`checkpoint.txt`	Last training run timestamp (2024-06-27)

6.9 KiB Raw Blame History Unescape Escape