Files
ai-training/_docs/01_solution/solution.md
T
Oleksandr Bezdieniezhnykh 142c6c4de8 Refactor constants management to use Pydantic BaseModel for configuration
- Replaced module-level path variables in constants.py with a structured Pydantic Config class.
- Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure.
- Fixed bugs related to image processing and model saving.
- Enhanced test infrastructure to accommodate the new configuration approach.

This refactor improves code maintainability and clarity by centralizing configuration management.
2026-03-27 18:18:30 +02:00

6.9 KiB
Raw Blame History

Solution

Product Solution Description

Azaion AI Training is an ML pipeline for training, exporting, and deploying YOLOv11 object detection models within the Azaion platform ecosystem. The system ingests annotated image data from a RabbitMQ stream, augments it through an Albumentations-based pipeline, trains YOLOv11 models on NVIDIA GPUs, exports them to multiple formats (ONNX, TensorRT, RKNN), and deploys encrypted split-model artifacts to a REST API and S3-compatible CDN for secure distribution.

The pipeline targets aerial/satellite military object detection across 17 base classes with 3 weather modes (Normal, Winter, Night), producing 80 total class slots.

Component Interaction

graph LR
    RMQ[RabbitMQ Streams] -->|annotations| AQ[Annotation Queue]
    AQ -->|images + labels| FS[(Filesystem)]
    FS -->|raw data| AUG[Augmentation]
    AUG -->|8× augmented| FS
    FS -->|dataset| TRAIN[Training]
    TRAIN -->|model artifacts| EXP[Export + Encrypt]
    EXP -->|small part| API[Azaion API]
    EXP -->|big part| CDN[S3 CDN]
    API -->|small part| INF[Inference]
    CDN -->|big part| INF
    INF -->|detections| OUT[Video Output]

Architecture

Component Solution Table

Component Solution Tools Advantages Limitations Requirements Security Cost Indicators Fitness
Annotation Queue Async RabbitMQ Streams consumer with role-based routing (Validator→validated, Operator→seed) rstream, msgpack, asyncio Decoupled ingestion, independent lifecycle, file-based offset persistence No reconnect logic on disconnect; single consumer (no scaling) RabbitMQ with Streams plugin, network access Credentials in plaintext config Low (single lightweight process) Good for current single-server deployment
Data Pipeline Continuous augmentation loop (5-min interval) producing 8× expansion via geometric + color transforms Albumentations, OpenCV, ThreadPoolExecutor Robust augmentation variety, parallel per-image processing Infinite loop with no graceful shutdown; attribute bug in progress logging Filesystem access to /azaion/data/ and /azaion/data-processed/ None CPU-bound, parallelized Adequate for offline batch augmentation
Training Ultralytics YOLO training with automated dataset formation (70/20/10 split), corrupt label filtering, model export and encrypted upload Ultralytics (YOLOv11m), PyTorch 2.3.0 CUDA 12.1 Mature framework, built-in checkpointing (save_period=1), multi-format export Long training cycles (~11.5 days for 360K annotations); batch=11 near 24GB VRAM limit NVIDIA GPU (RTX 4090 24GB), CUDA 12.1 Model encrypted AES-256-CBC before upload; split storage pattern High (GPU compute, multi-day runs) Well-suited for periodic retraining
Inference TensorRT (primary) and ONNX Runtime (fallback) engines with async CUDA streams, batch processing, NMS postprocessing TensorRT, ONNX Runtime, PyCUDA, OpenCV TensorRT: ~33% faster than ONNX, ~42% less VRAM; batch processing; per-GPU engine compilation Potential uninitialized batch_size for dynamic shapes; no model caching strategy NVIDIA GPU with TensorRT support Hardware-bound decryption key; encrypted model download Moderate (GPU inference) Production-ready for GPU servers
Security AES-256-CBC encryption for models and API resources; hardware fingerprinting (CPU+GPU+RAM+drive serial) for machine-bound keys cryptography library Split-model storage prevents single-point theft; hardware binding ties access to authorized machines Hardcoded encryption key; hardcoded credentials in config files; no TLS cert validation cryptography, pynvml, platform-specific hardware queries Core security component Minimal Functional but needs credential externalization
API & CDN REST API client with JWT auth and S3-compatible CDN for large artifact storage; split-resource upload/download pattern requests, boto3 Separation of small/big model parts; auto-relogin on 401/403 No retry on 500 errors; no connection pooling Azaion API endpoint, S3-compatible CDN endpoint JWT tokens, separate read/write CDN keys Low (network I/O only) Adequate for current model distribution needs
Edge Deployment RKNN export targeting RK3588 SoC (OrangePi5) with shell-based setup scripts RKNN toolkit, bash scripts Low-power edge inference capability Setup scripts not integrated into main pipeline; no automated deployment OrangePi5 hardware, RKNN runtime N/A Low (edge hardware) Proof-of-concept stage

Deployment Architecture

The system runs as independent processes without containerization or orchestration:

Process Runtime Pattern Host Requirements
Annotation Queue Consumer Continuous (async event loop) Network access to RabbitMQ
Augmentation Pipeline Continuous loop (5-min cycle) CPU cores, filesystem access
Training Pipeline Long-running (days per run) NVIDIA GPU (24GB VRAM), CUDA 12.1
Inference On-demand NVIDIA GPU with TensorRT
Data Tools Ad-hoc manual execution Developer machine

No CI/CD pipeline, container definitions, or infrastructure-as-code were found. Deployment is manual.

Testing Strategy

Existing Tests

Test Type Coverage
tests/security_test.py Script-based Encrypts a test image, verifies roundtrip decrypt matches original bytes
tests/imagelabel_visualize_test.py Script-based Loads sample annotations with preprocessing.read_labels (broken — preprocessing module missing)

Gaps

  • No formal test framework (pytest/unittest) configured
  • No integration tests for the training pipeline, augmentation, or inference
  • No API client tests (mocked or live)
  • No augmentation correctness tests (bounding box transform validation)
  • Security test is a standalone script, not runnable via test runner
  • The imagelabel_visualize_test.py cannot run due to missing preprocessing module

Observed Quality Mechanisms

  • Corrupt label detection during dataset formation (coords > 1.0 → moved to /data-corrupted/)
  • Bounding box clipping and filtering during augmentation
  • Training checkpointing (save_period=1) for crash recovery
  • Augmentation exception handling per-image and per-variant

References

Artifact Path Purpose
Main config config.yaml API credentials, queue config, directory paths
CDN config cdn.yaml S3 CDN endpoint and access keys
Class definitions classes.json 17 annotation classes with colors
Python dependencies requirements.txt Main pipeline dependencies
Queue dependencies annotation-queue/requirements.txt Annotation queue service dependencies
Edge setup orangepi5/*.sh OrangePi5 installation and run scripts
Training checkpoint checkpoint.txt Last training run timestamp (2024-06-27)