Files
ai-training/_docs/02_document/architecture.md
T
Oleksandr Bezdieniezhnykh 142c6c4de8 Refactor constants management to use Pydantic BaseModel for configuration
- Replaced module-level path variables in constants.py with a structured Pydantic Config class.
- Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure.
- Fixed bugs related to image processing and model saving.
- Enhanced test infrastructure to accommodate the new configuration approach.

This refactor improves code maintainability and clarity by centralizing configuration management.
2026-03-27 18:18:30 +02:00

7.1 KiB
Raw Blame History

Architecture

System Context

Azaion AI Training is a Python-based ML pipeline for training, exporting, and deploying YOLOv11 object detection models. The system operates within the Azaion platform ecosystem, consuming annotated image data and producing encrypted inference-ready models.

Boundaries

Boundary Interface Protocol
Azaion REST API ApiClient HTTPS (JWT auth)
S3-compatible CDN CDNManager (boto3) HTTPS (S3 API)
RabbitMQ Streams rstream Consumer AMQP 1.0
Local filesystem Direct I/O POSIX paths at /azaion/
NVIDIA GPU PyTorch, TensorRT, ONNX RT, PyCUDA CUDA 12.1

System Context Diagram

graph LR
    subgraph "Azaion Platform"
        API[Azaion REST API]
        CDN[S3-compatible CDN]
        Queue[RabbitMQ Streams]
    end

    subgraph "AI Training System"
        AQ[Annotation Queue Consumer]
        AUG[Augmentation Pipeline]
        TRAIN[Training Pipeline]
        INF[Inference Engine]
    end

    subgraph "Storage"
        FS["/azaion/ filesystem"]
    end

    subgraph "Hardware"
        GPU[NVIDIA GPU]
    end

    Queue -->|annotation events| AQ
    AQ -->|images + labels| FS
    FS -->|raw annotations| AUG
    AUG -->|augmented data| FS
    FS -->|processed dataset| TRAIN
    TRAIN -->|trained model| GPU
    TRAIN -->|encrypted model| API
    TRAIN -->|encrypted model big part| CDN
    API -->|encrypted model small part| INF
    CDN -->|encrypted model big part| INF
    INF -->|inference| GPU

Tech Stack

Layer Technology Version/Detail
Language Python 3.10+ (match statements)
ML Framework Ultralytics YOLO YOLOv11 medium
Deep Learning PyTorch 2.3.0 (CUDA 12.1)
GPU Inference TensorRT FP16/INT8, async CUDA streams
GPU Inference (alt) ONNX Runtime GPU CUDAExecutionProvider
Edge Inference RKNN RK3588 (OrangePi5)
Augmentation Albumentations Geometric + color transforms
Computer Vision OpenCV Image I/O, preprocessing, display
Object Storage boto3 S3-compatible CDN
Message Queue rstream RabbitMQ Streams consumer
Serialization msgpack Queue message format
Encryption cryptography AES-256-CBC
HTTP Client requests REST API communication
Configuration PyYAML YAML config files
Visualization matplotlib, netron Annotation display, model graphs

Deployment Model

The system runs as multiple independent processes on machines with NVIDIA GPUs:

Process Entry Point Runtime Typical Host
Training train.py Long-running (days) GPU server (RTX 4090, 24GB VRAM)
Augmentation augmentation.py Continuous loop (infinite) Same GPU server or CPU-only
Annotation Queue annotation-queue/annotation_queue_handler.py Continuous (async) Any server with network access
Inference start_inference.py On-demand GPU-equipped machine
Data Tools convert-annotations.py, dataset-visualiser.py Ad-hoc Developer machine

No containerization (Dockerfile), CI/CD pipeline, or orchestration infrastructure was found in the codebase. Deployment appears to be manual.

Data Model Overview

Annotation Data Flow

Raw annotations (Queue) → /azaion/data-seed/ (unvalidated)
                        → /azaion/data/ (validated)
                        → /azaion/data-processed/ (augmented, 8×)
                        → /azaion/datasets/azaion-{date}/ (train/valid/test split)
                        → /azaion/data-corrupted/ (invalid labels)
                        → /azaion/data_deleted/ (soft-deleted)

Annotation Class System

  • 17 base classes (ArmorVehicle, Truck, Vehicle, Artillery, Shadow, Trenches, MilitaryMan, TyreTracks, AdditArmoredTank, Smoke, Plane, Moto, CamouflageNet, CamouflageBranches, Roof, Building, Caponier)
  • 3 weather modes: Norm (offset 0), Wint (offset 20), Night (offset 40)
  • Total class slots: 80 (17 × 3 = 51 used, 29 reserved)
  • Format: YOLO (center_x, center_y, width, height — all normalized 01)

Model Artifacts

Format Use Export Details
.pt Training checkpoint YOLOv11 PyTorch weights
.onnx Cross-platform inference 1280px, batch=4, NMS baked in
.engine GPU inference (production) TensorRT FP16, batch=4, per-GPU architecture
.rknn Edge inference RK3588 target (OrangePi5)

Integration Points

Azaion REST API

  • POST /login → JWT token
  • POST /resources/{folder} → file upload (Bearer auth)
  • POST /resources/get/{folder} → encrypted file download (hardware-bound key)

S3-compatible CDN

  • Upload: model big parts (upload_fileobj)
  • Download: model big parts (download_file)
  • Separate read/write access keys

RabbitMQ Streams

  • Queue: azaion-annotations
  • Protocol: AMQP with rstream library
  • Message format: msgpack with positional integer keys
  • Offset tracking: persisted to offset.yaml

Non-Functional Requirements (Observed)

Category Observation Source
Training duration ~11.5 days for 360K annotations on 1× RTX 4090 Code comment in train.py
VRAM usage batch=11 → ~22GB (batch=12 fails at 24.2GB) Code comment in train.py
Inference speed TensorRT: 54s for 200s video (3.7GB VRAM) Code comment in start_inference.py
ONNX inference 81s for 200s video (6.3GB VRAM) Code comment in start_inference.py
Augmentation ratio 8× (1 original + 7 augmented per image) augmentation.py
Frame sampling Every 4th frame during inference inference/inference.py

Security Architecture

Mechanism Implementation Location
API authentication JWT token (email/password login) api_client.py
Resource encryption AES-256-CBC (hardware-bound key) security.py
Model encryption AES-256-CBC (static key) security.py
Split model storage Small part on API, big part on CDN api_client.py
Hardware fingerprinting CPU+GPU+RAM+drive serial hash hardware_service.py
CDN access control Separate read/write S3 credentials cdn_manager.py

Security Concerns

  • Hardcoded credentials in config.yaml and cdn.yaml
  • Hardcoded model encryption key in security.py
  • No TLS certificate validation visible in code
  • No input validation on API responses
  • Queue credentials in plaintext config files

Key Architectural Decisions

Decision Rationale (inferred)
YOLOv11 medium at 1280px Balance between detection quality and training time
Split model storage Prevent model theft from single storage compromise
Hardware-bound API encryption Tie resource access to authorized machines
TensorRT for production inference ~33% faster than ONNX, ~42% less VRAM
Augmentation as separate process Decouples data prep from training; runs continuously
Annotation queue as separate service Independent lifecycle; different dependency set
RKNN export for OrangePi5 Edge deployment on low-power ARM SoC