Files
ai-training/_docs/02_document/architecture.md
T
Oleksandr Bezdieniezhnykh 142c6c4de8 Refactor constants management to use Pydantic BaseModel for configuration
- Replaced module-level path variables in constants.py with a structured Pydantic Config class.
- Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure.
- Fixed bugs related to image processing and model saving.
- Enhanced test infrastructure to accommodate the new configuration approach.

This refactor improves code maintainability and clarity by centralizing configuration management.
2026-03-27 18:18:30 +02:00

176 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Architecture
## System Context
Azaion AI Training is a Python-based ML pipeline for training, exporting, and deploying YOLOv11 object detection models. The system operates within the Azaion platform ecosystem, consuming annotated image data and producing encrypted inference-ready models.
### Boundaries
| Boundary | Interface | Protocol |
|----------|-----------|----------|
| Azaion REST API | ApiClient | HTTPS (JWT auth) |
| S3-compatible CDN | CDNManager (boto3) | HTTPS (S3 API) |
| RabbitMQ Streams | rstream Consumer | AMQP 1.0 |
| Local filesystem | Direct I/O | POSIX paths at `/azaion/` |
| NVIDIA GPU | PyTorch, TensorRT, ONNX RT, PyCUDA | CUDA 12.1 |
### System Context Diagram
```mermaid
graph LR
subgraph "Azaion Platform"
API[Azaion REST API]
CDN[S3-compatible CDN]
Queue[RabbitMQ Streams]
end
subgraph "AI Training System"
AQ[Annotation Queue Consumer]
AUG[Augmentation Pipeline]
TRAIN[Training Pipeline]
INF[Inference Engine]
end
subgraph "Storage"
FS["/azaion/ filesystem"]
end
subgraph "Hardware"
GPU[NVIDIA GPU]
end
Queue -->|annotation events| AQ
AQ -->|images + labels| FS
FS -->|raw annotations| AUG
AUG -->|augmented data| FS
FS -->|processed dataset| TRAIN
TRAIN -->|trained model| GPU
TRAIN -->|encrypted model| API
TRAIN -->|encrypted model big part| CDN
API -->|encrypted model small part| INF
CDN -->|encrypted model big part| INF
INF -->|inference| GPU
```
## Tech Stack
| Layer | Technology | Version/Detail |
|-------|-----------|---------------|
| Language | Python | 3.10+ (match statements) |
| ML Framework | Ultralytics YOLO | YOLOv11 medium |
| Deep Learning | PyTorch | 2.3.0 (CUDA 12.1) |
| GPU Inference | TensorRT | FP16/INT8, async CUDA streams |
| GPU Inference (alt) | ONNX Runtime GPU | CUDAExecutionProvider |
| Edge Inference | RKNN | RK3588 (OrangePi5) |
| Augmentation | Albumentations | Geometric + color transforms |
| Computer Vision | OpenCV | Image I/O, preprocessing, display |
| Object Storage | boto3 | S3-compatible CDN |
| Message Queue | rstream | RabbitMQ Streams consumer |
| Serialization | msgpack | Queue message format |
| Encryption | cryptography | AES-256-CBC |
| HTTP Client | requests | REST API communication |
| Configuration | PyYAML | YAML config files |
| Visualization | matplotlib, netron | Annotation display, model graphs |
## Deployment Model
The system runs as multiple independent processes on machines with NVIDIA GPUs:
| Process | Entry Point | Runtime | Typical Host |
|---------|------------|---------|-------------|
| Training | `train.py` | Long-running (days) | GPU server (RTX 4090, 24GB VRAM) |
| Augmentation | `augmentation.py` | Continuous loop (infinite) | Same GPU server or CPU-only |
| Annotation Queue | `annotation-queue/annotation_queue_handler.py` | Continuous (async) | Any server with network access |
| Inference | `start_inference.py` | On-demand | GPU-equipped machine |
| Data Tools | `convert-annotations.py`, `dataset-visualiser.py` | Ad-hoc | Developer machine |
No containerization (Dockerfile), CI/CD pipeline, or orchestration infrastructure was found in the codebase. Deployment appears to be manual.
## Data Model Overview
### Annotation Data Flow
```
Raw annotations (Queue) → /azaion/data-seed/ (unvalidated)
→ /azaion/data/ (validated)
→ /azaion/data-processed/ (augmented, 8×)
→ /azaion/datasets/azaion-{date}/ (train/valid/test split)
→ /azaion/data-corrupted/ (invalid labels)
→ /azaion/data_deleted/ (soft-deleted)
```
### Annotation Class System
- 17 base classes (ArmorVehicle, Truck, Vehicle, Artillery, Shadow, Trenches, MilitaryMan, TyreTracks, AdditArmoredTank, Smoke, Plane, Moto, CamouflageNet, CamouflageBranches, Roof, Building, Caponier)
- 3 weather modes: Norm (offset 0), Wint (offset 20), Night (offset 40)
- Total class slots: 80 (17 × 3 = 51 used, 29 reserved)
- Format: YOLO (center_x, center_y, width, height — all normalized 01)
### Model Artifacts
| Format | Use | Export Details |
|--------|-----|---------------|
| `.pt` | Training checkpoint | YOLOv11 PyTorch weights |
| `.onnx` | Cross-platform inference | 1280px, batch=4, NMS baked in |
| `.engine` | GPU inference (production) | TensorRT FP16, batch=4, per-GPU architecture |
| `.rknn` | Edge inference | RK3588 target (OrangePi5) |
## Integration Points
### Azaion REST API
- `POST /login` → JWT token
- `POST /resources/{folder}` → file upload (Bearer auth)
- `POST /resources/get/{folder}` → encrypted file download (hardware-bound key)
### S3-compatible CDN
- Upload: model big parts (`upload_fileobj`)
- Download: model big parts (`download_file`)
- Separate read/write access keys
### RabbitMQ Streams
- Queue: `azaion-annotations`
- Protocol: AMQP with rstream library
- Message format: msgpack with positional integer keys
- Offset tracking: persisted to `offset.yaml`
## Non-Functional Requirements (Observed)
| Category | Observation | Source |
|----------|------------|--------|
| Training duration | ~11.5 days for 360K annotations on 1× RTX 4090 | Code comment in train.py |
| VRAM usage | batch=11 → ~22GB (batch=12 fails at 24.2GB) | Code comment in train.py |
| Inference speed | TensorRT: 54s for 200s video (3.7GB VRAM) | Code comment in start_inference.py |
| ONNX inference | 81s for 200s video (6.3GB VRAM) | Code comment in start_inference.py |
| Augmentation ratio | 8× (1 original + 7 augmented per image) | augmentation.py |
| Frame sampling | Every 4th frame during inference | inference/inference.py |
## Security Architecture
| Mechanism | Implementation | Location |
|-----------|---------------|----------|
| API authentication | JWT token (email/password login) | api_client.py |
| Resource encryption | AES-256-CBC (hardware-bound key) | security.py |
| Model encryption | AES-256-CBC (static key) | security.py |
| Split model storage | Small part on API, big part on CDN | api_client.py |
| Hardware fingerprinting | CPU+GPU+RAM+drive serial hash | hardware_service.py |
| CDN access control | Separate read/write S3 credentials | cdn_manager.py |
### Security Concerns
- Hardcoded credentials in `config.yaml` and `cdn.yaml`
- Hardcoded model encryption key in `security.py`
- No TLS certificate validation visible in code
- No input validation on API responses
- Queue credentials in plaintext config files
## Key Architectural Decisions
| Decision | Rationale (inferred) |
|----------|---------------------|
| YOLOv11 medium at 1280px | Balance between detection quality and training time |
| Split model storage | Prevent model theft from single storage compromise |
| Hardware-bound API encryption | Tie resource access to authorized machines |
| TensorRT for production inference | ~33% faster than ONNX, ~42% less VRAM |
| Augmentation as separate process | Decouples data prep from training; runs continuously |
| Annotation queue as separate service | Independent lifecycle; different dependency set |
| RKNN export for OrangePi5 | Edge deployment on low-power ARM SoC |