Refactor constants management to use Pydantic BaseModel for configuration

- Replaced module-level path variables in constants.py with a structured Pydantic Config class.
- Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure.
- Fixed bugs related to image processing and model saving.
- Enhanced test infrastructure to accommodate the new configuration approach.

This refactor improves code maintainability and clarity by centralizing configuration management.
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-03-27 18:18:30 +02:00
parent b68c07b540
commit 142c6c4de8
106 changed files with 5706 additions and 654 deletions
+175
View File
@@ -0,0 +1,175 @@
# Architecture
## System Context
Azaion AI Training is a Python-based ML pipeline for training, exporting, and deploying YOLOv11 object detection models. The system operates within the Azaion platform ecosystem, consuming annotated image data and producing encrypted inference-ready models.
### Boundaries
| Boundary | Interface | Protocol |
|----------|-----------|----------|
| Azaion REST API | ApiClient | HTTPS (JWT auth) |
| S3-compatible CDN | CDNManager (boto3) | HTTPS (S3 API) |
| RabbitMQ Streams | rstream Consumer | AMQP 1.0 |
| Local filesystem | Direct I/O | POSIX paths at `/azaion/` |
| NVIDIA GPU | PyTorch, TensorRT, ONNX RT, PyCUDA | CUDA 12.1 |
### System Context Diagram
```mermaid
graph LR
subgraph "Azaion Platform"
API[Azaion REST API]
CDN[S3-compatible CDN]
Queue[RabbitMQ Streams]
end
subgraph "AI Training System"
AQ[Annotation Queue Consumer]
AUG[Augmentation Pipeline]
TRAIN[Training Pipeline]
INF[Inference Engine]
end
subgraph "Storage"
FS["/azaion/ filesystem"]
end
subgraph "Hardware"
GPU[NVIDIA GPU]
end
Queue -->|annotation events| AQ
AQ -->|images + labels| FS
FS -->|raw annotations| AUG
AUG -->|augmented data| FS
FS -->|processed dataset| TRAIN
TRAIN -->|trained model| GPU
TRAIN -->|encrypted model| API
TRAIN -->|encrypted model big part| CDN
API -->|encrypted model small part| INF
CDN -->|encrypted model big part| INF
INF -->|inference| GPU
```
## Tech Stack
| Layer | Technology | Version/Detail |
|-------|-----------|---------------|
| Language | Python | 3.10+ (match statements) |
| ML Framework | Ultralytics YOLO | YOLOv11 medium |
| Deep Learning | PyTorch | 2.3.0 (CUDA 12.1) |
| GPU Inference | TensorRT | FP16/INT8, async CUDA streams |
| GPU Inference (alt) | ONNX Runtime GPU | CUDAExecutionProvider |
| Edge Inference | RKNN | RK3588 (OrangePi5) |
| Augmentation | Albumentations | Geometric + color transforms |
| Computer Vision | OpenCV | Image I/O, preprocessing, display |
| Object Storage | boto3 | S3-compatible CDN |
| Message Queue | rstream | RabbitMQ Streams consumer |
| Serialization | msgpack | Queue message format |
| Encryption | cryptography | AES-256-CBC |
| HTTP Client | requests | REST API communication |
| Configuration | PyYAML | YAML config files |
| Visualization | matplotlib, netron | Annotation display, model graphs |
## Deployment Model
The system runs as multiple independent processes on machines with NVIDIA GPUs:
| Process | Entry Point | Runtime | Typical Host |
|---------|------------|---------|-------------|
| Training | `train.py` | Long-running (days) | GPU server (RTX 4090, 24GB VRAM) |
| Augmentation | `augmentation.py` | Continuous loop (infinite) | Same GPU server or CPU-only |
| Annotation Queue | `annotation-queue/annotation_queue_handler.py` | Continuous (async) | Any server with network access |
| Inference | `start_inference.py` | On-demand | GPU-equipped machine |
| Data Tools | `convert-annotations.py`, `dataset-visualiser.py` | Ad-hoc | Developer machine |
No containerization (Dockerfile), CI/CD pipeline, or orchestration infrastructure was found in the codebase. Deployment appears to be manual.
## Data Model Overview
### Annotation Data Flow
```
Raw annotations (Queue) → /azaion/data-seed/ (unvalidated)
→ /azaion/data/ (validated)
→ /azaion/data-processed/ (augmented, 8×)
→ /azaion/datasets/azaion-{date}/ (train/valid/test split)
→ /azaion/data-corrupted/ (invalid labels)
→ /azaion/data_deleted/ (soft-deleted)
```
### Annotation Class System
- 17 base classes (ArmorVehicle, Truck, Vehicle, Artillery, Shadow, Trenches, MilitaryMan, TyreTracks, AdditArmoredTank, Smoke, Plane, Moto, CamouflageNet, CamouflageBranches, Roof, Building, Caponier)
- 3 weather modes: Norm (offset 0), Wint (offset 20), Night (offset 40)
- Total class slots: 80 (17 × 3 = 51 used, 29 reserved)
- Format: YOLO (center_x, center_y, width, height — all normalized 01)
### Model Artifacts
| Format | Use | Export Details |
|--------|-----|---------------|
| `.pt` | Training checkpoint | YOLOv11 PyTorch weights |
| `.onnx` | Cross-platform inference | 1280px, batch=4, NMS baked in |
| `.engine` | GPU inference (production) | TensorRT FP16, batch=4, per-GPU architecture |
| `.rknn` | Edge inference | RK3588 target (OrangePi5) |
## Integration Points
### Azaion REST API
- `POST /login` → JWT token
- `POST /resources/{folder}` → file upload (Bearer auth)
- `POST /resources/get/{folder}` → encrypted file download (hardware-bound key)
### S3-compatible CDN
- Upload: model big parts (`upload_fileobj`)
- Download: model big parts (`download_file`)
- Separate read/write access keys
### RabbitMQ Streams
- Queue: `azaion-annotations`
- Protocol: AMQP with rstream library
- Message format: msgpack with positional integer keys
- Offset tracking: persisted to `offset.yaml`
## Non-Functional Requirements (Observed)
| Category | Observation | Source |
|----------|------------|--------|
| Training duration | ~11.5 days for 360K annotations on 1× RTX 4090 | Code comment in train.py |
| VRAM usage | batch=11 → ~22GB (batch=12 fails at 24.2GB) | Code comment in train.py |
| Inference speed | TensorRT: 54s for 200s video (3.7GB VRAM) | Code comment in start_inference.py |
| ONNX inference | 81s for 200s video (6.3GB VRAM) | Code comment in start_inference.py |
| Augmentation ratio | 8× (1 original + 7 augmented per image) | augmentation.py |
| Frame sampling | Every 4th frame during inference | inference/inference.py |
## Security Architecture
| Mechanism | Implementation | Location |
|-----------|---------------|----------|
| API authentication | JWT token (email/password login) | api_client.py |
| Resource encryption | AES-256-CBC (hardware-bound key) | security.py |
| Model encryption | AES-256-CBC (static key) | security.py |
| Split model storage | Small part on API, big part on CDN | api_client.py |
| Hardware fingerprinting | CPU+GPU+RAM+drive serial hash | hardware_service.py |
| CDN access control | Separate read/write S3 credentials | cdn_manager.py |
### Security Concerns
- Hardcoded credentials in `config.yaml` and `cdn.yaml`
- Hardcoded model encryption key in `security.py`
- No TLS certificate validation visible in code
- No input validation on API responses
- Queue credentials in plaintext config files
## Key Architectural Decisions
| Decision | Rationale (inferred) |
|----------|---------------------|
| YOLOv11 medium at 1280px | Balance between detection quality and training time |
| Split model storage | Prevent model theft from single storage compromise |
| Hardware-bound API encryption | Tie resource access to authorized machines |
| TensorRT for production inference | ~33% faster than ONNX, ~42% less VRAM |
| Augmentation as separate process | Decouples data prep from training; runs continuously |
| Annotation queue as separate service | Independent lifecycle; different dependency set |
| RKNN export for OrangePi5 | Edge deployment on low-power ARM SoC |