mirror of
https://github.com/azaion/ai-training.git
synced 2026-04-22 21:46:35 +00:00
142c6c4de8
- Replaced module-level path variables in constants.py with a structured Pydantic Config class. - Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure. - Fixed bugs related to image processing and model saving. - Enhanced test infrastructure to accommodate the new configuration approach. This refactor improves code maintainability and clarity by centralizing configuration management.
176 lines
7.1 KiB
Markdown
176 lines
7.1 KiB
Markdown
# Architecture
|
||
|
||
## System Context
|
||
|
||
Azaion AI Training is a Python-based ML pipeline for training, exporting, and deploying YOLOv11 object detection models. The system operates within the Azaion platform ecosystem, consuming annotated image data and producing encrypted inference-ready models.
|
||
|
||
### Boundaries
|
||
|
||
| Boundary | Interface | Protocol |
|
||
|----------|-----------|----------|
|
||
| Azaion REST API | ApiClient | HTTPS (JWT auth) |
|
||
| S3-compatible CDN | CDNManager (boto3) | HTTPS (S3 API) |
|
||
| RabbitMQ Streams | rstream Consumer | AMQP 1.0 |
|
||
| Local filesystem | Direct I/O | POSIX paths at `/azaion/` |
|
||
| NVIDIA GPU | PyTorch, TensorRT, ONNX RT, PyCUDA | CUDA 12.1 |
|
||
|
||
### System Context Diagram
|
||
|
||
```mermaid
|
||
graph LR
|
||
subgraph "Azaion Platform"
|
||
API[Azaion REST API]
|
||
CDN[S3-compatible CDN]
|
||
Queue[RabbitMQ Streams]
|
||
end
|
||
|
||
subgraph "AI Training System"
|
||
AQ[Annotation Queue Consumer]
|
||
AUG[Augmentation Pipeline]
|
||
TRAIN[Training Pipeline]
|
||
INF[Inference Engine]
|
||
end
|
||
|
||
subgraph "Storage"
|
||
FS["/azaion/ filesystem"]
|
||
end
|
||
|
||
subgraph "Hardware"
|
||
GPU[NVIDIA GPU]
|
||
end
|
||
|
||
Queue -->|annotation events| AQ
|
||
AQ -->|images + labels| FS
|
||
FS -->|raw annotations| AUG
|
||
AUG -->|augmented data| FS
|
||
FS -->|processed dataset| TRAIN
|
||
TRAIN -->|trained model| GPU
|
||
TRAIN -->|encrypted model| API
|
||
TRAIN -->|encrypted model big part| CDN
|
||
API -->|encrypted model small part| INF
|
||
CDN -->|encrypted model big part| INF
|
||
INF -->|inference| GPU
|
||
```
|
||
|
||
## Tech Stack
|
||
|
||
| Layer | Technology | Version/Detail |
|
||
|-------|-----------|---------------|
|
||
| Language | Python | 3.10+ (match statements) |
|
||
| ML Framework | Ultralytics YOLO | YOLOv11 medium |
|
||
| Deep Learning | PyTorch | 2.3.0 (CUDA 12.1) |
|
||
| GPU Inference | TensorRT | FP16/INT8, async CUDA streams |
|
||
| GPU Inference (alt) | ONNX Runtime GPU | CUDAExecutionProvider |
|
||
| Edge Inference | RKNN | RK3588 (OrangePi5) |
|
||
| Augmentation | Albumentations | Geometric + color transforms |
|
||
| Computer Vision | OpenCV | Image I/O, preprocessing, display |
|
||
| Object Storage | boto3 | S3-compatible CDN |
|
||
| Message Queue | rstream | RabbitMQ Streams consumer |
|
||
| Serialization | msgpack | Queue message format |
|
||
| Encryption | cryptography | AES-256-CBC |
|
||
| HTTP Client | requests | REST API communication |
|
||
| Configuration | PyYAML | YAML config files |
|
||
| Visualization | matplotlib, netron | Annotation display, model graphs |
|
||
|
||
## Deployment Model
|
||
|
||
The system runs as multiple independent processes on machines with NVIDIA GPUs:
|
||
|
||
| Process | Entry Point | Runtime | Typical Host |
|
||
|---------|------------|---------|-------------|
|
||
| Training | `train.py` | Long-running (days) | GPU server (RTX 4090, 24GB VRAM) |
|
||
| Augmentation | `augmentation.py` | Continuous loop (infinite) | Same GPU server or CPU-only |
|
||
| Annotation Queue | `annotation-queue/annotation_queue_handler.py` | Continuous (async) | Any server with network access |
|
||
| Inference | `start_inference.py` | On-demand | GPU-equipped machine |
|
||
| Data Tools | `convert-annotations.py`, `dataset-visualiser.py` | Ad-hoc | Developer machine |
|
||
|
||
No containerization (Dockerfile), CI/CD pipeline, or orchestration infrastructure was found in the codebase. Deployment appears to be manual.
|
||
|
||
## Data Model Overview
|
||
|
||
### Annotation Data Flow
|
||
|
||
```
|
||
Raw annotations (Queue) → /azaion/data-seed/ (unvalidated)
|
||
→ /azaion/data/ (validated)
|
||
→ /azaion/data-processed/ (augmented, 8×)
|
||
→ /azaion/datasets/azaion-{date}/ (train/valid/test split)
|
||
→ /azaion/data-corrupted/ (invalid labels)
|
||
→ /azaion/data_deleted/ (soft-deleted)
|
||
```
|
||
|
||
### Annotation Class System
|
||
|
||
- 17 base classes (ArmorVehicle, Truck, Vehicle, Artillery, Shadow, Trenches, MilitaryMan, TyreTracks, AdditArmoredTank, Smoke, Plane, Moto, CamouflageNet, CamouflageBranches, Roof, Building, Caponier)
|
||
- 3 weather modes: Norm (offset 0), Wint (offset 20), Night (offset 40)
|
||
- Total class slots: 80 (17 × 3 = 51 used, 29 reserved)
|
||
- Format: YOLO (center_x, center_y, width, height — all normalized 0–1)
|
||
|
||
### Model Artifacts
|
||
|
||
| Format | Use | Export Details |
|
||
|--------|-----|---------------|
|
||
| `.pt` | Training checkpoint | YOLOv11 PyTorch weights |
|
||
| `.onnx` | Cross-platform inference | 1280px, batch=4, NMS baked in |
|
||
| `.engine` | GPU inference (production) | TensorRT FP16, batch=4, per-GPU architecture |
|
||
| `.rknn` | Edge inference | RK3588 target (OrangePi5) |
|
||
|
||
## Integration Points
|
||
|
||
### Azaion REST API
|
||
- `POST /login` → JWT token
|
||
- `POST /resources/{folder}` → file upload (Bearer auth)
|
||
- `POST /resources/get/{folder}` → encrypted file download (hardware-bound key)
|
||
|
||
### S3-compatible CDN
|
||
- Upload: model big parts (`upload_fileobj`)
|
||
- Download: model big parts (`download_file`)
|
||
- Separate read/write access keys
|
||
|
||
### RabbitMQ Streams
|
||
- Queue: `azaion-annotations`
|
||
- Protocol: AMQP with rstream library
|
||
- Message format: msgpack with positional integer keys
|
||
- Offset tracking: persisted to `offset.yaml`
|
||
|
||
## Non-Functional Requirements (Observed)
|
||
|
||
| Category | Observation | Source |
|
||
|----------|------------|--------|
|
||
| Training duration | ~11.5 days for 360K annotations on 1× RTX 4090 | Code comment in train.py |
|
||
| VRAM usage | batch=11 → ~22GB (batch=12 fails at 24.2GB) | Code comment in train.py |
|
||
| Inference speed | TensorRT: 54s for 200s video (3.7GB VRAM) | Code comment in start_inference.py |
|
||
| ONNX inference | 81s for 200s video (6.3GB VRAM) | Code comment in start_inference.py |
|
||
| Augmentation ratio | 8× (1 original + 7 augmented per image) | augmentation.py |
|
||
| Frame sampling | Every 4th frame during inference | inference/inference.py |
|
||
|
||
## Security Architecture
|
||
|
||
| Mechanism | Implementation | Location |
|
||
|-----------|---------------|----------|
|
||
| API authentication | JWT token (email/password login) | api_client.py |
|
||
| Resource encryption | AES-256-CBC (hardware-bound key) | security.py |
|
||
| Model encryption | AES-256-CBC (static key) | security.py |
|
||
| Split model storage | Small part on API, big part on CDN | api_client.py |
|
||
| Hardware fingerprinting | CPU+GPU+RAM+drive serial hash | hardware_service.py |
|
||
| CDN access control | Separate read/write S3 credentials | cdn_manager.py |
|
||
|
||
### Security Concerns
|
||
- Hardcoded credentials in `config.yaml` and `cdn.yaml`
|
||
- Hardcoded model encryption key in `security.py`
|
||
- No TLS certificate validation visible in code
|
||
- No input validation on API responses
|
||
- Queue credentials in plaintext config files
|
||
|
||
## Key Architectural Decisions
|
||
|
||
| Decision | Rationale (inferred) |
|
||
|----------|---------------------|
|
||
| YOLOv11 medium at 1280px | Balance between detection quality and training time |
|
||
| Split model storage | Prevent model theft from single storage compromise |
|
||
| Hardware-bound API encryption | Tie resource access to authorized machines |
|
||
| TensorRT for production inference | ~33% faster than ONNX, ~42% less VRAM |
|
||
| Augmentation as separate process | Decouples data prep from training; runs continuously |
|
||
| Annotation queue as separate service | Independent lifecycle; different dependency set |
|
||
| RKNN export for OrangePi5 | Edge deployment on low-power ARM SoC |
|