Refactor constants management to use Pydantic BaseModel for configuration

- Replaced module-level path variables in constants.py with a structured Pydantic Config class. - Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure. - Fixed bugs related to image processing and model saving. - Enhanced test infrastructure to accommodate the new configuration approach. This refactor improves code maintainability and clarity by centralizing configuration management.
2026-06-22 03:31:11 +00:00 · 2026-03-27 18:18:30 +02:00
parent b68c07b540
commit 142c6c4de8
106 changed files with 5706 additions and 654 deletions
@@ -0,0 +1,175 @@
+# Architecture
+
+## System Context
+
+Azaion AI Training is a Python-based ML pipeline for training, exporting, and deploying YOLOv11 object detection models. The system operates within the Azaion platform ecosystem, consuming annotated image data and producing encrypted inference-ready models.
+
+### Boundaries
+
+| Boundary | Interface | Protocol |
+|----------|-----------|----------|
+| Azaion REST API | ApiClient | HTTPS (JWT auth) |
+| S3-compatible CDN | CDNManager (boto3) | HTTPS (S3 API) |
+| RabbitMQ Streams | rstream Consumer | AMQP 1.0 |
+| Local filesystem | Direct I/O | POSIX paths at `/azaion/` |
+| NVIDIA GPU | PyTorch, TensorRT, ONNX RT, PyCUDA | CUDA 12.1 |
+
+### System Context Diagram
+
+```mermaid
+graph LR
+    subgraph "Azaion Platform"
+        API[Azaion REST API]
+        CDN[S3-compatible CDN]
+        Queue[RabbitMQ Streams]
+    end
+
+    subgraph "AI Training System"
+        AQ[Annotation Queue Consumer]
+        AUG[Augmentation Pipeline]
+        TRAIN[Training Pipeline]
+        INF[Inference Engine]
+    end
+
+    subgraph "Storage"
+        FS["/azaion/ filesystem"]
+    end
+
+    subgraph "Hardware"
+        GPU[NVIDIA GPU]
+    end
+
+    Queue -->|annotation events| AQ
+    AQ -->|images + labels| FS
+    FS -->|raw annotations| AUG
+    AUG -->|augmented data| FS
+    FS -->|processed dataset| TRAIN
+    TRAIN -->|trained model| GPU
+    TRAIN -->|encrypted model| API
+    TRAIN -->|encrypted model big part| CDN
+    API -->|encrypted model small part| INF
+    CDN -->|encrypted model big part| INF
+    INF -->|inference| GPU
+```
+
+## Tech Stack
+
+| Layer | Technology | Version/Detail |
+|-------|-----------|---------------|
+| Language | Python | 3.10+ (match statements) |
+| ML Framework | Ultralytics YOLO | YOLOv11 medium |
+| Deep Learning | PyTorch | 2.3.0 (CUDA 12.1) |
+| GPU Inference | TensorRT | FP16/INT8, async CUDA streams |
+| GPU Inference (alt) | ONNX Runtime GPU | CUDAExecutionProvider |
+| Edge Inference | RKNN | RK3588 (OrangePi5) |
+| Augmentation | Albumentations | Geometric + color transforms |
+| Computer Vision | OpenCV | Image I/O, preprocessing, display |
+| Object Storage | boto3 | S3-compatible CDN |
+| Message Queue | rstream | RabbitMQ Streams consumer |
+| Serialization | msgpack | Queue message format |
+| Encryption | cryptography | AES-256-CBC |
+| HTTP Client | requests | REST API communication |
+| Configuration | PyYAML | YAML config files |
+| Visualization | matplotlib, netron | Annotation display, model graphs |
+
+## Deployment Model
+
+The system runs as multiple independent processes on machines with NVIDIA GPUs:
+
+| Process | Entry Point | Runtime | Typical Host |
+|---------|------------|---------|-------------|
+| Training | `train.py` | Long-running (days) | GPU server (RTX 4090, 24GB VRAM) |
+| Augmentation | `augmentation.py` | Continuous loop (infinite) | Same GPU server or CPU-only |
+| Annotation Queue | `annotation-queue/annotation_queue_handler.py` | Continuous (async) | Any server with network access |
+| Inference | `start_inference.py` | On-demand | GPU-equipped machine |
+| Data Tools | `convert-annotations.py`, `dataset-visualiser.py` | Ad-hoc | Developer machine |
+
+No containerization (Dockerfile), CI/CD pipeline, or orchestration infrastructure was found in the codebase. Deployment appears to be manual.
+
+## Data Model Overview
+
+### Annotation Data Flow
+
+```
+Raw annotations (Queue) → /azaion/data-seed/ (unvalidated)
+                        → /azaion/data/ (validated)
+                        → /azaion/data-processed/ (augmented, 8×)
+                        → /azaion/datasets/azaion-{date}/ (train/valid/test split)
+                        → /azaion/data-corrupted/ (invalid labels)
+                        → /azaion/data_deleted/ (soft-deleted)
+```
+
+### Annotation Class System
+
+- 17 base classes (ArmorVehicle, Truck, Vehicle, Artillery, Shadow, Trenches, MilitaryMan, TyreTracks, AdditArmoredTank, Smoke, Plane, Moto, CamouflageNet, CamouflageBranches, Roof, Building, Caponier)
+- 3 weather modes: Norm (offset 0), Wint (offset 20), Night (offset 40)
+- Total class slots: 80 (17 × 3 = 51 used, 29 reserved)
+- Format: YOLO (center_x, center_y, width, height — all normalized 0–1)
+
+### Model Artifacts
+
+| Format | Use | Export Details |
+|--------|-----|---------------|
+| `.pt` | Training checkpoint | YOLOv11 PyTorch weights |
+| `.onnx` | Cross-platform inference | 1280px, batch=4, NMS baked in |
+| `.engine` | GPU inference (production) | TensorRT FP16, batch=4, per-GPU architecture |
+| `.rknn` | Edge inference | RK3588 target (OrangePi5) |
+
+## Integration Points
+
+### Azaion REST API
+- `POST /login` → JWT token
+- `POST /resources/{folder}` → file upload (Bearer auth)
+- `POST /resources/get/{folder}` → encrypted file download (hardware-bound key)
+
+### S3-compatible CDN
+- Upload: model big parts (`upload_fileobj`)
+- Download: model big parts (`download_file`)
+- Separate read/write access keys
+
+### RabbitMQ Streams
+- Queue: `azaion-annotations`
+- Protocol: AMQP with rstream library
+- Message format: msgpack with positional integer keys
+- Offset tracking: persisted to `offset.yaml`
+
+## Non-Functional Requirements (Observed)
+
+| Category | Observation | Source |
+|----------|------------|--------|
+| Training duration | ~11.5 days for 360K annotations on 1× RTX 4090 | Code comment in train.py |
+| VRAM usage | batch=11 → ~22GB (batch=12 fails at 24.2GB) | Code comment in train.py |
+| Inference speed | TensorRT: 54s for 200s video (3.7GB VRAM) | Code comment in start_inference.py |
+| ONNX inference | 81s for 200s video (6.3GB VRAM) | Code comment in start_inference.py |
+| Augmentation ratio | 8× (1 original + 7 augmented per image) | augmentation.py |
+| Frame sampling | Every 4th frame during inference | inference/inference.py |
+
+## Security Architecture
+
+| Mechanism | Implementation | Location |
+|-----------|---------------|----------|
+| API authentication | JWT token (email/password login) | api_client.py |
+| Resource encryption | AES-256-CBC (hardware-bound key) | security.py |
+| Model encryption | AES-256-CBC (static key) | security.py |
+| Split model storage | Small part on API, big part on CDN | api_client.py |
+| Hardware fingerprinting | CPU+GPU+RAM+drive serial hash | hardware_service.py |
+| CDN access control | Separate read/write S3 credentials | cdn_manager.py |
+
+### Security Concerns
+- Hardcoded credentials in `config.yaml` and `cdn.yaml`
+- Hardcoded model encryption key in `security.py`
+- No TLS certificate validation visible in code
+- No input validation on API responses
+- Queue credentials in plaintext config files
+
+## Key Architectural Decisions
+
+| Decision | Rationale (inferred) |
+|----------|---------------------|
+| YOLOv11 medium at 1280px | Balance between detection quality and training time |
+| Split model storage | Prevent model theft from single storage compromise |
+| Hardware-bound API encryption | Tie resource access to authorized machines |
+| TensorRT for production inference | ~33% faster than ONNX, ~42% less VRAM |
+| Augmentation as separate process | Decouples data prep from training; runs continuously |
+| Annotation queue as separate service | Independent lifecycle; different dependency set |
+| RKNN export for OrangePi5 | Edge deployment on low-power ARM SoC |