Refactor constants management to use Pydantic BaseModel for configuration

- Replaced module-level path variables in constants.py with a structured Pydantic Config class. - Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure. - Fixed bugs related to image processing and model saving. - Enhanced test infrastructure to accommodate the new configuration approach. This refactor improves code maintainability and clarity by centralizing configuration management.
2026-06-21 18:41:11 +00:00 · 2026-03-27 18:18:30 +02:00
parent b68c07b540
commit 142c6c4de8
106 changed files with 5706 additions and 654 deletions
@@ -0,0 +1,89 @@
+# Solution
+
+## Product Solution Description
+
+Azaion AI Training is an ML pipeline for training, exporting, and deploying YOLOv11 object detection models within the Azaion platform ecosystem. The system ingests annotated image data from a RabbitMQ stream, augments it through an Albumentations-based pipeline, trains YOLOv11 models on NVIDIA GPUs, exports them to multiple formats (ONNX, TensorRT, RKNN), and deploys encrypted split-model artifacts to a REST API and S3-compatible CDN for secure distribution.
+
+The pipeline targets aerial/satellite military object detection across 17 base classes with 3 weather modes (Normal, Winter, Night), producing 80 total class slots.
+
+### Component Interaction
+
+```mermaid
+graph LR
+    RMQ[RabbitMQ Streams] -->|annotations| AQ[Annotation Queue]
+    AQ -->|images + labels| FS[(Filesystem)]
+    FS -->|raw data| AUG[Augmentation]
+    AUG -->|8× augmented| FS
+    FS -->|dataset| TRAIN[Training]
+    TRAIN -->|model artifacts| EXP[Export + Encrypt]
+    EXP -->|small part| API[Azaion API]
+    EXP -->|big part| CDN[S3 CDN]
+    API -->|small part| INF[Inference]
+    CDN -->|big part| INF
+    INF -->|detections| OUT[Video Output]
+```
+
+## Architecture
+
+### Component Solution Table
+
+| Component | Solution | Tools | Advantages | Limitations | Requirements | Security | Cost Indicators | Fitness |
+|-----------|----------|-------|------------|-------------|-------------|----------|----------------|---------|
+| Annotation Queue | Async RabbitMQ Streams consumer with role-based routing (Validator→validated, Operator→seed) | rstream, msgpack, asyncio | Decoupled ingestion, independent lifecycle, file-based offset persistence | No reconnect logic on disconnect; single consumer (no scaling) | RabbitMQ with Streams plugin, network access | Credentials in plaintext config | Low (single lightweight process) | Good for current single-server deployment |
+| Data Pipeline | Continuous augmentation loop (5-min interval) producing 8× expansion via geometric + color transforms | Albumentations, OpenCV, ThreadPoolExecutor | Robust augmentation variety, parallel per-image processing | Infinite loop with no graceful shutdown; attribute bug in progress logging | Filesystem access to /azaion/data/ and /azaion/data-processed/ | None | CPU-bound, parallelized | Adequate for offline batch augmentation |
+| Training | Ultralytics YOLO training with automated dataset formation (70/20/10 split), corrupt label filtering, model export and encrypted upload | Ultralytics (YOLOv11m), PyTorch 2.3.0 CUDA 12.1 | Mature framework, built-in checkpointing (save_period=1), multi-format export | Long training cycles (~11.5 days for 360K annotations); batch=11 near 24GB VRAM limit | NVIDIA GPU (RTX 4090 24GB), CUDA 12.1 | Model encrypted AES-256-CBC before upload; split storage pattern | High (GPU compute, multi-day runs) | Well-suited for periodic retraining |
+| Inference | TensorRT (primary) and ONNX Runtime (fallback) engines with async CUDA streams, batch processing, NMS postprocessing | TensorRT, ONNX Runtime, PyCUDA, OpenCV | TensorRT: ~33% faster than ONNX, ~42% less VRAM; batch processing; per-GPU engine compilation | Potential uninitialized batch_size for dynamic shapes; no model caching strategy | NVIDIA GPU with TensorRT support | Hardware-bound decryption key; encrypted model download | Moderate (GPU inference) | Production-ready for GPU servers |
+| Security | AES-256-CBC encryption for models and API resources; hardware fingerprinting (CPU+GPU+RAM+drive serial) for machine-bound keys | cryptography library | Split-model storage prevents single-point theft; hardware binding ties access to authorized machines | Hardcoded encryption key; hardcoded credentials in config files; no TLS cert validation | cryptography, pynvml, platform-specific hardware queries | Core security component | Minimal | Functional but needs credential externalization |
+| API & CDN | REST API client with JWT auth and S3-compatible CDN for large artifact storage; split-resource upload/download pattern | requests, boto3 | Separation of small/big model parts; auto-relogin on 401/403 | No retry on 500 errors; no connection pooling | Azaion API endpoint, S3-compatible CDN endpoint | JWT tokens, separate read/write CDN keys | Low (network I/O only) | Adequate for current model distribution needs |
+| Edge Deployment | RKNN export targeting RK3588 SoC (OrangePi5) with shell-based setup scripts | RKNN toolkit, bash scripts | Low-power edge inference capability | Setup scripts not integrated into main pipeline; no automated deployment | OrangePi5 hardware, RKNN runtime | N/A | Low (edge hardware) | Proof-of-concept stage |
+
+### Deployment Architecture
+
+The system runs as independent processes without containerization or orchestration:
+
+| Process | Runtime Pattern | Host Requirements |
+|---------|----------------|-------------------|
+| Annotation Queue Consumer | Continuous (async event loop) | Network access to RabbitMQ |
+| Augmentation Pipeline | Continuous loop (5-min cycle) | CPU cores, filesystem access |
+| Training Pipeline | Long-running (days per run) | NVIDIA GPU (24GB VRAM), CUDA 12.1 |
+| Inference | On-demand | NVIDIA GPU with TensorRT |
+| Data Tools | Ad-hoc manual execution | Developer machine |
+
+No CI/CD pipeline, container definitions, or infrastructure-as-code were found. Deployment is manual.
+
+## Testing Strategy
+
+### Existing Tests
+
+| Test | Type | Coverage |
+|------|------|----------|
+| `tests/security_test.py` | Script-based | Encrypts a test image, verifies roundtrip decrypt matches original bytes |
+| `tests/imagelabel_visualize_test.py` | Script-based | Loads sample annotations with `preprocessing.read_labels` (broken — `preprocessing` module missing) |
+
+### Gaps
+
+- No formal test framework (pytest/unittest) configured
+- No integration tests for the training pipeline, augmentation, or inference
+- No API client tests (mocked or live)
+- No augmentation correctness tests (bounding box transform validation)
+- Security test is a standalone script, not runnable via test runner
+- The `imagelabel_visualize_test.py` cannot run due to missing `preprocessing` module
+
+### Observed Quality Mechanisms
+
+- Corrupt label detection during dataset formation (coords > 1.0 → moved to /data-corrupted/)
+- Bounding box clipping and filtering during augmentation
+- Training checkpointing (save_period=1) for crash recovery
+- Augmentation exception handling per-image and per-variant
+
+## References
+
+| Artifact | Path | Purpose |
+|----------|------|---------|
+| Main config | `config.yaml` | API credentials, queue config, directory paths |
+| CDN config | `cdn.yaml` | S3 CDN endpoint and access keys |
+| Class definitions | `classes.json` | 17 annotation classes with colors |
+| Python dependencies | `requirements.txt` | Main pipeline dependencies |
+| Queue dependencies | `annotation-queue/requirements.txt` | Annotation queue service dependencies |
+| Edge setup | `orangepi5/*.sh` | OrangePi5 installation and run scripts |
+| Training checkpoint | `checkpoint.txt` | Last training run timestamp (2024-06-27) |