mirror of
https://github.com/azaion/ai-training.git
synced 2026-04-22 23:16:36 +00:00
142c6c4de8
- Replaced module-level path variables in constants.py with a structured Pydantic Config class. - Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure. - Fixed bugs related to image processing and model saving. - Enhanced test infrastructure to accommodate the new configuration approach. This refactor improves code maintainability and clarity by centralizing configuration management.
90 lines
6.9 KiB
Markdown
90 lines
6.9 KiB
Markdown
# Solution
|
||
|
||
## Product Solution Description
|
||
|
||
Azaion AI Training is an ML pipeline for training, exporting, and deploying YOLOv11 object detection models within the Azaion platform ecosystem. The system ingests annotated image data from a RabbitMQ stream, augments it through an Albumentations-based pipeline, trains YOLOv11 models on NVIDIA GPUs, exports them to multiple formats (ONNX, TensorRT, RKNN), and deploys encrypted split-model artifacts to a REST API and S3-compatible CDN for secure distribution.
|
||
|
||
The pipeline targets aerial/satellite military object detection across 17 base classes with 3 weather modes (Normal, Winter, Night), producing 80 total class slots.
|
||
|
||
### Component Interaction
|
||
|
||
```mermaid
|
||
graph LR
|
||
RMQ[RabbitMQ Streams] -->|annotations| AQ[Annotation Queue]
|
||
AQ -->|images + labels| FS[(Filesystem)]
|
||
FS -->|raw data| AUG[Augmentation]
|
||
AUG -->|8× augmented| FS
|
||
FS -->|dataset| TRAIN[Training]
|
||
TRAIN -->|model artifacts| EXP[Export + Encrypt]
|
||
EXP -->|small part| API[Azaion API]
|
||
EXP -->|big part| CDN[S3 CDN]
|
||
API -->|small part| INF[Inference]
|
||
CDN -->|big part| INF
|
||
INF -->|detections| OUT[Video Output]
|
||
```
|
||
|
||
## Architecture
|
||
|
||
### Component Solution Table
|
||
|
||
| Component | Solution | Tools | Advantages | Limitations | Requirements | Security | Cost Indicators | Fitness |
|
||
|-----------|----------|-------|------------|-------------|-------------|----------|----------------|---------|
|
||
| Annotation Queue | Async RabbitMQ Streams consumer with role-based routing (Validator→validated, Operator→seed) | rstream, msgpack, asyncio | Decoupled ingestion, independent lifecycle, file-based offset persistence | No reconnect logic on disconnect; single consumer (no scaling) | RabbitMQ with Streams plugin, network access | Credentials in plaintext config | Low (single lightweight process) | Good for current single-server deployment |
|
||
| Data Pipeline | Continuous augmentation loop (5-min interval) producing 8× expansion via geometric + color transforms | Albumentations, OpenCV, ThreadPoolExecutor | Robust augmentation variety, parallel per-image processing | Infinite loop with no graceful shutdown; attribute bug in progress logging | Filesystem access to /azaion/data/ and /azaion/data-processed/ | None | CPU-bound, parallelized | Adequate for offline batch augmentation |
|
||
| Training | Ultralytics YOLO training with automated dataset formation (70/20/10 split), corrupt label filtering, model export and encrypted upload | Ultralytics (YOLOv11m), PyTorch 2.3.0 CUDA 12.1 | Mature framework, built-in checkpointing (save_period=1), multi-format export | Long training cycles (~11.5 days for 360K annotations); batch=11 near 24GB VRAM limit | NVIDIA GPU (RTX 4090 24GB), CUDA 12.1 | Model encrypted AES-256-CBC before upload; split storage pattern | High (GPU compute, multi-day runs) | Well-suited for periodic retraining |
|
||
| Inference | TensorRT (primary) and ONNX Runtime (fallback) engines with async CUDA streams, batch processing, NMS postprocessing | TensorRT, ONNX Runtime, PyCUDA, OpenCV | TensorRT: ~33% faster than ONNX, ~42% less VRAM; batch processing; per-GPU engine compilation | Potential uninitialized batch_size for dynamic shapes; no model caching strategy | NVIDIA GPU with TensorRT support | Hardware-bound decryption key; encrypted model download | Moderate (GPU inference) | Production-ready for GPU servers |
|
||
| Security | AES-256-CBC encryption for models and API resources; hardware fingerprinting (CPU+GPU+RAM+drive serial) for machine-bound keys | cryptography library | Split-model storage prevents single-point theft; hardware binding ties access to authorized machines | Hardcoded encryption key; hardcoded credentials in config files; no TLS cert validation | cryptography, pynvml, platform-specific hardware queries | Core security component | Minimal | Functional but needs credential externalization |
|
||
| API & CDN | REST API client with JWT auth and S3-compatible CDN for large artifact storage; split-resource upload/download pattern | requests, boto3 | Separation of small/big model parts; auto-relogin on 401/403 | No retry on 500 errors; no connection pooling | Azaion API endpoint, S3-compatible CDN endpoint | JWT tokens, separate read/write CDN keys | Low (network I/O only) | Adequate for current model distribution needs |
|
||
| Edge Deployment | RKNN export targeting RK3588 SoC (OrangePi5) with shell-based setup scripts | RKNN toolkit, bash scripts | Low-power edge inference capability | Setup scripts not integrated into main pipeline; no automated deployment | OrangePi5 hardware, RKNN runtime | N/A | Low (edge hardware) | Proof-of-concept stage |
|
||
|
||
### Deployment Architecture
|
||
|
||
The system runs as independent processes without containerization or orchestration:
|
||
|
||
| Process | Runtime Pattern | Host Requirements |
|
||
|---------|----------------|-------------------|
|
||
| Annotation Queue Consumer | Continuous (async event loop) | Network access to RabbitMQ |
|
||
| Augmentation Pipeline | Continuous loop (5-min cycle) | CPU cores, filesystem access |
|
||
| Training Pipeline | Long-running (days per run) | NVIDIA GPU (24GB VRAM), CUDA 12.1 |
|
||
| Inference | On-demand | NVIDIA GPU with TensorRT |
|
||
| Data Tools | Ad-hoc manual execution | Developer machine |
|
||
|
||
No CI/CD pipeline, container definitions, or infrastructure-as-code were found. Deployment is manual.
|
||
|
||
## Testing Strategy
|
||
|
||
### Existing Tests
|
||
|
||
| Test | Type | Coverage |
|
||
|------|------|----------|
|
||
| `tests/security_test.py` | Script-based | Encrypts a test image, verifies roundtrip decrypt matches original bytes |
|
||
| `tests/imagelabel_visualize_test.py` | Script-based | Loads sample annotations with `preprocessing.read_labels` (broken — `preprocessing` module missing) |
|
||
|
||
### Gaps
|
||
|
||
- No formal test framework (pytest/unittest) configured
|
||
- No integration tests for the training pipeline, augmentation, or inference
|
||
- No API client tests (mocked or live)
|
||
- No augmentation correctness tests (bounding box transform validation)
|
||
- Security test is a standalone script, not runnable via test runner
|
||
- The `imagelabel_visualize_test.py` cannot run due to missing `preprocessing` module
|
||
|
||
### Observed Quality Mechanisms
|
||
|
||
- Corrupt label detection during dataset formation (coords > 1.0 → moved to /data-corrupted/)
|
||
- Bounding box clipping and filtering during augmentation
|
||
- Training checkpointing (save_period=1) for crash recovery
|
||
- Augmentation exception handling per-image and per-variant
|
||
|
||
## References
|
||
|
||
| Artifact | Path | Purpose |
|
||
|----------|------|---------|
|
||
| Main config | `config.yaml` | API credentials, queue config, directory paths |
|
||
| CDN config | `cdn.yaml` | S3 CDN endpoint and access keys |
|
||
| Class definitions | `classes.json` | 17 annotation classes with colors |
|
||
| Python dependencies | `requirements.txt` | Main pipeline dependencies |
|
||
| Queue dependencies | `annotation-queue/requirements.txt` | Annotation queue service dependencies |
|
||
| Edge setup | `orangepi5/*.sh` | OrangePi5 installation and run scripts |
|
||
| Training checkpoint | `checkpoint.txt` | Last training run timestamp (2024-06-27) |
|