mirror of
https://github.com/azaion/ai-training.git
synced 2026-04-22 11:26:36 +00:00
Refactor constants management to use Pydantic BaseModel for configuration
- Replaced module-level path variables in constants.py with a structured Pydantic Config class. - Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure. - Fixed bugs related to image processing and model saving. - Enhanced test infrastructure to accommodate the new configuration approach. This refactor improves code maintainability and clarity by centralizing configuration management.
This commit is contained in:
@@ -0,0 +1,89 @@
|
||||
# Solution
|
||||
|
||||
## Product Solution Description
|
||||
|
||||
Azaion AI Training is an ML pipeline for training, exporting, and deploying YOLOv11 object detection models within the Azaion platform ecosystem. The system ingests annotated image data from a RabbitMQ stream, augments it through an Albumentations-based pipeline, trains YOLOv11 models on NVIDIA GPUs, exports them to multiple formats (ONNX, TensorRT, RKNN), and deploys encrypted split-model artifacts to a REST API and S3-compatible CDN for secure distribution.
|
||||
|
||||
The pipeline targets aerial/satellite military object detection across 17 base classes with 3 weather modes (Normal, Winter, Night), producing 80 total class slots.
|
||||
|
||||
### Component Interaction
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
RMQ[RabbitMQ Streams] -->|annotations| AQ[Annotation Queue]
|
||||
AQ -->|images + labels| FS[(Filesystem)]
|
||||
FS -->|raw data| AUG[Augmentation]
|
||||
AUG -->|8× augmented| FS
|
||||
FS -->|dataset| TRAIN[Training]
|
||||
TRAIN -->|model artifacts| EXP[Export + Encrypt]
|
||||
EXP -->|small part| API[Azaion API]
|
||||
EXP -->|big part| CDN[S3 CDN]
|
||||
API -->|small part| INF[Inference]
|
||||
CDN -->|big part| INF
|
||||
INF -->|detections| OUT[Video Output]
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
### Component Solution Table
|
||||
|
||||
| Component | Solution | Tools | Advantages | Limitations | Requirements | Security | Cost Indicators | Fitness |
|
||||
|-----------|----------|-------|------------|-------------|-------------|----------|----------------|---------|
|
||||
| Annotation Queue | Async RabbitMQ Streams consumer with role-based routing (Validator→validated, Operator→seed) | rstream, msgpack, asyncio | Decoupled ingestion, independent lifecycle, file-based offset persistence | No reconnect logic on disconnect; single consumer (no scaling) | RabbitMQ with Streams plugin, network access | Credentials in plaintext config | Low (single lightweight process) | Good for current single-server deployment |
|
||||
| Data Pipeline | Continuous augmentation loop (5-min interval) producing 8× expansion via geometric + color transforms | Albumentations, OpenCV, ThreadPoolExecutor | Robust augmentation variety, parallel per-image processing | Infinite loop with no graceful shutdown; attribute bug in progress logging | Filesystem access to /azaion/data/ and /azaion/data-processed/ | None | CPU-bound, parallelized | Adequate for offline batch augmentation |
|
||||
| Training | Ultralytics YOLO training with automated dataset formation (70/20/10 split), corrupt label filtering, model export and encrypted upload | Ultralytics (YOLOv11m), PyTorch 2.3.0 CUDA 12.1 | Mature framework, built-in checkpointing (save_period=1), multi-format export | Long training cycles (~11.5 days for 360K annotations); batch=11 near 24GB VRAM limit | NVIDIA GPU (RTX 4090 24GB), CUDA 12.1 | Model encrypted AES-256-CBC before upload; split storage pattern | High (GPU compute, multi-day runs) | Well-suited for periodic retraining |
|
||||
| Inference | TensorRT (primary) and ONNX Runtime (fallback) engines with async CUDA streams, batch processing, NMS postprocessing | TensorRT, ONNX Runtime, PyCUDA, OpenCV | TensorRT: ~33% faster than ONNX, ~42% less VRAM; batch processing; per-GPU engine compilation | Potential uninitialized batch_size for dynamic shapes; no model caching strategy | NVIDIA GPU with TensorRT support | Hardware-bound decryption key; encrypted model download | Moderate (GPU inference) | Production-ready for GPU servers |
|
||||
| Security | AES-256-CBC encryption for models and API resources; hardware fingerprinting (CPU+GPU+RAM+drive serial) for machine-bound keys | cryptography library | Split-model storage prevents single-point theft; hardware binding ties access to authorized machines | Hardcoded encryption key; hardcoded credentials in config files; no TLS cert validation | cryptography, pynvml, platform-specific hardware queries | Core security component | Minimal | Functional but needs credential externalization |
|
||||
| API & CDN | REST API client with JWT auth and S3-compatible CDN for large artifact storage; split-resource upload/download pattern | requests, boto3 | Separation of small/big model parts; auto-relogin on 401/403 | No retry on 500 errors; no connection pooling | Azaion API endpoint, S3-compatible CDN endpoint | JWT tokens, separate read/write CDN keys | Low (network I/O only) | Adequate for current model distribution needs |
|
||||
| Edge Deployment | RKNN export targeting RK3588 SoC (OrangePi5) with shell-based setup scripts | RKNN toolkit, bash scripts | Low-power edge inference capability | Setup scripts not integrated into main pipeline; no automated deployment | OrangePi5 hardware, RKNN runtime | N/A | Low (edge hardware) | Proof-of-concept stage |
|
||||
|
||||
### Deployment Architecture
|
||||
|
||||
The system runs as independent processes without containerization or orchestration:
|
||||
|
||||
| Process | Runtime Pattern | Host Requirements |
|
||||
|---------|----------------|-------------------|
|
||||
| Annotation Queue Consumer | Continuous (async event loop) | Network access to RabbitMQ |
|
||||
| Augmentation Pipeline | Continuous loop (5-min cycle) | CPU cores, filesystem access |
|
||||
| Training Pipeline | Long-running (days per run) | NVIDIA GPU (24GB VRAM), CUDA 12.1 |
|
||||
| Inference | On-demand | NVIDIA GPU with TensorRT |
|
||||
| Data Tools | Ad-hoc manual execution | Developer machine |
|
||||
|
||||
No CI/CD pipeline, container definitions, or infrastructure-as-code were found. Deployment is manual.
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Existing Tests
|
||||
|
||||
| Test | Type | Coverage |
|
||||
|------|------|----------|
|
||||
| `tests/security_test.py` | Script-based | Encrypts a test image, verifies roundtrip decrypt matches original bytes |
|
||||
| `tests/imagelabel_visualize_test.py` | Script-based | Loads sample annotations with `preprocessing.read_labels` (broken — `preprocessing` module missing) |
|
||||
|
||||
### Gaps
|
||||
|
||||
- No formal test framework (pytest/unittest) configured
|
||||
- No integration tests for the training pipeline, augmentation, or inference
|
||||
- No API client tests (mocked or live)
|
||||
- No augmentation correctness tests (bounding box transform validation)
|
||||
- Security test is a standalone script, not runnable via test runner
|
||||
- The `imagelabel_visualize_test.py` cannot run due to missing `preprocessing` module
|
||||
|
||||
### Observed Quality Mechanisms
|
||||
|
||||
- Corrupt label detection during dataset formation (coords > 1.0 → moved to /data-corrupted/)
|
||||
- Bounding box clipping and filtering during augmentation
|
||||
- Training checkpointing (save_period=1) for crash recovery
|
||||
- Augmentation exception handling per-image and per-variant
|
||||
|
||||
## References
|
||||
|
||||
| Artifact | Path | Purpose |
|
||||
|----------|------|---------|
|
||||
| Main config | `config.yaml` | API credentials, queue config, directory paths |
|
||||
| CDN config | `cdn.yaml` | S3 CDN endpoint and access keys |
|
||||
| Class definitions | `classes.json` | 17 annotation classes with colors |
|
||||
| Python dependencies | `requirements.txt` | Main pipeline dependencies |
|
||||
| Queue dependencies | `annotation-queue/requirements.txt` | Annotation queue service dependencies |
|
||||
| Edge setup | `orangepi5/*.sh` | OrangePi5 installation and run scripts |
|
||||
| Training checkpoint | `checkpoint.txt` | Last training run timestamp (2024-06-27) |
|
||||
Reference in New Issue
Block a user