ai-training/_docs/01_solution/solution.md

# Solution

## Product Solution Description

Azaion AI Training is an ML pipeline for training, exporting, and deploying YOLOv11 object detection models within the Azaion platform ecosystem. The system ingests annotated image data from a RabbitMQ stream, augments it through an Albumentations-based pipeline, trains YOLOv11 models on NVIDIA GPUs, exports them to multiple formats (ONNX, TensorRT, RKNN), and deploys encrypted split-model artifacts to a REST API and S3-compatible CDN for secure distribution.

The pipeline targets aerial/satellite military object detection across 17 base classes with 3 weather modes (Normal, Winter, Night), producing 80 total class slots.

### Component Interaction

```mermaid
graph LR
    RMQ[RabbitMQ Streams] -->|annotations| AQ[Annotation Queue]
    AQ -->|images + labels| FS[(Filesystem)]
    FS -->|raw data| AUG[Augmentation]
    AUG -->|8× augmented| FS
    FS -->|dataset| TRAIN[Training]
    TRAIN -->|model artifacts| EXP[Export + Encrypt]
    EXP -->|small part| API[Azaion API]
    EXP -->|big part| CDN[S3 CDN]
    API -->|small part| INF[Inference]
    CDN -->|big part| INF
    INF -->|detections| OUT[Video Output]
```

## Architecture

### Component Solution Table

| Component | Solution | Tools | Advantages | Limitations | Requirements | Security | Cost Indicators | Fitness |
|-----------|----------|-------|------------|-------------|-------------|----------|----------------|---------|
| Annotation Queue | Async RabbitMQ Streams consumer with role-based routing (Validator→validated, Operator→seed) | rstream, msgpack, asyncio | Decoupled ingestion, independent lifecycle, file-based offset persistence | No reconnect logic on disconnect; single consumer (no scaling) | RabbitMQ with Streams plugin, network access | Credentials in plaintext config | Low (single lightweight process) | Good for current single-server deployment |
| Data Pipeline | Continuous augmentation loop (5-min interval) producing 8× expansion via geometric + color transforms | Albumentations, OpenCV, ThreadPoolExecutor | Robust augmentation variety, parallel per-image processing | Infinite loop with no graceful shutdown; attribute bug in progress logging | Filesystem access to /azaion/data/ and /azaion/data-processed/ | None | CPU-bound, parallelized | Adequate for offline batch augmentation |
| Training | Ultralytics YOLO training with automated dataset formation (70/20/10 split), corrupt label filtering, model export and encrypted upload | Ultralytics (YOLOv11m), PyTorch 2.3.0 CUDA 12.1 | Mature framework, built-in checkpointing (save_period=1), multi-format export | Long training cycles (~11.5 days for 360K annotations); batch=11 near 24GB VRAM limit | NVIDIA GPU (RTX 4090 24GB), CUDA 12.1 | Model encrypted AES-256-CBC before upload; split storage pattern | High (GPU compute, multi-day runs) | Well-suited for periodic retraining |
| Inference | TensorRT (primary) and ONNX Runtime (fallback) engines with async CUDA streams, batch processing, NMS postprocessing | TensorRT, ONNX Runtime, PyCUDA, OpenCV | TensorRT: ~33% faster than ONNX, ~42% less VRAM; batch processing; per-GPU engine compilation | Potential uninitialized batch_size for dynamic shapes; no model caching strategy | NVIDIA GPU with TensorRT support | Hardware-bound decryption key; encrypted model download | Moderate (GPU inference) | Production-ready for GPU servers |
| Security | AES-256-CBC encryption for models and API resources; hardware fingerprinting (CPU+GPU+RAM+drive serial) for machine-bound keys | cryptography library | Split-model storage prevents single-point theft; hardware binding ties access to authorized machines | Hardcoded encryption key; hardcoded credentials in config files; no TLS cert validation | cryptography, pynvml, platform-specific hardware queries | Core security component | Minimal | Functional but needs credential externalization |
| API & CDN | REST API client with JWT auth and S3-compatible CDN for large artifact storage; split-resource upload/download pattern | requests, boto3 | Separation of small/big model parts; auto-relogin on 401/403 | No retry on 500 errors; no connection pooling | Azaion API endpoint, S3-compatible CDN endpoint | JWT tokens, separate read/write CDN keys | Low (network I/O only) | Adequate for current model distribution needs |
| Edge Deployment | RKNN export targeting RK3588 SoC (OrangePi5) with shell-based setup scripts | RKNN toolkit, bash scripts | Low-power edge inference capability | Setup scripts not integrated into main pipeline; no automated deployment | OrangePi5 hardware, RKNN runtime | N/A | Low (edge hardware) | Proof-of-concept stage |

### Deployment Architecture

The system runs as independent processes without containerization or orchestration:

| Process | Runtime Pattern | Host Requirements |
|---------|----------------|-------------------|
| Annotation Queue Consumer | Continuous (async event loop) | Network access to RabbitMQ |
| Augmentation Pipeline | Continuous loop (5-min cycle) | CPU cores, filesystem access |
| Training Pipeline | Long-running (days per run) | NVIDIA GPU (24GB VRAM), CUDA 12.1 |
| Inference | On-demand | NVIDIA GPU with TensorRT |
| Data Tools | Ad-hoc manual execution | Developer machine |

No CI/CD pipeline, container definitions, or infrastructure-as-code were found. Deployment is manual.

## Testing Strategy

### Existing Tests

| Test | Type | Coverage |
|------|------|----------|
| `tests/security_test.py` | Script-based | Encrypts a test image, verifies roundtrip decrypt matches original bytes |
| `tests/imagelabel_visualize_test.py` | Script-based | Loads sample annotations with `preprocessing.read_labels` (broken — `preprocessing` module missing) |

### Gaps

- No formal test framework (pytest/unittest) configured
- No integration tests for the training pipeline, augmentation, or inference
- No API client tests (mocked or live)
- No augmentation correctness tests (bounding box transform validation)
- Security test is a standalone script, not runnable via test runner
- The `imagelabel_visualize_test.py` cannot run due to missing `preprocessing` module

### Observed Quality Mechanisms

- Corrupt label detection during dataset formation (coords > 1.0 → moved to /data-corrupted/)
- Bounding box clipping and filtering during augmentation
- Training checkpointing (save_period=1) for crash recovery
- Augmentation exception handling per-image and per-variant

## References

| Artifact | Path | Purpose |
|----------|------|---------|
| Main config | `config.yaml` | API credentials, queue config, directory paths |
| CDN config | `cdn.yaml` | S3 CDN endpoint and access keys |
| Class definitions | `classes.json` | 17 annotation classes with colors |
| Python dependencies | `requirements.txt` | Main pipeline dependencies |
| Queue dependencies | `annotation-queue/requirements.txt` | Annotation queue service dependencies |
| Edge setup | `orangepi5/*.sh` | OrangePi5 installation and run scripts |
| Training checkpoint | `checkpoint.txt` | Last training run timestamp (2024-06-27) |