mirror of
https://github.com/azaion/ai-training.git
synced 2026-04-22 21:56:36 +00:00
142c6c4de8
- Replaced module-level path variables in constants.py with a structured Pydantic Config class. - Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure. - Fixed bugs related to image processing and model saving. - Enhanced test infrastructure to accommodate the new configuration approach. This refactor improves code maintainability and clarity by centralizing configuration management.
189 lines
6.3 KiB
Markdown
189 lines
6.3 KiB
Markdown
# System Flows
|
|
|
|
## Flow 1: Annotation Ingestion (Annotation Queue → Filesystem)
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant RMQ as RabbitMQ Streams
|
|
participant AQH as AnnotationQueueHandler
|
|
participant FS as Filesystem
|
|
|
|
RMQ->>AQH: AMQP message (msgpack)
|
|
AQH->>AQH: Decode message, read AnnotationStatus
|
|
|
|
alt Created / Edited
|
|
AQH->>AQH: Parse AnnotationMessage (image + detections)
|
|
alt Validator / Admin role
|
|
AQH->>FS: Write label → /data/labels/{name}.txt
|
|
AQH->>FS: Write image → /data/images/{name}.jpg
|
|
else Operator role
|
|
AQH->>FS: Write label → /data-seed/labels/{name}.txt
|
|
AQH->>FS: Write image → /data-seed/images/{name}.jpg
|
|
end
|
|
else Validated (bulk)
|
|
AQH->>FS: Move images+labels from /data-seed/ → /data/
|
|
else Deleted (bulk)
|
|
AQH->>FS: Move images+labels → /data_deleted/
|
|
end
|
|
|
|
AQH->>FS: Persist offset to offset.yaml
|
|
```
|
|
|
|
### Data Flow Table
|
|
|
|
| Step | Input | Output | Component |
|
|
|------|-------|--------|-----------|
|
|
| Receive | AMQP message (msgpack) | AnnotationMessage / AnnotationBulkMessage | Annotation Queue |
|
|
| Route | AnnotationStatus header | Dispatch to save/validate/delete | Annotation Queue |
|
|
| Save | Image bytes + detection JSON | .jpg + .txt files on disk | Annotation Queue |
|
|
| Track | Message context offset | offset.yaml | Annotation Queue |
|
|
|
|
---
|
|
|
|
## Flow 2: Data Augmentation
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant FS as Filesystem (/azaion/data/)
|
|
participant AUG as Augmentator
|
|
participant PFS as Filesystem (/azaion/data-processed/)
|
|
|
|
loop Every 5 minutes
|
|
AUG->>FS: Scan /data/images/ for unprocessed files
|
|
AUG->>AUG: Filter out already-processed images
|
|
loop Each unprocessed image (parallel)
|
|
AUG->>FS: Read image + labels
|
|
AUG->>AUG: Correct bounding boxes (clip + filter)
|
|
AUG->>AUG: Generate 7 augmented variants
|
|
AUG->>PFS: Write 8 images (original + 7 augmented)
|
|
AUG->>PFS: Write 8 label files
|
|
end
|
|
AUG->>AUG: Sleep 5 minutes
|
|
end
|
|
```
|
|
|
|
---
|
|
|
|
## Flow 3: Training Pipeline
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant PFS as Filesystem (/data-processed/)
|
|
participant TRAIN as train.py
|
|
participant DS as Filesystem (/datasets/)
|
|
participant YOLO as Ultralytics YOLO
|
|
participant API as Azaion API
|
|
participant CDN as S3 CDN
|
|
|
|
TRAIN->>PFS: Read all processed images
|
|
TRAIN->>TRAIN: Shuffle, split 70/20/10
|
|
TRAIN->>DS: Copy to train/valid/test folders
|
|
Note over TRAIN: Corrupted labels → /data-corrupted/
|
|
|
|
TRAIN->>TRAIN: Generate data.yaml (80 class names)
|
|
TRAIN->>YOLO: Train yolo11m (120 epochs, batch=11, 1280px)
|
|
YOLO-->>TRAIN: Training results + best.pt
|
|
|
|
TRAIN->>DS: Copy results to /models/{date}/
|
|
TRAIN->>TRAIN: Copy best.pt → /models/azaion.pt
|
|
|
|
TRAIN->>TRAIN: Export .pt → .onnx (1280px, batch=4)
|
|
TRAIN->>TRAIN: Read azaion.onnx bytes
|
|
TRAIN->>TRAIN: Encrypt with model key (AES-256-CBC)
|
|
TRAIN->>TRAIN: Split: small (≤3KB or 20%) + big (rest)
|
|
|
|
TRAIN->>API: Upload azaion.onnx.small
|
|
TRAIN->>CDN: Upload azaion.onnx.big
|
|
```
|
|
|
|
---
|
|
|
|
## Flow 4: Model Download & Inference
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant INF as start_inference.py
|
|
participant API as Azaion API
|
|
participant CDN as S3 CDN
|
|
participant SEC as Security
|
|
participant TRT as TensorRTEngine
|
|
participant VID as Video File
|
|
participant GUI as OpenCV Window
|
|
|
|
INF->>INF: Determine GPU-specific engine filename
|
|
INF->>SEC: Get model encryption key
|
|
|
|
INF->>API: Login (JWT)
|
|
INF->>API: Download {engine}.small (encrypted)
|
|
INF->>INF: Read {engine}.big from local disk
|
|
INF->>INF: Reassemble: small + big
|
|
INF->>SEC: Decrypt (AES-256-CBC)
|
|
|
|
INF->>TRT: Initialize engine from bytes
|
|
TRT->>TRT: Allocate CUDA memory (input + output)
|
|
|
|
loop Video frames
|
|
INF->>VID: Read frame (every 4th)
|
|
INF->>INF: Batch frames to batch_size
|
|
|
|
INF->>TRT: Preprocess (blob, normalize, resize)
|
|
TRT->>TRT: CUDA memcpy host→device
|
|
TRT->>TRT: Execute inference (async)
|
|
TRT->>TRT: CUDA memcpy device→host
|
|
|
|
INF->>INF: Postprocess (confidence filter + NMS)
|
|
INF->>GUI: Draw bounding boxes + display
|
|
end
|
|
```
|
|
|
|
### Data Flow Table
|
|
|
|
| Step | Input | Output | Component |
|
|
|------|-------|--------|-----------|
|
|
| Model resolve | GPU compute capability | Engine filename | Inference |
|
|
| Download small | API endpoint + JWT | Encrypted small bytes | API & CDN |
|
|
| Load big | Local filesystem | Encrypted big bytes | API & CDN |
|
|
| Reassemble | small + big bytes | Full encrypted model | API & CDN |
|
|
| Decrypt | Encrypted model + key | Raw TensorRT engine | Security |
|
|
| Init engine | Engine bytes | CUDA buffers allocated | Inference |
|
|
| Preprocess | Video frame | NCHW float32 blob | Inference |
|
|
| Inference | Input blob | Raw detection tensor | Inference |
|
|
| Postprocess | Raw tensor | List[Detection] | Inference |
|
|
| Visualize | Detections + frame | Annotated frame | Inference |
|
|
|
|
---
|
|
|
|
## Flow 5: Model Export (Multi-Format)
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
PT[azaion.pt] -->|export_onnx| ONNX[azaion.onnx]
|
|
PT -->|export_tensorrt| TRT[azaion.engine]
|
|
PT -->|export_rknn| RKNN[azaion.rknn]
|
|
ONNX -->|encrypt + split| UPLOAD[API + CDN upload]
|
|
TRT -->|encrypt + split| UPLOAD
|
|
```
|
|
|
|
| Target Format | Resolution | Batch | Precision | Use Case |
|
|
|---------------|-----------|-------|-----------|----------|
|
|
| ONNX | 1280px | 4 | FP32 | Cross-platform inference |
|
|
| TensorRT | auto | 4 | FP16 | Production GPU inference |
|
|
| RKNN | auto | auto | auto | OrangePi5 edge device |
|
|
|
|
---
|
|
|
|
## Error Scenarios
|
|
|
|
| Flow | Error | Handling |
|
|
|------|-------|---------|
|
|
| Annotation ingestion | Malformed message | Caught by on_message exception handler, logged |
|
|
| Annotation ingestion | Queue disconnect | Process exits (no reconnect logic) |
|
|
| Augmentation | Corrupted image | Caught per-thread, logged, skipped |
|
|
| Augmentation | Transform failure | Caught per-variant, logged, fewer augmentations produced |
|
|
| Training | Corrupted label (coords > 1.0) | Moved to /data-corrupted/ |
|
|
| Training | Power outage | save_period=1 enables resume_training from last epoch |
|
|
| API download | 401/403 | Auto-relogin + retry |
|
|
| API download | 500 | Printed, no retry |
|
|
| Inference | CUDA error | RuntimeError raised |
|
|
| CDN upload/download | Any exception | Caught, printed, returns False |
|