mirror of
https://github.com/azaion/ai-training.git
synced 2026-04-22 22:16:35 +00:00
142c6c4de8
- Replaced module-level path variables in constants.py with a structured Pydantic Config class. - Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure. - Fixed bugs related to image processing and model saving. - Enhanced test infrastructure to accommodate the new configuration approach. This refactor improves code maintainability and clarity by centralizing configuration management.
6.3 KiB
6.3 KiB
System Flows
Flow 1: Annotation Ingestion (Annotation Queue → Filesystem)
sequenceDiagram
participant RMQ as RabbitMQ Streams
participant AQH as AnnotationQueueHandler
participant FS as Filesystem
RMQ->>AQH: AMQP message (msgpack)
AQH->>AQH: Decode message, read AnnotationStatus
alt Created / Edited
AQH->>AQH: Parse AnnotationMessage (image + detections)
alt Validator / Admin role
AQH->>FS: Write label → /data/labels/{name}.txt
AQH->>FS: Write image → /data/images/{name}.jpg
else Operator role
AQH->>FS: Write label → /data-seed/labels/{name}.txt
AQH->>FS: Write image → /data-seed/images/{name}.jpg
end
else Validated (bulk)
AQH->>FS: Move images+labels from /data-seed/ → /data/
else Deleted (bulk)
AQH->>FS: Move images+labels → /data_deleted/
end
AQH->>FS: Persist offset to offset.yaml
Data Flow Table
| Step | Input | Output | Component |
|---|---|---|---|
| Receive | AMQP message (msgpack) | AnnotationMessage / AnnotationBulkMessage | Annotation Queue |
| Route | AnnotationStatus header | Dispatch to save/validate/delete | Annotation Queue |
| Save | Image bytes + detection JSON | .jpg + .txt files on disk | Annotation Queue |
| Track | Message context offset | offset.yaml | Annotation Queue |
Flow 2: Data Augmentation
sequenceDiagram
participant FS as Filesystem (/azaion/data/)
participant AUG as Augmentator
participant PFS as Filesystem (/azaion/data-processed/)
loop Every 5 minutes
AUG->>FS: Scan /data/images/ for unprocessed files
AUG->>AUG: Filter out already-processed images
loop Each unprocessed image (parallel)
AUG->>FS: Read image + labels
AUG->>AUG: Correct bounding boxes (clip + filter)
AUG->>AUG: Generate 7 augmented variants
AUG->>PFS: Write 8 images (original + 7 augmented)
AUG->>PFS: Write 8 label files
end
AUG->>AUG: Sleep 5 minutes
end
Flow 3: Training Pipeline
sequenceDiagram
participant PFS as Filesystem (/data-processed/)
participant TRAIN as train.py
participant DS as Filesystem (/datasets/)
participant YOLO as Ultralytics YOLO
participant API as Azaion API
participant CDN as S3 CDN
TRAIN->>PFS: Read all processed images
TRAIN->>TRAIN: Shuffle, split 70/20/10
TRAIN->>DS: Copy to train/valid/test folders
Note over TRAIN: Corrupted labels → /data-corrupted/
TRAIN->>TRAIN: Generate data.yaml (80 class names)
TRAIN->>YOLO: Train yolo11m (120 epochs, batch=11, 1280px)
YOLO-->>TRAIN: Training results + best.pt
TRAIN->>DS: Copy results to /models/{date}/
TRAIN->>TRAIN: Copy best.pt → /models/azaion.pt
TRAIN->>TRAIN: Export .pt → .onnx (1280px, batch=4)
TRAIN->>TRAIN: Read azaion.onnx bytes
TRAIN->>TRAIN: Encrypt with model key (AES-256-CBC)
TRAIN->>TRAIN: Split: small (≤3KB or 20%) + big (rest)
TRAIN->>API: Upload azaion.onnx.small
TRAIN->>CDN: Upload azaion.onnx.big
Flow 4: Model Download & Inference
sequenceDiagram
participant INF as start_inference.py
participant API as Azaion API
participant CDN as S3 CDN
participant SEC as Security
participant TRT as TensorRTEngine
participant VID as Video File
participant GUI as OpenCV Window
INF->>INF: Determine GPU-specific engine filename
INF->>SEC: Get model encryption key
INF->>API: Login (JWT)
INF->>API: Download {engine}.small (encrypted)
INF->>INF: Read {engine}.big from local disk
INF->>INF: Reassemble: small + big
INF->>SEC: Decrypt (AES-256-CBC)
INF->>TRT: Initialize engine from bytes
TRT->>TRT: Allocate CUDA memory (input + output)
loop Video frames
INF->>VID: Read frame (every 4th)
INF->>INF: Batch frames to batch_size
INF->>TRT: Preprocess (blob, normalize, resize)
TRT->>TRT: CUDA memcpy host→device
TRT->>TRT: Execute inference (async)
TRT->>TRT: CUDA memcpy device→host
INF->>INF: Postprocess (confidence filter + NMS)
INF->>GUI: Draw bounding boxes + display
end
Data Flow Table
| Step | Input | Output | Component |
|---|---|---|---|
| Model resolve | GPU compute capability | Engine filename | Inference |
| Download small | API endpoint + JWT | Encrypted small bytes | API & CDN |
| Load big | Local filesystem | Encrypted big bytes | API & CDN |
| Reassemble | small + big bytes | Full encrypted model | API & CDN |
| Decrypt | Encrypted model + key | Raw TensorRT engine | Security |
| Init engine | Engine bytes | CUDA buffers allocated | Inference |
| Preprocess | Video frame | NCHW float32 blob | Inference |
| Inference | Input blob | Raw detection tensor | Inference |
| Postprocess | Raw tensor | List[Detection] | Inference |
| Visualize | Detections + frame | Annotated frame | Inference |
Flow 5: Model Export (Multi-Format)
flowchart LR
PT[azaion.pt] -->|export_onnx| ONNX[azaion.onnx]
PT -->|export_tensorrt| TRT[azaion.engine]
PT -->|export_rknn| RKNN[azaion.rknn]
ONNX -->|encrypt + split| UPLOAD[API + CDN upload]
TRT -->|encrypt + split| UPLOAD
| Target Format | Resolution | Batch | Precision | Use Case |
|---|---|---|---|---|
| ONNX | 1280px | 4 | FP32 | Cross-platform inference |
| TensorRT | auto | 4 | FP16 | Production GPU inference |
| RKNN | auto | auto | auto | OrangePi5 edge device |
Error Scenarios
| Flow | Error | Handling |
|---|---|---|
| Annotation ingestion | Malformed message | Caught by on_message exception handler, logged |
| Annotation ingestion | Queue disconnect | Process exits (no reconnect logic) |
| Augmentation | Corrupted image | Caught per-thread, logged, skipped |
| Augmentation | Transform failure | Caught per-variant, logged, fewer augmentations produced |
| Training | Corrupted label (coords > 1.0) | Moved to /data-corrupted/ |
| Training | Power outage | save_period=1 enables resume_training from last epoch |
| API download | 401/403 | Auto-relogin + retry |
| API download | 500 | Printed, no retry |
| Inference | CUDA error | RuntimeError raised |
| CDN upload/download | Any exception | Caught, printed, returns False |