Files
ai-training/_docs/02_document/system-flows.md
T
Oleksandr Bezdieniezhnykh 142c6c4de8 Refactor constants management to use Pydantic BaseModel for configuration
- Replaced module-level path variables in constants.py with a structured Pydantic Config class.
- Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure.
- Fixed bugs related to image processing and model saving.
- Enhanced test infrastructure to accommodate the new configuration approach.

This refactor improves code maintainability and clarity by centralizing configuration management.
2026-03-27 18:18:30 +02:00

6.3 KiB

System Flows

Flow 1: Annotation Ingestion (Annotation Queue → Filesystem)

sequenceDiagram
    participant RMQ as RabbitMQ Streams
    participant AQH as AnnotationQueueHandler
    participant FS as Filesystem

    RMQ->>AQH: AMQP message (msgpack)
    AQH->>AQH: Decode message, read AnnotationStatus

    alt Created / Edited
        AQH->>AQH: Parse AnnotationMessage (image + detections)
        alt Validator / Admin role
            AQH->>FS: Write label → /data/labels/{name}.txt
            AQH->>FS: Write image → /data/images/{name}.jpg
        else Operator role
            AQH->>FS: Write label → /data-seed/labels/{name}.txt
            AQH->>FS: Write image → /data-seed/images/{name}.jpg
        end
    else Validated (bulk)
        AQH->>FS: Move images+labels from /data-seed/ → /data/
    else Deleted (bulk)
        AQH->>FS: Move images+labels → /data_deleted/
    end

    AQH->>FS: Persist offset to offset.yaml

Data Flow Table

Step Input Output Component
Receive AMQP message (msgpack) AnnotationMessage / AnnotationBulkMessage Annotation Queue
Route AnnotationStatus header Dispatch to save/validate/delete Annotation Queue
Save Image bytes + detection JSON .jpg + .txt files on disk Annotation Queue
Track Message context offset offset.yaml Annotation Queue

Flow 2: Data Augmentation

sequenceDiagram
    participant FS as Filesystem (/azaion/data/)
    participant AUG as Augmentator
    participant PFS as Filesystem (/azaion/data-processed/)

    loop Every 5 minutes
        AUG->>FS: Scan /data/images/ for unprocessed files
        AUG->>AUG: Filter out already-processed images
        loop Each unprocessed image (parallel)
            AUG->>FS: Read image + labels
            AUG->>AUG: Correct bounding boxes (clip + filter)
            AUG->>AUG: Generate 7 augmented variants
            AUG->>PFS: Write 8 images (original + 7 augmented)
            AUG->>PFS: Write 8 label files
        end
        AUG->>AUG: Sleep 5 minutes
    end

Flow 3: Training Pipeline

sequenceDiagram
    participant PFS as Filesystem (/data-processed/)
    participant TRAIN as train.py
    participant DS as Filesystem (/datasets/)
    participant YOLO as Ultralytics YOLO
    participant API as Azaion API
    participant CDN as S3 CDN

    TRAIN->>PFS: Read all processed images
    TRAIN->>TRAIN: Shuffle, split 70/20/10
    TRAIN->>DS: Copy to train/valid/test folders
    Note over TRAIN: Corrupted labels → /data-corrupted/

    TRAIN->>TRAIN: Generate data.yaml (80 class names)
    TRAIN->>YOLO: Train yolo11m (120 epochs, batch=11, 1280px)
    YOLO-->>TRAIN: Training results + best.pt

    TRAIN->>DS: Copy results to /models/{date}/
    TRAIN->>TRAIN: Copy best.pt → /models/azaion.pt

    TRAIN->>TRAIN: Export .pt → .onnx (1280px, batch=4)
    TRAIN->>TRAIN: Read azaion.onnx bytes
    TRAIN->>TRAIN: Encrypt with model key (AES-256-CBC)
    TRAIN->>TRAIN: Split: small (≤3KB or 20%) + big (rest)

    TRAIN->>API: Upload azaion.onnx.small
    TRAIN->>CDN: Upload azaion.onnx.big

Flow 4: Model Download & Inference

sequenceDiagram
    participant INF as start_inference.py
    participant API as Azaion API
    participant CDN as S3 CDN
    participant SEC as Security
    participant TRT as TensorRTEngine
    participant VID as Video File
    participant GUI as OpenCV Window

    INF->>INF: Determine GPU-specific engine filename
    INF->>SEC: Get model encryption key

    INF->>API: Login (JWT)
    INF->>API: Download {engine}.small (encrypted)
    INF->>INF: Read {engine}.big from local disk
    INF->>INF: Reassemble: small + big
    INF->>SEC: Decrypt (AES-256-CBC)

    INF->>TRT: Initialize engine from bytes
    TRT->>TRT: Allocate CUDA memory (input + output)

    loop Video frames
        INF->>VID: Read frame (every 4th)
        INF->>INF: Batch frames to batch_size

        INF->>TRT: Preprocess (blob, normalize, resize)
        TRT->>TRT: CUDA memcpy host→device
        TRT->>TRT: Execute inference (async)
        TRT->>TRT: CUDA memcpy device→host

        INF->>INF: Postprocess (confidence filter + NMS)
        INF->>GUI: Draw bounding boxes + display
    end

Data Flow Table

Step Input Output Component
Model resolve GPU compute capability Engine filename Inference
Download small API endpoint + JWT Encrypted small bytes API & CDN
Load big Local filesystem Encrypted big bytes API & CDN
Reassemble small + big bytes Full encrypted model API & CDN
Decrypt Encrypted model + key Raw TensorRT engine Security
Init engine Engine bytes CUDA buffers allocated Inference
Preprocess Video frame NCHW float32 blob Inference
Inference Input blob Raw detection tensor Inference
Postprocess Raw tensor List[Detection] Inference
Visualize Detections + frame Annotated frame Inference

Flow 5: Model Export (Multi-Format)

flowchart LR
    PT[azaion.pt] -->|export_onnx| ONNX[azaion.onnx]
    PT -->|export_tensorrt| TRT[azaion.engine]
    PT -->|export_rknn| RKNN[azaion.rknn]
    ONNX -->|encrypt + split| UPLOAD[API + CDN upload]
    TRT -->|encrypt + split| UPLOAD
Target Format Resolution Batch Precision Use Case
ONNX 1280px 4 FP32 Cross-platform inference
TensorRT auto 4 FP16 Production GPU inference
RKNN auto auto auto OrangePi5 edge device

Error Scenarios

Flow Error Handling
Annotation ingestion Malformed message Caught by on_message exception handler, logged
Annotation ingestion Queue disconnect Process exits (no reconnect logic)
Augmentation Corrupted image Caught per-thread, logged, skipped
Augmentation Transform failure Caught per-variant, logged, fewer augmentations produced
Training Corrupted label (coords > 1.0) Moved to /data-corrupted/
Training Power outage save_period=1 enables resume_training from last epoch
API download 401/403 Auto-relogin + retry
API download 500 Printed, no retry
Inference CUDA error RuntimeError raised
CDN upload/download Any exception Caught, printed, returns False