ai-training/_docs/02_document/system-flows.md

# System Flows

## Flow 1: Annotation Ingestion (Annotation Queue → Filesystem)

```mermaid
sequenceDiagram
    participant RMQ as RabbitMQ Streams
    participant AQH as AnnotationQueueHandler
    participant FS as Filesystem

    RMQ->>AQH: AMQP message (msgpack)
    AQH->>AQH: Decode message, read AnnotationStatus

    alt Created / Edited
        AQH->>AQH: Parse AnnotationMessage (image + detections)
        alt Validator / Admin role
            AQH->>FS: Write label → /data/labels/{name}.txt
            AQH->>FS: Write image → /data/images/{name}.jpg
        else Operator role
            AQH->>FS: Write label → /data-seed/labels/{name}.txt
            AQH->>FS: Write image → /data-seed/images/{name}.jpg
        end
    else Validated (bulk)
        AQH->>FS: Move images+labels from /data-seed/ → /data/
    else Deleted (bulk)
        AQH->>FS: Move images+labels → /data_deleted/
    end

    AQH->>FS: Persist offset to offset.yaml
```

### Data Flow Table

| Step | Input | Output | Component |
|------|-------|--------|-----------|
| Receive | AMQP message (msgpack) | AnnotationMessage / AnnotationBulkMessage | Annotation Queue |
| Route | AnnotationStatus header | Dispatch to save/validate/delete | Annotation Queue |
| Save | Image bytes + detection JSON | .jpg + .txt files on disk | Annotation Queue |
| Track | Message context offset | offset.yaml | Annotation Queue |

---

## Flow 2: Data Augmentation

```mermaid
sequenceDiagram
    participant FS as Filesystem (/azaion/data/)
    participant AUG as Augmentator
    participant PFS as Filesystem (/azaion/data-processed/)

    loop Every 5 minutes
        AUG->>FS: Scan /data/images/ for unprocessed files
        AUG->>AUG: Filter out already-processed images
        loop Each unprocessed image (parallel)
            AUG->>FS: Read image + labels
            AUG->>AUG: Correct bounding boxes (clip + filter)
            AUG->>AUG: Generate 7 augmented variants
            AUG->>PFS: Write 8 images (original + 7 augmented)
            AUG->>PFS: Write 8 label files
        end
        AUG->>AUG: Sleep 5 minutes
    end
```

---

## Flow 3: Training Pipeline

```mermaid
sequenceDiagram
    participant PFS as Filesystem (/data-processed/)
    participant TRAIN as train.py
    participant DS as Filesystem (/datasets/)
    participant YOLO as Ultralytics YOLO
    participant API as Azaion API
    participant CDN as S3 CDN

    TRAIN->>PFS: Read all processed images
    TRAIN->>TRAIN: Shuffle, split 70/20/10
    TRAIN->>DS: Copy to train/valid/test folders
    Note over TRAIN: Corrupted labels → /data-corrupted/

    TRAIN->>TRAIN: Generate data.yaml (80 class names)
    TRAIN->>YOLO: Train yolo11m (120 epochs, batch=11, 1280px)
    YOLO-->>TRAIN: Training results + best.pt

    TRAIN->>DS: Copy results to /models/{date}/
    TRAIN->>TRAIN: Copy best.pt → /models/azaion.pt

    TRAIN->>TRAIN: Export .pt → .onnx (1280px, batch=4)
    TRAIN->>TRAIN: Read azaion.onnx bytes
    TRAIN->>TRAIN: Encrypt with model key (AES-256-CBC)
    TRAIN->>TRAIN: Split: small (≤3KB or 20%) + big (rest)

    TRAIN->>API: Upload azaion.onnx.small
    TRAIN->>CDN: Upload azaion.onnx.big
```

---

## Flow 4: Model Download & Inference

```mermaid
sequenceDiagram
    participant INF as start_inference.py
    participant API as Azaion API
    participant CDN as S3 CDN
    participant SEC as Security
    participant TRT as TensorRTEngine
    participant VID as Video File
    participant GUI as OpenCV Window

    INF->>INF: Determine GPU-specific engine filename
    INF->>SEC: Get model encryption key

    INF->>API: Login (JWT)
    INF->>API: Download {engine}.small (encrypted)
    INF->>INF: Read {engine}.big from local disk
    INF->>INF: Reassemble: small + big
    INF->>SEC: Decrypt (AES-256-CBC)

    INF->>TRT: Initialize engine from bytes
    TRT->>TRT: Allocate CUDA memory (input + output)

    loop Video frames
        INF->>VID: Read frame (every 4th)
        INF->>INF: Batch frames to batch_size

        INF->>TRT: Preprocess (blob, normalize, resize)
        TRT->>TRT: CUDA memcpy host→device
        TRT->>TRT: Execute inference (async)
        TRT->>TRT: CUDA memcpy device→host

        INF->>INF: Postprocess (confidence filter + NMS)
        INF->>GUI: Draw bounding boxes + display
    end
```

### Data Flow Table

| Step | Input | Output | Component |
|------|-------|--------|-----------|
| Model resolve | GPU compute capability | Engine filename | Inference |
| Download small | API endpoint + JWT | Encrypted small bytes | API & CDN |
| Load big | Local filesystem | Encrypted big bytes | API & CDN |
| Reassemble | small + big bytes | Full encrypted model | API & CDN |
| Decrypt | Encrypted model + key | Raw TensorRT engine | Security |
| Init engine | Engine bytes | CUDA buffers allocated | Inference |
| Preprocess | Video frame | NCHW float32 blob | Inference |
| Inference | Input blob | Raw detection tensor | Inference |
| Postprocess | Raw tensor | List[Detection] | Inference |
| Visualize | Detections + frame | Annotated frame | Inference |

---

## Flow 5: Model Export (Multi-Format)

```mermaid
flowchart LR
    PT[azaion.pt] -->|export_onnx| ONNX[azaion.onnx]
    PT -->|export_tensorrt| TRT[azaion.engine]
    PT -->|export_rknn| RKNN[azaion.rknn]
    ONNX -->|encrypt + split| UPLOAD[API + CDN upload]
    TRT -->|encrypt + split| UPLOAD
```

| Target Format | Resolution | Batch | Precision | Use Case |
|---------------|-----------|-------|-----------|----------|
| ONNX | 1280px | 4 | FP32 | Cross-platform inference |
| TensorRT | auto | 4 | FP16 | Production GPU inference |
| RKNN | auto | auto | auto | OrangePi5 edge device |

---

## Error Scenarios

| Flow | Error | Handling |
|------|-------|---------|
| Annotation ingestion | Malformed message | Caught by on_message exception handler, logged |
| Annotation ingestion | Queue disconnect | Process exits (no reconnect logic) |
| Augmentation | Corrupted image | Caught per-thread, logged, skipped |
| Augmentation | Transform failure | Caught per-variant, logged, fewer augmentations produced |
| Training | Corrupted label (coords > 1.0) | Moved to /data-corrupted/ |
| Training | Power outage | save_period=1 enables resume_training from last epoch |
| API download | 401/403 | Auto-relogin + retry |
| API download | 500 | Printed, no retry |
| Inference | CUDA error | RuntimeError raised |
| CDN upload/download | Any exception | Caught, printed, returns False |