mirror of https://github.com/azaion/ai-training.git synced 2026-04-22 22:16:35 +00:00

Files

T

Oleksandr Bezdieniezhnykh 142c6c4de8 Refactor constants management to use Pydantic BaseModel for configuration

- Replaced module-level path variables in constants.py with a structured Pydantic Config class.
- Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure.
- Fixed bugs related to image processing and model saving.
- Enhanced test infrastructure to accommodate the new configuration approach.

This refactor improves code maintainability and clarity by centralizing configuration management.

2026-03-27 18:18:30 +02:00

6.3 KiB

Raw Blame History

System Flows

Flow 1: Annotation Ingestion (Annotation Queue → Filesystem)

sequenceDiagram
    participant RMQ as RabbitMQ Streams
    participant AQH as AnnotationQueueHandler
    participant FS as Filesystem

    RMQ->>AQH: AMQP message (msgpack)
    AQH->>AQH: Decode message, read AnnotationStatus

    alt Created / Edited
        AQH->>AQH: Parse AnnotationMessage (image + detections)
        alt Validator / Admin role
            AQH->>FS: Write label → /data/labels/{name}.txt
            AQH->>FS: Write image → /data/images/{name}.jpg
        else Operator role
            AQH->>FS: Write label → /data-seed/labels/{name}.txt
            AQH->>FS: Write image → /data-seed/images/{name}.jpg
        end
    else Validated (bulk)
        AQH->>FS: Move images+labels from /data-seed/ → /data/
    else Deleted (bulk)
        AQH->>FS: Move images+labels → /data_deleted/
    end

    AQH->>FS: Persist offset to offset.yaml

Data Flow Table

Step	Input	Output	Component
Receive	AMQP message (msgpack)	AnnotationMessage / AnnotationBulkMessage	Annotation Queue
Route	AnnotationStatus header	Dispatch to save/validate/delete	Annotation Queue
Save	Image bytes + detection JSON	.jpg + .txt files on disk	Annotation Queue
Track	Message context offset	offset.yaml	Annotation Queue

Flow 2: Data Augmentation

sequenceDiagram
    participant FS as Filesystem (/azaion/data/)
    participant AUG as Augmentator
    participant PFS as Filesystem (/azaion/data-processed/)

    loop Every 5 minutes
        AUG->>FS: Scan /data/images/ for unprocessed files
        AUG->>AUG: Filter out already-processed images
        loop Each unprocessed image (parallel)
            AUG->>FS: Read image + labels
            AUG->>AUG: Correct bounding boxes (clip + filter)
            AUG->>AUG: Generate 7 augmented variants
            AUG->>PFS: Write 8 images (original + 7 augmented)
            AUG->>PFS: Write 8 label files
        end
        AUG->>AUG: Sleep 5 minutes
    end

Flow 3: Training Pipeline

sequenceDiagram
    participant PFS as Filesystem (/data-processed/)
    participant TRAIN as train.py
    participant DS as Filesystem (/datasets/)
    participant YOLO as Ultralytics YOLO
    participant API as Azaion API
    participant CDN as S3 CDN

    TRAIN->>PFS: Read all processed images
    TRAIN->>TRAIN: Shuffle, split 70/20/10
    TRAIN->>DS: Copy to train/valid/test folders
    Note over TRAIN: Corrupted labels → /data-corrupted/

    TRAIN->>TRAIN: Generate data.yaml (80 class names)
    TRAIN->>YOLO: Train yolo11m (120 epochs, batch=11, 1280px)
    YOLO-->>TRAIN: Training results + best.pt

    TRAIN->>DS: Copy results to /models/{date}/
    TRAIN->>TRAIN: Copy best.pt → /models/azaion.pt

    TRAIN->>TRAIN: Export .pt → .onnx (1280px, batch=4)
    TRAIN->>TRAIN: Read azaion.onnx bytes
    TRAIN->>TRAIN: Encrypt with model key (AES-256-CBC)
    TRAIN->>TRAIN: Split: small (≤3KB or 20%) + big (rest)

    TRAIN->>API: Upload azaion.onnx.small
    TRAIN->>CDN: Upload azaion.onnx.big

Flow 4: Model Download & Inference

sequenceDiagram
    participant INF as start_inference.py
    participant API as Azaion API
    participant CDN as S3 CDN
    participant SEC as Security
    participant TRT as TensorRTEngine
    participant VID as Video File
    participant GUI as OpenCV Window

    INF->>INF: Determine GPU-specific engine filename
    INF->>SEC: Get model encryption key

    INF->>API: Login (JWT)
    INF->>API: Download {engine}.small (encrypted)
    INF->>INF: Read {engine}.big from local disk
    INF->>INF: Reassemble: small + big
    INF->>SEC: Decrypt (AES-256-CBC)

    INF->>TRT: Initialize engine from bytes
    TRT->>TRT: Allocate CUDA memory (input + output)

    loop Video frames
        INF->>VID: Read frame (every 4th)
        INF->>INF: Batch frames to batch_size

        INF->>TRT: Preprocess (blob, normalize, resize)
        TRT->>TRT: CUDA memcpy host→device
        TRT->>TRT: Execute inference (async)
        TRT->>TRT: CUDA memcpy device→host

        INF->>INF: Postprocess (confidence filter + NMS)
        INF->>GUI: Draw bounding boxes + display
    end

Data Flow Table

Step	Input	Output	Component
Model resolve	GPU compute capability	Engine filename	Inference
Download small	API endpoint + JWT	Encrypted small bytes	API & CDN
Load big	Local filesystem	Encrypted big bytes	API & CDN
Reassemble	small + big bytes	Full encrypted model	API & CDN
Decrypt	Encrypted model + key	Raw TensorRT engine	Security
Init engine	Engine bytes	CUDA buffers allocated	Inference
Preprocess	Video frame	NCHW float32 blob	Inference
Inference	Input blob	Raw detection tensor	Inference
Postprocess	Raw tensor	List[Detection]	Inference
Visualize	Detections + frame	Annotated frame	Inference

Flow 5: Model Export (Multi-Format)

flowchart LR
    PT[azaion.pt] -->|export_onnx| ONNX[azaion.onnx]
    PT -->|export_tensorrt| TRT[azaion.engine]
    PT -->|export_rknn| RKNN[azaion.rknn]
    ONNX -->|encrypt + split| UPLOAD[API + CDN upload]
    TRT -->|encrypt + split| UPLOAD

Target Format	Resolution	Batch	Precision	Use Case
ONNX	1280px	4	FP32	Cross-platform inference
TensorRT	auto	4	FP16	Production GPU inference
RKNN	auto	auto	auto	OrangePi5 edge device

Error Scenarios

Flow	Error	Handling
Annotation ingestion	Malformed message	Caught by on_message exception handler, logged
Annotation ingestion	Queue disconnect	Process exits (no reconnect logic)
Augmentation	Corrupted image	Caught per-thread, logged, skipped
Augmentation	Transform failure	Caught per-variant, logged, fewer augmentations produced
Training	Corrupted label (coords > 1.0)	Moved to /data-corrupted/
Training	Power outage	save_period=1 enables resume_training from last epoch
API download	401/403	Auto-relogin + retry
API download	500	Printed, no retry
Inference	CUDA error	RuntimeError raised
CDN upload/download	Any exception	Caught, printed, returns False

6.3 KiB Raw Blame History

System Flows

Flow 1: Annotation Ingestion (Annotation Queue → Filesystem)

Data Flow Table

Flow 2: Data Augmentation

Flow 3: Training Pipeline

Flow 4: Model Download & Inference

Data Flow Table

Flow 5: Model Export (Multi-Format)

Error Scenarios

6.3 KiB

Raw Blame History