mirror of
https://github.com/azaion/ai-training.git
synced 2026-04-22 08:36:34 +00:00
Refactor constants management to use Pydantic BaseModel for configuration
- Replaced module-level path variables in constants.py with a structured Pydantic Config class. - Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure. - Fixed bugs related to image processing and model saving. - Enhanced test infrastructure to accommodate the new configuration approach. This refactor improves code maintainability and clarity by centralizing configuration management.
This commit is contained in:
@@ -0,0 +1,51 @@
|
||||
# Acceptance Criteria
|
||||
|
||||
## Training
|
||||
|
||||
- Dataset split: 70% train, 20% validation, 10% test (hardcoded in train.py).
|
||||
- Training parameters: YOLOv11 medium, 120 epochs, batch size 11, image size 1280px, save_period=1.
|
||||
- Corrupted labels (bounding box coordinates > 1.0) are filtered to `/azaion/data-corrupted/`.
|
||||
- Model export to ONNX: 1280px resolution, batch size 4, NMS baked in.
|
||||
- Trained model encrypted with AES-256-CBC before upload.
|
||||
- Encrypted model split: small part ≤3KB or 20% of total → API server; remainder → CDN.
|
||||
- Post-training: model uploaded to both API and CDN endpoints.
|
||||
|
||||
## Augmentation
|
||||
|
||||
- Each validated image produces exactly 8 outputs (1 original + 7 augmented variants).
|
||||
- Augmentation runs every 5 minutes, processing only unprocessed images.
|
||||
- Bounding boxes clipped to [0, 1] range; boxes with area < 0.01% of image discarded.
|
||||
- Processing is parallelized per image using ThreadPoolExecutor.
|
||||
|
||||
## Annotation Ingestion
|
||||
|
||||
- Created/Edited annotations from Validators/Admins → `/azaion/data/`.
|
||||
- Created/Edited annotations from Operators → `/azaion/data-seed/`.
|
||||
- Validated (bulk) events → move from `/data-seed/` to `/data/`.
|
||||
- Deleted (bulk) events → move to `/data_deleted/`.
|
||||
- Queue consumer offset persisted to `offset.yaml` after each message.
|
||||
|
||||
## Inference
|
||||
|
||||
- TensorRT inference: ~54s for 200s video, ~3.7GB VRAM.
|
||||
- ONNX inference: ~81s for 200s video, ~6.3GB VRAM.
|
||||
- Frame sampling: every 4th frame.
|
||||
- Batch size: 4 (for both ONNX and TensorRT).
|
||||
- Confidence threshold: 0.3 (hardcoded in inference/inference.py).
|
||||
- NMS IoU threshold: 0.3 (hardcoded in inference/inference.py).
|
||||
- Overlapping detection removal: IoU > 0.3 with lower confidence removed.
|
||||
|
||||
## Security
|
||||
|
||||
- API authentication via JWT (email/password login).
|
||||
- Model encryption: AES-256-CBC with static key.
|
||||
- Resource encryption: AES-256-CBC with hardware-derived key (CPU+GPU+RAM+drive serial hash).
|
||||
- CDN access: separate read/write S3 credentials.
|
||||
- Split-model storage: prevents model theft from single storage compromise.
|
||||
|
||||
## Data Format
|
||||
|
||||
- Annotation format: YOLO (class_id center_x center_y width height — all normalized 0–1).
|
||||
- 17 base annotation classes × 3 weather modes = 51 active classes (80 total slots).
|
||||
- Image format: JPEG.
|
||||
- Queue message format: msgpack with positional integer keys.
|
||||
@@ -0,0 +1,47 @@
|
||||
# Input Data Parameters
|
||||
|
||||
## Annotation Images
|
||||
|
||||
- **Format**: JPEG
|
||||
- **Naming**: UUID-based (`{uuid}.jpg`)
|
||||
- **Source**: Azaion annotation platform via RabbitMQ Streams
|
||||
- **Volume**: Up to 360K+ annotations observed in training comments
|
||||
- **Delivery**: Real-time streaming via annotation queue consumer
|
||||
|
||||
## Annotation Labels
|
||||
|
||||
- **Format**: YOLO text format (one detection per line)
|
||||
- **Schema**: `{class_id} {center_x} {center_y} {width} {height}`
|
||||
- **Coordinate system**: All values normalized to 0–1 relative to image dimensions
|
||||
- **Constraints**: Coordinates must be in [0, 1]; labels with coords > 1.0 are treated as corrupted
|
||||
|
||||
## Annotation Classes
|
||||
|
||||
- **Source file**: `classes.json` (static, 17 entries)
|
||||
- **Schema per class**: `{ Id: int, Name: str, ShortName: str, Color: hex_str }`
|
||||
- **Classes**: ArmorVehicle, Truck, Vehicle, Artillery, Shadow, Trenches, MilitaryMan, TyreTracks, AdditArmoredTank, Smoke, Plane, Moto, CamouflageNet, CamouflageBranches, Roof, Building, Caponier
|
||||
- **Weather expansion**: Each class × 3 modes (Norm offset 0, Wint offset 20, Night offset 40)
|
||||
- **Total class IDs**: 80 slots (51 used, 29 reserved as placeholders)
|
||||
|
||||
## Queue Messages
|
||||
|
||||
- **Protocol**: AMQP via RabbitMQ Streams (rstream library)
|
||||
- **Serialization**: msgpack with positional integer keys
|
||||
- **Message types**: AnnotationMessage (single), AnnotationBulkMessage (batch validate/delete)
|
||||
- **Fields**: createdDate, name, originalMediaName, time, imageExtension, detections (JSON string), image (raw bytes), createdRole, createdEmail, source, status
|
||||
|
||||
## Configuration Files
|
||||
|
||||
| File | Format | Key Contents |
|
||||
|------|--------|-------------|
|
||||
| `config.yaml` | YAML | API URL, email, password, queue host/port/username/password, directory paths |
|
||||
| `cdn.yaml` | YAML | CDN endpoint, read access key/secret, write access key/secret, bucket name |
|
||||
| `classes.json` | JSON | Annotation class definitions array |
|
||||
| `checkpoint.txt` | Plain text | Last training run timestamp |
|
||||
| `offset.yaml` | YAML | Queue consumer offset for resume |
|
||||
|
||||
## Video Input (Inference)
|
||||
|
||||
- **Format**: Any OpenCV-supported video format
|
||||
- **Processing**: Every 4th frame sampled, batched in groups of 4
|
||||
- **Resolution**: Resized to model input size (1280×1280) during preprocessing
|
||||
@@ -0,0 +1,107 @@
|
||||
# Expected Results
|
||||
|
||||
Maps every input data item to its quantifiable expected result.
|
||||
|
||||
## Result Format Legend
|
||||
|
||||
| Result Type | When to Use | Example |
|
||||
|-------------|-------------|---------|
|
||||
| Exact value | Output must match precisely | `detection_count: 3`, `file_count: 8` |
|
||||
| Tolerance range | Numeric output with acceptable variance | `confidence: 0.92 ± 0.05` |
|
||||
| Threshold | Output must exceed or stay below a limit | `latency < 500ms`, `confidence ≥ 0.3` |
|
||||
| Pattern match | Output must match a string/regex pattern | `filename matches *_1.jpg` |
|
||||
| Set/count | Output must contain specific items or counts | `output_count == 8` |
|
||||
|
||||
## Input → Expected Result Mapping
|
||||
|
||||
### Augmentation
|
||||
|
||||
| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
|
||||
|---|-------|-------------------|-----------------|------------|-----------|---------------|
|
||||
| 1 | 1 image + 1 label from `dataset/` | Single annotated image with valid bboxes | `output_count: 8` (1 original + 7 augmented) | exact | N/A | N/A |
|
||||
| 2 | 1 image + 1 label from `dataset/` | Same image, output filenames | Original keeps name; augmented named `{stem}_1` through `{stem}_7` | pattern | N/A | N/A |
|
||||
| 3 | 1 image + 1 label from `dataset/` | All output label bboxes | Every coordinate in [0, 1] range | range | [0.0, 1.0] | N/A |
|
||||
| 4 | 1 image + label with bbox near edge (x=0.99, w=0.1) | Bbox partially outside image | Bbox clipped: width reduced, tiny bboxes (area < 0.01) removed | threshold_min | width ≥ 0.01, height ≥ 0.01 | N/A |
|
||||
| 5 | 1 image + empty label file | Image with no detections | `output_count: 8`, all label files empty | exact | N/A | N/A |
|
||||
|
||||
### Dataset Formation
|
||||
|
||||
| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
|
||||
|---|-------|-------------------|-----------------|------------|-----------|---------------|
|
||||
| 6 | 100 images + 100 labels from `dataset/` | Full fixture dataset | 3 folders created: `train/`, `valid/`, `test/` | exact | N/A | N/A |
|
||||
| 7 | 100 images + 100 labels from `dataset/` | Split ratio | train: 70, valid: 20, test: 10 | exact | N/A | N/A |
|
||||
| 8 | 100 images + 100 labels from `dataset/` | Each split has images/ and labels/ subdirs | `train/images/`, `train/labels/`, `valid/images/`, `valid/labels/`, `test/images/`, `test/labels/` | exact | N/A | N/A |
|
||||
| 9 | 100 images + 100 labels from `dataset/` | Total files across all splits equals input count | `sum(train + valid + test) == 100` | exact | N/A | N/A |
|
||||
|
||||
### Label Validation
|
||||
|
||||
| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
|
||||
|---|-------|-------------------|-----------------|------------|-----------|---------------|
|
||||
| 10 | Label file: `0 0.5 0.5 0.1 0.1` | Valid label (all coords ≤ 1.0) | `check_label` returns `True` | exact | N/A | N/A |
|
||||
| 11 | Label file: `0 1.5 0.5 0.1 0.1` | Corrupted label (x > 1.0) | `check_label` returns `False` | exact | N/A | N/A |
|
||||
| 12 | Label file: `0 0.5 0.5 0.1 1.2` | Corrupted label (h > 1.0) | `check_label` returns `False` | exact | N/A | N/A |
|
||||
| 13 | Non-existent label path | Missing label file | `check_label` returns `False` | exact | N/A | N/A |
|
||||
| 14 | Mix of 5 valid + 1 corrupted images/labels | Dataset formation with corrupted data | Corrupted image+label moved to `data-corrupted/`; valid ones in dataset splits | exact | corrupted_count: 1, valid_count: 5 | N/A |
|
||||
|
||||
### Encryption Roundtrip
|
||||
|
||||
| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
|
||||
|---|-------|-------------------|-----------------|------------|-----------|---------------|
|
||||
| 15 | 1024 random bytes + key "test-key" | Arbitrary binary data | `decrypt(encrypt(data, key), key) == data` | exact | N/A | N/A |
|
||||
| 16 | `azaion.onnx` bytes + model encryption key | Full ONNX model file | `decrypt(encrypt(model_bytes, key), key) == model_bytes` | exact | N/A | N/A |
|
||||
| 17 | Empty bytes + key "test-key" | Edge case: zero-length input | `decrypt(encrypt(b"", key), key) == b""` | exact | N/A | N/A |
|
||||
| 18 | 1 byte + key "test-key" | Edge case: minimum-length input | `decrypt(encrypt(b"\x00", key), key) == b"\x00"` | exact | N/A | N/A |
|
||||
|
||||
### Model Encryption + Split
|
||||
|
||||
| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
|
||||
|---|-------|-------------------|-----------------|------------|-----------|---------------|
|
||||
| 19 | 10000 bytes, key | Model-like binary data | Encrypted bytes split: small ≤ 3KB or 20% of total, big = remainder | threshold_max | small ≤ max(3072, total*0.2) | N/A |
|
||||
| 20 | 10000 bytes, key | Same data, reassembled | `small + big == encrypted_total` | exact | N/A | N/A |
|
||||
|
||||
### Annotation Class Loading
|
||||
|
||||
| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
|
||||
|---|-------|-------------------|-----------------|------------|-----------|---------------|
|
||||
| 21 | `classes.json` | Standard class definitions | `len(classes) == 17` unique base classes | exact | N/A | N/A |
|
||||
| 22 | `classes.json` | Weather mode expansion | Class IDs: Norm offset 0, Wint offset 20, Night offset 40 | exact | N/A | N/A |
|
||||
| 23 | `classes.json` | Total class slots in data.yaml | `nc: 80` in generated YAML | exact | N/A | N/A |
|
||||
|
||||
### Hardware Hash Determinism
|
||||
|
||||
| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
|
||||
|---|-------|-------------------|-----------------|------------|-----------|---------------|
|
||||
| 24 | String "test-hardware-info" | Arbitrary hardware string | `get_hw_hash(s1) == get_hw_hash(s1)` (deterministic) | exact | N/A | N/A |
|
||||
| 25 | Strings "hw-a" and "hw-b" | Different hardware strings | `get_hw_hash("hw-a") != get_hw_hash("hw-b")` | exact | N/A | N/A |
|
||||
| 26 | String "test-hardware-info" | Hash format | Result is base64-encoded string, length > 0 | pattern | matches `^[A-Za-z0-9+/]+=*$` | N/A |
|
||||
|
||||
### ONNX Inference Smoke Test
|
||||
|
||||
| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
|
||||
|---|-------|-------------------|-----------------|------------|-----------|---------------|
|
||||
| 27 | `azaion.onnx` + 1 image from `dataset/` | Model + annotated image (known to contain objects) | Engine loads without error; returns output array with shape [batch, N, 6] | exact (no exception) | N/A | N/A |
|
||||
| 28 | `azaion.onnx` + 1 image from `dataset/` | Inference postprocessing | Returns list of Detection objects (≥ 0 items); each Detection has x, y, w, h in [0,1], cls ≥ 0, confidence in [0,1] | range | x,y,w,h ∈ [0,1]; confidence ∈ [0,1]; cls ∈ [0,79] | N/A |
|
||||
|
||||
### NMS / Overlap Removal
|
||||
|
||||
| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
|
||||
|---|-------|-------------------|-----------------|------------|-----------|---------------|
|
||||
| 29 | 2 Detections: same position, conf 0.9 and 0.5, IoU > 0.3 | Overlapping detections, different confidence | 1 detection remaining (conf 0.9 kept) | exact | count: 1 | N/A |
|
||||
| 30 | 2 Detections: non-overlapping positions, IoU < 0.3 | Non-overlapping detections | 2 detections remaining (both kept) | exact | count: 2 | N/A |
|
||||
| 31 | 3 Detections: A overlaps B, B overlaps C, A doesn't overlap C | Chain overlap | ≤ 2 detections remaining; highest confidence per overlap pair kept | threshold_max | count ≤ 2 | N/A |
|
||||
|
||||
### Annotation Queue Message Parsing
|
||||
|
||||
| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
|
||||
|---|-------|-------------------|-----------------|------------|-----------|---------------|
|
||||
| 32 | Constructed msgpack bytes matching AnnotationMessage schema | Valid Created annotation message | Parsed AnnotationMessage with correct fields: name, detections, image bytes, status == Created | exact | N/A | N/A |
|
||||
| 33 | Constructed msgpack bytes for bulk Validated message | Valid bulk validation message | Parsed with status == Validated, list of annotation names | exact | N/A | N/A |
|
||||
| 34 | Constructed msgpack bytes for bulk Deleted message | Valid bulk deletion message | Parsed with status == Deleted, list of annotation names | exact | N/A | N/A |
|
||||
| 35 | Malformed msgpack bytes | Invalid message format | Exception raised (caught by handler) | exact (exception type) | N/A | N/A |
|
||||
|
||||
### YAML Generation
|
||||
|
||||
| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
|
||||
|---|-------|-------------------|-----------------|------------|-----------|---------------|
|
||||
| 36 | `classes.json` + dataset path | Generate data.yaml for training | YAML contains: `nc: 80`, `train: train/images`, `val: valid/images`, `test: test/images`, 80 class names | exact | N/A | N/A |
|
||||
| 37 | `classes.json` with 17 classes | Class name listing in YAML | 17 known class names present; 63 placeholder names as `Class-N` | exact | 17 named + 63 placeholder = 80 total | N/A |
|
||||
@@ -0,0 +1,33 @@
|
||||
# Problem Statement
|
||||
|
||||
## What is this system?
|
||||
|
||||
Azaion AI Training is an end-to-end machine learning pipeline for training and deploying object detection models. It detects military and infrastructure objects in aerial/satellite imagery — including vehicles, artillery, personnel, trenches, camouflage, and buildings — under varying weather and lighting conditions.
|
||||
|
||||
## What problem does it solve?
|
||||
|
||||
Automated detection of military assets and infrastructure from aerial imagery requires:
|
||||
1. Continuous ingestion of human-annotated training data from the Azaion annotation platform
|
||||
2. Automated data augmentation to expand limited labeled datasets (8× multiplication)
|
||||
3. GPU-accelerated model training using state-of-the-art object detection architectures
|
||||
4. Secure model distribution that prevents model theft and ties deployment to authorized hardware
|
||||
5. Real-time inference on video feeds with GPU acceleration
|
||||
6. Edge deployment capability for low-power field devices
|
||||
|
||||
## Who are the users?
|
||||
|
||||
- **Annotators/Operators**: Create annotation data through the Azaion platform. Their annotations flow into the training pipeline via RabbitMQ.
|
||||
- **Validators/Admins**: Review and approve annotations, promoting them from seed to validated status.
|
||||
- **ML Engineers**: Configure and run training pipelines, monitor model quality, trigger retraining.
|
||||
- **Inference Operators**: Deploy and run inference on video feeds using trained models on GPU-equipped machines.
|
||||
- **Edge Deployment Operators**: Set up and run inference on OrangePi5 edge devices in the field.
|
||||
|
||||
## How does it work (high level)?
|
||||
|
||||
1. Annotations (images + bounding box labels) arrive via a RabbitMQ stream from the Azaion annotation platform
|
||||
2. A queue consumer service routes annotations to the filesystem based on user role (operator → seed, validator → validated)
|
||||
3. An augmentation pipeline continuously processes validated images, producing 8 augmented variants per original
|
||||
4. A training pipeline assembles datasets (70/20/10 split), trains a YOLOv11 model over ~120 epochs, and exports to ONNX format
|
||||
5. Trained models are encrypted with AES-256-CBC, split into small and big parts, and uploaded to the Azaion API and S3 CDN respectively
|
||||
6. Inference clients download and reassemble the model, decrypt it using a hardware-bound key, and run real-time detection on video feeds using TensorRT or ONNX Runtime
|
||||
7. For edge deployment, models are exported to RKNN format for OrangePi5 devices
|
||||
@@ -0,0 +1,38 @@
|
||||
# Restrictions
|
||||
|
||||
## Hardware
|
||||
|
||||
- Training requires NVIDIA GPU with ≥24GB VRAM (validated: RTX 4090). Batch size 11 consumes ~22GB; batch size 12 exceeds 24GB.
|
||||
- TensorRT inference requires NVIDIA GPU with TensorRT support. Engine files are GPU-architecture-specific (compiled per compute capability).
|
||||
- ONNX Runtime inference requires NVIDIA GPU with CUDA support (~6.3GB VRAM for 200s video).
|
||||
- Edge inference requires RK3588 SoC (OrangePi5).
|
||||
- Hardware fingerprinting reads CPU model, GPU name, RAM total, and drive serial — requires access to these system properties.
|
||||
|
||||
## Software
|
||||
|
||||
- Python 3.10+ (uses `match` statements).
|
||||
- CUDA 12.1 with PyTorch 2.3.0.
|
||||
- TensorRT runtime for production GPU inference.
|
||||
- ONNX Runtime with CUDAExecutionProvider for cross-platform inference.
|
||||
- Albumentations for augmentation transforms.
|
||||
- boto3 for S3-compatible CDN access.
|
||||
- rstream for RabbitMQ Streams protocol.
|
||||
- cryptography library for AES-256-CBC encryption.
|
||||
|
||||
## Environment
|
||||
|
||||
- Filesystem paths hardcoded to `/azaion/` root (configurable via `config.yaml`).
|
||||
- Requires network access to Azaion REST API, S3-compatible CDN, and RabbitMQ instance.
|
||||
- Configuration files (`config.yaml`, `cdn.yaml`) must be present with valid credentials.
|
||||
- `classes.json` must be present with the 17 annotation class definitions.
|
||||
- No containerization — processes run directly on host OS.
|
||||
|
||||
## Operational
|
||||
|
||||
- Training duration: ~11.5 days for 360K annotations on a single RTX 4090.
|
||||
- Augmentation runs as an infinite loop with 5-minute sleep intervals.
|
||||
- Annotation queue consumer runs as a persistent async process.
|
||||
- TensorRT engine files are GPU-architecture-specific — must be regenerated when moving to a different GPU.
|
||||
- Model encryption key is hardcoded — changing it invalidates all previously encrypted models.
|
||||
- No graceful shutdown mechanism for the augmentation process.
|
||||
- No reconnection logic for the annotation queue consumer on disconnect.
|
||||
@@ -0,0 +1,33 @@
|
||||
# Security Approach
|
||||
|
||||
## Authentication
|
||||
|
||||
- **API Authentication**: JWT-based. Client sends email/password to `POST /login`, receives JWT token used as Bearer token for subsequent requests.
|
||||
- **Auto-relogin**: On HTTP 401/403 responses, the client automatically re-authenticates and retries the request.
|
||||
|
||||
## Encryption
|
||||
|
||||
- **Model encryption**: AES-256-CBC with a static key defined in `security.py`. All model artifacts (ONNX, TensorRT) are encrypted before upload.
|
||||
- **Resource encryption**: AES-256-CBC with a hardware-derived key. The key is generated by hashing the machine's CPU model, GPU name, total RAM, and primary drive serial number. This ties decryption to the specific hardware.
|
||||
- **Implementation**: Uses the `cryptography` library with PKCS7 padding. IV is prepended to ciphertext.
|
||||
|
||||
## Model Protection
|
||||
|
||||
- **Split storage**: Encrypted models are split into a small part (≤3KB or 20% of total size) stored on the Azaion API server and a big part stored on S3-compatible CDN. Both parts are required to reconstruct the model.
|
||||
- **Hardware binding**: Inference clients must run on authorized hardware whose fingerprint matches the encryption key used during upload.
|
||||
|
||||
## Access Control
|
||||
|
||||
- **CDN access**: Separate read-only and write-only S3 credentials. Training uploads use write keys; inference downloads use read keys.
|
||||
- **Role-based annotation routing**: Validator/Admin annotations go directly to validated storage; Operator annotations go to seed storage pending validation.
|
||||
|
||||
## Known Security Issues
|
||||
|
||||
| Issue | Severity | Location |
|
||||
|-------|----------|----------|
|
||||
| Hardcoded API credentials (email, password) | High | config.yaml |
|
||||
| Hardcoded CDN access keys (4 keys) | High | cdn.yaml |
|
||||
| Hardcoded model encryption key | High | security.py:67 |
|
||||
| Queue credentials in plaintext | Medium | config.yaml, annotation-queue/config.yaml |
|
||||
| No TLS certificate validation | Low | api_client.py |
|
||||
| No input validation on API responses | Low | api_client.py |
|
||||
Reference in New Issue
Block a user