- Replaced module-level path variables in constants.py with a structured Pydantic Config class. - Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure. - Fixed bugs related to image processing and model saving. - Enhanced test infrastructure to accommodate the new configuration approach. This refactor improves code maintainability and clarity by centralizing configuration management.
7.0 KiB
Final Documentation Report — Azaion AI Training
Executive Summary
Azaion AI Training is a Python-based ML pipeline for training, deploying, and running YOLOv11 object detection models targeting aerial military asset recognition. The system comprises 8 components (21 modules) spanning annotation ingestion, data augmentation, GPU-accelerated training, multi-format model export, encrypted model distribution, and real-time inference — with edge deployment capability via RKNN on OrangePi5 devices.
The codebase is functional and production-used (last training run: 2024-06-27) but has no CI/CD, no containerization, no formal test framework, and several hardcoded credentials. Verification identified 5 code bugs, 3 high-severity security issues, and 1 missing module.
Problem Statement
The system automates detection of 17 classes of military objects and infrastructure in aerial/satellite imagery across 3 weather conditions (Normal, Winter, Night). It replaces manual image analysis with a continuous pipeline: human-annotated data flows in via RabbitMQ, is augmented 8× for training diversity, trains YOLOv11 models over multi-day GPU runs, and distributes encrypted models to inference clients that run real-time video detection.
Architecture Overview
Tech stack: Python 3.10+ · PyTorch 2.3.0 (CUDA 12.1) · Ultralytics YOLOv11m · TensorRT · ONNX Runtime · Albumentations · boto3 · rstream · cryptography
Deployment: 5 independent processes (no orchestration, no containers) running on GPU-equipped servers. Manual deployment.
Component Summary
| # | Component | Modules | Purpose | Key Dependencies |
|---|---|---|---|---|
| 01 | Core Infrastructure | constants, utils | Shared paths, config keys, Dotdict helper | None |
| 02 | Security & Hardware | security, hardware_service | AES-256-CBC encryption, hardware fingerprinting | cryptography, pynvml |
| 03 | API & CDN Client | api_client, cdn_manager | REST API (JWT auth) + S3 CDN communication | requests, boto3, Security |
| 04 | Data Models | dto/annotationClass, dto/imageLabel | Annotation class definitions, image+label container | OpenCV, matplotlib |
| 05 | Data Pipeline | augmentation, convert-annotations, dataset-visualiser | 8× augmentation, format conversion, visualization | Albumentations, Data Models |
| 06 | Training Pipeline | train, exports, manual_run | Dataset formation → YOLO training → export → encrypted upload | Ultralytics, API & CDN, Security |
| 07 | Inference Engine | inference/dto, onnx_engine, tensorrt_engine, inference, start_inference | Model download, decryption, TensorRT/ONNX video inference | TensorRT, ONNX Runtime, PyCUDA |
| 08 | Annotation Queue | annotation_queue_dto, annotation_queue_handler | Async RabbitMQ Streams consumer for annotation CRUD events | rstream, msgpack |
System Flows
| # | Flow | Entry Point | Path | Output |
|---|---|---|---|---|
| 1 | Annotation Ingestion | RabbitMQ message | Queue → Handler → Filesystem | Images + labels on disk |
| 2 | Data Augmentation | Filesystem scan (5-min loop) | /data/ → Augmentator → /data-processed/ | 8× augmented images + labels |
| 3 | Training Pipeline | train.py main | /data-processed/ → Dataset split → YOLO train → Export → Encrypt → Upload | Encrypted model on API + CDN |
| 4 | Model Download & Inference | start_inference.py main | API + CDN download → Decrypt → TensorRT init → Video frames → Detections | Annotated video output |
| 5 | Model Export (Multi-Format) | train.py / manual_run.py | .pt → .onnx / .engine / .rknn | Multi-format model artifacts |
Risk Observations
Code Bugs (from Verification)
| # | Location | Issue | Impact |
|---|---|---|---|
| 1 | augmentation.py:118 | self.total_to_process undefined (should be self.total_images_to_process) |
AttributeError during progress logging |
| 2 | train.py:93,99 | copied counter never incremented |
Incorrect progress reporting (cosmetic) |
| 3 | exports.py:97 | Stale ApiClient(ApiCredentials(...)) constructor call with wrong params |
upload_model function would fail at runtime |
| 4 | inference/tensorrt_engine.py:43-44 | batch_size uninitialized for dynamic batch dimensions |
NameError for models with dynamic batch size |
| 5 | dataset-visualiser.py:6 | Imports from preprocessing module that doesn't exist |
Script cannot run |
Security Issues
| Issue | Severity | Location |
|---|---|---|
| Hardcoded API credentials | High | config.yaml |
| Hardcoded CDN access keys (4 keys) | High | cdn.yaml |
| Hardcoded model encryption key | High | security.py:67 |
| Queue credentials in plaintext | Medium | config.yaml, annotation-queue/config.yaml |
| No TLS certificate validation | Low | api_client.py |
Structural Concerns
- No CI/CD pipeline or containerization
- No formal test framework (2 script-based tests, 1 broken)
- Duplicated AnnotationClass/WeatherMode code in 3 locations
- No graceful shutdown for augmentation process
- No reconnect logic for annotation queue consumer
- Manual deployment only
Open Questions
- The
preprocessingmodule may have existed previously and been deleted or renamed — its absence breaksdataset-visualiser.pyandtests/imagelabel_visualize_test.py exports.upload_modelmay be intentionally deprecated in favor of the ApiClient-based flow intrain.py- The
orangepi5/shell scripts were not analyzed (bash, not Python) — they appear to be setup/run scripts for edge deployment checkpoint.txt(2024-06-27) suggests training infrastructure was last used in mid-2024
Artifact Index
| Path | Description | Step |
|---|---|---|
_docs/00_problem/problem.md |
Problem statement | 6 |
_docs/00_problem/restrictions.md |
Hardware, software, environment, operational restrictions | 6 |
_docs/00_problem/acceptance_criteria.md |
Measurable acceptance criteria from code | 6 |
_docs/00_problem/input_data/data_parameters.md |
Input data schemas and formats | 6 |
_docs/00_problem/security_approach.md |
Security mechanisms and known issues | 6 |
_docs/01_solution/solution.md |
Retrospective solution document | 5 |
_docs/02_document/00_discovery.md |
Codebase discovery: tree, tech stack, dependency graph | 0 |
_docs/02_document/modules/*.md |
21 module-level documentation files | 1 |
_docs/02_document/components/0N_*/description.md |
8 component specifications | 2 |
_docs/02_document/diagrams/components.md |
Component relationship diagram (Mermaid) | 2 |
_docs/02_document/architecture.md |
System architecture document | 3 |
_docs/02_document/system-flows.md |
5 system flow diagrams with sequence diagrams | 3 |
_docs/02_document/data_model.md |
Data model with ER diagram | 3 |
_docs/02_document/diagrams/flows/flow_*.md |
Individual flow diagrams (4 files) | 3 |
_docs/02_document/04_verification_log.md |
Verification results: 87 entities, 5 bugs, 3 security issues | 4 |
_docs/02_document/FINAL_report.md |
This report | 7 |
_docs/02_document/state.json |
Document skill progress tracking | — |