Files
ai-training/_docs/02_document/FINAL_report.md
T
Oleksandr Bezdieniezhnykh 142c6c4de8 Refactor constants management to use Pydantic BaseModel for configuration
- Replaced module-level path variables in constants.py with a structured Pydantic Config class.
- Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure.
- Fixed bugs related to image processing and model saving.
- Enhanced test infrastructure to accommodate the new configuration approach.

This refactor improves code maintainability and clarity by centralizing configuration management.
2026-03-27 18:18:30 +02:00

7.0 KiB
Raw Blame History

Final Documentation Report — Azaion AI Training

Executive Summary

Azaion AI Training is a Python-based ML pipeline for training, deploying, and running YOLOv11 object detection models targeting aerial military asset recognition. The system comprises 8 components (21 modules) spanning annotation ingestion, data augmentation, GPU-accelerated training, multi-format model export, encrypted model distribution, and real-time inference — with edge deployment capability via RKNN on OrangePi5 devices.

The codebase is functional and production-used (last training run: 2024-06-27) but has no CI/CD, no containerization, no formal test framework, and several hardcoded credentials. Verification identified 5 code bugs, 3 high-severity security issues, and 1 missing module.

Problem Statement

The system automates detection of 17 classes of military objects and infrastructure in aerial/satellite imagery across 3 weather conditions (Normal, Winter, Night). It replaces manual image analysis with a continuous pipeline: human-annotated data flows in via RabbitMQ, is augmented 8× for training diversity, trains YOLOv11 models over multi-day GPU runs, and distributes encrypted models to inference clients that run real-time video detection.

Architecture Overview

Tech stack: Python 3.10+ · PyTorch 2.3.0 (CUDA 12.1) · Ultralytics YOLOv11m · TensorRT · ONNX Runtime · Albumentations · boto3 · rstream · cryptography

Deployment: 5 independent processes (no orchestration, no containers) running on GPU-equipped servers. Manual deployment.

Component Summary

# Component Modules Purpose Key Dependencies
01 Core Infrastructure constants, utils Shared paths, config keys, Dotdict helper None
02 Security & Hardware security, hardware_service AES-256-CBC encryption, hardware fingerprinting cryptography, pynvml
03 API & CDN Client api_client, cdn_manager REST API (JWT auth) + S3 CDN communication requests, boto3, Security
04 Data Models dto/annotationClass, dto/imageLabel Annotation class definitions, image+label container OpenCV, matplotlib
05 Data Pipeline augmentation, convert-annotations, dataset-visualiser 8× augmentation, format conversion, visualization Albumentations, Data Models
06 Training Pipeline train, exports, manual_run Dataset formation → YOLO training → export → encrypted upload Ultralytics, API & CDN, Security
07 Inference Engine inference/dto, onnx_engine, tensorrt_engine, inference, start_inference Model download, decryption, TensorRT/ONNX video inference TensorRT, ONNX Runtime, PyCUDA
08 Annotation Queue annotation_queue_dto, annotation_queue_handler Async RabbitMQ Streams consumer for annotation CRUD events rstream, msgpack

System Flows

# Flow Entry Point Path Output
1 Annotation Ingestion RabbitMQ message Queue → Handler → Filesystem Images + labels on disk
2 Data Augmentation Filesystem scan (5-min loop) /data/ → Augmentator → /data-processed/ 8× augmented images + labels
3 Training Pipeline train.py main /data-processed/ → Dataset split → YOLO train → Export → Encrypt → Upload Encrypted model on API + CDN
4 Model Download & Inference start_inference.py main API + CDN download → Decrypt → TensorRT init → Video frames → Detections Annotated video output
5 Model Export (Multi-Format) train.py / manual_run.py .pt → .onnx / .engine / .rknn Multi-format model artifacts

Risk Observations

Code Bugs (from Verification)

# Location Issue Impact
1 augmentation.py:118 self.total_to_process undefined (should be self.total_images_to_process) AttributeError during progress logging
2 train.py:93,99 copied counter never incremented Incorrect progress reporting (cosmetic)
3 exports.py:97 Stale ApiClient(ApiCredentials(...)) constructor call with wrong params upload_model function would fail at runtime
4 inference/tensorrt_engine.py:43-44 batch_size uninitialized for dynamic batch dimensions NameError for models with dynamic batch size
5 dataset-visualiser.py:6 Imports from preprocessing module that doesn't exist Script cannot run

Security Issues

Issue Severity Location
Hardcoded API credentials High config.yaml
Hardcoded CDN access keys (4 keys) High cdn.yaml
Hardcoded model encryption key High security.py:67
Queue credentials in plaintext Medium config.yaml, annotation-queue/config.yaml
No TLS certificate validation Low api_client.py

Structural Concerns

  • No CI/CD pipeline or containerization
  • No formal test framework (2 script-based tests, 1 broken)
  • Duplicated AnnotationClass/WeatherMode code in 3 locations
  • No graceful shutdown for augmentation process
  • No reconnect logic for annotation queue consumer
  • Manual deployment only

Open Questions

  • The preprocessing module may have existed previously and been deleted or renamed — its absence breaks dataset-visualiser.py and tests/imagelabel_visualize_test.py
  • exports.upload_model may be intentionally deprecated in favor of the ApiClient-based flow in train.py
  • The orangepi5/ shell scripts were not analyzed (bash, not Python) — they appear to be setup/run scripts for edge deployment
  • checkpoint.txt (2024-06-27) suggests training infrastructure was last used in mid-2024

Artifact Index

Path Description Step
_docs/00_problem/problem.md Problem statement 6
_docs/00_problem/restrictions.md Hardware, software, environment, operational restrictions 6
_docs/00_problem/acceptance_criteria.md Measurable acceptance criteria from code 6
_docs/00_problem/input_data/data_parameters.md Input data schemas and formats 6
_docs/00_problem/security_approach.md Security mechanisms and known issues 6
_docs/01_solution/solution.md Retrospective solution document 5
_docs/02_document/00_discovery.md Codebase discovery: tree, tech stack, dependency graph 0
_docs/02_document/modules/*.md 21 module-level documentation files 1
_docs/02_document/components/0N_*/description.md 8 component specifications 2
_docs/02_document/diagrams/components.md Component relationship diagram (Mermaid) 2
_docs/02_document/architecture.md System architecture document 3
_docs/02_document/system-flows.md 5 system flow diagrams with sequence diagrams 3
_docs/02_document/data_model.md Data model with ER diagram 3
_docs/02_document/diagrams/flows/flow_*.md Individual flow diagrams (4 files) 3
_docs/02_document/04_verification_log.md Verification results: 87 entities, 5 bugs, 3 security issues 4
_docs/02_document/FINAL_report.md This report 7
_docs/02_document/state.json Document skill progress tracking