mirror of https://github.com/azaion/ai-training.git synced 2026-04-22 22:56:34 +00:00

Files

T

Oleksandr Bezdieniezhnykh 142c6c4de8 Refactor constants management to use Pydantic BaseModel for configuration

- Replaced module-level path variables in constants.py with a structured Pydantic Config class.
- Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure.
- Fixed bugs related to image processing and model saving.
- Enhanced test infrastructure to accommodate the new configuration approach.

This refactor improves code maintainability and clarity by centralizing configuration management.

2026-03-27 18:18:30 +02:00

7.0 KiB

Raw Blame History

Final Documentation Report — Azaion AI Training

Executive Summary

Azaion AI Training is a Python-based ML pipeline for training, deploying, and running YOLOv11 object detection models targeting aerial military asset recognition. The system comprises 8 components (21 modules) spanning annotation ingestion, data augmentation, GPU-accelerated training, multi-format model export, encrypted model distribution, and real-time inference — with edge deployment capability via RKNN on OrangePi5 devices.

The codebase is functional and production-used (last training run: 2024-06-27) but has no CI/CD, no containerization, no formal test framework, and several hardcoded credentials. Verification identified 5 code bugs, 3 high-severity security issues, and 1 missing module.

Problem Statement

The system automates detection of 17 classes of military objects and infrastructure in aerial/satellite imagery across 3 weather conditions (Normal, Winter, Night). It replaces manual image analysis with a continuous pipeline: human-annotated data flows in via RabbitMQ, is augmented 8× for training diversity, trains YOLOv11 models over multi-day GPU runs, and distributes encrypted models to inference clients that run real-time video detection.

Architecture Overview

Tech stack: Python 3.10+ · PyTorch 2.3.0 (CUDA 12.1) · Ultralytics YOLOv11m · TensorRT · ONNX Runtime · Albumentations · boto3 · rstream · cryptography

Deployment: 5 independent processes (no orchestration, no containers) running on GPU-equipped servers. Manual deployment.

Component Summary

#	Component	Modules	Purpose	Key Dependencies
01	Core Infrastructure	constants, utils	Shared paths, config keys, Dotdict helper	None
02	Security & Hardware	security, hardware_service	AES-256-CBC encryption, hardware fingerprinting	cryptography, pynvml
03	API & CDN Client	api_client, cdn_manager	REST API (JWT auth) + S3 CDN communication	requests, boto3, Security
04	Data Models	dto/annotationClass, dto/imageLabel	Annotation class definitions, image+label container	OpenCV, matplotlib
05	Data Pipeline	augmentation, convert-annotations, dataset-visualiser	8× augmentation, format conversion, visualization	Albumentations, Data Models
06	Training Pipeline	train, exports, manual_run	Dataset formation → YOLO training → export → encrypted upload	Ultralytics, API & CDN, Security
07	Inference Engine	inference/dto, onnx_engine, tensorrt_engine, inference, start_inference	Model download, decryption, TensorRT/ONNX video inference	TensorRT, ONNX Runtime, PyCUDA
08	Annotation Queue	annotation_queue_dto, annotation_queue_handler	Async RabbitMQ Streams consumer for annotation CRUD events	rstream, msgpack

System Flows

#	Flow	Entry Point	Path	Output
1	Annotation Ingestion	RabbitMQ message	Queue → Handler → Filesystem	Images + labels on disk
2	Data Augmentation	Filesystem scan (5-min loop)	/data/ → Augmentator → /data-processed/	8× augmented images + labels
3	Training Pipeline	train.py main	/data-processed/ → Dataset split → YOLO train → Export → Encrypt → Upload	Encrypted model on API + CDN
4	Model Download & Inference	start_inference.py main	API + CDN download → Decrypt → TensorRT init → Video frames → Detections	Annotated video output
5	Model Export (Multi-Format)	train.py / manual_run.py	.pt → .onnx / .engine / .rknn	Multi-format model artifacts

Risk Observations

Code Bugs (from Verification)

#	Location	Issue	Impact
1	augmentation.py:118	`self.total_to_process` undefined (should be `self.total_images_to_process`)	AttributeError during progress logging
2	train.py:93,99	`copied` counter never incremented	Incorrect progress reporting (cosmetic)
3	exports.py:97	Stale `ApiClient(ApiCredentials(...))` constructor call with wrong params	`upload_model` function would fail at runtime
4	inference/tensorrt_engine.py:43-44	`batch_size` uninitialized for dynamic batch dimensions	NameError for models with dynamic batch size
5	dataset-visualiser.py:6	Imports from `preprocessing` module that doesn't exist	Script cannot run

Security Issues

Issue	Severity	Location
Hardcoded API credentials	High	config.yaml
Hardcoded CDN access keys (4 keys)	High	cdn.yaml
Hardcoded model encryption key	High	security.py:67
Queue credentials in plaintext	Medium	config.yaml, annotation-queue/config.yaml
No TLS certificate validation	Low	api_client.py

Structural Concerns

No CI/CD pipeline or containerization
No formal test framework (2 script-based tests, 1 broken)
Duplicated AnnotationClass/WeatherMode code in 3 locations
No graceful shutdown for augmentation process
No reconnect logic for annotation queue consumer
Manual deployment only

Open Questions

The preprocessing module may have existed previously and been deleted or renamed — its absence breaks dataset-visualiser.py and tests/imagelabel_visualize_test.py
exports.upload_model may be intentionally deprecated in favor of the ApiClient-based flow in train.py
The orangepi5/ shell scripts were not analyzed (bash, not Python) — they appear to be setup/run scripts for edge deployment
checkpoint.txt (2024-06-27) suggests training infrastructure was last used in mid-2024

Artifact Index

Path	Description	Step
`_docs/00_problem/problem.md`	Problem statement	6
`_docs/00_problem/restrictions.md`	Hardware, software, environment, operational restrictions	6
`_docs/00_problem/acceptance_criteria.md`	Measurable acceptance criteria from code	6
`_docs/00_problem/input_data/data_parameters.md`	Input data schemas and formats	6
`_docs/00_problem/security_approach.md`	Security mechanisms and known issues	6
`_docs/01_solution/solution.md`	Retrospective solution document	5
`_docs/02_document/00_discovery.md`	Codebase discovery: tree, tech stack, dependency graph	0
`_docs/02_document/modules/*.md`	21 module-level documentation files	1
`_docs/02_document/components/0N_*/description.md`	8 component specifications	2
`_docs/02_document/diagrams/components.md`	Component relationship diagram (Mermaid)	2
`_docs/02_document/architecture.md`	System architecture document	3
`_docs/02_document/system-flows.md`	5 system flow diagrams with sequence diagrams	3
`_docs/02_document/data_model.md`	Data model with ER diagram	3
`_docs/02_document/diagrams/flows/flow_*.md`	Individual flow diagrams (4 files)	3
`_docs/02_document/04_verification_log.md`	Verification results: 87 entities, 5 bugs, 3 security issues	4
`_docs/02_document/FINAL_report.md`	This report	7
`_docs/02_document/state.json`	Document skill progress tracking	—

7.0 KiB Raw Blame History Unescape Escape