Files
ai-training/_docs/02_document/FINAL_report.md
T
Oleksandr Bezdieniezhnykh 142c6c4de8 Refactor constants management to use Pydantic BaseModel for configuration
- Replaced module-level path variables in constants.py with a structured Pydantic Config class.
- Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure.
- Fixed bugs related to image processing and model saving.
- Enhanced test infrastructure to accommodate the new configuration approach.

This refactor improves code maintainability and clarity by centralizing configuration management.
2026-03-27 18:18:30 +02:00

101 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Final Documentation Report — Azaion AI Training
## Executive Summary
Azaion AI Training is a Python-based ML pipeline for training, deploying, and running YOLOv11 object detection models targeting aerial military asset recognition. The system comprises 8 components (21 modules) spanning annotation ingestion, data augmentation, GPU-accelerated training, multi-format model export, encrypted model distribution, and real-time inference — with edge deployment capability via RKNN on OrangePi5 devices.
The codebase is functional and production-used (last training run: 2024-06-27) but has no CI/CD, no containerization, no formal test framework, and several hardcoded credentials. Verification identified 5 code bugs, 3 high-severity security issues, and 1 missing module.
## Problem Statement
The system automates detection of 17 classes of military objects and infrastructure in aerial/satellite imagery across 3 weather conditions (Normal, Winter, Night). It replaces manual image analysis with a continuous pipeline: human-annotated data flows in via RabbitMQ, is augmented 8× for training diversity, trains YOLOv11 models over multi-day GPU runs, and distributes encrypted models to inference clients that run real-time video detection.
## Architecture Overview
**Tech stack**: Python 3.10+ · PyTorch 2.3.0 (CUDA 12.1) · Ultralytics YOLOv11m · TensorRT · ONNX Runtime · Albumentations · boto3 · rstream · cryptography
**Deployment**: 5 independent processes (no orchestration, no containers) running on GPU-equipped servers. Manual deployment.
## Component Summary
| # | Component | Modules | Purpose | Key Dependencies |
|---|-----------|---------|---------|-----------------|
| 01 | Core Infrastructure | constants, utils | Shared paths, config keys, Dotdict helper | None |
| 02 | Security & Hardware | security, hardware_service | AES-256-CBC encryption, hardware fingerprinting | cryptography, pynvml |
| 03 | API & CDN Client | api_client, cdn_manager | REST API (JWT auth) + S3 CDN communication | requests, boto3, Security |
| 04 | Data Models | dto/annotationClass, dto/imageLabel | Annotation class definitions, image+label container | OpenCV, matplotlib |
| 05 | Data Pipeline | augmentation, convert-annotations, dataset-visualiser | 8× augmentation, format conversion, visualization | Albumentations, Data Models |
| 06 | Training Pipeline | train, exports, manual_run | Dataset formation → YOLO training → export → encrypted upload | Ultralytics, API & CDN, Security |
| 07 | Inference Engine | inference/dto, onnx_engine, tensorrt_engine, inference, start_inference | Model download, decryption, TensorRT/ONNX video inference | TensorRT, ONNX Runtime, PyCUDA |
| 08 | Annotation Queue | annotation_queue_dto, annotation_queue_handler | Async RabbitMQ Streams consumer for annotation CRUD events | rstream, msgpack |
## System Flows
| # | Flow | Entry Point | Path | Output |
|---|------|-------------|------|--------|
| 1 | Annotation Ingestion | RabbitMQ message | Queue → Handler → Filesystem | Images + labels on disk |
| 2 | Data Augmentation | Filesystem scan (5-min loop) | /data/ → Augmentator → /data-processed/ | 8× augmented images + labels |
| 3 | Training Pipeline | train.py __main__ | /data-processed/ → Dataset split → YOLO train → Export → Encrypt → Upload | Encrypted model on API + CDN |
| 4 | Model Download & Inference | start_inference.py __main__ | API + CDN download → Decrypt → TensorRT init → Video frames → Detections | Annotated video output |
| 5 | Model Export (Multi-Format) | train.py / manual_run.py | .pt → .onnx / .engine / .rknn | Multi-format model artifacts |
## Risk Observations
### Code Bugs (from Verification)
| # | Location | Issue | Impact |
|---|----------|-------|--------|
| 1 | augmentation.py:118 | `self.total_to_process` undefined (should be `self.total_images_to_process`) | AttributeError during progress logging |
| 2 | train.py:93,99 | `copied` counter never incremented | Incorrect progress reporting (cosmetic) |
| 3 | exports.py:97 | Stale `ApiClient(ApiCredentials(...))` constructor call with wrong params | `upload_model` function would fail at runtime |
| 4 | inference/tensorrt_engine.py:43-44 | `batch_size` uninitialized for dynamic batch dimensions | NameError for models with dynamic batch size |
| 5 | dataset-visualiser.py:6 | Imports from `preprocessing` module that doesn't exist | Script cannot run |
### Security Issues
| Issue | Severity | Location |
|-------|----------|----------|
| Hardcoded API credentials | High | config.yaml |
| Hardcoded CDN access keys (4 keys) | High | cdn.yaml |
| Hardcoded model encryption key | High | security.py:67 |
| Queue credentials in plaintext | Medium | config.yaml, annotation-queue/config.yaml |
| No TLS certificate validation | Low | api_client.py |
### Structural Concerns
- No CI/CD pipeline or containerization
- No formal test framework (2 script-based tests, 1 broken)
- Duplicated AnnotationClass/WeatherMode code in 3 locations
- No graceful shutdown for augmentation process
- No reconnect logic for annotation queue consumer
- Manual deployment only
## Open Questions
- The `preprocessing` module may have existed previously and been deleted or renamed — its absence breaks `dataset-visualiser.py` and `tests/imagelabel_visualize_test.py`
- `exports.upload_model` may be intentionally deprecated in favor of the ApiClient-based flow in `train.py`
- The `orangepi5/` shell scripts were not analyzed (bash, not Python) — they appear to be setup/run scripts for edge deployment
- `checkpoint.txt` (2024-06-27) suggests training infrastructure was last used in mid-2024
## Artifact Index
| Path | Description | Step |
|------|-------------|------|
| `_docs/00_problem/problem.md` | Problem statement | 6 |
| `_docs/00_problem/restrictions.md` | Hardware, software, environment, operational restrictions | 6 |
| `_docs/00_problem/acceptance_criteria.md` | Measurable acceptance criteria from code | 6 |
| `_docs/00_problem/input_data/data_parameters.md` | Input data schemas and formats | 6 |
| `_docs/00_problem/security_approach.md` | Security mechanisms and known issues | 6 |
| `_docs/01_solution/solution.md` | Retrospective solution document | 5 |
| `_docs/02_document/00_discovery.md` | Codebase discovery: tree, tech stack, dependency graph | 0 |
| `_docs/02_document/modules/*.md` | 21 module-level documentation files | 1 |
| `_docs/02_document/components/0N_*/description.md` | 8 component specifications | 2 |
| `_docs/02_document/diagrams/components.md` | Component relationship diagram (Mermaid) | 2 |
| `_docs/02_document/architecture.md` | System architecture document | 3 |
| `_docs/02_document/system-flows.md` | 5 system flow diagrams with sequence diagrams | 3 |
| `_docs/02_document/data_model.md` | Data model with ER diagram | 3 |
| `_docs/02_document/diagrams/flows/flow_*.md` | Individual flow diagrams (4 files) | 3 |
| `_docs/02_document/04_verification_log.md` | Verification results: 87 entities, 5 bugs, 3 security issues | 4 |
| `_docs/02_document/FINAL_report.md` | This report | 7 |
| `_docs/02_document/state.json` | Document skill progress tracking | — |