mirror of
https://github.com/azaion/ai-training.git
synced 2026-04-22 22:36:36 +00:00
142c6c4de8
- Replaced module-level path variables in constants.py with a structured Pydantic Config class. - Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure. - Fixed bugs related to image processing and model saving. - Enhanced test infrastructure to accommodate the new configuration approach. This refactor improves code maintainability and clarity by centralizing configuration management.
101 lines
7.0 KiB
Markdown
101 lines
7.0 KiB
Markdown
# Final Documentation Report — Azaion AI Training
|
||
|
||
## Executive Summary
|
||
|
||
Azaion AI Training is a Python-based ML pipeline for training, deploying, and running YOLOv11 object detection models targeting aerial military asset recognition. The system comprises 8 components (21 modules) spanning annotation ingestion, data augmentation, GPU-accelerated training, multi-format model export, encrypted model distribution, and real-time inference — with edge deployment capability via RKNN on OrangePi5 devices.
|
||
|
||
The codebase is functional and production-used (last training run: 2024-06-27) but has no CI/CD, no containerization, no formal test framework, and several hardcoded credentials. Verification identified 5 code bugs, 3 high-severity security issues, and 1 missing module.
|
||
|
||
## Problem Statement
|
||
|
||
The system automates detection of 17 classes of military objects and infrastructure in aerial/satellite imagery across 3 weather conditions (Normal, Winter, Night). It replaces manual image analysis with a continuous pipeline: human-annotated data flows in via RabbitMQ, is augmented 8× for training diversity, trains YOLOv11 models over multi-day GPU runs, and distributes encrypted models to inference clients that run real-time video detection.
|
||
|
||
## Architecture Overview
|
||
|
||
**Tech stack**: Python 3.10+ · PyTorch 2.3.0 (CUDA 12.1) · Ultralytics YOLOv11m · TensorRT · ONNX Runtime · Albumentations · boto3 · rstream · cryptography
|
||
|
||
**Deployment**: 5 independent processes (no orchestration, no containers) running on GPU-equipped servers. Manual deployment.
|
||
|
||
## Component Summary
|
||
|
||
| # | Component | Modules | Purpose | Key Dependencies |
|
||
|---|-----------|---------|---------|-----------------|
|
||
| 01 | Core Infrastructure | constants, utils | Shared paths, config keys, Dotdict helper | None |
|
||
| 02 | Security & Hardware | security, hardware_service | AES-256-CBC encryption, hardware fingerprinting | cryptography, pynvml |
|
||
| 03 | API & CDN Client | api_client, cdn_manager | REST API (JWT auth) + S3 CDN communication | requests, boto3, Security |
|
||
| 04 | Data Models | dto/annotationClass, dto/imageLabel | Annotation class definitions, image+label container | OpenCV, matplotlib |
|
||
| 05 | Data Pipeline | augmentation, convert-annotations, dataset-visualiser | 8× augmentation, format conversion, visualization | Albumentations, Data Models |
|
||
| 06 | Training Pipeline | train, exports, manual_run | Dataset formation → YOLO training → export → encrypted upload | Ultralytics, API & CDN, Security |
|
||
| 07 | Inference Engine | inference/dto, onnx_engine, tensorrt_engine, inference, start_inference | Model download, decryption, TensorRT/ONNX video inference | TensorRT, ONNX Runtime, PyCUDA |
|
||
| 08 | Annotation Queue | annotation_queue_dto, annotation_queue_handler | Async RabbitMQ Streams consumer for annotation CRUD events | rstream, msgpack |
|
||
|
||
## System Flows
|
||
|
||
| # | Flow | Entry Point | Path | Output |
|
||
|---|------|-------------|------|--------|
|
||
| 1 | Annotation Ingestion | RabbitMQ message | Queue → Handler → Filesystem | Images + labels on disk |
|
||
| 2 | Data Augmentation | Filesystem scan (5-min loop) | /data/ → Augmentator → /data-processed/ | 8× augmented images + labels |
|
||
| 3 | Training Pipeline | train.py __main__ | /data-processed/ → Dataset split → YOLO train → Export → Encrypt → Upload | Encrypted model on API + CDN |
|
||
| 4 | Model Download & Inference | start_inference.py __main__ | API + CDN download → Decrypt → TensorRT init → Video frames → Detections | Annotated video output |
|
||
| 5 | Model Export (Multi-Format) | train.py / manual_run.py | .pt → .onnx / .engine / .rknn | Multi-format model artifacts |
|
||
|
||
## Risk Observations
|
||
|
||
### Code Bugs (from Verification)
|
||
|
||
| # | Location | Issue | Impact |
|
||
|---|----------|-------|--------|
|
||
| 1 | augmentation.py:118 | `self.total_to_process` undefined (should be `self.total_images_to_process`) | AttributeError during progress logging |
|
||
| 2 | train.py:93,99 | `copied` counter never incremented | Incorrect progress reporting (cosmetic) |
|
||
| 3 | exports.py:97 | Stale `ApiClient(ApiCredentials(...))` constructor call with wrong params | `upload_model` function would fail at runtime |
|
||
| 4 | inference/tensorrt_engine.py:43-44 | `batch_size` uninitialized for dynamic batch dimensions | NameError for models with dynamic batch size |
|
||
| 5 | dataset-visualiser.py:6 | Imports from `preprocessing` module that doesn't exist | Script cannot run |
|
||
|
||
### Security Issues
|
||
|
||
| Issue | Severity | Location |
|
||
|-------|----------|----------|
|
||
| Hardcoded API credentials | High | config.yaml |
|
||
| Hardcoded CDN access keys (4 keys) | High | cdn.yaml |
|
||
| Hardcoded model encryption key | High | security.py:67 |
|
||
| Queue credentials in plaintext | Medium | config.yaml, annotation-queue/config.yaml |
|
||
| No TLS certificate validation | Low | api_client.py |
|
||
|
||
### Structural Concerns
|
||
|
||
- No CI/CD pipeline or containerization
|
||
- No formal test framework (2 script-based tests, 1 broken)
|
||
- Duplicated AnnotationClass/WeatherMode code in 3 locations
|
||
- No graceful shutdown for augmentation process
|
||
- No reconnect logic for annotation queue consumer
|
||
- Manual deployment only
|
||
|
||
## Open Questions
|
||
|
||
- The `preprocessing` module may have existed previously and been deleted or renamed — its absence breaks `dataset-visualiser.py` and `tests/imagelabel_visualize_test.py`
|
||
- `exports.upload_model` may be intentionally deprecated in favor of the ApiClient-based flow in `train.py`
|
||
- The `orangepi5/` shell scripts were not analyzed (bash, not Python) — they appear to be setup/run scripts for edge deployment
|
||
- `checkpoint.txt` (2024-06-27) suggests training infrastructure was last used in mid-2024
|
||
|
||
## Artifact Index
|
||
|
||
| Path | Description | Step |
|
||
|------|-------------|------|
|
||
| `_docs/00_problem/problem.md` | Problem statement | 6 |
|
||
| `_docs/00_problem/restrictions.md` | Hardware, software, environment, operational restrictions | 6 |
|
||
| `_docs/00_problem/acceptance_criteria.md` | Measurable acceptance criteria from code | 6 |
|
||
| `_docs/00_problem/input_data/data_parameters.md` | Input data schemas and formats | 6 |
|
||
| `_docs/00_problem/security_approach.md` | Security mechanisms and known issues | 6 |
|
||
| `_docs/01_solution/solution.md` | Retrospective solution document | 5 |
|
||
| `_docs/02_document/00_discovery.md` | Codebase discovery: tree, tech stack, dependency graph | 0 |
|
||
| `_docs/02_document/modules/*.md` | 21 module-level documentation files | 1 |
|
||
| `_docs/02_document/components/0N_*/description.md` | 8 component specifications | 2 |
|
||
| `_docs/02_document/diagrams/components.md` | Component relationship diagram (Mermaid) | 2 |
|
||
| `_docs/02_document/architecture.md` | System architecture document | 3 |
|
||
| `_docs/02_document/system-flows.md` | 5 system flow diagrams with sequence diagrams | 3 |
|
||
| `_docs/02_document/data_model.md` | Data model with ER diagram | 3 |
|
||
| `_docs/02_document/diagrams/flows/flow_*.md` | Individual flow diagrams (4 files) | 3 |
|
||
| `_docs/02_document/04_verification_log.md` | Verification results: 87 entities, 5 bugs, 3 security issues | 4 |
|
||
| `_docs/02_document/FINAL_report.md` | This report | 7 |
|
||
| `_docs/02_document/state.json` | Document skill progress tracking | — |
|