Refactor constants management to use Pydantic BaseModel for configuration

- Replaced module-level path variables in constants.py with a structured Pydantic Config class. - Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure. - Fixed bugs related to image processing and model saving. - Enhanced test infrastructure to accommodate the new configuration approach. This refactor improves code maintainability and clarity by centralizing configuration management.
2026-06-21 23:31:12 +00:00 · 2026-03-27 18:18:30 +02:00
parent b68c07b540
commit 142c6c4de8
106 changed files with 5706 additions and 654 deletions
@@ -0,0 +1,100 @@
+# Final Documentation Report — Azaion AI Training
+
+## Executive Summary
+
+Azaion AI Training is a Python-based ML pipeline for training, deploying, and running YOLOv11 object detection models targeting aerial military asset recognition. The system comprises 8 components (21 modules) spanning annotation ingestion, data augmentation, GPU-accelerated training, multi-format model export, encrypted model distribution, and real-time inference — with edge deployment capability via RKNN on OrangePi5 devices.
+
+The codebase is functional and production-used (last training run: 2024-06-27) but has no CI/CD, no containerization, no formal test framework, and several hardcoded credentials. Verification identified 5 code bugs, 3 high-severity security issues, and 1 missing module.
+
+## Problem Statement
+
+The system automates detection of 17 classes of military objects and infrastructure in aerial/satellite imagery across 3 weather conditions (Normal, Winter, Night). It replaces manual image analysis with a continuous pipeline: human-annotated data flows in via RabbitMQ, is augmented 8× for training diversity, trains YOLOv11 models over multi-day GPU runs, and distributes encrypted models to inference clients that run real-time video detection.
+
+## Architecture Overview
+
+**Tech stack**: Python 3.10+ · PyTorch 2.3.0 (CUDA 12.1) · Ultralytics YOLOv11m · TensorRT · ONNX Runtime · Albumentations · boto3 · rstream · cryptography
+
+**Deployment**: 5 independent processes (no orchestration, no containers) running on GPU-equipped servers. Manual deployment.
+
+## Component Summary
+
+| # | Component | Modules | Purpose | Key Dependencies |
+|---|-----------|---------|---------|-----------------|
+| 01 | Core Infrastructure | constants, utils | Shared paths, config keys, Dotdict helper | None |
+| 02 | Security & Hardware | security, hardware_service | AES-256-CBC encryption, hardware fingerprinting | cryptography, pynvml |
+| 03 | API & CDN Client | api_client, cdn_manager | REST API (JWT auth) + S3 CDN communication | requests, boto3, Security |
+| 04 | Data Models | dto/annotationClass, dto/imageLabel | Annotation class definitions, image+label container | OpenCV, matplotlib |
+| 05 | Data Pipeline | augmentation, convert-annotations, dataset-visualiser | 8× augmentation, format conversion, visualization | Albumentations, Data Models |
+| 06 | Training Pipeline | train, exports, manual_run | Dataset formation → YOLO training → export → encrypted upload | Ultralytics, API & CDN, Security |
+| 07 | Inference Engine | inference/dto, onnx_engine, tensorrt_engine, inference, start_inference | Model download, decryption, TensorRT/ONNX video inference | TensorRT, ONNX Runtime, PyCUDA |
+| 08 | Annotation Queue | annotation_queue_dto, annotation_queue_handler | Async RabbitMQ Streams consumer for annotation CRUD events | rstream, msgpack |
+
+## System Flows
+
+| # | Flow | Entry Point | Path | Output |
+|---|------|-------------|------|--------|
+| 1 | Annotation Ingestion | RabbitMQ message | Queue → Handler → Filesystem | Images + labels on disk |
+| 2 | Data Augmentation | Filesystem scan (5-min loop) | /data/ → Augmentator → /data-processed/ | 8× augmented images + labels |
+| 3 | Training Pipeline | train.py __main__ | /data-processed/ → Dataset split → YOLO train → Export → Encrypt → Upload | Encrypted model on API + CDN |
+| 4 | Model Download & Inference | start_inference.py __main__ | API + CDN download → Decrypt → TensorRT init → Video frames → Detections | Annotated video output |
+| 5 | Model Export (Multi-Format) | train.py / manual_run.py | .pt → .onnx / .engine / .rknn | Multi-format model artifacts |
+
+## Risk Observations
+
+### Code Bugs (from Verification)
+
+| # | Location | Issue | Impact |
+|---|----------|-------|--------|
+| 1 | augmentation.py:118 | `self.total_to_process` undefined (should be `self.total_images_to_process`) | AttributeError during progress logging |
+| 2 | train.py:93,99 | `copied` counter never incremented | Incorrect progress reporting (cosmetic) |
+| 3 | exports.py:97 | Stale `ApiClient(ApiCredentials(...))` constructor call with wrong params | `upload_model` function would fail at runtime |
+| 4 | inference/tensorrt_engine.py:43-44 | `batch_size` uninitialized for dynamic batch dimensions | NameError for models with dynamic batch size |
+| 5 | dataset-visualiser.py:6 | Imports from `preprocessing` module that doesn't exist | Script cannot run |
+
+### Security Issues
+
+| Issue | Severity | Location |
+|-------|----------|----------|
+| Hardcoded API credentials | High | config.yaml |
+| Hardcoded CDN access keys (4 keys) | High | cdn.yaml |
+| Hardcoded model encryption key | High | security.py:67 |
+| Queue credentials in plaintext | Medium | config.yaml, annotation-queue/config.yaml |
+| No TLS certificate validation | Low | api_client.py |
+
+### Structural Concerns
+
+- No CI/CD pipeline or containerization
+- No formal test framework (2 script-based tests, 1 broken)
+- Duplicated AnnotationClass/WeatherMode code in 3 locations
+- No graceful shutdown for augmentation process
+- No reconnect logic for annotation queue consumer
+- Manual deployment only
+
+## Open Questions
+
+- The `preprocessing` module may have existed previously and been deleted or renamed — its absence breaks `dataset-visualiser.py` and `tests/imagelabel_visualize_test.py`
+- `exports.upload_model` may be intentionally deprecated in favor of the ApiClient-based flow in `train.py`
+- The `orangepi5/` shell scripts were not analyzed (bash, not Python) — they appear to be setup/run scripts for edge deployment
+- `checkpoint.txt` (2024-06-27) suggests training infrastructure was last used in mid-2024
+
+## Artifact Index
+
+| Path | Description | Step |
+|------|-------------|------|
+| `_docs/00_problem/problem.md` | Problem statement | 6 |
+| `_docs/00_problem/restrictions.md` | Hardware, software, environment, operational restrictions | 6 |
+| `_docs/00_problem/acceptance_criteria.md` | Measurable acceptance criteria from code | 6 |
+| `_docs/00_problem/input_data/data_parameters.md` | Input data schemas and formats | 6 |
+| `_docs/00_problem/security_approach.md` | Security mechanisms and known issues | 6 |
+| `_docs/01_solution/solution.md` | Retrospective solution document | 5 |
+| `_docs/02_document/00_discovery.md` | Codebase discovery: tree, tech stack, dependency graph | 0 |
+| `_docs/02_document/modules/*.md` | 21 module-level documentation files | 1 |
+| `_docs/02_document/components/0N_*/description.md` | 8 component specifications | 2 |
+| `_docs/02_document/diagrams/components.md` | Component relationship diagram (Mermaid) | 2 |
+| `_docs/02_document/architecture.md` | System architecture document | 3 |
+| `_docs/02_document/system-flows.md` | 5 system flow diagrams with sequence diagrams | 3 |
+| `_docs/02_document/data_model.md` | Data model with ER diagram | 3 |
+| `_docs/02_document/diagrams/flows/flow_*.md` | Individual flow diagrams (4 files) | 3 |
+| `_docs/02_document/04_verification_log.md` | Verification results: 87 entities, 5 bugs, 3 security issues | 4 |
+| `_docs/02_document/FINAL_report.md` | This report | 7 |
+| `_docs/02_document/state.json` | Document skill progress tracking | — |