Update configuration and test structure for improved clarity and functionality

- Modified `.gitignore` to include test fixture data while excluding test results. - Updated `config.yaml` to change the model from 'yolo11m.yaml' to 'yolo26m.pt'. - Enhanced `.cursor/rules/coderule.mdc` with additional guidelines for test environment consistency and infrastructure handling. - Revised autopilot state management in `_docs/_autopilot_state.md` to reflect current progress and tasks. - Removed outdated augmentation tests and adjusted dataset formation tests to align with the new structure. These changes streamline the configuration and testing processes, ensuring better organization and clarity in the project.
2026-06-21 07:11:12 +00:00 · 2026-03-28 06:11:55 +02:00
parent cdcd1f6ea7
commit a47fa135de
119 changed files with 824 additions and 774 deletions
@@ -0,0 +1,33 @@
+# Refactoring Roadmap
+
+**Run**: 01-code-improvements
+**Date**: 2026-03-28
+
+## Execution Order
+
+All 5 changes are grouped into a single phase (straightforward, low-to-medium risk).
+
+| Priority | Change | Risk | Effort |
+|----------|--------|------|--------|
+| 1 | C05: Unify configuration | medium | 3 pts |
+| 2 | C01: Update YOLO model | medium | 2 pts |
+| 3 | C02: Replace external augmentation | medium | 3 pts |
+| 4 | C03: Remove processed directory | medium | 3 pts |
+| 5 | C04: Hard symlinks | low | 2 pts |
+
+**Total estimated effort**: 13 points across 5 tasks
+
+## Dependency Graph
+
+```
+C05 (config unification) ─── independent
+C01 (YOLO update) ← C02 (built-in aug) ← C03 (remove processed dir) ← C04 (hard symlinks)
+```
+
+C05 can be done in parallel with the C01→C04 chain.
+
+## Risk Mitigation
+
+- Existing test suite (83 tests) provides safety net
+- Each change committed separately for easy rollback
+- C02 is the highest-risk change (training pipeline behavior change) — validate by running a short training sanity check after implementation
@@ -0,0 +1,50 @@
+# Research Findings
+
+**Run**: 01-code-improvements
+**Date**: 2026-03-28
+
+## Current State Analysis
+
+### Training Pipeline
+- Uses `yolo11m.yaml` (architecture-only config, trains from scratch)
+- External augmentation via `albumentations` library in `src/augmentation.py`
+- Two-step process: augment → form dataset → train
+- Dataset formation copies files with `shutil.copy()`, duplicating ~8x storage
+
+### Configuration
+- Two config files: `config.yaml` (root) and `src/annotation-queue/config.yaml`
+- Annotation queue handler parses YAML manually instead of using shared `Config` model
+- Config drift risk between the two files
+
+## YOLO 26 Model Update
+
+Ultralytics YOLO26 is the latest model family. The medium variant `yolo26m.pt` replaces `yolo11m.yaml`:
+- Uses pretrained weights (`.pt`) rather than architecture-only (`.yaml`)
+- Faster convergence with transfer learning
+- Improved accuracy on detection benchmarks
+
+## Built-in Augmentation Parameters
+
+YOLO's `model.train()` supports the following augmentation parameters that replace the external `albumentations` pipeline:
+
+| Parameter | Default | Equivalent to current external aug |
+|-----------|---------|-----------------------------------|
+| `hsv_h` | 0.015 | HueSaturationValue(hue_shift_limit=10) |
+| `hsv_s` | 0.7 | HueSaturationValue(sat_shift_limit=10) |
+| `hsv_v` | 0.4 | RandomBrightnessContrast + HSV |
+| `degrees` | 0.0 | Affine(rotate=(-35,35)) → set to 35.0 |
+| `translate` | 0.1 | Default is sufficient |
+| `scale` | 0.5 | Affine(scale=(0.8,1.2)) → default covers this |
+| `shear` | 0.0 | Affine(shear=(-10,10)) → set to 10.0 |
+| `fliplr` | 0.5 | HorizontalFlip(p=0.6) → set to 0.6 |
+| `flipud` | 0.0 | Not used currently |
+| `mosaic` | 1.0 | New — YOLO built-in |
+| `mixup` | 0.0 | New — optional |
+
+## Hard Symlinks
+
+`os.link()` creates hard links sharing the same inode. Benefits:
+- Zero additional disk usage for dataset splits
+- Same read performance as regular files
+- Works on same filesystem (which is the case here — all under `/azaion/`)
+- Fallback to `shutil.copy()` for cross-filesystem edge cases
@@ -0,0 +1,66 @@
+# Baseline Metrics
+
+**Run**: 01-code-improvements
+**Date**: 2026-03-28
+**Mode**: Guided
+**Source**: `_docs/02_document/refactoring_notes.md`
+
+## Goals
+
+Apply 5 improvements identified during documentation:
+1. Update YOLO to v26m version
+2. Replace external augmentation with YOLO built-in augmentation
+3. Remove processed folder — use data dir directly
+4. Use hard symlinks instead of file copies for dataset formation
+5. Unify constants directories — remove `src/annotation-queue/config.yaml`
+
+## Code Metrics
+
+| Metric | Value |
+|--------|-------|
+| Source files (src/) | 24 Python files |
+| Source LOC | 2,945 |
+| Test files | 21 Python files |
+| Test LOC | 1,646 |
+| Total tests | 83 (77 blackbox/unit + 6 performance) |
+| Test execution time | ~130s (120s unit + 10s perf) |
+| Python version | 3.10.8 |
+| Ultralytics version | 8.4.30 |
+| Pip packages | ~76 |
+
+## Files Affected by Refactoring
+
+| File | LOC | Refactoring Items |
+|------|-----|-------------------|
+| `src/constants.py` | 118 | #3 (remove processed_dir), #5 (unify config) |
+| `src/train.py` | 178 | #1 (YOLO version), #2 (built-in aug), #3 (data dir), #4 (symlinks) |
+| `src/augmentation.py` | 152 | #2 (replace with YOLO built-in), #3 (processed dir) |
+| `src/exports.py` | 118 | #3 (processed dir references) |
+| `src/convert-annotations.py` | 119 | #3 (processed dir references) |
+| `src/dataset-visualiser.py` | 52 | #3 (processed dir references) |
+| `src/annotation-queue/annotation_queue_handler.py` | 173 | #5 (remove separate config.yaml) |
+| `src/annotation-queue/config.yaml` | 21 | #5 (delete — duplicated config) |
+| `config.yaml` | 30 | #5 (single source of truth) |
+
+## Test Suite Baseline
+
+```
+77 passed, 0 failed, 0 skipped (blackbox/unit)
+6 passed, 0 failed, 0 skipped (performance)
+Total: 83 passed in ~130s
+```
+
+## Functionality Inventory
+
+| Feature | Status | Affected by Refactoring |
+|---------|--------|------------------------|
+| Augmentation pipeline | Working | Yes (#2, #3) |
+| Dataset formation | Working | Yes (#3, #4) |
+| Training | Working | Yes (#1, #2) |
+| Model export (ONNX) | Working | No |
+| Inference (ONNX/TensorRT) | Working | No |
+| Annotation queue | Working | Yes (#5) |
+| API client | Working | No |
+| CDN manager | Working | No |
+| Security/encryption | Working | No |
+| Label validation | Working | No |
@@ -0,0 +1,26 @@
+# Training Pipeline
+
+## Files
+- `src/train.py` (178 LOC)
+- `src/augmentation.py` (152 LOC)
+- `src/constants.py` (118 LOC)
+
+## Current Flow
+
+```mermaid
+graph TD
+    A[augmentation.py] -->|reads from| B[data_dir]
+    A -->|writes to| C[processed_dir]
+    D[train.py::form_dataset] -->|reads from| C
+    D -->|shutil.copy to| E[datasets_dir/today/train,valid,test]
+    F[train.py::train_dataset] -->|YOLO.train| E
+```
+
+## Issues
+- External augmentation (albumentations) runs as separate step, writing to `processed_dir`
+- `form_dataset()` copies files from `processed_dir` to dataset splits using `shutil.copy`
+- YOLO has built-in augmentation that runs during training (mosaic, mixup, flips, etc.)
+- Using built-in aug eliminates need for `processed_dir` and the full `augmentation.py` pipeline
+- `copy_annotations()` uses `shutil.copy` — wasteful for large datasets
+- Global mutable `total_files_copied` variable in `copy_annotations`
+- Model config `yolo11m.yaml` trains from scratch; likely should use pretrained weights or updated variant
@@ -0,0 +1,18 @@
+# Configuration System
+
+## Files
+- `src/constants.py` (118 LOC)
+- `config.yaml` (root, 30 lines)
+- `src/annotation-queue/config.yaml` (21 lines)
+- `src/annotation-queue/annotation_queue_handler.py` (173 LOC)
+
+## Current State
+- `constants.py` defines `Config` (Pydantic model) loaded from root `config.yaml`
+- `annotation_queue_handler.py` reads its own `config.yaml` with raw `yaml.safe_load`
+- Both config files share `api`, `queue`, `dirs` sections but with different `dirs` values
+- Annotation queue config has `data: 'data-test'` vs root `data: 'data'`
+
+## Issues
+- Two config files with overlapping content — drift risk
+- `annotation_queue_handler.py` parses config manually instead of using `Config` model
+- `constants.py` still has `processed_dir` properties that become obsolete after removing external augmentation
@@ -0,0 +1,10 @@
+# Data Utilities
+
+## Files
+- `src/exports.py` — `form_data_sample()` reads from `processed_images_dir`
+- `src/dataset-visualiser.py` — `visualise_processed_folder()` reads from `processed_images_dir`/`processed_labels_dir`
+
+## Impact
+- Both files reference `processed_dir` via `constants.config`
+- After removing `processed_dir`, these must switch to `data_images_dir`/`data_labels_dir`
+- `form_data_sample()` also uses `shutil.copy` — candidate for hard links
@@ -0,0 +1,52 @@
+# List of Changes
+
+**Run**: 01-code-improvements
+**Mode**: guided
+**Source**: `_docs/02_document/refactoring_notes.md`
+**Date**: 2026-03-28
+
+## Summary
+
+Apply 5 improvements from documentation review: update YOLO model, switch to built-in augmentation, remove processed directory, use hard symlinks for dataset formation, and unify configuration files.
+
+## Changes
+
+### C01: Update YOLO model to 26m variant
+- **File(s)**: `src/constants.py`, `src/train.py`
+- **Problem**: Current model config uses `yolo11m.yaml` which trains from a YAML architecture definition
+- **Change**: Update `TrainingConfig.model` to the YOLO 26m variant; ensure `train_dataset()` uses the updated model reference
+- **Rationale**: Use updated model version as requested; pretrained weights improve convergence
+- **Risk**: medium
+- **Dependencies**: None
+
+### C02: Replace external augmentation with YOLO built-in
+- **File(s)**: `src/train.py`, `src/augmentation.py`
+- **Problem**: `augmentation.py` uses albumentations to augment images into a separate `processed_dir` before training — adds complexity, disk usage, and a separate processing step
+- **Change**: Remove the `augment_annotations()` call from the training pipeline; add YOLO built-in augmentation parameters (hsv_h, hsv_s, hsv_v, degrees, translate, scale, shear, flipud, fliplr, mosaic, mixup) to the `model.train()` call in `train_dataset()`, each on its own line with a descriptive comment; `augmentation.py` remains in codebase but is no longer called during training
+- **Rationale**: YOLO's built-in augmentation applies on-the-fly during training, eliminating the pre-processing step and processed directory
+- **Risk**: medium
+- **Dependencies**: C01
+
+### C03: Remove processed directory — use data dir directly
+- **File(s)**: `src/constants.py`, `src/train.py`, `src/exports.py`, `src/dataset-visualiser.py`
+- **Problem**: `processed_dir`, `processed_images_dir`, `processed_labels_dir` properties in `Config` are no longer needed when built-in augmentation is used; `form_dataset()` reads from processed dir; `form_data_sample()` reads from processed dir; `visualise_processed_folder()` reads from processed dir
+- **Change**: Remove `processed_dir`/`processed_images_dir`/`processed_labels_dir` properties from `Config`; update `form_dataset()` to read from `data_images_dir`/`data_labels_dir`; update `form_data_sample()` similarly; update `visualise_processed_folder()` similarly
+- **Rationale**: Processed directory is unnecessary without external augmentation step
+- **Risk**: medium
+- **Dependencies**: C02
+
+### C04: Use hard symlinks instead of file copies for dataset
+- **File(s)**: `src/train.py`
+- **Problem**: `copy_annotations()` uses `shutil.copy()` to duplicate images and labels into train/valid/test splits — wastes disk space on large datasets
+- **Change**: Replace `shutil.copy()` with `os.link()` to create hard links; add fallback to `shutil.copy()` for cross-filesystem scenarios
+- **Rationale**: Hard links share the same inode, saving disk space while maintaining independent directory entries
+- **Risk**: low
+- **Dependencies**: C03
+
+### C05: Unify configuration — remove annotation-queue/config.yaml
+- **File(s)**: `src/constants.py`, `src/annotation-queue/annotation_queue_handler.py`, `src/annotation-queue/config.yaml`
+- **Problem**: `src/annotation-queue/config.yaml` duplicates root `config.yaml` with different `dirs` values; `annotation_queue_handler.py` parses config manually via `yaml.safe_load` instead of using the shared `Config` model
+- **Change**: Extend `Config` in `constants.py` to include queue and annotation-queue directory settings; refactor `annotation_queue_handler.py` to accept a `Config` instance (or import from constants); delete `src/annotation-queue/config.yaml`
+- **Rationale**: Single source of truth for configuration eliminates drift risk and inconsistency
+- **Risk**: medium
+- **Dependencies**: None