Refactor constants management to use Pydantic BaseModel for configuration

- Replaced module-level path variables in constants.py with a structured Pydantic Config class. - Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure. - Fixed bugs related to image processing and model saving. - Enhanced test infrastructure to accommodate the new configuration approach. This refactor improves code maintainability and clarity by centralizing configuration management.
2026-04-23 02:06:35 +00:00 · 2026-03-27 18:18:30 +02:00
parent b68c07b540
commit 142c6c4de8
106 changed files with 5706 additions and 654 deletions
@@ -0,0 +1,78 @@
+# Dataset Formation Tests
+
+**Task**: AZ-155_test_dataset_formation
+**Name**: Dataset Formation Tests
+**Description**: Implement blackbox, performance, resilience, and resource tests for dataset split — 70/20/10 ratio, directory structure, integrity, corrupted filtering
+**Complexity**: 2 points
+**Dependencies**: AZ-152_test_infrastructure
+**Component**: Blackbox Tests
+**Jira**: AZ-155
+**Epic**: AZ-151
+
+## Problem
+
+Dataset formation splits annotated images into train/valid/test sets. Tests must verify correct ratios, directory structure, data integrity, corrupted label filtering, and performance.
+
+## Outcome
+
+- 8 passing pytest tests covering dataset formation
+- Blackbox tests in `tests/test_dataset_formation.py`
+- Performance test in `tests/performance/test_dataset_perf.py`
+
+## Scope
+
+### Included
+- BT-DSF-01: 70/20/10 split ratio (100 images → 70/20/10)
+- BT-DSF-02: Split directory structure (6 subdirs created)
+- BT-DSF-03: Total files preserved (sum == 100)
+- BT-DSF-04: Corrupted labels moved to corrupted directory
+- PT-DSF-01: Dataset formation throughput (100 images ≤ 30s)
+- RT-DSF-01: Empty processed directory handled gracefully
+- RL-DSF-01: Split ratios sum to 100%
+- RL-DSF-02: No data duplication across splits
+
+### Excluded
+- Label validation (separate task)
+
+## Acceptance Criteria
+
+**AC-1: Split ratio**
+Given 100 images + labels in processed/ dir
+When form_dataset() runs with patched paths
+Then train: 70, valid: 20, test: 10
+
+**AC-2: Directory structure**
+Given 100 images + labels
+When form_dataset() runs
+Then creates train/images/, train/labels/, valid/images/, valid/labels/, test/images/, test/labels/
+
+**AC-3: Data integrity**
+Given 100 valid images + labels
+When form_dataset() runs
+Then count(train) + count(valid) + count(test) == 100
+
+**AC-4: Corrupted filtering**
+Given 95 valid + 5 corrupted labels
+When form_dataset() runs
+Then 5 in data-corrupted/, 95 across splits
+
+**AC-5: Throughput**
+Given 100 images + labels
+When form_dataset() runs
+Then completes within 30 seconds
+
+**AC-6: Empty directory**
+Given empty processed images dir
+When form_dataset() runs
+Then empty dirs created, no crash
+
+**AC-7: No duplication**
+Given 100 images after form_dataset()
+When collecting all filenames across train/valid/test
+Then no filename appears in more than one split
+
+## Constraints
+
+- Must patch constants.py paths to use tmp_path
+- Requires copying 100 fixture images to tmp_path (session fixture)
+- Performance test marked: `@pytest.mark.performance`