Files
ai-training/_docs/02_tasks/AZ-155_test_dataset_formation.md
T
Oleksandr Bezdieniezhnykh 142c6c4de8 Refactor constants management to use Pydantic BaseModel for configuration
- Replaced module-level path variables in constants.py with a structured Pydantic Config class.
- Updated all relevant modules (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to access paths through the new config structure.
- Fixed bugs related to image processing and model saving.
- Enhanced test infrastructure to accommodate the new configuration approach.

This refactor improves code maintainability and clarity by centralizing configuration management.
2026-03-27 18:18:30 +02:00

2.4 KiB

Dataset Formation Tests

Task: AZ-155_test_dataset_formation Name: Dataset Formation Tests Description: Implement blackbox, performance, resilience, and resource tests for dataset split — 70/20/10 ratio, directory structure, integrity, corrupted filtering Complexity: 2 points Dependencies: AZ-152_test_infrastructure Component: Blackbox Tests Jira: AZ-155 Epic: AZ-151

Problem

Dataset formation splits annotated images into train/valid/test sets. Tests must verify correct ratios, directory structure, data integrity, corrupted label filtering, and performance.

Outcome

  • 8 passing pytest tests covering dataset formation
  • Blackbox tests in tests/test_dataset_formation.py
  • Performance test in tests/performance/test_dataset_perf.py

Scope

Included

  • BT-DSF-01: 70/20/10 split ratio (100 images → 70/20/10)
  • BT-DSF-02: Split directory structure (6 subdirs created)
  • BT-DSF-03: Total files preserved (sum == 100)
  • BT-DSF-04: Corrupted labels moved to corrupted directory
  • PT-DSF-01: Dataset formation throughput (100 images ≤ 30s)
  • RT-DSF-01: Empty processed directory handled gracefully
  • RL-DSF-01: Split ratios sum to 100%
  • RL-DSF-02: No data duplication across splits

Excluded

  • Label validation (separate task)

Acceptance Criteria

AC-1: Split ratio Given 100 images + labels in processed/ dir When form_dataset() runs with patched paths Then train: 70, valid: 20, test: 10

AC-2: Directory structure Given 100 images + labels When form_dataset() runs Then creates train/images/, train/labels/, valid/images/, valid/labels/, test/images/, test/labels/

AC-3: Data integrity Given 100 valid images + labels When form_dataset() runs Then count(train) + count(valid) + count(test) == 100

AC-4: Corrupted filtering Given 95 valid + 5 corrupted labels When form_dataset() runs Then 5 in data-corrupted/, 95 across splits

AC-5: Throughput Given 100 images + labels When form_dataset() runs Then completes within 30 seconds

AC-6: Empty directory Given empty processed images dir When form_dataset() runs Then empty dirs created, no crash

AC-7: No duplication Given 100 images after form_dataset() When collecting all filenames across train/valid/test Then no filename appears in more than one split

Constraints

  • Must patch constants.py paths to use tmp_path
  • Requires copying 100 fixture images to tmp_path (session fixture)
  • Performance test marked: @pytest.mark.performance