# Dataset Formation Tests **Task**: AZ-155_test_dataset_formation **Name**: Dataset Formation Tests **Description**: Implement blackbox, performance, resilience, and resource tests for dataset split — 70/20/10 ratio, directory structure, integrity, corrupted filtering **Complexity**: 2 points **Dependencies**: AZ-152_test_infrastructure **Component**: Blackbox Tests **Jira**: AZ-155 **Epic**: AZ-151 ## Problem Dataset formation splits annotated images into train/valid/test sets. Tests must verify correct ratios, directory structure, data integrity, corrupted label filtering, and performance. ## Outcome - 8 passing pytest tests covering dataset formation - Blackbox tests in `tests/test_dataset_formation.py` - Performance test in `tests/performance/test_dataset_perf.py` ## Scope ### Included - BT-DSF-01: 70/20/10 split ratio (100 images → 70/20/10) - BT-DSF-02: Split directory structure (6 subdirs created) - BT-DSF-03: Total files preserved (sum == 100) - BT-DSF-04: Corrupted labels moved to corrupted directory - PT-DSF-01: Dataset formation throughput (100 images ≤ 30s) - RT-DSF-01: Empty processed directory handled gracefully - RL-DSF-01: Split ratios sum to 100% - RL-DSF-02: No data duplication across splits ### Excluded - Label validation (separate task) ## Acceptance Criteria **AC-1: Split ratio** Given 100 images + labels in processed/ dir When form_dataset() runs with patched paths Then train: 70, valid: 20, test: 10 **AC-2: Directory structure** Given 100 images + labels When form_dataset() runs Then creates train/images/, train/labels/, valid/images/, valid/labels/, test/images/, test/labels/ **AC-3: Data integrity** Given 100 valid images + labels When form_dataset() runs Then count(train) + count(valid) + count(test) == 100 **AC-4: Corrupted filtering** Given 95 valid + 5 corrupted labels When form_dataset() runs Then 5 in data-corrupted/, 95 across splits **AC-5: Throughput** Given 100 images + labels When form_dataset() runs Then completes within 30 seconds **AC-6: Empty directory** Given empty processed images dir When form_dataset() runs Then empty dirs created, no crash **AC-7: No duplication** Given 100 images after form_dataset() When collecting all filenames across train/valid/test Then no filename appears in more than one split ## Constraints - Must patch constants.py paths to use tmp_path - Requires copying 100 fixture images to tmp_path (session fixture) - Performance test marked: `@pytest.mark.performance`