# Dataset Formation Tests

**Task**: AZ-155_test_dataset_formation
**Name**: Dataset Formation Tests
**Description**: Implement blackbox, performance, resilience, and resource tests for dataset split — 70/20/10 ratio, directory structure, integrity, corrupted filtering
**Complexity**: 2 points
**Dependencies**: AZ-152_test_infrastructure
**Component**: Blackbox Tests
**Jira**: AZ-155
**Epic**: AZ-151

## Problem

Dataset formation splits annotated images into train/valid/test sets. Tests must verify correct ratios, directory structure, data integrity, corrupted label filtering, and performance.

## Outcome

- 8 passing pytest tests covering dataset formation
- Blackbox tests in `tests/test_dataset_formation.py`
- Performance test in `tests/performance/test_dataset_perf.py`

## Scope

### Included
- BT-DSF-01: 70/20/10 split ratio (100 images → 70/20/10)
- BT-DSF-02: Split directory structure (6 subdirs created)
- BT-DSF-03: Total files preserved (sum == 100)
- BT-DSF-04: Corrupted labels moved to corrupted directory
- PT-DSF-01: Dataset formation throughput (100 images ≤ 30s)
- RT-DSF-01: Empty processed directory handled gracefully
- RL-DSF-01: Split ratios sum to 100%
- RL-DSF-02: No data duplication across splits

### Excluded
- Label validation (separate task)

## Acceptance Criteria

**AC-1: Split ratio**
Given 100 images + labels in processed/ dir
When form_dataset() runs with patched paths
Then train: 70, valid: 20, test: 10

**AC-2: Directory structure**
Given 100 images + labels
When form_dataset() runs
Then creates train/images/, train/labels/, valid/images/, valid/labels/, test/images/, test/labels/

**AC-3: Data integrity**
Given 100 valid images + labels
When form_dataset() runs
Then count(train) + count(valid) + count(test) == 100

**AC-4: Corrupted filtering**
Given 95 valid + 5 corrupted labels
When form_dataset() runs
Then 5 in data-corrupted/, 95 across splits

**AC-5: Throughput**
Given 100 images + labels
When form_dataset() runs
Then completes within 30 seconds

**AC-6: Empty directory**
Given empty processed images dir
When form_dataset() runs
Then empty dirs created, no crash

**AC-7: No duplication**
Given 100 images after form_dataset()
When collecting all filenames across train/valid/test
Then no filename appears in more than one split

## Constraints

- Must patch constants.py paths to use tmp_path
- Requires copying 100 fixture images to tmp_path (session fixture)
- Performance test marked: `@pytest.mark.performance`