# Component: Data Pipeline ## Overview Tools for preparing and managing annotation data: augmentation of training images, format conversion from external annotation systems, and visual inspection of annotated datasets. **Pattern**: Batch processing tools (standalone scripts + library) **Upstream**: Core (constants), Data Models (ImageLabel, AnnotationClass) **Downstream**: Training (augmented images feed into dataset formation) ## Modules - `augmentation` — image augmentation pipeline (albumentations) - `convert-annotations` — Pascal VOC / oriented bbox → YOLO format converter - `dataset-visualiser` — interactive annotation visualization tool ## Internal Interfaces ### Augmentator ```python Augmentator() Augmentator.augment_annotations(from_scratch: bool = False) -> None Augmentator.augment_inner(img_ann: ImageLabel) -> list[ImageLabel] Augmentator.correct_bboxes(labels) -> list Augmentator.read_labels(labels_path) -> list[list] ``` ### convert-annotations (functions) ```python convert(folder, dest_folder, read_annotations, ann_format) -> None minmax2yolo(width, height, xmin, xmax, ymin, ymax) -> tuple read_pascal_voc(width, height, s: str) -> list[str] read_bbox_oriented(width, height, s: str) -> list[str] ``` ### dataset-visualiser (functions) ```python visualise_dataset() -> None visualise_processed_folder() -> None ``` ## Data Access Patterns - **Augmentation**: Reads from `/azaion/data/images/` + `/azaion/data/labels/`, writes to `/azaion/data-processed/images/` + `/azaion/data-processed/labels/` - **Conversion**: Reads from user-specified source folder, writes to destination folder - **Visualiser**: Reads from datasets or processed folder, renders to matplotlib window ## Implementation Details - **Augmentation pipeline**: Per image → 1 original copy + 7 augmented variants (8× data expansion) - HorizontalFlip (60%), BrightnessContrast (40%), Affine (80%), MotionBlur (10%), HueSaturation (40%) - Bbox correction clips outside-boundary boxes, removes boxes < 1% of image - Incremental: skips already-processed images - Continuous mode: infinite loop with 5-minute sleep between rounds - Concurrent: ThreadPoolExecutor for parallel image processing - **Format conversion**: Pluggable reader pattern — `convert()` accepts any reader function that maps (width, height, text) → YOLO lines - **Visualiser**: Interactive (waits for keypress) — developer debugging tool ## Caveats - `dataset-visualiser` imports from `preprocessing` module which does not exist — broken import - `dataset-visualiser` has hardcoded dataset date (`2024-06-18`) and start index (35247) - `convert-annotations` hardcodes class mappings (Truck=1, Car/Taxi=2) — not configurable - Augmentation parameters are hardcoded, not configurable via config file - Augmentation `total_to_process` attribute referenced in `augment_annotation` but never set (uses `total_images_to_process`) ## Dependency Graph ```mermaid graph TD constants --> augmentation dto_imageLabel[dto/imageLabel] --> augmentation constants --> dataset-visualiser dto_annotationClass[dto/annotationClass] --> dataset-visualiser dto_imageLabel --> dataset-visualiser augmentation --> manual_run ``` ## Logging Strategy Print statements for progress tracking (processed count, errors). No structured logging.