Files
gps-denied-onboard/docs/02_components/helpers/h08_batch_validator_spec.md
T
Oleksandr Bezdieniezhnykh 2037870f67 add chunking
2025-11-27 03:43:19 +02:00

8.7 KiB
Raw Blame History

Batch Validator Helper

Interface Definition

Interface Name: IBatchValidator

Interface Methods

class IBatchValidator(ABC):
    @abstractmethod
    def validate_batch_size(self, batch: ImageBatch) -> ValidationResult:
        pass
    
    @abstractmethod
    def check_sequence_continuity(self, batch: ImageBatch, expected_start: int) -> ValidationResult:
        pass
    
    @abstractmethod
    def validate_naming_convention(self, filenames: List[str]) -> ValidationResult:
        pass
    
    @abstractmethod
    def validate_format(self, image_data: bytes) -> ValidationResult:
        pass

Component Description

Responsibilities

  • Validate image batch integrity
  • Check sequence continuity and naming conventions
  • Validate image format and dimensions
  • Ensure batch size constraints (10-50 images)
  • Support strict sequential ordering (ADxxxxxx.jpg)

Scope

  • Batch validation for F05 Image Input Pipeline
  • Image format validation
  • Filename pattern matching
  • Sequence gap detection

API Methods

validate_batch_size(batch: ImageBatch) -> ValidationResult

Description: Validates batch contains 10-50 images.

Called By:

  • F05 Image Input Pipeline (before queuing)

Input:

batch: ImageBatch:
    images: List[bytes]
    filenames: List[str]
    start_sequence: int
    end_sequence: int

Output:

ValidationResult:
    valid: bool
    errors: List[str]

Validation Rules:

  • Minimum batch size: 10 images
  • Maximum batch size: 50 images
  • Reason: Balance between upload overhead and processing granularity

Error Conditions:

  • Returns valid=False with error message (not an exception)

Test Cases:

  1. Valid batch (20 images): Returns valid=True
  2. Too few images (5): Returns valid=False, error="Batch size 5 below minimum 10"
  3. Too many images (60): Returns valid=False, error="Batch size 60 exceeds maximum 50"
  4. Empty batch: Returns valid=False

check_sequence_continuity(batch: ImageBatch, expected_start: int) -> ValidationResult

Description: Validates images form consecutive sequence with no gaps.

Called By:

  • F05 Image Input Pipeline (before queuing)

Input:

batch: ImageBatch
expected_start: int  # Expected starting sequence number

Output:

ValidationResult:
    valid: bool
    errors: List[str]

Validation Rules:

  1. Sequence starts at expected_start: First image sequence == expected_start
  2. Consecutive numbers: No gaps in sequence (AD000101, AD000102, AD000103, ...)
  3. Filename extraction: Parse sequence from ADxxxxxx.jpg pattern
  4. Strict ordering: Images must be in sequential order

Algorithm:

sequences = [extract_sequence(filename) for filename in batch.filenames]
if sequences[0] != expected_start:
    return invalid("Expected start {expected_start}, got {sequences[0]}")
for i in range(len(sequences) - 1):
    if sequences[i+1] != sequences[i] + 1:
        return invalid(f"Gap detected: {sequences[i]} -> {sequences[i+1]}")
return valid()

Error Conditions:

  • Returns valid=False with specific gap information

Test Cases:

  1. Valid sequence (101-150): expected_start=101 → valid=True
  2. Wrong start: expected_start=101, got 102 → valid=False
  3. Gap in sequence: AD000101, AD000103 (missing 102) → valid=False
  4. Out of order: AD000102, AD000101 → valid=False

validate_naming_convention(filenames: List[str]) -> ValidationResult

Description: Validates filenames match ADxxxxxx.jpg pattern.

Called By:

  • Internal (during check_sequence_continuity)
  • F05 Image Input Pipeline

Input:

filenames: List[str]

Output:

ValidationResult:
    valid: bool
    errors: List[str]

Validation Rules:

  1. Pattern: AD\d{6}\.(jpg|JPG|png|PNG)
  2. Examples: AD000001.jpg, AD000237.JPG, AD002000.png
  3. Case insensitive: Accepts .jpg, .JPG, .Jpg
  4. 6 digits required: Zero-padded to 6 digits

Regex Pattern: ^AD\d{6}\.(jpg|JPG|png|PNG)$

Error Conditions:

  • Returns valid=False listing invalid filenames

Test Cases:

  1. Valid names: ["AD000001.jpg", "AD000002.jpg"] → valid=True
  2. Invalid prefix: "IMG_0001.jpg" → valid=False
  3. Wrong digit count: "AD001.jpg" (3 digits) → valid=False
  4. Missing extension: "AD000001" → valid=False
  5. Invalid extension: "AD000001.bmp" → valid=False

validate_format(image_data: bytes) -> ValidationResult

Description: Validates image file format and properties.

Called By:

  • F05 Image Input Pipeline (per-image validation)

Input:

image_data: bytes  # Raw image file bytes

Output:

ValidationResult:
    valid: bool
    errors: List[str]

Validation Rules:

  1. Format: Valid JPEG or PNG
  2. Dimensions: 640×480 to 6252×4168 pixels
  3. File size: < 10MB per image
  4. Image readable: Not corrupted
  5. Color channels: RGB (3 channels)

Algorithm:

try:
    image = PIL.Image.open(BytesIO(image_data))
    width, height = image.size
    
    if image.format not in ['JPEG', 'PNG']:
        return invalid("Format must be JPEG or PNG")
    
    if width < 640 or height < 480:
        return invalid("Dimensions too small")
    
    if width > 6252 or height > 4168:
        return invalid("Dimensions too large")
    
    if len(image_data) > 10 * 1024 * 1024:
        return invalid("File size exceeds 10MB")
    
    return valid()
except Exception as e:
    return invalid(f"Corrupted image: {e}")

Error Conditions:

  • Returns valid=False with specific error

Test Cases:

  1. Valid JPEG (2048×1536): valid=True
  2. Valid PNG (6252×4168): valid=True
  3. Too small (320×240): valid=False
  4. Too large (8000×6000): valid=False
  5. File too big (15MB): valid=False
  6. Corrupted file: valid=False
  7. BMP format: valid=False

Integration Tests

Test 1: Complete Batch Validation

  1. Create batch with 20 images, AD000101.jpg - AD000120.jpg
  2. validate_batch_size() → valid
  3. validate_naming_convention() → valid
  4. check_sequence_continuity(expected_start=101) → valid
  5. validate_format() for each image → all valid

Test 2: Invalid Batch Detection

  1. Create batch with 60 images → validate_batch_size() fails
  2. Create batch with gap (AD000101, AD000103) → check_sequence_continuity() fails
  3. Create batch with IMG_0001.jpg → validate_naming_convention() fails
  4. Create batch with corrupted image → validate_format() fails

Test 3: Edge Cases

  1. Batch with exactly 10 images → valid
  2. Batch with exactly 50 images → valid
  3. Batch with 51 images → invalid
  4. Batch starting at AD999995.jpg (near max) → valid

Non-Functional Requirements

Performance

  • validate_batch_size: < 1ms
  • check_sequence_continuity: < 10ms for 50 images
  • validate_naming_convention: < 5ms for 50 filenames
  • validate_format: < 20ms per image (with PIL)
  • Total batch validation: < 100ms for 50 images

Reliability

  • Never raises exceptions (returns ValidationResult with errors)
  • Handles edge cases gracefully
  • Clear, actionable error messages

Maintainability

  • Configurable validation rules (min/max batch size, dimensions)
  • Easy to add new validation rules
  • Comprehensive error reporting

Dependencies

Internal Components

  • None (pure utility, no internal dependencies)

External Dependencies

  • Pillow (PIL): Image format validation and dimension checking
  • re (regex): Filename pattern matching

Data Models

ImageBatch

class ImageBatch(BaseModel):
    images: List[bytes]  # Raw image data
    filenames: List[str]  # e.g., ["AD000101.jpg", ...]
    start_sequence: int  # 101
    end_sequence: int    # 150
    batch_number: int    # Sequential batch number

ValidationResult

class ValidationResult(BaseModel):
    valid: bool
    errors: List[str] = []  # Empty if valid
    warnings: List[str] = []  # Optional warnings

ValidationRules (Configuration)

class ValidationRules(BaseModel):
    min_batch_size: int = 10
    max_batch_size: int = 50
    min_width: int = 640
    min_height: int = 480
    max_width: int = 6252
    max_height: int = 4168
    max_file_size_mb: int = 10
    allowed_formats: List[str] = ["JPEG", "PNG"]
    filename_pattern: str = r"^AD\d{6}\.(jpg|JPG|png|PNG)$"

Sequence Extraction

def extract_sequence(filename: str) -> int:
    """
    Extracts sequence number from filename.
    
    Example: "AD000237.jpg" -> 237
    """
    match = re.match(r"AD(\d{6})\.", filename)
    if match:
        return int(match.group(1))
    raise ValueError(f"Invalid filename format: {filename}")