Refactor task management structure and update documentation

- Changed the directory structure for task specifications to include a dedicated `todo/` folder within `_docs/02_tasks/` for tasks ready for implementation. - Updated references in various skills and documentation to reflect the new task lifecycle, including changes in the `implementer` and `decompose` skills. - Enhanced the README and flow documentation to clarify the new task organization and its implications for the implementation process. These updates improve task management clarity and streamline the implementation workflow.
2026-06-21 20:51:12 +00:00 · 2026-03-28 01:17:45 +02:00
parent 8c665bd0a4
commit cbf370c765
35 changed files with 1348 additions and 58 deletions
@@ -0,0 +1,193 @@
+# Test Infrastructure
+
+**Task**: AZ-138_test_infrastructure
+**Name**: Test Infrastructure
+**Description**: Scaffold the E2E test project — test runner, mock services, Docker test environment, test data fixtures, reporting
+**Complexity**: 5 points
+**Dependencies**: None
+**Component**: Integration Tests
+**Jira**: AZ-138
+**Epic**: AZ-137
+
+## Test Project Folder Layout
+
+```
+e2e/
+├── conftest.py
+├── requirements.txt
+├── Dockerfile
+├── pytest.ini
+├── mocks/
+│   ├── loader/
+│   │   ├── Dockerfile
+│   │   └── app.py
+│   └── annotations/
+│       ├── Dockerfile
+│       └── app.py
+├── fixtures/
+│   ├── image_small.jpg          (1280×720 JPEG, aerial, detectable objects)
+│   ├── image_large.JPG          (6252×4168 JPEG, triggers tiling)
+│   ├── image_dense01.jpg        (1280×720 JPEG, dense scene, clustered objects)
+│   ├── image_dense02.jpg        (1920×1080 JPEG, dense scene variant)
+│   ├── image_different_types.jpg (900×1600 JPEG, varied object classes)
+│   ├── image_empty_scene.jpg    (1920×1080 JPEG, no detectable objects)
+│   ├── video_short01.mp4        (short MP4 with moving objects)
+│   ├── video_short02.mp4        (short MP4 variant for concurrent tests)
+│   ├── video_long03.mp4         (long MP4, generates >100 SSE events)
+│   ├── empty_image              (zero-byte file, generated at build)
+│   ├── corrupt_image            (random binary garbage, generated at build)
+│   ├── classes.json             (19 classes, 3 weather modes, MaxSizeM values)
+│   └── azaion.onnx              (YOLO ONNX model, 1280×1280 input, 19 classes, 81MB)
+├── tests/
+│   ├── test_health_engine.py
+│   ├── test_single_image.py
+│   ├── test_tiling.py
+│   ├── test_async_sse.py
+│   ├── test_video.py
+│   ├── test_negative.py
+│   ├── test_resilience.py
+│   ├── test_performance.py
+│   ├── test_security.py
+│   └── test_resource_limits.py
+└── docker-compose.test.yml
+```
+
+### Layout Rationale
+
+- `mocks/` separated from tests — each mock is a standalone Docker service with its own Dockerfile
+- `fixtures/` holds all static test data, volume-mounted into containers
+- `tests/` organized by test category matching the test spec structure (one file per task group)
+- `conftest.py` provides shared pytest fixtures (HTTP clients, SSE helpers, service readiness checks)
+- `pytest.ini` configures markers for `gpu`/`cpu` profiles and test ordering
+
+## Mock Services
+
+| Mock Service | Replaces | Endpoints | Behavior |
+|-------------|----------|-----------|----------|
+| mock-loader | Loader service (model download/upload) | `GET /models/azaion.onnx` — serves ONNX model from volume. `POST /upload` — accepts TensorRT engine upload, stores in memory. `POST /mock/config` — control API (simulate 503, reset state). `GET /mock/status` — returns mock state. | Deterministic: serves model file from `/models/` volume. Configurable downtime via control endpoint. First-request-fail mode for retry tests. |
+| mock-annotations | Annotations service (result posting, token refresh) | `POST /annotations` — accepts annotation POST, stores in memory. `POST /auth/refresh` — returns refreshed token. `POST /mock/config` — control API (simulate 503, reset state). `GET /mock/annotations` — returns recorded annotations for assertion. | Records all incoming annotations in memory. Provides token refresh. Configurable downtime. Assertions via GET endpoint to verify what was received. |
+
+### Mock Control API
+
+Both mock services expose:
+- `POST /mock/config` — accepts JSON `{"mode": "normal"|"error"|"first_fail"}` to control behavior
+- `POST /mock/reset` — clears recorded state (annotations, uploads)
+- `GET /mock/status` — returns current mode and recorded interaction count
+
+## Docker Test Environment
+
+### docker-compose.test.yml Structure
+
+| Service | Image / Build | Purpose | Depends On |
+|---------|--------------|---------|------------|
+| detections | Build from repo root (Dockerfile) | System under test — FastAPI detection service | mock-loader, mock-annotations |
+| mock-loader | Build from `e2e/mocks/loader/` | Serves ONNX model, accepts TensorRT uploads | — |
+| mock-annotations | Build from `e2e/mocks/annotations/` | Accepts annotation results, provides token refresh | — |
+| e2e-consumer | Build from `e2e/` | pytest test runner | detections |
+
+### Networks and Volumes
+
+**Network**: `e2e-net` — isolated bridge network, all services communicate via hostnames
+
+**Volumes**:
+
+| Volume | Mount Target | Content |
+|--------|-------------|---------|
+| test-models | mock-loader:/models | `azaion.onnx` model file |
+| test-media | e2e-consumer:/media | Test images and video files |
+| test-classes | detections:/app/classes.json | `classes.json` with 19 detection classes |
+| test-results | e2e-consumer:/results | CSV test report output |
+
+### GPU Profile
+
+Two Docker Compose profiles:
+- **cpu** (default): `detections` runs without GPU runtime, exercises ONNX fallback path
+- **gpu**: `detections` runs with `runtime: nvidia` and `NVIDIA_VISIBLE_DEVICES=all`, exercises TensorRT path
+
+### Environment Variables (detections service)
+
+| Variable | Value | Purpose |
+|----------|-------|---------|
+| LOADER_URL | http://mock-loader:8080 | Points to mock Loader |
+| ANNOTATIONS_URL | http://mock-annotations:8081 | Points to mock Annotations |
+
+## Test Runner Configuration
+
+**Framework**: pytest
+**Plugins**: pytest-csv (reporting), requests (HTTP client), sseclient-py (SSE streaming), pytest-timeout (per-test timeouts)
+**Entry point**: `pytest --csv=/results/report.csv -v`
+
+### Fixture Strategy
+
+| Fixture | Scope | Purpose |
+|---------|-------|---------|
+| `base_url` | session | Detections service base URL (`http://detections:8000`) |
+| `http_client` | session | `requests.Session` configured with base URL and default timeout |
+| `sse_client_factory` | function | Factory that opens SSE connection to `/detect/stream` |
+| `mock_loader_url` | session | Mock-loader base URL for control API calls |
+| `mock_annotations_url` | session | Mock-annotations base URL for control API and assertion calls |
+| `wait_for_services` | session (autouse) | Polls health endpoints until all services are ready |
+| `reset_mocks` | function (autouse) | Calls `POST /mock/reset` on both mocks before each test |
+| `image_small` | session | Reads `image_small.jpg` from `/media/` volume |
+| `image_large` | session | Reads `image_large.JPG` from `/media/` volume |
+| `image_dense` | session | Reads `image_dense01.jpg` from `/media/` volume |
+| `image_dense_02` | session | Reads `image_dense02.jpg` from `/media/` volume |
+| `image_different_types` | session | Reads `image_different_types.jpg` from `/media/` volume |
+| `image_empty_scene` | session | Reads `image_empty_scene.jpg` from `/media/` volume |
+| `video_short_path` | session | Path to `video_short01.mp4` on `/media/` volume |
+| `video_short_02_path` | session | Path to `video_short02.mp4` on `/media/` volume |
+| `video_long_path` | session | Path to `video_long03.mp4` on `/media/` volume |
+| `empty_image` | session | Reads zero-byte file |
+| `corrupt_image` | session | Reads random binary file |
+| `jwt_token` | function | Generates a valid JWT with exp claim for auth tests |
+| `warm_engine` | module | Sends one detection request to initialize engine, used by tests that need warm engine |
+
+## Test Data Fixtures
+
+| Data Set | Source | Format | Used By |
+|----------|--------|--------|---------|
+| azaion.onnx | `input_data/azaion.onnx` | ONNX (1280×1280 input, 19 classes, 81MB) | All detection tests (via mock-loader) |
+| classes.json | repo root `classes.json` | JSON (19 objects with Id, Name, Color, MaxSizeM) | All tests (volume mount to detections) |
+| image_small.jpg | `input_data/image_small.jpg` | JPEG 1280×720 | Health, single image, filtering, negative, performance tests |
+| image_large.JPG | `input_data/image_large.JPG` | JPEG 6252×4168 | Tiling tests, performance tests |
+| image_dense01.jpg | `input_data/image_dense01.jpg` | JPEG 1280×720 dense scene | Dedup tests, detection cap tests |
+| image_dense02.jpg | `input_data/image_dense02.jpg` | JPEG 1920×1080 dense scene | Dedup variant |
+| image_different_types.jpg | `input_data/image_different_types.jpg` | JPEG 900×1600 varied classes | Weather mode class variant tests |
+| image_empty_scene.jpg | `input_data/image_empty_scene.jpg` | JPEG 1920×1080 empty | Zero-detection edge case |
+| video_short01.mp4 | `input_data/video_short01.mp4` | MP4 short video | Async, SSE, video processing tests |
+| video_short02.mp4 | `input_data/video_short02.mp4` | MP4 short video variant | Concurrent, resilience tests |
+| video_long03.mp4 | `input_data/video_long03.mp4` | MP4 long video (288MB) | SSE overflow, queue depth tests |
+| empty_image | Generated at build | Zero-byte file | FT-N-01 |
+| corrupt_image | Generated at build | Random binary | FT-N-02 |
+
+### Data Isolation
+
+Each test run starts with fresh containers (`docker compose down -v && docker compose up`). The detections service is stateless — no persistent data between runs. Mock services reset state via `POST /mock/reset` before each test. Tests that modify mock behavior (e.g., making loader unreachable) run with function-scoped mock resets.
+
+## Test Reporting
+
+**Format**: CSV
+**Columns**: Test ID, Test Name, Execution Time (ms), Result (PASS/FAIL/SKIP), Error Message (if FAIL)
+**Output path**: `/results/report.csv` → mounted to `./e2e-results/report.csv` on host
+
+## Acceptance Criteria
+
+**AC-1: Test environment starts**
+Given the docker-compose.test.yml
+When `docker compose -f docker-compose.test.yml up` is executed
+Then all services start and the detections service is reachable at http://detections:8000/health
+
+**AC-2: Mock services respond**
+Given the test environment is running
+When the e2e-consumer sends requests to mock-loader and mock-annotations
+Then mock services respond with configured behavior and record interactions
+
+**AC-3: Test runner executes**
+Given the test environment is running
+When the e2e-consumer starts
+Then pytest discovers and executes test files from `tests/` directory
+
+**AC-4: Test report generated**
+Given tests have been executed
+When the test run completes
+Then `/results/report.csv` exists with columns: Test ID, Test Name, Execution Time, Result, Error Message
@@ -0,0 +1,87 @@
+# Health & Engine Lifecycle Tests
+
+**Task**: AZ-139_test_health_engine
+**Name**: Health & Engine Lifecycle Tests
+**Description**: Implement E2E tests verifying health endpoint responses and engine lazy initialization lifecycle
+**Complexity**: 3 points
+**Dependencies**: AZ-138_test_infrastructure
+**Component**: Integration Tests
+**Jira**: AZ-139
+**Epic**: AZ-137
+
+## Problem
+
+The health endpoint and engine initialization lifecycle are critical for operational monitoring and service readiness. Tests must verify that the health endpoint correctly reflects engine state transitions (None → Downloading → Enabled/Error) and that engine initialization is lazy (triggered by first detection, not at startup).
+
+## Outcome
+
+- Health endpoint behavior verified across all engine states
+- Lazy initialization confirmed (no engine load at startup)
+- ONNX fallback path validated on CPU-only environments
+- Engine state transitions observable through health endpoint
+
+## Scope
+
+### Included
+- FT-P-01: Health check returns status before engine initialization
+- FT-P-02: Health check reflects engine availability after initialization
+- FT-P-14: Engine lazy initialization on first detection request
+- FT-P-15: ONNX fallback when GPU unavailable
+
+### Excluded
+- TensorRT-specific engine tests (require GPU hardware)
+- Performance benchmarking of engine initialization time
+- Engine error recovery scenarios (covered in resilience tests)
+
+## Acceptance Criteria
+
+**AC-1: Pre-init health check**
+Given the detections service just started with no prior requests
+When GET /health is called
+Then response is 200 with status "healthy" and aiAvailability "None"
+
+**AC-2: Post-init health check**
+Given a successful detection has been performed
+When GET /health is called
+Then aiAvailability reflects an active engine state (not "None" or "Downloading")
+
+**AC-3: Lazy initialization**
+Given a fresh service start
+When GET /health is called immediately
+Then aiAvailability is "None" (engine not loaded at startup)
+And after POST /detect with a valid image, GET /health shows engine active
+
+**AC-4: ONNX fallback**
+Given the service runs without GPU runtime (CPU-only profile)
+When POST /detect is called with a valid image
+Then detection succeeds via ONNX Runtime without TensorRT errors
+
+## Non-Functional Requirements
+
+**Performance**
+- Health check response within 2s
+- First detection (including engine init) within 60s
+
+**Reliability**
+- Tests must work on both CPU-only and GPU Docker profiles
+
+## Integration Tests
+
+| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
+|--------|------------------------|-------------|-------------------|----------------|
+| AC-1 | Fresh service, no requests | GET /health before any detection | 200, aiAvailability: "None" | Max 2s |
+| AC-2 | After POST /detect succeeds | GET /health after detection | aiAvailability not "None" | Max 30s |
+| AC-3 | Fresh service | Health → Detect → Health sequence | State transition None → active | Max 60s |
+| AC-4 | CPU-only Docker profile | POST /detect on CPU profile | Detection succeeds via ONNX | Max 60s |
+
+## Constraints
+
+- Tests must use the CPU Docker profile for ONNX fallback verification
+- Engine initialization time varies by hardware; timeouts must be generous
+- Health endpoint schema depends on AiAvailabilityStatus enum from codebase
+
+## Risks & Mitigation
+
+**Risk 1: Engine init timeout on slow CI**
+- *Risk*: Engine initialization may exceed timeout on resource-constrained CI runners
+- *Mitigation*: Use generous timeouts (60s) and mark as known slow test
@@ -0,0 +1,92 @@
+# Single Image Detection Tests
+
+**Task**: AZ-140_test_single_image
+**Name**: Single Image Detection Tests
+**Description**: Implement E2E tests verifying single image detection, confidence filtering, overlap deduplication, physical size filtering, and weather mode classes
+**Complexity**: 3 points
+**Dependencies**: AZ-138_test_infrastructure
+**Component**: Integration Tests
+**Jira**: AZ-140
+**Epic**: AZ-137
+
+## Problem
+
+Single image detection is the core functionality of the system. Tests must verify that detections are returned with correct structure, confidence filtering works at different thresholds, overlapping detections are deduplicated, physical size filtering removes implausible detections, and weather mode class variants are recognized.
+
+## Outcome
+
+- Detection response structure validated (x, y, width, height, label, confidence)
+- Confidence threshold filtering confirmed at multiple thresholds
+- Overlap deduplication verified with configurable containment ratio
+- Physical size filtering validated against MaxSizeM from classes.json
+- Weather mode class variants (Norm, Wint, Night) recognized correctly
+
+## Scope
+
+### Included
+- FT-P-03: Single image detection returns detections
+- FT-P-05: Detection confidence filtering respects threshold
+- FT-P-06: Overlapping detections are deduplicated
+- FT-P-07: Physical size filtering removes oversized detections
+- FT-P-13: Weather mode class variants
+
+### Excluded
+- Large image tiling (covered in tiling tests)
+- Async/video detection (covered in async and video tests)
+- Negative input validation (covered in negative tests)
+
+## Acceptance Criteria
+
+**AC-1: Detection response structure**
+Given an initialized engine and a valid small image
+When POST /detect is called with the image
+Then response is 200 with an array of DetectionDto objects containing x, y, width, height, label, confidence fields with coordinates in 0.0-1.0 range
+
+**AC-2: Confidence filtering**
+Given an initialized engine
+When POST /detect is called with probability_threshold 0.8
+Then all returned detections have confidence >= 0.8
+And calling with threshold 0.1 returns >= the number from threshold 0.8
+
+**AC-3: Overlap deduplication**
+Given an initialized engine and a scene with clustered objects
+When POST /detect is called with tracking_intersection_threshold 0.6
+Then no two detections of the same class overlap by more than 60%
+And lower threshold (0.01) produces fewer or equal detections
+
+**AC-4: Physical size filtering**
+Given an initialized engine and known GSD parameters
+When POST /detect is called with altitude, focal_length, sensor_width config
+Then no detection's computed physical size exceeds the MaxSizeM for its class
+
+**AC-5: Weather mode classes**
+Given an initialized engine with classes.json including weather variants
+When POST /detect is called
+Then all returned labels are valid entries from the 19-class x 3-mode registry
+
+## Non-Functional Requirements
+
+**Performance**
+- Single image detection within 30s (includes potential engine init)
+
+## Integration Tests
+
+| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
+|--------|------------------------|-------------|-------------------|----------------|
+| AC-1 | Engine warm, small-image | POST /detect response structure | Array of DetectionDto, coords 0.0-1.0 | Max 30s |
+| AC-2 | Engine warm, small-image | Two thresholds (0.8 vs 0.1) | Higher threshold = fewer detections | Max 30s |
+| AC-3 | Engine warm, small-image | Two containment thresholds | Lower threshold = more dedup | Max 30s |
+| AC-4 | Engine warm, small-image, GSD config | Physical size vs MaxSizeM | No oversized detections returned | Max 30s |
+| AC-5 | Engine warm, small-image | Detection label validation | Labels match classes.json entries | Max 30s |
+
+## Constraints
+
+- Deduplication verification requires the test image to produce overlapping detections
+- Physical size filtering requires correct GSD parameters matching the fixture image
+- Weather mode verification depends on classes.json fixture content
+
+## Risks & Mitigation
+
+**Risk 1: Insufficient detections from test image**
+- *Risk*: Small test image may not produce enough detections for meaningful filtering/dedup tests
+- *Mitigation*: Use an image with known dense object content; accept >= 1 detection as valid
@@ -0,0 +1,68 @@
+# Image Tiling Tests
+
+**Task**: AZ-141_test_tiling
+**Name**: Image Tiling Tests
+**Description**: Implement E2E tests verifying GSD-based tiling for large images and tile boundary deduplication
+**Complexity**: 3 points
+**Dependencies**: AZ-138_test_infrastructure
+**Component**: Integration Tests
+**Jira**: AZ-141
+**Epic**: AZ-137
+
+## Problem
+
+Images exceeding 1.5x model dimensions (1280x1280) must be tiled based on Ground Sample Distance (GSD) calculations. Tests must verify that tiling produces correct results with coordinates normalized to the original image, and that duplicate detections at tile boundaries are properly merged.
+
+## Outcome
+
+- Large image tiling confirmed with GSD-based tile sizing
+- Detection coordinates normalized to original image dimensions (not tile-local)
+- Tile boundary deduplication verified (no near-identical coordinate duplicates)
+- Bounding box coordinates remain in 0.0-1.0 range
+
+## Scope
+
+### Included
+- FT-P-04: Large image triggers GSD-based tiling
+- FT-P-16: Tile deduplication removes duplicate detections at tile boundaries
+
+### Excluded
+- Small image detection (covered in single image tests)
+- Tiling performance benchmarks (covered in performance tests)
+- Tile overlap configuration beyond default (implementation detail)
+
+## Acceptance Criteria
+
+**AC-1: GSD-based tiling**
+Given an initialized engine and a large image (4000x3000)
+When POST /detect is called with altitude, focal_length, sensor_width config
+Then detections are returned with coordinates in 0.0-1.0 range relative to the full original image
+
+**AC-2: Tile boundary deduplication**
+Given an initialized engine and a large image with tile overlap
+When POST /detect is called with tiling config including big_image_tile_overlap_percent
+Then no two detections of the same class have coordinates within 0.01 of each other (TILE_DUPLICATE_CONFIDENCE_THRESHOLD)
+
+## Non-Functional Requirements
+
+**Performance**
+- Large image processing within 60s (tiling adds overhead)
+
+## Integration Tests
+
+| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
+|--------|------------------------|-------------|-------------------|----------------|
+| AC-1 | Engine warm, large-image (4000x3000), GSD config | POST /detect with large image | Detections with coords 0.0-1.0 relative to full image | Max 60s |
+| AC-2 | Engine warm, large-image, tile overlap config | Check for near-duplicate detections | No same-class duplicates within 0.01 coords | Max 60s |
+
+## Constraints
+
+- Large image fixture must exceed 1.5x model input (1920x1920) to trigger tiling
+- GSD parameters must be physically plausible for the test scenario
+- Tile dedup threshold is hardcoded at 0.01 in the system
+
+## Risks & Mitigation
+
+**Risk 1: No detections at tile boundaries**
+- *Risk*: Test image may not have objects near tile boundaries
+- *Mitigation*: Verify tiling occurred by checking processing time is greater than small image; dedup assertion is vacuously true if no boundary objects
@@ -0,0 +1,77 @@
+# Async Detection & SSE Streaming Tests
+
+**Task**: AZ-142_test_async_sse
+**Name**: Async Detection & SSE Streaming Tests
+**Description**: Implement E2E tests verifying async media detection initiation, SSE event streaming, and duplicate media_id rejection
+**Complexity**: 3 points
+**Dependencies**: AZ-138_test_infrastructure
+**Component**: Integration Tests
+**Jira**: AZ-142
+**Epic**: AZ-137
+
+## Problem
+
+Async media detection via POST /detect/{media_id} must return immediately with "started" status while processing continues in background. SSE streaming must deliver real-time detection events to connected clients. Duplicate media_id submissions must be rejected with 409.
+
+## Outcome
+
+- Async detection returns immediately without waiting for processing
+- SSE connection receives detection events during processing
+- Final SSE event signals completion with mediaStatus "AIProcessed"
+- Duplicate media_id correctly rejected with 409 Conflict
+
+## Scope
+
+### Included
+- FT-P-08: Async media detection returns "started" immediately
+- FT-P-09: SSE streaming delivers detection events during async processing
+- FT-N-04: Duplicate media_id returns 409
+
+### Excluded
+- Video frame sampling details (covered in video tests)
+- SSE queue overflow behavior (covered in resource limit tests)
+- Annotations service interaction (covered in resilience tests)
+
+## Acceptance Criteria
+
+**AC-1: Immediate async response**
+Given an initialized engine
+When POST /detect/{media_id} is called with config and auth headers
+Then response arrives within 1s with {"status": "started"}
+
+**AC-2: SSE event delivery**
+Given an SSE client connected to GET /detect/stream
+When async detection is triggered via POST /detect/{media_id}
+Then SSE events are received with detection data and mediaStatus "AIProcessing"
+And a final event with mediaStatus "AIProcessed" and percent 100 arrives
+
+**AC-3: Duplicate media_id rejection**
+Given an async detection is already in progress for a media_id
+When POST /detect/{media_id} is called again with the same ID
+Then response is 409 Conflict
+
+## Non-Functional Requirements
+
+**Performance**
+- Async initiation response within 1s
+- SSE events delivered within 120s total processing time
+
+## Integration Tests
+
+| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
+|--------|------------------------|-------------|-------------------|----------------|
+| AC-1 | Engine warm, test-video, JWT token | POST /detect/{media_id} | Response < 1s, status "started" | Max 2s |
+| AC-2 | Engine warm, SSE connected, test-video | Listen SSE during async detection | Events received, final AIProcessed | Max 120s |
+| AC-3 | Active detection in progress | Second POST with same media_id | 409 Conflict | Max 5s |
+
+## Constraints
+
+- SSE client must connect before triggering async detection
+- JWT token required for async detection endpoint
+- Test video must be accessible via configured paths
+
+## Risks & Mitigation
+
+**Risk 1: SSE connection timing**
+- *Risk*: SSE connection may not be established before detection starts
+- *Mitigation*: Add small delay between SSE connect and detection trigger; verify connection established
@@ -0,0 +1,75 @@
+# Video Processing Tests
+
+**Task**: AZ-143_test_video
+**Name**: Video Processing Tests
+**Description**: Implement E2E tests verifying video frame sampling, annotation interval enforcement, and movement-based tracking
+**Complexity**: 3 points
+**Dependencies**: AZ-138_test_infrastructure, AZ-142_test_async_sse
+**Component**: Integration Tests
+**Jira**: AZ-143
+**Epic**: AZ-137
+
+## Problem
+
+Video detection processes frames at a configurable interval (frame_period_recognition), enforces minimum annotation intervals (frame_recognition_seconds), and tracks object movement to avoid redundant annotations. Tests must verify these three video-specific behaviors work correctly.
+
+## Outcome
+
+- Frame sampling verified: only every Nth frame processed (±10% tolerance)
+- Annotation interval enforced: no two annotations closer than configured seconds
+- Movement tracking confirmed: annotations emitted on object movement, suppressed for static objects
+
+## Scope
+
+### Included
+- FT-P-10: Video frame sampling processes every Nth frame
+- FT-P-11: Video annotation interval enforcement
+- FT-P-12: Video tracking accepts new annotations on movement
+
+### Excluded
+- Async detection initiation (covered in async/SSE tests)
+- SSE delivery mechanics (covered in async/SSE tests)
+- Video processing performance (covered in performance tests)
+
+## Acceptance Criteria
+
+**AC-1: Frame sampling**
+Given a 10s 30fps video (300 frames) and frame_period_recognition=4
+When async detection is triggered
+Then approximately 75 frames are processed (±10% tolerance)
+
+**AC-2: Annotation interval**
+Given a test video and frame_recognition_seconds=2
+When async detection is triggered
+Then minimum gap between consecutive annotation events >= 2 seconds
+
+**AC-3: Movement tracking**
+Given a test video with moving objects and tracking_distance_confidence > 0
+When async detection is triggered
+Then annotations contain updated positions for moving objects
+And static objects do not generate redundant annotations
+
+## Non-Functional Requirements
+
+**Performance**
+- Video processing completes within 120s
+
+## Integration Tests
+
+| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
+|--------|------------------------|-------------|-------------------|----------------|
+| AC-1 | Engine warm, SSE connected, test-video, frame_period=4 | Count processed frames via SSE | ~75 frames (±10%) | Max 120s |
+| AC-2 | Engine warm, SSE connected, test-video, frame_recognition_seconds=2 | Measure time between annotations | >= 2s gap between annotations | Max 120s |
+| AC-3 | Engine warm, SSE connected, test-video, tracking config | Inspect annotation positions | Updated coords for moving objects | Max 120s |
+
+## Constraints
+
+- Test video must contain moving objects for tracking verification
+- Frame counting tolerance accounts for start/end frame edge cases
+- Annotation interval measurement requires clock precision within 0.5s
+
+## Risks & Mitigation
+
+**Risk 1: Inconsistent frame counts**
+- *Risk*: Frame sampling may vary slightly depending on video codec and frame extraction
+- *Mitigation*: Use ±10% tolerance as specified in test spec
@@ -0,0 +1,82 @@
+# Negative Input Tests
+
+**Task**: AZ-144_test_negative
+**Name**: Negative Input Tests
+**Description**: Implement E2E tests verifying proper error responses for invalid inputs, unavailable engine, and missing configuration
+**Complexity**: 2 points
+**Dependencies**: AZ-138_test_infrastructure
+**Component**: Integration Tests
+**Jira**: AZ-144
+**Epic**: AZ-137
+
+## Problem
+
+The system must handle invalid and edge-case inputs gracefully, returning appropriate HTTP error codes without crashing. Tests must verify error responses for empty files, corrupt data, engine unavailability, and missing configuration.
+
+## Outcome
+
+- Empty image returns 400 Bad Request
+- Corrupt/non-image data returns 400 Bad Request
+- Detection when engine unavailable returns 503 or 422
+- Missing classes.json prevents normal operation
+- Service remains healthy after all negative inputs
+
+## Scope
+
+### Included
+- FT-N-01: Empty image returns 400
+- FT-N-02: Invalid image data returns 400
+- FT-N-03: Detection when engine unavailable returns 503
+- FT-N-05: Missing classes.json prevents startup
+
+### Excluded
+- Duplicate media_id (covered in async/SSE tests)
+- Service outage scenarios (covered in resilience tests)
+- Malformed multipart payloads (covered in security tests)
+
+## Acceptance Criteria
+
+**AC-1: Empty image**
+Given the detections service is running
+When POST /detect is called with a zero-byte file
+Then response is 400 Bad Request with error message
+
+**AC-2: Corrupt image**
+Given the detections service is running
+When POST /detect is called with random binary data
+Then response is 400 Bad Request (not 500)
+
+**AC-3: Engine unavailable**
+Given mock-loader is configured to fail model requests
+When POST /detect is called
+Then response is 503 or 422 with no crash or unhandled exception
+
+**AC-4: Missing classes.json**
+Given detections service started without classes.json volume mount
+When the service runs or a detection is attempted
+Then service either fails to start or returns empty/error results without crashing
+
+## Non-Functional Requirements
+
+**Reliability**
+- Service must remain operational after processing invalid inputs (AC-1, AC-2)
+
+## Integration Tests
+
+| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
+|--------|------------------------|-------------|-------------------|----------------|
+| AC-1 | Service running | POST /detect with empty file | 400 Bad Request | Max 5s |
+| AC-2 | Service running | POST /detect with corrupt binary | 400 Bad Request | Max 5s |
+| AC-3 | mock-loader returns 503 | POST /detect with valid image | 503 or 422 | Max 30s |
+| AC-4 | No classes.json mounted | Start service or detect | Fail gracefully | Max 30s |
+
+## Constraints
+
+- AC-4 requires a separate Docker Compose configuration without the classes.json volume
+- AC-3 requires mock-loader control API to simulate failure
+
+## Risks & Mitigation
+
+**Risk 1: AC-4 service start behavior**
+- *Risk*: Behavior when classes.json is missing may vary (fail at start vs. fail at detection)
+- *Mitigation*: Test both paths; accept either as valid graceful handling
@@ -0,0 +1,107 @@
+# Resilience Tests
+
+**Task**: AZ-145_test_resilience
+**Name**: Resilience Tests
+**Description**: Implement E2E tests verifying service resilience during external service outages, transient failures, and container restarts
+**Complexity**: 5 points
+**Dependencies**: AZ-138_test_infrastructure, AZ-142_test_async_sse
+**Component**: Integration Tests
+**Jira**: AZ-145
+**Epic**: AZ-137
+
+## Problem
+
+The detection service must continue operating when external dependencies fail. Tests must verify resilience during loader outages (before and after engine init), annotations service outages, transient loader failures with retry, and service restarts with state loss.
+
+## Outcome
+
+- Detection continues when loader goes down after engine is loaded
+- Async detection completes when annotations service is down
+- Engine initialization retries after transient loader failure
+- Service restart clears all in-memory state cleanly
+- Loader unreachable during initial model download handled gracefully
+- Annotations failure during async detection does not stop the pipeline
+
+## Scope
+
+### Included
+- FT-N-06: Loader service unreachable during model download
+- FT-N-07: Annotations service unreachable — detection continues
+- NFT-RES-01: Loader service outage after engine initialization
+- NFT-RES-02: Annotations service outage during async detection
+- NFT-RES-03: Engine initialization retry after transient loader failure
+- NFT-RES-04: Service restart with in-memory state loss
+
+### Excluded
+- Input validation errors (covered in negative tests)
+- Performance under fault conditions
+- Network partition simulation beyond service stop/start
+
+## Acceptance Criteria
+
+**AC-1: Loader unreachable during init**
+Given mock-loader is stopped and engine not initialized
+When POST /detect is called
+Then response is 503 or 422 error
+And GET /health reflects engine error state
+
+**AC-2: Annotations unreachable — detection continues**
+Given engine is initialized and mock-annotations is stopped
+When async detection is triggered
+Then SSE events still arrive and final AIProcessed event is received
+
+**AC-3: Loader outage after init**
+Given engine is already initialized (model in memory)
+When mock-loader is stopped and POST /detect is called
+Then detection succeeds (200 OK, engine already loaded)
+And GET /health remains "Enabled"
+
+**AC-4: Annotations outage mid-processing**
+Given async detection is in progress
+When mock-annotations is stopped mid-processing
+Then SSE events continue arriving
+And detection completes with AIProcessed event
+
+**AC-5: Transient loader failure with retry**
+Given mock-loader fails first request then recovers
+When first POST /detect fails and second POST /detect is sent
+Then second detection succeeds (engine initializes on retry)
+
+**AC-6: Service restart state reset**
+Given a detection may have been in progress
+When the detections container is restarted
+Then GET /health returns aiAvailability "None" (fresh start)
+And POST /detect/{media_id} is accepted (no stale _active_detections)
+
+## Non-Functional Requirements
+
+**Reliability**
+- All fault injection tests must restore mock services after test completion
+- Service must not crash or leave zombie processes after any failure scenario
+
+## Integration Tests
+
+| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
+|--------|------------------------|-------------|-------------------|----------------|
+| AC-1 | mock-loader stopped, fresh engine | POST /detect | 503/422 graceful error | Max 30s |
+| AC-2 | Engine warm, mock-annotations stopped | Async detection + SSE | SSE events continue, AIProcessed | Max 120s |
+| AC-3 | Engine warm, mock-loader stopped | POST /detect (sync) | 200 OK, detection succeeds | Max 30s |
+| AC-4 | Async detection started, then stop mock-annotations | SSE event stream | Events continue, pipeline completes | Max 120s |
+| AC-5 | mock-loader: first_fail mode | Two sequential POST /detect | First fails, second succeeds | Max 60s |
+| AC-6 | Restart detections container | Health + detect after restart | Clean state, no stale data | Max 60s |
+
+## Constraints
+
+- Fault injection via Docker service stop/start and mock control API
+- Container restart test requires docker compose restart capability
+- Mock services must support configurable failure modes (normal, error, first_fail)
+
+## Risks & Mitigation
+
+**Risk 1: Container restart timing**
+- *Risk*: Container restart may take variable time, causing flaky tests
+- *Mitigation*: Use service readiness polling with generous timeout before assertions
+
+**Risk 2: Mock state leakage between tests**
+- *Risk*: Stopped mock may affect subsequent tests
+- *Mitigation*: Function-scoped mock reset fixture restores all mocks before each test
@@ -0,0 +1,86 @@
+# Performance Tests
+
+**Task**: AZ-146_test_performance
+**Name**: Performance Tests
+**Description**: Implement E2E tests measuring detection latency, concurrent inference throughput, tiling overhead, and video processing frame rate
+**Complexity**: 3 points
+**Dependencies**: AZ-138_test_infrastructure
+**Component**: Integration Tests
+**Jira**: AZ-146
+**Epic**: AZ-137
+
+## Problem
+
+Performance characteristics must be baselined and verified: single image latency, concurrent request handling with the 2-worker ThreadPoolExecutor, tiling overhead for large images, and video processing frame rate. These tests establish performance contracts.
+
+## Outcome
+
+- Single image latency profiled (p50, p95, p99) for warm engine
+- Concurrent inference behavior validated (2-at-a-time processing confirmed)
+- Large image tiling overhead measured and bounded
+- Video processing frame rate baselined
+
+## Scope
+
+### Included
+- NFT-PERF-01: Single image detection latency
+- NFT-PERF-02: Concurrent inference throughput
+- NFT-PERF-03: Large image tiling processing time
+- NFT-PERF-04: Video processing frame rate
+
+### Excluded
+- GPU vs CPU comparative benchmarks
+- Memory usage profiling
+- Load testing beyond 4 concurrent requests
+
+## Acceptance Criteria
+
+**AC-1: Single image latency**
+Given a warm engine
+When 10 sequential POST /detect requests are sent with small-image
+Then p95 latency < 5000ms for ONNX CPU or p95 < 1000ms for TensorRT GPU
+
+**AC-2: Concurrent throughput**
+Given a warm engine
+When 2 concurrent POST /detect requests are sent
+Then both complete without error
+And 3 concurrent requests show queuing (total time > time for 2)
+
+**AC-3: Tiling overhead**
+Given a warm engine
+When POST /detect is sent with large-image (4000x3000)
+Then request completes within 120s
+And processing time scales proportionally with tile count
+
+**AC-4: Video frame rate**
+Given a warm engine with SSE connected
+When async detection processes test-video with frame_period=4
+Then processing completes within 5x video duration (< 50s)
+And frame processing rate is consistent (no stalls > 10s)
+
+## Non-Functional Requirements
+
+**Performance**
+- Tests themselves should complete within defined bounds
+- Results should be logged for trend analysis
+
+## Integration Tests
+
+| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
+|--------|------------------------|-------------|-------------------|----------------|
+| AC-1 | Engine warm | 10 sequential detections | p95 < 5000ms (CPU) | ~60s |
+| AC-2 | Engine warm | 2 then 3 concurrent requests | Queuing observed at 3 | ~30s |
+| AC-3 | Engine warm, large-image | Single large image detection | Completes < 120s | ~120s |
+| AC-4 | Engine warm, SSE connected | Video detection | < 50s, consistent rate | ~120s |
+
+## Constraints
+
+- Pass criteria differ between CPU (ONNX) and GPU (TensorRT) profiles
+- Concurrent request tests must account for connection overhead
+- Video frame rate depends on hardware; test measures consistency, not absolute speed
+
+## Risks & Mitigation
+
+**Risk 1: CI hardware variability**
+- *Risk*: Latency thresholds may fail on slower CI hardware
+- *Mitigation*: Use generous thresholds; mark as performance benchmark tests that can be skipped in resource-constrained CI
@@ -0,0 +1,78 @@
+# Security Tests
+
+**Task**: AZ-147_test_security
+**Name**: Security Tests
+**Description**: Implement E2E tests verifying handling of malformed payloads, oversized requests, and JWT token forwarding
+**Complexity**: 2 points
+**Dependencies**: AZ-138_test_infrastructure
+**Component**: Integration Tests
+**Jira**: AZ-147
+**Epic**: AZ-137
+
+## Problem
+
+The service must handle malicious or malformed input without crashing, reject oversized uploads, and correctly forward authentication tokens to downstream services. These tests verify security-relevant behaviors at the API boundary.
+
+## Outcome
+
+- Malformed multipart payloads return 4xx (not 500 or crash)
+- Oversized request bodies handled without OOM or crash
+- JWT token forwarded to annotations service exactly as received
+- Service remains operational after all security test scenarios
+
+## Scope
+
+### Included
+- NFT-SEC-01: Malformed multipart payload handling
+- NFT-SEC-02: Oversized request body
+- NFT-SEC-03: JWT token is forwarded without modification
+
+### Excluded
+- Authentication/authorization enforcement (service doesn't implement auth)
+- TLS verification (handled at infrastructure level)
+- CORS testing (requires browser context)
+
+## Acceptance Criteria
+
+**AC-1: Malformed multipart**
+Given the service is running
+When POST /detect is sent with truncated multipart (missing boundary) or empty file part
+Then response is 400 or 422 (not 500)
+And GET /health confirms service still healthy
+
+**AC-2: Oversized request**
+Given the service is running
+When POST /detect is sent with a 500MB random file
+Then response is an error (413, 400, or timeout) without OOM crash
+And GET /health confirms service still running
+
+**AC-3: JWT forwarding**
+Given engine is initialized and mock-annotations is recording
+When POST /detect/{media_id} is sent with Authorization and x-refresh-token headers
+Then mock-annotations received the exact same Authorization header value
+
+## Non-Functional Requirements
+
+**Reliability**
+- Service must not crash on any malformed input
+- Memory usage must not spike beyond bounds on oversized uploads
+
+## Integration Tests
+
+| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
+|--------|------------------------|-------------|-------------------|----------------|
+| AC-1 | Service running | Truncated multipart + no file part | 400/422, not 500 | Max 5s |
+| AC-2 | Service running | 500MB random file upload | Error response, no crash | Max 60s |
+| AC-3 | Engine warm, mock-annotations recording | Detect with JWT headers | Exact token match in mock | Max 120s |
+
+## Constraints
+
+- Oversized request test may require increased client timeout
+- JWT forwarding verification requires async detection to complete annotation POST
+- Malformed multipart construction requires raw HTTP request building
+
+## Risks & Mitigation
+
+**Risk 1: Oversized upload behavior varies**
+- *Risk*: FastAPI/Starlette may handle oversized bodies differently across versions
+- *Mitigation*: Accept any non-crash error response (413, 400, timeout, connection reset)
@@ -0,0 +1,99 @@
+# Resource Limit Tests
+
+**Task**: AZ-148_test_resource_limits
+**Name**: Resource Limit Tests
+**Description**: Implement E2E tests verifying ThreadPoolExecutor worker limit, SSE queue depth cap, max detections per frame, SSE overflow handling, and log file rotation
+**Complexity**: 3 points
+**Dependencies**: AZ-138_test_infrastructure, AZ-142_test_async_sse
+**Component**: Integration Tests
+**Jira**: AZ-148
+**Epic**: AZ-137
+
+## Problem
+
+The system enforces several resource limits: 2 concurrent inference workers, 100-event SSE queue depth, 300 max detections per frame, and daily log rotation. Tests must verify these limits are enforced correctly and that overflow conditions are handled gracefully.
+
+## Outcome
+
+- ThreadPoolExecutor limited to 2 concurrent inference operations
+- SSE queue capped at 100 events per client, overflow silently dropped
+- No response contains more than 300 detections per frame
+- Log files use date-based naming with daily rotation
+- SSE overflow does not crash the service or the detection pipeline
+
+## Scope
+
+### Included
+- FT-N-08: SSE queue overflow is silently dropped
+- NFT-RES-LIM-01: ThreadPoolExecutor worker limit (2 concurrent)
+- NFT-RES-LIM-02: SSE queue depth limit (100 events)
+- NFT-RES-LIM-03: Max 300 detections per frame
+- NFT-RES-LIM-04: Log file rotation and retention
+
+### Excluded
+- Memory limits (OS-level, not application-enforced)
+- Disk space limits
+- Network bandwidth throttling
+
+## Acceptance Criteria
+
+**AC-1: Worker limit**
+Given an initialized engine
+When 4 concurrent POST /detect requests are sent
+Then first 2 complete roughly together, next 2 complete after (2-at-a-time processing)
+And all 4 requests eventually succeed
+
+**AC-2: SSE queue depth**
+Given an SSE client connected but not reading (stalled)
+When async detection produces > 100 events
+Then stalled client receives <= 100 events when it resumes reading
+And no OOM or connection errors
+
+**AC-3: SSE overflow handling**
+Given an SSE client pauses reading
+When async detection generates many events
+Then detection completes normally (no error from overflow)
+And stalled client receives at most 100 buffered events
+
+**AC-4: Max detections per frame**
+Given an initialized engine and a dense scene image
+When POST /detect is called
+Then response contains at most 300 detections
+
+**AC-5: Log file rotation**
+Given the service is running with Logs/ volume mounted
+When detection requests are made
+Then log file exists at Logs/log_inference_YYYYMMDD.txt with today's date
+And log content contains structured INFO/DEBUG/WARNING entries
+
+## Non-Functional Requirements
+
+**Reliability**
+- Resource limits must be enforced without crash or undefined behavior
+
+## Integration Tests
+
+| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
+|--------|------------------------|-------------|-------------------|----------------|
+| AC-1 | Engine warm | 4 concurrent POST /detect | 2-at-a-time processing pattern | Max 60s |
+| AC-2 | Engine warm, stalled SSE | Async detection > 100 events | <= 100 events buffered | Max 120s |
+| AC-3 | Engine warm, stalled SSE | Detection pipeline behavior | Completes normally | Max 120s |
+| AC-4 | Engine warm, dense scene image | POST /detect | <= 300 detections | Max 30s |
+| AC-5 | Service running, Logs/ mounted | Detection requests | Date-named log file exists | Max 10s |
+
+## Constraints
+
+- Worker limit test requires precise timing measurement of response arrivals
+- SSE overflow test requires ability to pause/resume SSE client reading
+- Detection cap test requires an image producing many detections (may not reach 300 with test fixture)
+- Log rotation test verifies naming convention; full 30-day retention requires long-running test
+
+## Risks & Mitigation
+
+**Risk 1: Insufficient detections for cap test**
+- *Risk*: Test image may not produce 300 detections to actually hit the cap
+- *Mitigation*: Verify the cap exists by checking detection count <= 300; accept as passing if under limit
+
+**Risk 2: SSE client stall implementation**
+- *Risk*: HTTP client libraries may not support controlled read pausing
+- *Mitigation*: Use raw socket or thread-based approach to control when events are consumed
@@ -0,0 +1,137 @@
+# Test Infrastructure
+
+**Task**: AZ-152_test_infrastructure
+**Name**: Test Infrastructure
+**Description**: Scaffold the test project — pytest configuration, fixtures, conftest, test data management, Docker test environment
+**Complexity**: 3 points
+**Dependencies**: None
+**Component**: Blackbox Tests
+**Jira**: AZ-152
+**Epic**: AZ-151
+
+## Test Project Folder Layout
+
+```
+tests/
+├── conftest.py
+├── test_augmentation.py
+├── test_dataset_formation.py
+├── test_label_validation.py
+├── test_encryption.py
+├── test_model_split.py
+├── test_annotation_classes.py
+├── test_hardware_hash.py
+├── test_onnx_inference.py
+├── test_nms.py
+├── test_annotation_queue.py
+├── performance/
+│   ├── conftest.py
+│   ├── test_augmentation_perf.py
+│   ├── test_dataset_perf.py
+│   ├── test_encryption_perf.py
+│   └── test_inference_perf.py
+├── resilience/
+│   └── (resilience tests embedded in main test files via markers)
+├── security/
+│   └── (security tests embedded in main test files via markers)
+└── resource_limits/
+    └── (resource limit tests embedded in main test files via markers)
+```
+
+### Layout Rationale
+
+Flat test file structure per functional area matches the existing codebase module layout. Performance tests are separated into a subdirectory so they can be run independently (slower, threshold-based). Resilience, security, and resource limit tests use pytest markers (`@pytest.mark.resilience`, `@pytest.mark.security`, `@pytest.mark.resource_limit`) within the main test files to avoid unnecessary file proliferation while allowing selective execution.
+
+## Mock Services
+
+No mock services required. All 55 test scenarios operate offline against local code modules. External services (Azaion API, S3 CDN, RabbitMQ Streams, TensorRT) are excluded from the test scope per user decision.
+
+## Docker Test Environment
+
+### docker-compose.test.yml Structure
+
+| Service | Image / Build | Purpose | Depends On |
+|---------|--------------|---------|------------|
+| test-runner | Build from `Dockerfile.test` | Runs pytest suite | — |
+
+Single-container setup: the system under test is a Python library (not a service), so tests import modules directly. No network services required.
+
+### Volumes
+
+| Volume Mount | Purpose |
+|-------------|---------|
+| `./test-results:/app/test-results` | JUnit XML output for CI parsing |
+| `./_docs/00_problem/input_data:/app/_docs/00_problem/input_data:ro` | Fixture images, labels, ONNX model (read-only) |
+
+## Test Runner Configuration
+
+**Framework**: pytest
+**Plugins**: pytest (built-in JUnit XML via `--junitxml`)
+**Entry point (local)**: `scripts/run-tests-local.sh`
+**Entry point (Docker)**: `docker compose -f docker-compose.test.yml up --build --abort-on-container-exit`
+
+### Fixture Strategy
+
+| Fixture | Scope | Purpose |
+|---------|-------|---------|
+| `fixture_images_dir` | session | Path to 100 JPEG images from `_docs/00_problem/input_data/dataset/images/` |
+| `fixture_labels_dir` | session | Path to 100 YOLO labels from `_docs/00_problem/input_data/dataset/labels/` |
+| `fixture_onnx_model` | session | Bytes of `_docs/00_problem/input_data/azaion.onnx` |
+| `fixture_classes_json` | session | Path to `classes.json` |
+| `work_dir` | function | `tmp_path` based working directory for filesystem tests |
+| `sample_image_label` | function | Copies 1 image + label to `tmp_path` |
+| `sample_images_labels` | function | Copies N images + labels to `tmp_path` (parameterizable) |
+| `corrupted_label` | function | Generates a label file with coords > 1.0 in `tmp_path` |
+| `edge_bbox_label` | function | Generates a label with bbox near image edge in `tmp_path` |
+| `empty_label` | function | Generates an empty label file in `tmp_path` |
+
+## Test Data Fixtures
+
+| Data Set | Source | Format | Used By |
+|----------|--------|--------|---------|
+| 100 annotated images | `_docs/00_problem/input_data/dataset/images/` | JPEG | Augmentation, dataset formation, inference |
+| 100 YOLO labels | `_docs/00_problem/input_data/dataset/labels/` | TXT | Augmentation, dataset formation, label validation |
+| ONNX model (77MB) | `_docs/00_problem/input_data/azaion.onnx` | ONNX | Encryption roundtrip, inference |
+| Class definitions | `classes.json` (project root) | JSON | Annotation class loading, YAML generation |
+| Corrupted labels | Generated at test time | TXT | Label validation, dataset formation |
+| Edge-case bboxes | Generated at test time | In-memory | Augmentation bbox correction |
+| Detection objects | Generated at test time | In-memory | NMS overlap removal |
+| Msgpack messages | Generated at test time | bytes | Annotation queue parsing |
+| Random binary data | Generated at test time (`os.urandom`) | bytes | Encryption tests |
+| Empty label files | Generated at test time | TXT | Augmentation edge case |
+
+### Data Isolation
+
+Each test function receives an isolated `tmp_path` directory. Fixture files are copied (not symlinked) to `tmp_path` to prevent cross-test interference. Session-scoped fixtures (image dir, model bytes) are read-only references. No test modifies the source fixture data.
+
+## Test Reporting
+
+**Format**: JUnit XML
+**Output path**: `test-results/test-results.xml` (local), `/app/test-results/test-results.xml` (Docker)
+**CI integration**: Standard JUnit XML parseable by GitHub Actions, Azure Pipelines, GitLab CI
+
+## Constants Patching Strategy
+
+The production code uses hardcoded paths from `constants.py` (e.g., `/azaion/data/`). Tests must override these paths to point to `tmp_path` directories. Strategy: use `monkeypatch` or `unittest.mock.patch` to override `constants.*` module attributes at test function scope.
+
+## Acceptance Criteria
+
+**AC-1: Local test runner works**
+Given requirements-test.txt is installed
+When `scripts/run-tests-local.sh` is executed
+Then pytest discovers and runs tests, produces JUnit XML in `test-results/`
+
+**AC-2: Docker test runner works**
+Given Dockerfile.test and docker-compose.test.yml exist
+When `docker compose -f docker-compose.test.yml up --build` is executed
+Then test-runner container runs all tests, JUnit XML is written to mounted `test-results/` volume
+
+**AC-3: Fixtures provide test data**
+Given conftest.py defines session and function-scoped fixtures
+When a test requests `fixture_images_dir`
+Then it receives a valid path to 100 JPEG images
+
+**AC-4: Constants are properly patched**
+Given a test patches `constants.data_dir` to `tmp_path`
+When the test runs augmentation or dataset formation
+Then all file operations target `tmp_path`, not `/azaion/`
@@ -0,0 +1,83 @@
+# Augmentation Blackbox Tests
+
+**Task**: AZ-153_test_augmentation
+**Name**: Augmentation Blackbox Tests
+**Description**: Implement 8 blackbox tests for the augmentation pipeline — output count, naming, bbox validation, edge cases, filesystem integration
+**Complexity**: 3 points
+**Dependencies**: AZ-152_test_infrastructure
+**Component**: Blackbox Tests
+**Jira**: AZ-153
+**Epic**: AZ-151
+
+## Problem
+
+The augmentation pipeline transforms annotated images into 8 variants each. Tests must verify output count, naming conventions, bounding box validity, edge cases, and filesystem integration without referencing internals.
+
+## Outcome
+
+- 8 passing pytest tests in `tests/test_augmentation.py`
+- Covers: single-image augmentation, naming convention, bbox range, bbox clipping, tiny bbox removal, empty labels, full pipeline, skip-already-processed
+
+## Scope
+
+### Included
+- BT-AUG-01: Single image → 8 outputs
+- BT-AUG-02: Augmented filenames follow naming convention
+- BT-AUG-03: All output bounding boxes in valid range [0,1]
+- BT-AUG-04: Bounding box correction clips edge bboxes
+- BT-AUG-05: Tiny bounding boxes removed after correction
+- BT-AUG-06: Empty label produces 8 outputs with empty labels
+- BT-AUG-07: Full augmentation pipeline (filesystem, 5 images → 40 outputs)
+- BT-AUG-08: Augmentation skips already-processed images
+
+### Excluded
+- Performance tests (separate task)
+- Resilience tests (separate task)
+
+## Acceptance Criteria
+
+**AC-1: Output count**
+Given 1 image + 1 valid label
+When augment_inner() runs
+Then exactly 8 ImageLabel objects are returned
+
+**AC-2: Naming convention**
+Given image with stem "test_image"
+When augment_inner() runs
+Then outputs named test_image.jpg, test_image_1.jpg through test_image_7.jpg with matching .txt labels
+
+**AC-3: Bbox validity**
+Given 1 image + label with multiple bboxes
+When augment_inner() runs
+Then every bbox coordinate in every output is in [0.0, 1.0]
+
+**AC-4: Edge bbox clipping**
+Given label with bbox near edge (x=0.99, w=0.2)
+When correct_bboxes() runs
+Then width reduced to fit within bounds; no coordinate exceeds [margin, 1-margin]
+
+**AC-5: Tiny bbox removal**
+Given label with bbox that becomes < 0.01 area after clipping
+When correct_bboxes() runs
+Then bbox is removed from output
+
+**AC-6: Empty label**
+Given 1 image + empty label file
+When augment_inner() runs
+Then 8 ImageLabel objects returned, all with empty labels lists
+
+**AC-7: Full pipeline**
+Given 5 images + labels in data/ directory
+When augment_annotations() runs with patched paths
+Then 40 images in processed images dir, 40 matching labels
+
+**AC-8: Skip already-processed**
+Given 5 images in data/, 3 already in processed/
+When augment_annotations() runs
+Then only 2 new images processed (16 new outputs), existing 3 untouched
+
+## Constraints
+
+- Must patch constants.py paths to use tmp_path
+- Fixture images from _docs/00_problem/input_data/dataset/
+- Each test operates in isolated tmp_path
@@ -0,0 +1,72 @@
+# Augmentation Performance, Resilience & Resource Tests
+
+**Task**: AZ-154_test_augmentation_nonfunc
+**Name**: Augmentation Non-Functional Tests
+**Description**: Implement performance, resilience, and resource limit tests for augmentation — throughput, parallel speedup, error handling, output bounds
+**Complexity**: 2 points
+**Dependencies**: AZ-152_test_infrastructure
+**Component**: Blackbox Tests
+**Jira**: AZ-154
+**Epic**: AZ-151
+
+## Problem
+
+Augmentation must perform within time thresholds, handle corrupted/missing inputs gracefully, and respect output count bounds.
+
+## Outcome
+
+- 6 passing pytest tests across performance and resilience categories
+- Performance tests in `tests/performance/test_augmentation_perf.py`
+- Resilience and resource limit tests in `tests/test_augmentation.py` with markers
+
+## Scope
+
+### Included
+- PT-AUG-01: Augmentation throughput (10 images ≤ 60s)
+- PT-AUG-02: Parallel augmentation speedup (≥ 1.5× faster)
+- RT-AUG-01: Handles corrupted image gracefully
+- RT-AUG-02: Handles missing label file
+- RT-AUG-03: Transform failure produces fewer variants (no crash)
+- RL-AUG-01: Output count bounded to exactly 8
+
+### Excluded
+- Blackbox functional tests (separate task 02)
+
+## Acceptance Criteria
+
+**AC-1: Throughput**
+Given 10 images from fixture dataset
+When augment_annotations() runs
+Then completes within 60 seconds
+
+**AC-2: Parallel speedup**
+Given 10 images from fixture dataset
+When run with ThreadPoolExecutor vs sequential
+Then parallel is ≥ 1.5× faster
+
+**AC-3: Corrupted image**
+Given 1 valid + 1 corrupted image (truncated JPEG)
+When augment_annotations() runs
+Then valid image produces 8 outputs, corrupted skipped, no crash
+
+**AC-4: Missing label**
+Given 1 image with no matching label file
+When augment_annotation() runs on it
+Then exception caught per-thread, pipeline continues
+
+**AC-5: Transform failure**
+Given 1 image + label with extremely narrow bbox
+When augment_inner() runs
+Then 1-8 ImageLabel objects returned, no crash
+
+**AC-6: Output count bounded**
+Given 1 image
+When augment_inner() runs
+Then exactly 8 outputs returned (never more)
+
+## Constraints
+
+- Performance tests require pytest markers: `@pytest.mark.performance`
+- Resilience tests marked: `@pytest.mark.resilience`
+- Resource limit tests marked: `@pytest.mark.resource_limit`
+- Performance thresholds are generous (CPU-bound, no GPU requirement)
@@ -0,0 +1,78 @@
+# Dataset Formation Tests
+
+**Task**: AZ-155_test_dataset_formation
+**Name**: Dataset Formation Tests
+**Description**: Implement blackbox, performance, resilience, and resource tests for dataset split — 70/20/10 ratio, directory structure, integrity, corrupted filtering
+**Complexity**: 2 points
+**Dependencies**: AZ-152_test_infrastructure
+**Component**: Blackbox Tests
+**Jira**: AZ-155
+**Epic**: AZ-151
+
+## Problem
+
+Dataset formation splits annotated images into train/valid/test sets. Tests must verify correct ratios, directory structure, data integrity, corrupted label filtering, and performance.
+
+## Outcome
+
+- 8 passing pytest tests covering dataset formation
+- Blackbox tests in `tests/test_dataset_formation.py`
+- Performance test in `tests/performance/test_dataset_perf.py`
+
+## Scope
+
+### Included
+- BT-DSF-01: 70/20/10 split ratio (100 images → 70/20/10)
+- BT-DSF-02: Split directory structure (6 subdirs created)
+- BT-DSF-03: Total files preserved (sum == 100)
+- BT-DSF-04: Corrupted labels moved to corrupted directory
+- PT-DSF-01: Dataset formation throughput (100 images ≤ 30s)
+- RT-DSF-01: Empty processed directory handled gracefully
+- RL-DSF-01: Split ratios sum to 100%
+- RL-DSF-02: No data duplication across splits
+
+### Excluded
+- Label validation (separate task)
+
+## Acceptance Criteria
+
+**AC-1: Split ratio**
+Given 100 images + labels in processed/ dir
+When form_dataset() runs with patched paths
+Then train: 70, valid: 20, test: 10
+
+**AC-2: Directory structure**
+Given 100 images + labels
+When form_dataset() runs
+Then creates train/images/, train/labels/, valid/images/, valid/labels/, test/images/, test/labels/
+
+**AC-3: Data integrity**
+Given 100 valid images + labels
+When form_dataset() runs
+Then count(train) + count(valid) + count(test) == 100
+
+**AC-4: Corrupted filtering**
+Given 95 valid + 5 corrupted labels
+When form_dataset() runs
+Then 5 in data-corrupted/, 95 across splits
+
+**AC-5: Throughput**
+Given 100 images + labels
+When form_dataset() runs
+Then completes within 30 seconds
+
+**AC-6: Empty directory**
+Given empty processed images dir
+When form_dataset() runs
+Then empty dirs created, no crash
+
+**AC-7: No duplication**
+Given 100 images after form_dataset()
+When collecting all filenames across train/valid/test
+Then no filename appears in more than one split
+
+## Constraints
+
+- Must patch constants.py paths to use tmp_path
+- Requires copying 100 fixture images to tmp_path (session fixture)
+- Performance test marked: `@pytest.mark.performance`
@@ -0,0 +1,62 @@
+# Label Validation Tests
+
+**Task**: AZ-156_test_label_validation
+**Name**: Label Validation Tests
+**Description**: Implement 5 blackbox tests for YOLO label validation — valid labels, out-of-range coords, missing files, multi-line corruption
+**Complexity**: 1 point
+**Dependencies**: AZ-152_test_infrastructure
+**Component**: Blackbox Tests
+**Jira**: AZ-156
+**Epic**: AZ-151
+
+## Problem
+
+Labels must be validated before dataset formation. Tests verify the check_label function correctly accepts valid labels and rejects corrupted ones.
+
+## Outcome
+
+- 5 passing pytest tests in `tests/test_label_validation.py`
+
+## Scope
+
+### Included
+- BT-LBL-01: Valid label accepted (returns True)
+- BT-LBL-02: Label with x > 1.0 rejected (returns False)
+- BT-LBL-03: Label with height > 1.0 rejected (returns False)
+- BT-LBL-04: Missing label file rejected (returns False)
+- BT-LBL-05: Multi-line label with one corrupted line (returns False)
+
+### Excluded
+- Integration with dataset formation (separate task)
+
+## Acceptance Criteria
+
+**AC-1: Valid label**
+Given label file with content `0 0.5 0.5 0.1 0.1`
+When check_label(path) is called
+Then returns True
+
+**AC-2: x out of range**
+Given label file with content `0 1.5 0.5 0.1 0.1`
+When check_label(path) is called
+Then returns False
+
+**AC-3: height out of range**
+Given label file with content `0 0.5 0.5 0.1 1.2`
+When check_label(path) is called
+Then returns False
+
+**AC-4: Missing file**
+Given non-existent file path
+When check_label(path) is called
+Then returns False
+
+**AC-5: Multi-line corruption**
+Given label with `0 0.5 0.5 0.1 0.1\n3 0.5 0.5 0.1 1.5`
+When check_label(path) is called
+Then returns False
+
+## Constraints
+
+- Label files are generated in tmp_path at test time
+- No external fixtures needed
@@ -0,0 +1,102 @@
+# Encryption & Security Tests
+
+**Task**: AZ-157_test_encryption
+**Name**: Encryption & Security Tests
+**Description**: Implement blackbox, security, performance, resilience, and resource tests for AES-256-CBC encryption — roundtrips, key behavior, IV randomness, throughput, size bounds
+**Complexity**: 3 points
+**Dependencies**: AZ-152_test_infrastructure
+**Component**: Blackbox Tests
+**Jira**: AZ-157
+**Epic**: AZ-151
+
+## Problem
+
+The encryption module must correctly encrypt/decrypt data, produce key-dependent ciphertexts with random IVs, handle edge cases, and meet throughput requirements.
+
+## Outcome
+
+- 13 passing pytest tests in `tests/test_encryption.py`
+- Performance test in `tests/performance/test_encryption_perf.py`
+
+## Scope
+
+### Included
+- BT-ENC-01: Encrypt-decrypt roundtrip (1024 random bytes)
+- BT-ENC-02: Encrypt-decrypt roundtrip (ONNX model)
+- BT-ENC-03: Empty input roundtrip
+- BT-ENC-04: Single byte roundtrip
+- BT-ENC-05: Different keys produce different ciphertext
+- BT-ENC-06: Wrong key fails decryption
+- PT-ENC-01: Encryption throughput (10MB ≤ 5s)
+- RT-ENC-01: Decrypt with corrupted ciphertext
+- ST-ENC-01: Random IV (same data, same key → different ciphertexts)
+- ST-ENC-02: Wrong key cannot recover plaintext
+- ST-ENC-03: Model encryption key is deterministic
+- RL-ENC-01: Encrypted output size bounded (≤ N + 32 bytes)
+
+### Excluded
+- Model split tests (separate task)
+
+## Acceptance Criteria
+
+**AC-1: Roundtrip**
+Given 1024 random bytes and key "test-key"
+When encrypt then decrypt
+Then output equals input exactly
+
+**AC-2: Model roundtrip**
+Given azaion.onnx bytes and model encryption key
+When encrypt then decrypt
+Then output equals input exactly
+
+**AC-3: Empty input**
+Given b"" and key
+When encrypt then decrypt
+Then output equals b""
+
+**AC-4: Single byte**
+Given b"\x00" and key
+When encrypt then decrypt
+Then output equals b"\x00"
+
+**AC-5: Key-dependent ciphertext**
+Given same data, keys "key-a" and "key-b"
+When encrypting with each key
+Then ciphertexts differ
+
+**AC-6: Wrong key failure**
+Given encrypted with "key-a"
+When decrypting with "key-b"
+Then output does NOT equal original
+
+**AC-7: Throughput**
+Given 10MB random bytes
+When encrypt + decrypt roundtrip
+Then completes within 5 seconds
+
+**AC-8: Corrupted ciphertext**
+Given randomly modified ciphertext bytes
+When decrypt_to is called
+Then either raises exception or returns non-original bytes
+
+**AC-9: Random IV**
+Given same data, same key, encrypted twice
+When comparing ciphertexts
+Then they differ (random IV)
+
+**AC-10: Model key deterministic**
+Given two calls to get_model_encryption_key()
+When comparing results
+Then identical
+
+**AC-11: Size bound**
+Given N bytes plaintext
+When encrypted
+Then ciphertext size ≤ N + 32 bytes
+
+## Constraints
+
+- ONNX model fixture is session-scoped (77MB, read once)
+- Security tests marked: `@pytest.mark.security`
+- Performance test marked: `@pytest.mark.performance`
+- Resource limit test marked: `@pytest.mark.resource_limit`
@@ -0,0 +1,44 @@
+# Model Split Tests
+
+**Task**: AZ-158_test_model_split
+**Name**: Model Split Tests
+**Description**: Implement 2 blackbox tests for model split storage — size constraint and reassembly integrity
+**Complexity**: 1 point
+**Dependencies**: AZ-152_test_infrastructure
+**Component**: Blackbox Tests
+**Jira**: AZ-158
+**Epic**: AZ-151
+
+## Problem
+
+Encrypted models are split into small and big parts for CDN storage. Tests must verify the split respects size constraints and reassembly produces the original.
+
+## Outcome
+
+- 2 passing pytest tests in `tests/test_model_split.py`
+
+## Scope
+
+### Included
+- BT-SPL-01: Split respects size constraint (small ≤ max(3072 bytes, 20% of total))
+- BT-SPL-02: Reassembly produces original (small + big == encrypted bytes)
+
+### Excluded
+- CDN upload/download (requires external service)
+
+## Acceptance Criteria
+
+**AC-1: Size constraint**
+Given 10000 encrypted bytes
+When split into small + big
+Then small ≤ max(3072, total × 0.2); big = remainder
+
+**AC-2: Reassembly**
+Given split parts from 10000 encrypted bytes
+When small + big concatenated
+Then equals original encrypted bytes
+
+## Constraints
+
+- Uses generated binary data (no fixture files needed)
+- References SMALL_SIZE_KB constant from constants.py
@@ -0,0 +1,57 @@
+# Annotation Class & YAML Tests
+
+**Task**: AZ-159_test_annotation_classes
+**Name**: Annotation Class & YAML Tests
+**Description**: Implement 4 tests for annotation class loading, weather mode expansion, YAML generation, and total class count
+**Complexity**: 2 points
+**Dependencies**: AZ-152_test_infrastructure
+**Component**: Blackbox Tests
+**Jira**: AZ-159
+**Epic**: AZ-151
+
+## Problem
+
+The system loads 17 base annotation classes, expands them across 3 weather modes, and generates a data.yaml with 80 class slots. Tests verify the class pipeline.
+
+## Outcome
+
+- 4 passing pytest tests in `tests/test_annotation_classes.py`
+
+## Scope
+
+### Included
+- BT-CLS-01: Load 17 base classes from classes.json
+- BT-CLS-02: Weather mode expansion (offsets 0, 20, 40)
+- BT-CLS-03: YAML generation produces nc: 80 with 17 named + 63 placeholders
+- RL-CLS-01: Total class count is exactly 80
+
+### Excluded
+- Training configuration (beyond scope)
+
+## Acceptance Criteria
+
+**AC-1: Base classes**
+Given classes.json
+When AnnotationClass.read_json() is called
+Then returns dict with 17 unique base class entries
+
+**AC-2: Weather expansion**
+Given classes.json
+When classes are read
+Then same class exists at offset 0 (Norm), 20 (Wint), 40 (Night)
+
+**AC-3: YAML generation**
+Given classes.json + dataset path
+When create_yaml() runs with patched paths
+Then data.yaml contains nc: 80, 17 named classes + 63 Class-N placeholders
+
+**AC-4: Total count**
+Given classes.json
+When generating class list
+Then exactly 80 entries
+
+## Constraints
+
+- Uses classes.json from project root (fixture_classes_json)
+- YAML output goes to tmp_path
+- Resource limit test marked: `@pytest.mark.resource_limit`
@@ -0,0 +1,65 @@
+# Hardware Hash & API Key Tests
+
+**Task**: AZ-160_test_hardware_hash
+**Name**: Hardware Hash & API Key Tests
+**Description**: Implement 7 tests for hardware fingerprinting — determinism, uniqueness, base64 format, API key derivation from credentials and hardware
+**Complexity**: 2 points
+**Dependencies**: AZ-152_test_infrastructure
+**Component**: Blackbox Tests
+**Jira**: AZ-160
+**Epic**: AZ-151
+
+## Problem
+
+Hardware hashing provides machine-bound security for model encryption and API authentication. Tests must verify determinism, uniqueness, format, and credential/hardware dependency.
+
+## Outcome
+
+- 7 passing pytest tests in `tests/test_hardware_hash.py`
+
+## Scope
+
+### Included
+- BT-HSH-01: Deterministic output (same input → same hash)
+- BT-HSH-02: Different inputs → different hashes
+- BT-HSH-03: Output is valid base64
+- ST-HSH-01: Hardware hash deterministic (duplicate of BT-HSH-01 for security coverage)
+- ST-HSH-02: Different hardware → different hash
+- ST-HSH-03: API encryption key depends on credentials + hardware
+- ST-HSH-04: API encryption key depends on credentials
+
+### Excluded
+- Actual hardware info collection (may need mocking)
+
+## Acceptance Criteria
+
+**AC-1: Determinism**
+Given "test-hardware-info"
+When get_hw_hash() called twice
+Then both calls return identical string
+
+**AC-2: Uniqueness**
+Given "hw-a" and "hw-b"
+When get_hw_hash() called on each
+Then results differ
+
+**AC-3: Base64 format**
+Given "test-hardware-info"
+When get_hw_hash() called
+Then result matches `^[A-Za-z0-9+/]+=*$`
+
+**AC-4: API key depends on hardware**
+Given same credentials, different hardware hashes
+When get_api_encryption_key() called
+Then different keys returned
+
+**AC-5: API key depends on credentials**
+Given different credentials, same hardware hash
+When get_api_encryption_key() called
+Then different keys returned
+
+## Constraints
+
+- Security tests marked: `@pytest.mark.security`
+- May require mocking hardware info collection functions
+- All inputs are generated strings (no external fixtures)
@@ -0,0 +1,62 @@
+# ONNX Inference Tests
+
+**Task**: AZ-161_test_onnx_inference
+**Name**: ONNX Inference Tests
+**Description**: Implement 4 tests for ONNX model loading, inference execution, postprocessing, and CPU latency
+**Complexity**: 3 points
+**Dependencies**: AZ-152_test_infrastructure
+**Component**: Blackbox Tests
+**Jira**: AZ-161
+**Epic**: AZ-151
+
+## Problem
+
+The ONNX inference engine loads a model, runs detection on images, and postprocesses results. Tests must verify the full pipeline works on CPU (smoke test — no precision validation).
+
+## Outcome
+
+- 4 passing pytest tests
+- Blackbox tests in `tests/test_onnx_inference.py`
+- Performance test in `tests/performance/test_inference_perf.py`
+
+## Scope
+
+### Included
+- BT-INF-01: Model loads successfully (no exception, valid engine)
+- BT-INF-02: Inference returns output (array shape [batch, N, 6+])
+- BT-INF-03: Postprocessing returns valid detections (x,y,w,h ∈ [0,1], cls ∈ [0,79], conf ∈ [0,1])
+- PT-INF-01: ONNX inference latency (single image ≤ 10s on CPU)
+
+### Excluded
+- TensorRT inference (requires NVIDIA GPU)
+- Detection precision/recall validation (smoke-only per user decision)
+
+## Acceptance Criteria
+
+**AC-1: Model loads**
+Given azaion.onnx bytes
+When OnnxEngine(model_bytes) is constructed
+Then no exception; engine has valid input_shape and batch_size
+
+**AC-2: Inference output**
+Given ONNX engine + 1 preprocessed image
+When engine.run(input_blob) is called
+Then returns list of numpy arrays; first array has shape [batch, N, 6+]
+
+**AC-3: Valid detections**
+Given ONNX engine output from real image
+When Inference.postprocess() is called
+Then returns list of Detection objects; each has x,y,w,h ∈ [0,1], cls ∈ [0,79], confidence ∈ [0,1]
+
+**AC-4: CPU latency**
+Given 1 preprocessed image + ONNX model
+When single inference runs
+Then completes within 10 seconds
+
+## Constraints
+
+- Uses onnxruntime (CPU) not onnxruntime-gpu
+- ONNX model is 77MB, loaded once (session fixture)
+- Image preprocessing must match model input size (1280×1280)
+- Performance test marked: `@pytest.mark.performance`
+- This is a smoke test — validates structure, not detection accuracy
@@ -0,0 +1,50 @@
+# NMS Overlap Removal Tests
+
+**Task**: AZ-162_test_nms
+**Name**: NMS Overlap Removal Tests
+**Description**: Implement 3 tests for non-maximum suppression — overlapping kept by confidence, non-overlapping preserved, chain overlap resolution
+**Complexity**: 1 point
+**Dependencies**: AZ-152_test_infrastructure
+**Component**: Blackbox Tests
+**Jira**: AZ-162
+**Epic**: AZ-151
+
+## Problem
+
+The NMS module removes overlapping detections based on IoU threshold (0.3), keeping the higher-confidence detection. Tests verify all overlap scenarios.
+
+## Outcome
+
+- 3 passing pytest tests in `tests/test_nms.py`
+
+## Scope
+
+### Included
+- BT-NMS-01: Overlapping detections — keep higher confidence (IoU > 0.3 → 1 kept)
+- BT-NMS-02: Non-overlapping detections — keep both (IoU < 0.3 → 2 kept)
+- BT-NMS-03: Chain overlap resolution (A↔B, B↔C → ≤ 2 kept)
+
+### Excluded
+- Integration with inference pipeline (separate task)
+
+## Acceptance Criteria
+
+**AC-1: Overlap removal**
+Given 2 Detections at same position, confidence 0.9 and 0.5, IoU > 0.3
+When remove_overlapping_detections() runs
+Then 1 detection returned (confidence 0.9)
+
+**AC-2: Non-overlapping preserved**
+Given 2 Detections at distant positions, IoU < 0.3
+When remove_overlapping_detections() runs
+Then 2 detections returned
+
+**AC-3: Chain overlap**
+Given 3 Detections: A overlaps B, B overlaps C, A doesn't overlap C
+When remove_overlapping_detections() runs
+Then ≤ 2 detections; highest confidence per overlapping pair kept
+
+## Constraints
+
+- Detection objects constructed in-memory (no fixture files)
+- IoU threshold is 0.3 (from constants or hardcoded in NMS)
@@ -0,0 +1,64 @@
+# Annotation Queue Message Tests
+
+**Task**: AZ-163_test_annotation_queue
+**Name**: Annotation Queue Message Tests
+**Description**: Implement 5 tests for annotation queue message parsing — Created, Validated bulk, Deleted bulk, malformed handling
+**Complexity**: 2 points
+**Dependencies**: AZ-152_test_infrastructure
+**Component**: Blackbox Tests
+**Jira**: AZ-163
+**Epic**: AZ-151
+
+## Problem
+
+The annotation queue processes msgpack-encoded messages from RabbitMQ Streams. Tests must verify correct parsing of all message types and graceful handling of malformed input.
+
+## Outcome
+
+- 5 passing pytest tests in `tests/test_annotation_queue.py`
+
+## Scope
+
+### Included
+- BT-AQM-01: Parse Created annotation message (all fields populated correctly)
+- BT-AQM-02: Parse Validated bulk message (status == Validated, names list matches)
+- BT-AQM-03: Parse Deleted bulk message (status == Deleted, names list matches)
+- BT-AQM-04: Malformed message raises exception
+- RT-AQM-01: Malformed msgpack bytes handled (exception caught, no crash)
+
+### Excluded
+- Live RabbitMQ Streams connection (requires external service)
+- Queue offset persistence (requires live broker)
+
+## Acceptance Criteria
+
+**AC-1: Created message**
+Given msgpack bytes matching AnnotationMessage schema (status=Created, role=Validator)
+When decoded and constructed
+Then all fields populated: name, detections, image bytes, status == "Created", role == "Validator"
+
+**AC-2: Validated bulk**
+Given msgpack bytes with status=Validated, list of names
+When decoded and constructed
+Then status == "Validated", names list matches input
+
+**AC-3: Deleted bulk**
+Given msgpack bytes with status=Deleted, list of names
+When decoded and constructed
+Then status == "Deleted", names list matches input
+
+**AC-4: Malformed msgpack**
+Given invalid msgpack bytes
+When decode is attempted
+Then exception raised
+
+**AC-5: Resilient handling**
+Given random bytes (not valid msgpack)
+When passed to message handler
+Then exception caught, handler doesn't crash
+
+## Constraints
+
+- Msgpack messages constructed in-memory at test time
+- Must match the AnnotationMessage/AnnotationBulkMessage schemas from annotation-queue/
+- Resilience test marked: `@pytest.mark.resilience`