4.6 KiB
Resilience Tests
Task: AZ-145_test_resilience Name: Resilience Tests Description: Implement E2E tests verifying service resilience during external service outages, transient failures, and container restarts Complexity: 5 points Dependencies: AZ-138_test_infrastructure, AZ-142_test_async_sse Component: Integration Tests Jira: AZ-145 Epic: AZ-137
Problem
The detection service must continue operating when external dependencies fail. Tests must verify resilience during loader outages (before and after engine init), annotations service outages, transient loader failures with retry, and service restarts with state loss.
Outcome
- Detection continues when loader goes down after engine is loaded
- Async detection completes when annotations service is down
- Engine initialization retries after transient loader failure
- Service restart clears all in-memory state cleanly
- Loader unreachable during initial model download handled gracefully
- Annotations failure during async detection does not stop the pipeline
Scope
Included
- FT-N-06: Loader service unreachable during model download
- FT-N-07: Annotations service unreachable — detection continues
- NFT-RES-01: Loader service outage after engine initialization
- NFT-RES-02: Annotations service outage during async detection
- NFT-RES-03: Engine initialization retry after transient loader failure
- NFT-RES-04: Service restart with in-memory state loss
Excluded
- Input validation errors (covered in negative tests)
- Performance under fault conditions
- Network partition simulation beyond service stop/start
Acceptance Criteria
AC-1: Loader unreachable during init Given mock-loader is stopped and engine not initialized When POST /detect is called Then response is 503 or 422 error And GET /health reflects engine error state
AC-2: Annotations unreachable — detection continues Given engine is initialized and mock-annotations is stopped When async detection is triggered Then SSE events still arrive and final AIProcessed event is received
AC-3: Loader outage after init Given engine is already initialized (model in memory) When mock-loader is stopped and POST /detect is called Then detection succeeds (200 OK, engine already loaded) And GET /health remains "Enabled"
AC-4: Annotations outage mid-processing Given async detection is in progress When mock-annotations is stopped mid-processing Then SSE events continue arriving And detection completes with AIProcessed event
AC-5: Transient loader failure with retry Given mock-loader fails first request then recovers When first POST /detect fails and second POST /detect is sent Then second detection succeeds (engine initializes on retry)
AC-6: Service restart state reset Given a detection may have been in progress When the detections container is restarted Then GET /health returns aiAvailability "None" (fresh start) And POST /detect/{media_id} is accepted (no stale _active_detections)
Non-Functional Requirements
Reliability
- All fault injection tests must restore mock services after test completion
- Service must not crash or leave zombie processes after any failure scenario
Integration Tests
| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
|---|---|---|---|---|
| AC-1 | mock-loader stopped, fresh engine | POST /detect | 503/422 graceful error | Max 30s |
| AC-2 | Engine warm, mock-annotations stopped | Async detection + SSE | SSE events continue, AIProcessed | Max 120s |
| AC-3 | Engine warm, mock-loader stopped | POST /detect (sync) | 200 OK, detection succeeds | Max 30s |
| AC-4 | Async detection started, then stop mock-annotations | SSE event stream | Events continue, pipeline completes | Max 120s |
| AC-5 | mock-loader: first_fail mode | Two sequential POST /detect | First fails, second succeeds | Max 60s |
| AC-6 | Restart detections container | Health + detect after restart | Clean state, no stale data | Max 60s |
Constraints
- Fault injection via Docker service stop/start and mock control API
- Container restart test requires docker compose restart capability
- Mock services must support configurable failure modes (normal, error, first_fail)
Risks & Mitigation
Risk 1: Container restart timing
- Risk: Container restart may take variable time, causing flaky tests
- Mitigation: Use service readiness polling with generous timeout before assertions
Risk 2: Mock state leakage between tests
- Risk: Stopped mock may affect subsequent tests
- Mitigation: Function-scoped mock reset fixture restores all mocks before each test