Resilience Tests

Task: AZ-145_test_resilience Name: Resilience Tests Description: Implement E2E tests verifying service resilience during external service outages, transient failures, and container restarts Complexity: 5 points Dependencies: AZ-138_test_infrastructure, AZ-142_test_async_sse Component: Integration Tests Jira: AZ-145 Epic: AZ-137

Problem

The detection service must continue operating when external dependencies fail. Tests must verify resilience during loader outages (before and after engine init), annotations service outages, transient loader failures with retry, and service restarts with state loss.

Outcome

Detection continues when loader goes down after engine is loaded
Async detection completes when annotations service is down
Engine initialization retries after transient loader failure
Service restart clears all in-memory state cleanly
Loader unreachable during initial model download handled gracefully
Annotations failure during async detection does not stop the pipeline

Scope

Included

FT-N-06: Loader service unreachable during model download
FT-N-07: Annotations service unreachable — detection continues
NFT-RES-01: Loader service outage after engine initialization
NFT-RES-02: Annotations service outage during async detection
NFT-RES-03: Engine initialization retry after transient loader failure
NFT-RES-04: Service restart with in-memory state loss

Excluded

Input validation errors (covered in negative tests)
Performance under fault conditions
Network partition simulation beyond service stop/start

Acceptance Criteria

AC-1: Loader unreachable during init Given mock-loader is stopped and engine not initialized When POST /detect is called Then response is 503 or 422 error And GET /health reflects engine error state

AC-2: Annotations unreachable — detection continues Given engine is initialized and mock-annotations is stopped When async detection is triggered Then SSE events still arrive and final AIProcessed event is received

AC-3: Loader outage after init Given engine is already initialized (model in memory) When mock-loader is stopped and POST /detect is called Then detection succeeds (200 OK, engine already loaded) And GET /health remains "Enabled"

AC-4: Annotations outage mid-processing Given async detection is in progress When mock-annotations is stopped mid-processing Then SSE events continue arriving And detection completes with AIProcessed event

AC-5: Transient loader failure with retry Given mock-loader fails first request then recovers When first POST /detect fails and second POST /detect is sent Then second detection succeeds (engine initializes on retry)

AC-6: Service restart state reset Given a detection may have been in progress When the detections container is restarted Then GET /health returns aiAvailability "None" (fresh start) And POST /detect/{media_id} is accepted (no stale _active_detections)

Non-Functional Requirements

Reliability

All fault injection tests must restore mock services after test completion
Service must not crash or leave zombie processes after any failure scenario

Integration Tests

AC Ref	Initial Data/Conditions	What to Test	Expected Behavior	NFR References
AC-1	mock-loader stopped, fresh engine	POST /detect	503/422 graceful error	Max 30s
AC-2	Engine warm, mock-annotations stopped	Async detection + SSE	SSE events continue, AIProcessed	Max 120s
AC-3	Engine warm, mock-loader stopped	POST /detect (sync)	200 OK, detection succeeds	Max 30s
AC-4	Async detection started, then stop mock-annotations	SSE event stream	Events continue, pipeline completes	Max 120s
AC-5	mock-loader: first_fail mode	Two sequential POST /detect	First fails, second succeeds	Max 60s
AC-6	Restart detections container	Health + detect after restart	Clean state, no stale data	Max 60s

Constraints

Fault injection via Docker service stop/start and mock control API
Container restart test requires docker compose restart capability
Mock services must support configurable failure modes (normal, error, first_fail)

Risks & Mitigation

Risk 1: Container restart timing

Risk: Container restart may take variable time, causing flaky tests
Mitigation: Use service readiness polling with generous timeout before assertions

Risk 2: Mock state leakage between tests

Risk: Stopped mock may affect subsequent tests
Mitigation: Function-scoped mock reset fixture restores all mocks before each test

4.6 KiB Raw Blame History