Files
detections/_docs/02_tasks/done/AZ-145_test_resilience.md
T

4.6 KiB

Resilience Tests

Task: AZ-145_test_resilience Name: Resilience Tests Description: Implement E2E tests verifying service resilience during external service outages, transient failures, and container restarts Complexity: 5 points Dependencies: AZ-138_test_infrastructure, AZ-142_test_async_sse Component: Integration Tests Jira: AZ-145 Epic: AZ-137

Problem

The detection service must continue operating when external dependencies fail. Tests must verify resilience during loader outages (before and after engine init), annotations service outages, transient loader failures with retry, and service restarts with state loss.

Outcome

  • Detection continues when loader goes down after engine is loaded
  • Async detection completes when annotations service is down
  • Engine initialization retries after transient loader failure
  • Service restart clears all in-memory state cleanly
  • Loader unreachable during initial model download handled gracefully
  • Annotations failure during async detection does not stop the pipeline

Scope

Included

  • FT-N-06: Loader service unreachable during model download
  • FT-N-07: Annotations service unreachable — detection continues
  • NFT-RES-01: Loader service outage after engine initialization
  • NFT-RES-02: Annotations service outage during async detection
  • NFT-RES-03: Engine initialization retry after transient loader failure
  • NFT-RES-04: Service restart with in-memory state loss

Excluded

  • Input validation errors (covered in negative tests)
  • Performance under fault conditions
  • Network partition simulation beyond service stop/start

Acceptance Criteria

AC-1: Loader unreachable during init Given mock-loader is stopped and engine not initialized When POST /detect is called Then response is 503 or 422 error And GET /health reflects engine error state

AC-2: Annotations unreachable — detection continues Given engine is initialized and mock-annotations is stopped When async detection is triggered Then SSE events still arrive and final AIProcessed event is received

AC-3: Loader outage after init Given engine is already initialized (model in memory) When mock-loader is stopped and POST /detect is called Then detection succeeds (200 OK, engine already loaded) And GET /health remains "Enabled"

AC-4: Annotations outage mid-processing Given async detection is in progress When mock-annotations is stopped mid-processing Then SSE events continue arriving And detection completes with AIProcessed event

AC-5: Transient loader failure with retry Given mock-loader fails first request then recovers When first POST /detect fails and second POST /detect is sent Then second detection succeeds (engine initializes on retry)

AC-6: Service restart state reset Given a detection may have been in progress When the detections container is restarted Then GET /health returns aiAvailability "None" (fresh start) And POST /detect/{media_id} is accepted (no stale _active_detections)

Non-Functional Requirements

Reliability

  • All fault injection tests must restore mock services after test completion
  • Service must not crash or leave zombie processes after any failure scenario

Integration Tests

AC Ref Initial Data/Conditions What to Test Expected Behavior NFR References
AC-1 mock-loader stopped, fresh engine POST /detect 503/422 graceful error Max 30s
AC-2 Engine warm, mock-annotations stopped Async detection + SSE SSE events continue, AIProcessed Max 120s
AC-3 Engine warm, mock-loader stopped POST /detect (sync) 200 OK, detection succeeds Max 30s
AC-4 Async detection started, then stop mock-annotations SSE event stream Events continue, pipeline completes Max 120s
AC-5 mock-loader: first_fail mode Two sequential POST /detect First fails, second succeeds Max 60s
AC-6 Restart detections container Health + detect after restart Clean state, no stale data Max 60s

Constraints

  • Fault injection via Docker service stop/start and mock control API
  • Container restart test requires docker compose restart capability
  • Mock services must support configurable failure modes (normal, error, first_fail)

Risks & Mitigation

Risk 1: Container restart timing

  • Risk: Container restart may take variable time, causing flaky tests
  • Mitigation: Use service readiness polling with generous timeout before assertions

Risk 2: Mock state leakage between tests

  • Risk: Stopped mock may affect subsequent tests
  • Mitigation: Function-scoped mock reset fixture restores all mocks before each test