Refactor task management structure and update documentation

- Changed the directory structure for task specifications to include a dedicated `todo/` folder within `_docs/02_tasks/` for tasks ready for implementation. - Updated references in various skills and documentation to reflect the new task lifecycle, including changes in the `implementer` and `decompose` skills. - Enhanced the README and flow documentation to clarify the new task organization and its implications for the implementation process. These updates improve task management clarity and streamline the implementation workflow.
2026-04-22 22:56:34 +00:00 · 2026-03-28 01:17:45 +02:00
parent 8c665bd0a4
commit cbf370c765
35 changed files with 1348 additions and 58 deletions
@@ -0,0 +1,107 @@
+# Resilience Tests
+
+**Task**: AZ-145_test_resilience
+**Name**: Resilience Tests
+**Description**: Implement E2E tests verifying service resilience during external service outages, transient failures, and container restarts
+**Complexity**: 5 points
+**Dependencies**: AZ-138_test_infrastructure, AZ-142_test_async_sse
+**Component**: Integration Tests
+**Jira**: AZ-145
+**Epic**: AZ-137
+
+## Problem
+
+The detection service must continue operating when external dependencies fail. Tests must verify resilience during loader outages (before and after engine init), annotations service outages, transient loader failures with retry, and service restarts with state loss.
+
+## Outcome
+
+- Detection continues when loader goes down after engine is loaded
+- Async detection completes when annotations service is down
+- Engine initialization retries after transient loader failure
+- Service restart clears all in-memory state cleanly
+- Loader unreachable during initial model download handled gracefully
+- Annotations failure during async detection does not stop the pipeline
+
+## Scope
+
+### Included
+- FT-N-06: Loader service unreachable during model download
+- FT-N-07: Annotations service unreachable — detection continues
+- NFT-RES-01: Loader service outage after engine initialization
+- NFT-RES-02: Annotations service outage during async detection
+- NFT-RES-03: Engine initialization retry after transient loader failure
+- NFT-RES-04: Service restart with in-memory state loss
+
+### Excluded
+- Input validation errors (covered in negative tests)
+- Performance under fault conditions
+- Network partition simulation beyond service stop/start
+
+## Acceptance Criteria
+
+**AC-1: Loader unreachable during init**
+Given mock-loader is stopped and engine not initialized
+When POST /detect is called
+Then response is 503 or 422 error
+And GET /health reflects engine error state
+
+**AC-2: Annotations unreachable — detection continues**
+Given engine is initialized and mock-annotations is stopped
+When async detection is triggered
+Then SSE events still arrive and final AIProcessed event is received
+
+**AC-3: Loader outage after init**
+Given engine is already initialized (model in memory)
+When mock-loader is stopped and POST /detect is called
+Then detection succeeds (200 OK, engine already loaded)
+And GET /health remains "Enabled"
+
+**AC-4: Annotations outage mid-processing**
+Given async detection is in progress
+When mock-annotations is stopped mid-processing
+Then SSE events continue arriving
+And detection completes with AIProcessed event
+
+**AC-5: Transient loader failure with retry**
+Given mock-loader fails first request then recovers
+When first POST /detect fails and second POST /detect is sent
+Then second detection succeeds (engine initializes on retry)
+
+**AC-6: Service restart state reset**
+Given a detection may have been in progress
+When the detections container is restarted
+Then GET /health returns aiAvailability "None" (fresh start)
+And POST /detect/{media_id} is accepted (no stale _active_detections)
+
+## Non-Functional Requirements
+
+**Reliability**
+- All fault injection tests must restore mock services after test completion
+- Service must not crash or leave zombie processes after any failure scenario
+
+## Integration Tests
+
+| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
+|--------|------------------------|-------------|-------------------|----------------|
+| AC-1 | mock-loader stopped, fresh engine | POST /detect | 503/422 graceful error | Max 30s |
+| AC-2 | Engine warm, mock-annotations stopped | Async detection + SSE | SSE events continue, AIProcessed | Max 120s |
+| AC-3 | Engine warm, mock-loader stopped | POST /detect (sync) | 200 OK, detection succeeds | Max 30s |
+| AC-4 | Async detection started, then stop mock-annotations | SSE event stream | Events continue, pipeline completes | Max 120s |
+| AC-5 | mock-loader: first_fail mode | Two sequential POST /detect | First fails, second succeeds | Max 60s |
+| AC-6 | Restart detections container | Health + detect after restart | Clean state, no stale data | Max 60s |
+
+## Constraints
+
+- Fault injection via Docker service stop/start and mock control API
+- Container restart test requires docker compose restart capability
+- Mock services must support configurable failure modes (normal, error, first_fail)
+
+## Risks & Mitigation
+
+**Risk 1: Container restart timing**
+- *Risk*: Container restart may take variable time, causing flaky tests
+- *Mitigation*: Use service readiness polling with generous timeout before assertions
+
+**Risk 2: Mock state leakage between tests**
+- *Risk*: Stopped mock may affect subsequent tests
+- *Mitigation*: Function-scoped mock reset fixture restores all mocks before each test