# Resilience Tests ### NFT-RES-01: RabbitMQ broker outage during create **Summary**: `POST /annotations` succeeds (HTTP 200) when the RabbitMQ broker is unreachable; the outbox row is preserved; `FailsafeProducer` does not crash; on broker recovery the message is delivered. **Traces to**: AC-F-12, OP-02 (single-instance baseline) **Preconditions**: SUT healthy; broker initially reachable; clean outbox. **Fault injection**: - `docker exec rabbitmq rabbitmqctl stop_app` mid-test (stops AMQP/streams listeners; container stays up). **Steps**: | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | Stop RabbitMQ app | broker unreachable on 5552 | | 2 | `POST /annotations` once | HTTP 200; outbox row inserted | | 3 | Out-of-band: `SELECT COUNT(*) FROM annotations_queue_records WHERE annotation_id = ` | `count == 1` (row not deleted because drain failed) | | 4 | `GET /health` | HTTP 200 (SUT not crashed) | | 5 | `docker exec rabbitmq rabbitmqctl start_app` | broker recovers | | 6 | Wait `drain_interval × 3` | drainer publishes the queued message | | 7 | Out-of-band: `SELECT COUNT(*) FROM annotations_queue_records WHERE annotation_id = ` | `count == 0` (drained) | | 8 | Stream consumer (started before step 5 at offset `next`) reads one message | message body matches the documented schema | **Pass criteria**: zero 5xx during outage; outbox preserves the row; recovery delivers the deferred message; total recovery time ≤ 60s after broker comes back. **Duration**: ~2 minutes. --- ### NFT-RES-02: Postgres restart between writes **Summary**: Killing and restarting Postgres during a quiet period does not corrupt state; subsequent writes succeed. **Traces to**: AC-N-02 (idempotent migrator), implicit data-integrity NFR **Fault injection**: `docker compose restart postgres` while no in-flight requests. **Steps**: | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | `POST /annotations` once (FT-P-01-shape) | HTTP 200; row in DB | | 2 | `docker compose restart postgres` | DB up after ~5s | | 3 | Wait for SUT `/health` to return 200 | SUT recovers connection pool (or restarts itself) | | 4 | `POST /annotations` again | HTTP 200; row in DB | | 5 | `GET /annotations/` | HTTP 200; original row intact | **Pass criteria**: original row intact after restart; new write succeeds within 30s of DB recovery; zero data loss. **Duration**: ~2 minutes. --- ### NFT-RES-03: Postgres unreachable during create **Summary**: When DB is unreachable mid-request, the SUT returns a structured error envelope (no 500 with stack trace); the SUT recovers when DB returns. **Traces to**: AC-N-04 (zero unhandled exceptions to clients) **Fault injection**: `docker pause postgres` between request start and request end (race-y; use a delay-injecting test proxy if needed). **Steps**: | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | `docker pause postgres` | DB connections hang | | 2 | `POST /annotations` once with timeout 30s | HTTP 5xx OR HTTP 503; **error envelope present**; **no raw exception text in body** | | 3 | `docker unpause postgres` | DB responsive | | 4 | `POST /annotations` again | HTTP 200; SUT recovered | **Pass criteria**: under-DB-outage response uses the error envelope; SUT recovers within 30s of DB recovery. **Duration**: ~2 minutes. --- ### NFT-RES-04: SSE subscriber disconnect mid-stream **Summary**: A subscriber that disconnects mid-stream does not crash the SUT or block other subscribers. **Traces to**: AC-F-10, OP-01 (per-instance SSE state) **Steps**: | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | Open 3 SSE connections to `/annotations/events?missionId=` | all 3 alive | | 2 | Abruptly close subscriber #2 (TCP RST) | SUT cleans up its channel slot | | 3 | `POST /annotations` for mission `` | HTTP 200 | | 4 | Subscribers #1 and #3 each receive the event | both receive within 1000ms | | 5 | `GET /health` | HTTP 200 | **Pass criteria**: surviving subscribers still receive events; no SUT memory growth visible (channel slots reclaimed); `/health` stays green. **Duration**: ~1 minute. --- ### NFT-RES-05: Repeated FailsafeProducer empty-catch path **Summary**: When the image referenced by an outbox row no longer exists on disk, the drainer logs and proceeds (post RB-05). Tests today's behavior (empty catch) AND, after RB-05 lands, asserts the logged failure path. **Traces to**: RB-05 **Fault injection**: insert an outbox row whose `annotation_id` references a missing image (manually delete the file after `POST /annotations` returned 200, before the drain interval fires). **Steps**: | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | `POST /annotations` (FT-P-01) | HTTP 200; outbox row + image file present | | 2 | Delete `/.jpg` | image gone | | 3 | Wait `drain_interval × 2` | drainer runs | | 4 | Out-of-band: `SELECT COUNT(*) FROM annotations_queue_records WHERE annotation_id = ` | today's behavior: row may be deleted or stuck (empty catch swallows IOException) — **document actual behavior here** | | 5 `[after RB-05]` | Inspect SUT logs for an `ERROR` entry mentioning the missing image | one log entry present; metric counter `failsafe_drain_errors` incremented | **Pass criteria today**: SUT does not crash; `/health` stays 200. **Pass criteria after RB-05**: as above + the logged failure path is exercised. **Duration**: ~1 minute. --- ### NFT-RES-06: Stream consumer reconnect **Summary**: A stream consumer that drops and reconnects with offset `last_committed` reads only post-disconnect messages. **Traces to**: implicit (consumer-side concern, but documents the contract Annotations producer expects) **Steps**: | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | Start consumer at offset `next`; record current end-of-stream offset `O0` | consumer up | | 2 | `POST /annotations` 5 times | 5 outbox rows; 5 stream messages produced shortly after | | 3 | Consumer reads all 5; commits offset after each | consumer offset = `O0 + 5` | | 4 | Disconnect consumer | done | | 5 | `POST /annotations` 3 more times | 3 more stream messages | | 6 | Reconnect consumer at `last_committed = O0 + 5` | consumer reads only messages 6..8 | **Pass criteria**: re-attached consumer sees no duplicates and no gaps. **Duration**: ~1 minute.