Files
annotations/_docs/02_document/tests/resilience-tests.md
T
Oleksandr Bezdieniezhnykh 03f879206e docs+src: complete Steps 1-3 outcomes + auth re-sync baseline
This commit captures everything produced during autodev existing-code
Steps 1 (Document), 2 (Architecture Baseline Scan), and 3 (Test Spec),
together with the targeted auth + CORS re-sync triggered on 2026-05-14
when codebase drift was detected at Step 4 entry. None of this work was
previously committed.

Step 1 (Document) — 50+ _docs/02_document/ files: problem, solution,
architecture, system flows, glossary, module-layout, per-component
specs (01..06), modules, deployment, diagrams, data model, FINAL
report, verification log, discovery.

Step 2 (Architecture Baseline) — architecture_compliance_baseline.md.
Verdict PASS_WITH_WARNINGS (0 Critical, 0 High, 1 Medium, 2 Low). No
High/Critical findings; auto-chained to Step 3 per existing-code flow.

Step 3 (Test Spec) — _docs/02_document/tests/* (67 scenarios across
blackbox, security, resilience, resource-limit, performance), plus
e2e/docker-compose.test.yml, e2e/seed/run.sh, scripts/run-tests.sh,
scripts/run-performance-tests.sh. Coverage 88% over the active scope
(40 of 45 items covered, 6 RB-deferred, 5 documented-as-uncovered).

Targeted auth + CORS re-sync — replaces the deleted in-house token
issuer with a JWKS-verifier model. AuthController and TokenService
removed; JwtExtensions switched from HS256 symmetric to ES256 over
admin's JWKS. ConfigurationResolver and CorsConfigurationValidator
added under src/Infrastructure/. ADR-002 and ADR-006 retired; SEC-01,
SEC-02, SEC-03 marked Closed. One new testability risk recorded in
architecture.md Open Risks Section 6 (JWKS HTTPS gating).

Source changes:
- src/Auth/JwtExtensions.cs (modified) — ES256, JWKS, alg pinning
- src/Program.cs (modified) — DI wiring for ConfigurationResolver
  and CorsConfigurationValidator
- src/Controllers/AuthController.cs (deleted) — no in-service issuance
- src/Services/TokenService.cs (deleted) — same
- src/Infrastructure/ConfigurationResolver.cs (new)
- src/Infrastructure/CorsConfigurationValidator.cs (new)
- .env.example (new) — required env var documentation
- .gitignore (updated)

Cross-repo coordination: _docs/cross-repo/flights_h1_h2_h3_change_spec
captures the change-spec for downstream services that consumed the now
deleted /auth endpoints.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-14 20:19:05 +03:00

135 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Resilience Tests
### NFT-RES-01: RabbitMQ broker outage during create
**Summary**: `POST /annotations` succeeds (HTTP 200) when the RabbitMQ broker is unreachable; the outbox row is preserved; `FailsafeProducer` does not crash; on broker recovery the message is delivered.
**Traces to**: AC-F-12, OP-02 (single-instance baseline)
**Preconditions**: SUT healthy; broker initially reachable; clean outbox.
**Fault injection**:
- `docker exec rabbitmq rabbitmqctl stop_app` mid-test (stops AMQP/streams listeners; container stays up).
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | Stop RabbitMQ app | broker unreachable on 5552 |
| 2 | `POST /annotations` once | HTTP 200; outbox row inserted |
| 3 | Out-of-band: `SELECT COUNT(*) FROM annotations_queue_records WHERE annotation_id = <id>` | `count == 1` (row not deleted because drain failed) |
| 4 | `GET /health` | HTTP 200 (SUT not crashed) |
| 5 | `docker exec rabbitmq rabbitmqctl start_app` | broker recovers |
| 6 | Wait `drain_interval × 3` | drainer publishes the queued message |
| 7 | Out-of-band: `SELECT COUNT(*) FROM annotations_queue_records WHERE annotation_id = <id>` | `count == 0` (drained) |
| 8 | Stream consumer (started before step 5 at offset `next`) reads one message | message body matches the documented schema |
**Pass criteria**: zero 5xx during outage; outbox preserves the row; recovery delivers the deferred message; total recovery time ≤ 60s after broker comes back.
**Duration**: ~2 minutes.
---
### NFT-RES-02: Postgres restart between writes
**Summary**: Killing and restarting Postgres during a quiet period does not corrupt state; subsequent writes succeed.
**Traces to**: AC-N-02 (idempotent migrator), implicit data-integrity NFR
**Fault injection**: `docker compose restart postgres` while no in-flight requests.
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | `POST /annotations` once (FT-P-01-shape) | HTTP 200; row in DB |
| 2 | `docker compose restart postgres` | DB up after ~5s |
| 3 | Wait for SUT `/health` to return 200 | SUT recovers connection pool (or restarts itself) |
| 4 | `POST /annotations` again | HTTP 200; row in DB |
| 5 | `GET /annotations/<id from step 1>` | HTTP 200; original row intact |
**Pass criteria**: original row intact after restart; new write succeeds within 30s of DB recovery; zero data loss.
**Duration**: ~2 minutes.
---
### NFT-RES-03: Postgres unreachable during create
**Summary**: When DB is unreachable mid-request, the SUT returns a structured error envelope (no 500 with stack trace); the SUT recovers when DB returns.
**Traces to**: AC-N-04 (zero unhandled exceptions to clients)
**Fault injection**: `docker pause postgres` between request start and request end (race-y; use a delay-injecting test proxy if needed).
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | `docker pause postgres` | DB connections hang |
| 2 | `POST /annotations` once with timeout 30s | HTTP 5xx OR HTTP 503; **error envelope present**; **no raw exception text in body** |
| 3 | `docker unpause postgres` | DB responsive |
| 4 | `POST /annotations` again | HTTP 200; SUT recovered |
**Pass criteria**: under-DB-outage response uses the error envelope; SUT recovers within 30s of DB recovery.
**Duration**: ~2 minutes.
---
### NFT-RES-04: SSE subscriber disconnect mid-stream
**Summary**: A subscriber that disconnects mid-stream does not crash the SUT or block other subscribers.
**Traces to**: AC-F-10, OP-01 (per-instance SSE state)
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | Open 3 SSE connections to `/annotations/events?missionId=<m>` | all 3 alive |
| 2 | Abruptly close subscriber #2 (TCP RST) | SUT cleans up its channel slot |
| 3 | `POST /annotations` for mission `<m>` | HTTP 200 |
| 4 | Subscribers #1 and #3 each receive the event | both receive within 1000ms |
| 5 | `GET /health` | HTTP 200 |
**Pass criteria**: surviving subscribers still receive events; no SUT memory growth visible (channel slots reclaimed); `/health` stays green.
**Duration**: ~1 minute.
---
### NFT-RES-05: Repeated FailsafeProducer empty-catch path
**Summary**: When the image referenced by an outbox row no longer exists on disk, the drainer logs and proceeds (post RB-05). Tests today's behavior (empty catch) AND, after RB-05 lands, asserts the logged failure path.
**Traces to**: RB-05
**Fault injection**: insert an outbox row whose `annotation_id` references a missing image (manually delete the file after `POST /annotations` returned 200, before the drain interval fires).
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | `POST /annotations` (FT-P-01) | HTTP 200; outbox row + image file present |
| 2 | Delete `<images_dir>/<id>.jpg` | image gone |
| 3 | Wait `drain_interval × 2` | drainer runs |
| 4 | Out-of-band: `SELECT COUNT(*) FROM annotations_queue_records WHERE annotation_id = <id>` | today's behavior: row may be deleted or stuck (empty catch swallows IOException) — **document actual behavior here** |
| 5 `[after RB-05]` | Inspect SUT logs for an `ERROR` entry mentioning the missing image | one log entry present; metric counter `failsafe_drain_errors` incremented |
**Pass criteria today**: SUT does not crash; `/health` stays 200.
**Pass criteria after RB-05**: as above + the logged failure path is exercised.
**Duration**: ~1 minute.
---
### NFT-RES-06: Stream consumer reconnect
**Summary**: A stream consumer that drops and reconnects with offset `last_committed` reads only post-disconnect messages.
**Traces to**: implicit (consumer-side concern, but documents the contract Annotations producer expects)
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | Start consumer at offset `next`; record current end-of-stream offset `O0` | consumer up |
| 2 | `POST /annotations` 5 times | 5 outbox rows; 5 stream messages produced shortly after |
| 3 | Consumer reads all 5; commits offset after each | consumer offset = `O0 + 5` |
| 4 | Disconnect consumer | done |
| 5 | `POST /annotations` 3 more times | 3 more stream messages |
| 6 | Reconnect consumer at `last_committed = O0 + 5` | consumer reads only messages 6..8 |
**Pass criteria**: re-attached consumer sees no duplicates and no gaps.
**Duration**: ~1 minute.