This commit captures everything produced during autodev existing-code Steps 1 (Document), 2 (Architecture Baseline Scan), and 3 (Test Spec), together with the targeted auth + CORS re-sync triggered on 2026-05-14 when codebase drift was detected at Step 4 entry. None of this work was previously committed. Step 1 (Document) — 50+ _docs/02_document/ files: problem, solution, architecture, system flows, glossary, module-layout, per-component specs (01..06), modules, deployment, diagrams, data model, FINAL report, verification log, discovery. Step 2 (Architecture Baseline) — architecture_compliance_baseline.md. Verdict PASS_WITH_WARNINGS (0 Critical, 0 High, 1 Medium, 2 Low). No High/Critical findings; auto-chained to Step 3 per existing-code flow. Step 3 (Test Spec) — _docs/02_document/tests/* (67 scenarios across blackbox, security, resilience, resource-limit, performance), plus e2e/docker-compose.test.yml, e2e/seed/run.sh, scripts/run-tests.sh, scripts/run-performance-tests.sh. Coverage 88% over the active scope (40 of 45 items covered, 6 RB-deferred, 5 documented-as-uncovered). Targeted auth + CORS re-sync — replaces the deleted in-house token issuer with a JWKS-verifier model. AuthController and TokenService removed; JwtExtensions switched from HS256 symmetric to ES256 over admin's JWKS. ConfigurationResolver and CorsConfigurationValidator added under src/Infrastructure/. ADR-002 and ADR-006 retired; SEC-01, SEC-02, SEC-03 marked Closed. One new testability risk recorded in architecture.md Open Risks Section 6 (JWKS HTTPS gating). Source changes: - src/Auth/JwtExtensions.cs (modified) — ES256, JWKS, alg pinning - src/Program.cs (modified) — DI wiring for ConfigurationResolver and CorsConfigurationValidator - src/Controllers/AuthController.cs (deleted) — no in-service issuance - src/Services/TokenService.cs (deleted) — same - src/Infrastructure/ConfigurationResolver.cs (new) - src/Infrastructure/CorsConfigurationValidator.cs (new) - .env.example (new) — required env var documentation - .gitignore (updated) Cross-repo coordination: _docs/cross-repo/flights_h1_h2_h3_change_spec captures the change-spec for downstream services that consumed the now deleted /auth endpoints. Co-authored-by: Cursor <cursoragent@cursor.com>
6.3 KiB
Resilience Tests
NFT-RES-01: RabbitMQ broker outage during create
Summary: POST /annotations succeeds (HTTP 200) when the RabbitMQ broker is unreachable; the outbox row is preserved; FailsafeProducer does not crash; on broker recovery the message is delivered.
Traces to: AC-F-12, OP-02 (single-instance baseline)
Preconditions: SUT healthy; broker initially reachable; clean outbox.
Fault injection:
docker exec rabbitmq rabbitmqctl stop_appmid-test (stops AMQP/streams listeners; container stays up).
Steps:
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | Stop RabbitMQ app | broker unreachable on 5552 |
| 2 | POST /annotations once |
HTTP 200; outbox row inserted |
| 3 | Out-of-band: SELECT COUNT(*) FROM annotations_queue_records WHERE annotation_id = <id> |
count == 1 (row not deleted because drain failed) |
| 4 | GET /health |
HTTP 200 (SUT not crashed) |
| 5 | docker exec rabbitmq rabbitmqctl start_app |
broker recovers |
| 6 | Wait drain_interval × 3 |
drainer publishes the queued message |
| 7 | Out-of-band: SELECT COUNT(*) FROM annotations_queue_records WHERE annotation_id = <id> |
count == 0 (drained) |
| 8 | Stream consumer (started before step 5 at offset next) reads one message |
message body matches the documented schema |
Pass criteria: zero 5xx during outage; outbox preserves the row; recovery delivers the deferred message; total recovery time ≤ 60s after broker comes back. Duration: ~2 minutes.
NFT-RES-02: Postgres restart between writes
Summary: Killing and restarting Postgres during a quiet period does not corrupt state; subsequent writes succeed. Traces to: AC-N-02 (idempotent migrator), implicit data-integrity NFR
Fault injection: docker compose restart postgres while no in-flight requests.
Steps:
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | POST /annotations once (FT-P-01-shape) |
HTTP 200; row in DB |
| 2 | docker compose restart postgres |
DB up after ~5s |
| 3 | Wait for SUT /health to return 200 |
SUT recovers connection pool (or restarts itself) |
| 4 | POST /annotations again |
HTTP 200; row in DB |
| 5 | GET /annotations/<id from step 1> |
HTTP 200; original row intact |
Pass criteria: original row intact after restart; new write succeeds within 30s of DB recovery; zero data loss. Duration: ~2 minutes.
NFT-RES-03: Postgres unreachable during create
Summary: When DB is unreachable mid-request, the SUT returns a structured error envelope (no 500 with stack trace); the SUT recovers when DB returns. Traces to: AC-N-04 (zero unhandled exceptions to clients)
Fault injection: docker pause postgres between request start and request end (race-y; use a delay-injecting test proxy if needed).
Steps:
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | docker pause postgres |
DB connections hang |
| 2 | POST /annotations once with timeout 30s |
HTTP 5xx OR HTTP 503; error envelope present; no raw exception text in body |
| 3 | docker unpause postgres |
DB responsive |
| 4 | POST /annotations again |
HTTP 200; SUT recovered |
Pass criteria: under-DB-outage response uses the error envelope; SUT recovers within 30s of DB recovery. Duration: ~2 minutes.
NFT-RES-04: SSE subscriber disconnect mid-stream
Summary: A subscriber that disconnects mid-stream does not crash the SUT or block other subscribers. Traces to: AC-F-10, OP-01 (per-instance SSE state)
Steps:
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | Open 3 SSE connections to /annotations/events?missionId=<m> |
all 3 alive |
| 2 | Abruptly close subscriber #2 (TCP RST) | SUT cleans up its channel slot |
| 3 | POST /annotations for mission <m> |
HTTP 200 |
| 4 | Subscribers #1 and #3 each receive the event | both receive within 1000ms |
| 5 | GET /health |
HTTP 200 |
Pass criteria: surviving subscribers still receive events; no SUT memory growth visible (channel slots reclaimed); /health stays green.
Duration: ~1 minute.
NFT-RES-05: Repeated FailsafeProducer empty-catch path
Summary: When the image referenced by an outbox row no longer exists on disk, the drainer logs and proceeds (post RB-05). Tests today's behavior (empty catch) AND, after RB-05 lands, asserts the logged failure path. Traces to: RB-05
Fault injection: insert an outbox row whose annotation_id references a missing image (manually delete the file after POST /annotations returned 200, before the drain interval fires).
Steps:
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | POST /annotations (FT-P-01) |
HTTP 200; outbox row + image file present |
| 2 | Delete <images_dir>/<id>.jpg |
image gone |
| 3 | Wait drain_interval × 2 |
drainer runs |
| 4 | Out-of-band: SELECT COUNT(*) FROM annotations_queue_records WHERE annotation_id = <id> |
today's behavior: row may be deleted or stuck (empty catch swallows IOException) — document actual behavior here |
5 [after RB-05] |
Inspect SUT logs for an ERROR entry mentioning the missing image |
one log entry present; metric counter failsafe_drain_errors incremented |
Pass criteria today: SUT does not crash; /health stays 200.
Pass criteria after RB-05: as above + the logged failure path is exercised.
Duration: ~1 minute.
NFT-RES-06: Stream consumer reconnect
Summary: A stream consumer that drops and reconnects with offset last_committed reads only post-disconnect messages.
Traces to: implicit (consumer-side concern, but documents the contract Annotations producer expects)
Steps:
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | Start consumer at offset next; record current end-of-stream offset O0 |
consumer up |
| 2 | POST /annotations 5 times |
5 outbox rows; 5 stream messages produced shortly after |
| 3 | Consumer reads all 5; commits offset after each | consumer offset = O0 + 5 |
| 4 | Disconnect consumer | done |
| 5 | POST /annotations 3 more times |
3 more stream messages |
| 6 | Reconnect consumer at last_committed = O0 + 5 |
consumer reads only messages 6..8 |
Pass criteria: re-attached consumer sees no duplicates and no gaps. Duration: ~1 minute.