Files
annotations/_docs/02_document/tests/resilience-tests.md
T
Oleksandr Bezdieniezhnykh 03f879206e docs+src: complete Steps 1-3 outcomes + auth re-sync baseline
This commit captures everything produced during autodev existing-code
Steps 1 (Document), 2 (Architecture Baseline Scan), and 3 (Test Spec),
together with the targeted auth + CORS re-sync triggered on 2026-05-14
when codebase drift was detected at Step 4 entry. None of this work was
previously committed.

Step 1 (Document) — 50+ _docs/02_document/ files: problem, solution,
architecture, system flows, glossary, module-layout, per-component
specs (01..06), modules, deployment, diagrams, data model, FINAL
report, verification log, discovery.

Step 2 (Architecture Baseline) — architecture_compliance_baseline.md.
Verdict PASS_WITH_WARNINGS (0 Critical, 0 High, 1 Medium, 2 Low). No
High/Critical findings; auto-chained to Step 3 per existing-code flow.

Step 3 (Test Spec) — _docs/02_document/tests/* (67 scenarios across
blackbox, security, resilience, resource-limit, performance), plus
e2e/docker-compose.test.yml, e2e/seed/run.sh, scripts/run-tests.sh,
scripts/run-performance-tests.sh. Coverage 88% over the active scope
(40 of 45 items covered, 6 RB-deferred, 5 documented-as-uncovered).

Targeted auth + CORS re-sync — replaces the deleted in-house token
issuer with a JWKS-verifier model. AuthController and TokenService
removed; JwtExtensions switched from HS256 symmetric to ES256 over
admin's JWKS. ConfigurationResolver and CorsConfigurationValidator
added under src/Infrastructure/. ADR-002 and ADR-006 retired; SEC-01,
SEC-02, SEC-03 marked Closed. One new testability risk recorded in
architecture.md Open Risks Section 6 (JWKS HTTPS gating).

Source changes:
- src/Auth/JwtExtensions.cs (modified) — ES256, JWKS, alg pinning
- src/Program.cs (modified) — DI wiring for ConfigurationResolver
  and CorsConfigurationValidator
- src/Controllers/AuthController.cs (deleted) — no in-service issuance
- src/Services/TokenService.cs (deleted) — same
- src/Infrastructure/ConfigurationResolver.cs (new)
- src/Infrastructure/CorsConfigurationValidator.cs (new)
- .env.example (new) — required env var documentation
- .gitignore (updated)

Cross-repo coordination: _docs/cross-repo/flights_h1_h2_h3_change_spec
captures the change-spec for downstream services that consumed the now
deleted /auth endpoints.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-14 20:19:05 +03:00

6.3 KiB
Raw Blame History

Resilience Tests

NFT-RES-01: RabbitMQ broker outage during create

Summary: POST /annotations succeeds (HTTP 200) when the RabbitMQ broker is unreachable; the outbox row is preserved; FailsafeProducer does not crash; on broker recovery the message is delivered. Traces to: AC-F-12, OP-02 (single-instance baseline)

Preconditions: SUT healthy; broker initially reachable; clean outbox.

Fault injection:

  • docker exec rabbitmq rabbitmqctl stop_app mid-test (stops AMQP/streams listeners; container stays up).

Steps:

Step Action Expected Behavior
1 Stop RabbitMQ app broker unreachable on 5552
2 POST /annotations once HTTP 200; outbox row inserted
3 Out-of-band: SELECT COUNT(*) FROM annotations_queue_records WHERE annotation_id = <id> count == 1 (row not deleted because drain failed)
4 GET /health HTTP 200 (SUT not crashed)
5 docker exec rabbitmq rabbitmqctl start_app broker recovers
6 Wait drain_interval × 3 drainer publishes the queued message
7 Out-of-band: SELECT COUNT(*) FROM annotations_queue_records WHERE annotation_id = <id> count == 0 (drained)
8 Stream consumer (started before step 5 at offset next) reads one message message body matches the documented schema

Pass criteria: zero 5xx during outage; outbox preserves the row; recovery delivers the deferred message; total recovery time ≤ 60s after broker comes back. Duration: ~2 minutes.


NFT-RES-02: Postgres restart between writes

Summary: Killing and restarting Postgres during a quiet period does not corrupt state; subsequent writes succeed. Traces to: AC-N-02 (idempotent migrator), implicit data-integrity NFR

Fault injection: docker compose restart postgres while no in-flight requests.

Steps:

Step Action Expected Behavior
1 POST /annotations once (FT-P-01-shape) HTTP 200; row in DB
2 docker compose restart postgres DB up after ~5s
3 Wait for SUT /health to return 200 SUT recovers connection pool (or restarts itself)
4 POST /annotations again HTTP 200; row in DB
5 GET /annotations/<id from step 1> HTTP 200; original row intact

Pass criteria: original row intact after restart; new write succeeds within 30s of DB recovery; zero data loss. Duration: ~2 minutes.


NFT-RES-03: Postgres unreachable during create

Summary: When DB is unreachable mid-request, the SUT returns a structured error envelope (no 500 with stack trace); the SUT recovers when DB returns. Traces to: AC-N-04 (zero unhandled exceptions to clients)

Fault injection: docker pause postgres between request start and request end (race-y; use a delay-injecting test proxy if needed).

Steps:

Step Action Expected Behavior
1 docker pause postgres DB connections hang
2 POST /annotations once with timeout 30s HTTP 5xx OR HTTP 503; error envelope present; no raw exception text in body
3 docker unpause postgres DB responsive
4 POST /annotations again HTTP 200; SUT recovered

Pass criteria: under-DB-outage response uses the error envelope; SUT recovers within 30s of DB recovery. Duration: ~2 minutes.


NFT-RES-04: SSE subscriber disconnect mid-stream

Summary: A subscriber that disconnects mid-stream does not crash the SUT or block other subscribers. Traces to: AC-F-10, OP-01 (per-instance SSE state)

Steps:

Step Action Expected Behavior
1 Open 3 SSE connections to /annotations/events?missionId=<m> all 3 alive
2 Abruptly close subscriber #2 (TCP RST) SUT cleans up its channel slot
3 POST /annotations for mission <m> HTTP 200
4 Subscribers #1 and #3 each receive the event both receive within 1000ms
5 GET /health HTTP 200

Pass criteria: surviving subscribers still receive events; no SUT memory growth visible (channel slots reclaimed); /health stays green. Duration: ~1 minute.


NFT-RES-05: Repeated FailsafeProducer empty-catch path

Summary: When the image referenced by an outbox row no longer exists on disk, the drainer logs and proceeds (post RB-05). Tests today's behavior (empty catch) AND, after RB-05 lands, asserts the logged failure path. Traces to: RB-05

Fault injection: insert an outbox row whose annotation_id references a missing image (manually delete the file after POST /annotations returned 200, before the drain interval fires).

Steps:

Step Action Expected Behavior
1 POST /annotations (FT-P-01) HTTP 200; outbox row + image file present
2 Delete <images_dir>/<id>.jpg image gone
3 Wait drain_interval × 2 drainer runs
4 Out-of-band: SELECT COUNT(*) FROM annotations_queue_records WHERE annotation_id = <id> today's behavior: row may be deleted or stuck (empty catch swallows IOException) — document actual behavior here
5 [after RB-05] Inspect SUT logs for an ERROR entry mentioning the missing image one log entry present; metric counter failsafe_drain_errors incremented

Pass criteria today: SUT does not crash; /health stays 200. Pass criteria after RB-05: as above + the logged failure path is exercised. Duration: ~1 minute.


NFT-RES-06: Stream consumer reconnect

Summary: A stream consumer that drops and reconnects with offset last_committed reads only post-disconnect messages. Traces to: implicit (consumer-side concern, but documents the contract Annotations producer expects)

Steps:

Step Action Expected Behavior
1 Start consumer at offset next; record current end-of-stream offset O0 consumer up
2 POST /annotations 5 times 5 outbox rows; 5 stream messages produced shortly after
3 Consumer reads all 5; commits offset after each consumer offset = O0 + 5
4 Disconnect consumer done
5 POST /annotations 3 more times 3 more stream messages
6 Reconnect consumer at last_committed = O0 + 5 consumer reads only messages 6..8

Pass criteria: re-attached consumer sees no duplicates and no gaps. Duration: ~1 minute.