Files
missions/_docs/02_document/tests/resilience-tests.md
T
Oleksandr Bezdieniezhnykh 7025f4d075 refactor: enhance JWT authentication and CORS configuration
Updated JWT authentication to use configuration values instead of hardcoded secrets, improving security and flexibility. Enhanced CORS policy to conditionally allow origins based on configuration settings, with logging for permissive defaults. Updated README to reflect project renaming and clarify service context.
2026-05-14 19:48:25 +03:00

11 KiB

Resilience Tests

Status: produced by autodev /test-spec Phase 2 (2026-05-14). Naming: post-rename target. Resilience scenarios use the side-channel + container-orchestration tools to inject faults; the system-under-test is treated as a black box. Critical scenarios: AC-3.3 (cascade NOT transaction-wrapped) and AC-3.4 (orphan-row race) are the highest-impact resilience invariants — they intentionally encode current (sub-optimal) behavior so tests catch any silent change. Several other resilience tests verify documented operational behaviors (idempotent migrator, DB-down crash, JWT secret rotation).


NFT-RES-01: Cascade is NOT transaction-wrapped — partial deletes survive mid-walk failure

Summary: Verifies the documented ADR-006 carry-forward — when the cascade walk fails mid-way (e.g., media table absent), already-committed deletes remain. Traces to: AC-3.3, AC-3.4 (related), AC-10.2

Preconditions:

  • fixture_cascade_F3 applied (chain rooted at mid)
  • missions running

Fault injection: side-channel DROP TABLE media CASCADE; BEFORE the request — this turns the second sub-step of the cascade walk into a relation does not exist failure.

Steps:

Step Action Expected Behavior
1 Side-channel DROP TABLE media CASCADE succeeds
2 DELETE /missions/{mid} (JWT FL) 500; envelope { statusCode:500, message:"Internal server error" }
3 Side-channel SELECT COUNT(*) FROM map_objects WHERE mission_id={mid} count == 0 (work BEFORE the failure point committed — non-zero pre-fault, zero post-fault)
4 Side-channel SELECT COUNT(*) FROM missions WHERE id={mid} count == 1 (work AFTER the failure point did NOT run — row remains)
5 docker logs missions | grep "Unhandled exception" At least one matching log line containing relation and media

Pass criteria: map_objects count is 0 (deleted before failure) AND missions count is 1 (not deleted because failure short-circuited the walk) AND the response is 500 with the redacted body. Note: this test will FAIL once a transaction wrap is added (ADR-006 closure) — at that point ALL deletes will roll back and map_objects count will be >0. When the transaction wrap lands, update this test. Max execution time: 10s.


NFT-RES-02: Waypoint cascade is NOT transaction-wrapped (same invariant as F3)

Summary: Same invariant as NFT-RES-01, scoped to F4 (waypoint cascade). Traces to: AC-4.6, AC-3.3 (same root cause), AC-10.2

Preconditions:

  • fixture_cascade_F4 applied (waypoint wp1 with chain)

Fault injection: side-channel DROP TABLE media CASCADE BEFORE the request.

Steps:

Step Action Expected Behavior
1 Side-channel DROP TABLE media CASCADE succeeds
2 DELETE /missions/{mid}/waypoints/{wp1} (JWT FL) 500
3 Side-channel SELECT COUNT(*) FROM detection WHERE annotation_id IN (wp1 chain) count == 0 (deleted before failure)
4 Side-channel SELECT COUNT(*) FROM waypoints WHERE id={wp1} count == 1 (not deleted)

Pass criteria: same shape as NFT-RES-01. Max execution time: 10s.


NFT-RES-03: Idempotent migrator — second startup is a no-op

Summary: Verifies the migrator can run twice in a row without relation already exists errors. Traces to: AC-6.6, AC-6.4

Preconditions:

  • seed_empty (schema migrated once via first missions startup)

Fault injection: container restart (NOT volume reset).

Steps:

Step Action Expected Behavior
1 docker compose restart missions container exits cleanly + restarts
2 Wait for GET /health to return 200 within ≤ 30s
3 docker logs missions | grep -E "(error|Error|exception)" no NEW error / exception lines after the restart timestamp
4 Side-channel \d+ vehicles schema unchanged from after the first migrate

Pass criteria: second start completes; no new error log lines; schema unchanged. Max execution time: 60s.


NFT-RES-04: B9 one-shot legacy table drop runs once and is idempotent

Summary: Verifies that the post-B9 DROP TABLE IF EXISTS orthophotos / gps_corrections block in the migrator is destructive on the FIRST run against a legacy device, and a no-op on subsequent runs. Traces to: AC-6.5, AC-10.5

Preconditions:

  • seed_legacy_gps_tables (schema includes orthophotos + gps_corrections with 1 row each)
  • missions NOT yet started for this scenario

Fault injection: none — purely observe migrator behavior.

Steps:

Step Action Expected Behavior
1 Side-channel SELECT to_regclass('orthophotos'), to_regclass('gps_corrections') both NON-NULL (legacy tables present)
2 docker compose up -d missions container starts
3 Wait for GET /health to return 200
4 Side-channel re-query both NULL (dropped)
5 docker compose restart missions
6 Side-channel re-query both still NULL (idempotent — no error)
7 docker logs missions | grep -i "does not exist" NO log line (because of IF EXISTS)

Pass criteria: legacy tables absent after first start; subsequent restarts produce no errors and leave them absent. Note: this test only meaningfully runs on a post-B9 build. Before B9 lands, the migrator has no DROP block; gate this scenario on a build-time flag or by inspecting the migrator source. Max execution time: 60s.


NFT-RES-05: DB unreachable at startup — process exits non-zero

Summary: Verifies AC-6.7 — DB unreachability causes process exit, NOT silent retry-forever. Traces to: AC-6.7

Preconditions:

  • missions NOT running

Fault injection: stop postgres-test (docker compose stop postgres-test) then start missions.

Steps:

Step Action Expected Behavior
1 docker compose stop postgres-test
2 docker compose up -d missions
3 Poll docker inspect --format '{{.State.ExitCode}}' missions every 1s for ≤ 30s At some point within 30s, the container has exited with non-zero exit code
4 docker logs missions Contains an Npgsql connection error message (e.g., Connection refused)

Pass criteria: container exits with non-zero code within 30s; logs contain a recognisable Npgsql error. Max execution time: 60s.


NFT-RES-06: DB missing (database does not exist) — process exits with Npgsql 3D000

Summary: Verifies AC-6.8 — when the azaion database does not exist, process exits with the documented PostgreSQL error code. Traces to: AC-6.8

Preconditions:

  • postgres-test running with the azaion database NOT yet created (use POSTGRES_DB=postgres instead, or DROP DATABASE azaion)

Fault injection: same as preconditions.

Steps:

Step Action Expected Behavior
1 Side-channel DROP DATABASE IF EXISTS azaion
2 docker compose up -d missions
3 Poll exit code for ≤ 30s non-zero
4 docker logs missions | grep "3D000" at least one match

Pass criteria: container exits non-zero within 30s; logs contain 3D000. Max execution time: 60s.


NFT-RES-07: JWT_SECRET rotation invalidates existing tokens

Summary: Verifies AC-5.7 — restarting the service with a different JWT_SECRET causes previously-valid tokens to fail validation. Traces to: AC-5.7

Preconditions:

  • missions running with JWT_SECRET=test-secret-32-chars-min!!!!!!!!!
  • Token T1 minted with the same secret, valid for 1h

Fault injection: restart missions with JWT_SECRET=rotated-secret-32-chars-min!!!!!.

Steps:

Step Action Expected Behavior
1 GET /vehicles with Authorization: Bearer T1 200
2 docker compose stop missions
3 docker compose run -e JWT_SECRET=rotated-secret-32-chars-min!!!!! -d missions
4 Wait for GET /health 200
5 GET /vehicles with Authorization: Bearer T1 (same token as step 1) 401
6 Mint token T2 with the new secret, GET /vehicles with T2 200

Pass criteria: T1 works pre-rotation, fails post-rotation; T2 works post-rotation. Max execution time: 90s.


NFT-RES-08: TOCTOU on default-vehicle exclusivity (race window)

Summary: Verifies AC-1.4 — the clear-then-set is NOT transaction-wrapped, so a concurrent INSERT can leave 2+ defaults. Traces to: AC-1.4 (carry-forward)

Preconditions:

  • seed_one_default_vehicle (default P1)

Fault injection: a second concurrent client issues INSERT INTO vehicles (..., is_default=true) directly to the side-channel DB at the same moment as the API client issues POST /vehicles { IsDefault:true }. Synchronisation: the test orchestrator pauses at pg_advisory_lock(1) after the service's UPDATE vehicles SET is_default=FALSE and BEFORE the service's INSERT — implemented via a Postgres function that the test installs and the service path traverses (this requires an instrumented test build; if not available, the test is best-effort by issuing 100 parallel INSERTs).

Steps:

Step Action Expected Behavior
1 Run 100 parallel iterations of (POST /vehicles { IsDefault:true } + side-channel INSERT (..., is_default=true)) each iteration completes
2 Side-channel SELECT COUNT(*) FROM vehicles WHERE is_default=true count is ≥ 2 in at least one iteration

Pass criteria: at least one iteration produces default_count ≥ 2. If 0 iterations produce the race, the test FAILS — either the race window has been closed (good news; rewrite the test to assert default_count == 1 and update AC-1.4 to remove the race carry-forward), OR the test concurrency primitive is wrong (investigate). Note: this test is intentionally PROBABILISTIC — it asserts the RACE EXISTS, not that the system is broken. A PASS here is bad news for the system but means the test is correctly observing the documented behavior. Max execution time: 30s.


Notes

  • Tests that drop tables (NFT-RES-01, NFT-RES-02, NFT-RES-08) run in a per-class IClassFixture<DbResetFixture> that recreates the schema after each scenario.
  • NFT-RES-03 through NFT-RES-07 require container-orchestration via docker compose from inside the test runner. The e2e-consumer container needs the docker CLI + a mounted Docker socket — alternatively, the test runs from the host with docker compose available there. Hardware-assessment will lock this preference.
  • NFT-RES-08 is intentionally probabilistic and may be flaky on slow runners. CI marks this as [Trait("Stability","probabilistic")] and tolerates ≤ 1 failed run per 5; deterministic implementation (advisory lock + instrumented build) is a follow-up.