Files
Oleksandr Bezdieniezhnykh 78dea8ebab
ci/woodpecker/push/build-arm Pipeline was successful
chore: update configuration and Docker setup for JWT and test results
Enhanced the .gitignore to exclude test results and updated the Dockerfile to include a new entrypoint script for improved container initialization. Refactored JWT configuration to support additional parameters for automatic refresh intervals, ensuring better control over token management. Updated the ConfigurationResolver to enforce required environment variables without hardcoded fallbacks, enhancing security and flexibility.
2026-05-15 03:23:23 +03:00

14 KiB
Raw Permalink Blame History

Resilience Tests

Status: produced by autodev /test-spec Phase 2 (2026-05-14). Naming: post-rename target. Resilience scenarios use the side-channel + container-orchestration tools to inject faults; the system-under-test is treated as a black box. Critical scenarios: AC-3.3 (cascade NOT transaction-wrapped) and AC-3.4 (orphan-row race) are the highest-impact resilience invariants — they intentionally encode current (sub-optimal) behavior so tests catch any silent change. Several other resilience tests verify documented operational behaviors (idempotent migrator, DB-down crash, JWT secret rotation).


NFT-RES-01: Cascade is NOT transaction-wrapped — partial deletes survive mid-walk failure

Summary: Verifies the documented ADR-006 carry-forward — when the cascade walk fails mid-way (e.g., media table absent), already-committed deletes remain. Traces to: AC-3.3, AC-3.4 (related), AC-10.2

Preconditions:

  • fixture_cascade_F3 applied (chain rooted at mid)
  • missions running

Fault injection: side-channel DROP TABLE media CASCADE; BEFORE the request — this turns the second sub-step of the cascade walk into a relation does not exist failure.

Steps:

Step Action Expected Behavior
1 Side-channel DROP TABLE media CASCADE succeeds
2 DELETE /missions/{mid} (JWT FL) 500; envelope { statusCode:500, message:"Internal server error" }
3 Side-channel SELECT COUNT(*) FROM map_objects WHERE mission_id={mid} count == 0 (work BEFORE the failure point committed — non-zero pre-fault, zero post-fault)
4 Side-channel SELECT COUNT(*) FROM missions WHERE id={mid} count == 1 (work AFTER the failure point did NOT run — row remains)
5 docker logs missions | grep "Unhandled exception" At least one matching log line containing relation and media

Pass criteria: map_objects count is 0 (deleted before failure) AND missions count is 1 (not deleted because failure short-circuited the walk) AND the response is 500 with the redacted body. Note: this test will FAIL once a transaction wrap is added (ADR-006 closure) — at that point ALL deletes will roll back and map_objects count will be >0. When the transaction wrap lands, update this test. Max execution time: 10s.


NFT-RES-02: Waypoint cascade is NOT transaction-wrapped (same invariant as F3)

Summary: Same invariant as NFT-RES-01, scoped to F4 (waypoint cascade). Traces to: AC-4.6, AC-3.3 (same root cause), AC-10.2

Preconditions:

  • fixture_cascade_F4 applied (waypoint wp1 with chain)

Fault injection: side-channel DROP TABLE media CASCADE BEFORE the request.

Steps:

Step Action Expected Behavior
1 Side-channel DROP TABLE media CASCADE succeeds
2 DELETE /missions/{mid}/waypoints/{wp1} (JWT FL) 500
3 Side-channel SELECT COUNT(*) FROM detection WHERE annotation_id IN (wp1 chain) count == 0 (deleted before failure)
4 Side-channel SELECT COUNT(*) FROM waypoints WHERE id={wp1} count == 1 (not deleted)

Pass criteria: same shape as NFT-RES-01. Max execution time: 10s.


NFT-RES-03: Idempotent migrator — second startup is a no-op

Summary: Verifies the migrator can run twice in a row without relation already exists errors. Traces to: AC-6.6, AC-6.4

Preconditions:

  • seed_empty (schema migrated once via first missions startup)

Fault injection: container restart (NOT volume reset).

Steps:

Step Action Expected Behavior
1 docker compose restart missions container exits cleanly + restarts
2 Wait for GET /health to return 200 within ≤ 30s
3 docker logs missions | grep -E "(error|Error|exception)" no NEW error / exception lines after the restart timestamp
4 Side-channel \d+ vehicles schema unchanged from after the first migrate

Pass criteria: second start completes; no new error log lines; schema unchanged. Max execution time: 60s.


NFT-RES-04: B9 one-shot legacy table drop runs once and is idempotent

Summary: Verifies that the post-B9 DROP TABLE IF EXISTS orthophotos / gps_corrections block in the migrator is destructive on the FIRST run against a legacy device, and a no-op on subsequent runs. Traces to: AC-6.5, AC-10.5

Preconditions:

  • seed_legacy_gps_tables (schema includes orthophotos + gps_corrections with 1 row each)
  • missions NOT yet started for this scenario

Fault injection: none — purely observe migrator behavior.

Steps:

Step Action Expected Behavior
1 Side-channel SELECT to_regclass('orthophotos'), to_regclass('gps_corrections') both NON-NULL (legacy tables present)
2 docker compose up -d missions container starts
3 Wait for GET /health to return 200
4 Side-channel re-query both NULL (dropped)
5 docker compose restart missions
6 Side-channel re-query both still NULL (idempotent — no error)
7 docker logs missions | grep -i "does not exist" NO log line (because of IF EXISTS)

Pass criteria: legacy tables absent after first start; subsequent restarts produce no errors and leave them absent. Note: this test only meaningfully runs on a post-B9 build. Before B9 lands, the migrator has no DROP block; gate this scenario on a build-time flag or by inspecting the migrator source. Max execution time: 60s.


NFT-RES-05: Required configuration missing → fail-fast at startup

Summary: Verifies AC-6.1 / AC-6.2 / E3 — Infrastructure/ConfigurationResolver.ResolveRequiredOrThrow throws InvalidOperationException when any of the four required env vars (DATABASE_URL, JWT_ISSUER, JWT_AUDIENCE, JWT_JWKS_URL) is missing or whitespace-only. Also verifies AC-6.7 — DB unreachability (after config resolution succeeds) still causes process exit. The legacy "silent dev fallback boot" failure mode is structurally eliminated. Traces to: AC-6.1, AC-6.2, AC-6.7, E3, E4

Preconditions:

  • missions NOT running.
  • This scenario uses docker run outside the main compose to isolate env-var manipulation.

Steps (each row is a separate docker run invocation; each times out at 30s):

Step Action Expected Behavior
1 docker run --rm azaion/missions:test with ALL four required env vars unset container exits non-zero within 5s; logs contain InvalidOperationException; logs mention at least one of the four required keys
2 docker run with DATABASE_URL unset; the three JWT vars set correctly same shape; logs mention DATABASE_URL or Database:Url
3 docker run with JWT_ISSUER="" (whitespace-only); other three set same shape; logs mention JWT_ISSUER or Jwt:Issuer
4 docker run with JWT_AUDIENCE unset; others set same shape; logs mention JWT_AUDIENCE or Jwt:Audience
5 docker run with JWT_JWKS_URL unset; others set same shape; logs mention JWT_JWKS_URL or Jwt:JwksUrl
6 docker compose stop postgres-test, then start missions with all four env vars set correctly — config resolution succeeds, then DB-connect fails container exits non-zero within 30s; logs contain a recognisable Npgsql connection error (e.g., Connection refused) — NOT an InvalidOperationException from the resolver (this differentiates "config missing" from "config valid but DB down")

Pass criteria: rows 15 → fail-fast at config resolution; row 6 → fail at DB-connect AFTER config resolution succeeded. Note: this test now exercises BOTH the fail-fast resolver (rows 15) AND the DB-unreachable case (row 6). Pre-revision, only row 6 was tested under the assumption of hardcoded dev fallbacks. Max execution time: 180s (6 docker-run cycles).


NFT-RES-06: DB missing (database does not exist) — process exits with Npgsql 3D000

Summary: Verifies AC-6.8 — when the azaion database does not exist, process exits with the documented PostgreSQL error code. Traces to: AC-6.8

Preconditions:

  • postgres-test running with the azaion database NOT yet created (use POSTGRES_DB=postgres instead, or DROP DATABASE azaion)

Fault injection: same as preconditions.

Steps:

Step Action Expected Behavior
1 Side-channel DROP DATABASE IF EXISTS azaion
2 docker compose up -d missions
3 Poll exit code for ≤ 30s non-zero
4 docker logs missions | grep "3D000" at least one match

Pass criteria: container exits non-zero within 30s; logs contain 3D000. Max execution time: 60s.


NFT-RES-07: JWKS key rotation — no missions restart required

Summary: Verifies AC-5.7 — rotating the signing key on admin (via jwks-mock POST /rotate-key) propagates to missions on the JWKS cache refresh tick without restarting missions. This is the primary operational win over the legacy shared-HMAC model, which required coordinated re-deploy across every backend on the device. Traces to: AC-5.7

Preconditions:

  • missions running with warm JWKS cache (any previous protected request succeeded).
  • jwks-mock running with Cache-Control: max-age=60 and OldKeyGraceSeconds=5.
  • Token T1 requested via POST /sign with the CURRENT kid (kid_v1), valid for 1h.

Fault injection: POST https://jwks-mock:8443/rotate-key {} — generates kid_v2, retains kid_v1 in the JWKS for OldKeyGraceSeconds, then evicts kid_v1.

Steps:

Step Action Expected Behavior
1 GET /vehicles with Authorization: Bearer T1 200 (cached JWKS knows kid_v1)
2 POST https://jwks-mock:8443/rotate-key {} → returns kid_v2 jwks-mock now publishes BOTH kid_v1 and kid_v2 in its JWKS for OldKeyGraceSeconds=5
3 Immediately request token T2 signed with kid_v2 via POST /sign {} mock returns JWT with header kid: kid_v2
4 Immediately GET /vehicles with Authorization: Bearer T2 (BEFORE missions JWKS cache refresh) 401 (cache still only has kid_v1)
5 Wait up to 90s for missions's ConfigurationManager<JsonWebKeySet> to refresh against the new JWKS (the mock's max-age=60 triggers a refresh on the next request after that interval)
6 GET /vehicles with Authorization: Bearer T2 again 200 (cache now contains kid_v2)
7 GET /vehicles with Authorization: Bearer T1 (still has unexpired lifetime, signed with kid_v1) 200 IF the JWKS refresh happened BEFORE the mock's OldKeyGraceSeconds=5 window closed (the JWKS still had kid_v1); 401 AFTER the grace window when missions refreshes and kid_v1 is no longer in the JWKS. Test asserts the eventual 401
8 Verify missions was NEVER restarted during this scenario (docker inspect --format '{{.State.StartedAt}}' missions is unchanged from before step 1) startup timestamp unchanged

Pass criteria: rotation propagates without restart; T2 (new kid) eventually accepted; T1 (old kid) eventually rejected; missions startup timestamp unchanged. Note: this test replaces the pre-revision "shared-secret rotation requires coordinated redeploy" scenario. The pre-revision test asserted that ALL services on the device had to restart together; the post-revision test asserts the opposite — they do NOT have to restart. Max execution time: 180s (longest wait is the JWKS refresh tick).


NFT-RES-08: TOCTOU on default-vehicle exclusivity (race window)

Summary: Verifies AC-1.4 — the clear-then-set is NOT transaction-wrapped, so a concurrent INSERT can leave 2+ defaults. Traces to: AC-1.4 (carry-forward)

Preconditions:

  • seed_one_default_vehicle (default P1)

Fault injection: a second concurrent client issues INSERT INTO vehicles (..., is_default=true) directly to the side-channel DB at the same moment as the API client issues POST /vehicles { IsDefault:true }. Synchronisation: the test orchestrator pauses at pg_advisory_lock(1) after the service's UPDATE vehicles SET is_default=FALSE and BEFORE the service's INSERT — implemented via a Postgres function that the test installs and the service path traverses (this requires an instrumented test build; if not available, the test is best-effort by issuing 100 parallel INSERTs).

Steps:

Step Action Expected Behavior
1 Run 100 parallel iterations of (POST /vehicles { IsDefault:true } + side-channel INSERT (..., is_default=true)) each iteration completes
2 Side-channel SELECT COUNT(*) FROM vehicles WHERE is_default=true count is ≥ 2 in at least one iteration

Pass criteria: at least one iteration produces default_count ≥ 2. If 0 iterations produce the race, the test FAILS — either the race window has been closed (good news; rewrite the test to assert default_count == 1 and update AC-1.4 to remove the race carry-forward), OR the test concurrency primitive is wrong (investigate). Note: this test is intentionally PROBABILISTIC — it asserts the RACE EXISTS, not that the system is broken. A PASS here is bad news for the system but means the test is correctly observing the documented behavior. Max execution time: 30s.


Notes

  • Tests that drop tables (NFT-RES-01, NFT-RES-02, NFT-RES-08) run in a per-class IClassFixture<DbResetFixture> that recreates the schema after each scenario.
  • NFT-RES-03 through NFT-RES-07 require container-orchestration via docker compose from inside the test runner. The e2e-consumer container needs the docker CLI + a mounted Docker socket — alternatively, the test runs from the host with docker compose available there. Hardware-assessment will lock this preference.
  • NFT-RES-08 is intentionally probabilistic and may be flaky on slow runners. CI marks this as [Trait("Stability","probabilistic")] and tolerates ≤ 1 failed run per 5; deterministic implementation (advisory lock + instrumented build) is a follow-up.