Enhanced the .gitignore to exclude test results and updated the Dockerfile to include a new entrypoint script for improved container initialization. Refactored JWT configuration to support additional parameters for automatic refresh intervals, ensuring better control over token management. Updated the ConfigurationResolver to enforce required environment variables without hardcoded fallbacks, enhancing security and flexibility.
14 KiB
Resilience Tests
Status: produced by autodev
/test-specPhase 2 (2026-05-14). Naming: post-rename target. Resilience scenarios use the side-channel + container-orchestration tools to inject faults; the system-under-test is treated as a black box. Critical scenarios: AC-3.3 (cascade NOT transaction-wrapped) and AC-3.4 (orphan-row race) are the highest-impact resilience invariants — they intentionally encode current (sub-optimal) behavior so tests catch any silent change. Several other resilience tests verify documented operational behaviors (idempotent migrator, DB-down crash, JWT secret rotation).
NFT-RES-01: Cascade is NOT transaction-wrapped — partial deletes survive mid-walk failure
Summary: Verifies the documented ADR-006 carry-forward — when the cascade walk fails mid-way (e.g., media table absent), already-committed deletes remain.
Traces to: AC-3.3, AC-3.4 (related), AC-10.2
Preconditions:
fixture_cascade_F3applied (chain rooted atmid)missionsrunning
Fault injection: side-channel DROP TABLE media CASCADE; BEFORE the request — this turns the second sub-step of the cascade walk into a relation does not exist failure.
Steps:
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | Side-channel DROP TABLE media CASCADE |
succeeds |
| 2 | DELETE /missions/{mid} (JWT FL) |
500; envelope { statusCode:500, message:"Internal server error" } |
| 3 | Side-channel SELECT COUNT(*) FROM map_objects WHERE mission_id={mid} |
count == 0 (work BEFORE the failure point committed — non-zero pre-fault, zero post-fault) |
| 4 | Side-channel SELECT COUNT(*) FROM missions WHERE id={mid} |
count == 1 (work AFTER the failure point did NOT run — row remains) |
| 5 | docker logs missions | grep "Unhandled exception" |
At least one matching log line containing relation and media |
Pass criteria: map_objects count is 0 (deleted before failure) AND missions count is 1 (not deleted because failure short-circuited the walk) AND the response is 500 with the redacted body.
Note: this test will FAIL once a transaction wrap is added (ADR-006 closure) — at that point ALL deletes will roll back and map_objects count will be >0. When the transaction wrap lands, update this test.
Max execution time: 10s.
NFT-RES-02: Waypoint cascade is NOT transaction-wrapped (same invariant as F3)
Summary: Same invariant as NFT-RES-01, scoped to F4 (waypoint cascade). Traces to: AC-4.6, AC-3.3 (same root cause), AC-10.2
Preconditions:
fixture_cascade_F4applied (waypointwp1with chain)
Fault injection: side-channel DROP TABLE media CASCADE BEFORE the request.
Steps:
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | Side-channel DROP TABLE media CASCADE |
succeeds |
| 2 | DELETE /missions/{mid}/waypoints/{wp1} (JWT FL) |
500 |
| 3 | Side-channel SELECT COUNT(*) FROM detection WHERE annotation_id IN (wp1 chain) |
count == 0 (deleted before failure) |
| 4 | Side-channel SELECT COUNT(*) FROM waypoints WHERE id={wp1} |
count == 1 (not deleted) |
Pass criteria: same shape as NFT-RES-01. Max execution time: 10s.
NFT-RES-03: Idempotent migrator — second startup is a no-op
Summary: Verifies the migrator can run twice in a row without relation already exists errors.
Traces to: AC-6.6, AC-6.4
Preconditions:
seed_empty(schema migrated once via firstmissionsstartup)
Fault injection: container restart (NOT volume reset).
Steps:
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | docker compose restart missions |
container exits cleanly + restarts |
| 2 | Wait for GET /health to return 200 |
within ≤ 30s |
| 3 | docker logs missions | grep -E "(error|Error|exception)" |
no NEW error / exception lines after the restart timestamp |
| 4 | Side-channel \d+ vehicles |
schema unchanged from after the first migrate |
Pass criteria: second start completes; no new error log lines; schema unchanged. Max execution time: 60s.
NFT-RES-04: B9 one-shot legacy table drop runs once and is idempotent
Summary: Verifies that the post-B9 DROP TABLE IF EXISTS orthophotos / gps_corrections block in the migrator is destructive on the FIRST run against a legacy device, and a no-op on subsequent runs.
Traces to: AC-6.5, AC-10.5
Preconditions:
seed_legacy_gps_tables(schema includesorthophotos+gps_correctionswith 1 row each)missionsNOT yet started for this scenario
Fault injection: none — purely observe migrator behavior.
Steps:
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | Side-channel SELECT to_regclass('orthophotos'), to_regclass('gps_corrections') |
both NON-NULL (legacy tables present) |
| 2 | docker compose up -d missions |
container starts |
| 3 | Wait for GET /health to return 200 |
|
| 4 | Side-channel re-query | both NULL (dropped) |
| 5 | docker compose restart missions |
|
| 6 | Side-channel re-query | both still NULL (idempotent — no error) |
| 7 | docker logs missions | grep -i "does not exist" |
NO log line (because of IF EXISTS) |
Pass criteria: legacy tables absent after first start; subsequent restarts produce no errors and leave them absent. Note: this test only meaningfully runs on a post-B9 build. Before B9 lands, the migrator has no DROP block; gate this scenario on a build-time flag or by inspecting the migrator source. Max execution time: 60s.
NFT-RES-05: Required configuration missing → fail-fast at startup
Summary: Verifies AC-6.1 / AC-6.2 / E3 — Infrastructure/ConfigurationResolver.ResolveRequiredOrThrow throws InvalidOperationException when any of the four required env vars (DATABASE_URL, JWT_ISSUER, JWT_AUDIENCE, JWT_JWKS_URL) is missing or whitespace-only. Also verifies AC-6.7 — DB unreachability (after config resolution succeeds) still causes process exit. The legacy "silent dev fallback boot" failure mode is structurally eliminated.
Traces to: AC-6.1, AC-6.2, AC-6.7, E3, E4
Preconditions:
missionsNOT running.- This scenario uses
docker runoutside the main compose to isolate env-var manipulation.
Steps (each row is a separate docker run invocation; each times out at 30s):
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | docker run --rm azaion/missions:test with ALL four required env vars unset |
container exits non-zero within 5s; logs contain InvalidOperationException; logs mention at least one of the four required keys |
| 2 | docker run with DATABASE_URL unset; the three JWT vars set correctly |
same shape; logs mention DATABASE_URL or Database:Url |
| 3 | docker run with JWT_ISSUER="" (whitespace-only); other three set |
same shape; logs mention JWT_ISSUER or Jwt:Issuer |
| 4 | docker run with JWT_AUDIENCE unset; others set |
same shape; logs mention JWT_AUDIENCE or Jwt:Audience |
| 5 | docker run with JWT_JWKS_URL unset; others set |
same shape; logs mention JWT_JWKS_URL or Jwt:JwksUrl |
| 6 | docker compose stop postgres-test, then start missions with all four env vars set correctly — config resolution succeeds, then DB-connect fails |
container exits non-zero within 30s; logs contain a recognisable Npgsql connection error (e.g., Connection refused) — NOT an InvalidOperationException from the resolver (this differentiates "config missing" from "config valid but DB down") |
Pass criteria: rows 1–5 → fail-fast at config resolution; row 6 → fail at DB-connect AFTER config resolution succeeded. Note: this test now exercises BOTH the fail-fast resolver (rows 1–5) AND the DB-unreachable case (row 6). Pre-revision, only row 6 was tested under the assumption of hardcoded dev fallbacks. Max execution time: 180s (6 docker-run cycles).
NFT-RES-06: DB missing (database does not exist) — process exits with Npgsql 3D000
Summary: Verifies AC-6.8 — when the azaion database does not exist, process exits with the documented PostgreSQL error code.
Traces to: AC-6.8
Preconditions:
postgres-testrunning with theazaiondatabase NOT yet created (usePOSTGRES_DB=postgresinstead, orDROP DATABASE azaion)
Fault injection: same as preconditions.
Steps:
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | Side-channel DROP DATABASE IF EXISTS azaion |
|
| 2 | docker compose up -d missions |
|
| 3 | Poll exit code for ≤ 30s | non-zero |
| 4 | docker logs missions | grep "3D000" |
at least one match |
Pass criteria: container exits non-zero within 30s; logs contain 3D000.
Max execution time: 60s.
NFT-RES-07: JWKS key rotation — no missions restart required
Summary: Verifies AC-5.7 — rotating the signing key on admin (via jwks-mock POST /rotate-key) propagates to missions on the JWKS cache refresh tick without restarting missions. This is the primary operational win over the legacy shared-HMAC model, which required coordinated re-deploy across every backend on the device.
Traces to: AC-5.7
Preconditions:
missionsrunning with warm JWKS cache (any previous protected request succeeded).jwks-mockrunning withCache-Control: max-age=60andOldKeyGraceSeconds=5.- Token
T1requested viaPOST /signwith the CURRENTkid(kid_v1), valid for 1h.
Fault injection: POST https://jwks-mock:8443/rotate-key {} — generates kid_v2, retains kid_v1 in the JWKS for OldKeyGraceSeconds, then evicts kid_v1.
Steps:
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | GET /vehicles with Authorization: Bearer T1 |
200 (cached JWKS knows kid_v1) |
| 2 | POST https://jwks-mock:8443/rotate-key {} → returns kid_v2 |
jwks-mock now publishes BOTH kid_v1 and kid_v2 in its JWKS for OldKeyGraceSeconds=5 |
| 3 | Immediately request token T2 signed with kid_v2 via POST /sign {} |
mock returns JWT with header kid: kid_v2 |
| 4 | Immediately GET /vehicles with Authorization: Bearer T2 (BEFORE missions JWKS cache refresh) |
401 (cache still only has kid_v1) |
| 5 | Wait up to 90s for missions's ConfigurationManager<JsonWebKeySet> to refresh against the new JWKS (the mock's max-age=60 triggers a refresh on the next request after that interval) |
— |
| 6 | GET /vehicles with Authorization: Bearer T2 again |
200 (cache now contains kid_v2) |
| 7 | GET /vehicles with Authorization: Bearer T1 (still has unexpired lifetime, signed with kid_v1) |
200 IF the JWKS refresh happened BEFORE the mock's OldKeyGraceSeconds=5 window closed (the JWKS still had kid_v1); 401 AFTER the grace window when missions refreshes and kid_v1 is no longer in the JWKS. Test asserts the eventual 401 |
| 8 | Verify missions was NEVER restarted during this scenario (docker inspect --format '{{.State.StartedAt}}' missions is unchanged from before step 1) |
startup timestamp unchanged |
Pass criteria: rotation propagates without restart; T2 (new kid) eventually accepted; T1 (old kid) eventually rejected; missions startup timestamp unchanged.
Note: this test replaces the pre-revision "shared-secret rotation requires coordinated redeploy" scenario. The pre-revision test asserted that ALL services on the device had to restart together; the post-revision test asserts the opposite — they do NOT have to restart.
Max execution time: 180s (longest wait is the JWKS refresh tick).
NFT-RES-08: TOCTOU on default-vehicle exclusivity (race window)
Summary: Verifies AC-1.4 — the clear-then-set is NOT transaction-wrapped, so a concurrent INSERT can leave 2+ defaults. Traces to: AC-1.4 (carry-forward)
Preconditions:
seed_one_default_vehicle(defaultP1)
Fault injection: a second concurrent client issues INSERT INTO vehicles (..., is_default=true) directly to the side-channel DB at the same moment as the API client issues POST /vehicles { IsDefault:true }. Synchronisation: the test orchestrator pauses at pg_advisory_lock(1) after the service's UPDATE vehicles SET is_default=FALSE and BEFORE the service's INSERT — implemented via a Postgres function that the test installs and the service path traverses (this requires an instrumented test build; if not available, the test is best-effort by issuing 100 parallel INSERTs).
Steps:
| Step | Action | Expected Behavior |
|---|---|---|
| 1 | Run 100 parallel iterations of (POST /vehicles { IsDefault:true } + side-channel INSERT (..., is_default=true)) |
each iteration completes |
| 2 | Side-channel SELECT COUNT(*) FROM vehicles WHERE is_default=true |
count is ≥ 2 in at least one iteration |
Pass criteria: at least one iteration produces default_count ≥ 2. If 0 iterations produce the race, the test FAILS — either the race window has been closed (good news; rewrite the test to assert default_count == 1 and update AC-1.4 to remove the race carry-forward), OR the test concurrency primitive is wrong (investigate).
Note: this test is intentionally PROBABILISTIC — it asserts the RACE EXISTS, not that the system is broken. A PASS here is bad news for the system but means the test is correctly observing the documented behavior.
Max execution time: 30s.
Notes
- Tests that drop tables (NFT-RES-01, NFT-RES-02, NFT-RES-08) run in a per-class
IClassFixture<DbResetFixture>that recreates the schema after each scenario. - NFT-RES-03 through NFT-RES-07 require container-orchestration via
docker composefrom inside the test runner. Thee2e-consumercontainer needs thedockerCLI + a mounted Docker socket — alternatively, the test runs from the host withdocker composeavailable there. Hardware-assessment will lock this preference. - NFT-RES-08 is intentionally probabilistic and may be flaky on slow runners. CI marks this as
[Trait("Stability","probabilistic")]and tolerates ≤ 1 failed run per 5; deterministic implementation (advisory lock + instrumented build) is a follow-up.