# Resilience Tests > **Status**: produced by autodev `/test-spec` Phase 2 (2026-05-14). > **Naming**: post-rename target. Resilience scenarios use the side-channel + container-orchestration tools to inject faults; the system-under-test is treated as a black box. > **Critical scenarios**: AC-3.3 (cascade NOT transaction-wrapped) and AC-3.4 (orphan-row race) are the highest-impact resilience invariants — they intentionally encode current (sub-optimal) behavior so tests catch any silent change. Several other resilience tests verify documented operational behaviors (idempotent migrator, DB-down crash, JWT secret rotation). --- ### NFT-RES-01: Cascade is NOT transaction-wrapped — partial deletes survive mid-walk failure **Summary**: Verifies the documented ADR-006 carry-forward — when the cascade walk fails mid-way (e.g., `media` table absent), already-committed deletes remain. **Traces to**: AC-3.3, AC-3.4 (related), AC-10.2 **Preconditions**: - `fixture_cascade_F3` applied (chain rooted at `mid`) - `missions` running **Fault injection**: side-channel `DROP TABLE media CASCADE;` BEFORE the request — this turns the second sub-step of the cascade walk into a `relation does not exist` failure. **Steps**: | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | Side-channel `DROP TABLE media CASCADE` | succeeds | | 2 | `DELETE /missions/{mid}` (JWT `FL`) | `500`; envelope `{ statusCode:500, message:"Internal server error" }` | | 3 | Side-channel `SELECT COUNT(*) FROM map_objects WHERE mission_id={mid}` | `count == 0` (work BEFORE the failure point committed — non-zero pre-fault, zero post-fault) | | 4 | Side-channel `SELECT COUNT(*) FROM missions WHERE id={mid}` | `count == 1` (work AFTER the failure point did NOT run — row remains) | | 5 | `docker logs missions \| grep "Unhandled exception"` | At least one matching log line containing `relation` and `media` | **Pass criteria**: `map_objects` count is 0 (deleted before failure) AND `missions` count is 1 (not deleted because failure short-circuited the walk) AND the response is 500 with the redacted body. **Note**: this test will FAIL once a transaction wrap is added (ADR-006 closure) — at that point ALL deletes will roll back and `map_objects` count will be `>0`. When the transaction wrap lands, update this test. **Max execution time**: 10s. --- ### NFT-RES-02: Waypoint cascade is NOT transaction-wrapped (same invariant as F3) **Summary**: Same invariant as NFT-RES-01, scoped to F4 (waypoint cascade). **Traces to**: AC-4.6, AC-3.3 (same root cause), AC-10.2 **Preconditions**: - `fixture_cascade_F4` applied (waypoint `wp1` with chain) **Fault injection**: side-channel `DROP TABLE media CASCADE` BEFORE the request. **Steps**: | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | Side-channel `DROP TABLE media CASCADE` | succeeds | | 2 | `DELETE /missions/{mid}/waypoints/{wp1}` (JWT `FL`) | `500` | | 3 | Side-channel `SELECT COUNT(*) FROM detection WHERE annotation_id IN (wp1 chain)` | `count == 0` (deleted before failure) | | 4 | Side-channel `SELECT COUNT(*) FROM waypoints WHERE id={wp1}` | `count == 1` (not deleted) | **Pass criteria**: same shape as NFT-RES-01. **Max execution time**: 10s. --- ### NFT-RES-03: Idempotent migrator — second startup is a no-op **Summary**: Verifies the migrator can run twice in a row without `relation already exists` errors. **Traces to**: AC-6.6, AC-6.4 **Preconditions**: - `seed_empty` (schema migrated once via first `missions` startup) **Fault injection**: container restart (NOT volume reset). **Steps**: | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | `docker compose restart missions` | container exits cleanly + restarts | | 2 | Wait for `GET /health` to return `200` | within ≤ 30s | | 3 | `docker logs missions \| grep -E "(error\|Error\|exception)"` | no NEW error / exception lines after the restart timestamp | | 4 | Side-channel `\d+ vehicles` | schema unchanged from after the first migrate | **Pass criteria**: second start completes; no new error log lines; schema unchanged. **Max execution time**: 60s. --- ### NFT-RES-04: B9 one-shot legacy table drop runs once and is idempotent **Summary**: Verifies that the post-B9 `DROP TABLE IF EXISTS orthophotos / gps_corrections` block in the migrator is destructive on the FIRST run against a legacy device, and a no-op on subsequent runs. **Traces to**: AC-6.5, AC-10.5 **Preconditions**: - `seed_legacy_gps_tables` (schema includes `orthophotos` + `gps_corrections` with 1 row each) - `missions` NOT yet started for this scenario **Fault injection**: none — purely observe migrator behavior. **Steps**: | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | Side-channel `SELECT to_regclass('orthophotos'), to_regclass('gps_corrections')` | both NON-NULL (legacy tables present) | | 2 | `docker compose up -d missions` | container starts | | 3 | Wait for `GET /health` to return `200` | | | 4 | Side-channel re-query | both NULL (dropped) | | 5 | `docker compose restart missions` | | | 6 | Side-channel re-query | both still NULL (idempotent — no error) | | 7 | `docker logs missions \| grep -i "does not exist"` | NO log line (because of `IF EXISTS`) | **Pass criteria**: legacy tables absent after first start; subsequent restarts produce no errors and leave them absent. **Note**: this test only meaningfully runs on a **post-B9 build**. Before B9 lands, the migrator has no DROP block; gate this scenario on a build-time flag or by inspecting the migrator source. **Max execution time**: 60s. --- ### NFT-RES-05: Required configuration missing → fail-fast at startup **Summary**: Verifies AC-6.1 / AC-6.2 / E3 — `Infrastructure/ConfigurationResolver.ResolveRequiredOrThrow` throws `InvalidOperationException` when any of the four required env vars (`DATABASE_URL`, `JWT_ISSUER`, `JWT_AUDIENCE`, `JWT_JWKS_URL`) is missing or whitespace-only. Also verifies AC-6.7 — DB unreachability (after config resolution succeeds) still causes process exit. The legacy "silent dev fallback boot" failure mode is structurally eliminated. **Traces to**: AC-6.1, AC-6.2, AC-6.7, E3, E4 **Preconditions**: - `missions` NOT running. - This scenario uses `docker run` outside the main compose to isolate env-var manipulation. **Steps** (each row is a separate `docker run` invocation; each times out at 30s): | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | `docker run --rm azaion/missions:test` with ALL four required env vars unset | container exits non-zero within 5s; logs contain `InvalidOperationException`; logs mention at least one of the four required keys | | 2 | `docker run` with `DATABASE_URL` unset; the three JWT vars set correctly | same shape; logs mention `DATABASE_URL` or `Database:Url` | | 3 | `docker run` with `JWT_ISSUER=""` (whitespace-only); other three set | same shape; logs mention `JWT_ISSUER` or `Jwt:Issuer` | | 4 | `docker run` with `JWT_AUDIENCE` unset; others set | same shape; logs mention `JWT_AUDIENCE` or `Jwt:Audience` | | 5 | `docker run` with `JWT_JWKS_URL` unset; others set | same shape; logs mention `JWT_JWKS_URL` or `Jwt:JwksUrl` | | 6 | `docker compose stop postgres-test`, then start `missions` with all four env vars set correctly — config resolution succeeds, then DB-connect fails | container exits non-zero within 30s; logs contain a recognisable Npgsql connection error (e.g., `Connection refused`) — NOT an `InvalidOperationException` from the resolver (this differentiates "config missing" from "config valid but DB down") | **Pass criteria**: rows 1–5 → fail-fast at config resolution; row 6 → fail at DB-connect AFTER config resolution succeeded. **Note**: this test now exercises BOTH the fail-fast resolver (rows 1–5) AND the DB-unreachable case (row 6). Pre-revision, only row 6 was tested under the assumption of hardcoded dev fallbacks. **Max execution time**: 180s (6 docker-run cycles). --- ### NFT-RES-06: DB missing (database does not exist) — process exits with Npgsql 3D000 **Summary**: Verifies AC-6.8 — when the `azaion` database does not exist, process exits with the documented PostgreSQL error code. **Traces to**: AC-6.8 **Preconditions**: - `postgres-test` running with the `azaion` database NOT yet created (use `POSTGRES_DB=postgres` instead, or `DROP DATABASE azaion`) **Fault injection**: same as preconditions. **Steps**: | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | Side-channel `DROP DATABASE IF EXISTS azaion` | | | 2 | `docker compose up -d missions` | | | 3 | Poll exit code for ≤ 30s | non-zero | | 4 | `docker logs missions \| grep "3D000"` | at least one match | **Pass criteria**: container exits non-zero within 30s; logs contain `3D000`. **Max execution time**: 60s. --- ### NFT-RES-07: JWKS key rotation — no missions restart required **Summary**: Verifies AC-5.7 — rotating the signing key on `admin` (via `jwks-mock POST /rotate-key`) propagates to `missions` on the JWKS cache refresh tick **without restarting `missions`**. This is the primary operational win over the legacy shared-HMAC model, which required coordinated re-deploy across every backend on the device. **Traces to**: AC-5.7 **Preconditions**: - `missions` running with warm JWKS cache (any previous protected request succeeded). - `jwks-mock` running with `Cache-Control: max-age=60` and `OldKeyGraceSeconds=5`. - Token `T1` requested via `POST /sign` with the CURRENT `kid` (`kid_v1`), valid for 1h. **Fault injection**: `POST https://jwks-mock:8443/rotate-key {}` — generates `kid_v2`, retains `kid_v1` in the JWKS for `OldKeyGraceSeconds`, then evicts `kid_v1`. **Steps**: | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | `GET /vehicles` with `Authorization: Bearer T1` | `200` (cached JWKS knows `kid_v1`) | | 2 | `POST https://jwks-mock:8443/rotate-key {}` → returns `kid_v2` | jwks-mock now publishes BOTH `kid_v1` and `kid_v2` in its JWKS for `OldKeyGraceSeconds=5` | | 3 | Immediately request token `T2` signed with `kid_v2` via `POST /sign {}` | mock returns JWT with header `kid: kid_v2` | | 4 | Immediately `GET /vehicles` with `Authorization: Bearer T2` (BEFORE `missions` JWKS cache refresh) | `401` (cache still only has `kid_v1`) | | 5 | Wait up to 90s for `missions`'s `ConfigurationManager` to refresh against the new JWKS (the mock's `max-age=60` triggers a refresh on the next request after that interval) | — | | 6 | `GET /vehicles` with `Authorization: Bearer T2` again | `200` (cache now contains `kid_v2`) | | 7 | `GET /vehicles` with `Authorization: Bearer T1` (still has unexpired lifetime, signed with `kid_v1`) | `200` IF the JWKS refresh happened BEFORE the mock's `OldKeyGraceSeconds=5` window closed (the JWKS still had `kid_v1`); `401` AFTER the grace window when `missions` refreshes and `kid_v1` is no longer in the JWKS. Test asserts the eventual `401` | | 8 | Verify `missions` was NEVER restarted during this scenario (`docker inspect --format '{{.State.StartedAt}}' missions` is unchanged from before step 1) | startup timestamp unchanged | **Pass criteria**: rotation propagates without restart; `T2` (new kid) eventually accepted; `T1` (old kid) eventually rejected; `missions` startup timestamp unchanged. **Note**: this test replaces the pre-revision "shared-secret rotation requires coordinated redeploy" scenario. The pre-revision test asserted that ALL services on the device had to restart together; the post-revision test asserts the opposite — they do NOT have to restart. **Max execution time**: 180s (longest wait is the JWKS refresh tick). --- ### NFT-RES-08: TOCTOU on default-vehicle exclusivity (race window) **Summary**: Verifies AC-1.4 — the clear-then-set is NOT transaction-wrapped, so a concurrent INSERT can leave 2+ defaults. **Traces to**: AC-1.4 (carry-forward) **Preconditions**: - `seed_one_default_vehicle` (default `P1`) **Fault injection**: a second concurrent client issues `INSERT INTO vehicles (..., is_default=true)` directly to the side-channel DB at the same moment as the API client issues `POST /vehicles { IsDefault:true }`. Synchronisation: the test orchestrator pauses at `pg_advisory_lock(1)` after the service's `UPDATE vehicles SET is_default=FALSE` and BEFORE the service's `INSERT` — implemented via a Postgres function that the test installs and the service path traverses (this requires an instrumented test build; if not available, the test is best-effort by issuing 100 parallel INSERTs). **Steps**: | Step | Action | Expected Behavior | |------|--------|------------------| | 1 | Run 100 parallel iterations of `(POST /vehicles { IsDefault:true } + side-channel INSERT (..., is_default=true))` | each iteration completes | | 2 | Side-channel `SELECT COUNT(*) FROM vehicles WHERE is_default=true` | count is `≥ 2` in at least one iteration | **Pass criteria**: at least one iteration produces `default_count ≥ 2`. If 0 iterations produce the race, the test FAILS — either the race window has been closed (good news; rewrite the test to assert `default_count == 1` and update AC-1.4 to remove the race carry-forward), OR the test concurrency primitive is wrong (investigate). **Note**: this test is intentionally PROBABILISTIC — it asserts the RACE EXISTS, not that the system is broken. A `PASS` here is bad news for the system but means the test is correctly observing the documented behavior. **Max execution time**: 30s. --- ## Notes - Tests that drop tables (NFT-RES-01, NFT-RES-02, NFT-RES-08) run in a per-class `IClassFixture` that recreates the schema after each scenario. - NFT-RES-03 through NFT-RES-07 require container-orchestration via `docker compose` from inside the test runner. The `e2e-consumer` container needs the `docker` CLI + a mounted Docker socket — alternatively, the test runs from the host with `docker compose` available there. Hardware-assessment will lock this preference. - NFT-RES-08 is intentionally probabilistic and may be flaky on slow runners. CI marks this as `[Trait("Stability","probabilistic")]` and tolerates ≤ 1 failed run per 5; deterministic implementation (advisory lock + instrumented build) is a follow-up.