mirror of
https://github.com/azaion/missions.git
synced 2026-06-21 15:51:08 +00:00
78dea8ebab
ci/woodpecker/push/build-arm Pipeline was successful
Enhanced the .gitignore to exclude test results and updated the Dockerfile to include a new entrypoint script for improved container initialization. Refactored JWT configuration to support additional parameters for automatic refresh intervals, ensuring better control over token management. Updated the ConfigurationResolver to enforce required environment variables without hardcoded fallbacks, enhancing security and flexibility.
222 lines
14 KiB
Markdown
222 lines
14 KiB
Markdown
# Resilience Tests
|
||
|
||
> **Status**: produced by autodev `/test-spec` Phase 2 (2026-05-14).
|
||
> **Naming**: post-rename target. Resilience scenarios use the side-channel + container-orchestration tools to inject faults; the system-under-test is treated as a black box.
|
||
> **Critical scenarios**: AC-3.3 (cascade NOT transaction-wrapped) and AC-3.4 (orphan-row race) are the highest-impact resilience invariants — they intentionally encode current (sub-optimal) behavior so tests catch any silent change. Several other resilience tests verify documented operational behaviors (idempotent migrator, DB-down crash, JWT secret rotation).
|
||
|
||
---
|
||
|
||
### NFT-RES-01: Cascade is NOT transaction-wrapped — partial deletes survive mid-walk failure
|
||
|
||
**Summary**: Verifies the documented ADR-006 carry-forward — when the cascade walk fails mid-way (e.g., `media` table absent), already-committed deletes remain.
|
||
**Traces to**: AC-3.3, AC-3.4 (related), AC-10.2
|
||
|
||
**Preconditions**:
|
||
- `fixture_cascade_F3` applied (chain rooted at `mid`)
|
||
- `missions` running
|
||
|
||
**Fault injection**: side-channel `DROP TABLE media CASCADE;` BEFORE the request — this turns the second sub-step of the cascade walk into a `relation does not exist` failure.
|
||
|
||
**Steps**:
|
||
|
||
| Step | Action | Expected Behavior |
|
||
|------|--------|------------------|
|
||
| 1 | Side-channel `DROP TABLE media CASCADE` | succeeds |
|
||
| 2 | `DELETE /missions/{mid}` (JWT `FL`) | `500`; envelope `{ statusCode:500, message:"Internal server error" }` |
|
||
| 3 | Side-channel `SELECT COUNT(*) FROM map_objects WHERE mission_id={mid}` | `count == 0` (work BEFORE the failure point committed — non-zero pre-fault, zero post-fault) |
|
||
| 4 | Side-channel `SELECT COUNT(*) FROM missions WHERE id={mid}` | `count == 1` (work AFTER the failure point did NOT run — row remains) |
|
||
| 5 | `docker logs missions \| grep "Unhandled exception"` | At least one matching log line containing `relation` and `media` |
|
||
|
||
**Pass criteria**: `map_objects` count is 0 (deleted before failure) AND `missions` count is 1 (not deleted because failure short-circuited the walk) AND the response is 500 with the redacted body.
|
||
**Note**: this test will FAIL once a transaction wrap is added (ADR-006 closure) — at that point ALL deletes will roll back and `map_objects` count will be `>0`. When the transaction wrap lands, update this test.
|
||
**Max execution time**: 10s.
|
||
|
||
---
|
||
|
||
### NFT-RES-02: Waypoint cascade is NOT transaction-wrapped (same invariant as F3)
|
||
|
||
**Summary**: Same invariant as NFT-RES-01, scoped to F4 (waypoint cascade).
|
||
**Traces to**: AC-4.6, AC-3.3 (same root cause), AC-10.2
|
||
|
||
**Preconditions**:
|
||
- `fixture_cascade_F4` applied (waypoint `wp1` with chain)
|
||
|
||
**Fault injection**: side-channel `DROP TABLE media CASCADE` BEFORE the request.
|
||
|
||
**Steps**:
|
||
|
||
| Step | Action | Expected Behavior |
|
||
|------|--------|------------------|
|
||
| 1 | Side-channel `DROP TABLE media CASCADE` | succeeds |
|
||
| 2 | `DELETE /missions/{mid}/waypoints/{wp1}` (JWT `FL`) | `500` |
|
||
| 3 | Side-channel `SELECT COUNT(*) FROM detection WHERE annotation_id IN (wp1 chain)` | `count == 0` (deleted before failure) |
|
||
| 4 | Side-channel `SELECT COUNT(*) FROM waypoints WHERE id={wp1}` | `count == 1` (not deleted) |
|
||
|
||
**Pass criteria**: same shape as NFT-RES-01.
|
||
**Max execution time**: 10s.
|
||
|
||
---
|
||
|
||
### NFT-RES-03: Idempotent migrator — second startup is a no-op
|
||
|
||
**Summary**: Verifies the migrator can run twice in a row without `relation already exists` errors.
|
||
**Traces to**: AC-6.6, AC-6.4
|
||
|
||
**Preconditions**:
|
||
- `seed_empty` (schema migrated once via first `missions` startup)
|
||
|
||
**Fault injection**: container restart (NOT volume reset).
|
||
|
||
**Steps**:
|
||
|
||
| Step | Action | Expected Behavior |
|
||
|------|--------|------------------|
|
||
| 1 | `docker compose restart missions` | container exits cleanly + restarts |
|
||
| 2 | Wait for `GET /health` to return `200` | within ≤ 30s |
|
||
| 3 | `docker logs missions \| grep -E "(error\|Error\|exception)"` | no NEW error / exception lines after the restart timestamp |
|
||
| 4 | Side-channel `\d+ vehicles` | schema unchanged from after the first migrate |
|
||
|
||
**Pass criteria**: second start completes; no new error log lines; schema unchanged.
|
||
**Max execution time**: 60s.
|
||
|
||
---
|
||
|
||
### NFT-RES-04: B9 one-shot legacy table drop runs once and is idempotent
|
||
|
||
**Summary**: Verifies that the post-B9 `DROP TABLE IF EXISTS orthophotos / gps_corrections` block in the migrator is destructive on the FIRST run against a legacy device, and a no-op on subsequent runs.
|
||
**Traces to**: AC-6.5, AC-10.5
|
||
|
||
**Preconditions**:
|
||
- `seed_legacy_gps_tables` (schema includes `orthophotos` + `gps_corrections` with 1 row each)
|
||
- `missions` NOT yet started for this scenario
|
||
|
||
**Fault injection**: none — purely observe migrator behavior.
|
||
|
||
**Steps**:
|
||
|
||
| Step | Action | Expected Behavior |
|
||
|------|--------|------------------|
|
||
| 1 | Side-channel `SELECT to_regclass('orthophotos'), to_regclass('gps_corrections')` | both NON-NULL (legacy tables present) |
|
||
| 2 | `docker compose up -d missions` | container starts |
|
||
| 3 | Wait for `GET /health` to return `200` | |
|
||
| 4 | Side-channel re-query | both NULL (dropped) |
|
||
| 5 | `docker compose restart missions` | |
|
||
| 6 | Side-channel re-query | both still NULL (idempotent — no error) |
|
||
| 7 | `docker logs missions \| grep -i "does not exist"` | NO log line (because of `IF EXISTS`) |
|
||
|
||
**Pass criteria**: legacy tables absent after first start; subsequent restarts produce no errors and leave them absent.
|
||
**Note**: this test only meaningfully runs on a **post-B9 build**. Before B9 lands, the migrator has no DROP block; gate this scenario on a build-time flag or by inspecting the migrator source.
|
||
**Max execution time**: 60s.
|
||
|
||
---
|
||
|
||
### NFT-RES-05: Required configuration missing → fail-fast at startup
|
||
|
||
**Summary**: Verifies AC-6.1 / AC-6.2 / E3 — `Infrastructure/ConfigurationResolver.ResolveRequiredOrThrow` throws `InvalidOperationException` when any of the four required env vars (`DATABASE_URL`, `JWT_ISSUER`, `JWT_AUDIENCE`, `JWT_JWKS_URL`) is missing or whitespace-only. Also verifies AC-6.7 — DB unreachability (after config resolution succeeds) still causes process exit. The legacy "silent dev fallback boot" failure mode is structurally eliminated.
|
||
**Traces to**: AC-6.1, AC-6.2, AC-6.7, E3, E4
|
||
|
||
**Preconditions**:
|
||
- `missions` NOT running.
|
||
- This scenario uses `docker run` outside the main compose to isolate env-var manipulation.
|
||
|
||
**Steps** (each row is a separate `docker run` invocation; each times out at 30s):
|
||
|
||
| Step | Action | Expected Behavior |
|
||
|------|--------|------------------|
|
||
| 1 | `docker run --rm azaion/missions:test` with ALL four required env vars unset | container exits non-zero within 5s; logs contain `InvalidOperationException`; logs mention at least one of the four required keys |
|
||
| 2 | `docker run` with `DATABASE_URL` unset; the three JWT vars set correctly | same shape; logs mention `DATABASE_URL` or `Database:Url` |
|
||
| 3 | `docker run` with `JWT_ISSUER=""` (whitespace-only); other three set | same shape; logs mention `JWT_ISSUER` or `Jwt:Issuer` |
|
||
| 4 | `docker run` with `JWT_AUDIENCE` unset; others set | same shape; logs mention `JWT_AUDIENCE` or `Jwt:Audience` |
|
||
| 5 | `docker run` with `JWT_JWKS_URL` unset; others set | same shape; logs mention `JWT_JWKS_URL` or `Jwt:JwksUrl` |
|
||
| 6 | `docker compose stop postgres-test`, then start `missions` with all four env vars set correctly — config resolution succeeds, then DB-connect fails | container exits non-zero within 30s; logs contain a recognisable Npgsql connection error (e.g., `Connection refused`) — NOT an `InvalidOperationException` from the resolver (this differentiates "config missing" from "config valid but DB down") |
|
||
|
||
**Pass criteria**: rows 1–5 → fail-fast at config resolution; row 6 → fail at DB-connect AFTER config resolution succeeded.
|
||
**Note**: this test now exercises BOTH the fail-fast resolver (rows 1–5) AND the DB-unreachable case (row 6). Pre-revision, only row 6 was tested under the assumption of hardcoded dev fallbacks.
|
||
**Max execution time**: 180s (6 docker-run cycles).
|
||
|
||
---
|
||
|
||
### NFT-RES-06: DB missing (database does not exist) — process exits with Npgsql 3D000
|
||
|
||
**Summary**: Verifies AC-6.8 — when the `azaion` database does not exist, process exits with the documented PostgreSQL error code.
|
||
**Traces to**: AC-6.8
|
||
|
||
**Preconditions**:
|
||
- `postgres-test` running with the `azaion` database NOT yet created (use `POSTGRES_DB=postgres` instead, or `DROP DATABASE azaion`)
|
||
|
||
**Fault injection**: same as preconditions.
|
||
|
||
**Steps**:
|
||
|
||
| Step | Action | Expected Behavior |
|
||
|------|--------|------------------|
|
||
| 1 | Side-channel `DROP DATABASE IF EXISTS azaion` | |
|
||
| 2 | `docker compose up -d missions` | |
|
||
| 3 | Poll exit code for ≤ 30s | non-zero |
|
||
| 4 | `docker logs missions \| grep "3D000"` | at least one match |
|
||
|
||
**Pass criteria**: container exits non-zero within 30s; logs contain `3D000`.
|
||
**Max execution time**: 60s.
|
||
|
||
---
|
||
|
||
### NFT-RES-07: JWKS key rotation — no missions restart required
|
||
|
||
**Summary**: Verifies AC-5.7 — rotating the signing key on `admin` (via `jwks-mock POST /rotate-key`) propagates to `missions` on the JWKS cache refresh tick **without restarting `missions`**. This is the primary operational win over the legacy shared-HMAC model, which required coordinated re-deploy across every backend on the device.
|
||
**Traces to**: AC-5.7
|
||
|
||
**Preconditions**:
|
||
- `missions` running with warm JWKS cache (any previous protected request succeeded).
|
||
- `jwks-mock` running with `Cache-Control: max-age=60` and `OldKeyGraceSeconds=5`.
|
||
- Token `T1` requested via `POST /sign` with the CURRENT `kid` (`kid_v1`), valid for 1h.
|
||
|
||
**Fault injection**: `POST https://jwks-mock:8443/rotate-key {}` — generates `kid_v2`, retains `kid_v1` in the JWKS for `OldKeyGraceSeconds`, then evicts `kid_v1`.
|
||
|
||
**Steps**:
|
||
|
||
| Step | Action | Expected Behavior |
|
||
|------|--------|------------------|
|
||
| 1 | `GET /vehicles` with `Authorization: Bearer T1` | `200` (cached JWKS knows `kid_v1`) |
|
||
| 2 | `POST https://jwks-mock:8443/rotate-key {}` → returns `kid_v2` | jwks-mock now publishes BOTH `kid_v1` and `kid_v2` in its JWKS for `OldKeyGraceSeconds=5` |
|
||
| 3 | Immediately request token `T2` signed with `kid_v2` via `POST /sign {}` | mock returns JWT with header `kid: kid_v2` |
|
||
| 4 | Immediately `GET /vehicles` with `Authorization: Bearer T2` (BEFORE `missions` JWKS cache refresh) | `401` (cache still only has `kid_v1`) |
|
||
| 5 | Wait up to 90s for `missions`'s `ConfigurationManager<JsonWebKeySet>` to refresh against the new JWKS (the mock's `max-age=60` triggers a refresh on the next request after that interval) | — |
|
||
| 6 | `GET /vehicles` with `Authorization: Bearer T2` again | `200` (cache now contains `kid_v2`) |
|
||
| 7 | `GET /vehicles` with `Authorization: Bearer T1` (still has unexpired lifetime, signed with `kid_v1`) | `200` IF the JWKS refresh happened BEFORE the mock's `OldKeyGraceSeconds=5` window closed (the JWKS still had `kid_v1`); `401` AFTER the grace window when `missions` refreshes and `kid_v1` is no longer in the JWKS. Test asserts the eventual `401` |
|
||
| 8 | Verify `missions` was NEVER restarted during this scenario (`docker inspect --format '{{.State.StartedAt}}' missions` is unchanged from before step 1) | startup timestamp unchanged |
|
||
|
||
**Pass criteria**: rotation propagates without restart; `T2` (new kid) eventually accepted; `T1` (old kid) eventually rejected; `missions` startup timestamp unchanged.
|
||
**Note**: this test replaces the pre-revision "shared-secret rotation requires coordinated redeploy" scenario. The pre-revision test asserted that ALL services on the device had to restart together; the post-revision test asserts the opposite — they do NOT have to restart.
|
||
**Max execution time**: 180s (longest wait is the JWKS refresh tick).
|
||
|
||
---
|
||
|
||
### NFT-RES-08: TOCTOU on default-vehicle exclusivity (race window)
|
||
|
||
**Summary**: Verifies AC-1.4 — the clear-then-set is NOT transaction-wrapped, so a concurrent INSERT can leave 2+ defaults.
|
||
**Traces to**: AC-1.4 (carry-forward)
|
||
|
||
**Preconditions**:
|
||
- `seed_one_default_vehicle` (default `P1`)
|
||
|
||
**Fault injection**: a second concurrent client issues `INSERT INTO vehicles (..., is_default=true)` directly to the side-channel DB at the same moment as the API client issues `POST /vehicles { IsDefault:true }`. Synchronisation: the test orchestrator pauses at `pg_advisory_lock(1)` after the service's `UPDATE vehicles SET is_default=FALSE` and BEFORE the service's `INSERT` — implemented via a Postgres function that the test installs and the service path traverses (this requires an instrumented test build; if not available, the test is best-effort by issuing 100 parallel INSERTs).
|
||
|
||
**Steps**:
|
||
|
||
| Step | Action | Expected Behavior |
|
||
|------|--------|------------------|
|
||
| 1 | Run 100 parallel iterations of `(POST /vehicles { IsDefault:true } + side-channel INSERT (..., is_default=true))` | each iteration completes |
|
||
| 2 | Side-channel `SELECT COUNT(*) FROM vehicles WHERE is_default=true` | count is `≥ 2` in at least one iteration |
|
||
|
||
**Pass criteria**: at least one iteration produces `default_count ≥ 2`. If 0 iterations produce the race, the test FAILS — either the race window has been closed (good news; rewrite the test to assert `default_count == 1` and update AC-1.4 to remove the race carry-forward), OR the test concurrency primitive is wrong (investigate).
|
||
**Note**: this test is intentionally PROBABILISTIC — it asserts the RACE EXISTS, not that the system is broken. A `PASS` here is bad news for the system but means the test is correctly observing the documented behavior.
|
||
**Max execution time**: 30s.
|
||
|
||
---
|
||
|
||
## Notes
|
||
|
||
- Tests that drop tables (NFT-RES-01, NFT-RES-02, NFT-RES-08) run in a per-class `IClassFixture<DbResetFixture>` that recreates the schema after each scenario.
|
||
- NFT-RES-03 through NFT-RES-07 require container-orchestration via `docker compose` from inside the test runner. The `e2e-consumer` container needs the `docker` CLI + a mounted Docker socket — alternatively, the test runs from the host with `docker compose` available there. Hardware-assessment will lock this preference.
|
||
- NFT-RES-08 is intentionally probabilistic and may be flaky on slow runners. CI marks this as `[Trait("Stability","probabilistic")]` and tolerates ≤ 1 failed run per 5; deterministic implementation (advisory lock + instrumented build) is a follow-up.
|