Files
missions/_docs/02_document/tests/resilience-tests.md
T
Oleksandr Bezdieniezhnykh 7025f4d075 refactor: enhance JWT authentication and CORS configuration
Updated JWT authentication to use configuration values instead of hardcoded secrets, improving security and flexibility. Enhanced CORS policy to conditionally allow origins based on configuration settings, with logging for permissive defaults. Updated README to reflect project renaming and clarify service context.
2026-05-14 19:48:25 +03:00

216 lines
11 KiB
Markdown

# Resilience Tests
> **Status**: produced by autodev `/test-spec` Phase 2 (2026-05-14).
> **Naming**: post-rename target. Resilience scenarios use the side-channel + container-orchestration tools to inject faults; the system-under-test is treated as a black box.
> **Critical scenarios**: AC-3.3 (cascade NOT transaction-wrapped) and AC-3.4 (orphan-row race) are the highest-impact resilience invariants — they intentionally encode current (sub-optimal) behavior so tests catch any silent change. Several other resilience tests verify documented operational behaviors (idempotent migrator, DB-down crash, JWT secret rotation).
---
### NFT-RES-01: Cascade is NOT transaction-wrapped — partial deletes survive mid-walk failure
**Summary**: Verifies the documented ADR-006 carry-forward — when the cascade walk fails mid-way (e.g., `media` table absent), already-committed deletes remain.
**Traces to**: AC-3.3, AC-3.4 (related), AC-10.2
**Preconditions**:
- `fixture_cascade_F3` applied (chain rooted at `mid`)
- `missions` running
**Fault injection**: side-channel `DROP TABLE media CASCADE;` BEFORE the request — this turns the second sub-step of the cascade walk into a `relation does not exist` failure.
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | Side-channel `DROP TABLE media CASCADE` | succeeds |
| 2 | `DELETE /missions/{mid}` (JWT `FL`) | `500`; envelope `{ statusCode:500, message:"Internal server error" }` |
| 3 | Side-channel `SELECT COUNT(*) FROM map_objects WHERE mission_id={mid}` | `count == 0` (work BEFORE the failure point committed — non-zero pre-fault, zero post-fault) |
| 4 | Side-channel `SELECT COUNT(*) FROM missions WHERE id={mid}` | `count == 1` (work AFTER the failure point did NOT run — row remains) |
| 5 | `docker logs missions \| grep "Unhandled exception"` | At least one matching log line containing `relation` and `media` |
**Pass criteria**: `map_objects` count is 0 (deleted before failure) AND `missions` count is 1 (not deleted because failure short-circuited the walk) AND the response is 500 with the redacted body.
**Note**: this test will FAIL once a transaction wrap is added (ADR-006 closure) — at that point ALL deletes will roll back and `map_objects` count will be `>0`. When the transaction wrap lands, update this test.
**Max execution time**: 10s.
---
### NFT-RES-02: Waypoint cascade is NOT transaction-wrapped (same invariant as F3)
**Summary**: Same invariant as NFT-RES-01, scoped to F4 (waypoint cascade).
**Traces to**: AC-4.6, AC-3.3 (same root cause), AC-10.2
**Preconditions**:
- `fixture_cascade_F4` applied (waypoint `wp1` with chain)
**Fault injection**: side-channel `DROP TABLE media CASCADE` BEFORE the request.
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | Side-channel `DROP TABLE media CASCADE` | succeeds |
| 2 | `DELETE /missions/{mid}/waypoints/{wp1}` (JWT `FL`) | `500` |
| 3 | Side-channel `SELECT COUNT(*) FROM detection WHERE annotation_id IN (wp1 chain)` | `count == 0` (deleted before failure) |
| 4 | Side-channel `SELECT COUNT(*) FROM waypoints WHERE id={wp1}` | `count == 1` (not deleted) |
**Pass criteria**: same shape as NFT-RES-01.
**Max execution time**: 10s.
---
### NFT-RES-03: Idempotent migrator — second startup is a no-op
**Summary**: Verifies the migrator can run twice in a row without `relation already exists` errors.
**Traces to**: AC-6.6, AC-6.4
**Preconditions**:
- `seed_empty` (schema migrated once via first `missions` startup)
**Fault injection**: container restart (NOT volume reset).
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | `docker compose restart missions` | container exits cleanly + restarts |
| 2 | Wait for `GET /health` to return `200` | within ≤ 30s |
| 3 | `docker logs missions \| grep -E "(error\|Error\|exception)"` | no NEW error / exception lines after the restart timestamp |
| 4 | Side-channel `\d+ vehicles` | schema unchanged from after the first migrate |
**Pass criteria**: second start completes; no new error log lines; schema unchanged.
**Max execution time**: 60s.
---
### NFT-RES-04: B9 one-shot legacy table drop runs once and is idempotent
**Summary**: Verifies that the post-B9 `DROP TABLE IF EXISTS orthophotos / gps_corrections` block in the migrator is destructive on the FIRST run against a legacy device, and a no-op on subsequent runs.
**Traces to**: AC-6.5, AC-10.5
**Preconditions**:
- `seed_legacy_gps_tables` (schema includes `orthophotos` + `gps_corrections` with 1 row each)
- `missions` NOT yet started for this scenario
**Fault injection**: none — purely observe migrator behavior.
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | Side-channel `SELECT to_regclass('orthophotos'), to_regclass('gps_corrections')` | both NON-NULL (legacy tables present) |
| 2 | `docker compose up -d missions` | container starts |
| 3 | Wait for `GET /health` to return `200` | |
| 4 | Side-channel re-query | both NULL (dropped) |
| 5 | `docker compose restart missions` | |
| 6 | Side-channel re-query | both still NULL (idempotent — no error) |
| 7 | `docker logs missions \| grep -i "does not exist"` | NO log line (because of `IF EXISTS`) |
**Pass criteria**: legacy tables absent after first start; subsequent restarts produce no errors and leave them absent.
**Note**: this test only meaningfully runs on a **post-B9 build**. Before B9 lands, the migrator has no DROP block; gate this scenario on a build-time flag or by inspecting the migrator source.
**Max execution time**: 60s.
---
### NFT-RES-05: DB unreachable at startup — process exits non-zero
**Summary**: Verifies AC-6.7 — DB unreachability causes process exit, NOT silent retry-forever.
**Traces to**: AC-6.7
**Preconditions**:
- `missions` NOT running
**Fault injection**: stop `postgres-test` (`docker compose stop postgres-test`) then start `missions`.
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | `docker compose stop postgres-test` | |
| 2 | `docker compose up -d missions` | |
| 3 | Poll `docker inspect --format '{{.State.ExitCode}}' missions` every 1s for ≤ 30s | At some point within 30s, the container has exited with non-zero exit code |
| 4 | `docker logs missions` | Contains an Npgsql connection error message (e.g., `Connection refused`) |
**Pass criteria**: container exits with non-zero code within 30s; logs contain a recognisable Npgsql error.
**Max execution time**: 60s.
---
### NFT-RES-06: DB missing (database does not exist) — process exits with Npgsql 3D000
**Summary**: Verifies AC-6.8 — when the `azaion` database does not exist, process exits with the documented PostgreSQL error code.
**Traces to**: AC-6.8
**Preconditions**:
- `postgres-test` running with the `azaion` database NOT yet created (use `POSTGRES_DB=postgres` instead, or `DROP DATABASE azaion`)
**Fault injection**: same as preconditions.
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | Side-channel `DROP DATABASE IF EXISTS azaion` | |
| 2 | `docker compose up -d missions` | |
| 3 | Poll exit code for ≤ 30s | non-zero |
| 4 | `docker logs missions \| grep "3D000"` | at least one match |
**Pass criteria**: container exits non-zero within 30s; logs contain `3D000`.
**Max execution time**: 60s.
---
### NFT-RES-07: JWT_SECRET rotation invalidates existing tokens
**Summary**: Verifies AC-5.7 — restarting the service with a different `JWT_SECRET` causes previously-valid tokens to fail validation.
**Traces to**: AC-5.7
**Preconditions**:
- `missions` running with `JWT_SECRET=test-secret-32-chars-min!!!!!!!!!`
- Token `T1` minted with the same secret, valid for 1h
**Fault injection**: restart `missions` with `JWT_SECRET=rotated-secret-32-chars-min!!!!!`.
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | `GET /vehicles` with `Authorization: Bearer T1` | `200` |
| 2 | `docker compose stop missions` | |
| 3 | `docker compose run -e JWT_SECRET=rotated-secret-32-chars-min!!!!! -d missions` | |
| 4 | Wait for `GET /health` 200 | |
| 5 | `GET /vehicles` with `Authorization: Bearer T1` (same token as step 1) | `401` |
| 6 | Mint token `T2` with the new secret, `GET /vehicles` with `T2` | `200` |
**Pass criteria**: `T1` works pre-rotation, fails post-rotation; `T2` works post-rotation.
**Max execution time**: 90s.
---
### NFT-RES-08: TOCTOU on default-vehicle exclusivity (race window)
**Summary**: Verifies AC-1.4 — the clear-then-set is NOT transaction-wrapped, so a concurrent INSERT can leave 2+ defaults.
**Traces to**: AC-1.4 (carry-forward)
**Preconditions**:
- `seed_one_default_vehicle` (default `P1`)
**Fault injection**: a second concurrent client issues `INSERT INTO vehicles (..., is_default=true)` directly to the side-channel DB at the same moment as the API client issues `POST /vehicles { IsDefault:true }`. Synchronisation: the test orchestrator pauses at `pg_advisory_lock(1)` after the service's `UPDATE vehicles SET is_default=FALSE` and BEFORE the service's `INSERT` — implemented via a Postgres function that the test installs and the service path traverses (this requires an instrumented test build; if not available, the test is best-effort by issuing 100 parallel INSERTs).
**Steps**:
| Step | Action | Expected Behavior |
|------|--------|------------------|
| 1 | Run 100 parallel iterations of `(POST /vehicles { IsDefault:true } + side-channel INSERT (..., is_default=true))` | each iteration completes |
| 2 | Side-channel `SELECT COUNT(*) FROM vehicles WHERE is_default=true` | count is `≥ 2` in at least one iteration |
**Pass criteria**: at least one iteration produces `default_count ≥ 2`. If 0 iterations produce the race, the test FAILS — either the race window has been closed (good news; rewrite the test to assert `default_count == 1` and update AC-1.4 to remove the race carry-forward), OR the test concurrency primitive is wrong (investigate).
**Note**: this test is intentionally PROBABILISTIC — it asserts the RACE EXISTS, not that the system is broken. A `PASS` here is bad news for the system but means the test is correctly observing the documented behavior.
**Max execution time**: 30s.
---
## Notes
- Tests that drop tables (NFT-RES-01, NFT-RES-02, NFT-RES-08) run in a per-class `IClassFixture<DbResetFixture>` that recreates the schema after each scenario.
- NFT-RES-03 through NFT-RES-07 require container-orchestration via `docker compose` from inside the test runner. The `e2e-consumer` container needs the `docker` CLI + a mounted Docker socket — alternatively, the test runs from the host with `docker compose` available there. Hardware-assessment will lock this preference.
- NFT-RES-08 is intentionally probabilistic and may be flaky on slow runners. CI marks this as `[Trait("Stability","probabilistic")]` and tolerates ≤ 1 failed run per 5; deterministic implementation (advisory lock + instrumented build) is a follow-up.