# Resilience Tests Authored by `/test-spec` Phase 2 (2026-05-19). Resilience tests inject a fault, observe behaviour during the fault, observe recovery behaviour, and assert against both. The fault and the recovery contract are both quantifiable. BIT pre-flight pathways (positive R1, negatives R2/R3) are in `blackbox-tests.md` because they assert a functional gate. The runtime fault scenarios live here. --- ### NFT-RES-R4: Lost operator/Ground-Station link → RTL at 30 s grace (default) **Summary**: Sustained loss of the operator/Ground-Station radio link MUST trigger an RTL exactly at the configured grace window (default 30 s), and operator-link health MUST flip red. **Traces to**: AC `Reliability & Safety — Loss of operator/Ground-Station radio link MUST trigger a known mission-safe outcome / R4`, RESTRICT `Reliability & Safety — Lost operator-link failsafe MUST be deterministic and bounded`. **Tier**: B + E. **Preconditions**: - SUT mid-flight (scripted MAVLink stream + active operator session). - Operator session in steady state for ≥ 30 s before fault injection. - Grace window configured to default 30 s. **Fault injection**: - `operator-replay` issues `lost-link` event at T=0 and STAYS silent (no reconnect) for the remainder of the window. | Step | Action | Expected Behaviour | |---|---|---| | 1 | Inject lost-link event at T=0 | health endpoint immediately shows `deps.operator_link == "red"`; `last_seen_at` frozen | | 2 | Wait 25 s (within grace) | NO RTL command yet on `mavlink-sitl`; SUT continues mission | | 3 | Wait until T=30 s | RTL command observed on `mavlink-sitl` at T = 30 s ± 1 s; operator-stream emits a `failsafe_triggered` event with reason `operator_link_lost` | | 4 | Optionally reconnect operator-replay after RTL | RTL persists (operator cannot un-RTL silently — requires explicit operator override per AC); health.operator_link transitions back to green when traffic resumes | **Pass criteria**: RTL command at T = 30 s ± 1 s (`exact` with ± 1 s tolerance), `exact` operator-link red. **Recovery time bound**: RTL must be issued within 31 s of fault start. **Test status**: READY (operator-session-scripts inline-authorable; mavlink-sitl runs an ArduPilot SITL accepting RTL). --- ### NFT-RES-R5: Battery at RTL-floor → RTL **Summary**: When the airframe battery sample drops to the configured RTL floor (e.g. 25 %), the SUT MUST issue an RTL and health MUST surface yellow. **Traces to**: AC `Reliability & Safety — Battery at or below the configured RTL floor / R5`. **Tier**: B + E. **Preconditions**: - SUT mid-flight; battery telemetry replayed via `mavlink-sitl` at 1 Hz. **Fault injection**: - `mavlink-sitl` scripted battery curve: starts at 80 %; ramps down to 25 % at T=T0; held at 25 % afterwards. | Step | Action | Expected Behaviour | |---|---|---| | 1 | At T=T0, battery reads 25 % | within 1 sample period (1 s) the SUT issues RTL on `mavlink-sitl`; health transitions to `overall == "yellow"`; operator-stream emits `failsafe_triggered` with reason `battery_rtl_floor` | | 2 | Battery continues at 25 % | RTL persists; no oscillation | **Pass criteria**: `exact (RTL command observed)` + `exact (health.overall == "yellow")`. **Test status**: DEFERRED — ``. --- ### NFT-RES-R6: Battery at hard floor → land-now **Summary**: When the battery hits the configured hard floor (e.g. 15 %), the SUT MUST issue land-now and ONLY an authenticated operator command may override. **Traces to**: AC `Reliability & Safety — battery at or below the hard floor / R6`. **Tier**: B + E. **Preconditions**: - SUT mid-flight; battery ramps to 15 %. **Fault injection**: same as R5 but ramp continues to 15 %. | Step | Action | Expected Behaviour | |---|---|---| | 1 | At T=T0, battery reads 15 % | within 1 sample period the SUT issues land-now (`MAV_CMD_NAV_LAND` or equivalent) on `mavlink-sitl`; health red; operator-stream emits `failsafe_triggered` with reason `battery_hard_floor` | | 2 | Replay an UNAUTHENTICATED operator-override command | SUT REFUSES; land-now persists | | 3 | Replay an AUTHENTICATED operator-override (placeholder until Q9; full once Q9 resolves) | land-now cancelled; SUT returns to prior mode | **Pass criteria**: `exact (land_now observed)`; `exact (refusal of unauthenticated override)`; `exact (acceptance of authenticated override)`. **Test status**: DEFERRED — same fixture gap as R5; step 3's full authentication semantics also ``. --- ### NFT-RES-R7: Airframe link exhaustion → health red after max-retry **Summary**: When MAVLink commands fail through the configured bounded-retry budget (no airframe response), the airframe-link dependency MUST flip health red. **Traces to**: AC `Reliability & Safety — MAVLink command exhaustion (bounded retry with exponential backoff fails through max-retry) / R7`. **Tier**: B + E. **Preconditions**: - SUT mid-flight; max-retry configured (e.g., 5 attempts; exponential backoff base 100 ms). **Fault injection**: - `mavlink-sitl` configured to drop all command-ack messages for the duration of the test (peer non-responsive). | Step | Action | Expected Behaviour | |---|---|---| | 1 | SUT issues a MAVLink command (e.g., waypoint upload) | command sent; no ack received | | 2 | Backoff + retry loop executes through max-retry | retries observed on the wire with exponential backoff | | 3 | After final retry exhausts | health.airframe_link transitions to red; operator-stream emits a `dependency_degraded` event with reason `airframe_link_retry_exhausted` | **Pass criteria**: `exact (health.airframe_link == "red")` after max-retry; retries observed with backoff base 100 ms ± 20 ms. **Test status**: DEFERRED — ``. --- ### NFT-RES-R8: Wall-clock drift > 200 ms → time-source yellow **Summary**: When wall-clock drift versus GPS or NTP source exceeds 200 ms, the time-source dependency MUST report yellow, AND `clock_source` + `last_sync_at` MUST reflect the drift. **Traces to**: AC `Reliability & Safety — Wall-clock drift greater than 200 ms / R8`, RESTRICT `Wall-clock MUST be bound to GPS time once GPS is locked, or NTP at boot`. **Tier**: B. **Preconditions**: - SUT running with `time-injector` LD_PRELOAD active. - GPS source initially locked via `mavlink-sitl` GPS_RAW_INT messages. **Fault injection**: - `time-injector` advances the SUT process clock by 250 ms over a 1 s window while keeping GPS source locked. | Step | Action | Expected Behaviour | |---|---|---| | 1 | Bind clock to GPS at boot | health.time_source == green; `clock_source == "gps"`; `last_sync_at` recent | | 2 | Inject 250 ms drift | within 5 s health.time_source transitions to yellow; `clock_source` and `last_sync_at` updated to reflect the drift | | 3 | Stop drift | health.time_source returns to green within the next sync cycle | **Pass criteria**: `exact (health.time_source == "yellow")` during step 2; `exact (clock_source updated)` + `exact (last_sync_at updated)`. **Test status**: READY (time-drift-scripts inline-authorable). --- ### NFT-RES-R9: Geofence EXCLUSION crossing → waypoint refusal + RTL **Summary**: When a simulated waypoint crosses an EXCLUSION polygon, the SUT MUST refuse the waypoint AND trigger RTL. Symmetric behaviour for INCLUSION violations. **Traces to**: AC `Reliability & Safety — Geofence INCLUSION and EXCLUSION violations MUST both result in waypoint refusal + RTL / R9`, RESTRICT `Geofence enforcement MUST be symmetric`. **Tier**: B + E. **Preconditions**: - SUT mid-flight; geofence INCLUSION + EXCLUSION polygons loaded as part of the mission. **Fault injection**: - Scripted waypoint upload that crosses the EXCLUSION polygon; subsequently INCLUSION-exit test. | Step | Action | Expected Behaviour | |---|---|---| | 1 | Upload waypoint crossing EXCLUSION polygon | SUT refuses the waypoint; structured-log WARN with `geofence_violation_exclusion`; RTL command observed on `mavlink-sitl` | | 2 | Reset; upload waypoint exiting the INCLUSION polygon | identical behaviour — refused + RTL | **Pass criteria**: `exact (waypoint rejected)` + `exact (RTL command observed)` for both EXCLUSION and INCLUSION cases. **Test status**: DEFERRED — ``. --- ### NFT-RES-Mp2: Map-pull timeout → cache-fallback (functional coverage in FT-N-003) **Summary**: When the pre-flight map pull times out, the SUT falls back to last-known cached MapObjects and surfaces `map_sync == "cached_fallback"` with an operator-ack gate. (Functional gate semantics are tested in `blackbox-tests.md → FT-N-003`; this scenario adds the **timing+recovery** dimension.) **Traces to**: AC `Map Reconciliation — Cache-fallback on timeout / Mp2`. **Tier**: B. **Preconditions**: - `autopilot-state` seeded with a known prior MapObjects snapshot. - `missions-mock` configured to time out on `GET /missions/{id}/mapobjects` for a configurable duration. **Fault injection**: - `missions-mock` returns 504 / silent timeout for 60 s; then responds normally. | Step | Action | Expected Behaviour | |---|---|---| | 1 | Trigger BIT | SUT issues `GET /missions/{id}/mapobjects`; observes timeout (per its configured request timeout); within 5 s falls back to cached snapshot; `map_sync == "cached_fallback"`; BIT requires explicit operator ack (see FT-N-003) | | 2 | Mock recovers (responds normally) | next periodic resync re-attempts; once successful, `map_sync == "live"`; structured-log INFO `map_resync_recovered` | **Pass criteria**: `exact (cached_fallback within 5 s of timeout)`; recovery within the next resync cycle. **Test status**: READY (no external fixture beyond `mission-suite-fixture` (DEFERRED) for the cached snapshot seed; cached snapshot can be authored inline at minimal scale). --- ### NFT-RES-Mp4: Post-flight map-push 5xx → persist + bounded retry + operator warning **Summary**: When the post-flight `POST /missions/{id}/mapobjects` returns 5xx, the pending diff MUST be persisted on durable on-device storage, an operator-visible warning MUST surface, AND bounded retry MUST execute (capped at the configured retry limit). **Traces to**: AC `Map Reconciliation — Failure MUST persist the pending diff to durable on-device storage with bounded retry / Mp4`, RESTRICT `On-device storage MUST be bounded`. **Tier**: B + E. **Preconditions**: - SUT post-landing; pending diff ready to push. - `missions-mock` configured to return 5xx N times then 200. **Fault injection**: - `missions-mock` returns 503 for the first N attempts (N = configured retry-cap + 1); then returns 200. | Step | Action | Expected Behaviour | |---|---|---| | 1 | Trigger post-flight reconciliation | SUT issues `POST /missions/{id}/mapobjects`; receives 503 | | 2 | Observe persistence | pending diff file exists under `autopilot-state/pending_map_diff/.json`; size > 0 | | 3 | Observe operator-stream | warning event `map_push_failed` surfaced | | 4 | Observe retry loop | retries observed within the configured cap; backoff with jitter | | 5 | After retry-cap reached without success | SUT stops retrying; pending file remains for next session pickup | | 6 | Eventual success (mock returns 200) | next attempt succeeds; pending file removed; warning cleared | **Pass criteria**: `exact (pending file exists)` + `exact (warning surfaced)` + `threshold_max (retries ≤ configured cap)`. **Test status**: DEFERRED — ``. --- ## Recovery-time invariants common to every scenario - **No silent error swallowing.** Every fault scenario MUST observe a corresponding structured-log entry at WARN+ AND a corresponding health-endpoint transition. A fault that the SUT handles without surfacing through both channels is a TEST FAILURE per `security_approach.md → "No silent error swallowing for security-relevant failures"` (extended here to operational faults per `coderule.mdc → "Never suppress errors silently"`). - **Bounded behaviour.** Every retry/backoff loop MUST be bounded — the scenario asserts the cap on retry count and the cap on backoff window. Open-ended retry is a test failure. - **State integrity post-recovery.** After fault recovery (when applicable), the scenario asserts that the SUT returns to a known state — mode unchanged unless the fault legitimately altered it (e.g., RTL stays RTL until operator override). - **Symmetry assertions.** R9 explicitly tests both INCLUSION and EXCLUSION because the AC names symmetric behaviour. Wherever an AC pairs two outcomes (`fail-fast` + `fail-closed`, `red` + `yellow`, etc.), the resilience scenario MUST cover both halves.