mirror of
https://github.com/azaion/autopilot.git
synced 2026-06-21 14:01:10 +00:00
bc40ea7300
Greenfield Steps 1-6 baseline for the autopilot rewrite from legacy Qt/C++ to a Rust workspace. - Remove legacy Qt/C++ tree (ai_controller, drone_controller, misc/camera, python_scaffold, root Dockerfile, autopilot.pro, legacy main.py / requirements.txt). - Add _docs/00_problem (problem, restrictions, acceptance criteria, security approach, input data + fixtures). - Add _docs/01_solution/solution_draft01. - Add _docs/02_document (architecture, system-flows, data_model, glossary, decision-rationale, deployment, 13 component descriptions, tests/ specs, FINAL_report, module-layout). - Add _docs/02_tasks/todo with 47 task specs (AZ-640..AZ-686, one bootstrap + 46 component tasks) and _dependencies_table.md. - Add .cursor/rules/artifact-srp.mdc (single-responsibility rule for canonical _docs artifacts). - Track autodev state in _docs/_autodev_state.md (Step 6 completed, ready for Step 7 Implement). Jira: bootstrap AZ-626; component epics AZ-627..AZ-639; tasks AZ-640..AZ-686. Total complexity 173 points across 12 epics. Co-authored-by: Cursor <cursoragent@cursor.com>
197 lines
13 KiB
Markdown
197 lines
13 KiB
Markdown
# Resilience Tests
|
|
|
|
Authored by `/test-spec` Phase 2 (2026-05-19). Resilience tests inject a fault, observe behaviour during the fault, observe recovery behaviour, and assert against both. The fault and the recovery contract are both quantifiable.
|
|
|
|
BIT pre-flight pathways (positive R1, negatives R2/R3) are in `blackbox-tests.md` because they assert a functional gate. The runtime fault scenarios live here.
|
|
|
|
---
|
|
|
|
### NFT-RES-R4: Lost operator/Ground-Station link → RTL at 30 s grace (default)
|
|
**Summary**: Sustained loss of the operator/Ground-Station radio link MUST trigger an RTL exactly at the configured grace window (default 30 s), and operator-link health MUST flip red.
|
|
**Traces to**: AC `Reliability & Safety — Loss of operator/Ground-Station radio link MUST trigger a known mission-safe outcome / R4`, RESTRICT `Reliability & Safety — Lost operator-link failsafe MUST be deterministic and bounded`.
|
|
**Tier**: B + E.
|
|
|
|
**Preconditions**:
|
|
- SUT mid-flight (scripted MAVLink stream + active operator session).
|
|
- Operator session in steady state for ≥ 30 s before fault injection.
|
|
- Grace window configured to default 30 s.
|
|
|
|
**Fault injection**:
|
|
- `operator-replay` issues `lost-link` event at T=0 and STAYS silent (no reconnect) for the remainder of the window.
|
|
|
|
| Step | Action | Expected Behaviour |
|
|
|---|---|---|
|
|
| 1 | Inject lost-link event at T=0 | health endpoint immediately shows `deps.operator_link == "red"`; `last_seen_at` frozen |
|
|
| 2 | Wait 25 s (within grace) | NO RTL command yet on `mavlink-sitl`; SUT continues mission |
|
|
| 3 | Wait until T=30 s | RTL command observed on `mavlink-sitl` at T = 30 s ± 1 s; operator-stream emits a `failsafe_triggered` event with reason `operator_link_lost` |
|
|
| 4 | Optionally reconnect operator-replay after RTL | RTL persists (operator cannot un-RTL silently — requires explicit operator override per AC); health.operator_link transitions back to green when traffic resumes |
|
|
|
|
**Pass criteria**: RTL command at T = 30 s ± 1 s (`exact` with ± 1 s tolerance), `exact` operator-link red.
|
|
**Recovery time bound**: RTL must be issued within 31 s of fault start.
|
|
**Test status**: READY (operator-session-scripts inline-authorable; mavlink-sitl runs an ArduPilot SITL accepting RTL).
|
|
|
|
---
|
|
|
|
### NFT-RES-R5: Battery at RTL-floor → RTL
|
|
**Summary**: When the airframe battery sample drops to the configured RTL floor (e.g. 25 %), the SUT MUST issue an RTL and health MUST surface yellow.
|
|
**Traces to**: AC `Reliability & Safety — Battery at or below the configured RTL floor / R5`.
|
|
**Tier**: B + E.
|
|
|
|
**Preconditions**:
|
|
- SUT mid-flight; battery telemetry replayed via `mavlink-sitl` at 1 Hz.
|
|
|
|
**Fault injection**:
|
|
- `mavlink-sitl` scripted battery curve: starts at 80 %; ramps down to 25 % at T=T0; held at 25 % afterwards.
|
|
|
|
| Step | Action | Expected Behaviour |
|
|
|---|---|---|
|
|
| 1 | At T=T0, battery reads 25 % | within 1 sample period (1 s) the SUT issues RTL on `mavlink-sitl`; health transitions to `overall == "yellow"`; operator-stream emits `failsafe_triggered` with reason `battery_rtl_floor` |
|
|
| 2 | Battery continues at 25 % | RTL persists; no oscillation |
|
|
|
|
**Pass criteria**: `exact (RTL command observed)` + `exact (health.overall == "yellow")`.
|
|
**Test status**: DEFERRED — `<DEFERRED: mid-flight battery sample at RTL-floor via mavlink-sitl battery curve script>`.
|
|
|
|
---
|
|
|
|
### NFT-RES-R6: Battery at hard floor → land-now
|
|
**Summary**: When the battery hits the configured hard floor (e.g. 15 %), the SUT MUST issue land-now and ONLY an authenticated operator command may override.
|
|
**Traces to**: AC `Reliability & Safety — battery at or below the hard floor / R6`.
|
|
**Tier**: B + E.
|
|
|
|
**Preconditions**:
|
|
- SUT mid-flight; battery ramps to 15 %.
|
|
|
|
**Fault injection**: same as R5 but ramp continues to 15 %.
|
|
|
|
| Step | Action | Expected Behaviour |
|
|
|---|---|---|
|
|
| 1 | At T=T0, battery reads 15 % | within 1 sample period the SUT issues land-now (`MAV_CMD_NAV_LAND` or equivalent) on `mavlink-sitl`; health red; operator-stream emits `failsafe_triggered` with reason `battery_hard_floor` |
|
|
| 2 | Replay an UNAUTHENTICATED operator-override command | SUT REFUSES; land-now persists |
|
|
| 3 | Replay an AUTHENTICATED operator-override (placeholder until Q9; full once Q9 resolves) | land-now cancelled; SUT returns to prior mode |
|
|
|
|
**Pass criteria**: `exact (land_now observed)`; `exact (refusal of unauthenticated override)`; `exact (acceptance of authenticated override)`.
|
|
**Test status**: DEFERRED — same fixture gap as R5; step 3's full authentication semantics also `<DEFERRED: Q9>`.
|
|
|
|
---
|
|
|
|
### NFT-RES-R7: Airframe link exhaustion → health red after max-retry
|
|
**Summary**: When MAVLink commands fail through the configured bounded-retry budget (no airframe response), the airframe-link dependency MUST flip health red.
|
|
**Traces to**: AC `Reliability & Safety — MAVLink command exhaustion (bounded retry with exponential backoff fails through max-retry) / R7`.
|
|
**Tier**: B + E.
|
|
|
|
**Preconditions**:
|
|
- SUT mid-flight; max-retry configured (e.g., 5 attempts; exponential backoff base 100 ms).
|
|
|
|
**Fault injection**:
|
|
- `mavlink-sitl` configured to drop all command-ack messages for the duration of the test (peer non-responsive).
|
|
|
|
| Step | Action | Expected Behaviour |
|
|
|---|---|---|
|
|
| 1 | SUT issues a MAVLink command (e.g., waypoint upload) | command sent; no ack received |
|
|
| 2 | Backoff + retry loop executes through max-retry | retries observed on the wire with exponential backoff |
|
|
| 3 | After final retry exhausts | health.airframe_link transitions to red; operator-stream emits a `dependency_degraded` event with reason `airframe_link_retry_exhausted` |
|
|
|
|
**Pass criteria**: `exact (health.airframe_link == "red")` after max-retry; retries observed with backoff base 100 ms ± 20 ms.
|
|
**Test status**: DEFERRED — `<DEFERRED: airframe link command + bounded retry/backoff with peer not responding through max-retries>`.
|
|
|
|
---
|
|
|
|
### NFT-RES-R8: Wall-clock drift > 200 ms → time-source yellow
|
|
**Summary**: When wall-clock drift versus GPS or NTP source exceeds 200 ms, the time-source dependency MUST report yellow, AND `clock_source` + `last_sync_at` MUST reflect the drift.
|
|
**Traces to**: AC `Reliability & Safety — Wall-clock drift greater than 200 ms / R8`, RESTRICT `Wall-clock MUST be bound to GPS time once GPS is locked, or NTP at boot`.
|
|
**Tier**: B.
|
|
|
|
**Preconditions**:
|
|
- SUT running with `time-injector` LD_PRELOAD active.
|
|
- GPS source initially locked via `mavlink-sitl` GPS_RAW_INT messages.
|
|
|
|
**Fault injection**:
|
|
- `time-injector` advances the SUT process clock by 250 ms over a 1 s window while keeping GPS source locked.
|
|
|
|
| Step | Action | Expected Behaviour |
|
|
|---|---|---|
|
|
| 1 | Bind clock to GPS at boot | health.time_source == green; `clock_source == "gps"`; `last_sync_at` recent |
|
|
| 2 | Inject 250 ms drift | within 5 s health.time_source transitions to yellow; `clock_source` and `last_sync_at` updated to reflect the drift |
|
|
| 3 | Stop drift | health.time_source returns to green within the next sync cycle |
|
|
|
|
**Pass criteria**: `exact (health.time_source == "yellow")` during step 2; `exact (clock_source updated)` + `exact (last_sync_at updated)`.
|
|
**Test status**: READY (time-drift-scripts inline-authorable).
|
|
|
|
---
|
|
|
|
### NFT-RES-R9: Geofence EXCLUSION crossing → waypoint refusal + RTL
|
|
**Summary**: When a simulated waypoint crosses an EXCLUSION polygon, the SUT MUST refuse the waypoint AND trigger RTL. Symmetric behaviour for INCLUSION violations.
|
|
**Traces to**: AC `Reliability & Safety — Geofence INCLUSION and EXCLUSION violations MUST both result in waypoint refusal + RTL / R9`, RESTRICT `Geofence enforcement MUST be symmetric`.
|
|
**Tier**: B + E.
|
|
|
|
**Preconditions**:
|
|
- SUT mid-flight; geofence INCLUSION + EXCLUSION polygons loaded as part of the mission.
|
|
|
|
**Fault injection**:
|
|
- Scripted waypoint upload that crosses the EXCLUSION polygon; subsequently INCLUSION-exit test.
|
|
|
|
| Step | Action | Expected Behaviour |
|
|
|---|---|---|
|
|
| 1 | Upload waypoint crossing EXCLUSION polygon | SUT refuses the waypoint; structured-log WARN with `geofence_violation_exclusion`; RTL command observed on `mavlink-sitl` |
|
|
| 2 | Reset; upload waypoint exiting the INCLUSION polygon | identical behaviour — refused + RTL |
|
|
|
|
**Pass criteria**: `exact (waypoint rejected)` + `exact (RTL command observed)` for both EXCLUSION and INCLUSION cases.
|
|
**Test status**: DEFERRED — `<DEFERRED: geofence EXCLUSION polygon crossed by simulated waypoint via mavlink-sitl scripted mission>`.
|
|
|
|
---
|
|
|
|
### NFT-RES-Mp2: Map-pull timeout → cache-fallback (functional coverage in FT-N-003)
|
|
**Summary**: When the pre-flight map pull times out, the SUT falls back to last-known cached MapObjects and surfaces `map_sync == "cached_fallback"` with an operator-ack gate. (Functional gate semantics are tested in `blackbox-tests.md → FT-N-003`; this scenario adds the **timing+recovery** dimension.)
|
|
**Traces to**: AC `Map Reconciliation — Cache-fallback on timeout / Mp2`.
|
|
**Tier**: B.
|
|
|
|
**Preconditions**:
|
|
- `autopilot-state` seeded with a known prior MapObjects snapshot.
|
|
- `missions-mock` configured to time out on `GET /missions/{id}/mapobjects` for a configurable duration.
|
|
|
|
**Fault injection**:
|
|
- `missions-mock` returns 504 / silent timeout for 60 s; then responds normally.
|
|
|
|
| Step | Action | Expected Behaviour |
|
|
|---|---|---|
|
|
| 1 | Trigger BIT | SUT issues `GET /missions/{id}/mapobjects`; observes timeout (per its configured request timeout); within 5 s falls back to cached snapshot; `map_sync == "cached_fallback"`; BIT requires explicit operator ack (see FT-N-003) |
|
|
| 2 | Mock recovers (responds normally) | next periodic resync re-attempts; once successful, `map_sync == "live"`; structured-log INFO `map_resync_recovered` |
|
|
|
|
**Pass criteria**: `exact (cached_fallback within 5 s of timeout)`; recovery within the next resync cycle.
|
|
**Test status**: READY (no external fixture beyond `mission-suite-fixture` (DEFERRED) for the cached snapshot seed; cached snapshot can be authored inline at minimal scale).
|
|
|
|
---
|
|
|
|
### NFT-RES-Mp4: Post-flight map-push 5xx → persist + bounded retry + operator warning
|
|
**Summary**: When the post-flight `POST /missions/{id}/mapobjects` returns 5xx, the pending diff MUST be persisted on durable on-device storage, an operator-visible warning MUST surface, AND bounded retry MUST execute (capped at the configured retry limit).
|
|
**Traces to**: AC `Map Reconciliation — Failure MUST persist the pending diff to durable on-device storage with bounded retry / Mp4`, RESTRICT `On-device storage MUST be bounded`.
|
|
**Tier**: B + E.
|
|
|
|
**Preconditions**:
|
|
- SUT post-landing; pending diff ready to push.
|
|
- `missions-mock` configured to return 5xx N times then 200.
|
|
|
|
**Fault injection**:
|
|
- `missions-mock` returns 503 for the first N attempts (N = configured retry-cap + 1); then returns 200.
|
|
|
|
| Step | Action | Expected Behaviour |
|
|
|---|---|---|
|
|
| 1 | Trigger post-flight reconciliation | SUT issues `POST /missions/{id}/mapobjects`; receives 503 |
|
|
| 2 | Observe persistence | pending diff file exists under `autopilot-state/pending_map_diff/<mission-id>.json`; size > 0 |
|
|
| 3 | Observe operator-stream | warning event `map_push_failed` surfaced |
|
|
| 4 | Observe retry loop | retries observed within the configured cap; backoff with jitter |
|
|
| 5 | After retry-cap reached without success | SUT stops retrying; pending file remains for next session pickup |
|
|
| 6 | Eventual success (mock returns 200) | next attempt succeeds; pending file removed; warning cleared |
|
|
|
|
**Pass criteria**: `exact (pending file exists)` + `exact (warning surfaced)` + `threshold_max (retries ≤ configured cap)`.
|
|
**Test status**: DEFERRED — `<DEFERRED: same fixture as Mp3 (60-minute pass diff)>`.
|
|
|
|
---
|
|
|
|
## Recovery-time invariants common to every scenario
|
|
|
|
- **No silent error swallowing.** Every fault scenario MUST observe a corresponding structured-log entry at WARN+ AND a corresponding health-endpoint transition. A fault that the SUT handles without surfacing through both channels is a TEST FAILURE per `security_approach.md → "No silent error swallowing for security-relevant failures"` (extended here to operational faults per `coderule.mdc → "Never suppress errors silently"`).
|
|
- **Bounded behaviour.** Every retry/backoff loop MUST be bounded — the scenario asserts the cap on retry count and the cap on backoff window. Open-ended retry is a test failure.
|
|
- **State integrity post-recovery.** After fault recovery (when applicable), the scenario asserts that the SUT returns to a known state — mode unchanged unless the fault legitimately altered it (e.g., RTL stays RTL until operator override).
|
|
- **Symmetry assertions.** R9 explicitly tests both INCLUSION and EXCLUSION because the AC names symmetric behaviour. Wherever an AC pairs two outcomes (`fail-fast` + `fail-closed`, `red` + `yellow`, etc.), the resilience scenario MUST cover both halves.
|