Greenfield Steps 1-6 baseline for the autopilot rewrite from legacy Qt/C++ to a Rust workspace. - Remove legacy Qt/C++ tree (ai_controller, drone_controller, misc/camera, python_scaffold, root Dockerfile, autopilot.pro, legacy main.py / requirements.txt). - Add _docs/00_problem (problem, restrictions, acceptance criteria, security approach, input data + fixtures). - Add _docs/01_solution/solution_draft01. - Add _docs/02_document (architecture, system-flows, data_model, glossary, decision-rationale, deployment, 13 component descriptions, tests/ specs, FINAL_report, module-layout). - Add _docs/02_tasks/todo with 47 task specs (AZ-640..AZ-686, one bootstrap + 46 component tasks) and _dependencies_table.md. - Add .cursor/rules/artifact-srp.mdc (single-responsibility rule for canonical _docs artifacts). - Track autodev state in _docs/_autodev_state.md (Step 6 completed, ready for Step 7 Implement). Jira: bootstrap AZ-626; component epics AZ-627..AZ-639; tasks AZ-640..AZ-686. Total complexity 173 points across 12 epics. Co-authored-by: Cursor <cursoragent@cursor.com>
13 KiB
Resilience Tests
Authored by /test-spec Phase 2 (2026-05-19). Resilience tests inject a fault, observe behaviour during the fault, observe recovery behaviour, and assert against both. The fault and the recovery contract are both quantifiable.
BIT pre-flight pathways (positive R1, negatives R2/R3) are in blackbox-tests.md because they assert a functional gate. The runtime fault scenarios live here.
NFT-RES-R4: Lost operator/Ground-Station link → RTL at 30 s grace (default)
Summary: Sustained loss of the operator/Ground-Station radio link MUST trigger an RTL exactly at the configured grace window (default 30 s), and operator-link health MUST flip red.
Traces to: AC Reliability & Safety — Loss of operator/Ground-Station radio link MUST trigger a known mission-safe outcome / R4, RESTRICT Reliability & Safety — Lost operator-link failsafe MUST be deterministic and bounded.
Tier: B + E.
Preconditions:
- SUT mid-flight (scripted MAVLink stream + active operator session).
- Operator session in steady state for ≥ 30 s before fault injection.
- Grace window configured to default 30 s.
Fault injection:
operator-replayissueslost-linkevent at T=0 and STAYS silent (no reconnect) for the remainder of the window.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | Inject lost-link event at T=0 | health endpoint immediately shows deps.operator_link == "red"; last_seen_at frozen |
| 2 | Wait 25 s (within grace) | NO RTL command yet on mavlink-sitl; SUT continues mission |
| 3 | Wait until T=30 s | RTL command observed on mavlink-sitl at T = 30 s ± 1 s; operator-stream emits a failsafe_triggered event with reason operator_link_lost |
| 4 | Optionally reconnect operator-replay after RTL | RTL persists (operator cannot un-RTL silently — requires explicit operator override per AC); health.operator_link transitions back to green when traffic resumes |
Pass criteria: RTL command at T = 30 s ± 1 s (exact with ± 1 s tolerance), exact operator-link red.
Recovery time bound: RTL must be issued within 31 s of fault start.
Test status: READY (operator-session-scripts inline-authorable; mavlink-sitl runs an ArduPilot SITL accepting RTL).
NFT-RES-R5: Battery at RTL-floor → RTL
Summary: When the airframe battery sample drops to the configured RTL floor (e.g. 25 %), the SUT MUST issue an RTL and health MUST surface yellow.
Traces to: AC Reliability & Safety — Battery at or below the configured RTL floor / R5.
Tier: B + E.
Preconditions:
- SUT mid-flight; battery telemetry replayed via
mavlink-sitlat 1 Hz.
Fault injection:
mavlink-sitlscripted battery curve: starts at 80 %; ramps down to 25 % at T=T0; held at 25 % afterwards.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | At T=T0, battery reads 25 % | within 1 sample period (1 s) the SUT issues RTL on mavlink-sitl; health transitions to overall == "yellow"; operator-stream emits failsafe_triggered with reason battery_rtl_floor |
| 2 | Battery continues at 25 % | RTL persists; no oscillation |
Pass criteria: exact (RTL command observed) + exact (health.overall == "yellow").
Test status: DEFERRED — <DEFERRED: mid-flight battery sample at RTL-floor via mavlink-sitl battery curve script>.
NFT-RES-R6: Battery at hard floor → land-now
Summary: When the battery hits the configured hard floor (e.g. 15 %), the SUT MUST issue land-now and ONLY an authenticated operator command may override.
Traces to: AC Reliability & Safety — battery at or below the hard floor / R6.
Tier: B + E.
Preconditions:
- SUT mid-flight; battery ramps to 15 %.
Fault injection: same as R5 but ramp continues to 15 %.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | At T=T0, battery reads 15 % | within 1 sample period the SUT issues land-now (MAV_CMD_NAV_LAND or equivalent) on mavlink-sitl; health red; operator-stream emits failsafe_triggered with reason battery_hard_floor |
| 2 | Replay an UNAUTHENTICATED operator-override command | SUT REFUSES; land-now persists |
| 3 | Replay an AUTHENTICATED operator-override (placeholder until Q9; full once Q9 resolves) | land-now cancelled; SUT returns to prior mode |
Pass criteria: exact (land_now observed); exact (refusal of unauthenticated override); exact (acceptance of authenticated override).
Test status: DEFERRED — same fixture gap as R5; step 3's full authentication semantics also <DEFERRED: Q9>.
NFT-RES-R7: Airframe link exhaustion → health red after max-retry
Summary: When MAVLink commands fail through the configured bounded-retry budget (no airframe response), the airframe-link dependency MUST flip health red.
Traces to: AC Reliability & Safety — MAVLink command exhaustion (bounded retry with exponential backoff fails through max-retry) / R7.
Tier: B + E.
Preconditions:
- SUT mid-flight; max-retry configured (e.g., 5 attempts; exponential backoff base 100 ms).
Fault injection:
mavlink-sitlconfigured to drop all command-ack messages for the duration of the test (peer non-responsive).
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | SUT issues a MAVLink command (e.g., waypoint upload) | command sent; no ack received |
| 2 | Backoff + retry loop executes through max-retry | retries observed on the wire with exponential backoff |
| 3 | After final retry exhausts | health.airframe_link transitions to red; operator-stream emits a dependency_degraded event with reason airframe_link_retry_exhausted |
Pass criteria: exact (health.airframe_link == "red") after max-retry; retries observed with backoff base 100 ms ± 20 ms.
Test status: DEFERRED — <DEFERRED: airframe link command + bounded retry/backoff with peer not responding through max-retries>.
NFT-RES-R8: Wall-clock drift > 200 ms → time-source yellow
Summary: When wall-clock drift versus GPS or NTP source exceeds 200 ms, the time-source dependency MUST report yellow, AND clock_source + last_sync_at MUST reflect the drift.
Traces to: AC Reliability & Safety — Wall-clock drift greater than 200 ms / R8, RESTRICT Wall-clock MUST be bound to GPS time once GPS is locked, or NTP at boot.
Tier: B.
Preconditions:
- SUT running with
time-injectorLD_PRELOAD active. - GPS source initially locked via
mavlink-sitlGPS_RAW_INT messages.
Fault injection:
time-injectoradvances the SUT process clock by 250 ms over a 1 s window while keeping GPS source locked.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | Bind clock to GPS at boot | health.time_source == green; clock_source == "gps"; last_sync_at recent |
| 2 | Inject 250 ms drift | within 5 s health.time_source transitions to yellow; clock_source and last_sync_at updated to reflect the drift |
| 3 | Stop drift | health.time_source returns to green within the next sync cycle |
Pass criteria: exact (health.time_source == "yellow") during step 2; exact (clock_source updated) + exact (last_sync_at updated).
Test status: READY (time-drift-scripts inline-authorable).
NFT-RES-R9: Geofence EXCLUSION crossing → waypoint refusal + RTL
Summary: When a simulated waypoint crosses an EXCLUSION polygon, the SUT MUST refuse the waypoint AND trigger RTL. Symmetric behaviour for INCLUSION violations.
Traces to: AC Reliability & Safety — Geofence INCLUSION and EXCLUSION violations MUST both result in waypoint refusal + RTL / R9, RESTRICT Geofence enforcement MUST be symmetric.
Tier: B + E.
Preconditions:
- SUT mid-flight; geofence INCLUSION + EXCLUSION polygons loaded as part of the mission.
Fault injection:
- Scripted waypoint upload that crosses the EXCLUSION polygon; subsequently INCLUSION-exit test.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | Upload waypoint crossing EXCLUSION polygon | SUT refuses the waypoint; structured-log WARN with geofence_violation_exclusion; RTL command observed on mavlink-sitl |
| 2 | Reset; upload waypoint exiting the INCLUSION polygon | identical behaviour — refused + RTL |
Pass criteria: exact (waypoint rejected) + exact (RTL command observed) for both EXCLUSION and INCLUSION cases.
Test status: DEFERRED — <DEFERRED: geofence EXCLUSION polygon crossed by simulated waypoint via mavlink-sitl scripted mission>.
NFT-RES-Mp2: Map-pull timeout → cache-fallback (functional coverage in FT-N-003)
Summary: When the pre-flight map pull times out, the SUT falls back to last-known cached MapObjects and surfaces map_sync == "cached_fallback" with an operator-ack gate. (Functional gate semantics are tested in blackbox-tests.md → FT-N-003; this scenario adds the timing+recovery dimension.)
Traces to: AC Map Reconciliation — Cache-fallback on timeout / Mp2.
Tier: B.
Preconditions:
autopilot-stateseeded with a known prior MapObjects snapshot.missions-mockconfigured to time out onGET /missions/{id}/mapobjectsfor a configurable duration.
Fault injection:
missions-mockreturns 504 / silent timeout for 60 s; then responds normally.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | Trigger BIT | SUT issues GET /missions/{id}/mapobjects; observes timeout (per its configured request timeout); within 5 s falls back to cached snapshot; map_sync == "cached_fallback"; BIT requires explicit operator ack (see FT-N-003) |
| 2 | Mock recovers (responds normally) | next periodic resync re-attempts; once successful, map_sync == "live"; structured-log INFO map_resync_recovered |
Pass criteria: exact (cached_fallback within 5 s of timeout); recovery within the next resync cycle.
Test status: READY (no external fixture beyond mission-suite-fixture (DEFERRED) for the cached snapshot seed; cached snapshot can be authored inline at minimal scale).
NFT-RES-Mp4: Post-flight map-push 5xx → persist + bounded retry + operator warning
Summary: When the post-flight POST /missions/{id}/mapobjects returns 5xx, the pending diff MUST be persisted on durable on-device storage, an operator-visible warning MUST surface, AND bounded retry MUST execute (capped at the configured retry limit).
Traces to: AC Map Reconciliation — Failure MUST persist the pending diff to durable on-device storage with bounded retry / Mp4, RESTRICT On-device storage MUST be bounded.
Tier: B + E.
Preconditions:
- SUT post-landing; pending diff ready to push.
missions-mockconfigured to return 5xx N times then 200.
Fault injection:
missions-mockreturns 503 for the first N attempts (N = configured retry-cap + 1); then returns 200.
| Step | Action | Expected Behaviour |
|---|---|---|
| 1 | Trigger post-flight reconciliation | SUT issues POST /missions/{id}/mapobjects; receives 503 |
| 2 | Observe persistence | pending diff file exists under autopilot-state/pending_map_diff/<mission-id>.json; size > 0 |
| 3 | Observe operator-stream | warning event map_push_failed surfaced |
| 4 | Observe retry loop | retries observed within the configured cap; backoff with jitter |
| 5 | After retry-cap reached without success | SUT stops retrying; pending file remains for next session pickup |
| 6 | Eventual success (mock returns 200) | next attempt succeeds; pending file removed; warning cleared |
Pass criteria: exact (pending file exists) + exact (warning surfaced) + threshold_max (retries ≤ configured cap).
Test status: DEFERRED — <DEFERRED: same fixture as Mp3 (60-minute pass diff)>.
Recovery-time invariants common to every scenario
- No silent error swallowing. Every fault scenario MUST observe a corresponding structured-log entry at WARN+ AND a corresponding health-endpoint transition. A fault that the SUT handles without surfacing through both channels is a TEST FAILURE per
security_approach.md → "No silent error swallowing for security-relevant failures"(extended here to operational faults percoderule.mdc → "Never suppress errors silently"). - Bounded behaviour. Every retry/backoff loop MUST be bounded — the scenario asserts the cap on retry count and the cap on backoff window. Open-ended retry is a test failure.
- State integrity post-recovery. After fault recovery (when applicable), the scenario asserts that the SUT returns to a known state — mode unchanged unless the fault legitimately altered it (e.g., RTL stays RTL until operator override).
- Symmetry assertions. R9 explicitly tests both INCLUSION and EXCLUSION because the AC names symmetric behaviour. Wherever an AC pairs two outcomes (
fail-fast+fail-closed,red+yellow, etc.), the resilience scenario MUST cover both halves.