Files
autopilot/_docs/02_document/tests/resilience-tests.md
T
Oleksandr Bezdieniezhnykh bc40ea7300 [AZ-626] Decompose complete: 47 tasks + docs + module layout
Greenfield Steps 1-6 baseline for the autopilot rewrite from legacy
Qt/C++ to a Rust workspace.

- Remove legacy Qt/C++ tree (ai_controller, drone_controller,
  misc/camera, python_scaffold, root Dockerfile, autopilot.pro,
  legacy main.py / requirements.txt).
- Add _docs/00_problem (problem, restrictions, acceptance criteria,
  security approach, input data + fixtures).
- Add _docs/01_solution/solution_draft01.
- Add _docs/02_document (architecture, system-flows, data_model,
  glossary, decision-rationale, deployment, 13 component descriptions,
  tests/ specs, FINAL_report, module-layout).
- Add _docs/02_tasks/todo with 47 task specs (AZ-640..AZ-686, one
  bootstrap + 46 component tasks) and _dependencies_table.md.
- Add .cursor/rules/artifact-srp.mdc (single-responsibility rule for
  canonical _docs artifacts).
- Track autodev state in _docs/_autodev_state.md (Step 6 completed,
  ready for Step 7 Implement).

Jira: bootstrap AZ-626; component epics AZ-627..AZ-639; tasks
AZ-640..AZ-686. Total complexity 173 points across 12 epics.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-19 11:02:01 +03:00

13 KiB

Resilience Tests

Authored by /test-spec Phase 2 (2026-05-19). Resilience tests inject a fault, observe behaviour during the fault, observe recovery behaviour, and assert against both. The fault and the recovery contract are both quantifiable.

BIT pre-flight pathways (positive R1, negatives R2/R3) are in blackbox-tests.md because they assert a functional gate. The runtime fault scenarios live here.


Summary: Sustained loss of the operator/Ground-Station radio link MUST trigger an RTL exactly at the configured grace window (default 30 s), and operator-link health MUST flip red. Traces to: AC Reliability & Safety — Loss of operator/Ground-Station radio link MUST trigger a known mission-safe outcome / R4, RESTRICT Reliability & Safety — Lost operator-link failsafe MUST be deterministic and bounded. Tier: B + E.

Preconditions:

  • SUT mid-flight (scripted MAVLink stream + active operator session).
  • Operator session in steady state for ≥ 30 s before fault injection.
  • Grace window configured to default 30 s.

Fault injection:

  • operator-replay issues lost-link event at T=0 and STAYS silent (no reconnect) for the remainder of the window.
Step Action Expected Behaviour
1 Inject lost-link event at T=0 health endpoint immediately shows deps.operator_link == "red"; last_seen_at frozen
2 Wait 25 s (within grace) NO RTL command yet on mavlink-sitl; SUT continues mission
3 Wait until T=30 s RTL command observed on mavlink-sitl at T = 30 s ± 1 s; operator-stream emits a failsafe_triggered event with reason operator_link_lost
4 Optionally reconnect operator-replay after RTL RTL persists (operator cannot un-RTL silently — requires explicit operator override per AC); health.operator_link transitions back to green when traffic resumes

Pass criteria: RTL command at T = 30 s ± 1 s (exact with ± 1 s tolerance), exact operator-link red. Recovery time bound: RTL must be issued within 31 s of fault start. Test status: READY (operator-session-scripts inline-authorable; mavlink-sitl runs an ArduPilot SITL accepting RTL).


NFT-RES-R5: Battery at RTL-floor → RTL

Summary: When the airframe battery sample drops to the configured RTL floor (e.g. 25 %), the SUT MUST issue an RTL and health MUST surface yellow. Traces to: AC Reliability & Safety — Battery at or below the configured RTL floor / R5. Tier: B + E.

Preconditions:

  • SUT mid-flight; battery telemetry replayed via mavlink-sitl at 1 Hz.

Fault injection:

  • mavlink-sitl scripted battery curve: starts at 80 %; ramps down to 25 % at T=T0; held at 25 % afterwards.
Step Action Expected Behaviour
1 At T=T0, battery reads 25 % within 1 sample period (1 s) the SUT issues RTL on mavlink-sitl; health transitions to overall == "yellow"; operator-stream emits failsafe_triggered with reason battery_rtl_floor
2 Battery continues at 25 % RTL persists; no oscillation

Pass criteria: exact (RTL command observed) + exact (health.overall == "yellow"). Test status: DEFERRED — <DEFERRED: mid-flight battery sample at RTL-floor via mavlink-sitl battery curve script>.


NFT-RES-R6: Battery at hard floor → land-now

Summary: When the battery hits the configured hard floor (e.g. 15 %), the SUT MUST issue land-now and ONLY an authenticated operator command may override. Traces to: AC Reliability & Safety — battery at or below the hard floor / R6. Tier: B + E.

Preconditions:

  • SUT mid-flight; battery ramps to 15 %.

Fault injection: same as R5 but ramp continues to 15 %.

Step Action Expected Behaviour
1 At T=T0, battery reads 15 % within 1 sample period the SUT issues land-now (MAV_CMD_NAV_LAND or equivalent) on mavlink-sitl; health red; operator-stream emits failsafe_triggered with reason battery_hard_floor
2 Replay an UNAUTHENTICATED operator-override command SUT REFUSES; land-now persists
3 Replay an AUTHENTICATED operator-override (placeholder until Q9; full once Q9 resolves) land-now cancelled; SUT returns to prior mode

Pass criteria: exact (land_now observed); exact (refusal of unauthenticated override); exact (acceptance of authenticated override). Test status: DEFERRED — same fixture gap as R5; step 3's full authentication semantics also <DEFERRED: Q9>.


Summary: When MAVLink commands fail through the configured bounded-retry budget (no airframe response), the airframe-link dependency MUST flip health red. Traces to: AC Reliability & Safety — MAVLink command exhaustion (bounded retry with exponential backoff fails through max-retry) / R7. Tier: B + E.

Preconditions:

  • SUT mid-flight; max-retry configured (e.g., 5 attempts; exponential backoff base 100 ms).

Fault injection:

  • mavlink-sitl configured to drop all command-ack messages for the duration of the test (peer non-responsive).
Step Action Expected Behaviour
1 SUT issues a MAVLink command (e.g., waypoint upload) command sent; no ack received
2 Backoff + retry loop executes through max-retry retries observed on the wire with exponential backoff
3 After final retry exhausts health.airframe_link transitions to red; operator-stream emits a dependency_degraded event with reason airframe_link_retry_exhausted

Pass criteria: exact (health.airframe_link == "red") after max-retry; retries observed with backoff base 100 ms ± 20 ms. Test status: DEFERRED — <DEFERRED: airframe link command + bounded retry/backoff with peer not responding through max-retries>.


NFT-RES-R8: Wall-clock drift > 200 ms → time-source yellow

Summary: When wall-clock drift versus GPS or NTP source exceeds 200 ms, the time-source dependency MUST report yellow, AND clock_source + last_sync_at MUST reflect the drift. Traces to: AC Reliability & Safety — Wall-clock drift greater than 200 ms / R8, RESTRICT Wall-clock MUST be bound to GPS time once GPS is locked, or NTP at boot. Tier: B.

Preconditions:

  • SUT running with time-injector LD_PRELOAD active.
  • GPS source initially locked via mavlink-sitl GPS_RAW_INT messages.

Fault injection:

  • time-injector advances the SUT process clock by 250 ms over a 1 s window while keeping GPS source locked.
Step Action Expected Behaviour
1 Bind clock to GPS at boot health.time_source == green; clock_source == "gps"; last_sync_at recent
2 Inject 250 ms drift within 5 s health.time_source transitions to yellow; clock_source and last_sync_at updated to reflect the drift
3 Stop drift health.time_source returns to green within the next sync cycle

Pass criteria: exact (health.time_source == "yellow") during step 2; exact (clock_source updated) + exact (last_sync_at updated). Test status: READY (time-drift-scripts inline-authorable).


NFT-RES-R9: Geofence EXCLUSION crossing → waypoint refusal + RTL

Summary: When a simulated waypoint crosses an EXCLUSION polygon, the SUT MUST refuse the waypoint AND trigger RTL. Symmetric behaviour for INCLUSION violations. Traces to: AC Reliability & Safety — Geofence INCLUSION and EXCLUSION violations MUST both result in waypoint refusal + RTL / R9, RESTRICT Geofence enforcement MUST be symmetric. Tier: B + E.

Preconditions:

  • SUT mid-flight; geofence INCLUSION + EXCLUSION polygons loaded as part of the mission.

Fault injection:

  • Scripted waypoint upload that crosses the EXCLUSION polygon; subsequently INCLUSION-exit test.
Step Action Expected Behaviour
1 Upload waypoint crossing EXCLUSION polygon SUT refuses the waypoint; structured-log WARN with geofence_violation_exclusion; RTL command observed on mavlink-sitl
2 Reset; upload waypoint exiting the INCLUSION polygon identical behaviour — refused + RTL

Pass criteria: exact (waypoint rejected) + exact (RTL command observed) for both EXCLUSION and INCLUSION cases. Test status: DEFERRED — <DEFERRED: geofence EXCLUSION polygon crossed by simulated waypoint via mavlink-sitl scripted mission>.


NFT-RES-Mp2: Map-pull timeout → cache-fallback (functional coverage in FT-N-003)

Summary: When the pre-flight map pull times out, the SUT falls back to last-known cached MapObjects and surfaces map_sync == "cached_fallback" with an operator-ack gate. (Functional gate semantics are tested in blackbox-tests.md → FT-N-003; this scenario adds the timing+recovery dimension.) Traces to: AC Map Reconciliation — Cache-fallback on timeout / Mp2. Tier: B.

Preconditions:

  • autopilot-state seeded with a known prior MapObjects snapshot.
  • missions-mock configured to time out on GET /missions/{id}/mapobjects for a configurable duration.

Fault injection:

  • missions-mock returns 504 / silent timeout for 60 s; then responds normally.
Step Action Expected Behaviour
1 Trigger BIT SUT issues GET /missions/{id}/mapobjects; observes timeout (per its configured request timeout); within 5 s falls back to cached snapshot; map_sync == "cached_fallback"; BIT requires explicit operator ack (see FT-N-003)
2 Mock recovers (responds normally) next periodic resync re-attempts; once successful, map_sync == "live"; structured-log INFO map_resync_recovered

Pass criteria: exact (cached_fallback within 5 s of timeout); recovery within the next resync cycle. Test status: READY (no external fixture beyond mission-suite-fixture (DEFERRED) for the cached snapshot seed; cached snapshot can be authored inline at minimal scale).


NFT-RES-Mp4: Post-flight map-push 5xx → persist + bounded retry + operator warning

Summary: When the post-flight POST /missions/{id}/mapobjects returns 5xx, the pending diff MUST be persisted on durable on-device storage, an operator-visible warning MUST surface, AND bounded retry MUST execute (capped at the configured retry limit). Traces to: AC Map Reconciliation — Failure MUST persist the pending diff to durable on-device storage with bounded retry / Mp4, RESTRICT On-device storage MUST be bounded. Tier: B + E.

Preconditions:

  • SUT post-landing; pending diff ready to push.
  • missions-mock configured to return 5xx N times then 200.

Fault injection:

  • missions-mock returns 503 for the first N attempts (N = configured retry-cap + 1); then returns 200.
Step Action Expected Behaviour
1 Trigger post-flight reconciliation SUT issues POST /missions/{id}/mapobjects; receives 503
2 Observe persistence pending diff file exists under autopilot-state/pending_map_diff/<mission-id>.json; size > 0
3 Observe operator-stream warning event map_push_failed surfaced
4 Observe retry loop retries observed within the configured cap; backoff with jitter
5 After retry-cap reached without success SUT stops retrying; pending file remains for next session pickup
6 Eventual success (mock returns 200) next attempt succeeds; pending file removed; warning cleared

Pass criteria: exact (pending file exists) + exact (warning surfaced) + threshold_max (retries ≤ configured cap). Test status: DEFERRED — <DEFERRED: same fixture as Mp3 (60-minute pass diff)>.


Recovery-time invariants common to every scenario

  • No silent error swallowing. Every fault scenario MUST observe a corresponding structured-log entry at WARN+ AND a corresponding health-endpoint transition. A fault that the SUT handles without surfacing through both channels is a TEST FAILURE per security_approach.md → "No silent error swallowing for security-relevant failures" (extended here to operational faults per coderule.mdc → "Never suppress errors silently").
  • Bounded behaviour. Every retry/backoff loop MUST be bounded — the scenario asserts the cap on retry count and the cap on backoff window. Open-ended retry is a test failure.
  • State integrity post-recovery. After fault recovery (when applicable), the scenario asserts that the SUT returns to a known state — mode unchanged unless the fault legitimately altered it (e.g., RTL stays RTL until operator override).
  • Symmetry assertions. R9 explicitly tests both INCLUSION and EXCLUSION because the AC names symmetric behaviour. Wherever an AC pairs two outcomes (fail-fast + fail-closed, red + yellow, etc.), the resilience scenario MUST cover both halves.