[AZ-651] [AZ-668] lost-link failsafe ladder + mapobjects persistence (batch 7)

AZ-651 (mission_executor lost-link ladder): - LostLinkLadder pure-logic state machine (LinkOk -> Degraded -> Lost -> LinkLostInFollow + MavlinkLost branch). Configurable thresholds via LostLinkConfig. - LostLinkCommandIssuer trait + MavlinkCommandIssuer production impl emitting MAV_CMD_NAV_RETURN_TO_LAUNCH via MavlinkHandle::send_command. - LostLinkDriver task wires the ladder to operator-link watch, MAVLink LinkEvent broadcast, and optional target-follow signal. On RTL, driver calls the issuer THEN MissionExecutorHandle::failsafe_trigger. - failsafe_trigger(LinkLost | LinkLostInFollow) short-circuits FlyMission -> Land via direct FSM state mutation + TransitionEvent emission; Paused state is intentionally NOT overridden. - Tests: 4/4 ACs locally green (degraded-no-rtl; lost-fires-once; follow-grace; mavlink-loss-no-rtl) plus driver + FSM integration. AZ-668 (mapobjects_store persistence): - Snapshot serializable shape + Store::{to_snapshot,from_snapshot} round trip. - MapObjectsPersistence async trait + JsonSnapshotEngine default impl (write to .tmp, sync_all, atomic rename, best-effort parent fsync). - PersistenceError::{Corrupt, SchemaMismatch} surfaces explicit errors on bad blob; PersistenceMetrics tracks last_snapshot_ts, snapshot_size_bytes, snapshot_errors_total. - MapObjectsStore::from_snapshot factory for crash recovery from the composition root. - Tests: 4/4 ACs locally green (round-trip; atomic rename ignores partial .tmp; crash recovery preserves pending; corruption returns explicit error) plus schema-mismatch + metrics smoke checks. Quality gates: - cargo fmt: clean. - cargo clippy -p mission_executor -p mapobjects_store --tests: 0 warns. - cargo test --workspace: all green. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 01:31:09 +00:00 · 2026-05-19 18:59:28 +03:00
parent 23366a5c6d
commit 2bcd4a8059
16 changed files with 1940 additions and 8 deletions
@@ -0,0 +1,72 @@
+# Lost-Link Failsafe Ladder (F10)
+
+**Task**: AZ-651_mission_executor_lost_link_ladder
+**Name**: Lost-link ladder LinkOk → LinkDegraded → LinkLost → LinkLostInFollow
+**Description**: Per-tick evaluation of the operator/Ground-Station modem link state. Default RTL after 30 s grace. Configurable. MAVLink-link loss to ArduPilot itself is a separate, more severe event — health → red, airframe failsafe takes over (we do NOT override it).
+**Complexity**: 3 points
+**Dependencies**: AZ-640_initial_structure, AZ-648_mission_executor_state_machine, AZ-649_mission_executor_telemetry_forwarding
+**Component**: mission_executor
+**Tracker**: AZ-651
+**Epic**: AZ-636
+
+## Problem
+
+The operator's modem link is critical to safe operation but inherently flaky. The failsafe must escalate predictably from `LinkOk` to `LinkDegraded` (5–30 s) to `LinkLost` (>30 s) to `LinkLostInFollow` (special-cased target-follow case) — each step with a defined behaviour. Default action on `LinkLost` is RTL after a grace window. Crucially, MAVLink-link loss to ArduPilot is a different event — autopilot does NOT override the airframe's built-in failsafe in that case.
+
+## Outcome
+
+- `LostLinkLadder::tick(now, link_state)` updates an enum `LadderState ∈ {LinkOk, LinkDegraded, LinkLost, LinkLostInFollow}` deterministically based on the elapsed time since the last operator-link heartbeat.
+- `LinkDegraded` for 5–30 s: health → yellow; events queued; no command to airframe.
+- `LinkLost` for >30 s (configurable): trigger RTL via `mavlink_layer`; transition to `LAND`.
+- `LinkLostInFollow` (active `TargetFollow` + >30 s): 30 s grace, then RTL.
+- MAVLink-link loss to ArduPilot: detected via `mavlink_layer`'s `LinkLost`; health → red; do NOT issue RTL (airframe handles it).
+- Health surface: current `LadderState`, time-in-state, RTL trigger count.
+
+## Scope
+
+### Included
+- Ladder state machine.
+- Subscribe to operator-link state from `telemetry_stream` (forwarded by `operator_bridge` health).
+- Subscribe to MAVLink-link state from `mavlink_layer`.
+- Configurable thresholds (defaults: degraded=5 s, lost=30 s, follow-grace=30 s).
+- RTL command issuance via `mavlink_layer::send_command(MAV_CMD_NAV_RETURN_TO_LAUNCH)`.
+
+### Excluded
+- Operator command auth checks (`operator_bridge`).
+- Target-follow state ownership (`scan_controller`).
+
+## Acceptance Criteria
+
+**AC-1: Operator-link degraded then recovers**
+Given a healthy link
+When the operator-link heartbeat stops for 10 s and resumes
+Then the ladder reports `LinkOk → LinkDegraded → LinkOk` with correct dwell times; no RTL is issued.
+
+**AC-2: Operator-link lost triggers RTL**
+Given a healthy link
+When the operator-link heartbeat stops for 31 s
+Then the ladder reports `LinkLost`, `send_command(MAV_CMD_NAV_RETURN_TO_LAUNCH)` is issued exactly once, and the state machine transitions to `LAND`.
+
+**AC-3: Lost-in-follow grace then RTL**
+Given the system is in `TargetFollow` and the operator-link drops
+When the link is down for 30 s (grace), then continues to be down past the grace
+Then RTL is triggered after the grace fires, not earlier.
+
+**AC-4: MAVLink loss does NOT trigger autopilot-side RTL**
+Given the MAVLink link to ArduPilot is lost (`mavlink_layer` reports `LinkLost`)
+When the ladder tick runs
+Then health → red, no `MAV_CMD_NAV_RETURN_TO_LAUNCH` is issued by autopilot (airframe failsafe owns the response), and the event is observable.
+
+## Non-Functional Requirements
+
+**Performance**
+- Ladder tick: ≤5 ms.
+
+**Reliability**
+- All thresholds configurable; no hardcoded defaults beyond the defaults documented above.
+
+## Runtime Completeness
+
+- **Named capability**: F10 lost-link failsafe ladder.
+- **Production code that must exist**: real state machine; real RTL command issuance.
+- **Unacceptable substitutes**: omitting the `LinkLostInFollow` grace is not acceptable (an operator may have momentary glitches mid-follow).
@@ -0,0 +1,76 @@
+# Persistence — In-Memory + JSON Snapshot (Q3 Default)
+
+**Task**: AZ-668_mapobjects_store_persistence
+**Name**: In-memory + JSON snapshot persistence (default per Q3)
+**Description**: Crash-recovery and post-flight upload durability for the in-memory MapObjects state. Default engine: in-memory + atomic JSON snapshot to `${state_dir}/mapobjects/<mission_id>.json` per checkpoint. Q3 reserves the slot for SQLite+H3 / KV alternatives.
+**Complexity**: 3 points
+**Dependencies**: AZ-640_initial_structure, AZ-665_mapobjects_store_h3_classify, AZ-667_mapobjects_store_hydrate_and_pending
+**Component**: mapobjects_store
+**Tracker**: AZ-668
+**Epic**: AZ-633
+
+## Problem
+
+The in-memory hashmap is authoritative for the active mission, but a crash mid-mission must not lose the pending diff. The persistence engine choice is Q3 (open); the default is in-memory + JSON snapshot (atomic rename), which keeps the engine choice cleanly behind a `MapObjectsPersistence` trait so SQLite+H3 or RocksDB can swap in later without touching call sites.
+
+## Outcome
+
+- `MapObjectsPersistence` trait with `save_snapshot(state) -> Result<()>` and `load_snapshot(path) -> Result<State>`.
+- `JsonSnapshotEngine` impl that writes to `${state_dir}/mapobjects/<mission_id>.json` via atomic rename (write to `.tmp` then rename).
+- Snapshot cadence: configurable; default every 30 s OR on every N pending-observation appends, whichever first.
+- Crash recovery: at startup, load the most recent snapshot for any mission that did not reach `POST_FLIGHT_SYNC`.
+- Health surface: `last_snapshot_ts`, `snapshot_size_bytes`, `snapshot_errors_total`.
+- Persistence corruption on startup: refuse to start with stale state; surface explicit error to the operator.
+
+## Scope
+
+### Included
+- `MapObjectsPersistence` trait.
+- `JsonSnapshotEngine` (default impl).
+- Atomic rename pattern.
+- Crash-recovery load.
+- Snapshot cadence policy.
+
+### Excluded
+- SQLite+H3 alternative (Q3 follow-up if chosen later).
+- KV alternative (Q3 follow-up).
+- The post-flight push itself (`mission_client` task 08).
+
+## Acceptance Criteria
+
+**AC-1: Snapshot + reload round-trip**
+Given a store with 100 MapObjects + 10 IgnoredItems + 5 pending observations
+When `save_snapshot()` writes to disk and a fresh process calls `load_snapshot()`
+Then the loaded state equals the saved state.
+
+**AC-2: Atomic rename prevents partial writes**
+Given a snapshot write is interrupted mid-write (simulated kill -9)
+When a fresh process starts
+Then it loads the previous good snapshot, not the partial one (no corruption observed).
+
+**AC-3: Crash recovery loads pending**
+Given a previous run terminated with non-empty pending_observations
+When the new process calls `load_snapshot()` for the same mission_id
+Then pending_observations is non-empty and matches the pre-crash count.
+
+**AC-4: Corruption surfaces explicit error**
+Given a snapshot file with truncated content
+When `load_snapshot()` runs
+Then it returns `Err(CorruptSnapshot)` and `snapshot_errors_total` increments; the store does NOT silently start empty.
+
+## Non-Functional Requirements
+
+**Performance**
+- Snapshot of a 30 km × 30 km mission (≤1 000 MapObjects): ≤1 s.
+- Crash recovery: ≤2 s to a usable state (per `description.md §9`).
+
+**Reliability**
+- Atomic rename — no partial-write corruption.
+- Corruption never silent.
+
+## Runtime Completeness
+
+- **Named capability**: persistent MapObjects state with crash recovery — default engine in-memory + JSON snapshot per Q3.
+- **Production code that must exist**: real disk write; real atomic rename; real corruption-detection on load.
+- **Allowed external stubs**: `tempfile` for test fixtures.
+- **Unacceptable substitutes**: a no-op persistence in production is unacceptable (crash mid-flight loses the diff).