[AZ-651] [AZ-668] lost-link failsafe ladder + mapobjects persistence (batch 7)

AZ-651 (mission_executor lost-link ladder):
- LostLinkLadder pure-logic state machine (LinkOk -> Degraded -> Lost
  -> LinkLostInFollow + MavlinkLost branch). Configurable thresholds
  via LostLinkConfig.
- LostLinkCommandIssuer trait + MavlinkCommandIssuer production impl
  emitting MAV_CMD_NAV_RETURN_TO_LAUNCH via MavlinkHandle::send_command.
- LostLinkDriver task wires the ladder to operator-link watch, MAVLink
  LinkEvent broadcast, and optional target-follow signal. On RTL,
  driver calls the issuer THEN MissionExecutorHandle::failsafe_trigger.
- failsafe_trigger(LinkLost | LinkLostInFollow) short-circuits FlyMission
  -> Land via direct FSM state mutation + TransitionEvent emission;
  Paused state is intentionally NOT overridden.
- Tests: 4/4 ACs locally green (degraded-no-rtl; lost-fires-once;
  follow-grace; mavlink-loss-no-rtl) plus driver + FSM integration.

AZ-668 (mapobjects_store persistence):
- Snapshot serializable shape + Store::{to_snapshot,from_snapshot}
  round trip.
- MapObjectsPersistence async trait + JsonSnapshotEngine default impl
  (write to .tmp, sync_all, atomic rename, best-effort parent fsync).
- PersistenceError::{Corrupt, SchemaMismatch} surfaces explicit errors
  on bad blob; PersistenceMetrics tracks last_snapshot_ts,
  snapshot_size_bytes, snapshot_errors_total.
- MapObjectsStore::from_snapshot factory for crash recovery from the
  composition root.
- Tests: 4/4 ACs locally green (round-trip; atomic rename ignores
  partial .tmp; crash recovery preserves pending; corruption returns
  explicit error) plus schema-mismatch + metrics smoke checks.

Quality gates:
- cargo fmt: clean.
- cargo clippy -p mission_executor -p mapobjects_store --tests: 0 warns.
- cargo test --workspace: all green.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-19 18:59:28 +03:00
parent 23366a5c6d
commit 2bcd4a8059
16 changed files with 1940 additions and 8 deletions
@@ -0,0 +1,72 @@
# Lost-Link Failsafe Ladder (F10)
**Task**: AZ-651_mission_executor_lost_link_ladder
**Name**: Lost-link ladder LinkOk → LinkDegraded → LinkLost → LinkLostInFollow
**Description**: Per-tick evaluation of the operator/Ground-Station modem link state. Default RTL after 30 s grace. Configurable. MAVLink-link loss to ArduPilot itself is a separate, more severe event — health → red, airframe failsafe takes over (we do NOT override it).
**Complexity**: 3 points
**Dependencies**: AZ-640_initial_structure, AZ-648_mission_executor_state_machine, AZ-649_mission_executor_telemetry_forwarding
**Component**: mission_executor
**Tracker**: AZ-651
**Epic**: AZ-636
## Problem
The operator's modem link is critical to safe operation but inherently flaky. The failsafe must escalate predictably from `LinkOk` to `LinkDegraded` (530 s) to `LinkLost` (>30 s) to `LinkLostInFollow` (special-cased target-follow case) — each step with a defined behaviour. Default action on `LinkLost` is RTL after a grace window. Crucially, MAVLink-link loss to ArduPilot is a different event — autopilot does NOT override the airframe's built-in failsafe in that case.
## Outcome
- `LostLinkLadder::tick(now, link_state)` updates an enum `LadderState ∈ {LinkOk, LinkDegraded, LinkLost, LinkLostInFollow}` deterministically based on the elapsed time since the last operator-link heartbeat.
- `LinkDegraded` for 530 s: health → yellow; events queued; no command to airframe.
- `LinkLost` for >30 s (configurable): trigger RTL via `mavlink_layer`; transition to `LAND`.
- `LinkLostInFollow` (active `TargetFollow` + >30 s): 30 s grace, then RTL.
- MAVLink-link loss to ArduPilot: detected via `mavlink_layer`'s `LinkLost`; health → red; do NOT issue RTL (airframe handles it).
- Health surface: current `LadderState`, time-in-state, RTL trigger count.
## Scope
### Included
- Ladder state machine.
- Subscribe to operator-link state from `telemetry_stream` (forwarded by `operator_bridge` health).
- Subscribe to MAVLink-link state from `mavlink_layer`.
- Configurable thresholds (defaults: degraded=5 s, lost=30 s, follow-grace=30 s).
- RTL command issuance via `mavlink_layer::send_command(MAV_CMD_NAV_RETURN_TO_LAUNCH)`.
### Excluded
- Operator command auth checks (`operator_bridge`).
- Target-follow state ownership (`scan_controller`).
## Acceptance Criteria
**AC-1: Operator-link degraded then recovers**
Given a healthy link
When the operator-link heartbeat stops for 10 s and resumes
Then the ladder reports `LinkOk → LinkDegraded → LinkOk` with correct dwell times; no RTL is issued.
**AC-2: Operator-link lost triggers RTL**
Given a healthy link
When the operator-link heartbeat stops for 31 s
Then the ladder reports `LinkLost`, `send_command(MAV_CMD_NAV_RETURN_TO_LAUNCH)` is issued exactly once, and the state machine transitions to `LAND`.
**AC-3: Lost-in-follow grace then RTL**
Given the system is in `TargetFollow` and the operator-link drops
When the link is down for 30 s (grace), then continues to be down past the grace
Then RTL is triggered after the grace fires, not earlier.
**AC-4: MAVLink loss does NOT trigger autopilot-side RTL**
Given the MAVLink link to ArduPilot is lost (`mavlink_layer` reports `LinkLost`)
When the ladder tick runs
Then health → red, no `MAV_CMD_NAV_RETURN_TO_LAUNCH` is issued by autopilot (airframe failsafe owns the response), and the event is observable.
## Non-Functional Requirements
**Performance**
- Ladder tick: ≤5 ms.
**Reliability**
- All thresholds configurable; no hardcoded defaults beyond the defaults documented above.
## Runtime Completeness
- **Named capability**: F10 lost-link failsafe ladder.
- **Production code that must exist**: real state machine; real RTL command issuance.
- **Unacceptable substitutes**: omitting the `LinkLostInFollow` grace is not acceptable (an operator may have momentary glitches mid-follow).
@@ -0,0 +1,76 @@
# Persistence — In-Memory + JSON Snapshot (Q3 Default)
**Task**: AZ-668_mapobjects_store_persistence
**Name**: In-memory + JSON snapshot persistence (default per Q3)
**Description**: Crash-recovery and post-flight upload durability for the in-memory MapObjects state. Default engine: in-memory + atomic JSON snapshot to `${state_dir}/mapobjects/<mission_id>.json` per checkpoint. Q3 reserves the slot for SQLite+H3 / KV alternatives.
**Complexity**: 3 points
**Dependencies**: AZ-640_initial_structure, AZ-665_mapobjects_store_h3_classify, AZ-667_mapobjects_store_hydrate_and_pending
**Component**: mapobjects_store
**Tracker**: AZ-668
**Epic**: AZ-633
## Problem
The in-memory hashmap is authoritative for the active mission, but a crash mid-mission must not lose the pending diff. The persistence engine choice is Q3 (open); the default is in-memory + JSON snapshot (atomic rename), which keeps the engine choice cleanly behind a `MapObjectsPersistence` trait so SQLite+H3 or RocksDB can swap in later without touching call sites.
## Outcome
- `MapObjectsPersistence` trait with `save_snapshot(state) -> Result<()>` and `load_snapshot(path) -> Result<State>`.
- `JsonSnapshotEngine` impl that writes to `${state_dir}/mapobjects/<mission_id>.json` via atomic rename (write to `.tmp` then rename).
- Snapshot cadence: configurable; default every 30 s OR on every N pending-observation appends, whichever first.
- Crash recovery: at startup, load the most recent snapshot for any mission that did not reach `POST_FLIGHT_SYNC`.
- Health surface: `last_snapshot_ts`, `snapshot_size_bytes`, `snapshot_errors_total`.
- Persistence corruption on startup: refuse to start with stale state; surface explicit error to the operator.
## Scope
### Included
- `MapObjectsPersistence` trait.
- `JsonSnapshotEngine` (default impl).
- Atomic rename pattern.
- Crash-recovery load.
- Snapshot cadence policy.
### Excluded
- SQLite+H3 alternative (Q3 follow-up if chosen later).
- KV alternative (Q3 follow-up).
- The post-flight push itself (`mission_client` task 08).
## Acceptance Criteria
**AC-1: Snapshot + reload round-trip**
Given a store with 100 MapObjects + 10 IgnoredItems + 5 pending observations
When `save_snapshot()` writes to disk and a fresh process calls `load_snapshot()`
Then the loaded state equals the saved state.
**AC-2: Atomic rename prevents partial writes**
Given a snapshot write is interrupted mid-write (simulated kill -9)
When a fresh process starts
Then it loads the previous good snapshot, not the partial one (no corruption observed).
**AC-3: Crash recovery loads pending**
Given a previous run terminated with non-empty pending_observations
When the new process calls `load_snapshot()` for the same mission_id
Then pending_observations is non-empty and matches the pre-crash count.
**AC-4: Corruption surfaces explicit error**
Given a snapshot file with truncated content
When `load_snapshot()` runs
Then it returns `Err(CorruptSnapshot)` and `snapshot_errors_total` increments; the store does NOT silently start empty.
## Non-Functional Requirements
**Performance**
- Snapshot of a 30 km × 30 km mission (≤1 000 MapObjects): ≤1 s.
- Crash recovery: ≤2 s to a usable state (per `description.md §9`).
**Reliability**
- Atomic rename — no partial-write corruption.
- Corruption never silent.
## Runtime Completeness
- **Named capability**: persistent MapObjects state with crash recovery — default engine in-memory + JSON snapshot per Q3.
- **Production code that must exist**: real disk write; real atomic rename; real corruption-detection on load.
- **Allowed external stubs**: `tempfile` for test fixtures.
- **Unacceptable substitutes**: a no-op persistence in production is unacceptable (crash mid-flight loses the diff).