mirror of
https://github.com/azaion/autopilot.git
synced 2026-06-22 02:41:11 +00:00
[AZ-651] [AZ-668] lost-link failsafe ladder + mapobjects persistence (batch 7)
AZ-651 (mission_executor lost-link ladder):
- LostLinkLadder pure-logic state machine (LinkOk -> Degraded -> Lost
-> LinkLostInFollow + MavlinkLost branch). Configurable thresholds
via LostLinkConfig.
- LostLinkCommandIssuer trait + MavlinkCommandIssuer production impl
emitting MAV_CMD_NAV_RETURN_TO_LAUNCH via MavlinkHandle::send_command.
- LostLinkDriver task wires the ladder to operator-link watch, MAVLink
LinkEvent broadcast, and optional target-follow signal. On RTL,
driver calls the issuer THEN MissionExecutorHandle::failsafe_trigger.
- failsafe_trigger(LinkLost | LinkLostInFollow) short-circuits FlyMission
-> Land via direct FSM state mutation + TransitionEvent emission;
Paused state is intentionally NOT overridden.
- Tests: 4/4 ACs locally green (degraded-no-rtl; lost-fires-once;
follow-grace; mavlink-loss-no-rtl) plus driver + FSM integration.
AZ-668 (mapobjects_store persistence):
- Snapshot serializable shape + Store::{to_snapshot,from_snapshot}
round trip.
- MapObjectsPersistence async trait + JsonSnapshotEngine default impl
(write to .tmp, sync_all, atomic rename, best-effort parent fsync).
- PersistenceError::{Corrupt, SchemaMismatch} surfaces explicit errors
on bad blob; PersistenceMetrics tracks last_snapshot_ts,
snapshot_size_bytes, snapshot_errors_total.
- MapObjectsStore::from_snapshot factory for crash recovery from the
composition root.
- Tests: 4/4 ACs locally green (round-trip; atomic rename ignores
partial .tmp; crash recovery preserves pending; corruption returns
explicit error) plus schema-mismatch + metrics smoke checks.
Quality gates:
- cargo fmt: clean.
- cargo clippy -p mission_executor -p mapobjects_store --tests: 0 warns.
- cargo test --workspace: all green.
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -1,72 +0,0 @@
|
||||
# Lost-Link Failsafe Ladder (F10)
|
||||
|
||||
**Task**: AZ-651_mission_executor_lost_link_ladder
|
||||
**Name**: Lost-link ladder LinkOk → LinkDegraded → LinkLost → LinkLostInFollow
|
||||
**Description**: Per-tick evaluation of the operator/Ground-Station modem link state. Default RTL after 30 s grace. Configurable. MAVLink-link loss to ArduPilot itself is a separate, more severe event — health → red, airframe failsafe takes over (we do NOT override it).
|
||||
**Complexity**: 3 points
|
||||
**Dependencies**: AZ-640_initial_structure, AZ-648_mission_executor_state_machine, AZ-649_mission_executor_telemetry_forwarding
|
||||
**Component**: mission_executor
|
||||
**Tracker**: AZ-651
|
||||
**Epic**: AZ-636
|
||||
|
||||
## Problem
|
||||
|
||||
The operator's modem link is critical to safe operation but inherently flaky. The failsafe must escalate predictably from `LinkOk` to `LinkDegraded` (5–30 s) to `LinkLost` (>30 s) to `LinkLostInFollow` (special-cased target-follow case) — each step with a defined behaviour. Default action on `LinkLost` is RTL after a grace window. Crucially, MAVLink-link loss to ArduPilot is a different event — autopilot does NOT override the airframe's built-in failsafe in that case.
|
||||
|
||||
## Outcome
|
||||
|
||||
- `LostLinkLadder::tick(now, link_state)` updates an enum `LadderState ∈ {LinkOk, LinkDegraded, LinkLost, LinkLostInFollow}` deterministically based on the elapsed time since the last operator-link heartbeat.
|
||||
- `LinkDegraded` for 5–30 s: health → yellow; events queued; no command to airframe.
|
||||
- `LinkLost` for >30 s (configurable): trigger RTL via `mavlink_layer`; transition to `LAND`.
|
||||
- `LinkLostInFollow` (active `TargetFollow` + >30 s): 30 s grace, then RTL.
|
||||
- MAVLink-link loss to ArduPilot: detected via `mavlink_layer`'s `LinkLost`; health → red; do NOT issue RTL (airframe handles it).
|
||||
- Health surface: current `LadderState`, time-in-state, RTL trigger count.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
- Ladder state machine.
|
||||
- Subscribe to operator-link state from `telemetry_stream` (forwarded by `operator_bridge` health).
|
||||
- Subscribe to MAVLink-link state from `mavlink_layer`.
|
||||
- Configurable thresholds (defaults: degraded=5 s, lost=30 s, follow-grace=30 s).
|
||||
- RTL command issuance via `mavlink_layer::send_command(MAV_CMD_NAV_RETURN_TO_LAUNCH)`.
|
||||
|
||||
### Excluded
|
||||
- Operator command auth checks (`operator_bridge`).
|
||||
- Target-follow state ownership (`scan_controller`).
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Operator-link degraded then recovers**
|
||||
Given a healthy link
|
||||
When the operator-link heartbeat stops for 10 s and resumes
|
||||
Then the ladder reports `LinkOk → LinkDegraded → LinkOk` with correct dwell times; no RTL is issued.
|
||||
|
||||
**AC-2: Operator-link lost triggers RTL**
|
||||
Given a healthy link
|
||||
When the operator-link heartbeat stops for 31 s
|
||||
Then the ladder reports `LinkLost`, `send_command(MAV_CMD_NAV_RETURN_TO_LAUNCH)` is issued exactly once, and the state machine transitions to `LAND`.
|
||||
|
||||
**AC-3: Lost-in-follow grace then RTL**
|
||||
Given the system is in `TargetFollow` and the operator-link drops
|
||||
When the link is down for 30 s (grace), then continues to be down past the grace
|
||||
Then RTL is triggered after the grace fires, not earlier.
|
||||
|
||||
**AC-4: MAVLink loss does NOT trigger autopilot-side RTL**
|
||||
Given the MAVLink link to ArduPilot is lost (`mavlink_layer` reports `LinkLost`)
|
||||
When the ladder tick runs
|
||||
Then health → red, no `MAV_CMD_NAV_RETURN_TO_LAUNCH` is issued by autopilot (airframe failsafe owns the response), and the event is observable.
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- Ladder tick: ≤5 ms.
|
||||
|
||||
**Reliability**
|
||||
- All thresholds configurable; no hardcoded defaults beyond the defaults documented above.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: F10 lost-link failsafe ladder.
|
||||
- **Production code that must exist**: real state machine; real RTL command issuance.
|
||||
- **Unacceptable substitutes**: omitting the `LinkLostInFollow` grace is not acceptable (an operator may have momentary glitches mid-follow).
|
||||
@@ -1,76 +0,0 @@
|
||||
# Persistence — In-Memory + JSON Snapshot (Q3 Default)
|
||||
|
||||
**Task**: AZ-668_mapobjects_store_persistence
|
||||
**Name**: In-memory + JSON snapshot persistence (default per Q3)
|
||||
**Description**: Crash-recovery and post-flight upload durability for the in-memory MapObjects state. Default engine: in-memory + atomic JSON snapshot to `${state_dir}/mapobjects/<mission_id>.json` per checkpoint. Q3 reserves the slot for SQLite+H3 / KV alternatives.
|
||||
**Complexity**: 3 points
|
||||
**Dependencies**: AZ-640_initial_structure, AZ-665_mapobjects_store_h3_classify, AZ-667_mapobjects_store_hydrate_and_pending
|
||||
**Component**: mapobjects_store
|
||||
**Tracker**: AZ-668
|
||||
**Epic**: AZ-633
|
||||
|
||||
## Problem
|
||||
|
||||
The in-memory hashmap is authoritative for the active mission, but a crash mid-mission must not lose the pending diff. The persistence engine choice is Q3 (open); the default is in-memory + JSON snapshot (atomic rename), which keeps the engine choice cleanly behind a `MapObjectsPersistence` trait so SQLite+H3 or RocksDB can swap in later without touching call sites.
|
||||
|
||||
## Outcome
|
||||
|
||||
- `MapObjectsPersistence` trait with `save_snapshot(state) -> Result<()>` and `load_snapshot(path) -> Result<State>`.
|
||||
- `JsonSnapshotEngine` impl that writes to `${state_dir}/mapobjects/<mission_id>.json` via atomic rename (write to `.tmp` then rename).
|
||||
- Snapshot cadence: configurable; default every 30 s OR on every N pending-observation appends, whichever first.
|
||||
- Crash recovery: at startup, load the most recent snapshot for any mission that did not reach `POST_FLIGHT_SYNC`.
|
||||
- Health surface: `last_snapshot_ts`, `snapshot_size_bytes`, `snapshot_errors_total`.
|
||||
- Persistence corruption on startup: refuse to start with stale state; surface explicit error to the operator.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
- `MapObjectsPersistence` trait.
|
||||
- `JsonSnapshotEngine` (default impl).
|
||||
- Atomic rename pattern.
|
||||
- Crash-recovery load.
|
||||
- Snapshot cadence policy.
|
||||
|
||||
### Excluded
|
||||
- SQLite+H3 alternative (Q3 follow-up if chosen later).
|
||||
- KV alternative (Q3 follow-up).
|
||||
- The post-flight push itself (`mission_client` task 08).
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Snapshot + reload round-trip**
|
||||
Given a store with 100 MapObjects + 10 IgnoredItems + 5 pending observations
|
||||
When `save_snapshot()` writes to disk and a fresh process calls `load_snapshot()`
|
||||
Then the loaded state equals the saved state.
|
||||
|
||||
**AC-2: Atomic rename prevents partial writes**
|
||||
Given a snapshot write is interrupted mid-write (simulated kill -9)
|
||||
When a fresh process starts
|
||||
Then it loads the previous good snapshot, not the partial one (no corruption observed).
|
||||
|
||||
**AC-3: Crash recovery loads pending**
|
||||
Given a previous run terminated with non-empty pending_observations
|
||||
When the new process calls `load_snapshot()` for the same mission_id
|
||||
Then pending_observations is non-empty and matches the pre-crash count.
|
||||
|
||||
**AC-4: Corruption surfaces explicit error**
|
||||
Given a snapshot file with truncated content
|
||||
When `load_snapshot()` runs
|
||||
Then it returns `Err(CorruptSnapshot)` and `snapshot_errors_total` increments; the store does NOT silently start empty.
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- Snapshot of a 30 km × 30 km mission (≤1 000 MapObjects): ≤1 s.
|
||||
- Crash recovery: ≤2 s to a usable state (per `description.md §9`).
|
||||
|
||||
**Reliability**
|
||||
- Atomic rename — no partial-write corruption.
|
||||
- Corruption never silent.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: persistent MapObjects state with crash recovery — default engine in-memory + JSON snapshot per Q3.
|
||||
- **Production code that must exist**: real disk write; real atomic rename; real corruption-detection on load.
|
||||
- **Allowed external stubs**: `tempfile` for test fixtures.
|
||||
- **Unacceptable substitutes**: a no-op persistence in production is unacceptable (crash mid-flight loses the diff).
|
||||
Reference in New Issue
Block a user