[AZ-335] C1 warm-start hint persistence + F8 reboot recovery wiring

Adds JsonSidecarWarmStartHintStore (atomic JSON + SHA-256 sidecar via
AZ-280) inside c1_vio, plus the cross-strategy WarmStartWiredStrategy
wrapper + prime_warm_start_from_disk / prime_warm_start_from_fc hooks
at runtime_root. AC-7 post-reset covariance inflation and AC-8 "no
fake confidence" baseline floor are enforced at the wiring layer so
no strategy module needed edits. Adds three c1_vio config fields
(warm_start_store_dir, warm_start_save_period_frames,
post_reset_covariance_inflation_factor) and registers the new FDR
kind vio.warm_start. 34 unit tests cover all 10 ACs + 3 NFRs.

Verdict PASS_WITH_WARNINGS — see
_docs/03_implementation/reviews/batch_56_review.md for the four
non-blocking documentation findings (F1 cold-start log kind shorthand,
F2 strategy-frame pose semantics, F3 dev-hardware perf smoke, F4
runtime_root importing c1-internal _facade_spine for shared FDR
conventions).

Closes AZ-335; depends on AZ-528 (batch 55).

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-14 03:30:46 +03:00
parent f12789ebf0
commit 06f655d8fb
10 changed files with 2239 additions and 3 deletions
@@ -1,191 +0,0 @@
# C1 Warm-Start Hint Persistence + F8 Reboot Recovery Wiring
**Task**: AZ-335_c1_warm_start_recovery
**Name**: C1 Warm-Start + F8 Reboot Recovery
**Description**: Implement the cross-cutting wiring that lets every `VioStrategy` recover from F8 (companion reboot) without fake confidence and lets F2 (takeoff load) seed the strategy with the FC EKF's last valid GPS + IMU-extrapolated pose. Adds a small `WarmStartHintStore` (atomic JSON sidecar persistence, written after every successful `VioOutput`, read once at process startup before the first `process_frame`), plus the runtime composition glue that captures the hint flow at the appropriate flight-state boundaries. The strategy implementations (AZ-332 / AZ-333 / AZ-334) already implement `reset_to_warm_start`; this task delivers the orchestration around them — what lives where on disk between flights, when `reset_to_warm_start` is invoked, and how AC-5.1 (converge within 5 frames) and AC-5.3 (no fake confidence after reboot) are satisfied at the wiring layer rather than per-strategy.
**Complexity**: 3 points
**Dependencies**: AZ-331_c1_vio_strategy_protocol, AZ-332_c1_okvis2_strategy, AZ-333_c1_vins_mono_strategy, AZ-334_c1_klt_ransac_strategy, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-270_compose_root, AZ-280_sha256_sidecar, AZ-272_fdr_record_schema
**Component**: c1_vio (epic AZ-254 / E-C1)
**Tracker**: AZ-335
**Epic**: AZ-254 (E-C1)
### Document Dependencies
- `_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md``WarmStartPose` DTO + `reset_to_warm_start` Protocol method (AZ-331).
- `_docs/02_document/contracts/shared_helpers/sha256_sidecar.md` — atomic write + sidecar pattern (AZ-280).
- `_docs/02_document/contracts/shared_helpers/se3_utils.md` — SE(3) ↔ JSON-serialisable form (AZ-277; relied on indirectly via `WarmStartPose` field types).
- `_docs/02_document/components/01_c1_vio/description.md` — § 1 mentions F2 takeoff load and F8 reboot recovery; § 5 notes strategy lives for the duration of a flight; reset on `reset_to_warm_start` for F8 reboot.
- `_docs/02_document/components/01_c1_vio/tests.md` — C1-IT-05 (warm-start convergence within 5 frames; AC-5.1) and C1-IT-06 (F8 reboot recovery; AC-5.3) bind this task's behaviour at the test layer (deferred to Step 9 / E-BBT).
## Problem
Without this wiring:
- AC-5.1 (initialisation from FC EKF's last valid GPS + IMU-extrapolated) has no producer at the runtime layer; each strategy's `reset_to_warm_start` is a stub that no one calls.
- AC-5.3 (on F8 reboot, re-init from FC IMU-extrapolated pose without fake confidence) collapses; the companion process restart path lands in a cold-start that takes minutes to converge — outside the AC-NEW-1 30 s budget.
- The composition root would have to grow per-strategy F2/F8 logic, violating the "interface-first composition root" principle (ADR-009): cross-strategy concerns belong in shared wiring, not duplicated across strategy modules.
- The hint file format and on-disk location would drift across operator deployments (every site picking its own path) — making post-flight forensics and operator playbooks inconsistent.
- "No fake confidence" (AC-5.3) — the requirement that post-reboot covariance must NOT be smaller than pre-reboot — has no enforcement point; each strategy implementation could silently emit a tighter post-reboot covariance because it "knows" the hint is good, defeating the safety invariant.
This task delivers the cross-strategy orchestration: a small persistence helper, the F2 / F8 hooks in the runtime composition root, and the AC-5.3 enforcement that the post-reboot strategy emits inflated covariance until it has independently re-converged.
## Outcome
- A `WarmStartHintStore` interface + default implementation at `src/gps_denied_onboard/components/c1_vio/warm_start_store.py`:
- `WarmStartHintStore` Protocol (PEP 544): `save(hint: WarmStartPose) -> None` + `load() -> WarmStartPose | None` + `clear() -> None`.
- `JsonSidecarWarmStartHintStore`: writes a JSON file `{store_dir}/c1_warm_start.json` via the AZ-280 `Sha256Sidecar` atomic-write+sidecar pattern (file + `.sha256`); `load()` verifies the sidecar before returning the hint (corrupted file → `load()` returns `None` and emits a WARN log; the wiring path treats this as "no hint" and falls through to cold-start).
- The store is constructor-injected into the strategy through the composition root; the strategy itself does NOT touch the filesystem.
- Runtime composition root extension at `src/gps_denied_onboard/runtime_root/vio_factory.py` (already extended by AZ-331; this task adds two hooks):
1. **F2 takeoff hook** (`prime_warm_start_from_fc(strategy, fc_adapter, store)`): reads the FC EKF's last valid GPS + IMU-extrapolated pose via the C8 `FcAdapter` interface (consumed via the constructor-injected interface, NOT a direct C8 module import — Layer 3→4 ban respected via the interface-at-producer pattern), constructs a `WarmStartPose`, calls `strategy.reset_to_warm_start(hint)`, and saves the hint to the store. This hook is invoked once at takeoff (operator-side or auto-detected via FC's `flight_state` transition to `IN_AIR`).
2. **F8 reboot hook** (`prime_warm_start_from_disk(strategy, store)`): at every process startup before the first `process_frame`, calls `store.load()`; if the result is non-None, calls `strategy.reset_to_warm_start(hint)`. If `load()` returns None (cold start; no prior hint or corrupted), no `reset_to_warm_start` is invoked; the strategy emits its INIT-state behaviour for the first `warm_start_max_frames` (AC-5.1 per AZ-331's contract).
- Per-frame save hook (cross-cutting): every emitted `VioOutput` from `process_frame` is converted into a `WarmStartPose` (relative-pose chained against the prior baseline by the runtime root, plus the latest `imu_bias` from the same `VioOutput`) and saved via `store.save(hint)`. Save throughput is bounded — `config.vio.warm_start_save_period_frames` (default 5) limits how often the disk write is incurred (every Nth frame).
- AC-5.3 enforcement at the wiring layer: after `prime_warm_start_from_disk` injects a hint, the runtime root sets a small `consecutive_post_reset_frames` counter on the strategy facade (NOT mutating the strategy itself; the counter lives in the wiring); for the first `config.vio.warm_start_max_frames` (default 5) frames after a `reset_to_warm_start`, the runtime root post-processes the emitted `VioOutput` to inflate `pose_covariance_6x6` by a configurable factor (default 2× steady-state) — this guarantees no post-reboot strategy emits a covariance smaller than pre-reboot, regardless of what the strategy itself thinks. The inflation is removed once the counter elapses.
- Config schema extension to AZ-269: `config.vio.warm_start_store_dir` (default `/var/lib/gps_denied_onboard/warm_start/`), `config.vio.warm_start_save_period_frames` (default 5), `config.vio.post_reset_covariance_inflation_factor` (default 2.0).
- INFO log on every successful `prime_warm_start_*` invocation (with the source: `f2_takeoff_fc` / `f8_reboot_disk` / `cold_start_no_hint`); WARN log on hint file corruption; ERROR log on any strategy `reset_to_warm_start` failure.
- FDR record `kind="vio.warm_start"` emitted on every prime invocation, with the source label and the `bias_norm` of the loaded hint (lets post-flight forensics see whether the hint was used and how stale it was).
## Scope
### Included
- `WarmStartHintStore` Protocol + `JsonSidecarWarmStartHintStore` default implementation.
- `prime_warm_start_from_fc(strategy, fc_adapter, store)` runtime composition function.
- `prime_warm_start_from_disk(strategy, store)` runtime composition function.
- Per-frame save hook integration (called by the runtime root after every successful `process_frame` emission).
- Post-reset covariance inflation wrapper at the wiring layer (NOT inside any strategy).
- Config schema extension to AZ-269 for the three new fields.
- INFO / WARN / ERROR logging per description.md § 9.
- FDR `kind="vio.warm_start"` record emission via the injected `FdrClient`.
- Atomic write + sidecar verification via AZ-280 (no naked `Path.write_bytes` / `open().write` in this task).
- Unit tests covering hint round-trip, corruption handling, post-reset inflation, F8 cold-start fall-through.
### Excluded
- The `VioStrategy` Protocol itself — owned by AZ-331.
- The three strategy implementations of `reset_to_warm_start` — owned by AZ-332 / AZ-333 / AZ-334.
- C8 `FcAdapter` interface — owned by E-C8 (AZ-261); this task consumes the interface, does NOT define it.
- AC-5.1 / AC-5.3 / C1-IT-05 / C1-IT-06 component-internal tests themselves — deferred to Step 9 / E-BBT per greenfield flow Step 6 rule.
- Multi-flight hint history (only the latest hint is persisted; older hints are overwritten by atomic write).
- Operator UI for inspecting hint freshness — out of scope; operator reads the FDR record.
- Hint encryption — the warm-start hint contains pose + bias, not credentials; on-disk encryption is outside the threat model this cycle.
## Acceptance Criteria
**AC-1: `WarmStartHintStore` round-trip**
Given an empty store directory and a constructed `WarmStartPose` instance
When `store.save(hint)` is called and then `store.load()` is called
Then `load()` returns a `WarmStartPose` deep-equal to the original hint; the on-disk file at `{store_dir}/c1_warm_start.json` exists; the sidecar at `{store_dir}/c1_warm_start.json.sha256` exists and verifies
**AC-2: Corrupted hint file → `load()` returns `None` + WARN log**
Given a `c1_warm_start.json` whose actual sha256 does not match the sidecar
When `store.load()` is called
Then `None` is returned; ONE WARN log `kind="c1.warm_start.corrupted"` with the offending path is emitted; the file is NOT silently deleted (operator may want to forensically inspect)
**AC-3: Cold-start path — no hint, no reset**
Given an empty store directory at process startup
When `prime_warm_start_from_disk(strategy, store)` is called
Then `store.load()` returns `None`; `strategy.reset_to_warm_start` is NOT invoked (verifiable via spy); ONE INFO log `kind="c1.warm_start.cold_start"` is emitted; the strategy proceeds with its own INIT-state behaviour
**AC-4: F8 reboot path — hint loaded, `reset_to_warm_start` invoked**
Given a populated store directory with a known hint
When `prime_warm_start_from_disk(strategy, store)` is called
Then `store.load()` returns the hint; `strategy.reset_to_warm_start(hint)` is invoked exactly once with the loaded hint (verifiable via spy); ONE INFO log `kind="c1.warm_start.f8_reboot_disk"` and ONE FDR record `kind="vio.warm_start"` are emitted
**AC-5: F2 takeoff path — FC adapter queried, hint persisted**
Given a constructed `FcAdapter` (test double) returning a known last-valid-GPS + IMU-extrapolated pose
When `prime_warm_start_from_fc(strategy, fc_adapter, store)` is called
Then a `WarmStartPose` constructed from the FC data is passed to `strategy.reset_to_warm_start`; the same hint is then saved via `store.save`; ONE INFO log `kind="c1.warm_start.f2_takeoff_fc"` and ONE FDR record `kind="vio.warm_start"` are emitted
**AC-6: Per-frame save respects period**
Given `config.vio.warm_start_save_period_frames = 5` and a strategy emitting 12 successful `VioOutput`s
When the per-frame save hook is invoked once per emission
Then `store.save` is called exactly 2 times (after frames 5 and 10; frame 12 is mid-period); the on-disk hint reflects frame 10's `VioOutput`; the next save will occur after frame 15
**AC-7: Post-reset covariance inflation — first N frames inflated**
Given `config.vio.warm_start_max_frames = 5` and `config.vio.post_reset_covariance_inflation_factor = 2.0`, after `prime_warm_start_from_disk` invokes `reset_to_warm_start`
When the next 5 `VioOutput`s flow through the runtime root
Then each output's `pose_covariance_6x6` Frobenius norm is exactly 2.0× the strategy's emitted norm; the 6th frame's covariance is the strategy's unmodified emitted norm; the inflation is reflected in the consumer's view (C5 fusion sees the inflated covariance)
**AC-8: AC-5.3 — post-reboot covariance never below pre-reboot**
Given a saved hint with `||pose_covariance_6x6||_F = X` (the last pre-reboot value, captured by the wiring at save time as a "baseline" alongside the hint)
When `prime_warm_start_from_disk` runs and the strategy emits 5 post-reset frames
Then every post-reset `VioOutput.pose_covariance_6x6` Frobenius norm is ≥ X (after the 2.0× inflation in AC-7); the AC-5.3 "no fake confidence" invariant is enforced at the wiring layer regardless of strategy behaviour
**AC-9: `store.clear()` removes file + sidecar**
Given a populated store directory
When `store.clear()` is called
Then both `c1_warm_start.json` and `c1_warm_start.json.sha256` are removed; subsequent `store.load()` returns `None`; ONE INFO log `kind="c1.warm_start.cleared"` is emitted
**AC-10: Atomic write — process kill mid-save leaves no half-written file**
Given a save in progress (mid-write)
When the process is killed
Then on next startup `store.load()` either returns the prior valid hint (the temp-file rename was not yet committed) or `None` if no prior hint existed; there is NO scenario where a half-written file is loaded as a "valid" hint (AZ-280 `Sha256Sidecar` atomic write + sidecar verify guarantee this)
## Non-Functional Requirements
**Performance**
- `store.save(hint)` p99 ≤ 50 ms on Tier-2 NVMe (a single atomic JSON write of ~1 KB + 64-byte sidecar). On a 3 Hz frame rate with `warm_start_save_period_frames = 5`, the amortised cost is < 50 ms / (5 / 3 Hz) ≈ 3 ms per frame.
- `store.load()` p99 ≤ 20 ms on Tier-2 NVMe (one read + one sha256 verify of ~1 KB).
- Post-reset covariance inflation is a single matrix scalar multiplication per `VioOutput` — sub-microsecond cost; no measurable latency impact on the C1-PT-01 budget.
**Compatibility**
- JSON schema for the hint file is fixed at v1; future schema changes require a `version` field and the AZ-280 sidecar pattern continues to handle bit-rot detection.
- The store directory MUST be on a writable mount with sufficient space (a few KB suffices); the operator deployment ensures this via the systemd unit.
**Reliability**
- Atomic write + sidecar verify defends against process kill mid-save and against bit-flip.
- The post-reset covariance inflation is the only safety invariant enforced at the wiring layer; per-strategy honest-covariance behaviour during steady-state is enforced by the strategies themselves (AZ-332 / AZ-333 / AZ-334 each have an AC-9 honest-covariance contract).
- Failure of `prime_warm_start_*` MUST NOT crash the process — a malformed hint or a missing FC adapter response degrades to cold-start with a WARN log; the process continues.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | `save` then `load` round-trip | Loaded hint deep-equal to original; file + sidecar exist |
| AC-2 | Corrupt hint file (flip 1 byte) | `load()` returns `None`; WARN log emitted; file NOT deleted |
| AC-3 | `prime_warm_start_from_disk` with empty store | `reset_to_warm_start` NOT called (spy); INFO log `cold_start` |
| AC-4 | `prime_warm_start_from_disk` with valid hint | `reset_to_warm_start(hint)` called once; INFO log `f8_reboot_disk`; FDR record emitted |
| AC-5 | `prime_warm_start_from_fc` with fake FC adapter | Hint constructed from FC data; `reset_to_warm_start` called; `store.save` called; INFO log `f2_takeoff_fc`; FDR record emitted |
| AC-6 | Per-frame save with period=5, 12 frames | `store.save` called exactly twice (after frames 5 and 10) |
| AC-7 | Post-reset inflation × 5 frames | Each output's covariance Frobenius norm = 2.0× strategy's emitted norm; 6th frame is unmodified |
| AC-8 | Pre-reboot baseline X; post-reboot 5 frames | Every post-reset covariance ≥ X (after inflation) |
| AC-9 | `store.clear()` then `load()` | Both files removed; `load()` returns `None`; INFO log emitted |
| AC-10 | Mock process-kill mid-save | On restart, `load()` returns prior valid hint OR `None`; no half-written file ever loaded |
| NFR-perf-save | Microbench `store.save` × 1000 | p99 ≤ 50 ms on Tier-2 NVMe |
| NFR-perf-load | Microbench `store.load` × 1000 | p99 ≤ 20 ms on Tier-2 NVMe |
| NFR-reliability-no-crash | Inject malformed FC adapter response | `prime_warm_start_from_fc` logs WARN and returns; process does NOT crash |
## Constraints
- The persistence path uses AZ-280's `Sha256Sidecar` for atomic write + verify — no naked `Path.write_bytes` / `open().write` (per `coderule.mdc` "follow established project patterns").
- The store interface is a Protocol; the JSON-sidecar implementation is the default but a future operator-managed store (e.g., Redis-backed) could plug in via the same interface.
- The post-reset covariance inflation lives at the wiring layer — NOT inside any strategy. Adding inflation to a strategy is forbidden (would double-inflate when the wiring also inflates).
- The runtime root reads the FC adapter via the constructor-injected `FcAdapter` interface (Layer 3 → Layer 4 interface-at-producer pattern; documented in `module-layout.md` Layering notes); direct import of any C8 concrete adapter is forbidden in this task's source.
- The hint file's JSON schema is owned by this task; its `version` field is `1` and any future change requires a major bump per the standard versioning rule.
- Per-frame save throttling defaults to every 5 frames (0.6 Hz at 3 Hz frame rate); the value is config-driven.
- The post-reset baseline (the pre-reboot Frobenius norm used as the AC-8 floor) is persisted alongside the hint in the JSON file under a `pre_reboot_covariance_norm` field; AC-8's enforcement reads it back at load time.
## Risks & Mitigation
**Risk 1: The store directory is not writable in the airborne deployment**
- *Risk*: Read-only root filesystem (a hardening choice some operators make) defeats `store.save`; every flight reverts to cold-start, blowing the AC-NEW-1 budget.
- *Mitigation*: `store.save` failures emit ERROR logs but do NOT crash the process; the operator's deployment playbook (out of scope here) ensures `config.vio.warm_start_store_dir` points at a writable mount. AC-NFR-reliability-no-crash covers the no-crash case.
**Risk 2: Hint goes stale when the operator changes camera calibration between flights**
- *Risk*: Saved hint is for `adti26` calibration; operator swaps to a new camera; the hint's pose / bias are no longer applicable.
- *Mitigation*: The JSON schema includes a `calibration_id` field (the calibration's content hash); `load()` returns `None` if the current `CameraCalibration.id` does not match the saved hint's `calibration_id`; ONE WARN log `kind="c1.warm_start.calibration_mismatch"` is emitted. This forces a clean cold-start when calibration changes — correct behaviour.
**Risk 3: Per-frame save throughput pressure on slow disks**
- *Risk*: A slow operator-provided storage device makes `store.save` exceed the 50 ms budget at default period; per-frame DEBUG log records the slowness but the sustained pressure could starve other I/O.
- *Mitigation*: The throttle period (`warm_start_save_period_frames`) is config-driven; an operator with slow storage can raise it to 30 (one save per 10 s at 3 Hz). The save itself is sync — no async queue this cycle.
**Risk 4: Post-reset covariance inflation is too aggressive (or not aggressive enough)**
- *Risk*: 2.0× factor is a heuristic; if a strategy's natural post-reset behaviour is already inflated 3×, the wiring inflates further to 6× — over-cautious. If it's 1.1×, the wiring is barely-honest at 2.2×.
- *Mitigation*: The factor is config-driven (`post_reset_covariance_inflation_factor`); a future cycle's calibration test (Step 9 / E-BBT) will tune it per strategy. The 2.0× default is a safety conservative baseline; the AC-8 floor (post-reset ≥ pre-reboot) is the hard invariant.
## Runtime Completeness
- **Named capability**: cross-strategy warm-start hint persistence + F2 takeoff + F8 reboot recovery wiring + AC-5.3 honest-covariance enforcement at the wiring layer (architecture / E-C1 / `solution.md` "F2 takeoff load" + "F8 Companion-reboot recovery" / AC-5.1 + AC-5.3).
- **Production code that must exist**: real `JsonSidecarWarmStartHintStore` using AZ-280's atomic write + verify; real `prime_warm_start_from_fc` consuming a real `FcAdapter` interface; real `prime_warm_start_from_disk` invoked at process startup; real per-frame save hook in the runtime composition root; real post-reset covariance inflation wrapper.
- **Allowed external stubs**: tests MAY use a fake `FcAdapter` returning scripted FC data (AC-5); a fake `WarmStartHintStore` for testing the runtime hooks in isolation (AC-3 / AC-4 / AC-7 / AC-8); production wiring uses the real AZ-280 store + the real C8 FC adapter selected at composition root.
- **Unacceptable substitutes**: in-memory store that loses state across process restart (would defeat AC-4 / AC-5.3 entirely); naked `open().write` in place of AZ-280's atomic-write pattern (would lose AC-10 atomicity); per-strategy warm-start logic that bypasses the runtime root (would force every new strategy to reinvent the wiring); a 1.0× inflation factor (would defeat AC-8); reading the FC adapter via direct C8 module import (would violate Layer 3 → Layer 4 ban).