[AZ-332] C1 OKVIS2 Strategy: facade + binding skeleton

Python facade (`Okvis2Strategy`) is production-quality and satisfies
AZ-331's `VioStrategy` protocol; full AC-1..10 coverage with
AC-9 + NFR-perf marked `tier2`. The C++ pybind11 binding compiles
and loads but throws `OkvisFatalException("estimator not yet wired")`
on first `add_frame` — the `okvis::ThreadedKFVio` wiring is a tier2
follow-up the Step-15 Product Completeness Gate is expected to track
as a remediation task.

Resolved contradictions:

* Constructor signature aligned with the AZ-331 factory: `(config, *,
  fdr_client, clock=None)`. Calibration / preintegrator / logger
  built internally from config. No churn on AZ-331.
* IMU substrate: OKVIS2 owns its internal estimator IMU integration;
  the AZ-276 `ImuPreintegrator` is a separate substrate consumed by
  E-C5's fusion graph. Single source of truth lives at the sample
  stream, not the integrator instance.
* FDR API: `FdrClient.enqueue(record)` with new `vio.health` kind
  added to AZ-272 `KNOWN_PAYLOAD_KEYS`.

CI matrix forces `-DBUILD_OKVIS2=OFF` until the tier2 wiring task
brings Ceres / SuiteSparse / OKVIS2 vendored submodules into the
Linux build.

Files: 17 added/modified across `c1_vio/`, `fdr_client/records.py`,
`cpp/okvis2/CMakeLists.txt`, CI workflow, AZ-332 task spec
(implementation-notes section), batch 23 report.

Tests: 17 new (15 tier1 + 2 tier2). Full Tier-1 suite: 1109 pass,
2 skipped (env), 2 deselected (tier2). No regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-12 09:56:45 +03:00
parent 9c35776bcb
commit 1ebab29a4f
19 changed files with 2083 additions and 49 deletions
@@ -1,202 +0,0 @@
# C1 OKVIS2 Strategy — Production-Default VIO
**Task**: AZ-332_c1_okvis2_strategy
**Name**: C1 OKVIS2 Strategy
**Description**: Implement `Okvis2Strategy`, the production-default `VioStrategy` for E-C1. The class is a Python facade over the OKVIS2 C++ tightly-coupled keyframe-based VIO core (sliding window of K=1020 keyframes per D-C5-3) accessed via a pybind11 wrapper around `cpp/okvis2/`. The strategy owns the per-flight OKVIS2 estimator instance, feeds it nav-camera frames + IMU samples (via the AZ-276 `ImuPreintegrator` helper for the GTSAM `CombinedImuFactor` substrate that C5 also reads), and emits `VioOutput` with honest 6×6 covariance per AC-1.4 and per-frame `VioHealth`. Per `_docs/02_document/components/01_c1_vio/description.md` § 5: per-frame cost is dominated by feature extraction + matching, sliding-window optimisation is `O(F·log K)`; per-frame p95 latency must stay ≤ 80 ms on Tier-2 with C2 backbone running concurrently (C1-PT-01). Build-time gated by `BUILD_OKVIS2`.
**Complexity**: 5 points
**Dependencies**: AZ-331_c1_vio_strategy_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-276_imu_preintegrator, AZ-277_se3_utils, AZ-272_fdr_record_schema, AZ-273_fdr_client_ringbuf
**Component**: c1_vio (epic AZ-254 / E-C1)
**Tracker**: AZ-332
**Epic**: AZ-254 (E-C1)
### Document Dependencies
- `_docs/02_document/contracts/c1_vio/vio_strategy_protocol.md` — the Protocol this task implements; produced by AZ-331.
- `_docs/02_document/contracts/shared_helpers/imu_preintegrator.md` — IMU substrate (AZ-276); consumer of the GTSAM `CombinedImuFactor` per-keyframe.
- `_docs/02_document/contracts/shared_helpers/se3_utils.md` — SE(3) ↔ pose-matrix conversion utilities (AZ-277).
- `_docs/02_document/components/01_c1_vio/description.md` — § 5 implementation details + § 6 helpers + § 7 caveats (Okvis2 latency spike behaviour under thermal throttle).
## Problem
Without a production-default `Okvis2Strategy`:
- The default airborne binary cannot operate — only the KLT/RANSAC simple-baseline (mandatory engine-rule path) would be available, and C1-PT-01 / AC-2.2 frame-to-frame MRE bounds were specified against OKVIS2.
- The honest 6×6 covariance contract (AC-1.4 / AC-NEW-4) loses its production producer; KLT/RANSAC's covariance is a documented degraded fallback, not the primary signal C5's iSAM2 graph fuses.
- D-CROSS-LATENCY-1's hybrid covariance auto-degrade decision in C4 has no `VioHealth` source-of-truth at production-quality numbers.
- The architecture's "tightly-coupled VIO with sliding-window optimisation" claim becomes documentation-only.
- Mode-B FT-P-04 / FT-P-05 suite-level scenarios cannot run against the production stack; FT-P-04 expects ≥ 95 % tracked-frame ratio on the Derkachi normal segment.
This task delivers the canonical production VIO. The other two strategies (VINS-Mono research-only, KLT/RANSAC simple-baseline) are separate tasks; the contract task (AZ-331) defines the boundary all three share.
## Outcome
- An `Okvis2Strategy` class at `src/gps_denied_onboard/components/c1_vio/okvis2.py` conforming to the `VioStrategy` Protocol from AZ-331; `current_strategy_label() == "okvis2"`.
- A pybind11 wrapper at `src/gps_denied_onboard/components/c1_vio/_native/okvis2_binding.cpp` exposing the OKVIS2 C++ estimator (`okvis::ThreadedKFVio` or equivalent in the pinned upstream HEAD) to Python. The wrapper is built by CMake under `cpp/okvis2/` (build-time gated by `BUILD_OKVIS2`); the resulting `.so` is imported lazily inside `okvis2.py`.
- Constructor `__init__(self, *, calibration: CameraCalibration, preintegrator: ImuPreintegrator, fdr_client: FdrClient, logger: Logger, config: Okvis2Config)` — all dependencies constructor-injected per ADR-009. `Okvis2Config` (`@dataclass(frozen=True)`) carries the OKVIS2-specific knobs (sliding-window size K ∈ [10, 20], keyframe-decision parallax threshold, RANSAC inlier ratio, max optimisation iterations) loaded from `config.vio.okvis2.*` via AZ-269.
- `process_frame(frame, imu, calibration) -> VioOutput`:
1. Append IMU samples to the injected `ImuPreintegrator` (strict-monotonic guarded; `ImuPreintegrationError` rewraps to `VioFatalError`).
2. Feed the nav-camera frame to OKVIS2 via the pybind11 `add_frame` wrapper.
3. If OKVIS2 emits a new estimator update, extract the relative pose (SE(3) via `helpers.se3_utils`), the 6×6 covariance from OKVIS2's internal Hessian (or marginalised block per upstream API), the latest IMU bias, and the feature-quality summary (tracked / new / lost / mean parallax / per-frame MRE).
4. Build and return `VioOutput` with `frame_id` echoed.
5. Emit per-frame DEBUG log (off by default) with backbone identity + elapsed milliseconds; emit WARN log when degraded covariance is detected (per `health_snapshot` heuristic); emit ERROR log on `VioFatalError`.
- `reset_to_warm_start(hint)`: tears down the current OKVIS2 estimator instance (releases C++ resources), constructs a fresh estimator, seeds the IMU bias from `hint.bias`, seeds the initial body-to-world pose from `hint.body_T_world`, and seeds the velocity from `hint.velocity_b`. The next `config.vio.warm_start_max_frames` frames are allowed to converge before the strategy reports `state == TRACKING` (AC-5.1). Calling `reset_to_warm_start` is idempotent across consecutive calls (the second call re-resets cleanly).
- `health_snapshot()` returns `VioHealth(state, consecutive_lost, bias_norm)` derived from OKVIS2's internal tracker state: `INIT` until enough keyframes are accumulated, `TRACKING` while the optimisation converges, `DEGRADED` when feature count drops below `config.vio.okvis2.degraded_feature_threshold` or covariance Frobenius norm exceeds 2× steady-state, `LOST` after `config.vio.lost_frame_threshold` consecutive frames without a successful update.
- The honest-covariance invariant (Protocol Invariant) is enforced behaviourally: the strategy MUST NOT shrink the reported covariance during a `DEGRADED` window (the OKVIS2 estimator's covariance is read directly; no smoothing or floor is applied that would mask degradation).
- Error envelope is closed: every OKVIS2 / pybind11 / Eigen exception is caught inside `process_frame` / `reset_to_warm_start` and rewrapped into the `VioError` family (`VioInitializingError` while INIT, `VioFatalError` on backend-init failure or sustained LOST).
- All FDR records emitted via the injected `FdrClient` use the `kind="vio.health"` schema from AZ-272; per-frame DEBUG goes to stdout/journald only (per description.md § 9 logging strategy).
## Scope
### Included
- `Okvis2Strategy` class implementation + the `Okvis2Config` dataclass + the `_native/okvis2_binding.cpp` pybind11 wrapper.
- CMake target under `cpp/okvis2/` that links the OKVIS2 upstream pin (BSD-3-Clause) and produces the binding `.so`. Build flag `BUILD_OKVIS2`.
- The full `process_frame` / `reset_to_warm_start` / `health_snapshot` / `current_strategy_label` surface conforming to AZ-331's Protocol.
- IMU substrate via the constructor-injected `ImuPreintegrator` (AZ-276); this strategy never imports GTSAM directly.
- Honest-covariance reading from OKVIS2's internal estimator state (no client-side smoothing).
- Lazy import of the `_native` binding inside `okvis2.py` so a Tier-0 build with `BUILD_OKVIS2=OFF` does not force the OKVIS2 native lib to be present.
- Per-frame DEBUG log gated by `config.vio.per_frame_debug_log` (default OFF).
- WARN / ERROR / INFO logging per description.md § 9.
- Health-state transitions emitted as FDR records via the `kind="vio.health"` schema.
- Composition-root wiring (entry to the AZ-331 `build_vio_strategy` factory's `okvis2` branch).
- Standalone microbench script `python -m gps_denied_onboard.components.c1_vio.bench.okvis2 <fixture>` for C1-PT-01 latency measurements (referenced by Step 9 / E-BBT perf tests, not implemented as the test itself here — only the benchable surface).
### Excluded
- VINS-Mono strategy — separate task in this epic.
- KLT/RANSAC simple-baseline strategy — separate task in this epic.
- Warm-start hint persistence (write at takeoff, read at F8 reboot) — separate task in this epic; this strategy only consumes a constructed `WarmStartPose`.
- C5 fusion of `VioOutput` — owned by E-C5 (AZ-260).
- C13 FDR writer-thread / segment rotation — owned by E-C13 (AZ-248); this strategy only emits via the producer-side `FdrClient`.
- IMU preintegration mathematics — owned by AZ-276.
- The C1-IT-01..06 / C1-PT-01 tests themselves — deferred to Step 9 (E-BBT) per greenfield flow Step 6 rule.
- Honest-covariance contract test that sweeps all three strategies — that's a Step 9 / E-BBT cross-strategy test (epic child issue #7), not part of this single-strategy task.
- OKVIS2 upstream-source modifications — upstream HEAD is pinned per Plan-phase; deviations require an explicit ADR.
- Multi-camera OKVIS2 — out of scope (single nav-camera per RESTRICT-UAV-3).
## Acceptance Criteria
**AC-1: `current_strategy_label()` returns `"okvis2"`**
Given an `Okvis2Strategy` constructed via the AZ-331 factory with `config.vio.strategy = "okvis2"`
When `current_strategy_label()` is called
Then the returned string is exactly `"okvis2"`
**AC-2: `process_frame` returns `VioOutput` with `frame_id` echoed**
Given a `NavCameraFrame` with `frame_id = "uuid-abc"` and a populated `ImuWindow`
When `process_frame(frame, imu, calibration)` is called and reaches a successful estimator update
Then the returned `VioOutput.frame_id == "uuid-abc"`; `pose_covariance_6x6` is symmetric and positive-definite; `imu_bias` is non-`None`
**AC-3: `process_frame` rewraps every backend exception into `VioError`**
Given a malformed input that triggers an OKVIS2 / pybind11 / Eigen exception inside the backend
When `process_frame` is called
Then the raised exception is one of `VioInitializingError` / `VioDegradedError` / `VioFatalError`; the original exception is chained via `raise ... from`; no raw `RuntimeError` / `ValueError` from the backend leaks to the caller
**AC-4: `reset_to_warm_start` clears state and seeds the hint**
Given a strategy with N processed frames and a non-default IMU bias
When `reset_to_warm_start(hint)` is called with a known `hint.bias` and `hint.body_T_world`
Then the next `process_frame` call's `VioOutput.imu_bias` reflects `hint.bias` (within numerical tolerance) and the resulting `relative_pose_T` is consistent with starting from `hint.body_T_world`; calling `reset_to_warm_start` a second time without intervening frames does not raise
**AC-5: `health_snapshot()` reports `INIT` initially and `TRACKING` after warm-up**
Given a freshly-constructed strategy
When `health_snapshot()` is called before any `process_frame` invocation
Then `state == INIT`; after `config.vio.warm_start_max_frames` (default 5) successful `process_frame` calls on a normal-segment fixture, the next `health_snapshot()` returns `state == TRACKING`
**AC-6: `health_snapshot()` reports `DEGRADED` on feature loss**
Given a strategy in TRACKING state
When `process_frame` is fed a frame with feature count below `config.vio.okvis2.degraded_feature_threshold`
Then the returned `VioOutput.pose_covariance_6x6` Frobenius norm is strictly greater than the prior frame's; the next `health_snapshot()` returns `state == DEGRADED`; the strategy MUST emit a `VioOutput` (not raise) so C5 can down-weight rather than fall back
**AC-7: Sustained loss raises `VioFatalError`**
Given a strategy in DEGRADED state
When `config.vio.lost_frame_threshold` (default 9) consecutive frames fail to update the estimator
Then the next `process_frame` call raises `VioFatalError`; subsequent `health_snapshot()` returns `state == LOST`; the AC-5.2 fallback path (FC IMU-only after 3 s) is the consumer's responsibility
**AC-8: `BUILD_OKVIS2=OFF` does not import OKVIS2 native libs**
Given the binary is built with `BUILD_OKVIS2=OFF`
When `gps_denied_onboard.components.c1_vio` is imported (NOT the `okvis2` submodule directly)
Then `sys.modules` does NOT contain `gps_denied_onboard.components.c1_vio.okvis2` or any `_native.okvis2_binding` entry; AZ-331's factory raises `StrategyNotAvailableError("okvis2", missing_flag="BUILD_OKVIS2")` if `okvis2` is requested
**AC-9: Honest covariance — no shrinkage during DEGRADED**
Given a controlled-degradation 60 s synthetic input (same source as the deferred C1-IT-01 test fixture)
When `process_frame` runs through the degradation event
Then `||pose_covariance_6x6||_F` is monotonically non-decreasing from the moment `health_snapshot().state` first transitions to `DEGRADED` until either `TRACKING` is restored or `LOST` is reached; this is enforced by reading OKVIS2's internal covariance directly without any client-side floor or smoother
**AC-10: FDR `vio.health` records emitted on every state transition**
Given the strategy is configured with a real `FdrClient` (or test double)
When `health_snapshot().state` transitions (`INIT → TRACKING`, `TRACKING → DEGRADED`, `DEGRADED → LOST`, etc.)
Then exactly one FDR record with `kind="vio.health"` and the new state is emitted via the `FdrClient.emit` API; no records are emitted on steady-state frames
## Non-Functional Requirements
**Performance**
- `process_frame` p95 ≤ 80 ms on Tier-2 with C2 backbone running concurrently (C1-PT-01 / NFT-PERF-01 component partition); failure threshold 120 ms.
- `process_frame` p50 ≤ 25 ms on Tier-2 (description.md C1-PT-01).
- Throughput ≥ 3 Hz sustained; failure threshold < 2.5 Hz.
- CPU ≤ 30 % of one core; memory ≤ 1.5 GB resident (description.md § 6 + epic NFR).
**Compatibility**
- OKVIS2 upstream HEAD pinned per Plan-phase. No upstream-source modifications.
- pybind11 version matches the OKVIS2 / VINS-Mono / GTSAM build (description.md § 5 dependency table).
- Eigen version matches OKVIS2 / GTSAM pin.
**Reliability**
- The error envelope is closed at the `VioError` family. No raw OKVIS2 / pybind11 / Eigen exceptions cross the Python boundary.
- `process_frame` is idempotent w.r.t. state when it raises: a raised exception leaves the estimator in a recoverable state; the next valid frame integrates as if the bad one never came.
- The strategy is single-threaded by Protocol contract; the composition root binds one instance to the camera ingest thread.
**Concurrency**
- One `Okvis2Strategy` instance per camera ingest thread; concurrent calls to `process_frame` on the same instance are undefined behaviour (matches Protocol invariant).
- The injected `ImuPreintegrator` is also single-threaded; the same composition-root binding rule applies.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | `current_strategy_label()` after factory build with `okvis2` config | Returns `"okvis2"` |
| AC-2 | `process_frame` with a fixture frame + IMU window | `VioOutput.frame_id` echoed; covariance SPD; `imu_bias` non-None |
| AC-3 | Inject a malformed frame that triggers a backend exception (mocked binding) | `VioError`-family exception raised; original chained via `__cause__` |
| AC-4 | `reset_to_warm_start` then `process_frame` × N | Bias reflects hint; second `reset_to_warm_start` does not raise |
| AC-5 | Cold construct → `health_snapshot` × N | `INIT` initially; `TRACKING` after `warm_start_max_frames` |
| AC-6 | Feed degraded fixture | Covariance Frobenius norm strictly increases; `health_snapshot` returns `DEGRADED`; `VioOutput` IS emitted (not raised) |
| AC-7 | Fed `lost_frame_threshold` consecutive failed frames | `VioFatalError` on the next `process_frame`; `health_snapshot` returns `LOST` |
| AC-8 | `BUILD_OKVIS2=OFF` import + factory call | Module not in `sys.modules`; factory raises `StrategyNotAvailableError` |
| AC-9 | 60 s controlled-degradation synthetic | Covariance Frobenius norm monotonically non-decreasing during DEGRADED window |
| AC-10 | Real / fake `FdrClient` spy through state transitions | Exactly one `vio.health` record per transition; no spam on steady-state |
| NFR-perf | C1-PT-01 microbench against the Derkachi normal segment fixture (Tier-2) | p95 ≤ 80 ms; p50 ≤ 25 ms; throughput ≥ 3 Hz |
| NFR-reliability-error-envelope | Raise each backend exception type via mock binding; assert no leakage | All caught and rewrapped to `VioError` family |
## Constraints
- This task implements (does NOT define) the AZ-331 Protocol; any signature mismatch is a Spec-Gap finding (High) per code-review skill Phase 2.
- The pybind11 binding lives under `_native/` per `module-layout.md`; the `.so` import path is CMake-known and lazy-imported inside `okvis2.py`.
- OKVIS2 native source lives under `cpp/okvis2/` (parallel to `src/`, NOT nested inside the Python package), per `module-layout.md` rule #4.
- The strategy MUST consume IMU via the AZ-276 `ImuPreintegrator` helper; constructing a second IMU integration path is forbidden (defeats the "single source of IMU truth" invariant).
- This task introduces no new third-party dependencies beyond OKVIS2 + pybind11 + Eigen (already pinned).
- Per-frame DEBUG logging defaults OFF (would flood at 3 Hz); enabled only via `config.vio.per_frame_debug_log`.
- The strategy MUST NOT apply a covariance floor or smoother on the read path — honest covariance is the safety floor for AC-NEW-4; smoothing is a Risks-and-Mitigation discussion only.
- The `Okvis2Config` schema extension to AZ-269 is owned by this task; the field set is documented above.
## Risks & Mitigation
**Risk 1: OKVIS2 latency spike on thermally-throttled Jetson breaks AC-4.1**
- *Risk*: description.md § 7 notes OKVIS2's sliding-window optimisation can spike to 80120 ms on a thermally-throttled Jetson; the C1-PT-01 p95 ≤ 80 ms budget is the wire boundary.
- *Mitigation*: D-CROSS-LATENCY-1 hybrid auto-degrades **C4** covariance recovery (not C1) under thermal throttle, freeing budget. This task does NOT implement thermal-aware behaviour — it just measures and reports latency; the C4 task owns the degradation decision. AC-9 covers the honest-covariance side; AC-NFR-perf measures the latency.
**Risk 2: pybind11 type marshalling overhead dominates the per-frame budget**
- *Risk*: Marshalling a 5472×3648×3 uint8 frame across the Python ↔ C++ boundary on every `process_frame` could add 10s of ms.
- *Mitigation*: The pybind11 binding accepts the frame as a `numpy.ndarray` with `py::array::c_style | py::array::forcecast` so the data buffer is shared (zero-copy on `c_style`-aligned input). The composition root binds the camera ingest path to emit `c_style` buffers (handled in `frame_source/LiveCameraFrameSource`, AZ-265 cycle-1 deliverable). If the zero-copy path is broken, AC-NFR-perf microbench shows it immediately.
**Risk 3: OKVIS2 internal covariance is reported in a frame-convention C5 does not expect**
- *Risk*: OKVIS2 reports covariance in its own body-frame; C5 expects body-to-world. A frame-convention bug would silently produce wrong covariance to iSAM2.
- *Mitigation*: The strategy uses `helpers.se3_utils` (AZ-277) to convert OKVIS2's frame to the canonical body-to-world convention; the conversion is unit-tested at the helper level and asserted by AC-2 (covariance SPD) + the deferred C1-IT-02 (cross-strategy invariants test).
**Risk 4: OKVIS2 BSD-3-Clause license attribution missed**
- *Risk*: Failing to include OKVIS2's license notice in the airborne binary's NOTICE file violates BSD-3-Clause.
- *Mitigation*: The CMake target under `cpp/okvis2/` includes the upstream LICENSE file in the build artifact's NOTICE bundle; CI's SBOM step (existing infra) verifies presence. Tracked in the project NOTICE generation pipeline (out of scope here).
## Runtime Completeness
- **Named capability**: OKVIS2 tightly-coupled keyframe-based VIO + sliding-window optimisation + honest 6×6 covariance via OKVIS2's internal Hessian (architecture / E-C1 / `solution.md` "Strategy: Okvis2 production-default" / D-C5-3).
- **Production code that must exist**: real `Okvis2Strategy` class implementing the AZ-331 Protocol; real pybind11 binding to `cpp/okvis2/` (real OKVIS2 upstream, not a mock); real per-frame OKVIS2 estimator update; real covariance read from OKVIS2's internal Hessian; real bias propagation through the AZ-276 `ImuPreintegrator`.
- **Allowed external stubs**: tests MAY use a fake pybind11 binding that returns scripted `VioOutput` payloads (AC-3 / AC-6 / AC-7 use this for backend-exception injection); production wiring uses the real OKVIS2 upstream pinned by Plan-phase.
- **Unacceptable substitutes**: a Python-level "OKVIS2" wrapper that re-implements the optimisation loop in pure Python (would defeat C1-PT-01 ≤ 80 ms p95); a covariance floor or smoother on the read path (would break AC-9 honest-covariance contract); skipping the AZ-276 `ImuPreintegrator` and integrating IMU samples internally (would break the single-IMU-truth invariant); using a pre-built deterministic-fallback `VioOutput` while OKVIS2 is "compiled out" (would silently break C5 fusion at deploy time without the BUILD-flag gate firing first).