Files
gps-denied-onboard/_docs/03_implementation/AZ-332_implementation_plan.md
T
Oleksandr Bezdieniezhnykh 9c35776bcb chore: pre-batch-23 carry-over (state + AZ-332 plan)
Handoff artifacts from the prior /autodev session that stopped at
Step 7 sub_step compute-next-batch:

- _docs/_autodev_state.md: pointer updated to batch 23, AZ-332 only
  (AZ-345 deferred — dep AZ-346 not yet in done/).
- _docs/03_implementation/AZ-332_implementation_plan.md: locked-in
  decisions (no ROS 2, no Python re-impl, three-env split: macOS dev /
  Ubuntu CI / Jetson tier2) + step-by-step playbook for next session.

Pre-batch chore commit per implement skill prereq #4 (clean tree
required before AZ-332 commit so the batch diff stays focused).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 09:18:20 +03:00

169 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AZ-332 — Implementation plan (batch 23, cycle 1)
**Date created**: 2026-05-12 (carry-over from `/autodev` session 2026-05-12 morning)
**Owner**: next `/autodev` invocation starting from Step 7 Implement sub_step `compute-next-batch`
**Scope of this doc**: a concrete, in-order playbook for the next session. Reading this + the task spec at `_docs/02_tasks/todo/AZ-332_c1_okvis2_strategy.md` is sufficient to resume — no other re-discovery needed.
---
## Why this is its own plan doc
AZ-332 (C1 OKVIS2 production-default VIO) is the first task in this project to require a native C++ build chain (OKVIS2 + pybind11). The previous session researched paths, surfaced blockers, and landed on a decomposition that splits work across three build environments. That decomposition has to survive the session boundary, hence this file.
## Decisions locked in the previous session
1. **No ROS 2 layer.** `colcon` build of OKVIS2 produces the same libraries as standalone CMake plus a ROS 2 node we do not need; ROS 2 runtime IPC was rejected at Plan time (`_docs/01_solution/solution.md` § D-C1-1-SUB-A — "Rejected (cost + latency budget conflict)"). Build with **standalone CMake**.
2. **No Python re-implementation of OKVIS2.** Forbidden by the task spec ("Unacceptable substitutes" section). Pure-Python VIO violates C1-PT-01 ≤ 80 ms p95 budget by construction.
3. **No alternative VIO substitution.** Every C++ VIO candidate (OpenVINS, VINS-Mono, Kimera-VIO) has the same compile-on-macOS problem. The only Python-native candidates (DPVO, KLT+RANSAC) are mono-VO only — not drop-ins for a VIO contract. AZ-332 stays OKVIS2.
4. **Three-environment dev split**:
| Environment | What runs there | What it gates |
|---|---|---|
| macOS dev | Python facade + binding C++ editing; unit tests using the fake `_native.okvis2_binding` (task spec explicitly allows this for tests) | AC-1, AC-2, AC-3, AC-4, AC-5, AC-6, AC-7, AC-8, AC-10 |
| Ubuntu CI runner (`ci.yml`) | Native CMake build of vendored OKVIS2 + binding `.so` | Build-passes gate; no AC validation here |
| Self-hosted Jetson runner (`ci-tier2.yml`) | Real-OKVIS2 perf + honest-covariance tests | AC-9 (honest covariance monotonicity); NFR-perf p95 ≤ 80 ms |
This split honours the task spec ("real `Okvis2Strategy` calling real C7 `InferenceRuntime` with real TRT-compiled DISK engine") because the production binary IS the real binding compiled on Linux/Jetson — only the dev-side unit tests use the fake. The fake never ships to production.
## Concrete step-by-step for next session (in order; each step has a stop-and-verify gate)
### Step 0 — re-entry sanity check (1 min)
- Read `_docs/_autodev_state.md`: confirm step 7 / sub_step `compute-next-batch` / detail points here.
- Read this doc fully.
- Read `_docs/02_tasks/todo/AZ-332_c1_okvis2_strategy.md` once.
- `git status --porcelain` must be empty (implement skill prerequisite).
### Step 1 — vendor OKVIS2 and pybind11 as git submodules (510 min)
- `git submodule add --depth 1 --recurse-submodules https://github.com/smartroboticslab/okvis2.git cpp/okvis2/upstream`
- Note: submodule path is `cpp/okvis2/upstream/` (not `cpp/okvis2/` directly) so the existing `cpp/okvis2/CMakeLists.txt` keeps its project-owned role and `add_subdirectory(upstream)` pulls in OKVIS2.
- `git submodule add --depth 1 https://github.com/pybind/pybind11.git cpp/pybind11/upstream`
- Same pattern: existing `cpp/pybind11/` directory keeps the project README; submodule lives at `cpp/pybind11/upstream/`.
- Delete the `.gitkeep` and placeholder `README.md` from `cpp/pybind11/` once the submodule is in place (or keep them; they're harmless either way — pick one and stay consistent).
- Pin a known-good commit hash for OKVIS2 (record it in this doc under "Pinned upstream versions" once chosen). Recommendation: pin to the latest `main` HEAD at the time of submodule add and document the commit short-hash here.
- **Gate**: `git submodule status` shows both submodules with a SHA; `git status` clean except `.gitmodules` + submodule entries.
### Step 2 — write CMake glue (1530 min)
Files to write:
- `cpp/okvis2/CMakeLists.txt` (replace existing placeholder):
- `if(NOT BUILD_OKVIS2) return() endif()`
- `add_subdirectory(upstream EXCLUDE_FROM_ALL)` with OKVIS2's `USE_NN=OFF` to drop the LibTorch dep (per Fact #39 — keyframe arch tolerates this).
- Find_package the Linux deps OKVIS2 needs (Eigen3, Boost, glog, gflags, SuiteSparse, Ceres, OpenCV — every one is an apt package on Ubuntu, brew formula on macOS).
- `add_subdirectory(${CMAKE_SOURCE_DIR}/cpp/pybind11/upstream pybind11_build)`.
- `pybind11_add_module(okvis2_binding ${CMAKE_CURRENT_SOURCE_DIR}/../../src/gps_denied_onboard/components/c1_vio/_native/okvis2_binding.cpp)` — note path back to Python tree.
- `target_link_libraries(okvis2_binding PRIVATE okvis::Estimator okvis::Common ...)` (exact target names from OKVIS2's CMake exports — verify by running `cmake --build build --target help | grep okvis` once submodule is in).
- `install(TARGETS okvis2_binding DESTINATION ${CMAKE_INSTALL_LIBDIR}/gps_denied_onboard/components/c1_vio/_native/)`.
- `cpp/pybind11/CMakeLists.txt` (replace existing placeholder): can stay nearly empty — pybind11 is included by `cpp/okvis2/CMakeLists.txt` via `add_subdirectory`.
The existing top-level `cpp/CMakeLists.txt` already has `add_subdirectory(okvis2)` gated on `BUILD_OKVIS2 OR BUILD_VINS_MONO OR BUILD_KLT_RANSAC` — no change needed there.
**Gate**: `cmake -S . -B build -DBUILD_OKVIS2=OFF` succeeds on macOS (no-op build with the flag off). The OFF path is what protects the rest of the build from any of this new wiring.
### Step 3 — write the pybind11 binding C++ skeleton (12 h)
File: `src/gps_denied_onboard/components/c1_vio/_native/okvis2_binding.cpp`
Surface needed (mirrors the Python facade's needs — not the full OKVIS2 API):
- `Okvis2Backend` class with: ctor from YAML config string + camera intrinsics dict; `add_frame(frame_id: str, ns_ts: int, image: ndarray[uint8, H, W, C]) -> bool`; `add_imu(ns_ts: int, accel: ndarray[float64, 3], gyro: ndarray[float64, 3]) -> None`; `get_latest_output() -> dict | None` (returns frame_id + 4x4 pose matrix + 6x6 covariance + bias + feature_quality dict + emitted_at_ns); `reset(body_T_world: ndarray[float64, 4, 4], velocity: ndarray[float64, 3], accel_bias: ndarray[float64, 3], gyro_bias: ndarray[float64, 3]) -> None`; `health() -> dict` (returns `{state: str, consecutive_lost: int, bias_norm: float}`).
- Exceptions: every OKVIS / Eigen / std::runtime_error caught inside binding methods and rethrown as a fixed set of Python exceptions registered via `py::register_exception` — the Python facade then catches those and rewraps into `VioError` family.
- Zero-copy pathway: `image` is `py::array_t<uint8_t, py::array::c_style | py::array::forcecast>` so DISK ingest avoids a copy.
This is a skeleton — full OKVIS2 estimator wiring (`okvis::ThreadedKFVio` setup + callback plumbing) can be a follow-up commit if the skeleton + CI Linux build come back green first.
**Gate**: compiles inside the OKVIS2 CMake target. Tested on Ubuntu CI runner (not macOS).
### Step 4 — write the Python facade `okvis2.py` (12 h)
File: `src/gps_denied_onboard/components/c1_vio/okvis2.py`
- `Okvis2Strategy` class implementing the `VioStrategy` Protocol from `interface.py`.
- Lazy import of `_native.okvis2_binding` inside the module body (NOT at module top — that's the I-5 / Risk-2 mitigation; AZ-331's `test_ac5_build_vio_strategy_flag_off_no_import` asserts this and MUST still pass).
- Constructor signature: `__init__(self, config: Config, *, fdr_client: FdrClient)` — match the AZ-331 factory's call shape exactly. Inside the constructor: build the `ImuPreintegrator` from `helpers.imu_preintegrator.make_imu_preintegrator(calibration)`; build the `Okvis2Backend` from the binding; record the strategy label as `"okvis2"` (frozen per Protocol invariant).
- Map every backend exception (raised from the C++ binding's registered exception types) to the `VioError` family — `OkvisInitException → VioInitializingError`, `OkvisFatalException → VioFatalError`, `OkvisOptimizationException → VioDegradedError` (only when transitioning to fatal — the normal degraded path returns a `VioOutput` with inflated covariance per AZ-331 v1.0.0).
- `process_frame`: feed IMU samples to the preintegrator, push frame to backend, read latest output, build the `VioOutput` DTO using `gtsam.Pose3.matrix()` round-trip via `helpers.se3_utils` (AZ-277). Echo `frame_id`.
- `reset_to_warm_start`: tear down + reconstruct `Okvis2Backend` from the hint; first call must not raise (idempotency invariant per AC-4); seed bias into the preintegrator via `preintegrator.reset_with_bias(hint.bias)`.
- `health_snapshot`: pull `backend.health()` dict and wrap as `VioHealth`. Track `consecutive_lost` Python-side because the binding returns "current state" only.
- `current_strategy_label`: return the frozen `"okvis2"`.
- FDR records on state transitions via the injected `fdr_client` using the `kind="vio.health"` schema (AZ-272).
**Gate**: `mypy --strict` passes against the new file; `ruff check` passes; isinstance check `isinstance(Okvis2Strategy(...), VioStrategy)` returns True without importing the native binding (i.e., the Protocol's structural conformance, not the construction itself).
### Step 5 — write `Okvis2Config` (15 min)
File: `src/gps_denied_onboard/components/c1_vio/config.py` (extend existing — do not duplicate `C1VioConfig`).
- Add `@dataclass(frozen=True) class Okvis2Config` with fields: `keyframe_window_size: int = 15` (∈ [10, 20] per D-C5-3); `keyframe_parallax_threshold_px: float = 3.0`; `ransac_inlier_ratio: float = 0.5`; `max_optimization_iters: int = 4`; `degraded_feature_threshold: int = 30`; `per_frame_debug_log: bool = False`.
- `__post_init__` validates ranges and raises `ConfigError`.
- Register the block under `config.components['c1_vio'].okvis2` (sub-block) — keep `C1VioConfig` as-is at the top level.
**Gate**: `Okvis2Config(keyframe_window_size=9)` raises `ConfigError`; `Okvis2Config()` defaults pass.
### Step 6 — write unit tests with fake binding (12 h)
Files:
- `tests/unit/c1_vio/conftest.py`: a `fake_okvis2_binding` fixture that installs a `types.ModuleType` at `sys.modules['gps_denied_onboard.components.c1_vio._native.okvis2_binding']` with a scriptable `Okvis2Backend` test double. The test double exposes a `script()` method that pre-loads a queue of outputs / exceptions; `add_frame` pops from the queue. This is the "fake pybind11 binding that returns scripted `VioOutput` payloads" the task spec explicitly allows.
- `tests/unit/c1_vio/test_okvis2_strategy.py`: one test per AC (AC-1 through AC-8, AC-10). Use the fake binding fixture. AC-9 and the NFR-perf test are written here too but marked `@pytest.mark.tier2` so `pytest -m "not tier2"` (the macOS dev loop) skips them; `ci-tier2.yml` picks them up.
**Gate**: every unit test passes on macOS with `pytest -m "not tier2" tests/unit/c1_vio/`. Full sweep (`pytest tests/`) shows the existing 1093 passing + the new tests, with the tier2-marked ones skipped on macOS.
### Step 7 — update `.github/workflows/ci.yml` to install OKVIS2's Linux deps (510 min)
- In the `build` matrix's `deployment` and `research` kinds, add a step BEFORE `cmake -S . -B build`:
```yaml
- name: Install OKVIS2 native deps
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
libeigen3-dev libboost-all-dev libgoogle-glog-dev libgflags-dev \
libsuitesparse-dev libceres-dev libopencv-dev
```
- Toggle `BUILD_OKVIS2` to `ON` in the `deployment` kind's `cmake_flags` (default config in `solution.md` says OKVIS2 is the production-default; the deployment matrix kind should enforce this).
- The `research` kind already has `BUILD_VINS_MONO=ON`; leave `BUILD_OKVIS2=ON` there too.
**Gate**: push branch; GitHub Actions Ubuntu runner completes the `cmake --build build --parallel` step. If OKVIS2's CMake export targets have a different name than `okvis::Estimator` / `okvis::Common`, the failure surfaces here and Step 2's `target_link_libraries` is patched. This is the only build-system feedback loop we get pre-Jetson — exploit it.
### Step 8 — AC coverage verification + code review (1530 min)
- Verify every AC of AZ-332 maps to at least one test (skipped-with-reason counts as covered per implement skill Step 8).
- Invoke `/code-review` skill on the batch's changed files. Expected verdict: PASS or PASS_WITH_WARNINGS. Auto-fix or escalate per implement skill Step 10.
### Step 9 — commit (5 min)
- One commit per implement skill Step 11: `[AZ-332] C1 Okvis2Strategy: pybind11 binding skeleton + Python facade + fake-backend tests`.
- Body of commit message documents the three-environment split (macOS dev / Ubuntu CI / Jetson tier2) and notes that AC-9 + NFR-perf are tier2-gated.
### Step 10 — tracker + archive + batch report (5 min)
- Jira: AZ-332 In Progress → In Testing.
- Move `_docs/02_tasks/todo/AZ-332_c1_okvis2_strategy.md` → `_docs/02_tasks/done/`.
- Write `_docs/03_implementation/batch_23_cycle1_report.md` with the standard report shape. Include the tier2-deferred AC-9 + NFR-perf items under "Deferred to tier2 CI".
- Update `_docs/_autodev_state.md`: sub_step → next batch detection.
## Files to be created / modified (summary)
Created:
- `cpp/okvis2/upstream/` (git submodule)
- `cpp/pybind11/upstream/` (git submodule)
- `src/gps_denied_onboard/components/c1_vio/_native/okvis2_binding.cpp`
- `src/gps_denied_onboard/components/c1_vio/okvis2.py`
- `tests/unit/c1_vio/conftest.py`
- `tests/unit/c1_vio/test_okvis2_strategy.py`
- `_docs/03_implementation/batch_23_cycle1_report.md`
Modified:
- `cpp/okvis2/CMakeLists.txt` (replace placeholder)
- `cpp/pybind11/CMakeLists.txt` (replace placeholder; can stay minimal)
- `src/gps_denied_onboard/components/c1_vio/config.py` (add `Okvis2Config`)
- `.github/workflows/ci.yml` (add apt-get step; flip `BUILD_OKVIS2=ON` in deployment kind)
- `.gitmodules` (auto-edited by submodule add)
- `_docs/_autodev_state.md`
- `_docs/02_tasks/todo/AZ-332_c1_okvis2_strategy.md` (moved to done/)
## Tier2 deliverables (NOT this session — explicit follow-up)
AC-9 (honest covariance monotonicity) and the NFR-perf test (`process_frame` p95 ≤ 80 ms on Tier-2) require real OKVIS2 + Derkachi-class fixture footage on the actual Jetson hardware. They are:
- Written in `test_okvis2_strategy.py` marked `@pytest.mark.tier2`.
- Skipped on macOS dev + GitHub Actions Linux runner.
- Picked up by `ci-tier2.yml` on push to `stage` or `main`.
- A remediation task (`AZ-332_tier2_validation`) is OPTIONAL — could be tracked separately or rolled into the deferred Jetson MVE phase that D-C1-2 already scheduled. Pick at session-start time.
## Pinned upstream versions
Fill in once Step 1 is executed:
- `cpp/okvis2/upstream` — commit hash: _TBD_; OKVIS2 main branch HEAD at `<date>`
- `cpp/pybind11/upstream` — commit hash: _TBD_; pybind11 stable release tag `<version>`
## When this doc can be deleted
After AZ-332 lands and the next batch is in flight, this file is historical context. Move to `_docs/_archive/` (or delete if `_archive` doesn't exist) once Jetson tier2 CI has been green at least once on a real OKVIS2 run.