mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 22:01:13 +00:00
940066bee2
Co-authored-by: Cursor <cursoragent@cursor.com>
767 lines
109 KiB
Markdown
767 lines
109 KiB
Markdown
# GPS-Denied Onboard Pose Estimation — Architecture
|
||
|
||
> Date: 2026-05-09 (Plan Phase 2a — initial draft).
|
||
> Inputs: `_docs/00_problem/{problem,acceptance_criteria,restrictions}.md`, `_docs/00_problem/input_data/*`, `_docs/01_solution/solution.md`, `_docs/02_document/glossary.md`, `_docs/02_document/tests/*`.
|
||
|
||
## Architecture Vision
|
||
|
||
> User-confirmed in Plan Phase 2a.0 (2026-05-09). This section is the spine of the document; nothing below it may contradict it without a recorded ADR.
|
||
|
||
The system is a **Jetson Orin Nano Super-hosted onboard companion** that delivers a GPS-equivalent WGS84 position (with honest 6×6 covariance and provenance label `{satellite_anchored | visual_propagated | dead_reckoned}`) to a fixed-wing UAV's flight controller in GPS-denied or GPS-spoofed environments. It runs as a **single Python-with-C++-extensions monolithic process per binary track** on the companion PC, fusing pre-flight-cached satellite tiles served by the parent-suite `satellite-provider` with live nav-camera frames (3 Hz) and FC-supplied IMU/attitude (100–200 Hz). A canonical hierarchical pipeline `VIO → retrieval → re-rank → matching → AdHoP-conditional refinement → pose → fusion` drives the per-frame loop within a 400 ms p95 latency budget. Cross-component coupling routes through a shared GTSAM substrate so posterior covariance is recovered natively (D-C5-5 = (c)). The companion is **read-only against `satellite-provider` while airborne** — both the pre-flight tile download and the post-landing tile upload run from the operator-side `Tile Manager` (C11), a separate binary that is excluded from the airborne CMake target so the companion image cannot load either code path even via reflection or config error (process-level isolation, AC-8.4).
|
||
|
||
### Components — intent-level (formal decomposition belongs to Step 3)
|
||
|
||
- **C1 — Visual / Visual-Inertial Odometry**: pluggable `VioStrategy` (Okvis2 architecturally-nominated production-default, VinsMono in research builds only, KltRansac mandatory simple-baseline), config-selected at startup, not hot-swappable mid-flight. **Cycle-1 operational reality**: AZ-332 (Okvis2) and AZ-333 (VinsMono) shipped as facade-only — both require Tier-2 prerequisites (CI build env + Jetson hardware + DBoW2 vocab artifact) that cycle 1 did not deliver, so the production-default selection is **KltRansac** (AZ-334) until AZ-592 / AZ-593 (Tier-2 follow-ups) land. ADR-001 / ADR-002 are unchanged — the seam holds; the *selection* shifted.
|
||
- **C2 — Visual Place Recognition**: pre-cached satellite-tile retrieval (UltraVPR primary, MegaLoc secondary, MixVPR / SelaVPR / EigenPlaces / NetVLAD / SALAD additional candidates), all behind a single `VprStrategy` interface; concrete implementation chosen by config at startup.
|
||
- **C2.5 — Top-N inlier-based re-rank**: re-ranks the top-K=10 VPR candidates by single-pair LightGlue inlier count down to top-N=3.
|
||
- **C3 — Cross-domain matcher**: DISK+LightGlue (D-C3-1 = (a)) over the N=3 retained candidates; ALIKED+LightGlue secondary; XFeat alternate.
|
||
- **C3.5 — AdHoP-conditional refinement**: invoked only when initial reprojection residual exceeds threshold; bypassed otherwise to preserve AC-4.1.
|
||
- **C4 — Pose estimation**: OpenCV ≥4.12.0 `solvePnPRansac` (IPPE) wrapped in GTSAM `Marginals` for native 6×6 covariance recovery (D-C4-2 = (b); auto-degrades to Jacobian-based covariance D-C4-2 = (a) under thermal throttle per D-CROSS-LATENCY-1).
|
||
- **C5 — State estimator**: GTSAM iSAM2 + `CombinedImuFactor` + `IncrementalFixedLagSmoother` (K=10–20 keyframes, D-C5-3); native posterior covariance via `Marginals`; **AC-4.5 = internal smoothing only**, not FC retroactive correction.
|
||
- **C6 — Tile cache + spatial index**: PostgreSQL btree spatial index over filesystem `./tiles/{zoomLevel}/{x}/{y}.jpg` mirroring `satellite-provider`'s on-disk layout, plus FAISS HNSW index for VPR descriptors (`.index` written via `faiss.write_index` + atomicwrites + SHA-256 content-hash gate, D-C10-3).
|
||
- **C7 — On-Jetson inference runtime**: TensorRT 10.3 engines (Polygraphy / trtexec / IBuilderConfig hybrid orchestration), JetPack 6.2, SM 87; ONNX Runtime + TRT EP fallback; pure PyTorch FP16 baseline.
|
||
- **C8 — Flight-Controller adapter**: `pymavlink` `GPS_INPUT` for ArduPilot Plane (MAVLink 2.0 message signing on the companion ↔ AP wired channel, D-C8-9 = (d)) and `YAMSPy` / INAV-Toolkit `MSP2_SENSOR_GPS` for iNav (signing-gap accepted residual risk).
|
||
- **C10 — Pre-flight cache provisioning**: builds the **model-derived** cache artifacts (descriptor generation, engine compilation, manifest + content-hash) on top of an already-populated tile store; F2 takeoff verifier (D-C10-1, D-C10-3, D-C10-6, D-C10-7). C10 does NOT touch `satellite-provider` — tile network I/O lives in C11.
|
||
- **C11 — Tile Manager** (operator-side, distinct binary/image, ADR-004 process-isolated): owns operator-side network I/O against `satellite-provider` in **both directions**. `TileDownloader` interface fetches tiles into C6 during F1 (TLS + service-internal API key); `TileUploader` interface pushes mid-flight tiles to `satellite-provider`'s ingest endpoint (D-PROJ-2 contract; not yet implemented service-side). C11 carries **no flight-state gating** of its own (Batch 44 SRP refactor) — the post-landing safety check lives in C12 (single source of truth). The component bundles both interfaces because they share auth, HTTP client, deployment unit, and the airborne-exclusion property.
|
||
- **C12 — Operator Pre-flight Orchestrator** (operator-side, same image as C11): orchestrates the operator-side workflows that C11 implements. Hosts the pre-flight cache provisioning UI, sector classification (active-conflict vs stable rear), the `Flight` resolver (`FlightsApiClient` → bbox + takeoff origin), the **post-landing upload orchestrator** (gates `TileUploader` on the `flight_footer` FDR record's `clean_shutdown` field — AZ-329), and the **operator re-localization service** (AC-3.4 visual-loss hint dispatched to the airborne companion over the GCS link via the `OperatorCommandTransport` Protocol; concrete pymavlink-backed impl is an E-C8 deliverable — AZ-330). The C12 ⇄ C11 boundary is a thin Protocol cut (`TileUploaderCut`) so C12 does not import C11 directly (AZ-507).
|
||
- **C13 — Flight Data Recorder (FDR)**: per-flight ≤64 GB NVM record of estimates + IMU traces + emitted MAVLink + system health + mid-flight tiles + ≤0.1 Hz failed-tile thumbnails; raw nav/AI-cam frames excluded (AC-8.5, AC-NEW-3).
|
||
- **External: `satellite-provider`** (parent-suite .NET 8 service): tile producer pre-flight; tile sink post-landing (D-PROJ-2). Treated as a planned external dependency on the upload + voting paths.
|
||
|
||
### Architectural principles / non-negotiables
|
||
|
||
1. **Camera-specific math enters only via a `Camera calibration artifact` JSON** (intrinsics + distortion + body-to-camera extrinsics + acquisition method `factory_sheet | checkerboard_refined | hybrid`). No hard-coded camera math anywhere; test fixtures (`adti26`) and production deployments (`adti20`) load different artifacts on the same code path.
|
||
2. **VioStrategy is selected at startup via config; not hot-swappable mid-flight.**
|
||
3. **Build-time exclusion of unused `Strategy` implementations.** A given binary links only the implementations it actually uses at runtime. The default deployment binary links the production-default strategies (architecturally OKVIS2 on C1; **operationally KltRansac in cycle 1** while AZ-332 / AZ-333 are BLOCKED awaiting Tier-2 prerequisites — see Components C1 above and FINAL_report § "Cycle 1 Implementation Status") plus the engine-rule-mandatory simple-baseline (KltRansac on C1); the IT-12 comparative-study binary links all C1 implementations side-by-side. The mechanism is per-component CMake `BUILD_*` flags (`BUILD_VINS_MONO`, `BUILD_SALAD`, …) plus the per-binary composition root choosing among the linked implementations at startup. **Justification is technical** — binary size on the 8 GB shared Jetson, boot/load time inside the AC-NEW-1 30 s budget, deployed dependency / attack surface, and accidental-selection risk reduction (a binary with only OKVIS2 + KltRansac linked cannot be misconfigured into running VINS-Mono). **Component licenses do not drive this decision** — see ADR-002. CI emits both the deployment binary and the research binary on every PR.
|
||
4. **In-air network I/O against `satellite-provider` is forbidden — in BOTH directions.** Enforced primarily by **process-level isolation** — the Tile Manager (C11), which carries both the `TileDownloader` and the `TileUploader` interfaces, is not loaded in the airborne companion image. The defense-in-depth software guard is a C12-side `flight_footer.clean_shutdown == True` check (read by `PostLandingUploadOrchestrator` from the post-flight FDR via `FdrFooterReader`); C11 itself no longer gates (Batch 44 SRP refactor). The companion is read-only against C6 in flight; both pre-flight tile fetching and post-landing tile upload happen on the operator workstation.
|
||
5. **All persistent imagery is in `satellite-provider`'s on-disk tile format** (`./tiles/{zoomLevel}/{x}/{y}.jpg` + matching metadata) so post-landing upload is byte-identical. No raw frames on disk except the AC-8.5 forensic ≤0.1 Hz failed-tile thumbnail log inside FDR.
|
||
6. **Honest 6×6 posterior covariance via GTSAM `Marginals`** is the safety floor for AC-NEW-4 and AC-NEW-7. Under-reported `horiz_accuracy` is a defect, not a tuning knob.
|
||
7. **MAVLink 2.0 message signing on the companion ↔ ArduPilot wired channel**, with per-flight key rotation (D-C8-9 = (d)). iNav has no signing equivalent — accepted residual risk, Plan-phase carryforward proposes an iNav firmware feature request.
|
||
8. **D-CROSS-LATENCY-1 hybrid**: K=3 baseline auto-degrades to K=2 + Jacobian covariance under Jetson thermal throttle, preserving AC-4.1 at +50 °C ambient at the cost of ~5–10 % accuracy loss (still inside AC-NEW-4).
|
||
9. **Two execution tiers** (Tier-1 workstation Docker = fast/cheap; Tier-2 Jetson hardware = AC-bound) appear in the deployment plan and CI matrix per finding F6.
|
||
10. **Camera intrinsics and full-altitude footage are calibration prerequisites**, not implementation gaps. Production accuracy claims are gated on D-PROJ-1 closure (hybrid factory + checkerboard refinement). Test fixtures use `adti26` calibration sourced from public/factory references.
|
||
11. **Spoofed GPS never re-enters the estimator** unless the FC GPS report passes a three-part gate (AC-NEW-8 + AZ-490 follow-up): (a) FC GPS health stable + non-spoofed for ≥ 10 s, (b) a visual/satellite consistency check has succeeded on the next anchor frame, AND (c) the FC's reported position is within ≤ 200 m of the companion's last emitted `PoseEstimate`. The third clause is the **mid-flight bounded-delta gate** — even a "stable, non-spoofed" GPS frame is rejected if it disagrees with the companion's posterior by more than the configurable budget. Real GPS that passes the gate is fused via `add_pose_anchor` with the FC's covariance (treated as one more anchor source, never overriding the visual pipeline without the gate).
|
||
14. **Operator-planned mission is the primary cold-start trust anchor**, not the FC EKF (AZ-490 follow-up). The operator authors the route in the parent-suite **Mission Planner UI** (`suite/ui`), the route persists in the parent-suite **`flights` REST service** (`suite/flights`), and C12 (operator tooling) reads the `Flight` from that service to: (a) derive the cache bbox as the envelope of the waypoint lat/lon plus a configurable buffer, (b) extract the first-ordered waypoint as the **takeoff origin** (lat / lon / alt), and (c) bake the takeoff origin into the C10 Manifest so the airborne C5 can warm-start from it via `set_takeoff_origin(origin, sigma_horiz_m, sigma_vert_m)` **before** any FC IMU / VIO sample arrives. This unblocks the GPS-jammed-at-takeoff scenario the FC-EKF-only cold-start path (AZ-419 today) cannot handle. The FC EKF's last valid GPS becomes a **secondary** cold-start input — used only when the operator origin is missing from the Manifest OR when the FC EKF reading passes the same bounded-delta consistency check against the operator origin.
|
||
12. **AC-4.5 is internal smoothing only.** GTSAM iSAM2 retroactively refines past keyframes onboard and emits the corrected current frame; the FC log is forward-time only — neither ArduPilot nor iNav supports FC-side retroactive correction (Mode B Fact #107).
|
||
13. **Interface-first components with constructor-injected dependencies.** Every component is **defined as an interface (Python `Protocol` or `ABC`) before any concrete implementation exists**, lives in its **own folder under `src/components/<component>/`**, and is wired together via **constructor injection** at a single composition root. Components never reach out to a global registry, a singleton, or `import` a sibling component's concrete class directly — they receive their collaborators as `__init__` arguments typed against the sibling's interface. Multiple interchangeable implementations of the same interface MUST be supported by design (e.g., C1 has three `VioStrategy` implementations; C2 has UltraVPR + MegaLoc + MixVPR + … behind a single `VprStrategy`; C8 has two FC-adapter implementations behind a single `FcAdapter`). Selection happens once, at startup, by config; the composition root resolves config → concrete implementation → wires the graph; the rest of the runtime sees only interfaces. **Side benefit (NOTE)**: this design also gives the project **packaging optionality** — different combinations of `BUILD_*` flags can produce binaries tailored to specific deployment targets, customer bundles, or (if/when relevant later) end-product licensing strategies, **without any source-level change in application code**. That optionality is a *consequence* of the interface-first design, not a driver — the architectural decisions in this document are made on technical grounds; component licenses do not influence them. See ADR-002 § Consequences and ADR-009.
|
||
|
||
### Open architectural items (tracked, NOT blocking Phase 2a)
|
||
|
||
- **D-PROJ-1** (camera calibration acquisition): CLOSED in this Plan cycle as hybrid factory + checkerboard refinement (~1 day per deployed unit). No physical hardware available this cycle, so production calibration is documented as instructions only; runtime path uses test-fixture calibration for `adti26` images.
|
||
- **D-PROJ-2** (parent-suite `satellite-provider` ingest endpoint + multi-flight voting layer): open, parent-suite work, tracked in `_docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md`. Onboard-side proceeds against the real `satellite-provider` — and uses an e2e-test-only `mock-suite-sat-service` fixture (under `tests/fixtures/`) to stand in for the not-yet-shipped POST contract during integration tests.
|
||
- **D-PROJ-3** (multi-flight fixture acquisition for AC-NEW-4 / AC-NEW-7 statistical headroom): not pursued this cycle; AC-text was relaxed 2026-05-09 to Monte-Carlo-over-current-data with stated 95 % CI; multi-flight statistical headroom is residual risk in the Step 4 risk register.
|
||
- **D-C8-2 runtime gate** (companion-driven `MAV_CMD_SET_EKF_SOURCE_SET` switch): pattern is firmware-supported but not deployed-precedent. ArduPilot Plane SITL validation (IT-3) is the lock gate; D-C8-2-FALLBACK options recorded.
|
||
- **D-C2-12** (DINOv2-feature-based matcher evaluation): carryforward research item; potentially closes D-C3-1 retrain cost.
|
||
|
||
---
|
||
|
||
## 1. System Context
|
||
|
||
**Problem being solved**: a fixed-wing UAV operating in eastern/southern Ukraine must continue to navigate and report position to its flight controller when GPS is **denied** (no fix) or **spoofed** (false fix). The onboard system replaces real GPS with a WGS84 position estimate derived from pre-cached satellite tiles + live nav-camera frames + FC IMU/attitude, with honest covariance and a provenance label. Mission profile: 8 h flights, ~60 km/h cruise, ≤1 km AGL, ≤400 km² total cached area.
|
||
|
||
**System boundaries** (inside vs outside):
|
||
|
||
| Inside the system (this project) | Outside the system |
|
||
|---|---|
|
||
| Companion PC runtime (Jetson Orin Nano Super, JetPack 6.2) | Flight controller firmware (ArduPilot Plane, iNav) |
|
||
| All onboard pose-estimation logic (C1–C8, C13) | Parent-suite `satellite-provider` (.NET 8 REST microservice) |
|
||
| Pre-flight cache artifact build (C10 — engines + descriptors + manifest) | Parent-suite `flights` REST service (.NET 8; owns the `Flight` + `Waypoint` DTOs) |
|
||
| Operator-side Tile Manager (C11 — pre-flight download + post-landing upload) | Parent-suite Mission Planner UI (`suite/ui` — where operators plan the route) |
|
||
| Operator Pre-flight Orchestrator (C12) | GCS (QGroundControl) |
|
||
| FDR writer (C13) | Nav camera hardware (`adti20`); AI-camera hardware |
|
||
| Camera calibration artifact format + loader | UAV airframe / FC IMU / sensors |
|
||
| | Operator's workstation OS / authentication |
|
||
| | The act of calibration itself (operator runs checkerboard rig) |
|
||
|
||
**External systems**:
|
||
|
||
| System | Integration Type | Direction | Purpose |
|
||
|---|---|---|---|
|
||
| `satellite-provider` (parent-suite .NET 8) | REST + filesystem (read), REST (post-landing write, D-PROJ-2) | Both | Pre-flight tile source; post-landing tile sink (planned) |
|
||
| `flights` REST service (parent-suite .NET 8) | REST (read) over HTTPS | Inbound to C12 | Source of the operator-planned `Flight` (waypoints, ordering, altitudes). C12 derives bbox + takeoff origin from the Flight. **Operator workstation only** — never reached from the airborne companion |
|
||
| Mission Planner UI (`suite/ui`) | Indirect via `flights` REST | Inbound (mediated) | Where the operator authors the route before C12 consumes it. Out of scope for this project, but the API contract it produces IS in scope |
|
||
| ArduPilot Plane FC | MAVLink 2.0 over UART/USB (signed) | Both | Inbound: external position via `GPS_INPUT`. Outbound: IMU, attitude, GPS health, EKF source-set commands |
|
||
| iNav FC | MSP2 over UART (unsigned), MAVLink outbound | Both | Inbound: external position via `MSP2_SENSOR_GPS` (companion is sole GPS source on iNav). Outbound: IMU/attitude/telemetry |
|
||
| QGroundControl (GCS) | MAVLink 2.0 (link-bandwidth-limited) | Both | 1–2 Hz downsampled summary out (AC-6.1); operator commands in (AC-6.2) |
|
||
| Nav camera (USB/MIPI-CSI/GigE) | Camera SDK / V4L2 | Inbound | 3 Hz nadir frames at 5472×3648 px |
|
||
| AI camera | Camera SDK + gimbal/zoom telemetry | Inbound | AC-7.x object localization (deferred to follow-up cycle) |
|
||
| Operator workstation | Filesystem + USB/Ethernet | Both | Pre-flight: stages cache + calibration onto companion. Post-flight: triggers upload tool, reads FDR |
|
||
|
||
---
|
||
|
||
## 2. Technology Stack
|
||
|
||
| Layer | Technology | Version | Rationale |
|
||
|---|---|---|---|
|
||
| Language (host) | Python | 3.10 (JetPack 6.2 default) | Glue layer for GTSAM/FAISS/OpenCV/pymavlink/YAMSPy; matches every selected library's primary binding |
|
||
| Language (perf-critical) | C++ | C++17 | OKVIS2, VINS-Mono, GTSAM core, OpenCV, FAISS native; Python wrappers cross the boundary |
|
||
| Inference runtime | TensorRT | 10.3 (JetPack 6.2 pin) | Pinned per D-C7-9; fallback ONNX Runtime + TRT EP; pure PyTorch FP16 baseline for mandatory simple-baseline track |
|
||
| Visual matching | DISK + LightGlue | upstream HEAD pinned per Plan-phase | D-C3-1 = (a); replaces SuperPoint+SuperGlue (Magic Leap noncommercial canonical) |
|
||
| VPR (primary) | UltraVPR | RAL 2025 / ICRA 2026 (cbbhuxx/UltraVPR) | Documentary Lead PRIMARY; rotation-invariant, unsupervised aerial pretrain (multi-heading aerial flight + closes D-C2-1 retrain cost) |
|
||
| VPR (secondary) | MegaLoc, MixVPR, SelaVPR, EigenPlaces, NetVLAD | upstream HEAD pinned per Plan-phase | Mode B Fact #110/#113 + mandatory simple-baseline (NetVLAD/MixVPR) |
|
||
| State estimator | GTSAM + `gtsam_unstable.IncrementalFixedLagSmoother` | per Plan-phase pin (no published CVE at audit time) | Native 6×6 covariance; D-C5-5 = (c) `PriorFactorPose3` only |
|
||
| Image / pose math | OpenCV (Python+C++) | **≥ 4.11.0.86, < 4.12** (cycle-1 relaxation; original target ≥ 4.12.0) | CVE-2025-53644 mitigation target was ≥ 4.12.0 (Mode B Fact #112); cycle 1 relaxed the floor because `gtsam==4.2.1` only ships numpy<2 wheels and `opencv-python>=4.12` requires numpy>=2 — see `_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md`. 4.11.0.86 is in the supported 4.x line and receives security patches; the ≥ 4.12.0 pin replays once gtsam ships numpy-2 wheels or an alternative SE(3) backend lands. IPPE flags for D-C4-1 = (b) unaffected. |
|
||
| VPR descriptor index | FAISS HNSW | upstream HEAD pinned per Plan-phase | `faiss.write_index` + atomicwrites + SHA-256 content-hash gate (D-C10-3) |
|
||
| FC adapter (ArduPilot) | `pymavlink` + MAVLink 2.0 signing | bundled unmodified per D-C8-3 | Verified Source #4; ArduPilot canonical signing per Source #128 |
|
||
| FC adapter (iNav) | YAMSPy + INAV-Toolkit MSP2 | MIT throughout | iNav has no inbound MAVLink ext-positioning handler (SQ6) |
|
||
| VIO (production) | OKVIS2 (BSD-3-Clause) | upstream HEAD pinned per Plan-phase | D-C1-1-SUB-A = (a) architecturally-nominated production-default. **Cycle-1**: AZ-332 BLOCKED — facade + pybind11 skeleton ship; first `add_frame` raises until Tier-2 prerequisites (CI build env + Jetson hardware + DBoW2 vocab) and AZ-592 follow-up land. |
|
||
| VIO (research / IT-12) | VINS-Mono | upstream HEAD pinned per Plan-phase | Research binary only (`BUILD_VINS_MONO=ON`) for IT-12 comparative study; build-time exclusion from deployment binary per ADR-002. **Cycle-1**: AZ-333 BLOCKED — same skeleton-only state as AZ-332, plus pending upstream-vendoring decision (HKUST + ROS-strip vs. community fork); AZ-593 follow-up. |
|
||
| VIO (mandatory baseline) | KLT+RANSAC over OpenCV | OpenCV ≥ 4.11.0.86 (cycle-1 relaxation; see OpenCV row) | Engine-rule-required mandatory simple-baseline. **Cycle-1**: serves as the operational airborne `VioStrategy` default while AZ-332 / AZ-333 remain BLOCKED. |
|
||
| Tile cache backend | PostgreSQL + filesystem | PostgreSQL 16 (mirror of `satellite-provider`) | C6 mirrors `satellite-provider`'s on-disk and table layout so C11 `TileUploader`'s post-landing payload is byte-identical to what the parent suite already serves |
|
||
| Container runtime | Docker (Tier-1) + bare JetPack (Tier-2) | Docker 27.x; JetPack 6.2 | Tier-1 workstation Docker; Tier-2 Jetson native (no Docker — direct JetPack to keep INT8 calibration cache trustworthy per D-C10-6) |
|
||
| Build system | CMake + Python `pyproject.toml` | CMake ≥ 3.27 | CMake `option(BUILD_VINS_MONO ...)` D-C1-1-SUB-A; Python wheels built per Jetson via cibuildwheel-equivalent recipe |
|
||
| CI/CD | GitHub Actions (Tier-1) + self-hosted Jetson runner (Tier-2) | latest pinned action versions | Two-binary emit on every PR (production + research); Tier-2 runs are AC-bound jobs only |
|
||
| Configuration | YAML (per-flight) + Camera calibration JSON | n/a | Single config root; the only camera-specific entry point is the calibration JSON |
|
||
|
||
**Key constraints from `restrictions.md` and how they shape the stack**:
|
||
|
||
- **Hardware pinned to Jetson Orin Nano Super (8 GB shared, 25 W)** → forces TensorRT engine compilation on-device + INT8/FP16 mix per D-C7-1; rules out heavy multi-process stacks (D-C1-1-SUB-A = (b) was rejected on latency budget).
|
||
- **Python is the host language but ROS-bound C++ is unavoidable for VIO** → both production and research binaries are CMake projects that produce a Python-importable `.so` per `VioStrategy`; the rest of the runtime is pure Python.
|
||
- **PX4 is out of scope, ArduPilot Plane + iNav both required** → C8 must split per FC, with no single message contract spanning both.
|
||
- **Build-time exclusion of unused `Strategy` implementations (ADR-002)** → CMake `BUILD_*` flags (`BUILD_VINS_MONO`, `BUILD_SALAD`, …) determine which implementations are linked into each binary; the deployment binary links the production-default + the mandatory simple-baseline; the IT-12 research binary links all strategies. Justification is technical (binary size on 8 GB shared Jetson, AC-NEW-1 boot budget, dependency surface, accidental-selection risk). Component licenses do not influence this decision.
|
||
- **MAVLink message-signing posture asymmetry** → `pymavlink` signing handshake is part of takeoff load on the AP path; iNav unsigned link is documented as accepted residual risk in `security_analysis.md` carryforward.
|
||
- **No raw-frame storage (AC-8.5)** → all camera ingestion is streaming; the only persistence path for frame imagery is via tile orthorectification (AC-8.4).
|
||
- **8 h continuous duty cycle at 25 W up to +50 °C ambient** → the auto-degrade hybrid (D-CROSS-LATENCY-1) is a first-class concern of every latency-sensitive component, not an afterthought.
|
||
|
||
---
|
||
|
||
## 3. Deployment Model
|
||
|
||
**Environments**:
|
||
|
||
| Environment | Purpose | Hardware |
|
||
|---|---|---|
|
||
| `dev-tier1` | Fast iterative development; unit + most integration tests | Workstation (any Linux x86_64 + NVIDIA GPU optional); Docker |
|
||
| `dev-tier2` | Hardware-bound development checks | Jetson Orin Nano Super dev kit (developer's desk) |
|
||
| `staging-tier1` | CI runs that don't require Jetson hardware | GitHub-hosted runner (x86_64); Docker |
|
||
| `staging-tier2` | CI runs that require Jetson (AC-bound jobs only) | Self-hosted Jetson runner; bare JetPack (no Docker) |
|
||
| `production` | Deployed companion image on a UAV | Jetson Orin Nano Super (pinned); bare JetPack; no inbound network listening (defense-in-depth, NFT-SEC-05) |
|
||
| `production-operator-workstation` | Operator-side workflows orchestrated by C12: pre-flight tile download (C11 `TileDownloader`), cache artifact build (C10), post-landing tile upload (C12 `PostLandingUploadOrchestrator` → C11 `TileUploader`), AC-3.4 re-loc hint dispatch (C12 `OperatorReLocService`), FDR retrieval | Operator's Linux workstation; Docker for `satellite-provider` mirror |
|
||
|
||
**Infrastructure**:
|
||
|
||
- **No cloud orchestration**. The companion is an embedded edge device; the operator's workstation is a single host that runs the operator tooling (C11 Tile Manager + C12 Operator Pre-flight Orchestrator) and a local `satellite-provider` mirror or VPN-reaches the lab `satellite-provider`.
|
||
- **Two airborne binaries shipped on every PR** (ADR-002): `deployment-binary` (links the production-default strategy on each component + the mandatory simple-baseline; CMake `BUILD_VINS_MONO=OFF`, `BUILD_SALAD=OFF`, …) and `research-binary` (links every available strategy on every component; all `BUILD_*` flags `ON`, used for the IT-12 comparative study). The deployment binary is what installs onto an operational Jetson; the research binary runs on dev/lab Jetson hardware for the comparative-study report. The same code base produces both — ADR-002 mechanism scales to additional binary variants later if packaging strategy requires it. **Replay is not a separate binary** (ADR-011): the deployment-binary runs both live and replay modes from the same image, swapping `FrameSource` / `FcAdapter` / `MavlinkTransport` strategies at startup based on `config.mode`. A third binary — `operator-orchestrator` (C10 + C11 + C12) — ships from the same source tree for the operator workstation; the airborne deployment-binary does NOT contain the operator-orchestrator components (ADR-004 process isolation).
|
||
- **Container scope**: Tier-1 uses Docker (`docker compose` for the developer setup including a `mock-suite-sat-service` container, the operator-orchestrator container, and a Postgres for C6). **Tier-2 (Jetson) does NOT use Docker** — TensorRT INT8 calibration caches and `jetson-stats` thermal telemetry are most reliable without a container layer, per D-C7-9 + D-C10-6. The deployed image on the Jetson is a JetPack-based system image with the deployment binary preinstalled.
|
||
- **Scaling**: not applicable (per-UAV, single companion). Failover is per-airframe (the FC's IMU-only fallback at AC-5.2 is the system's "scale-out").
|
||
|
||
**Environment-specific configuration**:
|
||
|
||
| Config | dev-tier1 | staging-tier2 | production |
|
||
|---|---|---|---|
|
||
| `satellite-provider` host | local Docker (`satellite-provider:5100`) | real `satellite-provider` Docker (download path; existing) + e2e-test `mock-suite-sat-service` fixture (POST/upload only, until D-PROJ-2 lands) | operator workstation (pre-flight only) |
|
||
| Camera calibration source | test-fixture artifact (`adti26.json`) | test-fixture artifact | `adti20.json` (D-PROJ-1 hybrid output) |
|
||
| Logging sink | console (DEBUG) | journald + FDR | FDR (per-flight, ≤ 64 GB rolling) |
|
||
| MAVLink signing key | dev key (committed to test fixtures) | per-flight key from test config | per-flight key generated at takeoff load, rotated per flight |
|
||
| Inference engine source | pre-built engines OR on-the-fly compile | pre-built (Tier-2 cache) | pre-built (verified content-hash gate) |
|
||
| `BUILD_VINS_MONO` (binary track) | both (developer's choice) | both | OFF (production-only) |
|
||
| Network egress | unrestricted | locked to test endpoints | **none in flight** (DNS blackhole + iptables OUTPUT REJECT, NFT-SEC-05) |
|
||
|
||
**Image / artifact pipeline**:
|
||
|
||
```
|
||
source repo
|
||
├─→ CI matrix
|
||
│ ├─ tier1 lint + unit + most integration → Docker
|
||
│ ├─ tier1 build production-binary + research-binary (CMake split)
|
||
│ ├─ tier1 SBOM diff (production must NOT include vins_mono)
|
||
│ └─ tier2 (self-hosted Jetson) AC-bound suite (NFT-PERF-*, NFT-LIM-*, IT-12)
|
||
│
|
||
├─→ release artifacts:
|
||
│ ├─ deployment-binary tarball (production-default strategies + mandatory baselines + replay strategies, ADR-002 + ADR-011; runs both live and replay modes from a single image)
|
||
│ ├─ research-binary tarball (all strategies linked; for IT-12 comparative study; also includes replay strategies)
|
||
│ ├─ JetPack image (deployment-binary preinstalled)
|
||
│ └─ operator-orchestrator tarball (C11 + C12 + e2e-test mock-suite-sat-service compose for offline integration testing)
|
||
│
|
||
└─→ deploy paths:
|
||
├─ Jetson operational deploy: JetPack image flash (deployment-binary)
|
||
├─ Lab/research deploy: research-binary install on dev Jetson
|
||
└─ Operator workstation: Docker compose for C11+C12+local satellite-provider mirror
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Data Model Overview
|
||
|
||
> Detailed per-component data models live in component specs (Step 3); per-entity migration strategies live in `data_model.md` (Phase 2b).
|
||
|
||
**Core entities**:
|
||
|
||
| Entity | Description | Owned by component |
|
||
|---|---|---|
|
||
| `NavCameraFrame` | 5472×3648 px nadir RGB frame + capture timestamp + camera ID | Camera ingest → C1, C2 |
|
||
| `ImuSample` / `ImuWindow` | IMU sample (accel + gyro + timestamp) at 100–200 Hz; windowed view sent to C1 | FC adapter (C8 inbound side) |
|
||
| `VioOutput` | Per-frame relative pose SE(3) + 6×6 covariance + IMU bias estimate + feature quality | C1 |
|
||
| `VprQuery` | Image embedding (UltraVPR/MegaLoc/etc) | C2 |
|
||
| `VprResult` | Top-K=10 candidate tile IDs ranked by descriptor distance | C2 |
|
||
| `RerankResult` | Top-N=3 candidate tiles ranked by inlier count | C2.5 |
|
||
| `MatchResult` | 2D-3D correspondences with RANSAC inliers from C3 / C3.5 | C3, C3.5 |
|
||
| `CameraCalibration` | Intrinsics K + distortion + body-to-camera extrinsics + acquisition method | Loaded once at startup; consumed by C1, C3, C4 |
|
||
| `PoseEstimate` | WGS84 position + 6×6 covariance + provenance label + `last_satellite_anchor_age_ms` | C4 → C5 |
|
||
| `Tile` | JPEG body + center lat/lon + zoomLevel + tile_size_meters/pixels + capture_timestamp + source + freshness flag + (mid-flight only) quality_metadata | C6 |
|
||
| `TileQualityMetadata` | `estimator_label`, 2×2 covariance sub-matrix, `last_anchor_age_ms`, MRE, IMU bias norm — sufficient for D-PROJ-2 voting | C6 (write side from C5/C4 outputs) |
|
||
| `EmittedExternalPosition` | WGS84 + honest `horiz_accuracy` + per-FC encoding (MAVLink `GPS_INPUT` for AP, MSP2 `MSP2_SENSOR_GPS` for iNav) | C8 |
|
||
| `FlightStateSignal` | `IN_AIR | ON_GROUND` boolean derived from FC `MAV_STATE` | C8 inbound side; used internally by C8/C5 for live-flight state machines. **Not** consumed by C11/C12 — post-landing gating reads the C13-written `flight_footer` FDR record instead (Batch 44 SRP refactor) |
|
||
| `FlightFooterRecord` | `{flight_id, clean_shutdown, total_records, segment_count, …}` — single FDR record written by C13 on clean shutdown | C13 (writer) → C12 `PostLandingUploadOrchestrator` (reader, via `FdrFooterReader`) |
|
||
| `PostLandingUploadRequest` | `{flight_id, satellite_provider_url, api_key, batch_size}` | C12 CLI → C12 `PostLandingUploadOrchestrator` |
|
||
| `ReLocHint` | Operator-supplied position hint for AC-3.4 visual-loss re-localization: `{approximate_position_wgs84: LatLonAlt, confidence_radius_m, reason}`; validated at construction (lat ∈ [-90,90]; lon ∈ (-180,180]; radius > 0; reason non-empty); emitted to airborne companion via `OperatorCommandTransport` Protocol (E-C8 concrete) | Operator CLI → C12 `OperatorReLocService` → (GCS link) airborne companion |
|
||
| `FdrRecord` | Estimates + IMU traces + emitted MAVLink + system health + tiles + thumbnails (≤ 64 GB / flight) | C13 |
|
||
| `Manifest` | Hash of (model + calibration + corpus + sector classification + takeoff origin) for D-C10-1 idempotence | C10 |
|
||
| `EngineCacheEntry` | TRT engine + INT8 calibration cache keyed by SM/JP/TRT/precision tuple (D-C10-7) | C10, C7 |
|
||
| `SectorClassification` | `active_conflict | stable_rear` per area, drives freshness threshold | C12 (operator-set) → C6, C10 |
|
||
| `Flight` | Operator-planned mission: ordered `Waypoint` list + metadata, persisted in the parent-suite `flights` REST service. Read by C12 via `FlightsApiClient`; never reached from the airborne companion | External (`suite/flights`) → C12 |
|
||
| `Waypoint` | Ordered `(lat, lon, alt, objective, source)` entry inside a `Flight`. C12 envelopes waypoint lat/lon → bbox; first-ordered waypoint → takeoff origin | External (`suite/flights`) → C12 |
|
||
| `TakeoffOrigin` | `LatLonAlt` carried in the C10 Manifest; baked in by C12 at build time from `Flight.waypoints[0]`; consumed at boot by C5 via `set_takeoff_origin(origin, sigma_horiz_m, sigma_vert_m)` (AZ-490) | C12 → C10 Manifest → C5 |
|
||
|
||
**Key relationships**:
|
||
|
||
- `NavCameraFrame` → `VioOutput` (via C1) and `VprQuery` (via C2): same frame, two consumers.
|
||
- `VprResult.tileIds` ⊆ `Tile.id` (FK into the tile cache).
|
||
- `MatchResult` references both `NavCameraFrame.id` and `Tile.id` (cross-domain pair).
|
||
- `PoseEstimate` aggregates `MatchResult` + `VioOutput` + `ImuWindow` through C4 + C5.
|
||
- `EmittedExternalPosition` is a per-FC projection of `PoseEstimate`; the projection rule lives in C8 (per-FC unit conversion D-C8-8 = (b)).
|
||
- `Tile` (mid-flight) is produced from `NavCameraFrame` + `PoseEstimate` via orthorectification; carries `TileQualityMetadata` referencing the `PoseEstimate` it was emitted from.
|
||
- `FdrRecord` is the union of all emitted streams + all inputs (excluding raw nav/AI-cam frames); rollover policy = oldest segment dropped first.
|
||
|
||
**Data flow summary** (one-line each; full sequences in `system-flows.md`):
|
||
|
||
- Pre-flight: `satellite-provider` → C11 `TileDownloader` → `Tile` cache (C6) → C10 → `EngineCacheEntry` + `Manifest` + descriptor `.index` (atomic write + content-hash gate).
|
||
- Takeoff load: `Manifest` content-hash verify + FAISS mmap + TRT deserialize + MAVLink signing handshake → ready.
|
||
- Per-frame runtime: `NavCameraFrame` + `ImuWindow` → C1 (`VioOutput`) → C2 → C2.5 → C3 → C3.5 → C4 → C5 → C8 → `EmittedExternalPosition` to FC.
|
||
- Mid-flight tile gen: `NavCameraFrame` + `PoseEstimate` → orthorectify → dedup → write to local C6 (no upload).
|
||
- GCS telemetry: C5 → C8 → 1–2 Hz downsampled summary to QGroundControl.
|
||
- FDR: every emitted/received stream → C13 ring with per-flight ≤ 64 GB cap.
|
||
- Post-landing: operator triggers C12 `PostLandingUploadOrchestrator` → reads `flight_footer` from FDR via `FdrFooterReader` → on `clean_shutdown == True` invokes C11 `TileUploader` (via `TileUploaderCut` Protocol) → reads C6 → uploads to `satellite-provider` ingest endpoint (D-PROJ-2 contract). Refusal modes (`footer_missing`, `unclean_shutdown`, `flight_id_not_found`, `fdr_unreadable`) raise `FlightStateNotConfirmedError` with operator-actionable remediation text and a distinct CLI exit code per mode.
|
||
- Operator re-loc (AC-3.4 visual-loss path): operator submits `ReLocHint` via the `reloc-confirm` CLI → C12 `OperatorReLocService` validates the DTO → forwards to airborne companion via `OperatorCommandTransport` (E-C8 concrete) → records `c12.reloc.requested` FDR record (`outcome ∈ {sent, failed}`). Live log redaction (lat/lon rounded to 5 decimals; `reason` truncated to 200 chars); FDR record persists the full hint un-redacted for post-flight forensics.
|
||
|
||
---
|
||
|
||
## 5. Integration Points
|
||
|
||
### Internal Communication
|
||
|
||
> All in-process Python calls; the system is a single host process per binary track. "Pattern" describes the interaction shape.
|
||
|
||
| From | To | Protocol | Pattern | Notes |
|
||
|---|---|---|---|---|
|
||
| Camera ingest thread | C1 (`VioStrategy.process_frame`) | In-process queue (bounded, drop-oldest) | Producer-consumer | Frame skip is allowed under sustained load (AC-4.1 "~10% may drop") |
|
||
| Camera ingest thread | C2 (`vpr_pipeline.query`) | In-process queue (bounded, drop-oldest) | Producer-consumer | Same frame fan-out, distinct queue depths |
|
||
| C2 | C2.5 | Direct call | Function call | C2.5 wraps C3 matcher; no queue |
|
||
| C2.5 | C3 / C3.5 | Direct call | Function call | C3.5 invoked iff `MatchResult.reprojection_residual > threshold` |
|
||
| C3 / C3.5 | C4 | Direct call | Function call | `MatchResult` passed as DTO |
|
||
| C1 + C4 | C5 | In-process queue (timestamp-ordered merge) | Pub/sub | C5 holds the GTSAM `iSAM2` state; one writer thread |
|
||
| C5 | C8 (FC outbound) | In-process queue (per-FC encoder) | Pub/sub | One encoder per active FC profile; selected at startup |
|
||
| C8 (FC inbound) | C1 (`ImuWindow`), C5 (FC IMU/attitude prior) | In-process pub/sub (timestamp-aligned) | Pub/sub | Single source of truth for FC IMU; both consumers see the same window |
|
||
| C8 (FC inbound) | flight-state guard (process boundary) | In-process pub/sub | Event | Used by FDR + GCS heartbeat; airborne companion does not load C11 at all |
|
||
| C5 → orthorectifier → C6 | C6 (write-only while airborne) | In-process function call | Command | Write path is in-process; the in-air image has no upload code path |
|
||
| All components | C13 (FDR writer) | In-process queue (lossy on overrun) | Pub/sub | Overrun = logged rollover, never silent drop (AC-NEW-3) |
|
||
|
||
### External Integrations
|
||
|
||
| External system | Protocol | Auth | Rate limits | Failure mode |
|
||
|---|---|---|---|---|
|
||
| ArduPilot Plane FC | MAVLink 2.0 (`GPS_INPUT` 5 Hz; `MAV_CMD_SET_EKF_SOURCE_SET`; `STATUSTEXT` / `NAMED_VALUE_FLOAT`) over UART/USB | MAVLink 2.0 message signing, per-flight key (D-C8-9 = (d)) | 5 Hz periodic emit; signing handshake at takeoff load (≤ 5 s, AC-NEW-1) | Signing handshake fail → companion refuses takeoff; mid-flight signing key compromise → FC ignores unsigned messages, AC-5.2 takes over |
|
||
| iNav FC | MSP2 `MSP2_SENSOR_GPS` over UART; MAVLink outbound for telemetry | None (iNav has no signing) — accepted residual risk per Mode B Source #129 | 5 Hz periodic emit | Mid-flight bad-frame → iNav `mspGPSReceiveNewData()` receives only the latest frame; honest `hPosAccuracy` is the only safety net |
|
||
| QGroundControl (GCS) | MAVLink 2.0 (`STATUSTEXT`, `NAMED_VALUE_FLOAT`, `GPS_RAW_INT`) | Same MAVLink 2.0 signing as the AP path (AP profile); no signing on iNav profile | 1–2 Hz downsampled (AC-6.1); operator commands are best-effort | GCS link drop → companion continues; no mid-flight reconfiguration is required from GCS |
|
||
| `satellite-provider` (pre-flight read — bbox + slippy-map) | REST `POST /api/satellite/tiles/inventory` (bulk lookup by `(z,x,y)`, ≤ 5000 entries / request) + `GET /tiles/{z}/{x}/{y}` (slippy-map JPEG fetch); OpenAPI at `/swagger`; filesystem access if co-located | JWT Bearer (`SATELLITE_PROVIDER_API_KEY`) over TLS; the dev-only `SATELLITE_PROVIDER_TLS_INSECURE=1` env knob accepts the self-signed dev cert. The companion never reaches `satellite-provider` directly while airborne. | Off-line pre-flight; not time-critical | Cache miss → C11 `TileDownloader` fails fast pre-flight; C10 build is blocked downstream; takeoff blocked |
|
||
| `satellite-provider` (pre-flight route seed — cycle 3 / Epic AZ-835) | REST `POST /api/satellite/route` (corridor onboarding; body per `CreateRouteRequest.cs` DTO) + `GET /api/satellite/route/{id}` (status polling; terminal-success `mapsReady=true`) | Same JWT Bearer / TLS-insecure as the read path; validated pre-emptively against AZ-809 `CreateRouteRequestValidator` bounds | Off-line pre-flight; bounded by `poll_max_attempts × poll_interval_s` (default 60 × 5 s) | Terminal failure → `RouteTerminalFailureError`; transient → `RouteTransientError`; validation → `RouteValidationError`. C11's `SatelliteProviderRouteClient` (AZ-838) owns the surface. |
|
||
| `satellite-provider` (post-landing ingest, D-PROJ-2, **planned**) | REST `POST /api/satellite/tiles/ingest` (multipart) | Per-flight onboard signing key (carried with each tile); rate-limited | Bursty post-landing | Endpoint not yet implemented service-side → C11 keeps batches queued locally; never blocks the pre-flight cycle |
|
||
| Operator workstation (pre-flight stage) | Filesystem (USB / Ethernet) | OS-level (operator login) | Not time-critical | Bad-stage detection via Manifest content-hash gate (D-C10-3) |
|
||
| Nav camera | USB / MIPI-CSI / GigE (lens-module dependent) | n/a | 3 Hz | Frame drop / hardware fault → "VISUAL_BLACKOUT" path (AC-3.5, AC-NEW-8) |
|
||
|
||
### `satellite-provider` integration (cycle-3 ground truth)
|
||
|
||
**The Jetson e2e harness now consumes the REAL parent-suite `satellite-provider` .NET service** (lineage AZ-688 / AZ-691 / AZ-692; `satellite-provider` + `satellite-provider-postgres` services in `docker-compose.test.jetson.yml`). The legacy `mock-sat` fixture is retired from the Jetson compose; D-PROJ-2 `POST /api/satellite/upload` has shipped service-side (`Program.cs:211`). Tier-1 `docker-compose.test.yml` is deprecated 2026-05-20 per `_docs/02_document/tests/environment.md`.
|
||
|
||
Two consequences for the architecture:
|
||
|
||
1. **C11 read contract adapted to the v1.0.0 inventory shape (AZ-777 Phase 1)** — `POST /api/satellite/tiles/inventory` + `GET /tiles/{z}/{x}/{y}` replace the historical `GET /api/satellite/tiles?bbox=…&zoom=…` shape. The bbox-driven `download_tiles_for_area` entry point and its DTOs are unchanged at the call-site level; the contract adaptation is internal to `HttpTileDownloader`. Auth is JWT Bearer (`SATELLITE_PROVIDER_API_KEY`) over TLS; `SATELLITE_PROVIDER_TLS_INSECURE=1` is a documented dev-only knob for self-signed certs.
|
||
2. **Route-driven seeding (Epic AZ-835 — C11's third interface, `SatelliteProviderRouteClient`)** — the operator can now submit a tlog-derived `RouteSpec` (waypoints + region size; produced by `replay_input.tlog_route.extract_route_from_tlog` — AZ-836; canonical DTO at `_types/route.py` per AZ-845) via `POST /api/satellite/route` and have `satellite-provider` materialise just the corridor tiles, polling `GET /api/satellite/route/{id}` until `mapsReady=true`. This is ~100× more tile-efficient than the bbox path on long, narrow flights. Pre-emptive validation mirrors the AZ-809 `CreateRouteRequestValidator` bounds. The route-driven path is exercised today by the cycle-3 e2e fixture `operator_pre_flight_setup` (AZ-839) and the orchestrator test `test_az835_e2e_real_flight.py` (AZ-840); the C12 production CLI binding is a future-cycle integration.
|
||
|
||
**Imagery source license attribution (cycle 3)**: the Jetson `satellite-provider` instance downloads from the **Google Maps satellite layer** (`lyrs=s`), governed by Google Maps Platform Terms of Service. Dev/research use only; production deployment requires either a Google Maps Platform licensing review or migration to a true CC-BY satellite source on the parent-suite side (parent-suite ticket TBD). Operator-side seed scripts (`tests/fixtures/derkachi_c6/seed_region.py`, `seed_route.py`) propagate the "Imagery © Google" attribution.
|
||
|
||
No new ADR — this is execution of existing decisions (architectural principle #5 satellite-provider on-disk layout end-to-end; ADR-004 process-level isolation unchanged; ADR-011 replay is a configuration unchanged). The architectural surface gained the route-driven seeding path inside C11; nothing else moved.
|
||
|
||
### `satellite-provider` upload contract (per D-PROJ-2 carryforward)
|
||
|
||
The onboard side of D-PROJ-2 is fully specified in `_docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md`. From this architecture's standpoint:
|
||
|
||
- **`Tile` writes are append-only and idempotent** (the same `(zoomLevel, lat, lon, capture_timestamp, companion_id, flight_id)` tuple is the dedup key).
|
||
- **Quality metadata is mandatory on every uploaded tile** so the planned voting layer can promote `pending → trusted` without re-deriving statistics on the service side.
|
||
- **Onboard tiles never claim the `trusted` status**; they are uploaded as `pending` and the parent-suite voting layer (D-PROJ-2 design task #2) decides promotion.
|
||
- **Test substitute**: `mock-suite-sat-service` is an e2e-test-only fixture (under `tests/fixtures/mock-suite-sat-service/`) that implements the upload contract for NFT-SEC-01 / FT-P-17 / IT runs until D-PROJ-2 lands service-side. It is **not a component** in the architectural sense — the production architectural counterparty for both download and upload is the real `satellite-provider`. The fixture is retired the moment the real ingest endpoint ships. (Download + route-seed integration tests on the Jetson harness already run against the real service as of cycle 3.)
|
||
|
||
---
|
||
|
||
## 6. Non-Functional Requirements
|
||
|
||
> Targets are taken verbatim from `acceptance_criteria.md` and `tests/traceability-matrix.md`. The tests column points to the canonical `tests/` files where each NFR is exercised.
|
||
|
||
| Requirement | Target | Measurement | Priority | Tests |
|
||
|---|---|---|---|---|
|
||
| End-to-end latency (AC-4.1) | p95 ≤ 400 ms (steady-state and thermal-throttle hybrid) | NFT-PERF-01 (Tier-2); D-CROSS-LATENCY-1 partition | High | `tests/performance-tests.md` |
|
||
| Tail latency under thermal stress (AC-NEW-5 + AC-4.1) | p99 ≤ 600 ms; p95 ≤ 400 ms at +50 °C 8 h | NFT-9 hot-soak | High | `tests/performance-tests.md` |
|
||
| Memory cap (AC-4.2) | < 8 GB shared (CPU + GPU) on Jetson Orin Nano Super | NFT-LIM-01 8 h replay | High | `tests/resource-limit-tests.md` |
|
||
| Cold-start TTFF (AC-NEW-1) | p95 < 30 s from companion boot to first valid frame | NFT-PERF-03 (50× cold boot) | High | `tests/performance-tests.md` |
|
||
| Spoofing-promotion latency (AC-NEW-2) | p95 < 3 s on each FC | NFT-PERF-04 (SITL on AP + iNav) | High | `tests/performance-tests.md` |
|
||
| FDR storage (AC-NEW-3) | ≤ 64 GB / flight; no silent drops | NFT-LIM-02 8 h synthetic | Medium | `tests/resource-limit-tests.md` |
|
||
| False-position safety (AC-NEW-4) | P(err > 500 m) < 0.1 %; P(err > 1 km) < 0.01 %, with stated 95 % CI over current corpus | NFT-RES-03 Monte Carlo | High | `tests/resilience-tests.md` |
|
||
| Operating envelope (AC-NEW-5) | −20 °C to +50 °C; 25 W; 8 h no throttle | NFT-LIM-04 workstation baseline (chamber deferred) | High | `tests/resource-limit-tests.md` |
|
||
| Imagery freshness (AC-NEW-6, AC-8.2) | Reject/downgrade tiles violating 6 mo / 12 mo thresholds | FT-N-05 / FT-N-06 | High | `tests/blackbox-tests.md` |
|
||
| Cache-poisoning safety (AC-NEW-7) | Onboard-side: P(misalign > 30 m) < 1 %, P(> 100 m) < 0.1 %, with stated 95 % CI | NFT-SEC-01 onboard Monte Carlo + synthetic over-confidence injection | High | `tests/security-tests.md` |
|
||
| Visual blackout failsafe (AC-NEW-8) | Mode transition ≤ 400 ms; covariance grows monotonically; spoofed GPS never re-promoted without 10 s + visual consistency gate | FT-N-04 + NFT-RES-04 | High | `tests/resilience-tests.md` + `tests/blackbox-tests.md` |
|
||
| Cross-FC covariance honesty (AC-NEW-4 cross-FC) | `horiz_accuracy` (m, AP) and `hPosAccuracy` (mm, iNav) carry mathematically equivalent values from the same 2×2 sub-matrix | IT-10 cross-FC | High | `tests/blackbox-tests.md` |
|
||
| MAVLink message-signing posture (AC-4.3 + D-C8-9) | Signing enabled on AP wired channel; per-flight key rotation logged to FDR; iNav documented residual risk | NFT-8 + NFT-SEC-03 | High | `tests/security-tests.md` |
|
||
| Dependency CVE pinning (D-CROSS-CVE-1) | Target: OpenCV ≥ 4.12.0; SBOM clean of unpatched CVEs at audit time; monthly re-scan. **Cycle-1**: relaxed to `>=4.11.0.86,<4.12` per `_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md` (gtsam-4.2.1/numpy-1.x ABI block); CVE-2025-53644 to be re-validated against 4.11.0.86 before close. | NFT-10 SBOM CVE audit | High | `tests/security-tests.md` |
|
||
| GCS bandwidth budget (AC-6.1) | 1–2 Hz downsampled summary | FT-P-12 | Medium | `tests/blackbox-tests.md` |
|
||
| Frame-by-frame streaming (AC-4.4) | No batching/delay; estimates emitted per frame | NFT-PERF-02 | High | `tests/performance-tests.md` |
|
||
| Smoothing-loop look-back (AC-4.5, Mode B Fact #107) | FDR contains smoothed past-frame estimates; smoothing horizon converges within X m of ground truth at K = 10–20 keyframes | IT-11 | Medium | `tests/blackbox-tests.md` |
|
||
|
||
---
|
||
|
||
## 7. Security Architecture
|
||
|
||
**Threat model** (one-page summary; full extraction lives in carryforward `security_analysis.md`):
|
||
|
||
- The companion is a **remote untrusted endpoint** from the parent-suite's standpoint: a downed UAV's companion can be physically captured. Persistent secrets must therefore be **per-flight ephemeral** wherever feasible.
|
||
- The **wired companion ↔ FC link** is the only physical-access-required attack surface for in-flight injection. MAVLink 2.0 signing on the AP path mitigates CVE-2026-1579 (D-C8-9 = (d)). iNav has no signing — accepted residual risk.
|
||
- The **GCS link** is bandwidth-limited and best-effort; a hostile GCS can spoof operator commands but cannot inject pose data (the system never accepts pose from GCS).
|
||
- **GPS spoofing** is treated as expected, not anomalous (AC-3.5, AC-NEW-2, AC-NEW-8). The system never lets a spoofed GPS source re-enter the estimator without a 10 s + visual-consistency gate.
|
||
- **Cache poisoning** is the dominant cross-flight attack vector (AC-NEW-7): a compromised companion could write a misaligned tile that becomes the next flight's anchor. The mitigation has two halves: onboard (honest covariance + quality metadata) and parent-suite (D-PROJ-2 voting layer, not yet implemented).
|
||
- **Pre-flight cache stage** is on the operator's workstation; the SHA-256 content-hash gate (D-C10-3) detects in-place tampering between stage and takeoff.
|
||
- **In-flight network egress is forbidden** (defense-in-depth: DNS blackhole + iptables OUTPUT REJECT, NFT-SEC-05). The only outbound path from the companion is MAVLink to the FC and signed STATUSTEXT to the GCS.
|
||
|
||
**Authentication** (per integration):
|
||
|
||
| Integration | Mechanism |
|
||
|---|---|
|
||
| Companion ↔ ArduPilot Plane FC | MAVLink 2.0 message signing, per-flight key rotation (D-C8-9 = (d)) |
|
||
| Companion ↔ iNav FC | None (iNav has no signing implementation; accepted residual risk per Mode B Source #129) |
|
||
| Companion ↔ GCS (AP profile) | MAVLink 2.0 signing inherited from the FC channel |
|
||
| Operator workstation ↔ `satellite-provider` (pre-flight) | TLS + service-internal API key (workstation only; never on the airborne companion) |
|
||
| Companion ↔ `satellite-provider` (post-landing upload, **D-PROJ-2 planned**) | Per-flight onboard signing key carried with each uploaded tile; the planned ingest endpoint verifies the key |
|
||
| Operator workstation pre-flight stage | OS-level (operator login + workstation hardening — operator-orchestrator concern, C12) |
|
||
|
||
**Authorization**:
|
||
|
||
- **Onboard runtime**: a single principal (the runtime process); no in-process privilege boundaries. The Tile Manager (C11) runs as a different principal on the operator workstation, holding the only credentials that reach `satellite-provider` (TLS API key for download; per-flight onboard signing key for post-landing upload). The airborne image does not contain the C11 binary at all.
|
||
- **GCS**: operator commands (`AC-6.2`) are best-effort hints; the operator cannot promote a pose, override covariance, or reach the `satellite-provider` write path. Operator re-loc requests (C12 `OperatorReLocService` → `OperatorCommandTransport` over the GCS link) trigger the satellite re-localization flow (F6) but do not bypass any safety gate — the airborne pipeline still validates the hint against the visual/satellite consistency check before promoting any pose.
|
||
|
||
**Data protection**:
|
||
|
||
- **At rest**: tile cache + descriptor index + FDR are written to the companion's local NVM. No application-level encryption (the threat model treats a captured companion as compromised; encryption would buy little against physical access). Operator-side `satellite-provider` storage is the parent-suite's concern.
|
||
- **In transit**: MAVLink 2.0 message signing on the AP channel; MSP2 unsigned on iNav. The post-landing upload runs over TLS to `satellite-provider`.
|
||
- **Secrets management**:
|
||
- **Per-flight MAVLink signing key**: generated at takeoff load; rotated per flight; logged to FDR.
|
||
- **Per-flight onboard signing key for tile upload**: generated at takeoff load; baked into mid-flight tile metadata; consumed by C11 post-landing.
|
||
- **Pre-flight service API key**: stays on the operator workstation; never written to the companion image.
|
||
- **No long-lived secrets on the companion image** beyond firmware-level boot signatures (out of scope).
|
||
|
||
**Audit logging**:
|
||
|
||
| What | Where | Retention |
|
||
|---|---|---|
|
||
| All emitted external-position frames + covariance + provenance label | FDR (C13) | per flight (≤ 64 GB; rollover oldest-first) |
|
||
| All received MAVLink + MSP2 frames (raw `tlog` stream) | FDR | per flight |
|
||
| MAVLink 2.0 signing key rotation events | FDR | per flight |
|
||
| Spoofing-promotion / spoofing-rejection events | FDR + GCS STATUSTEXT | per flight + best-effort GCS link |
|
||
| `VISUAL_BLACKOUT_*` STATUSTEXT events (AC-3.5, AC-NEW-8) | FDR + GCS STATUSTEXT | per flight + best-effort |
|
||
| C10 content-hash gate fail events | FDR + companion refuses takeoff | per flight |
|
||
| Mid-flight tile-gen failures | ≤ 0.1 Hz thumbnail log inside FDR (AC-8.5 forensic exception) | per flight |
|
||
| Component health (CPU/GPU/temp/throttle) | FDR | per flight |
|
||
| Source-set switch events (D-C8-2 EKF source-set) | FDR + GCS STATUSTEXT | per flight |
|
||
| Production binary SBOM provenance | release artifacts; not on the deployed companion | per release |
|
||
|
||
---
|
||
|
||
## 8. Key Architectural Decisions
|
||
|
||
> These ADRs distill the user-confirmed Mode-B locks plus this architecture's first-time choices. ADRs are also tracked in `_docs/00_research/06_component_fit_matrix/MODEB_revisions.md` and (for cross-component gates) `99_cross_component_gates.md`. Step 4 (Risk Review) iterates on them; this section is the authoritative entry point.
|
||
|
||
### ADR-001 — VioStrategy is selected at startup via config; not hot-swappable
|
||
|
||
**Context**: Three VIO implementations are required (OKVIS2 production-default, VINS-Mono research-only, KLT+RANSAC mandatory simple-baseline). Hot-swap mid-flight would add re-initialisation cost on every switch and would require keeping multiple solvers warm in 8 GB shared memory.
|
||
|
||
**Decision**: VioStrategy is selected at startup from a single config knob (`vio.strategy: okvis2 | vins_mono | klt_ransac`), and the choice is constant for the flight. The `VioStrategy` interface owns the abstraction; concrete strategies own their per-strategy concerns (OKVIS2's ROS bring-up, VINS-Mono's build flag, KLT's degraded covariance). Build-time inclusion / exclusion of individual strategies is governed separately by ADR-002.
|
||
|
||
**Alternatives considered**:
|
||
1. Hot-swap at runtime — rejected: re-init cost + memory footprint inside AC-4.2.
|
||
2. Single-strategy build per binary — rejected: defeats the IT-12 comparative-study objective on the research binary.
|
||
|
||
**Consequences**: A flight is locked to one VIO; failure of the active strategy = AC-5.2 fallback (FC IMU-only). The comparative study is a per-replay artifact, not a runtime decision.
|
||
|
||
**Cycle-1 operational note (2026-05-19, post-Implement)**: AZ-332 (OKVIS2) and AZ-333 (VINS-Mono) shipped as facade-only with `BLOCKED` terminal classification per the implement skill's PASS-with-BLOCKED policy (Tier-2 prerequisites: CI build env + Jetson hardware + DBoW2 vocab artifact for AZ-332; same plus upstream-vendoring decision for AZ-333). The `_STRATEGY_REGISTRY` (see ADR-009 cycle-1 note below) registers all three slots so the seam stays correct, but selecting `okvis2` or `vins_mono` raises `StrategyNotAvailableError` from `vio_factory.py` until the gating `BUILD_*` flag turns on. The cycle-1 production-default selection is **`klt_ransac`** (AZ-334). Follow-ups: **AZ-592** (Tier-2 OKVIS2 wiring) and **AZ-593** (VINS-Mono vendoring + wiring) — both parked in `_docs/02_tasks/backlog/`. Closed Won't-Fix during cycle 1: AZ-589 + AZ-590 (original remediation — they targeted upstream APIs that don't exist in the actually-checked-in OKVIS2 submodule). Full post-mortem in `_docs/03_implementation/implementation_completeness_cycle1_report.md` § "Verdict — Revised 2026-05-16".
|
||
|
||
### ADR-002 — Build-time exclusion of unused `Strategy` implementations (D-C1-1-SUB-A = (a))
|
||
|
||
**Context**: The architecture deliberately requires multiple interchangeable implementations per component (three `VioStrategy` for C1; multiple `VprStrategy` for C2; two FC adapters for C8). At runtime each binary uses exactly one of them per component. Linking *all* implementations into every binary would inflate binary size on the 8 GB shared Jetson, increase boot/load time inside the AC-NEW-1 ≤ 30 s p95 budget, expand the deployed dependency / attack surface, and create accidental-selection risk (a misconfigured runtime accidentally booting a non-deployment-default strategy). A single binary with all strategies present is also harder to reason about for the IT-12 comparative study, which deliberately wants the *opposite* — every strategy present and replayed against the same footage.
|
||
|
||
This decision is made on **technical grounds only**. Component licenses (BSD/Apache/MIT/LGPL/GPL/etc.) **do not influence** which strategy is the deployment-default — that choice is the IT-12 measured-performance verdict on the project's operating context (Jetson Orin Nano Super + ADTi 20MP 20L V1 + Derkachi-class footage).
|
||
|
||
**Decision**:
|
||
|
||
1. **Per-component CMake `BUILD_*` flag** controls whether each implementation is linked into a given binary (`BUILD_VINS_MONO`, `BUILD_SALAD`, etc.). The default deployment binary links the production-default strategy (OKVIS2 on C1 today, pending IT-12 verdict) plus the engine-rule-mandatory simple-baseline (KltRansac on C1). The research binary links every available strategy of every component for IT-12.
|
||
2. **The Strategy interface boundary makes the exclusion architectural** rather than configurational: sibling components import only the `Strategy` interface, never a concrete implementation. The composition root (one per binary, see ADR-009) is the only place that names concrete classes, and a class whose file is not part of the CMake target cannot be named there — so a misconfigured deployment cannot accidentally pull in an unintended strategy.
|
||
3. **Selection at startup** (config-driven; ADR-001) picks among the linked-in strategies. A binary with only OKVIS2 + KltRansac linked exposes only those two values for `vio.strategy`; the config validator fails fast if asked for `vins_mono`.
|
||
4. **CI emits both binaries on every PR** (deployment + research) so the comparative-study artifact is always reproducible alongside the deployable artifact.
|
||
|
||
**Alternatives considered**:
|
||
|
||
1. **Single binary with all strategies linked, runtime config picks one** — rejected on binary size + boot time + accidental-selection risk + unnecessary dependency surface on the deployed device.
|
||
2. **Process-isolation IPC for the unused strategies** — rejected on latency budget conflict (D-CROSS-LATENCY-1) and operational complexity of two-process deployments on a 25 W edge device.
|
||
3. **Multiple deployment-binary variants tailored to specific customer bundles** — out of scope of this ADR; supported as a *consequence* (see Consequences NOTE) but not a driver of the decision.
|
||
|
||
**Consequences**:
|
||
|
||
- Two CI binaries on every PR; both must build and test green.
|
||
- Adding any new strategy to a component is a folder-add + a CMake `BUILD_*` flag + an entry in the relevant binary's composition root. No call-site changes anywhere.
|
||
- The deployment binary's SBOM is what it is — a *consequence* of which `BUILD_*` flags were `ON`, not a driver of which flags should be `ON`.
|
||
- **NOTE — packaging optionality (deferred / non-binding).** Because the exclusion is per-implementation per-CMake-flag, the same code base can produce additional binaries — for different deployment targets, different customer bundles, or different end-product licensing bundles **if and when product licensing is decided later**. This architecture **deliberately makes no licensing decisions today**: component licenses do not influence which strategy is the deployment-default, and the decision above is purely technical. When packaging strategy is finalized, the same `BUILD_*` flag mechanism produces the right bundle without source-level changes — that optionality is a *side benefit* of the interface-first design (Principle #13 + ADR-009), not a justification for it.
|
||
|
||
### ADR-003 — Honest 6×6 covariance via GTSAM Marginals is the safety floor (D-C5-5 = (c))
|
||
|
||
**Context**: AC-NEW-4 and the cross-FC covariance honesty (IT-10) require a single, mathematically-recoverable 6×6 posterior covariance per emitted frame. ESKF-style Jacobian-based covariance is faster but loses information across the C4–C5 boundary.
|
||
|
||
**Decision**: C5 is GTSAM iSAM2 + `CombinedImuFactor` + `BetweenFactorPose3` + `GenericProjectionFactorCal3DS2`, with `Marginals.marginalCovariance(pose_key)` recovering the 6×6 posterior. C4 is OpenCV `solvePnPRansac` wrapped in a GTSAM factor so C4 and C5 share the same substrate. D-CROSS-LATENCY-1 hybrid auto-degrades C4 covariance to Jacobian-based (D-C4-2 = (a)) under thermal throttle, but C5 stays on Marginals.
|
||
|
||
**Alternatives considered**:
|
||
1. ESKF-only with Jacobian covariance — rejected: loses cross-component covariance honesty; engine-rule mandatory simple-baseline only.
|
||
2. Dual estimators (ESKF + iSAM2) — rejected: memory + complexity + the hybrid auto-degrade already covers thermal stress.
|
||
|
||
**Consequences**: GTSAM is a hard runtime dependency; AC-4.5 internal smoothing is for free; per-frame covariance recovery costs 30–90 ms in steady state (auto-degrades to 5–15 ms under thermal throttle).
|
||
|
||
### ADR-004 — Process-level isolation for in-air upload prevention (AC-8.4 enforcement)
|
||
|
||
**Context**: AC-8.4 forbids in-air outbound writes to `satellite-provider` for drone-security reasons. The companion is also read-only against `satellite-provider` while airborne — there is no operational reason to fetch tiles in flight either, since the pre-flight cache is the contract. A software guard checking `flight_state == ON_GROUND` can be bypassed by code injection if the network I/O code path is ever loaded.
|
||
|
||
**Decision**: The Tile Manager (C11) is a **separate binary / image** that runs only on the operator's workstation; the airborne companion image does not contain the C11 binary at all — neither the `TileDownloader` (pre-flight) nor the `TileUploader` (post-landing) code paths can be reached from the airborne process. The defense-in-depth software guard is owned by **C12's `PostLandingUploadOrchestrator`**, which reads the `flight_footer` FDR record's `clean_shutdown` field before invoking C11's `TileUploader` (Batch 44 SRP refactor — the gate's single source of truth is the FDR footer C13 writes only on clean shutdown; C11 itself no longer gates). The local mid-flight tile format is byte-identical to `satellite-provider`'s on-disk layout so no transformation is needed at upload time.
|
||
|
||
**Why the gate moved to C12 (Batch 44)**: An earlier iteration placed the gate inside C11's `TileUploader` (consuming a live `FlightStateSignal` from C8). That duplicated the safety invariant on both sides of the C11/C12 boundary and coupled C11 to C8 just for the post-landing check. The current design (a) consolidates ownership on the operator-side workflow head (C12) — single responsibility per component, single source of truth for "vehicle is fully stopped" (= C13's footer write decision), and (b) collapses an arbitrary 30-second hold-down heuristic to an exact boolean (`clean_shutdown`). The `TileUploader` Protocol contract is frozen at v2.0.0 with the gate parameters removed; AZ-317 is superseded.
|
||
|
||
**Enforcement gates (per R02 risk register)**:
|
||
1. **CI SBOM diff**: the build pipeline fails the airborne `production-binary` artifact if any symbol from `c11_tilemanager/` (or any module that transitively imports `c11_tilemanager`) appears in the linked image. This is an extension of the per-implementation SBOM enforcement already in ADR-002.
|
||
2. **Runtime self-check in `runtime_root.py`**: at startup, before opening the FC adapter, the airborne composition root attempts `importlib.util.find_spec("c11_tilemanager")` and panics if the spec resolves to anything other than `None`. Cost: one import lookup at startup; benefit: catches a build-system regression even if SBOM diff was bypassed.
|
||
3. **Network egress test (NFT-SEC-02)**: the airborne process is run inside a network namespace with no route to `satellite-provider`'s host; any attempted outbound TCP connection to it is a release-blocking test failure.
|
||
|
||
**Alternatives considered**:
|
||
1. Single binary with software-only guard — rejected on principle: a runtime guard cannot be the primary control for an "is the system airborne?" safety property.
|
||
2. Hardware-level switch (e.g., physical write-enable jumper) — rejected: adds operations cost; software-image-isolation gives equivalent assurance for this threat model.
|
||
|
||
**Consequences**: Two binaries to maintain (companion image + operator-orchestrator image). CI builds and tests both. The operator workflow has an explicit post-landing step ("run the upload tool") which is itself a feature, not a bug.
|
||
|
||
### ADR-005 — Two execution tiers (Tier-1 / Tier-2) are first-class architectural concerns (F6)
|
||
|
||
**Context**: AC-4.1 latency, AC-4.2 memory, AC-NEW-1 cold-start, AC-NEW-3 FDR storage, AC-NEW-5 thermal envelope, and AC-NEW-7 cache-poisoning all have validation locations on Jetson hardware that cannot be replicated on a workstation. Conversely, most logic, integration, and contract tests run in seconds on Tier-1 and would take orders of magnitude longer on Tier-2.
|
||
|
||
**Decision**: Tier-1 = workstation Docker (fast/cheap; runs lint + unit + most integration + Mock `satellite-provider`); Tier-2 = Jetson hardware (AC-bound jobs only; runs NFT-PERF-* + NFT-LIM-* + NFT-RES-* + IT-12). Both tiers are documented in the deployment plan and the CI matrix; failure on either tier is release-blocking. Tier-2 runner availability is itself a risk-register entry.
|
||
|
||
**Alternatives considered**:
|
||
1. Tier-2-only — rejected: order-of-magnitude slower iteration loop; runner-availability risk dominates.
|
||
2. Tier-1-only — rejected: AC-bound NFTs cannot pass without Jetson hardware in the loop.
|
||
|
||
**Consequences**: CI is split; some tests have an explicit `tier: 2` annotation in `tests/environment.md`; release artifacts include both tier results.
|
||
|
||
### ADR-006 — D-CROSS-LATENCY-1 hybrid is the AC-4.1 budget strategy
|
||
|
||
**Context**: At +50 °C ambient (AC-NEW-5 upper-temp), the Jetson auto-throttles, collapsing the steady-state K=3 latency budget. AC-4.1 has no thermal carve-out — the 400 ms p95 must hold across the operating envelope.
|
||
|
||
**Decision**: K=3 baseline (DISK+LightGlue × 3 candidates from C2.5; GTSAM Marginals 6×6 covariance recovery in C4) auto-degrades to K=2 + Jacobian-based covariance under thermal throttle. The trigger is the Jetson's thermal-throttle telemetry crossing a configurable temperature/clock threshold (set per D-C7-9 JetPack 6.2 + TensorRT 10.3 lock). NFT-9 hot-soak validates the hybrid.
|
||
|
||
**Alternatives considered**:
|
||
1. K=3 fixed + larger latency budget — rejected: AC-4.1 is the contract.
|
||
2. K=2 always — rejected: ~5–10 % accuracy loss at steady state hurts AC-NEW-4 headroom.
|
||
|
||
**Consequences**: ~5–10 % accuracy loss at the upper thermal envelope (still inside AC-NEW-4). The hybrid is part of the runtime, not a config knob; the threshold is.
|
||
|
||
### ADR-007 — `mock-suite-sat-service` is an e2e-test fixture, not a first-class component (REVERSED 2026-05-09)
|
||
|
||
**Context**: D-PROJ-2 (parent-suite ingest endpoint + voting layer) is not yet implemented. NFT-SEC-01 / FT-P-17 / IT runs need a counterparty for the post-landing upload contract. An earlier iteration of this ADR promoted the mock to a first-class component boundary peer of `satellite-provider`, with its own description under `components/` and its own deployable image — to make the contract auditable.
|
||
|
||
**Decision (current)**: the mock is **an e2e-test fixture only**, scoped under `tests/fixtures/mock-suite-sat-service/`. The architectural counterparty for both the existing download path and the planned D-PROJ-2 upload path is the **real** `satellite-provider`. The contract sketch lives in `_docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md` (the source of truth for the parent-suite work) and is mirrored in C11 Tile Manager's external API section (the onboard consumer's view). The mock implements that contract in tests; production never reaches it.
|
||
|
||
**Why reversed**: promoting an e2e-test fixture to a component boundary inflated the architectural surface and risked the test fixture drifting away from the real contract once D-PROJ-2 lands. The contract sketch in the leftover file is sufficient as the auditable source of truth without a separate component spec.
|
||
|
||
**Alternatives considered**:
|
||
1. Keep ADR-007 as originally written — rejected: see "Why reversed".
|
||
2. Wait for D-PROJ-2 service-side implementation before any tests — rejected: blocks the onboard cycle.
|
||
|
||
**Consequences**: The mock continues to ship in the operator-orchestrator tarball's compose file as a test-time service, but it is no longer documented under `_docs/02_document/components/`. Test specs and CI references treat it as a fixture. When `satellite-provider` ships the real endpoint, the fixture is replaced by pointing tests at the real service; no architectural changes flow from that switch.
|
||
|
||
### ADR-008 — D-C8-2 source-set switch is `Selected with runtime gate` (Mode B Fact #111)
|
||
|
||
**Context**: AC-NEW-2 requires spoofing-promotion latency < 3 s. The companion-driven `MAV_CMD_SET_EKF_SOURCE_SET` switch (D-C8-2 = (b)) is firmware-supported but has no production-deployed precedent — the project would establish the canonical pattern.
|
||
|
||
**Decision**: D-C8-2 = (b) is selected with a runtime gate: ArduPilot Plane SITL validation (IT-3) is the lock gate. If IT-3 fails, D-C8-2-FALLBACK options are recorded — (a) operator-manual RC aux switch with relaxed AC-NEW-2 wording; (b) operator-warning STATUSTEXT instead of automated switch; (c) escalate to ArduPilot dev community.
|
||
|
||
**Alternatives considered**: see D-C8-2-FALLBACK above.
|
||
|
||
**Consequences**: AC-NEW-2 contractual latency is contingent on IT-3 passing. If IT-3 fails, AC-NEW-2 wording is renegotiated as part of D-C8-2-FALLBACK = (a).
|
||
|
||
### ADR-009 — Interface-first components, constructor injection, one folder per component
|
||
|
||
**Context**: The architecture deliberately requires multiple interchangeable implementations per component (three `VioStrategy` for C1; UltraVPR / MegaLoc / MixVPR / SelaVPR / EigenPlaces / NetVLAD / SALAD candidates for C2; pymavlink-AP and YAMSPy-iNav adapters for C8). ADR-002 further mandates that the **same logical component** ship in different concrete forms across binaries (deployment binary vs IT-12 research binary; future packaging variants if/when needed). Without a strict interface boundary, sibling components import each other's concrete classes; build-time exclusion via `BUILD_*` flags becomes a fragile compile-time afterthought rather than an architectural property; testing each strategy in isolation requires monkey-patching; and adding a new strategy ripples into every call site. The interface-first pattern is the architectural mechanism that makes ADR-001 (runtime selection) and ADR-002 (build-time exclusion) tractable simultaneously.
|
||
|
||
**Decision**:
|
||
|
||
1. **Interface first.** Every component is specified as a Python `Protocol` (or `abc.ABC`, when concrete defaults are useful) **before** any concrete implementation is written. The interface is the contract; concrete implementations satisfy it. Step 3 component specs document the interface signature; concrete implementations are documented under their own header inside the component spec.
|
||
|
||
2. **One folder per component.** Source layout (per `coderule.mdc` "place source code under `src/`"):
|
||
|
||
```
|
||
src/
|
||
components/
|
||
c1_vio/
|
||
__init__.py
|
||
interface.py # VioStrategy Protocol + VioOutput, VioConfig DTOs
|
||
okvis2_strategy.py # deployment-default (pending IT-12 verdict)
|
||
vins_mono_strategy.py # research-only; behind BUILD_VINS_MONO (ADR-002)
|
||
klt_ransac_strategy.py # engine-rule-mandatory simple-baseline
|
||
tests/
|
||
c2_vpr/
|
||
__init__.py
|
||
interface.py # VprStrategy Protocol
|
||
ultra_vpr.py # deployment-default (Documentary Lead PRIMARY)
|
||
mega_loc.py
|
||
mix_vpr.py # mandatory simple-baseline alternate
|
||
sela_vpr.py
|
||
eigen_places.py
|
||
net_vlad.py # mandatory simple-baseline classical
|
||
salad.py # additional candidate; behind BUILD_SALAD (ADR-002)
|
||
tests/
|
||
c2_5_rerank/
|
||
interface.py # ReRankStrategy
|
||
inlier_count_rerank.py
|
||
tests/
|
||
c3_matcher/
|
||
interface.py # CrossDomainMatcher
|
||
disk_lightglue.py
|
||
aliked_lightglue.py
|
||
xfeat.py
|
||
tests/
|
||
c3_5_adhop/
|
||
interface.py # ConditionalRefiner
|
||
adhop_refiner.py
|
||
passthrough_refiner.py # for non-conditional baseline
|
||
tests/
|
||
c4_pose/
|
||
interface.py # PoseEstimator
|
||
opencv_gtsam_estimator.py
|
||
tests/
|
||
c5_state/
|
||
interface.py # StateEstimator
|
||
gtsam_isam2_estimator.py
|
||
eskf_estimator.py # mandatory simple-baseline
|
||
tests/
|
||
c6_tile_cache/
|
||
interface.py # TileStore + TileMetadataStore + DescriptorIndex
|
||
postgres_filesystem_store.py
|
||
faiss_descriptor_index.py
|
||
tests/
|
||
c7_inference/
|
||
interface.py # InferenceRuntime
|
||
tensorrt_runtime.py
|
||
onnx_trt_ep_runtime.py
|
||
pytorch_fp16_runtime.py
|
||
tests/
|
||
c8_fc_adapter/
|
||
interface.py # FcAdapter (in+out), GcsAdapter
|
||
pymavlink_ardupilot_adapter.py
|
||
msp2_inav_adapter.py
|
||
qgc_telemetry_adapter.py
|
||
tests/
|
||
c10_cache_provisioning/
|
||
interface.py # CacheProvisioner, ManifestVerifier
|
||
provisioner.py
|
||
tests/
|
||
c11_tilemanager/ # SEPARATE BINARY — never linked into airborne image
|
||
interface.py # TileDownloader, TileUploader (two interfaces in one component)
|
||
http_tile_downloader.py
|
||
http_tile_uploader.py
|
||
tests/
|
||
c13_fdr/
|
||
interface.py # FdrWriter
|
||
file_fdr_writer.py
|
||
tests/
|
||
composition/
|
||
runtime_root.py # composition root: config -> concrete graph
|
||
tilemanager_root.py # composition root for the C11 operator-side tool (download + upload)
|
||
research_root.py # composition root for the research/dev binary
|
||
```
|
||
|
||
3. **Constructor injection only.** Every component class declares its collaborators as **typed `__init__` arguments**, against the sibling's interface (not the concrete class). Example sketch:
|
||
|
||
```python
|
||
# src/components/c4_pose/interface.py
|
||
from typing import Protocol
|
||
class PoseEstimator(Protocol):
|
||
def estimate(self, match: MatchResult, calibration: CameraCalibration) -> PoseEstimate: ...
|
||
|
||
# src/components/c5_state/gtsam_isam2_estimator.py
|
||
class GtsamIsam2StateEstimator:
|
||
def __init__(
|
||
self,
|
||
*,
|
||
pose_estimator: PoseEstimator, # interface, not concrete
|
||
imu_source: ImuSource, # interface
|
||
fdr: FdrWriter, # interface
|
||
config: StateEstimatorConfig,
|
||
) -> None:
|
||
self._pose = pose_estimator
|
||
self._imu = imu_source
|
||
self._fdr = fdr
|
||
self._cfg = config
|
||
```
|
||
|
||
4. **Composition root** (`src/composition/runtime_root.py`) is the **only** place that knows about concrete classes. It reads config, picks each concrete implementation, validates that every named implementation is actually linked into the active binary (fails fast otherwise), and wires the graph. Every other module sees only interfaces. **Build-time exclusion (ADR-002) becomes architectural**, not configurational: the deployment binary's composition root literally cannot wire `VinsMonoVioStrategy` because that file is not linked into the deployment binary (`BUILD_VINS_MONO=OFF`). Future packaging variants (e.g., a customer bundle with a different `VprStrategy` set) work the same way — a different `BUILD_*` flag combination + the same composition root code.
|
||
|
||
5. **Python DI mechanism**: hand-rolled constructor injection in the composition root is the default — it has no extra dependency, is trivially understandable, and matches the pattern of "select once at startup, never hot-swap". A heavier DI library (`dependency-injector`, `injector`, `punq`) is **only** introduced if the composition root grows past ~150 lines or test-side wiring becomes repetitive; that is a Plan-phase deferred decision (carryforward), not a current architectural commitment. Mocking in tests is via simple stub classes that satisfy the same `Protocol` — no monkey-patching, no `unittest.mock.patch`.
|
||
|
||
6. **Test wiring**: each component's `tests/` folder owns the test composition for that component. Test composition roots wire the unit-under-test against in-memory / fake implementations of every interface dependency. Cross-component integration tests (Tier-1) compose multiple real components with a fake `FcAdapter` + fake `TileStore` + fake `InferenceRuntime`. End-to-end Tier-2 tests run against the real composition root.
|
||
|
||
**Alternatives considered**:
|
||
|
||
1. **Sibling concrete imports** (`from c5_state.gtsam_isam2 import GtsamIsam2StateEstimator`) — rejected: makes ADR-002 build-time exclusion a CMake / SBOM artifact rather than an architectural property; couples C4 to a specific C5 implementation and vice versa; defeats the per-component test wiring; ripples into every call site whenever a new strategy is added.
|
||
2. **Service locator / global registry** (e.g., a process-wide DI singleton accessed via `get_service(VioStrategy)`) — rejected: hides the dependency graph from constructors, makes test isolation harder, and re-introduces the singletons banned in coderule.mdc.
|
||
3. **Function-based DI** (passing factories instead of instances) — rejected as the default: more cognitive overhead than constructor injection for a startup-bound, never-hot-swapped runtime. Reserved for the few call sites where lazy construction is genuinely required (e.g., the per-flight MAVLink signing key generator).
|
||
4. **Heavy DI framework** (`dependency-injector`, `injector`, `punq`) from day one — rejected as default: introduces a runtime dependency for a problem the composition root can solve in plain Python; reserved as an opt-in if the composition root outgrows hand-rolled wiring.
|
||
|
||
**Consequences**:
|
||
|
||
- Step 3 component decomposition produces, for **every** component: an `interface.py` description + ≥ 1 concrete implementation description + a test composition.
|
||
- The composition root is itself a reviewable artifact (a single Python file per binary track) that documents which concrete implementations a given binary contains.
|
||
- Build-time exclusion (ADR-002) becomes architectural: the deployment composition root *cannot* `import` a strategy whose file is not part of the deployment binary's CMake target. The same property scales to any future packaging variant — including, if/when product licensing strategy is decided, license-driven bundles (Principle #13 NOTE), without any source-level change in application code.
|
||
- Per-component folders give each implementation a natural home for its own `tests/`, fixtures, and adapter-specific helpers — matching coderule.mdc's "logic specific to a platform, variant, or environment belongs in the class that owns that variant".
|
||
- Adding a new C2 VPR backbone (e.g., a future foundation-model retrieval backbone via D-C2-12) is a folder-add + interface-conformance change; no other component is touched.
|
||
|
||
#### Cross-Component Contract Surface (AZ-507)
|
||
|
||
The ADR-009 "interface, not concrete" rule has an architectural sibling: cross-component imports go through `_types/*.py` (DTOs + typed-error envelopes such as `_types.inference_errors`), never through `components.X (Public API)`. The only exception is `runtime_root/*` (the composition root), which is allowed to import concrete strategies across components precisely because it is the single place that resolves Protocol parameters to concrete classes. Every other module under `components/**/*.py` consumes cross-component contracts via (a) shared DTOs in `_types/*`, and (b) consumer-side structural `Protocol` cuts defined locally inside the consuming component (e.g. `c10_provisioning.engine_compiler.CompileEngineCallable` for the narrow `compile_engine` surface of the C7 InferenceRuntime). This is the same architectural property as constructor-injection-against-interface, applied to the import graph rather than the call graph. The AZ-270 `test_az270_compose_root.test_ac6_only_compose_root_imports_concrete_strategies` lint enforces this on every `components/**/*.py`; AZ-507 reconciles `module-layout.md` with the lint so the documentation and the build gate agree.
|
||
|
||
#### Cycle-1 implementation: `_STRATEGY_REGISTRY` + `pre_constructed` (AZ-591, AZ-618)
|
||
|
||
Two cross-cutting Tier-1 mechanisms shipped inside `runtime_root/` during cycle 1 that the Plan-era ADR-009 sketch did not anticipate. Both are operational prerequisites for `compose_root()` reaching takeoff and are extensions of — not deviations from — the constructor-injection-against-interface rule above.
|
||
|
||
1. **`_STRATEGY_REGISTRY` + `register_strategy(...)` API (AZ-591).** A module-level `dict[(component_slug, strategy_name)] → _Registration]` populated per-binary. The airborne entrypoint calls `runtime_root.airborne_bootstrap.register_airborne_strategies()` once at process start, which fills 7 strategy-selecting airborne component slots (`c1_vio`, `c2_vpr`, `c2_5_rerank`, `c3_matcher`, `c3_5_adhop`, `c4_pose`, `c5_state`) with `tier="airborne"`. Without this, `compose_root()` raises `StrategyNotLinkedError` on the first config-driven strategy lookup. The registry is the **runtime-side complement to ADR-002 build-time exclusion**: the build chooses which strategies are even available to register; the registry chooses which one this binary serves; the config chooses which registered slot to wire. A misconfigured runtime asking for an unlinked strategy still fails fast (`StrategyNotLinkedError` carries the offending strategy name + component slug + actually-linked alternatives — operator gets a clear next step). The `register_strategy` call site is restricted by lint (AZ-270): only the composition root or a binary-specific bootstrap module may call it; calls from component modules are an architecture violation.
|
||
|
||
2. **`pre_constructed` kwarg + `build_pre_constructed(config)` (AZ-618 umbrella → subtasks AZ-619..AZ-624).** `compose_root(config, *, pre_constructed=...)` now accepts a dict of pre-built infrastructure objects keyed by documented strategy slug, consumed by the airborne wrapper factories registered in step 1. The airborne entrypoint builds these via `airborne_bootstrap.build_pre_constructed(config)` in 6 dependency-ordered phases:
|
||
|
||
| Phase | Slugs seeded | Notes |
|
||
|-------|--------------|-------|
|
||
| A (AZ-619) | `c13_fdr`, `clock` | `c13_fdr` is per-producer-cached; `clock` is fresh `WallClock` |
|
||
| B (AZ-620) | `c6_descriptor_index`, `c6_tile_store` | gated on `BUILD_FAISS_INDEX` per consumer |
|
||
| C (AZ-621) | `c7_inference` | gated on `BUILD_TENSORRT_RUNTIME` / `BUILD_PYTORCH_FP16_RUNTIME` |
|
||
| D (AZ-622) | `c3_lightglue_runtime`, `c3_feature_extractor` | LightGlue runtime reuses Phase C `c7_inference` engine (no double build); gated on `C3_MATCHER_BUILD_FLAGS[strategy]` |
|
||
| E (AZ-623) | `c282_ransac_filter`, `c5_imu_preintegrator`, `c5_se3_utils`, `c5_wgs_converter` | IMU preintegrator cached at module level keyed by camera-calibration path |
|
||
| E.5 (AZ-625) | `c5_isam2_graph_handle` (+ internal `_c5_prebuilt_estimator`) | eager `(StateEstimator, ISam2GraphHandle)` build so C4 receives the handle (C4 runs before C5 in topo order) and the C5 wrapper short-circuits without re-invoking the factory; gated on `C5_STATE_BUILD_FLAGS[strategy]` |
|
||
| F (AZ-624) | (no slot keys; wires `runtime_root.main()` and verifies AC-1..AC-5 end-to-end) | terminal phase |
|
||
|
||
The expected per-component dependency keys are documented in `airborne_bootstrap.AIRBORNE_REQUIRED_PRE_CONSTRUCTED_KEYS`. Missing keys raise `AirborneBootstrapError` with the missing-key name + the consuming component slug + the relevant gating `BUILD_*` flag, so the operator-facing error names exactly which build flag or which input is wrong. Tests stub by passing the same `pre_constructed=...` kwarg with mock objects; the bootstrap's caching makes two calls within a process return the same `c13_fdr` object (AC-619.2) without changing the contract. In replay mode (ADR-011), `compose_root` merges replay-built `frame_source` / `fc_adapter` / `clock` / `mavlink_transport` / `replay_sink` over `pre_constructed` so the replay branch's `TlogDerivedClock` correctly overrides the bootstrap's `WallClock`. AZ-687 added a guard for the minimal replay `Config` that omits strategy-component blocks — the bootstrap skips the `_build_c6_*` / `_build_c7_*` / `_build_c5_*` seeds when their component block is absent, since the corresponding wrappers do not run.
|
||
|
||
Both additions sit inside `runtime_root/`; no component crosses the AZ-507 import boundary. They preserve every ADR-009 invariant — interface-first components, constructor-injected dependencies, single composition root, build-time-exclusion-as-architectural-property — and add the runtime mechanics needed to make a 12+ infrastructure-dependency graph wirable without losing fail-fast behaviour. `module-layout.md` § shared/runtime_root carries the file-level ownership; this section is the architectural rationale.
|
||
|
||
### ADR-010 — Operator-planned mission is the cold-start trust anchor; FC GPS is secondary
|
||
|
||
**Context**: The original cold-start design (AZ-419 / FT-P-11) assumed the FC EKF's last valid GPS fix is available at takeoff to seed C5. Field reality contradicts this: a UAV operating in a contested-EW environment may have GPS jammed **before** takeoff (the jamming radius reaches the launch site, the unit launches under a jammer's umbrella, etc.). In that case the FC EKF has no GPS fix to give, and the companion has nothing to anchor the initial pose to — the entire downstream pipeline (VIO bootstrap, VPR retrieval scope, satellite anchoring) collapses or runs blind. At the same time, the parent suite already requires the operator to author a route in the **Mission Planner UI** (`suite/ui`) and persist it to the **`flights` REST service** (`suite/flights`) before any flight runs. The waypoint ordering is operationally meaningful: waypoint[0] is the planned takeoff point. The operator therefore already declares the takeoff position with operationally relevant accuracy (typically a few tens of metres) hours before launch, in a context that has no dependency on GPS at all. This information is the natural cold-start trust anchor.
|
||
|
||
**Decision**:
|
||
|
||
1. **`Flight` is read pre-flight, not in-flight.** C12 (the operator-side tool, separate binary from the airborne companion — per ADR-002) calls the parent-suite `flights` REST service via a typed client (AZ-489 `FlightsApiClient`) when the operator runs `gps-denied-cli build-cache --flight-id <Guid>`. An offline path (`--flight-file <path>`) reads the same DTO shape from a JSON export so the workflow survives operator workstations that have no path to the flights service. The companion binary **never** depends on the flights service at runtime (Principle #9 — denied-environment operation).
|
||
2. **C12 derives bbox + takeoff origin from the `Flight`.** The bbox is the envelope of waypoint lat/lon plus a configurable buffer (default 1 km, AZ-489 AC-3). The takeoff origin is `Flight.waypoints[0].(lat, lon, alt)` — the operator's authored launch point.
|
||
3. **Both fields are baked into the C10 Manifest.** `BuildRequest` and `Manifest` carry `takeoff_origin: LatLonAlt | None` (AZ-323 / AZ-325 / AZ-324 amendments). The hash that drives D-C10-1 idempotence includes `takeoff_origin`, so a re-plan of the route produces a new cache identity and the verifier (AZ-324) rejects a mismatched cache at boot.
|
||
4. **C5 consumes the origin before any sensor sample.** The companion's composition root reads `takeoff_origin` from the cache manifest at boot and invokes `set_takeoff_origin(origin, sigma_horiz_m, sigma_vert_m)` on the active `StateEstimator` (AZ-490) **before** the first `add_vio` / `add_fc_imu` call. Both `GtsamIsam2StateEstimator` and `EskfStateEstimator` accept the origin as a Bayesian prior — iSAM2 attaches a `PriorFactorPose3` at `Pose3.Identity()` (the operator origin BECOMES the local-ENU (0,0,0) anchor) with diagonal sigmas `[5°, 5°, 5°, sigma_horiz_m, sigma_horiz_m, sigma_vert_m]`; ESKF seeds the nominal position to (0,0,0) and writes the position block of the error covariance to `diag(sigma_horiz_m², sigma_horiz_m², sigma_vert_m²)`. Defaults are `sigma_horiz_m = 5.0 m`, `sigma_vert_m = 10.0 m` from `C5StateConfig`.
|
||
5. **FC GPS is a secondary, gated input.** If the FC EKF later produces a GPS reading (in-flight or at takeoff), it is fused through the existing `add_pose_anchor` machinery only after passing the three-part gate of Principle #11 — **including the ≤ 200 m bounded-delta check against the companion's last emitted `PoseEstimate`**. Real GPS that passes the gate is one more measurement, never an override.
|
||
6. **Failure modes.** If the Manifest has no `takeoff_origin` AND the FC EKF has no usable GPS at takeoff, C5 stays in `INITIALIZING` and the FC adapter (C8) emits a non-fused source label; the FT-P-11 takeoff-abort policy (AZ-419 amended) applies. If the Manifest has `takeoff_origin` AND the FC EKF GPS is wildly inconsistent with it at takeoff (e.g., > 200 m), the operator origin wins and the FC GPS is logged as suspect — this is the GPS-spoofed-at-takeoff case and is the entire point of this ADR.
|
||
|
||
**Alternatives considered**:
|
||
|
||
1. **Keep FC EKF as primary** (status quo of AZ-419) — rejected: cannot survive GPS-denied takeoff, which is in scope per Principles #1 and #9. Field reports of pre-launch jamming make this a realistic, not edge-case, failure mode.
|
||
2. **Operator types the origin into a CLI prompt at build-cache time** — rejected: duplicates information the Mission Planner UI already captures, drifts from the canonical route, and breaks if the operator re-plans without re-typing. The `Flight` DTO is the single source of truth.
|
||
3. **Pull `Flight` from the companion at runtime over a back-channel** — rejected: violates Principle #9 (denied-environment operation; no egress from the companion to anything other than the FC). The flights service is an **operator-workstation** concern only.
|
||
4. **Treat operator origin as a hard assignment instead of a prior** — rejected: a hard assignment cannot be fused with a later high-quality posterior, breaks ADR-003's "honest covariance" property, and prevents the `add_pose_anchor` fusion path from ever correcting the origin if it was authored with imprecision.
|
||
|
||
**Consequences**:
|
||
|
||
- AZ-419 (FT-P-11) is amended: the primary cold-start path is operator-origin-from-manifest; FC-EKF-GPS is the fallback path with its own sub-AC.
|
||
- C10 contracts gain a `takeoff_origin` field in `BuildRequest`, `Manifest`, and the verifier's validation set (AZ-323 / AZ-325 / AZ-324). Contract version bumps to v1.1.0.
|
||
- C5 gains a `set_takeoff_origin(origin, sigma_horiz_m, sigma_vert_m)` method on the `StateEstimator` protocol (AZ-490). Protocol contract version bumps to v1.1.0.
|
||
- C12 gains the `FlightsApiClient` boundary + offline `--flight-file` path (AZ-489).
|
||
- Principle #11 (the spoofed-GPS gate) is extended with the bounded-delta clause; the gate now serves both takeoff and mid-flight.
|
||
- The companion binary's network surface is unchanged — only C12 (operator-side, separate binary) talks to the flights service.
|
||
|
||
### ADR-011 — Replay is a configuration of the airborne binary, not a separate image (REVERSES the v1.0.0 four-binary design)
|
||
|
||
**Context**: The original Decompose Step 2 design for epic AZ-265 (E-DEMO-REPLAY) treated replay as a **fourth Docker image** (`gps-denied-replay-cli`) built from the same source tree with a different `BUILD_*` flag combination — specifically `BUILD_C6=OFF`, `BUILD_C10=OFF`, `BUILD_C11=OFF`, `BUILD_C12=OFF`, plus the new replay-only build flags ON. The justification was the same as ADR-002 for the live/research/operator split: minimize binary size, attack surface, and accidental-selection risk. An SBOM-diff CI step was specified (AZ-403) to enforce the exclusion of the four "off" components from the replay binary.
|
||
|
||
Two facts surfaced during the Step 7 (Implement) batch loop that contradicted this design:
|
||
|
||
1. **The C2 (VPR) → C6 dependency cannot be honestly removed.** C2 retrieves candidate tiles by querying the C6 `DescriptorIndex` (FAISS HNSW over pre-built per-tile descriptors). With C6 absent the index has no host, and C2's `VprStrategy.lookup(c1)` either returns empty (replay produces no positioning fixes, defeating epic AC-3 of ≤ 100 m for ≥ 80 % of ticks) or has to be backed by a parallel "lite" index variant (which is not the production code path and therefore destroys the epic's premise that demo confidence equals field-test confidence on the same footage). Either way the v1.0.0 design's `BUILD_C6=OFF` flag for replay conflicts with the v1.0.0 epic AC-3.
|
||
2. **The user requirement is the opposite of binary isolation.** Replay's purpose is "demo confidence equals field-test confidence on the same footage" — i.e., the demo and the real flight should run **exactly** the same code path. Reducing the binary's component set (even one with a sound technical justification like ADR-002) actively works against that purpose: any divergence between the replay image and the airborne image becomes a potential source of demo↔field drift that no SBOM diff can detect once the two binaries' source trees evolve independently.
|
||
|
||
**Decision**:
|
||
|
||
1. **Replay is a configuration of the airborne binary.** The airborne Docker image is the replay image. No fourth Docker image, no SBOM-diff CI step, no `BUILD_C6=OFF` for replay. The operator runs the same image with the same `gps-denied-onboard` entry point (or its sibling `gps-denied-replay` console-script wrapper) — only the config differs.
|
||
2. **The mode-aware decision is `config.mode = "live" | "replay"` resolved once at startup in `compose_root`.** The composition root branch (the single point of mode awareness in the codebase) swaps three strategies and adds one observer:
|
||
- `FrameSource`: `LiveCameraFrameSource` ↔ `VideoFileFrameSource`.
|
||
- `FcAdapter`: `PymavlinkArdupilotAdapter` / `Msp2InavAdapter` ↔ `TlogReplayFcAdapter`.
|
||
- `MavlinkTransport`: `SerialMavlinkTransport` ↔ `NoopMavlinkTransport` (the outbound bytes go nowhere in replay; the C8 encoder code path is unchanged — see Invariant 5 of the replay protocol).
|
||
- **Adds** `JsonlReplaySink` as an additional listener on C5's `EstimatorOutput` stream (replay-only; the UI consumes the JSONL file). The live binary's downstream sinks (C8 outbound to FC, QGC telemetry adapter, C13 FDR) are unchanged.
|
||
3. **A new `replay_input/` Layer-4 cross-cutting module owns `(video, tlog)` → `(FrameSource, FcAdapter, Clock)` convergence.** It instantiates the replay strategies, applies the time-offset (manual or auto via AZ-405), and hands the composition root a `ReplayInputBundle`. The composition root sees no `if mode == "replay"` plumbing — it sees standard `FrameSource` + `FcAdapter` + `Clock` instances. This is the architectural mechanism that delivers Principle #13's interface-first promise for the replay-vs-live boundary.
|
||
4. **Operator pre-flight workflow is identical between replay and live.** The operator plans a route in the parent-suite Mission Planner UI (`suite/ui`); the route persists in the `flights` REST service; C12 reads the `Flight`, derives the bbox + takeoff origin, calls C11 `TileDownloader` against `satellite-provider`, builds the C10 cache (descriptor index + engines + manifest). The only step that differs is "go fly" → "run `gps-denied-replay` against video + tlog". The companion image consumes the cache identically in both modes (Invariant 12 of the replay protocol).
|
||
5. **MAVLink emit destinations in replay are no-op sinks for non-UI consumers.** The C8 outbound encoders (`GPS_INPUT`, GCS `STATUSTEXT`, `NAMED_VALUE_FLOAT`, `MAV_CMD_SET_EKF_SOURCE_SET`) run unchanged; their byte streams hit `NoopMavlinkTransport` and disappear. The user-confirmed design intent: the **only** position output the UI cares about in replay is the per-tick C5 `EstimatorOutput`, which is captured by `JsonlReplaySink` and tailed by the parent-suite UI. MAVLink signing key is mandatory in both modes (Invariant 11 of the replay protocol — the operator supplies a dummy key file for replay; the signing handshake runs and its bytes are dropped by the noop transport).
|
||
6. **Three binaries, not four.** The active build matrix returns to the ADR-002 cadence: **airborne** (Tier-1 + Tier-2 production; live + replay both run from this image), **research** (IT-12 comparative-study, mirrors airborne plus the additional VioStrategy / VprStrategy variants), **operator-orchestrator** (pre-flight workflows on operator workstation). The replay-cli column is removed from `module-layout.md`'s Build-Time Exclusion Map; the replay-only `BUILD_*` flags (`BUILD_VIDEO_FILE_FRAME_SOURCE`, `BUILD_TLOG_REPLAY_ADAPTER`, `BUILD_REPLAY_SINK_JSONL`) are ON in airborne and research, OFF in operator-orchestrator.
|
||
|
||
**Alternatives considered**:
|
||
|
||
1. **Keep the fourth `gps-denied-replay-cli` binary with `BUILD_C6=OFF`** (status quo of v1.0.0) — rejected for the two reasons in the Context section: the C2→C6 dependency makes `BUILD_C6=OFF` incompatible with epic AC-3, and the very purpose of replay (demo↔field fidelity) is undermined by any source-tree divergence the SBOM-diff step cannot detect.
|
||
2. **Keep the fourth binary but with `BUILD_C6=ON`** — rejected: same code as airborne minus C10/C11/C12, which is exactly what airborne already is (the airborne binary already excludes C10/C11/C12 per ADR-002 / ADR-004). The fourth binary would be byte-identical to the airborne image; maintaining it as a separate CI artifact adds work for zero gain.
|
||
3. **Make replay an HTTP service rather than a CLI** — rejected as out-of-scope for this ADR (the parent-suite UI subprocess + JSONL tail design predates this decision and is not in scope here). The replay CLI / live entry-point split is a CLI shape concern, not an architectural concern; the airborne binary remains a long-lived process with no HTTP listener.
|
||
4. **Move the JSONL sink to a different output (e.g., piped into stdout, or a unix socket)** — deferred. The current `results.jsonl` file output is the simplest UI-tailable contract and matches the parent-suite UI's subprocess assumption. If the UI later needs streaming-without-disk, the sink Protocol allows a `StdoutReplaySink` or `UnixSocketReplaySink` strategy without any change to the composition root.
|
||
|
||
**Consequences**:
|
||
|
||
- `_docs/02_document/contracts/replay/replay_protocol.md` is at **v2.0.0** (replaces v1.0.0). New invariants 5, 11, 12 codify the encoder-mode-agnosticism, the signing-key mandate, and the real-C6-cache-in-replay properties.
|
||
- `module-layout.md` Build-Time Exclusion Map drops the `Replay-cli` column; airborne column gains `BUILD_VIDEO_FILE_FRAME_SOURCE=ON`, `BUILD_TLOG_REPLAY_ADAPTER=ON`, `BUILD_REPLAY_SINK_JSONL=ON`. The narrative reduces "Four binaries…" to "Three binaries…".
|
||
- `module-layout.md` Cross-Cutting section gains a `replay_input/` entry (Layer-4 coordinator, owned by AZ-405).
|
||
- AZ-403 (replay-cli Dockerfile + SBOM diff CI step) is **cancelled**; its task file moves to `done/` with a cancellation banner pointing at this ADR. Its dependency edges (incoming from AZ-404, outgoing to nothing) are removed from `_docs/02_tasks/_dependencies_table.md`. The Jira ticket transition to "Cancelled" is recorded in `_docs/_process_leftovers/` if the tracker MCP is unavailable at execution time.
|
||
- AZ-401 shrinks: it no longer authors a separate `compose_replay` function; it extends `compose_root` with the `config.mode == "replay"` branch and wires `JsonlReplaySink` + `NoopMavlinkTransport`. Complexity drops from 3 → 2 points.
|
||
- AZ-402 shrinks: it is a thin mode-config wrapper that dispatches into the live entry point, not a standalone CLI.
|
||
- AZ-405 grows slightly: it now also owns the `replay_input/` coordinator (the natural home for the auto-sync logic + the time-offset application).
|
||
- AZ-404 (E2E replay test) is unchanged in scope but reworded: it asserts mode-agnosticism (Invariant 1) and runs against the unified airborne image — no fourth-image entrypoint to verify.
|
||
- C8 gains a thin `MavlinkTransport` Protocol seam introduced by AZ-400: `SerialMavlinkTransport` (live) and `NoopMavlinkTransport` (replay) implement it. This is a no-op restructure of the existing C8 transport code; the encoders are unchanged. The Protocol seam is the architectural mechanism for Invariant 5 (encoders are byte-identical).
|
||
- Demo↔field fidelity is now structurally guaranteed: the same binary runs in both contexts; any drift between them is a behavioural-test failure, not an SBOM-diff failure.
|
||
|
||
### ADR-012 — Open-loop ESKF composition profile via `c4_pose.enabled = false` (AZ-776)
|
||
|
||
**Context**: ADR-009 wires the C4 pose estimator and the C5 state estimator through a shared GTSAM iSAM2 substrate — C4 adds its PnP factor directly to C5's iSAM2 graph (ADR-003). The `c4_pose` slot in `runtime_root/airborne_bootstrap.py` lists `c5_isam2_graph_handle` as a required `pre_constructed` key (AZ-625), and the `OpenCVGtsamPoseEstimator` constructor consumes that handle. This wiring was sound for the steady-state GTSAM-iSAM2 build of C5.
|
||
|
||
When C5 ships a second strategy — `eskf` (ESKF baseline, AZ-588) — the substrate is **not** an iSAM2 graph: ESKF integrates an IMU-driven covariance forward closed-form, with no factor graph behind it. Its `create()` factory returns `(estimator, None)` for the second tuple element (the iSAM2 handle slot). Two facts surfaced from this:
|
||
|
||
1. **`c4_pose` cannot be the gate.** C4 owns satellite-anchored pose estimation. ESKF runs satellite-free open-loop. Forcing `c4_pose` into the composition when no satellite anchoring is wired means C4 either crashes at construction (no iSAM2 handle) or, worse, gets a fake handle that pretends to anchor poses that nothing produces — a silent passthrough that violates the "Real Results, Not Simulated Ones" meta-rule.
|
||
2. **The replay Tier-2 smoke profile needs an honest minimum.** The AZ-265 replay path's mandatory simple baseline is KLT/RANSAC VIO + ESKF state estimator without any satellite re-anchoring (AZ-777 will add the satellite path on top via the Derkachi C6 reference tile cache). Without an explicit composition profile that excludes C4, every Tier-2 test that wants to exercise the simple baseline either crashes at compose time or has to monkey-patch the registry — both are anti-patterns for an architectural seam.
|
||
|
||
**Decision**:
|
||
|
||
1. **`C4PoseConfig.enabled: bool = True` is the user-facing switch for the open-loop ESKF profile.** Default ON preserves the ADR-003 steady-state airborne path. Setting `enabled=False` instructs `compose_root` to remove `c4_pose` from the selection map before topological ordering — the wrapper never runs, the consumer never sees a handle, and the wiring stays honest.
|
||
2. **`compose_root` enforces the C4↔C5 pairing matrix at compose time.** The validation gate lives in `_validate_c4_c5_composition_profile` (called from `compose_root` before `_compose`) and rejects the two off-diagonal cells of the 2×2 (`c4_pose.enabled`, `c5_state.strategy`) matrix with a `CompositionError` naming both blocks. The two valid combinations are:
|
||
- `c4_pose.enabled=True` + `c5_state.strategy="gtsam_isam2"` — the ADR-003 / ADR-009 steady-state airborne path.
|
||
- `c4_pose.enabled=False` + `c5_state.strategy="eskf"` — the open-loop ESKF profile (Tier-2 smoke baseline; satellite anchoring deferred to AZ-777).
|
||
The two **invalid** combinations are rejected with explicit error text:
|
||
- `enabled=False` + `gtsam_isam2` (an iSAM2 graph with no PnP anchors converges to drift-prone visual-only odometry; the production deployment intent is that gtsam_isam2 always coexists with C4).
|
||
- `enabled=True` + `eskf` (ESKF has no graph for C4 to anchor against; this is the AZ-776 root-cause pairing the user reported).
|
||
3. **`build_pre_constructed` honours `c4_pose.enabled`.** When disabled, `c5_isam2_graph_handle` is **omitted** from the `pre_constructed` dict — the handle is a C4 consumer requirement, and removing C4 from the selection map removes the requirement. The ESKF estimator itself is still built and cached in the internal `_c5_prebuilt_estimator` slot (so the C5 wrapper short-circuits onto the prebuilt instance), but the iSAM2-shaped seam disappears from the cross-component contract.
|
||
4. **Component selection is the only thing that changes.** The composition root's existing `_compose` mechanics — topological ordering, lazy strategy resolution, build-flag gating — are unchanged. The new `skip_slugs` parameter (a `frozenset[str]`) is the minimal seam that lets `compose_root` instruct `_compose` to drop the disabled component(s); there is no second composition path, no `compose_eskf` function, no mode-aware branch outside the validation gate.
|
||
|
||
**Alternatives considered**:
|
||
|
||
1. **Make `c4_pose` a "soft" dependency of C5 (introspect the strategy at C5 construction time, skip C4 wiring only when `strategy == "eskf"`).** Rejected: this leaks C5-strategy specifics into C4's interface (`PoseEstimator` would have to grow a "you may not be wired" affordance), violates ADR-009 interface-first, and re-introduces the very mode-aware branches Invariant 1 of the replay protocol forbids.
|
||
2. **Make `compose_root` derive `c4_pose.enabled` automatically from `c5_state.strategy` (no user-facing flag).** Rejected: the C4↔C5 coupling is a deliberate design pairing, not a mechanical derivation. Future research strategies (e.g. a non-iSAM2 GTSAM variant, or a satellite-anchored ESKF) may want different combinations; the explicit flag keeps the configuration honest and audit-able.
|
||
3. **Keep the wiring as-is and rely on the registry mechanism to skip C4.** Rejected: `C4PoseConfig` registers itself with the global config registry at module import (via `register_component_block` in `components/c4_pose/__init__.py`), which means even an empty `c4_pose:` block in YAML instantiates the block with defaults and pulls C4 into the selection map. The flag is the only honest opt-out without removing the registration call (which would break the steady-state path).
|
||
4. **Build a synthetic `NullIsam2GraphHandle` that satisfies the Protocol but no-ops on update.** Rejected as the textbook example of the "Real Results, Not Simulated Ones" anti-pattern: it would let C4 run on top of ESKF with no anchoring, producing pose estimates that look real but have no factor-graph grounding. The composition-time gate is the honest answer.
|
||
|
||
**Consequences**:
|
||
|
||
- `tests/e2e/replay/conftest.py` writes `c4_pose: { enabled: false }` into the Tier-2 replay `config.yaml`, alongside the existing `c1_vio: klt_ransac` + `c5_state: eskf` block. This is the open-loop profile the replay binary uses for the AZ-265 / AZ-776 simple-baseline tests.
|
||
- `tests/e2e/replay/test_derkachi_1min.py` un-xfails AC-1 (clean exit + per-frame JSONL), AC-2 (schema), AC-5 (determinism), AC-6 realtime, and AC-6 ASAP — these tests only required compose-time success to pass and AZ-776 lands that. AC-3 (≤ 100 m for ≥ 80 % of ticks) **remains** xfailed for AZ-777: ESKF integrates open-loop and drifts unbounded without C2/C3/C4 satellite re-anchoring; the ≤ 100 m threshold cannot be met by physics until the Derkachi C6 reference tile cache lands.
|
||
- `_docs/02_document/contracts/replay/replay_protocol.md` gains a new "Open-loop ESKF composition profile" sub-section in **Composition root extension** plus a new **Invariant 13** ("C4↔C5 pairing matrix is enforced at compose time") that the AZ-776 unit tests own.
|
||
- `_docs/02_document/components/06_c4_pose/description.md` gains an "Enabled flag" sub-section that points at this ADR; the rest of the component contract is unchanged.
|
||
- The unit-test surface at `tests/unit/runtime_root/test_az776_open_loop_eskf_composition.py` owns the seven invariants AZ-776 introduces: `C4PoseConfig.enabled` default-true, AC-1 (open-loop ESKF composes without C4), AC-2 (default GTSAM profile still includes C4), AC-3a + AC-3b (the two forbidden pairings raise `CompositionError`), and the two `pre_constructed` behaviours (`c5_isam2_graph_handle` omitted when C4 disabled, present when C4 enabled). The full suite passes in ~4 s.
|
||
- The composition root's contract surface in `runtime_root/__init__.py` gains one public helper (`CompositionError` was already public; the new `skip_slugs` parameter to `_compose` is module-private). No public CLI flag is added — operators set `c4_pose.enabled = false` in YAML. |