# Risk Assessment — gps-denied-onboard Plan cycle — Iteration 1 **Date**: 2026-05-09 **Scope**: technical / schedule / external risks identified during the 4a evaluator pass over architecture, system flows, data model, deployment, and the 14-component decomposition. **Iteration policy**: if the user requests another round, this file becomes `risk_mitigations_02.md` and so on. ## Risk Scoring Matrix | | Low Impact | Medium Impact | High Impact | |--|------------|---------------|-------------| | **High Probability** | Medium | High | Critical | | **Medium Probability** | Low | Medium | High | | **Low Probability** | Low | Low | Medium | ## Acceptance Criteria by Risk Level | Level | Action Required | |-------|----------------| | Low | Accepted, monitored quarterly | | Medium | Mitigation plan required before implementation | | High | Mitigation + contingency plan required, reviewed weekly | | Critical | Must be resolved before proceeding to next planning step | ## Risk Register | ID | Risk | Category | Probability | Impact | Score | Mitigation | Owner | Status | |----|------|----------|-------------|--------|-------|------------|-------|--------| | R01 | D-PROJ-2 ingest endpoint not yet implemented service-side; F10 post-landing upload cannot reach a real `satellite-provider` until parent-suite work lands | External | High | Medium | High | C11 `TileUploader` keeps batches in pending-upload journal; e2e `mock-suite-sat-service` fixture exercises the contract in tests; leftover file tracks the cross-workspace dep; onboard release does not block on D-PROJ-2 | Parent suite | Mitigated | | R02 | ADR-004 process-isolation invariant broken by accidental CMake link of C11 into the airborne image | Technical | Low | High | Medium | CI SBOM diff explicitly fails if any `c11_tilemanager/` symbol appears in the airborne `production-binary` artifact (see ADR-002 + CI security gate); runtime check in `runtime_root.py` panics if `c11_tilemanager` module imports non-test paths | Onboard team | Mitigated | | R03 | MAVLink 2.0 per-flight signing key handshake on AP wired channel (D-C8-9 = (d)) has no production-deployed precedent | Technical | Medium | High | High | IT-3 ArduPilot SITL validation gate; D-C8-2-FALLBACK options recorded ((a) operator-manual RC aux + relaxed AC-NEW-2; (b) STATUSTEXT instead of automated switch; (c) escalate to ArduPilot dev community) per ADR-008 | Onboard team | Open (gated by IT-3) | | R04 | TensorRT engine cache hardware-tied to SM 87; deserialising on a different SM corrupts inference silently | Technical | Medium | High | High | D-C10-7 self-describing filename `__sm_jp_trt_.engine` + D-C10-3 SHA-256 content-hash gate at F2 takeoff refuses deserialise on tuple mismatch; CI Tier-2 runs on the pinned JetPack 6.2 / TRT 10.3 / SM 87 image | Onboard team | Mitigated | | R05 | iSAM2 numerical instability silently swallows factor-add failures, producing wrong covariance | Technical | Medium | Medium | Medium | C5 logs every `add_*` call's success/failure; `EstimatorHealth.cov_norm_growing_for_s` monotonicity check feeds the AC-NEW-8 spoof gate; `EstimatorFatalError` triggers AC-5.2 (3 s no estimate → FC IMU-only) | Onboard team | Mitigated | | R06 | VPR top-1 false positive: visually-similar but geographically-wrong tile ranks above the true match | Technical | Medium | Medium | Medium | C2.5 inlier-count rerank narrows K=10 → N=3; C3 RANSAC + reprojection residual filter; C3.5 conditional AdHoP refinement on hard frames; D-CROSS-LATENCY-1 hybrid keeps the budget under thermal throttle. Cross-flight cache-poisoning safety budget (AC-NEW-7) gates downstream | Onboard team | Mitigated | | R07 | FC GPS source promoted back from spoofed → trusted prematurely (AC-NEW-2 / AC-NEW-8 violation) | Technical | Low | High | Medium | C5 `SourceLabelStateMachine`: never re-promote until ≥10 s of `gps_health == STABLE_NON_SPOOFED` AND next satellite-anchored frame agrees with FC GPS within configurable tolerance; every reject logged to FDR + GCS STATUSTEXT (ADR-008) | Onboard team | Mitigated | | R08 | Tile freshness drift inside `active_conflict` sectors between download and flight time (AC-NEW-6 violation in stale theatre) | External | Medium | Medium | Medium | C11 `TileDownloader` applies sector-classified freshness gate at fetch (reject in `active_conflict`, downgrade-no-`satellite_anchored`-label in `stable_rear`); `DownloadBatchReport` surfaces stale-rejection counts so the operator can re-pull. Manifest-hash idempotence (D-C10-1) makes re-runs cheap | Operator + onboard team | Mitigated | | R09 | Per-flight onboard signing key compromise: a captured companion lets an attacker poison `satellite-provider` ingest | External | Low | Medium | Low | Per-flight ephemeral keys deleted on flight-ring rollover (≥30 days post-landing default); parent-suite voting layer (D-PROJ-2 design task #2) requires multi-companion agreement before promoting `pending → trusted`; operator can revoke a compromised `companion_id` | Parent suite + operator | Mitigated (depends on D-PROJ-2 #2) | | R10 | C5 covariance recovery (`Marginals.marginalCovariance`) exceeds AC-4.1 latency budget under thermal throttle | Technical | Medium | Medium | Medium | D-CROSS-LATENCY-1 hybrid: `ThermalState` from C7 triggers C4 to switch from MARGINALS → JACOBIAN per-frame; ~5–10% accuracy loss accepted under throttle, never on the steady-state path (ADR-006). Once thermal returns below threshold, switch back next frame | Onboard team | Mitigated | | R11 | AC-NEW-4 / AC-NEW-7 multi-flight statistical headroom is reduced-confidence with the current single-flight (Derkachi) fixture (D-PROJ-3 deferred) | Schedule / Data | High | Medium | High | Validation requirement relaxed from "≥100 flights" literal to Monte-Carlo-with-stated-CI over current corpus; multi-flight statistical residual risk recorded in this register; D-PROJ-3 carryforward to next cycle (Maxar Open Data Ukraine + AerialVL S03 + own multi-flight data) | Project lead | Open (carryforward) | | R12 | Single deployment camera (`adti20`, one unit available) becomes a single-point-of-failure for Tier-2 NFTs | Resource | Medium | Medium | Medium | D-PROJ-1 hybrid calibration acquired per-unit; bench Jetson at HQ retains a copy of every successfully-built engine cache (D-C10-8); IT-12 comparative study uses static fixtures so it is camera-unit-agnostic. If the deployed unit fails, the cache rebuilds on a replacement unit | Onboard team | Accepted | | R13 | C13 FDR queue overrun on a sustained burst (e.g., F4 mid-flight tile gen + per-frame estimates + IMU traces) | Technical | Low | Medium | Low | Per-producer drop-oldest queues; rollover event itself is always logged (`FdrQueueOverrunError` writes a structured "overrun" record with producer-id and dropped count); NFT-LIM-02 (8 h synthetic AC-NEW-3) validates aggregate throughput | Onboard team | Mitigated | | R14 | Apparent C2.5 ↔ C3 build-time circular dependency around the LightGlue runtime | Technical | Low | Low | Low | Resolved: ownership of `LightGlueRuntime` moved to the shared helper (`_docs/02_document/common-helpers/03_helper_lightglue_runtime.md`); both C2.5 and C3 are sibling consumers; data flow remains C2.5 → C3 (one-way). C2.5 spec § 8 updated to reference the helper, not C3 | Onboard team | Mitigated (this iteration) | ## Risk Categories ### Technical Risks (R02, R03, R04, R05, R06, R07, R10, R13, R14) Concentrated in three areas: (a) the GPU/TRT stack and its hardware-tied engine cache; (b) the GTSAM substrate shared between C4 and C5; (c) MAVLink signing handshake. Each has a concrete mitigation already wired into the architecture. ### Schedule / Data Risks (R11) Single risk: D-PROJ-3 multi-flight fixture acquisition deferred. Validation strategy adjusted from literal-count to Monte-Carlo-with-CI. Carryforward to next cycle. ### Resource Risks (R12) Single deployment camera. Accepted with bench Jetson + HQ engine archive as the contingency. ### External Risks (R01, R08, R09) All three depend on parent-suite work — D-PROJ-2 (#1 ingest endpoint, #2 voting layer) plus the existing `satellite-provider` GET surface. Cross-workspace dependency tracked in `_docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md`. ## Detailed Risk Analysis ### R02: ADR-004 process-isolation invariant broken by accidental airborne C11 link **Description**: The architecture's primary defence against in-air outbound writes to `satellite-provider` is **process-level isolation** — the C11 Tile Manager (both `TileDownloader` and `TileUploader`) is excluded from the airborne CMake target so the airborne image cannot load that code path even via reflection or config error. A regression that adds a `c11_tilemanager/` import path to a shared module the airborne binary depends on would silently re-introduce the upload code into airborne memory, defeating ADR-004. **Trigger conditions**: - A future refactor moves a helper used by C11 into a shared module. - A new feature accidentally adds `c11_tilemanager` to the airborne CMake target list. - A reflection-based plugin loader scans all `src/components/` and instantiates whatever it finds. **Affected components**: C11 (the deletion target), the airborne `production-binary` build, the runtime composition root. **Mitigation strategy**: 1. CI security gate: SBOM diff fails if any `c11_tilemanager/` symbol appears in the airborne `production-binary` artifact (extension of ADR-002's existing per-implementation enforcement). 2. Runtime self-check in `runtime_root.py`: if any `c11_tilemanager.*` module is importable from inside the airborne process, panic at startup before opening the FC adapter. 3. Test gate (NFT-SEC-02): explicit "egress test" verifies the airborne process cannot reach `satellite-provider` over the network. **Contingency plan**: If the runtime self-check fires in flight, the airborne process refuses to publish `GPS_INPUT` / `MSP2_SENSOR_GPS` and alerts via STATUSTEXT — defaulting back to FC IMU-only (AC-5.2 fallback path). **Residual risk after mitigation**: Low. **Documents updated**: `architecture.md` ADR-004 (extended to cover both `TileDownloader` and `TileUploader`); `architecture.md` ADR-002 (SBOM diff scope); `tests/security-tests.md` NFT-SEC-02. --- ### R03: MAVLink 2.0 per-flight signing handshake (D-C8-9 = (d)) has no production precedent **Description**: D-C8-9 selects MAVLink 2.0 message signing (option d) on the AP wired channel with per-flight ephemeral keys. The AP firmware supports it but no flight-deployed system the project has reviewed runs it in a per-flight rotation pattern; the contract for the handshake at takeoff is project-defined. **Trigger conditions**: ArduPilot Plane firmware behaviour differs from documentation; the handshake fails on a real airframe. **Affected components**: C8 (per-FC adapter), F2 (takeoff load). **Mitigation strategy**: 1. **IT-3 ArduPilot SITL validation gate** — IT-3 must pass before D-C8-9 is locked. The test exercises full handshake + signed-channel + key rotation in SITL. 2. **D-C8-2-FALLBACK options recorded** (per ADR-008): (a) operator-manual RC aux switch with relaxed AC-NEW-2 wording; (b) operator-warning STATUSTEXT instead of automated switch; (c) escalate to ArduPilot dev community. **Contingency plan**: If IT-3 fails, fall back to (a) per ADR-008 and accept the relaxed AC-NEW-2 wording for this cycle. This is a documented planned fallback, not an emergency response. **Residual risk after mitigation**: Medium-Low until IT-3 passes; Low once IT-3 is green. **Documents updated**: `architecture.md` ADR-008; `components/10_c8_fc_adapter/description.md` § 5 (Error Handling); `tests/integration-tests.md` IT-3. --- ### R11: AC-NEW-4 / AC-NEW-7 reduced-confidence statistical headroom **Description**: AC-NEW-4 (per-flight pose-error CDF) and AC-NEW-7 (cross-flight cache-poisoning safety budget) were originally framed as ≥100-flight statistical claims. With current data limited to a single Derkachi flight + 60 still images, the project cannot meet the literal bound and instead validates with Monte-Carlo over the current corpus, with stated CIs. **Trigger conditions**: Reviewer or operator interprets the original AC text strictly; the relaxed Monte-Carlo wording is not formally accepted upstream. **Affected components**: validation strategy across NFT-PERF-01, NFT-LIM-03, NFT-SEC-01. **Mitigation strategy**: 1. AC text already revised in 2026-05-09 to "Monte-Carlo over currently-available data corpus with stated CI" (per `acceptance_criteria.md` revision note). 2. D-PROJ-3 (multi-flight fixture acquisition) is the carryforward: AerialVL S03 + Maxar Open Data Ukraine + own multi-flight data when next cycle resources permit. 3. NFT specs explicitly mark this as PARTIAL coverage with traceability flag in the matrix. **Contingency plan**: If the relaxed wording is rejected, the project escalates D-PROJ-3 to a blocking pre-cycle dep and pauses validation NFTs until multi-flight data is acquired. **Residual risk after mitigation**: Medium (acknowledged headroom; acceptable for this cycle but tracked). **Documents updated**: `00_problem/acceptance_criteria.md` revision note; `tests/traceability-matrix.md`; `01_solution/solution.md` Plan-phase carryforward section. ## Architecture / Component Changes Applied (this iteration) | Risk ID | Document Modified | Change Description | |---------|------------------|--------------------| | R02 | `architecture.md` ADR-004 | Extended scope: process isolation now excludes both C11 download and upload code paths from the airborne image; runtime self-check requirement noted | | R02 | `architecture.md` ADR-002 | SBOM diff already covers per-implementation linking; explicit C11 boundary added to enforcement scope | | R14 | `components/03_c2_5_rerank/description.md` § 1 + § 8 | LightGlueRuntime ownership moved from C3 to the shared helper; C2.5's "Must be implemented after" no longer references C3 directly | (R03, R04, R05, R06, R07, R08, R10, R13 already had mitigations wired into the architecture/component specs in earlier iterations of Plan; this register only cross-references them.) ## Summary **Total risks identified**: 14 **Critical**: 0 | **High**: 3 (R01, R03, R11) | **Medium**: 8 (R02, R04, R05, R06, R07, R08, R10, R12) | **Low**: 3 (R09, R13, R14) **Risks mitigated this iteration**: 2 net-new (R02 enforcement scope extended; R14 resolved via helper ownership) **Risks requiring user decision**: none in this iteration. R03 is gated by IT-3 (a future test event); R11 is the documented D-PROJ-3 carryforward already accepted earlier in the cycle.