Files
Oleksandr Bezdieniezhnykh 64542d32fc Update autodev state, architecture documentation, and glossary terms
Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
2026-05-10 00:21:34 +03:00

14 KiB
Raw Permalink Blame History

Risk Assessment — gps-denied-onboard Plan cycle — Iteration 1

Date: 2026-05-09 Scope: technical / schedule / external risks identified during the 4a evaluator pass over architecture, system flows, data model, deployment, and the 14-component decomposition. Iteration policy: if the user requests another round, this file becomes risk_mitigations_02.md and so on.

Risk Scoring Matrix

Low Impact Medium Impact High Impact
High Probability Medium High Critical
Medium Probability Low Medium High
Low Probability Low Low Medium

Acceptance Criteria by Risk Level

Level Action Required
Low Accepted, monitored quarterly
Medium Mitigation plan required before implementation
High Mitigation + contingency plan required, reviewed weekly
Critical Must be resolved before proceeding to next planning step

Risk Register

ID Risk Category Probability Impact Score Mitigation Owner Status
R01 D-PROJ-2 ingest endpoint not yet implemented service-side; F10 post-landing upload cannot reach a real satellite-provider until parent-suite work lands External High Medium High C11 TileUploader keeps batches in pending-upload journal; e2e mock-suite-sat-service fixture exercises the contract in tests; leftover file tracks the cross-workspace dep; onboard release does not block on D-PROJ-2 Parent suite Mitigated
R02 ADR-004 process-isolation invariant broken by accidental CMake link of C11 into the airborne image Technical Low High Medium CI SBOM diff explicitly fails if any c11_tilemanager/ symbol appears in the airborne production-binary artifact (see ADR-002 + CI security gate); runtime check in runtime_root.py panics if c11_tilemanager module imports non-test paths Onboard team Mitigated
R03 MAVLink 2.0 per-flight signing key handshake on AP wired channel (D-C8-9 = (d)) has no production-deployed precedent Technical Medium High High IT-3 ArduPilot SITL validation gate; D-C8-2-FALLBACK options recorded ((a) operator-manual RC aux + relaxed AC-NEW-2; (b) STATUSTEXT instead of automated switch; (c) escalate to ArduPilot dev community) per ADR-008 Onboard team Open (gated by IT-3)
R04 TensorRT engine cache hardware-tied to SM 87; deserialising on a different SM corrupts inference silently Technical Medium High High D-C10-7 self-describing filename <model>__sm<SM>_jp<JP>_trt<TRT>_<precision>.engine + D-C10-3 SHA-256 content-hash gate at F2 takeoff refuses deserialise on tuple mismatch; CI Tier-2 runs on the pinned JetPack 6.2 / TRT 10.3 / SM 87 image Onboard team Mitigated
R05 iSAM2 numerical instability silently swallows factor-add failures, producing wrong covariance Technical Medium Medium Medium C5 logs every add_* call's success/failure; EstimatorHealth.cov_norm_growing_for_s monotonicity check feeds the AC-NEW-8 spoof gate; EstimatorFatalError triggers AC-5.2 (3 s no estimate → FC IMU-only) Onboard team Mitigated
R06 VPR top-1 false positive: visually-similar but geographically-wrong tile ranks above the true match Technical Medium Medium Medium C2.5 inlier-count rerank narrows K=10 → N=3; C3 RANSAC + reprojection residual filter; C3.5 conditional AdHoP refinement on hard frames; D-CROSS-LATENCY-1 hybrid keeps the budget under thermal throttle. Cross-flight cache-poisoning safety budget (AC-NEW-7) gates downstream Onboard team Mitigated
R07 FC GPS source promoted back from spoofed → trusted prematurely (AC-NEW-2 / AC-NEW-8 violation) Technical Low High Medium C5 SourceLabelStateMachine: never re-promote until ≥10 s of gps_health == STABLE_NON_SPOOFED AND next satellite-anchored frame agrees with FC GPS within configurable tolerance; every reject logged to FDR + GCS STATUSTEXT (ADR-008) Onboard team Mitigated
R08 Tile freshness drift inside active_conflict sectors between download and flight time (AC-NEW-6 violation in stale theatre) External Medium Medium Medium C11 TileDownloader applies sector-classified freshness gate at fetch (reject in active_conflict, downgrade-no-satellite_anchored-label in stable_rear); DownloadBatchReport surfaces stale-rejection counts so the operator can re-pull. Manifest-hash idempotence (D-C10-1) makes re-runs cheap Operator + onboard team Mitigated
R09 Per-flight onboard signing key compromise: a captured companion lets an attacker poison satellite-provider ingest External Low Medium Low Per-flight ephemeral keys deleted on flight-ring rollover (≥30 days post-landing default); parent-suite voting layer (D-PROJ-2 design task #2) requires multi-companion agreement before promoting pending → trusted; operator can revoke a compromised companion_id Parent suite + operator Mitigated (depends on D-PROJ-2 #2)
R10 C5 covariance recovery (Marginals.marginalCovariance) exceeds AC-4.1 latency budget under thermal throttle Technical Medium Medium Medium D-CROSS-LATENCY-1 hybrid: ThermalState from C7 triggers C4 to switch from MARGINALS → JACOBIAN per-frame; ~510% accuracy loss accepted under throttle, never on the steady-state path (ADR-006). Once thermal returns below threshold, switch back next frame Onboard team Mitigated
R11 AC-NEW-4 / AC-NEW-7 multi-flight statistical headroom is reduced-confidence with the current single-flight (Derkachi) fixture (D-PROJ-3 deferred) Schedule / Data High Medium High Validation requirement relaxed from "≥100 flights" literal to Monte-Carlo-with-stated-CI over current corpus; multi-flight statistical residual risk recorded in this register; D-PROJ-3 carryforward to next cycle (Maxar Open Data Ukraine + AerialVL S03 + own multi-flight data) Project lead Open (carryforward)
R12 Single deployment camera (adti20, one unit available) becomes a single-point-of-failure for Tier-2 NFTs Resource Medium Medium Medium D-PROJ-1 hybrid calibration acquired per-unit; bench Jetson at HQ retains a copy of every successfully-built engine cache (D-C10-8); IT-12 comparative study uses static fixtures so it is camera-unit-agnostic. If the deployed unit fails, the cache rebuilds on a replacement unit Onboard team Accepted
R13 C13 FDR queue overrun on a sustained burst (e.g., F4 mid-flight tile gen + per-frame estimates + IMU traces) Technical Low Medium Low Per-producer drop-oldest queues; rollover event itself is always logged (FdrQueueOverrunError writes a structured "overrun" record with producer-id and dropped count); NFT-LIM-02 (8 h synthetic AC-NEW-3) validates aggregate throughput Onboard team Mitigated
R14 Apparent C2.5 ↔ C3 build-time circular dependency around the LightGlue runtime Technical Low Low Low Resolved: ownership of LightGlueRuntime moved to the shared helper (_docs/02_document/common-helpers/03_helper_lightglue_runtime.md); both C2.5 and C3 are sibling consumers; data flow remains C2.5 → C3 (one-way). C2.5 spec § 8 updated to reference the helper, not C3 Onboard team Mitigated (this iteration)

Risk Categories

Technical Risks (R02, R03, R04, R05, R06, R07, R10, R13, R14)

Concentrated in three areas: (a) the GPU/TRT stack and its hardware-tied engine cache; (b) the GTSAM substrate shared between C4 and C5; (c) MAVLink signing handshake. Each has a concrete mitigation already wired into the architecture.

Schedule / Data Risks (R11)

Single risk: D-PROJ-3 multi-flight fixture acquisition deferred. Validation strategy adjusted from literal-count to Monte-Carlo-with-CI. Carryforward to next cycle.

Resource Risks (R12)

Single deployment camera. Accepted with bench Jetson + HQ engine archive as the contingency.

External Risks (R01, R08, R09)

All three depend on parent-suite work — D-PROJ-2 (#1 ingest endpoint, #2 voting layer) plus the existing satellite-provider GET surface. Cross-workspace dependency tracked in _docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md.

Detailed Risk Analysis

Description: The architecture's primary defence against in-air outbound writes to satellite-provider is process-level isolation — the C11 Tile Manager (both TileDownloader and TileUploader) is excluded from the airborne CMake target so the airborne image cannot load that code path even via reflection or config error. A regression that adds a c11_tilemanager/ import path to a shared module the airborne binary depends on would silently re-introduce the upload code into airborne memory, defeating ADR-004.

Trigger conditions:

  • A future refactor moves a helper used by C11 into a shared module.
  • A new feature accidentally adds c11_tilemanager to the airborne CMake target list.
  • A reflection-based plugin loader scans all src/components/ and instantiates whatever it finds.

Affected components: C11 (the deletion target), the airborne production-binary build, the runtime composition root.

Mitigation strategy:

  1. CI security gate: SBOM diff fails if any c11_tilemanager/ symbol appears in the airborne production-binary artifact (extension of ADR-002's existing per-implementation enforcement).
  2. Runtime self-check in runtime_root.py: if any c11_tilemanager.* module is importable from inside the airborne process, panic at startup before opening the FC adapter.
  3. Test gate (NFT-SEC-02): explicit "egress test" verifies the airborne process cannot reach satellite-provider over the network.

Contingency plan: If the runtime self-check fires in flight, the airborne process refuses to publish GPS_INPUT / MSP2_SENSOR_GPS and alerts via STATUSTEXT — defaulting back to FC IMU-only (AC-5.2 fallback path).

Residual risk after mitigation: Low.

Documents updated: architecture.md ADR-004 (extended to cover both TileDownloader and TileUploader); architecture.md ADR-002 (SBOM diff scope); tests/security-tests.md NFT-SEC-02.


Description: D-C8-9 selects MAVLink 2.0 message signing (option d) on the AP wired channel with per-flight ephemeral keys. The AP firmware supports it but no flight-deployed system the project has reviewed runs it in a per-flight rotation pattern; the contract for the handshake at takeoff is project-defined.

Trigger conditions: ArduPilot Plane firmware behaviour differs from documentation; the handshake fails on a real airframe.

Affected components: C8 (per-FC adapter), F2 (takeoff load).

Mitigation strategy:

  1. IT-3 ArduPilot SITL validation gate — IT-3 must pass before D-C8-9 is locked. The test exercises full handshake + signed-channel + key rotation in SITL.
  2. D-C8-2-FALLBACK options recorded (per ADR-008): (a) operator-manual RC aux switch with relaxed AC-NEW-2 wording; (b) operator-warning STATUSTEXT instead of automated switch; (c) escalate to ArduPilot dev community.

Contingency plan: If IT-3 fails, fall back to (a) per ADR-008 and accept the relaxed AC-NEW-2 wording for this cycle. This is a documented planned fallback, not an emergency response.

Residual risk after mitigation: Medium-Low until IT-3 passes; Low once IT-3 is green.

Documents updated: architecture.md ADR-008; components/10_c8_fc_adapter/description.md § 5 (Error Handling); tests/integration-tests.md IT-3.


R11: AC-NEW-4 / AC-NEW-7 reduced-confidence statistical headroom

Description: AC-NEW-4 (per-flight pose-error CDF) and AC-NEW-7 (cross-flight cache-poisoning safety budget) were originally framed as ≥100-flight statistical claims. With current data limited to a single Derkachi flight + 60 still images, the project cannot meet the literal bound and instead validates with Monte-Carlo over the current corpus, with stated CIs.

Trigger conditions: Reviewer or operator interprets the original AC text strictly; the relaxed Monte-Carlo wording is not formally accepted upstream.

Affected components: validation strategy across NFT-PERF-01, NFT-LIM-03, NFT-SEC-01.

Mitigation strategy:

  1. AC text already revised in 2026-05-09 to "Monte-Carlo over currently-available data corpus with stated CI" (per acceptance_criteria.md revision note).
  2. D-PROJ-3 (multi-flight fixture acquisition) is the carryforward: AerialVL S03 + Maxar Open Data Ukraine + own multi-flight data when next cycle resources permit.
  3. NFT specs explicitly mark this as PARTIAL coverage with traceability flag in the matrix.

Contingency plan: If the relaxed wording is rejected, the project escalates D-PROJ-3 to a blocking pre-cycle dep and pauses validation NFTs until multi-flight data is acquired.

Residual risk after mitigation: Medium (acknowledged headroom; acceptable for this cycle but tracked).

Documents updated: 00_problem/acceptance_criteria.md revision note; tests/traceability-matrix.md; 01_solution/solution.md Plan-phase carryforward section.

Architecture / Component Changes Applied (this iteration)

Risk ID Document Modified Change Description
R02 architecture.md ADR-004 Extended scope: process isolation now excludes both C11 download and upload code paths from the airborne image; runtime self-check requirement noted
R02 architecture.md ADR-002 SBOM diff already covers per-implementation linking; explicit C11 boundary added to enforcement scope
R14 components/03_c2_5_rerank/description.md § 1 + § 8 LightGlueRuntime ownership moved from C3 to the shared helper; C2.5's "Must be implemented after" no longer references C3 directly

(R03, R04, R05, R06, R07, R08, R10, R13 already had mitigations wired into the architecture/component specs in earlier iterations of Plan; this register only cross-references them.)

Summary

Total risks identified: 14 Critical: 0 | High: 3 (R01, R03, R11) | Medium: 8 (R02, R04, R05, R06, R07, R08, R10, R12) | Low: 3 (R09, R13, R14) Risks mitigated this iteration: 2 net-new (R02 enforcement scope extended; R14 resolved via helper ownership) Risks requiring user decision: none in this iteration. R03 is gated by IT-3 (a future test event); R11 is the documented D-PROJ-3 carryforward already accepted earlier in the cycle.