mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 23:31:13 +00:00
64542d32fc
Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
142 lines
14 KiB
Markdown
142 lines
14 KiB
Markdown
# Risk Assessment — gps-denied-onboard Plan cycle — Iteration 1
|
||
|
||
**Date**: 2026-05-09
|
||
**Scope**: technical / schedule / external risks identified during the 4a evaluator pass over architecture, system flows, data model, deployment, and the 14-component decomposition.
|
||
**Iteration policy**: if the user requests another round, this file becomes `risk_mitigations_02.md` and so on.
|
||
|
||
## Risk Scoring Matrix
|
||
|
||
| | Low Impact | Medium Impact | High Impact |
|
||
|--|------------|---------------|-------------|
|
||
| **High Probability** | Medium | High | Critical |
|
||
| **Medium Probability** | Low | Medium | High |
|
||
| **Low Probability** | Low | Low | Medium |
|
||
|
||
## Acceptance Criteria by Risk Level
|
||
|
||
| Level | Action Required |
|
||
|-------|----------------|
|
||
| Low | Accepted, monitored quarterly |
|
||
| Medium | Mitigation plan required before implementation |
|
||
| High | Mitigation + contingency plan required, reviewed weekly |
|
||
| Critical | Must be resolved before proceeding to next planning step |
|
||
|
||
## Risk Register
|
||
|
||
| ID | Risk | Category | Probability | Impact | Score | Mitigation | Owner | Status |
|
||
|----|------|----------|-------------|--------|-------|------------|-------|--------|
|
||
| R01 | D-PROJ-2 ingest endpoint not yet implemented service-side; F10 post-landing upload cannot reach a real `satellite-provider` until parent-suite work lands | External | High | Medium | High | C11 `TileUploader` keeps batches in pending-upload journal; e2e `mock-suite-sat-service` fixture exercises the contract in tests; leftover file tracks the cross-workspace dep; onboard release does not block on D-PROJ-2 | Parent suite | Mitigated |
|
||
| R02 | ADR-004 process-isolation invariant broken by accidental CMake link of C11 into the airborne image | Technical | Low | High | Medium | CI SBOM diff explicitly fails if any `c11_tilemanager/` symbol appears in the airborne `production-binary` artifact (see ADR-002 + CI security gate); runtime check in `runtime_root.py` panics if `c11_tilemanager` module imports non-test paths | Onboard team | Mitigated |
|
||
| R03 | MAVLink 2.0 per-flight signing key handshake on AP wired channel (D-C8-9 = (d)) has no production-deployed precedent | Technical | Medium | High | High | IT-3 ArduPilot SITL validation gate; D-C8-2-FALLBACK options recorded ((a) operator-manual RC aux + relaxed AC-NEW-2; (b) STATUSTEXT instead of automated switch; (c) escalate to ArduPilot dev community) per ADR-008 | Onboard team | Open (gated by IT-3) |
|
||
| R04 | TensorRT engine cache hardware-tied to SM 87; deserialising on a different SM corrupts inference silently | Technical | Medium | High | High | D-C10-7 self-describing filename `<model>__sm<SM>_jp<JP>_trt<TRT>_<precision>.engine` + D-C10-3 SHA-256 content-hash gate at F2 takeoff refuses deserialise on tuple mismatch; CI Tier-2 runs on the pinned JetPack 6.2 / TRT 10.3 / SM 87 image | Onboard team | Mitigated |
|
||
| R05 | iSAM2 numerical instability silently swallows factor-add failures, producing wrong covariance | Technical | Medium | Medium | Medium | C5 logs every `add_*` call's success/failure; `EstimatorHealth.cov_norm_growing_for_s` monotonicity check feeds the AC-NEW-8 spoof gate; `EstimatorFatalError` triggers AC-5.2 (3 s no estimate → FC IMU-only) | Onboard team | Mitigated |
|
||
| R06 | VPR top-1 false positive: visually-similar but geographically-wrong tile ranks above the true match | Technical | Medium | Medium | Medium | C2.5 inlier-count rerank narrows K=10 → N=3; C3 RANSAC + reprojection residual filter; C3.5 conditional AdHoP refinement on hard frames; D-CROSS-LATENCY-1 hybrid keeps the budget under thermal throttle. Cross-flight cache-poisoning safety budget (AC-NEW-7) gates downstream | Onboard team | Mitigated |
|
||
| R07 | FC GPS source promoted back from spoofed → trusted prematurely (AC-NEW-2 / AC-NEW-8 violation) | Technical | Low | High | Medium | C5 `SourceLabelStateMachine`: never re-promote until ≥10 s of `gps_health == STABLE_NON_SPOOFED` AND next satellite-anchored frame agrees with FC GPS within configurable tolerance; every reject logged to FDR + GCS STATUSTEXT (ADR-008) | Onboard team | Mitigated |
|
||
| R08 | Tile freshness drift inside `active_conflict` sectors between download and flight time (AC-NEW-6 violation in stale theatre) | External | Medium | Medium | Medium | C11 `TileDownloader` applies sector-classified freshness gate at fetch (reject in `active_conflict`, downgrade-no-`satellite_anchored`-label in `stable_rear`); `DownloadBatchReport` surfaces stale-rejection counts so the operator can re-pull. Manifest-hash idempotence (D-C10-1) makes re-runs cheap | Operator + onboard team | Mitigated |
|
||
| R09 | Per-flight onboard signing key compromise: a captured companion lets an attacker poison `satellite-provider` ingest | External | Low | Medium | Low | Per-flight ephemeral keys deleted on flight-ring rollover (≥30 days post-landing default); parent-suite voting layer (D-PROJ-2 design task #2) requires multi-companion agreement before promoting `pending → trusted`; operator can revoke a compromised `companion_id` | Parent suite + operator | Mitigated (depends on D-PROJ-2 #2) |
|
||
| R10 | C5 covariance recovery (`Marginals.marginalCovariance`) exceeds AC-4.1 latency budget under thermal throttle | Technical | Medium | Medium | Medium | D-CROSS-LATENCY-1 hybrid: `ThermalState` from C7 triggers C4 to switch from MARGINALS → JACOBIAN per-frame; ~5–10% accuracy loss accepted under throttle, never on the steady-state path (ADR-006). Once thermal returns below threshold, switch back next frame | Onboard team | Mitigated |
|
||
| R11 | AC-NEW-4 / AC-NEW-7 multi-flight statistical headroom is reduced-confidence with the current single-flight (Derkachi) fixture (D-PROJ-3 deferred) | Schedule / Data | High | Medium | High | Validation requirement relaxed from "≥100 flights" literal to Monte-Carlo-with-stated-CI over current corpus; multi-flight statistical residual risk recorded in this register; D-PROJ-3 carryforward to next cycle (Maxar Open Data Ukraine + AerialVL S03 + own multi-flight data) | Project lead | Open (carryforward) |
|
||
| R12 | Single deployment camera (`adti20`, one unit available) becomes a single-point-of-failure for Tier-2 NFTs | Resource | Medium | Medium | Medium | D-PROJ-1 hybrid calibration acquired per-unit; bench Jetson at HQ retains a copy of every successfully-built engine cache (D-C10-8); IT-12 comparative study uses static fixtures so it is camera-unit-agnostic. If the deployed unit fails, the cache rebuilds on a replacement unit | Onboard team | Accepted |
|
||
| R13 | C13 FDR queue overrun on a sustained burst (e.g., F4 mid-flight tile gen + per-frame estimates + IMU traces) | Technical | Low | Medium | Low | Per-producer drop-oldest queues; rollover event itself is always logged (`FdrQueueOverrunError` writes a structured "overrun" record with producer-id and dropped count); NFT-LIM-02 (8 h synthetic AC-NEW-3) validates aggregate throughput | Onboard team | Mitigated |
|
||
| R14 | Apparent C2.5 ↔ C3 build-time circular dependency around the LightGlue runtime | Technical | Low | Low | Low | Resolved: ownership of `LightGlueRuntime` moved to the shared helper (`_docs/02_document/common-helpers/03_helper_lightglue_runtime.md`); both C2.5 and C3 are sibling consumers; data flow remains C2.5 → C3 (one-way). C2.5 spec § 8 updated to reference the helper, not C3 | Onboard team | Mitigated (this iteration) |
|
||
|
||
## Risk Categories
|
||
|
||
### Technical Risks (R02, R03, R04, R05, R06, R07, R10, R13, R14)
|
||
|
||
Concentrated in three areas: (a) the GPU/TRT stack and its hardware-tied engine cache; (b) the GTSAM substrate shared between C4 and C5; (c) MAVLink signing handshake. Each has a concrete mitigation already wired into the architecture.
|
||
|
||
### Schedule / Data Risks (R11)
|
||
|
||
Single risk: D-PROJ-3 multi-flight fixture acquisition deferred. Validation strategy adjusted from literal-count to Monte-Carlo-with-CI. Carryforward to next cycle.
|
||
|
||
### Resource Risks (R12)
|
||
|
||
Single deployment camera. Accepted with bench Jetson + HQ engine archive as the contingency.
|
||
|
||
### External Risks (R01, R08, R09)
|
||
|
||
All three depend on parent-suite work — D-PROJ-2 (#1 ingest endpoint, #2 voting layer) plus the existing `satellite-provider` GET surface. Cross-workspace dependency tracked in `_docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md`.
|
||
|
||
## Detailed Risk Analysis
|
||
|
||
### R02: ADR-004 process-isolation invariant broken by accidental airborne C11 link
|
||
|
||
**Description**: The architecture's primary defence against in-air outbound writes to `satellite-provider` is **process-level isolation** — the C11 Tile Manager (both `TileDownloader` and `TileUploader`) is excluded from the airborne CMake target so the airborne image cannot load that code path even via reflection or config error. A regression that adds a `c11_tilemanager/` import path to a shared module the airborne binary depends on would silently re-introduce the upload code into airborne memory, defeating ADR-004.
|
||
|
||
**Trigger conditions**:
|
||
- A future refactor moves a helper used by C11 into a shared module.
|
||
- A new feature accidentally adds `c11_tilemanager` to the airborne CMake target list.
|
||
- A reflection-based plugin loader scans all `src/components/` and instantiates whatever it finds.
|
||
|
||
**Affected components**: C11 (the deletion target), the airborne `production-binary` build, the runtime composition root.
|
||
|
||
**Mitigation strategy**:
|
||
1. CI security gate: SBOM diff fails if any `c11_tilemanager/` symbol appears in the airborne `production-binary` artifact (extension of ADR-002's existing per-implementation enforcement).
|
||
2. Runtime self-check in `runtime_root.py`: if any `c11_tilemanager.*` module is importable from inside the airborne process, panic at startup before opening the FC adapter.
|
||
3. Test gate (NFT-SEC-02): explicit "egress test" verifies the airborne process cannot reach `satellite-provider` over the network.
|
||
|
||
**Contingency plan**: If the runtime self-check fires in flight, the airborne process refuses to publish `GPS_INPUT` / `MSP2_SENSOR_GPS` and alerts via STATUSTEXT — defaulting back to FC IMU-only (AC-5.2 fallback path).
|
||
|
||
**Residual risk after mitigation**: Low.
|
||
|
||
**Documents updated**: `architecture.md` ADR-004 (extended to cover both `TileDownloader` and `TileUploader`); `architecture.md` ADR-002 (SBOM diff scope); `tests/security-tests.md` NFT-SEC-02.
|
||
|
||
---
|
||
|
||
### R03: MAVLink 2.0 per-flight signing handshake (D-C8-9 = (d)) has no production precedent
|
||
|
||
**Description**: D-C8-9 selects MAVLink 2.0 message signing (option d) on the AP wired channel with per-flight ephemeral keys. The AP firmware supports it but no flight-deployed system the project has reviewed runs it in a per-flight rotation pattern; the contract for the handshake at takeoff is project-defined.
|
||
|
||
**Trigger conditions**: ArduPilot Plane firmware behaviour differs from documentation; the handshake fails on a real airframe.
|
||
|
||
**Affected components**: C8 (per-FC adapter), F2 (takeoff load).
|
||
|
||
**Mitigation strategy**:
|
||
1. **IT-3 ArduPilot SITL validation gate** — IT-3 must pass before D-C8-9 is locked. The test exercises full handshake + signed-channel + key rotation in SITL.
|
||
2. **D-C8-2-FALLBACK options recorded** (per ADR-008): (a) operator-manual RC aux switch with relaxed AC-NEW-2 wording; (b) operator-warning STATUSTEXT instead of automated switch; (c) escalate to ArduPilot dev community.
|
||
|
||
**Contingency plan**: If IT-3 fails, fall back to (a) per ADR-008 and accept the relaxed AC-NEW-2 wording for this cycle. This is a documented planned fallback, not an emergency response.
|
||
|
||
**Residual risk after mitigation**: Medium-Low until IT-3 passes; Low once IT-3 is green.
|
||
|
||
**Documents updated**: `architecture.md` ADR-008; `components/10_c8_fc_adapter/description.md` § 5 (Error Handling); `tests/integration-tests.md` IT-3.
|
||
|
||
---
|
||
|
||
### R11: AC-NEW-4 / AC-NEW-7 reduced-confidence statistical headroom
|
||
|
||
**Description**: AC-NEW-4 (per-flight pose-error CDF) and AC-NEW-7 (cross-flight cache-poisoning safety budget) were originally framed as ≥100-flight statistical claims. With current data limited to a single Derkachi flight + 60 still images, the project cannot meet the literal bound and instead validates with Monte-Carlo over the current corpus, with stated CIs.
|
||
|
||
**Trigger conditions**: Reviewer or operator interprets the original AC text strictly; the relaxed Monte-Carlo wording is not formally accepted upstream.
|
||
|
||
**Affected components**: validation strategy across NFT-PERF-01, NFT-LIM-03, NFT-SEC-01.
|
||
|
||
**Mitigation strategy**:
|
||
1. AC text already revised in 2026-05-09 to "Monte-Carlo over currently-available data corpus with stated CI" (per `acceptance_criteria.md` revision note).
|
||
2. D-PROJ-3 (multi-flight fixture acquisition) is the carryforward: AerialVL S03 + Maxar Open Data Ukraine + own multi-flight data when next cycle resources permit.
|
||
3. NFT specs explicitly mark this as PARTIAL coverage with traceability flag in the matrix.
|
||
|
||
**Contingency plan**: If the relaxed wording is rejected, the project escalates D-PROJ-3 to a blocking pre-cycle dep and pauses validation NFTs until multi-flight data is acquired.
|
||
|
||
**Residual risk after mitigation**: Medium (acknowledged headroom; acceptable for this cycle but tracked).
|
||
|
||
**Documents updated**: `00_problem/acceptance_criteria.md` revision note; `tests/traceability-matrix.md`; `01_solution/solution.md` Plan-phase carryforward section.
|
||
|
||
## Architecture / Component Changes Applied (this iteration)
|
||
|
||
| Risk ID | Document Modified | Change Description |
|
||
|---------|------------------|--------------------|
|
||
| R02 | `architecture.md` ADR-004 | Extended scope: process isolation now excludes both C11 download and upload code paths from the airborne image; runtime self-check requirement noted |
|
||
| R02 | `architecture.md` ADR-002 | SBOM diff already covers per-implementation linking; explicit C11 boundary added to enforcement scope |
|
||
| R14 | `components/03_c2_5_rerank/description.md` § 1 + § 8 | LightGlueRuntime ownership moved from C3 to the shared helper; C2.5's "Must be implemented after" no longer references C3 directly |
|
||
|
||
(R03, R04, R05, R06, R07, R08, R10, R13 already had mitigations wired into the architecture/component specs in earlier iterations of Plan; this register only cross-references them.)
|
||
|
||
## Summary
|
||
|
||
**Total risks identified**: 14
|
||
**Critical**: 0 | **High**: 3 (R01, R03, R11) | **Medium**: 8 (R02, R04, R05, R06, R07, R08, R10, R12) | **Low**: 3 (R09, R13, R14)
|
||
**Risks mitigated this iteration**: 2 net-new (R02 enforcement scope extended; R14 resolved via helper ownership)
|
||
**Risks requiring user decision**: none in this iteration. R03 is gated by IT-3 (a future test event); R11 is the documented D-PROJ-3 carryforward already accepted earlier in the cycle.
|