Files
gps-denied-onboard/_docs/02_document/risk_mitigations.md
T

16 KiB

Risk Assessment — Architecture Review — Iteration 01

Evaluator Pass Summary

Check Result Notes
Single Responsibility Pass Components each own one primary concern: ingest, VIO, safety, Satellite Service sync/retrieval, verification, Tile Manager storage/generation, MAVLink, FDR, validation
Dumb Code / Smart Data Pass Complex behavior is mostly expressed through DTOs, mode labels, covariance fields, manifests, and gates
Interface Consistency Pass with fix Safety wrapper no longer directly depends on Tile Manager for anchor acceptance; cache freshness/provenance travels through AnchorDecision
Circular Dependencies Pass with caution Runtime flow is acyclic at component ownership level; MAVLink remains a bidirectional protocol adapter but owns no localization policy
Missing Interactions Pass Pre-VIO occlusion, IMU-only blackout, relocalization, tile writes, FDR, and SITL validation are all represented
Security Considerations Pass Signed cache sidecars, source/system ID checks, spoofing rejection, and no in-flight satellite-provider or Satellite Service access are covered
Performance Bottlenecks Pass Jetson latency, VPR/local matching, FDR append pressure, PostgreSQL availability, and thermal limits are identified
API Contracts Pass Core DTO handoffs are documented: FramePacket, VioStatePacket, AnchorDecision, PositionEstimate, FdrEvent

Risk Scoring Matrix

Low Impact Medium Impact High Impact
High Probability Medium High Critical
Medium Probability Low Medium High
Low Probability Low Low Medium

Acceptance Criteria by Risk Level

Level Action Required
Low Accepted and monitored
Medium Mitigation plan required before implementation
High Mitigation + contingency plan required, reviewed during implementation
Critical Must be resolved before proceeding to next planning step

Risk Register

ID Risk Category Probability Impact Score Mitigation Owner Status
R01 ADTi 20MP 20L V1 public specs conflict with planning assumptions for resolution, FPS, lens, interface, and temperature Technical / External Medium High High Pin manufacturer datasheet and exact lens/interface before implementation; make camera calibration/spec task a bootstrap blocker Camera ingest/calibration Mitigated by gate
R02 BASALT may underperform or lose tracking on nadir fixed-wing low-parallax terrain Technical Medium High High Public replay with MUN-FRL/ALTO/Kagaru/EPFL where applicable, representative target replay, OpenVINS reference comparison, Kimera backup path BASALT VIO adapter Mitigated by validation
R03 BASALT confidence/covariance may under-report real error Safety Medium High High Wrapper owns covariance calibration; compare against ground truth, satellite residuals, and OpenVINS reference; never emit optimistic horiz_accuracy Safety/anchor wrapper Mitigated by wrapper design
R04 Total occlusion detector may false-negative and feed unusable frames into VIO Safety / Technical Medium High High Conservative pre-VIO occlusion gate, FDR status, tests for total blackout, and fallback to IMU-only dead_reckoned mode Camera ingest/calibration Mitigated by spec/test
R05 IMU-only blackout propagation could be trusted too long Safety Medium High High Monotonic covariance growth, dead_reckoned label, fix_type=0/horiz_accuracy=999.0 when >30 s or covariance >500 m Safety/anchor wrapper Mitigated by AC gate
R06 DINOv2-VLAD + ALIKED/DISK-LightGlue exceeds Jetson latency/memory budget Performance Medium High High Trigger-only execution, CPU FAISS first, top-K caps, model profiling, TensorRT only after fidelity checks Satellite Service / Anchor verification Mitigated by profiling gates
R07 PostgreSQL/PostGIS local DB is unavailable or too heavy for onboard runtime Technical / Operational Medium High High Run local onboard PostgreSQL, health-check before flight, keep large payloads in files, fail mission cache validation if DB unavailable Tile Manager / FDR Mitigated by deployment gates
R08 Generated tile cache poisoning corrupts future anchors Security / Safety Low High Medium Sigma gate, provenance sidecars, post-flight Satellite Service voting, no direct promotion to trusted basemap Tile Manager Mitigated by policy
R09 Public datasets do not cover final target terrain or commercial license needs External / Schedule Medium Medium Medium Use public data for de-risking only; representative synchronized target data remains mandatory for acceptance Validation harness Mitigated by acceptance rule
R10 MAVLink GPS_INPUT parameters or Plane behavior differs from assumptions Integration Medium High High Plane SITL release gate with production parameters, spoofing/failsafe tests, raw field validation with pymavlink MAVLink/GCS integration Mitigated by SITL gate
R11 FDR appends or PostgreSQL indexing interferes with hot-path latency Performance Medium Medium Medium Append asynchronously, use CBOR payload segments for high-volume data, keep PostgreSQL as event index/query surface FDR/observability Mitigated by design
R12 GPL/non-commercial tooling accidentally enters production or acceptance evidence Legal / Compliance Low High Medium Keep OpenVINS/ORB-SLAM3 reference-only; license-tag datasets before CI; SuperPoint only after legal approval Validation harness / Architecture Mitigated by gates

Detailed Risk Analysis

R01: Camera Specification Mismatch

Description: Public ADTi pages show 5456 x 3632 stills, 2 fps continuous capture, Sony E mount, and -10..40 C operation. The project needs the exact production lens, camera interface, sustained capture behavior, thermal behavior, and calibration model.

Trigger conditions: Manufacturer documentation or hardware testing contradicts assumed FPS, interface, temperature, or lens characteristics.

Affected components: Camera ingest/calibration, BASALT VIO adapter, validation harness, deployment procedures.

Mitigation strategy:

  1. Make camera specification verification a bootstrap task.
  2. Require manufacturer datasheet or hardware measurement before implementation claims 3 fps or hot-environment operation.
  3. Version calibration data by exact camera/lens/interface.

Contingency plan: Reduce frame rate assumptions, adjust latency tests, or select a different navigation camera/lens/interface.

Residual risk after mitigation: Medium.

Documents updated: glossary.md, architecture.md, components/01_camera_ingest_calibration/description.md, deployment/deployment_procedures.md.


R02: BASALT Nadir Fixed-Wing Fit

Description: BASALT is a strong VIO candidate, but fixed downward cameras over planar terrain can cause low-parallax and texture-degeneracy cases.

Trigger conditions: Public or representative replay shows high drift, frequent tracking loss, or poor initialization.

Affected components: BASALT VIO adapter, safety/anchor wrapper, validation harness.

Mitigation strategy:

  1. Run MUN-FRL first for synchronized nadir camera + IMU + ground truth.
  2. Add ALTO/Kagaru/EPFL slices where available for aerial/fixed-wing realism.
  3. Compare against OpenVINS reference and Kimera backup.

Contingency plan: Keep Kimera backup or build a project-owned fallback estimator around OpenCV + IMU only after replay evidence requires it.

Residual risk after mitigation: Medium.

Documents updated: architecture.md, components/02_basalt_vio_adapter/description.md, tests/test-data.md.


R03: Covariance Under-Reporting

Description: Incorrect confidence is more dangerous than no estimate because the flight controller may trust a false fix.

Trigger conditions: Replay error exceeds reported covariance, or anchors are accepted despite inconsistent residuals.

Affected components: Safety/anchor wrapper, MAVLink/GCS integration, FDR/observability.

Mitigation strategy:

  1. Make wrapper covariance the product authority, not BASALT raw confidence.
  2. Validate calibration against ground truth, satellite residuals, and OpenVINS reference.
  3. Map horiz_accuracy so it never under-reports the 95% semi-major covariance axis.

Contingency plan: Degrade to no-fix sooner and require operator relocalization or mission abort behavior.

Residual risk after mitigation: Medium.

Documents updated: architecture.md, components/03_safety_anchor_wrapper/description.md, tests/blackbox-tests.md.


R04: Total Occlusion Detection Failure

Description: If total occlusion is not detected before VIO, BASALT may receive unusable frames and produce misleading state updates.

Trigger conditions: Lens cover, cloud/whiteout, decode failure, underexposure/overexposure, or textureless frame reaches VIO as usable.

Affected components: Camera ingest/calibration, safety/anchor wrapper, BASALT VIO adapter.

Mitigation strategy:

  1. Camera ingest exposes OcclusionReport and sets usable_for_vio=false for total occlusion/blackout.
  2. Total occlusion bypasses BASALT for that frame.
  3. Safety wrapper switches to IMU-only dead_reckoned propagation with monotonic covariance growth.

Contingency plan: Tune detector conservatively and accept temporary false-positive IMU-only degradation over false VIO confidence.

Residual risk after mitigation: Medium.

Documents updated: components/01_camera_ingest_calibration/description.md, components/03_safety_anchor_wrapper/description.md, system-flows.md, diagrams/flows/flow_normal_localization.md, tests/resilience-tests.md.


R05: IMU-Only Mode Over-Trust

Description: IMU-only propagation drifts quickly and must be treated as an emergency bridge, not a long-duration solution.

Trigger conditions: Blackout lasts longer than 30 seconds or covariance exceeds 500 m.

Affected components: Safety/anchor wrapper, MAVLink/GCS integration, FDR/observability.

Mitigation strategy:

  1. Emit source_label=dead_reckoned during IMU-only mode.
  2. Grow covariance monotonically.
  3. Emit fix_type=0, horiz_accuracy=999.0, and VISUAL_BLACKOUT_FAILSAFE at thresholds.

Contingency plan: Stop publishing valid fixes and require relocalization/operator action.

Residual risk after mitigation: Low.

Documents updated: components/03_safety_anchor_wrapper/description.md, system-flows.md, tests/blackbox-tests.md, tests/resilience-tests.md, tests/traceability-matrix.md.


R06: Trigger Path Performance

Description: DINOv2-VLAD and learned local matching can exceed Jetson latency/memory limits.

Trigger conditions: Relocalization exceeds p95 latency, memory budget, or causes thermal throttling.

Affected components: Satellite Service, anchor verification, validation harness.

Mitigation strategy:

  1. Keep VPR/local matching trigger-based.
  2. Use CPU FAISS first and bounded top-K.
  3. Accept optimized engines only after descriptor-fidelity tests pass.

Contingency plan: Reduce descriptor resolution/model size, reduce top-K, or fall back to classical features for emergency operation.

Residual risk after mitigation: Medium.

Documents updated: architecture.md, components/04_satellite_retrieval/description.md, components/05_anchor_verification/description.md, tests/performance-tests.md.


R07: Onboard PostgreSQL/PostGIS Availability

Description: PostgreSQL/PostGIS is now the structured metadata store. If local DB availability or resource use is poor, cache/FDR queries may fail.

Trigger conditions: Local DB does not start, DB files corrupt, DB consumes too much memory/I/O, or migrations fail.

Affected components: Tile Manager, FDR/observability, deployment procedures.

Mitigation strategy:

  1. Require local onboard PostgreSQL health check before flight.
  2. Store large imagery/descriptors/CBOR payloads as files, not DB blobs.
  3. Treat DB unavailability as a mission-cache validation blocker.

Contingency plan: Abort mission-cache activation and run only no-cache degraded modes or resync/rebuild DB before flight.

Residual risk after mitigation: Medium.

Documents updated: data_model.md, architecture.md, components/06_cache_tile_lifecycle/description.md, components/08_fdr_observability/description.md, deployment/environment_strategy.md.


R08: Cache Poisoning

Description: A bad generated tile could be written back and later used as a trusted anchor.

Trigger conditions: Generated tile is promoted despite high parent covariance, stale source, bad sidecar, or inconsistent overlap voting.

Affected components: Tile Manager, safety/anchor wrapper, Satellite Service integration.

Mitigation strategy:

  1. Require tile-write sigma gates.
  2. Store generated tiles as candidates with signed sidecars.
  3. Promote only through post-flight Satellite Service validation/voting.

Contingency plan: Quarantine generated tiles and invalidate affected cache regions.

Residual risk after mitigation: Low.

Documents updated: architecture.md, components/06_cache_tile_lifecycle/description.md, tests/security-tests.md.


R09: Dataset Coverage / Licensing

Description: Public datasets may not match target terrain, may lack raw synchronized IMU, or may have non-commercial restrictions.

Trigger conditions: MUN-FRL/ALTO/Kagaru/EPFL slices are unavailable, unrepresentative, or license-incompatible for acceptance.

Affected components: Validation harness, BASALT VIO adapter, anchor verification.

Mitigation strategy:

  1. Use public datasets for de-risking only.
  2. License-tag datasets before CI jobs.
  3. Require representative synchronized target data for final acceptance.

Contingency plan: Collect a target replay dataset before final acceptance.

Residual risk after mitigation: Medium.

Documents updated: tests/test-data.md, deployment/environment_strategy.md, deployment/ci_cd_pipeline.md.


R10: Plane GPS_INPUT Integration

Description: ArduPilot Plane EKF and GPS_INPUT handling may differ from assumptions, especially around accuracy fields, ignore flags, velocity fields, and spoofing transitions.

Trigger conditions: Plane SITL rejects or mishandles emitted GPS_INPUT, or QGC status is insufficient.

Affected components: MAVLink/GCS integration, safety/anchor wrapper, validation harness.

Mitigation strategy:

  1. Use pymavlink for exact GPS_INPUT field control.
  2. Gate release on Plane SITL with production parameters.
  3. Validate spoofing/failsafe and QGC status behavior.

Contingency plan: Adjust parameter guidance/output fields before hardware deployment.

Residual risk after mitigation: Medium.

Documents updated: components/07_mavlink_gcs_integration/description.md, tests/environment.md, deployment/ci_cd_pipeline.md.

Architecture/Component Changes Applied

Risk ID Document Modified Change Description
R04 components/01_camera_ingest_calibration/description.md Added explicit detect_occlusion, OcclusionReport, and pre-VIO bypass behavior
R04/R05 components/03_safety_anchor_wrapper/description.md Added propagate_imu_only, total_occlusion, monotonic covariance behavior, and no direct Tile Manager dependency
R07 data_model.md Replaced embedded DB references with PostgreSQL/PostGIS structured metadata and CBOR FDR payload segments
R07 architecture.md Added PostgreSQL/PostGIS ADR and FDR storage decision
R05 tests/blackbox-tests.md / tests/resilience-tests.md Made total occlusion and IMU-only blackout behavior explicit

Summary

Total risks identified: 12
Critical: 0 | High: 7 | Medium: 5 | Low: 0
Risks mitigated this iteration: 12
Risks requiring user decision: None immediately. Future decisions are tied to exact camera hardware proof, dataset license approval, and representative data collection timing.