16 KiB
Risk Assessment — Architecture Review — Iteration 01
Evaluator Pass Summary
| Check | Result | Notes |
|---|---|---|
| Single Responsibility | Pass | Components each own one primary concern: ingest, VIO, safety, retrieval, verification, cache, MAVLink, FDR, validation |
| Dumb Code / Smart Data | Pass | Complex behavior is mostly expressed through DTOs, mode labels, covariance fields, manifests, and gates |
| Interface Consistency | Pass with fix | Safety wrapper no longer directly depends on cache lifecycle for anchor acceptance; cache freshness/provenance travels through AnchorDecision |
| Circular Dependencies | Pass with caution | Runtime flow is acyclic at component ownership level; MAVLink remains a bidirectional protocol adapter but owns no localization policy |
| Missing Interactions | Pass | Pre-VIO occlusion, IMU-only blackout, relocalization, tile writes, FDR, and SITL validation are all represented |
| Security Considerations | Pass | Signed cache sidecars, source/system ID checks, spoofing rejection, and no in-flight satellite-provider access are covered |
| Performance Bottlenecks | Pass | Jetson latency, VPR/local matching, FDR append pressure, PostgreSQL availability, and thermal limits are identified |
| API Contracts | Pass | Core DTO handoffs are documented: FramePacket, VioStatePacket, AnchorDecision, PositionEstimate, FdrEvent |
Risk Scoring Matrix
| Low Impact | Medium Impact | High Impact | |
|---|---|---|---|
| High Probability | Medium | High | Critical |
| Medium Probability | Low | Medium | High |
| Low Probability | Low | Low | Medium |
Acceptance Criteria by Risk Level
| Level | Action Required |
|---|---|
| Low | Accepted and monitored |
| Medium | Mitigation plan required before implementation |
| High | Mitigation + contingency plan required, reviewed during implementation |
| Critical | Must be resolved before proceeding to next planning step |
Risk Register
| ID | Risk | Category | Probability | Impact | Score | Mitigation | Owner | Status |
|---|---|---|---|---|---|---|---|---|
| R01 | ADTi 20MP 20L V1 public specs conflict with planning assumptions for resolution, FPS, lens, interface, and temperature | Technical / External | Medium | High | High | Pin manufacturer datasheet and exact lens/interface before implementation; make camera calibration/spec task a bootstrap blocker | Camera ingest/calibration | Mitigated by gate |
| R02 | BASALT may underperform or lose tracking on nadir fixed-wing low-parallax terrain | Technical | Medium | High | High | Public replay with MUN-FRL/ALTO/Kagaru/EPFL where applicable, representative target replay, OpenVINS reference comparison, Kimera backup path | BASALT VIO adapter | Mitigated by validation |
| R03 | BASALT confidence/covariance may under-report real error | Safety | Medium | High | High | Wrapper owns covariance calibration; compare against ground truth, satellite residuals, and OpenVINS reference; never emit optimistic horiz_accuracy |
Safety/anchor wrapper | Mitigated by wrapper design |
| R04 | Total occlusion detector may false-negative and feed unusable frames into VIO | Safety / Technical | Medium | High | High | Conservative pre-VIO occlusion gate, FDR status, tests for total blackout, and fallback to IMU-only dead_reckoned mode |
Camera ingest/calibration | Mitigated by spec/test |
| R05 | IMU-only blackout propagation could be trusted too long | Safety | Medium | High | High | Monotonic covariance growth, dead_reckoned label, fix_type=0/horiz_accuracy=999.0 when >30 s or covariance >500 m |
Safety/anchor wrapper | Mitigated by AC gate |
| R06 | DINOv2-VLAD + ALIKED/DISK-LightGlue exceeds Jetson latency/memory budget | Performance | Medium | High | High | Trigger-only execution, CPU FAISS first, top-K caps, model profiling, TensorRT only after fidelity checks | Satellite retrieval / Anchor verification | Mitigated by profiling gates |
| R07 | PostgreSQL/PostGIS local DB is unavailable or too heavy for onboard runtime | Technical / Operational | Medium | High | High | Run local onboard PostgreSQL, health-check before flight, keep large payloads in files, fail mission cache validation if DB unavailable | Cache lifecycle / FDR | Mitigated by deployment gates |
| R08 | Generated tile cache poisoning corrupts future anchors | Security / Safety | Low | High | Medium | Sigma gate, provenance sidecars, post-flight Satellite Service voting, no direct promotion to trusted basemap | Cache/tile lifecycle | Mitigated by policy |
| R09 | Public datasets do not cover final target terrain or commercial license needs | External / Schedule | Medium | Medium | Medium | Use public data for de-risking only; representative synchronized target data remains mandatory for acceptance | Validation harness | Mitigated by acceptance rule |
| R10 | MAVLink GPS_INPUT parameters or Plane behavior differs from assumptions |
Integration | Medium | High | High | Plane SITL release gate with production parameters, spoofing/failsafe tests, raw field validation with pymavlink | MAVLink/GCS integration | Mitigated by SITL gate |
| R11 | FDR appends or PostgreSQL indexing interferes with hot-path latency | Performance | Medium | Medium | Medium | Append asynchronously, use CBOR payload segments for high-volume data, keep PostgreSQL as event index/query surface | FDR/observability | Mitigated by design |
| R12 | GPL/non-commercial tooling accidentally enters production or acceptance evidence | Legal / Compliance | Low | High | Medium | Keep OpenVINS/ORB-SLAM3 reference-only; license-tag datasets before CI; SuperPoint only after legal approval | Validation harness / Architecture | Mitigated by gates |
Detailed Risk Analysis
R01: Camera Specification Mismatch
Description: Public ADTi pages show 5456 x 3632 stills, 2 fps continuous capture, Sony E mount, and -10..40 C operation. The project needs the exact production lens, camera interface, sustained capture behavior, thermal behavior, and calibration model.
Trigger conditions: Manufacturer documentation or hardware testing contradicts assumed FPS, interface, temperature, or lens characteristics.
Affected components: Camera ingest/calibration, BASALT VIO adapter, validation harness, deployment procedures.
Mitigation strategy:
- Make camera specification verification a bootstrap task.
- Require manufacturer datasheet or hardware measurement before implementation claims 3 fps or hot-environment operation.
- Version calibration data by exact camera/lens/interface.
Contingency plan: Reduce frame rate assumptions, adjust latency tests, or select a different navigation camera/lens/interface.
Residual risk after mitigation: Medium.
Documents updated: glossary.md, architecture.md, components/01_camera_ingest_calibration/description.md, deployment/deployment_procedures.md.
R02: BASALT Nadir Fixed-Wing Fit
Description: BASALT is a strong VIO candidate, but fixed downward cameras over planar terrain can cause low-parallax and texture-degeneracy cases.
Trigger conditions: Public or representative replay shows high drift, frequent tracking loss, or poor initialization.
Affected components: BASALT VIO adapter, safety/anchor wrapper, validation harness.
Mitigation strategy:
- Run MUN-FRL first for synchronized nadir camera + IMU + ground truth.
- Add ALTO/Kagaru/EPFL slices where available for aerial/fixed-wing realism.
- Compare against OpenVINS reference and Kimera backup.
Contingency plan: Keep Kimera backup or build a project-owned fallback estimator around OpenCV + IMU only after replay evidence requires it.
Residual risk after mitigation: Medium.
Documents updated: architecture.md, components/02_basalt_vio_adapter/description.md, tests/test-data.md.
R03: Covariance Under-Reporting
Description: Incorrect confidence is more dangerous than no estimate because the flight controller may trust a false fix.
Trigger conditions: Replay error exceeds reported covariance, or anchors are accepted despite inconsistent residuals.
Affected components: Safety/anchor wrapper, MAVLink/GCS integration, FDR/observability.
Mitigation strategy:
- Make wrapper covariance the product authority, not BASALT raw confidence.
- Validate calibration against ground truth, satellite residuals, and OpenVINS reference.
- Map
horiz_accuracyso it never under-reports the 95% semi-major covariance axis.
Contingency plan: Degrade to no-fix sooner and require operator relocalization or mission abort behavior.
Residual risk after mitigation: Medium.
Documents updated: architecture.md, components/03_safety_anchor_wrapper/description.md, tests/blackbox-tests.md.
R04: Total Occlusion Detection Failure
Description: If total occlusion is not detected before VIO, BASALT may receive unusable frames and produce misleading state updates.
Trigger conditions: Lens cover, cloud/whiteout, decode failure, underexposure/overexposure, or textureless frame reaches VIO as usable.
Affected components: Camera ingest/calibration, safety/anchor wrapper, BASALT VIO adapter.
Mitigation strategy:
- Camera ingest exposes
OcclusionReportand setsusable_for_vio=falsefor total occlusion/blackout. - Total occlusion bypasses BASALT for that frame.
- Safety wrapper switches to IMU-only
dead_reckonedpropagation with monotonic covariance growth.
Contingency plan: Tune detector conservatively and accept temporary false-positive IMU-only degradation over false VIO confidence.
Residual risk after mitigation: Medium.
Documents updated: components/01_camera_ingest_calibration/description.md, components/03_safety_anchor_wrapper/description.md, system-flows.md, diagrams/flows/flow_normal_localization.md, tests/resilience-tests.md.
R05: IMU-Only Mode Over-Trust
Description: IMU-only propagation drifts quickly and must be treated as an emergency bridge, not a long-duration solution.
Trigger conditions: Blackout lasts longer than 30 seconds or covariance exceeds 500 m.
Affected components: Safety/anchor wrapper, MAVLink/GCS integration, FDR/observability.
Mitigation strategy:
- Emit
source_label=dead_reckonedduring IMU-only mode. - Grow covariance monotonically.
- Emit
fix_type=0,horiz_accuracy=999.0, andVISUAL_BLACKOUT_FAILSAFEat thresholds.
Contingency plan: Stop publishing valid fixes and require relocalization/operator action.
Residual risk after mitigation: Low.
Documents updated: components/03_safety_anchor_wrapper/description.md, system-flows.md, tests/blackbox-tests.md, tests/resilience-tests.md, tests/traceability-matrix.md.
R06: Trigger Path Performance
Description: DINOv2-VLAD and learned local matching can exceed Jetson latency/memory limits.
Trigger conditions: Relocalization exceeds p95 latency, memory budget, or causes thermal throttling.
Affected components: Satellite retrieval, anchor verification, validation harness.
Mitigation strategy:
- Keep VPR/local matching trigger-based.
- Use CPU FAISS first and bounded top-K.
- Accept optimized engines only after descriptor-fidelity tests pass.
Contingency plan: Reduce descriptor resolution/model size, reduce top-K, or fall back to classical features for emergency operation.
Residual risk after mitigation: Medium.
Documents updated: architecture.md, components/04_satellite_retrieval/description.md, components/05_anchor_verification/description.md, tests/performance-tests.md.
R07: Onboard PostgreSQL/PostGIS Availability
Description: PostgreSQL/PostGIS is now the structured metadata store. If local DB availability or resource use is poor, cache/FDR queries may fail.
Trigger conditions: Local DB does not start, DB files corrupt, DB consumes too much memory/I/O, or migrations fail.
Affected components: Cache/tile lifecycle, FDR/observability, deployment procedures.
Mitigation strategy:
- Require local onboard PostgreSQL health check before flight.
- Store large imagery/descriptors/CBOR payloads as files, not DB blobs.
- Treat DB unavailability as a mission-cache validation blocker.
Contingency plan: Abort mission-cache activation and run only no-cache degraded modes or resync/rebuild DB before flight.
Residual risk after mitigation: Medium.
Documents updated: data_model.md, architecture.md, components/06_cache_tile_lifecycle/description.md, components/08_fdr_observability/description.md, deployment/environment_strategy.md.
R08: Cache Poisoning
Description: A bad generated tile could be written back and later used as a trusted anchor.
Trigger conditions: Generated tile is promoted despite high parent covariance, stale source, bad sidecar, or inconsistent overlap voting.
Affected components: Cache/tile lifecycle, safety/anchor wrapper, Satellite Service integration.
Mitigation strategy:
- Require tile-write sigma gates.
- Store generated tiles as candidates with signed sidecars.
- Promote only through post-flight Satellite Service validation/voting.
Contingency plan: Quarantine generated tiles and invalidate affected cache regions.
Residual risk after mitigation: Low.
Documents updated: architecture.md, components/06_cache_tile_lifecycle/description.md, tests/security-tests.md.
R09: Dataset Coverage / Licensing
Description: Public datasets may not match target terrain, may lack raw synchronized IMU, or may have non-commercial restrictions.
Trigger conditions: MUN-FRL/ALTO/Kagaru/EPFL slices are unavailable, unrepresentative, or license-incompatible for acceptance.
Affected components: Validation harness, BASALT VIO adapter, anchor verification.
Mitigation strategy:
- Use public datasets for de-risking only.
- License-tag datasets before CI jobs.
- Require representative synchronized target data for final acceptance.
Contingency plan: Collect a target replay dataset before final acceptance.
Residual risk after mitigation: Medium.
Documents updated: tests/test-data.md, deployment/environment_strategy.md, deployment/ci_cd_pipeline.md.
R10: Plane GPS_INPUT Integration
Description: ArduPilot Plane EKF and GPS_INPUT handling may differ from assumptions, especially around accuracy fields, ignore flags, velocity fields, and spoofing transitions.
Trigger conditions: Plane SITL rejects or mishandles emitted GPS_INPUT, or QGC status is insufficient.
Affected components: MAVLink/GCS integration, safety/anchor wrapper, validation harness.
Mitigation strategy:
- Use pymavlink for exact
GPS_INPUTfield control. - Gate release on Plane SITL with production parameters.
- Validate spoofing/failsafe and QGC status behavior.
Contingency plan: Adjust parameter guidance/output fields before hardware deployment.
Residual risk after mitigation: Medium.
Documents updated: components/07_mavlink_gcs_integration/description.md, tests/environment.md, deployment/ci_cd_pipeline.md.
Architecture/Component Changes Applied
| Risk ID | Document Modified | Change Description |
|---|---|---|
| R04 | components/01_camera_ingest_calibration/description.md |
Added explicit detect_occlusion, OcclusionReport, and pre-VIO bypass behavior |
| R04/R05 | components/03_safety_anchor_wrapper/description.md |
Added propagate_imu_only, total_occlusion, monotonic covariance behavior, and no direct cache lifecycle dependency |
| R07 | data_model.md |
Replaced embedded DB references with PostgreSQL/PostGIS structured metadata and CBOR FDR payload segments |
| R07 | architecture.md |
Added PostgreSQL/PostGIS ADR and FDR storage decision |
| R05 | tests/blackbox-tests.md / tests/resilience-tests.md |
Made total occlusion and IMU-only blackout behavior explicit |
Summary
Total risks identified: 12
Critical: 0 | High: 7 | Medium: 5 | Low: 0
Risks mitigated this iteration: 12
Risks requiring user decision: None immediately. Future decisions are tied to exact camera hardware proof, dataset license approval, and representative data collection timing.