mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 20:41:13 +00:00
72a9df6b57
Keep VIO package and native bridge paths backend-neutral so BASALT remains an implementation choice rather than a component boundary. Co-authored-by: Cursor <cursoragent@cursor.com>
276 lines
16 KiB
Markdown
276 lines
16 KiB
Markdown
# Risk Assessment — Architecture Review — Iteration 01
|
|
|
|
## Evaluator Pass Summary
|
|
|
|
| Check | Result | Notes |
|
|
|-------|--------|-------|
|
|
| Single Responsibility | Pass | Components each own one primary concern: ingest, VIO, safety, Satellite Service sync/retrieval, verification, Tile Manager storage/generation, MAVLink, FDR, validation |
|
|
| Dumb Code / Smart Data | Pass | Complex behavior is mostly expressed through DTOs, mode labels, covariance fields, manifests, and gates |
|
|
| Interface Consistency | Pass with fix | Safety wrapper no longer directly depends on Tile Manager for anchor acceptance; cache freshness/provenance travels through `AnchorDecision` |
|
|
| Circular Dependencies | Pass with caution | Runtime flow is acyclic at component ownership level; MAVLink remains a bidirectional protocol adapter but owns no localization policy |
|
|
| Missing Interactions | Pass | Pre-VIO occlusion, IMU-only blackout, relocalization, tile writes, FDR, and SITL validation are all represented |
|
|
| Security Considerations | Pass | Signed cache sidecars, source/system ID checks, spoofing rejection, and no in-flight satellite-provider or Satellite Service access are covered |
|
|
| Performance Bottlenecks | Pass | Jetson latency, VPR/local matching, FDR append pressure, PostgreSQL availability, and thermal limits are identified |
|
|
| API Contracts | Pass | Core DTO handoffs are documented: `FramePacket`, `VioStatePacket`, `AnchorDecision`, `PositionEstimate`, `FdrEvent` |
|
|
|
|
## Risk Scoring Matrix
|
|
|
|
| | Low Impact | Medium Impact | High Impact |
|
|
|--|------------|---------------|-------------|
|
|
| **High Probability** | Medium | High | Critical |
|
|
| **Medium Probability** | Low | Medium | High |
|
|
| **Low Probability** | Low | Low | Medium |
|
|
|
|
## Acceptance Criteria by Risk Level
|
|
|
|
| Level | Action Required |
|
|
|-------|-----------------|
|
|
| Low | Accepted and monitored |
|
|
| Medium | Mitigation plan required before implementation |
|
|
| High | Mitigation + contingency plan required, reviewed during implementation |
|
|
| Critical | Must be resolved before proceeding to next planning step |
|
|
|
|
## Risk Register
|
|
|
|
| ID | Risk | Category | Probability | Impact | Score | Mitigation | Owner | Status |
|
|
|----|------|----------|-------------|--------|-------|------------|-------|--------|
|
|
| R01 | ADTi 20MP 20L V1 public specs conflict with planning assumptions for resolution, FPS, lens, interface, and temperature | Technical / External | Medium | High | High | Pin manufacturer datasheet and exact lens/interface before implementation; make camera calibration/spec task a bootstrap blocker | Camera ingest/calibration | Mitigated by gate |
|
|
| R02 | BASALT may underperform or lose tracking on nadir fixed-wing low-parallax terrain | Technical | Medium | High | High | Public replay with MUN-FRL/ALTO/Kagaru/EPFL where applicable, representative target replay, OpenVINS reference comparison, Kimera backup path | VIO adapter | Mitigated by validation |
|
|
| R03 | BASALT confidence/covariance may under-report real error | Safety | Medium | High | High | Wrapper owns covariance calibration; compare against ground truth, satellite residuals, and OpenVINS reference; never emit optimistic `horiz_accuracy` | Safety/anchor wrapper | Mitigated by wrapper design |
|
|
| R04 | Total occlusion detector may false-negative and feed unusable frames into VIO | Safety / Technical | Medium | High | High | Conservative pre-VIO occlusion gate, FDR status, tests for total blackout, and fallback to IMU-only `dead_reckoned` mode | Camera ingest/calibration | Mitigated by spec/test |
|
|
| R05 | IMU-only blackout propagation could be trusted too long | Safety | Medium | High | High | Monotonic covariance growth, `dead_reckoned` label, `fix_type=0`/`horiz_accuracy=999.0` when >30 s or covariance >500 m | Safety/anchor wrapper | Mitigated by AC gate |
|
|
| R06 | DINOv2-VLAD + ALIKED/DISK-LightGlue exceeds Jetson latency/memory budget | Performance | Medium | High | High | Trigger-only execution, CPU FAISS first, top-K caps, model profiling, TensorRT only after fidelity checks | Satellite Service / Anchor verification | Mitigated by profiling gates |
|
|
| R07 | PostgreSQL/PostGIS local DB is unavailable or too heavy for onboard runtime | Technical / Operational | Medium | High | High | Run local onboard PostgreSQL, health-check before flight, keep large payloads in files, fail mission cache validation if DB unavailable | Tile Manager / FDR | Mitigated by deployment gates |
|
|
| R08 | Generated tile cache poisoning corrupts future anchors | Security / Safety | Low | High | Medium | Sigma gate, provenance sidecars, post-flight Satellite Service voting, no direct promotion to trusted basemap | Tile Manager | Mitigated by policy |
|
|
| R09 | Public datasets do not cover final target terrain or commercial license needs | External / Schedule | Medium | Medium | Medium | Use public data for de-risking only; representative synchronized target data remains mandatory for acceptance | Validation harness | Mitigated by acceptance rule |
|
|
| R10 | MAVLink `GPS_INPUT` parameters or Plane behavior differs from assumptions | Integration | Medium | High | High | Plane SITL release gate with production parameters, spoofing/failsafe tests, raw field validation with pymavlink | MAVLink/GCS integration | Mitigated by SITL gate |
|
|
| R11 | FDR appends or PostgreSQL indexing interferes with hot-path latency | Performance | Medium | Medium | Medium | Append asynchronously, use CBOR payload segments for high-volume data, keep PostgreSQL as event index/query surface | FDR/observability | Mitigated by design |
|
|
| R12 | GPL/non-commercial tooling accidentally enters production or acceptance evidence | Legal / Compliance | Low | High | Medium | Keep OpenVINS/ORB-SLAM3 reference-only; license-tag datasets before CI; SuperPoint only after legal approval | Validation harness / Architecture | Mitigated by gates |
|
|
|
|
## Detailed Risk Analysis
|
|
|
|
### R01: Camera Specification Mismatch
|
|
|
|
**Description**: Public ADTi pages show 5456 x 3632 stills, 2 fps continuous capture, Sony E mount, and -10..40 C operation. The project needs the exact production lens, camera interface, sustained capture behavior, thermal behavior, and calibration model.
|
|
|
|
**Trigger conditions**: Manufacturer documentation or hardware testing contradicts assumed FPS, interface, temperature, or lens characteristics.
|
|
|
|
**Affected components**: Camera ingest/calibration, VIO adapter, separate e2e test suite, deployment procedures.
|
|
|
|
**Mitigation strategy**:
|
|
1. Make camera specification verification a bootstrap task.
|
|
2. Require manufacturer datasheet or hardware measurement before implementation claims 3 fps or hot-environment operation.
|
|
3. Version calibration data by exact camera/lens/interface.
|
|
|
|
**Contingency plan**: Reduce frame rate assumptions, adjust latency tests, or select a different navigation camera/lens/interface.
|
|
|
|
**Residual risk after mitigation**: Medium.
|
|
|
|
**Documents updated**: `glossary.md`, `architecture.md`, `components/01_camera_ingest_calibration/description.md`, `deployment/deployment_procedures.md`.
|
|
|
|
---
|
|
|
|
### R02: BASALT Nadir Fixed-Wing Fit
|
|
|
|
**Description**: BASALT is a strong VIO candidate, but fixed downward cameras over planar terrain can cause low-parallax and texture-degeneracy cases.
|
|
|
|
**Trigger conditions**: Public or representative replay shows high drift, frequent tracking loss, or poor initialization.
|
|
|
|
**Affected components**: VIO adapter, safety/anchor wrapper, separate e2e test suite.
|
|
|
|
**Mitigation strategy**:
|
|
1. Run MUN-FRL first for synchronized nadir camera + IMU + ground truth.
|
|
2. Add ALTO/Kagaru/EPFL slices where available for aerial/fixed-wing realism.
|
|
3. Compare against OpenVINS reference and Kimera backup.
|
|
|
|
**Contingency plan**: Keep Kimera backup or build a project-owned fallback estimator around OpenCV + IMU only after replay evidence requires it.
|
|
|
|
**Residual risk after mitigation**: Medium.
|
|
|
|
**Documents updated**: `architecture.md`, `components/02_vio_adapter/description.md`, `tests/test-data.md`.
|
|
|
|
---
|
|
|
|
### R03: Covariance Under-Reporting
|
|
|
|
**Description**: Incorrect confidence is more dangerous than no estimate because the flight controller may trust a false fix.
|
|
|
|
**Trigger conditions**: Replay error exceeds reported covariance, or anchors are accepted despite inconsistent residuals.
|
|
|
|
**Affected components**: Safety/anchor wrapper, MAVLink/GCS integration, FDR/observability.
|
|
|
|
**Mitigation strategy**:
|
|
1. Make wrapper covariance the product authority, not BASALT raw confidence.
|
|
2. Validate calibration against ground truth, satellite residuals, and OpenVINS reference.
|
|
3. Map `horiz_accuracy` so it never under-reports the 95% semi-major covariance axis.
|
|
|
|
**Contingency plan**: Degrade to no-fix sooner and require operator relocalization or mission abort behavior.
|
|
|
|
**Residual risk after mitigation**: Medium.
|
|
|
|
**Documents updated**: `architecture.md`, `components/03_safety_anchor_wrapper/description.md`, `tests/blackbox-tests.md`.
|
|
|
|
---
|
|
|
|
### R04: Total Occlusion Detection Failure
|
|
|
|
**Description**: If total occlusion is not detected before VIO, BASALT may receive unusable frames and produce misleading state updates.
|
|
|
|
**Trigger conditions**: Lens cover, cloud/whiteout, decode failure, underexposure/overexposure, or textureless frame reaches VIO as usable.
|
|
|
|
**Affected components**: Camera ingest/calibration, safety/anchor wrapper, VIO adapter.
|
|
|
|
**Mitigation strategy**:
|
|
1. Camera ingest exposes `OcclusionReport` and sets `usable_for_vio=false` for total occlusion/blackout.
|
|
2. Total occlusion bypasses BASALT for that frame.
|
|
3. Safety wrapper switches to IMU-only `dead_reckoned` propagation with monotonic covariance growth.
|
|
|
|
**Contingency plan**: Tune detector conservatively and accept temporary false-positive IMU-only degradation over false VIO confidence.
|
|
|
|
**Residual risk after mitigation**: Medium.
|
|
|
|
**Documents updated**: `components/01_camera_ingest_calibration/description.md`, `components/03_safety_anchor_wrapper/description.md`, `system-flows.md`, `diagrams/flows/flow_normal_localization.md`, `tests/resilience-tests.md`.
|
|
|
|
---
|
|
|
|
### R05: IMU-Only Mode Over-Trust
|
|
|
|
**Description**: IMU-only propagation drifts quickly and must be treated as an emergency bridge, not a long-duration solution.
|
|
|
|
**Trigger conditions**: Blackout lasts longer than 30 seconds or covariance exceeds 500 m.
|
|
|
|
**Affected components**: Safety/anchor wrapper, MAVLink/GCS integration, FDR/observability.
|
|
|
|
**Mitigation strategy**:
|
|
1. Emit `source_label=dead_reckoned` during IMU-only mode.
|
|
2. Grow covariance monotonically.
|
|
3. Emit `fix_type=0`, `horiz_accuracy=999.0`, and `VISUAL_BLACKOUT_FAILSAFE` at thresholds.
|
|
|
|
**Contingency plan**: Stop publishing valid fixes and require relocalization/operator action.
|
|
|
|
**Residual risk after mitigation**: Low.
|
|
|
|
**Documents updated**: `components/03_safety_anchor_wrapper/description.md`, `system-flows.md`, `tests/blackbox-tests.md`, `tests/resilience-tests.md`, `tests/traceability-matrix.md`.
|
|
|
|
---
|
|
|
|
### R06: Trigger Path Performance
|
|
|
|
**Description**: DINOv2-VLAD and learned local matching can exceed Jetson latency/memory limits.
|
|
|
|
**Trigger conditions**: Relocalization exceeds p95 latency, memory budget, or causes thermal throttling.
|
|
|
|
**Affected components**: Satellite Service, anchor verification, separate e2e test suite.
|
|
|
|
**Mitigation strategy**:
|
|
1. Keep VPR/local matching trigger-based.
|
|
2. Use CPU FAISS first and bounded top-K.
|
|
3. Accept optimized engines only after descriptor-fidelity tests pass.
|
|
|
|
**Contingency plan**: Reduce descriptor resolution/model size, reduce top-K, or fall back to classical features for emergency operation.
|
|
|
|
**Residual risk after mitigation**: Medium.
|
|
|
|
**Documents updated**: `architecture.md`, `components/04_satellite_retrieval/description.md`, `components/05_anchor_verification/description.md`, `tests/performance-tests.md`.
|
|
|
|
---
|
|
|
|
### R07: Onboard PostgreSQL/PostGIS Availability
|
|
|
|
**Description**: PostgreSQL/PostGIS is now the structured metadata store. If local DB availability or resource use is poor, cache/FDR queries may fail.
|
|
|
|
**Trigger conditions**: Local DB does not start, DB files corrupt, DB consumes too much memory/I/O, or migrations fail.
|
|
|
|
**Affected components**: Tile Manager, FDR/observability, deployment procedures.
|
|
|
|
**Mitigation strategy**:
|
|
1. Require local onboard PostgreSQL health check before flight.
|
|
2. Store large imagery/descriptors/CBOR payloads as files, not DB blobs.
|
|
3. Treat DB unavailability as a mission-cache validation blocker.
|
|
|
|
**Contingency plan**: Abort mission-cache activation and run only no-cache degraded modes or resync/rebuild DB before flight.
|
|
|
|
**Residual risk after mitigation**: Medium.
|
|
|
|
**Documents updated**: `data_model.md`, `architecture.md`, `components/06_cache_tile_lifecycle/description.md`, `components/08_fdr_observability/description.md`, `deployment/environment_strategy.md`.
|
|
|
|
---
|
|
|
|
### R08: Cache Poisoning
|
|
|
|
**Description**: A bad generated tile could be written back and later used as a trusted anchor.
|
|
|
|
**Trigger conditions**: Generated tile is promoted despite high parent covariance, stale source, bad sidecar, or inconsistent overlap voting.
|
|
|
|
**Affected components**: Tile Manager, safety/anchor wrapper, Satellite Service integration.
|
|
|
|
**Mitigation strategy**:
|
|
1. Require tile-write sigma gates.
|
|
2. Store generated tiles as candidates with signed sidecars.
|
|
3. Promote only through post-flight Satellite Service validation/voting.
|
|
|
|
**Contingency plan**: Quarantine generated tiles and invalidate affected cache regions.
|
|
|
|
**Residual risk after mitigation**: Low.
|
|
|
|
**Documents updated**: `architecture.md`, `components/06_cache_tile_lifecycle/description.md`, `tests/security-tests.md`.
|
|
|
|
---
|
|
|
|
### R09: Dataset Coverage / Licensing
|
|
|
|
**Description**: Public datasets may not match target terrain, may lack raw synchronized IMU, or may have non-commercial restrictions.
|
|
|
|
**Trigger conditions**: MUN-FRL/ALTO/Kagaru/EPFL slices are unavailable, unrepresentative, or license-incompatible for acceptance.
|
|
|
|
**Affected components**: Validation harness, VIO adapter, anchor verification.
|
|
|
|
**Mitigation strategy**:
|
|
1. Use public datasets for de-risking only.
|
|
2. License-tag datasets before CI jobs.
|
|
3. Require representative synchronized target data for final acceptance.
|
|
|
|
**Contingency plan**: Collect a target replay dataset before final acceptance.
|
|
|
|
**Residual risk after mitigation**: Medium.
|
|
|
|
**Documents updated**: `tests/test-data.md`, `deployment/environment_strategy.md`, `deployment/ci_cd_pipeline.md`.
|
|
|
|
---
|
|
|
|
### R10: Plane `GPS_INPUT` Integration
|
|
|
|
**Description**: ArduPilot Plane EKF and `GPS_INPUT` handling may differ from assumptions, especially around accuracy fields, ignore flags, velocity fields, and spoofing transitions.
|
|
|
|
**Trigger conditions**: Plane SITL rejects or mishandles emitted `GPS_INPUT`, or QGC status is insufficient.
|
|
|
|
**Affected components**: MAVLink/GCS integration, safety/anchor wrapper, separate e2e test suite.
|
|
|
|
**Mitigation strategy**:
|
|
1. Use pymavlink for exact `GPS_INPUT` field control.
|
|
2. Gate release on Plane SITL with production parameters.
|
|
3. Validate spoofing/failsafe and QGC status behavior.
|
|
|
|
**Contingency plan**: Adjust parameter guidance/output fields before hardware deployment.
|
|
|
|
**Residual risk after mitigation**: Medium.
|
|
|
|
**Documents updated**: `components/07_mavlink_gcs_integration/description.md`, `tests/environment.md`, `deployment/ci_cd_pipeline.md`.
|
|
|
|
## Architecture/Component Changes Applied
|
|
|
|
| Risk ID | Document Modified | Change Description |
|
|
|---------|-------------------|--------------------|
|
|
| R04 | `components/01_camera_ingest_calibration/description.md` | Added explicit `detect_occlusion`, `OcclusionReport`, and pre-VIO bypass behavior |
|
|
| R04/R05 | `components/03_safety_anchor_wrapper/description.md` | Added `propagate_imu_only`, `total_occlusion`, monotonic covariance behavior, and no direct Tile Manager dependency |
|
|
| R07 | `data_model.md` | Replaced embedded DB references with PostgreSQL/PostGIS structured metadata and CBOR FDR payload segments |
|
|
| R07 | `architecture.md` | Added PostgreSQL/PostGIS ADR and FDR storage decision |
|
|
| R05 | `tests/blackbox-tests.md` / `tests/resilience-tests.md` | Made total occlusion and IMU-only blackout behavior explicit |
|
|
|
|
## Summary
|
|
|
|
**Total risks identified**: 12
|
|
**Critical**: 0 | **High**: 7 | **Medium**: 5 | **Low**: 0
|
|
**Risks mitigated this iteration**: 12
|
|
**Risks requiring user decision**: None immediately. Future decisions are tied to exact camera hardware proof, dataset license approval, and representative data collection timing.
|