mirror of
https://github.com/azaion/detections-semantic.git
synced 2026-04-23 00:26:37 +00:00
Initial commit
Made-with: Cursor
This commit is contained in:
@@ -0,0 +1,281 @@
|
||||
# Risk Assessment — Semantic Detection System — Iteration 01
|
||||
|
||||
## Risk Scoring Matrix
|
||||
|
||||
| | Low Impact | Medium Impact | High Impact |
|
||||
|--|------------|---------------|-------------|
|
||||
| **High Probability** | Medium | High | Critical |
|
||||
| **Medium Probability** | Low | Medium | High |
|
||||
| **Low Probability** | Low | Low | Medium |
|
||||
|
||||
## Risk Register
|
||||
|
||||
| ID | Risk | Category | Prob | Impact | Score | Mitigation | Status |
|
||||
|----|------|----------|------|--------|-------|------------|--------|
|
||||
| R01 | YOLOE backbone accuracy on aerial concealment data | Technical | Med | High | **High** | Benchmark sprint; dual backbone; empirical selection | Open |
|
||||
| R02 | VLM model load latency (5-10s) delays first L2 VLM analysis | Technical | High | Med | **High** | Predictive loading when first POI queued | Open |
|
||||
| R03 | V1 heuristic false positive rate overwhelms operator | Technical | High | Med | **High** | High initial thresholds; per-season tuning; VLM filter | Open |
|
||||
| R04 | GPU memory pressure during YOLOE→VLM transitions | Technical | Med | High | **High** | Sequential scheduling; explicit load/unload; memory monitoring | Open |
|
||||
| R05 | Seasonal model generalization failure | Technical | High | High | **Critical** | Phased rollout (winter first); config-driven season; continuous data collection | Open |
|
||||
| R06 | Search scenario config complexity causes runtime errors | Technical | Med | Med | **Medium** | Config validation at startup; scenario integration tests; good defaults | Open |
|
||||
| R07 | Path following on fragmented/noisy segmentation masks | Technical | Med | Med | **Medium** | Aggressive pruning; min-length filter; fallback to area_sweep | Open |
|
||||
| R08 | ViewLink protocol implementation effort underestimated | Schedule | Med | Med | **Medium** | Mock mode for parallel dev; allocate extra time; check for community implementations | Open |
|
||||
| R09 | Operator information overload from too many detections | Technical | Med | Med | **Medium** | Confidence thresholds; priority ranking; scenario-based filtering | Open |
|
||||
| R10 | Python GIL blocking in future code additions | Technical | Low | Low | **Low** | All hot-path compute in C/Cython/TRT (GIL released); document convention | Accepted |
|
||||
| R11 | NVMe write latency during high L2 recording rate | Technical | Low | Med | **Low** | 3MB/s well within NVMe bandwidth; async writes; drop frames if queue backs up | Accepted |
|
||||
| R12 | py_trees performance overhead on complex trees | Technical | Low | Low | **Low** | <1ms measured for ~30 nodes; monitor if tree grows | Accepted |
|
||||
| R13 | Dynamic search scenarios not extensible enough | Technical | Low | Med | **Low** | BT architecture allows new subtree types without changing existing code | Accepted |
|
||||
|
||||
**Total**: 13 risks — 1 Critical, 4 High, 4 Medium, 4 Low
|
||||
|
||||
---
|
||||
|
||||
## Detailed Risk Analysis
|
||||
|
||||
### R01: YOLOE Backbone Accuracy on Aerial Concealment Data
|
||||
|
||||
**Description**: Neither YOLO11 nor YOLO26 has been validated on aerial imagery of camouflaged/concealed military positions. YOLO26 has reported accuracy regression on custom datasets (GitHub #23206). Zero-shot YOLOE performance on concealment classes is unknown.
|
||||
|
||||
**Trigger conditions**: mAP50 on validation set < 50% for target classes (footpath, branch_pile, dark_entrance)
|
||||
|
||||
**Affected components**: Tier1Detector, all downstream pipeline
|
||||
|
||||
**Mitigation strategy**:
|
||||
1. Sprint 1 benchmark: 200 annotated frames, both backbones, same hyperparameters
|
||||
2. Evaluate zero-shot YOLOE with text/visual prompts first (no training data needed)
|
||||
3. If both backbones underperform: fall back to standard YOLO with custom training
|
||||
4. Keep both TRT engines on NVMe; config switch
|
||||
|
||||
**Contingency plan**: If YOLOE open-vocabulary approach fails entirely, train a standard YOLO model from scratch on annotated data (Phase 2 timeline).
|
||||
|
||||
**Residual risk after mitigation**: Medium — benchmark data will be limited initially
|
||||
|
||||
**Documents updated**: architecture.md ADR-003
|
||||
|
||||
---
|
||||
|
||||
### R02: VLM Model Load Latency
|
||||
|
||||
**Description**: NanoLLM VILA1.5-3B takes 5-10s to load. During L1→L2 transition, if VLM is needed for the first time in a session, the investigation is delayed.
|
||||
|
||||
**Trigger conditions**: First POI requiring VLM analysis in a session
|
||||
|
||||
**Affected components**: VLMClient, ScanController
|
||||
|
||||
**Mitigation strategy**:
|
||||
1. When first POI is queued (even before L2 starts), begin VLM loading in background
|
||||
2. If VLM not ready when Tier 2 result is ambiguous, proceed with Tier 2 result only
|
||||
3. Subsequent analyses will have VLM warm
|
||||
|
||||
**Contingency plan**: If load time is unacceptable, keep VLM loaded at startup and accept higher memory usage during L1 (requires verifying YOLOE fits in remaining memory).
|
||||
|
||||
**Residual risk after mitigation**: Low — only first VLM request affected
|
||||
|
||||
**Documents updated**: VLMClient description.md (lifecycle section), ScanController description.md (L2 subtree)
|
||||
|
||||
---
|
||||
|
||||
### R03: V1 Heuristic False Positive Rate
|
||||
|
||||
**Description**: Darkness + contrast heuristic will flag shadows, water puddles, dark soil, tree shade as potential concealed positions. Operator may be overwhelmed.
|
||||
|
||||
**Trigger conditions**: FP rate > 80% during initial field testing
|
||||
|
||||
**Affected components**: Tier2SpatialAnalyzer, OutputManager, operator workflow
|
||||
|
||||
**Mitigation strategy**:
|
||||
1. Start with conservative thresholds (higher darkness, higher contrast required)
|
||||
2. Per-season threshold configs (winter/summer/autumn differ significantly)
|
||||
3. VLM as secondary filter for ambiguous cases (when available)
|
||||
4. Priority ranking: scenarios with higher confidence bubble up
|
||||
5. Scenario-based filtering: operator sees scenario name, can mentally filter
|
||||
|
||||
**Contingency plan**: If heuristic is unusable, fast-track a simple binary classifier trained on collected FP/TP data from field testing.
|
||||
|
||||
**Residual risk after mitigation**: Medium — some FP expected and accepted per design
|
||||
|
||||
**Documents updated**: Tier2SpatialAnalyzer description.md, Config helper (per-season thresholds)
|
||||
|
||||
---
|
||||
|
||||
### R04: GPU Memory Pressure During Transitions
|
||||
|
||||
**Description**: YOLOE TRT engine (~2GB GPU) must coexist with VLM (~3GB GPU) in 8GB shared LPDDR5. If both are loaded simultaneously, total exceeds available GPU memory.
|
||||
|
||||
**Trigger conditions**: YOLOE engine stays loaded while VLM loads, or VLM doesn't fully unload
|
||||
|
||||
**Affected components**: Tier1Detector, VLMClient, ScanController
|
||||
|
||||
**Mitigation strategy**:
|
||||
1. Sequential GPU scheduling: YOLOE processes current frame → VLM loads → VLM analyzes → VLM unloads → YOLOE resumes
|
||||
2. Explicit `unload_model()` call before any `load_model()` for different model
|
||||
3. Monitor GPU memory via tegrastats in health check; set semantic_available=false if memory exceeds threshold
|
||||
|
||||
**Contingency plan**: If memory management is unreliable, use a smaller VLM (Obsidian-3B at ~1.5GB) that can coexist with YOLOE.
|
||||
|
||||
**Residual risk after mitigation**: Low — sequential scheduling is well-defined
|
||||
|
||||
**Documents updated**: architecture.md §5 (sequential GPU note), VLMClient description.md (lifecycle)
|
||||
|
||||
---
|
||||
|
||||
### R05: Seasonal Model Generalization Failure (CRITICAL)
|
||||
|
||||
**Description**: Models trained on winter imagery will fail in spring/summer/autumn. Footpaths on snow look completely different from footpaths on mud/grass. Branch piles vary by season.
|
||||
|
||||
**Trigger conditions**: Deploying winter-trained model in spring/summer without retraining
|
||||
|
||||
**Affected components**: Tier1Detector, Tier2SpatialAnalyzer, all search scenarios
|
||||
|
||||
**Mitigation strategy**:
|
||||
1. Phased rollout: winter only for initial release
|
||||
2. Season config in YAML: `season: winter` — adjusts thresholds, enables season-specific scenarios
|
||||
3. Continuous data collection from every flight via frame recorder
|
||||
4. Season-specific YOLOE classes: `footpath_winter`, `footpath_autumn` etc.
|
||||
5. Retraining pipeline per season (Phase 4 in training strategy)
|
||||
6. Search scenarios are season-aware (different trigger classes per season)
|
||||
|
||||
**Contingency plan**: If multi-season models don't converge, maintain separate model files per season. Config-switch per deployment.
|
||||
|
||||
**Residual risk after mitigation**: Medium — phased approach manages exposure, but spring/summer models will need real data
|
||||
|
||||
**Documents updated**: Config helper (season field), ScanController (season-aware scenarios), training strategy in solution.md
|
||||
|
||||
---
|
||||
|
||||
### R06: Search Scenario Config Complexity
|
||||
|
||||
**Description**: Data-driven search scenarios add YAML complexity. Invalid configs (missing follow_class for path_follow, unknown investigation type, empty trigger classes) could cause runtime errors or silent failures.
|
||||
|
||||
**Trigger conditions**: Operator modifies config YAML incorrectly
|
||||
|
||||
**Affected components**: ScanController, Config helper
|
||||
|
||||
**Mitigation strategy**:
|
||||
1. Config validation at startup: reject invalid scenarios with clear error messages
|
||||
2. Ship with well-tested default scenarios per season
|
||||
3. Scenario integration tests: verify each investigation type with mock data
|
||||
4. Unknown investigation type → log error, skip scenario, continue with others
|
||||
|
||||
**Contingency plan**: If config complexity proves too error-prone, build a simple scenario editor tool.
|
||||
|
||||
**Residual risk after mitigation**: Low — validation catches most errors
|
||||
|
||||
**Documents updated**: Config helper (validation rules expanded)
|
||||
|
||||
---
|
||||
|
||||
### R07: Spatial Analysis on Noisy/Sparse Input
|
||||
|
||||
**Description**: For mask tracing: noisy or fragmented segmentation masks produce broken skeletons with spurious branches, leading to erratic path following. For cluster tracing: too few detections visible in a single frame (e.g., wide-area L1 at medium zoom can't resolve small objects), or false positive detections create phantom clusters.
|
||||
|
||||
**Trigger conditions**: YOLOE footpath segmentation has many disconnected components; or cluster_follow scenario triggers on <min_cluster_size detections
|
||||
|
||||
**Affected components**: Tier2SpatialAnalyzer, GimbalDriver (PID receives erratic targets)
|
||||
|
||||
**Mitigation strategy**:
|
||||
1. Mask trace: morphological closing before skeletonization to connect nearby fragments
|
||||
2. Mask trace: aggressive pruning of branches shorter than `min_branch_length`; select longest connected component
|
||||
3. Mask trace: if skeleton quality too low, fall back to area_sweep
|
||||
4. Cluster trace: configurable min_cluster_size (default 2) filters noise
|
||||
5. Cluster trace: cluster_radius_px prevents grouping unrelated detections
|
||||
6. Cluster trace: if no valid cluster found, return empty result (ScanController falls through to next investigation type or returns to L1)
|
||||
|
||||
**Contingency plan**: Mask trace: use centroid + bounding box direction as rough path direction. Cluster trace: fall back to zoom_classify on individual detections instead of cluster_follow.
|
||||
|
||||
**Residual risk after mitigation**: Low — multiple fallback layers for both strategies
|
||||
|
||||
**Documents updated**: Tier2SpatialAnalyzer description.md (preprocessing, cluster error handling)
|
||||
|
||||
---
|
||||
|
||||
### R08: ViewLink Protocol Implementation Effort
|
||||
|
||||
**Description**: No open-source Python implementation of ViewLink Serial Protocol V3.3.3 exists. Custom implementation requires parsing the PDF specification, handling binary packet format, and testing on real hardware.
|
||||
|
||||
**Trigger conditions**: Implementation takes >2 weeks; edge cases in protocol not documented
|
||||
|
||||
**Affected components**: GimbalDriver
|
||||
|
||||
**Mitigation strategy**:
|
||||
1. Mock mode (TCP socket) enables parallel development without real hardware
|
||||
2. ArduPilot has a ViewPro driver in C++ — can reference for packet format
|
||||
3. Allocate 2 weeks for GimbalDriver implementation + bench testing
|
||||
4. Start with basic commands (pan/tilt/zoom) before advanced features
|
||||
|
||||
**Contingency plan**: If ViewLink implementation stalls, use MAVLink gimbal protocol via ArduPilot as an intermediary (less control but faster to implement).
|
||||
|
||||
**Residual risk after mitigation**: Low — ArduPilot reference reduces uncertainty
|
||||
|
||||
**Documents updated**: GimbalDriver description.md
|
||||
|
||||
---
|
||||
|
||||
### R09: Operator Information Overload
|
||||
|
||||
**Description**: High FP rate + continuous scanning + multiple active scenarios = many detection candidates. Operator may miss real targets in the noise.
|
||||
|
||||
**Trigger conditions**: >20 detections per minute with FP rate >70%
|
||||
|
||||
**Affected components**: OutputManager (operator delivery), ScanController
|
||||
|
||||
**Mitigation strategy**:
|
||||
1. Confidence threshold per scenario (configurable)
|
||||
2. Priority ranking: higher-confidence, VLM-confirmed detections shown first
|
||||
3. Scenario name in detection output: operator knows context
|
||||
4. Configurable detection throttle: max N detections per minute to operator
|
||||
|
||||
**Contingency plan**: Add a simple client-side filter in operator display (by scenario, by confidence).
|
||||
|
||||
**Residual risk after mitigation**: Medium — some overload expected initially, tuning required
|
||||
|
||||
**Documents updated**: OutputManager description.md, Config helper
|
||||
|
||||
---
|
||||
|
||||
### R10: Python GIL (Low — Accepted)
|
||||
|
||||
**Description**: Python's Global Interpreter Lock prevents true parallel execution of Python threads.
|
||||
|
||||
**Why it's not a concern for this system**:
|
||||
1. **All compute-heavy operations release the GIL**: TensorRT inference (C++ backend), OpenCV (C backend), scikit-image skeletonization (Cython), pyserial I/O (C backend), NVMe file writes (OS-level I/O)
|
||||
2. **VLM runs in a separate Docker process** — entirely outside the GIL
|
||||
3. **Architecture is deliberately single-threaded**: BT tick loop processes one frame at a time; no need for threading
|
||||
4. **Pure Python in hot path**: only py_trees traversal (~<1ms) and dict/list operations
|
||||
5. **I/O operations** (UART, NVMe writes) release the GIL natively
|
||||
|
||||
**Caveat**: If future developers add Python-heavy computation in a BT leaf node without using C/Cython, it could block other operations. This is a coding practice issue, not an architectural one.
|
||||
|
||||
**Mitigation**: Document in coding guidelines: "All compute-heavy leaf nodes must use C-extension libraries or Cython. Pure Python processing must complete in <5ms."
|
||||
|
||||
**Residual risk**: Low
|
||||
|
||||
---
|
||||
|
||||
## Architecture/Component Changes Applied (This Iteration)
|
||||
|
||||
| Risk ID | Document Modified | Change Description |
|
||||
|---------|------------------|--------------------|
|
||||
| R01 | `architecture.md` ADR-003 | Already documented: dual backbone strategy |
|
||||
| R02 | `components/04_vlm_client/description.md` | Lifecycle notes: predictive loading when first POI queued |
|
||||
| R03 | `components/03_tier2_spatial_analyzer/description.md` | V2 CNN removed; heuristic is permanent Tier 2 approach |
|
||||
| R03 | `architecture.md` tech stack | MobileNetV3-Small removed |
|
||||
| R04 | `architecture.md` §2, §5 | Sequential GPU scheduling documented |
|
||||
| R05 | `common-helpers/01_helper_config.md` | Season-specific search scenarios |
|
||||
| R05 | `components/01_scan_controller/description.md` | Dynamic search scenarios with season-aware trigger classes |
|
||||
| R06 | `common-helpers/01_helper_config.md` | Scenario validation rules expanded |
|
||||
| R07 | `components/03_tier2_spatial_analyzer/description.md` | Preprocessing, cluster error handling, and fallback documented |
|
||||
| R10 | — | No change needed; documented as accepted |
|
||||
| — | `data_model.md` | Fixed: POI.status added "timeout", POI max size config-driven, HealthLogEntry uses capability flags |
|
||||
| — | `common-helpers/02_helper_types.md` | Added SearchScenario struct; POI includes scenario_name and investigation_type |
|
||||
|
||||
## Summary
|
||||
|
||||
**Total risks identified**: 13
|
||||
**Critical**: 1 (R05 — seasonal generalization)
|
||||
**High**: 4 (R01 backbone accuracy, R02 VLM load latency, R03 heuristic FP rate, R04 GPU memory)
|
||||
**Medium**: 4 (R06 config complexity, R07 fragmented masks, R08 ViewLink effort, R09 operator overload)
|
||||
**Low**: 4 (R10 GIL, R11 NVMe writes, R12 py_trees overhead, R13 scenario extensibility)
|
||||
|
||||
**Risks mitigated this iteration**: All 13 have mitigation strategies documented
|
||||
**Risks requiring user decision**: None — all mitigations are actionable without further input
|
||||
Reference in New Issue
Block a user