Files
detections-semantic/_docs/02_plans/risk_mitigations.md
T
Oleksandr Bezdieniezhnykh 8e2ecf50fd Initial commit
Made-with: Cursor
2026-03-26 00:20:30 +02:00

16 KiB

Risk Assessment — Semantic Detection System — Iteration 01

Risk Scoring Matrix

Low Impact Medium Impact High Impact
High Probability Medium High Critical
Medium Probability Low Medium High
Low Probability Low Low Medium

Risk Register

ID Risk Category Prob Impact Score Mitigation Status
R01 YOLOE backbone accuracy on aerial concealment data Technical Med High High Benchmark sprint; dual backbone; empirical selection Open
R02 VLM model load latency (5-10s) delays first L2 VLM analysis Technical High Med High Predictive loading when first POI queued Open
R03 V1 heuristic false positive rate overwhelms operator Technical High Med High High initial thresholds; per-season tuning; VLM filter Open
R04 GPU memory pressure during YOLOE→VLM transitions Technical Med High High Sequential scheduling; explicit load/unload; memory monitoring Open
R05 Seasonal model generalization failure Technical High High Critical Phased rollout (winter first); config-driven season; continuous data collection Open
R06 Search scenario config complexity causes runtime errors Technical Med Med Medium Config validation at startup; scenario integration tests; good defaults Open
R07 Path following on fragmented/noisy segmentation masks Technical Med Med Medium Aggressive pruning; min-length filter; fallback to area_sweep Open
R08 ViewLink protocol implementation effort underestimated Schedule Med Med Medium Mock mode for parallel dev; allocate extra time; check for community implementations Open
R09 Operator information overload from too many detections Technical Med Med Medium Confidence thresholds; priority ranking; scenario-based filtering Open
R10 Python GIL blocking in future code additions Technical Low Low Low All hot-path compute in C/Cython/TRT (GIL released); document convention Accepted
R11 NVMe write latency during high L2 recording rate Technical Low Med Low 3MB/s well within NVMe bandwidth; async writes; drop frames if queue backs up Accepted
R12 py_trees performance overhead on complex trees Technical Low Low Low <1ms measured for ~30 nodes; monitor if tree grows Accepted
R13 Dynamic search scenarios not extensible enough Technical Low Med Low BT architecture allows new subtree types without changing existing code Accepted

Total: 13 risks — 1 Critical, 4 High, 4 Medium, 4 Low


Detailed Risk Analysis

R01: YOLOE Backbone Accuracy on Aerial Concealment Data

Description: Neither YOLO11 nor YOLO26 has been validated on aerial imagery of camouflaged/concealed military positions. YOLO26 has reported accuracy regression on custom datasets (GitHub #23206). Zero-shot YOLOE performance on concealment classes is unknown.

Trigger conditions: mAP50 on validation set < 50% for target classes (footpath, branch_pile, dark_entrance)

Affected components: Tier1Detector, all downstream pipeline

Mitigation strategy:

  1. Sprint 1 benchmark: 200 annotated frames, both backbones, same hyperparameters
  2. Evaluate zero-shot YOLOE with text/visual prompts first (no training data needed)
  3. If both backbones underperform: fall back to standard YOLO with custom training
  4. Keep both TRT engines on NVMe; config switch

Contingency plan: If YOLOE open-vocabulary approach fails entirely, train a standard YOLO model from scratch on annotated data (Phase 2 timeline).

Residual risk after mitigation: Medium — benchmark data will be limited initially

Documents updated: architecture.md ADR-003


R02: VLM Model Load Latency

Description: NanoLLM VILA1.5-3B takes 5-10s to load. During L1→L2 transition, if VLM is needed for the first time in a session, the investigation is delayed.

Trigger conditions: First POI requiring VLM analysis in a session

Affected components: VLMClient, ScanController

Mitigation strategy:

  1. When first POI is queued (even before L2 starts), begin VLM loading in background
  2. If VLM not ready when Tier 2 result is ambiguous, proceed with Tier 2 result only
  3. Subsequent analyses will have VLM warm

Contingency plan: If load time is unacceptable, keep VLM loaded at startup and accept higher memory usage during L1 (requires verifying YOLOE fits in remaining memory).

Residual risk after mitigation: Low — only first VLM request affected

Documents updated: VLMClient description.md (lifecycle section), ScanController description.md (L2 subtree)


R03: V1 Heuristic False Positive Rate

Description: Darkness + contrast heuristic will flag shadows, water puddles, dark soil, tree shade as potential concealed positions. Operator may be overwhelmed.

Trigger conditions: FP rate > 80% during initial field testing

Affected components: Tier2SpatialAnalyzer, OutputManager, operator workflow

Mitigation strategy:

  1. Start with conservative thresholds (higher darkness, higher contrast required)
  2. Per-season threshold configs (winter/summer/autumn differ significantly)
  3. VLM as secondary filter for ambiguous cases (when available)
  4. Priority ranking: scenarios with higher confidence bubble up
  5. Scenario-based filtering: operator sees scenario name, can mentally filter

Contingency plan: If heuristic is unusable, fast-track a simple binary classifier trained on collected FP/TP data from field testing.

Residual risk after mitigation: Medium — some FP expected and accepted per design

Documents updated: Tier2SpatialAnalyzer description.md, Config helper (per-season thresholds)


R04: GPU Memory Pressure During Transitions

Description: YOLOE TRT engine (~2GB GPU) must coexist with VLM (~3GB GPU) in 8GB shared LPDDR5. If both are loaded simultaneously, total exceeds available GPU memory.

Trigger conditions: YOLOE engine stays loaded while VLM loads, or VLM doesn't fully unload

Affected components: Tier1Detector, VLMClient, ScanController

Mitigation strategy:

  1. Sequential GPU scheduling: YOLOE processes current frame → VLM loads → VLM analyzes → VLM unloads → YOLOE resumes
  2. Explicit unload_model() call before any load_model() for different model
  3. Monitor GPU memory via tegrastats in health check; set semantic_available=false if memory exceeds threshold

Contingency plan: If memory management is unreliable, use a smaller VLM (Obsidian-3B at ~1.5GB) that can coexist with YOLOE.

Residual risk after mitigation: Low — sequential scheduling is well-defined

Documents updated: architecture.md §5 (sequential GPU note), VLMClient description.md (lifecycle)


R05: Seasonal Model Generalization Failure (CRITICAL)

Description: Models trained on winter imagery will fail in spring/summer/autumn. Footpaths on snow look completely different from footpaths on mud/grass. Branch piles vary by season.

Trigger conditions: Deploying winter-trained model in spring/summer without retraining

Affected components: Tier1Detector, Tier2SpatialAnalyzer, all search scenarios

Mitigation strategy:

  1. Phased rollout: winter only for initial release
  2. Season config in YAML: season: winter — adjusts thresholds, enables season-specific scenarios
  3. Continuous data collection from every flight via frame recorder
  4. Season-specific YOLOE classes: footpath_winter, footpath_autumn etc.
  5. Retraining pipeline per season (Phase 4 in training strategy)
  6. Search scenarios are season-aware (different trigger classes per season)

Contingency plan: If multi-season models don't converge, maintain separate model files per season. Config-switch per deployment.

Residual risk after mitigation: Medium — phased approach manages exposure, but spring/summer models will need real data

Documents updated: Config helper (season field), ScanController (season-aware scenarios), training strategy in solution.md


R06: Search Scenario Config Complexity

Description: Data-driven search scenarios add YAML complexity. Invalid configs (missing follow_class for path_follow, unknown investigation type, empty trigger classes) could cause runtime errors or silent failures.

Trigger conditions: Operator modifies config YAML incorrectly

Affected components: ScanController, Config helper

Mitigation strategy:

  1. Config validation at startup: reject invalid scenarios with clear error messages
  2. Ship with well-tested default scenarios per season
  3. Scenario integration tests: verify each investigation type with mock data
  4. Unknown investigation type → log error, skip scenario, continue with others

Contingency plan: If config complexity proves too error-prone, build a simple scenario editor tool.

Residual risk after mitigation: Low — validation catches most errors

Documents updated: Config helper (validation rules expanded)


R07: Spatial Analysis on Noisy/Sparse Input

Description: For mask tracing: noisy or fragmented segmentation masks produce broken skeletons with spurious branches, leading to erratic path following. For cluster tracing: too few detections visible in a single frame (e.g., wide-area L1 at medium zoom can't resolve small objects), or false positive detections create phantom clusters.

Trigger conditions: YOLOE footpath segmentation has many disconnected components; or cluster_follow scenario triggers on <min_cluster_size detections

Affected components: Tier2SpatialAnalyzer, GimbalDriver (PID receives erratic targets)

Mitigation strategy:

  1. Mask trace: morphological closing before skeletonization to connect nearby fragments
  2. Mask trace: aggressive pruning of branches shorter than min_branch_length; select longest connected component
  3. Mask trace: if skeleton quality too low, fall back to area_sweep
  4. Cluster trace: configurable min_cluster_size (default 2) filters noise
  5. Cluster trace: cluster_radius_px prevents grouping unrelated detections
  6. Cluster trace: if no valid cluster found, return empty result (ScanController falls through to next investigation type or returns to L1)

Contingency plan: Mask trace: use centroid + bounding box direction as rough path direction. Cluster trace: fall back to zoom_classify on individual detections instead of cluster_follow.

Residual risk after mitigation: Low — multiple fallback layers for both strategies

Documents updated: Tier2SpatialAnalyzer description.md (preprocessing, cluster error handling)


Description: No open-source Python implementation of ViewLink Serial Protocol V3.3.3 exists. Custom implementation requires parsing the PDF specification, handling binary packet format, and testing on real hardware.

Trigger conditions: Implementation takes >2 weeks; edge cases in protocol not documented

Affected components: GimbalDriver

Mitigation strategy:

  1. Mock mode (TCP socket) enables parallel development without real hardware
  2. ArduPilot has a ViewPro driver in C++ — can reference for packet format
  3. Allocate 2 weeks for GimbalDriver implementation + bench testing
  4. Start with basic commands (pan/tilt/zoom) before advanced features

Contingency plan: If ViewLink implementation stalls, use MAVLink gimbal protocol via ArduPilot as an intermediary (less control but faster to implement).

Residual risk after mitigation: Low — ArduPilot reference reduces uncertainty

Documents updated: GimbalDriver description.md


R09: Operator Information Overload

Description: High FP rate + continuous scanning + multiple active scenarios = many detection candidates. Operator may miss real targets in the noise.

Trigger conditions: >20 detections per minute with FP rate >70%

Affected components: OutputManager (operator delivery), ScanController

Mitigation strategy:

  1. Confidence threshold per scenario (configurable)
  2. Priority ranking: higher-confidence, VLM-confirmed detections shown first
  3. Scenario name in detection output: operator knows context
  4. Configurable detection throttle: max N detections per minute to operator

Contingency plan: Add a simple client-side filter in operator display (by scenario, by confidence).

Residual risk after mitigation: Medium — some overload expected initially, tuning required

Documents updated: OutputManager description.md, Config helper


R10: Python GIL (Low — Accepted)

Description: Python's Global Interpreter Lock prevents true parallel execution of Python threads.

Why it's not a concern for this system:

  1. All compute-heavy operations release the GIL: TensorRT inference (C++ backend), OpenCV (C backend), scikit-image skeletonization (Cython), pyserial I/O (C backend), NVMe file writes (OS-level I/O)
  2. VLM runs in a separate Docker process — entirely outside the GIL
  3. Architecture is deliberately single-threaded: BT tick loop processes one frame at a time; no need for threading
  4. Pure Python in hot path: only py_trees traversal (~<1ms) and dict/list operations
  5. I/O operations (UART, NVMe writes) release the GIL natively

Caveat: If future developers add Python-heavy computation in a BT leaf node without using C/Cython, it could block other operations. This is a coding practice issue, not an architectural one.

Mitigation: Document in coding guidelines: "All compute-heavy leaf nodes must use C-extension libraries or Cython. Pure Python processing must complete in <5ms."

Residual risk: Low


Architecture/Component Changes Applied (This Iteration)

Risk ID Document Modified Change Description
R01 architecture.md ADR-003 Already documented: dual backbone strategy
R02 components/04_vlm_client/description.md Lifecycle notes: predictive loading when first POI queued
R03 components/03_tier2_spatial_analyzer/description.md V2 CNN removed; heuristic is permanent Tier 2 approach
R03 architecture.md tech stack MobileNetV3-Small removed
R04 architecture.md §2, §5 Sequential GPU scheduling documented
R05 common-helpers/01_helper_config.md Season-specific search scenarios
R05 components/01_scan_controller/description.md Dynamic search scenarios with season-aware trigger classes
R06 common-helpers/01_helper_config.md Scenario validation rules expanded
R07 components/03_tier2_spatial_analyzer/description.md Preprocessing, cluster error handling, and fallback documented
R10 No change needed; documented as accepted
data_model.md Fixed: POI.status added "timeout", POI max size config-driven, HealthLogEntry uses capability flags
common-helpers/02_helper_types.md Added SearchScenario struct; POI includes scenario_name and investigation_type

Summary

Total risks identified: 13 Critical: 1 (R05 — seasonal generalization) High: 4 (R01 backbone accuracy, R02 VLM load latency, R03 heuristic FP rate, R04 GPU memory) Medium: 4 (R06 config complexity, R07 fragmented masks, R08 ViewLink effort, R09 operator overload) Low: 4 (R10 GIL, R11 NVMe writes, R12 py_trees overhead, R13 scenario extensibility)

Risks mitigated this iteration: All 13 have mitigation strategies documented Risks requiring user decision: None — all mitigations are actionable without further input