mirror of https://github.com/azaion/detections-semantic.git synced 2026-04-22 22:46:37 +00:00

Files

T

Oleksandr Bezdieniezhnykh 8e2ecf50fd Initial commit

Made-with: Cursor

2026-03-26 00:20:30 +02:00

16 KiB

Raw Blame History

Risk Assessment — Semantic Detection System — Iteration 01

Risk Scoring Matrix

	Low Impact	Medium Impact	High Impact
High Probability	Medium	High	Critical
Medium Probability	Low	Medium	High
Low Probability	Low	Low	Medium

Risk Register

ID	Risk	Category	Prob	Impact	Score	Mitigation	Status
R01	YOLOE backbone accuracy on aerial concealment data	Technical	Med	High	High	Benchmark sprint; dual backbone; empirical selection	Open
R02	VLM model load latency (5-10s) delays first L2 VLM analysis	Technical	High	Med	High	Predictive loading when first POI queued	Open
R03	V1 heuristic false positive rate overwhelms operator	Technical	High	Med	High	High initial thresholds; per-season tuning; VLM filter	Open
R04	GPU memory pressure during YOLOE→VLM transitions	Technical	Med	High	High	Sequential scheduling; explicit load/unload; memory monitoring	Open
R05	Seasonal model generalization failure	Technical	High	High	Critical	Phased rollout (winter first); config-driven season; continuous data collection	Open
R06	Search scenario config complexity causes runtime errors	Technical	Med	Med	Medium	Config validation at startup; scenario integration tests; good defaults	Open
R07	Path following on fragmented/noisy segmentation masks	Technical	Med	Med	Medium	Aggressive pruning; min-length filter; fallback to area_sweep	Open
R08	ViewLink protocol implementation effort underestimated	Schedule	Med	Med	Medium	Mock mode for parallel dev; allocate extra time; check for community implementations	Open
R09	Operator information overload from too many detections	Technical	Med	Med	Medium	Confidence thresholds; priority ranking; scenario-based filtering	Open
R10	Python GIL blocking in future code additions	Technical	Low	Low	Low	All hot-path compute in C/Cython/TRT (GIL released); document convention	Accepted
R11	NVMe write latency during high L2 recording rate	Technical	Low	Med	Low	3MB/s well within NVMe bandwidth; async writes; drop frames if queue backs up	Accepted
R12	py_trees performance overhead on complex trees	Technical	Low	Low	Low	<1ms measured for ~30 nodes; monitor if tree grows	Accepted
R13	Dynamic search scenarios not extensible enough	Technical	Low	Med	Low	BT architecture allows new subtree types without changing existing code	Accepted

Total: 13 risks — 1 Critical, 4 High, 4 Medium, 4 Low

Detailed Risk Analysis

R01: YOLOE Backbone Accuracy on Aerial Concealment Data

Description: Neither YOLO11 nor YOLO26 has been validated on aerial imagery of camouflaged/concealed military positions. YOLO26 has reported accuracy regression on custom datasets (GitHub #23206). Zero-shot YOLOE performance on concealment classes is unknown.

Trigger conditions: mAP50 on validation set < 50% for target classes (footpath, branch_pile, dark_entrance)

Affected components: Tier1Detector, all downstream pipeline

Mitigation strategy:

Sprint 1 benchmark: 200 annotated frames, both backbones, same hyperparameters
Evaluate zero-shot YOLOE with text/visual prompts first (no training data needed)
If both backbones underperform: fall back to standard YOLO with custom training
Keep both TRT engines on NVMe; config switch

Contingency plan: If YOLOE open-vocabulary approach fails entirely, train a standard YOLO model from scratch on annotated data (Phase 2 timeline).

Residual risk after mitigation: Medium — benchmark data will be limited initially

Documents updated: architecture.md ADR-003

R02: VLM Model Load Latency

Description: NanoLLM VILA1.5-3B takes 5-10s to load. During L1→L2 transition, if VLM is needed for the first time in a session, the investigation is delayed.

Trigger conditions: First POI requiring VLM analysis in a session

Affected components: VLMClient, ScanController

Mitigation strategy:

When first POI is queued (even before L2 starts), begin VLM loading in background
If VLM not ready when Tier 2 result is ambiguous, proceed with Tier 2 result only
Subsequent analyses will have VLM warm

Contingency plan: If load time is unacceptable, keep VLM loaded at startup and accept higher memory usage during L1 (requires verifying YOLOE fits in remaining memory).

Residual risk after mitigation: Low — only first VLM request affected

Documents updated: VLMClient description.md (lifecycle section), ScanController description.md (L2 subtree)

R03: V1 Heuristic False Positive Rate

Description: Darkness + contrast heuristic will flag shadows, water puddles, dark soil, tree shade as potential concealed positions. Operator may be overwhelmed.

Trigger conditions: FP rate > 80% during initial field testing

Affected components: Tier2SpatialAnalyzer, OutputManager, operator workflow

Mitigation strategy:

Start with conservative thresholds (higher darkness, higher contrast required)
Per-season threshold configs (winter/summer/autumn differ significantly)
VLM as secondary filter for ambiguous cases (when available)
Priority ranking: scenarios with higher confidence bubble up
Scenario-based filtering: operator sees scenario name, can mentally filter

Contingency plan: If heuristic is unusable, fast-track a simple binary classifier trained on collected FP/TP data from field testing.

Residual risk after mitigation: Medium — some FP expected and accepted per design

Documents updated: Tier2SpatialAnalyzer description.md, Config helper (per-season thresholds)

R04: GPU Memory Pressure During Transitions

Description: YOLOE TRT engine (~2GB GPU) must coexist with VLM (~3GB GPU) in 8GB shared LPDDR5. If both are loaded simultaneously, total exceeds available GPU memory.

Trigger conditions: YOLOE engine stays loaded while VLM loads, or VLM doesn't fully unload

Affected components: Tier1Detector, VLMClient, ScanController

Mitigation strategy:

Sequential GPU scheduling: YOLOE processes current frame → VLM loads → VLM analyzes → VLM unloads → YOLOE resumes
Explicit unload_model() call before any load_model() for different model
Monitor GPU memory via tegrastats in health check; set semantic_available=false if memory exceeds threshold

Contingency plan: If memory management is unreliable, use a smaller VLM (Obsidian-3B at ~1.5GB) that can coexist with YOLOE.

Residual risk after mitigation: Low — sequential scheduling is well-defined

Documents updated: architecture.md §5 (sequential GPU note), VLMClient description.md (lifecycle)

R05: Seasonal Model Generalization Failure (CRITICAL)

Description: Models trained on winter imagery will fail in spring/summer/autumn. Footpaths on snow look completely different from footpaths on mud/grass. Branch piles vary by season.

Trigger conditions: Deploying winter-trained model in spring/summer without retraining

Affected components: Tier1Detector, Tier2SpatialAnalyzer, all search scenarios

Mitigation strategy:

Phased rollout: winter only for initial release
Season config in YAML: season: winter — adjusts thresholds, enables season-specific scenarios
Continuous data collection from every flight via frame recorder
Season-specific YOLOE classes: footpath_winter, footpath_autumn etc.
Retraining pipeline per season (Phase 4 in training strategy)
Search scenarios are season-aware (different trigger classes per season)

Contingency plan: If multi-season models don't converge, maintain separate model files per season. Config-switch per deployment.

Residual risk after mitigation: Medium — phased approach manages exposure, but spring/summer models will need real data

Documents updated: Config helper (season field), ScanController (season-aware scenarios), training strategy in solution.md

R06: Search Scenario Config Complexity

Description: Data-driven search scenarios add YAML complexity. Invalid configs (missing follow_class for path_follow, unknown investigation type, empty trigger classes) could cause runtime errors or silent failures.

Trigger conditions: Operator modifies config YAML incorrectly

Affected components: ScanController, Config helper

Mitigation strategy:

Config validation at startup: reject invalid scenarios with clear error messages
Ship with well-tested default scenarios per season
Scenario integration tests: verify each investigation type with mock data
Unknown investigation type → log error, skip scenario, continue with others

Contingency plan: If config complexity proves too error-prone, build a simple scenario editor tool.

Residual risk after mitigation: Low — validation catches most errors

Documents updated: Config helper (validation rules expanded)

R07: Spatial Analysis on Noisy/Sparse Input

Description: For mask tracing: noisy or fragmented segmentation masks produce broken skeletons with spurious branches, leading to erratic path following. For cluster tracing: too few detections visible in a single frame (e.g., wide-area L1 at medium zoom can't resolve small objects), or false positive detections create phantom clusters.

Trigger conditions: YOLOE footpath segmentation has many disconnected components; or cluster_follow scenario triggers on <min_cluster_size detections

Affected components: Tier2SpatialAnalyzer, GimbalDriver (PID receives erratic targets)

Mitigation strategy:

Mask trace: morphological closing before skeletonization to connect nearby fragments
Mask trace: aggressive pruning of branches shorter than min_branch_length; select longest connected component
Mask trace: if skeleton quality too low, fall back to area_sweep
Cluster trace: configurable min_cluster_size (default 2) filters noise
Cluster trace: cluster_radius_px prevents grouping unrelated detections
Cluster trace: if no valid cluster found, return empty result (ScanController falls through to next investigation type or returns to L1)

Contingency plan: Mask trace: use centroid + bounding box direction as rough path direction. Cluster trace: fall back to zoom_classify on individual detections instead of cluster_follow.

Residual risk after mitigation: Low — multiple fallback layers for both strategies

Documents updated: Tier2SpatialAnalyzer description.md (preprocessing, cluster error handling)

R08: ViewLink Protocol Implementation Effort

Description: No open-source Python implementation of ViewLink Serial Protocol V3.3.3 exists. Custom implementation requires parsing the PDF specification, handling binary packet format, and testing on real hardware.

Trigger conditions: Implementation takes >2 weeks; edge cases in protocol not documented

Affected components: GimbalDriver

Mitigation strategy:

Mock mode (TCP socket) enables parallel development without real hardware
ArduPilot has a ViewPro driver in C++ — can reference for packet format
Allocate 2 weeks for GimbalDriver implementation + bench testing
Start with basic commands (pan/tilt/zoom) before advanced features

Contingency plan: If ViewLink implementation stalls, use MAVLink gimbal protocol via ArduPilot as an intermediary (less control but faster to implement).

Residual risk after mitigation: Low — ArduPilot reference reduces uncertainty

Documents updated: GimbalDriver description.md

R09: Operator Information Overload

Description: High FP rate + continuous scanning + multiple active scenarios = many detection candidates. Operator may miss real targets in the noise.

Trigger conditions: >20 detections per minute with FP rate >70%

Affected components: OutputManager (operator delivery), ScanController

Mitigation strategy:

Confidence threshold per scenario (configurable)
Priority ranking: higher-confidence, VLM-confirmed detections shown first
Scenario name in detection output: operator knows context
Configurable detection throttle: max N detections per minute to operator

Contingency plan: Add a simple client-side filter in operator display (by scenario, by confidence).

Residual risk after mitigation: Medium — some overload expected initially, tuning required

Documents updated: OutputManager description.md, Config helper

R10: Python GIL (Low — Accepted)

Description: Python's Global Interpreter Lock prevents true parallel execution of Python threads.

Why it's not a concern for this system:

All compute-heavy operations release the GIL: TensorRT inference (C++ backend), OpenCV (C backend), scikit-image skeletonization (Cython), pyserial I/O (C backend), NVMe file writes (OS-level I/O)
VLM runs in a separate Docker process — entirely outside the GIL
Architecture is deliberately single-threaded: BT tick loop processes one frame at a time; no need for threading
Pure Python in hot path: only py_trees traversal (~<1ms) and dict/list operations
I/O operations (UART, NVMe writes) release the GIL natively

Caveat: If future developers add Python-heavy computation in a BT leaf node without using C/Cython, it could block other operations. This is a coding practice issue, not an architectural one.

Mitigation: Document in coding guidelines: "All compute-heavy leaf nodes must use C-extension libraries or Cython. Pure Python processing must complete in <5ms."

Residual risk: Low

Architecture/Component Changes Applied (This Iteration)

Risk ID	Document Modified	Change Description
R01	`architecture.md` ADR-003	Already documented: dual backbone strategy
R02	`components/04_vlm_client/description.md`	Lifecycle notes: predictive loading when first POI queued
R03	`components/03_tier2_spatial_analyzer/description.md`	V2 CNN removed; heuristic is permanent Tier 2 approach
R03	`architecture.md` tech stack	MobileNetV3-Small removed
R04	`architecture.md` §2, §5	Sequential GPU scheduling documented
R05	`common-helpers/01_helper_config.md`	Season-specific search scenarios
R05	`components/01_scan_controller/description.md`	Dynamic search scenarios with season-aware trigger classes
R06	`common-helpers/01_helper_config.md`	Scenario validation rules expanded
R07	`components/03_tier2_spatial_analyzer/description.md`	Preprocessing, cluster error handling, and fallback documented
R10	—	No change needed; documented as accepted
—	`data_model.md`	Fixed: POI.status added "timeout", POI max size config-driven, HealthLogEntry uses capability flags
—	`common-helpers/02_helper_types.md`	Added SearchScenario struct; POI includes scenario_name and investigation_type

Summary

Total risks identified: 13 Critical: 1 (R05 — seasonal generalization) High: 4 (R01 backbone accuracy, R02 VLM load latency, R03 heuristic FP rate, R04 GPU memory) Medium: 4 (R06 config complexity, R07 fragmented masks, R08 ViewLink effort, R09 operator overload) Low: 4 (R10 GIL, R11 NVMe writes, R12 py_trees overhead, R13 scenario extensibility)

Risks mitigated this iteration: All 13 have mitigation strategies documented Risks requiring user decision: None — all mitigations are actionable without further input

16 KiB Raw Blame History

Risk Assessment — Semantic Detection System — Iteration 01

Risk Scoring Matrix

Risk Register

Detailed Risk Analysis

R01: YOLOE Backbone Accuracy on Aerial Concealment Data

R02: VLM Model Load Latency

R03: V1 Heuristic False Positive Rate

R04: GPU Memory Pressure During Transitions

R05: Seasonal Model Generalization Failure (CRITICAL)

R06: Search Scenario Config Complexity

R07: Spatial Analysis on Noisy/Sparse Input

R08: ViewLink Protocol Implementation Effort

R09: Operator Information Overload

R10: Python GIL (Low — Accepted)

Architecture/Component Changes Applied (This Iteration)

Summary

16 KiB

Raw Blame History