detections-semantic/_docs/00_research/04_reasoning_chain.md

# Reasoning Chain

## Dimension 1: Memory Budget Feasibility

### Fact Confirmation
The Jetson Orin Nano Super has 8GB unified (CPU+GPU shared) memory. After OS overhead, only ~5.2GB is usable for GPU workloads (Fact #19). A single YOLO TRT engine consumes ~2.6GB including cuDNN/CUDA overhead (Fact #2). The free GPU memory in a Docker container is ~3.7GB (Fact #1).

### Reference Comparison
Draft01 assumes: existing YOLO TRT + YOLOE-26 TRT + MobileNetV3-Small TRT + UAV-VL-R1 all running or ready. Reality: two TRT engines alone likely consume 4-5GB (2×~2.5GB for engines, minus some shared CUDA overhead ~1GB). VLM (UAV-VL-R1 INT8 = 2.5GB) cannot fit alongside both YOLO engines.

### Conclusion
**The draft01 architecture is memory-infeasible as designed.** Two YOLO TRT engines + CNN already saturate available memory. VLM cannot run concurrently. Solution: (A) time-multiplex — unload YOLO engines before loading VLM, (B) use a single merged TRT engine where YOLOE-26 is re-parameterized into the existing YOLO pipeline, (C) offload VLM to a companion device or cloud.

### Confidence
✅ High — multiple sources confirm memory constraints

---

## Dimension 2: YOLO26/YOLOE-26 TRT Deployment Stability

### Fact Confirmation
YOLO26 has confirmed bugs when deployed via TensorRT on Jetson: confidence misalignment in C++ (Fact #5), INT8 export crash (Fact #6). These are architecture-specific issues in YOLO26's end-to-end NMS-free design when converted through ONNX→TRT pipeline.

### Reference Comparison
Draft01 assumes YOLOE-26 TRT deployment works. YOLOE-26 inherits YOLO26's architecture. If YOLO26 has TRT issues, YOLOE-26 likely inherits them. YOLOv8 TRT deployment is stable and proven on Jetson.

### Conclusion
**YOLOE-26 TRT deployment on Jetson is a high-risk path.** Mitigation: (A) use YOLOE-v8-seg (YOLOE built on YOLOv8 backbone) instead of YOLOE-26-seg for initial deployment — proven TRT stability, (B) use Python Ultralytics predict() API instead of C++ TRT initially (avoids confidence issue but slower), (C) monitor Ultralytics issue tracker for YOLO26 TRT fixes before transitioning.

### Confidence
✅ High — bugs are documented with issue numbers

---

## Dimension 3: YOLOE-26 Zero-Shot Accuracy for Domain

### Fact Confirmation
YOLOE is trained on LVIS (1203 categories) and COCO data (Fact #8). Military concealment classes are not in LVIS/COCO. Generic concepts like "footpath" and "road" are in-distribution and should work. Domain-specific concepts like "dugout", "FPV hideout", "camouflage netting" are far out-of-distribution.

### Reference Comparison
Draft01 relies heavily on YOLOE-26 text prompts as the bootstrapping mechanism. If text prompts fail for concealment-specific classes, the entire zero-shot Phase 1 is compromised. However, visual prompts (SAVPE) and multimodal fusion (Fact #7) offer a stronger alternative for domain-specific detection.

### Conclusion
**Text prompts alone are unreliable for concealment classes.** Solution: (A) prioritize visual prompts (SAVPE) using semantic01-04.png reference images — visually similar detection is less dependent on LVIS vocabulary, (B) use multimodal fusion (text + visual) for robustness, (C) use generic text prompts only for in-distribution classes ("footpath", "road", "trail") and visual prompts for concealment-specific patterns, (D) measure zero-shot recall in first week and have fallback to heuristic detectors.

### Confidence
⚠️ Medium — no direct benchmarks for this domain, reasoning from training data distribution

---

## Dimension 4: Path Tracing Algorithm Robustness

### Fact Confirmation
Classical Zhang-Suen skeletonization is sensitive to boundary noise — small perturbations create spurious skeletal branches (Fact #15). Aerial segmentation masks from YOLOE-26 will have noise, partial segments, and broken paths. GraphMorph (2025) provides topology-aware centerline extraction with fewer false positives (Fact #16).

### Reference Comparison
Draft01 uses basic skeletonization + hit-miss endpoint detection. This works on clean synthetic masks but will fail on real-world noisy masks with multiple branches, gaps, and artifacts.

### Conclusion
**Add morphological preprocessing before skeletonization.** Solution: (A) apply Gaussian blur + binary threshold + morphological closing to clean segmentation masks before skeletonization, (B) prune short skeleton branches (< threshold length) to remove noise-induced artifacts, (C) consider GraphMorph as a more robust alternative if simple pruning is insufficient, (D) increase ROI crop from 128×128 to 256×256 for more spatial context in CNN classification.

### Confidence
✅ High — skeletonization noise sensitivity is well-documented

---

## Dimension 5: VLM Runtime & Integration Viability

### Fact Confirmation
TensorRT-LLM explicitly does not support edge devices (Fact #11). vLLM works on Jetson but requires careful memory management (Fact #12). VLM inference speed for 2B models: Cosmos-Reason2-2B achieves only 4.7 tok/s (Fact #13). For a 50-100 token response, this means 10-21 seconds — far exceeding the 5s target. A 1.5B model failed to load due to KV cache buffer requirements exceeding available memory (Fact #14).

### Reference Comparison
Draft01 assumes UAV-VL-R1 runs in ≤5s via vLLM or TRT-LLM. Reality: TRT-LLM is not an option. vLLM can work but inference will be 10-20s, not 5s. VLM cannot coexist with YOLO engines in memory.

### Conclusion
**VLM tier needs fundamental redesign.** Solutions: (A) accept 10-20s VLM latency — change from "optional real-time" to "background analysis" mode where VLM results arrive after operator has moved on, (B) use SmolVLM-500M (1.8GB) or Moondream-0.5B (816 MiB INT4) instead of UAV-VL-R1 for faster inference and smaller footprint, (C) implement VLM as demand-loaded: unload all TRT engines → load VLM → infer → unload VLM → reload TRT engines (adds 2-3s per switch but ensures memory fits), (D) defer VLM to a ground station connected via datalink if latency is acceptable.

### Confidence
✅ High — benchmarks directly confirm constraints

---

## Dimension 6: Gimbal Control Adequacy

### Fact Confirmation
PID controllers work for gimbal stabilization when the platform is relatively stable (Fact #17). However, during UAV flight, attitude changes and mounting errors cause drift that PID alone cannot compensate. Kalman filter + coordinate transformation is proven to eliminate these errors (Fact #17).

### Reference Comparison
Draft01 uses PID-only control. This works if the UAV is hovering with minimal movement. During active flight (Level 1 sweep), the UAV is moving, causing gimbal drift and path-following errors.

### Conclusion
**Add Kalman filter to gimbal control pipeline.** Solution: cascade architecture: Kalman filter for state estimation (compensates UAV attitude) → PID for error correction → gimbal actuator. Use UAV IMU data as Kalman filter input. This is standard practice for aerial gimbal systems.

### Confidence
✅ High — well-established engineering practice

---

## Dimension 7: Training Data Realism

### Fact Confirmation
Ultralytics recommends ≥1500 images/class and ≥10,000 instances/class for good YOLO performance (Source #15). Synthetic data generation for camouflage detection is validated: GenCAMO and CamouflageAnything both improve detection baselines (Fact #18).

### Reference Comparison
Draft01 targets 500+ images by week 6, 1500+ by week 8. For military concealment in winter conditions, this is optimistic: images are hard to collect (active conflict zone), annotation requires domain expertise, and class diversity is limited.

### Conclusion
**Supplement real data with synthetic augmentation.** Solution: (A) use CamouflageAnything or GenCAMO to generate synthetic concealment training data, (B) apply cut-paste augmentation of annotated concealment objects onto clean terrain backgrounds, (C) reduce initial target to 300-500 real images + 1000+ synthetic images per class for first model iteration, (D) implement active learning: YOLOE-26 zero-shot flags candidates → human annotates → feeds into training loop.

### Confidence
⚠️ Medium — synthetic data quality for this specific domain is untested

---

## Dimension 8: Security & Adversarial Resilience

### Fact Confirmation
Smaller YOLO models are more vulnerable to adversarial patches (Fact #9). PatchBlock defense recovers up to 77% accuracy (Fact #10). No security considerations in draft01 for model weight protection, inference pipeline integrity, or operational data.

### Reference Comparison
Draft01 has zero security measures. For a military edge device that could be physically captured, this is a significant gap.

### Conclusion
**Add three security layers.** (A) Adversarial defense: integrate PatchBlock or equivalent CPU-based preprocessing to detect anomalous input patches, (B) Model protection: encrypt TRT engine files at rest, decrypt into tmpfs at boot, use secure boot chain, (C) Operational data: encrypted circular buffer for captured imagery, auto-wipe on tamper detection, transmit only coordinates + confidence (not raw imagery) over datalink.

### Confidence
✅ High — threats are well-documented, mitigations exist