Files
detections-semantic/_docs/00_research/04_reasoning_chain.md
T
Oleksandr Bezdieniezhnykh 8e2ecf50fd Initial commit
Made-with: Cursor
2026-03-26 00:20:30 +02:00

9.0 KiB
Raw Blame History

Reasoning Chain

Dimension 1: Memory Budget Feasibility

Fact Confirmation

The Jetson Orin Nano Super has 8GB unified (CPU+GPU shared) memory. After OS overhead, only ~5.2GB is usable for GPU workloads (Fact #19). A single YOLO TRT engine consumes ~2.6GB including cuDNN/CUDA overhead (Fact #2). The free GPU memory in a Docker container is ~3.7GB (Fact #1).

Reference Comparison

Draft01 assumes: existing YOLO TRT + YOLOE-26 TRT + MobileNetV3-Small TRT + UAV-VL-R1 all running or ready. Reality: two TRT engines alone likely consume 4-5GB (2×~2.5GB for engines, minus some shared CUDA overhead ~1GB). VLM (UAV-VL-R1 INT8 = 2.5GB) cannot fit alongside both YOLO engines.

Conclusion

The draft01 architecture is memory-infeasible as designed. Two YOLO TRT engines + CNN already saturate available memory. VLM cannot run concurrently. Solution: (A) time-multiplex — unload YOLO engines before loading VLM, (B) use a single merged TRT engine where YOLOE-26 is re-parameterized into the existing YOLO pipeline, (C) offload VLM to a companion device or cloud.

Confidence

High — multiple sources confirm memory constraints


Dimension 2: YOLO26/YOLOE-26 TRT Deployment Stability

Fact Confirmation

YOLO26 has confirmed bugs when deployed via TensorRT on Jetson: confidence misalignment in C++ (Fact #5), INT8 export crash (Fact #6). These are architecture-specific issues in YOLO26's end-to-end NMS-free design when converted through ONNX→TRT pipeline.

Reference Comparison

Draft01 assumes YOLOE-26 TRT deployment works. YOLOE-26 inherits YOLO26's architecture. If YOLO26 has TRT issues, YOLOE-26 likely inherits them. YOLOv8 TRT deployment is stable and proven on Jetson.

Conclusion

YOLOE-26 TRT deployment on Jetson is a high-risk path. Mitigation: (A) use YOLOE-v8-seg (YOLOE built on YOLOv8 backbone) instead of YOLOE-26-seg for initial deployment — proven TRT stability, (B) use Python Ultralytics predict() API instead of C++ TRT initially (avoids confidence issue but slower), (C) monitor Ultralytics issue tracker for YOLO26 TRT fixes before transitioning.

Confidence

High — bugs are documented with issue numbers


Dimension 3: YOLOE-26 Zero-Shot Accuracy for Domain

Fact Confirmation

YOLOE is trained on LVIS (1203 categories) and COCO data (Fact #8). Military concealment classes are not in LVIS/COCO. Generic concepts like "footpath" and "road" are in-distribution and should work. Domain-specific concepts like "dugout", "FPV hideout", "camouflage netting" are far out-of-distribution.

Reference Comparison

Draft01 relies heavily on YOLOE-26 text prompts as the bootstrapping mechanism. If text prompts fail for concealment-specific classes, the entire zero-shot Phase 1 is compromised. However, visual prompts (SAVPE) and multimodal fusion (Fact #7) offer a stronger alternative for domain-specific detection.

Conclusion

Text prompts alone are unreliable for concealment classes. Solution: (A) prioritize visual prompts (SAVPE) using semantic01-04.png reference images — visually similar detection is less dependent on LVIS vocabulary, (B) use multimodal fusion (text + visual) for robustness, (C) use generic text prompts only for in-distribution classes ("footpath", "road", "trail") and visual prompts for concealment-specific patterns, (D) measure zero-shot recall in first week and have fallback to heuristic detectors.

Confidence

⚠️ Medium — no direct benchmarks for this domain, reasoning from training data distribution


Dimension 4: Path Tracing Algorithm Robustness

Fact Confirmation

Classical Zhang-Suen skeletonization is sensitive to boundary noise — small perturbations create spurious skeletal branches (Fact #15). Aerial segmentation masks from YOLOE-26 will have noise, partial segments, and broken paths. GraphMorph (2025) provides topology-aware centerline extraction with fewer false positives (Fact #16).

Reference Comparison

Draft01 uses basic skeletonization + hit-miss endpoint detection. This works on clean synthetic masks but will fail on real-world noisy masks with multiple branches, gaps, and artifacts.

Conclusion

Add morphological preprocessing before skeletonization. Solution: (A) apply Gaussian blur + binary threshold + morphological closing to clean segmentation masks before skeletonization, (B) prune short skeleton branches (< threshold length) to remove noise-induced artifacts, (C) consider GraphMorph as a more robust alternative if simple pruning is insufficient, (D) increase ROI crop from 128×128 to 256×256 for more spatial context in CNN classification.

Confidence

High — skeletonization noise sensitivity is well-documented


Dimension 5: VLM Runtime & Integration Viability

Fact Confirmation

TensorRT-LLM explicitly does not support edge devices (Fact #11). vLLM works on Jetson but requires careful memory management (Fact #12). VLM inference speed for 2B models: Cosmos-Reason2-2B achieves only 4.7 tok/s (Fact #13). For a 50-100 token response, this means 10-21 seconds — far exceeding the 5s target. A 1.5B model failed to load due to KV cache buffer requirements exceeding available memory (Fact #14).

Reference Comparison

Draft01 assumes UAV-VL-R1 runs in ≤5s via vLLM or TRT-LLM. Reality: TRT-LLM is not an option. vLLM can work but inference will be 10-20s, not 5s. VLM cannot coexist with YOLO engines in memory.

Conclusion

VLM tier needs fundamental redesign. Solutions: (A) accept 10-20s VLM latency — change from "optional real-time" to "background analysis" mode where VLM results arrive after operator has moved on, (B) use SmolVLM-500M (1.8GB) or Moondream-0.5B (816 MiB INT4) instead of UAV-VL-R1 for faster inference and smaller footprint, (C) implement VLM as demand-loaded: unload all TRT engines → load VLM → infer → unload VLM → reload TRT engines (adds 2-3s per switch but ensures memory fits), (D) defer VLM to a ground station connected via datalink if latency is acceptable.

Confidence

High — benchmarks directly confirm constraints


Dimension 6: Gimbal Control Adequacy

Fact Confirmation

PID controllers work for gimbal stabilization when the platform is relatively stable (Fact #17). However, during UAV flight, attitude changes and mounting errors cause drift that PID alone cannot compensate. Kalman filter + coordinate transformation is proven to eliminate these errors (Fact #17).

Reference Comparison

Draft01 uses PID-only control. This works if the UAV is hovering with minimal movement. During active flight (Level 1 sweep), the UAV is moving, causing gimbal drift and path-following errors.

Conclusion

Add Kalman filter to gimbal control pipeline. Solution: cascade architecture: Kalman filter for state estimation (compensates UAV attitude) → PID for error correction → gimbal actuator. Use UAV IMU data as Kalman filter input. This is standard practice for aerial gimbal systems.

Confidence

High — well-established engineering practice


Dimension 7: Training Data Realism

Fact Confirmation

Ultralytics recommends ≥1500 images/class and ≥10,000 instances/class for good YOLO performance (Source #15). Synthetic data generation for camouflage detection is validated: GenCAMO and CamouflageAnything both improve detection baselines (Fact #18).

Reference Comparison

Draft01 targets 500+ images by week 6, 1500+ by week 8. For military concealment in winter conditions, this is optimistic: images are hard to collect (active conflict zone), annotation requires domain expertise, and class diversity is limited.

Conclusion

Supplement real data with synthetic augmentation. Solution: (A) use CamouflageAnything or GenCAMO to generate synthetic concealment training data, (B) apply cut-paste augmentation of annotated concealment objects onto clean terrain backgrounds, (C) reduce initial target to 300-500 real images + 1000+ synthetic images per class for first model iteration, (D) implement active learning: YOLOE-26 zero-shot flags candidates → human annotates → feeds into training loop.

Confidence

⚠️ Medium — synthetic data quality for this specific domain is untested


Dimension 8: Security & Adversarial Resilience

Fact Confirmation

Smaller YOLO models are more vulnerable to adversarial patches (Fact #9). PatchBlock defense recovers up to 77% accuracy (Fact #10). No security considerations in draft01 for model weight protection, inference pipeline integrity, or operational data.

Reference Comparison

Draft01 has zero security measures. For a military edge device that could be physically captured, this is a significant gap.

Conclusion

Add three security layers. (A) Adversarial defense: integrate PatchBlock or equivalent CPU-based preprocessing to detect anomalous input patches, (B) Model protection: encrypt TRT engine files at rest, decrypt into tmpfs at boot, use secure boot chain, (C) Operational data: encrypted circular buffer for captured imagery, auto-wipe on tamper detection, transmit only coordinates + confidence (not raw imagery) over datalink.

Confidence

High — threats are well-documented, mitigations exist