# Solution Draft ## Product Solution Description A three-tier semantic detection system for identifying concealed/camouflaged positions from reconnaissance UAV aerial imagery, running on Jetson Orin Nano Super alongside the existing YOLO detection pipeline. ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ JETSON ORIN NANO SUPER │ │ │ │ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │ │ │ ViewPro │───▶│ Tier 1 │───▶│ Tier 2 │───▶│ Tier 3 │ │ │ │ A40 │ │ YOLO26-Seg │ │ Spatial │ │ VLM │ │ │ │ Camera │ │ + YOLOE-26 │ │ Reasoning │ │ UAV-VL │ │ │ │ │ │ ≤100ms │ │ + CNN │ │ -R1 │ │ │ │ │ │ │ │ ≤200ms │ │ ≤5s │ │ │ └────▲─────┘ └──────────────┘ └──────────────┘ └──────────┘ │ │ │ │ │ ┌────┴─────┐ ┌──────────────┐ │ │ │ Gimbal │◀───│ Scan │ │ │ │ Control │ │ Controller │ │ │ │ Module │ │ (L1/L2) │ │ │ └──────────┘ └──────────────┘ │ │ │ │ ┌──────────────────────────────┐ │ │ │ Existing YOLO Detection │ (separate service, provides context) │ │ │ Cython + TRT │ │ │ └──────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘ ``` The system operates in two scan levels: - **Level 1 (Wide Sweep)**: Camera at medium zoom, left-right swing. YOLOE-26 text/visual prompts detect POIs in real-time. Existing YOLO provides scene context. - **Level 2 (Detailed Scan)**: Camera zooms into POI. Spatial reasoning traces footpaths, finds endpoints. CNN classifies potential hideouts. Optional VLM provides deep analysis for ambiguous cases. Three submodules: (1) Semantic Detection AI, (2) Camera Gimbal Control, (3) Integration with existing detections service. ## Existing/Competitor Solutions Analysis No direct commercial or open-source competitor exists for this specific combination of requirements (concealed position detection from UAV with edge inference). Related work: | Solution | Approach | Limitations for This Use Case | |----------|----------|-------------------------------| | Standard YOLO object detection | Bounding box classification of known object types | Cannot detect camouflaged/concealed targets without explicit visual features | | CAMOUFLAGE-Net (YOLOv7-based) | Attention mechanisms + ELAN for camouflage detection | Designed for ground-level imagery, not aerial; academic datasets only | | Open-Vocabulary Camouflaged Object Segmentation | VLM + SAM cascaded segmentation | Too slow for real-time edge inference; requires cloud GPU | | UAV-YOLO12 | Multi-scale road segmentation from UAV imagery | Roads only, no concealment reasoning | | FootpathSeg (DINO-MC + UNet) | Footpath segmentation with self-supervised learning | Pedestrian context, not military aerial; no path-following logic | | YOLO-World / YOLOE | Open-vocabulary detection | Closest fit — YOLOE-26 is our primary Tier 1 mechanism | **Key insight**: No existing solution combines footpath detection + path tracing + endpoint analysis + concealment classification in a single pipeline. This requires a custom multi-stage system. ## Architecture ### Component 1: Tier 1 — Real-Time Detection (YOLOE-26 + YOLO26-Seg) | Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit | |----------|-------|-----------|-------------|-------------|----------|------|-----| | **YOLOE-26 text/visual prompts (recommended)** | yoloe-26s-seg.pt, Ultralytics, TensorRT | Zero-shot detection from day 1. Text prompts: "footpath", "branch pile", "dark entrance". Visual prompts: reference images of hideouts. Zero overhead when re-parameterized. NMS-free. | Open-vocabulary accuracy lower than custom-trained model. Text prompts may not capture all concealment patterns. | Ultralytics ≥8.4, TensorRT, JetPack 6.2 | Model weights stored locally, no cloud | Free (open source) | **Best fit for bootstrapping. Immediate capability.** | | YOLO26-Seg custom-trained | yolo26s-seg.pt fine-tuned on custom dataset | Higher accuracy for known classes after training. Instance segmentation masks for footpaths. | Requires annotated training data (1500+ images/class). No zero-shot capability. | Custom dataset, GPU for training | Same | Free (open source) + annotation labor | **Best fit for production after data collection.** | | UNet + MambaOut | PyTorch, TensorRT | Best published accuracy for trail segmentation from aerial photos. | Separate model, additional memory. No built-in detection head. | Custom integration | Same | Free (open source) | Backup option if YOLO26-Seg underperforms on trails | **Recommended approach**: Start with YOLOE-26 text/visual prompts. Parallel annotation effort builds custom dataset. Transition to fine-tuned YOLO26-Seg once data is sufficient. YOLOE-26 zero-shot capability provides immediate usability. **YOLOE-26 configuration for this project**: Text prompts for Level 1 detection: - `"footpath"`, `"trail"`, `"path in snow"`, `"road"`, `"track"` - `"pile of branches"`, `"tree branches"`, `"camouflage netting"` - `"dark entrance"`, `"hole"`, `"dugout"`, `"dark opening"` - `"tree row"`, `"tree line"`, `"group of trees"` - `"clearing"`, `"open area near forest"` Visual prompts: Annotated reference images (semantic01-04.png) as visual prompt sources. Crop ROIs around known hideouts, use as reference for SAVPE visual-prompted detection. API usage: ```python from ultralytics import YOLOE model = YOLOE("yoloe-26s-seg.pt") model.set_classes(["footpath", "branch pile", "dark entrance", "tree row", "road"]) results = model.predict(frame, conf=0.15) # low threshold, high recall ``` Visual prompt usage: ```python results = model.predict( frame, refer_image="reference_hideout.jpg", visual_prompts={"bboxes": np.array([[x1, y1, x2, y2]]), "cls": np.array([0])} ) ``` ### Component 2: Tier 2 — Spatial Reasoning & CNN Confirmation | Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit | |----------|-------|-----------|-------------|-------------|----------|------|-----| | **Path tracing + CNN classifier (recommended)** | OpenCV skeletonization, MobileNetV3-Small TensorRT | Fast (<200ms total). Path tracing from segmentation masks. Binary classifier: "concealed position yes/no" on ROI crop. Well-understood algorithms. | Requires custom annotated data for CNN classifier. Path tracing quality depends on segmentation quality. | OpenCV, scikit-image, PyTorch → TRT | Offline inference | Free + annotation labor | **Best fit. Modular, fast, interpretable.** | | Heuristic rules only | OpenCV, NumPy | No training data needed. Rule-based: "if footpath ends at dark mass → flag." | Brittle. Hard to tune. Cannot generalize across seasons/terrain. | None | Offline | Free | Baseline/fallback for initial version | | End-to-end custom model | PyTorch, TensorRT | Single model handles everything. | Requires massive training data. Black box. Hard to debug. | Large annotated dataset | Offline | Free + GPU time | Not recommended for initial release | **Path tracing algorithm**: 1. Take footpath segmentation mask from Tier 1 2. Skeletonize using Zhang-Suen algorithm (`skimage.morphology.skeletonize`) 3. Detect endpoints using hit-miss morphological operations (8 kernel patterns) 4. Detect junctions using branch-point kernels 5. Trace path segments between junctions/endpoints 6. For each endpoint: extract 128×128 ROI crop centered on endpoint from original image 7. Feed ROI crop to MobileNetV3-Small binary classifier: "concealed structure" vs "natural terminus" **Freshness assessment approach**: Visual features for fresh vs stale classification (binary classifier on path ROI): - Edge sharpness (Laplacian variance on path boundary) - Contrast ratio (path intensity vs surrounding terrain) - Fill ratio (percentage of path area with snow/vegetation coverage) - Path width consistency (fresh paths have more uniform width) Implementation: Extract these features per path segment, feed to lightweight classifier (Random Forest or small CNN). Initial version can use hand-tuned thresholds, then train with annotated data. ### Component 3: Tier 3 — VLM Deep Analysis (Optional) | Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit | |----------|-------|-----------|-------------|-------------|----------|------|-----| | **UAV-VL-R1 (recommended)** | Qwen2-VL-2B fine-tuned, vLLM, INT8 quantization | Purpose-built for aerial reasoning. 48% better than generic Qwen2-VL-2B on UAV tasks. 2.5GB INT8. Open source. | 3-5s per analysis. Competes for GPU memory with YOLO (sequential scheduling). | vLLM or TRT-LLM, GPTQ-INT8 weights | Local inference only | Free (open source) | **Best fit for aerial VLM tasks.** | | SmolVLM2-500M | HuggingFace Transformers, ONNX | Smallest memory (1.8GB). Fastest inference (~1-2s estimated). | Weakest reasoning. May lack nuance for concealment analysis. | ONNX Runtime or TRT | Local only | Free | Fallback if memory is tight | | Moondream 2B | moondream API, PyTorch | Built-in detect()/point() APIs. Strong grounded detection (refcoco 91.1). | Not aerial-specialized. Same size class as UAV-VL-R1 but less relevant. | PyTorch or ONNX | Local only | Free | Alternative if UAV-VL-R1 underperforms | | No VLM | N/A | Simpler system. Less memory. No latency for Tier 3. | No zero-shot capability for novel patterns. No operator explanations. | None | N/A | Free | Viable for production if Tier 1+2 accuracy is sufficient | **VLM prompting strategy for concealment analysis**: ``` Analyze this aerial UAV image crop. A footpath was detected leading to this area. Is there a concealed military position visible? Look for: - Dark entrances or openings - Piles of cut tree branches used as camouflage - Dugout structures - Signs of recent human activity Answer: YES or NO, then one sentence explaining why. ``` **Integration**: VLM runs as separate Python process. Communicates with main Cython pipeline via Unix domain socket or shared memory. Triggered only when Tier 2 CNN confidence is between 30-70% (ambiguous cases). ### Component 4: Camera Gimbal Control | Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit | |----------|-------|-----------|-------------|-------------|----------|------|-----| | **Custom ViewLink serial driver + PID controller (recommended)** | Python serial, ViewLink Protocol V3.3.3, PID library | Direct hardware control. Closed-loop tracking with detection feedback. Low latency (<10ms serial command). | Must implement ViewLink protocol from spec. PID tuning needed. | ViewPro A40 documentation, pyserial | Physical hardware access only | Free + hardware | **Best fit. Direct control, no middleware.** | | ArduPilot integration via MAVLink | ArduPilot, MAVLink, pymavlink | Battle-tested gimbal driver. Well-documented. | Requires ArduPilot flight controller. Additional latency through FC. May conflict with mission planner. | Pixhawk or similar FC running ArduPilot 4.5+ | MAVLink protocol | Pixhawk hardware ($50-200) | Alternative if ArduPilot is already used for flight control | **Scan controller state machine**: ``` ┌─────────────────┐ │ LEVEL 1 │ │ Wide Sweep │◀──────────────────────────┐ │ Medium Zoom │ │ └────────┬────────┘ │ │ POI detected │ Analysis complete ▼ │ or timeout (5s) ┌─────────────────┐ │ │ ZOOM │ │ │ TRANSITION │ │ │ 1-2 seconds │ │ └────────┬────────┘ │ │ Zoom complete │ ▼ │ ┌─────────────────┐ ┌──────────────┐ │ │ LEVEL 2 │────▶│ PATH │───────┤ │ Detailed Scan │ │ FOLLOWING │ │ │ High Zoom │ │ Pan along │ │ └────────┬────────┘ │ detected │ │ │ │ path │ │ │ └──────────────┘ │ │ Endpoint found │ ▼ │ ┌─────────────────┐ │ │ ENDPOINT │ │ │ ANALYSIS │────────────────────────────┘ │ Hold + VLM │ └─────────────────┘ ``` **Level 1 sweep pattern**: Sinusoidal yaw oscillation centered on flight heading. Amplitude: ±30° (configurable). Period: matched to ground speed so adjacent sweeps overlap by 20%. Pitch: slightly downward (configurable based on altitude and desired ground coverage). **Path-following control loop**: 1. Tier 1 outputs footpath segmentation mask 2. Extract path centerline direction (from skeleton) 3. Compute error: path center vs frame center 4. PID controller adjusts gimbal yaw/pitch to minimize error 5. Update rate: tied to detection frame rate (10-30 FPS) 6. When path endpoint reached or path leaves frame: stop following, analyze endpoint **POI queue management**: Priority queue sorted by: (1) detection confidence, (2) proximity to current camera position (minimize slew time), (3) recency. Max queue size: 20 POIs. Older/lower-confidence entries expire after 30 seconds. ### Component 5: Integration with Existing Detections Service | Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit | |----------|-------|-----------|-------------|-------------|----------|------|-----| | **Extend existing Cython codebase + separate VLM process (recommended)** | Cython, TensorRT, Unix socket IPC | Maintains existing architecture. YOLO26/YOLOE-26 fits naturally into TRT pipeline. VLM isolated in separate process. | VLM IPC adds small latency. Two processes to manage. | Cython extensions, process management | Process isolation | Free | **Best fit. Minimal disruption to existing system.** | | Microservice architecture | FastAPI, Docker, gRPC | Clean separation. Independent scaling. | Overhead for single Jetson. Over-engineered for edge. | Docker, networking | Service mesh | Free | Over-engineered for single device | **Integration architecture**: ``` ┌─────────────────────────────────────────────────────────┐ │ Main Process (Cython + TRT) │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Existing │ │ YOLOE-26 │ │ Scan │ │ │ │ YOLO Det │ │ Semantic │ │ Controller │ │ │ │ (TRT Engine) │ │ (TRT Engine) │ │ + Gimbal │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ └─────────┬────────┘ │ │ │ ▼ │ │ │ ┌──────────────┐ │ │ │ │ Spatial │◀──────────────────────┘ │ │ │ Reasoning │ │ │ │ + CNN Class. │ │ │ └──────┬───────┘ │ │ │ ambiguous cases │ │ ▼ │ │ ┌──────────────┐ │ │ │ IPC (Unix │ │ │ │ Socket) │ │ │ └──────┬───────┘ │ └────────────────┼─────────────────────────────────────────┘ │ ▼ ┌────────────────────────────┐ │ VLM Process (Python) │ │ UAV-VL-R1 (vLLM/TRT-LLM) │ │ INT8 quantized │ └────────────────────────────┘ ``` **Data flow**: 1. Camera frame → existing YOLO detection (scene context: vehicles, debris, structures) 2. Same frame → YOLOE-26 semantic detection (footpaths, branch piles, entrances) 3. YOLO context + YOLOE-26 detections → Spatial reasoning module 4. Spatial reasoning: path tracing, endpoint analysis, CNN classification 5. High-confidence detections → operator notification (bounding box + coordinates) 6. Ambiguous detections → VLM process via IPC → response → operator notification 7. All detections → scan controller → gimbal commands (Level 1/2 transitions, path following) **GPU scheduling**: YOLO and YOLOE-26 can share a single TRT engine (YOLOE-26 re-parameterized to standard YOLO26 weights for fixed classes). VLM inference is sequential: pause YOLO frames, run VLM, resume YOLO. VLM analysis typically lasts 3-5s during which the camera holds position (endpoint analysis phase). ## Training & Data Strategy ### Phase 1: Zero-shot (Week 1-2) - Deploy YOLOE-26 with text/visual prompts - Use semantic01-04.png as visual prompt references - Tune text prompt class names and confidence thresholds - Collect false positive/negative data for annotation ### Phase 2: Annotation & Fine-tuning (Week 3-8) - Annotate collected data using existing annotation tooling - Priority order: footpaths (segmentation masks) → branch piles (bboxes) → entrances (bboxes) → roads (segmentation) → trees (bboxes) - Use SAM (Segment Anything Model) for semi-automated segmentation mask generation - Target: 500+ images per class by week 6, 1500+ by week 8 - Fine-tune YOLO26-Seg on custom dataset using linear probing first, then full fine-tuning ### Phase 3: Custom CNN classifier (Week 6-10) - Train MobileNetV3-Small binary classifier on ROI crops from endpoint analysis - Positive: annotated concealed positions. Negative: natural path termini, random terrain - Target: 300+ positive, 1000+ negative samples - Export to TensorRT FP16 ### Phase 4: VLM integration (Week 8-12) - Deploy UAV-VL-R1 INT8 as separate process - Tune prompting strategy on collected ambiguous cases - Optional: fine-tune UAV-VL-R1 on domain-specific data if base accuracy insufficient ### Phase 5: Seasonal expansion (Month 3+) - Winter data → spring/summer annotation campaigns - Re-train all models with multi-season data - Expect accuracy degradation in summer (vegetation occlusion), mitigate with larger dataset ## Testing Strategy ### Integration / Functional Tests - YOLOE-26 text prompt detection on reference images (semantic01-04.png) — verify footpath and hideout regions are flagged - Path tracing on synthetic segmentation masks — verify skeleton, endpoint, junction detection - CNN classifier on known positive/negative ROI crops — verify binary output correctness - Gimbal control loop on simulated camera feed — verify PID convergence and path-following accuracy - VLM IPC round-trip — verify request/response latency and correctness - Full pipeline test: frame → YOLOE detection → path tracing → CNN → VLM → operator notification - Scan controller state machine: verify Level 1 → Level 2 transitions, timeout, return to Level 1 ### Non-Functional Tests - Tier 1 latency: measure end-to-end YOLOE-26 inference on Jetson Orin Nano Super ≤100ms - Tier 2 latency: path tracing + CNN classification ≤200ms - Tier 3 latency: VLM IPC + inference ≤5 seconds - Memory profiling: total RAM usage under YOLO + YOLOE + CNN + VLM concurrent loading - Thermal stress test: continuous inference for 30+ minutes without thermal throttling - False positive rate measurement on known clean terrain - False negative rate measurement on annotated concealed positions - Gimbal response test: measure physical camera movement latency vs command ## References - YOLO26 docs: https://docs.ultralytics.com/models/yolo26/ - YOLOE docs: https://docs.ultralytics.com/models/yoloe/ - YOLOE-26 paper: https://arxiv.org/abs/2602.00168 - YOLO26 Jetson benchmarks: https://docs.ultralytics.com/guides/nvidia-jetson - UAV-VL-R1: https://arxiv.org/pdf/2508.11196, https://github.com/Leke-G/UAV-VL-R1 - SmolVLM2: https://huggingface.co/blog/smolervlm - Moondream: https://moondream.ai/blog/introducing-moondream-0-5b - ViewPro Protocol: https://www.viewprotech.com/index.php?ac=article&at=read&did=510 - ArduPilot ViewPro: https://ardupilot.org/copter/docs/common-viewpro-gimbal.html - FootpathSeg: https://github.com/WennyXY/FootpathSeg - UAV-YOLO12: https://www.mdpi.com/2072-4292/17/9/1539 - Trail segmentation (UNet+MambaOut): https://arxiv.org/pdf/2504.12121 - servopilot PID: https://pypi.org/project/servopilot/ - Camouflage detection (OVCOS): https://arxiv.org/html/2506.19300v1 - Jetson AI Lab benchmarks: https://www.jetson-ai-lab.com/tutorials/genai-benchmarking/ - Cosmos-Reason2-2B benchmarks: Embedl blog (Feb 2026) ## Related Artifacts - AC assessment: `_docs/00_research/00_ac_assessment.md` - Tech stack evaluation: `_docs/01_solution/tech_stack.md` (if Phase 3 was executed) - Security analysis: `_docs/01_solution/security_analysis.md` (if Phase 4 was executed)