detections-semantic/_docs/01_solution/solution_draft01.md

# Solution Draft

## Product Solution Description

A three-tier semantic detection system for identifying concealed/camouflaged positions from reconnaissance UAV aerial imagery, running on Jetson Orin Nano Super alongside the existing YOLO detection pipeline.

```
┌─────────────────────────────────────────────────────────────────────────┐
│                        JETSON ORIN NANO SUPER                          │
│                                                                        │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────┐  │
│  │ ViewPro  │───▶│  Tier 1      │───▶│  Tier 2      │───▶│ Tier 3   │  │
│  │ A40      │    │  YOLO26-Seg  │    │  Spatial     │    │ VLM      │  │
│  │ Camera   │    │  + YOLOE-26  │    │  Reasoning   │    │ UAV-VL   │  │
│  │          │    │  ≤100ms      │    │  + CNN       │    │ -R1      │  │
│  │          │    │              │    │  ≤200ms      │    │ ≤5s      │  │
│  └────▲─────┘    └──────────────┘    └──────────────┘    └──────────┘  │
│       │                                                                │
│  ┌────┴─────┐    ┌──────────────┐                                      │
│  │ Gimbal   │◀───│  Scan        │                                      │
│  │ Control  │    │  Controller  │                                      │
│  │ Module   │    │  (L1/L2)     │                                      │
│  └──────────┘    └──────────────┘                                      │
│                                                                        │
│  ┌──────────────────────────────┐                                      │
│  │  Existing YOLO Detection    │ (separate service, provides context)  │
│  │  Cython + TRT               │                                       │
│  └──────────────────────────────┘                                      │
└─────────────────────────────────────────────────────────────────────────┘
```

The system operates in two scan levels:
- **Level 1 (Wide Sweep)**: Camera at medium zoom, left-right swing. YOLOE-26 text/visual prompts detect POIs in real-time. Existing YOLO provides scene context.
- **Level 2 (Detailed Scan)**: Camera zooms into POI. Spatial reasoning traces footpaths, finds endpoints. CNN classifies potential hideouts. Optional VLM provides deep analysis for ambiguous cases.

Three submodules: (1) Semantic Detection AI, (2) Camera Gimbal Control, (3) Integration with existing detections service.

## Existing/Competitor Solutions Analysis

No direct commercial or open-source competitor exists for this specific combination of requirements (concealed position detection from UAV with edge inference). Related work:

| Solution | Approach | Limitations for This Use Case |
|----------|----------|-------------------------------|
| Standard YOLO object detection | Bounding box classification of known object types | Cannot detect camouflaged/concealed targets without explicit visual features |
| CAMOUFLAGE-Net (YOLOv7-based) | Attention mechanisms + ELAN for camouflage detection | Designed for ground-level imagery, not aerial; academic datasets only |
| Open-Vocabulary Camouflaged Object Segmentation | VLM + SAM cascaded segmentation | Too slow for real-time edge inference; requires cloud GPU |
| UAV-YOLO12 | Multi-scale road segmentation from UAV imagery | Roads only, no concealment reasoning |
| FootpathSeg (DINO-MC + UNet) | Footpath segmentation with self-supervised learning | Pedestrian context, not military aerial; no path-following logic |
| YOLO-World / YOLOE | Open-vocabulary detection | Closest fit — YOLOE-26 is our primary Tier 1 mechanism |

**Key insight**: No existing solution combines footpath detection + path tracing + endpoint analysis + concealment classification in a single pipeline. This requires a custom multi-stage system.

## Architecture

### Component 1: Tier 1 — Real-Time Detection (YOLOE-26 + YOLO26-Seg)

| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| **YOLOE-26 text/visual prompts (recommended)** | yoloe-26s-seg.pt, Ultralytics, TensorRT | Zero-shot detection from day 1. Text prompts: "footpath", "branch pile", "dark entrance". Visual prompts: reference images of hideouts. Zero overhead when re-parameterized. NMS-free. | Open-vocabulary accuracy lower than custom-trained model. Text prompts may not capture all concealment patterns. | Ultralytics ≥8.4, TensorRT, JetPack 6.2 | Model weights stored locally, no cloud | Free (open source) | **Best fit for bootstrapping. Immediate capability.** |
| YOLO26-Seg custom-trained | yolo26s-seg.pt fine-tuned on custom dataset | Higher accuracy for known classes after training. Instance segmentation masks for footpaths. | Requires annotated training data (1500+ images/class). No zero-shot capability. | Custom dataset, GPU for training | Same | Free (open source) + annotation labor | **Best fit for production after data collection.** |
| UNet + MambaOut | PyTorch, TensorRT | Best published accuracy for trail segmentation from aerial photos. | Separate model, additional memory. No built-in detection head. | Custom integration | Same | Free (open source) | Backup option if YOLO26-Seg underperforms on trails |

**Recommended approach**: Start with YOLOE-26 text/visual prompts. Parallel annotation effort builds custom dataset. Transition to fine-tuned YOLO26-Seg once data is sufficient. YOLOE-26 zero-shot capability provides immediate usability.

**YOLOE-26 configuration for this project**:

Text prompts for Level 1 detection:
- `"footpath"`, `"trail"`, `"path in snow"`, `"road"`, `"track"`
- `"pile of branches"`, `"tree branches"`, `"camouflage netting"`
- `"dark entrance"`, `"hole"`, `"dugout"`, `"dark opening"`
- `"tree row"`, `"tree line"`, `"group of trees"`
- `"clearing"`, `"open area near forest"`

Visual prompts: Annotated reference images (semantic01-04.png) as visual prompt sources. Crop ROIs around known hideouts, use as reference for SAVPE visual-prompted detection.

API usage:
```python
from ultralytics import YOLOE

model = YOLOE("yoloe-26s-seg.pt")
model.set_classes(["footpath", "branch pile", "dark entrance", "tree row", "road"])
results = model.predict(frame, conf=0.15)  # low threshold, high recall
```

Visual prompt usage:
```python
results = model.predict(
    frame,
    refer_image="reference_hideout.jpg",
    visual_prompts={"bboxes": np.array([[x1, y1, x2, y2]]), "cls": np.array([0])}
)
```

### Component 2: Tier 2 — Spatial Reasoning & CNN Confirmation

| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| **Path tracing + CNN classifier (recommended)** | OpenCV skeletonization, MobileNetV3-Small TensorRT | Fast (<200ms total). Path tracing from segmentation masks. Binary classifier: "concealed position yes/no" on ROI crop. Well-understood algorithms. | Requires custom annotated data for CNN classifier. Path tracing quality depends on segmentation quality. | OpenCV, scikit-image, PyTorch → TRT | Offline inference | Free + annotation labor | **Best fit. Modular, fast, interpretable.** |
| Heuristic rules only | OpenCV, NumPy | No training data needed. Rule-based: "if footpath ends at dark mass → flag." | Brittle. Hard to tune. Cannot generalize across seasons/terrain. | None | Offline | Free | Baseline/fallback for initial version |
| End-to-end custom model | PyTorch, TensorRT | Single model handles everything. | Requires massive training data. Black box. Hard to debug. | Large annotated dataset | Offline | Free + GPU time | Not recommended for initial release |

**Path tracing algorithm**:

1. Take footpath segmentation mask from Tier 1
2. Skeletonize using Zhang-Suen algorithm (`skimage.morphology.skeletonize`)
3. Detect endpoints using hit-miss morphological operations (8 kernel patterns)
4. Detect junctions using branch-point kernels
5. Trace path segments between junctions/endpoints
6. For each endpoint: extract 128×128 ROI crop centered on endpoint from original image
7. Feed ROI crop to MobileNetV3-Small binary classifier: "concealed structure" vs "natural terminus"

**Freshness assessment approach**:

Visual features for fresh vs stale classification (binary classifier on path ROI):
- Edge sharpness (Laplacian variance on path boundary)
- Contrast ratio (path intensity vs surrounding terrain)
- Fill ratio (percentage of path area with snow/vegetation coverage)
- Path width consistency (fresh paths have more uniform width)

Implementation: Extract these features per path segment, feed to lightweight classifier (Random Forest or small CNN). Initial version can use hand-tuned thresholds, then train with annotated data.

### Component 3: Tier 3 — VLM Deep Analysis (Optional)

| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| **UAV-VL-R1 (recommended)** | Qwen2-VL-2B fine-tuned, vLLM, INT8 quantization | Purpose-built for aerial reasoning. 48% better than generic Qwen2-VL-2B on UAV tasks. 2.5GB INT8. Open source. | 3-5s per analysis. Competes for GPU memory with YOLO (sequential scheduling). | vLLM or TRT-LLM, GPTQ-INT8 weights | Local inference only | Free (open source) | **Best fit for aerial VLM tasks.** |
| SmolVLM2-500M | HuggingFace Transformers, ONNX | Smallest memory (1.8GB). Fastest inference (~1-2s estimated). | Weakest reasoning. May lack nuance for concealment analysis. | ONNX Runtime or TRT | Local only | Free | Fallback if memory is tight |
| Moondream 2B | moondream API, PyTorch | Built-in detect()/point() APIs. Strong grounded detection (refcoco 91.1). | Not aerial-specialized. Same size class as UAV-VL-R1 but less relevant. | PyTorch or ONNX | Local only | Free | Alternative if UAV-VL-R1 underperforms |
| No VLM | N/A | Simpler system. Less memory. No latency for Tier 3. | No zero-shot capability for novel patterns. No operator explanations. | None | N/A | Free | Viable for production if Tier 1+2 accuracy is sufficient |

**VLM prompting strategy for concealment analysis**:

```
Analyze this aerial UAV image crop. A footpath was detected leading to this area.
Is there a concealed military position visible? Look for:
- Dark entrances or openings
- Piles of cut tree branches used as camouflage
- Dugout structures
- Signs of recent human activity

Answer: YES or NO, then one sentence explaining why.
```

**Integration**: VLM runs as separate Python process. Communicates with main Cython pipeline via Unix domain socket or shared memory. Triggered only when Tier 2 CNN confidence is between 30-70% (ambiguous cases).

### Component 4: Camera Gimbal Control

| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| **Custom ViewLink serial driver + PID controller (recommended)** | Python serial, ViewLink Protocol V3.3.3, PID library | Direct hardware control. Closed-loop tracking with detection feedback. Low latency (<10ms serial command). | Must implement ViewLink protocol from spec. PID tuning needed. | ViewPro A40 documentation, pyserial | Physical hardware access only | Free + hardware | **Best fit. Direct control, no middleware.** |
| ArduPilot integration via MAVLink | ArduPilot, MAVLink, pymavlink | Battle-tested gimbal driver. Well-documented. | Requires ArduPilot flight controller. Additional latency through FC. May conflict with mission planner. | Pixhawk or similar FC running ArduPilot 4.5+ | MAVLink protocol | Pixhawk hardware ($50-200) | Alternative if ArduPilot is already used for flight control |

**Scan controller state machine**:

```
┌─────────────────┐
│   LEVEL 1       │
│   Wide Sweep    │◀──────────────────────────┐
│   Medium Zoom   │                            │
└────────┬────────┘                            │
         │ POI detected                        │ Analysis complete
         ▼                                     │ or timeout (5s)
┌─────────────────┐                            │
│   ZOOM          │                            │
│   TRANSITION    │                            │
│   1-2 seconds   │                            │
└────────┬────────┘                            │
         │ Zoom complete                       │
         ▼                                     │
┌─────────────────┐     ┌──────────────┐       │
│   LEVEL 2       │────▶│  PATH        │───────┤
│   Detailed Scan │     │  FOLLOWING   │       │
│   High Zoom     │     │  Pan along   │       │
└────────┬────────┘     │  detected    │       │
         │              │  path        │       │
         │              └──────────────┘       │
         │ Endpoint found                      │
         ▼                                     │
┌─────────────────┐                            │
│   ENDPOINT      │                            │
│   ANALYSIS      │────────────────────────────┘
│   Hold + VLM    │
└─────────────────┘
```

**Level 1 sweep pattern**: Sinusoidal yaw oscillation centered on flight heading. Amplitude: ±30° (configurable). Period: matched to ground speed so adjacent sweeps overlap by 20%. Pitch: slightly downward (configurable based on altitude and desired ground coverage).

**Path-following control loop**:
1. Tier 1 outputs footpath segmentation mask
2. Extract path centerline direction (from skeleton)
3. Compute error: path center vs frame center
4. PID controller adjusts gimbal yaw/pitch to minimize error
5. Update rate: tied to detection frame rate (10-30 FPS)
6. When path endpoint reached or path leaves frame: stop following, analyze endpoint

**POI queue management**: Priority queue sorted by: (1) detection confidence, (2) proximity to current camera position (minimize slew time), (3) recency. Max queue size: 20 POIs. Older/lower-confidence entries expire after 30 seconds.

### Component 5: Integration with Existing Detections Service

| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| **Extend existing Cython codebase + separate VLM process (recommended)** | Cython, TensorRT, Unix socket IPC | Maintains existing architecture. YOLO26/YOLOE-26 fits naturally into TRT pipeline. VLM isolated in separate process. | VLM IPC adds small latency. Two processes to manage. | Cython extensions, process management | Process isolation | Free | **Best fit. Minimal disruption to existing system.** |
| Microservice architecture | FastAPI, Docker, gRPC | Clean separation. Independent scaling. | Overhead for single Jetson. Over-engineered for edge. | Docker, networking | Service mesh | Free | Over-engineered for single device |

**Integration architecture**:

```
┌─────────────────────────────────────────────────────────┐
│                    Main Process (Cython + TRT)           │
│                                                          │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐ │
│  │ Existing     │   │ YOLOE-26     │   │ Scan         │ │
│  │ YOLO Det     │   │ Semantic     │   │ Controller   │ │
│  │ (TRT Engine) │   │ (TRT Engine) │   │ + Gimbal     │ │
│  └──────┬───────┘   └──────┬───────┘   └──────┬───────┘ │
│         │                  │                   │         │
│         └─────────┬────────┘                   │         │
│                   ▼                            │         │
│         ┌──────────────┐                       │         │
│         │ Spatial      │◀──────────────────────┘         │
│         │ Reasoning    │                                 │
│         │ + CNN Class. │                                 │
│         └──────┬───────┘                                 │
│                │ ambiguous cases                          │
│                ▼                                         │
│         ┌──────────────┐                                 │
│         │ IPC (Unix    │                                 │
│         │ Socket)      │                                 │
│         └──────┬───────┘                                 │
└────────────────┼─────────────────────────────────────────┘
                 │
                 ▼
┌────────────────────────────┐
│  VLM Process (Python)      │
│  UAV-VL-R1 (vLLM/TRT-LLM) │
│  INT8 quantized            │
└────────────────────────────┘
```

**Data flow**:
1. Camera frame → existing YOLO detection (scene context: vehicles, debris, structures)
2. Same frame → YOLOE-26 semantic detection (footpaths, branch piles, entrances)
3. YOLO context + YOLOE-26 detections → Spatial reasoning module
4. Spatial reasoning: path tracing, endpoint analysis, CNN classification
5. High-confidence detections → operator notification (bounding box + coordinates)
6. Ambiguous detections → VLM process via IPC → response → operator notification
7. All detections → scan controller → gimbal commands (Level 1/2 transitions, path following)

**GPU scheduling**: YOLO and YOLOE-26 can share a single TRT engine (YOLOE-26 re-parameterized to standard YOLO26 weights for fixed classes). VLM inference is sequential: pause YOLO frames, run VLM, resume YOLO. VLM analysis typically lasts 3-5s during which the camera holds position (endpoint analysis phase).

## Training & Data Strategy

### Phase 1: Zero-shot (Week 1-2)
- Deploy YOLOE-26 with text/visual prompts
- Use semantic01-04.png as visual prompt references
- Tune text prompt class names and confidence thresholds
- Collect false positive/negative data for annotation

### Phase 2: Annotation & Fine-tuning (Week 3-8)
- Annotate collected data using existing annotation tooling
- Priority order: footpaths (segmentation masks) → branch piles (bboxes) → entrances (bboxes) → roads (segmentation) → trees (bboxes)
- Use SAM (Segment Anything Model) for semi-automated segmentation mask generation
- Target: 500+ images per class by week 6, 1500+ by week 8
- Fine-tune YOLO26-Seg on custom dataset using linear probing first, then full fine-tuning

### Phase 3: Custom CNN classifier (Week 6-10)
- Train MobileNetV3-Small binary classifier on ROI crops from endpoint analysis
- Positive: annotated concealed positions. Negative: natural path termini, random terrain
- Target: 300+ positive, 1000+ negative samples
- Export to TensorRT FP16

### Phase 4: VLM integration (Week 8-12)
- Deploy UAV-VL-R1 INT8 as separate process
- Tune prompting strategy on collected ambiguous cases
- Optional: fine-tune UAV-VL-R1 on domain-specific data if base accuracy insufficient

### Phase 5: Seasonal expansion (Month 3+)
- Winter data → spring/summer annotation campaigns
- Re-train all models with multi-season data
- Expect accuracy degradation in summer (vegetation occlusion), mitigate with larger dataset

## Testing Strategy

### Integration / Functional Tests
- YOLOE-26 text prompt detection on reference images (semantic01-04.png) — verify footpath and hideout regions are flagged
- Path tracing on synthetic segmentation masks — verify skeleton, endpoint, junction detection
- CNN classifier on known positive/negative ROI crops — verify binary output correctness
- Gimbal control loop on simulated camera feed — verify PID convergence and path-following accuracy
- VLM IPC round-trip — verify request/response latency and correctness
- Full pipeline test: frame → YOLOE detection → path tracing → CNN → VLM → operator notification
- Scan controller state machine: verify Level 1 → Level 2 transitions, timeout, return to Level 1

### Non-Functional Tests
- Tier 1 latency: measure end-to-end YOLOE-26 inference on Jetson Orin Nano Super ≤100ms
- Tier 2 latency: path tracing + CNN classification ≤200ms
- Tier 3 latency: VLM IPC + inference ≤5 seconds
- Memory profiling: total RAM usage under YOLO + YOLOE + CNN + VLM concurrent loading
- Thermal stress test: continuous inference for 30+ minutes without thermal throttling
- False positive rate measurement on known clean terrain
- False negative rate measurement on annotated concealed positions
- Gimbal response test: measure physical camera movement latency vs command

## References

- YOLO26 docs: https://docs.ultralytics.com/models/yolo26/
- YOLOE docs: https://docs.ultralytics.com/models/yoloe/
- YOLOE-26 paper: https://arxiv.org/abs/2602.00168
- YOLO26 Jetson benchmarks: https://docs.ultralytics.com/guides/nvidia-jetson
- UAV-VL-R1: https://arxiv.org/pdf/2508.11196, https://github.com/Leke-G/UAV-VL-R1
- SmolVLM2: https://huggingface.co/blog/smolervlm
- Moondream: https://moondream.ai/blog/introducing-moondream-0-5b
- ViewPro Protocol: https://www.viewprotech.com/index.php?ac=article&at=read&did=510
- ArduPilot ViewPro: https://ardupilot.org/copter/docs/common-viewpro-gimbal.html
- FootpathSeg: https://github.com/WennyXY/FootpathSeg
- UAV-YOLO12: https://www.mdpi.com/2072-4292/17/9/1539
- Trail segmentation (UNet+MambaOut): https://arxiv.org/pdf/2504.12121
- servopilot PID: https://pypi.org/project/servopilot/
- Camouflage detection (OVCOS): https://arxiv.org/html/2506.19300v1
- Jetson AI Lab benchmarks: https://www.jetson-ai-lab.com/tutorials/genai-benchmarking/
- Cosmos-Reason2-2B benchmarks: Embedl blog (Feb 2026)

## Related Artifacts
- AC assessment: `_docs/00_research/00_ac_assessment.md`
- Tech stack evaluation: `_docs/01_solution/tech_stack.md` (if Phase 3 was executed)
- Security analysis: `_docs/01_solution/security_analysis.md` (if Phase 4 was executed)