detections-semantic/_docs/00_research/00_ac_assessment.md

# Acceptance Criteria Assessment (Revised)

## Acceptance Criteria

| Criterion | Our Values | Researched Values | Cost/Timeline Impact | Status |
|-----------|-----------|-------------------|---------------------|--------|
| Tier 1 latency | ≤100ms per frame | YOLO26n TensorRT FP16 on Jetson Orin Nano Super: ~7ms at 640px. YOLOE-26 adds zero inference overhead when re-parameterized. Combined detection + segmentation well under 100ms. | Low risk | **Confirmed** |
| Tier 2 detailed analysis | ≤2 seconds (originally VLM) | See VLM assessment below. Recommend tiered approach: Tier 2 = custom CNN classifier (≤100ms), Tier 3 = optional VLM (3-5s). | Architecture change reduces risk | **Modified: split into Tier 2 (CNN ≤200ms) + Tier 3 (VLM ≤5s optional)** |
| YOLO new classes P≥80%, R≥80% | P≥80%, R≥80% | Use YOLO26-Seg for footpaths/roads (instance segmentation), YOLO26 detection for compact objects. YOLOE-26 open-vocabulary as bootstrap. | Reduced training data dependency initially via YOLOE-26 | **Modified: YOLO26-Seg for linear features, YOLOE-26 text prompts for bootstrapping** |
| Semantic recall ≥60% concealed positions | ≥60% recall | Reasonable for initial release. YOLOE-26 zero-shot + custom-trained YOLO26 models should reach this. | Medium risk | **Confirmed** |
| Semantic precision ≥20% | ≥20% initial | Reasonable. YOLOE-26 text prompts will start noisy, custom training improves over time. | Low risk | **Confirmed** |
| Footpath recall ≥70% | ≥70% | Confirmed achievable. UAV-YOLO12: F1=0.825. YOLO26-Seg NMS-free: likely competitive. | Low risk, seasonal caveat | **Confirmed** |
| Level 1→2 transition | ≤1 second | Physical zoom takes 1-3s on ViewPro A40. | Physical constraint | **Modified: ≤2 seconds** |
| Gimbal command latency ≤500ms | ≤500ms | ViewPro A40 UART 115200 + physical response: well under 500ms. | Low risk | **Confirmed** |
| RAM ≤6GB for semantic + VLM | ≤6GB | YOLO26 models: ~0.1-0.5GB. YOLOE-26: same as YOLO26 (zero overhead). Custom CNN: ~0.1GB. VLM (optional): SmolVLM2-500M ~1.8GB or UAV-VL-R1 INT8 ~2.5GB. Total worst case: ~4GB. | Low risk — well within budget | **Confirmed** |
| Dataset 1.5 months × 5h/day | ~225 hours | YOLOE-26 text prompts reduce initial data dependency. Recommend SAM-assisted annotation. | Reduced risk via YOLOE-26 bootstrapping | **Modified: phased data collection strategy** |

## VLM Options Assessment

| Model | Params | Memory (quantized) | Estimated Speed (Jetson Orin Nano Super) | Strengths | Weaknesses | Fit for This Task |
|-------|--------|-------------------|----------------------------------------|-----------|------------|-------------------|
| **UAV-VL-R1** | 2B | 2.5 GB (INT8) | ~10-15 tok/s (estimated from Qwen2-VL-2B base) | **Specifically trained for aerial reasoning.** 48% better than Qwen2-VL-2B on UAV tasks. Supports object counting, spatial reasoning. Open source. | Largest memory footprint of candidates. ~3-5s per analysis with image. | **Best fit for aerial scene understanding.** Use as Tier 3 for ambiguous cases. |
| **SmolVLM2-500M** | 500M | 1.8 GB (FP16) | ~30-50 tok/s (estimated, 4x smaller than 2B) | Tiny memory footprint. Fast inference. ONNX export available. | Weakest reasoning capability. Video-MME: 42.2 (mediocre). May lack nuance for concealment analysis. | **Marginal.** Fast but may be too weak for meaningful contextual reasoning on concealed positions. |
| **SmolVLM2-256M** | 256M | <1 GB | ~50-80 tok/s (estimated) | Smallest available VLM. Near-instant inference. | Very limited reasoning. Outperforms Idefics-80B on simple benchmarks but not on complex spatial reasoning. | **Not recommended.** Likely too weak for this task. |
| **Moondream 0.5B** | 500M | 816 MiB (INT4) | ~30-50 tok/s (estimated) | Built-in detect() and point() APIs. Tiny. | No specific aerial training. Detection benchmarks unverified for small model. | **Interesting for pointing/localization.** Could complement YOLO by pointing at "suspicious area at end of path." |
| **Moondream 2B** | 2B | ~2 GB (INT4) | ~10-15 tok/s (estimated) | Strong grounded detection (refcoco: 91.1). detect() and point() APIs. | No aerial-specific training. Similar size to UAV-VL-R1 but less specialized. | **Good general VLM, but UAV-VL-R1 is better for aerial tasks.** |
| **Cosmos-Reason2-2B** | 2B | ~2 GB (W4A16) | 4.7 tok/s (benchmarked) | NVIDIA-optimized for Jetson. Reasoning focused. | Slowest of candidates. Not aerial-specific. | **Slow. Not recommended over UAV-VL-R1.** |

### VLM Verdict

**Is a VLM helpful at all for this task?**

**Yes, but as a supplementary layer, not the primary mechanism.** The core detection pipeline should be:
- **Tier 1 (real-time)**: YOLO26-Seg + YOLOE-26 text prompts — detect footpaths, roads, branch piles, entrances, trees
- **Tier 2 (fast confirmation)**: Custom lightweight CNN classifier — binary "concealed position: yes/no" on cropped ROI (≤200ms)
- **Tier 3 (optional deep analysis)**: Small VLM — contextual reasoning on ambiguous cases, operator-facing descriptions

VLM adds value for:
1. **Zero-shot bootstrapping** — before enough custom training data exists, VLM can reason about "is this a hidden position?"
2. **Ambiguous cases** — when Tier 2 CNN confidence is between 30-70%, VLM provides second opinion
3. **Operator trust** — VLM can generate natural-language explanations ("footpath terminates at dark mass consistent with concealed entrance")
4. **Novel patterns** — VLM generalizes to new types of concealment without retraining

**Recommended VLM: UAV-VL-R1** (2.5GB INT8, aerial-specialized, open source). Fallback: SmolVLM2-500M if memory is too tight.

## YOLO26 Key Capabilities for This Project

| Feature | Relevance |
|---------|-----------|
| **NMS-free end-to-end inference** | Simpler deployment on Jetson, more predictable latency, no post-processing tuning |
| **Instance segmentation (YOLO26-Seg)** | Precise pixel-level masks for footpaths and roads — far better than bounding boxes for linear features |
| **YOLOE-26 open-vocabulary** | Text-prompted detection: "footpath", "pile of branches", "dark entrance" — zero-shot, no training needed initially |
| **YOLOE-26 visual prompts** | Use reference images of hideouts to find visually similar structures — directly applicable |
| **YOLOE-26 prompt-free mode** | 1200+ built-in categories for autonomous discovery — useful for Level 1 wide scan |
| **43% faster CPU inference** | Better edge performance than previous YOLO versions |
| **MuSGD optimizer** | Better convergence for custom training with small datasets |
| **Improved small object detection** | ProgLoss + STAL loss — directly relevant for detecting small entrances and path features |

**YOLOE-26 is a potential game-changer for this project.** It enables:
- Immediate zero-shot detection with text prompts before any custom training
- Visual prompt mode: provide example images of hideouts, system finds similar patterns
- Gradual transition to custom-trained YOLO26 as annotated data accumulates
- No inference overhead vs standard YOLO26 when re-parameterized for fixed classes

## Restrictions Assessment

| Restriction | Our Values | Researched Values | Cost/Timeline Impact | Status |
|-------------|-----------|-------------------|---------------------|--------|
| Jetson Orin Nano Super 67 TOPS, 8GB | 67 TOPS INT8, 8GB LPDDR5, 102 GB/s | Memory budget: YOLO26 (~0.3GB) + YOLOE-26 (zero overhead) + CNN (~0.1GB) + VLM (~2.5GB) = ~3GB. Leaves ~3GB for OS, buffers. VLM and YOLO should run sequentially to avoid contention. | Manageable with scheduling | **Confirmed, add sequential inference scheduling** |
| ViewPro A40 zoom transition | Not specified | 40x optical zoom full traversal: 1-3 seconds. Partial zoom (medium→high): ~1-2s. | Physical constraint, must account for in scan timing | **Add: zoom transition time 1-2s as physical constraint** |
| All seasons | No seasonal restriction | Phased rollout: winter first (highest contrast), then spring/summer/autumn. | Multi-season dataset collection is long-term effort | **Add: phased seasonal rollout** |
| Cython + TensorRT | Must extend existing | YOLO26 deploys natively to TensorRT. YOLOE-26 also TensorRT compatible. VLM may need separate process (vLLM or TensorRT-LLM). | Low complexity for YOLO26. Medium for VLM. | **Modified: VLM as separate process with IPC** |
| VLM local only | No cloud | UAV-VL-R1 INT8: 2.5GB, open source, fits on Jetson. SmolVLM2-500M: 1.8GB, ONNX available. | Feasible | **Confirmed** |

## Key Findings (Revised)

1. **YOLOE-26 open-vocabulary is the biggest opportunity.** Text-prompted and visual-prompted detection enables immediate zero-shot capability without custom training. Transition to custom-trained YOLO26 as data accumulates.

2. **Three-tier architecture is more realistic than two-tier.** Tier 1: YOLO26/YOLOE-26 real-time (≤100ms). Tier 2: Custom CNN confirmation (≤200ms). Tier 3: VLM deep analysis (≤5s, optional).

3. **UAV-VL-R1 is the best VLM candidate.** Purpose-built for aerial reasoning, 48% better than generic Qwen2-VL-2B, fits on Jetson at 2.5GB INT8. SmolVLM2-500M is a lighter fallback.

4. **VLM is valuable but supplementary.** Zero-shot bootstrapping, ambiguous case analysis, and operator-facing explanations. Not the primary detection mechanism.

5. **YOLO26 NMS-free design simplifies edge deployment.** No NMS tuning, more predictable latency, native TensorRT support. Instance segmentation mode ideal for footpaths.

6. **Phased approach reduces risk.** Start with YOLOE-26 text prompts (no training needed), then train custom YOLO26 models, then add VLM for edge cases.

## Sources

- Ultralytics YOLO26 docs: https://docs.ultralytics.com/models/yolo26/ (Jan 2026)
- YOLOE-26 paper: arXiv:2602.00168 (Feb 2026)
- YOLOE docs: https://v8docs.ultralytics.com/models/yoloe/ (2025)
- Ultralytics YOLO26 Jetson benchmarks: https://docs.ultralytics.com/guides/nvidia-jetson (2026)
- Cosmos-Reason2-2B on Jetson Orin Nano Super: 4.7 tok/s W4A16 (Embedl, Feb 2026)
- UAV-VL-R1: arXiv:2508.11196 (2025), 2.5GB INT8, open source
- SmolVLM2-500M: HuggingFace blog (Feb 2025), 1.8GB GPU RAM
- SmolVLM-256M: HuggingFace blog (Jan 2025), <1GB
- Moondream 0.5B: moondream.ai (Dec 2024), 816 MiB INT4
- Moondream 2B: moondream.ai (2025), refcoco 91.1
- UAV-YOLO12 road segmentation: F1=0.825, 11.1ms (MDPI Drones, 2025)
- ViewPro ViewLink Serial Protocol V3.3.3 (viewprotech.com)
- YOLO training best practices: ≥1500 images/class (Ultralytics docs)
- Open-Vocabulary Camouflaged Object Segmentation: arXiv:2506.19300 (2025)