Files
detections-semantic/_docs/00_research/00_ac_assessment.md
T
Oleksandr Bezdieniezhnykh 8e2ecf50fd Initial commit
Made-with: Cursor
2026-03-26 00:20:30 +02:00

105 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Acceptance Criteria Assessment (Revised)
## Acceptance Criteria
| Criterion | Our Values | Researched Values | Cost/Timeline Impact | Status |
|-----------|-----------|-------------------|---------------------|--------|
| Tier 1 latency | ≤100ms per frame | YOLO26n TensorRT FP16 on Jetson Orin Nano Super: ~7ms at 640px. YOLOE-26 adds zero inference overhead when re-parameterized. Combined detection + segmentation well under 100ms. | Low risk | **Confirmed** |
| Tier 2 detailed analysis | ≤2 seconds (originally VLM) | See VLM assessment below. Recommend tiered approach: Tier 2 = custom CNN classifier (≤100ms), Tier 3 = optional VLM (3-5s). | Architecture change reduces risk | **Modified: split into Tier 2 (CNN ≤200ms) + Tier 3 (VLM ≤5s optional)** |
| YOLO new classes P≥80%, R≥80% | P≥80%, R≥80% | Use YOLO26-Seg for footpaths/roads (instance segmentation), YOLO26 detection for compact objects. YOLOE-26 open-vocabulary as bootstrap. | Reduced training data dependency initially via YOLOE-26 | **Modified: YOLO26-Seg for linear features, YOLOE-26 text prompts for bootstrapping** |
| Semantic recall ≥60% concealed positions | ≥60% recall | Reasonable for initial release. YOLOE-26 zero-shot + custom-trained YOLO26 models should reach this. | Medium risk | **Confirmed** |
| Semantic precision ≥20% | ≥20% initial | Reasonable. YOLOE-26 text prompts will start noisy, custom training improves over time. | Low risk | **Confirmed** |
| Footpath recall ≥70% | ≥70% | Confirmed achievable. UAV-YOLO12: F1=0.825. YOLO26-Seg NMS-free: likely competitive. | Low risk, seasonal caveat | **Confirmed** |
| Level 1→2 transition | ≤1 second | Physical zoom takes 1-3s on ViewPro A40. | Physical constraint | **Modified: ≤2 seconds** |
| Gimbal command latency ≤500ms | ≤500ms | ViewPro A40 UART 115200 + physical response: well under 500ms. | Low risk | **Confirmed** |
| RAM ≤6GB for semantic + VLM | ≤6GB | YOLO26 models: ~0.1-0.5GB. YOLOE-26: same as YOLO26 (zero overhead). Custom CNN: ~0.1GB. VLM (optional): SmolVLM2-500M ~1.8GB or UAV-VL-R1 INT8 ~2.5GB. Total worst case: ~4GB. | Low risk — well within budget | **Confirmed** |
| Dataset 1.5 months × 5h/day | ~225 hours | YOLOE-26 text prompts reduce initial data dependency. Recommend SAM-assisted annotation. | Reduced risk via YOLOE-26 bootstrapping | **Modified: phased data collection strategy** |
## VLM Options Assessment
| Model | Params | Memory (quantized) | Estimated Speed (Jetson Orin Nano Super) | Strengths | Weaknesses | Fit for This Task |
|-------|--------|-------------------|----------------------------------------|-----------|------------|-------------------|
| **UAV-VL-R1** | 2B | 2.5 GB (INT8) | ~10-15 tok/s (estimated from Qwen2-VL-2B base) | **Specifically trained for aerial reasoning.** 48% better than Qwen2-VL-2B on UAV tasks. Supports object counting, spatial reasoning. Open source. | Largest memory footprint of candidates. ~3-5s per analysis with image. | **Best fit for aerial scene understanding.** Use as Tier 3 for ambiguous cases. |
| **SmolVLM2-500M** | 500M | 1.8 GB (FP16) | ~30-50 tok/s (estimated, 4x smaller than 2B) | Tiny memory footprint. Fast inference. ONNX export available. | Weakest reasoning capability. Video-MME: 42.2 (mediocre). May lack nuance for concealment analysis. | **Marginal.** Fast but may be too weak for meaningful contextual reasoning on concealed positions. |
| **SmolVLM2-256M** | 256M | <1 GB | ~50-80 tok/s (estimated) | Smallest available VLM. Near-instant inference. | Very limited reasoning. Outperforms Idefics-80B on simple benchmarks but not on complex spatial reasoning. | **Not recommended.** Likely too weak for this task. |
| **Moondream 0.5B** | 500M | 816 MiB (INT4) | ~30-50 tok/s (estimated) | Built-in detect() and point() APIs. Tiny. | No specific aerial training. Detection benchmarks unverified for small model. | **Interesting for pointing/localization.** Could complement YOLO by pointing at "suspicious area at end of path." |
| **Moondream 2B** | 2B | ~2 GB (INT4) | ~10-15 tok/s (estimated) | Strong grounded detection (refcoco: 91.1). detect() and point() APIs. | No aerial-specific training. Similar size to UAV-VL-R1 but less specialized. | **Good general VLM, but UAV-VL-R1 is better for aerial tasks.** |
| **Cosmos-Reason2-2B** | 2B | ~2 GB (W4A16) | 4.7 tok/s (benchmarked) | NVIDIA-optimized for Jetson. Reasoning focused. | Slowest of candidates. Not aerial-specific. | **Slow. Not recommended over UAV-VL-R1.** |
### VLM Verdict
**Is a VLM helpful at all for this task?**
**Yes, but as a supplementary layer, not the primary mechanism.** The core detection pipeline should be:
- **Tier 1 (real-time)**: YOLO26-Seg + YOLOE-26 text prompts — detect footpaths, roads, branch piles, entrances, trees
- **Tier 2 (fast confirmation)**: Custom lightweight CNN classifier — binary "concealed position: yes/no" on cropped ROI (≤200ms)
- **Tier 3 (optional deep analysis)**: Small VLM — contextual reasoning on ambiguous cases, operator-facing descriptions
VLM adds value for:
1. **Zero-shot bootstrapping** — before enough custom training data exists, VLM can reason about "is this a hidden position?"
2. **Ambiguous cases** — when Tier 2 CNN confidence is between 30-70%, VLM provides second opinion
3. **Operator trust** — VLM can generate natural-language explanations ("footpath terminates at dark mass consistent with concealed entrance")
4. **Novel patterns** — VLM generalizes to new types of concealment without retraining
**Recommended VLM: UAV-VL-R1** (2.5GB INT8, aerial-specialized, open source). Fallback: SmolVLM2-500M if memory is too tight.
## YOLO26 Key Capabilities for This Project
| Feature | Relevance |
|---------|-----------|
| **NMS-free end-to-end inference** | Simpler deployment on Jetson, more predictable latency, no post-processing tuning |
| **Instance segmentation (YOLO26-Seg)** | Precise pixel-level masks for footpaths and roads — far better than bounding boxes for linear features |
| **YOLOE-26 open-vocabulary** | Text-prompted detection: "footpath", "pile of branches", "dark entrance" — zero-shot, no training needed initially |
| **YOLOE-26 visual prompts** | Use reference images of hideouts to find visually similar structures — directly applicable |
| **YOLOE-26 prompt-free mode** | 1200+ built-in categories for autonomous discovery — useful for Level 1 wide scan |
| **43% faster CPU inference** | Better edge performance than previous YOLO versions |
| **MuSGD optimizer** | Better convergence for custom training with small datasets |
| **Improved small object detection** | ProgLoss + STAL loss — directly relevant for detecting small entrances and path features |
**YOLOE-26 is a potential game-changer for this project.** It enables:
- Immediate zero-shot detection with text prompts before any custom training
- Visual prompt mode: provide example images of hideouts, system finds similar patterns
- Gradual transition to custom-trained YOLO26 as annotated data accumulates
- No inference overhead vs standard YOLO26 when re-parameterized for fixed classes
## Restrictions Assessment
| Restriction | Our Values | Researched Values | Cost/Timeline Impact | Status |
|-------------|-----------|-------------------|---------------------|--------|
| Jetson Orin Nano Super 67 TOPS, 8GB | 67 TOPS INT8, 8GB LPDDR5, 102 GB/s | Memory budget: YOLO26 (~0.3GB) + YOLOE-26 (zero overhead) + CNN (~0.1GB) + VLM (~2.5GB) = ~3GB. Leaves ~3GB for OS, buffers. VLM and YOLO should run sequentially to avoid contention. | Manageable with scheduling | **Confirmed, add sequential inference scheduling** |
| ViewPro A40 zoom transition | Not specified | 40x optical zoom full traversal: 1-3 seconds. Partial zoom (medium→high): ~1-2s. | Physical constraint, must account for in scan timing | **Add: zoom transition time 1-2s as physical constraint** |
| All seasons | No seasonal restriction | Phased rollout: winter first (highest contrast), then spring/summer/autumn. | Multi-season dataset collection is long-term effort | **Add: phased seasonal rollout** |
| Cython + TensorRT | Must extend existing | YOLO26 deploys natively to TensorRT. YOLOE-26 also TensorRT compatible. VLM may need separate process (vLLM or TensorRT-LLM). | Low complexity for YOLO26. Medium for VLM. | **Modified: VLM as separate process with IPC** |
| VLM local only | No cloud | UAV-VL-R1 INT8: 2.5GB, open source, fits on Jetson. SmolVLM2-500M: 1.8GB, ONNX available. | Feasible | **Confirmed** |
## Key Findings (Revised)
1. **YOLOE-26 open-vocabulary is the biggest opportunity.** Text-prompted and visual-prompted detection enables immediate zero-shot capability without custom training. Transition to custom-trained YOLO26 as data accumulates.
2. **Three-tier architecture is more realistic than two-tier.** Tier 1: YOLO26/YOLOE-26 real-time (≤100ms). Tier 2: Custom CNN confirmation (≤200ms). Tier 3: VLM deep analysis (≤5s, optional).
3. **UAV-VL-R1 is the best VLM candidate.** Purpose-built for aerial reasoning, 48% better than generic Qwen2-VL-2B, fits on Jetson at 2.5GB INT8. SmolVLM2-500M is a lighter fallback.
4. **VLM is valuable but supplementary.** Zero-shot bootstrapping, ambiguous case analysis, and operator-facing explanations. Not the primary detection mechanism.
5. **YOLO26 NMS-free design simplifies edge deployment.** No NMS tuning, more predictable latency, native TensorRT support. Instance segmentation mode ideal for footpaths.
6. **Phased approach reduces risk.** Start with YOLOE-26 text prompts (no training needed), then train custom YOLO26 models, then add VLM for edge cases.
## Sources
- Ultralytics YOLO26 docs: https://docs.ultralytics.com/models/yolo26/ (Jan 2026)
- YOLOE-26 paper: arXiv:2602.00168 (Feb 2026)
- YOLOE docs: https://v8docs.ultralytics.com/models/yoloe/ (2025)
- Ultralytics YOLO26 Jetson benchmarks: https://docs.ultralytics.com/guides/nvidia-jetson (2026)
- Cosmos-Reason2-2B on Jetson Orin Nano Super: 4.7 tok/s W4A16 (Embedl, Feb 2026)
- UAV-VL-R1: arXiv:2508.11196 (2025), 2.5GB INT8, open source
- SmolVLM2-500M: HuggingFace blog (Feb 2025), 1.8GB GPU RAM
- SmolVLM-256M: HuggingFace blog (Jan 2025), <1GB
- Moondream 0.5B: moondream.ai (Dec 2024), 816 MiB INT4
- Moondream 2B: moondream.ai (2025), refcoco 91.1
- UAV-YOLO12 road segmentation: F1=0.825, 11.1ms (MDPI Drones, 2025)
- ViewPro ViewLink Serial Protocol V3.3.3 (viewprotech.com)
- YOLO training best practices: ≥1500 images/class (Ultralytics docs)
- Open-Vocabulary Camouflaged Object Segmentation: arXiv:2506.19300 (2025)