Files
Oleksandr Bezdieniezhnykh 8e2ecf50fd Initial commit
Made-with: Cursor
2026-03-26 00:20:30 +02:00

10 KiB
Raw Permalink Blame History

Acceptance Criteria Assessment (Revised)

Acceptance Criteria

Criterion Our Values Researched Values Cost/Timeline Impact Status
Tier 1 latency ≤100ms per frame YOLO26n TensorRT FP16 on Jetson Orin Nano Super: ~7ms at 640px. YOLOE-26 adds zero inference overhead when re-parameterized. Combined detection + segmentation well under 100ms. Low risk Confirmed
Tier 2 detailed analysis ≤2 seconds (originally VLM) See VLM assessment below. Recommend tiered approach: Tier 2 = custom CNN classifier (≤100ms), Tier 3 = optional VLM (3-5s). Architecture change reduces risk Modified: split into Tier 2 (CNN ≤200ms) + Tier 3 (VLM ≤5s optional)
YOLO new classes P≥80%, R≥80% P≥80%, R≥80% Use YOLO26-Seg for footpaths/roads (instance segmentation), YOLO26 detection for compact objects. YOLOE-26 open-vocabulary as bootstrap. Reduced training data dependency initially via YOLOE-26 Modified: YOLO26-Seg for linear features, YOLOE-26 text prompts for bootstrapping
Semantic recall ≥60% concealed positions ≥60% recall Reasonable for initial release. YOLOE-26 zero-shot + custom-trained YOLO26 models should reach this. Medium risk Confirmed
Semantic precision ≥20% ≥20% initial Reasonable. YOLOE-26 text prompts will start noisy, custom training improves over time. Low risk Confirmed
Footpath recall ≥70% ≥70% Confirmed achievable. UAV-YOLO12: F1=0.825. YOLO26-Seg NMS-free: likely competitive. Low risk, seasonal caveat Confirmed
Level 1→2 transition ≤1 second Physical zoom takes 1-3s on ViewPro A40. Physical constraint Modified: ≤2 seconds
Gimbal command latency ≤500ms ≤500ms ViewPro A40 UART 115200 + physical response: well under 500ms. Low risk Confirmed
RAM ≤6GB for semantic + VLM ≤6GB YOLO26 models: ~0.1-0.5GB. YOLOE-26: same as YOLO26 (zero overhead). Custom CNN: ~0.1GB. VLM (optional): SmolVLM2-500M ~1.8GB or UAV-VL-R1 INT8 ~2.5GB. Total worst case: ~4GB. Low risk — well within budget Confirmed
Dataset 1.5 months × 5h/day ~225 hours YOLOE-26 text prompts reduce initial data dependency. Recommend SAM-assisted annotation. Reduced risk via YOLOE-26 bootstrapping Modified: phased data collection strategy

VLM Options Assessment

Model Params Memory (quantized) Estimated Speed (Jetson Orin Nano Super) Strengths Weaknesses Fit for This Task
UAV-VL-R1 2B 2.5 GB (INT8) ~10-15 tok/s (estimated from Qwen2-VL-2B base) Specifically trained for aerial reasoning. 48% better than Qwen2-VL-2B on UAV tasks. Supports object counting, spatial reasoning. Open source. Largest memory footprint of candidates. ~3-5s per analysis with image. Best fit for aerial scene understanding. Use as Tier 3 for ambiguous cases.
SmolVLM2-500M 500M 1.8 GB (FP16) ~30-50 tok/s (estimated, 4x smaller than 2B) Tiny memory footprint. Fast inference. ONNX export available. Weakest reasoning capability. Video-MME: 42.2 (mediocre). May lack nuance for concealment analysis. Marginal. Fast but may be too weak for meaningful contextual reasoning on concealed positions.
SmolVLM2-256M 256M <1 GB ~50-80 tok/s (estimated) Smallest available VLM. Near-instant inference. Very limited reasoning. Outperforms Idefics-80B on simple benchmarks but not on complex spatial reasoning. Not recommended. Likely too weak for this task.
Moondream 0.5B 500M 816 MiB (INT4) ~30-50 tok/s (estimated) Built-in detect() and point() APIs. Tiny. No specific aerial training. Detection benchmarks unverified for small model. Interesting for pointing/localization. Could complement YOLO by pointing at "suspicious area at end of path."
Moondream 2B 2B ~2 GB (INT4) ~10-15 tok/s (estimated) Strong grounded detection (refcoco: 91.1). detect() and point() APIs. No aerial-specific training. Similar size to UAV-VL-R1 but less specialized. Good general VLM, but UAV-VL-R1 is better for aerial tasks.
Cosmos-Reason2-2B 2B ~2 GB (W4A16) 4.7 tok/s (benchmarked) NVIDIA-optimized for Jetson. Reasoning focused. Slowest of candidates. Not aerial-specific. Slow. Not recommended over UAV-VL-R1.

VLM Verdict

Is a VLM helpful at all for this task?

Yes, but as a supplementary layer, not the primary mechanism. The core detection pipeline should be:

  • Tier 1 (real-time): YOLO26-Seg + YOLOE-26 text prompts — detect footpaths, roads, branch piles, entrances, trees
  • Tier 2 (fast confirmation): Custom lightweight CNN classifier — binary "concealed position: yes/no" on cropped ROI (≤200ms)
  • Tier 3 (optional deep analysis): Small VLM — contextual reasoning on ambiguous cases, operator-facing descriptions

VLM adds value for:

  1. Zero-shot bootstrapping — before enough custom training data exists, VLM can reason about "is this a hidden position?"
  2. Ambiguous cases — when Tier 2 CNN confidence is between 30-70%, VLM provides second opinion
  3. Operator trust — VLM can generate natural-language explanations ("footpath terminates at dark mass consistent with concealed entrance")
  4. Novel patterns — VLM generalizes to new types of concealment without retraining

Recommended VLM: UAV-VL-R1 (2.5GB INT8, aerial-specialized, open source). Fallback: SmolVLM2-500M if memory is too tight.

YOLO26 Key Capabilities for This Project

Feature Relevance
NMS-free end-to-end inference Simpler deployment on Jetson, more predictable latency, no post-processing tuning
Instance segmentation (YOLO26-Seg) Precise pixel-level masks for footpaths and roads — far better than bounding boxes for linear features
YOLOE-26 open-vocabulary Text-prompted detection: "footpath", "pile of branches", "dark entrance" — zero-shot, no training needed initially
YOLOE-26 visual prompts Use reference images of hideouts to find visually similar structures — directly applicable
YOLOE-26 prompt-free mode 1200+ built-in categories for autonomous discovery — useful for Level 1 wide scan
43% faster CPU inference Better edge performance than previous YOLO versions
MuSGD optimizer Better convergence for custom training with small datasets
Improved small object detection ProgLoss + STAL loss — directly relevant for detecting small entrances and path features

YOLOE-26 is a potential game-changer for this project. It enables:

  • Immediate zero-shot detection with text prompts before any custom training
  • Visual prompt mode: provide example images of hideouts, system finds similar patterns
  • Gradual transition to custom-trained YOLO26 as annotated data accumulates
  • No inference overhead vs standard YOLO26 when re-parameterized for fixed classes

Restrictions Assessment

Restriction Our Values Researched Values Cost/Timeline Impact Status
Jetson Orin Nano Super 67 TOPS, 8GB 67 TOPS INT8, 8GB LPDDR5, 102 GB/s Memory budget: YOLO26 (~0.3GB) + YOLOE-26 (zero overhead) + CNN (~0.1GB) + VLM (~2.5GB) = ~3GB. Leaves ~3GB for OS, buffers. VLM and YOLO should run sequentially to avoid contention. Manageable with scheduling Confirmed, add sequential inference scheduling
ViewPro A40 zoom transition Not specified 40x optical zoom full traversal: 1-3 seconds. Partial zoom (medium→high): ~1-2s. Physical constraint, must account for in scan timing Add: zoom transition time 1-2s as physical constraint
All seasons No seasonal restriction Phased rollout: winter first (highest contrast), then spring/summer/autumn. Multi-season dataset collection is long-term effort Add: phased seasonal rollout
Cython + TensorRT Must extend existing YOLO26 deploys natively to TensorRT. YOLOE-26 also TensorRT compatible. VLM may need separate process (vLLM or TensorRT-LLM). Low complexity for YOLO26. Medium for VLM. Modified: VLM as separate process with IPC
VLM local only No cloud UAV-VL-R1 INT8: 2.5GB, open source, fits on Jetson. SmolVLM2-500M: 1.8GB, ONNX available. Feasible Confirmed

Key Findings (Revised)

  1. YOLOE-26 open-vocabulary is the biggest opportunity. Text-prompted and visual-prompted detection enables immediate zero-shot capability without custom training. Transition to custom-trained YOLO26 as data accumulates.

  2. Three-tier architecture is more realistic than two-tier. Tier 1: YOLO26/YOLOE-26 real-time (≤100ms). Tier 2: Custom CNN confirmation (≤200ms). Tier 3: VLM deep analysis (≤5s, optional).

  3. UAV-VL-R1 is the best VLM candidate. Purpose-built for aerial reasoning, 48% better than generic Qwen2-VL-2B, fits on Jetson at 2.5GB INT8. SmolVLM2-500M is a lighter fallback.

  4. VLM is valuable but supplementary. Zero-shot bootstrapping, ambiguous case analysis, and operator-facing explanations. Not the primary detection mechanism.

  5. YOLO26 NMS-free design simplifies edge deployment. No NMS tuning, more predictable latency, native TensorRT support. Instance segmentation mode ideal for footpaths.

  6. Phased approach reduces risk. Start with YOLOE-26 text prompts (no training needed), then train custom YOLO26 models, then add VLM for edge cases.

Sources

  • Ultralytics YOLO26 docs: https://docs.ultralytics.com/models/yolo26/ (Jan 2026)
  • YOLOE-26 paper: arXiv:2602.00168 (Feb 2026)
  • YOLOE docs: https://v8docs.ultralytics.com/models/yoloe/ (2025)
  • Ultralytics YOLO26 Jetson benchmarks: https://docs.ultralytics.com/guides/nvidia-jetson (2026)
  • Cosmos-Reason2-2B on Jetson Orin Nano Super: 4.7 tok/s W4A16 (Embedl, Feb 2026)
  • UAV-VL-R1: arXiv:2508.11196 (2025), 2.5GB INT8, open source
  • SmolVLM2-500M: HuggingFace blog (Feb 2025), 1.8GB GPU RAM
  • SmolVLM-256M: HuggingFace blog (Jan 2025), <1GB
  • Moondream 0.5B: moondream.ai (Dec 2024), 816 MiB INT4
  • Moondream 2B: moondream.ai (2025), refcoco 91.1
  • UAV-YOLO12 road segmentation: F1=0.825, 11.1ms (MDPI Drones, 2025)
  • ViewPro ViewLink Serial Protocol V3.3.3 (viewprotech.com)
  • YOLO training best practices: ≥1500 images/class (Ultralytics docs)
  • Open-Vocabulary Camouflaged Object Segmentation: arXiv:2506.19300 (2025)