mirror of https://github.com/azaion/detections-semantic.git synced 2026-04-23 05:06:39 +00:00

Files

T

Oleksandr Bezdieniezhnykh 8e2ecf50fd Initial commit

Made-with: Cursor

2026-03-26 00:20:30 +02:00

10 KiB

Raw Permalink Blame History

Acceptance Criteria Assessment (Revised)

Acceptance Criteria

Criterion	Our Values	Researched Values	Cost/Timeline Impact	Status
Tier 1 latency	≤100ms per frame	YOLO26n TensorRT FP16 on Jetson Orin Nano Super: ~7ms at 640px. YOLOE-26 adds zero inference overhead when re-parameterized. Combined detection + segmentation well under 100ms.	Low risk	Confirmed
Tier 2 detailed analysis	≤2 seconds (originally VLM)	See VLM assessment below. Recommend tiered approach: Tier 2 = custom CNN classifier (≤100ms), Tier 3 = optional VLM (3-5s).	Architecture change reduces risk	Modified: split into Tier 2 (CNN ≤200ms) + Tier 3 (VLM ≤5s optional)
YOLO new classes P≥80%, R≥80%	P≥80%, R≥80%	Use YOLO26-Seg for footpaths/roads (instance segmentation), YOLO26 detection for compact objects. YOLOE-26 open-vocabulary as bootstrap.	Reduced training data dependency initially via YOLOE-26	Modified: YOLO26-Seg for linear features, YOLOE-26 text prompts for bootstrapping
Semantic recall ≥60% concealed positions	≥60% recall	Reasonable for initial release. YOLOE-26 zero-shot + custom-trained YOLO26 models should reach this.	Medium risk	Confirmed
Semantic precision ≥20%	≥20% initial	Reasonable. YOLOE-26 text prompts will start noisy, custom training improves over time.	Low risk	Confirmed
Footpath recall ≥70%	≥70%	Confirmed achievable. UAV-YOLO12: F1=0.825. YOLO26-Seg NMS-free: likely competitive.	Low risk, seasonal caveat	Confirmed
Level 1→2 transition	≤1 second	Physical zoom takes 1-3s on ViewPro A40.	Physical constraint	Modified: ≤2 seconds
Gimbal command latency ≤500ms	≤500ms	ViewPro A40 UART 115200 + physical response: well under 500ms.	Low risk	Confirmed
RAM ≤6GB for semantic + VLM	≤6GB	YOLO26 models: ~0.1-0.5GB. YOLOE-26: same as YOLO26 (zero overhead). Custom CNN: ~0.1GB. VLM (optional): SmolVLM2-500M ~1.8GB or UAV-VL-R1 INT8 ~2.5GB. Total worst case: ~4GB.	Low risk — well within budget	Confirmed
Dataset 1.5 months × 5h/day	~225 hours	YOLOE-26 text prompts reduce initial data dependency. Recommend SAM-assisted annotation.	Reduced risk via YOLOE-26 bootstrapping	Modified: phased data collection strategy

VLM Options Assessment

Model	Params	Memory (quantized)	Estimated Speed (Jetson Orin Nano Super)	Strengths	Weaknesses	Fit for This Task
UAV-VL-R1	2B	2.5 GB (INT8)	~10-15 tok/s (estimated from Qwen2-VL-2B base)	Specifically trained for aerial reasoning. 48% better than Qwen2-VL-2B on UAV tasks. Supports object counting, spatial reasoning. Open source.	Largest memory footprint of candidates. ~3-5s per analysis with image.	Best fit for aerial scene understanding. Use as Tier 3 for ambiguous cases.
SmolVLM2-500M	500M	1.8 GB (FP16)	~30-50 tok/s (estimated, 4x smaller than 2B)	Tiny memory footprint. Fast inference. ONNX export available.	Weakest reasoning capability. Video-MME: 42.2 (mediocre). May lack nuance for concealment analysis.	Marginal. Fast but may be too weak for meaningful contextual reasoning on concealed positions.
SmolVLM2-256M	256M	<1 GB	~50-80 tok/s (estimated)	Smallest available VLM. Near-instant inference.	Very limited reasoning. Outperforms Idefics-80B on simple benchmarks but not on complex spatial reasoning.	Not recommended. Likely too weak for this task.
Moondream 0.5B	500M	816 MiB (INT4)	~30-50 tok/s (estimated)	Built-in detect() and point() APIs. Tiny.	No specific aerial training. Detection benchmarks unverified for small model.	Interesting for pointing/localization. Could complement YOLO by pointing at "suspicious area at end of path."
Moondream 2B	2B	~2 GB (INT4)	~10-15 tok/s (estimated)	Strong grounded detection (refcoco: 91.1). detect() and point() APIs.	No aerial-specific training. Similar size to UAV-VL-R1 but less specialized.	Good general VLM, but UAV-VL-R1 is better for aerial tasks.
Cosmos-Reason2-2B	2B	~2 GB (W4A16)	4.7 tok/s (benchmarked)	NVIDIA-optimized for Jetson. Reasoning focused.	Slowest of candidates. Not aerial-specific.	Slow. Not recommended over UAV-VL-R1.

VLM Verdict

Is a VLM helpful at all for this task?

Yes, but as a supplementary layer, not the primary mechanism. The core detection pipeline should be:

Tier 1 (real-time): YOLO26-Seg + YOLOE-26 text prompts — detect footpaths, roads, branch piles, entrances, trees
Tier 2 (fast confirmation): Custom lightweight CNN classifier — binary "concealed position: yes/no" on cropped ROI (≤200ms)
Tier 3 (optional deep analysis): Small VLM — contextual reasoning on ambiguous cases, operator-facing descriptions

VLM adds value for:

Zero-shot bootstrapping — before enough custom training data exists, VLM can reason about "is this a hidden position?"
Ambiguous cases — when Tier 2 CNN confidence is between 30-70%, VLM provides second opinion
Operator trust — VLM can generate natural-language explanations ("footpath terminates at dark mass consistent with concealed entrance")
Novel patterns — VLM generalizes to new types of concealment without retraining

Recommended VLM: UAV-VL-R1 (2.5GB INT8, aerial-specialized, open source). Fallback: SmolVLM2-500M if memory is too tight.

YOLO26 Key Capabilities for This Project

Feature	Relevance
NMS-free end-to-end inference	Simpler deployment on Jetson, more predictable latency, no post-processing tuning
Instance segmentation (YOLO26-Seg)	Precise pixel-level masks for footpaths and roads — far better than bounding boxes for linear features
YOLOE-26 open-vocabulary	Text-prompted detection: "footpath", "pile of branches", "dark entrance" — zero-shot, no training needed initially
YOLOE-26 visual prompts	Use reference images of hideouts to find visually similar structures — directly applicable
YOLOE-26 prompt-free mode	1200+ built-in categories for autonomous discovery — useful for Level 1 wide scan
43% faster CPU inference	Better edge performance than previous YOLO versions
MuSGD optimizer	Better convergence for custom training with small datasets
Improved small object detection	ProgLoss + STAL loss — directly relevant for detecting small entrances and path features

YOLOE-26 is a potential game-changer for this project. It enables:

Immediate zero-shot detection with text prompts before any custom training
Visual prompt mode: provide example images of hideouts, system finds similar patterns
Gradual transition to custom-trained YOLO26 as annotated data accumulates
No inference overhead vs standard YOLO26 when re-parameterized for fixed classes

Restrictions Assessment

Restriction	Our Values	Researched Values	Cost/Timeline Impact	Status
Jetson Orin Nano Super 67 TOPS, 8GB	67 TOPS INT8, 8GB LPDDR5, 102 GB/s	Memory budget: YOLO26 (~0.3GB) + YOLOE-26 (zero overhead) + CNN (~0.1GB) + VLM (~2.5GB) = ~3GB. Leaves ~3GB for OS, buffers. VLM and YOLO should run sequentially to avoid contention.	Manageable with scheduling	Confirmed, add sequential inference scheduling
ViewPro A40 zoom transition	Not specified	40x optical zoom full traversal: 1-3 seconds. Partial zoom (medium→high): ~1-2s.	Physical constraint, must account for in scan timing	Add: zoom transition time 1-2s as physical constraint
All seasons	No seasonal restriction	Phased rollout: winter first (highest contrast), then spring/summer/autumn.	Multi-season dataset collection is long-term effort	Add: phased seasonal rollout
Cython + TensorRT	Must extend existing	YOLO26 deploys natively to TensorRT. YOLOE-26 also TensorRT compatible. VLM may need separate process (vLLM or TensorRT-LLM).	Low complexity for YOLO26. Medium for VLM.	Modified: VLM as separate process with IPC
VLM local only	No cloud	UAV-VL-R1 INT8: 2.5GB, open source, fits on Jetson. SmolVLM2-500M: 1.8GB, ONNX available.	Feasible	Confirmed

Key Findings (Revised)

YOLOE-26 open-vocabulary is the biggest opportunity. Text-prompted and visual-prompted detection enables immediate zero-shot capability without custom training. Transition to custom-trained YOLO26 as data accumulates.
Three-tier architecture is more realistic than two-tier. Tier 1: YOLO26/YOLOE-26 real-time (≤100ms). Tier 2: Custom CNN confirmation (≤200ms). Tier 3: VLM deep analysis (≤5s, optional).
UAV-VL-R1 is the best VLM candidate. Purpose-built for aerial reasoning, 48% better than generic Qwen2-VL-2B, fits on Jetson at 2.5GB INT8. SmolVLM2-500M is a lighter fallback.
VLM is valuable but supplementary. Zero-shot bootstrapping, ambiguous case analysis, and operator-facing explanations. Not the primary detection mechanism.
YOLO26 NMS-free design simplifies edge deployment. No NMS tuning, more predictable latency, native TensorRT support. Instance segmentation mode ideal for footpaths.
Phased approach reduces risk. Start with YOLOE-26 text prompts (no training needed), then train custom YOLO26 models, then add VLM for edge cases.

Sources

Ultralytics YOLO26 docs: https://docs.ultralytics.com/models/yolo26/ (Jan 2026)
YOLOE-26 paper: arXiv:2602.00168 (Feb 2026)
YOLOE docs: https://v8docs.ultralytics.com/models/yoloe/ (2025)
Ultralytics YOLO26 Jetson benchmarks: https://docs.ultralytics.com/guides/nvidia-jetson (2026)
Cosmos-Reason2-2B on Jetson Orin Nano Super: 4.7 tok/s W4A16 (Embedl, Feb 2026)
UAV-VL-R1: arXiv:2508.11196 (2025), 2.5GB INT8, open source
SmolVLM2-500M: HuggingFace blog (Feb 2025), 1.8GB GPU RAM
SmolVLM-256M: HuggingFace blog (Jan 2025), <1GB
Moondream 0.5B: moondream.ai (Dec 2024), 816 MiB INT4
Moondream 2B: moondream.ai (2025), refcoco 91.1
UAV-YOLO12 road segmentation: F1=0.825, 11.1ms (MDPI Drones, 2025)
ViewPro ViewLink Serial Protocol V3.3.3 (viewprotech.com)
YOLO training best practices: ≥1500 images/class (Ultralytics docs)
Open-Vocabulary Camouflaged Object Segmentation: arXiv:2506.19300 (2025)

10 KiB Raw Permalink Blame History Unescape Escape