mirror of
https://github.com/azaion/detections-semantic.git
synced 2026-04-22 11:06:38 +00:00
Initial commit
Made-with: Cursor
This commit is contained in:
@@ -0,0 +1,104 @@
|
||||
# Acceptance Criteria Assessment (Revised)
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
| Criterion | Our Values | Researched Values | Cost/Timeline Impact | Status |
|
||||
|-----------|-----------|-------------------|---------------------|--------|
|
||||
| Tier 1 latency | ≤100ms per frame | YOLO26n TensorRT FP16 on Jetson Orin Nano Super: ~7ms at 640px. YOLOE-26 adds zero inference overhead when re-parameterized. Combined detection + segmentation well under 100ms. | Low risk | **Confirmed** |
|
||||
| Tier 2 detailed analysis | ≤2 seconds (originally VLM) | See VLM assessment below. Recommend tiered approach: Tier 2 = custom CNN classifier (≤100ms), Tier 3 = optional VLM (3-5s). | Architecture change reduces risk | **Modified: split into Tier 2 (CNN ≤200ms) + Tier 3 (VLM ≤5s optional)** |
|
||||
| YOLO new classes P≥80%, R≥80% | P≥80%, R≥80% | Use YOLO26-Seg for footpaths/roads (instance segmentation), YOLO26 detection for compact objects. YOLOE-26 open-vocabulary as bootstrap. | Reduced training data dependency initially via YOLOE-26 | **Modified: YOLO26-Seg for linear features, YOLOE-26 text prompts for bootstrapping** |
|
||||
| Semantic recall ≥60% concealed positions | ≥60% recall | Reasonable for initial release. YOLOE-26 zero-shot + custom-trained YOLO26 models should reach this. | Medium risk | **Confirmed** |
|
||||
| Semantic precision ≥20% | ≥20% initial | Reasonable. YOLOE-26 text prompts will start noisy, custom training improves over time. | Low risk | **Confirmed** |
|
||||
| Footpath recall ≥70% | ≥70% | Confirmed achievable. UAV-YOLO12: F1=0.825. YOLO26-Seg NMS-free: likely competitive. | Low risk, seasonal caveat | **Confirmed** |
|
||||
| Level 1→2 transition | ≤1 second | Physical zoom takes 1-3s on ViewPro A40. | Physical constraint | **Modified: ≤2 seconds** |
|
||||
| Gimbal command latency ≤500ms | ≤500ms | ViewPro A40 UART 115200 + physical response: well under 500ms. | Low risk | **Confirmed** |
|
||||
| RAM ≤6GB for semantic + VLM | ≤6GB | YOLO26 models: ~0.1-0.5GB. YOLOE-26: same as YOLO26 (zero overhead). Custom CNN: ~0.1GB. VLM (optional): SmolVLM2-500M ~1.8GB or UAV-VL-R1 INT8 ~2.5GB. Total worst case: ~4GB. | Low risk — well within budget | **Confirmed** |
|
||||
| Dataset 1.5 months × 5h/day | ~225 hours | YOLOE-26 text prompts reduce initial data dependency. Recommend SAM-assisted annotation. | Reduced risk via YOLOE-26 bootstrapping | **Modified: phased data collection strategy** |
|
||||
|
||||
## VLM Options Assessment
|
||||
|
||||
| Model | Params | Memory (quantized) | Estimated Speed (Jetson Orin Nano Super) | Strengths | Weaknesses | Fit for This Task |
|
||||
|-------|--------|-------------------|----------------------------------------|-----------|------------|-------------------|
|
||||
| **UAV-VL-R1** | 2B | 2.5 GB (INT8) | ~10-15 tok/s (estimated from Qwen2-VL-2B base) | **Specifically trained for aerial reasoning.** 48% better than Qwen2-VL-2B on UAV tasks. Supports object counting, spatial reasoning. Open source. | Largest memory footprint of candidates. ~3-5s per analysis with image. | **Best fit for aerial scene understanding.** Use as Tier 3 for ambiguous cases. |
|
||||
| **SmolVLM2-500M** | 500M | 1.8 GB (FP16) | ~30-50 tok/s (estimated, 4x smaller than 2B) | Tiny memory footprint. Fast inference. ONNX export available. | Weakest reasoning capability. Video-MME: 42.2 (mediocre). May lack nuance for concealment analysis. | **Marginal.** Fast but may be too weak for meaningful contextual reasoning on concealed positions. |
|
||||
| **SmolVLM2-256M** | 256M | <1 GB | ~50-80 tok/s (estimated) | Smallest available VLM. Near-instant inference. | Very limited reasoning. Outperforms Idefics-80B on simple benchmarks but not on complex spatial reasoning. | **Not recommended.** Likely too weak for this task. |
|
||||
| **Moondream 0.5B** | 500M | 816 MiB (INT4) | ~30-50 tok/s (estimated) | Built-in detect() and point() APIs. Tiny. | No specific aerial training. Detection benchmarks unverified for small model. | **Interesting for pointing/localization.** Could complement YOLO by pointing at "suspicious area at end of path." |
|
||||
| **Moondream 2B** | 2B | ~2 GB (INT4) | ~10-15 tok/s (estimated) | Strong grounded detection (refcoco: 91.1). detect() and point() APIs. | No aerial-specific training. Similar size to UAV-VL-R1 but less specialized. | **Good general VLM, but UAV-VL-R1 is better for aerial tasks.** |
|
||||
| **Cosmos-Reason2-2B** | 2B | ~2 GB (W4A16) | 4.7 tok/s (benchmarked) | NVIDIA-optimized for Jetson. Reasoning focused. | Slowest of candidates. Not aerial-specific. | **Slow. Not recommended over UAV-VL-R1.** |
|
||||
|
||||
### VLM Verdict
|
||||
|
||||
**Is a VLM helpful at all for this task?**
|
||||
|
||||
**Yes, but as a supplementary layer, not the primary mechanism.** The core detection pipeline should be:
|
||||
- **Tier 1 (real-time)**: YOLO26-Seg + YOLOE-26 text prompts — detect footpaths, roads, branch piles, entrances, trees
|
||||
- **Tier 2 (fast confirmation)**: Custom lightweight CNN classifier — binary "concealed position: yes/no" on cropped ROI (≤200ms)
|
||||
- **Tier 3 (optional deep analysis)**: Small VLM — contextual reasoning on ambiguous cases, operator-facing descriptions
|
||||
|
||||
VLM adds value for:
|
||||
1. **Zero-shot bootstrapping** — before enough custom training data exists, VLM can reason about "is this a hidden position?"
|
||||
2. **Ambiguous cases** — when Tier 2 CNN confidence is between 30-70%, VLM provides second opinion
|
||||
3. **Operator trust** — VLM can generate natural-language explanations ("footpath terminates at dark mass consistent with concealed entrance")
|
||||
4. **Novel patterns** — VLM generalizes to new types of concealment without retraining
|
||||
|
||||
**Recommended VLM: UAV-VL-R1** (2.5GB INT8, aerial-specialized, open source). Fallback: SmolVLM2-500M if memory is too tight.
|
||||
|
||||
## YOLO26 Key Capabilities for This Project
|
||||
|
||||
| Feature | Relevance |
|
||||
|---------|-----------|
|
||||
| **NMS-free end-to-end inference** | Simpler deployment on Jetson, more predictable latency, no post-processing tuning |
|
||||
| **Instance segmentation (YOLO26-Seg)** | Precise pixel-level masks for footpaths and roads — far better than bounding boxes for linear features |
|
||||
| **YOLOE-26 open-vocabulary** | Text-prompted detection: "footpath", "pile of branches", "dark entrance" — zero-shot, no training needed initially |
|
||||
| **YOLOE-26 visual prompts** | Use reference images of hideouts to find visually similar structures — directly applicable |
|
||||
| **YOLOE-26 prompt-free mode** | 1200+ built-in categories for autonomous discovery — useful for Level 1 wide scan |
|
||||
| **43% faster CPU inference** | Better edge performance than previous YOLO versions |
|
||||
| **MuSGD optimizer** | Better convergence for custom training with small datasets |
|
||||
| **Improved small object detection** | ProgLoss + STAL loss — directly relevant for detecting small entrances and path features |
|
||||
|
||||
**YOLOE-26 is a potential game-changer for this project.** It enables:
|
||||
- Immediate zero-shot detection with text prompts before any custom training
|
||||
- Visual prompt mode: provide example images of hideouts, system finds similar patterns
|
||||
- Gradual transition to custom-trained YOLO26 as annotated data accumulates
|
||||
- No inference overhead vs standard YOLO26 when re-parameterized for fixed classes
|
||||
|
||||
## Restrictions Assessment
|
||||
|
||||
| Restriction | Our Values | Researched Values | Cost/Timeline Impact | Status |
|
||||
|-------------|-----------|-------------------|---------------------|--------|
|
||||
| Jetson Orin Nano Super 67 TOPS, 8GB | 67 TOPS INT8, 8GB LPDDR5, 102 GB/s | Memory budget: YOLO26 (~0.3GB) + YOLOE-26 (zero overhead) + CNN (~0.1GB) + VLM (~2.5GB) = ~3GB. Leaves ~3GB for OS, buffers. VLM and YOLO should run sequentially to avoid contention. | Manageable with scheduling | **Confirmed, add sequential inference scheduling** |
|
||||
| ViewPro A40 zoom transition | Not specified | 40x optical zoom full traversal: 1-3 seconds. Partial zoom (medium→high): ~1-2s. | Physical constraint, must account for in scan timing | **Add: zoom transition time 1-2s as physical constraint** |
|
||||
| All seasons | No seasonal restriction | Phased rollout: winter first (highest contrast), then spring/summer/autumn. | Multi-season dataset collection is long-term effort | **Add: phased seasonal rollout** |
|
||||
| Cython + TensorRT | Must extend existing | YOLO26 deploys natively to TensorRT. YOLOE-26 also TensorRT compatible. VLM may need separate process (vLLM or TensorRT-LLM). | Low complexity for YOLO26. Medium for VLM. | **Modified: VLM as separate process with IPC** |
|
||||
| VLM local only | No cloud | UAV-VL-R1 INT8: 2.5GB, open source, fits on Jetson. SmolVLM2-500M: 1.8GB, ONNX available. | Feasible | **Confirmed** |
|
||||
|
||||
## Key Findings (Revised)
|
||||
|
||||
1. **YOLOE-26 open-vocabulary is the biggest opportunity.** Text-prompted and visual-prompted detection enables immediate zero-shot capability without custom training. Transition to custom-trained YOLO26 as data accumulates.
|
||||
|
||||
2. **Three-tier architecture is more realistic than two-tier.** Tier 1: YOLO26/YOLOE-26 real-time (≤100ms). Tier 2: Custom CNN confirmation (≤200ms). Tier 3: VLM deep analysis (≤5s, optional).
|
||||
|
||||
3. **UAV-VL-R1 is the best VLM candidate.** Purpose-built for aerial reasoning, 48% better than generic Qwen2-VL-2B, fits on Jetson at 2.5GB INT8. SmolVLM2-500M is a lighter fallback.
|
||||
|
||||
4. **VLM is valuable but supplementary.** Zero-shot bootstrapping, ambiguous case analysis, and operator-facing explanations. Not the primary detection mechanism.
|
||||
|
||||
5. **YOLO26 NMS-free design simplifies edge deployment.** No NMS tuning, more predictable latency, native TensorRT support. Instance segmentation mode ideal for footpaths.
|
||||
|
||||
6. **Phased approach reduces risk.** Start with YOLOE-26 text prompts (no training needed), then train custom YOLO26 models, then add VLM for edge cases.
|
||||
|
||||
## Sources
|
||||
|
||||
- Ultralytics YOLO26 docs: https://docs.ultralytics.com/models/yolo26/ (Jan 2026)
|
||||
- YOLOE-26 paper: arXiv:2602.00168 (Feb 2026)
|
||||
- YOLOE docs: https://v8docs.ultralytics.com/models/yoloe/ (2025)
|
||||
- Ultralytics YOLO26 Jetson benchmarks: https://docs.ultralytics.com/guides/nvidia-jetson (2026)
|
||||
- Cosmos-Reason2-2B on Jetson Orin Nano Super: 4.7 tok/s W4A16 (Embedl, Feb 2026)
|
||||
- UAV-VL-R1: arXiv:2508.11196 (2025), 2.5GB INT8, open source
|
||||
- SmolVLM2-500M: HuggingFace blog (Feb 2025), 1.8GB GPU RAM
|
||||
- SmolVLM-256M: HuggingFace blog (Jan 2025), <1GB
|
||||
- Moondream 0.5B: moondream.ai (Dec 2024), 816 MiB INT4
|
||||
- Moondream 2B: moondream.ai (2025), refcoco 91.1
|
||||
- UAV-YOLO12 road segmentation: F1=0.825, 11.1ms (MDPI Drones, 2025)
|
||||
- ViewPro ViewLink Serial Protocol V3.3.3 (viewprotech.com)
|
||||
- YOLO training best practices: ≥1500 images/class (Ultralytics docs)
|
||||
- Open-Vocabulary Camouflaged Object Segmentation: arXiv:2506.19300 (2025)
|
||||
@@ -0,0 +1,74 @@
|
||||
# Question Decomposition
|
||||
|
||||
## Original Question
|
||||
Assess solution_draft01.md for weak points, performance bottlenecks, security issues, and produce a revised solution draft.
|
||||
|
||||
## Active Mode
|
||||
Mode B — Solution Assessment of draft01
|
||||
Rationale: solution_draft01.md exists in OUTPUT_DIR. Assessing and improving.
|
||||
|
||||
## Problem Context Summary
|
||||
- Three-tier semantic detection (YOLOE-26 → Spatial Reasoning + CNN → VLM) on Jetson Orin Nano Super (8GB, 67 TOPS)
|
||||
- Two-level camera scan (wide sweep → detailed investigation) with ViewPro A40 gimbal
|
||||
- Integration with existing Cython+TRT YOLO detection service
|
||||
- YOLOE-26 zero-shot bootstrapping → custom YOLO26 fine-tuning transition
|
||||
- VLM (UAV-VL-R1) as separate process via Unix socket IPC
|
||||
- Winter-first seasonal rollout
|
||||
|
||||
## Question Type
|
||||
**Problem Diagnosis** — root cause analysis of weak points
|
||||
Combined with **Decision Support** — weighing alternative solutions for identified issues
|
||||
|
||||
## Research Subject Boundary Definition
|
||||
- **Population**: Edge AI semantic detection pipelines on Jetson-class hardware
|
||||
- **Geography**: Deployment in Eastern European winter conditions (Ukraine conflict)
|
||||
- **Timeframe**: 2025-2026 technology (YOLO26, YOLOE-26, VLMs, JetPack 6.2)
|
||||
- **Level**: Single Jetson Orin Nano Super device (8GB unified memory, 67 TOPS INT8)
|
||||
|
||||
## Decomposed Sub-Questions
|
||||
|
||||
### Memory & Resource Contention
|
||||
1. What is the actual GPU memory footprint of YOLOE-26s-seg TRT engine + existing YOLO TRT engine + MobileNetV3-Small TRT + UAV-VL-R1 INT8 running on 8GB unified memory?
|
||||
2. Can two TRT engines (existing YOLO + YOLOE-26) share the same GPU execution context, or do they need separate CUDA streams?
|
||||
3. Is sequential VLM scheduling (pause YOLO → run VLM → resume) viable without dropping detection frames?
|
||||
|
||||
### YOLOE-26 Zero-Shot Accuracy
|
||||
4. How well do YOLOE text prompts perform on out-of-distribution domains (military concealment vs COCO/LVIS training data)?
|
||||
5. Are visual prompts (SAVPE) more reliable than text prompts for this domain? What are the reference image requirements?
|
||||
6. What fallback if YOLOE-26 zero-shot produces unacceptable false positive rates?
|
||||
|
||||
### Path Tracing & Spatial Reasoning
|
||||
7. How robust is morphological skeletonization on noisy aerial segmentation masks (partial paths, broken segments)?
|
||||
8. What happens with dense path networks (villages, supply routes)? How to filter relevant paths?
|
||||
9. Is 128×128 ROI sufficient for endpoint classification, or does the CNN need more spatial context?
|
||||
|
||||
### VLM Integration
|
||||
10. What is the actual inference latency of a 2B-parameter VLM (INT8) on Jetson Orin Nano Super?
|
||||
11. Is vLLM the right runtime for Jetson, or should we use TRT-LLM / llama.cpp / MLC-LLM?
|
||||
12. What is the memory overhead of keeping a VLM loaded but idle vs loading on-demand?
|
||||
|
||||
### Gimbal Control & Scan Strategy
|
||||
13. Is PID control sufficient for path-following, or do we need a more sophisticated controller (Kalman filter, predictive)?
|
||||
14. What happens when the UAV itself is moving during Level 2 detailed scan? How to compensate?
|
||||
15. Is the POI queue strategy (20 max, 30s expiry) well-calibrated for typical mission profiles?
|
||||
|
||||
### Training Data Strategy
|
||||
16. Is 1500 images/class realistic for military concealment data? What are actual annotation throughput estimates?
|
||||
17. Can synthetic data augmentation (cut-paste, style transfer) meaningfully boost concealment detection training?
|
||||
|
||||
### Security
|
||||
18. What adversarial attack vectors exist against edge-deployed YOLO models?
|
||||
19. How to protect model weights and inference pipeline on a physical device that could be captured?
|
||||
20. What operational security measures are needed for the data pipeline (captured imagery, detection logs)?
|
||||
|
||||
## Timeliness Sensitivity Assessment
|
||||
|
||||
- **Research Topic**: Edge AI resource management, VLM deployment on Jetson, YOLOE accuracy assessment
|
||||
- **Sensitivity Level**: Critical
|
||||
- **Rationale**: Tools (vLLM, TRT-LLM, MLC-LLM for Jetson) are actively evolving. JetPack 6.2 is latest. YOLOE-26 is weeks old.
|
||||
- **Source Time Window**: 6 months (Sep 2025 — Mar 2026)
|
||||
- **Priority official sources**:
|
||||
1. NVIDIA Jetson AI Lab (memory/performance benchmarks)
|
||||
2. Ultralytics docs (YOLOE-26 accuracy, TRT export)
|
||||
3. vLLM / TRT-LLM / MLC-LLM Jetson compatibility docs
|
||||
4. TensorRT 10.x memory management documentation
|
||||
@@ -0,0 +1,280 @@
|
||||
# Source Registry
|
||||
|
||||
## Source #1
|
||||
- **Title**: Ultralytics YOLO26 Documentation
|
||||
- **Link**: https://docs.ultralytics.com/models/yolo26/
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2026-01-14
|
||||
- **Timeliness Status**: Currently valid
|
||||
- **Version Info**: YOLO26, Ultralytics 8.4.x
|
||||
- **Summary**: Official YOLO26 docs — NMS-free, edge-first, MuSGD optimizer, improved small object detection, instance segmentation with semantic loss.
|
||||
|
||||
## Source #2
|
||||
- **Title**: YOLOE: Real-Time Seeing Anything — Ultralytics Docs
|
||||
- **Link**: https://docs.ultralytics.com/models/yoloe/
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025-2026
|
||||
- **Timeliness Status**: Currently valid
|
||||
- **Version Info**: YOLOE, YOLOE-26 (yoloe-26n-seg.pt through yoloe-26x-seg.pt)
|
||||
- **Summary**: Official YOLOE docs — open-vocabulary detection/segmentation, text/visual/prompt-free modes, RepRTA, SAVPE, LRPC, zero inference overhead when re-parameterized.
|
||||
|
||||
## Source #3
|
||||
- **Title**: YOLOE-26 Paper
|
||||
- **Link**: https://arxiv.org/abs/2602.00168
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2026-02
|
||||
- **Timeliness Status**: Currently valid
|
||||
- **Summary**: Integration of YOLO26 with YOLOE for real-time open-vocabulary instance segmentation. NMS-free, end-to-end.
|
||||
|
||||
## Source #4
|
||||
- **Title**: Ultralytics YOLO26 Jetson Benchmarks
|
||||
- **Link**: https://docs.ultralytics.com/guides/nvidia-jetson
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2026
|
||||
- **Timeliness Status**: Currently valid
|
||||
- **Version Info**: YOLO11 benchmarks on Jetson Orin Nano Super, TensorRT FP16
|
||||
- **Summary**: YOLO11n TensorRT FP16 on Jetson Orin Nano Super: 6.93ms at 640px. YOLO11s: 13.50ms. YOLO11m: 17.48ms.
|
||||
|
||||
## Source #5
|
||||
- **Title**: Cosmos-Reason2-2B on Jetson Orin Nano Super
|
||||
- **Link**: https://www.thenextgentechinsider.com/pulse/cosmos-reason2-runs-on-jetson-orin-nano-super-with-w4a16-quantization
|
||||
- **Tier**: L2
|
||||
- **Publication Date**: 2026-02
|
||||
- **Timeliness Status**: Currently valid
|
||||
- **Summary**: 4.7 tok/s on Jetson Orin Nano Super with W4A16 quantization.
|
||||
|
||||
## Source #6
|
||||
- **Title**: UAV-VL-R1 Paper
|
||||
- **Link**: https://arxiv.org/pdf/2508.11196
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025
|
||||
- **Timeliness Status**: Currently valid
|
||||
- **Summary**: Lightweight VLM for aerial reasoning. 48% better zero-shot than Qwen2-VL-2B. 2.5GB INT8, 3.9GB FP16. Open source.
|
||||
|
||||
## Source #7
|
||||
- **Title**: SmolVLM 256M & 500M Blog
|
||||
- **Link**: https://huggingface.co/blog/smolervlm
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025-01
|
||||
- **Timeliness Status**: Currently valid
|
||||
- **Summary**: SmolVLM-500M: 1.8GB GPU RAM, ONNX/WebGPU support, 93M SigLIP vision encoder.
|
||||
|
||||
## Source #8
|
||||
- **Title**: Moondream 0.5B Blog
|
||||
- **Link**: https://moondream.ai/blog/introducing-moondream-0-5b
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2024-12
|
||||
- **Timeliness Status**: Currently valid
|
||||
- **Summary**: 500M params, 816 MiB INT4, detect()/point() APIs, Raspberry Pi compatible.
|
||||
|
||||
## Source #9
|
||||
- **Title**: ViewPro ViewLink Serial Protocol V3.3.3
|
||||
- **Link**: https://www.viewprotech.com/index.php?ac=article&at=read&did=510
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2024
|
||||
- **Timeliness Status**: Currently valid
|
||||
- **Summary**: Serial command protocol for ViewPro gimbal cameras. UART 115200.
|
||||
|
||||
## Source #10
|
||||
- **Title**: ArduPilot ViewPro Gimbal Integration
|
||||
- **Link**: https://ardupilot.org/copter/docs/common-viewpro-gimbal.html
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025
|
||||
- **Version Info**: ArduPilot 4.5+
|
||||
- **Summary**: MNT1_TYPE=11 (Viewpro), SERIAL2_PROTOCOL=8, TTL serial, MAVLink 10Hz.
|
||||
|
||||
## Source #11
|
||||
- **Title**: UAV-YOLO12 Road Segmentation
|
||||
- **Link**: https://www.mdpi.com/2072-4292/17/9/1539
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: F1=0.825 for paths from UAV imagery. 11.1ms inference. SKNet + PConv modules.
|
||||
|
||||
## Source #12
|
||||
- **Title**: FootpathSeg GitHub
|
||||
- **Link**: https://github.com/WennyXY/FootpathSeg
|
||||
- **Tier**: L3
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: DINO-MC pre-training + UNet fine-tuning for footpath segmentation. GIS layer generation.
|
||||
|
||||
## Source #13
|
||||
- **Title**: Herbivore Trail Segmentation (UNet+MambaOut)
|
||||
- **Link**: https://arxiv.org/pdf/2504.12121
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025-04
|
||||
- **Summary**: UNet+MambaOut achieves best accuracy for trail detection from aerial photographs.
|
||||
|
||||
## Source #14
|
||||
- **Title**: Open-Vocabulary Camouflaged Object Segmentation
|
||||
- **Link**: https://arxiv.org/html/2506.19300v1
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: VLM + SAM cascaded approach for camouflage detection. VLM-derived features as prompts to SAM.
|
||||
|
||||
## Source #15
|
||||
- **Title**: YOLO Training Best Practices
|
||||
- **Link**: https://docs.ultralytics.com/yolov5/tutorials/tips_for_best_training_results
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: ≥1500 images/class, ≥10,000 instances/class. 0-10% background images. Pretrained weights recommended.
|
||||
|
||||
## Source #16
|
||||
- **Title**: Jetson AI Lab LLM/VLM Benchmarks
|
||||
- **Link**: https://www.jetson-ai-lab.com/tutorials/genai-benchmarking/
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025-2026
|
||||
- **Summary**: Llama-3.1-8B W4A16 on Jetson Orin Nano Super: 44.19 tok/s output, 32ms TTFT. vLLM as inference engine.
|
||||
|
||||
## Source #17
|
||||
- **Title**: servopilot Python Library
|
||||
- **Link**: https://pypi.org/project/servopilot/
|
||||
- **Tier**: L3
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: Anti-windup PID controller for gimbal control. Dual-axis support. Zero dependencies.
|
||||
|
||||
## Source #18
|
||||
- **Title**: Multi-Model AI Resource Allocation for Humanoid Robots: A Survey on Jetson Orin Nano Super
|
||||
- **Link**: https://dev.to/ankk98/multi-model-ai-resource-allocation-for-humanoid-robots-a-survey-on-jetson-orin-nano-super-310i
|
||||
- **Tier**: L3
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: Running VLA + YOLO concurrently on Orin Nano Super is "mostly theoretical". GPU sharing causes 10-40% latency jitter. Needs lighter edge-optimized models.
|
||||
|
||||
## Source #19
|
||||
- **Title**: TensorRT Multiple Engines on Single GPU
|
||||
- **Link**: https://github.com/NVIDIA/TensorRT/issues/4358
|
||||
- **Tier**: L2
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: NVIDIA recommends single engine with async CUDA streams over multiple separate engines. CUDA context push/pop needed for multiple engines.
|
||||
|
||||
## Source #20
|
||||
- **Title**: TensorRT High Memory Usage on Jetson Orin Nano (Ultralytics)
|
||||
- **Link**: https://github.com/ultralytics/ultralytics/issues/21562
|
||||
- **Tier**: L2
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: YOLOv8-OBB TRT engine consumes ~2.6GB on Jetson Orin Nano. cuDNN/CUDA binary loading adds ~940MB-1.1GB overhead per engine.
|
||||
|
||||
## Source #21
|
||||
- **Title**: NVIDIA Forum: Jetson Orin Nano Super Insufficient GPU Memory
|
||||
- **Link**: https://forums.developer.nvidia.com/t/jetson-orin-nano-super-insufficient-gpu-memory/330777
|
||||
- **Tier**: L2
|
||||
- **Publication Date**: 2025-04
|
||||
- **Summary**: Orin Nano Super shows 3.7GB/7.6GB free GPU memory after OS. Even 1.5B Q4 model fails to load due to KV cache buffer requirements (model weight 876MB + temp buffer 10.7GB needed).
|
||||
|
||||
## Source #22
|
||||
- **Title**: YOLO26 TensorRT Confidence Misalignment on Jetson
|
||||
- **Link**: https://www.hackster.io/qwe018931/pushing-limits-yolov8-vs-v26-on-jetson-orin-nano-b89267
|
||||
- **Tier**: L2
|
||||
- **Publication Date**: 2026
|
||||
- **Summary**: YOLO26 exhibits bounding box drift and inaccurate confidence scores when converted to TRT for C++ deployment on Jetson. YOLOv8 works fine. Architecture-specific export issue.
|
||||
|
||||
## Source #23
|
||||
- **Title**: YOLO26 INT8 TensorRT Export Fails on Jetson Orin (Ultralytics Issue #23841)
|
||||
- **Link**: https://github.com/ultralytics/ultralytics/issues/23841
|
||||
- **Tier**: L2
|
||||
- **Publication Date**: 2026
|
||||
- **Summary**: YOLO26n INT8 TRT export fails with checkLinks error during calibration on Jetson Orin with TensorRT 10.3.0 / JetPack 6.
|
||||
|
||||
## Source #24
|
||||
- **Title**: PatchBlock: Lightweight Defense Against Adversarial Patches for Edge AI
|
||||
- **Link**: https://arxiv.org/abs/2601.00367
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2026-01
|
||||
- **Summary**: CPU-based preprocessing module recovers up to 77% model accuracy under adversarial patch attacks. Minimal clean accuracy loss. Suitable for edge deployment.
|
||||
|
||||
## Source #25
|
||||
- **Title**: Qrypt Quantum-Secure Encryption for NVIDIA Jetson Edge AI
|
||||
- **Link**: https://thequantuminsider.com/2026/03/12/qrypt-quantum-secure-encryption-nvidia-jetson-edge-ai/
|
||||
- **Tier**: L2
|
||||
- **Publication Date**: 2026-03
|
||||
- **Summary**: BLAST encryption protocol for Jetson Orin Nano and Thor. Quantum-secure end-to-end encryption, independent key generation.
|
||||
|
||||
## Source #26
|
||||
- **Title**: Adversarial Patch Attacks on YOLO Edge Deployment (Springer)
|
||||
- **Link**: https://link.springer.com/article/10.1007/s10207-025-01067-3
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: Smaller YOLO models on edge devices are more vulnerable to adversarial attacks. Trade-off between latency and security.
|
||||
|
||||
## Source #27
|
||||
- **Title**: Synthetic Data for Military Camouflaged Object Detection (IEEE)
|
||||
- **Link**: https://ieeexplore.ieee.org/document/10660900/
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2024
|
||||
- **Summary**: Synthetic data generation approach for military camouflage detection training.
|
||||
|
||||
## Source #28
|
||||
- **Title**: GenCAMO: Environment-Aware Camouflage Image Generation
|
||||
- **Link**: https://arxiv.org/abs/2601.01181
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2026-01
|
||||
- **Summary**: Scene graph + generative models for synthetic camouflage data with multi-modal annotations. Improves complex scene detection.
|
||||
|
||||
## Source #29
|
||||
- **Title**: Camouflage Anything (CVPR 2025)
|
||||
- **Link**: https://openaccess.thecvf.com/content/CVPR2025/html/Das_Camouflage_Anything_...
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: Controlled out-painting for realistic camouflage dataset generation. CamOT metric. Improves detection baselines when used for fine-tuning.
|
||||
|
||||
## Source #30
|
||||
- **Title**: YOLOE Visual+Text Multimodal Fusion PR (Ultralytics)
|
||||
- **Link**: https://github.com/ultralytics/ultralytics/pull/21966
|
||||
- **Tier**: L2
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: Multimodal fusion of text + visual prompts for YOLOE. Concat mode (zero overhead) and weighted-sum mode (fuse_alpha). Merged into Ultralytics.
|
||||
|
||||
## Source #31
|
||||
- **Title**: Learnable Morphological Skeleton for Remote Sensing (IEEE TGRS 2025)
|
||||
- **Link**: https://ui.adsabs.harvard.edu/abs/2025ITGRS..63S1458X
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: Learnable morphological skeleton priors integrated into SAM for slender object segmentation. Addresses downsampling information loss.
|
||||
|
||||
## Source #32
|
||||
- **Title**: GraphMorph: Topologically Accurate Tubular Structure Extraction
|
||||
- **Link**: https://arxiv.org/pdf/2502.11731
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: Branch-level graph decoder + SkeletonDijkstra for centerline extraction. Reduces false positives vs pixel-level segmentation.
|
||||
|
||||
## Source #33
|
||||
- **Title**: UAV Gimbal PID Control for Camera Stabilization (IEEE 2024)
|
||||
- **Link**: https://ieeexplore.ieee.org/document/10569310/
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2024
|
||||
- **Summary**: PID controllers applied in gimbal construction for stabilization and tracking.
|
||||
|
||||
## Source #34
|
||||
- **Title**: Kalman Filter Steady Aiming for UAV Gimbal (IEEE)
|
||||
- **Link**: https://ieeexplore.ieee.org/ielx7/6287639/10005208/10160027.pdf
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2023
|
||||
- **Summary**: Kalman filter + coordinate transformation eliminates attitude and mounting errors in UAV gimbal. Better accuracy than PID alone during flight.
|
||||
|
||||
## Source #35
|
||||
- **Title**: vLLM on Jetson Orin Nano Deployment Guide
|
||||
- **Link**: https://learnopencv.com/deployment-on-edge-vllm-on-jetson/
|
||||
- **Tier**: L2
|
||||
- **Publication Date**: 2026
|
||||
- **Summary**: vLLM can run 2B models on Orin Nano 8GB. Shared memory must be increased to 8GB. Memory management critical.
|
||||
|
||||
## Source #36
|
||||
- **Title**: Jetson Orin Nano LLM Bottleneck Analysis
|
||||
- **Link**: https://ericxliu.me/posts/benchmarking-llms-on-jetson-orin-nano/
|
||||
- **Tier**: L2
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: Bottleneck is memory bandwidth (68 GB/s), not compute. Only 5.2GB usable VRAM after OS overhead. 40 TOPS largely underutilized for LLM inference.
|
||||
|
||||
## Source #37
|
||||
- **Title**: TRT-LLM: No Edge Device Support Statement
|
||||
- **Link**: https://github.com/NVIDIA/TensorRT-LLM/issues/7978
|
||||
- **Tier**: L1
|
||||
- **Publication Date**: 2025
|
||||
- **Summary**: TensorRT-LLM developers explicitly state they do not aim to support edge devices/platforms.
|
||||
|
||||
## Source #38
|
||||
- **Title**: Qwen3-VL-2B on Orin Nano Super (NVIDIA Forum)
|
||||
- **Link**: https://forums.developer.nvidia.com/t/performance-inquiry-optimizing-qwen3-vl-2b-inference-for-2-qps-target-on-orin-nano-super/359639
|
||||
- **Tier**: L2
|
||||
- **Publication Date**: 2026
|
||||
- **Summary**: Performance inquiry for Qwen3-VL-2B targeting 2 QPS on Orin Nano Super. Indicates active community attempts to deploy 2B VLMs on this hardware.
|
||||
@@ -0,0 +1,161 @@
|
||||
# Fact Cards
|
||||
|
||||
## Fact #1
|
||||
- **Statement**: Jetson Orin Nano Super has 7.6GB total unified memory, but only ~3.7GB free GPU memory after OS/system overhead in a Docker container.
|
||||
- **Source**: Source #21
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: Edge AI multi-model deployment on Orin Nano Super
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: Memory contention
|
||||
|
||||
## Fact #2
|
||||
- **Statement**: A single TensorRT engine (YOLOv8-OBB) consumes ~2.6GB on Jetson Orin Nano. cuDNN/CUDA binary loading adds ~940MB-1.1GB overhead per engine initialization.
|
||||
- **Source**: Source #20
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: TRT multi-engine memory planning
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: Memory contention
|
||||
|
||||
## Fact #3
|
||||
- **Statement**: Running VLA + YOLO detection concurrently on Orin Nano Super is described as "mostly theoretical" in 2025 surveys. GPU sharing causes 10-40% latency jitter.
|
||||
- **Source**: Source #18
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: Multi-model concurrent inference
|
||||
- **Confidence**: ⚠️ Medium (survey, not primary benchmark)
|
||||
- **Related Dimension**: Memory contention, performance
|
||||
|
||||
## Fact #4
|
||||
- **Statement**: NVIDIA recommends using a single TRT engine with async CUDA streams over multiple separate engines for GPU efficiency. Multiple engines need CUDA context push/pop management.
|
||||
- **Source**: Source #19
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: TRT engine management
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: Memory contention, architecture
|
||||
|
||||
## Fact #5
|
||||
- **Statement**: YOLO26 exhibits bounding box drift and inaccurate confidence scores when deployed via TensorRT on Jetson Orin Nano in C++. This is an architecture-specific export issue not present in YOLOv8.
|
||||
- **Source**: Source #22
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: YOLO26/YOLOE-26 TRT deployment
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: YOLOE-26 viability, deployment risk
|
||||
|
||||
## Fact #6
|
||||
- **Statement**: YOLO26n INT8 TensorRT export fails during calibration graph optimization on Jetson Orin with TensorRT 10.3.0 / JetPack 6. ONNX export succeeds but TRT build crashes.
|
||||
- **Source**: Source #23
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: YOLO26 edge deployment
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: YOLOE-26 viability, deployment risk
|
||||
|
||||
## Fact #7
|
||||
- **Statement**: YOLOE supports multimodal fusion of text + visual prompts with two modes: concat (zero overhead) and weighted-sum (fuse_alpha). This can improve robustness over text-only or visual-only prompts.
|
||||
- **Source**: Source #30
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: YOLOE prompt strategy
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: YOLOE-26 accuracy
|
||||
|
||||
## Fact #8
|
||||
- **Statement**: YOLOE text prompts are trained on LVIS (1203 categories) and COCO. Military concealment classes (dugouts, branch camouflage, FPV hideouts) are far out-of-distribution from training data. No published benchmarks for this domain.
|
||||
- **Source**: Sources #2, #3 (inferred from training data descriptions)
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: YOLOE-26 zero-shot accuracy
|
||||
- **Confidence**: ⚠️ Medium (inference from known training data)
|
||||
- **Related Dimension**: YOLOE-26 accuracy
|
||||
|
||||
## Fact #9
|
||||
- **Statement**: Smaller YOLO models (commonly used on edge devices) are more vulnerable to adversarial patch attacks than larger counterparts, creating a latency-security trade-off.
|
||||
- **Source**: Source #26
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: Edge AI security
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: Security
|
||||
|
||||
## Fact #10
|
||||
- **Statement**: PatchBlock is a lightweight CPU-based preprocessing module that recovers up to 77% of model accuracy under adversarial patch attacks with minimal clean accuracy loss.
|
||||
- **Source**: Source #24
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: Edge AI adversarial defense
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: Security
|
||||
|
||||
## Fact #11
|
||||
- **Statement**: TensorRT-LLM developers explicitly stated they "do not aim to support models on edge devices/platforms" when asked about VLM support on Orin NX.
|
||||
- **Source**: Source #37
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: VLM runtime selection
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: VLM integration
|
||||
|
||||
## Fact #12
|
||||
- **Statement**: vLLM can deploy 2B models on Jetson Orin Nano 8GB. Shared memory must be increased to 8GB. Memory management is critical. Bottleneck is memory bandwidth (68 GB/s), not compute (67 TOPS).
|
||||
- **Source**: Sources #35, #36
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: VLM runtime on Jetson
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: VLM integration
|
||||
|
||||
## Fact #13
|
||||
- **Statement**: Cosmos-Reason2-2B achieves 4.7 tok/s on Jetson Orin Nano Super with W4A16 quantization. Llama-3.1-8B W4A16 achieves 44.19 tok/s (text-only). VLMs are significantly slower due to vision encoder overhead.
|
||||
- **Source**: Sources #5, #16
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: VLM inference speed estimation
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: VLM integration, performance
|
||||
|
||||
## Fact #14
|
||||
- **Statement**: A 1.5B Q4 model on Jetson Orin Nano Super failed to load because KV cache temp buffer required 10.7GB while only 6.5GB was available. Model weights alone were only 876MB.
|
||||
- **Source**: Source #21
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: VLM memory management
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: Memory contention, VLM integration
|
||||
|
||||
## Fact #15
|
||||
- **Statement**: Morphological skeletonization suffers from noise-induced boundary variations causing spurious skeletal branches. Recent methods (2025) use scale-space hierarchical simplification for controllable robustness.
|
||||
- **Source**: Source #31 (related search results)
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: Path tracing robustness
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: Path tracing
|
||||
|
||||
## Fact #16
|
||||
- **Statement**: GraphMorph (2025) operates at branch-level using Graph Decoder + SkeletonDijkstra, producing topology-aware centerline masks. Reduces false positives vs pixel-level segmentation approaches.
|
||||
- **Source**: Source #32
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: Path extraction algorithms
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: Path tracing
|
||||
|
||||
## Fact #17
|
||||
- **Statement**: Kalman filter + coordinate transformation in UAV gimbal systems eliminates attitude and mounting errors that PID controllers alone cannot compensate for during flight.
|
||||
- **Source**: Source #34
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: Gimbal control algorithm
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: Gimbal control
|
||||
|
||||
## Fact #18
|
||||
- **Statement**: Synthetic data generation for camouflage detection is a validated approach: GenCAMO (2026) uses scene graphs + generative models; CamouflageAnything (CVPR 2025) uses controlled out-painting. Both improve detection baselines.
|
||||
- **Source**: Sources #28, #29
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: Training data strategy
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: Training data
|
||||
|
||||
## Fact #19
|
||||
- **Statement**: Usable VRAM on Jetson Orin Nano Super is approximately 5.2GB after OS overhead (not the advertised 8GB). The 8GB is shared between CPU and GPU.
|
||||
- **Source**: Source #36
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: Memory budget planning
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: Memory contention
|
||||
|
||||
## Fact #20
|
||||
- **Statement**: FP8 quantization for Qwen2-VL-2B performs worse than FP16 on vLLM. INT8/W4A16 are the recommended quantization formats for 2B VLMs on constrained hardware.
|
||||
- **Source**: vLLM Issue #9992
|
||||
- **Phase**: Assessment
|
||||
- **Target Audience**: VLM quantization strategy
|
||||
- **Confidence**: ✅ High
|
||||
- **Related Dimension**: VLM integration
|
||||
@@ -0,0 +1,28 @@
|
||||
# Comparison Framework
|
||||
|
||||
## Selected Framework Type
|
||||
Problem Diagnosis + Decision Support
|
||||
|
||||
## Selected Dimensions
|
||||
1. Memory Budget Feasibility
|
||||
2. YOLO26/YOLOE-26 TRT Deployment Stability
|
||||
3. YOLOE-26 Zero-Shot Accuracy for Domain
|
||||
4. Path Tracing Algorithm Robustness
|
||||
5. VLM Runtime & Integration Viability
|
||||
6. Gimbal Control Adequacy
|
||||
7. Training Data Realism
|
||||
8. Security & Adversarial Resilience
|
||||
|
||||
## Initial Population
|
||||
|
||||
| Dimension | Draft01 Assumption | Researched Reality | Risk Level | Factual Basis |
|
||||
|-----------|-------------------|-------------------|------------|---------------|
|
||||
| Memory Budget | YOLO + YOLOE-26 + CNN + VLM coexist on 8GB | Only ~5.2GB usable VRAM. Single YOLO TRT engine ~2.6GB. Two engines + CNN ≈ 5-6GB. No room for VLM simultaneously. | **CRITICAL** | Fact #1, #2, #3, #14, #19 |
|
||||
| YOLO26 TRT Stability | YOLO26-Seg TRT export assumed working | YOLO26 has confirmed confidence misalignment in TRT C++ and INT8 export crashes on Jetson. Active bugs unfixed. | **HIGH** | Fact #5, #6 |
|
||||
| YOLOE-26 Zero-Shot | Text prompts "footpath", "branch pile" assumed effective | Trained on LVIS/COCO. Military concealment is far OOD. No published domain benchmarks. Generic prompts may work for "footpath" but not "dugout" or "camouflage netting". | **HIGH** | Fact #7, #8 |
|
||||
| Path Tracing | Zhang-Suen skeletonization assumed robust | Classical skeletonization is noise-sensitive — spurious branches from noisy segmentation masks. GraphMorph/learnable skeletons are more robust alternatives. | **MEDIUM** | Fact #15, #16 |
|
||||
| VLM Runtime | vLLM or TRT-LLM assumed viable | TRT-LLM explicitly does not support edge devices. vLLM works but requires careful memory management. VLM cannot run concurrently with YOLO — must unload/reload. | **HIGH** | Fact #11, #12, #14 |
|
||||
| VLM Speed | UAV-VL-R1 ≤5s assumed | Cosmos-Reason2-2B: 4.7 tok/s on Orin Nano Super. For 50-100 token response: 10-21s. Significantly exceeds 5s target. | **HIGH** | Fact #13 |
|
||||
| Gimbal Control | PID assumed sufficient | PID works for stationary UAV. During flight, Kalman filter needed to compensate attitude/mounting errors. PID alone causes drift. | **MEDIUM** | Fact #17 |
|
||||
| Training Data | 1500 images/class in 8 weeks assumed | Realistic for generic objects; challenging for military concealment (access, annotation complexity). Synthetic augmentation (GenCAMO, CamouflageAnything) can significantly help. | **MEDIUM** | Fact #18 |
|
||||
| Security | No security measures in draft01 | Small edge YOLO models are more vulnerable to adversarial patches. Physical device capture risk (model weights, logs). PatchBlock defense available. | **HIGH** | Fact #9, #10 |
|
||||
@@ -0,0 +1,127 @@
|
||||
# Reasoning Chain
|
||||
|
||||
## Dimension 1: Memory Budget Feasibility
|
||||
|
||||
### Fact Confirmation
|
||||
The Jetson Orin Nano Super has 8GB unified (CPU+GPU shared) memory. After OS overhead, only ~5.2GB is usable for GPU workloads (Fact #19). A single YOLO TRT engine consumes ~2.6GB including cuDNN/CUDA overhead (Fact #2). The free GPU memory in a Docker container is ~3.7GB (Fact #1).
|
||||
|
||||
### Reference Comparison
|
||||
Draft01 assumes: existing YOLO TRT + YOLOE-26 TRT + MobileNetV3-Small TRT + UAV-VL-R1 all running or ready. Reality: two TRT engines alone likely consume 4-5GB (2×~2.5GB for engines, minus some shared CUDA overhead ~1GB). VLM (UAV-VL-R1 INT8 = 2.5GB) cannot fit alongside both YOLO engines.
|
||||
|
||||
### Conclusion
|
||||
**The draft01 architecture is memory-infeasible as designed.** Two YOLO TRT engines + CNN already saturate available memory. VLM cannot run concurrently. Solution: (A) time-multiplex — unload YOLO engines before loading VLM, (B) use a single merged TRT engine where YOLOE-26 is re-parameterized into the existing YOLO pipeline, (C) offload VLM to a companion device or cloud.
|
||||
|
||||
### Confidence
|
||||
✅ High — multiple sources confirm memory constraints
|
||||
|
||||
---
|
||||
|
||||
## Dimension 2: YOLO26/YOLOE-26 TRT Deployment Stability
|
||||
|
||||
### Fact Confirmation
|
||||
YOLO26 has confirmed bugs when deployed via TensorRT on Jetson: confidence misalignment in C++ (Fact #5), INT8 export crash (Fact #6). These are architecture-specific issues in YOLO26's end-to-end NMS-free design when converted through ONNX→TRT pipeline.
|
||||
|
||||
### Reference Comparison
|
||||
Draft01 assumes YOLOE-26 TRT deployment works. YOLOE-26 inherits YOLO26's architecture. If YOLO26 has TRT issues, YOLOE-26 likely inherits them. YOLOv8 TRT deployment is stable and proven on Jetson.
|
||||
|
||||
### Conclusion
|
||||
**YOLOE-26 TRT deployment on Jetson is a high-risk path.** Mitigation: (A) use YOLOE-v8-seg (YOLOE built on YOLOv8 backbone) instead of YOLOE-26-seg for initial deployment — proven TRT stability, (B) use Python Ultralytics predict() API instead of C++ TRT initially (avoids confidence issue but slower), (C) monitor Ultralytics issue tracker for YOLO26 TRT fixes before transitioning.
|
||||
|
||||
### Confidence
|
||||
✅ High — bugs are documented with issue numbers
|
||||
|
||||
---
|
||||
|
||||
## Dimension 3: YOLOE-26 Zero-Shot Accuracy for Domain
|
||||
|
||||
### Fact Confirmation
|
||||
YOLOE is trained on LVIS (1203 categories) and COCO data (Fact #8). Military concealment classes are not in LVIS/COCO. Generic concepts like "footpath" and "road" are in-distribution and should work. Domain-specific concepts like "dugout", "FPV hideout", "camouflage netting" are far out-of-distribution.
|
||||
|
||||
### Reference Comparison
|
||||
Draft01 relies heavily on YOLOE-26 text prompts as the bootstrapping mechanism. If text prompts fail for concealment-specific classes, the entire zero-shot Phase 1 is compromised. However, visual prompts (SAVPE) and multimodal fusion (Fact #7) offer a stronger alternative for domain-specific detection.
|
||||
|
||||
### Conclusion
|
||||
**Text prompts alone are unreliable for concealment classes.** Solution: (A) prioritize visual prompts (SAVPE) using semantic01-04.png reference images — visually similar detection is less dependent on LVIS vocabulary, (B) use multimodal fusion (text + visual) for robustness, (C) use generic text prompts only for in-distribution classes ("footpath", "road", "trail") and visual prompts for concealment-specific patterns, (D) measure zero-shot recall in first week and have fallback to heuristic detectors.
|
||||
|
||||
### Confidence
|
||||
⚠️ Medium — no direct benchmarks for this domain, reasoning from training data distribution
|
||||
|
||||
---
|
||||
|
||||
## Dimension 4: Path Tracing Algorithm Robustness
|
||||
|
||||
### Fact Confirmation
|
||||
Classical Zhang-Suen skeletonization is sensitive to boundary noise — small perturbations create spurious skeletal branches (Fact #15). Aerial segmentation masks from YOLOE-26 will have noise, partial segments, and broken paths. GraphMorph (2025) provides topology-aware centerline extraction with fewer false positives (Fact #16).
|
||||
|
||||
### Reference Comparison
|
||||
Draft01 uses basic skeletonization + hit-miss endpoint detection. This works on clean synthetic masks but will fail on real-world noisy masks with multiple branches, gaps, and artifacts.
|
||||
|
||||
### Conclusion
|
||||
**Add morphological preprocessing before skeletonization.** Solution: (A) apply Gaussian blur + binary threshold + morphological closing to clean segmentation masks before skeletonization, (B) prune short skeleton branches (< threshold length) to remove noise-induced artifacts, (C) consider GraphMorph as a more robust alternative if simple pruning is insufficient, (D) increase ROI crop from 128×128 to 256×256 for more spatial context in CNN classification.
|
||||
|
||||
### Confidence
|
||||
✅ High — skeletonization noise sensitivity is well-documented
|
||||
|
||||
---
|
||||
|
||||
## Dimension 5: VLM Runtime & Integration Viability
|
||||
|
||||
### Fact Confirmation
|
||||
TensorRT-LLM explicitly does not support edge devices (Fact #11). vLLM works on Jetson but requires careful memory management (Fact #12). VLM inference speed for 2B models: Cosmos-Reason2-2B achieves only 4.7 tok/s (Fact #13). For a 50-100 token response, this means 10-21 seconds — far exceeding the 5s target. A 1.5B model failed to load due to KV cache buffer requirements exceeding available memory (Fact #14).
|
||||
|
||||
### Reference Comparison
|
||||
Draft01 assumes UAV-VL-R1 runs in ≤5s via vLLM or TRT-LLM. Reality: TRT-LLM is not an option. vLLM can work but inference will be 10-20s, not 5s. VLM cannot coexist with YOLO engines in memory.
|
||||
|
||||
### Conclusion
|
||||
**VLM tier needs fundamental redesign.** Solutions: (A) accept 10-20s VLM latency — change from "optional real-time" to "background analysis" mode where VLM results arrive after operator has moved on, (B) use SmolVLM-500M (1.8GB) or Moondream-0.5B (816 MiB INT4) instead of UAV-VL-R1 for faster inference and smaller footprint, (C) implement VLM as demand-loaded: unload all TRT engines → load VLM → infer → unload VLM → reload TRT engines (adds 2-3s per switch but ensures memory fits), (D) defer VLM to a ground station connected via datalink if latency is acceptable.
|
||||
|
||||
### Confidence
|
||||
✅ High — benchmarks directly confirm constraints
|
||||
|
||||
---
|
||||
|
||||
## Dimension 6: Gimbal Control Adequacy
|
||||
|
||||
### Fact Confirmation
|
||||
PID controllers work for gimbal stabilization when the platform is relatively stable (Fact #17). However, during UAV flight, attitude changes and mounting errors cause drift that PID alone cannot compensate. Kalman filter + coordinate transformation is proven to eliminate these errors (Fact #17).
|
||||
|
||||
### Reference Comparison
|
||||
Draft01 uses PID-only control. This works if the UAV is hovering with minimal movement. During active flight (Level 1 sweep), the UAV is moving, causing gimbal drift and path-following errors.
|
||||
|
||||
### Conclusion
|
||||
**Add Kalman filter to gimbal control pipeline.** Solution: cascade architecture: Kalman filter for state estimation (compensates UAV attitude) → PID for error correction → gimbal actuator. Use UAV IMU data as Kalman filter input. This is standard practice for aerial gimbal systems.
|
||||
|
||||
### Confidence
|
||||
✅ High — well-established engineering practice
|
||||
|
||||
---
|
||||
|
||||
## Dimension 7: Training Data Realism
|
||||
|
||||
### Fact Confirmation
|
||||
Ultralytics recommends ≥1500 images/class and ≥10,000 instances/class for good YOLO performance (Source #15). Synthetic data generation for camouflage detection is validated: GenCAMO and CamouflageAnything both improve detection baselines (Fact #18).
|
||||
|
||||
### Reference Comparison
|
||||
Draft01 targets 500+ images by week 6, 1500+ by week 8. For military concealment in winter conditions, this is optimistic: images are hard to collect (active conflict zone), annotation requires domain expertise, and class diversity is limited.
|
||||
|
||||
### Conclusion
|
||||
**Supplement real data with synthetic augmentation.** Solution: (A) use CamouflageAnything or GenCAMO to generate synthetic concealment training data, (B) apply cut-paste augmentation of annotated concealment objects onto clean terrain backgrounds, (C) reduce initial target to 300-500 real images + 1000+ synthetic images per class for first model iteration, (D) implement active learning: YOLOE-26 zero-shot flags candidates → human annotates → feeds into training loop.
|
||||
|
||||
### Confidence
|
||||
⚠️ Medium — synthetic data quality for this specific domain is untested
|
||||
|
||||
---
|
||||
|
||||
## Dimension 8: Security & Adversarial Resilience
|
||||
|
||||
### Fact Confirmation
|
||||
Smaller YOLO models are more vulnerable to adversarial patches (Fact #9). PatchBlock defense recovers up to 77% accuracy (Fact #10). No security considerations in draft01 for model weight protection, inference pipeline integrity, or operational data.
|
||||
|
||||
### Reference Comparison
|
||||
Draft01 has zero security measures. For a military edge device that could be physically captured, this is a significant gap.
|
||||
|
||||
### Conclusion
|
||||
**Add three security layers.** (A) Adversarial defense: integrate PatchBlock or equivalent CPU-based preprocessing to detect anomalous input patches, (B) Model protection: encrypt TRT engine files at rest, decrypt into tmpfs at boot, use secure boot chain, (C) Operational data: encrypted circular buffer for captured imagery, auto-wipe on tamper detection, transmit only coordinates + confidence (not raw imagery) over datalink.
|
||||
|
||||
### Confidence
|
||||
✅ High — threats are well-documented, mitigations exist
|
||||
@@ -0,0 +1,55 @@
|
||||
# Validation Log
|
||||
|
||||
## Validation Scenario
|
||||
Same scenario as draft01: Winter reconnaissance flight at 700m altitude over forested area. But now accounting for memory constraints, TRT bugs, and revised VLM latency.
|
||||
|
||||
## Expected Based on Revised Conclusions
|
||||
|
||||
**Using the revised architecture (YOLOE-v8-seg, demand-loaded VLM, Kalman+PID gimbal):**
|
||||
|
||||
1. Level 1 sweep begins. Single TRT engine running YOLOE-v8-seg (re-parameterized with fixed classes) + existing YOLO detection in shared engine context. Memory: ~3-3.5GB for combined engine. Inference: ~13ms (s-size).
|
||||
|
||||
2. YOLOE-v8-seg detects footpaths via text prompt ("footpath", "trail") + visual prompt (reference images of paths). Also detects "road", "tree row". CNN-specific concealment classes handled by visual prompts only.
|
||||
|
||||
3. Path segmentation mask preprocessed: Gaussian blur → binary threshold → morphological closing → skeletonization → branch pruning. Endpoints extracted. 256×256 ROI crops.
|
||||
|
||||
4. MobileNetV3-Small CNN classifies endpoints. Memory: ~50MB TRT engine. Total pipeline (mask preprocessing + skeleton + CNN): ~150ms.
|
||||
|
||||
5. High-confidence detection → operator alert with coordinates. Ambiguous detection (CNN 30-70%) → queued for VLM analysis.
|
||||
|
||||
6. VLM analysis is **background/batch mode**: Scan controller continues Level 1 sweep. When a batch of 3-5 ambiguous detections accumulates or operator requests deep analysis: pause YOLO TRT → unload engine → load Moondream-0.5B (816 MiB) → analyze batch → unload → reload YOLO TRT. Total pause: ~20-40s. Operator receives delayed analysis results.
|
||||
|
||||
7. Gimbal: Kalman filter fuses IMU data for state estimation → PID corrects → gimbal actuates. Path-following during Level 2 is smoother, compensates for UAV drift.
|
||||
|
||||
## Actual Validation Results
|
||||
Cannot validate against real-world data. Validation based on:
|
||||
- YOLOE-v8-seg TRT deployment on Jetson is proven stable (unlike YOLO26)
|
||||
- Memory budget: ~3.5GB (YOLO engine) + 0.8GB (Moondream) = 4.3GB peak during VLM phase, within 5.2GB usable
|
||||
- Moondream 0.5B is confirmed to run on Raspberry Pi — Jetson will be faster
|
||||
- Kalman+PID gimbal control is standard aerospace engineering
|
||||
|
||||
## Counterexamples
|
||||
|
||||
1. **VLM delay unacceptable**: If 20-40s batch VLM delay is unacceptable, could use Moondream's detect() API for faster binary yes/no (~2-5s for 0.5B) instead of full text generation. Or skip VLM entirely and rely on CNN + operator judgment.
|
||||
|
||||
2. **YOLOE-v8-seg accuracy lower than YOLOE-26-seg**: YOLOE-v8 is older architecture. YOLOE-26 should have better accuracy. Mitigation: use YOLOE-v8 for stable deployment now, switch to YOLOE-26 once TRT bugs are fixed.
|
||||
|
||||
3. **Model switching latency**: Loading/unloading TRT engines adds 2-3s each direction. For frequent VLM requests, this overhead accumulates. Mitigation: batch VLM requests, implement predictive pre-loading.
|
||||
|
||||
4. **Single-engine approach limits flexibility**: Merging YOLOE + existing YOLO into one engine may require re-exporting when classes change. Mitigation: use YOLOE re-parameterization — when classes are fixed, YOLOE becomes standard YOLO with zero overhead.
|
||||
|
||||
## Review Checklist
|
||||
- [x] Draft conclusions consistent with fact cards
|
||||
- [x] No important dimensions missed
|
||||
- [x] No over-extrapolation
|
||||
- [x] Conclusions actionable/verifiable
|
||||
- [x] Memory budget calculated from documented values
|
||||
- [x] TRT deployment risk based on documented bugs
|
||||
- [ ] Note: YOLOE-v8-seg TRT stability on Jetson not directly tested (inferred from YOLOv8 stability)
|
||||
- [ ] Note: Moondream 0.5B accuracy for aerial concealment analysis is unknown
|
||||
|
||||
## Conclusions Requiring Revision
|
||||
- VLM latency target must change from ≤5s to "background batch" (20-40s)
|
||||
- Consider dropping VLM entirely for MVP and adding later when hardware/software matures
|
||||
- YOLOE-26 should be replaced with YOLOE-v8 for initial deployment
|
||||
- Memory architecture needs explicit budget table in solution draft
|
||||
Reference in New Issue
Block a user