Made-with: Cursor
28 KiB
Solution Draft
Assessment Findings
| Old Component Solution | Weak Point (functional/security/performance) | New Solution |
|---|---|---|
| YOLOE-26-seg TRT engine | YOLO26 has confirmed TRT confidence misalignment and INT8 export crashes on Jetson (bugs #23841, Hackster.io report). YOLOE-26 inherits these bugs. | Use YOLOE-v8-seg for initial deployment (proven TRT stability). Transition to YOLOE-26 once Ultralytics fixes TRT issues. |
| Two separate TRT engines (existing YOLO + YOLOE-26) | Combined memory ~5-6GB exceeds usable 5.2GB VRAM. cuDNN overhead ~1GB per engine. | Single merged TRT engine: YOLOE-v8-seg re-parameterized with fixed classes merges into existing YOLO pipeline. One engine, one CUDA context. |
| UAV-VL-R1 (2B) via vLLM ≤5s | TRT-LLM does not support edge. 2B VLM: ~4.7 tok/s → 10-21s for useful response. VLM (2.5GB) cannot fit alongside YOLO in memory. | Moondream 0.5B (816 MiB INT4) as primary VLM. Demand-loaded: unload YOLO → load VLM → analyze batch → unload → reload YOLO. Background mode, not real-time. |
| Text prompts for concealment classes | Military concealment classes are far OOD from LVIS/COCO training data. "dugout", "camouflage netting" unlikely to work. | Visual prompts (SAVPE) primary for concealment. Text prompts only for in-distribution classes (footpath, road, trail). Multimodal fusion (text+visual) for robustness. |
| Zhang-Suen skeletonization raw | Noise-sensitive: spurious branches from noisy aerial segmentation masks. | Add preprocessing pipeline: Gaussian blur → threshold → morphological closing → skeletonization → branch pruning (remove < 20px branches). Increase ROI to 256×256. |
| PID-only gimbal control | PID cannot compensate UAV attitude drift and mounting errors during flight. | Kalman filter + PID cascade: Kalman estimates state from IMU → PID corrects error → gimbal actuates. |
| 1500 images/class in 8 weeks | Optimistic for military concealment data collection. Access constraints, annotation complexity. | 300-500 real + 1000+ synthetic (GenCAMO/CamouflageAnything) per class. Active learning loop from YOLOE zero-shot. |
| No security measures | Small edge YOLO models vulnerable to adversarial patches. Physical device capture risk. No data protection. | Three layers: PatchBlock adversarial defense, encrypted model weights at rest, auto-wipe on tamper. |
Product Solution Description
A three-tier semantic detection system for identifying concealed/camouflaged positions from reconnaissance UAV aerial imagery, running on Jetson Orin Nano Super alongside the existing YOLO detection pipeline. Redesigned for the 5.2GB usable VRAM budget with demand-loaded VLM.
┌─────────────────────────────────────────────────────────────────────────┐
│ JETSON ORIN NANO SUPER │
│ (5.2 GB usable VRAM) │
│ │
│ ┌──────────┐ ┌──────────────────────┐ ┌───────────────────────┐ │
│ │ ViewPro │───▶│ Tier 1 │───▶│ Tier 2 │ │
│ │ A40 │ │ Merged TRT Engine │ │ Path Preprocessing │ │
│ │ Camera │ │ YOLOE-v8-seg │ │ + Skeletonization │ │
│ │ │ │ + Existing YOLO │ │ + MobileNetV3-Small │ │
│ │ │ │ ≤15ms │ │ ≤200ms │ │
│ └────▲─────┘ └──────────────────────┘ └───────────┬───────────┘ │
│ │ │ │
│ ┌────┴─────┐ ┌──────────────┐ │ ambiguous │
│ │ Gimbal │◀───│ Scan │ ▼ │
│ │ Kalman │ │ Controller │ ┌───────────────────┐ │
│ │ + PID │ │ (L1/L2) │ │ VLM Queue │ │
│ └──────────┘ └──────────────┘ │ (batch when ≥3 │ │
│ │ or on demand) │ │
│ ┌──────────────────────────────┐ └────────┬──────────┘ │
│ │ PatchBlock Adversarial │ │ │
│ │ Defense (CPU preprocessing)│ [demand-load cycle] │
│ └──────────────────────────────┘ ┌────────▼──────────┐ │
│ │ Tier 3 │ │
│ │ Moondream 0.5B │ │
│ │ 816 MiB INT4 │ │
│ │ ~5-10s per image │ │
│ └───────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
The system operates in two scan levels:
- Level 1 (Wide Sweep): Camera at medium zoom. Merged TRT engine runs YOLOE-v8-seg (visual + text prompts) and existing YOLO detection simultaneously. POIs queued by confidence.
- Level 2 (Detailed Scan): Camera zooms into POI. Path preprocessing → skeletonization → endpoint CNN. High-confidence → immediate alert. Ambiguous → VLM queue.
- VLM Batch Analysis: When queue reaches 3+ detections or operator requests: scan pauses, YOLO engine unloads, Moondream loads, batch analyzes, unloads, YOLO reloads. ~30-45s total cycle.
Three submodules: (1) Semantic Detection AI, (2) Camera Gimbal Control, (3) Integration with existing detections service.
Memory Budget
| Component | Mode | GPU Memory | Notes |
|---|---|---|---|
| OS + System | Always | ~2.4 GB | From 8GB total, leaves 5.2GB usable |
| Merged TRT Engine (YOLOE-v8-seg + YOLO) | Detection mode | ~2.8 GB | Single engine, shared CUDA context |
| MobileNetV3-Small TRT (FP16) | Detection mode | ~50 MB | Tiny binary classifier |
| OpenCV + NumPy buffers | Always | ~200 MB | Frame buffers, masks |
| PatchBlock defense | Always | ~50 MB | CPU-based, minimal GPU |
| Total in Detection Mode | ~3.1 GB | 1.7 GB headroom | |
| Moondream 0.5B INT4 | VLM mode | ~816 MB | Demand-loaded |
| vLLM overhead + KV cache | VLM mode | ~500 MB | Minimal for 0.5B model |
| Total in VLM Mode | ~1.6 GB | After unloading TRT engine |
Architecture
Component 1: Tier 1 — Real-Time Detection (YOLOE-v8-seg, merged engine)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|---|---|---|---|---|---|---|---|
| YOLOE-v8-seg re-parameterized (recommended) | yoloe-v8s-seg.pt, Ultralytics, TensorRT FP16 | Proven TRT stability on Jetson. Zero inference overhead when re-parameterized. Visual+text multimodal fusion. Merges into existing YOLO engine. | Older architecture than YOLO26 (slightly lower base accuracy). | Ultralytics ≥8.4, TensorRT, JetPack 6.2 | PatchBlock CPU preprocessing | ~13ms FP16 (s-size) | Best fit for stable deployment. |
| YOLOE-26-seg (future upgrade) | yoloe-26s-seg.pt, TensorRT | Better accuracy (YOLO26 architecture). NMS-free. | Active TRT bugs on Jetson: confidence misalignment, INT8 crash. | Wait for Ultralytics fix | Same | ~7ms FP16 (estimated) | Future upgrade when TRT bugs resolved. |
| YOLO26-Seg custom-trained (production) | yolo26s-seg.pt fine-tuned | Highest accuracy for known classes. | Requires 1500+ annotated images/class. Same TRT bugs. | Custom dataset, GPU for training | Same | ~7ms FP16 | Long-term production model. |
Prompt strategy (revised):
Text prompts (in-distribution classes only):
"footpath","trail","path","road","track""tree row","tree line","clearing"
Visual prompts (SAVPE, for concealment-specific detection):
- Reference images cropped from semantic01-04.png: branch piles, dark entrances, dugout structures
- Use multimodal fusion mode:
concat(zero overhead)
from ultralytics import YOLOE
model = YOLOE("yoloe-v8s-seg.pt")
text_classes = ["footpath", "trail", "road", "tree row", "clearing"]
model.set_classes(text_classes)
results = model.predict(
frame,
conf=0.15,
refer_image="reference_hideout.jpg",
visual_prompts={"bboxes": np.array([[x1, y1, x2, y2]]), "cls": np.array([0])},
fusion_mode="concat"
)
Re-parameterization for production: Once classes are fixed after training, re-parameterize YOLOE-v8 to standard YOLOv8 weights. This eliminates the open-vocabulary overhead entirely, and the model becomes a regular YOLO inference engine. Merge with existing YOLO detection into a single TRT engine using TensorRT's multi-model support or batch inference.
Component 2: Tier 2 — Spatial Reasoning & CNN Confirmation
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|---|---|---|---|---|---|---|---|
| Robust path tracing + CNN classifier (recommended) | OpenCV, scikit-image, MobileNetV3-Small TRT | Preprocessing removes noise. Branch pruning eliminates artifacts. 256×256 ROI for better context. | Still depends on segmentation quality. | OpenCV, scikit-image, PyTorch → TRT | Offline inference | ~150ms total | Best fit. Robust against noisy masks. |
| GraphMorph centerline extraction | PyTorch, custom model | Topology-aware. Reduces false positives. | Requires additional model in memory. More complex integration. | PyTorch, custom training | Offline | ~200ms estimated | Upgrade path if basic approach fails |
| Heuristic rules only | OpenCV, NumPy | No training data. Immediate. | Brittle. Cannot generalize. | None | Offline | ~50ms | Baseline/fallback for day-1 |
Revised path tracing pipeline:
- Take footpath segmentation mask from Tier 1
- Preprocessing: Gaussian blur (σ=1.5) → binary threshold (Otsu) → morphological closing (5×5 kernel, 2 iterations) → remove small connected components (< 100px area)
- Skeletonize using Zhang-Suen algorithm
- Branch pruning: Remove skeleton branches shorter than 20 pixels (noise artifacts)
- Detect endpoints using hit-miss morphological operations (8 kernel patterns)
- Detect junctions using branch-point kernels
- Trace path segments between junctions/endpoints
- For each endpoint: extract 256×256 ROI crop centered on endpoint from original image
- Feed ROI crop to MobileNetV3-Small binary classifier
Freshness assessment (unchanged from draft01, validated approach):
- Edge sharpness, contrast ratio, fill ratio, path width consistency
- Initial hand-tuned thresholds → Random Forest with annotated data
Component 3: Tier 3 — VLM Deep Analysis (Background Batch Mode)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|---|---|---|---|---|---|---|---|
| Moondream 0.5B INT4 demand-loaded (recommended) | Moondream, ONNX/PyTorch, INT4 | 816 MiB memory. Built-in detect()/point() APIs. Runs on Raspberry Pi. | Weaker reasoning than 2B models. Not aerial-specialized. | ONNX Runtime or PyTorch | Local only | ~2-5s per image (0.5B) | Best fit for memory-constrained edge. |
| SmolVLM2-500M | HuggingFace, ONNX | 1.8GB. Small. ONNX support. | Less capable than Moondream for detection. No detect() API. | ONNX Runtime | Local only | ~3-7s estimated | Alternative if Moondream underperforms |
| UAV-VL-R1 (2B) demand-loaded | vLLM, W4A16 | Aerial-specialized. Best reasoning for UAV imagery. | 2.5GB INT8. ~10-21s per analysis. Tight memory fit. | vLLM, W4A16 weights | Local only | ~10-21s | Upgrade path if Moondream insufficient. |
| No VLM | N/A | Simplest. Most memory. Zero latency impact. | No fallback for ambiguous CNN outputs. No explanations. | None | N/A | N/A | Viable MVP if Tier 1+2 accuracy is sufficient. |
Demand-loading protocol:
1. VLM queue reaches threshold (≥3 detections or operator request)
2. Scan controller transitions to HOLD state (camera fixed position)
3. Signal main process to unload TRT engine
4. Wait for GPU memory release (~1s)
5. Launch VLM process: load Moondream 0.5B INT4
6. Process all queued detections sequentially (~2-5s each)
7. Collect results, send to operator
8. Unload VLM, release GPU memory
9. Reload TRT engine (~2s)
10. Resume scan from HOLD position
Total cycle: ~30-45s for 3-5 detections
VLM prompting strategy (adapted for Moondream's capabilities):
Using detect() API for fast binary check:
model.detect(image, "concealed military position")
model.detect(image, "dugout covered with branches")
Using caption for detailed analysis:
model.caption(image, length="normal")
Component 4: Camera Gimbal Control
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|---|---|---|---|---|---|---|---|
| Kalman+PID cascade with ViewLink (recommended) | pyserial, ViewLink V3.3.3, filterpy (Kalman), servopilot (PID) | Compensates UAV attitude drift. Proven in aerospace. Smooth path-following. | More complex than PID-only. Requires IMU data feed. | ViewPro A40, pyserial, IMU data access | Physical only | <10ms command latency | Best fit. Flight-grade control. |
| PID-only with ViewLink | pyserial, ViewLink V3.3.3, servopilot | Simple. Works for hovering UAV. | Drifts during flight. Cannot compensate mounting errors. | ViewPro A40, pyserial | Physical only | <10ms | Acceptable for testing only |
Revised control architecture:
UAV IMU Data ──▶ Kalman Filter ──▶ State Estimate (attitude, angular velocity)
│
Camera Frame ──▶ Detection ──▶ Target Position ──▶ Error Calculation
│
State Estimate ─────▶│
│
PID Controller
│
Gimbal Command
│
ViewLink Serial
Kalman filter state vector: [yaw, pitch, yaw_rate, pitch_rate] Measurement inputs: IMU gyroscope (yaw_rate, pitch_rate), detection-derived angles Process model: constant angular velocity with noise
Scan patterns (unchanged from draft01): sinusoidal yaw oscillation, POI queue management.
Path-following (revised): Kalman-filtered state estimate provides smoother tracking. PID gains can be lower (less aggressive) because state estimate is already stabilized. Update rate: tied to detection frame rate.
Component 5: Integration with Existing Detections Service
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|---|---|---|---|---|---|---|---|
| Single merged Cython+TRT process + demand-loaded VLM (recommended) | Cython, TensorRT, ONNX Runtime | Single TRT engine. Minimal memory. VLM isolated. | VLM loading pauses detection (30-45s). | Cython extensions, process management | Process isolation + encryption | Minimal overhead | Best fit for 5.2GB VRAM. |
Revised integration architecture:
┌───────────────────────────────────────────────────────────────────┐
│ Main Process (Cython + TRT) │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Single Merged TRT Engine │ │
│ │ ├─ Existing YOLO Detection heads │ │
│ │ ├─ YOLOE-v8-seg (re-parameterized) │ │
│ │ └─ MobileNetV3-Small classifier │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐│
│ │ Path Tracing │ │ Scan │ │ PatchBlock Defense ││
│ │ + Skeleton │ │ Controller │ │ (CPU parallel) ││
│ │ (CPU) │ │ + Kalman+PID │ │ ││
│ └──────────────┘ └──────────────┘ └──────────────────────────┘│
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ VLM Manager │ │
│ │ state: IDLE | LOADING | ANALYZING | UNLOAD │ │
│ │ queue: [detection_1, detection_2, ...] │ │
│ └──────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
VLM mode (demand-loaded, replaces TRT engine temporarily):
┌───────────────────────────────────────────────────────────────────┐
│ ┌──────────────────────────────────────────────┐ │
│ │ Moondream 0.5B INT4 │ │
│ │ (ONNX Runtime or PyTorch) │ │
│ └──────────────────────────────────────────────┘ │
│ Detection paused. Camera in HOLD state. │
└───────────────────────────────────────────────────────────────────┘
Data flow (revised):
- PatchBlock preprocesses frame on CPU (parallel with GPU inference)
- Cleaned frame → merged TRT engine → YOLO detections + YOLOE-v8 semantic detections
- Semantic detections → path preprocessing → skeletonization → endpoint extraction → CNN
- High-confidence → operator alert (coordinates + bounding box + confidence)
- Ambiguous → VLM queue
- VLM queue management: batch-process when queue ≥ 3 or operator triggers
- During VLM mode: detection paused, camera holds, operator notified of pause
GPU scheduling (revised): No concurrent multi-model GPU sharing. Single TRT engine runs during detection mode. VLM demand-loaded exclusively during analysis mode. This eliminates the 10-40% latency jitter from GPU sharing.
Component 6: Security
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|---|---|---|---|---|---|---|---|
| Three-layer security (recommended) | PatchBlock, LUKS/dm-crypt, tmpfs | Adversarial defense + model protection + data protection | Adds ~5ms CPU overhead for PatchBlock | PatchBlock library, Linux crypto | Full stack | Minimal GPU impact | Required for military edge deployment. |
Layer 1: Adversarial Input Defense
- PatchBlock CPU preprocessing on every frame before GPU inference
- Detects anomalous patches via outlier detection and dimensionality reduction
- Recovers up to 77% accuracy under adversarial attack
- Runs in parallel with GPU inference (no latency addition to pipeline)
Layer 2: Model & Weight Protection
- TRT engine files encrypted at rest using LUKS on a dedicated partition
- At boot: decrypt into tmpfs (RAM disk) — never written to persistent storage unencrypted
- Secure boot chain via Jetson's secure boot (fuse-based, hardware root of trust)
- If device is captured powered-off: encrypted models, no plaintext weights accessible
Layer 3: Operational Data Protection
- Captured imagery stored in encrypted circular buffer (last N minutes only)
- Detection logs (coordinates, confidence, timestamps) encrypted at rest
- Over datalink: transmit only coordinates + confidence + small thumbnail (not raw frames)
- Tamper detection: if enclosure opened or unauthorized boot detected → auto-wipe keys + detection logs
Training & Data Strategy (Revised)
Phase 1: Zero-shot (Week 1-2)
- Deploy YOLOE-v8-seg with multimodal prompts (text for paths, visual for concealment)
- Use semantic01-04.png as visual prompt references via SAVPE
- Tune confidence thresholds per class type
- Collect false positive/negative data for annotation
- Benchmark YOLOE-v8-seg TRT on Jetson: confirm inference time, memory, stability
Phase 2: Annotation & Fine-tuning (Week 3-8)
- Annotate collected real data (target: 300-500 images/class)
- Generate 1000+ synthetic images per class using GenCAMO/CamouflageAnything
- Priority: footpaths (segmentation) → branch piles (bboxes) → entrances (bboxes)
- Active learning: YOLOE zero-shot flags candidates → human reviews → annotates
- Fine-tune YOLOv8-Seg (or YOLO26-Seg if TRT fixed) on real + synthetic dataset
- Use linear probing first, then full fine-tuning
Phase 3: CNN classifier (Week 4-8, parallel with Phase 2)
- Train MobileNetV3-Small on ROI crops: 256×256 from endpoint analysis
- Positive: annotated concealed positions + synthetic. Negative: natural termini, random terrain
- Target: 200+ real positive + 500+ synthetic positive, 1000+ negative
- Export to TensorRT FP16
Phase 4: VLM integration (Week 8-12)
- Deploy Moondream 0.5B INT4 in demand-loaded mode
- Test demand-load cycle timing: measure unload → load → infer → unload → reload
- Tune detect() prompts and caption prompts on collected ambiguous cases
- If Moondream accuracy insufficient: test UAV-VL-R1 (2B) demand-loaded
- If YOLO26 TRT bugs fixed: test YOLOE-26-seg as Tier 1 upgrade
Phase 5: Seasonal expansion (Month 3+)
- Winter data → spring/summer annotation campaigns
- Re-train all models with multi-season data + seasonal synthetic augmentation
Testing Strategy
Integration / Functional Tests
- YOLOE-v8-seg multimodal prompt detection on reference images — verify text+visual fusion
- TRT engine stability test: 1000 consecutive inferences without confidence drift
- Path preprocessing pipeline on synthetic noisy masks — verify cleaning + skeletonization
- Branch pruning: verify short spurious branches removed, real path branches preserved
- CNN classifier on known positive/negative 256×256 ROI crops
- Demand-load VLM cycle: measure timing of unload TRT → load Moondream → infer → unload → reload TRT
- Memory monitoring during demand-load: confirm no memory leak across 10+ cycles
- Kalman+PID gimbal control with simulated IMU data — verify drift compensation
- Full pipeline: frame → PatchBlock → YOLOE-v8 → path tracing → CNN → (VLM) → alert
- Scan controller: Level 1 → Level 2 → HOLD (for VLM) → resume Level 1
Non-Functional Tests
- Tier 1 latency: YOLOE-v8-seg TRT FP16 ≤15ms on Jetson Orin Nano Super
- Tier 2 latency: preprocessing + skeletonization + CNN ≤200ms
- VLM demand-load cycle: ≤45s for 3 detections (including load/unload overhead)
- Memory profiling: peak detection mode ≤3.5GB GPU, peak VLM mode ≤2.0GB GPU
- Thermal stress test: 30+ minutes continuous detection without thermal throttling
- PatchBlock adversarial test: inject test adversarial patches, measure accuracy recovery
- False positive/negative rate on annotated reference images
- Gimbal path-following accuracy with and without Kalman filter (measure improvement)
- Demand-load memory leak test: 50+ VLM cycles without memory growth
References
- YOLOE-v8 docs: https://docs.ultralytics.com/models/yoloe/
- YOLOE-26 paper: https://arxiv.org/abs/2602.00168
- YOLO26 TRT confidence bug: https://www.hackster.io/qwe018931/pushing-limits-yolov8-vs-v26-on-jetson-orin-nano-b89267
- YOLO26 INT8 crash: https://github.com/ultralytics/ultralytics/issues/23841
- YOLOE multimodal fusion: https://github.com/ultralytics/ultralytics/pull/21966
- Jetson Orin Nano Super memory: https://forums.developer.nvidia.com/t/jetson-orin-nano-super-insufficient-gpu-memory/330777
- Multi-model survey on Orin Nano: https://dev.to/ankk98/multi-model-ai-resource-allocation-for-humanoid-robots-a-survey-on-jetson-orin-nano-super-310i
- TRT multiple engines: https://github.com/NVIDIA/TensorRT/issues/4358
- TRT memory on Jetson: https://github.com/ultralytics/ultralytics/issues/21562
- Moondream: https://moondream.ai/blog/introducing-moondream-0-5b
- Cosmos-Reason2-2B Jetson benchmark: https://www.thenextgentechinsider.com/pulse/cosmos-reason2-runs-on-jetson-orin-nano-super-with-w4a16-quantization
- Jetson AI Lab benchmarks: https://www.jetson-ai-lab.com/tutorials/genai-benchmarking/
- Jetson LLM bottleneck: https://ericxliu.me/posts/benchmarking-llms-on-jetson-orin-nano/
- vLLM on Jetson: https://learnopencv.com/deployment-on-edge-vllm-on-jetson/
- TRT-LLM no edge support: https://github.com/NVIDIA/TensorRT-LLM/issues/7978
- PatchBlock defense: https://arxiv.org/abs/2601.00367
- Adversarial patches on YOLO: https://link.springer.com/article/10.1007/s10207-025-01067-3
- GenCAMO synthetic data: https://arxiv.org/abs/2601.01181
- CamouflageAnything (CVPR 2025): https://openaccess.thecvf.com/content/CVPR2025/html/Das_Camouflage_Anything_...
- GraphMorph centerlines: https://arxiv.org/pdf/2502.11731
- Learnable skeleton + SAM: https://ui.adsabs.harvard.edu/abs/2025ITGRS..63S1458X
- Kalman filter gimbal: https://ieeexplore.ieee.org/ielx7/6287639/10005208/10160027.pdf
- UAV-VL-R1: https://arxiv.org/pdf/2508.11196
- ViewPro Protocol: https://www.viewprotech.com/index.php?ac=article&at=read&did=510
- servopilot PID: https://pypi.org/project/servopilot/
Related Artifacts
- AC assessment:
_docs/00_research/00_ac_assessment.md - Previous draft:
_docs/01_solution/solution_draft01.md