mirror of https://github.com/azaion/detections-semantic.git synced 2026-04-22 12:06:38 +00:00

Files

T

Oleksandr Bezdieniezhnykh 8e2ecf50fd Initial commit

Made-with: Cursor

2026-03-26 00:20:30 +02:00

28 KiB

Raw Blame History

Solution Draft

Assessment Findings

Old Component Solution	Weak Point (functional/security/performance)	New Solution
YOLOE-26-seg TRT engine	YOLO26 has confirmed TRT confidence misalignment and INT8 export crashes on Jetson (bugs #23841, Hackster.io report). YOLOE-26 inherits these bugs.	Use YOLOE-v8-seg for initial deployment (proven TRT stability). Transition to YOLOE-26 once Ultralytics fixes TRT issues.
Two separate TRT engines (existing YOLO + YOLOE-26)	Combined memory ~5-6GB exceeds usable 5.2GB VRAM. cuDNN overhead ~1GB per engine.	Single merged TRT engine: YOLOE-v8-seg re-parameterized with fixed classes merges into existing YOLO pipeline. One engine, one CUDA context.
UAV-VL-R1 (2B) via vLLM ≤5s	TRT-LLM does not support edge. 2B VLM: ~4.7 tok/s → 10-21s for useful response. VLM (2.5GB) cannot fit alongside YOLO in memory.	Moondream 0.5B (816 MiB INT4) as primary VLM. Demand-loaded: unload YOLO → load VLM → analyze batch → unload → reload YOLO. Background mode, not real-time.
Text prompts for concealment classes	Military concealment classes are far OOD from LVIS/COCO training data. "dugout", "camouflage netting" unlikely to work.	Visual prompts (SAVPE) primary for concealment. Text prompts only for in-distribution classes (footpath, road, trail). Multimodal fusion (text+visual) for robustness.
Zhang-Suen skeletonization raw	Noise-sensitive: spurious branches from noisy aerial segmentation masks.	Add preprocessing pipeline: Gaussian blur → threshold → morphological closing → skeletonization → branch pruning (remove < 20px branches). Increase ROI to 256×256.
PID-only gimbal control	PID cannot compensate UAV attitude drift and mounting errors during flight.	Kalman filter + PID cascade: Kalman estimates state from IMU → PID corrects error → gimbal actuates.
1500 images/class in 8 weeks	Optimistic for military concealment data collection. Access constraints, annotation complexity.	300-500 real + 1000+ synthetic (GenCAMO/CamouflageAnything) per class. Active learning loop from YOLOE zero-shot.
No security measures	Small edge YOLO models vulnerable to adversarial patches. Physical device capture risk. No data protection.	Three layers: PatchBlock adversarial defense, encrypted model weights at rest, auto-wipe on tamper.

Product Solution Description

A three-tier semantic detection system for identifying concealed/camouflaged positions from reconnaissance UAV aerial imagery, running on Jetson Orin Nano Super alongside the existing YOLO detection pipeline. Redesigned for the 5.2GB usable VRAM budget with demand-loaded VLM.

┌─────────────────────────────────────────────────────────────────────────┐
│                        JETSON ORIN NANO SUPER                          │
│                        (5.2 GB usable VRAM)                            │
│                                                                        │
│  ┌──────────┐    ┌──────────────────────┐    ┌───────────────────────┐ │
│  │ ViewPro  │───▶│  Tier 1              │───▶│  Tier 2               │ │
│  │ A40      │    │  Merged TRT Engine   │    │  Path Preprocessing   │ │
│  │ Camera   │    │  YOLOE-v8-seg        │    │  + Skeletonization    │ │
│  │          │    │  + Existing YOLO     │    │  + MobileNetV3-Small  │ │
│  │          │    │  ≤15ms               │    │  ≤200ms               │ │
│  └────▲─────┘    └──────────────────────┘    └───────────┬───────────┘ │
│       │                                                  │             │
│  ┌────┴─────┐    ┌──────────────┐                        │ ambiguous   │
│  │ Gimbal   │◀───│  Scan        │                        ▼             │
│  │ Kalman   │    │  Controller  │              ┌───────────────────┐   │
│  │ + PID    │    │  (L1/L2)     │              │ VLM Queue         │   │
│  └──────────┘    └──────────────┘              │ (batch when ≥3    │   │
│                                                │  or on demand)    │   │
│  ┌──────────────────────────────┐              └────────┬──────────┘   │
│  │  PatchBlock Adversarial     │                        │              │
│  │  Defense (CPU preprocessing)│              [demand-load cycle]      │
│  └──────────────────────────────┘              ┌────────▼──────────┐   │
│                                                │ Tier 3            │   │
│                                                │ Moondream 0.5B    │   │
│                                                │ 816 MiB INT4      │   │
│                                                │ ~5-10s per image  │   │
│                                                └───────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

The system operates in two scan levels:

Level 1 (Wide Sweep): Camera at medium zoom. Merged TRT engine runs YOLOE-v8-seg (visual + text prompts) and existing YOLO detection simultaneously. POIs queued by confidence.
Level 2 (Detailed Scan): Camera zooms into POI. Path preprocessing → skeletonization → endpoint CNN. High-confidence → immediate alert. Ambiguous → VLM queue.
VLM Batch Analysis: When queue reaches 3+ detections or operator requests: scan pauses, YOLO engine unloads, Moondream loads, batch analyzes, unloads, YOLO reloads. ~30-45s total cycle.

Three submodules: (1) Semantic Detection AI, (2) Camera Gimbal Control, (3) Integration with existing detections service.

Memory Budget

Component	Mode	GPU Memory	Notes
OS + System	Always	~2.4 GB	From 8GB total, leaves 5.2GB usable
Merged TRT Engine (YOLOE-v8-seg + YOLO)	Detection mode	~2.8 GB	Single engine, shared CUDA context
MobileNetV3-Small TRT (FP16)	Detection mode	~50 MB	Tiny binary classifier
OpenCV + NumPy buffers	Always	~200 MB	Frame buffers, masks
PatchBlock defense	Always	~50 MB	CPU-based, minimal GPU
Total in Detection Mode		~3.1 GB	1.7 GB headroom
Moondream 0.5B INT4	VLM mode	~816 MB	Demand-loaded
vLLM overhead + KV cache	VLM mode	~500 MB	Minimal for 0.5B model
Total in VLM Mode		~1.6 GB	After unloading TRT engine

Architecture

Component 1: Tier 1 — Real-Time Detection (YOLOE-v8-seg, merged engine)

Solution	Tools	Advantages	Limitations	Requirements	Security	Performance	Fit
YOLOE-v8-seg re-parameterized (recommended)	yoloe-v8s-seg.pt, Ultralytics, TensorRT FP16	Proven TRT stability on Jetson. Zero inference overhead when re-parameterized. Visual+text multimodal fusion. Merges into existing YOLO engine.	Older architecture than YOLO26 (slightly lower base accuracy).	Ultralytics ≥8.4, TensorRT, JetPack 6.2	PatchBlock CPU preprocessing	~13ms FP16 (s-size)	Best fit for stable deployment.
YOLOE-26-seg (future upgrade)	yoloe-26s-seg.pt, TensorRT	Better accuracy (YOLO26 architecture). NMS-free.	Active TRT bugs on Jetson: confidence misalignment, INT8 crash.	Wait for Ultralytics fix	Same	~7ms FP16 (estimated)	Future upgrade when TRT bugs resolved.
YOLO26-Seg custom-trained (production)	yolo26s-seg.pt fine-tuned	Highest accuracy for known classes.	Requires 1500+ annotated images/class. Same TRT bugs.	Custom dataset, GPU for training	Same	~7ms FP16	Long-term production model.

Prompt strategy (revised):

Text prompts (in-distribution classes only):

"footpath", "trail", "path", "road", "track"
"tree row", "tree line", "clearing"

Visual prompts (SAVPE, for concealment-specific detection):

Reference images cropped from semantic01-04.png: branch piles, dark entrances, dugout structures
Use multimodal fusion mode: concat (zero overhead)

from ultralytics import YOLOE

model = YOLOE("yoloe-v8s-seg.pt")

text_classes = ["footpath", "trail", "road", "tree row", "clearing"]
model.set_classes(text_classes)

results = model.predict(
    frame,
    conf=0.15,
    refer_image="reference_hideout.jpg",
    visual_prompts={"bboxes": np.array([[x1, y1, x2, y2]]), "cls": np.array([0])},
    fusion_mode="concat"
)

Re-parameterization for production: Once classes are fixed after training, re-parameterize YOLOE-v8 to standard YOLOv8 weights. This eliminates the open-vocabulary overhead entirely, and the model becomes a regular YOLO inference engine. Merge with existing YOLO detection into a single TRT engine using TensorRT's multi-model support or batch inference.

Component 2: Tier 2 — Spatial Reasoning & CNN Confirmation

Solution	Tools	Advantages	Limitations	Requirements	Security	Performance	Fit
Robust path tracing + CNN classifier (recommended)	OpenCV, scikit-image, MobileNetV3-Small TRT	Preprocessing removes noise. Branch pruning eliminates artifacts. 256×256 ROI for better context.	Still depends on segmentation quality.	OpenCV, scikit-image, PyTorch → TRT	Offline inference	~150ms total	Best fit. Robust against noisy masks.
GraphMorph centerline extraction	PyTorch, custom model	Topology-aware. Reduces false positives.	Requires additional model in memory. More complex integration.	PyTorch, custom training	Offline	~200ms estimated	Upgrade path if basic approach fails
Heuristic rules only	OpenCV, NumPy	No training data. Immediate.	Brittle. Cannot generalize.	None	Offline	~50ms	Baseline/fallback for day-1

Revised path tracing pipeline:

Take footpath segmentation mask from Tier 1
Preprocessing: Gaussian blur (σ=1.5) → binary threshold (Otsu) → morphological closing (5×5 kernel, 2 iterations) → remove small connected components (< 100px area)
Skeletonize using Zhang-Suen algorithm
Branch pruning: Remove skeleton branches shorter than 20 pixels (noise artifacts)
Detect endpoints using hit-miss morphological operations (8 kernel patterns)
Detect junctions using branch-point kernels
Trace path segments between junctions/endpoints
For each endpoint: extract 256×256 ROI crop centered on endpoint from original image
Feed ROI crop to MobileNetV3-Small binary classifier

Freshness assessment (unchanged from draft01, validated approach):

Edge sharpness, contrast ratio, fill ratio, path width consistency
Initial hand-tuned thresholds → Random Forest with annotated data

Component 3: Tier 3 — VLM Deep Analysis (Background Batch Mode)

Solution	Tools	Advantages	Limitations	Requirements	Security	Performance	Fit
Moondream 0.5B INT4 demand-loaded (recommended)	Moondream, ONNX/PyTorch, INT4	816 MiB memory. Built-in detect()/point() APIs. Runs on Raspberry Pi.	Weaker reasoning than 2B models. Not aerial-specialized.	ONNX Runtime or PyTorch	Local only	~2-5s per image (0.5B)	Best fit for memory-constrained edge.
SmolVLM2-500M	HuggingFace, ONNX	1.8GB. Small. ONNX support.	Less capable than Moondream for detection. No detect() API.	ONNX Runtime	Local only	~3-7s estimated	Alternative if Moondream underperforms
UAV-VL-R1 (2B) demand-loaded	vLLM, W4A16	Aerial-specialized. Best reasoning for UAV imagery.	2.5GB INT8. ~10-21s per analysis. Tight memory fit.	vLLM, W4A16 weights	Local only	~10-21s	Upgrade path if Moondream insufficient.
No VLM	N/A	Simplest. Most memory. Zero latency impact.	No fallback for ambiguous CNN outputs. No explanations.	None	N/A	N/A	Viable MVP if Tier 1+2 accuracy is sufficient.

Demand-loading protocol:

1. VLM queue reaches threshold (≥3 detections or operator request)
2. Scan controller transitions to HOLD state (camera fixed position)
3. Signal main process to unload TRT engine
4. Wait for GPU memory release (~1s)
5. Launch VLM process: load Moondream 0.5B INT4
6. Process all queued detections sequentially (~2-5s each)
7. Collect results, send to operator
8. Unload VLM, release GPU memory
9. Reload TRT engine (~2s)
10. Resume scan from HOLD position
Total cycle: ~30-45s for 3-5 detections

VLM prompting strategy (adapted for Moondream's capabilities):

Using detect() API for fast binary check:

model.detect(image, "concealed military position")
model.detect(image, "dugout covered with branches")

Using caption for detailed analysis:

model.caption(image, length="normal")

Component 4: Camera Gimbal Control

Solution	Tools	Advantages	Limitations	Requirements	Security	Performance	Fit
Kalman+PID cascade with ViewLink (recommended)	pyserial, ViewLink V3.3.3, filterpy (Kalman), servopilot (PID)	Compensates UAV attitude drift. Proven in aerospace. Smooth path-following.	More complex than PID-only. Requires IMU data feed.	ViewPro A40, pyserial, IMU data access	Physical only	<10ms command latency	Best fit. Flight-grade control.
PID-only with ViewLink	pyserial, ViewLink V3.3.3, servopilot	Simple. Works for hovering UAV.	Drifts during flight. Cannot compensate mounting errors.	ViewPro A40, pyserial	Physical only	<10ms	Acceptable for testing only

Revised control architecture:

UAV IMU Data ──▶ Kalman Filter ──▶ State Estimate (attitude, angular velocity)
                     │
Camera Frame ──▶ Detection ──▶ Target Position ──▶ Error Calculation
                                                        │
                                    State Estimate ─────▶│
                                                        │
                                                   PID Controller
                                                        │
                                                   Gimbal Command
                                                        │
                                                   ViewLink Serial

Kalman filter state vector: [yaw, pitch, yaw_rate, pitch_rate] Measurement inputs: IMU gyroscope (yaw_rate, pitch_rate), detection-derived angles Process model: constant angular velocity with noise

Scan patterns (unchanged from draft01): sinusoidal yaw oscillation, POI queue management.

Path-following (revised): Kalman-filtered state estimate provides smoother tracking. PID gains can be lower (less aggressive) because state estimate is already stabilized. Update rate: tied to detection frame rate.

Component 5: Integration with Existing Detections Service

Solution	Tools	Advantages	Limitations	Requirements	Security	Performance	Fit
Single merged Cython+TRT process + demand-loaded VLM (recommended)	Cython, TensorRT, ONNX Runtime	Single TRT engine. Minimal memory. VLM isolated.	VLM loading pauses detection (30-45s).	Cython extensions, process management	Process isolation + encryption	Minimal overhead	Best fit for 5.2GB VRAM.

Revised integration architecture:

┌───────────────────────────────────────────────────────────────────┐
│                 Main Process (Cython + TRT)                       │
│                                                                   │
│  ┌──────────────────────────────────────────────┐                 │
│  │ Single Merged TRT Engine                     │                 │
│  │  ├─ Existing YOLO Detection heads            │                 │
│  │  ├─ YOLOE-v8-seg (re-parameterized)         │                 │
│  │  └─ MobileNetV3-Small classifier            │                 │
│  └──────────────────────────────────────────────┘                 │
│                                                                   │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐│
│  │ Path Tracing │  │ Scan         │  │ PatchBlock Defense       ││
│  │ + Skeleton   │  │ Controller   │  │ (CPU parallel)           ││
│  │ (CPU)        │  │ + Kalman+PID │  │                          ││
│  └──────────────┘  └──────────────┘  └──────────────────────────┘│
│                                                                   │
│  ┌──────────────────────────────────────────────┐                 │
│  │ VLM Manager                                  │                 │
│  │  state: IDLE | LOADING | ANALYZING | UNLOAD  │                 │
│  │  queue: [detection_1, detection_2, ...]      │                 │
│  └──────────────────────────────────────────────┘                 │
└───────────────────────────────────────────────────────────────────┘

VLM mode (demand-loaded, replaces TRT engine temporarily):
┌───────────────────────────────────────────────────────────────────┐
│  ┌──────────────────────────────────────────────┐                 │
│  │ Moondream 0.5B INT4                          │                 │
│  │ (ONNX Runtime or PyTorch)                    │                 │
│  └──────────────────────────────────────────────┘                 │
│  Detection paused. Camera in HOLD state.                          │
└───────────────────────────────────────────────────────────────────┘

Data flow (revised):

PatchBlock preprocesses frame on CPU (parallel with GPU inference)
Cleaned frame → merged TRT engine → YOLO detections + YOLOE-v8 semantic detections
Semantic detections → path preprocessing → skeletonization → endpoint extraction → CNN
High-confidence → operator alert (coordinates + bounding box + confidence)
Ambiguous → VLM queue
VLM queue management: batch-process when queue ≥ 3 or operator triggers
During VLM mode: detection paused, camera holds, operator notified of pause

GPU scheduling (revised): No concurrent multi-model GPU sharing. Single TRT engine runs during detection mode. VLM demand-loaded exclusively during analysis mode. This eliminates the 10-40% latency jitter from GPU sharing.

Component 6: Security

Solution	Tools	Advantages	Limitations	Requirements	Security	Performance	Fit
Three-layer security (recommended)	PatchBlock, LUKS/dm-crypt, tmpfs	Adversarial defense + model protection + data protection	Adds ~5ms CPU overhead for PatchBlock	PatchBlock library, Linux crypto	Full stack	Minimal GPU impact	Required for military edge deployment.

Layer 1: Adversarial Input Defense

PatchBlock CPU preprocessing on every frame before GPU inference
Detects anomalous patches via outlier detection and dimensionality reduction
Recovers up to 77% accuracy under adversarial attack
Runs in parallel with GPU inference (no latency addition to pipeline)

Layer 2: Model & Weight Protection

TRT engine files encrypted at rest using LUKS on a dedicated partition
At boot: decrypt into tmpfs (RAM disk) — never written to persistent storage unencrypted
Secure boot chain via Jetson's secure boot (fuse-based, hardware root of trust)
If device is captured powered-off: encrypted models, no plaintext weights accessible

Layer 3: Operational Data Protection

Captured imagery stored in encrypted circular buffer (last N minutes only)
Detection logs (coordinates, confidence, timestamps) encrypted at rest
Over datalink: transmit only coordinates + confidence + small thumbnail (not raw frames)
Tamper detection: if enclosure opened or unauthorized boot detected → auto-wipe keys + detection logs

Training & Data Strategy (Revised)

Phase 1: Zero-shot (Week 1-2)

Deploy YOLOE-v8-seg with multimodal prompts (text for paths, visual for concealment)
Use semantic01-04.png as visual prompt references via SAVPE
Tune confidence thresholds per class type
Collect false positive/negative data for annotation
Benchmark YOLOE-v8-seg TRT on Jetson: confirm inference time, memory, stability

Phase 2: Annotation & Fine-tuning (Week 3-8)

Annotate collected real data (target: 300-500 images/class)
Generate 1000+ synthetic images per class using GenCAMO/CamouflageAnything
Priority: footpaths (segmentation) → branch piles (bboxes) → entrances (bboxes)
Active learning: YOLOE zero-shot flags candidates → human reviews → annotates
Fine-tune YOLOv8-Seg (or YOLO26-Seg if TRT fixed) on real + synthetic dataset
Use linear probing first, then full fine-tuning

Phase 3: CNN classifier (Week 4-8, parallel with Phase 2)

Train MobileNetV3-Small on ROI crops: 256×256 from endpoint analysis
Positive: annotated concealed positions + synthetic. Negative: natural termini, random terrain
Target: 200+ real positive + 500+ synthetic positive, 1000+ negative
Export to TensorRT FP16

Phase 4: VLM integration (Week 8-12)

Deploy Moondream 0.5B INT4 in demand-loaded mode
Test demand-load cycle timing: measure unload → load → infer → unload → reload
Tune detect() prompts and caption prompts on collected ambiguous cases
If Moondream accuracy insufficient: test UAV-VL-R1 (2B) demand-loaded
If YOLO26 TRT bugs fixed: test YOLOE-26-seg as Tier 1 upgrade

Phase 5: Seasonal expansion (Month 3+)

Winter data → spring/summer annotation campaigns
Re-train all models with multi-season data + seasonal synthetic augmentation

Testing Strategy

Integration / Functional Tests

YOLOE-v8-seg multimodal prompt detection on reference images — verify text+visual fusion
TRT engine stability test: 1000 consecutive inferences without confidence drift
Path preprocessing pipeline on synthetic noisy masks — verify cleaning + skeletonization
Branch pruning: verify short spurious branches removed, real path branches preserved
CNN classifier on known positive/negative 256×256 ROI crops
Demand-load VLM cycle: measure timing of unload TRT → load Moondream → infer → unload → reload TRT
Memory monitoring during demand-load: confirm no memory leak across 10+ cycles
Kalman+PID gimbal control with simulated IMU data — verify drift compensation
Full pipeline: frame → PatchBlock → YOLOE-v8 → path tracing → CNN → (VLM) → alert
Scan controller: Level 1 → Level 2 → HOLD (for VLM) → resume Level 1

Non-Functional Tests

Tier 1 latency: YOLOE-v8-seg TRT FP16 ≤15ms on Jetson Orin Nano Super
Tier 2 latency: preprocessing + skeletonization + CNN ≤200ms
VLM demand-load cycle: ≤45s for 3 detections (including load/unload overhead)
Memory profiling: peak detection mode ≤3.5GB GPU, peak VLM mode ≤2.0GB GPU
Thermal stress test: 30+ minutes continuous detection without thermal throttling
PatchBlock adversarial test: inject test adversarial patches, measure accuracy recovery
False positive/negative rate on annotated reference images
Gimbal path-following accuracy with and without Kalman filter (measure improvement)
Demand-load memory leak test: 50+ VLM cycles without memory growth

References

YOLOE-v8 docs: https://docs.ultralytics.com/models/yoloe/
YOLOE-26 paper: https://arxiv.org/abs/2602.00168
YOLO26 TRT confidence bug: https://www.hackster.io/qwe018931/pushing-limits-yolov8-vs-v26-on-jetson-orin-nano-b89267
YOLO26 INT8 crash: https://github.com/ultralytics/ultralytics/issues/23841
YOLOE multimodal fusion: https://github.com/ultralytics/ultralytics/pull/21966
Jetson Orin Nano Super memory: https://forums.developer.nvidia.com/t/jetson-orin-nano-super-insufficient-gpu-memory/330777
Multi-model survey on Orin Nano: https://dev.to/ankk98/multi-model-ai-resource-allocation-for-humanoid-robots-a-survey-on-jetson-orin-nano-super-310i
TRT multiple engines: https://github.com/NVIDIA/TensorRT/issues/4358
TRT memory on Jetson: https://github.com/ultralytics/ultralytics/issues/21562
Moondream: https://moondream.ai/blog/introducing-moondream-0-5b
Cosmos-Reason2-2B Jetson benchmark: https://www.thenextgentechinsider.com/pulse/cosmos-reason2-runs-on-jetson-orin-nano-super-with-w4a16-quantization
Jetson AI Lab benchmarks: https://www.jetson-ai-lab.com/tutorials/genai-benchmarking/
Jetson LLM bottleneck: https://ericxliu.me/posts/benchmarking-llms-on-jetson-orin-nano/
vLLM on Jetson: https://learnopencv.com/deployment-on-edge-vllm-on-jetson/
TRT-LLM no edge support: https://github.com/NVIDIA/TensorRT-LLM/issues/7978
PatchBlock defense: https://arxiv.org/abs/2601.00367
Adversarial patches on YOLO: https://link.springer.com/article/10.1007/s10207-025-01067-3
GenCAMO synthetic data: https://arxiv.org/abs/2601.01181
CamouflageAnything (CVPR 2025): https://openaccess.thecvf.com/content/CVPR2025/html/Das_Camouflage_Anything_...
GraphMorph centerlines: https://arxiv.org/pdf/2502.11731
Learnable skeleton + SAM: https://ui.adsabs.harvard.edu/abs/2025ITGRS..63S1458X
Kalman filter gimbal: https://ieeexplore.ieee.org/ielx7/6287639/10005208/10160027.pdf
UAV-VL-R1: https://arxiv.org/pdf/2508.11196
ViewPro Protocol: https://www.viewprotech.com/index.php?ac=article&at=read&did=510
servopilot PID: https://pypi.org/project/servopilot/

AC assessment: _docs/00_research/00_ac_assessment.md
Previous draft: _docs/01_solution/solution_draft01.md

28 KiB Raw Blame History Unescape Escape