Files
detections-semantic/_docs/01_solution/solution_draft02.md
T
Oleksandr Bezdieniezhnykh 8e2ecf50fd Initial commit
Made-with: Cursor
2026-03-26 00:20:30 +02:00

28 KiB
Raw Blame History

Solution Draft

Assessment Findings

Old Component Solution Weak Point (functional/security/performance) New Solution
YOLOE-26-seg TRT engine YOLO26 has confirmed TRT confidence misalignment and INT8 export crashes on Jetson (bugs #23841, Hackster.io report). YOLOE-26 inherits these bugs. Use YOLOE-v8-seg for initial deployment (proven TRT stability). Transition to YOLOE-26 once Ultralytics fixes TRT issues.
Two separate TRT engines (existing YOLO + YOLOE-26) Combined memory ~5-6GB exceeds usable 5.2GB VRAM. cuDNN overhead ~1GB per engine. Single merged TRT engine: YOLOE-v8-seg re-parameterized with fixed classes merges into existing YOLO pipeline. One engine, one CUDA context.
UAV-VL-R1 (2B) via vLLM ≤5s TRT-LLM does not support edge. 2B VLM: ~4.7 tok/s → 10-21s for useful response. VLM (2.5GB) cannot fit alongside YOLO in memory. Moondream 0.5B (816 MiB INT4) as primary VLM. Demand-loaded: unload YOLO → load VLM → analyze batch → unload → reload YOLO. Background mode, not real-time.
Text prompts for concealment classes Military concealment classes are far OOD from LVIS/COCO training data. "dugout", "camouflage netting" unlikely to work. Visual prompts (SAVPE) primary for concealment. Text prompts only for in-distribution classes (footpath, road, trail). Multimodal fusion (text+visual) for robustness.
Zhang-Suen skeletonization raw Noise-sensitive: spurious branches from noisy aerial segmentation masks. Add preprocessing pipeline: Gaussian blur → threshold → morphological closing → skeletonization → branch pruning (remove < 20px branches). Increase ROI to 256×256.
PID-only gimbal control PID cannot compensate UAV attitude drift and mounting errors during flight. Kalman filter + PID cascade: Kalman estimates state from IMU → PID corrects error → gimbal actuates.
1500 images/class in 8 weeks Optimistic for military concealment data collection. Access constraints, annotation complexity. 300-500 real + 1000+ synthetic (GenCAMO/CamouflageAnything) per class. Active learning loop from YOLOE zero-shot.
No security measures Small edge YOLO models vulnerable to adversarial patches. Physical device capture risk. No data protection. Three layers: PatchBlock adversarial defense, encrypted model weights at rest, auto-wipe on tamper.

Product Solution Description

A three-tier semantic detection system for identifying concealed/camouflaged positions from reconnaissance UAV aerial imagery, running on Jetson Orin Nano Super alongside the existing YOLO detection pipeline. Redesigned for the 5.2GB usable VRAM budget with demand-loaded VLM.

┌─────────────────────────────────────────────────────────────────────────┐
│                        JETSON ORIN NANO SUPER                          │
│                        (5.2 GB usable VRAM)                            │
│                                                                        │
│  ┌──────────┐    ┌──────────────────────┐    ┌───────────────────────┐ │
│  │ ViewPro  │───▶│  Tier 1              │───▶│  Tier 2               │ │
│  │ A40      │    │  Merged TRT Engine   │    │  Path Preprocessing   │ │
│  │ Camera   │    │  YOLOE-v8-seg        │    │  + Skeletonization    │ │
│  │          │    │  + Existing YOLO     │    │  + MobileNetV3-Small  │ │
│  │          │    │  ≤15ms               │    │  ≤200ms               │ │
│  └────▲─────┘    └──────────────────────┘    └───────────┬───────────┘ │
│       │                                                  │             │
│  ┌────┴─────┐    ┌──────────────┐                        │ ambiguous   │
│  │ Gimbal   │◀───│  Scan        │                        ▼             │
│  │ Kalman   │    │  Controller  │              ┌───────────────────┐   │
│  │ + PID    │    │  (L1/L2)     │              │ VLM Queue         │   │
│  └──────────┘    └──────────────┘              │ (batch when ≥3    │   │
│                                                │  or on demand)    │   │
│  ┌──────────────────────────────┐              └────────┬──────────┘   │
│  │  PatchBlock Adversarial     │                        │              │
│  │  Defense (CPU preprocessing)│              [demand-load cycle]      │
│  └──────────────────────────────┘              ┌────────▼──────────┐   │
│                                                │ Tier 3            │   │
│                                                │ Moondream 0.5B    │   │
│                                                │ 816 MiB INT4      │   │
│                                                │ ~5-10s per image  │   │
│                                                └───────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

The system operates in two scan levels:

  • Level 1 (Wide Sweep): Camera at medium zoom. Merged TRT engine runs YOLOE-v8-seg (visual + text prompts) and existing YOLO detection simultaneously. POIs queued by confidence.
  • Level 2 (Detailed Scan): Camera zooms into POI. Path preprocessing → skeletonization → endpoint CNN. High-confidence → immediate alert. Ambiguous → VLM queue.
  • VLM Batch Analysis: When queue reaches 3+ detections or operator requests: scan pauses, YOLO engine unloads, Moondream loads, batch analyzes, unloads, YOLO reloads. ~30-45s total cycle.

Three submodules: (1) Semantic Detection AI, (2) Camera Gimbal Control, (3) Integration with existing detections service.

Memory Budget

Component Mode GPU Memory Notes
OS + System Always ~2.4 GB From 8GB total, leaves 5.2GB usable
Merged TRT Engine (YOLOE-v8-seg + YOLO) Detection mode ~2.8 GB Single engine, shared CUDA context
MobileNetV3-Small TRT (FP16) Detection mode ~50 MB Tiny binary classifier
OpenCV + NumPy buffers Always ~200 MB Frame buffers, masks
PatchBlock defense Always ~50 MB CPU-based, minimal GPU
Total in Detection Mode ~3.1 GB 1.7 GB headroom
Moondream 0.5B INT4 VLM mode ~816 MB Demand-loaded
vLLM overhead + KV cache VLM mode ~500 MB Minimal for 0.5B model
Total in VLM Mode ~1.6 GB After unloading TRT engine

Architecture

Component 1: Tier 1 — Real-Time Detection (YOLOE-v8-seg, merged engine)

Solution Tools Advantages Limitations Requirements Security Performance Fit
YOLOE-v8-seg re-parameterized (recommended) yoloe-v8s-seg.pt, Ultralytics, TensorRT FP16 Proven TRT stability on Jetson. Zero inference overhead when re-parameterized. Visual+text multimodal fusion. Merges into existing YOLO engine. Older architecture than YOLO26 (slightly lower base accuracy). Ultralytics ≥8.4, TensorRT, JetPack 6.2 PatchBlock CPU preprocessing ~13ms FP16 (s-size) Best fit for stable deployment.
YOLOE-26-seg (future upgrade) yoloe-26s-seg.pt, TensorRT Better accuracy (YOLO26 architecture). NMS-free. Active TRT bugs on Jetson: confidence misalignment, INT8 crash. Wait for Ultralytics fix Same ~7ms FP16 (estimated) Future upgrade when TRT bugs resolved.
YOLO26-Seg custom-trained (production) yolo26s-seg.pt fine-tuned Highest accuracy for known classes. Requires 1500+ annotated images/class. Same TRT bugs. Custom dataset, GPU for training Same ~7ms FP16 Long-term production model.

Prompt strategy (revised):

Text prompts (in-distribution classes only):

  • "footpath", "trail", "path", "road", "track"
  • "tree row", "tree line", "clearing"

Visual prompts (SAVPE, for concealment-specific detection):

  • Reference images cropped from semantic01-04.png: branch piles, dark entrances, dugout structures
  • Use multimodal fusion mode: concat (zero overhead)
from ultralytics import YOLOE

model = YOLOE("yoloe-v8s-seg.pt")

text_classes = ["footpath", "trail", "road", "tree row", "clearing"]
model.set_classes(text_classes)

results = model.predict(
    frame,
    conf=0.15,
    refer_image="reference_hideout.jpg",
    visual_prompts={"bboxes": np.array([[x1, y1, x2, y2]]), "cls": np.array([0])},
    fusion_mode="concat"
)

Re-parameterization for production: Once classes are fixed after training, re-parameterize YOLOE-v8 to standard YOLOv8 weights. This eliminates the open-vocabulary overhead entirely, and the model becomes a regular YOLO inference engine. Merge with existing YOLO detection into a single TRT engine using TensorRT's multi-model support or batch inference.

Component 2: Tier 2 — Spatial Reasoning & CNN Confirmation

Solution Tools Advantages Limitations Requirements Security Performance Fit
Robust path tracing + CNN classifier (recommended) OpenCV, scikit-image, MobileNetV3-Small TRT Preprocessing removes noise. Branch pruning eliminates artifacts. 256×256 ROI for better context. Still depends on segmentation quality. OpenCV, scikit-image, PyTorch → TRT Offline inference ~150ms total Best fit. Robust against noisy masks.
GraphMorph centerline extraction PyTorch, custom model Topology-aware. Reduces false positives. Requires additional model in memory. More complex integration. PyTorch, custom training Offline ~200ms estimated Upgrade path if basic approach fails
Heuristic rules only OpenCV, NumPy No training data. Immediate. Brittle. Cannot generalize. None Offline ~50ms Baseline/fallback for day-1

Revised path tracing pipeline:

  1. Take footpath segmentation mask from Tier 1
  2. Preprocessing: Gaussian blur (σ=1.5) → binary threshold (Otsu) → morphological closing (5×5 kernel, 2 iterations) → remove small connected components (< 100px area)
  3. Skeletonize using Zhang-Suen algorithm
  4. Branch pruning: Remove skeleton branches shorter than 20 pixels (noise artifacts)
  5. Detect endpoints using hit-miss morphological operations (8 kernel patterns)
  6. Detect junctions using branch-point kernels
  7. Trace path segments between junctions/endpoints
  8. For each endpoint: extract 256×256 ROI crop centered on endpoint from original image
  9. Feed ROI crop to MobileNetV3-Small binary classifier

Freshness assessment (unchanged from draft01, validated approach):

  • Edge sharpness, contrast ratio, fill ratio, path width consistency
  • Initial hand-tuned thresholds → Random Forest with annotated data

Component 3: Tier 3 — VLM Deep Analysis (Background Batch Mode)

Solution Tools Advantages Limitations Requirements Security Performance Fit
Moondream 0.5B INT4 demand-loaded (recommended) Moondream, ONNX/PyTorch, INT4 816 MiB memory. Built-in detect()/point() APIs. Runs on Raspberry Pi. Weaker reasoning than 2B models. Not aerial-specialized. ONNX Runtime or PyTorch Local only ~2-5s per image (0.5B) Best fit for memory-constrained edge.
SmolVLM2-500M HuggingFace, ONNX 1.8GB. Small. ONNX support. Less capable than Moondream for detection. No detect() API. ONNX Runtime Local only ~3-7s estimated Alternative if Moondream underperforms
UAV-VL-R1 (2B) demand-loaded vLLM, W4A16 Aerial-specialized. Best reasoning for UAV imagery. 2.5GB INT8. ~10-21s per analysis. Tight memory fit. vLLM, W4A16 weights Local only ~10-21s Upgrade path if Moondream insufficient.
No VLM N/A Simplest. Most memory. Zero latency impact. No fallback for ambiguous CNN outputs. No explanations. None N/A N/A Viable MVP if Tier 1+2 accuracy is sufficient.

Demand-loading protocol:

1. VLM queue reaches threshold (≥3 detections or operator request)
2. Scan controller transitions to HOLD state (camera fixed position)
3. Signal main process to unload TRT engine
4. Wait for GPU memory release (~1s)
5. Launch VLM process: load Moondream 0.5B INT4
6. Process all queued detections sequentially (~2-5s each)
7. Collect results, send to operator
8. Unload VLM, release GPU memory
9. Reload TRT engine (~2s)
10. Resume scan from HOLD position
Total cycle: ~30-45s for 3-5 detections

VLM prompting strategy (adapted for Moondream's capabilities):

Using detect() API for fast binary check:

model.detect(image, "concealed military position")
model.detect(image, "dugout covered with branches")

Using caption for detailed analysis:

model.caption(image, length="normal")

Component 4: Camera Gimbal Control

Solution Tools Advantages Limitations Requirements Security Performance Fit
Kalman+PID cascade with ViewLink (recommended) pyserial, ViewLink V3.3.3, filterpy (Kalman), servopilot (PID) Compensates UAV attitude drift. Proven in aerospace. Smooth path-following. More complex than PID-only. Requires IMU data feed. ViewPro A40, pyserial, IMU data access Physical only <10ms command latency Best fit. Flight-grade control.
PID-only with ViewLink pyserial, ViewLink V3.3.3, servopilot Simple. Works for hovering UAV. Drifts during flight. Cannot compensate mounting errors. ViewPro A40, pyserial Physical only <10ms Acceptable for testing only

Revised control architecture:

UAV IMU Data ──▶ Kalman Filter ──▶ State Estimate (attitude, angular velocity)
                     │
Camera Frame ──▶ Detection ──▶ Target Position ──▶ Error Calculation
                                                        │
                                    State Estimate ─────▶│
                                                        │
                                                   PID Controller
                                                        │
                                                   Gimbal Command
                                                        │
                                                   ViewLink Serial

Kalman filter state vector: [yaw, pitch, yaw_rate, pitch_rate] Measurement inputs: IMU gyroscope (yaw_rate, pitch_rate), detection-derived angles Process model: constant angular velocity with noise

Scan patterns (unchanged from draft01): sinusoidal yaw oscillation, POI queue management.

Path-following (revised): Kalman-filtered state estimate provides smoother tracking. PID gains can be lower (less aggressive) because state estimate is already stabilized. Update rate: tied to detection frame rate.

Component 5: Integration with Existing Detections Service

Solution Tools Advantages Limitations Requirements Security Performance Fit
Single merged Cython+TRT process + demand-loaded VLM (recommended) Cython, TensorRT, ONNX Runtime Single TRT engine. Minimal memory. VLM isolated. VLM loading pauses detection (30-45s). Cython extensions, process management Process isolation + encryption Minimal overhead Best fit for 5.2GB VRAM.

Revised integration architecture:

┌───────────────────────────────────────────────────────────────────┐
│                 Main Process (Cython + TRT)                       │
│                                                                   │
│  ┌──────────────────────────────────────────────┐                 │
│  │ Single Merged TRT Engine                     │                 │
│  │  ├─ Existing YOLO Detection heads            │                 │
│  │  ├─ YOLOE-v8-seg (re-parameterized)         │                 │
│  │  └─ MobileNetV3-Small classifier            │                 │
│  └──────────────────────────────────────────────┘                 │
│                                                                   │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐│
│  │ Path Tracing │  │ Scan         │  │ PatchBlock Defense       ││
│  │ + Skeleton   │  │ Controller   │  │ (CPU parallel)           ││
│  │ (CPU)        │  │ + Kalman+PID │  │                          ││
│  └──────────────┘  └──────────────┘  └──────────────────────────┘│
│                                                                   │
│  ┌──────────────────────────────────────────────┐                 │
│  │ VLM Manager                                  │                 │
│  │  state: IDLE | LOADING | ANALYZING | UNLOAD  │                 │
│  │  queue: [detection_1, detection_2, ...]      │                 │
│  └──────────────────────────────────────────────┘                 │
└───────────────────────────────────────────────────────────────────┘

VLM mode (demand-loaded, replaces TRT engine temporarily):
┌───────────────────────────────────────────────────────────────────┐
│  ┌──────────────────────────────────────────────┐                 │
│  │ Moondream 0.5B INT4                          │                 │
│  │ (ONNX Runtime or PyTorch)                    │                 │
│  └──────────────────────────────────────────────┘                 │
│  Detection paused. Camera in HOLD state.                          │
└───────────────────────────────────────────────────────────────────┘

Data flow (revised):

  1. PatchBlock preprocesses frame on CPU (parallel with GPU inference)
  2. Cleaned frame → merged TRT engine → YOLO detections + YOLOE-v8 semantic detections
  3. Semantic detections → path preprocessing → skeletonization → endpoint extraction → CNN
  4. High-confidence → operator alert (coordinates + bounding box + confidence)
  5. Ambiguous → VLM queue
  6. VLM queue management: batch-process when queue ≥ 3 or operator triggers
  7. During VLM mode: detection paused, camera holds, operator notified of pause

GPU scheduling (revised): No concurrent multi-model GPU sharing. Single TRT engine runs during detection mode. VLM demand-loaded exclusively during analysis mode. This eliminates the 10-40% latency jitter from GPU sharing.

Component 6: Security

Solution Tools Advantages Limitations Requirements Security Performance Fit
Three-layer security (recommended) PatchBlock, LUKS/dm-crypt, tmpfs Adversarial defense + model protection + data protection Adds ~5ms CPU overhead for PatchBlock PatchBlock library, Linux crypto Full stack Minimal GPU impact Required for military edge deployment.

Layer 1: Adversarial Input Defense

  • PatchBlock CPU preprocessing on every frame before GPU inference
  • Detects anomalous patches via outlier detection and dimensionality reduction
  • Recovers up to 77% accuracy under adversarial attack
  • Runs in parallel with GPU inference (no latency addition to pipeline)

Layer 2: Model & Weight Protection

  • TRT engine files encrypted at rest using LUKS on a dedicated partition
  • At boot: decrypt into tmpfs (RAM disk) — never written to persistent storage unencrypted
  • Secure boot chain via Jetson's secure boot (fuse-based, hardware root of trust)
  • If device is captured powered-off: encrypted models, no plaintext weights accessible

Layer 3: Operational Data Protection

  • Captured imagery stored in encrypted circular buffer (last N minutes only)
  • Detection logs (coordinates, confidence, timestamps) encrypted at rest
  • Over datalink: transmit only coordinates + confidence + small thumbnail (not raw frames)
  • Tamper detection: if enclosure opened or unauthorized boot detected → auto-wipe keys + detection logs

Training & Data Strategy (Revised)

Phase 1: Zero-shot (Week 1-2)

  • Deploy YOLOE-v8-seg with multimodal prompts (text for paths, visual for concealment)
  • Use semantic01-04.png as visual prompt references via SAVPE
  • Tune confidence thresholds per class type
  • Collect false positive/negative data for annotation
  • Benchmark YOLOE-v8-seg TRT on Jetson: confirm inference time, memory, stability

Phase 2: Annotation & Fine-tuning (Week 3-8)

  • Annotate collected real data (target: 300-500 images/class)
  • Generate 1000+ synthetic images per class using GenCAMO/CamouflageAnything
  • Priority: footpaths (segmentation) → branch piles (bboxes) → entrances (bboxes)
  • Active learning: YOLOE zero-shot flags candidates → human reviews → annotates
  • Fine-tune YOLOv8-Seg (or YOLO26-Seg if TRT fixed) on real + synthetic dataset
  • Use linear probing first, then full fine-tuning

Phase 3: CNN classifier (Week 4-8, parallel with Phase 2)

  • Train MobileNetV3-Small on ROI crops: 256×256 from endpoint analysis
  • Positive: annotated concealed positions + synthetic. Negative: natural termini, random terrain
  • Target: 200+ real positive + 500+ synthetic positive, 1000+ negative
  • Export to TensorRT FP16

Phase 4: VLM integration (Week 8-12)

  • Deploy Moondream 0.5B INT4 in demand-loaded mode
  • Test demand-load cycle timing: measure unload → load → infer → unload → reload
  • Tune detect() prompts and caption prompts on collected ambiguous cases
  • If Moondream accuracy insufficient: test UAV-VL-R1 (2B) demand-loaded
  • If YOLO26 TRT bugs fixed: test YOLOE-26-seg as Tier 1 upgrade

Phase 5: Seasonal expansion (Month 3+)

  • Winter data → spring/summer annotation campaigns
  • Re-train all models with multi-season data + seasonal synthetic augmentation

Testing Strategy

Integration / Functional Tests

  • YOLOE-v8-seg multimodal prompt detection on reference images — verify text+visual fusion
  • TRT engine stability test: 1000 consecutive inferences without confidence drift
  • Path preprocessing pipeline on synthetic noisy masks — verify cleaning + skeletonization
  • Branch pruning: verify short spurious branches removed, real path branches preserved
  • CNN classifier on known positive/negative 256×256 ROI crops
  • Demand-load VLM cycle: measure timing of unload TRT → load Moondream → infer → unload → reload TRT
  • Memory monitoring during demand-load: confirm no memory leak across 10+ cycles
  • Kalman+PID gimbal control with simulated IMU data — verify drift compensation
  • Full pipeline: frame → PatchBlock → YOLOE-v8 → path tracing → CNN → (VLM) → alert
  • Scan controller: Level 1 → Level 2 → HOLD (for VLM) → resume Level 1

Non-Functional Tests

  • Tier 1 latency: YOLOE-v8-seg TRT FP16 ≤15ms on Jetson Orin Nano Super
  • Tier 2 latency: preprocessing + skeletonization + CNN ≤200ms
  • VLM demand-load cycle: ≤45s for 3 detections (including load/unload overhead)
  • Memory profiling: peak detection mode ≤3.5GB GPU, peak VLM mode ≤2.0GB GPU
  • Thermal stress test: 30+ minutes continuous detection without thermal throttling
  • PatchBlock adversarial test: inject test adversarial patches, measure accuracy recovery
  • False positive/negative rate on annotated reference images
  • Gimbal path-following accuracy with and without Kalman filter (measure improvement)
  • Demand-load memory leak test: 50+ VLM cycles without memory growth

References

  • AC assessment: _docs/00_research/00_ac_assessment.md
  • Previous draft: _docs/01_solution/solution_draft01.md