mirror of
https://github.com/azaion/detections-semantic.git
synced 2026-04-23 09:06:38 +00:00
8e2ecf50fd
Made-with: Cursor
371 lines
28 KiB
Markdown
371 lines
28 KiB
Markdown
# Solution Draft
|
||
|
||
## Assessment Findings
|
||
|
||
| Old Component Solution | Weak Point (functional/security/performance) | New Solution |
|
||
|------------------------|----------------------------------------------|-------------|
|
||
| YOLOE-26-seg TRT engine | YOLO26 has confirmed TRT confidence misalignment and INT8 export crashes on Jetson (bugs #23841, Hackster.io report). YOLOE-26 inherits these bugs. | Use YOLOE-v8-seg for initial deployment (proven TRT stability). Transition to YOLOE-26 once Ultralytics fixes TRT issues. |
|
||
| Two separate TRT engines (existing YOLO + YOLOE-26) | Combined memory ~5-6GB exceeds usable 5.2GB VRAM. cuDNN overhead ~1GB per engine. | Single merged TRT engine: YOLOE-v8-seg re-parameterized with fixed classes merges into existing YOLO pipeline. One engine, one CUDA context. |
|
||
| UAV-VL-R1 (2B) via vLLM ≤5s | TRT-LLM does not support edge. 2B VLM: ~4.7 tok/s → 10-21s for useful response. VLM (2.5GB) cannot fit alongside YOLO in memory. | Moondream 0.5B (816 MiB INT4) as primary VLM. Demand-loaded: unload YOLO → load VLM → analyze batch → unload → reload YOLO. Background mode, not real-time. |
|
||
| Text prompts for concealment classes | Military concealment classes are far OOD from LVIS/COCO training data. "dugout", "camouflage netting" unlikely to work. | Visual prompts (SAVPE) primary for concealment. Text prompts only for in-distribution classes (footpath, road, trail). Multimodal fusion (text+visual) for robustness. |
|
||
| Zhang-Suen skeletonization raw | Noise-sensitive: spurious branches from noisy aerial segmentation masks. | Add preprocessing pipeline: Gaussian blur → threshold → morphological closing → skeletonization → branch pruning (remove < 20px branches). Increase ROI to 256×256. |
|
||
| PID-only gimbal control | PID cannot compensate UAV attitude drift and mounting errors during flight. | Kalman filter + PID cascade: Kalman estimates state from IMU → PID corrects error → gimbal actuates. |
|
||
| 1500 images/class in 8 weeks | Optimistic for military concealment data collection. Access constraints, annotation complexity. | 300-500 real + 1000+ synthetic (GenCAMO/CamouflageAnything) per class. Active learning loop from YOLOE zero-shot. |
|
||
| No security measures | Small edge YOLO models vulnerable to adversarial patches. Physical device capture risk. No data protection. | Three layers: PatchBlock adversarial defense, encrypted model weights at rest, auto-wipe on tamper. |
|
||
|
||
## Product Solution Description
|
||
|
||
A three-tier semantic detection system for identifying concealed/camouflaged positions from reconnaissance UAV aerial imagery, running on Jetson Orin Nano Super alongside the existing YOLO detection pipeline. Redesigned for the 5.2GB usable VRAM budget with demand-loaded VLM.
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────┐
|
||
│ JETSON ORIN NANO SUPER │
|
||
│ (5.2 GB usable VRAM) │
|
||
│ │
|
||
│ ┌──────────┐ ┌──────────────────────┐ ┌───────────────────────┐ │
|
||
│ │ ViewPro │───▶│ Tier 1 │───▶│ Tier 2 │ │
|
||
│ │ A40 │ │ Merged TRT Engine │ │ Path Preprocessing │ │
|
||
│ │ Camera │ │ YOLOE-v8-seg │ │ + Skeletonization │ │
|
||
│ │ │ │ + Existing YOLO │ │ + MobileNetV3-Small │ │
|
||
│ │ │ │ ≤15ms │ │ ≤200ms │ │
|
||
│ └────▲─────┘ └──────────────────────┘ └───────────┬───────────┘ │
|
||
│ │ │ │
|
||
│ ┌────┴─────┐ ┌──────────────┐ │ ambiguous │
|
||
│ │ Gimbal │◀───│ Scan │ ▼ │
|
||
│ │ Kalman │ │ Controller │ ┌───────────────────┐ │
|
||
│ │ + PID │ │ (L1/L2) │ │ VLM Queue │ │
|
||
│ └──────────┘ └──────────────┘ │ (batch when ≥3 │ │
|
||
│ │ or on demand) │ │
|
||
│ ┌──────────────────────────────┐ └────────┬──────────┘ │
|
||
│ │ PatchBlock Adversarial │ │ │
|
||
│ │ Defense (CPU preprocessing)│ [demand-load cycle] │
|
||
│ └──────────────────────────────┘ ┌────────▼──────────┐ │
|
||
│ │ Tier 3 │ │
|
||
│ │ Moondream 0.5B │ │
|
||
│ │ 816 MiB INT4 │ │
|
||
│ │ ~5-10s per image │ │
|
||
│ └───────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
The system operates in two scan levels:
|
||
- **Level 1 (Wide Sweep)**: Camera at medium zoom. Merged TRT engine runs YOLOE-v8-seg (visual + text prompts) and existing YOLO detection simultaneously. POIs queued by confidence.
|
||
- **Level 2 (Detailed Scan)**: Camera zooms into POI. Path preprocessing → skeletonization → endpoint CNN. High-confidence → immediate alert. Ambiguous → VLM queue.
|
||
- **VLM Batch Analysis**: When queue reaches 3+ detections or operator requests: scan pauses, YOLO engine unloads, Moondream loads, batch analyzes, unloads, YOLO reloads. ~30-45s total cycle.
|
||
|
||
Three submodules: (1) Semantic Detection AI, (2) Camera Gimbal Control, (3) Integration with existing detections service.
|
||
|
||
### Memory Budget
|
||
|
||
| Component | Mode | GPU Memory | Notes |
|
||
|-----------|------|-----------|-------|
|
||
| OS + System | Always | ~2.4 GB | From 8GB total, leaves 5.2GB usable |
|
||
| Merged TRT Engine (YOLOE-v8-seg + YOLO) | Detection mode | ~2.8 GB | Single engine, shared CUDA context |
|
||
| MobileNetV3-Small TRT (FP16) | Detection mode | ~50 MB | Tiny binary classifier |
|
||
| OpenCV + NumPy buffers | Always | ~200 MB | Frame buffers, masks |
|
||
| PatchBlock defense | Always | ~50 MB | CPU-based, minimal GPU |
|
||
| **Total in Detection Mode** | | **~3.1 GB** | **1.7 GB headroom** |
|
||
| Moondream 0.5B INT4 | VLM mode | ~816 MB | Demand-loaded |
|
||
| vLLM overhead + KV cache | VLM mode | ~500 MB | Minimal for 0.5B model |
|
||
| **Total in VLM Mode** | | **~1.6 GB** | **After unloading TRT engine** |
|
||
|
||
## Architecture
|
||
|
||
### Component 1: Tier 1 — Real-Time Detection (YOLOE-v8-seg, merged engine)
|
||
|
||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||
| **YOLOE-v8-seg re-parameterized (recommended)** | yoloe-v8s-seg.pt, Ultralytics, TensorRT FP16 | Proven TRT stability on Jetson. Zero inference overhead when re-parameterized. Visual+text multimodal fusion. Merges into existing YOLO engine. | Older architecture than YOLO26 (slightly lower base accuracy). | Ultralytics ≥8.4, TensorRT, JetPack 6.2 | PatchBlock CPU preprocessing | ~13ms FP16 (s-size) | **Best fit for stable deployment.** |
|
||
| YOLOE-26-seg (future upgrade) | yoloe-26s-seg.pt, TensorRT | Better accuracy (YOLO26 architecture). NMS-free. | Active TRT bugs on Jetson: confidence misalignment, INT8 crash. | Wait for Ultralytics fix | Same | ~7ms FP16 (estimated) | **Future upgrade when TRT bugs resolved.** |
|
||
| YOLO26-Seg custom-trained (production) | yolo26s-seg.pt fine-tuned | Highest accuracy for known classes. | Requires 1500+ annotated images/class. Same TRT bugs. | Custom dataset, GPU for training | Same | ~7ms FP16 | **Long-term production model.** |
|
||
|
||
**Prompt strategy (revised)**:
|
||
|
||
Text prompts (in-distribution classes only):
|
||
- `"footpath"`, `"trail"`, `"path"`, `"road"`, `"track"`
|
||
- `"tree row"`, `"tree line"`, `"clearing"`
|
||
|
||
Visual prompts (SAVPE, for concealment-specific detection):
|
||
- Reference images cropped from semantic01-04.png: branch piles, dark entrances, dugout structures
|
||
- Use multimodal fusion mode: `concat` (zero overhead)
|
||
|
||
```python
|
||
from ultralytics import YOLOE
|
||
|
||
model = YOLOE("yoloe-v8s-seg.pt")
|
||
|
||
text_classes = ["footpath", "trail", "road", "tree row", "clearing"]
|
||
model.set_classes(text_classes)
|
||
|
||
results = model.predict(
|
||
frame,
|
||
conf=0.15,
|
||
refer_image="reference_hideout.jpg",
|
||
visual_prompts={"bboxes": np.array([[x1, y1, x2, y2]]), "cls": np.array([0])},
|
||
fusion_mode="concat"
|
||
)
|
||
```
|
||
|
||
**Re-parameterization for production**: Once classes are fixed after training, re-parameterize YOLOE-v8 to standard YOLOv8 weights. This eliminates the open-vocabulary overhead entirely, and the model becomes a regular YOLO inference engine. Merge with existing YOLO detection into a single TRT engine using TensorRT's multi-model support or batch inference.
|
||
|
||
### Component 2: Tier 2 — Spatial Reasoning & CNN Confirmation
|
||
|
||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||
| **Robust path tracing + CNN classifier (recommended)** | OpenCV, scikit-image, MobileNetV3-Small TRT | Preprocessing removes noise. Branch pruning eliminates artifacts. 256×256 ROI for better context. | Still depends on segmentation quality. | OpenCV, scikit-image, PyTorch → TRT | Offline inference | ~150ms total | **Best fit. Robust against noisy masks.** |
|
||
| GraphMorph centerline extraction | PyTorch, custom model | Topology-aware. Reduces false positives. | Requires additional model in memory. More complex integration. | PyTorch, custom training | Offline | ~200ms estimated | Upgrade path if basic approach fails |
|
||
| Heuristic rules only | OpenCV, NumPy | No training data. Immediate. | Brittle. Cannot generalize. | None | Offline | ~50ms | Baseline/fallback for day-1 |
|
||
|
||
**Revised path tracing pipeline**:
|
||
|
||
1. Take footpath segmentation mask from Tier 1
|
||
2. **Preprocessing**: Gaussian blur (σ=1.5) → binary threshold (Otsu) → morphological closing (5×5 kernel, 2 iterations) → remove small connected components (< 100px area)
|
||
3. Skeletonize using Zhang-Suen algorithm
|
||
4. **Branch pruning**: Remove skeleton branches shorter than 20 pixels (noise artifacts)
|
||
5. Detect endpoints using hit-miss morphological operations (8 kernel patterns)
|
||
6. Detect junctions using branch-point kernels
|
||
7. Trace path segments between junctions/endpoints
|
||
8. For each endpoint: extract **256×256** ROI crop centered on endpoint from original image
|
||
9. Feed ROI crop to MobileNetV3-Small binary classifier
|
||
|
||
**Freshness assessment** (unchanged from draft01, validated approach):
|
||
- Edge sharpness, contrast ratio, fill ratio, path width consistency
|
||
- Initial hand-tuned thresholds → Random Forest with annotated data
|
||
|
||
### Component 3: Tier 3 — VLM Deep Analysis (Background Batch Mode)
|
||
|
||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||
| **Moondream 0.5B INT4 demand-loaded (recommended)** | Moondream, ONNX/PyTorch, INT4 | 816 MiB memory. Built-in detect()/point() APIs. Runs on Raspberry Pi. | Weaker reasoning than 2B models. Not aerial-specialized. | ONNX Runtime or PyTorch | Local only | ~2-5s per image (0.5B) | **Best fit for memory-constrained edge.** |
|
||
| SmolVLM2-500M | HuggingFace, ONNX | 1.8GB. Small. ONNX support. | Less capable than Moondream for detection. No detect() API. | ONNX Runtime | Local only | ~3-7s estimated | Alternative if Moondream underperforms |
|
||
| UAV-VL-R1 (2B) demand-loaded | vLLM, W4A16 | Aerial-specialized. Best reasoning for UAV imagery. | 2.5GB INT8. ~10-21s per analysis. Tight memory fit. | vLLM, W4A16 weights | Local only | ~10-21s | **Upgrade path if Moondream insufficient.** |
|
||
| No VLM | N/A | Simplest. Most memory. Zero latency impact. | No fallback for ambiguous CNN outputs. No explanations. | None | N/A | N/A | **Viable MVP if Tier 1+2 accuracy is sufficient.** |
|
||
|
||
**Demand-loading protocol**:
|
||
|
||
```
|
||
1. VLM queue reaches threshold (≥3 detections or operator request)
|
||
2. Scan controller transitions to HOLD state (camera fixed position)
|
||
3. Signal main process to unload TRT engine
|
||
4. Wait for GPU memory release (~1s)
|
||
5. Launch VLM process: load Moondream 0.5B INT4
|
||
6. Process all queued detections sequentially (~2-5s each)
|
||
7. Collect results, send to operator
|
||
8. Unload VLM, release GPU memory
|
||
9. Reload TRT engine (~2s)
|
||
10. Resume scan from HOLD position
|
||
Total cycle: ~30-45s for 3-5 detections
|
||
```
|
||
|
||
**VLM prompting strategy** (adapted for Moondream's capabilities):
|
||
|
||
Using detect() API for fast binary check:
|
||
```python
|
||
model.detect(image, "concealed military position")
|
||
model.detect(image, "dugout covered with branches")
|
||
```
|
||
|
||
Using caption for detailed analysis:
|
||
```python
|
||
model.caption(image, length="normal")
|
||
```
|
||
|
||
### Component 4: Camera Gimbal Control
|
||
|
||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||
| **Kalman+PID cascade with ViewLink (recommended)** | pyserial, ViewLink V3.3.3, filterpy (Kalman), servopilot (PID) | Compensates UAV attitude drift. Proven in aerospace. Smooth path-following. | More complex than PID-only. Requires IMU data feed. | ViewPro A40, pyserial, IMU data access | Physical only | <10ms command latency | **Best fit. Flight-grade control.** |
|
||
| PID-only with ViewLink | pyserial, ViewLink V3.3.3, servopilot | Simple. Works for hovering UAV. | Drifts during flight. Cannot compensate mounting errors. | ViewPro A40, pyserial | Physical only | <10ms | Acceptable for testing only |
|
||
|
||
**Revised control architecture**:
|
||
|
||
```
|
||
UAV IMU Data ──▶ Kalman Filter ──▶ State Estimate (attitude, angular velocity)
|
||
│
|
||
Camera Frame ──▶ Detection ──▶ Target Position ──▶ Error Calculation
|
||
│
|
||
State Estimate ─────▶│
|
||
│
|
||
PID Controller
|
||
│
|
||
Gimbal Command
|
||
│
|
||
ViewLink Serial
|
||
```
|
||
|
||
Kalman filter state vector: [yaw, pitch, yaw_rate, pitch_rate]
|
||
Measurement inputs: IMU gyroscope (yaw_rate, pitch_rate), detection-derived angles
|
||
Process model: constant angular velocity with noise
|
||
|
||
**Scan patterns** (unchanged from draft01): sinusoidal yaw oscillation, POI queue management.
|
||
|
||
**Path-following** (revised): Kalman-filtered state estimate provides smoother tracking. PID gains can be lower (less aggressive) because state estimate is already stabilized. Update rate: tied to detection frame rate.
|
||
|
||
### Component 5: Integration with Existing Detections Service
|
||
|
||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||
| **Single merged Cython+TRT process + demand-loaded VLM (recommended)** | Cython, TensorRT, ONNX Runtime | Single TRT engine. Minimal memory. VLM isolated. | VLM loading pauses detection (30-45s). | Cython extensions, process management | Process isolation + encryption | Minimal overhead | **Best fit for 5.2GB VRAM.** |
|
||
|
||
**Revised integration architecture**:
|
||
|
||
```
|
||
┌───────────────────────────────────────────────────────────────────┐
|
||
│ Main Process (Cython + TRT) │
|
||
│ │
|
||
│ ┌──────────────────────────────────────────────┐ │
|
||
│ │ Single Merged TRT Engine │ │
|
||
│ │ ├─ Existing YOLO Detection heads │ │
|
||
│ │ ├─ YOLOE-v8-seg (re-parameterized) │ │
|
||
│ │ └─ MobileNetV3-Small classifier │ │
|
||
│ └──────────────────────────────────────────────┘ │
|
||
│ │
|
||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐│
|
||
│ │ Path Tracing │ │ Scan │ │ PatchBlock Defense ││
|
||
│ │ + Skeleton │ │ Controller │ │ (CPU parallel) ││
|
||
│ │ (CPU) │ │ + Kalman+PID │ │ ││
|
||
│ └──────────────┘ └──────────────┘ └──────────────────────────┘│
|
||
│ │
|
||
│ ┌──────────────────────────────────────────────┐ │
|
||
│ │ VLM Manager │ │
|
||
│ │ state: IDLE | LOADING | ANALYZING | UNLOAD │ │
|
||
│ │ queue: [detection_1, detection_2, ...] │ │
|
||
│ └──────────────────────────────────────────────┘ │
|
||
└───────────────────────────────────────────────────────────────────┘
|
||
|
||
VLM mode (demand-loaded, replaces TRT engine temporarily):
|
||
┌───────────────────────────────────────────────────────────────────┐
|
||
│ ┌──────────────────────────────────────────────┐ │
|
||
│ │ Moondream 0.5B INT4 │ │
|
||
│ │ (ONNX Runtime or PyTorch) │ │
|
||
│ └──────────────────────────────────────────────┘ │
|
||
│ Detection paused. Camera in HOLD state. │
|
||
└───────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Data flow** (revised):
|
||
1. PatchBlock preprocesses frame on CPU (parallel with GPU inference)
|
||
2. Cleaned frame → merged TRT engine → YOLO detections + YOLOE-v8 semantic detections
|
||
3. Semantic detections → path preprocessing → skeletonization → endpoint extraction → CNN
|
||
4. High-confidence → operator alert (coordinates + bounding box + confidence)
|
||
5. Ambiguous → VLM queue
|
||
6. VLM queue management: batch-process when queue ≥ 3 or operator triggers
|
||
7. During VLM mode: detection paused, camera holds, operator notified of pause
|
||
|
||
**GPU scheduling** (revised): No concurrent multi-model GPU sharing. Single TRT engine runs during detection mode. VLM demand-loaded exclusively during analysis mode. This eliminates the 10-40% latency jitter from GPU sharing.
|
||
|
||
### Component 6: Security
|
||
|
||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||
| **Three-layer security (recommended)** | PatchBlock, LUKS/dm-crypt, tmpfs | Adversarial defense + model protection + data protection | Adds ~5ms CPU overhead for PatchBlock | PatchBlock library, Linux crypto | Full stack | Minimal GPU impact | **Required for military edge deployment.** |
|
||
|
||
**Layer 1: Adversarial Input Defense**
|
||
- PatchBlock CPU preprocessing on every frame before GPU inference
|
||
- Detects anomalous patches via outlier detection and dimensionality reduction
|
||
- Recovers up to 77% accuracy under adversarial attack
|
||
- Runs in parallel with GPU inference (no latency addition to pipeline)
|
||
|
||
**Layer 2: Model & Weight Protection**
|
||
- TRT engine files encrypted at rest using LUKS on a dedicated partition
|
||
- At boot: decrypt into tmpfs (RAM disk) — never written to persistent storage unencrypted
|
||
- Secure boot chain via Jetson's secure boot (fuse-based, hardware root of trust)
|
||
- If device is captured powered-off: encrypted models, no plaintext weights accessible
|
||
|
||
**Layer 3: Operational Data Protection**
|
||
- Captured imagery stored in encrypted circular buffer (last N minutes only)
|
||
- Detection logs (coordinates, confidence, timestamps) encrypted at rest
|
||
- Over datalink: transmit only coordinates + confidence + small thumbnail (not raw frames)
|
||
- Tamper detection: if enclosure opened or unauthorized boot detected → auto-wipe keys + detection logs
|
||
|
||
## Training & Data Strategy (Revised)
|
||
|
||
### Phase 1: Zero-shot (Week 1-2)
|
||
- Deploy YOLOE-v8-seg with multimodal prompts (text for paths, visual for concealment)
|
||
- Use semantic01-04.png as visual prompt references via SAVPE
|
||
- Tune confidence thresholds per class type
|
||
- Collect false positive/negative data for annotation
|
||
- **Benchmark YOLOE-v8-seg TRT on Jetson: confirm inference time, memory, stability**
|
||
|
||
### Phase 2: Annotation & Fine-tuning (Week 3-8)
|
||
- Annotate collected real data (target: 300-500 images/class)
|
||
- **Generate 1000+ synthetic images per class using GenCAMO/CamouflageAnything**
|
||
- Priority: footpaths (segmentation) → branch piles (bboxes) → entrances (bboxes)
|
||
- Active learning: YOLOE zero-shot flags candidates → human reviews → annotates
|
||
- Fine-tune YOLOv8-Seg (or YOLO26-Seg if TRT fixed) on real + synthetic dataset
|
||
- Use linear probing first, then full fine-tuning
|
||
|
||
### Phase 3: CNN classifier (Week 4-8, parallel with Phase 2)
|
||
- Train MobileNetV3-Small on ROI crops: 256×256 from endpoint analysis
|
||
- Positive: annotated concealed positions + synthetic. Negative: natural termini, random terrain
|
||
- Target: 200+ real positive + 500+ synthetic positive, 1000+ negative
|
||
- Export to TensorRT FP16
|
||
|
||
### Phase 4: VLM integration (Week 8-12)
|
||
- Deploy Moondream 0.5B INT4 in demand-loaded mode
|
||
- Test demand-load cycle timing: measure unload → load → infer → unload → reload
|
||
- Tune detect() prompts and caption prompts on collected ambiguous cases
|
||
- **If Moondream accuracy insufficient: test UAV-VL-R1 (2B) demand-loaded**
|
||
- **If YOLO26 TRT bugs fixed: test YOLOE-26-seg as Tier 1 upgrade**
|
||
|
||
### Phase 5: Seasonal expansion (Month 3+)
|
||
- Winter data → spring/summer annotation campaigns
|
||
- Re-train all models with multi-season data + seasonal synthetic augmentation
|
||
|
||
## Testing Strategy
|
||
|
||
### Integration / Functional Tests
|
||
- YOLOE-v8-seg multimodal prompt detection on reference images — verify text+visual fusion
|
||
- **TRT engine stability test**: 1000 consecutive inferences without confidence drift
|
||
- Path preprocessing pipeline on synthetic noisy masks — verify cleaning + skeletonization
|
||
- Branch pruning: verify short spurious branches removed, real path branches preserved
|
||
- CNN classifier on known positive/negative 256×256 ROI crops
|
||
- **Demand-load VLM cycle**: measure timing of unload TRT → load Moondream → infer → unload → reload TRT
|
||
- **Memory monitoring during demand-load**: confirm no memory leak across 10+ cycles
|
||
- Kalman+PID gimbal control with simulated IMU data — verify drift compensation
|
||
- Full pipeline: frame → PatchBlock → YOLOE-v8 → path tracing → CNN → (VLM) → alert
|
||
- Scan controller: Level 1 → Level 2 → HOLD (for VLM) → resume Level 1
|
||
|
||
### Non-Functional Tests
|
||
- Tier 1 latency: YOLOE-v8-seg TRT FP16 ≤15ms on Jetson Orin Nano Super
|
||
- Tier 2 latency: preprocessing + skeletonization + CNN ≤200ms
|
||
- **VLM demand-load cycle: ≤45s for 3 detections (including load/unload overhead)**
|
||
- **Memory profiling: peak detection mode ≤3.5GB GPU, peak VLM mode ≤2.0GB GPU**
|
||
- Thermal stress test: 30+ minutes continuous detection without thermal throttling
|
||
- PatchBlock adversarial test: inject test adversarial patches, measure accuracy recovery
|
||
- False positive/negative rate on annotated reference images
|
||
- Gimbal path-following accuracy with and without Kalman filter (measure improvement)
|
||
- **Demand-load memory leak test: 50+ VLM cycles without memory growth**
|
||
|
||
## References
|
||
|
||
- YOLOE-v8 docs: https://docs.ultralytics.com/models/yoloe/
|
||
- YOLOE-26 paper: https://arxiv.org/abs/2602.00168
|
||
- YOLO26 TRT confidence bug: https://www.hackster.io/qwe018931/pushing-limits-yolov8-vs-v26-on-jetson-orin-nano-b89267
|
||
- YOLO26 INT8 crash: https://github.com/ultralytics/ultralytics/issues/23841
|
||
- YOLOE multimodal fusion: https://github.com/ultralytics/ultralytics/pull/21966
|
||
- Jetson Orin Nano Super memory: https://forums.developer.nvidia.com/t/jetson-orin-nano-super-insufficient-gpu-memory/330777
|
||
- Multi-model survey on Orin Nano: https://dev.to/ankk98/multi-model-ai-resource-allocation-for-humanoid-robots-a-survey-on-jetson-orin-nano-super-310i
|
||
- TRT multiple engines: https://github.com/NVIDIA/TensorRT/issues/4358
|
||
- TRT memory on Jetson: https://github.com/ultralytics/ultralytics/issues/21562
|
||
- Moondream: https://moondream.ai/blog/introducing-moondream-0-5b
|
||
- Cosmos-Reason2-2B Jetson benchmark: https://www.thenextgentechinsider.com/pulse/cosmos-reason2-runs-on-jetson-orin-nano-super-with-w4a16-quantization
|
||
- Jetson AI Lab benchmarks: https://www.jetson-ai-lab.com/tutorials/genai-benchmarking/
|
||
- Jetson LLM bottleneck: https://ericxliu.me/posts/benchmarking-llms-on-jetson-orin-nano/
|
||
- vLLM on Jetson: https://learnopencv.com/deployment-on-edge-vllm-on-jetson/
|
||
- TRT-LLM no edge support: https://github.com/NVIDIA/TensorRT-LLM/issues/7978
|
||
- PatchBlock defense: https://arxiv.org/abs/2601.00367
|
||
- Adversarial patches on YOLO: https://link.springer.com/article/10.1007/s10207-025-01067-3
|
||
- GenCAMO synthetic data: https://arxiv.org/abs/2601.01181
|
||
- CamouflageAnything (CVPR 2025): https://openaccess.thecvf.com/content/CVPR2025/html/Das_Camouflage_Anything_...
|
||
- GraphMorph centerlines: https://arxiv.org/pdf/2502.11731
|
||
- Learnable skeleton + SAM: https://ui.adsabs.harvard.edu/abs/2025ITGRS..63S1458X
|
||
- Kalman filter gimbal: https://ieeexplore.ieee.org/ielx7/6287639/10005208/10160027.pdf
|
||
- UAV-VL-R1: https://arxiv.org/pdf/2508.11196
|
||
- ViewPro Protocol: https://www.viewprotech.com/index.php?ac=article&at=read&did=510
|
||
- servopilot PID: https://pypi.org/project/servopilot/
|
||
|
||
## Related Artifacts
|
||
- AC assessment: `_docs/00_research/00_ac_assessment.md`
|
||
- Previous draft: `_docs/01_solution/solution_draft01.md`
|