mirror of
https://github.com/azaion/detections-semantic.git
synced 2026-04-22 08:36:36 +00:00
Initial commit
Made-with: Cursor
This commit is contained in:
@@ -0,0 +1,319 @@
|
||||
# Solution Draft
|
||||
|
||||
## Assessment Findings
|
||||
|
||||
| Old Component Solution | Weak Point (functional/security/performance) | New Solution |
|
||||
|------------------------|----------------------------------------------|-------------|
|
||||
| YOLO26 as sole detection backbone | **Accuracy regression on custom datasets**: Reported YOLO26s "much less accurate" than YOLO11s on identical training data (GitHub #23206). YOLO26 is 3 months old — less battle-tested than YOLO11. | Benchmark YOLO26 vs YOLO11 on initial annotated data before committing. YOLO11 as fallback. YOLOE supports both backbones (yoloe-11s-seg, yoloe-26s-seg). |
|
||||
| YOLO26 TensorRT INT8 export | **INT8 export fails on Jetson** (TRT Error Code 2, OOM). Fix merged (PR #23928) but indicates fragile tooling. | Use FP16 only for initial deployment (confirmed stable). INT8 as future optimization after tooling matures. Pin Ultralytics version + JetPack version. |
|
||||
| vLLM as VLM runtime | **Unstable on Jetson Orin Nano**: system freezes, reboots, installation crashes, excessive memory (multiple open issues). Not production-ready for 8GB devices. | **Replace with NanoLLM/NanoVLM** — purpose-built for Jetson by NVIDIA's Dusty-NV team. Docker containers for JetPack 5/6. Supports VILA, LLaVA. Stable. Or use llama.cpp with GGUF models (proven on Jetson). |
|
||||
| No storage strategy | **SD card corruption**: Recurring corruption documented across multiple Jetson Orin Nano users. SD cards unsuitable for production. | **Mandatory NVMe SSD** for OS + models + logging. No SD card in production. Ruggedized NVMe mount for vibration resistance. |
|
||||
| No EMI protection on UART | **ViewPro documents EMI issues**: antennas cause random gimbal panning if within 35cm. Standard UART parity bit insufficient for noisy UAV environment. | Add CRC-16 checksum layer on gimbal commands. Enforce 35cm antenna separation in physical design. Consider shielded UART cable. Command retry on CRC failure (max 3 retries, then log error). |
|
||||
| No environmental hardening addressed | **UAV environment**: vibration, temperature extremes (-20°C to +50°C), dust, EMI, power fluctuations. Dev kit form factor is not field-deployable. | Use ruggedized carrier board (MILBOX-ORNX or similar) with vibration dampening. Conformal coating on exposed connectors. External temperature sensor for environmental monitoring. |
|
||||
| No logging or telemetry | **No post-flight review capability**: field system must log all detections with metadata for model iteration, operator review, and evidence collection. | Add detection logging: timestamp, GPS-denied coordinates, confidence score, detection class, JPEG thumbnail, tier that triggered, freshness metadata. Log to NVMe SSD. Export as structured format (JSON lines) after flight. |
|
||||
| No frame recording for offline replay | **Training data collection depends on field recording**: Without recording, no way to build training dataset from real flights. | Record all camera frames to NVMe at configurable rate (1-5 FPS during Level 1, full rate during Level 2). Include detection overlay option. Post-flight: use recordings for annotation. |
|
||||
| No power management | **UAV power budget is finite**: Jetson at 15W + gimbal + camera + radio. No monitoring of power draw or load shedding. | Monitor power consumption via Jetson's INA sensors. Power budget alert at 80% of allocated watts. Load shedding: disable VLM first, then reduce inference rate, then disable semantic detection. |
|
||||
| YOLO26 not validated for this domain | **No benchmark on aerial concealment detection**: All YOLO26 numbers are on COCO/LVIS. Concealment detection may behave very differently. | First sprint deliverable: benchmark YOLOE-26 (both 11 and 26 backbones) on semantic01-04.png with text/visual prompts. Report AP on initial annotated validation set before committing to backbone. |
|
||||
| Freshness and path tracing are untested algorithms | **No proven prior art**: Both freshness assessment and path-following via skeletonization are novel combinations. Risk of over-engineering before validation. | Implement minimal viable versions first. V1 path tracing: skeleton + endpoint only, no freshness, no junction following. Validate on real flight data before adding complexity. |
|
||||
|
||||
## Product Solution Description
|
||||
|
||||
A three-tier semantic detection system for identifying concealed/camouflaged positions from reconnaissance UAV aerial imagery, running on Jetson Orin Nano Super with NVMe SSD storage, active cooling, and ruggedized carrier board, alongside the existing YOLO detection pipeline.
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ JETSON ORIN NANO SUPER (ruggedized carrier, NVMe, 15W) │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
|
||||
│ │ ViewPro │───▶│ Tier 1 │───▶│ Tier 2 │───▶│ Tier 3 │ │
|
||||
│ │ A40 │ │ YOLOE │ │ Path Trace │ │ VLM │ │
|
||||
│ │ Camera │ │ (11 or 26 │ │ + CNN │ │ NanoLLM │ │
|
||||
│ │ + Frame │ │ backbone) │ │ ≤200ms │ │ (L2 only) │ │
|
||||
│ │ Quality │ │ TRT FP16 │ │ │ │ ≤5s │ │
|
||||
│ │ Gate │ │ ≤100ms │ │ │ │ │ │
|
||||
│ └────▲─────┘ └──────────────┘ └──────────────┘ └───────────┘ │
|
||||
│ │ │
|
||||
│ ┌────┴─────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
|
||||
│ │ Gimbal │◀───│ Scan │ │ Watchdog │ │ Recorder │ │
|
||||
│ │ Control │ │ Controller │ │ + Thermal │ │ + Logger │ │
|
||||
│ │ + CRC │ │ (L1/L2 FSM) │ │ + Power │ │ (NVMe) │ │
|
||||
│ └──────────┘ └──────────────┘ └──────────────┘ └───────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────┐ │
|
||||
│ │ Existing YOLO Detection │ (always running, scene context) │
|
||||
│ │ Cython + TRT │ │
|
||||
│ └──────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Key changes from draft02:
|
||||
- **YOLOE backbone is configurable** (YOLO11 or YOLO26) — benchmark before committing
|
||||
- **NanoLLM replaces vLLM** as VLM runtime (purpose-built for Jetson, stable)
|
||||
- **NVMe SSD mandatory** — no SD card in production
|
||||
- **CRC-16 on gimbal UART** — EMI protection
|
||||
- **Detection logger + frame recorder** — post-flight review and training data collection
|
||||
- **Ruggedized carrier board** — vibration, temperature, dust protection
|
||||
- **Power monitoring + load shedding** — finite UAV power budget
|
||||
- **FP16 only** for initial deployment (INT8 export unstable on Jetson)
|
||||
- **Minimal V1 for unproven components** — path tracing and freshness start simple
|
||||
|
||||
## Architecture
|
||||
|
||||
### Component 1: Tier 1 — Real-Time Detection
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **YOLOE with configurable backbone (recommended)** | yoloe-11s-seg.pt or yoloe-26s-seg.pt, set_classes() → TRT FP16 | Supports both YOLO11 and YOLO26 backbones. Benchmark on real data, pick winner. set_classes() bakes CLIP embeddings for zero overhead. | YOLO26 may regress on custom data vs YOLO11. Needs empirical comparison. | Ultralytics ≥8.4 (pinned version), TensorRT, JetPack 6.2 | Local only | YOLO11s TRT FP16: ~7ms (640px). YOLO26s: similar or slightly faster. | **Best fit. Hedge against backbone risk.** |
|
||||
|
||||
**Version pinning strategy**:
|
||||
- Pin `ultralytics==8.4.X` (specific patch version validated on Jetson)
|
||||
- Pin JetPack 6.2 + TensorRT version
|
||||
- Test every Ultralytics update in staging before deploying to production
|
||||
- Keep both yoloe-11s-seg and yoloe-26s-seg TRT engines on NVMe; switch via config
|
||||
|
||||
**YOLO backbone selection process (Sprint 1)**:
|
||||
1. Annotate 200 frames from real flight footage (footpaths, branch piles, entrances)
|
||||
2. Fine-tune YOLOE-11s-seg and YOLOE-26s-seg on same dataset, same hyperparameters
|
||||
3. Evaluate on held-out validation set (50 frames)
|
||||
4. Pick backbone with higher mAP50
|
||||
5. If delta < 2%: pick YOLO26 (faster CPU inference, NMS-free deployment)
|
||||
|
||||
### Component 2: Tier 2 — Spatial Reasoning & CNN Confirmation
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **V1 minimal path tracing + heuristic classifier (recommended for initial release)** | OpenCV, scikit-image | No training data needed. Skeleton + endpoint detection + simple heuristic: "dark mass at endpoint → flag." Fast to implement and validate. | Low accuracy. Many false positives. | OpenCV, scikit-image | Offline | ~30ms | **V1: ship fast, validate on real data.** |
|
||||
| **V2 trained CNN (after data collection)** | MobileNetV3-Small, TensorRT FP16 | Higher accuracy after training. Dynamic ROI sizing. | Needs 300+ positive, 1000+ negative annotated ROI crops. | PyTorch, TRT export | Offline | ~5-10ms classification | **V2: replace heuristic once data exists.** |
|
||||
|
||||
**V1 heuristic for endpoint analysis** (no training data needed):
|
||||
1. Skeletonize footpath mask with branch pruning
|
||||
2. Find endpoints
|
||||
3. For each endpoint: extract ROI (dynamic size based on GSD)
|
||||
4. Compute: mean_darkness = mean intensity in ROI center 50%. contrast = (surrounding_mean - center_mean) / surrounding_mean. area_ratio = dark_pixel_count / total_pixels.
|
||||
5. If mean_darkness < threshold_dark AND contrast > threshold_contrast → flag as potential concealed position
|
||||
6. Thresholds: configurable, tuned per season. Start with winter values.
|
||||
|
||||
**V1 freshness** (metadata only, not a filter):
|
||||
- contrast_ratio of path vs surrounding terrain
|
||||
- Report as: "high contrast" (likely fresh) / "low contrast" (likely stale)
|
||||
- No binary classification. Operator sees all detections with freshness tag.
|
||||
|
||||
### Component 3: Tier 3 — VLM Deep Analysis
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **NanoLLM with VILA-2.7B or VILA1.5-3B (recommended)** | NanoLLM Docker container, MLC/TVM quantization | Purpose-built for Jetson by NVIDIA team. Stable Docker containers. Optimized memory management. Supports VLMs natively. | Limited model selection (VILA, LLaVA, Obsidian). Not all VLMs available. | Docker, JetPack 6, NVMe for container storage | Local only, container isolation | ~15-25 tok/s on Orin Nano (4-bit MLC) | **Most stable Jetson VLM option.** |
|
||||
| llama.cpp with GGUF VLM | llama.cpp, GGUF model files | Lightweight. No Docker needed. Proven stability on Jetson. Wide model support. | Manual build. Less optimized than NanoLLM for Jetson GPU. | llama.cpp build, GGUF weights | Local only | ~10-20 tok/s estimated | **Fallback if NanoLLM doesn't support needed model.** |
|
||||
| ~~vLLM~~ | ~~vLLM Docker~~ | ~~High throughput~~ | **System freezes, reboots, installation crashes on Orin Nano. Multiple open bugs. Not production-ready.** | N/A | N/A | N/A | **Not recommended.** |
|
||||
|
||||
**Model selection for NanoLLM**:
|
||||
- Primary: VILA1.5-3B (confirmed on Orin Nano, multimodal, 4-bit MLC)
|
||||
- If UAV-VL-R1 GGUF weights become available: use via llama.cpp (aerial-specialized)
|
||||
- Fallback: Obsidian-3B (mini VLM, lower accuracy but very fast)
|
||||
|
||||
### Component 4: Camera Gimbal Control
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **ViewLink serial driver + CRC-16 + PID + watchdog (recommended)** | pyserial, crcmod, PID library, threading | Robust communication. CRC catches EMI-corrupted commands. Retry logic. Watchdog. | ViewLink protocol implementation from spec. Physical EMI mitigation required. | ViewPro docs, UART, shielded cable, 35cm antenna separation | Physical only | <10ms command + CRC overhead negligible | **Production-grade.** |
|
||||
|
||||
**UART reliability layer**:
|
||||
```
|
||||
Packet format: [SOF(2)] [CMD(N)] [CRC16(2)]
|
||||
- SOF: 0xAA 0x55 (start of frame)
|
||||
- CMD: ViewLink command bytes per protocol spec
|
||||
- CRC16: CRC-CCITT over CMD bytes
|
||||
```
|
||||
- On send: compute CRC-16, append to ViewLink command packet
|
||||
- On receive (gimbal feedback): validate CRC-16. Discard corrupted frames.
|
||||
- On CRC failure (send): retry up to 3 times with 10ms delay. Log failure after 3 retries.
|
||||
- Note: Check if ViewLink protocol already includes checksums (read full spec first). If so, use native checksum; don't add redundant CRC.
|
||||
|
||||
**Physical EMI mitigation checklist**:
|
||||
- [ ] Gimbal UART cable: shielded, shortest possible run
|
||||
- [ ] Video/data transmitter antenna: ≥35cm from gimbal (ViewPro recommendation)
|
||||
- [ ] Independent power supply for gimbal (or filtered from main bus)
|
||||
- [ ] Ferrite beads on UART cable near Jetson connector
|
||||
|
||||
### Component 5: Recording, Logging & Telemetry
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **NVMe-backed frame recorder + JSON-lines detection logger (recommended)** | OpenCV VideoWriter / JPEG sequences, JSON lines, NVMe SSD | Post-flight review. Training data collection. Evidence. Detection audit trail. | NVMe write bandwidth (~500 MB/s) more than sufficient. Storage: ~2GB/min at 1080p 5FPS JPEG. | NVMe SSD ≥256GB | Physical access to NVMe | ~5ms per frame write (async) | **Essential for field deployment.** |
|
||||
|
||||
**Detection log format** (JSON lines, one per detection):
|
||||
```json
|
||||
{
|
||||
"ts": "2026-03-19T14:32:01.234Z",
|
||||
"frame_id": 12345,
|
||||
"gps_denied_lat": 48.123456,
|
||||
"gps_denied_lon": 37.654321,
|
||||
"tier": 1,
|
||||
"class": "footpath",
|
||||
"confidence": 0.72,
|
||||
"bbox": [0.12, 0.34, 0.45, 0.67],
|
||||
"freshness": "high_contrast",
|
||||
"tier2_result": "concealed_position",
|
||||
"tier2_confidence": 0.85,
|
||||
"tier3_used": false,
|
||||
"thumbnail_path": "frames/12345_det_0.jpg"
|
||||
}
|
||||
```
|
||||
|
||||
**Frame recording strategy**:
|
||||
- Level 1: record every 5th frame (1-2 FPS) — overview coverage
|
||||
- Level 2: record every frame (30 FPS) — detailed analysis footage
|
||||
- Storage budget: 256GB NVMe ≈ 2 hours at Level 2 full rate, or 10+ hours at Level 1 rate
|
||||
- Circular buffer: when storage >80% full, overwrite oldest Level 1 frames (keep Level 2)
|
||||
|
||||
### Component 6: System Health & Resilience
|
||||
|
||||
**Monitoring threads**:
|
||||
|
||||
| Monitor | Check Interval | Threshold | Action |
|
||||
|---------|---------------|-----------|--------|
|
||||
| Thermal (T_junction) | 1s | >75°C | Degrade to Level 1 only |
|
||||
| Thermal (T_junction) | 1s | >80°C | Disable semantic detection |
|
||||
| Power (Jetson INA) | 2s | >80% budget | Disable VLM |
|
||||
| Power (Jetson INA) | 2s | >90% budget | Reduce inference rate to 5 FPS |
|
||||
| Gimbal heartbeat | 2s | No response | Force Level 1 sweep pattern |
|
||||
| Semantic process | 5s | No heartbeat | Restart with 5s backoff, max 3 attempts |
|
||||
| VLM process | 5s | No heartbeat | Mark Tier 3 unavailable, continue Tier 1+2 |
|
||||
| NVMe free space | 60s | <20% free | Switch to Level 1 recording rate only |
|
||||
| Frame quality | per frame | Laplacian var < threshold | Skip frame, use buffered good frame |
|
||||
|
||||
**Graceful degradation** (4 levels, unchanged from draft02):
|
||||
|
||||
| Level | Condition | Capability |
|
||||
|-------|-----------|-----------|
|
||||
| 0 — Full | All nominal, T < 70°C | Tier 1+2+3, Level 1+2, gimbal, recording |
|
||||
| 1 — No VLM | VLM unavailable or T > 75°C or power > 80% | Tier 1+2, Level 1+2, gimbal, recording |
|
||||
| 2 — No semantic | Semantic crashed 3x or T > 80°C | Existing YOLO only, Level 1 sweep, recording |
|
||||
| 3 — No gimbal | Gimbal UART failed 3x | Existing YOLO only, fixed camera, recording |
|
||||
|
||||
### Component 7: Integration & Deployment
|
||||
|
||||
**Hardware BOM additions** (beyond existing system):
|
||||
|
||||
| Item | Purpose | Estimated Cost |
|
||||
|------|---------|----------------|
|
||||
| NVMe SSD ≥256GB (industrial grade) | OS + models + recording + logging | $40-80 |
|
||||
| Active cooling fan (30mm+) | Prevent thermal throttling | $10-20 |
|
||||
| Ruggedized carrier board (e.g., MILBOX-ORNX or custom) | Vibration, temperature, dust protection | $200-500 |
|
||||
| Shielded UART cable + ferrite beads | EMI protection for gimbal communication | $10-20 |
|
||||
| Total additional hardware | | ~$260-620 |
|
||||
|
||||
**Software deployment**:
|
||||
- OS: JetPack 6.2 on NVMe SSD
|
||||
- YOLOE models: TRT FP16 engines on NVMe (both 11 and 26 backbone variants)
|
||||
- VLM: NanoLLM Docker container on NVMe
|
||||
- Existing YOLO: current Cython + TRT pipeline (unchanged)
|
||||
- New Cython modules: semantic detection, gimbal control, scan controller, recorder
|
||||
- VLM process: separate Docker container, IPC via Unix socket
|
||||
- Config: YAML file for all thresholds, class names, scan parameters, degradation thresholds
|
||||
|
||||
**Version control & update strategy**:
|
||||
- Pin all dependency versions (Ultralytics, TensorRT, NanoLLM, OpenCV)
|
||||
- Model updates: swap TRT engine files on NVMe, restart service
|
||||
- Config updates: edit YAML, restart service
|
||||
- No over-the-air updates (air-gapped system). USB drive for field updates.
|
||||
|
||||
## Training & Data Strategy
|
||||
|
||||
### Phase 0: Benchmark Sprint (Week 1-2)
|
||||
- Deploy YOLOE-26s-seg and YOLOE-11s-seg in open-vocab mode
|
||||
- Test text/visual prompts on semantic01-04.png + 50 additional frames
|
||||
- Record results. Pick backbone with better qualitative detection.
|
||||
- Deploy V1 heuristic endpoint analysis (no CNN, no training data needed)
|
||||
- First field test flight with recording enabled
|
||||
|
||||
### Phase 1: Field validation & data collection (Week 2-6)
|
||||
- Deploy TRT FP16 engine with best backbone
|
||||
- Record all flights to NVMe
|
||||
- Operator marks detections as true/false positive in post-flight review
|
||||
- Build annotation backlog from recorded frames
|
||||
- Target: 500 annotated frames by week 6
|
||||
|
||||
### Phase 2: Custom model training (Week 6-10)
|
||||
- Fine-tune YOLOE-Seg on custom dataset (linear probing → full fine-tune)
|
||||
- Train MobileNetV3-Small CNN on endpoint ROI crops
|
||||
- A/B test: custom model vs YOLOE zero-shot on validation set
|
||||
- Deploy winning model as new TRT engine
|
||||
|
||||
### Phase 3: VLM & refinement (Week 8-14)
|
||||
- Deploy NanoLLM with VILA1.5-3B
|
||||
- Tune prompting on collected ambiguous cases
|
||||
- Train freshness classifier (if enough annotated freshness labels exist)
|
||||
- Target: 1500+ images per class
|
||||
|
||||
### Phase 4: Seasonal expansion (Month 4+)
|
||||
- Spring/summer annotation campaigns
|
||||
- Re-train all models with multi-season data
|
||||
- Adjust heuristic thresholds per season (configurable via YAML)
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Integration / Functional Tests
|
||||
- YOLOE text prompt detection on reference images (both 11 and 26 backbones)
|
||||
- TRT FP16 export on Jetson Orin Nano Super (verify no OOM, no crash)
|
||||
- V1 heuristic endpoint analysis on 20 synthetic masks (10 with hideouts, 10 without)
|
||||
- Frame quality gate: inject blurry frames, verify rejection
|
||||
- Gimbal CRC layer: inject corrupted commands, verify retry + log
|
||||
- Gimbal watchdog: simulate hang, verify forced Level 1 within 2.5s
|
||||
- NanoLLM VLM: load model, run inference on 10 aerial images, verify output + memory
|
||||
- VLM load/unload cycle: 10 cycles without memory leak
|
||||
- Detection logger: verify JSON-lines format, all fields populated
|
||||
- Frame recorder: verify NVMe write speed, no dropped frames at 30 FPS
|
||||
- Full pipeline end-to-end on recorded flight footage (offline replay)
|
||||
- Graceful degradation: simulate each failure mode, verify correct degradation level
|
||||
|
||||
### Non-Functional Tests
|
||||
- Tier 1 latency on Jetson Orin Nano Super TRT FP16: ≤100ms (both backbones)
|
||||
- Tier 2 latency (V1 heuristic): ≤50ms. (V2 CNN): ≤200ms
|
||||
- Tier 3 latency (NanoLLM VLM): ≤5 seconds
|
||||
- Memory peak: all components loaded < 7GB
|
||||
- Thermal: 60-minute sustained inference, T_junction < 75°C with active cooling
|
||||
- NVMe endurance: continuous recording for 2 hours, verify no write errors
|
||||
- Power draw: measure at each degradation level, verify within UAV power budget
|
||||
- EMI test: operate near data transmitter antenna, verify no gimbal anomalies with CRC layer
|
||||
- Cold start: power on → first detection within 60 seconds (model load time)
|
||||
- Vibration: mount Jetson on vibration table, run inference, compare detection accuracy vs static
|
||||
|
||||
## Technology Maturity Assessment
|
||||
|
||||
| Component | Technology | Maturity | Risk | Mitigation |
|
||||
|-----------|-----------|----------|------|------------|
|
||||
| Tier 1 Detection | YOLOE/YOLO26/YOLO11 | **Medium** — YOLO26 is 3 months old, reported regressions on custom data. YOLOE-26 even newer. | Medium | Benchmark both backbones. YOLO11 is battle-tested fallback. Pin versions. |
|
||||
| TRT FP16 Export | TensorRT on JetPack 6.2 | **High** — FP16 is stable on Jetson. Well-documented. | Low | FP16 only. Avoid INT8 initially. |
|
||||
| TRT INT8 Export | TensorRT on JetPack 6.2 | **Low** — Documented crashes (PR #23928). Calibration issues. | High | Defer to Phase 3+. FP16 sufficient for now. |
|
||||
| VLM (NanoLLM) | NanoLLM + VILA-3B | **Medium-High** — Purpose-built for Jetson by NVIDIA team. Docker-based. Monthly releases. | Low | More stable than vLLM. Use Docker containers. |
|
||||
| VLM (vLLM) | vLLM on Jetson | **Low** — System freezes, crashes, open bugs. | **High** | **Do not use.** NanoLLM instead. |
|
||||
| Path Tracing | Skeletonization + OpenCV | **High** — Decades-old algorithms. Well-understood. | Low | Pruning needed for noisy inputs. |
|
||||
| CNN Classifier | MobileNetV3-Small + TRT | **High** — Proven architecture. TRT FP16 stable. | Low | Standard transfer learning. |
|
||||
| Gimbal Control | ViewLink Serial Protocol | **Medium** — Protocol documented. ArduPilot driver exists. | Medium | EMI mitigation critical. CRC layer. |
|
||||
| Freshness Assessment | Novel heuristic | **Low** — No prior art. Experimental. | High | V1: metadata only, not a filter. Iterate with data. |
|
||||
| NVMe Storage | Industrial NVMe on Jetson | **High** — Production standard. SD card alternative is unreliable. | Low | Use industrial-grade SSD. |
|
||||
| Ruggedized Hardware | MILBOX-ORNX or custom | **High** — Established product. Designed for Jetson + UAV. | Low | Standard procurement. |
|
||||
|
||||
## References
|
||||
|
||||
- YOLO26 docs: https://docs.ultralytics.com/models/yolo26/
|
||||
- YOLOE docs: https://docs.ultralytics.com/models/yoloe/
|
||||
- YOLO26 accuracy regression: https://github.com/ultralytics/ultralytics/issues/23206
|
||||
- YOLO26 TRT INT8 crash: https://github.com/ultralytics/ultralytics/issues/23841
|
||||
- YOLO26 TRT OOM fix: https://github.com/ultralytics/ultralytics/pull/23928
|
||||
- vLLM Jetson freezes: https://github.com/dusty-nv/jetson-containers/issues/800
|
||||
- vLLM Jetson install crash: https://github.com/vllm-project/vllm/issues/23376
|
||||
- NanoLLM docs: https://dusty-nv.github.io/NanoLLM/
|
||||
- NanoVLM: https://jetson-ai-lab.com/tutorial_nano-vlm.html
|
||||
- MILBOX-ORNX: https://forecr.io/products/jetson-orin-nx-orin-nano-rugged-compact-pc-milbox-ornx
|
||||
- Jetson SD card corruption: https://forums.developer.nvidia.com/t/corrupted-sd-cards/265418
|
||||
- Jetson thermal throttling: https://www.alibaba.com/product-insights/how-to-run-private-llama-3-inference-on-a-200-jetson-orin-nano-without-thermal-throttling.html
|
||||
- ViewPro EMI issues: https://www.viewprouav.com/help/gimbal/
|
||||
- UART CRC reliability: https://link.springer.com/chapter/10.1007/978-981-19-8563-8_23
|
||||
- Military edge AI thermal: https://www.mobilityengineeringtech.com/component/content/article/53967
|
||||
- UAV-VL-R1: https://arxiv.org/pdf/2508.11196
|
||||
- Qwen2-VL-2B-GPTQ-INT8: https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int8
|
||||
- Ultralytics Jetson guide: https://docs.ultralytics.com/guides/nvidia-jetson
|
||||
- Skelite: https://arxiv.org/html/2503.07369v1
|
||||
- YOLO FP reduction: https://docs.ultralytics.com/yolov5/tutorials/tips_for_best_training_results
|
||||
@@ -0,0 +1,325 @@
|
||||
# Solution Draft
|
||||
|
||||
## Product Solution Description
|
||||
|
||||
A three-tier semantic detection system for identifying concealed/camouflaged positions from reconnaissance UAV aerial imagery, running on Jetson Orin Nano Super alongside the existing YOLO detection pipeline.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ JETSON ORIN NANO SUPER │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
|
||||
│ │ ViewPro │───▶│ Tier 1 │───▶│ Tier 2 │───▶│ Tier 3 │ │
|
||||
│ │ A40 │ │ YOLO26-Seg │ │ Spatial │ │ VLM │ │
|
||||
│ │ Camera │ │ + YOLOE-26 │ │ Reasoning │ │ UAV-VL │ │
|
||||
│ │ │ │ ≤100ms │ │ + CNN │ │ -R1 │ │
|
||||
│ │ │ │ │ │ ≤200ms │ │ ≤5s │ │
|
||||
│ └────▲─────┘ └──────────────┘ └──────────────┘ └──────────┘ │
|
||||
│ │ │
|
||||
│ ┌────┴─────┐ ┌──────────────┐ │
|
||||
│ │ Gimbal │◀───│ Scan │ │
|
||||
│ │ Control │ │ Controller │ │
|
||||
│ │ Module │ │ (L1/L2) │ │
|
||||
│ └──────────┘ └──────────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────┐ │
|
||||
│ │ Existing YOLO Detection │ (separate service, provides context) │
|
||||
│ │ Cython + TRT │ │
|
||||
│ └──────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
The system operates in two scan levels:
|
||||
- **Level 1 (Wide Sweep)**: Camera at medium zoom, left-right swing. YOLOE-26 text/visual prompts detect POIs in real-time. Existing YOLO provides scene context.
|
||||
- **Level 2 (Detailed Scan)**: Camera zooms into POI. Spatial reasoning traces footpaths, finds endpoints. CNN classifies potential hideouts. Optional VLM provides deep analysis for ambiguous cases.
|
||||
|
||||
Three submodules: (1) Semantic Detection AI, (2) Camera Gimbal Control, (3) Integration with existing detections service.
|
||||
|
||||
## Existing/Competitor Solutions Analysis
|
||||
|
||||
No direct commercial or open-source competitor exists for this specific combination of requirements (concealed position detection from UAV with edge inference). Related work:
|
||||
|
||||
| Solution | Approach | Limitations for This Use Case |
|
||||
|----------|----------|-------------------------------|
|
||||
| Standard YOLO object detection | Bounding box classification of known object types | Cannot detect camouflaged/concealed targets without explicit visual features |
|
||||
| CAMOUFLAGE-Net (YOLOv7-based) | Attention mechanisms + ELAN for camouflage detection | Designed for ground-level imagery, not aerial; academic datasets only |
|
||||
| Open-Vocabulary Camouflaged Object Segmentation | VLM + SAM cascaded segmentation | Too slow for real-time edge inference; requires cloud GPU |
|
||||
| UAV-YOLO12 | Multi-scale road segmentation from UAV imagery | Roads only, no concealment reasoning |
|
||||
| FootpathSeg (DINO-MC + UNet) | Footpath segmentation with self-supervised learning | Pedestrian context, not military aerial; no path-following logic |
|
||||
| YOLO-World / YOLOE | Open-vocabulary detection | Closest fit — YOLOE-26 is our primary Tier 1 mechanism |
|
||||
|
||||
**Key insight**: No existing solution combines footpath detection + path tracing + endpoint analysis + concealment classification in a single pipeline. This requires a custom multi-stage system.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Component 1: Tier 1 — Real-Time Detection (YOLOE-26 + YOLO26-Seg)
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------|-----|
|
||||
| **YOLOE-26 text/visual prompts (recommended)** | yoloe-26s-seg.pt, Ultralytics, TensorRT | Zero-shot detection from day 1. Text prompts: "footpath", "branch pile", "dark entrance". Visual prompts: reference images of hideouts. Zero overhead when re-parameterized. NMS-free. | Open-vocabulary accuracy lower than custom-trained model. Text prompts may not capture all concealment patterns. | Ultralytics ≥8.4, TensorRT, JetPack 6.2 | Model weights stored locally, no cloud | Free (open source) | **Best fit for bootstrapping. Immediate capability.** |
|
||||
| YOLO26-Seg custom-trained | yolo26s-seg.pt fine-tuned on custom dataset | Higher accuracy for known classes after training. Instance segmentation masks for footpaths. | Requires annotated training data (1500+ images/class). No zero-shot capability. | Custom dataset, GPU for training | Same | Free (open source) + annotation labor | **Best fit for production after data collection.** |
|
||||
| UNet + MambaOut | PyTorch, TensorRT | Best published accuracy for trail segmentation from aerial photos. | Separate model, additional memory. No built-in detection head. | Custom integration | Same | Free (open source) | Backup option if YOLO26-Seg underperforms on trails |
|
||||
|
||||
**Recommended approach**: Start with YOLOE-26 text/visual prompts. Parallel annotation effort builds custom dataset. Transition to fine-tuned YOLO26-Seg once data is sufficient. YOLOE-26 zero-shot capability provides immediate usability.
|
||||
|
||||
**YOLOE-26 configuration for this project**:
|
||||
|
||||
Text prompts for Level 1 detection:
|
||||
- `"footpath"`, `"trail"`, `"path in snow"`, `"road"`, `"track"`
|
||||
- `"pile of branches"`, `"tree branches"`, `"camouflage netting"`
|
||||
- `"dark entrance"`, `"hole"`, `"dugout"`, `"dark opening"`
|
||||
- `"tree row"`, `"tree line"`, `"group of trees"`
|
||||
- `"clearing"`, `"open area near forest"`
|
||||
|
||||
Visual prompts: Annotated reference images (semantic01-04.png) as visual prompt sources. Crop ROIs around known hideouts, use as reference for SAVPE visual-prompted detection.
|
||||
|
||||
API usage:
|
||||
```python
|
||||
from ultralytics import YOLOE
|
||||
|
||||
model = YOLOE("yoloe-26s-seg.pt")
|
||||
model.set_classes(["footpath", "branch pile", "dark entrance", "tree row", "road"])
|
||||
results = model.predict(frame, conf=0.15) # low threshold, high recall
|
||||
```
|
||||
|
||||
Visual prompt usage:
|
||||
```python
|
||||
results = model.predict(
|
||||
frame,
|
||||
refer_image="reference_hideout.jpg",
|
||||
visual_prompts={"bboxes": np.array([[x1, y1, x2, y2]]), "cls": np.array([0])}
|
||||
)
|
||||
```
|
||||
|
||||
### Component 2: Tier 2 — Spatial Reasoning & CNN Confirmation
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------|-----|
|
||||
| **Path tracing + CNN classifier (recommended)** | OpenCV skeletonization, MobileNetV3-Small TensorRT | Fast (<200ms total). Path tracing from segmentation masks. Binary classifier: "concealed position yes/no" on ROI crop. Well-understood algorithms. | Requires custom annotated data for CNN classifier. Path tracing quality depends on segmentation quality. | OpenCV, scikit-image, PyTorch → TRT | Offline inference | Free + annotation labor | **Best fit. Modular, fast, interpretable.** |
|
||||
| Heuristic rules only | OpenCV, NumPy | No training data needed. Rule-based: "if footpath ends at dark mass → flag." | Brittle. Hard to tune. Cannot generalize across seasons/terrain. | None | Offline | Free | Baseline/fallback for initial version |
|
||||
| End-to-end custom model | PyTorch, TensorRT | Single model handles everything. | Requires massive training data. Black box. Hard to debug. | Large annotated dataset | Offline | Free + GPU time | Not recommended for initial release |
|
||||
|
||||
**Path tracing algorithm**:
|
||||
|
||||
1. Take footpath segmentation mask from Tier 1
|
||||
2. Skeletonize using Zhang-Suen algorithm (`skimage.morphology.skeletonize`)
|
||||
3. Detect endpoints using hit-miss morphological operations (8 kernel patterns)
|
||||
4. Detect junctions using branch-point kernels
|
||||
5. Trace path segments between junctions/endpoints
|
||||
6. For each endpoint: extract 128×128 ROI crop centered on endpoint from original image
|
||||
7. Feed ROI crop to MobileNetV3-Small binary classifier: "concealed structure" vs "natural terminus"
|
||||
|
||||
**Freshness assessment approach**:
|
||||
|
||||
Visual features for fresh vs stale classification (binary classifier on path ROI):
|
||||
- Edge sharpness (Laplacian variance on path boundary)
|
||||
- Contrast ratio (path intensity vs surrounding terrain)
|
||||
- Fill ratio (percentage of path area with snow/vegetation coverage)
|
||||
- Path width consistency (fresh paths have more uniform width)
|
||||
|
||||
Implementation: Extract these features per path segment, feed to lightweight classifier (Random Forest or small CNN). Initial version can use hand-tuned thresholds, then train with annotated data.
|
||||
|
||||
### Component 3: Tier 3 — VLM Deep Analysis (Optional)
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------|-----|
|
||||
| **UAV-VL-R1 (recommended)** | Qwen2-VL-2B fine-tuned, vLLM, INT8 quantization | Purpose-built for aerial reasoning. 48% better than generic Qwen2-VL-2B on UAV tasks. 2.5GB INT8. Open source. | 3-5s per analysis. Competes for GPU memory with YOLO (sequential scheduling). | vLLM or TRT-LLM, GPTQ-INT8 weights | Local inference only | Free (open source) | **Best fit for aerial VLM tasks.** |
|
||||
| SmolVLM2-500M | HuggingFace Transformers, ONNX | Smallest memory (1.8GB). Fastest inference (~1-2s estimated). | Weakest reasoning. May lack nuance for concealment analysis. | ONNX Runtime or TRT | Local only | Free | Fallback if memory is tight |
|
||||
| Moondream 2B | moondream API, PyTorch | Built-in detect()/point() APIs. Strong grounded detection (refcoco 91.1). | Not aerial-specialized. Same size class as UAV-VL-R1 but less relevant. | PyTorch or ONNX | Local only | Free | Alternative if UAV-VL-R1 underperforms |
|
||||
| No VLM | N/A | Simpler system. Less memory. No latency for Tier 3. | No zero-shot capability for novel patterns. No operator explanations. | None | N/A | Free | Viable for production if Tier 1+2 accuracy is sufficient |
|
||||
|
||||
**VLM prompting strategy for concealment analysis**:
|
||||
|
||||
```
|
||||
Analyze this aerial UAV image crop. A footpath was detected leading to this area.
|
||||
Is there a concealed military position visible? Look for:
|
||||
- Dark entrances or openings
|
||||
- Piles of cut tree branches used as camouflage
|
||||
- Dugout structures
|
||||
- Signs of recent human activity
|
||||
|
||||
Answer: YES or NO, then one sentence explaining why.
|
||||
```
|
||||
|
||||
**Integration**: VLM runs as separate Python process. Communicates with main Cython pipeline via Unix domain socket or shared memory. Triggered only when Tier 2 CNN confidence is between 30-70% (ambiguous cases).
|
||||
|
||||
### Component 4: Camera Gimbal Control
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------|-----|
|
||||
| **Custom ViewLink serial driver + PID controller (recommended)** | Python serial, ViewLink Protocol V3.3.3, PID library | Direct hardware control. Closed-loop tracking with detection feedback. Low latency (<10ms serial command). | Must implement ViewLink protocol from spec. PID tuning needed. | ViewPro A40 documentation, pyserial | Physical hardware access only | Free + hardware | **Best fit. Direct control, no middleware.** |
|
||||
| ArduPilot integration via MAVLink | ArduPilot, MAVLink, pymavlink | Battle-tested gimbal driver. Well-documented. | Requires ArduPilot flight controller. Additional latency through FC. May conflict with mission planner. | Pixhawk or similar FC running ArduPilot 4.5+ | MAVLink protocol | Pixhawk hardware ($50-200) | Alternative if ArduPilot is already used for flight control |
|
||||
|
||||
**Scan controller state machine**:
|
||||
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ LEVEL 1 │
|
||||
│ Wide Sweep │◀──────────────────────────┐
|
||||
│ Medium Zoom │ │
|
||||
└────────┬────────┘ │
|
||||
│ POI detected │ Analysis complete
|
||||
▼ │ or timeout (5s)
|
||||
┌─────────────────┐ │
|
||||
│ ZOOM │ │
|
||||
│ TRANSITION │ │
|
||||
│ 1-2 seconds │ │
|
||||
└────────┬────────┘ │
|
||||
│ Zoom complete │
|
||||
▼ │
|
||||
┌─────────────────┐ ┌──────────────┐ │
|
||||
│ LEVEL 2 │────▶│ PATH │───────┤
|
||||
│ Detailed Scan │ │ FOLLOWING │ │
|
||||
│ High Zoom │ │ Pan along │ │
|
||||
└────────┬────────┘ │ detected │ │
|
||||
│ │ path │ │
|
||||
│ └──────────────┘ │
|
||||
│ Endpoint found │
|
||||
▼ │
|
||||
┌─────────────────┐ │
|
||||
│ ENDPOINT │ │
|
||||
│ ANALYSIS │────────────────────────────┘
|
||||
│ Hold + VLM │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
**Level 1 sweep pattern**: Sinusoidal yaw oscillation centered on flight heading. Amplitude: ±30° (configurable). Period: matched to ground speed so adjacent sweeps overlap by 20%. Pitch: slightly downward (configurable based on altitude and desired ground coverage).
|
||||
|
||||
**Path-following control loop**:
|
||||
1. Tier 1 outputs footpath segmentation mask
|
||||
2. Extract path centerline direction (from skeleton)
|
||||
3. Compute error: path center vs frame center
|
||||
4. PID controller adjusts gimbal yaw/pitch to minimize error
|
||||
5. Update rate: tied to detection frame rate (10-30 FPS)
|
||||
6. When path endpoint reached or path leaves frame: stop following, analyze endpoint
|
||||
|
||||
**POI queue management**: Priority queue sorted by: (1) detection confidence, (2) proximity to current camera position (minimize slew time), (3) recency. Max queue size: 20 POIs. Older/lower-confidence entries expire after 30 seconds.
|
||||
|
||||
### Component 5: Integration with Existing Detections Service
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------|-----|
|
||||
| **Extend existing Cython codebase + separate VLM process (recommended)** | Cython, TensorRT, Unix socket IPC | Maintains existing architecture. YOLO26/YOLOE-26 fits naturally into TRT pipeline. VLM isolated in separate process. | VLM IPC adds small latency. Two processes to manage. | Cython extensions, process management | Process isolation | Free | **Best fit. Minimal disruption to existing system.** |
|
||||
| Microservice architecture | FastAPI, Docker, gRPC | Clean separation. Independent scaling. | Overhead for single Jetson. Over-engineered for edge. | Docker, networking | Service mesh | Free | Over-engineered for single device |
|
||||
|
||||
**Integration architecture**:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Main Process (Cython + TRT) │
|
||||
│ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Existing │ │ YOLOE-26 │ │ Scan │ │
|
||||
│ │ YOLO Det │ │ Semantic │ │ Controller │ │
|
||||
│ │ (TRT Engine) │ │ (TRT Engine) │ │ + Gimbal │ │
|
||||
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
|
||||
│ │ │ │ │
|
||||
│ └─────────┬────────┘ │ │
|
||||
│ ▼ │ │
|
||||
│ ┌──────────────┐ │ │
|
||||
│ │ Spatial │◀──────────────────────┘ │
|
||||
│ │ Reasoning │ │
|
||||
│ │ + CNN Class. │ │
|
||||
│ └──────┬───────┘ │
|
||||
│ │ ambiguous cases │
|
||||
│ ▼ │
|
||||
│ ┌──────────────┐ │
|
||||
│ │ IPC (Unix │ │
|
||||
│ │ Socket) │ │
|
||||
│ └──────┬───────┘ │
|
||||
└────────────────┼─────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌────────────────────────────┐
|
||||
│ VLM Process (Python) │
|
||||
│ UAV-VL-R1 (vLLM/TRT-LLM) │
|
||||
│ INT8 quantized │
|
||||
└────────────────────────────┘
|
||||
```
|
||||
|
||||
**Data flow**:
|
||||
1. Camera frame → existing YOLO detection (scene context: vehicles, debris, structures)
|
||||
2. Same frame → YOLOE-26 semantic detection (footpaths, branch piles, entrances)
|
||||
3. YOLO context + YOLOE-26 detections → Spatial reasoning module
|
||||
4. Spatial reasoning: path tracing, endpoint analysis, CNN classification
|
||||
5. High-confidence detections → operator notification (bounding box + coordinates)
|
||||
6. Ambiguous detections → VLM process via IPC → response → operator notification
|
||||
7. All detections → scan controller → gimbal commands (Level 1/2 transitions, path following)
|
||||
|
||||
**GPU scheduling**: YOLO and YOLOE-26 can share a single TRT engine (YOLOE-26 re-parameterized to standard YOLO26 weights for fixed classes). VLM inference is sequential: pause YOLO frames, run VLM, resume YOLO. VLM analysis typically lasts 3-5s during which the camera holds position (endpoint analysis phase).
|
||||
|
||||
## Training & Data Strategy
|
||||
|
||||
### Phase 1: Zero-shot (Week 1-2)
|
||||
- Deploy YOLOE-26 with text/visual prompts
|
||||
- Use semantic01-04.png as visual prompt references
|
||||
- Tune text prompt class names and confidence thresholds
|
||||
- Collect false positive/negative data for annotation
|
||||
|
||||
### Phase 2: Annotation & Fine-tuning (Week 3-8)
|
||||
- Annotate collected data using existing annotation tooling
|
||||
- Priority order: footpaths (segmentation masks) → branch piles (bboxes) → entrances (bboxes) → roads (segmentation) → trees (bboxes)
|
||||
- Use SAM (Segment Anything Model) for semi-automated segmentation mask generation
|
||||
- Target: 500+ images per class by week 6, 1500+ by week 8
|
||||
- Fine-tune YOLO26-Seg on custom dataset using linear probing first, then full fine-tuning
|
||||
|
||||
### Phase 3: Custom CNN classifier (Week 6-10)
|
||||
- Train MobileNetV3-Small binary classifier on ROI crops from endpoint analysis
|
||||
- Positive: annotated concealed positions. Negative: natural path termini, random terrain
|
||||
- Target: 300+ positive, 1000+ negative samples
|
||||
- Export to TensorRT FP16
|
||||
|
||||
### Phase 4: VLM integration (Week 8-12)
|
||||
- Deploy UAV-VL-R1 INT8 as separate process
|
||||
- Tune prompting strategy on collected ambiguous cases
|
||||
- Optional: fine-tune UAV-VL-R1 on domain-specific data if base accuracy insufficient
|
||||
|
||||
### Phase 5: Seasonal expansion (Month 3+)
|
||||
- Winter data → spring/summer annotation campaigns
|
||||
- Re-train all models with multi-season data
|
||||
- Expect accuracy degradation in summer (vegetation occlusion), mitigate with larger dataset
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Integration / Functional Tests
|
||||
- YOLOE-26 text prompt detection on reference images (semantic01-04.png) — verify footpath and hideout regions are flagged
|
||||
- Path tracing on synthetic segmentation masks — verify skeleton, endpoint, junction detection
|
||||
- CNN classifier on known positive/negative ROI crops — verify binary output correctness
|
||||
- Gimbal control loop on simulated camera feed — verify PID convergence and path-following accuracy
|
||||
- VLM IPC round-trip — verify request/response latency and correctness
|
||||
- Full pipeline test: frame → YOLOE detection → path tracing → CNN → VLM → operator notification
|
||||
- Scan controller state machine: verify Level 1 → Level 2 transitions, timeout, return to Level 1
|
||||
|
||||
### Non-Functional Tests
|
||||
- Tier 1 latency: measure end-to-end YOLOE-26 inference on Jetson Orin Nano Super ≤100ms
|
||||
- Tier 2 latency: path tracing + CNN classification ≤200ms
|
||||
- Tier 3 latency: VLM IPC + inference ≤5 seconds
|
||||
- Memory profiling: total RAM usage under YOLO + YOLOE + CNN + VLM concurrent loading
|
||||
- Thermal stress test: continuous inference for 30+ minutes without thermal throttling
|
||||
- False positive rate measurement on known clean terrain
|
||||
- False negative rate measurement on annotated concealed positions
|
||||
- Gimbal response test: measure physical camera movement latency vs command
|
||||
|
||||
## References
|
||||
|
||||
- YOLO26 docs: https://docs.ultralytics.com/models/yolo26/
|
||||
- YOLOE docs: https://docs.ultralytics.com/models/yoloe/
|
||||
- YOLOE-26 paper: https://arxiv.org/abs/2602.00168
|
||||
- YOLO26 Jetson benchmarks: https://docs.ultralytics.com/guides/nvidia-jetson
|
||||
- UAV-VL-R1: https://arxiv.org/pdf/2508.11196, https://github.com/Leke-G/UAV-VL-R1
|
||||
- SmolVLM2: https://huggingface.co/blog/smolervlm
|
||||
- Moondream: https://moondream.ai/blog/introducing-moondream-0-5b
|
||||
- ViewPro Protocol: https://www.viewprotech.com/index.php?ac=article&at=read&did=510
|
||||
- ArduPilot ViewPro: https://ardupilot.org/copter/docs/common-viewpro-gimbal.html
|
||||
- FootpathSeg: https://github.com/WennyXY/FootpathSeg
|
||||
- UAV-YOLO12: https://www.mdpi.com/2072-4292/17/9/1539
|
||||
- Trail segmentation (UNet+MambaOut): https://arxiv.org/pdf/2504.12121
|
||||
- servopilot PID: https://pypi.org/project/servopilot/
|
||||
- Camouflage detection (OVCOS): https://arxiv.org/html/2506.19300v1
|
||||
- Jetson AI Lab benchmarks: https://www.jetson-ai-lab.com/tutorials/genai-benchmarking/
|
||||
- Cosmos-Reason2-2B benchmarks: Embedl blog (Feb 2026)
|
||||
|
||||
## Related Artifacts
|
||||
- AC assessment: `_docs/00_research/00_ac_assessment.md`
|
||||
- Tech stack evaluation: `_docs/01_solution/tech_stack.md` (if Phase 3 was executed)
|
||||
- Security analysis: `_docs/01_solution/security_analysis.md` (if Phase 4 was executed)
|
||||
@@ -0,0 +1,370 @@
|
||||
# Solution Draft
|
||||
|
||||
## Assessment Findings
|
||||
|
||||
| Old Component Solution | Weak Point (functional/security/performance) | New Solution |
|
||||
|------------------------|----------------------------------------------|-------------|
|
||||
| YOLOE-26-seg TRT engine | YOLO26 has confirmed TRT confidence misalignment and INT8 export crashes on Jetson (bugs #23841, Hackster.io report). YOLOE-26 inherits these bugs. | Use YOLOE-v8-seg for initial deployment (proven TRT stability). Transition to YOLOE-26 once Ultralytics fixes TRT issues. |
|
||||
| Two separate TRT engines (existing YOLO + YOLOE-26) | Combined memory ~5-6GB exceeds usable 5.2GB VRAM. cuDNN overhead ~1GB per engine. | Single merged TRT engine: YOLOE-v8-seg re-parameterized with fixed classes merges into existing YOLO pipeline. One engine, one CUDA context. |
|
||||
| UAV-VL-R1 (2B) via vLLM ≤5s | TRT-LLM does not support edge. 2B VLM: ~4.7 tok/s → 10-21s for useful response. VLM (2.5GB) cannot fit alongside YOLO in memory. | Moondream 0.5B (816 MiB INT4) as primary VLM. Demand-loaded: unload YOLO → load VLM → analyze batch → unload → reload YOLO. Background mode, not real-time. |
|
||||
| Text prompts for concealment classes | Military concealment classes are far OOD from LVIS/COCO training data. "dugout", "camouflage netting" unlikely to work. | Visual prompts (SAVPE) primary for concealment. Text prompts only for in-distribution classes (footpath, road, trail). Multimodal fusion (text+visual) for robustness. |
|
||||
| Zhang-Suen skeletonization raw | Noise-sensitive: spurious branches from noisy aerial segmentation masks. | Add preprocessing pipeline: Gaussian blur → threshold → morphological closing → skeletonization → branch pruning (remove < 20px branches). Increase ROI to 256×256. |
|
||||
| PID-only gimbal control | PID cannot compensate UAV attitude drift and mounting errors during flight. | Kalman filter + PID cascade: Kalman estimates state from IMU → PID corrects error → gimbal actuates. |
|
||||
| 1500 images/class in 8 weeks | Optimistic for military concealment data collection. Access constraints, annotation complexity. | 300-500 real + 1000+ synthetic (GenCAMO/CamouflageAnything) per class. Active learning loop from YOLOE zero-shot. |
|
||||
| No security measures | Small edge YOLO models vulnerable to adversarial patches. Physical device capture risk. No data protection. | Three layers: PatchBlock adversarial defense, encrypted model weights at rest, auto-wipe on tamper. |
|
||||
|
||||
## Product Solution Description
|
||||
|
||||
A three-tier semantic detection system for identifying concealed/camouflaged positions from reconnaissance UAV aerial imagery, running on Jetson Orin Nano Super alongside the existing YOLO detection pipeline. Redesigned for the 5.2GB usable VRAM budget with demand-loaded VLM.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ JETSON ORIN NANO SUPER │
|
||||
│ (5.2 GB usable VRAM) │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────────────────┐ ┌───────────────────────┐ │
|
||||
│ │ ViewPro │───▶│ Tier 1 │───▶│ Tier 2 │ │
|
||||
│ │ A40 │ │ Merged TRT Engine │ │ Path Preprocessing │ │
|
||||
│ │ Camera │ │ YOLOE-v8-seg │ │ + Skeletonization │ │
|
||||
│ │ │ │ + Existing YOLO │ │ + MobileNetV3-Small │ │
|
||||
│ │ │ │ ≤15ms │ │ ≤200ms │ │
|
||||
│ └────▲─────┘ └──────────────────────┘ └───────────┬───────────┘ │
|
||||
│ │ │ │
|
||||
│ ┌────┴─────┐ ┌──────────────┐ │ ambiguous │
|
||||
│ │ Gimbal │◀───│ Scan │ ▼ │
|
||||
│ │ Kalman │ │ Controller │ ┌───────────────────┐ │
|
||||
│ │ + PID │ │ (L1/L2) │ │ VLM Queue │ │
|
||||
│ └──────────┘ └──────────────┘ │ (batch when ≥3 │ │
|
||||
│ │ or on demand) │ │
|
||||
│ ┌──────────────────────────────┐ └────────┬──────────┘ │
|
||||
│ │ PatchBlock Adversarial │ │ │
|
||||
│ │ Defense (CPU preprocessing)│ [demand-load cycle] │
|
||||
│ └──────────────────────────────┘ ┌────────▼──────────┐ │
|
||||
│ │ Tier 3 │ │
|
||||
│ │ Moondream 0.5B │ │
|
||||
│ │ 816 MiB INT4 │ │
|
||||
│ │ ~5-10s per image │ │
|
||||
│ └───────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
The system operates in two scan levels:
|
||||
- **Level 1 (Wide Sweep)**: Camera at medium zoom. Merged TRT engine runs YOLOE-v8-seg (visual + text prompts) and existing YOLO detection simultaneously. POIs queued by confidence.
|
||||
- **Level 2 (Detailed Scan)**: Camera zooms into POI. Path preprocessing → skeletonization → endpoint CNN. High-confidence → immediate alert. Ambiguous → VLM queue.
|
||||
- **VLM Batch Analysis**: When queue reaches 3+ detections or operator requests: scan pauses, YOLO engine unloads, Moondream loads, batch analyzes, unloads, YOLO reloads. ~30-45s total cycle.
|
||||
|
||||
Three submodules: (1) Semantic Detection AI, (2) Camera Gimbal Control, (3) Integration with existing detections service.
|
||||
|
||||
### Memory Budget
|
||||
|
||||
| Component | Mode | GPU Memory | Notes |
|
||||
|-----------|------|-----------|-------|
|
||||
| OS + System | Always | ~2.4 GB | From 8GB total, leaves 5.2GB usable |
|
||||
| Merged TRT Engine (YOLOE-v8-seg + YOLO) | Detection mode | ~2.8 GB | Single engine, shared CUDA context |
|
||||
| MobileNetV3-Small TRT (FP16) | Detection mode | ~50 MB | Tiny binary classifier |
|
||||
| OpenCV + NumPy buffers | Always | ~200 MB | Frame buffers, masks |
|
||||
| PatchBlock defense | Always | ~50 MB | CPU-based, minimal GPU |
|
||||
| **Total in Detection Mode** | | **~3.1 GB** | **1.7 GB headroom** |
|
||||
| Moondream 0.5B INT4 | VLM mode | ~816 MB | Demand-loaded |
|
||||
| vLLM overhead + KV cache | VLM mode | ~500 MB | Minimal for 0.5B model |
|
||||
| **Total in VLM Mode** | | **~1.6 GB** | **After unloading TRT engine** |
|
||||
|
||||
## Architecture
|
||||
|
||||
### Component 1: Tier 1 — Real-Time Detection (YOLOE-v8-seg, merged engine)
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **YOLOE-v8-seg re-parameterized (recommended)** | yoloe-v8s-seg.pt, Ultralytics, TensorRT FP16 | Proven TRT stability on Jetson. Zero inference overhead when re-parameterized. Visual+text multimodal fusion. Merges into existing YOLO engine. | Older architecture than YOLO26 (slightly lower base accuracy). | Ultralytics ≥8.4, TensorRT, JetPack 6.2 | PatchBlock CPU preprocessing | ~13ms FP16 (s-size) | **Best fit for stable deployment.** |
|
||||
| YOLOE-26-seg (future upgrade) | yoloe-26s-seg.pt, TensorRT | Better accuracy (YOLO26 architecture). NMS-free. | Active TRT bugs on Jetson: confidence misalignment, INT8 crash. | Wait for Ultralytics fix | Same | ~7ms FP16 (estimated) | **Future upgrade when TRT bugs resolved.** |
|
||||
| YOLO26-Seg custom-trained (production) | yolo26s-seg.pt fine-tuned | Highest accuracy for known classes. | Requires 1500+ annotated images/class. Same TRT bugs. | Custom dataset, GPU for training | Same | ~7ms FP16 | **Long-term production model.** |
|
||||
|
||||
**Prompt strategy (revised)**:
|
||||
|
||||
Text prompts (in-distribution classes only):
|
||||
- `"footpath"`, `"trail"`, `"path"`, `"road"`, `"track"`
|
||||
- `"tree row"`, `"tree line"`, `"clearing"`
|
||||
|
||||
Visual prompts (SAVPE, for concealment-specific detection):
|
||||
- Reference images cropped from semantic01-04.png: branch piles, dark entrances, dugout structures
|
||||
- Use multimodal fusion mode: `concat` (zero overhead)
|
||||
|
||||
```python
|
||||
from ultralytics import YOLOE
|
||||
|
||||
model = YOLOE("yoloe-v8s-seg.pt")
|
||||
|
||||
text_classes = ["footpath", "trail", "road", "tree row", "clearing"]
|
||||
model.set_classes(text_classes)
|
||||
|
||||
results = model.predict(
|
||||
frame,
|
||||
conf=0.15,
|
||||
refer_image="reference_hideout.jpg",
|
||||
visual_prompts={"bboxes": np.array([[x1, y1, x2, y2]]), "cls": np.array([0])},
|
||||
fusion_mode="concat"
|
||||
)
|
||||
```
|
||||
|
||||
**Re-parameterization for production**: Once classes are fixed after training, re-parameterize YOLOE-v8 to standard YOLOv8 weights. This eliminates the open-vocabulary overhead entirely, and the model becomes a regular YOLO inference engine. Merge with existing YOLO detection into a single TRT engine using TensorRT's multi-model support or batch inference.
|
||||
|
||||
### Component 2: Tier 2 — Spatial Reasoning & CNN Confirmation
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **Robust path tracing + CNN classifier (recommended)** | OpenCV, scikit-image, MobileNetV3-Small TRT | Preprocessing removes noise. Branch pruning eliminates artifacts. 256×256 ROI for better context. | Still depends on segmentation quality. | OpenCV, scikit-image, PyTorch → TRT | Offline inference | ~150ms total | **Best fit. Robust against noisy masks.** |
|
||||
| GraphMorph centerline extraction | PyTorch, custom model | Topology-aware. Reduces false positives. | Requires additional model in memory. More complex integration. | PyTorch, custom training | Offline | ~200ms estimated | Upgrade path if basic approach fails |
|
||||
| Heuristic rules only | OpenCV, NumPy | No training data. Immediate. | Brittle. Cannot generalize. | None | Offline | ~50ms | Baseline/fallback for day-1 |
|
||||
|
||||
**Revised path tracing pipeline**:
|
||||
|
||||
1. Take footpath segmentation mask from Tier 1
|
||||
2. **Preprocessing**: Gaussian blur (σ=1.5) → binary threshold (Otsu) → morphological closing (5×5 kernel, 2 iterations) → remove small connected components (< 100px area)
|
||||
3. Skeletonize using Zhang-Suen algorithm
|
||||
4. **Branch pruning**: Remove skeleton branches shorter than 20 pixels (noise artifacts)
|
||||
5. Detect endpoints using hit-miss morphological operations (8 kernel patterns)
|
||||
6. Detect junctions using branch-point kernels
|
||||
7. Trace path segments between junctions/endpoints
|
||||
8. For each endpoint: extract **256×256** ROI crop centered on endpoint from original image
|
||||
9. Feed ROI crop to MobileNetV3-Small binary classifier
|
||||
|
||||
**Freshness assessment** (unchanged from draft01, validated approach):
|
||||
- Edge sharpness, contrast ratio, fill ratio, path width consistency
|
||||
- Initial hand-tuned thresholds → Random Forest with annotated data
|
||||
|
||||
### Component 3: Tier 3 — VLM Deep Analysis (Background Batch Mode)
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **Moondream 0.5B INT4 demand-loaded (recommended)** | Moondream, ONNX/PyTorch, INT4 | 816 MiB memory. Built-in detect()/point() APIs. Runs on Raspberry Pi. | Weaker reasoning than 2B models. Not aerial-specialized. | ONNX Runtime or PyTorch | Local only | ~2-5s per image (0.5B) | **Best fit for memory-constrained edge.** |
|
||||
| SmolVLM2-500M | HuggingFace, ONNX | 1.8GB. Small. ONNX support. | Less capable than Moondream for detection. No detect() API. | ONNX Runtime | Local only | ~3-7s estimated | Alternative if Moondream underperforms |
|
||||
| UAV-VL-R1 (2B) demand-loaded | vLLM, W4A16 | Aerial-specialized. Best reasoning for UAV imagery. | 2.5GB INT8. ~10-21s per analysis. Tight memory fit. | vLLM, W4A16 weights | Local only | ~10-21s | **Upgrade path if Moondream insufficient.** |
|
||||
| No VLM | N/A | Simplest. Most memory. Zero latency impact. | No fallback for ambiguous CNN outputs. No explanations. | None | N/A | N/A | **Viable MVP if Tier 1+2 accuracy is sufficient.** |
|
||||
|
||||
**Demand-loading protocol**:
|
||||
|
||||
```
|
||||
1. VLM queue reaches threshold (≥3 detections or operator request)
|
||||
2. Scan controller transitions to HOLD state (camera fixed position)
|
||||
3. Signal main process to unload TRT engine
|
||||
4. Wait for GPU memory release (~1s)
|
||||
5. Launch VLM process: load Moondream 0.5B INT4
|
||||
6. Process all queued detections sequentially (~2-5s each)
|
||||
7. Collect results, send to operator
|
||||
8. Unload VLM, release GPU memory
|
||||
9. Reload TRT engine (~2s)
|
||||
10. Resume scan from HOLD position
|
||||
Total cycle: ~30-45s for 3-5 detections
|
||||
```
|
||||
|
||||
**VLM prompting strategy** (adapted for Moondream's capabilities):
|
||||
|
||||
Using detect() API for fast binary check:
|
||||
```python
|
||||
model.detect(image, "concealed military position")
|
||||
model.detect(image, "dugout covered with branches")
|
||||
```
|
||||
|
||||
Using caption for detailed analysis:
|
||||
```python
|
||||
model.caption(image, length="normal")
|
||||
```
|
||||
|
||||
### Component 4: Camera Gimbal Control
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **Kalman+PID cascade with ViewLink (recommended)** | pyserial, ViewLink V3.3.3, filterpy (Kalman), servopilot (PID) | Compensates UAV attitude drift. Proven in aerospace. Smooth path-following. | More complex than PID-only. Requires IMU data feed. | ViewPro A40, pyserial, IMU data access | Physical only | <10ms command latency | **Best fit. Flight-grade control.** |
|
||||
| PID-only with ViewLink | pyserial, ViewLink V3.3.3, servopilot | Simple. Works for hovering UAV. | Drifts during flight. Cannot compensate mounting errors. | ViewPro A40, pyserial | Physical only | <10ms | Acceptable for testing only |
|
||||
|
||||
**Revised control architecture**:
|
||||
|
||||
```
|
||||
UAV IMU Data ──▶ Kalman Filter ──▶ State Estimate (attitude, angular velocity)
|
||||
│
|
||||
Camera Frame ──▶ Detection ──▶ Target Position ──▶ Error Calculation
|
||||
│
|
||||
State Estimate ─────▶│
|
||||
│
|
||||
PID Controller
|
||||
│
|
||||
Gimbal Command
|
||||
│
|
||||
ViewLink Serial
|
||||
```
|
||||
|
||||
Kalman filter state vector: [yaw, pitch, yaw_rate, pitch_rate]
|
||||
Measurement inputs: IMU gyroscope (yaw_rate, pitch_rate), detection-derived angles
|
||||
Process model: constant angular velocity with noise
|
||||
|
||||
**Scan patterns** (unchanged from draft01): sinusoidal yaw oscillation, POI queue management.
|
||||
|
||||
**Path-following** (revised): Kalman-filtered state estimate provides smoother tracking. PID gains can be lower (less aggressive) because state estimate is already stabilized. Update rate: tied to detection frame rate.
|
||||
|
||||
### Component 5: Integration with Existing Detections Service
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **Single merged Cython+TRT process + demand-loaded VLM (recommended)** | Cython, TensorRT, ONNX Runtime | Single TRT engine. Minimal memory. VLM isolated. | VLM loading pauses detection (30-45s). | Cython extensions, process management | Process isolation + encryption | Minimal overhead | **Best fit for 5.2GB VRAM.** |
|
||||
|
||||
**Revised integration architecture**:
|
||||
|
||||
```
|
||||
┌───────────────────────────────────────────────────────────────────┐
|
||||
│ Main Process (Cython + TRT) │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────┐ │
|
||||
│ │ Single Merged TRT Engine │ │
|
||||
│ │ ├─ Existing YOLO Detection heads │ │
|
||||
│ │ ├─ YOLOE-v8-seg (re-parameterized) │ │
|
||||
│ │ └─ MobileNetV3-Small classifier │ │
|
||||
│ └──────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐│
|
||||
│ │ Path Tracing │ │ Scan │ │ PatchBlock Defense ││
|
||||
│ │ + Skeleton │ │ Controller │ │ (CPU parallel) ││
|
||||
│ │ (CPU) │ │ + Kalman+PID │ │ ││
|
||||
│ └──────────────┘ └──────────────┘ └──────────────────────────┘│
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────┐ │
|
||||
│ │ VLM Manager │ │
|
||||
│ │ state: IDLE | LOADING | ANALYZING | UNLOAD │ │
|
||||
│ │ queue: [detection_1, detection_2, ...] │ │
|
||||
│ └──────────────────────────────────────────────┘ │
|
||||
└───────────────────────────────────────────────────────────────────┘
|
||||
|
||||
VLM mode (demand-loaded, replaces TRT engine temporarily):
|
||||
┌───────────────────────────────────────────────────────────────────┐
|
||||
│ ┌──────────────────────────────────────────────┐ │
|
||||
│ │ Moondream 0.5B INT4 │ │
|
||||
│ │ (ONNX Runtime or PyTorch) │ │
|
||||
│ └──────────────────────────────────────────────┘ │
|
||||
│ Detection paused. Camera in HOLD state. │
|
||||
└───────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Data flow** (revised):
|
||||
1. PatchBlock preprocesses frame on CPU (parallel with GPU inference)
|
||||
2. Cleaned frame → merged TRT engine → YOLO detections + YOLOE-v8 semantic detections
|
||||
3. Semantic detections → path preprocessing → skeletonization → endpoint extraction → CNN
|
||||
4. High-confidence → operator alert (coordinates + bounding box + confidence)
|
||||
5. Ambiguous → VLM queue
|
||||
6. VLM queue management: batch-process when queue ≥ 3 or operator triggers
|
||||
7. During VLM mode: detection paused, camera holds, operator notified of pause
|
||||
|
||||
**GPU scheduling** (revised): No concurrent multi-model GPU sharing. Single TRT engine runs during detection mode. VLM demand-loaded exclusively during analysis mode. This eliminates the 10-40% latency jitter from GPU sharing.
|
||||
|
||||
### Component 6: Security
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **Three-layer security (recommended)** | PatchBlock, LUKS/dm-crypt, tmpfs | Adversarial defense + model protection + data protection | Adds ~5ms CPU overhead for PatchBlock | PatchBlock library, Linux crypto | Full stack | Minimal GPU impact | **Required for military edge deployment.** |
|
||||
|
||||
**Layer 1: Adversarial Input Defense**
|
||||
- PatchBlock CPU preprocessing on every frame before GPU inference
|
||||
- Detects anomalous patches via outlier detection and dimensionality reduction
|
||||
- Recovers up to 77% accuracy under adversarial attack
|
||||
- Runs in parallel with GPU inference (no latency addition to pipeline)
|
||||
|
||||
**Layer 2: Model & Weight Protection**
|
||||
- TRT engine files encrypted at rest using LUKS on a dedicated partition
|
||||
- At boot: decrypt into tmpfs (RAM disk) — never written to persistent storage unencrypted
|
||||
- Secure boot chain via Jetson's secure boot (fuse-based, hardware root of trust)
|
||||
- If device is captured powered-off: encrypted models, no plaintext weights accessible
|
||||
|
||||
**Layer 3: Operational Data Protection**
|
||||
- Captured imagery stored in encrypted circular buffer (last N minutes only)
|
||||
- Detection logs (coordinates, confidence, timestamps) encrypted at rest
|
||||
- Over datalink: transmit only coordinates + confidence + small thumbnail (not raw frames)
|
||||
- Tamper detection: if enclosure opened or unauthorized boot detected → auto-wipe keys + detection logs
|
||||
|
||||
## Training & Data Strategy (Revised)
|
||||
|
||||
### Phase 1: Zero-shot (Week 1-2)
|
||||
- Deploy YOLOE-v8-seg with multimodal prompts (text for paths, visual for concealment)
|
||||
- Use semantic01-04.png as visual prompt references via SAVPE
|
||||
- Tune confidence thresholds per class type
|
||||
- Collect false positive/negative data for annotation
|
||||
- **Benchmark YOLOE-v8-seg TRT on Jetson: confirm inference time, memory, stability**
|
||||
|
||||
### Phase 2: Annotation & Fine-tuning (Week 3-8)
|
||||
- Annotate collected real data (target: 300-500 images/class)
|
||||
- **Generate 1000+ synthetic images per class using GenCAMO/CamouflageAnything**
|
||||
- Priority: footpaths (segmentation) → branch piles (bboxes) → entrances (bboxes)
|
||||
- Active learning: YOLOE zero-shot flags candidates → human reviews → annotates
|
||||
- Fine-tune YOLOv8-Seg (or YOLO26-Seg if TRT fixed) on real + synthetic dataset
|
||||
- Use linear probing first, then full fine-tuning
|
||||
|
||||
### Phase 3: CNN classifier (Week 4-8, parallel with Phase 2)
|
||||
- Train MobileNetV3-Small on ROI crops: 256×256 from endpoint analysis
|
||||
- Positive: annotated concealed positions + synthetic. Negative: natural termini, random terrain
|
||||
- Target: 200+ real positive + 500+ synthetic positive, 1000+ negative
|
||||
- Export to TensorRT FP16
|
||||
|
||||
### Phase 4: VLM integration (Week 8-12)
|
||||
- Deploy Moondream 0.5B INT4 in demand-loaded mode
|
||||
- Test demand-load cycle timing: measure unload → load → infer → unload → reload
|
||||
- Tune detect() prompts and caption prompts on collected ambiguous cases
|
||||
- **If Moondream accuracy insufficient: test UAV-VL-R1 (2B) demand-loaded**
|
||||
- **If YOLO26 TRT bugs fixed: test YOLOE-26-seg as Tier 1 upgrade**
|
||||
|
||||
### Phase 5: Seasonal expansion (Month 3+)
|
||||
- Winter data → spring/summer annotation campaigns
|
||||
- Re-train all models with multi-season data + seasonal synthetic augmentation
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Integration / Functional Tests
|
||||
- YOLOE-v8-seg multimodal prompt detection on reference images — verify text+visual fusion
|
||||
- **TRT engine stability test**: 1000 consecutive inferences without confidence drift
|
||||
- Path preprocessing pipeline on synthetic noisy masks — verify cleaning + skeletonization
|
||||
- Branch pruning: verify short spurious branches removed, real path branches preserved
|
||||
- CNN classifier on known positive/negative 256×256 ROI crops
|
||||
- **Demand-load VLM cycle**: measure timing of unload TRT → load Moondream → infer → unload → reload TRT
|
||||
- **Memory monitoring during demand-load**: confirm no memory leak across 10+ cycles
|
||||
- Kalman+PID gimbal control with simulated IMU data — verify drift compensation
|
||||
- Full pipeline: frame → PatchBlock → YOLOE-v8 → path tracing → CNN → (VLM) → alert
|
||||
- Scan controller: Level 1 → Level 2 → HOLD (for VLM) → resume Level 1
|
||||
|
||||
### Non-Functional Tests
|
||||
- Tier 1 latency: YOLOE-v8-seg TRT FP16 ≤15ms on Jetson Orin Nano Super
|
||||
- Tier 2 latency: preprocessing + skeletonization + CNN ≤200ms
|
||||
- **VLM demand-load cycle: ≤45s for 3 detections (including load/unload overhead)**
|
||||
- **Memory profiling: peak detection mode ≤3.5GB GPU, peak VLM mode ≤2.0GB GPU**
|
||||
- Thermal stress test: 30+ minutes continuous detection without thermal throttling
|
||||
- PatchBlock adversarial test: inject test adversarial patches, measure accuracy recovery
|
||||
- False positive/negative rate on annotated reference images
|
||||
- Gimbal path-following accuracy with and without Kalman filter (measure improvement)
|
||||
- **Demand-load memory leak test: 50+ VLM cycles without memory growth**
|
||||
|
||||
## References
|
||||
|
||||
- YOLOE-v8 docs: https://docs.ultralytics.com/models/yoloe/
|
||||
- YOLOE-26 paper: https://arxiv.org/abs/2602.00168
|
||||
- YOLO26 TRT confidence bug: https://www.hackster.io/qwe018931/pushing-limits-yolov8-vs-v26-on-jetson-orin-nano-b89267
|
||||
- YOLO26 INT8 crash: https://github.com/ultralytics/ultralytics/issues/23841
|
||||
- YOLOE multimodal fusion: https://github.com/ultralytics/ultralytics/pull/21966
|
||||
- Jetson Orin Nano Super memory: https://forums.developer.nvidia.com/t/jetson-orin-nano-super-insufficient-gpu-memory/330777
|
||||
- Multi-model survey on Orin Nano: https://dev.to/ankk98/multi-model-ai-resource-allocation-for-humanoid-robots-a-survey-on-jetson-orin-nano-super-310i
|
||||
- TRT multiple engines: https://github.com/NVIDIA/TensorRT/issues/4358
|
||||
- TRT memory on Jetson: https://github.com/ultralytics/ultralytics/issues/21562
|
||||
- Moondream: https://moondream.ai/blog/introducing-moondream-0-5b
|
||||
- Cosmos-Reason2-2B Jetson benchmark: https://www.thenextgentechinsider.com/pulse/cosmos-reason2-runs-on-jetson-orin-nano-super-with-w4a16-quantization
|
||||
- Jetson AI Lab benchmarks: https://www.jetson-ai-lab.com/tutorials/genai-benchmarking/
|
||||
- Jetson LLM bottleneck: https://ericxliu.me/posts/benchmarking-llms-on-jetson-orin-nano/
|
||||
- vLLM on Jetson: https://learnopencv.com/deployment-on-edge-vllm-on-jetson/
|
||||
- TRT-LLM no edge support: https://github.com/NVIDIA/TensorRT-LLM/issues/7978
|
||||
- PatchBlock defense: https://arxiv.org/abs/2601.00367
|
||||
- Adversarial patches on YOLO: https://link.springer.com/article/10.1007/s10207-025-01067-3
|
||||
- GenCAMO synthetic data: https://arxiv.org/abs/2601.01181
|
||||
- CamouflageAnything (CVPR 2025): https://openaccess.thecvf.com/content/CVPR2025/html/Das_Camouflage_Anything_...
|
||||
- GraphMorph centerlines: https://arxiv.org/pdf/2502.11731
|
||||
- Learnable skeleton + SAM: https://ui.adsabs.harvard.edu/abs/2025ITGRS..63S1458X
|
||||
- Kalman filter gimbal: https://ieeexplore.ieee.org/ielx7/6287639/10005208/10160027.pdf
|
||||
- UAV-VL-R1: https://arxiv.org/pdf/2508.11196
|
||||
- ViewPro Protocol: https://www.viewprotech.com/index.php?ac=article&at=read&did=510
|
||||
- servopilot PID: https://pypi.org/project/servopilot/
|
||||
|
||||
## Related Artifacts
|
||||
- AC assessment: `_docs/00_research/00_ac_assessment.md`
|
||||
- Previous draft: `_docs/01_solution/solution_draft01.md`
|
||||
@@ -0,0 +1,319 @@
|
||||
# Solution Draft
|
||||
|
||||
## Assessment Findings
|
||||
|
||||
| Old Component Solution | Weak Point (functional/security/performance) | New Solution |
|
||||
|------------------------|----------------------------------------------|-------------|
|
||||
| YOLO26 as sole detection backbone | **Accuracy regression on custom datasets**: Reported YOLO26s "much less accurate" than YOLO11s on identical training data (GitHub #23206). YOLO26 is 3 months old — less battle-tested than YOLO11. | Benchmark YOLO26 vs YOLO11 on initial annotated data before committing. YOLO11 as fallback. YOLOE supports both backbones (yoloe-11s-seg, yoloe-26s-seg). |
|
||||
| YOLO26 TensorRT INT8 export | **INT8 export fails on Jetson** (TRT Error Code 2, OOM). Fix merged (PR #23928) but indicates fragile tooling. | Use FP16 only for initial deployment (confirmed stable). INT8 as future optimization after tooling matures. Pin Ultralytics version + JetPack version. |
|
||||
| vLLM as VLM runtime | **Unstable on Jetson Orin Nano**: system freezes, reboots, installation crashes, excessive memory (multiple open issues). Not production-ready for 8GB devices. | **Replace with NanoLLM/NanoVLM** — purpose-built for Jetson by NVIDIA's Dusty-NV team. Docker containers for JetPack 5/6. Supports VILA, LLaVA. Stable. Or use llama.cpp with GGUF models (proven on Jetson). |
|
||||
| No storage strategy | **SD card corruption**: Recurring corruption documented across multiple Jetson Orin Nano users. SD cards unsuitable for production. | **Mandatory NVMe SSD** for OS + models + logging. No SD card in production. Ruggedized NVMe mount for vibration resistance. |
|
||||
| No EMI protection on UART | **ViewPro documents EMI issues**: antennas cause random gimbal panning if within 35cm. Standard UART parity bit insufficient for noisy UAV environment. | Add CRC-16 checksum layer on gimbal commands. Enforce 35cm antenna separation in physical design. Consider shielded UART cable. Command retry on CRC failure (max 3 retries, then log error). |
|
||||
| No environmental hardening addressed | **UAV environment**: vibration, temperature extremes (-20°C to +50°C), dust, EMI, power fluctuations. Dev kit form factor is not field-deployable. | Use ruggedized carrier board (MILBOX-ORNX or similar) with vibration dampening. Conformal coating on exposed connectors. External temperature sensor for environmental monitoring. |
|
||||
| No logging or telemetry | **No post-flight review capability**: field system must log all detections with metadata for model iteration, operator review, and evidence collection. | Add detection logging: timestamp, GPS-denied coordinates, confidence score, detection class, JPEG thumbnail, tier that triggered, freshness metadata. Log to NVMe SSD. Export as structured format (JSON lines) after flight. |
|
||||
| No frame recording for offline replay | **Training data collection depends on field recording**: Without recording, no way to build training dataset from real flights. | Record all camera frames to NVMe at configurable rate (1-5 FPS during Level 1, full rate during Level 2). Include detection overlay option. Post-flight: use recordings for annotation. |
|
||||
| No power management | **UAV power budget is finite**: Jetson at 15W + gimbal + camera + radio. No monitoring of power draw or load shedding. | Monitor power consumption via Jetson's INA sensors. Power budget alert at 80% of allocated watts. Load shedding: disable VLM first, then reduce inference rate, then disable semantic detection. |
|
||||
| YOLO26 not validated for this domain | **No benchmark on aerial concealment detection**: All YOLO26 numbers are on COCO/LVIS. Concealment detection may behave very differently. | First sprint deliverable: benchmark YOLOE-26 (both 11 and 26 backbones) on semantic01-04.png with text/visual prompts. Report AP on initial annotated validation set before committing to backbone. |
|
||||
| Freshness and path tracing are untested algorithms | **No proven prior art**: Both freshness assessment and path-following via skeletonization are novel combinations. Risk of over-engineering before validation. | Implement minimal viable versions first. V1 path tracing: skeleton + endpoint only, no freshness, no junction following. Validate on real flight data before adding complexity. |
|
||||
|
||||
## Product Solution Description
|
||||
|
||||
A three-tier semantic detection system for identifying concealed/camouflaged positions from reconnaissance UAV aerial imagery, running on Jetson Orin Nano Super with NVMe SSD storage, active cooling, and ruggedized carrier board, alongside the existing YOLO detection pipeline.
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ JETSON ORIN NANO SUPER (ruggedized carrier, NVMe, 15W) │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
|
||||
│ │ ViewPro │───▶│ Tier 1 │───▶│ Tier 2 │───▶│ Tier 3 │ │
|
||||
│ │ A40 │ │ YOLOE │ │ Path Trace │ │ VLM │ │
|
||||
│ │ Camera │ │ (11 or 26 │ │ + CNN │ │ NanoLLM │ │
|
||||
│ │ + Frame │ │ backbone) │ │ ≤200ms │ │ (L2 only) │ │
|
||||
│ │ Quality │ │ TRT FP16 │ │ │ │ ≤5s │ │
|
||||
│ │ Gate │ │ ≤100ms │ │ │ │ │ │
|
||||
│ └────▲─────┘ └──────────────┘ └──────────────┘ └───────────┘ │
|
||||
│ │ │
|
||||
│ ┌────┴─────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
|
||||
│ │ Gimbal │◀───│ Scan │ │ Watchdog │ │ Recorder │ │
|
||||
│ │ Control │ │ Controller │ │ + Thermal │ │ + Logger │ │
|
||||
│ │ + CRC │ │ (L1/L2 FSM) │ │ + Power │ │ (NVMe) │ │
|
||||
│ └──────────┘ └──────────────┘ └──────────────┘ └───────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────┐ │
|
||||
│ │ Existing YOLO Detection │ (always running, scene context) │
|
||||
│ │ Cython + TRT │ │
|
||||
│ └──────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Key changes from draft02:
|
||||
- **YOLOE backbone is configurable** (YOLO11 or YOLO26) — benchmark before committing
|
||||
- **NanoLLM replaces vLLM** as VLM runtime (purpose-built for Jetson, stable)
|
||||
- **NVMe SSD mandatory** — no SD card in production
|
||||
- **CRC-16 on gimbal UART** — EMI protection
|
||||
- **Detection logger + frame recorder** — post-flight review and training data collection
|
||||
- **Ruggedized carrier board** — vibration, temperature, dust protection
|
||||
- **Power monitoring + load shedding** — finite UAV power budget
|
||||
- **FP16 only** for initial deployment (INT8 export unstable on Jetson)
|
||||
- **Minimal V1 for unproven components** — path tracing and freshness start simple
|
||||
|
||||
## Architecture
|
||||
|
||||
### Component 1: Tier 1 — Real-Time Detection
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **YOLOE with configurable backbone (recommended)** | yoloe-11s-seg.pt or yoloe-26s-seg.pt, set_classes() → TRT FP16 | Supports both YOLO11 and YOLO26 backbones. Benchmark on real data, pick winner. set_classes() bakes CLIP embeddings for zero overhead. | YOLO26 may regress on custom data vs YOLO11. Needs empirical comparison. | Ultralytics ≥8.4 (pinned version), TensorRT, JetPack 6.2 | Local only | YOLO11s TRT FP16: ~7ms (640px). YOLO26s: similar or slightly faster. | **Best fit. Hedge against backbone risk.** |
|
||||
|
||||
**Version pinning strategy**:
|
||||
- Pin `ultralytics==8.4.X` (specific patch version validated on Jetson)
|
||||
- Pin JetPack 6.2 + TensorRT version
|
||||
- Test every Ultralytics update in staging before deploying to production
|
||||
- Keep both yoloe-11s-seg and yoloe-26s-seg TRT engines on NVMe; switch via config
|
||||
|
||||
**YOLO backbone selection process (Sprint 1)**:
|
||||
1. Annotate 200 frames from real flight footage (footpaths, branch piles, entrances)
|
||||
2. Fine-tune YOLOE-11s-seg and YOLOE-26s-seg on same dataset, same hyperparameters
|
||||
3. Evaluate on held-out validation set (50 frames)
|
||||
4. Pick backbone with higher mAP50
|
||||
5. If delta < 2%: pick YOLO26 (faster CPU inference, NMS-free deployment)
|
||||
|
||||
### Component 2: Tier 2 — Spatial Reasoning & CNN Confirmation
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **V1 minimal path tracing + heuristic classifier (recommended for initial release)** | OpenCV, scikit-image | No training data needed. Skeleton + endpoint detection + simple heuristic: "dark mass at endpoint → flag." Fast to implement and validate. | Low accuracy. Many false positives. | OpenCV, scikit-image | Offline | ~30ms | **V1: ship fast, validate on real data.** |
|
||||
| **V2 trained CNN (after data collection)** | MobileNetV3-Small, TensorRT FP16 | Higher accuracy after training. Dynamic ROI sizing. | Needs 300+ positive, 1000+ negative annotated ROI crops. | PyTorch, TRT export | Offline | ~5-10ms classification | **V2: replace heuristic once data exists.** |
|
||||
|
||||
**V1 heuristic for endpoint analysis** (no training data needed):
|
||||
1. Skeletonize footpath mask with branch pruning
|
||||
2. Find endpoints
|
||||
3. For each endpoint: extract ROI (dynamic size based on GSD)
|
||||
4. Compute: mean_darkness = mean intensity in ROI center 50%. contrast = (surrounding_mean - center_mean) / surrounding_mean. area_ratio = dark_pixel_count / total_pixels.
|
||||
5. If mean_darkness < threshold_dark AND contrast > threshold_contrast → flag as potential concealed position
|
||||
6. Thresholds: configurable, tuned per season. Start with winter values.
|
||||
|
||||
**V1 freshness** (metadata only, not a filter):
|
||||
- contrast_ratio of path vs surrounding terrain
|
||||
- Report as: "high contrast" (likely fresh) / "low contrast" (likely stale)
|
||||
- No binary classification. Operator sees all detections with freshness tag.
|
||||
|
||||
### Component 3: Tier 3 — VLM Deep Analysis
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **NanoLLM with VILA-2.7B or VILA1.5-3B (recommended)** | NanoLLM Docker container, MLC/TVM quantization | Purpose-built for Jetson by NVIDIA team. Stable Docker containers. Optimized memory management. Supports VLMs natively. | Limited model selection (VILA, LLaVA, Obsidian). Not all VLMs available. | Docker, JetPack 6, NVMe for container storage | Local only, container isolation | ~15-25 tok/s on Orin Nano (4-bit MLC) | **Most stable Jetson VLM option.** |
|
||||
| llama.cpp with GGUF VLM | llama.cpp, GGUF model files | Lightweight. No Docker needed. Proven stability on Jetson. Wide model support. | Manual build. Less optimized than NanoLLM for Jetson GPU. | llama.cpp build, GGUF weights | Local only | ~10-20 tok/s estimated | **Fallback if NanoLLM doesn't support needed model.** |
|
||||
| ~~vLLM~~ | ~~vLLM Docker~~ | ~~High throughput~~ | **System freezes, reboots, installation crashes on Orin Nano. Multiple open bugs. Not production-ready.** | N/A | N/A | N/A | **Not recommended.** |
|
||||
|
||||
**Model selection for NanoLLM**:
|
||||
- Primary: VILA1.5-3B (confirmed on Orin Nano, multimodal, 4-bit MLC)
|
||||
- If UAV-VL-R1 GGUF weights become available: use via llama.cpp (aerial-specialized)
|
||||
- Fallback: Obsidian-3B (mini VLM, lower accuracy but very fast)
|
||||
|
||||
### Component 4: Camera Gimbal Control
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **ViewLink serial driver + CRC-16 + PID + watchdog (recommended)** | pyserial, crcmod, PID library, threading | Robust communication. CRC catches EMI-corrupted commands. Retry logic. Watchdog. | ViewLink protocol implementation from spec. Physical EMI mitigation required. | ViewPro docs, UART, shielded cable, 35cm antenna separation | Physical only | <10ms command + CRC overhead negligible | **Production-grade.** |
|
||||
|
||||
**UART reliability layer**:
|
||||
```
|
||||
Packet format: [SOF(2)] [CMD(N)] [CRC16(2)]
|
||||
- SOF: 0xAA 0x55 (start of frame)
|
||||
- CMD: ViewLink command bytes per protocol spec
|
||||
- CRC16: CRC-CCITT over CMD bytes
|
||||
```
|
||||
- On send: compute CRC-16, append to ViewLink command packet
|
||||
- On receive (gimbal feedback): validate CRC-16. Discard corrupted frames.
|
||||
- On CRC failure (send): retry up to 3 times with 10ms delay. Log failure after 3 retries.
|
||||
- Note: Check if ViewLink protocol already includes checksums (read full spec first). If so, use native checksum; don't add redundant CRC.
|
||||
|
||||
**Physical EMI mitigation checklist**:
|
||||
- [ ] Gimbal UART cable: shielded, shortest possible run
|
||||
- [ ] Video/data transmitter antenna: ≥35cm from gimbal (ViewPro recommendation)
|
||||
- [ ] Independent power supply for gimbal (or filtered from main bus)
|
||||
- [ ] Ferrite beads on UART cable near Jetson connector
|
||||
|
||||
### Component 5: Recording, Logging & Telemetry
|
||||
|
||||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Performance | Fit |
|
||||
|----------|-------|-----------|-------------|-------------|----------|------------|-----|
|
||||
| **NVMe-backed frame recorder + JSON-lines detection logger (recommended)** | OpenCV VideoWriter / JPEG sequences, JSON lines, NVMe SSD | Post-flight review. Training data collection. Evidence. Detection audit trail. | NVMe write bandwidth (~500 MB/s) more than sufficient. Storage: ~2GB/min at 1080p 5FPS JPEG. | NVMe SSD ≥256GB | Physical access to NVMe | ~5ms per frame write (async) | **Essential for field deployment.** |
|
||||
|
||||
**Detection log format** (JSON lines, one per detection):
|
||||
```json
|
||||
{
|
||||
"ts": "2026-03-19T14:32:01.234Z",
|
||||
"frame_id": 12345,
|
||||
"gps_denied_lat": 48.123456,
|
||||
"gps_denied_lon": 37.654321,
|
||||
"tier": 1,
|
||||
"class": "footpath",
|
||||
"confidence": 0.72,
|
||||
"bbox": [0.12, 0.34, 0.45, 0.67],
|
||||
"freshness": "high_contrast",
|
||||
"tier2_result": "concealed_position",
|
||||
"tier2_confidence": 0.85,
|
||||
"tier3_used": false,
|
||||
"thumbnail_path": "frames/12345_det_0.jpg"
|
||||
}
|
||||
```
|
||||
|
||||
**Frame recording strategy**:
|
||||
- Level 1: record every 5th frame (1-2 FPS) — overview coverage
|
||||
- Level 2: record every frame (30 FPS) — detailed analysis footage
|
||||
- Storage budget: 256GB NVMe ≈ 2 hours at Level 2 full rate, or 10+ hours at Level 1 rate
|
||||
- Circular buffer: when storage >80% full, overwrite oldest Level 1 frames (keep Level 2)
|
||||
|
||||
### Component 6: System Health & Resilience
|
||||
|
||||
**Monitoring threads**:
|
||||
|
||||
| Monitor | Check Interval | Threshold | Action |
|
||||
|---------|---------------|-----------|--------|
|
||||
| Thermal (T_junction) | 1s | >75°C | Degrade to Level 1 only |
|
||||
| Thermal (T_junction) | 1s | >80°C | Disable semantic detection |
|
||||
| Power (Jetson INA) | 2s | >80% budget | Disable VLM |
|
||||
| Power (Jetson INA) | 2s | >90% budget | Reduce inference rate to 5 FPS |
|
||||
| Gimbal heartbeat | 2s | No response | Force Level 1 sweep pattern |
|
||||
| Semantic process | 5s | No heartbeat | Restart with 5s backoff, max 3 attempts |
|
||||
| VLM process | 5s | No heartbeat | Mark Tier 3 unavailable, continue Tier 1+2 |
|
||||
| NVMe free space | 60s | <20% free | Switch to Level 1 recording rate only |
|
||||
| Frame quality | per frame | Laplacian var < threshold | Skip frame, use buffered good frame |
|
||||
|
||||
**Graceful degradation** (4 levels, unchanged from draft02):
|
||||
|
||||
| Level | Condition | Capability |
|
||||
|-------|-----------|-----------|
|
||||
| 0 — Full | All nominal, T < 70°C | Tier 1+2+3, Level 1+2, gimbal, recording |
|
||||
| 1 — No VLM | VLM unavailable or T > 75°C or power > 80% | Tier 1+2, Level 1+2, gimbal, recording |
|
||||
| 2 — No semantic | Semantic crashed 3x or T > 80°C | Existing YOLO only, Level 1 sweep, recording |
|
||||
| 3 — No gimbal | Gimbal UART failed 3x | Existing YOLO only, fixed camera, recording |
|
||||
|
||||
### Component 7: Integration & Deployment
|
||||
|
||||
**Hardware BOM additions** (beyond existing system):
|
||||
|
||||
| Item | Purpose | Estimated Cost |
|
||||
|------|---------|----------------|
|
||||
| NVMe SSD ≥256GB (industrial grade) | OS + models + recording + logging | $40-80 |
|
||||
| Active cooling fan (30mm+) | Prevent thermal throttling | $10-20 |
|
||||
| Ruggedized carrier board (e.g., MILBOX-ORNX or custom) | Vibration, temperature, dust protection | $200-500 |
|
||||
| Shielded UART cable + ferrite beads | EMI protection for gimbal communication | $10-20 |
|
||||
| Total additional hardware | | ~$260-620 |
|
||||
|
||||
**Software deployment**:
|
||||
- OS: JetPack 6.2 on NVMe SSD
|
||||
- YOLOE models: TRT FP16 engines on NVMe (both 11 and 26 backbone variants)
|
||||
- VLM: NanoLLM Docker container on NVMe
|
||||
- Existing YOLO: current Cython + TRT pipeline (unchanged)
|
||||
- New Cython modules: semantic detection, gimbal control, scan controller, recorder
|
||||
- VLM process: separate Docker container, IPC via Unix socket
|
||||
- Config: YAML file for all thresholds, class names, scan parameters, degradation thresholds
|
||||
|
||||
**Version control & update strategy**:
|
||||
- Pin all dependency versions (Ultralytics, TensorRT, NanoLLM, OpenCV)
|
||||
- Model updates: swap TRT engine files on NVMe, restart service
|
||||
- Config updates: edit YAML, restart service
|
||||
- No over-the-air updates (air-gapped system). USB drive for field updates.
|
||||
|
||||
## Training & Data Strategy
|
||||
|
||||
### Phase 0: Benchmark Sprint (Week 1-2)
|
||||
- Deploy YOLOE-26s-seg and YOLOE-11s-seg in open-vocab mode
|
||||
- Test text/visual prompts on semantic01-04.png + 50 additional frames
|
||||
- Record results. Pick backbone with better qualitative detection.
|
||||
- Deploy V1 heuristic endpoint analysis (no CNN, no training data needed)
|
||||
- First field test flight with recording enabled
|
||||
|
||||
### Phase 1: Field validation & data collection (Week 2-6)
|
||||
- Deploy TRT FP16 engine with best backbone
|
||||
- Record all flights to NVMe
|
||||
- Operator marks detections as true/false positive in post-flight review
|
||||
- Build annotation backlog from recorded frames
|
||||
- Target: 500 annotated frames by week 6
|
||||
|
||||
### Phase 2: Custom model training (Week 6-10)
|
||||
- Fine-tune YOLOE-Seg on custom dataset (linear probing → full fine-tune)
|
||||
- Train MobileNetV3-Small CNN on endpoint ROI crops
|
||||
- A/B test: custom model vs YOLOE zero-shot on validation set
|
||||
- Deploy winning model as new TRT engine
|
||||
|
||||
### Phase 3: VLM & refinement (Week 8-14)
|
||||
- Deploy NanoLLM with VILA1.5-3B
|
||||
- Tune prompting on collected ambiguous cases
|
||||
- Train freshness classifier (if enough annotated freshness labels exist)
|
||||
- Target: 1500+ images per class
|
||||
|
||||
### Phase 4: Seasonal expansion (Month 4+)
|
||||
- Spring/summer annotation campaigns
|
||||
- Re-train all models with multi-season data
|
||||
- Adjust heuristic thresholds per season (configurable via YAML)
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Integration / Functional Tests
|
||||
- YOLOE text prompt detection on reference images (both 11 and 26 backbones)
|
||||
- TRT FP16 export on Jetson Orin Nano Super (verify no OOM, no crash)
|
||||
- V1 heuristic endpoint analysis on 20 synthetic masks (10 with hideouts, 10 without)
|
||||
- Frame quality gate: inject blurry frames, verify rejection
|
||||
- Gimbal CRC layer: inject corrupted commands, verify retry + log
|
||||
- Gimbal watchdog: simulate hang, verify forced Level 1 within 2.5s
|
||||
- NanoLLM VLM: load model, run inference on 10 aerial images, verify output + memory
|
||||
- VLM load/unload cycle: 10 cycles without memory leak
|
||||
- Detection logger: verify JSON-lines format, all fields populated
|
||||
- Frame recorder: verify NVMe write speed, no dropped frames at 30 FPS
|
||||
- Full pipeline end-to-end on recorded flight footage (offline replay)
|
||||
- Graceful degradation: simulate each failure mode, verify correct degradation level
|
||||
|
||||
### Non-Functional Tests
|
||||
- Tier 1 latency on Jetson Orin Nano Super TRT FP16: ≤100ms (both backbones)
|
||||
- Tier 2 latency (V1 heuristic): ≤50ms. (V2 CNN): ≤200ms
|
||||
- Tier 3 latency (NanoLLM VLM): ≤5 seconds
|
||||
- Memory peak: all components loaded < 7GB
|
||||
- Thermal: 60-minute sustained inference, T_junction < 75°C with active cooling
|
||||
- NVMe endurance: continuous recording for 2 hours, verify no write errors
|
||||
- Power draw: measure at each degradation level, verify within UAV power budget
|
||||
- EMI test: operate near data transmitter antenna, verify no gimbal anomalies with CRC layer
|
||||
- Cold start: power on → first detection within 60 seconds (model load time)
|
||||
- Vibration: mount Jetson on vibration table, run inference, compare detection accuracy vs static
|
||||
|
||||
## Technology Maturity Assessment
|
||||
|
||||
| Component | Technology | Maturity | Risk | Mitigation |
|
||||
|-----------|-----------|----------|------|------------|
|
||||
| Tier 1 Detection | YOLOE/YOLO26/YOLO11 | **Medium** — YOLO26 is 3 months old, reported regressions on custom data. YOLOE-26 even newer. | Medium | Benchmark both backbones. YOLO11 is battle-tested fallback. Pin versions. |
|
||||
| TRT FP16 Export | TensorRT on JetPack 6.2 | **High** — FP16 is stable on Jetson. Well-documented. | Low | FP16 only. Avoid INT8 initially. |
|
||||
| TRT INT8 Export | TensorRT on JetPack 6.2 | **Low** — Documented crashes (PR #23928). Calibration issues. | High | Defer to Phase 3+. FP16 sufficient for now. |
|
||||
| VLM (NanoLLM) | NanoLLM + VILA-3B | **Medium-High** — Purpose-built for Jetson by NVIDIA team. Docker-based. Monthly releases. | Low | More stable than vLLM. Use Docker containers. |
|
||||
| VLM (vLLM) | vLLM on Jetson | **Low** — System freezes, crashes, open bugs. | **High** | **Do not use.** NanoLLM instead. |
|
||||
| Path Tracing | Skeletonization + OpenCV | **High** — Decades-old algorithms. Well-understood. | Low | Pruning needed for noisy inputs. |
|
||||
| CNN Classifier | MobileNetV3-Small + TRT | **High** — Proven architecture. TRT FP16 stable. | Low | Standard transfer learning. |
|
||||
| Gimbal Control | ViewLink Serial Protocol | **Medium** — Protocol documented. ArduPilot driver exists. | Medium | EMI mitigation critical. CRC layer. |
|
||||
| Freshness Assessment | Novel heuristic | **Low** — No prior art. Experimental. | High | V1: metadata only, not a filter. Iterate with data. |
|
||||
| NVMe Storage | Industrial NVMe on Jetson | **High** — Production standard. SD card alternative is unreliable. | Low | Use industrial-grade SSD. |
|
||||
| Ruggedized Hardware | MILBOX-ORNX or custom | **High** — Established product. Designed for Jetson + UAV. | Low | Standard procurement. |
|
||||
|
||||
## References
|
||||
|
||||
- YOLO26 docs: https://docs.ultralytics.com/models/yolo26/
|
||||
- YOLOE docs: https://docs.ultralytics.com/models/yoloe/
|
||||
- YOLO26 accuracy regression: https://github.com/ultralytics/ultralytics/issues/23206
|
||||
- YOLO26 TRT INT8 crash: https://github.com/ultralytics/ultralytics/issues/23841
|
||||
- YOLO26 TRT OOM fix: https://github.com/ultralytics/ultralytics/pull/23928
|
||||
- vLLM Jetson freezes: https://github.com/dusty-nv/jetson-containers/issues/800
|
||||
- vLLM Jetson install crash: https://github.com/vllm-project/vllm/issues/23376
|
||||
- NanoLLM docs: https://dusty-nv.github.io/NanoLLM/
|
||||
- NanoVLM: https://jetson-ai-lab.com/tutorial_nano-vlm.html
|
||||
- MILBOX-ORNX: https://forecr.io/products/jetson-orin-nx-orin-nano-rugged-compact-pc-milbox-ornx
|
||||
- Jetson SD card corruption: https://forums.developer.nvidia.com/t/corrupted-sd-cards/265418
|
||||
- Jetson thermal throttling: https://www.alibaba.com/product-insights/how-to-run-private-llama-3-inference-on-a-200-jetson-orin-nano-without-thermal-throttling.html
|
||||
- ViewPro EMI issues: https://www.viewprouav.com/help/gimbal/
|
||||
- UART CRC reliability: https://link.springer.com/chapter/10.1007/978-981-19-8563-8_23
|
||||
- Military edge AI thermal: https://www.mobilityengineeringtech.com/component/content/article/53967
|
||||
- UAV-VL-R1: https://arxiv.org/pdf/2508.11196
|
||||
- Qwen2-VL-2B-GPTQ-INT8: https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int8
|
||||
- Ultralytics Jetson guide: https://docs.ultralytics.com/guides/nvidia-jetson
|
||||
- Skelite: https://arxiv.org/html/2503.07369v1
|
||||
- YOLO FP reduction: https://docs.ultralytics.com/yolov5/tutorials/tips_for_best_training_results
|
||||
@@ -0,0 +1,140 @@
|
||||
# Tech Stack Evaluation
|
||||
|
||||
## Requirements Analysis
|
||||
|
||||
### Functional Requirements
|
||||
- Real-time open-vocabulary detection from UAV aerial imagery
|
||||
- Footpath segmentation and path tracing with endpoint analysis
|
||||
- Binary concealment classification on ROI crops
|
||||
- On-demand VLM analysis for ambiguous detections
|
||||
- Camera gimbal control with path-following
|
||||
- Integration with existing Cython+TRT YOLO pipeline
|
||||
|
||||
### Non-Functional Requirements
|
||||
- Tier 1 inference ≤15ms, Tier 2 ≤200ms
|
||||
- 5.2GB usable VRAM budget (Jetson Orin Nano Super 8GB)
|
||||
- Field-deployable: thermal resilience, tamper protection
|
||||
- Offline operation (no cloud dependency)
|
||||
|
||||
### Constraints
|
||||
- Jetson Orin Nano Super: 67 TOPS INT8, 8GB LPDDR5 unified, 68 GB/s bandwidth
|
||||
- JetPack 6.2, CUDA 12.6, TensorRT 10.3
|
||||
- Existing codebase: Cython + TensorRT (must extend, not replace)
|
||||
- ViewPro A40 camera with ViewLink Serial Protocol V3.3.3
|
||||
|
||||
## Technology Evaluation
|
||||
|
||||
### Detection Framework
|
||||
|
||||
| Option | Fitness | Maturity | Security | Team Fit | Cost | Scalability | Score |
|
||||
|--------|---------|----------|----------|----------|------|-------------|-------|
|
||||
| **YOLOE-v8-seg (Ultralytics)** | 9/10 — open-vocab + segmentation | 9/10 — YOLOv8 TRT proven | 7/10 — PatchBlock compatible | 9/10 — existing Cython+TRT expertise | Free | 8/10 | **8.5** |
|
||||
| YOLOE-26-seg (Ultralytics) | 10/10 — latest arch, NMS-free | 4/10 — TRT bugs on Jetson | 7/10 | 7/10 — new arch, less familiar | Free | 9/10 | **6.5** |
|
||||
| YOLO-World v2 | 7/10 — open-vocab, no seg | 7/10 — stable but older | 7/10 | 8/10 | Free | 7/10 | **7.0** |
|
||||
|
||||
**Selected**: YOLOE-v8-seg. Upgrade path to YOLOE-26 when TRT issues resolved.
|
||||
|
||||
### CNN Classifier
|
||||
|
||||
| Option | Fitness | Maturity | Security | Team Fit | Cost | Scalability | Score |
|
||||
|--------|---------|----------|----------|----------|------|-------------|-------|
|
||||
| **MobileNetV3-Small** | 9/10 — binary classification, tiny | 10/10 — battle-tested | 8/10 | 9/10 | Free | 8/10 | **9.0** |
|
||||
| EfficientNet-B0 | 8/10 — slightly more accurate | 10/10 | 8/10 | 8/10 | Free | 7/10 — larger | **8.0** |
|
||||
| ResNet-18 | 7/10 — overkill for binary | 10/10 | 8/10 | 9/10 | Free | 6/10 — 44MB | **7.5** |
|
||||
|
||||
**Selected**: MobileNetV3-Small. ~50MB TRT FP16. Best size/accuracy trade-off.
|
||||
|
||||
### VLM
|
||||
|
||||
| Option | Fitness | Maturity | Security | Team Fit | Cost | Scalability | Score |
|
||||
|--------|---------|----------|----------|----------|------|-------------|-------|
|
||||
| **Moondream 0.5B INT4** | 7/10 — detect()/point() APIs | 7/10 — active development | 8/10 — local only | 7/10 — new, learning curve | Free | 9/10 — 816 MiB | **7.5** |
|
||||
| SmolVLM2-500M | 6/10 — no detect API | 6/10 — newer | 8/10 | 6/10 | Free | 8/10 — 1.8GB | **6.5** |
|
||||
| UAV-VL-R1 2B | 9/10 — aerial-specialized | 5/10 — not tested on Jetson | 8/10 | 5/10 | Free | 4/10 — 2.5GB | **5.5** |
|
||||
| No VLM (MVP) | 5/10 — no fallback | 10/10 | 10/10 | 10/10 | Free | 10/10 | **8.0** |
|
||||
|
||||
**Selected**: Moondream 0.5B for VLM tier. "No VLM" as MVP fallback if Moondream insufficient.
|
||||
|
||||
### VLM Runtime
|
||||
|
||||
| Option | Fitness | Maturity | Security | Team Fit | Cost | Scalability | Score |
|
||||
|--------|---------|----------|----------|----------|------|-------------|-------|
|
||||
| **ONNX Runtime** | 8/10 — lightweight, cross-platform | 9/10 | 8/10 | 8/10 | Free | 9/10 | **8.5** |
|
||||
| vLLM | 7/10 — server-oriented, overkill for 0.5B | 8/10 — Jetson compatible | 7/10 | 6/10 — complex setup | Free | 7/10 | **7.0** |
|
||||
| PyTorch direct | 7/10 — simplest integration | 10/10 | 8/10 | 9/10 | Free | 6/10 — no optimization | **7.5** |
|
||||
| MLC-LLM | 6/10 — declining adoption | 5/10 | 7/10 | 5/10 | Free | 7/10 | **5.5** |
|
||||
|
||||
**Selected**: ONNX Runtime for Moondream 0.5B. Lightweight, no server overhead.
|
||||
|
||||
### Gimbal Control
|
||||
|
||||
| Option | Fitness | Maturity | Security | Team Fit | Cost | Scalability | Score |
|
||||
|--------|---------|----------|----------|----------|------|-------------|-------|
|
||||
| **filterpy (Kalman) + servopilot (PID)** | 9/10 — cascade control | 8/10 — proven libraries | 8/10 | 7/10 — Kalman is new | Free | 8/10 | **8.0** |
|
||||
| Custom Kalman + PID from scratch | 8/10 | 5/10 — unproven | 8/10 | 6/10 | Free | 7/10 | **6.5** |
|
||||
| PID only (servopilot) | 6/10 — no drift compensation | 9/10 | 8/10 | 9/10 | Free | 7/10 | **7.5** |
|
||||
|
||||
**Selected**: filterpy + servopilot cascade.
|
||||
|
||||
### Adversarial Defense
|
||||
|
||||
| Option | Fitness | Maturity | Security | Team Fit | Cost | Scalability | Score |
|
||||
|--------|---------|----------|----------|----------|------|-------------|-------|
|
||||
| **PatchBlock** | 9/10 — designed for edge YOLO | 7/10 — 2026 paper | 9/10 | 7/10 | Free | 9/10 — CPU-based | **8.0** |
|
||||
| Custom input validation | 5/10 — ad-hoc | 3/10 | 6/10 | 8/10 | Free | 7/10 | **5.5** |
|
||||
| None | 0/10 | 10/10 | 0/10 | 10/10 | Free | 10/10 | **3.0** |
|
||||
|
||||
**Selected**: PatchBlock. Integrate as CPU preprocessing step.
|
||||
|
||||
### Synthetic Data Generation
|
||||
|
||||
| Option | Fitness | Maturity | Security | Team Fit | Cost | Scalability | Score |
|
||||
|--------|---------|----------|----------|----------|------|-------------|-------|
|
||||
| **CamouflageAnything** | 8/10 — CVPR 2025, camouflage-specific | 7/10 | 8/10 | 6/10 | Free | 8/10 | **7.5** |
|
||||
| GenCAMO | 8/10 — environment-aware, 2026 | 6/10 — newer | 8/10 | 6/10 | Free | 8/10 | **7.0** |
|
||||
| Cut-paste augmentation | 6/10 — simple but effective | 10/10 | 8/10 | 9/10 | Free | 7/10 | **7.5** |
|
||||
|
||||
**Selected**: CamouflageAnything (primary) + cut-paste (supplementary).
|
||||
|
||||
## Tech Stack Summary
|
||||
|
||||
| Layer | Technology | Version | Justification |
|
||||
|-------|-----------|---------|---------------|
|
||||
| Hardware | Jetson Orin Nano Super | 8GB | Existing constraint |
|
||||
| OS / SDK | JetPack | 6.2 | Latest for Orin Nano Super |
|
||||
| GPU Runtime | TensorRT | 10.3 (FP16) | Existing pipeline, proven stability |
|
||||
| Detection | YOLOE-v8-seg | Ultralytics ≥8.4 | Stable TRT, open-vocab + segmentation |
|
||||
| Classifier | MobileNetV3-Small | torchvision → TRT FP16 | Tiny footprint, binary classification |
|
||||
| VLM | Moondream 0.5B | INT4, ONNX | 816 MiB, detect()/point() APIs |
|
||||
| VLM Runtime | ONNX Runtime | ≥1.17 | Lightweight, no server overhead |
|
||||
| Path Tracing | OpenCV + scikit-image | OpenCV 4.x, skimage 0.22+ | Preprocessing + skeletonization |
|
||||
| Gimbal Kalman | filterpy | ≥1.4 | Kalman filter state estimation |
|
||||
| Gimbal PID | servopilot | latest | Anti-windup PID, dual-axis |
|
||||
| Serial | pyserial | ≥3.5 | ViewLink protocol communication |
|
||||
| Adversarial Defense | PatchBlock | 2026 release | CPU-based, edge-optimized |
|
||||
| Synthetic Data | CamouflageAnything | CVPR 2025 | Camouflage-specific generation |
|
||||
| Encryption | LUKS / dm-crypt | Linux kernel | Model weight encryption at rest |
|
||||
| Core Language | Cython + Python | 3.10+ | Existing codebase extension |
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
| Technology | Risk | Mitigation |
|
||||
|-----------|------|------------|
|
||||
| YOLOE-v8-seg | Lower accuracy than YOLOE-26 | Monitor YOLO26 TRT fix; upgrade when stable |
|
||||
| Moondream 0.5B | Untested for aerial concealment | Empirical testing Week 8; fallback to no-VLM MVP |
|
||||
| PatchBlock | New (2026), limited field testing | Can disable if causes false positives; low integration risk |
|
||||
| filterpy Kalman | Team unfamiliar | Well-documented library; standard aerospace algorithm |
|
||||
| CamouflageAnything | Synthetic-to-real domain gap | Supplement with real data; validate FP/FN rates |
|
||||
| Demand-loaded VLM | 30-45s detection pause | Batch requests; operator-triggered only; async notification |
|
||||
| ONNX Runtime on Jetson | Less optimized than TRT for vision models | For 0.5B model, ONNX overhead is acceptable |
|
||||
|
||||
## Learning Requirements
|
||||
|
||||
| Technology | Effort | Who |
|
||||
|-----------|--------|-----|
|
||||
| YOLOE visual prompts (SAVPE) | Low — API-based | Detection engineer |
|
||||
| Moondream detect()/caption() | Low — simple API | ML engineer |
|
||||
| filterpy Kalman filter | Medium — state estimation theory | Controls engineer |
|
||||
| PatchBlock integration | Low — preprocessing module | Detection engineer |
|
||||
| CamouflageAnything pipeline | Medium — generative model setup | Data engineer |
|
||||
| LUKS encryption + secure boot | Medium — Linux security | DevOps / platform |
|
||||
Reference in New Issue
Block a user