mirror of
https://github.com/azaion/detections-semantic.git
synced 2026-04-22 08:16:38 +00:00
8e2ecf50fd
Made-with: Cursor
326 lines
24 KiB
Markdown
326 lines
24 KiB
Markdown
# Solution Draft
|
||
|
||
## Product Solution Description
|
||
|
||
A three-tier semantic detection system for identifying concealed/camouflaged positions from reconnaissance UAV aerial imagery, running on Jetson Orin Nano Super alongside the existing YOLO detection pipeline.
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────┐
|
||
│ JETSON ORIN NANO SUPER │
|
||
│ │
|
||
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
|
||
│ │ ViewPro │───▶│ Tier 1 │───▶│ Tier 2 │───▶│ Tier 3 │ │
|
||
│ │ A40 │ │ YOLO26-Seg │ │ Spatial │ │ VLM │ │
|
||
│ │ Camera │ │ + YOLOE-26 │ │ Reasoning │ │ UAV-VL │ │
|
||
│ │ │ │ ≤100ms │ │ + CNN │ │ -R1 │ │
|
||
│ │ │ │ │ │ ≤200ms │ │ ≤5s │ │
|
||
│ └────▲─────┘ └──────────────┘ └──────────────┘ └──────────┘ │
|
||
│ │ │
|
||
│ ┌────┴─────┐ ┌──────────────┐ │
|
||
│ │ Gimbal │◀───│ Scan │ │
|
||
│ │ Control │ │ Controller │ │
|
||
│ │ Module │ │ (L1/L2) │ │
|
||
│ └──────────┘ └──────────────┘ │
|
||
│ │
|
||
│ ┌──────────────────────────────┐ │
|
||
│ │ Existing YOLO Detection │ (separate service, provides context) │
|
||
│ │ Cython + TRT │ │
|
||
│ └──────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
The system operates in two scan levels:
|
||
- **Level 1 (Wide Sweep)**: Camera at medium zoom, left-right swing. YOLOE-26 text/visual prompts detect POIs in real-time. Existing YOLO provides scene context.
|
||
- **Level 2 (Detailed Scan)**: Camera zooms into POI. Spatial reasoning traces footpaths, finds endpoints. CNN classifies potential hideouts. Optional VLM provides deep analysis for ambiguous cases.
|
||
|
||
Three submodules: (1) Semantic Detection AI, (2) Camera Gimbal Control, (3) Integration with existing detections service.
|
||
|
||
## Existing/Competitor Solutions Analysis
|
||
|
||
No direct commercial or open-source competitor exists for this specific combination of requirements (concealed position detection from UAV with edge inference). Related work:
|
||
|
||
| Solution | Approach | Limitations for This Use Case |
|
||
|----------|----------|-------------------------------|
|
||
| Standard YOLO object detection | Bounding box classification of known object types | Cannot detect camouflaged/concealed targets without explicit visual features |
|
||
| CAMOUFLAGE-Net (YOLOv7-based) | Attention mechanisms + ELAN for camouflage detection | Designed for ground-level imagery, not aerial; academic datasets only |
|
||
| Open-Vocabulary Camouflaged Object Segmentation | VLM + SAM cascaded segmentation | Too slow for real-time edge inference; requires cloud GPU |
|
||
| UAV-YOLO12 | Multi-scale road segmentation from UAV imagery | Roads only, no concealment reasoning |
|
||
| FootpathSeg (DINO-MC + UNet) | Footpath segmentation with self-supervised learning | Pedestrian context, not military aerial; no path-following logic |
|
||
| YOLO-World / YOLOE | Open-vocabulary detection | Closest fit — YOLOE-26 is our primary Tier 1 mechanism |
|
||
|
||
**Key insight**: No existing solution combines footpath detection + path tracing + endpoint analysis + concealment classification in a single pipeline. This requires a custom multi-stage system.
|
||
|
||
## Architecture
|
||
|
||
### Component 1: Tier 1 — Real-Time Detection (YOLOE-26 + YOLO26-Seg)
|
||
|
||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|
||
|----------|-------|-----------|-------------|-------------|----------|------|-----|
|
||
| **YOLOE-26 text/visual prompts (recommended)** | yoloe-26s-seg.pt, Ultralytics, TensorRT | Zero-shot detection from day 1. Text prompts: "footpath", "branch pile", "dark entrance". Visual prompts: reference images of hideouts. Zero overhead when re-parameterized. NMS-free. | Open-vocabulary accuracy lower than custom-trained model. Text prompts may not capture all concealment patterns. | Ultralytics ≥8.4, TensorRT, JetPack 6.2 | Model weights stored locally, no cloud | Free (open source) | **Best fit for bootstrapping. Immediate capability.** |
|
||
| YOLO26-Seg custom-trained | yolo26s-seg.pt fine-tuned on custom dataset | Higher accuracy for known classes after training. Instance segmentation masks for footpaths. | Requires annotated training data (1500+ images/class). No zero-shot capability. | Custom dataset, GPU for training | Same | Free (open source) + annotation labor | **Best fit for production after data collection.** |
|
||
| UNet + MambaOut | PyTorch, TensorRT | Best published accuracy for trail segmentation from aerial photos. | Separate model, additional memory. No built-in detection head. | Custom integration | Same | Free (open source) | Backup option if YOLO26-Seg underperforms on trails |
|
||
|
||
**Recommended approach**: Start with YOLOE-26 text/visual prompts. Parallel annotation effort builds custom dataset. Transition to fine-tuned YOLO26-Seg once data is sufficient. YOLOE-26 zero-shot capability provides immediate usability.
|
||
|
||
**YOLOE-26 configuration for this project**:
|
||
|
||
Text prompts for Level 1 detection:
|
||
- `"footpath"`, `"trail"`, `"path in snow"`, `"road"`, `"track"`
|
||
- `"pile of branches"`, `"tree branches"`, `"camouflage netting"`
|
||
- `"dark entrance"`, `"hole"`, `"dugout"`, `"dark opening"`
|
||
- `"tree row"`, `"tree line"`, `"group of trees"`
|
||
- `"clearing"`, `"open area near forest"`
|
||
|
||
Visual prompts: Annotated reference images (semantic01-04.png) as visual prompt sources. Crop ROIs around known hideouts, use as reference for SAVPE visual-prompted detection.
|
||
|
||
API usage:
|
||
```python
|
||
from ultralytics import YOLOE
|
||
|
||
model = YOLOE("yoloe-26s-seg.pt")
|
||
model.set_classes(["footpath", "branch pile", "dark entrance", "tree row", "road"])
|
||
results = model.predict(frame, conf=0.15) # low threshold, high recall
|
||
```
|
||
|
||
Visual prompt usage:
|
||
```python
|
||
results = model.predict(
|
||
frame,
|
||
refer_image="reference_hideout.jpg",
|
||
visual_prompts={"bboxes": np.array([[x1, y1, x2, y2]]), "cls": np.array([0])}
|
||
)
|
||
```
|
||
|
||
### Component 2: Tier 2 — Spatial Reasoning & CNN Confirmation
|
||
|
||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|
||
|----------|-------|-----------|-------------|-------------|----------|------|-----|
|
||
| **Path tracing + CNN classifier (recommended)** | OpenCV skeletonization, MobileNetV3-Small TensorRT | Fast (<200ms total). Path tracing from segmentation masks. Binary classifier: "concealed position yes/no" on ROI crop. Well-understood algorithms. | Requires custom annotated data for CNN classifier. Path tracing quality depends on segmentation quality. | OpenCV, scikit-image, PyTorch → TRT | Offline inference | Free + annotation labor | **Best fit. Modular, fast, interpretable.** |
|
||
| Heuristic rules only | OpenCV, NumPy | No training data needed. Rule-based: "if footpath ends at dark mass → flag." | Brittle. Hard to tune. Cannot generalize across seasons/terrain. | None | Offline | Free | Baseline/fallback for initial version |
|
||
| End-to-end custom model | PyTorch, TensorRT | Single model handles everything. | Requires massive training data. Black box. Hard to debug. | Large annotated dataset | Offline | Free + GPU time | Not recommended for initial release |
|
||
|
||
**Path tracing algorithm**:
|
||
|
||
1. Take footpath segmentation mask from Tier 1
|
||
2. Skeletonize using Zhang-Suen algorithm (`skimage.morphology.skeletonize`)
|
||
3. Detect endpoints using hit-miss morphological operations (8 kernel patterns)
|
||
4. Detect junctions using branch-point kernels
|
||
5. Trace path segments between junctions/endpoints
|
||
6. For each endpoint: extract 128×128 ROI crop centered on endpoint from original image
|
||
7. Feed ROI crop to MobileNetV3-Small binary classifier: "concealed structure" vs "natural terminus"
|
||
|
||
**Freshness assessment approach**:
|
||
|
||
Visual features for fresh vs stale classification (binary classifier on path ROI):
|
||
- Edge sharpness (Laplacian variance on path boundary)
|
||
- Contrast ratio (path intensity vs surrounding terrain)
|
||
- Fill ratio (percentage of path area with snow/vegetation coverage)
|
||
- Path width consistency (fresh paths have more uniform width)
|
||
|
||
Implementation: Extract these features per path segment, feed to lightweight classifier (Random Forest or small CNN). Initial version can use hand-tuned thresholds, then train with annotated data.
|
||
|
||
### Component 3: Tier 3 — VLM Deep Analysis (Optional)
|
||
|
||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|
||
|----------|-------|-----------|-------------|-------------|----------|------|-----|
|
||
| **UAV-VL-R1 (recommended)** | Qwen2-VL-2B fine-tuned, vLLM, INT8 quantization | Purpose-built for aerial reasoning. 48% better than generic Qwen2-VL-2B on UAV tasks. 2.5GB INT8. Open source. | 3-5s per analysis. Competes for GPU memory with YOLO (sequential scheduling). | vLLM or TRT-LLM, GPTQ-INT8 weights | Local inference only | Free (open source) | **Best fit for aerial VLM tasks.** |
|
||
| SmolVLM2-500M | HuggingFace Transformers, ONNX | Smallest memory (1.8GB). Fastest inference (~1-2s estimated). | Weakest reasoning. May lack nuance for concealment analysis. | ONNX Runtime or TRT | Local only | Free | Fallback if memory is tight |
|
||
| Moondream 2B | moondream API, PyTorch | Built-in detect()/point() APIs. Strong grounded detection (refcoco 91.1). | Not aerial-specialized. Same size class as UAV-VL-R1 but less relevant. | PyTorch or ONNX | Local only | Free | Alternative if UAV-VL-R1 underperforms |
|
||
| No VLM | N/A | Simpler system. Less memory. No latency for Tier 3. | No zero-shot capability for novel patterns. No operator explanations. | None | N/A | Free | Viable for production if Tier 1+2 accuracy is sufficient |
|
||
|
||
**VLM prompting strategy for concealment analysis**:
|
||
|
||
```
|
||
Analyze this aerial UAV image crop. A footpath was detected leading to this area.
|
||
Is there a concealed military position visible? Look for:
|
||
- Dark entrances or openings
|
||
- Piles of cut tree branches used as camouflage
|
||
- Dugout structures
|
||
- Signs of recent human activity
|
||
|
||
Answer: YES or NO, then one sentence explaining why.
|
||
```
|
||
|
||
**Integration**: VLM runs as separate Python process. Communicates with main Cython pipeline via Unix domain socket or shared memory. Triggered only when Tier 2 CNN confidence is between 30-70% (ambiguous cases).
|
||
|
||
### Component 4: Camera Gimbal Control
|
||
|
||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|
||
|----------|-------|-----------|-------------|-------------|----------|------|-----|
|
||
| **Custom ViewLink serial driver + PID controller (recommended)** | Python serial, ViewLink Protocol V3.3.3, PID library | Direct hardware control. Closed-loop tracking with detection feedback. Low latency (<10ms serial command). | Must implement ViewLink protocol from spec. PID tuning needed. | ViewPro A40 documentation, pyserial | Physical hardware access only | Free + hardware | **Best fit. Direct control, no middleware.** |
|
||
| ArduPilot integration via MAVLink | ArduPilot, MAVLink, pymavlink | Battle-tested gimbal driver. Well-documented. | Requires ArduPilot flight controller. Additional latency through FC. May conflict with mission planner. | Pixhawk or similar FC running ArduPilot 4.5+ | MAVLink protocol | Pixhawk hardware ($50-200) | Alternative if ArduPilot is already used for flight control |
|
||
|
||
**Scan controller state machine**:
|
||
|
||
```
|
||
┌─────────────────┐
|
||
│ LEVEL 1 │
|
||
│ Wide Sweep │◀──────────────────────────┐
|
||
│ Medium Zoom │ │
|
||
└────────┬────────┘ │
|
||
│ POI detected │ Analysis complete
|
||
▼ │ or timeout (5s)
|
||
┌─────────────────┐ │
|
||
│ ZOOM │ │
|
||
│ TRANSITION │ │
|
||
│ 1-2 seconds │ │
|
||
└────────┬────────┘ │
|
||
│ Zoom complete │
|
||
▼ │
|
||
┌─────────────────┐ ┌──────────────┐ │
|
||
│ LEVEL 2 │────▶│ PATH │───────┤
|
||
│ Detailed Scan │ │ FOLLOWING │ │
|
||
│ High Zoom │ │ Pan along │ │
|
||
└────────┬────────┘ │ detected │ │
|
||
│ │ path │ │
|
||
│ └──────────────┘ │
|
||
│ Endpoint found │
|
||
▼ │
|
||
┌─────────────────┐ │
|
||
│ ENDPOINT │ │
|
||
│ ANALYSIS │────────────────────────────┘
|
||
│ Hold + VLM │
|
||
└─────────────────┘
|
||
```
|
||
|
||
**Level 1 sweep pattern**: Sinusoidal yaw oscillation centered on flight heading. Amplitude: ±30° (configurable). Period: matched to ground speed so adjacent sweeps overlap by 20%. Pitch: slightly downward (configurable based on altitude and desired ground coverage).
|
||
|
||
**Path-following control loop**:
|
||
1. Tier 1 outputs footpath segmentation mask
|
||
2. Extract path centerline direction (from skeleton)
|
||
3. Compute error: path center vs frame center
|
||
4. PID controller adjusts gimbal yaw/pitch to minimize error
|
||
5. Update rate: tied to detection frame rate (10-30 FPS)
|
||
6. When path endpoint reached or path leaves frame: stop following, analyze endpoint
|
||
|
||
**POI queue management**: Priority queue sorted by: (1) detection confidence, (2) proximity to current camera position (minimize slew time), (3) recency. Max queue size: 20 POIs. Older/lower-confidence entries expire after 30 seconds.
|
||
|
||
### Component 5: Integration with Existing Detections Service
|
||
|
||
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|
||
|----------|-------|-----------|-------------|-------------|----------|------|-----|
|
||
| **Extend existing Cython codebase + separate VLM process (recommended)** | Cython, TensorRT, Unix socket IPC | Maintains existing architecture. YOLO26/YOLOE-26 fits naturally into TRT pipeline. VLM isolated in separate process. | VLM IPC adds small latency. Two processes to manage. | Cython extensions, process management | Process isolation | Free | **Best fit. Minimal disruption to existing system.** |
|
||
| Microservice architecture | FastAPI, Docker, gRPC | Clean separation. Independent scaling. | Overhead for single Jetson. Over-engineered for edge. | Docker, networking | Service mesh | Free | Over-engineered for single device |
|
||
|
||
**Integration architecture**:
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────┐
|
||
│ Main Process (Cython + TRT) │
|
||
│ │
|
||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||
│ │ Existing │ │ YOLOE-26 │ │ Scan │ │
|
||
│ │ YOLO Det │ │ Semantic │ │ Controller │ │
|
||
│ │ (TRT Engine) │ │ (TRT Engine) │ │ + Gimbal │ │
|
||
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
|
||
│ │ │ │ │
|
||
│ └─────────┬────────┘ │ │
|
||
│ ▼ │ │
|
||
│ ┌──────────────┐ │ │
|
||
│ │ Spatial │◀──────────────────────┘ │
|
||
│ │ Reasoning │ │
|
||
│ │ + CNN Class. │ │
|
||
│ └──────┬───────┘ │
|
||
│ │ ambiguous cases │
|
||
│ ▼ │
|
||
│ ┌──────────────┐ │
|
||
│ │ IPC (Unix │ │
|
||
│ │ Socket) │ │
|
||
│ └──────┬───────┘ │
|
||
└────────────────┼─────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌────────────────────────────┐
|
||
│ VLM Process (Python) │
|
||
│ UAV-VL-R1 (vLLM/TRT-LLM) │
|
||
│ INT8 quantized │
|
||
└────────────────────────────┘
|
||
```
|
||
|
||
**Data flow**:
|
||
1. Camera frame → existing YOLO detection (scene context: vehicles, debris, structures)
|
||
2. Same frame → YOLOE-26 semantic detection (footpaths, branch piles, entrances)
|
||
3. YOLO context + YOLOE-26 detections → Spatial reasoning module
|
||
4. Spatial reasoning: path tracing, endpoint analysis, CNN classification
|
||
5. High-confidence detections → operator notification (bounding box + coordinates)
|
||
6. Ambiguous detections → VLM process via IPC → response → operator notification
|
||
7. All detections → scan controller → gimbal commands (Level 1/2 transitions, path following)
|
||
|
||
**GPU scheduling**: YOLO and YOLOE-26 can share a single TRT engine (YOLOE-26 re-parameterized to standard YOLO26 weights for fixed classes). VLM inference is sequential: pause YOLO frames, run VLM, resume YOLO. VLM analysis typically lasts 3-5s during which the camera holds position (endpoint analysis phase).
|
||
|
||
## Training & Data Strategy
|
||
|
||
### Phase 1: Zero-shot (Week 1-2)
|
||
- Deploy YOLOE-26 with text/visual prompts
|
||
- Use semantic01-04.png as visual prompt references
|
||
- Tune text prompt class names and confidence thresholds
|
||
- Collect false positive/negative data for annotation
|
||
|
||
### Phase 2: Annotation & Fine-tuning (Week 3-8)
|
||
- Annotate collected data using existing annotation tooling
|
||
- Priority order: footpaths (segmentation masks) → branch piles (bboxes) → entrances (bboxes) → roads (segmentation) → trees (bboxes)
|
||
- Use SAM (Segment Anything Model) for semi-automated segmentation mask generation
|
||
- Target: 500+ images per class by week 6, 1500+ by week 8
|
||
- Fine-tune YOLO26-Seg on custom dataset using linear probing first, then full fine-tuning
|
||
|
||
### Phase 3: Custom CNN classifier (Week 6-10)
|
||
- Train MobileNetV3-Small binary classifier on ROI crops from endpoint analysis
|
||
- Positive: annotated concealed positions. Negative: natural path termini, random terrain
|
||
- Target: 300+ positive, 1000+ negative samples
|
||
- Export to TensorRT FP16
|
||
|
||
### Phase 4: VLM integration (Week 8-12)
|
||
- Deploy UAV-VL-R1 INT8 as separate process
|
||
- Tune prompting strategy on collected ambiguous cases
|
||
- Optional: fine-tune UAV-VL-R1 on domain-specific data if base accuracy insufficient
|
||
|
||
### Phase 5: Seasonal expansion (Month 3+)
|
||
- Winter data → spring/summer annotation campaigns
|
||
- Re-train all models with multi-season data
|
||
- Expect accuracy degradation in summer (vegetation occlusion), mitigate with larger dataset
|
||
|
||
## Testing Strategy
|
||
|
||
### Integration / Functional Tests
|
||
- YOLOE-26 text prompt detection on reference images (semantic01-04.png) — verify footpath and hideout regions are flagged
|
||
- Path tracing on synthetic segmentation masks — verify skeleton, endpoint, junction detection
|
||
- CNN classifier on known positive/negative ROI crops — verify binary output correctness
|
||
- Gimbal control loop on simulated camera feed — verify PID convergence and path-following accuracy
|
||
- VLM IPC round-trip — verify request/response latency and correctness
|
||
- Full pipeline test: frame → YOLOE detection → path tracing → CNN → VLM → operator notification
|
||
- Scan controller state machine: verify Level 1 → Level 2 transitions, timeout, return to Level 1
|
||
|
||
### Non-Functional Tests
|
||
- Tier 1 latency: measure end-to-end YOLOE-26 inference on Jetson Orin Nano Super ≤100ms
|
||
- Tier 2 latency: path tracing + CNN classification ≤200ms
|
||
- Tier 3 latency: VLM IPC + inference ≤5 seconds
|
||
- Memory profiling: total RAM usage under YOLO + YOLOE + CNN + VLM concurrent loading
|
||
- Thermal stress test: continuous inference for 30+ minutes without thermal throttling
|
||
- False positive rate measurement on known clean terrain
|
||
- False negative rate measurement on annotated concealed positions
|
||
- Gimbal response test: measure physical camera movement latency vs command
|
||
|
||
## References
|
||
|
||
- YOLO26 docs: https://docs.ultralytics.com/models/yolo26/
|
||
- YOLOE docs: https://docs.ultralytics.com/models/yoloe/
|
||
- YOLOE-26 paper: https://arxiv.org/abs/2602.00168
|
||
- YOLO26 Jetson benchmarks: https://docs.ultralytics.com/guides/nvidia-jetson
|
||
- UAV-VL-R1: https://arxiv.org/pdf/2508.11196, https://github.com/Leke-G/UAV-VL-R1
|
||
- SmolVLM2: https://huggingface.co/blog/smolervlm
|
||
- Moondream: https://moondream.ai/blog/introducing-moondream-0-5b
|
||
- ViewPro Protocol: https://www.viewprotech.com/index.php?ac=article&at=read&did=510
|
||
- ArduPilot ViewPro: https://ardupilot.org/copter/docs/common-viewpro-gimbal.html
|
||
- FootpathSeg: https://github.com/WennyXY/FootpathSeg
|
||
- UAV-YOLO12: https://www.mdpi.com/2072-4292/17/9/1539
|
||
- Trail segmentation (UNet+MambaOut): https://arxiv.org/pdf/2504.12121
|
||
- servopilot PID: https://pypi.org/project/servopilot/
|
||
- Camouflage detection (OVCOS): https://arxiv.org/html/2506.19300v1
|
||
- Jetson AI Lab benchmarks: https://www.jetson-ai-lab.com/tutorials/genai-benchmarking/
|
||
- Cosmos-Reason2-2B benchmarks: Embedl blog (Feb 2026)
|
||
|
||
## Related Artifacts
|
||
- AC assessment: `_docs/00_research/00_ac_assessment.md`
|
||
- Tech stack evaluation: `_docs/01_solution/tech_stack.md` (if Phase 3 was executed)
|
||
- Security analysis: `_docs/01_solution/security_analysis.md` (if Phase 4 was executed)
|