Files
Oleksandr Bezdieniezhnykh 8e2ecf50fd Initial commit
Made-with: Cursor
2026-03-26 00:20:30 +02:00

326 lines
24 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Solution Draft
## Product Solution Description
A three-tier semantic detection system for identifying concealed/camouflaged positions from reconnaissance UAV aerial imagery, running on Jetson Orin Nano Super alongside the existing YOLO detection pipeline.
```
┌─────────────────────────────────────────────────────────────────────────┐
│ JETSON ORIN NANO SUPER │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ ViewPro │───▶│ Tier 1 │───▶│ Tier 2 │───▶│ Tier 3 │ │
│ │ A40 │ │ YOLO26-Seg │ │ Spatial │ │ VLM │ │
│ │ Camera │ │ + YOLOE-26 │ │ Reasoning │ │ UAV-VL │ │
│ │ │ │ ≤100ms │ │ + CNN │ │ -R1 │ │
│ │ │ │ │ │ ≤200ms │ │ ≤5s │ │
│ └────▲─────┘ └──────────────┘ └──────────────┘ └──────────┘ │
│ │ │
│ ┌────┴─────┐ ┌──────────────┐ │
│ │ Gimbal │◀───│ Scan │ │
│ │ Control │ │ Controller │ │
│ │ Module │ │ (L1/L2) │ │
│ └──────────┘ └──────────────┘ │
│ │
│ ┌──────────────────────────────┐ │
│ │ Existing YOLO Detection │ (separate service, provides context) │
│ │ Cython + TRT │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
```
The system operates in two scan levels:
- **Level 1 (Wide Sweep)**: Camera at medium zoom, left-right swing. YOLOE-26 text/visual prompts detect POIs in real-time. Existing YOLO provides scene context.
- **Level 2 (Detailed Scan)**: Camera zooms into POI. Spatial reasoning traces footpaths, finds endpoints. CNN classifies potential hideouts. Optional VLM provides deep analysis for ambiguous cases.
Three submodules: (1) Semantic Detection AI, (2) Camera Gimbal Control, (3) Integration with existing detections service.
## Existing/Competitor Solutions Analysis
No direct commercial or open-source competitor exists for this specific combination of requirements (concealed position detection from UAV with edge inference). Related work:
| Solution | Approach | Limitations for This Use Case |
|----------|----------|-------------------------------|
| Standard YOLO object detection | Bounding box classification of known object types | Cannot detect camouflaged/concealed targets without explicit visual features |
| CAMOUFLAGE-Net (YOLOv7-based) | Attention mechanisms + ELAN for camouflage detection | Designed for ground-level imagery, not aerial; academic datasets only |
| Open-Vocabulary Camouflaged Object Segmentation | VLM + SAM cascaded segmentation | Too slow for real-time edge inference; requires cloud GPU |
| UAV-YOLO12 | Multi-scale road segmentation from UAV imagery | Roads only, no concealment reasoning |
| FootpathSeg (DINO-MC + UNet) | Footpath segmentation with self-supervised learning | Pedestrian context, not military aerial; no path-following logic |
| YOLO-World / YOLOE | Open-vocabulary detection | Closest fit — YOLOE-26 is our primary Tier 1 mechanism |
**Key insight**: No existing solution combines footpath detection + path tracing + endpoint analysis + concealment classification in a single pipeline. This requires a custom multi-stage system.
## Architecture
### Component 1: Tier 1 — Real-Time Detection (YOLOE-26 + YOLO26-Seg)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| **YOLOE-26 text/visual prompts (recommended)** | yoloe-26s-seg.pt, Ultralytics, TensorRT | Zero-shot detection from day 1. Text prompts: "footpath", "branch pile", "dark entrance". Visual prompts: reference images of hideouts. Zero overhead when re-parameterized. NMS-free. | Open-vocabulary accuracy lower than custom-trained model. Text prompts may not capture all concealment patterns. | Ultralytics ≥8.4, TensorRT, JetPack 6.2 | Model weights stored locally, no cloud | Free (open source) | **Best fit for bootstrapping. Immediate capability.** |
| YOLO26-Seg custom-trained | yolo26s-seg.pt fine-tuned on custom dataset | Higher accuracy for known classes after training. Instance segmentation masks for footpaths. | Requires annotated training data (1500+ images/class). No zero-shot capability. | Custom dataset, GPU for training | Same | Free (open source) + annotation labor | **Best fit for production after data collection.** |
| UNet + MambaOut | PyTorch, TensorRT | Best published accuracy for trail segmentation from aerial photos. | Separate model, additional memory. No built-in detection head. | Custom integration | Same | Free (open source) | Backup option if YOLO26-Seg underperforms on trails |
**Recommended approach**: Start with YOLOE-26 text/visual prompts. Parallel annotation effort builds custom dataset. Transition to fine-tuned YOLO26-Seg once data is sufficient. YOLOE-26 zero-shot capability provides immediate usability.
**YOLOE-26 configuration for this project**:
Text prompts for Level 1 detection:
- `"footpath"`, `"trail"`, `"path in snow"`, `"road"`, `"track"`
- `"pile of branches"`, `"tree branches"`, `"camouflage netting"`
- `"dark entrance"`, `"hole"`, `"dugout"`, `"dark opening"`
- `"tree row"`, `"tree line"`, `"group of trees"`
- `"clearing"`, `"open area near forest"`
Visual prompts: Annotated reference images (semantic01-04.png) as visual prompt sources. Crop ROIs around known hideouts, use as reference for SAVPE visual-prompted detection.
API usage:
```python
from ultralytics import YOLOE
model = YOLOE("yoloe-26s-seg.pt")
model.set_classes(["footpath", "branch pile", "dark entrance", "tree row", "road"])
results = model.predict(frame, conf=0.15) # low threshold, high recall
```
Visual prompt usage:
```python
results = model.predict(
frame,
refer_image="reference_hideout.jpg",
visual_prompts={"bboxes": np.array([[x1, y1, x2, y2]]), "cls": np.array([0])}
)
```
### Component 2: Tier 2 — Spatial Reasoning & CNN Confirmation
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| **Path tracing + CNN classifier (recommended)** | OpenCV skeletonization, MobileNetV3-Small TensorRT | Fast (<200ms total). Path tracing from segmentation masks. Binary classifier: "concealed position yes/no" on ROI crop. Well-understood algorithms. | Requires custom annotated data for CNN classifier. Path tracing quality depends on segmentation quality. | OpenCV, scikit-image, PyTorch → TRT | Offline inference | Free + annotation labor | **Best fit. Modular, fast, interpretable.** |
| Heuristic rules only | OpenCV, NumPy | No training data needed. Rule-based: "if footpath ends at dark mass → flag." | Brittle. Hard to tune. Cannot generalize across seasons/terrain. | None | Offline | Free | Baseline/fallback for initial version |
| End-to-end custom model | PyTorch, TensorRT | Single model handles everything. | Requires massive training data. Black box. Hard to debug. | Large annotated dataset | Offline | Free + GPU time | Not recommended for initial release |
**Path tracing algorithm**:
1. Take footpath segmentation mask from Tier 1
2. Skeletonize using Zhang-Suen algorithm (`skimage.morphology.skeletonize`)
3. Detect endpoints using hit-miss morphological operations (8 kernel patterns)
4. Detect junctions using branch-point kernels
5. Trace path segments between junctions/endpoints
6. For each endpoint: extract 128×128 ROI crop centered on endpoint from original image
7. Feed ROI crop to MobileNetV3-Small binary classifier: "concealed structure" vs "natural terminus"
**Freshness assessment approach**:
Visual features for fresh vs stale classification (binary classifier on path ROI):
- Edge sharpness (Laplacian variance on path boundary)
- Contrast ratio (path intensity vs surrounding terrain)
- Fill ratio (percentage of path area with snow/vegetation coverage)
- Path width consistency (fresh paths have more uniform width)
Implementation: Extract these features per path segment, feed to lightweight classifier (Random Forest or small CNN). Initial version can use hand-tuned thresholds, then train with annotated data.
### Component 3: Tier 3 — VLM Deep Analysis (Optional)
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| **UAV-VL-R1 (recommended)** | Qwen2-VL-2B fine-tuned, vLLM, INT8 quantization | Purpose-built for aerial reasoning. 48% better than generic Qwen2-VL-2B on UAV tasks. 2.5GB INT8. Open source. | 3-5s per analysis. Competes for GPU memory with YOLO (sequential scheduling). | vLLM or TRT-LLM, GPTQ-INT8 weights | Local inference only | Free (open source) | **Best fit for aerial VLM tasks.** |
| SmolVLM2-500M | HuggingFace Transformers, ONNX | Smallest memory (1.8GB). Fastest inference (~1-2s estimated). | Weakest reasoning. May lack nuance for concealment analysis. | ONNX Runtime or TRT | Local only | Free | Fallback if memory is tight |
| Moondream 2B | moondream API, PyTorch | Built-in detect()/point() APIs. Strong grounded detection (refcoco 91.1). | Not aerial-specialized. Same size class as UAV-VL-R1 but less relevant. | PyTorch or ONNX | Local only | Free | Alternative if UAV-VL-R1 underperforms |
| No VLM | N/A | Simpler system. Less memory. No latency for Tier 3. | No zero-shot capability for novel patterns. No operator explanations. | None | N/A | Free | Viable for production if Tier 1+2 accuracy is sufficient |
**VLM prompting strategy for concealment analysis**:
```
Analyze this aerial UAV image crop. A footpath was detected leading to this area.
Is there a concealed military position visible? Look for:
- Dark entrances or openings
- Piles of cut tree branches used as camouflage
- Dugout structures
- Signs of recent human activity
Answer: YES or NO, then one sentence explaining why.
```
**Integration**: VLM runs as separate Python process. Communicates with main Cython pipeline via Unix domain socket or shared memory. Triggered only when Tier 2 CNN confidence is between 30-70% (ambiguous cases).
### Component 4: Camera Gimbal Control
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| **Custom ViewLink serial driver + PID controller (recommended)** | Python serial, ViewLink Protocol V3.3.3, PID library | Direct hardware control. Closed-loop tracking with detection feedback. Low latency (<10ms serial command). | Must implement ViewLink protocol from spec. PID tuning needed. | ViewPro A40 documentation, pyserial | Physical hardware access only | Free + hardware | **Best fit. Direct control, no middleware.** |
| ArduPilot integration via MAVLink | ArduPilot, MAVLink, pymavlink | Battle-tested gimbal driver. Well-documented. | Requires ArduPilot flight controller. Additional latency through FC. May conflict with mission planner. | Pixhawk or similar FC running ArduPilot 4.5+ | MAVLink protocol | Pixhawk hardware ($50-200) | Alternative if ArduPilot is already used for flight control |
**Scan controller state machine**:
```
┌─────────────────┐
│ LEVEL 1 │
│ Wide Sweep │◀──────────────────────────┐
│ Medium Zoom │ │
└────────┬────────┘ │
│ POI detected │ Analysis complete
▼ │ or timeout (5s)
┌─────────────────┐ │
│ ZOOM │ │
│ TRANSITION │ │
│ 1-2 seconds │ │
└────────┬────────┘ │
│ Zoom complete │
▼ │
┌─────────────────┐ ┌──────────────┐ │
│ LEVEL 2 │────▶│ PATH │───────┤
│ Detailed Scan │ │ FOLLOWING │ │
│ High Zoom │ │ Pan along │ │
└────────┬────────┘ │ detected │ │
│ │ path │ │
│ └──────────────┘ │
│ Endpoint found │
▼ │
┌─────────────────┐ │
│ ENDPOINT │ │
│ ANALYSIS │────────────────────────────┘
│ Hold + VLM │
└─────────────────┘
```
**Level 1 sweep pattern**: Sinusoidal yaw oscillation centered on flight heading. Amplitude: ±30° (configurable). Period: matched to ground speed so adjacent sweeps overlap by 20%. Pitch: slightly downward (configurable based on altitude and desired ground coverage).
**Path-following control loop**:
1. Tier 1 outputs footpath segmentation mask
2. Extract path centerline direction (from skeleton)
3. Compute error: path center vs frame center
4. PID controller adjusts gimbal yaw/pitch to minimize error
5. Update rate: tied to detection frame rate (10-30 FPS)
6. When path endpoint reached or path leaves frame: stop following, analyze endpoint
**POI queue management**: Priority queue sorted by: (1) detection confidence, (2) proximity to current camera position (minimize slew time), (3) recency. Max queue size: 20 POIs. Older/lower-confidence entries expire after 30 seconds.
### Component 5: Integration with Existing Detections Service
| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
|----------|-------|-----------|-------------|-------------|----------|------|-----|
| **Extend existing Cython codebase + separate VLM process (recommended)** | Cython, TensorRT, Unix socket IPC | Maintains existing architecture. YOLO26/YOLOE-26 fits naturally into TRT pipeline. VLM isolated in separate process. | VLM IPC adds small latency. Two processes to manage. | Cython extensions, process management | Process isolation | Free | **Best fit. Minimal disruption to existing system.** |
| Microservice architecture | FastAPI, Docker, gRPC | Clean separation. Independent scaling. | Overhead for single Jetson. Over-engineered for edge. | Docker, networking | Service mesh | Free | Over-engineered for single device |
**Integration architecture**:
```
┌─────────────────────────────────────────────────────────┐
│ Main Process (Cython + TRT) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Existing │ │ YOLOE-26 │ │ Scan │ │
│ │ YOLO Det │ │ Semantic │ │ Controller │ │
│ │ (TRT Engine) │ │ (TRT Engine) │ │ + Gimbal │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────┬────────┘ │ │
│ ▼ │ │
│ ┌──────────────┐ │ │
│ │ Spatial │◀──────────────────────┘ │
│ │ Reasoning │ │
│ │ + CNN Class. │ │
│ └──────┬───────┘ │
│ │ ambiguous cases │
│ ▼ │
│ ┌──────────────┐ │
│ │ IPC (Unix │ │
│ │ Socket) │ │
│ └──────┬───────┘ │
└────────────────┼─────────────────────────────────────────┘
┌────────────────────────────┐
│ VLM Process (Python) │
│ UAV-VL-R1 (vLLM/TRT-LLM) │
│ INT8 quantized │
└────────────────────────────┘
```
**Data flow**:
1. Camera frame → existing YOLO detection (scene context: vehicles, debris, structures)
2. Same frame → YOLOE-26 semantic detection (footpaths, branch piles, entrances)
3. YOLO context + YOLOE-26 detections → Spatial reasoning module
4. Spatial reasoning: path tracing, endpoint analysis, CNN classification
5. High-confidence detections → operator notification (bounding box + coordinates)
6. Ambiguous detections → VLM process via IPC → response → operator notification
7. All detections → scan controller → gimbal commands (Level 1/2 transitions, path following)
**GPU scheduling**: YOLO and YOLOE-26 can share a single TRT engine (YOLOE-26 re-parameterized to standard YOLO26 weights for fixed classes). VLM inference is sequential: pause YOLO frames, run VLM, resume YOLO. VLM analysis typically lasts 3-5s during which the camera holds position (endpoint analysis phase).
## Training & Data Strategy
### Phase 1: Zero-shot (Week 1-2)
- Deploy YOLOE-26 with text/visual prompts
- Use semantic01-04.png as visual prompt references
- Tune text prompt class names and confidence thresholds
- Collect false positive/negative data for annotation
### Phase 2: Annotation & Fine-tuning (Week 3-8)
- Annotate collected data using existing annotation tooling
- Priority order: footpaths (segmentation masks) → branch piles (bboxes) → entrances (bboxes) → roads (segmentation) → trees (bboxes)
- Use SAM (Segment Anything Model) for semi-automated segmentation mask generation
- Target: 500+ images per class by week 6, 1500+ by week 8
- Fine-tune YOLO26-Seg on custom dataset using linear probing first, then full fine-tuning
### Phase 3: Custom CNN classifier (Week 6-10)
- Train MobileNetV3-Small binary classifier on ROI crops from endpoint analysis
- Positive: annotated concealed positions. Negative: natural path termini, random terrain
- Target: 300+ positive, 1000+ negative samples
- Export to TensorRT FP16
### Phase 4: VLM integration (Week 8-12)
- Deploy UAV-VL-R1 INT8 as separate process
- Tune prompting strategy on collected ambiguous cases
- Optional: fine-tune UAV-VL-R1 on domain-specific data if base accuracy insufficient
### Phase 5: Seasonal expansion (Month 3+)
- Winter data → spring/summer annotation campaigns
- Re-train all models with multi-season data
- Expect accuracy degradation in summer (vegetation occlusion), mitigate with larger dataset
## Testing Strategy
### Integration / Functional Tests
- YOLOE-26 text prompt detection on reference images (semantic01-04.png) — verify footpath and hideout regions are flagged
- Path tracing on synthetic segmentation masks — verify skeleton, endpoint, junction detection
- CNN classifier on known positive/negative ROI crops — verify binary output correctness
- Gimbal control loop on simulated camera feed — verify PID convergence and path-following accuracy
- VLM IPC round-trip — verify request/response latency and correctness
- Full pipeline test: frame → YOLOE detection → path tracing → CNN → VLM → operator notification
- Scan controller state machine: verify Level 1 → Level 2 transitions, timeout, return to Level 1
### Non-Functional Tests
- Tier 1 latency: measure end-to-end YOLOE-26 inference on Jetson Orin Nano Super ≤100ms
- Tier 2 latency: path tracing + CNN classification ≤200ms
- Tier 3 latency: VLM IPC + inference ≤5 seconds
- Memory profiling: total RAM usage under YOLO + YOLOE + CNN + VLM concurrent loading
- Thermal stress test: continuous inference for 30+ minutes without thermal throttling
- False positive rate measurement on known clean terrain
- False negative rate measurement on annotated concealed positions
- Gimbal response test: measure physical camera movement latency vs command
## References
- YOLO26 docs: https://docs.ultralytics.com/models/yolo26/
- YOLOE docs: https://docs.ultralytics.com/models/yoloe/
- YOLOE-26 paper: https://arxiv.org/abs/2602.00168
- YOLO26 Jetson benchmarks: https://docs.ultralytics.com/guides/nvidia-jetson
- UAV-VL-R1: https://arxiv.org/pdf/2508.11196, https://github.com/Leke-G/UAV-VL-R1
- SmolVLM2: https://huggingface.co/blog/smolervlm
- Moondream: https://moondream.ai/blog/introducing-moondream-0-5b
- ViewPro Protocol: https://www.viewprotech.com/index.php?ac=article&at=read&did=510
- ArduPilot ViewPro: https://ardupilot.org/copter/docs/common-viewpro-gimbal.html
- FootpathSeg: https://github.com/WennyXY/FootpathSeg
- UAV-YOLO12: https://www.mdpi.com/2072-4292/17/9/1539
- Trail segmentation (UNet+MambaOut): https://arxiv.org/pdf/2504.12121
- servopilot PID: https://pypi.org/project/servopilot/
- Camouflage detection (OVCOS): https://arxiv.org/html/2506.19300v1
- Jetson AI Lab benchmarks: https://www.jetson-ai-lab.com/tutorials/genai-benchmarking/
- Cosmos-Reason2-2B benchmarks: Embedl blog (Feb 2026)
## Related Artifacts
- AC assessment: `_docs/00_research/00_ac_assessment.md`
- Tech stack evaluation: `_docs/01_solution/tech_stack.md` (if Phase 3 was executed)
- Security analysis: `_docs/01_solution/security_analysis.md` (if Phase 4 was executed)