detections-semantic/_docs/00_problem/problem.md

The existing reconnaissance winged UAV system uses YOLO-based object detection to identify vehicles and military equipment from aerial imagery. This approach is now ineffective — current high-value targets are camouflaged positions: FPV operator hideouts, hidden artillery emplacements, and dugouts masked by tree branches. These targets cannot be detected by visual similarity to known object classes.

The detection approach has two layers. First, the existing YOLO detection model must be extended with new object classes that serve as building blocks for semantic reasoning: black entrances to hideouts (various sizes), piles of tree branches, footpaths, roads, individual trees and tree blocks. These are the raw detections — the primitives. Second, the semantic detection module takes these primitives plus scene context and applies contextual reasoning: tracing footpaths, assessing freshness, following paths to endpoints, and identifying concealed structures at those endpoints.

The system operates a two-level scan controlled by the semantic module:

Level 1 — Wide-area sweep. The camera points along the UAV route at medium zoom, swinging left-right perpendicular to the flight path. YOLO continuously detects primitives — footpaths, roads, tree rows, branch piles, black entrances, buildings, vehicles. When the system detects a point of interest (a footpath starting point, a suspicious branch pile, a tree row that could conceal a position), it marks the GPS-denied coordinates and queues it for detailed scan.

Level 2 — Detailed scan. The camera zooms into the queued point of interest. The semantic module takes control of the gimbal and executes a targeted investigation:
- Path following: when a footpath is detected, the camera pans along the path direction, tracing it from origin to endpoint. The gimbal adjusts pan and tilt to keep the path centered as the UAV moves. At intersections, the system follows the freshest or most promising branch.
- Endpoint analysis: when the path terminates at a structure (branch pile, dark entrance, dugout), the camera holds position and the VLM performs detailed semantic analysis — assessing whether the endpoint is a concealed position.
- Area sweep: for broader POIs (tree rows, clearings), the camera performs a slow pan across the area at high zoom, letting both YOLO and semantic detection scan for hidden details.
- Return: after analysis completes or a timeout is reached, the camera returns to Level 1 wide-area sweep and moves to the next queued POI or continues the route.

The project splits into three submodules: semantic detection AI (models + inference), camera gimbal control (scan patterns + gimbal commands), and integration with the existing detections service. Annotation tooling and training pipelines exist in separate repositories.