Files
detections-semantic/_docs/02_plans/architecture.md
T
Oleksandr Bezdieniezhnykh 8e2ecf50fd Initial commit
Made-with: Cursor
2026-03-26 00:20:30 +02:00

14 KiB

Semantic Detection System — Architecture

1. System Context

Problem being solved: Reconnaissance UAVs with YOLO-based object detection cannot identify camouflaged/concealed military positions (FPV operator hideouts, hidden artillery, dugouts masked by branches). A semantic detection layer is needed that detects footpaths, traces them to endpoints, and identifies concealed structures — controlling the camera gimbal through a two-level scan strategy (wide sweep + detailed investigation).

System boundaries:

  • Inside: Semantic detection pipeline (Tier 1/2/3 inference), scan controller (L1/L2 Behavior Tree), gimbal driver (ViewLink serial), frame recorder, detection logger, system health monitor
  • Outside: Existing YOLO detection pipeline, GPS-denied navigation, mission planning, annotation tooling, training pipelines, operator display

External systems:

System Integration Type Direction Purpose
Existing YOLO Pipeline REST API (in-process or local HTTP) Inbound Provides scene-level detections (vehicles, roads, buildings) as context
ViewPro A40 Gimbal UART serial (ViewLink protocol) Outbound Camera pan/tilt/zoom commands
GPS-Denied System Shared memory / API Inbound Provides current GPS-denied coordinates for detection logging
Operator Display REST API / shared detection output Outbound Delivers detection results (bounding boxes + metadata)
NVMe Storage Filesystem Both Frame recording, detection logs, model files, config

2. Technology Stack

Layer Technology Version Rationale
Language (core) Cython / C Extends existing detection codebase; maximum performance
Language (VLM) Python 3.11 3.11 NanoLLM and VLM libraries are Python-native
Language (tools) Python 3.11 3.11 Configuration, logging, frame recording utilities
Inference (Tier 1) TensorRT FP16 JetPack 6.2 bundled Fastest inference on Jetson; FP16 is stable (INT8 deferred)
Inference (Tier 3) NanoLLM (MLC/TVM) 24.7+ Purpose-built for Jetson VLM inference; Docker-based
Detection model YOLOE (yoloe-11s-seg or yoloe-26s-seg) Ultralytics 8.4.x (pinned) Open-vocabulary segmentation; backbone selected empirically
VLM model VILA1.5-3B (4-bit MLC) Confirmed on Orin Nano; multimodal; stable via NanoLLM
Image processing OpenCV + scikit-image 4.x Skeletonization, morphology, frame quality assessment
Orchestration py_trees 2.4.0 Behavior tree for scan controller; extensible, preemptive
Serial comm pyserial + crcmod ViewLink gimbal protocol with CRC-16
IPC Unix domain socket Semantic process ↔ VLM process communication
Containerization Docker JetPack 6.2 container VLM runs in NanoLLM Docker; main service in existing Docker
Configuration YAML All thresholds, class names, scan parameters, degradation levels
Platform Jetson Orin Nano Super JetPack 6.2 67 TOPS, 8GB LPDDR5, NVMe SSD boot

Key constraints from restrictions.md:

  • 8GB shared RAM: YOLO ~2GB, semantic+VLM must fit in ~6GB. Sequential GPU scheduling (no concurrent YOLO+VLM).
  • Cython + TRT codebase: new modules must integrate with existing Cython build system
  • Air-gapped: no cloud connectivity, all inference local. Updates via USB drive.
  • ViewPro A40 zoom transition: 1-2 seconds physical constraint affects L1→L2 timing

3. Deployment Model

Environments: Development (workstation with GPU), Production (Jetson Orin Nano Super on UAV)

Infrastructure:

  • Production: Jetson Orin Nano Super with ruggedized carrier board (MILBOX-ORNX or similar), NVMe SSD, active cooling
  • Development: x86 workstation with NVIDIA GPU (for model training and testing)
  • No cloud, no staging environment — field-deployed edge device

Environment-specific configuration:

Config Development Production
Inference engine ONNX Runtime (CPU/GPU) or TRT on dev GPU TensorRT FP16 on Jetson
Gimbal Mock serial (TCP socket) Real UART to ViewPro A40
VLM NanoLLM Docker or direct Python NanoLLM Docker on Jetson
Storage Local filesystem NVMe SSD (industrial grade)
Logging Console + file JSON-lines to NVMe
Thermal monitor Disabled Active (tegrastats)
Power monitor Disabled Active (INA sensors)
Config file config.dev.yaml config.prod.yaml

4. Data Model Overview

Core entities:

See data_model.md for full details. Summary:

Runtime structs (in-memory only): FrameContext, YoloDetection (external input), POI, GimbalState Persistent (NVMe flat files): DetectionLogEntry (JSON-lines), HealthLogEntry (JSON-lines), RecordedFrames (JPEG), Config (YAML)

No database. Transient processing artifacts (segmentation masks, skeletons, endpoint crops) are created, consumed, and discarded within a single frame's processing cycle.

Data flow summary:

  • Camera → Frame → YOLO (external) → detections → SemanticPipeline → detection log + operator
  • SemanticPipeline → ScanController → GimbalDriver → ViewPro A40
  • Frame → Recorder → NVMe (JPEG) + Logger → NVMe (JSON-lines)

5. Integration Points

Internal Communication

From To Protocol Pattern Notes
ScanController Tier1Detector Direct function call (Cython) Sync pipeline Same process, frame buffer shared
Tier1Detector Tier2SpatialAnalyzer Direct function call Sync pipeline Segmentation mask or detection list passed in memory
ScanController VLMProcess Unix domain socket (JSON) Async request-response VLM in separate Docker container; 5s timeout
ScanController GimbalDriver Direct function call Command queue Scan controller pushes target angles
GimbalDriver ViewPro A40 UART serial (ViewLink protocol) Command-response 115200 baud; use native ViewLink checksum if available, add CRC-16 only if protocol lacks integrity checks
ScanController Logger/Recorder Direct function call Fire-and-forget (async write) Non-blocking NVMe write; detection log + frame recording
ScanController (inline) health_check() at top of main loop Capability flags Reads tegrastats, gimbal heartbeat, VLM status — no separate thread

External Integrations

External System Protocol Auth Rate Limits Failure Mode
Existing YOLO Pipeline In-process call or local HTTP (localhost) None (same device) Frame rate (10-30 FPS) semantic_available=false → YOLO-only mode
GPS-Denied System Shared memory or local API None (same device) Per-frame Coordinates logged as null if unavailable
Operator Display Detection output format (same as YOLO) None (same device) Per-detection Detections queued if display unavailable

6. Non-Functional Requirements

Requirement Target Measurement Priority
Tier 1 latency (p95) ≤100ms per frame TRT inference time on Jetson High
Tier 2 latency (p95) ≤200ms per ROI (V2 CNN) / ≤50ms (V1 heuristic) Processing time from mask to classification High
Tier 3 latency (p95) ≤5s per ROI VLM request-to-response via IPC Medium
Memory (semantic+VLM) ≤6GB peak tegrastats monitoring High
Thermal (sustained) T_junction < 75°C tegrastats, 60-min test High
Throughput ≥8 FPS sustained (Tier 1) Frames processed per second High
Availability Capability-flag degradation (vlm, gimbal, semantic) Continuous operation despite component failures High
Cold start ≤60s to first detection Power-on to first result Medium
Recording endurance ≥2 hours at Level 2 rate NVMe write, 256GB SSD Medium
Data retention Until NVMe full (circular buffer) Oldest L1 frames overwritten first Low

7. Security Architecture

Authentication: None required — all components are local on the same Jetson device, air-gapped network.

Authorization: N/A — single-user system, operator interacts via separate display system.

Data protection:

  • At rest: No encryption (performance priority on edge device; physical security assumed via UAV possession)
  • In transit: N/A (all communication is local — UART, Unix socket, localhost)
  • Secrets management: No secrets — no API keys, no cloud credentials. Model files are not sensitive (publicly available architectures).

Audit logging: Detection log (JSON-lines) records every detection with timestamp, coordinates, confidence, tier. Gimbal command log records every command sent. Both stored on NVMe. Retained until overwritten by circular buffer or manually extracted via USB.

8. Key Architectural Decisions

ADR-001: Three-tier inference architecture

Context: Need both fast initial detection (≤100ms) and deep semantic analysis (≤5s). Single model cannot achieve both.

Decision: Three tiers — Tier 1 (YOLOE TRT, ≤100ms), Tier 2 (path tracing + heuristic/CNN, ≤200ms), Tier 3 (VLM, ≤5s, optional). Each tier runs only when the previous tier triggers it.

Alternatives considered:

  1. Single VLM for all analysis — rejected: too slow for real-time scanning (>2s per frame)
  2. YOLO + VLM only (no Tier 2) — rejected: VLM would be invoked too frequently, saturating GPU

Consequences: More complex pipeline; three models to manage; but enables real-time scanning with deep analysis only when needed.

ADR-002: NanoLLM instead of vLLM for VLM runtime

Context: VLM process needs stable inference on Jetson Orin Nano 8GB. vLLM has documented system freezes and crashes on this hardware.

Decision: Use NanoLLM (NVIDIA's Jetson-optimized library) with Docker containers and MLC/TVM quantization.

Alternatives considered:

  1. vLLM — rejected: system freezes, reboots, installation crashes (multiple open GitHub issues)
  2. llama.cpp — kept as fallback for GGUF models not supported by NanoLLM

Consequences: Limited model selection (VILA, LLaVA, Obsidian); UAV-VL-R1 only available via llama.cpp fallback.

ADR-003: YOLOE backbone selection deferred to empirical benchmark

Context: YOLO26 has reported accuracy regression on custom datasets vs YOLO11. Both are supported by YOLOE.

Decision: Support both yoloe-11s-seg and yoloe-26s-seg as configurable backends. Sprint 1 benchmarks on real annotated data determine the winner.

Alternatives considered:

  1. Commit to YOLO26 — rejected: reported regression risk
  2. Commit to YOLO11 — rejected: YOLO26 has better NMS-free deployment and small-object features

Consequences: Must maintain two TRT engine files; config switch; slightly more build complexity.

ADR-004: FP16 only, INT8 deferred

Context: TensorRT INT8 export crashes on Jetson Orin (JetPack 6, TRT 10.3.0) during calibration.

Decision: Use FP16 for all TRT engines in initial deployment. INT8 optimization deferred to Phase 3+.

Alternatives considered:

  1. INT8 from day one — rejected: documented crashes, unstable tooling
  2. Mixed precision (FP16 backbone, INT8 head) — rejected: adds complexity without proven stability

Consequences: ~2x slower than INT8 theoretical maximum; acceptable given FP16 already meets latency targets.

ADR-005: VLM as separate Docker process with IPC

Context: VLM (NanoLLM) runs in a Docker container with specific CUDA/MLC dependencies. Cannot be compiled into Cython codebase.

Decision: VLM runs as a separate Docker container. Communication via Unix domain socket (JSON messages). Loaded dynamically during Level 2 only; unloaded to free GPU memory during Level 1.

Alternatives considered:

  1. VLM compiled into main process — rejected: dependency incompatibility with Cython + TRT pipeline
  2. VLM always loaded — rejected: consumes ~3GB GPU memory that's needed for YOLO during Level 1

Consequences: IPC latency overhead (~10ms); container management complexity; but clean separation and memory efficiency.

ADR-006: NVMe SSD mandatory, no SD card

Context: Recurring SD card corruption documented on Jetson Orin Nano. Production module has no eMMC.

Decision: NVMe SSD for OS, models, recording, logging. Industrial-grade SSD with vibration-resistant mount.

Alternatives considered:

  1. SD card — rejected: documented corruption issues across multiple brands
  2. USB drive — rejected: slower, less reliable under vibration

Consequences: Additional hardware cost (~$40-80); requires NVMe-compatible carrier board.

ADR-007: UART integrity for gimbal communication

Context: ViewPro documents EMI-induced random gimbal panning from antenna interference. UART communication needs error detection.

Decision: First, check if ViewLink Serial Protocol V3.3.3 includes native checksums (read full spec during implementation). If yes, use the native checksum and add retry logic on checksum failure. If no native checksum exists, add CRC-16 (CRC-CCITT) wrapper. Either way: retry up to 3 times on integrity failure, log errors.

Alternatives considered:

  1. No error detection — rejected: EMI is a documented real-world issue
  2. Always add custom CRC regardless — rejected: may conflict with native protocol

Consequences: Depends on spec reading; physical EMI mitigation (shielded cable, 35cm antenna separation) still needed regardless.

ADR-008: Behavior Tree for ScanController orchestration

Context: ScanController manages two scan levels (L1 sweep, L2 investigation), health preemption, POI queueing, and future extensions (spiral search, thermal scan). Need a pattern that handles preemption cleanly and is extensible.

Decision: Use py_trees (2.4.0) Behavior Tree. Root Selector tries HealthGuard → L2Investigation → L1Sweep → Idle. Leaf nodes are simple procedural calls into existing components. Shared state via py_trees Blackboard.

Alternatives considered:

  1. Flat state machine — rejected: adding new scan modes requires rewiring transitions; preemption logic becomes tangled
  2. Hierarchical state machine — viable but less standard for autonomous vehicles; less tooling support
  3. Hybrid (BT + procedural leaves) — this is essentially what we chose; BT structure with procedural leaf logic

Consequences: Adds py_trees dependency (~150KB); tree tick overhead negligible (<1ms); ASCII tree rendering aids debugging; new scan behaviors added as subtrees without modifying existing ones.