detections-semantic/_docs/02_plans/epics.md

# Jira Epics — Semantic Detection System

> Epics created in Jira project AZ (AZAION) on 2026-03-20.

## Epic → Jira ID Mapping

| # | Epic | Jira ID |
|---|------|---------|
| 1 | Bootstrap & Initial Structure | AZ-130 |
| 2 | Tier1Detector — YOLOE TensorRT Inference | AZ-131 |
| 3 | Tier2SpatialAnalyzer — Spatial Pattern Analysis | AZ-132 |
| 4 | VLMClient — NanoLLM IPC Client | AZ-133 |
| 5 | GimbalDriver — ViewLink Serial Control | AZ-134 |
| 6 | OutputManager — Recording & Logging | AZ-135 |
| 7 | ScanController — Behavior Tree Orchestrator | AZ-136 |
| 8 | Integration Tests — End-to-End System Testing | AZ-137 |

## Dependency Order

```
1. AZ-130 Bootstrap & Initial Structure  (no dependencies)
2. AZ-131 Tier1Detector                  (depends on AZ-130)
3. AZ-132 Tier2SpatialAnalyzer           (depends on AZ-130)  ← parallel with 2,4,5,6
4. AZ-133 VLMClient                      (depends on AZ-130)  ← parallel with 2,3,5,6
5. AZ-134 GimbalDriver                   (depends on AZ-130)  ← parallel with 2,3,4,6
6. AZ-135 OutputManager                  (depends on AZ-130)  ← parallel with 2,3,4,5
7. AZ-136 ScanController                 (depends on AZ-130–AZ-135)
8. AZ-137 Integration Tests              (depends on AZ-136)
```

---

## Epic 1: Bootstrap & Initial Structure

**Summary**: Scaffold the project: folder structure, shared models, interfaces, stubs, CI/CD config, Docker setup, test infrastructure.

**Problem / Context**: The semantic detection module needs a clean project scaffold that integrates with the existing Cython + TensorRT codebase. All components share Config and Types helpers that must exist before any implementation.

**Scope**:

In Scope:
- Project folder structure matching architecture
- Config helper: YAML loading, validation, typed access, dev/prod configs
- Types helper: all shared dataclasses (FrameContext, Detection, POI, GimbalState, CapabilityFlags, SpatialAnalysisResult, Waypoint, VLMResponse, SearchScenario)
- Interface stubs for all 6 components
- Docker setup: dev Dockerfile, docker-compose with mock services
- CI pipeline config: lint, test, build stages
- Test infrastructure: pytest setup, fixture directory, mock factories

Out of Scope:
- Component implementation (handled in component epics)
- Real hardware integration
- Model training / export

**Dependencies**:
- Epic dependencies: None (first epic)
- External: Existing detections repository access

**Effort Estimation**: M / 5-8 points

**Acceptance Criteria**:

| # | Criterion | Measurable Condition |
|---|-----------|---------------------|
| 1 | Config helper loads and validates YAML | Unit tests pass for valid/invalid configs |
| 2 | Types helper defines all shared structs | All dataclasses importable, fields match spec |
| 3 | Docker dev environment boots | `docker compose up` succeeds, health endpoint returns |
| 4 | CI pipeline runs lint + test | Pipeline passes on scaffold code |

**Risks**:

| # | Risk | Mitigation |
|---|------|------------|
| 1 | Cython build system integration | Start with pure Python, Cython-ize later |
| 2 | Config schema changes during development | Version field in config, validation tolerant of additions |

**Labels**: `component:bootstrap`, `type:platform`

**Child Issues**:

| Type | Title | Points |
|------|-------|--------|
| Task | Create project folder structure and pyproject.toml | 1 |
| Task | Implement Config helper with YAML validation | 3 |
| Task | Implement Types helper with all shared dataclasses | 2 |
| Task | Create interface stubs for all 6 components | 2 |
| Task | Docker dev setup + docker-compose | 3 |
| Task | CI pipeline config (lint, test, build) | 2 |
| Task | Test infrastructure (pytest, fixtures, mock factories) | 2 |

---

## Epic 2: Tier1Detector — YOLOE TensorRT Inference

**Summary**: Wrap YOLOE TensorRT FP16 inference for detection + segmentation on aerial frames.

**Problem / Context**: The system needs fast (<100ms) object detection including segmentation masks for footpaths and concealment indicators. Must support both YOLOE-11 and YOLOE-26 backbones.

**Scope**:

In Scope:
- TRT engine loading with class name configuration
- Frame preprocessing (resize, normalize)
- Detection + segmentation inference
- NMS handling for YOLOE-11 (NMS-free for YOLOE-26)
- ONNX Runtime fallback for dev environment

Out of Scope:
- Model training and export (separate repo)
- Custom class fine-tuning

**Dependencies**:
- Epic dependencies: Bootstrap
- External: Pre-exported TRT engine files, Ultralytics 8.4.x

**Effort Estimation**: M / 5-8 points

**Acceptance Criteria**:

| # | Criterion | Measurable Condition |
|---|-----------|---------------------|
| 1 | Inference ≤100ms on Jetson Orin Nano Super | PT-01 passes |
| 2 | New classes P≥80%, R≥80% on validation set | AT-01 passes |
| 3 | Existing classes not degraded | AT-02 passes (mAP50 within 2% of baseline) |
| 4 | GPU memory ≤2.5GB | PT-03 passes |

**Risks**:

| # | Risk | Mitigation |
|---|------|------------|
| 1 | R01: Backbone accuracy on concealment data | Benchmark sprint; dual backbone strategy |
| 2 | YOLOE-26 NMS-free model packaging | Validate engine metadata detection at load time |

**Labels**: `component:tier1-detector`, `type:inference`

**Child Issues**:

| Type | Title | Points |
|------|-------|--------|
| Spike | Benchmark YOLOE-11 vs YOLOE-26 on 200 annotated frames | 3 |
| Task | Implement TRT engine loader with class name config | 3 |
| Task | Implement detect() with preprocessing + postprocessing | 3 |
| Task | Add ONNX Runtime fallback for dev environment | 2 |
| Task | Write unit + integration tests for Tier1Detector | 2 |

---

## Epic 3: Tier2SpatialAnalyzer — Spatial Pattern Analysis

**Summary**: Analyze spatial patterns from Tier 1 detections — trace footpath masks and cluster discrete objects — producing waypoints for gimbal navigation.

**Problem / Context**: After Tier 1 detects objects, spatial reasoning is needed to trace footpaths to endpoints (concealed positions) and group clustered objects (defense networks). This is the core semantic reasoning layer.

**Scope**:

In Scope:
- Mask tracing: skeletonize → prune → endpoints → classify
- Cluster tracing: spatial clustering → visit order → per-point classify
- ROI classification heuristic (darkness + contrast)
- Freshness tagging for mask traces

Out of Scope:
- CNN classifier (removed from V1)
- Machine learning-based classification (V2+)

**Dependencies**:
- Epic dependencies: Bootstrap
- External: scikit-image, scipy

**Effort Estimation**: M / 5-8 points

**Acceptance Criteria**:

| # | Criterion | Measurable Condition |
|---|-----------|---------------------|
| 1 | trace_mask ≤200ms on 1080p mask | PT-01 passes |
| 2 | Concealed position recall ≥60% | AT-01 passes |
| 3 | Footpath endpoint detection ≥70% | AT-02 passes |
| 4 | Freshness tags correctly assigned | AT-03 passes (≥80% high-contrast correct) |

**Risks**:

| # | Risk | Mitigation |
|---|------|------------|
| 1 | R03: High false positive rate from heuristic | Conservative thresholds; per-season config |
| 2 | R07: Fragmented masks | Morphological closing; min-branch pruning |

**Labels**: `component:tier2-spatial-analyzer`, `type:inference`

**Child Issues**:

| Type | Title | Points |
|------|-------|--------|
| Task | Implement mask tracing pipeline (skeletonize, prune, endpoints) | 5 |
| Task | Implement cluster tracing (spatial clustering, visit order) | 3 |
| Task | Implement analyze_roi heuristic (darkness + contrast + freshness) | 3 |
| Task | Write unit + integration tests for Tier2SpatialAnalyzer | 2 |

---

## Epic 4: VLMClient — NanoLLM IPC Client

**Summary**: IPC client for communicating with the NanoLLM Docker container via Unix domain socket for Tier 3 visual language model analysis.

**Problem / Context**: Ambiguous Tier 2 results need deep visual analysis via VILA1.5-3B VLM. The VLM runs in a separate Docker container; this client manages the IPC protocol and model lifecycle (load/unload for GPU memory management).

**Scope**:

In Scope:
- Unix domain socket client (connect, disconnect)
- JSON IPC protocol (analyze, load_model, unload_model, status)
- Model lifecycle management (load on demand, unload to free GPU)
- Timeout handling, retry logic, availability tracking

Out of Scope:
- VLM model selection/training
- NanoLLM Docker container itself (pre-built)

**Dependencies**:
- Epic dependencies: Bootstrap
- External: NanoLLM Docker container with VILA1.5-3B

**Effort Estimation**: S / 3-5 points

**Acceptance Criteria**:

| # | Criterion | Measurable Condition |
|---|-----------|---------------------|
| 1 | Round-trip analyze ≤5s | PT-01 passes |
| 2 | GPU memory released on unload | PT-02 passes (≤baseline+50MB) |
| 3 | 3 consecutive failures → unavailable | IT-06 passes |
| 4 | Temp files cleaned after analyze | ST-02 passes |

**Risks**:

| # | Risk | Mitigation |
|---|------|------------|
| 1 | R02: VLM load latency (5-10s) | Predictive loading when first POI queued |
| 2 | R04: GPU memory pressure | Sequential scheduling; explicit unload |

**Labels**: `component:vlm-client`, `type:integration`

**Child Issues**:

| Type | Title | Points |
|------|-------|--------|
| Task | Implement Unix socket client (connect, disconnect, protocol) | 3 |
| Task | Implement model lifecycle management (load, unload, status) | 2 |
| Task | Implement analyze() with timeout, retry, availability tracking | 3 |
| Task | Write unit + integration tests for VLMClient | 2 |

---

## Epic 5: GimbalDriver — ViewLink Serial Control

**Summary**: Hardware adapter for the ViewPro A40 gimbal, implementing the ViewLink serial protocol for pan/tilt/zoom control and PID-based path following.

**Problem / Context**: The scan controller needs to point the camera at specific angles and follow paths. This requires implementing the ViewLink Serial Protocol V3.3.3 over UART, plus a PID controller for smooth path tracking. A mock TCP mode enables development without hardware.

**Scope**:

In Scope:
- ViewLink protocol: send commands, receive state, checksum validation
- Pan/tilt/zoom absolute control
- PID dual-axis path following
- Mock TCP mode for development
- Retry logic with CRC validation

Out of Scope:
- Advanced gimbal features (tracking modes, stabilization tuning)
- Hardware EMI mitigation (physical, not software)

**Dependencies**:
- Epic dependencies: Bootstrap
- External: ViewLink Protocol V3.3.3 specification, pyserial

**Effort Estimation**: L / 8-13 points

**Acceptance Criteria**:

| # | Criterion | Measurable Condition |
|---|-----------|---------------------|
| 1 | Command latency ≤500ms | PT-01 passes |
| 2 | Zoom transition ≤2s | PT-02, AT-02 pass |
| 3 | PID keeps path in center 50% | AT-03 passes (≥90% of cycles) |
| 4 | Smooth transitions (jerk ≤50 deg/s³) | PT-03 passes |

**Risks**:

| # | Risk | Mitigation |
|---|------|------------|
| 1 | R08: ViewLink protocol implementation effort | ArduPilot C++ reference; mock mode for parallel dev |
| 2 | PID tuning on real hardware | Configurable gains; bench test phase |

**Labels**: `component:gimbal-driver`, `type:hardware`

**Child Issues**:

| Type | Title | Points |
|------|-------|--------|
| Spike | Parse ViewLink V3.3.3 protocol spec, document packet format | 3 |
| Task | Implement UART/TCP connection layer with mock mode | 3 |
| Task | Implement command send/receive with checksum and retry | 5 |
| Task | Implement PID dual-axis controller with anti-windup | 3 |
| Task | Implement zoom_to_poi, return_to_sweep, follow_path | 3 |
| Task | Write unit + integration tests for GimbalDriver | 2 |
| Task | Mock gimbal TCP server for dev/test | 2 |

---

## Epic 6: OutputManager — Recording & Logging

**Summary**: Facade over all persistent output: detection logging, frame recording, health logging, gimbal logging, and operator detection delivery.

**Problem / Context**: Every flight produces data needed for operator situational awareness, post-flight review, and training data collection. Recording must never block inference. NVMe storage requires circular buffer management.

**Scope**:

In Scope:
- Detection JSON-lines logger (append, flush)
- JPEG frame recorder (L1 at 2 FPS, L2 at 30 FPS)
- Health and gimbal command logging
- Operator delivery in YOLO-compatible format
- NVMe storage monitoring and circular buffer

Out of Scope:
- Data export/upload tools
- Long-term storage management

**Dependencies**:
- Epic dependencies: Bootstrap
- External: NVMe SSD, OpenCV for JPEG encoding

**Effort Estimation**: S / 3-5 points

**Acceptance Criteria**:

| # | Criterion | Measurable Condition |
|---|-----------|---------------------|
| 1 | 30 FPS frame recording without dropped frames | PT-01 passes |
| 2 | No memory leak under sustained load | PT-02 passes (≤10MB growth) |
| 3 | Storage warning triggers at <20% free | IT-07 passes |
| 4 | Write failures don't block caller | IT-09 passes |

**Risks**:

| # | Risk | Mitigation |
|---|------|------------|
| 1 | R11: NVMe write latency at 30 FPS | Async writes; drop frames if queue backs up |
| 2 | R09: Operator overload | Confidence thresholds; detection throttle |

**Labels**: `component:output-manager`, `type:data`

**Child Issues**:

| Type | Title | Points |
|------|-------|--------|
| Task | Implement detection JSON-lines logger | 2 |
| Task | Implement JPEG frame recorder with rate control | 3 |
| Task | Implement health + gimbal command logging | 1 |
| Task | Implement operator delivery in YOLO format | 2 |
| Task | Implement NVMe storage monitor + circular buffer | 3 |
| Task | Write unit + integration tests for OutputManager | 2 |

---

## Epic 7: ScanController — Behavior Tree Orchestrator

**Summary**: Central orchestrator implementing the two-level scan strategy via a py_trees behavior tree with data-driven search scenarios.

**Problem / Context**: All components need coordination: L1 sweep finds POIs, L2 investigation analyzes them via the appropriate subtree (path_follow, cluster_follow, area_sweep, zoom_classify), and results flow to the operator. Health monitoring provides graceful degradation.

**Scope**:

In Scope:
- Behavior tree structure (Root, HealthGuard, L2Investigation, L1Sweep, Idle)
- All 4 investigation subtrees (PathFollow, ClusterFollow, AreaSweep, ZoomClassify)
- POI queue with priority management and deduplication
- EvaluatePOI with scenario-aware trigger matching
- Search scenario YAML loading and dispatching
- Health API endpoint (/api/v1/health)
- Capability flags and graceful degradation

Out of Scope:
- Component internals (delegated to respective components)
- New investigation types (extensible via BT subtrees)

**Dependencies**:
- Epic dependencies: Bootstrap, Tier1Detector, Tier2SpatialAnalyzer, VLMClient, GimbalDriver, OutputManager
- External: py_trees 2.4.0, FastAPI

**Effort Estimation**: L / 8-13 points

**Acceptance Criteria**:

| # | Criterion | Measurable Condition |
|---|-----------|---------------------|
| 1 | L1→L2 transition ≤2s | PT-01 passes |
| 2 | Full L1→L2→L1 cycle works | AT-03 passes |
| 3 | POI queue orders by priority | IT-09 passes |
| 4 | HealthGuard degrades gracefully | IT-11 passes |
| 5 | Coexists with YOLO (≤5% FPS reduction) | PT-02 passes |

**Risks**:

| # | Risk | Mitigation |
|---|------|------------|
| 1 | R06: Config complexity → runtime errors | Validation at startup; skip invalid scenarios |
| 2 | Single-threaded BT bottleneck | Leaf nodes delegate to optimized C/TRT backends |

**Labels**: `component:scan-controller`, `type:orchestration`

**Child Issues**:

| Type | Title | Points |
|------|-------|--------|
| Task | Implement BT skeleton: Root, HealthGuard, L1Sweep, L2Investigation, Idle | 5 |
| Task | Implement EvaluatePOI with scenario-aware matching + cluster aggregation | 3 |
| Task | Implement PathFollowSubtree (TraceMask, PIDFollow, WaypointAnalysis) | 5 |
| Task | Implement ClusterFollowSubtree (TraceCluster, VisitLoop, ClassifyWaypoint) | 3 |
| Task | Implement AreaSweepSubtree and ZoomClassifySubtree | 3 |
| Task | Implement POI queue (priority, deduplication, max size) | 2 |
| Task | Implement health API endpoint + capability flags | 2 |
| Task | Write unit + integration tests for ScanController | 3 |

---

## Epic 8: Integration Tests — End-to-End System Testing

**Summary**: Implement the black-box integration test suite defined in `_docs/02_plans/integration_tests/`.

**Problem / Context**: The system must be validated end-to-end using Docker-based tests that treat the semantic detection module as a black box, verifying all acceptance criteria and cross-component interactions.

**Scope**:

In Scope:
- Functional integration tests (all scenarios from integration_tests/functional_tests.md)
- Non-functional tests (performance, resilience)
- Docker test environment setup
- Test data management
- CI integration

Out of Scope:
- Unit tests (covered in component epics)
- Field testing on real hardware

**Dependencies**:
- Epic dependencies: ScanController (all components integrated)
- External: Docker, test data set

**Effort Estimation**: M / 5-8 points

**Acceptance Criteria**:

| # | Criterion | Measurable Condition |
|---|-----------|---------------------|
| 1 | All functional test scenarios pass | Green CI for functional suite |
| 2 | ≥76% AC coverage in traceability matrix | Coverage report |
| 3 | Docker test env boots with `docker compose up` | Setup documented and reproducible |

**Risks**:

| # | Risk | Mitigation |
|---|------|------------|
| 1 | Test data availability (annotated imagery) | Use synthetic data for CI; real data for acceptance |
| 2 | Mock services diverge from real behavior | Keep mocks minimal; integration tests catch drift |

**Labels**: `component:integration-tests`, `type:testing`

**Child Issues**:

| Type | Title | Points |
|------|-------|--------|
| Task | Docker test environment + compose setup | 3 |
| Task | Implement functional integration tests (positive scenarios) | 5 |
| Task | Implement functional integration tests (negative/edge scenarios) | 3 |
| Task | Implement non-functional tests (performance, resilience) | 3 |
| Task | CI integration and test reporting | 2 |

---

## Summary

| # | Jira ID | Epic | T-shirt | Points Range | Dependencies |
|---|---------|------|---------|-------------|-------------|
| 1 | AZ-130 | Bootstrap & Initial Structure | M | 5-8 | None |
| 2 | AZ-131 | Tier1Detector | M | 5-8 | AZ-130 |
| 3 | AZ-132 | Tier2SpatialAnalyzer | M | 5-8 | AZ-130 |
| 4 | AZ-133 | VLMClient | S | 3-5 | AZ-130 |
| 5 | AZ-134 | GimbalDriver | L | 8-13 | AZ-130 |
| 6 | AZ-135 | OutputManager | S | 3-5 | AZ-130 |
| 7 | AZ-136 | ScanController | L | 8-13 | AZ-130–AZ-135 |
| 8 | AZ-137 | Integration Tests | M | 5-8 | AZ-136 |
| **Total** | | | | **42-68** | |