mirror of
https://github.com/azaion/detections-semantic.git
synced 2026-04-23 01:46:38 +00:00
Initial commit
Made-with: Cursor
This commit is contained in:
@@ -0,0 +1,68 @@
|
||||
# CI/CD Pipeline
|
||||
|
||||
## Pipeline Overview
|
||||
|
||||
| Stage | Trigger | Runner | Duration | Gate |
|
||||
|-------|---------|--------|----------|------|
|
||||
| Lint + Unit Tests | PR to dev | x86 cloud | ~4 min | Block merge |
|
||||
| Build + E2E Tests | PR to dev, nightly | x86 cloud | ~15 min | Block merge |
|
||||
| Build (Jetson) | Merge to dev | Jetson self-hosted OR cross-compile | ~15 min | Block deploy |
|
||||
| Package | Manual trigger | x86 cloud | ~5 min | Block deploy |
|
||||
|
||||
## Stage Details
|
||||
|
||||
### 1. Lint + Unit Tests
|
||||
|
||||
- Python: `ruff check` + `ruff format --check`
|
||||
- Cython: `cython-lint` on .pyx files
|
||||
- pytest on Python modules (path tracing, freshness heuristic, config parsing, POI queue, detection logger)
|
||||
- No GPU required (mocked inference)
|
||||
- Coverage threshold: 70%
|
||||
|
||||
### 2. Build + E2E Tests
|
||||
|
||||
- `docker build` for semantic-detection (x86 target)
|
||||
- `docker compose -f docker-compose.test.yaml up --abort-on-container-exit`
|
||||
- Runs all FT-P-*, FT-N-*, non-HIL NFT tests
|
||||
- JUnit XML report artifact
|
||||
- Timeout: 10 minutes
|
||||
|
||||
### 3. Build (Jetson)
|
||||
|
||||
- Cross-compile for aarch64 OR build on self-hosted Jetson runner
|
||||
- TRT engine export not part of CI (engines pre-built, stored as artifacts)
|
||||
- Docker image tagged with git SHA
|
||||
|
||||
### 4. Package
|
||||
|
||||
- Build final Docker images for Jetson (aarch64)
|
||||
- Export as tar archive for USB-based field deployment
|
||||
- Include: Docker images, TRT engines, config files, update script
|
||||
- Output: `semantic-detection-{version}-jetson.tar.gz`
|
||||
|
||||
## HIL Testing (not a CI stage)
|
||||
|
||||
Hardware-in-the-loop tests run manually on physical Jetson Orin Nano Super:
|
||||
- Latency benchmarks (NFT-PERF-01)
|
||||
- Memory/thermal endurance (NFT-RES-LIM-01, NFT-RES-LIM-02)
|
||||
- Cold start (NFT-RES-LIM-04)
|
||||
- Results documented but do not gate deployment
|
||||
|
||||
## Caching
|
||||
|
||||
| Cache | Key | Contents |
|
||||
|-------|-----|----------|
|
||||
| pip | requirements.txt hash | Python dependencies |
|
||||
| Docker layers | Dockerfile hash | Base image + system deps |
|
||||
|
||||
## Artifacts
|
||||
|
||||
| Artifact | Stage | Retention |
|
||||
|----------|-------|-----------|
|
||||
| JUnit XML test report | Build + E2E | 30 days |
|
||||
| Docker images (Jetson) | Build (Jetson) | 90 days |
|
||||
| Deployment package (.tar.gz) | Package | Permanent |
|
||||
|
||||
## Secrets
|
||||
|
||||
None needed — air-gapped system. Docker registry is internal (Azure DevOps Artifacts or local).
|
||||
@@ -0,0 +1,104 @@
|
||||
# Containerization Plan
|
||||
|
||||
## Container Architecture
|
||||
|
||||
| Container | Base Image | Purpose | GPU Access |
|
||||
|-----------|-----------|---------|------------|
|
||||
| semantic-detection | nvcr.io/nvidia/l4t-tensorrt:r36.x (JetPack 6.2) | Main detection service (Cython + TRT + scan controller + gimbal + recorder) | Yes (TRT inference) |
|
||||
| vlm-service | dustynv/nanollm:r36 (NanoLLM for JetPack 6) | VLM inference (VILA1.5-3B, 4-bit MLC) | Yes (GPU inference) |
|
||||
|
||||
## Dockerfile: semantic-detection
|
||||
|
||||
```dockerfile
|
||||
# Outline — not runnable, for planning purposes
|
||||
FROM nvcr.io/nvidia/l4t-tensorrt:r36.x
|
||||
|
||||
# System dependencies
|
||||
RUN apt-get update && apt-get install -y python3.11 python3-pip libopencv-dev
|
||||
|
||||
# Python dependencies
|
||||
COPY requirements.txt .
|
||||
RUN pip3 install -r requirements.txt # pyserial, crcmod, scikit-image, pyyaml
|
||||
|
||||
# Cython build
|
||||
COPY src/ /app/src/
|
||||
RUN cd /app/src && python3 setup.py build_ext --inplace
|
||||
|
||||
# Config and models mounted as volumes
|
||||
VOLUME ["/models", "/etc/semantic-detection", "/data/output"]
|
||||
|
||||
ENTRYPOINT ["python3", "/app/src/main.py"]
|
||||
```
|
||||
|
||||
## Dockerfile: vlm-service
|
||||
|
||||
Uses NanoLLM pre-built Docker image. No custom Dockerfile needed — configuration via environment variables and volume mounts.
|
||||
|
||||
```yaml
|
||||
# docker-compose snippet
|
||||
vlm-service:
|
||||
image: dustynv/nanollm:r36
|
||||
runtime: nvidia
|
||||
environment:
|
||||
- MODEL=VILA1.5-3B
|
||||
- QUANTIZATION=w4a16
|
||||
volumes:
|
||||
- vlm-models:/models
|
||||
- vlm-socket:/tmp
|
||||
ipc: host
|
||||
shm_size: 8g
|
||||
```
|
||||
|
||||
## Volume Strategy
|
||||
|
||||
| Volume | Mount Point | Contents | Persistence |
|
||||
|--------|-----------|----------|-------------|
|
||||
| models | /models | TRT FP16 engines (yoloe-11s-seg.engine, yoloe-26s-seg.engine, mobilenetv3.engine) | Persistent on NVMe |
|
||||
| config | /etc/semantic-detection | config.yaml, class definitions | Persistent on NVMe |
|
||||
| output | /data/output | Detection logs, recorded frames, gimbal logs | Persistent on NVMe (circular buffer) |
|
||||
| vlm-models | /models (vlm-service) | VILA1.5-3B MLC weights | Persistent on NVMe |
|
||||
| vlm-socket | /tmp (both containers) | Unix domain socket for IPC | Ephemeral |
|
||||
|
||||
## GPU Sharing
|
||||
|
||||
Both containers share the same GPU. Sequential scheduling enforced at application level:
|
||||
- During Level 1: only semantic-detection uses GPU (YOLOE inference)
|
||||
- During Level 2 Tier 3: semantic-detection pauses YOLOE, vlm-service runs VLM inference
|
||||
- `--runtime=nvidia` on both containers, but application logic prevents concurrent GPU access
|
||||
|
||||
## Resource Limits
|
||||
|
||||
| Container | Memory Limit | CPU Limit | GPU |
|
||||
|-----------|-------------|-----------|-----|
|
||||
| semantic-detection | 4GB | No limit (all 6 cores available) | Shared |
|
||||
| vlm-service | 4GB | No limit | Shared |
|
||||
|
||||
Note: Limits are soft — shared LPDDR5 means actual allocation is dynamic. Application-level monitoring (HealthMonitor) tracks actual usage.
|
||||
|
||||
## Development Environment
|
||||
|
||||
```yaml
|
||||
# docker-compose.dev.yaml
|
||||
services:
|
||||
semantic-detection:
|
||||
build: .
|
||||
environment:
|
||||
- ENV=development
|
||||
- GIMBAL_MODE=mock_tcp
|
||||
- INFERENCE_ENGINE=onnxruntime
|
||||
volumes:
|
||||
- ./src:/app/src
|
||||
- ./config/config.dev.yaml:/etc/semantic-detection/config.yaml
|
||||
ports:
|
||||
- "8080:8080"
|
||||
|
||||
vlm-stub:
|
||||
build: ./tests/vlm_stub
|
||||
volumes:
|
||||
- vlm-socket:/tmp
|
||||
|
||||
mock-gimbal:
|
||||
build: ./tests/mock_gimbal
|
||||
ports:
|
||||
- "9090:9090"
|
||||
```
|
||||
@@ -0,0 +1,50 @@
|
||||
# Deployment Procedures
|
||||
|
||||
## Pre-Deployment Checklist
|
||||
|
||||
- [ ] All CI tests pass (lint, unit, E2E)
|
||||
- [ ] Docker images built for aarch64 (Jetson)
|
||||
- [ ] TRT engines exported on matching JetPack version
|
||||
- [ ] Config file updated if needed
|
||||
- [ ] USB drive prepared with: images, engines, config, update.sh
|
||||
|
||||
## Standard Deployment
|
||||
|
||||
```
|
||||
1. Connect USB to Jetson
|
||||
2. Run: sudo /mnt/usb/update.sh
|
||||
- Stops running containers
|
||||
- docker load < semantic-detection-{version}.tar
|
||||
- docker load < vlm-service-{version}.tar (if VLM update)
|
||||
- Copies config + engines to /models/ and /etc/semantic-detection/
|
||||
- Restarts containers
|
||||
3. Wait 60s for cold start
|
||||
4. Verify: curl http://localhost:8080/api/v1/health → 200 OK
|
||||
5. Remove USB
|
||||
```
|
||||
|
||||
## Model-Only Update
|
||||
|
||||
```
|
||||
1. Stop semantic-detection container
|
||||
2. Copy new .engine file to /models/
|
||||
3. Update config.yaml engine path if filename changed
|
||||
4. Restart container
|
||||
5. Verify health
|
||||
```
|
||||
|
||||
## Health Checks
|
||||
|
||||
| Check | Method | Expected | Timeout |
|
||||
|-------|--------|----------|---------|
|
||||
| Service alive | GET /api/v1/health | 200 OK | 5s |
|
||||
| Tier 1 loaded | health response: tier1_ready=true | true | 30s |
|
||||
| Gimbal connected | health response: gimbal_alive=true | true | 10s |
|
||||
| First detection | POST test frame | Non-empty result | 60s |
|
||||
|
||||
## Recovery
|
||||
|
||||
If deployment fails or system is unresponsive:
|
||||
1. Try restarting containers: `docker compose restart`
|
||||
2. If still failing: re-deploy from previous known-good USB package
|
||||
3. Last resort: re-flash Jetson with SDK Manager + deploy from scratch (~30 min)
|
||||
@@ -0,0 +1,40 @@
|
||||
# Environment Strategy
|
||||
|
||||
## Environments
|
||||
|
||||
| Environment | Purpose | Hardware | Inference | Gimbal |
|
||||
|-------------|---------|----------|-----------|--------|
|
||||
| Development | Local dev + tests | x86 workstation with NVIDIA GPU | ONNX Runtime or TRT on dev GPU | Mock (TCP socket) |
|
||||
| Production | Field deployment on UAV | Jetson Orin Nano Super (ruggedized) | TensorRT FP16 | Real ViewPro A40 |
|
||||
|
||||
CI testing uses the Development environment with Docker (mock everything).
|
||||
HIL testing uses the Production environment on a bench Jetson.
|
||||
|
||||
## Configuration
|
||||
|
||||
Single YAML config file per environment:
|
||||
- `config.dev.yaml` — mock gimbal, console logging, ONNX Runtime
|
||||
- `config.prod.yaml` — real gimbal, NVMe logging, thermal monitoring
|
||||
|
||||
Config is a file on disk, not environment variables. Updated via USB in production.
|
||||
|
||||
## Environment-Specific Overrides
|
||||
|
||||
| Config Key | Development | Production |
|
||||
|-----------|-------------|------------|
|
||||
| inference_engine | onnxruntime | tensorrt_fp16 |
|
||||
| gimbal_mode | mock_tcp | real_uart |
|
||||
| vlm_enabled | false (or stub) | true |
|
||||
| recording_enabled | false | true |
|
||||
| thermal_monitor | false | true |
|
||||
| log_output | console | file (NVMe) |
|
||||
|
||||
## Field Update Procedure
|
||||
|
||||
1. Prepare USB drive: Docker image tar, TRT engines (if changed), config.yaml, update.sh
|
||||
2. Connect USB to Jetson
|
||||
3. Run `update.sh`: stops services → loads new images → copies files → restarts
|
||||
4. Verify: health endpoint returns 200, test frame produces detection
|
||||
5. Remove USB
|
||||
|
||||
If update fails: re-run with previous package from USB, or re-flash Jetson from known-good image.
|
||||
@@ -0,0 +1,80 @@
|
||||
# Observability
|
||||
|
||||
## Logging
|
||||
|
||||
### Detection Log
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| ts | ISO 8601 | Detection timestamp |
|
||||
| frame_id | uint64 | Source frame |
|
||||
| gps_denied_lat | float64 | GPS-denied latitude |
|
||||
| gps_denied_lon | float64 | GPS-denied longitude |
|
||||
| tier | uint8 | Tier that produced detection |
|
||||
| class | string | Detection class label |
|
||||
| confidence | float32 | Detection confidence |
|
||||
| bbox | float32[4] | centerX, centerY, width, height (normalized) |
|
||||
| freshness | string | Freshness tag (footpaths only) |
|
||||
| tier2_result | string | Tier 2 classification |
|
||||
| tier2_confidence | float32 | Tier 2 confidence |
|
||||
| tier3_used | bool | Whether VLM was invoked |
|
||||
| thumbnail_path | string | Path to ROI thumbnail |
|
||||
|
||||
**Format**: JSON-lines, append-only
|
||||
**Location**: `/data/output/detections.jsonl`
|
||||
**Rotation**: None (circular buffer at filesystem level for L1 frames)
|
||||
|
||||
### Gimbal Command Log
|
||||
|
||||
**Format**: Text, one line per command (timestamp, command type, target angles, CRC status, retry count)
|
||||
**Location**: `/data/output/gimbal.log`
|
||||
|
||||
### System Health Log
|
||||
|
||||
**Format**: JSON-lines, 1 entry per second
|
||||
**Fields**: timestamp, t_junction, power_watts, gpu_mem_mb, cpu_mem_mb, degradation_level, gimbal_alive, semantic_alive, vlm_alive, nvme_free_pct
|
||||
**Location**: `/data/output/health.jsonl`
|
||||
|
||||
### Application Error Log
|
||||
|
||||
**Format**: Text with severity levels (ERROR, WARN, INFO)
|
||||
**Location**: `/data/output/app.log`
|
||||
**Content**: Exceptions, timeouts, CRC failures, frame skips, VLM errors
|
||||
|
||||
## Metrics (In-Memory)
|
||||
|
||||
No external metrics service (air-gapped). Metrics are computed in-memory and exposed via health API endpoint:
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| frames_processed_total | Counter | Total frames through Tier 1 |
|
||||
| frames_skipped_quality | Counter | Frames rejected by quality gate |
|
||||
| detections_total | Counter | Total detections produced (all tiers) |
|
||||
| tier1_latency_ms | Histogram | Tier 1 inference time |
|
||||
| tier2_latency_ms | Histogram | Tier 2 processing time |
|
||||
| tier3_latency_ms | Histogram | Tier 3 VLM time |
|
||||
| poi_queue_depth | Gauge | Current POI queue size |
|
||||
| degradation_level | Gauge | Current degradation level |
|
||||
| t_junction_celsius | Gauge | Current junction temperature |
|
||||
| power_draw_watts | Gauge | Current power draw |
|
||||
| gpu_memory_used_mb | Gauge | Current GPU memory |
|
||||
| gimbal_crc_failures | Counter | Total CRC failures on UART |
|
||||
| vlm_crashes | Counter | VLM process crash count |
|
||||
|
||||
**Exposed via**: GET /api/v1/health (JSON response with all metrics)
|
||||
|
||||
## Alerting
|
||||
|
||||
No external alerting system. Alerts are:
|
||||
1. Degradation level changes → logged to health log + detection log
|
||||
2. Critical events (VLM crash, gimbal loss, thermal critical) → logged with severity ERROR
|
||||
3. Operator display shows current degradation level as status indicator
|
||||
|
||||
## Post-Flight Analysis
|
||||
|
||||
After landing, NVMe data is extracted via USB for offline analysis:
|
||||
- `detections.jsonl` → import into annotation tool for TP/FP labeling
|
||||
- `frames/` → source material for training dataset expansion
|
||||
- `health.jsonl` → thermal/power profile for hardware optimization
|
||||
- `gimbal.log` → PID tuning analysis
|
||||
- `app.log` → debugging and issue diagnosis
|
||||
Reference in New Issue
Block a user