[AZ-626] Decompose complete: 47 tasks + docs + module layout

Greenfield Steps 1-6 baseline for the autopilot rewrite from legacy
Qt/C++ to a Rust workspace.

- Remove legacy Qt/C++ tree (ai_controller, drone_controller,
  misc/camera, python_scaffold, root Dockerfile, autopilot.pro,
  legacy main.py / requirements.txt).
- Add _docs/00_problem (problem, restrictions, acceptance criteria,
  security approach, input data + fixtures).
- Add _docs/01_solution/solution_draft01.
- Add _docs/02_document (architecture, system-flows, data_model,
  glossary, decision-rationale, deployment, 13 component descriptions,
  tests/ specs, FINAL_report, module-layout).
- Add _docs/02_tasks/todo with 47 task specs (AZ-640..AZ-686, one
  bootstrap + 46 component tasks) and _dependencies_table.md.
- Add .cursor/rules/artifact-srp.mdc (single-responsibility rule for
  canonical _docs artifacts).
- Track autodev state in _docs/_autodev_state.md (Step 6 completed,
  ready for Step 7 Implement).

Jira: bootstrap AZ-626; component epics AZ-627..AZ-639; tasks
AZ-640..AZ-686. Total complexity 173 points across 12 epics.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-19 11:02:01 +03:00
parent f7d6cb4a3a
commit bc40ea7300
235 changed files with 12585 additions and 15097 deletions
@@ -0,0 +1,90 @@
# CI / CD Pipeline
**Status**: forward-looking design (Rust). Final pipeline file lands during build-system bring-up. The shape below describes the intent.
## 1. Goals
The pipeline must:
- Build the autopilot Rust binary cross-compiled for `aarch64-unknown-linux-gnu`.
- Run the full Rust test suite (unit + integration + replay-based) on every commit.
- Run a hardware-in-loop conformance gate against an ArduPilot SITL instance (covers `mavlink_layer` + `mission_executor`).
- Run a benchmark gate on representative target hardware (covers Tier 1 / Tier 2 / VLM / gimbal latency budgets — see `architecture.md §7.6 Benchmark gate`).
- Sign and publish artefacts (binary + container image) on tagged releases.
- Never auto-deploy to the airframe. Deployment is a human-driven operation tied to the suite's flight-gate convention (`/run/azaion/in-flight`).
## 2. Pipeline stages
Single Woodpecker pipeline, multi-stage. Stages run sequentially; a failed stage stops the run.
| Stage | Purpose | Notes |
|---|---|---|
| **fetch** | Clone, restore Cargo cache | `cargo fetch` with a remote cache key. |
| **lint** | `cargo fmt --check`, `cargo clippy --all-targets --all-features -- -D warnings` | Hard fail on any warning. |
| **unit-test** | `cargo test --workspace` (host-arch) | Most logic is platform-independent; runs in parallel on host. |
| **build-arm64** | Cross-compile for `aarch64-unknown-linux-gnu` | `cross` or `cargo zigbuild` depending on Rust toolchain. Produces the production binary + a debug symbol artefact. |
| **integration-test** | Replay-based integration tests under emulation | Fixtures: pre-recorded RTSP clip, MAVLink replay, synthetic telemetry. No hardware required. |
| **sitl-conformance** | ArduPilot SITL conformance gate | Spins up ArduPilot SITL + autopilot binary in a container; runs a fixed mission script; asserts MAVLink command surface (per `architecture.md §7.7`) and geofence enforcement. |
| **benchmark-gate** *(opt-in, manual / nightly)* | Tier 1 / 2 / VLM / gimbal latency on real Jetson | Runs on a self-hosted Jetson Orin Nano runner. Asserts `architecture.md §6 NFR` budgets. Slow; not on every PR. |
| **package** | Build container image (Option B from `containerization.md`) | Multi-arch tag: `azaion/autopilot:<branch>-arm64`. |
| **sign** | Sign binary + image | Cosign for the image; OS-vendor signing flow for the binary if used in native deployment. |
| **publish** | Push image + binary to internal registry | Tagged builds only. |
## 3. Artefacts
| Artefact | Where | Retention |
|---|---|---|
| `autopilot` binary (aarch64) | internal artefact store | last 10 builds per branch; tagged builds kept indefinitely |
| Debug symbols (`.dwp`) | internal artefact store, separate path | matched to binary lifetime |
| Container image | internal Docker registry | last 10 dev builds; tagged builds kept indefinitely |
| Cosign signature | next to image | matched to image lifetime |
| Test logs | CI run | per Woodpecker retention |
| Benchmark gate report | internal artefact store (Markdown + JSON) | per-tag retention |
## 4. Build matrix
Single matrix entry today:
| Toolchain | Target | Tier-1 dep | VLM feature |
|---|---|---|---|
| Rust stable | `aarch64-unknown-linux-gnu` | `../detections` (Cython service consumed via gRPC; not built here) | `cargo --features vlm` (also `cargo` without — both must build) |
The `--features vlm` and the no-feature path are both built and tested to enforce the optionality contract from `architecture.md §7.6 Local VLM confirmation`.
## 5. SITL conformance gate (in detail)
Stage runs in CI; produces a pass/fail signal that gates merge to `dev`.
**Setup:**
1. Start ArduPilot SITL in a container, listening on `udp://0.0.0.0:14550`.
2. Start autopilot binary configured for SITL endpoint.
3. Pre-load a fixture mission via the missions API mock (`mission_client` HTTP target).
4. Pre-load a fixture RTSP source (looped clip).
5. Mock the `../detections` service with deterministic detections.
**Assertions:**
- All MAVLink message kinds in `architecture.md §7.7` succeed at least once.
- Mission upload + start completes within the configured retry budget.
- INCLUSION geofence violation triggers RTL.
- EXCLUSION geofence violation triggers RTL (regression gate against the earlier silent-ignore behaviour).
- Middle-waypoint POST + re-upload succeeds within ≤2 s.
- Health endpoint returns `green` once steady state is reached.
## 6. Branch policy
| Branch | Triggers | Required gates |
|---|---|---|
| feature branches (PR) | on push | fetch → lint → unit-test → build-arm64 → integration-test → sitl-conformance |
| `dev` | on merge | all PR gates + package |
| tagged release (`v*`) | on tag | all `dev` gates + sign + publish + benchmark-gate (manual approval) |
`main` and `dev` are protected. Force-push is forbidden. Merges require a green pipeline.
## 7. Out of scope here
- Airframe deployment automation (manual; tied to flight-gate).
- Ground Station and `../detections` pipelines (each owns its own).
- AI training pipeline — `../_docs/12_ai_training.md`.
- Model-sync to the airframe (`model-sync.service`, suite-level — `../_docs/00_top_level_architecture.md`).
@@ -0,0 +1,142 @@
# Containerisation
**Status**: forward-looking design (Rust). Final shape will surface during build-system bring-up; treat the choices below as the current intent, not commitments.
## 1. Deployment shape
`autopilot` is a single Rust binary. Two delivery options are considered:
| Option | Form | Pros | Cons |
|---|---|---|---|
| **A — native systemd unit** | bare binary deployed to `/usr/local/bin/autopilot` + a `.service` unit | minimum overhead on Jetson; closest to airframe constraints; trivial flight-gate integration | per-host installation discipline; less reproducible across nodes |
| **B — single container image** | `azaion/autopilot:<branch>-arm64` | consistent across environments; matches the suite's existing OTA model (Watchtower) | container runtime adds startup latency and one more failure surface on the airframe |
The decision is **Option A** for the on-airframe deployment (lowest overhead, closest to the autopilot's real-time constraints), and **Option B** for development / CI / emulated-hardware testing (reproducibility wins). The same Rust binary is built once and packaged into both.
## 2. Target hardware
| Item | Value |
|---|---|
| Edge device | NVIDIA Jetson Orin Nano Super 8 GB |
| Architecture | aarch64 |
| OS | Ubuntu 22.04 (JetPack-bundled) — locked JetPack version + power mode |
| Camera | ViewPro A40 (RTSP + UDP control) |
| Autopilot | ArduPilot or PX4 over MAVLink v2 (UDP or serial) |
## 3. Native deployment (Option A — production)
**Layout:**
```text
/usr/local/bin/autopilot Rust binary
/etc/azaion/autopilot/config.toml runtime config
/etc/systemd/system/autopilot.service systemd unit
/var/lib/autopilot/ persistent state (mapobjects_store)
/run/azaion/in-flight flight-gate marker (per ../_docs/00_top_level_architecture.md)
```
**systemd unit highlights:**
- `Type=notify` — autopilot signals readiness once Tier 1, gimbal, and MAVLink links are healthy.
- `Restart=on-failure`, `RestartSec=2s`, `StartLimitBurst=5` — bounded restart (so a hard-broken binary doesn't loop forever).
- `MemoryMax=` — enforces the on-airframe memory budget (~6 GB; Tier-1 YOLO container holds ~2 GB).
- `LimitNOFILE`, `LimitNPROC` set explicitly.
- `ExecStartPre=/bin/sh -c 'mkdir -p /run/azaion && touch /run/azaion/in-flight'` — asserts the suite-wide flight-gate so `model-sync.service` does not pull a new model mid-flight.
- `ExecStopPost=/bin/rm -f /run/azaion/in-flight` — clears the flight-gate on shutdown.
**Runtime config** (`/etc/azaion/autopilot/config.toml`) is the single source for non-secret configuration: RTSP URL, gimbal endpoint, MAVLink connection URI, missions API endpoint, Ground Station endpoint, VLM IPC socket path, `vlm_enabled` flag, log level. Secrets (if any — TBD per `../_docs/02_missions.md` auth model) come from the systemd `EnvironmentFile=` pointing at a permission-restricted file.
## 4. Container image (Option B — dev / CI / emulation)
**Base image:** `nvcr.io/nvidia/l4t-base:<JetPack-pinned-tag>` for production-equivalent NVDEC + TensorRT plumbing; `ubuntu:22.04` for emulation (no GPU acceleration).
**Image layout:**
```text
/usr/local/bin/autopilot Rust binary (built outside the image)
/etc/azaion/autopilot/config.toml runtime config (mounted at runtime)
/var/lib/autopilot/ persistent state (volume-mounted)
```
**Image is non-root.** Default `USER` is `autopilot:autopilot`; `/var/lib/autopilot/` is owned by that user.
**Compose example** (development):
```yaml
services:
autopilot:
image: azaion/autopilot:dev-arm64
restart: unless-stopped
environment:
AUTOPILOT_CONFIG: /etc/azaion/autopilot/config.toml
volumes:
- ./config/autopilot.toml:/etc/azaion/autopilot/config.toml:ro
- autopilot-state:/var/lib/autopilot
- /run/azaion:/run/azaion
devices:
- /dev/ttyUSB0:/dev/ttyUSB0 # MAVLink serial (if used)
network_mode: host # RTSP / UDP gimbal / Ground Station modem all on host
volumes:
autopilot-state: {}
```
`network_mode: host` is intentional on Jetson: RTSP, gimbal UDP, MAVLink UDP, and the modem-link to the Ground Station all share the airframe's network namespace.
## 5. External dependencies on the airframe
`autopilot` itself is the only autopilot-owned process. The on-airframe tier also runs (separately):
- **`../detections`** — Tier 1 YOLO service. Container delivered from its own pipeline. Bi-directional gRPC endpoint consumed by `detection_client`.
- **NanoLLM / VILA1.5-3B** (optional) — local IPC peer of `vlm_client`. Separate container or process; not embedded in the autopilot binary. Surfaces a Unix-domain socket; peer-credential check is mandatory when supported.
- **GPS-Denied service** — separate edge service, owned by `gps-denied-onboard`; consumed indirectly through the shared edge data path (per `../_docs/11_gps_denied.md`).
- **`model-sync.service`** — suite-wide rclone-driven model puller. Reads `/run/azaion/in-flight` to defer model swaps during flight (per `../_docs/00_top_level_architecture.md`).
## 6. Configuration surface
All configuration is declarative (`config.toml`); there is no compile-time configuration of endpoints, addresses, or feature switches **except** the `vlm_client` build-time feature flag (see `architecture.md §7.6 Local VLM confirmation > Optionality model`).
| Concern | Mechanism |
|---|---|
| RTSP / gimbal / MAVLink endpoints | `config.toml` |
| `missions` API endpoint + auth | `config.toml` (auth pulled from `EnvironmentFile=`) |
| Ground Station endpoint | `config.toml` |
| VLM IPC socket path | `config.toml` |
| `vlm_enabled` runtime flag | `config.toml` |
| `vlm_client` build-time feature | `cargo --features vlm` at build |
| Log level + format | `RUST_LOG` env (`tracing-subscriber` honours it) |
| Mission ID for the current flight | CLI arg (per-flight, not per-host) |
## 7. Health endpoint
`autopilot` exposes a single HTTP health endpoint (port and bind address from `config.toml`; default `127.0.0.1:8080`). It aggregates per-component readiness:
```json
{
"status": "green | yellow | red",
"components": {
"frame_ingest": "green",
"detection_client": "green",
"movement_detector": "green",
"semantic_analyzer": "green",
"vlm_client": "disabled",
"scan_controller": "green",
"mapobjects_store": "green",
"gimbal_controller": "green",
"operator_bridge": "yellow",
"mission_executor": "green",
"mavlink_layer": "green",
"mission_client": "green",
"telemetry_stream": "green"
},
"last_state_change": "2026-05-17T12:00:00Z"
}
```
`yellow` is degraded-but-running; `red` is unrecoverable for at least one essential component. The aggregator surfaces details on each transition through `tracing` (see `observability.md`).
## 8. Out of scope here
- Provisioning the Jetson host itself (Ansible / Kickstart / disk imaging) — owned by airframe ops.
- Build pipeline (cross-compile, signing, registry push) — see `ci_cd_pipeline.md`.
- Observability stack (tracing exporter, log shipper, metrics scraper) — see `observability.md`.
- Mission delivery to the airframe — owned by `missions` API.
@@ -0,0 +1,142 @@
# Observability
**Status**: forward-looking design (Rust). Treat the choices below as the intended approach; the exact tracing exporter / metrics scraper / log-shipping target depend on the suite's overall observability stack at deploy time.
## 1. Posture
- **One binary, one process.** Per-component instrumentation is structured (each component listed in `architecture.md §3` is a `tracing` target).
- **Structured logs are primary**, metrics are derived from log spans and counters, traces are end-to-end on a frame's journey through the pipeline.
- **No silent error swallowing.** Every failure path increments a counter, emits a span event, or both.
- **Health is aggregated**, not derived from logs. The HTTP health endpoint (`containerization.md §7`) is the source of truth for live readiness.
## 2. Logs
**Library**: `tracing` + `tracing-subscriber`.
**Format**: JSON to stdout. Captured by the host's journald (Option A) or by the container runtime (Option B), then shipped to the suite's log aggregator.
**Per-line fields:**
| Field | Source | Notes |
|---|---|---|
| `ts` | wall clock | ISO-8601 UTC. |
| `ts_mono_ns` | monotonic clock | For ordering across components without clock-skew artefacts. |
| `level` | `tracing` | `error \| warn \| info \| debug \| trace`. |
| `target` | component name | One of `frame_ingest`, `detection_client`, `movement_detector`, `semantic_analyzer`, `vlm_client`, `scan_controller`, `mapobjects_store`, `gimbal_controller`, `operator_bridge`, `mission_executor`, `mavlink_layer`, `mission_client`, `telemetry_stream`. |
| `frame_seq` | propagated context | Where applicable. Lets us reconstruct one frame's journey. |
| `poi_id`, `roi_id`, `target_id`, `mission_id`, `command_id` | propagated context | Where applicable. |
| `event` | message | Short, machine-friendly identifier (e.g., `frame.dropped`, `vlm.timeout`, `mission.geofence_violation`, `bit.check_failed`, `failsafe.lost_link`, `mapobjects.push_failed`, `operator.auth_rejected`). |
| `model_version` | propagated context | Version string for `tier1_model_version` and `vlm_model_version`. Required on every `vlm.response` and on every Tier-2 evidence span for forensic correlation. |
| `wall_clock_source` | telemetry frame | `gnss \| host \| coast`; emitted on every state-transition span and on every operator-command audit log line. |
| `reason` | message | Free-form for human readers. |
**Log level defaults:**
- `info`: lifecycle (startup / shutdown / state transitions), all error and security events.
- `warn`: degraded-but-running events (yellow health, retries, drops).
- `error`: red health, hard failures, schema violations, security violations.
- `debug` / `trace`: off in production; enabled per-target via `RUST_LOG`.
**Always logged at `warn` or higher** (per `coderule.mdc`):
- Every exception path that the operator could care about.
- Authentication / authorisation failures (peer-cred check failures on VLM IPC, malformed Ground Station session, MAVLink-2 signing rejection).
- Geofence violations.
- Schema validation failures (Tier 1 response, VLM response, mission JSON).
## 3. Metrics
Derived from log spans + a small set of explicit counters. Exporter: Prometheus-compatible (per the suite's stack).
**Per-component counters** (illustrative — exact names finalised at implementation):
| Component | Counter | Type |
|---|---|---|
| `frame_ingest` | `frames_received_total`, `frames_dropped_total{reason}`, `decode_errors_total` | counter |
| `frame_ingest` | `decode_ms` | histogram |
| `detection_client` | `requests_total`, `errors_total{kind}`, `latency_ms` | counter / histogram |
| `movement_detector` | `candidates_total`, `telemetry_skew_drops_total` | counter |
| `semantic_analyzer` | `tier2_runs_total`, `tier2_latency_ms`, `tier2_oversize_total` | counter / histogram |
| `vlm_client` | `vlm_requests_total{status}`, `vlm_latency_ms` | counter / histogram |
| `scan_controller` | `state_transitions_total{from,to}`, `pois_in_queue`, `pois_per_min`, `tick_latency_ms` | counter / gauge / histogram |
| `mapobjects_store` | `classify_total{result}`, `ignored_items_total`, `removed_candidates_total` | counter |
| `gimbal_controller` | `commands_total`, `decision_to_movement_ms`, `zoom_transition_ms`, `vendor_faults_total` | counter / histogram |
| `mavlink_layer` | `messages_in_total{kind}`, `messages_out_total{kind}`, `command_acks_total{result}`, `parse_errors_total`, `link_state` | counter / gauge |
| `mission_executor` | `state_transitions_total{from,to}`, `mission_uploads_total{result}`, `geofence_violations_total{kind}` | counter |
| `mission_client` | `fetches_total{result}`, `middle_waypoint_posts_total{result}`, `mapobjects_pull_total{result}`, `mapobjects_push_total{result}`, `mapobjects_pull_bytes`, `mapobjects_push_bytes`, `mapobjects_sync_lag_s` | counter / gauge |
| `mission_executor` (BIT) | `bit_runs_total{result}`, `bit_check_failures_total{check}` | counter |
| `mission_executor` (failsafe) | `link_loss_events_total{trigger}`, `failsafe_action_total{action}` | counter |
| `operator_bridge` | `pois_surfaced_total`, `commands_received_total{kind,result}`, `decision_latency_ms`, `auth_rejections_total{reason}`, `command_e2e_ms` | counter / histogram |
| `telemetry_stream` | `bytes_out_total`, `frames_out_total`, `link_state`, `bandwidth_used_mbps` | counter / gauge |
**Aggregated:**
- `health_state{component}` — 0 (red) / 1 (yellow) / 2 (green); enables alerting per-component.
- `process_uptime_seconds`, `process_resident_memory_bytes` — standard.
## 4. Traces
`tracing` spans cover the path of a single frame and the path of a single POI.
**Frame trace** (per `Frame`):
```text
frame_ingest.publish
detection_client.request
detection_client.response
movement_detector.tick
[movement_detector.emit_candidate]
telemetry_stream.push
```
**POI trace** (per `POI`):
```text
scan_controller.enqueue
scan_controller.dequeue
gimbal_controller.zoom
semantic_analyzer.tier2
[vlm_client.request -> vlm_client.response]
operator_bridge.surface
[operator_bridge.confirm | decline | timeout]
mission_executor.middle_waypoint # confirm path
mapobjects_store.append_ignored # decline path
```
Spans propagate via context across in-process channels. Trace export target depends on the suite's stack (OTLP / Jaeger / Tempo).
## 5. Health endpoint
See `containerization.md §7`. The endpoint is the operator-facing readiness API; metrics + logs are the engineer-facing investigation API.
A red health state for any of these components is unrecoverable for the current flight:
- `frame_ingest` red → no input → cannot operate.
- `mavlink_layer` red → no UAV control → trigger RTL via the autopilot's failsafe (the autopilot itself enforces this when MAVLink heartbeat stops).
- `mission_executor` red → mission lifecycle stuck → operator must take RC control.
A red health state for these components is degraded-but-survivable:
- `detection_client` → continue zoom-out sweep; lose Tier 1.
- `movement_detector` → continue; lose movement-candidate POI source.
- `semantic_analyzer` → continue; surface Tier-1-only POIs.
- `vlm_client` → fail-closed (POIs surfaced without VLM evidence).
- `mapobjects_store` → continue with in-memory state; persistent diff lost on restart. Sync state may transition to `Stale` (operator visible).
- `mapobjects_sync` (logical, owned by `mission_client`) → mission proceeds with stale snapshot; post-flight push retries via leftover spool. Operator sees `mapobjects_sync = degraded`.
- `operator_bridge` / `telemetry_stream` → continue zoom-out sweep; pause POI surfacing; resume on reconnect. F10 lost-link ladder owns the larger response.
- `gimbal_controller` → pause zoom-in / target-follow; zoom-out sweep stops.
- `mission_client` → continue current mission from in-memory copy.
## 6. Replay-driven debugging
All non-trivial decisions in `scan_controller`, `movement_detector`, `semantic_analyzer`, `vlm_client`, and `mission_executor` are reconstructable from logs + the (size-capped) raw inputs that drove them:
- Frame seq, gimbal state at decode, telemetry sample used, Tier-1 detections returned, Tier-2 score, VLM raw response (size-capped), operator command, resulting state transition.
This is the foundation of the replay-based integration tests in `ci_cd_pipeline.md §2`.
## 7. Out of scope here
- Suite-wide observability stack choice (OTLP vs Loki vs Tempo vs Promtail) — owned by suite ops.
- Persistent log retention policy — owned by suite ops.
- Alerting routing (Slack / PagerDuty / email) — owned by suite ops.