mirror of
https://github.com/azaion/autopilot.git
synced 2026-06-21 13:11:11 +00:00
[AZ-626] Decompose complete: 47 tasks + docs + module layout
Greenfield Steps 1-6 baseline for the autopilot rewrite from legacy Qt/C++ to a Rust workspace. - Remove legacy Qt/C++ tree (ai_controller, drone_controller, misc/camera, python_scaffold, root Dockerfile, autopilot.pro, legacy main.py / requirements.txt). - Add _docs/00_problem (problem, restrictions, acceptance criteria, security approach, input data + fixtures). - Add _docs/01_solution/solution_draft01. - Add _docs/02_document (architecture, system-flows, data_model, glossary, decision-rationale, deployment, 13 component descriptions, tests/ specs, FINAL_report, module-layout). - Add _docs/02_tasks/todo with 47 task specs (AZ-640..AZ-686, one bootstrap + 46 component tasks) and _dependencies_table.md. - Add .cursor/rules/artifact-srp.mdc (single-responsibility rule for canonical _docs artifacts). - Track autodev state in _docs/_autodev_state.md (Step 6 completed, ready for Step 7 Implement). Jira: bootstrap AZ-626; component epics AZ-627..AZ-639; tasks AZ-640..AZ-686. Total complexity 173 points across 12 epics. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,90 @@
|
||||
# CI / CD Pipeline
|
||||
|
||||
**Status**: forward-looking design (Rust). Final pipeline file lands during build-system bring-up. The shape below describes the intent.
|
||||
|
||||
## 1. Goals
|
||||
|
||||
The pipeline must:
|
||||
|
||||
- Build the autopilot Rust binary cross-compiled for `aarch64-unknown-linux-gnu`.
|
||||
- Run the full Rust test suite (unit + integration + replay-based) on every commit.
|
||||
- Run a hardware-in-loop conformance gate against an ArduPilot SITL instance (covers `mavlink_layer` + `mission_executor`).
|
||||
- Run a benchmark gate on representative target hardware (covers Tier 1 / Tier 2 / VLM / gimbal latency budgets — see `architecture.md §7.6 Benchmark gate`).
|
||||
- Sign and publish artefacts (binary + container image) on tagged releases.
|
||||
- Never auto-deploy to the airframe. Deployment is a human-driven operation tied to the suite's flight-gate convention (`/run/azaion/in-flight`).
|
||||
|
||||
## 2. Pipeline stages
|
||||
|
||||
Single Woodpecker pipeline, multi-stage. Stages run sequentially; a failed stage stops the run.
|
||||
|
||||
| Stage | Purpose | Notes |
|
||||
|---|---|---|
|
||||
| **fetch** | Clone, restore Cargo cache | `cargo fetch` with a remote cache key. |
|
||||
| **lint** | `cargo fmt --check`, `cargo clippy --all-targets --all-features -- -D warnings` | Hard fail on any warning. |
|
||||
| **unit-test** | `cargo test --workspace` (host-arch) | Most logic is platform-independent; runs in parallel on host. |
|
||||
| **build-arm64** | Cross-compile for `aarch64-unknown-linux-gnu` | `cross` or `cargo zigbuild` depending on Rust toolchain. Produces the production binary + a debug symbol artefact. |
|
||||
| **integration-test** | Replay-based integration tests under emulation | Fixtures: pre-recorded RTSP clip, MAVLink replay, synthetic telemetry. No hardware required. |
|
||||
| **sitl-conformance** | ArduPilot SITL conformance gate | Spins up ArduPilot SITL + autopilot binary in a container; runs a fixed mission script; asserts MAVLink command surface (per `architecture.md §7.7`) and geofence enforcement. |
|
||||
| **benchmark-gate** *(opt-in, manual / nightly)* | Tier 1 / 2 / VLM / gimbal latency on real Jetson | Runs on a self-hosted Jetson Orin Nano runner. Asserts `architecture.md §6 NFR` budgets. Slow; not on every PR. |
|
||||
| **package** | Build container image (Option B from `containerization.md`) | Multi-arch tag: `azaion/autopilot:<branch>-arm64`. |
|
||||
| **sign** | Sign binary + image | Cosign for the image; OS-vendor signing flow for the binary if used in native deployment. |
|
||||
| **publish** | Push image + binary to internal registry | Tagged builds only. |
|
||||
|
||||
## 3. Artefacts
|
||||
|
||||
| Artefact | Where | Retention |
|
||||
|---|---|---|
|
||||
| `autopilot` binary (aarch64) | internal artefact store | last 10 builds per branch; tagged builds kept indefinitely |
|
||||
| Debug symbols (`.dwp`) | internal artefact store, separate path | matched to binary lifetime |
|
||||
| Container image | internal Docker registry | last 10 dev builds; tagged builds kept indefinitely |
|
||||
| Cosign signature | next to image | matched to image lifetime |
|
||||
| Test logs | CI run | per Woodpecker retention |
|
||||
| Benchmark gate report | internal artefact store (Markdown + JSON) | per-tag retention |
|
||||
|
||||
## 4. Build matrix
|
||||
|
||||
Single matrix entry today:
|
||||
|
||||
| Toolchain | Target | Tier-1 dep | VLM feature |
|
||||
|---|---|---|---|
|
||||
| Rust stable | `aarch64-unknown-linux-gnu` | `../detections` (Cython service consumed via gRPC; not built here) | `cargo --features vlm` (also `cargo` without — both must build) |
|
||||
|
||||
The `--features vlm` and the no-feature path are both built and tested to enforce the optionality contract from `architecture.md §7.6 Local VLM confirmation`.
|
||||
|
||||
## 5. SITL conformance gate (in detail)
|
||||
|
||||
Stage runs in CI; produces a pass/fail signal that gates merge to `dev`.
|
||||
|
||||
**Setup:**
|
||||
|
||||
1. Start ArduPilot SITL in a container, listening on `udp://0.0.0.0:14550`.
|
||||
2. Start autopilot binary configured for SITL endpoint.
|
||||
3. Pre-load a fixture mission via the missions API mock (`mission_client` HTTP target).
|
||||
4. Pre-load a fixture RTSP source (looped clip).
|
||||
5. Mock the `../detections` service with deterministic detections.
|
||||
|
||||
**Assertions:**
|
||||
|
||||
- All MAVLink message kinds in `architecture.md §7.7` succeed at least once.
|
||||
- Mission upload + start completes within the configured retry budget.
|
||||
- INCLUSION geofence violation triggers RTL.
|
||||
- EXCLUSION geofence violation triggers RTL (regression gate against the earlier silent-ignore behaviour).
|
||||
- Middle-waypoint POST + re-upload succeeds within ≤2 s.
|
||||
- Health endpoint returns `green` once steady state is reached.
|
||||
|
||||
## 6. Branch policy
|
||||
|
||||
| Branch | Triggers | Required gates |
|
||||
|---|---|---|
|
||||
| feature branches (PR) | on push | fetch → lint → unit-test → build-arm64 → integration-test → sitl-conformance |
|
||||
| `dev` | on merge | all PR gates + package |
|
||||
| tagged release (`v*`) | on tag | all `dev` gates + sign + publish + benchmark-gate (manual approval) |
|
||||
|
||||
`main` and `dev` are protected. Force-push is forbidden. Merges require a green pipeline.
|
||||
|
||||
## 7. Out of scope here
|
||||
|
||||
- Airframe deployment automation (manual; tied to flight-gate).
|
||||
- Ground Station and `../detections` pipelines (each owns its own).
|
||||
- AI training pipeline — `../_docs/12_ai_training.md`.
|
||||
- Model-sync to the airframe (`model-sync.service`, suite-level — `../_docs/00_top_level_architecture.md`).
|
||||
@@ -0,0 +1,142 @@
|
||||
# Containerisation
|
||||
|
||||
**Status**: forward-looking design (Rust). Final shape will surface during build-system bring-up; treat the choices below as the current intent, not commitments.
|
||||
|
||||
## 1. Deployment shape
|
||||
|
||||
`autopilot` is a single Rust binary. Two delivery options are considered:
|
||||
|
||||
| Option | Form | Pros | Cons |
|
||||
|---|---|---|---|
|
||||
| **A — native systemd unit** | bare binary deployed to `/usr/local/bin/autopilot` + a `.service` unit | minimum overhead on Jetson; closest to airframe constraints; trivial flight-gate integration | per-host installation discipline; less reproducible across nodes |
|
||||
| **B — single container image** | `azaion/autopilot:<branch>-arm64` | consistent across environments; matches the suite's existing OTA model (Watchtower) | container runtime adds startup latency and one more failure surface on the airframe |
|
||||
|
||||
The decision is **Option A** for the on-airframe deployment (lowest overhead, closest to the autopilot's real-time constraints), and **Option B** for development / CI / emulated-hardware testing (reproducibility wins). The same Rust binary is built once and packaged into both.
|
||||
|
||||
## 2. Target hardware
|
||||
|
||||
| Item | Value |
|
||||
|---|---|
|
||||
| Edge device | NVIDIA Jetson Orin Nano Super 8 GB |
|
||||
| Architecture | aarch64 |
|
||||
| OS | Ubuntu 22.04 (JetPack-bundled) — locked JetPack version + power mode |
|
||||
| Camera | ViewPro A40 (RTSP + UDP control) |
|
||||
| Autopilot | ArduPilot or PX4 over MAVLink v2 (UDP or serial) |
|
||||
|
||||
## 3. Native deployment (Option A — production)
|
||||
|
||||
**Layout:**
|
||||
|
||||
```text
|
||||
/usr/local/bin/autopilot Rust binary
|
||||
/etc/azaion/autopilot/config.toml runtime config
|
||||
/etc/systemd/system/autopilot.service systemd unit
|
||||
/var/lib/autopilot/ persistent state (mapobjects_store)
|
||||
/run/azaion/in-flight flight-gate marker (per ../_docs/00_top_level_architecture.md)
|
||||
```
|
||||
|
||||
**systemd unit highlights:**
|
||||
|
||||
- `Type=notify` — autopilot signals readiness once Tier 1, gimbal, and MAVLink links are healthy.
|
||||
- `Restart=on-failure`, `RestartSec=2s`, `StartLimitBurst=5` — bounded restart (so a hard-broken binary doesn't loop forever).
|
||||
- `MemoryMax=` — enforces the on-airframe memory budget (~6 GB; Tier-1 YOLO container holds ~2 GB).
|
||||
- `LimitNOFILE`, `LimitNPROC` set explicitly.
|
||||
- `ExecStartPre=/bin/sh -c 'mkdir -p /run/azaion && touch /run/azaion/in-flight'` — asserts the suite-wide flight-gate so `model-sync.service` does not pull a new model mid-flight.
|
||||
- `ExecStopPost=/bin/rm -f /run/azaion/in-flight` — clears the flight-gate on shutdown.
|
||||
|
||||
**Runtime config** (`/etc/azaion/autopilot/config.toml`) is the single source for non-secret configuration: RTSP URL, gimbal endpoint, MAVLink connection URI, missions API endpoint, Ground Station endpoint, VLM IPC socket path, `vlm_enabled` flag, log level. Secrets (if any — TBD per `../_docs/02_missions.md` auth model) come from the systemd `EnvironmentFile=` pointing at a permission-restricted file.
|
||||
|
||||
## 4. Container image (Option B — dev / CI / emulation)
|
||||
|
||||
**Base image:** `nvcr.io/nvidia/l4t-base:<JetPack-pinned-tag>` for production-equivalent NVDEC + TensorRT plumbing; `ubuntu:22.04` for emulation (no GPU acceleration).
|
||||
|
||||
**Image layout:**
|
||||
|
||||
```text
|
||||
/usr/local/bin/autopilot Rust binary (built outside the image)
|
||||
/etc/azaion/autopilot/config.toml runtime config (mounted at runtime)
|
||||
/var/lib/autopilot/ persistent state (volume-mounted)
|
||||
```
|
||||
|
||||
**Image is non-root.** Default `USER` is `autopilot:autopilot`; `/var/lib/autopilot/` is owned by that user.
|
||||
|
||||
**Compose example** (development):
|
||||
|
||||
```yaml
|
||||
services:
|
||||
autopilot:
|
||||
image: azaion/autopilot:dev-arm64
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
AUTOPILOT_CONFIG: /etc/azaion/autopilot/config.toml
|
||||
volumes:
|
||||
- ./config/autopilot.toml:/etc/azaion/autopilot/config.toml:ro
|
||||
- autopilot-state:/var/lib/autopilot
|
||||
- /run/azaion:/run/azaion
|
||||
devices:
|
||||
- /dev/ttyUSB0:/dev/ttyUSB0 # MAVLink serial (if used)
|
||||
network_mode: host # RTSP / UDP gimbal / Ground Station modem all on host
|
||||
volumes:
|
||||
autopilot-state: {}
|
||||
```
|
||||
|
||||
`network_mode: host` is intentional on Jetson: RTSP, gimbal UDP, MAVLink UDP, and the modem-link to the Ground Station all share the airframe's network namespace.
|
||||
|
||||
## 5. External dependencies on the airframe
|
||||
|
||||
`autopilot` itself is the only autopilot-owned process. The on-airframe tier also runs (separately):
|
||||
|
||||
- **`../detections`** — Tier 1 YOLO service. Container delivered from its own pipeline. Bi-directional gRPC endpoint consumed by `detection_client`.
|
||||
- **NanoLLM / VILA1.5-3B** (optional) — local IPC peer of `vlm_client`. Separate container or process; not embedded in the autopilot binary. Surfaces a Unix-domain socket; peer-credential check is mandatory when supported.
|
||||
- **GPS-Denied service** — separate edge service, owned by `gps-denied-onboard`; consumed indirectly through the shared edge data path (per `../_docs/11_gps_denied.md`).
|
||||
- **`model-sync.service`** — suite-wide rclone-driven model puller. Reads `/run/azaion/in-flight` to defer model swaps during flight (per `../_docs/00_top_level_architecture.md`).
|
||||
|
||||
## 6. Configuration surface
|
||||
|
||||
All configuration is declarative (`config.toml`); there is no compile-time configuration of endpoints, addresses, or feature switches **except** the `vlm_client` build-time feature flag (see `architecture.md §7.6 Local VLM confirmation > Optionality model`).
|
||||
|
||||
| Concern | Mechanism |
|
||||
|---|---|
|
||||
| RTSP / gimbal / MAVLink endpoints | `config.toml` |
|
||||
| `missions` API endpoint + auth | `config.toml` (auth pulled from `EnvironmentFile=`) |
|
||||
| Ground Station endpoint | `config.toml` |
|
||||
| VLM IPC socket path | `config.toml` |
|
||||
| `vlm_enabled` runtime flag | `config.toml` |
|
||||
| `vlm_client` build-time feature | `cargo --features vlm` at build |
|
||||
| Log level + format | `RUST_LOG` env (`tracing-subscriber` honours it) |
|
||||
| Mission ID for the current flight | CLI arg (per-flight, not per-host) |
|
||||
|
||||
## 7. Health endpoint
|
||||
|
||||
`autopilot` exposes a single HTTP health endpoint (port and bind address from `config.toml`; default `127.0.0.1:8080`). It aggregates per-component readiness:
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "green | yellow | red",
|
||||
"components": {
|
||||
"frame_ingest": "green",
|
||||
"detection_client": "green",
|
||||
"movement_detector": "green",
|
||||
"semantic_analyzer": "green",
|
||||
"vlm_client": "disabled",
|
||||
"scan_controller": "green",
|
||||
"mapobjects_store": "green",
|
||||
"gimbal_controller": "green",
|
||||
"operator_bridge": "yellow",
|
||||
"mission_executor": "green",
|
||||
"mavlink_layer": "green",
|
||||
"mission_client": "green",
|
||||
"telemetry_stream": "green"
|
||||
},
|
||||
"last_state_change": "2026-05-17T12:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
`yellow` is degraded-but-running; `red` is unrecoverable for at least one essential component. The aggregator surfaces details on each transition through `tracing` (see `observability.md`).
|
||||
|
||||
## 8. Out of scope here
|
||||
|
||||
- Provisioning the Jetson host itself (Ansible / Kickstart / disk imaging) — owned by airframe ops.
|
||||
- Build pipeline (cross-compile, signing, registry push) — see `ci_cd_pipeline.md`.
|
||||
- Observability stack (tracing exporter, log shipper, metrics scraper) — see `observability.md`.
|
||||
- Mission delivery to the airframe — owned by `missions` API.
|
||||
@@ -0,0 +1,142 @@
|
||||
# Observability
|
||||
|
||||
**Status**: forward-looking design (Rust). Treat the choices below as the intended approach; the exact tracing exporter / metrics scraper / log-shipping target depend on the suite's overall observability stack at deploy time.
|
||||
|
||||
## 1. Posture
|
||||
|
||||
- **One binary, one process.** Per-component instrumentation is structured (each component listed in `architecture.md §3` is a `tracing` target).
|
||||
- **Structured logs are primary**, metrics are derived from log spans and counters, traces are end-to-end on a frame's journey through the pipeline.
|
||||
- **No silent error swallowing.** Every failure path increments a counter, emits a span event, or both.
|
||||
- **Health is aggregated**, not derived from logs. The HTTP health endpoint (`containerization.md §7`) is the source of truth for live readiness.
|
||||
|
||||
## 2. Logs
|
||||
|
||||
**Library**: `tracing` + `tracing-subscriber`.
|
||||
|
||||
**Format**: JSON to stdout. Captured by the host's journald (Option A) or by the container runtime (Option B), then shipped to the suite's log aggregator.
|
||||
|
||||
**Per-line fields:**
|
||||
|
||||
| Field | Source | Notes |
|
||||
|---|---|---|
|
||||
| `ts` | wall clock | ISO-8601 UTC. |
|
||||
| `ts_mono_ns` | monotonic clock | For ordering across components without clock-skew artefacts. |
|
||||
| `level` | `tracing` | `error \| warn \| info \| debug \| trace`. |
|
||||
| `target` | component name | One of `frame_ingest`, `detection_client`, `movement_detector`, `semantic_analyzer`, `vlm_client`, `scan_controller`, `mapobjects_store`, `gimbal_controller`, `operator_bridge`, `mission_executor`, `mavlink_layer`, `mission_client`, `telemetry_stream`. |
|
||||
| `frame_seq` | propagated context | Where applicable. Lets us reconstruct one frame's journey. |
|
||||
| `poi_id`, `roi_id`, `target_id`, `mission_id`, `command_id` | propagated context | Where applicable. |
|
||||
| `event` | message | Short, machine-friendly identifier (e.g., `frame.dropped`, `vlm.timeout`, `mission.geofence_violation`, `bit.check_failed`, `failsafe.lost_link`, `mapobjects.push_failed`, `operator.auth_rejected`). |
|
||||
| `model_version` | propagated context | Version string for `tier1_model_version` and `vlm_model_version`. Required on every `vlm.response` and on every Tier-2 evidence span for forensic correlation. |
|
||||
| `wall_clock_source` | telemetry frame | `gnss \| host \| coast`; emitted on every state-transition span and on every operator-command audit log line. |
|
||||
| `reason` | message | Free-form for human readers. |
|
||||
|
||||
**Log level defaults:**
|
||||
|
||||
- `info`: lifecycle (startup / shutdown / state transitions), all error and security events.
|
||||
- `warn`: degraded-but-running events (yellow health, retries, drops).
|
||||
- `error`: red health, hard failures, schema violations, security violations.
|
||||
- `debug` / `trace`: off in production; enabled per-target via `RUST_LOG`.
|
||||
|
||||
**Always logged at `warn` or higher** (per `coderule.mdc`):
|
||||
|
||||
- Every exception path that the operator could care about.
|
||||
- Authentication / authorisation failures (peer-cred check failures on VLM IPC, malformed Ground Station session, MAVLink-2 signing rejection).
|
||||
- Geofence violations.
|
||||
- Schema validation failures (Tier 1 response, VLM response, mission JSON).
|
||||
|
||||
## 3. Metrics
|
||||
|
||||
Derived from log spans + a small set of explicit counters. Exporter: Prometheus-compatible (per the suite's stack).
|
||||
|
||||
**Per-component counters** (illustrative — exact names finalised at implementation):
|
||||
|
||||
| Component | Counter | Type |
|
||||
|---|---|---|
|
||||
| `frame_ingest` | `frames_received_total`, `frames_dropped_total{reason}`, `decode_errors_total` | counter |
|
||||
| `frame_ingest` | `decode_ms` | histogram |
|
||||
| `detection_client` | `requests_total`, `errors_total{kind}`, `latency_ms` | counter / histogram |
|
||||
| `movement_detector` | `candidates_total`, `telemetry_skew_drops_total` | counter |
|
||||
| `semantic_analyzer` | `tier2_runs_total`, `tier2_latency_ms`, `tier2_oversize_total` | counter / histogram |
|
||||
| `vlm_client` | `vlm_requests_total{status}`, `vlm_latency_ms` | counter / histogram |
|
||||
| `scan_controller` | `state_transitions_total{from,to}`, `pois_in_queue`, `pois_per_min`, `tick_latency_ms` | counter / gauge / histogram |
|
||||
| `mapobjects_store` | `classify_total{result}`, `ignored_items_total`, `removed_candidates_total` | counter |
|
||||
| `gimbal_controller` | `commands_total`, `decision_to_movement_ms`, `zoom_transition_ms`, `vendor_faults_total` | counter / histogram |
|
||||
| `mavlink_layer` | `messages_in_total{kind}`, `messages_out_total{kind}`, `command_acks_total{result}`, `parse_errors_total`, `link_state` | counter / gauge |
|
||||
| `mission_executor` | `state_transitions_total{from,to}`, `mission_uploads_total{result}`, `geofence_violations_total{kind}` | counter |
|
||||
| `mission_client` | `fetches_total{result}`, `middle_waypoint_posts_total{result}`, `mapobjects_pull_total{result}`, `mapobjects_push_total{result}`, `mapobjects_pull_bytes`, `mapobjects_push_bytes`, `mapobjects_sync_lag_s` | counter / gauge |
|
||||
| `mission_executor` (BIT) | `bit_runs_total{result}`, `bit_check_failures_total{check}` | counter |
|
||||
| `mission_executor` (failsafe) | `link_loss_events_total{trigger}`, `failsafe_action_total{action}` | counter |
|
||||
| `operator_bridge` | `pois_surfaced_total`, `commands_received_total{kind,result}`, `decision_latency_ms`, `auth_rejections_total{reason}`, `command_e2e_ms` | counter / histogram |
|
||||
| `telemetry_stream` | `bytes_out_total`, `frames_out_total`, `link_state`, `bandwidth_used_mbps` | counter / gauge |
|
||||
|
||||
**Aggregated:**
|
||||
|
||||
- `health_state{component}` — 0 (red) / 1 (yellow) / 2 (green); enables alerting per-component.
|
||||
- `process_uptime_seconds`, `process_resident_memory_bytes` — standard.
|
||||
|
||||
## 4. Traces
|
||||
|
||||
`tracing` spans cover the path of a single frame and the path of a single POI.
|
||||
|
||||
**Frame trace** (per `Frame`):
|
||||
|
||||
```text
|
||||
frame_ingest.publish
|
||||
detection_client.request
|
||||
detection_client.response
|
||||
movement_detector.tick
|
||||
[movement_detector.emit_candidate]
|
||||
telemetry_stream.push
|
||||
```
|
||||
|
||||
**POI trace** (per `POI`):
|
||||
|
||||
```text
|
||||
scan_controller.enqueue
|
||||
scan_controller.dequeue
|
||||
gimbal_controller.zoom
|
||||
semantic_analyzer.tier2
|
||||
[vlm_client.request -> vlm_client.response]
|
||||
operator_bridge.surface
|
||||
[operator_bridge.confirm | decline | timeout]
|
||||
mission_executor.middle_waypoint # confirm path
|
||||
mapobjects_store.append_ignored # decline path
|
||||
```
|
||||
|
||||
Spans propagate via context across in-process channels. Trace export target depends on the suite's stack (OTLP / Jaeger / Tempo).
|
||||
|
||||
## 5. Health endpoint
|
||||
|
||||
See `containerization.md §7`. The endpoint is the operator-facing readiness API; metrics + logs are the engineer-facing investigation API.
|
||||
|
||||
A red health state for any of these components is unrecoverable for the current flight:
|
||||
|
||||
- `frame_ingest` red → no input → cannot operate.
|
||||
- `mavlink_layer` red → no UAV control → trigger RTL via the autopilot's failsafe (the autopilot itself enforces this when MAVLink heartbeat stops).
|
||||
- `mission_executor` red → mission lifecycle stuck → operator must take RC control.
|
||||
|
||||
A red health state for these components is degraded-but-survivable:
|
||||
|
||||
- `detection_client` → continue zoom-out sweep; lose Tier 1.
|
||||
- `movement_detector` → continue; lose movement-candidate POI source.
|
||||
- `semantic_analyzer` → continue; surface Tier-1-only POIs.
|
||||
- `vlm_client` → fail-closed (POIs surfaced without VLM evidence).
|
||||
- `mapobjects_store` → continue with in-memory state; persistent diff lost on restart. Sync state may transition to `Stale` (operator visible).
|
||||
- `mapobjects_sync` (logical, owned by `mission_client`) → mission proceeds with stale snapshot; post-flight push retries via leftover spool. Operator sees `mapobjects_sync = degraded`.
|
||||
- `operator_bridge` / `telemetry_stream` → continue zoom-out sweep; pause POI surfacing; resume on reconnect. F10 lost-link ladder owns the larger response.
|
||||
- `gimbal_controller` → pause zoom-in / target-follow; zoom-out sweep stops.
|
||||
- `mission_client` → continue current mission from in-memory copy.
|
||||
|
||||
## 6. Replay-driven debugging
|
||||
|
||||
All non-trivial decisions in `scan_controller`, `movement_detector`, `semantic_analyzer`, `vlm_client`, and `mission_executor` are reconstructable from logs + the (size-capped) raw inputs that drove them:
|
||||
|
||||
- Frame seq, gimbal state at decode, telemetry sample used, Tier-1 detections returned, Tier-2 score, VLM raw response (size-capped), operator command, resulting state transition.
|
||||
|
||||
This is the foundation of the replay-based integration tests in `ci_cd_pipeline.md §2`.
|
||||
|
||||
## 7. Out of scope here
|
||||
|
||||
- Suite-wide observability stack choice (OTLP vs Loki vs Tempo vs Promtail) — owned by suite ops.
|
||||
- Persistent log retention policy — owned by suite ops.
|
||||
- Alerting routing (Slack / PagerDuty / email) — owned by suite ops.
|
||||
Reference in New Issue
Block a user