mirror of
https://github.com/azaion/autopilot.git
synced 2026-06-21 16:31:11 +00:00
bc40ea7300
Greenfield Steps 1-6 baseline for the autopilot rewrite from legacy Qt/C++ to a Rust workspace. - Remove legacy Qt/C++ tree (ai_controller, drone_controller, misc/camera, python_scaffold, root Dockerfile, autopilot.pro, legacy main.py / requirements.txt). - Add _docs/00_problem (problem, restrictions, acceptance criteria, security approach, input data + fixtures). - Add _docs/01_solution/solution_draft01. - Add _docs/02_document (architecture, system-flows, data_model, glossary, decision-rationale, deployment, 13 component descriptions, tests/ specs, FINAL_report, module-layout). - Add _docs/02_tasks/todo with 47 task specs (AZ-640..AZ-686, one bootstrap + 46 component tasks) and _dependencies_table.md. - Add .cursor/rules/artifact-srp.mdc (single-responsibility rule for canonical _docs artifacts). - Track autodev state in _docs/_autodev_state.md (Step 6 completed, ready for Step 7 Implement). Jira: bootstrap AZ-626; component epics AZ-627..AZ-639; tasks AZ-640..AZ-686. Total complexity 173 points across 12 epics. Co-authored-by: Cursor <cursoragent@cursor.com>
143 lines
8.5 KiB
Markdown
143 lines
8.5 KiB
Markdown
# Observability
|
|
|
|
**Status**: forward-looking design (Rust). Treat the choices below as the intended approach; the exact tracing exporter / metrics scraper / log-shipping target depend on the suite's overall observability stack at deploy time.
|
|
|
|
## 1. Posture
|
|
|
|
- **One binary, one process.** Per-component instrumentation is structured (each component listed in `architecture.md §3` is a `tracing` target).
|
|
- **Structured logs are primary**, metrics are derived from log spans and counters, traces are end-to-end on a frame's journey through the pipeline.
|
|
- **No silent error swallowing.** Every failure path increments a counter, emits a span event, or both.
|
|
- **Health is aggregated**, not derived from logs. The HTTP health endpoint (`containerization.md §7`) is the source of truth for live readiness.
|
|
|
|
## 2. Logs
|
|
|
|
**Library**: `tracing` + `tracing-subscriber`.
|
|
|
|
**Format**: JSON to stdout. Captured by the host's journald (Option A) or by the container runtime (Option B), then shipped to the suite's log aggregator.
|
|
|
|
**Per-line fields:**
|
|
|
|
| Field | Source | Notes |
|
|
|---|---|---|
|
|
| `ts` | wall clock | ISO-8601 UTC. |
|
|
| `ts_mono_ns` | monotonic clock | For ordering across components without clock-skew artefacts. |
|
|
| `level` | `tracing` | `error \| warn \| info \| debug \| trace`. |
|
|
| `target` | component name | One of `frame_ingest`, `detection_client`, `movement_detector`, `semantic_analyzer`, `vlm_client`, `scan_controller`, `mapobjects_store`, `gimbal_controller`, `operator_bridge`, `mission_executor`, `mavlink_layer`, `mission_client`, `telemetry_stream`. |
|
|
| `frame_seq` | propagated context | Where applicable. Lets us reconstruct one frame's journey. |
|
|
| `poi_id`, `roi_id`, `target_id`, `mission_id`, `command_id` | propagated context | Where applicable. |
|
|
| `event` | message | Short, machine-friendly identifier (e.g., `frame.dropped`, `vlm.timeout`, `mission.geofence_violation`, `bit.check_failed`, `failsafe.lost_link`, `mapobjects.push_failed`, `operator.auth_rejected`). |
|
|
| `model_version` | propagated context | Version string for `tier1_model_version` and `vlm_model_version`. Required on every `vlm.response` and on every Tier-2 evidence span for forensic correlation. |
|
|
| `wall_clock_source` | telemetry frame | `gnss \| host \| coast`; emitted on every state-transition span and on every operator-command audit log line. |
|
|
| `reason` | message | Free-form for human readers. |
|
|
|
|
**Log level defaults:**
|
|
|
|
- `info`: lifecycle (startup / shutdown / state transitions), all error and security events.
|
|
- `warn`: degraded-but-running events (yellow health, retries, drops).
|
|
- `error`: red health, hard failures, schema violations, security violations.
|
|
- `debug` / `trace`: off in production; enabled per-target via `RUST_LOG`.
|
|
|
|
**Always logged at `warn` or higher** (per `coderule.mdc`):
|
|
|
|
- Every exception path that the operator could care about.
|
|
- Authentication / authorisation failures (peer-cred check failures on VLM IPC, malformed Ground Station session, MAVLink-2 signing rejection).
|
|
- Geofence violations.
|
|
- Schema validation failures (Tier 1 response, VLM response, mission JSON).
|
|
|
|
## 3. Metrics
|
|
|
|
Derived from log spans + a small set of explicit counters. Exporter: Prometheus-compatible (per the suite's stack).
|
|
|
|
**Per-component counters** (illustrative — exact names finalised at implementation):
|
|
|
|
| Component | Counter | Type |
|
|
|---|---|---|
|
|
| `frame_ingest` | `frames_received_total`, `frames_dropped_total{reason}`, `decode_errors_total` | counter |
|
|
| `frame_ingest` | `decode_ms` | histogram |
|
|
| `detection_client` | `requests_total`, `errors_total{kind}`, `latency_ms` | counter / histogram |
|
|
| `movement_detector` | `candidates_total`, `telemetry_skew_drops_total` | counter |
|
|
| `semantic_analyzer` | `tier2_runs_total`, `tier2_latency_ms`, `tier2_oversize_total` | counter / histogram |
|
|
| `vlm_client` | `vlm_requests_total{status}`, `vlm_latency_ms` | counter / histogram |
|
|
| `scan_controller` | `state_transitions_total{from,to}`, `pois_in_queue`, `pois_per_min`, `tick_latency_ms` | counter / gauge / histogram |
|
|
| `mapobjects_store` | `classify_total{result}`, `ignored_items_total`, `removed_candidates_total` | counter |
|
|
| `gimbal_controller` | `commands_total`, `decision_to_movement_ms`, `zoom_transition_ms`, `vendor_faults_total` | counter / histogram |
|
|
| `mavlink_layer` | `messages_in_total{kind}`, `messages_out_total{kind}`, `command_acks_total{result}`, `parse_errors_total`, `link_state` | counter / gauge |
|
|
| `mission_executor` | `state_transitions_total{from,to}`, `mission_uploads_total{result}`, `geofence_violations_total{kind}` | counter |
|
|
| `mission_client` | `fetches_total{result}`, `middle_waypoint_posts_total{result}`, `mapobjects_pull_total{result}`, `mapobjects_push_total{result}`, `mapobjects_pull_bytes`, `mapobjects_push_bytes`, `mapobjects_sync_lag_s` | counter / gauge |
|
|
| `mission_executor` (BIT) | `bit_runs_total{result}`, `bit_check_failures_total{check}` | counter |
|
|
| `mission_executor` (failsafe) | `link_loss_events_total{trigger}`, `failsafe_action_total{action}` | counter |
|
|
| `operator_bridge` | `pois_surfaced_total`, `commands_received_total{kind,result}`, `decision_latency_ms`, `auth_rejections_total{reason}`, `command_e2e_ms` | counter / histogram |
|
|
| `telemetry_stream` | `bytes_out_total`, `frames_out_total`, `link_state`, `bandwidth_used_mbps` | counter / gauge |
|
|
|
|
**Aggregated:**
|
|
|
|
- `health_state{component}` — 0 (red) / 1 (yellow) / 2 (green); enables alerting per-component.
|
|
- `process_uptime_seconds`, `process_resident_memory_bytes` — standard.
|
|
|
|
## 4. Traces
|
|
|
|
`tracing` spans cover the path of a single frame and the path of a single POI.
|
|
|
|
**Frame trace** (per `Frame`):
|
|
|
|
```text
|
|
frame_ingest.publish
|
|
detection_client.request
|
|
detection_client.response
|
|
movement_detector.tick
|
|
[movement_detector.emit_candidate]
|
|
telemetry_stream.push
|
|
```
|
|
|
|
**POI trace** (per `POI`):
|
|
|
|
```text
|
|
scan_controller.enqueue
|
|
scan_controller.dequeue
|
|
gimbal_controller.zoom
|
|
semantic_analyzer.tier2
|
|
[vlm_client.request -> vlm_client.response]
|
|
operator_bridge.surface
|
|
[operator_bridge.confirm | decline | timeout]
|
|
mission_executor.middle_waypoint # confirm path
|
|
mapobjects_store.append_ignored # decline path
|
|
```
|
|
|
|
Spans propagate via context across in-process channels. Trace export target depends on the suite's stack (OTLP / Jaeger / Tempo).
|
|
|
|
## 5. Health endpoint
|
|
|
|
See `containerization.md §7`. The endpoint is the operator-facing readiness API; metrics + logs are the engineer-facing investigation API.
|
|
|
|
A red health state for any of these components is unrecoverable for the current flight:
|
|
|
|
- `frame_ingest` red → no input → cannot operate.
|
|
- `mavlink_layer` red → no UAV control → trigger RTL via the autopilot's failsafe (the autopilot itself enforces this when MAVLink heartbeat stops).
|
|
- `mission_executor` red → mission lifecycle stuck → operator must take RC control.
|
|
|
|
A red health state for these components is degraded-but-survivable:
|
|
|
|
- `detection_client` → continue zoom-out sweep; lose Tier 1.
|
|
- `movement_detector` → continue; lose movement-candidate POI source.
|
|
- `semantic_analyzer` → continue; surface Tier-1-only POIs.
|
|
- `vlm_client` → fail-closed (POIs surfaced without VLM evidence).
|
|
- `mapobjects_store` → continue with in-memory state; persistent diff lost on restart. Sync state may transition to `Stale` (operator visible).
|
|
- `mapobjects_sync` (logical, owned by `mission_client`) → mission proceeds with stale snapshot; post-flight push retries via leftover spool. Operator sees `mapobjects_sync = degraded`.
|
|
- `operator_bridge` / `telemetry_stream` → continue zoom-out sweep; pause POI surfacing; resume on reconnect. F10 lost-link ladder owns the larger response.
|
|
- `gimbal_controller` → pause zoom-in / target-follow; zoom-out sweep stops.
|
|
- `mission_client` → continue current mission from in-memory copy.
|
|
|
|
## 6. Replay-driven debugging
|
|
|
|
All non-trivial decisions in `scan_controller`, `movement_detector`, `semantic_analyzer`, `vlm_client`, and `mission_executor` are reconstructable from logs + the (size-capped) raw inputs that drove them:
|
|
|
|
- Frame seq, gimbal state at decode, telemetry sample used, Tier-1 detections returned, Tier-2 score, VLM raw response (size-capped), operator command, resulting state transition.
|
|
|
|
This is the foundation of the replay-based integration tests in `ci_cd_pipeline.md §2`.
|
|
|
|
## 7. Out of scope here
|
|
|
|
- Suite-wide observability stack choice (OTLP vs Loki vs Tempo vs Promtail) — owned by suite ops.
|
|
- Persistent log retention policy — owned by suite ops.
|
|
- Alerting routing (Slack / PagerDuty / email) — owned by suite ops.
|