# Observability **Status**: forward-looking design (Rust). Treat the choices below as the intended approach; the exact tracing exporter / metrics scraper / log-shipping target depend on the suite's overall observability stack at deploy time. ## 1. Posture - **One binary, one process.** Per-component instrumentation is structured (each component listed in `architecture.md §3` is a `tracing` target). - **Structured logs are primary**, metrics are derived from log spans and counters, traces are end-to-end on a frame's journey through the pipeline. - **No silent error swallowing.** Every failure path increments a counter, emits a span event, or both. - **Health is aggregated**, not derived from logs. The HTTP health endpoint (`containerization.md §7`) is the source of truth for live readiness. ## 2. Logs **Library**: `tracing` + `tracing-subscriber`. **Format**: JSON to stdout. Captured by the host's journald (Option A) or by the container runtime (Option B), then shipped to the suite's log aggregator. **Per-line fields:** | Field | Source | Notes | |---|---|---| | `ts` | wall clock | ISO-8601 UTC. | | `ts_mono_ns` | monotonic clock | For ordering across components without clock-skew artefacts. | | `level` | `tracing` | `error \| warn \| info \| debug \| trace`. | | `target` | component name | One of `frame_ingest`, `detection_client`, `movement_detector`, `semantic_analyzer`, `vlm_client`, `scan_controller`, `mapobjects_store`, `gimbal_controller`, `operator_bridge`, `mission_executor`, `mavlink_layer`, `mission_client`, `telemetry_stream`. | | `frame_seq` | propagated context | Where applicable. Lets us reconstruct one frame's journey. | | `poi_id`, `roi_id`, `target_id`, `mission_id`, `command_id` | propagated context | Where applicable. | | `event` | message | Short, machine-friendly identifier (e.g., `frame.dropped`, `vlm.timeout`, `mission.geofence_violation`, `bit.check_failed`, `failsafe.lost_link`, `mapobjects.push_failed`, `operator.auth_rejected`). | | `model_version` | propagated context | Version string for `tier1_model_version` and `vlm_model_version`. Required on every `vlm.response` and on every Tier-2 evidence span for forensic correlation. | | `wall_clock_source` | telemetry frame | `gnss \| host \| coast`; emitted on every state-transition span and on every operator-command audit log line. | | `reason` | message | Free-form for human readers. | **Log level defaults:** - `info`: lifecycle (startup / shutdown / state transitions), all error and security events. - `warn`: degraded-but-running events (yellow health, retries, drops). - `error`: red health, hard failures, schema violations, security violations. - `debug` / `trace`: off in production; enabled per-target via `RUST_LOG`. **Always logged at `warn` or higher** (per `coderule.mdc`): - Every exception path that the operator could care about. - Authentication / authorisation failures (peer-cred check failures on VLM IPC, malformed Ground Station session, MAVLink-2 signing rejection). - Geofence violations. - Schema validation failures (Tier 1 response, VLM response, mission JSON). ## 3. Metrics Derived from log spans + a small set of explicit counters. Exporter: Prometheus-compatible (per the suite's stack). **Per-component counters** (illustrative — exact names finalised at implementation): | Component | Counter | Type | |---|---|---| | `frame_ingest` | `frames_received_total`, `frames_dropped_total{reason}`, `decode_errors_total` | counter | | `frame_ingest` | `decode_ms` | histogram | | `detection_client` | `requests_total`, `errors_total{kind}`, `latency_ms` | counter / histogram | | `movement_detector` | `candidates_total`, `telemetry_skew_drops_total` | counter | | `semantic_analyzer` | `tier2_runs_total`, `tier2_latency_ms`, `tier2_oversize_total` | counter / histogram | | `vlm_client` | `vlm_requests_total{status}`, `vlm_latency_ms` | counter / histogram | | `scan_controller` | `state_transitions_total{from,to}`, `pois_in_queue`, `pois_per_min`, `tick_latency_ms` | counter / gauge / histogram | | `mapobjects_store` | `classify_total{result}`, `ignored_items_total`, `removed_candidates_total` | counter | | `gimbal_controller` | `commands_total`, `decision_to_movement_ms`, `zoom_transition_ms`, `vendor_faults_total` | counter / histogram | | `mavlink_layer` | `messages_in_total{kind}`, `messages_out_total{kind}`, `command_acks_total{result}`, `parse_errors_total`, `link_state` | counter / gauge | | `mission_executor` | `state_transitions_total{from,to}`, `mission_uploads_total{result}`, `geofence_violations_total{kind}` | counter | | `mission_client` | `fetches_total{result}`, `middle_waypoint_posts_total{result}`, `mapobjects_pull_total{result}`, `mapobjects_push_total{result}`, `mapobjects_pull_bytes`, `mapobjects_push_bytes`, `mapobjects_sync_lag_s` | counter / gauge | | `mission_executor` (BIT) | `bit_runs_total{result}`, `bit_check_failures_total{check}` | counter | | `mission_executor` (failsafe) | `link_loss_events_total{trigger}`, `failsafe_action_total{action}` | counter | | `operator_bridge` | `pois_surfaced_total`, `commands_received_total{kind,result}`, `decision_latency_ms`, `auth_rejections_total{reason}`, `command_e2e_ms` | counter / histogram | | `telemetry_stream` | `bytes_out_total`, `frames_out_total`, `link_state`, `bandwidth_used_mbps` | counter / gauge | **Aggregated:** - `health_state{component}` — 0 (red) / 1 (yellow) / 2 (green); enables alerting per-component. - `process_uptime_seconds`, `process_resident_memory_bytes` — standard. ## 4. Traces `tracing` spans cover the path of a single frame and the path of a single POI. **Frame trace** (per `Frame`): ```text frame_ingest.publish detection_client.request detection_client.response movement_detector.tick [movement_detector.emit_candidate] telemetry_stream.push ``` **POI trace** (per `POI`): ```text scan_controller.enqueue scan_controller.dequeue gimbal_controller.zoom semantic_analyzer.tier2 [vlm_client.request -> vlm_client.response] operator_bridge.surface [operator_bridge.confirm | decline | timeout] mission_executor.middle_waypoint # confirm path mapobjects_store.append_ignored # decline path ``` Spans propagate via context across in-process channels. Trace export target depends on the suite's stack (OTLP / Jaeger / Tempo). ## 5. Health endpoint See `containerization.md §7`. The endpoint is the operator-facing readiness API; metrics + logs are the engineer-facing investigation API. A red health state for any of these components is unrecoverable for the current flight: - `frame_ingest` red → no input → cannot operate. - `mavlink_layer` red → no UAV control → trigger RTL via the autopilot's failsafe (the autopilot itself enforces this when MAVLink heartbeat stops). - `mission_executor` red → mission lifecycle stuck → operator must take RC control. A red health state for these components is degraded-but-survivable: - `detection_client` → continue zoom-out sweep; lose Tier 1. - `movement_detector` → continue; lose movement-candidate POI source. - `semantic_analyzer` → continue; surface Tier-1-only POIs. - `vlm_client` → fail-closed (POIs surfaced without VLM evidence). - `mapobjects_store` → continue with in-memory state; persistent diff lost on restart. Sync state may transition to `Stale` (operator visible). - `mapobjects_sync` (logical, owned by `mission_client`) → mission proceeds with stale snapshot; post-flight push retries via leftover spool. Operator sees `mapobjects_sync = degraded`. - `operator_bridge` / `telemetry_stream` → continue zoom-out sweep; pause POI surfacing; resume on reconnect. F10 lost-link ladder owns the larger response. - `gimbal_controller` → pause zoom-in / target-follow; zoom-out sweep stops. - `mission_client` → continue current mission from in-memory copy. ## 6. Replay-driven debugging All non-trivial decisions in `scan_controller`, `movement_detector`, `semantic_analyzer`, `vlm_client`, and `mission_executor` are reconstructable from logs + the (size-capped) raw inputs that drove them: - Frame seq, gimbal state at decode, telemetry sample used, Tier-1 detections returned, Tier-2 score, VLM raw response (size-capped), operator command, resulting state transition. This is the foundation of the replay-based integration tests in `ci_cd_pipeline.md §2`. ## 7. Out of scope here - Suite-wide observability stack choice (OTLP vs Loki vs Tempo vs Promtail) — owned by suite ops. - Persistent log retention policy — owned by suite ops. - Alerting routing (Slack / PagerDuty / email) — owned by suite ops.