Greenfield Steps 1-6 baseline for the autopilot rewrite from legacy Qt/C++ to a Rust workspace. - Remove legacy Qt/C++ tree (ai_controller, drone_controller, misc/camera, python_scaffold, root Dockerfile, autopilot.pro, legacy main.py / requirements.txt). - Add _docs/00_problem (problem, restrictions, acceptance criteria, security approach, input data + fixtures). - Add _docs/01_solution/solution_draft01. - Add _docs/02_document (architecture, system-flows, data_model, glossary, decision-rationale, deployment, 13 component descriptions, tests/ specs, FINAL_report, module-layout). - Add _docs/02_tasks/todo with 47 task specs (AZ-640..AZ-686, one bootstrap + 46 component tasks) and _dependencies_table.md. - Add .cursor/rules/artifact-srp.mdc (single-responsibility rule for canonical _docs artifacts). - Track autodev state in _docs/_autodev_state.md (Step 6 completed, ready for Step 7 Implement). Jira: bootstrap AZ-626; component epics AZ-627..AZ-639; tasks AZ-640..AZ-686. Total complexity 173 points across 12 epics. Co-authored-by: Cursor <cursoragent@cursor.com>
8.5 KiB
Observability
Status: forward-looking design (Rust). Treat the choices below as the intended approach; the exact tracing exporter / metrics scraper / log-shipping target depend on the suite's overall observability stack at deploy time.
1. Posture
- One binary, one process. Per-component instrumentation is structured (each component listed in
architecture.md §3is atracingtarget). - Structured logs are primary, metrics are derived from log spans and counters, traces are end-to-end on a frame's journey through the pipeline.
- No silent error swallowing. Every failure path increments a counter, emits a span event, or both.
- Health is aggregated, not derived from logs. The HTTP health endpoint (
containerization.md §7) is the source of truth for live readiness.
2. Logs
Library: tracing + tracing-subscriber.
Format: JSON to stdout. Captured by the host's journald (Option A) or by the container runtime (Option B), then shipped to the suite's log aggregator.
Per-line fields:
| Field | Source | Notes |
|---|---|---|
ts |
wall clock | ISO-8601 UTC. |
ts_mono_ns |
monotonic clock | For ordering across components without clock-skew artefacts. |
level |
tracing |
error | warn | info | debug | trace. |
target |
component name | One of frame_ingest, detection_client, movement_detector, semantic_analyzer, vlm_client, scan_controller, mapobjects_store, gimbal_controller, operator_bridge, mission_executor, mavlink_layer, mission_client, telemetry_stream. |
frame_seq |
propagated context | Where applicable. Lets us reconstruct one frame's journey. |
poi_id, roi_id, target_id, mission_id, command_id |
propagated context | Where applicable. |
event |
message | Short, machine-friendly identifier (e.g., frame.dropped, vlm.timeout, mission.geofence_violation, bit.check_failed, failsafe.lost_link, mapobjects.push_failed, operator.auth_rejected). |
model_version |
propagated context | Version string for tier1_model_version and vlm_model_version. Required on every vlm.response and on every Tier-2 evidence span for forensic correlation. |
wall_clock_source |
telemetry frame | gnss | host | coast; emitted on every state-transition span and on every operator-command audit log line. |
reason |
message | Free-form for human readers. |
Log level defaults:
info: lifecycle (startup / shutdown / state transitions), all error and security events.warn: degraded-but-running events (yellow health, retries, drops).error: red health, hard failures, schema violations, security violations.debug/trace: off in production; enabled per-target viaRUST_LOG.
Always logged at warn or higher (per coderule.mdc):
- Every exception path that the operator could care about.
- Authentication / authorisation failures (peer-cred check failures on VLM IPC, malformed Ground Station session, MAVLink-2 signing rejection).
- Geofence violations.
- Schema validation failures (Tier 1 response, VLM response, mission JSON).
3. Metrics
Derived from log spans + a small set of explicit counters. Exporter: Prometheus-compatible (per the suite's stack).
Per-component counters (illustrative — exact names finalised at implementation):
| Component | Counter | Type |
|---|---|---|
frame_ingest |
frames_received_total, frames_dropped_total{reason}, decode_errors_total |
counter |
frame_ingest |
decode_ms |
histogram |
detection_client |
requests_total, errors_total{kind}, latency_ms |
counter / histogram |
movement_detector |
candidates_total, telemetry_skew_drops_total |
counter |
semantic_analyzer |
tier2_runs_total, tier2_latency_ms, tier2_oversize_total |
counter / histogram |
vlm_client |
vlm_requests_total{status}, vlm_latency_ms |
counter / histogram |
scan_controller |
state_transitions_total{from,to}, pois_in_queue, pois_per_min, tick_latency_ms |
counter / gauge / histogram |
mapobjects_store |
classify_total{result}, ignored_items_total, removed_candidates_total |
counter |
gimbal_controller |
commands_total, decision_to_movement_ms, zoom_transition_ms, vendor_faults_total |
counter / histogram |
mavlink_layer |
messages_in_total{kind}, messages_out_total{kind}, command_acks_total{result}, parse_errors_total, link_state |
counter / gauge |
mission_executor |
state_transitions_total{from,to}, mission_uploads_total{result}, geofence_violations_total{kind} |
counter |
mission_client |
fetches_total{result}, middle_waypoint_posts_total{result}, mapobjects_pull_total{result}, mapobjects_push_total{result}, mapobjects_pull_bytes, mapobjects_push_bytes, mapobjects_sync_lag_s |
counter / gauge |
mission_executor (BIT) |
bit_runs_total{result}, bit_check_failures_total{check} |
counter |
mission_executor (failsafe) |
link_loss_events_total{trigger}, failsafe_action_total{action} |
counter |
operator_bridge |
pois_surfaced_total, commands_received_total{kind,result}, decision_latency_ms, auth_rejections_total{reason}, command_e2e_ms |
counter / histogram |
telemetry_stream |
bytes_out_total, frames_out_total, link_state, bandwidth_used_mbps |
counter / gauge |
Aggregated:
health_state{component}— 0 (red) / 1 (yellow) / 2 (green); enables alerting per-component.process_uptime_seconds,process_resident_memory_bytes— standard.
4. Traces
tracing spans cover the path of a single frame and the path of a single POI.
Frame trace (per Frame):
frame_ingest.publish
detection_client.request
detection_client.response
movement_detector.tick
[movement_detector.emit_candidate]
telemetry_stream.push
POI trace (per POI):
scan_controller.enqueue
scan_controller.dequeue
gimbal_controller.zoom
semantic_analyzer.tier2
[vlm_client.request -> vlm_client.response]
operator_bridge.surface
[operator_bridge.confirm | decline | timeout]
mission_executor.middle_waypoint # confirm path
mapobjects_store.append_ignored # decline path
Spans propagate via context across in-process channels. Trace export target depends on the suite's stack (OTLP / Jaeger / Tempo).
5. Health endpoint
See containerization.md §7. The endpoint is the operator-facing readiness API; metrics + logs are the engineer-facing investigation API.
A red health state for any of these components is unrecoverable for the current flight:
frame_ingestred → no input → cannot operate.mavlink_layerred → no UAV control → trigger RTL via the autopilot's failsafe (the autopilot itself enforces this when MAVLink heartbeat stops).mission_executorred → mission lifecycle stuck → operator must take RC control.
A red health state for these components is degraded-but-survivable:
detection_client→ continue zoom-out sweep; lose Tier 1.movement_detector→ continue; lose movement-candidate POI source.semantic_analyzer→ continue; surface Tier-1-only POIs.vlm_client→ fail-closed (POIs surfaced without VLM evidence).mapobjects_store→ continue with in-memory state; persistent diff lost on restart. Sync state may transition toStale(operator visible).mapobjects_sync(logical, owned bymission_client) → mission proceeds with stale snapshot; post-flight push retries via leftover spool. Operator seesmapobjects_sync = degraded.operator_bridge/telemetry_stream→ continue zoom-out sweep; pause POI surfacing; resume on reconnect. F10 lost-link ladder owns the larger response.gimbal_controller→ pause zoom-in / target-follow; zoom-out sweep stops.mission_client→ continue current mission from in-memory copy.
6. Replay-driven debugging
All non-trivial decisions in scan_controller, movement_detector, semantic_analyzer, vlm_client, and mission_executor are reconstructable from logs + the (size-capped) raw inputs that drove them:
- Frame seq, gimbal state at decode, telemetry sample used, Tier-1 detections returned, Tier-2 score, VLM raw response (size-capped), operator command, resulting state transition.
This is the foundation of the replay-based integration tests in ci_cd_pipeline.md §2.
7. Out of scope here
- Suite-wide observability stack choice (OTLP vs Loki vs Tempo vs Promtail) — owned by suite ops.
- Persistent log retention policy — owned by suite ops.
- Alerting routing (Slack / PagerDuty / email) — owned by suite ops.