Files
Oleksandr Bezdieniezhnykh bc40ea7300 [AZ-626] Decompose complete: 47 tasks + docs + module layout
Greenfield Steps 1-6 baseline for the autopilot rewrite from legacy
Qt/C++ to a Rust workspace.

- Remove legacy Qt/C++ tree (ai_controller, drone_controller,
  misc/camera, python_scaffold, root Dockerfile, autopilot.pro,
  legacy main.py / requirements.txt).
- Add _docs/00_problem (problem, restrictions, acceptance criteria,
  security approach, input data + fixtures).
- Add _docs/01_solution/solution_draft01.
- Add _docs/02_document (architecture, system-flows, data_model,
  glossary, decision-rationale, deployment, 13 component descriptions,
  tests/ specs, FINAL_report, module-layout).
- Add _docs/02_tasks/todo with 47 task specs (AZ-640..AZ-686, one
  bootstrap + 46 component tasks) and _dependencies_table.md.
- Add .cursor/rules/artifact-srp.mdc (single-responsibility rule for
  canonical _docs artifacts).
- Track autodev state in _docs/_autodev_state.md (Step 6 completed,
  ready for Step 7 Implement).

Jira: bootstrap AZ-626; component epics AZ-627..AZ-639; tasks
AZ-640..AZ-686. Total complexity 173 points across 12 epics.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-19 11:02:01 +03:00

8.5 KiB

Observability

Status: forward-looking design (Rust). Treat the choices below as the intended approach; the exact tracing exporter / metrics scraper / log-shipping target depend on the suite's overall observability stack at deploy time.

1. Posture

  • One binary, one process. Per-component instrumentation is structured (each component listed in architecture.md §3 is a tracing target).
  • Structured logs are primary, metrics are derived from log spans and counters, traces are end-to-end on a frame's journey through the pipeline.
  • No silent error swallowing. Every failure path increments a counter, emits a span event, or both.
  • Health is aggregated, not derived from logs. The HTTP health endpoint (containerization.md §7) is the source of truth for live readiness.

2. Logs

Library: tracing + tracing-subscriber.

Format: JSON to stdout. Captured by the host's journald (Option A) or by the container runtime (Option B), then shipped to the suite's log aggregator.

Per-line fields:

Field Source Notes
ts wall clock ISO-8601 UTC.
ts_mono_ns monotonic clock For ordering across components without clock-skew artefacts.
level tracing error | warn | info | debug | trace.
target component name One of frame_ingest, detection_client, movement_detector, semantic_analyzer, vlm_client, scan_controller, mapobjects_store, gimbal_controller, operator_bridge, mission_executor, mavlink_layer, mission_client, telemetry_stream.
frame_seq propagated context Where applicable. Lets us reconstruct one frame's journey.
poi_id, roi_id, target_id, mission_id, command_id propagated context Where applicable.
event message Short, machine-friendly identifier (e.g., frame.dropped, vlm.timeout, mission.geofence_violation, bit.check_failed, failsafe.lost_link, mapobjects.push_failed, operator.auth_rejected).
model_version propagated context Version string for tier1_model_version and vlm_model_version. Required on every vlm.response and on every Tier-2 evidence span for forensic correlation.
wall_clock_source telemetry frame gnss | host | coast; emitted on every state-transition span and on every operator-command audit log line.
reason message Free-form for human readers.

Log level defaults:

  • info: lifecycle (startup / shutdown / state transitions), all error and security events.
  • warn: degraded-but-running events (yellow health, retries, drops).
  • error: red health, hard failures, schema violations, security violations.
  • debug / trace: off in production; enabled per-target via RUST_LOG.

Always logged at warn or higher (per coderule.mdc):

  • Every exception path that the operator could care about.
  • Authentication / authorisation failures (peer-cred check failures on VLM IPC, malformed Ground Station session, MAVLink-2 signing rejection).
  • Geofence violations.
  • Schema validation failures (Tier 1 response, VLM response, mission JSON).

3. Metrics

Derived from log spans + a small set of explicit counters. Exporter: Prometheus-compatible (per the suite's stack).

Per-component counters (illustrative — exact names finalised at implementation):

Component Counter Type
frame_ingest frames_received_total, frames_dropped_total{reason}, decode_errors_total counter
frame_ingest decode_ms histogram
detection_client requests_total, errors_total{kind}, latency_ms counter / histogram
movement_detector candidates_total, telemetry_skew_drops_total counter
semantic_analyzer tier2_runs_total, tier2_latency_ms, tier2_oversize_total counter / histogram
vlm_client vlm_requests_total{status}, vlm_latency_ms counter / histogram
scan_controller state_transitions_total{from,to}, pois_in_queue, pois_per_min, tick_latency_ms counter / gauge / histogram
mapobjects_store classify_total{result}, ignored_items_total, removed_candidates_total counter
gimbal_controller commands_total, decision_to_movement_ms, zoom_transition_ms, vendor_faults_total counter / histogram
mavlink_layer messages_in_total{kind}, messages_out_total{kind}, command_acks_total{result}, parse_errors_total, link_state counter / gauge
mission_executor state_transitions_total{from,to}, mission_uploads_total{result}, geofence_violations_total{kind} counter
mission_client fetches_total{result}, middle_waypoint_posts_total{result}, mapobjects_pull_total{result}, mapobjects_push_total{result}, mapobjects_pull_bytes, mapobjects_push_bytes, mapobjects_sync_lag_s counter / gauge
mission_executor (BIT) bit_runs_total{result}, bit_check_failures_total{check} counter
mission_executor (failsafe) link_loss_events_total{trigger}, failsafe_action_total{action} counter
operator_bridge pois_surfaced_total, commands_received_total{kind,result}, decision_latency_ms, auth_rejections_total{reason}, command_e2e_ms counter / histogram
telemetry_stream bytes_out_total, frames_out_total, link_state, bandwidth_used_mbps counter / gauge

Aggregated:

  • health_state{component} — 0 (red) / 1 (yellow) / 2 (green); enables alerting per-component.
  • process_uptime_seconds, process_resident_memory_bytes — standard.

4. Traces

tracing spans cover the path of a single frame and the path of a single POI.

Frame trace (per Frame):

frame_ingest.publish
  detection_client.request
    detection_client.response
  movement_detector.tick
    [movement_detector.emit_candidate]
  telemetry_stream.push

POI trace (per POI):

scan_controller.enqueue
  scan_controller.dequeue
    gimbal_controller.zoom
    semantic_analyzer.tier2
      [vlm_client.request -> vlm_client.response]
    operator_bridge.surface
      [operator_bridge.confirm | decline | timeout]
        mission_executor.middle_waypoint    # confirm path
        mapobjects_store.append_ignored     # decline path

Spans propagate via context across in-process channels. Trace export target depends on the suite's stack (OTLP / Jaeger / Tempo).

5. Health endpoint

See containerization.md §7. The endpoint is the operator-facing readiness API; metrics + logs are the engineer-facing investigation API.

A red health state for any of these components is unrecoverable for the current flight:

  • frame_ingest red → no input → cannot operate.
  • mavlink_layer red → no UAV control → trigger RTL via the autopilot's failsafe (the autopilot itself enforces this when MAVLink heartbeat stops).
  • mission_executor red → mission lifecycle stuck → operator must take RC control.

A red health state for these components is degraded-but-survivable:

  • detection_client → continue zoom-out sweep; lose Tier 1.
  • movement_detector → continue; lose movement-candidate POI source.
  • semantic_analyzer → continue; surface Tier-1-only POIs.
  • vlm_client → fail-closed (POIs surfaced without VLM evidence).
  • mapobjects_store → continue with in-memory state; persistent diff lost on restart. Sync state may transition to Stale (operator visible).
  • mapobjects_sync (logical, owned by mission_client) → mission proceeds with stale snapshot; post-flight push retries via leftover spool. Operator sees mapobjects_sync = degraded.
  • operator_bridge / telemetry_stream → continue zoom-out sweep; pause POI surfacing; resume on reconnect. F10 lost-link ladder owns the larger response.
  • gimbal_controller → pause zoom-in / target-follow; zoom-out sweep stops.
  • mission_client → continue current mission from in-memory copy.

6. Replay-driven debugging

All non-trivial decisions in scan_controller, movement_detector, semantic_analyzer, vlm_client, and mission_executor are reconstructable from logs + the (size-capped) raw inputs that drove them:

  • Frame seq, gimbal state at decode, telemetry sample used, Tier-1 detections returned, Tier-2 score, VLM raw response (size-capped), operator command, resulting state transition.

This is the foundation of the replay-based integration tests in ci_cd_pipeline.md §2.

7. Out of scope here

  • Suite-wide observability stack choice (OTLP vs Loki vs Tempo vs Promtail) — owned by suite ops.
  • Persistent log retention policy — owned by suite ops.
  • Alerting routing (Slack / PagerDuty / email) — owned by suite ops.