Files
Oleksandr Bezdieniezhnykh bc40ea7300 [AZ-626] Decompose complete: 47 tasks + docs + module layout
Greenfield Steps 1-6 baseline for the autopilot rewrite from legacy
Qt/C++ to a Rust workspace.

- Remove legacy Qt/C++ tree (ai_controller, drone_controller,
  misc/camera, python_scaffold, root Dockerfile, autopilot.pro,
  legacy main.py / requirements.txt).
- Add _docs/00_problem (problem, restrictions, acceptance criteria,
  security approach, input data + fixtures).
- Add _docs/01_solution/solution_draft01.
- Add _docs/02_document (architecture, system-flows, data_model,
  glossary, decision-rationale, deployment, 13 component descriptions,
  tests/ specs, FINAL_report, module-layout).
- Add _docs/02_tasks/todo with 47 task specs (AZ-640..AZ-686, one
  bootstrap + 46 component tasks) and _dependencies_table.md.
- Add .cursor/rules/artifact-srp.mdc (single-responsibility rule for
  canonical _docs artifacts).
- Track autodev state in _docs/_autodev_state.md (Step 6 completed,
  ready for Step 7 Implement).

Jira: bootstrap AZ-626; component epics AZ-627..AZ-639; tasks
AZ-640..AZ-686. Total complexity 173 points across 12 epics.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-19 11:02:01 +03:00

143 lines
8.5 KiB
Markdown

# Observability
**Status**: forward-looking design (Rust). Treat the choices below as the intended approach; the exact tracing exporter / metrics scraper / log-shipping target depend on the suite's overall observability stack at deploy time.
## 1. Posture
- **One binary, one process.** Per-component instrumentation is structured (each component listed in `architecture.md §3` is a `tracing` target).
- **Structured logs are primary**, metrics are derived from log spans and counters, traces are end-to-end on a frame's journey through the pipeline.
- **No silent error swallowing.** Every failure path increments a counter, emits a span event, or both.
- **Health is aggregated**, not derived from logs. The HTTP health endpoint (`containerization.md §7`) is the source of truth for live readiness.
## 2. Logs
**Library**: `tracing` + `tracing-subscriber`.
**Format**: JSON to stdout. Captured by the host's journald (Option A) or by the container runtime (Option B), then shipped to the suite's log aggregator.
**Per-line fields:**
| Field | Source | Notes |
|---|---|---|
| `ts` | wall clock | ISO-8601 UTC. |
| `ts_mono_ns` | monotonic clock | For ordering across components without clock-skew artefacts. |
| `level` | `tracing` | `error \| warn \| info \| debug \| trace`. |
| `target` | component name | One of `frame_ingest`, `detection_client`, `movement_detector`, `semantic_analyzer`, `vlm_client`, `scan_controller`, `mapobjects_store`, `gimbal_controller`, `operator_bridge`, `mission_executor`, `mavlink_layer`, `mission_client`, `telemetry_stream`. |
| `frame_seq` | propagated context | Where applicable. Lets us reconstruct one frame's journey. |
| `poi_id`, `roi_id`, `target_id`, `mission_id`, `command_id` | propagated context | Where applicable. |
| `event` | message | Short, machine-friendly identifier (e.g., `frame.dropped`, `vlm.timeout`, `mission.geofence_violation`, `bit.check_failed`, `failsafe.lost_link`, `mapobjects.push_failed`, `operator.auth_rejected`). |
| `model_version` | propagated context | Version string for `tier1_model_version` and `vlm_model_version`. Required on every `vlm.response` and on every Tier-2 evidence span for forensic correlation. |
| `wall_clock_source` | telemetry frame | `gnss \| host \| coast`; emitted on every state-transition span and on every operator-command audit log line. |
| `reason` | message | Free-form for human readers. |
**Log level defaults:**
- `info`: lifecycle (startup / shutdown / state transitions), all error and security events.
- `warn`: degraded-but-running events (yellow health, retries, drops).
- `error`: red health, hard failures, schema violations, security violations.
- `debug` / `trace`: off in production; enabled per-target via `RUST_LOG`.
**Always logged at `warn` or higher** (per `coderule.mdc`):
- Every exception path that the operator could care about.
- Authentication / authorisation failures (peer-cred check failures on VLM IPC, malformed Ground Station session, MAVLink-2 signing rejection).
- Geofence violations.
- Schema validation failures (Tier 1 response, VLM response, mission JSON).
## 3. Metrics
Derived from log spans + a small set of explicit counters. Exporter: Prometheus-compatible (per the suite's stack).
**Per-component counters** (illustrative — exact names finalised at implementation):
| Component | Counter | Type |
|---|---|---|
| `frame_ingest` | `frames_received_total`, `frames_dropped_total{reason}`, `decode_errors_total` | counter |
| `frame_ingest` | `decode_ms` | histogram |
| `detection_client` | `requests_total`, `errors_total{kind}`, `latency_ms` | counter / histogram |
| `movement_detector` | `candidates_total`, `telemetry_skew_drops_total` | counter |
| `semantic_analyzer` | `tier2_runs_total`, `tier2_latency_ms`, `tier2_oversize_total` | counter / histogram |
| `vlm_client` | `vlm_requests_total{status}`, `vlm_latency_ms` | counter / histogram |
| `scan_controller` | `state_transitions_total{from,to}`, `pois_in_queue`, `pois_per_min`, `tick_latency_ms` | counter / gauge / histogram |
| `mapobjects_store` | `classify_total{result}`, `ignored_items_total`, `removed_candidates_total` | counter |
| `gimbal_controller` | `commands_total`, `decision_to_movement_ms`, `zoom_transition_ms`, `vendor_faults_total` | counter / histogram |
| `mavlink_layer` | `messages_in_total{kind}`, `messages_out_total{kind}`, `command_acks_total{result}`, `parse_errors_total`, `link_state` | counter / gauge |
| `mission_executor` | `state_transitions_total{from,to}`, `mission_uploads_total{result}`, `geofence_violations_total{kind}` | counter |
| `mission_client` | `fetches_total{result}`, `middle_waypoint_posts_total{result}`, `mapobjects_pull_total{result}`, `mapobjects_push_total{result}`, `mapobjects_pull_bytes`, `mapobjects_push_bytes`, `mapobjects_sync_lag_s` | counter / gauge |
| `mission_executor` (BIT) | `bit_runs_total{result}`, `bit_check_failures_total{check}` | counter |
| `mission_executor` (failsafe) | `link_loss_events_total{trigger}`, `failsafe_action_total{action}` | counter |
| `operator_bridge` | `pois_surfaced_total`, `commands_received_total{kind,result}`, `decision_latency_ms`, `auth_rejections_total{reason}`, `command_e2e_ms` | counter / histogram |
| `telemetry_stream` | `bytes_out_total`, `frames_out_total`, `link_state`, `bandwidth_used_mbps` | counter / gauge |
**Aggregated:**
- `health_state{component}` — 0 (red) / 1 (yellow) / 2 (green); enables alerting per-component.
- `process_uptime_seconds`, `process_resident_memory_bytes` — standard.
## 4. Traces
`tracing` spans cover the path of a single frame and the path of a single POI.
**Frame trace** (per `Frame`):
```text
frame_ingest.publish
detection_client.request
detection_client.response
movement_detector.tick
[movement_detector.emit_candidate]
telemetry_stream.push
```
**POI trace** (per `POI`):
```text
scan_controller.enqueue
scan_controller.dequeue
gimbal_controller.zoom
semantic_analyzer.tier2
[vlm_client.request -> vlm_client.response]
operator_bridge.surface
[operator_bridge.confirm | decline | timeout]
mission_executor.middle_waypoint # confirm path
mapobjects_store.append_ignored # decline path
```
Spans propagate via context across in-process channels. Trace export target depends on the suite's stack (OTLP / Jaeger / Tempo).
## 5. Health endpoint
See `containerization.md §7`. The endpoint is the operator-facing readiness API; metrics + logs are the engineer-facing investigation API.
A red health state for any of these components is unrecoverable for the current flight:
- `frame_ingest` red → no input → cannot operate.
- `mavlink_layer` red → no UAV control → trigger RTL via the autopilot's failsafe (the autopilot itself enforces this when MAVLink heartbeat stops).
- `mission_executor` red → mission lifecycle stuck → operator must take RC control.
A red health state for these components is degraded-but-survivable:
- `detection_client` → continue zoom-out sweep; lose Tier 1.
- `movement_detector` → continue; lose movement-candidate POI source.
- `semantic_analyzer` → continue; surface Tier-1-only POIs.
- `vlm_client` → fail-closed (POIs surfaced without VLM evidence).
- `mapobjects_store` → continue with in-memory state; persistent diff lost on restart. Sync state may transition to `Stale` (operator visible).
- `mapobjects_sync` (logical, owned by `mission_client`) → mission proceeds with stale snapshot; post-flight push retries via leftover spool. Operator sees `mapobjects_sync = degraded`.
- `operator_bridge` / `telemetry_stream` → continue zoom-out sweep; pause POI surfacing; resume on reconnect. F10 lost-link ladder owns the larger response.
- `gimbal_controller` → pause zoom-in / target-follow; zoom-out sweep stops.
- `mission_client` → continue current mission from in-memory copy.
## 6. Replay-driven debugging
All non-trivial decisions in `scan_controller`, `movement_detector`, `semantic_analyzer`, `vlm_client`, and `mission_executor` are reconstructable from logs + the (size-capped) raw inputs that drove them:
- Frame seq, gimbal state at decode, telemetry sample used, Tier-1 detections returned, Tier-2 score, VLM raw response (size-capped), operator command, resulting state transition.
This is the foundation of the replay-based integration tests in `ci_cd_pipeline.md §2`.
## 7. Out of scope here
- Suite-wide observability stack choice (OTLP vs Loki vs Tempo vs Promtail) — owned by suite ops.
- Persistent log retention policy — owned by suite ops.
- Alerting routing (Slack / PagerDuty / email) — owned by suite ops.