Files
gps-denied-onboard/_docs/04_deploy/observability.md
T
Oleksandr Bezdieniezhnykh bf13549b32
ci/woodpecker/push/02-build-push Pipeline failed
[autodev] Update configuration and documentation for cycle-1
- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments.
- Updated `.gitignore` to include a new deploy rollback bookmark.
- Revised `_docs/_autodev_state.md` to reflect the current task status and steps.
- Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements.
- Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin.
- Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths.

This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
2026-05-20 08:05:35 +03:00

22 KiB
Raw Blame History

GPS-Denied Onboard — Observability

Generated by /autodev greenfield Step 16 (Deploy) — Step 5. Builds on Step 1 (reports/deploy_status_report.md), Step 2 (containerization.md), Step 3 (ci_cd_pipeline.md), and Step 4 (environment_strategy.md). The deploy skill's standard observability template (Prometheus /metrics + OpenTelemetry + PagerDuty) is adapted here for an airborne autonomous system: the airborne image has no inbound listeners (NFT-SEC-05 in-flight egress lockdown), so the canonical observability surface is the on-device Flight Data Recorder (FDR) binary ring buffer, replayed off-flight by post-landing tooling. Operator workstation + CI keep the conventional logging-to-stdout / journald patterns.

Observability Architecture (one-paragraph)

The airborne image (companion-jetson / companion-tier1) writes structured FDR records to a 64 GB ring buffer (/var/lib/gps-denied/fdr) via the shared_fdr_client (producer → SPSC ring → C13 writer). Logs above WARN are forwarded into FDR as kind="log" records by the fdr_log_bridge (AZ-267); below-WARN logs go to LOG_SINK (console in dev, journald on the operator workstation, fdr on airborne — never to file). Telemetry is captured as kind-specific FDR records (vio.tick, state.tick, tile_match, c6.write, c6.eviction_batch, etc.) rather than via a Prometheus endpoint, because no inbound TCP is permitted in flight. Post-flight tooling on the operator workstation parses the FDR segments using the frozen, versioned fdr_record_schema v1.3.0 and feeds Grafana / Jupyter / one-off scripts. The suite-mandated AZAION_UPDATE_EVENT journald audit chain + OCI image labels (org.opencontainers.image.revision/created/source) + ENV AZAION_REVISION=$CI_COMMIT_SHA form the deploy-side audit trail (AZ-204). jetson-stats (jtop) device telemetry (thermal zones, CPU/GPU clocks, power rails) is sampled by C7 + C4 to drive the D-CROSS-LATENCY-1 auto-degrade hybrid trigger; samples land in FDR alongside the matcher / pose ticks.

Logging

Format

Structured records to LOG_SINK. No file-based logging in containers. The LOG_SINK env var (Step 4) selects the destination per environment.

Common log envelope (per-record fields)

Source of truth: _docs/02_document/contracts/shared_log_bridge/log_record_schema.md v1.0.0 — referenced by the fdr_log_bridge (AZ-267). Every onboard log record carries:

{
  "timestamp": "2026-05-10T03:14:15.123456Z",
  "level": "INFO",
  "service": "gps-denied-onboard",
  "component": "c2_vpr",
  "flight_id": "<uuid>",
  "frame_id": 12345,
  "kind": "vpr.warmup",
  "msg": "loaded",
  "kv": {"model": "salad"},
  "exc": null
}
Field Purpose Notes
timestamp ISO 8601 UTC, microsecond precision RFC 3339 with Z suffix
level DEBUG | INFO | WARN | ERROR WARN + ERROR are also mirrored into FDR via fdr_log_bridge
service gps-denied-onboard Constant per submodule
component Module slug from module-layout.md (c2_vpr, c6_tile_cache.store, shared.fdr_client, …) Matches producer_id on the corresponding FDR record
flight_id UUID assigned at flight open by C13 (flight_header) Correlation across all components within one flight
frame_id Monotonic per-frame counter from runtime_root Cross-component frame correlation (VIO ↔ matcher ↔ state)
kind Dotted snake_case event tag (closed enum per component) E.g. vpr.warmup, c6.evict.budget, c8.signing_key_rotation
msg Short human-readable event description No PII; no secrets; no file payloads
kv Bag of typed scalars JSON-safe; no nested blobs > 4 KiB
exc Optional exception class + traceback Present only on ERROR; truncated to 4 KiB

Log Levels

Level Usage Example
ERROR Exceptions, failures requiring offline review c5.solver.diverged, c8.signing_handshake_failed, c6.write_failed
WARN Degraded operation, retry, fallback engaged c4.pose.degraded_to_pnp, c6.freshness.rejected, c7.tensorrt_engine_rebuild
INFO Significant in-flight business events c8.signing_key_rotation, flight_header, flight_footer, c11.upload_batch_queued
DEBUG Detailed diagnostics (dev only) Per-frame VIO covariance dump, full matcher correspondences list

WARN + ERROR are mirrored into FDR via fdr_log_bridge (AZ-267) so they survive a post-landing journalctl clear. INFO + DEBUG go only to LOG_SINK.

Destinations and Retention

Environment LOG_SINK Destination Retention
Development (Tier-1 Docker) console Docker container stdout (docker compose logs companion) Session — cleared on docker compose down
CI (Woodpecker) console Woodpecker UI stdout capture Per the suite Woodpecker retention policy (operator-managed; today ≤ 30 days)
Staging (lab Jetson) journald Host journald Per the host's journald.conf (suite default: ~7 days rolling)
Production — airborne fdr FDR ring buffer at /var/lib/gps-denied/fdr (≥ 64 GB) Bounded by ring capacity; rolls over per segment_rollover FDR record. Post-flight operator pulls segments to long-term storage on the operator workstation.
Production — operator workstation journald Host journald Per the host's journald.conf (operator-managed; recommendation: 30 days for the operator-orchestrator service unit)

"PII" Rules (read: operational secrets)

This system has no end-user PII surface — flights, MAVLink, and tile data are operational rather than personal. The equivalent restrictions are operational-secret leakage controls:

  • Never log MAVLink 2.0 signing key bytes, per-flight onboard signing key bytes, satellite-provider API tokens, registry tokens, or Postgres credentials. The KeySource Protocol (C8) is the only component that ever holds key material, and its log path emits only the rotation event tag + key fingerprint (SHA-256 first 8 bytes), never the key.
  • Mask absolute file paths in any record that references operator-specific layouts (e.g. /Users/<operator>/… collapsed to ~/…).
  • Never log raw camera frame bytes or full tile JPEGs inline — they go to sidecar paths via FDR's failed_tile_thumbnail (≤ 0.1 Hz rate cap) or mid_flight_tile_snapshot.
  • Never log raw GPS coordinates unless the flight's restricted_geographic_log_redaction config is off (operator-set at takeoff load).

Telemetry (FDR-based, not Prometheus)

Why FDR, not Prometheus / OTel

The airborne image runs under NFT-SEC-05 (in-flight egress lockdown — no inbound listeners, outbound only to the FC over UART/USB and to QGroundControl over MAVLink 2.0 12 Hz downsampled summary). A /metrics HTTP endpoint would violate this, and a push-mode OTel exporter has no in-flight collector to reach. The FDR ring is the canonical telemetry sink; post-flight tooling converts FDR records into whatever observability backend the operator prefers (Grafana, Jupyter, ad-hoc scripts).

The operator workstation is not in-flight-locked-down; cycle-2 may add a Prometheus /metrics endpoint on the operator-orchestrator service (see "Future Work" below). Cycle-1 leaves both the operator-orchestrator and airborne side on the FDR + structured logs path for consistency.

FDR Record Kinds (cycle-1 metrics surface)

Source of truth: _docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md v1.3.0. Each kind is the metric.

Metric (FDR kind) Producer Type (intent) What it tells the operator
vio.tick C1 per-frame snapshot VIO output (R, t), pose covariance proxies, last-anchor age, monocular reproj error, IMU bias norm
state.tick C5 per-frame snapshot Smoothed fused-pose tick from iSAM2 (or ESKF baseline) + 2x2 covariance + estimator label
tile_match C2.5 / C3 per-match snapshot Tile id, VPR score, match count, RANSAC inlier count
c6.write C6 counter-ish (per-tile) Successful write_tile — tile id, source, disk bytes, content SHA-256
c6.write_failed C6 counter-ish (per-failure) Failed write_tilereason ∈ {content_hash_mismatch, freshness_reject, metadata_error, fs_error}
c6.freshness.rejected C6 counter-ish (per-reject) Active-conflict-stale tile rejected — tile_id, age_seconds, threshold
c6.freshness.downgraded C6 counter-ish (per-downgrade) Stable-rear-stale tile downgraded — same shape as rejected
c6.eviction_batch C6 batch counter (per sweep) Cache budget enforcer evicted N tiles to make room — trigger tile, freed bytes, count, first 5 evicted ids
overrun shared.fdr_client counter (per drop) FDR ring overrun — producer_id of the originating queue + dropped count (> 0). AC-NEW-3: never silent.
segment_rollover C13 writer counter (per rotation) Segment file rotated (including 64 GB cap drops)
failed_tile_thumbnail C6 / C11 rate-capped sample Forensic JPEG thumbnail (≤ 0.1 Hz). AC-8.5
mid_flight_tile_snapshot C13 snapshot path sample pointer Mid-flight tile snapshot pointer (sidecar). AC-8.4
flight_header C13 writer once-per-flight flight_id, start ISO/monotonic, config snapshot, signing-key rotation event, manifest content hashes, build info
flight_footer C13 writer once-per-flight flight_id, end ISO/monotonic, records written / dropped (overrun) / bytes / rollover count / clean-shutdown flag

Device Telemetry (jetson-stats / jtop)

D-CROSS-LATENCY-1 requires runtime thermal + power + GPU clock telemetry to drive the auto-degrade hybrid trigger (frame deadline missed × thermal headroom). Cycle-1 source: jetson-stats (jtop) accessed inside the companion-jetson container via runtime: nvidia + the nvidia-container-runtime device passthrough — same pattern the suite's detections service uses on the same hardware.

Signal Source Sample rate Consumer
GPU clock (MHz) jtop.gpu 1 Hz C7 (degrade gate); recorded into FDR via c7.device_telemetry log records (kind="c7.thermal_headroom")
GPU/CPU temperature (°C) jtop.temperature 1 Hz C4 / C7 hybrid trigger
Power draw (mW) jtop.power 1 Hz Cycle-2 derate hysteresis
Memory pressure jtop.memory 1 Hz C6 eviction batch hysteresis

Cycle-1: jtop runs in-process inside the companion container; samples are emitted as FDR kind="c7.thermal_headroom" records. Cycle-2 may move this to a sidecar Python thread once the Step 2 BLOCKING gate "jetson-stats thermal telemetry under Docker" (containerization.md § Step 2 Validation Gates) is signed off on the real Tier-2 Jetson.

Collection Interval

Source Interval
Per-frame producers (C1 vio.tick, C5 state.tick, C3 tile_match) Camera frame cadence (target ≥ 4 Hz on Tier-2; per _docs/02_document/architecture.md Vision)
Per-write producers (C6 c6.write, c6.write_failed, c6.freshness.*) Per-event (write-path triggered)
Per-batch producers (C6 c6.eviction_batch) Per-sweep (only when ≥ 1 tile evicted)
jetson-stats (jtop) 1 Hz
flight_header / flight_footer Once per flight
segment_rollover Per segment rotation

There is no Prometheus-style "scrape interval" because there is no scraping endpoint — the FDR ring is push-only from producers, drained by C13's writer thread.

Distributed Tracing

Architecture stance (cycle-1)

No W3C Trace Context. No OpenTelemetry SDK. The airborne image's correlation key is the pair (flight_id, frame_id):

  • flight_id (UUID) is assigned at flight open by C13 and written into flight_header. Every log record and FDR record within that flight carries it.
  • frame_id (monotonic per-frame counter) is assigned by the composition root's frame pipeline. Every per-frame FDR record (vio.tick, state.tick, tile_match, c6.write …) carries it.

This is sufficient because the airborne pipeline is in-process, single-camera, single-FC — there are no inter-service RPC hops to trace. Post-flight tooling reconstructs the per-frame causal chain by joining FDR records on (flight_id, frame_id).

The operator workstation has more conventional inter-service traffic (C12 ↔ flights REST, C11 ↔ satellite-provider REST). Cycle-1 traces these by:

  • Per-request log records with the request URL + status + duration_ms + a generated correlation_id.
  • FlightsApiClient and the satellite-provider HTTP client both stamp this correlation id on the request line + response log.

OpenTelemetry SDK + W3C Trace Context propagation is a cycle-2 polish item for the operator-orchestrator only — not for the airborne image. Logged in "Future Work" below.

Sampling

Environment Effective sampling rate Rationale
Development 100% FDR + logs both on
Staging (lab Jetson) 100% Full visibility for IT-12 / NFT-PERF runs
Production — airborne 100% per-frame for vio.tick/state.tick/tile_match; failed_tile_thumbnail rate-capped at ≤ 0.1 Hz FDR ring is the only post-landing forensic record; full per-frame capture is mandatory. Rate caps live on byte-heavy forensic records only.
Production — operator workstation 100% INFO+; DEBUG off Operator workstation has full disk; cost is not a concern.

Alerting

Airborne (in-flight)

No real-time alerting from the airborne image. Autonomy: the FC handles in-flight failsafe (SAFE_DEAD_RECKONING, RTL, LAND etc. per AC-FC-FAILSAFE-1). The companion does not have a network path to a human operator in flight — its only outbound channel is the MAVLink 2.0 12 Hz downsampled summary to QGroundControl, which surfaces companion health via STATUSTEXT messages and the parent suite's GpsDeniedHealth MAVLink message.

Alert-equivalents on the airborne side:

Event Detected by In-flight signal
Companion process died FC adapter watchdog timeout FC drops to SAFE_DEAD_RECKONING; operator sees lost telemetry in QGC
D-CROSS-LATENCY-1 deadline miss + thermal headroom low C4 / C7 hybrid trigger Auto-degrade to lower-cost C7 backend; STATUSTEXT to QGC + FDR kind="c7.degrade"
C8 signing handshake failed C8 FC adapter Refuses takeoff; STATUSTEXT to QGC + FDR kind="c8.signing_handshake_failed"
FDR ring overrun shared.fdr_client drop-oldest hook Emits kind="overrun" (AC-NEW-3); post-flight forensics tag
Segment cap reached (64 GB) C13 writer Emits kind="segment_rollover" with cap-drop flag; oldest data lost — flag surfaces post-flight

Post-Flight (operator workstation)

Post-flight analysis runs the FDR segments through the post-landing tooling. Alerts surface in the operator's environment:

Severity Response time Condition Cycle-1 channel
Critical Pre-next-flight gate (≤ 10 min before takeoff) flight_footer.clean_shutdown == false; kind="c8.signing_handshake_failed" observed; FDR overrun count > 0 above per-flight threshold Operator UI block + Slack #gps-denied-ops (cycle-2 once the channel is wired); cycle-1: operator's local terminal output from post-landing tooling
High Same-day C6 eviction batch > 100 in one flight; tile_match score histogram drifted vs operator baseline Same as above
Medium Within 1 week Cumulative thermal-headroom-low events trending up across recent flights Operator dashboard (cycle-2)
Low Recorded in flight summary only Non-critical warnings (FDR kind="log" at WARN level) Flight summary PDF / Markdown

CI (Woodpecker pipelines)

Severity Response time Condition Channel
Critical Same business day 01-test.yml failure on main branch Woodpecker UI; per-repo Slack channel (cycle-2 follow-up — ci_cd_pipeline.md Future Work #8)
High Within 24 h 02-build-push.yml build failure on any push branch Woodpecker UI
Medium Next business day Lint / coverage gate fail (cycle-2; cycle-1 has neither) n/a in cycle-1
Low Next sprint review Non-critical pipeline warnings n/a

Deploy / Update (Watchtower)

Severity Response time Condition Channel
Critical Immediate Watchtower post-update hook emits AZAION_UPDATE_EVENT severity=error to journald (image pull failed, container crash on restart) journald + suite operator's journalctl -g AZAION_UPDATE_EVENT audit chain
Informational None Watchtower applied an update during a non-flight window (/run/azaion/in-flight cleared) AZAION_UPDATE_EVENT severity=info to journald — audit only

Dashboards

Operations (cycle-1 — what exists today)

  • Suite Woodpecker UI — CI pipeline status per branch + commit; the only "live" operations dashboard cycle-1 ships.
  • jtop on the bench — operator runs sudo jtop on the lab / airborne Jetson during staging / pre-flight to observe thermal + GPU clock + power. Not a service dashboard; it's a CLI tool.
  • docker ps + docker compose logs — the operator workstation operator's dev-environment dashboard.

Operations (cycle-2 polish, planned)

  • Grafana dashboard fed by post-landing-parsed FDR records — service health per component (FDR record kinds rolled up into rates), thermal trend, eviction count, tile_match score distribution.
  • Prometheus /metrics on operator-orchestrator — once the operator workstation cycle-2 wires this, the Grafana dashboard pulls live operator-side metrics alongside post-landing FDR rollups.

Flight Analytics (cycle-1 — what exists today)

  • Per-flight summary generated by post-landing tooling (Markdown / PDF) — records written / dropped, segment count, top-N error log lines, eviction count, signing-key rotation event log, flight_footer.clean_shutdown flag. Stored alongside the FDR segments under _docs/06_metrics/flights/<flight_id>/ (cycle-2 publishes; cycle-1 staging dir is operator-local).

Flight Analytics (cycle-2 polish, planned)

  • FDR replay viewer — interactive timeline of (flight_id, frame_id) correlated records.
  • NFT-PERF baseline tracker — frame deadline miss rate, thermal headroom, end-to-end pose latency tracked across flights.

Deploy Audit (suite-mandated)

Per ../_infra/ci/README.md → "OCI image labels and commit provenance (AZ-204)" and ../_infra/deploy/jetson/README.md → "Audit: what is this device running?":

  • Every image (companion-jetson, companion-tier1, operator-orchestrator) is built with:
    • OCI labels: org.opencontainers.image.revision=$CI_COMMIT_SHA, org.opencontainers.image.created=<UTC RFC 3339>, org.opencontainers.image.source=$CI_REPO_URL.
    • ENV AZAION_SERVICE=gps-denied-onboard + ENV AZAION_REVISION=$CI_COMMIT_SHA.
  • Watchtower's post-update hook emits one AZAION_UPDATE_EVENT line per applied update into journald, carrying the new revision SHA + service name + timestamp + outcome.
  • The operator runs journalctl -g AZAION_UPDATE_EVENT on any Jetson to answer "what is this device running and when did it last update?".

Self-verification

  • Structured logging format defined with required fields (timestamp, level, service, component, flight_id, frame_id, kind, msg, kv, exc)
  • Per-environment LOG_SINK destination + retention tabulated
  • FDR-based metrics surface enumerated (every fdr_record_schema v1.3.0 kind mapped to its operator-relevant meaning)
  • Device telemetry (jetson-stats / jtop) source + sample rate + consumer (D-CROSS-LATENCY-1 hybrid trigger)
  • Tracing stance recorded — no W3C Trace Context / OTel SDK on airborne (justified by single-process pipeline + NFT-SEC-05); operator-side correlation_id pattern documented; OTel deferred to cycle-2 polish
  • Alert severities + response times defined across the four touchpoints: airborne in-flight, post-flight operator workstation, CI, deploy/update audit (AZAION_UPDATE_EVENT)
  • Operational-secret leakage controls in place (no key bytes / API tokens / Postgres credentials in logs; KeySource is the only key holder)
  • Dashboards inventoried — cycle-1 reality (Woodpecker UI, jtop, post-landing summary) explicit; cycle-2 polish (Grafana, FDR replay viewer, NFT-PERF tracker) logged as follow-ups
  • Suite-mandated deploy audit chain (AZAION_UPDATE_EVENT + OCI labels + AZAION_REVISION env) referenced from ../_infra/ docs

Future Work (cycle-2 polish)

  1. Prometheus /metrics on operator-orchestrator — cycle-2 wires an in-process exporter for operator-workstation-side metrics (flights REST round-trip latency, satellite-provider download throughput, tile manifest content-hash failures). The airborne image stays off this path per NFT-SEC-05.
  2. Grafana dashboard fed by post-landing-parsed FDR rollups — single pane of glass for per-flight + cross-flight trends.
  3. OpenTelemetry SDK on operator-orchestrator only — instruments FlightsApiClient + satellite-provider HTTP client with W3C Trace Context propagation. Out of scope for airborne.
  4. Per-repo Slack channel (#gps-denied-ci for CI, #gps-denied-ops for post-flight)ci_cd_pipeline.md Future Work #8 already logs the CI half; this doc adds the ops half.
  5. FDR replay viewer — interactive timeline of (flight_id, frame_id) correlated records; consumes FDR segments via the fdr_record_schema v1.3.0 parser.
  6. NFT-PERF baseline tracker — automated frame-deadline-miss-rate + thermal-headroom + end-to-end pose latency trending across flights, gated by AZ-595 SITL replay fixture + AZ-592/AZ-593 Tier-2 OKVIS2/VINS-Mono wiring.
  7. Centralised log aggregator on the operator workstation — Loki / journald-export-to-cloud once the operator network egress allows it; cycle-1 leaves journald at host-default retention.