# GPS-Denied Onboard — Observability > Generated by `/autodev` greenfield Step 16 (Deploy) — Step 5. Builds on > Step 1 (`reports/deploy_status_report.md`), Step 2 (`containerization.md`), > Step 3 (`ci_cd_pipeline.md`), and Step 4 (`environment_strategy.md`). The > deploy skill's standard observability template (Prometheus `/metrics` + > OpenTelemetry + PagerDuty) is adapted here for an airborne autonomous > system: the airborne image has **no inbound listeners** (NFT-SEC-05 > in-flight egress lockdown), so the canonical observability surface is the > on-device **Flight Data Recorder (FDR)** binary ring buffer, replayed > off-flight by post-landing tooling. Operator workstation + CI keep the > conventional logging-to-stdout / journald patterns. ## Observability Architecture (one-paragraph) The airborne image (`companion-jetson` / `companion-tier1`) writes **structured FDR records** to a 64 GB ring buffer (`/var/lib/gps-denied/fdr`) via the `shared_fdr_client` (`producer → SPSC ring → C13 writer`). Logs above `WARN` are forwarded into FDR as `kind="log"` records by the `fdr_log_bridge` (AZ-267); below-WARN logs go to `LOG_SINK` (`console` in dev, `journald` on the operator workstation, `fdr` on airborne — never to file). Telemetry is captured as kind-specific FDR records (`vio.tick`, `state.tick`, `tile_match`, `c6.write`, `c6.eviction_batch`, etc.) rather than via a Prometheus endpoint, because no inbound TCP is permitted in flight. Post-flight tooling on the operator workstation parses the FDR segments using the **frozen, versioned `fdr_record_schema` v1.3.0** and feeds Grafana / Jupyter / one-off scripts. The suite-mandated **`AZAION_UPDATE_EVENT` journald audit chain** + OCI image labels (`org.opencontainers.image.revision/created/source`) + `ENV AZAION_REVISION=$CI_COMMIT_SHA` form the deploy-side audit trail (AZ-204). **`jetson-stats` (`jtop`) device telemetry** (thermal zones, CPU/GPU clocks, power rails) is sampled by C7 + C4 to drive the `D-CROSS-LATENCY-1` auto-degrade hybrid trigger; samples land in FDR alongside the matcher / pose ticks. ## Logging ### Format Structured records to `LOG_SINK`. No file-based logging in containers. The `LOG_SINK` env var (Step 4) selects the destination per environment. #### Common log envelope (per-record fields) Source of truth: `_docs/02_document/contracts/shared_log_bridge/log_record_schema.md` v1.0.0 — referenced by the `fdr_log_bridge` (AZ-267). Every onboard log record carries: ```json { "timestamp": "2026-05-10T03:14:15.123456Z", "level": "INFO", "service": "gps-denied-onboard", "component": "c2_vpr", "flight_id": "", "frame_id": 12345, "kind": "vpr.warmup", "msg": "loaded", "kv": {"model": "salad"}, "exc": null } ``` | Field | Purpose | Notes | |-------|---------|-------| | `timestamp` | ISO 8601 UTC, microsecond precision | RFC 3339 with `Z` suffix | | `level` | `DEBUG \| INFO \| WARN \| ERROR` | `WARN` + `ERROR` are also mirrored into FDR via `fdr_log_bridge` | | `service` | `gps-denied-onboard` | Constant per submodule | | `component` | Module slug from `module-layout.md` (`c2_vpr`, `c6_tile_cache.store`, `shared.fdr_client`, …) | Matches `producer_id` on the corresponding FDR record | | `flight_id` | UUID assigned at flight open by C13 (`flight_header`) | Correlation across all components within one flight | | `frame_id` | Monotonic per-frame counter from `runtime_root` | Cross-component frame correlation (VIO ↔ matcher ↔ state) | | `kind` | Dotted snake_case event tag (closed enum per component) | E.g. `vpr.warmup`, `c6.evict.budget`, `c8.signing_key_rotation` | | `msg` | Short human-readable event description | No PII; no secrets; no file payloads | | `kv` | Bag of typed scalars | JSON-safe; no nested blobs > 4 KiB | | `exc` | Optional exception class + traceback | Present only on `ERROR`; truncated to 4 KiB | ### Log Levels | Level | Usage | Example | |-------|-------|---------| | ERROR | Exceptions, failures requiring offline review | `c5.solver.diverged`, `c8.signing_handshake_failed`, `c6.write_failed` | | WARN | Degraded operation, retry, fallback engaged | `c4.pose.degraded_to_pnp`, `c6.freshness.rejected`, `c7.tensorrt_engine_rebuild` | | INFO | Significant in-flight business events | `c8.signing_key_rotation`, `flight_header`, `flight_footer`, `c11.upload_batch_queued` | | DEBUG | Detailed diagnostics (dev only) | Per-frame VIO covariance dump, full matcher correspondences list | `WARN` + `ERROR` are mirrored into FDR via `fdr_log_bridge` (AZ-267) so they survive a post-landing `journalctl` clear. `INFO` + `DEBUG` go only to `LOG_SINK`. ### Destinations and Retention | Environment | `LOG_SINK` | Destination | Retention | |-------------|------------|-------------|-----------| | Development (Tier-1 Docker) | `console` | Docker container stdout (`docker compose logs companion`) | Session — cleared on `docker compose down` | | CI (Woodpecker) | `console` | Woodpecker UI stdout capture | Per the suite Woodpecker retention policy (operator-managed; today ≤ 30 days) | | Staging (lab Jetson) | `journald` | Host journald | Per the host's `journald.conf` (suite default: ~7 days rolling) | | Production — airborne | `fdr` | FDR ring buffer at `/var/lib/gps-denied/fdr` (≥ 64 GB) | Bounded by ring capacity; rolls over per `segment_rollover` FDR record. Post-flight operator pulls segments to long-term storage on the operator workstation. | | Production — operator workstation | `journald` | Host journald | Per the host's `journald.conf` (operator-managed; recommendation: 30 days for the operator-orchestrator service unit) | ### "PII" Rules (read: operational secrets) This system has no end-user PII surface — flights, MAVLink, and tile data are operational rather than personal. The equivalent restrictions are **operational-secret leakage** controls: - **Never log** MAVLink 2.0 signing key bytes, per-flight onboard signing key bytes, `satellite-provider` API tokens, registry tokens, or Postgres credentials. The `KeySource` Protocol (C8) is the only component that ever holds key material, and its log path emits **only** the rotation event tag + key fingerprint (SHA-256 first 8 bytes), never the key. - **Mask** absolute file paths in any record that references operator-specific layouts (e.g. `/Users//…` collapsed to `~/…`). - **Never log** raw camera frame bytes or full tile JPEGs inline — they go to sidecar paths via FDR's `failed_tile_thumbnail` (≤ 0.1 Hz rate cap) or `mid_flight_tile_snapshot`. - **Never log** raw GPS coordinates unless the flight's `restricted_geographic_log_redaction` config is `off` (operator-set at takeoff load). ## Telemetry (FDR-based, not Prometheus) ### Why FDR, not Prometheus / OTel The airborne image runs under NFT-SEC-05 (in-flight egress lockdown — no inbound listeners, outbound only to the FC over UART/USB and to QGroundControl over MAVLink 2.0 1–2 Hz downsampled summary). A `/metrics` HTTP endpoint would violate this, and a push-mode OTel exporter has no in-flight collector to reach. The FDR ring is the canonical telemetry sink; post-flight tooling converts FDR records into whatever observability backend the operator prefers (Grafana, Jupyter, ad-hoc scripts). The **operator workstation** is *not* in-flight-locked-down; cycle-2 may add a Prometheus `/metrics` endpoint on the `operator-orchestrator` service (see "Future Work" below). Cycle-1 leaves both the operator-orchestrator and airborne side on the FDR + structured logs path for consistency. ### FDR Record Kinds (cycle-1 metrics surface) Source of truth: `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` v1.3.0. Each `kind` is the metric. | Metric (FDR `kind`) | Producer | Type (intent) | What it tells the operator | |---------------------|----------|----------------|----------------------------| | `vio.tick` | C1 | per-frame snapshot | VIO output (`R`, `t`), pose covariance proxies, last-anchor age, monocular reproj error, IMU bias norm | | `state.tick` | C5 | per-frame snapshot | Smoothed fused-pose tick from iSAM2 (or ESKF baseline) + 2x2 covariance + estimator label | | `tile_match` | C2.5 / C3 | per-match snapshot | Tile id, VPR score, match count, RANSAC inlier count | | `c6.write` | C6 | counter-ish (per-tile) | Successful `write_tile` — tile id, source, disk bytes, content SHA-256 | | `c6.write_failed` | C6 | counter-ish (per-failure) | Failed `write_tile` — `reason ∈ {content_hash_mismatch, freshness_reject, metadata_error, fs_error}` | | `c6.freshness.rejected` | C6 | counter-ish (per-reject) | Active-conflict-stale tile rejected — `tile_id`, `age_seconds`, threshold | | `c6.freshness.downgraded` | C6 | counter-ish (per-downgrade) | Stable-rear-stale tile downgraded — same shape as rejected | | `c6.eviction_batch` | C6 | batch counter (per sweep) | Cache budget enforcer evicted N tiles to make room — trigger tile, freed bytes, count, first 5 evicted ids | | `overrun` | `shared.fdr_client` | counter (per drop) | FDR ring overrun — `producer_id` of the originating queue + dropped count (`> 0`). AC-NEW-3: never silent. | | `segment_rollover` | C13 writer | counter (per rotation) | Segment file rotated (including 64 GB cap drops) | | `failed_tile_thumbnail` | C6 / C11 | rate-capped sample | Forensic JPEG thumbnail (≤ 0.1 Hz). AC-8.5 | | `mid_flight_tile_snapshot` | C13 snapshot path | sample pointer | Mid-flight tile snapshot pointer (sidecar). AC-8.4 | | `flight_header` | C13 writer | once-per-flight | `flight_id`, start ISO/monotonic, config snapshot, signing-key rotation event, manifest content hashes, build info | | `flight_footer` | C13 writer | once-per-flight | `flight_id`, end ISO/monotonic, records written / dropped (overrun) / bytes / rollover count / clean-shutdown flag | ### Device Telemetry (`jetson-stats` / `jtop`) `D-CROSS-LATENCY-1` requires runtime thermal + power + GPU clock telemetry to drive the auto-degrade hybrid trigger (frame deadline missed × thermal headroom). Cycle-1 source: `jetson-stats` (`jtop`) accessed inside the `companion-jetson` container via `runtime: nvidia` + the nvidia-container-runtime device passthrough — same pattern the suite's `detections` service uses on the same hardware. | Signal | Source | Sample rate | Consumer | |--------|--------|-------------|----------| | GPU clock (MHz) | `jtop.gpu` | 1 Hz | C7 (degrade gate); recorded into FDR via `c7.device_telemetry` log records (`kind="c7.thermal_headroom"`) | | GPU/CPU temperature (°C) | `jtop.temperature` | 1 Hz | C4 / C7 hybrid trigger | | Power draw (mW) | `jtop.power` | 1 Hz | Cycle-2 derate hysteresis | | Memory pressure | `jtop.memory` | 1 Hz | C6 eviction batch hysteresis | Cycle-1: `jtop` runs in-process inside the companion container; samples are emitted as FDR `kind="c7.thermal_headroom"` records. Cycle-2 may move this to a sidecar Python thread once the Step 2 BLOCKING gate "`jetson-stats` thermal telemetry under Docker" (`containerization.md` § Step 2 Validation Gates) is signed off on the real Tier-2 Jetson. ### Collection Interval | Source | Interval | |--------|----------| | Per-frame producers (C1 `vio.tick`, C5 `state.tick`, C3 `tile_match`) | Camera frame cadence (target ≥ 4 Hz on Tier-2; per `_docs/02_document/architecture.md` Vision) | | Per-write producers (C6 `c6.write`, `c6.write_failed`, `c6.freshness.*`) | Per-event (write-path triggered) | | Per-batch producers (C6 `c6.eviction_batch`) | Per-sweep (only when ≥ 1 tile evicted) | | `jetson-stats` (`jtop`) | 1 Hz | | `flight_header` / `flight_footer` | Once per flight | | `segment_rollover` | Per segment rotation | There is no Prometheus-style "scrape interval" because there is no scraping endpoint — the FDR ring is push-only from producers, drained by C13's writer thread. ## Distributed Tracing ### Architecture stance (cycle-1) **No W3C Trace Context. No OpenTelemetry SDK.** The airborne image's correlation key is the pair `(flight_id, frame_id)`: - `flight_id` (UUID) is assigned at flight open by C13 and written into `flight_header`. Every log record and FDR record within that flight carries it. - `frame_id` (monotonic per-frame counter) is assigned by the composition root's frame pipeline. Every per-frame FDR record (`vio.tick`, `state.tick`, `tile_match`, `c6.write` …) carries it. This is sufficient because the airborne pipeline is **in-process, single-camera, single-FC** — there are no inter-service RPC hops to trace. Post-flight tooling reconstructs the per-frame causal chain by joining FDR records on `(flight_id, frame_id)`. The **operator workstation** has more conventional inter-service traffic (C12 ↔ `flights` REST, C11 ↔ `satellite-provider` REST). Cycle-1 traces these by: - Per-request log records with the request URL + status + duration_ms + a generated `correlation_id`. - `FlightsApiClient` and the `satellite-provider` HTTP client both stamp this correlation id on the request line + response log. OpenTelemetry SDK + W3C Trace Context propagation is a **cycle-2 polish item** for the operator-orchestrator only — not for the airborne image. Logged in "Future Work" below. ### Sampling | Environment | Effective sampling rate | Rationale | |-------------|--------------------------|-----------| | Development | 100% | FDR + logs both on | | Staging (lab Jetson) | 100% | Full visibility for IT-12 / NFT-PERF runs | | Production — airborne | 100% per-frame for `vio.tick`/`state.tick`/`tile_match`; `failed_tile_thumbnail` rate-capped at ≤ 0.1 Hz | FDR ring is the only post-landing forensic record; full per-frame capture is mandatory. Rate caps live on byte-heavy forensic records only. | | Production — operator workstation | 100% INFO+; DEBUG off | Operator workstation has full disk; cost is not a concern. | ## Alerting ### Airborne (in-flight) **No real-time alerting from the airborne image.** Autonomy: the FC handles in-flight failsafe (`SAFE_DEAD_RECKONING`, `RTL`, `LAND` etc. per AC-FC-FAILSAFE-1). The companion does not have a network path to a human operator in flight — its only outbound channel is the MAVLink 2.0 1–2 Hz downsampled summary to QGroundControl, which surfaces companion health via STATUSTEXT messages and the parent suite's `GpsDeniedHealth` MAVLink message. Alert-equivalents on the airborne side: | Event | Detected by | In-flight signal | |-------|-------------|------------------| | Companion process died | FC adapter watchdog timeout | FC drops to `SAFE_DEAD_RECKONING`; operator sees lost telemetry in QGC | | `D-CROSS-LATENCY-1` deadline miss + thermal headroom low | C4 / C7 hybrid trigger | Auto-degrade to lower-cost C7 backend; STATUSTEXT to QGC + FDR `kind="c7.degrade"` | | C8 signing handshake failed | C8 FC adapter | Refuses takeoff; STATUSTEXT to QGC + FDR `kind="c8.signing_handshake_failed"` | | FDR ring overrun | `shared.fdr_client` drop-oldest hook | Emits `kind="overrun"` (AC-NEW-3); post-flight forensics tag | | Segment cap reached (64 GB) | C13 writer | Emits `kind="segment_rollover"` with cap-drop flag; oldest data lost — flag surfaces post-flight | ### Post-Flight (operator workstation) Post-flight analysis runs the FDR segments through the post-landing tooling. Alerts surface in the operator's environment: | Severity | Response time | Condition | Cycle-1 channel | |----------|---------------|-----------|------------------| | Critical | Pre-next-flight gate (≤ 10 min before takeoff) | `flight_footer.clean_shutdown == false`; `kind="c8.signing_handshake_failed"` observed; FDR overrun count > 0 above per-flight threshold | Operator UI block + Slack `#gps-denied-ops` (cycle-2 once the channel is wired); cycle-1: operator's local terminal output from post-landing tooling | | High | Same-day | C6 eviction batch > 100 in one flight; tile_match score histogram drifted vs operator baseline | Same as above | | Medium | Within 1 week | Cumulative thermal-headroom-low events trending up across recent flights | Operator dashboard (cycle-2) | | Low | Recorded in flight summary only | Non-critical warnings (FDR `kind="log"` at WARN level) | Flight summary PDF / Markdown | ### CI (Woodpecker pipelines) | Severity | Response time | Condition | Channel | |----------|---------------|-----------|---------| | Critical | Same business day | `01-test.yml` failure on `main` branch | Woodpecker UI; per-repo Slack channel (cycle-2 follow-up — `ci_cd_pipeline.md` Future Work #8) | | High | Within 24 h | `02-build-push.yml` build failure on any push branch | Woodpecker UI | | Medium | Next business day | Lint / coverage gate fail (cycle-2; cycle-1 has neither) | n/a in cycle-1 | | Low | Next sprint review | Non-critical pipeline warnings | n/a | ### Deploy / Update (Watchtower) | Severity | Response time | Condition | Channel | |----------|---------------|-----------|---------| | Critical | Immediate | Watchtower post-update hook emits `AZAION_UPDATE_EVENT severity=error` to journald (image pull failed, container crash on restart) | journald + suite operator's `journalctl -g AZAION_UPDATE_EVENT` audit chain | | Informational | None | Watchtower applied an update during a non-flight window (`/run/azaion/in-flight` cleared) | `AZAION_UPDATE_EVENT severity=info` to journald — audit only | ## Dashboards ### Operations (cycle-1 — what exists today) - **Suite Woodpecker UI** — CI pipeline status per branch + commit; the only "live" operations dashboard cycle-1 ships. - **`jtop` on the bench** — operator runs `sudo jtop` on the lab / airborne Jetson during staging / pre-flight to observe thermal + GPU clock + power. Not a service dashboard; it's a CLI tool. - **`docker ps` + `docker compose logs`** — the operator workstation operator's `dev`-environment dashboard. ### Operations (cycle-2 polish, planned) - **Grafana dashboard** fed by post-landing-parsed FDR records — service health per component (FDR record kinds rolled up into rates), thermal trend, eviction count, tile_match score distribution. - **Prometheus `/metrics` on operator-orchestrator** — once the operator workstation cycle-2 wires this, the Grafana dashboard pulls live operator-side metrics alongside post-landing FDR rollups. ### Flight Analytics (cycle-1 — what exists today) - **Per-flight summary** generated by post-landing tooling (Markdown / PDF) — records written / dropped, segment count, top-N error log lines, eviction count, signing-key rotation event log, `flight_footer.clean_shutdown` flag. Stored alongside the FDR segments under `_docs/06_metrics/flights//` (cycle-2 publishes; cycle-1 staging dir is operator-local). ### Flight Analytics (cycle-2 polish, planned) - **FDR replay viewer** — interactive timeline of `(flight_id, frame_id)` correlated records. - **NFT-PERF baseline tracker** — frame deadline miss rate, thermal headroom, end-to-end pose latency tracked across flights. ## Deploy Audit (suite-mandated) Per `../_infra/ci/README.md` → "OCI image labels and commit provenance (AZ-204)" and `../_infra/deploy/jetson/README.md` → "Audit: what is this device running?": - Every image (`companion-jetson`, `companion-tier1`, `operator-orchestrator`) is built with: - OCI labels: `org.opencontainers.image.revision=$CI_COMMIT_SHA`, `org.opencontainers.image.created=`, `org.opencontainers.image.source=$CI_REPO_URL`. - `ENV AZAION_SERVICE=gps-denied-onboard` + `ENV AZAION_REVISION=$CI_COMMIT_SHA`. - Watchtower's post-update hook emits one `AZAION_UPDATE_EVENT` line per applied update into journald, carrying the new revision SHA + service name + timestamp + outcome. - The operator runs `journalctl -g AZAION_UPDATE_EVENT` on any Jetson to answer "what is this device running and when did it last update?". ## Self-verification - [x] Structured logging format defined with required fields (timestamp, level, service, component, `flight_id`, `frame_id`, kind, msg, kv, exc) - [x] Per-environment `LOG_SINK` destination + retention tabulated - [x] FDR-based metrics surface enumerated (every `fdr_record_schema` v1.3.0 kind mapped to its operator-relevant meaning) - [x] Device telemetry (`jetson-stats` / `jtop`) source + sample rate + consumer (D-CROSS-LATENCY-1 hybrid trigger) - [x] Tracing stance recorded — no W3C Trace Context / OTel SDK on airborne (justified by single-process pipeline + NFT-SEC-05); operator-side correlation_id pattern documented; OTel deferred to cycle-2 polish - [x] Alert severities + response times defined across the four touchpoints: airborne in-flight, post-flight operator workstation, CI, deploy/update audit (`AZAION_UPDATE_EVENT`) - [x] Operational-secret leakage controls in place (no key bytes / API tokens / Postgres credentials in logs; `KeySource` is the only key holder) - [x] Dashboards inventoried — cycle-1 reality (Woodpecker UI, `jtop`, post-landing summary) explicit; cycle-2 polish (Grafana, FDR replay viewer, NFT-PERF tracker) logged as follow-ups - [x] Suite-mandated deploy audit chain (`AZAION_UPDATE_EVENT` + OCI labels + `AZAION_REVISION` env) referenced from `../_infra/` docs ## Future Work (cycle-2 polish) 1. **Prometheus `/metrics` on `operator-orchestrator`** — cycle-2 wires an in-process exporter for operator-workstation-side metrics (`flights` REST round-trip latency, `satellite-provider` download throughput, tile manifest content-hash failures). The airborne image stays off this path per NFT-SEC-05. 2. **Grafana dashboard fed by post-landing-parsed FDR rollups** — single pane of glass for per-flight + cross-flight trends. 3. **OpenTelemetry SDK on `operator-orchestrator` only** — instruments `FlightsApiClient` + `satellite-provider` HTTP client with W3C Trace Context propagation. Out of scope for airborne. 4. **Per-repo Slack channel (`#gps-denied-ci` for CI, `#gps-denied-ops` for post-flight)** — `ci_cd_pipeline.md` Future Work #8 already logs the CI half; this doc adds the ops half. 5. **FDR replay viewer** — interactive timeline of `(flight_id, frame_id)` correlated records; consumes FDR segments via the `fdr_record_schema` v1.3.0 parser. 6. **NFT-PERF baseline tracker** — automated frame-deadline-miss-rate + thermal-headroom + end-to-end pose latency trending across flights, gated by AZ-595 SITL replay fixture + AZ-592/AZ-593 Tier-2 OKVIS2/VINS-Mono wiring. 7. **Centralised log aggregator on the operator workstation** — Loki / journald-export-to-cloud once the operator network egress allows it; cycle-1 leaves journald at host-default retention.