# GPS-Denied Onboard — Observability > Date: 2026-05-09 (Plan Phase 2c — initial draft). > Inputs: `_docs/02_document/architecture.md` § 7 (Audit logging) + § 6 (NFRs); `_docs/02_document/data_model.md` § 2.8 (FDR); ADR-005 (Tier-1 / Tier-2); AC-NEW-3 (FDR ≤ 64 GB / no silent drops); AC-NEW-5 (operating envelope). ## Observability is asymmetric by design Most CI/CD templates assume a network-connected service that pushes structured logs to an aggregator and exposes Prometheus metrics for live scraping. **This project's airborne profile does not.** Architecture.md ADR-004 + § 7 + Principle #4 require **no inbound network listening and no outbound network egress in flight** (NFT-SEC-05 enforces). The Jetson is operating as an embedded edge device, not a service. Observability therefore splits into three regimes: | Regime | Where | Live or post-flight | Primary mechanism | |---|---|---|---| | **In-flight onboard** | Production Jetson, in flight | Live (to FDR ring) + best-effort live (to GCS) | FDR binary record stream + GCS STATUSTEXT / NAMED_VALUE_FLOAT | | **Post-flight onboard** | Operator workstation after pulling the FDR | Post-flight | FDR replay + visualization in operator-orchestrator C12 | | **CI / dev (Tier-1, Tier-2)** | Workstation Docker / Jetson CI runner | Live | Standard structured logging + Prometheus metrics endpoint where applicable | The sections below are organized by regime. ## 1. In-flight onboard (production Jetson) ### 1.1 FDR (Flight Data Recorder) — primary observability sink Schema is in `data_model.md` § 2.8. Every observable event in flight goes through FDR. The FDR is **append-only**, **lossy on overrun (logged, never silent)**, and **per-flight ring-bounded at ≤ 64 GB** (AC-NEW-3). Observability events that emit FDR records: | Component | Event | FDR record type | |---|---|---| | C8 outbound | Every emitted `EmittedExternalPosition` to FC | `0x0001 EmittedExternalPosition` | | C8 inbound | Every received MAVLink frame (raw `tlog`-style) | `0x0003 ReceivedMavlinkRaw` | | C8 inbound (iNav) | Every received MSP2 frame | `0x0004 ReceivedMsp2Raw` | | C8 inbound | IMU window forwarded to C1 / C5 | `0x0002 ImuTrace` | | C5 | Source-label transition (`satellite_anchored` ↔ `visual_propagated` ↔ `dead_reckoned`) | `0x0006 SourceLabelTransition` | | C5 + C8 | Spoofing-promotion / -rejection event | `0x000C SpoofingPromotionEvent` | | C5 | VISUAL_BLACKOUT entry / exit (AC-3.5, AC-NEW-8) | `0x000B VisualBlackoutEvent` | | C6 | Mid-flight tile emit | `0x0007 MidFlightTileEmitted` | | C6 | Mid-flight tile failure (with thumbnail filename, AC-8.5 forensic exception) | `0x0008 MidFlightTileFailed` | | C7 (inference) | Thermal-throttle hybrid switch K=3 ↔ K=2 | `0x000E ThermalThrottleHybridSwitch` | | C8 | MAVLink-2.0 signing key rotation event (D-C8-9) | `0x0009 MavlinkSigningKeyRotated` | | C8 | EKF source-set switch event (D-C8-2 = (b)) | `0x000A EkfSourceSetCommand` | | C10 | Pre-flight content-hash gate fail | `0x000D ContentHashGateFail` | | All components | Lifecycle events (start / stop / fail) | `0x000F ComponentLifecycleEvent` | | `jetson-stats` collector (driven by C7 or a dedicated thread) | Per-second sample of CPU%, GPU%, temp, throttle flag, RAM, VRAM, NVM remaining | `0x0005 SystemHealth` | **Lossy-on-overrun rule (AC-NEW-3 enforcement)**: if the FDR writer cannot keep up (NVM I/O bound), the writer drops the **oldest segment** in the current flight's ring AND emits a `0x000F ComponentLifecycleEvent` of type `fdr_segment_dropped` to the new head segment. A segment drop is a hard observability signal — it appears in the post-flight report and in the GCS STATUSTEXT stream. There is no path that silently discards an event. **Format**: length-prefixed binary stream with `record_header` (magic `0x47464452 "GFDR"` + version + type + monotonic_ms) followed by a per-type body and a CRC32. New record types are additive (data_model.md § 6.5). **Storage path**: `/var/lib/gps-denied/fdr/{flight_id}/segments/seg_NNNNN.bin`. Thumbnails (AC-8.5) live at `/var/lib/gps-denied/fdr/{flight_id}/thumbnails/`. A flight's `manifest.json` (the FDR-side mirror, distinct from the PostgreSQL `manifests` row) sits at the flight's root and carries the flight metadata snapshot. ### 1.2 GCS telemetry (best-effort, bandwidth-limited) The GCS link is the only outbound channel from the airborne companion (per architecture.md § 7). Bandwidth is bounded (AC-6.1: 1–2 Hz downsampled summary). The companion emits: | MAVLink message | Rate | Content | |---|---|---| | `STATUSTEXT` | event-driven (only when something changes) | Source label transitions; spoofing-promotion / -rejection; VISUAL_BLACKOUT entry / exit; signing key rotation; FDR segment drop; component start / fail; thermal-throttle hybrid switch | | `NAMED_VALUE_FLOAT` | 1 Hz | `horiz_accuracy_m`, `vert_accuracy_m`, `vio_health` (frame-quality 0..1), `last_anchor_age_s`, `cpu_pct`, `gpu_pct`, `temp_c` | | `GPS_RAW_INT` | 1–2 Hz (AC-6.1) | Mirror of the AP `GPS_INPUT` we just emitted, downsampled — gives the operator a live position view in QGC | These are **best-effort** — packet loss on the GCS link is treated as normal. The FDR remains the source of truth. **STATUSTEXT severity mapping**: | FDR event | STATUSTEXT severity | Example text | |---|---|---| | Source label → `dead_reckoned` | `MAV_SEVERITY_WARNING` | `"GPS-DENIED: dead-reckoned (last anchor 12.3s ago)"` | | VISUAL_BLACKOUT entry | `MAV_SEVERITY_NOTICE` | `"GPS-DENIED: VISUAL_BLACKOUT entered (reason=low_features)"` | | Spoofing rejected | `MAV_SEVERITY_NOTICE` | `"GPS-DENIED: spoofed FC GPS rejected (last visual consistency PASS 0.4s ago)"` | | Spoofing promoted (10 s + visual gate passed) | `MAV_SEVERITY_INFO` | `"GPS-DENIED: FC GPS promoted to fused source"` | | FDR segment dropped | `MAV_SEVERITY_WARNING` | `"GPS-DENIED: FDR segment 47 dropped (NVM bound)"` | | Signing key rotation | `MAV_SEVERITY_INFO` | `"GPS-DENIED: MAVLink signing key rotated"` | | Component fail | `MAV_SEVERITY_CRITICAL` | `"GPS-DENIED: VIO strategy fault — failover to FC IMU-only (AC-5.2)"` | ### 1.3 No console logging in flight Production deployment binary refuses `LOG_LEVEL=DEBUG` by default (environment_strategy.md § Variable validation). The airborne companion has no operator-readable console — even ERROR-level logs go to journald + FDR rather than stdout. journald retention is 7 days on a rolling buffer (separate from the FDR's per-flight retention). ### 1.4 In-flight metrics are NOT scraped There is no Prometheus endpoint on the production airborne companion. The justification matches § 1.3: there is no scraper to scrape it; metrics are recorded into FDR and visible via NAMED_VALUE_FLOAT only. CI / dev environments DO expose `/metrics` (see § 3 below). ## 2. Post-flight onboard (operator workstation) When the operator plugs the companion in post-landing: 1. **FDR retrieval** (operator tooling C12 — feature, not in scope of this document's structure but observability-impacting): operator-orchestrator reads the FDR ring, copies it to the workstation, and seals the in-flight ring. The companion's per-flight ephemeral keys are deleted at this step (environment_strategy.md § Per-flight key lifecycle). 2. **Visualization** (operator tooling C12): the workstation renders: - Time-series of `horiz_accuracy`, `vert_accuracy`, `last_anchor_age_ms`, source label timeline, thermal-throttle hybrid switches, and CPU / GPU / temp. - Map view: emitted positions vs. (when available) FC `GLOBAL_POSITION_INT` ground truth. - Spoofing / VISUAL_BLACKOUT event markers overlaid on the timeline. - Per-flight summary: total mid-flight tiles emitted, FDR segment drops (if any), AC-NEW-4 / AC-NEW-7 statistics for this flight. 3. **NFT-RES-03 / NFT-SEC-01 corpus contribution**: if the operator opts in, the flight's emitted positions + FC ground truth are added to the AC-NEW-4 / AC-NEW-7 Monte-Carlo corpus for the next CI run. 4. **Forensic thumbnail review** (AC-8.5 exception): failed-tile thumbnails are visible in the operator UI for human review; this is the only image-data review surface. ## 3. CI / dev environments (Tier-1 / Tier-2) Tier-1 dev / staging containers DO expose conventional observability surfaces, because they're being driven by humans and CI orchestrators that need them. The airborne profile of § 1 is the **production-only** profile. ### 3.1 Logging (Tier-1 / Tier-2) Structured JSON to stdout/stderr (consumed by the developer's `docker compose logs` or by CI's log collector): ```json { "timestamp": "2026-05-09T08:42:11.234Z", "level": "INFO", "service": "gps-denied-companion", "component": "C5", "flight_id": "", "monotonic_ms": 12345, "message": "Source label transition", "context": { "from": "satellite_anchored", "to": "visual_propagated", "reason": "vpr_no_match" } } ``` Log levels: | Level | Usage | Example | |-------|-------|---------| | ERROR | Exceptions; component fault that triggered AC-5.2 fallback | "VIO strategy initialization failed: GTSAM dlopen failed" | | WARN | Degraded behavior; FDR segment drop; thermal-throttle hybrid switch | "Thermal throttle active; downgrading K=3 → K=2" | | INFO | Significant lifecycle events; source label transition | "Source label: satellite_anchored → visual_propagated" | | DEBUG | Per-frame diagnostic — Tier-1 / dev only; production refuses this level (environment_strategy.md § Variable validation) | "MatchResult: 47 inliers, residual=2.3px" | **PII / safety-sensitive content**: no GPS coordinates in DEBUG / INFO logs by default. Only `horiz_accuracy` (a scalar) is INFO-loggable; the actual lat/lon is FDR-only. WARN / ERROR log records may include lat/lon when the operator's troubleshooting requires it; in that case the FDR still has the canonical record. Log retention: | Environment | Destination | Retention | |-------------|-------------|-----------| | `dev-tier1` | Docker stdout | Container lifetime | | `dev-tier2` | journald (Jetson) | 7 days | | `staging-tier1` (CI) | GitHub Actions log artifact | 30 days (matches CI artifact retention) | | `staging-tier2` (Jetson CI) | Self-hosted runner journald + uploaded report | 30 days | | `production` | journald (Jetson) | 7 days, see § 1.3 | ### 3.2 Metrics (Tier-1 / Tier-2) Prometheus-compatible `/metrics` endpoint on `dev-tier1`, `staging-tier1`, `staging-tier2`. **Disabled on `production`** (no listener on the airborne companion, NFT-SEC-05). Application metrics: | Metric | Type | Description | |--------|------|-------------| | `gps_denied_frame_processed_total` | Counter | Total nav frames processed (per `GPS_DENIED_VIO_STRATEGY` label) | | `gps_denied_frame_emit_latency_seconds` | Histogram | End-to-end frame → emit latency (the AC-4.1 metric) | | `gps_denied_source_label_total` | Counter | Counter per `satellite_anchored | visual_propagated | dead_reckoned` label | | `gps_denied_vpr_match_rate` | Gauge | Rolling-1-minute rate of successful VPR matches | | `gps_denied_thermal_hybrid_active` | Gauge | 0/1 — is the K=2 thermal-throttle hybrid active? (D-CROSS-LATENCY-1) | | `gps_denied_fdr_segment_drops_total` | Counter | Total FDR segment drops this run (AC-NEW-3 audit) | | `gps_denied_fdr_size_bytes` | Gauge | Current FDR ring size in bytes (must stay ≤ 64 GB) | | `gps_denied_signing_key_rotations_total` | Counter | MAVLink signing key rotation count | System metrics: standard `process_*`, `python_*` exporters; on Tier-2 also `jetson_stats_*` exposed via `jtop` exporter. Business metrics (i.e., AC-derived): | Metric | AC | Use | |--------|------|-------------| | `gps_denied_horiz_accuracy_m` (gauge, last value) | AC-NEW-4 | Live operator dashboard on operator workstation post-flight; CI threshold checks | | `gps_denied_cold_start_seconds` | AC-NEW-1 | Set once at takeoff load completion; NFT-PERF-03 reads it | | `gps_denied_spoofing_promotion_latency_seconds` | AC-NEW-2 | Set on each promotion / rejection event; NFT-PERF-04 reads it | Collection interval: 15 s (typical Prometheus default; Tier-2 NFT runs may use 1 s for AC-bound timing). ### 3.3 Distributed tracing — NOT applicable The runtime is a single in-process Python program with no cross-service hops in flight (architecture.md § 5 internal communication is all in-process). Distributed tracing is therefore not applicable to the production runtime. The Tier-1 integration setup DOES involve cross-container hops (companion ↔ mock-sat ↔ db ↔ e2e-runner), but those are exercised by the e2e test framework's own log + status capture; OpenTelemetry is not provisioned for this project. If a future cycle introduces a multi-process companion (which ADR-004 explicitly rejected for the airborne profile but might appear on the operator workstation for C11 Tile Manager + C12 Operator Pre-flight Orchestrator), tracing can be reconsidered then. ## 4. Alerting (post-flight, not in-flight) There is no live in-flight alerting from the airborne companion. The operator's **GCS** is the live human-loop interface (STATUSTEXT severity stream § 1.2). All other alerting is **post-flight**: | Source | Severity | Response Time | Conditions | |----------|---------------|-----------|----------| | FDR review (operator workstation) | Critical | Same-day human review | FDR segment drop count > 0; component fail event; spoofing-promotion latency > 3 s; AC-NEW-4 outliers (P(err > 1 km) > 0.01 % in this flight's window) | | FDR review | High | Next-day | AC-NEW-1 cold-start TTFF > 30 s p95 in this flight's window; thermal-throttle hybrid active > 25 % of the flight | | FDR review | Medium | Within 1 week | Mid-flight tile failure rate > 5 %; high VPR no-match rate; sustained `dead_reckoned` periods > 10 s | | CI (Tier-2) | Critical | Block PR merge | Any AC-bound NFT failure (architecture.md § 6 NFR list) | | CI (Tier-1) | Critical | Block PR merge | Build failure; security CVE; SBOM diff fail (ADR-002) | Notification channels: | Severity | Channel | |----------|---------| | Critical (FDR or CI) | Slack `#gps-denied-ops` + email | | High | Slack `#gps-denied-ops` | | Medium | Slack `#gps-denied-ops` (digest) | There is no PagerDuty / on-call rotation for this project; in-flight failures are handled by the FC's IMU-only fallback (AC-5.2), not by an operations team. ## 5. Dashboards ### 5.1 Operator workstation post-flight dashboard Built into operator-orchestrator C12. Per flight: - Time series: source label, `horiz_accuracy`, `last_anchor_age_ms`, CPU%, GPU%, temp. - Event markers: VISUAL_BLACKOUT entries, spoofing events, signing key rotations, thermal hybrid switches. - Map: emitted track + FC ground truth (when available) + pre-flight cache footprint + mid-flight tile coverage. - Statistics: per-flight error CDF; AC-NEW-4 contribution; mid-flight tile counts. - FDR audit table: any `0x000F` lifecycle events of severity ≥ WARN. ### 5.2 CI dashboard (Tier-2) GitHub Actions job summary plus a per-NFT report uploaded as workflow artifact. The summary includes: - Pass / fail per NFT scenario. - For NFT-PERF-*: histogram of latencies + comparison to threshold. - For NFT-LIM-*: peak memory / FDR size traces. - For NFT-RES-*: AC-NEW-4 / AC-NEW-7 statistical summary with stated 95 % CI. - For IT-12: comparative-study summary across all VIO / VPR strategies in the research binary. There is no live CI dashboard separate from the GitHub Actions UI; the project is small enough that the per-PR job summary is sufficient. ### 5.3 No live in-flight dashboard Out of scope by design. The GCS is the only live operator surface; all other inspection is post-flight. ## 6. Open Items / Plan-Phase Carryforward - **Long-term FDR archive** (multi-flight statistical headroom): D-PROJ-3 (multi-flight fixture acquisition for AC-NEW-4 / AC-NEW-7) is not pursued this cycle. If pursued in a future cycle, post-flight FDR archives become a corpus contribution path; the operator-orchestrator FDR-retrieval step would need an explicit "contribute to corpus" toggle. - **Telemetry-link encryption** beyond MAVLink-2.0 signing: out of scope; addressed by physical link assumptions in the threat model (architecture.md § 7). - **iNav signing**: still has no equivalent to MAVLink-2.0 signing (Mode B Source #129). Carryforward Plan-phase action: file a feature request upstream; meanwhile observability for iNav-profile flights is the same as AP-profile minus the `MavlinkSigningKeyRotated` records (which are NULL on iNav flights per data_model.md § 2.2).