Files
gps-denied-onboard/_docs/02_document/deployment/observability.md
T
Oleksandr Bezdieniezhnykh 64542d32fc Update autodev state, architecture documentation, and glossary terms
Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
2026-05-10 00:21:34 +03:00

16 KiB
Raw Blame History

GPS-Denied Onboard — Observability

Date: 2026-05-09 (Plan Phase 2c — initial draft). Inputs: _docs/02_document/architecture.md § 7 (Audit logging) + § 6 (NFRs); _docs/02_document/data_model.md § 2.8 (FDR); ADR-005 (Tier-1 / Tier-2); AC-NEW-3 (FDR ≤ 64 GB / no silent drops); AC-NEW-5 (operating envelope).

Observability is asymmetric by design

Most CI/CD templates assume a network-connected service that pushes structured logs to an aggregator and exposes Prometheus metrics for live scraping. This project's airborne profile does not. Architecture.md ADR-004 + § 7 + Principle #4 require no inbound network listening and no outbound network egress in flight (NFT-SEC-05 enforces). The Jetson is operating as an embedded edge device, not a service.

Observability therefore splits into three regimes:

Regime Where Live or post-flight Primary mechanism
In-flight onboard Production Jetson, in flight Live (to FDR ring) + best-effort live (to GCS) FDR binary record stream + GCS STATUSTEXT / NAMED_VALUE_FLOAT
Post-flight onboard Operator workstation after pulling the FDR Post-flight FDR replay + visualization in operator-tooling C12
CI / dev (Tier-1, Tier-2) Workstation Docker / Jetson CI runner Live Standard structured logging + Prometheus metrics endpoint where applicable

The sections below are organized by regime.

1. In-flight onboard (production Jetson)

1.1 FDR (Flight Data Recorder) — primary observability sink

Schema is in data_model.md § 2.8. Every observable event in flight goes through FDR. The FDR is append-only, lossy on overrun (logged, never silent), and per-flight ring-bounded at ≤ 64 GB (AC-NEW-3).

Observability events that emit FDR records:

Component Event FDR record type
C8 outbound Every emitted EmittedExternalPosition to FC 0x0001 EmittedExternalPosition
C8 inbound Every received MAVLink frame (raw tlog-style) 0x0003 ReceivedMavlinkRaw
C8 inbound (iNav) Every received MSP2 frame 0x0004 ReceivedMsp2Raw
C8 inbound IMU window forwarded to C1 / C5 0x0002 ImuTrace
C5 Source-label transition (satellite_anchoredvisual_propagateddead_reckoned) 0x0006 SourceLabelTransition
C5 + C8 Spoofing-promotion / -rejection event 0x000C SpoofingPromotionEvent
C5 VISUAL_BLACKOUT entry / exit (AC-3.5, AC-NEW-8) 0x000B VisualBlackoutEvent
C6 Mid-flight tile emit 0x0007 MidFlightTileEmitted
C6 Mid-flight tile failure (with thumbnail filename, AC-8.5 forensic exception) 0x0008 MidFlightTileFailed
C7 (inference) Thermal-throttle hybrid switch K=3 ↔ K=2 0x000E ThermalThrottleHybridSwitch
C8 MAVLink-2.0 signing key rotation event (D-C8-9) 0x0009 MavlinkSigningKeyRotated
C8 EKF source-set switch event (D-C8-2 = (b)) 0x000A EkfSourceSetCommand
C10 Pre-flight content-hash gate fail 0x000D ContentHashGateFail
All components Lifecycle events (start / stop / fail) 0x000F ComponentLifecycleEvent
jetson-stats collector (driven by C7 or a dedicated thread) Per-second sample of CPU%, GPU%, temp, throttle flag, RAM, VRAM, NVM remaining 0x0005 SystemHealth

Lossy-on-overrun rule (AC-NEW-3 enforcement): if the FDR writer cannot keep up (NVM I/O bound), the writer drops the oldest segment in the current flight's ring AND emits a 0x000F ComponentLifecycleEvent of type fdr_segment_dropped to the new head segment. A segment drop is a hard observability signal — it appears in the post-flight report and in the GCS STATUSTEXT stream. There is no path that silently discards an event.

Format: length-prefixed binary stream with record_header (magic 0x47464452 "GFDR" + version + type + monotonic_ms) followed by a per-type body and a CRC32. New record types are additive (data_model.md § 6.5).

Storage path: /var/lib/gps-denied/fdr/{flight_id}/segments/seg_NNNNN.bin. Thumbnails (AC-8.5) live at /var/lib/gps-denied/fdr/{flight_id}/thumbnails/. A flight's manifest.json (the FDR-side mirror, distinct from the PostgreSQL manifests row) sits at the flight's root and carries the flight metadata snapshot.

1.2 GCS telemetry (best-effort, bandwidth-limited)

The GCS link is the only outbound channel from the airborne companion (per architecture.md § 7). Bandwidth is bounded (AC-6.1: 12 Hz downsampled summary). The companion emits:

MAVLink message Rate Content
STATUSTEXT event-driven (only when something changes) Source label transitions; spoofing-promotion / -rejection; VISUAL_BLACKOUT entry / exit; signing key rotation; FDR segment drop; component start / fail; thermal-throttle hybrid switch
NAMED_VALUE_FLOAT 1 Hz horiz_accuracy_m, vert_accuracy_m, vio_health (frame-quality 0..1), last_anchor_age_s, cpu_pct, gpu_pct, temp_c
GPS_RAW_INT 12 Hz (AC-6.1) Mirror of the AP GPS_INPUT we just emitted, downsampled — gives the operator a live position view in QGC

These are best-effort — packet loss on the GCS link is treated as normal. The FDR remains the source of truth.

STATUSTEXT severity mapping:

FDR event STATUSTEXT severity Example text
Source label → dead_reckoned MAV_SEVERITY_WARNING "GPS-DENIED: dead-reckoned (last anchor 12.3s ago)"
VISUAL_BLACKOUT entry MAV_SEVERITY_NOTICE "GPS-DENIED: VISUAL_BLACKOUT entered (reason=low_features)"
Spoofing rejected MAV_SEVERITY_NOTICE "GPS-DENIED: spoofed FC GPS rejected (last visual consistency PASS 0.4s ago)"
Spoofing promoted (10 s + visual gate passed) MAV_SEVERITY_INFO "GPS-DENIED: FC GPS promoted to fused source"
FDR segment dropped MAV_SEVERITY_WARNING "GPS-DENIED: FDR segment 47 dropped (NVM bound)"
Signing key rotation MAV_SEVERITY_INFO "GPS-DENIED: MAVLink signing key rotated"
Component fail MAV_SEVERITY_CRITICAL "GPS-DENIED: VIO strategy fault — failover to FC IMU-only (AC-5.2)"

1.3 No console logging in flight

Production deployment binary refuses LOG_LEVEL=DEBUG by default (environment_strategy.md § Variable validation). The airborne companion has no operator-readable console — even ERROR-level logs go to journald + FDR rather than stdout. journald retention is 7 days on a rolling buffer (separate from the FDR's per-flight retention).

1.4 In-flight metrics are NOT scraped

There is no Prometheus endpoint on the production airborne companion. The justification matches § 1.3: there is no scraper to scrape it; metrics are recorded into FDR and visible via NAMED_VALUE_FLOAT only. CI / dev environments DO expose /metrics (see § 3 below).

2. Post-flight onboard (operator workstation)

When the operator plugs the companion in post-landing:

  1. FDR retrieval (operator tooling C12 — feature, not in scope of this document's structure but observability-impacting): operator-tooling reads the FDR ring, copies it to the workstation, and seals the in-flight ring. The companion's per-flight ephemeral keys are deleted at this step (environment_strategy.md § Per-flight key lifecycle).
  2. Visualization (operator tooling C12): the workstation renders:
    • Time-series of horiz_accuracy, vert_accuracy, last_anchor_age_ms, source label timeline, thermal-throttle hybrid switches, and CPU / GPU / temp.
    • Map view: emitted positions vs. (when available) FC GLOBAL_POSITION_INT ground truth.
    • Spoofing / VISUAL_BLACKOUT event markers overlaid on the timeline.
    • Per-flight summary: total mid-flight tiles emitted, FDR segment drops (if any), AC-NEW-4 / AC-NEW-7 statistics for this flight.
  3. NFT-RES-03 / NFT-SEC-01 corpus contribution: if the operator opts in, the flight's emitted positions + FC ground truth are added to the AC-NEW-4 / AC-NEW-7 Monte-Carlo corpus for the next CI run.
  4. Forensic thumbnail review (AC-8.5 exception): failed-tile thumbnails are visible in the operator UI for human review; this is the only image-data review surface.

3. CI / dev environments (Tier-1 / Tier-2)

Tier-1 dev / staging containers DO expose conventional observability surfaces, because they're being driven by humans and CI orchestrators that need them. The airborne profile of § 1 is the production-only profile.

3.1 Logging (Tier-1 / Tier-2)

Structured JSON to stdout/stderr (consumed by the developer's docker compose logs or by CI's log collector):

{
  "timestamp": "2026-05-09T08:42:11.234Z",
  "level": "INFO",
  "service": "gps-denied-companion",
  "component": "C5",
  "flight_id": "<uuid>",
  "monotonic_ms": 12345,
  "message": "Source label transition",
  "context": {
    "from": "satellite_anchored",
    "to": "visual_propagated",
    "reason": "vpr_no_match"
  }
}

Log levels:

Level Usage Example
ERROR Exceptions; component fault that triggered AC-5.2 fallback "VIO strategy initialization failed: GTSAM dlopen failed"
WARN Degraded behavior; FDR segment drop; thermal-throttle hybrid switch "Thermal throttle active; downgrading K=3 → K=2"
INFO Significant lifecycle events; source label transition "Source label: satellite_anchored → visual_propagated"
DEBUG Per-frame diagnostic — Tier-1 / dev only; production refuses this level (environment_strategy.md § Variable validation) "MatchResult: 47 inliers, residual=2.3px"

PII / safety-sensitive content: no GPS coordinates in DEBUG / INFO logs by default. Only horiz_accuracy (a scalar) is INFO-loggable; the actual lat/lon is FDR-only. WARN / ERROR log records may include lat/lon when the operator's troubleshooting requires it; in that case the FDR still has the canonical record.

Log retention:

Environment Destination Retention
dev-tier1 Docker stdout Container lifetime
dev-tier2 journald (Jetson) 7 days
staging-tier1 (CI) GitHub Actions log artifact 30 days (matches CI artifact retention)
staging-tier2 (Jetson CI) Self-hosted runner journald + uploaded report 30 days
production journald (Jetson) 7 days, see § 1.3

3.2 Metrics (Tier-1 / Tier-2)

Prometheus-compatible /metrics endpoint on dev-tier1, staging-tier1, staging-tier2. Disabled on production (no listener on the airborne companion, NFT-SEC-05).

Application metrics:

Metric Type Description
gps_denied_frame_processed_total Counter Total nav frames processed (per GPS_DENIED_VIO_STRATEGY label)
gps_denied_frame_emit_latency_seconds Histogram End-to-end frame → emit latency (the AC-4.1 metric)
gps_denied_source_label_total Counter Counter per `satellite_anchored
gps_denied_vpr_match_rate Gauge Rolling-1-minute rate of successful VPR matches
gps_denied_thermal_hybrid_active Gauge 0/1 — is the K=2 thermal-throttle hybrid active? (D-CROSS-LATENCY-1)
gps_denied_fdr_segment_drops_total Counter Total FDR segment drops this run (AC-NEW-3 audit)
gps_denied_fdr_size_bytes Gauge Current FDR ring size in bytes (must stay ≤ 64 GB)
gps_denied_signing_key_rotations_total Counter MAVLink signing key rotation count

System metrics: standard process_*, python_* exporters; on Tier-2 also jetson_stats_* exposed via jtop exporter.

Business metrics (i.e., AC-derived):

Metric AC Use
gps_denied_horiz_accuracy_m (gauge, last value) AC-NEW-4 Live operator dashboard on operator workstation post-flight; CI threshold checks
gps_denied_cold_start_seconds AC-NEW-1 Set once at takeoff load completion; NFT-PERF-03 reads it
gps_denied_spoofing_promotion_latency_seconds AC-NEW-2 Set on each promotion / rejection event; NFT-PERF-04 reads it

Collection interval: 15 s (typical Prometheus default; Tier-2 NFT runs may use 1 s for AC-bound timing).

3.3 Distributed tracing — NOT applicable

The runtime is a single in-process Python program with no cross-service hops in flight (architecture.md § 5 internal communication is all in-process). Distributed tracing is therefore not applicable to the production runtime.

The Tier-1 integration setup DOES involve cross-container hops (companion ↔ mock-sat ↔ db ↔ e2e-runner), but those are exercised by the e2e test framework's own log + status capture; OpenTelemetry is not provisioned for this project. If a future cycle introduces a multi-process companion (which ADR-004 explicitly rejected for the airborne profile but might appear on the operator workstation for C11 Tile Manager + C12 Operator Pre-flight Tooling), tracing can be reconsidered then.

4. Alerting (post-flight, not in-flight)

There is no live in-flight alerting from the airborne companion. The operator's GCS is the live human-loop interface (STATUSTEXT severity stream § 1.2). All other alerting is post-flight:

Source Severity Response Time Conditions
FDR review (operator workstation) Critical Same-day human review FDR segment drop count > 0; component fail event; spoofing-promotion latency > 3 s; AC-NEW-4 outliers (P(err > 1 km) > 0.01 % in this flight's window)
FDR review High Next-day AC-NEW-1 cold-start TTFF > 30 s p95 in this flight's window; thermal-throttle hybrid active > 25 % of the flight
FDR review Medium Within 1 week Mid-flight tile failure rate > 5 %; high VPR no-match rate; sustained dead_reckoned periods > 10 s
CI (Tier-2) Critical Block PR merge Any AC-bound NFT failure (architecture.md § 6 NFR list)
CI (Tier-1) Critical Block PR merge Build failure; security CVE; SBOM diff fail (ADR-002)

Notification channels:

Severity Channel
Critical (FDR or CI) Slack #gps-denied-ops + email
High Slack #gps-denied-ops
Medium Slack #gps-denied-ops (digest)

There is no PagerDuty / on-call rotation for this project; in-flight failures are handled by the FC's IMU-only fallback (AC-5.2), not by an operations team.

5. Dashboards

5.1 Operator workstation post-flight dashboard

Built into operator-tooling C12. Per flight:

  • Time series: source label, horiz_accuracy, last_anchor_age_ms, CPU%, GPU%, temp.
  • Event markers: VISUAL_BLACKOUT entries, spoofing events, signing key rotations, thermal hybrid switches.
  • Map: emitted track + FC ground truth (when available) + pre-flight cache footprint + mid-flight tile coverage.
  • Statistics: per-flight error CDF; AC-NEW-4 contribution; mid-flight tile counts.
  • FDR audit table: any 0x000F lifecycle events of severity ≥ WARN.

5.2 CI dashboard (Tier-2)

GitHub Actions job summary plus a per-NFT report uploaded as workflow artifact. The summary includes:

  • Pass / fail per NFT scenario.
  • For NFT-PERF-*: histogram of latencies + comparison to threshold.
  • For NFT-LIM-*: peak memory / FDR size traces.
  • For NFT-RES-*: AC-NEW-4 / AC-NEW-7 statistical summary with stated 95 % CI.
  • For IT-12: comparative-study summary across all VIO / VPR strategies in the research binary.

There is no live CI dashboard separate from the GitHub Actions UI; the project is small enough that the per-PR job summary is sufficient.

5.3 No live in-flight dashboard

Out of scope by design. The GCS is the only live operator surface; all other inspection is post-flight.

6. Open Items / Plan-Phase Carryforward

  • Long-term FDR archive (multi-flight statistical headroom): D-PROJ-3 (multi-flight fixture acquisition for AC-NEW-4 / AC-NEW-7) is not pursued this cycle. If pursued in a future cycle, post-flight FDR archives become a corpus contribution path; the operator-tooling FDR-retrieval step would need an explicit "contribute to corpus" toggle.
  • Telemetry-link encryption beyond MAVLink-2.0 signing: out of scope; addressed by physical link assumptions in the threat model (architecture.md § 7).
  • iNav signing: still has no equivalent to MAVLink-2.0 signing (Mode B Source #129). Carryforward Plan-phase action: file a feature request upstream; meanwhile observability for iNav-profile flights is the same as AP-profile minus the MavlinkSigningKeyRotated records (which are NULL on iNav flights per data_model.md § 2.2).