Implements two new C12 services and rebalances the C11/C12 boundary in one atomic commit: * AZ-329 PostLandingUploadOrchestrator — gates C11 upload on the `flight_footer` FDR record's `clean_shutdown` field; 4 refusal modes; new FdrFooterReader Protocol + LocalFdrFooterReader. * AZ-330 OperatorReLocService — AC-3.4 visual-loss re-localization hint; reuses shared LatLonAlt; OperatorCommandTransport Protocol cut (E-C8 owns the future pymavlink concrete); new FDR record kind `c12.reloc.requested`; log redaction (lat/lon 5 decimals, reason 200 chars). * AZ-523 C11 internal flight-state gate removed (SRP refactor): `confirm_flight_state` / `FlightStateSignal` use / `FlightStateNotOnGroundError` deleted from C11; TileUploader contract bumped to v2.0.0 (frozen) with migration note; AZ-317 superseded. * AZ-524 Package rename `c12_operator_tooling` → `c12_operator_orchestrator` across source, tests, pyproject, CMake, Dockerfile, compose, CI, runtime-root services class (`OperatorOrchestratorServices`) + factory function (`build_operator_orchestrator`), logger namespaces, config slug, docs, and the E-C12 epic title. Tests: 1543 passed, 80 skipped (all environment gates). Targeted AC suite (AZ-329 + AZ-330 + FdrFooterReader): 37 passed. Cold-start NFR-perf still ≤ 500 ms p99. Tracker: AZ-317 → Done (superseded); AZ-319 v2.0.0 contract bump comment; AZ-329/AZ-330 → In Testing; AZ-253 epic renamed; AZ-523 + AZ-524 created and closed as audit-trail tickets. See `_docs/03_implementation/batch_44_cycle1_report.md`. Co-authored-by: Cursor <cursoragent@cursor.com>
16 KiB
GPS-Denied Onboard — Observability
Date: 2026-05-09 (Plan Phase 2c — initial draft). Inputs:
_docs/02_document/architecture.md§ 7 (Audit logging) + § 6 (NFRs);_docs/02_document/data_model.md§ 2.8 (FDR); ADR-005 (Tier-1 / Tier-2); AC-NEW-3 (FDR ≤ 64 GB / no silent drops); AC-NEW-5 (operating envelope).
Observability is asymmetric by design
Most CI/CD templates assume a network-connected service that pushes structured logs to an aggregator and exposes Prometheus metrics for live scraping. This project's airborne profile does not. Architecture.md ADR-004 + § 7 + Principle #4 require no inbound network listening and no outbound network egress in flight (NFT-SEC-05 enforces). The Jetson is operating as an embedded edge device, not a service.
Observability therefore splits into three regimes:
| Regime | Where | Live or post-flight | Primary mechanism |
|---|---|---|---|
| In-flight onboard | Production Jetson, in flight | Live (to FDR ring) + best-effort live (to GCS) | FDR binary record stream + GCS STATUSTEXT / NAMED_VALUE_FLOAT |
| Post-flight onboard | Operator workstation after pulling the FDR | Post-flight | FDR replay + visualization in operator-orchestrator C12 |
| CI / dev (Tier-1, Tier-2) | Workstation Docker / Jetson CI runner | Live | Standard structured logging + Prometheus metrics endpoint where applicable |
The sections below are organized by regime.
1. In-flight onboard (production Jetson)
1.1 FDR (Flight Data Recorder) — primary observability sink
Schema is in data_model.md § 2.8. Every observable event in flight goes through FDR. The FDR is append-only, lossy on overrun (logged, never silent), and per-flight ring-bounded at ≤ 64 GB (AC-NEW-3).
Observability events that emit FDR records:
| Component | Event | FDR record type |
|---|---|---|
| C8 outbound | Every emitted EmittedExternalPosition to FC |
0x0001 EmittedExternalPosition |
| C8 inbound | Every received MAVLink frame (raw tlog-style) |
0x0003 ReceivedMavlinkRaw |
| C8 inbound (iNav) | Every received MSP2 frame | 0x0004 ReceivedMsp2Raw |
| C8 inbound | IMU window forwarded to C1 / C5 | 0x0002 ImuTrace |
| C5 | Source-label transition (satellite_anchored ↔ visual_propagated ↔ dead_reckoned) |
0x0006 SourceLabelTransition |
| C5 + C8 | Spoofing-promotion / -rejection event | 0x000C SpoofingPromotionEvent |
| C5 | VISUAL_BLACKOUT entry / exit (AC-3.5, AC-NEW-8) | 0x000B VisualBlackoutEvent |
| C6 | Mid-flight tile emit | 0x0007 MidFlightTileEmitted |
| C6 | Mid-flight tile failure (with thumbnail filename, AC-8.5 forensic exception) | 0x0008 MidFlightTileFailed |
| C7 (inference) | Thermal-throttle hybrid switch K=3 ↔ K=2 | 0x000E ThermalThrottleHybridSwitch |
| C8 | MAVLink-2.0 signing key rotation event (D-C8-9) | 0x0009 MavlinkSigningKeyRotated |
| C8 | EKF source-set switch event (D-C8-2 = (b)) | 0x000A EkfSourceSetCommand |
| C10 | Pre-flight content-hash gate fail | 0x000D ContentHashGateFail |
| All components | Lifecycle events (start / stop / fail) | 0x000F ComponentLifecycleEvent |
jetson-stats collector (driven by C7 or a dedicated thread) |
Per-second sample of CPU%, GPU%, temp, throttle flag, RAM, VRAM, NVM remaining | 0x0005 SystemHealth |
Lossy-on-overrun rule (AC-NEW-3 enforcement): if the FDR writer cannot keep up (NVM I/O bound), the writer drops the oldest segment in the current flight's ring AND emits a 0x000F ComponentLifecycleEvent of type fdr_segment_dropped to the new head segment. A segment drop is a hard observability signal — it appears in the post-flight report and in the GCS STATUSTEXT stream. There is no path that silently discards an event.
Format: length-prefixed binary stream with record_header (magic 0x47464452 "GFDR" + version + type + monotonic_ms) followed by a per-type body and a CRC32. New record types are additive (data_model.md § 6.5).
Storage path: /var/lib/gps-denied/fdr/{flight_id}/segments/seg_NNNNN.bin. Thumbnails (AC-8.5) live at /var/lib/gps-denied/fdr/{flight_id}/thumbnails/. A flight's manifest.json (the FDR-side mirror, distinct from the PostgreSQL manifests row) sits at the flight's root and carries the flight metadata snapshot.
1.2 GCS telemetry (best-effort, bandwidth-limited)
The GCS link is the only outbound channel from the airborne companion (per architecture.md § 7). Bandwidth is bounded (AC-6.1: 1–2 Hz downsampled summary). The companion emits:
| MAVLink message | Rate | Content |
|---|---|---|
STATUSTEXT |
event-driven (only when something changes) | Source label transitions; spoofing-promotion / -rejection; VISUAL_BLACKOUT entry / exit; signing key rotation; FDR segment drop; component start / fail; thermal-throttle hybrid switch |
NAMED_VALUE_FLOAT |
1 Hz | horiz_accuracy_m, vert_accuracy_m, vio_health (frame-quality 0..1), last_anchor_age_s, cpu_pct, gpu_pct, temp_c |
GPS_RAW_INT |
1–2 Hz (AC-6.1) | Mirror of the AP GPS_INPUT we just emitted, downsampled — gives the operator a live position view in QGC |
These are best-effort — packet loss on the GCS link is treated as normal. The FDR remains the source of truth.
STATUSTEXT severity mapping:
| FDR event | STATUSTEXT severity | Example text |
|---|---|---|
Source label → dead_reckoned |
MAV_SEVERITY_WARNING |
"GPS-DENIED: dead-reckoned (last anchor 12.3s ago)" |
| VISUAL_BLACKOUT entry | MAV_SEVERITY_NOTICE |
"GPS-DENIED: VISUAL_BLACKOUT entered (reason=low_features)" |
| Spoofing rejected | MAV_SEVERITY_NOTICE |
"GPS-DENIED: spoofed FC GPS rejected (last visual consistency PASS 0.4s ago)" |
| Spoofing promoted (10 s + visual gate passed) | MAV_SEVERITY_INFO |
"GPS-DENIED: FC GPS promoted to fused source" |
| FDR segment dropped | MAV_SEVERITY_WARNING |
"GPS-DENIED: FDR segment 47 dropped (NVM bound)" |
| Signing key rotation | MAV_SEVERITY_INFO |
"GPS-DENIED: MAVLink signing key rotated" |
| Component fail | MAV_SEVERITY_CRITICAL |
"GPS-DENIED: VIO strategy fault — failover to FC IMU-only (AC-5.2)" |
1.3 No console logging in flight
Production deployment binary refuses LOG_LEVEL=DEBUG by default (environment_strategy.md § Variable validation). The airborne companion has no operator-readable console — even ERROR-level logs go to journald + FDR rather than stdout. journald retention is 7 days on a rolling buffer (separate from the FDR's per-flight retention).
1.4 In-flight metrics are NOT scraped
There is no Prometheus endpoint on the production airborne companion. The justification matches § 1.3: there is no scraper to scrape it; metrics are recorded into FDR and visible via NAMED_VALUE_FLOAT only. CI / dev environments DO expose /metrics (see § 3 below).
2. Post-flight onboard (operator workstation)
When the operator plugs the companion in post-landing:
- FDR retrieval (operator tooling C12 — feature, not in scope of this document's structure but observability-impacting): operator-orchestrator reads the FDR ring, copies it to the workstation, and seals the in-flight ring. The companion's per-flight ephemeral keys are deleted at this step (environment_strategy.md § Per-flight key lifecycle).
- Visualization (operator tooling C12): the workstation renders:
- Time-series of
horiz_accuracy,vert_accuracy,last_anchor_age_ms, source label timeline, thermal-throttle hybrid switches, and CPU / GPU / temp. - Map view: emitted positions vs. (when available) FC
GLOBAL_POSITION_INTground truth. - Spoofing / VISUAL_BLACKOUT event markers overlaid on the timeline.
- Per-flight summary: total mid-flight tiles emitted, FDR segment drops (if any), AC-NEW-4 / AC-NEW-7 statistics for this flight.
- Time-series of
- NFT-RES-03 / NFT-SEC-01 corpus contribution: if the operator opts in, the flight's emitted positions + FC ground truth are added to the AC-NEW-4 / AC-NEW-7 Monte-Carlo corpus for the next CI run.
- Forensic thumbnail review (AC-8.5 exception): failed-tile thumbnails are visible in the operator UI for human review; this is the only image-data review surface.
3. CI / dev environments (Tier-1 / Tier-2)
Tier-1 dev / staging containers DO expose conventional observability surfaces, because they're being driven by humans and CI orchestrators that need them. The airborne profile of § 1 is the production-only profile.
3.1 Logging (Tier-1 / Tier-2)
Structured JSON to stdout/stderr (consumed by the developer's docker compose logs or by CI's log collector):
{
"timestamp": "2026-05-09T08:42:11.234Z",
"level": "INFO",
"service": "gps-denied-companion",
"component": "C5",
"flight_id": "<uuid>",
"monotonic_ms": 12345,
"message": "Source label transition",
"context": {
"from": "satellite_anchored",
"to": "visual_propagated",
"reason": "vpr_no_match"
}
}
Log levels:
| Level | Usage | Example |
|---|---|---|
| ERROR | Exceptions; component fault that triggered AC-5.2 fallback | "VIO strategy initialization failed: GTSAM dlopen failed" |
| WARN | Degraded behavior; FDR segment drop; thermal-throttle hybrid switch | "Thermal throttle active; downgrading K=3 → K=2" |
| INFO | Significant lifecycle events; source label transition | "Source label: satellite_anchored → visual_propagated" |
| DEBUG | Per-frame diagnostic — Tier-1 / dev only; production refuses this level (environment_strategy.md § Variable validation) | "MatchResult: 47 inliers, residual=2.3px" |
PII / safety-sensitive content: no GPS coordinates in DEBUG / INFO logs by default. Only horiz_accuracy (a scalar) is INFO-loggable; the actual lat/lon is FDR-only. WARN / ERROR log records may include lat/lon when the operator's troubleshooting requires it; in that case the FDR still has the canonical record.
Log retention:
| Environment | Destination | Retention |
|---|---|---|
dev-tier1 |
Docker stdout | Container lifetime |
dev-tier2 |
journald (Jetson) | 7 days |
staging-tier1 (CI) |
GitHub Actions log artifact | 30 days (matches CI artifact retention) |
staging-tier2 (Jetson CI) |
Self-hosted runner journald + uploaded report | 30 days |
production |
journald (Jetson) | 7 days, see § 1.3 |
3.2 Metrics (Tier-1 / Tier-2)
Prometheus-compatible /metrics endpoint on dev-tier1, staging-tier1, staging-tier2. Disabled on production (no listener on the airborne companion, NFT-SEC-05).
Application metrics:
| Metric | Type | Description |
|---|---|---|
gps_denied_frame_processed_total |
Counter | Total nav frames processed (per GPS_DENIED_VIO_STRATEGY label) |
gps_denied_frame_emit_latency_seconds |
Histogram | End-to-end frame → emit latency (the AC-4.1 metric) |
gps_denied_source_label_total |
Counter | Counter per `satellite_anchored |
gps_denied_vpr_match_rate |
Gauge | Rolling-1-minute rate of successful VPR matches |
gps_denied_thermal_hybrid_active |
Gauge | 0/1 — is the K=2 thermal-throttle hybrid active? (D-CROSS-LATENCY-1) |
gps_denied_fdr_segment_drops_total |
Counter | Total FDR segment drops this run (AC-NEW-3 audit) |
gps_denied_fdr_size_bytes |
Gauge | Current FDR ring size in bytes (must stay ≤ 64 GB) |
gps_denied_signing_key_rotations_total |
Counter | MAVLink signing key rotation count |
System metrics: standard process_*, python_* exporters; on Tier-2 also jetson_stats_* exposed via jtop exporter.
Business metrics (i.e., AC-derived):
| Metric | AC | Use |
|---|---|---|
gps_denied_horiz_accuracy_m (gauge, last value) |
AC-NEW-4 | Live operator dashboard on operator workstation post-flight; CI threshold checks |
gps_denied_cold_start_seconds |
AC-NEW-1 | Set once at takeoff load completion; NFT-PERF-03 reads it |
gps_denied_spoofing_promotion_latency_seconds |
AC-NEW-2 | Set on each promotion / rejection event; NFT-PERF-04 reads it |
Collection interval: 15 s (typical Prometheus default; Tier-2 NFT runs may use 1 s for AC-bound timing).
3.3 Distributed tracing — NOT applicable
The runtime is a single in-process Python program with no cross-service hops in flight (architecture.md § 5 internal communication is all in-process). Distributed tracing is therefore not applicable to the production runtime.
The Tier-1 integration setup DOES involve cross-container hops (companion ↔ mock-sat ↔ db ↔ e2e-runner), but those are exercised by the e2e test framework's own log + status capture; OpenTelemetry is not provisioned for this project. If a future cycle introduces a multi-process companion (which ADR-004 explicitly rejected for the airborne profile but might appear on the operator workstation for C11 Tile Manager + C12 Operator Pre-flight Orchestrator), tracing can be reconsidered then.
4. Alerting (post-flight, not in-flight)
There is no live in-flight alerting from the airborne companion. The operator's GCS is the live human-loop interface (STATUSTEXT severity stream § 1.2). All other alerting is post-flight:
| Source | Severity | Response Time | Conditions |
|---|---|---|---|
| FDR review (operator workstation) | Critical | Same-day human review | FDR segment drop count > 0; component fail event; spoofing-promotion latency > 3 s; AC-NEW-4 outliers (P(err > 1 km) > 0.01 % in this flight's window) |
| FDR review | High | Next-day | AC-NEW-1 cold-start TTFF > 30 s p95 in this flight's window; thermal-throttle hybrid active > 25 % of the flight |
| FDR review | Medium | Within 1 week | Mid-flight tile failure rate > 5 %; high VPR no-match rate; sustained dead_reckoned periods > 10 s |
| CI (Tier-2) | Critical | Block PR merge | Any AC-bound NFT failure (architecture.md § 6 NFR list) |
| CI (Tier-1) | Critical | Block PR merge | Build failure; security CVE; SBOM diff fail (ADR-002) |
Notification channels:
| Severity | Channel |
|---|---|
| Critical (FDR or CI) | Slack #gps-denied-ops + email |
| High | Slack #gps-denied-ops |
| Medium | Slack #gps-denied-ops (digest) |
There is no PagerDuty / on-call rotation for this project; in-flight failures are handled by the FC's IMU-only fallback (AC-5.2), not by an operations team.
5. Dashboards
5.1 Operator workstation post-flight dashboard
Built into operator-orchestrator C12. Per flight:
- Time series: source label,
horiz_accuracy,last_anchor_age_ms, CPU%, GPU%, temp. - Event markers: VISUAL_BLACKOUT entries, spoofing events, signing key rotations, thermal hybrid switches.
- Map: emitted track + FC ground truth (when available) + pre-flight cache footprint + mid-flight tile coverage.
- Statistics: per-flight error CDF; AC-NEW-4 contribution; mid-flight tile counts.
- FDR audit table: any
0x000Flifecycle events of severity ≥ WARN.
5.2 CI dashboard (Tier-2)
GitHub Actions job summary plus a per-NFT report uploaded as workflow artifact. The summary includes:
- Pass / fail per NFT scenario.
- For NFT-PERF-*: histogram of latencies + comparison to threshold.
- For NFT-LIM-*: peak memory / FDR size traces.
- For NFT-RES-*: AC-NEW-4 / AC-NEW-7 statistical summary with stated 95 % CI.
- For IT-12: comparative-study summary across all VIO / VPR strategies in the research binary.
There is no live CI dashboard separate from the GitHub Actions UI; the project is small enough that the per-PR job summary is sufficient.
5.3 No live in-flight dashboard
Out of scope by design. The GCS is the only live operator surface; all other inspection is post-flight.
6. Open Items / Plan-Phase Carryforward
- Long-term FDR archive (multi-flight statistical headroom): D-PROJ-3 (multi-flight fixture acquisition for AC-NEW-4 / AC-NEW-7) is not pursued this cycle. If pursued in a future cycle, post-flight FDR archives become a corpus contribution path; the operator-orchestrator FDR-retrieval step would need an explicit "contribute to corpus" toggle.
- Telemetry-link encryption beyond MAVLink-2.0 signing: out of scope; addressed by physical link assumptions in the threat model (architecture.md § 7).
- iNav signing: still has no equivalent to MAVLink-2.0 signing (Mode B Source #129). Carryforward Plan-phase action: file a feature request upstream; meanwhile observability for iNav-profile flights is the same as AP-profile minus the
MavlinkSigningKeyRotatedrecords (which are NULL on iNav flights per data_model.md § 2.2).