Files
Oleksandr Bezdieniezhnykh 5fe67023b2 [AZ-329] [AZ-330] [AZ-523] [AZ-524] Batch 44 atomic refactor
Implements two new C12 services and rebalances the C11/C12 boundary
in one atomic commit:

* AZ-329 PostLandingUploadOrchestrator — gates C11 upload on the
  `flight_footer` FDR record's `clean_shutdown` field; 4 refusal
  modes; new FdrFooterReader Protocol + LocalFdrFooterReader.
* AZ-330 OperatorReLocService — AC-3.4 visual-loss re-localization
  hint; reuses shared LatLonAlt; OperatorCommandTransport Protocol
  cut (E-C8 owns the future pymavlink concrete); new FDR record
  kind `c12.reloc.requested`; log redaction (lat/lon 5 decimals,
  reason 200 chars).
* AZ-523 C11 internal flight-state gate removed (SRP refactor):
  `confirm_flight_state` / `FlightStateSignal` use /
  `FlightStateNotOnGroundError` deleted from C11; TileUploader
  contract bumped to v2.0.0 (frozen) with migration note; AZ-317
  superseded.
* AZ-524 Package rename `c12_operator_tooling` →
  `c12_operator_orchestrator` across source, tests, pyproject,
  CMake, Dockerfile, compose, CI, runtime-root services class
  (`OperatorOrchestratorServices`) + factory function
  (`build_operator_orchestrator`), logger namespaces, config slug,
  docs, and the E-C12 epic title.

Tests: 1543 passed, 80 skipped (all environment gates). Targeted
AC suite (AZ-329 + AZ-330 + FdrFooterReader): 37 passed. Cold-start
NFR-perf still ≤ 500 ms p99.

Tracker: AZ-317 → Done (superseded); AZ-319 v2.0.0 contract bump
comment; AZ-329/AZ-330 → In Testing; AZ-253 epic renamed; AZ-523
+ AZ-524 created and closed as audit-trail tickets.

See `_docs/03_implementation/batch_44_cycle1_report.md`.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-13 19:42:46 +03:00

16 KiB
Raw Permalink Blame History

GPS-Denied Onboard — Observability

Date: 2026-05-09 (Plan Phase 2c — initial draft). Inputs: _docs/02_document/architecture.md § 7 (Audit logging) + § 6 (NFRs); _docs/02_document/data_model.md § 2.8 (FDR); ADR-005 (Tier-1 / Tier-2); AC-NEW-3 (FDR ≤ 64 GB / no silent drops); AC-NEW-5 (operating envelope).

Observability is asymmetric by design

Most CI/CD templates assume a network-connected service that pushes structured logs to an aggregator and exposes Prometheus metrics for live scraping. This project's airborne profile does not. Architecture.md ADR-004 + § 7 + Principle #4 require no inbound network listening and no outbound network egress in flight (NFT-SEC-05 enforces). The Jetson is operating as an embedded edge device, not a service.

Observability therefore splits into three regimes:

Regime Where Live or post-flight Primary mechanism
In-flight onboard Production Jetson, in flight Live (to FDR ring) + best-effort live (to GCS) FDR binary record stream + GCS STATUSTEXT / NAMED_VALUE_FLOAT
Post-flight onboard Operator workstation after pulling the FDR Post-flight FDR replay + visualization in operator-orchestrator C12
CI / dev (Tier-1, Tier-2) Workstation Docker / Jetson CI runner Live Standard structured logging + Prometheus metrics endpoint where applicable

The sections below are organized by regime.

1. In-flight onboard (production Jetson)

1.1 FDR (Flight Data Recorder) — primary observability sink

Schema is in data_model.md § 2.8. Every observable event in flight goes through FDR. The FDR is append-only, lossy on overrun (logged, never silent), and per-flight ring-bounded at ≤ 64 GB (AC-NEW-3).

Observability events that emit FDR records:

Component Event FDR record type
C8 outbound Every emitted EmittedExternalPosition to FC 0x0001 EmittedExternalPosition
C8 inbound Every received MAVLink frame (raw tlog-style) 0x0003 ReceivedMavlinkRaw
C8 inbound (iNav) Every received MSP2 frame 0x0004 ReceivedMsp2Raw
C8 inbound IMU window forwarded to C1 / C5 0x0002 ImuTrace
C5 Source-label transition (satellite_anchoredvisual_propagateddead_reckoned) 0x0006 SourceLabelTransition
C5 + C8 Spoofing-promotion / -rejection event 0x000C SpoofingPromotionEvent
C5 VISUAL_BLACKOUT entry / exit (AC-3.5, AC-NEW-8) 0x000B VisualBlackoutEvent
C6 Mid-flight tile emit 0x0007 MidFlightTileEmitted
C6 Mid-flight tile failure (with thumbnail filename, AC-8.5 forensic exception) 0x0008 MidFlightTileFailed
C7 (inference) Thermal-throttle hybrid switch K=3 ↔ K=2 0x000E ThermalThrottleHybridSwitch
C8 MAVLink-2.0 signing key rotation event (D-C8-9) 0x0009 MavlinkSigningKeyRotated
C8 EKF source-set switch event (D-C8-2 = (b)) 0x000A EkfSourceSetCommand
C10 Pre-flight content-hash gate fail 0x000D ContentHashGateFail
All components Lifecycle events (start / stop / fail) 0x000F ComponentLifecycleEvent
jetson-stats collector (driven by C7 or a dedicated thread) Per-second sample of CPU%, GPU%, temp, throttle flag, RAM, VRAM, NVM remaining 0x0005 SystemHealth

Lossy-on-overrun rule (AC-NEW-3 enforcement): if the FDR writer cannot keep up (NVM I/O bound), the writer drops the oldest segment in the current flight's ring AND emits a 0x000F ComponentLifecycleEvent of type fdr_segment_dropped to the new head segment. A segment drop is a hard observability signal — it appears in the post-flight report and in the GCS STATUSTEXT stream. There is no path that silently discards an event.

Format: length-prefixed binary stream with record_header (magic 0x47464452 "GFDR" + version + type + monotonic_ms) followed by a per-type body and a CRC32. New record types are additive (data_model.md § 6.5).

Storage path: /var/lib/gps-denied/fdr/{flight_id}/segments/seg_NNNNN.bin. Thumbnails (AC-8.5) live at /var/lib/gps-denied/fdr/{flight_id}/thumbnails/. A flight's manifest.json (the FDR-side mirror, distinct from the PostgreSQL manifests row) sits at the flight's root and carries the flight metadata snapshot.

1.2 GCS telemetry (best-effort, bandwidth-limited)

The GCS link is the only outbound channel from the airborne companion (per architecture.md § 7). Bandwidth is bounded (AC-6.1: 12 Hz downsampled summary). The companion emits:

MAVLink message Rate Content
STATUSTEXT event-driven (only when something changes) Source label transitions; spoofing-promotion / -rejection; VISUAL_BLACKOUT entry / exit; signing key rotation; FDR segment drop; component start / fail; thermal-throttle hybrid switch
NAMED_VALUE_FLOAT 1 Hz horiz_accuracy_m, vert_accuracy_m, vio_health (frame-quality 0..1), last_anchor_age_s, cpu_pct, gpu_pct, temp_c
GPS_RAW_INT 12 Hz (AC-6.1) Mirror of the AP GPS_INPUT we just emitted, downsampled — gives the operator a live position view in QGC

These are best-effort — packet loss on the GCS link is treated as normal. The FDR remains the source of truth.

STATUSTEXT severity mapping:

FDR event STATUSTEXT severity Example text
Source label → dead_reckoned MAV_SEVERITY_WARNING "GPS-DENIED: dead-reckoned (last anchor 12.3s ago)"
VISUAL_BLACKOUT entry MAV_SEVERITY_NOTICE "GPS-DENIED: VISUAL_BLACKOUT entered (reason=low_features)"
Spoofing rejected MAV_SEVERITY_NOTICE "GPS-DENIED: spoofed FC GPS rejected (last visual consistency PASS 0.4s ago)"
Spoofing promoted (10 s + visual gate passed) MAV_SEVERITY_INFO "GPS-DENIED: FC GPS promoted to fused source"
FDR segment dropped MAV_SEVERITY_WARNING "GPS-DENIED: FDR segment 47 dropped (NVM bound)"
Signing key rotation MAV_SEVERITY_INFO "GPS-DENIED: MAVLink signing key rotated"
Component fail MAV_SEVERITY_CRITICAL "GPS-DENIED: VIO strategy fault — failover to FC IMU-only (AC-5.2)"

1.3 No console logging in flight

Production deployment binary refuses LOG_LEVEL=DEBUG by default (environment_strategy.md § Variable validation). The airborne companion has no operator-readable console — even ERROR-level logs go to journald + FDR rather than stdout. journald retention is 7 days on a rolling buffer (separate from the FDR's per-flight retention).

1.4 In-flight metrics are NOT scraped

There is no Prometheus endpoint on the production airborne companion. The justification matches § 1.3: there is no scraper to scrape it; metrics are recorded into FDR and visible via NAMED_VALUE_FLOAT only. CI / dev environments DO expose /metrics (see § 3 below).

2. Post-flight onboard (operator workstation)

When the operator plugs the companion in post-landing:

  1. FDR retrieval (operator tooling C12 — feature, not in scope of this document's structure but observability-impacting): operator-orchestrator reads the FDR ring, copies it to the workstation, and seals the in-flight ring. The companion's per-flight ephemeral keys are deleted at this step (environment_strategy.md § Per-flight key lifecycle).
  2. Visualization (operator tooling C12): the workstation renders:
    • Time-series of horiz_accuracy, vert_accuracy, last_anchor_age_ms, source label timeline, thermal-throttle hybrid switches, and CPU / GPU / temp.
    • Map view: emitted positions vs. (when available) FC GLOBAL_POSITION_INT ground truth.
    • Spoofing / VISUAL_BLACKOUT event markers overlaid on the timeline.
    • Per-flight summary: total mid-flight tiles emitted, FDR segment drops (if any), AC-NEW-4 / AC-NEW-7 statistics for this flight.
  3. NFT-RES-03 / NFT-SEC-01 corpus contribution: if the operator opts in, the flight's emitted positions + FC ground truth are added to the AC-NEW-4 / AC-NEW-7 Monte-Carlo corpus for the next CI run.
  4. Forensic thumbnail review (AC-8.5 exception): failed-tile thumbnails are visible in the operator UI for human review; this is the only image-data review surface.

3. CI / dev environments (Tier-1 / Tier-2)

Tier-1 dev / staging containers DO expose conventional observability surfaces, because they're being driven by humans and CI orchestrators that need them. The airborne profile of § 1 is the production-only profile.

3.1 Logging (Tier-1 / Tier-2)

Structured JSON to stdout/stderr (consumed by the developer's docker compose logs or by CI's log collector):

{
  "timestamp": "2026-05-09T08:42:11.234Z",
  "level": "INFO",
  "service": "gps-denied-companion",
  "component": "C5",
  "flight_id": "<uuid>",
  "monotonic_ms": 12345,
  "message": "Source label transition",
  "context": {
    "from": "satellite_anchored",
    "to": "visual_propagated",
    "reason": "vpr_no_match"
  }
}

Log levels:

Level Usage Example
ERROR Exceptions; component fault that triggered AC-5.2 fallback "VIO strategy initialization failed: GTSAM dlopen failed"
WARN Degraded behavior; FDR segment drop; thermal-throttle hybrid switch "Thermal throttle active; downgrading K=3 → K=2"
INFO Significant lifecycle events; source label transition "Source label: satellite_anchored → visual_propagated"
DEBUG Per-frame diagnostic — Tier-1 / dev only; production refuses this level (environment_strategy.md § Variable validation) "MatchResult: 47 inliers, residual=2.3px"

PII / safety-sensitive content: no GPS coordinates in DEBUG / INFO logs by default. Only horiz_accuracy (a scalar) is INFO-loggable; the actual lat/lon is FDR-only. WARN / ERROR log records may include lat/lon when the operator's troubleshooting requires it; in that case the FDR still has the canonical record.

Log retention:

Environment Destination Retention
dev-tier1 Docker stdout Container lifetime
dev-tier2 journald (Jetson) 7 days
staging-tier1 (CI) GitHub Actions log artifact 30 days (matches CI artifact retention)
staging-tier2 (Jetson CI) Self-hosted runner journald + uploaded report 30 days
production journald (Jetson) 7 days, see § 1.3

3.2 Metrics (Tier-1 / Tier-2)

Prometheus-compatible /metrics endpoint on dev-tier1, staging-tier1, staging-tier2. Disabled on production (no listener on the airborne companion, NFT-SEC-05).

Application metrics:

Metric Type Description
gps_denied_frame_processed_total Counter Total nav frames processed (per GPS_DENIED_VIO_STRATEGY label)
gps_denied_frame_emit_latency_seconds Histogram End-to-end frame → emit latency (the AC-4.1 metric)
gps_denied_source_label_total Counter Counter per `satellite_anchored
gps_denied_vpr_match_rate Gauge Rolling-1-minute rate of successful VPR matches
gps_denied_thermal_hybrid_active Gauge 0/1 — is the K=2 thermal-throttle hybrid active? (D-CROSS-LATENCY-1)
gps_denied_fdr_segment_drops_total Counter Total FDR segment drops this run (AC-NEW-3 audit)
gps_denied_fdr_size_bytes Gauge Current FDR ring size in bytes (must stay ≤ 64 GB)
gps_denied_signing_key_rotations_total Counter MAVLink signing key rotation count

System metrics: standard process_*, python_* exporters; on Tier-2 also jetson_stats_* exposed via jtop exporter.

Business metrics (i.e., AC-derived):

Metric AC Use
gps_denied_horiz_accuracy_m (gauge, last value) AC-NEW-4 Live operator dashboard on operator workstation post-flight; CI threshold checks
gps_denied_cold_start_seconds AC-NEW-1 Set once at takeoff load completion; NFT-PERF-03 reads it
gps_denied_spoofing_promotion_latency_seconds AC-NEW-2 Set on each promotion / rejection event; NFT-PERF-04 reads it

Collection interval: 15 s (typical Prometheus default; Tier-2 NFT runs may use 1 s for AC-bound timing).

3.3 Distributed tracing — NOT applicable

The runtime is a single in-process Python program with no cross-service hops in flight (architecture.md § 5 internal communication is all in-process). Distributed tracing is therefore not applicable to the production runtime.

The Tier-1 integration setup DOES involve cross-container hops (companion ↔ mock-sat ↔ db ↔ e2e-runner), but those are exercised by the e2e test framework's own log + status capture; OpenTelemetry is not provisioned for this project. If a future cycle introduces a multi-process companion (which ADR-004 explicitly rejected for the airborne profile but might appear on the operator workstation for C11 Tile Manager + C12 Operator Pre-flight Orchestrator), tracing can be reconsidered then.

4. Alerting (post-flight, not in-flight)

There is no live in-flight alerting from the airborne companion. The operator's GCS is the live human-loop interface (STATUSTEXT severity stream § 1.2). All other alerting is post-flight:

Source Severity Response Time Conditions
FDR review (operator workstation) Critical Same-day human review FDR segment drop count > 0; component fail event; spoofing-promotion latency > 3 s; AC-NEW-4 outliers (P(err > 1 km) > 0.01 % in this flight's window)
FDR review High Next-day AC-NEW-1 cold-start TTFF > 30 s p95 in this flight's window; thermal-throttle hybrid active > 25 % of the flight
FDR review Medium Within 1 week Mid-flight tile failure rate > 5 %; high VPR no-match rate; sustained dead_reckoned periods > 10 s
CI (Tier-2) Critical Block PR merge Any AC-bound NFT failure (architecture.md § 6 NFR list)
CI (Tier-1) Critical Block PR merge Build failure; security CVE; SBOM diff fail (ADR-002)

Notification channels:

Severity Channel
Critical (FDR or CI) Slack #gps-denied-ops + email
High Slack #gps-denied-ops
Medium Slack #gps-denied-ops (digest)

There is no PagerDuty / on-call rotation for this project; in-flight failures are handled by the FC's IMU-only fallback (AC-5.2), not by an operations team.

5. Dashboards

5.1 Operator workstation post-flight dashboard

Built into operator-orchestrator C12. Per flight:

  • Time series: source label, horiz_accuracy, last_anchor_age_ms, CPU%, GPU%, temp.
  • Event markers: VISUAL_BLACKOUT entries, spoofing events, signing key rotations, thermal hybrid switches.
  • Map: emitted track + FC ground truth (when available) + pre-flight cache footprint + mid-flight tile coverage.
  • Statistics: per-flight error CDF; AC-NEW-4 contribution; mid-flight tile counts.
  • FDR audit table: any 0x000F lifecycle events of severity ≥ WARN.

5.2 CI dashboard (Tier-2)

GitHub Actions job summary plus a per-NFT report uploaded as workflow artifact. The summary includes:

  • Pass / fail per NFT scenario.
  • For NFT-PERF-*: histogram of latencies + comparison to threshold.
  • For NFT-LIM-*: peak memory / FDR size traces.
  • For NFT-RES-*: AC-NEW-4 / AC-NEW-7 statistical summary with stated 95 % CI.
  • For IT-12: comparative-study summary across all VIO / VPR strategies in the research binary.

There is no live CI dashboard separate from the GitHub Actions UI; the project is small enough that the per-PR job summary is sufficient.

5.3 No live in-flight dashboard

Out of scope by design. The GCS is the only live operator surface; all other inspection is post-flight.

6. Open Items / Plan-Phase Carryforward

  • Long-term FDR archive (multi-flight statistical headroom): D-PROJ-3 (multi-flight fixture acquisition for AC-NEW-4 / AC-NEW-7) is not pursued this cycle. If pursued in a future cycle, post-flight FDR archives become a corpus contribution path; the operator-orchestrator FDR-retrieval step would need an explicit "contribute to corpus" toggle.
  • Telemetry-link encryption beyond MAVLink-2.0 signing: out of scope; addressed by physical link assumptions in the threat model (architecture.md § 7).
  • iNav signing: still has no equivalent to MAVLink-2.0 signing (Mode B Source #129). Carryforward Plan-phase action: file a feature request upstream; meanwhile observability for iNav-profile flights is the same as AP-profile minus the MavlinkSigningKeyRotated records (which are NULL on iNav flights per data_model.md § 2.2).