mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 08:31:13 +00:00
5fe67023b2
Implements two new C12 services and rebalances the C11/C12 boundary in one atomic commit: * AZ-329 PostLandingUploadOrchestrator — gates C11 upload on the `flight_footer` FDR record's `clean_shutdown` field; 4 refusal modes; new FdrFooterReader Protocol + LocalFdrFooterReader. * AZ-330 OperatorReLocService — AC-3.4 visual-loss re-localization hint; reuses shared LatLonAlt; OperatorCommandTransport Protocol cut (E-C8 owns the future pymavlink concrete); new FDR record kind `c12.reloc.requested`; log redaction (lat/lon 5 decimals, reason 200 chars). * AZ-523 C11 internal flight-state gate removed (SRP refactor): `confirm_flight_state` / `FlightStateSignal` use / `FlightStateNotOnGroundError` deleted from C11; TileUploader contract bumped to v2.0.0 (frozen) with migration note; AZ-317 superseded. * AZ-524 Package rename `c12_operator_tooling` → `c12_operator_orchestrator` across source, tests, pyproject, CMake, Dockerfile, compose, CI, runtime-root services class (`OperatorOrchestratorServices`) + factory function (`build_operator_orchestrator`), logger namespaces, config slug, docs, and the E-C12 epic title. Tests: 1543 passed, 80 skipped (all environment gates). Targeted AC suite (AZ-329 + AZ-330 + FdrFooterReader): 37 passed. Cold-start NFR-perf still ≤ 500 ms p99. Tracker: AZ-317 → Done (superseded); AZ-319 v2.0.0 contract bump comment; AZ-329/AZ-330 → In Testing; AZ-253 epic renamed; AZ-523 + AZ-524 created and closed as audit-trail tickets. See `_docs/03_implementation/batch_44_cycle1_report.md`. Co-authored-by: Cursor <cursoragent@cursor.com>
233 lines
16 KiB
Markdown
233 lines
16 KiB
Markdown
# GPS-Denied Onboard — Observability
|
||
|
||
> Date: 2026-05-09 (Plan Phase 2c — initial draft).
|
||
> Inputs: `_docs/02_document/architecture.md` § 7 (Audit logging) + § 6 (NFRs); `_docs/02_document/data_model.md` § 2.8 (FDR); ADR-005 (Tier-1 / Tier-2); AC-NEW-3 (FDR ≤ 64 GB / no silent drops); AC-NEW-5 (operating envelope).
|
||
|
||
## Observability is asymmetric by design
|
||
|
||
Most CI/CD templates assume a network-connected service that pushes structured logs to an aggregator and exposes Prometheus metrics for live scraping. **This project's airborne profile does not.** Architecture.md ADR-004 + § 7 + Principle #4 require **no inbound network listening and no outbound network egress in flight** (NFT-SEC-05 enforces). The Jetson is operating as an embedded edge device, not a service.
|
||
|
||
Observability therefore splits into three regimes:
|
||
|
||
| Regime | Where | Live or post-flight | Primary mechanism |
|
||
|---|---|---|---|
|
||
| **In-flight onboard** | Production Jetson, in flight | Live (to FDR ring) + best-effort live (to GCS) | FDR binary record stream + GCS STATUSTEXT / NAMED_VALUE_FLOAT |
|
||
| **Post-flight onboard** | Operator workstation after pulling the FDR | Post-flight | FDR replay + visualization in operator-orchestrator C12 |
|
||
| **CI / dev (Tier-1, Tier-2)** | Workstation Docker / Jetson CI runner | Live | Standard structured logging + Prometheus metrics endpoint where applicable |
|
||
|
||
The sections below are organized by regime.
|
||
|
||
## 1. In-flight onboard (production Jetson)
|
||
|
||
### 1.1 FDR (Flight Data Recorder) — primary observability sink
|
||
|
||
Schema is in `data_model.md` § 2.8. Every observable event in flight goes through FDR. The FDR is **append-only**, **lossy on overrun (logged, never silent)**, and **per-flight ring-bounded at ≤ 64 GB** (AC-NEW-3).
|
||
|
||
Observability events that emit FDR records:
|
||
|
||
| Component | Event | FDR record type |
|
||
|---|---|---|
|
||
| C8 outbound | Every emitted `EmittedExternalPosition` to FC | `0x0001 EmittedExternalPosition` |
|
||
| C8 inbound | Every received MAVLink frame (raw `tlog`-style) | `0x0003 ReceivedMavlinkRaw` |
|
||
| C8 inbound (iNav) | Every received MSP2 frame | `0x0004 ReceivedMsp2Raw` |
|
||
| C8 inbound | IMU window forwarded to C1 / C5 | `0x0002 ImuTrace` |
|
||
| C5 | Source-label transition (`satellite_anchored` ↔ `visual_propagated` ↔ `dead_reckoned`) | `0x0006 SourceLabelTransition` |
|
||
| C5 + C8 | Spoofing-promotion / -rejection event | `0x000C SpoofingPromotionEvent` |
|
||
| C5 | VISUAL_BLACKOUT entry / exit (AC-3.5, AC-NEW-8) | `0x000B VisualBlackoutEvent` |
|
||
| C6 | Mid-flight tile emit | `0x0007 MidFlightTileEmitted` |
|
||
| C6 | Mid-flight tile failure (with thumbnail filename, AC-8.5 forensic exception) | `0x0008 MidFlightTileFailed` |
|
||
| C7 (inference) | Thermal-throttle hybrid switch K=3 ↔ K=2 | `0x000E ThermalThrottleHybridSwitch` |
|
||
| C8 | MAVLink-2.0 signing key rotation event (D-C8-9) | `0x0009 MavlinkSigningKeyRotated` |
|
||
| C8 | EKF source-set switch event (D-C8-2 = (b)) | `0x000A EkfSourceSetCommand` |
|
||
| C10 | Pre-flight content-hash gate fail | `0x000D ContentHashGateFail` |
|
||
| All components | Lifecycle events (start / stop / fail) | `0x000F ComponentLifecycleEvent` |
|
||
| `jetson-stats` collector (driven by C7 or a dedicated thread) | Per-second sample of CPU%, GPU%, temp, throttle flag, RAM, VRAM, NVM remaining | `0x0005 SystemHealth` |
|
||
|
||
**Lossy-on-overrun rule (AC-NEW-3 enforcement)**: if the FDR writer cannot keep up (NVM I/O bound), the writer drops the **oldest segment** in the current flight's ring AND emits a `0x000F ComponentLifecycleEvent` of type `fdr_segment_dropped` to the new head segment. A segment drop is a hard observability signal — it appears in the post-flight report and in the GCS STATUSTEXT stream. There is no path that silently discards an event.
|
||
|
||
**Format**: length-prefixed binary stream with `record_header` (magic `0x47464452 "GFDR"` + version + type + monotonic_ms) followed by a per-type body and a CRC32. New record types are additive (data_model.md § 6.5).
|
||
|
||
**Storage path**: `/var/lib/gps-denied/fdr/{flight_id}/segments/seg_NNNNN.bin`. Thumbnails (AC-8.5) live at `/var/lib/gps-denied/fdr/{flight_id}/thumbnails/`. A flight's `manifest.json` (the FDR-side mirror, distinct from the PostgreSQL `manifests` row) sits at the flight's root and carries the flight metadata snapshot.
|
||
|
||
### 1.2 GCS telemetry (best-effort, bandwidth-limited)
|
||
|
||
The GCS link is the only outbound channel from the airborne companion (per architecture.md § 7). Bandwidth is bounded (AC-6.1: 1–2 Hz downsampled summary). The companion emits:
|
||
|
||
| MAVLink message | Rate | Content |
|
||
|---|---|---|
|
||
| `STATUSTEXT` | event-driven (only when something changes) | Source label transitions; spoofing-promotion / -rejection; VISUAL_BLACKOUT entry / exit; signing key rotation; FDR segment drop; component start / fail; thermal-throttle hybrid switch |
|
||
| `NAMED_VALUE_FLOAT` | 1 Hz | `horiz_accuracy_m`, `vert_accuracy_m`, `vio_health` (frame-quality 0..1), `last_anchor_age_s`, `cpu_pct`, `gpu_pct`, `temp_c` |
|
||
| `GPS_RAW_INT` | 1–2 Hz (AC-6.1) | Mirror of the AP `GPS_INPUT` we just emitted, downsampled — gives the operator a live position view in QGC |
|
||
|
||
These are **best-effort** — packet loss on the GCS link is treated as normal. The FDR remains the source of truth.
|
||
|
||
**STATUSTEXT severity mapping**:
|
||
|
||
| FDR event | STATUSTEXT severity | Example text |
|
||
|---|---|---|
|
||
| Source label → `dead_reckoned` | `MAV_SEVERITY_WARNING` | `"GPS-DENIED: dead-reckoned (last anchor 12.3s ago)"` |
|
||
| VISUAL_BLACKOUT entry | `MAV_SEVERITY_NOTICE` | `"GPS-DENIED: VISUAL_BLACKOUT entered (reason=low_features)"` |
|
||
| Spoofing rejected | `MAV_SEVERITY_NOTICE` | `"GPS-DENIED: spoofed FC GPS rejected (last visual consistency PASS 0.4s ago)"` |
|
||
| Spoofing promoted (10 s + visual gate passed) | `MAV_SEVERITY_INFO` | `"GPS-DENIED: FC GPS promoted to fused source"` |
|
||
| FDR segment dropped | `MAV_SEVERITY_WARNING` | `"GPS-DENIED: FDR segment 47 dropped (NVM bound)"` |
|
||
| Signing key rotation | `MAV_SEVERITY_INFO` | `"GPS-DENIED: MAVLink signing key rotated"` |
|
||
| Component fail | `MAV_SEVERITY_CRITICAL` | `"GPS-DENIED: VIO strategy fault — failover to FC IMU-only (AC-5.2)"` |
|
||
|
||
### 1.3 No console logging in flight
|
||
|
||
Production deployment binary refuses `LOG_LEVEL=DEBUG` by default (environment_strategy.md § Variable validation). The airborne companion has no operator-readable console — even ERROR-level logs go to journald + FDR rather than stdout. journald retention is 7 days on a rolling buffer (separate from the FDR's per-flight retention).
|
||
|
||
### 1.4 In-flight metrics are NOT scraped
|
||
|
||
There is no Prometheus endpoint on the production airborne companion. The justification matches § 1.3: there is no scraper to scrape it; metrics are recorded into FDR and visible via NAMED_VALUE_FLOAT only. CI / dev environments DO expose `/metrics` (see § 3 below).
|
||
|
||
## 2. Post-flight onboard (operator workstation)
|
||
|
||
When the operator plugs the companion in post-landing:
|
||
|
||
1. **FDR retrieval** (operator tooling C12 — feature, not in scope of this document's structure but observability-impacting): operator-orchestrator reads the FDR ring, copies it to the workstation, and seals the in-flight ring. The companion's per-flight ephemeral keys are deleted at this step (environment_strategy.md § Per-flight key lifecycle).
|
||
2. **Visualization** (operator tooling C12): the workstation renders:
|
||
- Time-series of `horiz_accuracy`, `vert_accuracy`, `last_anchor_age_ms`, source label timeline, thermal-throttle hybrid switches, and CPU / GPU / temp.
|
||
- Map view: emitted positions vs. (when available) FC `GLOBAL_POSITION_INT` ground truth.
|
||
- Spoofing / VISUAL_BLACKOUT event markers overlaid on the timeline.
|
||
- Per-flight summary: total mid-flight tiles emitted, FDR segment drops (if any), AC-NEW-4 / AC-NEW-7 statistics for this flight.
|
||
3. **NFT-RES-03 / NFT-SEC-01 corpus contribution**: if the operator opts in, the flight's emitted positions + FC ground truth are added to the AC-NEW-4 / AC-NEW-7 Monte-Carlo corpus for the next CI run.
|
||
4. **Forensic thumbnail review** (AC-8.5 exception): failed-tile thumbnails are visible in the operator UI for human review; this is the only image-data review surface.
|
||
|
||
## 3. CI / dev environments (Tier-1 / Tier-2)
|
||
|
||
Tier-1 dev / staging containers DO expose conventional observability surfaces, because they're being driven by humans and CI orchestrators that need them. The airborne profile of § 1 is the **production-only** profile.
|
||
|
||
### 3.1 Logging (Tier-1 / Tier-2)
|
||
|
||
Structured JSON to stdout/stderr (consumed by the developer's `docker compose logs` or by CI's log collector):
|
||
|
||
```json
|
||
{
|
||
"timestamp": "2026-05-09T08:42:11.234Z",
|
||
"level": "INFO",
|
||
"service": "gps-denied-companion",
|
||
"component": "C5",
|
||
"flight_id": "<uuid>",
|
||
"monotonic_ms": 12345,
|
||
"message": "Source label transition",
|
||
"context": {
|
||
"from": "satellite_anchored",
|
||
"to": "visual_propagated",
|
||
"reason": "vpr_no_match"
|
||
}
|
||
}
|
||
```
|
||
|
||
Log levels:
|
||
|
||
| Level | Usage | Example |
|
||
|-------|-------|---------|
|
||
| ERROR | Exceptions; component fault that triggered AC-5.2 fallback | "VIO strategy initialization failed: GTSAM dlopen failed" |
|
||
| WARN | Degraded behavior; FDR segment drop; thermal-throttle hybrid switch | "Thermal throttle active; downgrading K=3 → K=2" |
|
||
| INFO | Significant lifecycle events; source label transition | "Source label: satellite_anchored → visual_propagated" |
|
||
| DEBUG | Per-frame diagnostic — Tier-1 / dev only; production refuses this level (environment_strategy.md § Variable validation) | "MatchResult: 47 inliers, residual=2.3px" |
|
||
|
||
**PII / safety-sensitive content**: no GPS coordinates in DEBUG / INFO logs by default. Only `horiz_accuracy` (a scalar) is INFO-loggable; the actual lat/lon is FDR-only. WARN / ERROR log records may include lat/lon when the operator's troubleshooting requires it; in that case the FDR still has the canonical record.
|
||
|
||
Log retention:
|
||
|
||
| Environment | Destination | Retention |
|
||
|-------------|-------------|-----------|
|
||
| `dev-tier1` | Docker stdout | Container lifetime |
|
||
| `dev-tier2` | journald (Jetson) | 7 days |
|
||
| `staging-tier1` (CI) | GitHub Actions log artifact | 30 days (matches CI artifact retention) |
|
||
| `staging-tier2` (Jetson CI) | Self-hosted runner journald + uploaded report | 30 days |
|
||
| `production` | journald (Jetson) | 7 days, see § 1.3 |
|
||
|
||
### 3.2 Metrics (Tier-1 / Tier-2)
|
||
|
||
Prometheus-compatible `/metrics` endpoint on `dev-tier1`, `staging-tier1`, `staging-tier2`. **Disabled on `production`** (no listener on the airborne companion, NFT-SEC-05).
|
||
|
||
Application metrics:
|
||
|
||
| Metric | Type | Description |
|
||
|--------|------|-------------|
|
||
| `gps_denied_frame_processed_total` | Counter | Total nav frames processed (per `GPS_DENIED_VIO_STRATEGY` label) |
|
||
| `gps_denied_frame_emit_latency_seconds` | Histogram | End-to-end frame → emit latency (the AC-4.1 metric) |
|
||
| `gps_denied_source_label_total` | Counter | Counter per `satellite_anchored | visual_propagated | dead_reckoned` label |
|
||
| `gps_denied_vpr_match_rate` | Gauge | Rolling-1-minute rate of successful VPR matches |
|
||
| `gps_denied_thermal_hybrid_active` | Gauge | 0/1 — is the K=2 thermal-throttle hybrid active? (D-CROSS-LATENCY-1) |
|
||
| `gps_denied_fdr_segment_drops_total` | Counter | Total FDR segment drops this run (AC-NEW-3 audit) |
|
||
| `gps_denied_fdr_size_bytes` | Gauge | Current FDR ring size in bytes (must stay ≤ 64 GB) |
|
||
| `gps_denied_signing_key_rotations_total` | Counter | MAVLink signing key rotation count |
|
||
|
||
System metrics: standard `process_*`, `python_*` exporters; on Tier-2 also `jetson_stats_*` exposed via `jtop` exporter.
|
||
|
||
Business metrics (i.e., AC-derived):
|
||
|
||
| Metric | AC | Use |
|
||
|--------|------|-------------|
|
||
| `gps_denied_horiz_accuracy_m` (gauge, last value) | AC-NEW-4 | Live operator dashboard on operator workstation post-flight; CI threshold checks |
|
||
| `gps_denied_cold_start_seconds` | AC-NEW-1 | Set once at takeoff load completion; NFT-PERF-03 reads it |
|
||
| `gps_denied_spoofing_promotion_latency_seconds` | AC-NEW-2 | Set on each promotion / rejection event; NFT-PERF-04 reads it |
|
||
|
||
Collection interval: 15 s (typical Prometheus default; Tier-2 NFT runs may use 1 s for AC-bound timing).
|
||
|
||
### 3.3 Distributed tracing — NOT applicable
|
||
|
||
The runtime is a single in-process Python program with no cross-service hops in flight (architecture.md § 5 internal communication is all in-process). Distributed tracing is therefore not applicable to the production runtime.
|
||
|
||
The Tier-1 integration setup DOES involve cross-container hops (companion ↔ mock-sat ↔ db ↔ e2e-runner), but those are exercised by the e2e test framework's own log + status capture; OpenTelemetry is not provisioned for this project. If a future cycle introduces a multi-process companion (which ADR-004 explicitly rejected for the airborne profile but might appear on the operator workstation for C11 Tile Manager + C12 Operator Pre-flight Orchestrator), tracing can be reconsidered then.
|
||
|
||
## 4. Alerting (post-flight, not in-flight)
|
||
|
||
There is no live in-flight alerting from the airborne companion. The operator's **GCS** is the live human-loop interface (STATUSTEXT severity stream § 1.2). All other alerting is **post-flight**:
|
||
|
||
| Source | Severity | Response Time | Conditions |
|
||
|----------|---------------|-----------|----------|
|
||
| FDR review (operator workstation) | Critical | Same-day human review | FDR segment drop count > 0; component fail event; spoofing-promotion latency > 3 s; AC-NEW-4 outliers (P(err > 1 km) > 0.01 % in this flight's window) |
|
||
| FDR review | High | Next-day | AC-NEW-1 cold-start TTFF > 30 s p95 in this flight's window; thermal-throttle hybrid active > 25 % of the flight |
|
||
| FDR review | Medium | Within 1 week | Mid-flight tile failure rate > 5 %; high VPR no-match rate; sustained `dead_reckoned` periods > 10 s |
|
||
| CI (Tier-2) | Critical | Block PR merge | Any AC-bound NFT failure (architecture.md § 6 NFR list) |
|
||
| CI (Tier-1) | Critical | Block PR merge | Build failure; security CVE; SBOM diff fail (ADR-002) |
|
||
|
||
Notification channels:
|
||
|
||
| Severity | Channel |
|
||
|----------|---------|
|
||
| Critical (FDR or CI) | Slack `#gps-denied-ops` + email |
|
||
| High | Slack `#gps-denied-ops` |
|
||
| Medium | Slack `#gps-denied-ops` (digest) |
|
||
|
||
There is no PagerDuty / on-call rotation for this project; in-flight failures are handled by the FC's IMU-only fallback (AC-5.2), not by an operations team.
|
||
|
||
## 5. Dashboards
|
||
|
||
### 5.1 Operator workstation post-flight dashboard
|
||
|
||
Built into operator-orchestrator C12. Per flight:
|
||
|
||
- Time series: source label, `horiz_accuracy`, `last_anchor_age_ms`, CPU%, GPU%, temp.
|
||
- Event markers: VISUAL_BLACKOUT entries, spoofing events, signing key rotations, thermal hybrid switches.
|
||
- Map: emitted track + FC ground truth (when available) + pre-flight cache footprint + mid-flight tile coverage.
|
||
- Statistics: per-flight error CDF; AC-NEW-4 contribution; mid-flight tile counts.
|
||
- FDR audit table: any `0x000F` lifecycle events of severity ≥ WARN.
|
||
|
||
### 5.2 CI dashboard (Tier-2)
|
||
|
||
GitHub Actions job summary plus a per-NFT report uploaded as workflow artifact. The summary includes:
|
||
|
||
- Pass / fail per NFT scenario.
|
||
- For NFT-PERF-*: histogram of latencies + comparison to threshold.
|
||
- For NFT-LIM-*: peak memory / FDR size traces.
|
||
- For NFT-RES-*: AC-NEW-4 / AC-NEW-7 statistical summary with stated 95 % CI.
|
||
- For IT-12: comparative-study summary across all VIO / VPR strategies in the research binary.
|
||
|
||
There is no live CI dashboard separate from the GitHub Actions UI; the project is small enough that the per-PR job summary is sufficient.
|
||
|
||
### 5.3 No live in-flight dashboard
|
||
|
||
Out of scope by design. The GCS is the only live operator surface; all other inspection is post-flight.
|
||
|
||
## 6. Open Items / Plan-Phase Carryforward
|
||
|
||
- **Long-term FDR archive** (multi-flight statistical headroom): D-PROJ-3 (multi-flight fixture acquisition for AC-NEW-4 / AC-NEW-7) is not pursued this cycle. If pursued in a future cycle, post-flight FDR archives become a corpus contribution path; the operator-orchestrator FDR-retrieval step would need an explicit "contribute to corpus" toggle.
|
||
- **Telemetry-link encryption** beyond MAVLink-2.0 signing: out of scope; addressed by physical link assumptions in the threat model (architecture.md § 7).
|
||
- **iNav signing**: still has no equivalent to MAVLink-2.0 signing (Mode B Source #129). Carryforward Plan-phase action: file a feature request upstream; meanwhile observability for iNav-profile flights is the same as AP-profile minus the `MavlinkSigningKeyRotated` records (which are NULL on iNav flights per data_model.md § 2.2).
|