mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 17:41:12 +00:00
64542d32fc
Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
233 lines
16 KiB
Markdown
233 lines
16 KiB
Markdown
# GPS-Denied Onboard — Observability
|
||
|
||
> Date: 2026-05-09 (Plan Phase 2c — initial draft).
|
||
> Inputs: `_docs/02_document/architecture.md` § 7 (Audit logging) + § 6 (NFRs); `_docs/02_document/data_model.md` § 2.8 (FDR); ADR-005 (Tier-1 / Tier-2); AC-NEW-3 (FDR ≤ 64 GB / no silent drops); AC-NEW-5 (operating envelope).
|
||
|
||
## Observability is asymmetric by design
|
||
|
||
Most CI/CD templates assume a network-connected service that pushes structured logs to an aggregator and exposes Prometheus metrics for live scraping. **This project's airborne profile does not.** Architecture.md ADR-004 + § 7 + Principle #4 require **no inbound network listening and no outbound network egress in flight** (NFT-SEC-05 enforces). The Jetson is operating as an embedded edge device, not a service.
|
||
|
||
Observability therefore splits into three regimes:
|
||
|
||
| Regime | Where | Live or post-flight | Primary mechanism |
|
||
|---|---|---|---|
|
||
| **In-flight onboard** | Production Jetson, in flight | Live (to FDR ring) + best-effort live (to GCS) | FDR binary record stream + GCS STATUSTEXT / NAMED_VALUE_FLOAT |
|
||
| **Post-flight onboard** | Operator workstation after pulling the FDR | Post-flight | FDR replay + visualization in operator-tooling C12 |
|
||
| **CI / dev (Tier-1, Tier-2)** | Workstation Docker / Jetson CI runner | Live | Standard structured logging + Prometheus metrics endpoint where applicable |
|
||
|
||
The sections below are organized by regime.
|
||
|
||
## 1. In-flight onboard (production Jetson)
|
||
|
||
### 1.1 FDR (Flight Data Recorder) — primary observability sink
|
||
|
||
Schema is in `data_model.md` § 2.8. Every observable event in flight goes through FDR. The FDR is **append-only**, **lossy on overrun (logged, never silent)**, and **per-flight ring-bounded at ≤ 64 GB** (AC-NEW-3).
|
||
|
||
Observability events that emit FDR records:
|
||
|
||
| Component | Event | FDR record type |
|
||
|---|---|---|
|
||
| C8 outbound | Every emitted `EmittedExternalPosition` to FC | `0x0001 EmittedExternalPosition` |
|
||
| C8 inbound | Every received MAVLink frame (raw `tlog`-style) | `0x0003 ReceivedMavlinkRaw` |
|
||
| C8 inbound (iNav) | Every received MSP2 frame | `0x0004 ReceivedMsp2Raw` |
|
||
| C8 inbound | IMU window forwarded to C1 / C5 | `0x0002 ImuTrace` |
|
||
| C5 | Source-label transition (`satellite_anchored` ↔ `visual_propagated` ↔ `dead_reckoned`) | `0x0006 SourceLabelTransition` |
|
||
| C5 + C8 | Spoofing-promotion / -rejection event | `0x000C SpoofingPromotionEvent` |
|
||
| C5 | VISUAL_BLACKOUT entry / exit (AC-3.5, AC-NEW-8) | `0x000B VisualBlackoutEvent` |
|
||
| C6 | Mid-flight tile emit | `0x0007 MidFlightTileEmitted` |
|
||
| C6 | Mid-flight tile failure (with thumbnail filename, AC-8.5 forensic exception) | `0x0008 MidFlightTileFailed` |
|
||
| C7 (inference) | Thermal-throttle hybrid switch K=3 ↔ K=2 | `0x000E ThermalThrottleHybridSwitch` |
|
||
| C8 | MAVLink-2.0 signing key rotation event (D-C8-9) | `0x0009 MavlinkSigningKeyRotated` |
|
||
| C8 | EKF source-set switch event (D-C8-2 = (b)) | `0x000A EkfSourceSetCommand` |
|
||
| C10 | Pre-flight content-hash gate fail | `0x000D ContentHashGateFail` |
|
||
| All components | Lifecycle events (start / stop / fail) | `0x000F ComponentLifecycleEvent` |
|
||
| `jetson-stats` collector (driven by C7 or a dedicated thread) | Per-second sample of CPU%, GPU%, temp, throttle flag, RAM, VRAM, NVM remaining | `0x0005 SystemHealth` |
|
||
|
||
**Lossy-on-overrun rule (AC-NEW-3 enforcement)**: if the FDR writer cannot keep up (NVM I/O bound), the writer drops the **oldest segment** in the current flight's ring AND emits a `0x000F ComponentLifecycleEvent` of type `fdr_segment_dropped` to the new head segment. A segment drop is a hard observability signal — it appears in the post-flight report and in the GCS STATUSTEXT stream. There is no path that silently discards an event.
|
||
|
||
**Format**: length-prefixed binary stream with `record_header` (magic `0x47464452 "GFDR"` + version + type + monotonic_ms) followed by a per-type body and a CRC32. New record types are additive (data_model.md § 6.5).
|
||
|
||
**Storage path**: `/var/lib/gps-denied/fdr/{flight_id}/segments/seg_NNNNN.bin`. Thumbnails (AC-8.5) live at `/var/lib/gps-denied/fdr/{flight_id}/thumbnails/`. A flight's `manifest.json` (the FDR-side mirror, distinct from the PostgreSQL `manifests` row) sits at the flight's root and carries the flight metadata snapshot.
|
||
|
||
### 1.2 GCS telemetry (best-effort, bandwidth-limited)
|
||
|
||
The GCS link is the only outbound channel from the airborne companion (per architecture.md § 7). Bandwidth is bounded (AC-6.1: 1–2 Hz downsampled summary). The companion emits:
|
||
|
||
| MAVLink message | Rate | Content |
|
||
|---|---|---|
|
||
| `STATUSTEXT` | event-driven (only when something changes) | Source label transitions; spoofing-promotion / -rejection; VISUAL_BLACKOUT entry / exit; signing key rotation; FDR segment drop; component start / fail; thermal-throttle hybrid switch |
|
||
| `NAMED_VALUE_FLOAT` | 1 Hz | `horiz_accuracy_m`, `vert_accuracy_m`, `vio_health` (frame-quality 0..1), `last_anchor_age_s`, `cpu_pct`, `gpu_pct`, `temp_c` |
|
||
| `GPS_RAW_INT` | 1–2 Hz (AC-6.1) | Mirror of the AP `GPS_INPUT` we just emitted, downsampled — gives the operator a live position view in QGC |
|
||
|
||
These are **best-effort** — packet loss on the GCS link is treated as normal. The FDR remains the source of truth.
|
||
|
||
**STATUSTEXT severity mapping**:
|
||
|
||
| FDR event | STATUSTEXT severity | Example text |
|
||
|---|---|---|
|
||
| Source label → `dead_reckoned` | `MAV_SEVERITY_WARNING` | `"GPS-DENIED: dead-reckoned (last anchor 12.3s ago)"` |
|
||
| VISUAL_BLACKOUT entry | `MAV_SEVERITY_NOTICE` | `"GPS-DENIED: VISUAL_BLACKOUT entered (reason=low_features)"` |
|
||
| Spoofing rejected | `MAV_SEVERITY_NOTICE` | `"GPS-DENIED: spoofed FC GPS rejected (last visual consistency PASS 0.4s ago)"` |
|
||
| Spoofing promoted (10 s + visual gate passed) | `MAV_SEVERITY_INFO` | `"GPS-DENIED: FC GPS promoted to fused source"` |
|
||
| FDR segment dropped | `MAV_SEVERITY_WARNING` | `"GPS-DENIED: FDR segment 47 dropped (NVM bound)"` |
|
||
| Signing key rotation | `MAV_SEVERITY_INFO` | `"GPS-DENIED: MAVLink signing key rotated"` |
|
||
| Component fail | `MAV_SEVERITY_CRITICAL` | `"GPS-DENIED: VIO strategy fault — failover to FC IMU-only (AC-5.2)"` |
|
||
|
||
### 1.3 No console logging in flight
|
||
|
||
Production deployment binary refuses `LOG_LEVEL=DEBUG` by default (environment_strategy.md § Variable validation). The airborne companion has no operator-readable console — even ERROR-level logs go to journald + FDR rather than stdout. journald retention is 7 days on a rolling buffer (separate from the FDR's per-flight retention).
|
||
|
||
### 1.4 In-flight metrics are NOT scraped
|
||
|
||
There is no Prometheus endpoint on the production airborne companion. The justification matches § 1.3: there is no scraper to scrape it; metrics are recorded into FDR and visible via NAMED_VALUE_FLOAT only. CI / dev environments DO expose `/metrics` (see § 3 below).
|
||
|
||
## 2. Post-flight onboard (operator workstation)
|
||
|
||
When the operator plugs the companion in post-landing:
|
||
|
||
1. **FDR retrieval** (operator tooling C12 — feature, not in scope of this document's structure but observability-impacting): operator-tooling reads the FDR ring, copies it to the workstation, and seals the in-flight ring. The companion's per-flight ephemeral keys are deleted at this step (environment_strategy.md § Per-flight key lifecycle).
|
||
2. **Visualization** (operator tooling C12): the workstation renders:
|
||
- Time-series of `horiz_accuracy`, `vert_accuracy`, `last_anchor_age_ms`, source label timeline, thermal-throttle hybrid switches, and CPU / GPU / temp.
|
||
- Map view: emitted positions vs. (when available) FC `GLOBAL_POSITION_INT` ground truth.
|
||
- Spoofing / VISUAL_BLACKOUT event markers overlaid on the timeline.
|
||
- Per-flight summary: total mid-flight tiles emitted, FDR segment drops (if any), AC-NEW-4 / AC-NEW-7 statistics for this flight.
|
||
3. **NFT-RES-03 / NFT-SEC-01 corpus contribution**: if the operator opts in, the flight's emitted positions + FC ground truth are added to the AC-NEW-4 / AC-NEW-7 Monte-Carlo corpus for the next CI run.
|
||
4. **Forensic thumbnail review** (AC-8.5 exception): failed-tile thumbnails are visible in the operator UI for human review; this is the only image-data review surface.
|
||
|
||
## 3. CI / dev environments (Tier-1 / Tier-2)
|
||
|
||
Tier-1 dev / staging containers DO expose conventional observability surfaces, because they're being driven by humans and CI orchestrators that need them. The airborne profile of § 1 is the **production-only** profile.
|
||
|
||
### 3.1 Logging (Tier-1 / Tier-2)
|
||
|
||
Structured JSON to stdout/stderr (consumed by the developer's `docker compose logs` or by CI's log collector):
|
||
|
||
```json
|
||
{
|
||
"timestamp": "2026-05-09T08:42:11.234Z",
|
||
"level": "INFO",
|
||
"service": "gps-denied-companion",
|
||
"component": "C5",
|
||
"flight_id": "<uuid>",
|
||
"monotonic_ms": 12345,
|
||
"message": "Source label transition",
|
||
"context": {
|
||
"from": "satellite_anchored",
|
||
"to": "visual_propagated",
|
||
"reason": "vpr_no_match"
|
||
}
|
||
}
|
||
```
|
||
|
||
Log levels:
|
||
|
||
| Level | Usage | Example |
|
||
|-------|-------|---------|
|
||
| ERROR | Exceptions; component fault that triggered AC-5.2 fallback | "VIO strategy initialization failed: GTSAM dlopen failed" |
|
||
| WARN | Degraded behavior; FDR segment drop; thermal-throttle hybrid switch | "Thermal throttle active; downgrading K=3 → K=2" |
|
||
| INFO | Significant lifecycle events; source label transition | "Source label: satellite_anchored → visual_propagated" |
|
||
| DEBUG | Per-frame diagnostic — Tier-1 / dev only; production refuses this level (environment_strategy.md § Variable validation) | "MatchResult: 47 inliers, residual=2.3px" |
|
||
|
||
**PII / safety-sensitive content**: no GPS coordinates in DEBUG / INFO logs by default. Only `horiz_accuracy` (a scalar) is INFO-loggable; the actual lat/lon is FDR-only. WARN / ERROR log records may include lat/lon when the operator's troubleshooting requires it; in that case the FDR still has the canonical record.
|
||
|
||
Log retention:
|
||
|
||
| Environment | Destination | Retention |
|
||
|-------------|-------------|-----------|
|
||
| `dev-tier1` | Docker stdout | Container lifetime |
|
||
| `dev-tier2` | journald (Jetson) | 7 days |
|
||
| `staging-tier1` (CI) | GitHub Actions log artifact | 30 days (matches CI artifact retention) |
|
||
| `staging-tier2` (Jetson CI) | Self-hosted runner journald + uploaded report | 30 days |
|
||
| `production` | journald (Jetson) | 7 days, see § 1.3 |
|
||
|
||
### 3.2 Metrics (Tier-1 / Tier-2)
|
||
|
||
Prometheus-compatible `/metrics` endpoint on `dev-tier1`, `staging-tier1`, `staging-tier2`. **Disabled on `production`** (no listener on the airborne companion, NFT-SEC-05).
|
||
|
||
Application metrics:
|
||
|
||
| Metric | Type | Description |
|
||
|--------|------|-------------|
|
||
| `gps_denied_frame_processed_total` | Counter | Total nav frames processed (per `GPS_DENIED_VIO_STRATEGY` label) |
|
||
| `gps_denied_frame_emit_latency_seconds` | Histogram | End-to-end frame → emit latency (the AC-4.1 metric) |
|
||
| `gps_denied_source_label_total` | Counter | Counter per `satellite_anchored | visual_propagated | dead_reckoned` label |
|
||
| `gps_denied_vpr_match_rate` | Gauge | Rolling-1-minute rate of successful VPR matches |
|
||
| `gps_denied_thermal_hybrid_active` | Gauge | 0/1 — is the K=2 thermal-throttle hybrid active? (D-CROSS-LATENCY-1) |
|
||
| `gps_denied_fdr_segment_drops_total` | Counter | Total FDR segment drops this run (AC-NEW-3 audit) |
|
||
| `gps_denied_fdr_size_bytes` | Gauge | Current FDR ring size in bytes (must stay ≤ 64 GB) |
|
||
| `gps_denied_signing_key_rotations_total` | Counter | MAVLink signing key rotation count |
|
||
|
||
System metrics: standard `process_*`, `python_*` exporters; on Tier-2 also `jetson_stats_*` exposed via `jtop` exporter.
|
||
|
||
Business metrics (i.e., AC-derived):
|
||
|
||
| Metric | AC | Use |
|
||
|--------|------|-------------|
|
||
| `gps_denied_horiz_accuracy_m` (gauge, last value) | AC-NEW-4 | Live operator dashboard on operator workstation post-flight; CI threshold checks |
|
||
| `gps_denied_cold_start_seconds` | AC-NEW-1 | Set once at takeoff load completion; NFT-PERF-03 reads it |
|
||
| `gps_denied_spoofing_promotion_latency_seconds` | AC-NEW-2 | Set on each promotion / rejection event; NFT-PERF-04 reads it |
|
||
|
||
Collection interval: 15 s (typical Prometheus default; Tier-2 NFT runs may use 1 s for AC-bound timing).
|
||
|
||
### 3.3 Distributed tracing — NOT applicable
|
||
|
||
The runtime is a single in-process Python program with no cross-service hops in flight (architecture.md § 5 internal communication is all in-process). Distributed tracing is therefore not applicable to the production runtime.
|
||
|
||
The Tier-1 integration setup DOES involve cross-container hops (companion ↔ mock-sat ↔ db ↔ e2e-runner), but those are exercised by the e2e test framework's own log + status capture; OpenTelemetry is not provisioned for this project. If a future cycle introduces a multi-process companion (which ADR-004 explicitly rejected for the airborne profile but might appear on the operator workstation for C11 Tile Manager + C12 Operator Pre-flight Tooling), tracing can be reconsidered then.
|
||
|
||
## 4. Alerting (post-flight, not in-flight)
|
||
|
||
There is no live in-flight alerting from the airborne companion. The operator's **GCS** is the live human-loop interface (STATUSTEXT severity stream § 1.2). All other alerting is **post-flight**:
|
||
|
||
| Source | Severity | Response Time | Conditions |
|
||
|----------|---------------|-----------|----------|
|
||
| FDR review (operator workstation) | Critical | Same-day human review | FDR segment drop count > 0; component fail event; spoofing-promotion latency > 3 s; AC-NEW-4 outliers (P(err > 1 km) > 0.01 % in this flight's window) |
|
||
| FDR review | High | Next-day | AC-NEW-1 cold-start TTFF > 30 s p95 in this flight's window; thermal-throttle hybrid active > 25 % of the flight |
|
||
| FDR review | Medium | Within 1 week | Mid-flight tile failure rate > 5 %; high VPR no-match rate; sustained `dead_reckoned` periods > 10 s |
|
||
| CI (Tier-2) | Critical | Block PR merge | Any AC-bound NFT failure (architecture.md § 6 NFR list) |
|
||
| CI (Tier-1) | Critical | Block PR merge | Build failure; security CVE; SBOM diff fail (ADR-002) |
|
||
|
||
Notification channels:
|
||
|
||
| Severity | Channel |
|
||
|----------|---------|
|
||
| Critical (FDR or CI) | Slack `#gps-denied-ops` + email |
|
||
| High | Slack `#gps-denied-ops` |
|
||
| Medium | Slack `#gps-denied-ops` (digest) |
|
||
|
||
There is no PagerDuty / on-call rotation for this project; in-flight failures are handled by the FC's IMU-only fallback (AC-5.2), not by an operations team.
|
||
|
||
## 5. Dashboards
|
||
|
||
### 5.1 Operator workstation post-flight dashboard
|
||
|
||
Built into operator-tooling C12. Per flight:
|
||
|
||
- Time series: source label, `horiz_accuracy`, `last_anchor_age_ms`, CPU%, GPU%, temp.
|
||
- Event markers: VISUAL_BLACKOUT entries, spoofing events, signing key rotations, thermal hybrid switches.
|
||
- Map: emitted track + FC ground truth (when available) + pre-flight cache footprint + mid-flight tile coverage.
|
||
- Statistics: per-flight error CDF; AC-NEW-4 contribution; mid-flight tile counts.
|
||
- FDR audit table: any `0x000F` lifecycle events of severity ≥ WARN.
|
||
|
||
### 5.2 CI dashboard (Tier-2)
|
||
|
||
GitHub Actions job summary plus a per-NFT report uploaded as workflow artifact. The summary includes:
|
||
|
||
- Pass / fail per NFT scenario.
|
||
- For NFT-PERF-*: histogram of latencies + comparison to threshold.
|
||
- For NFT-LIM-*: peak memory / FDR size traces.
|
||
- For NFT-RES-*: AC-NEW-4 / AC-NEW-7 statistical summary with stated 95 % CI.
|
||
- For IT-12: comparative-study summary across all VIO / VPR strategies in the research binary.
|
||
|
||
There is no live CI dashboard separate from the GitHub Actions UI; the project is small enough that the per-PR job summary is sufficient.
|
||
|
||
### 5.3 No live in-flight dashboard
|
||
|
||
Out of scope by design. The GCS is the only live operator surface; all other inspection is post-flight.
|
||
|
||
## 6. Open Items / Plan-Phase Carryforward
|
||
|
||
- **Long-term FDR archive** (multi-flight statistical headroom): D-PROJ-3 (multi-flight fixture acquisition for AC-NEW-4 / AC-NEW-7) is not pursued this cycle. If pursued in a future cycle, post-flight FDR archives become a corpus contribution path; the operator-tooling FDR-retrieval step would need an explicit "contribute to corpus" toggle.
|
||
- **Telemetry-link encryption** beyond MAVLink-2.0 signing: out of scope; addressed by physical link assumptions in the threat model (architecture.md § 7).
|
||
- **iNav signing**: still has no equivalent to MAVLink-2.0 signing (Mode B Source #129). Carryforward Plan-phase action: file a feature request upstream; meanwhile observability for iNav-profile flights is the same as AP-profile minus the `MavlinkSigningKeyRotated` records (which are NULL on iNav flights per data_model.md § 2.2).
|