- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments. - Updated `.gitignore` to include a new deploy rollback bookmark. - Revised `_docs/_autodev_state.md` to reflect the current task status and steps. - Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements. - Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin. - Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths. This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
22 KiB
GPS-Denied Onboard — Observability
Generated by
/autodevgreenfield Step 16 (Deploy) — Step 5. Builds on Step 1 (reports/deploy_status_report.md), Step 2 (containerization.md), Step 3 (ci_cd_pipeline.md), and Step 4 (environment_strategy.md). The deploy skill's standard observability template (Prometheus/metrics+ OpenTelemetry + PagerDuty) is adapted here for an airborne autonomous system: the airborne image has no inbound listeners (NFT-SEC-05 in-flight egress lockdown), so the canonical observability surface is the on-device Flight Data Recorder (FDR) binary ring buffer, replayed off-flight by post-landing tooling. Operator workstation + CI keep the conventional logging-to-stdout / journald patterns.
Observability Architecture (one-paragraph)
The airborne image (companion-jetson / companion-tier1) writes
structured FDR records to a 64 GB ring buffer (/var/lib/gps-denied/fdr)
via the shared_fdr_client (producer → SPSC ring → C13 writer). Logs
above WARN are forwarded into FDR as kind="log" records by the
fdr_log_bridge (AZ-267); below-WARN logs go to LOG_SINK (console in
dev, journald on the operator workstation, fdr on airborne — never to
file). Telemetry is captured as kind-specific FDR records (vio.tick,
state.tick, tile_match, c6.write, c6.eviction_batch, etc.) rather
than via a Prometheus endpoint, because no inbound TCP is permitted in
flight. Post-flight tooling on the operator workstation parses the FDR
segments using the frozen, versioned fdr_record_schema v1.3.0 and
feeds Grafana / Jupyter / one-off scripts. The suite-mandated
AZAION_UPDATE_EVENT journald audit chain + OCI image labels
(org.opencontainers.image.revision/created/source) + ENV AZAION_REVISION=$CI_COMMIT_SHA form the deploy-side audit trail (AZ-204).
jetson-stats (jtop) device telemetry (thermal zones, CPU/GPU
clocks, power rails) is sampled by C7 + C4 to drive the
D-CROSS-LATENCY-1 auto-degrade hybrid trigger; samples land in FDR
alongside the matcher / pose ticks.
Logging
Format
Structured records to LOG_SINK. No file-based logging in containers.
The LOG_SINK env var (Step 4) selects the destination per environment.
Common log envelope (per-record fields)
Source of truth: _docs/02_document/contracts/shared_log_bridge/log_record_schema.md v1.0.0 — referenced by the fdr_log_bridge (AZ-267). Every onboard log record carries:
{
"timestamp": "2026-05-10T03:14:15.123456Z",
"level": "INFO",
"service": "gps-denied-onboard",
"component": "c2_vpr",
"flight_id": "<uuid>",
"frame_id": 12345,
"kind": "vpr.warmup",
"msg": "loaded",
"kv": {"model": "salad"},
"exc": null
}
| Field | Purpose | Notes |
|---|---|---|
timestamp |
ISO 8601 UTC, microsecond precision | RFC 3339 with Z suffix |
level |
DEBUG | INFO | WARN | ERROR |
WARN + ERROR are also mirrored into FDR via fdr_log_bridge |
service |
gps-denied-onboard |
Constant per submodule |
component |
Module slug from module-layout.md (c2_vpr, c6_tile_cache.store, shared.fdr_client, …) |
Matches producer_id on the corresponding FDR record |
flight_id |
UUID assigned at flight open by C13 (flight_header) |
Correlation across all components within one flight |
frame_id |
Monotonic per-frame counter from runtime_root |
Cross-component frame correlation (VIO ↔ matcher ↔ state) |
kind |
Dotted snake_case event tag (closed enum per component) | E.g. vpr.warmup, c6.evict.budget, c8.signing_key_rotation |
msg |
Short human-readable event description | No PII; no secrets; no file payloads |
kv |
Bag of typed scalars | JSON-safe; no nested blobs > 4 KiB |
exc |
Optional exception class + traceback | Present only on ERROR; truncated to 4 KiB |
Log Levels
| Level | Usage | Example |
|---|---|---|
| ERROR | Exceptions, failures requiring offline review | c5.solver.diverged, c8.signing_handshake_failed, c6.write_failed |
| WARN | Degraded operation, retry, fallback engaged | c4.pose.degraded_to_pnp, c6.freshness.rejected, c7.tensorrt_engine_rebuild |
| INFO | Significant in-flight business events | c8.signing_key_rotation, flight_header, flight_footer, c11.upload_batch_queued |
| DEBUG | Detailed diagnostics (dev only) | Per-frame VIO covariance dump, full matcher correspondences list |
WARN + ERROR are mirrored into FDR via fdr_log_bridge (AZ-267) so they survive a post-landing journalctl clear. INFO + DEBUG go only to LOG_SINK.
Destinations and Retention
| Environment | LOG_SINK |
Destination | Retention |
|---|---|---|---|
| Development (Tier-1 Docker) | console |
Docker container stdout (docker compose logs companion) |
Session — cleared on docker compose down |
| CI (Woodpecker) | console |
Woodpecker UI stdout capture | Per the suite Woodpecker retention policy (operator-managed; today ≤ 30 days) |
| Staging (lab Jetson) | journald |
Host journald | Per the host's journald.conf (suite default: ~7 days rolling) |
| Production — airborne | fdr |
FDR ring buffer at /var/lib/gps-denied/fdr (≥ 64 GB) |
Bounded by ring capacity; rolls over per segment_rollover FDR record. Post-flight operator pulls segments to long-term storage on the operator workstation. |
| Production — operator workstation | journald |
Host journald | Per the host's journald.conf (operator-managed; recommendation: 30 days for the operator-orchestrator service unit) |
"PII" Rules (read: operational secrets)
This system has no end-user PII surface — flights, MAVLink, and tile data are operational rather than personal. The equivalent restrictions are operational-secret leakage controls:
- Never log MAVLink 2.0 signing key bytes, per-flight onboard signing key bytes,
satellite-providerAPI tokens, registry tokens, or Postgres credentials. TheKeySourceProtocol (C8) is the only component that ever holds key material, and its log path emits only the rotation event tag + key fingerprint (SHA-256 first 8 bytes), never the key. - Mask absolute file paths in any record that references operator-specific layouts (e.g.
/Users/<operator>/…collapsed to~/…). - Never log raw camera frame bytes or full tile JPEGs inline — they go to sidecar paths via FDR's
failed_tile_thumbnail(≤ 0.1 Hz rate cap) ormid_flight_tile_snapshot. - Never log raw GPS coordinates unless the flight's
restricted_geographic_log_redactionconfig isoff(operator-set at takeoff load).
Telemetry (FDR-based, not Prometheus)
Why FDR, not Prometheus / OTel
The airborne image runs under NFT-SEC-05 (in-flight egress lockdown — no inbound listeners, outbound only to the FC over UART/USB and to QGroundControl over MAVLink 2.0 1–2 Hz downsampled summary). A /metrics HTTP endpoint would violate this, and a push-mode OTel exporter has no in-flight collector to reach. The FDR ring is the canonical telemetry sink; post-flight tooling converts FDR records into whatever observability backend the operator prefers (Grafana, Jupyter, ad-hoc scripts).
The operator workstation is not in-flight-locked-down; cycle-2 may add a Prometheus /metrics endpoint on the operator-orchestrator service (see "Future Work" below). Cycle-1 leaves both the operator-orchestrator and airborne side on the FDR + structured logs path for consistency.
FDR Record Kinds (cycle-1 metrics surface)
Source of truth: _docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md v1.3.0. Each kind is the metric.
Metric (FDR kind) |
Producer | Type (intent) | What it tells the operator |
|---|---|---|---|
vio.tick |
C1 | per-frame snapshot | VIO output (R, t), pose covariance proxies, last-anchor age, monocular reproj error, IMU bias norm |
state.tick |
C5 | per-frame snapshot | Smoothed fused-pose tick from iSAM2 (or ESKF baseline) + 2x2 covariance + estimator label |
tile_match |
C2.5 / C3 | per-match snapshot | Tile id, VPR score, match count, RANSAC inlier count |
c6.write |
C6 | counter-ish (per-tile) | Successful write_tile — tile id, source, disk bytes, content SHA-256 |
c6.write_failed |
C6 | counter-ish (per-failure) | Failed write_tile — reason ∈ {content_hash_mismatch, freshness_reject, metadata_error, fs_error} |
c6.freshness.rejected |
C6 | counter-ish (per-reject) | Active-conflict-stale tile rejected — tile_id, age_seconds, threshold |
c6.freshness.downgraded |
C6 | counter-ish (per-downgrade) | Stable-rear-stale tile downgraded — same shape as rejected |
c6.eviction_batch |
C6 | batch counter (per sweep) | Cache budget enforcer evicted N tiles to make room — trigger tile, freed bytes, count, first 5 evicted ids |
overrun |
shared.fdr_client |
counter (per drop) | FDR ring overrun — producer_id of the originating queue + dropped count (> 0). AC-NEW-3: never silent. |
segment_rollover |
C13 writer | counter (per rotation) | Segment file rotated (including 64 GB cap drops) |
failed_tile_thumbnail |
C6 / C11 | rate-capped sample | Forensic JPEG thumbnail (≤ 0.1 Hz). AC-8.5 |
mid_flight_tile_snapshot |
C13 snapshot path | sample pointer | Mid-flight tile snapshot pointer (sidecar). AC-8.4 |
flight_header |
C13 writer | once-per-flight | flight_id, start ISO/monotonic, config snapshot, signing-key rotation event, manifest content hashes, build info |
flight_footer |
C13 writer | once-per-flight | flight_id, end ISO/monotonic, records written / dropped (overrun) / bytes / rollover count / clean-shutdown flag |
Device Telemetry (jetson-stats / jtop)
D-CROSS-LATENCY-1 requires runtime thermal + power + GPU clock telemetry to drive the auto-degrade hybrid trigger (frame deadline missed × thermal headroom). Cycle-1 source: jetson-stats (jtop) accessed inside the companion-jetson container via runtime: nvidia + the nvidia-container-runtime device passthrough — same pattern the suite's detections service uses on the same hardware.
| Signal | Source | Sample rate | Consumer |
|---|---|---|---|
| GPU clock (MHz) | jtop.gpu |
1 Hz | C7 (degrade gate); recorded into FDR via c7.device_telemetry log records (kind="c7.thermal_headroom") |
| GPU/CPU temperature (°C) | jtop.temperature |
1 Hz | C4 / C7 hybrid trigger |
| Power draw (mW) | jtop.power |
1 Hz | Cycle-2 derate hysteresis |
| Memory pressure | jtop.memory |
1 Hz | C6 eviction batch hysteresis |
Cycle-1: jtop runs in-process inside the companion container; samples are emitted as FDR kind="c7.thermal_headroom" records. Cycle-2 may move this to a sidecar Python thread once the Step 2 BLOCKING gate "jetson-stats thermal telemetry under Docker" (containerization.md § Step 2 Validation Gates) is signed off on the real Tier-2 Jetson.
Collection Interval
| Source | Interval |
|---|---|
Per-frame producers (C1 vio.tick, C5 state.tick, C3 tile_match) |
Camera frame cadence (target ≥ 4 Hz on Tier-2; per _docs/02_document/architecture.md Vision) |
Per-write producers (C6 c6.write, c6.write_failed, c6.freshness.*) |
Per-event (write-path triggered) |
Per-batch producers (C6 c6.eviction_batch) |
Per-sweep (only when ≥ 1 tile evicted) |
jetson-stats (jtop) |
1 Hz |
flight_header / flight_footer |
Once per flight |
segment_rollover |
Per segment rotation |
There is no Prometheus-style "scrape interval" because there is no scraping endpoint — the FDR ring is push-only from producers, drained by C13's writer thread.
Distributed Tracing
Architecture stance (cycle-1)
No W3C Trace Context. No OpenTelemetry SDK. The airborne image's correlation key is the pair (flight_id, frame_id):
flight_id(UUID) is assigned at flight open by C13 and written intoflight_header. Every log record and FDR record within that flight carries it.frame_id(monotonic per-frame counter) is assigned by the composition root's frame pipeline. Every per-frame FDR record (vio.tick,state.tick,tile_match,c6.write…) carries it.
This is sufficient because the airborne pipeline is in-process, single-camera, single-FC — there are no inter-service RPC hops to trace. Post-flight tooling reconstructs the per-frame causal chain by joining FDR records on (flight_id, frame_id).
The operator workstation has more conventional inter-service traffic (C12 ↔ flights REST, C11 ↔ satellite-provider REST). Cycle-1 traces these by:
- Per-request log records with the request URL + status + duration_ms + a generated
correlation_id. FlightsApiClientand thesatellite-providerHTTP client both stamp this correlation id on the request line + response log.
OpenTelemetry SDK + W3C Trace Context propagation is a cycle-2 polish item for the operator-orchestrator only — not for the airborne image. Logged in "Future Work" below.
Sampling
| Environment | Effective sampling rate | Rationale |
|---|---|---|
| Development | 100% | FDR + logs both on |
| Staging (lab Jetson) | 100% | Full visibility for IT-12 / NFT-PERF runs |
| Production — airborne | 100% per-frame for vio.tick/state.tick/tile_match; failed_tile_thumbnail rate-capped at ≤ 0.1 Hz |
FDR ring is the only post-landing forensic record; full per-frame capture is mandatory. Rate caps live on byte-heavy forensic records only. |
| Production — operator workstation | 100% INFO+; DEBUG off | Operator workstation has full disk; cost is not a concern. |
Alerting
Airborne (in-flight)
No real-time alerting from the airborne image. Autonomy: the FC handles in-flight failsafe (SAFE_DEAD_RECKONING, RTL, LAND etc. per AC-FC-FAILSAFE-1). The companion does not have a network path to a human operator in flight — its only outbound channel is the MAVLink 2.0 1–2 Hz downsampled summary to QGroundControl, which surfaces companion health via STATUSTEXT messages and the parent suite's GpsDeniedHealth MAVLink message.
Alert-equivalents on the airborne side:
| Event | Detected by | In-flight signal |
|---|---|---|
| Companion process died | FC adapter watchdog timeout | FC drops to SAFE_DEAD_RECKONING; operator sees lost telemetry in QGC |
D-CROSS-LATENCY-1 deadline miss + thermal headroom low |
C4 / C7 hybrid trigger | Auto-degrade to lower-cost C7 backend; STATUSTEXT to QGC + FDR kind="c7.degrade" |
| C8 signing handshake failed | C8 FC adapter | Refuses takeoff; STATUSTEXT to QGC + FDR kind="c8.signing_handshake_failed" |
| FDR ring overrun | shared.fdr_client drop-oldest hook |
Emits kind="overrun" (AC-NEW-3); post-flight forensics tag |
| Segment cap reached (64 GB) | C13 writer | Emits kind="segment_rollover" with cap-drop flag; oldest data lost — flag surfaces post-flight |
Post-Flight (operator workstation)
Post-flight analysis runs the FDR segments through the post-landing tooling. Alerts surface in the operator's environment:
| Severity | Response time | Condition | Cycle-1 channel |
|---|---|---|---|
| Critical | Pre-next-flight gate (≤ 10 min before takeoff) | flight_footer.clean_shutdown == false; kind="c8.signing_handshake_failed" observed; FDR overrun count > 0 above per-flight threshold |
Operator UI block + Slack #gps-denied-ops (cycle-2 once the channel is wired); cycle-1: operator's local terminal output from post-landing tooling |
| High | Same-day | C6 eviction batch > 100 in one flight; tile_match score histogram drifted vs operator baseline | Same as above |
| Medium | Within 1 week | Cumulative thermal-headroom-low events trending up across recent flights | Operator dashboard (cycle-2) |
| Low | Recorded in flight summary only | Non-critical warnings (FDR kind="log" at WARN level) |
Flight summary PDF / Markdown |
CI (Woodpecker pipelines)
| Severity | Response time | Condition | Channel |
|---|---|---|---|
| Critical | Same business day | 01-test.yml failure on main branch |
Woodpecker UI; per-repo Slack channel (cycle-2 follow-up — ci_cd_pipeline.md Future Work #8) |
| High | Within 24 h | 02-build-push.yml build failure on any push branch |
Woodpecker UI |
| Medium | Next business day | Lint / coverage gate fail (cycle-2; cycle-1 has neither) | n/a in cycle-1 |
| Low | Next sprint review | Non-critical pipeline warnings | n/a |
Deploy / Update (Watchtower)
| Severity | Response time | Condition | Channel |
|---|---|---|---|
| Critical | Immediate | Watchtower post-update hook emits AZAION_UPDATE_EVENT severity=error to journald (image pull failed, container crash on restart) |
journald + suite operator's journalctl -g AZAION_UPDATE_EVENT audit chain |
| Informational | None | Watchtower applied an update during a non-flight window (/run/azaion/in-flight cleared) |
AZAION_UPDATE_EVENT severity=info to journald — audit only |
Dashboards
Operations (cycle-1 — what exists today)
- Suite Woodpecker UI — CI pipeline status per branch + commit; the only "live" operations dashboard cycle-1 ships.
jtopon the bench — operator runssudo jtopon the lab / airborne Jetson during staging / pre-flight to observe thermal + GPU clock + power. Not a service dashboard; it's a CLI tool.docker ps+docker compose logs— the operator workstation operator'sdev-environment dashboard.
Operations (cycle-2 polish, planned)
- Grafana dashboard fed by post-landing-parsed FDR records — service health per component (FDR record kinds rolled up into rates), thermal trend, eviction count, tile_match score distribution.
- Prometheus
/metricson operator-orchestrator — once the operator workstation cycle-2 wires this, the Grafana dashboard pulls live operator-side metrics alongside post-landing FDR rollups.
Flight Analytics (cycle-1 — what exists today)
- Per-flight summary generated by post-landing tooling (Markdown / PDF) — records written / dropped, segment count, top-N error log lines, eviction count, signing-key rotation event log,
flight_footer.clean_shutdownflag. Stored alongside the FDR segments under_docs/06_metrics/flights/<flight_id>/(cycle-2 publishes; cycle-1 staging dir is operator-local).
Flight Analytics (cycle-2 polish, planned)
- FDR replay viewer — interactive timeline of
(flight_id, frame_id)correlated records. - NFT-PERF baseline tracker — frame deadline miss rate, thermal headroom, end-to-end pose latency tracked across flights.
Deploy Audit (suite-mandated)
Per ../_infra/ci/README.md → "OCI image labels and commit provenance (AZ-204)" and ../_infra/deploy/jetson/README.md → "Audit: what is this device running?":
- Every image (
companion-jetson,companion-tier1,operator-orchestrator) is built with:- OCI labels:
org.opencontainers.image.revision=$CI_COMMIT_SHA,org.opencontainers.image.created=<UTC RFC 3339>,org.opencontainers.image.source=$CI_REPO_URL. ENV AZAION_SERVICE=gps-denied-onboard+ENV AZAION_REVISION=$CI_COMMIT_SHA.
- OCI labels:
- Watchtower's post-update hook emits one
AZAION_UPDATE_EVENTline per applied update into journald, carrying the new revision SHA + service name + timestamp + outcome. - The operator runs
journalctl -g AZAION_UPDATE_EVENTon any Jetson to answer "what is this device running and when did it last update?".
Self-verification
- Structured logging format defined with required fields (timestamp, level, service, component,
flight_id,frame_id, kind, msg, kv, exc) - Per-environment
LOG_SINKdestination + retention tabulated - FDR-based metrics surface enumerated (every
fdr_record_schemav1.3.0 kind mapped to its operator-relevant meaning) - Device telemetry (
jetson-stats/jtop) source + sample rate + consumer (D-CROSS-LATENCY-1 hybrid trigger) - Tracing stance recorded — no W3C Trace Context / OTel SDK on airborne (justified by single-process pipeline + NFT-SEC-05); operator-side correlation_id pattern documented; OTel deferred to cycle-2 polish
- Alert severities + response times defined across the four touchpoints: airborne in-flight, post-flight operator workstation, CI, deploy/update audit (
AZAION_UPDATE_EVENT) - Operational-secret leakage controls in place (no key bytes / API tokens / Postgres credentials in logs;
KeySourceis the only key holder) - Dashboards inventoried — cycle-1 reality (Woodpecker UI,
jtop, post-landing summary) explicit; cycle-2 polish (Grafana, FDR replay viewer, NFT-PERF tracker) logged as follow-ups - Suite-mandated deploy audit chain (
AZAION_UPDATE_EVENT+ OCI labels +AZAION_REVISIONenv) referenced from../_infra/docs
Future Work (cycle-2 polish)
- Prometheus
/metricsonoperator-orchestrator— cycle-2 wires an in-process exporter for operator-workstation-side metrics (flightsREST round-trip latency,satellite-providerdownload throughput, tile manifest content-hash failures). The airborne image stays off this path per NFT-SEC-05. - Grafana dashboard fed by post-landing-parsed FDR rollups — single pane of glass for per-flight + cross-flight trends.
- OpenTelemetry SDK on
operator-orchestratoronly — instrumentsFlightsApiClient+satellite-providerHTTP client with W3C Trace Context propagation. Out of scope for airborne. - Per-repo Slack channel (
#gps-denied-cifor CI,#gps-denied-opsfor post-flight) —ci_cd_pipeline.mdFuture Work #8 already logs the CI half; this doc adds the ops half. - FDR replay viewer — interactive timeline of
(flight_id, frame_id)correlated records; consumes FDR segments via thefdr_record_schemav1.3.0 parser. - NFT-PERF baseline tracker — automated frame-deadline-miss-rate + thermal-headroom + end-to-end pose latency trending across flights, gated by AZ-595 SITL replay fixture + AZ-592/AZ-593 Tier-2 OKVIS2/VINS-Mono wiring.
- Centralised log aggregator on the operator workstation — Loki / journald-export-to-cloud once the operator network egress allows it; cycle-1 leaves journald at host-default retention.