mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 11:41:13 +00:00
bf13549b32
ci/woodpecker/push/02-build-push Pipeline failed
- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments. - Updated `.gitignore` to include a new deploy rollback bookmark. - Revised `_docs/_autodev_state.md` to reflect the current task status and steps. - Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements. - Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin. - Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths. This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
283 lines
22 KiB
Markdown
283 lines
22 KiB
Markdown
# GPS-Denied Onboard — Observability
|
||
|
||
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 5. Builds on
|
||
> Step 1 (`reports/deploy_status_report.md`), Step 2 (`containerization.md`),
|
||
> Step 3 (`ci_cd_pipeline.md`), and Step 4 (`environment_strategy.md`). The
|
||
> deploy skill's standard observability template (Prometheus `/metrics` +
|
||
> OpenTelemetry + PagerDuty) is adapted here for an airborne autonomous
|
||
> system: the airborne image has **no inbound listeners** (NFT-SEC-05
|
||
> in-flight egress lockdown), so the canonical observability surface is the
|
||
> on-device **Flight Data Recorder (FDR)** binary ring buffer, replayed
|
||
> off-flight by post-landing tooling. Operator workstation + CI keep the
|
||
> conventional logging-to-stdout / journald patterns.
|
||
|
||
## Observability Architecture (one-paragraph)
|
||
|
||
The airborne image (`companion-jetson` / `companion-tier1`) writes
|
||
**structured FDR records** to a 64 GB ring buffer (`/var/lib/gps-denied/fdr`)
|
||
via the `shared_fdr_client` (`producer → SPSC ring → C13 writer`). Logs
|
||
above `WARN` are forwarded into FDR as `kind="log"` records by the
|
||
`fdr_log_bridge` (AZ-267); below-WARN logs go to `LOG_SINK` (`console` in
|
||
dev, `journald` on the operator workstation, `fdr` on airborne — never to
|
||
file). Telemetry is captured as kind-specific FDR records (`vio.tick`,
|
||
`state.tick`, `tile_match`, `c6.write`, `c6.eviction_batch`, etc.) rather
|
||
than via a Prometheus endpoint, because no inbound TCP is permitted in
|
||
flight. Post-flight tooling on the operator workstation parses the FDR
|
||
segments using the **frozen, versioned `fdr_record_schema` v1.3.0** and
|
||
feeds Grafana / Jupyter / one-off scripts. The suite-mandated
|
||
**`AZAION_UPDATE_EVENT` journald audit chain** + OCI image labels
|
||
(`org.opencontainers.image.revision/created/source`) + `ENV
|
||
AZAION_REVISION=$CI_COMMIT_SHA` form the deploy-side audit trail (AZ-204).
|
||
**`jetson-stats` (`jtop`) device telemetry** (thermal zones, CPU/GPU
|
||
clocks, power rails) is sampled by C7 + C4 to drive the
|
||
`D-CROSS-LATENCY-1` auto-degrade hybrid trigger; samples land in FDR
|
||
alongside the matcher / pose ticks.
|
||
|
||
## Logging
|
||
|
||
### Format
|
||
|
||
Structured records to `LOG_SINK`. No file-based logging in containers.
|
||
The `LOG_SINK` env var (Step 4) selects the destination per environment.
|
||
|
||
#### Common log envelope (per-record fields)
|
||
|
||
Source of truth: `_docs/02_document/contracts/shared_log_bridge/log_record_schema.md` v1.0.0 — referenced by the `fdr_log_bridge` (AZ-267). Every onboard log record carries:
|
||
|
||
```json
|
||
{
|
||
"timestamp": "2026-05-10T03:14:15.123456Z",
|
||
"level": "INFO",
|
||
"service": "gps-denied-onboard",
|
||
"component": "c2_vpr",
|
||
"flight_id": "<uuid>",
|
||
"frame_id": 12345,
|
||
"kind": "vpr.warmup",
|
||
"msg": "loaded",
|
||
"kv": {"model": "salad"},
|
||
"exc": null
|
||
}
|
||
```
|
||
|
||
| Field | Purpose | Notes |
|
||
|-------|---------|-------|
|
||
| `timestamp` | ISO 8601 UTC, microsecond precision | RFC 3339 with `Z` suffix |
|
||
| `level` | `DEBUG \| INFO \| WARN \| ERROR` | `WARN` + `ERROR` are also mirrored into FDR via `fdr_log_bridge` |
|
||
| `service` | `gps-denied-onboard` | Constant per submodule |
|
||
| `component` | Module slug from `module-layout.md` (`c2_vpr`, `c6_tile_cache.store`, `shared.fdr_client`, …) | Matches `producer_id` on the corresponding FDR record |
|
||
| `flight_id` | UUID assigned at flight open by C13 (`flight_header`) | Correlation across all components within one flight |
|
||
| `frame_id` | Monotonic per-frame counter from `runtime_root` | Cross-component frame correlation (VIO ↔ matcher ↔ state) |
|
||
| `kind` | Dotted snake_case event tag (closed enum per component) | E.g. `vpr.warmup`, `c6.evict.budget`, `c8.signing_key_rotation` |
|
||
| `msg` | Short human-readable event description | No PII; no secrets; no file payloads |
|
||
| `kv` | Bag of typed scalars | JSON-safe; no nested blobs > 4 KiB |
|
||
| `exc` | Optional exception class + traceback | Present only on `ERROR`; truncated to 4 KiB |
|
||
|
||
### Log Levels
|
||
|
||
| Level | Usage | Example |
|
||
|-------|-------|---------|
|
||
| ERROR | Exceptions, failures requiring offline review | `c5.solver.diverged`, `c8.signing_handshake_failed`, `c6.write_failed` |
|
||
| WARN | Degraded operation, retry, fallback engaged | `c4.pose.degraded_to_pnp`, `c6.freshness.rejected`, `c7.tensorrt_engine_rebuild` |
|
||
| INFO | Significant in-flight business events | `c8.signing_key_rotation`, `flight_header`, `flight_footer`, `c11.upload_batch_queued` |
|
||
| DEBUG | Detailed diagnostics (dev only) | Per-frame VIO covariance dump, full matcher correspondences list |
|
||
|
||
`WARN` + `ERROR` are mirrored into FDR via `fdr_log_bridge` (AZ-267) so they survive a post-landing `journalctl` clear. `INFO` + `DEBUG` go only to `LOG_SINK`.
|
||
|
||
### Destinations and Retention
|
||
|
||
| Environment | `LOG_SINK` | Destination | Retention |
|
||
|-------------|------------|-------------|-----------|
|
||
| Development (Tier-1 Docker) | `console` | Docker container stdout (`docker compose logs companion`) | Session — cleared on `docker compose down` |
|
||
| CI (Woodpecker) | `console` | Woodpecker UI stdout capture | Per the suite Woodpecker retention policy (operator-managed; today ≤ 30 days) |
|
||
| Staging (lab Jetson) | `journald` | Host journald | Per the host's `journald.conf` (suite default: ~7 days rolling) |
|
||
| Production — airborne | `fdr` | FDR ring buffer at `/var/lib/gps-denied/fdr` (≥ 64 GB) | Bounded by ring capacity; rolls over per `segment_rollover` FDR record. Post-flight operator pulls segments to long-term storage on the operator workstation. |
|
||
| Production — operator workstation | `journald` | Host journald | Per the host's `journald.conf` (operator-managed; recommendation: 30 days for the operator-orchestrator service unit) |
|
||
|
||
### "PII" Rules (read: operational secrets)
|
||
|
||
This system has no end-user PII surface — flights, MAVLink, and tile data are operational rather than personal. The equivalent restrictions are **operational-secret leakage** controls:
|
||
|
||
- **Never log** MAVLink 2.0 signing key bytes, per-flight onboard signing key bytes, `satellite-provider` API tokens, registry tokens, or Postgres credentials. The `KeySource` Protocol (C8) is the only component that ever holds key material, and its log path emits **only** the rotation event tag + key fingerprint (SHA-256 first 8 bytes), never the key.
|
||
- **Mask** absolute file paths in any record that references operator-specific layouts (e.g. `/Users/<operator>/…` collapsed to `~/…`).
|
||
- **Never log** raw camera frame bytes or full tile JPEGs inline — they go to sidecar paths via FDR's `failed_tile_thumbnail` (≤ 0.1 Hz rate cap) or `mid_flight_tile_snapshot`.
|
||
- **Never log** raw GPS coordinates unless the flight's `restricted_geographic_log_redaction` config is `off` (operator-set at takeoff load).
|
||
|
||
## Telemetry (FDR-based, not Prometheus)
|
||
|
||
### Why FDR, not Prometheus / OTel
|
||
|
||
The airborne image runs under NFT-SEC-05 (in-flight egress lockdown — no inbound listeners, outbound only to the FC over UART/USB and to QGroundControl over MAVLink 2.0 1–2 Hz downsampled summary). A `/metrics` HTTP endpoint would violate this, and a push-mode OTel exporter has no in-flight collector to reach. The FDR ring is the canonical telemetry sink; post-flight tooling converts FDR records into whatever observability backend the operator prefers (Grafana, Jupyter, ad-hoc scripts).
|
||
|
||
The **operator workstation** is *not* in-flight-locked-down; cycle-2 may add a Prometheus `/metrics` endpoint on the `operator-orchestrator` service (see "Future Work" below). Cycle-1 leaves both the operator-orchestrator and airborne side on the FDR + structured logs path for consistency.
|
||
|
||
### FDR Record Kinds (cycle-1 metrics surface)
|
||
|
||
Source of truth: `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` v1.3.0. Each `kind` is the metric.
|
||
|
||
| Metric (FDR `kind`) | Producer | Type (intent) | What it tells the operator |
|
||
|---------------------|----------|----------------|----------------------------|
|
||
| `vio.tick` | C1 | per-frame snapshot | VIO output (`R`, `t`), pose covariance proxies, last-anchor age, monocular reproj error, IMU bias norm |
|
||
| `state.tick` | C5 | per-frame snapshot | Smoothed fused-pose tick from iSAM2 (or ESKF baseline) + 2x2 covariance + estimator label |
|
||
| `tile_match` | C2.5 / C3 | per-match snapshot | Tile id, VPR score, match count, RANSAC inlier count |
|
||
| `c6.write` | C6 | counter-ish (per-tile) | Successful `write_tile` — tile id, source, disk bytes, content SHA-256 |
|
||
| `c6.write_failed` | C6 | counter-ish (per-failure) | Failed `write_tile` — `reason ∈ {content_hash_mismatch, freshness_reject, metadata_error, fs_error}` |
|
||
| `c6.freshness.rejected` | C6 | counter-ish (per-reject) | Active-conflict-stale tile rejected — `tile_id`, `age_seconds`, threshold |
|
||
| `c6.freshness.downgraded` | C6 | counter-ish (per-downgrade) | Stable-rear-stale tile downgraded — same shape as rejected |
|
||
| `c6.eviction_batch` | C6 | batch counter (per sweep) | Cache budget enforcer evicted N tiles to make room — trigger tile, freed bytes, count, first 5 evicted ids |
|
||
| `overrun` | `shared.fdr_client` | counter (per drop) | FDR ring overrun — `producer_id` of the originating queue + dropped count (`> 0`). AC-NEW-3: never silent. |
|
||
| `segment_rollover` | C13 writer | counter (per rotation) | Segment file rotated (including 64 GB cap drops) |
|
||
| `failed_tile_thumbnail` | C6 / C11 | rate-capped sample | Forensic JPEG thumbnail (≤ 0.1 Hz). AC-8.5 |
|
||
| `mid_flight_tile_snapshot` | C13 snapshot path | sample pointer | Mid-flight tile snapshot pointer (sidecar). AC-8.4 |
|
||
| `flight_header` | C13 writer | once-per-flight | `flight_id`, start ISO/monotonic, config snapshot, signing-key rotation event, manifest content hashes, build info |
|
||
| `flight_footer` | C13 writer | once-per-flight | `flight_id`, end ISO/monotonic, records written / dropped (overrun) / bytes / rollover count / clean-shutdown flag |
|
||
|
||
### Device Telemetry (`jetson-stats` / `jtop`)
|
||
|
||
`D-CROSS-LATENCY-1` requires runtime thermal + power + GPU clock telemetry to drive the auto-degrade hybrid trigger (frame deadline missed × thermal headroom). Cycle-1 source: `jetson-stats` (`jtop`) accessed inside the `companion-jetson` container via `runtime: nvidia` + the nvidia-container-runtime device passthrough — same pattern the suite's `detections` service uses on the same hardware.
|
||
|
||
| Signal | Source | Sample rate | Consumer |
|
||
|--------|--------|-------------|----------|
|
||
| GPU clock (MHz) | `jtop.gpu` | 1 Hz | C7 (degrade gate); recorded into FDR via `c7.device_telemetry` log records (`kind="c7.thermal_headroom"`) |
|
||
| GPU/CPU temperature (°C) | `jtop.temperature` | 1 Hz | C4 / C7 hybrid trigger |
|
||
| Power draw (mW) | `jtop.power` | 1 Hz | Cycle-2 derate hysteresis |
|
||
| Memory pressure | `jtop.memory` | 1 Hz | C6 eviction batch hysteresis |
|
||
|
||
Cycle-1: `jtop` runs in-process inside the companion container; samples are emitted as FDR `kind="c7.thermal_headroom"` records. Cycle-2 may move this to a sidecar Python thread once the Step 2 BLOCKING gate "`jetson-stats` thermal telemetry under Docker" (`containerization.md` § Step 2 Validation Gates) is signed off on the real Tier-2 Jetson.
|
||
|
||
### Collection Interval
|
||
|
||
| Source | Interval |
|
||
|--------|----------|
|
||
| Per-frame producers (C1 `vio.tick`, C5 `state.tick`, C3 `tile_match`) | Camera frame cadence (target ≥ 4 Hz on Tier-2; per `_docs/02_document/architecture.md` Vision) |
|
||
| Per-write producers (C6 `c6.write`, `c6.write_failed`, `c6.freshness.*`) | Per-event (write-path triggered) |
|
||
| Per-batch producers (C6 `c6.eviction_batch`) | Per-sweep (only when ≥ 1 tile evicted) |
|
||
| `jetson-stats` (`jtop`) | 1 Hz |
|
||
| `flight_header` / `flight_footer` | Once per flight |
|
||
| `segment_rollover` | Per segment rotation |
|
||
|
||
There is no Prometheus-style "scrape interval" because there is no scraping endpoint — the FDR ring is push-only from producers, drained by C13's writer thread.
|
||
|
||
## Distributed Tracing
|
||
|
||
### Architecture stance (cycle-1)
|
||
|
||
**No W3C Trace Context. No OpenTelemetry SDK.** The airborne image's correlation key is the pair `(flight_id, frame_id)`:
|
||
|
||
- `flight_id` (UUID) is assigned at flight open by C13 and written into `flight_header`. Every log record and FDR record within that flight carries it.
|
||
- `frame_id` (monotonic per-frame counter) is assigned by the composition root's frame pipeline. Every per-frame FDR record (`vio.tick`, `state.tick`, `tile_match`, `c6.write` …) carries it.
|
||
|
||
This is sufficient because the airborne pipeline is **in-process, single-camera, single-FC** — there are no inter-service RPC hops to trace. Post-flight tooling reconstructs the per-frame causal chain by joining FDR records on `(flight_id, frame_id)`.
|
||
|
||
The **operator workstation** has more conventional inter-service traffic (C12 ↔ `flights` REST, C11 ↔ `satellite-provider` REST). Cycle-1 traces these by:
|
||
|
||
- Per-request log records with the request URL + status + duration_ms + a generated `correlation_id`.
|
||
- `FlightsApiClient` and the `satellite-provider` HTTP client both stamp this correlation id on the request line + response log.
|
||
|
||
OpenTelemetry SDK + W3C Trace Context propagation is a **cycle-2 polish item** for the operator-orchestrator only — not for the airborne image. Logged in "Future Work" below.
|
||
|
||
### Sampling
|
||
|
||
| Environment | Effective sampling rate | Rationale |
|
||
|-------------|--------------------------|-----------|
|
||
| Development | 100% | FDR + logs both on |
|
||
| Staging (lab Jetson) | 100% | Full visibility for IT-12 / NFT-PERF runs |
|
||
| Production — airborne | 100% per-frame for `vio.tick`/`state.tick`/`tile_match`; `failed_tile_thumbnail` rate-capped at ≤ 0.1 Hz | FDR ring is the only post-landing forensic record; full per-frame capture is mandatory. Rate caps live on byte-heavy forensic records only. |
|
||
| Production — operator workstation | 100% INFO+; DEBUG off | Operator workstation has full disk; cost is not a concern. |
|
||
|
||
## Alerting
|
||
|
||
### Airborne (in-flight)
|
||
|
||
**No real-time alerting from the airborne image.** Autonomy: the FC handles in-flight failsafe (`SAFE_DEAD_RECKONING`, `RTL`, `LAND` etc. per AC-FC-FAILSAFE-1). The companion does not have a network path to a human operator in flight — its only outbound channel is the MAVLink 2.0 1–2 Hz downsampled summary to QGroundControl, which surfaces companion health via STATUSTEXT messages and the parent suite's `GpsDeniedHealth` MAVLink message.
|
||
|
||
Alert-equivalents on the airborne side:
|
||
|
||
| Event | Detected by | In-flight signal |
|
||
|-------|-------------|------------------|
|
||
| Companion process died | FC adapter watchdog timeout | FC drops to `SAFE_DEAD_RECKONING`; operator sees lost telemetry in QGC |
|
||
| `D-CROSS-LATENCY-1` deadline miss + thermal headroom low | C4 / C7 hybrid trigger | Auto-degrade to lower-cost C7 backend; STATUSTEXT to QGC + FDR `kind="c7.degrade"` |
|
||
| C8 signing handshake failed | C8 FC adapter | Refuses takeoff; STATUSTEXT to QGC + FDR `kind="c8.signing_handshake_failed"` |
|
||
| FDR ring overrun | `shared.fdr_client` drop-oldest hook | Emits `kind="overrun"` (AC-NEW-3); post-flight forensics tag |
|
||
| Segment cap reached (64 GB) | C13 writer | Emits `kind="segment_rollover"` with cap-drop flag; oldest data lost — flag surfaces post-flight |
|
||
|
||
### Post-Flight (operator workstation)
|
||
|
||
Post-flight analysis runs the FDR segments through the post-landing tooling. Alerts surface in the operator's environment:
|
||
|
||
| Severity | Response time | Condition | Cycle-1 channel |
|
||
|----------|---------------|-----------|------------------|
|
||
| Critical | Pre-next-flight gate (≤ 10 min before takeoff) | `flight_footer.clean_shutdown == false`; `kind="c8.signing_handshake_failed"` observed; FDR overrun count > 0 above per-flight threshold | Operator UI block + Slack `#gps-denied-ops` (cycle-2 once the channel is wired); cycle-1: operator's local terminal output from post-landing tooling |
|
||
| High | Same-day | C6 eviction batch > 100 in one flight; tile_match score histogram drifted vs operator baseline | Same as above |
|
||
| Medium | Within 1 week | Cumulative thermal-headroom-low events trending up across recent flights | Operator dashboard (cycle-2) |
|
||
| Low | Recorded in flight summary only | Non-critical warnings (FDR `kind="log"` at WARN level) | Flight summary PDF / Markdown |
|
||
|
||
### CI (Woodpecker pipelines)
|
||
|
||
| Severity | Response time | Condition | Channel |
|
||
|----------|---------------|-----------|---------|
|
||
| Critical | Same business day | `01-test.yml` failure on `main` branch | Woodpecker UI; per-repo Slack channel (cycle-2 follow-up — `ci_cd_pipeline.md` Future Work #8) |
|
||
| High | Within 24 h | `02-build-push.yml` build failure on any push branch | Woodpecker UI |
|
||
| Medium | Next business day | Lint / coverage gate fail (cycle-2; cycle-1 has neither) | n/a in cycle-1 |
|
||
| Low | Next sprint review | Non-critical pipeline warnings | n/a |
|
||
|
||
### Deploy / Update (Watchtower)
|
||
|
||
| Severity | Response time | Condition | Channel |
|
||
|----------|---------------|-----------|---------|
|
||
| Critical | Immediate | Watchtower post-update hook emits `AZAION_UPDATE_EVENT severity=error` to journald (image pull failed, container crash on restart) | journald + suite operator's `journalctl -g AZAION_UPDATE_EVENT` audit chain |
|
||
| Informational | None | Watchtower applied an update during a non-flight window (`/run/azaion/in-flight` cleared) | `AZAION_UPDATE_EVENT severity=info` to journald — audit only |
|
||
|
||
## Dashboards
|
||
|
||
### Operations (cycle-1 — what exists today)
|
||
|
||
- **Suite Woodpecker UI** — CI pipeline status per branch + commit; the only "live" operations dashboard cycle-1 ships.
|
||
- **`jtop` on the bench** — operator runs `sudo jtop` on the lab / airborne Jetson during staging / pre-flight to observe thermal + GPU clock + power. Not a service dashboard; it's a CLI tool.
|
||
- **`docker ps` + `docker compose logs`** — the operator workstation operator's `dev`-environment dashboard.
|
||
|
||
### Operations (cycle-2 polish, planned)
|
||
|
||
- **Grafana dashboard** fed by post-landing-parsed FDR records — service health per component (FDR record kinds rolled up into rates), thermal trend, eviction count, tile_match score distribution.
|
||
- **Prometheus `/metrics` on operator-orchestrator** — once the operator workstation cycle-2 wires this, the Grafana dashboard pulls live operator-side metrics alongside post-landing FDR rollups.
|
||
|
||
### Flight Analytics (cycle-1 — what exists today)
|
||
|
||
- **Per-flight summary** generated by post-landing tooling (Markdown / PDF) — records written / dropped, segment count, top-N error log lines, eviction count, signing-key rotation event log, `flight_footer.clean_shutdown` flag. Stored alongside the FDR segments under `_docs/06_metrics/flights/<flight_id>/` (cycle-2 publishes; cycle-1 staging dir is operator-local).
|
||
|
||
### Flight Analytics (cycle-2 polish, planned)
|
||
|
||
- **FDR replay viewer** — interactive timeline of `(flight_id, frame_id)` correlated records.
|
||
- **NFT-PERF baseline tracker** — frame deadline miss rate, thermal headroom, end-to-end pose latency tracked across flights.
|
||
|
||
## Deploy Audit (suite-mandated)
|
||
|
||
Per `../_infra/ci/README.md` → "OCI image labels and commit provenance (AZ-204)" and `../_infra/deploy/jetson/README.md` → "Audit: what is this device running?":
|
||
|
||
- Every image (`companion-jetson`, `companion-tier1`, `operator-orchestrator`) is built with:
|
||
- OCI labels: `org.opencontainers.image.revision=$CI_COMMIT_SHA`, `org.opencontainers.image.created=<UTC RFC 3339>`, `org.opencontainers.image.source=$CI_REPO_URL`.
|
||
- `ENV AZAION_SERVICE=gps-denied-onboard` + `ENV AZAION_REVISION=$CI_COMMIT_SHA`.
|
||
- Watchtower's post-update hook emits one `AZAION_UPDATE_EVENT` line per applied update into journald, carrying the new revision SHA + service name + timestamp + outcome.
|
||
- The operator runs `journalctl -g AZAION_UPDATE_EVENT` on any Jetson to answer "what is this device running and when did it last update?".
|
||
|
||
## Self-verification
|
||
|
||
- [x] Structured logging format defined with required fields (timestamp, level, service, component, `flight_id`, `frame_id`, kind, msg, kv, exc)
|
||
- [x] Per-environment `LOG_SINK` destination + retention tabulated
|
||
- [x] FDR-based metrics surface enumerated (every `fdr_record_schema` v1.3.0 kind mapped to its operator-relevant meaning)
|
||
- [x] Device telemetry (`jetson-stats` / `jtop`) source + sample rate + consumer (D-CROSS-LATENCY-1 hybrid trigger)
|
||
- [x] Tracing stance recorded — no W3C Trace Context / OTel SDK on airborne (justified by single-process pipeline + NFT-SEC-05); operator-side correlation_id pattern documented; OTel deferred to cycle-2 polish
|
||
- [x] Alert severities + response times defined across the four touchpoints: airborne in-flight, post-flight operator workstation, CI, deploy/update audit (`AZAION_UPDATE_EVENT`)
|
||
- [x] Operational-secret leakage controls in place (no key bytes / API tokens / Postgres credentials in logs; `KeySource` is the only key holder)
|
||
- [x] Dashboards inventoried — cycle-1 reality (Woodpecker UI, `jtop`, post-landing summary) explicit; cycle-2 polish (Grafana, FDR replay viewer, NFT-PERF tracker) logged as follow-ups
|
||
- [x] Suite-mandated deploy audit chain (`AZAION_UPDATE_EVENT` + OCI labels + `AZAION_REVISION` env) referenced from `../_infra/` docs
|
||
|
||
## Future Work (cycle-2 polish)
|
||
|
||
1. **Prometheus `/metrics` on `operator-orchestrator`** — cycle-2 wires an in-process exporter for operator-workstation-side metrics (`flights` REST round-trip latency, `satellite-provider` download throughput, tile manifest content-hash failures). The airborne image stays off this path per NFT-SEC-05.
|
||
2. **Grafana dashboard fed by post-landing-parsed FDR rollups** — single pane of glass for per-flight + cross-flight trends.
|
||
3. **OpenTelemetry SDK on `operator-orchestrator` only** — instruments `FlightsApiClient` + `satellite-provider` HTTP client with W3C Trace Context propagation. Out of scope for airborne.
|
||
4. **Per-repo Slack channel (`#gps-denied-ci` for CI, `#gps-denied-ops` for post-flight)** — `ci_cd_pipeline.md` Future Work #8 already logs the CI half; this doc adds the ops half.
|
||
5. **FDR replay viewer** — interactive timeline of `(flight_id, frame_id)` correlated records; consumes FDR segments via the `fdr_record_schema` v1.3.0 parser.
|
||
6. **NFT-PERF baseline tracker** — automated frame-deadline-miss-rate + thermal-headroom + end-to-end pose latency trending across flights, gated by AZ-595 SITL replay fixture + AZ-592/AZ-593 Tier-2 OKVIS2/VINS-Mono wiring.
|
||
7. **Centralised log aggregator on the operator workstation** — Loki / journald-export-to-cloud once the operator network egress allows it; cycle-1 leaves journald at host-default retention.
|