Files
Oleksandr Bezdieniezhnykh bf13549b32
ci/woodpecker/push/02-build-push Pipeline failed
[autodev] Update configuration and documentation for cycle-1
- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments.
- Updated `.gitignore` to include a new deploy rollback bookmark.
- Revised `_docs/_autodev_state.md` to reflect the current task status and steps.
- Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements.
- Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin.
- Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths.

This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
2026-05-20 08:05:35 +03:00

283 lines
22 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GPS-Denied Onboard — Observability
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 5. Builds on
> Step 1 (`reports/deploy_status_report.md`), Step 2 (`containerization.md`),
> Step 3 (`ci_cd_pipeline.md`), and Step 4 (`environment_strategy.md`). The
> deploy skill's standard observability template (Prometheus `/metrics` +
> OpenTelemetry + PagerDuty) is adapted here for an airborne autonomous
> system: the airborne image has **no inbound listeners** (NFT-SEC-05
> in-flight egress lockdown), so the canonical observability surface is the
> on-device **Flight Data Recorder (FDR)** binary ring buffer, replayed
> off-flight by post-landing tooling. Operator workstation + CI keep the
> conventional logging-to-stdout / journald patterns.
## Observability Architecture (one-paragraph)
The airborne image (`companion-jetson` / `companion-tier1`) writes
**structured FDR records** to a 64 GB ring buffer (`/var/lib/gps-denied/fdr`)
via the `shared_fdr_client` (`producer → SPSC ring → C13 writer`). Logs
above `WARN` are forwarded into FDR as `kind="log"` records by the
`fdr_log_bridge` (AZ-267); below-WARN logs go to `LOG_SINK` (`console` in
dev, `journald` on the operator workstation, `fdr` on airborne — never to
file). Telemetry is captured as kind-specific FDR records (`vio.tick`,
`state.tick`, `tile_match`, `c6.write`, `c6.eviction_batch`, etc.) rather
than via a Prometheus endpoint, because no inbound TCP is permitted in
flight. Post-flight tooling on the operator workstation parses the FDR
segments using the **frozen, versioned `fdr_record_schema` v1.3.0** and
feeds Grafana / Jupyter / one-off scripts. The suite-mandated
**`AZAION_UPDATE_EVENT` journald audit chain** + OCI image labels
(`org.opencontainers.image.revision/created/source`) + `ENV
AZAION_REVISION=$CI_COMMIT_SHA` form the deploy-side audit trail (AZ-204).
**`jetson-stats` (`jtop`) device telemetry** (thermal zones, CPU/GPU
clocks, power rails) is sampled by C7 + C4 to drive the
`D-CROSS-LATENCY-1` auto-degrade hybrid trigger; samples land in FDR
alongside the matcher / pose ticks.
## Logging
### Format
Structured records to `LOG_SINK`. No file-based logging in containers.
The `LOG_SINK` env var (Step 4) selects the destination per environment.
#### Common log envelope (per-record fields)
Source of truth: `_docs/02_document/contracts/shared_log_bridge/log_record_schema.md` v1.0.0 — referenced by the `fdr_log_bridge` (AZ-267). Every onboard log record carries:
```json
{
"timestamp": "2026-05-10T03:14:15.123456Z",
"level": "INFO",
"service": "gps-denied-onboard",
"component": "c2_vpr",
"flight_id": "<uuid>",
"frame_id": 12345,
"kind": "vpr.warmup",
"msg": "loaded",
"kv": {"model": "salad"},
"exc": null
}
```
| Field | Purpose | Notes |
|-------|---------|-------|
| `timestamp` | ISO 8601 UTC, microsecond precision | RFC 3339 with `Z` suffix |
| `level` | `DEBUG \| INFO \| WARN \| ERROR` | `WARN` + `ERROR` are also mirrored into FDR via `fdr_log_bridge` |
| `service` | `gps-denied-onboard` | Constant per submodule |
| `component` | Module slug from `module-layout.md` (`c2_vpr`, `c6_tile_cache.store`, `shared.fdr_client`, …) | Matches `producer_id` on the corresponding FDR record |
| `flight_id` | UUID assigned at flight open by C13 (`flight_header`) | Correlation across all components within one flight |
| `frame_id` | Monotonic per-frame counter from `runtime_root` | Cross-component frame correlation (VIO ↔ matcher ↔ state) |
| `kind` | Dotted snake_case event tag (closed enum per component) | E.g. `vpr.warmup`, `c6.evict.budget`, `c8.signing_key_rotation` |
| `msg` | Short human-readable event description | No PII; no secrets; no file payloads |
| `kv` | Bag of typed scalars | JSON-safe; no nested blobs > 4 KiB |
| `exc` | Optional exception class + traceback | Present only on `ERROR`; truncated to 4 KiB |
### Log Levels
| Level | Usage | Example |
|-------|-------|---------|
| ERROR | Exceptions, failures requiring offline review | `c5.solver.diverged`, `c8.signing_handshake_failed`, `c6.write_failed` |
| WARN | Degraded operation, retry, fallback engaged | `c4.pose.degraded_to_pnp`, `c6.freshness.rejected`, `c7.tensorrt_engine_rebuild` |
| INFO | Significant in-flight business events | `c8.signing_key_rotation`, `flight_header`, `flight_footer`, `c11.upload_batch_queued` |
| DEBUG | Detailed diagnostics (dev only) | Per-frame VIO covariance dump, full matcher correspondences list |
`WARN` + `ERROR` are mirrored into FDR via `fdr_log_bridge` (AZ-267) so they survive a post-landing `journalctl` clear. `INFO` + `DEBUG` go only to `LOG_SINK`.
### Destinations and Retention
| Environment | `LOG_SINK` | Destination | Retention |
|-------------|------------|-------------|-----------|
| Development (Tier-1 Docker) | `console` | Docker container stdout (`docker compose logs companion`) | Session — cleared on `docker compose down` |
| CI (Woodpecker) | `console` | Woodpecker UI stdout capture | Per the suite Woodpecker retention policy (operator-managed; today ≤ 30 days) |
| Staging (lab Jetson) | `journald` | Host journald | Per the host's `journald.conf` (suite default: ~7 days rolling) |
| Production — airborne | `fdr` | FDR ring buffer at `/var/lib/gps-denied/fdr` (≥ 64 GB) | Bounded by ring capacity; rolls over per `segment_rollover` FDR record. Post-flight operator pulls segments to long-term storage on the operator workstation. |
| Production — operator workstation | `journald` | Host journald | Per the host's `journald.conf` (operator-managed; recommendation: 30 days for the operator-orchestrator service unit) |
### "PII" Rules (read: operational secrets)
This system has no end-user PII surface — flights, MAVLink, and tile data are operational rather than personal. The equivalent restrictions are **operational-secret leakage** controls:
- **Never log** MAVLink 2.0 signing key bytes, per-flight onboard signing key bytes, `satellite-provider` API tokens, registry tokens, or Postgres credentials. The `KeySource` Protocol (C8) is the only component that ever holds key material, and its log path emits **only** the rotation event tag + key fingerprint (SHA-256 first 8 bytes), never the key.
- **Mask** absolute file paths in any record that references operator-specific layouts (e.g. `/Users/<operator>/…` collapsed to `~/…`).
- **Never log** raw camera frame bytes or full tile JPEGs inline — they go to sidecar paths via FDR's `failed_tile_thumbnail` (≤ 0.1 Hz rate cap) or `mid_flight_tile_snapshot`.
- **Never log** raw GPS coordinates unless the flight's `restricted_geographic_log_redaction` config is `off` (operator-set at takeoff load).
## Telemetry (FDR-based, not Prometheus)
### Why FDR, not Prometheus / OTel
The airborne image runs under NFT-SEC-05 (in-flight egress lockdown — no inbound listeners, outbound only to the FC over UART/USB and to QGroundControl over MAVLink 2.0 12 Hz downsampled summary). A `/metrics` HTTP endpoint would violate this, and a push-mode OTel exporter has no in-flight collector to reach. The FDR ring is the canonical telemetry sink; post-flight tooling converts FDR records into whatever observability backend the operator prefers (Grafana, Jupyter, ad-hoc scripts).
The **operator workstation** is *not* in-flight-locked-down; cycle-2 may add a Prometheus `/metrics` endpoint on the `operator-orchestrator` service (see "Future Work" below). Cycle-1 leaves both the operator-orchestrator and airborne side on the FDR + structured logs path for consistency.
### FDR Record Kinds (cycle-1 metrics surface)
Source of truth: `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` v1.3.0. Each `kind` is the metric.
| Metric (FDR `kind`) | Producer | Type (intent) | What it tells the operator |
|---------------------|----------|----------------|----------------------------|
| `vio.tick` | C1 | per-frame snapshot | VIO output (`R`, `t`), pose covariance proxies, last-anchor age, monocular reproj error, IMU bias norm |
| `state.tick` | C5 | per-frame snapshot | Smoothed fused-pose tick from iSAM2 (or ESKF baseline) + 2x2 covariance + estimator label |
| `tile_match` | C2.5 / C3 | per-match snapshot | Tile id, VPR score, match count, RANSAC inlier count |
| `c6.write` | C6 | counter-ish (per-tile) | Successful `write_tile` — tile id, source, disk bytes, content SHA-256 |
| `c6.write_failed` | C6 | counter-ish (per-failure) | Failed `write_tile``reason ∈ {content_hash_mismatch, freshness_reject, metadata_error, fs_error}` |
| `c6.freshness.rejected` | C6 | counter-ish (per-reject) | Active-conflict-stale tile rejected — `tile_id`, `age_seconds`, threshold |
| `c6.freshness.downgraded` | C6 | counter-ish (per-downgrade) | Stable-rear-stale tile downgraded — same shape as rejected |
| `c6.eviction_batch` | C6 | batch counter (per sweep) | Cache budget enforcer evicted N tiles to make room — trigger tile, freed bytes, count, first 5 evicted ids |
| `overrun` | `shared.fdr_client` | counter (per drop) | FDR ring overrun — `producer_id` of the originating queue + dropped count (`> 0`). AC-NEW-3: never silent. |
| `segment_rollover` | C13 writer | counter (per rotation) | Segment file rotated (including 64 GB cap drops) |
| `failed_tile_thumbnail` | C6 / C11 | rate-capped sample | Forensic JPEG thumbnail (≤ 0.1 Hz). AC-8.5 |
| `mid_flight_tile_snapshot` | C13 snapshot path | sample pointer | Mid-flight tile snapshot pointer (sidecar). AC-8.4 |
| `flight_header` | C13 writer | once-per-flight | `flight_id`, start ISO/monotonic, config snapshot, signing-key rotation event, manifest content hashes, build info |
| `flight_footer` | C13 writer | once-per-flight | `flight_id`, end ISO/monotonic, records written / dropped (overrun) / bytes / rollover count / clean-shutdown flag |
### Device Telemetry (`jetson-stats` / `jtop`)
`D-CROSS-LATENCY-1` requires runtime thermal + power + GPU clock telemetry to drive the auto-degrade hybrid trigger (frame deadline missed × thermal headroom). Cycle-1 source: `jetson-stats` (`jtop`) accessed inside the `companion-jetson` container via `runtime: nvidia` + the nvidia-container-runtime device passthrough — same pattern the suite's `detections` service uses on the same hardware.
| Signal | Source | Sample rate | Consumer |
|--------|--------|-------------|----------|
| GPU clock (MHz) | `jtop.gpu` | 1 Hz | C7 (degrade gate); recorded into FDR via `c7.device_telemetry` log records (`kind="c7.thermal_headroom"`) |
| GPU/CPU temperature (°C) | `jtop.temperature` | 1 Hz | C4 / C7 hybrid trigger |
| Power draw (mW) | `jtop.power` | 1 Hz | Cycle-2 derate hysteresis |
| Memory pressure | `jtop.memory` | 1 Hz | C6 eviction batch hysteresis |
Cycle-1: `jtop` runs in-process inside the companion container; samples are emitted as FDR `kind="c7.thermal_headroom"` records. Cycle-2 may move this to a sidecar Python thread once the Step 2 BLOCKING gate "`jetson-stats` thermal telemetry under Docker" (`containerization.md` § Step 2 Validation Gates) is signed off on the real Tier-2 Jetson.
### Collection Interval
| Source | Interval |
|--------|----------|
| Per-frame producers (C1 `vio.tick`, C5 `state.tick`, C3 `tile_match`) | Camera frame cadence (target ≥ 4 Hz on Tier-2; per `_docs/02_document/architecture.md` Vision) |
| Per-write producers (C6 `c6.write`, `c6.write_failed`, `c6.freshness.*`) | Per-event (write-path triggered) |
| Per-batch producers (C6 `c6.eviction_batch`) | Per-sweep (only when ≥ 1 tile evicted) |
| `jetson-stats` (`jtop`) | 1 Hz |
| `flight_header` / `flight_footer` | Once per flight |
| `segment_rollover` | Per segment rotation |
There is no Prometheus-style "scrape interval" because there is no scraping endpoint — the FDR ring is push-only from producers, drained by C13's writer thread.
## Distributed Tracing
### Architecture stance (cycle-1)
**No W3C Trace Context. No OpenTelemetry SDK.** The airborne image's correlation key is the pair `(flight_id, frame_id)`:
- `flight_id` (UUID) is assigned at flight open by C13 and written into `flight_header`. Every log record and FDR record within that flight carries it.
- `frame_id` (monotonic per-frame counter) is assigned by the composition root's frame pipeline. Every per-frame FDR record (`vio.tick`, `state.tick`, `tile_match`, `c6.write` …) carries it.
This is sufficient because the airborne pipeline is **in-process, single-camera, single-FC** — there are no inter-service RPC hops to trace. Post-flight tooling reconstructs the per-frame causal chain by joining FDR records on `(flight_id, frame_id)`.
The **operator workstation** has more conventional inter-service traffic (C12 ↔ `flights` REST, C11 ↔ `satellite-provider` REST). Cycle-1 traces these by:
- Per-request log records with the request URL + status + duration_ms + a generated `correlation_id`.
- `FlightsApiClient` and the `satellite-provider` HTTP client both stamp this correlation id on the request line + response log.
OpenTelemetry SDK + W3C Trace Context propagation is a **cycle-2 polish item** for the operator-orchestrator only — not for the airborne image. Logged in "Future Work" below.
### Sampling
| Environment | Effective sampling rate | Rationale |
|-------------|--------------------------|-----------|
| Development | 100% | FDR + logs both on |
| Staging (lab Jetson) | 100% | Full visibility for IT-12 / NFT-PERF runs |
| Production — airborne | 100% per-frame for `vio.tick`/`state.tick`/`tile_match`; `failed_tile_thumbnail` rate-capped at ≤ 0.1 Hz | FDR ring is the only post-landing forensic record; full per-frame capture is mandatory. Rate caps live on byte-heavy forensic records only. |
| Production — operator workstation | 100% INFO+; DEBUG off | Operator workstation has full disk; cost is not a concern. |
## Alerting
### Airborne (in-flight)
**No real-time alerting from the airborne image.** Autonomy: the FC handles in-flight failsafe (`SAFE_DEAD_RECKONING`, `RTL`, `LAND` etc. per AC-FC-FAILSAFE-1). The companion does not have a network path to a human operator in flight — its only outbound channel is the MAVLink 2.0 12 Hz downsampled summary to QGroundControl, which surfaces companion health via STATUSTEXT messages and the parent suite's `GpsDeniedHealth` MAVLink message.
Alert-equivalents on the airborne side:
| Event | Detected by | In-flight signal |
|-------|-------------|------------------|
| Companion process died | FC adapter watchdog timeout | FC drops to `SAFE_DEAD_RECKONING`; operator sees lost telemetry in QGC |
| `D-CROSS-LATENCY-1` deadline miss + thermal headroom low | C4 / C7 hybrid trigger | Auto-degrade to lower-cost C7 backend; STATUSTEXT to QGC + FDR `kind="c7.degrade"` |
| C8 signing handshake failed | C8 FC adapter | Refuses takeoff; STATUSTEXT to QGC + FDR `kind="c8.signing_handshake_failed"` |
| FDR ring overrun | `shared.fdr_client` drop-oldest hook | Emits `kind="overrun"` (AC-NEW-3); post-flight forensics tag |
| Segment cap reached (64 GB) | C13 writer | Emits `kind="segment_rollover"` with cap-drop flag; oldest data lost — flag surfaces post-flight |
### Post-Flight (operator workstation)
Post-flight analysis runs the FDR segments through the post-landing tooling. Alerts surface in the operator's environment:
| Severity | Response time | Condition | Cycle-1 channel |
|----------|---------------|-----------|------------------|
| Critical | Pre-next-flight gate (≤ 10 min before takeoff) | `flight_footer.clean_shutdown == false`; `kind="c8.signing_handshake_failed"` observed; FDR overrun count > 0 above per-flight threshold | Operator UI block + Slack `#gps-denied-ops` (cycle-2 once the channel is wired); cycle-1: operator's local terminal output from post-landing tooling |
| High | Same-day | C6 eviction batch > 100 in one flight; tile_match score histogram drifted vs operator baseline | Same as above |
| Medium | Within 1 week | Cumulative thermal-headroom-low events trending up across recent flights | Operator dashboard (cycle-2) |
| Low | Recorded in flight summary only | Non-critical warnings (FDR `kind="log"` at WARN level) | Flight summary PDF / Markdown |
### CI (Woodpecker pipelines)
| Severity | Response time | Condition | Channel |
|----------|---------------|-----------|---------|
| Critical | Same business day | `01-test.yml` failure on `main` branch | Woodpecker UI; per-repo Slack channel (cycle-2 follow-up — `ci_cd_pipeline.md` Future Work #8) |
| High | Within 24 h | `02-build-push.yml` build failure on any push branch | Woodpecker UI |
| Medium | Next business day | Lint / coverage gate fail (cycle-2; cycle-1 has neither) | n/a in cycle-1 |
| Low | Next sprint review | Non-critical pipeline warnings | n/a |
### Deploy / Update (Watchtower)
| Severity | Response time | Condition | Channel |
|----------|---------------|-----------|---------|
| Critical | Immediate | Watchtower post-update hook emits `AZAION_UPDATE_EVENT severity=error` to journald (image pull failed, container crash on restart) | journald + suite operator's `journalctl -g AZAION_UPDATE_EVENT` audit chain |
| Informational | None | Watchtower applied an update during a non-flight window (`/run/azaion/in-flight` cleared) | `AZAION_UPDATE_EVENT severity=info` to journald — audit only |
## Dashboards
### Operations (cycle-1 — what exists today)
- **Suite Woodpecker UI** — CI pipeline status per branch + commit; the only "live" operations dashboard cycle-1 ships.
- **`jtop` on the bench** — operator runs `sudo jtop` on the lab / airborne Jetson during staging / pre-flight to observe thermal + GPU clock + power. Not a service dashboard; it's a CLI tool.
- **`docker ps` + `docker compose logs`** — the operator workstation operator's `dev`-environment dashboard.
### Operations (cycle-2 polish, planned)
- **Grafana dashboard** fed by post-landing-parsed FDR records — service health per component (FDR record kinds rolled up into rates), thermal trend, eviction count, tile_match score distribution.
- **Prometheus `/metrics` on operator-orchestrator** — once the operator workstation cycle-2 wires this, the Grafana dashboard pulls live operator-side metrics alongside post-landing FDR rollups.
### Flight Analytics (cycle-1 — what exists today)
- **Per-flight summary** generated by post-landing tooling (Markdown / PDF) — records written / dropped, segment count, top-N error log lines, eviction count, signing-key rotation event log, `flight_footer.clean_shutdown` flag. Stored alongside the FDR segments under `_docs/06_metrics/flights/<flight_id>/` (cycle-2 publishes; cycle-1 staging dir is operator-local).
### Flight Analytics (cycle-2 polish, planned)
- **FDR replay viewer** — interactive timeline of `(flight_id, frame_id)` correlated records.
- **NFT-PERF baseline tracker** — frame deadline miss rate, thermal headroom, end-to-end pose latency tracked across flights.
## Deploy Audit (suite-mandated)
Per `../_infra/ci/README.md` → "OCI image labels and commit provenance (AZ-204)" and `../_infra/deploy/jetson/README.md` → "Audit: what is this device running?":
- Every image (`companion-jetson`, `companion-tier1`, `operator-orchestrator`) is built with:
- OCI labels: `org.opencontainers.image.revision=$CI_COMMIT_SHA`, `org.opencontainers.image.created=<UTC RFC 3339>`, `org.opencontainers.image.source=$CI_REPO_URL`.
- `ENV AZAION_SERVICE=gps-denied-onboard` + `ENV AZAION_REVISION=$CI_COMMIT_SHA`.
- Watchtower's post-update hook emits one `AZAION_UPDATE_EVENT` line per applied update into journald, carrying the new revision SHA + service name + timestamp + outcome.
- The operator runs `journalctl -g AZAION_UPDATE_EVENT` on any Jetson to answer "what is this device running and when did it last update?".
## Self-verification
- [x] Structured logging format defined with required fields (timestamp, level, service, component, `flight_id`, `frame_id`, kind, msg, kv, exc)
- [x] Per-environment `LOG_SINK` destination + retention tabulated
- [x] FDR-based metrics surface enumerated (every `fdr_record_schema` v1.3.0 kind mapped to its operator-relevant meaning)
- [x] Device telemetry (`jetson-stats` / `jtop`) source + sample rate + consumer (D-CROSS-LATENCY-1 hybrid trigger)
- [x] Tracing stance recorded — no W3C Trace Context / OTel SDK on airborne (justified by single-process pipeline + NFT-SEC-05); operator-side correlation_id pattern documented; OTel deferred to cycle-2 polish
- [x] Alert severities + response times defined across the four touchpoints: airborne in-flight, post-flight operator workstation, CI, deploy/update audit (`AZAION_UPDATE_EVENT`)
- [x] Operational-secret leakage controls in place (no key bytes / API tokens / Postgres credentials in logs; `KeySource` is the only key holder)
- [x] Dashboards inventoried — cycle-1 reality (Woodpecker UI, `jtop`, post-landing summary) explicit; cycle-2 polish (Grafana, FDR replay viewer, NFT-PERF tracker) logged as follow-ups
- [x] Suite-mandated deploy audit chain (`AZAION_UPDATE_EVENT` + OCI labels + `AZAION_REVISION` env) referenced from `../_infra/` docs
## Future Work (cycle-2 polish)
1. **Prometheus `/metrics` on `operator-orchestrator`** — cycle-2 wires an in-process exporter for operator-workstation-side metrics (`flights` REST round-trip latency, `satellite-provider` download throughput, tile manifest content-hash failures). The airborne image stays off this path per NFT-SEC-05.
2. **Grafana dashboard fed by post-landing-parsed FDR rollups** — single pane of glass for per-flight + cross-flight trends.
3. **OpenTelemetry SDK on `operator-orchestrator` only** — instruments `FlightsApiClient` + `satellite-provider` HTTP client with W3C Trace Context propagation. Out of scope for airborne.
4. **Per-repo Slack channel (`#gps-denied-ci` for CI, `#gps-denied-ops` for post-flight)**`ci_cd_pipeline.md` Future Work #8 already logs the CI half; this doc adds the ops half.
5. **FDR replay viewer** — interactive timeline of `(flight_id, frame_id)` correlated records; consumes FDR segments via the `fdr_record_schema` v1.3.0 parser.
6. **NFT-PERF baseline tracker** — automated frame-deadline-miss-rate + thermal-headroom + end-to-end pose latency trending across flights, gated by AZ-595 SITL replay fixture + AZ-592/AZ-593 Tier-2 OKVIS2/VINS-Mono wiring.
7. **Centralised log aggregator on the operator workstation** — Loki / journald-export-to-cloud once the operator network egress allows it; cycle-1 leaves journald at host-default retention.