[autodev] Update configuration and documentation for cycle-1
ci/woodpecker/push/02-build-push Pipeline failed

- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments.
- Updated `.gitignore` to include a new deploy rollback bookmark.
- Revised `_docs/_autodev_state.md` to reflect the current task status and steps.
- Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements.
- Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin.
- Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths.

This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-20 08:05:35 +03:00
parent ab92946833
commit bf13549b32
34 changed files with 3689 additions and 42 deletions
+160
View File
@@ -0,0 +1,160 @@
# GPS-Denied Onboard — CI/CD Pipeline
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 3 (CI/CD).
> Builds on Step 1 (`reports/deploy_status_report.md`) and Step 2
> (`containerization.md`). **This document is the deployment-pipeline spec
> for THIS submodule under the parent-suite Woodpecker CI + Gitea Packages
> stack** (`../_infra/ci/README.md`). The Plan-phase doc at
> `_docs/02_document/deployment/ci_cd_pipeline.md` (GitHub Actions framing)
> is now stale and will be reconciled in autodev's existing-code Step 13
> (Update Docs); the operative CI contract is here.
## Decision Record (cycle-1 scope)
| Decision | Choice | Rationale |
|----------|--------|-----------|
| CI platform | **Woodpecker CI** (suite-mandated) | The parent suite ships Woodpecker + Gitea Packages + Caddy TLS already; no greenfield CI tooling is added |
| Pipeline layout | **Two-workflow contract** (`01-test.yml` + `02-build-push.yml`) | Suite contract per `../_infra/ci/README.md` → "Pipeline configuration — two-workflow contract" |
| Test trigger (cycle-1) | **`event: [manual]` only** | The Tier-1 e2e harness (`docker-compose.test.yml` + `tests/e2e/Dockerfile`) is heavy (TensorRT-class pytorch fp16, gtsam, Postgres, Derkachi replay clip). Cycle-1 ships it as opt-in until amd64 agent availability and per-run wall-clock are characterised on the colocated arm64 Jetson agent. **Flip-back path**: change `event: [push, pull_request, manual]` and add `depends_on: [01-test]` to `02-build-push.yml`. |
| Build-push gating (cycle-1) | **Un-gated** (no `depends_on: [01-test]`) | Mirrors the `detections` deferral pattern documented in `../_infra/ci/README.md` → "`detections` deferral". Build path proves out independently while the test path is manual-only. Re-gates when the test path flips to `[push, pull_request, manual]`. |
| Images pushed (cycle-1) | `companion-tier1` + `operator-orchestrator` (two distinct registry repos) | `containerization.md` → Next Steps #4: "ship only `operator-orchestrator` + `companion-tier1` for the test path" until `docker/companion-jetson.Dockerfile` lands in next cycle |
| Production-name tag reservation | **`azaion/gps-denied-onboard:<branch>-arm` is RESERVED for `companion-jetson`** (next cycle) | The parent-suite Jetson compose's `gps-denied-onboard` service block (`../_infra/deploy/jetson/docker-compose.yml`) expects this exact tag. Pushing a Tier-1 dev build under it would mis-route Watchtower; cycle-1 uses explicit-suffix tags instead. |
| Multi-arch matrix | **arm64 active; amd64 commented** | Matches the template default. Uncomment when the operator-orchestrator deploy target (amd64 workstations) becomes the canonical pull path. |
| OCI labels | **`org.opencontainers.image.revision/created/source` + `ENV AZAION_REVISION`** | Suite-mandated per AZ-204 (`../_infra/ci/README.md` → "OCI image labels and commit provenance") |
| Secrets | Suite-provisioned Woodpecker global secrets: `registry_host`, `registry_user`, `registry_token` | Provisioned by `../_infra/ci/install-woodpecker.sh`; this submodule consumes them via `from_secret:` references |
## Pipeline Overview (cycle-1)
| Stage | Trigger | Runner | Quality Gate |
|-------|---------|--------|--------------|
| **Test** (`01-test.yml`) | `event: [manual]` (cycle-1; flip to `[push, pull_request, manual]` when test budget is characterised) | arm64 agent (colocated Jetson; `labels: platform: arm64`) | `pytest -q /opt/tests/e2e/` exits 0 in the `e2e-runner` container; `--exit-code-from e2e-runner` enforces this at the compose layer |
| **Build + Push** (`02-build-push.yml`) | `event: [push, manual]` on `branch: [dev, stage, main]` | arm64 agent (matrix entry; amd64 commented) | Both `companion-tier1` and `operator-orchestrator` builds succeed; both `docker push` succeed |
There is no separate Lint stage in cycle-1: `ruff` and other linters are run pre-commit and inside the `e2e-runner` container's `pytest` invocation (test collection fails on import errors caused by lint-class issues). Adding an explicit lint stage is a cycle-2 polish item logged in §Future Work.
There is no separate Security stage in cycle-1: `pip-audit`, OpenCV pin gate (per `_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md`), and Trivy image scan are owned by the `/security` skill (Step 14 of greenfield deploy was DONE; see `_docs/05_security/`) and run on operator invocation, not per-build. Adding them as a CI stage is a cycle-2 polish item.
## Stage Details — Test (`01-test.yml`)
**File**: `.woodpecker/01-test.yml`
**Trigger (cycle-1)**: `event: [manual]` — run from the Woodpecker UI on demand
**Runner**: arm64 agent (`labels: platform: arm64`)
**Working directory**: repo root (the test compose lives at root, not under `e2e/`)
**Steps**:
1. **`e2e`** — Brings up the full Tier-1 e2e stack via the existing `docker-compose.test.yml`:
```
docker compose -f docker-compose.test.yml up \
--abort-on-container-exit \
--exit-code-from e2e-runner \
--build
```
- `--abort-on-container-exit` shuts the compose down the moment any service exits (a crashed `companion` or `mock-sat` surfaces immediately instead of hanging the runner waiting for `e2e-runner` to time out).
- `--exit-code-from e2e-runner` makes the pipeline exit code reflect pytest's result, not `companion`'s.
- `--build` rebuilds images if any source changed.
- The `e2e-runner` ENTRYPOINT is `pytest -q /opt/tests/e2e/` (see `tests/e2e/Dockerfile`); both `tests/e2e/replay/` (Reality Gate, gated by `RUN_REPLAY_E2E=1`) and any future `tests/e2e/scenarios/` are exercised.
2. **`down`** — Always runs (`when: status: [success, failure]`), tears the compose down to release volumes and DB state:
```
docker compose -f docker-compose.test.yml down -v
```
`down -v` drops `db-data`, `fdr-data`, `tile-data` so the next run starts clean.
**No report-artifact step in cycle-1**: `pytest -q` output goes to stdout (captured by Woodpecker). A CSV/JUnit report step is a cycle-2 polish item — would require adding `pytest-csv` or `--junit-xml` to the e2e-runner Dockerfile + a write-mount under `e2e/results/`.
## Stage Details — Build + Push (`02-build-push.yml`)
**File**: `.woodpecker/02-build-push.yml`
**Trigger**: `event: [push, manual]` on `branch: [dev, stage, main]`
**`depends_on`**: **none** in cycle-1 (un-gated, per `detections` deferral pattern). Re-add `depends_on: [01-test]` when `01-test.yml` flips to push triggers.
**Runner**: arm64 agent (matrix; amd64 commented)
**Matrix block**:
```yaml
matrix:
include:
- PLATFORM: arm64
TAG_SUFFIX: arm
# - PLATFORM: amd64
# TAG_SUFFIX: amd
labels:
platform: ${PLATFORM}
```
Adding amd64 = one-line uncomment + ensuring the amd64 agent host has Docker access to the registry.
**Steps** — two sequential `build-push` invocations (both must succeed for the workflow to pass):
1. **`build-push-companion-tier1`** —
- Dockerfile: `docker/companion-tier1.Dockerfile` (4-stage, existing)
- Image: `${REGISTRY_HOST}/azaion/gps-denied-onboard-companion-tier1:${CI_COMMIT_BRANCH}-${TAG_SUFFIX}`
- OCI labels: `revision=$CI_COMMIT_SHA`, `created=<UTC RFC 3339>`, `source=$CI_REPO_URL`
- Build-arg: `CI_COMMIT_SHA=$CI_COMMIT_SHA` (Dockerfile reads into `ENV AZAION_REVISION`)
2. **`build-push-operator-orchestrator`** —
- Dockerfile: `docker/operator-orchestrator.Dockerfile` (single-stage, existing)
- Image: `${REGISTRY_HOST}/azaion/gps-denied-onboard-operator-orchestrator:${CI_COMMIT_BRANCH}-${TAG_SUFFIX}`
- OCI labels + build-arg: same suite contract as above
**Image NOT pushed**: `mock-suite-sat-service` (test fixture per `containerization.md`; not a production artefact).
**Image NOT pushed in cycle-1, reserved for cycle-2**: `azaion/gps-denied-onboard:<branch>-arm` — the parent-suite Jetson compose's `gps-denied-onboard` service block already references this exact tag. Cycle-2 (when `docker/companion-jetson.Dockerfile` lands) writes to it; cycle-1 must NOT, otherwise Watchtower on fielded Jetsons would pull a Tier-1 dev build under the production tag.
## Registry Layout (cycle-1 → cycle-2)
| Tag | Cycle-1 (today) | Cycle-2 (after `companion-jetson.Dockerfile` lands) |
|-----|-----------------|------------------------------------------------------|
| `azaion/gps-denied-onboard:<branch>-arm` | **Not pushed** (reserved) | Built from `docker/companion-jetson.Dockerfile`; Watchtower-tracked by parent-suite Jetson compose |
| `azaion/gps-denied-onboard-companion-tier1:<branch>-arm` | **Built + pushed** | Continues to be pushed (Tier-1 dev / CI image; consumed by `docker-compose.test.yml` and by CI agents that don't rebuild locally) |
| `azaion/gps-denied-onboard-operator-orchestrator:<branch>-arm` | **Built + pushed** | Continues to be pushed; becomes Watchtower-tracked on operator workstations once that deploy target is wired (cycle-2 Step 4 / Environment Strategy follow-up) |
| `azaion/gps-denied-onboard-companion-jetson:<arch>-arm` | n/a | **NOT used**: cycle-2 collapses companion-jetson onto the canonical `azaion/gps-denied-onboard:<branch>-arm` tag (so the existing parent-suite Jetson compose works without edit) |
## Caching Strategy
| Cache | Mechanism (cycle-1) | Notes |
|-------|---------------------|-------|
| Docker layer cache | Host Docker daemon on the arm64 agent (shared via mounted `/var/run/docker.sock`) | Suite-standard: all build steps mount `/var/run/docker.sock` so the host daemon's layer cache survives across pipeline runs |
| Python wheel cache (Tier-1 e2e) | Implicit via Docker layer cache on the `python-deps` stage | A persistent pip cache volume is a cycle-2 polish (would speed up first-run after `pyproject.toml` bumps) |
| Replay-fixture (`_docs/00_problem/input_data/...`) | Bind-mount from repo checkout | The checkout is shallow per Woodpecker default; the Derkachi clip lives in the repo (committed), no LFS fetch needed |
## Notifications
Suite-default: build failure surfaces in the Woodpecker UI. Per-repo Slack / email integration is owned by the suite operator and applied at the Woodpecker server config layer (not per-repo); cycle-1 inherits the suite default. Adding a per-repo Slack channel is a follow-up logged in §Future Work.
## Quality Gates — Coverage / Security
Cycle-1 ships **without** an in-pipeline coverage gate or security scan. Both are owned by out-of-pipeline skills today:
- **Coverage**: `pytest --cov` is available in the dev image but is not a CI gate yet. Adding `--cov-fail-under=75 --cov-fail-under=90` (safety-critical) is logged for cycle-2.
- **Security (CVE / SBOM)**: `/security` skill already produced `_docs/05_security/dependency_scan.md` + per-area reports as part of greenfield Step 14. Re-running the scan in CI is a cycle-2 polish item — the rationale is that the dependency surface is small and changes infrequently, so out-of-pipeline `pip-audit` + `trivy image` is acceptable for cycle-1.
The Plan-phase doc (`_docs/02_document/deployment/ci_cd_pipeline.md`) describes a richer pipeline (lint / unit / integration / SBOM diff / security / Tier-2 NFTs). That document is the **architectural target**; this cycle-1 spec is the **operational reality** that the suite Woodpecker stack supports today. The two are reconciled in autodev's existing-code Step 13 (Update Docs).
## Self-Verification
- [x] Pipeline stages defined for cycle-1 with explicit triggers and gates
- [x] Two-workflow contract honoured (`01-test.yml` + `02-build-push.yml`)
- [x] OCI labels + `AZAION_REVISION` build-arg specified for both push stages (AZ-204)
- [x] Multi-arch matrix block included (arm64 active, amd64 commented per template default)
- [x] Suite global secrets (`registry_host`, `registry_user`, `registry_token`) referenced via `from_secret:`
- [x] Cycle-1 vs cycle-2 tag separation explicit (production `azaion/gps-denied-onboard:<branch>-arm` reserved for `companion-jetson`)
- [x] Deferral rationale documented (manual-only test, un-gated build-push) with flip-back instructions
- [x] Docker layer caching addressed (host daemon socket mount)
- [ ] Coverage gate enforced in CI — **DEFERRED to cycle-2** (logged)
- [ ] Security scanning in CI — **DEFERRED to cycle-2** (logged; out-of-pipeline scans exist today)
- [ ] Multi-environment deployment (staging → production) — **N/A in cycle-1**; suite registry is the only deploy target. Cycle-2 wires environment promotion via branch-tag convention (`dev-arm` → `stage-arm` → `main-arm`)
- [ ] Notifications channel configured — **DEFERRED**; inherits suite default
## Future Work (cycle-2 polish)
1. **Flip `01-test.yml` to `event: [push, pull_request, manual]`** once the per-run wall-clock on the arm64 agent is characterised (target: ≤ 15 min for the Reality Gate replay set). Re-add `depends_on: [01-test]` to `02-build-push.yml`.
2. **Author `docker/companion-jetson.Dockerfile`** (containerization.md Next Steps #2) → add a third `build-push` step writing to `azaion/gps-denied-onboard:<branch>-arm`. Once this lands, the cycle-1 `companion-tier1` push may continue or be retired depending on whether dev workflows need a registry-served Tier-1 image.
3. **Coordinate parent-suite Jetson compose edit** (containerization.md Next Steps #3) — add `fdr-data`, `tile-data`, `/run/azaion`, FC + camera device passthrough mounts to the `gps-denied-onboard` service block in `../_infra/deploy/jetson/docker-compose.yml`. Cross-submodule; record in `_docs/_process_leftovers/` if not editable in this cycle.
4. **Reconcile Plan-phase CI doc** — rewrite `_docs/02_document/deployment/ci_cd_pipeline.md` against this cycle-1 Woodpecker reality (or formally retain it as the architectural target with a "current state" pointer to this file). Owned by autodev's existing-code Step 13 (Update Docs).
5. **In-pipeline lint stage** — add a `ruff check` + `mypy --strict` lane (parallel to `e2e`, before it) so lint failures gate `01-test.yml` at the cheap end.
6. **In-pipeline coverage gate** — extend the `e2e-runner` ENTRYPOINT to `pytest --cov=src/gps_denied_onboard --cov-fail-under=75 --cov-report=xml:/results/coverage.xml -q /opt/tests/e2e/` + a `report` step publishing the XML.
7. **In-pipeline security gate** — add `pip-audit` + `trivy image` steps; gate on the OpenCV pin per `_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md`.
8. **Per-repo Slack notification** — wire the suite Slack channel (`#gps-denied-ci` per Plan-phase doc).
9. **Tier-2 e2e on Jetson hardware** (NFT lane per Plan-phase doc) — separate Woodpecker pipeline or matrix entry once the Tier-2 runner availability is confirmed (deploy_status_report.md blocker #3, AZ-592 / AZ-593).
+231
View File
@@ -0,0 +1,231 @@
# GPS-Denied Onboard — Containerization
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 2.
> Builds on Step 1 output (`reports/deploy_status_report.md`) and the
> parent-suite CI/CD reality at `../_infra/ci/README.md`. Tier-2 delivery
> shape: **Option B (Docker on Jetson via Watchtower) — autodev-resolved
> 2026-05-19; reversible per Step 1 report**.
## Containerization Stance
| Tier | Production runtime | Image source |
|------|--------------------|--------------|
| Tier-1 (workstation dev + CI + replay) | Docker via `docker-compose.yml` / `docker-compose.test.yml` | This submodule (`docker/companion-tier1.Dockerfile`, `docker/operator-orchestrator.Dockerfile`, `docker/mock-suite-sat-service.Dockerfile`) |
| Tier-2 (Jetson Orin Nano Super production) | Docker via parent-suite `_infra/deploy/jetson/docker-compose.yml` + Watchtower auto-update | This submodule's new `docker/companion-jetson.Dockerfile` (NEW under Option B) pushed to `${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm` |
| Tier-2 (lab/research IT-12 binary) | Docker (same `companion-jetson.Dockerfile` with research strategy flags ON) or bare JetPack install via tarball | Optional separate image tag `:research-arm`; cycle-1 ships only the deployment binary path |
Three architectural binary tracks (per ADR-002 + ADR-011) collapse onto
**two production Docker images** in this plan:
1. **`gps-denied-onboard` (airborne)** — `docker/companion-jetson.Dockerfile` for Tier-2 production + `docker/companion-tier1.Dockerfile` for Tier-1. Same Python module entrypoint (`python3 -m gps_denied_onboard.runtime_root`); runs both **live mode** and **replay mode** from a single image per ADR-011 — config (`config.mode = live | replay`) selects strategies at startup.
2. **`gps-denied-operator-orchestrator`** — `docker/operator-orchestrator.Dockerfile` for the operator workstation (C10 + C11 + C12).
Test fixtures (`mock-suite-sat-service`, `e2e-runner`) and test infrastructure (Tier-1 + Tier-2 runners) ship as separate non-deployable images. The research binary is a build-flag variant of the airborne image, not a separate Dockerfile.
## ADR-005 Amendment (DRAFT — pending Step 12 / Update Docs sync)
> Draft language for the architecture follow-up flagged in Step 1's
> Cross-Cutting Decision. Lands in `architecture.md` ADR-005 (amendment)
> or a new ADR-012 when Step 12 (Test-Spec Sync) / autodev's existing-code
> Step 13 (Update Docs) picks this up. The current `architecture.md`
> ADR-005 paragraph "Tier-2 (Jetson) does NOT use Docker" becomes
> inconsistent with this plan and must be reconciled.
> **Container scope (amended)**: Tier-1 uses Docker (`docker compose` for
> the developer setup). **Tier-2 (Jetson production) ALSO uses Docker**,
> via the parent-suite `_infra/deploy/jetson/docker-compose.yml` +
> Watchtower flow, with `runtime: nvidia` for GPU access and explicit
> volume mounts for the TensorRT INT8 calibration cache
> (`model-cache:/data/models`) and the C13 FDR ring
> (`fdr-data:/var/lib/gps-denied/fdr`). The two technical concerns the
> original ADR-005 cited — INT8 calibration cache stability and
> `jetson-stats` thermal telemetry access — are addressed by (a) the
> calibration cache living in a host-mounted volume that survives
> container restarts and (b) `jetson-stats` accessed via the
> nvidia-container-runtime's standard device passthrough (same pattern
> the parent-suite `detections` service already uses successfully on the
> same hardware). The deployment binary is the Docker image; the JetPack
> 6.2 system image is the **host** OS, not the runtime layer.
### Step 2 Validation Gates (BLOCKING — must pass before Step 3)
If either of these gates fails, **fall back to Option A** (bare-JetPack
systemd unit) and re-write this containerization plan:
| Gate | What it validates | Pass criteria | Owner |
|------|-------------------|---------------|-------|
| **TensorRT INT8 cache durability under Docker** | Build a calibration cache inside the running container; restart the container; verify the cache is reused and inference output is byte-equivalent | SHA-256 of the calibration cache file before and after restart matches; first-frame inference timing post-restart is within 5% of pre-restart timing (cache hit) | C7 owner; runs against the `companion-jetson` image on the actual Tier-2 Jetson |
| **`jetson-stats` thermal telemetry under Docker** | Run `jtop` (jetson-stats CLI) inside the container with `runtime: nvidia`; verify thermal + power + GPU clock readings match `sudo jtop` on the host within 1% | All thermal zones reported; CPU/GPU clock readings present; D-CROSS-LATENCY-1 hybrid trigger threshold readable | C7 / C5 owners; runs against the `companion-jetson` image |
Both gates land as task tickets when Step 16 chains into the next-cycle
existing-code flow (autodev resumes at existing-code Step 9 New Task per
the Done state). They are **deferred to next cycle** and recorded here so
they are not lost; the cycle-1 deploy plan ships Option B with the
validation marked as "validation pending" in `deploy_status_report.md`.
## Component-to-Image Mapping
Per ADR-009, components are folders under `src/gps_denied_onboard/components/`. They are not separate processes / containers in this monolithic Python-with-C++-extensions architecture. The mapping below shows which component code paths each image links.
| Image | Components linked | BUILD_* flags (defaults) |
|-------|-------------------|---------------------------|
| `companion-jetson` (Tier-2 prod) + `companion-tier1` (Tier-1 dev) | C1 (`KltRansac` default), C2 (`UltraVPR` default), C2.5, C3 (`DISK+LightGlue`), C3.5, C4, C5 (`GtsamIsam2`), C6, C7 (`tensorrt` on Tier-2, `pytorch_fp16` on Tier-1), C8 (per `GPS_DENIED_FC_PROFILE`), C13 + replay strategies (`BUILD_VIDEO_FILE_FRAME_SOURCE=ON`, `BUILD_TLOG_REPLAY_ADAPTER=ON`, `BUILD_REPLAY_SINK_JSONL=ON`) | `BUILD_VINS_MONO=OFF`, `BUILD_SALAD=OFF`, `BUILD_C11_TILE_MANAGER=OFF` (ADR-004 enforcement), `BUILD_DEV_STATIC_KEY=OFF`, `BUILD_STATE_ESKF=OFF` |
| `operator-orchestrator` (operator workstation) | C10, C11 (`TileDownloader` + `TileUploader`), C12 | `BUILD_C11_TILE_MANAGER=ON` |
| `mock-suite-sat-service` (test fixture) | NONE (FastAPI stub of the parent-suite `satellite-provider` D-PROJ-2 contract) | — |
| `e2e-runner` Tier-1 (`tests/e2e/Dockerfile`) | Full SUT (editable install) + pytest entrypoint | Test profile defaults |
| `e2e-runner` Tier-2 (`tests/e2e/Dockerfile.jetson`) | Full SUT (editable install) + pytest entrypoint; `dustynv/l4t-pytorch:r36.4.0` base | Test profile defaults |
## Per-Image Dockerfile Specifications
### `companion-jetson` — **NEW under Option B**
| Property | Value |
|----------|-------|
| File | `docker/companion-jetson.Dockerfile` (new in next cycle's Step 7 — Implementation; this plan specifies the contents) |
| Base image | `dustynv/l4t-pytorch:r36.4.0` (digest-pinned per suite follow-up #1) — same base proven by `tests/e2e/Dockerfile.jetson` |
| Stages | (1) system-deps (apt: `build-essential`, `cmake`, `libpq-dev`, `libspatialindex-dev`, `libgl1`, `libglib2.0-0`) → (2) python-deps (`pip install -e ".[inference]"` with the Tegra-tuned torch preserved per the existing Tier-2 e2e Dockerfile rationale) → (3) cpp-build (CMake build of the native VIO / matcher extensions with `BUILD_VINS_MONO=OFF`, `BUILD_C11_TILE_MANAGER=OFF`) → (4) runtime (slim image carrying the venv + native libs + SUT source) |
| User | `gps-denied` non-root uid 10001 (companion does not need root inside the container; volume mounts owned by the same uid on the host) |
| Build args | `CI_COMMIT_SHA` (suite-mandated; stamped as OCI labels + `ENV AZAION_REVISION`); `BRANCH` (carried into image labels) |
| OCI labels | `org.opencontainers.image.revision=$CI_COMMIT_SHA`, `org.opencontainers.image.created=<UTC RFC 3339>`, `org.opencontainers.image.source=$CI_REPO_URL` (suite-mandated per `../_infra/ci/README.md` → "OCI image labels and commit provenance (AZ-204)") |
| ENV | `AZAION_SERVICE=gps-denied-onboard`, `AZAION_REVISION=$CI_COMMIT_SHA`, `PYTHONPATH=/opt/gps-denied/src`, `PATH=/opt/venv/bin:$PATH` |
| Health check | `python3 -m gps_denied_onboard.healthcheck``--interval=10s --timeout=3s --start-period=30s --retries=3` (longer start-period than Tier-1 because TensorRT engine deserialize takes seconds on Jetson) |
| Exposed ports | `8080` (HTTP healthz + future replay-mode JSONL stream socket; mapped to host `5040:8080` per parent-suite compose). MAVLink + camera I/O is **not** TCP — it is host-bound (`/dev/ttyUSB*`, `/dev/video*`) via device passthrough. |
| Volume mounts (declared in parent-suite compose) | `model-cache:/data/models` (TensorRT engines + calibration cache + descriptor index); `fdr-data:/var/lib/gps-denied/fdr` (C13 ring, ≥ 64 GB); `tile-data:/var/lib/gps-denied/tiles` (C6 filesystem store, ≥ 10 GB); `/run/azaion:/run/azaion` (flight-state flag, read-only); device passthrough for `/dev/ttyUSB*` (FC UART) + `/dev/video*` (nav camera) |
| Watchtower labels | `com.centurylinklabs.watchtower.enable=true` + post-update hook emitting `AZAION_UPDATE_EVENT` per suite `x-update-logger` template |
| ENTRYPOINT | `python3 -m gps_denied_onboard.runtime_root` (same as Tier-1) |
| Flight-state gate | Honoured via `/run/azaion/in-flight` bind mount — Watchtower restart hook MUST check the flag before restarting (suite-managed; the image itself only honors the flag when transitioning between strategies at boot — there is no in-process restart logic) |
### `companion-tier1` (existing — `docker/companion-tier1.Dockerfile`)
| Property | Value |
|----------|-------|
| Base image | `ubuntu:22.04` (system-deps stage) → `ubuntu:22.04` (runtime) |
| Stages | 4 (`system-deps``python-deps``cpp-build``runtime`) — already documented in the file header |
| User | Currently root (acceptable for Tier-1 dev / CI containers — Tier-2 production hardens this in `companion-jetson`) |
| Health check | `python3 -m gps_denied_onboard.healthcheck``--interval=10s --timeout=3s --start-period=15s --retries=3` |
| Exposed ports | None (Tier-1 healthcheck is in-process; CI exposes nothing) |
| Notes | **No change required for cycle-1.** Next cycle: add `BRANCH` + `CI_COMMIT_SHA` build args + OCI labels for parity with `companion-jetson`. |
### `operator-orchestrator` (existing — `docker/operator-orchestrator.Dockerfile`)
| Property | Value |
|----------|-------|
| Base image | `python:3.10-slim` |
| Stages | 1 (`runtime`) — single-stage is acceptable here because the operator-orchestrator has no native C++ extensions and the slim base keeps it lean |
| User | Currently root — same Tier-1 caveat as `companion-tier1` |
| Health check | `python3 -m gps_denied_onboard.healthcheck``--interval=10s --timeout=3s --start-period=10s --retries=3` |
| Exposed ports | TBD (next cycle adds the C12 CLI's HTTP control surface for the operator UI; today the CLI runs as a one-shot invocation) |
| Notes | **No change required for cycle-1.** |
### `mock-suite-sat-service` (existing — `docker/mock-suite-sat-service.Dockerfile`)
| Property | Value |
|----------|-------|
| Base image | `python:3.10-slim` |
| User | Currently root — acceptable, this is an e2e test fixture only |
| Health check | `urllib.request.urlopen('http://127.0.0.1:5100/healthz')``--interval=5s --timeout=2s --retries=3` |
| Exposed ports | `5100` (HTTP) |
| Notes | **Not a production image.** Retired when parent-suite D-PROJ-2 ships the real ingest endpoint. |
### `e2e-runner` Tier-1 (existing — `tests/e2e/Dockerfile`)
Test runner for the Reality Gate on Colima / Tier-1 workstation Docker. Not a production image. ENTRYPOINT: `pytest -q /opt/tests/e2e/`. **No change for cycle-1.**
### `e2e-runner` Tier-2 (existing — `tests/e2e/Dockerfile.jetson`)
Test runner for the Reality Gate on the Jetson. `dustynv/l4t-pytorch:r36.4.0` base. The new `companion-jetson` production image inherits its base image choice and Tegra-pip rationale from this file. **No change for cycle-1.**
## Docker Compose — Local Development (existing `docker-compose.yml`)
The existing root `docker-compose.yml` already covers Tier-1 dev: `companion` + `operator-orchestrator` + `mock-sat` + `db` (Postgres 16), with healthchecks, named volumes (`db-data`, `fdr-data`, `tile-data`), and a `tests/fixtures:/fixtures:ro` bind mount for the dev calibration JSON + signing key.
**No structural change required.** Optional cycle-2 polish:
- Add a `network: gps-denied-dev` declaration (currently relies on Docker Compose's default network) so the suite-level e2e harness can join it explicitly when needed.
- Reference `${BRANCH:-main}` for image tags so the dev compose can pull from the suite registry instead of always building.
## Docker Compose — Blackbox Tests (existing)
| File | Purpose | Status |
|------|---------|--------|
| `docker-compose.test.yml` | Tier-1 e2e (Replay + Reality Gate); sets `BUILD_VIDEO_FILE_FRAME_SOURCE=ON`, `BUILD_TLOG_REPLAY_ADAPTER=ON`, `BUILD_REPLAY_SINK_JSONL=ON` | ✅ working |
| `docker-compose.test.jetson.yml` | Tier-2 e2e on Jetson; same flags ON | ✅ working |
| `e2e/docker/docker-compose.test.yml` | Suite-level e2e harness's internal compose | ✅ owned by the e2e harness |
| `e2e/docker/docker-compose.tier2-bridge.yml` | Tier-2 host-network bridge for direct hardware access | ✅ in tree |
**Run patterns** (suite-mandated per Woodpecker two-workflow contract):
```bash
# Tier-1 e2e (CI 01-test.yml):
docker compose -f docker-compose.test.yml up --build --abort-on-container-exit --exit-code-from e2e-runner
# Tier-2 e2e (manual / Tier-2 lane):
docker compose -f docker-compose.test.jetson.yml up --abort-on-container-exit --exit-code-from e2e-runner
```
The exit code of the `e2e-runner` service is the pipeline result. This contract matches the suite's `detections` e2e variant verbatim.
## Docker Compose — Tier-2 Production (parent-suite, NOT in this submodule)
This submodule does **not** ship a Tier-2 production compose file. The Tier-2 production stack is `../_infra/deploy/jetson/docker-compose.yml` (already shipping). This submodule contributes:
1. The published image at `${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm` (via `companion-jetson.Dockerfile` + the upcoming `.woodpecker/02-build-push.yml`).
2. The healthcheck endpoint (`python3 -m gps_denied_onboard.healthcheck`).
3. The flight-state gate honour (`/run/azaion/in-flight` bind mount in the suite compose — read by the image at boot).
4. The audit chain — OCI labels + `AZAION_REVISION` env + Watchtower post-update hook emitting `AZAION_UPDATE_EVENT` to journald.
**Cross-cutting suggestion logged but not actioned in cycle-1**: the parent-suite Jetson compose's `gps-denied-onboard` service block is minimal (no volume mounts beyond `model-cache`). Under Option B, it needs the additional mounts listed in the `companion-jetson` Dockerfile table above (`fdr-data`, `tile-data`, `/run/azaion`, FC + camera device passthrough). This is a **parent-suite edit** that the GPS-Denied Onboard team must coordinate with the suite operator — recorded in Next Steps below.
## Image Tagging Strategy (Suite-Mandated)
| Context | Tag Format | Example |
|---------|-----------|---------|
| Per-PR CI (test only, not pushed) | n/a | n/a |
| Per-branch CI build-push | `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` | `git.azaion.com/azaion/gps-denied-onboard:dev-arm` |
| Release | `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` (suite uses floating branch tags + Watchtower; semver is not used at suite level today) | `git.azaion.com/azaion/gps-denied-onboard:main-arm` |
| Local dev | Image name without registry prefix | `gps-denied-onboard/companion:dev` (current local compose), `gps-denied-onboard/operator-orchestrator:dev`, `gps-denied-onboard/mock-suite-sat-service:dev` |
**No `:latest` tag in CI.** Suite contract is `<branch>-<arch>` only; Watchtower polls these floating tags.
## .dockerignore (existing — audit + recommended addenda)
The current `.dockerignore` (33 lines, root) covers `.git`, `.venv`, build artefacts, `*.engine` / `*.calib` / `*.index` / `*.faiss` / `*.onnx`, large test fixtures, `_docs/`, and editor noise. **Adequate for cycle-1.** Recommended next-cycle additions (logged here, not applied):
```
# Next-cycle additions to .dockerignore (not applied in cycle-1)
.cursor/ # rules + skills do not belong in any image
_docs/ # already excluded — keep
docker-compose*.yml # don't accidentally ship dev compose into the production image
e2e/ # test harness compose + fixtures stay out of production images
tests/ # test code stays out of production images (currently NOT excluded)
*.md # README / docs — not needed at runtime
```
Note: `tests/` is currently NOT in `.dockerignore`, which is **intentional for cycle-1** — the e2e-runner images (`tests/e2e/Dockerfile`, `tests/e2e/Dockerfile.jetson`) COPY `tests/` into the image. Splitting `.dockerignore` per-image (via Docker's `dockerfile:` field on `.dockerignore` is BuildKit-only) is a next-cycle refactor.
## Health Checks — Inventory
| Image | Endpoint / Command | Cadence |
|-------|---------------------|---------|
| `companion-tier1`, `companion-jetson`, `operator-orchestrator` | `python3 -m gps_denied_onboard.healthcheck` (the module already exists per the existing Dockerfiles) | `--interval=10s --timeout=3s --start-period={15,30,10}s --retries=3` |
| `mock-suite-sat-service` | HTTP GET `/healthz` on port 5100 | `--interval=5s --timeout=2s --retries=3` |
| `db` (Postgres 16, suite-managed under Tier-2; root compose for Tier-1) | `pg_isready -U gps_denied -d gps_denied` | `--interval=5s --timeout=3s --retries=10` |
## Self-verification
- [x] Every component is mapped to its image (`companion-tier1` / `companion-jetson` for C1C8 + C13; `operator-orchestrator` for C10 + C11 + C12; `mock-suite-sat-service` for the e2e fixture)
- [x] Multi-stage builds specified for `companion-tier1` (4 stages, existing) and `companion-jetson` (4 stages, planned)
- [x] Non-root user planned for `companion-jetson` (Tier-2 production); Tier-1 dev / operator-orchestrator stays root for now (next-cycle harden)
- [x] Health checks defined for every service
- [x] `docker-compose.yml` covers all components + dependencies (existing)
- [x] `docker-compose.test.yml` enables black-box testing (existing; Tier-1 + Tier-2 jetson variants)
- [x] `.dockerignore` defined (existing; next-cycle additions logged)
- [x] Tier-2 production delivery shape resolved (Option B; ADR-005 amendment drafted; Step 2 validation gates queued)
- [x] Image tagging strategy aligned with suite-mandated `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` contract
## Next Steps
1. **User confirms this containerization plan** (BLOCKING gate per the deploy skill Step 2).
2. **Author `docker/companion-jetson.Dockerfile`** — implementation task for the next cycle (existing-code Step 9 New Task → Step 10 Implement). Will be one of the first follow-up tickets when autodev's Done step reroutes to the existing-code flow.
3. **Coordinate parent-suite edit**`../_infra/deploy/jetson/docker-compose.yml` `gps-denied-onboard` service block needs the additional volume mounts (`fdr-data`, `tile-data`, `/run/azaion`, FC + camera device passthrough). This is a cross-submodule change tracked as a follow-up; record in `_docs/_process_leftovers/` if not editable in this cycle.
4. **Proceed to Step 3 (CI/CD pipeline)** — author `.woodpecker/01-test.yml` (Python `pytest` + Tier-1 e2e via existing `docker-compose.test.yml`) + `.woodpecker/02-build-push.yml` (multi-arch matrix, `companion-jetson.Dockerfile` once it lands; until then, ship only `operator-orchestrator` + `companion-tier1` for the test path). Rewrite `_docs/02_document/deployment/ci_cd_pipeline.md` against the actual Woodpecker + Gitea Packages stack per suite `../_infra/ci/README.md`.
+198
View File
@@ -0,0 +1,198 @@
# GPS-Denied Onboard — Deployment Scripts
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 7. Five
> bash scripts under `scripts/` automate the procedures in
> `deployment_procedures.md`. Step 7 is the only step in the deploy
> skill that produces executable artefacts; all five scripts honour the
> `/run/azaion/in-flight` flight-state gate documented in Step 6.
## Overview
| Script | Purpose | Location |
|--------|---------|----------|
| `deploy.sh` | Main orchestrator (pull → flight-state-check → stop → start → health); `--rollback` flag restores the previous image set | `scripts/deploy.sh` |
| `pull-images.sh` | Pull images from `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` (suite Gitea Packages registry) | `scripts/pull-images.sh` |
| `start-services.sh` | `docker compose up -d`; waits for HEALTHCHECK; emits `AZAION_UPDATE_EVENT` via journald | `scripts/start-services.sh` |
| `stop-services.sh` | Graceful `docker compose down`; saves current image digests to `.previous-tags.env` for rollback | `scripts/stop-services.sh` |
| `health-check.sh` | Reads Docker HEALTHCHECK status across the stack (no HTTP endpoint — NFT-SEC-05) | `scripts/health-check.sh` |
## Prerequisites
- Docker + Docker Compose v2 installed on the target host (Tier-1 workstation, lab Jetson, airborne Jetson, or operator workstation).
- For remote operation: SSH access to the target via `DEPLOY_HOST` env var (same pattern as `scripts/run-tests-jetson.sh` uses).
- Registry credentials: `REGISTRY_HOST` + `REGISTRY_USER` + `REGISTRY_TOKEN` (suite-provisioned Woodpecker global secrets per `../_infra/ci/README.md`). Loaded from `.env` at the repo root or passed via the environment.
- `.env` file populated from `.env.example`. See `environment_strategy.md` § Environment Variables for per-environment guidance.
## Environment Variables Consumed
All scripts source `.env` from the project root if present. The deploy-side variables consumed (beyond the ones already documented in `.env.example`):
| Variable | Required by | Purpose |
|----------|-------------|---------|
| `REGISTRY_HOST` | `pull-images.sh`, `deploy.sh` | Suite Gitea Packages registry hostname (e.g. `git.azaion.com`) |
| `REGISTRY_USER` | `pull-images.sh` | Registry user (Woodpecker global secret on CI; operator credentials locally) |
| `REGISTRY_TOKEN` | `pull-images.sh` | Registry token (matches Woodpecker global secret); passed to `docker login --password-stdin` |
| `DEPLOY_HOST` | All (remote mode) | SSH alias / `user@host` for remote execution. Unset = local execution. |
| `AIRBORNE_COMPOSE_FILE` | `start-services.sh`, `stop-services.sh`, `health-check.sh` (when `--target=airborne`) | Override the default airborne compose path (`/etc/gps-denied/docker-compose.airborne.yml`) |
| `AZAION_REVISION` | `start-services.sh` (for the audit `AZAION_UPDATE_EVENT` line) | Inherited from the image's `ENV AZAION_REVISION=$CI_COMMIT_SHA` per AZ-204 |
| `BRANCH`, `ARCH` | `pull-images.sh`, `deploy.sh` | Tag selector (defaults: `main`, `arm`) |
| `WAIT_SECS` | `start-services.sh`, `deploy.sh` | HEALTHCHECK wait budget (default: 120 s) |
`.previous-tags.env` is written at the project root by `stop-services.sh` and is git-ignored (added to `.gitignore` in this step).
## Targets
Every script accepts `--target <dev|airborne|operator-workstation>` and picks a sensible compose file by default:
| `--target` | Default compose file | Purpose |
|------------|----------------------|---------|
| `dev` | `docker-compose.yml` | Tier-1 workstation Docker (developer + CI) |
| `operator-workstation` | `docker-compose.yml` (reused; operator workstation runs only `operator-orchestrator` + `db`) | Operator deploy of `operator-orchestrator`. Cycle-2 may add a dedicated `docker-compose.operator.yml` that excludes the `companion` service. |
| `airborne` | `${AIRBORNE_COMPOSE_FILE:-/etc/gps-denied/docker-compose.airborne.yml}` | Tier-2 airborne Jetson. Cycle-1 ships no compose file at this path — Watchtower drives updates via the parent-suite `_infra/deploy/jetson/docker-compose.yml`. The scripts are still usable for manual cycle-1 operator-issued cycle/restart on the bench Jetson by passing `--compose-file ./docker-compose.test.jetson.yml` or pointing `AIRBORNE_COMPOSE_FILE` at the parent-suite compose. |
## Script Details
### `deploy.sh`
Main orchestrator. Runs:
1. `pull-images.sh --target <target> --branch <branch> --arch <arch>` (skipped on `--rollback`)
2. Flight-state check (in-band — invokes `stop-services.sh` which performs the actual `/run/azaion/in-flight` probe)
3. `stop-services.sh --target <target>` (also writes `.previous-tags.env`)
4. `start-services.sh --target <target> --wait-secs <N>`
5. `health-check.sh --target <target>`
**Usage**:
```
scripts/deploy.sh [--target dev|airborne|operator-workstation]
[--branch <branch>] [--arch <arch>]
[--compose-file <path>]
[--wait-secs N]
[--rollback] [--force] [--help]
```
**Rollback**: when `--rollback` is passed, `deploy.sh` reads `.previous-tags.env` (written by the most recent `stop-services.sh` run), `docker pull`s each saved image digest, then proceeds with the stop → start → health pipeline. Cycle-1 does not retag — the operator owns the registry-side tag promotion per `deployment_procedures.md` § Rollback Procedures.
**Force flag** (`--force`): bypasses the `/run/azaion/in-flight` safety gate. **Never pass during a live flight** — this is an emergency escape hatch for stuck flag scenarios (e.g. autopilot service died holding the flag).
### `pull-images.sh`
Pulls the cycle-1 image set from the suite registry. Cycle-2 will pick up the airborne `companion-jetson` image automatically when `--target=airborne` is selected (the image name template is already coded for it).
**Usage**:
```
scripts/pull-images.sh [--branch <branch>] [--arch <arch>]
[--target dev|airborne|operator-workstation]
[--verify] [--help]
```
**`--verify`**: after pull, prints each image's RepoDigest + `AZAION_REVISION` env var (per the OCI labels mandated by AZ-204).
### `start-services.sh`
`docker compose up -d --remove-orphans`. Polls `docker compose ps --format` until every service that declares a HEALTHCHECK reports `healthy` (default budget: 120 s). On success, emits a structured `AZAION_UPDATE_EVENT` line via journald (`logger -t gps-denied-onboard -p user.notice`).
**Usage**:
```
scripts/start-services.sh [--target dev|airborne|operator-workstation]
[--compose-file <path>]
[--wait-secs N] [--force] [--help]
```
**Refuses to start the airborne stack when `/run/azaion/in-flight` is set** (unless `--force` is passed) — this matches `deployment_procedures.md` § Deployment Strategy "ground-only safety gate".
### `stop-services.sh`
Graceful `docker compose down --remove-orphans`. The companion's stop sequence is governed by Docker's default 10 s grace period in cycle-1; cycle-2 adds `stop_grace_period: 30s` to the `companion` service block (see `deployment_procedures.md` § Graceful Shutdown — Cycle-1 status).
Before stopping, writes the current image set to `.previous-tags.env` in the repo root:
```
# Saved by scripts/stop-services.sh on 2026-05-20T05:54:00Z
# Used by deploy.sh --rollback to restore the previous image set.
# Service tag layout: PREV_<SERVICE>_IMAGE=<repo>@<sha256-digest>
PREV_COMPANION_IMAGE=gps-denied-onboard/companion@sha256:abc…
PREV_OPERATOR_ORCHESTRATOR_IMAGE=gps-denied-onboard/operator-orchestrator@sha256:def…
PREV_MOCK_SAT_IMAGE=gps-denied-onboard/mock-suite-sat-service@sha256:…
PREV_DB_IMAGE=postgres@sha256:…
```
**Refuses to stop the airborne stack when `/run/azaion/in-flight` is set** (unless `--force` is passed).
**Usage**:
```
scripts/stop-services.sh [--target dev|airborne|operator-workstation]
[--compose-file <path>] [--force] [--help]
```
### `health-check.sh`
Reads Docker HEALTHCHECK status across the stack via `docker compose ps --format '{{.Service}}\t{{.State}}\t{{.Health}}'`. No HTTP endpoints (NFT-SEC-05 — the companion has no inbound listener).
**Usage**:
```
scripts/health-check.sh [--target dev|airborne|operator-workstation]
[--compose-file <path>] [--help]
```
**Exit codes**:
- `0` — all services healthy (or running with no declared HEALTHCHECK, which is the case for services that intentionally have none, e.g. `mock-sat` in test profiles where the HEALTHCHECK is declared elsewhere).
- `1` — at least one service is `running` but `unhealthy`.
- `2` — at least one service is not `running` (exited, dead, or never started).
## Common Properties
All five scripts:
- `#!/usr/bin/env bash` + `set -euo pipefail`.
- Support `--help` / `-h` (heredoc-based usage block — robust to source-line reordering).
- Source `.env` from the project root if present (`set -a` / `set +a` around the source so the variables are exported into the script's environment + subprocesses).
- Support **remote execution** via `DEPLOY_HOST=<ssh-alias>` env var. When set, every docker command is run via `ssh ${DEPLOY_HOST}`. The pre-flight SSH check uses `-o BatchMode=yes -o ConnectTimeout=5` (same pattern as `scripts/run-tests-jetson.sh`).
- Are **idempotent** for the running-stack case: `start-services.sh` is safe to re-run on an already-healthy stack; `stop-services.sh` is safe to re-run on an already-stopped stack; `pull-images.sh` is safe to re-run (docker will report "Image is up to date").
- Exit codes are stable per script (documented in each script's `--help` and at the top of this document).
## Local Smoke Test (Tier-1 dev)
After authoring, the operator can smoke-test the full chain on a Tier-1 workstation:
```bash
# Reset
docker compose -f docker-compose.yml down -v
# Manual pipeline (does what deploy.sh does, step by step)
scripts/pull-images.sh --target dev --branch dev --arch arm # optional in dev; the dev compose builds locally
scripts/start-services.sh --target dev --wait-secs 180 # gives 3 min for pip / cmake on first build
scripts/health-check.sh --target dev # exit 0 when companion + operator-orchestrator + db + mock-sat are healthy
scripts/stop-services.sh --target dev # writes .previous-tags.env
```
A `docker compose ps` between each step verifies the expected service state. Cycle-2 will add an automated smoke test under `tests/e2e/scripts/` that runs this sequence on a CI-clean host.
## Self-verification
- [x] All five scripts created under `scripts/` and marked executable (`chmod +x`).
- [x] Scripts source `.env` from the project root (when present); `REGISTRY_HOST` / `REGISTRY_USER` / `REGISTRY_TOKEN` consumed by `pull-images.sh`.
- [x] `deploy.sh` orchestrates pull → flight-state-check → stop → start → health; `--rollback` restores `.previous-tags.env`.
- [x] `pull-images.sh` handles `docker login` via `--password-stdin` and tags images per `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` (suite contract).
- [x] `start-services.sh` brings up `docker compose up -d` and waits for HEALTHCHECK; emits `AZAION_UPDATE_EVENT` via `logger` on systemd hosts.
- [x] `stop-services.sh` writes `.previous-tags.env` then runs `docker compose down --remove-orphans`; honours the `/run/azaion/in-flight` gate.
- [x] `health-check.sh` reads HEALTHCHECK status via `docker compose ps` (no HTTP endpoint — NFT-SEC-05).
- [x] Rollback supported via `deploy.sh --rollback`.
- [x] Remote deployment via SSH supported through `DEPLOY_HOST` (same pattern as `scripts/run-tests-jetson.sh`).
- [x] `.previous-tags.env` added to `.gitignore` (rollback bookmark; not a committed artefact).
- [x] All scripts use heredoc-based `--help` (robust to source-line shifts) and `set -euo pipefail`.
- [x] `bash -n` syntax-checks pass on all five scripts.
## Cycle-2 Polish (logged, not implemented in cycle-1)
1. **`stop_grace_period: 30s`** on the `companion` service in `docker-compose.yml` + the parent-suite Jetson compose, once the Step 2 BLOCKING gate "TensorRT INT8 cache durability under Docker" measures the actual drain budget on Tier-2 hardware (`deployment_procedures.md` § Graceful Shutdown — Cycle-1 status).
2. **`docker-compose.operator.yml`** — operator-only compose that excludes the `companion` service so `--target=operator-workstation` doesn't pull / start the airborne binary at all.
3. **Tag-rotation helper**`scripts/promote-tag.sh <sha> <branch>` that retags the registry-side `${REGISTRY_HOST}/azaion/<service>:<branch>-arm` for production rollouts. Cycle-1 keeps this operator-manual.
4. **`scripts/post-flight-pull.sh`** — pulls FDR segments from the airborne Jetson to the operator workstation and runs `python3 -m gps_denied_onboard.post_flight.summarise` (per `observability.md` § Flight Analytics).
5. **CI-clean smoke test**`tests/e2e/scripts/test_deploy_pipeline.sh` exercising `pull → start → health → stop → rollback` against a clean Docker host (gated by `RUN_DEPLOY_E2E=1`).
6. **Watchtower post-update hook on the operator workstation** — cycle-2 may add a Watchtower instance on the operator workstation that polls the suite registry and applies updates automatically. Cycle-1 leaves the operator workstation on the `scripts/deploy.sh` operator-driven path.
+207
View File
@@ -0,0 +1,207 @@
# GPS-Denied Onboard — Deployment Procedures
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 6. Builds on
> Step 15 (`reports/deploy_status_report.md`, `containerization.md`,
> `ci_cd_pipeline.md`, `environment_strategy.md`, `observability.md`). The
> deploy skill's standard procedure template (load-balanced HTTP service
> with blue-green / rolling / canary patterns) is adapted here for the
> system's actual topology: single airborne instance + single operator
> workstation, ground-only updates, FC-managed in-flight failsafe, and the
> parent-suite Watchtower flow with a flight-state gate.
## Deployment Strategy
### Pattern: **Floating-tag pull-on-ground (Watchtower-managed)**
| Aspect | Choice | Rationale |
|--------|--------|-----------|
| Update mechanism (airborne Jetson) | Parent-suite Watchtower polls `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm`; pulls + restarts when SHA changes | Suite-mandated pattern per `../_infra/deploy/jetson/README.md`. The fielded Jetson stack has Watchtower already running, polling all 9 application services on the same cadence. |
| Update mechanism (operator workstation) | Operator runs `docker compose pull && docker compose up -d` from `scripts/start-services.sh` | The operator workstation is single-user; cycle-1 does not need automatic updates. Cycle-2 may add a Watchtower instance on the workstation. |
| Update mechanism (lab Jetson — staging) | Same as airborne (Watchtower polling `dev-arm` or `stage-arm`) | Mirrors airborne so the bench rig validates the exact same update path. |
| Blue-green / rolling / canary | **None of the above** — N=1 instance per role | The airborne side has one Jetson per aircraft (no fleet); the operator workstation has one instance per operator. There is no load-balanced replicate to roll over. |
| Zero-downtime requirement | **Not applicable in flight**; ground-only | Flights are discrete + bounded; the FC handles in-flight failsafe (AC-FC-FAILSAFE-1) if the companion is unavailable mid-flight. Updates do not happen during flight. |
| Ground-only safety gate | `/run/azaion/in-flight` flag (parent-suite `autopilot` service writes it on arm/disarm) | **Watchtower's post-update hook MUST refuse to restart the `gps-denied-onboard` container when this flag is set.** Honoured at the suite-compose layer, not in this submodule's image (the image only honours the flag at boot when transitioning between strategies). |
| Multi-aircraft rollout | Tag-based per-aircraft (operator can pin `:rev-<sha>-arm` instead of `:main-arm`) | Floating tag is the default; explicit SHA pinning is the manual override. Suite operator owns per-aircraft pinning. |
### Graceful Shutdown
The companion has **no inbound HTTP connections** (NFT-SEC-05 in-flight egress lockdown). "Graceful shutdown" means: drain in-flight FDR writes, flush the C13 segment, emit `flight_footer`, close MAVLink connection cleanly.
| Step | Action | Owner |
|------|--------|-------|
| 1 | systemd / Docker sends `SIGTERM` to PID 1 (`python3 -m gps_denied_onboard.runtime_root`) | OS layer |
| 2 | Runtime root sets the global `shutting_down` flag; all per-frame producers stop enqueuing new FDR records | runtime root |
| 3 | C13 writer drains the FDR SPSC ring (≤ 200 ms target — bounded by ring depth + writer throughput) | C13 |
| 4 | C13 emits `flight_footer` with `clean_shutdown=true`, `records_written`, `records_dropped_overrun`, `bytes_written`, `rollover_count` | C13 |
| 5 | C13 closes the active segment file (fsync, rename `.tmp` → final) | C13 |
| 6 | C8 sends final MAVLink `STATUSTEXT` and closes the FC serial connection | C8 |
| 7 | Process exits 0 | runtime root |
**Termination grace period (target)**: 30 seconds for the above sequence. If exceeded, Docker / systemd sends `SIGKILL`; `flight_footer.clean_shutdown` will be `false` on the next boot's recovery write, flagging the unclean shutdown for the post-flight summary.
**Cycle-1 status**: docker-compose.yml does **not** yet declare `stop_grace_period: 30s` — cycle-1 inherits Docker's default 10 s grace. The C13 ring drain target (≤ 200 ms) fits comfortably inside 10 s for the dev profile, but TensorRT engine teardown + gtsam factor cleanup on Tier-2 hardware are not yet measured. **Cycle-2 follow-up** (recorded in `_docs/_process_leftovers/` when this deploy plan lands): add `stop_grace_period: 30s` to the `companion` service in `docker-compose.yml` and to the `gps-denied-onboard` service in the parent-suite `../_infra/deploy/jetson/docker-compose.yml` once the Step 2 validation gate "TensorRT INT8 cache durability under Docker" (`containerization.md` § Step 2 Validation Gates) measures the actual drain budget on the Jetson.
### Database Migration Ordering
Cycle-1 ships **no migration runner** — C6 bootstrap uses idempotent `CREATE TABLE IF NOT EXISTS`. Cycle-2+ rules (from `environment_strategy.md` § Migration Rules):
| Rule | Cycle-1 status | Cycle-2+ enforcement |
|------|----------------|----------------------|
| Migrations run **before** new code deploys | n/a — bootstrap-only | Alembic (or equivalent) migration step runs against staging first, then production, before the corresponding image pull is enabled |
| All migrations must be backward-compatible | n/a | Required: new schema works with previous image's read path until next release rotates both |
| Irreversible migrations require explicit operator approval | n/a | Required: Woodpecker UI approval gate + recorded in `_docs/04_deploy/migration_log.md` |
| Production migrations on the airborne Jetson refuse to run when `/run/azaion/in-flight` is set | n/a | Required: migration tool reads the flag at start; aborts with exit 0 + journald audit line if the flag is set |
| Production migrations on the operator workstation require operator approval | n/a | Required: interactive prompt in `start-services.sh` before applying |
## Health Checks
The companion has no HTTP `/health/live` or `/health/ready` endpoint (NFT-SEC-05). The Docker `HEALTHCHECK` is an **exec check** that re-runs the startup validation matrix (`environment_strategy.md` § Variable Validation) and inspects in-process liveness signals.
| Check | Type | Command / mechanism | Interval | Failure threshold | Action |
|-------|------|----------------------|----------|--------------------|--------|
| Liveness / Readiness | `HEALTHCHECK` exec | `python3 -m gps_denied_onboard.healthcheck` | 10 s (companion-tier1 / operator-orchestrator); 10 s (companion-jetson, with `--start-period=30s` for TensorRT engine deserialise) | 3 consecutive failures → Docker marks container `unhealthy` → systemd / Watchtower restarts | Same as readiness — no load balancer to drain. Watchtower honours `/run/azaion/in-flight` before restarting. |
| Startup probe | Same exec | Same command | 5 s once `--start-period` elapses | 30 attempts max | Kill + recreate; Watchtower retries the pull on next poll |
| FC adapter health (in-flight) | C8 watchdog from the FC | MAVLink heartbeat loss > 1 s | n/a — handled by the FC | FC drops to `SAFE_DEAD_RECKONING` or `RTL` per AC-FC-FAILSAFE-1 |
| FDR ring liveness | `shared.fdr_client` overrun monitor | Producer enqueue failure | n/a — emits `kind="overrun"` record (AC-NEW-3); never silent | Post-flight forensics surface; no in-flight action |
| `db` Postgres health (operator workstation + dev compose) | `pg_isready -U gps_denied -d gps_denied` | 5 s | 10 failures | Docker / systemd restart the `db` service; the companion's healthcheck fails until DB is back |
| `mock-suite-sat-service` health (Tier-1 e2e only) | HTTP GET `/healthz` on port 5100 | 5 s | 3 failures | Compose marks unhealthy; e2e-runner `--exit-code-from e2e-runner` surfaces failure |
### `python3 -m gps_denied_onboard.healthcheck` contract
The healthcheck module (already exists per `containerization.md`) re-runs:
1. **Required env vars validation** — same set as the composition root, but read-only (no side effects).
2. **C6 DB reachability**`psycopg2.connect(DB_URL) → SELECT 1`.
3. **C13 FDR mount writability**`os.access(FDR_PATH, os.W_OK)` + a probe write to a `.healthcheck` file.
4. **C7 backend availability** — for `INFERENCE_BACKEND=tensorrt`, validates the engine cache directory exists + is readable; for `pytorch_fp16`, no extra check (libtorch in-process).
5. **C8 FC adapter** — best-effort: attempts a non-blocking serial open if `GPS_DENIED_FC_PROFILE` is set + the device path is present. Absent device path is not a failure (dev / CI containers).
Exit codes: `0` healthy; `1` config-invalid; `2` dependency-unreachable; `3` resource-bound (e.g. FDR full). Docker treats any non-zero as `unhealthy`.
## Staging Deployment (lab Jetson HITL)
Treat the lab Jetson as a **mirror of production** for image promotion. Operator runs the procedure manually; cycle-2 may automate via the suite.
1. **CI/CD** has already built + pushed `${REGISTRY_HOST}/azaion/gps-denied-onboard-companion-tier1:dev-arm` + `…-operator-orchestrator:dev-arm` via `.woodpecker/02-build-push.yml` (cycle-1) or `companion-jetson:dev-arm` via cycle-2.
2. **Verify the flag**`cat /run/azaion/in-flight` should be empty / absent on the lab Jetson (no live FC there). If a HITL session is running, wait for the bench session to end.
3. **Pull the new image**`scripts/pull-images.sh dev` (Step 7). Watchtower may have already pulled if running on the lab Jetson.
4. **Restart the service**`scripts/start-services.sh dev` (Step 7). Honours stop-grace-period; waits for HEALTHCHECK to report healthy.
5. **Run the HITL e2e suite**`docker compose -f docker-compose.test.jetson.yml up --abort-on-container-exit --exit-code-from e2e-runner --build`. This runs the **Reality Gate** replay (Derkachi clip + recorded tlog) against the new image on Tier-2 hardware.
6. **Verify FDR output**`python3 -m gps_denied_onboard.post_flight.summarise --segment /var/lib/gps-denied/fdr/segment-*.fdr` (cycle-1 ad-hoc tool; cycle-2 polish lands the full replay viewer). Confirm `flight_footer.clean_shutdown == true` and `records_dropped_overrun == 0`.
7. **If gates pass** → promote: tag `${REGISTRY_HOST}/azaion/gps-denied-onboard:<sha>-arm` (or repurpose by branch promotion from `dev-arm``stage-arm` once cycle-2 wires environment branches per `ci_cd_pipeline.md` Quality Gates `Multi-environment deployment` row).
8. **If gates fail** → file a Jira issue under E-DEPLOY; roll back the lab Jetson per § Rollback Procedures.
## Production Deployment (airborne Jetson + operator workstation)
Production deployment lands on each aircraft individually + on each operator workstation. The aircraft side is Watchtower-driven; the operator workstation side is operator-driven.
### Pre-deploy checks (operator-owned)
- [ ] **CI gates green**`01-test.yml` passed on the target branch (cycle-1: manual trigger; cycle-2: push gate).
- [ ] **Security scan recent**`_docs/05_security/dependency_scan.md` re-validated against the build SHA. The OpenCV pin per `_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md` is honoured.
- [ ] **HITL gate passed** — Staging deployment § 56 confirmed `clean_shutdown=true` and `records_dropped_overrun=0`.
- [ ] **Per-aircraft acceptance** — operator confirms the build's strategy flags (`BUILD_VINS_MONO`, `BUILD_SALAD`, `BUILD_C11_TILE_MANAGER`, replay flags, `BUILD_DEV_STATIC_KEY=OFF`) match the operational profile for the destination aircraft.
- [ ] **Calibration JSON onboard**`/etc/gps-denied/calibration/adti20.json` (operator-acquired per D-PROJ-1) is staged on the aircraft Jetson NVM.
- [ ] **Signing key path provisioned**`MAVLINK_SIGNING_KEY` resolves to a per-host writable path that `KeySource` will rotate at takeoff; no static key from `tests/fixtures/`.
- [ ] **Postgres credentials in `/etc/gps-denied/.pgpass`** — per-host random password (Step 7 `start-services.sh` writes this on first run).
- [ ] **`/run/azaion/in-flight` is clear** — no live flight in progress on the target aircraft.
- [ ] **Rollback target identified** — previous successful SHA recorded for the target aircraft (operator notebook + `journalctl -g AZAION_UPDATE_EVENT` on the Jetson).
- [ ] **Stakeholders notified** — flight operator + suite operator informed of the deploy window.
### Production Deployment — Airborne Jetson (Watchtower-driven)
1. **Tag promotion** — operator pushes the validated SHA to `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm` (or per-aircraft SHA pin if rolling out partial fleet).
2. **Wait for Watchtower poll** — default poll interval per suite config (typically ≤ 5 min).
3. **Watchtower pre-restart check** — Watchtower's post-update hook checks `/run/azaion/in-flight`; if set, defers the restart until the next poll.
4. **Container stop** — Docker sends `SIGTERM`; companion drains FDR (≤ 200 ms target) + emits `flight_footer` per § Graceful Shutdown. Exit must complete within 30 s grace period.
5. **Image pull complete** — Watchtower pulls the new image (already verified-by-tag; OCI labels embed the SHA).
6. **Container start** — Docker starts the new container; `HEALTHCHECK` `--start-period=30s` allows TensorRT engine deserialise + Postgres reconnect.
7. **Audit event emitted** — Watchtower's post-update hook emits `AZAION_UPDATE_EVENT` to journald (`observability.md` § Deploy Audit).
8. **Verify on the aircraft** — operator runs `journalctl -g AZAION_UPDATE_EVENT --since 10min` on the Jetson; confirms the new revision SHA matches the intended tag.
9. **Run a ground HITL pre-flight** — operator brings up the bench-mounted aircraft, runs the standard pre-flight checklist (FC heartbeat, signing handshake, camera focus, NFT-SEC-04 image-decode smoke). Pre-flight refusal-to-arm on any gate failure is the production safety net.
10. **Monitor the first flight** — operator watches QGroundControl for STATUSTEXT messages from the companion + the `GpsDeniedHealth` MAVLink message stream during the first flight under the new image.
11. **Post-flight forensics** — after landing, operator pulls FDR segments + runs `post_flight.summarise`; confirms no regression vs the previous-SHA baseline (NFT-PERF gates per `_docs/02_document/tests/` baselines).
### Production Deployment — Operator Workstation (operator-driven)
1. **Pre-deploy checks** — same checklist as above, scoped to the operator-orchestrator image.
2. **Pull** — operator runs `scripts/pull-images.sh main` (Step 7).
3. **Stop**`scripts/stop-services.sh` (Step 7) gracefully stops the operator-orchestrator service.
4. **Start**`scripts/start-services.sh main` (Step 7) brings the new image up. `HEALTHCHECK` `--start-period=10s` allows DB reconnect.
5. **Audit**`journalctl -g AZAION_UPDATE_EVENT --since 10min` on the operator workstation confirms the new revision.
6. **Smoke test** — operator runs the C12 `--flight-file <offline_fixture>` path against a known-good flight DTO; verifies the `FlightsApiClient` round-trip succeeds.
### Post-deploy monitoring window
| Window | What to watch | Action on regression |
|--------|---------------|----------------------|
| First 15 min | journald `AZAION_UPDATE_EVENT` cadence; container `HEALTHCHECK` status | Roll back immediately (§ Rollback Procedures) |
| First flight (airborne) | QGC STATUSTEXT + `GpsDeniedHealth` MAVLink stream; FDR `overrun` count | Operator aborts flight if `GpsDeniedHealth` degrades; FC failsafe is the safety net |
| First post-flight pull (airborne) | FDR `flight_footer.clean_shutdown` flag; `records_dropped_overrun`; per-component `tile_match`, `c6.eviction_batch` baselines | If `clean_shutdown=false` or baselines drifted → roll back; required post-mortem |
## Rollback Procedures
### Trigger Criteria
| Severity | Trigger | Decision lead |
|----------|---------|---------------|
| **Immediate rollback** | New image fails `HEALTHCHECK` within 5 minutes of `AZAION_UPDATE_EVENT`; or `flight_footer.clean_shutdown=false` on the first flight under the new image | Flight operator (airborne) / Suite operator (workstation) |
| **Same-day rollback** | NFT-PERF baseline regression > 10% (frame deadline miss rate, end-to-end pose latency); FDR `records_dropped_overrun` > 0 above per-flight threshold; sustained `c6.eviction_batch` activity > baseline | Operator + GPS-Denied Onboard owner |
| **Manual rollback** | Operator judgement (visible operational anomaly without a clear FDR signal) | Operator |
### Rollback Steps (airborne Jetson)
1. **Confirm the flag**`/run/azaion/in-flight` is clear. If a flight is live, the FC's failsafe + operator's QGC abort path take precedence; rollback happens after landing.
2. **Identify the previous-good SHA**`journalctl -g AZAION_UPDATE_EVENT --since 24h` on the affected Jetson shows the last successful revision.
3. **Tag rollback** — operator retags the registry: `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm` → previous SHA. (Cycle-1: operator pulls + retags via the registry UI; cycle-2: scripted via `scripts/deploy.sh rollback <sha>`.)
4. **Wait for Watchtower** — next poll detects the SHA change + pulls the previous image.
5. **Verify**`journalctl -g AZAION_UPDATE_EVENT --since 10min` shows the rollback revision; companion `HEALTHCHECK` is healthy.
6. **DB rollback** — cycle-1: not applicable (bootstrap-only schema). Cycle-2+: if the new image applied a migration, run the DOWN script if reversible; otherwise escalate to GPS-Denied Onboard owner + suite operator before proceeding.
7. **Notify** — stakeholders informed; rollback flagged for post-mortem within 24 hours.
### Rollback Steps (operator workstation)
1. `scripts/stop-services.sh` (Step 7) stops the operator-orchestrator service.
2. Operator runs `scripts/pull-images.sh <previous_sha>` (Step 7).
3. `scripts/start-services.sh <previous_sha>` (Step 7) brings the previous image up.
4. Verify via `HEALTHCHECK` + offline `--flight-file` smoke.
5. DB rollback as above (cycle-1 n/a; cycle-2+ per migration tool).
6. Notify suite operator.
### Post-mortem (required after every production rollback)
Recorded in `_docs/_process_leftovers/<YYYY-MM-DD>_<topic>_rollback.md` and replayed at the next `/autodev` invocation per `.cursor/rules/tracker.mdc` Leftovers Mechanism. Contents:
- **Timeline** — `AZAION_UPDATE_EVENT` deploy event → first failure observation → rollback completion.
- **Root cause** — pulled from FDR + journald + Woodpecker pipeline.
- **What went wrong** — gate that should have caught it (CI? HITL? Pre-flight checklist?).
- **Prevention** — concrete checklist edit or test addition. Lessons appended to `_docs/LESSONS.md` per the autodev retrospective conventions.
## Deployment Checklist
The pre-deploy checklist above is the canonical one. Repeating it here in the standard skill format for traceability:
- [ ] All CI tests pass on the target branch (cycle-1: `01-test.yml` manual run; cycle-2: push gate)
- [ ] Security scan clean — re-validated against current pins; OpenCV CVE replay condition checked (`_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md`)
- [ ] Docker images built + pushed under `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>`; OCI labels + `AZAION_REVISION` env stamped per AZ-204
- [ ] Database migrations (cycle-2+): reviewed, tested, backward-compatible, flight-state-gated, operator-approved
- [ ] Environment variables configured per-environment per `environment_strategy.md` § Environment Variables
- [ ] Health check (`python3 -m gps_denied_onboard.healthcheck`) returns 0 on a dry-run against the target image
- [ ] Observability touchpoints active: `LOG_SINK` honoured, FDR mount writable, `jetson-stats` accessible inside the container (Tier-2)
- [ ] Rollback plan documented — previous-good SHA recorded; rollback steps reviewed
- [ ] Stakeholders notified of deployment window (flight operator + suite operator + GPS-Denied Onboard owner)
- [ ] Operator available during the post-deploy monitoring window (first 15 minutes + first flight)
## Self-verification
- [x] Deployment strategy chosen (Watchtower floating-tag pull-on-ground) and justified (single instance per role, ground-only updates, FC-managed in-flight failsafe)
- [x] Zero-downtime stance: **not applicable in flight**; ground-only — explicitly justified
- [x] Health checks defined (exec-based `HEALTHCHECK` covering liveness + readiness; FC watchdog covers in-flight liveness via FC failsafe)
- [x] Rollback trigger criteria (immediate / same-day / manual) + steps for both airborne and operator workstation
- [x] Deployment checklist complete and grounded in the project's actual gates (`AZAION_UPDATE_EVENT` audit, CVE replay, `/run/azaion/in-flight` flag, signing key provisioning)
- [x] Post-mortem path defined and tied to the `_docs/_process_leftovers/` + `_docs/LESSONS.md` mechanism
- [x] Graceful-shutdown sequence covers the FDR-flush + `flight_footer.clean_shutdown` invariants
## BLOCKING — User Confirmation Required
This is the deploy skill Step 6 BLOCKING gate per `.cursor/skills/deploy/SKILL.md` § Methodology Quick Reference. Step 7 (Deployment Scripts) writes executable shell scripts that automate the procedures above; user confirmation that the procedure is correct is required before scripts are generated.
+132
View File
@@ -0,0 +1,132 @@
# GPS-Denied Onboard — Environment Strategy
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 4. Builds on
> Step 1 (`reports/deploy_status_report.md`), Step 2 (`containerization.md`),
> and Step 3 (`ci_cd_pipeline.md`). The deploy skill's standard
> Dev/Staging/Production template is adapted here for a Jetson-airborne
> system: production has two distinct targets (airborne Jetson + operator
> workstation), and "staging" maps to a lab Jetson HITL rig rather than a
> classical cloud pre-prod environment.
## Environments
| Environment | Purpose | Infrastructure | Data Source |
|-------------|---------|----------------|-------------|
| **Development** | Local developer workflow on a Tier-1 workstation (Linux/macOS-Colima). Runs the full Tier-1 stack (`companion-tier1` + `operator-orchestrator` + `mock-suite-sat-service` + `db`) for unit + integration + Tier-1 e2e (Reality Gate replay). | Docker Compose (`docker-compose.yml`, `docker-compose.test.yml`); named volumes (`db-data`, `fdr-data`, `tile-data`); bind-mount `tests/fixtures:/fixtures:ro`. Optional dev Postgres on host. | Seed data via Docker init scripts; **mocked `satellite-provider`** via `mock-suite-sat-service`; **dev MAVLink signing key** from `tests/fixtures/mavlink_signing/dev_key` (with `BUILD_DEV_STATIC_KEY=ON` on dev containers only); **Derkachi replay clip + tlog** committed under `_docs/00_problem/input_data/`. |
| **Staging** | Lab / research Jetson HITL rig — same Jetson Orin Nano Super hardware as airborne, but on the bench: SITL or recorded tlog as the FC source, recorded video as the camera source, no live flight. Used for pre-flight validation, NFT-PERF-* Tier-2 runs (when AZ-592 / AZ-593 land), and IT-12 comparative study. | Tier-2 hardware (Jetson Orin Nano Super) running JetPack 6.2 host OS + Docker via `runtime: nvidia`; image pulled from suite registry (`${REGISTRY_HOST}/azaion/gps-denied-onboard:dev-arm` per cycle-1 tag-suffix, eventually `:stage-arm`); compose file `docker-compose.test.jetson.yml` for HITL e2e; Postgres 16 native on host. | Recorded Derkachi clip + SITL tlog (deterministic); test calibration JSON (`adti26.json`); **dev signing key** (per-flight rotation disabled — staging FC is SITL, not signed). Mirrors Production volume mount layout (`/var/lib/gps-denied/{fdr,tiles}`, `/data/models`) so calibration-cache + INT8-engine artefacts are interchangeable between bench and field. |
| **Production** | Two distinct deploy targets, both anonymized-data-free (real flight data flows through them): (a) **airborne Jetson Orin Nano Super** carried on the aircraft, running the `companion-jetson` image under the parent-suite Watchtower flow per `containerization.md` ADR-005 amendment; (b) **operator workstation** running `operator-orchestrator` for pre-flight tile provisioning + post-landing upload via `FlightsApiClient` / `TileUploader`. | (a) Airborne: parent-suite `_infra/deploy/jetson/docker-compose.yml`, `runtime: nvidia`, Watchtower polling `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm`, host-mounted volumes for FDR (≥ 64 GB) + tile cache (≥ 10 GB) + model cache; native Postgres 16 on the Jetson NVM. (b) Operator workstation: `docker compose up` with `gps-denied-onboard/operator-orchestrator:main` or installed via `pull-images.sh``start-services.sh`; native Postgres 16 on the workstation. | Real flight data — live FC (ArduPilot Plane signed MAVLink 2.0, or iNav MSP2 unsigned), live nav camera (ADTi 20MP), live `satellite-provider` REST + on-disk tiles. **Per-flight ephemeral MAVLink + onboard signing keys** generated at takeoff load, rotated per flight, logged to FDR. Operator workstation reads `satellite-provider` API token from OS keyring; never written to any image. |
### Tier ↔ Environment Mapping
| Environment | Tier-1 image(s) used | Tier-2 image(s) used | Notes |
|-------------|----------------------|------------------------|-------|
| Development | `companion-tier1`, `operator-orchestrator`, `mock-suite-sat-service` | — | All four services via `docker-compose.yml`. |
| Staging (lab Jetson) | — | `companion-jetson` (when cycle-2 ships), or `companion-tier1` in Tier-1-on-Jetson interim | Tier-2 Jetson HITL pulls the arm64 image; `docker-compose.test.jetson.yml` orchestrates. |
| Production — airborne | — | `companion-jetson` (cycle-2) | Watchtower-managed; cycle-1 ships only the planning + Tier-1 images per `ci_cd_pipeline.md` Registry Layout. |
| Production — operator workstation | `operator-orchestrator` | — | Cycle-1 already builds + pushes `${REGISTRY_HOST}/azaion/gps-denied-onboard-operator-orchestrator:<branch>-arm`. |
## Environment Variables
### Required Variables (companion + operator-orchestrator)
> Source of truth: `.env.example` at repo root (extended in Step 1). The
> table below references that file; do NOT re-declare variable names here.
| Variable | Purpose | Dev Default (Tier-1 Docker) | Staging Source (lab Jetson) | Production Source |
|----------|---------|------------------------------|------------------------------|--------------------|
| `GPS_DENIED_FC_PROFILE` | FC adapter selection | `ardupilot_plane` | Per-rig fixed (matches the SITL profile in use) | Per-flight config from operator; written into the per-flight bundle on the operator workstation |
| `GPS_DENIED_TIER` | Runtime tier gate | `1` | `2` | `2` (baked into the Jetson image manifest) |
| `DB_URL` | Postgres connection | `postgresql://gps_denied:dev@db:5432/gps_denied` (dev Docker creds) | Lab Postgres init script — per-host random password | Per-host native Postgres init with random password; written to `/etc/gps-denied/.pgpass` (root:gps-denied, 0640) and exported by the systemd / Docker run hook |
| `SATELLITE_PROVIDER_URL` | Pre-flight tile download | `http://mock-sat:5100` | Lab `satellite-provider` (LAN-resolved); blank on airborne | Operator workstation env / VPN-resolved hostname; **empty on airborne** (defence-in-depth NFT-SEC-05 — in-flight egress lockdown) |
| `CAMERA_CALIBRATION_PATH` | Camera calibration JSON | `/fixtures/calibration/adti26.json` | `/etc/gps-denied/calibration/adti26.json` (operator copies the test fixture for HITL) | `/etc/gps-denied/calibration/adti20.json` (operator-acquired per D-PROJ-1) |
| `LOG_LEVEL` | Log verbosity | `DEBUG` | `INFO` | `INFO` |
| `LOG_SINK` | Log destination | `console` | `journald` (lab) | `fdr` on airborne; `journald` on operator workstation |
| `MAVLINK_SIGNING_KEY` | Per-flight signing key | `tests/fixtures/mavlink_signing/dev_key` (with `BUILD_DEV_STATIC_KEY=ON`) | `tests/fixtures/mavlink_signing/dev_key` (lab SITL, signing disabled or static-dev) | **Per-flight ephemeral key**, generated at takeoff load, rotated per flight, logged to FDR. Never committed; never written to the image. |
| `INFERENCE_BACKEND` | C7 backend selection | `pytorch_fp16` | `tensorrt` (Tier-2 hardware) | `tensorrt` |
| `FDR_PATH` | C13 ring writer | `/var/lib/gps-denied/fdr` (named volume `fdr-data`) | Host-mounted `/var/lib/gps-denied/fdr` on the lab Jetson | Host-mounted `/var/lib/gps-denied/fdr` on the airborne Jetson NVM partition (≥ 64 GB) |
| `TILE_CACHE_PATH` | C6 tile filesystem store | `/var/lib/gps-denied/tiles` (named volume `tile-data`) | Host-mounted `/var/lib/gps-denied/tiles` on the lab Jetson | Host-mounted `/var/lib/gps-denied/tiles` on the airborne Jetson NVM (≥ 10 GB) |
Optional / build-time strategy gating flags (`BUILD_VINS_MONO`, `BUILD_SALAD`, `BUILD_C11_TILE_MANAGER`, `BUILD_VIDEO_FILE_FRAME_SOURCE`, `BUILD_TLOG_REPLAY_ADAPTER`, `BUILD_REPLAY_SINK_JSONL`, `BUILD_DEV_STATIC_KEY`, `BUILD_STATE_ESKF`) are documented in `.env.example` and in `deploy_status_report.md` → "Required Environment Variables". Operative defaults per ADR-002 + ADR-004 + ADR-011:
- Airborne / operator-orchestrator binaries: `BUILD_C11_TILE_MANAGER=OFF` on airborne (ADR-004 process-level isolation — CI SBOM-diff + runtime self-check + NFT-SEC-02 egress test enforce); `BUILD_C11_TILE_MANAGER=ON` on operator-orchestrator only.
- Replay-mode strategy flags: `ON` on airborne + research; explicitly set in `docker-compose.test*.yml` for CI.
- `BUILD_DEV_STATIC_KEY`: **MUST stay OFF on production images.** Dev / CI containers only.
### `.env.example`
Source of truth lives at the repo root (`.env.example`), version-controlled. It contains placeholder values for all required variables plus comments for build-time gating flags. Operators copy it to `.env` (git-ignored) and fill in values per environment. Tier-2 production deploys do **not** use `.env` at all — environment variables are stamped into the systemd / Docker run hook by `start-services.sh` (Step 7) from `/etc/gps-denied/env.d/` files owned `root:gps-denied 0640`.
### Variable Validation (fail-fast at startup)
All services validate required environment variables at startup and exit non-zero with a clear error message if any are missing. Implementation lives in each component's config module:
| Component | Config module | Variables validated |
|-----------|---------------|---------------------|
| Composition root | `src/gps_denied_onboard/runtime_root/__main__.py` | `GPS_DENIED_TIER`, `GPS_DENIED_FC_PROFILE`, `LOG_LEVEL`, `LOG_SINK` |
| C6 (tile cache) | `src/gps_denied_onboard/components/c6_tile_cache/config.py` | `DB_URL`, `TILE_CACHE_PATH` |
| C7 (inference) | `src/gps_denied_onboard/components/c7_inference/config.py` | `INFERENCE_BACKEND` (must be one of `tensorrt`, `pytorch_fp16`, `onnx_trt_ep`); `INFERENCE_BACKEND=tensorrt` requires the model cache volume mount |
| C8 (FC adapter) | `src/gps_denied_onboard/components/c8_fc_adapter/config.py` | `MAVLINK_SIGNING_KEY` (when `GPS_DENIED_FC_PROFILE=ardupilot_plane`) |
| C10 (provisioning) | `src/gps_denied_onboard/components/c10_provisioning/config.py` | `SATELLITE_PROVIDER_URL` (operator-orchestrator only; **must be empty on airborne**); `CAMERA_CALIBRATION_PATH` |
| C13 (FDR) | `src/gps_denied_onboard/components/c13_fdr/config.py` | `FDR_PATH` (must be writable, ≥ 64 GB free on production) |
Health check (`python3 -m gps_denied_onboard.healthcheck`, declared in each Dockerfile) re-runs the same validation set after startup so a Docker `HEALTHY` transition is conditioned on configuration validity, not just process liveness.
## Secrets Management
| Environment | Method | Tool / Location | Rotation |
|-------------|--------|-----------------|----------|
| Development | `.env` file (git-ignored) + `tests/fixtures/mavlink_signing/dev_key` (allow-listed in `.gitignore`) | dotenv loaded by Docker Compose; fixture key read directly by tests with `BUILD_DEV_STATIC_KEY=ON` | None — dev fixture is static. |
| Staging (lab Jetson) | `.env` file (git-ignored) on the Jetson host + same dev fixture signing key (lab SITL is not a signing-attack target) | `/etc/gps-denied/env.d/*.env` on the Jetson, `root:gps-denied 0640` | None — lab fixture is static. |
| Production — airborne | **Per-flight ephemeral MAVLink + onboard signing key generated at takeoff load, rotated per flight, logged to FDR.** The Postgres password is generated per-host at JetPack provisioning and stored in `/etc/gps-denied/.pgpass` (`root:gps-denied 0640`). The airborne image has **no inbound listeners** (NFT-SEC-05 in-flight egress lockdown) so no API secrets live on it. | Onboard secret generation: `KeySource` Protocol implemented in `src/gps_denied_onboard/components/c8_fc_adapter/key_source.py` (per-flight rotation). Postgres password: provisioning script on the Jetson host writes once at first boot. | **Per-flight rotation** for MAVLink + onboard signing keys (Principle #7). Postgres password rotated on operator-issued re-provisioning only. |
| Production — operator workstation | Operator's local credential store / OS keyring for the `satellite-provider` API token + per-flight onboard signing key staging. Suite Woodpecker global secrets (`registry_host`, `registry_user`, `registry_token`) for image pulls — already provisioned per `../_infra/ci/install-woodpecker.sh`; this submodule consumes them via `from_secret:` references in `.woodpecker/02-build-push.yml`. | macOS Keychain / GNOME-Keyring / Windows Credential Manager via a thin wrapper invoked by `start-services.sh`; Woodpecker global secrets injected as env vars at pipeline runtime. | `satellite-provider` API token: rotated by the suite operator (out-of-band); per-flight onboard signing keys rotated per flight (above). Registry token: rotated by suite operator on schedule. |
| CI | Suite-provisioned Woodpecker global secrets (`registry_host`, `registry_user`, `registry_token`) | Consumed by `.woodpecker/02-build-push.yml` via `from_secret:` references — never committed | Rotated by suite operator (out-of-band, ≤ 90 days target per suite policy). |
**Rotation policy (companion-side, normative)**:
- **Per-flight** (MAVLink 2.0 signing key + onboard signing key): mandatory; new keypair generated at takeoff load by `KeySource`, rotated even if the previous flight ended normally. Logged to FDR for chain-of-custody.
- **Per-host** (Postgres password on Jetson + operator workstation): rotated on operator-issued re-provisioning; no scheduled rotation.
- **Per-operator-credential** (`satellite-provider` API token, registry token): owned and rotated by the suite operator out-of-band; this submodule consumes whatever is provisioned.
**No external cloud secret manager** (AWS Secrets Manager / Azure Key Vault / HashiCorp Vault) is used. The combination of (a) per-flight ephemeral signing keys generated on-device, (b) no inbound network listeners on the airborne image, (c) per-host Postgres password with no shared state across hosts, and (d) suite-managed Woodpecker secrets for CI is sufficient for the operational risk model and matches `deploy_status_report.md` → "Secret manager — Per-flight ephemeral, no external manager".
**Never commit**: real MAVLink signing keys (the dev fixture `tests/fixtures/mavlink_signing/dev_key` is the allow-listed exception); real Postgres credentials (the committed `DB_URL` in `.env.example` uses the local Docker `dev` password placeholder); `satellite-provider` API tokens; `.env` files (`.gitignore` line 64 confirms).
## Database Management
| Environment | Type | Migrations | Data |
|-------------|------|-----------|------|
| Development | Docker Postgres 16 (`db` service in `docker-compose.yml`), named volume `db-data` | Applied on container start by C6 bootstrap (idempotent `CREATE TABLE IF NOT EXISTS` for tile + descriptor index) | Seed data via the C6 bootstrap on first run; `docker compose down -v` drops the volume cleanly for `docker compose up --build` |
| Staging (lab Jetson) | Native Postgres 16 on JetPack 6.2 host, sized ≤ 10 GB on a dedicated NVM partition | Applied via the same C6 bootstrap on first run; subsequent migrations applied via CI/CD lane (when cycle-2 lands an explicit migration runner) | Recorded Derkachi clip tile-set + descriptor index pre-loaded by `e2e/fixtures/tile-cache-builder/` |
| Production — airborne | Native Postgres 16 on the Jetson Orin Nano Super NVM partition (≥ 10 GB tile cache budget + descriptor index) | Applied via the C6 bootstrap at first systemd unit start; cycle-1 schema is bootstrap-only with no breaking migrations. Future migrations (cycle-2+): reversible, backward-compatible, applied by a dedicated migration job that is **gated by the flight-state flag** (`/run/azaion/in-flight` — no DB writes during flight) | Real flight data: pre-flight tile + descriptor index seeded by `TileDownloader` on the operator workstation, packaged by C10, and copied to the Jetson NVM at provisioning |
| Production — operator workstation | Native Postgres 16 on the operator workstation | Applied via the same C6 bootstrap; future migrations applied via CI/CD with operator approval | Operator-managed: tile downloads via `satellite-provider`, post-landing uploads via `TileUploader` |
### Migration Rules (cycle-2+ — not yet exercised)
- **Reversible**: every migration ships with an explicit DOWN / rollback script.
- **Backward-compatible**: a new schema version must continue to work with the previous binary's read path until the next release rotates both. Sequence: deploy migration → wait one release cycle → remove old code path.
- **Production gate**: production migrations require operator approval recorded in the Woodpecker UI before apply.
- **Flight-state gate**: migration jobs on the airborne Jetson refuse to run when `/run/azaion/in-flight` is set. The post-landing operator-issued reconcile path is the only window for schema changes on the airborne side.
### Cycle-1 Migration Status
Cycle-1 ships **without a migration runner**. The C6 bootstrap path uses idempotent `CREATE TABLE IF NOT EXISTS` for the tile + descriptor index schema, which is enough for cycle-1 because no schema change has happened since the initial bootstrap. Adding a dedicated migration tool (Alembic / similar) is logged as a cycle-2 follow-up — recorded here so it is not lost.
## Self-verification
- [x] All three environments (Development / Staging / Production) defined with clear purpose
- [x] Tier-1 ↔ Tier-2 mapping explicit (which image runs where)
- [x] Operator workstation called out as a distinct production target alongside airborne Jetson
- [x] Environment variable documentation references `.env.example` (source of truth) without re-declaring names
- [x] Per-variable Dev / Staging / Production sources tabulated
- [x] No secrets in this document (only placeholders + locations)
- [x] Secret manager strategy specified — per-flight ephemeral generation, no external cloud manager, suite-managed Woodpecker secrets for CI; rotation policy normative for per-flight rotation
- [x] Database strategy per environment (Docker Postgres → native Postgres on Jetson + operator workstation); cycle-1 bootstrap-only migration stance recorded; cycle-2 migration rules drafted
- [x] Flight-state gate (`/run/azaion/in-flight`) honoured in production-side migration rules
- [x] Variable validation strategy (fail-fast + healthcheck re-run) mapped to per-component config modules
## Next Steps
1. **Proceed to Step 5 (Observability)** — define structured logging (`LOG_SINK`), metrics (per-component counters, Prometheus-compatible exposition if cycle-2 adds it), tracing (out-of-scope for cycle-1; FDR records serve as the airborne audit trail), and the `AZAION_UPDATE_EVENT` journald audit chain.
2. **Step 6 (Deployment Procedures)** must reference this environment matrix when documenting per-environment deploy procedures (Tier-1 dev `docker compose up`, lab Jetson HITL `docker-compose.test.jetson.yml`, airborne Watchtower-driven update, operator workstation `docker compose up` with image pull).
3. **Step 7 (Deployment Scripts)** must implement the env-loader hook (`start-services.sh` reading `/etc/gps-denied/env.d/*.env` per-host on production targets), the per-host Postgres password generation hook, and the `KeySource` per-flight ephemeral key invocation contract.
4. **Cycle-2 follow-up**: introduce a dedicated migration runner (Alembic or equivalent) with the flight-state-gated apply path and operator-approval gate.
+282
View File
@@ -0,0 +1,282 @@
# GPS-Denied Onboard — Observability
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 5. Builds on
> Step 1 (`reports/deploy_status_report.md`), Step 2 (`containerization.md`),
> Step 3 (`ci_cd_pipeline.md`), and Step 4 (`environment_strategy.md`). The
> deploy skill's standard observability template (Prometheus `/metrics` +
> OpenTelemetry + PagerDuty) is adapted here for an airborne autonomous
> system: the airborne image has **no inbound listeners** (NFT-SEC-05
> in-flight egress lockdown), so the canonical observability surface is the
> on-device **Flight Data Recorder (FDR)** binary ring buffer, replayed
> off-flight by post-landing tooling. Operator workstation + CI keep the
> conventional logging-to-stdout / journald patterns.
## Observability Architecture (one-paragraph)
The airborne image (`companion-jetson` / `companion-tier1`) writes
**structured FDR records** to a 64 GB ring buffer (`/var/lib/gps-denied/fdr`)
via the `shared_fdr_client` (`producer → SPSC ring → C13 writer`). Logs
above `WARN` are forwarded into FDR as `kind="log"` records by the
`fdr_log_bridge` (AZ-267); below-WARN logs go to `LOG_SINK` (`console` in
dev, `journald` on the operator workstation, `fdr` on airborne — never to
file). Telemetry is captured as kind-specific FDR records (`vio.tick`,
`state.tick`, `tile_match`, `c6.write`, `c6.eviction_batch`, etc.) rather
than via a Prometheus endpoint, because no inbound TCP is permitted in
flight. Post-flight tooling on the operator workstation parses the FDR
segments using the **frozen, versioned `fdr_record_schema` v1.3.0** and
feeds Grafana / Jupyter / one-off scripts. The suite-mandated
**`AZAION_UPDATE_EVENT` journald audit chain** + OCI image labels
(`org.opencontainers.image.revision/created/source`) + `ENV
AZAION_REVISION=$CI_COMMIT_SHA` form the deploy-side audit trail (AZ-204).
**`jetson-stats` (`jtop`) device telemetry** (thermal zones, CPU/GPU
clocks, power rails) is sampled by C7 + C4 to drive the
`D-CROSS-LATENCY-1` auto-degrade hybrid trigger; samples land in FDR
alongside the matcher / pose ticks.
## Logging
### Format
Structured records to `LOG_SINK`. No file-based logging in containers.
The `LOG_SINK` env var (Step 4) selects the destination per environment.
#### Common log envelope (per-record fields)
Source of truth: `_docs/02_document/contracts/shared_log_bridge/log_record_schema.md` v1.0.0 — referenced by the `fdr_log_bridge` (AZ-267). Every onboard log record carries:
```json
{
"timestamp": "2026-05-10T03:14:15.123456Z",
"level": "INFO",
"service": "gps-denied-onboard",
"component": "c2_vpr",
"flight_id": "<uuid>",
"frame_id": 12345,
"kind": "vpr.warmup",
"msg": "loaded",
"kv": {"model": "salad"},
"exc": null
}
```
| Field | Purpose | Notes |
|-------|---------|-------|
| `timestamp` | ISO 8601 UTC, microsecond precision | RFC 3339 with `Z` suffix |
| `level` | `DEBUG \| INFO \| WARN \| ERROR` | `WARN` + `ERROR` are also mirrored into FDR via `fdr_log_bridge` |
| `service` | `gps-denied-onboard` | Constant per submodule |
| `component` | Module slug from `module-layout.md` (`c2_vpr`, `c6_tile_cache.store`, `shared.fdr_client`, …) | Matches `producer_id` on the corresponding FDR record |
| `flight_id` | UUID assigned at flight open by C13 (`flight_header`) | Correlation across all components within one flight |
| `frame_id` | Monotonic per-frame counter from `runtime_root` | Cross-component frame correlation (VIO ↔ matcher ↔ state) |
| `kind` | Dotted snake_case event tag (closed enum per component) | E.g. `vpr.warmup`, `c6.evict.budget`, `c8.signing_key_rotation` |
| `msg` | Short human-readable event description | No PII; no secrets; no file payloads |
| `kv` | Bag of typed scalars | JSON-safe; no nested blobs > 4 KiB |
| `exc` | Optional exception class + traceback | Present only on `ERROR`; truncated to 4 KiB |
### Log Levels
| Level | Usage | Example |
|-------|-------|---------|
| ERROR | Exceptions, failures requiring offline review | `c5.solver.diverged`, `c8.signing_handshake_failed`, `c6.write_failed` |
| WARN | Degraded operation, retry, fallback engaged | `c4.pose.degraded_to_pnp`, `c6.freshness.rejected`, `c7.tensorrt_engine_rebuild` |
| INFO | Significant in-flight business events | `c8.signing_key_rotation`, `flight_header`, `flight_footer`, `c11.upload_batch_queued` |
| DEBUG | Detailed diagnostics (dev only) | Per-frame VIO covariance dump, full matcher correspondences list |
`WARN` + `ERROR` are mirrored into FDR via `fdr_log_bridge` (AZ-267) so they survive a post-landing `journalctl` clear. `INFO` + `DEBUG` go only to `LOG_SINK`.
### Destinations and Retention
| Environment | `LOG_SINK` | Destination | Retention |
|-------------|------------|-------------|-----------|
| Development (Tier-1 Docker) | `console` | Docker container stdout (`docker compose logs companion`) | Session — cleared on `docker compose down` |
| CI (Woodpecker) | `console` | Woodpecker UI stdout capture | Per the suite Woodpecker retention policy (operator-managed; today ≤ 30 days) |
| Staging (lab Jetson) | `journald` | Host journald | Per the host's `journald.conf` (suite default: ~7 days rolling) |
| Production — airborne | `fdr` | FDR ring buffer at `/var/lib/gps-denied/fdr` (≥ 64 GB) | Bounded by ring capacity; rolls over per `segment_rollover` FDR record. Post-flight operator pulls segments to long-term storage on the operator workstation. |
| Production — operator workstation | `journald` | Host journald | Per the host's `journald.conf` (operator-managed; recommendation: 30 days for the operator-orchestrator service unit) |
### "PII" Rules (read: operational secrets)
This system has no end-user PII surface — flights, MAVLink, and tile data are operational rather than personal. The equivalent restrictions are **operational-secret leakage** controls:
- **Never log** MAVLink 2.0 signing key bytes, per-flight onboard signing key bytes, `satellite-provider` API tokens, registry tokens, or Postgres credentials. The `KeySource` Protocol (C8) is the only component that ever holds key material, and its log path emits **only** the rotation event tag + key fingerprint (SHA-256 first 8 bytes), never the key.
- **Mask** absolute file paths in any record that references operator-specific layouts (e.g. `/Users/<operator>/…` collapsed to `~/…`).
- **Never log** raw camera frame bytes or full tile JPEGs inline — they go to sidecar paths via FDR's `failed_tile_thumbnail` (≤ 0.1 Hz rate cap) or `mid_flight_tile_snapshot`.
- **Never log** raw GPS coordinates unless the flight's `restricted_geographic_log_redaction` config is `off` (operator-set at takeoff load).
## Telemetry (FDR-based, not Prometheus)
### Why FDR, not Prometheus / OTel
The airborne image runs under NFT-SEC-05 (in-flight egress lockdown — no inbound listeners, outbound only to the FC over UART/USB and to QGroundControl over MAVLink 2.0 12 Hz downsampled summary). A `/metrics` HTTP endpoint would violate this, and a push-mode OTel exporter has no in-flight collector to reach. The FDR ring is the canonical telemetry sink; post-flight tooling converts FDR records into whatever observability backend the operator prefers (Grafana, Jupyter, ad-hoc scripts).
The **operator workstation** is *not* in-flight-locked-down; cycle-2 may add a Prometheus `/metrics` endpoint on the `operator-orchestrator` service (see "Future Work" below). Cycle-1 leaves both the operator-orchestrator and airborne side on the FDR + structured logs path for consistency.
### FDR Record Kinds (cycle-1 metrics surface)
Source of truth: `_docs/02_document/contracts/shared_fdr_client/fdr_record_schema.md` v1.3.0. Each `kind` is the metric.
| Metric (FDR `kind`) | Producer | Type (intent) | What it tells the operator |
|---------------------|----------|----------------|----------------------------|
| `vio.tick` | C1 | per-frame snapshot | VIO output (`R`, `t`), pose covariance proxies, last-anchor age, monocular reproj error, IMU bias norm |
| `state.tick` | C5 | per-frame snapshot | Smoothed fused-pose tick from iSAM2 (or ESKF baseline) + 2x2 covariance + estimator label |
| `tile_match` | C2.5 / C3 | per-match snapshot | Tile id, VPR score, match count, RANSAC inlier count |
| `c6.write` | C6 | counter-ish (per-tile) | Successful `write_tile` — tile id, source, disk bytes, content SHA-256 |
| `c6.write_failed` | C6 | counter-ish (per-failure) | Failed `write_tile``reason ∈ {content_hash_mismatch, freshness_reject, metadata_error, fs_error}` |
| `c6.freshness.rejected` | C6 | counter-ish (per-reject) | Active-conflict-stale tile rejected — `tile_id`, `age_seconds`, threshold |
| `c6.freshness.downgraded` | C6 | counter-ish (per-downgrade) | Stable-rear-stale tile downgraded — same shape as rejected |
| `c6.eviction_batch` | C6 | batch counter (per sweep) | Cache budget enforcer evicted N tiles to make room — trigger tile, freed bytes, count, first 5 evicted ids |
| `overrun` | `shared.fdr_client` | counter (per drop) | FDR ring overrun — `producer_id` of the originating queue + dropped count (`> 0`). AC-NEW-3: never silent. |
| `segment_rollover` | C13 writer | counter (per rotation) | Segment file rotated (including 64 GB cap drops) |
| `failed_tile_thumbnail` | C6 / C11 | rate-capped sample | Forensic JPEG thumbnail (≤ 0.1 Hz). AC-8.5 |
| `mid_flight_tile_snapshot` | C13 snapshot path | sample pointer | Mid-flight tile snapshot pointer (sidecar). AC-8.4 |
| `flight_header` | C13 writer | once-per-flight | `flight_id`, start ISO/monotonic, config snapshot, signing-key rotation event, manifest content hashes, build info |
| `flight_footer` | C13 writer | once-per-flight | `flight_id`, end ISO/monotonic, records written / dropped (overrun) / bytes / rollover count / clean-shutdown flag |
### Device Telemetry (`jetson-stats` / `jtop`)
`D-CROSS-LATENCY-1` requires runtime thermal + power + GPU clock telemetry to drive the auto-degrade hybrid trigger (frame deadline missed × thermal headroom). Cycle-1 source: `jetson-stats` (`jtop`) accessed inside the `companion-jetson` container via `runtime: nvidia` + the nvidia-container-runtime device passthrough — same pattern the suite's `detections` service uses on the same hardware.
| Signal | Source | Sample rate | Consumer |
|--------|--------|-------------|----------|
| GPU clock (MHz) | `jtop.gpu` | 1 Hz | C7 (degrade gate); recorded into FDR via `c7.device_telemetry` log records (`kind="c7.thermal_headroom"`) |
| GPU/CPU temperature (°C) | `jtop.temperature` | 1 Hz | C4 / C7 hybrid trigger |
| Power draw (mW) | `jtop.power` | 1 Hz | Cycle-2 derate hysteresis |
| Memory pressure | `jtop.memory` | 1 Hz | C6 eviction batch hysteresis |
Cycle-1: `jtop` runs in-process inside the companion container; samples are emitted as FDR `kind="c7.thermal_headroom"` records. Cycle-2 may move this to a sidecar Python thread once the Step 2 BLOCKING gate "`jetson-stats` thermal telemetry under Docker" (`containerization.md` § Step 2 Validation Gates) is signed off on the real Tier-2 Jetson.
### Collection Interval
| Source | Interval |
|--------|----------|
| Per-frame producers (C1 `vio.tick`, C5 `state.tick`, C3 `tile_match`) | Camera frame cadence (target ≥ 4 Hz on Tier-2; per `_docs/02_document/architecture.md` Vision) |
| Per-write producers (C6 `c6.write`, `c6.write_failed`, `c6.freshness.*`) | Per-event (write-path triggered) |
| Per-batch producers (C6 `c6.eviction_batch`) | Per-sweep (only when ≥ 1 tile evicted) |
| `jetson-stats` (`jtop`) | 1 Hz |
| `flight_header` / `flight_footer` | Once per flight |
| `segment_rollover` | Per segment rotation |
There is no Prometheus-style "scrape interval" because there is no scraping endpoint — the FDR ring is push-only from producers, drained by C13's writer thread.
## Distributed Tracing
### Architecture stance (cycle-1)
**No W3C Trace Context. No OpenTelemetry SDK.** The airborne image's correlation key is the pair `(flight_id, frame_id)`:
- `flight_id` (UUID) is assigned at flight open by C13 and written into `flight_header`. Every log record and FDR record within that flight carries it.
- `frame_id` (monotonic per-frame counter) is assigned by the composition root's frame pipeline. Every per-frame FDR record (`vio.tick`, `state.tick`, `tile_match`, `c6.write` …) carries it.
This is sufficient because the airborne pipeline is **in-process, single-camera, single-FC** — there are no inter-service RPC hops to trace. Post-flight tooling reconstructs the per-frame causal chain by joining FDR records on `(flight_id, frame_id)`.
The **operator workstation** has more conventional inter-service traffic (C12 ↔ `flights` REST, C11 ↔ `satellite-provider` REST). Cycle-1 traces these by:
- Per-request log records with the request URL + status + duration_ms + a generated `correlation_id`.
- `FlightsApiClient` and the `satellite-provider` HTTP client both stamp this correlation id on the request line + response log.
OpenTelemetry SDK + W3C Trace Context propagation is a **cycle-2 polish item** for the operator-orchestrator only — not for the airborne image. Logged in "Future Work" below.
### Sampling
| Environment | Effective sampling rate | Rationale |
|-------------|--------------------------|-----------|
| Development | 100% | FDR + logs both on |
| Staging (lab Jetson) | 100% | Full visibility for IT-12 / NFT-PERF runs |
| Production — airborne | 100% per-frame for `vio.tick`/`state.tick`/`tile_match`; `failed_tile_thumbnail` rate-capped at ≤ 0.1 Hz | FDR ring is the only post-landing forensic record; full per-frame capture is mandatory. Rate caps live on byte-heavy forensic records only. |
| Production — operator workstation | 100% INFO+; DEBUG off | Operator workstation has full disk; cost is not a concern. |
## Alerting
### Airborne (in-flight)
**No real-time alerting from the airborne image.** Autonomy: the FC handles in-flight failsafe (`SAFE_DEAD_RECKONING`, `RTL`, `LAND` etc. per AC-FC-FAILSAFE-1). The companion does not have a network path to a human operator in flight — its only outbound channel is the MAVLink 2.0 12 Hz downsampled summary to QGroundControl, which surfaces companion health via STATUSTEXT messages and the parent suite's `GpsDeniedHealth` MAVLink message.
Alert-equivalents on the airborne side:
| Event | Detected by | In-flight signal |
|-------|-------------|------------------|
| Companion process died | FC adapter watchdog timeout | FC drops to `SAFE_DEAD_RECKONING`; operator sees lost telemetry in QGC |
| `D-CROSS-LATENCY-1` deadline miss + thermal headroom low | C4 / C7 hybrid trigger | Auto-degrade to lower-cost C7 backend; STATUSTEXT to QGC + FDR `kind="c7.degrade"` |
| C8 signing handshake failed | C8 FC adapter | Refuses takeoff; STATUSTEXT to QGC + FDR `kind="c8.signing_handshake_failed"` |
| FDR ring overrun | `shared.fdr_client` drop-oldest hook | Emits `kind="overrun"` (AC-NEW-3); post-flight forensics tag |
| Segment cap reached (64 GB) | C13 writer | Emits `kind="segment_rollover"` with cap-drop flag; oldest data lost — flag surfaces post-flight |
### Post-Flight (operator workstation)
Post-flight analysis runs the FDR segments through the post-landing tooling. Alerts surface in the operator's environment:
| Severity | Response time | Condition | Cycle-1 channel |
|----------|---------------|-----------|------------------|
| Critical | Pre-next-flight gate (≤ 10 min before takeoff) | `flight_footer.clean_shutdown == false`; `kind="c8.signing_handshake_failed"` observed; FDR overrun count > 0 above per-flight threshold | Operator UI block + Slack `#gps-denied-ops` (cycle-2 once the channel is wired); cycle-1: operator's local terminal output from post-landing tooling |
| High | Same-day | C6 eviction batch > 100 in one flight; tile_match score histogram drifted vs operator baseline | Same as above |
| Medium | Within 1 week | Cumulative thermal-headroom-low events trending up across recent flights | Operator dashboard (cycle-2) |
| Low | Recorded in flight summary only | Non-critical warnings (FDR `kind="log"` at WARN level) | Flight summary PDF / Markdown |
### CI (Woodpecker pipelines)
| Severity | Response time | Condition | Channel |
|----------|---------------|-----------|---------|
| Critical | Same business day | `01-test.yml` failure on `main` branch | Woodpecker UI; per-repo Slack channel (cycle-2 follow-up — `ci_cd_pipeline.md` Future Work #8) |
| High | Within 24 h | `02-build-push.yml` build failure on any push branch | Woodpecker UI |
| Medium | Next business day | Lint / coverage gate fail (cycle-2; cycle-1 has neither) | n/a in cycle-1 |
| Low | Next sprint review | Non-critical pipeline warnings | n/a |
### Deploy / Update (Watchtower)
| Severity | Response time | Condition | Channel |
|----------|---------------|-----------|---------|
| Critical | Immediate | Watchtower post-update hook emits `AZAION_UPDATE_EVENT severity=error` to journald (image pull failed, container crash on restart) | journald + suite operator's `journalctl -g AZAION_UPDATE_EVENT` audit chain |
| Informational | None | Watchtower applied an update during a non-flight window (`/run/azaion/in-flight` cleared) | `AZAION_UPDATE_EVENT severity=info` to journald — audit only |
## Dashboards
### Operations (cycle-1 — what exists today)
- **Suite Woodpecker UI** — CI pipeline status per branch + commit; the only "live" operations dashboard cycle-1 ships.
- **`jtop` on the bench** — operator runs `sudo jtop` on the lab / airborne Jetson during staging / pre-flight to observe thermal + GPU clock + power. Not a service dashboard; it's a CLI tool.
- **`docker ps` + `docker compose logs`** — the operator workstation operator's `dev`-environment dashboard.
### Operations (cycle-2 polish, planned)
- **Grafana dashboard** fed by post-landing-parsed FDR records — service health per component (FDR record kinds rolled up into rates), thermal trend, eviction count, tile_match score distribution.
- **Prometheus `/metrics` on operator-orchestrator** — once the operator workstation cycle-2 wires this, the Grafana dashboard pulls live operator-side metrics alongside post-landing FDR rollups.
### Flight Analytics (cycle-1 — what exists today)
- **Per-flight summary** generated by post-landing tooling (Markdown / PDF) — records written / dropped, segment count, top-N error log lines, eviction count, signing-key rotation event log, `flight_footer.clean_shutdown` flag. Stored alongside the FDR segments under `_docs/06_metrics/flights/<flight_id>/` (cycle-2 publishes; cycle-1 staging dir is operator-local).
### Flight Analytics (cycle-2 polish, planned)
- **FDR replay viewer** — interactive timeline of `(flight_id, frame_id)` correlated records.
- **NFT-PERF baseline tracker** — frame deadline miss rate, thermal headroom, end-to-end pose latency tracked across flights.
## Deploy Audit (suite-mandated)
Per `../_infra/ci/README.md` → "OCI image labels and commit provenance (AZ-204)" and `../_infra/deploy/jetson/README.md` → "Audit: what is this device running?":
- Every image (`companion-jetson`, `companion-tier1`, `operator-orchestrator`) is built with:
- OCI labels: `org.opencontainers.image.revision=$CI_COMMIT_SHA`, `org.opencontainers.image.created=<UTC RFC 3339>`, `org.opencontainers.image.source=$CI_REPO_URL`.
- `ENV AZAION_SERVICE=gps-denied-onboard` + `ENV AZAION_REVISION=$CI_COMMIT_SHA`.
- Watchtower's post-update hook emits one `AZAION_UPDATE_EVENT` line per applied update into journald, carrying the new revision SHA + service name + timestamp + outcome.
- The operator runs `journalctl -g AZAION_UPDATE_EVENT` on any Jetson to answer "what is this device running and when did it last update?".
## Self-verification
- [x] Structured logging format defined with required fields (timestamp, level, service, component, `flight_id`, `frame_id`, kind, msg, kv, exc)
- [x] Per-environment `LOG_SINK` destination + retention tabulated
- [x] FDR-based metrics surface enumerated (every `fdr_record_schema` v1.3.0 kind mapped to its operator-relevant meaning)
- [x] Device telemetry (`jetson-stats` / `jtop`) source + sample rate + consumer (D-CROSS-LATENCY-1 hybrid trigger)
- [x] Tracing stance recorded — no W3C Trace Context / OTel SDK on airborne (justified by single-process pipeline + NFT-SEC-05); operator-side correlation_id pattern documented; OTel deferred to cycle-2 polish
- [x] Alert severities + response times defined across the four touchpoints: airborne in-flight, post-flight operator workstation, CI, deploy/update audit (`AZAION_UPDATE_EVENT`)
- [x] Operational-secret leakage controls in place (no key bytes / API tokens / Postgres credentials in logs; `KeySource` is the only key holder)
- [x] Dashboards inventoried — cycle-1 reality (Woodpecker UI, `jtop`, post-landing summary) explicit; cycle-2 polish (Grafana, FDR replay viewer, NFT-PERF tracker) logged as follow-ups
- [x] Suite-mandated deploy audit chain (`AZAION_UPDATE_EVENT` + OCI labels + `AZAION_REVISION` env) referenced from `../_infra/` docs
## Future Work (cycle-2 polish)
1. **Prometheus `/metrics` on `operator-orchestrator`** — cycle-2 wires an in-process exporter for operator-workstation-side metrics (`flights` REST round-trip latency, `satellite-provider` download throughput, tile manifest content-hash failures). The airborne image stays off this path per NFT-SEC-05.
2. **Grafana dashboard fed by post-landing-parsed FDR rollups** — single pane of glass for per-flight + cross-flight trends.
3. **OpenTelemetry SDK on `operator-orchestrator` only** — instruments `FlightsApiClient` + `satellite-provider` HTTP client with W3C Trace Context propagation. Out of scope for airborne.
4. **Per-repo Slack channel (`#gps-denied-ci` for CI, `#gps-denied-ops` for post-flight)**`ci_cd_pipeline.md` Future Work #8 already logs the CI half; this doc adds the ops half.
5. **FDR replay viewer** — interactive timeline of `(flight_id, frame_id)` correlated records; consumes FDR segments via the `fdr_record_schema` v1.3.0 parser.
6. **NFT-PERF baseline tracker** — automated frame-deadline-miss-rate + thermal-headroom + end-to-end pose latency trending across flights, gated by AZ-595 SITL replay fixture + AZ-592/AZ-593 Tier-2 OKVIS2/VINS-Mono wiring.
7. **Centralised log aggregator on the operator workstation** — Loki / journald-export-to-cloud once the operator network egress allows it; cycle-1 leaves journald at host-default retention.
@@ -0,0 +1,242 @@
# GPS-Denied Onboard — Deployment Status Report
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 1 status & env
> assessment, 2026-05-19. Inputs: `_docs/02_document/architecture.md`,
> 14 component specs in `_docs/02_document/components/`,
> `_docs/00_problem/restrictions.md`, existing root-level Docker artefacts
> (`docker-compose.yml`, `docker-compose.test*.yml`, `docker/*.Dockerfile`),
> and `.env.example`.
## Deployment Readiness Summary
| Aspect | Status | Notes |
|--------|--------|-------|
| Architecture defined | ✅ | `architecture.md` v1 + 11 ADRs; vision section is the spine, no drift detected |
| Component specs complete | ✅ | 14 components (C1C8, C10C13) with description.md present |
| Infrastructure prerequisites met | ⚠️ Partial | Tier-1 (workstation Docker + Postgres 16 + mock-sat) ready and committed; **parent-suite CI/CD (Woodpecker + Gitea Packages registry + Caddy TLS) already exists** at `../_infra/ci/` — this submodule needs to author `.woodpecker/01-test.yml` + `.woodpecker/02-build-push.yml` per the suite-mandated two-workflow contract; Tier-2 Jetson runner availability tracked as cycle-1 follow-up (AZ-592 / AZ-593) |
| External dependencies identified | ✅ | parent-suite `satellite-provider` (read pre-flight, write post-landing via planned D-PROJ-2), parent-suite `flights` REST, ArduPilot Plane FC (signed MAVLink 2.0), iNav FC (MSP2), QGroundControl, nav camera (ADTi 20MP) |
| Blockers | 4 | (1) **Cross-cutting ADR-005 ↔ parent-suite Jetson Docker compose contradiction** — see "Cross-Cutting Decision" section below; (2) D-PROJ-2 ingest endpoint planned, parent-suite work; (3) AZ-592/AZ-593 Tier-2 wiring deferred to follow-up cycle; (4) D-CROSS-CVE-1 opencv pin replay deferred on upstream `gtsam` numpy-2 wheels |
The system is **deploy-plannable today** at the Tier-1 / dev level — but
production Tier-2 delivery shape (bare JetPack per ADR-005 vs Docker
container under the parent-suite Watchtower flow) needs a user decision
before the deploy plan steps 27 can be authored without drift. See the
new "Cross-Cutting Decision" section below.
## Parent-Suite Context (Authoritative Discovery)
This submodule lives inside the **Azaion suite meta-repo** at `../`. The
suite already has a fully-installed CI/CD + production-deploy stack the
GPS-Denied Onboard plan **did not previously account for**. Citations below.
| Suite artefact | Path | What it mandates for this submodule |
|----------------|------|--------------------------------------|
| Woodpecker CI + Gitea Packages + Caddy TLS | `../_infra/ci/README.md` | Two-workflow per-repo pattern: `.woodpecker/01-test.yml` (test on push/PR) + `.woodpecker/02-build-push.yml` (build+push, gated `depends_on: [01-test]`, multi-arch matrix). All images go to `${REGISTRY_HOST}/azaion/<service>:<branch>-arm` (e.g., `git.azaion.com/azaion/gps-denied-onboard:dev-arm`). Registry secrets (`registry_host`, `registry_user`, `registry_token`) are already provisioned as Woodpecker global secrets — this submodule consumes them. |
| Jetson production compose | `../_infra/deploy/jetson/docker-compose.yml` | The fielded Jetson runs **9 application services + Postgres + Watchtower** via `docker compose up -d`. One of those services is already declared: `gps-denied-onboard: image: ${REGISTRY_HOST}/azaion/gps-denied-onboard:${BRANCH:-main}-arm`, `runtime: nvidia`, port `5040:8080`, env `AUTOPILOT_URL: http://autopilot:8080`, `MODELS_DIR: /data/models`. **This contradicts ADR-005's "bare JetPack, no Docker" stance** — see Cross-Cutting Decision below. |
| Flight-state safety gate | `../_infra/deploy/jetson/README.md` → "Flight-state convention" | All on-Jetson model syncs and Watchtower-driven container restarts are gated by `/run/azaion/in-flight` (written by `autopilot` service on arm/disarm). Any GPS-Denied Onboard production deploy on Jetson must honour the same flag. |
| Audit logging | Same README → "Audit: what is this device running?" | OCI labels (`org.opencontainers.image.revision/created/source`) + per-service env `AZAION_REVISION=$CI_COMMIT_SHA` + journald-captured `AZAION_UPDATE_EVENT` lines. Every submodule's Dockerfile must accept `--build-arg CI_COMMIT_SHA` and stamp the OCI labels + `ENV AZAION_REVISION`. |
| Suite-level e2e | `../.woodpecker/suite-e2e.yml` | Manual / nightly cron pipeline that brings up `_infra/deploy/jetson/docker-compose.yml` + `e2e/docker-compose.suite-e2e.yml`; downstream signal only, does not gate this submodule. Already references `gps-denied-onboard` as one of the services pulled. |
| Outstanding suite follow-up #4 | `../_infra/ci/README.md` → Follow-ups | "Missing Dockerfiles for Jetson edge services. `detections-semantic/`, `gps-denied-onboard/`, `gps-denied-desktop/` have no `Dockerfile` / `Dockerfile.jetson` today." This submodule's `docker/companion-tier1.Dockerfile` exists for Tier-1; **a `Dockerfile.jetson` for the arm64 Watchtower image does not exist yet**. |
## Cross-Cutting Decision — ADR-005 vs Parent-Suite Jetson Docker Compose
**The conflict in one paragraph.** ADR-005 in `architecture.md` says:
"Tier-2 (Jetson) does NOT use Docker — TensorRT INT8 calibration caches
and `jetson-stats` thermal telemetry are most reliable without a container
layer, per D-C7-9 + D-C10-6. The deployed image on the Jetson is a
JetPack-based system image with the deployment binary preinstalled." The
parent suite's `_infra/deploy/jetson/docker-compose.yml` declares
`gps-denied-onboard` as a Docker service pulled by Watchtower, with
`runtime: nvidia` for GPU access, alongside 8 other suite services. Both
cannot be the production deploy path simultaneously — this needs a user
call before Step 2 (Containerization) writes the production
containerization plan.
**Resolution options:**
| Option | What it means | Implications |
|--------|---------------|--------------|
| **A** | Keep ADR-005 — GPS-Denied Onboard is **NOT** in the Jetson Docker compose. It runs as a bare-metal systemd service on the same Jetson, beside the Docker stack. Watchtower does not manage it. | Parent-suite `_infra/deploy/jetson/docker-compose.yml` must drop the `gps-denied-onboard` service (a parent-suite edit). This submodule ships a JetPack-flashable tarball + systemd unit instead of an image. Deploy procedure becomes operator-side `apt`-/`tarball`-install, not `docker compose up`. CI builds a release tarball, not an image. Updates lose the Watchtower + journald audit chain — we need an equivalent. |
| **B** | Reverse ADR-005 — GPS-Denied Onboard ships a `Dockerfile.jetson` and runs as a Docker container under the parent-suite Watchtower flow. The ADR is rewritten to "Docker on Jetson with `runtime: nvidia` + explicit calibration-cache + jetson-stats volume mounts to preserve the D-C7-9 / D-C10-6 properties". | Suite follow-up #4 is closed by this submodule. CI fits the suite two-workflow pattern. Flight-state gate honoured via `/run/azaion/in-flight` volume mount. TensorRT INT8 calibration cache + jetson-stats telemetry must be validated under Docker (not just bare JetPack) — Step 2 of this deploy plan owns that validation; if it fails, fall back to (A). |
| **C** | Hybrid — GPS-Denied Onboard ships **both** a Docker image (for Tier-1 + dev + e2e + replay) **and** a JetPack bare-metal artefact (for Tier-2 production). | Two release artefacts to maintain; two CI lanes; matches ADR-005 + ADR-002 mechanism for "binary tracks". Parent-suite compose still drops the Watchtower-managed `gps-denied-onboard` service (operator runs the bare-metal artefact alongside the Docker stack). |
**Autodev-resolved (2026-05-19 19:09 UTC+3): Option B.** The user
explicitly skipped the structured BLOCKING gate, directing the autodev to
continue with available information. Option B is selected because:
1. **Existence proof on the same platform.** The parent suite's
`detections` service already runs as a Docker container on the Jetson
with `runtime: nvidia` (`Dockerfile.jetson` + suite production compose).
GPU access + INT8-class inference in Docker on Jetson is a working
pattern in this suite, not a hypothetical.
2. **Suite follow-up #4** in `../_infra/ci/README.md` explicitly lists
"Missing Dockerfiles for Jetson edge services. … `gps-denied-onboard/`"
— the parent-suite operator expects this submodule to ship a
`Dockerfile.jetson` and join the Watchtower flow.
3. **Audit + flight-gate chain reuse.** Option B inherits
`AZAION_UPDATE_EVENT` journald audit + `/run/azaion/in-flight`
flight-state gate + per-flight ephemeral secret rotation patterns
without re-inventing them at bare-metal level.
4. **ADR-005 concerns are validatable in Step 2.** The two technical
concerns ADR-005 cited (TensorRT INT8 calibration cache stability +
`jetson-stats` thermal telemetry access) become explicit Step 2
validation gates: model-cache mounted as a named Docker volume (same
pattern `detections` uses for `model-cache:/data/models`); jetson-stats
accessed via `runtime: nvidia` + the standard nvidia container toolkit
device passthrough. **If either validation fails in Step 2**, the
autodev falls back to Option A and reopens this section.
5. **Step 3 CI/CD authoring is straightforward** under Option B — the
suite already provides the two-workflow `.woodpecker/` templates and
registry secrets; this submodule plugs into the existing pipeline.
**To reverse this decision later**: edit this section to record the new
choice, restore ADR-005's bare-JetPack language in `architecture.md`, and
re-run `/autodev` — Step 2 will detect the change via the rewritten
section and rebuild the containerization plan accordingly.
**Required architecture follow-up under Option B**: the `architecture.md`
ADR-005 paragraph "Container scope: …Tier-2 (Jetson) does NOT use Docker"
becomes inconsistent with this decision. Step 2 of the deploy plan will
draft the ADR-005 amendment (or replacement ADR-012 — "Docker on Jetson
with explicit calibration-cache + jetson-stats passthrough") and the
amendment lands in Step 12 (Test-Spec Sync / Update Docs equivalent)
output. Recording the architectural drift here so it is not lost.
The originally-listed registry decision is **already settled by the
parent suite** — `${REGISTRY_HOST}` is the Gitea Packages registry behind
Caddy TLS (`git.azaion.com` per the example in
`../_infra/deploy/jetson/README.md`); no operator choice needed for this
submodule.
## Component Status
> Docker-ready column means: does the component run inside the Tier-1
> Docker images? Tier-2 production deploys via JetPack image flash, not
> Docker (ADR-005); that column is N/A for Tier-2-only paths.
| Component | State | Docker-ready (Tier-1) | Notes |
|-----------|-------|-----------------------|-------|
| C1 — VIO (`c1_vio`) | ✅ implemented + tested (operational default = `KltRansac` AZ-334) | yes | `Okvis2`/`VinsMono` ship as facade-only — AZ-332/AZ-333 BLOCKED on Tier-2 prereqs; follow-ups AZ-592/AZ-593 in backlog (ADR-001 cycle-1 note). `_STRATEGY_REGISTRY` registers all three slots; selecting an unlinked strategy raises `StrategyNotLinkedError` |
| C2 — VPR (`c2_vpr`) | ✅ implemented + tested | yes | `UltraVPR` primary; `MegaLoc`/`MixVPR`/`SelaVPR`/`EigenPlaces`/`NetVLAD` secondaries behind `BUILD_*` flags per ADR-002 |
| C2.5 — Re-rank (`c2_5_rerank`) | ✅ implemented + tested | yes | inlier-count re-rank top-K=10 → top-N=3 |
| C3 — Matcher (`c3_matcher`) | ✅ implemented + tested | yes | `DISK+LightGlue` primary; `ALIKED+LightGlue` / `XFeat` secondaries |
| C3.5 — AdHoP (`c3_5_adhop`) | ✅ implemented + tested | yes | conditional refinement; `passthrough` baseline path |
| C4 — Pose (`c4_pose`) | ✅ implemented + tested | yes | OpenCV `solvePnPRansac` + GTSAM Marginals; D-CROSS-LATENCY-1 auto-degrade |
| C5 — State (`c5_state`) | ✅ implemented + tested | yes | GTSAM iSAM2 + `IncrementalFixedLagSmoother`; ESKF baseline behind `BUILD_STATE_ESKF` |
| C6 — Tile cache (`c6_tile_cache`) | ✅ implemented + tested | yes | Postgres 16 btree spatial index + filesystem tiles + FAISS HNSW descriptor index |
| C7 — Inference (`c7_inference`) | ✅ implemented + tested (Tier-1 PyTorch FP16); Tier-2 TensorRT path pinned | yes (PyTorch FP16); N/A (TensorRT runs on bare JetPack) | `INFERENCE_BACKEND={tensorrt|pytorch_fp16|onnx_trt_ep}`; ONNX+TRT EP fallback |
| C8 — FC adapter (`c8_fc_adapter`) | ✅ implemented + tested | yes | `pymavlink` ArduPilot Plane (signed) + `MSP2` iNav (unsigned, accepted risk); `MavlinkTransport` Protocol seam (Serial / Noop for replay per ADR-011) |
| C10 — Provisioning (`c10_provisioning`) | ✅ implemented + tested | yes (operator-orchestrator image) | engine + descriptor + manifest build with SHA-256 content-hash gate |
| C11 — Tile Manager (`c11_tilemanager`) | ✅ implemented + tested | yes (operator-orchestrator image ONLY) | airborne image MUST NOT link C11 (ADR-004 process-level isolation); CI SBOM-diff + runtime self-check + NFT-SEC-02 egress test enforce |
| C12 — Operator orchestrator (`c12_operator_orchestrator`) | ✅ implemented + tested | yes (operator-orchestrator image ONLY) | `FlightsApiClient` + `PostLandingUploadOrchestrator` + `OperatorReLocService` |
| C13 — FDR (`c13_fdr`) | ✅ implemented + tested | yes | ≤ 64 GB / flight ring; `flight_footer` record drives C12 post-landing gate |
### Binary tracks (three, per ADR-002 + ADR-011)
| Binary | Image / target | Contents | Where it runs |
|--------|---------------|----------|---------------|
| `airborne` | Tier-2: bare JetPack 6.2 system image / Tier-1: `gps-denied-onboard/companion:dev` Docker image | C1C8 + C13 + replay strategies (`BUILD_VIDEO_FILE_FRAME_SOURCE`, `BUILD_TLOG_REPLAY_ADAPTER`, `BUILD_REPLAY_SINK_JSONL` ON); same image runs live and replay modes (config-selected) | Jetson Orin Nano Super (prod); workstation Docker (dev/CI) |
| `research` | Tier-1 Docker / Tier-2 bare JetPack | airborne contents + every non-default strategy linked (IT-12 comparative study) | Lab Jetson, CI Tier-2 jobs |
| `operator-orchestrator` | Tier-1 Docker image `gps-denied-onboard/operator-orchestrator:dev` | C10 + C11 + C12; ships with mock-suite-sat-service compose for offline tests | Operator workstation |
## External Dependencies
| Dependency | Type | Required For | Status |
|------------|------|--------------|--------|
| PostgreSQL 16 | Database (C6 tile + descriptor metadata) | All deployments | ✅ Tier-1: `db` service in `docker-compose.yml`; Tier-2: native Postgres on operator workstation + Jetson (sized for ≤ 10 GB cache budget) |
| Filesystem `./tiles/{zoomLevel}/{x}/{y}.jpg` | Tile binary store mirroring `satellite-provider` on-disk layout | All deployments (C6) | ✅ Tier-1: `tile-data` volume; Tier-2: NVM partition (≥ 10 GB) |
| Parent-suite `satellite-provider` (.NET 8 REST + on-disk tiles) | External service | Operator workstation only (pre-flight `TileDownloader` via C11; post-landing `TileUploader` via C11/C12) | ✅ pre-flight read path is live; ⚠️ post-landing POST contract (D-PROJ-2) **planned**, parent-suite work — see `_docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md` |
| Parent-suite `flights` REST service (.NET 8) | External service | Operator workstation only (C12 reads `Flight` DTO via `FlightsApiClient`) | ✅ contract owned by parent-suite; offline `--flight-file` path implemented (AZ-489) as fallback |
| ArduPilot Plane FC | MAVLink 2.0 over UART/USB (signed) | Production (airborne ↔ FC) | ✅ adapter implemented; signing handshake validated via NFT-SEC-03; per-flight key rotation logged to FDR |
| iNav FC | MSP2 over UART (unsigned, accepted risk) | Production (airborne ↔ FC) | ✅ adapter implemented; no signing — documented residual risk |
| QGroundControl (GCS) | MAVLink 2.0 12 Hz downsampled summary | Production (operator monitoring) | ✅ outbound encoder + STATUSTEXT path covered |
| Nav camera (ADTi 20MP 20L V1) | Camera SDK / V4L2 over USB / MIPI-CSI / GigE | Production (airborne) | ⚠️ live driver per deployed lens module — calibration JSON (`adti20.json`) is operator-acquired per D-PROJ-1 (hybrid factory + checkerboard); `adti26.json` test-fixture used in dev / CI |
| GitHub Actions runner (Tier-1) | CI | Build + lint + unit + most integration + Tier-1 e2e | ✅ GitHub-hosted x86_64 runner; pinned actions per `_docs/02_document/deployment/ci_cd_pipeline.md` |
| Self-hosted Jetson runner (Tier-2) | CI | AC-bound NFTs (NFT-PERF-* + NFT-LIM-* + IT-12) | ⚠️ runner availability tracked as a risk-register entry (ADR-005). Cycle-1 perf probe ran Tier-1 only — NFT-PERF-01/03 Tier-2 hardware required, NFT-PERF-02/04 SITL replay fixture pending AZ-595 |
## Infrastructure Prerequisites
| Prerequisite | Status | Action Needed |
|--------------|--------|---------------|
| Container registry | ✅ **Already set by parent suite** | `${REGISTRY_HOST}` (Gitea Packages behind Caddy TLS, e.g. `git.azaion.com`). Images: `${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm`. Woodpecker global secrets `registry_host` / `registry_user` / `registry_token` already provisioned per `../_infra/ci/README.md`. **No operator choice needed.** |
| Cloud account | N/A | No cloud orchestration. The CI/CD server itself is a self-hosted Jetson colocated with the registry — see `../_infra/ci/README.md` → "Architecture". |
| DNS configuration | ✅ **Already set by parent suite** | `REGISTRY_DOMAIN` + `WOODPECKER_DOMAIN` resolve to the CI host's public IP. Operator workstation reaches `satellite-provider` over LAN / VPN; no public DNS for the airborne / operator side from this submodule. |
| SSL certificates | ✅ **Already set by parent suite** (Caddy + Let's Encrypt / internal / external-file modes) | Suite operator chooses the mode in `../_infra/ci/.env`. The companion has no inbound listeners (NFT-SEC-05 in-flight egress lockdown). |
| CI/CD platform | ⚠️ Suite-mandated (Woodpecker CI two-workflow pattern); **submodule pipeline files missing** | This submodule has **no `.woodpecker/` folder yet**. Suite follow-up #4 in `../_infra/ci/README.md` confirms `gps-denied-onboard` is one of the services awaiting CI integration. Step 3 of this deploy plan must author `.woodpecker/01-test.yml` (Python `pytest` + Tier-1 e2e via the existing `docker-compose.test.yml`) and `.woodpecker/02-build-push.yml` (multi-arch matrix → `${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm`). The existing pre-cycle-1 `_docs/02_document/deployment/ci_cd_pipeline.md` was written against an assumed GitHub Actions runner — Step 3 must rewrite it against the actual suite Woodpecker pattern. |
| Secret manager | ⚠️ Per-flight ephemeral, no external manager | Per-flight MAVLink signing key + per-flight onboard signing key are **generated at takeoff load**, rotated per flight, logged to FDR. Pre-flight `satellite-provider` API key lives on the operator workstation only; never written to companion image. **No external secret manager required** for the companion. For the operator workstation, the operator's local credential store / OS keyring is sufficient. |
| Image build host | ⚠️ Depends on the Cross-Cutting Decision above | Option B (Docker on Jetson) requires arm64 build agents (already provisioned at the suite level — Jetson colocated agent + optional remote amd64). Option A (bare JetPack) requires a JetPack 6.2 SDK build host with `pyproject.toml` wheel build + native CMake build; CI lane is different (release-tarball lane, not registry push). |
| JetPack 6.2 system image | ⚠️ Required for Tier-2 hardware regardless of option | Operator burns the JetPack 6.2 + Jetson Linux base image; Step 6 documents the procedure. Under Option A this image hosts a bare-metal install; under Option B it hosts Docker + `runtime: nvidia` + the suite-level compose. |
| Flight-state gate (`/run/azaion/in-flight`) | ⚠️ Suite-mandated for any Watchtower-managed production deploy | Under Option B, the GPS-Denied Onboard image must accept the same volume mount + honour the flag. Under Option A, the bare-metal systemd unit must also gate on it (the parent-suite `autopilot` service still writes the flag). Step 6 documents this. |
| Audit / OCI labels (`AZAION_REVISION`, `org.opencontainers.image.revision/created/source`) | ⚠️ Suite-mandated under Option B; recommended under Option A | The suite `journalctl -g AZAION_UPDATE_EVENT` audit chain depends on these. Step 2 must add them to the Dockerfile under Option B; Step 7 deployment scripts must emit an equivalent under Option A. |
## Deployment Blockers
| Blocker | Severity | Resolution |
|---------|----------|------------|
| **ADR-005 ↔ parent-suite Jetson Docker compose contradiction** | High (blocks Step 2 Containerization) | See "Cross-Cutting Decision" section above. User picks A / B / C; the choice determines whether Step 2 writes a Docker-on-Jetson plan or a bare-metal JetPack plan. |
| **D-PROJ-2** — parent-suite `satellite-provider` ingest endpoint + voting layer not yet implemented | Medium (production-blocking for post-landing upload only; airborne path is unaffected) | Parent-suite work tracked in `_docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md`. The onboard side ships against the real service (download) + e2e-test-only `mock-suite-sat-service` fixture (upload). Post-landing upload tool keeps batches queued locally until D-PROJ-2 lands. |
| **AZ-592 / AZ-593** — Tier-2 OKVIS2 / VINS-Mono wiring (build env + Jetson + DBoW2 vocab) | Medium (no impact on cycle-1 production deploy — operational default is `KltRansac` AZ-334) | Both parked in `_docs/02_tasks/backlog/`; follow-up cycle (ADR-001 cycle-1 note). Cycle-1 deployment ships with `KltRansac` as the operational `VioStrategy`. |
| **D-CROSS-CVE-1**`opencv-python ≥ 4.12.0` pin deferred on `gtsam==4.2` numpy<2 ABI block | Low (CVE-2025-53644 re-validated against 4.11.0.86 — no advisory ties it to the current pin band; NFT-SEC-04 fuzz fixture is the executable confirmation) | Replay condition tracked in `_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md`. Replay lands when upstream `gtsam` ships numpy-2 wheels (or an alternative SE(3) backend) — at that point also bump `cryptography ≥ 46.0.7` per Phase 1 finding F1. |
## Required Environment Variables
> Production-required variables on the companion image are the smaller set
> below (12 entries). The operator-orchestrator image consumes the same
> set plus the C12-specific knobs documented in
> `_docs/02_document/components/13_c12_operator_orchestrator/description.md`.
| Variable | Purpose | Required In | Default (Dev) | Source (Staging / Prod) |
|----------|---------|-------------|---------------|--------------------------|
| `GPS_DENIED_FC_PROFILE` | Selects FC adapter at composition root: `ardupilot_plane \| inav` | Airborne, operator-orchestrator | `ardupilot_plane` | Per-flight config from the operator |
| `GPS_DENIED_TIER` | Runtime tier gate: `1`=workstation/CI, `2`=Jetson production | All | `1` | `1` for CI containers, `2` baked into the JetPack image |
| `DB_URL` | Postgres connection (C6 tile + descriptor metadata) | All | `postgresql://gps_denied:dev@db:5432/gps_denied` | Operator workstation: local Postgres credentials; Jetson production: local Postgres init script with random per-host password |
| `SATELLITE_PROVIDER_URL` | Pre-flight tile download endpoint | Operator-orchestrator only (never on airborne) | `http://mock-sat:5100` | Operator workstation env / VPN-resolved hostname; **must be empty on airborne** (defence-in-depth NFT-SEC-05) |
| `CAMERA_CALIBRATION_PATH` | Path to JSON camera calibration loaded at startup | Airborne, operator-orchestrator | `/fixtures/calibration/adti26.json` | Production: `/etc/gps-denied/calibration/adti20.json` (operator-acquired per D-PROJ-1) |
| `LOG_LEVEL` | Structured log level (`DEBUG \| INFO \| WARNING \| ERROR`) | All | `DEBUG` | Production: `INFO` |
| `LOG_SINK` | Structured log destination (`console \| journald \| fdr`) | All | `console` | Production: `fdr` (companion); `journald` (operator workstation) |
| `MAVLINK_SIGNING_KEY` | Per-flight MAVLink 2.0 signing key path | Airborne (ArduPilot profile) | `tests/fixtures/mavlink_signing/dev_key` | Production: per-flight ephemeral key generated at takeoff load, rotated per flight, logged to FDR (Principle #7) |
| `INFERENCE_BACKEND` | Selects C7 backend (`tensorrt \| pytorch_fp16 \| onnx_trt_ep`) | Airborne, operator-orchestrator | `pytorch_fp16` | Tier-2 production: `tensorrt`; Tier-1 CI: `pytorch_fp16` |
| `FDR_PATH` | C13 ring writer location | Airborne | `/var/lib/gps-denied/fdr` | Production: `/var/lib/gps-denied/fdr` on the companion NVM partition (≥ 64 GB) |
| `TILE_CACHE_PATH` | C6 filesystem tile root | Airborne, operator-orchestrator | `/var/lib/gps-denied/tiles` | Production: `/var/lib/gps-denied/tiles` on the companion NVM (≥ 10 GB) |
| `BUILD_VINS_MONO`, `BUILD_SALAD`, `BUILD_C11_TILE_MANAGER` | Build-time strategy / component gating (ADR-002) | Build host | `OFF` for deployment binary | `OFF` on airborne (`BUILD_C11_TILE_MANAGER` MUST stay OFF per ADR-004); `ON` on research binary |
| `BUILD_VIDEO_FILE_FRAME_SOURCE`, `BUILD_TLOG_REPLAY_ADAPTER`, `BUILD_REPLAY_SINK_JSONL` (optional) | Replay-mode strategy gating (ADR-011) | Replay-capable images | unset (defaults to ON in the airborne / research binaries) | `ON` in airborne + research; explicitly set in `docker-compose.test*.yml` for CI |
| `BUILD_DEV_STATIC_KEY` (optional, dev-only) | Gates the AP adapter's `signing_key_source='dev_static'` path | Dev / CI containers only | unset / `OFF` | **MUST stay OFF on production images.** |
| `BUILD_STATE_ESKF` (optional) | Links the ESKF state estimator (mandatory simple-baseline) | Research binary | unset / `OFF` | `ON` on research binary; `OFF` on airborne |
### Sensitive variables — never committed
| Variable | Why |
|----------|-----|
| `MAVLINK_SIGNING_KEY` (real key) | Per-flight key, generated at takeoff. `.env.example` points at the dev test fixture only. |
| Real Postgres credentials | The committed `DB_URL` uses the local Docker `dev` password. Production credentials live on the host outside the image. |
| `SATELLITE_PROVIDER_URL` API token (when D-PROJ-2 lands) | Per-flight onboard signing key carried with each uploaded tile; never written to the companion image. |
## .env Files Created
- `.env.example` — committed to VCS, contains all variable names with placeholder values (extended with optional / build-flag rows in this step).
- `.env` — git-ignored (`.gitignore` line 64 confirms), contains development defaults that mirror `docker-compose.yml`. Safe to use for `docker compose up`, `python -m gps_denied_onboard.healthcheck`, and the existing test runner scripts.
- `.gitignore` already excludes `.env`, `.env.local`, and `*.key` while allow-listing the dev-fixture signing key (`!tests/fixtures/mavlink_signing/dev_key`). No changes needed.
## Pre-existing Deployment Artefacts (Discovered)
This is **not** a from-scratch deployment plan — the cycle-1 implementation already shipped working containerization scaffolding. Subsequent deploy-plan steps will harmonise these against the documents being produced rather than recreate them.
| File | Purpose | Status |
|------|---------|--------|
| `docker-compose.yml` | Tier-1 dev compose: `companion` + `operator-orchestrator` + `mock-sat` + `db` | ✅ working, healthchecks present |
| `docker-compose.test.yml` | Tier-1 e2e test compose (replay mode flags ON) | ✅ working |
| `docker-compose.test.jetson.yml` | Tier-2 Jetson e2e test compose | ✅ working |
| `e2e/docker/docker-compose.test.yml`, `e2e/docker/docker-compose.tier2-bridge.yml` | Suite-level e2e harness | ✅ owned by the e2e harness, referenced by `_docs/02_document/deployment/ci_cd_pipeline.md` |
| `docker/companion-tier1.Dockerfile`, `docker/operator-orchestrator.Dockerfile`, `docker/mock-suite-sat-service.Dockerfile` | Per-binary Dockerfiles | ✅ in tree (referenced by compose files) |
| `tests/e2e/Dockerfile`, `tests/e2e/Dockerfile.jetson` | Test runner images | ✅ in tree |
| `e2e/fixtures/tile-cache-builder/Dockerfile`, `e2e/fixtures/mock-suite-sat/Dockerfile`, `e2e/runner/Dockerfile` | Test fixtures | ✅ in tree |
| `scripts/run-tests.sh`, `scripts/run-tests-jetson.sh`, `scripts/run-performance-tests.sh` | Test entry points | ✅ in tree (Step 7 will add `deploy.sh`, `pull-images.sh`, `start-services.sh`, `stop-services.sh`, `health-check.sh`) |
| `_docs/02_document/deployment/ci_cd_pipeline.md` | Pre-existing CI/CD doc | ✅ exists (per Step 12 Test-Spec Sync output); Step 3 will reconcile against this status report |
## Next Steps
1. **User confirms this status report** (BLOCKING gate per the deploy skill Step 1).
2. **User picks the Cross-Cutting Decision option (A / B / C)** — this determines the production Tier-2 delivery shape and is required input for Step 2.
3. **Proceed to Step 2 (Containerization)** — under Option A: write the bare-metal JetPack production plan (tarball + systemd unit + flight-state gate) and the Tier-1 Docker plan (existing `docker-compose.yml`) separately. Under Option B: author `docker/Dockerfile.jetson` matching the suite-mandated OCI labels + `AZAION_REVISION` build-arg, and reconcile ADR-005 in `architecture.md` to the new "Docker on Jetson" stance. Under Option C: both artefacts, two CI lanes.
4. **Step 3 (CI/CD pipeline) is no longer "pick a platform"** — author `.woodpecker/01-test.yml` + `.woodpecker/02-build-push.yml` per the suite two-workflow contract (`../_infra/ci/README.md` → "Pipeline configuration — two-workflow contract"). Rewrite `_docs/02_document/deployment/ci_cd_pipeline.md` against the actual Woodpecker + Gitea Packages stack instead of the previously-assumed GitHub Actions runner.
5. After Step 3, auto-chain through Steps 47 (environment strategy, observability, deployment procedures, deployment scripts) per the deploy skill's workflow. Step 6 procedures must include the flight-state gate (`/run/azaion/in-flight`) and the audit-log chain (`AZAION_UPDATE_EVENT` via journald) regardless of which option wins above.