mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 16:01:13 +00:00
bf13549b32
ci/woodpecker/push/02-build-push Pipeline failed
- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments. - Updated `.gitignore` to include a new deploy rollback bookmark. - Revised `_docs/_autodev_state.md` to reflect the current task status and steps. - Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements. - Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin. - Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths. This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
199 lines
13 KiB
Markdown
199 lines
13 KiB
Markdown
# GPS-Denied Onboard — Deployment Scripts
|
|
|
|
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 7. Five
|
|
> bash scripts under `scripts/` automate the procedures in
|
|
> `deployment_procedures.md`. Step 7 is the only step in the deploy
|
|
> skill that produces executable artefacts; all five scripts honour the
|
|
> `/run/azaion/in-flight` flight-state gate documented in Step 6.
|
|
|
|
## Overview
|
|
|
|
| Script | Purpose | Location |
|
|
|--------|---------|----------|
|
|
| `deploy.sh` | Main orchestrator (pull → flight-state-check → stop → start → health); `--rollback` flag restores the previous image set | `scripts/deploy.sh` |
|
|
| `pull-images.sh` | Pull images from `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` (suite Gitea Packages registry) | `scripts/pull-images.sh` |
|
|
| `start-services.sh` | `docker compose up -d`; waits for HEALTHCHECK; emits `AZAION_UPDATE_EVENT` via journald | `scripts/start-services.sh` |
|
|
| `stop-services.sh` | Graceful `docker compose down`; saves current image digests to `.previous-tags.env` for rollback | `scripts/stop-services.sh` |
|
|
| `health-check.sh` | Reads Docker HEALTHCHECK status across the stack (no HTTP endpoint — NFT-SEC-05) | `scripts/health-check.sh` |
|
|
|
|
## Prerequisites
|
|
|
|
- Docker + Docker Compose v2 installed on the target host (Tier-1 workstation, lab Jetson, airborne Jetson, or operator workstation).
|
|
- For remote operation: SSH access to the target via `DEPLOY_HOST` env var (same pattern as `scripts/run-tests-jetson.sh` uses).
|
|
- Registry credentials: `REGISTRY_HOST` + `REGISTRY_USER` + `REGISTRY_TOKEN` (suite-provisioned Woodpecker global secrets per `../_infra/ci/README.md`). Loaded from `.env` at the repo root or passed via the environment.
|
|
- `.env` file populated from `.env.example`. See `environment_strategy.md` § Environment Variables for per-environment guidance.
|
|
|
|
## Environment Variables Consumed
|
|
|
|
All scripts source `.env` from the project root if present. The deploy-side variables consumed (beyond the ones already documented in `.env.example`):
|
|
|
|
| Variable | Required by | Purpose |
|
|
|----------|-------------|---------|
|
|
| `REGISTRY_HOST` | `pull-images.sh`, `deploy.sh` | Suite Gitea Packages registry hostname (e.g. `git.azaion.com`) |
|
|
| `REGISTRY_USER` | `pull-images.sh` | Registry user (Woodpecker global secret on CI; operator credentials locally) |
|
|
| `REGISTRY_TOKEN` | `pull-images.sh` | Registry token (matches Woodpecker global secret); passed to `docker login --password-stdin` |
|
|
| `DEPLOY_HOST` | All (remote mode) | SSH alias / `user@host` for remote execution. Unset = local execution. |
|
|
| `AIRBORNE_COMPOSE_FILE` | `start-services.sh`, `stop-services.sh`, `health-check.sh` (when `--target=airborne`) | Override the default airborne compose path (`/etc/gps-denied/docker-compose.airborne.yml`) |
|
|
| `AZAION_REVISION` | `start-services.sh` (for the audit `AZAION_UPDATE_EVENT` line) | Inherited from the image's `ENV AZAION_REVISION=$CI_COMMIT_SHA` per AZ-204 |
|
|
| `BRANCH`, `ARCH` | `pull-images.sh`, `deploy.sh` | Tag selector (defaults: `main`, `arm`) |
|
|
| `WAIT_SECS` | `start-services.sh`, `deploy.sh` | HEALTHCHECK wait budget (default: 120 s) |
|
|
|
|
`.previous-tags.env` is written at the project root by `stop-services.sh` and is git-ignored (added to `.gitignore` in this step).
|
|
|
|
## Targets
|
|
|
|
Every script accepts `--target <dev|airborne|operator-workstation>` and picks a sensible compose file by default:
|
|
|
|
| `--target` | Default compose file | Purpose |
|
|
|------------|----------------------|---------|
|
|
| `dev` | `docker-compose.yml` | Tier-1 workstation Docker (developer + CI) |
|
|
| `operator-workstation` | `docker-compose.yml` (reused; operator workstation runs only `operator-orchestrator` + `db`) | Operator deploy of `operator-orchestrator`. Cycle-2 may add a dedicated `docker-compose.operator.yml` that excludes the `companion` service. |
|
|
| `airborne` | `${AIRBORNE_COMPOSE_FILE:-/etc/gps-denied/docker-compose.airborne.yml}` | Tier-2 airborne Jetson. Cycle-1 ships no compose file at this path — Watchtower drives updates via the parent-suite `_infra/deploy/jetson/docker-compose.yml`. The scripts are still usable for manual cycle-1 operator-issued cycle/restart on the bench Jetson by passing `--compose-file ./docker-compose.test.jetson.yml` or pointing `AIRBORNE_COMPOSE_FILE` at the parent-suite compose. |
|
|
|
|
## Script Details
|
|
|
|
### `deploy.sh`
|
|
|
|
Main orchestrator. Runs:
|
|
|
|
1. `pull-images.sh --target <target> --branch <branch> --arch <arch>` (skipped on `--rollback`)
|
|
2. Flight-state check (in-band — invokes `stop-services.sh` which performs the actual `/run/azaion/in-flight` probe)
|
|
3. `stop-services.sh --target <target>` (also writes `.previous-tags.env`)
|
|
4. `start-services.sh --target <target> --wait-secs <N>`
|
|
5. `health-check.sh --target <target>`
|
|
|
|
**Usage**:
|
|
|
|
```
|
|
scripts/deploy.sh [--target dev|airborne|operator-workstation]
|
|
[--branch <branch>] [--arch <arch>]
|
|
[--compose-file <path>]
|
|
[--wait-secs N]
|
|
[--rollback] [--force] [--help]
|
|
```
|
|
|
|
**Rollback**: when `--rollback` is passed, `deploy.sh` reads `.previous-tags.env` (written by the most recent `stop-services.sh` run), `docker pull`s each saved image digest, then proceeds with the stop → start → health pipeline. Cycle-1 does not retag — the operator owns the registry-side tag promotion per `deployment_procedures.md` § Rollback Procedures.
|
|
|
|
**Force flag** (`--force`): bypasses the `/run/azaion/in-flight` safety gate. **Never pass during a live flight** — this is an emergency escape hatch for stuck flag scenarios (e.g. autopilot service died holding the flag).
|
|
|
|
### `pull-images.sh`
|
|
|
|
Pulls the cycle-1 image set from the suite registry. Cycle-2 will pick up the airborne `companion-jetson` image automatically when `--target=airborne` is selected (the image name template is already coded for it).
|
|
|
|
**Usage**:
|
|
|
|
```
|
|
scripts/pull-images.sh [--branch <branch>] [--arch <arch>]
|
|
[--target dev|airborne|operator-workstation]
|
|
[--verify] [--help]
|
|
```
|
|
|
|
**`--verify`**: after pull, prints each image's RepoDigest + `AZAION_REVISION` env var (per the OCI labels mandated by AZ-204).
|
|
|
|
### `start-services.sh`
|
|
|
|
`docker compose up -d --remove-orphans`. Polls `docker compose ps --format` until every service that declares a HEALTHCHECK reports `healthy` (default budget: 120 s). On success, emits a structured `AZAION_UPDATE_EVENT` line via journald (`logger -t gps-denied-onboard -p user.notice`).
|
|
|
|
**Usage**:
|
|
|
|
```
|
|
scripts/start-services.sh [--target dev|airborne|operator-workstation]
|
|
[--compose-file <path>]
|
|
[--wait-secs N] [--force] [--help]
|
|
```
|
|
|
|
**Refuses to start the airborne stack when `/run/azaion/in-flight` is set** (unless `--force` is passed) — this matches `deployment_procedures.md` § Deployment Strategy "ground-only safety gate".
|
|
|
|
### `stop-services.sh`
|
|
|
|
Graceful `docker compose down --remove-orphans`. The companion's stop sequence is governed by Docker's default 10 s grace period in cycle-1; cycle-2 adds `stop_grace_period: 30s` to the `companion` service block (see `deployment_procedures.md` § Graceful Shutdown — Cycle-1 status).
|
|
|
|
Before stopping, writes the current image set to `.previous-tags.env` in the repo root:
|
|
|
|
```
|
|
# Saved by scripts/stop-services.sh on 2026-05-20T05:54:00Z
|
|
# Used by deploy.sh --rollback to restore the previous image set.
|
|
# Service tag layout: PREV_<SERVICE>_IMAGE=<repo>@<sha256-digest>
|
|
PREV_COMPANION_IMAGE=gps-denied-onboard/companion@sha256:abc…
|
|
PREV_OPERATOR_ORCHESTRATOR_IMAGE=gps-denied-onboard/operator-orchestrator@sha256:def…
|
|
PREV_MOCK_SAT_IMAGE=gps-denied-onboard/mock-suite-sat-service@sha256:…
|
|
PREV_DB_IMAGE=postgres@sha256:…
|
|
```
|
|
|
|
**Refuses to stop the airborne stack when `/run/azaion/in-flight` is set** (unless `--force` is passed).
|
|
|
|
**Usage**:
|
|
|
|
```
|
|
scripts/stop-services.sh [--target dev|airborne|operator-workstation]
|
|
[--compose-file <path>] [--force] [--help]
|
|
```
|
|
|
|
### `health-check.sh`
|
|
|
|
Reads Docker HEALTHCHECK status across the stack via `docker compose ps --format '{{.Service}}\t{{.State}}\t{{.Health}}'`. No HTTP endpoints (NFT-SEC-05 — the companion has no inbound listener).
|
|
|
|
**Usage**:
|
|
|
|
```
|
|
scripts/health-check.sh [--target dev|airborne|operator-workstation]
|
|
[--compose-file <path>] [--help]
|
|
```
|
|
|
|
**Exit codes**:
|
|
- `0` — all services healthy (or running with no declared HEALTHCHECK, which is the case for services that intentionally have none, e.g. `mock-sat` in test profiles where the HEALTHCHECK is declared elsewhere).
|
|
- `1` — at least one service is `running` but `unhealthy`.
|
|
- `2` — at least one service is not `running` (exited, dead, or never started).
|
|
|
|
## Common Properties
|
|
|
|
All five scripts:
|
|
|
|
- `#!/usr/bin/env bash` + `set -euo pipefail`.
|
|
- Support `--help` / `-h` (heredoc-based usage block — robust to source-line reordering).
|
|
- Source `.env` from the project root if present (`set -a` / `set +a` around the source so the variables are exported into the script's environment + subprocesses).
|
|
- Support **remote execution** via `DEPLOY_HOST=<ssh-alias>` env var. When set, every docker command is run via `ssh ${DEPLOY_HOST}`. The pre-flight SSH check uses `-o BatchMode=yes -o ConnectTimeout=5` (same pattern as `scripts/run-tests-jetson.sh`).
|
|
- Are **idempotent** for the running-stack case: `start-services.sh` is safe to re-run on an already-healthy stack; `stop-services.sh` is safe to re-run on an already-stopped stack; `pull-images.sh` is safe to re-run (docker will report "Image is up to date").
|
|
- Exit codes are stable per script (documented in each script's `--help` and at the top of this document).
|
|
|
|
## Local Smoke Test (Tier-1 dev)
|
|
|
|
After authoring, the operator can smoke-test the full chain on a Tier-1 workstation:
|
|
|
|
```bash
|
|
# Reset
|
|
docker compose -f docker-compose.yml down -v
|
|
|
|
# Manual pipeline (does what deploy.sh does, step by step)
|
|
scripts/pull-images.sh --target dev --branch dev --arch arm # optional in dev; the dev compose builds locally
|
|
scripts/start-services.sh --target dev --wait-secs 180 # gives 3 min for pip / cmake on first build
|
|
scripts/health-check.sh --target dev # exit 0 when companion + operator-orchestrator + db + mock-sat are healthy
|
|
scripts/stop-services.sh --target dev # writes .previous-tags.env
|
|
```
|
|
|
|
A `docker compose ps` between each step verifies the expected service state. Cycle-2 will add an automated smoke test under `tests/e2e/scripts/` that runs this sequence on a CI-clean host.
|
|
|
|
## Self-verification
|
|
|
|
- [x] All five scripts created under `scripts/` and marked executable (`chmod +x`).
|
|
- [x] Scripts source `.env` from the project root (when present); `REGISTRY_HOST` / `REGISTRY_USER` / `REGISTRY_TOKEN` consumed by `pull-images.sh`.
|
|
- [x] `deploy.sh` orchestrates pull → flight-state-check → stop → start → health; `--rollback` restores `.previous-tags.env`.
|
|
- [x] `pull-images.sh` handles `docker login` via `--password-stdin` and tags images per `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` (suite contract).
|
|
- [x] `start-services.sh` brings up `docker compose up -d` and waits for HEALTHCHECK; emits `AZAION_UPDATE_EVENT` via `logger` on systemd hosts.
|
|
- [x] `stop-services.sh` writes `.previous-tags.env` then runs `docker compose down --remove-orphans`; honours the `/run/azaion/in-flight` gate.
|
|
- [x] `health-check.sh` reads HEALTHCHECK status via `docker compose ps` (no HTTP endpoint — NFT-SEC-05).
|
|
- [x] Rollback supported via `deploy.sh --rollback`.
|
|
- [x] Remote deployment via SSH supported through `DEPLOY_HOST` (same pattern as `scripts/run-tests-jetson.sh`).
|
|
- [x] `.previous-tags.env` added to `.gitignore` (rollback bookmark; not a committed artefact).
|
|
- [x] All scripts use heredoc-based `--help` (robust to source-line shifts) and `set -euo pipefail`.
|
|
- [x] `bash -n` syntax-checks pass on all five scripts.
|
|
|
|
## Cycle-2 Polish (logged, not implemented in cycle-1)
|
|
|
|
1. **`stop_grace_period: 30s`** on the `companion` service in `docker-compose.yml` + the parent-suite Jetson compose, once the Step 2 BLOCKING gate "TensorRT INT8 cache durability under Docker" measures the actual drain budget on Tier-2 hardware (`deployment_procedures.md` § Graceful Shutdown — Cycle-1 status).
|
|
2. **`docker-compose.operator.yml`** — operator-only compose that excludes the `companion` service so `--target=operator-workstation` doesn't pull / start the airborne binary at all.
|
|
3. **Tag-rotation helper** — `scripts/promote-tag.sh <sha> <branch>` that retags the registry-side `${REGISTRY_HOST}/azaion/<service>:<branch>-arm` for production rollouts. Cycle-1 keeps this operator-manual.
|
|
4. **`scripts/post-flight-pull.sh`** — pulls FDR segments from the airborne Jetson to the operator workstation and runs `python3 -m gps_denied_onboard.post_flight.summarise` (per `observability.md` § Flight Analytics).
|
|
5. **CI-clean smoke test** — `tests/e2e/scripts/test_deploy_pipeline.sh` exercising `pull → start → health → stop → rollback` against a clean Docker host (gated by `RUN_DEPLOY_E2E=1`).
|
|
6. **Watchtower post-update hook on the operator workstation** — cycle-2 may add a Watchtower instance on the operator workstation that polls the suite registry and applies updates automatically. Cycle-1 leaves the operator workstation on the `scripts/deploy.sh` operator-driven path.
|