mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 13:41:14 +00:00
bf13549b32
ci/woodpecker/push/02-build-push Pipeline failed
- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments. - Updated `.gitignore` to include a new deploy rollback bookmark. - Revised `_docs/_autodev_state.md` to reflect the current task status and steps. - Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements. - Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin. - Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths. This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
133 lines
19 KiB
Markdown
133 lines
19 KiB
Markdown
# GPS-Denied Onboard — Environment Strategy
|
|
|
|
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 4. Builds on
|
|
> Step 1 (`reports/deploy_status_report.md`), Step 2 (`containerization.md`),
|
|
> and Step 3 (`ci_cd_pipeline.md`). The deploy skill's standard
|
|
> Dev/Staging/Production template is adapted here for a Jetson-airborne
|
|
> system: production has two distinct targets (airborne Jetson + operator
|
|
> workstation), and "staging" maps to a lab Jetson HITL rig rather than a
|
|
> classical cloud pre-prod environment.
|
|
|
|
## Environments
|
|
|
|
| Environment | Purpose | Infrastructure | Data Source |
|
|
|-------------|---------|----------------|-------------|
|
|
| **Development** | Local developer workflow on a Tier-1 workstation (Linux/macOS-Colima). Runs the full Tier-1 stack (`companion-tier1` + `operator-orchestrator` + `mock-suite-sat-service` + `db`) for unit + integration + Tier-1 e2e (Reality Gate replay). | Docker Compose (`docker-compose.yml`, `docker-compose.test.yml`); named volumes (`db-data`, `fdr-data`, `tile-data`); bind-mount `tests/fixtures:/fixtures:ro`. Optional dev Postgres on host. | Seed data via Docker init scripts; **mocked `satellite-provider`** via `mock-suite-sat-service`; **dev MAVLink signing key** from `tests/fixtures/mavlink_signing/dev_key` (with `BUILD_DEV_STATIC_KEY=ON` on dev containers only); **Derkachi replay clip + tlog** committed under `_docs/00_problem/input_data/`. |
|
|
| **Staging** | Lab / research Jetson HITL rig — same Jetson Orin Nano Super hardware as airborne, but on the bench: SITL or recorded tlog as the FC source, recorded video as the camera source, no live flight. Used for pre-flight validation, NFT-PERF-* Tier-2 runs (when AZ-592 / AZ-593 land), and IT-12 comparative study. | Tier-2 hardware (Jetson Orin Nano Super) running JetPack 6.2 host OS + Docker via `runtime: nvidia`; image pulled from suite registry (`${REGISTRY_HOST}/azaion/gps-denied-onboard:dev-arm` per cycle-1 tag-suffix, eventually `:stage-arm`); compose file `docker-compose.test.jetson.yml` for HITL e2e; Postgres 16 native on host. | Recorded Derkachi clip + SITL tlog (deterministic); test calibration JSON (`adti26.json`); **dev signing key** (per-flight rotation disabled — staging FC is SITL, not signed). Mirrors Production volume mount layout (`/var/lib/gps-denied/{fdr,tiles}`, `/data/models`) so calibration-cache + INT8-engine artefacts are interchangeable between bench and field. |
|
|
| **Production** | Two distinct deploy targets, both anonymized-data-free (real flight data flows through them): (a) **airborne Jetson Orin Nano Super** carried on the aircraft, running the `companion-jetson` image under the parent-suite Watchtower flow per `containerization.md` ADR-005 amendment; (b) **operator workstation** running `operator-orchestrator` for pre-flight tile provisioning + post-landing upload via `FlightsApiClient` / `TileUploader`. | (a) Airborne: parent-suite `_infra/deploy/jetson/docker-compose.yml`, `runtime: nvidia`, Watchtower polling `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm`, host-mounted volumes for FDR (≥ 64 GB) + tile cache (≥ 10 GB) + model cache; native Postgres 16 on the Jetson NVM. (b) Operator workstation: `docker compose up` with `gps-denied-onboard/operator-orchestrator:main` or installed via `pull-images.sh` → `start-services.sh`; native Postgres 16 on the workstation. | Real flight data — live FC (ArduPilot Plane signed MAVLink 2.0, or iNav MSP2 unsigned), live nav camera (ADTi 20MP), live `satellite-provider` REST + on-disk tiles. **Per-flight ephemeral MAVLink + onboard signing keys** generated at takeoff load, rotated per flight, logged to FDR. Operator workstation reads `satellite-provider` API token from OS keyring; never written to any image. |
|
|
|
|
### Tier ↔ Environment Mapping
|
|
|
|
| Environment | Tier-1 image(s) used | Tier-2 image(s) used | Notes |
|
|
|-------------|----------------------|------------------------|-------|
|
|
| Development | `companion-tier1`, `operator-orchestrator`, `mock-suite-sat-service` | — | All four services via `docker-compose.yml`. |
|
|
| Staging (lab Jetson) | — | `companion-jetson` (when cycle-2 ships), or `companion-tier1` in Tier-1-on-Jetson interim | Tier-2 Jetson HITL pulls the arm64 image; `docker-compose.test.jetson.yml` orchestrates. |
|
|
| Production — airborne | — | `companion-jetson` (cycle-2) | Watchtower-managed; cycle-1 ships only the planning + Tier-1 images per `ci_cd_pipeline.md` Registry Layout. |
|
|
| Production — operator workstation | `operator-orchestrator` | — | Cycle-1 already builds + pushes `${REGISTRY_HOST}/azaion/gps-denied-onboard-operator-orchestrator:<branch>-arm`. |
|
|
|
|
## Environment Variables
|
|
|
|
### Required Variables (companion + operator-orchestrator)
|
|
|
|
> Source of truth: `.env.example` at repo root (extended in Step 1). The
|
|
> table below references that file; do NOT re-declare variable names here.
|
|
|
|
| Variable | Purpose | Dev Default (Tier-1 Docker) | Staging Source (lab Jetson) | Production Source |
|
|
|----------|---------|------------------------------|------------------------------|--------------------|
|
|
| `GPS_DENIED_FC_PROFILE` | FC adapter selection | `ardupilot_plane` | Per-rig fixed (matches the SITL profile in use) | Per-flight config from operator; written into the per-flight bundle on the operator workstation |
|
|
| `GPS_DENIED_TIER` | Runtime tier gate | `1` | `2` | `2` (baked into the Jetson image manifest) |
|
|
| `DB_URL` | Postgres connection | `postgresql://gps_denied:dev@db:5432/gps_denied` (dev Docker creds) | Lab Postgres init script — per-host random password | Per-host native Postgres init with random password; written to `/etc/gps-denied/.pgpass` (root:gps-denied, 0640) and exported by the systemd / Docker run hook |
|
|
| `SATELLITE_PROVIDER_URL` | Pre-flight tile download | `http://mock-sat:5100` | Lab `satellite-provider` (LAN-resolved); blank on airborne | Operator workstation env / VPN-resolved hostname; **empty on airborne** (defence-in-depth NFT-SEC-05 — in-flight egress lockdown) |
|
|
| `CAMERA_CALIBRATION_PATH` | Camera calibration JSON | `/fixtures/calibration/adti26.json` | `/etc/gps-denied/calibration/adti26.json` (operator copies the test fixture for HITL) | `/etc/gps-denied/calibration/adti20.json` (operator-acquired per D-PROJ-1) |
|
|
| `LOG_LEVEL` | Log verbosity | `DEBUG` | `INFO` | `INFO` |
|
|
| `LOG_SINK` | Log destination | `console` | `journald` (lab) | `fdr` on airborne; `journald` on operator workstation |
|
|
| `MAVLINK_SIGNING_KEY` | Per-flight signing key | `tests/fixtures/mavlink_signing/dev_key` (with `BUILD_DEV_STATIC_KEY=ON`) | `tests/fixtures/mavlink_signing/dev_key` (lab SITL, signing disabled or static-dev) | **Per-flight ephemeral key**, generated at takeoff load, rotated per flight, logged to FDR. Never committed; never written to the image. |
|
|
| `INFERENCE_BACKEND` | C7 backend selection | `pytorch_fp16` | `tensorrt` (Tier-2 hardware) | `tensorrt` |
|
|
| `FDR_PATH` | C13 ring writer | `/var/lib/gps-denied/fdr` (named volume `fdr-data`) | Host-mounted `/var/lib/gps-denied/fdr` on the lab Jetson | Host-mounted `/var/lib/gps-denied/fdr` on the airborne Jetson NVM partition (≥ 64 GB) |
|
|
| `TILE_CACHE_PATH` | C6 tile filesystem store | `/var/lib/gps-denied/tiles` (named volume `tile-data`) | Host-mounted `/var/lib/gps-denied/tiles` on the lab Jetson | Host-mounted `/var/lib/gps-denied/tiles` on the airborne Jetson NVM (≥ 10 GB) |
|
|
|
|
Optional / build-time strategy gating flags (`BUILD_VINS_MONO`, `BUILD_SALAD`, `BUILD_C11_TILE_MANAGER`, `BUILD_VIDEO_FILE_FRAME_SOURCE`, `BUILD_TLOG_REPLAY_ADAPTER`, `BUILD_REPLAY_SINK_JSONL`, `BUILD_DEV_STATIC_KEY`, `BUILD_STATE_ESKF`) are documented in `.env.example` and in `deploy_status_report.md` → "Required Environment Variables". Operative defaults per ADR-002 + ADR-004 + ADR-011:
|
|
|
|
- Airborne / operator-orchestrator binaries: `BUILD_C11_TILE_MANAGER=OFF` on airborne (ADR-004 process-level isolation — CI SBOM-diff + runtime self-check + NFT-SEC-02 egress test enforce); `BUILD_C11_TILE_MANAGER=ON` on operator-orchestrator only.
|
|
- Replay-mode strategy flags: `ON` on airborne + research; explicitly set in `docker-compose.test*.yml` for CI.
|
|
- `BUILD_DEV_STATIC_KEY`: **MUST stay OFF on production images.** Dev / CI containers only.
|
|
|
|
### `.env.example`
|
|
|
|
Source of truth lives at the repo root (`.env.example`), version-controlled. It contains placeholder values for all required variables plus comments for build-time gating flags. Operators copy it to `.env` (git-ignored) and fill in values per environment. Tier-2 production deploys do **not** use `.env` at all — environment variables are stamped into the systemd / Docker run hook by `start-services.sh` (Step 7) from `/etc/gps-denied/env.d/` files owned `root:gps-denied 0640`.
|
|
|
|
### Variable Validation (fail-fast at startup)
|
|
|
|
All services validate required environment variables at startup and exit non-zero with a clear error message if any are missing. Implementation lives in each component's config module:
|
|
|
|
| Component | Config module | Variables validated |
|
|
|-----------|---------------|---------------------|
|
|
| Composition root | `src/gps_denied_onboard/runtime_root/__main__.py` | `GPS_DENIED_TIER`, `GPS_DENIED_FC_PROFILE`, `LOG_LEVEL`, `LOG_SINK` |
|
|
| C6 (tile cache) | `src/gps_denied_onboard/components/c6_tile_cache/config.py` | `DB_URL`, `TILE_CACHE_PATH` |
|
|
| C7 (inference) | `src/gps_denied_onboard/components/c7_inference/config.py` | `INFERENCE_BACKEND` (must be one of `tensorrt`, `pytorch_fp16`, `onnx_trt_ep`); `INFERENCE_BACKEND=tensorrt` requires the model cache volume mount |
|
|
| C8 (FC adapter) | `src/gps_denied_onboard/components/c8_fc_adapter/config.py` | `MAVLINK_SIGNING_KEY` (when `GPS_DENIED_FC_PROFILE=ardupilot_plane`) |
|
|
| C10 (provisioning) | `src/gps_denied_onboard/components/c10_provisioning/config.py` | `SATELLITE_PROVIDER_URL` (operator-orchestrator only; **must be empty on airborne**); `CAMERA_CALIBRATION_PATH` |
|
|
| C13 (FDR) | `src/gps_denied_onboard/components/c13_fdr/config.py` | `FDR_PATH` (must be writable, ≥ 64 GB free on production) |
|
|
|
|
Health check (`python3 -m gps_denied_onboard.healthcheck`, declared in each Dockerfile) re-runs the same validation set after startup so a Docker `HEALTHY` transition is conditioned on configuration validity, not just process liveness.
|
|
|
|
## Secrets Management
|
|
|
|
| Environment | Method | Tool / Location | Rotation |
|
|
|-------------|--------|-----------------|----------|
|
|
| Development | `.env` file (git-ignored) + `tests/fixtures/mavlink_signing/dev_key` (allow-listed in `.gitignore`) | dotenv loaded by Docker Compose; fixture key read directly by tests with `BUILD_DEV_STATIC_KEY=ON` | None — dev fixture is static. |
|
|
| Staging (lab Jetson) | `.env` file (git-ignored) on the Jetson host + same dev fixture signing key (lab SITL is not a signing-attack target) | `/etc/gps-denied/env.d/*.env` on the Jetson, `root:gps-denied 0640` | None — lab fixture is static. |
|
|
| Production — airborne | **Per-flight ephemeral MAVLink + onboard signing key generated at takeoff load, rotated per flight, logged to FDR.** The Postgres password is generated per-host at JetPack provisioning and stored in `/etc/gps-denied/.pgpass` (`root:gps-denied 0640`). The airborne image has **no inbound listeners** (NFT-SEC-05 in-flight egress lockdown) so no API secrets live on it. | Onboard secret generation: `KeySource` Protocol implemented in `src/gps_denied_onboard/components/c8_fc_adapter/key_source.py` (per-flight rotation). Postgres password: provisioning script on the Jetson host writes once at first boot. | **Per-flight rotation** for MAVLink + onboard signing keys (Principle #7). Postgres password rotated on operator-issued re-provisioning only. |
|
|
| Production — operator workstation | Operator's local credential store / OS keyring for the `satellite-provider` API token + per-flight onboard signing key staging. Suite Woodpecker global secrets (`registry_host`, `registry_user`, `registry_token`) for image pulls — already provisioned per `../_infra/ci/install-woodpecker.sh`; this submodule consumes them via `from_secret:` references in `.woodpecker/02-build-push.yml`. | macOS Keychain / GNOME-Keyring / Windows Credential Manager via a thin wrapper invoked by `start-services.sh`; Woodpecker global secrets injected as env vars at pipeline runtime. | `satellite-provider` API token: rotated by the suite operator (out-of-band); per-flight onboard signing keys rotated per flight (above). Registry token: rotated by suite operator on schedule. |
|
|
| CI | Suite-provisioned Woodpecker global secrets (`registry_host`, `registry_user`, `registry_token`) | Consumed by `.woodpecker/02-build-push.yml` via `from_secret:` references — never committed | Rotated by suite operator (out-of-band, ≤ 90 days target per suite policy). |
|
|
|
|
**Rotation policy (companion-side, normative)**:
|
|
|
|
- **Per-flight** (MAVLink 2.0 signing key + onboard signing key): mandatory; new keypair generated at takeoff load by `KeySource`, rotated even if the previous flight ended normally. Logged to FDR for chain-of-custody.
|
|
- **Per-host** (Postgres password on Jetson + operator workstation): rotated on operator-issued re-provisioning; no scheduled rotation.
|
|
- **Per-operator-credential** (`satellite-provider` API token, registry token): owned and rotated by the suite operator out-of-band; this submodule consumes whatever is provisioned.
|
|
|
|
**No external cloud secret manager** (AWS Secrets Manager / Azure Key Vault / HashiCorp Vault) is used. The combination of (a) per-flight ephemeral signing keys generated on-device, (b) no inbound network listeners on the airborne image, (c) per-host Postgres password with no shared state across hosts, and (d) suite-managed Woodpecker secrets for CI is sufficient for the operational risk model and matches `deploy_status_report.md` → "Secret manager — Per-flight ephemeral, no external manager".
|
|
|
|
**Never commit**: real MAVLink signing keys (the dev fixture `tests/fixtures/mavlink_signing/dev_key` is the allow-listed exception); real Postgres credentials (the committed `DB_URL` in `.env.example` uses the local Docker `dev` password placeholder); `satellite-provider` API tokens; `.env` files (`.gitignore` line 64 confirms).
|
|
|
|
## Database Management
|
|
|
|
| Environment | Type | Migrations | Data |
|
|
|-------------|------|-----------|------|
|
|
| Development | Docker Postgres 16 (`db` service in `docker-compose.yml`), named volume `db-data` | Applied on container start by C6 bootstrap (idempotent `CREATE TABLE IF NOT EXISTS` for tile + descriptor index) | Seed data via the C6 bootstrap on first run; `docker compose down -v` drops the volume cleanly for `docker compose up --build` |
|
|
| Staging (lab Jetson) | Native Postgres 16 on JetPack 6.2 host, sized ≤ 10 GB on a dedicated NVM partition | Applied via the same C6 bootstrap on first run; subsequent migrations applied via CI/CD lane (when cycle-2 lands an explicit migration runner) | Recorded Derkachi clip tile-set + descriptor index pre-loaded by `e2e/fixtures/tile-cache-builder/` |
|
|
| Production — airborne | Native Postgres 16 on the Jetson Orin Nano Super NVM partition (≥ 10 GB tile cache budget + descriptor index) | Applied via the C6 bootstrap at first systemd unit start; cycle-1 schema is bootstrap-only with no breaking migrations. Future migrations (cycle-2+): reversible, backward-compatible, applied by a dedicated migration job that is **gated by the flight-state flag** (`/run/azaion/in-flight` — no DB writes during flight) | Real flight data: pre-flight tile + descriptor index seeded by `TileDownloader` on the operator workstation, packaged by C10, and copied to the Jetson NVM at provisioning |
|
|
| Production — operator workstation | Native Postgres 16 on the operator workstation | Applied via the same C6 bootstrap; future migrations applied via CI/CD with operator approval | Operator-managed: tile downloads via `satellite-provider`, post-landing uploads via `TileUploader` |
|
|
|
|
### Migration Rules (cycle-2+ — not yet exercised)
|
|
|
|
- **Reversible**: every migration ships with an explicit DOWN / rollback script.
|
|
- **Backward-compatible**: a new schema version must continue to work with the previous binary's read path until the next release rotates both. Sequence: deploy migration → wait one release cycle → remove old code path.
|
|
- **Production gate**: production migrations require operator approval recorded in the Woodpecker UI before apply.
|
|
- **Flight-state gate**: migration jobs on the airborne Jetson refuse to run when `/run/azaion/in-flight` is set. The post-landing operator-issued reconcile path is the only window for schema changes on the airborne side.
|
|
|
|
### Cycle-1 Migration Status
|
|
|
|
Cycle-1 ships **without a migration runner**. The C6 bootstrap path uses idempotent `CREATE TABLE IF NOT EXISTS` for the tile + descriptor index schema, which is enough for cycle-1 because no schema change has happened since the initial bootstrap. Adding a dedicated migration tool (Alembic / similar) is logged as a cycle-2 follow-up — recorded here so it is not lost.
|
|
|
|
## Self-verification
|
|
|
|
- [x] All three environments (Development / Staging / Production) defined with clear purpose
|
|
- [x] Tier-1 ↔ Tier-2 mapping explicit (which image runs where)
|
|
- [x] Operator workstation called out as a distinct production target alongside airborne Jetson
|
|
- [x] Environment variable documentation references `.env.example` (source of truth) without re-declaring names
|
|
- [x] Per-variable Dev / Staging / Production sources tabulated
|
|
- [x] No secrets in this document (only placeholders + locations)
|
|
- [x] Secret manager strategy specified — per-flight ephemeral generation, no external cloud manager, suite-managed Woodpecker secrets for CI; rotation policy normative for per-flight rotation
|
|
- [x] Database strategy per environment (Docker Postgres → native Postgres on Jetson + operator workstation); cycle-1 bootstrap-only migration stance recorded; cycle-2 migration rules drafted
|
|
- [x] Flight-state gate (`/run/azaion/in-flight`) honoured in production-side migration rules
|
|
- [x] Variable validation strategy (fail-fast + healthcheck re-run) mapped to per-component config modules
|
|
|
|
## Next Steps
|
|
|
|
1. **Proceed to Step 5 (Observability)** — define structured logging (`LOG_SINK`), metrics (per-component counters, Prometheus-compatible exposition if cycle-2 adds it), tracing (out-of-scope for cycle-1; FDR records serve as the airborne audit trail), and the `AZAION_UPDATE_EVENT` journald audit chain.
|
|
2. **Step 6 (Deployment Procedures)** must reference this environment matrix when documenting per-environment deploy procedures (Tier-1 dev `docker compose up`, lab Jetson HITL `docker-compose.test.jetson.yml`, airborne Watchtower-driven update, operator workstation `docker compose up` with image pull).
|
|
3. **Step 7 (Deployment Scripts)** must implement the env-loader hook (`start-services.sh` reading `/etc/gps-denied/env.d/*.env` per-host on production targets), the per-host Postgres password generation hook, and the `KeySource` per-flight ephemeral key invocation contract.
|
|
4. **Cycle-2 follow-up**: introduce a dedicated migration runner (Alembic or equivalent) with the flight-state-gated apply path and operator-approval gate.
|