mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 22:51:14 +00:00
bf13549b32
ci/woodpecker/push/02-build-push Pipeline failed
- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments. - Updated `.gitignore` to include a new deploy rollback bookmark. - Revised `_docs/_autodev_state.md` to reflect the current task status and steps. - Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements. - Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin. - Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths. This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
232 lines
20 KiB
Markdown
232 lines
20 KiB
Markdown
# GPS-Denied Onboard — Containerization
|
||
|
||
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 2.
|
||
> Builds on Step 1 output (`reports/deploy_status_report.md`) and the
|
||
> parent-suite CI/CD reality at `../_infra/ci/README.md`. Tier-2 delivery
|
||
> shape: **Option B (Docker on Jetson via Watchtower) — autodev-resolved
|
||
> 2026-05-19; reversible per Step 1 report**.
|
||
|
||
## Containerization Stance
|
||
|
||
| Tier | Production runtime | Image source |
|
||
|------|--------------------|--------------|
|
||
| Tier-1 (workstation dev + CI + replay) | Docker via `docker-compose.yml` / `docker-compose.test.yml` | This submodule (`docker/companion-tier1.Dockerfile`, `docker/operator-orchestrator.Dockerfile`, `docker/mock-suite-sat-service.Dockerfile`) |
|
||
| Tier-2 (Jetson Orin Nano Super production) | Docker via parent-suite `_infra/deploy/jetson/docker-compose.yml` + Watchtower auto-update | This submodule's new `docker/companion-jetson.Dockerfile` (NEW under Option B) pushed to `${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm` |
|
||
| Tier-2 (lab/research IT-12 binary) | Docker (same `companion-jetson.Dockerfile` with research strategy flags ON) or bare JetPack install via tarball | Optional separate image tag `:research-arm`; cycle-1 ships only the deployment binary path |
|
||
|
||
Three architectural binary tracks (per ADR-002 + ADR-011) collapse onto
|
||
**two production Docker images** in this plan:
|
||
|
||
1. **`gps-denied-onboard` (airborne)** — `docker/companion-jetson.Dockerfile` for Tier-2 production + `docker/companion-tier1.Dockerfile` for Tier-1. Same Python module entrypoint (`python3 -m gps_denied_onboard.runtime_root`); runs both **live mode** and **replay mode** from a single image per ADR-011 — config (`config.mode = live | replay`) selects strategies at startup.
|
||
2. **`gps-denied-operator-orchestrator`** — `docker/operator-orchestrator.Dockerfile` for the operator workstation (C10 + C11 + C12).
|
||
|
||
Test fixtures (`mock-suite-sat-service`, `e2e-runner`) and test infrastructure (Tier-1 + Tier-2 runners) ship as separate non-deployable images. The research binary is a build-flag variant of the airborne image, not a separate Dockerfile.
|
||
|
||
## ADR-005 Amendment (DRAFT — pending Step 12 / Update Docs sync)
|
||
|
||
> Draft language for the architecture follow-up flagged in Step 1's
|
||
> Cross-Cutting Decision. Lands in `architecture.md` ADR-005 (amendment)
|
||
> or a new ADR-012 when Step 12 (Test-Spec Sync) / autodev's existing-code
|
||
> Step 13 (Update Docs) picks this up. The current `architecture.md`
|
||
> ADR-005 paragraph "Tier-2 (Jetson) does NOT use Docker" becomes
|
||
> inconsistent with this plan and must be reconciled.
|
||
|
||
> **Container scope (amended)**: Tier-1 uses Docker (`docker compose` for
|
||
> the developer setup). **Tier-2 (Jetson production) ALSO uses Docker**,
|
||
> via the parent-suite `_infra/deploy/jetson/docker-compose.yml` +
|
||
> Watchtower flow, with `runtime: nvidia` for GPU access and explicit
|
||
> volume mounts for the TensorRT INT8 calibration cache
|
||
> (`model-cache:/data/models`) and the C13 FDR ring
|
||
> (`fdr-data:/var/lib/gps-denied/fdr`). The two technical concerns the
|
||
> original ADR-005 cited — INT8 calibration cache stability and
|
||
> `jetson-stats` thermal telemetry access — are addressed by (a) the
|
||
> calibration cache living in a host-mounted volume that survives
|
||
> container restarts and (b) `jetson-stats` accessed via the
|
||
> nvidia-container-runtime's standard device passthrough (same pattern
|
||
> the parent-suite `detections` service already uses successfully on the
|
||
> same hardware). The deployment binary is the Docker image; the JetPack
|
||
> 6.2 system image is the **host** OS, not the runtime layer.
|
||
|
||
### Step 2 Validation Gates (BLOCKING — must pass before Step 3)
|
||
|
||
If either of these gates fails, **fall back to Option A** (bare-JetPack
|
||
systemd unit) and re-write this containerization plan:
|
||
|
||
| Gate | What it validates | Pass criteria | Owner |
|
||
|------|-------------------|---------------|-------|
|
||
| **TensorRT INT8 cache durability under Docker** | Build a calibration cache inside the running container; restart the container; verify the cache is reused and inference output is byte-equivalent | SHA-256 of the calibration cache file before and after restart matches; first-frame inference timing post-restart is within 5% of pre-restart timing (cache hit) | C7 owner; runs against the `companion-jetson` image on the actual Tier-2 Jetson |
|
||
| **`jetson-stats` thermal telemetry under Docker** | Run `jtop` (jetson-stats CLI) inside the container with `runtime: nvidia`; verify thermal + power + GPU clock readings match `sudo jtop` on the host within 1% | All thermal zones reported; CPU/GPU clock readings present; D-CROSS-LATENCY-1 hybrid trigger threshold readable | C7 / C5 owners; runs against the `companion-jetson` image |
|
||
|
||
Both gates land as task tickets when Step 16 chains into the next-cycle
|
||
existing-code flow (autodev resumes at existing-code Step 9 New Task per
|
||
the Done state). They are **deferred to next cycle** and recorded here so
|
||
they are not lost; the cycle-1 deploy plan ships Option B with the
|
||
validation marked as "validation pending" in `deploy_status_report.md`.
|
||
|
||
## Component-to-Image Mapping
|
||
|
||
Per ADR-009, components are folders under `src/gps_denied_onboard/components/`. They are not separate processes / containers in this monolithic Python-with-C++-extensions architecture. The mapping below shows which component code paths each image links.
|
||
|
||
| Image | Components linked | BUILD_* flags (defaults) |
|
||
|-------|-------------------|---------------------------|
|
||
| `companion-jetson` (Tier-2 prod) + `companion-tier1` (Tier-1 dev) | C1 (`KltRansac` default), C2 (`UltraVPR` default), C2.5, C3 (`DISK+LightGlue`), C3.5, C4, C5 (`GtsamIsam2`), C6, C7 (`tensorrt` on Tier-2, `pytorch_fp16` on Tier-1), C8 (per `GPS_DENIED_FC_PROFILE`), C13 + replay strategies (`BUILD_VIDEO_FILE_FRAME_SOURCE=ON`, `BUILD_TLOG_REPLAY_ADAPTER=ON`, `BUILD_REPLAY_SINK_JSONL=ON`) | `BUILD_VINS_MONO=OFF`, `BUILD_SALAD=OFF`, `BUILD_C11_TILE_MANAGER=OFF` (ADR-004 enforcement), `BUILD_DEV_STATIC_KEY=OFF`, `BUILD_STATE_ESKF=OFF` |
|
||
| `operator-orchestrator` (operator workstation) | C10, C11 (`TileDownloader` + `TileUploader`), C12 | `BUILD_C11_TILE_MANAGER=ON` |
|
||
| `mock-suite-sat-service` (test fixture) | NONE (FastAPI stub of the parent-suite `satellite-provider` D-PROJ-2 contract) | — |
|
||
| `e2e-runner` Tier-1 (`tests/e2e/Dockerfile`) | Full SUT (editable install) + pytest entrypoint | Test profile defaults |
|
||
| `e2e-runner` Tier-2 (`tests/e2e/Dockerfile.jetson`) | Full SUT (editable install) + pytest entrypoint; `dustynv/l4t-pytorch:r36.4.0` base | Test profile defaults |
|
||
|
||
## Per-Image Dockerfile Specifications
|
||
|
||
### `companion-jetson` — **NEW under Option B**
|
||
|
||
| Property | Value |
|
||
|----------|-------|
|
||
| File | `docker/companion-jetson.Dockerfile` (new in next cycle's Step 7 — Implementation; this plan specifies the contents) |
|
||
| Base image | `dustynv/l4t-pytorch:r36.4.0` (digest-pinned per suite follow-up #1) — same base proven by `tests/e2e/Dockerfile.jetson` |
|
||
| Stages | (1) system-deps (apt: `build-essential`, `cmake`, `libpq-dev`, `libspatialindex-dev`, `libgl1`, `libglib2.0-0`) → (2) python-deps (`pip install -e ".[inference]"` with the Tegra-tuned torch preserved per the existing Tier-2 e2e Dockerfile rationale) → (3) cpp-build (CMake build of the native VIO / matcher extensions with `BUILD_VINS_MONO=OFF`, `BUILD_C11_TILE_MANAGER=OFF`) → (4) runtime (slim image carrying the venv + native libs + SUT source) |
|
||
| User | `gps-denied` non-root uid 10001 (companion does not need root inside the container; volume mounts owned by the same uid on the host) |
|
||
| Build args | `CI_COMMIT_SHA` (suite-mandated; stamped as OCI labels + `ENV AZAION_REVISION`); `BRANCH` (carried into image labels) |
|
||
| OCI labels | `org.opencontainers.image.revision=$CI_COMMIT_SHA`, `org.opencontainers.image.created=<UTC RFC 3339>`, `org.opencontainers.image.source=$CI_REPO_URL` (suite-mandated per `../_infra/ci/README.md` → "OCI image labels and commit provenance (AZ-204)") |
|
||
| ENV | `AZAION_SERVICE=gps-denied-onboard`, `AZAION_REVISION=$CI_COMMIT_SHA`, `PYTHONPATH=/opt/gps-denied/src`, `PATH=/opt/venv/bin:$PATH` |
|
||
| Health check | `python3 -m gps_denied_onboard.healthcheck` — `--interval=10s --timeout=3s --start-period=30s --retries=3` (longer start-period than Tier-1 because TensorRT engine deserialize takes seconds on Jetson) |
|
||
| Exposed ports | `8080` (HTTP healthz + future replay-mode JSONL stream socket; mapped to host `5040:8080` per parent-suite compose). MAVLink + camera I/O is **not** TCP — it is host-bound (`/dev/ttyUSB*`, `/dev/video*`) via device passthrough. |
|
||
| Volume mounts (declared in parent-suite compose) | `model-cache:/data/models` (TensorRT engines + calibration cache + descriptor index); `fdr-data:/var/lib/gps-denied/fdr` (C13 ring, ≥ 64 GB); `tile-data:/var/lib/gps-denied/tiles` (C6 filesystem store, ≥ 10 GB); `/run/azaion:/run/azaion` (flight-state flag, read-only); device passthrough for `/dev/ttyUSB*` (FC UART) + `/dev/video*` (nav camera) |
|
||
| Watchtower labels | `com.centurylinklabs.watchtower.enable=true` + post-update hook emitting `AZAION_UPDATE_EVENT` per suite `x-update-logger` template |
|
||
| ENTRYPOINT | `python3 -m gps_denied_onboard.runtime_root` (same as Tier-1) |
|
||
| Flight-state gate | Honoured via `/run/azaion/in-flight` bind mount — Watchtower restart hook MUST check the flag before restarting (suite-managed; the image itself only honors the flag when transitioning between strategies at boot — there is no in-process restart logic) |
|
||
|
||
### `companion-tier1` (existing — `docker/companion-tier1.Dockerfile`)
|
||
|
||
| Property | Value |
|
||
|----------|-------|
|
||
| Base image | `ubuntu:22.04` (system-deps stage) → `ubuntu:22.04` (runtime) |
|
||
| Stages | 4 (`system-deps` → `python-deps` → `cpp-build` → `runtime`) — already documented in the file header |
|
||
| User | Currently root (acceptable for Tier-1 dev / CI containers — Tier-2 production hardens this in `companion-jetson`) |
|
||
| Health check | `python3 -m gps_denied_onboard.healthcheck` — `--interval=10s --timeout=3s --start-period=15s --retries=3` |
|
||
| Exposed ports | None (Tier-1 healthcheck is in-process; CI exposes nothing) |
|
||
| Notes | **No change required for cycle-1.** Next cycle: add `BRANCH` + `CI_COMMIT_SHA` build args + OCI labels for parity with `companion-jetson`. |
|
||
|
||
### `operator-orchestrator` (existing — `docker/operator-orchestrator.Dockerfile`)
|
||
|
||
| Property | Value |
|
||
|----------|-------|
|
||
| Base image | `python:3.10-slim` |
|
||
| Stages | 1 (`runtime`) — single-stage is acceptable here because the operator-orchestrator has no native C++ extensions and the slim base keeps it lean |
|
||
| User | Currently root — same Tier-1 caveat as `companion-tier1` |
|
||
| Health check | `python3 -m gps_denied_onboard.healthcheck` — `--interval=10s --timeout=3s --start-period=10s --retries=3` |
|
||
| Exposed ports | TBD (next cycle adds the C12 CLI's HTTP control surface for the operator UI; today the CLI runs as a one-shot invocation) |
|
||
| Notes | **No change required for cycle-1.** |
|
||
|
||
### `mock-suite-sat-service` (existing — `docker/mock-suite-sat-service.Dockerfile`)
|
||
|
||
| Property | Value |
|
||
|----------|-------|
|
||
| Base image | `python:3.10-slim` |
|
||
| User | Currently root — acceptable, this is an e2e test fixture only |
|
||
| Health check | `urllib.request.urlopen('http://127.0.0.1:5100/healthz')` — `--interval=5s --timeout=2s --retries=3` |
|
||
| Exposed ports | `5100` (HTTP) |
|
||
| Notes | **Not a production image.** Retired when parent-suite D-PROJ-2 ships the real ingest endpoint. |
|
||
|
||
### `e2e-runner` Tier-1 (existing — `tests/e2e/Dockerfile`)
|
||
|
||
Test runner for the Reality Gate on Colima / Tier-1 workstation Docker. Not a production image. ENTRYPOINT: `pytest -q /opt/tests/e2e/`. **No change for cycle-1.**
|
||
|
||
### `e2e-runner` Tier-2 (existing — `tests/e2e/Dockerfile.jetson`)
|
||
|
||
Test runner for the Reality Gate on the Jetson. `dustynv/l4t-pytorch:r36.4.0` base. The new `companion-jetson` production image inherits its base image choice and Tegra-pip rationale from this file. **No change for cycle-1.**
|
||
|
||
## Docker Compose — Local Development (existing `docker-compose.yml`)
|
||
|
||
The existing root `docker-compose.yml` already covers Tier-1 dev: `companion` + `operator-orchestrator` + `mock-sat` + `db` (Postgres 16), with healthchecks, named volumes (`db-data`, `fdr-data`, `tile-data`), and a `tests/fixtures:/fixtures:ro` bind mount for the dev calibration JSON + signing key.
|
||
|
||
**No structural change required.** Optional cycle-2 polish:
|
||
|
||
- Add a `network: gps-denied-dev` declaration (currently relies on Docker Compose's default network) so the suite-level e2e harness can join it explicitly when needed.
|
||
- Reference `${BRANCH:-main}` for image tags so the dev compose can pull from the suite registry instead of always building.
|
||
|
||
## Docker Compose — Blackbox Tests (existing)
|
||
|
||
| File | Purpose | Status |
|
||
|------|---------|--------|
|
||
| `docker-compose.test.yml` | Tier-1 e2e (Replay + Reality Gate); sets `BUILD_VIDEO_FILE_FRAME_SOURCE=ON`, `BUILD_TLOG_REPLAY_ADAPTER=ON`, `BUILD_REPLAY_SINK_JSONL=ON` | ✅ working |
|
||
| `docker-compose.test.jetson.yml` | Tier-2 e2e on Jetson; same flags ON | ✅ working |
|
||
| `e2e/docker/docker-compose.test.yml` | Suite-level e2e harness's internal compose | ✅ owned by the e2e harness |
|
||
| `e2e/docker/docker-compose.tier2-bridge.yml` | Tier-2 host-network bridge for direct hardware access | ✅ in tree |
|
||
|
||
**Run patterns** (suite-mandated per Woodpecker two-workflow contract):
|
||
|
||
```bash
|
||
# Tier-1 e2e (CI 01-test.yml):
|
||
docker compose -f docker-compose.test.yml up --build --abort-on-container-exit --exit-code-from e2e-runner
|
||
|
||
# Tier-2 e2e (manual / Tier-2 lane):
|
||
docker compose -f docker-compose.test.jetson.yml up --abort-on-container-exit --exit-code-from e2e-runner
|
||
```
|
||
|
||
The exit code of the `e2e-runner` service is the pipeline result. This contract matches the suite's `detections` e2e variant verbatim.
|
||
|
||
## Docker Compose — Tier-2 Production (parent-suite, NOT in this submodule)
|
||
|
||
This submodule does **not** ship a Tier-2 production compose file. The Tier-2 production stack is `../_infra/deploy/jetson/docker-compose.yml` (already shipping). This submodule contributes:
|
||
|
||
1. The published image at `${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm` (via `companion-jetson.Dockerfile` + the upcoming `.woodpecker/02-build-push.yml`).
|
||
2. The healthcheck endpoint (`python3 -m gps_denied_onboard.healthcheck`).
|
||
3. The flight-state gate honour (`/run/azaion/in-flight` bind mount in the suite compose — read by the image at boot).
|
||
4. The audit chain — OCI labels + `AZAION_REVISION` env + Watchtower post-update hook emitting `AZAION_UPDATE_EVENT` to journald.
|
||
|
||
**Cross-cutting suggestion logged but not actioned in cycle-1**: the parent-suite Jetson compose's `gps-denied-onboard` service block is minimal (no volume mounts beyond `model-cache`). Under Option B, it needs the additional mounts listed in the `companion-jetson` Dockerfile table above (`fdr-data`, `tile-data`, `/run/azaion`, FC + camera device passthrough). This is a **parent-suite edit** that the GPS-Denied Onboard team must coordinate with the suite operator — recorded in Next Steps below.
|
||
|
||
## Image Tagging Strategy (Suite-Mandated)
|
||
|
||
| Context | Tag Format | Example |
|
||
|---------|-----------|---------|
|
||
| Per-PR CI (test only, not pushed) | n/a | n/a |
|
||
| Per-branch CI build-push | `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` | `git.azaion.com/azaion/gps-denied-onboard:dev-arm` |
|
||
| Release | `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` (suite uses floating branch tags + Watchtower; semver is not used at suite level today) | `git.azaion.com/azaion/gps-denied-onboard:main-arm` |
|
||
| Local dev | Image name without registry prefix | `gps-denied-onboard/companion:dev` (current local compose), `gps-denied-onboard/operator-orchestrator:dev`, `gps-denied-onboard/mock-suite-sat-service:dev` |
|
||
|
||
**No `:latest` tag in CI.** Suite contract is `<branch>-<arch>` only; Watchtower polls these floating tags.
|
||
|
||
## .dockerignore (existing — audit + recommended addenda)
|
||
|
||
The current `.dockerignore` (33 lines, root) covers `.git`, `.venv`, build artefacts, `*.engine` / `*.calib` / `*.index` / `*.faiss` / `*.onnx`, large test fixtures, `_docs/`, and editor noise. **Adequate for cycle-1.** Recommended next-cycle additions (logged here, not applied):
|
||
|
||
```
|
||
# Next-cycle additions to .dockerignore (not applied in cycle-1)
|
||
.cursor/ # rules + skills do not belong in any image
|
||
_docs/ # already excluded — keep
|
||
docker-compose*.yml # don't accidentally ship dev compose into the production image
|
||
e2e/ # test harness compose + fixtures stay out of production images
|
||
tests/ # test code stays out of production images (currently NOT excluded)
|
||
*.md # README / docs — not needed at runtime
|
||
```
|
||
|
||
Note: `tests/` is currently NOT in `.dockerignore`, which is **intentional for cycle-1** — the e2e-runner images (`tests/e2e/Dockerfile`, `tests/e2e/Dockerfile.jetson`) COPY `tests/` into the image. Splitting `.dockerignore` per-image (via Docker's `dockerfile:` field on `.dockerignore` is BuildKit-only) is a next-cycle refactor.
|
||
|
||
## Health Checks — Inventory
|
||
|
||
| Image | Endpoint / Command | Cadence |
|
||
|-------|---------------------|---------|
|
||
| `companion-tier1`, `companion-jetson`, `operator-orchestrator` | `python3 -m gps_denied_onboard.healthcheck` (the module already exists per the existing Dockerfiles) | `--interval=10s --timeout=3s --start-period={15,30,10}s --retries=3` |
|
||
| `mock-suite-sat-service` | HTTP GET `/healthz` on port 5100 | `--interval=5s --timeout=2s --retries=3` |
|
||
| `db` (Postgres 16, suite-managed under Tier-2; root compose for Tier-1) | `pg_isready -U gps_denied -d gps_denied` | `--interval=5s --timeout=3s --retries=10` |
|
||
|
||
## Self-verification
|
||
|
||
- [x] Every component is mapped to its image (`companion-tier1` / `companion-jetson` for C1–C8 + C13; `operator-orchestrator` for C10 + C11 + C12; `mock-suite-sat-service` for the e2e fixture)
|
||
- [x] Multi-stage builds specified for `companion-tier1` (4 stages, existing) and `companion-jetson` (4 stages, planned)
|
||
- [x] Non-root user planned for `companion-jetson` (Tier-2 production); Tier-1 dev / operator-orchestrator stays root for now (next-cycle harden)
|
||
- [x] Health checks defined for every service
|
||
- [x] `docker-compose.yml` covers all components + dependencies (existing)
|
||
- [x] `docker-compose.test.yml` enables black-box testing (existing; Tier-1 + Tier-2 jetson variants)
|
||
- [x] `.dockerignore` defined (existing; next-cycle additions logged)
|
||
- [x] Tier-2 production delivery shape resolved (Option B; ADR-005 amendment drafted; Step 2 validation gates queued)
|
||
- [x] Image tagging strategy aligned with suite-mandated `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` contract
|
||
|
||
## Next Steps
|
||
|
||
1. **User confirms this containerization plan** (BLOCKING gate per the deploy skill Step 2).
|
||
2. **Author `docker/companion-jetson.Dockerfile`** — implementation task for the next cycle (existing-code Step 9 New Task → Step 10 Implement). Will be one of the first follow-up tickets when autodev's Done step reroutes to the existing-code flow.
|
||
3. **Coordinate parent-suite edit** — `../_infra/deploy/jetson/docker-compose.yml` `gps-denied-onboard` service block needs the additional volume mounts (`fdr-data`, `tile-data`, `/run/azaion`, FC + camera device passthrough). This is a cross-submodule change tracked as a follow-up; record in `_docs/_process_leftovers/` if not editable in this cycle.
|
||
4. **Proceed to Step 3 (CI/CD pipeline)** — author `.woodpecker/01-test.yml` (Python `pytest` + Tier-1 e2e via existing `docker-compose.test.yml`) + `.woodpecker/02-build-push.yml` (multi-arch matrix, `companion-jetson.Dockerfile` once it lands; until then, ship only `operator-orchestrator` + `companion-tier1` for the test path). Rewrite `_docs/02_document/deployment/ci_cd_pipeline.md` against the actual Woodpecker + Gitea Packages stack per suite `../_infra/ci/README.md`.
|