Files
gps-denied-onboard/_docs/04_deploy/containerization.md
T
Oleksandr Bezdieniezhnykh bf13549b32
ci/woodpecker/push/02-build-push Pipeline failed
[autodev] Update configuration and documentation for cycle-1
- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments.
- Updated `.gitignore` to include a new deploy rollback bookmark.
- Revised `_docs/_autodev_state.md` to reflect the current task status and steps.
- Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements.
- Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin.
- Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths.

This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
2026-05-20 08:05:35 +03:00

232 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GPS-Denied Onboard — Containerization
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 2.
> Builds on Step 1 output (`reports/deploy_status_report.md`) and the
> parent-suite CI/CD reality at `../_infra/ci/README.md`. Tier-2 delivery
> shape: **Option B (Docker on Jetson via Watchtower) — autodev-resolved
> 2026-05-19; reversible per Step 1 report**.
## Containerization Stance
| Tier | Production runtime | Image source |
|------|--------------------|--------------|
| Tier-1 (workstation dev + CI + replay) | Docker via `docker-compose.yml` / `docker-compose.test.yml` | This submodule (`docker/companion-tier1.Dockerfile`, `docker/operator-orchestrator.Dockerfile`, `docker/mock-suite-sat-service.Dockerfile`) |
| Tier-2 (Jetson Orin Nano Super production) | Docker via parent-suite `_infra/deploy/jetson/docker-compose.yml` + Watchtower auto-update | This submodule's new `docker/companion-jetson.Dockerfile` (NEW under Option B) pushed to `${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm` |
| Tier-2 (lab/research IT-12 binary) | Docker (same `companion-jetson.Dockerfile` with research strategy flags ON) or bare JetPack install via tarball | Optional separate image tag `:research-arm`; cycle-1 ships only the deployment binary path |
Three architectural binary tracks (per ADR-002 + ADR-011) collapse onto
**two production Docker images** in this plan:
1. **`gps-denied-onboard` (airborne)** — `docker/companion-jetson.Dockerfile` for Tier-2 production + `docker/companion-tier1.Dockerfile` for Tier-1. Same Python module entrypoint (`python3 -m gps_denied_onboard.runtime_root`); runs both **live mode** and **replay mode** from a single image per ADR-011 — config (`config.mode = live | replay`) selects strategies at startup.
2. **`gps-denied-operator-orchestrator`** — `docker/operator-orchestrator.Dockerfile` for the operator workstation (C10 + C11 + C12).
Test fixtures (`mock-suite-sat-service`, `e2e-runner`) and test infrastructure (Tier-1 + Tier-2 runners) ship as separate non-deployable images. The research binary is a build-flag variant of the airborne image, not a separate Dockerfile.
## ADR-005 Amendment (DRAFT — pending Step 12 / Update Docs sync)
> Draft language for the architecture follow-up flagged in Step 1's
> Cross-Cutting Decision. Lands in `architecture.md` ADR-005 (amendment)
> or a new ADR-012 when Step 12 (Test-Spec Sync) / autodev's existing-code
> Step 13 (Update Docs) picks this up. The current `architecture.md`
> ADR-005 paragraph "Tier-2 (Jetson) does NOT use Docker" becomes
> inconsistent with this plan and must be reconciled.
> **Container scope (amended)**: Tier-1 uses Docker (`docker compose` for
> the developer setup). **Tier-2 (Jetson production) ALSO uses Docker**,
> via the parent-suite `_infra/deploy/jetson/docker-compose.yml` +
> Watchtower flow, with `runtime: nvidia` for GPU access and explicit
> volume mounts for the TensorRT INT8 calibration cache
> (`model-cache:/data/models`) and the C13 FDR ring
> (`fdr-data:/var/lib/gps-denied/fdr`). The two technical concerns the
> original ADR-005 cited — INT8 calibration cache stability and
> `jetson-stats` thermal telemetry access — are addressed by (a) the
> calibration cache living in a host-mounted volume that survives
> container restarts and (b) `jetson-stats` accessed via the
> nvidia-container-runtime's standard device passthrough (same pattern
> the parent-suite `detections` service already uses successfully on the
> same hardware). The deployment binary is the Docker image; the JetPack
> 6.2 system image is the **host** OS, not the runtime layer.
### Step 2 Validation Gates (BLOCKING — must pass before Step 3)
If either of these gates fails, **fall back to Option A** (bare-JetPack
systemd unit) and re-write this containerization plan:
| Gate | What it validates | Pass criteria | Owner |
|------|-------------------|---------------|-------|
| **TensorRT INT8 cache durability under Docker** | Build a calibration cache inside the running container; restart the container; verify the cache is reused and inference output is byte-equivalent | SHA-256 of the calibration cache file before and after restart matches; first-frame inference timing post-restart is within 5% of pre-restart timing (cache hit) | C7 owner; runs against the `companion-jetson` image on the actual Tier-2 Jetson |
| **`jetson-stats` thermal telemetry under Docker** | Run `jtop` (jetson-stats CLI) inside the container with `runtime: nvidia`; verify thermal + power + GPU clock readings match `sudo jtop` on the host within 1% | All thermal zones reported; CPU/GPU clock readings present; D-CROSS-LATENCY-1 hybrid trigger threshold readable | C7 / C5 owners; runs against the `companion-jetson` image |
Both gates land as task tickets when Step 16 chains into the next-cycle
existing-code flow (autodev resumes at existing-code Step 9 New Task per
the Done state). They are **deferred to next cycle** and recorded here so
they are not lost; the cycle-1 deploy plan ships Option B with the
validation marked as "validation pending" in `deploy_status_report.md`.
## Component-to-Image Mapping
Per ADR-009, components are folders under `src/gps_denied_onboard/components/`. They are not separate processes / containers in this monolithic Python-with-C++-extensions architecture. The mapping below shows which component code paths each image links.
| Image | Components linked | BUILD_* flags (defaults) |
|-------|-------------------|---------------------------|
| `companion-jetson` (Tier-2 prod) + `companion-tier1` (Tier-1 dev) | C1 (`KltRansac` default), C2 (`UltraVPR` default), C2.5, C3 (`DISK+LightGlue`), C3.5, C4, C5 (`GtsamIsam2`), C6, C7 (`tensorrt` on Tier-2, `pytorch_fp16` on Tier-1), C8 (per `GPS_DENIED_FC_PROFILE`), C13 + replay strategies (`BUILD_VIDEO_FILE_FRAME_SOURCE=ON`, `BUILD_TLOG_REPLAY_ADAPTER=ON`, `BUILD_REPLAY_SINK_JSONL=ON`) | `BUILD_VINS_MONO=OFF`, `BUILD_SALAD=OFF`, `BUILD_C11_TILE_MANAGER=OFF` (ADR-004 enforcement), `BUILD_DEV_STATIC_KEY=OFF`, `BUILD_STATE_ESKF=OFF` |
| `operator-orchestrator` (operator workstation) | C10, C11 (`TileDownloader` + `TileUploader`), C12 | `BUILD_C11_TILE_MANAGER=ON` |
| `mock-suite-sat-service` (test fixture) | NONE (FastAPI stub of the parent-suite `satellite-provider` D-PROJ-2 contract) | — |
| `e2e-runner` Tier-1 (`tests/e2e/Dockerfile`) | Full SUT (editable install) + pytest entrypoint | Test profile defaults |
| `e2e-runner` Tier-2 (`tests/e2e/Dockerfile.jetson`) | Full SUT (editable install) + pytest entrypoint; `dustynv/l4t-pytorch:r36.4.0` base | Test profile defaults |
## Per-Image Dockerfile Specifications
### `companion-jetson` — **NEW under Option B**
| Property | Value |
|----------|-------|
| File | `docker/companion-jetson.Dockerfile` (new in next cycle's Step 7 — Implementation; this plan specifies the contents) |
| Base image | `dustynv/l4t-pytorch:r36.4.0` (digest-pinned per suite follow-up #1) — same base proven by `tests/e2e/Dockerfile.jetson` |
| Stages | (1) system-deps (apt: `build-essential`, `cmake`, `libpq-dev`, `libspatialindex-dev`, `libgl1`, `libglib2.0-0`) → (2) python-deps (`pip install -e ".[inference]"` with the Tegra-tuned torch preserved per the existing Tier-2 e2e Dockerfile rationale) → (3) cpp-build (CMake build of the native VIO / matcher extensions with `BUILD_VINS_MONO=OFF`, `BUILD_C11_TILE_MANAGER=OFF`) → (4) runtime (slim image carrying the venv + native libs + SUT source) |
| User | `gps-denied` non-root uid 10001 (companion does not need root inside the container; volume mounts owned by the same uid on the host) |
| Build args | `CI_COMMIT_SHA` (suite-mandated; stamped as OCI labels + `ENV AZAION_REVISION`); `BRANCH` (carried into image labels) |
| OCI labels | `org.opencontainers.image.revision=$CI_COMMIT_SHA`, `org.opencontainers.image.created=<UTC RFC 3339>`, `org.opencontainers.image.source=$CI_REPO_URL` (suite-mandated per `../_infra/ci/README.md` → "OCI image labels and commit provenance (AZ-204)") |
| ENV | `AZAION_SERVICE=gps-denied-onboard`, `AZAION_REVISION=$CI_COMMIT_SHA`, `PYTHONPATH=/opt/gps-denied/src`, `PATH=/opt/venv/bin:$PATH` |
| Health check | `python3 -m gps_denied_onboard.healthcheck``--interval=10s --timeout=3s --start-period=30s --retries=3` (longer start-period than Tier-1 because TensorRT engine deserialize takes seconds on Jetson) |
| Exposed ports | `8080` (HTTP healthz + future replay-mode JSONL stream socket; mapped to host `5040:8080` per parent-suite compose). MAVLink + camera I/O is **not** TCP — it is host-bound (`/dev/ttyUSB*`, `/dev/video*`) via device passthrough. |
| Volume mounts (declared in parent-suite compose) | `model-cache:/data/models` (TensorRT engines + calibration cache + descriptor index); `fdr-data:/var/lib/gps-denied/fdr` (C13 ring, ≥ 64 GB); `tile-data:/var/lib/gps-denied/tiles` (C6 filesystem store, ≥ 10 GB); `/run/azaion:/run/azaion` (flight-state flag, read-only); device passthrough for `/dev/ttyUSB*` (FC UART) + `/dev/video*` (nav camera) |
| Watchtower labels | `com.centurylinklabs.watchtower.enable=true` + post-update hook emitting `AZAION_UPDATE_EVENT` per suite `x-update-logger` template |
| ENTRYPOINT | `python3 -m gps_denied_onboard.runtime_root` (same as Tier-1) |
| Flight-state gate | Honoured via `/run/azaion/in-flight` bind mount — Watchtower restart hook MUST check the flag before restarting (suite-managed; the image itself only honors the flag when transitioning between strategies at boot — there is no in-process restart logic) |
### `companion-tier1` (existing — `docker/companion-tier1.Dockerfile`)
| Property | Value |
|----------|-------|
| Base image | `ubuntu:22.04` (system-deps stage) → `ubuntu:22.04` (runtime) |
| Stages | 4 (`system-deps``python-deps``cpp-build``runtime`) — already documented in the file header |
| User | Currently root (acceptable for Tier-1 dev / CI containers — Tier-2 production hardens this in `companion-jetson`) |
| Health check | `python3 -m gps_denied_onboard.healthcheck``--interval=10s --timeout=3s --start-period=15s --retries=3` |
| Exposed ports | None (Tier-1 healthcheck is in-process; CI exposes nothing) |
| Notes | **No change required for cycle-1.** Next cycle: add `BRANCH` + `CI_COMMIT_SHA` build args + OCI labels for parity with `companion-jetson`. |
### `operator-orchestrator` (existing — `docker/operator-orchestrator.Dockerfile`)
| Property | Value |
|----------|-------|
| Base image | `python:3.10-slim` |
| Stages | 1 (`runtime`) — single-stage is acceptable here because the operator-orchestrator has no native C++ extensions and the slim base keeps it lean |
| User | Currently root — same Tier-1 caveat as `companion-tier1` |
| Health check | `python3 -m gps_denied_onboard.healthcheck``--interval=10s --timeout=3s --start-period=10s --retries=3` |
| Exposed ports | TBD (next cycle adds the C12 CLI's HTTP control surface for the operator UI; today the CLI runs as a one-shot invocation) |
| Notes | **No change required for cycle-1.** |
### `mock-suite-sat-service` (existing — `docker/mock-suite-sat-service.Dockerfile`)
| Property | Value |
|----------|-------|
| Base image | `python:3.10-slim` |
| User | Currently root — acceptable, this is an e2e test fixture only |
| Health check | `urllib.request.urlopen('http://127.0.0.1:5100/healthz')``--interval=5s --timeout=2s --retries=3` |
| Exposed ports | `5100` (HTTP) |
| Notes | **Not a production image.** Retired when parent-suite D-PROJ-2 ships the real ingest endpoint. |
### `e2e-runner` Tier-1 (existing — `tests/e2e/Dockerfile`)
Test runner for the Reality Gate on Colima / Tier-1 workstation Docker. Not a production image. ENTRYPOINT: `pytest -q /opt/tests/e2e/`. **No change for cycle-1.**
### `e2e-runner` Tier-2 (existing — `tests/e2e/Dockerfile.jetson`)
Test runner for the Reality Gate on the Jetson. `dustynv/l4t-pytorch:r36.4.0` base. The new `companion-jetson` production image inherits its base image choice and Tegra-pip rationale from this file. **No change for cycle-1.**
## Docker Compose — Local Development (existing `docker-compose.yml`)
The existing root `docker-compose.yml` already covers Tier-1 dev: `companion` + `operator-orchestrator` + `mock-sat` + `db` (Postgres 16), with healthchecks, named volumes (`db-data`, `fdr-data`, `tile-data`), and a `tests/fixtures:/fixtures:ro` bind mount for the dev calibration JSON + signing key.
**No structural change required.** Optional cycle-2 polish:
- Add a `network: gps-denied-dev` declaration (currently relies on Docker Compose's default network) so the suite-level e2e harness can join it explicitly when needed.
- Reference `${BRANCH:-main}` for image tags so the dev compose can pull from the suite registry instead of always building.
## Docker Compose — Blackbox Tests (existing)
| File | Purpose | Status |
|------|---------|--------|
| `docker-compose.test.yml` | Tier-1 e2e (Replay + Reality Gate); sets `BUILD_VIDEO_FILE_FRAME_SOURCE=ON`, `BUILD_TLOG_REPLAY_ADAPTER=ON`, `BUILD_REPLAY_SINK_JSONL=ON` | ✅ working |
| `docker-compose.test.jetson.yml` | Tier-2 e2e on Jetson; same flags ON | ✅ working |
| `e2e/docker/docker-compose.test.yml` | Suite-level e2e harness's internal compose | ✅ owned by the e2e harness |
| `e2e/docker/docker-compose.tier2-bridge.yml` | Tier-2 host-network bridge for direct hardware access | ✅ in tree |
**Run patterns** (suite-mandated per Woodpecker two-workflow contract):
```bash
# Tier-1 e2e (CI 01-test.yml):
docker compose -f docker-compose.test.yml up --build --abort-on-container-exit --exit-code-from e2e-runner
# Tier-2 e2e (manual / Tier-2 lane):
docker compose -f docker-compose.test.jetson.yml up --abort-on-container-exit --exit-code-from e2e-runner
```
The exit code of the `e2e-runner` service is the pipeline result. This contract matches the suite's `detections` e2e variant verbatim.
## Docker Compose — Tier-2 Production (parent-suite, NOT in this submodule)
This submodule does **not** ship a Tier-2 production compose file. The Tier-2 production stack is `../_infra/deploy/jetson/docker-compose.yml` (already shipping). This submodule contributes:
1. The published image at `${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm` (via `companion-jetson.Dockerfile` + the upcoming `.woodpecker/02-build-push.yml`).
2. The healthcheck endpoint (`python3 -m gps_denied_onboard.healthcheck`).
3. The flight-state gate honour (`/run/azaion/in-flight` bind mount in the suite compose — read by the image at boot).
4. The audit chain — OCI labels + `AZAION_REVISION` env + Watchtower post-update hook emitting `AZAION_UPDATE_EVENT` to journald.
**Cross-cutting suggestion logged but not actioned in cycle-1**: the parent-suite Jetson compose's `gps-denied-onboard` service block is minimal (no volume mounts beyond `model-cache`). Under Option B, it needs the additional mounts listed in the `companion-jetson` Dockerfile table above (`fdr-data`, `tile-data`, `/run/azaion`, FC + camera device passthrough). This is a **parent-suite edit** that the GPS-Denied Onboard team must coordinate with the suite operator — recorded in Next Steps below.
## Image Tagging Strategy (Suite-Mandated)
| Context | Tag Format | Example |
|---------|-----------|---------|
| Per-PR CI (test only, not pushed) | n/a | n/a |
| Per-branch CI build-push | `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` | `git.azaion.com/azaion/gps-denied-onboard:dev-arm` |
| Release | `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` (suite uses floating branch tags + Watchtower; semver is not used at suite level today) | `git.azaion.com/azaion/gps-denied-onboard:main-arm` |
| Local dev | Image name without registry prefix | `gps-denied-onboard/companion:dev` (current local compose), `gps-denied-onboard/operator-orchestrator:dev`, `gps-denied-onboard/mock-suite-sat-service:dev` |
**No `:latest` tag in CI.** Suite contract is `<branch>-<arch>` only; Watchtower polls these floating tags.
## .dockerignore (existing — audit + recommended addenda)
The current `.dockerignore` (33 lines, root) covers `.git`, `.venv`, build artefacts, `*.engine` / `*.calib` / `*.index` / `*.faiss` / `*.onnx`, large test fixtures, `_docs/`, and editor noise. **Adequate for cycle-1.** Recommended next-cycle additions (logged here, not applied):
```
# Next-cycle additions to .dockerignore (not applied in cycle-1)
.cursor/ # rules + skills do not belong in any image
_docs/ # already excluded — keep
docker-compose*.yml # don't accidentally ship dev compose into the production image
e2e/ # test harness compose + fixtures stay out of production images
tests/ # test code stays out of production images (currently NOT excluded)
*.md # README / docs — not needed at runtime
```
Note: `tests/` is currently NOT in `.dockerignore`, which is **intentional for cycle-1** — the e2e-runner images (`tests/e2e/Dockerfile`, `tests/e2e/Dockerfile.jetson`) COPY `tests/` into the image. Splitting `.dockerignore` per-image (via Docker's `dockerfile:` field on `.dockerignore` is BuildKit-only) is a next-cycle refactor.
## Health Checks — Inventory
| Image | Endpoint / Command | Cadence |
|-------|---------------------|---------|
| `companion-tier1`, `companion-jetson`, `operator-orchestrator` | `python3 -m gps_denied_onboard.healthcheck` (the module already exists per the existing Dockerfiles) | `--interval=10s --timeout=3s --start-period={15,30,10}s --retries=3` |
| `mock-suite-sat-service` | HTTP GET `/healthz` on port 5100 | `--interval=5s --timeout=2s --retries=3` |
| `db` (Postgres 16, suite-managed under Tier-2; root compose for Tier-1) | `pg_isready -U gps_denied -d gps_denied` | `--interval=5s --timeout=3s --retries=10` |
## Self-verification
- [x] Every component is mapped to its image (`companion-tier1` / `companion-jetson` for C1C8 + C13; `operator-orchestrator` for C10 + C11 + C12; `mock-suite-sat-service` for the e2e fixture)
- [x] Multi-stage builds specified for `companion-tier1` (4 stages, existing) and `companion-jetson` (4 stages, planned)
- [x] Non-root user planned for `companion-jetson` (Tier-2 production); Tier-1 dev / operator-orchestrator stays root for now (next-cycle harden)
- [x] Health checks defined for every service
- [x] `docker-compose.yml` covers all components + dependencies (existing)
- [x] `docker-compose.test.yml` enables black-box testing (existing; Tier-1 + Tier-2 jetson variants)
- [x] `.dockerignore` defined (existing; next-cycle additions logged)
- [x] Tier-2 production delivery shape resolved (Option B; ADR-005 amendment drafted; Step 2 validation gates queued)
- [x] Image tagging strategy aligned with suite-mandated `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` contract
## Next Steps
1. **User confirms this containerization plan** (BLOCKING gate per the deploy skill Step 2).
2. **Author `docker/companion-jetson.Dockerfile`** — implementation task for the next cycle (existing-code Step 9 New Task → Step 10 Implement). Will be one of the first follow-up tickets when autodev's Done step reroutes to the existing-code flow.
3. **Coordinate parent-suite edit**`../_infra/deploy/jetson/docker-compose.yml` `gps-denied-onboard` service block needs the additional volume mounts (`fdr-data`, `tile-data`, `/run/azaion`, FC + camera device passthrough). This is a cross-submodule change tracked as a follow-up; record in `_docs/_process_leftovers/` if not editable in this cycle.
4. **Proceed to Step 3 (CI/CD pipeline)** — author `.woodpecker/01-test.yml` (Python `pytest` + Tier-1 e2e via existing `docker-compose.test.yml`) + `.woodpecker/02-build-push.yml` (multi-arch matrix, `companion-jetson.Dockerfile` once it lands; until then, ship only `operator-orchestrator` + `companion-tier1` for the test path). Rewrite `_docs/02_document/deployment/ci_cd_pipeline.md` against the actual Woodpecker + Gitea Packages stack per suite `../_infra/ci/README.md`.