- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments. - Updated `.gitignore` to include a new deploy rollback bookmark. - Revised `_docs/_autodev_state.md` to reflect the current task status and steps. - Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements. - Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin. - Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths. This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
20 KiB
GPS-Denied Onboard — Containerization
Generated by
/autodevgreenfield Step 16 (Deploy) — Step 2. Builds on Step 1 output (reports/deploy_status_report.md) and the parent-suite CI/CD reality at../_infra/ci/README.md. Tier-2 delivery shape: Option B (Docker on Jetson via Watchtower) — autodev-resolved 2026-05-19; reversible per Step 1 report.
Containerization Stance
| Tier | Production runtime | Image source |
|---|---|---|
| Tier-1 (workstation dev + CI + replay) | Docker via docker-compose.yml / docker-compose.test.yml |
This submodule (docker/companion-tier1.Dockerfile, docker/operator-orchestrator.Dockerfile, docker/mock-suite-sat-service.Dockerfile) |
| Tier-2 (Jetson Orin Nano Super production) | Docker via parent-suite _infra/deploy/jetson/docker-compose.yml + Watchtower auto-update |
This submodule's new docker/companion-jetson.Dockerfile (NEW under Option B) pushed to ${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm |
| Tier-2 (lab/research IT-12 binary) | Docker (same companion-jetson.Dockerfile with research strategy flags ON) or bare JetPack install via tarball |
Optional separate image tag :research-arm; cycle-1 ships only the deployment binary path |
Three architectural binary tracks (per ADR-002 + ADR-011) collapse onto two production Docker images in this plan:
gps-denied-onboard(airborne) —docker/companion-jetson.Dockerfilefor Tier-2 production +docker/companion-tier1.Dockerfilefor Tier-1. Same Python module entrypoint (python3 -m gps_denied_onboard.runtime_root); runs both live mode and replay mode from a single image per ADR-011 — config (config.mode = live | replay) selects strategies at startup.gps-denied-operator-orchestrator—docker/operator-orchestrator.Dockerfilefor the operator workstation (C10 + C11 + C12).
Test fixtures (mock-suite-sat-service, e2e-runner) and test infrastructure (Tier-1 + Tier-2 runners) ship as separate non-deployable images. The research binary is a build-flag variant of the airborne image, not a separate Dockerfile.
ADR-005 Amendment (DRAFT — pending Step 12 / Update Docs sync)
Draft language for the architecture follow-up flagged in Step 1's Cross-Cutting Decision. Lands in
architecture.mdADR-005 (amendment) or a new ADR-012 when Step 12 (Test-Spec Sync) / autodev's existing-code Step 13 (Update Docs) picks this up. The currentarchitecture.mdADR-005 paragraph "Tier-2 (Jetson) does NOT use Docker" becomes inconsistent with this plan and must be reconciled.
Container scope (amended): Tier-1 uses Docker (
docker composefor the developer setup). Tier-2 (Jetson production) ALSO uses Docker, via the parent-suite_infra/deploy/jetson/docker-compose.yml+ Watchtower flow, withruntime: nvidiafor GPU access and explicit volume mounts for the TensorRT INT8 calibration cache (model-cache:/data/models) and the C13 FDR ring (fdr-data:/var/lib/gps-denied/fdr). The two technical concerns the original ADR-005 cited — INT8 calibration cache stability andjetson-statsthermal telemetry access — are addressed by (a) the calibration cache living in a host-mounted volume that survives container restarts and (b)jetson-statsaccessed via the nvidia-container-runtime's standard device passthrough (same pattern the parent-suitedetectionsservice already uses successfully on the same hardware). The deployment binary is the Docker image; the JetPack 6.2 system image is the host OS, not the runtime layer.
Step 2 Validation Gates (BLOCKING — must pass before Step 3)
If either of these gates fails, fall back to Option A (bare-JetPack systemd unit) and re-write this containerization plan:
| Gate | What it validates | Pass criteria | Owner |
|---|---|---|---|
| TensorRT INT8 cache durability under Docker | Build a calibration cache inside the running container; restart the container; verify the cache is reused and inference output is byte-equivalent | SHA-256 of the calibration cache file before and after restart matches; first-frame inference timing post-restart is within 5% of pre-restart timing (cache hit) | C7 owner; runs against the companion-jetson image on the actual Tier-2 Jetson |
jetson-stats thermal telemetry under Docker |
Run jtop (jetson-stats CLI) inside the container with runtime: nvidia; verify thermal + power + GPU clock readings match sudo jtop on the host within 1% |
All thermal zones reported; CPU/GPU clock readings present; D-CROSS-LATENCY-1 hybrid trigger threshold readable | C7 / C5 owners; runs against the companion-jetson image |
Both gates land as task tickets when Step 16 chains into the next-cycle
existing-code flow (autodev resumes at existing-code Step 9 New Task per
the Done state). They are deferred to next cycle and recorded here so
they are not lost; the cycle-1 deploy plan ships Option B with the
validation marked as "validation pending" in deploy_status_report.md.
Component-to-Image Mapping
Per ADR-009, components are folders under src/gps_denied_onboard/components/. They are not separate processes / containers in this monolithic Python-with-C++-extensions architecture. The mapping below shows which component code paths each image links.
| Image | Components linked | BUILD_* flags (defaults) |
|---|---|---|
companion-jetson (Tier-2 prod) + companion-tier1 (Tier-1 dev) |
C1 (KltRansac default), C2 (UltraVPR default), C2.5, C3 (DISK+LightGlue), C3.5, C4, C5 (GtsamIsam2), C6, C7 (tensorrt on Tier-2, pytorch_fp16 on Tier-1), C8 (per GPS_DENIED_FC_PROFILE), C13 + replay strategies (BUILD_VIDEO_FILE_FRAME_SOURCE=ON, BUILD_TLOG_REPLAY_ADAPTER=ON, BUILD_REPLAY_SINK_JSONL=ON) |
BUILD_VINS_MONO=OFF, BUILD_SALAD=OFF, BUILD_C11_TILE_MANAGER=OFF (ADR-004 enforcement), BUILD_DEV_STATIC_KEY=OFF, BUILD_STATE_ESKF=OFF |
operator-orchestrator (operator workstation) |
C10, C11 (TileDownloader + TileUploader), C12 |
BUILD_C11_TILE_MANAGER=ON |
mock-suite-sat-service (test fixture) |
NONE (FastAPI stub of the parent-suite satellite-provider D-PROJ-2 contract) |
— |
e2e-runner Tier-1 (tests/e2e/Dockerfile) |
Full SUT (editable install) + pytest entrypoint | Test profile defaults |
e2e-runner Tier-2 (tests/e2e/Dockerfile.jetson) |
Full SUT (editable install) + pytest entrypoint; dustynv/l4t-pytorch:r36.4.0 base |
Test profile defaults |
Per-Image Dockerfile Specifications
companion-jetson — NEW under Option B
| Property | Value |
|---|---|
| File | docker/companion-jetson.Dockerfile (new in next cycle's Step 7 — Implementation; this plan specifies the contents) |
| Base image | dustynv/l4t-pytorch:r36.4.0 (digest-pinned per suite follow-up #1) — same base proven by tests/e2e/Dockerfile.jetson |
| Stages | (1) system-deps (apt: build-essential, cmake, libpq-dev, libspatialindex-dev, libgl1, libglib2.0-0) → (2) python-deps (pip install -e ".[inference]" with the Tegra-tuned torch preserved per the existing Tier-2 e2e Dockerfile rationale) → (3) cpp-build (CMake build of the native VIO / matcher extensions with BUILD_VINS_MONO=OFF, BUILD_C11_TILE_MANAGER=OFF) → (4) runtime (slim image carrying the venv + native libs + SUT source) |
| User | gps-denied non-root uid 10001 (companion does not need root inside the container; volume mounts owned by the same uid on the host) |
| Build args | CI_COMMIT_SHA (suite-mandated; stamped as OCI labels + ENV AZAION_REVISION); BRANCH (carried into image labels) |
| OCI labels | org.opencontainers.image.revision=$CI_COMMIT_SHA, org.opencontainers.image.created=<UTC RFC 3339>, org.opencontainers.image.source=$CI_REPO_URL (suite-mandated per ../_infra/ci/README.md → "OCI image labels and commit provenance (AZ-204)") |
| ENV | AZAION_SERVICE=gps-denied-onboard, AZAION_REVISION=$CI_COMMIT_SHA, PYTHONPATH=/opt/gps-denied/src, PATH=/opt/venv/bin:$PATH |
| Health check | python3 -m gps_denied_onboard.healthcheck — --interval=10s --timeout=3s --start-period=30s --retries=3 (longer start-period than Tier-1 because TensorRT engine deserialize takes seconds on Jetson) |
| Exposed ports | 8080 (HTTP healthz + future replay-mode JSONL stream socket; mapped to host 5040:8080 per parent-suite compose). MAVLink + camera I/O is not TCP — it is host-bound (/dev/ttyUSB*, /dev/video*) via device passthrough. |
| Volume mounts (declared in parent-suite compose) | model-cache:/data/models (TensorRT engines + calibration cache + descriptor index); fdr-data:/var/lib/gps-denied/fdr (C13 ring, ≥ 64 GB); tile-data:/var/lib/gps-denied/tiles (C6 filesystem store, ≥ 10 GB); /run/azaion:/run/azaion (flight-state flag, read-only); device passthrough for /dev/ttyUSB* (FC UART) + /dev/video* (nav camera) |
| Watchtower labels | com.centurylinklabs.watchtower.enable=true + post-update hook emitting AZAION_UPDATE_EVENT per suite x-update-logger template |
| ENTRYPOINT | python3 -m gps_denied_onboard.runtime_root (same as Tier-1) |
| Flight-state gate | Honoured via /run/azaion/in-flight bind mount — Watchtower restart hook MUST check the flag before restarting (suite-managed; the image itself only honors the flag when transitioning between strategies at boot — there is no in-process restart logic) |
companion-tier1 (existing — docker/companion-tier1.Dockerfile)
| Property | Value |
|---|---|
| Base image | ubuntu:22.04 (system-deps stage) → ubuntu:22.04 (runtime) |
| Stages | 4 (system-deps → python-deps → cpp-build → runtime) — already documented in the file header |
| User | Currently root (acceptable for Tier-1 dev / CI containers — Tier-2 production hardens this in companion-jetson) |
| Health check | python3 -m gps_denied_onboard.healthcheck — --interval=10s --timeout=3s --start-period=15s --retries=3 |
| Exposed ports | None (Tier-1 healthcheck is in-process; CI exposes nothing) |
| Notes | No change required for cycle-1. Next cycle: add BRANCH + CI_COMMIT_SHA build args + OCI labels for parity with companion-jetson. |
operator-orchestrator (existing — docker/operator-orchestrator.Dockerfile)
| Property | Value |
|---|---|
| Base image | python:3.10-slim |
| Stages | 1 (runtime) — single-stage is acceptable here because the operator-orchestrator has no native C++ extensions and the slim base keeps it lean |
| User | Currently root — same Tier-1 caveat as companion-tier1 |
| Health check | python3 -m gps_denied_onboard.healthcheck — --interval=10s --timeout=3s --start-period=10s --retries=3 |
| Exposed ports | TBD (next cycle adds the C12 CLI's HTTP control surface for the operator UI; today the CLI runs as a one-shot invocation) |
| Notes | No change required for cycle-1. |
mock-suite-sat-service (existing — docker/mock-suite-sat-service.Dockerfile)
| Property | Value |
|---|---|
| Base image | python:3.10-slim |
| User | Currently root — acceptable, this is an e2e test fixture only |
| Health check | urllib.request.urlopen('http://127.0.0.1:5100/healthz') — --interval=5s --timeout=2s --retries=3 |
| Exposed ports | 5100 (HTTP) |
| Notes | Not a production image. Retired when parent-suite D-PROJ-2 ships the real ingest endpoint. |
e2e-runner Tier-1 (existing — tests/e2e/Dockerfile)
Test runner for the Reality Gate on Colima / Tier-1 workstation Docker. Not a production image. ENTRYPOINT: pytest -q /opt/tests/e2e/. No change for cycle-1.
e2e-runner Tier-2 (existing — tests/e2e/Dockerfile.jetson)
Test runner for the Reality Gate on the Jetson. dustynv/l4t-pytorch:r36.4.0 base. The new companion-jetson production image inherits its base image choice and Tegra-pip rationale from this file. No change for cycle-1.
Docker Compose — Local Development (existing docker-compose.yml)
The existing root docker-compose.yml already covers Tier-1 dev: companion + operator-orchestrator + mock-sat + db (Postgres 16), with healthchecks, named volumes (db-data, fdr-data, tile-data), and a tests/fixtures:/fixtures:ro bind mount for the dev calibration JSON + signing key.
No structural change required. Optional cycle-2 polish:
- Add a
network: gps-denied-devdeclaration (currently relies on Docker Compose's default network) so the suite-level e2e harness can join it explicitly when needed. - Reference
${BRANCH:-main}for image tags so the dev compose can pull from the suite registry instead of always building.
Docker Compose — Blackbox Tests (existing)
| File | Purpose | Status |
|---|---|---|
docker-compose.test.yml |
Tier-1 e2e (Replay + Reality Gate); sets BUILD_VIDEO_FILE_FRAME_SOURCE=ON, BUILD_TLOG_REPLAY_ADAPTER=ON, BUILD_REPLAY_SINK_JSONL=ON |
✅ working |
docker-compose.test.jetson.yml |
Tier-2 e2e on Jetson; same flags ON | ✅ working |
e2e/docker/docker-compose.test.yml |
Suite-level e2e harness's internal compose | ✅ owned by the e2e harness |
e2e/docker/docker-compose.tier2-bridge.yml |
Tier-2 host-network bridge for direct hardware access | ✅ in tree |
Run patterns (suite-mandated per Woodpecker two-workflow contract):
# Tier-1 e2e (CI 01-test.yml):
docker compose -f docker-compose.test.yml up --build --abort-on-container-exit --exit-code-from e2e-runner
# Tier-2 e2e (manual / Tier-2 lane):
docker compose -f docker-compose.test.jetson.yml up --abort-on-container-exit --exit-code-from e2e-runner
The exit code of the e2e-runner service is the pipeline result. This contract matches the suite's detections e2e variant verbatim.
Docker Compose — Tier-2 Production (parent-suite, NOT in this submodule)
This submodule does not ship a Tier-2 production compose file. The Tier-2 production stack is ../_infra/deploy/jetson/docker-compose.yml (already shipping). This submodule contributes:
- The published image at
${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm(viacompanion-jetson.Dockerfile+ the upcoming.woodpecker/02-build-push.yml). - The healthcheck endpoint (
python3 -m gps_denied_onboard.healthcheck). - The flight-state gate honour (
/run/azaion/in-flightbind mount in the suite compose — read by the image at boot). - The audit chain — OCI labels +
AZAION_REVISIONenv + Watchtower post-update hook emittingAZAION_UPDATE_EVENTto journald.
Cross-cutting suggestion logged but not actioned in cycle-1: the parent-suite Jetson compose's gps-denied-onboard service block is minimal (no volume mounts beyond model-cache). Under Option B, it needs the additional mounts listed in the companion-jetson Dockerfile table above (fdr-data, tile-data, /run/azaion, FC + camera device passthrough). This is a parent-suite edit that the GPS-Denied Onboard team must coordinate with the suite operator — recorded in Next Steps below.
Image Tagging Strategy (Suite-Mandated)
| Context | Tag Format | Example |
|---|---|---|
| Per-PR CI (test only, not pushed) | n/a | n/a |
| Per-branch CI build-push | ${REGISTRY_HOST}/azaion/<service>:<branch>-<arch> |
git.azaion.com/azaion/gps-denied-onboard:dev-arm |
| Release | ${REGISTRY_HOST}/azaion/<service>:<branch>-<arch> (suite uses floating branch tags + Watchtower; semver is not used at suite level today) |
git.azaion.com/azaion/gps-denied-onboard:main-arm |
| Local dev | Image name without registry prefix | gps-denied-onboard/companion:dev (current local compose), gps-denied-onboard/operator-orchestrator:dev, gps-denied-onboard/mock-suite-sat-service:dev |
No :latest tag in CI. Suite contract is <branch>-<arch> only; Watchtower polls these floating tags.
.dockerignore (existing — audit + recommended addenda)
The current .dockerignore (33 lines, root) covers .git, .venv, build artefacts, *.engine / *.calib / *.index / *.faiss / *.onnx, large test fixtures, _docs/, and editor noise. Adequate for cycle-1. Recommended next-cycle additions (logged here, not applied):
# Next-cycle additions to .dockerignore (not applied in cycle-1)
.cursor/ # rules + skills do not belong in any image
_docs/ # already excluded — keep
docker-compose*.yml # don't accidentally ship dev compose into the production image
e2e/ # test harness compose + fixtures stay out of production images
tests/ # test code stays out of production images (currently NOT excluded)
*.md # README / docs — not needed at runtime
Note: tests/ is currently NOT in .dockerignore, which is intentional for cycle-1 — the e2e-runner images (tests/e2e/Dockerfile, tests/e2e/Dockerfile.jetson) COPY tests/ into the image. Splitting .dockerignore per-image (via Docker's dockerfile: field on .dockerignore is BuildKit-only) is a next-cycle refactor.
Health Checks — Inventory
| Image | Endpoint / Command | Cadence |
|---|---|---|
companion-tier1, companion-jetson, operator-orchestrator |
python3 -m gps_denied_onboard.healthcheck (the module already exists per the existing Dockerfiles) |
--interval=10s --timeout=3s --start-period={15,30,10}s --retries=3 |
mock-suite-sat-service |
HTTP GET /healthz on port 5100 |
--interval=5s --timeout=2s --retries=3 |
db (Postgres 16, suite-managed under Tier-2; root compose for Tier-1) |
pg_isready -U gps_denied -d gps_denied |
--interval=5s --timeout=3s --retries=10 |
Self-verification
- Every component is mapped to its image (
companion-tier1/companion-jetsonfor C1–C8 + C13;operator-orchestratorfor C10 + C11 + C12;mock-suite-sat-servicefor the e2e fixture) - Multi-stage builds specified for
companion-tier1(4 stages, existing) andcompanion-jetson(4 stages, planned) - Non-root user planned for
companion-jetson(Tier-2 production); Tier-1 dev / operator-orchestrator stays root for now (next-cycle harden) - Health checks defined for every service
docker-compose.ymlcovers all components + dependencies (existing)docker-compose.test.ymlenables black-box testing (existing; Tier-1 + Tier-2 jetson variants).dockerignoredefined (existing; next-cycle additions logged)- Tier-2 production delivery shape resolved (Option B; ADR-005 amendment drafted; Step 2 validation gates queued)
- Image tagging strategy aligned with suite-mandated
${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>contract
Next Steps
- User confirms this containerization plan (BLOCKING gate per the deploy skill Step 2).
- Author
docker/companion-jetson.Dockerfile— implementation task for the next cycle (existing-code Step 9 New Task → Step 10 Implement). Will be one of the first follow-up tickets when autodev's Done step reroutes to the existing-code flow. - Coordinate parent-suite edit —
../_infra/deploy/jetson/docker-compose.ymlgps-denied-onboardservice block needs the additional volume mounts (fdr-data,tile-data,/run/azaion, FC + camera device passthrough). This is a cross-submodule change tracked as a follow-up; record in_docs/_process_leftovers/if not editable in this cycle. - Proceed to Step 3 (CI/CD pipeline) — author
.woodpecker/01-test.yml(Pythonpytest+ Tier-1 e2e via existingdocker-compose.test.yml) +.woodpecker/02-build-push.yml(multi-arch matrix,companion-jetson.Dockerfileonce it lands; until then, ship onlyoperator-orchestrator+companion-tier1for the test path). Rewrite_docs/02_document/deployment/ci_cd_pipeline.mdagainst the actual Woodpecker + Gitea Packages stack per suite../_infra/ci/README.md.