Files
Oleksandr Bezdieniezhnykh bf13549b32
ci/woodpecker/push/02-build-push Pipeline failed
[autodev] Update configuration and documentation for cycle-1
- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments.
- Updated `.gitignore` to include a new deploy rollback bookmark.
- Revised `_docs/_autodev_state.md` to reflect the current task status and steps.
- Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements.
- Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin.
- Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths.

This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
2026-05-20 08:05:35 +03:00

20 KiB
Raw Permalink Blame History

GPS-Denied Onboard — Containerization

Generated by /autodev greenfield Step 16 (Deploy) — Step 2. Builds on Step 1 output (reports/deploy_status_report.md) and the parent-suite CI/CD reality at ../_infra/ci/README.md. Tier-2 delivery shape: Option B (Docker on Jetson via Watchtower) — autodev-resolved 2026-05-19; reversible per Step 1 report.

Containerization Stance

Tier Production runtime Image source
Tier-1 (workstation dev + CI + replay) Docker via docker-compose.yml / docker-compose.test.yml This submodule (docker/companion-tier1.Dockerfile, docker/operator-orchestrator.Dockerfile, docker/mock-suite-sat-service.Dockerfile)
Tier-2 (Jetson Orin Nano Super production) Docker via parent-suite _infra/deploy/jetson/docker-compose.yml + Watchtower auto-update This submodule's new docker/companion-jetson.Dockerfile (NEW under Option B) pushed to ${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm
Tier-2 (lab/research IT-12 binary) Docker (same companion-jetson.Dockerfile with research strategy flags ON) or bare JetPack install via tarball Optional separate image tag :research-arm; cycle-1 ships only the deployment binary path

Three architectural binary tracks (per ADR-002 + ADR-011) collapse onto two production Docker images in this plan:

  1. gps-denied-onboard (airborne)docker/companion-jetson.Dockerfile for Tier-2 production + docker/companion-tier1.Dockerfile for Tier-1. Same Python module entrypoint (python3 -m gps_denied_onboard.runtime_root); runs both live mode and replay mode from a single image per ADR-011 — config (config.mode = live | replay) selects strategies at startup.
  2. gps-denied-operator-orchestratordocker/operator-orchestrator.Dockerfile for the operator workstation (C10 + C11 + C12).

Test fixtures (mock-suite-sat-service, e2e-runner) and test infrastructure (Tier-1 + Tier-2 runners) ship as separate non-deployable images. The research binary is a build-flag variant of the airborne image, not a separate Dockerfile.

ADR-005 Amendment (DRAFT — pending Step 12 / Update Docs sync)

Draft language for the architecture follow-up flagged in Step 1's Cross-Cutting Decision. Lands in architecture.md ADR-005 (amendment) or a new ADR-012 when Step 12 (Test-Spec Sync) / autodev's existing-code Step 13 (Update Docs) picks this up. The current architecture.md ADR-005 paragraph "Tier-2 (Jetson) does NOT use Docker" becomes inconsistent with this plan and must be reconciled.

Container scope (amended): Tier-1 uses Docker (docker compose for the developer setup). Tier-2 (Jetson production) ALSO uses Docker, via the parent-suite _infra/deploy/jetson/docker-compose.yml + Watchtower flow, with runtime: nvidia for GPU access and explicit volume mounts for the TensorRT INT8 calibration cache (model-cache:/data/models) and the C13 FDR ring (fdr-data:/var/lib/gps-denied/fdr). The two technical concerns the original ADR-005 cited — INT8 calibration cache stability and jetson-stats thermal telemetry access — are addressed by (a) the calibration cache living in a host-mounted volume that survives container restarts and (b) jetson-stats accessed via the nvidia-container-runtime's standard device passthrough (same pattern the parent-suite detections service already uses successfully on the same hardware). The deployment binary is the Docker image; the JetPack 6.2 system image is the host OS, not the runtime layer.

Step 2 Validation Gates (BLOCKING — must pass before Step 3)

If either of these gates fails, fall back to Option A (bare-JetPack systemd unit) and re-write this containerization plan:

Gate What it validates Pass criteria Owner
TensorRT INT8 cache durability under Docker Build a calibration cache inside the running container; restart the container; verify the cache is reused and inference output is byte-equivalent SHA-256 of the calibration cache file before and after restart matches; first-frame inference timing post-restart is within 5% of pre-restart timing (cache hit) C7 owner; runs against the companion-jetson image on the actual Tier-2 Jetson
jetson-stats thermal telemetry under Docker Run jtop (jetson-stats CLI) inside the container with runtime: nvidia; verify thermal + power + GPU clock readings match sudo jtop on the host within 1% All thermal zones reported; CPU/GPU clock readings present; D-CROSS-LATENCY-1 hybrid trigger threshold readable C7 / C5 owners; runs against the companion-jetson image

Both gates land as task tickets when Step 16 chains into the next-cycle existing-code flow (autodev resumes at existing-code Step 9 New Task per the Done state). They are deferred to next cycle and recorded here so they are not lost; the cycle-1 deploy plan ships Option B with the validation marked as "validation pending" in deploy_status_report.md.

Component-to-Image Mapping

Per ADR-009, components are folders under src/gps_denied_onboard/components/. They are not separate processes / containers in this monolithic Python-with-C++-extensions architecture. The mapping below shows which component code paths each image links.

Image Components linked BUILD_* flags (defaults)
companion-jetson (Tier-2 prod) + companion-tier1 (Tier-1 dev) C1 (KltRansac default), C2 (UltraVPR default), C2.5, C3 (DISK+LightGlue), C3.5, C4, C5 (GtsamIsam2), C6, C7 (tensorrt on Tier-2, pytorch_fp16 on Tier-1), C8 (per GPS_DENIED_FC_PROFILE), C13 + replay strategies (BUILD_VIDEO_FILE_FRAME_SOURCE=ON, BUILD_TLOG_REPLAY_ADAPTER=ON, BUILD_REPLAY_SINK_JSONL=ON) BUILD_VINS_MONO=OFF, BUILD_SALAD=OFF, BUILD_C11_TILE_MANAGER=OFF (ADR-004 enforcement), BUILD_DEV_STATIC_KEY=OFF, BUILD_STATE_ESKF=OFF
operator-orchestrator (operator workstation) C10, C11 (TileDownloader + TileUploader), C12 BUILD_C11_TILE_MANAGER=ON
mock-suite-sat-service (test fixture) NONE (FastAPI stub of the parent-suite satellite-provider D-PROJ-2 contract)
e2e-runner Tier-1 (tests/e2e/Dockerfile) Full SUT (editable install) + pytest entrypoint Test profile defaults
e2e-runner Tier-2 (tests/e2e/Dockerfile.jetson) Full SUT (editable install) + pytest entrypoint; dustynv/l4t-pytorch:r36.4.0 base Test profile defaults

Per-Image Dockerfile Specifications

companion-jetsonNEW under Option B

Property Value
File docker/companion-jetson.Dockerfile (new in next cycle's Step 7 — Implementation; this plan specifies the contents)
Base image dustynv/l4t-pytorch:r36.4.0 (digest-pinned per suite follow-up #1) — same base proven by tests/e2e/Dockerfile.jetson
Stages (1) system-deps (apt: build-essential, cmake, libpq-dev, libspatialindex-dev, libgl1, libglib2.0-0) → (2) python-deps (pip install -e ".[inference]" with the Tegra-tuned torch preserved per the existing Tier-2 e2e Dockerfile rationale) → (3) cpp-build (CMake build of the native VIO / matcher extensions with BUILD_VINS_MONO=OFF, BUILD_C11_TILE_MANAGER=OFF) → (4) runtime (slim image carrying the venv + native libs + SUT source)
User gps-denied non-root uid 10001 (companion does not need root inside the container; volume mounts owned by the same uid on the host)
Build args CI_COMMIT_SHA (suite-mandated; stamped as OCI labels + ENV AZAION_REVISION); BRANCH (carried into image labels)
OCI labels org.opencontainers.image.revision=$CI_COMMIT_SHA, org.opencontainers.image.created=<UTC RFC 3339>, org.opencontainers.image.source=$CI_REPO_URL (suite-mandated per ../_infra/ci/README.md → "OCI image labels and commit provenance (AZ-204)")
ENV AZAION_SERVICE=gps-denied-onboard, AZAION_REVISION=$CI_COMMIT_SHA, PYTHONPATH=/opt/gps-denied/src, PATH=/opt/venv/bin:$PATH
Health check python3 -m gps_denied_onboard.healthcheck--interval=10s --timeout=3s --start-period=30s --retries=3 (longer start-period than Tier-1 because TensorRT engine deserialize takes seconds on Jetson)
Exposed ports 8080 (HTTP healthz + future replay-mode JSONL stream socket; mapped to host 5040:8080 per parent-suite compose). MAVLink + camera I/O is not TCP — it is host-bound (/dev/ttyUSB*, /dev/video*) via device passthrough.
Volume mounts (declared in parent-suite compose) model-cache:/data/models (TensorRT engines + calibration cache + descriptor index); fdr-data:/var/lib/gps-denied/fdr (C13 ring, ≥ 64 GB); tile-data:/var/lib/gps-denied/tiles (C6 filesystem store, ≥ 10 GB); /run/azaion:/run/azaion (flight-state flag, read-only); device passthrough for /dev/ttyUSB* (FC UART) + /dev/video* (nav camera)
Watchtower labels com.centurylinklabs.watchtower.enable=true + post-update hook emitting AZAION_UPDATE_EVENT per suite x-update-logger template
ENTRYPOINT python3 -m gps_denied_onboard.runtime_root (same as Tier-1)
Flight-state gate Honoured via /run/azaion/in-flight bind mount — Watchtower restart hook MUST check the flag before restarting (suite-managed; the image itself only honors the flag when transitioning between strategies at boot — there is no in-process restart logic)

companion-tier1 (existing — docker/companion-tier1.Dockerfile)

Property Value
Base image ubuntu:22.04 (system-deps stage) → ubuntu:22.04 (runtime)
Stages 4 (system-depspython-depscpp-buildruntime) — already documented in the file header
User Currently root (acceptable for Tier-1 dev / CI containers — Tier-2 production hardens this in companion-jetson)
Health check python3 -m gps_denied_onboard.healthcheck--interval=10s --timeout=3s --start-period=15s --retries=3
Exposed ports None (Tier-1 healthcheck is in-process; CI exposes nothing)
Notes No change required for cycle-1. Next cycle: add BRANCH + CI_COMMIT_SHA build args + OCI labels for parity with companion-jetson.

operator-orchestrator (existing — docker/operator-orchestrator.Dockerfile)

Property Value
Base image python:3.10-slim
Stages 1 (runtime) — single-stage is acceptable here because the operator-orchestrator has no native C++ extensions and the slim base keeps it lean
User Currently root — same Tier-1 caveat as companion-tier1
Health check python3 -m gps_denied_onboard.healthcheck--interval=10s --timeout=3s --start-period=10s --retries=3
Exposed ports TBD (next cycle adds the C12 CLI's HTTP control surface for the operator UI; today the CLI runs as a one-shot invocation)
Notes No change required for cycle-1.

mock-suite-sat-service (existing — docker/mock-suite-sat-service.Dockerfile)

Property Value
Base image python:3.10-slim
User Currently root — acceptable, this is an e2e test fixture only
Health check urllib.request.urlopen('http://127.0.0.1:5100/healthz')--interval=5s --timeout=2s --retries=3
Exposed ports 5100 (HTTP)
Notes Not a production image. Retired when parent-suite D-PROJ-2 ships the real ingest endpoint.

e2e-runner Tier-1 (existing — tests/e2e/Dockerfile)

Test runner for the Reality Gate on Colima / Tier-1 workstation Docker. Not a production image. ENTRYPOINT: pytest -q /opt/tests/e2e/. No change for cycle-1.

e2e-runner Tier-2 (existing — tests/e2e/Dockerfile.jetson)

Test runner for the Reality Gate on the Jetson. dustynv/l4t-pytorch:r36.4.0 base. The new companion-jetson production image inherits its base image choice and Tegra-pip rationale from this file. No change for cycle-1.

Docker Compose — Local Development (existing docker-compose.yml)

The existing root docker-compose.yml already covers Tier-1 dev: companion + operator-orchestrator + mock-sat + db (Postgres 16), with healthchecks, named volumes (db-data, fdr-data, tile-data), and a tests/fixtures:/fixtures:ro bind mount for the dev calibration JSON + signing key.

No structural change required. Optional cycle-2 polish:

  • Add a network: gps-denied-dev declaration (currently relies on Docker Compose's default network) so the suite-level e2e harness can join it explicitly when needed.
  • Reference ${BRANCH:-main} for image tags so the dev compose can pull from the suite registry instead of always building.

Docker Compose — Blackbox Tests (existing)

File Purpose Status
docker-compose.test.yml Tier-1 e2e (Replay + Reality Gate); sets BUILD_VIDEO_FILE_FRAME_SOURCE=ON, BUILD_TLOG_REPLAY_ADAPTER=ON, BUILD_REPLAY_SINK_JSONL=ON working
docker-compose.test.jetson.yml Tier-2 e2e on Jetson; same flags ON working
e2e/docker/docker-compose.test.yml Suite-level e2e harness's internal compose owned by the e2e harness
e2e/docker/docker-compose.tier2-bridge.yml Tier-2 host-network bridge for direct hardware access in tree

Run patterns (suite-mandated per Woodpecker two-workflow contract):

# Tier-1 e2e (CI 01-test.yml):
docker compose -f docker-compose.test.yml up --build --abort-on-container-exit --exit-code-from e2e-runner

# Tier-2 e2e (manual / Tier-2 lane):
docker compose -f docker-compose.test.jetson.yml up --abort-on-container-exit --exit-code-from e2e-runner

The exit code of the e2e-runner service is the pipeline result. This contract matches the suite's detections e2e variant verbatim.

Docker Compose — Tier-2 Production (parent-suite, NOT in this submodule)

This submodule does not ship a Tier-2 production compose file. The Tier-2 production stack is ../_infra/deploy/jetson/docker-compose.yml (already shipping). This submodule contributes:

  1. The published image at ${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm (via companion-jetson.Dockerfile + the upcoming .woodpecker/02-build-push.yml).
  2. The healthcheck endpoint (python3 -m gps_denied_onboard.healthcheck).
  3. The flight-state gate honour (/run/azaion/in-flight bind mount in the suite compose — read by the image at boot).
  4. The audit chain — OCI labels + AZAION_REVISION env + Watchtower post-update hook emitting AZAION_UPDATE_EVENT to journald.

Cross-cutting suggestion logged but not actioned in cycle-1: the parent-suite Jetson compose's gps-denied-onboard service block is minimal (no volume mounts beyond model-cache). Under Option B, it needs the additional mounts listed in the companion-jetson Dockerfile table above (fdr-data, tile-data, /run/azaion, FC + camera device passthrough). This is a parent-suite edit that the GPS-Denied Onboard team must coordinate with the suite operator — recorded in Next Steps below.

Image Tagging Strategy (Suite-Mandated)

Context Tag Format Example
Per-PR CI (test only, not pushed) n/a n/a
Per-branch CI build-push ${REGISTRY_HOST}/azaion/<service>:<branch>-<arch> git.azaion.com/azaion/gps-denied-onboard:dev-arm
Release ${REGISTRY_HOST}/azaion/<service>:<branch>-<arch> (suite uses floating branch tags + Watchtower; semver is not used at suite level today) git.azaion.com/azaion/gps-denied-onboard:main-arm
Local dev Image name without registry prefix gps-denied-onboard/companion:dev (current local compose), gps-denied-onboard/operator-orchestrator:dev, gps-denied-onboard/mock-suite-sat-service:dev

No :latest tag in CI. Suite contract is <branch>-<arch> only; Watchtower polls these floating tags.

The current .dockerignore (33 lines, root) covers .git, .venv, build artefacts, *.engine / *.calib / *.index / *.faiss / *.onnx, large test fixtures, _docs/, and editor noise. Adequate for cycle-1. Recommended next-cycle additions (logged here, not applied):

# Next-cycle additions to .dockerignore (not applied in cycle-1)
.cursor/              # rules + skills do not belong in any image
_docs/                # already excluded — keep
docker-compose*.yml   # don't accidentally ship dev compose into the production image
e2e/                  # test harness compose + fixtures stay out of production images
tests/                # test code stays out of production images (currently NOT excluded)
*.md                  # README / docs — not needed at runtime

Note: tests/ is currently NOT in .dockerignore, which is intentional for cycle-1 — the e2e-runner images (tests/e2e/Dockerfile, tests/e2e/Dockerfile.jetson) COPY tests/ into the image. Splitting .dockerignore per-image (via Docker's dockerfile: field on .dockerignore is BuildKit-only) is a next-cycle refactor.

Health Checks — Inventory

Image Endpoint / Command Cadence
companion-tier1, companion-jetson, operator-orchestrator python3 -m gps_denied_onboard.healthcheck (the module already exists per the existing Dockerfiles) --interval=10s --timeout=3s --start-period={15,30,10}s --retries=3
mock-suite-sat-service HTTP GET /healthz on port 5100 --interval=5s --timeout=2s --retries=3
db (Postgres 16, suite-managed under Tier-2; root compose for Tier-1) pg_isready -U gps_denied -d gps_denied --interval=5s --timeout=3s --retries=10

Self-verification

  • Every component is mapped to its image (companion-tier1 / companion-jetson for C1C8 + C13; operator-orchestrator for C10 + C11 + C12; mock-suite-sat-service for the e2e fixture)
  • Multi-stage builds specified for companion-tier1 (4 stages, existing) and companion-jetson (4 stages, planned)
  • Non-root user planned for companion-jetson (Tier-2 production); Tier-1 dev / operator-orchestrator stays root for now (next-cycle harden)
  • Health checks defined for every service
  • docker-compose.yml covers all components + dependencies (existing)
  • docker-compose.test.yml enables black-box testing (existing; Tier-1 + Tier-2 jetson variants)
  • .dockerignore defined (existing; next-cycle additions logged)
  • Tier-2 production delivery shape resolved (Option B; ADR-005 amendment drafted; Step 2 validation gates queued)
  • Image tagging strategy aligned with suite-mandated ${REGISTRY_HOST}/azaion/<service>:<branch>-<arch> contract

Next Steps

  1. User confirms this containerization plan (BLOCKING gate per the deploy skill Step 2).
  2. Author docker/companion-jetson.Dockerfile — implementation task for the next cycle (existing-code Step 9 New Task → Step 10 Implement). Will be one of the first follow-up tickets when autodev's Done step reroutes to the existing-code flow.
  3. Coordinate parent-suite edit../_infra/deploy/jetson/docker-compose.yml gps-denied-onboard service block needs the additional volume mounts (fdr-data, tile-data, /run/azaion, FC + camera device passthrough). This is a cross-submodule change tracked as a follow-up; record in _docs/_process_leftovers/ if not editable in this cycle.
  4. Proceed to Step 3 (CI/CD pipeline) — author .woodpecker/01-test.yml (Python pytest + Tier-1 e2e via existing docker-compose.test.yml) + .woodpecker/02-build-push.yml (multi-arch matrix, companion-jetson.Dockerfile once it lands; until then, ship only operator-orchestrator + companion-tier1 for the test path). Rewrite _docs/02_document/deployment/ci_cd_pipeline.md against the actual Woodpecker + Gitea Packages stack per suite ../_infra/ci/README.md.