azaion/gps-denied-onboard

Fork 0

mirror of https://github.com/azaion/gps-denied-onboard.git synced 2026-06-21 13:31:12 +00:00

Files

T

Oleksandr Bezdieniezhnykh bf13549b32

ci/woodpecker/push/02-build-push Pipeline failed

Details

[autodev] Update configuration and documentation for cycle-1

- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments.
- Updated `.gitignore` to include a new deploy rollback bookmark.
- Revised `_docs/_autodev_state.md` to reflect the current task status and steps.
- Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements.
- Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin.
- Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths.

This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.

2026-05-20 08:05:35 +03:00

20 KiB

Raw Blame History

GPS-Denied Onboard — Containerization

Generated by /autodev greenfield Step 16 (Deploy) — Step 2. Builds on Step 1 output (reports/deploy_status_report.md) and the parent-suite CI/CD reality at ../_infra/ci/README.md. Tier-2 delivery shape: Option B (Docker on Jetson via Watchtower) — autodev-resolved 2026-05-19; reversible per Step 1 report.

Containerization Stance

Tier	Production runtime	Image source
Tier-1 (workstation dev + CI + replay)	Docker via `docker-compose.yml` / `docker-compose.test.yml`	This submodule (`docker/companion-tier1.Dockerfile`, `docker/operator-orchestrator.Dockerfile`, `docker/mock-suite-sat-service.Dockerfile`)
Tier-2 (Jetson Orin Nano Super production)	Docker via parent-suite `_infra/deploy/jetson/docker-compose.yml` + Watchtower auto-update	This submodule's new `docker/companion-jetson.Dockerfile` (NEW under Option B) pushed to `${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm`
Tier-2 (lab/research IT-12 binary)	Docker (same `companion-jetson.Dockerfile` with research strategy flags ON) or bare JetPack install via tarball	Optional separate image tag `:research-arm`; cycle-1 ships only the deployment binary path

Three architectural binary tracks (per ADR-002 + ADR-011) collapse onto two production Docker images in this plan:

gps-denied-onboard (airborne) — docker/companion-jetson.Dockerfile for Tier-2 production + docker/companion-tier1.Dockerfile for Tier-1. Same Python module entrypoint (python3 -m gps_denied_onboard.runtime_root); runs both live mode and replay mode from a single image per ADR-011 — config (config.mode = live | replay) selects strategies at startup.
gps-denied-operator-orchestrator — docker/operator-orchestrator.Dockerfile for the operator workstation (C10 + C11 + C12).

Test fixtures (mock-suite-sat-service, e2e-runner) and test infrastructure (Tier-1 + Tier-2 runners) ship as separate non-deployable images. The research binary is a build-flag variant of the airborne image, not a separate Dockerfile.

ADR-005 Amendment (DRAFT — pending Step 12 / Update Docs sync)

Draft language for the architecture follow-up flagged in Step 1's Cross-Cutting Decision. Lands in architecture.md ADR-005 (amendment) or a new ADR-012 when Step 12 (Test-Spec Sync) / autodev's existing-code Step 13 (Update Docs) picks this up. The current architecture.md ADR-005 paragraph "Tier-2 (Jetson) does NOT use Docker" becomes inconsistent with this plan and must be reconciled.

Container scope (amended): Tier-1 uses Docker (docker compose for the developer setup). Tier-2 (Jetson production) ALSO uses Docker, via the parent-suite _infra/deploy/jetson/docker-compose.yml + Watchtower flow, with runtime: nvidia for GPU access and explicit volume mounts for the TensorRT INT8 calibration cache (model-cache:/data/models) and the C13 FDR ring (fdr-data:/var/lib/gps-denied/fdr). The two technical concerns the original ADR-005 cited — INT8 calibration cache stability and jetson-stats thermal telemetry access — are addressed by (a) the calibration cache living in a host-mounted volume that survives container restarts and (b) jetson-stats accessed via the nvidia-container-runtime's standard device passthrough (same pattern the parent-suite detections service already uses successfully on the same hardware). The deployment binary is the Docker image; the JetPack 6.2 system image is the host OS, not the runtime layer.

Step 2 Validation Gates (BLOCKING — must pass before Step 3)

If either of these gates fails, fall back to Option A (bare-JetPack systemd unit) and re-write this containerization plan:

Gate	What it validates	Pass criteria	Owner
TensorRT INT8 cache durability under Docker	Build a calibration cache inside the running container; restart the container; verify the cache is reused and inference output is byte-equivalent	SHA-256 of the calibration cache file before and after restart matches; first-frame inference timing post-restart is within 5% of pre-restart timing (cache hit)	C7 owner; runs against the `companion-jetson` image on the actual Tier-2 Jetson
`jetson-stats` thermal telemetry under Docker	Run `jtop` (jetson-stats CLI) inside the container with `runtime: nvidia`; verify thermal + power + GPU clock readings match `sudo jtop` on the host within 1%	All thermal zones reported; CPU/GPU clock readings present; D-CROSS-LATENCY-1 hybrid trigger threshold readable	C7 / C5 owners; runs against the `companion-jetson` image

Both gates land as task tickets when Step 16 chains into the next-cycle existing-code flow (autodev resumes at existing-code Step 9 New Task per the Done state). They are deferred to next cycle and recorded here so they are not lost; the cycle-1 deploy plan ships Option B with the validation marked as "validation pending" in deploy_status_report.md.

Component-to-Image Mapping

Per ADR-009, components are folders under src/gps_denied_onboard/components/. They are not separate processes / containers in this monolithic Python-with-C++-extensions architecture. The mapping below shows which component code paths each image links.

Image	Components linked	BUILD_* flags (defaults)
`companion-jetson` (Tier-2 prod) + `companion-tier1` (Tier-1 dev)	C1 (`KltRansac` default), C2 (`UltraVPR` default), C2.5, C3 (`DISK+LightGlue`), C3.5, C4, C5 (`GtsamIsam2`), C6, C7 (`tensorrt` on Tier-2, `pytorch_fp16` on Tier-1), C8 (per `GPS_DENIED_FC_PROFILE`), C13 + replay strategies (`BUILD_VIDEO_FILE_FRAME_SOURCE=ON`, `BUILD_TLOG_REPLAY_ADAPTER=ON`, `BUILD_REPLAY_SINK_JSONL=ON`)	`BUILD_VINS_MONO=OFF`, `BUILD_SALAD=OFF`, `BUILD_C11_TILE_MANAGER=OFF` (ADR-004 enforcement), `BUILD_DEV_STATIC_KEY=OFF`, `BUILD_STATE_ESKF=OFF`
`operator-orchestrator` (operator workstation)	C10, C11 (`TileDownloader` + `TileUploader`), C12	`BUILD_C11_TILE_MANAGER=ON`
`mock-suite-sat-service` (test fixture)	NONE (FastAPI stub of the parent-suite `satellite-provider` D-PROJ-2 contract)	—
`e2e-runner` Tier-1 (`tests/e2e/Dockerfile`)	Full SUT (editable install) + pytest entrypoint	Test profile defaults
`e2e-runner` Tier-2 (`tests/e2e/Dockerfile.jetson`)	Full SUT (editable install) + pytest entrypoint; `dustynv/l4t-pytorch:r36.4.0` base	Test profile defaults

Per-Image Dockerfile Specifications

`companion-jetson` — NEW under Option B

Property	Value
File	`docker/companion-jetson.Dockerfile` (new in next cycle's Step 7 — Implementation; this plan specifies the contents)
Base image	`dustynv/l4t-pytorch:r36.4.0` (digest-pinned per suite follow-up #1) — same base proven by `tests/e2e/Dockerfile.jetson`
Stages	(1) system-deps (apt: `build-essential`, `cmake`, `libpq-dev`, `libspatialindex-dev`, `libgl1`, `libglib2.0-0`) → (2) python-deps (`pip install -e ".[inference]"` with the Tegra-tuned torch preserved per the existing Tier-2 e2e Dockerfile rationale) → (3) cpp-build (CMake build of the native VIO / matcher extensions with `BUILD_VINS_MONO=OFF`, `BUILD_C11_TILE_MANAGER=OFF`) → (4) runtime (slim image carrying the venv + native libs + SUT source)
User	`gps-denied` non-root uid 10001 (companion does not need root inside the container; volume mounts owned by the same uid on the host)
Build args	`CI_COMMIT_SHA` (suite-mandated; stamped as OCI labels + `ENV AZAION_REVISION`); `BRANCH` (carried into image labels)
OCI labels	`org.opencontainers.image.revision=$CI_COMMIT_SHA`, `org.opencontainers.image.created=<UTC RFC 3339>`, `org.opencontainers.image.source=$CI_REPO_URL` (suite-mandated per `../_infra/ci/README.md` → "OCI image labels and commit provenance (AZ-204)")
ENV	`AZAION_SERVICE=gps-denied-onboard`, `AZAION_REVISION=$CI_COMMIT_SHA`, `PYTHONPATH=/opt/gps-denied/src`, `PATH=/opt/venv/bin:$PATH`
Health check	`python3 -m gps_denied_onboard.healthcheck` — `--interval=10s --timeout=3s --start-period=30s --retries=3` (longer start-period than Tier-1 because TensorRT engine deserialize takes seconds on Jetson)
Exposed ports	`8080` (HTTP healthz + future replay-mode JSONL stream socket; mapped to host `5040:8080` per parent-suite compose). MAVLink + camera I/O is not TCP — it is host-bound (`/dev/ttyUSB`, `/dev/video`) via device passthrough.
Volume mounts (declared in parent-suite compose)	`model-cache:/data/models` (TensorRT engines + calibration cache + descriptor index); `fdr-data:/var/lib/gps-denied/fdr` (C13 ring, ≥ 64 GB); `tile-data:/var/lib/gps-denied/tiles` (C6 filesystem store, ≥ 10 GB); `/run/azaion:/run/azaion` (flight-state flag, read-only); device passthrough for `/dev/ttyUSB` (FC UART) + `/dev/video` (nav camera)
Watchtower labels	`com.centurylinklabs.watchtower.enable=true` + post-update hook emitting `AZAION_UPDATE_EVENT` per suite `x-update-logger` template
ENTRYPOINT	`python3 -m gps_denied_onboard.runtime_root` (same as Tier-1)
Flight-state gate	Honoured via `/run/azaion/in-flight` bind mount — Watchtower restart hook MUST check the flag before restarting (suite-managed; the image itself only honors the flag when transitioning between strategies at boot — there is no in-process restart logic)

`companion-tier1` (existing — `docker/companion-tier1.Dockerfile`)

Property	Value
Base image	`ubuntu:22.04` (system-deps stage) → `ubuntu:22.04` (runtime)
Stages	4 (`system-deps` → `python-deps` → `cpp-build` → `runtime`) — already documented in the file header
User	Currently root (acceptable for Tier-1 dev / CI containers — Tier-2 production hardens this in `companion-jetson`)
Health check	`python3 -m gps_denied_onboard.healthcheck` — `--interval=10s --timeout=3s --start-period=15s --retries=3`
Exposed ports	None (Tier-1 healthcheck is in-process; CI exposes nothing)
Notes	No change required for cycle-1. Next cycle: add `BRANCH` + `CI_COMMIT_SHA` build args + OCI labels for parity with `companion-jetson`.

`operator-orchestrator` (existing — `docker/operator-orchestrator.Dockerfile`)

Property	Value
Base image	`python:3.10-slim`
Stages	1 (`runtime`) — single-stage is acceptable here because the operator-orchestrator has no native C++ extensions and the slim base keeps it lean
User	Currently root — same Tier-1 caveat as `companion-tier1`
Health check	`python3 -m gps_denied_onboard.healthcheck` — `--interval=10s --timeout=3s --start-period=10s --retries=3`
Exposed ports	TBD (next cycle adds the C12 CLI's HTTP control surface for the operator UI; today the CLI runs as a one-shot invocation)
Notes	No change required for cycle-1.

`mock-suite-sat-service` (existing — `docker/mock-suite-sat-service.Dockerfile`)

Property	Value
Base image	`python:3.10-slim`
User	Currently root — acceptable, this is an e2e test fixture only
Health check	`urllib.request.urlopen('http://127.0.0.1:5100/healthz')` — `--interval=5s --timeout=2s --retries=3`
Exposed ports	`5100` (HTTP)
Notes	Not a production image. Retired when parent-suite D-PROJ-2 ships the real ingest endpoint.

`e2e-runner` Tier-1 (existing — `tests/e2e/Dockerfile`)

Test runner for the Reality Gate on Colima / Tier-1 workstation Docker. Not a production image. ENTRYPOINT: pytest -q /opt/tests/e2e/. No change for cycle-1.

`e2e-runner` Tier-2 (existing — `tests/e2e/Dockerfile.jetson`)

Test runner for the Reality Gate on the Jetson. dustynv/l4t-pytorch:r36.4.0 base. The new companion-jetson production image inherits its base image choice and Tegra-pip rationale from this file. No change for cycle-1.

Docker Compose — Local Development (existing `docker-compose.yml`)

The existing root docker-compose.yml already covers Tier-1 dev: companion + operator-orchestrator + mock-sat + db (Postgres 16), with healthchecks, named volumes (db-data, fdr-data, tile-data), and a tests/fixtures:/fixtures:ro bind mount for the dev calibration JSON + signing key.

No structural change required. Optional cycle-2 polish:

Add a network: gps-denied-dev declaration (currently relies on Docker Compose's default network) so the suite-level e2e harness can join it explicitly when needed.
Reference ${BRANCH:-main} for image tags so the dev compose can pull from the suite registry instead of always building.

Docker Compose — Blackbox Tests (existing)

File	Purpose	Status
`docker-compose.test.yml`	Tier-1 e2e (Replay + Reality Gate); sets `BUILD_VIDEO_FILE_FRAME_SOURCE=ON`, `BUILD_TLOG_REPLAY_ADAPTER=ON`, `BUILD_REPLAY_SINK_JSONL=ON`	✅ working
`docker-compose.test.jetson.yml`	Tier-2 e2e on Jetson; same flags ON	✅ working
`e2e/docker/docker-compose.test.yml`	Suite-level e2e harness's internal compose	✅ owned by the e2e harness
`e2e/docker/docker-compose.tier2-bridge.yml`	Tier-2 host-network bridge for direct hardware access	✅ in tree

Run patterns (suite-mandated per Woodpecker two-workflow contract):

# Tier-1 e2e (CI 01-test.yml):
docker compose -f docker-compose.test.yml up --build --abort-on-container-exit --exit-code-from e2e-runner

# Tier-2 e2e (manual / Tier-2 lane):
docker compose -f docker-compose.test.jetson.yml up --abort-on-container-exit --exit-code-from e2e-runner

The exit code of the e2e-runner service is the pipeline result. This contract matches the suite's detections e2e variant verbatim.

Docker Compose — Tier-2 Production (parent-suite, NOT in this submodule)

This submodule does not ship a Tier-2 production compose file. The Tier-2 production stack is ../_infra/deploy/jetson/docker-compose.yml (already shipping). This submodule contributes:

The published image at ${REGISTRY_HOST}/azaion/gps-denied-onboard:<branch>-arm (via companion-jetson.Dockerfile + the upcoming .woodpecker/02-build-push.yml).
The healthcheck endpoint (python3 -m gps_denied_onboard.healthcheck).
The flight-state gate honour (/run/azaion/in-flight bind mount in the suite compose — read by the image at boot).
The audit chain — OCI labels + AZAION_REVISION env + Watchtower post-update hook emitting AZAION_UPDATE_EVENT to journald.

Cross-cutting suggestion logged but not actioned in cycle-1: the parent-suite Jetson compose's gps-denied-onboard service block is minimal (no volume mounts beyond model-cache). Under Option B, it needs the additional mounts listed in the companion-jetson Dockerfile table above (fdr-data, tile-data, /run/azaion, FC + camera device passthrough). This is a parent-suite edit that the GPS-Denied Onboard team must coordinate with the suite operator — recorded in Next Steps below.

Image Tagging Strategy (Suite-Mandated)

Context	Tag Format	Example
Per-PR CI (test only, not pushed)	n/a	n/a
Per-branch CI build-push	`${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>`	`git.azaion.com/azaion/gps-denied-onboard:dev-arm`
Release	`${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>` (suite uses floating branch tags + Watchtower; semver is not used at suite level today)	`git.azaion.com/azaion/gps-denied-onboard:main-arm`
Local dev	Image name without registry prefix	`gps-denied-onboard/companion:dev` (current local compose), `gps-denied-onboard/operator-orchestrator:dev`, `gps-denied-onboard/mock-suite-sat-service:dev`

No :latest tag in CI. Suite contract is <branch>-<arch> only; Watchtower polls these floating tags.

.dockerignore (existing — audit + recommended addenda)

The current .dockerignore (33 lines, root) covers .git, .venv, build artefacts, *.engine / *.calib / *.index / *.faiss / *.onnx, large test fixtures, _docs/, and editor noise. Adequate for cycle-1. Recommended next-cycle additions (logged here, not applied):

# Next-cycle additions to .dockerignore (not applied in cycle-1)
.cursor/              # rules + skills do not belong in any image
_docs/                # already excluded — keep
docker-compose*.yml   # don't accidentally ship dev compose into the production image
e2e/                  # test harness compose + fixtures stay out of production images
tests/                # test code stays out of production images (currently NOT excluded)
*.md                  # README / docs — not needed at runtime

Note: tests/ is currently NOT in .dockerignore, which is intentional for cycle-1 — the e2e-runner images (tests/e2e/Dockerfile, tests/e2e/Dockerfile.jetson) COPY tests/ into the image. Splitting .dockerignore per-image (via Docker's dockerfile: field on .dockerignore is BuildKit-only) is a next-cycle refactor.

Health Checks — Inventory

Image	Endpoint / Command	Cadence
`companion-tier1`, `companion-jetson`, `operator-orchestrator`	`python3 -m gps_denied_onboard.healthcheck` (the module already exists per the existing Dockerfiles)	`--interval=10s --timeout=3s --start-period={15,30,10}s --retries=3`
`mock-suite-sat-service`	HTTP GET `/healthz` on port 5100	`--interval=5s --timeout=2s --retries=3`
`db` (Postgres 16, suite-managed under Tier-2; root compose for Tier-1)	`pg_isready -U gps_denied -d gps_denied`	`--interval=5s --timeout=3s --retries=10`

Self-verification

Every component is mapped to its image (companion-tier1 / companion-jetson for C1–C8 + C13; operator-orchestrator for C10 + C11 + C12; mock-suite-sat-service for the e2e fixture)
Multi-stage builds specified for companion-tier1 (4 stages, existing) and companion-jetson (4 stages, planned)
Non-root user planned for companion-jetson (Tier-2 production); Tier-1 dev / operator-orchestrator stays root for now (next-cycle harden)
Health checks defined for every service
docker-compose.yml covers all components + dependencies (existing)
docker-compose.test.yml enables black-box testing (existing; Tier-1 + Tier-2 jetson variants)
.dockerignore defined (existing; next-cycle additions logged)
Tier-2 production delivery shape resolved (Option B; ADR-005 amendment drafted; Step 2 validation gates queued)
Image tagging strategy aligned with suite-mandated ${REGISTRY_HOST}/azaion/<service>:<branch>-<arch> contract

Next Steps

User confirms this containerization plan (BLOCKING gate per the deploy skill Step 2).
Author docker/companion-jetson.Dockerfile — implementation task for the next cycle (existing-code Step 9 New Task → Step 10 Implement). Will be one of the first follow-up tickets when autodev's Done step reroutes to the existing-code flow.
Coordinate parent-suite edit — ../_infra/deploy/jetson/docker-compose.yml gps-denied-onboard service block needs the additional volume mounts (fdr-data, tile-data, /run/azaion, FC + camera device passthrough). This is a cross-submodule change tracked as a follow-up; record in _docs/_process_leftovers/ if not editable in this cycle.
Proceed to Step 3 (CI/CD pipeline) — author .woodpecker/01-test.yml (Python pytest + Tier-1 e2e via existing docker-compose.test.yml) + .woodpecker/02-build-push.yml (multi-arch matrix, companion-jetson.Dockerfile once it lands; until then, ship only operator-orchestrator + companion-tier1 for the test path). Rewrite _docs/02_document/deployment/ci_cd_pipeline.md against the actual Woodpecker + Gitea Packages stack per suite ../_infra/ci/README.md.

20 KiB Raw Blame History Unescape Escape