# GPS-Denied Onboard — Deployment Procedures > Generated by `/autodev` greenfield Step 16 (Deploy) — Step 6. Builds on > Step 1–5 (`reports/deploy_status_report.md`, `containerization.md`, > `ci_cd_pipeline.md`, `environment_strategy.md`, `observability.md`). The > deploy skill's standard procedure template (load-balanced HTTP service > with blue-green / rolling / canary patterns) is adapted here for the > system's actual topology: single airborne instance + single operator > workstation, ground-only updates, FC-managed in-flight failsafe, and the > parent-suite Watchtower flow with a flight-state gate. ## Deployment Strategy ### Pattern: **Floating-tag pull-on-ground (Watchtower-managed)** | Aspect | Choice | Rationale | |--------|--------|-----------| | Update mechanism (airborne Jetson) | Parent-suite Watchtower polls `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm`; pulls + restarts when SHA changes | Suite-mandated pattern per `../_infra/deploy/jetson/README.md`. The fielded Jetson stack has Watchtower already running, polling all 9 application services on the same cadence. | | Update mechanism (operator workstation) | Operator runs `docker compose pull && docker compose up -d` from `scripts/start-services.sh` | The operator workstation is single-user; cycle-1 does not need automatic updates. Cycle-2 may add a Watchtower instance on the workstation. | | Update mechanism (lab Jetson — staging) | Same as airborne (Watchtower polling `dev-arm` or `stage-arm`) | Mirrors airborne so the bench rig validates the exact same update path. | | Blue-green / rolling / canary | **None of the above** — N=1 instance per role | The airborne side has one Jetson per aircraft (no fleet); the operator workstation has one instance per operator. There is no load-balanced replicate to roll over. | | Zero-downtime requirement | **Not applicable in flight**; ground-only | Flights are discrete + bounded; the FC handles in-flight failsafe (AC-FC-FAILSAFE-1) if the companion is unavailable mid-flight. Updates do not happen during flight. | | Ground-only safety gate | `/run/azaion/in-flight` flag (parent-suite `autopilot` service writes it on arm/disarm) | **Watchtower's post-update hook MUST refuse to restart the `gps-denied-onboard` container when this flag is set.** Honoured at the suite-compose layer, not in this submodule's image (the image only honours the flag at boot when transitioning between strategies). | | Multi-aircraft rollout | Tag-based per-aircraft (operator can pin `:rev--arm` instead of `:main-arm`) | Floating tag is the default; explicit SHA pinning is the manual override. Suite operator owns per-aircraft pinning. | ### Graceful Shutdown The companion has **no inbound HTTP connections** (NFT-SEC-05 in-flight egress lockdown). "Graceful shutdown" means: drain in-flight FDR writes, flush the C13 segment, emit `flight_footer`, close MAVLink connection cleanly. | Step | Action | Owner | |------|--------|-------| | 1 | systemd / Docker sends `SIGTERM` to PID 1 (`python3 -m gps_denied_onboard.runtime_root`) | OS layer | | 2 | Runtime root sets the global `shutting_down` flag; all per-frame producers stop enqueuing new FDR records | runtime root | | 3 | C13 writer drains the FDR SPSC ring (≤ 200 ms target — bounded by ring depth + writer throughput) | C13 | | 4 | C13 emits `flight_footer` with `clean_shutdown=true`, `records_written`, `records_dropped_overrun`, `bytes_written`, `rollover_count` | C13 | | 5 | C13 closes the active segment file (fsync, rename `.tmp` → final) | C13 | | 6 | C8 sends final MAVLink `STATUSTEXT` and closes the FC serial connection | C8 | | 7 | Process exits 0 | runtime root | **Termination grace period (target)**: 30 seconds for the above sequence. If exceeded, Docker / systemd sends `SIGKILL`; `flight_footer.clean_shutdown` will be `false` on the next boot's recovery write, flagging the unclean shutdown for the post-flight summary. **Cycle-1 status**: docker-compose.yml does **not** yet declare `stop_grace_period: 30s` — cycle-1 inherits Docker's default 10 s grace. The C13 ring drain target (≤ 200 ms) fits comfortably inside 10 s for the dev profile, but TensorRT engine teardown + gtsam factor cleanup on Tier-2 hardware are not yet measured. **Cycle-2 follow-up** (recorded in `_docs/_process_leftovers/` when this deploy plan lands): add `stop_grace_period: 30s` to the `companion` service in `docker-compose.yml` and to the `gps-denied-onboard` service in the parent-suite `../_infra/deploy/jetson/docker-compose.yml` once the Step 2 validation gate "TensorRT INT8 cache durability under Docker" (`containerization.md` § Step 2 Validation Gates) measures the actual drain budget on the Jetson. ### Database Migration Ordering Cycle-1 ships **no migration runner** — C6 bootstrap uses idempotent `CREATE TABLE IF NOT EXISTS`. Cycle-2+ rules (from `environment_strategy.md` § Migration Rules): | Rule | Cycle-1 status | Cycle-2+ enforcement | |------|----------------|----------------------| | Migrations run **before** new code deploys | n/a — bootstrap-only | Alembic (or equivalent) migration step runs against staging first, then production, before the corresponding image pull is enabled | | All migrations must be backward-compatible | n/a | Required: new schema works with previous image's read path until next release rotates both | | Irreversible migrations require explicit operator approval | n/a | Required: Woodpecker UI approval gate + recorded in `_docs/04_deploy/migration_log.md` | | Production migrations on the airborne Jetson refuse to run when `/run/azaion/in-flight` is set | n/a | Required: migration tool reads the flag at start; aborts with exit 0 + journald audit line if the flag is set | | Production migrations on the operator workstation require operator approval | n/a | Required: interactive prompt in `start-services.sh` before applying | ## Health Checks The companion has no HTTP `/health/live` or `/health/ready` endpoint (NFT-SEC-05). The Docker `HEALTHCHECK` is an **exec check** that re-runs the startup validation matrix (`environment_strategy.md` § Variable Validation) and inspects in-process liveness signals. | Check | Type | Command / mechanism | Interval | Failure threshold | Action | |-------|------|----------------------|----------|--------------------|--------| | Liveness / Readiness | `HEALTHCHECK` exec | `python3 -m gps_denied_onboard.healthcheck` | 10 s (companion-tier1 / operator-orchestrator); 10 s (companion-jetson, with `--start-period=30s` for TensorRT engine deserialise) | 3 consecutive failures → Docker marks container `unhealthy` → systemd / Watchtower restarts | Same as readiness — no load balancer to drain. Watchtower honours `/run/azaion/in-flight` before restarting. | | Startup probe | Same exec | Same command | 5 s once `--start-period` elapses | 30 attempts max | Kill + recreate; Watchtower retries the pull on next poll | | FC adapter health (in-flight) | C8 watchdog from the FC | MAVLink heartbeat loss > 1 s | n/a — handled by the FC | FC drops to `SAFE_DEAD_RECKONING` or `RTL` per AC-FC-FAILSAFE-1 | | FDR ring liveness | `shared.fdr_client` overrun monitor | Producer enqueue failure | n/a — emits `kind="overrun"` record (AC-NEW-3); never silent | Post-flight forensics surface; no in-flight action | | `db` Postgres health (operator workstation + dev compose) | `pg_isready -U gps_denied -d gps_denied` | 5 s | 10 failures | Docker / systemd restart the `db` service; the companion's healthcheck fails until DB is back | | `mock-suite-sat-service` health (Tier-1 e2e only) | HTTP GET `/healthz` on port 5100 | 5 s | 3 failures | Compose marks unhealthy; e2e-runner `--exit-code-from e2e-runner` surfaces failure | ### `python3 -m gps_denied_onboard.healthcheck` contract The healthcheck module (already exists per `containerization.md`) re-runs: 1. **Required env vars validation** — same set as the composition root, but read-only (no side effects). 2. **C6 DB reachability** — `psycopg2.connect(DB_URL) → SELECT 1`. 3. **C13 FDR mount writability** — `os.access(FDR_PATH, os.W_OK)` + a probe write to a `.healthcheck` file. 4. **C7 backend availability** — for `INFERENCE_BACKEND=tensorrt`, validates the engine cache directory exists + is readable; for `pytorch_fp16`, no extra check (libtorch in-process). 5. **C8 FC adapter** — best-effort: attempts a non-blocking serial open if `GPS_DENIED_FC_PROFILE` is set + the device path is present. Absent device path is not a failure (dev / CI containers). Exit codes: `0` healthy; `1` config-invalid; `2` dependency-unreachable; `3` resource-bound (e.g. FDR full). Docker treats any non-zero as `unhealthy`. ## Staging Deployment (lab Jetson HITL) Treat the lab Jetson as a **mirror of production** for image promotion. Operator runs the procedure manually; cycle-2 may automate via the suite. 1. **CI/CD** has already built + pushed `${REGISTRY_HOST}/azaion/gps-denied-onboard-companion-tier1:dev-arm` + `…-operator-orchestrator:dev-arm` via `.woodpecker/02-build-push.yml` (cycle-1) or `companion-jetson:dev-arm` via cycle-2. 2. **Verify the flag** — `cat /run/azaion/in-flight` should be empty / absent on the lab Jetson (no live FC there). If a HITL session is running, wait for the bench session to end. 3. **Pull the new image** — `scripts/pull-images.sh dev` (Step 7). Watchtower may have already pulled if running on the lab Jetson. 4. **Restart the service** — `scripts/start-services.sh dev` (Step 7). Honours stop-grace-period; waits for HEALTHCHECK to report healthy. 5. **Run the HITL e2e suite** — `docker compose -f docker-compose.test.jetson.yml up --abort-on-container-exit --exit-code-from e2e-runner --build`. This runs the **Reality Gate** replay (Derkachi clip + recorded tlog) against the new image on Tier-2 hardware. 6. **Verify FDR output** — `python3 -m gps_denied_onboard.post_flight.summarise --segment /var/lib/gps-denied/fdr/segment-*.fdr` (cycle-1 ad-hoc tool; cycle-2 polish lands the full replay viewer). Confirm `flight_footer.clean_shutdown == true` and `records_dropped_overrun == 0`. 7. **If gates pass** → promote: tag `${REGISTRY_HOST}/azaion/gps-denied-onboard:-arm` (or repurpose by branch promotion from `dev-arm` → `stage-arm` once cycle-2 wires environment branches per `ci_cd_pipeline.md` Quality Gates `Multi-environment deployment` row). 8. **If gates fail** → file a Jira issue under E-DEPLOY; roll back the lab Jetson per § Rollback Procedures. ## Production Deployment (airborne Jetson + operator workstation) Production deployment lands on each aircraft individually + on each operator workstation. The aircraft side is Watchtower-driven; the operator workstation side is operator-driven. ### Pre-deploy checks (operator-owned) - [ ] **CI gates green** — `01-test.yml` passed on the target branch (cycle-1: manual trigger; cycle-2: push gate). - [ ] **Security scan recent** — `_docs/05_security/dependency_scan.md` re-validated against the build SHA. The OpenCV pin per `_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md` is honoured. - [ ] **HITL gate passed** — Staging deployment § 5–6 confirmed `clean_shutdown=true` and `records_dropped_overrun=0`. - [ ] **Per-aircraft acceptance** — operator confirms the build's strategy flags (`BUILD_VINS_MONO`, `BUILD_SALAD`, `BUILD_C11_TILE_MANAGER`, replay flags, `BUILD_DEV_STATIC_KEY=OFF`) match the operational profile for the destination aircraft. - [ ] **Calibration JSON onboard** — `/etc/gps-denied/calibration/adti20.json` (operator-acquired per D-PROJ-1) is staged on the aircraft Jetson NVM. - [ ] **Signing key path provisioned** — `MAVLINK_SIGNING_KEY` resolves to a per-host writable path that `KeySource` will rotate at takeoff; no static key from `tests/fixtures/`. - [ ] **Postgres credentials in `/etc/gps-denied/.pgpass`** — per-host random password (Step 7 `start-services.sh` writes this on first run). - [ ] **`/run/azaion/in-flight` is clear** — no live flight in progress on the target aircraft. - [ ] **Rollback target identified** — previous successful SHA recorded for the target aircraft (operator notebook + `journalctl -g AZAION_UPDATE_EVENT` on the Jetson). - [ ] **Stakeholders notified** — flight operator + suite operator informed of the deploy window. ### Production Deployment — Airborne Jetson (Watchtower-driven) 1. **Tag promotion** — operator pushes the validated SHA to `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm` (or per-aircraft SHA pin if rolling out partial fleet). 2. **Wait for Watchtower poll** — default poll interval per suite config (typically ≤ 5 min). 3. **Watchtower pre-restart check** — Watchtower's post-update hook checks `/run/azaion/in-flight`; if set, defers the restart until the next poll. 4. **Container stop** — Docker sends `SIGTERM`; companion drains FDR (≤ 200 ms target) + emits `flight_footer` per § Graceful Shutdown. Exit must complete within 30 s grace period. 5. **Image pull complete** — Watchtower pulls the new image (already verified-by-tag; OCI labels embed the SHA). 6. **Container start** — Docker starts the new container; `HEALTHCHECK` `--start-period=30s` allows TensorRT engine deserialise + Postgres reconnect. 7. **Audit event emitted** — Watchtower's post-update hook emits `AZAION_UPDATE_EVENT` to journald (`observability.md` § Deploy Audit). 8. **Verify on the aircraft** — operator runs `journalctl -g AZAION_UPDATE_EVENT --since 10min` on the Jetson; confirms the new revision SHA matches the intended tag. 9. **Run a ground HITL pre-flight** — operator brings up the bench-mounted aircraft, runs the standard pre-flight checklist (FC heartbeat, signing handshake, camera focus, NFT-SEC-04 image-decode smoke). Pre-flight refusal-to-arm on any gate failure is the production safety net. 10. **Monitor the first flight** — operator watches QGroundControl for STATUSTEXT messages from the companion + the `GpsDeniedHealth` MAVLink message stream during the first flight under the new image. 11. **Post-flight forensics** — after landing, operator pulls FDR segments + runs `post_flight.summarise`; confirms no regression vs the previous-SHA baseline (NFT-PERF gates per `_docs/02_document/tests/` baselines). ### Production Deployment — Operator Workstation (operator-driven) 1. **Pre-deploy checks** — same checklist as above, scoped to the operator-orchestrator image. 2. **Pull** — operator runs `scripts/pull-images.sh main` (Step 7). 3. **Stop** — `scripts/stop-services.sh` (Step 7) gracefully stops the operator-orchestrator service. 4. **Start** — `scripts/start-services.sh main` (Step 7) brings the new image up. `HEALTHCHECK` `--start-period=10s` allows DB reconnect. 5. **Audit** — `journalctl -g AZAION_UPDATE_EVENT --since 10min` on the operator workstation confirms the new revision. 6. **Smoke test** — operator runs the C12 `--flight-file ` path against a known-good flight DTO; verifies the `FlightsApiClient` round-trip succeeds. ### Post-deploy monitoring window | Window | What to watch | Action on regression | |--------|---------------|----------------------| | First 15 min | journald `AZAION_UPDATE_EVENT` cadence; container `HEALTHCHECK` status | Roll back immediately (§ Rollback Procedures) | | First flight (airborne) | QGC STATUSTEXT + `GpsDeniedHealth` MAVLink stream; FDR `overrun` count | Operator aborts flight if `GpsDeniedHealth` degrades; FC failsafe is the safety net | | First post-flight pull (airborne) | FDR `flight_footer.clean_shutdown` flag; `records_dropped_overrun`; per-component `tile_match`, `c6.eviction_batch` baselines | If `clean_shutdown=false` or baselines drifted → roll back; required post-mortem | ## Rollback Procedures ### Trigger Criteria | Severity | Trigger | Decision lead | |----------|---------|---------------| | **Immediate rollback** | New image fails `HEALTHCHECK` within 5 minutes of `AZAION_UPDATE_EVENT`; or `flight_footer.clean_shutdown=false` on the first flight under the new image | Flight operator (airborne) / Suite operator (workstation) | | **Same-day rollback** | NFT-PERF baseline regression > 10% (frame deadline miss rate, end-to-end pose latency); FDR `records_dropped_overrun` > 0 above per-flight threshold; sustained `c6.eviction_batch` activity > baseline | Operator + GPS-Denied Onboard owner | | **Manual rollback** | Operator judgement (visible operational anomaly without a clear FDR signal) | Operator | ### Rollback Steps (airborne Jetson) 1. **Confirm the flag** — `/run/azaion/in-flight` is clear. If a flight is live, the FC's failsafe + operator's QGC abort path take precedence; rollback happens after landing. 2. **Identify the previous-good SHA** — `journalctl -g AZAION_UPDATE_EVENT --since 24h` on the affected Jetson shows the last successful revision. 3. **Tag rollback** — operator retags the registry: `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm` → previous SHA. (Cycle-1: operator pulls + retags via the registry UI; cycle-2: scripted via `scripts/deploy.sh rollback `.) 4. **Wait for Watchtower** — next poll detects the SHA change + pulls the previous image. 5. **Verify** — `journalctl -g AZAION_UPDATE_EVENT --since 10min` shows the rollback revision; companion `HEALTHCHECK` is healthy. 6. **DB rollback** — cycle-1: not applicable (bootstrap-only schema). Cycle-2+: if the new image applied a migration, run the DOWN script if reversible; otherwise escalate to GPS-Denied Onboard owner + suite operator before proceeding. 7. **Notify** — stakeholders informed; rollback flagged for post-mortem within 24 hours. ### Rollback Steps (operator workstation) 1. `scripts/stop-services.sh` (Step 7) stops the operator-orchestrator service. 2. Operator runs `scripts/pull-images.sh ` (Step 7). 3. `scripts/start-services.sh ` (Step 7) brings the previous image up. 4. Verify via `HEALTHCHECK` + offline `--flight-file` smoke. 5. DB rollback as above (cycle-1 n/a; cycle-2+ per migration tool). 6. Notify suite operator. ### Post-mortem (required after every production rollback) Recorded in `_docs/_process_leftovers/__rollback.md` and replayed at the next `/autodev` invocation per `.cursor/rules/tracker.mdc` Leftovers Mechanism. Contents: - **Timeline** — `AZAION_UPDATE_EVENT` deploy event → first failure observation → rollback completion. - **Root cause** — pulled from FDR + journald + Woodpecker pipeline. - **What went wrong** — gate that should have caught it (CI? HITL? Pre-flight checklist?). - **Prevention** — concrete checklist edit or test addition. Lessons appended to `_docs/LESSONS.md` per the autodev retrospective conventions. ## Deployment Checklist The pre-deploy checklist above is the canonical one. Repeating it here in the standard skill format for traceability: - [ ] All CI tests pass on the target branch (cycle-1: `01-test.yml` manual run; cycle-2: push gate) - [ ] Security scan clean — re-validated against current pins; OpenCV CVE replay condition checked (`_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md`) - [ ] Docker images built + pushed under `${REGISTRY_HOST}/azaion/:-`; OCI labels + `AZAION_REVISION` env stamped per AZ-204 - [ ] Database migrations (cycle-2+): reviewed, tested, backward-compatible, flight-state-gated, operator-approved - [ ] Environment variables configured per-environment per `environment_strategy.md` § Environment Variables - [ ] Health check (`python3 -m gps_denied_onboard.healthcheck`) returns 0 on a dry-run against the target image - [ ] Observability touchpoints active: `LOG_SINK` honoured, FDR mount writable, `jetson-stats` accessible inside the container (Tier-2) - [ ] Rollback plan documented — previous-good SHA recorded; rollback steps reviewed - [ ] Stakeholders notified of deployment window (flight operator + suite operator + GPS-Denied Onboard owner) - [ ] Operator available during the post-deploy monitoring window (first 15 minutes + first flight) ## Self-verification - [x] Deployment strategy chosen (Watchtower floating-tag pull-on-ground) and justified (single instance per role, ground-only updates, FC-managed in-flight failsafe) - [x] Zero-downtime stance: **not applicable in flight**; ground-only — explicitly justified - [x] Health checks defined (exec-based `HEALTHCHECK` covering liveness + readiness; FC watchdog covers in-flight liveness via FC failsafe) - [x] Rollback trigger criteria (immediate / same-day / manual) + steps for both airborne and operator workstation - [x] Deployment checklist complete and grounded in the project's actual gates (`AZAION_UPDATE_EVENT` audit, CVE replay, `/run/azaion/in-flight` flag, signing key provisioning) - [x] Post-mortem path defined and tied to the `_docs/_process_leftovers/` + `_docs/LESSONS.md` mechanism - [x] Graceful-shutdown sequence covers the FDR-flush + `flight_footer.clean_shutdown` invariants ## BLOCKING — User Confirmation Required This is the deploy skill Step 6 BLOCKING gate per `.cursor/skills/deploy/SKILL.md` § Methodology Quick Reference. Step 7 (Deployment Scripts) writes executable shell scripts that automate the procedures above; user confirmation that the procedure is correct is required before scripts are generated.