Files
gps-denied-onboard/_docs/04_deploy/deployment_procedures.md
T
Oleksandr Bezdieniezhnykh bf13549b32
ci/woodpecker/push/02-build-push Pipeline failed
[autodev] Update configuration and documentation for cycle-1
- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments.
- Updated `.gitignore` to include a new deploy rollback bookmark.
- Revised `_docs/_autodev_state.md` to reflect the current task status and steps.
- Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements.
- Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin.
- Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths.

This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
2026-05-20 08:05:35 +03:00

21 KiB
Raw Blame History

GPS-Denied Onboard — Deployment Procedures

Generated by /autodev greenfield Step 16 (Deploy) — Step 6. Builds on Step 15 (reports/deploy_status_report.md, containerization.md, ci_cd_pipeline.md, environment_strategy.md, observability.md). The deploy skill's standard procedure template (load-balanced HTTP service with blue-green / rolling / canary patterns) is adapted here for the system's actual topology: single airborne instance + single operator workstation, ground-only updates, FC-managed in-flight failsafe, and the parent-suite Watchtower flow with a flight-state gate.

Deployment Strategy

Pattern: Floating-tag pull-on-ground (Watchtower-managed)

Aspect Choice Rationale
Update mechanism (airborne Jetson) Parent-suite Watchtower polls ${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm; pulls + restarts when SHA changes Suite-mandated pattern per ../_infra/deploy/jetson/README.md. The fielded Jetson stack has Watchtower already running, polling all 9 application services on the same cadence.
Update mechanism (operator workstation) Operator runs docker compose pull && docker compose up -d from scripts/start-services.sh The operator workstation is single-user; cycle-1 does not need automatic updates. Cycle-2 may add a Watchtower instance on the workstation.
Update mechanism (lab Jetson — staging) Same as airborne (Watchtower polling dev-arm or stage-arm) Mirrors airborne so the bench rig validates the exact same update path.
Blue-green / rolling / canary None of the above — N=1 instance per role The airborne side has one Jetson per aircraft (no fleet); the operator workstation has one instance per operator. There is no load-balanced replicate to roll over.
Zero-downtime requirement Not applicable in flight; ground-only Flights are discrete + bounded; the FC handles in-flight failsafe (AC-FC-FAILSAFE-1) if the companion is unavailable mid-flight. Updates do not happen during flight.
Ground-only safety gate /run/azaion/in-flight flag (parent-suite autopilot service writes it on arm/disarm) Watchtower's post-update hook MUST refuse to restart the gps-denied-onboard container when this flag is set. Honoured at the suite-compose layer, not in this submodule's image (the image only honours the flag at boot when transitioning between strategies).
Multi-aircraft rollout Tag-based per-aircraft (operator can pin :rev-<sha>-arm instead of :main-arm) Floating tag is the default; explicit SHA pinning is the manual override. Suite operator owns per-aircraft pinning.

Graceful Shutdown

The companion has no inbound HTTP connections (NFT-SEC-05 in-flight egress lockdown). "Graceful shutdown" means: drain in-flight FDR writes, flush the C13 segment, emit flight_footer, close MAVLink connection cleanly.

Step Action Owner
1 systemd / Docker sends SIGTERM to PID 1 (python3 -m gps_denied_onboard.runtime_root) OS layer
2 Runtime root sets the global shutting_down flag; all per-frame producers stop enqueuing new FDR records runtime root
3 C13 writer drains the FDR SPSC ring (≤ 200 ms target — bounded by ring depth + writer throughput) C13
4 C13 emits flight_footer with clean_shutdown=true, records_written, records_dropped_overrun, bytes_written, rollover_count C13
5 C13 closes the active segment file (fsync, rename .tmp → final) C13
6 C8 sends final MAVLink STATUSTEXT and closes the FC serial connection C8
7 Process exits 0 runtime root

Termination grace period (target): 30 seconds for the above sequence. If exceeded, Docker / systemd sends SIGKILL; flight_footer.clean_shutdown will be false on the next boot's recovery write, flagging the unclean shutdown for the post-flight summary.

Cycle-1 status: docker-compose.yml does not yet declare stop_grace_period: 30s — cycle-1 inherits Docker's default 10 s grace. The C13 ring drain target (≤ 200 ms) fits comfortably inside 10 s for the dev profile, but TensorRT engine teardown + gtsam factor cleanup on Tier-2 hardware are not yet measured. Cycle-2 follow-up (recorded in _docs/_process_leftovers/ when this deploy plan lands): add stop_grace_period: 30s to the companion service in docker-compose.yml and to the gps-denied-onboard service in the parent-suite ../_infra/deploy/jetson/docker-compose.yml once the Step 2 validation gate "TensorRT INT8 cache durability under Docker" (containerization.md § Step 2 Validation Gates) measures the actual drain budget on the Jetson.

Database Migration Ordering

Cycle-1 ships no migration runner — C6 bootstrap uses idempotent CREATE TABLE IF NOT EXISTS. Cycle-2+ rules (from environment_strategy.md § Migration Rules):

Rule Cycle-1 status Cycle-2+ enforcement
Migrations run before new code deploys n/a — bootstrap-only Alembic (or equivalent) migration step runs against staging first, then production, before the corresponding image pull is enabled
All migrations must be backward-compatible n/a Required: new schema works with previous image's read path until next release rotates both
Irreversible migrations require explicit operator approval n/a Required: Woodpecker UI approval gate + recorded in _docs/04_deploy/migration_log.md
Production migrations on the airborne Jetson refuse to run when /run/azaion/in-flight is set n/a Required: migration tool reads the flag at start; aborts with exit 0 + journald audit line if the flag is set
Production migrations on the operator workstation require operator approval n/a Required: interactive prompt in start-services.sh before applying

Health Checks

The companion has no HTTP /health/live or /health/ready endpoint (NFT-SEC-05). The Docker HEALTHCHECK is an exec check that re-runs the startup validation matrix (environment_strategy.md § Variable Validation) and inspects in-process liveness signals.

Check Type Command / mechanism Interval Failure threshold Action
Liveness / Readiness HEALTHCHECK exec python3 -m gps_denied_onboard.healthcheck 10 s (companion-tier1 / operator-orchestrator); 10 s (companion-jetson, with --start-period=30s for TensorRT engine deserialise) 3 consecutive failures → Docker marks container unhealthy → systemd / Watchtower restarts Same as readiness — no load balancer to drain. Watchtower honours /run/azaion/in-flight before restarting.
Startup probe Same exec Same command 5 s once --start-period elapses 30 attempts max Kill + recreate; Watchtower retries the pull on next poll
FC adapter health (in-flight) C8 watchdog from the FC MAVLink heartbeat loss > 1 s n/a — handled by the FC FC drops to SAFE_DEAD_RECKONING or RTL per AC-FC-FAILSAFE-1
FDR ring liveness shared.fdr_client overrun monitor Producer enqueue failure n/a — emits kind="overrun" record (AC-NEW-3); never silent Post-flight forensics surface; no in-flight action
db Postgres health (operator workstation + dev compose) pg_isready -U gps_denied -d gps_denied 5 s 10 failures Docker / systemd restart the db service; the companion's healthcheck fails until DB is back
mock-suite-sat-service health (Tier-1 e2e only) HTTP GET /healthz on port 5100 5 s 3 failures Compose marks unhealthy; e2e-runner --exit-code-from e2e-runner surfaces failure

python3 -m gps_denied_onboard.healthcheck contract

The healthcheck module (already exists per containerization.md) re-runs:

  1. Required env vars validation — same set as the composition root, but read-only (no side effects).
  2. C6 DB reachabilitypsycopg2.connect(DB_URL) → SELECT 1.
  3. C13 FDR mount writabilityos.access(FDR_PATH, os.W_OK) + a probe write to a .healthcheck file.
  4. C7 backend availability — for INFERENCE_BACKEND=tensorrt, validates the engine cache directory exists + is readable; for pytorch_fp16, no extra check (libtorch in-process).
  5. C8 FC adapter — best-effort: attempts a non-blocking serial open if GPS_DENIED_FC_PROFILE is set + the device path is present. Absent device path is not a failure (dev / CI containers).

Exit codes: 0 healthy; 1 config-invalid; 2 dependency-unreachable; 3 resource-bound (e.g. FDR full). Docker treats any non-zero as unhealthy.

Staging Deployment (lab Jetson HITL)

Treat the lab Jetson as a mirror of production for image promotion. Operator runs the procedure manually; cycle-2 may automate via the suite.

  1. CI/CD has already built + pushed ${REGISTRY_HOST}/azaion/gps-denied-onboard-companion-tier1:dev-arm + …-operator-orchestrator:dev-arm via .woodpecker/02-build-push.yml (cycle-1) or companion-jetson:dev-arm via cycle-2.
  2. Verify the flagcat /run/azaion/in-flight should be empty / absent on the lab Jetson (no live FC there). If a HITL session is running, wait for the bench session to end.
  3. Pull the new imagescripts/pull-images.sh dev (Step 7). Watchtower may have already pulled if running on the lab Jetson.
  4. Restart the servicescripts/start-services.sh dev (Step 7). Honours stop-grace-period; waits for HEALTHCHECK to report healthy.
  5. Run the HITL e2e suitedocker compose -f docker-compose.test.jetson.yml up --abort-on-container-exit --exit-code-from e2e-runner --build. This runs the Reality Gate replay (Derkachi clip + recorded tlog) against the new image on Tier-2 hardware.
  6. Verify FDR outputpython3 -m gps_denied_onboard.post_flight.summarise --segment /var/lib/gps-denied/fdr/segment-*.fdr (cycle-1 ad-hoc tool; cycle-2 polish lands the full replay viewer). Confirm flight_footer.clean_shutdown == true and records_dropped_overrun == 0.
  7. If gates pass → promote: tag ${REGISTRY_HOST}/azaion/gps-denied-onboard:<sha>-arm (or repurpose by branch promotion from dev-armstage-arm once cycle-2 wires environment branches per ci_cd_pipeline.md Quality Gates Multi-environment deployment row).
  8. If gates fail → file a Jira issue under E-DEPLOY; roll back the lab Jetson per § Rollback Procedures.

Production Deployment (airborne Jetson + operator workstation)

Production deployment lands on each aircraft individually + on each operator workstation. The aircraft side is Watchtower-driven; the operator workstation side is operator-driven.

Pre-deploy checks (operator-owned)

  • CI gates green01-test.yml passed on the target branch (cycle-1: manual trigger; cycle-2: push gate).
  • Security scan recent_docs/05_security/dependency_scan.md re-validated against the build SHA. The OpenCV pin per _docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md is honoured.
  • HITL gate passed — Staging deployment § 56 confirmed clean_shutdown=true and records_dropped_overrun=0.
  • Per-aircraft acceptance — operator confirms the build's strategy flags (BUILD_VINS_MONO, BUILD_SALAD, BUILD_C11_TILE_MANAGER, replay flags, BUILD_DEV_STATIC_KEY=OFF) match the operational profile for the destination aircraft.
  • Calibration JSON onboard/etc/gps-denied/calibration/adti20.json (operator-acquired per D-PROJ-1) is staged on the aircraft Jetson NVM.
  • Signing key path provisionedMAVLINK_SIGNING_KEY resolves to a per-host writable path that KeySource will rotate at takeoff; no static key from tests/fixtures/.
  • Postgres credentials in /etc/gps-denied/.pgpass — per-host random password (Step 7 start-services.sh writes this on first run).
  • /run/azaion/in-flight is clear — no live flight in progress on the target aircraft.
  • Rollback target identified — previous successful SHA recorded for the target aircraft (operator notebook + journalctl -g AZAION_UPDATE_EVENT on the Jetson).
  • Stakeholders notified — flight operator + suite operator informed of the deploy window.

Production Deployment — Airborne Jetson (Watchtower-driven)

  1. Tag promotion — operator pushes the validated SHA to ${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm (or per-aircraft SHA pin if rolling out partial fleet).
  2. Wait for Watchtower poll — default poll interval per suite config (typically ≤ 5 min).
  3. Watchtower pre-restart check — Watchtower's post-update hook checks /run/azaion/in-flight; if set, defers the restart until the next poll.
  4. Container stop — Docker sends SIGTERM; companion drains FDR (≤ 200 ms target) + emits flight_footer per § Graceful Shutdown. Exit must complete within 30 s grace period.
  5. Image pull complete — Watchtower pulls the new image (already verified-by-tag; OCI labels embed the SHA).
  6. Container start — Docker starts the new container; HEALTHCHECK --start-period=30s allows TensorRT engine deserialise + Postgres reconnect.
  7. Audit event emitted — Watchtower's post-update hook emits AZAION_UPDATE_EVENT to journald (observability.md § Deploy Audit).
  8. Verify on the aircraft — operator runs journalctl -g AZAION_UPDATE_EVENT --since 10min on the Jetson; confirms the new revision SHA matches the intended tag.
  9. Run a ground HITL pre-flight — operator brings up the bench-mounted aircraft, runs the standard pre-flight checklist (FC heartbeat, signing handshake, camera focus, NFT-SEC-04 image-decode smoke). Pre-flight refusal-to-arm on any gate failure is the production safety net.
  10. Monitor the first flight — operator watches QGroundControl for STATUSTEXT messages from the companion + the GpsDeniedHealth MAVLink message stream during the first flight under the new image.
  11. Post-flight forensics — after landing, operator pulls FDR segments + runs post_flight.summarise; confirms no regression vs the previous-SHA baseline (NFT-PERF gates per _docs/02_document/tests/ baselines).

Production Deployment — Operator Workstation (operator-driven)

  1. Pre-deploy checks — same checklist as above, scoped to the operator-orchestrator image.
  2. Pull — operator runs scripts/pull-images.sh main (Step 7).
  3. Stopscripts/stop-services.sh (Step 7) gracefully stops the operator-orchestrator service.
  4. Startscripts/start-services.sh main (Step 7) brings the new image up. HEALTHCHECK --start-period=10s allows DB reconnect.
  5. Auditjournalctl -g AZAION_UPDATE_EVENT --since 10min on the operator workstation confirms the new revision.
  6. Smoke test — operator runs the C12 --flight-file <offline_fixture> path against a known-good flight DTO; verifies the FlightsApiClient round-trip succeeds.

Post-deploy monitoring window

Window What to watch Action on regression
First 15 min journald AZAION_UPDATE_EVENT cadence; container HEALTHCHECK status Roll back immediately (§ Rollback Procedures)
First flight (airborne) QGC STATUSTEXT + GpsDeniedHealth MAVLink stream; FDR overrun count Operator aborts flight if GpsDeniedHealth degrades; FC failsafe is the safety net
First post-flight pull (airborne) FDR flight_footer.clean_shutdown flag; records_dropped_overrun; per-component tile_match, c6.eviction_batch baselines If clean_shutdown=false or baselines drifted → roll back; required post-mortem

Rollback Procedures

Trigger Criteria

Severity Trigger Decision lead
Immediate rollback New image fails HEALTHCHECK within 5 minutes of AZAION_UPDATE_EVENT; or flight_footer.clean_shutdown=false on the first flight under the new image Flight operator (airborne) / Suite operator (workstation)
Same-day rollback NFT-PERF baseline regression > 10% (frame deadline miss rate, end-to-end pose latency); FDR records_dropped_overrun > 0 above per-flight threshold; sustained c6.eviction_batch activity > baseline Operator + GPS-Denied Onboard owner
Manual rollback Operator judgement (visible operational anomaly without a clear FDR signal) Operator

Rollback Steps (airborne Jetson)

  1. Confirm the flag/run/azaion/in-flight is clear. If a flight is live, the FC's failsafe + operator's QGC abort path take precedence; rollback happens after landing.
  2. Identify the previous-good SHAjournalctl -g AZAION_UPDATE_EVENT --since 24h on the affected Jetson shows the last successful revision.
  3. Tag rollback — operator retags the registry: ${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm → previous SHA. (Cycle-1: operator pulls + retags via the registry UI; cycle-2: scripted via scripts/deploy.sh rollback <sha>.)
  4. Wait for Watchtower — next poll detects the SHA change + pulls the previous image.
  5. Verifyjournalctl -g AZAION_UPDATE_EVENT --since 10min shows the rollback revision; companion HEALTHCHECK is healthy.
  6. DB rollback — cycle-1: not applicable (bootstrap-only schema). Cycle-2+: if the new image applied a migration, run the DOWN script if reversible; otherwise escalate to GPS-Denied Onboard owner + suite operator before proceeding.
  7. Notify — stakeholders informed; rollback flagged for post-mortem within 24 hours.

Rollback Steps (operator workstation)

  1. scripts/stop-services.sh (Step 7) stops the operator-orchestrator service.
  2. Operator runs scripts/pull-images.sh <previous_sha> (Step 7).
  3. scripts/start-services.sh <previous_sha> (Step 7) brings the previous image up.
  4. Verify via HEALTHCHECK + offline --flight-file smoke.
  5. DB rollback as above (cycle-1 n/a; cycle-2+ per migration tool).
  6. Notify suite operator.

Post-mortem (required after every production rollback)

Recorded in _docs/_process_leftovers/<YYYY-MM-DD>_<topic>_rollback.md and replayed at the next /autodev invocation per .cursor/rules/tracker.mdc Leftovers Mechanism. Contents:

  • TimelineAZAION_UPDATE_EVENT deploy event → first failure observation → rollback completion.
  • Root cause — pulled from FDR + journald + Woodpecker pipeline.
  • What went wrong — gate that should have caught it (CI? HITL? Pre-flight checklist?).
  • Prevention — concrete checklist edit or test addition. Lessons appended to _docs/LESSONS.md per the autodev retrospective conventions.

Deployment Checklist

The pre-deploy checklist above is the canonical one. Repeating it here in the standard skill format for traceability:

  • All CI tests pass on the target branch (cycle-1: 01-test.yml manual run; cycle-2: push gate)
  • Security scan clean — re-validated against current pins; OpenCV CVE replay condition checked (_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md)
  • Docker images built + pushed under ${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>; OCI labels + AZAION_REVISION env stamped per AZ-204
  • Database migrations (cycle-2+): reviewed, tested, backward-compatible, flight-state-gated, operator-approved
  • Environment variables configured per-environment per environment_strategy.md § Environment Variables
  • Health check (python3 -m gps_denied_onboard.healthcheck) returns 0 on a dry-run against the target image
  • Observability touchpoints active: LOG_SINK honoured, FDR mount writable, jetson-stats accessible inside the container (Tier-2)
  • Rollback plan documented — previous-good SHA recorded; rollback steps reviewed
  • Stakeholders notified of deployment window (flight operator + suite operator + GPS-Denied Onboard owner)
  • Operator available during the post-deploy monitoring window (first 15 minutes + first flight)

Self-verification

  • Deployment strategy chosen (Watchtower floating-tag pull-on-ground) and justified (single instance per role, ground-only updates, FC-managed in-flight failsafe)
  • Zero-downtime stance: not applicable in flight; ground-only — explicitly justified
  • Health checks defined (exec-based HEALTHCHECK covering liveness + readiness; FC watchdog covers in-flight liveness via FC failsafe)
  • Rollback trigger criteria (immediate / same-day / manual) + steps for both airborne and operator workstation
  • Deployment checklist complete and grounded in the project's actual gates (AZAION_UPDATE_EVENT audit, CVE replay, /run/azaion/in-flight flag, signing key provisioning)
  • Post-mortem path defined and tied to the _docs/_process_leftovers/ + _docs/LESSONS.md mechanism
  • Graceful-shutdown sequence covers the FDR-flush + flight_footer.clean_shutdown invariants

BLOCKING — User Confirmation Required

This is the deploy skill Step 6 BLOCKING gate per .cursor/skills/deploy/SKILL.md § Methodology Quick Reference. Step 7 (Deployment Scripts) writes executable shell scripts that automate the procedures above; user confirmation that the procedure is correct is required before scripts are generated.