- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments. - Updated `.gitignore` to include a new deploy rollback bookmark. - Revised `_docs/_autodev_state.md` to reflect the current task status and steps. - Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements. - Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin. - Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths. This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
21 KiB
GPS-Denied Onboard — Deployment Procedures
Generated by
/autodevgreenfield Step 16 (Deploy) — Step 6. Builds on Step 1–5 (reports/deploy_status_report.md,containerization.md,ci_cd_pipeline.md,environment_strategy.md,observability.md). The deploy skill's standard procedure template (load-balanced HTTP service with blue-green / rolling / canary patterns) is adapted here for the system's actual topology: single airborne instance + single operator workstation, ground-only updates, FC-managed in-flight failsafe, and the parent-suite Watchtower flow with a flight-state gate.
Deployment Strategy
Pattern: Floating-tag pull-on-ground (Watchtower-managed)
| Aspect | Choice | Rationale |
|---|---|---|
| Update mechanism (airborne Jetson) | Parent-suite Watchtower polls ${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm; pulls + restarts when SHA changes |
Suite-mandated pattern per ../_infra/deploy/jetson/README.md. The fielded Jetson stack has Watchtower already running, polling all 9 application services on the same cadence. |
| Update mechanism (operator workstation) | Operator runs docker compose pull && docker compose up -d from scripts/start-services.sh |
The operator workstation is single-user; cycle-1 does not need automatic updates. Cycle-2 may add a Watchtower instance on the workstation. |
| Update mechanism (lab Jetson — staging) | Same as airborne (Watchtower polling dev-arm or stage-arm) |
Mirrors airborne so the bench rig validates the exact same update path. |
| Blue-green / rolling / canary | None of the above — N=1 instance per role | The airborne side has one Jetson per aircraft (no fleet); the operator workstation has one instance per operator. There is no load-balanced replicate to roll over. |
| Zero-downtime requirement | Not applicable in flight; ground-only | Flights are discrete + bounded; the FC handles in-flight failsafe (AC-FC-FAILSAFE-1) if the companion is unavailable mid-flight. Updates do not happen during flight. |
| Ground-only safety gate | /run/azaion/in-flight flag (parent-suite autopilot service writes it on arm/disarm) |
Watchtower's post-update hook MUST refuse to restart the gps-denied-onboard container when this flag is set. Honoured at the suite-compose layer, not in this submodule's image (the image only honours the flag at boot when transitioning between strategies). |
| Multi-aircraft rollout | Tag-based per-aircraft (operator can pin :rev-<sha>-arm instead of :main-arm) |
Floating tag is the default; explicit SHA pinning is the manual override. Suite operator owns per-aircraft pinning. |
Graceful Shutdown
The companion has no inbound HTTP connections (NFT-SEC-05 in-flight egress lockdown). "Graceful shutdown" means: drain in-flight FDR writes, flush the C13 segment, emit flight_footer, close MAVLink connection cleanly.
| Step | Action | Owner |
|---|---|---|
| 1 | systemd / Docker sends SIGTERM to PID 1 (python3 -m gps_denied_onboard.runtime_root) |
OS layer |
| 2 | Runtime root sets the global shutting_down flag; all per-frame producers stop enqueuing new FDR records |
runtime root |
| 3 | C13 writer drains the FDR SPSC ring (≤ 200 ms target — bounded by ring depth + writer throughput) | C13 |
| 4 | C13 emits flight_footer with clean_shutdown=true, records_written, records_dropped_overrun, bytes_written, rollover_count |
C13 |
| 5 | C13 closes the active segment file (fsync, rename .tmp → final) |
C13 |
| 6 | C8 sends final MAVLink STATUSTEXT and closes the FC serial connection |
C8 |
| 7 | Process exits 0 | runtime root |
Termination grace period (target): 30 seconds for the above sequence. If exceeded, Docker / systemd sends SIGKILL; flight_footer.clean_shutdown will be false on the next boot's recovery write, flagging the unclean shutdown for the post-flight summary.
Cycle-1 status: docker-compose.yml does not yet declare stop_grace_period: 30s — cycle-1 inherits Docker's default 10 s grace. The C13 ring drain target (≤ 200 ms) fits comfortably inside 10 s for the dev profile, but TensorRT engine teardown + gtsam factor cleanup on Tier-2 hardware are not yet measured. Cycle-2 follow-up (recorded in _docs/_process_leftovers/ when this deploy plan lands): add stop_grace_period: 30s to the companion service in docker-compose.yml and to the gps-denied-onboard service in the parent-suite ../_infra/deploy/jetson/docker-compose.yml once the Step 2 validation gate "TensorRT INT8 cache durability under Docker" (containerization.md § Step 2 Validation Gates) measures the actual drain budget on the Jetson.
Database Migration Ordering
Cycle-1 ships no migration runner — C6 bootstrap uses idempotent CREATE TABLE IF NOT EXISTS. Cycle-2+ rules (from environment_strategy.md § Migration Rules):
| Rule | Cycle-1 status | Cycle-2+ enforcement |
|---|---|---|
| Migrations run before new code deploys | n/a — bootstrap-only | Alembic (or equivalent) migration step runs against staging first, then production, before the corresponding image pull is enabled |
| All migrations must be backward-compatible | n/a | Required: new schema works with previous image's read path until next release rotates both |
| Irreversible migrations require explicit operator approval | n/a | Required: Woodpecker UI approval gate + recorded in _docs/04_deploy/migration_log.md |
Production migrations on the airborne Jetson refuse to run when /run/azaion/in-flight is set |
n/a | Required: migration tool reads the flag at start; aborts with exit 0 + journald audit line if the flag is set |
| Production migrations on the operator workstation require operator approval | n/a | Required: interactive prompt in start-services.sh before applying |
Health Checks
The companion has no HTTP /health/live or /health/ready endpoint (NFT-SEC-05). The Docker HEALTHCHECK is an exec check that re-runs the startup validation matrix (environment_strategy.md § Variable Validation) and inspects in-process liveness signals.
| Check | Type | Command / mechanism | Interval | Failure threshold | Action |
|---|---|---|---|---|---|
| Liveness / Readiness | HEALTHCHECK exec |
python3 -m gps_denied_onboard.healthcheck |
10 s (companion-tier1 / operator-orchestrator); 10 s (companion-jetson, with --start-period=30s for TensorRT engine deserialise) |
3 consecutive failures → Docker marks container unhealthy → systemd / Watchtower restarts |
Same as readiness — no load balancer to drain. Watchtower honours /run/azaion/in-flight before restarting. |
| Startup probe | Same exec | Same command | 5 s once --start-period elapses |
30 attempts max | Kill + recreate; Watchtower retries the pull on next poll |
| FC adapter health (in-flight) | C8 watchdog from the FC | MAVLink heartbeat loss > 1 s | n/a — handled by the FC | FC drops to SAFE_DEAD_RECKONING or RTL per AC-FC-FAILSAFE-1 |
|
| FDR ring liveness | shared.fdr_client overrun monitor |
Producer enqueue failure | n/a — emits kind="overrun" record (AC-NEW-3); never silent |
Post-flight forensics surface; no in-flight action | |
db Postgres health (operator workstation + dev compose) |
pg_isready -U gps_denied -d gps_denied |
5 s | 10 failures | Docker / systemd restart the db service; the companion's healthcheck fails until DB is back |
|
mock-suite-sat-service health (Tier-1 e2e only) |
HTTP GET /healthz on port 5100 |
5 s | 3 failures | Compose marks unhealthy; e2e-runner --exit-code-from e2e-runner surfaces failure |
python3 -m gps_denied_onboard.healthcheck contract
The healthcheck module (already exists per containerization.md) re-runs:
- Required env vars validation — same set as the composition root, but read-only (no side effects).
- C6 DB reachability —
psycopg2.connect(DB_URL) → SELECT 1. - C13 FDR mount writability —
os.access(FDR_PATH, os.W_OK)+ a probe write to a.healthcheckfile. - C7 backend availability — for
INFERENCE_BACKEND=tensorrt, validates the engine cache directory exists + is readable; forpytorch_fp16, no extra check (libtorch in-process). - C8 FC adapter — best-effort: attempts a non-blocking serial open if
GPS_DENIED_FC_PROFILEis set + the device path is present. Absent device path is not a failure (dev / CI containers).
Exit codes: 0 healthy; 1 config-invalid; 2 dependency-unreachable; 3 resource-bound (e.g. FDR full). Docker treats any non-zero as unhealthy.
Staging Deployment (lab Jetson HITL)
Treat the lab Jetson as a mirror of production for image promotion. Operator runs the procedure manually; cycle-2 may automate via the suite.
- CI/CD has already built + pushed
${REGISTRY_HOST}/azaion/gps-denied-onboard-companion-tier1:dev-arm+…-operator-orchestrator:dev-armvia.woodpecker/02-build-push.yml(cycle-1) orcompanion-jetson:dev-armvia cycle-2. - Verify the flag —
cat /run/azaion/in-flightshould be empty / absent on the lab Jetson (no live FC there). If a HITL session is running, wait for the bench session to end. - Pull the new image —
scripts/pull-images.sh dev(Step 7). Watchtower may have already pulled if running on the lab Jetson. - Restart the service —
scripts/start-services.sh dev(Step 7). Honours stop-grace-period; waits for HEALTHCHECK to report healthy. - Run the HITL e2e suite —
docker compose -f docker-compose.test.jetson.yml up --abort-on-container-exit --exit-code-from e2e-runner --build. This runs the Reality Gate replay (Derkachi clip + recorded tlog) against the new image on Tier-2 hardware. - Verify FDR output —
python3 -m gps_denied_onboard.post_flight.summarise --segment /var/lib/gps-denied/fdr/segment-*.fdr(cycle-1 ad-hoc tool; cycle-2 polish lands the full replay viewer). Confirmflight_footer.clean_shutdown == trueandrecords_dropped_overrun == 0. - If gates pass → promote: tag
${REGISTRY_HOST}/azaion/gps-denied-onboard:<sha>-arm(or repurpose by branch promotion fromdev-arm→stage-armonce cycle-2 wires environment branches perci_cd_pipeline.mdQuality GatesMulti-environment deploymentrow). - If gates fail → file a Jira issue under E-DEPLOY; roll back the lab Jetson per § Rollback Procedures.
Production Deployment (airborne Jetson + operator workstation)
Production deployment lands on each aircraft individually + on each operator workstation. The aircraft side is Watchtower-driven; the operator workstation side is operator-driven.
Pre-deploy checks (operator-owned)
- CI gates green —
01-test.ymlpassed on the target branch (cycle-1: manual trigger; cycle-2: push gate). - Security scan recent —
_docs/05_security/dependency_scan.mdre-validated against the build SHA. The OpenCV pin per_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.mdis honoured. - HITL gate passed — Staging deployment § 5–6 confirmed
clean_shutdown=trueandrecords_dropped_overrun=0. - Per-aircraft acceptance — operator confirms the build's strategy flags (
BUILD_VINS_MONO,BUILD_SALAD,BUILD_C11_TILE_MANAGER, replay flags,BUILD_DEV_STATIC_KEY=OFF) match the operational profile for the destination aircraft. - Calibration JSON onboard —
/etc/gps-denied/calibration/adti20.json(operator-acquired per D-PROJ-1) is staged on the aircraft Jetson NVM. - Signing key path provisioned —
MAVLINK_SIGNING_KEYresolves to a per-host writable path thatKeySourcewill rotate at takeoff; no static key fromtests/fixtures/. - Postgres credentials in
/etc/gps-denied/.pgpass— per-host random password (Step 7start-services.shwrites this on first run). /run/azaion/in-flightis clear — no live flight in progress on the target aircraft.- Rollback target identified — previous successful SHA recorded for the target aircraft (operator notebook +
journalctl -g AZAION_UPDATE_EVENTon the Jetson). - Stakeholders notified — flight operator + suite operator informed of the deploy window.
Production Deployment — Airborne Jetson (Watchtower-driven)
- Tag promotion — operator pushes the validated SHA to
${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm(or per-aircraft SHA pin if rolling out partial fleet). - Wait for Watchtower poll — default poll interval per suite config (typically ≤ 5 min).
- Watchtower pre-restart check — Watchtower's post-update hook checks
/run/azaion/in-flight; if set, defers the restart until the next poll. - Container stop — Docker sends
SIGTERM; companion drains FDR (≤ 200 ms target) + emitsflight_footerper § Graceful Shutdown. Exit must complete within 30 s grace period. - Image pull complete — Watchtower pulls the new image (already verified-by-tag; OCI labels embed the SHA).
- Container start — Docker starts the new container;
HEALTHCHECK--start-period=30sallows TensorRT engine deserialise + Postgres reconnect. - Audit event emitted — Watchtower's post-update hook emits
AZAION_UPDATE_EVENTto journald (observability.md§ Deploy Audit). - Verify on the aircraft — operator runs
journalctl -g AZAION_UPDATE_EVENT --since 10minon the Jetson; confirms the new revision SHA matches the intended tag. - Run a ground HITL pre-flight — operator brings up the bench-mounted aircraft, runs the standard pre-flight checklist (FC heartbeat, signing handshake, camera focus, NFT-SEC-04 image-decode smoke). Pre-flight refusal-to-arm on any gate failure is the production safety net.
- Monitor the first flight — operator watches QGroundControl for STATUSTEXT messages from the companion + the
GpsDeniedHealthMAVLink message stream during the first flight under the new image. - Post-flight forensics — after landing, operator pulls FDR segments + runs
post_flight.summarise; confirms no regression vs the previous-SHA baseline (NFT-PERF gates per_docs/02_document/tests/baselines).
Production Deployment — Operator Workstation (operator-driven)
- Pre-deploy checks — same checklist as above, scoped to the operator-orchestrator image.
- Pull — operator runs
scripts/pull-images.sh main(Step 7). - Stop —
scripts/stop-services.sh(Step 7) gracefully stops the operator-orchestrator service. - Start —
scripts/start-services.sh main(Step 7) brings the new image up.HEALTHCHECK--start-period=10sallows DB reconnect. - Audit —
journalctl -g AZAION_UPDATE_EVENT --since 10minon the operator workstation confirms the new revision. - Smoke test — operator runs the C12
--flight-file <offline_fixture>path against a known-good flight DTO; verifies theFlightsApiClientround-trip succeeds.
Post-deploy monitoring window
| Window | What to watch | Action on regression |
|---|---|---|
| First 15 min | journald AZAION_UPDATE_EVENT cadence; container HEALTHCHECK status |
Roll back immediately (§ Rollback Procedures) |
| First flight (airborne) | QGC STATUSTEXT + GpsDeniedHealth MAVLink stream; FDR overrun count |
Operator aborts flight if GpsDeniedHealth degrades; FC failsafe is the safety net |
| First post-flight pull (airborne) | FDR flight_footer.clean_shutdown flag; records_dropped_overrun; per-component tile_match, c6.eviction_batch baselines |
If clean_shutdown=false or baselines drifted → roll back; required post-mortem |
Rollback Procedures
Trigger Criteria
| Severity | Trigger | Decision lead |
|---|---|---|
| Immediate rollback | New image fails HEALTHCHECK within 5 minutes of AZAION_UPDATE_EVENT; or flight_footer.clean_shutdown=false on the first flight under the new image |
Flight operator (airborne) / Suite operator (workstation) |
| Same-day rollback | NFT-PERF baseline regression > 10% (frame deadline miss rate, end-to-end pose latency); FDR records_dropped_overrun > 0 above per-flight threshold; sustained c6.eviction_batch activity > baseline |
Operator + GPS-Denied Onboard owner |
| Manual rollback | Operator judgement (visible operational anomaly without a clear FDR signal) | Operator |
Rollback Steps (airborne Jetson)
- Confirm the flag —
/run/azaion/in-flightis clear. If a flight is live, the FC's failsafe + operator's QGC abort path take precedence; rollback happens after landing. - Identify the previous-good SHA —
journalctl -g AZAION_UPDATE_EVENT --since 24hon the affected Jetson shows the last successful revision. - Tag rollback — operator retags the registry:
${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm→ previous SHA. (Cycle-1: operator pulls + retags via the registry UI; cycle-2: scripted viascripts/deploy.sh rollback <sha>.) - Wait for Watchtower — next poll detects the SHA change + pulls the previous image.
- Verify —
journalctl -g AZAION_UPDATE_EVENT --since 10minshows the rollback revision; companionHEALTHCHECKis healthy. - DB rollback — cycle-1: not applicable (bootstrap-only schema). Cycle-2+: if the new image applied a migration, run the DOWN script if reversible; otherwise escalate to GPS-Denied Onboard owner + suite operator before proceeding.
- Notify — stakeholders informed; rollback flagged for post-mortem within 24 hours.
Rollback Steps (operator workstation)
scripts/stop-services.sh(Step 7) stops the operator-orchestrator service.- Operator runs
scripts/pull-images.sh <previous_sha>(Step 7). scripts/start-services.sh <previous_sha>(Step 7) brings the previous image up.- Verify via
HEALTHCHECK+ offline--flight-filesmoke. - DB rollback as above (cycle-1 n/a; cycle-2+ per migration tool).
- Notify suite operator.
Post-mortem (required after every production rollback)
Recorded in _docs/_process_leftovers/<YYYY-MM-DD>_<topic>_rollback.md and replayed at the next /autodev invocation per .cursor/rules/tracker.mdc Leftovers Mechanism. Contents:
- Timeline —
AZAION_UPDATE_EVENTdeploy event → first failure observation → rollback completion. - Root cause — pulled from FDR + journald + Woodpecker pipeline.
- What went wrong — gate that should have caught it (CI? HITL? Pre-flight checklist?).
- Prevention — concrete checklist edit or test addition. Lessons appended to
_docs/LESSONS.mdper the autodev retrospective conventions.
Deployment Checklist
The pre-deploy checklist above is the canonical one. Repeating it here in the standard skill format for traceability:
- All CI tests pass on the target branch (cycle-1:
01-test.ymlmanual run; cycle-2: push gate) - Security scan clean — re-validated against current pins; OpenCV CVE replay condition checked (
_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md) - Docker images built + pushed under
${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>; OCI labels +AZAION_REVISIONenv stamped per AZ-204 - Database migrations (cycle-2+): reviewed, tested, backward-compatible, flight-state-gated, operator-approved
- Environment variables configured per-environment per
environment_strategy.md§ Environment Variables - Health check (
python3 -m gps_denied_onboard.healthcheck) returns 0 on a dry-run against the target image - Observability touchpoints active:
LOG_SINKhonoured, FDR mount writable,jetson-statsaccessible inside the container (Tier-2) - Rollback plan documented — previous-good SHA recorded; rollback steps reviewed
- Stakeholders notified of deployment window (flight operator + suite operator + GPS-Denied Onboard owner)
- Operator available during the post-deploy monitoring window (first 15 minutes + first flight)
Self-verification
- Deployment strategy chosen (Watchtower floating-tag pull-on-ground) and justified (single instance per role, ground-only updates, FC-managed in-flight failsafe)
- Zero-downtime stance: not applicable in flight; ground-only — explicitly justified
- Health checks defined (exec-based
HEALTHCHECKcovering liveness + readiness; FC watchdog covers in-flight liveness via FC failsafe) - Rollback trigger criteria (immediate / same-day / manual) + steps for both airborne and operator workstation
- Deployment checklist complete and grounded in the project's actual gates (
AZAION_UPDATE_EVENTaudit, CVE replay,/run/azaion/in-flightflag, signing key provisioning) - Post-mortem path defined and tied to the
_docs/_process_leftovers/+_docs/LESSONS.mdmechanism - Graceful-shutdown sequence covers the FDR-flush +
flight_footer.clean_shutdowninvariants
BLOCKING — User Confirmation Required
This is the deploy skill Step 6 BLOCKING gate per .cursor/skills/deploy/SKILL.md § Methodology Quick Reference. Step 7 (Deployment Scripts) writes executable shell scripts that automate the procedures above; user confirmation that the procedure is correct is required before scripts are generated.