mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 17:21:12 +00:00
bf13549b32
ci/woodpecker/push/02-build-push Pipeline failed
- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments. - Updated `.gitignore` to include a new deploy rollback bookmark. - Revised `_docs/_autodev_state.md` to reflect the current task status and steps. - Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements. - Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin. - Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths. This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
208 lines
21 KiB
Markdown
208 lines
21 KiB
Markdown
# GPS-Denied Onboard — Deployment Procedures
|
||
|
||
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 6. Builds on
|
||
> Step 1–5 (`reports/deploy_status_report.md`, `containerization.md`,
|
||
> `ci_cd_pipeline.md`, `environment_strategy.md`, `observability.md`). The
|
||
> deploy skill's standard procedure template (load-balanced HTTP service
|
||
> with blue-green / rolling / canary patterns) is adapted here for the
|
||
> system's actual topology: single airborne instance + single operator
|
||
> workstation, ground-only updates, FC-managed in-flight failsafe, and the
|
||
> parent-suite Watchtower flow with a flight-state gate.
|
||
|
||
## Deployment Strategy
|
||
|
||
### Pattern: **Floating-tag pull-on-ground (Watchtower-managed)**
|
||
|
||
| Aspect | Choice | Rationale |
|
||
|--------|--------|-----------|
|
||
| Update mechanism (airborne Jetson) | Parent-suite Watchtower polls `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm`; pulls + restarts when SHA changes | Suite-mandated pattern per `../_infra/deploy/jetson/README.md`. The fielded Jetson stack has Watchtower already running, polling all 9 application services on the same cadence. |
|
||
| Update mechanism (operator workstation) | Operator runs `docker compose pull && docker compose up -d` from `scripts/start-services.sh` | The operator workstation is single-user; cycle-1 does not need automatic updates. Cycle-2 may add a Watchtower instance on the workstation. |
|
||
| Update mechanism (lab Jetson — staging) | Same as airborne (Watchtower polling `dev-arm` or `stage-arm`) | Mirrors airborne so the bench rig validates the exact same update path. |
|
||
| Blue-green / rolling / canary | **None of the above** — N=1 instance per role | The airborne side has one Jetson per aircraft (no fleet); the operator workstation has one instance per operator. There is no load-balanced replicate to roll over. |
|
||
| Zero-downtime requirement | **Not applicable in flight**; ground-only | Flights are discrete + bounded; the FC handles in-flight failsafe (AC-FC-FAILSAFE-1) if the companion is unavailable mid-flight. Updates do not happen during flight. |
|
||
| Ground-only safety gate | `/run/azaion/in-flight` flag (parent-suite `autopilot` service writes it on arm/disarm) | **Watchtower's post-update hook MUST refuse to restart the `gps-denied-onboard` container when this flag is set.** Honoured at the suite-compose layer, not in this submodule's image (the image only honours the flag at boot when transitioning between strategies). |
|
||
| Multi-aircraft rollout | Tag-based per-aircraft (operator can pin `:rev-<sha>-arm` instead of `:main-arm`) | Floating tag is the default; explicit SHA pinning is the manual override. Suite operator owns per-aircraft pinning. |
|
||
|
||
### Graceful Shutdown
|
||
|
||
The companion has **no inbound HTTP connections** (NFT-SEC-05 in-flight egress lockdown). "Graceful shutdown" means: drain in-flight FDR writes, flush the C13 segment, emit `flight_footer`, close MAVLink connection cleanly.
|
||
|
||
| Step | Action | Owner |
|
||
|------|--------|-------|
|
||
| 1 | systemd / Docker sends `SIGTERM` to PID 1 (`python3 -m gps_denied_onboard.runtime_root`) | OS layer |
|
||
| 2 | Runtime root sets the global `shutting_down` flag; all per-frame producers stop enqueuing new FDR records | runtime root |
|
||
| 3 | C13 writer drains the FDR SPSC ring (≤ 200 ms target — bounded by ring depth + writer throughput) | C13 |
|
||
| 4 | C13 emits `flight_footer` with `clean_shutdown=true`, `records_written`, `records_dropped_overrun`, `bytes_written`, `rollover_count` | C13 |
|
||
| 5 | C13 closes the active segment file (fsync, rename `.tmp` → final) | C13 |
|
||
| 6 | C8 sends final MAVLink `STATUSTEXT` and closes the FC serial connection | C8 |
|
||
| 7 | Process exits 0 | runtime root |
|
||
|
||
**Termination grace period (target)**: 30 seconds for the above sequence. If exceeded, Docker / systemd sends `SIGKILL`; `flight_footer.clean_shutdown` will be `false` on the next boot's recovery write, flagging the unclean shutdown for the post-flight summary.
|
||
|
||
**Cycle-1 status**: docker-compose.yml does **not** yet declare `stop_grace_period: 30s` — cycle-1 inherits Docker's default 10 s grace. The C13 ring drain target (≤ 200 ms) fits comfortably inside 10 s for the dev profile, but TensorRT engine teardown + gtsam factor cleanup on Tier-2 hardware are not yet measured. **Cycle-2 follow-up** (recorded in `_docs/_process_leftovers/` when this deploy plan lands): add `stop_grace_period: 30s` to the `companion` service in `docker-compose.yml` and to the `gps-denied-onboard` service in the parent-suite `../_infra/deploy/jetson/docker-compose.yml` once the Step 2 validation gate "TensorRT INT8 cache durability under Docker" (`containerization.md` § Step 2 Validation Gates) measures the actual drain budget on the Jetson.
|
||
|
||
### Database Migration Ordering
|
||
|
||
Cycle-1 ships **no migration runner** — C6 bootstrap uses idempotent `CREATE TABLE IF NOT EXISTS`. Cycle-2+ rules (from `environment_strategy.md` § Migration Rules):
|
||
|
||
| Rule | Cycle-1 status | Cycle-2+ enforcement |
|
||
|------|----------------|----------------------|
|
||
| Migrations run **before** new code deploys | n/a — bootstrap-only | Alembic (or equivalent) migration step runs against staging first, then production, before the corresponding image pull is enabled |
|
||
| All migrations must be backward-compatible | n/a | Required: new schema works with previous image's read path until next release rotates both |
|
||
| Irreversible migrations require explicit operator approval | n/a | Required: Woodpecker UI approval gate + recorded in `_docs/04_deploy/migration_log.md` |
|
||
| Production migrations on the airborne Jetson refuse to run when `/run/azaion/in-flight` is set | n/a | Required: migration tool reads the flag at start; aborts with exit 0 + journald audit line if the flag is set |
|
||
| Production migrations on the operator workstation require operator approval | n/a | Required: interactive prompt in `start-services.sh` before applying |
|
||
|
||
## Health Checks
|
||
|
||
The companion has no HTTP `/health/live` or `/health/ready` endpoint (NFT-SEC-05). The Docker `HEALTHCHECK` is an **exec check** that re-runs the startup validation matrix (`environment_strategy.md` § Variable Validation) and inspects in-process liveness signals.
|
||
|
||
| Check | Type | Command / mechanism | Interval | Failure threshold | Action |
|
||
|-------|------|----------------------|----------|--------------------|--------|
|
||
| Liveness / Readiness | `HEALTHCHECK` exec | `python3 -m gps_denied_onboard.healthcheck` | 10 s (companion-tier1 / operator-orchestrator); 10 s (companion-jetson, with `--start-period=30s` for TensorRT engine deserialise) | 3 consecutive failures → Docker marks container `unhealthy` → systemd / Watchtower restarts | Same as readiness — no load balancer to drain. Watchtower honours `/run/azaion/in-flight` before restarting. |
|
||
| Startup probe | Same exec | Same command | 5 s once `--start-period` elapses | 30 attempts max | Kill + recreate; Watchtower retries the pull on next poll |
|
||
| FC adapter health (in-flight) | C8 watchdog from the FC | MAVLink heartbeat loss > 1 s | n/a — handled by the FC | FC drops to `SAFE_DEAD_RECKONING` or `RTL` per AC-FC-FAILSAFE-1 |
|
||
| FDR ring liveness | `shared.fdr_client` overrun monitor | Producer enqueue failure | n/a — emits `kind="overrun"` record (AC-NEW-3); never silent | Post-flight forensics surface; no in-flight action |
|
||
| `db` Postgres health (operator workstation + dev compose) | `pg_isready -U gps_denied -d gps_denied` | 5 s | 10 failures | Docker / systemd restart the `db` service; the companion's healthcheck fails until DB is back |
|
||
| `mock-suite-sat-service` health (Tier-1 e2e only) | HTTP GET `/healthz` on port 5100 | 5 s | 3 failures | Compose marks unhealthy; e2e-runner `--exit-code-from e2e-runner` surfaces failure |
|
||
|
||
### `python3 -m gps_denied_onboard.healthcheck` contract
|
||
|
||
The healthcheck module (already exists per `containerization.md`) re-runs:
|
||
|
||
1. **Required env vars validation** — same set as the composition root, but read-only (no side effects).
|
||
2. **C6 DB reachability** — `psycopg2.connect(DB_URL) → SELECT 1`.
|
||
3. **C13 FDR mount writability** — `os.access(FDR_PATH, os.W_OK)` + a probe write to a `.healthcheck` file.
|
||
4. **C7 backend availability** — for `INFERENCE_BACKEND=tensorrt`, validates the engine cache directory exists + is readable; for `pytorch_fp16`, no extra check (libtorch in-process).
|
||
5. **C8 FC adapter** — best-effort: attempts a non-blocking serial open if `GPS_DENIED_FC_PROFILE` is set + the device path is present. Absent device path is not a failure (dev / CI containers).
|
||
|
||
Exit codes: `0` healthy; `1` config-invalid; `2` dependency-unreachable; `3` resource-bound (e.g. FDR full). Docker treats any non-zero as `unhealthy`.
|
||
|
||
## Staging Deployment (lab Jetson HITL)
|
||
|
||
Treat the lab Jetson as a **mirror of production** for image promotion. Operator runs the procedure manually; cycle-2 may automate via the suite.
|
||
|
||
1. **CI/CD** has already built + pushed `${REGISTRY_HOST}/azaion/gps-denied-onboard-companion-tier1:dev-arm` + `…-operator-orchestrator:dev-arm` via `.woodpecker/02-build-push.yml` (cycle-1) or `companion-jetson:dev-arm` via cycle-2.
|
||
2. **Verify the flag** — `cat /run/azaion/in-flight` should be empty / absent on the lab Jetson (no live FC there). If a HITL session is running, wait for the bench session to end.
|
||
3. **Pull the new image** — `scripts/pull-images.sh dev` (Step 7). Watchtower may have already pulled if running on the lab Jetson.
|
||
4. **Restart the service** — `scripts/start-services.sh dev` (Step 7). Honours stop-grace-period; waits for HEALTHCHECK to report healthy.
|
||
5. **Run the HITL e2e suite** — `docker compose -f docker-compose.test.jetson.yml up --abort-on-container-exit --exit-code-from e2e-runner --build`. This runs the **Reality Gate** replay (Derkachi clip + recorded tlog) against the new image on Tier-2 hardware.
|
||
6. **Verify FDR output** — `python3 -m gps_denied_onboard.post_flight.summarise --segment /var/lib/gps-denied/fdr/segment-*.fdr` (cycle-1 ad-hoc tool; cycle-2 polish lands the full replay viewer). Confirm `flight_footer.clean_shutdown == true` and `records_dropped_overrun == 0`.
|
||
7. **If gates pass** → promote: tag `${REGISTRY_HOST}/azaion/gps-denied-onboard:<sha>-arm` (or repurpose by branch promotion from `dev-arm` → `stage-arm` once cycle-2 wires environment branches per `ci_cd_pipeline.md` Quality Gates `Multi-environment deployment` row).
|
||
8. **If gates fail** → file a Jira issue under E-DEPLOY; roll back the lab Jetson per § Rollback Procedures.
|
||
|
||
## Production Deployment (airborne Jetson + operator workstation)
|
||
|
||
Production deployment lands on each aircraft individually + on each operator workstation. The aircraft side is Watchtower-driven; the operator workstation side is operator-driven.
|
||
|
||
### Pre-deploy checks (operator-owned)
|
||
|
||
- [ ] **CI gates green** — `01-test.yml` passed on the target branch (cycle-1: manual trigger; cycle-2: push gate).
|
||
- [ ] **Security scan recent** — `_docs/05_security/dependency_scan.md` re-validated against the build SHA. The OpenCV pin per `_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md` is honoured.
|
||
- [ ] **HITL gate passed** — Staging deployment § 5–6 confirmed `clean_shutdown=true` and `records_dropped_overrun=0`.
|
||
- [ ] **Per-aircraft acceptance** — operator confirms the build's strategy flags (`BUILD_VINS_MONO`, `BUILD_SALAD`, `BUILD_C11_TILE_MANAGER`, replay flags, `BUILD_DEV_STATIC_KEY=OFF`) match the operational profile for the destination aircraft.
|
||
- [ ] **Calibration JSON onboard** — `/etc/gps-denied/calibration/adti20.json` (operator-acquired per D-PROJ-1) is staged on the aircraft Jetson NVM.
|
||
- [ ] **Signing key path provisioned** — `MAVLINK_SIGNING_KEY` resolves to a per-host writable path that `KeySource` will rotate at takeoff; no static key from `tests/fixtures/`.
|
||
- [ ] **Postgres credentials in `/etc/gps-denied/.pgpass`** — per-host random password (Step 7 `start-services.sh` writes this on first run).
|
||
- [ ] **`/run/azaion/in-flight` is clear** — no live flight in progress on the target aircraft.
|
||
- [ ] **Rollback target identified** — previous successful SHA recorded for the target aircraft (operator notebook + `journalctl -g AZAION_UPDATE_EVENT` on the Jetson).
|
||
- [ ] **Stakeholders notified** — flight operator + suite operator informed of the deploy window.
|
||
|
||
### Production Deployment — Airborne Jetson (Watchtower-driven)
|
||
|
||
1. **Tag promotion** — operator pushes the validated SHA to `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm` (or per-aircraft SHA pin if rolling out partial fleet).
|
||
2. **Wait for Watchtower poll** — default poll interval per suite config (typically ≤ 5 min).
|
||
3. **Watchtower pre-restart check** — Watchtower's post-update hook checks `/run/azaion/in-flight`; if set, defers the restart until the next poll.
|
||
4. **Container stop** — Docker sends `SIGTERM`; companion drains FDR (≤ 200 ms target) + emits `flight_footer` per § Graceful Shutdown. Exit must complete within 30 s grace period.
|
||
5. **Image pull complete** — Watchtower pulls the new image (already verified-by-tag; OCI labels embed the SHA).
|
||
6. **Container start** — Docker starts the new container; `HEALTHCHECK` `--start-period=30s` allows TensorRT engine deserialise + Postgres reconnect.
|
||
7. **Audit event emitted** — Watchtower's post-update hook emits `AZAION_UPDATE_EVENT` to journald (`observability.md` § Deploy Audit).
|
||
8. **Verify on the aircraft** — operator runs `journalctl -g AZAION_UPDATE_EVENT --since 10min` on the Jetson; confirms the new revision SHA matches the intended tag.
|
||
9. **Run a ground HITL pre-flight** — operator brings up the bench-mounted aircraft, runs the standard pre-flight checklist (FC heartbeat, signing handshake, camera focus, NFT-SEC-04 image-decode smoke). Pre-flight refusal-to-arm on any gate failure is the production safety net.
|
||
10. **Monitor the first flight** — operator watches QGroundControl for STATUSTEXT messages from the companion + the `GpsDeniedHealth` MAVLink message stream during the first flight under the new image.
|
||
11. **Post-flight forensics** — after landing, operator pulls FDR segments + runs `post_flight.summarise`; confirms no regression vs the previous-SHA baseline (NFT-PERF gates per `_docs/02_document/tests/` baselines).
|
||
|
||
### Production Deployment — Operator Workstation (operator-driven)
|
||
|
||
1. **Pre-deploy checks** — same checklist as above, scoped to the operator-orchestrator image.
|
||
2. **Pull** — operator runs `scripts/pull-images.sh main` (Step 7).
|
||
3. **Stop** — `scripts/stop-services.sh` (Step 7) gracefully stops the operator-orchestrator service.
|
||
4. **Start** — `scripts/start-services.sh main` (Step 7) brings the new image up. `HEALTHCHECK` `--start-period=10s` allows DB reconnect.
|
||
5. **Audit** — `journalctl -g AZAION_UPDATE_EVENT --since 10min` on the operator workstation confirms the new revision.
|
||
6. **Smoke test** — operator runs the C12 `--flight-file <offline_fixture>` path against a known-good flight DTO; verifies the `FlightsApiClient` round-trip succeeds.
|
||
|
||
### Post-deploy monitoring window
|
||
|
||
| Window | What to watch | Action on regression |
|
||
|--------|---------------|----------------------|
|
||
| First 15 min | journald `AZAION_UPDATE_EVENT` cadence; container `HEALTHCHECK` status | Roll back immediately (§ Rollback Procedures) |
|
||
| First flight (airborne) | QGC STATUSTEXT + `GpsDeniedHealth` MAVLink stream; FDR `overrun` count | Operator aborts flight if `GpsDeniedHealth` degrades; FC failsafe is the safety net |
|
||
| First post-flight pull (airborne) | FDR `flight_footer.clean_shutdown` flag; `records_dropped_overrun`; per-component `tile_match`, `c6.eviction_batch` baselines | If `clean_shutdown=false` or baselines drifted → roll back; required post-mortem |
|
||
|
||
## Rollback Procedures
|
||
|
||
### Trigger Criteria
|
||
|
||
| Severity | Trigger | Decision lead |
|
||
|----------|---------|---------------|
|
||
| **Immediate rollback** | New image fails `HEALTHCHECK` within 5 minutes of `AZAION_UPDATE_EVENT`; or `flight_footer.clean_shutdown=false` on the first flight under the new image | Flight operator (airborne) / Suite operator (workstation) |
|
||
| **Same-day rollback** | NFT-PERF baseline regression > 10% (frame deadline miss rate, end-to-end pose latency); FDR `records_dropped_overrun` > 0 above per-flight threshold; sustained `c6.eviction_batch` activity > baseline | Operator + GPS-Denied Onboard owner |
|
||
| **Manual rollback** | Operator judgement (visible operational anomaly without a clear FDR signal) | Operator |
|
||
|
||
### Rollback Steps (airborne Jetson)
|
||
|
||
1. **Confirm the flag** — `/run/azaion/in-flight` is clear. If a flight is live, the FC's failsafe + operator's QGC abort path take precedence; rollback happens after landing.
|
||
2. **Identify the previous-good SHA** — `journalctl -g AZAION_UPDATE_EVENT --since 24h` on the affected Jetson shows the last successful revision.
|
||
3. **Tag rollback** — operator retags the registry: `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm` → previous SHA. (Cycle-1: operator pulls + retags via the registry UI; cycle-2: scripted via `scripts/deploy.sh rollback <sha>`.)
|
||
4. **Wait for Watchtower** — next poll detects the SHA change + pulls the previous image.
|
||
5. **Verify** — `journalctl -g AZAION_UPDATE_EVENT --since 10min` shows the rollback revision; companion `HEALTHCHECK` is healthy.
|
||
6. **DB rollback** — cycle-1: not applicable (bootstrap-only schema). Cycle-2+: if the new image applied a migration, run the DOWN script if reversible; otherwise escalate to GPS-Denied Onboard owner + suite operator before proceeding.
|
||
7. **Notify** — stakeholders informed; rollback flagged for post-mortem within 24 hours.
|
||
|
||
### Rollback Steps (operator workstation)
|
||
|
||
1. `scripts/stop-services.sh` (Step 7) stops the operator-orchestrator service.
|
||
2. Operator runs `scripts/pull-images.sh <previous_sha>` (Step 7).
|
||
3. `scripts/start-services.sh <previous_sha>` (Step 7) brings the previous image up.
|
||
4. Verify via `HEALTHCHECK` + offline `--flight-file` smoke.
|
||
5. DB rollback as above (cycle-1 n/a; cycle-2+ per migration tool).
|
||
6. Notify suite operator.
|
||
|
||
### Post-mortem (required after every production rollback)
|
||
|
||
Recorded in `_docs/_process_leftovers/<YYYY-MM-DD>_<topic>_rollback.md` and replayed at the next `/autodev` invocation per `.cursor/rules/tracker.mdc` Leftovers Mechanism. Contents:
|
||
|
||
- **Timeline** — `AZAION_UPDATE_EVENT` deploy event → first failure observation → rollback completion.
|
||
- **Root cause** — pulled from FDR + journald + Woodpecker pipeline.
|
||
- **What went wrong** — gate that should have caught it (CI? HITL? Pre-flight checklist?).
|
||
- **Prevention** — concrete checklist edit or test addition. Lessons appended to `_docs/LESSONS.md` per the autodev retrospective conventions.
|
||
|
||
## Deployment Checklist
|
||
|
||
The pre-deploy checklist above is the canonical one. Repeating it here in the standard skill format for traceability:
|
||
|
||
- [ ] All CI tests pass on the target branch (cycle-1: `01-test.yml` manual run; cycle-2: push gate)
|
||
- [ ] Security scan clean — re-validated against current pins; OpenCV CVE replay condition checked (`_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md`)
|
||
- [ ] Docker images built + pushed under `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>`; OCI labels + `AZAION_REVISION` env stamped per AZ-204
|
||
- [ ] Database migrations (cycle-2+): reviewed, tested, backward-compatible, flight-state-gated, operator-approved
|
||
- [ ] Environment variables configured per-environment per `environment_strategy.md` § Environment Variables
|
||
- [ ] Health check (`python3 -m gps_denied_onboard.healthcheck`) returns 0 on a dry-run against the target image
|
||
- [ ] Observability touchpoints active: `LOG_SINK` honoured, FDR mount writable, `jetson-stats` accessible inside the container (Tier-2)
|
||
- [ ] Rollback plan documented — previous-good SHA recorded; rollback steps reviewed
|
||
- [ ] Stakeholders notified of deployment window (flight operator + suite operator + GPS-Denied Onboard owner)
|
||
- [ ] Operator available during the post-deploy monitoring window (first 15 minutes + first flight)
|
||
|
||
## Self-verification
|
||
|
||
- [x] Deployment strategy chosen (Watchtower floating-tag pull-on-ground) and justified (single instance per role, ground-only updates, FC-managed in-flight failsafe)
|
||
- [x] Zero-downtime stance: **not applicable in flight**; ground-only — explicitly justified
|
||
- [x] Health checks defined (exec-based `HEALTHCHECK` covering liveness + readiness; FC watchdog covers in-flight liveness via FC failsafe)
|
||
- [x] Rollback trigger criteria (immediate / same-day / manual) + steps for both airborne and operator workstation
|
||
- [x] Deployment checklist complete and grounded in the project's actual gates (`AZAION_UPDATE_EVENT` audit, CVE replay, `/run/azaion/in-flight` flag, signing key provisioning)
|
||
- [x] Post-mortem path defined and tied to the `_docs/_process_leftovers/` + `_docs/LESSONS.md` mechanism
|
||
- [x] Graceful-shutdown sequence covers the FDR-flush + `flight_footer.clean_shutdown` invariants
|
||
|
||
## BLOCKING — User Confirmation Required
|
||
|
||
This is the deploy skill Step 6 BLOCKING gate per `.cursor/skills/deploy/SKILL.md` § Methodology Quick Reference. Step 7 (Deployment Scripts) writes executable shell scripts that automate the procedures above; user confirmation that the procedure is correct is required before scripts are generated.
|