Files
Oleksandr Bezdieniezhnykh bf13549b32
ci/woodpecker/push/02-build-push Pipeline failed
[autodev] Update configuration and documentation for cycle-1
- Enhanced `.env.example` with detailed CMake build flags and replay-mode strategy flags for development and CI environments.
- Updated `.gitignore` to include a new deploy rollback bookmark.
- Revised `_docs/_autodev_state.md` to reflect the current task status and steps.
- Added new lessons to `_docs/LESSONS.md` regarding testing and architectural improvements.
- Documented changes in `_docs/02_document/deployment/ci_cd_pipeline.md` to reflect the relaxed OpenCV version pin.
- Updated test data documentation in `_docs/02_document/tests/test-data.md` to clarify fixture usage and paths.

This commit continues the cycle-1 documentation sync and addresses various configuration updates for improved clarity and functionality.
2026-05-20 08:05:35 +03:00

208 lines
21 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GPS-Denied Onboard — Deployment Procedures
> Generated by `/autodev` greenfield Step 16 (Deploy) — Step 6. Builds on
> Step 15 (`reports/deploy_status_report.md`, `containerization.md`,
> `ci_cd_pipeline.md`, `environment_strategy.md`, `observability.md`). The
> deploy skill's standard procedure template (load-balanced HTTP service
> with blue-green / rolling / canary patterns) is adapted here for the
> system's actual topology: single airborne instance + single operator
> workstation, ground-only updates, FC-managed in-flight failsafe, and the
> parent-suite Watchtower flow with a flight-state gate.
## Deployment Strategy
### Pattern: **Floating-tag pull-on-ground (Watchtower-managed)**
| Aspect | Choice | Rationale |
|--------|--------|-----------|
| Update mechanism (airborne Jetson) | Parent-suite Watchtower polls `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm`; pulls + restarts when SHA changes | Suite-mandated pattern per `../_infra/deploy/jetson/README.md`. The fielded Jetson stack has Watchtower already running, polling all 9 application services on the same cadence. |
| Update mechanism (operator workstation) | Operator runs `docker compose pull && docker compose up -d` from `scripts/start-services.sh` | The operator workstation is single-user; cycle-1 does not need automatic updates. Cycle-2 may add a Watchtower instance on the workstation. |
| Update mechanism (lab Jetson — staging) | Same as airborne (Watchtower polling `dev-arm` or `stage-arm`) | Mirrors airborne so the bench rig validates the exact same update path. |
| Blue-green / rolling / canary | **None of the above** — N=1 instance per role | The airborne side has one Jetson per aircraft (no fleet); the operator workstation has one instance per operator. There is no load-balanced replicate to roll over. |
| Zero-downtime requirement | **Not applicable in flight**; ground-only | Flights are discrete + bounded; the FC handles in-flight failsafe (AC-FC-FAILSAFE-1) if the companion is unavailable mid-flight. Updates do not happen during flight. |
| Ground-only safety gate | `/run/azaion/in-flight` flag (parent-suite `autopilot` service writes it on arm/disarm) | **Watchtower's post-update hook MUST refuse to restart the `gps-denied-onboard` container when this flag is set.** Honoured at the suite-compose layer, not in this submodule's image (the image only honours the flag at boot when transitioning between strategies). |
| Multi-aircraft rollout | Tag-based per-aircraft (operator can pin `:rev-<sha>-arm` instead of `:main-arm`) | Floating tag is the default; explicit SHA pinning is the manual override. Suite operator owns per-aircraft pinning. |
### Graceful Shutdown
The companion has **no inbound HTTP connections** (NFT-SEC-05 in-flight egress lockdown). "Graceful shutdown" means: drain in-flight FDR writes, flush the C13 segment, emit `flight_footer`, close MAVLink connection cleanly.
| Step | Action | Owner |
|------|--------|-------|
| 1 | systemd / Docker sends `SIGTERM` to PID 1 (`python3 -m gps_denied_onboard.runtime_root`) | OS layer |
| 2 | Runtime root sets the global `shutting_down` flag; all per-frame producers stop enqueuing new FDR records | runtime root |
| 3 | C13 writer drains the FDR SPSC ring (≤ 200 ms target — bounded by ring depth + writer throughput) | C13 |
| 4 | C13 emits `flight_footer` with `clean_shutdown=true`, `records_written`, `records_dropped_overrun`, `bytes_written`, `rollover_count` | C13 |
| 5 | C13 closes the active segment file (fsync, rename `.tmp` → final) | C13 |
| 6 | C8 sends final MAVLink `STATUSTEXT` and closes the FC serial connection | C8 |
| 7 | Process exits 0 | runtime root |
**Termination grace period (target)**: 30 seconds for the above sequence. If exceeded, Docker / systemd sends `SIGKILL`; `flight_footer.clean_shutdown` will be `false` on the next boot's recovery write, flagging the unclean shutdown for the post-flight summary.
**Cycle-1 status**: docker-compose.yml does **not** yet declare `stop_grace_period: 30s` — cycle-1 inherits Docker's default 10 s grace. The C13 ring drain target (≤ 200 ms) fits comfortably inside 10 s for the dev profile, but TensorRT engine teardown + gtsam factor cleanup on Tier-2 hardware are not yet measured. **Cycle-2 follow-up** (recorded in `_docs/_process_leftovers/` when this deploy plan lands): add `stop_grace_period: 30s` to the `companion` service in `docker-compose.yml` and to the `gps-denied-onboard` service in the parent-suite `../_infra/deploy/jetson/docker-compose.yml` once the Step 2 validation gate "TensorRT INT8 cache durability under Docker" (`containerization.md` § Step 2 Validation Gates) measures the actual drain budget on the Jetson.
### Database Migration Ordering
Cycle-1 ships **no migration runner** — C6 bootstrap uses idempotent `CREATE TABLE IF NOT EXISTS`. Cycle-2+ rules (from `environment_strategy.md` § Migration Rules):
| Rule | Cycle-1 status | Cycle-2+ enforcement |
|------|----------------|----------------------|
| Migrations run **before** new code deploys | n/a — bootstrap-only | Alembic (or equivalent) migration step runs against staging first, then production, before the corresponding image pull is enabled |
| All migrations must be backward-compatible | n/a | Required: new schema works with previous image's read path until next release rotates both |
| Irreversible migrations require explicit operator approval | n/a | Required: Woodpecker UI approval gate + recorded in `_docs/04_deploy/migration_log.md` |
| Production migrations on the airborne Jetson refuse to run when `/run/azaion/in-flight` is set | n/a | Required: migration tool reads the flag at start; aborts with exit 0 + journald audit line if the flag is set |
| Production migrations on the operator workstation require operator approval | n/a | Required: interactive prompt in `start-services.sh` before applying |
## Health Checks
The companion has no HTTP `/health/live` or `/health/ready` endpoint (NFT-SEC-05). The Docker `HEALTHCHECK` is an **exec check** that re-runs the startup validation matrix (`environment_strategy.md` § Variable Validation) and inspects in-process liveness signals.
| Check | Type | Command / mechanism | Interval | Failure threshold | Action |
|-------|------|----------------------|----------|--------------------|--------|
| Liveness / Readiness | `HEALTHCHECK` exec | `python3 -m gps_denied_onboard.healthcheck` | 10 s (companion-tier1 / operator-orchestrator); 10 s (companion-jetson, with `--start-period=30s` for TensorRT engine deserialise) | 3 consecutive failures → Docker marks container `unhealthy` → systemd / Watchtower restarts | Same as readiness — no load balancer to drain. Watchtower honours `/run/azaion/in-flight` before restarting. |
| Startup probe | Same exec | Same command | 5 s once `--start-period` elapses | 30 attempts max | Kill + recreate; Watchtower retries the pull on next poll |
| FC adapter health (in-flight) | C8 watchdog from the FC | MAVLink heartbeat loss > 1 s | n/a — handled by the FC | FC drops to `SAFE_DEAD_RECKONING` or `RTL` per AC-FC-FAILSAFE-1 |
| FDR ring liveness | `shared.fdr_client` overrun monitor | Producer enqueue failure | n/a — emits `kind="overrun"` record (AC-NEW-3); never silent | Post-flight forensics surface; no in-flight action |
| `db` Postgres health (operator workstation + dev compose) | `pg_isready -U gps_denied -d gps_denied` | 5 s | 10 failures | Docker / systemd restart the `db` service; the companion's healthcheck fails until DB is back |
| `mock-suite-sat-service` health (Tier-1 e2e only) | HTTP GET `/healthz` on port 5100 | 5 s | 3 failures | Compose marks unhealthy; e2e-runner `--exit-code-from e2e-runner` surfaces failure |
### `python3 -m gps_denied_onboard.healthcheck` contract
The healthcheck module (already exists per `containerization.md`) re-runs:
1. **Required env vars validation** — same set as the composition root, but read-only (no side effects).
2. **C6 DB reachability**`psycopg2.connect(DB_URL) → SELECT 1`.
3. **C13 FDR mount writability**`os.access(FDR_PATH, os.W_OK)` + a probe write to a `.healthcheck` file.
4. **C7 backend availability** — for `INFERENCE_BACKEND=tensorrt`, validates the engine cache directory exists + is readable; for `pytorch_fp16`, no extra check (libtorch in-process).
5. **C8 FC adapter** — best-effort: attempts a non-blocking serial open if `GPS_DENIED_FC_PROFILE` is set + the device path is present. Absent device path is not a failure (dev / CI containers).
Exit codes: `0` healthy; `1` config-invalid; `2` dependency-unreachable; `3` resource-bound (e.g. FDR full). Docker treats any non-zero as `unhealthy`.
## Staging Deployment (lab Jetson HITL)
Treat the lab Jetson as a **mirror of production** for image promotion. Operator runs the procedure manually; cycle-2 may automate via the suite.
1. **CI/CD** has already built + pushed `${REGISTRY_HOST}/azaion/gps-denied-onboard-companion-tier1:dev-arm` + `…-operator-orchestrator:dev-arm` via `.woodpecker/02-build-push.yml` (cycle-1) or `companion-jetson:dev-arm` via cycle-2.
2. **Verify the flag**`cat /run/azaion/in-flight` should be empty / absent on the lab Jetson (no live FC there). If a HITL session is running, wait for the bench session to end.
3. **Pull the new image**`scripts/pull-images.sh dev` (Step 7). Watchtower may have already pulled if running on the lab Jetson.
4. **Restart the service**`scripts/start-services.sh dev` (Step 7). Honours stop-grace-period; waits for HEALTHCHECK to report healthy.
5. **Run the HITL e2e suite**`docker compose -f docker-compose.test.jetson.yml up --abort-on-container-exit --exit-code-from e2e-runner --build`. This runs the **Reality Gate** replay (Derkachi clip + recorded tlog) against the new image on Tier-2 hardware.
6. **Verify FDR output**`python3 -m gps_denied_onboard.post_flight.summarise --segment /var/lib/gps-denied/fdr/segment-*.fdr` (cycle-1 ad-hoc tool; cycle-2 polish lands the full replay viewer). Confirm `flight_footer.clean_shutdown == true` and `records_dropped_overrun == 0`.
7. **If gates pass** → promote: tag `${REGISTRY_HOST}/azaion/gps-denied-onboard:<sha>-arm` (or repurpose by branch promotion from `dev-arm``stage-arm` once cycle-2 wires environment branches per `ci_cd_pipeline.md` Quality Gates `Multi-environment deployment` row).
8. **If gates fail** → file a Jira issue under E-DEPLOY; roll back the lab Jetson per § Rollback Procedures.
## Production Deployment (airborne Jetson + operator workstation)
Production deployment lands on each aircraft individually + on each operator workstation. The aircraft side is Watchtower-driven; the operator workstation side is operator-driven.
### Pre-deploy checks (operator-owned)
- [ ] **CI gates green**`01-test.yml` passed on the target branch (cycle-1: manual trigger; cycle-2: push gate).
- [ ] **Security scan recent**`_docs/05_security/dependency_scan.md` re-validated against the build SHA. The OpenCV pin per `_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md` is honoured.
- [ ] **HITL gate passed** — Staging deployment § 56 confirmed `clean_shutdown=true` and `records_dropped_overrun=0`.
- [ ] **Per-aircraft acceptance** — operator confirms the build's strategy flags (`BUILD_VINS_MONO`, `BUILD_SALAD`, `BUILD_C11_TILE_MANAGER`, replay flags, `BUILD_DEV_STATIC_KEY=OFF`) match the operational profile for the destination aircraft.
- [ ] **Calibration JSON onboard**`/etc/gps-denied/calibration/adti20.json` (operator-acquired per D-PROJ-1) is staged on the aircraft Jetson NVM.
- [ ] **Signing key path provisioned**`MAVLINK_SIGNING_KEY` resolves to a per-host writable path that `KeySource` will rotate at takeoff; no static key from `tests/fixtures/`.
- [ ] **Postgres credentials in `/etc/gps-denied/.pgpass`** — per-host random password (Step 7 `start-services.sh` writes this on first run).
- [ ] **`/run/azaion/in-flight` is clear** — no live flight in progress on the target aircraft.
- [ ] **Rollback target identified** — previous successful SHA recorded for the target aircraft (operator notebook + `journalctl -g AZAION_UPDATE_EVENT` on the Jetson).
- [ ] **Stakeholders notified** — flight operator + suite operator informed of the deploy window.
### Production Deployment — Airborne Jetson (Watchtower-driven)
1. **Tag promotion** — operator pushes the validated SHA to `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm` (or per-aircraft SHA pin if rolling out partial fleet).
2. **Wait for Watchtower poll** — default poll interval per suite config (typically ≤ 5 min).
3. **Watchtower pre-restart check** — Watchtower's post-update hook checks `/run/azaion/in-flight`; if set, defers the restart until the next poll.
4. **Container stop** — Docker sends `SIGTERM`; companion drains FDR (≤ 200 ms target) + emits `flight_footer` per § Graceful Shutdown. Exit must complete within 30 s grace period.
5. **Image pull complete** — Watchtower pulls the new image (already verified-by-tag; OCI labels embed the SHA).
6. **Container start** — Docker starts the new container; `HEALTHCHECK` `--start-period=30s` allows TensorRT engine deserialise + Postgres reconnect.
7. **Audit event emitted** — Watchtower's post-update hook emits `AZAION_UPDATE_EVENT` to journald (`observability.md` § Deploy Audit).
8. **Verify on the aircraft** — operator runs `journalctl -g AZAION_UPDATE_EVENT --since 10min` on the Jetson; confirms the new revision SHA matches the intended tag.
9. **Run a ground HITL pre-flight** — operator brings up the bench-mounted aircraft, runs the standard pre-flight checklist (FC heartbeat, signing handshake, camera focus, NFT-SEC-04 image-decode smoke). Pre-flight refusal-to-arm on any gate failure is the production safety net.
10. **Monitor the first flight** — operator watches QGroundControl for STATUSTEXT messages from the companion + the `GpsDeniedHealth` MAVLink message stream during the first flight under the new image.
11. **Post-flight forensics** — after landing, operator pulls FDR segments + runs `post_flight.summarise`; confirms no regression vs the previous-SHA baseline (NFT-PERF gates per `_docs/02_document/tests/` baselines).
### Production Deployment — Operator Workstation (operator-driven)
1. **Pre-deploy checks** — same checklist as above, scoped to the operator-orchestrator image.
2. **Pull** — operator runs `scripts/pull-images.sh main` (Step 7).
3. **Stop**`scripts/stop-services.sh` (Step 7) gracefully stops the operator-orchestrator service.
4. **Start**`scripts/start-services.sh main` (Step 7) brings the new image up. `HEALTHCHECK` `--start-period=10s` allows DB reconnect.
5. **Audit**`journalctl -g AZAION_UPDATE_EVENT --since 10min` on the operator workstation confirms the new revision.
6. **Smoke test** — operator runs the C12 `--flight-file <offline_fixture>` path against a known-good flight DTO; verifies the `FlightsApiClient` round-trip succeeds.
### Post-deploy monitoring window
| Window | What to watch | Action on regression |
|--------|---------------|----------------------|
| First 15 min | journald `AZAION_UPDATE_EVENT` cadence; container `HEALTHCHECK` status | Roll back immediately (§ Rollback Procedures) |
| First flight (airborne) | QGC STATUSTEXT + `GpsDeniedHealth` MAVLink stream; FDR `overrun` count | Operator aborts flight if `GpsDeniedHealth` degrades; FC failsafe is the safety net |
| First post-flight pull (airborne) | FDR `flight_footer.clean_shutdown` flag; `records_dropped_overrun`; per-component `tile_match`, `c6.eviction_batch` baselines | If `clean_shutdown=false` or baselines drifted → roll back; required post-mortem |
## Rollback Procedures
### Trigger Criteria
| Severity | Trigger | Decision lead |
|----------|---------|---------------|
| **Immediate rollback** | New image fails `HEALTHCHECK` within 5 minutes of `AZAION_UPDATE_EVENT`; or `flight_footer.clean_shutdown=false` on the first flight under the new image | Flight operator (airborne) / Suite operator (workstation) |
| **Same-day rollback** | NFT-PERF baseline regression > 10% (frame deadline miss rate, end-to-end pose latency); FDR `records_dropped_overrun` > 0 above per-flight threshold; sustained `c6.eviction_batch` activity > baseline | Operator + GPS-Denied Onboard owner |
| **Manual rollback** | Operator judgement (visible operational anomaly without a clear FDR signal) | Operator |
### Rollback Steps (airborne Jetson)
1. **Confirm the flag**`/run/azaion/in-flight` is clear. If a flight is live, the FC's failsafe + operator's QGC abort path take precedence; rollback happens after landing.
2. **Identify the previous-good SHA**`journalctl -g AZAION_UPDATE_EVENT --since 24h` on the affected Jetson shows the last successful revision.
3. **Tag rollback** — operator retags the registry: `${REGISTRY_HOST}/azaion/gps-denied-onboard:main-arm` → previous SHA. (Cycle-1: operator pulls + retags via the registry UI; cycle-2: scripted via `scripts/deploy.sh rollback <sha>`.)
4. **Wait for Watchtower** — next poll detects the SHA change + pulls the previous image.
5. **Verify**`journalctl -g AZAION_UPDATE_EVENT --since 10min` shows the rollback revision; companion `HEALTHCHECK` is healthy.
6. **DB rollback** — cycle-1: not applicable (bootstrap-only schema). Cycle-2+: if the new image applied a migration, run the DOWN script if reversible; otherwise escalate to GPS-Denied Onboard owner + suite operator before proceeding.
7. **Notify** — stakeholders informed; rollback flagged for post-mortem within 24 hours.
### Rollback Steps (operator workstation)
1. `scripts/stop-services.sh` (Step 7) stops the operator-orchestrator service.
2. Operator runs `scripts/pull-images.sh <previous_sha>` (Step 7).
3. `scripts/start-services.sh <previous_sha>` (Step 7) brings the previous image up.
4. Verify via `HEALTHCHECK` + offline `--flight-file` smoke.
5. DB rollback as above (cycle-1 n/a; cycle-2+ per migration tool).
6. Notify suite operator.
### Post-mortem (required after every production rollback)
Recorded in `_docs/_process_leftovers/<YYYY-MM-DD>_<topic>_rollback.md` and replayed at the next `/autodev` invocation per `.cursor/rules/tracker.mdc` Leftovers Mechanism. Contents:
- **Timeline** — `AZAION_UPDATE_EVENT` deploy event → first failure observation → rollback completion.
- **Root cause** — pulled from FDR + journald + Woodpecker pipeline.
- **What went wrong** — gate that should have caught it (CI? HITL? Pre-flight checklist?).
- **Prevention** — concrete checklist edit or test addition. Lessons appended to `_docs/LESSONS.md` per the autodev retrospective conventions.
## Deployment Checklist
The pre-deploy checklist above is the canonical one. Repeating it here in the standard skill format for traceability:
- [ ] All CI tests pass on the target branch (cycle-1: `01-test.yml` manual run; cycle-2: push gate)
- [ ] Security scan clean — re-validated against current pins; OpenCV CVE replay condition checked (`_docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md`)
- [ ] Docker images built + pushed under `${REGISTRY_HOST}/azaion/<service>:<branch>-<arch>`; OCI labels + `AZAION_REVISION` env stamped per AZ-204
- [ ] Database migrations (cycle-2+): reviewed, tested, backward-compatible, flight-state-gated, operator-approved
- [ ] Environment variables configured per-environment per `environment_strategy.md` § Environment Variables
- [ ] Health check (`python3 -m gps_denied_onboard.healthcheck`) returns 0 on a dry-run against the target image
- [ ] Observability touchpoints active: `LOG_SINK` honoured, FDR mount writable, `jetson-stats` accessible inside the container (Tier-2)
- [ ] Rollback plan documented — previous-good SHA recorded; rollback steps reviewed
- [ ] Stakeholders notified of deployment window (flight operator + suite operator + GPS-Denied Onboard owner)
- [ ] Operator available during the post-deploy monitoring window (first 15 minutes + first flight)
## Self-verification
- [x] Deployment strategy chosen (Watchtower floating-tag pull-on-ground) and justified (single instance per role, ground-only updates, FC-managed in-flight failsafe)
- [x] Zero-downtime stance: **not applicable in flight**; ground-only — explicitly justified
- [x] Health checks defined (exec-based `HEALTHCHECK` covering liveness + readiness; FC watchdog covers in-flight liveness via FC failsafe)
- [x] Rollback trigger criteria (immediate / same-day / manual) + steps for both airborne and operator workstation
- [x] Deployment checklist complete and grounded in the project's actual gates (`AZAION_UPDATE_EVENT` audit, CVE replay, `/run/azaion/in-flight` flag, signing key provisioning)
- [x] Post-mortem path defined and tied to the `_docs/_process_leftovers/` + `_docs/LESSONS.md` mechanism
- [x] Graceful-shutdown sequence covers the FDR-flush + `flight_footer.clean_shutdown` invariants
## BLOCKING — User Confirmation Required
This is the deploy skill Step 6 BLOCKING gate per `.cursor/skills/deploy/SKILL.md` § Methodology Quick Reference. Step 7 (Deployment Scripts) writes executable shell scripts that automate the procedures above; user confirmation that the procedure is correct is required before scripts are generated.