Files
gps-denied-onboard/_docs/02_document/deployment/deployment_procedures.md
T
Oleksandr Bezdieniezhnykh 5fe67023b2 [AZ-329] [AZ-330] [AZ-523] [AZ-524] Batch 44 atomic refactor
Implements two new C12 services and rebalances the C11/C12 boundary
in one atomic commit:

* AZ-329 PostLandingUploadOrchestrator — gates C11 upload on the
  `flight_footer` FDR record's `clean_shutdown` field; 4 refusal
  modes; new FdrFooterReader Protocol + LocalFdrFooterReader.
* AZ-330 OperatorReLocService — AC-3.4 visual-loss re-localization
  hint; reuses shared LatLonAlt; OperatorCommandTransport Protocol
  cut (E-C8 owns the future pymavlink concrete); new FDR record
  kind `c12.reloc.requested`; log redaction (lat/lon 5 decimals,
  reason 200 chars).
* AZ-523 C11 internal flight-state gate removed (SRP refactor):
  `confirm_flight_state` / `FlightStateSignal` use /
  `FlightStateNotOnGroundError` deleted from C11; TileUploader
  contract bumped to v2.0.0 (frozen) with migration note; AZ-317
  superseded.
* AZ-524 Package rename `c12_operator_tooling` →
  `c12_operator_orchestrator` across source, tests, pyproject,
  CMake, Dockerfile, compose, CI, runtime-root services class
  (`OperatorOrchestratorServices`) + factory function
  (`build_operator_orchestrator`), logger namespaces, config slug,
  docs, and the E-C12 epic title.

Tests: 1543 passed, 80 skipped (all environment gates). Targeted
AC suite (AZ-329 + AZ-330 + FdrFooterReader): 37 passed. Cold-start
NFR-perf still ≤ 500 ms p99.

Tracker: AZ-317 → Done (superseded); AZ-319 v2.0.0 contract bump
comment; AZ-329/AZ-330 → In Testing; AZ-253 epic renamed; AZ-523
+ AZ-524 created and closed as audit-trail tickets.

See `_docs/03_implementation/batch_44_cycle1_report.md`.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-13 19:42:46 +03:00

266 lines
17 KiB
Markdown

# GPS-Denied Onboard — Deployment Procedures
> Date: 2026-05-09 (Plan Phase 2c — initial draft).
> Inputs: `_docs/02_document/architecture.md` § 3 (Deployment Model) + § 7 (Security); `_docs/02_document/data_model.md` § 4 (Migration Strategy); environment_strategy.md; ADR-002, ADR-004, ADR-005; AC-NEW-1, AC-NEW-3, AC-NEW-4, AC-NEW-5.
## Deployment scope and model
This project does **not** ship a service; it ships an **embedded edge image** plus an **operator-orchestrator bundle**. The "deployment" patterns from the standard template (blue-green / rolling / canary) are not applicable. Deployment for this project means:
| Artifact | Target | Deployment mechanism |
|---|---|---|
| **JetPack image** (`gps-denied-jetpack-<semver>-<sha>.img`) | Production Jetson Orin Nano Super on a UAV | Operator flashes the image onto the Jetson via NVIDIA `sdkmanager` or `Etcher`-style `dd` from the operator workstation |
| **Operator tooling tarball** | Operator workstation | Operator extracts; `docker compose up -d` brings up `mock-suite-sat-service` (when offline) + `operator-orchestrator` |
| **Tier-1 dev compose** | Developer workstation | Developer runs `docker compose up` from repo root |
**Zero-downtime is not a goal**: a UAV is not in service while it is being re-flashed. The deployment cadence is per-airframe maintenance, not per-request availability.
**Strategy**: the closest analogue to a "rolling deploy" is the operator's fleet-management process (re-flash one UAV at a time across the fleet). The fleet-management process is the operator's concern, not this project's; this document covers the per-airframe procedure.
## Pre-deployment artifact assembly (release engineer)
Performed once per release on Tier-1 + Tier-2 CI; produces signed artifacts stored in the release bucket.
1. Tag a commit on `main`. CI runs the full pipeline (`ci_cd_pipeline.md`).
2. **Tier-1 produces**:
- `companion-tier1:deployment-<sha>` and `companion-tier1:research-<sha>` Docker images (pushed to registry).
- `mock-suite-sat-service:<sha>` Docker image.
- `operator-orchestrator:<sha>` Docker image.
- SBOM artifacts for both binaries (deployment and research).
- `operator-orchestrator-<semver>-<sha>.tar.gz` containing the operator-orchestrator image + mock-sat image + their compose file + verification script + relevant docs.
3. **Tier-2 produces**:
- Native deployment-binary build on the self-hosted Jetson runner.
- SBOM verification: byte-equal (after canonicalization) to Tier-1's deployment-binary SBOM. Mismatch fails the release.
- **JetPack image build**: a JetPack 6.2 base image with the deployment binary + PostgreSQL 16 + base migrations + `/etc/gps-denied/runtime.yaml` template preinstalled. Output: `gps-denied-jetpack-<semver>-<sha>.img`.
4. **Signing** (Tier-1):
- Both Docker image manifests are signed with the project's release key.
- The JetPack image is signed; checksum is published as a separate signed file (`gps-denied-jetpack-<semver>-<sha>.img.sha256.sig`).
- The operator-orchestrator tarball is signed.
5. **Release bucket**: artifacts uploaded; release notes published; the previous release's artifacts retained for at least 90 days for rollback support.
A release fails if any step above fails — including any AC-bound NFT failure on Tier-2 (`ci_cd_pipeline.md` § AC-bound NFTs).
## Pre-takeoff readiness gate ("health check" analog)
Production has no `/health/live` HTTP endpoint (no listener; NFT-SEC-05). The companion's "health check" is the **pre-takeoff readiness gate**: a sequence of checks that runs at takeoff load and decides whether the companion is ready to emit external position to the FC.
| Check | What it validates | Action on failure |
|---|---|---|
| Manifest content-hash gate (D-C10-3) | The on-disk manifest matches the operator-staged manifest hash (data_model.md § 2.4) | FDR record `0x000D ContentHashGateFail` + STATUSTEXT critical + companion refuses to publish a `GPS_INPUT` / `MSP2_SENSOR_GPS` source |
| Camera calibration JSON validation | File present + schema-valid + content-hash matches `manifests.calibration_artifact_hash` | Same |
| FAISS `.index` mmap + content-hash | mmap succeeds + content-hash matches `manifests.descriptor_index_hash` | Same |
| TRT engine cache verification | All required engines present per `engine_cache_entries`; each engine's content-hash matches `engine_hash` | Same |
| `alembic current == head` | DB schema is up-to-date for this binary | Same |
| MAVLink-2.0 signing handshake (AP profile) | Signed handshake with the FC succeeds within AC-NEW-1 30 s budget (D-C8-9 = (d)) | FDR record `MavlinkSigningKeyRotated` with reason "handshake_failed" + STATUSTEXT critical + companion refuses to emit |
| Per-flight key generation | Both per-flight ephemeral keys (MAVLink signing + onboard tile signing) generated and persisted under `/var/lib/gps-denied/per-flight/` | Same |
| Initial frame → emit pipeline test | First nav-camera frame reaches C8 outbound encoder; `EmittedExternalPosition` produced | Same |
| Network egress is denied | Verify no outbound network egress is possible (DNS blackhole effective, iptables OUTPUT REJECT loaded) — defense-in-depth on architecture.md § 7 + NFT-SEC-05 | FDR critical + STATUSTEXT + refuse to emit |
The gate completes within the AC-NEW-1 30 s p95 budget; failure produces a clear FDR + STATUSTEXT trail and the companion's `GPS_INPUT` / `MSP2_SENSOR_GPS` channel stays silent — the FC operates as if no companion-GPS source is available, which is the correct safe-default.
## Production deployment procedure (per-airframe)
This is the per-airframe deployment procedure performed by the operator, NOT by CI.
### 1. Pre-deploy approval
Required before any production-bound flight:
- [ ] Release notes for the target version reviewed; AC-NEW-4 / AC-NEW-7 statistical summaries reviewed.
- [ ] All Tier-2 AC-bound NFTs green at the target version (`ci_cd_pipeline.md` § AC-bound NFTs).
- [ ] Security audit of the target version completed (Tier-1 SBOM clean of unpatched CVEs; D-CROSS-CVE-1).
- [ ] D-PROJ-1 calibration step performed on the target Jetson + UAV pairing (hybrid factory + checkerboard-refined; ~1 day per deployed unit).
- [ ] Rollback artifact (the previous release's JetPack image) is staged on the operator workstation.
- [ ] FDR retention policy for this airframe confirmed (default 30 days; environment_strategy.md § Database Management).
- [ ] If switching FC profile (`ardupilot_plane``inav`), FC firmware compatibility confirmed.
### 2. Pre-deploy checks (operator workstation)
```sh
# Verify the artifact bundle integrity.
cosign verify-blob \
--signature gps-denied-jetpack-<semver>-<sha>.img.sha256.sig \
--key gps-denied-release-key.pub \
gps-denied-jetpack-<semver>-<sha>.img.sha256
sha256sum -c gps-denied-jetpack-<semver>-<sha>.img.sha256
# Verify the operator-orchestrator tarball.
cosign verify-blob \
--signature operator-orchestrator-<semver>-<sha>.tar.gz.sig \
--key gps-denied-release-key.pub \
operator-orchestrator-<semver>-<sha>.tar.gz
```
### 3. Pre-flight cache build (operator-orchestrator C12)
Performed on the operator workstation, with `satellite-provider` reachable (locally mirrored or via lab VPN).
```sh
docker compose -f operator-orchestrator-compose.yml up -d
# Operator opens http://127.0.0.1:8080
```
The C12 UI walks the operator through:
1. Upload / select the target operational sector (GeoJSON polygon).
2. Set sector classifications (`active_conflict``stable_rear`) — drives freshness threshold (data_model.md § 2.3).
3. Tile download from `satellite-provider` (parent suite) — produces `tiles` rows with `source='googlemaps'` + filesystem JPEGs.
4. Descriptor (FAISS) index generation across the loaded tile corpus.
5. TRT engine compilation on the workstation (Tier-2 emulation if no Jetson is present, or directly on a co-located Jetson dev kit).
6. Manifest generation: hash over (model bundle + calibration JSON + corpus + sector classifications + descriptor index + engine cache).
7. Output: a sealed pre-flight bundle on a USB drive or staged for direct ethernet transfer.
### 4. JetPack image flash
Operator flashes the target JetPack image onto the Jetson:
```sh
sudo dd if=gps-denied-jetpack-<semver>-<sha>.img of=/dev/sdX bs=4M status=progress
# OR via NVIDIA SDK Manager for a more guided flow.
sync
```
The flashed image contains:
- JetPack 6.2 base
- The deployment binary preinstalled at `/opt/gps-denied/`
- PostgreSQL 16 with `alembic` schema initialized at the target migration head
- `/etc/gps-denied/runtime.yaml` template (the operator fills in airframe-specific values: `fc_profile`, `companion_id`)
- A systemd unit `gps-denied.service` that auto-starts at boot
The image is **identical across UAVs**; per-airframe configuration (`/etc/gps-denied/runtime.yaml`) is filled in after flash.
### 5. Per-airframe configuration
Operator boots the Jetson in maintenance mode, ssh's in (this is the only time the Jetson has any inbound network surface; closed before takeoff), and:
```sh
sudo $EDITOR /etc/gps-denied/runtime.yaml
# Set: fc_profile, companion_id, fdr_retention_days, log_level
sudo gps-denied-cli stage-cache /mnt/usb/gps-denied-cache-<sector-id>.tar.gz
# Stages the operator-prepared cache + calibration + manifest into /var/lib/gps-denied/.
sudo gps-denied-cli verify-readiness
# Runs all gate checks except MAVLink signing handshake (which requires the FC to be powered).
```
### 6. UAV integration
- Wire the Jetson UART/USB to the FC.
- For ArduPilot Plane: configure FC parameters per the AP-side checklist (`EKF3_SRC1_POSXY = 3` or per D-C8-2 = (b) configuration, AHRS_EKF_TYPE = 3).
- For iNav: configure `gps_provider = MSP`, `gps_ublox_use_galileo = OFF`.
- Power up the FC; verify MAVLink signing handshake completes within 30 s (AC-NEW-1).
### 7. First-flight commissioning
The first flight on a freshly-deployed airframe is a **commissioning flight**, not a production flight:
- Operator stays in line-of-sight.
- AC-5.2 fallback (FC IMU-only) is the primary safety net during commissioning.
- Operator manually triggers a `MAV_CMD_REQUEST_MESSAGE` to confirm `GPS_INPUT` is being received and the FC's EKF source-set switch responds correctly.
- If everything looks healthy on the GCS dashboard for 5+ minutes of cruise, the airframe is cleared for production flights.
### 8. Post-deploy monitoring
Post first commissioning flight:
- [ ] FDR retrieved and visualized on operator workstation (operator-orchestrator C12 dashboard, observability.md § 5.1).
- [ ] AC-NEW-4 statistics for the commissioning flight reviewed; outliers investigated.
- [ ] No FDR segment drops; no `ContentHashGateFail` events.
- [ ] Mid-flight tile generation working (post-landing upload — handle that separately).
- [ ] If everything green, the deployment is finalised; the previous release's JetPack image can be archived (still kept for rollback).
## Post-landing tile upload (per-flight, ADR-004)
Per AC-8.4 + ADR-004, mid-flight tile upload to `satellite-provider` is **post-landing only**, and uses the operator-orchestrator's C11 Tile Manager (`TileUploader` interface; a separate binary, never linked into the airborne image).
```sh
# Operator plugs the companion's NVM into the workstation OR ssh's into the powered-off-then-re-booted Jetson.
docker compose run operator-orchestrator \
python -m operator_orchestrator.tilemanager upload \
--flight-id <uuid> \
--satellite-provider $SATELLITE_PROVIDER_URL \
--signing-pubkey-fingerprint <fingerprint>
```
Behavior:
- Reads the local `tiles` rows where `source='onboard_ingest' AND voting_status='pending' AND flight_id=<uuid>`.
- Reads the corresponding JPEG body + sidecar JSON from filesystem.
- Reads the per-flight onboard tile-signing private key (still on the companion's NVM until FDR rolls over).
- Submits to `satellite-provider`'s `POST /api/satellite/tiles/ingest` endpoint (D-PROJ-2 contract).
- On 2xx success: deletes local row + JPEG + sidecar + emits FDR event `tile_uploaded`.
- On 4xx: leaves local data; emits FDR event `tile_upload_failed` with reason; operator decides next steps (likely a parent-suite issue).
- On 5xx: retries with exponential backoff; persistent failure → `tile_upload_failed` + operator review.
When the parent-suite voting layer (D-PROJ-2 design task #2) ships, this flow does NOT change on the onboard side — the parent suite's promotion logic is invisible to onboard-side upload.
## Rollback Procedures
### Trigger criteria
| Severity | Trigger | Decision-maker |
|---|---|---|
| Critical (per-airframe) | Commissioning flight fails AC-5.2 fallback (the FC IMU-only fallback also failed; airframe lost) | Safety review board (out of scope of this project) |
| Critical (fleet-wide) | Any post-deploy AC-NEW-4 outlier indicates a regression: P(err > 1 km) measured on a real flight > AC threshold by ≥ 2x | Suite security + onboard team lead |
| High (per-airframe) | Commissioning flight passes but post-flight FDR analysis shows AC-NEW-4 / AC-NEW-7 regression vs. prior release | Onboard team lead |
| High (per-airframe) | Operator unable to complete pre-flight readiness gate (manifest hash gate fails repeatedly) | Operator + onboard team lead |
| Medium (per-airframe) | Sustained `dead_reckoned` periods longer than expected; FDR segment drops occurring | Operator + onboard team lead (post-flight investigation; may not warrant immediate rollback) |
### Rollback steps (per-airframe)
1. **Re-flash** the previous release's JetPack image onto the affected Jetson (same procedure as § 4 with the previous artifact).
2. **Re-stage** the previous release's pre-flight bundle (the operator workstation retains it in the operator-orchestrator cache for ≥ 30 days).
3. **Re-run** the pre-takeoff readiness gate.
4. **Confirm** AC-5.2 fallback is still functional (it is FC firmware behavior; rolling back the companion image cannot break it, but verify on the GCS).
5. **Document** the rollback in the post-mortem template; include FDR snapshots from the offending flight (if any) plus the rollback artifacts versions.
### Database rollback (data_model.md § 4.2 reversibility)
Per data_model.md § 4.2, every Alembic migration MUST implement a working `downgrade()`. Rolling back the JetPack image to the previous release rolls back the schema to whatever migration head the previous release uses. Concretely:
- The previous release's JetPack image contains its own Alembic migration tree.
- On boot, the previous-release runtime asserts `alembic current == head_for_that_release`. If the database is on a NEWER head (because the airframe ran the new release between deployments), the runtime invokes `alembic downgrade <previous-release-head>` automatically.
- If a migration is **not reversible** (which requires an explicit ADR — data_model.md § 4.2), the rollback must be manually adjudicated by the operator + onboard team lead. This case is rare by policy.
### Post-mortem
Required after every rollback (per-airframe or fleet-wide):
- Timeline: when was the new release flashed; when did the failure surface; when was rollback initiated.
- Root cause: which AC was missed; which component is implicated; was it a regression introduced by this release or by a hardware/operational variable change.
- What went wrong in the release process: did Tier-2 CI catch it; if not, why not.
- Prevention: new test scenario added to NFT suite; new lint check; new rule in `_docs/LESSONS.md`.
- Distribution: post-mortem report stored under `_docs/06_metrics/incident_<YYYY-MM-DD>_<topic>.md` (per autodev failure-handling protocol).
## Deployment Checklist
Pre-flash:
- [ ] All Tier-2 AC-bound NFTs green at target version
- [ ] Security scan clean (zero critical / high CVEs; SBOM diff passes ADR-002 enforcement)
- [ ] Both Docker images built and pushed (deployment + research)
- [ ] JetPack image built, signed, checksummed
- [ ] Operator-tooling tarball built, signed
- [ ] Pre-flight bundle prepared by operator (cache + calibration + manifest)
- [ ] Pre-takeoff readiness gate behavior verified on a bench Jetson before flashing onto the production unit
- [ ] Rollback artifact (previous release JetPack image) staged on operator workstation
- [ ] FDR retention policy confirmed for the target airframe
Post-flash:
- [ ] First-flight commissioning flight cleared per § 7
- [ ] FDR retrieved and analyzed; AC-NEW-4 / AC-NEW-7 statistics within expected envelope
- [ ] Post-landing upload procedure tested end-to-end (companion → operator workstation → `satellite-provider`)
- [ ] Operator runbook updated with airframe-specific notes (e.g., "this airframe has UART2 wired to FC")
## Tier-2 enablement
Until the Tier-2 self-hosted Jetson runner is fully provisioned:
- AC-bound NFTs are gated as **manual trigger only** on PRs (`ci_cd_pipeline.md` § Manual-trigger override).
- The merge gate on `dev` excludes Tier-2 NFTs; the merge gate on `stage` and `main` retains the full gate.
- The pre-takeoff readiness gate (§ Pre-takeoff readiness gate) is unaffected — it runs on the Jetson at every takeoff regardless of CI gating posture.
When the Tier-2 runner is in steady state, this section is removed and the merge gates harmonize across `dev` / `stage` / `main`.