Files
gps-denied-onboard/_docs/02_document/deployment/deployment_procedures.md
T
Oleksandr Bezdieniezhnykh 5fe67023b2 [AZ-329] [AZ-330] [AZ-523] [AZ-524] Batch 44 atomic refactor
Implements two new C12 services and rebalances the C11/C12 boundary
in one atomic commit:

* AZ-329 PostLandingUploadOrchestrator — gates C11 upload on the
  `flight_footer` FDR record's `clean_shutdown` field; 4 refusal
  modes; new FdrFooterReader Protocol + LocalFdrFooterReader.
* AZ-330 OperatorReLocService — AC-3.4 visual-loss re-localization
  hint; reuses shared LatLonAlt; OperatorCommandTransport Protocol
  cut (E-C8 owns the future pymavlink concrete); new FDR record
  kind `c12.reloc.requested`; log redaction (lat/lon 5 decimals,
  reason 200 chars).
* AZ-523 C11 internal flight-state gate removed (SRP refactor):
  `confirm_flight_state` / `FlightStateSignal` use /
  `FlightStateNotOnGroundError` deleted from C11; TileUploader
  contract bumped to v2.0.0 (frozen) with migration note; AZ-317
  superseded.
* AZ-524 Package rename `c12_operator_tooling` →
  `c12_operator_orchestrator` across source, tests, pyproject,
  CMake, Dockerfile, compose, CI, runtime-root services class
  (`OperatorOrchestratorServices`) + factory function
  (`build_operator_orchestrator`), logger namespaces, config slug,
  docs, and the E-C12 epic title.

Tests: 1543 passed, 80 skipped (all environment gates). Targeted
AC suite (AZ-329 + AZ-330 + FdrFooterReader): 37 passed. Cold-start
NFR-perf still ≤ 500 ms p99.

Tracker: AZ-317 → Done (superseded); AZ-319 v2.0.0 contract bump
comment; AZ-329/AZ-330 → In Testing; AZ-253 epic renamed; AZ-523
+ AZ-524 created and closed as audit-trail tickets.

See `_docs/03_implementation/batch_44_cycle1_report.md`.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-13 19:42:46 +03:00

17 KiB

GPS-Denied Onboard — Deployment Procedures

Date: 2026-05-09 (Plan Phase 2c — initial draft). Inputs: _docs/02_document/architecture.md § 3 (Deployment Model) + § 7 (Security); _docs/02_document/data_model.md § 4 (Migration Strategy); environment_strategy.md; ADR-002, ADR-004, ADR-005; AC-NEW-1, AC-NEW-3, AC-NEW-4, AC-NEW-5.

Deployment scope and model

This project does not ship a service; it ships an embedded edge image plus an operator-orchestrator bundle. The "deployment" patterns from the standard template (blue-green / rolling / canary) are not applicable. Deployment for this project means:

Artifact Target Deployment mechanism
JetPack image (gps-denied-jetpack-<semver>-<sha>.img) Production Jetson Orin Nano Super on a UAV Operator flashes the image onto the Jetson via NVIDIA sdkmanager or Etcher-style dd from the operator workstation
Operator tooling tarball Operator workstation Operator extracts; docker compose up -d brings up mock-suite-sat-service (when offline) + operator-orchestrator
Tier-1 dev compose Developer workstation Developer runs docker compose up from repo root

Zero-downtime is not a goal: a UAV is not in service while it is being re-flashed. The deployment cadence is per-airframe maintenance, not per-request availability.

Strategy: the closest analogue to a "rolling deploy" is the operator's fleet-management process (re-flash one UAV at a time across the fleet). The fleet-management process is the operator's concern, not this project's; this document covers the per-airframe procedure.

Pre-deployment artifact assembly (release engineer)

Performed once per release on Tier-1 + Tier-2 CI; produces signed artifacts stored in the release bucket.

  1. Tag a commit on main. CI runs the full pipeline (ci_cd_pipeline.md).
  2. Tier-1 produces:
    • companion-tier1:deployment-<sha> and companion-tier1:research-<sha> Docker images (pushed to registry).
    • mock-suite-sat-service:<sha> Docker image.
    • operator-orchestrator:<sha> Docker image.
    • SBOM artifacts for both binaries (deployment and research).
    • operator-orchestrator-<semver>-<sha>.tar.gz containing the operator-orchestrator image + mock-sat image + their compose file + verification script + relevant docs.
  3. Tier-2 produces:
    • Native deployment-binary build on the self-hosted Jetson runner.
    • SBOM verification: byte-equal (after canonicalization) to Tier-1's deployment-binary SBOM. Mismatch fails the release.
    • JetPack image build: a JetPack 6.2 base image with the deployment binary + PostgreSQL 16 + base migrations + /etc/gps-denied/runtime.yaml template preinstalled. Output: gps-denied-jetpack-<semver>-<sha>.img.
  4. Signing (Tier-1):
    • Both Docker image manifests are signed with the project's release key.
    • The JetPack image is signed; checksum is published as a separate signed file (gps-denied-jetpack-<semver>-<sha>.img.sha256.sig).
    • The operator-orchestrator tarball is signed.
  5. Release bucket: artifacts uploaded; release notes published; the previous release's artifacts retained for at least 90 days for rollback support.

A release fails if any step above fails — including any AC-bound NFT failure on Tier-2 (ci_cd_pipeline.md § AC-bound NFTs).

Pre-takeoff readiness gate ("health check" analog)

Production has no /health/live HTTP endpoint (no listener; NFT-SEC-05). The companion's "health check" is the pre-takeoff readiness gate: a sequence of checks that runs at takeoff load and decides whether the companion is ready to emit external position to the FC.

Check What it validates Action on failure
Manifest content-hash gate (D-C10-3) The on-disk manifest matches the operator-staged manifest hash (data_model.md § 2.4) FDR record 0x000D ContentHashGateFail + STATUSTEXT critical + companion refuses to publish a GPS_INPUT / MSP2_SENSOR_GPS source
Camera calibration JSON validation File present + schema-valid + content-hash matches manifests.calibration_artifact_hash Same
FAISS .index mmap + content-hash mmap succeeds + content-hash matches manifests.descriptor_index_hash Same
TRT engine cache verification All required engines present per engine_cache_entries; each engine's content-hash matches engine_hash Same
alembic current == head DB schema is up-to-date for this binary Same
MAVLink-2.0 signing handshake (AP profile) Signed handshake with the FC succeeds within AC-NEW-1 30 s budget (D-C8-9 = (d)) FDR record MavlinkSigningKeyRotated with reason "handshake_failed" + STATUSTEXT critical + companion refuses to emit
Per-flight key generation Both per-flight ephemeral keys (MAVLink signing + onboard tile signing) generated and persisted under /var/lib/gps-denied/per-flight/ Same
Initial frame → emit pipeline test First nav-camera frame reaches C8 outbound encoder; EmittedExternalPosition produced Same
Network egress is denied Verify no outbound network egress is possible (DNS blackhole effective, iptables OUTPUT REJECT loaded) — defense-in-depth on architecture.md § 7 + NFT-SEC-05 FDR critical + STATUSTEXT + refuse to emit

The gate completes within the AC-NEW-1 30 s p95 budget; failure produces a clear FDR + STATUSTEXT trail and the companion's GPS_INPUT / MSP2_SENSOR_GPS channel stays silent — the FC operates as if no companion-GPS source is available, which is the correct safe-default.

Production deployment procedure (per-airframe)

This is the per-airframe deployment procedure performed by the operator, NOT by CI.

1. Pre-deploy approval

Required before any production-bound flight:

  • Release notes for the target version reviewed; AC-NEW-4 / AC-NEW-7 statistical summaries reviewed.
  • All Tier-2 AC-bound NFTs green at the target version (ci_cd_pipeline.md § AC-bound NFTs).
  • Security audit of the target version completed (Tier-1 SBOM clean of unpatched CVEs; D-CROSS-CVE-1).
  • D-PROJ-1 calibration step performed on the target Jetson + UAV pairing (hybrid factory + checkerboard-refined; ~1 day per deployed unit).
  • Rollback artifact (the previous release's JetPack image) is staged on the operator workstation.
  • FDR retention policy for this airframe confirmed (default 30 days; environment_strategy.md § Database Management).
  • If switching FC profile (ardupilot_planeinav), FC firmware compatibility confirmed.

2. Pre-deploy checks (operator workstation)

# Verify the artifact bundle integrity.
cosign verify-blob \
  --signature gps-denied-jetpack-<semver>-<sha>.img.sha256.sig \
  --key gps-denied-release-key.pub \
  gps-denied-jetpack-<semver>-<sha>.img.sha256

sha256sum -c gps-denied-jetpack-<semver>-<sha>.img.sha256

# Verify the operator-orchestrator tarball.
cosign verify-blob \
  --signature operator-orchestrator-<semver>-<sha>.tar.gz.sig \
  --key gps-denied-release-key.pub \
  operator-orchestrator-<semver>-<sha>.tar.gz

3. Pre-flight cache build (operator-orchestrator C12)

Performed on the operator workstation, with satellite-provider reachable (locally mirrored or via lab VPN).

docker compose -f operator-orchestrator-compose.yml up -d
# Operator opens http://127.0.0.1:8080

The C12 UI walks the operator through:

  1. Upload / select the target operational sector (GeoJSON polygon).
  2. Set sector classifications (active_conflictstable_rear) — drives freshness threshold (data_model.md § 2.3).
  3. Tile download from satellite-provider (parent suite) — produces tiles rows with source='googlemaps' + filesystem JPEGs.
  4. Descriptor (FAISS) index generation across the loaded tile corpus.
  5. TRT engine compilation on the workstation (Tier-2 emulation if no Jetson is present, or directly on a co-located Jetson dev kit).
  6. Manifest generation: hash over (model bundle + calibration JSON + corpus + sector classifications + descriptor index + engine cache).
  7. Output: a sealed pre-flight bundle on a USB drive or staged for direct ethernet transfer.

4. JetPack image flash

Operator flashes the target JetPack image onto the Jetson:

sudo dd if=gps-denied-jetpack-<semver>-<sha>.img of=/dev/sdX bs=4M status=progress
# OR via NVIDIA SDK Manager for a more guided flow.
sync

The flashed image contains:

  • JetPack 6.2 base
  • The deployment binary preinstalled at /opt/gps-denied/
  • PostgreSQL 16 with alembic schema initialized at the target migration head
  • /etc/gps-denied/runtime.yaml template (the operator fills in airframe-specific values: fc_profile, companion_id)
  • A systemd unit gps-denied.service that auto-starts at boot

The image is identical across UAVs; per-airframe configuration (/etc/gps-denied/runtime.yaml) is filled in after flash.

5. Per-airframe configuration

Operator boots the Jetson in maintenance mode, ssh's in (this is the only time the Jetson has any inbound network surface; closed before takeoff), and:

sudo $EDITOR /etc/gps-denied/runtime.yaml
# Set: fc_profile, companion_id, fdr_retention_days, log_level
sudo gps-denied-cli stage-cache /mnt/usb/gps-denied-cache-<sector-id>.tar.gz
# Stages the operator-prepared cache + calibration + manifest into /var/lib/gps-denied/.
sudo gps-denied-cli verify-readiness
# Runs all gate checks except MAVLink signing handshake (which requires the FC to be powered).

6. UAV integration

  • Wire the Jetson UART/USB to the FC.
  • For ArduPilot Plane: configure FC parameters per the AP-side checklist (EKF3_SRC1_POSXY = 3 or per D-C8-2 = (b) configuration, AHRS_EKF_TYPE = 3).
  • For iNav: configure gps_provider = MSP, gps_ublox_use_galileo = OFF.
  • Power up the FC; verify MAVLink signing handshake completes within 30 s (AC-NEW-1).

7. First-flight commissioning

The first flight on a freshly-deployed airframe is a commissioning flight, not a production flight:

  • Operator stays in line-of-sight.
  • AC-5.2 fallback (FC IMU-only) is the primary safety net during commissioning.
  • Operator manually triggers a MAV_CMD_REQUEST_MESSAGE to confirm GPS_INPUT is being received and the FC's EKF source-set switch responds correctly.
  • If everything looks healthy on the GCS dashboard for 5+ minutes of cruise, the airframe is cleared for production flights.

8. Post-deploy monitoring

Post first commissioning flight:

  • FDR retrieved and visualized on operator workstation (operator-orchestrator C12 dashboard, observability.md § 5.1).
  • AC-NEW-4 statistics for the commissioning flight reviewed; outliers investigated.
  • No FDR segment drops; no ContentHashGateFail events.
  • Mid-flight tile generation working (post-landing upload — handle that separately).
  • If everything green, the deployment is finalised; the previous release's JetPack image can be archived (still kept for rollback).

Post-landing tile upload (per-flight, ADR-004)

Per AC-8.4 + ADR-004, mid-flight tile upload to satellite-provider is post-landing only, and uses the operator-orchestrator's C11 Tile Manager (TileUploader interface; a separate binary, never linked into the airborne image).

# Operator plugs the companion's NVM into the workstation OR ssh's into the powered-off-then-re-booted Jetson.
docker compose run operator-orchestrator \
  python -m operator_orchestrator.tilemanager upload \
  --flight-id <uuid> \
  --satellite-provider $SATELLITE_PROVIDER_URL \
  --signing-pubkey-fingerprint <fingerprint>

Behavior:

  • Reads the local tiles rows where source='onboard_ingest' AND voting_status='pending' AND flight_id=<uuid>.
  • Reads the corresponding JPEG body + sidecar JSON from filesystem.
  • Reads the per-flight onboard tile-signing private key (still on the companion's NVM until FDR rolls over).
  • Submits to satellite-provider's POST /api/satellite/tiles/ingest endpoint (D-PROJ-2 contract).
  • On 2xx success: deletes local row + JPEG + sidecar + emits FDR event tile_uploaded.
  • On 4xx: leaves local data; emits FDR event tile_upload_failed with reason; operator decides next steps (likely a parent-suite issue).
  • On 5xx: retries with exponential backoff; persistent failure → tile_upload_failed + operator review.

When the parent-suite voting layer (D-PROJ-2 design task #2) ships, this flow does NOT change on the onboard side — the parent suite's promotion logic is invisible to onboard-side upload.

Rollback Procedures

Trigger criteria

Severity Trigger Decision-maker
Critical (per-airframe) Commissioning flight fails AC-5.2 fallback (the FC IMU-only fallback also failed; airframe lost) Safety review board (out of scope of this project)
Critical (fleet-wide) Any post-deploy AC-NEW-4 outlier indicates a regression: P(err > 1 km) measured on a real flight > AC threshold by ≥ 2x Suite security + onboard team lead
High (per-airframe) Commissioning flight passes but post-flight FDR analysis shows AC-NEW-4 / AC-NEW-7 regression vs. prior release Onboard team lead
High (per-airframe) Operator unable to complete pre-flight readiness gate (manifest hash gate fails repeatedly) Operator + onboard team lead
Medium (per-airframe) Sustained dead_reckoned periods longer than expected; FDR segment drops occurring Operator + onboard team lead (post-flight investigation; may not warrant immediate rollback)

Rollback steps (per-airframe)

  1. Re-flash the previous release's JetPack image onto the affected Jetson (same procedure as § 4 with the previous artifact).
  2. Re-stage the previous release's pre-flight bundle (the operator workstation retains it in the operator-orchestrator cache for ≥ 30 days).
  3. Re-run the pre-takeoff readiness gate.
  4. Confirm AC-5.2 fallback is still functional (it is FC firmware behavior; rolling back the companion image cannot break it, but verify on the GCS).
  5. Document the rollback in the post-mortem template; include FDR snapshots from the offending flight (if any) plus the rollback artifacts versions.

Database rollback (data_model.md § 4.2 reversibility)

Per data_model.md § 4.2, every Alembic migration MUST implement a working downgrade(). Rolling back the JetPack image to the previous release rolls back the schema to whatever migration head the previous release uses. Concretely:

  • The previous release's JetPack image contains its own Alembic migration tree.
  • On boot, the previous-release runtime asserts alembic current == head_for_that_release. If the database is on a NEWER head (because the airframe ran the new release between deployments), the runtime invokes alembic downgrade <previous-release-head> automatically.
  • If a migration is not reversible (which requires an explicit ADR — data_model.md § 4.2), the rollback must be manually adjudicated by the operator + onboard team lead. This case is rare by policy.

Post-mortem

Required after every rollback (per-airframe or fleet-wide):

  • Timeline: when was the new release flashed; when did the failure surface; when was rollback initiated.
  • Root cause: which AC was missed; which component is implicated; was it a regression introduced by this release or by a hardware/operational variable change.
  • What went wrong in the release process: did Tier-2 CI catch it; if not, why not.
  • Prevention: new test scenario added to NFT suite; new lint check; new rule in _docs/LESSONS.md.
  • Distribution: post-mortem report stored under _docs/06_metrics/incident_<YYYY-MM-DD>_<topic>.md (per autodev failure-handling protocol).

Deployment Checklist

Pre-flash:

  • All Tier-2 AC-bound NFTs green at target version
  • Security scan clean (zero critical / high CVEs; SBOM diff passes ADR-002 enforcement)
  • Both Docker images built and pushed (deployment + research)
  • JetPack image built, signed, checksummed
  • Operator-tooling tarball built, signed
  • Pre-flight bundle prepared by operator (cache + calibration + manifest)
  • Pre-takeoff readiness gate behavior verified on a bench Jetson before flashing onto the production unit
  • Rollback artifact (previous release JetPack image) staged on operator workstation
  • FDR retention policy confirmed for the target airframe

Post-flash:

  • First-flight commissioning flight cleared per § 7
  • FDR retrieved and analyzed; AC-NEW-4 / AC-NEW-7 statistics within expected envelope
  • Post-landing upload procedure tested end-to-end (companion → operator workstation → satellite-provider)
  • Operator runbook updated with airframe-specific notes (e.g., "this airframe has UART2 wired to FC")

Tier-2 enablement

Until the Tier-2 self-hosted Jetson runner is fully provisioned:

  • AC-bound NFTs are gated as manual trigger only on PRs (ci_cd_pipeline.md § Manual-trigger override).
  • The merge gate on dev excludes Tier-2 NFTs; the merge gate on stage and main retains the full gate.
  • The pre-takeoff readiness gate (§ Pre-takeoff readiness gate) is unaffected — it runs on the Jetson at every takeoff regardless of CI gating posture.

When the Tier-2 runner is in steady state, this section is removed and the merge gates harmonize across dev / stage / main.