Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
17 KiB
GPS-Denied Onboard — Deployment Procedures
Date: 2026-05-09 (Plan Phase 2c — initial draft). Inputs:
_docs/02_document/architecture.md§ 3 (Deployment Model) + § 7 (Security);_docs/02_document/data_model.md§ 4 (Migration Strategy); environment_strategy.md; ADR-002, ADR-004, ADR-005; AC-NEW-1, AC-NEW-3, AC-NEW-4, AC-NEW-5.
Deployment scope and model
This project does not ship a service; it ships an embedded edge image plus an operator-tooling bundle. The "deployment" patterns from the standard template (blue-green / rolling / canary) are not applicable. Deployment for this project means:
| Artifact | Target | Deployment mechanism |
|---|---|---|
JetPack image (gps-denied-jetpack-<semver>-<sha>.img) |
Production Jetson Orin Nano Super on a UAV | Operator flashes the image onto the Jetson via NVIDIA sdkmanager or Etcher-style dd from the operator workstation |
| Operator tooling tarball | Operator workstation | Operator extracts; docker compose up -d brings up mock-suite-sat-service (when offline) + operator-tooling |
| Tier-1 dev compose | Developer workstation | Developer runs docker compose up from repo root |
Zero-downtime is not a goal: a UAV is not in service while it is being re-flashed. The deployment cadence is per-airframe maintenance, not per-request availability.
Strategy: the closest analogue to a "rolling deploy" is the operator's fleet-management process (re-flash one UAV at a time across the fleet). The fleet-management process is the operator's concern, not this project's; this document covers the per-airframe procedure.
Pre-deployment artifact assembly (release engineer)
Performed once per release on Tier-1 + Tier-2 CI; produces signed artifacts stored in the release bucket.
- Tag a commit on
main. CI runs the full pipeline (ci_cd_pipeline.md). - Tier-1 produces:
companion-tier1:deployment-<sha>andcompanion-tier1:research-<sha>Docker images (pushed to registry).mock-suite-sat-service:<sha>Docker image.operator-tooling:<sha>Docker image.- SBOM artifacts for both binaries (deployment and research).
operator-tooling-<semver>-<sha>.tar.gzcontaining the operator-tooling image + mock-sat image + their compose file + verification script + relevant docs.
- Tier-2 produces:
- Native deployment-binary build on the self-hosted Jetson runner.
- SBOM verification: byte-equal (after canonicalization) to Tier-1's deployment-binary SBOM. Mismatch fails the release.
- JetPack image build: a JetPack 6.2 base image with the deployment binary + PostgreSQL 16 + base migrations +
/etc/gps-denied/runtime.yamltemplate preinstalled. Output:gps-denied-jetpack-<semver>-<sha>.img.
- Signing (Tier-1):
- Both Docker image manifests are signed with the project's release key.
- The JetPack image is signed; checksum is published as a separate signed file (
gps-denied-jetpack-<semver>-<sha>.img.sha256.sig). - The operator-tooling tarball is signed.
- Release bucket: artifacts uploaded; release notes published; the previous release's artifacts retained for at least 90 days for rollback support.
A release fails if any step above fails — including any AC-bound NFT failure on Tier-2 (ci_cd_pipeline.md § AC-bound NFTs).
Pre-takeoff readiness gate ("health check" analog)
Production has no /health/live HTTP endpoint (no listener; NFT-SEC-05). The companion's "health check" is the pre-takeoff readiness gate: a sequence of checks that runs at takeoff load and decides whether the companion is ready to emit external position to the FC.
| Check | What it validates | Action on failure |
|---|---|---|
| Manifest content-hash gate (D-C10-3) | The on-disk manifest matches the operator-staged manifest hash (data_model.md § 2.4) | FDR record 0x000D ContentHashGateFail + STATUSTEXT critical + companion refuses to publish a GPS_INPUT / MSP2_SENSOR_GPS source |
| Camera calibration JSON validation | File present + schema-valid + content-hash matches manifests.calibration_artifact_hash |
Same |
FAISS .index mmap + content-hash |
mmap succeeds + content-hash matches manifests.descriptor_index_hash |
Same |
| TRT engine cache verification | All required engines present per engine_cache_entries; each engine's content-hash matches engine_hash |
Same |
alembic current == head |
DB schema is up-to-date for this binary | Same |
| MAVLink-2.0 signing handshake (AP profile) | Signed handshake with the FC succeeds within AC-NEW-1 30 s budget (D-C8-9 = (d)) | FDR record MavlinkSigningKeyRotated with reason "handshake_failed" + STATUSTEXT critical + companion refuses to emit |
| Per-flight key generation | Both per-flight ephemeral keys (MAVLink signing + onboard tile signing) generated and persisted under /var/lib/gps-denied/per-flight/ |
Same |
| Initial frame → emit pipeline test | First nav-camera frame reaches C8 outbound encoder; EmittedExternalPosition produced |
Same |
| Network egress is denied | Verify no outbound network egress is possible (DNS blackhole effective, iptables OUTPUT REJECT loaded) — defense-in-depth on architecture.md § 7 + NFT-SEC-05 | FDR critical + STATUSTEXT + refuse to emit |
The gate completes within the AC-NEW-1 30 s p95 budget; failure produces a clear FDR + STATUSTEXT trail and the companion's GPS_INPUT / MSP2_SENSOR_GPS channel stays silent — the FC operates as if no companion-GPS source is available, which is the correct safe-default.
Production deployment procedure (per-airframe)
This is the per-airframe deployment procedure performed by the operator, NOT by CI.
1. Pre-deploy approval
Required before any production-bound flight:
- Release notes for the target version reviewed; AC-NEW-4 / AC-NEW-7 statistical summaries reviewed.
- All Tier-2 AC-bound NFTs green at the target version (
ci_cd_pipeline.md§ AC-bound NFTs). - Security audit of the target version completed (Tier-1 SBOM clean of unpatched CVEs; D-CROSS-CVE-1).
- D-PROJ-1 calibration step performed on the target Jetson + UAV pairing (hybrid factory + checkerboard-refined; ~1 day per deployed unit).
- Rollback artifact (the previous release's JetPack image) is staged on the operator workstation.
- FDR retention policy for this airframe confirmed (default 30 days; environment_strategy.md § Database Management).
- If switching FC profile (
ardupilot_plane↔inav), FC firmware compatibility confirmed.
2. Pre-deploy checks (operator workstation)
# Verify the artifact bundle integrity.
cosign verify-blob \
--signature gps-denied-jetpack-<semver>-<sha>.img.sha256.sig \
--key gps-denied-release-key.pub \
gps-denied-jetpack-<semver>-<sha>.img.sha256
sha256sum -c gps-denied-jetpack-<semver>-<sha>.img.sha256
# Verify the operator-tooling tarball.
cosign verify-blob \
--signature operator-tooling-<semver>-<sha>.tar.gz.sig \
--key gps-denied-release-key.pub \
operator-tooling-<semver>-<sha>.tar.gz
3. Pre-flight cache build (operator-tooling C12)
Performed on the operator workstation, with satellite-provider reachable (locally mirrored or via lab VPN).
docker compose -f operator-tooling-compose.yml up -d
# Operator opens http://127.0.0.1:8080
The C12 UI walks the operator through:
- Upload / select the target operational sector (GeoJSON polygon).
- Set sector classifications (
active_conflict↔stable_rear) — drives freshness threshold (data_model.md § 2.3). - Tile download from
satellite-provider(parent suite) — producestilesrows withsource='googlemaps'+ filesystem JPEGs. - Descriptor (FAISS) index generation across the loaded tile corpus.
- TRT engine compilation on the workstation (Tier-2 emulation if no Jetson is present, or directly on a co-located Jetson dev kit).
- Manifest generation: hash over (model bundle + calibration JSON + corpus + sector classifications + descriptor index + engine cache).
- Output: a sealed pre-flight bundle on a USB drive or staged for direct ethernet transfer.
4. JetPack image flash
Operator flashes the target JetPack image onto the Jetson:
sudo dd if=gps-denied-jetpack-<semver>-<sha>.img of=/dev/sdX bs=4M status=progress
# OR via NVIDIA SDK Manager for a more guided flow.
sync
The flashed image contains:
- JetPack 6.2 base
- The deployment binary preinstalled at
/opt/gps-denied/ - PostgreSQL 16 with
alembicschema initialized at the target migration head /etc/gps-denied/runtime.yamltemplate (the operator fills in airframe-specific values:fc_profile,companion_id)- A systemd unit
gps-denied.servicethat auto-starts at boot
The image is identical across UAVs; per-airframe configuration (/etc/gps-denied/runtime.yaml) is filled in after flash.
5. Per-airframe configuration
Operator boots the Jetson in maintenance mode, ssh's in (this is the only time the Jetson has any inbound network surface; closed before takeoff), and:
sudo $EDITOR /etc/gps-denied/runtime.yaml
# Set: fc_profile, companion_id, fdr_retention_days, log_level
sudo gps-denied-cli stage-cache /mnt/usb/gps-denied-cache-<sector-id>.tar.gz
# Stages the operator-prepared cache + calibration + manifest into /var/lib/gps-denied/.
sudo gps-denied-cli verify-readiness
# Runs all gate checks except MAVLink signing handshake (which requires the FC to be powered).
6. UAV integration
- Wire the Jetson UART/USB to the FC.
- For ArduPilot Plane: configure FC parameters per the AP-side checklist (
EKF3_SRC1_POSXY = 3or per D-C8-2 = (b) configuration, AHRS_EKF_TYPE = 3). - For iNav: configure
gps_provider = MSP,gps_ublox_use_galileo = OFF. - Power up the FC; verify MAVLink signing handshake completes within 30 s (AC-NEW-1).
7. First-flight commissioning
The first flight on a freshly-deployed airframe is a commissioning flight, not a production flight:
- Operator stays in line-of-sight.
- AC-5.2 fallback (FC IMU-only) is the primary safety net during commissioning.
- Operator manually triggers a
MAV_CMD_REQUEST_MESSAGEto confirmGPS_INPUTis being received and the FC's EKF source-set switch responds correctly. - If everything looks healthy on the GCS dashboard for 5+ minutes of cruise, the airframe is cleared for production flights.
8. Post-deploy monitoring
Post first commissioning flight:
- FDR retrieved and visualized on operator workstation (operator-tooling C12 dashboard, observability.md § 5.1).
- AC-NEW-4 statistics for the commissioning flight reviewed; outliers investigated.
- No FDR segment drops; no
ContentHashGateFailevents. - Mid-flight tile generation working (post-landing upload — handle that separately).
- If everything green, the deployment is finalised; the previous release's JetPack image can be archived (still kept for rollback).
Post-landing tile upload (per-flight, ADR-004)
Per AC-8.4 + ADR-004, mid-flight tile upload to satellite-provider is post-landing only, and uses the operator-tooling's C11 Tile Manager (TileUploader interface; a separate binary, never linked into the airborne image).
# Operator plugs the companion's NVM into the workstation OR ssh's into the powered-off-then-re-booted Jetson.
docker compose run operator-tooling \
python -m operator_tooling.tilemanager upload \
--flight-id <uuid> \
--satellite-provider $SATELLITE_PROVIDER_URL \
--signing-pubkey-fingerprint <fingerprint>
Behavior:
- Reads the local
tilesrows wheresource='onboard_ingest' AND voting_status='pending' AND flight_id=<uuid>. - Reads the corresponding JPEG body + sidecar JSON from filesystem.
- Reads the per-flight onboard tile-signing private key (still on the companion's NVM until FDR rolls over).
- Submits to
satellite-provider'sPOST /api/satellite/tiles/ingestendpoint (D-PROJ-2 contract). - On 2xx success: deletes local row + JPEG + sidecar + emits FDR event
tile_uploaded. - On 4xx: leaves local data; emits FDR event
tile_upload_failedwith reason; operator decides next steps (likely a parent-suite issue). - On 5xx: retries with exponential backoff; persistent failure →
tile_upload_failed+ operator review.
When the parent-suite voting layer (D-PROJ-2 design task #2) ships, this flow does NOT change on the onboard side — the parent suite's promotion logic is invisible to onboard-side upload.
Rollback Procedures
Trigger criteria
| Severity | Trigger | Decision-maker |
|---|---|---|
| Critical (per-airframe) | Commissioning flight fails AC-5.2 fallback (the FC IMU-only fallback also failed; airframe lost) | Safety review board (out of scope of this project) |
| Critical (fleet-wide) | Any post-deploy AC-NEW-4 outlier indicates a regression: P(err > 1 km) measured on a real flight > AC threshold by ≥ 2x | Suite security + onboard team lead |
| High (per-airframe) | Commissioning flight passes but post-flight FDR analysis shows AC-NEW-4 / AC-NEW-7 regression vs. prior release | Onboard team lead |
| High (per-airframe) | Operator unable to complete pre-flight readiness gate (manifest hash gate fails repeatedly) | Operator + onboard team lead |
| Medium (per-airframe) | Sustained dead_reckoned periods longer than expected; FDR segment drops occurring |
Operator + onboard team lead (post-flight investigation; may not warrant immediate rollback) |
Rollback steps (per-airframe)
- Re-flash the previous release's JetPack image onto the affected Jetson (same procedure as § 4 with the previous artifact).
- Re-stage the previous release's pre-flight bundle (the operator workstation retains it in the operator-tooling cache for ≥ 30 days).
- Re-run the pre-takeoff readiness gate.
- Confirm AC-5.2 fallback is still functional (it is FC firmware behavior; rolling back the companion image cannot break it, but verify on the GCS).
- Document the rollback in the post-mortem template; include FDR snapshots from the offending flight (if any) plus the rollback artifacts versions.
Database rollback (data_model.md § 4.2 reversibility)
Per data_model.md § 4.2, every Alembic migration MUST implement a working downgrade(). Rolling back the JetPack image to the previous release rolls back the schema to whatever migration head the previous release uses. Concretely:
- The previous release's JetPack image contains its own Alembic migration tree.
- On boot, the previous-release runtime asserts
alembic current == head_for_that_release. If the database is on a NEWER head (because the airframe ran the new release between deployments), the runtime invokesalembic downgrade <previous-release-head>automatically. - If a migration is not reversible (which requires an explicit ADR — data_model.md § 4.2), the rollback must be manually adjudicated by the operator + onboard team lead. This case is rare by policy.
Post-mortem
Required after every rollback (per-airframe or fleet-wide):
- Timeline: when was the new release flashed; when did the failure surface; when was rollback initiated.
- Root cause: which AC was missed; which component is implicated; was it a regression introduced by this release or by a hardware/operational variable change.
- What went wrong in the release process: did Tier-2 CI catch it; if not, why not.
- Prevention: new test scenario added to NFT suite; new lint check; new rule in
_docs/LESSONS.md. - Distribution: post-mortem report stored under
_docs/06_metrics/incident_<YYYY-MM-DD>_<topic>.md(per autodev failure-handling protocol).
Deployment Checklist
Pre-flash:
- All Tier-2 AC-bound NFTs green at target version
- Security scan clean (zero critical / high CVEs; SBOM diff passes ADR-002 enforcement)
- Both Docker images built and pushed (deployment + research)
- JetPack image built, signed, checksummed
- Operator-tooling tarball built, signed
- Pre-flight bundle prepared by operator (cache + calibration + manifest)
- Pre-takeoff readiness gate behavior verified on a bench Jetson before flashing onto the production unit
- Rollback artifact (previous release JetPack image) staged on operator workstation
- FDR retention policy confirmed for the target airframe
Post-flash:
- First-flight commissioning flight cleared per § 7
- FDR retrieved and analyzed; AC-NEW-4 / AC-NEW-7 statistics within expected envelope
- Post-landing upload procedure tested end-to-end (companion → operator workstation →
satellite-provider) - Operator runbook updated with airframe-specific notes (e.g., "this airframe has UART2 wired to FC")
Tier-2 enablement
Until the Tier-2 self-hosted Jetson runner is fully provisioned:
- AC-bound NFTs are gated as manual trigger only on PRs (
ci_cd_pipeline.md§ Manual-trigger override). - The merge gate on
devexcludes Tier-2 NFTs; the merge gate onstageandmainretains the full gate. - The pre-takeoff readiness gate (§ Pre-takeoff readiness gate) is unaffected — it runs on the Jetson at every takeoff regardless of CI gating posture.
When the Tier-2 runner is in steady state, this section is removed and the merge gates harmonize across dev / stage / main.