mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 21:21:13 +00:00
64542d32fc
Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
266 lines
17 KiB
Markdown
266 lines
17 KiB
Markdown
# GPS-Denied Onboard — Deployment Procedures
|
|
|
|
> Date: 2026-05-09 (Plan Phase 2c — initial draft).
|
|
> Inputs: `_docs/02_document/architecture.md` § 3 (Deployment Model) + § 7 (Security); `_docs/02_document/data_model.md` § 4 (Migration Strategy); environment_strategy.md; ADR-002, ADR-004, ADR-005; AC-NEW-1, AC-NEW-3, AC-NEW-4, AC-NEW-5.
|
|
|
|
## Deployment scope and model
|
|
|
|
This project does **not** ship a service; it ships an **embedded edge image** plus an **operator-tooling bundle**. The "deployment" patterns from the standard template (blue-green / rolling / canary) are not applicable. Deployment for this project means:
|
|
|
|
| Artifact | Target | Deployment mechanism |
|
|
|---|---|---|
|
|
| **JetPack image** (`gps-denied-jetpack-<semver>-<sha>.img`) | Production Jetson Orin Nano Super on a UAV | Operator flashes the image onto the Jetson via NVIDIA `sdkmanager` or `Etcher`-style `dd` from the operator workstation |
|
|
| **Operator tooling tarball** | Operator workstation | Operator extracts; `docker compose up -d` brings up `mock-suite-sat-service` (when offline) + `operator-tooling` |
|
|
| **Tier-1 dev compose** | Developer workstation | Developer runs `docker compose up` from repo root |
|
|
|
|
**Zero-downtime is not a goal**: a UAV is not in service while it is being re-flashed. The deployment cadence is per-airframe maintenance, not per-request availability.
|
|
|
|
**Strategy**: the closest analogue to a "rolling deploy" is the operator's fleet-management process (re-flash one UAV at a time across the fleet). The fleet-management process is the operator's concern, not this project's; this document covers the per-airframe procedure.
|
|
|
|
## Pre-deployment artifact assembly (release engineer)
|
|
|
|
Performed once per release on Tier-1 + Tier-2 CI; produces signed artifacts stored in the release bucket.
|
|
|
|
1. Tag a commit on `main`. CI runs the full pipeline (`ci_cd_pipeline.md`).
|
|
2. **Tier-1 produces**:
|
|
- `companion-tier1:deployment-<sha>` and `companion-tier1:research-<sha>` Docker images (pushed to registry).
|
|
- `mock-suite-sat-service:<sha>` Docker image.
|
|
- `operator-tooling:<sha>` Docker image.
|
|
- SBOM artifacts for both binaries (deployment and research).
|
|
- `operator-tooling-<semver>-<sha>.tar.gz` containing the operator-tooling image + mock-sat image + their compose file + verification script + relevant docs.
|
|
3. **Tier-2 produces**:
|
|
- Native deployment-binary build on the self-hosted Jetson runner.
|
|
- SBOM verification: byte-equal (after canonicalization) to Tier-1's deployment-binary SBOM. Mismatch fails the release.
|
|
- **JetPack image build**: a JetPack 6.2 base image with the deployment binary + PostgreSQL 16 + base migrations + `/etc/gps-denied/runtime.yaml` template preinstalled. Output: `gps-denied-jetpack-<semver>-<sha>.img`.
|
|
4. **Signing** (Tier-1):
|
|
- Both Docker image manifests are signed with the project's release key.
|
|
- The JetPack image is signed; checksum is published as a separate signed file (`gps-denied-jetpack-<semver>-<sha>.img.sha256.sig`).
|
|
- The operator-tooling tarball is signed.
|
|
5. **Release bucket**: artifacts uploaded; release notes published; the previous release's artifacts retained for at least 90 days for rollback support.
|
|
|
|
A release fails if any step above fails — including any AC-bound NFT failure on Tier-2 (`ci_cd_pipeline.md` § AC-bound NFTs).
|
|
|
|
## Pre-takeoff readiness gate ("health check" analog)
|
|
|
|
Production has no `/health/live` HTTP endpoint (no listener; NFT-SEC-05). The companion's "health check" is the **pre-takeoff readiness gate**: a sequence of checks that runs at takeoff load and decides whether the companion is ready to emit external position to the FC.
|
|
|
|
| Check | What it validates | Action on failure |
|
|
|---|---|---|
|
|
| Manifest content-hash gate (D-C10-3) | The on-disk manifest matches the operator-staged manifest hash (data_model.md § 2.4) | FDR record `0x000D ContentHashGateFail` + STATUSTEXT critical + companion refuses to publish a `GPS_INPUT` / `MSP2_SENSOR_GPS` source |
|
|
| Camera calibration JSON validation | File present + schema-valid + content-hash matches `manifests.calibration_artifact_hash` | Same |
|
|
| FAISS `.index` mmap + content-hash | mmap succeeds + content-hash matches `manifests.descriptor_index_hash` | Same |
|
|
| TRT engine cache verification | All required engines present per `engine_cache_entries`; each engine's content-hash matches `engine_hash` | Same |
|
|
| `alembic current == head` | DB schema is up-to-date for this binary | Same |
|
|
| MAVLink-2.0 signing handshake (AP profile) | Signed handshake with the FC succeeds within AC-NEW-1 30 s budget (D-C8-9 = (d)) | FDR record `MavlinkSigningKeyRotated` with reason "handshake_failed" + STATUSTEXT critical + companion refuses to emit |
|
|
| Per-flight key generation | Both per-flight ephemeral keys (MAVLink signing + onboard tile signing) generated and persisted under `/var/lib/gps-denied/per-flight/` | Same |
|
|
| Initial frame → emit pipeline test | First nav-camera frame reaches C8 outbound encoder; `EmittedExternalPosition` produced | Same |
|
|
| Network egress is denied | Verify no outbound network egress is possible (DNS blackhole effective, iptables OUTPUT REJECT loaded) — defense-in-depth on architecture.md § 7 + NFT-SEC-05 | FDR critical + STATUSTEXT + refuse to emit |
|
|
|
|
The gate completes within the AC-NEW-1 30 s p95 budget; failure produces a clear FDR + STATUSTEXT trail and the companion's `GPS_INPUT` / `MSP2_SENSOR_GPS` channel stays silent — the FC operates as if no companion-GPS source is available, which is the correct safe-default.
|
|
|
|
## Production deployment procedure (per-airframe)
|
|
|
|
This is the per-airframe deployment procedure performed by the operator, NOT by CI.
|
|
|
|
### 1. Pre-deploy approval
|
|
|
|
Required before any production-bound flight:
|
|
|
|
- [ ] Release notes for the target version reviewed; AC-NEW-4 / AC-NEW-7 statistical summaries reviewed.
|
|
- [ ] All Tier-2 AC-bound NFTs green at the target version (`ci_cd_pipeline.md` § AC-bound NFTs).
|
|
- [ ] Security audit of the target version completed (Tier-1 SBOM clean of unpatched CVEs; D-CROSS-CVE-1).
|
|
- [ ] D-PROJ-1 calibration step performed on the target Jetson + UAV pairing (hybrid factory + checkerboard-refined; ~1 day per deployed unit).
|
|
- [ ] Rollback artifact (the previous release's JetPack image) is staged on the operator workstation.
|
|
- [ ] FDR retention policy for this airframe confirmed (default 30 days; environment_strategy.md § Database Management).
|
|
- [ ] If switching FC profile (`ardupilot_plane` ↔ `inav`), FC firmware compatibility confirmed.
|
|
|
|
### 2. Pre-deploy checks (operator workstation)
|
|
|
|
```sh
|
|
# Verify the artifact bundle integrity.
|
|
cosign verify-blob \
|
|
--signature gps-denied-jetpack-<semver>-<sha>.img.sha256.sig \
|
|
--key gps-denied-release-key.pub \
|
|
gps-denied-jetpack-<semver>-<sha>.img.sha256
|
|
|
|
sha256sum -c gps-denied-jetpack-<semver>-<sha>.img.sha256
|
|
|
|
# Verify the operator-tooling tarball.
|
|
cosign verify-blob \
|
|
--signature operator-tooling-<semver>-<sha>.tar.gz.sig \
|
|
--key gps-denied-release-key.pub \
|
|
operator-tooling-<semver>-<sha>.tar.gz
|
|
```
|
|
|
|
### 3. Pre-flight cache build (operator-tooling C12)
|
|
|
|
Performed on the operator workstation, with `satellite-provider` reachable (locally mirrored or via lab VPN).
|
|
|
|
```sh
|
|
docker compose -f operator-tooling-compose.yml up -d
|
|
# Operator opens http://127.0.0.1:8080
|
|
```
|
|
|
|
The C12 UI walks the operator through:
|
|
|
|
1. Upload / select the target operational sector (GeoJSON polygon).
|
|
2. Set sector classifications (`active_conflict` ↔ `stable_rear`) — drives freshness threshold (data_model.md § 2.3).
|
|
3. Tile download from `satellite-provider` (parent suite) — produces `tiles` rows with `source='googlemaps'` + filesystem JPEGs.
|
|
4. Descriptor (FAISS) index generation across the loaded tile corpus.
|
|
5. TRT engine compilation on the workstation (Tier-2 emulation if no Jetson is present, or directly on a co-located Jetson dev kit).
|
|
6. Manifest generation: hash over (model bundle + calibration JSON + corpus + sector classifications + descriptor index + engine cache).
|
|
7. Output: a sealed pre-flight bundle on a USB drive or staged for direct ethernet transfer.
|
|
|
|
### 4. JetPack image flash
|
|
|
|
Operator flashes the target JetPack image onto the Jetson:
|
|
|
|
```sh
|
|
sudo dd if=gps-denied-jetpack-<semver>-<sha>.img of=/dev/sdX bs=4M status=progress
|
|
# OR via NVIDIA SDK Manager for a more guided flow.
|
|
sync
|
|
```
|
|
|
|
The flashed image contains:
|
|
|
|
- JetPack 6.2 base
|
|
- The deployment binary preinstalled at `/opt/gps-denied/`
|
|
- PostgreSQL 16 with `alembic` schema initialized at the target migration head
|
|
- `/etc/gps-denied/runtime.yaml` template (the operator fills in airframe-specific values: `fc_profile`, `companion_id`)
|
|
- A systemd unit `gps-denied.service` that auto-starts at boot
|
|
|
|
The image is **identical across UAVs**; per-airframe configuration (`/etc/gps-denied/runtime.yaml`) is filled in after flash.
|
|
|
|
### 5. Per-airframe configuration
|
|
|
|
Operator boots the Jetson in maintenance mode, ssh's in (this is the only time the Jetson has any inbound network surface; closed before takeoff), and:
|
|
|
|
```sh
|
|
sudo $EDITOR /etc/gps-denied/runtime.yaml
|
|
# Set: fc_profile, companion_id, fdr_retention_days, log_level
|
|
sudo gps-denied-cli stage-cache /mnt/usb/gps-denied-cache-<sector-id>.tar.gz
|
|
# Stages the operator-prepared cache + calibration + manifest into /var/lib/gps-denied/.
|
|
sudo gps-denied-cli verify-readiness
|
|
# Runs all gate checks except MAVLink signing handshake (which requires the FC to be powered).
|
|
```
|
|
|
|
### 6. UAV integration
|
|
|
|
- Wire the Jetson UART/USB to the FC.
|
|
- For ArduPilot Plane: configure FC parameters per the AP-side checklist (`EKF3_SRC1_POSXY = 3` or per D-C8-2 = (b) configuration, AHRS_EKF_TYPE = 3).
|
|
- For iNav: configure `gps_provider = MSP`, `gps_ublox_use_galileo = OFF`.
|
|
- Power up the FC; verify MAVLink signing handshake completes within 30 s (AC-NEW-1).
|
|
|
|
### 7. First-flight commissioning
|
|
|
|
The first flight on a freshly-deployed airframe is a **commissioning flight**, not a production flight:
|
|
|
|
- Operator stays in line-of-sight.
|
|
- AC-5.2 fallback (FC IMU-only) is the primary safety net during commissioning.
|
|
- Operator manually triggers a `MAV_CMD_REQUEST_MESSAGE` to confirm `GPS_INPUT` is being received and the FC's EKF source-set switch responds correctly.
|
|
- If everything looks healthy on the GCS dashboard for 5+ minutes of cruise, the airframe is cleared for production flights.
|
|
|
|
### 8. Post-deploy monitoring
|
|
|
|
Post first commissioning flight:
|
|
|
|
- [ ] FDR retrieved and visualized on operator workstation (operator-tooling C12 dashboard, observability.md § 5.1).
|
|
- [ ] AC-NEW-4 statistics for the commissioning flight reviewed; outliers investigated.
|
|
- [ ] No FDR segment drops; no `ContentHashGateFail` events.
|
|
- [ ] Mid-flight tile generation working (post-landing upload — handle that separately).
|
|
- [ ] If everything green, the deployment is finalised; the previous release's JetPack image can be archived (still kept for rollback).
|
|
|
|
## Post-landing tile upload (per-flight, ADR-004)
|
|
|
|
Per AC-8.4 + ADR-004, mid-flight tile upload to `satellite-provider` is **post-landing only**, and uses the operator-tooling's C11 Tile Manager (`TileUploader` interface; a separate binary, never linked into the airborne image).
|
|
|
|
```sh
|
|
# Operator plugs the companion's NVM into the workstation OR ssh's into the powered-off-then-re-booted Jetson.
|
|
docker compose run operator-tooling \
|
|
python -m operator_tooling.tilemanager upload \
|
|
--flight-id <uuid> \
|
|
--satellite-provider $SATELLITE_PROVIDER_URL \
|
|
--signing-pubkey-fingerprint <fingerprint>
|
|
```
|
|
|
|
Behavior:
|
|
|
|
- Reads the local `tiles` rows where `source='onboard_ingest' AND voting_status='pending' AND flight_id=<uuid>`.
|
|
- Reads the corresponding JPEG body + sidecar JSON from filesystem.
|
|
- Reads the per-flight onboard tile-signing private key (still on the companion's NVM until FDR rolls over).
|
|
- Submits to `satellite-provider`'s `POST /api/satellite/tiles/ingest` endpoint (D-PROJ-2 contract).
|
|
- On 2xx success: deletes local row + JPEG + sidecar + emits FDR event `tile_uploaded`.
|
|
- On 4xx: leaves local data; emits FDR event `tile_upload_failed` with reason; operator decides next steps (likely a parent-suite issue).
|
|
- On 5xx: retries with exponential backoff; persistent failure → `tile_upload_failed` + operator review.
|
|
|
|
When the parent-suite voting layer (D-PROJ-2 design task #2) ships, this flow does NOT change on the onboard side — the parent suite's promotion logic is invisible to onboard-side upload.
|
|
|
|
## Rollback Procedures
|
|
|
|
### Trigger criteria
|
|
|
|
| Severity | Trigger | Decision-maker |
|
|
|---|---|---|
|
|
| Critical (per-airframe) | Commissioning flight fails AC-5.2 fallback (the FC IMU-only fallback also failed; airframe lost) | Safety review board (out of scope of this project) |
|
|
| Critical (fleet-wide) | Any post-deploy AC-NEW-4 outlier indicates a regression: P(err > 1 km) measured on a real flight > AC threshold by ≥ 2x | Suite security + onboard team lead |
|
|
| High (per-airframe) | Commissioning flight passes but post-flight FDR analysis shows AC-NEW-4 / AC-NEW-7 regression vs. prior release | Onboard team lead |
|
|
| High (per-airframe) | Operator unable to complete pre-flight readiness gate (manifest hash gate fails repeatedly) | Operator + onboard team lead |
|
|
| Medium (per-airframe) | Sustained `dead_reckoned` periods longer than expected; FDR segment drops occurring | Operator + onboard team lead (post-flight investigation; may not warrant immediate rollback) |
|
|
|
|
### Rollback steps (per-airframe)
|
|
|
|
1. **Re-flash** the previous release's JetPack image onto the affected Jetson (same procedure as § 4 with the previous artifact).
|
|
2. **Re-stage** the previous release's pre-flight bundle (the operator workstation retains it in the operator-tooling cache for ≥ 30 days).
|
|
3. **Re-run** the pre-takeoff readiness gate.
|
|
4. **Confirm** AC-5.2 fallback is still functional (it is FC firmware behavior; rolling back the companion image cannot break it, but verify on the GCS).
|
|
5. **Document** the rollback in the post-mortem template; include FDR snapshots from the offending flight (if any) plus the rollback artifacts versions.
|
|
|
|
### Database rollback (data_model.md § 4.2 reversibility)
|
|
|
|
Per data_model.md § 4.2, every Alembic migration MUST implement a working `downgrade()`. Rolling back the JetPack image to the previous release rolls back the schema to whatever migration head the previous release uses. Concretely:
|
|
|
|
- The previous release's JetPack image contains its own Alembic migration tree.
|
|
- On boot, the previous-release runtime asserts `alembic current == head_for_that_release`. If the database is on a NEWER head (because the airframe ran the new release between deployments), the runtime invokes `alembic downgrade <previous-release-head>` automatically.
|
|
- If a migration is **not reversible** (which requires an explicit ADR — data_model.md § 4.2), the rollback must be manually adjudicated by the operator + onboard team lead. This case is rare by policy.
|
|
|
|
### Post-mortem
|
|
|
|
Required after every rollback (per-airframe or fleet-wide):
|
|
|
|
- Timeline: when was the new release flashed; when did the failure surface; when was rollback initiated.
|
|
- Root cause: which AC was missed; which component is implicated; was it a regression introduced by this release or by a hardware/operational variable change.
|
|
- What went wrong in the release process: did Tier-2 CI catch it; if not, why not.
|
|
- Prevention: new test scenario added to NFT suite; new lint check; new rule in `_docs/LESSONS.md`.
|
|
- Distribution: post-mortem report stored under `_docs/06_metrics/incident_<YYYY-MM-DD>_<topic>.md` (per autodev failure-handling protocol).
|
|
|
|
## Deployment Checklist
|
|
|
|
Pre-flash:
|
|
|
|
- [ ] All Tier-2 AC-bound NFTs green at target version
|
|
- [ ] Security scan clean (zero critical / high CVEs; SBOM diff passes ADR-002 enforcement)
|
|
- [ ] Both Docker images built and pushed (deployment + research)
|
|
- [ ] JetPack image built, signed, checksummed
|
|
- [ ] Operator-tooling tarball built, signed
|
|
- [ ] Pre-flight bundle prepared by operator (cache + calibration + manifest)
|
|
- [ ] Pre-takeoff readiness gate behavior verified on a bench Jetson before flashing onto the production unit
|
|
- [ ] Rollback artifact (previous release JetPack image) staged on operator workstation
|
|
- [ ] FDR retention policy confirmed for the target airframe
|
|
|
|
Post-flash:
|
|
|
|
- [ ] First-flight commissioning flight cleared per § 7
|
|
- [ ] FDR retrieved and analyzed; AC-NEW-4 / AC-NEW-7 statistics within expected envelope
|
|
- [ ] Post-landing upload procedure tested end-to-end (companion → operator workstation → `satellite-provider`)
|
|
- [ ] Operator runbook updated with airframe-specific notes (e.g., "this airframe has UART2 wired to FC")
|
|
|
|
## Tier-2 enablement
|
|
|
|
Until the Tier-2 self-hosted Jetson runner is fully provisioned:
|
|
|
|
- AC-bound NFTs are gated as **manual trigger only** on PRs (`ci_cd_pipeline.md` § Manual-trigger override).
|
|
- The merge gate on `dev` excludes Tier-2 NFTs; the merge gate on `stage` and `main` retains the full gate.
|
|
- The pre-takeoff readiness gate (§ Pre-takeoff readiness gate) is unaffected — it runs on the Jetson at every takeoff regardless of CI gating posture.
|
|
|
|
When the Tier-2 runner is in steady state, this section is removed and the merge gates harmonize across `dev` / `stage` / `main`.
|