mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 13:51:13 +00:00
5fe67023b2
Implements two new C12 services and rebalances the C11/C12 boundary in one atomic commit: * AZ-329 PostLandingUploadOrchestrator — gates C11 upload on the `flight_footer` FDR record's `clean_shutdown` field; 4 refusal modes; new FdrFooterReader Protocol + LocalFdrFooterReader. * AZ-330 OperatorReLocService — AC-3.4 visual-loss re-localization hint; reuses shared LatLonAlt; OperatorCommandTransport Protocol cut (E-C8 owns the future pymavlink concrete); new FDR record kind `c12.reloc.requested`; log redaction (lat/lon 5 decimals, reason 200 chars). * AZ-523 C11 internal flight-state gate removed (SRP refactor): `confirm_flight_state` / `FlightStateSignal` use / `FlightStateNotOnGroundError` deleted from C11; TileUploader contract bumped to v2.0.0 (frozen) with migration note; AZ-317 superseded. * AZ-524 Package rename `c12_operator_tooling` → `c12_operator_orchestrator` across source, tests, pyproject, CMake, Dockerfile, compose, CI, runtime-root services class (`OperatorOrchestratorServices`) + factory function (`build_operator_orchestrator`), logger namespaces, config slug, docs, and the E-C12 epic title. Tests: 1543 passed, 80 skipped (all environment gates). Targeted AC suite (AZ-329 + AZ-330 + FdrFooterReader): 37 passed. Cold-start NFR-perf still ≤ 500 ms p99. Tracker: AZ-317 → Done (superseded); AZ-319 v2.0.0 contract bump comment; AZ-329/AZ-330 → In Testing; AZ-253 epic renamed; AZ-523 + AZ-524 created and closed as audit-trail tickets. See `_docs/03_implementation/batch_44_cycle1_report.md`. Co-authored-by: Cursor <cursoragent@cursor.com>
266 lines
17 KiB
Markdown
266 lines
17 KiB
Markdown
# GPS-Denied Onboard — Deployment Procedures
|
|
|
|
> Date: 2026-05-09 (Plan Phase 2c — initial draft).
|
|
> Inputs: `_docs/02_document/architecture.md` § 3 (Deployment Model) + § 7 (Security); `_docs/02_document/data_model.md` § 4 (Migration Strategy); environment_strategy.md; ADR-002, ADR-004, ADR-005; AC-NEW-1, AC-NEW-3, AC-NEW-4, AC-NEW-5.
|
|
|
|
## Deployment scope and model
|
|
|
|
This project does **not** ship a service; it ships an **embedded edge image** plus an **operator-orchestrator bundle**. The "deployment" patterns from the standard template (blue-green / rolling / canary) are not applicable. Deployment for this project means:
|
|
|
|
| Artifact | Target | Deployment mechanism |
|
|
|---|---|---|
|
|
| **JetPack image** (`gps-denied-jetpack-<semver>-<sha>.img`) | Production Jetson Orin Nano Super on a UAV | Operator flashes the image onto the Jetson via NVIDIA `sdkmanager` or `Etcher`-style `dd` from the operator workstation |
|
|
| **Operator tooling tarball** | Operator workstation | Operator extracts; `docker compose up -d` brings up `mock-suite-sat-service` (when offline) + `operator-orchestrator` |
|
|
| **Tier-1 dev compose** | Developer workstation | Developer runs `docker compose up` from repo root |
|
|
|
|
**Zero-downtime is not a goal**: a UAV is not in service while it is being re-flashed. The deployment cadence is per-airframe maintenance, not per-request availability.
|
|
|
|
**Strategy**: the closest analogue to a "rolling deploy" is the operator's fleet-management process (re-flash one UAV at a time across the fleet). The fleet-management process is the operator's concern, not this project's; this document covers the per-airframe procedure.
|
|
|
|
## Pre-deployment artifact assembly (release engineer)
|
|
|
|
Performed once per release on Tier-1 + Tier-2 CI; produces signed artifacts stored in the release bucket.
|
|
|
|
1. Tag a commit on `main`. CI runs the full pipeline (`ci_cd_pipeline.md`).
|
|
2. **Tier-1 produces**:
|
|
- `companion-tier1:deployment-<sha>` and `companion-tier1:research-<sha>` Docker images (pushed to registry).
|
|
- `mock-suite-sat-service:<sha>` Docker image.
|
|
- `operator-orchestrator:<sha>` Docker image.
|
|
- SBOM artifacts for both binaries (deployment and research).
|
|
- `operator-orchestrator-<semver>-<sha>.tar.gz` containing the operator-orchestrator image + mock-sat image + their compose file + verification script + relevant docs.
|
|
3. **Tier-2 produces**:
|
|
- Native deployment-binary build on the self-hosted Jetson runner.
|
|
- SBOM verification: byte-equal (after canonicalization) to Tier-1's deployment-binary SBOM. Mismatch fails the release.
|
|
- **JetPack image build**: a JetPack 6.2 base image with the deployment binary + PostgreSQL 16 + base migrations + `/etc/gps-denied/runtime.yaml` template preinstalled. Output: `gps-denied-jetpack-<semver>-<sha>.img`.
|
|
4. **Signing** (Tier-1):
|
|
- Both Docker image manifests are signed with the project's release key.
|
|
- The JetPack image is signed; checksum is published as a separate signed file (`gps-denied-jetpack-<semver>-<sha>.img.sha256.sig`).
|
|
- The operator-orchestrator tarball is signed.
|
|
5. **Release bucket**: artifacts uploaded; release notes published; the previous release's artifacts retained for at least 90 days for rollback support.
|
|
|
|
A release fails if any step above fails — including any AC-bound NFT failure on Tier-2 (`ci_cd_pipeline.md` § AC-bound NFTs).
|
|
|
|
## Pre-takeoff readiness gate ("health check" analog)
|
|
|
|
Production has no `/health/live` HTTP endpoint (no listener; NFT-SEC-05). The companion's "health check" is the **pre-takeoff readiness gate**: a sequence of checks that runs at takeoff load and decides whether the companion is ready to emit external position to the FC.
|
|
|
|
| Check | What it validates | Action on failure |
|
|
|---|---|---|
|
|
| Manifest content-hash gate (D-C10-3) | The on-disk manifest matches the operator-staged manifest hash (data_model.md § 2.4) | FDR record `0x000D ContentHashGateFail` + STATUSTEXT critical + companion refuses to publish a `GPS_INPUT` / `MSP2_SENSOR_GPS` source |
|
|
| Camera calibration JSON validation | File present + schema-valid + content-hash matches `manifests.calibration_artifact_hash` | Same |
|
|
| FAISS `.index` mmap + content-hash | mmap succeeds + content-hash matches `manifests.descriptor_index_hash` | Same |
|
|
| TRT engine cache verification | All required engines present per `engine_cache_entries`; each engine's content-hash matches `engine_hash` | Same |
|
|
| `alembic current == head` | DB schema is up-to-date for this binary | Same |
|
|
| MAVLink-2.0 signing handshake (AP profile) | Signed handshake with the FC succeeds within AC-NEW-1 30 s budget (D-C8-9 = (d)) | FDR record `MavlinkSigningKeyRotated` with reason "handshake_failed" + STATUSTEXT critical + companion refuses to emit |
|
|
| Per-flight key generation | Both per-flight ephemeral keys (MAVLink signing + onboard tile signing) generated and persisted under `/var/lib/gps-denied/per-flight/` | Same |
|
|
| Initial frame → emit pipeline test | First nav-camera frame reaches C8 outbound encoder; `EmittedExternalPosition` produced | Same |
|
|
| Network egress is denied | Verify no outbound network egress is possible (DNS blackhole effective, iptables OUTPUT REJECT loaded) — defense-in-depth on architecture.md § 7 + NFT-SEC-05 | FDR critical + STATUSTEXT + refuse to emit |
|
|
|
|
The gate completes within the AC-NEW-1 30 s p95 budget; failure produces a clear FDR + STATUSTEXT trail and the companion's `GPS_INPUT` / `MSP2_SENSOR_GPS` channel stays silent — the FC operates as if no companion-GPS source is available, which is the correct safe-default.
|
|
|
|
## Production deployment procedure (per-airframe)
|
|
|
|
This is the per-airframe deployment procedure performed by the operator, NOT by CI.
|
|
|
|
### 1. Pre-deploy approval
|
|
|
|
Required before any production-bound flight:
|
|
|
|
- [ ] Release notes for the target version reviewed; AC-NEW-4 / AC-NEW-7 statistical summaries reviewed.
|
|
- [ ] All Tier-2 AC-bound NFTs green at the target version (`ci_cd_pipeline.md` § AC-bound NFTs).
|
|
- [ ] Security audit of the target version completed (Tier-1 SBOM clean of unpatched CVEs; D-CROSS-CVE-1).
|
|
- [ ] D-PROJ-1 calibration step performed on the target Jetson + UAV pairing (hybrid factory + checkerboard-refined; ~1 day per deployed unit).
|
|
- [ ] Rollback artifact (the previous release's JetPack image) is staged on the operator workstation.
|
|
- [ ] FDR retention policy for this airframe confirmed (default 30 days; environment_strategy.md § Database Management).
|
|
- [ ] If switching FC profile (`ardupilot_plane` ↔ `inav`), FC firmware compatibility confirmed.
|
|
|
|
### 2. Pre-deploy checks (operator workstation)
|
|
|
|
```sh
|
|
# Verify the artifact bundle integrity.
|
|
cosign verify-blob \
|
|
--signature gps-denied-jetpack-<semver>-<sha>.img.sha256.sig \
|
|
--key gps-denied-release-key.pub \
|
|
gps-denied-jetpack-<semver>-<sha>.img.sha256
|
|
|
|
sha256sum -c gps-denied-jetpack-<semver>-<sha>.img.sha256
|
|
|
|
# Verify the operator-orchestrator tarball.
|
|
cosign verify-blob \
|
|
--signature operator-orchestrator-<semver>-<sha>.tar.gz.sig \
|
|
--key gps-denied-release-key.pub \
|
|
operator-orchestrator-<semver>-<sha>.tar.gz
|
|
```
|
|
|
|
### 3. Pre-flight cache build (operator-orchestrator C12)
|
|
|
|
Performed on the operator workstation, with `satellite-provider` reachable (locally mirrored or via lab VPN).
|
|
|
|
```sh
|
|
docker compose -f operator-orchestrator-compose.yml up -d
|
|
# Operator opens http://127.0.0.1:8080
|
|
```
|
|
|
|
The C12 UI walks the operator through:
|
|
|
|
1. Upload / select the target operational sector (GeoJSON polygon).
|
|
2. Set sector classifications (`active_conflict` ↔ `stable_rear`) — drives freshness threshold (data_model.md § 2.3).
|
|
3. Tile download from `satellite-provider` (parent suite) — produces `tiles` rows with `source='googlemaps'` + filesystem JPEGs.
|
|
4. Descriptor (FAISS) index generation across the loaded tile corpus.
|
|
5. TRT engine compilation on the workstation (Tier-2 emulation if no Jetson is present, or directly on a co-located Jetson dev kit).
|
|
6. Manifest generation: hash over (model bundle + calibration JSON + corpus + sector classifications + descriptor index + engine cache).
|
|
7. Output: a sealed pre-flight bundle on a USB drive or staged for direct ethernet transfer.
|
|
|
|
### 4. JetPack image flash
|
|
|
|
Operator flashes the target JetPack image onto the Jetson:
|
|
|
|
```sh
|
|
sudo dd if=gps-denied-jetpack-<semver>-<sha>.img of=/dev/sdX bs=4M status=progress
|
|
# OR via NVIDIA SDK Manager for a more guided flow.
|
|
sync
|
|
```
|
|
|
|
The flashed image contains:
|
|
|
|
- JetPack 6.2 base
|
|
- The deployment binary preinstalled at `/opt/gps-denied/`
|
|
- PostgreSQL 16 with `alembic` schema initialized at the target migration head
|
|
- `/etc/gps-denied/runtime.yaml` template (the operator fills in airframe-specific values: `fc_profile`, `companion_id`)
|
|
- A systemd unit `gps-denied.service` that auto-starts at boot
|
|
|
|
The image is **identical across UAVs**; per-airframe configuration (`/etc/gps-denied/runtime.yaml`) is filled in after flash.
|
|
|
|
### 5. Per-airframe configuration
|
|
|
|
Operator boots the Jetson in maintenance mode, ssh's in (this is the only time the Jetson has any inbound network surface; closed before takeoff), and:
|
|
|
|
```sh
|
|
sudo $EDITOR /etc/gps-denied/runtime.yaml
|
|
# Set: fc_profile, companion_id, fdr_retention_days, log_level
|
|
sudo gps-denied-cli stage-cache /mnt/usb/gps-denied-cache-<sector-id>.tar.gz
|
|
# Stages the operator-prepared cache + calibration + manifest into /var/lib/gps-denied/.
|
|
sudo gps-denied-cli verify-readiness
|
|
# Runs all gate checks except MAVLink signing handshake (which requires the FC to be powered).
|
|
```
|
|
|
|
### 6. UAV integration
|
|
|
|
- Wire the Jetson UART/USB to the FC.
|
|
- For ArduPilot Plane: configure FC parameters per the AP-side checklist (`EKF3_SRC1_POSXY = 3` or per D-C8-2 = (b) configuration, AHRS_EKF_TYPE = 3).
|
|
- For iNav: configure `gps_provider = MSP`, `gps_ublox_use_galileo = OFF`.
|
|
- Power up the FC; verify MAVLink signing handshake completes within 30 s (AC-NEW-1).
|
|
|
|
### 7. First-flight commissioning
|
|
|
|
The first flight on a freshly-deployed airframe is a **commissioning flight**, not a production flight:
|
|
|
|
- Operator stays in line-of-sight.
|
|
- AC-5.2 fallback (FC IMU-only) is the primary safety net during commissioning.
|
|
- Operator manually triggers a `MAV_CMD_REQUEST_MESSAGE` to confirm `GPS_INPUT` is being received and the FC's EKF source-set switch responds correctly.
|
|
- If everything looks healthy on the GCS dashboard for 5+ minutes of cruise, the airframe is cleared for production flights.
|
|
|
|
### 8. Post-deploy monitoring
|
|
|
|
Post first commissioning flight:
|
|
|
|
- [ ] FDR retrieved and visualized on operator workstation (operator-orchestrator C12 dashboard, observability.md § 5.1).
|
|
- [ ] AC-NEW-4 statistics for the commissioning flight reviewed; outliers investigated.
|
|
- [ ] No FDR segment drops; no `ContentHashGateFail` events.
|
|
- [ ] Mid-flight tile generation working (post-landing upload — handle that separately).
|
|
- [ ] If everything green, the deployment is finalised; the previous release's JetPack image can be archived (still kept for rollback).
|
|
|
|
## Post-landing tile upload (per-flight, ADR-004)
|
|
|
|
Per AC-8.4 + ADR-004, mid-flight tile upload to `satellite-provider` is **post-landing only**, and uses the operator-orchestrator's C11 Tile Manager (`TileUploader` interface; a separate binary, never linked into the airborne image).
|
|
|
|
```sh
|
|
# Operator plugs the companion's NVM into the workstation OR ssh's into the powered-off-then-re-booted Jetson.
|
|
docker compose run operator-orchestrator \
|
|
python -m operator_orchestrator.tilemanager upload \
|
|
--flight-id <uuid> \
|
|
--satellite-provider $SATELLITE_PROVIDER_URL \
|
|
--signing-pubkey-fingerprint <fingerprint>
|
|
```
|
|
|
|
Behavior:
|
|
|
|
- Reads the local `tiles` rows where `source='onboard_ingest' AND voting_status='pending' AND flight_id=<uuid>`.
|
|
- Reads the corresponding JPEG body + sidecar JSON from filesystem.
|
|
- Reads the per-flight onboard tile-signing private key (still on the companion's NVM until FDR rolls over).
|
|
- Submits to `satellite-provider`'s `POST /api/satellite/tiles/ingest` endpoint (D-PROJ-2 contract).
|
|
- On 2xx success: deletes local row + JPEG + sidecar + emits FDR event `tile_uploaded`.
|
|
- On 4xx: leaves local data; emits FDR event `tile_upload_failed` with reason; operator decides next steps (likely a parent-suite issue).
|
|
- On 5xx: retries with exponential backoff; persistent failure → `tile_upload_failed` + operator review.
|
|
|
|
When the parent-suite voting layer (D-PROJ-2 design task #2) ships, this flow does NOT change on the onboard side — the parent suite's promotion logic is invisible to onboard-side upload.
|
|
|
|
## Rollback Procedures
|
|
|
|
### Trigger criteria
|
|
|
|
| Severity | Trigger | Decision-maker |
|
|
|---|---|---|
|
|
| Critical (per-airframe) | Commissioning flight fails AC-5.2 fallback (the FC IMU-only fallback also failed; airframe lost) | Safety review board (out of scope of this project) |
|
|
| Critical (fleet-wide) | Any post-deploy AC-NEW-4 outlier indicates a regression: P(err > 1 km) measured on a real flight > AC threshold by ≥ 2x | Suite security + onboard team lead |
|
|
| High (per-airframe) | Commissioning flight passes but post-flight FDR analysis shows AC-NEW-4 / AC-NEW-7 regression vs. prior release | Onboard team lead |
|
|
| High (per-airframe) | Operator unable to complete pre-flight readiness gate (manifest hash gate fails repeatedly) | Operator + onboard team lead |
|
|
| Medium (per-airframe) | Sustained `dead_reckoned` periods longer than expected; FDR segment drops occurring | Operator + onboard team lead (post-flight investigation; may not warrant immediate rollback) |
|
|
|
|
### Rollback steps (per-airframe)
|
|
|
|
1. **Re-flash** the previous release's JetPack image onto the affected Jetson (same procedure as § 4 with the previous artifact).
|
|
2. **Re-stage** the previous release's pre-flight bundle (the operator workstation retains it in the operator-orchestrator cache for ≥ 30 days).
|
|
3. **Re-run** the pre-takeoff readiness gate.
|
|
4. **Confirm** AC-5.2 fallback is still functional (it is FC firmware behavior; rolling back the companion image cannot break it, but verify on the GCS).
|
|
5. **Document** the rollback in the post-mortem template; include FDR snapshots from the offending flight (if any) plus the rollback artifacts versions.
|
|
|
|
### Database rollback (data_model.md § 4.2 reversibility)
|
|
|
|
Per data_model.md § 4.2, every Alembic migration MUST implement a working `downgrade()`. Rolling back the JetPack image to the previous release rolls back the schema to whatever migration head the previous release uses. Concretely:
|
|
|
|
- The previous release's JetPack image contains its own Alembic migration tree.
|
|
- On boot, the previous-release runtime asserts `alembic current == head_for_that_release`. If the database is on a NEWER head (because the airframe ran the new release between deployments), the runtime invokes `alembic downgrade <previous-release-head>` automatically.
|
|
- If a migration is **not reversible** (which requires an explicit ADR — data_model.md § 4.2), the rollback must be manually adjudicated by the operator + onboard team lead. This case is rare by policy.
|
|
|
|
### Post-mortem
|
|
|
|
Required after every rollback (per-airframe or fleet-wide):
|
|
|
|
- Timeline: when was the new release flashed; when did the failure surface; when was rollback initiated.
|
|
- Root cause: which AC was missed; which component is implicated; was it a regression introduced by this release or by a hardware/operational variable change.
|
|
- What went wrong in the release process: did Tier-2 CI catch it; if not, why not.
|
|
- Prevention: new test scenario added to NFT suite; new lint check; new rule in `_docs/LESSONS.md`.
|
|
- Distribution: post-mortem report stored under `_docs/06_metrics/incident_<YYYY-MM-DD>_<topic>.md` (per autodev failure-handling protocol).
|
|
|
|
## Deployment Checklist
|
|
|
|
Pre-flash:
|
|
|
|
- [ ] All Tier-2 AC-bound NFTs green at target version
|
|
- [ ] Security scan clean (zero critical / high CVEs; SBOM diff passes ADR-002 enforcement)
|
|
- [ ] Both Docker images built and pushed (deployment + research)
|
|
- [ ] JetPack image built, signed, checksummed
|
|
- [ ] Operator-tooling tarball built, signed
|
|
- [ ] Pre-flight bundle prepared by operator (cache + calibration + manifest)
|
|
- [ ] Pre-takeoff readiness gate behavior verified on a bench Jetson before flashing onto the production unit
|
|
- [ ] Rollback artifact (previous release JetPack image) staged on operator workstation
|
|
- [ ] FDR retention policy confirmed for the target airframe
|
|
|
|
Post-flash:
|
|
|
|
- [ ] First-flight commissioning flight cleared per § 7
|
|
- [ ] FDR retrieved and analyzed; AC-NEW-4 / AC-NEW-7 statistics within expected envelope
|
|
- [ ] Post-landing upload procedure tested end-to-end (companion → operator workstation → `satellite-provider`)
|
|
- [ ] Operator runbook updated with airframe-specific notes (e.g., "this airframe has UART2 wired to FC")
|
|
|
|
## Tier-2 enablement
|
|
|
|
Until the Tier-2 self-hosted Jetson runner is fully provisioned:
|
|
|
|
- AC-bound NFTs are gated as **manual trigger only** on PRs (`ci_cd_pipeline.md` § Manual-trigger override).
|
|
- The merge gate on `dev` excludes Tier-2 NFTs; the merge gate on `stage` and `main` retains the full gate.
|
|
- The pre-takeoff readiness gate (§ Pre-takeoff readiness gate) is unaffected — it runs on the Jetson at every takeoff regardless of CI gating posture.
|
|
|
|
When the Tier-2 runner is in steady state, this section is removed and the merge gates harmonize across `dev` / `stage` / `main`.
|