Update autodev state, architecture documentation, and glossary terms

Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-10 00:21:34 +03:00
parent 723f574b14
commit 64542d32fc
52 changed files with 8789 additions and 88 deletions
@@ -0,0 +1,194 @@
# GPS-Denied Onboard — CI/CD Pipeline
> Date: 2026-05-09 (Plan Phase 2c — initial draft).
> Inputs: `_docs/02_document/architecture.md` § 3 (Deployment Model); ADR-002 (build-time exclusion); ADR-005 (Tier-1 / Tier-2 are first-class); ADR-007 (`mock-suite-sat-service` is an e2e-test fixture; reversed 2026-05-09 from the earlier "real component boundary" framing).
## Pipeline Overview
The pipeline has **two execution tiers** (architecture.md ADR-005), reflected in two CI runner pools that share the same workflow definitions but differ in runner labels and active job set:
| Stage | Trigger | Runner | Quality Gate |
|-------|---------|--------|-------------|
| Lint | Every push, every PR | Tier-1 (GitHub-hosted x86_64) | Zero lint errors (Python: `ruff` + `mypy --strict`; C++: `clang-format --dry-run` + `clang-tidy`; CMake: `cmakelang`) |
| Unit | Every push, every PR | Tier-1 | All unit tests pass; coverage ≥ 75 % per component, ≥ 90 % on safety-critical (C5 state estimator, C8 FC adapters) |
| Integration (Tier-1) | Every push, every PR | Tier-1 | Tier-1 integration suite passes (uses `docker-compose.test.yml` — companion + mock-sat + db + e2e-runner) |
| Build (Tier-1, both binaries) | Every push, every PR | Tier-1 | `companion-tier1:deployment-<sha>` AND `companion-tier1:research-<sha>` build green (ADR-002 dual-emit) |
| SBOM diff | After build | Tier-1 | Deployment SBOM excludes `vins_mono`, `salad`, etc.; research SBOM includes all strategies; PR fails on mismatch |
| Security | After build | Tier-1 | Zero unpatched critical / high CVEs (`pip-audit` + `dotnet list package --vulnerable` for mock-sat + Trivy on images) |
| Push images (Tier-1) | PR merge to `dev`, `stage`, `main` | Tier-1 | Push succeeds; PRs do NOT push (avoids polluting registry) |
| Build (Tier-2 deployment binary) | PR merge to `dev`, `stage`, `main` | Tier-2 (self-hosted Jetson) | Native build on Jetson green; deployment binary SBOM matches Tier-1 deployment SBOM |
| AC-bound NFTs (Tier-2) | PR merge to `dev`, `stage`, `main`; manual on PR | Tier-2 | NFT-PERF-* (AC-4.1, AC-NEW-1, AC-NEW-2), NFT-LIM-* (AC-4.2, AC-NEW-3), NFT-RES-* (AC-NEW-4, AC-NEW-7), IT-12 (comparative study) all pass thresholds in `tests/traceability-matrix.md` |
| JetPack image build | Tag on `main` | Tier-2 | JetPack 6.2 image built with deployment binary preinstalled, signed, and attested |
| Operator tooling tarball | Tag on `main` | Tier-1 | Tarball contains C11 Tile Manager (both `TileDownloader` and `TileUploader`) + C12 Operator Pre-flight Tooling + mock-sat-service compose + verification script |
Tier-2 jobs are the **only** AC-bound jobs. Everything else runs on Tier-1.
## Stage Details
### Lint
Parallelized per language inside one Tier-1 workflow. Sequential per file is preserved in the report so a single failure is greppable in the log.
| Language | Tool | Rules |
|---|---|---|
| Python | `ruff` (formatter + linter) | Project's `pyproject.toml` configures rules; `ruff check --diff` enforces that the committed code is formatted |
| Python types | `mypy --strict` | Strict mode; all components must type-check (CI fails on `error: ...`) |
| C++ | `clang-format --dry-run` + `clang-tidy` | `.clang-format` lives at repo root; `clang-tidy` checks listed in `.clang-tidy` |
| CMake | `cmakelang` (`cmake-format --check`) | `.cmake-format.yaml` lives at repo root |
| YAML / Markdown | `yamllint`, `markdownlint-cli` | Used for `.github/`, `_docs/`, `docker-compose*.yml` |
### Unit
| Component | Framework | Coverage gate |
|---|---|---|
| Python (host code) | `pytest` + `pytest-cov` | `--cov-fail-under=75` per component; safety-critical (C5, C8) at `--cov-fail-under=90` |
| C++ (per-strategy native builds) | `gtest` + `lcov` | Per-strategy library `≥ 75 %` line coverage; `klt_ransac` (mandatory simple-baseline) at `≥ 90 %` |
| Mock sat service (.NET) | `dotnet test` + `coverlet` | `≥ 75 %` line coverage on the mock |
Coverage report is published as a pipeline artifact (`coverage/index.html`). CI fails fast on threshold violation.
### Integration (Tier-1)
Drives the autodev e2e contract: runs `docker compose -f docker-compose.test.yml up --abort-on-container-exit --exit-code-from e2e-runner --build` from `e2e/` and captures `e2e/results/report.csv`.
Coverage scenarios on Tier-1:
- All FT (Functional Test) and IT (Integration Test) scenarios that DO NOT require Jetson hardware (per `tests/traceability-matrix.md` "Tier" column).
- `mock-suite-sat-service` interactions including failure injection (latency, 5xx, partial responses, cache poisoning replay).
- Cross-FC adapter behavior on SITL: ArduPilot Plane SITL runs as a sidecar container; iNav SITL runs as a sidecar container; companion's MAVLink and MSP2 paths are exercised against both.
- D-PROJ-2 contract: post-landing upload payload assembly + signature verification against the mock.
### Build (Tier-1, both binaries)
Per ADR-002, every PR produces both binaries. The build job uses two parallel matrix entries with identical Dockerfile + different `BUILD_*` flags:
```yaml
matrix:
build_kind:
- { tag: deployment, args: "BUILD_VINS_MONO=OFF BUILD_SALAD=OFF" }
- { tag: research, args: "BUILD_VINS_MONO=ON BUILD_SALAD=ON" }
```
The Dockerfile receives the args; `cmake -DBUILD_VINS_MONO=$BUILD_VINS_MONO -DBUILD_SALAD=$BUILD_SALAD` enforces the exclusion at the C++ build layer; `setup.py` / `pyproject.toml` reads the same env to skip importing excluded modules in the composition root validator. Both images are built; both must build green; both go through SBOM and security gates.
### SBOM diff (ADR-002 enforcement)
```yaml
- name: sbom-deployment
run: syft packages docker:gps-denied/companion-tier1:deployment-${{ github.sha }} -o spdx-json > sbom-deployment.json
- name: sbom-research
run: syft packages docker:gps-denied/companion-tier1:research-${{ github.sha }} -o spdx-json > sbom-research.json
- name: sbom-diff
run: python ci/sbom_diff.py --deployment sbom-deployment.json --research sbom-research.json
```
`ci/sbom_diff.py` enforces:
- `vins_mono`, `salad`, and any module flagged "research-only" in `_docs/02_document/components/` MUST appear in research SBOM and MUST NOT appear in deployment SBOM.
- The deployment SBOM is a strict subset of the research SBOM (i.e., the research binary contains everything the deployment binary contains plus the research-only modules).
- Both SBOMs are attached as workflow artifacts and as release artifacts on tag.
### Security
| Check | Tool | Block on |
|---|---|---|
| Python dependency CVEs | `pip-audit` against `pyproject.toml` lockfile | Critical / High severity |
| .NET dependency CVEs | `dotnet list package --vulnerable --include-transitive` | Critical / High severity |
| C++ dependency CVEs | Manual audit via SBOM matched against NVD; `osv-scanner` for known submodule pins | Critical / High severity |
| Image scan | Trivy on all CI-built images | Critical / High severity |
| OpenCV pin gate | CI step asserts the resolved OpenCV version is `≥ 4.12.0` (D-CROSS-CVE-1) | Any version `< 4.12.0` |
| GTSAM CVE re-scan | Monthly scheduled workflow against the GTSAM commit pinned in `cmake/dependencies.cmake` | Any new published CVE |
### Push images (Tier-1)
On `push` to `dev`, `stage`, `main`: tag images with `${BRANCH_NAME}-${BUILD_KIND}-${SHORT_SHA}` and push to the registry. PR events do NOT push — PRs get test signal only.
### Build (Tier-2 deployment binary)
Self-hosted Jetson runner (`labels: [self-hosted, jetson, orin-nano-super]`) builds the deployment binary natively. The build is **not** containerized (architecture.md § 3 explanation). After build:
1. Compute the deployment-binary SBOM on Jetson.
2. Compare it byte-for-byte (after canonicalization) against the Tier-1 deployment-binary SBOM. If they diverge, the PR fails — the two binaries must be built from the same source / same dependency pins.
3. Cache the TRT engine builds on the Jetson runner's persistent cache (keyed by manifest hash) so subsequent CI runs reuse them.
### AC-bound NFTs (Tier-2)
Run only on the Tier-2 runner. Each NFT corresponds to one or more acceptance-criterion entries in `tests/traceability-matrix.md`. The runner:
1. Pulls the freshly-built deployment binary.
2. Mounts the curated `tests/fixtures/flight_derkachi/` replay corpus.
3. Runs each NFT scenario, captures jetson-stats telemetry (CPU, GPU, temp, throttle, RAM, VRAM), and compares against the AC threshold.
4. Publishes a per-NFT report; pipeline fails if any threshold is missed.
| NFT scenario | AC | Pass criterion |
|---|---|---|
| NFT-PERF-01 | AC-4.1 | E2E p95 ≤ 400 ms over 1000-frame replay (steady state) |
| NFT-PERF-02 | AC-4.4 | No frame batching detected (per-frame emit gap < 50 ms) |
| NFT-PERF-03 | AC-NEW-1 | Cold-start TTFF p95 < 30 s over 50 cold boots |
| NFT-PERF-04 | AC-NEW-2 | Spoofing-promotion latency p95 < 3 s on AP SITL + iNav SITL |
| NFT-LIM-01 | AC-4.2 | Memory < 8 GB shared (CPU + GPU) over 8 h replay |
| NFT-LIM-02 | AC-NEW-3 | FDR ring stays ≤ 64 GB; no silent drops |
| NFT-LIM-04 | AC-NEW-5 | Workstation thermal-baseline (chamber test deferred) |
| NFT-RES-03 | AC-NEW-4 | Monte Carlo: P(err > 500 m) < 0.1 %, P(err > 1 km) < 0.01 %, with stated 95 % CI |
| NFT-RES-04 | AC-NEW-8 | VISUAL_BLACKOUT mode transition ≤ 400 ms; covariance grows monotonically |
| NFT-SEC-01 | AC-NEW-7 | Cache-poisoning Monte Carlo on onboard side: P(misalign > 30 m) < 1 %, P(> 100 m) < 0.1 %, with 95 % CI |
| NFT-SEC-03 | D-C8-9 | MAVLink 2.0 signing handshake exercised; per-flight rotation logged to FDR |
| NFT-SEC-05 | architecture.md Threat Model | Network-egress-deny on production profile validated (DNS blackhole + iptables OUTPUT REJECT effective) |
| NFT-9 hot-soak | AC-NEW-5 + AC-4.1 | 8 h at +50 °C ambient (chamber if available, else throttle-injection): p95 ≤ 400 ms throughout |
| NFT-10 SBOM CVE audit | D-CROSS-CVE-1 | SBOM clean of unpatched CVEs at audit time; failed scans blocking |
| IT-12 | architecture.md ADR-001 + ADR-002 | Comparative study replays the same fixture against research-binary's all-VIO matrix; report published |
### JetPack image build (release-only)
Runs on tag push to `main`. Produces `gps-denied-jetpack-<semver>-<sha>.img` (the deployable JetPack image) plus a signed checksum. The image is uploaded to the release bucket; the signature is signed by a release key stored in the Tier-1 secret manager.
### Operator tooling tarball (release-only)
Bundles `operator-tooling` Docker image + `mock-suite-sat-service` Docker image + their compose file + a verification script + the documentation under `_docs/02_document/`. The tarball is uploaded to the release bucket alongside the JetPack image.
## Caching Strategy
| Cache | Key | Restore Keys |
|-------|-----|-------------|
| Python deps (Tier-1) | `pyproject.toml` hash + Python version | Python version only |
| C++ build deps (Tier-1) | `cmake/dependencies.cmake` hash | n/a — full rebuild on change |
| Docker layers (Tier-1) | `Dockerfile` hash + dep-file hashes | Dockerfile hash |
| TRT engine cache (Tier-2) | manifest hash from `_docs/02_document/data_model.md` § 2.4 (`engine_cache_bundle_hash`) | none (engine cache is per-tuple; reuse only on exact tuple match) |
| Tier-1 build artifacts | `git-sha` | branch name |
| Replay fixtures | `tests/fixtures/flight_derkachi/` content hash | n/a |
## Parallelization
```
push → [ lint || unit (parallel per component) ] (Tier-1)
→ integration (Tier-1; sequential)
→ build matrix [deployment, research] (Tier-1; parallel)
→ [ SBOM diff || security ] (Tier-1; parallel)
→ push images (Tier-1; merge events only)
→ [ Tier-2 build || Tier-1 release prep (on tag) ] (parallel)
→ AC-bound NFTs (Tier-2; on merge events; sequential per scenario, parallel where the AC allows)
→ release (on tag; sequential)
```
Tier-1 stages from `lint` through `push images` typically complete in ≤ 12 min; Tier-2 NFTs take 14 h depending on the replay corpus length and the active scenario set.
## Notifications
| Event | Channel | Recipients |
|-------|---------|-----------|
| Build failure (Tier-1) | Slack `#gps-denied-ci` | onboard team |
| Tier-2 NFT failure | Slack `#gps-denied-ci` + email | onboard team + safety reviewer |
| Security alert (CVE block) | Slack `#gps-denied-ci` + email | onboard team + suite security |
| SBOM diff fail (ADR-002) | Slack `#gps-denied-ci` + PR comment | PR author |
| Deploy success (release) | Slack `#gps-denied-releases` | suite-wide |
| JetPack image signature mismatch | Slack `#gps-denied-ci` + email + page | release engineer + safety reviewer |
## Manual-trigger override
Initially, AC-bound NFTs may run on manual trigger only while the Tier-2 runner is being provisioned and the test fixtures are being authored. Until that gating is removed, the merge gate on `dev` excludes Tier-2; `stage` and `main` retain the full gate. The exception is documented in `_docs/02_document/deployment/deployment_procedures.md` § Tier-2 enablement.
## Reference: Woodpecker CI two-workflow contract
The parent suite uses Woodpecker for some sibling components. If the project decides to migrate from GitHub Actions to Woodpecker, the canonical contract from `.cursor/skills/deploy/templates/ci_cd_pipeline.md` § Reference Implementation applies (`.woodpecker/01-test.yml` + `.woodpecker/02-build-push.yml`, multi-arch matrix). Migration is an explicit decision, NOT current state — current pipeline is GitHub Actions plus a self-hosted Jetson runner.
@@ -0,0 +1,245 @@
# GPS-Denied Onboard — Containerization
> Date: 2026-05-09 (Plan Phase 2c — initial draft).
> Inputs: `_docs/02_document/architecture.md` § 3 (Deployment Model); `_docs/00_problem/restrictions.md` § Onboard Hardware; ADR-002 (build-time exclusion of unused strategies); ADR-005 (Tier-1 / Tier-2 are first-class).
## Containerization scope
This project has **asymmetric containerization** by design (architecture.md § 3, ADR-005):
- **Tier-1** (workstation): Docker is the universal runtime. Dev, lint, unit, most integration, and `mock-suite-sat-service` all run in Docker compose.
- **Tier-2 (Jetson)**: **NO Docker**. The deployed JetPack image runs the deployment binary natively. TensorRT INT8 calibration caches and `jetson-stats` thermal telemetry are most reliable without a container layer (D-C7-9 + D-C10-6). The "image" is a JetPack 6.2 system image with the deployment binary preinstalled.
- **Operator workstation**: Docker is used for the local `satellite-provider` mirror, the `mock-suite-sat-service` (when offline), and the operator-tooling stack (C11 Tile Manager + C12 Operator Pre-flight Tooling).
Three Dockerfiles are maintained; the airborne companion uses **none of them** in production.
## Component Dockerfiles
### `gps-denied-companion-tier1` (Tier-1 dev / CI only)
This image is for fast iterative development on a workstation. It is **never** flashed onto a Jetson.
| Property | Value |
|----------|-------|
| Base image | `nvidia/cuda:12.6.0-runtime-ubuntu22.04` (or `python:3.10-slim` if no GPU on dev box) |
| Build image | `nvidia/cuda:12.6.0-devel-ubuntu22.04` |
| Stages | `system-deps``python-deps``cpp-build` (CMake + GTSAM + FAISS + OpenCV + OKVIS2 + KltRansac) → `runtime` |
| User | `companion` (UID 1000, non-root) |
| Health check | `python -m gps_denied.healthcheck` (validates calibration JSON loadable + DB reachable + FAISS index mmap-able). 30 s interval. |
| Exposed ports | `5101/tcp` (companion control plane — Tier-1 only; Tier-2 production has no inbound network) |
| Key build args | `BUILD_VINS_MONO=OFF` (deployment build), `BUILD_SALAD=OFF`; `BUILD_VINS_MONO=ON BUILD_SALAD=ON` for the research build |
| Notes | Two distinct image tags built on every PR: `companion-tier1:deployment-<sha>` and `companion-tier1:research-<sha>` (ADR-002). |
### `mock-suite-sat-service` (Tier-1 e2e-test fixture; ADR-007 reversed 2026-05-09 — fixture only, not a component)
e2e-test fixture only — implements the planned D-PROJ-2 ingest contract (`POST /api/satellite/tiles/ingest`) so upload integration tests can run before the real endpoint ships service-side. Production never reaches it; the architectural counterparty for upload is the real `satellite-provider`. Download integration tests target the real `satellite-provider` directly (its GET surface is already implemented), not this fixture. Source lives under `tests/fixtures/mock-suite-sat-service/`, NOT `src/components/`.
| Property | Value |
|----------|-------|
| Base image | `mcr.microsoft.com/dotnet/aspnet:8.0-alpine` (matches the parent suite's stack) |
| Build image | `mcr.microsoft.com/dotnet/sdk:8.0-alpine` |
| Stages | `restore``build``publish``runtime` |
| User | `mock` (non-root) |
| Health check | HTTP `GET /healthz` (returns 200 if listening + storage backend mounted). 10 s interval. |
| Exposed ports | `5100/tcp` (matches `satellite-provider`'s port so the same client config works) |
| Key build args | `MOCK_FAILURE_PROFILE` (default `none`; used by NFT-SEC-01 to inject latency / 5xx / partial responses) |
| Notes | The mock is a release artifact (operator-tooling tarball includes its compose file). When the real `satellite-provider` D-PROJ-2 endpoint ships, the mock is retired. |
### `operator-tooling` (Operator workstation Tile Manager + pre-flight UI, C11 + C12)
| Property | Value |
|----------|-------|
| Base image | `python:3.10-slim` |
| Build image | `python:3.10-slim` (no native deps; pure Python plus `httpx` for both download and upload, `psycopg` for read/write of C6 mirror, `cryptography` for upload signing) |
| Stages | `python-deps``runtime` |
| User | `operator` (non-root) |
| Health check | `python -m operator_tooling.healthcheck` (validates `satellite-provider` reachable). 30 s interval. |
| Exposed ports | `8080/tcp` (operator pre-flight UI, C12); no inbound network for C11 Tile Manager (it's a CLI / one-shot tool, both directions) |
| Key build args | `INCLUDE_PRE_FLIGHT_UI=true` (default; can be turned off for headless CLI-only deployments) |
| Notes | **C11 Tile Manager (both `TileDownloader` and `TileUploader`) is in this image, NEVER in `gps-denied-companion-tier1`** (ADR-004 process-level isolation). The airborne deployment binary on Tier-2 also does not contain C11. |
## Docker Compose — Local Development
```yaml
# docker-compose.yml
services:
companion:
build:
context: .
dockerfile: docker/companion-tier1.Dockerfile
args:
BUILD_VINS_MONO: "OFF"
BUILD_SALAD: "OFF"
image: gps-denied/companion-tier1:dev
environment:
- DB_URL=postgresql://gps_denied:dev@db:5432/gps_denied
- SATELLITE_PROVIDER_URL=http://mock-sat:5100
- CAMERA_CALIBRATION_PATH=/fixtures/calibration/adti26.json
- LOG_LEVEL=DEBUG
- GPS_DENIED_FC_PROFILE=ardupilot_plane
volumes:
- ./tests/fixtures:/fixtures:ro
- tile-cache:/var/lib/gps-denied/tiles
- fdr:/var/lib/gps-denied/fdr
depends_on:
db: { condition: service_healthy }
mock-sat: { condition: service_healthy }
healthcheck:
test: ["CMD", "python", "-m", "gps_denied.healthcheck"]
interval: 30s
timeout: 10s
retries: 3
networks: [ gps-denied-net ]
mock-sat:
build:
context: ./mock-suite-sat-service
dockerfile: Dockerfile
image: gps-denied/mock-suite-sat-service:dev
environment:
- ASPNETCORE_URLS=http://+:5100
- MOCK_FAILURE_PROFILE=none
volumes:
- mock-sat-tiles:/srv/tiles
healthcheck:
test: ["CMD", "wget", "-q", "-O-", "http://localhost:5100/healthz"]
interval: 10s
networks: [ gps-denied-net ]
db:
image: postgres:16-alpine
environment:
- POSTGRES_DB=gps_denied
- POSTGRES_USER=gps_denied
- POSTGRES_PASSWORD=dev
volumes:
- db-data:/var/lib/postgresql/data
- ./docker/db-init:/docker-entrypoint-initdb.d:ro
healthcheck:
test: ["CMD", "pg_isready", "-U", "gps_denied"]
interval: 5s
networks: [ gps-denied-net ]
operator-tooling:
build:
context: .
dockerfile: docker/operator-tooling.Dockerfile
image: gps-denied/operator-tooling:dev
environment:
- SATELLITE_PROVIDER_URL=http://mock-sat:5100
- COMPANION_DB_URL=postgresql://gps_denied:dev@db:5432/gps_denied
ports:
- "8080:8080"
depends_on:
mock-sat: { condition: service_healthy }
networks: [ gps-denied-net ]
volumes:
tile-cache:
fdr:
db-data:
mock-sat-tiles:
networks:
gps-denied-net:
```
## Docker Compose — Tier-1 Integration & Blackbox Tests
```yaml
# docker-compose.test.yml
services:
companion:
extends:
file: docker-compose.yml
service: companion
environment:
- LOG_LEVEL=INFO
- GPS_DENIED_REPLAY_FIXTURE=/fixtures/flight_derkachi
- GPS_DENIED_TIER=1
mock-sat:
extends:
file: docker-compose.yml
service: mock-sat
volumes:
- ./tests/fixtures/tiles_corpus:/srv/tiles:ro
db:
extends:
file: docker-compose.yml
service: db
volumes:
- ./tests/fixtures/seed-db.sql:/docker-entrypoint-initdb.d/01_seed.sql:ro
e2e-runner:
build:
context: ./e2e
dockerfile: Dockerfile
image: gps-denied/e2e-runner:dev
depends_on:
companion: { condition: service_healthy }
mock-sat: { condition: service_healthy }
db: { condition: service_healthy }
environment:
- PYTEST_ARGS=--csv=/results/report.csv -v
volumes:
- ./e2e/results:/results
```
Run: `docker compose -f docker-compose.test.yml up --abort-on-container-exit --exit-code-from e2e-runner --build`.
## Tier-2 — Jetson runtime (NO Docker)
The Tier-2 deployment is a **JetPack 6.2 system image**, not a container. Its assembly is documented in `deployment_procedures.md` § Production Deployment. Key constraints driving the no-Docker decision (architecture.md § 3, D-C7-9 + D-C10-6):
1. **TensorRT INT8 calibration caches**: most reliable when the SM/JetPack/TRT triple matches the host kernel exactly; container-host abstraction is a known source of drift.
2. **`jetson-stats` thermal telemetry**: needs root + sysfs access; runs cleanest on bare metal.
3. **AC-NEW-1 cold-start budget (30 s p95)**: container start adds 12 s overhead the budget cannot afford.
4. **AC-NEW-3 FDR storage (≤ 64 GB)**: the FDR ring is mounted on the host's NVM directly; a container layer would either bind-mount (no benefit) or copy (defeats the storage guarantee).
Tier-2 CI runs the same deployment binary directly on the self-hosted Jetson runner, with no container shim.
## Image Tagging Strategy
| Context | Tag Format | Example |
|---------|-----------|---------|
| CI build (deployment binary) | `<registry>/gps-denied/companion-tier1:deployment-<git-sha>` | `ghcr.io/azaion/gps-denied/companion-tier1:deployment-a1b2c3d` |
| CI build (research binary) | `<registry>/gps-denied/companion-tier1:research-<git-sha>` | `ghcr.io/azaion/gps-denied/companion-tier1:research-a1b2c3d` |
| Mock sat service | `<registry>/gps-denied/mock-suite-sat-service:<git-sha>` | `ghcr.io/azaion/gps-denied/mock-suite-sat-service:a1b2c3d` |
| Operator tooling | `<registry>/gps-denied/operator-tooling:<git-sha>` | `ghcr.io/azaion/gps-denied/operator-tooling:a1b2c3d` |
| Release | `<registry>/gps-denied/<image>:<semver>` | `ghcr.io/azaion/gps-denied/companion-tier1:deployment-1.2.0` |
| Local dev | `gps-denied/<image>:dev` | `gps-denied/companion-tier1:dev` |
| JetPack image (Tier-2) | `gps-denied-jetpack-<semver>-<sha>.img` | `gps-denied-jetpack-1.2.0-a1b2c3d.img` (file artifact, not a container tag) |
## SBOM and binary track
CI emits both Tier-1 binary tracks on every PR (ADR-002). After build, an SBOM diff step asserts:
- The deployment-binary SBOM **must NOT** include `vins_mono`, `salad`, or any other research-only library.
- The research-binary SBOM **must** include every strategy listed in the architecture.
A failing SBOM diff fails the PR. SBOM artifacts are attached to the release; they are NOT shipped on the deployed Jetson image (they live only in the release artifacts directory).
## .dockerignore
```
.git
.cursor
_docs
_standalone
node_modules
**/bin
**/obj
**/__pycache__
**/.venv
**/venv
**/.pytest_cache
**/.mypy_cache
*.md
.env*
docker-compose*.yml
tests/fixtures/large_replays/
```
The `tests/fixtures/large_replays/` exclusion is critical: that directory holds the Derkachi flight footage (multi-GB) which is mounted into the test runner via `volumes:` rather than baked into images.
@@ -0,0 +1,265 @@
# GPS-Denied Onboard — Deployment Procedures
> Date: 2026-05-09 (Plan Phase 2c — initial draft).
> Inputs: `_docs/02_document/architecture.md` § 3 (Deployment Model) + § 7 (Security); `_docs/02_document/data_model.md` § 4 (Migration Strategy); environment_strategy.md; ADR-002, ADR-004, ADR-005; AC-NEW-1, AC-NEW-3, AC-NEW-4, AC-NEW-5.
## Deployment scope and model
This project does **not** ship a service; it ships an **embedded edge image** plus an **operator-tooling bundle**. The "deployment" patterns from the standard template (blue-green / rolling / canary) are not applicable. Deployment for this project means:
| Artifact | Target | Deployment mechanism |
|---|---|---|
| **JetPack image** (`gps-denied-jetpack-<semver>-<sha>.img`) | Production Jetson Orin Nano Super on a UAV | Operator flashes the image onto the Jetson via NVIDIA `sdkmanager` or `Etcher`-style `dd` from the operator workstation |
| **Operator tooling tarball** | Operator workstation | Operator extracts; `docker compose up -d` brings up `mock-suite-sat-service` (when offline) + `operator-tooling` |
| **Tier-1 dev compose** | Developer workstation | Developer runs `docker compose up` from repo root |
**Zero-downtime is not a goal**: a UAV is not in service while it is being re-flashed. The deployment cadence is per-airframe maintenance, not per-request availability.
**Strategy**: the closest analogue to a "rolling deploy" is the operator's fleet-management process (re-flash one UAV at a time across the fleet). The fleet-management process is the operator's concern, not this project's; this document covers the per-airframe procedure.
## Pre-deployment artifact assembly (release engineer)
Performed once per release on Tier-1 + Tier-2 CI; produces signed artifacts stored in the release bucket.
1. Tag a commit on `main`. CI runs the full pipeline (`ci_cd_pipeline.md`).
2. **Tier-1 produces**:
- `companion-tier1:deployment-<sha>` and `companion-tier1:research-<sha>` Docker images (pushed to registry).
- `mock-suite-sat-service:<sha>` Docker image.
- `operator-tooling:<sha>` Docker image.
- SBOM artifacts for both binaries (deployment and research).
- `operator-tooling-<semver>-<sha>.tar.gz` containing the operator-tooling image + mock-sat image + their compose file + verification script + relevant docs.
3. **Tier-2 produces**:
- Native deployment-binary build on the self-hosted Jetson runner.
- SBOM verification: byte-equal (after canonicalization) to Tier-1's deployment-binary SBOM. Mismatch fails the release.
- **JetPack image build**: a JetPack 6.2 base image with the deployment binary + PostgreSQL 16 + base migrations + `/etc/gps-denied/runtime.yaml` template preinstalled. Output: `gps-denied-jetpack-<semver>-<sha>.img`.
4. **Signing** (Tier-1):
- Both Docker image manifests are signed with the project's release key.
- The JetPack image is signed; checksum is published as a separate signed file (`gps-denied-jetpack-<semver>-<sha>.img.sha256.sig`).
- The operator-tooling tarball is signed.
5. **Release bucket**: artifacts uploaded; release notes published; the previous release's artifacts retained for at least 90 days for rollback support.
A release fails if any step above fails — including any AC-bound NFT failure on Tier-2 (`ci_cd_pipeline.md` § AC-bound NFTs).
## Pre-takeoff readiness gate ("health check" analog)
Production has no `/health/live` HTTP endpoint (no listener; NFT-SEC-05). The companion's "health check" is the **pre-takeoff readiness gate**: a sequence of checks that runs at takeoff load and decides whether the companion is ready to emit external position to the FC.
| Check | What it validates | Action on failure |
|---|---|---|
| Manifest content-hash gate (D-C10-3) | The on-disk manifest matches the operator-staged manifest hash (data_model.md § 2.4) | FDR record `0x000D ContentHashGateFail` + STATUSTEXT critical + companion refuses to publish a `GPS_INPUT` / `MSP2_SENSOR_GPS` source |
| Camera calibration JSON validation | File present + schema-valid + content-hash matches `manifests.calibration_artifact_hash` | Same |
| FAISS `.index` mmap + content-hash | mmap succeeds + content-hash matches `manifests.descriptor_index_hash` | Same |
| TRT engine cache verification | All required engines present per `engine_cache_entries`; each engine's content-hash matches `engine_hash` | Same |
| `alembic current == head` | DB schema is up-to-date for this binary | Same |
| MAVLink-2.0 signing handshake (AP profile) | Signed handshake with the FC succeeds within AC-NEW-1 30 s budget (D-C8-9 = (d)) | FDR record `MavlinkSigningKeyRotated` with reason "handshake_failed" + STATUSTEXT critical + companion refuses to emit |
| Per-flight key generation | Both per-flight ephemeral keys (MAVLink signing + onboard tile signing) generated and persisted under `/var/lib/gps-denied/per-flight/` | Same |
| Initial frame → emit pipeline test | First nav-camera frame reaches C8 outbound encoder; `EmittedExternalPosition` produced | Same |
| Network egress is denied | Verify no outbound network egress is possible (DNS blackhole effective, iptables OUTPUT REJECT loaded) — defense-in-depth on architecture.md § 7 + NFT-SEC-05 | FDR critical + STATUSTEXT + refuse to emit |
The gate completes within the AC-NEW-1 30 s p95 budget; failure produces a clear FDR + STATUSTEXT trail and the companion's `GPS_INPUT` / `MSP2_SENSOR_GPS` channel stays silent — the FC operates as if no companion-GPS source is available, which is the correct safe-default.
## Production deployment procedure (per-airframe)
This is the per-airframe deployment procedure performed by the operator, NOT by CI.
### 1. Pre-deploy approval
Required before any production-bound flight:
- [ ] Release notes for the target version reviewed; AC-NEW-4 / AC-NEW-7 statistical summaries reviewed.
- [ ] All Tier-2 AC-bound NFTs green at the target version (`ci_cd_pipeline.md` § AC-bound NFTs).
- [ ] Security audit of the target version completed (Tier-1 SBOM clean of unpatched CVEs; D-CROSS-CVE-1).
- [ ] D-PROJ-1 calibration step performed on the target Jetson + UAV pairing (hybrid factory + checkerboard-refined; ~1 day per deployed unit).
- [ ] Rollback artifact (the previous release's JetPack image) is staged on the operator workstation.
- [ ] FDR retention policy for this airframe confirmed (default 30 days; environment_strategy.md § Database Management).
- [ ] If switching FC profile (`ardupilot_plane``inav`), FC firmware compatibility confirmed.
### 2. Pre-deploy checks (operator workstation)
```sh
# Verify the artifact bundle integrity.
cosign verify-blob \
--signature gps-denied-jetpack-<semver>-<sha>.img.sha256.sig \
--key gps-denied-release-key.pub \
gps-denied-jetpack-<semver>-<sha>.img.sha256
sha256sum -c gps-denied-jetpack-<semver>-<sha>.img.sha256
# Verify the operator-tooling tarball.
cosign verify-blob \
--signature operator-tooling-<semver>-<sha>.tar.gz.sig \
--key gps-denied-release-key.pub \
operator-tooling-<semver>-<sha>.tar.gz
```
### 3. Pre-flight cache build (operator-tooling C12)
Performed on the operator workstation, with `satellite-provider` reachable (locally mirrored or via lab VPN).
```sh
docker compose -f operator-tooling-compose.yml up -d
# Operator opens http://127.0.0.1:8080
```
The C12 UI walks the operator through:
1. Upload / select the target operational sector (GeoJSON polygon).
2. Set sector classifications (`active_conflict``stable_rear`) — drives freshness threshold (data_model.md § 2.3).
3. Tile download from `satellite-provider` (parent suite) — produces `tiles` rows with `source='googlemaps'` + filesystem JPEGs.
4. Descriptor (FAISS) index generation across the loaded tile corpus.
5. TRT engine compilation on the workstation (Tier-2 emulation if no Jetson is present, or directly on a co-located Jetson dev kit).
6. Manifest generation: hash over (model bundle + calibration JSON + corpus + sector classifications + descriptor index + engine cache).
7. Output: a sealed pre-flight bundle on a USB drive or staged for direct ethernet transfer.
### 4. JetPack image flash
Operator flashes the target JetPack image onto the Jetson:
```sh
sudo dd if=gps-denied-jetpack-<semver>-<sha>.img of=/dev/sdX bs=4M status=progress
# OR via NVIDIA SDK Manager for a more guided flow.
sync
```
The flashed image contains:
- JetPack 6.2 base
- The deployment binary preinstalled at `/opt/gps-denied/`
- PostgreSQL 16 with `alembic` schema initialized at the target migration head
- `/etc/gps-denied/runtime.yaml` template (the operator fills in airframe-specific values: `fc_profile`, `companion_id`)
- A systemd unit `gps-denied.service` that auto-starts at boot
The image is **identical across UAVs**; per-airframe configuration (`/etc/gps-denied/runtime.yaml`) is filled in after flash.
### 5. Per-airframe configuration
Operator boots the Jetson in maintenance mode, ssh's in (this is the only time the Jetson has any inbound network surface; closed before takeoff), and:
```sh
sudo $EDITOR /etc/gps-denied/runtime.yaml
# Set: fc_profile, companion_id, fdr_retention_days, log_level
sudo gps-denied-cli stage-cache /mnt/usb/gps-denied-cache-<sector-id>.tar.gz
# Stages the operator-prepared cache + calibration + manifest into /var/lib/gps-denied/.
sudo gps-denied-cli verify-readiness
# Runs all gate checks except MAVLink signing handshake (which requires the FC to be powered).
```
### 6. UAV integration
- Wire the Jetson UART/USB to the FC.
- For ArduPilot Plane: configure FC parameters per the AP-side checklist (`EKF3_SRC1_POSXY = 3` or per D-C8-2 = (b) configuration, AHRS_EKF_TYPE = 3).
- For iNav: configure `gps_provider = MSP`, `gps_ublox_use_galileo = OFF`.
- Power up the FC; verify MAVLink signing handshake completes within 30 s (AC-NEW-1).
### 7. First-flight commissioning
The first flight on a freshly-deployed airframe is a **commissioning flight**, not a production flight:
- Operator stays in line-of-sight.
- AC-5.2 fallback (FC IMU-only) is the primary safety net during commissioning.
- Operator manually triggers a `MAV_CMD_REQUEST_MESSAGE` to confirm `GPS_INPUT` is being received and the FC's EKF source-set switch responds correctly.
- If everything looks healthy on the GCS dashboard for 5+ minutes of cruise, the airframe is cleared for production flights.
### 8. Post-deploy monitoring
Post first commissioning flight:
- [ ] FDR retrieved and visualized on operator workstation (operator-tooling C12 dashboard, observability.md § 5.1).
- [ ] AC-NEW-4 statistics for the commissioning flight reviewed; outliers investigated.
- [ ] No FDR segment drops; no `ContentHashGateFail` events.
- [ ] Mid-flight tile generation working (post-landing upload — handle that separately).
- [ ] If everything green, the deployment is finalised; the previous release's JetPack image can be archived (still kept for rollback).
## Post-landing tile upload (per-flight, ADR-004)
Per AC-8.4 + ADR-004, mid-flight tile upload to `satellite-provider` is **post-landing only**, and uses the operator-tooling's C11 Tile Manager (`TileUploader` interface; a separate binary, never linked into the airborne image).
```sh
# Operator plugs the companion's NVM into the workstation OR ssh's into the powered-off-then-re-booted Jetson.
docker compose run operator-tooling \
python -m operator_tooling.tilemanager upload \
--flight-id <uuid> \
--satellite-provider $SATELLITE_PROVIDER_URL \
--signing-pubkey-fingerprint <fingerprint>
```
Behavior:
- Reads the local `tiles` rows where `source='onboard_ingest' AND voting_status='pending' AND flight_id=<uuid>`.
- Reads the corresponding JPEG body + sidecar JSON from filesystem.
- Reads the per-flight onboard tile-signing private key (still on the companion's NVM until FDR rolls over).
- Submits to `satellite-provider`'s `POST /api/satellite/tiles/ingest` endpoint (D-PROJ-2 contract).
- On 2xx success: deletes local row + JPEG + sidecar + emits FDR event `tile_uploaded`.
- On 4xx: leaves local data; emits FDR event `tile_upload_failed` with reason; operator decides next steps (likely a parent-suite issue).
- On 5xx: retries with exponential backoff; persistent failure → `tile_upload_failed` + operator review.
When the parent-suite voting layer (D-PROJ-2 design task #2) ships, this flow does NOT change on the onboard side — the parent suite's promotion logic is invisible to onboard-side upload.
## Rollback Procedures
### Trigger criteria
| Severity | Trigger | Decision-maker |
|---|---|---|
| Critical (per-airframe) | Commissioning flight fails AC-5.2 fallback (the FC IMU-only fallback also failed; airframe lost) | Safety review board (out of scope of this project) |
| Critical (fleet-wide) | Any post-deploy AC-NEW-4 outlier indicates a regression: P(err > 1 km) measured on a real flight > AC threshold by ≥ 2x | Suite security + onboard team lead |
| High (per-airframe) | Commissioning flight passes but post-flight FDR analysis shows AC-NEW-4 / AC-NEW-7 regression vs. prior release | Onboard team lead |
| High (per-airframe) | Operator unable to complete pre-flight readiness gate (manifest hash gate fails repeatedly) | Operator + onboard team lead |
| Medium (per-airframe) | Sustained `dead_reckoned` periods longer than expected; FDR segment drops occurring | Operator + onboard team lead (post-flight investigation; may not warrant immediate rollback) |
### Rollback steps (per-airframe)
1. **Re-flash** the previous release's JetPack image onto the affected Jetson (same procedure as § 4 with the previous artifact).
2. **Re-stage** the previous release's pre-flight bundle (the operator workstation retains it in the operator-tooling cache for ≥ 30 days).
3. **Re-run** the pre-takeoff readiness gate.
4. **Confirm** AC-5.2 fallback is still functional (it is FC firmware behavior; rolling back the companion image cannot break it, but verify on the GCS).
5. **Document** the rollback in the post-mortem template; include FDR snapshots from the offending flight (if any) plus the rollback artifacts versions.
### Database rollback (data_model.md § 4.2 reversibility)
Per data_model.md § 4.2, every Alembic migration MUST implement a working `downgrade()`. Rolling back the JetPack image to the previous release rolls back the schema to whatever migration head the previous release uses. Concretely:
- The previous release's JetPack image contains its own Alembic migration tree.
- On boot, the previous-release runtime asserts `alembic current == head_for_that_release`. If the database is on a NEWER head (because the airframe ran the new release between deployments), the runtime invokes `alembic downgrade <previous-release-head>` automatically.
- If a migration is **not reversible** (which requires an explicit ADR — data_model.md § 4.2), the rollback must be manually adjudicated by the operator + onboard team lead. This case is rare by policy.
### Post-mortem
Required after every rollback (per-airframe or fleet-wide):
- Timeline: when was the new release flashed; when did the failure surface; when was rollback initiated.
- Root cause: which AC was missed; which component is implicated; was it a regression introduced by this release or by a hardware/operational variable change.
- What went wrong in the release process: did Tier-2 CI catch it; if not, why not.
- Prevention: new test scenario added to NFT suite; new lint check; new rule in `_docs/LESSONS.md`.
- Distribution: post-mortem report stored under `_docs/06_metrics/incident_<YYYY-MM-DD>_<topic>.md` (per autodev failure-handling protocol).
## Deployment Checklist
Pre-flash:
- [ ] All Tier-2 AC-bound NFTs green at target version
- [ ] Security scan clean (zero critical / high CVEs; SBOM diff passes ADR-002 enforcement)
- [ ] Both Docker images built and pushed (deployment + research)
- [ ] JetPack image built, signed, checksummed
- [ ] Operator-tooling tarball built, signed
- [ ] Pre-flight bundle prepared by operator (cache + calibration + manifest)
- [ ] Pre-takeoff readiness gate behavior verified on a bench Jetson before flashing onto the production unit
- [ ] Rollback artifact (previous release JetPack image) staged on operator workstation
- [ ] FDR retention policy confirmed for the target airframe
Post-flash:
- [ ] First-flight commissioning flight cleared per § 7
- [ ] FDR retrieved and analyzed; AC-NEW-4 / AC-NEW-7 statistics within expected envelope
- [ ] Post-landing upload procedure tested end-to-end (companion → operator workstation → `satellite-provider`)
- [ ] Operator runbook updated with airframe-specific notes (e.g., "this airframe has UART2 wired to FC")
## Tier-2 enablement
Until the Tier-2 self-hosted Jetson runner is fully provisioned:
- AC-bound NFTs are gated as **manual trigger only** on PRs (`ci_cd_pipeline.md` § Manual-trigger override).
- The merge gate on `dev` excludes Tier-2 NFTs; the merge gate on `stage` and `main` retains the full gate.
- The pre-takeoff readiness gate (§ Pre-takeoff readiness gate) is unaffected — it runs on the Jetson at every takeoff regardless of CI gating posture.
When the Tier-2 runner is in steady state, this section is removed and the merge gates harmonize across `dev` / `stage` / `main`.
@@ -0,0 +1,178 @@
# GPS-Denied Onboard — Environment Strategy
> Date: 2026-05-09 (Plan Phase 2c — initial draft).
> Inputs: `_docs/02_document/architecture.md` § 3 (Deployment Model) + § 7 (Security Architecture); `_docs/02_document/data_model.md` § 5 (Seed Data); `_docs/00_problem/restrictions.md`; ADR-002, ADR-004, ADR-005.
## Environments
This project has **six environments**, not the canonical three (dev / staging / prod). The asymmetry reflects ADR-005 (Tier-1 / Tier-2) and ADR-004 (process-level isolation between airborne companion image and operator-side upload tool).
| Environment | Purpose | Infrastructure | Data Source |
|-------------|---------|---------------|-------------|
| `dev-tier1` | Local developer iteration; lint + unit + most integration tests | Workstation (Linux x86_64; NVIDIA GPU optional); Docker compose | Test fixtures (`adti26.json` calibration; `tests/fixtures/flight_derkachi/`) + `mock-suite-sat-service` |
| `dev-tier2` | Hardware-bound developer checks | Jetson Orin Nano Super dev kit on developer's desk; bare JetPack | Test fixtures + locally-mirrored `satellite-provider` |
| `staging-tier1` | CI runs that don't require Jetson hardware | GitHub-hosted runner (x86_64); Docker | Sealed test fixtures committed to the repo |
| `staging-tier2` | CI runs that require Jetson (AC-bound NFT-PERF-*, NFT-LIM-*, NFT-RES-*, NFT-SEC-*, IT-12) | Self-hosted Jetson runner; bare JetPack 6.2 | Same sealed fixtures + cached TRT engines per manifest hash |
| `production` | Deployed onboard companion image on a UAV | Jetson Orin Nano Super (pinned); bare JetPack 6.2; **no inbound network listening; no outbound network egress in flight** (NFT-SEC-05) | Operator-staged pre-flight cache + per-flight in-flight orthorectified tiles |
| `production-operator-workstation` | Pre-flight tile download (C11 `TileDownloader`); pre-flight cache artifact build (C10 driven by C12); post-landing tile upload (C11 `TileUploader`); FDR retrieval | Operator's Linux workstation; Docker for `satellite-provider` mirror | Operator-managed `satellite-provider` instance + the companion's NVM contents post-landing |
Notes:
- **No "staging" deployment of the companion**. Staging is purely a CI mode — there is no live staging Jetson UAV. Production is one-step from CI release artifacts → operator workstation → flashed Jetson.
- **The airborne companion never sees `staging-*` environments at runtime**. Staging is exclusively a CI gating concept.
- **The operator workstation is its own environment** with its own secrets posture (operator login + workstation hardening) — see § Secrets Management.
## Environment Variables
Variables are categorized by which environment(s) consume them. Production has the **shortest** required list because in-flight network egress is forbidden — most of the typical "service URL" variables disappear.
### Required variables — companion runtime (all environments)
| Variable | Purpose | dev-tier1 default | dev-tier2 default | production source |
|---|---|---|---|---|
| `DB_URL` | Local PostgreSQL connection | `postgresql://gps_denied:dev@db:5432/gps_denied` | `postgresql://gps_denied:dev@localhost:5432/gps_denied` | `postgresql://gps_denied@/gps_denied?host=/var/run/postgresql` (UNIX socket on Jetson, no password) |
| `CAMERA_CALIBRATION_PATH` | Camera calibration JSON path (Principle #1, data_model.md § 2.6) | `/fixtures/calibration/adti26.json` | `/fixtures/calibration/adti26.json` | `/etc/gps-denied/calibration/adti20.json` (per-deployed-unit, post D-PROJ-1 hybrid) |
| `GPS_DENIED_FC_PROFILE` | `ardupilot_plane` or `inav` | `ardupilot_plane` | per developer's bench setup | per UAV airframe (set via JetPack image's `/etc/gps-denied/runtime.yaml`) |
| `GPS_DENIED_VIO_STRATEGY` | `okvis2`, `vins_mono`, `klt_ransac` (ADR-001 startup-locked) | `okvis2` | `okvis2` | `okvis2` (production-default; pending IT-12 verdict) |
| `GPS_DENIED_VPR_STRATEGY` | `ultra_vpr`, `mega_loc`, `mix_vpr`, ... | `ultra_vpr` | `ultra_vpr` | `ultra_vpr` (Documentary Lead PRIMARY) |
| `GPS_DENIED_BUILD_KIND` | `deployment` or `research` (ADR-002; matches the binary's CMake flag set; the runtime validator fails fast if config asks for a strategy not linked into the binary) | `deployment` | `deployment` | `deployment` (research binary is dev-tier2 / staging-tier2 only) |
| `GPS_DENIED_FDR_RETENTION_DAYS` | FDR ring retention (data_model.md § 2.8) | `7` | `30` | `30` (operator-configurable per UAV) |
| `LOG_LEVEL` | `DEBUG` / `INFO` / `WARN` / `ERROR` | `DEBUG` | `INFO` | `INFO` (DEBUG is forbidden on the airborne image — context: no operator-readable console, and DEBUG output on FDR ring would inflate beyond 64 GB AC-NEW-3 envelope) |
| `MAVLINK_SIGNING_KEY_PATH` | Per-flight MAVLink-2.0 signing key file (regenerated at takeoff load; see § Secrets Management) | `/fixtures/keys/dev_mavlink_signing.key` | `/fixtures/keys/dev_mavlink_signing.key` | `/var/lib/gps-denied/per-flight/mavlink_signing.key` (generated at takeoff, deleted on flight ring rollover) |
| `ONBOARD_TILE_SIGNING_KEY_PATH` | Per-flight onboard tile-signing private key | `/fixtures/keys/dev_onboard_signing.key` | `/fixtures/keys/dev_onboard_signing.key` | `/var/lib/gps-denied/per-flight/onboard_tile_signing.key` (generated at takeoff, deleted on flight ring rollover) |
### Required variables — Tier-1 / staging only (NOT on production)
| Variable | Purpose | dev-tier1 default | staging-tier1 default | production |
|---|---|---|---|---|
| `SATELLITE_PROVIDER_URL` | Where to reach the tile source for pre-flight runs (CI / dev) | `http://mock-sat:5100` | `http://mock-sat:5100` | **NOT SET** — production never reaches a satellite-provider directly while airborne |
| `MOCK_FAILURE_PROFILE` | Failure injection for `mock-suite-sat-service` | `none` | per CI scenario | n/a |
| `GPS_DENIED_REPLAY_FIXTURE` | Path to replay corpus | `/fixtures/flight_derkachi` | `/fixtures/flight_derkachi` | n/a |
### Required variables — operator workstation
| Variable | Purpose | Source |
|---|---|---|
| `SATELLITE_PROVIDER_URL` | Operator's local mirror or VPN-reached lab service | Operator config (operator workstation `.env` file) |
| `SATELLITE_PROVIDER_API_KEY` | TLS + service-internal API key for `satellite-provider` (architecture.md § 7) | Operator workstation secret manager (file or system keyring) — NEVER copied onto the companion image |
| `COMPANION_DB_URL` | Direct DB connection to the companion (post-landing) | Set transiently when the operator plugs the companion in for FDR retrieval / upload |
| `OPERATOR_TOOLING_BIND_ADDR` | Pre-flight UI bind address (C12) | `127.0.0.1:8080` (workstation-local; never exposed to network) |
### `.env.example`
Two example files are committed:
`.env.example.dev-tier1`:
```env
# dev-tier1 - workstation Docker compose
DB_URL=postgresql://gps_denied:dev@db:5432/gps_denied
SATELLITE_PROVIDER_URL=http://mock-sat:5100
CAMERA_CALIBRATION_PATH=/fixtures/calibration/adti26.json
GPS_DENIED_FC_PROFILE=ardupilot_plane
GPS_DENIED_VIO_STRATEGY=okvis2
GPS_DENIED_VPR_STRATEGY=ultra_vpr
GPS_DENIED_BUILD_KIND=deployment
GPS_DENIED_FDR_RETENTION_DAYS=7
GPS_DENIED_REPLAY_FIXTURE=/fixtures/flight_derkachi
LOG_LEVEL=DEBUG
MAVLINK_SIGNING_KEY_PATH=/fixtures/keys/dev_mavlink_signing.key
ONBOARD_TILE_SIGNING_KEY_PATH=/fixtures/keys/dev_onboard_signing.key
MOCK_FAILURE_PROFILE=none
```
`.env.example.operator-workstation`:
```env
# operator workstation
SATELLITE_PROVIDER_URL=http://localhost:5100 # local mirror, or replace with lab VPN URL
SATELLITE_PROVIDER_API_KEY= # populate from the workstation secret manager; NEVER commit
COMPANION_DB_URL= # set when companion is plugged in for FDR retrieval
OPERATOR_TOOLING_BIND_ADDR=127.0.0.1:8080
```
### Variable validation
The runtime composition root (`src/composition/runtime_root.py`, ADR-009) validates every required variable at startup and fails fast with a clear error message. Specifically:
- **Type validation** for enums (`GPS_DENIED_FC_PROFILE`, `GPS_DENIED_VIO_STRATEGY`, etc.) against the strategies linked into the binary (ADR-002 enforcement at config layer).
- **Path validation** for every `*_PATH` variable: file must exist + (where applicable) content-hash must match `manifests` table entry.
- **Forbidden-pair validation**: `GPS_DENIED_BUILD_KIND=deployment` AND `GPS_DENIED_VIO_STRATEGY=vins_mono` is rejected at startup ("vins_mono is not linked into the deployment binary"). The same check is repeated for any research-only strategy.
- **Production hardening**: when `LOG_LEVEL=DEBUG` is set on a binary built with `GPS_DENIED_BUILD_KIND=deployment` AND a manifest indicates a production deployment, the runtime emits a warning and downgrades to `INFO`. A flag `GPS_DENIED_ALLOW_DEBUG_IN_PROD=1` is required to override (only set when an engineer is debugging a returned-from-flight unit on the bench).
## Secrets Management
The threat model (architecture.md § 7) treats the airborne companion as a **remote untrusted endpoint**: a downed UAV's companion can be physically captured. Persistent secrets must therefore be **per-flight ephemeral** wherever feasible.
| Environment | Mechanism | Tool |
|-------------|--------|------|
| `dev-tier1` | `.env` file (git-ignored) + dev keys (committed test fixtures, clearly marked) | dotenv |
| `dev-tier2` | `.env` file (git-ignored) + dev keys | dotenv |
| `staging-tier1` | GitHub Actions secrets | GitHub-managed |
| `staging-tier2` | GitHub Actions secrets injected onto the self-hosted Jetson runner | GitHub-managed |
| `production` (companion) | **Per-flight ephemeral keys** generated at takeoff load by the takeoff bring-up sequence (C8 signing handshake + per-flight tile signing key seed); written to `/var/lib/gps-denied/per-flight/`; logged to FDR; deleted on flight-ring rollover (≥ 30 days post-landing default) | Local filesystem; no external secret manager |
| `production-operator-workstation` | OS-level secret store (keyring / GNOME secrets / macOS keychain) for the long-lived `SATELLITE_PROVIDER_API_KEY` | OS keyring + workstation hardening |
### Per-flight key lifecycle (production companion)
1. **Pre-flight**: operator stages cache + calibration + manifests. NO secrets are baked into the JetPack image — the image is identical across all UAVs the operator deploys.
2. **Takeoff load (F2)**: the takeoff sequence generates two ephemeral keypairs:
- MAVLink-2.0 per-flight signing key (D-C8-9 = (d), driven by C8) — only used on the AP wired channel; iNav has no signing.
- Onboard tile-signing keypair (D-PROJ-2 design task #1 contract) — used to sign every mid-flight tile so the parent suite's planned voting layer can authenticate the source.
3. **In flight**: keys live at `/var/lib/gps-denied/per-flight/*.key` (mode 0600, owned by the runtime UID). The MAVLink signing key fingerprint is logged to FDR record `MavlinkSigningKeyRotated`; the onboard signing pubkey hash is recorded in the `flights` table.
4. **Post-landing**: the operator's C11 `TileUploader` uses the onboard tile-signing private key to assemble the upload payload; it's the only post-flight consumer.
5. **Rollover**: when the FDR ring drops a flight, the per-flight key files for that flight are deleted by the same atomic step.
### No long-lived secrets on the production companion image
| Type | Where it lives |
|---|---|
| `SATELLITE_PROVIDER_API_KEY` | Operator workstation only; never on the companion image (architecture.md § 7) |
| Per-flight MAVLink signing key | Generated on companion at takeoff; per-flight ephemeral |
| Per-flight onboard tile-signing key | Generated on companion at takeoff; per-flight ephemeral |
| Production deployment binary signing key | Release-time; lives only in the Tier-1 release secret manager |
| JetPack image signing key | Same as above |
This means the threat surface on a captured companion reduces to "what is in the FDR for the current flight" plus "the public keys of the upstream signing roots" — the latter is publishable without harm.
### Rotation policy
| Secret | Rotation cadence | Procedure |
|---|---|---|
| Per-flight MAVLink signing key | Every flight (per-flight ephemeral) | Automated at takeoff load |
| Per-flight onboard tile-signing key | Every flight (per-flight ephemeral) | Automated at takeoff load |
| `SATELLITE_PROVIDER_API_KEY` | Operator-managed; rotated when an operator workstation is reissued or compromised is suspected | Operator workstation hardening procedure (out of scope of this document; operator-tooling C12 owns it) |
| Production binary signing key | Per release cycle or on suspected compromise | Release engineer rotates; new key fingerprint is published in release notes; verification scripts on the operator workstation pull the latest fingerprint |
| JetPack image signing key | Same as production binary signing key | Same |
## Database Management
Each companion has its **own local PostgreSQL 16** instance — no shared upstream database, no cluster, no replication. The data_model.md § 1 makes this explicit: companion DB is per-companion; cross-companion coordination happens via `satellite-provider` post-landing only.
| Environment | Type | Migrations | Data |
|-------------|------|-----------|------|
| `dev-tier1` | Docker `postgres:16-alpine`, named volume | Applied on container start by an init script; Alembic-managed (data_model.md § 4) | Seed data via `tests/fixtures/seed-db.sql` |
| `dev-tier2` | PostgreSQL 16 native on the Jetson (or via developer-installed deb packages) | Applied via `alembic upgrade head` invoked by the takeoff-load script | Same seed fixtures |
| `staging-tier1` | Docker `postgres:16-alpine` | Applied by the test runner before scenarios start | Sealed fixture rows |
| `staging-tier2` | PostgreSQL 16 on the Jetson runner | Applied by the test runner | Sealed fixture rows + per-scenario synthetic injections (NFT-SEC-01 cache-poisoning Monte Carlo, etc.) |
| `production` | PostgreSQL 16 on the Jetson, native install (part of the JetPack image) | Applied at JetPack image build time by the image builder; companion runtime asserts `alembic current == head` at takeoff load and refuses takeoff on mismatch | Live data only (data_model.md § 5 hard rule: production NEVER seeds) |
| `production-operator-workstation` | Workstation's local `satellite-provider` mirror has its own DB; operator tooling does NOT run a separate DB | Mirror DB is `satellite-provider`'s concern; operator tooling reads it but does not migrate it | Mirror data |
### Migration rules (data_model.md § 4 + § 6)
- All migrations must be **additive-only by default** (data_model.md § 6.1).
- All migrations must be **reversible by default** (data_model.md § 4.2). Non-reversible migrations require an ADR + user sign-off.
- The `tiles` schema specifically has its **canonical columns frozen** (data_model.md § 6.3) — coordinate any change with `satellite-provider`'s schema owner.
- Production migrations are applied at JetPack image build time, not at runtime. The companion never invokes `alembic upgrade` against a live database in flight; it only verifies `alembic current == head`.
- Migration scripts are reviewed in the same PR that adds the schema change; a PR-level checklist line in the PR template references this rule.
## Configuration Loading Order
Composition root (`src/composition/runtime_root.py`) loads configuration in this strict order — later sources override earlier ones:
1. `_docs/02_document/runtime_config_defaults.yaml` (project-wide defaults; committed)
2. `/etc/gps-denied/runtime.yaml` (per-airframe overrides; baked into the JetPack image)
3. Environment variables (highest precedence on production; second-highest in dev where the next item exists)
4. `--config-override KEY=VALUE` CLI flags (developer convenience; rejected on production by the manifest validator)
The full resolved configuration is logged to FDR as a `ComponentLifecycleEvent` of type `runtime_config_resolved` at takeoff load — this is the audit record for "what config did this flight actually run with".
@@ -0,0 +1,232 @@
# GPS-Denied Onboard — Observability
> Date: 2026-05-09 (Plan Phase 2c — initial draft).
> Inputs: `_docs/02_document/architecture.md` § 7 (Audit logging) + § 6 (NFRs); `_docs/02_document/data_model.md` § 2.8 (FDR); ADR-005 (Tier-1 / Tier-2); AC-NEW-3 (FDR ≤ 64 GB / no silent drops); AC-NEW-5 (operating envelope).
## Observability is asymmetric by design
Most CI/CD templates assume a network-connected service that pushes structured logs to an aggregator and exposes Prometheus metrics for live scraping. **This project's airborne profile does not.** Architecture.md ADR-004 + § 7 + Principle #4 require **no inbound network listening and no outbound network egress in flight** (NFT-SEC-05 enforces). The Jetson is operating as an embedded edge device, not a service.
Observability therefore splits into three regimes:
| Regime | Where | Live or post-flight | Primary mechanism |
|---|---|---|---|
| **In-flight onboard** | Production Jetson, in flight | Live (to FDR ring) + best-effort live (to GCS) | FDR binary record stream + GCS STATUSTEXT / NAMED_VALUE_FLOAT |
| **Post-flight onboard** | Operator workstation after pulling the FDR | Post-flight | FDR replay + visualization in operator-tooling C12 |
| **CI / dev (Tier-1, Tier-2)** | Workstation Docker / Jetson CI runner | Live | Standard structured logging + Prometheus metrics endpoint where applicable |
The sections below are organized by regime.
## 1. In-flight onboard (production Jetson)
### 1.1 FDR (Flight Data Recorder) — primary observability sink
Schema is in `data_model.md` § 2.8. Every observable event in flight goes through FDR. The FDR is **append-only**, **lossy on overrun (logged, never silent)**, and **per-flight ring-bounded at ≤ 64 GB** (AC-NEW-3).
Observability events that emit FDR records:
| Component | Event | FDR record type |
|---|---|---|
| C8 outbound | Every emitted `EmittedExternalPosition` to FC | `0x0001 EmittedExternalPosition` |
| C8 inbound | Every received MAVLink frame (raw `tlog`-style) | `0x0003 ReceivedMavlinkRaw` |
| C8 inbound (iNav) | Every received MSP2 frame | `0x0004 ReceivedMsp2Raw` |
| C8 inbound | IMU window forwarded to C1 / C5 | `0x0002 ImuTrace` |
| C5 | Source-label transition (`satellite_anchored``visual_propagated``dead_reckoned`) | `0x0006 SourceLabelTransition` |
| C5 + C8 | Spoofing-promotion / -rejection event | `0x000C SpoofingPromotionEvent` |
| C5 | VISUAL_BLACKOUT entry / exit (AC-3.5, AC-NEW-8) | `0x000B VisualBlackoutEvent` |
| C6 | Mid-flight tile emit | `0x0007 MidFlightTileEmitted` |
| C6 | Mid-flight tile failure (with thumbnail filename, AC-8.5 forensic exception) | `0x0008 MidFlightTileFailed` |
| C7 (inference) | Thermal-throttle hybrid switch K=3 ↔ K=2 | `0x000E ThermalThrottleHybridSwitch` |
| C8 | MAVLink-2.0 signing key rotation event (D-C8-9) | `0x0009 MavlinkSigningKeyRotated` |
| C8 | EKF source-set switch event (D-C8-2 = (b)) | `0x000A EkfSourceSetCommand` |
| C10 | Pre-flight content-hash gate fail | `0x000D ContentHashGateFail` |
| All components | Lifecycle events (start / stop / fail) | `0x000F ComponentLifecycleEvent` |
| `jetson-stats` collector (driven by C7 or a dedicated thread) | Per-second sample of CPU%, GPU%, temp, throttle flag, RAM, VRAM, NVM remaining | `0x0005 SystemHealth` |
**Lossy-on-overrun rule (AC-NEW-3 enforcement)**: if the FDR writer cannot keep up (NVM I/O bound), the writer drops the **oldest segment** in the current flight's ring AND emits a `0x000F ComponentLifecycleEvent` of type `fdr_segment_dropped` to the new head segment. A segment drop is a hard observability signal — it appears in the post-flight report and in the GCS STATUSTEXT stream. There is no path that silently discards an event.
**Format**: length-prefixed binary stream with `record_header` (magic `0x47464452 "GFDR"` + version + type + monotonic_ms) followed by a per-type body and a CRC32. New record types are additive (data_model.md § 6.5).
**Storage path**: `/var/lib/gps-denied/fdr/{flight_id}/segments/seg_NNNNN.bin`. Thumbnails (AC-8.5) live at `/var/lib/gps-denied/fdr/{flight_id}/thumbnails/`. A flight's `manifest.json` (the FDR-side mirror, distinct from the PostgreSQL `manifests` row) sits at the flight's root and carries the flight metadata snapshot.
### 1.2 GCS telemetry (best-effort, bandwidth-limited)
The GCS link is the only outbound channel from the airborne companion (per architecture.md § 7). Bandwidth is bounded (AC-6.1: 12 Hz downsampled summary). The companion emits:
| MAVLink message | Rate | Content |
|---|---|---|
| `STATUSTEXT` | event-driven (only when something changes) | Source label transitions; spoofing-promotion / -rejection; VISUAL_BLACKOUT entry / exit; signing key rotation; FDR segment drop; component start / fail; thermal-throttle hybrid switch |
| `NAMED_VALUE_FLOAT` | 1 Hz | `horiz_accuracy_m`, `vert_accuracy_m`, `vio_health` (frame-quality 0..1), `last_anchor_age_s`, `cpu_pct`, `gpu_pct`, `temp_c` |
| `GPS_RAW_INT` | 12 Hz (AC-6.1) | Mirror of the AP `GPS_INPUT` we just emitted, downsampled — gives the operator a live position view in QGC |
These are **best-effort** — packet loss on the GCS link is treated as normal. The FDR remains the source of truth.
**STATUSTEXT severity mapping**:
| FDR event | STATUSTEXT severity | Example text |
|---|---|---|
| Source label → `dead_reckoned` | `MAV_SEVERITY_WARNING` | `"GPS-DENIED: dead-reckoned (last anchor 12.3s ago)"` |
| VISUAL_BLACKOUT entry | `MAV_SEVERITY_NOTICE` | `"GPS-DENIED: VISUAL_BLACKOUT entered (reason=low_features)"` |
| Spoofing rejected | `MAV_SEVERITY_NOTICE` | `"GPS-DENIED: spoofed FC GPS rejected (last visual consistency PASS 0.4s ago)"` |
| Spoofing promoted (10 s + visual gate passed) | `MAV_SEVERITY_INFO` | `"GPS-DENIED: FC GPS promoted to fused source"` |
| FDR segment dropped | `MAV_SEVERITY_WARNING` | `"GPS-DENIED: FDR segment 47 dropped (NVM bound)"` |
| Signing key rotation | `MAV_SEVERITY_INFO` | `"GPS-DENIED: MAVLink signing key rotated"` |
| Component fail | `MAV_SEVERITY_CRITICAL` | `"GPS-DENIED: VIO strategy fault — failover to FC IMU-only (AC-5.2)"` |
### 1.3 No console logging in flight
Production deployment binary refuses `LOG_LEVEL=DEBUG` by default (environment_strategy.md § Variable validation). The airborne companion has no operator-readable console — even ERROR-level logs go to journald + FDR rather than stdout. journald retention is 7 days on a rolling buffer (separate from the FDR's per-flight retention).
### 1.4 In-flight metrics are NOT scraped
There is no Prometheus endpoint on the production airborne companion. The justification matches § 1.3: there is no scraper to scrape it; metrics are recorded into FDR and visible via NAMED_VALUE_FLOAT only. CI / dev environments DO expose `/metrics` (see § 3 below).
## 2. Post-flight onboard (operator workstation)
When the operator plugs the companion in post-landing:
1. **FDR retrieval** (operator tooling C12 — feature, not in scope of this document's structure but observability-impacting): operator-tooling reads the FDR ring, copies it to the workstation, and seals the in-flight ring. The companion's per-flight ephemeral keys are deleted at this step (environment_strategy.md § Per-flight key lifecycle).
2. **Visualization** (operator tooling C12): the workstation renders:
- Time-series of `horiz_accuracy`, `vert_accuracy`, `last_anchor_age_ms`, source label timeline, thermal-throttle hybrid switches, and CPU / GPU / temp.
- Map view: emitted positions vs. (when available) FC `GLOBAL_POSITION_INT` ground truth.
- Spoofing / VISUAL_BLACKOUT event markers overlaid on the timeline.
- Per-flight summary: total mid-flight tiles emitted, FDR segment drops (if any), AC-NEW-4 / AC-NEW-7 statistics for this flight.
3. **NFT-RES-03 / NFT-SEC-01 corpus contribution**: if the operator opts in, the flight's emitted positions + FC ground truth are added to the AC-NEW-4 / AC-NEW-7 Monte-Carlo corpus for the next CI run.
4. **Forensic thumbnail review** (AC-8.5 exception): failed-tile thumbnails are visible in the operator UI for human review; this is the only image-data review surface.
## 3. CI / dev environments (Tier-1 / Tier-2)
Tier-1 dev / staging containers DO expose conventional observability surfaces, because they're being driven by humans and CI orchestrators that need them. The airborne profile of § 1 is the **production-only** profile.
### 3.1 Logging (Tier-1 / Tier-2)
Structured JSON to stdout/stderr (consumed by the developer's `docker compose logs` or by CI's log collector):
```json
{
"timestamp": "2026-05-09T08:42:11.234Z",
"level": "INFO",
"service": "gps-denied-companion",
"component": "C5",
"flight_id": "<uuid>",
"monotonic_ms": 12345,
"message": "Source label transition",
"context": {
"from": "satellite_anchored",
"to": "visual_propagated",
"reason": "vpr_no_match"
}
}
```
Log levels:
| Level | Usage | Example |
|-------|-------|---------|
| ERROR | Exceptions; component fault that triggered AC-5.2 fallback | "VIO strategy initialization failed: GTSAM dlopen failed" |
| WARN | Degraded behavior; FDR segment drop; thermal-throttle hybrid switch | "Thermal throttle active; downgrading K=3 → K=2" |
| INFO | Significant lifecycle events; source label transition | "Source label: satellite_anchored → visual_propagated" |
| DEBUG | Per-frame diagnostic — Tier-1 / dev only; production refuses this level (environment_strategy.md § Variable validation) | "MatchResult: 47 inliers, residual=2.3px" |
**PII / safety-sensitive content**: no GPS coordinates in DEBUG / INFO logs by default. Only `horiz_accuracy` (a scalar) is INFO-loggable; the actual lat/lon is FDR-only. WARN / ERROR log records may include lat/lon when the operator's troubleshooting requires it; in that case the FDR still has the canonical record.
Log retention:
| Environment | Destination | Retention |
|-------------|-------------|-----------|
| `dev-tier1` | Docker stdout | Container lifetime |
| `dev-tier2` | journald (Jetson) | 7 days |
| `staging-tier1` (CI) | GitHub Actions log artifact | 30 days (matches CI artifact retention) |
| `staging-tier2` (Jetson CI) | Self-hosted runner journald + uploaded report | 30 days |
| `production` | journald (Jetson) | 7 days, see § 1.3 |
### 3.2 Metrics (Tier-1 / Tier-2)
Prometheus-compatible `/metrics` endpoint on `dev-tier1`, `staging-tier1`, `staging-tier2`. **Disabled on `production`** (no listener on the airborne companion, NFT-SEC-05).
Application metrics:
| Metric | Type | Description |
|--------|------|-------------|
| `gps_denied_frame_processed_total` | Counter | Total nav frames processed (per `GPS_DENIED_VIO_STRATEGY` label) |
| `gps_denied_frame_emit_latency_seconds` | Histogram | End-to-end frame → emit latency (the AC-4.1 metric) |
| `gps_denied_source_label_total` | Counter | Counter per `satellite_anchored | visual_propagated | dead_reckoned` label |
| `gps_denied_vpr_match_rate` | Gauge | Rolling-1-minute rate of successful VPR matches |
| `gps_denied_thermal_hybrid_active` | Gauge | 0/1 — is the K=2 thermal-throttle hybrid active? (D-CROSS-LATENCY-1) |
| `gps_denied_fdr_segment_drops_total` | Counter | Total FDR segment drops this run (AC-NEW-3 audit) |
| `gps_denied_fdr_size_bytes` | Gauge | Current FDR ring size in bytes (must stay ≤ 64 GB) |
| `gps_denied_signing_key_rotations_total` | Counter | MAVLink signing key rotation count |
System metrics: standard `process_*`, `python_*` exporters; on Tier-2 also `jetson_stats_*` exposed via `jtop` exporter.
Business metrics (i.e., AC-derived):
| Metric | AC | Use |
|--------|------|-------------|
| `gps_denied_horiz_accuracy_m` (gauge, last value) | AC-NEW-4 | Live operator dashboard on operator workstation post-flight; CI threshold checks |
| `gps_denied_cold_start_seconds` | AC-NEW-1 | Set once at takeoff load completion; NFT-PERF-03 reads it |
| `gps_denied_spoofing_promotion_latency_seconds` | AC-NEW-2 | Set on each promotion / rejection event; NFT-PERF-04 reads it |
Collection interval: 15 s (typical Prometheus default; Tier-2 NFT runs may use 1 s for AC-bound timing).
### 3.3 Distributed tracing — NOT applicable
The runtime is a single in-process Python program with no cross-service hops in flight (architecture.md § 5 internal communication is all in-process). Distributed tracing is therefore not applicable to the production runtime.
The Tier-1 integration setup DOES involve cross-container hops (companion ↔ mock-sat ↔ db ↔ e2e-runner), but those are exercised by the e2e test framework's own log + status capture; OpenTelemetry is not provisioned for this project. If a future cycle introduces a multi-process companion (which ADR-004 explicitly rejected for the airborne profile but might appear on the operator workstation for C11 Tile Manager + C12 Operator Pre-flight Tooling), tracing can be reconsidered then.
## 4. Alerting (post-flight, not in-flight)
There is no live in-flight alerting from the airborne companion. The operator's **GCS** is the live human-loop interface (STATUSTEXT severity stream § 1.2). All other alerting is **post-flight**:
| Source | Severity | Response Time | Conditions |
|----------|---------------|-----------|----------|
| FDR review (operator workstation) | Critical | Same-day human review | FDR segment drop count > 0; component fail event; spoofing-promotion latency > 3 s; AC-NEW-4 outliers (P(err > 1 km) > 0.01 % in this flight's window) |
| FDR review | High | Next-day | AC-NEW-1 cold-start TTFF > 30 s p95 in this flight's window; thermal-throttle hybrid active > 25 % of the flight |
| FDR review | Medium | Within 1 week | Mid-flight tile failure rate > 5 %; high VPR no-match rate; sustained `dead_reckoned` periods > 10 s |
| CI (Tier-2) | Critical | Block PR merge | Any AC-bound NFT failure (architecture.md § 6 NFR list) |
| CI (Tier-1) | Critical | Block PR merge | Build failure; security CVE; SBOM diff fail (ADR-002) |
Notification channels:
| Severity | Channel |
|----------|---------|
| Critical (FDR or CI) | Slack `#gps-denied-ops` + email |
| High | Slack `#gps-denied-ops` |
| Medium | Slack `#gps-denied-ops` (digest) |
There is no PagerDuty / on-call rotation for this project; in-flight failures are handled by the FC's IMU-only fallback (AC-5.2), not by an operations team.
## 5. Dashboards
### 5.1 Operator workstation post-flight dashboard
Built into operator-tooling C12. Per flight:
- Time series: source label, `horiz_accuracy`, `last_anchor_age_ms`, CPU%, GPU%, temp.
- Event markers: VISUAL_BLACKOUT entries, spoofing events, signing key rotations, thermal hybrid switches.
- Map: emitted track + FC ground truth (when available) + pre-flight cache footprint + mid-flight tile coverage.
- Statistics: per-flight error CDF; AC-NEW-4 contribution; mid-flight tile counts.
- FDR audit table: any `0x000F` lifecycle events of severity ≥ WARN.
### 5.2 CI dashboard (Tier-2)
GitHub Actions job summary plus a per-NFT report uploaded as workflow artifact. The summary includes:
- Pass / fail per NFT scenario.
- For NFT-PERF-*: histogram of latencies + comparison to threshold.
- For NFT-LIM-*: peak memory / FDR size traces.
- For NFT-RES-*: AC-NEW-4 / AC-NEW-7 statistical summary with stated 95 % CI.
- For IT-12: comparative-study summary across all VIO / VPR strategies in the research binary.
There is no live CI dashboard separate from the GitHub Actions UI; the project is small enough that the per-PR job summary is sufficient.
### 5.3 No live in-flight dashboard
Out of scope by design. The GCS is the only live operator surface; all other inspection is post-flight.
## 6. Open Items / Plan-Phase Carryforward
- **Long-term FDR archive** (multi-flight statistical headroom): D-PROJ-3 (multi-flight fixture acquisition for AC-NEW-4 / AC-NEW-7) is not pursued this cycle. If pursued in a future cycle, post-flight FDR archives become a corpus contribution path; the operator-tooling FDR-retrieval step would need an explicit "contribute to corpus" toggle.
- **Telemetry-link encryption** beyond MAVLink-2.0 signing: out of scope; addressed by physical link assumptions in the threat model (architecture.md § 7).
- **iNav signing**: still has no equivalent to MAVLink-2.0 signing (Mode B Source #129). Carryforward Plan-phase action: file a feature request upstream; meanwhile observability for iNav-profile flights is the same as AP-profile minus the `MavlinkSigningKeyRotated` records (which are NULL on iNav flights per data_model.md § 2.2).