[AZ-329] [AZ-330] [AZ-523] [AZ-524] Batch 44 atomic refactor

Implements two new C12 services and rebalances the C11/C12 boundary
in one atomic commit:

* AZ-329 PostLandingUploadOrchestrator — gates C11 upload on the
  `flight_footer` FDR record's `clean_shutdown` field; 4 refusal
  modes; new FdrFooterReader Protocol + LocalFdrFooterReader.
* AZ-330 OperatorReLocService — AC-3.4 visual-loss re-localization
  hint; reuses shared LatLonAlt; OperatorCommandTransport Protocol
  cut (E-C8 owns the future pymavlink concrete); new FDR record
  kind `c12.reloc.requested`; log redaction (lat/lon 5 decimals,
  reason 200 chars).
* AZ-523 C11 internal flight-state gate removed (SRP refactor):
  `confirm_flight_state` / `FlightStateSignal` use /
  `FlightStateNotOnGroundError` deleted from C11; TileUploader
  contract bumped to v2.0.0 (frozen) with migration note; AZ-317
  superseded.
* AZ-524 Package rename `c12_operator_tooling` →
  `c12_operator_orchestrator` across source, tests, pyproject,
  CMake, Dockerfile, compose, CI, runtime-root services class
  (`OperatorOrchestratorServices`) + factory function
  (`build_operator_orchestrator`), logger namespaces, config slug,
  docs, and the E-C12 epic title.

Tests: 1543 passed, 80 skipped (all environment gates). Targeted
AC suite (AZ-329 + AZ-330 + FdrFooterReader): 37 passed. Cold-start
NFR-perf still ≤ 500 ms p99.

Tracker: AZ-317 → Done (superseded); AZ-319 v2.0.0 contract bump
comment; AZ-329/AZ-330 → In Testing; AZ-253 epic renamed; AZ-523
+ AZ-524 created and closed as audit-trail tickets.

See `_docs/03_implementation/batch_44_cycle1_report.md`.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-13 19:42:46 +03:00
parent 2d88d3d674
commit 5fe67023b2
112 changed files with 3409 additions and 1311 deletions
@@ -19,7 +19,7 @@ The pipeline has **two execution tiers** (architecture.md ADR-005), reflected in
| Build (Tier-2 deployment binary) | PR merge to `dev`, `stage`, `main` | Tier-2 (self-hosted Jetson) | Native build on Jetson green; deployment binary SBOM matches Tier-1 deployment SBOM |
| AC-bound NFTs (Tier-2) | PR merge to `dev`, `stage`, `main`; manual on PR | Tier-2 | NFT-PERF-* (AC-4.1, AC-NEW-1, AC-NEW-2), NFT-LIM-* (AC-4.2, AC-NEW-3), NFT-RES-* (AC-NEW-4, AC-NEW-7), IT-12 (comparative study) all pass thresholds in `tests/traceability-matrix.md` |
| JetPack image build | Tag on `main` | Tier-2 | JetPack 6.2 image built with deployment binary preinstalled, signed, and attested |
| Operator tooling tarball | Tag on `main` | Tier-1 | Tarball contains C11 Tile Manager (both `TileDownloader` and `TileUploader`) + C12 Operator Pre-flight Tooling + mock-sat-service compose + verification script |
| Operator tooling tarball | Tag on `main` | Tier-1 | Tarball contains C11 Tile Manager (both `TileDownloader` and `TileUploader`) + C12 Operator Pre-flight Orchestrator + mock-sat-service compose + verification script |
Tier-2 jobs are the **only** AC-bound jobs. Everything else runs on Tier-1.
@@ -146,7 +146,7 @@ Runs on tag push to `main`. Produces `gps-denied-jetpack-<semver>-<sha>.img` (th
### Operator tooling tarball (release-only)
Bundles `operator-tooling` Docker image + `mock-suite-sat-service` Docker image + their compose file + a verification script + the documentation under `_docs/02_document/`. The tarball is uploaded to the release bucket alongside the JetPack image.
Bundles `operator-orchestrator` Docker image + `mock-suite-sat-service` Docker image + their compose file + a verification script + the documentation under `_docs/02_document/`. The tarball is uploaded to the release bucket alongside the JetPack image.
## Caching Strategy
@@ -9,7 +9,7 @@ This project has **asymmetric containerization** by design (architecture.md § 3
- **Tier-1** (workstation): Docker is the universal runtime. Dev, lint, unit, most integration, and `mock-suite-sat-service` all run in Docker compose.
- **Tier-2 (Jetson)**: **NO Docker**. The deployed JetPack image runs the deployment binary natively. TensorRT INT8 calibration caches and `jetson-stats` thermal telemetry are most reliable without a container layer (D-C7-9 + D-C10-6). The "image" is a JetPack 6.2 system image with the deployment binary preinstalled.
- **Operator workstation**: Docker is used for the local `satellite-provider` mirror, the `mock-suite-sat-service` (when offline), and the operator-tooling stack (C11 Tile Manager + C12 Operator Pre-flight Tooling).
- **Operator workstation**: Docker is used for the local `satellite-provider` mirror, the `mock-suite-sat-service` (when offline), and the operator-orchestrator stack (C11 Tile Manager + C12 Operator Pre-flight Orchestrator).
Three Dockerfiles are maintained; the airborne companion uses **none of them** in production.
@@ -43,9 +43,9 @@ e2e-test fixture only — implements the planned D-PROJ-2 ingest contract (`POST
| Health check | HTTP `GET /healthz` (returns 200 if listening + storage backend mounted). 10 s interval. |
| Exposed ports | `5100/tcp` (matches `satellite-provider`'s port so the same client config works) |
| Key build args | `MOCK_FAILURE_PROFILE` (default `none`; used by NFT-SEC-01 to inject latency / 5xx / partial responses) |
| Notes | The mock is a release artifact (operator-tooling tarball includes its compose file). When the real `satellite-provider` D-PROJ-2 endpoint ships, the mock is retired. |
| Notes | The mock is a release artifact (operator-orchestrator tarball includes its compose file). When the real `satellite-provider` D-PROJ-2 endpoint ships, the mock is retired. |
### `operator-tooling` (Operator workstation Tile Manager + pre-flight UI, C11 + C12)
### `operator-orchestrator` (Operator workstation Tile Manager + pre-flight UI, C11 + C12)
| Property | Value |
|----------|-------|
@@ -53,7 +53,7 @@ e2e-test fixture only — implements the planned D-PROJ-2 ingest contract (`POST
| Build image | `python:3.10-slim` (no native deps; pure Python plus `httpx` for both download and upload, `psycopg` for read/write of C6 mirror, `cryptography` for upload signing) |
| Stages | `python-deps``runtime` |
| User | `operator` (non-root) |
| Health check | `python -m operator_tooling.healthcheck` (validates `satellite-provider` reachable). 30 s interval. |
| Health check | `python -m operator_orchestrator.healthcheck` (validates `satellite-provider` reachable). 30 s interval. |
| Exposed ports | `8080/tcp` (operator pre-flight UI, C12); no inbound network for C11 Tile Manager (it's a CLI / one-shot tool, both directions) |
| Key build args | `INCLUDE_PRE_FLIGHT_UI=true` (default; can be turned off for headless CLI-only deployments) |
| Notes | **C11 Tile Manager (both `TileDownloader` and `TileUploader`) is in this image, NEVER in `gps-denied-companion-tier1`** (ADR-004 process-level isolation). The airborne deployment binary on Tier-2 also does not contain C11. |
@@ -120,11 +120,11 @@ services:
interval: 5s
networks: [ gps-denied-net ]
operator-tooling:
operator-orchestrator:
build:
context: .
dockerfile: docker/operator-tooling.Dockerfile
image: gps-denied/operator-tooling:dev
dockerfile: docker/operator-orchestrator.Dockerfile
image: gps-denied/operator-orchestrator:dev
environment:
- SATELLITE_PROVIDER_URL=http://mock-sat:5100
- COMPANION_DB_URL=postgresql://gps_denied:dev@db:5432/gps_denied
@@ -207,7 +207,7 @@ Tier-2 CI runs the same deployment binary directly on the self-hosted Jetson run
| CI build (deployment binary) | `<registry>/gps-denied/companion-tier1:deployment-<git-sha>` | `ghcr.io/azaion/gps-denied/companion-tier1:deployment-a1b2c3d` |
| CI build (research binary) | `<registry>/gps-denied/companion-tier1:research-<git-sha>` | `ghcr.io/azaion/gps-denied/companion-tier1:research-a1b2c3d` |
| Mock sat service | `<registry>/gps-denied/mock-suite-sat-service:<git-sha>` | `ghcr.io/azaion/gps-denied/mock-suite-sat-service:a1b2c3d` |
| Operator tooling | `<registry>/gps-denied/operator-tooling:<git-sha>` | `ghcr.io/azaion/gps-denied/operator-tooling:a1b2c3d` |
| Operator tooling | `<registry>/gps-denied/operator-orchestrator:<git-sha>` | `ghcr.io/azaion/gps-denied/operator-orchestrator:a1b2c3d` |
| Release | `<registry>/gps-denied/<image>:<semver>` | `ghcr.io/azaion/gps-denied/companion-tier1:deployment-1.2.0` |
| Local dev | `gps-denied/<image>:dev` | `gps-denied/companion-tier1:dev` |
| JetPack image (Tier-2) | `gps-denied-jetpack-<semver>-<sha>.img` | `gps-denied-jetpack-1.2.0-a1b2c3d.img` (file artifact, not a container tag) |
@@ -5,12 +5,12 @@
## Deployment scope and model
This project does **not** ship a service; it ships an **embedded edge image** plus an **operator-tooling bundle**. The "deployment" patterns from the standard template (blue-green / rolling / canary) are not applicable. Deployment for this project means:
This project does **not** ship a service; it ships an **embedded edge image** plus an **operator-orchestrator bundle**. The "deployment" patterns from the standard template (blue-green / rolling / canary) are not applicable. Deployment for this project means:
| Artifact | Target | Deployment mechanism |
|---|---|---|
| **JetPack image** (`gps-denied-jetpack-<semver>-<sha>.img`) | Production Jetson Orin Nano Super on a UAV | Operator flashes the image onto the Jetson via NVIDIA `sdkmanager` or `Etcher`-style `dd` from the operator workstation |
| **Operator tooling tarball** | Operator workstation | Operator extracts; `docker compose up -d` brings up `mock-suite-sat-service` (when offline) + `operator-tooling` |
| **Operator tooling tarball** | Operator workstation | Operator extracts; `docker compose up -d` brings up `mock-suite-sat-service` (when offline) + `operator-orchestrator` |
| **Tier-1 dev compose** | Developer workstation | Developer runs `docker compose up` from repo root |
**Zero-downtime is not a goal**: a UAV is not in service while it is being re-flashed. The deployment cadence is per-airframe maintenance, not per-request availability.
@@ -25,9 +25,9 @@ Performed once per release on Tier-1 + Tier-2 CI; produces signed artifacts stor
2. **Tier-1 produces**:
- `companion-tier1:deployment-<sha>` and `companion-tier1:research-<sha>` Docker images (pushed to registry).
- `mock-suite-sat-service:<sha>` Docker image.
- `operator-tooling:<sha>` Docker image.
- `operator-orchestrator:<sha>` Docker image.
- SBOM artifacts for both binaries (deployment and research).
- `operator-tooling-<semver>-<sha>.tar.gz` containing the operator-tooling image + mock-sat image + their compose file + verification script + relevant docs.
- `operator-orchestrator-<semver>-<sha>.tar.gz` containing the operator-orchestrator image + mock-sat image + their compose file + verification script + relevant docs.
3. **Tier-2 produces**:
- Native deployment-binary build on the self-hosted Jetson runner.
- SBOM verification: byte-equal (after canonicalization) to Tier-1's deployment-binary SBOM. Mismatch fails the release.
@@ -35,7 +35,7 @@ Performed once per release on Tier-1 + Tier-2 CI; produces signed artifacts stor
4. **Signing** (Tier-1):
- Both Docker image manifests are signed with the project's release key.
- The JetPack image is signed; checksum is published as a separate signed file (`gps-denied-jetpack-<semver>-<sha>.img.sha256.sig`).
- The operator-tooling tarball is signed.
- The operator-orchestrator tarball is signed.
5. **Release bucket**: artifacts uploaded; release notes published; the previous release's artifacts retained for at least 90 days for rollback support.
A release fails if any step above fails — including any AC-bound NFT failure on Tier-2 (`ci_cd_pipeline.md` § AC-bound NFTs).
@@ -85,19 +85,19 @@ cosign verify-blob \
sha256sum -c gps-denied-jetpack-<semver>-<sha>.img.sha256
# Verify the operator-tooling tarball.
# Verify the operator-orchestrator tarball.
cosign verify-blob \
--signature operator-tooling-<semver>-<sha>.tar.gz.sig \
--signature operator-orchestrator-<semver>-<sha>.tar.gz.sig \
--key gps-denied-release-key.pub \
operator-tooling-<semver>-<sha>.tar.gz
operator-orchestrator-<semver>-<sha>.tar.gz
```
### 3. Pre-flight cache build (operator-tooling C12)
### 3. Pre-flight cache build (operator-orchestrator C12)
Performed on the operator workstation, with `satellite-provider` reachable (locally mirrored or via lab VPN).
```sh
docker compose -f operator-tooling-compose.yml up -d
docker compose -f operator-orchestrator-compose.yml up -d
# Operator opens http://127.0.0.1:8080
```
@@ -164,7 +164,7 @@ The first flight on a freshly-deployed airframe is a **commissioning flight**, n
Post first commissioning flight:
- [ ] FDR retrieved and visualized on operator workstation (operator-tooling C12 dashboard, observability.md § 5.1).
- [ ] FDR retrieved and visualized on operator workstation (operator-orchestrator C12 dashboard, observability.md § 5.1).
- [ ] AC-NEW-4 statistics for the commissioning flight reviewed; outliers investigated.
- [ ] No FDR segment drops; no `ContentHashGateFail` events.
- [ ] Mid-flight tile generation working (post-landing upload — handle that separately).
@@ -172,12 +172,12 @@ Post first commissioning flight:
## Post-landing tile upload (per-flight, ADR-004)
Per AC-8.4 + ADR-004, mid-flight tile upload to `satellite-provider` is **post-landing only**, and uses the operator-tooling's C11 Tile Manager (`TileUploader` interface; a separate binary, never linked into the airborne image).
Per AC-8.4 + ADR-004, mid-flight tile upload to `satellite-provider` is **post-landing only**, and uses the operator-orchestrator's C11 Tile Manager (`TileUploader` interface; a separate binary, never linked into the airborne image).
```sh
# Operator plugs the companion's NVM into the workstation OR ssh's into the powered-off-then-re-booted Jetson.
docker compose run operator-tooling \
python -m operator_tooling.tilemanager upload \
docker compose run operator-orchestrator \
python -m operator_orchestrator.tilemanager upload \
--flight-id <uuid> \
--satellite-provider $SATELLITE_PROVIDER_URL \
--signing-pubkey-fingerprint <fingerprint>
@@ -210,7 +210,7 @@ When the parent-suite voting layer (D-PROJ-2 design task #2) ships, this flow do
### Rollback steps (per-airframe)
1. **Re-flash** the previous release's JetPack image onto the affected Jetson (same procedure as § 4 with the previous artifact).
2. **Re-stage** the previous release's pre-flight bundle (the operator workstation retains it in the operator-tooling cache for ≥ 30 days).
2. **Re-stage** the previous release's pre-flight bundle (the operator workstation retains it in the operator-orchestrator cache for ≥ 30 days).
3. **Re-run** the pre-takeoff readiness gate.
4. **Confirm** AC-5.2 fallback is still functional (it is FC firmware behavior; rolling back the companion image cannot break it, but verify on the GCS).
5. **Document** the rollback in the post-mortem template; include FDR snapshots from the offending flight (if any) plus the rollback artifacts versions.
@@ -141,7 +141,7 @@ This means the threat surface on a captured companion reduces to "what is in the
|---|---|---|
| Per-flight MAVLink signing key | Every flight (per-flight ephemeral) | Automated at takeoff load |
| Per-flight onboard tile-signing key | Every flight (per-flight ephemeral) | Automated at takeoff load |
| `SATELLITE_PROVIDER_API_KEY` | Operator-managed; rotated when an operator workstation is reissued or compromised is suspected | Operator workstation hardening procedure (out of scope of this document; operator-tooling C12 owns it) |
| `SATELLITE_PROVIDER_API_KEY` | Operator-managed; rotated when an operator workstation is reissued or compromised is suspected | Operator workstation hardening procedure (out of scope of this document; operator-orchestrator C12 owns it) |
| Production binary signing key | Per release cycle or on suspected compromise | Release engineer rotates; new key fingerprint is published in release notes; verification scripts on the operator workstation pull the latest fingerprint |
| JetPack image signing key | Same as production binary signing key | Same |
@@ -12,7 +12,7 @@ Observability therefore splits into three regimes:
| Regime | Where | Live or post-flight | Primary mechanism |
|---|---|---|---|
| **In-flight onboard** | Production Jetson, in flight | Live (to FDR ring) + best-effort live (to GCS) | FDR binary record stream + GCS STATUSTEXT / NAMED_VALUE_FLOAT |
| **Post-flight onboard** | Operator workstation after pulling the FDR | Post-flight | FDR replay + visualization in operator-tooling C12 |
| **Post-flight onboard** | Operator workstation after pulling the FDR | Post-flight | FDR replay + visualization in operator-orchestrator C12 |
| **CI / dev (Tier-1, Tier-2)** | Workstation Docker / Jetson CI runner | Live | Standard structured logging + Prometheus metrics endpoint where applicable |
The sections below are organized by regime.
@@ -85,7 +85,7 @@ There is no Prometheus endpoint on the production airborne companion. The justif
When the operator plugs the companion in post-landing:
1. **FDR retrieval** (operator tooling C12 — feature, not in scope of this document's structure but observability-impacting): operator-tooling reads the FDR ring, copies it to the workstation, and seals the in-flight ring. The companion's per-flight ephemeral keys are deleted at this step (environment_strategy.md § Per-flight key lifecycle).
1. **FDR retrieval** (operator tooling C12 — feature, not in scope of this document's structure but observability-impacting): operator-orchestrator reads the FDR ring, copies it to the workstation, and seals the in-flight ring. The companion's per-flight ephemeral keys are deleted at this step (environment_strategy.md § Per-flight key lifecycle).
2. **Visualization** (operator tooling C12): the workstation renders:
- Time-series of `horiz_accuracy`, `vert_accuracy`, `last_anchor_age_ms`, source label timeline, thermal-throttle hybrid switches, and CPU / GPU / temp.
- Map view: emitted positions vs. (when available) FC `GLOBAL_POSITION_INT` ground truth.
@@ -173,7 +173,7 @@ Collection interval: 15 s (typical Prometheus default; Tier-2 NFT runs may use 1
The runtime is a single in-process Python program with no cross-service hops in flight (architecture.md § 5 internal communication is all in-process). Distributed tracing is therefore not applicable to the production runtime.
The Tier-1 integration setup DOES involve cross-container hops (companion ↔ mock-sat ↔ db ↔ e2e-runner), but those are exercised by the e2e test framework's own log + status capture; OpenTelemetry is not provisioned for this project. If a future cycle introduces a multi-process companion (which ADR-004 explicitly rejected for the airborne profile but might appear on the operator workstation for C11 Tile Manager + C12 Operator Pre-flight Tooling), tracing can be reconsidered then.
The Tier-1 integration setup DOES involve cross-container hops (companion ↔ mock-sat ↔ db ↔ e2e-runner), but those are exercised by the e2e test framework's own log + status capture; OpenTelemetry is not provisioned for this project. If a future cycle introduces a multi-process companion (which ADR-004 explicitly rejected for the airborne profile but might appear on the operator workstation for C11 Tile Manager + C12 Operator Pre-flight Orchestrator), tracing can be reconsidered then.
## 4. Alerting (post-flight, not in-flight)
@@ -201,7 +201,7 @@ There is no PagerDuty / on-call rotation for this project; in-flight failures ar
### 5.1 Operator workstation post-flight dashboard
Built into operator-tooling C12. Per flight:
Built into operator-orchestrator C12. Per flight:
- Time series: source label, `horiz_accuracy`, `last_anchor_age_ms`, CPU%, GPU%, temp.
- Event markers: VISUAL_BLACKOUT entries, spoofing events, signing key rotations, thermal hybrid switches.
@@ -227,6 +227,6 @@ Out of scope by design. The GCS is the only live operator surface; all other ins
## 6. Open Items / Plan-Phase Carryforward
- **Long-term FDR archive** (multi-flight statistical headroom): D-PROJ-3 (multi-flight fixture acquisition for AC-NEW-4 / AC-NEW-7) is not pursued this cycle. If pursued in a future cycle, post-flight FDR archives become a corpus contribution path; the operator-tooling FDR-retrieval step would need an explicit "contribute to corpus" toggle.
- **Long-term FDR archive** (multi-flight statistical headroom): D-PROJ-3 (multi-flight fixture acquisition for AC-NEW-4 / AC-NEW-7) is not pursued this cycle. If pursued in a future cycle, post-flight FDR archives become a corpus contribution path; the operator-orchestrator FDR-retrieval step would need an explicit "contribute to corpus" toggle.
- **Telemetry-link encryption** beyond MAVLink-2.0 signing: out of scope; addressed by physical link assumptions in the threat model (architecture.md § 7).
- **iNav signing**: still has no equivalent to MAVLink-2.0 signing (Mode B Source #129). Carryforward Plan-phase action: file a feature request upstream; meanwhile observability for iNav-profile flights is the same as AP-profile minus the `MavlinkSigningKeyRotated` records (which are NULL on iNav flights per data_model.md § 2.2).