[AZ-328] C12 BuildCacheOrchestrator + remote C10 invoker (Batch 43)

Implements F1 pre-flight cache build orchestrator on the operator
workstation. Composes C11 TileDownloader (AZ-316), C12 CompanionBringup
(AZ-327), C12 FlightsApiClient (AZ-489), and the new
RemoteCacheProvisionerInvoker into one sequenced flow guarded by a
filelock-backed workstation-side lockfile.

Architectural decisions:
- Phase-0 flight-resolve runs BEFORE the lockfile (ADR-010): a flight
  that cannot be resolved is an operator-input error, not a contended-
  resource error. Enforced by AC-11 + AC-14.
- Consumer-side cuts (AZ-507) for C11 + C10 types: local Protocols /
  mirror DTOs in tile_downloader_cut.py and _types.py; external errors
  matched by name-based whitelisting so unknown exceptions still
  propagate per AC-6. Cross-component type translation lives at the
  composition root (c12_factory).
- Failure surfacing: recognised operational failures (download error,
  companion not ready, build error, flight-resolve error) return as
  CacheBuildReport(outcome=failure, failure_phase=...). Only lockfile
  contention raises (BuildLockHeldError) since no phase ever ran.
- Workstation-side filelock library (project pin); no custom primitive.
- Remote C10 stdout streamed line-by-line as DEBUG with api_key /
  auth_token redacted before logging (defence-in-depth).
- CLI is now a thin adapter; all workflow logic lives in
  build_cache.py. operator-tool build-cache exit codes map per
  CacheBuildReport.failure_phase + failure_exception_type.

Tests: 116 c12 unit tests pass (29 new for AZ-328 covering 15/15 ACs +
NFR-perf-overhead microbench; 7 new for remote_c10_invoker; 3 new for
file_lock; test_cli_build_cache rewritten for new orchestrator
interface). Full repo suite: 1522 passed, 80 skipped.

Also: replays Batch 42's ruff format leftover for c12 flights_api +
test_az489 files (formatter ran over the c12 directory after new
files were added). Pure whitespace; no behaviour change.

Full report: _docs/03_implementation/batch_43_cycle1_report.md

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-13 11:03:46 +03:00
parent 099c75c6f8
commit 7644b25e8c
23 changed files with 3585 additions and 256 deletions
@@ -1,259 +0,0 @@
# C12 Build-Cache Orchestrator — F1 Sequencing + Actionable `CacheBuildReport`
**Task**: AZ-328_c12_build_cache_orchestrator
**Name**: C12 Build-Cache Orchestrator
**Description**: Implement `BuildCacheOrchestrator`, the public top-level F1 (pre-flight cache build) workflow. `build_cache(request: BuildCacheRequest) -> CacheBuildReport` does the following sequenced work, with strict ordering: **(0) Flight-resolve phase (ADR-010, AZ-489)** — the orchestrator either calls `flights_api_client.fetch_flight(flight_id, base_url, auth_token)` (online) or `flights_api_client.load_flight_file(path)` (offline) per the resolved CLI flag, then `bbox = flights_api_client.bbox_from_waypoints(flight.waypoints, buffer_m=config.flight_bbox_buffer_m)` and `takeoff_origin = flights_api_client.takeoff_origin_from_flight(flight)`. The resolved `(bbox, takeoff_origin, flight_id, raw_flight_dto)` is captured into `FlightResolveReport` for FDR/debug and forwarded into the downstream phases; any `FlightsApiUnreachableError` / `FlightsApiAuthError` / `FlightNotFoundError` / `FlightsApiSchemaError` / `FlightFileNotFoundError` / `EmptyWaypointsError` / `WaypointSchemaError` is wrapped as `CacheBuildError(failure_phase=flight_resolve, ...)` and aborts BEFORE the lockfile is even acquired (no point holding the lock while diagnosing operator inputs). (1) acquire a filesystem lockfile at `<cache_staging_root>/.c12.lock` per description.md § 7 (prevents concurrent F1 runs from stomping each other); (2) call `tile_downloader.fetch(...)` (AZ-316) on the operator workstation with `bbox` (computed in phase 0), `sector_class`, `freshness_threshold_months`, `satellite_provider_url`, `api_key`; (3) on download `failure` outcome → wrap as `CacheBuildError(failure_phase=download, ...)` and return `CacheBuildReport(outcome=failure, failure_phase=download, flight_resolve_report=..., download_report=..., build_report=None)` WITHOUT invoking C10; (4) on download `success` → call `companion_bringup.verify_companion_ready(...)` (AZ-327) — if `not_ready` → wrap and return `CacheBuildReport(outcome=failure, failure_phase=download, ...)`; (5) SSH-invoke `C10.CacheProvisioner.build_cache_artifacts` (AZ-325) on the companion via the `RemoteCacheProvisionerInvoker` helper, **passing `takeoff_origin` + `flight_id` along with bbox/sector_class** so AZ-325 / AZ-323 bake them into the Manifest. Stream the C10 stdout/stderr lines back as DEBUG logs and parse the final `BuildReport` JSON document the C10 process emits on stdout; (6) aggregate into `CacheBuildReport`; (7) release the lockfile in `finally`. Wraps any underlying error from C11/C10/C7/C6 as `CacheBuildError` with a `remediation` attribute populated per `failure_phase`. Owns the operator-facing C12-IT-02 acceptance test contract.
**Complexity**: 5 points
**Dependencies**: AZ-326_c12_cli_app, AZ-327_c12_companion_bringup, AZ-316_c11_tile_downloader, AZ-325_c10_cache_provisioner, AZ-489_c12_flights_api_client (Flight resolve + bbox-from-waypoints + takeoff origin), AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module
**Component**: c12_operator_tooling (epic AZ-253 / E-C12)
**Tracker**: AZ-328
**Epic**: AZ-253 (E-C12)
### Document Dependencies
- `_docs/02_document/contracts/c11_tilemanager/tile_downloader.md` — consumed: `fetch` API + `DownloadBatchReport` shape.
- `_docs/02_document/contracts/c10_provisioning/cache_provisioner.md` — consumed: `build_cache_artifacts` API + `BuildReport` shape (this task invokes the contract over SSH; the contract values are passed back as a JSON document).
- `_docs/02_document/components/13_c12_operator_tooling/description.md` — § 1 (Coordinator), § 2 (`build_cache`, `CacheBuildReport`), § 5 (`CacheBuildError`), § 7 (lockfile), § 8 (depends on C10 + C11).
- `_docs/02_document/contracts/shared_logging/log_record_schema.md` — INFO/WARN/ERROR + DEBUG log shapes (DEBUG is used for streamed C10 progress).
- `_docs/_process_leftovers/2026-05-09_satellite-provider-design-tasks.md` — the parent-suite `satellite-provider` URL + auth surface this task wires through (informational, no direct dep).
## Problem
Without a real `BuildCacheOrchestrator`:
- F1 has no head — operators cannot build a flight-ready cache; AC-8.3 (imagery pre-loaded onto companion before flight) collapses; AC-NEW-1 (cold-start TTFF) cannot be exercised.
- The download-vs-build phase distinction has no enforcement — without strict ordering, a build phase may start before the C6 cache has tiles, causing the C10 `DescriptorBatcher` to return `failure_reason="no tiles in C6 ..."` instead of the operator getting the actionable C11 download error first.
- Operators have no failure-phase signal — a `CacheBuildError` without `failure_phase` forces the operator to read tracebacks to determine whether to retry the download or rebuild the engines.
- C12-IT-02 (build_cache orchestrates C11 then C10; download failure aborts before C10) has no implementation.
- Concurrent operator runs of `build-cache` against the same area would race on C6 + on the companion's C10 cache root, producing inconsistent state. description.md § 7's lockfile mitigation has no producer.
- The CLI's `build-cache` subcommand has nothing to delegate to.
- C10's `BuildReport` is produced on the companion process; without a remote invoker that captures and parses its output, the operator workstation cannot aggregate it into `CacheBuildReport`.
This task delivers the F1 orchestrator + the remote C10 invoker + the lockfile + the unified `CacheBuildReport` aggregation. It does NOT own download (AZ-316), engine compile (AZ-321), descriptor generation (AZ-322), Manifest writing (AZ-323), takeoff verification (AZ-324), or the C10 orchestrator itself (AZ-325) — it composes them.
## Outcome
- A `BuildCacheOrchestrator` class at `src/operator_tool/build_cache.py`:
- Constructor: `__init__(self, *, flights_api_client: FlightsApiClient, tile_downloader: TileDownloader, companion_bringup: CompanionBringup, remote_c10_invoker: RemoteCacheProvisionerInvoker, freshness_table: FreshnessTable, lock_factory: FileLockFactory, logger: Logger, clock: Clock, config: C12BuildCacheConfig)`.
- `C12BuildCacheConfig` (`@dataclass(frozen=True)`): `cache_staging_root: Path`, `lock_filename: str = ".c12.lock"`, `lock_timeout_s: float = 5.0`, `companion_cache_root: PurePosixPath`, `flight_bbox_buffer_m: float = 1000.0`, `flights_api_base_url: str`, `flights_api_auth_token: SecretStr`.
- Public method: `build_cache(request: BuildCacheRequest) -> CacheBuildReport`.
- DTOs at `src/operator_tool/_types.py`:
- `BuildCacheRequest` (`@dataclass(frozen=True)`): `flight_source: FlightSource (one of `FlightById(flight_id: UUID)` or `FlightFromFile(path: Path)`)`, `sector_class: SectorClassification`, `calibration_path: Path`, `satellite_provider_url: str`, `api_key: SecretStr`, `companion_address: CompanionAddress`, `expected_engines: tuple[str, ...]`. **The legacy `bbox` field is removed — the orchestrator derives bbox from the resolved `FlightDto`.**
- `FlightResolveReport` (`@dataclass(frozen=True)`): `source: enum {flights_api, flight_file}`, `flight_id: UUID`, `waypoint_count: int`, `bbox: Bbox`, `takeoff_origin: LatLonAlt`, `raw_flight_dto: FlightDto`.
- `CacheBuildReport` (`@dataclass(frozen=True)`): `flight_resolve_report: FlightResolveReport | None`, `download_report: DownloadBatchReport | None`, `build_report: BuildReport | None`, `outcome: enum {success, failure, idempotent_no_op}`, `failure_phase: enum {none, flight_resolve, download, build}`, `failure_reason: str | None`, `wall_clock_s: float`.
- Errors at `src/operator_tool/errors.py`:
- `CacheBuildError(Exception)`: attributes `failure_phase: enum {download, build}`, `wrapped_exception_repr: str`, `remediation: str`. The `remediation` attribute is populated at construction time per `failure_phase` (download → "Re-run with same args; check `satellite_provider_url` and `api_key`."; build → "Inspect companion `~/.azaion/onboard/c10-build.log`; consider `rm -rf <companion_cache_root>/engines/` to force a clean rebuild.").
- `BuildLockHeldError(CacheBuildError)`: subclass for the lock-held case with `remediation` = "Another `build-cache` is in progress; wait or kill the holding process and remove `<lock_path>`."
- A `RemoteCacheProvisionerInvoker` at `src/operator_tool/remote_c10_invoker.py`:
- Constructor: `__init__(self, *, ssh_factory: SshSessionFactory, logger: Logger)`.
- `invoke(session: SshSession, request: RemoteBuildRequest) -> BuildReport` — runs the C10 build entry point on the companion via `session.run("azaion-onboard c10 build --json-output --request <stdin>", ...)`, streams stdout line-by-line as DEBUG logs (`kind="c10.remote.progress"`), parses the final line as `BuildReport` JSON. The C10 entry point on the companion is the canonical CLI that AZ-325's `CacheProvisioner` ships (E-BOOT scaffolding established `azaion-onboard` as the airborne-image CLI; C10's build mode is `azaion-onboard c10 build`).
- A `FileLockFactory` Protocol at `src/operator_tool/file_lock.py`:
```python
@runtime_checkable
class FileLock(Protocol):
def __enter__(self) -> "FileLock": ...
def __exit__(self, exc_type, exc, tb) -> None: ...
@runtime_checkable
class FileLockFactory(Protocol):
def try_lock(self, path: Path, *, timeout_s: float) -> FileLock: ...
```
Concrete: `FilelockFileLockFactory` wrapping the `filelock` library per the project pin (already used by E-C13 per epics.md C13 section). NOT a custom implementation.
- Method flow for `build_cache`:
0. **Flight resolve phase** (ADR-010 / AZ-489) — runs BEFORE the lockfile is acquired:
- Branch on `request.flight_source`:
- `FlightById(flight_id)` → `flight = flights_api_client.fetch_flight(flight_id=..., base_url=config.flights_api_base_url, auth_token=config.flights_api_auth_token)`.
- `FlightFromFile(path)` → `flight = flights_api_client.load_flight_file(path=path)`.
- Compute `bbox = flights_api_client.bbox_from_waypoints(flight.waypoints, buffer_m=config.flight_bbox_buffer_m)`.
- Compute `takeoff_origin = flights_api_client.takeoff_origin_from_flight(flight)`.
- Build `FlightResolveReport(source=..., flight_id=flight.flight_id, waypoint_count=len(flight.waypoints), bbox, takeoff_origin, raw_flight_dto=flight)`.
- Catch `FlightsApiUnreachableError`, `FlightsApiAuthError`, `FlightNotFoundError`, `FlightsApiSchemaError`, `FlightFileNotFoundError`, `EmptyWaypointsError`, `WaypointSchemaError` → wrap as `CacheBuildError(failure_phase=flight_resolve, ...)` and return `CacheBuildReport(outcome=failure, failure_phase=flight_resolve, flight_resolve_report=None, download_report=None, build_report=None, ...)`. INFO log `kind="c12.build_cache.flight_resolve.start"` before; ERROR log `kind="c12.build_cache.flight_resolve.failed"` on failure with the resolved error class name (auth_token NEVER logged).
1. Compute `lock_path = config.cache_staging_root / config.lock_filename`. Ensure `config.cache_staging_root` exists (mkdir parents=True).
2. Compute `freshness_threshold_months = freshness_table.threshold(request.sector_class)` (uses T1's helper).
3. Acquire lock: `with lock_factory.try_lock(lock_path, timeout_s=config.lock_timeout_s) as lock:` — on timeout, raise `BuildLockHeldError(failure_phase=download, ...)`.
4. Record `start_t = clock.monotonic()`.
5. INFO log `kind="c12.build_cache.start"` with the request (api_key + auth_token REDACTED) and the `flight_resolve_report` summary.
6. **Download phase**: `download_report = tile_downloader.fetch(DownloadRequest(bbox=flight_resolve_report.bbox, freshness_threshold_months=freshness_threshold_months, url=request.satellite_provider_url, api_key=request.api_key))` — the bbox is the one derived in phase 0; the orchestrator no longer accepts a caller-supplied bbox. Catch `SatelliteProviderError`, `RateLimitedError`, `ResolutionRejectionError`, `CacheBudgetExceededError`, `TileManagerError` → wrap as `CacheBuildError(failure_phase=download, ...)`. If `download_report.outcome == failure` → return `CacheBuildReport(outcome=failure, failure_phase=download, flight_resolve_report=..., download_report=..., build_report=None, failure_reason=download_report.failure_reason, wall_clock_s=...)`.
7. **Verify-ready phase**: `readiness = companion_bringup.verify_companion_ready(request.companion_address)`. Catch `CompanionUnreachableError`, `ContentHashMismatchError` → wrap as `CacheBuildError(failure_phase=download, ...)`. If `readiness.outcome == not_ready` → return `CacheBuildReport(outcome=failure, failure_phase=download, ..., failure_reason="companion not ready: " + ", ".join(readiness.not_ready_reasons))`.
8. **Build phase**: open SSH session via `ssh_factory.open(request.companion_address, ...)`; call `remote_c10_invoker.invoke(session, RemoteBuildRequest(bbox=flight_resolve_report.bbox, zoom_levels=..., sector_class=request.sector_class, calibration_path=request.calibration_path, expected_engines=request.expected_engines, companion_cache_root=config.companion_cache_root, takeoff_origin=flight_resolve_report.takeoff_origin, flight_id=flight_resolve_report.flight_id))` — the orchestrator forwards `takeoff_origin` + `flight_id` to the remote C10 build entry point so AZ-325 / AZ-323 bake them into the Manifest (ADR-010, AZ-490 consumes them on the companion at boot). Catch `EngineBuildError`, `CalibrationCacheError`, `ManifestSignatureError`, `ManifestCoverageError`, `BuildLockHeldError` (C10's lock, distinct from C12's) → wrap as `CacheBuildError(failure_phase=build, ...)`.
9. Aggregate: `build_report` from step 8. If `build_report.outcome == IDEMPOTENT_NO_OP` → return `CacheBuildReport(outcome=idempotent_no_op, failure_phase=none, download_report=..., build_report=..., failure_reason=None, wall_clock_s=...)`. Else if `build_report.outcome == FAILURE` → return `CacheBuildReport(outcome=failure, failure_phase=build, ..., failure_reason=build_report.failure_reason, ...)`.
10. INFO log `kind="c12.build_cache.success"` with the aggregated counts (tiles_downloaded, engines_built, engines_reused, descriptors_generated).
11. Return `CacheBuildReport(outcome=success, failure_phase=none, download_report=..., build_report=..., failure_reason=None, wall_clock_s=...)`.
12. Lockfile released by `__exit__` of the `with` block.
- Composition-root factory at `src/gps_denied_onboard/runtime_root/c12_factory.py` extends T1's `OperatorToolServices` dataclass with a `build_cache_orchestrator: BuildCacheOrchestrator` field. The factory `build_build_cache_orchestrator(config, services) -> BuildCacheOrchestrator` constructs the lock factory, the remote C10 invoker, and pulls T1's `freshness_table` + T2's `companion_bringup` from the existing services dataclass.
- T1's `cli.py` `build-cache` subcommand resolves `services.build_cache_orchestrator` and calls `.build_cache(request)`. Maps `CacheBuildError(failure_phase=download) → exit 20`; `CacheBuildError(failure_phase=build) → exit 21`; `BuildLockHeldError → exit 50`.
## Scope
### Included
- `BuildCacheOrchestrator` class with the single public method.
- The 2 DTOs (`BuildCacheRequest`, `CacheBuildReport`) plus the `outcome` and `failure_phase` enums.
- The 2 error types (`CacheBuildError` with `remediation`, `BuildLockHeldError`).
- `RemoteCacheProvisionerInvoker` over SSH (using the `SshSessionFactory` Protocol from T2).
- `FileLockFactory` + `FileLock` Protocols + `FilelockFileLockFactory` concrete using the `filelock` library.
- Composition-root factory.
- Wiring of T1's `build-cache` subcommand to this service.
- Conformance unit tests using fakes for `TileDownloader`, `CompanionBringup`, `RemoteCacheProvisionerInvoker`, `FileLockFactory` covering all 8 acceptance criteria.
### Excluded
- Anything internal to C11 download (AZ-316).
- Anything internal to C10 build (AZ-321..325).
- Anything internal to companion-side verification (AZ-327).
- The takeoff-time verification (AZ-324, airborne).
- Telemetry of build progress to a dashboard — DEBUG-log streaming only this cycle.
- Resumable downloads — AZ-316's idempotence handles partial downloads; this task does not retry on its own.
- Parallel multi-area builds — one area per `build_cache` call.
## Acceptance Criteria
**AC-1: Happy path — flight-resolve → download → verify-ready → build → `success`**
Given a fresh empty C6 + a clean companion + valid `BuildCacheRequest(flight_source=FlightById(...))` + fakes that all return `success` (including a 3-waypoint `FlightDto`)
When `build_cache(request)` is called
Then the call sequence is `flights_api_client.fetch_flight → bbox_from_waypoints → takeoff_origin_from_flight → lock acquire → tile_downloader.fetch (with derived bbox) → companion_bringup.verify_companion_ready → remote_c10_invoker.invoke (with takeoff_origin + flight_id) → lock release` (verifiable via spy on each fake); `CacheBuildReport(outcome=success, failure_phase=none, flight_resolve_report=..., download_report=..., build_report=..., failure_reason=None)` is returned; ONE INFO log `kind="c12.build_cache.flight_resolve.start"`; ONE INFO log `kind="c12.build_cache.start"`; ONE INFO log `kind="c12.build_cache.success"`
**AC-2: Download failure aborts before C10**
Given a fake `tile_downloader.fetch` that raises `SatelliteProviderError("503 Service Unavailable")`
When `build_cache(request)` is called
Then `CacheBuildReport(outcome=failure, failure_phase=download, download_report=None, build_report=None, failure_reason="503 Service Unavailable")` is returned (NOT raised); `companion_bringup.verify_companion_ready` is NEVER called; `remote_c10_invoker.invoke` is NEVER called; ONE ERROR log `kind="c12.build_cache.download.failed"`; lockfile is released
**AC-3: Verify-ready failure (`not_ready`) aborts before C10**
Given `tile_downloader.fetch` returns `success`, then `companion_bringup.verify_companion_ready` returns `ReadinessReport(outcome=not_ready, not_ready_reasons=("manifest missing",))`
When `build_cache(request)` is called
Then `CacheBuildReport(outcome=failure, failure_phase=download, ..., failure_reason="companion not ready: manifest missing")` is returned; `remote_c10_invoker.invoke` is NEVER called; ONE ERROR log `kind="c12.build_cache.companion.not_ready"`; lockfile released
**AC-4: Build failure surfaces `failure_phase=build`**
Given download + verify-ready return `success`/`ready`, then `remote_c10_invoker.invoke` raises `EngineBuildError("CUDA OOM on backbone dinov2_vpr")`
When `build_cache(request)` is called
Then `CacheBuildReport(outcome=failure, failure_phase=build, download_report=..., build_report=None, failure_reason="CUDA OOM on backbone dinov2_vpr")` is returned; ONE ERROR log `kind="c12.build_cache.build.failed"`; `CacheBuildError(failure_phase=build)`'s `remediation` attribute mentions cache cleanup; lockfile released
**AC-5: Lockfile prevents concurrent F1 runs**
Given a `FileLockFactory` whose `try_lock` raises `LockTimeout` after 5 s (simulated)
When `build_cache(request)` is called
Then `BuildLockHeldError(failure_phase=download, ...)` is raised; the `tile_downloader`, `companion_bringup`, `remote_c10_invoker` are NEVER called; ONE ERROR log `kind="c12.build_cache.lock.held"`
**AC-6: Lockfile released in `finally` even on exception**
Given any of the four service collaborators raises an unexpected exception (`KeyboardInterrupt`, `RuntimeError`)
When `build_cache(request)` is called
Then the exception propagates to the caller; the lockfile's `__exit__` was called exactly once (verifiable via spy on the fake `FileLock`); the next `build_cache` call against the same lock path acquires the lock immediately
**AC-7: Idempotent no-op surfaces as `outcome=idempotent_no_op`**
Given `remote_c10_invoker.invoke` returns `BuildReport(outcome=IDEMPOTENT_NO_OP, ...)` (D-C10-1 hit per AZ-325)
When `build_cache(request)` is called
Then `CacheBuildReport(outcome=idempotent_no_op, failure_phase=none, ..., failure_reason=None)` is returned; ONE INFO log `kind="c12.build_cache.idempotent"`; CLI exit code is 0 (success-equivalent for idempotent re-runs)
**AC-8: `remediation` populated per `failure_phase`**
Given any `CacheBuildError` raised by the orchestrator
When the caller inspects `error.remediation`
Then for `failure_phase=download` the text mentions "Re-run with same args" + key/url checks; for `failure_phase=build` the text mentions cache cleanup + GPU diagnostics; for `BuildLockHeldError` the text mentions the lock path and how to clear it
**AC-9: api_key is REDACTED in all log output**
Given a `BuildCacheRequest` with `api_key=SecretStr("super-secret-token")`
When any log line is emitted by the orchestrator
Then no log line contains the literal token; `api_key` field appears as `"REDACTED"` or is omitted entirely
**AC-10: Aggregated `CacheBuildReport` carries all sub-reports on success**
Given a happy-path run
When the caller inspects the returned `CacheBuildReport`
Then `flight_resolve_report` is a populated `FlightResolveReport`; `download_report` is a populated `DownloadBatchReport` from C11; `build_report` is a populated `BuildReport` from C10; `wall_clock_s` is a positive float; all sub-reports' fields are accessible (no truncation)
**AC-11: Flight-resolve failure aborts BEFORE the lockfile (ADR-010)**
Given `flights_api_client.fetch_flight` raises `FlightNotFoundError`
When `build_cache(request)` is called
Then `CacheBuildReport(outcome=failure, failure_phase=flight_resolve, flight_resolve_report=None, download_report=None, build_report=None, failure_reason="flight not found: <uuid>")` is returned; `lock_factory.try_lock` is NEVER called; `tile_downloader.fetch` is NEVER called; `companion_bringup.verify_companion_ready` is NEVER called; `remote_c10_invoker.invoke` is NEVER called; ONE ERROR log `kind="c12.build_cache.flight_resolve.failed"`
**AC-12: Offline flight-file path used when `FlightFromFile` source is passed**
Given `BuildCacheRequest(flight_source=FlightFromFile(path=/tmp/flight.json))`
When `build_cache(request)` is called
Then `flights_api_client.load_flight_file(path=/tmp/flight.json)` is called once; `flights_api_client.fetch_flight` is NEVER called; the rest of the pipeline runs identically
**AC-13: `takeoff_origin` is forwarded to the remote C10 invoker**
Given a fake `FlightDto` with `waypoints[0] = (50.0, 36.2, 200.0)`
When `build_cache(request)` is called through to the build phase
Then `remote_c10_invoker.invoke` is called with `RemoteBuildRequest.takeoff_origin == LatLonAlt(50.0, 36.2, 200.0)` and `RemoteBuildRequest.flight_id == flight.flight_id`
**AC-14: `EmptyWaypointsError` surfaces with `failure_phase=flight_resolve`**
Given the resolved `FlightDto` has zero waypoints (so `bbox_from_waypoints` raises `EmptyWaypointsError`)
When `build_cache(request)` is called
Then `CacheBuildReport(outcome=failure, failure_phase=flight_resolve, ..., failure_reason="empty waypoints; re-plan in Mission Planner UI")` is returned; lockfile NOT acquired
**AC-15: `auth_token` is REDACTED in all log output (Phase 0)**
Given `config.flights_api_auth_token = SecretStr("bearer-xyz")`
When any log line is emitted by the flight-resolve phase
Then no log line contains the literal `bearer-xyz`; the field appears as `"REDACTED"` or is omitted entirely (same convention as AC-9 for `api_key`)
## Non-Functional Requirements
**Performance**
- The orchestrator's own overhead (lock acquire + verify-ready dispatch + result aggregation) is ≤ 1 s wall-clock; the dominant time is `tile_downloader.fetch` (minutes) + `remote_c10_invoker.invoke` (minutes), both owned upstream.
- Lock acquisition timeout default `5.0 s`; configurable for tests.
**Compatibility**
- `filelock` library per the project pin (used by E-C13 already). No new third-party dependencies.
- The `SshSessionFactory` Protocol is shared with T2 — the orchestrator MUST receive the same factory T2 uses (single composition-root construction).
**Reliability**
- Strict ordering: download → verify-ready → build. AC-2, AC-3 enforce.
- Lockfile released in all paths (AC-6).
- `api_key` never logged (AC-9).
- The remote C10 invocation streams DEBUG logs but does NOT buffer the full stdout in memory — uses line-iteration so even multi-hour builds don't blow memory.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Happy path with all fakes returning success | Sequenced calls + `success` report + INFO logs |
| AC-2 | Fake `tile_downloader.fetch` raises `SatelliteProviderError` | `failure_phase=download`, no C10 call, lock released |
| AC-3 | Fake `verify_companion_ready` returns `not_ready` | `failure_phase=download`, no C10 call, lock released |
| AC-4 | Fake `remote_c10_invoker.invoke` raises `EngineBuildError` | `failure_phase=build`, ERROR log, remediation mentions cleanup |
| AC-5 | Fake `FileLockFactory.try_lock` raises `LockTimeout` | `BuildLockHeldError`, no service calls, ERROR log |
| AC-6 | Fake `tile_downloader.fetch` raises `KeyboardInterrupt` | `KeyboardInterrupt` propagates, `FileLock.__exit__` called once |
| AC-7 | Fake C10 returns `IDEMPOTENT_NO_OP` | `outcome=idempotent_no_op`, INFO log |
| AC-8 | Construct each error type, inspect `remediation` | Matches documented text per phase |
| AC-9 | Capture log output with `api_key="super-secret-token"` | Token not present in any log line |
| AC-10 | Happy-path inspect returned report | All three sub-reports (flight_resolve + download + build) present, fields accessible |
| AC-11 | Fake `fetch_flight` raises `FlightNotFoundError` | `failure_phase=flight_resolve`; lockfile NOT acquired; ZERO downstream calls |
| AC-12 | `FlightFromFile` source | `load_flight_file` called; `fetch_flight` NOT called |
| AC-13 | Inspect `RemoteBuildRequest` sent to invoker | `takeoff_origin` + `flight_id` forwarded |
| AC-14 | `EmptyWaypointsError` from `bbox_from_waypoints` | `failure_phase=flight_resolve`; lockfile NOT acquired |
| AC-15 | Capture log output with auth_token | Token not present |
| NFR-perf-overhead | Microbench orchestrator-only path with all-fake collaborators × 100 | p99 ≤ 50 ms (excludes real network/SSH) |
## Constraints
- Strict phase ordering is non-negotiable: flight_resolve → lock → download → verify-ready → build. Any reordering breaks AC-2/AC-3/AC-11 and causes operators to chase phantom errors. **The flight_resolve phase happens BEFORE the lockfile is acquired — a Flight that cannot be resolved is an operator-input error, not a contended-resource error, and should not block parallel builds.**
- `failure_phase` is a closed set `{none, flight_resolve, download, build}` — adding a new value requires Plan-cycle approval (operators script against these values).
- The lockfile lives in the operator workstation's cache staging area, NOT on the companion. Companion-side concurrent protection is C10's responsibility (CP-INV-4 in AZ-325).
- `api_key` field uses `pydantic.SecretStr` (or equivalent) and MUST NOT be `repr()`-logged anywhere in the orchestrator.
- The remote C10 invocation goes through the same `SshSessionFactory` as T2 — do NOT instantiate a second SSH client. Single composition root.
- `filelock` library — do NOT roll a custom file-locking primitive. Cross-platform correctness is hard.
## Risks & Mitigation
**Risk 1: Operator runs `build-cache` while a previous `build-cache` is still in progress**
- *Risk*: Two concurrent runs would race on the C6 spatial index + the companion's C10 cache root, producing inconsistent state.
- *Mitigation*: AC-5 + AC-6 — the lockfile is acquired with a 5-s timeout; the second invocation gets `BuildLockHeldError` with a clear remediation hint.
**Risk 2: Mid-build SSH session drops (operator disconnects USB)**
- *Risk*: The C10 build is hours long; an SSH disconnect surfaces as `paramiko.SSHException` in the middle of `remote_c10_invoker.invoke`.
- *Mitigation*: The exception propagates as `CacheBuildError(failure_phase=build, wrapped_exception_repr="...")`; `remediation` mentions reconnecting and re-running (D-C10-1 makes the next run cheap if the build was past the engine-compile phase). The lockfile is released so the retry is unblocked.
**Risk 3: C10's stdout stream is malformed or truncated**
- *Risk*: The companion's C10 process crashes mid-output; `RemoteCacheProvisionerInvoker` cannot find a valid `BuildReport` JSON line.
- *Mitigation*: `RemoteCacheProvisionerInvoker.invoke` raises `BuildReportParseError` (a `CacheBuildError(failure_phase=build)` subclass) with the captured stdout/stderr tail. Operator diagnoses via the companion's `c10-build.log`.
**Risk 4: `freshness_threshold_months` lookup fails**
- *Risk*: A future cycle adds a `SectorClassification` enum value without updating `freshness_table` (T1).
- *Mitigation*: T1's `freshness_threshold_months` raises `KeyError` for unknown values; this orchestrator surfaces it as `CacheBuildError(failure_phase=download, ...)` with `remediation` mentioning the missing classification. Tests on T1 cover this.
**Risk 5: api_key leaks into a DEBUG log via the C10 stdout stream**
- *Risk*: A mis-configured C10 prints the api_key in its own log; the orchestrator's DEBUG-streaming forwards it.
- *Mitigation*: AC-9 asserts the orchestrator does NOT emit the api_key; the `RemoteCacheProvisionerInvoker.invoke` filters incoming stdout lines through a redactor that replaces the literal api_key value with `<REDACTED>` before logging. Defence-in-depth — C10 SHOULD not log it either, but this guards against a regression.
## Runtime Completeness
- **Named capability**: F1 pre-flight cache build orchestration per description.md § 1, § 2 (`build_cache`), § 8.
- **Production code that must exist**: real `BuildCacheOrchestrator` composing real `TileDownloader` (AZ-316) + real `CompanionBringup` (AZ-327) + real `RemoteCacheProvisionerInvoker` (this task) over real `paramiko` SSH (T2 owns the factory) + real `filelock` lockfile + real C10 build entry on the companion (AZ-325 ships the entry point).
- **Allowed external stubs**: tests MAY use fakes for all four service collaborators + the lock factory; production wiring uses real C11/C10 + real SSH + real filelock.
- **Unacceptable substitutes**: in-process fake C10 in production (description.md § 1 says C10 runs companion-side over USB/Eth — running in-process defeats the architecture); a custom file-locking primitive (correctness is non-trivial, use `filelock`); skipping verify-ready in production (defeats AC-NEW-1 takeoff verify); silently swallowing C10 errors instead of surfacing as `CacheBuildError(failure_phase=build)`.