# C10 CacheProvisioner — Idempotent Orchestrator + ManifestCoverageError **Task**: AZ-325_c10_cache_provisioner **Name**: C10 CacheProvisioner **Description**: Implement `CacheProvisioner` (per the contract `_docs/02_document/contracts/c10_provisioning/cache_provisioner.md` v1.1.0), the public top-level orchestrator that composes AZ-321 (EngineCompiler), AZ-322 (DescriptorBatcher), and AZ-323 (ManifestBuilder) into a single idempotent F1 build pipeline. Acquires a `cache_root/.c10.lock` filesystem lockfile to enforce CP-INV-4. Computes the build-identity hash from the same canonical inputs AZ-323 hashes (model_ids + calibration_sha256 + tiles_coverage_sha256 + sector_class + bbox + zoom_levels **+ takeoff_origin + flight_id**) and compares to the existing `Manifest.json`'s `manifest_hash`; on match → `outcome=IDEMPOTENT_NO_OP`. On mismatch (or no prior Manifest) → run engine compile → descriptor population → Manifest build (passing `request.takeoff_origin` and `request.flight_id` to AZ-323), then walk `cache_root` to confirm every file is listed in the new Manifest's `artifacts` section, raising `ManifestCoverageError` on orphans (with rollback to prior-good Manifest). Empty corpus → `BuildReport(outcome=FAILURE, failure_reason="run C11 TileDownloader first")` per description.md § 5. **A request whose `takeoff_origin` differs from the prior Manifest's by ≥ 1 mm is treated as a new build identity (CP-INV-8) — this is the contract that lets `ManifestVerifier` reject a re-planned route at boot.** **Complexity**: 3 points **Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-303_c6_storage_interfaces, AZ-321_c10_engine_compiler, AZ-322_c10_descriptor_batcher, AZ-323_c10_manifest_builder **Component**: c10_provisioning (epic AZ-252 / E-C10) **Tracker**: AZ-325 **Epic**: AZ-252 (E-C10) ### Document Dependencies - `_docs/02_document/contracts/c10_provisioning/cache_provisioner.md` — produced by this task (frozen Protocol + DTO shape, invariants, test cases). - `_docs/02_document/components/11_c10_provisioning/description.md` — § 1 idempotence, § 5 error handling, § 7 lockfile race-condition mitigation. ## Problem Without a real orchestrator: - D-C10-1 (idempotent re-run via manifest hash) cannot be enforced — every operator invocation re-compiles every engine, blowing the C10-PT-01 ≤ 1 min warm target. - D-C10-3 (`ManifestCoverageError` on orphan files / no smuggled artifacts) is unobservable — partial-build leftovers and out-of-band file drops at takeoff time go undetected. - C10-IT-03 (idempotent re-run — same hash, no recompile) cannot be implemented. - C10-IT-04 (`ManifestCoverageError` on orphan files) cannot be implemented. - The race-condition mitigation per description.md § 7 (filesystem lockfile) has no producer. - C12 OperatorTooling (E-C12) has no surface to call — its `c10 build` CLI command is a one-liner only after this task ships. - The "missing tiles in C6" failure path (description.md § 5) has no surface — operators would see a stack trace from AZ-322 instead of a clear `failure_reason` directing them to C11. This task delivers the orchestrator + its frozen contract. It does NOT compile engines (AZ-321), embed tiles (AZ-322), build Manifests (AZ-323), or verify at takeoff (AZ-324). ## Outcome - A `CacheProvisioner` class implementation at `src/gps_denied_onboard/components/c10_provisioning/provisioner.py` matching the Protocol in the contract. - Constructor: `__init__(self, *, engine_compiler: EngineCompiler, descriptor_batcher: DescriptorBatcher, manifest_builder: ManifestBuilder, tile_metadata_store: TileMetadataStore, lock_factory: FileLockFactory, logger: Logger, clock: Clock, config: C10ProvisionerConfig)`. - `C10ProvisionerConfig` (`@dataclass(frozen=True)`): `coverage_strict: bool = True`, `lock_timeout_s: float = 5.0`, `manifest_filename: str = "Manifest.json"`. - Method `build_cache_artifacts(request: BuildRequest) -> BuildReport` flow: 1. **Lock acquisition** (CP-INV-4): - Path: `request.cache_root / ".c10.lock"`. - Acquire via `lock_factory.try_lock(path, timeout_s=config.lock_timeout_s)` — non-blocking with a short timeout to surface concurrent invocations as `BuildLockHeldError`. 2. **Tile gathering**: call `tile_metadata_store.query_by_bbox(bbox, zoom_levels, sector_class)`. - If empty → return `BuildReport(outcome=FAILURE, failure_reason="no tiles in C6 for the requested scope; run C11 TileDownloader first", engines_built=0, ...)`. ERROR log; release lock. 3. **Build-identity hash for idempotence check**: - Compute `request_hash = sha256(canonical_json(model_ids + calibration_sha256 + tiles_coverage_sha256 + sector_class + bbox + zoom_levels + takeoff_origin_tuple_or_none + flight_id_or_none))`. The `model_ids` come from the configured backbone list; `calibration_sha256` from streaming the calibration_path; `tiles_coverage_sha256` from sorting the tile rows by `(zoom, lat, lon, source)` and hashing per AZ-323's algorithm. `takeoff_origin_tuple_or_none` is `(lat_deg, lon_deg, alt_m)` rounded to 9 decimal places when `request.takeoff_origin is not None`, otherwise the JSON `null` sentinel (CP-INV-8). The hashing formula MUST match AZ-323 exactly so AZ-325's idempotence decision agrees with AZ-323's emitted `build.manifest_hash`. - Read existing `Manifest.json` if present; parse only the `build.manifest_hash` field (don't run full verification — that's AZ-324's job). If `existing.manifest_hash == request_hash` → return `BuildReport(outcome=IDEMPOTENT_NO_OP, manifest_hash=existing.manifest_hash, manifest_path=existing_path, engines_built=0, engines_reused=0, descriptors_generated=0, elapsed_s, failure_reason=None)`. INFO log; release lock. 4. **Active build path**: - Snapshot prior-good Manifest (rename to `Manifest.json.prev` if present) for rollback. - Compose engine compile request from configured backbones; call `engine_compiler.compile_engines_for_corpus(...)` → `engine_entries`. - Compose descriptor populate request (filter, callback hooked to logger); call `descriptor_batcher.populate_descriptors(...)` → `DescriptorBatchReport`. If `outcome=failure` → restore prior Manifest, release lock, return `BuildReport(outcome=FAILURE, failure_reason=batch.failure_reason, ...)`. - Compose Manifest build input from engine entries + descriptor index path + calibration + key_path **+ `request.takeoff_origin` + `request.flight_id`** (ADR-010); call `manifest_builder.build_manifest(...)` → `ManifestArtifact`. Both fields default to `None` when the caller did not supply them (e.g., legacy C12 invocation without `--flight-id`). 5. **Coverage check** (CP-INV-3 / D-C10-3): - Walk `cache_root` recursively (`pathlib.Path.rglob`); collect every regular file path EXCLUDING `Manifest.json`, `Manifest.json.sha256`, `Manifest.json.sig`, `Manifest.json.prev`, `.c10.lock`, and any `.sha256` sidecar (sidecars are implicit per the AZ-280 pattern, paired with their primary). - Build expected set: every `path` in `manifest.artifacts.engines + descriptor_index + calibration` (resolved relative to `cache_root`). - `orphans = walked - expected`. - If `orphans` non-empty AND `config.coverage_strict`: - Restore prior Manifest from `Manifest.json.prev` (delete current Manifest; rename prev back). If no prev existed, leave the new Manifest in place but raise. - Raise `ManifestCoverageError(f"orphan files in cache_root: {sorted(orphans)}")`. ERROR log. - If `orphans` non-empty AND NOT `coverage_strict`: WARN log with the orphan list; continue. 6. **Cleanup**: delete `Manifest.json.prev` if present; release lock. 7. Return `BuildReport(outcome=SUCCESS, engines_built, engines_reused, descriptors_generated, manifest_hash, manifest_path, failure_reason=None, elapsed_s)`. - Method `compile_engines_for_corpus(request)` is a thin passthrough to `engine_compiler.compile_engines_for_corpus(request)` (per CP-TC-11; lets operators run engine-only re-compiles for D-C10-6 hardware-change scenarios without redoing descriptors). - A `FileLockFactory` Protocol + a default `Filelock`-library-backed impl (use `filelock` package, already pinned via shared helpers if present; if not, add to deps with a single pinned version). - INFO logs on lock acquired / released, build start/end, idempotent no-op; ERROR on coverage error / build failure; WARN on non-strict coverage drift. ## Scope ### Included - `CacheProvisioner` class implementing the Protocol from the contract. - The contract document (frozen at v1.0.0). - Filesystem lockfile (FileLockFactory Protocol + filelock-backed default impl). - Idempotence check (parse existing Manifest's `manifest_hash` only; no full verify). - Coverage walk + `ManifestCoverageError` with rollback to prior Manifest. - Empty-corpus handling with explicit hint to run C11. - `compile_engines_for_corpus` passthrough. - Composition-root factory `build_cache_provisioner(config) -> CacheProvisioner`. - Conformance test for the contract Protocol. ### Excluded - The internal phases (AZ-321, AZ-322, AZ-323). - Manifest verification at takeoff (AZ-324). - Operator CLI / tooling (E-C12). - C13 FDR emissions (build is offline). - Resumable mid-build state (out of scope; restart from scratch). - GC of stale engines (operator action). - Multi-cache rotation. ## Acceptance Criteria **AC-1: Cold build composes phases and writes Manifest** Given an empty cache_root and C6 populated with tiles for the requested scope When `build_cache_artifacts(request)` is called Then `outcome=SUCCESS`; `engines_built > 0`; `descriptors_generated > 0`; `Manifest.json` + `Manifest.json.sig` + `Manifest.json.sha256` exist; `BuildReport.manifest_hash` matches the on-disk Manifest's `build.manifest_hash`; `elapsed_s` is positive **AC-2: Warm idempotent re-run skips everything** Given a prior successful build at the same cache_root with the same identity tuple When `build_cache_artifacts` is called with an identical request Then `outcome=IDEMPOTENT_NO_OP`; `engines_built=0, engines_reused=0, descriptors_generated=0`; ZERO calls to `engine_compiler.compile_engines_for_corpus` (verifiable via spy); ZERO calls to `descriptor_batcher.populate_descriptors`; ZERO calls to `manifest_builder.build_manifest`; the on-disk Manifest is byte-identical (mtime unchanged) **AC-3: Different bbox triggers full rebuild and atomic replacement** Given a prior Manifest at the cache_root for bbox A When `build_cache_artifacts` is called with bbox B (B ≠ A) Then `outcome=SUCCESS`; the new Manifest replaces the old (atomic via AZ-280); old `Manifest.json.prev` is cleaned up after coverage passes; `manifest_hash` differs from the prior **AC-4: Empty corpus surfaces failure with operator hint** Given C6 has zero tiles for the requested scope When `build_cache_artifacts` is called Then `outcome=FAILURE`; `failure_reason` contains "C11 TileDownloader"; ZERO compile / embed / Manifest calls; lock IS released (no leaked lockfile) **AC-5: Concurrent invocation raises `BuildLockHeldError`** Given another invocation holds `.c10.lock` When a second `build_cache_artifacts` runs Then `BuildLockHeldError` is raised within `lock_timeout_s`; the existing build is unaffected; the existing lockfile is NOT deleted **AC-6: ManifestCoverageError rolls back to prior Manifest** Given a prior-good Manifest exists; a build is run; before the coverage walk, an orphan file `cache_root/leftover.bin` is dropped (simulated) When the coverage walk runs in strict mode Then `ManifestCoverageError(...)` is raised naming the orphan; `Manifest.json` on disk is the prior-good one (prev was restored); ERROR log **AC-7: Coverage non-strict mode warns but continues** Given `coverage_strict=False` and an orphan When the build completes Then `outcome=SUCCESS`; ONE WARN log naming the orphan; the new Manifest is on disk **AC-8: Lock released on every exit path** Given any of: success / failure / IDEMPOTENT_NO_OP / `ManifestCoverageError` / `EngineBuildError` propagation When `build_cache_artifacts` returns or raises Then `cache_root/.c10.lock` is removed (or unlocked if the implementation uses fcntl); a subsequent call succeeds (no leftover lock) **AC-9: Hard errors propagate without state corruption** Given `engine_compiler.compile_engines_for_corpus` raises `EngineBuildError` When `build_cache_artifacts` runs Then the error propagates; on-disk Manifest is the prior-good one (prev restored); lock is released; partial engines that AZ-321 wrote ARE on disk (not deleted — operators may want them for diagnostic) **AC-10: `compile_engines_for_corpus` passthrough** Given a request configured for engine-only re-compile When `compile_engines_for_corpus(req)` is called directly Then `engine_compiler.compile_engines_for_corpus(req)` is invoked once with the same request; the return value is forwarded as a tuple; no lock is acquired (this is a thin diagnostic-mode call) **AC-11: Conformance — `isinstance` returns True** Given the implementation When `isinstance(impl, CacheProvisioner)` is checked under runtime_checkable Then `True` **AC-12: Cold build benchmark within C10-PT-01 envelope** Given Tier-1 dev workstation with NVIDIA GPU + a 1000-tile corpus + 3 backbones When a cold build runs Then wall-clock ≤ 12 min (CP-TC-12 / NFR C10-PT-01); WARN log if exceeded (so operators see the regression in CI) **AC-13: Warm idempotent benchmark within C10-PT-01 envelope** Given a populated cache and identical request When `build_cache_artifacts` runs Then wall-clock ≤ 1 min (CP-TC-13 / NFR C10-PT-01); the bound work is the build-identity hash computation, which is dominated by `tiles_coverage_sha256` over 1000 tiles (~5 ms hashing) **AC-14: `takeoff_origin` mismatch triggers full rebuild (ADR-010 / CP-INV-8)** Given a prior Manifest built with `takeoff_origin = A` When `build_cache_artifacts` is called with the SAME bbox / zooms / sector / calibration / tiles, but `takeoff_origin = B (B ≠ A by ≥ 1 mm)` Then `outcome=SUCCESS` (NOT `IDEMPOTENT_NO_OP`); the new Manifest replaces the old; the new `manifest_hash` differs from the prior; the new Manifest's `flight.takeoff_origin` matches B **AC-15: `takeoff_origin = None` propagates through with no flight block in Manifest (back-compat)** Given a `BuildRequest` with `takeoff_origin = None` and `flight_id = None` When `build_cache_artifacts` runs Then `outcome=SUCCESS`; the produced Manifest has no `flight.takeoff_origin` key (AZ-323's AC-14); idempotence still works for subsequent identical-without-origin invocations **AC-16: `flight_id` participation in idempotence** Given a prior Manifest built with `flight_id = X, takeoff_origin = A` When `build_cache_artifacts` runs with `flight_id = Y, takeoff_origin = A` (only `flight_id` differs) Then `outcome=SUCCESS` (NOT `IDEMPOTENT_NO_OP`); `flight_id` is part of the build identity per CP-INV-8 ## Non-Functional Requirements **Performance** - Cold path is bound by AZ-321 + AZ-322 (per their NFRs); this orchestrator adds ≤ 5 s coordination overhead. - Warm path: build-identity hash + Manifest read + idempotence compare ≤ 5 s on Tier-1 dev workstation (1000-tile corpus). - Coverage walk: O(N files); ≤ 1 s for ≤ 10k files in cache_root. **Compatibility** - `filelock` library — pin via `requirements.txt`. (Verify already present from a prior task's deps; if not, add. Same version across all C10 tasks.) - `orjson` (already pinned via AZ-272), `hashlib` (stdlib), `pathlib` (stdlib). **Reliability** - CP-INV-2: failed build never leaves the cache in a worse state than at start. - Lock release on every exit path (try/finally). - Atomic Manifest replacement (rename prev → current rollback semantics); coverage error rolls back automatically. - No silent failures: every error path logs at ERROR level with diagnostic. ## Unit Tests | AC Ref | What to Test | Required Outcome | |--------|-------------|-----------------| | AC-1 | Cold build with fakes for phases | All phases called once; SUCCESS | | AC-2 | Warm re-run with identical request | IDEMPOTENT_NO_OP; zero phase calls | | AC-3 | Different bbox after prior build | SUCCESS; atomic replace; old Manifest gone | | AC-4 | Empty C6 query | FAILURE; hint string; lock released | | AC-14 | Warm re-run with different takeoff_origin | SUCCESS; new manifest_hash; phases called | | AC-15 | Build with takeoff_origin=None | SUCCESS; Manifest has no flight.takeoff_origin | | AC-16 | Warm re-run with different flight_id only | SUCCESS; new manifest_hash | | AC-5 | Pre-acquire lock externally; run | BuildLockHeldError | | AC-6 | Inject orphan file before coverage walk | ManifestCoverageError; prior Manifest restored | | AC-7 | Same as AC-6 with `coverage_strict=False` | SUCCESS; WARN log | | AC-8 | Each error path | Lock released after each | | AC-9 | engine_compiler raises | Error propagates; rollback; lock released | | AC-10 | Direct call to compile_engines_for_corpus | Single passthrough; no lock | | AC-11 | Conformance | True | | AC-12 | Cold build bench (skipped on CI; manual) | ≤ 12 min | | AC-13 | Warm bench | ≤ 1 min | | NFR-perf-coverage-walk | 10k files in cache_root | ≤ 1 s | ## Constraints - The orchestrator does NOT touch `satellite-provider` (CP-INV-6); all I/O is local. - Lockfile is mandatory; bypassing the lock for testing is a config flag, NOT a separate code path. - Idempotence check parses ONLY `build.manifest_hash` from the existing Manifest; full verification is AZ-324's job (separate code path). - `Manifest.json.prev` is the rollback target; never two prevs deep (rebuilds are not stack-able). - Coverage walk EXCLUDES the lockfile, the Manifest itself, its sidecar, its signature, and any `.prev` rollback file. - The orchestrator never modifies engines compiled by AZ-321 (atomic on disk) — it only touches the Manifest + .prev/.lock files. - Operator key handling delegates entirely to AZ-323 (CP-INV-7). - This task introduces at most ONE new third-party dependency (`filelock`) — verify against existing deps first. ## Risks & Mitigation **Risk 1: Stale lockfile after process kill** - *Risk*: A SIGKILL'd build leaves `.c10.lock` on disk; subsequent runs always raise `BuildLockHeldError`. - *Mitigation*: Use `filelock` library which uses fcntl flock (auto-released on process exit). On platforms without fcntl, document the manual cleanup step. AC-5 + AC-8 cover normal lock release; the SIGKILL case is an OS-level guarantee from filelock. **Risk 2: Coverage walk slow on huge cache_root** - *Risk*: 100k files in cache_root → coverage walk could take seconds. - *Mitigation*: NFR-perf-coverage-walk benchmark; if exceeded, switch to streaming compare with a sorted Manifest path list. Out of scope for the initial impl. **Risk 3: Idempotence check trusts prior Manifest's hash without verifying signature** - *Risk*: A tampered Manifest could lie about its `manifest_hash`, fooling the orchestrator into IDEMPOTENT_NO_OP and skipping a needed rebuild. - *Mitigation*: This is acceptable because AZ-324's `ManifestVerifier` runs at takeoff — a tampered Manifest fails verify and prevents arming. The orchestrator's role is to AVOID rebuilds when nothing changed; trusting `manifest_hash` is a performance optimization, not a security check. Documented in CP-INV-1. **Risk 4: Empty `coverage_strict=False` becomes the de-facto default** - *Risk*: Operators set `coverage_strict=False` to ship faster, defeating D-C10-3. - *Mitigation*: Default is True; the config flag is documented as "for forensic builds only"; CI runs always assert strict. **Risk 5: Rollback corrupts state on partial coverage walk failure** - *Risk*: If `Manifest.json.prev` rename fails (e.g., disk full), the cache is left in an in-between state. - *Mitigation*: Use AZ-280's atomic rename helper; if the rename itself fails, surface a distinct `ManifestRollbackError` (subclass of `ManifestCoverageError`) so operators see the disk-level cause. Documented but not a blocker for v1.0.0. **Risk 6: Lock acquisition races with operator's manual file ops** - *Risk*: Operator manually edits a file in cache_root while a build is running. - *Mitigation*: Coverage walk happens at the end of build; if operator drops a file mid-build, AC-6 catches it. The lockfile prevents two CONCURRENT builds, not operator-vs-build interference. Documented. ## Runtime Completeness - **Named capability**: top-level F1 cache build with D-C10-1 idempotence + D-C10-3 ManifestCoverageError + lockfile race-condition mitigation (epic § Acceptance C10-IT-01, C10-IT-03, C10-IT-04; description.md § 1, § 5, § 7). - **Production code that must exist**: real `CacheProvisioner` orchestrating real AZ-321/322/323 + real `filelock`-backed lock + real coverage walk + real rollback. - **Allowed external stubs**: tests MAY use spy/fake versions of AZ-321/322/323 (already produced by their conformance tests) + an in-process `FileLockFactory` for deterministic concurrency tests. - **Unacceptable substitutes**: skipping the lockfile (defeats CP-INV-4); skipping the coverage walk (defeats D-C10-3); a "soft" idempotence that re-builds anyway (defeats D-C10-1 and the C10-PT-01 1-min warm target); calling AZ-324's `ManifestVerifier` for the idempotence check (over-kill — full verify on every operator invocation triples warm-path cost); deleting partial engines on failure (operators rely on them for diagnostic per AC-9).