[AZ-325] C10 CacheProvisioner orchestrator

Implements the public top-level F1 build orchestrator for E-C10 per
contract v1.1.0. Composes EngineCompiler (AZ-321), DescriptorBatcher
(AZ-322), and ManifestBuilder (AZ-323) into a single idempotent
operation guarded by a fcntl-backed cache_root/.c10.lock and a
post-build coverage walk.

Adds:
- CacheProvisionerImpl + FilelockFileLockFactory (provisioner.py)
- BuildRequest/BuildReport/BuildOutcome/SectorClassification DTOs +
  FileLockFactory Protocol + replaced placeholder CacheProvisioner
  Protocol with v1.1.0 surface (interface.py)
- C10ProvisionerConfig wired into C10ProvisioningConfig (config.py)
- BuildLockHeldError + ManifestCoverageError (errors.py)
- build_cache_provisioner composition root (c10_factory.py)
- 18 tests covering AC-1..AC-16 + NFR-perf-coverage-walk
- filelock>=3.13,<4.0 (single new third-party dep)

Idempotence (CP-INV-1) reuses AZ-323's _compute_manifest_hash /
_aggregate_tile_hash so the build-identity decision agrees byte-for-
byte with the Manifest's recorded manifest_hash. Coverage rollback
uses a .prev rename snapshot. Diagnostic compile_engines_for_corpus
is lock-free per AC-10.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-13 05:00:16 +03:00
parent 684ec2601c
commit f7b2e70085
12 changed files with 2329 additions and 21 deletions
@@ -0,0 +1,251 @@
# C10 CacheProvisioner — Idempotent Orchestrator + ManifestCoverageError
**Task**: AZ-325_c10_cache_provisioner
**Name**: C10 CacheProvisioner
**Description**: Implement `CacheProvisioner` (per the contract `_docs/02_document/contracts/c10_provisioning/cache_provisioner.md` v1.1.0), the public top-level orchestrator that composes AZ-321 (EngineCompiler), AZ-322 (DescriptorBatcher), and AZ-323 (ManifestBuilder) into a single idempotent F1 build pipeline. Acquires a `cache_root/.c10.lock` filesystem lockfile to enforce CP-INV-4. Computes the build-identity hash from the same canonical inputs AZ-323 hashes (model_ids + calibration_sha256 + tiles_coverage_sha256 + sector_class + bbox + zoom_levels **+ takeoff_origin + flight_id**) and compares to the existing `Manifest.json`'s `manifest_hash`; on match → `outcome=IDEMPOTENT_NO_OP`. On mismatch (or no prior Manifest) → run engine compile → descriptor population → Manifest build (passing `request.takeoff_origin` and `request.flight_id` to AZ-323), then walk `cache_root` to confirm every file is listed in the new Manifest's `artifacts` section, raising `ManifestCoverageError` on orphans (with rollback to prior-good Manifest). Empty corpus → `BuildReport(outcome=FAILURE, failure_reason="run C11 TileDownloader first")` per description.md § 5. **A request whose `takeoff_origin` differs from the prior Manifest's by ≥ 1 mm is treated as a new build identity (CP-INV-8) — this is the contract that lets `ManifestVerifier` reject a re-planned route at boot.**
**Complexity**: 3 points
**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-303_c6_storage_interfaces, AZ-321_c10_engine_compiler, AZ-322_c10_descriptor_batcher, AZ-323_c10_manifest_builder
**Component**: c10_provisioning (epic AZ-252 / E-C10)
**Tracker**: AZ-325
**Epic**: AZ-252 (E-C10)
### Document Dependencies
- `_docs/02_document/contracts/c10_provisioning/cache_provisioner.md` — produced by this task (frozen Protocol + DTO shape, invariants, test cases).
- `_docs/02_document/components/11_c10_provisioning/description.md` — § 1 idempotence, § 5 error handling, § 7 lockfile race-condition mitigation.
## Problem
Without a real orchestrator:
- D-C10-1 (idempotent re-run via manifest hash) cannot be enforced — every operator invocation re-compiles every engine, blowing the C10-PT-01 ≤ 1 min warm target.
- D-C10-3 (`ManifestCoverageError` on orphan files / no smuggled artifacts) is unobservable — partial-build leftovers and out-of-band file drops at takeoff time go undetected.
- C10-IT-03 (idempotent re-run — same hash, no recompile) cannot be implemented.
- C10-IT-04 (`ManifestCoverageError` on orphan files) cannot be implemented.
- The race-condition mitigation per description.md § 7 (filesystem lockfile) has no producer.
- C12 OperatorTooling (E-C12) has no surface to call — its `c10 build` CLI command is a one-liner only after this task ships.
- The "missing tiles in C6" failure path (description.md § 5) has no surface — operators would see a stack trace from AZ-322 instead of a clear `failure_reason` directing them to C11.
This task delivers the orchestrator + its frozen contract. It does NOT compile engines (AZ-321), embed tiles (AZ-322), build Manifests (AZ-323), or verify at takeoff (AZ-324).
## Outcome
- A `CacheProvisioner` class implementation at `src/gps_denied_onboard/components/c10_provisioning/provisioner.py` matching the Protocol in the contract.
- Constructor: `__init__(self, *, engine_compiler: EngineCompiler, descriptor_batcher: DescriptorBatcher, manifest_builder: ManifestBuilder, tile_metadata_store: TileMetadataStore, lock_factory: FileLockFactory, logger: Logger, clock: Clock, config: C10ProvisionerConfig)`.
- `C10ProvisionerConfig` (`@dataclass(frozen=True)`): `coverage_strict: bool = True`, `lock_timeout_s: float = 5.0`, `manifest_filename: str = "Manifest.json"`.
- Method `build_cache_artifacts(request: BuildRequest) -> BuildReport` flow:
1. **Lock acquisition** (CP-INV-4):
- Path: `request.cache_root / ".c10.lock"`.
- Acquire via `lock_factory.try_lock(path, timeout_s=config.lock_timeout_s)` — non-blocking with a short timeout to surface concurrent invocations as `BuildLockHeldError`.
2. **Tile gathering**: call `tile_metadata_store.query_by_bbox(bbox, zoom_levels, sector_class)`.
- If empty → return `BuildReport(outcome=FAILURE, failure_reason="no tiles in C6 for the requested scope; run C11 TileDownloader first", engines_built=0, ...)`. ERROR log; release lock.
3. **Build-identity hash for idempotence check**:
- Compute `request_hash = sha256(canonical_json(model_ids + calibration_sha256 + tiles_coverage_sha256 + sector_class + bbox + zoom_levels + takeoff_origin_tuple_or_none + flight_id_or_none))`. The `model_ids` come from the configured backbone list; `calibration_sha256` from streaming the calibration_path; `tiles_coverage_sha256` from sorting the tile rows by `(zoom, lat, lon, source)` and hashing per AZ-323's algorithm. `takeoff_origin_tuple_or_none` is `(lat_deg, lon_deg, alt_m)` rounded to 9 decimal places when `request.takeoff_origin is not None`, otherwise the JSON `null` sentinel (CP-INV-8). The hashing formula MUST match AZ-323 exactly so AZ-325's idempotence decision agrees with AZ-323's emitted `build.manifest_hash`.
- Read existing `Manifest.json` if present; parse only the `build.manifest_hash` field (don't run full verification — that's AZ-324's job). If `existing.manifest_hash == request_hash` → return `BuildReport(outcome=IDEMPOTENT_NO_OP, manifest_hash=existing.manifest_hash, manifest_path=existing_path, engines_built=0, engines_reused=0, descriptors_generated=0, elapsed_s, failure_reason=None)`. INFO log; release lock.
4. **Active build path**:
- Snapshot prior-good Manifest (rename to `Manifest.json.prev` if present) for rollback.
- Compose engine compile request from configured backbones; call `engine_compiler.compile_engines_for_corpus(...)``engine_entries`.
- Compose descriptor populate request (filter, callback hooked to logger); call `descriptor_batcher.populate_descriptors(...)``DescriptorBatchReport`. If `outcome=failure` → restore prior Manifest, release lock, return `BuildReport(outcome=FAILURE, failure_reason=batch.failure_reason, ...)`.
- Compose Manifest build input from engine entries + descriptor index path + calibration + key_path **+ `request.takeoff_origin` + `request.flight_id`** (ADR-010); call `manifest_builder.build_manifest(...)``ManifestArtifact`. Both fields default to `None` when the caller did not supply them (e.g., legacy C12 invocation without `--flight-id`).
5. **Coverage check** (CP-INV-3 / D-C10-3):
- Walk `cache_root` recursively (`pathlib.Path.rglob`); collect every regular file path EXCLUDING `Manifest.json`, `Manifest.json.sha256`, `Manifest.json.sig`, `Manifest.json.prev`, `.c10.lock`, and any `.sha256` sidecar (sidecars are implicit per the AZ-280 pattern, paired with their primary).
- Build expected set: every `path` in `manifest.artifacts.engines + descriptor_index + calibration` (resolved relative to `cache_root`).
- `orphans = walked - expected`.
- If `orphans` non-empty AND `config.coverage_strict`:
- Restore prior Manifest from `Manifest.json.prev` (delete current Manifest; rename prev back). If no prev existed, leave the new Manifest in place but raise.
- Raise `ManifestCoverageError(f"orphan files in cache_root: {sorted(orphans)}")`. ERROR log.
- If `orphans` non-empty AND NOT `coverage_strict`: WARN log with the orphan list; continue.
6. **Cleanup**: delete `Manifest.json.prev` if present; release lock.
7. Return `BuildReport(outcome=SUCCESS, engines_built, engines_reused, descriptors_generated, manifest_hash, manifest_path, failure_reason=None, elapsed_s)`.
- Method `compile_engines_for_corpus(request)` is a thin passthrough to `engine_compiler.compile_engines_for_corpus(request)` (per CP-TC-11; lets operators run engine-only re-compiles for D-C10-6 hardware-change scenarios without redoing descriptors).
- A `FileLockFactory` Protocol + a default `Filelock`-library-backed impl (use `filelock` package, already pinned via shared helpers if present; if not, add to deps with a single pinned version).
- INFO logs on lock acquired / released, build start/end, idempotent no-op; ERROR on coverage error / build failure; WARN on non-strict coverage drift.
## Scope
### Included
- `CacheProvisioner` class implementing the Protocol from the contract.
- The contract document (frozen at v1.0.0).
- Filesystem lockfile (FileLockFactory Protocol + filelock-backed default impl).
- Idempotence check (parse existing Manifest's `manifest_hash` only; no full verify).
- Coverage walk + `ManifestCoverageError` with rollback to prior Manifest.
- Empty-corpus handling with explicit hint to run C11.
- `compile_engines_for_corpus` passthrough.
- Composition-root factory `build_cache_provisioner(config) -> CacheProvisioner`.
- Conformance test for the contract Protocol.
### Excluded
- The internal phases (AZ-321, AZ-322, AZ-323).
- Manifest verification at takeoff (AZ-324).
- Operator CLI / tooling (E-C12).
- C13 FDR emissions (build is offline).
- Resumable mid-build state (out of scope; restart from scratch).
- GC of stale engines (operator action).
- Multi-cache rotation.
## Acceptance Criteria
**AC-1: Cold build composes phases and writes Manifest**
Given an empty cache_root and C6 populated with tiles for the requested scope
When `build_cache_artifacts(request)` is called
Then `outcome=SUCCESS`; `engines_built > 0`; `descriptors_generated > 0`; `Manifest.json` + `Manifest.json.sig` + `Manifest.json.sha256` exist; `BuildReport.manifest_hash` matches the on-disk Manifest's `build.manifest_hash`; `elapsed_s` is positive
**AC-2: Warm idempotent re-run skips everything**
Given a prior successful build at the same cache_root with the same identity tuple
When `build_cache_artifacts` is called with an identical request
Then `outcome=IDEMPOTENT_NO_OP`; `engines_built=0, engines_reused=0, descriptors_generated=0`; ZERO calls to `engine_compiler.compile_engines_for_corpus` (verifiable via spy); ZERO calls to `descriptor_batcher.populate_descriptors`; ZERO calls to `manifest_builder.build_manifest`; the on-disk Manifest is byte-identical (mtime unchanged)
**AC-3: Different bbox triggers full rebuild and atomic replacement**
Given a prior Manifest at the cache_root for bbox A
When `build_cache_artifacts` is called with bbox B (B ≠ A)
Then `outcome=SUCCESS`; the new Manifest replaces the old (atomic via AZ-280); old `Manifest.json.prev` is cleaned up after coverage passes; `manifest_hash` differs from the prior
**AC-4: Empty corpus surfaces failure with operator hint**
Given C6 has zero tiles for the requested scope
When `build_cache_artifacts` is called
Then `outcome=FAILURE`; `failure_reason` contains "C11 TileDownloader"; ZERO compile / embed / Manifest calls; lock IS released (no leaked lockfile)
**AC-5: Concurrent invocation raises `BuildLockHeldError`**
Given another invocation holds `.c10.lock`
When a second `build_cache_artifacts` runs
Then `BuildLockHeldError` is raised within `lock_timeout_s`; the existing build is unaffected; the existing lockfile is NOT deleted
**AC-6: ManifestCoverageError rolls back to prior Manifest**
Given a prior-good Manifest exists; a build is run; before the coverage walk, an orphan file `cache_root/leftover.bin` is dropped (simulated)
When the coverage walk runs in strict mode
Then `ManifestCoverageError(...)` is raised naming the orphan; `Manifest.json` on disk is the prior-good one (prev was restored); ERROR log
**AC-7: Coverage non-strict mode warns but continues**
Given `coverage_strict=False` and an orphan
When the build completes
Then `outcome=SUCCESS`; ONE WARN log naming the orphan; the new Manifest is on disk
**AC-8: Lock released on every exit path**
Given any of: success / failure / IDEMPOTENT_NO_OP / `ManifestCoverageError` / `EngineBuildError` propagation
When `build_cache_artifacts` returns or raises
Then `cache_root/.c10.lock` is removed (or unlocked if the implementation uses fcntl); a subsequent call succeeds (no leftover lock)
**AC-9: Hard errors propagate without state corruption**
Given `engine_compiler.compile_engines_for_corpus` raises `EngineBuildError`
When `build_cache_artifacts` runs
Then the error propagates; on-disk Manifest is the prior-good one (prev restored); lock is released; partial engines that AZ-321 wrote ARE on disk (not deleted — operators may want them for diagnostic)
**AC-10: `compile_engines_for_corpus` passthrough**
Given a request configured for engine-only re-compile
When `compile_engines_for_corpus(req)` is called directly
Then `engine_compiler.compile_engines_for_corpus(req)` is invoked once with the same request; the return value is forwarded as a tuple; no lock is acquired (this is a thin diagnostic-mode call)
**AC-11: Conformance — `isinstance` returns True**
Given the implementation
When `isinstance(impl, CacheProvisioner)` is checked under runtime_checkable
Then `True`
**AC-12: Cold build benchmark within C10-PT-01 envelope**
Given Tier-1 dev workstation with NVIDIA GPU + a 1000-tile corpus + 3 backbones
When a cold build runs
Then wall-clock ≤ 12 min (CP-TC-12 / NFR C10-PT-01); WARN log if exceeded (so operators see the regression in CI)
**AC-13: Warm idempotent benchmark within C10-PT-01 envelope**
Given a populated cache and identical request
When `build_cache_artifacts` runs
Then wall-clock ≤ 1 min (CP-TC-13 / NFR C10-PT-01); the bound work is the build-identity hash computation, which is dominated by `tiles_coverage_sha256` over 1000 tiles (~5 ms hashing)
**AC-14: `takeoff_origin` mismatch triggers full rebuild (ADR-010 / CP-INV-8)**
Given a prior Manifest built with `takeoff_origin = A`
When `build_cache_artifacts` is called with the SAME bbox / zooms / sector / calibration / tiles, but `takeoff_origin = B (B ≠ A by ≥ 1 mm)`
Then `outcome=SUCCESS` (NOT `IDEMPOTENT_NO_OP`); the new Manifest replaces the old; the new `manifest_hash` differs from the prior; the new Manifest's `flight.takeoff_origin` matches B
**AC-15: `takeoff_origin = None` propagates through with no flight block in Manifest (back-compat)**
Given a `BuildRequest` with `takeoff_origin = None` and `flight_id = None`
When `build_cache_artifacts` runs
Then `outcome=SUCCESS`; the produced Manifest has no `flight.takeoff_origin` key (AZ-323's AC-14); idempotence still works for subsequent identical-without-origin invocations
**AC-16: `flight_id` participation in idempotence**
Given a prior Manifest built with `flight_id = X, takeoff_origin = A`
When `build_cache_artifacts` runs with `flight_id = Y, takeoff_origin = A` (only `flight_id` differs)
Then `outcome=SUCCESS` (NOT `IDEMPOTENT_NO_OP`); `flight_id` is part of the build identity per CP-INV-8
## Non-Functional Requirements
**Performance**
- Cold path is bound by AZ-321 + AZ-322 (per their NFRs); this orchestrator adds ≤ 5 s coordination overhead.
- Warm path: build-identity hash + Manifest read + idempotence compare ≤ 5 s on Tier-1 dev workstation (1000-tile corpus).
- Coverage walk: O(N files); ≤ 1 s for ≤ 10k files in cache_root.
**Compatibility**
- `filelock` library — pin via `requirements.txt`. (Verify already present from a prior task's deps; if not, add. Same version across all C10 tasks.)
- `orjson` (already pinned via AZ-272), `hashlib` (stdlib), `pathlib` (stdlib).
**Reliability**
- CP-INV-2: failed build never leaves the cache in a worse state than at start.
- Lock release on every exit path (try/finally).
- Atomic Manifest replacement (rename prev → current rollback semantics); coverage error rolls back automatically.
- No silent failures: every error path logs at ERROR level with diagnostic.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Cold build with fakes for phases | All phases called once; SUCCESS |
| AC-2 | Warm re-run with identical request | IDEMPOTENT_NO_OP; zero phase calls |
| AC-3 | Different bbox after prior build | SUCCESS; atomic replace; old Manifest gone |
| AC-4 | Empty C6 query | FAILURE; hint string; lock released |
| AC-14 | Warm re-run with different takeoff_origin | SUCCESS; new manifest_hash; phases called |
| AC-15 | Build with takeoff_origin=None | SUCCESS; Manifest has no flight.takeoff_origin |
| AC-16 | Warm re-run with different flight_id only | SUCCESS; new manifest_hash |
| AC-5 | Pre-acquire lock externally; run | BuildLockHeldError |
| AC-6 | Inject orphan file before coverage walk | ManifestCoverageError; prior Manifest restored |
| AC-7 | Same as AC-6 with `coverage_strict=False` | SUCCESS; WARN log |
| AC-8 | Each error path | Lock released after each |
| AC-9 | engine_compiler raises | Error propagates; rollback; lock released |
| AC-10 | Direct call to compile_engines_for_corpus | Single passthrough; no lock |
| AC-11 | Conformance | True |
| AC-12 | Cold build bench (skipped on CI; manual) | ≤ 12 min |
| AC-13 | Warm bench | ≤ 1 min |
| NFR-perf-coverage-walk | 10k files in cache_root | ≤ 1 s |
## Constraints
- The orchestrator does NOT touch `satellite-provider` (CP-INV-6); all I/O is local.
- Lockfile is mandatory; bypassing the lock for testing is a config flag, NOT a separate code path.
- Idempotence check parses ONLY `build.manifest_hash` from the existing Manifest; full verification is AZ-324's job (separate code path).
- `Manifest.json.prev` is the rollback target; never two prevs deep (rebuilds are not stack-able).
- Coverage walk EXCLUDES the lockfile, the Manifest itself, its sidecar, its signature, and any `.prev` rollback file.
- The orchestrator never modifies engines compiled by AZ-321 (atomic on disk) — it only touches the Manifest + .prev/.lock files.
- Operator key handling delegates entirely to AZ-323 (CP-INV-7).
- This task introduces at most ONE new third-party dependency (`filelock`) — verify against existing deps first.
## Risks & Mitigation
**Risk 1: Stale lockfile after process kill**
- *Risk*: A SIGKILL'd build leaves `.c10.lock` on disk; subsequent runs always raise `BuildLockHeldError`.
- *Mitigation*: Use `filelock` library which uses fcntl flock (auto-released on process exit). On platforms without fcntl, document the manual cleanup step. AC-5 + AC-8 cover normal lock release; the SIGKILL case is an OS-level guarantee from filelock.
**Risk 2: Coverage walk slow on huge cache_root**
- *Risk*: 100k files in cache_root → coverage walk could take seconds.
- *Mitigation*: NFR-perf-coverage-walk benchmark; if exceeded, switch to streaming compare with a sorted Manifest path list. Out of scope for the initial impl.
**Risk 3: Idempotence check trusts prior Manifest's hash without verifying signature**
- *Risk*: A tampered Manifest could lie about its `manifest_hash`, fooling the orchestrator into IDEMPOTENT_NO_OP and skipping a needed rebuild.
- *Mitigation*: This is acceptable because AZ-324's `ManifestVerifier` runs at takeoff — a tampered Manifest fails verify and prevents arming. The orchestrator's role is to AVOID rebuilds when nothing changed; trusting `manifest_hash` is a performance optimization, not a security check. Documented in CP-INV-1.
**Risk 4: Empty `coverage_strict=False` becomes the de-facto default**
- *Risk*: Operators set `coverage_strict=False` to ship faster, defeating D-C10-3.
- *Mitigation*: Default is True; the config flag is documented as "for forensic builds only"; CI runs always assert strict.
**Risk 5: Rollback corrupts state on partial coverage walk failure**
- *Risk*: If `Manifest.json.prev` rename fails (e.g., disk full), the cache is left in an in-between state.
- *Mitigation*: Use AZ-280's atomic rename helper; if the rename itself fails, surface a distinct `ManifestRollbackError` (subclass of `ManifestCoverageError`) so operators see the disk-level cause. Documented but not a blocker for v1.0.0.
**Risk 6: Lock acquisition races with operator's manual file ops**
- *Risk*: Operator manually edits a file in cache_root while a build is running.
- *Mitigation*: Coverage walk happens at the end of build; if operator drops a file mid-build, AC-6 catches it. The lockfile prevents two CONCURRENT builds, not operator-vs-build interference. Documented.
## Runtime Completeness
- **Named capability**: top-level F1 cache build with D-C10-1 idempotence + D-C10-3 ManifestCoverageError + lockfile race-condition mitigation (epic § Acceptance C10-IT-01, C10-IT-03, C10-IT-04; description.md § 1, § 5, § 7).
- **Production code that must exist**: real `CacheProvisioner` orchestrating real AZ-321/322/323 + real `filelock`-backed lock + real coverage walk + real rollback.
- **Allowed external stubs**: tests MAY use spy/fake versions of AZ-321/322/323 (already produced by their conformance tests) + an in-process `FileLockFactory` for deterministic concurrency tests.
- **Unacceptable substitutes**: skipping the lockfile (defeats CP-INV-4); skipping the coverage walk (defeats D-C10-3); a "soft" idempotence that re-builds anyway (defeats D-C10-1 and the C10-PT-01 1-min warm target); calling AZ-324's `ManifestVerifier` for the idempotence check (over-kill — full verify on every operator invocation triples warm-path cost); deleting partial engines on failure (operators rely on them for diagnostic per AC-9).