Closes out greenfield Step 6 (Decompose) for all 14 components (C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446 plus the _dependencies_table.md and component contract documents. State file updated to greenfield Step 7 (Implement), not_started. Co-authored-by: Cursor <cursoragent@cursor.com>
19 KiB
C10 CacheProvisioner — Idempotent Orchestrator + ManifestCoverageError
Task: AZ-325_c10_cache_provisioner
Name: C10 CacheProvisioner
Description: Implement CacheProvisioner (per the contract _docs/02_document/contracts/c10_provisioning/cache_provisioner.md), the public top-level orchestrator that composes AZ-321 (EngineCompiler), AZ-322 (DescriptorBatcher), and AZ-323 (ManifestBuilder) into a single idempotent F1 build pipeline. Acquires a cache_root/.c10.lock filesystem lockfile to enforce CP-INV-4. Computes the build-identity hash from the same canonical inputs AZ-323 hashes (model_ids + calibration_sha256 + tiles_coverage_sha256 + sector_class + bbox + zoom_levels) and compares to the existing Manifest.json's manifest_hash; on match → outcome=IDEMPOTENT_NO_OP. On mismatch (or no prior Manifest) → run engine compile → descriptor population → Manifest build, then walk cache_root to confirm every file is listed in the new Manifest's artifacts section, raising ManifestCoverageError on orphans (with rollback to prior-good Manifest). Empty corpus → BuildReport(outcome=FAILURE, failure_reason="run C11 TileDownloader first") per description.md § 5.
Complexity: 3 points
Dependencies: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-303_c6_storage_interfaces, AZ-321_c10_engine_compiler, AZ-322_c10_descriptor_batcher, AZ-323_c10_manifest_builder
Component: c10_provisioning (epic AZ-252 / E-C10)
Tracker: AZ-325
Epic: AZ-252 (E-C10)
Document Dependencies
_docs/02_document/contracts/c10_provisioning/cache_provisioner.md— produced by this task (frozen Protocol + DTO shape, invariants, test cases)._docs/02_document/components/11_c10_provisioning/description.md— § 1 idempotence, § 5 error handling, § 7 lockfile race-condition mitigation.
Problem
Without a real orchestrator:
- D-C10-1 (idempotent re-run via manifest hash) cannot be enforced — every operator invocation re-compiles every engine, blowing the C10-PT-01 ≤ 1 min warm target.
- D-C10-3 (
ManifestCoverageErroron orphan files / no smuggled artifacts) is unobservable — partial-build leftovers and out-of-band file drops at takeoff time go undetected. - C10-IT-03 (idempotent re-run — same hash, no recompile) cannot be implemented.
- C10-IT-04 (
ManifestCoverageErroron orphan files) cannot be implemented. - The race-condition mitigation per description.md § 7 (filesystem lockfile) has no producer.
- C12 OperatorTooling (E-C12) has no surface to call — its
c10 buildCLI command is a one-liner only after this task ships. - The "missing tiles in C6" failure path (description.md § 5) has no surface — operators would see a stack trace from AZ-322 instead of a clear
failure_reasondirecting them to C11.
This task delivers the orchestrator + its frozen contract. It does NOT compile engines (AZ-321), embed tiles (AZ-322), build Manifests (AZ-323), or verify at takeoff (AZ-324).
Outcome
- A
CacheProvisionerclass implementation atsrc/gps_denied_onboard/components/c10_provisioning/provisioner.pymatching the Protocol in the contract. - Constructor:
__init__(self, *, engine_compiler: EngineCompiler, descriptor_batcher: DescriptorBatcher, manifest_builder: ManifestBuilder, tile_metadata_store: TileMetadataStore, lock_factory: FileLockFactory, logger: Logger, clock: Clock, config: C10ProvisionerConfig). C10ProvisionerConfig(@dataclass(frozen=True)):coverage_strict: bool = True,lock_timeout_s: float = 5.0,manifest_filename: str = "Manifest.json".- Method
build_cache_artifacts(request: BuildRequest) -> BuildReportflow:- Lock acquisition (CP-INV-4):
- Path:
request.cache_root / ".c10.lock". - Acquire via
lock_factory.try_lock(path, timeout_s=config.lock_timeout_s)— non-blocking with a short timeout to surface concurrent invocations asBuildLockHeldError.
- Path:
- Tile gathering: call
tile_metadata_store.query_by_bbox(bbox, zoom_levels, sector_class).- If empty → return
BuildReport(outcome=FAILURE, failure_reason="no tiles in C6 for the requested scope; run C11 TileDownloader first", engines_built=0, ...). ERROR log; release lock.
- If empty → return
- Build-identity hash for idempotence check:
- Compute
request_hash = sha256(canonical_json(model_ids + calibration_sha256 + tiles_coverage_sha256 + sector_class + bbox + zoom_levels)). Themodel_idscome from the configured backbone list;calibration_sha256from streaming the calibration_path;tiles_coverage_sha256from sorting the tile rows by(zoom, lat, lon, source)and hashing per AZ-323's algorithm. - Read existing
Manifest.jsonif present; parse only thebuild.manifest_hashfield (don't run full verification — that's AZ-324's job). Ifexisting.manifest_hash == request_hash→ returnBuildReport(outcome=IDEMPOTENT_NO_OP, manifest_hash=existing.manifest_hash, manifest_path=existing_path, engines_built=0, engines_reused=0, descriptors_generated=0, elapsed_s, failure_reason=None). INFO log; release lock.
- Compute
- Active build path:
- Snapshot prior-good Manifest (rename to
Manifest.json.previf present) for rollback. - Compose engine compile request from configured backbones; call
engine_compiler.compile_engines_for_corpus(...)→engine_entries. - Compose descriptor populate request (filter, callback hooked to logger); call
descriptor_batcher.populate_descriptors(...)→DescriptorBatchReport. Ifoutcome=failure→ restore prior Manifest, release lock, returnBuildReport(outcome=FAILURE, failure_reason=batch.failure_reason, ...). - Compose Manifest build input from engine entries + descriptor index path + calibration + key_path; call
manifest_builder.build_manifest(...)→ManifestArtifact.
- Snapshot prior-good Manifest (rename to
- Coverage check (CP-INV-3 / D-C10-3):
- Walk
cache_rootrecursively (pathlib.Path.rglob); collect every regular file path EXCLUDINGManifest.json,Manifest.json.sha256,Manifest.json.sig,Manifest.json.prev,.c10.lock, and any.sha256sidecar (sidecars are implicit per the AZ-280 pattern, paired with their primary). - Build expected set: every
pathinmanifest.artifacts.engines + descriptor_index + calibration(resolved relative tocache_root). orphans = walked - expected.- If
orphansnon-empty ANDconfig.coverage_strict:- Restore prior Manifest from
Manifest.json.prev(delete current Manifest; rename prev back). If no prev existed, leave the new Manifest in place but raise. - Raise
ManifestCoverageError(f"orphan files in cache_root: {sorted(orphans)}"). ERROR log.
- Restore prior Manifest from
- If
orphansnon-empty AND NOTcoverage_strict: WARN log with the orphan list; continue.
- Walk
- Cleanup: delete
Manifest.json.previf present; release lock. - Return
BuildReport(outcome=SUCCESS, engines_built, engines_reused, descriptors_generated, manifest_hash, manifest_path, failure_reason=None, elapsed_s).
- Lock acquisition (CP-INV-4):
- Method
compile_engines_for_corpus(request)is a thin passthrough toengine_compiler.compile_engines_for_corpus(request)(per CP-TC-11; lets operators run engine-only re-compiles for D-C10-6 hardware-change scenarios without redoing descriptors). - A
FileLockFactoryProtocol + a defaultFilelock-library-backed impl (usefilelockpackage, already pinned via shared helpers if present; if not, add to deps with a single pinned version). - INFO logs on lock acquired / released, build start/end, idempotent no-op; ERROR on coverage error / build failure; WARN on non-strict coverage drift.
Scope
Included
CacheProvisionerclass implementing the Protocol from the contract.- The contract document (frozen at v1.0.0).
- Filesystem lockfile (FileLockFactory Protocol + filelock-backed default impl).
- Idempotence check (parse existing Manifest's
manifest_hashonly; no full verify). - Coverage walk +
ManifestCoverageErrorwith rollback to prior Manifest. - Empty-corpus handling with explicit hint to run C11.
compile_engines_for_corpuspassthrough.- Composition-root factory
build_cache_provisioner(config) -> CacheProvisioner. - Conformance test for the contract Protocol.
Excluded
- The internal phases (AZ-321, AZ-322, AZ-323).
- Manifest verification at takeoff (AZ-324).
- Operator CLI / tooling (E-C12).
- C13 FDR emissions (build is offline).
- Resumable mid-build state (out of scope; restart from scratch).
- GC of stale engines (operator action).
- Multi-cache rotation.
Acceptance Criteria
AC-1: Cold build composes phases and writes Manifest
Given an empty cache_root and C6 populated with tiles for the requested scope
When build_cache_artifacts(request) is called
Then outcome=SUCCESS; engines_built > 0; descriptors_generated > 0; Manifest.json + Manifest.json.sig + Manifest.json.sha256 exist; BuildReport.manifest_hash matches the on-disk Manifest's build.manifest_hash; elapsed_s is positive
AC-2: Warm idempotent re-run skips everything
Given a prior successful build at the same cache_root with the same identity tuple
When build_cache_artifacts is called with an identical request
Then outcome=IDEMPOTENT_NO_OP; engines_built=0, engines_reused=0, descriptors_generated=0; ZERO calls to engine_compiler.compile_engines_for_corpus (verifiable via spy); ZERO calls to descriptor_batcher.populate_descriptors; ZERO calls to manifest_builder.build_manifest; the on-disk Manifest is byte-identical (mtime unchanged)
AC-3: Different bbox triggers full rebuild and atomic replacement
Given a prior Manifest at the cache_root for bbox A
When build_cache_artifacts is called with bbox B (B ≠ A)
Then outcome=SUCCESS; the new Manifest replaces the old (atomic via AZ-280); old Manifest.json.prev is cleaned up after coverage passes; manifest_hash differs from the prior
AC-4: Empty corpus surfaces failure with operator hint
Given C6 has zero tiles for the requested scope
When build_cache_artifacts is called
Then outcome=FAILURE; failure_reason contains "C11 TileDownloader"; ZERO compile / embed / Manifest calls; lock IS released (no leaked lockfile)
AC-5: Concurrent invocation raises BuildLockHeldError
Given another invocation holds .c10.lock
When a second build_cache_artifacts runs
Then BuildLockHeldError is raised within lock_timeout_s; the existing build is unaffected; the existing lockfile is NOT deleted
AC-6: ManifestCoverageError rolls back to prior Manifest
Given a prior-good Manifest exists; a build is run; before the coverage walk, an orphan file cache_root/leftover.bin is dropped (simulated)
When the coverage walk runs in strict mode
Then ManifestCoverageError(...) is raised naming the orphan; Manifest.json on disk is the prior-good one (prev was restored); ERROR log
AC-7: Coverage non-strict mode warns but continues
Given coverage_strict=False and an orphan
When the build completes
Then outcome=SUCCESS; ONE WARN log naming the orphan; the new Manifest is on disk
AC-8: Lock released on every exit path
Given any of: success / failure / IDEMPOTENT_NO_OP / ManifestCoverageError / EngineBuildError propagation
When build_cache_artifacts returns or raises
Then cache_root/.c10.lock is removed (or unlocked if the implementation uses fcntl); a subsequent call succeeds (no leftover lock)
AC-9: Hard errors propagate without state corruption
Given engine_compiler.compile_engines_for_corpus raises EngineBuildError
When build_cache_artifacts runs
Then the error propagates; on-disk Manifest is the prior-good one (prev restored); lock is released; partial engines that AZ-321 wrote ARE on disk (not deleted — operators may want them for diagnostic)
AC-10: compile_engines_for_corpus passthrough
Given a request configured for engine-only re-compile
When compile_engines_for_corpus(req) is called directly
Then engine_compiler.compile_engines_for_corpus(req) is invoked once with the same request; the return value is forwarded as a tuple; no lock is acquired (this is a thin diagnostic-mode call)
AC-11: Conformance — isinstance returns True
Given the implementation
When isinstance(impl, CacheProvisioner) is checked under runtime_checkable
Then True
AC-12: Cold build benchmark within C10-PT-01 envelope Given Tier-1 dev workstation with NVIDIA GPU + a 1000-tile corpus + 3 backbones When a cold build runs Then wall-clock ≤ 12 min (CP-TC-12 / NFR C10-PT-01); WARN log if exceeded (so operators see the regression in CI)
AC-13: Warm idempotent benchmark within C10-PT-01 envelope
Given a populated cache and identical request
When build_cache_artifacts runs
Then wall-clock ≤ 1 min (CP-TC-13 / NFR C10-PT-01); the bound work is the build-identity hash computation, which is dominated by tiles_coverage_sha256 over 1000 tiles (~5 ms hashing)
Non-Functional Requirements
Performance
- Cold path is bound by AZ-321 + AZ-322 (per their NFRs); this orchestrator adds ≤ 5 s coordination overhead.
- Warm path: build-identity hash + Manifest read + idempotence compare ≤ 5 s on Tier-1 dev workstation (1000-tile corpus).
- Coverage walk: O(N files); ≤ 1 s for ≤ 10k files in cache_root.
Compatibility
filelocklibrary — pin viarequirements.txt. (Verify already present from a prior task's deps; if not, add. Same version across all C10 tasks.)orjson(already pinned via AZ-272),hashlib(stdlib),pathlib(stdlib).
Reliability
- CP-INV-2: failed build never leaves the cache in a worse state than at start.
- Lock release on every exit path (try/finally).
- Atomic Manifest replacement (rename prev → current rollback semantics); coverage error rolls back automatically.
- No silent failures: every error path logs at ERROR level with diagnostic.
Unit Tests
| AC Ref | What to Test | Required Outcome |
|---|---|---|
| AC-1 | Cold build with fakes for phases | All phases called once; SUCCESS |
| AC-2 | Warm re-run with identical request | IDEMPOTENT_NO_OP; zero phase calls |
| AC-3 | Different bbox after prior build | SUCCESS; atomic replace; old Manifest gone |
| AC-4 | Empty C6 query | FAILURE; hint string; lock released |
| AC-5 | Pre-acquire lock externally; run | BuildLockHeldError |
| AC-6 | Inject orphan file before coverage walk | ManifestCoverageError; prior Manifest restored |
| AC-7 | Same as AC-6 with coverage_strict=False |
SUCCESS; WARN log |
| AC-8 | Each error path | Lock released after each |
| AC-9 | engine_compiler raises | Error propagates; rollback; lock released |
| AC-10 | Direct call to compile_engines_for_corpus | Single passthrough; no lock |
| AC-11 | Conformance | True |
| AC-12 | Cold build bench (skipped on CI; manual) | ≤ 12 min |
| AC-13 | Warm bench | ≤ 1 min |
| NFR-perf-coverage-walk | 10k files in cache_root | ≤ 1 s |
Constraints
- The orchestrator does NOT touch
satellite-provider(CP-INV-6); all I/O is local. - Lockfile is mandatory; bypassing the lock for testing is a config flag, NOT a separate code path.
- Idempotence check parses ONLY
build.manifest_hashfrom the existing Manifest; full verification is AZ-324's job (separate code path). Manifest.json.previs the rollback target; never two prevs deep (rebuilds are not stack-able).- Coverage walk EXCLUDES the lockfile, the Manifest itself, its sidecar, its signature, and any
.prevrollback file. - The orchestrator never modifies engines compiled by AZ-321 (atomic on disk) — it only touches the Manifest + .prev/.lock files.
- Operator key handling delegates entirely to AZ-323 (CP-INV-7).
- This task introduces at most ONE new third-party dependency (
filelock) — verify against existing deps first.
Risks & Mitigation
Risk 1: Stale lockfile after process kill
- Risk: A SIGKILL'd build leaves
.c10.lockon disk; subsequent runs always raiseBuildLockHeldError. - Mitigation: Use
filelocklibrary which uses fcntl flock (auto-released on process exit). On platforms without fcntl, document the manual cleanup step. AC-5 + AC-8 cover normal lock release; the SIGKILL case is an OS-level guarantee from filelock.
Risk 2: Coverage walk slow on huge cache_root
- Risk: 100k files in cache_root → coverage walk could take seconds.
- Mitigation: NFR-perf-coverage-walk benchmark; if exceeded, switch to streaming compare with a sorted Manifest path list. Out of scope for the initial impl.
Risk 3: Idempotence check trusts prior Manifest's hash without verifying signature
- Risk: A tampered Manifest could lie about its
manifest_hash, fooling the orchestrator into IDEMPOTENT_NO_OP and skipping a needed rebuild. - Mitigation: This is acceptable because AZ-324's
ManifestVerifierruns at takeoff — a tampered Manifest fails verify and prevents arming. The orchestrator's role is to AVOID rebuilds when nothing changed; trustingmanifest_hashis a performance optimization, not a security check. Documented in CP-INV-1.
Risk 4: Empty coverage_strict=False becomes the de-facto default
- Risk: Operators set
coverage_strict=Falseto ship faster, defeating D-C10-3. - Mitigation: Default is True; the config flag is documented as "for forensic builds only"; CI runs always assert strict.
Risk 5: Rollback corrupts state on partial coverage walk failure
- Risk: If
Manifest.json.prevrename fails (e.g., disk full), the cache is left in an in-between state. - Mitigation: Use AZ-280's atomic rename helper; if the rename itself fails, surface a distinct
ManifestRollbackError(subclass ofManifestCoverageError) so operators see the disk-level cause. Documented but not a blocker for v1.0.0.
Risk 6: Lock acquisition races with operator's manual file ops
- Risk: Operator manually edits a file in cache_root while a build is running.
- Mitigation: Coverage walk happens at the end of build; if operator drops a file mid-build, AC-6 catches it. The lockfile prevents two CONCURRENT builds, not operator-vs-build interference. Documented.
Runtime Completeness
- Named capability: top-level F1 cache build with D-C10-1 idempotence + D-C10-3 ManifestCoverageError + lockfile race-condition mitigation (epic § Acceptance C10-IT-01, C10-IT-03, C10-IT-04; description.md § 1, § 5, § 7).
- Production code that must exist: real
CacheProvisionerorchestrating real AZ-321/322/323 + realfilelock-backed lock + real coverage walk + real rollback. - Allowed external stubs: tests MAY use spy/fake versions of AZ-321/322/323 (already produced by their conformance tests) + an in-process
FileLockFactoryfor deterministic concurrency tests. - Unacceptable substitutes: skipping the lockfile (defeats CP-INV-4); skipping the coverage walk (defeats D-C10-3); a "soft" idempotence that re-builds anyway (defeats D-C10-1 and the C10-PT-01 1-min warm target); calling AZ-324's
ManifestVerifierfor the idempotence check (over-kill — full verify on every operator invocation triples warm-path cost); deleting partial engines on failure (operators rely on them for diagnostic per AC-9).