Files
gps-denied-onboard/_docs/02_tasks/todo/AZ-325_c10_cache_provisioner.md
T
Oleksandr Bezdieniezhnykh e0be591b06 [AZ-489] [AZ-490] ADR-010 design pass: operator-mission as cold-start anchor
Architecture, contracts, and task amendments for the flight-route-driven
preflight + cold-start origin feature (ADR-010). No source code touched
in this commit; the implementation commits for AZ-489 / AZ-490 / AZ-419
land separately.

* architecture.md: ADR-010, new Principle #14, amended Principle #11,
  external systems gain flights service + Mission Planner UI, data
  model gains Flight / Waypoint / TakeoffOrigin.
* system-flows.md: F1 gains phase 0 (Flight resolve), F2 gains
  cold-start ladder, F7 gains mid-flight bounded-delta GPS gate.
* glossary.md: Flight, Flights API, Mid-flight bounded-delta GPS gate,
  Mission Planner UI, Takeoff origin, Waypoint.
* C10: description + cache_provisioner + manifest_verifier bumped to
  v1.1 carrying takeoff_origin + flight_id in the manifest hash.
* C12: description updated + new flights_api_client.md contract v1.0.
* C5: description + state_estimator_protocol bumped to v1.1 with
  set_takeoff_origin + 3-clause spoof-promotion gate.
* AZ-323/324/325/326/328/419 amended in place. AZ-490 spec created
  (C5 set_takeoff_origin entrypoint).
* Dependencies table: 142 tasks / 478 pts / 15 forward edges
  (2 new tasks, 2 backward deps, 2 forward deps from AZ-419).
* Leftovers cleared: 2026-05-11 Jira transition entries for AZ-355
  and AZ-386 are deleted (Jira reconnected; both already transitioned
  in their respective implementation commits).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 01:28:05 +03:00

21 KiB

C10 CacheProvisioner — Idempotent Orchestrator + ManifestCoverageError

Task: AZ-325_c10_cache_provisioner Name: C10 CacheProvisioner Description: Implement CacheProvisioner (per the contract _docs/02_document/contracts/c10_provisioning/cache_provisioner.md v1.1.0), the public top-level orchestrator that composes AZ-321 (EngineCompiler), AZ-322 (DescriptorBatcher), and AZ-323 (ManifestBuilder) into a single idempotent F1 build pipeline. Acquires a cache_root/.c10.lock filesystem lockfile to enforce CP-INV-4. Computes the build-identity hash from the same canonical inputs AZ-323 hashes (model_ids + calibration_sha256 + tiles_coverage_sha256 + sector_class + bbox + zoom_levels + takeoff_origin + flight_id) and compares to the existing Manifest.json's manifest_hash; on match → outcome=IDEMPOTENT_NO_OP. On mismatch (or no prior Manifest) → run engine compile → descriptor population → Manifest build (passing request.takeoff_origin and request.flight_id to AZ-323), then walk cache_root to confirm every file is listed in the new Manifest's artifacts section, raising ManifestCoverageError on orphans (with rollback to prior-good Manifest). Empty corpus → BuildReport(outcome=FAILURE, failure_reason="run C11 TileDownloader first") per description.md § 5. A request whose takeoff_origin differs from the prior Manifest's by ≥ 1 mm is treated as a new build identity (CP-INV-8) — this is the contract that lets ManifestVerifier reject a re-planned route at boot. Complexity: 3 points Dependencies: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-303_c6_storage_interfaces, AZ-321_c10_engine_compiler, AZ-322_c10_descriptor_batcher, AZ-323_c10_manifest_builder Component: c10_provisioning (epic AZ-252 / E-C10) Tracker: AZ-325 Epic: AZ-252 (E-C10)

Document Dependencies

  • _docs/02_document/contracts/c10_provisioning/cache_provisioner.md — produced by this task (frozen Protocol + DTO shape, invariants, test cases).
  • _docs/02_document/components/11_c10_provisioning/description.md — § 1 idempotence, § 5 error handling, § 7 lockfile race-condition mitigation.

Problem

Without a real orchestrator:

  • D-C10-1 (idempotent re-run via manifest hash) cannot be enforced — every operator invocation re-compiles every engine, blowing the C10-PT-01 ≤ 1 min warm target.
  • D-C10-3 (ManifestCoverageError on orphan files / no smuggled artifacts) is unobservable — partial-build leftovers and out-of-band file drops at takeoff time go undetected.
  • C10-IT-03 (idempotent re-run — same hash, no recompile) cannot be implemented.
  • C10-IT-04 (ManifestCoverageError on orphan files) cannot be implemented.
  • The race-condition mitigation per description.md § 7 (filesystem lockfile) has no producer.
  • C12 OperatorTooling (E-C12) has no surface to call — its c10 build CLI command is a one-liner only after this task ships.
  • The "missing tiles in C6" failure path (description.md § 5) has no surface — operators would see a stack trace from AZ-322 instead of a clear failure_reason directing them to C11.

This task delivers the orchestrator + its frozen contract. It does NOT compile engines (AZ-321), embed tiles (AZ-322), build Manifests (AZ-323), or verify at takeoff (AZ-324).

Outcome

  • A CacheProvisioner class implementation at src/gps_denied_onboard/components/c10_provisioning/provisioner.py matching the Protocol in the contract.
  • Constructor: __init__(self, *, engine_compiler: EngineCompiler, descriptor_batcher: DescriptorBatcher, manifest_builder: ManifestBuilder, tile_metadata_store: TileMetadataStore, lock_factory: FileLockFactory, logger: Logger, clock: Clock, config: C10ProvisionerConfig).
  • C10ProvisionerConfig (@dataclass(frozen=True)): coverage_strict: bool = True, lock_timeout_s: float = 5.0, manifest_filename: str = "Manifest.json".
  • Method build_cache_artifacts(request: BuildRequest) -> BuildReport flow:
    1. Lock acquisition (CP-INV-4):
      • Path: request.cache_root / ".c10.lock".
      • Acquire via lock_factory.try_lock(path, timeout_s=config.lock_timeout_s) — non-blocking with a short timeout to surface concurrent invocations as BuildLockHeldError.
    2. Tile gathering: call tile_metadata_store.query_by_bbox(bbox, zoom_levels, sector_class).
      • If empty → return BuildReport(outcome=FAILURE, failure_reason="no tiles in C6 for the requested scope; run C11 TileDownloader first", engines_built=0, ...). ERROR log; release lock.
    3. Build-identity hash for idempotence check:
      • Compute request_hash = sha256(canonical_json(model_ids + calibration_sha256 + tiles_coverage_sha256 + sector_class + bbox + zoom_levels + takeoff_origin_tuple_or_none + flight_id_or_none)). The model_ids come from the configured backbone list; calibration_sha256 from streaming the calibration_path; tiles_coverage_sha256 from sorting the tile rows by (zoom, lat, lon, source) and hashing per AZ-323's algorithm. takeoff_origin_tuple_or_none is (lat_deg, lon_deg, alt_m) rounded to 9 decimal places when request.takeoff_origin is not None, otherwise the JSON null sentinel (CP-INV-8). The hashing formula MUST match AZ-323 exactly so AZ-325's idempotence decision agrees with AZ-323's emitted build.manifest_hash.
      • Read existing Manifest.json if present; parse only the build.manifest_hash field (don't run full verification — that's AZ-324's job). If existing.manifest_hash == request_hash → return BuildReport(outcome=IDEMPOTENT_NO_OP, manifest_hash=existing.manifest_hash, manifest_path=existing_path, engines_built=0, engines_reused=0, descriptors_generated=0, elapsed_s, failure_reason=None). INFO log; release lock.
    4. Active build path:
      • Snapshot prior-good Manifest (rename to Manifest.json.prev if present) for rollback.
      • Compose engine compile request from configured backbones; call engine_compiler.compile_engines_for_corpus(...)engine_entries.
      • Compose descriptor populate request (filter, callback hooked to logger); call descriptor_batcher.populate_descriptors(...)DescriptorBatchReport. If outcome=failure → restore prior Manifest, release lock, return BuildReport(outcome=FAILURE, failure_reason=batch.failure_reason, ...).
      • Compose Manifest build input from engine entries + descriptor index path + calibration + key_path + request.takeoff_origin + request.flight_id (ADR-010); call manifest_builder.build_manifest(...)ManifestArtifact. Both fields default to None when the caller did not supply them (e.g., legacy C12 invocation without --flight-id).
    5. Coverage check (CP-INV-3 / D-C10-3):
      • Walk cache_root recursively (pathlib.Path.rglob); collect every regular file path EXCLUDING Manifest.json, Manifest.json.sha256, Manifest.json.sig, Manifest.json.prev, .c10.lock, and any .sha256 sidecar (sidecars are implicit per the AZ-280 pattern, paired with their primary).
      • Build expected set: every path in manifest.artifacts.engines + descriptor_index + calibration (resolved relative to cache_root).
      • orphans = walked - expected.
      • If orphans non-empty AND config.coverage_strict:
        • Restore prior Manifest from Manifest.json.prev (delete current Manifest; rename prev back). If no prev existed, leave the new Manifest in place but raise.
        • Raise ManifestCoverageError(f"orphan files in cache_root: {sorted(orphans)}"). ERROR log.
      • If orphans non-empty AND NOT coverage_strict: WARN log with the orphan list; continue.
    6. Cleanup: delete Manifest.json.prev if present; release lock.
    7. Return BuildReport(outcome=SUCCESS, engines_built, engines_reused, descriptors_generated, manifest_hash, manifest_path, failure_reason=None, elapsed_s).
  • Method compile_engines_for_corpus(request) is a thin passthrough to engine_compiler.compile_engines_for_corpus(request) (per CP-TC-11; lets operators run engine-only re-compiles for D-C10-6 hardware-change scenarios without redoing descriptors).
  • A FileLockFactory Protocol + a default Filelock-library-backed impl (use filelock package, already pinned via shared helpers if present; if not, add to deps with a single pinned version).
  • INFO logs on lock acquired / released, build start/end, idempotent no-op; ERROR on coverage error / build failure; WARN on non-strict coverage drift.

Scope

Included

  • CacheProvisioner class implementing the Protocol from the contract.
  • The contract document (frozen at v1.0.0).
  • Filesystem lockfile (FileLockFactory Protocol + filelock-backed default impl).
  • Idempotence check (parse existing Manifest's manifest_hash only; no full verify).
  • Coverage walk + ManifestCoverageError with rollback to prior Manifest.
  • Empty-corpus handling with explicit hint to run C11.
  • compile_engines_for_corpus passthrough.
  • Composition-root factory build_cache_provisioner(config) -> CacheProvisioner.
  • Conformance test for the contract Protocol.

Excluded

  • The internal phases (AZ-321, AZ-322, AZ-323).
  • Manifest verification at takeoff (AZ-324).
  • Operator CLI / tooling (E-C12).
  • C13 FDR emissions (build is offline).
  • Resumable mid-build state (out of scope; restart from scratch).
  • GC of stale engines (operator action).
  • Multi-cache rotation.

Acceptance Criteria

AC-1: Cold build composes phases and writes Manifest Given an empty cache_root and C6 populated with tiles for the requested scope When build_cache_artifacts(request) is called Then outcome=SUCCESS; engines_built > 0; descriptors_generated > 0; Manifest.json + Manifest.json.sig + Manifest.json.sha256 exist; BuildReport.manifest_hash matches the on-disk Manifest's build.manifest_hash; elapsed_s is positive

AC-2: Warm idempotent re-run skips everything Given a prior successful build at the same cache_root with the same identity tuple When build_cache_artifacts is called with an identical request Then outcome=IDEMPOTENT_NO_OP; engines_built=0, engines_reused=0, descriptors_generated=0; ZERO calls to engine_compiler.compile_engines_for_corpus (verifiable via spy); ZERO calls to descriptor_batcher.populate_descriptors; ZERO calls to manifest_builder.build_manifest; the on-disk Manifest is byte-identical (mtime unchanged)

AC-3: Different bbox triggers full rebuild and atomic replacement Given a prior Manifest at the cache_root for bbox A When build_cache_artifacts is called with bbox B (B ≠ A) Then outcome=SUCCESS; the new Manifest replaces the old (atomic via AZ-280); old Manifest.json.prev is cleaned up after coverage passes; manifest_hash differs from the prior

AC-4: Empty corpus surfaces failure with operator hint Given C6 has zero tiles for the requested scope When build_cache_artifacts is called Then outcome=FAILURE; failure_reason contains "C11 TileDownloader"; ZERO compile / embed / Manifest calls; lock IS released (no leaked lockfile)

AC-5: Concurrent invocation raises BuildLockHeldError Given another invocation holds .c10.lock When a second build_cache_artifacts runs Then BuildLockHeldError is raised within lock_timeout_s; the existing build is unaffected; the existing lockfile is NOT deleted

AC-6: ManifestCoverageError rolls back to prior Manifest Given a prior-good Manifest exists; a build is run; before the coverage walk, an orphan file cache_root/leftover.bin is dropped (simulated) When the coverage walk runs in strict mode Then ManifestCoverageError(...) is raised naming the orphan; Manifest.json on disk is the prior-good one (prev was restored); ERROR log

AC-7: Coverage non-strict mode warns but continues Given coverage_strict=False and an orphan When the build completes Then outcome=SUCCESS; ONE WARN log naming the orphan; the new Manifest is on disk

AC-8: Lock released on every exit path Given any of: success / failure / IDEMPOTENT_NO_OP / ManifestCoverageError / EngineBuildError propagation When build_cache_artifacts returns or raises Then cache_root/.c10.lock is removed (or unlocked if the implementation uses fcntl); a subsequent call succeeds (no leftover lock)

AC-9: Hard errors propagate without state corruption Given engine_compiler.compile_engines_for_corpus raises EngineBuildError When build_cache_artifacts runs Then the error propagates; on-disk Manifest is the prior-good one (prev restored); lock is released; partial engines that AZ-321 wrote ARE on disk (not deleted — operators may want them for diagnostic)

AC-10: compile_engines_for_corpus passthrough Given a request configured for engine-only re-compile When compile_engines_for_corpus(req) is called directly Then engine_compiler.compile_engines_for_corpus(req) is invoked once with the same request; the return value is forwarded as a tuple; no lock is acquired (this is a thin diagnostic-mode call)

AC-11: Conformance — isinstance returns True Given the implementation When isinstance(impl, CacheProvisioner) is checked under runtime_checkable Then True

AC-12: Cold build benchmark within C10-PT-01 envelope Given Tier-1 dev workstation with NVIDIA GPU + a 1000-tile corpus + 3 backbones When a cold build runs Then wall-clock ≤ 12 min (CP-TC-12 / NFR C10-PT-01); WARN log if exceeded (so operators see the regression in CI)

AC-13: Warm idempotent benchmark within C10-PT-01 envelope Given a populated cache and identical request When build_cache_artifacts runs Then wall-clock ≤ 1 min (CP-TC-13 / NFR C10-PT-01); the bound work is the build-identity hash computation, which is dominated by tiles_coverage_sha256 over 1000 tiles (~5 ms hashing)

AC-14: takeoff_origin mismatch triggers full rebuild (ADR-010 / CP-INV-8) Given a prior Manifest built with takeoff_origin = A When build_cache_artifacts is called with the SAME bbox / zooms / sector / calibration / tiles, but takeoff_origin = B (B ≠ A by ≥ 1 mm) Then outcome=SUCCESS (NOT IDEMPOTENT_NO_OP); the new Manifest replaces the old; the new manifest_hash differs from the prior; the new Manifest's flight.takeoff_origin matches B

AC-15: takeoff_origin = None propagates through with no flight block in Manifest (back-compat) Given a BuildRequest with takeoff_origin = None and flight_id = None When build_cache_artifacts runs Then outcome=SUCCESS; the produced Manifest has no flight.takeoff_origin key (AZ-323's AC-14); idempotence still works for subsequent identical-without-origin invocations

AC-16: flight_id participation in idempotence Given a prior Manifest built with flight_id = X, takeoff_origin = A When build_cache_artifacts runs with flight_id = Y, takeoff_origin = A (only flight_id differs) Then outcome=SUCCESS (NOT IDEMPOTENT_NO_OP); flight_id is part of the build identity per CP-INV-8

Non-Functional Requirements

Performance

  • Cold path is bound by AZ-321 + AZ-322 (per their NFRs); this orchestrator adds ≤ 5 s coordination overhead.
  • Warm path: build-identity hash + Manifest read + idempotence compare ≤ 5 s on Tier-1 dev workstation (1000-tile corpus).
  • Coverage walk: O(N files); ≤ 1 s for ≤ 10k files in cache_root.

Compatibility

  • filelock library — pin via requirements.txt. (Verify already present from a prior task's deps; if not, add. Same version across all C10 tasks.)
  • orjson (already pinned via AZ-272), hashlib (stdlib), pathlib (stdlib).

Reliability

  • CP-INV-2: failed build never leaves the cache in a worse state than at start.
  • Lock release on every exit path (try/finally).
  • Atomic Manifest replacement (rename prev → current rollback semantics); coverage error rolls back automatically.
  • No silent failures: every error path logs at ERROR level with diagnostic.

Unit Tests

AC Ref What to Test Required Outcome
AC-1 Cold build with fakes for phases All phases called once; SUCCESS
AC-2 Warm re-run with identical request IDEMPOTENT_NO_OP; zero phase calls
AC-3 Different bbox after prior build SUCCESS; atomic replace; old Manifest gone
AC-4 Empty C6 query FAILURE; hint string; lock released
AC-14 Warm re-run with different takeoff_origin SUCCESS; new manifest_hash; phases called
AC-15 Build with takeoff_origin=None SUCCESS; Manifest has no flight.takeoff_origin
AC-16 Warm re-run with different flight_id only SUCCESS; new manifest_hash
AC-5 Pre-acquire lock externally; run BuildLockHeldError
AC-6 Inject orphan file before coverage walk ManifestCoverageError; prior Manifest restored
AC-7 Same as AC-6 with coverage_strict=False SUCCESS; WARN log
AC-8 Each error path Lock released after each
AC-9 engine_compiler raises Error propagates; rollback; lock released
AC-10 Direct call to compile_engines_for_corpus Single passthrough; no lock
AC-11 Conformance True
AC-12 Cold build bench (skipped on CI; manual) ≤ 12 min
AC-13 Warm bench ≤ 1 min
NFR-perf-coverage-walk 10k files in cache_root ≤ 1 s

Constraints

  • The orchestrator does NOT touch satellite-provider (CP-INV-6); all I/O is local.
  • Lockfile is mandatory; bypassing the lock for testing is a config flag, NOT a separate code path.
  • Idempotence check parses ONLY build.manifest_hash from the existing Manifest; full verification is AZ-324's job (separate code path).
  • Manifest.json.prev is the rollback target; never two prevs deep (rebuilds are not stack-able).
  • Coverage walk EXCLUDES the lockfile, the Manifest itself, its sidecar, its signature, and any .prev rollback file.
  • The orchestrator never modifies engines compiled by AZ-321 (atomic on disk) — it only touches the Manifest + .prev/.lock files.
  • Operator key handling delegates entirely to AZ-323 (CP-INV-7).
  • This task introduces at most ONE new third-party dependency (filelock) — verify against existing deps first.

Risks & Mitigation

Risk 1: Stale lockfile after process kill

  • Risk: A SIGKILL'd build leaves .c10.lock on disk; subsequent runs always raise BuildLockHeldError.
  • Mitigation: Use filelock library which uses fcntl flock (auto-released on process exit). On platforms without fcntl, document the manual cleanup step. AC-5 + AC-8 cover normal lock release; the SIGKILL case is an OS-level guarantee from filelock.

Risk 2: Coverage walk slow on huge cache_root

  • Risk: 100k files in cache_root → coverage walk could take seconds.
  • Mitigation: NFR-perf-coverage-walk benchmark; if exceeded, switch to streaming compare with a sorted Manifest path list. Out of scope for the initial impl.

Risk 3: Idempotence check trusts prior Manifest's hash without verifying signature

  • Risk: A tampered Manifest could lie about its manifest_hash, fooling the orchestrator into IDEMPOTENT_NO_OP and skipping a needed rebuild.
  • Mitigation: This is acceptable because AZ-324's ManifestVerifier runs at takeoff — a tampered Manifest fails verify and prevents arming. The orchestrator's role is to AVOID rebuilds when nothing changed; trusting manifest_hash is a performance optimization, not a security check. Documented in CP-INV-1.

Risk 4: Empty coverage_strict=False becomes the de-facto default

  • Risk: Operators set coverage_strict=False to ship faster, defeating D-C10-3.
  • Mitigation: Default is True; the config flag is documented as "for forensic builds only"; CI runs always assert strict.

Risk 5: Rollback corrupts state on partial coverage walk failure

  • Risk: If Manifest.json.prev rename fails (e.g., disk full), the cache is left in an in-between state.
  • Mitigation: Use AZ-280's atomic rename helper; if the rename itself fails, surface a distinct ManifestRollbackError (subclass of ManifestCoverageError) so operators see the disk-level cause. Documented but not a blocker for v1.0.0.

Risk 6: Lock acquisition races with operator's manual file ops

  • Risk: Operator manually edits a file in cache_root while a build is running.
  • Mitigation: Coverage walk happens at the end of build; if operator drops a file mid-build, AC-6 catches it. The lockfile prevents two CONCURRENT builds, not operator-vs-build interference. Documented.

Runtime Completeness

  • Named capability: top-level F1 cache build with D-C10-1 idempotence + D-C10-3 ManifestCoverageError + lockfile race-condition mitigation (epic § Acceptance C10-IT-01, C10-IT-03, C10-IT-04; description.md § 1, § 5, § 7).
  • Production code that must exist: real CacheProvisioner orchestrating real AZ-321/322/323 + real filelock-backed lock + real coverage walk + real rollback.
  • Allowed external stubs: tests MAY use spy/fake versions of AZ-321/322/323 (already produced by their conformance tests) + an in-process FileLockFactory for deterministic concurrency tests.
  • Unacceptable substitutes: skipping the lockfile (defeats CP-INV-4); skipping the coverage walk (defeats D-C10-3); a "soft" idempotence that re-builds anyway (defeats D-C10-1 and the C10-PT-01 1-min warm target); calling AZ-324's ManifestVerifier for the idempotence check (over-kill — full verify on every operator invocation triples warm-path cost); deleting partial engines on failure (operators rely on them for diagnostic per AC-9).