mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 19:11:14 +00:00
Decompose Step 6 snapshot: 140 task specs + contract docs
Closes out greenfield Step 6 (Decompose) for all 14 components (C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446 plus the _dependencies_table.md and component contract documents. State file updated to greenfield Step 7 (Implement), not_started. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,233 @@
|
||||
# C10 CacheProvisioner — Idempotent Orchestrator + ManifestCoverageError
|
||||
|
||||
**Task**: AZ-325_c10_cache_provisioner
|
||||
**Name**: C10 CacheProvisioner
|
||||
**Description**: Implement `CacheProvisioner` (per the contract `_docs/02_document/contracts/c10_provisioning/cache_provisioner.md`), the public top-level orchestrator that composes AZ-321 (EngineCompiler), AZ-322 (DescriptorBatcher), and AZ-323 (ManifestBuilder) into a single idempotent F1 build pipeline. Acquires a `cache_root/.c10.lock` filesystem lockfile to enforce CP-INV-4. Computes the build-identity hash from the same canonical inputs AZ-323 hashes (model_ids + calibration_sha256 + tiles_coverage_sha256 + sector_class + bbox + zoom_levels) and compares to the existing `Manifest.json`'s `manifest_hash`; on match → `outcome=IDEMPOTENT_NO_OP`. On mismatch (or no prior Manifest) → run engine compile → descriptor population → Manifest build, then walk `cache_root` to confirm every file is listed in the new Manifest's `artifacts` section, raising `ManifestCoverageError` on orphans (with rollback to prior-good Manifest). Empty corpus → `BuildReport(outcome=FAILURE, failure_reason="run C11 TileDownloader first")` per description.md § 5.
|
||||
**Complexity**: 3 points
|
||||
**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-303_c6_storage_interfaces, AZ-321_c10_engine_compiler, AZ-322_c10_descriptor_batcher, AZ-323_c10_manifest_builder
|
||||
**Component**: c10_provisioning (epic AZ-252 / E-C10)
|
||||
**Tracker**: AZ-325
|
||||
**Epic**: AZ-252 (E-C10)
|
||||
|
||||
### Document Dependencies
|
||||
|
||||
- `_docs/02_document/contracts/c10_provisioning/cache_provisioner.md` — produced by this task (frozen Protocol + DTO shape, invariants, test cases).
|
||||
- `_docs/02_document/components/11_c10_provisioning/description.md` — § 1 idempotence, § 5 error handling, § 7 lockfile race-condition mitigation.
|
||||
|
||||
## Problem
|
||||
|
||||
Without a real orchestrator:
|
||||
|
||||
- D-C10-1 (idempotent re-run via manifest hash) cannot be enforced — every operator invocation re-compiles every engine, blowing the C10-PT-01 ≤ 1 min warm target.
|
||||
- D-C10-3 (`ManifestCoverageError` on orphan files / no smuggled artifacts) is unobservable — partial-build leftovers and out-of-band file drops at takeoff time go undetected.
|
||||
- C10-IT-03 (idempotent re-run — same hash, no recompile) cannot be implemented.
|
||||
- C10-IT-04 (`ManifestCoverageError` on orphan files) cannot be implemented.
|
||||
- The race-condition mitigation per description.md § 7 (filesystem lockfile) has no producer.
|
||||
- C12 OperatorTooling (E-C12) has no surface to call — its `c10 build` CLI command is a one-liner only after this task ships.
|
||||
- The "missing tiles in C6" failure path (description.md § 5) has no surface — operators would see a stack trace from AZ-322 instead of a clear `failure_reason` directing them to C11.
|
||||
|
||||
This task delivers the orchestrator + its frozen contract. It does NOT compile engines (AZ-321), embed tiles (AZ-322), build Manifests (AZ-323), or verify at takeoff (AZ-324).
|
||||
|
||||
## Outcome
|
||||
|
||||
- A `CacheProvisioner` class implementation at `src/gps_denied_onboard/components/c10_provisioning/provisioner.py` matching the Protocol in the contract.
|
||||
- Constructor: `__init__(self, *, engine_compiler: EngineCompiler, descriptor_batcher: DescriptorBatcher, manifest_builder: ManifestBuilder, tile_metadata_store: TileMetadataStore, lock_factory: FileLockFactory, logger: Logger, clock: Clock, config: C10ProvisionerConfig)`.
|
||||
- `C10ProvisionerConfig` (`@dataclass(frozen=True)`): `coverage_strict: bool = True`, `lock_timeout_s: float = 5.0`, `manifest_filename: str = "Manifest.json"`.
|
||||
- Method `build_cache_artifacts(request: BuildRequest) -> BuildReport` flow:
|
||||
1. **Lock acquisition** (CP-INV-4):
|
||||
- Path: `request.cache_root / ".c10.lock"`.
|
||||
- Acquire via `lock_factory.try_lock(path, timeout_s=config.lock_timeout_s)` — non-blocking with a short timeout to surface concurrent invocations as `BuildLockHeldError`.
|
||||
2. **Tile gathering**: call `tile_metadata_store.query_by_bbox(bbox, zoom_levels, sector_class)`.
|
||||
- If empty → return `BuildReport(outcome=FAILURE, failure_reason="no tiles in C6 for the requested scope; run C11 TileDownloader first", engines_built=0, ...)`. ERROR log; release lock.
|
||||
3. **Build-identity hash for idempotence check**:
|
||||
- Compute `request_hash = sha256(canonical_json(model_ids + calibration_sha256 + tiles_coverage_sha256 + sector_class + bbox + zoom_levels))`. The `model_ids` come from the configured backbone list; `calibration_sha256` from streaming the calibration_path; `tiles_coverage_sha256` from sorting the tile rows by `(zoom, lat, lon, source)` and hashing per AZ-323's algorithm.
|
||||
- Read existing `Manifest.json` if present; parse only the `build.manifest_hash` field (don't run full verification — that's AZ-324's job). If `existing.manifest_hash == request_hash` → return `BuildReport(outcome=IDEMPOTENT_NO_OP, manifest_hash=existing.manifest_hash, manifest_path=existing_path, engines_built=0, engines_reused=0, descriptors_generated=0, elapsed_s, failure_reason=None)`. INFO log; release lock.
|
||||
4. **Active build path**:
|
||||
- Snapshot prior-good Manifest (rename to `Manifest.json.prev` if present) for rollback.
|
||||
- Compose engine compile request from configured backbones; call `engine_compiler.compile_engines_for_corpus(...)` → `engine_entries`.
|
||||
- Compose descriptor populate request (filter, callback hooked to logger); call `descriptor_batcher.populate_descriptors(...)` → `DescriptorBatchReport`. If `outcome=failure` → restore prior Manifest, release lock, return `BuildReport(outcome=FAILURE, failure_reason=batch.failure_reason, ...)`.
|
||||
- Compose Manifest build input from engine entries + descriptor index path + calibration + key_path; call `manifest_builder.build_manifest(...)` → `ManifestArtifact`.
|
||||
5. **Coverage check** (CP-INV-3 / D-C10-3):
|
||||
- Walk `cache_root` recursively (`pathlib.Path.rglob`); collect every regular file path EXCLUDING `Manifest.json`, `Manifest.json.sha256`, `Manifest.json.sig`, `Manifest.json.prev`, `.c10.lock`, and any `.sha256` sidecar (sidecars are implicit per the AZ-280 pattern, paired with their primary).
|
||||
- Build expected set: every `path` in `manifest.artifacts.engines + descriptor_index + calibration` (resolved relative to `cache_root`).
|
||||
- `orphans = walked - expected`.
|
||||
- If `orphans` non-empty AND `config.coverage_strict`:
|
||||
- Restore prior Manifest from `Manifest.json.prev` (delete current Manifest; rename prev back). If no prev existed, leave the new Manifest in place but raise.
|
||||
- Raise `ManifestCoverageError(f"orphan files in cache_root: {sorted(orphans)}")`. ERROR log.
|
||||
- If `orphans` non-empty AND NOT `coverage_strict`: WARN log with the orphan list; continue.
|
||||
6. **Cleanup**: delete `Manifest.json.prev` if present; release lock.
|
||||
7. Return `BuildReport(outcome=SUCCESS, engines_built, engines_reused, descriptors_generated, manifest_hash, manifest_path, failure_reason=None, elapsed_s)`.
|
||||
- Method `compile_engines_for_corpus(request)` is a thin passthrough to `engine_compiler.compile_engines_for_corpus(request)` (per CP-TC-11; lets operators run engine-only re-compiles for D-C10-6 hardware-change scenarios without redoing descriptors).
|
||||
- A `FileLockFactory` Protocol + a default `Filelock`-library-backed impl (use `filelock` package, already pinned via shared helpers if present; if not, add to deps with a single pinned version).
|
||||
- INFO logs on lock acquired / released, build start/end, idempotent no-op; ERROR on coverage error / build failure; WARN on non-strict coverage drift.
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- `CacheProvisioner` class implementing the Protocol from the contract.
|
||||
- The contract document (frozen at v1.0.0).
|
||||
- Filesystem lockfile (FileLockFactory Protocol + filelock-backed default impl).
|
||||
- Idempotence check (parse existing Manifest's `manifest_hash` only; no full verify).
|
||||
- Coverage walk + `ManifestCoverageError` with rollback to prior Manifest.
|
||||
- Empty-corpus handling with explicit hint to run C11.
|
||||
- `compile_engines_for_corpus` passthrough.
|
||||
- Composition-root factory `build_cache_provisioner(config) -> CacheProvisioner`.
|
||||
- Conformance test for the contract Protocol.
|
||||
|
||||
### Excluded
|
||||
|
||||
- The internal phases (AZ-321, AZ-322, AZ-323).
|
||||
- Manifest verification at takeoff (AZ-324).
|
||||
- Operator CLI / tooling (E-C12).
|
||||
- C13 FDR emissions (build is offline).
|
||||
- Resumable mid-build state (out of scope; restart from scratch).
|
||||
- GC of stale engines (operator action).
|
||||
- Multi-cache rotation.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Cold build composes phases and writes Manifest**
|
||||
Given an empty cache_root and C6 populated with tiles for the requested scope
|
||||
When `build_cache_artifacts(request)` is called
|
||||
Then `outcome=SUCCESS`; `engines_built > 0`; `descriptors_generated > 0`; `Manifest.json` + `Manifest.json.sig` + `Manifest.json.sha256` exist; `BuildReport.manifest_hash` matches the on-disk Manifest's `build.manifest_hash`; `elapsed_s` is positive
|
||||
|
||||
**AC-2: Warm idempotent re-run skips everything**
|
||||
Given a prior successful build at the same cache_root with the same identity tuple
|
||||
When `build_cache_artifacts` is called with an identical request
|
||||
Then `outcome=IDEMPOTENT_NO_OP`; `engines_built=0, engines_reused=0, descriptors_generated=0`; ZERO calls to `engine_compiler.compile_engines_for_corpus` (verifiable via spy); ZERO calls to `descriptor_batcher.populate_descriptors`; ZERO calls to `manifest_builder.build_manifest`; the on-disk Manifest is byte-identical (mtime unchanged)
|
||||
|
||||
**AC-3: Different bbox triggers full rebuild and atomic replacement**
|
||||
Given a prior Manifest at the cache_root for bbox A
|
||||
When `build_cache_artifacts` is called with bbox B (B ≠ A)
|
||||
Then `outcome=SUCCESS`; the new Manifest replaces the old (atomic via AZ-280); old `Manifest.json.prev` is cleaned up after coverage passes; `manifest_hash` differs from the prior
|
||||
|
||||
**AC-4: Empty corpus surfaces failure with operator hint**
|
||||
Given C6 has zero tiles for the requested scope
|
||||
When `build_cache_artifacts` is called
|
||||
Then `outcome=FAILURE`; `failure_reason` contains "C11 TileDownloader"; ZERO compile / embed / Manifest calls; lock IS released (no leaked lockfile)
|
||||
|
||||
**AC-5: Concurrent invocation raises `BuildLockHeldError`**
|
||||
Given another invocation holds `.c10.lock`
|
||||
When a second `build_cache_artifacts` runs
|
||||
Then `BuildLockHeldError` is raised within `lock_timeout_s`; the existing build is unaffected; the existing lockfile is NOT deleted
|
||||
|
||||
**AC-6: ManifestCoverageError rolls back to prior Manifest**
|
||||
Given a prior-good Manifest exists; a build is run; before the coverage walk, an orphan file `cache_root/leftover.bin` is dropped (simulated)
|
||||
When the coverage walk runs in strict mode
|
||||
Then `ManifestCoverageError(...)` is raised naming the orphan; `Manifest.json` on disk is the prior-good one (prev was restored); ERROR log
|
||||
|
||||
**AC-7: Coverage non-strict mode warns but continues**
|
||||
Given `coverage_strict=False` and an orphan
|
||||
When the build completes
|
||||
Then `outcome=SUCCESS`; ONE WARN log naming the orphan; the new Manifest is on disk
|
||||
|
||||
**AC-8: Lock released on every exit path**
|
||||
Given any of: success / failure / IDEMPOTENT_NO_OP / `ManifestCoverageError` / `EngineBuildError` propagation
|
||||
When `build_cache_artifacts` returns or raises
|
||||
Then `cache_root/.c10.lock` is removed (or unlocked if the implementation uses fcntl); a subsequent call succeeds (no leftover lock)
|
||||
|
||||
**AC-9: Hard errors propagate without state corruption**
|
||||
Given `engine_compiler.compile_engines_for_corpus` raises `EngineBuildError`
|
||||
When `build_cache_artifacts` runs
|
||||
Then the error propagates; on-disk Manifest is the prior-good one (prev restored); lock is released; partial engines that AZ-321 wrote ARE on disk (not deleted — operators may want them for diagnostic)
|
||||
|
||||
**AC-10: `compile_engines_for_corpus` passthrough**
|
||||
Given a request configured for engine-only re-compile
|
||||
When `compile_engines_for_corpus(req)` is called directly
|
||||
Then `engine_compiler.compile_engines_for_corpus(req)` is invoked once with the same request; the return value is forwarded as a tuple; no lock is acquired (this is a thin diagnostic-mode call)
|
||||
|
||||
**AC-11: Conformance — `isinstance` returns True**
|
||||
Given the implementation
|
||||
When `isinstance(impl, CacheProvisioner)` is checked under runtime_checkable
|
||||
Then `True`
|
||||
|
||||
**AC-12: Cold build benchmark within C10-PT-01 envelope**
|
||||
Given Tier-1 dev workstation with NVIDIA GPU + a 1000-tile corpus + 3 backbones
|
||||
When a cold build runs
|
||||
Then wall-clock ≤ 12 min (CP-TC-12 / NFR C10-PT-01); WARN log if exceeded (so operators see the regression in CI)
|
||||
|
||||
**AC-13: Warm idempotent benchmark within C10-PT-01 envelope**
|
||||
Given a populated cache and identical request
|
||||
When `build_cache_artifacts` runs
|
||||
Then wall-clock ≤ 1 min (CP-TC-13 / NFR C10-PT-01); the bound work is the build-identity hash computation, which is dominated by `tiles_coverage_sha256` over 1000 tiles (~5 ms hashing)
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance**
|
||||
- Cold path is bound by AZ-321 + AZ-322 (per their NFRs); this orchestrator adds ≤ 5 s coordination overhead.
|
||||
- Warm path: build-identity hash + Manifest read + idempotence compare ≤ 5 s on Tier-1 dev workstation (1000-tile corpus).
|
||||
- Coverage walk: O(N files); ≤ 1 s for ≤ 10k files in cache_root.
|
||||
|
||||
**Compatibility**
|
||||
- `filelock` library — pin via `requirements.txt`. (Verify already present from a prior task's deps; if not, add. Same version across all C10 tasks.)
|
||||
- `orjson` (already pinned via AZ-272), `hashlib` (stdlib), `pathlib` (stdlib).
|
||||
|
||||
**Reliability**
|
||||
- CP-INV-2: failed build never leaves the cache in a worse state than at start.
|
||||
- Lock release on every exit path (try/finally).
|
||||
- Atomic Manifest replacement (rename prev → current rollback semantics); coverage error rolls back automatically.
|
||||
- No silent failures: every error path logs at ERROR level with diagnostic.
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|-------------|-----------------|
|
||||
| AC-1 | Cold build with fakes for phases | All phases called once; SUCCESS |
|
||||
| AC-2 | Warm re-run with identical request | IDEMPOTENT_NO_OP; zero phase calls |
|
||||
| AC-3 | Different bbox after prior build | SUCCESS; atomic replace; old Manifest gone |
|
||||
| AC-4 | Empty C6 query | FAILURE; hint string; lock released |
|
||||
| AC-5 | Pre-acquire lock externally; run | BuildLockHeldError |
|
||||
| AC-6 | Inject orphan file before coverage walk | ManifestCoverageError; prior Manifest restored |
|
||||
| AC-7 | Same as AC-6 with `coverage_strict=False` | SUCCESS; WARN log |
|
||||
| AC-8 | Each error path | Lock released after each |
|
||||
| AC-9 | engine_compiler raises | Error propagates; rollback; lock released |
|
||||
| AC-10 | Direct call to compile_engines_for_corpus | Single passthrough; no lock |
|
||||
| AC-11 | Conformance | True |
|
||||
| AC-12 | Cold build bench (skipped on CI; manual) | ≤ 12 min |
|
||||
| AC-13 | Warm bench | ≤ 1 min |
|
||||
| NFR-perf-coverage-walk | 10k files in cache_root | ≤ 1 s |
|
||||
|
||||
## Constraints
|
||||
|
||||
- The orchestrator does NOT touch `satellite-provider` (CP-INV-6); all I/O is local.
|
||||
- Lockfile is mandatory; bypassing the lock for testing is a config flag, NOT a separate code path.
|
||||
- Idempotence check parses ONLY `build.manifest_hash` from the existing Manifest; full verification is AZ-324's job (separate code path).
|
||||
- `Manifest.json.prev` is the rollback target; never two prevs deep (rebuilds are not stack-able).
|
||||
- Coverage walk EXCLUDES the lockfile, the Manifest itself, its sidecar, its signature, and any `.prev` rollback file.
|
||||
- The orchestrator never modifies engines compiled by AZ-321 (atomic on disk) — it only touches the Manifest + .prev/.lock files.
|
||||
- Operator key handling delegates entirely to AZ-323 (CP-INV-7).
|
||||
- This task introduces at most ONE new third-party dependency (`filelock`) — verify against existing deps first.
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: Stale lockfile after process kill**
|
||||
- *Risk*: A SIGKILL'd build leaves `.c10.lock` on disk; subsequent runs always raise `BuildLockHeldError`.
|
||||
- *Mitigation*: Use `filelock` library which uses fcntl flock (auto-released on process exit). On platforms without fcntl, document the manual cleanup step. AC-5 + AC-8 cover normal lock release; the SIGKILL case is an OS-level guarantee from filelock.
|
||||
|
||||
**Risk 2: Coverage walk slow on huge cache_root**
|
||||
- *Risk*: 100k files in cache_root → coverage walk could take seconds.
|
||||
- *Mitigation*: NFR-perf-coverage-walk benchmark; if exceeded, switch to streaming compare with a sorted Manifest path list. Out of scope for the initial impl.
|
||||
|
||||
**Risk 3: Idempotence check trusts prior Manifest's hash without verifying signature**
|
||||
- *Risk*: A tampered Manifest could lie about its `manifest_hash`, fooling the orchestrator into IDEMPOTENT_NO_OP and skipping a needed rebuild.
|
||||
- *Mitigation*: This is acceptable because AZ-324's `ManifestVerifier` runs at takeoff — a tampered Manifest fails verify and prevents arming. The orchestrator's role is to AVOID rebuilds when nothing changed; trusting `manifest_hash` is a performance optimization, not a security check. Documented in CP-INV-1.
|
||||
|
||||
**Risk 4: Empty `coverage_strict=False` becomes the de-facto default**
|
||||
- *Risk*: Operators set `coverage_strict=False` to ship faster, defeating D-C10-3.
|
||||
- *Mitigation*: Default is True; the config flag is documented as "for forensic builds only"; CI runs always assert strict.
|
||||
|
||||
**Risk 5: Rollback corrupts state on partial coverage walk failure**
|
||||
- *Risk*: If `Manifest.json.prev` rename fails (e.g., disk full), the cache is left in an in-between state.
|
||||
- *Mitigation*: Use AZ-280's atomic rename helper; if the rename itself fails, surface a distinct `ManifestRollbackError` (subclass of `ManifestCoverageError`) so operators see the disk-level cause. Documented but not a blocker for v1.0.0.
|
||||
|
||||
**Risk 6: Lock acquisition races with operator's manual file ops**
|
||||
- *Risk*: Operator manually edits a file in cache_root while a build is running.
|
||||
- *Mitigation*: Coverage walk happens at the end of build; if operator drops a file mid-build, AC-6 catches it. The lockfile prevents two CONCURRENT builds, not operator-vs-build interference. Documented.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: top-level F1 cache build with D-C10-1 idempotence + D-C10-3 ManifestCoverageError + lockfile race-condition mitigation (epic § Acceptance C10-IT-01, C10-IT-03, C10-IT-04; description.md § 1, § 5, § 7).
|
||||
- **Production code that must exist**: real `CacheProvisioner` orchestrating real AZ-321/322/323 + real `filelock`-backed lock + real coverage walk + real rollback.
|
||||
- **Allowed external stubs**: tests MAY use spy/fake versions of AZ-321/322/323 (already produced by their conformance tests) + an in-process `FileLockFactory` for deterministic concurrency tests.
|
||||
- **Unacceptable substitutes**: skipping the lockfile (defeats CP-INV-4); skipping the coverage walk (defeats D-C10-3); a "soft" idempotence that re-builds anyway (defeats D-C10-1 and the C10-PT-01 1-min warm target); calling AZ-324's `ManifestVerifier` for the idempotence check (over-kill — full verify on every operator invocation triples warm-path cost); deleting partial engines on failure (operators rely on them for diagnostic per AC-9).
|
||||
Reference in New Issue
Block a user