Decompose Step 6 snapshot: 140 task specs + contract docs

Closes out greenfield Step 6 (Decompose) for all 14 components
(C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446
plus the _dependencies_table.md and component contract documents.

State file updated to greenfield Step 7 (Implement), not_started.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-11 00:39:48 +03:00
parent 8171fcb29e
commit 880eabcb3f
172 changed files with 22897 additions and 35 deletions
@@ -0,0 +1,145 @@
# Contract: CacheProvisioner (C10)
**Type**: Python Protocol (`@runtime_checkable`) — local in-process API.
**Producer task**: AZ-325_c10_cache_provisioner
**Consumers**:
- C12 Operator Tooling — orchestrates the F1 build sequence `C11 TileDownloader → CacheProvisioner.build_artifacts` and surfaces the `BuildReport` to the operator (E-C12 / AZ-253).
- C13 FDR — out of scope for build (F1 is offline / pre-flight); F2's verify is owned by the `ManifestVerifier` contract.
## Purpose
`CacheProvisioner` is the public top-level surface for the C10 build phase. It composes `EngineCompiler` (AZ-321), `DescriptorBatcher` (AZ-322), and `ManifestBuilder` (AZ-323) into a single idempotent operation that the operator runs after `C11 TileDownloader` has populated C6. The Provisioner enforces D-C10-1 idempotence (skip rebuild when the build-identity hash matches the prior Manifest), D-C10-3 ManifestCoverageError (every shipped artifact under `cache_root` MUST be in the Manifest — no smuggled files), and D-C10-6 hardware-tied engine reuse (delegated to AZ-321). It does NOT touch `satellite-provider` (per epic § Architecture notes); tile I/O is C11's responsibility.
## Public Surface
```python
from pathlib import Path
from typing import Protocol, runtime_checkable
@runtime_checkable
class CacheProvisioner(Protocol):
"""Public top-level orchestrator for C10 cache build.
Idempotent: if the prior Manifest's build-identity hash matches the
request's, returns `outcome=IDEMPOTENT_NO_OP` without rebuilding.
Otherwise composes engine compile + descriptor population + Manifest
write + coverage check.
"""
def build_cache_artifacts(self, request: BuildRequest) -> BuildReport: ...
def compile_engines_for_corpus(self, request: EngineCompileRequest) -> tuple[EngineCacheEntry, ...]: ...
```
### DTOs
```python
from dataclasses import dataclass
from enum import Enum
from pathlib import Path
class SectorClassification(Enum):
ACTIVE_CONFLICT = "active_conflict"
STABLE_REAR = "stable_rear"
class BuildOutcome(Enum):
SUCCESS = "success"
FAILURE = "failure"
IDEMPOTENT_NO_OP = "idempotent_no_op"
@dataclass(frozen=True)
class Bbox:
lat_min: float
lon_min: float
lat_max: float
lon_max: float
@dataclass(frozen=True)
class BuildRequest:
bbox: Bbox
zoom_levels: tuple[int, ...]
sector_class: SectorClassification
calibration_path: Path
cache_root: Path
key_path: Path # operator signing key per C10-ST-01
@dataclass(frozen=True)
class BuildReport:
outcome: BuildOutcome
engines_built: int
engines_reused: int
descriptors_generated: int
manifest_hash: str | None
manifest_path: Path | None
failure_reason: str | None
elapsed_s: float
```
(`EngineCompileRequest` and `EngineCacheEntry` are AZ-321's; re-exported for convenience.)
### Exceptions
| Exception | When raised | Caller action |
|-----------|------------|---------------|
| `BuildLockHeldError` | Another `build_cache_artifacts` invocation holds the cache_root lockfile (per description.md § 7 race-condition mitigation). | Operator waits / kills the other process; not retried automatically. |
| `ManifestCoverageError` | After build, an orphan file exists under `cache_root` that is not listed in the Manifest. | Build is rolled back to prior-good Manifest (if present); operator inspects the orphan. |
| `EngineBuildError`, `CalibrationCacheError` | Propagated from AZ-321 / AZ-298. | Operator triages GPU / calibration. |
| `DescriptorBatchError` | Propagated from AZ-322. | Operator triages GPU OOM / model. |
| `ManifestWriteError` | Propagated from AZ-323 (key fingerprint mismatch in operator mode, key load failure, atomic-write failure). | Operator inspects key / disk. |
`BuildOutcome.FAILURE` is reserved for soft failures captured in `BuildReport` (missing tiles in C6, coverage warning when configured non-strict). Hard errors raise.
## Invariants
| ID | Invariant | Why |
|----|-----------|-----|
| CP-INV-1 | Idempotence: if `Manifest.json` exists at `cache_root` AND its `manifest_hash` equals the build-identity hash for the new request → `outcome=IDEMPOTENT_NO_OP`, ZERO new compiles, ZERO new embeds, ZERO new Manifest writes; the existing Manifest is left untouched. | D-C10-1; warm re-run ≤ 1 min envelope (C10-PT-01). |
| CP-INV-2 | A failed `build_cache_artifacts` does NOT leave the cache in a worse state than at the start: new engines may exist (cache hits) but the Manifest is either the previous-good one OR rolled back; the FAISS index is either the previous-good one OR atomically replaced. | Operators can retry safely. |
| CP-INV-3 | After a SUCCESS outcome, `ManifestCoverageError` has been verified absent: every file under `cache_root` (recursively, excluding the Manifest itself + sidecars + sig) is listed in the Manifest's artifacts. | D-C10-3 — no smuggled artifacts in the takeoff cache. |
| CP-INV-4 | Concurrent `build_cache_artifacts` calls on the same `cache_root` are mutually exclusive via a filesystem lockfile at `cache_root/.c10.lock`. | description.md § 7 race-condition mitigation. |
| CP-INV-5 | `cache_root` must already exist; `build_cache_artifacts` does NOT create the directory tree (operator workflow places it). | Avoids accidental builds in unintended paths. |
| CP-INV-6 | No network calls (no `satellite-provider`, no Postgres TLS to a remote DB beyond the local instance, no metric push). | Epic § Architecture notes: C10 is workstation-local. |
| CP-INV-7 | The operator key file at `request.key_path` is opened exactly once (via AZ-323's signer) and zeroized when out of scope; this contract does NOT cache the key in memory across calls. | Operator key hygiene. |
## Non-Goals
- Tile fetch from `satellite-provider` — owned by E-C11 / C11 TileDownloader.
- Engine deserialization at takeoff — owned by E-C7 / AZ-298 + C5 takeoff arming.
- Manifest verification — owned by AZ-324's `ManifestVerifier` (separate contract).
- Multi-cache management (rotating between sector caches) — operator runs `build_cache_artifacts` per cache_root.
- Garbage collection of stale engines — explicit operator action; not part of the build flow.
- Resumable build (mid-build process kill → resume from last batch) — out of scope; restart from scratch.
## Versioning
- v1.0.0 — initial Protocol surface (this document).
- Breaking changes: changing `BuildRequest` shape, removing a `BuildOutcome`, adding a required field — bump major.
- Additive changes: new optional kwarg, new `BuildOutcome` value, new field on `BuildReport` — bump minor. Consumers MUST handle unknown outcomes gracefully (treat as FAILURE).
- Patch: clarifications, doc edits.
| Version | Date | Notes | Author |
|---------|------|-------|--------|
| 1.0.0 | 2026-05-10 | Initial contract — produced by AZ-325 (E-C10 decomposition) | autodev |
## Test Cases (consumer side)
| ID | Scenario | Expected Outcome |
|----|----------|------------------|
| CP-TC-1 | Cold build with all dependencies satisfied | `outcome=SUCCESS`; counts > 0; Manifest at `cache_root/Manifest.json` |
| CP-TC-2 | Warm build, identical request | `outcome=IDEMPOTENT_NO_OP`; counts all 0; Manifest unchanged on disk |
| CP-TC-3 | Warm build, different bbox | `outcome=SUCCESS`; rebuild happens; new Manifest replaces old (atomic) |
| CP-TC-4 | C6 has zero tiles for the requested scope | `outcome=FAILURE`; `failure_reason` directs operator to run C11 first |
| CP-TC-5 | Concurrent invocation while another build in progress | `BuildLockHeldError`; second invocation does not corrupt state |
| CP-TC-6 | An orphan file exists under `cache_root` after build | `ManifestCoverageError`; rolled back to prior Manifest if present |
| CP-TC-7 | Operator key file fingerprint not in allowlist (operator mode) | `ManifestWriteError` (propagated from AZ-323); ZERO file writes |
| CP-TC-8 | `EngineBuildError` mid-compile | Exception propagates; partial cache state consistent (atomic engines on disk for those that succeeded; Manifest NOT updated) |
| CP-TC-9 | `DescriptorBatchError` (persistent CUDA OOM) | Exception propagates; engines may be on disk; Manifest NOT updated |
| CP-TC-10 | Conformance: `isinstance(impl, CacheProvisioner)` | `True` |
| CP-TC-11 | `compile_engines_for_corpus` directly callable for re-compile-only flows | Returns `tuple[EngineCacheEntry, ...]`; no descriptor / Manifest work |
| CP-TC-12 | Cold build wall-clock benchmark on Tier-1 dev workstation, 1k tiles, 3 backbones | ≤ 12 min (NFR C10-PT-01) |
| CP-TC-13 | Warm idempotent re-run benchmark | ≤ 1 min (NFR C10-PT-01) |
@@ -0,0 +1,134 @@
# Contract: ManifestVerifier (C10)
**Type**: Python Protocol (`@runtime_checkable`) — local in-process API.
**Producer task**: AZ-324_c10_manifest_verifier
**Consumers**:
- C5 State Estimator / takeoff-arming gate (F2 phase) — refuses to arm if `verify_manifest` does not return `outcome=pass`. (E-C5 / AZ-249.)
- C12 Operator Tooling — runs verify before flight handoff to surface drift between F1 build time and F2 takeoff (E-C12 / AZ-253).
- C13 FDR — emits a `manifest.verify` record on every airborne verify call (`outcome` field gates downstream).
## Purpose
`ManifestVerifier` is the read-only validator for the C10-produced cache Manifest. It is the takeoff trust anchor for AC-NEW-1 ("no engine deserialization at takeoff before manifest verify") and D-C10-3 ("SHA-256 content-hash gate over every shipped artifact"). At F2 takeoff, every artifact listed in the Manifest is re-hashed and compared to its recorded digest; any mismatch fails the verdict and prevents arming. The Ed25519 signature over the Manifest is verified against a pinned operator public key before any artifact is touched — defence-in-depth against a spliced Manifest pointing at attacker-chosen content hashes.
## Public Surface
```python
from pathlib import Path
from typing import Protocol, runtime_checkable
from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PublicKey
@runtime_checkable
class ManifestVerifier(Protocol):
"""Read-only verifier for a C10-produced Manifest.json.
Fail-closed: any deviation in signature, schema, or per-artifact hash
yields `VerificationResult(outcome=fail, ...)`. Never raises on a verify
failure — operators / takeoff arming code branch on `outcome`.
Raises only on resource errors (Manifest.json missing, key file
unreadable) — those are environment problems, not verify outcomes.
"""
def verify_manifest(
self,
*,
manifest_path: Path,
trusted_public_keys: tuple[Ed25519PublicKey, ...],
) -> VerificationResult: ...
```
### DTOs
```python
from dataclasses import dataclass
from enum import Enum
from pathlib import Path
class VerifyOutcome(Enum):
PASS = "pass"
FAIL = "fail"
class VerifyFailReason(Enum):
MANIFEST_NOT_FOUND = "manifest_not_found"
SIGNATURE_NOT_FOUND = "signature_not_found"
SIGNATURE_INVALID = "signature_invalid"
UNTRUSTED_PUBLIC_KEY = "untrusted_public_key"
SCHEMA_VIOLATION = "schema_violation"
ARTIFACT_MISSING = "artifact_missing"
ARTIFACT_HASH_MISMATCH = "artifact_hash_mismatch"
TILES_COVERAGE_MISMATCH = "tiles_coverage_mismatch"
MANIFEST_SELF_HASH_MISMATCH = "manifest_self_hash_mismatch"
@dataclass(frozen=True)
class ArtifactCheck:
relative_path: str
expected_sha256: str
actual_sha256: str | None # None if file missing
matched: bool
@dataclass(frozen=True)
class VerificationResult:
outcome: VerifyOutcome
fail_reasons: tuple[VerifyFailReason, ...]
fail_details: tuple[str, ...] # human-readable diagnostic per reason
signing_public_key_fingerprint: str | None # populated when signature parses, even if untrusted
per_artifact_checks: tuple[ArtifactCheck, ...]
elapsed_ms: int
```
## Invariants
| ID | Invariant | Why |
|----|-----------|-----|
| MV-INV-1 | The verifier is fail-closed: any deviation produces `outcome=FAIL` with at least one `VerifyFailReason`; never returns `PASS` with non-empty `fail_reasons`. | AC-NEW-1 / D-C10-3 — takeoff cannot arm on a partial verify. |
| MV-INV-2 | Signature verification happens BEFORE per-artifact hashing. If the signature is invalid or untrusted, no file content is read beyond the Manifest itself. | Defence-in-depth: a malicious Manifest must not trick the verifier into hashing attacker-chosen file paths. |
| MV-INV-3 | The Manifest's own `Manifest.json.sha256` sidecar (written by AZ-323) must match `sha256(Manifest.json)`; mismatch is `MANIFEST_SELF_HASH_MISMATCH`. | The sidecar is the entry point of the chain of trust — drift here means tampering or atomic-write failure. |
| MV-INV-4 | Per-artifact paths are interpreted relative to `manifest_path.parent`; absolute paths in the Manifest are rejected as `SCHEMA_VIOLATION`. | Prevents a malicious Manifest from pointing outside `cache_root`. |
| MV-INV-5 | `tiles_coverage` mismatch is reported separately from `ARTIFACT_HASH_MISMATCH` because tiles are hashed in aggregate (per AZ-323). The verifier re-derives the aggregate hash from a `TileMetadataStore` query if available, OR (in airborne F2 mode) treats the recorded `tiles_coverage_sha256` as authoritative and only verifies the Manifest signature + non-tile artifacts. | Airborne C5 may not load 100k per-tile rows just to arm; the trust chain is signature → manifest_hash → tiles_coverage_sha256. C12 / operator mode does the full re-derivation. |
| MV-INV-6 | The verifier never writes to disk, never opens network sockets, never calls C13. Telemetry is the caller's responsibility. | Read-only contract — composable in airborne C5 + operator C12 contexts without side-effect surprise. |
| MV-INV-7 | `elapsed_ms` is recorded for every call (pass or fail) so operators and C5 can observe drift in verify cost on slow disks. | NFR for C10-PT-01's takeoff load budget. |
## Non-Goals
- **Signature production** — owned by AZ-323's `ManifestSigner`. The verifier never signs.
- **Cache repair** — the verifier reports failures; rebuild is owned by AZ-325 (the orchestrator).
- **Trusted-key distribution / revocation** — `trusted_public_keys` is supplied by the caller; this contract does not define a key registry.
- **Coverage check (orphan files in cache_root)** — owned by AZ-325 (`ManifestCoverageError`); the verifier checks "every Manifest entry exists and matches", not "every cache_root file is in the Manifest".
- **Rollback to prior-good Manifest** — out of scope; caller decides next action on `FAIL`.
## Versioning
- v1.0.0 — initial Protocol surface (this document).
- Breaking changes — adding a required argument, removing a `VerifyFailReason`, changing semantics of an existing one — bump major.
- Additive changes — new `VerifyFailReason` value, new optional kwarg on `verify_manifest`, new field on `VerificationResult` — bump minor. Consumers MUST handle unknown reasons gracefully (default to FAIL).
- Patch — clarifications, doc edits, bug-fix tests.
| Version | Date | Notes | Author |
|---------|------|-------|--------|
| 1.0.0 | 2026-05-10 | Initial contract — produced by AZ-324 (E-C10 decomposition) | autodev |
## Test Cases (consumer side)
| ID | Scenario | Expected Outcome |
|----|----------|------------------|
| MV-TC-1 | Valid Manifest + trusted key + all artifacts present + hashes match | `outcome=PASS`, empty `fail_reasons`, `per_artifact_checks` all `matched=True` |
| MV-TC-2 | Manifest.json missing | `outcome=FAIL`, `fail_reasons=(MANIFEST_NOT_FOUND,)`; no further work |
| MV-TC-3 | Manifest.json.sig missing | `outcome=FAIL`, `fail_reasons=(SIGNATURE_NOT_FOUND,)`; signature_public_key_fingerprint=None |
| MV-TC-4 | Signature does not verify | `outcome=FAIL`, `fail_reasons=(SIGNATURE_INVALID,)`; no per-artifact checks performed |
| MV-TC-5 | Signature verifies but key is not in `trusted_public_keys` | `outcome=FAIL`, `fail_reasons=(UNTRUSTED_PUBLIC_KEY,)`; fingerprint populated |
| MV-TC-6 | Schema violation (missing required key, absolute path, wrong types) | `outcome=FAIL`, `fail_reasons=(SCHEMA_VIOLATION,)` with detail naming the field |
| MV-TC-7 | One engine missing on disk | `outcome=FAIL`, `fail_reasons=(ARTIFACT_MISSING,)`; `per_artifact_checks` shows that engine with `actual_sha256=None, matched=False` |
| MV-TC-8 | One engine present but bytes drifted | `outcome=FAIL`, `fail_reasons=(ARTIFACT_HASH_MISMATCH,)`; offending check has `matched=False` |
| MV-TC-9 | Multiple failures (missing + drifted + signature OK) | `fail_reasons` contains BOTH `ARTIFACT_MISSING` and `ARTIFACT_HASH_MISMATCH`; per-artifact checks complete (don't short-circuit on first failure) |
| MV-TC-10 | `Manifest.json.sha256` sidecar mismatch | `outcome=FAIL`, `fail_reasons=(MANIFEST_SELF_HASH_MISMATCH,)`; signature path NOT consulted |
| MV-TC-11 | Tampered Manifest body but matching sidecar | `outcome=FAIL`, `fail_reasons=(SIGNATURE_INVALID,)` (the signature cannot match if body changed even by 1 byte) |
| MV-TC-12 | Conformance: `isinstance(ManifestVerifier, my_impl)` | `True` |
| MV-TC-13 | Tier-2 Tile-coverage check (operator mode with TileMetadataStore) | If recomputed `tiles_coverage_sha256` differs → `TILES_COVERAGE_MISMATCH`; if matches → that part passes |
| MV-TC-14 | Empty `trusted_public_keys` | `outcome=FAIL`, `fail_reasons=(UNTRUSTED_PUBLIC_KEY,)` (every key is untrusted by definition) |
| MV-TC-15 | Pristine Manifest verified inside 100 ms on Tier-2 (excludes per-tile re-walk) | `elapsed_ms ≤ 100` for the signature + non-tile artifact path |