Implements the C10 internal phase that walks every C6 tile, embeds through C2's backbone via the AZ-321-produced engine, and rebuilds the AZ-306 FAISS HNSW index in one atomic write. - DescriptorBatcher with halve-and-retry OOM recovery (default 1 retry) - BackboneEmbedder Protocol + C7EngineBackboneEmbedder default impl - DescriptorBatchError for OOM / dim-mismatch / missing-output failures - Empty-corpus surfaces as outcome=failure with explicit hint to run C11 - Per-10% progress callback + DEBUG logs (no engine bytes leaked) - Consumer-side Protocol cuts (TilesByBboxBatchQuery, TilePixelOpener, DescriptorIndexRebuilder) so c10 stays within AZ-270 lint - runtime_root.c10_factory adds build_descriptor_batcher + three C6->C10 adapters - 16 unit tests covering AC-1..AC-10 + 2 NFRs + 4 supplemental (Protocol conformance, query pass-through, handle release, config) Co-authored-by: Cursor <cursoragent@cursor.com>
11 KiB
C10 — Pre-flight Cache Provisioning
1. High-Level Overview
Purpose: build the model-derived pre-flight cache artifacts on top of an already-populated tile store, and verify them at takeoff. After C11 TileDownloader has fetched tiles into C6, C10 orchestrates: compile/deserialize TensorRT engines via C7 → batch each tile through C2's backbone for descriptors → atomically write FAISS HNSW index with SHA-256 sidecars (D-C10-3) → write Manifest with hash of (model + calibration + corpus + sector_class + takeoff_origin) for D-C10-1 idempotence. The takeoff_origin is supplied by C12 (derived from Flight.waypoints[0] via the FlightsApiClient, ADR-010 + AZ-489); C10 treats it as one more identity field and bakes it into both the Manifest body and the manifest-hash. At F2 takeoff load, run verify_manifest (D-C10-3 SHA-256 content-hash gate) before allowing the system to arm; the verifier also surfaces takeoff_origin so the companion's composition root can pass it to C5.set_takeoff_origin(origin, sigma_horiz_m, sigma_vert_m) before any sensor sample (AZ-490).
C10 does NOT touch satellite-provider. Tile I/O — both download (F1 inbound) and post-landing upload (F10) — lives in C11 (Tile Manager). C10 reads tiles from C6, writes engines + descriptors + manifest to filesystem and Postgres. The split is operational: C11 carries the operator-side network identity (TLS API key for download, per-flight signing key for upload) and the airborne-exclusion property (ADR-004); C10 carries the model identity and the takeoff-load verifier — neither of which need to leave the workstation/companion enclave at runtime.
Architectural Pattern: Coordinator — single concrete implementation CacheProvisioner behind two interfaces (CacheProvisioner for the F1 build phase, ManifestVerifier for F2's content-hash gate). The interfaces are split because F2 only needs the verifier and shouldn't pull in the full provisioning code path.
Upstream dependencies:
- C12 OperatorTooling → triggers
build_cache_artifacts(...)after C11TileDownloaderhas populated C6. - C6 TileStore + TileMetadataStore + DescriptorIndex → read source (tiles + metadata), write target (FAISS index).
- C7 InferenceRuntime → engine compile + deserialize.
- C2 backbone (via C7 engine) → descriptor batched generation.
Downstream consumers:
- F2 takeoff load → consumes
verify_manifestoutcome.
2. Internal Interfaces
Interface: CacheProvisioner
| Method | Input | Output | Async | Error Types |
|---|---|---|---|---|
build_cache_artifacts |
BuildRequest |
BuildReport |
No (offline; minutes) | EngineBuildError, DescriptorBatchError, ManifestWriteError, IdempotentNoOp |
compile_engines_for_corpus |
BackboneList |
list[EngineCacheEntry] |
No | EngineBuildError, CalibrationCacheError |
Interface: ManifestVerifier
| Method | Input | Output | Async | Error Types |
|---|---|---|---|---|
verify_manifest |
manifest_path: Path |
VerificationResult |
No | ManifestNotFoundError, ContentHashMismatchError |
Input/Output DTOs:
BuildRequest:
bbox: BoundingBox (lat_min, lon_min, lat_max, lon_max) # scopes which C6 tiles are in the manifest
zoom_levels: list[int]
sector_class: enum {active_conflict, stable_rear} # baked into manifest
calibration_path: Path
cache_root: Path
takeoff_origin: LatLonAlt | None # ADR-010 / AZ-489; baked into manifest + hash
flight_id: UUID | None # ADR-010; pass-through provenance, baked into manifest
BuildReport:
engines_built: int
engines_reused: int
descriptors_generated: int
manifest_hash: sha256
outcome: enum {success, failure, idempotent_no_op}
failure_reason: string (optional)
Manifest: see data_model.md (carries takeoff_origin + flight_id when set; hash includes them)
EngineCacheEntry: see data_model.md
VerificationResult:
manifest_hash_match: bool
per_artifact_hash_match: dict[Path, bool]
takeoff_origin: LatLonAlt | None # passed through from manifest for C5 warm-start (AZ-490)
flight_id: UUID | None
outcome: enum {pass, fail}
fail_reasons: list[string]
3. External API Specification
Not applicable. C10 has no network surface — all I/O is local filesystem + local Postgres.
4. Data Access Patterns
C10 reads tiles rows from C6 (scoped to the build's bbox + zoom_levels), writes the FAISS .index to filesystem via Sha256Sidecar, and writes Manifest + manifests row to Postgres via C6.
Storage Estimates
| Table/Collection | Est. Row Count (1yr) | Row Size | Total Size | Growth Rate |
|---|---|---|---|---|
| Manifest | one per build per cached area | ~10 KB (YAML/JSON) | negligible | per build |
| SHA-256 sidecars | one per artifact (.index, calibration JSON, manifest, .engine) | 64 B (hex digest) | negligible | per build |
Data Management
Seed data: none — C10 writes from scratch (or D-C10-1 idempotently no-ops). Tiles must already be in C6 (placed there by C11 TileDownloader); a missing-tiles condition is a build error, not a download trigger.
Rollback: D-C10-1 manifest-hash check makes provisioning idempotent. Atomic writes (atomicwrites package) prevent partial states; on partial failure, the previous-good cache remains until the new one is fully written.
5. Implementation Details
Algorithmic Complexity: dominated by descriptor batched generation on Jetson (GPU-bound). Worst-case ~400 km² provisioning is ≤ tens of minutes (offline, not time-critical per AC-8.3). Tile network bandwidth is not in C10's budget — that cost is in C11.
State Management: stateless w.r.t. flight lifetime. No connection state — all dependencies are local.
Key Dependencies:
| Library | Version | Purpose |
|---|---|---|
| atomicwrites | latest | Atomic file replacement for .index + Manifest (D-C10-3) |
| hashlib (stdlib) | stdlib | SHA-256 content-hash sidecars |
| PyYAML / orjson | per project pin | Manifest serialization |
| numpy | per project pin | Descriptor batch ndarray container (AZ-322 DescriptorBatcher) |
AZ-322 internal phase — DescriptorBatcher:
The populate_descriptors phase walks every tile in C6 for the requested
(bbox, zoom_levels, sector_class), embeds them through C7's InferenceRuntime
(via C7EngineBackboneEmbedder, the default BackboneEmbedder impl), and
hands the resulting (N, descriptor_dim) ndarray to AZ-306's
DescriptorIndex.rebuild_from_descriptors for atomic FAISS index write.
CUDA OOM is handled via halve-and-retry bounded by C10BatcherConfig.max_oom_retries
(default 1: 64 → 32, then succeed-or-fail-fast) so a real GPU regression
surfaces in seconds rather than via silent retries. Per-10% progress is
emitted both as DEBUG logs (c10.descriptor.progress) and via an optional
progress_callback so operator tooling can wire a TTY/GUI bar without
touching the batcher itself. The descriptor int64 id formula is the
canonical AZ-306 scheme (int.from_bytes(sha256("zoom|lat|lon").first8, "big", signed=True))
— invented locally to avoid a circular dependency back into C6 internals
would break AC-6.
Error Handling Strategy:
EngineBuildError/CalibrationCacheError: surfaced from C7 — never silently fall back; operator must intervene.DescriptorBatchError: CUDA OOM during descriptor generation. Halve batch size and retry once; if still OOM, surface to operator.ManifestWriteError: filesystem error or atomic-write rollback. Cache marked invalid; operator must re-run.IdempotentNoOp: D-C10-1 manifest-hash matched the prior build's hash; skip rebuild; emit no-op report.ContentHashMismatchError(F2): refuse takeoff; STATUSTEXT to GCS; FDR records the event; operator must re-run F1.- Missing tiles in C6 for the requested bbox/zoom: surface as
BuildReport.failurewith explicit instruction to run C11TileDownloaderfirst; do not fall back to a network fetch — that responsibility lives in C11.
6. Extensions and Helpers
| Helper | Purpose | Used By |
|---|---|---|
Sha256Sidecar |
atomic write + content-hash sidecar pattern | C6, C7, C10 |
EngineFilenameSchema |
self-describing filename per D-C10-7 | C7, C10 |
WgsConverter |
bbox math | C4, C5, C6, C8, C10 |
7. Caveats & Edge Cases
Known limitations:
- C10 depends on C6 already containing the tiles for the requested bbox + zoom levels. The F1 cache-build workflow (C12) sequences
C11 TileDownloader → C10 build_cache_artifacts; C10 alone is not a complete F1. - D-C10-3 SHA-256 content-hash gate must cover EVERY artifact: every tile (the per-tile hash is computed at C11 download time and stored in C6), the FAISS
.index, the calibration JSON, and the Manifest itself. Missing sidecars are a release-blocking defect.
Potential race conditions:
- Concurrent
build_cache_artifactsinvocations on the same cache root would corrupt state. Single-process operator-tool wraps with a filesystem lockfile (the same lockfile C11 honours); if a second invocation tries to start, fail with explicit error.
Performance bottlenecks:
- Descriptor batched generation is GPU-bound; batching is the main lever (D-C7-1 INT8/FP16 mix decision applies).
- Engine compile is workspace-bound on Jetson; D-C10-6 calibration cache reuse is the main lever.
8. Dependency Graph
Must be implemented after: C6 (read source for tiles, write target for FAISS), C7 (engine + descriptor runtime), C2 (backbone interface for descriptor generation; called via C7).
Can be implemented in parallel with: C8, C13.
Blocks: C12 (operator can't sequence F1 without C10 ready), F1, F2 (verify_manifest), F8 (warm-cache verify on reboot recovery).
9. Logging Strategy
| Log Level | When | Example |
|---|---|---|
| ERROR | EngineBuildError, DescriptorBatchError, ManifestWriteError, ContentHashMismatchError (F2) |
C10 engine build failed: backbone=disk; takeoff blocked |
| WARN | engine cache miss falls through to build | C10 engine cache miss: model=ultra_vpr; sm=87, jp=6.2, trt=10.3, fp16; rebuild |
| INFO | Build start/end + report; verify_manifest pass | C10 build complete: engines=4, descriptors=87654, manifest_hash=…; outcome=success |
| DEBUG | per-tile descriptor batch progress | C10 descriptor batch progress: 12345/87654 (14%) |
Log format: structured JSON. Log storage: stdout (operator tool); journald (companion verify); FDR via C13 (only for F2 verify_manifest events — provisioning is offline and goes to operator-facing logs, not flight FDR).