Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
8.9 KiB
C10 — Pre-flight Cache Provisioning
1. High-Level Overview
Purpose: build the model-derived pre-flight cache artifacts on top of an already-populated tile store, and verify them at takeoff. After C11 TileDownloader has fetched tiles into C6, C10 orchestrates: compile/deserialize TensorRT engines via C7 → batch each tile through C2's backbone for descriptors → atomically write FAISS HNSW index with SHA-256 sidecars (D-C10-3) → write Manifest with hash of (model + calibration + corpus + sector_class) for D-C10-1 idempotence. At F2 takeoff load, run verify_manifest (D-C10-3 SHA-256 content-hash gate) before allowing the system to arm.
C10 does NOT touch satellite-provider. Tile I/O — both download (F1 inbound) and post-landing upload (F10) — lives in C11 (Tile Manager). C10 reads tiles from C6, writes engines + descriptors + manifest to filesystem and Postgres. The split is operational: C11 carries the operator-side network identity (TLS API key for download, per-flight signing key for upload) and the airborne-exclusion property (ADR-004); C10 carries the model identity and the takeoff-load verifier — neither of which need to leave the workstation/companion enclave at runtime.
Architectural Pattern: Coordinator — single concrete implementation CacheProvisioner behind two interfaces (CacheProvisioner for the F1 build phase, ManifestVerifier for F2's content-hash gate). The interfaces are split because F2 only needs the verifier and shouldn't pull in the full provisioning code path.
Upstream dependencies:
- C12 OperatorTooling → triggers
build_cache_artifacts(...)after C11TileDownloaderhas populated C6. - C6 TileStore + TileMetadataStore + DescriptorIndex → read source (tiles + metadata), write target (FAISS index).
- C7 InferenceRuntime → engine compile + deserialize.
- C2 backbone (via C7 engine) → descriptor batched generation.
Downstream consumers:
- F2 takeoff load → consumes
verify_manifestoutcome.
2. Internal Interfaces
Interface: CacheProvisioner
| Method | Input | Output | Async | Error Types |
|---|---|---|---|---|
build_cache_artifacts |
BuildRequest |
BuildReport |
No (offline; minutes) | EngineBuildError, DescriptorBatchError, ManifestWriteError, IdempotentNoOp |
compile_engines_for_corpus |
BackboneList |
list[EngineCacheEntry] |
No | EngineBuildError, CalibrationCacheError |
Interface: ManifestVerifier
| Method | Input | Output | Async | Error Types |
|---|---|---|---|---|
verify_manifest |
manifest_path: Path |
VerificationResult |
No | ManifestNotFoundError, ContentHashMismatchError |
Input/Output DTOs:
BuildRequest:
bbox: BoundingBox (lat_min, lon_min, lat_max, lon_max) # scopes which C6 tiles are in the manifest
zoom_levels: list[int]
sector_class: enum {active_conflict, stable_rear} # baked into manifest
calibration_path: Path
cache_root: Path
BuildReport:
engines_built: int
engines_reused: int
descriptors_generated: int
manifest_hash: sha256
outcome: enum {success, failure, idempotent_no_op}
failure_reason: string (optional)
Manifest: see data_model.md
EngineCacheEntry: see data_model.md
VerificationResult:
manifest_hash_match: bool
per_artifact_hash_match: dict[Path, bool]
outcome: enum {pass, fail}
fail_reasons: list[string]
3. External API Specification
Not applicable. C10 has no network surface — all I/O is local filesystem + local Postgres.
4. Data Access Patterns
C10 reads tiles rows from C6 (scoped to the build's bbox + zoom_levels), writes the FAISS .index to filesystem via Sha256Sidecar, and writes Manifest + manifests row to Postgres via C6.
Storage Estimates
| Table/Collection | Est. Row Count (1yr) | Row Size | Total Size | Growth Rate |
|---|---|---|---|---|
| Manifest | one per build per cached area | ~10 KB (YAML/JSON) | negligible | per build |
| SHA-256 sidecars | one per artifact (.index, calibration JSON, manifest, .engine) | 64 B (hex digest) | negligible | per build |
Data Management
Seed data: none — C10 writes from scratch (or D-C10-1 idempotently no-ops). Tiles must already be in C6 (placed there by C11 TileDownloader); a missing-tiles condition is a build error, not a download trigger.
Rollback: D-C10-1 manifest-hash check makes provisioning idempotent. Atomic writes (atomicwrites package) prevent partial states; on partial failure, the previous-good cache remains until the new one is fully written.
5. Implementation Details
Algorithmic Complexity: dominated by descriptor batched generation on Jetson (GPU-bound). Worst-case ~400 km² provisioning is ≤ tens of minutes (offline, not time-critical per AC-8.3). Tile network bandwidth is not in C10's budget — that cost is in C11.
State Management: stateless w.r.t. flight lifetime. No connection state — all dependencies are local.
Key Dependencies:
| Library | Version | Purpose |
|---|---|---|
| atomicwrites | latest | Atomic file replacement for .index + Manifest (D-C10-3) |
| hashlib (stdlib) | stdlib | SHA-256 content-hash sidecars |
| PyYAML / orjson | per project pin | Manifest serialization |
Error Handling Strategy:
EngineBuildError/CalibrationCacheError: surfaced from C7 — never silently fall back; operator must intervene.DescriptorBatchError: CUDA OOM during descriptor generation. Halve batch size and retry once; if still OOM, surface to operator.ManifestWriteError: filesystem error or atomic-write rollback. Cache marked invalid; operator must re-run.IdempotentNoOp: D-C10-1 manifest-hash matched the prior build's hash; skip rebuild; emit no-op report.ContentHashMismatchError(F2): refuse takeoff; STATUSTEXT to GCS; FDR records the event; operator must re-run F1.- Missing tiles in C6 for the requested bbox/zoom: surface as
BuildReport.failurewith explicit instruction to run C11TileDownloaderfirst; do not fall back to a network fetch — that responsibility lives in C11.
6. Extensions and Helpers
| Helper | Purpose | Used By |
|---|---|---|
Sha256Sidecar |
atomic write + content-hash sidecar pattern | C6, C7, C10 |
EngineFilenameSchema |
self-describing filename per D-C10-7 | C7, C10 |
WgsConverter |
bbox math | C4, C5, C6, C8, C10 |
7. Caveats & Edge Cases
Known limitations:
- C10 depends on C6 already containing the tiles for the requested bbox + zoom levels. The F1 cache-build workflow (C12) sequences
C11 TileDownloader → C10 build_cache_artifacts; C10 alone is not a complete F1. - D-C10-3 SHA-256 content-hash gate must cover EVERY artifact: every tile (the per-tile hash is computed at C11 download time and stored in C6), the FAISS
.index, the calibration JSON, and the Manifest itself. Missing sidecars are a release-blocking defect.
Potential race conditions:
- Concurrent
build_cache_artifactsinvocations on the same cache root would corrupt state. Single-process operator-tool wraps with a filesystem lockfile (the same lockfile C11 honours); if a second invocation tries to start, fail with explicit error.
Performance bottlenecks:
- Descriptor batched generation is GPU-bound; batching is the main lever (D-C7-1 INT8/FP16 mix decision applies).
- Engine compile is workspace-bound on Jetson; D-C10-6 calibration cache reuse is the main lever.
8. Dependency Graph
Must be implemented after: C6 (read source for tiles, write target for FAISS), C7 (engine + descriptor runtime), C2 (backbone interface for descriptor generation; called via C7).
Can be implemented in parallel with: C8, C13.
Blocks: C12 (operator can't sequence F1 without C10 ready), F1, F2 (verify_manifest), F8 (warm-cache verify on reboot recovery).
9. Logging Strategy
| Log Level | When | Example |
|---|---|---|
| ERROR | EngineBuildError, DescriptorBatchError, ManifestWriteError, ContentHashMismatchError (F2) |
C10 engine build failed: backbone=disk; takeoff blocked |
| WARN | engine cache miss falls through to build | C10 engine cache miss: model=ultra_vpr; sm=87, jp=6.2, trt=10.3, fp16; rebuild |
| INFO | Build start/end + report; verify_manifest pass | C10 build complete: engines=4, descriptors=87654, manifest_hash=…; outcome=success |
| DEBUG | per-tile descriptor batch progress | C10 descriptor batch progress: 12345/87654 (14%) |
Log format: structured JSON. Log storage: stdout (operator tool); journald (companion verify); FDR via C13 (only for F2 verify_manifest events — provisioning is offline and goes to operator-facing logs, not flight FDR).