Files
gps-denied-onboard/_docs/02_document/components/11_c10_provisioning/description.md
T
Oleksandr Bezdieniezhnykh 64542d32fc Update autodev state, architecture documentation, and glossary terms
Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
2026-05-10 00:21:34 +03:00

8.9 KiB

C10 — Pre-flight Cache Provisioning

1. High-Level Overview

Purpose: build the model-derived pre-flight cache artifacts on top of an already-populated tile store, and verify them at takeoff. After C11 TileDownloader has fetched tiles into C6, C10 orchestrates: compile/deserialize TensorRT engines via C7 → batch each tile through C2's backbone for descriptors → atomically write FAISS HNSW index with SHA-256 sidecars (D-C10-3) → write Manifest with hash of (model + calibration + corpus + sector_class) for D-C10-1 idempotence. At F2 takeoff load, run verify_manifest (D-C10-3 SHA-256 content-hash gate) before allowing the system to arm.

C10 does NOT touch satellite-provider. Tile I/O — both download (F1 inbound) and post-landing upload (F10) — lives in C11 (Tile Manager). C10 reads tiles from C6, writes engines + descriptors + manifest to filesystem and Postgres. The split is operational: C11 carries the operator-side network identity (TLS API key for download, per-flight signing key for upload) and the airborne-exclusion property (ADR-004); C10 carries the model identity and the takeoff-load verifier — neither of which need to leave the workstation/companion enclave at runtime.

Architectural Pattern: Coordinator — single concrete implementation CacheProvisioner behind two interfaces (CacheProvisioner for the F1 build phase, ManifestVerifier for F2's content-hash gate). The interfaces are split because F2 only needs the verifier and shouldn't pull in the full provisioning code path.

Upstream dependencies:

  • C12 OperatorTooling → triggers build_cache_artifacts(...) after C11 TileDownloader has populated C6.
  • C6 TileStore + TileMetadataStore + DescriptorIndex → read source (tiles + metadata), write target (FAISS index).
  • C7 InferenceRuntime → engine compile + deserialize.
  • C2 backbone (via C7 engine) → descriptor batched generation.

Downstream consumers:

  • F2 takeoff load → consumes verify_manifest outcome.

2. Internal Interfaces

Interface: CacheProvisioner

Method Input Output Async Error Types
build_cache_artifacts BuildRequest BuildReport No (offline; minutes) EngineBuildError, DescriptorBatchError, ManifestWriteError, IdempotentNoOp
compile_engines_for_corpus BackboneList list[EngineCacheEntry] No EngineBuildError, CalibrationCacheError

Interface: ManifestVerifier

Method Input Output Async Error Types
verify_manifest manifest_path: Path VerificationResult No ManifestNotFoundError, ContentHashMismatchError

Input/Output DTOs:

BuildRequest:
  bbox:                       BoundingBox (lat_min, lon_min, lat_max, lon_max)  # scopes which C6 tiles are in the manifest
  zoom_levels:                list[int]
  sector_class:               enum {active_conflict, stable_rear}                # baked into manifest
  calibration_path:           Path
  cache_root:                 Path

BuildReport:
  engines_built:                    int
  engines_reused:                   int
  descriptors_generated:            int
  manifest_hash:                    sha256
  outcome:                          enum {success, failure, idempotent_no_op}
  failure_reason:                   string (optional)

Manifest:                       see data_model.md
EngineCacheEntry:               see data_model.md

VerificationResult:
  manifest_hash_match:        bool
  per_artifact_hash_match:    dict[Path, bool]
  outcome:                    enum {pass, fail}
  fail_reasons:               list[string]

3. External API Specification

Not applicable. C10 has no network surface — all I/O is local filesystem + local Postgres.

4. Data Access Patterns

C10 reads tiles rows from C6 (scoped to the build's bbox + zoom_levels), writes the FAISS .index to filesystem via Sha256Sidecar, and writes Manifest + manifests row to Postgres via C6.

Storage Estimates

Table/Collection Est. Row Count (1yr) Row Size Total Size Growth Rate
Manifest one per build per cached area ~10 KB (YAML/JSON) negligible per build
SHA-256 sidecars one per artifact (.index, calibration JSON, manifest, .engine) 64 B (hex digest) negligible per build

Data Management

Seed data: none — C10 writes from scratch (or D-C10-1 idempotently no-ops). Tiles must already be in C6 (placed there by C11 TileDownloader); a missing-tiles condition is a build error, not a download trigger.

Rollback: D-C10-1 manifest-hash check makes provisioning idempotent. Atomic writes (atomicwrites package) prevent partial states; on partial failure, the previous-good cache remains until the new one is fully written.

5. Implementation Details

Algorithmic Complexity: dominated by descriptor batched generation on Jetson (GPU-bound). Worst-case ~400 km² provisioning is ≤ tens of minutes (offline, not time-critical per AC-8.3). Tile network bandwidth is not in C10's budget — that cost is in C11.

State Management: stateless w.r.t. flight lifetime. No connection state — all dependencies are local.

Key Dependencies:

Library Version Purpose
atomicwrites latest Atomic file replacement for .index + Manifest (D-C10-3)
hashlib (stdlib) stdlib SHA-256 content-hash sidecars
PyYAML / orjson per project pin Manifest serialization

Error Handling Strategy:

  • EngineBuildError / CalibrationCacheError: surfaced from C7 — never silently fall back; operator must intervene.
  • DescriptorBatchError: CUDA OOM during descriptor generation. Halve batch size and retry once; if still OOM, surface to operator.
  • ManifestWriteError: filesystem error or atomic-write rollback. Cache marked invalid; operator must re-run.
  • IdempotentNoOp: D-C10-1 manifest-hash matched the prior build's hash; skip rebuild; emit no-op report.
  • ContentHashMismatchError (F2): refuse takeoff; STATUSTEXT to GCS; FDR records the event; operator must re-run F1.
  • Missing tiles in C6 for the requested bbox/zoom: surface as BuildReport.failure with explicit instruction to run C11 TileDownloader first; do not fall back to a network fetch — that responsibility lives in C11.

6. Extensions and Helpers

Helper Purpose Used By
Sha256Sidecar atomic write + content-hash sidecar pattern C6, C7, C10
EngineFilenameSchema self-describing filename per D-C10-7 C7, C10
WgsConverter bbox math C4, C5, C6, C8, C10

7. Caveats & Edge Cases

Known limitations:

  • C10 depends on C6 already containing the tiles for the requested bbox + zoom levels. The F1 cache-build workflow (C12) sequences C11 TileDownloader → C10 build_cache_artifacts; C10 alone is not a complete F1.
  • D-C10-3 SHA-256 content-hash gate must cover EVERY artifact: every tile (the per-tile hash is computed at C11 download time and stored in C6), the FAISS .index, the calibration JSON, and the Manifest itself. Missing sidecars are a release-blocking defect.

Potential race conditions:

  • Concurrent build_cache_artifacts invocations on the same cache root would corrupt state. Single-process operator-tool wraps with a filesystem lockfile (the same lockfile C11 honours); if a second invocation tries to start, fail with explicit error.

Performance bottlenecks:

  • Descriptor batched generation is GPU-bound; batching is the main lever (D-C7-1 INT8/FP16 mix decision applies).
  • Engine compile is workspace-bound on Jetson; D-C10-6 calibration cache reuse is the main lever.

8. Dependency Graph

Must be implemented after: C6 (read source for tiles, write target for FAISS), C7 (engine + descriptor runtime), C2 (backbone interface for descriptor generation; called via C7).

Can be implemented in parallel with: C8, C13.

Blocks: C12 (operator can't sequence F1 without C10 ready), F1, F2 (verify_manifest), F8 (warm-cache verify on reboot recovery).

9. Logging Strategy

Log Level When Example
ERROR EngineBuildError, DescriptorBatchError, ManifestWriteError, ContentHashMismatchError (F2) C10 engine build failed: backbone=disk; takeoff blocked
WARN engine cache miss falls through to build C10 engine cache miss: model=ultra_vpr; sm=87, jp=6.2, trt=10.3, fp16; rebuild
INFO Build start/end + report; verify_manifest pass C10 build complete: engines=4, descriptors=87654, manifest_hash=…; outcome=success
DEBUG per-tile descriptor batch progress C10 descriptor batch progress: 12345/87654 (14%)

Log format: structured JSON. Log storage: stdout (operator tool); journald (companion verify); FDR via C13 (only for F2 verify_manifest events — provisioning is offline and goes to operator-facing logs, not flight FDR).