Files
Oleksandr Bezdieniezhnykh 12aba8139f [autodev] Step 13 partial: c10/c11/c12/c13 cycle-1 doc sync
Batch 4 of the cycle-1 component-doc sync. For each of C10
(provisioning), C11 (tilemanager), C12 (operator_orchestrator),
and C13 (fdr):

- Append "Cycle-1 operational reality" paragraph to § 1
  documenting the actual cycle-1 wiring path:
  - C10: operator-side / cross-tier; NOT in _STRATEGY_REGISTRY;
    composed via runtime_root/c10_factory.py with six per-service
    factories; reuses C7 InferenceRuntime for engine compile;
    AZ-323 Ed25519 signer + C10ManifestConfig signing-mode gate;
    AZ-324 ManifestVerifierImpl with airborne/operator modes;
    AZ-507 c6 cuts kept in c10_factory; AZ-687 N/A.
  - C11: operator-workstation-only; airborne build target
    excludes source tree (ADR-004 / AC-8.4); composed via
    runtime_root/c11_factory.py with three per-service factories;
    distinct FdrClient producer_ids for signing_key + tile_uploader;
    AZ-320 IdempotentRetryTileUploader wraps by default;
    AZ-507 keeps c6 surfaces caller-injected; AZ-687 N/A.
  - C12: operator-workstation CLI binary; airborne build excludes
    source tree (ADR-004 + Principle #9); composed via
    runtime_root/c12_factory.py; OperatorOrchestratorServices
    dataclass aggregates AZ-326/327/328/329/330/489 services with
    sibling fields defaulting to None; AZ-507 cuts via
    RemoteCacheProvisionerInvoker + TileDownloaderCut/UploaderCut;
    AZ-687 N/A.
  - C13: airborne infrastructure; pre_constructed[c13_fdr] seeded
    FIRST via make_fdr_client(AIRBORNE_MAIN_PRODUCER_ID, config)
    (AZ-619 Phase A); per-producer _CACHE gives AC-619.2 singleton;
    AZ-274 drop-oldest overrun policy wired at construction;
    c1_vio / c5_state require it, c2_5/c3/c3_5/c4 optional; AZ-687
    guard explicitly does NOT apply — seed runs before any block
    presence check so replay binaries still write FDR.

Also bump _docs/_process_leftovers/2026-05-11_d_cross_cve_1_opencv_pin_deferred.md
replay timestamp to 17:18 (start of this /autodev invocation);
gtsam==4.2.1 still requires numpy<2.0.0 so the relaxed opencv pin
remains in effect.

Update _docs/_autodev_state.md sub_step.detail to record batch
4/~5 done; next batch is the 8 helpers under common-helpers/.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-19 17:25:53 +03:00

13 KiB

C10 — Pre-flight Cache Provisioning

1. High-Level Overview

Purpose: build the model-derived pre-flight cache artifacts on top of an already-populated tile store, and verify them at takeoff. After C11 TileDownloader has fetched tiles into C6, C10 orchestrates: compile/deserialize TensorRT engines via C7 → batch each tile through C2's backbone for descriptors → atomically write FAISS HNSW index with SHA-256 sidecars (D-C10-3) → write Manifest with hash of (model + calibration + corpus + sector_class + takeoff_origin) for D-C10-1 idempotence. The takeoff_origin is supplied by C12 (derived from Flight.waypoints[0] via the FlightsApiClient, ADR-010 + AZ-489); C10 treats it as one more identity field and bakes it into both the Manifest body and the manifest-hash. At F2 takeoff load, run verify_manifest (D-C10-3 SHA-256 content-hash gate) before allowing the system to arm; the verifier also surfaces takeoff_origin so the companion's composition root can pass it to C5.set_takeoff_origin(origin, sigma_horiz_m, sigma_vert_m) before any sensor sample (AZ-490).

C10 does NOT touch satellite-provider. Tile I/O — both download (F1 inbound) and post-landing upload (F10) — lives in C11 (Tile Manager). C10 reads tiles from C6, writes engines + descriptors + manifest to filesystem and Postgres. The split is operational: C11 carries the operator-side network identity (TLS API key for download, per-flight signing key for upload) and the airborne-exclusion property (ADR-004); C10 carries the model identity and the takeoff-load verifier — neither of which need to leave the workstation/companion enclave at runtime.

Architectural Pattern: Coordinator — single concrete implementation CacheProvisioner behind two interfaces (CacheProvisioner for the F1 build phase, ManifestVerifier for F2's content-hash gate). The interfaces are split because F2 only needs the verifier and shouldn't pull in the full provisioning code path.

Cycle-1 operational reality: C10 is operator-side / cross-tier infrastructure, NOT an airborne strategy slot — it does not appear in _AIRBORNE_REGISTRATIONS and register_airborne_strategies() (AZ-591) never registers it; equivalently it has no row in AIRBORNE_REQUIRED_PRE_CONSTRUCTED_KEYS. The operator binary composes C10 via runtime_root/c10_factory.py, which exposes six tiny per-service factories (build_engine_compiler, build_backbone_specs, build_manifest_builder, build_manifest_verifier, build_descriptor_batcher, build_cache_provisioner) that the CLI wires directly. The factory reuses the C7 InferenceRuntime via inference_factory.build_inference_runtime for the engine-compile path (honouring BUILD_TENSORRT_RUNTIME / BUILD_PYTORCH_FP16_RUNTIME) and threads Sha256Sidecar, Ed25519ManifestSigner, and a structured logger explicitly — no global registry. The AZ-323 ManifestBuilder reads config.components['c10_provisioning'].manifest (C10ManifestConfig: signing_mode ∈ {operator, dev}, allowed_operator_fingerprints, schema_version="1.1"); operator-mode signs only with an allowlisted Ed25519 key fingerprint, dev-mode warns when an allowlisted key is used. AZ-324's ManifestVerifierImpl has two modes selected by with_tile_store: False (airborne C5 path, MV-INV-5: trust the Ed25519 signature + recorded tiles_coverage_sha256) and True (operator C12 path: re-derive the aggregate from C6 and report drift) — wired in build_manifest_verifier and never silently flipping. The AZ-507 cross-component cut keeps C10 from importing C6 directly: c10_factory.py owns three composition-root adapters (c6_tile_metadata_store_to_tiles_query, c6_tile_store_to_pixel_opener, c6_descriptor_index_to_rebuilder) that translate C6's DTOs into C10's narrow TileHashRecord / TileBboxRecord / TilePixelOpener / DescriptorIndexRebuilder cuts. AZ-687 replay-mode guard does not apply to C10 — replay-mode binaries are airborne-only and never invoke the C10 build path.

Upstream dependencies:

  • C12 OperatorTooling → triggers build_cache_artifacts(...) after C11 TileDownloader has populated C6.
  • C6 TileStore + TileMetadataStore + DescriptorIndex → read source (tiles + metadata), write target (FAISS index).
  • C7 InferenceRuntime → engine compile + deserialize.
  • C2 backbone (via C7 engine) → descriptor batched generation.

Downstream consumers:

  • F2 takeoff load → consumes verify_manifest outcome.

2. Internal Interfaces

Interface: CacheProvisioner

Method Input Output Async Error Types
build_cache_artifacts BuildRequest BuildReport No (offline; minutes) EngineBuildError, DescriptorBatchError, ManifestWriteError, IdempotentNoOp
compile_engines_for_corpus BackboneList list[EngineCacheEntry] No EngineBuildError, CalibrationCacheError

Interface: ManifestVerifier

Method Input Output Async Error Types
verify_manifest manifest_path: Path VerificationResult No ManifestNotFoundError, ContentHashMismatchError

Input/Output DTOs:

BuildRequest:
  bbox:                       BoundingBox (lat_min, lon_min, lat_max, lon_max)  # scopes which C6 tiles are in the manifest
  zoom_levels:                list[int]
  sector_class:               enum {active_conflict, stable_rear}                # baked into manifest
  calibration_path:           Path
  cache_root:                 Path
  takeoff_origin:             LatLonAlt | None                                   # ADR-010 / AZ-489; baked into manifest + hash
  flight_id:                  UUID | None                                        # ADR-010; pass-through provenance, baked into manifest

BuildReport:
  engines_built:                    int
  engines_reused:                   int
  descriptors_generated:            int
  manifest_hash:                    sha256
  outcome:                          enum {success, failure, idempotent_no_op}
  failure_reason:                   string (optional)

Manifest:                       see data_model.md (carries takeoff_origin + flight_id when set; hash includes them)
EngineCacheEntry:               see data_model.md

VerificationResult:
  manifest_hash_match:        bool
  per_artifact_hash_match:    dict[Path, bool]
  takeoff_origin:             LatLonAlt | None              # passed through from manifest for C5 warm-start (AZ-490)
  flight_id:                  UUID | None
  outcome:                    enum {pass, fail}
  fail_reasons:               list[string]

3. External API Specification

Not applicable. C10 has no network surface — all I/O is local filesystem + local Postgres.

4. Data Access Patterns

C10 reads tiles rows from C6 (scoped to the build's bbox + zoom_levels), writes the FAISS .index to filesystem via Sha256Sidecar, and writes Manifest + manifests row to Postgres via C6.

Storage Estimates

Table/Collection Est. Row Count (1yr) Row Size Total Size Growth Rate
Manifest one per build per cached area ~10 KB (YAML/JSON) negligible per build
SHA-256 sidecars one per artifact (.index, calibration JSON, manifest, .engine) 64 B (hex digest) negligible per build

Data Management

Seed data: none — C10 writes from scratch (or D-C10-1 idempotently no-ops). Tiles must already be in C6 (placed there by C11 TileDownloader); a missing-tiles condition is a build error, not a download trigger.

Rollback: D-C10-1 manifest-hash check makes provisioning idempotent. Atomic writes (atomicwrites package) prevent partial states; on partial failure, the previous-good cache remains until the new one is fully written.

5. Implementation Details

Algorithmic Complexity: dominated by descriptor batched generation on Jetson (GPU-bound). Worst-case ~400 km² provisioning is ≤ tens of minutes (offline, not time-critical per AC-8.3). Tile network bandwidth is not in C10's budget — that cost is in C11.

State Management: stateless w.r.t. flight lifetime. No connection state — all dependencies are local.

Key Dependencies:

Library Version Purpose
atomicwrites latest Atomic file replacement for .index + Manifest (D-C10-3)
hashlib (stdlib) stdlib SHA-256 content-hash sidecars
PyYAML / orjson per project pin Manifest serialization
numpy per project pin Descriptor batch ndarray container (AZ-322 DescriptorBatcher)

AZ-322 internal phase — DescriptorBatcher:

The populate_descriptors phase walks every tile in C6 for the requested (bbox, zoom_levels, sector_class), embeds them through C7's InferenceRuntime (via C7EngineBackboneEmbedder, the default BackboneEmbedder impl), and hands the resulting (N, descriptor_dim) ndarray to AZ-306's DescriptorIndex.rebuild_from_descriptors for atomic FAISS index write. CUDA OOM is handled via halve-and-retry bounded by C10BatcherConfig.max_oom_retries (default 1: 64 → 32, then succeed-or-fail-fast) so a real GPU regression surfaces in seconds rather than via silent retries. Per-10% progress is emitted both as DEBUG logs (c10.descriptor.progress) and via an optional progress_callback so operator tooling can wire a TTY/GUI bar without touching the batcher itself. The descriptor int64 id formula is the canonical AZ-306 scheme (int.from_bytes(sha256("zoom|lat|lon").first8, "big", signed=True)) — invented locally to avoid a circular dependency back into C6 internals would break AC-6.

Error Handling Strategy:

  • EngineBuildError / CalibrationCacheError: surfaced from C7 — never silently fall back; operator must intervene.
  • DescriptorBatchError: CUDA OOM during descriptor generation. Halve batch size and retry once; if still OOM, surface to operator.
  • ManifestWriteError: filesystem error or atomic-write rollback. Cache marked invalid; operator must re-run.
  • IdempotentNoOp: D-C10-1 manifest-hash matched the prior build's hash; skip rebuild; emit no-op report.
  • ContentHashMismatchError (F2): refuse takeoff; STATUSTEXT to GCS; FDR records the event; operator must re-run F1.
  • Missing tiles in C6 for the requested bbox/zoom: surface as BuildReport.failure with explicit instruction to run C11 TileDownloader first; do not fall back to a network fetch — that responsibility lives in C11.

6. Extensions and Helpers

Helper Purpose Used By
Sha256Sidecar atomic write + content-hash sidecar pattern C6, C7, C10
EngineFilenameSchema self-describing filename per D-C10-7 C7, C10
WgsConverter bbox math C4, C5, C6, C8, C10

7. Caveats & Edge Cases

Known limitations:

  • C10 depends on C6 already containing the tiles for the requested bbox + zoom levels. The F1 cache-build workflow (C12) sequences C11 TileDownloader → C10 build_cache_artifacts; C10 alone is not a complete F1.
  • D-C10-3 SHA-256 content-hash gate must cover EVERY artifact: every tile (the per-tile hash is computed at C11 download time and stored in C6), the FAISS .index, the calibration JSON, and the Manifest itself. Missing sidecars are a release-blocking defect.

Potential race conditions:

  • Concurrent build_cache_artifacts invocations on the same cache root would corrupt state. Single-process operator-orchestrator wraps with a filesystem lockfile (the same lockfile C11 honours); if a second invocation tries to start, fail with explicit error.

Performance bottlenecks:

  • Descriptor batched generation is GPU-bound; batching is the main lever (D-C7-1 INT8/FP16 mix decision applies).
  • Engine compile is workspace-bound on Jetson; D-C10-6 calibration cache reuse is the main lever.

8. Dependency Graph

Must be implemented after: C6 (read source for tiles, write target for FAISS), C7 (engine + descriptor runtime), C2 (backbone interface for descriptor generation; called via C7).

Can be implemented in parallel with: C8, C13.

Blocks: C12 (operator can't sequence F1 without C10 ready), F1, F2 (verify_manifest), F8 (warm-cache verify on reboot recovery).

9. Logging Strategy

Log Level When Example
ERROR EngineBuildError, DescriptorBatchError, ManifestWriteError, ContentHashMismatchError (F2) C10 engine build failed: backbone=disk; takeoff blocked
WARN engine cache miss falls through to build C10 engine cache miss: model=ultra_vpr; sm=87, jp=6.2, trt=10.3, fp16; rebuild
INFO Build start/end + report; verify_manifest pass C10 build complete: engines=4, descriptors=87654, manifest_hash=…; outcome=success
DEBUG per-tile descriptor batch progress C10 descriptor batch progress: 12345/87654 (14%)

Log format: structured JSON. Log storage: stdout (operator tool); journald (companion verify); FDR via C13 (only for F2 verify_manifest events — provisioning is offline and goes to operator-facing logs, not flight FDR).