mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 12:31:13 +00:00
Update autodev state, architecture documentation, and glossary terms
Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
This commit is contained in:
@@ -0,0 +1,151 @@
|
||||
# C10 — Pre-flight Cache Provisioning
|
||||
|
||||
## 1. High-Level Overview
|
||||
|
||||
**Purpose**: build the **model-derived** pre-flight cache artifacts on top of an already-populated tile store, and verify them at takeoff. After C11 `TileDownloader` has fetched tiles into C6, C10 orchestrates: compile/deserialize TensorRT engines via C7 → batch each tile through C2's backbone for descriptors → atomically write FAISS HNSW index with SHA-256 sidecars (D-C10-3) → write Manifest with hash of (model + calibration + corpus + sector_class) for D-C10-1 idempotence. At F2 takeoff load, run `verify_manifest` (D-C10-3 SHA-256 content-hash gate) before allowing the system to arm.
|
||||
|
||||
**C10 does NOT touch `satellite-provider`.** Tile I/O — both download (F1 inbound) and post-landing upload (F10) — lives in C11 (Tile Manager). C10 reads tiles from C6, writes engines + descriptors + manifest to filesystem and Postgres. The split is operational: C11 carries the operator-side network identity (TLS API key for download, per-flight signing key for upload) and the airborne-exclusion property (ADR-004); C10 carries the model identity and the takeoff-load verifier — neither of which need to leave the workstation/companion enclave at runtime.
|
||||
|
||||
**Architectural Pattern**: Coordinator — single concrete implementation `CacheProvisioner` behind two interfaces (`CacheProvisioner` for the F1 build phase, `ManifestVerifier` for F2's content-hash gate). The interfaces are split because F2 only needs the verifier and shouldn't pull in the full provisioning code path.
|
||||
|
||||
**Upstream dependencies**:
|
||||
|
||||
- C12 OperatorTooling → triggers `build_cache_artifacts(...)` after C11 `TileDownloader` has populated C6.
|
||||
- C6 TileStore + TileMetadataStore + DescriptorIndex → read source (tiles + metadata), write target (FAISS index).
|
||||
- C7 InferenceRuntime → engine compile + deserialize.
|
||||
- C2 backbone (via C7 engine) → descriptor batched generation.
|
||||
|
||||
**Downstream consumers**:
|
||||
|
||||
- F2 takeoff load → consumes `verify_manifest` outcome.
|
||||
|
||||
## 2. Internal Interfaces
|
||||
|
||||
### Interface: `CacheProvisioner`
|
||||
|
||||
| Method | Input | Output | Async | Error Types |
|
||||
|--------|-------|--------|-------|-------------|
|
||||
| `build_cache_artifacts` | `BuildRequest` | `BuildReport` | No (offline; minutes) | `EngineBuildError`, `DescriptorBatchError`, `ManifestWriteError`, `IdempotentNoOp` |
|
||||
| `compile_engines_for_corpus` | `BackboneList` | `list[EngineCacheEntry]` | No | `EngineBuildError`, `CalibrationCacheError` |
|
||||
|
||||
### Interface: `ManifestVerifier`
|
||||
|
||||
| Method | Input | Output | Async | Error Types |
|
||||
|--------|-------|--------|-------|-------------|
|
||||
| `verify_manifest` | `manifest_path: Path` | `VerificationResult` | No | `ManifestNotFoundError`, `ContentHashMismatchError` |
|
||||
|
||||
**Input/Output DTOs**:
|
||||
|
||||
```
|
||||
BuildRequest:
|
||||
bbox: BoundingBox (lat_min, lon_min, lat_max, lon_max) # scopes which C6 tiles are in the manifest
|
||||
zoom_levels: list[int]
|
||||
sector_class: enum {active_conflict, stable_rear} # baked into manifest
|
||||
calibration_path: Path
|
||||
cache_root: Path
|
||||
|
||||
BuildReport:
|
||||
engines_built: int
|
||||
engines_reused: int
|
||||
descriptors_generated: int
|
||||
manifest_hash: sha256
|
||||
outcome: enum {success, failure, idempotent_no_op}
|
||||
failure_reason: string (optional)
|
||||
|
||||
Manifest: see data_model.md
|
||||
EngineCacheEntry: see data_model.md
|
||||
|
||||
VerificationResult:
|
||||
manifest_hash_match: bool
|
||||
per_artifact_hash_match: dict[Path, bool]
|
||||
outcome: enum {pass, fail}
|
||||
fail_reasons: list[string]
|
||||
```
|
||||
|
||||
## 3. External API Specification
|
||||
|
||||
Not applicable. C10 has no network surface — all I/O is local filesystem + local Postgres.
|
||||
|
||||
## 4. Data Access Patterns
|
||||
|
||||
C10 reads `tiles` rows from C6 (scoped to the build's bbox + zoom_levels), writes the FAISS `.index` to filesystem via `Sha256Sidecar`, and writes Manifest + `manifests` row to Postgres via C6.
|
||||
|
||||
### Storage Estimates
|
||||
|
||||
| Table/Collection | Est. Row Count (1yr) | Row Size | Total Size | Growth Rate |
|
||||
|-----------------|---------------------|----------|------------|-------------|
|
||||
| Manifest | one per build per cached area | ~10 KB (YAML/JSON) | negligible | per build |
|
||||
| SHA-256 sidecars | one per artifact (.index, calibration JSON, manifest, .engine) | 64 B (hex digest) | negligible | per build |
|
||||
|
||||
### Data Management
|
||||
|
||||
**Seed data**: none — C10 writes from scratch (or D-C10-1 idempotently no-ops). Tiles must already be in C6 (placed there by C11 `TileDownloader`); a missing-tiles condition is a build error, not a download trigger.
|
||||
|
||||
**Rollback**: D-C10-1 manifest-hash check makes provisioning idempotent. Atomic writes (atomicwrites package) prevent partial states; on partial failure, the previous-good cache remains until the new one is fully written.
|
||||
|
||||
## 5. Implementation Details
|
||||
|
||||
**Algorithmic Complexity**: dominated by descriptor batched generation on Jetson (GPU-bound). Worst-case ~400 km² provisioning is ≤ tens of minutes (offline, not time-critical per AC-8.3). Tile network bandwidth is **not** in C10's budget — that cost is in C11.
|
||||
|
||||
**State Management**: stateless w.r.t. flight lifetime. No connection state — all dependencies are local.
|
||||
|
||||
**Key Dependencies**:
|
||||
|
||||
| Library | Version | Purpose |
|
||||
|---------|---------|---------|
|
||||
| atomicwrites | latest | Atomic file replacement for `.index` + Manifest (D-C10-3) |
|
||||
| hashlib (stdlib) | stdlib | SHA-256 content-hash sidecars |
|
||||
| PyYAML / orjson | per project pin | Manifest serialization |
|
||||
|
||||
**Error Handling Strategy**:
|
||||
|
||||
- `EngineBuildError` / `CalibrationCacheError`: surfaced from C7 — never silently fall back; operator must intervene.
|
||||
- `DescriptorBatchError`: CUDA OOM during descriptor generation. Halve batch size and retry once; if still OOM, surface to operator.
|
||||
- `ManifestWriteError`: filesystem error or atomic-write rollback. Cache marked invalid; operator must re-run.
|
||||
- `IdempotentNoOp`: D-C10-1 manifest-hash matched the prior build's hash; skip rebuild; emit no-op report.
|
||||
- `ContentHashMismatchError` (F2): refuse takeoff; STATUSTEXT to GCS; FDR records the event; operator must re-run F1.
|
||||
- **Missing tiles in C6 for the requested bbox/zoom**: surface as `BuildReport.failure` with explicit instruction to run C11 `TileDownloader` first; do **not** fall back to a network fetch — that responsibility lives in C11.
|
||||
|
||||
## 6. Extensions and Helpers
|
||||
|
||||
| Helper | Purpose | Used By |
|
||||
|--------|---------|---------|
|
||||
| `Sha256Sidecar` | atomic write + content-hash sidecar pattern | C6, C7, C10 |
|
||||
| `EngineFilenameSchema` | self-describing filename per D-C10-7 | C7, C10 |
|
||||
| `WgsConverter` | bbox math | C4, C5, C6, C8, C10 |
|
||||
|
||||
## 7. Caveats & Edge Cases
|
||||
|
||||
**Known limitations**:
|
||||
|
||||
- C10 depends on C6 already containing the tiles for the requested bbox + zoom levels. The F1 cache-build workflow (C12) sequences `C11 TileDownloader → C10 build_cache_artifacts`; C10 alone is not a complete F1.
|
||||
- D-C10-3 SHA-256 content-hash gate must cover EVERY artifact: every tile (the per-tile hash is computed at C11 download time and stored in C6), the FAISS `.index`, the calibration JSON, and the Manifest itself. Missing sidecars are a release-blocking defect.
|
||||
|
||||
**Potential race conditions**:
|
||||
|
||||
- Concurrent `build_cache_artifacts` invocations on the same cache root would corrupt state. Single-process operator-tool wraps with a filesystem lockfile (the same lockfile C11 honours); if a second invocation tries to start, fail with explicit error.
|
||||
|
||||
**Performance bottlenecks**:
|
||||
|
||||
- Descriptor batched generation is GPU-bound; batching is the main lever (D-C7-1 INT8/FP16 mix decision applies).
|
||||
- Engine compile is workspace-bound on Jetson; D-C10-6 calibration cache reuse is the main lever.
|
||||
|
||||
## 8. Dependency Graph
|
||||
|
||||
**Must be implemented after**: C6 (read source for tiles, write target for FAISS), C7 (engine + descriptor runtime), C2 (backbone interface for descriptor generation; called via C7).
|
||||
|
||||
**Can be implemented in parallel with**: C8, C13.
|
||||
|
||||
**Blocks**: C12 (operator can't sequence F1 without C10 ready), F1, F2 (verify_manifest), F8 (warm-cache verify on reboot recovery).
|
||||
|
||||
## 9. Logging Strategy
|
||||
|
||||
| Log Level | When | Example |
|
||||
|-----------|------|---------|
|
||||
| ERROR | `EngineBuildError`, `DescriptorBatchError`, `ManifestWriteError`, `ContentHashMismatchError` (F2) | `C10 engine build failed: backbone=disk; takeoff blocked` |
|
||||
| WARN | engine cache miss falls through to build | `C10 engine cache miss: model=ultra_vpr; sm=87, jp=6.2, trt=10.3, fp16; rebuild` |
|
||||
| INFO | Build start/end + report; verify_manifest pass | `C10 build complete: engines=4, descriptors=87654, manifest_hash=…; outcome=success` |
|
||||
| DEBUG | per-tile descriptor batch progress | `C10 descriptor batch progress: 12345/87654 (14%)` |
|
||||
|
||||
**Log format**: structured JSON.
|
||||
**Log storage**: stdout (operator tool); journald (companion verify); FDR via C13 (only for F2 verify_manifest events — provisioning is offline and goes to operator-facing logs, not flight FDR).
|
||||
Reference in New Issue
Block a user