mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 11:41:12 +00:00
Update autodev state, architecture documentation, and glossary terms
Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
This commit is contained in:
@@ -0,0 +1,151 @@
|
||||
# C10 — Pre-flight Cache Provisioning
|
||||
|
||||
## 1. High-Level Overview
|
||||
|
||||
**Purpose**: build the **model-derived** pre-flight cache artifacts on top of an already-populated tile store, and verify them at takeoff. After C11 `TileDownloader` has fetched tiles into C6, C10 orchestrates: compile/deserialize TensorRT engines via C7 → batch each tile through C2's backbone for descriptors → atomically write FAISS HNSW index with SHA-256 sidecars (D-C10-3) → write Manifest with hash of (model + calibration + corpus + sector_class) for D-C10-1 idempotence. At F2 takeoff load, run `verify_manifest` (D-C10-3 SHA-256 content-hash gate) before allowing the system to arm.
|
||||
|
||||
**C10 does NOT touch `satellite-provider`.** Tile I/O — both download (F1 inbound) and post-landing upload (F10) — lives in C11 (Tile Manager). C10 reads tiles from C6, writes engines + descriptors + manifest to filesystem and Postgres. The split is operational: C11 carries the operator-side network identity (TLS API key for download, per-flight signing key for upload) and the airborne-exclusion property (ADR-004); C10 carries the model identity and the takeoff-load verifier — neither of which need to leave the workstation/companion enclave at runtime.
|
||||
|
||||
**Architectural Pattern**: Coordinator — single concrete implementation `CacheProvisioner` behind two interfaces (`CacheProvisioner` for the F1 build phase, `ManifestVerifier` for F2's content-hash gate). The interfaces are split because F2 only needs the verifier and shouldn't pull in the full provisioning code path.
|
||||
|
||||
**Upstream dependencies**:
|
||||
|
||||
- C12 OperatorTooling → triggers `build_cache_artifacts(...)` after C11 `TileDownloader` has populated C6.
|
||||
- C6 TileStore + TileMetadataStore + DescriptorIndex → read source (tiles + metadata), write target (FAISS index).
|
||||
- C7 InferenceRuntime → engine compile + deserialize.
|
||||
- C2 backbone (via C7 engine) → descriptor batched generation.
|
||||
|
||||
**Downstream consumers**:
|
||||
|
||||
- F2 takeoff load → consumes `verify_manifest` outcome.
|
||||
|
||||
## 2. Internal Interfaces
|
||||
|
||||
### Interface: `CacheProvisioner`
|
||||
|
||||
| Method | Input | Output | Async | Error Types |
|
||||
|--------|-------|--------|-------|-------------|
|
||||
| `build_cache_artifacts` | `BuildRequest` | `BuildReport` | No (offline; minutes) | `EngineBuildError`, `DescriptorBatchError`, `ManifestWriteError`, `IdempotentNoOp` |
|
||||
| `compile_engines_for_corpus` | `BackboneList` | `list[EngineCacheEntry]` | No | `EngineBuildError`, `CalibrationCacheError` |
|
||||
|
||||
### Interface: `ManifestVerifier`
|
||||
|
||||
| Method | Input | Output | Async | Error Types |
|
||||
|--------|-------|--------|-------|-------------|
|
||||
| `verify_manifest` | `manifest_path: Path` | `VerificationResult` | No | `ManifestNotFoundError`, `ContentHashMismatchError` |
|
||||
|
||||
**Input/Output DTOs**:
|
||||
|
||||
```
|
||||
BuildRequest:
|
||||
bbox: BoundingBox (lat_min, lon_min, lat_max, lon_max) # scopes which C6 tiles are in the manifest
|
||||
zoom_levels: list[int]
|
||||
sector_class: enum {active_conflict, stable_rear} # baked into manifest
|
||||
calibration_path: Path
|
||||
cache_root: Path
|
||||
|
||||
BuildReport:
|
||||
engines_built: int
|
||||
engines_reused: int
|
||||
descriptors_generated: int
|
||||
manifest_hash: sha256
|
||||
outcome: enum {success, failure, idempotent_no_op}
|
||||
failure_reason: string (optional)
|
||||
|
||||
Manifest: see data_model.md
|
||||
EngineCacheEntry: see data_model.md
|
||||
|
||||
VerificationResult:
|
||||
manifest_hash_match: bool
|
||||
per_artifact_hash_match: dict[Path, bool]
|
||||
outcome: enum {pass, fail}
|
||||
fail_reasons: list[string]
|
||||
```
|
||||
|
||||
## 3. External API Specification
|
||||
|
||||
Not applicable. C10 has no network surface — all I/O is local filesystem + local Postgres.
|
||||
|
||||
## 4. Data Access Patterns
|
||||
|
||||
C10 reads `tiles` rows from C6 (scoped to the build's bbox + zoom_levels), writes the FAISS `.index` to filesystem via `Sha256Sidecar`, and writes Manifest + `manifests` row to Postgres via C6.
|
||||
|
||||
### Storage Estimates
|
||||
|
||||
| Table/Collection | Est. Row Count (1yr) | Row Size | Total Size | Growth Rate |
|
||||
|-----------------|---------------------|----------|------------|-------------|
|
||||
| Manifest | one per build per cached area | ~10 KB (YAML/JSON) | negligible | per build |
|
||||
| SHA-256 sidecars | one per artifact (.index, calibration JSON, manifest, .engine) | 64 B (hex digest) | negligible | per build |
|
||||
|
||||
### Data Management
|
||||
|
||||
**Seed data**: none — C10 writes from scratch (or D-C10-1 idempotently no-ops). Tiles must already be in C6 (placed there by C11 `TileDownloader`); a missing-tiles condition is a build error, not a download trigger.
|
||||
|
||||
**Rollback**: D-C10-1 manifest-hash check makes provisioning idempotent. Atomic writes (atomicwrites package) prevent partial states; on partial failure, the previous-good cache remains until the new one is fully written.
|
||||
|
||||
## 5. Implementation Details
|
||||
|
||||
**Algorithmic Complexity**: dominated by descriptor batched generation on Jetson (GPU-bound). Worst-case ~400 km² provisioning is ≤ tens of minutes (offline, not time-critical per AC-8.3). Tile network bandwidth is **not** in C10's budget — that cost is in C11.
|
||||
|
||||
**State Management**: stateless w.r.t. flight lifetime. No connection state — all dependencies are local.
|
||||
|
||||
**Key Dependencies**:
|
||||
|
||||
| Library | Version | Purpose |
|
||||
|---------|---------|---------|
|
||||
| atomicwrites | latest | Atomic file replacement for `.index` + Manifest (D-C10-3) |
|
||||
| hashlib (stdlib) | stdlib | SHA-256 content-hash sidecars |
|
||||
| PyYAML / orjson | per project pin | Manifest serialization |
|
||||
|
||||
**Error Handling Strategy**:
|
||||
|
||||
- `EngineBuildError` / `CalibrationCacheError`: surfaced from C7 — never silently fall back; operator must intervene.
|
||||
- `DescriptorBatchError`: CUDA OOM during descriptor generation. Halve batch size and retry once; if still OOM, surface to operator.
|
||||
- `ManifestWriteError`: filesystem error or atomic-write rollback. Cache marked invalid; operator must re-run.
|
||||
- `IdempotentNoOp`: D-C10-1 manifest-hash matched the prior build's hash; skip rebuild; emit no-op report.
|
||||
- `ContentHashMismatchError` (F2): refuse takeoff; STATUSTEXT to GCS; FDR records the event; operator must re-run F1.
|
||||
- **Missing tiles in C6 for the requested bbox/zoom**: surface as `BuildReport.failure` with explicit instruction to run C11 `TileDownloader` first; do **not** fall back to a network fetch — that responsibility lives in C11.
|
||||
|
||||
## 6. Extensions and Helpers
|
||||
|
||||
| Helper | Purpose | Used By |
|
||||
|--------|---------|---------|
|
||||
| `Sha256Sidecar` | atomic write + content-hash sidecar pattern | C6, C7, C10 |
|
||||
| `EngineFilenameSchema` | self-describing filename per D-C10-7 | C7, C10 |
|
||||
| `WgsConverter` | bbox math | C4, C5, C6, C8, C10 |
|
||||
|
||||
## 7. Caveats & Edge Cases
|
||||
|
||||
**Known limitations**:
|
||||
|
||||
- C10 depends on C6 already containing the tiles for the requested bbox + zoom levels. The F1 cache-build workflow (C12) sequences `C11 TileDownloader → C10 build_cache_artifacts`; C10 alone is not a complete F1.
|
||||
- D-C10-3 SHA-256 content-hash gate must cover EVERY artifact: every tile (the per-tile hash is computed at C11 download time and stored in C6), the FAISS `.index`, the calibration JSON, and the Manifest itself. Missing sidecars are a release-blocking defect.
|
||||
|
||||
**Potential race conditions**:
|
||||
|
||||
- Concurrent `build_cache_artifacts` invocations on the same cache root would corrupt state. Single-process operator-tool wraps with a filesystem lockfile (the same lockfile C11 honours); if a second invocation tries to start, fail with explicit error.
|
||||
|
||||
**Performance bottlenecks**:
|
||||
|
||||
- Descriptor batched generation is GPU-bound; batching is the main lever (D-C7-1 INT8/FP16 mix decision applies).
|
||||
- Engine compile is workspace-bound on Jetson; D-C10-6 calibration cache reuse is the main lever.
|
||||
|
||||
## 8. Dependency Graph
|
||||
|
||||
**Must be implemented after**: C6 (read source for tiles, write target for FAISS), C7 (engine + descriptor runtime), C2 (backbone interface for descriptor generation; called via C7).
|
||||
|
||||
**Can be implemented in parallel with**: C8, C13.
|
||||
|
||||
**Blocks**: C12 (operator can't sequence F1 without C10 ready), F1, F2 (verify_manifest), F8 (warm-cache verify on reboot recovery).
|
||||
|
||||
## 9. Logging Strategy
|
||||
|
||||
| Log Level | When | Example |
|
||||
|-----------|------|---------|
|
||||
| ERROR | `EngineBuildError`, `DescriptorBatchError`, `ManifestWriteError`, `ContentHashMismatchError` (F2) | `C10 engine build failed: backbone=disk; takeoff blocked` |
|
||||
| WARN | engine cache miss falls through to build | `C10 engine cache miss: model=ultra_vpr; sm=87, jp=6.2, trt=10.3, fp16; rebuild` |
|
||||
| INFO | Build start/end + report; verify_manifest pass | `C10 build complete: engines=4, descriptors=87654, manifest_hash=…; outcome=success` |
|
||||
| DEBUG | per-tile descriptor batch progress | `C10 descriptor batch progress: 12345/87654 (14%)` |
|
||||
|
||||
**Log format**: structured JSON.
|
||||
**Log storage**: stdout (operator tool); journald (companion verify); FDR via C13 (only for F2 verify_manifest events — provisioning is offline and goes to operator-facing logs, not flight FDR).
|
||||
@@ -0,0 +1,151 @@
|
||||
# Test Specification — C10 Pre-flight Cache Provisioning (engines + descriptors + manifest)
|
||||
|
||||
Component-scoped. Suite-level coverage in `_docs/02_document/tests/*.md`. C10 was narrowed in this Plan cycle: it builds model-derived artifacts (TensorRT engines, VPR descriptors, signed Manifest) from an **already-populated** C6 tile cache. Tile fetch is C11 `TileDownloader`'s concern.
|
||||
|
||||
## Acceptance Criteria Traceability
|
||||
|
||||
| AC ID | Acceptance Criterion (one-line) | Test IDs | Coverage |
|
||||
|-------|---------------------------------|----------|----------|
|
||||
| AC-8.3 | Imagery pre-loaded onto companion before flight (the manifest gate) | FT-P-15, FT-P-16, **C10-IT-01** | Covered |
|
||||
| AC-NEW-1 | Cold-start TTFF <30 s p95 (pre-built engines required) | NFT-PERF-03, **C10-IT-02** | Covered |
|
||||
| D-C10-1 | Manifest-hash idempotence on repeated build | **C10-IT-03** | Covered |
|
||||
| D-C10-3 | Takeoff content-hash gate refuses mismatch | covered at C7-IT-03; **C10-IT-04** asserts the manifest signing path | Covered |
|
||||
| D-C10-6 | Engine cache hardware-tied (SM 87 / JP 6.2 / TRT 10.3 / FP16) | C7-IT-04, **C10-IT-05** | Covered |
|
||||
| D-C10-7 | Engine filename schema enforcement | covered at C7-IT-04 | Covered |
|
||||
|
||||
---
|
||||
|
||||
## Component-Internal Tests
|
||||
|
||||
### C10-IT-01: end-to-end build from a pre-populated C6
|
||||
|
||||
**Summary**: given a C6 tile cache populated by C11 `TileDownloader` (10 GB Derkachi area), C10 produces (a) all required TensorRT engines, (b) the FAISS HNSW index over VPR descriptors, (c) a signed Manifest, in under the operator-tooling time budget.
|
||||
|
||||
**Traces to**: AC-8.3, AC-NEW-1
|
||||
|
||||
**Description**: stage a C6 with the Derkachi corpus already populated; run `CacheProvisioner.build_artifacts`; assert (a) the engine set under `cache_artifacts/engines/` matches the configured model list, (b) `descriptor_index.faiss` is non-empty and queryable, (c) the Manifest is signed with the operator's signing key and content-hashes every artifact.
|
||||
|
||||
**Input data**: pre-populated C6 (`tests/fixtures/c6_populated_derkachi/`).
|
||||
|
||||
**Expected result**: all artifacts present + signed Manifest.
|
||||
|
||||
**Max execution time**: 12 min on Tier-1 (CPU TRT compile is slow; Tier-2 takes ~4 min and is the production path).
|
||||
|
||||
---
|
||||
|
||||
### C10-IT-02: ManifestVerifier refuses unsigned / wrong-signature Manifest
|
||||
|
||||
**Summary**: `ManifestVerifier.verify` rejects a Manifest whose signature doesn't validate against the operator's public key.
|
||||
|
||||
**Traces to**: AC-NEW-1, D-C10-3
|
||||
|
||||
**Description**: build a valid Manifest; copy it; tamper one byte; call `verify`; assert `ManifestSignatureError`. Repeat: copy + replace signature with one signed by an unauthorized key; assert `ManifestSignatureError`.
|
||||
|
||||
**Input data**: valid Manifest + 2 tampered copies.
|
||||
|
||||
**Expected result**: both tampered Manifests rejected.
|
||||
|
||||
**Max execution time**: 5 s.
|
||||
|
||||
---
|
||||
|
||||
### C10-IT-03: idempotence on repeated build
|
||||
|
||||
**Summary**: re-running `build_artifacts` against an unchanged C6 produces the same Manifest content-hash and skips already-built engines.
|
||||
|
||||
**Traces to**: D-C10-1
|
||||
|
||||
**Description**: run build once; record Manifest content-hash + engine compile timestamps. Re-run with no C6 changes; assert (a) Manifest content-hash unchanged, (b) engines reused (no recompile, asserted via timestamp comparison), (c) total wall-clock < 1 min on Tier-1.
|
||||
|
||||
**Input data**: as C10-IT-01.
|
||||
|
||||
**Expected result**: idempotent — same hash, no recompile.
|
||||
|
||||
**Max execution time**: 90 s (second-run only).
|
||||
|
||||
---
|
||||
|
||||
### C10-IT-04: Manifest covers every shipped artifact
|
||||
|
||||
**Summary**: the Manifest's content-hash table includes every file under `cache_artifacts/`; an artifact present on disk but missing from the Manifest is a build failure.
|
||||
|
||||
**Traces to**: D-C10-3 (no smuggled artifacts can pass the takeoff gate)
|
||||
|
||||
**Description**: after a successful build, plant an extra file in `cache_artifacts/`; re-run `build_artifacts` (or call the build's post-step audit hook); assert build refuses to sign — output `ManifestCoverageError` listing the orphan file.
|
||||
|
||||
**Input data**: as C10-IT-01 plus an extra file.
|
||||
|
||||
**Expected result**: build fails with `ManifestCoverageError`.
|
||||
|
||||
**Max execution time**: 60 s.
|
||||
|
||||
---
|
||||
|
||||
### C10-IT-05: Tier-2 hardware-tied engine compile produces SM-87 / JP-6.2 / TRT-10.3 binary
|
||||
|
||||
**Summary**: when run on the bench Jetson, C10 produces engines whose internal TRT metadata reports `SM=87, JetPack=6.2, TRT=10.3, precision=FP16`.
|
||||
|
||||
**Traces to**: D-C10-6
|
||||
|
||||
**Description**: run `build_artifacts` on the bench Jetson; for each engine, parse the internal TRT version footer; assert the quadruple matches.
|
||||
|
||||
**Input data**: bench Jetson + Derkachi C6 fixture.
|
||||
|
||||
**Expected result**: all engines tagged correctly.
|
||||
|
||||
**Max execution time**: 6 min on Tier-2.
|
||||
|
||||
---
|
||||
|
||||
## Performance Tests
|
||||
|
||||
### C10-PT-01: build wall-clock budget on Tier-1 (operator-tooling laptop)
|
||||
|
||||
**Traces to**: operator-tooling UX (no AC trace; an operator-tooling SLO)
|
||||
|
||||
**Load scenario**: full Derkachi corpus (10 GB, ~87 654 tiles).
|
||||
|
||||
**Expected results**:
|
||||
|
||||
| Metric | Target | Failure Threshold |
|
||||
|--------|--------|-------------------|
|
||||
| Cold build wall-clock | ≤ 12 min on a developer laptop with NVIDIA GPU | 25 min |
|
||||
| Warm idempotent re-run | ≤ 1 min | 3 min |
|
||||
|
||||
---
|
||||
|
||||
## Security Tests
|
||||
|
||||
### C10-ST-01: signing-key path uses operator-controlled key (not a baked-in dev key)
|
||||
|
||||
**Summary**: the build refuses to sign the Manifest if the configured signing-key path points to the baked-in dev key (caught via a hash-list check).
|
||||
|
||||
**Traces to**: defensive (production-key safety)
|
||||
|
||||
**Test procedure**:
|
||||
1. Configure C10 with the dev-key path that's hard-coded into the dev fixtures.
|
||||
2. Run `build_artifacts`.
|
||||
3. Assert refusal with `OperatorKeyRequiredError`.
|
||||
|
||||
**Pass criteria**: refusal.
|
||||
**Fail criteria**: build succeeds with the dev key.
|
||||
|
||||
---
|
||||
|
||||
## Acceptance Tests
|
||||
|
||||
Covered transitively via FT-P-15 / FT-P-16 (operator workflow tests).
|
||||
|
||||
---
|
||||
|
||||
## Test Data Management
|
||||
|
||||
| Data Set | Source | Size |
|
||||
|----------|--------|------|
|
||||
| `tests/fixtures/c6_populated_derkachi/` | C11 `TileDownloader` build artifact | ~10 GB on disk |
|
||||
| Operator signing key (test-only) | generated per test run | <1 KB |
|
||||
| Dev key (for the negative test) | curated, in-repo | <1 KB |
|
||||
|
||||
**Setup**: C11 `TileDownloader` integration test (under C11) populates C6 once; that artifact is reused.
|
||||
**Teardown**: per-test temp dirs for `cache_artifacts/` build outputs.
|
||||
**Data isolation**: per-test temp `cache_artifacts/`.
|
||||
Reference in New Issue
Block a user