mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 07:51:28 +00:00
[AZ-322] C10 DescriptorBatcher (faiss-cpu, OOM halve-retry)
Implements the C10 internal phase that walks every C6 tile, embeds through C2's backbone via the AZ-321-produced engine, and rebuilds the AZ-306 FAISS HNSW index in one atomic write. - DescriptorBatcher with halve-and-retry OOM recovery (default 1 retry) - BackboneEmbedder Protocol + C7EngineBackboneEmbedder default impl - DescriptorBatchError for OOM / dim-mismatch / missing-output failures - Empty-corpus surfaces as outcome=failure with explicit hint to run C11 - Per-10% progress callback + DEBUG logs (no engine bytes leaked) - Consumer-side Protocol cuts (TilesByBboxBatchQuery, TilePixelOpener, DescriptorIndexRebuilder) so c10 stays within AZ-270 lint - runtime_root.c10_factory adds build_descriptor_batcher + three C6->C10 adapters - 16 unit tests covering AC-1..AC-10 + 2 NFRs + 4 supplemental (Protocol conformance, query pass-through, handle release, config) Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -100,6 +100,24 @@ C10 reads `tiles` rows from C6 (scoped to the build's bbox + zoom_levels), write
|
||||
| atomicwrites | latest | Atomic file replacement for `.index` + Manifest (D-C10-3) |
|
||||
| hashlib (stdlib) | stdlib | SHA-256 content-hash sidecars |
|
||||
| PyYAML / orjson | per project pin | Manifest serialization |
|
||||
| numpy | per project pin | Descriptor batch ndarray container (AZ-322 `DescriptorBatcher`) |
|
||||
|
||||
**AZ-322 internal phase — `DescriptorBatcher`**:
|
||||
|
||||
The `populate_descriptors` phase walks every tile in C6 for the requested
|
||||
`(bbox, zoom_levels, sector_class)`, embeds them through C7's `InferenceRuntime`
|
||||
(via `C7EngineBackboneEmbedder`, the default `BackboneEmbedder` impl), and
|
||||
hands the resulting `(N, descriptor_dim)` ndarray to AZ-306's
|
||||
`DescriptorIndex.rebuild_from_descriptors` for atomic FAISS index write.
|
||||
CUDA OOM is handled via halve-and-retry bounded by `C10BatcherConfig.max_oom_retries`
|
||||
(default 1: 64 → 32, then succeed-or-fail-fast) so a real GPU regression
|
||||
surfaces in seconds rather than via silent retries. Per-10% progress is
|
||||
emitted both as DEBUG logs (`c10.descriptor.progress`) and via an optional
|
||||
`progress_callback` so operator tooling can wire a TTY/GUI bar without
|
||||
touching the batcher itself. The descriptor int64 id formula is the
|
||||
canonical AZ-306 scheme (`int.from_bytes(sha256("zoom|lat|lon").first8, "big", signed=True)`)
|
||||
— invented locally to avoid a circular dependency back into C6 internals
|
||||
would break AC-6.
|
||||
|
||||
**Error Handling Strategy**:
|
||||
|
||||
|
||||
@@ -209,14 +209,16 @@ Bootstrap reference: `_docs/02_tasks/todo/AZ-263_initial_structure.md`. Architec
|
||||
- **Epic**: AZ-252 (E-C10 Cache Provisioner)
|
||||
- **Directory**: `src/gps_denied_onboard/components/c10_provisioning/`
|
||||
- **Public API**:
|
||||
- `__init__.py` (re-exports `CacheProvisioner`, `Manifest`, `EngineCacheEntry`, plus AZ-321 surface: `EngineCompiler`, `BackboneSpec`, `EngineCompileRequest`, `EngineCompileResult`, `CompileOutcome`, `EngineCompileSummary`, `CompileEngineCallable`, `BackboneConfig`, `C10ProvisioningConfig`)
|
||||
- `interface.py` (`CacheProvisioner` Protocol)
|
||||
- `__init__.py` (re-exports `CacheProvisioner`, `Manifest`, `EngineCacheEntry`, plus AZ-321 surface: `EngineCompiler`, `BackboneSpec`, `EngineCompileRequest`, `EngineCompileResult`, `CompileOutcome`, `EngineCompileSummary`, `CompileEngineCallable`, `BackboneConfig`, `C10ProvisioningConfig`, plus AZ-322 surface: `DescriptorBatcher`, `BackboneEmbedder`, `C7EngineBackboneEmbedder`, `C10BatcherConfig`, `CorpusFilter`, `DescriptorBatchReport`, `ProgressEvent`, `TileBboxRecord`, `BatcherTile`, `TilesByBboxBatchQuery`, `TilePixelOpener`, `DescriptorIndexRebuilder`, `DescriptorBatchError`)
|
||||
- `interface.py` (`CacheProvisioner` Protocol, `BackboneEmbedder` Protocol — AZ-322)
|
||||
- Config block: `C10ProvisioningConfig` (registered on import)
|
||||
- **Internal**:
|
||||
- `engine_compiler.py` (AZ-321; per-model TRT compile + hardware-tied cache reuse + `CompileEngineCallable` structural cut of the C7 InferenceRuntime)
|
||||
- `config.py` (AZ-321; `BackboneConfig` + `C10ProvisioningConfig` dataclasses)
|
||||
- `descriptor_batcher.py` (AZ-322; `DescriptorBatcher` + DTOs + consumer-side Protocols `TilesByBboxBatchQuery` / `TilePixelOpener` / `DescriptorIndexRebuilder`)
|
||||
- `c7_engine_embedder.py` (AZ-322; `C7EngineBackboneEmbedder` adapter wrapping AZ-297 `InferenceRuntime` + AZ-321 engine path)
|
||||
- `default_provisioner.py` (engine compile + descriptors + manifest + content-hash gate, pending)
|
||||
- Composition root: `runtime_root/c10_factory.py` (`build_engine_compiler`, `build_backbone_specs`)
|
||||
- Composition root: `runtime_root/c10_factory.py` (`build_engine_compiler`, `build_backbone_specs`, `build_manifest_builder`, `build_manifest_verifier`, `build_descriptor_batcher` + the C6→C10 adapters `c6_tile_metadata_store_to_tiles_batch_query`, `c6_tile_store_to_pixel_opener`, `c6_descriptor_index_to_rebuilder`)
|
||||
- **Owns**: `src/gps_denied_onboard/components/c10_provisioning/**`, `tests/unit/c10_provisioning/**`
|
||||
- **Imports from**: `_types` (cross-component DTOs `EngineCacheEntry`, `BuildConfig`, `PrecisionMode`, `OptimizationProfile`, `HostCapabilities`, `TileMetadata`, etc.), `_types.inference_errors` (AZ-507 typed-error envelope for `EngineBuildError` + `CalibrationCacheError`), `helpers.sha256_sidecar`, `helpers.engine_filename_schema`, `helpers.wgs_converter`, `config`, `logging`, `fdr_client`. The `InferenceRuntime.compile_engine` surface (c7) and the `TileMetadataStore.query_by_bbox` surface (c6) are obtained via constructor-injected consumer-side structural Protocol cuts (the `CompileEngineCallable` cut already lives in `engine_compiler.py`; AZ-323 / AZ-324 will define analogous `query_by_bbox` cuts inside `c10_provisioning/`). NEVER `from gps_denied_onboard.components.c6_tile_cache import ...` or `from gps_denied_onboard.components.c7_inference import ...` inside `c10_provisioning/*.py`.
|
||||
- **Consumed by**: `c12_operator_tooling`, `runtime_root` (operator binary only — excluded from airborne via `BUILD_C10_PROVISIONING=OFF` for airborne build per ADR-002)
|
||||
|
||||
@@ -0,0 +1,141 @@
|
||||
# Batch 36 — Cycle 1 Report
|
||||
|
||||
**Date**: 2026-05-13
|
||||
**Batch**: 36 (single task — direct AZ-306 follow-up)
|
||||
**Tasks**: AZ-322 (C10 Descriptor Batcher, 3pt)
|
||||
**Status**: complete; AZ-322 transitioned to "In Testing" pending operator review.
|
||||
|
||||
## Scope
|
||||
|
||||
AZ-322 implements `DescriptorBatcher` — the C10 phase that walks every C6 tile in the requested
|
||||
`(bbox, zoom_levels, sector_class)`, embeds it through C2's VPR backbone (via the C7 engine produced
|
||||
by AZ-321), and rebuilds the AZ-306 FAISS HNSW index in one atomic write.
|
||||
|
||||
This unblocks the airborne C2 VPR step's takeoff verify (AC-NEW-1) and makes the C10-PT-01
|
||||
cold-build budget observable end-to-end.
|
||||
|
||||
## Architectural Decisions
|
||||
|
||||
### 1. Consumer-side Protocol cuts (AZ-270 / AZ-507 compliance)
|
||||
|
||||
The AZ-322 task spec listed direct C6 types (`TileMetadataStore`, `TileStore`, `DescriptorIndex`)
|
||||
in the `DescriptorBatcher.__init__` signature. That contradicts AZ-270 (no cross-component
|
||||
imports inside `components/*`) and the AZ-507 cross-component contract surface rule. The
|
||||
established precedent — AZ-323's `ManifestBuilder` and AZ-324's `ManifestVerifierImpl` — declares
|
||||
**consumer-side structural Protocol cuts** locally inside the C10 module and lets the composition
|
||||
root (`runtime_root.c10_factory`) wire C6's concrete strategies in via thin adapters.
|
||||
|
||||
This batch follows that precedent. `descriptor_batcher.py` declares four
|
||||
local-to-C10 Protocols:
|
||||
|
||||
- `BackboneEmbedder` (lifted to `interface.py` for re-use by future tasks)
|
||||
- `TilesByBboxBatchQuery` — narrower than C6's `TileMetadataStore.query_by_bbox`, accepts
|
||||
`tuple[int, ...]` of zooms instead of a single zoom
|
||||
- `TilePixelOpener` — narrower than C6's `TileStore.read_tile_pixels(TileId)`; takes
|
||||
`(zoom, lat, lon)` and returns a context manager
|
||||
- `DescriptorIndexRebuilder` — narrower than C6's
|
||||
`DescriptorIndex.rebuild_from_descriptors(descriptors, tile_ids: list[TileId], hnsw_params: HnswParams)`;
|
||||
takes `tile_records: list[TileBboxRecord]` plus individual HNSW kwargs
|
||||
|
||||
The matching adapters live in `runtime_root/c10_factory.py`:
|
||||
|
||||
- `c6_tile_metadata_store_to_tiles_batch_query` — loops over `zoom_levels`, projects `TileMetadata`
|
||||
rows down to the four-field `TileBboxRecord`
|
||||
- `c6_tile_store_to_pixel_opener` — builds `TileId` and returns the C6 `TilePixelHandle` (already
|
||||
a context manager)
|
||||
- `c6_descriptor_index_to_rebuilder` — projects `TileBboxRecord` → `TileId` and folds HNSW kwargs
|
||||
into `HnswParams`
|
||||
|
||||
### 2. `C7EngineBackboneEmbedder` adapter — `Any`-typed at the c7 boundary
|
||||
|
||||
The default `BackboneEmbedder` impl wraps an AZ-297 `InferenceRuntime` + an AZ-321-compiled
|
||||
`EngineHandle`. Importing those types — even under `TYPE_CHECKING` — fails the AZ-270 AST lint
|
||||
because the lint walks `ast.ImportFrom` nodes regardless of context. We therefore type the
|
||||
constructor parameters as `Any` and rely on structural duck-typing
|
||||
(`inference_runtime.infer(handle, dict) -> dict`). The composition root wires the concrete C7
|
||||
runtime in.
|
||||
|
||||
### 3. JPEG → tensor preprocessing is injected, not owned
|
||||
|
||||
`C7EngineBackboneEmbedder` accepts a `tile_decoder: Callable[[Any], np.ndarray]` rather than
|
||||
hard-wiring OpenCV / Pillow / torchvision. Image preprocessing belongs to E-C2 (AZ-255); when
|
||||
it ships, the composition root injects a real decoder. Until then the adapter stays free of
|
||||
imaging-stack dependencies, keeping AZ-322's surface narrow and the test surface tiny.
|
||||
|
||||
### 4. Descriptor int64 id formula — reuse AZ-306, do not invent
|
||||
|
||||
`DescriptorBatcher` does NOT recompute the int64 id formula. It hands `TileBboxRecord` rows to
|
||||
the rebuilder; the rebuilder adapter projects to `TileId`; AZ-306's
|
||||
`FaissDescriptorIndex.rebuild_from_descriptors` uses the canonical
|
||||
`tile_id_to_int64(TileId)` helper. Test `test_ac6_descriptor_id_mapping_matches_az306_scheme`
|
||||
confirms by importing `tile_id_to_int64` directly and asserting against the
|
||||
`int.from_bytes(sha256("zoom|lat|lon").first8, "big", signed=True)` formula.
|
||||
|
||||
## Files Changed
|
||||
|
||||
### Production code (new)
|
||||
|
||||
- `src/gps_denied_onboard/components/c10_provisioning/descriptor_batcher.py` — `DescriptorBatcher`
|
||||
class + `BatcherTile`, `TileBboxRecord`, `CorpusFilter`, `ProgressEvent`, `DescriptorBatchReport`,
|
||||
`BatcherOutcome`, `C10BatcherConfig` DTOs + `TilesByBboxBatchQuery`, `TilePixelOpener`,
|
||||
`DescriptorIndexRebuilder` consumer Protocols.
|
||||
- `src/gps_denied_onboard/components/c10_provisioning/c7_engine_embedder.py` —
|
||||
`C7EngineBackboneEmbedder` adapter wrapping the AZ-297 `InferenceRuntime` surface; `Any`-typed
|
||||
to stay below the AZ-270 boundary.
|
||||
|
||||
### Production code (modified)
|
||||
|
||||
- `src/gps_denied_onboard/components/c10_provisioning/interface.py` — added `BackboneEmbedder`
|
||||
Protocol (`embed_batch` + `descriptor_dim`), `runtime_checkable`.
|
||||
- `src/gps_denied_onboard/components/c10_provisioning/errors.py` — added `DescriptorBatchError`
|
||||
exception class extending `C10ProvisioningError`.
|
||||
- `src/gps_denied_onboard/components/c10_provisioning/__init__.py` — re-exported all new symbols.
|
||||
- `src/gps_denied_onboard/runtime_root/c10_factory.py` — added `build_descriptor_batcher` plus
|
||||
the three C6→C10 adapter functions.
|
||||
|
||||
### Tests (new)
|
||||
|
||||
- `tests/unit/c10_provisioning/test_descriptor_batcher.py` — 16 tests covering AC-1 through
|
||||
AC-10 + NFR-perf-overhead + NFR-reliability-bounded-retry, plus 4 supplemental tests
|
||||
(`Protocol` runtime-check for the four consumer cuts, query-args pass-through, handle
|
||||
release on embed failure, config validation).
|
||||
|
||||
### Documentation
|
||||
|
||||
- `_docs/02_document/module-layout.md` — c10 Public API + Internal section updated to list the
|
||||
AZ-322 surface; composition root section lists the new factory + adapters.
|
||||
- `_docs/02_document/components/11_c10_provisioning/description.md` — §5 dependency table picks
|
||||
up `numpy`; new "AZ-322 internal phase" subsection summarises the batcher's
|
||||
contract / OOM behaviour / progress reporting / id formula.
|
||||
|
||||
## Test Results
|
||||
|
||||
- 16 / 16 AZ-322 tests pass (`tests/unit/c10_provisioning/test_descriptor_batcher.py`).
|
||||
- 197 / 197 c10 + c6 + runtime-root targeted runs pass (59 docker-skip).
|
||||
- Full project suite: **1352 passed, 79 skipped, 1 failed**.
|
||||
- 79 skipped: docker / Jetson / CUDA / actionlint env-gated (Tier-0 dev host).
|
||||
- 1 failed: `tests/unit/test_ac1_scaffold_layout.py::test_cmake_files_configure` —
|
||||
pre-existing OKVIS2 git-submodule failure documented in batch_35 cycle report; unrelated
|
||||
to this batch.
|
||||
|
||||
## Decisions Ledger
|
||||
|
||||
| Decision | Rationale |
|
||||
|----------|-----------|
|
||||
| `DescriptorBatcher.__init__` takes consumer-side Protocols, not raw C6 types | AZ-270 lint blocks direct cross-component imports; AZ-323 / AZ-324 set the precedent |
|
||||
| `C7EngineBackboneEmbedder` parameters are `Any`-typed | AZ-270 AST lint flags `TYPE_CHECKING` imports too; structural duck-typing avoids the boundary |
|
||||
| `tile_decoder` is injected, not bundled | JPEG preprocessing belongs to E-C2 (AZ-255); keeping it out of AZ-322 narrows scope and dependencies |
|
||||
| Default `C10BatcherConfig.max_oom_retries=1` | Spec NFR-reliability-bounded-retry; one halve from 64 → 32 is the standard surface, deeper retries mask GPU regressions |
|
||||
| Reuse AZ-306's `tile_id_to_int64` | Spec AC-6; inventing the formula here would diverge from C6's id scheme |
|
||||
| Atomic FAISS rebuild guaranteed by AZ-306, not duplicated here | Spec AC-7; the batcher's role is to call `rebuild_from_descriptors` exactly once |
|
||||
|
||||
## Notes
|
||||
|
||||
- The `C7EngineBackboneEmbedder` is the default `BackboneEmbedder` impl, but production wiring
|
||||
to a real C7 engine awaits AZ-326 (T5 orchestrator) and AZ-255 (real C2 backbone preprocessing).
|
||||
The adapter is unit-tested via fakes today; integration tests land with AZ-326.
|
||||
- `C10BatcherConfig` currently has no dedicated config-block hook in
|
||||
`C10ProvisioningConfig`; `build_descriptor_batcher` uses defaults. AZ-326 will add the
|
||||
config-block plumbing.
|
||||
- The OKVIS2 cmake submodule failure remains and is independent of every batch-35 / batch-36
|
||||
change. It will resolve when the project's submodules are initialised on the dev host.
|
||||
@@ -6,11 +6,11 @@ step: 7
|
||||
name: Implement
|
||||
status: in_progress
|
||||
sub_step:
|
||||
phase: 3
|
||||
name: compute-next-batch
|
||||
detail: "batch 35 complete (AZ-306 5pt; faiss-cpu PyPI strategy chosen over custom pybind11 wrapper); awaiting next batch selection"
|
||||
phase: 4
|
||||
name: batch-complete
|
||||
detail: "batch 36 complete: AZ-322 implemented + tests + factory wiring; ready to chain to next batch"
|
||||
retry_count: 0
|
||||
cycle: 1
|
||||
tracker: jira
|
||||
last_completed_batch: 35
|
||||
last_completed_batch: 36
|
||||
last_cumulative_review: batches_31-33
|
||||
|
||||
Reference in New Issue
Block a user