[AZ-322] C10 DescriptorBatcher (faiss-cpu, OOM halve-retry)

Implements the C10 internal phase that walks every C6 tile, embeds
through C2's backbone via the AZ-321-produced engine, and rebuilds
the AZ-306 FAISS HNSW index in one atomic write.

- DescriptorBatcher with halve-and-retry OOM recovery (default 1 retry)
- BackboneEmbedder Protocol + C7EngineBackboneEmbedder default impl
- DescriptorBatchError for OOM / dim-mismatch / missing-output failures
- Empty-corpus surfaces as outcome=failure with explicit hint to run C11
- Per-10% progress callback + DEBUG logs (no engine bytes leaked)
- Consumer-side Protocol cuts (TilesByBboxBatchQuery, TilePixelOpener,
  DescriptorIndexRebuilder) so c10 stays within AZ-270 lint
- runtime_root.c10_factory adds build_descriptor_batcher + three
  C6->C10 adapters
- 16 unit tests covering AC-1..AC-10 + 2 NFRs + 4 supplemental
  (Protocol conformance, query pass-through, handle release, config)

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-13 04:20:47 +03:00
parent 3b7265757b
commit f01a5058ab
12 changed files with 1733 additions and 10 deletions
@@ -0,0 +1,208 @@
# C10 Descriptor Batcher — Embed Corpus Tiles via C2 Backbone + Write FAISS
**Task**: AZ-322_c10_descriptor_batcher
**Name**: C10 Descriptor Batcher
**Description**: Implement `DescriptorBatcher`, the C10-internal phase that walks every tile in C6 for the requested `(bbox, zoom_levels)`, runs them through C2's VPR backbone (via the C7 engine produced by AZ-321) in batches sized for the operator workstation's GPU, collects the resulting fixed-dimension descriptors, and rebuilds the FAISS HNSW index via AZ-303's `DescriptorIndex.rebuild_from_descriptors`. Handles CUDA OOM with halve-and-retry; surfaces per-batch progress via DEBUG logs and a callback. Returns a `DescriptorBatchReport` with `descriptors_generated`, `tiles_consumed`, `oom_retries`, `elapsed_s`. Defines a thin C10-internal `BackboneEmbedder` Protocol with one method `embed_batch(tile_pixels: list[TilePixelHandle]) -> ndarray`; the concrete impl is supplied by E-C2 (AZ-255) later via a thin adapter, OR a direct call into the AZ-321-produced engine if E-C2 ships a public embed API by then.
**Complexity**: 3 points
**Dependencies**: AZ-263_initial_structure, AZ-269_config_loader, AZ-266_log_module, AZ-303_c6_storage_interfaces, AZ-306_c6_faiss_descriptor_index, AZ-321_c10_engine_compiler
**Component**: c10_provisioning (epic AZ-252 / E-C10)
**Tracker**: AZ-322
**Epic**: AZ-252 (E-C10)
### Document Dependencies
- `_docs/02_document/contracts/c6_tile_cache/tile_metadata_store.md``query_by_bbox` (read tile list) and `tile_store.read_tile_pixels` (read tile bytes via mmap handle).
- `_docs/02_document/contracts/c6_tile_cache/descriptor_index.md``rebuild_from_descriptors` (atomic write target).
- `_docs/02_document/components/11_c10_provisioning/description.md` — § 5 `DescriptorBatchError` handling; § 7 GPU-bound bottleneck.
## Problem
Without a real descriptor batcher:
- AC-NEW-1's takeoff verify has no FAISS index to verify; the airborne C2 VPR step returns empty top-k.
- AC-8.1 collapses partially — even with imagery in C6, the airborne system cannot localize without descriptors.
- The C10-PT-01 cold-build budget (≤ 12 min) is unobservable; the descriptor phase is the dominant cost on Jetson.
- D-C10-3's "every artifact in Manifest" requirement (AC-NEW-1) cannot list `.index` artifacts that don't exist.
- CUDA OOM during build is the most common failure mode operators hit per the description.md § 5; without a structured halve-and-retry, every OOM is a manual restart.
- Per-batch progress is invisible — operators staring at a `c10 build` command for 8+ minutes see nothing without DEBUG logs they don't enable.
This task delivers the embed-and-write phase. It does NOT compile engines (AZ-321) or write the Manifest (T3) or orchestrate idempotence (T5).
## Outcome
- A `DescriptorBatcher` class at `src/gps_denied_onboard/components/c10_provisioning/descriptor_batcher.py`:
- Constructor: `__init__(self, *, backbone_embedder: BackboneEmbedder, tile_metadata_store: TileMetadataStore, tile_store: TileStore, descriptor_index: DescriptorIndex, logger: Logger, clock: Clock, config: C10BatcherConfig)`.
- `C10BatcherConfig` (`@dataclass(frozen=True)`): `initial_batch_size: int = 64`, `max_oom_retries: int = 1`, `progress_callback: Callable[[ProgressEvent], None] | None = None`.
- Public method: `populate_descriptors(corpus_filter: CorpusFilter) -> DescriptorBatchReport`.
- `CorpusFilter` (`@dataclass(frozen=True)`): `bbox: Bbox`, `zoom_levels: tuple[int, ...]`, `sector_class: SectorClassification`.
- `DescriptorBatchReport` (`@dataclass(frozen=True)`): `descriptors_generated: int`, `tiles_consumed: int`, `oom_retries: int`, `elapsed_s: float`, `outcome: enum {success, failure}`, `failure_reason: str | None`.
- A `BackboneEmbedder` Protocol at `src/gps_denied_onboard/components/c10_provisioning/interface.py`:
```python
@runtime_checkable
class BackboneEmbedder(Protocol):
def embed_batch(self, tiles: list[TilePixelHandle]) -> np.ndarray: ...
def descriptor_dim(self) -> int: ...
```
- Method flow:
1. Call `tile_metadata_store.query_by_bbox(bbox=request.bbox, zoom_levels=request.zoom_levels, sector_class=request.sector_class)` → list of `TileMetadata` rows. If empty → return `DescriptorBatchReport(outcome=failure, failure_reason="no tiles in C6 for the requested scope; run C11 TileDownloader first")` per description.md § 5.
2. Open every tile via `tile_store.read_tile_pixels(tile_id)` lazily (context manager; release after each batch).
3. Walk tiles in batches of `current_batch_size` (initially `config.initial_batch_size`):
- Call `backbone_embedder.embed_batch(tile_pixel_handles)` → `np.ndarray` of shape `(batch_size, descriptor_dim)`.
- On `DescriptorBatchError("CUDA OOM")`:
- If `oom_retries < config.max_oom_retries` AND `current_batch_size > 1`: halve `current_batch_size`, increment `oom_retries`, re-run THIS batch with the smaller size.
- Else: raise `DescriptorBatchError` with full context (batch index, tile ids, current batch size).
- Append the descriptors to a running buffer; record `(tile_id, descriptor_row_index)` mapping.
- Emit a `ProgressEvent(tiles_done, tiles_total, current_batch_size, elapsed_s)` via `config.progress_callback` if set.
- Emit DEBUG log every 10% progress (`c10.descriptor.progress`).
4. After all tiles consumed:
- Construct the descriptor `np.ndarray` of shape `(tiles_consumed, descriptor_dim)`.
- Construct the int64 id mapping per AZ-306's documented scheme (`int64(sha256(zoom|lat|lon|source).first8bytes)`).
- Call `descriptor_index.rebuild_from_descriptors(descriptors, ids, hnsw_params)` — this writes the `.index` file atomically via AZ-280.
- Return `DescriptorBatchReport(outcome=success, descriptors_generated=tiles_consumed, ...)`.
- Composition root constructs `DescriptorBatcher` with a `BackboneEmbedder` impl. Initially this is a thin `C7EngineBackboneEmbedder` that wraps `inference_runtime.run_engine(engine_path, batch)`; when E-C2 (AZ-255) ships, an adapter wires C2's public embed surface in (one-line factory swap).
- INFO log on session start (with batch counts); DEBUG on per-10% progress; WARN on every OOM retry; ERROR on terminal `DescriptorBatchError`.
## Scope
### Included
- `DescriptorBatcher` class with the single public method.
- `BackboneEmbedder` Protocol declaration + a default `C7EngineBackboneEmbedder` adapter that wraps the AZ-298 inference runtime + the AZ-321-produced engine path.
- `CorpusFilter`, `DescriptorBatchReport`, `ProgressEvent`, `C10BatcherConfig` DTOs.
- CUDA OOM halve-and-retry logic.
- Atomic FAISS index rebuild via AZ-303/306's Protocol.
- Progress callback + DEBUG log emission.
- Composition-root factory `build_descriptor_batcher`.
- Conformance test for `BackboneEmbedder` Protocol.
### Excluded
- The actual C2 VPR backbone — owned by E-C2 (AZ-255).
- TensorRT engine compile — owned by AZ-321 (the engine the embedder runs).
- Manifest writing — owned by T3.
- Tile download — owned by E-C11 (AZ-316).
- HNSW parameter selection — `hnsw_params` is config-driven; the orchestrator T5 supplies them. The batcher does NOT pick them.
- Multi-GPU / batched-across-GPUs — operator workstation is single-GPU per RESTRICT-OPS-2.
- Resumability mid-batch — if the process is killed at batch N of M, the next run starts from batch 0; descriptors are only written in one shot via `rebuild_from_descriptors` (atomic). Documented constraint.
## Acceptance Criteria
**AC-1: Happy path embeds all tiles and rebuilds index**
Given C6 contains 1000 tiles for the requested bbox + zoom_levels
When `populate_descriptors(filter)` is called
Then `embed_batch` is called `ceil(1000 / 64) = 16` times; the final descriptor array has shape `(1000, descriptor_dim)`; `descriptor_index.rebuild_from_descriptors` is called ONCE with this array; report shows `descriptors_generated=1000, tiles_consumed=1000, oom_retries=0, outcome=success`
**AC-2: CUDA OOM halves batch size and retries**
Given `embed_batch` raises `DescriptorBatchError("CUDA OOM")` on the first call with batch_size=64
When the batcher catches the OOM
Then `embed_batch` is called again with batch_size=32 (halved); `oom_retries` becomes 1; if 32 succeeds, the run continues with batch_size=32 for subsequent batches; ONE WARN log `c10.descriptor.oom.retry`
**AC-3: Persistent OOM after halve-retry exhausted raises**
Given `embed_batch` raises `DescriptorBatchError("CUDA OOM")` at every batch size from 64 down to 1, and `max_oom_retries=1`
When the batcher exhausts retries
Then `DescriptorBatchError` is raised with the final batch_size + tile_ids context; ZERO `rebuild_from_descriptors` calls; ONE ERROR log
**AC-4: Empty corpus surfaces as failure with explicit hint**
Given C6 has zero tiles for the requested scope
When `populate_descriptors(filter)` is called
Then `outcome=failure`, `failure_reason="no tiles in C6 for the requested scope; run C11 TileDownloader first"`; ZERO `embed_batch` calls; ONE ERROR log directing the operator to run C11
**AC-5: Progress callback fires every 10%**
Given a 1000-tile corpus and a callback spy
When `populate_descriptors(filter)` is called
Then the callback fires at 10%, 20%, ..., 100% (10 times); each event carries `tiles_done`, `tiles_total=1000`, `current_batch_size`, `elapsed_s`
**AC-6: Descriptor id mapping matches AZ-306's scheme**
Given the same tile (zoom=18, lat=49.5, lon=37.0, source=googlemaps)
When the batcher computes the int64 id
Then the value equals `int.from_bytes(sha256(b"18|49.5|37.0|googlemaps").digest()[:8], "big", signed=True)`; the same call elsewhere produces the same id (deterministic across runs)
**AC-7: Atomic FAISS rebuild — partial write impossible**
Given the FAISS index already exists from a prior run
When `populate_descriptors` is killed mid-`rebuild_from_descriptors`
Then either the previous-good index OR the new index is on disk; never a half-written `.index`. (AZ-303/306's contract guarantees atomicity; this AC just asserts the batcher does not bypass it.)
**AC-8: BackboneEmbedder Protocol is conformance-checkable**
Given a concrete `C7EngineBackboneEmbedder` instance
When `isinstance(impl, BackboneEmbedder)` is checked under `runtime_checkable`
Then the result is `True`; a fake omitting `descriptor_dim` returns `False`
**AC-9: descriptor_dim matches across embed_batch and HNSW params**
Given `backbone_embedder.descriptor_dim() == 512`
When `embed_batch` returns an array
Then the array's last axis is 512; if a future drift produces 768, raise `DescriptorBatchError("descriptor_dim mismatch")` BEFORE writing to FAISS
**AC-10: Progress + DEBUG logs do not pull the private engine bytes**
Given a session with the C7-engine-backed embedder
When all DEBUG logs are captured
Then engine bytes do NOT appear in any log; only metadata (batch_size, tile_ids, elapsed_s) is logged
## Non-Functional Requirements
**Performance**
- Embed throughput is dominated by AZ-321's engine + the embedder; this task adds ≤ 5% overhead (lazy mmap handles + numpy concatenation).
- The 1000-tile corpus should complete in ≤ 5 min on Tier-1 dev workstation (assumes 50ms per batch of 64; envelope only).
**Compatibility**
- `numpy` per project pin; `pathlib` stdlib; AZ-303 + AZ-306 + AZ-321 dependencies pinned.
- No new third-party dependencies.
**Reliability**
- Halve-and-retry is bounded by `max_oom_retries`; default 1 (so 64→32, then either succeeds or raises); higher values trade latency for completion probability.
- The atomic FAISS rebuild relies on AZ-303/306's contract; this task does not fork its own write path.
- `descriptor_dim` mismatch is caught before FAISS write to prevent corrupting an existing valid index.
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | 1000-tile corpus + fake embedder | 16 batches; rebuild called once; outcome=success |
| AC-2 | Fake embedder raises OOM at batch_size=64; succeeds at 32 | retry happens; oom_retries=1 |
| AC-3 | Fake embedder always OOMs | DescriptorBatchError raised; no rebuild call |
| AC-4 | Empty corpus | outcome=failure; explicit hint; zero embeds |
| AC-5 | 1000 tiles + spy callback | 10 callback events |
| AC-6 | Compute id for sample tile | Matches sha256 first-8-bytes formula |
| AC-7 | Kill mid-rebuild + restart | No half-index (AZ-306's atomic write) |
| AC-8 | `isinstance` check on impl + partial fake | True / False |
| AC-9 | Embedder returns wrong dim | DescriptorBatchError before FAISS write |
| AC-10 | Capture all DEBUG logs | No engine bytes; only metadata |
| NFR-perf-overhead | 1000-tile bench with no-op embedder | ≤ 5% overhead vs raw embed sum |
| NFR-reliability-bounded-retry | Embedder OOM × 5 with max_oom_retries=1 | Raises after 1 retry, not 5 |
## Constraints
- `BackboneEmbedder` Protocol surface is intentionally narrow (2 methods); future C2 wiring adapts via the composition root, not by modifying this task.
- `embed_batch` MUST be called with a list of mmap-backed `TilePixelHandle` (per AZ-303); raw bytes are NOT accepted (would defeat AZ-303's read-only invariant).
- The descriptor id formula is canonical via AZ-306; this task does NOT invent its own.
- `rebuild_from_descriptors` is the ONLY write path to the FAISS index in this task; consumers do NOT touch the `.index` file directly.
- Halve-and-retry is bounded; unlimited retries are NOT permitted (would mask GPU regressions).
- This task introduces no new third-party dependencies.
## Risks & Mitigation
**Risk 1: BackboneEmbedder Protocol drifts from E-C2's eventual surface**
- *Risk*: When E-C2 (AZ-255) ships, its natural public method might be `embed_query(image: np.ndarray)` not `embed_batch(list[TilePixelHandle])`.
- *Mitigation*: A thin adapter at the C10/C2 boundary translates; the Protocol's two-method surface is small enough that wrapping is trivial. AZ-321 already produces the engine; if E-C2 ships its own public embed API, the C7-backed adapter is replaced via composition root.
**Risk 2: Halve-and-retry hides a real GPU regression**
- *Risk*: Persistent OOM at batch_size=1 indicates a deeper issue (memory fragmentation, model leak); halving repeatedly down to 1 wastes time.
- *Mitigation*: `max_oom_retries=1` by default — at most one halve. If 32 still OOMs, the run fails fast with full context for operator triage.
**Risk 3: Descriptor array memory pressure**
- *Risk*: 100k tiles × 512-dim float32 = 200 MB in one numpy array; on small operator workstations this is OK but multiplies for higher-dim backbones (e.g., 1024 → 400 MB).
- *Mitigation*: AZ-306's `rebuild_from_descriptors` accepts a streamed iterator if added later; for now the in-memory approach is documented and bounded by the operator workstation's RAM (RESTRICT-OPS-1 sets a 16 GB floor).
**Risk 4: Empty corpus is a silent operator mistake**
- *Risk*: Operator forgets to run C11 first; the build silently produces an empty index.
- *Mitigation*: AC-4 + ERROR log + explicit `failure_reason` hint surface immediately; the orchestrator T5 fails the build without writing a Manifest.
**Risk 5: descriptor_dim mismatch is detected too late**
- *Risk*: All 1000 tiles embed successfully but at the wrong dim; FAISS index is rebuilt with the wrong shape; takeoff verify fails.
- *Mitigation*: AC-9 checks the array's last axis BEFORE the rebuild call; cheap dim check at every batch boundary.
## Runtime Completeness
- **Named capability**: descriptor batched generation through C2 backbone over the corpus, FAISS index rebuild, GPU-bound throughput envelope per C10-PT-01 (description.md § 5; epic § Acceptance C10-IT-01).
- **Production code that must exist**: real `DescriptorBatcher` orchestrating real `BackboneEmbedder` (initially `C7EngineBackboneEmbedder` wrapping AZ-298) + real AZ-303/306 `rebuild_from_descriptors`; real OOM halve-and-retry; real progress emission.
- **Allowed external stubs**: tests MAY use a fake `BackboneEmbedder` that returns scripted descriptor arrays + a fake `tile_metadata_store` (already provided by AZ-303 conformance fakes); production wiring uses the real AZ-298 runtime + real C6.
- **Unacceptable substitutes**: a "deterministic descriptor" fake in production (defeats the entire localization pipeline); skipping the OOM retry (every transient OOM becomes a manual restart); writing to FAISS via raw `numpy.tofile` (bypasses AZ-306's atomic write); fabricating descriptor ids that don't match AZ-306's int64 sha256 scheme (breaks AC-6 and the takeoff verify).