[AZ-341] Archive AZ-341 + batch 45 report

Batch 45 (AZ-341 C2 FAISS retrieve wiring) post-commit bookkeeping:
- Move AZ-341 task spec to done/ (implement skill step 13).
- Write batch_45_cycle1_report.md (test results, AC coverage,
  architectural decisions, findings carried into cumulative review).
- Bump state.last_completed_batch 44 → 45.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-13 21:47:07 +03:00
parent 88f6ae6dce
commit 1682dc354b
3 changed files with 195 additions and 1 deletions
@@ -1,197 +0,0 @@
# C2 FAISS HNSW Retrieve Wiring
**Task**: AZ-341_c2_faiss_retrieve_wiring
**Name**: C2 FAISS HNSW Retrieve Wiring (TileStore Bridge + Stale-Handle Defence)
**Description**: Implement the bridge between every C2 `VprStrategy.retrieve_topk` and the C6 `TileStore.faiss_topk(query_embedding, k) -> (distances, tile_ids)` Public API. The bridge handles handle invalidation defence (C2-ST-01: out-of-band corpus replacement caught via mmap inode/sidecar check; raises `IndexUnavailableError` instead of returning stale candidates), per-frame DEBUG logging of top-K distances (gated by config), and the WARN-threshold check on top-1 distance. The wiring lives in a small `c2_vpr/_faiss_bridge.py` module that every concrete `VprStrategy` constructor accepts via injection — keeps each strategy's `retrieve_topk` body to one line (`return self._faiss_bridge.retrieve(query, k, backbone_label="<name>")`).
**Complexity**: 3 points
**Dependencies**: AZ-336_c2_vpr_strategy_protocol, AZ-263_initial_structure, AZ-269_config_loader, AZ-303_c6_storage_interfaces, AZ-305_c6_postgres_filesystem_store, AZ-306_c6_faiss_descriptor_index, AZ-266_log_module, AZ-272_fdr_record_schema
**Component**: c2_vpr (epic AZ-255 / E-C2)
**Tracker**: AZ-341
**Epic**: AZ-255 (E-C2)
### Document Dependencies
- `_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md``retrieve_topk` Protocol contract; INV-4 (exactly k candidates, sorted ascending), INV-5 (`backbone_label` non-empty), INV-6 (deterministic).
- `_docs/02_document/components/02_c2_vpr/description.md` — § 1 retrieval boundary; § 4 `FAISS HNSW top-K=10 search` query; § 7 race conditions; § 9 logging strategy (WARN top-1 distance threshold).
- `_docs/02_document/components/02_c2_vpr/tests.md` — C2-ST-01 (index handle invalidation safety; the strategy MUST raise `IndexUnavailableError` rather than return stale candidates after out-of-band corpus replacement).
- `_docs/02_document/contracts/c6_tile_cache/descriptor_index.md``TileStore.faiss_topk` query surface; sidecar / mmap-inode invariants used by the bridge's stale-handle defence.
- `_docs/02_document/module-layout.md``c2_vpr` Imports from `components.c6_tile_cache` (Public API only); Layering Layer 3 → Layer 2 (allowed).
## Problem
Without this task:
- Every concrete `VprStrategy` (UltraVPR, NetVLAD, MegaLoc, MixVPR, SelaVPR, EigenPlaces, SALAD) would re-implement the same `retrieve_topk` body — calling `tile_store.faiss_topk`, building `VprCandidate` list, applying WARN threshold, emitting DEBUG log, handling `IndexUnavailableError`. Seven copies = seven places to drift.
- The C2-ST-01 safety check (out-of-band corpus replacement → `IndexUnavailableError`) would be re-implemented per strategy → seven places to break the safety invariant. Defended-in-depth at the bridge layer is correct.
- WARN-threshold tuning (the `config.vpr.warn_top1_threshold` knob) would have to be threaded through every strategy's constructor; centralising in the bridge means one config touchpoint.
- DEBUG per-frame distance logging — useful for forensic investigation of suspicious flights — would either be off everywhere or duplicated everywhere.
The bridge also gives a single place to enforce INV-4 (exactly k candidates returned) defensively: if C6's `faiss_topk` ever returns fewer or unordered, the bridge catches it and raises `IndexUnavailableError` with diagnostic context.
## Outcome
- `src/gps_denied_onboard/components/c2_vpr/_faiss_bridge.py` defining `FaissBridge`:
- Constructor: `__init__(self, tile_store: TileStore, normaliser: DescriptorNormaliser, descriptor_dim: int, warn_top1_threshold: float, debug_log_per_frame_distances: bool, fdr_client: FdrClient)`.
- Method `retrieve(self, query: VprQuery, k: int, backbone_label: str) -> VprResult`:
1. `distances, tile_ids = self._tile_store.faiss_topk(query.embedding, k)` — propagates `IndexUnavailableError` from C6 unchanged.
2. **Defensive INV-4 check**: assert `len(distances) == k` AND `distances == sorted(distances)`; on violation raise `IndexUnavailableError(f"corpus returned {len(distances)} candidates (expected {k}) or unordered distances")`.
3. Build `[VprCandidate(tile_id=tid, descriptor_distance=d, descriptor_dim=self._descriptor_dim) for tid, d in zip(tile_ids, distances)]`.
4. Construct `result = VprResult(query.frame_id, candidates, retrieved_at=monotonic_ns(), backbone_label=backbone_label)`.
5. **WARN-threshold check**: if `distances[0] > self._warn_top1_threshold`, emit ONE WARN log `kind="c2.vpr.top1_distance_above_threshold"` with structured `{distance, threshold, backbone_label}`.
6. **DEBUG per-frame distances**: if `self._debug_log_per_frame_distances`, emit ONE DEBUG log `kind="c2.vpr.frame_distances"` with `{frame_id, top10_distances}`.
7. Return `result`.
- Each concrete `VprStrategy`'s `retrieve_topk` is one line: `return self._faiss_bridge.retrieve(query, k, backbone_label=self.BACKBONE_LABEL)`.
- The bridge is constructor-injected into every strategy; every strategy's `create(...)` factory builds the bridge with strategy-specific `descriptor_dim` and `BACKBONE_LABEL`.
- C2-ST-01 stale-handle defence: the bridge does NOT re-implement the inode / sidecar check itself — that lives in C6's `TileStore.faiss_topk` (per `_docs/02_document/contracts/c6_tile_cache/descriptor_index.md`). The bridge's role is to NOT swallow the `IndexUnavailableError`; propagating it unchanged is the contractual behaviour. Tests assert the propagation.
- Logging per description.md § 9 (WARN top-1, DEBUG per-frame distances).
- FDR record `kind="vpr.retrieve_topk"` emitted per `retrieve` call with `{frame_id, backbone_label, top10_distances, latency_us}` for post-flight forensics.
## Scope
### Included
- `FaissBridge` class with the `retrieve` method exactly as specified above.
- Defensive INV-4 check (exactly k candidates, sorted ascending) at the bridge layer; raises `IndexUnavailableError` on violation.
- `IndexUnavailableError` propagation (NOT wrapping) from C6 `TileStore.faiss_topk`.
- WARN-threshold check on top-1 distance (config-driven `config.vpr.warn_top1_threshold`).
- DEBUG per-frame distances log (config-driven `config.vpr.debug_per_frame_distances`).
- FDR record emission per `retrieve` call.
- Latency measurement (monotonic_ns delta) for the FDR record.
- Unit tests covering: happy path, INV-4 violation handling, IndexUnavailableError propagation, WARN-threshold trigger, DEBUG log gating, FDR record fields, latency capture.
- Strategy-side wire-in: each concrete strategy's `create(...)` updated to construct and inject the bridge; each `retrieve_topk` shrunk to a one-line delegation.
### Excluded
- The C6 `TileStore.faiss_topk` query implementation — owned by AZ-303 / AZ-305 / AZ-306. The bridge consumes the Public API.
- The C6 mmap inode + sidecar stale-handle check — owned by AZ-306 (FAISS descriptor index). The bridge consumes the resulting `IndexUnavailableError` and propagates it.
- The `VprStrategy` Protocol and DTOs — owned by AZ-336.
- Concrete strategy implementations — owned by AZ-337..AZ-340 (strategy-side wire-in code is part of those tasks; this task delivers the bridge module + the wire-in pattern they follow).
- C2-ST-01 acceptance test (suite-level) — deferred to Step 9 / E-BBT (and partially covered by the C6 stale-handle test owned by AZ-306).
## Acceptance Criteria
**AC-1: Happy-path retrieve**
Given a `FaissBridge` constructed with a fake `TileStore` returning `(distances=[0.05, 0.10, 0.15, ...], tile_ids=[t1, t2, t3, ...])` for a `query` AND `k=10`
When `bridge.retrieve(query, k=10, backbone_label="ultra_vpr")` is called
Then a `VprResult` is returned with `len(candidates) == 10`, sorted ascending, `backbone_label == "ultra_vpr"`, `descriptor_dim` matches the bridge's configured value; ONE FDR record `kind="vpr.retrieve_topk"` emitted with the correct fields
**AC-2: INV-4 violation — wrong count**
Given a fake `TileStore` returning `(distances=[0.05, 0.10], tile_ids=[t1, t2])` for `k=10` (only 2 candidates returned)
When `bridge.retrieve(query, k=10, backbone_label="ultra_vpr")` is called
Then `IndexUnavailableError` is raised with message containing `"corpus returned 2 candidates (expected 10)"`; NO FDR record is emitted (the failure is the corpus, not the retrieval); ERROR log `kind="c2.vpr.invariant_violation"` emitted
**AC-3: INV-4 violation — unordered distances**
Given a fake `TileStore` returning `(distances=[0.05, 0.20, 0.10, 0.15, ...], tile_ids=[...])` (out of ascending order)
When `bridge.retrieve(query, k=10, backbone_label="ultra_vpr")` is called
Then `IndexUnavailableError` is raised with message containing `"unordered distances"`; ERROR log emitted
**AC-4: `IndexUnavailableError` from C6 propagated unchanged**
Given a fake `TileStore` whose `faiss_topk` raises `IndexUnavailableError("stale handle")`
When `bridge.retrieve(query, k=10, backbone_label="ultra_vpr")` is called
Then `IndexUnavailableError("stale handle")` propagates unchanged (NOT wrapped, NOT re-raised with new message); the bridge does not catch and re-raise
**AC-5: WARN-threshold trigger**
Given `warn_top1_threshold = 0.30` AND a fake `TileStore` returning `distances[0] = 0.42`
When `bridge.retrieve(...)` is called
Then ONE WARN log `kind="c2.vpr.top1_distance_above_threshold"` with structured `{distance: 0.42, threshold: 0.30, backbone_label: "ultra_vpr"}` is emitted; the `VprResult` is still returned
**AC-6: WARN-threshold not triggered when top-1 is below threshold**
Given `warn_top1_threshold = 0.30` AND `distances[0] = 0.15`
When `bridge.retrieve(...)` is called
Then NO WARN log is emitted; the `VprResult` is returned
**AC-7: DEBUG per-frame distances ON**
Given `debug_log_per_frame_distances = True`
When `bridge.retrieve(...)` is called
Then ONE DEBUG log `kind="c2.vpr.frame_distances"` is emitted with `{frame_id, top10_distances: [0.05, 0.10, ...]}`
**AC-8: DEBUG per-frame distances OFF (default)**
Given `debug_log_per_frame_distances = False`
When `bridge.retrieve(...)` is called
Then NO DEBUG log is emitted
**AC-9: FDR record carries correct fields**
Given a happy-path retrieve
When `bridge.retrieve(...)` returns
Then ONE FDR record `kind="vpr.retrieve_topk"` is emitted with structured fields `{frame_id, backbone_label, top10_distances, latency_us}`; `latency_us > 0`; `top10_distances == [0.05, 0.10, ...]`
**AC-10: Strategy-side wire-in — `retrieve_topk` is one line**
Given any concrete `VprStrategy` post-wire-in
When the strategy's source code is inspected
Then the body of `retrieve_topk` is exactly one return statement delegating to `self._faiss_bridge.retrieve(query, k, backbone_label=...)`; no candidate-building or distance-checking logic remains in the strategy
**AC-11: Per-strategy `descriptor_dim` carried through to candidates**
Given two bridges — one for UltraVPR (descriptor_dim=512), one for NetVLAD (descriptor_dim=4096)
When `retrieve` is called on each
Then UltraVPR's candidates have `descriptor_dim == 512`; NetVLAD's candidates have `descriptor_dim == 4096`; the bridge does NOT mix them up
## Non-Functional Requirements
**Performance**
- `bridge.retrieve` overhead p95 ≤ 0.5 ms — bounded by the INV-4 sorted check (linear in k=10), the candidate-list construction (10 dataclass instantiations), and the FDR record emission (a single SPSC enqueue). The C6 `faiss_topk` time is excluded from this budget — that's C6's NFR.
- The DEBUG per-frame log adds ≤ 50 µs when ON; default-OFF eliminates the cost.
**Compatibility**
- The bridge is a thin Python module; no native code, no compile flag.
- The `TileStore.faiss_topk` API surface is the only C6 contract consumed; changes to that surface require a coordinated bridge update.
**Reliability**
- The defensive INV-4 check provides a backstop against silent C6 bugs (e.g., a future `faiss_topk` regression that returns fewer candidates).
- `IndexUnavailableError` propagation preserves C2-ST-01's "raise rather than return stale" invariant.
- The bridge is stateless across calls (no per-call state mutates the bridge instance); thread safety is per-strategy (single ingest thread per strategy → single bridge use).
## Unit Tests
| AC Ref | What to Test | Required Outcome |
|--------|-------------|-----------------|
| AC-1 | Happy-path retrieve | `VprResult` with 10 candidates, sorted, correct label, FDR record emitted |
| AC-2 | TileStore returns 2/10 candidates | `IndexUnavailableError("corpus returned 2 candidates (expected 10)")`; ERROR log; no FDR record |
| AC-3 | TileStore returns unordered distances | `IndexUnavailableError("unordered distances")`; ERROR log |
| AC-4 | TileStore raises `IndexUnavailableError("stale handle")` | propagated unchanged |
| AC-5 | top-1 = 0.42, threshold = 0.30 | WARN log with `{distance, threshold, backbone_label}` |
| AC-6 | top-1 = 0.15, threshold = 0.30 | NO WARN log |
| AC-7 | DEBUG ON | DEBUG log with `{frame_id, top10_distances}` |
| AC-8 | DEBUG OFF | NO DEBUG log |
| AC-9 | FDR record fields | `{frame_id, backbone_label, top10_distances, latency_us}` with `latency_us > 0` |
| AC-10 | Source-code lint of strategies | `retrieve_topk` body is exactly one return statement (verifiable via AST inspection) |
| AC-11 | Two bridges with different `descriptor_dim` | each carries through to its candidates' `descriptor_dim` |
| NFR-perf | Microbench `bridge.retrieve` × 1000 with mock TileStore | p95 ≤ 0.5 ms (excluding C6 time) |
## Constraints
- **The bridge does NOT implement stale-handle detection** — that's C6's responsibility (AZ-306). The bridge's role is propagation.
- **The defensive INV-4 check is mandatory** — even though C6 should already enforce this, a defended-in-depth check at the consumer side catches future C6 regressions before they propagate to downstream C2.5 ReRanker.
- **The bridge is constructor-injected** — strategies do NOT import `_faiss_bridge` directly; the strategy's `create(...)` factory constructs the bridge and injects it.
- **WARN-threshold and DEBUG flag are config-driven** — no hard-coded values inside the bridge.
- **Latency measurement uses monotonic_ns** — wall-clock would drift if the system clock adjusts mid-flight.
- **The bridge is bound to one strategy's `descriptor_dim`** — a single bridge cannot serve multiple strategies with different dims; this is enforced by the per-strategy `create(...)` constructing a bridge with that strategy's dim.
- **No GPU operations in the bridge** — pure CPU; the GPU work is the strategy's `embed_query`.
## Risks & Mitigation
**Risk 1: The bridge's defensive INV-4 check duplicates C6's invariant enforcement**
- *Risk*: Two enforcement points may diverge; one might be stricter than the other.
- *Mitigation*: The bridge's check is "trust but verify" (defended-in-depth). C6 is the primary enforcement; the bridge's check exists to catch C6 regressions early. If the contracts diverge, that's a coordinated update across both contracts.
**Risk 2: Per-frame DEBUG log volume at 3 Hz could flood journald**
- *Risk*: 3 Hz × 10 distances per record × 24 hours = ~2.6M log lines per flight day.
- *Mitigation*: DEBUG is OFF by default (AC-8); operators enable for forensic investigation only. Test asserts default-OFF.
**Risk 3: WARN-threshold default (0.30) is uncalibrated**
- *Risk*: 0.30 is a placeholder; production-tuned values come from FT-P-19 telemetry. Excessive WARNs early in deployment.
- *Mitigation*: `warn_top1_threshold` is config-driven; operators tune per deployment. Default 0.30 is conservative starting point.
**Risk 4: AST inspection in AC-10 is brittle**
- *Risk*: AC-10's "exactly one return statement" check via AST inspection may break if a strategy adds a docstring.
- *Mitigation*: AC-10's AST check ignores docstrings, comments, and whitespace; only the AST body of `retrieve_topk` is inspected — body must be a single `Return` node.
**Risk 5: FDR record volume at 3 Hz × 10 distances**
- *Risk*: Per-frame FDR emission at 3 Hz adds 3 records/sec; with `top10_distances` array (~80 bytes), that's ~1 MB / hour of flight.
- *Mitigation*: This is well within the C13 FDR writer's budget (≤ 64 GB cap per AZ-291..AZ-296); 1 MB/hour is negligible. The retrieval-level provenance is critical for post-flight forensics; the cost is justified.
## Runtime Completeness
- **Named capability**: cross-strategy bridge between every C2 `VprStrategy.retrieve_topk` and the C6 `TileStore.faiss_topk` query surface; centralises invariant enforcement, WARN-threshold, DEBUG logging, and FDR provenance (architecture / E-C2 / `solution.md` "FAISS HNSW lookup wiring" / C2-ST-01).
- **Production code that must exist**: real `FaissBridge` class with real `tile_store.faiss_topk` calls; real defensive INV-4 check; real `IndexUnavailableError` propagation; real WARN-threshold + DEBUG flag config plumbing; real FDR record emission with latency_us; real per-strategy wire-in shrinking each `retrieve_topk` to one delegation line.
- **Allowed external stubs**: tests MAY use `FakeTileStore` returning controllable `(distances, tile_ids)` tuples (AC-1..AC-9), `FakeFdrClient` (AC-9), `FakeLogger` capturing log records by `kind`; production wiring uses the real C6 `TileStore` + real C13 FDR.
- **Unacceptable substitutes**: per-strategy duplicated retrieve logic (would defeat the entire purpose of this task); skipping the defensive INV-4 check (would let future C6 regressions silently propagate); wrapping `IndexUnavailableError` in another exception (would defeat C2-ST-01); hard-coded WARN threshold (would break operator tuning); always-ON DEBUG logging (would flood journald); skipping FDR record emission (would lose post-flight retrieval provenance); a stateful bridge that caches between calls (the contract is stateless per call).