mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 11:51:14 +00:00
[AZ-343] C2.5 InlierCountReRanker + shared FeatureExtractor helper
Implements the production-default ReRankStrategy: K=10 → N=3 by single-pair LightGlue inlier count, with strict drop-and-continue (INV-8) on per-candidate TileFetch / backbone / zero-inlier failures and RerankAllCandidatesFailedError on zero survivors. Composition root injects the shared LightGlueRuntime + Clock + the new FeatureExtractor helper (an L1 placeholder OpenCvOrbExtractor that unblocks AZ-343 and future C3 strategies — task scope expansion). Architectural notes: - Cross-component imports stay banned; tile_store types as `object` and the C6 TileCacheError family is duck-typed by class module prefix (same workaround AZ-348 adopted for c7_inference; proper fix is to relocate TileCacheError to _types/ in a follow-up). - Clock injection follows the replay contract (AZ-398 Invariant 2); reranked_at is sourced from clock.monotonic_ns(). - AZ-342 factory grew `feature_extractor` + `clock` + `fdr_client` parameters; existing AZ-342 conformance tests updated. Tests: 19 new AC-1..AC-12 + mixed-failure scenarios in test_inlier_count_reranker.py; existing AZ-342 suite (26) still green. Full repo sweep 1093 passed / 2 skipped (cmake/actionlint not on PATH). Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -1,227 +0,0 @@
|
||||
# C2.5 InlierCountReRanker — single-pair LightGlue inlier count K=10 → N=3
|
||||
|
||||
**Task**: AZ-343_c2_5_inlier_count_reranker
|
||||
**Name**: C2.5 InlierCountReRanker (drop-and-continue)
|
||||
**Description**: Implement `InlierCountReRanker`, the production-default `ReRankStrategy`. For each candidate in C2's top-K=10 `VprResult`, fetch tile pixels from C6, run a single-pair LightGlue forward via the shared `LightGlueRuntime` helper (AZ-278), record the inlier count, then sort descending by inlier count and return the top-N=3 as a `RerankResult`. Implements the drop-and-continue contract (Invariant 8 from the Protocol contract): per-candidate `RerankBackboneError` (LightGlue forward failure) and `TileFetchError` (C6 read failure) are caught inside the loop, the candidate is dropped, an ERROR log + FDR record is emitted, and the success path continues. Zero survivors raise `RerankAllCandidatesFailedError`. Includes the concrete `InlierCountReRankerPreprocessor` if any pre-LightGlue cropping/resizing is needed (single-pair LightGlue input contract MUST be satisfied — owned by AZ-278 helper, but the per-frame side prep happens here). Composition-root wired via the AZ-342 factory (this task's `create` entry-point).
|
||||
**Complexity**: 3 points
|
||||
**Dependencies**: AZ-342_c2_5_rerank_strategy_protocol (Protocol + factory + DTOs + errors + composition wiring), AZ-263_initial_structure, AZ-269_config_loader, AZ-278_lightglue_runtime (shared LightGlue helper), AZ-303_c6_storage_interfaces (`TileStore.get_tile_pixels` Public API), AZ-266_log_module, AZ-272_fdr_record_schema
|
||||
**Component**: c2_5_rerank (epic AZ-256 / E-C2.5)
|
||||
**Tracker**: AZ-343
|
||||
**Epic**: AZ-256 (E-C2.5)
|
||||
|
||||
### Document Dependencies
|
||||
|
||||
- `_docs/02_document/contracts/c2_5_rerank/rerank_strategy_protocol.md` — Protocol contract this task implements (every invariant MUST be satisfied; drop-and-continue is INV-8).
|
||||
- `_docs/02_document/components/03_c2_5_rerank/description.md` — § 1 architectural pattern; § 2 `RerankResult` semantics (length = N=3 ranked descending by inlier_count); § 5 error handling (drop-and-continue + zero-survivors fallback); § 7 caveats (shared helper serial access, no concurrency); § 9 logging.
|
||||
- `_docs/02_document/module-layout.md` — `c2_5_rerank` Per-Component Mapping (`inlier_based_reranker.py` Internal); `BUILD_RERANK_INLIER_COUNT` row in build-time exclusion map (ON for airborne / research / replay-cli; OFF for operator-tooling).
|
||||
- `_docs/02_document/contracts/shared_helpers/lightglue_runtime.md` — `LightGlueRuntime.match_single_pair(query_image, support_image) -> InlierCount` (or equivalent helper API; this task's calls go through that interface).
|
||||
- `_docs/02_document/contracts/c6_tile_cache/tile_store.md` — `TileStore.get_tile_pixels(tile_id) -> page-cache-backed handle` semantics (no copy).
|
||||
- `_docs/02_document/contracts/c2_vpr/vpr_strategy_protocol.md` — `VprResult` and `VprCandidate` DTOs (consumed at the input boundary).
|
||||
- `_docs/02_document/components/03_c2_5_rerank/tests.md` — C2.5-IT-01 (top-1 promotion rate ≥ 0.98); C2.5-IT-02 (drop-and-continue smoke); C2.5-IT-03 (helper serial-access invariant); C2.5-PT-01 (`rerank` p95 ≤ 80 ms for 10 single-pair LightGlue passes; GPU mem ≤ 300 MB shared engine).
|
||||
|
||||
## Problem
|
||||
|
||||
Without this task:
|
||||
|
||||
- The Protocol from AZ-342 has no concrete implementation; the airborne binary cannot start because `compose_root` cannot construct a `ReRankStrategy` for `config.rerank.strategy = "inlier_count"` (the only legal value today).
|
||||
- C3 CrossDomainMatcher (AZ-257) has no input source; F3 / F6 cannot run.
|
||||
- AC-2.5-IT-01 (top-1 promotion rate ≥ 0.98) — the primary C2.5 acceptance criterion — has no producer; the boundary between cheap retrieval (C2) and expensive matching (C3) is undefended; F3 sees N=10 instead of N=3 candidates and overshoots its latency budget by 3.3×.
|
||||
- The drop-and-continue contract is the ONLY thing standing between a single LightGlue CUDA OOM and a full VIO-only fallback (AC-3.5). Without robust per-candidate error handling, AC-NEW-7 cache-poisoning safety budget can be triggered by transient backbone errors that have nothing to do with the corpus.
|
||||
- The shared `LightGlueRuntime` helper (AZ-278) is constructed but unused unless this task wires it into the per-candidate inlier-counting loop. R14 (apparent C2.5↔C3 cycle) was resolved by helper ownership; that resolution is moot if neither sibling consumer ships.
|
||||
|
||||
## Outcome
|
||||
|
||||
- `src/gps_denied_onboard/components/c2_5_rerank/inlier_based_reranker.py` defining:
|
||||
- `InlierCountReRanker` class implementing the `ReRankStrategy` Protocol (AZ-342).
|
||||
- Constructor signature: `__init__(self, tile_store: TileStore, lightglue_runtime: LightGlueRuntime, fdr_client: FdrClient, top_n: int = 3)`. The strategy holds references to the constructor-injected `LightGlueRuntime` (NOT a copy); the helper's lifecycle is owned by the runtime root (per the helper-ownership R14 fix).
|
||||
- `rerank(frame, vpr_result, n, calibration)`:
|
||||
1. `surviving: list[RerankCandidate] = []`
|
||||
2. `dropped = 0`
|
||||
3. For each `VprCandidate` in `vpr_result.candidates`:
|
||||
a. Try `tile_pixels_handle = self._tile_store.get_tile_pixels(candidate.tile_id)`.
|
||||
On `TileFetchError`: emit ERROR log `kind="c2_5.rerank.tile_fetch_error"` + FDR record `kind="rerank.tile_fetch_error"`; `dropped += 1`; continue.
|
||||
b. Try `inlier_count = self._lightglue_runtime.match_single_pair(query_image=frame.image_bytes_or_decoded, support_image=tile_pixels_handle, calibration=calibration).inlier_count`.
|
||||
On `LightGlueError` / underlying CUDA / RuntimeError: wrap as `RerankBackboneError`; emit ERROR log `kind="c2_5.rerank.backbone_error"` + FDR record `kind="rerank.backbone_error"` with `tile_id` field; `dropped += 1`; continue.
|
||||
c. If `inlier_count == 0`: emit DEBUG log `kind="c2_5.rerank.zero_inliers"` (NOT an error — just a no-match candidate); `dropped += 1`; continue.
|
||||
d. Else: append `RerankCandidate(tile_id=candidate.tile_id, inlier_count=inlier_count, descriptor_distance=candidate.descriptor_distance, descriptor_dim=candidate.descriptor_dim, tile_pixels_handle=tile_pixels_handle)` to `surviving`.
|
||||
4. If `len(surviving) == 0`: emit ERROR log `kind="c2_5.rerank.all_failed"` + FDR record `kind="rerank.all_failed"` with `frame_id` + `candidates_input` + `candidates_dropped`; raise `RerankAllCandidatesFailedError(...)`.
|
||||
5. Sort `surviving` descending by `inlier_count`; ties broken by `descriptor_distance` ascending (per Invariant 3 deterministic tie-break).
|
||||
6. Truncate to `surviving[:n]`.
|
||||
7. If `len(surviving[:n]) < n`: emit WARN log `kind="c2_5.rerank.fewer_than_n_survivors"` with `{requested: n, returned: len(surviving[:n]), dropped: dropped}` (matches description.md § 9 WARN row).
|
||||
8. Emit INFO log `kind="c2_5.rerank.frame_done"` (gated by `config.rerank.debug_per_frame_log`; default false to avoid 3 Hz log volume) with the inlier-count vector. Emit FDR record `kind="rerank.frame_done"` (always, NOT gated) with `{frame_id, candidates_input, candidates_dropped, top_inlier_count, top_tile_id}`.
|
||||
9. Return `RerankResult(frame_id=vpr_result.frame_id, candidates=surviving[:n], reranked_at=monotonic_ns(), rerank_label="inlier_count", candidates_input=len(vpr_result.candidates), candidates_dropped=dropped)`.
|
||||
- Module-level `create(config, tile_store, lightglue_runtime) -> ReRankStrategy`:
|
||||
1. Read `top_n = config.rerank.top_n` (default 3).
|
||||
2. Construct `InlierCountReRanker(tile_store=tile_store, lightglue_runtime=lightglue_runtime, fdr_client=<from runtime root>, top_n=top_n)`.
|
||||
3. Return the instance.
|
||||
- Composition-root wiring: `runtime_root.compose_root` includes a path that, after constructing the shared `LightGlueRuntime`, invokes `build_rerank_strategy(...)` (the AZ-342 factory) which dispatches to this task's `create`.
|
||||
- Logging per description.md § 9:
|
||||
- INFO `kind="c2_5.rerank.ready"` with `{strategy: "inlier_count", N: 3, K: 10}` after construction.
|
||||
- WARN `kind="c2_5.rerank.fewer_than_n_survivors"` per frame when survivors < N.
|
||||
- ERROR `kind="c2_5.rerank.all_failed"` on zero survivors.
|
||||
- ERROR `kind="c2_5.rerank.backbone_error"` per LightGlue failure.
|
||||
- ERROR `kind="c2_5.rerank.tile_fetch_error"` per C6 read failure.
|
||||
- DEBUG `kind="c2_5.rerank.zero_inliers"` per candidate with zero inliers (gated).
|
||||
- DEBUG `kind="c2_5.rerank.frame_done"` per frame with inlier vector (gated).
|
||||
- FDR records emitted: `kind="rerank.frame_done"` (always, per frame), `kind="rerank.backbone_error"` (per error), `kind="rerank.tile_fetch_error"` (per error), `kind="rerank.all_failed"` (per zero-survivors event).
|
||||
|
||||
## Scope
|
||||
|
||||
### Included
|
||||
|
||||
- `InlierCountReRanker` class implementing the `ReRankStrategy` Protocol exactly per the AZ-342 contract (every invariant satisfied).
|
||||
- Drop-and-continue per-candidate error handling for `RerankBackboneError` AND `TileFetchError`.
|
||||
- Zero-survivors → `RerankAllCandidatesFailedError` path.
|
||||
- Deterministic top-N sort: descending by `inlier_count`, ties broken ascending by `descriptor_distance`.
|
||||
- `RerankCandidate` construction with `tile_pixels_handle` carried as a reference (no copy).
|
||||
- Module-level `create(config, tile_store, lightglue_runtime)` factory entry-point.
|
||||
- Composition-root wiring path for `config.rerank.strategy == "inlier_count"` (consumed by AZ-342's factory).
|
||||
- Logging per description.md § 9 (INFO ready, WARN fewer-than-N, ERROR error paths, DEBUG per-frame distances + zero-inliers).
|
||||
- FDR record emission for frame-done, error paths, and all-failed.
|
||||
- Unit tests covering Invariants 1–8, the drop-and-continue contract, the zero-survivors path, the tie-break determinism, the `tile_pixels_handle` reference semantics, the composition-root wiring path.
|
||||
- `BUILD_RERANK_INLIER_COUNT` CMake flag wiring (per ADR-002): the strategy module is excluded from the operator-tooling binary (operator tooling does not run the per-frame pipeline).
|
||||
|
||||
### Excluded
|
||||
|
||||
- The `ReRankStrategy` Protocol + DTOs + errors + factory — owned by AZ-342 (`AZ-342_c2_5_rerank_strategy_protocol`).
|
||||
- The `LightGlueRuntime` helper itself — already AZ-278 (E-CC-HELPERS); this task consumes the constructor-injected handle and calls `match_single_pair`.
|
||||
- The C6 `TileStore` interface — owned by AZ-303; this task consumes the Public API.
|
||||
- The C2 `VprResult` / `VprCandidate` DTOs — owned by AZ-336 (`c2_vpr_strategy_protocol`); this task consumes them at the input boundary.
|
||||
- LightGlue engine compile (`.onnx` → `.trt`) — owned by AZ-321 (`c10_engine_compiler`); the helper handle wraps the produced engine.
|
||||
- C3 CrossDomainMatcher — separate epic / task (AZ-257's component decomposition).
|
||||
- Component-internal acceptance tests beyond Protocol + invariants + drop-and-continue smoke: C2.5-IT-01 (top-1 promotion rate ≥ 0.98 against a fixture corpus), C2.5-PT-01 (latency NFR `rerank` p95 ≤ 80 ms), are deferred to Step 9 / E-BBT.
|
||||
- Any cross-component re-rank tuning (e.g., learned re-rankers) — future task in a follow-up cycle.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
**AC-1: Protocol conformance**
|
||||
Given a constructed `InlierCountReRanker` instance
|
||||
When `isinstance(strategy, ReRankStrategy)` is evaluated
|
||||
Then the result is `True`; the instance has `rerank`
|
||||
|
||||
**AC-2: Top-N ordering — descending by inlier_count, ties broken ascending by descriptor_distance**
|
||||
Given a `VprResult` with K=10 candidates whose inlier counts (after rerank) are [412, 198, 287, 153, 287, 0, 65, 412, 89, 234] and descriptor_distances [0.1, 0.4, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] AND `n=3`
|
||||
When `rerank(frame, vpr_result, n=3, calibration)` is called
|
||||
Then `RerankResult.candidates[0].inlier_count == 412 AND descriptor_distance == 0.1` (tie-break: lower distance ranked first); `candidates[1].inlier_count == 412 AND descriptor_distance == 0.8`; `candidates[2].inlier_count == 287 AND descriptor_distance == 0.2`; `len(candidates) == 3`; the candidate with `inlier_count == 0` is dropped (DEBUG log emitted)
|
||||
|
||||
**AC-3: Drop-and-continue on `RerankBackboneError`**
|
||||
Given a `VprResult` with K=10 candidates AND a `LightGlueRuntime` test double that raises `LightGlueError` on the 4th call (4th candidate) and succeeds on all others
|
||||
When `rerank(...)` is called with `n=3`
|
||||
Then the call returns successfully; `RerankResult.candidates` has 3 survivors selected from the 9 successful candidates; `candidates_dropped == 1` (or higher if zero-inlier candidates were also present); ONE ERROR log `kind="c2_5.rerank.backbone_error"` is emitted with `tile_id` of the 4th candidate; ONE FDR record `kind="rerank.backbone_error"` is emitted
|
||||
|
||||
**AC-4: Drop-and-continue on `TileFetchError`**
|
||||
Given a `VprResult` with K=10 candidates AND a `TileStore` test double that raises `TileFetchError` on the 7th candidate's `get_tile_pixels` call
|
||||
When `rerank(...)` is called with `n=3`
|
||||
Then the call returns successfully; `RerankResult.candidates` has 3 survivors from the 9 fetched candidates; `candidates_dropped >= 1`; ONE ERROR log `kind="c2_5.rerank.tile_fetch_error"` is emitted; ONE FDR record `kind="rerank.tile_fetch_error"` is emitted
|
||||
|
||||
**AC-5: Zero survivors → `RerankAllCandidatesFailedError`**
|
||||
Given a `VprResult` with K=10 candidates AND a `LightGlueRuntime` test double that raises `LightGlueError` on EVERY call
|
||||
When `rerank(...)` is called with `n=3`
|
||||
Then `RerankAllCandidatesFailedError` is raised with message containing the input candidate count; TEN ERROR logs `kind="c2_5.rerank.backbone_error"` are emitted (one per candidate); ONE final ERROR log `kind="c2_5.rerank.all_failed"` is emitted; ONE FDR record `kind="rerank.all_failed"` is emitted with `{candidates_input: 10, candidates_dropped: 10}`
|
||||
|
||||
**AC-6: Fewer than N survivors → WARN log + partial result**
|
||||
Given a `VprResult` with K=10 candidates AND a configuration where 8 candidates fail (mix of `RerankBackboneError` + zero-inliers) and 2 succeed
|
||||
When `rerank(...)` is called with `n=3`
|
||||
Then `RerankResult.candidates` has 2 survivors (NOT padded; NOT raised); `candidates_dropped == 8`; ONE WARN log `kind="c2_5.rerank.fewer_than_n_survivors"` with `{requested: 3, returned: 2, dropped: 8}` is emitted
|
||||
|
||||
**AC-7: `tile_pixels_handle` is a reference, NOT a copy**
|
||||
Given a `RerankResult` returned from `rerank(...)`
|
||||
When the underlying tile pixel buffer (in the C6 page-cache-backed `tile_pixels_handle`) is mutated externally
|
||||
Then a re-read via the same `tile_pixels_handle` reflects the mutation (proves identity, not a copy); `RerankResult.candidates[0].tile_pixels_handle is original_handle_returned_by_tile_store_get_tile_pixels`
|
||||
|
||||
**AC-8: `descriptor_distance` carried forward unchanged**
|
||||
Given a `VprResult` whose top candidate has `descriptor_distance == 0.123456789`
|
||||
When `rerank(...)` is called and the candidate survives
|
||||
Then `RerankResult.candidates[i].descriptor_distance == 0.123456789` (bit-exact for the FP type used in `VprCandidate`)
|
||||
|
||||
**AC-9: Deterministic — same inputs → bit-identical RerankResult**
|
||||
Given the same `(frame, vpr_result, n, calibration)` AND a `LightGlueRuntime` test double whose `match_single_pair` is deterministic
|
||||
When `rerank(...)` is called 3 times
|
||||
Then all three returns have identical `candidates` (same `tile_id`s, same `inlier_count`s, same order); `frame_id` matches `vpr_result.frame_id` in all three; `reranked_at` differs across calls (monotonic_ns) but `candidates` does not
|
||||
|
||||
**AC-10: Composition-root wiring — `config.rerank.strategy = "inlier_count"`**
|
||||
Given `config.rerank.strategy = "inlier_count"` AND `config.rerank.top_n = 3` AND a constructed shared `LightGlueRuntime` AND a constructed `TileStore`
|
||||
When `compose_root(config)` runs
|
||||
Then an `InlierCountReRanker` instance is wired into the runtime root; ONE INFO log `kind="c2_5.rerank.ready"` with `{strategy: "inlier_count", N: 3, K: 10}` is emitted; the strategy's `_lightglue_runtime` is identity-equal to the runtime root's shared helper
|
||||
|
||||
**AC-11: FDR emission — frame_done record per frame**
|
||||
Given a successful `rerank(...)` call returning 3 survivors with inlier counts [412, 287, 198] and `candidates_dropped == 2`
|
||||
When the call completes
|
||||
Then ONE FDR record `kind="rerank.frame_done"` is emitted with structured fields `{frame_id: <UUID>, candidates_input: 10, candidates_dropped: 2, top_inlier_count: 412, top_tile_id: <tuple>}`
|
||||
|
||||
**AC-12: Single-pair LightGlue invocation — frame ↔ tile only**
|
||||
Given a `VprResult` with K=10 candidates
|
||||
When `rerank(...)` is called with all candidates succeeding
|
||||
Then `lightglue_runtime.match_single_pair` is called EXACTLY 10 times (once per candidate); each call's `query_image` is the same `frame.image_bytes_or_decoded` reference; each call's `support_image` differs (one per candidate's `tile_pixels_handle`)
|
||||
|
||||
## Non-Functional Requirements
|
||||
|
||||
**Performance** (deferred validation to C2.5-PT-01 / E-BBT; this task delivers the implementation):
|
||||
- `rerank` p95 ≤ 80 ms for 10 single-pair LightGlue passes — bounded by 10 × LightGlue forward time (~6-7 ms each on TRT 10.3 FP16 per AZ-278's helper benchmarks) + Python overhead. The Python-side overhead per candidate (fetch handle + log emit + sort) MUST be ≤ 1 ms p95 to keep the LightGlue compute path on budget.
|
||||
- GPU memory: ≤ 300 MB resident for the shared LightGlue engine — owned by AZ-278 helper; this task consumes one engine instance and does NOT reload.
|
||||
|
||||
**Compatibility**
|
||||
- The `LightGlueRuntime.match_single_pair` API is owned by AZ-278; this task consumes the published method signature. If AZ-278's API evolves (additional args, different return type), this task is the upstream caller that must update — surfaced by the standard tracker dependency mechanism.
|
||||
- The `TileStore.get_tile_pixels` Public API is owned by AZ-303; same pattern.
|
||||
|
||||
**Reliability**
|
||||
- Drop-and-continue is the primary reliability mechanism — a transient LightGlue CUDA OOM on one candidate must NOT propagate to the whole frame.
|
||||
- The strategy is single-threaded by contract (INV-1, AZ-342); composition root binds it to the same ingest thread as C3 (because they share `LightGlueRuntime`).
|
||||
- Zero-survivors raises `RerankAllCandidatesFailedError`; downstream C5 falls back to VIO-only with provenance `visual_propagated` (AC-3.5 / description.md § 5 hard-failure path).
|
||||
|
||||
## Unit Tests
|
||||
|
||||
| AC Ref | What to Test | Required Outcome |
|
||||
|--------|--------------|------------------|
|
||||
| AC-1 | `isinstance(InlierCountReRanker(...), ReRankStrategy)` | `True` |
|
||||
| AC-2 | Top-N ordering with mixed inlier counts + ties + zeros | Sorted descending by inlier_count; tie-break ascending by descriptor_distance; zero-inliers dropped |
|
||||
| AC-3 | LightGlue raises on 4th candidate; 9 succeed | 3 survivors from 9; `candidates_dropped >= 1`; ERROR log + FDR record emitted |
|
||||
| AC-4 | TileStore raises on 7th candidate | 3 survivors from 9; ERROR log + FDR record emitted |
|
||||
| AC-5 | Every candidate fails | `RerankAllCandidatesFailedError`; 10 ERROR logs + 1 final ERROR log + 1 FDR record `kind="rerank.all_failed"` |
|
||||
| AC-6 | 8 fail, 2 succeed | 2 survivors (NOT padded); WARN log emitted |
|
||||
| AC-7 | `tile_pixels_handle` reference semantics | Identity preserved; mutation visible across reads |
|
||||
| AC-8 | `descriptor_distance` carried forward | Bit-exact match with input |
|
||||
| AC-9 | Deterministic — 3 calls with same inputs | Identical `candidates`; differing `reranked_at` |
|
||||
| AC-10 | `compose_root(config="inlier_count")` | Wired; INFO log emitted; helper identity-shared with C3 |
|
||||
| AC-11 | FDR `kind="rerank.frame_done"` emission | Emitted once per successful call with correct fields |
|
||||
| AC-12 | Single-pair LightGlue invocation count | Exactly K calls; query_image identity-shared across calls |
|
||||
| Drop-and-continue mixed | Mixed `RerankBackboneError` + `TileFetchError` + zero-inliers + successes on K=10 | All non-success candidates dropped; logs and FDR records per failure type |
|
||||
|
||||
## Constraints
|
||||
|
||||
- **Single-pair LightGlue ONLY** — this strategy does NOT use multi-pair LightGlue or batched inference. Per description.md § 5, the inlier count is a SINGLE-PAIR forward pass. Batched LightGlue is a different optimisation path (deferred to a future cycle if K=10 proves too slow).
|
||||
- **Drop-and-continue is mandatory** — Invariant 8 from the contract is non-negotiable; any per-candidate exception MUST be caught and converted to a drop event. Re-raising a per-candidate exception is forbidden; the only escape from `rerank` is `RerankAllCandidatesFailedError` (zero survivors) or success.
|
||||
- **`tile_pixels_handle` reference, not copy** — Invariant 6; copying would defeat AC-4.1 latency budget. The C6 contract guarantees the page-cache-backed handle is valid for the duration of the `rerank` call (TTL covers a per-frame window).
|
||||
- **Constructor injection only** — no `import gps_denied_onboard.config` inside the strategy module; config is consumed via the `create` factory.
|
||||
- **`LightGlueRuntime` is constructor-injected, NOT instantiated here** — the runtime root constructs ONE shared instance and passes it to both this strategy AND the C3 matcher (per the helper-ownership R14 fix).
|
||||
- **Logging respects DEBUG-gating** — per-frame DEBUG logs (zero-inliers, frame-done) are gated by `config.rerank.debug_per_frame_log` (default false); flooding journald at 3 Hz × K=10 = 30 events/sec by default would violate the spirit of description.md § 9.
|
||||
- **FDR `kind="rerank.frame_done"` is NOT gated** — it is the primary forensic record for AC-NEW-7 cache-poisoning post-flight analysis; emission rate is 3 Hz which fits FDR's 200 Hz aggregate budget (AC-NEW-3 / E-C13 NFR).
|
||||
|
||||
## Risks & Mitigation
|
||||
|
||||
**Risk 1: `LightGlueRuntime.match_single_pair` API does not yet exist on AZ-278's helper**
|
||||
- *Risk*: AZ-278 is in `_docs/02_tasks/todo/`; its API surface may not include a `match_single_pair` method explicitly — only a generic `match` or `match_batch`.
|
||||
- *Mitigation*: This task documents the expected API surface (`match_single_pair(query_image, support_image, calibration) -> InlierCount`); if AZ-278 ships only a batched API, a thin per-call wrapper around the batched API can stay inside this strategy (one batch of size 1). Surface to AZ-278 implementer at decompose-step-4 cross-verification time as a coordination point.
|
||||
|
||||
**Risk 2: 10 single-pair LightGlue calls saturate the GPU stream and serialise behind C3's per-pair work**
|
||||
- *Risk*: The shared `LightGlueRuntime` requires serial access (Invariant 1 + helper contract); if C3's matcher is also calling the helper in parallel from the same thread (which it shouldn't be), deadlock or cross-frame data corruption could result.
|
||||
- *Mitigation*: Composition root binds C2.5 and C3 to the SAME single ingest thread (per AZ-342 AC-10); the helper's serial-access invariant is satisfied by single-thread binding. The helper itself MAY add an internal assertion that the calling thread matches the binding thread (owned by AZ-278). C2.5-IT-03 verifies the serial-access invariant.
|
||||
|
||||
**Risk 3: `tile_pixels_handle` lifetime exceeds the C6 page-cache TTL**
|
||||
- *Risk*: The handle is a reference; if C6 evicts the page before C3 reads from it (in a future frame), C3 sees stale or zero pixels.
|
||||
- *Mitigation*: C6's contract guarantees the handle is valid for the duration of the per-frame pipeline window (~333 ms at 3 Hz). C3 must consume the handle within the same frame; the per-frame pipeline orchestration (runtime root) enforces no cross-frame retention. `RerankResult` is consumed-once.
|
||||
|
||||
**Risk 4: `inlier_count == 0` is treated as a drop, but a legitimately-low-overlap match might still have value to C3**
|
||||
- *Risk*: Dropping zero-inlier candidates may discard a candidate that C3 could rescue with its more powerful cross-domain matcher.
|
||||
- *Mitigation*: Per description.md § 7 caveats, the re-rank correctness depends on inlier count being a meaningful proxy. Zero inliers means the LightGlue forward pass found NO geometric agreement at all — C3's cross-domain matcher would also fail because it operates on an inferior fixed-domain prior. Dropping zero-inliers is the right call. If a future cycle finds counter-examples, the threshold (`inlier_count > 0` → `inlier_count >= 1` → `inlier_count >= MIN_RERANK_INLIERS`) becomes a config knob.
|
||||
|
||||
**Risk 5: `monotonic_ns()` call on the hot path is non-trivial in CPython**
|
||||
- *Risk*: 3 Hz × N timestamps = 12 timestamp calls per second; `time.monotonic_ns()` is ~50 ns each; negligible.
|
||||
- *Mitigation*: No mitigation needed; called out for completeness in case profiling later identifies it.
|
||||
|
||||
## Runtime Completeness
|
||||
|
||||
- **Named capability**: `InlierCountReRanker` — production-default `ReRankStrategy` for K=10 → N=3 by single-pair LightGlue inlier count (architecture / E-C2.5 / `solution.md` "single-pair LightGlue inlier count" / AC-2.5-IT-01 + AC-4.1).
|
||||
- **Production code that must exist**: real `InlierCountReRanker` calling real `LightGlueRuntime.match_single_pair` with the real shared TRT-compiled LightGlue engine; real `TileStore.get_tile_pixels` page-cache-backed handle fetch; real composition-root wiring through the AZ-342 factory.
|
||||
- **Allowed external stubs**: tests MAY use `FakeLightGlueRuntime` returning pre-computed inlier counts (AC-2..AC-9), `FakeTileStore` returning a fake handle (AC-4 / AC-7 / AC-10), `FakeFdrClient` (verifying FDR record emission), a synthetic frame fixture; production wiring uses the real C6 + AZ-278 helper + LightGlue engine.
|
||||
- **Unacceptable substitutes**: a pure-Python NumPy implementation of LightGlue inlier counting (would not satisfy C2.5-PT-01 latency at 80 ms p95; would defeat the GPU-bound architectural choice); skipping the drop-and-continue contract and propagating per-candidate exceptions (would break Invariant 8 from the contract); copying tile pixels into the `RerankCandidate` instead of holding the C6 page-cache handle (would violate Invariant 6 and inflate per-frame allocations); calling LightGlue in batched mode without a per-candidate inlier breakdown (would lose the inlier-per-candidate signal needed for ranking); instantiating a SECOND `LightGlueRuntime` for C2.5 instead of consuming the runtime-root-shared one (would double GPU memory and break the helper-ownership R14 fix); ignoring zero-survivors and returning an empty `RerankResult` instead of raising `RerankAllCandidatesFailedError` (would propagate empty input to C3 instead of triggering the C5 VIO-only fallback).
|
||||
Reference in New Issue
Block a user