Files
gps-denied-onboard/_docs/02_tasks/todo/AZ-345_c3_disk_lightglue.md
T
Oleksandr Bezdieniezhnykh 880eabcb3f Decompose Step 6 snapshot: 140 task specs + contract docs
Closes out greenfield Step 6 (Decompose) for all 14 components
(C1-C13 + cross-cutting helpers/replay). Covers tasks AZ-266..AZ-446
plus the _dependencies_table.md and component contract documents.

State file updated to greenfield Step 7 (Implement), not_started.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 00:39:48 +03:00

18 KiB
Raw Blame History

C3 DISK+LightGlue Primary Matcher

Task: AZ-345_c3_disk_lightglue Name: C3 DISK+LightGlue Primary Matcher Description: Implement DiskLightGlueMatcher, the production-default CrossDomainMatcher (per D-C3-1 = (a)). For each top-N=3 candidate in a RerankResult: extract DISK keypoints + descriptors from the nav-camera frame and the candidate tile via the C7 InferenceRuntime (TensorRT 10.3 FP16 primary, ONNX-Runtime fallback); match keypoints via the shared LightGlueRuntime helper (AZ-278); filter inliers + compute median reprojection residual via the shared RansacFilter helper (AZ-282); record the result in a CandidateMatchSet. Sort surviving candidates descending by inlier count (tie-break: lower median residual ranked higher); return the best as MatchResult.best_candidate_idx. Implements the drop-and-continue contract (Invariant 4) for per-candidate MatcherBackboneError. Updates the constructor-injected RollingHealthWindow after each frame. Composition-root wired via the AZ-344 factory. Complexity: 5 points Dependencies: AZ-344 (Protocol + factory + DTOs + errors + RollingHealthWindow), AZ-263_initial_structure, AZ-269_config_loader, AZ-278_lightglue_runtime (shared LightGlue helper), AZ-282_ransac_filter (shared RANSAC helper), AZ-298_c7_tensorrt_runtime (DISK forward via TRT), AZ-299_c7_onnxrt_fallback (DISK forward via ONNX-RT fallback), AZ-303_c6_storage_interfaces (tile_pixels_handle from RerankResult; tile pixel decode), AZ-281_engine_filename_schema (DISK engine self-describing filename), AZ-321_c10_engine_compiler (DISK + LightGlue engine compile path), AZ-266_log_module, AZ-272_fdr_record_schema Component: c3_matcher (epic AZ-257 / E-C3) Tracker: AZ-345 Epic: AZ-257 (E-C3)

Document Dependencies

  • _docs/02_document/contracts/c3_matcher/cross_domain_matcher_protocol.md — Protocol contract (every invariant satisfied; drop-and-continue is INV-4).
  • _docs/02_document/components/04_c3_matcher/description.md — § 1 D-C3-1 = (a) production-default; § 5 error handling; § 7 shared helper serial access; § 9 logging.
  • _docs/02_document/module-layout.mdc3_matcher Per-Component Mapping (disk_lightglue.py Internal); BUILD_MATCHER_DISK_LIGHTGLUE row (ON for airborne / research / replay-cli).
  • _docs/02_document/contracts/shared_helpers/lightglue_runtime.md — single-pair / multi-pair API.
  • _docs/02_document/contracts/shared_helpers/ransac_filter.md — RANSAC + median residual API.
  • _docs/02_document/contracts/c2_5_rerank/rerank_strategy_protocol.mdRerankResult consumed at input boundary.
  • _docs/02_document/contracts/c7_inference/inference_runtime_protocol.md — DISK forward via InferenceRuntime.
  • _docs/02_document/components/04_c3_matcher/tests.md — C3-IT-01 (best-candidate inlier count p5 ≥ 80); C3-IT-02 (deterministic best_candidate_idx); C3-IT-03 (cross-domain MRE p95 < 2.5 px); C3-IT-04 (tilt ±20° + 350m outliers); C3-IT-05 (InsufficientInliersError propagation); C3-PT-01 (latency p95 ≤ 180 ms; per-candidate ≤ 60 ms; GPU mem ≤ 800 MB).

Problem

Without this task: compose_root cannot wire when config.matcher.strategy = "disk_lightglue" (the default value); F3 / F6 cannot run; AC-1.1 (best-candidate inlier count p5 ≥ 80) has no producer; AC-2.2 (cross-domain MRE p95 < 2.5 px) is unmeasurable; AC-NEW-7 cache-poisoning safety budget loses its primary detection signal (low-inlier frames in MatcherHealth). The DISK+LightGlue choice is locked per Mode B Fact #110 / D-C3-1; without this task the locked decision is unrealised.

Outcome

  • src/gps_denied_onboard/components/c3_matcher/disk_lightglue.py defining:
    • DiskLightGlueMatcher class implementing the CrossDomainMatcher Protocol (AZ-344).
    • Constructor: __init__(self, runtime: InferenceRuntime, lightglue_runtime: LightGlueRuntime, ransac_filter: RansacFilter, fdr_client: FdrClient, health_window: RollingHealthWindow, config: MatcherConfig). The strategy holds the DISK engine ID (returned by runtime.load_engine) plus references to the constructor-injected LightGlueRuntime + RansacFilter.
    • match(frame, rerank_result, calibration):
      1. Decode + preprocess the nav-camera frame ONCE (resize / normalise per DISK's input contract).
      2. Run DISK forward on the query frame → (query_keypoints, query_descriptors).
      3. survivors: list[CandidateMatchSet] = [], dropped = 0.
      4. For each RerankCandidate in rerank_result.candidates: a. Decode + preprocess the candidate tile (from tile_pixels_handle). b. Try DISK forward on the tile → (tile_keypoints, tile_descriptors). On failure: wrap as MatcherBackboneError; emit ERROR log + FDR record kind="matcher.backbone_error" with tile_id + phase="disk_forward"; dropped += 1; continue. c. Try lightglue_runtime.match_pair(query_keypoints, query_descriptors, tile_keypoints, tile_descriptors)correspondences (raw matches before RANSAC). On failure: wrap as MatcherBackboneError; phase="lightglue_match"; drop; continue. d. ransac_result = ransac_filter.filter(correspondences, threshold_px=config.ransac_threshold_px)RansacResult(inlier_correspondences, ransac_outlier_count, per_candidate_residual_px). The helper handles RANSAC + median residual computation. e. If ransac_result.inlier_correspondences.shape[0] == 0: emit DEBUG log kind="c3.matcher.zero_inliers"; dropped += 1; continue. f. Append CandidateMatchSet(tile_id=candidate.tile_id, inlier_count=ransac_result.inlier_correspondences.shape[0], inlier_correspondences=ransac_result.inlier_correspondences, ransac_outlier_count=ransac_result.ransac_outlier_count, per_candidate_residual_px=ransac_result.per_candidate_residual_px) to survivors.
      5. Determine survivor_max_inliers = max(s.inlier_count for s in survivors) (or 0 if empty).
      6. If len(survivors) == 0 OR survivor_max_inliers < config.min_inliers_threshold: emit ERROR log kind="c3.matcher.insufficient_inliers" + FDR record kind="matcher.insufficient_inliers"; health_window.update(now, best_inlier_count=0, had_backbone_error=(dropped > 0)); raise InsufficientInliersError.
      7. Sort survivors descending by inlier_count; ties broken by per_candidate_residual_px ascending. The first survivor is the best.
      8. best = survivors[0]. If best.per_candidate_residual_px > config.residual_warn_threshold_px: emit WARN log kind="c3.matcher.residual_above_threshold" (will trigger AdHoP at C3.5).
      9. health_window.update(now, best_inlier_count=best.inlier_count, had_backbone_error=(dropped > 0)).
      10. Emit FDR record kind="matcher.frame_done" with {frame_id, candidates_input, candidates_dropped, best_inlier_count, best_residual_px, best_tile_id}.
      11. Return MatchResult(frame_id=rerank_result.frame_id, per_candidate=survivors, best_candidate_idx=0, reprojection_residual_px=best.per_candidate_residual_px, matched_at=monotonic_ns(), matcher_label="disk_lightglue", candidates_input=len(rerank_result.candidates), candidates_dropped=dropped).
    • health_snapshot(): returns self._health_window.snapshot().
    • Module-level create(config, lightglue_runtime, ransac_filter, inference_runtime, health_window) -> CrossDomainMatcher:
      1. disk_weights_path = config.matcher.disk_weights_path (TRT engine produced by AZ-321).
      2. Load DISK engine via inference_runtime.load_engine(disk_weights_path).
      3. Construct DiskLightGlueMatcher(...).
  • Composition-root wiring path for config.matcher.strategy == "disk_lightglue".
  • Logging per description.md § 9: INFO ready; WARN residual-above-threshold; ERROR insufficient-inliers + backbone-error; DEBUG per-frame inlier+residual list (gated).
  • FDR records: matcher.frame_done (always per frame), matcher.backbone_error (per error), matcher.insufficient_inliers (per all-failed event).

Scope

Included

  • DiskLightGlueMatcher class implementing CrossDomainMatcher exactly per the AZ-344 contract.
  • DISK forward via C7 InferenceRuntime (TRT primary; ONNX-RT fallback chain owned by C7 — this task consumes the unified interface).
  • LightGlue matching via shared helper.
  • RANSAC + median residual via shared RansacFilter helper.
  • Drop-and-continue per-candidate error handling (Invariant 4).
  • Below-threshold all-failed → InsufficientInliersError.
  • Deterministic best-candidate selection (Invariant 3).
  • RollingHealthWindow.update after each frame.
  • Composition-root wiring path.
  • Logging + FDR record emission per description.md § 9.
  • Unit tests covering Invariants 19, drop-and-continue, below-threshold, deterministic ordering, tile_pixels_handle reference semantics, composition-root wiring path.
  • BUILD_MATCHER_DISK_LIGHTGLUE flag wiring (ON in airborne / research / replay-cli; OFF in operator-tooling).

Excluded

  • The Protocol + DTOs + errors + factory + RollingHealthWindow — owned by AZ-344.
  • The LightGlueRuntime helper — already AZ-278.
  • The RansacFilter helper — already AZ-282.
  • The C7 InferenceRuntime — owned by AZ-297..AZ-300.
  • DISK engine compile (.onnx → .trt) — owned by AZ-321; this task consumes the produced engine.
  • ALIKED+LightGlue (AZ-346) and XFeat (AZ-347).
  • Component-internal acceptance tests beyond Invariants 19 + drop-and-continue smoke: C3-IT-01 (recall floor), C3-IT-03 (cross-domain MRE), C3-IT-04 (tilt outliers), C3-PT-01 (latency NFR), are deferred to Step 9 / E-BBT.

Acceptance Criteria

AC-1: Protocol conformance isinstance(DiskLightGlueMatcher(...), CrossDomainMatcher) returns True.

AC-2: Best-candidate selection — argmax(inlier_count) + tie-break Given a RerankResult with N=3 candidates whose computed inlier counts are [120, 80, 120] and median residuals [1.4, 1.0, 1.1] When match(...) is called Then best_candidate_idx == 0 (the candidate with inlier_count=120 AND residual=1.1 (lower than the other 120-inlier candidate's 1.4)); per_candidate[0].inlier_count == 120 AND per_candidate_residual_px == 1.1; per_candidate[1].inlier_count == 120 AND per_candidate_residual_px == 1.4; per_candidate[2].inlier_count == 80.

AC-3: Drop-and-continue on per-candidate MatcherBackboneError Given an InferenceRuntime test double that raises RuntimeError on the 2nd candidate's DISK forward and succeeds on others When match(...) is called Then len(per_candidate) == 2; candidates_dropped == 1; ONE ERROR log kind="c3.matcher.backbone_error" is emitted with tile_id + phase="disk_forward"; ONE FDR record kind="matcher.backbone_error" is emitted; success path continues.

AC-4: Drop-and-continue on per-candidate LightGlue failure Given a LightGlueRuntime test double that raises on the 1st candidate's match call When match(...) is called Then the candidate is dropped with phase="lightglue_match"; ERROR log + FDR record emitted; remaining candidates processed.

AC-5: Below-threshold → InsufficientInliersError Given config.matcher.min_inliers_threshold = 60 AND every candidate's RANSAC inlier count is < 60 When match(...) is called Then InsufficientInliersError is raised; ONE ERROR log kind="c3.matcher.insufficient_inliers" + ONE FDR record kind="matcher.insufficient_inliers" are emitted; health_window.update(now, best_inlier_count=0, had_backbone_error=False) is invoked.

AC-6: All-failed → InsufficientInliersError Given every candidate's DISK forward raises When match(...) is called Then InsufficientInliersError is raised; per-candidate ERROR logs + final ERROR log emitted; health_window.update(now, best_inlier_count=0, had_backbone_error=True) is invoked.

AC-7: WARN log on residual above threshold Given the best candidate's per_candidate_residual_px = 4.2 AND config.matcher.residual_warn_threshold_px = 2.5 When match(...) returns Then ONE WARN log kind="c3.matcher.residual_above_threshold" with {residual_px: 4.2, threshold_px: 2.5} is emitted.

AC-8: health_window.update invoked after every match (success or failure) Given any match(...) call (success, partial drop, all-failed) When the call completes (returns normally OR raises InsufficientInliersError) Then health_window.update(...) is invoked exactly ONCE for that frame; best_inlier_count matches the actual best inlier count (0 on all-failed); had_backbone_error == True if any candidate dropped due to backbone failure.

AC-9: inlier_correspondences shape contract Given a successful match(...) When inspecting any CandidateMatchSet Then inlier_correspondences.shape == (inlier_count, 4); dtype == float32.

AC-10: Deterministic — same inputs → bit-identical MatchResult Given fixed inputs and deterministic test doubles When match(...) is called 3 times Then all three returns have identical per_candidate content (same inlier_counts, same residuals, same best_candidate_idx).

AC-11: Composition-root wiring Given config.matcher.strategy = "disk_lightglue" AND a constructed shared LightGlueRuntime AND RansacFilter AND InferenceRuntime When compose_root(config) runs Then a DiskLightGlueMatcher instance is wired; ONE INFO log kind="c3.matcher.ready" with {strategy: "disk_lightglue", min_inliers_threshold, residual_warn_threshold_px} is emitted; the strategy's _lightglue_runtime is identity-equal to the runtime root's shared helper.

AC-12: FDR matcher.frame_done per frame Given a successful match(...) returning best candidate with inlier_count=120 and residual=1.1, dropped=1 When the call completes Then ONE FDR record kind="matcher.frame_done" is emitted with structured fields {frame_id, candidates_input: 3, candidates_dropped: 1, best_inlier_count: 120, best_residual_px: 1.1, best_tile_id: <tuple>}.

Non-Functional Requirements

Performance (deferred validation to C3-PT-01):

  • match p95 ≤ 180 ms (3 candidates × ~60 ms DISK forward + LightGlue match + RANSAC).
  • Per-candidate p95 ≤ 60 ms.
  • GPU memory ≤ 800 MB combined (DISK engine + LightGlue engine resident).

Compatibility

  • DISK engine file format owned by C10 + C7; this task consumes via config.matcher.disk_weights_path.
  • Upstream DISK research code drop pinned per Plan-phase; weight changes require C10 rebuild + C3-IT-03 re-run.

Reliability

  • Drop-and-continue per candidate (Invariant 4).
  • Single-thread by contract (INV-1).
  • InsufficientInliersError triggers C5 VIO-only fallback (AC-3.5); does NOT crash.

Unit Tests

AC Ref What to Test Required Outcome
AC-1 Protocol conformance isinstance returns True
AC-2 Best-candidate + tie-break Lower residual wins among tied inliers
AC-3 DISK forward fails on 2nd 2 survivors; ERROR log + FDR record
AC-4 LightGlue fails on 1st 2 survivors; phase="lightglue_match"
AC-5 All below threshold InsufficientInliersError; health update
AC-6 All forwards fail InsufficientInliersError; per-candidate logs
AC-7 Residual > warn threshold WARN log emitted
AC-8 Health update invoked once per match One update per call regardless of outcome
AC-9 Correspondences shape (I, 4) float32; I == inlier_count
AC-10 Determinism 3 calls return identical content
AC-11 compose_root wiring Wired; INFO log; helper identity-shared
AC-12 FDR frame_done emission Correct structured fields

Constraints

  • Drop-and-continue is mandatory — Invariant 4; per-candidate exceptions never propagate.
  • Median residual, not mean — Invariant 8; computed inside RansacFilter.
  • Constructor injection only — no import gps_denied_onboard.config inside the strategy module.
  • LightGlueRuntime and RansacFilter are constructor-injected — never instantiated here.
  • DISK engine load at create time, NOT at first frame — engine-output assertion fires at startup.
  • Tile pixel decode is per-call — but the underlying tile_pixels_handle is page-cache-backed (not copied into the strategy).
  • RollingHealthWindow.update is called EXACTLY once per match — including the all-failed path.

Risks & Mitigation

Risk 1: DISK upstream code drop ships an unsupported ONNX op for TRT 10.3

  • Mitigation: engine compile is C10's responsibility (AZ-321). If C10 cannot build the engine, this task is blocked upstream — surface via tracker dependency mechanism.

Risk 2: LightGlueRuntime.match_pair API not yet defined

  • Mitigation: AZ-278 defines the helper API; this task consumes whatever AZ-278 ships. If only single-pair is provided, this task wraps single-pair calls in a per-candidate loop (already structured that way). Surface to AZ-278 implementer at decompose-step-4.

Risk 3: Tile pixel decode is non-trivial cost on hot path

  • Mitigation: tile pixels arrive as page-cache-backed handles from C6; decode (JPEG → ndarray) happens once per candidate. If profiling shows this is a bottleneck, a future optimization pre-decodes adjacent tiles in C6's mmap layer.

Risk 4: Deterministic best-candidate tie-break depends on stable sort

  • Mitigation: Python's list.sort() is stable; the implementation uses sorted(survivors, key=lambda s: (-s.inlier_count, s.per_candidate_residual_px)) which is deterministic. Test AC-2 asserts the exact ordering on a tie scenario.

Risk 5: RollingHealthWindow drift between matcher implementations

  • Mitigation: ONE RollingHealthWindow class owned by AZ-344; constructor-injected into every concrete matcher. AZ-345/AZ-346/AZ-347 use the same instance type via the same constructor injection.

Runtime Completeness

  • Named capability: DiskLightGlueMatcher — production-default CrossDomainMatcher for cross-domain feature matching (architecture / E-C3 / solution.md / D-C3-1 / AC-1.1 + AC-2.2 + AC-3.1).
  • Production code that must exist: real DiskLightGlueMatcher calling real C7 InferenceRuntime with real TRT-compiled DISK engine; real shared LightGlueRuntime calls; real shared RansacFilter for inlier filtering + median residual; real RollingHealthWindow.update after each frame; real composition-root wiring.
  • Allowed external stubs: FakeInferenceRuntime, FakeLightGlueRuntime, FakeRansacFilter, FakeFdrClient, synthetic frame fixtures for unit tests.
  • Unacceptable substitutes: a Python+NumPy implementation of DISK forward (would not satisfy C3-PT-01 latency); a different RANSAC implementation per matcher (would defeat AZ-282 helper); skipping RollingHealthWindow.update on the all-failed path (would lose the health signal C5 needs); calling LightGlueRuntime in batch mode without per-candidate inlier breakdown; using the mean residual instead of the median (would violate INV-8).