Update autodev state, architecture documentation, and glossary terms

Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-10 00:21:34 +03:00
parent 723f574b14
commit 64542d32fc
52 changed files with 8789 additions and 88 deletions
@@ -0,0 +1,111 @@
# C2.5 — Inlier-based Re-rank
## 1. High-Level Overview
**Purpose**: re-rank C2's top-K=10 VPR candidates down to top-N=3 by single-pair LightGlue inlier count, producing a higher-precision input for the cross-domain matcher (C3). The re-rank step is the architectural boundary between cheap descriptor retrieval (C2) and expensive cross-domain matching (C3) — it pays a small extra cost so C3 only operates on the most promising candidates.
**Architectural Pattern**: Strategy (single concrete implementation today: `InlierCountReRanker`). Future re-rank algorithms can be added as additional `ReRankStrategy` implementations behind the same interface.
**Upstream dependencies**:
- C2 → `VprResult` (top-K=10 candidates).
- Shared `LightGlueRuntime` helper (used in single-pair mode for inlier counting; the same matcher object is shared with C3 — owned by the helper, not by C3, so neither component depends on the other at build time).
- C6 TileStore → fetch tile pixels for each candidate (cheap, in-memory page-cache hit during a flight).
- Camera calibration artifact — for nav-frame preprocessing.
**Downstream consumers**:
- C3 CrossDomainMatcher (consumes `RerankResult`).
## 2. Internal Interfaces
### Interface: `ReRankStrategy`
| Method | Input | Output | Async | Error Types |
|--------|-------|--------|-------|-------------|
| `rerank` | `NavCameraFrame, VprResult, n: int` | `RerankResult` | No | `RerankBackboneError`, `TileFetchError` |
**Input DTOs**:
```
NavCameraFrame: see C1
VprResult: see C2
```
**Output DTOs**:
```
RerankResult:
frame_id: uuid
candidates: list[RerankCandidate] (length = n=3, ranked by inlier_count descending)
reranked_at: monotonic_ns
RerankCandidate:
tile_id: composite (zoomLevel, lat, lon)
inlier_count: int — single-pair LightGlue inliers
descriptor_distance: float — carried forward from C2 for FDR provenance
tile_pixels_handle: Tile pixel reference (do not copy — page-cache hit)
```
## 3. External API Specification
Not applicable.
## 4. Data Access Patterns
| Query | Frequency | Hot Path | Index Needed |
|-------|-----------|----------|--------------|
| Tile pixel fetch from C6 (10 tiles per frame) | 3 Hz × 10 = 30 Hz | Yes | tile filesystem already mmap-backed in C6 |
No caching layer beyond C6's mmap. The same tile may be fetched repeatedly across frames; OS page cache absorbs that cost.
## 5. Implementation Details
**Algorithmic Complexity**: `O(K)` LightGlue forward passes per frame (K=10), each `O(M_tile · M_query)` in feature counts. The whole step is GPU-bound on the same engine that C3 uses — hence the shared LightGlue runtime.
**State Management**: stateless per-frame. Holds a reference to the shared LightGlue object owned by C3.
**Key Dependencies**:
| Library | Version | Purpose |
|---------|---------|---------|
| LightGlue (Python) | upstream HEAD pinned per Plan-phase | Single-pair matching for inlier count |
| TensorRT | matches C7 | LightGlue inference engine reuse |
**Error Handling Strategy**:
- `RerankBackboneError`: LightGlue forward pass failed on one or more candidates. The candidate is dropped from the rerank set; if fewer than N=3 candidates survive, C2.5 returns whatever it has and C3 proceeds with reduced N.
- `TileFetchError`: C6 read failure for a candidate tile. Same drop-and-continue behaviour as above.
- Hard failure (zero candidates left after rerank): emit no `RerankResult`; C5 falls back to VIO-only with provenance label `visual_propagated`.
## 6. Extensions and Helpers
| Helper | Purpose | Used By |
|--------|---------|---------|
| `LightGlueRuntime` | shared LightGlue inference handle (one engine, many call sites) | C2.5, C3 |
## 7. Caveats & Edge Cases
**Known limitations**:
- The re-rank correctness depends on LightGlue inlier-count being a meaningful proxy for cross-domain match quality at single-pair resolution. If a backbone in C2 returns visually-similar-but-geographically-wrong candidates, C2.5's inlier count can still rank them above the true match — AC-NEW-7 cache-poisoning safety budget catches this downstream.
**Potential race conditions**:
- Shared LightGlue runtime is the same object as C3 uses. Serial access from a single ingest thread; concurrent calls forbidden.
**Performance bottlenecks**:
- 10 LightGlue passes per frame is non-trivial; budget allocation lives in `tests/performance-tests.md` NFT-PERF-01 partition.
## 8. Dependency Graph
**Must be implemented after**: C2 (input), shared `LightGlueRuntime` helper (which both C2.5 and C3 consume), C6 (tile pixels), C7 (inference runtime). C2.5 does **not** depend on C3 at build time — they are sibling consumers of the helper, and the data flow is C2.5 → C3 (not the other way).
**Can be implemented in parallel with**: C1 (independent path).
**Blocks**: C3 (no `RerankResult`, C3 has no input), F3 / F6.
## 9. Logging Strategy
| Log Level | When | Example |
|-----------|------|---------|
| ERROR | Zero candidates surviving rerank | `Re-rank produced 0 candidates; frame=12345; falling back to visual_propagated` |
| WARN | <N=3 candidates surviving | `Re-rank produced 1 candidate of N=3; frame=12345` |
| INFO | Strategy ready | `Re-rank ready: strategy=inlier_count, N=3, K=10` |
| DEBUG | Per-frame inlier counts | `Re-rank frame=12345 inlier_counts=[412, 287, 198, ...]` |
**Log format**: structured JSON.
**Log storage**: stdout / journald / FDR via C13 (ERROR + WARN only).
@@ -0,0 +1,107 @@
# Test Specification — C2.5 Re-rank
Component-scoped. Suite-level coverage in `_docs/02_document/tests/*.md`.
## Acceptance Criteria Traceability
| AC ID | Acceptance Criterion (one-line) | Test IDs | Coverage |
|-------|---------------------------------|----------|----------|
| AC-2.1b | Satellite-anchor registration | FT-P-05, **C2.5-IT-01** | Covered |
| AC-4.1 | E2E latency <400 ms p95 | NFT-PERF-01, **C2.5-PT-01** | Covered |
| AC-NEW-7 | Cache poisoning (rerank-side filter) | NFT-SEC-01, **C2.5-IT-02** | Covered (relaxed) |
---
## Component-Internal Tests
### C2.5-IT-01: Top-K=10 → Top-N=3 promotion stability
**Summary**: when C2's top-1 is the ground-truth tile, C2.5's top-1 stays the ground-truth tile.
**Traces to**: AC-2.1b
**Description**: for the Derkachi normal segment where C2 already picks the correct top-1 (per C2-IT-01), assert that C2.5's `RerankResult.candidates[0].tile_id` matches C2's top-1 in ≥98% of frames. The remaining ≤2% are accepted (rerank can legitimately demote a top-1 candidate when a top-K=2..10 candidate has more inliers — this is the design intent).
**Input data**: shared with C2-IT-01.
**Expected result**: top-1 promotion rate ≥ 0.98 (i.e., rerank rarely overrides a correct C2 top-1).
**Max execution time**: 2 min (10 LightGlue passes per frame on Tier-1 with the LightGlue runtime).
---
### C2.5-IT-02: drop-and-continue on per-candidate failure
**Summary**: if one of the K=10 candidates raises `RerankBackboneError`, C2.5 returns N=2 candidates instead of zero — never crashes.
**Traces to**: AC-NEW-7 (defensive — keeps the pipeline alive on the resilience path)
**Description**: monkey-patch the LightGlue runtime to raise `RerankBackboneError` on the 5th candidate of every frame; run 100 frames; assert (a) C2.5 emits a `RerankResult` for every frame, (b) `len(candidates) ∈ {2, 3}`, (c) error logged at WARN level.
**Input data**: `synthetic_vpr/diverse_100f/`.
**Expected result**: 100/100 frames produce a result; counts match assertion.
**Max execution time**: 2 min.
---
### C2.5-IT-03: shared LightGlue runtime serial-access invariant
**Summary**: concurrent calls to the shared `LightGlueRuntime` from C2.5 and C3 must serialize without deadlock or corrupt output.
**Traces to**: helper invariant (no AC trace; backstops the helper-ownership decision per R14)
**Description**: spawn two threads — one running C2.5 rerank, the other running C3 match — sharing the same `LightGlueRuntime`. Run 50 iterations each; assert no exceptions, all outputs produced, and the result determinism holds (compare against single-threaded baseline; bit-identical match).
**Input data**: synthetic batch.
**Expected result**: no deadlock; outputs bit-identical to single-threaded run.
**Max execution time**: 2 min.
---
## Performance Tests
### C2.5-PT-01: 10 LightGlue passes per frame budget on Tier-2
**Traces to**: AC-4.1
**Load scenario**: 3 Hz, K=10 single-pair LightGlue per frame, 10 min replay.
**Expected results**:
| Metric | Target | Failure Threshold |
|--------|--------|-------------------|
| `rerank` p95 | ≤ 80 ms (10 single-pair LightGlue) | 150 ms |
| Inference engine reuse | 1 engine across all 10 calls | regression on engine-reuse causes test failure |
**Resource limits**:
- GPU memory: ≤ 300 MB for the shared LightGlue engine (counted once, not 10×).
---
## Security Tests
C2.5 has no externally-reachable surface; defensive coverage flows through the helper-ownership invariant (C2.5-IT-03).
---
## Acceptance Tests
C2.5 has no operator-facing behaviour; covered transitively via FT-P-05 / FT-P-06.
---
## Test Data Management
| Data Set | Source | Size |
|----------|--------|------|
| `synthetic_vpr/diverse_100f/` | shared with C2 | shared |
| `flight_derkachi/normal_segment_60_stills/` | shared with C1/C2 | shared |
| Derkachi corpus + LightGlue engine | C10/C7 build artifacts | shared |
**Setup**: same as C2.
**Teardown**: corpus + engines are read-only.
**Data isolation**: per-test temp dirs under `tests/tmp/c2_5/<test-id>/`.