Transitioned the autodev state to phase 21, reflecting the completion of Step 5 and the drafting of Step 6 epics. Revised the architecture documentation to clarify the roles of the Tile Manager and its components, ensuring accurate representation of the system's operational flow. Updated glossary entries for Flight State and Operator to incorporate recent changes and enhance clarity on component interactions and responsibilities.
6.0 KiB
Test Specification — C2 Visual Place Recognition
Component-scoped. Suite-level coverage is in _docs/02_document/tests/*.md; canonical traceability is tests/traceability-matrix.md.
Acceptance Criteria Traceability
| AC ID | Acceptance Criterion (one-line) | Test IDs | Coverage |
|---|---|---|---|
| AC-2.1b | Satellite-anchor registration meets AC-1.1/1.2/2.2/8.2/8.6 | FT-P-05, FT-P-19, C2-IT-01 | Covered |
| AC-2.2 (cross-domain portion) | MRE <2.5 px cross-domain | FT-P-06, C2-IT-02 (recall floor only — MRE owned by C3) | Covered |
| AC-4.1 | E2E latency <400 ms p95 | NFT-PERF-01, C2-PT-01 | Covered |
| AC-NEW-7 | Cache poisoning safety budget | NFT-SEC-01 (onboard-side), C2-IT-03 | Covered (relaxed) |
| AC-8.6 (scale-ratio portion) | Scale-ratio satellite relocalization | FT-P-19, C2-IT-04 | Covered |
Component-Internal Tests
C2-IT-01: top-K=10 recall at p=10 on Derkachi
Summary: the primary backbone (UltraVPR) achieves recall@10 ≥ 0.95 on the Derkachi normal segment against the pre-built corpus.
Traces to: AC-2.1b
Description: for each query frame in flight_derkachi/normal_segment_60_stills/, embed via UltraVPR; query the FAISS HNSW index; assert that the ground-truth tile (per recorded GPS + sector classification) is within the top-10 candidates ≥95% of frames.
Input data: flight_derkachi/normal_segment_60_stills/ + the Derkachi corpus FAISS index built by C10 (consumed read-only here).
Expected result: recall@10 ≥ 0.95 for UltraVPR; recall@10 ≥ 0.85 for the mandatory simple-baseline NetVLAD (engine rule check).
Max execution time: 90 s.
C2-IT-02: VprResult schema invariants
Summary: every VprResult carries len(candidates) == k, monotonically-non-decreasing descriptor_distance, and a non-empty backbone_label.
Traces to: AC-2.2 (downstream-coverage prerequisite)
Description: 100 frames through retrieve_topk(k=10); assert (a) length, (b) sorted-ascending distance, (c) backbone_label non-empty.
Input data: tests/fixtures/synthetic_vpr/diverse_100f/.
Expected result: 100/100 results pass all invariants.
Max execution time: 10 s.
C2-IT-03: cache-poisoning seed rejection at retrieval
Summary: when the corpus contains a poisoned tile injected during NFT-SEC-01 setup, the top-1 distance to that tile is bounded so downstream RANSAC + voting can reject it within the AC-NEW-7 relaxed budget.
Traces to: AC-NEW-7 (component-level partition)
Description: load the NFT-SEC-01 setup corpus (3-flight cumulative dataset with deflated covariance × 1.5–3); query each Derkachi frame; record the top-1 candidate for each. Pass if the poisoned tile is top-1 in fewer than the relaxed-CI threshold (the relaxation per AC-text 2026-05-09).
Input data: NFT-SEC-01 setup corpus.
Expected result: poisoned-tile top-1 rate within AC-NEW-7 relaxed CI.
Max execution time: 5 min.
C2-IT-04: scale-ratio invariance on satellite re-loc
Summary: when the nav-camera scale changes (e.g., altitude shift), VPR top-K still surfaces the correct tile.
Traces to: AC-8.6 (scale-ratio half)
Description: synthetically scale the Derkachi normal-segment frames by ±20% (mimicking altitude variation); assert recall@10 stays ≥ 0.85 across the scale sweep.
Input data: flight_derkachi/scaled_+20%/ and flight_derkachi/scaled_-20%/ (generated via deterministic resize).
Expected result: recall@10 ≥ 0.85 at both scale extremes for UltraVPR.
Max execution time: 90 s.
Performance Tests
C2-PT-01: backbone forward + HNSW lookup budget on Tier-2
Traces to: AC-4.1
Load scenario: 3 Hz frame rate, 10 min replay; corpus size 87 654 tiles (Derkachi area at 0.5 m/px).
Expected results:
| Metric | Target | Failure Threshold |
|---|---|---|
embed_query p95 |
≤ 60 ms (UltraVPR / FP16) | 100 ms |
retrieve_topk(k=10) p95 |
≤ 2 ms (HNSW) | 10 ms |
| Combined p95 | ≤ 65 ms | 110 ms |
Resource limits:
- GPU memory: ≤ 600 MB resident for backbone weights.
- System memory: ≤ 200 MB for the mmap'd FAISS index handle.
Security Tests
C2-ST-01: index-handle invalidation safety
Summary: after C10 rebuilds the FAISS index (post-takeoff is FORBIDDEN, but unit-level safety check), the previous handle held by C2 must not silently return stale results.
Traces to: defensive — no AC trace; backstops a code-injection / config-drift mode that AC-NEW-7 already covers at the suite level.
Test procedure:
- Build a tiny 100-tile FAISS index; mmap it through C2.
- Replace the underlying file (simulating an out-of-band rebuild, which is FORBIDDEN at flight time per D-C10-3 but defended-in-depth here).
- Call
retrieve_topkand assert C2 raisesIndexUnavailableErrorrather than returning stale candidates.
Pass criteria: IndexUnavailableError raised; no candidates returned.
Fail criteria: any candidate returned.
Acceptance Tests
C2 contributes to AC-2.1b / AC-8.6 via the suite-level FT scenarios. No additional C2-only acceptance tests.
Test Data Management
| Data Set | Description | Source | Size |
|---|---|---|---|
synthetic_vpr/diverse_100f/ |
100 diverse synthetic frames for invariant checks | generated, deterministic | ~50 MB |
flight_derkachi/normal_segment_60_stills/ |
shared with C1 | curated | shared |
flight_derkachi/scaled_±20%/ |
scale-ratio sweep | generated from above | ~40 MB |
| Derkachi FAISS corpus | C10's output, consumed read-only | C10 build artifact | ~200 MB |
Setup procedure:
- C10 must have built the Derkachi FAISS index (this is a Step-5-test-side prereq; in the test runner, the corpus is staged from
tests/fixtures/cache_artifacts/). - Synthetic + scaled fixtures generated by deterministic scripts.
Teardown: corpus is read-only; nothing to clean up.
Data isolation: each test gets its own tests/tmp/c2/<test-id>/.