[AZ-423] [AZ-427] Add FT-P-19 + FT-N-05 blackbox tests

Implement the AC-8.6 (top-K=10 retrieval scale-ratio + scene-change
PARTIAL) and AC-8.2 / AC-NEW-6 (stale aged-tile rejection) blackbox
scenarios.

AZ-423 (FT-P-19, 3pt) helpers + scenario:
- retrieval_evaluator.py — top-K within-distance evaluator (60 stills
  vs 100 m budget), scene-change PARTIAL recorder (always emits
  PARTIAL on the 2 _gmaps.png pairs), FDR record projectors, CSV
  writers.
- tests/positive/test_ft_p_19_sat_reloc_scale.py (6 parametrised
  variants).

AZ-427 (FT-N-05, 2pt) helpers + scenario:
- aged_tile_rejection_evaluator.py — Signal A (stale rejection at
  load) + Signal B (per-frame downgrade) decision matrix, reuses
  ALLOWED_SOURCE_LABELS from estimate_schema.
- tests/negative/test_ft_n_05_stale_tile_rejection.py (12 parametrised
  variants: FC × VIO × {7mo/active-conflict, 13mo/rear}).

48 new unit tests cover every helper branch. Both scenarios skip
when sitl_replay_ready is false and fail loudly when fixture records
are missing.

Per-batch review: PASS_WITH_WARNINGS (2 Low — production-dependency
surface, FDR-kind constant duplication).
Cumulative review 82-84: PASS (2 Low carry-over / hygiene candidate).

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-17 15:43:06 +03:00
parent a22028087f
commit f25cae4a82
13 changed files with 2005 additions and 3 deletions
@@ -1,63 +0,0 @@
# FT-P-19 — Satellite relocalization scale-ratio + scene-change PARTIAL
**Task**: AZ-423_ft_p_19_sat_reloc_scale
**Name**: UAV-frame footprint scale-ratio retrievable from cache (FULL); scene-change subset (PARTIAL) (AC-8.6)
**Description**: Implement FT-P-19 — for each `still-image-set-60` image, query cache top-K=10 retrieval; assert top-K result includes a tile whose centre is within 100 m of the image's true centre. For the 2 paired `_gmaps.png` images, run cross-domain matcher; record scene-change behavior as PARTIAL (full coverage requires labeled change-pair dataset, deferred under D-PROJ-3).
**Complexity**: 3 points
**Dependencies**: AZ-406, AZ-407
**Component**: Blackbox Tests / Positive (epic AZ-262)
**Tracker**: AZ-423
**Epic**: AZ-262 (E-BBT)
## Problem
AC-8.6 has two halves: scale-ratio retrievability (UAV-frame footprint at deployment altitude is in the cache regardless of internal tiling) and scene-change handling (matcher succeeds when the satellite reference shows seasonal/temporal differences). The first is fully measurable; the second is constrained by the lack of labeled change-pair data and is documented as PARTIAL in the traceability matrix.
## Outcome
- pytest scenario at `e2e/tests/positive/test_ft_p_19_sat_reloc_scale.py`.
- For each of the 60 still images: query the SUT's cache top-K=10 retrieval (via the per-frame retrieval log in FDR OR a public cache-query API if exposed); assert the top-K includes a tile whose centre is within 100 m of the image's true centre.
- For the 2 paired `_gmaps.png` images: run cross-domain matcher; record per-image match success (boolean) into CSV; mark the scene-change subset as `PARTIAL` in `report.csv`.
## Scope
### Included
- Per-image top-K=10 retrieval observation.
- Per-image distance check on top-K members.
- Scene-change subset (2 paired images) with PARTIAL annotation.
### Excluded
- Stale-tile rejection — owned by FT-N-05.
- Multi-flight scene-change statistics — deferred under D-PROJ-3 (out of scope of this task).
## Acceptance Criteria
**AC-1: scale-ratio retrievability (60 images)**
Given each still image
When the SUT performs top-K=10 retrieval
Then the top-K result includes a tile whose centre is within 100 m of the image's true centre, for all 60 images.
**AC-2: scene-change subset PARTIAL**
Given the 2 paired `_gmaps.png` images
When the cross-domain matcher runs against each
Then the per-image result is recorded; the scenario's overall result is `PARTIAL` in `report.csv` for this subset (regardless of pass/fail count, because N=2 is too small for statistical confidence).
**AC-3: parameterization**
Given conftest parameterization
Then the scenario runs per `(fc_adapter, vio_strategy)`.
## System Under Test Boundary
End-to-end through public boundaries.
- **Allowed**: FDR per-frame retrieval log read OR a public cache-query API (whichever is documented).
- **Forbidden**: importing C2 / C6 internal index, monkeypatching FAISS query.
## Constraints
- The PARTIAL annotation is structural — the scenario emits it regardless of pass/fail count for the scene-change subset, because the AC text itself acknowledges insufficient data (per traceability matrix).
## Document Dependencies
- `_docs/02_document/tests/blackbox-tests.md` § FT-P-19
- `_docs/02_document/tests/traceability-matrix.md` § AC-8.6 (PARTIAL annotation)
@@ -1,63 +0,0 @@
# FT-N-05 — Stale-tile rejection on freshness violation
**Task**: AZ-427_ft_n_05_stale_tile_rejection
**Name**: Tiles violating AC-8.2 freshness window are rejected or downgraded (AC-8.2, AC-NEW-6)
**Description**: Implement FT-N-05 — replay 60 still images against `synth-age-7mo` (configure SUT for active-conflict sector) and `synth-age-13mo` (configure SUT for rear sector); SUT either rejects load OR loads but never emits `satellite_anchored` from these tiles.
**Complexity**: 2 points
**Dependencies**: AZ-406, AZ-407 (synth-age-tile-set)
**Component**: Blackbox Tests / Negative / Cache freshness (epic AZ-262)
**Tracker**: AZ-427
**Epic**: AZ-262 (E-BBT)
## Problem
Stale-tile rejection is a security-relevant freshness gate (AC-8.2, AC-NEW-6) — without it the SUT could promote outdated geographic information to `satellite_anchored`, enabling silent location drift. This must be exercised.
## Outcome
- pytest scenario at `e2e/tests/negative/test_ft_n_05_stale_tile_rejection.py`.
- Two sub-cases:
- `synth-age-7mo` mounted; SUT configured for active-conflict sector; replay 60 stills.
- `synth-age-13mo` mounted; SUT configured for rear sector; replay 60 stills.
- For each sub-case: assert 0 frames emit `source_label = satellite_anchored` from these tiles. Either rejection at load (FDR `tile-load-rejected: stale`) OR per-frame downgrade is acceptable.
## Scope
### Included
- Both sub-cases (7mo / 13mo).
- Sector-configuration switch (per AC-8.2 active-conflict vs rear).
### Excluded
- Mid-flight tile freshness positive case — owned by FT-N-06 (inside AZ-422).
## Acceptance Criteria
**AC-1: 7mo aged tiles in active-conflict sector**
Given the SUT mounts `synth-age-7mo` and is configured for active-conflict sector
When 60 still images are replayed
Then 0 outbound emissions carry `source_label = satellite_anchored`. Acceptable signals: (a) FDR `tile-load-rejected: stale` events at startup; (b) per-frame `source_label ∈ {visual_propagated, dead_reckoned}` throughout.
**AC-2: 13mo aged tiles in rear sector**
Given the SUT mounts `synth-age-13mo` and is configured for rear sector
When 60 still images are replayed
Then same as AC-1.
**AC-3: parameterization**
Given conftest parameterization
Then the scenario runs per `(fc_adapter, vio_strategy)`.
## System Under Test Boundary
End-to-end through public boundaries.
- **Allowed**: FDR for `tile-load-rejected` events, outbound `source_label` stream.
- **Forbidden**: importing C6 freshness gate state, monkeypatching the date-comparison logic.
## Constraints
- "Active-conflict" / "rear" sector is per AC-8.2; the configuration mechanism is documented in the SUT's config schema (per E-CC-CONF / `module-layout.md`).
## Document Dependencies
- `_docs/02_document/tests/blackbox-tests.md` § FT-N-05
- `_docs/02_document/tests/test-data.md` § Seed Data Sets (synth-age-tile-set)