[AZ-422] Add FT-P-17 + FT-N-06 mid-flight tile blackbox tests

Implement the AC-8.4 and AC-NEW-6 blackbox scenarios for mid-flight
tile generation, dedup, landing-time upload, and freshness gating.

Helpers:
- runner/helpers/mid_flight_tile_evaluator.py — pure-logic evaluators
  for tile generation rate, Mode B Fact #105 schema check, footprint+
  GSD dedup (via geo.distance_m), upload-audit reconciliation, and
  the AC-5/AC-6 capture_utc + freshness-gate checks.
- runner/helpers/mock_suite_sat_audit.py — httpx wrapper for the
  mock-suite-sat-service /tiles/audit endpoint with strict response-
  shape validation.

Scenarios:
- tests/positive/test_ft_p_17_mid_flight_tiles.py
- tests/negative/test_ft_n_06_mid_flight_freshness.py

Both skip when sitl_replay_ready is false and fail loudly when fixture
records are missing (tests-as-gates discipline). 52 new unit tests
(41 evaluator + 11 audit client) cover every helper branch.

Review: PASS_WITH_WARNINGS (2 Low — duplicate haversine carry-over,
upstream production dependency surface).

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-17 15:28:39 +03:00
parent 1ee54b414b
commit 5def1a3eb3
11 changed files with 1782 additions and 2 deletions
+107
View File
@@ -0,0 +1,107 @@
# Batch 83 — AZ-422 (FT-P-17 + FT-N-06 mid-flight tile generation + freshness)
**Tracker**: AZ-422
**Tasks**: 1 task / 3 complexity points
**Date**: 2026-05-17
**Verdict**: PASS_WITH_WARNINGS
**Review**: `_docs/03_implementation/reviews/batch_83_review.md`
## Scope
- FT-P-17 (positive, AC-8.4): mid-flight orthorectified tile generation, per-tile quality metadata, dedup, landing-time upload to mock-suite-sat-service.
- FT-N-06 (negative, AC-NEW-6): per-tile `capture_utc` within ±60 s of generation wall-clock; freshness gate must not reject freshly generated tiles as stale.
Both scenarios parameterize across `(fc_adapter ∈ {ardupilot, inav}, vio_strategy ∈ {okvis2, klt_ransac, vins_mono})` → 12 collected test cases.
## Files
### Created
- `e2e/runner/helpers/mid_flight_tile_evaluator.py` — pure-logic evaluators for AC-1..AC-6:
* `evaluate_tile_generation_rate` (AC-1)
* `evaluate_tile_quality_metadata` (AC-2; Mode B Fact #105 schema mirror)
* `evaluate_dedup` (AC-3; Vincenty distance via `geo.distance_m` + GSD-fraction check)
* `evaluate_upload_acks` (AC-4)
* `evaluate_capture_date_freshness` (AC-5; ISO-8601 parse + monotonic-ms drift)
* `evaluate_freshness_gate` (AC-6)
- `e2e/runner/helpers/mock_suite_sat_audit.py` — thin `httpx` client for `GET /tiles/audit` with input validation, HTTP error, JSON shape errors all raised as `RuntimeError`.
- `e2e/tests/positive/test_ft_p_17_mid_flight_tiles.py` — FT-P-17 scenario covering AC-1..AC-4 + AC-7.
- `e2e/tests/negative/test_ft_n_06_mid_flight_freshness.py` — FT-N-06 scenario covering AC-5 + AC-6 + AC-7.
- `e2e/_unit_tests/helpers/test_mid_flight_tile_evaluator.py` — 41 unit tests covering happy paths + boundary + error cases for every evaluator.
- `e2e/_unit_tests/helpers/test_mock_suite_sat_audit.py` — 11 unit tests covering happy paths + every error branch with `httpx.MockTransport`.
### Modified
- `e2e/_unit_tests/test_directory_layout.py` — registered 4 new paths under the AZ-406 layout invariant.
## Test Results
```
$ pytest _unit_tests/helpers/test_mid_flight_tile_evaluator.py \
_unit_tests/helpers/test_mock_suite_sat_audit.py \
_unit_tests/test_directory_layout.py -x
============================= 157 passed in 1.07s ==============================
```
Scenario collection:
```
$ pytest tests/positive/test_ft_p_17_mid_flight_tiles.py \
tests/negative/test_ft_n_06_mid_flight_freshness.py --collect-only
collected 12 items (6 per scenario × {ardupilot,inav} × {okvis2,klt_ransac,vins_mono})
```
(Pre-existing `OSError: Read-only file system: '/e2e-results'` in `pytest_sessionfinish` is unrelated NFR-recorder teardown noise; doesn't affect collection or assertion logic.)
## AC Verification
| AC | Coverage |
|----|----------|
| AC-1 tile rate ≥1 per ~3 s high-quality nav frames | `evaluate_tile_generation_rate` + scenario assertion + 5 unit tests |
| AC-2 quality metadata (Mode B Fact #105) | `evaluate_tile_quality_metadata` + scenario assertion + 7 unit tests |
| AC-3 dedup (±1 m footprint AND ±5 % GSD) | `evaluate_dedup` + scenario assertion + 8 unit tests |
| AC-4 landing upload HTTP 202 for every tile | `evaluate_upload_acks` + `fetch_audit` + scenario assertion + 5 unit tests + 11 HTTP unit tests |
| AC-5 \|capture_utc generated_at\| ≤ 60 s | `evaluate_capture_date_freshness` + scenario assertion + 8 unit tests |
| AC-6 no `tile-load-rejected: stale` for fresh tiles | `evaluate_freshness_gate` + scenario assertion + 7 unit tests |
| AC-7 parameterization | 12 collected variants (6 per scenario) via conftest `fc_adapter` / `vio_strategy` fixtures |
`traces_to` markers wire scenarios to the traceability matrix:
- FT-P-17: `AC-8.4,AC-1,AC-2,AC-3,AC-4,AC-7`
- FT-N-06: `AC-NEW-6,AC-5,AC-6,AC-7`
## Code Review
**Verdict**: PASS_WITH_WARNINGS — 0 Critical, 0 High, 2 Low.
- **F1 (carry-over)**: `gcs_telemetry_evaluator.py`'s private haversine duplicates `geo.distance_m`. Already surfaced in the batches 7981 cumulative review; deferred to a dedicated refactor batch.
- **F2 (production-dependency surface)**: both scenarios depend on upstream features (see Production Dependencies below). Tests skip cleanly when fixtures missing and fail loudly when fixtures exist but records are missing — adhering to "tests as gates" principle.
Full review: `_docs/03_implementation/reviews/batch_83_review.md`.
## Production Dependencies
These features must exist for the scenarios to actually run (rather than skip):
1. **SUT-side** `mid-flight-tile-output` FDR record kind matching the Mode B Fact #105 schema (`TILE_REQUIRED_TOP_LEVEL_FIELDS` + `TILE_REQUIRED_QUALITY_FIELDS`).
2. **SUT-side** `tile-load-rejected` FDR record with `reason="stale"` emitted by the freshness gate.
3. **SUT-side** `simulate_landing()` MAVLink command (or equivalent public-input trigger) for landing-event tile upload.
4. **Fixture-builder-side** Derkachi 5-min replay scenario emitting both record kinds for the parameterized FC × VIO grid.
5. **Fixture-builder-side** `FT_P_17_HIGH_QUALITY_WINDOW_S` env var injection (total seconds of high-quality nav frames per AC-2.1a normal-segment criterion).
6. **Already exists**: `mock-suite-sat-service` `/tiles/audit` endpoint (`e2e/fixtures/mock-suite-sat/app.py`).
7. **Already exists**: `mock_suite_sat_url` and `sitl_replay_ready` pytest fixtures (used by sibling scenarios).
Dependencies 15 are tracked against epic E-OBC (Mode B work) and AZ-595 fixture builder — outside the blackbox-test workspace.
## Architecture Compliance
- All new files under `e2e/`, owned by the Blackbox Tests component per `_docs/02_document/module-layout.md`.
- No imports from `src/gps_denied_onboard` (explicit public-boundary discipline note in `mid_flight_tile_evaluator.py`).
- No new cyclic dependencies.
- `httpx` and `pyproj` (via `geo`) reuse — no new infrastructure libraries introduced.
## Sub-step Trace
Phases executed per `implement/SKILL.md`:
- phase 5 (load-spec) → AZ-422 spec read
- phase 6 (implement-tasks-sequentially) → helper + scenarios + unit tests
- phase 7 (verify-ac-coverage) → 7-AC trace above
- phase 8 (code-review) → batch_83_review.md (PASS_WITH_WARNINGS)
- phase 11 (commit-batch) → next.
@@ -0,0 +1,114 @@
# Code Review Report
**Batch**: 83 — AZ-422 (FT-P-17 + FT-N-06 mid-flight tile generation + freshness)
**Date**: 2026-05-17
**Verdict**: PASS_WITH_WARNINGS
## Files Reviewed
**Created**:
- `e2e/runner/helpers/mid_flight_tile_evaluator.py`
- `e2e/runner/helpers/mock_suite_sat_audit.py`
- `e2e/tests/positive/test_ft_p_17_mid_flight_tiles.py`
- `e2e/tests/negative/test_ft_n_06_mid_flight_freshness.py`
- `e2e/_unit_tests/helpers/test_mid_flight_tile_evaluator.py`
- `e2e/_unit_tests/helpers/test_mock_suite_sat_audit.py`
**Modified**:
- `e2e/_unit_tests/test_directory_layout.py` (registered 4 new paths)
## Findings
| # | Severity | Category | File:Line | Title |
|---|----------|----------|-----------|-------|
| 1 | Low | Maintainability | `e2e/runner/helpers/gcs_telemetry_evaluator.py` (carry-over) | Duplicate haversine helper not consolidated to `geo.distance_m` |
| 2 | Low | Spec-Gap | `e2e/tests/positive/test_ft_p_17_mid_flight_tiles.py`, `e2e/tests/negative/test_ft_n_06_mid_flight_freshness.py` | Tests depend on upstream production + fixture-builder features that don't exist yet |
### Finding Details
**F1: Duplicate haversine helper not consolidated to `geo.distance_m`** (Low / Maintainability — carry-over)
- Location: `e2e/runner/helpers/gcs_telemetry_evaluator.py`
- Description: Batch 81 introduced a private haversine function inside `gcs_telemetry_evaluator.py` for search-region-shift distance math. `runner.helpers.geo.distance_m` is the project-wide Vincenty helper (used by this batch's `mid_flight_tile_evaluator.py` for dedup). Two helpers, two algorithms, same intent.
- Suggestion: Migrate `gcs_telemetry_evaluator.py` to `geo.distance_m` in a dedicated refactor batch (≤1 point). Out of scope for AZ-422 — would expand the diff into a helper already shipped and reviewed.
- Task: Carry-over from batches 7981 cumulative review.
**F2: Tests depend on upstream production + fixture-builder features that don't exist yet** (Low / Spec-Gap)
- Location: `e2e/tests/positive/test_ft_p_17_mid_flight_tiles.py`, `e2e/tests/negative/test_ft_n_06_mid_flight_freshness.py`
- Description: Both scenarios require:
* SUT writing `mid-flight-tile-output` FDR records with the Mode B Fact #105 schema (production, owned by epic E-OBC / Mode B work — outside the test harness).
* SUT emitting `tile-load-rejected` FDR records with `reason="stale"` from the freshness gate (same).
* `simulate_landing()` MAVLink command or equivalent public-input mechanism that triggers landing-time tile upload (production, public-input).
* `mock-suite-sat-service` audit endpoint (already exists in `e2e/fixtures/mock-suite-sat/app.py`).
* Fixture builder support for the `FT_P_17_HIGH_QUALITY_WINDOW_S` env var, computed from segment-quality FDR records (AZ-595).
* Fixture builder support for AZ-422 5-min Derkachi replay scenario.
- The tests **skip cleanly** when `sitl_replay_ready` is false (consistent with FT-P-12/13/15/16/18) and **fail loudly** when the fixture exists but the required records are missing — adhering to the "tests as gates" principle.
- Suggestion: Surface as a single line in the AZ-422 batch report under Production Dependencies. No code change in this batch.
- Task: AZ-422.
## Phase 1: Context Loading
Read AZ-422 task spec (`_docs/02_tasks/todo/AZ-422_ft_p_17_ftn_06_mid_flight_tiles.md`). All seven ACs and the SUT boundary statement understood before review.
## Phase 2: Spec Compliance
| AC | Helper | Scenario assertion | Unit-test coverage |
|----|--------|--------------------|--------------------|
| AC-1 (≥1 tile / 3 s high-quality nav frames) | `evaluate_tile_generation_rate` | `assert rate_report.passes` | 5 cases (exact pass, under-min fail, zero window, invalid arg, empty list) |
| AC-2 (Mode B Fact #105 quality fields populated) | `evaluate_tile_quality_metadata` | `assert quality_report.passes` | 7 cases (full pass, missing quality field, missing top-level, non-dict quality, empty list, null value, partial drop) |
| AC-3 (dedup: ±1 m footprint AND ±5 % GSD) | `evaluate_dedup` | `assert dedup_report.passes` | 8 cases (dup same centre, far apart pass, different GSD pass, close-GSD dupe, missing GSD skip, empty pass, invalid args raise, 3-tile pair detection) |
| AC-4 (landing upload HTTP 202 every tile) | `evaluate_upload_acks` + `mock_suite_sat_audit.fetch_audit` | `assert upload_report.passes` | 5 cases (all acked pass, missing tile fail, extra audit pass, empty generated fail, malformed-entry skip, non-dict skip) + 11 HTTP-client cases |
| AC-5 (|capture_utc generated_at| ≤ 60 s) | `evaluate_capture_date_freshness` | `assert capture_report.passes` | 8 cases (zero drift pass, within tolerance pass, over tolerance fail, unparseable fail, fallback to quality dict, custom tolerance, invalid tolerance raise, empty list fail) |
| AC-6 (no `tile-load-rejected: stale` for fresh tiles) | `evaluate_freshness_gate` | `assert freshness_report.passes` | 7 cases (no rejections pass, unrelated rejection pass, fresh stale-rejected fail, non-stale reason ignored, tile_id key variant, non-dict skip, custom stale reason) |
| AC-7 (parameterized across `(fc_adapter, vio_strategy)`) | conftest fixtures | 6 collected variants per scenario = 12 total | — |
All ACs satisfied. The `@pytest.mark.traces_to("AC-8.4,AC-1,...")` markers wire scenarios to AC IDs for the traceability matrix.
## Phase 3: Code Quality
- **SOLID**: each evaluator is a pure function over a `TileSpec`/dict input returning a frozen-dataclass report. `mock_suite_sat_audit.fetch_audit` is a single-responsibility HTTP wrapper. Test files mirror helper shape (one test file per helper module).
- **Error handling**: all error paths raise `ValueError` (input validation) or `RuntimeError` (HTTP / response shape). No bare `except` or silent swallowing. The `except (TypeError, ValueError)` in `_parse_iso8601_utc_seconds` is typed and limited to parsing failure.
- **Naming**: `evaluate_*` matches sibling helpers (`gcs_telemetry_evaluator`, `tile_cache_inspector`); report dataclasses follow `<Concern>Report` / `<Concern>EntryReport` naming.
- **Complexity**: longest function `evaluate_tile_quality_metadata` at ~25 lines; `evaluate_dedup` at ~25 lines with an O(N²) loop the docstring explicitly documents. All under coderule's 50-line threshold.
- **DRY**: no in-batch duplication. Cross-batch duplication of haversine logic surfaced as F1.
- **Test quality**: every unit test uses Arrange/Act/Assert comments per coderule. Tests assert meaningful behavior (specific drift values, specific failing tile IDs, specific error substrings) rather than "no error thrown".
- **Dead code**: none. `_top_level_field_to_attr` is a documented forward-compatibility seam (1:1 today, allows future field-name drift handling).
## Phase 4: Security Quick-Scan
- No SQL, no string-interpolated queries (FDR is JSON file iteration).
- No `subprocess(... shell=True)`, no `exec`, no `eval`.
- No hardcoded secrets; the HTTP client takes a base URL from `mock_suite_sat_url` fixture.
- Input validation: `fetch_audit` validates `base_url` non-empty, `run_id` non-empty, HTTP 2xx, JSON object body, `entries` list shape.
- No sensitive data in error messages; HTTP error message truncates body to 200 chars.
- No insecure deserialization — JSON parsed via `httpx.Response.json()` (stdlib `json.loads` underneath); shape checked post-parse.
## Phase 5: Performance Scan
- `evaluate_dedup` is O(N²) — explicitly documented in the docstring as acceptable for <100 tiles per 5-min replay. For longer flights a spatial index (KD-tree) would be needed; out of scope for AZ-422.
- `fetch_audit` is one-shot HTTP GET with 10 s timeout and no retries — appropriate for a co-located mock in the compose harness.
- FDR iteration in scenarios uses `fdr_reader.iter_records` (generator).
- No N+1 patterns, no unbounded fetching beyond the SUT's tile count.
## Phase 6: Cross-Task Consistency
- `TileSpec` mirrors `e2e/fixtures/mock-suite-sat/app.py`'s `TilePublishRequest` + `TileQualityMetadata` (Mode B Fact #105). Constants `TILE_REQUIRED_TOP_LEVEL_FIELDS` and `TILE_REQUIRED_QUALITY_FIELDS` make the contract explicit.
- Scenario skip pattern (`if not sitl_replay_ready: pytest.skip(...)`) matches FT-P-12/13/15/16/18.
- `httpx` HTTP client matches the dependency already pinned for sibling helpers.
- `geo.distance_m` reuse rather than re-implementing Vincenty/haversine in this helper.
- Test fixture imports (`evidence_dir`, `run_id`, `nfr_recorder`, `fc_adapter`, `vio_strategy`, `mock_suite_sat_url`, `sitl_replay_ready`) match the conftest signature used by sibling scenarios.
## Phase 7: Architecture Compliance
- All new files live under `e2e/` — owned exclusively by the Blackbox Tests component per `_docs/02_document/module-layout.md`.
- **No imports from `src/gps_denied_onboard`** — explicit "public-boundary discipline" note at top of `mid_flight_tile_evaluator.py`; verified by reading every import.
- No layering violations.
- No new cyclic module dependencies (the helper imports from `.geo` only; tests import from `runner.helpers.*`).
- No duplicate symbols across components (the carry-over duplicate haversine is intra-component, tracked in F1).
- No cross-cutting concerns re-implemented locally; HTTP client, JSON parsing, and geo math all delegate to shared dependencies.
## Verdict
- 0 Critical, 0 High → no FAIL trigger.
- 2 Low (one carry-over, one production-dependency surface) → **PASS_WITH_WARNINGS**.
Batch 83 is ready to commit. The two Low findings are surfaced for batch report and feed forward into the next cumulative review (batches 8284) without blocking.