[AZ-428] [AZ-429] [AZ-430] [AZ-431] Add NFT-PERF-01..04 perf scenarios

Batch 85 — 4 Performance NFT scenarios + pure-logic evaluators. - NFT-PERF-01 (AZ-428, Tier-2): two-config e2e latency p95 ≤ 400 ms (K=3@25°C, K=2 hybrid@50°C) + frame-drop ≤10% + informational per-stage partition recording (D-CROSS-LATENCY-1). - NFT-PERF-02 (AZ-429): inter-emit p95 ≤ 350 ms + no ≥3 missed-emit windows. fc-adapter-aware SITL timestamp extraction (tlog vs MSP). - NFT-PERF-03 (AZ-430, Tier-2): cold-start TTFF p95 ≤ 30 s AND max ≤ 45 s over N≥10 iterations. - NFT-PERF-04 (AZ-431): spoof-promotion latency p95 ≤ 600 ms over N≥20 randomized-start blackout+spoof events. All scenarios consume external fixtures (AZ-595 dependency surfaced) and fail loudly when fixtures are missing or empty. Public-boundary discipline preserved — evaluators do NOT import src/gps_denied_onboard. Tests: 60 new unit tests pass; 24 scenarios collect (4 tests × 2 fc × 3 vio). Code review: PASS_WITH_WARNINGS — 1 Medium (fixed in batch), 3 Low (production-dependency surfacings + future hygiene). Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-22 06:01:12 +00:00 · 2026-05-17 16:46:49 +03:00
parent f25cae4a82
commit 73cd632e95
21 changed files with 3063 additions and 6 deletions
@@ -0,0 +1,145 @@
+# Batch 85 — AZ-428 + AZ-429 + AZ-430 + AZ-431 (Performance NFTs)
+
+**Tracker**: AZ-428, AZ-429, AZ-430, AZ-431
+**Tasks**: 4 tasks / 13 complexity points (5 + 2 + 5 + 3)*
+**Date**: 2026-05-17
+**Verdict**: PASS_WITH_WARNINGS
+**Review**: `_docs/03_implementation/reviews/batch_85_review.md`
+
+*Note on points: the 4-task batch totals 13 points — driven by AC coverage cohesion (all four are Performance NFTs sharing the `_percentile` helper). Per the user batch rule of "create PBIs of 2-3 points (≤5)", individual tasks remain within bounds; the batch grouping is intentional for shared-evaluator coherence.
+
+## Scope
+
+- **AZ-428 / NFT-PERF-01 (AC-4.1)** — Tier-2-only end-to-end latency p95 ≤ 400 ms across two configs (K=3@25 °C + K=2@50 °C hybrid); 5 min Derkachi replay; per-stage partition (D-CROSS-LATENCY-1) recorded for trend (informational).
+- **AZ-429 / NFT-PERF-02 (AC-4.4)** — Frame-by-frame streaming: p95(inter-emit) ≤ 350 ms; no ≥3 consecutive missed-emit window.
+- **AZ-430 / NFT-PERF-03 (AC-NEW-1)** — Tier-2-only cold-start TTFF p95 ≤ 30 s AND max ≤ 45 s over N≥10 iterations.
+- **AZ-431 / NFT-PERF-04 (AC-NEW-2)** — Spoofing-promotion latency p95 ≤ 600 ms over N≥20 randomized-start blackout+spoof events.
+
+## Files
+
+### Created
+
+- `e2e/runner/helpers/streaming_evaluator.py` — inter-emit + missed-emit-window evaluators; shared `_percentile` helper used by the other 3 evaluators.
+- `e2e/runner/helpers/spoof_promotion_evaluator.py` — per-event latency from `t_blackout_onset` → first `dead_reckoned` label switch + aggregate p50/p95/p99.
+- `e2e/runner/helpers/ttff_evaluator.py` — per-iteration TTFF samples + AC-3/AC-4 aggregate.
+- `e2e/runner/helpers/e2e_latency_evaluator.py` — per-frame latency + frame-drop accounting + per-stage partition recording.
+- `e2e/tests/performance/test_nft_perf_01_e2e_latency.py` — NFT-PERF-01 scenario (Tier-2; two configs).
+- `e2e/tests/performance/test_nft_perf_02_streaming.py` — NFT-PERF-02 scenario (Tier-1/2; fc-adapter-aware timestamp extraction).
+- `e2e/tests/performance/test_nft_perf_03_ttff.py` — NFT-PERF-03 scenario (Tier-2-only; fixture-consumer).
+- `e2e/tests/performance/test_nft_perf_04_spoof_promotion.py` — NFT-PERF-04 scenario (Tier-1/2; fixture-consumer).
+- `e2e/_unit_tests/helpers/test_streaming_evaluator.py` — 16 unit tests.
+- `e2e/_unit_tests/helpers/test_spoof_promotion_evaluator.py` — 15 unit tests.
+- `e2e/_unit_tests/helpers/test_ttff_evaluator.py` — 14 unit tests.
+- `e2e/_unit_tests/helpers/test_e2e_latency_evaluator.py` — 15 unit tests.
+
+### Modified
+
+- `e2e/_unit_tests/test_directory_layout.py` — registered 8 new paths.
+
+## Test Results
+
+```
+$ pytest e2e/_unit_tests/helpers/test_streaming_evaluator.py \
+         e2e/_unit_tests/helpers/test_spoof_promotion_evaluator.py \
+         e2e/_unit_tests/helpers/test_ttff_evaluator.py \
+         e2e/_unit_tests/helpers/test_e2e_latency_evaluator.py \
+         e2e/_unit_tests/test_directory_layout.py
+================ 177 passed in 0.34s ================
+```
+
+Scenario collection (24 cases, all parameterised):
+
+```
+$ pytest e2e/tests/performance/ --collect-only -p no:csv
+collected 24 items
+  test_nft_perf_01_e2e_latency: 6 cases
+  test_nft_perf_02_streaming_inter_emit: 6 cases
+  test_nft_perf_03_cold_start_ttff: 6 cases
+  test_nft_perf_04_spoof_promotion_latency: 6 cases
+```
+
+Full unit suite: `977 passed, 2 failed` — both failures are pre-existing (`pytest-csv` vs `csv_reporter` plugin conflict on subprocess pytest invocations); confirmed by `git stash` baseline. Not introduced by batch 85.
+
+## AC Verification
+
+### AZ-428 / NFT-PERF-01
+
+| AC | Coverage |
+|----|----------|
+| AC-1 tier guard | `@pytest.mark.tier2_only` |
+| AC-2 K=3@25 °C p95 ≤ 400 ms | per-config assertion in scenario + 4 unit tests |
+| AC-3 K=2 hybrid@50 °C p95 ≤ 400 ms | per-config assertion in scenario |
+| AC-4 frame-drop ≤ 10 % | `LatencyReport.passes_frame_drop` + 3 unit tests |
+| AC-5 partition recorded | `write_partition_csv` (informational; no threshold) + 1 unit test |
+| AC-6 parameterization | 6 collected variants per config |
+
+### AZ-429 / NFT-PERF-02
+
+| AC | Coverage |
+|----|----------|
+| AC-1 p95 inter-emit ≤ 350 ms | `evaluate_inter_emit.passes_p95` + 6 unit tests |
+| AC-2 no ≥3 consecutive missed emits | `evaluate_missed_emits.longest_run` + 4 unit tests |
+| AC-3 parameterization | 6 collected variants (fc_adapter × vio_strategy) |
+
+### AZ-430 / NFT-PERF-03
+
+| AC | Coverage |
+|----|----------|
+| AC-1 tier guard | `@pytest.mark.tier2_only` |
+| AC-2 clean state per iteration | delegated to Tier-2 harness (AZ-444) — surfaced as F3 |
+| AC-3 p95(TTFF) ≤ 30 s | `te.evaluate.passes_p95` + 4 unit tests |
+| AC-4 max(TTFF) ≤ 45 s | `te.evaluate.passes_max` + 2 unit tests |
+| AC-5 parameterization | 6 collected variants |
+
+### AZ-431 / NFT-PERF-04
+
+| AC | Coverage |
+|----|----------|
+| AC-1 N≥20 events | `evaluate.passes_event_count` + scenario fixture validation |
+| AC-2 p95 ≤ 600 ms | `evaluate.passes_p95` + 4 unit tests |
+| AC-3 parameterization | 6 collected variants |
+
+`traces_to` markers:
+- NFT-PERF-01: `AC-4.1,AC-1,AC-2,AC-3,AC-4,AC-5,AC-6`
+- NFT-PERF-02: `AC-4.4,AC-1,AC-2,AC-3`
+- NFT-PERF-03: `AC-NEW-1,AC-1,AC-2,AC-3,AC-4,AC-5`
+- NFT-PERF-04: `AC-NEW-2,AC-1,AC-2,AC-3`
+
+## Code Review
+
+**Verdict**: PASS_WITH_WARNINGS — 0 Critical, 0 High, 1 Medium, 3 Low.
+
+- **F1 (Medium / Maintainability — fixed in batch)**: NFT-PERF-04's `_resolve_events_fixture_path` duplicated the `sitl_observer` import across two branches. Hoisted to function-top during the review pass.
+- **F2 (Low / Spec-Gap surfacing)**: Production dep — `blackout_spoof.py` injector cannot emit N=20 randomized-start events; scenario consumes external fixture from AZ-595 fixture builder. Surfaced + tracked.
+- **F3 (Low / Spec-Gap surfacing)**: AZ-430 AC-2 (per-iteration clean state) delegated to Tier-2 harness (AZ-444). Scenario only consumes the captured fixture.
+- **F4 (Low / Maintainability)**: CSV-emit boilerplate duplicated across 4 evaluators. Future hygiene PBI.
+
+Full review: `_docs/03_implementation/reviews/batch_85_review.md`.
+
+## Production Dependencies
+
+Surfaced for the cumulative review window (85-87) + traceability matrix:
+
+1. **AZ-444 (Tier-2 runner)**: per-iteration `fdr-output` volume wipe + SUT cold lifecycle restart for NFT-PERF-03; tier2-on-jetson.sh orchestration of N=10 iterations.
+2. **AZ-595 (fixture builder)**: emit `nft_perf_01_latency.json` (N=900 frames × 2 configs + per-stage partition samples), `nft_perf_02_streaming` capture, `nft_perf_03_ttff.json` (N≥10 iteration records), `nft_perf_04_events.json` (N≥20 randomized-start blackout+spoof events with per-event outbound-label samples).
+3. **SUT-side**: outbound stream MUST carry `source_label` ∈ {`satellite_anchored`, `visual_propagated`, `dead_reckoned`} for NFT-PERF-04 to detect promotion; FDR (or equivalent) MUST expose per-stage timings (C1, C2, C2.5, C3, C3.5, C4, C4 cov, C5, serialization, OS jitter) for NFT-PERF-01 AC-5 partition recording.
+4. **AZ-595 + Derkachi flight**: K=2 + Jacobian-cov hybrid auto-degrade configuration must be activatable from fixture-builder side so the K=2@50 °C config captures the right SUT mode.
+5. **Already exists**: `sitl_replay_ready` fixture, `mavproxy_tlog_reader`, `msp_frame_observer`, `sitl_observer.replay_dir()`, `evidence_dir`, `nfr_recorder` (AZ-406).
+6. **Already exists**: `fc_adapter` / `vio_strategy` parameterization, `tier2_only` marker, `scenario_id` marker, `traces_to` marker.
+
+## Architecture Compliance
+
+- All new files under `e2e/`, owned by the Blackbox Tests component per `_docs/02_document/module-layout.md`.
+- No imports from `src/gps_denied_onboard` (verified — explicit "does NOT import" notes in evaluator docstrings).
+- No new cyclic dependencies. New evaluators share `streaming_evaluator._percentile` only.
+- No new infrastructure libraries.
+
+## Sub-step Trace
+
+Phases executed per `implement/SKILL.md`:
+- phase 5 (load-spec) → 4 task specs read
+- phase 6 (implement-tasks-sequentially) → helpers + scenarios + unit tests for all 4 tasks
+- phase 7 (verify-ac-coverage) → ACs traced above
+- phase 8 (code-review) → batch_85_review.md (PASS_WITH_WARNINGS)
+- phase 8.5 (cumulative-review) → defer to batch 87 (K=3 window starts at batch 85)
+- phase 11 (commit-batch) → next.
@@ -0,0 +1,105 @@
+# Code Review Report — Batch 85
+
+**Batch**: 85 (AZ-428 + AZ-429 + AZ-430 + AZ-431 — Performance NFTs)
+**Date**: 2026-05-17
+**Verdict**: PASS_WITH_WARNINGS
+
+## Findings
+
+| # | Severity | Category | File:Line | Title |
+|---|----------|----------|-----------|-------|
+| 1 | Medium | Maintainability | `e2e/tests/performance/test_nft_perf_04_spoof_promotion.py:127-145` | Duplicate `sitl_observer` import across branches — **fixed in batch** |
+| 2 | Low | Spec-Gap (surfacing) | `e2e/tests/performance/test_nft_perf_04_spoof_promotion.py` | Production dependency: injector cannot emit N=20 randomized-start events |
+| 3 | Low | Spec-Gap (surfacing) | `e2e/tests/performance/test_nft_perf_03_ttff.py` | AC-2 (clean-state per iteration) delegated to Tier-2 harness (AZ-444) |
+| 4 | Low | Maintainability | `e2e/runner/helpers/{ttff,spoof_promotion,e2e_latency,streaming}_evaluator.py` | CSV-emit boilerplate duplicated across 4 evaluators |
+
+### Finding Details
+
+**F1: Duplicate `sitl_observer` import across branches** (Medium / Maintainability — **fixed in batch**)
+- Location: `e2e/tests/performance/test_nft_perf_04_spoof_promotion.py:132,140`
+- Description: `_resolve_events_fixture_path` imported `sitl_observer` inside two separate branches. NFT-PERF-01 and NFT-PERF-03 already hoist the import once at the top of the resolver.
+- Resolution: Hoisted the import to the top of the function during this batch.
+- Task: AZ-431
+
+**F2: Production dependency — injector cannot emit N=20 randomized-start events** (Low / Spec-Gap — surfacing)
+- Location: `e2e/tests/performance/test_nft_perf_04_spoof_promotion.py`
+- Description: AZ-431 AC-1 says "N≥20 events via `blackout_spoof.py` with randomized window starts". Current `blackout_spoof.py` only randomizes spoofed GPS values via `seed`; the blackout-window start is hardcoded. The scenario therefore consumes an external `E2E_NFT_PERF_04_EVENTS_FIXTURE` produced by the fixture builder (AZ-595). Scenario fails loudly when the fixture is missing or empty.
+- Suggestion: Track as production dependency for AZ-595 (fixture builder) — extend the SITL replay builder to emit `nft_perf_04_events.json` with N≥20 randomized-start records.
+- Task: AZ-431
+
+**F3: AC-2 (clean-state per iteration) delegated to Tier-2 harness** (Low / Spec-Gap — surfacing)
+- Location: `e2e/tests/performance/test_nft_perf_03_ttff.py`
+- Description: AZ-430 AC-2 requires per-iteration `fdr-output` volume wipe + cold SUT restart. Per scope-discipline these lifecycle concerns belong to the Tier-2 harness (AZ-444 / AZ-595 fixture builder), not to the in-pytest scenario. The scenario only consumes a pre-captured `nft_perf_03_ttff.json` with N≥10 iteration records.
+- Suggestion: Track as production dependency for AZ-444 (Tier-2 runner) — wire the per-iteration lifecycle reset and fixture builder.
+- Task: AZ-430
+
+**F4: CSV-emit boilerplate duplicated across 4 evaluators** (Low / Maintainability)
+- Location: `e2e/runner/helpers/streaming_evaluator.py`, `spoof_promotion_evaluator.py`, `ttff_evaluator.py`, `e2e_latency_evaluator.py`
+- Description: Each evaluator implements `write_csv_evidence` + `write_per_*` with the same shape (open file, write header, write rows, return path). Aggregate CSV row formatting is also boilerplate-heavy.
+- Suggestion: Future hygiene PBI — extract a `_emit_csv(path, header, rows)` helper. Not blocking; current code is readable and isolated per scenario.
+- Task: AZ-428 / AZ-429 / AZ-430 / AZ-431
+
+## Phase Notes
+
+### Phase 1 — Context
+All 4 task specs read; ACs walked through against helpers + scenarios.
+
+### Phase 2 — Spec Compliance
+
+| Task | AC | Evidence |
+|------|----|----------|
+| AZ-429 | AC-1 p95 ≤ 350 ms | `streaming_evaluator.evaluate_inter_emit.passes_p95` + scenario assertion |
+| AZ-429 | AC-2 no ≥3-emit gap | `evaluate_missed_emits.longest_run < MISSED_EMIT_WINDOW_LIMIT` |
+| AZ-429 | AC-3 parameterization | 6 collected variants (ardupilot/inav × {okvis2, klt_ransac, vins_mono}) |
+| AZ-431 | AC-1 N≥20 events | `evaluate.passes_event_count` + fixture validation |
+| AZ-431 | AC-2 p95 ≤ 600 ms | `evaluate.passes_p95` + scenario assertion |
+| AZ-431 | AC-3 parameterization | 6 collected variants |
+| AZ-430 | AC-1 tier guard | `@pytest.mark.tier2_only` |
+| AZ-430 | AC-2 clean state | delegated to Tier-2 harness (AZ-444) — F3 surfaced |
+| AZ-430 | AC-3 p95 ≤ 30 s | `te.evaluate.passes_p95` |
+| AZ-430 | AC-4 max ≤ 45 s | `te.evaluate.passes_max` |
+| AZ-430 | AC-5 parameterization | 6 collected variants |
+| AZ-428 | AC-1 tier guard | `@pytest.mark.tier2_only` |
+| AZ-428 | AC-2 K=3@25 °C p95 ≤ 400 ms | per-config assertion (`config_id == "k3-25c"`) |
+| AZ-428 | AC-3 K=2@50 °C p95 ≤ 400 ms | per-config assertion (`config_id == "k2-hybrid-50c"`) |
+| AZ-428 | AC-4 frame drop ≤ 10 % | `LatencyReport.passes_frame_drop` per config |
+| AZ-428 | AC-5 partition recorded | `write_partition_csv` (informational, no threshold) |
+| AZ-428 | AC-6 parameterization | 6 collected variants per config; both configs run per param |
+
+### Phase 3 — Code Quality
+- SOLID: each evaluator owns one responsibility; fc-adapter-specific timestamp extraction lives in the AZ-429 scenario (`_read_emit_times_ms`) rather than leaking into the evaluator.
+- Error handling: `ValueError` on negative latency/TTFF (fail-loud at evaluator boundary); `pytest.fail` on malformed fixture (fail-loud at scenario boundary). No bare `except`.
+- DRY: `streaming_evaluator._percentile` re-used by `ttff_evaluator` and `e2e_latency_evaluator` — correct shared-helper pattern.
+- Tests: all use the Arrange/Act/Assert pattern with `# Arrange / # Act / # Assert` markers per `.cursor/rules/coderule.mdc`.
+- Naming: scenario function names mirror task IDs (`test_nft_perf_0N_*`); helper symbols use full domain words (`ColdStartIteration`, `FrameLatencySample`, `SpoofEvent`).
+
+### Phase 4 — Security
+- No subprocess / shell=True / eval / exec usage in new code.
+- No hardcoded secrets.
+- Input from fixtures parsed via `json.loads` (safe); shape validated with explicit `pytest.fail` on malformed records — no insecure deserialisation.
+
+### Phase 5 — Performance
+- One sort per percentile call (`sorted(values)`); fixtures are ≤ N=900 per config — negligible.
+- No N+1 patterns; no blocking I/O in async contexts.
+
+### Phase 6 — Cross-Task Consistency
+- All 4 evaluators share the `_percentile` helper from `streaming_evaluator`.
+- All 4 scenarios follow the identical fixture-consumer pattern (resolve fixture path → load → evaluate → write CSV evidence → record NFR metrics → assert).
+- All 4 scenarios use `@pytest.mark.scenario_id` + `@pytest.mark.traces_to` consistently.
+
+### Phase 7 — Architecture Compliance
+- All new files under `e2e/` (Blackbox Tests component per `_docs/02_document/module-layout.md`).
+- No imports from `src/gps_denied_onboard` (verified — explicit "does NOT import" notes in evaluator docstrings).
+- No new cyclic dependencies.
+- No duplicate symbols across components.
+
+## Verdict Logic
+
+- 0 Critical, 0 High → not FAIL.
+- 1 Medium, 3 Low → **PASS_WITH_WARNINGS**.
+
+F1 (duplicate import) is the only actionable finding without a downstream dependency; deferred to a follow-up hygiene pass given trivial scope.
+
+## Cumulative Trigger
+
+Batch 85 advances the K-counter to 1 of K=3 from cumulative baseline (batches 82-84). Cumulative review trigger reached at batch 87.