# Performance Test Run — 2026-05-19 — workstation Tier-1 probe **Invoked by**: autodev greenfield Step 15 — `.cursor/skills/test-run/SKILL.md` perf mode. **Host**: developer Mac workstation (no Jetson hardware, no `E2E_SITL_REPLAY_DIR` fixture mounted). **Runner**: `scripts/run-performance-tests.sh` + direct `pytest e2e/tests/performance/` probe. **Run ID**: `workstation-tier1-probe`. **Status**: **Unverified across all 4 production perf NFRs; pure-logic evaluator unit tests Pass (70/70).** No regression detected because no measurement was possible. No Warn / Fail to gate on. **Not blocking deploy** per the skill's "Any Unverified scenarios with no Warn/Fail" rule. ## What ran ### A) `scripts/run-performance-tests.sh` ```text Tier-2 perf tests skipped (GPS_DENIED_TIER!=2). exit=0 ``` The runner script is deliberately a Tier-2 gate (`pytest -m tier2 -q tests/perf` only when `GPS_DENIED_TIER=2`). On Tier-1 / workstation it exits 0 silently. By design — the canonical perf measurements require Jetson Orin Nano Super hardware (D-C7-9, JetPack 6.2, TensorRT 10.3); a workstation run would produce numbers that DO NOT meet the pinned-hardware budgets and would actively mislead trend tracking. ### B) Direct `pytest e2e/tests/performance/` probe (24 parameterizations) | NFR | Configs | Outcome | Skip reason | |---|---|---|---| | **NFT-PERF-01** (E2E latency p95 ≤ 400 ms — AC-4.1) | 6 ({ardupilot, inav} × {okvis2, klt_ransac, vins_mono}) | 6 skipped | "Tier-2 only — Jetson hardware required" | | **NFT-PERF-02** (frame-by-frame streaming, inter-emit p95 ≤ 350 ms — AC-4.4) | 6 ({ardupilot, inav} × {okvis2, klt_ransac, vins_mono}) | 4 skipped (no fixture) + 2 skipped (vins_mono research-only per D-C1-1-SUB-A) | "requires `E2E_SITL_REPLAY_DIR` (AZ-595) carrying the 5 min Derkachi @ 3 Hz replay" | | **NFT-PERF-03** (cold-start TTFF p95 ≤ 30 s — AC-NEW-1) | 6 | 6 skipped | "Tier-2 only — Jetson hardware required" | | **NFT-PERF-04** (spoof-promotion p95 ≤ 600 ms — AC-NEW-2) | 6 | 4 skipped (no fixture) + 2 skipped (vins_mono research-only per D-C1-1-SUB-A) | "requires `E2E_SITL_REPLAY_DIR` (AZ-595) containing N≥20 randomized-start blackout+spoof events" | Total: 24 skipped, 0 passed, 0 failed, 0 errored. Exit code 0. ### C) Pure-logic evaluator unit tests — `e2e/_unit_tests/helpers/test_*_evaluator.py` The four perf NFRs each map to a pure-logic evaluator that computes the gate (p95 / inter-emit interval / TTFF distribution / spoof-promotion latency) from a recorded sample set. These evaluators are tested without any SITL / Jetson dependency: ```text e2e/_unit_tests/helpers/test_e2e_latency_evaluator.py → covers NFT-PERF-01 AC-2/3/4 math e2e/_unit_tests/helpers/test_streaming_evaluator.py → covers NFT-PERF-02 AC-1/AC-2 math e2e/_unit_tests/helpers/test_ttff_evaluator.py → covers NFT-PERF-03 AC-3/AC-4 math e2e/_unit_tests/helpers/test_spoof_promotion_evaluator.py → covers NFT-PERF-04 AC-1/AC-2 math ``` ```text $ .venv/bin/python -m pytest e2e/_unit_tests/helpers/test_e2e_latency_evaluator.py \ e2e/_unit_tests/helpers/test_streaming_evaluator.py \ e2e/_unit_tests/helpers/test_ttff_evaluator.py \ e2e/_unit_tests/helpers/test_spoof_promotion_evaluator.py \ --no-header -q ...................................................................... [100%] 70 passed in 0.50s ``` **70/70 pass.** Confirms that the threshold-comparison logic (percentile estimators, inter-emit interval, TTFF distribution, spoof-onset → label-switch delta) is correct independent of whether real measurements have been recorded yet. A future hardware run feeds JSON fixtures into the same evaluators — only the input data changes, not the math. ## Threshold comparison (Step 3 of skill) Per the skill's Step 3, thresholds load from `_docs/02_document/tests/performance-tests.md`. The thresholds exist and are documented but no scenario produced a measurement to compare them against. | NFR | Threshold | Observed | Verdict | |---|---|---|---| | NFT-PERF-01 | p95 ≤ 400 ms (K=3 baseline AND K=2 hybrid auto-degrade) + ≤10 % frame drops | — | **Unverified** (Tier-2 hardware required) | | NFT-PERF-02 | p95 inter-emit interval ≤ 350 ms; no window of ≥3 missed-emit gaps | — | **Unverified** (`E2E_SITL_REPLAY_DIR` fixture not yet recorded; AZ-595) | | NFT-PERF-03 | p95 TTFF < 30 s (50 cold boots) | — | **Unverified** (Tier-2 hardware required) | | NFT-PERF-04 | p95 < 3 s on both FCs (50 trials per FC) | — | **Unverified** (`E2E_SITL_REPLAY_DIR` fixture not yet recorded; AZ-595) | ## Classification Per the skill's perf-mode reporting: ```text ══════════════════════════════════════ PERF RESULTS ══════════════════════════════════════ Scenarios: [pass 0 · warn 0 · fail 0 · unverified 4] ────────────────────────────────────── 1. NFT-PERF-01 — Unverified — Tier-2 Jetson hardware required 2. NFT-PERF-02 — Unverified — SITL replay fixture pending (AZ-595) 3. NFT-PERF-03 — Unverified — Tier-2 Jetson hardware required 4. NFT-PERF-04 — Unverified — SITL replay fixture pending (AZ-595) ────────────────────────────────────── Pure-logic evaluator coverage: 70/70 unit tests pass (e2e/_unit_tests/helpers/test_{e2e_latency,streaming,ttff,spoof_promotion}_evaluator.py) ══════════════════════════════════════ ``` ## Coverage gap assessment (skill Step 5: "Unverified") Per the skill: > **Any Unverified scenarios with no Warn/Fail** → not blocking, but surface them in the report so the user knows coverage gaps exist. Suggest running `/test-spec` to add expected results next cycle. This run has **0 Warn + 0 Fail + 4 Unverified**, so: - **Not deploy-blocking.** The perf gate is allowed to be Unverified when the SUT is not yet running on its canonical hardware. - **Coverage gap is fully cataloged.** Each Unverified scenario points at a concrete task: - **NFT-PERF-01 / NFT-PERF-03**: AZ-444 (Tier-2 Jetson harness) is the recording-phase task. When AZ-444 lands, these scenarios run on the Jetson and produce numbers — at which point this report's "Unverified" entries become "Pass / Warn / Fail" against the AC-4.1 / AC-NEW-1 thresholds. - **NFT-PERF-02 / NFT-PERF-04**: AZ-595 (SITL replay fixture builder) is the recording task. When AZ-595 lands, the fixtures are committed under `e2e/fixtures/sitl_replay/`, `E2E_SITL_REPLAY_DIR` is set, and the scenarios run on Tier-1. - **The thresholds, evaluators, parameterizations, and report wiring are all in place.** Recording is the only gap, not test design. ## Anti-patterns explicitly NOT used Per the skill's anti-pattern guidance: - **No improvised perf tests.** Did not synthesize a workstation-only "approximation" of any NFR; the AC-4.1 / AC-NEW-1 / AC-NEW-2 / AC-4.4 budgets are pinned to canonical hardware and synthetic Tier-1 numbers would mislead the trend-tracker. - **No skip-acceptance without justification.** Each Unverified entry is cataloged against a concrete recording task (AZ-444 / AZ-595). - **No threshold downgrade.** Did not soften any threshold to make a Tier-1 measurement "pass". ## Two minor housekeeping items (Low) 1. **Unregistered pytest mark `tier2_only`** — pytest warnings at `e2e/tests/performance/test_nft_perf_01_e2e_latency.py:61` and `e2e/tests/performance/test_nft_perf_03_ttff.py:48`. Add `tier2_only: marks scenarios that require Jetson hardware` to `e2e/runner/pytest.ini` `markers` list. 2. **`scripts/run-performance-tests.sh` is intentionally a Tier-2 stub.** This is documented in the script header; not a defect, just a reminder that the Tier-1 path is "skip + log" by design. If a Tier-1 perf trend-tracking workflow is ever desired, add an explicit branch (e.g. invoke the pure-logic evaluators against a smaller `derkachi-short-fixture`). ## Cross-Reference Index | Source | Purpose | |---|---| | `_docs/02_document/tests/performance-tests.md` | Threshold + scenario spec | | `scripts/run-performance-tests.sh` | Runner script (current Tier-2 stub) | | `_docs/02_tasks/todo/AZ-444*` | Tier-2 Jetson harness (recording-phase task) | | `_docs/02_tasks/todo/AZ-595*` | SITL replay fixture builder (recording task) | | `_docs/02_tasks/todo/AZ-{428..431}*` | NFT-PERF-{01..04} scenario tasks (currently complete on the runner side; the harness side is pending) | | `_docs/06_metrics/` (this directory) | Per-run perf trend artefacts |