gps-denied-onboard/_docs/06_metrics/perf_2026-05-26_cycle3-tier1-probe.md

# Performance Test Run — 2026-05-26 — Cycle 3 Tier-1 probe

**Invoked by**: autodev existing-code Step 15 (cycle 3) — `.cursor/skills/test-run/SKILL.md` perf mode.
**Host**: developer Mac workstation (Darwin arm64, no Jetson hardware, no `E2E_SITL_REPLAY_DIR` fixture mounted).
**Runner**: `scripts/run-performance-tests.sh` + direct `pytest e2e/tests/performance/` probe + pure-logic evaluator unit tests.
**Run ID**: `cycle3-tier1-probe`.
**Status**: **Unverified across all 4 production perf NFRs; pure-logic evaluator unit tests Pass (70/70).** No regression detected because no measurement was possible. No Warn / Fail to gate on. **Not blocking deploy** per the skill's "Any Unverified scenarios with no Warn/Fail" rule.

## Why this cycle re-ran the same probe

Cycle 3 work touched only pre-flight / offline code paths:

| Task | Layer | Runtime hot-path impact |
|---|---|---|
| AZ-836 `tlog_route_extractor` | Pre-flight (operator workstation) | None — extraction runs once per flight, before takeoff |
| AZ-838 `SatelliteProviderRouteClient` | Pre-flight (operator workstation) | None — HTTP client against satellite-provider's Route API |
| AZ-839 `operator_pre_flight_setup` real fixture | Test infrastructure | None — fixture composes existing pre-flight components |
| AZ-840 E2E orchestrator test | Test only | None |
| AZ-777 Derkachi C6 reference fixture + C11 inventory adapter | Pre-flight + C11 download path | C11 `TileDownloader` is invoked at pre-flight (operator workstation), not in-flight — airborne process has no egress (RESTRICT-OPS-1, NFT-SEC-02) |
| AZ-845 `RouteSpec` relocation | Refactor (type re-home) | None — public API unchanged |
| AZ-846 `module-layout.md` refresh | Docs | None |
| AZ-847 Lint widening | Test only | None |

None of these touches the airborne pipeline that NFT-PERF-01..04 measure (E2E latency, frame-by-frame streaming, cold-start TTFF, spoof-promotion). The 2026-05-19 baseline (`perf_2026-05-19_workstation-tier1-probe.md`) remains the most recent measurement of record; this run confirms no Tier-1-observable regression by reproducing the same 4× Unverified outcome.

## What ran

### A) `scripts/run-performance-tests.sh`

```text
Tier-2 perf tests skipped (GPS_DENIED_TIER!=2).
exit=0
```

Tier-2 gate (`pytest -m tier2 -q tests/perf` only when `GPS_DENIED_TIER=2`). Exit 0 silently on Tier-1 by design — canonical perf measurements require Jetson Orin Nano Super hardware (D-C7-9, JetPack 6.2, TensorRT 10.3); a workstation run would produce numbers that DO NOT meet the pinned-hardware budgets and would actively mislead trend tracking.

### B) Direct `pytest e2e/tests/performance/` probe (24 parameterizations)

| NFR | Configs | Outcome | Skip reason |
|---|---|---|---|
| **NFT-PERF-01** (E2E latency p95 ≤ 400 ms — AC-4.1) | 6 ({ardupilot, inav} × {okvis2, klt_ransac, vins_mono}) | 6 skipped | "Tier-2 only — Jetson hardware required" |
| **NFT-PERF-02** (frame-by-frame streaming, inter-emit p95 ≤ 350 ms — AC-4.4) | 6 ({ardupilot, inav} × {okvis2, klt_ransac, vins_mono}) | 4 skipped (no fixture) + 2 skipped (vins_mono research-only per D-C1-1-SUB-A) | "requires `E2E_SITL_REPLAY_DIR` (AZ-595) carrying the 5 min Derkachi @ 3 Hz replay" |
| **NFT-PERF-03** (cold-start TTFF p95 ≤ 30 s — AC-NEW-1) | 6 | 6 skipped | "Tier-2 only — Jetson hardware required" |
| **NFT-PERF-04** (spoof-promotion p95 ≤ 600 ms — AC-NEW-2) | 6 | 4 skipped (no fixture) + 2 skipped (vins_mono research-only per D-C1-1-SUB-A) | "requires `E2E_SITL_REPLAY_DIR` (AZ-595) containing N≥20 randomized-start blackout+spoof events" |

Total: 24 skipped, 0 passed, 0 failed, 0 errored. Exit code 0.

### C) Pure-logic evaluator unit tests — `e2e/_unit_tests/helpers/test_*_evaluator.py`

```text
$ .venv/bin/python -m pytest e2e/_unit_tests/helpers/test_e2e_latency_evaluator.py \
                              e2e/_unit_tests/helpers/test_streaming_evaluator.py \
                              e2e/_unit_tests/helpers/test_ttff_evaluator.py \
                              e2e/_unit_tests/helpers/test_spoof_promotion_evaluator.py \
                              -v --tb=short
======================= 70 passed in 0.25s ========================
```

**70/70 pass.** Identical to 2026-05-19 — confirms percentile estimators, inter-emit interval math, TTFF distribution math, and spoof-onset → label-switch delta math are still correct. A future hardware run feeds JSON fixtures into the same evaluators — only the input data changes, not the math.

## Threshold comparison (Step 3 of skill)

Per the skill's Step 3, thresholds load from `_docs/02_document/tests/performance-tests.md`. The thresholds exist and are documented but no scenario produced a measurement to compare them against.

| NFR | Threshold | Observed | Verdict |
|---|---|---|---|
| NFT-PERF-01 | p95 ≤ 400 ms (K=3 baseline AND K=2 hybrid auto-degrade) + ≤10 % frame drops | — | **Unverified** (Tier-2 hardware required) |
| NFT-PERF-02 | p95 inter-emit interval ≤ 350 ms; no window of ≥3 missed-emit gaps | — | **Unverified** (`E2E_SITL_REPLAY_DIR` fixture not yet recorded; AZ-595) |
| NFT-PERF-03 | p95 TTFF < 30 s (50 cold boots) | — | **Unverified** (Tier-2 hardware required) |
| NFT-PERF-04 | p95 < 3 s on both FCs (50 trials per FC) | — | **Unverified** (`E2E_SITL_REPLAY_DIR` fixture not yet recorded; AZ-595) |

## Classification

Per the skill's perf-mode reporting:

```text
══════════════════════════════════════
 PERF RESULTS
══════════════════════════════════════
 Scenarios: [pass 0 · warn 0 · fail 0 · unverified 4]
──────────────────────────────────────
 1. NFT-PERF-01 — Unverified — Tier-2 Jetson hardware required
 2. NFT-PERF-02 — Unverified — SITL replay fixture pending (AZ-595)
 3. NFT-PERF-03 — Unverified — Tier-2 Jetson hardware required
 4. NFT-PERF-04 — Unverified — SITL replay fixture pending (AZ-595)
──────────────────────────────────────
 Pure-logic evaluator coverage: 70/70 unit tests pass
 (e2e/_unit_tests/helpers/test_{e2e_latency,streaming,ttff,spoof_promotion}_evaluator.py)
══════════════════════════════════════
```

## Coverage gap assessment (skill Step 5: "Unverified")

Per the skill:

> **Any Unverified scenarios with no Warn/Fail** → not blocking, but surface them in the report so the user knows coverage gaps exist. Suggest running `/test-spec` to add expected results next cycle.

This run has **0 Warn + 0 Fail + 4 Unverified**, so:

- **Not deploy-blocking.** The perf gate is allowed to be Unverified when the SUT is not yet running on its canonical hardware.
- **Coverage gap is unchanged from 2026-05-19** — same two recording-phase prerequisites:
  - **NFT-PERF-01 / NFT-PERF-03**: AZ-444 (Tier-2 Jetson harness). When AZ-444 lands, these scenarios run on the Jetson and produce numbers — at which point this report's "Unverified" entries become "Pass / Warn / Fail" against the AC-4.1 / AC-NEW-1 thresholds.
  - **NFT-PERF-02 / NFT-PERF-04**: AZ-595 (SITL replay fixture builder). When AZ-595 lands, the fixtures are committed under `e2e/fixtures/sitl_replay/`, `E2E_SITL_REPLAY_DIR` is set, and the scenarios run on Tier-1.

## Findings worth tracking (Low)

### Carryforward from 2026-05-19

1. **Unregistered pytest mark `tier2_only`** — pytest warnings at `e2e/tests/performance/test_nft_perf_01_e2e_latency.py:61` and `e2e/tests/performance/test_nft_perf_03_ttff.py:48`. Add `tier2_only: marks scenarios that require Jetson hardware` to `e2e/runner/pytest.ini` `markers` list. **Status: still present in cycle 3.**
2. **`scripts/run-performance-tests.sh` is intentionally a Tier-2 stub.** Unchanged from 2026-05-19. **Status: still as designed.**

### New (discovered while running this probe — pre-existing, not cycle-3 caused)

3. **EVIDENCE_OUT default is a hardcoded container path** — `e2e/runner/conftest.py:56` sets `default=os.environ.get("EVIDENCE_OUT", "/e2e-results/evidence")`. On a Tier-1 host run (no Docker, no Jetson), the `nfr_recorder.pytest_sessionfinish` hook tries to create `/e2e-results/evidence` and fails with `OSError: [Errno 30] Read-only file system: '/e2e-results'`. Workaround: `EVIDENCE_OUT=$(pwd)/e2e-results/<run-id>/evidence python -m pytest …`. Suggested fix: default to a workspace-relative path when `--evidence-out` is not explicitly passed and no `EVIDENCE_OUT` env var is set. Logged to `_docs/_process_leftovers/2026-05-26_evidence_out_default_path.md` for later remediation. **Status: pre-existing host-pytest defect, not introduced by cycle 3 — but cycle 3 work is what surfaced it (re-running the same probe a second time).**

## Anti-patterns explicitly NOT used

Per the skill's anti-pattern guidance:

- **No improvised perf tests.** Did not synthesize a workstation-only "approximation" of any NFR; the AC-4.1 / AC-NEW-1 / AC-NEW-2 / AC-4.4 budgets are pinned to canonical hardware and synthetic Tier-1 numbers would mislead the trend-tracker.
- **No skip-acceptance without justification.** Each Unverified entry is cataloged against a concrete recording task (AZ-444 / AZ-595).
- **No threshold downgrade.** Did not soften any threshold to make a Tier-1 measurement "pass".
- **No silent passthrough.** The four perf NFRs all measure real algorithm execution; no per-test bypass was inserted to make a Tier-1 result look like a Tier-2 result.

## Cross-Reference Index

| Source | Purpose |
|---|---|
| `_docs/02_document/tests/performance-tests.md` | Threshold + scenario spec |
| `scripts/run-performance-tests.sh` | Runner script (current Tier-2 stub) |
| `_docs/06_metrics/perf_2026-05-19_workstation-tier1-probe.md` | Prior Tier-1 probe (greenfield Step 15) |
| `_docs/02_tasks/todo/AZ-444*` | Tier-2 Jetson harness (recording-phase task) |
| `_docs/02_tasks/todo/AZ-595*` | SITL replay fixture builder (recording task) |
| `_docs/02_tasks/todo/AZ-{428..431}*` | NFT-PERF-{01..04} scenario tasks (runner side complete; harness pending) |
| `_docs/_process_leftovers/2026-05-26_evidence_out_default_path.md` | EVIDENCE_OUT defect leftover |
| `_docs/06_metrics/` (this directory) | Per-run perf trend artefacts |