annotations/.cursor/skills/test-run/SKILL.md

---
name: test-run
description: |
  Run the project's test suite, report results, and handle failures.
  Detects test runners automatically (pytest, dotnet test, cargo test, npm test)
  or uses scripts/run-tests.sh if available.
  Trigger phrases:
  - "run tests", "test suite", "verify tests"
category: build
tags: [testing, verification, test-suite]
disable-model-invocation: true
---

# Test Run

Run the project's test suite and report results. This skill is invoked by the autodev at verification checkpoints — after implementing tests, after implementing features, before deploy — or any point where a test suite must pass before proceeding.

## Modes

test-run has two modes. The caller passes the mode explicitly; if missing, default to `functional`.

| Mode | Scope | Typical caller | Input artifacts |
|------|-------|---------------|-----------------|
| `functional` (default) | Unit / integration / blackbox tests — correctness | autodev Steps that verify after Implement Tests or Implement | `scripts/run-tests.sh`, `_docs/02_document/tests/environment.md`, `_docs/02_document/tests/blackbox-tests.md` |
| `perf` | Performance / load / stress / soak tests — latency, throughput, error-rate thresholds | autodev greenfield Step 9, existing-code Step 15 (pre-deploy) | `scripts/run-performance-tests.sh`, `_docs/02_document/tests/performance-tests.md`, AC thresholds in `_docs/00_problem/acceptance_criteria.md` |

Direct user invocation (`/test-run`) defaults to `functional`. If the user says "perf tests", "load test", "performance", or passes a performance scenarios file, run `perf` mode.

After selecting a mode, read its corresponding workflow below; do not mix them.

---

## Functional Mode

### 1. Detect Test Runner

Check in order — first match wins:

1. `scripts/run-tests.sh` exists → use it (the script already encodes the correct execution strategy)
2. `docker-compose.test.yml` exists → run the Docker Suitability Check (see below). Docker is preferred; use it unless hardware constraints prevent it.
3. Auto-detect from project files:
   - `pytest.ini`, `pyproject.toml` with `[tool.pytest]`, or `conftest.py` → `pytest`
   - `*.csproj` or `*.sln` → `dotnet test`
   - `Cargo.toml` → `cargo test`
   - `package.json` with test script → `npm test`
   - `Makefile` with `test` target → `make test`

If no runner detected → report failure and ask user to specify.

#### Execution Environment Check

1. Check `_docs/02_document/tests/environment.md` for a "Test Execution" section. If the test-spec skill already assessed hardware dependencies and recorded a decision (local / docker / both), **follow that decision**.
2. If the "Test Execution" section says **local** → run tests directly on host (no Docker).
3. If the "Test Execution" section says **docker** → use Docker (docker-compose).
4. If the "Test Execution" section says **both** → run local first, then Docker (or vice versa), and merge results.
5. If no prior decision exists → fall back to the hardware-dependency detection logic from the test-spec skill's "Hardware-Dependency & Execution Environment Assessment" section. Ask the user if hardware indicators are found.

### 2. Run Tests

1. Execute the detected test runner
2. Capture output: passed, failed, skipped, errors
3. If a test environment was spun up, tear it down after tests complete

### 3. Report Results

Present a summary:

```
══════════════════════════════════════
 TEST RESULTS: [N passed, M failed, K skipped, E errors]
══════════════════════════════════════
```

**Important**: Collection errors (import failures, missing dependencies, syntax errors) count as failures — they are not "skipped" or ignorable. If a collection error is caused by a missing dependency, install it (add to the project's dependency file and install) before re-running. The test runner script (`run-tests.sh`) should install all dependencies automatically — if it doesn't, fix the script to do so.

### 4. Diagnose Failures and Skips

Before presenting choices, list every failing/erroring/skipped test with a one-line root cause:

```
Failures:
 1. test_foo.py::test_bar — missing dependency 'netron' (not installed)
 2. test_baz.py::test_qux — AssertionError: expected 5, got 3 (logic error)
 3. test_old.py::test_legacy — ImportError: no module 'removed_module' (possibly obsolete)

Skips:
 1. test_x.py::test_pre_init — runtime skip: engine already initialized (unreachable in current test order)
 2. test_y.py::test_docker_only — explicit @skip: requires Docker (dead code in local runs)
```

Categorize failures as: **missing dependency**, **broken import**, **logic/assertion error**, **possibly obsolete**, or **environment-specific**.

Categorize skips as: **explicit skip (dead code)**, **runtime skip (unreachable)**, **environment mismatch**, or **missing fixture/data**.

### 5. Handle Outcome

**All tests pass, zero skipped** → return success to the autodev for auto-chain.

**Any test fails or errors** → this is a **blocking gate**. Never silently ignore failures. **Always investigate the root cause before deciding on an action.** Read the failing test code, read the error output, check service logs if applicable, and determine whether the bug is in the test or in the production code.

After investigating, present:

```
══════════════════════════════════════
 TEST RESULTS: [N passed, M failed, K skipped, E errors]
══════════════════════════════════════
 Failures:
  1. test_X — root cause: [detailed reason] → action: [fix test / fix code / remove + justification]
══════════════════════════════════════
 A) Apply recommended fixes, then re-run
 B) Abort — fix manually
══════════════════════════════════════
 Recommendation: A — fix root causes before proceeding
══════════════════════════════════════
```

- If user picks A → apply fixes, then re-run (loop back to step 2)
- If user picks B → return failure to the autodev

**Any skipped test** → classify as legitimate or illegitimate before deciding whether to block.

#### Legitimate skips (accept and proceed)

The code path genuinely cannot execute on this runner. Acceptable reasons:

- Hardware not physically present (GPU, Apple Neural Engine, sensor, serial device)
- Operating system mismatch (Darwin-only test on Linux CI, Windows-only test on macOS)
- Feature-flag-gated test whose feature is intentionally disabled in this environment
- External service the project deliberately does not control (e.g., a third-party API with no sandbox, and the project has a documented contract test instead)

For legitimate skips: verify the skip condition is accurate (the test would run if the hardware/OS were present), verify it has a clear reason string, and proceed.

#### Illegitimate skips (BLOCKING — must resolve)

The skip is a workaround for something we can and should fix. NOT acceptable reasons:

- Required service not running (database, message broker, downstream API we control) → fix: bring the service up, add a docker-compose dependency, or add a mock
- Missing test fixture, seed data, or sample file → fix: provide the data, generate it, or ASK the user for it
- Missing environment variable or credential → fix: add to `.env.example`, document, ASK user for the value
- Flaky-test quarantine with no tracking ticket → fix: create the ticket (or replay via leftovers if tracker is down)
- Inherited skip from a prior refactor that was never cleaned up → fix: clean it up now
- Test ordering mutates shared state → fix: isolate the state

**Rule of thumb**: if the reason for skipping is "we didn't set something up," that's not a valid skip — set it up. If the reason is "this hardware/OS isn't here," that's valid.

#### Resolution steps for illegitimate skips

1. Classify the skip (read the skip reason and test body)
2. If the fix is **mechanical** — start a container, install a dep, add a mock, reorder fixtures — attempt it automatically and re-run
3. If the fix requires **user input** — credentials, sample data, a business decision — BLOCK and ASK
4. Never silently mark the skip as "accepted" — every illegitimate skip must either be fixed or escalated
5. Removal is a last resort and requires explicit user approval with documented reasoning

#### Categorization cheatsheet

- **explicit skip (e.g. `@pytest.mark.skip`)**: check whether the reason in the decorator is still valid
- **conditional skip (e.g. `@pytest.mark.skipif`)**: check whether the condition is accurate and whether we can change the environment to make it false
- **runtime skip (e.g. `pytest.skip()` in body)**: check why the condition fires — often an ordering or environment bug
- **missing fixture/data**: treated as illegitimate unless user confirms the data is unavailable

After investigating, present findings:

```
══════════════════════════════════════
 SKIPPED TESTS: K tests skipped
══════════════════════════════════════
 1. test_X — root cause: [detailed reason] → action: [fix / restructure / remove + justification]
 2. test_Y — root cause: [detailed reason] → action: [fix / restructure / remove + justification]
══════════════════════════════════════
 A) Apply recommended fixes, then re-run
 B) Accept skips and proceed (requires user justification per skip)
══════════════════════════════════════
```

Only option B allows proceeding with skips, and it requires explicit user approval with documented justification for each skip.

---

## Perf Mode

Performance tests differ from functional tests in what they measure (latency / throughput / error-rate distributions, not pass/fail of a single assertion) and when they run (once before deploy, not per batch). The mode reuses the same orchestration shape (detect → run → report → gate on outcome) but with perf-specific tool detection and threshold comparison.

### 1. Detect Perf Runner

Check in order — first match wins:

1. `scripts/run-performance-tests.sh` exists (generated by `test-spec` Phase 4) → use it; the script already encodes the correct load profile and tool invocation.
2. `_docs/02_document/tests/performance-tests.md` exists → read the scenarios, then auto-detect a load-testing tool:
   - `k6` binary available → prefer k6 (scriptable, good default reporting)
   - `locust` in project deps or installed → locust
   - `artillery` in `package.json` or installed globally → artillery
   - `wrk` binary available → wrk (simple HTTP only; use only if scenarios are HTTP GET/POST)
   - Language-native benchmark harness (`cargo bench`, `go test -bench`, `pytest-benchmark`) → use when scenarios are CPU-bound or in-process
3. No runner and no scenarios spec → STOP and ask the user to either run `/test-spec` first (to produce `performance-tests.md` + the runner script) or supply a runner script manually. Do not improvise perf tests from scratch.

### 2. Run

Execute the detected runner against the target system. Capture per-scenario metrics:

- Latency percentiles: p50, p95, p99 (and p999 if load volume permits)
- Throughput: requests/sec or operations/sec
- Error rate: failed / total
- Duration: actual run time (for soak/ramp scenarios)
- Resource usage if the scenarios call for it (CPU%, RSS, GPU utilization)

Tear down any environment the runner spun up after metrics are captured.

### 3. Compare Against Thresholds

Load thresholds in this order:

1. Per-scenario expected results from `_docs/02_document/tests/performance-tests.md`
2. Project-wide thresholds from `_docs/00_problem/acceptance_criteria.md` (latency / throughput lines)
3. Fallback: no threshold → record the measurement but classify the scenario as **Unverified** (not pass/fail)

Classify each scenario:

- **Pass** — all thresholds met
- **Warn** — within 10% of any threshold (e.g., p95 = 460ms against a 500ms threshold)
- **Fail** — any threshold violated
- **Unverified** — no threshold to compare against

### 4. Report

```
══════════════════════════════════════
 PERF RESULTS
══════════════════════════════════════
 Scenarios: [pass N · warn M · fail K · unverified U]
──────────────────────────────────────
 1. [scenario_name] — [Pass/Warn/Fail/Unverified]
    p50 = [x]ms · p95 = [y]ms · p99 = [z]ms
    throughput = [r] rps · errors = [e]%
    threshold: [criterion and verdict detail]
 2. ...
══════════════════════════════════════
```

Persist the full report to `_docs/06_metrics/perf_<YYYY-MM-DD>_<run_label>.md` for trend tracking across cycles.

### 5. Handle Outcome

**All scenarios Pass (or Pass + Unverified only)** → return success to the caller.

**Any Warn or Fail** → this is a **blocking gate**. Investigate before deciding — read the runner output, check if the warn-band was historically stable, rule out transient infrastructure noise (always worth one re-run before declaring a regression).

After investigating, present:

```
══════════════════════════════════════
 PERF GATE: [summary]
══════════════════════════════════════
 Failing / warning scenarios:
  1. [scenario] — [metric] = [observed] vs threshold [threshold]
     likely cause: [1-line diagnosis]
══════════════════════════════════════
 A) Fix and re-run (investigate and address the regression)
 B) Proceed anyway (accept the regression — requires written justification
    recorded in the perf report)
 C) Abort — investigate offline
══════════════════════════════════════
 Recommendation: A — perf regressions caught pre-deploy
 are orders of magnitude cheaper to fix than post-deploy
══════════════════════════════════════
```

- User picks A → apply fixes, re-run (back to step 2).
- User picks B → append the justification to the perf report; return success to the caller.
- User picks C → return failure to the caller.

**Any Unverified scenarios with no Warn/Fail** → not blocking, but surface them in the report so the user knows coverage gaps exist. Suggest running `/test-spec` to add expected results next cycle.

## Trigger Conditions

This skill is invoked by the autodev at test verification checkpoints. It is not typically invoked directly by the user. When invoked directly, select the mode from the user's phrasing ("run tests" → functional; "load test" / "perf test" → perf).