Add .cursor AI autodevelopment harness (agents, skills, rules)

Made-with: Cursor
2026-06-22 14:11:12 +00:00 · 2026-03-26 01:06:54 +02:00
parent 1522c83b76
commit 8db19cc60a
94 changed files with 11047 additions and 0 deletions
@@ -0,0 +1,135 @@
+# Expected Results Template
+
+Save as `_docs/00_problem/input_data/expected_results/results_report.md`.
+For complex expected outputs, place reference CSV files alongside it in `_docs/00_problem/input_data/expected_results/`.
+Referenced by the test-spec skill (`.cursor/skills/test-spec/SKILL.md`).
+
+---
+
+```markdown
+# Expected Results
+
+Maps every input data item to its quantifiable expected result.
+Tests use this mapping to compare actual system output against known-correct answers.
+
+## Result Format Legend
+
+| Result Type | When to Use | Example |
+|-------------|-------------|---------|
+| Exact value | Output must match precisely | `status_code: 200`, `detection_count: 3` |
+| Tolerance range | Numeric output with acceptable variance | `confidence: 0.92 ± 0.05`, `bbox_x: 120 ± 10px` |
+| Threshold | Output must exceed or stay below a limit | `latency < 500ms`, `confidence ≥ 0.85` |
+| Pattern match | Output must match a string/regex pattern | `error_message contains "invalid format"` |
+| File reference | Complex output compared against a reference file | `match expected_results/case_01.json` |
+| Schema match | Output structure must conform to a schema | `response matches DetectionResultSchema` |
+| Set/count | Output must contain specific items or counts | `classes ⊇ {"car", "person"}`, `detections.length == 5` |
+
+## Comparison Methods
+
+| Method | Description | Tolerance Syntax |
+|--------|-------------|-----------------|
+| `exact` | Actual == Expected | N/A |
+| `numeric_tolerance` | abs(actual - expected) ≤ tolerance | `± <value>` or `± <percent>%` |
+| `range` | min ≤ actual ≤ max | `[min, max]` |
+| `threshold_min` | actual ≥ threshold | `≥ <value>` |
+| `threshold_max` | actual ≤ threshold | `≤ <value>` |
+| `regex` | actual matches regex pattern | regex string |
+| `substring` | actual contains substring | substring |
+| `json_diff` | structural comparison against reference JSON | diff tolerance per field |
+| `set_contains` | actual output set contains expected items | subset notation |
+| `file_reference` | compare against reference file in expected_results/ | file path |
+
+## Input → Expected Result Mapping
+
+### [Scenario Group Name, e.g. "Single Image Detection"]
+
+| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
+|---|-------|-------------------|-----------------|------------|-----------|---------------|
+| 1 | `[file or parameters]` | [what this input represents] | [quantifiable expected output] | [method from table above] | [± value, range, or N/A] | [path in expected_results/ or N/A] |
+
+#### Example — Object Detection
+
+| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
+|---|-------|-------------------|-----------------|------------|-----------|---------------|
+| 1 | `image_01.jpg` | Aerial photo, 3 vehicles visible | `detection_count: 3`, classes: `["ArmorVehicle", "ArmorVehicle", "Truck"]` | exact (count), set_contains (classes) | N/A | N/A |
+| 2 | `image_01.jpg` | Same image, bbox positions | bboxes: `[(120,80,340,290), (400,150,580,310), (50,400,200,520)]` | numeric_tolerance | ± 15px per coordinate | `expected_results/image_01_detections.json` |
+| 3 | `image_01.jpg` | Same image, confidence scores | confidences: `[0.94, 0.88, 0.91]` | threshold_min | each ≥ 0.85 | N/A |
+| 4 | `empty_scene.jpg` | Aerial photo, no objects | `detection_count: 0`, empty detections array | exact | N/A | N/A |
+| 5 | `corrupted.dat` | Invalid file format | HTTP 400, body contains `"error"` key | exact (status), substring (body) | N/A | N/A |
+
+#### Example — Performance
+
+| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
+|---|-------|-------------------|-----------------|------------|-----------|---------------|
+| 1 | `standard_image.jpg` | 1920x1080 single image | Response time | threshold_max | ≤ 2000ms | N/A |
+| 2 | `large_image.jpg` | 8000x6000 tiled image | Response time | threshold_max | ≤ 10000ms | N/A |
+
+#### Example — Error Handling
+
+| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
+|---|-------|-------------------|-----------------|------------|-----------|---------------|
+| 1 | `POST /detect` with no file | Missing required input | HTTP 422, message matches `"file.*required"` | exact (status), regex (message) | N/A | N/A |
+| 2 | `POST /detect` with `probability_threshold: 5.0` | Out-of-range config | HTTP 422 or clamped to valid range | exact (status) or range [0.0, 1.0] | N/A | N/A |
+
+## Expected Result Reference Files
+
+When the expected output is too complex for an inline table cell (e.g., full JSON response with nested objects), place a reference file in `_docs/00_problem/input_data/expected_results/`.
+
+### File Naming Convention
+
+`<input_name>_expected.<format>`
+
+Examples:
+- `image_01_detections.json`
+- `batch_A_results.csv`
+- `video_01_annotations.json`
+
+### Reference File Requirements
+
+- Must be machine-readable (JSON, CSV, YAML — not prose)
+- Must contain only the expected output structure and values
+- Must include tolerance annotations where applicable (as metadata fields or comments)
+- Must be valid and parseable by standard libraries
+
+### Reference File Example (JSON)
+
+File: `expected_results/image_01_detections.json`
+
+```json
+{
+  "input": "image_01.jpg",
+  "expected": {
+    "detection_count": 3,
+    "detections": [
+      {
+        "class": "ArmorVehicle",
+        "confidence": { "min": 0.85 },
+        "bbox": { "x1": 120, "y1": 80, "x2": 340, "y2": 290, "tolerance_px": 15 }
+      },
+      {
+        "class": "ArmorVehicle",
+        "confidence": { "min": 0.85 },
+        "bbox": { "x1": 400, "y1": 150, "x2": 580, "y2": 310, "tolerance_px": 15 }
+      },
+      {
+        "class": "Truck",
+        "confidence": { "min": 0.85 },
+        "bbox": { "x1": 50, "y1": 400, "x2": 200, "y2": 520, "tolerance_px": 15 }
+      }
+    ]
+  }
+}
+```
+```
+
+---
+
+## Guidance Notes
+
+- Every row in the mapping table must have at least one quantifiable comparison — no row should say only "should work" or "returns result".
+- Use `exact` comparison for counts, status codes, and discrete values.
+- Use `numeric_tolerance` for floating-point values and spatial coordinates where minor variance is expected.
+- Use `threshold_min`/`threshold_max` for performance metrics and confidence scores.
+- Use `file_reference` when the expected output has more than ~3 fields or nested structures.
+- Reference files must be committed alongside input data — they are part of the test specification.
+- When the system has non-deterministic behavior (e.g., model inference variance across hardware), document the expected tolerance explicitly and justify it.
@@ -0,0 +1,88 @@
+# Test Runner Script Structure
+
+Reference for generating `scripts/run-tests.sh` and `scripts/run-performance-tests.sh`.
+
+## `scripts/run-tests.sh`
+
+```bash
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+UNIT_ONLY=false
+RESULTS_DIR="$PROJECT_ROOT/test-results"
+
+for arg in "$@"; do
+  case $arg in
+    --unit-only) UNIT_ONLY=true ;;
+  esac
+done
+
+cleanup() {
+  # tear down docker-compose if it was started
+}
+trap cleanup EXIT
+
+mkdir -p "$RESULTS_DIR"
+
+# --- Unit Tests ---
+# [detect runner: pytest / dotnet test / cargo test / npm test]
+# [run and capture exit code]
+# [save results to $RESULTS_DIR/unit-results.*]
+
+# --- Blackbox Tests (skip if --unit-only) ---
+# if ! $UNIT_ONLY; then
+#   [docker compose -f <compose-file> up -d]
+#   [wait for health checks]
+#   [run blackbox test suite]
+#   [save results to $RESULTS_DIR/blackbox-results.*]
+# fi
+
+# --- Summary ---
+# [print passed / failed / skipped counts]
+# [exit 0 if all passed, exit 1 otherwise]
+```
+
+## `scripts/run-performance-tests.sh`
+
+```bash
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+RESULTS_DIR="$PROJECT_ROOT/test-results"
+
+cleanup() {
+  # tear down test environment if started
+}
+trap cleanup EXIT
+
+mkdir -p "$RESULTS_DIR"
+
+# --- Start System Under Test ---
+# [docker compose up -d or start local server]
+# [wait for health checks]
+
+# --- Run Performance Scenarios ---
+# [detect tool: k6 / locust / artillery / wrk / built-in]
+# [run each scenario from performance-tests.md]
+# [capture metrics: latency P50/P95/P99, throughput, error rate]
+
+# --- Compare Against Thresholds ---
+# [read thresholds from test spec or CLI args]
+# [print per-scenario pass/fail]
+
+# --- Summary ---
+# [exit 0 if all thresholds met, exit 1 otherwise]
+```
+
+## Key Requirements
+
+- Both scripts must be idempotent (safe to run multiple times)
+- Both scripts must work in CI (no interactive prompts, no GUI)
+- Use `trap cleanup EXIT` to ensure teardown even on failure
+- Exit codes: 0 = all pass, 1 = failures detected
+- Write results to `test-results/` directory (add to `.gitignore` if not already present)
+- The actual commands depend on the detected tech stack — fill them in during Phase 4 of the test-spec skill