Update Dockerfile to use Bun for package management, remove package-lock.json, and adjust .gitignore to include it.

2026-06-22 11:11:19 +00:00 · 2026-03-26 00:32:54 +02:00
parent 157a33096a
commit 4c7f479398
99 changed files with 11426 additions and 2626 deletions
@@ -0,0 +1,469 @@
+---
+name: test-spec
+description: |
+  Test specification skill. Analyzes input data and expected results completeness,
+  then produces detailed test scenarios (blackbox, performance, resilience, security, resource limits)
+  that treat the system as a black box. Every test pairs input data with quantifiable expected results
+  so tests can verify correctness, not just execution.
+  4-phase workflow: input data + expected results analysis, test scenario specification, data + results validation gate,
+  test runner script generation. Produces 8 artifacts under tests/ and 2 shell scripts under scripts/.
+  Trigger phrases:
+  - "test spec", "test specification", "test scenarios"
+  - "blackbox test spec", "black box tests", "blackbox tests"
+  - "performance tests", "resilience tests", "security tests"
+category: build
+tags: [testing, black-box, blackbox-tests, test-specification, qa]
+disable-model-invocation: true
+---
+
+# Test Scenario Specification
+
+Analyze input data completeness and produce detailed black-box test specifications. Tests describe what the system should do given specific inputs — they never reference internals.
+
+## Core Principles
+
+- **Black-box only**: tests describe observable behavior through public interfaces; no internal implementation details
+- **Traceability**: every test traces to at least one acceptance criterion or restriction
+- **Save immediately**: write artifacts to disk after each phase; never accumulate unsaved work
+- **Ask, don't assume**: when requirements are ambiguous, ask the user before proceeding
+- **Spec, don't code**: this workflow produces test specifications, never test implementation code
+- **No test without data**: every test scenario MUST have concrete test data; tests without data are removed
+- **No test without expected result**: every test scenario MUST pair input data with a quantifiable expected result; a test that cannot compare actual output against a known-correct answer is not verifiable and must be removed
+
+## Context Resolution
+
+Fixed paths — no mode detection needed:
+
+- PROBLEM_DIR: `_docs/00_problem/`
+- SOLUTION_DIR: `_docs/01_solution/`
+- DOCUMENT_DIR: `_docs/02_document/`
+- TESTS_OUTPUT_DIR: `_docs/02_document/tests/`
+
+Announce the resolved paths to the user before proceeding.
+
+## Input Specification
+
+### Required Files
+
+| File | Purpose |
+|------|---------|
+| `_docs/00_problem/problem.md` | Problem description and context |
+| `_docs/00_problem/acceptance_criteria.md` | Measurable acceptance criteria |
+| `_docs/00_problem/restrictions.md` | Constraints and limitations |
+| `_docs/00_problem/input_data/` | Reference data examples, expected results, and optional reference files |
+| `_docs/01_solution/solution.md` | Finalized solution |
+
+### Expected Results Specification
+
+Every input data item MUST have a corresponding expected result that defines what the system should produce. Expected results MUST be **quantifiable** — the test must be able to programmatically compare actual system output against the expected result and produce a pass/fail verdict.
+
+Expected results live inside `_docs/00_problem/input_data/` in one or both of:
+
+1. **Mapping file** (`input_data/expected_results/results_report.md`): a table pairing each input with its quantifiable expected output, using the format defined in `.cursor/skills/test-spec/templates/expected-results.md`
+
+2. **Reference files folder** (`input_data/expected_results/`): machine-readable files (JSON, CSV, etc.) containing full expected outputs for complex cases, referenced from the mapping file
+
+```
+input_data/
+├── expected_results/            ← required: expected results folder
+│   ├── results_report.md        ← required: input→expected result mapping
+│   ├── image_01_expected.csv    ← per-file expected detections
+│   └── video_01_expected.csv
+├── image_01.jpg
+├── empty_scene.jpg
+└── data_parameters.md
+```
+
+**Quantifiability requirements** (see template for full format and examples):
+- Numeric values: exact value or value ± tolerance (e.g., `confidence ≥ 0.85`, `position ± 10px`)
+- Structured data: exact JSON/CSV values, or a reference file in `expected_results/`
+- Counts: exact counts (e.g., "3 detections", "0 errors")
+- Text/patterns: exact string or regex pattern to match
+- Timing: threshold (e.g., "response ≤ 500ms")
+- Error cases: expected error code, message pattern, or HTTP status
+
+### Optional Files (used when available)
+
+| File | Purpose |
+|------|---------|
+| `DOCUMENT_DIR/architecture.md` | System architecture for environment design |
+| `DOCUMENT_DIR/system-flows.md` | System flows for test scenario coverage |
+| `DOCUMENT_DIR/components/` | Component specs for interface identification |
+
+### Prerequisite Checks (BLOCKING)
+
+1. `acceptance_criteria.md` exists and is non-empty — **STOP if missing**
+2. `restrictions.md` exists and is non-empty — **STOP if missing**
+3. `input_data/` exists and contains at least one file — **STOP if missing**
+4. `input_data/expected_results/results_report.md` exists and is non-empty — **STOP if missing**. Prompt the user: *"Expected results mapping is required. Please create `_docs/00_problem/input_data/expected_results/results_report.md` pairing each input with its quantifiable expected output. Use `.cursor/skills/test-spec/templates/expected-results.md` as the format reference."*
+5. `problem.md` exists and is non-empty — **STOP if missing**
+6. `solution.md` exists and is non-empty — **STOP if missing**
+7. Create TESTS_OUTPUT_DIR if it does not exist
+8. If TESTS_OUTPUT_DIR already contains files, ask user: **resume from last checkpoint or start fresh?**
+
+## Artifact Management
+
+### Directory Structure
+
+```
+TESTS_OUTPUT_DIR/
+├── environment.md
+├── test-data.md
+├── blackbox-tests.md
+├── performance-tests.md
+├── resilience-tests.md
+├── security-tests.md
+├── resource-limit-tests.md
+└── traceability-matrix.md
+```
+
+### Save Timing
+
+| Phase | Save immediately after | Filename |
+|-------|------------------------|----------|
+| Phase 1 | Input data analysis (no file — findings feed Phase 2) | — |
+| Phase 2 | Environment spec | `environment.md` |
+| Phase 2 | Test data spec | `test-data.md` |
+| Phase 2 | Blackbox tests | `blackbox-tests.md` |
+| Phase 2 | Performance tests | `performance-tests.md` |
+| Phase 2 | Resilience tests | `resilience-tests.md` |
+| Phase 2 | Security tests | `security-tests.md` |
+| Phase 2 | Resource limit tests | `resource-limit-tests.md` |
+| Phase 2 | Traceability matrix | `traceability-matrix.md` |
+| Phase 3 | Updated test data spec (if data added) | `test-data.md` |
+| Phase 3 | Updated test files (if tests removed) | respective test file |
+| Phase 3 | Updated traceability matrix (if tests removed) | `traceability-matrix.md` |
+| Phase 4 | Test runner script | `scripts/run-tests.sh` |
+| Phase 4 | Performance test runner script | `scripts/run-performance-tests.sh` |
+
+### Resumability
+
+If TESTS_OUTPUT_DIR already contains files:
+
+1. List existing files and match them to the save timing table above
+2. Identify which phase/artifacts are complete
+3. Resume from the next incomplete artifact
+4. Inform the user which artifacts are being skipped
+
+## Progress Tracking
+
+At the start of execution, create a TodoWrite with all three phases. Update status as each phase completes.
+
+## Workflow
+
+### Phase 1: Input Data Completeness Analysis
+
+**Role**: Professional Quality Assurance Engineer
+**Goal**: Assess whether the available input data is sufficient to build comprehensive test scenarios
+**Constraints**: Analysis only — no test specs yet
+
+1. Read `_docs/01_solution/solution.md`
+2. Read `acceptance_criteria.md`, `restrictions.md`
+3. Read testing strategy from solution.md (if present)
+4. If `DOCUMENT_DIR/architecture.md` and `DOCUMENT_DIR/system-flows.md` exist, read them for additional context on system interfaces and flows
+5. Read `input_data/expected_results/results_report.md` and any referenced files in `input_data/expected_results/`
+6. Analyze `input_data/` contents against:
+   - Coverage of acceptance criteria scenarios
+   - Coverage of restriction edge cases
+   - Coverage of testing strategy requirements
+7. Analyze `input_data/expected_results/results_report.md` completeness:
+   - Every input data item has a corresponding expected result row in the mapping
+   - Expected results are quantifiable (contain numeric thresholds, exact values, patterns, or file references — not vague descriptions like "works correctly" or "returns result")
+   - Expected results specify a comparison method (exact match, tolerance range, pattern match, threshold) per the template
+   - Reference files in `input_data/expected_results/` that are cited in the mapping actually exist and are valid
+8. Present input-to-expected-result pairing assessment:
+
+| Input Data | Expected Result Provided? | Quantifiable? | Issue (if any) |
+|------------|--------------------------|---------------|----------------|
+| [file/data] | Yes/No | Yes/No | [missing, vague, no tolerance, etc.] |
+
+9. Threshold: at least 70% coverage of scenarios AND every covered scenario has a quantifiable expected result (see `.cursor/rules/cursor-meta.mdc` Quality Thresholds table)
+10. If coverage is low, search the internet for supplementary data, assess quality with user, and if user agrees, add to `input_data/` and update `input_data/expected_results/results_report.md`
+11. If expected results are missing or not quantifiable, ask user to provide them before proceeding
+
+**BLOCKING**: Do NOT proceed until user confirms both input data coverage AND expected results completeness are sufficient.
+
+---
+
+### Phase 2: Test Scenario Specification
+
+**Role**: Professional Quality Assurance Engineer
+**Goal**: Produce detailed black-box test specifications covering blackbox, performance, resilience, security, and resource limit scenarios
+**Constraints**: Spec only — no test code. Tests describe what the system should do given specific inputs, not how the system is built.
+
+Based on all acquired data, acceptance_criteria, and restrictions, form detailed test scenarios:
+
+1. Define test environment using `.cursor/skills/plan/templates/test-environment.md` as structure
+2. Define test data management using `.cursor/skills/plan/templates/test-data.md` as structure
+3. Write blackbox test scenarios (positive + negative) using `.cursor/skills/plan/templates/blackbox-tests.md` as structure
+4. Write performance test scenarios using `.cursor/skills/plan/templates/performance-tests.md` as structure
+5. Write resilience test scenarios using `.cursor/skills/plan/templates/resilience-tests.md` as structure
+6. Write security test scenarios using `.cursor/skills/plan/templates/security-tests.md` as structure
+7. Write resource limit test scenarios using `.cursor/skills/plan/templates/resource-limit-tests.md` as structure
+8. Build traceability matrix using `.cursor/skills/plan/templates/traceability-matrix.md` as structure
+
+**Self-verification**:
+- [ ] Every acceptance criterion is covered by at least one test scenario
+- [ ] Every restriction is verified by at least one test scenario
+- [ ] Every test scenario has a quantifiable expected result from `input_data/expected_results/results_report.md`
+- [ ] Expected results use comparison methods from `.cursor/skills/test-spec/templates/expected-results.md`
+- [ ] Positive and negative scenarios are balanced
+- [ ] Consumer app has no direct access to system internals
+- [ ] Docker environment is self-contained (`docker compose up` sufficient)
+- [ ] External dependencies have mock/stub services defined
+- [ ] Traceability matrix has no uncovered AC or restrictions
+
+**Save action**: Write all files under TESTS_OUTPUT_DIR:
+- `environment.md`
+- `test-data.md`
+- `blackbox-tests.md`
+- `performance-tests.md`
+- `resilience-tests.md`
+- `security-tests.md`
+- `resource-limit-tests.md`
+- `traceability-matrix.md`
+
+**BLOCKING**: Present test coverage summary (from traceability-matrix.md) to user. Do NOT proceed until confirmed.
+
+Capture any new questions, findings, or insights that arise during test specification — these feed forward into downstream skills (plan, refactor, etc.).
+
+---
+
+### Phase 3: Test Data Validation Gate (HARD GATE)
+
+**Role**: Professional Quality Assurance Engineer
+**Goal**: Ensure every test scenario produced in Phase 2 has concrete, sufficient test data. Remove tests that lack data. Verify final coverage stays above 70%.
+**Constraints**: This phase is MANDATORY and cannot be skipped.
+
+#### Step 1 — Build the test-data and expected-result requirements checklist
+
+Scan `blackbox-tests.md`, `performance-tests.md`, `resilience-tests.md`, `security-tests.md`, and `resource-limit-tests.md`. For every test scenario, extract:
+
+| # | Test Scenario ID | Test Name | Required Input Data | Required Expected Result | Result Quantifiable? | Comparison Method | Input Provided? | Expected Result Provided? |
+|---|-----------------|-----------|---------------------|-------------------------|---------------------|-------------------|----------------|--------------------------|
+| 1 | [ID] | [name] | [data description] | [what system should output] | [Yes/No] | [exact/tolerance/pattern/threshold] | [Yes/No] | [Yes/No] |
+
+Present this table to the user.
+
+#### Step 2 — Ask user to provide missing test data AND expected results
+
+For each row where **Input Provided?** is **No** OR **Expected Result Provided?** is **No**, ask the user:
+
+> **Option A — Provide the missing items**: Supply what is missing:
+> - **Missing input data**: Place test data files in `_docs/00_problem/input_data/` or indicate the location.
+> - **Missing expected result**: Provide the quantifiable expected result for this input. Update `_docs/00_problem/input_data/expected_results/results_report.md` with a row mapping the input to its expected output. If the expected result is complex, provide a reference CSV file in `_docs/00_problem/input_data/expected_results/`. Use `.cursor/skills/test-spec/templates/expected-results.md` for format guidance.
+>
+> Expected results MUST be quantifiable — the test must be able to programmatically compare actual vs expected. Examples:
+> - "3 detections with bounding boxes [(x1,y1,x2,y2), ...] ± 10px"
+> - "HTTP 200 with JSON body matching `expected_response_01.json`"
+> - "Processing time < 500ms"
+> - "0 false positives in the output set"
+>
+> **Option B — Skip this test**: If you cannot provide the data or expected result, this test scenario will be **removed** from the specification.
+
+**BLOCKING**: Wait for the user's response for every missing item.
+
+#### Step 3 — Validate provided data and expected results
+
+For each item where the user chose **Option A**:
+
+**Input data validation**:
+1. Verify the data file(s) exist at the indicated location
+2. Verify **quality**: data matches the format, schema, and constraints described in the test scenario (e.g., correct image resolution, valid JSON structure, expected value ranges)
+3. Verify **quantity**: enough data samples to cover the scenario (e.g., at least N images for a batch test, multiple edge-case variants)
+
+**Expected result validation**:
+4. Verify the expected result exists in `input_data/expected_results/results_report.md` or as a referenced file in `input_data/expected_results/`
+5. Verify **quantifiability**: the expected result can be evaluated programmatically — it must contain at least one of:
+   - Exact values (counts, strings, status codes)
+   - Numeric values with tolerance (e.g., `± 10px`, `≥ 0.85`)
+   - Pattern matches (regex, substring, JSON schema)
+   - Thresholds (e.g., `< 500ms`, `≤ 5% error rate`)
+   - Reference file for structural comparison (JSON diff, CSV diff)
+6. Verify **completeness**: the expected result covers all outputs the test checks (not just one field when the test validates multiple)
+7. Verify **consistency**: the expected result is consistent with the acceptance criteria it traces to
+
+If any validation fails, report the specific issue and loop back to Step 2 for that item.
+
+#### Step 4 — Remove tests without data or expected results
+
+For each item where the user chose **Option B**:
+
+1. Warn the user: `⚠️ Test scenario [ID] "[Name]" will be REMOVED from the specification due to missing test data or expected result.`
+2. Remove the test scenario from the respective test file
+3. Remove corresponding rows from `traceability-matrix.md`
+4. Update `test-data.md` to reflect the removal
+
+**Save action**: Write updated files under TESTS_OUTPUT_DIR:
+- `test-data.md`
+- Affected test files (if tests removed)
+- `traceability-matrix.md` (if tests removed)
+
+#### Step 5 — Final coverage check
+
+After all removals, recalculate coverage:
+
+1. Count remaining test scenarios that trace to acceptance criteria
+2. Count total acceptance criteria + restrictions
+3. Calculate coverage percentage: `covered_items / total_items * 100`
+
+| Metric | Value |
+|--------|-------|
+| Total AC + Restrictions | ? |
+| Covered by remaining tests | ? |
+| **Coverage %** | **?%** |
+
+**Decision**:
+
+- **Coverage ≥ 70%** → Phase 3 **PASSED**. Present final summary to user.
+- **Coverage < 70%** → Phase 3 **FAILED**. Report:
+  > ❌ Test coverage dropped to **X%** (minimum 70% required). The removed test scenarios left gaps in the following acceptance criteria / restrictions:
+  >
+  > | Uncovered Item | Type (AC/Restriction) | Missing Test Data Needed |
+  > |---|---|---|
+  >
+  > **Action required**: Provide the missing test data for the items above, or add alternative test scenarios that cover these items with data you can supply.
+
+  **BLOCKING**: Loop back to Step 2 with the uncovered items. Do NOT finalize until coverage ≥ 70%.
+
+#### Phase 3 Completion
+
+When coverage ≥ 70% and all remaining tests have validated data AND quantifiable expected results:
+
+1. Present the final coverage report
+2. List all removed tests (if any) with reasons
+3. Confirm every remaining test has: input data + quantifiable expected result + comparison method
+4. Confirm all artifacts are saved and consistent
+
+---
+
+### Phase 4: Test Runner Script Generation
+
+**Role**: DevOps engineer
+**Goal**: Generate executable shell scripts that run the specified tests, so the autopilot and CI can invoke them consistently.
+**Constraints**: Scripts must be idempotent, portable across dev/CI, and exit with non-zero on failure.
+
+#### Step 1 — Detect test infrastructure
+
+1. Identify the project's test runner from manifests and config files:
+   - Python: `pytest` (pyproject.toml, setup.cfg, pytest.ini)
+   - .NET: `dotnet test` (*.csproj, *.sln)
+   - Rust: `cargo test` (Cargo.toml)
+   - Node: `npm test` or `vitest` / `jest` (package.json)
+2. Identify docker-compose files for integration/blackbox tests (`docker-compose.test.yml`, `e2e/docker-compose*.yml`)
+3. Identify performance/load testing tools from dependencies (k6, locust, artillery, wrk, or built-in benchmarks)
+4. Read `TESTS_OUTPUT_DIR/environment.md` for infrastructure requirements
+
+#### Step 2 — Generate `scripts/run-tests.sh`
+
+Create `scripts/run-tests.sh` at the project root using `.cursor/skills/test-spec/templates/run-tests-script.md` as structural guidance. The script must:
+
+1. Set `set -euo pipefail` and trap cleanup on EXIT
+2. Optionally accept a `--unit-only` flag to skip blackbox tests
+3. Run unit tests using the detected test runner
+4. If blackbox tests exist: spin up docker-compose environment, wait for health checks, run blackbox test suite, tear down
+5. Print a summary of passed/failed/skipped tests
+6. Exit 0 on all pass, exit 1 on any failure
+
+#### Step 3 — Generate `scripts/run-performance-tests.sh`
+
+Create `scripts/run-performance-tests.sh` at the project root. The script must:
+
+1. Set `set -euo pipefail` and trap cleanup on EXIT
+2. Read thresholds from `_docs/02_document/tests/performance-tests.md` (or accept as CLI args)
+3. Spin up the system under test (docker-compose or local)
+4. Run load/performance scenarios using the detected tool
+5. Compare results against threshold values from the test spec
+6. Print a pass/fail summary per scenario
+7. Exit 0 if all thresholds met, exit 1 otherwise
+
+#### Step 4 — Verify scripts
+
+1. Verify both scripts are syntactically valid (`bash -n scripts/run-tests.sh`)
+2. Mark both scripts as executable (`chmod +x`)
+3. Present a summary of what each script does to the user
+
+**Save action**: Write `scripts/run-tests.sh` and `scripts/run-performance-tests.sh` to the project root.
+
+---
+
+## Escalation Rules
+
+| Situation | Action |
+|-----------|--------|
+| Missing acceptance_criteria.md, restrictions.md, or input_data/ | **STOP** — specification cannot proceed |
+| Missing input_data/expected_results/results_report.md | **STOP** — ask user to provide expected results mapping using the template |
+| Ambiguous requirements | ASK user |
+| Input data coverage below 70% (Phase 1) | Search internet for supplementary data, ASK user to validate |
+| Expected results missing or not quantifiable (Phase 1) | ASK user to provide quantifiable expected results before proceeding |
+| Test scenario conflicts with restrictions | ASK user to clarify intent |
+| System interfaces unclear (no architecture.md) | ASK user or derive from solution.md |
+| Test data or expected result not provided for a test scenario (Phase 3) | WARN user and REMOVE the test |
+| Final coverage below 70% after removals (Phase 3) | BLOCK — require user to supply data or accept reduced spec |
+
+## Common Mistakes
+
+- **Referencing internals**: tests must be black-box — no internal module names, no direct DB queries against the system under test
+- **Vague expected outcomes**: "works correctly" is not a test outcome; use specific measurable values
+- **Missing expected results**: input data without a paired expected result is useless — the test cannot determine pass/fail without knowing what "correct" looks like
+- **Non-quantifiable expected results**: "should return good results" is not verifiable; expected results must have exact values, tolerances, thresholds, or pattern matches that code can evaluate
+- **Missing negative scenarios**: every positive scenario category should have corresponding negative/edge-case tests
+- **Untraceable tests**: every test should trace to at least one AC or restriction
+- **Writing test code**: this skill produces specifications, never implementation code
+- **Tests without data**: every test scenario MUST have concrete test data AND a quantifiable expected result; a test spec without either is not executable and must be removed
+
+## Trigger Conditions
+
+When the user wants to:
+- Specify blackbox tests before implementation or refactoring
+- Analyze input data completeness for test coverage
+- Produce test scenarios from acceptance criteria
+
+**Keywords**: "test spec", "test specification", "blackbox test spec", "black box tests", "blackbox tests", "test scenarios"
+
+## Methodology Quick Reference
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│              Test Scenario Specification (4-Phase)                   │
+├──────────────────────────────────────────────────────────────────────┤
+│ PREREQ: Data Gate (BLOCKING)                                         │
+│   → verify AC, restrictions, input_data (incl. expected_results.md)  │
+│                                                                      │
+│ Phase 1: Input Data & Expected Results Completeness Analysis         │
+│   → assess input_data/ coverage vs AC scenarios (≥70%)               │
+│   → verify every input has a quantifiable expected result            │
+│   → present input→expected-result pairing assessment                 │
+│   [BLOCKING: user confirms input data + expected results coverage]   │
+│                                                                      │
+│ Phase 2: Test Scenario Specification                                 │
+│   → environment.md                                                   │
+│   → test-data.md (with expected results mapping)                     │
+│   → blackbox-tests.md (positive + negative)                          │
+│   → performance-tests.md                                             │
+│   → resilience-tests.md                                              │
+│   → security-tests.md                                                │
+│   → resource-limit-tests.md                                          │
+│   → traceability-matrix.md                                           │
+│   [BLOCKING: user confirms test coverage]                            │
+│                                                                      │
+│ Phase 3: Test Data & Expected Results Validation Gate (HARD GATE)    │
+│   → build test-data + expected-result requirements checklist         │
+│   → ask user: provide data+result (A) or remove test (B)             │
+│   → validate input data (quality + quantity)                         │
+│   → validate expected results (quantifiable + comparison method)     │
+│   → remove tests without data or expected result, warn user          │
+│   → final coverage check (≥70% or FAIL + loop back)                  │
+│   [BLOCKING: coverage ≥ 70% required to pass]                        │
+│                                                                      │
+│ Phase 4: Test Runner Script Generation                               │
+│   → detect test runner + docker-compose + load tool                  │
+│   → scripts/run-tests.sh (unit + blackbox)                           │
+│   → scripts/run-performance-tests.sh (load/perf scenarios)           │
+│   → verify scripts are valid and executable                          │
+├──────────────────────────────────────────────────────────────────────┤
+│ Principles: Black-box only · Traceability · Save immediately         │
+│             Ask don't assume · Spec don't code                       │
+│             No test without data · No test without expected result   │
+└──────────────────────────────────────────────────────────────────────┘
+```
@@ -0,0 +1,135 @@
+# Expected Results Template
+
+Save as `_docs/00_problem/input_data/expected_results/results_report.md`.
+For complex expected outputs, place reference CSV files alongside it in `_docs/00_problem/input_data/expected_results/`.
+Referenced by the test-spec skill (`.cursor/skills/test-spec/SKILL.md`).
+
+---
+
+```markdown
+# Expected Results
+
+Maps every input data item to its quantifiable expected result.
+Tests use this mapping to compare actual system output against known-correct answers.
+
+## Result Format Legend
+
+| Result Type | When to Use | Example |
+|-------------|-------------|---------|
+| Exact value | Output must match precisely | `status_code: 200`, `detection_count: 3` |
+| Tolerance range | Numeric output with acceptable variance | `confidence: 0.92 ± 0.05`, `bbox_x: 120 ± 10px` |
+| Threshold | Output must exceed or stay below a limit | `latency < 500ms`, `confidence ≥ 0.85` |
+| Pattern match | Output must match a string/regex pattern | `error_message contains "invalid format"` |
+| File reference | Complex output compared against a reference file | `match expected_results/case_01.json` |
+| Schema match | Output structure must conform to a schema | `response matches DetectionResultSchema` |
+| Set/count | Output must contain specific items or counts | `classes ⊇ {"car", "person"}`, `detections.length == 5` |
+
+## Comparison Methods
+
+| Method | Description | Tolerance Syntax |
+|--------|-------------|-----------------|
+| `exact` | Actual == Expected | N/A |
+| `numeric_tolerance` | abs(actual - expected) ≤ tolerance | `± <value>` or `± <percent>%` |
+| `range` | min ≤ actual ≤ max | `[min, max]` |
+| `threshold_min` | actual ≥ threshold | `≥ <value>` |
+| `threshold_max` | actual ≤ threshold | `≤ <value>` |
+| `regex` | actual matches regex pattern | regex string |
+| `substring` | actual contains substring | substring |
+| `json_diff` | structural comparison against reference JSON | diff tolerance per field |
+| `set_contains` | actual output set contains expected items | subset notation |
+| `file_reference` | compare against reference file in expected_results/ | file path |
+
+## Input → Expected Result Mapping
+
+### [Scenario Group Name, e.g. "Single Image Detection"]
+
+| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
+|---|-------|-------------------|-----------------|------------|-----------|---------------|
+| 1 | `[file or parameters]` | [what this input represents] | [quantifiable expected output] | [method from table above] | [± value, range, or N/A] | [path in expected_results/ or N/A] |
+
+#### Example — Object Detection
+
+| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
+|---|-------|-------------------|-----------------|------------|-----------|---------------|
+| 1 | `image_01.jpg` | Aerial photo, 3 vehicles visible | `detection_count: 3`, classes: `["ArmorVehicle", "ArmorVehicle", "Truck"]` | exact (count), set_contains (classes) | N/A | N/A |
+| 2 | `image_01.jpg` | Same image, bbox positions | bboxes: `[(120,80,340,290), (400,150,580,310), (50,400,200,520)]` | numeric_tolerance | ± 15px per coordinate | `expected_results/image_01_detections.json` |
+| 3 | `image_01.jpg` | Same image, confidence scores | confidences: `[0.94, 0.88, 0.91]` | threshold_min | each ≥ 0.85 | N/A |
+| 4 | `empty_scene.jpg` | Aerial photo, no objects | `detection_count: 0`, empty detections array | exact | N/A | N/A |
+| 5 | `corrupted.dat` | Invalid file format | HTTP 400, body contains `"error"` key | exact (status), substring (body) | N/A | N/A |
+
+#### Example — Performance
+
+| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
+|---|-------|-------------------|-----------------|------------|-----------|---------------|
+| 1 | `standard_image.jpg` | 1920x1080 single image | Response time | threshold_max | ≤ 2000ms | N/A |
+| 2 | `large_image.jpg` | 8000x6000 tiled image | Response time | threshold_max | ≤ 10000ms | N/A |
+
+#### Example — Error Handling
+
+| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
+|---|-------|-------------------|-----------------|------------|-----------|---------------|
+| 1 | `POST /detect` with no file | Missing required input | HTTP 422, message matches `"file.*required"` | exact (status), regex (message) | N/A | N/A |
+| 2 | `POST /detect` with `probability_threshold: 5.0` | Out-of-range config | HTTP 422 or clamped to valid range | exact (status) or range [0.0, 1.0] | N/A | N/A |
+
+## Expected Result Reference Files
+
+When the expected output is too complex for an inline table cell (e.g., full JSON response with nested objects), place a reference file in `_docs/00_problem/input_data/expected_results/`.
+
+### File Naming Convention
+
+`<input_name>_expected.<format>`
+
+Examples:
+- `image_01_detections.json`
+- `batch_A_results.csv`
+- `video_01_annotations.json`
+
+### Reference File Requirements
+
+- Must be machine-readable (JSON, CSV, YAML — not prose)
+- Must contain only the expected output structure and values
+- Must include tolerance annotations where applicable (as metadata fields or comments)
+- Must be valid and parseable by standard libraries
+
+### Reference File Example (JSON)
+
+File: `expected_results/image_01_detections.json`
+
+```json
+{
+  "input": "image_01.jpg",
+  "expected": {
+    "detection_count": 3,
+    "detections": [
+      {
+        "class": "ArmorVehicle",
+        "confidence": { "min": 0.85 },
+        "bbox": { "x1": 120, "y1": 80, "x2": 340, "y2": 290, "tolerance_px": 15 }
+      },
+      {
+        "class": "ArmorVehicle",
+        "confidence": { "min": 0.85 },
+        "bbox": { "x1": 400, "y1": 150, "x2": 580, "y2": 310, "tolerance_px": 15 }
+      },
+      {
+        "class": "Truck",
+        "confidence": { "min": 0.85 },
+        "bbox": { "x1": 50, "y1": 400, "x2": 200, "y2": 520, "tolerance_px": 15 }
+      }
+    ]
+  }
+}
+```
+```
+
+---
+
+## Guidance Notes
+
+- Every row in the mapping table must have at least one quantifiable comparison — no row should say only "should work" or "returns result".
+- Use `exact` comparison for counts, status codes, and discrete values.
+- Use `numeric_tolerance` for floating-point values and spatial coordinates where minor variance is expected.
+- Use `threshold_min`/`threshold_max` for performance metrics and confidence scores.
+- Use `file_reference` when the expected output has more than ~3 fields or nested structures.
+- Reference files must be committed alongside input data — they are part of the test specification.
+- When the system has non-deterministic behavior (e.g., model inference variance across hardware), document the expected tolerance explicitly and justify it.
@@ -0,0 +1,88 @@
+# Test Runner Script Structure
+
+Reference for generating `scripts/run-tests.sh` and `scripts/run-performance-tests.sh`.
+
+## `scripts/run-tests.sh`
+
+```bash
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+UNIT_ONLY=false
+RESULTS_DIR="$PROJECT_ROOT/test-results"
+
+for arg in "$@"; do
+  case $arg in
+    --unit-only) UNIT_ONLY=true ;;
+  esac
+done
+
+cleanup() {
+  # tear down docker-compose if it was started
+}
+trap cleanup EXIT
+
+mkdir -p "$RESULTS_DIR"
+
+# --- Unit Tests ---
+# [detect runner: pytest / dotnet test / cargo test / npm test]
+# [run and capture exit code]
+# [save results to $RESULTS_DIR/unit-results.*]
+
+# --- Blackbox Tests (skip if --unit-only) ---
+# if ! $UNIT_ONLY; then
+#   [docker compose -f <compose-file> up -d]
+#   [wait for health checks]
+#   [run blackbox test suite]
+#   [save results to $RESULTS_DIR/blackbox-results.*]
+# fi
+
+# --- Summary ---
+# [print passed / failed / skipped counts]
+# [exit 0 if all passed, exit 1 otherwise]
+```
+
+## `scripts/run-performance-tests.sh`
+
+```bash
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+RESULTS_DIR="$PROJECT_ROOT/test-results"
+
+cleanup() {
+  # tear down test environment if started
+}
+trap cleanup EXIT
+
+mkdir -p "$RESULTS_DIR"
+
+# --- Start System Under Test ---
+# [docker compose up -d or start local server]
+# [wait for health checks]
+
+# --- Run Performance Scenarios ---
+# [detect tool: k6 / locust / artillery / wrk / built-in]
+# [run each scenario from performance-tests.md]
+# [capture metrics: latency P50/P95/P99, throughput, error rate]
+
+# --- Compare Against Thresholds ---
+# [read thresholds from test spec or CLI args]
+# [print per-scenario pass/fail]
+
+# --- Summary ---
+# [exit 0 if all thresholds met, exit 1 otherwise]
+```
+
+## Key Requirements
+
+- Both scripts must be idempotent (safe to run multiple times)
+- Both scripts must work in CI (no interactive prompts, no GUI)
+- Use `trap cleanup EXIT` to ensure teardown even on failure
+- Exit codes: 0 = all pass, 1 = failures detected
+- Write results to `test-results/` directory (add to `.gitignore` if not already present)
+- The actual commands depend on the detected tech stack — fill them in during Phase 4 of the test-spec skill