Update skills documentation to reflect changes in directory structure and terminology. Replace references to integration tests with blackbox tests across various SKILL.md files and templates. Revise paths in planning and deployment documentation to align with the updated _docs/02_document/ structure. Enhance clarity in task management processes and ensure consistency in terminology throughout the documentation.

2026-06-22 15:41:12 +00:00 · 2026-03-25 06:08:05 +02:00
parent e720a949a8
commit 1c6e8f47b1
67 changed files with 5624 additions and 3647 deletions
@@ -17,6 +17,17 @@ disable-model-invocation: true

 Auto-chaining execution engine that drives the full BUILD → SHIP workflow. Detects project state from `_docs/`, resumes from where work stopped, and flows through skills automatically. The user invokes `/autopilot` once — the engine handles sequencing, transitions, and re-entry.

+## File Index
+
+| File | Purpose |
+|------|---------|
+| `flows/greenfield.md` | Detection rules, step table, and auto-chain rules for new projects |
+| `flows/existing-code.md` | Detection rules, step table, and auto-chain rules for existing codebases |
+| `state.md` | State file format, rules, re-entry protocol, session boundaries |
+| `protocols.md` | User interaction, Jira MCP auth, choice format, error handling, status summary |
+
+**On every invocation**: read all four files above before executing any logic.
+
 ## Core Principles

 - **Auto-chain**: when a skill completes, immediately start the next one — no pause between skills
@@ -24,250 +35,57 @@ Auto-chaining execution engine that drives the full BUILD → SHIP workflow. Det
 - **State from disk**: all progress is persisted to `_docs/_autopilot_state.md` and cross-checked against `_docs/` folder structure
 - **Rich re-entry**: on every invocation, read the state file for full context before continuing
 - **Delegate, don't duplicate**: read and execute each sub-skill's SKILL.md; never inline their logic here
+- **Sound on pause**: follow `.cursor/rules/human-attention-sound.mdc` — play a notification sound before every pause that requires human input
+- **Minimize interruptions**: only ask the user when the decision genuinely cannot be resolved automatically
+- **Single project per workspace**: all `_docs/` paths are relative to workspace root; for monorepos, each service needs its own Cursor workspace

-## State File: `_docs/_autopilot_state.md`
+## Flow Resolution

-The autopilot persists its state to `_docs/_autopilot_state.md`. This file is the primary source of truth for re-entry. Folder scanning is the fallback when the state file doesn't exist.
+Determine which flow to use:

-### Format
+1. If workspace has source code files **and** `_docs/` does not exist → **existing-code flow** (Pre-Step detection)
+2. If `_docs/_autopilot_state.md` exists and records Document in `Completed Steps` → **existing-code flow**
+3. If `_docs/_autopilot_state.md` exists and `step: done` AND workspace contains source code → **existing-code flow** (completed project re-entry — loops to New Task)
+4. Otherwise → **greenfield flow**

-```markdown
-# Autopilot State
+After selecting the flow, apply its detection rules (first match wins) to determine the current step.

-## Current Step
-step: [0-5 or "done"]
-name: [Problem / Research / Plan / Decompose / Implement / Deploy / Done]
-status: [not_started / in_progress / completed]
-sub_step: [optional — sub-skill phase if interrupted mid-step, e.g. "Plan Step 3: Component Decomposition"]
+## Execution Loop

-## Completed Steps
-
-| Step | Name | Completed | Key Outcome |
-|------|------|-----------|-------------|
-| 0 | Problem | [date] | [one-line summary] |
-| 1 | Research | [date] | [N drafts, final approach summary] |
-| 2 | Plan | [date] | [N components, architecture summary] |
-| 3 | Decompose | [date] | [N tasks, total complexity points] |
-| 4 | Implement | [date] | [N batches, pass/fail summary] |
-| 5 | Deploy | [date] | [artifacts produced] |
-
-## Key Decisions
- [decision 1: e.g. "Tech stack: Python + Rust for perf-critical, Postgres DB"]
- [decision 2: e.g. "6 research rounds, final draft: solution_draft06.md"]
- [decision N]
-
-## Last Session
-date: [date]
-ended_at: [step name and phase]
-reason: [completed step / session boundary / user paused / context limit]
-notes: [any context for next session, e.g. "User asked to revisit risk assessment"]
-
-## Blockers
- [blocker 1, if any]
- [none]
-```
-
-### State File Rules
-
-1. **Create** the state file on the very first autopilot invocation (after state detection determines Step 0)
-2. **Update** the state file after every step completion, every session boundary, and every BLOCKING gate confirmation
-3. **Read** the state file as the first action on every invocation — before folder scanning
-4. **Cross-check**: after reading the state file, verify against actual `_docs/` folder contents. If they disagree (e.g., state file says Step 2 but `_docs/02_plans/architecture.md` already exists), trust the folder structure and update the state file to match
-5. **Never delete** the state file. It accumulates history across the entire project lifecycle
-
-## Execution Entry Point
-
-Every invocation of this skill follows the same sequence:
+Every invocation follows this sequence:

 ```
 1. Read _docs/_autopilot_state.md (if exists)
-2. Cross-check state file against _docs/ folder structure
-3. Resolve current step (state file + folder scan)
-4. Present Status Summary (from state file context)
-5. Enter Execution Loop:
-   a. Read and execute the current skill's SKILL.md
-   b. When skill completes → update state file
-   c. Re-detect next step
-   d. If next skill is ready → auto-chain (go to 5a with next skill)
-   e. If session boundary reached → update state file with session notes → suggest new conversation
-   f. If all steps done → update state file → report completion
+2. Read all File Index files above
+3. Cross-check state file against _docs/ folder structure (rules in state.md)
+4. Resolve flow (see Flow Resolution above)
+5. Resolve current step (detection rules from the active flow file)
+6. Present Status Summary (template in active flow file)
+7. Execute:
+   a. Delegate to current skill (see Skill Delegation below)
+   b. If skill returns FAILED → apply Skill Failure Retry Protocol (see protocols.md):
+      - Auto-retry the same skill (failure may be caused by missing user input or environment issue)
+      - If 3 consecutive auto-retries fail → record in state file Blockers, warn user, stop auto-retry
+   c. When skill completes successfully → reset retry counter, update state file (rules in state.md)
+   d. Re-detect next step from the active flow's detection rules
+   e. If next skill is ready → auto-chain (go to 7a with next skill)
+   f. If session boundary reached → update state, suggest new conversation (rules in state.md)
+   g. If all steps done → update state → report completion
 ```

-## State Detection
-
-Read `_docs/_autopilot_state.md` first. If it exists and is consistent with the folder structure, use the `Current Step` from the state file. If the state file doesn't exist or is inconsistent, fall back to folder scanning.
-
-### Folder Scan Rules (fallback)
-
-Scan `_docs/` to determine the current workflow position. Check rules in order — first match wins.
-
-### Detection Rules
-
-**Step 0 — Problem Gathering**
-Condition: `_docs/00_problem/` does not exist, OR any of these are missing/empty:
- `problem.md`
- `restrictions.md`
- `acceptance_criteria.md`
- `input_data/` (must contain at least one file)
-
-Action: Read and execute `.cursor/skills/problem/SKILL.md`
-
---
-
-**Step 1 — Research (Initial)**
-Condition: `_docs/00_problem/` is complete AND `_docs/01_solution/` has no `solution_draft*.md` files
-
-Action: Read and execute `.cursor/skills/research/SKILL.md` (will auto-detect Mode A)
-
---
-
-**Step 1b — Research Decision**
-Condition: `_docs/01_solution/` contains `solution_draft*.md` files AND `_docs/01_solution/solution.md` does not exist AND `_docs/02_plans/architecture.md` does not exist
-
-Action: Present the current research state to the user:
- How many solution drafts exist
- Whether tech_stack.md and security_analysis.md exist
- One-line summary from the latest draft
-
-Then ask: **"Run another research round (Mode B assessment), or proceed to planning?"**
- If user wants another round → Read and execute `.cursor/skills/research/SKILL.md` (will auto-detect Mode B)
- If user wants to proceed → auto-chain to Step 2 (Plan)
-
---
-
-**Step 2 — Plan**
-Condition: `_docs/01_solution/` has `solution_draft*.md` files AND `_docs/02_plans/architecture.md` does not exist
-
-Action:
-1. The plan skill's Prereq 2 will rename the latest draft to `solution.md` — this is handled by the plan skill itself
-2. Read and execute `.cursor/skills/plan/SKILL.md`
-
-If `_docs/02_plans/` exists but is incomplete (has some artifacts but no `FINAL_report.md`), the plan skill's built-in resumability handles it.
-
---
-
-**Step 3 — Decompose**
-Condition: `_docs/02_plans/` contains `architecture.md` AND `_docs/02_plans/components/` has at least one component AND `_docs/02_tasks/` does not exist or has no task files (excluding `_dependencies_table.md`)
-
-Action: Read and execute `.cursor/skills/decompose/SKILL.md`
-
-If `_docs/02_tasks/` has some task files already, the decompose skill's resumability handles it.
-
---
-
-**Step 4 — Implement**
-Condition: `_docs/02_tasks/` contains task files AND `_dependencies_table.md` exists AND `_docs/03_implementation/FINAL_implementation_report.md` does not exist
-
-Action: Read and execute `.cursor/skills/implement/SKILL.md`
-
-If `_docs/03_implementation/` has batch reports, the implement skill detects completed tasks and continues.
-
---
-
-**Step 5 — Deploy**
-Condition: `_docs/03_implementation/FINAL_implementation_report.md` exists AND `_docs/04_deploy/` does not exist or is incomplete
-
-Action: Read and execute `.cursor/skills/deploy/SKILL.md`
-
---
-
-**Done**
-Condition: `_docs/04_deploy/` contains all expected artifacts (containerization.md, ci_cd_pipeline.md, environment_strategy.md, observability.md, deployment_procedures.md)
-
-Action: Report project completion with summary.
-
-## Status Summary
-
-On every invocation, before executing any skill, present a status summary built from the state file (with folder scan fallback).
-
-Format:
-
-```
-═══════════════════════════════════════════════════
- AUTOPILOT STATUS
-═══════════════════════════════════════════════════
- Step 0  Problem      [DONE / IN PROGRESS / NOT STARTED]
- Step 1  Research     [DONE (N drafts) / IN PROGRESS / NOT STARTED]
- Step 2  Plan         [DONE / IN PROGRESS / NOT STARTED]
- Step 3  Decompose    [DONE (N tasks) / IN PROGRESS / NOT STARTED]
- Step 4  Implement    [DONE / IN PROGRESS (batch M of ~N) / NOT STARTED]
- Step 5  Deploy       [DONE / IN PROGRESS / NOT STARTED]
-═══════════════════════════════════════════════════
- Current step: [Step N — Name]
- Action: [what will happen next]
-═══════════════════════════════════════════════════
-```
-
-For re-entry (state file exists), also include:
- Key decisions from the state file's `Key Decisions` section
- Last session context from the `Last Session` section
- Any blockers from the `Blockers` section
-
-## Auto-Chain Rules
-
-After a skill completes, apply these rules:
-
-| Completed Step | Next Action |
-|---------------|-------------|
-| Problem Gathering | Auto-chain → Research (Mode A) |
-| Research (any round) | Auto-chain → Research Decision (ask user: another round or proceed?) |
-| Research Decision → proceed | Auto-chain → Plan |
-| Plan | Auto-chain → Decompose |
-| Decompose | **Session boundary** — suggest new conversation before Implement |
-| Implement | Auto-chain → Deploy |
-| Deploy | Report completion |
-
-### Session Boundary: Decompose → Implement
-
-After decompose completes, **do not auto-chain to implement**. Instead:
-
-1. Update state file: mark Decompose as completed, set current step to 4 (Implement) with status `not_started`
-2. Write `Last Session` section: `reason: session boundary`, `notes: Decompose complete, implementation ready`
-3. Present a summary: number of tasks, estimated batches, total complexity points
-4. Suggest: "Implementation is the longest phase and benefits from a fresh conversation context. Start a new conversation and type `/autopilot` to begin implementation."
-5. If the user insists on continuing in the same conversation, proceed.
-
-This is the only hard session boundary. All other transitions auto-chain.
-
 ## Skill Delegation

 For each step, the delegation pattern is:

-1. Update state file: set current step to `in_progress`, record `sub_step` if applicable
+1. Update state file: set `step` to the autopilot step number, status to `in_progress`, set `sub_step` to the sub-skill's current internal step/phase, reset `retry_count: 0`
 2. Announce: "Starting [Skill Name]..."
 3. Read the skill file: `.cursor/skills/[name]/SKILL.md`
-4. Execute the skill's workflow exactly as written, including:
-   - All BLOCKING gates (present to user, wait for confirmation)
-   - All self-verification checklists
-   - All save actions
-   - All escalation rules
-5. When the skill's workflow is fully complete:
-   - Update state file: mark step as `completed`, record date, write one-line key outcome
-   - Add any key decisions made during this step to the `Key Decisions` section
-   - Return to the auto-chain rules
+4. Execute the skill's workflow exactly as written, including all BLOCKING gates, self-verification checklists, save actions, and escalation rules. Update `sub_step` in state each time the sub-skill advances.
+5. If the skill **fails**: follow the Skill Failure Retry Protocol in `protocols.md` — increment `retry_count`, auto-retry up to 3 times, then escalate.
+6. When complete (success): reset `retry_count: 0`, mark step `completed`, record date + key outcome, add key decisions to state file, return to auto-chain rules (from active flow file)

 Do NOT modify, skip, or abbreviate any part of the sub-skill's workflow. The autopilot is a sequencer, not an optimizer.

-## Re-Entry Protocol
-
-When the user invokes `/autopilot` and work already exists:
-
-1. Read `_docs/_autopilot_state.md`
-2. Cross-check against `_docs/` folder structure
-3. Present Status Summary with context from state file (key decisions, last session, blockers)
-4. If the detected step has a sub-skill with built-in resumability (plan, decompose, implement, deploy all do), the sub-skill handles mid-step recovery
-5. Continue execution from detected state
-
-## Error Handling
-
-| Situation | Action |
-|-----------|--------|
-| State detection is ambiguous (artifacts suggest two different steps) | Present findings to user, ask which step to execute |
-| Sub-skill fails or hits an unrecoverable blocker | Report the error, suggest the user fix it manually, then re-invoke `/autopilot` |
-| User wants to skip a step | Warn about downstream dependencies, proceed if user confirms |
-| User wants to go back to a previous step | Warn that re-running may overwrite artifacts, proceed if user confirms |
-| User asks "where am I?" without wanting to continue | Show Status Summary only, do not start execution |
-
 ## Trigger Conditions

 This skill activates when the user wants to:
@@ -281,41 +99,9 @@ This skill activates when the user wants to:
 **Differentiation**:
 - User wants only research → use `/research` directly
 - User wants only planning → use `/plan` directly
+- User wants to document an existing codebase → use `/document` directly
 - User wants the full guided workflow → use `/autopilot`

-## Methodology Quick Reference
+## Flow Reference

-```
-┌────────────────────────────────────────────────────────────────┐
-│              Autopilot (Auto-Chain Orchestrator)                │
-├────────────────────────────────────────────────────────────────┤
-│ EVERY INVOCATION:                                              │
-│   1. State Detection (scan _docs/)                             │
-│   2. Status Summary (show progress)                            │
-│   3. Execute current skill                                     │
-│   4. Auto-chain to next skill (loop)                           │
-│                                                                │
-│ WORKFLOW:                                                       │
-│   Step 0  Problem    → .cursor/skills/problem/SKILL.md         │
-│     ↓ auto-chain                                               │
-│   Step 1  Research   → .cursor/skills/research/SKILL.md        │
-│     ↓ auto-chain (ask: another round?)                         │
-│   Step 2  Plan       → .cursor/skills/plan/SKILL.md            │
-│     ↓ auto-chain                                               │
-│   Step 3  Decompose  → .cursor/skills/decompose/SKILL.md       │
-│     ↓ SESSION BOUNDARY (suggest new conversation)              │
-│   Step 4  Implement  → .cursor/skills/implement/SKILL.md       │
-│     ↓ auto-chain                                               │
-│   Step 5  Deploy     → .cursor/skills/deploy/SKILL.md          │
-│     ↓                                                          │
-│   DONE                                                         │
-│                                                                │
-│ STATE FILE: _docs/_autopilot_state.md                          │
-│ FALLBACK: _docs/ folder structure scan                         │
-│ PAUSE POINTS: sub-skill BLOCKING gates only                    │
-│ SESSION BREAK: after Decompose (before Implement)              │
-├────────────────────────────────────────────────────────────────┤
-│ Principles: Auto-chain · State to file · Rich re-entry         │
-│             Delegate don't duplicate · Pause at decisions only  │
-└────────────────────────────────────────────────────────────────┘
-```
+See `flows/greenfield.md` and `flows/existing-code.md` for step tables, detection rules, auto-chain rules, and status summary templates.
@@ -0,0 +1,234 @@
+# Existing Code Workflow
+
+Workflow for projects with an existing codebase. Starts with documentation, produces test specs, decomposes and implements tests, verifies them, refactors with that safety net, then adds new functionality and deploys.
+
+## Step Reference Table
+
+| Step | Name | Sub-Skill | Internal SubSteps |
+|------|------|-----------|-------------------|
+| 1 | Document | document/SKILL.md | Steps 1–8 |
+| 2 | Test Spec | test-spec/SKILL.md | Phase 1a–1b |
+| 3 | Decompose Tests | decompose/SKILL.md (tests-only) | Step 1t + Step 3 + Step 4 |
+| 4 | Implement Tests | implement/SKILL.md | (batch-driven, no fixed sub-steps) |
+| 5 | Run Tests | test-run/SKILL.md | Steps 1–4 |
+| 6 | Refactor | refactor/SKILL.md | Phases 0–5 (6-phase method) |
+| 7 | New Task | new-task/SKILL.md | Steps 1–8 (loop) |
+| 8 | Implement | implement/SKILL.md | (batch-driven, no fixed sub-steps) |
+| 9 | Run Tests | test-run/SKILL.md | Steps 1–4 |
+| 10 | Security Audit | security/SKILL.md | Phase 1–5 (optional) |
+| 11 | Performance Test | (autopilot-managed) | Load/stress tests (optional) |
+| 12 | Deploy | deploy/SKILL.md | Step 1–7 |
+
+After Step 12, the existing-code workflow is complete.
+
+## Detection Rules
+
+Check rules in order — first match wins.
+
+---
+
+**Step 1 — Document**
+Condition: `_docs/` does not exist AND the workspace contains source code files (e.g., `*.py`, `*.cs`, `*.rs`, `*.ts`, `src/`, `Cargo.toml`, `*.csproj`, `package.json`)
+
+Action: An existing codebase without documentation was detected. Read and execute `.cursor/skills/document/SKILL.md`. After the document skill completes, re-detect state (the produced `_docs/` artifacts will place the project at Step 2 or later).
+
+---
+
+**Step 2 — Test Spec**
+Condition: `_docs/02_document/FINAL_report.md` exists AND workspace contains source code files (e.g., `*.py`, `*.cs`, `*.rs`, `*.ts`) AND `_docs/02_document/tests/traceability-matrix.md` does not exist AND the autopilot state shows Document was run (check `Completed Steps` for "Document" entry)
+
+Action: Read and execute `.cursor/skills/test-spec/SKILL.md`
+
+This step applies when the codebase was documented via the `/document` skill. Test specifications must be produced before refactoring or further development.
+
+---
+
+**Step 3 — Decompose Tests**
+Condition: `_docs/02_document/tests/traceability-matrix.md` exists AND workspace contains source code files AND the autopilot state shows Document was run AND (`_docs/02_tasks/` does not exist or has no task files)
+
+Action: Read and execute `.cursor/skills/decompose/SKILL.md` in **tests-only mode** (pass `_docs/02_document/tests/` as input). The decompose skill will:
+1. Run Step 1t (test infrastructure bootstrap)
+2. Run Step 3 (blackbox test task decomposition)
+3. Run Step 4 (cross-verification against test coverage)
+
+If `_docs/02_tasks/` has some task files already, the decompose skill's resumability handles it.
+
+---
+
+**Step 4 — Implement Tests**
+Condition: `_docs/02_tasks/` contains task files AND `_dependencies_table.md` exists AND the autopilot state shows Step 3 (Decompose Tests) is completed AND `_docs/03_implementation/FINAL_implementation_report.md` does not exist
+
+Action: Read and execute `.cursor/skills/implement/SKILL.md`
+
+The implement skill reads test tasks from `_docs/02_tasks/` and implements them.
+
+If `_docs/03_implementation/` has batch reports, the implement skill detects completed tasks and continues.
+
+---
+
+**Step 5 — Run Tests**
+Condition: `_docs/03_implementation/FINAL_implementation_report.md` exists AND the autopilot state shows Step 4 (Implement Tests) is completed AND the autopilot state does NOT show Step 5 (Run Tests) as completed
+
+Action: Read and execute `.cursor/skills/test-run/SKILL.md`
+
+Verifies the implemented test suite passes before proceeding to refactoring. The tests form the safety net for all subsequent code changes.
+
+---
+
+**Step 6 — Refactor**
+Condition: the autopilot state shows Step 5 (Run Tests) is completed AND `_docs/04_refactoring/FINAL_report.md` does not exist
+
+Action: Read and execute `.cursor/skills/refactor/SKILL.md`
+
+The refactor skill runs the full 6-phase method using the implemented tests as a safety net.
+
+If `_docs/04_refactoring/` has phase reports, the refactor skill detects completed phases and continues.
+
+---
+
+**Step 7 — New Task**
+Condition: the autopilot state shows Step 6 (Refactor) is completed AND the autopilot state does NOT show Step 7 (New Task) as completed
+
+Action: Read and execute `.cursor/skills/new-task/SKILL.md`
+
+The new-task skill interactively guides the user through defining new functionality. It loops until the user is done adding tasks. New task files are written to `_docs/02_tasks/`.
+
+---
+
+**Step 8 — Implement**
+Condition: the autopilot state shows Step 7 (New Task) is completed AND `_docs/03_implementation/` does not contain a FINAL report covering the new tasks (check state for distinction between test implementation and feature implementation)
+
+Action: Read and execute `.cursor/skills/implement/SKILL.md`
+
+The implement skill reads the new tasks from `_docs/02_tasks/` and implements them. Tasks already implemented in Step 4 are skipped (the implement skill tracks completed tasks in batch reports).
+
+If `_docs/03_implementation/` has batch reports from this phase, the implement skill detects completed tasks and continues.
+
+---
+
+**Step 9 — Run Tests**
+Condition: the autopilot state shows Step 8 (Implement) is completed AND the autopilot state does NOT show Step 9 (Run Tests) as completed
+
+Action: Read and execute `.cursor/skills/test-run/SKILL.md`
+
+---
+
+**Step 10 — Security Audit (optional)**
+Condition: the autopilot state shows Step 9 (Run Tests) is completed AND the autopilot state does NOT show Step 10 (Security Audit) as completed or skipped AND (`_docs/04_deploy/` does not exist or is incomplete)
+
+Action: Present using Choose format:
+
+```
+══════════════════════════════════════
+ DECISION REQUIRED: Run security audit before deploy?
+══════════════════════════════════════
+ A) Run security audit (recommended for production deployments)
+ B) Skip — proceed directly to deploy
+══════════════════════════════════════
+ Recommendation: A — catches vulnerabilities before production
+══════════════════════════════════════
+```
+
+- If user picks A → Read and execute `.cursor/skills/security/SKILL.md`. After completion, auto-chain to Step 11 (Performance Test).
+- If user picks B → Mark Step 10 as `skipped` in the state file, auto-chain to Step 11 (Performance Test).
+
+---
+
+**Step 11 — Performance Test (optional)**
+Condition: the autopilot state shows Step 10 (Security Audit) is completed or skipped AND the autopilot state does NOT show Step 11 (Performance Test) as completed or skipped AND (`_docs/04_deploy/` does not exist or is incomplete)
+
+Action: Present using Choose format:
+
+```
+══════════════════════════════════════
+ DECISION REQUIRED: Run performance/load tests before deploy?
+══════════════════════════════════════
+ A) Run performance tests (recommended for latency-sensitive or high-load systems)
+ B) Skip — proceed directly to deploy
+══════════════════════════════════════
+ Recommendation: [A or B — base on whether acceptance criteria
+ include latency, throughput, or load requirements]
+══════════════════════════════════════
+```
+
+- If user picks A → Run performance tests:
+  1. If `scripts/run-performance-tests.sh` exists (generated by the test-spec skill Phase 4), execute it
+  2. Otherwise, check if `_docs/02_document/tests/performance-tests.md` exists for test scenarios, detect appropriate load testing tool (k6, locust, artillery, wrk, or built-in benchmarks), and execute performance test scenarios against the running system
+  3. Present results vs acceptance criteria thresholds
+  4. If thresholds fail → present Choose format: A) Fix and re-run, B) Proceed anyway, C) Abort
+  5. After completion, auto-chain to Step 12 (Deploy)
+- If user picks B → Mark Step 11 as `skipped` in the state file, auto-chain to Step 12 (Deploy).
+
+---
+
+**Step 12 — Deploy**
+Condition: the autopilot state shows Step 9 (Run Tests) is completed AND (Step 10 is completed or skipped) AND (Step 11 is completed or skipped) AND (`_docs/04_deploy/` does not exist or is incomplete)
+
+Action: Read and execute `.cursor/skills/deploy/SKILL.md`
+
+After deployment completes, the existing-code workflow is done.
+
+---
+
+**Re-Entry After Completion**
+Condition: the autopilot state shows `step: done` OR all steps through 12 (Deploy) are completed
+
+Action: The project completed a full cycle. Present status and loop back to New Task:
+
+```
+══════════════════════════════════════
+ PROJECT CYCLE COMPLETE
+══════════════════════════════════════
+ The previous cycle finished successfully.
+ You can now add new functionality.
+══════════════════════════════════════
+ A) Add new features (start New Task)
+ B) Done — no more changes needed
+══════════════════════════════════════
+```
+
+- If user picks A → set `step: 7`, `status: not_started` in the state file, then auto-chain to Step 7 (New Task). Previous cycle history stays in Completed Steps.
+- If user picks B → report final project status and exit.
+
+## Auto-Chain Rules
+
+| Completed Step | Next Action |
+|---------------|-------------|
+| Document (1) | Auto-chain → Test Spec (2) |
+| Test Spec (2) | Auto-chain → Decompose Tests (3) |
+| Decompose Tests (3) | **Session boundary** — suggest new conversation before Implement Tests |
+| Implement Tests (4) | Auto-chain → Run Tests (5) |
+| Run Tests (5, all pass) | Auto-chain → Refactor (6) |
+| Refactor (6) | Auto-chain → New Task (7) |
+| New Task (7) | **Session boundary** — suggest new conversation before Implement |
+| Implement (8) | Auto-chain → Run Tests (9) |
+| Run Tests (9, all pass) | Auto-chain → Security Audit choice (10) |
+| Security Audit (10, done or skipped) | Auto-chain → Performance Test choice (11) |
+| Performance Test (11, done or skipped) | Auto-chain → Deploy (12) |
+| Deploy (12) | **Workflow complete** — existing-code flow done |
+
+## Status Summary Template
+
+```
+═══════════════════════════════════════════════════
+ AUTOPILOT STATUS (existing-code)
+═══════════════════════════════════════════════════
+ Step 1   Document            [DONE / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 2   Test Spec           [DONE / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 3   Decompose Tests     [DONE (N tasks) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 4   Implement Tests     [DONE / IN PROGRESS (batch M) / NOT STARTED / FAILED (retry N/3)]
+ Step 5   Run Tests           [DONE (N passed, M failed) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 6   Refactor            [DONE / IN PROGRESS (phase N) / NOT STARTED / FAILED (retry N/3)]
+ Step 7   New Task            [DONE (N tasks) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 8   Implement           [DONE / IN PROGRESS (batch M of ~N) / NOT STARTED / FAILED (retry N/3)]
+ Step 9   Run Tests           [DONE (N passed, M failed) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 10  Security Audit      [DONE / SKIPPED / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 11  Performance Test    [DONE / SKIPPED / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 12  Deploy              [DONE / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+═══════════════════════════════════════════════════
+ Current: Step N — Name
+ SubStep: M — [sub-skill internal step name]
+ Retry:   [N/3 if retrying, omit if 0]
+ Action:  [what will happen next]
+═══════════════════════════════════════════════════
+```
@@ -0,0 +1,235 @@
+# Greenfield Workflow
+
+Workflow for new projects built from scratch. Flows linearly: Problem → Research → Plan → UI Design (if applicable) → Decompose → Implement → Run Tests → Security Audit (optional) → Performance Test (optional) → Deploy.
+
+## Step Reference Table
+
+| Step | Name | Sub-Skill | Internal SubSteps |
+|------|------|-----------|-------------------|
+| 1 | Problem | problem/SKILL.md | Phase 1–4 |
+| 2 | Research | research/SKILL.md | Mode A: Phase 1–4 · Mode B: Step 0–8 |
+| 3 | Plan | plan/SKILL.md | Step 1–6 + Final |
+| 4 | UI Design | ui-design/SKILL.md | Phase 0–8 (conditional — UI projects only) |
+| 5 | Decompose | decompose/SKILL.md | Step 1–4 |
+| 6 | Implement | implement/SKILL.md | (batch-driven, no fixed sub-steps) |
+| 7 | Run Tests | test-run/SKILL.md | Steps 1–4 |
+| 8 | Security Audit | security/SKILL.md | Phase 1–5 (optional) |
+| 9 | Performance Test | (autopilot-managed) | Load/stress tests (optional) |
+| 10 | Deploy | deploy/SKILL.md | Step 1–7 |
+
+## Detection Rules
+
+Check rules in order — first match wins.
+
+---
+
+**Step 1 — Problem Gathering**
+Condition: `_docs/00_problem/` does not exist, OR any of these are missing/empty:
+- `problem.md`
+- `restrictions.md`
+- `acceptance_criteria.md`
+- `input_data/` (must contain at least one file)
+
+Action: Read and execute `.cursor/skills/problem/SKILL.md`
+
+---
+
+**Step 2 — Research (Initial)**
+Condition: `_docs/00_problem/` is complete AND `_docs/01_solution/` has no `solution_draft*.md` files
+
+Action: Read and execute `.cursor/skills/research/SKILL.md` (will auto-detect Mode A)
+
+---
+
+**Research Decision** (inline gate between Step 2 and Step 3)
+Condition: `_docs/01_solution/` contains `solution_draft*.md` files AND `_docs/01_solution/solution.md` does not exist AND `_docs/02_document/architecture.md` does not exist
+
+Action: Present the current research state to the user:
+- How many solution drafts exist
+- Whether tech_stack.md and security_analysis.md exist
+- One-line summary from the latest draft
+
+Then present using the **Choose format**:
+
+```
+══════════════════════════════════════
+ DECISION REQUIRED: Research complete — next action?
+══════════════════════════════════════
+ A) Run another research round (Mode B assessment)
+ B) Proceed to planning with current draft
+══════════════════════════════════════
+ Recommendation: [A or B] — [reason based on draft quality]
+══════════════════════════════════════
+```
+
+- If user picks A → Read and execute `.cursor/skills/research/SKILL.md` (will auto-detect Mode B)
+- If user picks B → auto-chain to Step 3 (Plan)
+
+---
+
+**Step 3 — Plan**
+Condition: `_docs/01_solution/` has `solution_draft*.md` files AND `_docs/02_document/architecture.md` does not exist
+
+Action:
+1. The plan skill's Prereq 2 will rename the latest draft to `solution.md` — this is handled by the plan skill itself
+2. Read and execute `.cursor/skills/plan/SKILL.md`
+
+If `_docs/02_document/` exists but is incomplete (has some artifacts but no `FINAL_report.md`), the plan skill's built-in resumability handles it.
+
+---
+
+**Step 4 — UI Design (conditional)**
+Condition: `_docs/02_document/architecture.md` exists AND the autopilot state does NOT show Step 4 (UI Design) as completed or skipped AND the project is a UI project
+
+**UI Project Detection** — the project is a UI project if ANY of the following are true:
+- `package.json` exists in the workspace root or any subdirectory
+- `*.html`, `*.jsx`, `*.tsx` files exist in the workspace
+- `_docs/02_document/components/` contains a component whose `description.md` mentions UI, frontend, page, screen, dashboard, form, or view
+- `_docs/02_document/architecture.md` mentions frontend, UI layer, SPA, or client-side rendering
+- `_docs/01_solution/solution.md` mentions frontend, web interface, or user-facing UI
+
+If the project is NOT a UI project → mark Step 4 as `skipped` in the state file and auto-chain to Step 5.
+
+If the project IS a UI project → present using Choose format:
+
+```
+══════════════════════════════════════
+ DECISION REQUIRED: UI project detected — generate mockups?
+══════════════════════════════════════
+ A) Generate UI mockups before decomposition (recommended)
+ B) Skip — proceed directly to decompose
+══════════════════════════════════════
+ Recommendation: A — mockups before decomposition
+ produce better task specs for frontend components
+══════════════════════════════════════
+```
+
+- If user picks A → Read and execute `.cursor/skills/ui-design/SKILL.md`. After completion, auto-chain to Step 5 (Decompose).
+- If user picks B → Mark Step 4 as `skipped` in the state file, auto-chain to Step 5 (Decompose).
+
+---
+
+**Step 5 — Decompose**
+Condition: `_docs/02_document/` contains `architecture.md` AND `_docs/02_document/components/` has at least one component AND `_docs/02_tasks/` does not exist or has no task files (excluding `_dependencies_table.md`)
+
+Action: Read and execute `.cursor/skills/decompose/SKILL.md`
+
+If `_docs/02_tasks/` has some task files already, the decompose skill's resumability handles it.
+
+---
+
+**Step 6 — Implement**
+Condition: `_docs/02_tasks/` contains task files AND `_dependencies_table.md` exists AND `_docs/03_implementation/FINAL_implementation_report.md` does not exist
+
+Action: Read and execute `.cursor/skills/implement/SKILL.md`
+
+If `_docs/03_implementation/` has batch reports, the implement skill detects completed tasks and continues.
+
+---
+
+**Step 7 — Run Tests**
+Condition: `_docs/03_implementation/FINAL_implementation_report.md` exists AND the autopilot state does NOT show Step 7 (Run Tests) as completed AND (`_docs/04_deploy/` does not exist or is incomplete)
+
+Action: Read and execute `.cursor/skills/test-run/SKILL.md`
+
+---
+
+**Step 8 — Security Audit (optional)**
+Condition: the autopilot state shows Step 7 (Run Tests) is completed AND the autopilot state does NOT show Step 8 (Security Audit) as completed or skipped AND (`_docs/04_deploy/` does not exist or is incomplete)
+
+Action: Present using Choose format:
+
+```
+══════════════════════════════════════
+ DECISION REQUIRED: Run security audit before deploy?
+══════════════════════════════════════
+ A) Run security audit (recommended for production deployments)
+ B) Skip — proceed directly to deploy
+══════════════════════════════════════
+ Recommendation: A — catches vulnerabilities before production
+══════════════════════════════════════
+```
+
+- If user picks A → Read and execute `.cursor/skills/security/SKILL.md`. After completion, auto-chain to Step 9 (Performance Test).
+- If user picks B → Mark Step 8 as `skipped` in the state file, auto-chain to Step 9 (Performance Test).
+
+---
+
+**Step 9 — Performance Test (optional)**
+Condition: the autopilot state shows Step 8 (Security Audit) is completed or skipped AND the autopilot state does NOT show Step 9 (Performance Test) as completed or skipped AND (`_docs/04_deploy/` does not exist or is incomplete)
+
+Action: Present using Choose format:
+
+```
+══════════════════════════════════════
+ DECISION REQUIRED: Run performance/load tests before deploy?
+══════════════════════════════════════
+ A) Run performance tests (recommended for latency-sensitive or high-load systems)
+ B) Skip — proceed directly to deploy
+══════════════════════════════════════
+ Recommendation: [A or B — base on whether acceptance criteria
+ include latency, throughput, or load requirements]
+══════════════════════════════════════
+```
+
+- If user picks A → Run performance tests:
+  1. If `scripts/run-performance-tests.sh` exists (generated by the test-spec skill Phase 4), execute it
+  2. Otherwise, check if `_docs/02_document/tests/performance-tests.md` exists for test scenarios, detect appropriate load testing tool (k6, locust, artillery, wrk, or built-in benchmarks), and execute performance test scenarios against the running system
+  3. Present results vs acceptance criteria thresholds
+  4. If thresholds fail → present Choose format: A) Fix and re-run, B) Proceed anyway, C) Abort
+  5. After completion, auto-chain to Step 10 (Deploy)
+- If user picks B → Mark Step 9 as `skipped` in the state file, auto-chain to Step 10 (Deploy).
+
+---
+
+**Step 10 — Deploy**
+Condition: the autopilot state shows Step 7 (Run Tests) is completed AND (Step 8 is completed or skipped) AND (Step 9 is completed or skipped) AND (`_docs/04_deploy/` does not exist or is incomplete)
+
+Action: Read and execute `.cursor/skills/deploy/SKILL.md`
+
+---
+
+**Done**
+Condition: `_docs/04_deploy/` contains all expected artifacts (containerization.md, ci_cd_pipeline.md, environment_strategy.md, observability.md, deployment_procedures.md)
+
+Action: Report project completion with summary. If the user runs autopilot again after greenfield completion, Flow Resolution rule 3 routes to the existing-code flow (re-entry after completion) so they can add new features.
+
+## Auto-Chain Rules
+
+| Completed Step | Next Action |
+|---------------|-------------|
+| Problem (1) | Auto-chain → Research (2) |
+| Research (2) | Auto-chain → Research Decision (ask user: another round or proceed?) |
+| Research Decision → proceed | Auto-chain → Plan (3) |
+| Plan (3) | Auto-chain → UI Design detection (4) |
+| UI Design (4, done or skipped) | Auto-chain → Decompose (5) |
+| Decompose (5) | **Session boundary** — suggest new conversation before Implement |
+| Implement (6) | Auto-chain → Run Tests (7) |
+| Run Tests (7, all pass) | Auto-chain → Security Audit choice (8) |
+| Security Audit (8, done or skipped) | Auto-chain → Performance Test choice (9) |
+| Performance Test (9, done or skipped) | Auto-chain → Deploy (10) |
+| Deploy (10) | Report completion |
+
+## Status Summary Template
+
+```
+═══════════════════════════════════════════════════
+ AUTOPILOT STATUS (greenfield)
+═══════════════════════════════════════════════════
+ Step 1   Problem             [DONE / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 2   Research            [DONE (N drafts) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 3   Plan                [DONE / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 4   UI Design           [DONE / SKIPPED / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 5   Decompose           [DONE (N tasks) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 6   Implement           [DONE / IN PROGRESS (batch M of ~N) / NOT STARTED / FAILED (retry N/3)]
+ Step 7   Run Tests           [DONE (N passed, M failed) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 8   Security Audit      [DONE / SKIPPED / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 9   Performance Test    [DONE / SKIPPED / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 10  Deploy              [DONE / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+═══════════════════════════════════════════════════
+ Current: Step N — Name
+ SubStep: M — [sub-skill internal step name]
+ Retry:   [N/3 if retrying, omit if 0]
+ Action:  [what will happen next]
+═══════════════════════════════════════════════════
+```
@@ -0,0 +1,314 @@
+# Autopilot Protocols
+
+## User Interaction Protocol
+
+Every time the autopilot or a sub-skill needs a user decision, use the **Choose A / B / C / D** format. This applies to:
+
+- State transitions where multiple valid next actions exist
+- Sub-skill BLOCKING gates that require user judgment
+- Any fork where the autopilot cannot confidently pick the right path
+- Trade-off decisions (tech choices, scope, risk acceptance)
+
+### When to Ask (MUST ask)
+
+- The next action is ambiguous (e.g., "another research round or proceed?")
+- The decision has irreversible consequences (e.g., architecture choices, skipping a step)
+- The user's intent or preference cannot be inferred from existing artifacts
+- A sub-skill's BLOCKING gate explicitly requires user confirmation
+- Multiple valid approaches exist with meaningfully different trade-offs
+
+### When NOT to Ask (auto-transition)
+
+- Only one logical next step exists (e.g., Problem complete → Research is the only option)
+- The transition is deterministic from the state (e.g., Plan complete → Decompose)
+- The decision is low-risk and reversible
+- Existing artifacts or prior decisions already imply the answer
+
+### Choice Format
+
+Always present decisions in this format:
+
+```
+══════════════════════════════════════
+ DECISION REQUIRED: [brief context]
+══════════════════════════════════════
+ A) [Option A — short description]
+ B) [Option B — short description]
+ C) [Option C — short description, if applicable]
+ D) [Option D — short description, if applicable]
+══════════════════════════════════════
+ Recommendation: [A/B/C/D] — [one-line reason]
+══════════════════════════════════════
+```
+
+Rules:
+1. Always provide 2–4 concrete options (never open-ended questions)
+2. Always include a recommendation with a brief justification
+3. Keep option descriptions to one line each
+4. If only 2 options make sense, use A/B only — do not pad with filler options
+5. Play the notification sound (per `human-attention-sound.mdc`) before presenting the choice
+6. Record every user decision in the state file's `Key Decisions` section
+7. After the user picks, proceed immediately — no follow-up confirmation unless the choice was destructive
+
+## Work Item Tracker Authentication
+
+Several workflow steps create work items (epics, tasks, links). The system supports **Jira MCP** and **Azure DevOps MCP** as interchangeable backends. Detect which is configured by listing available MCP servers.
+
+### Tracker Detection
+
+1. Check for available MCP servers: Jira MCP (`user-Jira-MCP-Server`) or Azure DevOps MCP (`user-AzureDevops`)
+2. If both are available, ask the user which to use (Choose format)
+3. Record the choice in the state file: `tracker: jira` or `tracker: ado`
+4. If neither is available, set `tracker: local` and proceed without external tracking
+
+### Steps That Require Work Item Tracker
+
+| Flow | Step | Sub-Step | Tracker Action |
+|------|------|----------|----------------|
+| greenfield | 3 (Plan) | Step 6 — Epics | Create epics for each component |
+| greenfield | 5 (Decompose) | Step 1–3 — All tasks | Create ticket per task, link to epic |
+| existing-code | 3 (Decompose Tests) | Step 1t + Step 3 — All test tasks | Create ticket per task, link to epic |
+| existing-code | 7 (New Task) | Step 7 — Ticket | Create ticket per task, link to epic |
+
+### Authentication Gate
+
+Before entering a step that requires work item tracking (see table above) for the first time, the autopilot must:
+
+1. Call `mcp_auth` on the detected tracker's MCP server
+2. If authentication succeeds → proceed normally
+3. If the user **skips** or authentication fails → present using Choose format:
+
+```
+══════════════════════════════════════
+ Tracker authentication failed
+══════════════════════════════════════
+ A) Retry authentication (retry mcp_auth)
+ B) Continue without tracker (tasks saved locally only)
+══════════════════════════════════════
+ Recommendation: A — Tracker IDs drive task referencing,
+ dependency tracking, and implementation batching.
+ Without tracker, task files use numeric prefixes instead.
+══════════════════════════════════════
+```
+
+If user picks **B** (continue without tracker):
+- Set a flag in the state file: `tracker: local`
+- All skills that would create tickets instead save metadata locally in the task/epic files with `Tracker: pending` status
+- Task files keep numeric prefixes (e.g., `01_initial_structure.md`) instead of tracker ID prefixes
+- The workflow proceeds normally in all other respects
+
+### Re-Authentication
+
+If the tracker MCP was already authenticated in a previous invocation (verify by listing available tools beyond `mcp_auth`), skip the auth gate.
+
+## Error Handling
+
+All error situations that require user input MUST use the **Choose A / B / C / D** format.
+
+| Situation | Action |
+|-----------|--------|
+| State detection is ambiguous (artifacts suggest two different steps) | Present findings and use Choose format with the candidate steps as options |
+| Sub-skill fails or hits an unrecoverable blocker | Use Choose format: A) retry, B) skip with warning, C) abort and fix manually |
+| User wants to skip a step | Use Choose format: A) skip (with dependency warning), B) execute the step |
+| User wants to go back to a previous step | Use Choose format: A) re-run (with overwrite warning), B) stay on current step |
+| User asks "where am I?" without wanting to continue | Show Status Summary only, do not start execution |
+
+## Skill Failure Retry Protocol
+
+Sub-skills can return a **failed** result. Failures are often caused by missing user input, environment issues, or transient errors that resolve on retry. The autopilot auto-retries before escalating.
+
+### Retry Flow
+
+```
+Skill execution → FAILED
+  │
+  ├─ retry_count < 3 ?
+  │    YES → increment retry_count in state file
+  │         → log failure reason in state file (Retry Log section)
+  │         → re-read the sub-skill's SKILL.md
+  │         → re-execute from the current sub_step
+  │         → (loop back to check result)
+  │
+  │    NO (retry_count = 3) →
+  │         → set status: failed in Current Step
+  │         → add entry to Blockers section:
+  │             "[Skill Name] failed 3 consecutive times at sub_step [M].
+  │              Last failure: [reason]. Auto-retry exhausted."
+  │         → present warning to user (see Escalation below)
+  │         → do NOT auto-retry again until user intervenes
+```
+
+### Retry Rules
+
+1. **Auto-retry immediately**: when a skill fails, retry it without asking the user — the failure is often transient (missing user confirmation in a prior step, docker not running, file lock, etc.)
+2. **Preserve sub_step**: retry from the last recorded `sub_step`, not from the beginning of the skill — unless the failure indicates corruption, in which case restart from sub_step 1
+3. **Increment `retry_count`**: update `retry_count` in the state file's `Current Step` section on each retry attempt
+4. **Log each failure**: append the failure reason and timestamp to the state file's `Retry Log` section
+5. **Reset on success**: when the skill eventually succeeds, reset `retry_count: 0` and clear the `Retry Log` for that step
+
+### Escalation (after 3 consecutive failures)
+
+After 3 failed auto-retries of the same skill, the failure is likely not user-related. Stop retrying and escalate:
+
+1. Update the state file:
+   - Set `status: failed` in `Current Step`
+   - Set `retry_count: 3`
+   - Add a blocker entry describing the repeated failure
+2. Play notification sound (per `human-attention-sound.mdc`)
+3. Present using Choose format:
+
+```
+══════════════════════════════════════
+ SKILL FAILED: [Skill Name] — 3 consecutive failures
+══════════════════════════════════════
+ Step: [N] — [Name]
+ SubStep: [M] — [sub-step name]
+ Last failure reason: [reason]
+══════════════════════════════════════
+ A) Retry with fresh context (new conversation)
+ B) Skip this step with warning
+ C) Abort — investigate and fix manually
+══════════════════════════════════════
+ Recommendation: A — fresh context often resolves
+ persistent failures
+══════════════════════════════════════
+```
+
+### Re-Entry After Failure
+
+On the next autopilot invocation (new conversation), if the state file shows `status: failed` and `retry_count: 3`:
+
+- Present the blocker to the user before attempting execution
+- If the user chooses to retry → reset `retry_count: 0`, set `status: in_progress`, and re-execute
+- If the user chooses to skip → mark step as `skipped`, proceed to next step
+- Do NOT silently auto-retry — the user must acknowledge the persistent failure first
+
+## Error Recovery Protocol
+
+### Stuck Detection
+
+When executing a sub-skill, monitor for these signals:
+
+- Same artifact overwritten 3+ times without meaningful change
+- Sub-skill repeatedly asks the same question after receiving an answer
+- No new artifacts saved for an extended period despite active execution
+
+### Recovery Actions (ordered)
+
+1. **Re-read state**: read `_docs/_autopilot_state.md` and cross-check against `_docs/` folders
+2. **Retry current sub-step**: re-read the sub-skill's SKILL.md and restart from the current sub-step
+3. **Escalate**: after 2 failed retries, present diagnostic summary to user using Choose format:
+
+```
+══════════════════════════════════════
+ RECOVERY: [skill name] stuck at [sub-step]
+══════════════════════════════════════
+ A) Retry with fresh context (new conversation)
+ B) Skip this sub-step with warning
+ C) Abort and fix manually
+══════════════════════════════════════
+ Recommendation: A — fresh context often resolves stuck loops
+══════════════════════════════════════
+```
+
+### Circuit Breaker
+
+If the same autopilot step fails 3 consecutive times across conversations:
+
+- Record the failure pattern in the state file's `Blockers` section
+- Do NOT auto-retry on next invocation
+- Present the blocker and ask user for guidance before attempting again
+
+## Context Management Protocol
+
+### Principle
+
+Disk is memory. Never rely on in-context accumulation — read from `_docs/` artifacts, not from conversation history.
+
+### Minimal Re-Read Set Per Skill
+
+When re-entering a skill (new conversation or context refresh):
+
+- Always read: `_docs/_autopilot_state.md`
+- Always read: the active skill's `SKILL.md`
+- Conditionally read: only the `_docs/` artifacts the current sub-step requires (listed in each skill's Context Resolution section)
+- Never bulk-read: do not load all `_docs/` files at once
+
+### Mid-Skill Interruption
+
+If context is filling up during a long skill (e.g., document, implement):
+
+1. Save current sub-step progress to the skill's artifact directory
+2. Update `_docs/_autopilot_state.md` with exact sub-step position
+3. Suggest a new conversation: "Context is getting long — recommend continuing in a fresh conversation for better results"
+4. On re-entry, the skill's resumability protocol picks up from the saved sub-step
+
+### Large Artifact Handling
+
+When a skill needs to read large files (e.g., full solution.md, architecture.md):
+
+- Read only the sections relevant to the current sub-step
+- Use search tools (Grep, SemanticSearch) to find specific sections rather than reading entire files
+- Summarize key decisions from prior steps in the state file so they don't need to be re-read
+
+### Context Budget Heuristic
+
+Agents cannot programmatically query context window usage. Use these heuristics to avoid degradation:
+
+| Zone | Indicators | Action |
+|------|-----------|--------|
+| **Safe** | State file + SKILL.md + 2–3 focused artifacts loaded | Continue normally |
+| **Caution** | 5+ artifacts loaded, or 3+ large files (architecture, solution, discovery), or conversation has 20+ tool calls | Complete current sub-step, then suggest session break |
+| **Danger** | Repeated truncation in tool output, tool calls failing unexpectedly, responses becoming shallow or repetitive | Save immediately, update state file, force session boundary |
+
+**Skill-specific guidelines**:
+
+| Skill | Recommended session breaks |
+|-------|---------------------------|
+| **document** | After every ~5 modules in Step 1; between Step 4 (Verification) and Step 5 (Solution Extraction) |
+| **implement** | Each batch is a natural checkpoint; if more than 2 batches completed in one session, suggest break |
+| **plan** | Between Step 5 (Test Specifications) and Step 6 (Epics) for projects with many components |
+| **research** | Between Mode A rounds; between Mode A and Mode B |
+
+**How to detect caution/danger zone without API**:
+
+1. Count tool calls made so far — if approaching 20+, context is likely filling up
+2. If reading a file returns truncated content, context is under pressure
+3. If the agent starts producing shorter or less detailed responses than earlier in the conversation, context quality is degrading
+4. When in doubt, save and suggest a new conversation — re-entry is cheap thanks to the state file
+
+## Rollback Protocol
+
+### Implementation Steps (git-based)
+
+Handled by `/implement` skill — each batch commit is a rollback checkpoint via `git revert`.
+
+### Planning/Documentation Steps (artifact-based)
+
+For steps that produce `_docs/` artifacts (problem, research, plan, decompose, document):
+
+1. **Before overwriting**: if re-running a step that already has artifacts, the sub-skill's prerequisite check asks the user (resume/overwrite/skip)
+2. **Rollback to previous step**: use Choose format:
+
+```
+══════════════════════════════════════
+ ROLLBACK: Re-run [step name]?
+══════════════════════════════════════
+ A) Re-run the step (overwrites current artifacts)
+ B) Stay on current step
+══════════════════════════════════════
+ Warning: This will overwrite files in _docs/[folder]/
+══════════════════════════════════════
+```
+
+3. **Git safety net**: artifacts are committed with each autopilot step completion. To roll back: `git log --oneline _docs/` to find the commit, then `git checkout <commit> -- _docs/<folder>/`
+4. **State file rollback**: when rolling back artifacts, also update `_docs/_autopilot_state.md` to reflect the rolled-back step (set it to `in_progress`, clear completed date)
+
+## Status Summary
+
+On every invocation, before executing any skill, present a status summary built from the state file (with folder scan fallback). Use the Status Summary Template from the active flow file (`flows/greenfield.md` or `flows/existing-code.md`).
+
+For re-entry (state file exists), also include:
+- Key decisions from the state file's `Key Decisions` section
+- Last session context from the `Last Session` section
+- Any blockers from the `Blockers` section
@@ -0,0 +1,122 @@
+# Autopilot State Management
+
+## State File: `_docs/_autopilot_state.md`
+
+The autopilot persists its state to `_docs/_autopilot_state.md`. This file is the primary source of truth for re-entry. Folder scanning is the fallback when the state file doesn't exist.
+
+### Format
+
+```markdown
+# Autopilot State
+
+## Current Step
+flow: [greenfield | existing-code]
+step: [1-10 for greenfield, 1-12 for existing-code, or "done"]
+name: [step name from the active flow's Step Reference Table]
+status: [not_started / in_progress / completed / skipped / failed]
+sub_step: [optional — sub-skill internal step number + name if interrupted mid-step]
+retry_count: [0-3 — number of consecutive auto-retry attempts for current step, reset to 0 on success]
+
+When updating `Current Step`, always write it as:
+  flow: existing-code   ← active flow
+  step: N               ← autopilot step (sequential integer)
+  sub_step: M           ← sub-skill's own internal step/phase number + name
+  retry_count: 0        ← reset on new step or success; increment on each failed retry
+Example:
+  flow: greenfield
+  step: 3
+  name: Plan
+  status: in_progress
+  sub_step: 4 — Architecture Review & Risk Assessment
+  retry_count: 0
+Example (failed after 3 retries):
+  flow: existing-code
+  step: 2
+  name: Test Spec
+  status: failed
+  sub_step: 1b — Test Case Generation
+  retry_count: 3
+
+## Completed Steps
+
+| Step | Name | Completed | Key Outcome |
+|------|------|-----------|-------------|
+| 1 | [name] | [date] | [one-line summary] |
+| 2 | [name] | [date] | [one-line summary] |
+| ... | ... | ... | ... |
+
+## Key Decisions
+- [decision 1: e.g. "Tech stack: Python + Rust for perf-critical, Postgres DB"]
+- [decision N]
+
+## Last Session
+date: [date]
+ended_at: Step [N] [Name] — SubStep [M] [sub-step name]
+reason: [completed step / session boundary / user paused / context limit]
+notes: [any context for next session]
+
+## Retry Log
+| Attempt | Step | Name | SubStep | Failure Reason | Timestamp |
+|---------|------|------|---------|----------------|-----------|
+| 1 | [step] | [name] | [sub_step] | [reason] | [date-time] |
+| ... | ... | ... | ... | ... | ... |
+
+(Clear this table when the step succeeds or user resets. Append a row on each failed auto-retry.)
+
+## Blockers
+- [blocker 1, if any]
+- [none]
+```
+
+### State File Rules
+
+1. **Create** the state file on the very first autopilot invocation (after state detection determines Step 1)
+2. **Update** the state file after every step completion, every session boundary, every BLOCKING gate confirmation, and every failed retry attempt
+3. **Read** the state file as the first action on every invocation — before folder scanning
+4. **Cross-check**: after reading the state file, verify against actual `_docs/` folder contents. If they disagree (e.g., state file says Step 3 but `_docs/02_document/architecture.md` already exists), trust the folder structure and update the state file to match
+5. **Never delete** the state file. It accumulates history across the entire project lifecycle
+6. **Retry tracking**: increment `retry_count` on each failed auto-retry; reset to `0` when the step succeeds or the user manually resets. If `retry_count` reaches 3, set `status: failed` and add an entry to `Blockers`
+7. **Failed state on re-entry**: if the state file shows `status: failed` with `retry_count: 3`, do NOT auto-retry — present the blocker to the user and wait for their decision before proceeding
+
+## State Detection
+
+Read `_docs/_autopilot_state.md` first. If it exists and is consistent with the folder structure, use the `Current Step` from the state file. If the state file doesn't exist or is inconsistent, fall back to folder scanning.
+
+### Folder Scan Rules (fallback)
+
+Scan `_docs/` to determine the current workflow position. The detection rules are defined in each flow file (`flows/greenfield.md` and `flows/existing-code.md`). Check the existing-code flow first (Step 1 detection), then greenfield flow rules. First match wins.
+
+## Re-Entry Protocol
+
+When the user invokes `/autopilot` and work already exists:
+
+1. Read `_docs/_autopilot_state.md`
+2. Cross-check against `_docs/` folder structure
+3. Present Status Summary with context from state file (key decisions, last session, blockers)
+4. If the detected step has a sub-skill with built-in resumability (plan, decompose, implement, deploy all do), the sub-skill handles mid-step recovery
+5. Continue execution from detected state
+
+## Session Boundaries
+
+After any decompose/planning step completes, **do not auto-chain to implement**. Instead:
+
+1. Update state file: mark the step as completed, set current step to the next implement step with status `not_started`
+   - Existing-code flow: After Step 3 (Decompose Tests) → set current step to 4 (Implement Tests)
+   - Existing-code flow: After Step 7 (New Task) → set current step to 8 (Implement)
+   - Greenfield flow: After Step 5 (Decompose) → set current step to 6 (Implement)
+2. Write `Last Session` section: `reason: session boundary`, `notes: Decompose complete, implementation ready`
+3. Present a summary: number of tasks, estimated batches, total complexity points
+4. Use Choose format:
+
+```
+══════════════════════════════════════
+ DECISION REQUIRED: Decompose complete — start implementation?
+══════════════════════════════════════
+ A) Start a new conversation for implementation (recommended for context freshness)
+ B) Continue implementation in this conversation
+══════════════════════════════════════
+ Recommendation: A — implementation is the longest phase, fresh context helps
+══════════════════════════════════════
+```
+
+These are the only hard session boundaries. All other transitions auto-chain.
@@ -46,7 +46,7 @@ For each task, verify implementation satisfies every acceptance criterion:

 - Walk through each AC (Given/When/Then) and trace it in the code
 - Check that unit tests cover each AC
- Check that integration tests exist where specified in the task spec
+- Check that blackbox tests exist where specified in the task spec
 - Flag any AC that is not demonstrably satisfied as a **Spec-Gap** finding (severity: High)
 - Flag any scope creep (implementation beyond what the spec asked for) as a **Scope** finding (severity: Low)

@@ -152,3 +152,42 @@ The `/implement` skill invokes this skill after each batch completes:
 2. Passes task spec paths + changed files to this skill
 3. If verdict is FAIL — presents findings to user (BLOCKING), user fixes or confirms
 4. If verdict is PASS or PASS_WITH_WARNINGS — proceeds automatically (findings shown as info)
+
+## Integration Contract
+
+### Inputs (provided by the implement skill)
+
+| Input | Type | Source | Required |
+|-------|------|--------|----------|
+| `task_specs` | list of file paths | Task `.md` files from `_docs/02_tasks/` for the current batch | Yes |
+| `changed_files` | list of file paths | Files modified by implementer agents (from `git diff` or agent reports) | Yes |
+| `batch_number` | integer | Current batch number (for report naming) | Yes |
+| `project_restrictions` | file path | `_docs/00_problem/restrictions.md` | If exists |
+| `solution_overview` | file path | `_docs/01_solution/solution.md` | If exists |
+
+### Invocation Pattern
+
+The implement skill invokes code-review by:
+
+1. Reading `.cursor/skills/code-review/SKILL.md`
+2. Providing the inputs above as context (read the files, pass content to the review phases)
+3. Executing all 6 phases sequentially
+4. Consuming the verdict from the output
+
+### Outputs (returned to the implement skill)
+
+| Output | Type | Description |
+|--------|------|-------------|
+| `verdict` | `PASS` / `PASS_WITH_WARNINGS` / `FAIL` | Drives the implement skill's auto-fix gate |
+| `findings` | structured list | Each finding has: severity, category, file:line, title, description, suggestion, task reference |
+| `critical_count` | integer | Number of Critical findings |
+| `high_count` | integer | Number of High findings |
+| `report_path` | file path | `_docs/03_implementation/reviews/batch_[NN]_review.md` |
+
+### Report Persistence
+
+Save the review report to `_docs/03_implementation/reviews/batch_[NN]_review.md` (create the `reviews/` directory if it does not exist). The report uses the Output Format defined above.
+
+The implement skill uses `verdict` to decide:
+- `PASS` / `PASS_WITH_WARNINGS` → proceed to commit
+- `FAIL` → enter auto-fix loop (up to 2 attempts), then escalate to user
@@ -2,12 +2,13 @@
 name: decompose
 description: |
  Decompose planned components into atomic implementable tasks with bootstrap structure plan.
-  4-step workflow: bootstrap structure plan, component task decomposition, integration test task decomposition, and cross-task verification.
-  Supports full decomposition (_docs/ structure) and single component mode.
+  4-step workflow: bootstrap structure plan, component task decomposition, blackbox test task decomposition, and cross-task verification.
+  Supports full decomposition (_docs/ structure), single component mode, and tests-only mode.
  Trigger phrases:
  - "decompose", "decompose features", "feature decomposition"
  - "task decomposition", "break down components"
  - "prepare for implementation"
+  - "decompose tests", "test decomposition"
 category: build
 tags: [decomposition, tasks, dependencies, jira, implementation-prep]
 disable-model-invocation: true
@@ -32,18 +33,26 @@ Decompose planned components into atomic, implementable task specs with a bootst
 Determine the operating mode based on invocation before any other logic runs.

 **Default** (no explicit input file provided):
- PLANS_DIR: `_docs/02_plans/`
+- DOCUMENT_DIR: `_docs/02_document/`
 - TASKS_DIR: `_docs/02_tasks/`
- Reads from: `_docs/00_problem/`, `_docs/01_solution/`, PLANS_DIR
- Runs Step 1 (bootstrap) + Step 2 (all components) + Step 3 (integration tests) + Step 4 (cross-verification)
+- Reads from: `_docs/00_problem/`, `_docs/01_solution/`, DOCUMENT_DIR
+- Runs Step 1 (bootstrap) + Step 2 (all components) + Step 3 (blackbox tests) + Step 4 (cross-verification)

-**Single component mode** (provided file is within `_docs/02_plans/` and inside a `components/` subdirectory):
- PLANS_DIR: `_docs/02_plans/`
+**Single component mode** (provided file is within `_docs/02_document/` and inside a `components/` subdirectory):
+- DOCUMENT_DIR: `_docs/02_document/`
 - TASKS_DIR: `_docs/02_tasks/`
 - Derive component number and component name from the file path
 - Ask user for the parent Epic ID
 - Runs Step 2 (that component only, appending to existing task numbering)

+**Tests-only mode** (provided file/directory is within `tests/`, or `DOCUMENT_DIR/tests/` exists and input explicitly requests test decomposition):
+- DOCUMENT_DIR: `_docs/02_document/`
+- TASKS_DIR: `_docs/02_tasks/`
+- TESTS_DIR: `DOCUMENT_DIR/tests/`
+- Reads from: `_docs/00_problem/`, `_docs/01_solution/`, TESTS_DIR
+- Runs Step 1t (test infrastructure bootstrap) + Step 3 (blackbox test decomposition) + Step 4 (cross-verification against test coverage)
+- Skips Step 1 (project bootstrap) and Step 2 (component decomposition) — the codebase already exists
+
 Announce the detected mode and resolved paths to the user before proceeding.

 ## Input Specification
@@ -58,10 +67,10 @@ Announce the detected mode and resolved paths to the user before proceeding.
 | `_docs/00_problem/restrictions.md` | Constraints and limitations |
 | `_docs/00_problem/acceptance_criteria.md` | Measurable acceptance criteria |
 | `_docs/01_solution/solution.md` | Finalized solution |
-| `PLANS_DIR/architecture.md` | Architecture from plan skill |
-| `PLANS_DIR/system-flows.md` | System flows from plan skill |
-| `PLANS_DIR/components/[##]_[name]/description.md` | Component specs from plan skill |
-| `PLANS_DIR/integration_tests/` | Integration test specs from plan skill |
+| `DOCUMENT_DIR/architecture.md` | Architecture from plan skill |
+| `DOCUMENT_DIR/system-flows.md` | System flows from plan skill |
+| `DOCUMENT_DIR/components/[##]_[name]/description.md` | Component specs from plan skill |
+| `DOCUMENT_DIR/tests/` | Blackbox test specs from plan skill |

 **Single component mode:**

@@ -70,16 +79,38 @@ Announce the detected mode and resolved paths to the user before proceeding.
 | The provided component `description.md` | Component spec to decompose |
 | Corresponding `tests.md` in the same directory (if available) | Test specs for context |

+**Tests-only mode:**
+
+| File | Purpose |
+|------|---------|
+| `TESTS_DIR/environment.md` | Test environment specification (Docker services, networks, volumes) |
+| `TESTS_DIR/test-data.md` | Test data management (seed data, mocks, isolation) |
+| `TESTS_DIR/blackbox-tests.md` | Blackbox functional scenarios (positive + negative) |
+| `TESTS_DIR/performance-tests.md` | Performance test scenarios |
+| `TESTS_DIR/resilience-tests.md` | Resilience test scenarios |
+| `TESTS_DIR/security-tests.md` | Security test scenarios |
+| `TESTS_DIR/resource-limit-tests.md` | Resource limit test scenarios |
+| `TESTS_DIR/traceability-matrix.md` | AC/restriction coverage mapping |
+| `_docs/00_problem/problem.md` | Problem context |
+| `_docs/00_problem/restrictions.md` | Constraints for test design |
+| `_docs/00_problem/acceptance_criteria.md` | Acceptance criteria being verified |
+
 ### Prerequisite Checks (BLOCKING)

 **Default:**
-1. PLANS_DIR contains `architecture.md` and `components/` — **STOP if missing**
+1. DOCUMENT_DIR contains `architecture.md` and `components/` — **STOP if missing**
 2. Create TASKS_DIR if it does not exist
 3. If TASKS_DIR already contains task files, ask user: **resume from last checkpoint or start fresh?**

 **Single component mode:**
 1. The provided component file exists and is non-empty — **STOP if missing**

+**Tests-only mode:**
+1. `TESTS_DIR/blackbox-tests.md` exists and is non-empty — **STOP if missing**
+2. `TESTS_DIR/environment.md` exists — **STOP if missing**
+3. Create TASKS_DIR if it does not exist
+4. If TASKS_DIR already contains task files, ask user: **resume from last checkpoint or start fresh?**
+
 ## Artifact Management

 ### Directory Structure
@@ -100,8 +131,9 @@ TASKS_DIR/
 | Step | Save immediately after | Filename |
 |------|------------------------|----------|
 | Step 1 | Bootstrap structure plan complete + Jira ticket created + file renamed | `[JIRA-ID]_initial_structure.md` |
+| Step 1t | Test infrastructure bootstrap complete + Jira ticket created + file renamed | `[JIRA-ID]_test_infrastructure.md` |
 | Step 2 | Each component task decomposed + Jira ticket created + file renamed | `[JIRA-ID]_[short_name].md` |
-| Step 3 | Each integration test task decomposed + Jira ticket created + file renamed | `[JIRA-ID]_[short_name].md` |
+| Step 3 | Each blackbox test task decomposed + Jira ticket created + file renamed | `[JIRA-ID]_[short_name].md` |
 | Step 4 | Cross-task verification complete | `_dependencies_table.md` |

 ### Resumability
@@ -118,13 +150,49 @@ At the start of execution, create a TodoWrite with all applicable steps. Update

 ## Workflow

+### Step 1t: Test Infrastructure Bootstrap (tests-only mode only)
+
+**Role**: Professional Quality Assurance Engineer
+**Goal**: Produce `01_test_infrastructure.md` — the first task describing the test project scaffold
+**Constraints**: This is a plan document, not code. The `/implement` skill executes it.
+
+1. Read `TESTS_DIR/environment.md` and `TESTS_DIR/test-data.md`
+2. Read problem.md, restrictions.md, acceptance_criteria.md for domain context
+3. Document the test infrastructure plan using `templates/test-infrastructure-task.md`
+
+The test infrastructure bootstrap must include:
+- Test project folder layout (`e2e/` directory structure)
+- Mock/stub service definitions for each external dependency
+- `docker-compose.test.yml` structure from environment.md
+- Test runner configuration (framework, plugins, fixtures)
+- Test data fixture setup from test-data.md seed data sets
+- Test reporting configuration (format, output path)
+- Data isolation strategy
+
+**Self-verification**:
+- [ ] Every external dependency from environment.md has a mock service defined
+- [ ] Docker Compose structure covers all services from environment.md
+- [ ] Test data fixtures cover all seed data sets from test-data.md
+- [ ] Test runner configuration matches the consumer app tech stack from environment.md
+- [ ] Data isolation strategy is defined
+
+**Save action**: Write `01_test_infrastructure.md` (temporary numeric name)
+
+**Jira action**: Create a Jira ticket for this task under the "Blackbox Tests" epic. Write the Jira ticket ID and Epic ID back into the task header.
+
+**Rename action**: Rename the file from `01_test_infrastructure.md` to `[JIRA-ID]_test_infrastructure.md`. Update the **Task** field inside the file to match the new filename.
+
+**BLOCKING**: Present test infrastructure plan summary to user. Do NOT proceed until user confirms.
+
+---
+
 ### Step 1: Bootstrap Structure Plan (default mode only)

 **Role**: Professional software architect
 **Goal**: Produce `01_initial_structure.md` — the first task describing the project skeleton
 **Constraints**: This is a plan document, not code. The `/implement` skill executes it.

-1. Read architecture.md, all component specs, system-flows.md, data_model.md, and `deployment/` from PLANS_DIR
+1. Read architecture.md, all component specs, system-flows.md, data_model.md, and `deployment/` from DOCUMENT_DIR
 2. Read problem, solution, and restrictions from `_docs/00_problem/` and `_docs/01_solution/`
 3. Research best implementation patterns for the identified tech stack
 4. Document the structure plan using `templates/initial-structure-task.md`
@@ -134,27 +202,27 @@ The bootstrap structure plan must include:
 - Shared models, interfaces, and DTOs
 - Dockerfile per component (multi-stage, non-root, health checks, pinned base images)
 - `docker-compose.yml` for local development (all components + database + dependencies)
- `docker-compose.test.yml` for integration test environment (black-box test runner)
+- `docker-compose.test.yml` for blackbox test environment (blackbox test runner)
 - `.dockerignore`
 - CI/CD pipeline file (`.github/workflows/ci.yml` or `azure-pipelines.yml`) with stages from `deployment/ci_cd_pipeline.md`
 - Database migration setup and initial seed data scripts
 - Observability configuration: structured logging setup, health check endpoints (`/health/live`, `/health/ready`), metrics endpoint (`/metrics`)
 - Environment variable documentation (`.env.example`)
- Test structure with unit and integration test locations
+- Test structure with unit and blackbox test locations

 **Self-verification**:
 - [ ] All components have corresponding folders in the layout
 - [ ] All inter-component interfaces have DTOs defined
 - [ ] Dockerfile defined for each component
 - [ ] `docker-compose.yml` covers all components and dependencies
- [ ] `docker-compose.test.yml` enables black-box integration testing
+- [ ] `docker-compose.test.yml` enables blackbox testing
 - [ ] CI/CD pipeline file defined with lint, test, security, build, deploy stages
 - [ ] Database migration setup included
 - [ ] Health check endpoints specified for each service
 - [ ] Structured logging configuration included
 - [ ] `.env.example` with all required environment variables
 - [ ] Environment strategy covers dev, staging, production
- [ ] Test structure includes unit and integration test locations
+- [ ] Test structure includes unit and blackbox test locations

 **Save action**: Write `01_initial_structure.md` (temporary numeric name)

@@ -166,7 +234,7 @@ The bootstrap structure plan must include:

 ---

-### Step 2: Task Decomposition (all modes)
+### Step 2: Task Decomposition (default and single component modes)

 **Role**: Professional software architect
 **Goal**: Decompose each component into atomic, implementable task specs — numbered sequentially starting from 02
@@ -200,52 +268,66 @@ For each component (or the single provided component):

 ---

-### Step 3: Integration Test Task Decomposition (default mode only)
+### Step 3: Blackbox Test Task Decomposition (default and tests-only modes)

 **Role**: Professional Quality Assurance Engineer
-**Goal**: Decompose integration test specs into atomic, implementable task specs
+**Goal**: Decompose blackbox test specs into atomic, implementable task specs
 **Constraints**: Behavioral specs only — describe what, not how. No test code.

-**Numbering**: Continue sequential numbering from where Step 2 left off.
+**Numbering**:
+- In default mode: continue sequential numbering from where Step 2 left off.
+- In tests-only mode: start from 02 (01 is the test infrastructure bootstrap from Step 1t).

-1. Read all test specs from `PLANS_DIR/integration_tests/` (functional_tests.md, non_functional_tests.md)
+1. Read all test specs from `DOCUMENT_DIR/tests/` (`blackbox-tests.md`, `performance-tests.md`, `resilience-tests.md`, `security-tests.md`, `resource-limit-tests.md`)
 2. Group related test scenarios into atomic tasks (e.g., one task per test category or per component under test)
-3. Each task should reference the specific test scenarios it implements and the environment/test_data specs
-4. Dependencies: integration test tasks depend on the component implementation tasks they exercise
+3. Each task should reference the specific test scenarios it implements and the environment/test-data specs
+4. Dependencies:
+   - In default mode: blackbox test tasks depend on the component implementation tasks they exercise
+   - In tests-only mode: blackbox test tasks depend on the test infrastructure bootstrap task (Step 1t)
 5. Write each task spec using `templates/task.md`
 6. Estimate complexity per task (1, 2, 3, 5 points); no task should exceed 5 points — split if it does
 7. Note task dependencies (referencing Jira IDs of already-created dependency tasks)
-8. **Immediately after writing each task file**: create a Jira ticket under the "Integration Tests" epic, write the Jira ticket ID and Epic ID back into the task header, then rename the file from `[##]_[short_name].md` to `[JIRA-ID]_[short_name].md`.
+8. **Immediately after writing each task file**: create a Jira ticket under the "Blackbox Tests" epic, write the Jira ticket ID and Epic ID back into the task header, then rename the file from `[##]_[short_name].md` to `[JIRA-ID]_[short_name].md`.

 **Self-verification**:
- [ ] Every functional test scenario from `integration_tests/functional_tests.md` is covered by a task
- [ ] Every non-functional test scenario from `integration_tests/non_functional_tests.md` is covered by a task
+- [ ] Every scenario from `tests/blackbox-tests.md` is covered by a task
+- [ ] Every scenario from `tests/performance-tests.md`, `tests/resilience-tests.md`, `tests/security-tests.md`, and `tests/resource-limit-tests.md` is covered by a task
 - [ ] No task exceeds 5 complexity points
- [ ] Dependencies correctly reference the component tasks being tested
- [ ] Every task has a Jira ticket linked to the "Integration Tests" epic
+- [ ] Dependencies correctly reference the dependency tasks (component tasks in default mode, test infrastructure in tests-only mode)
+- [ ] Every task has a Jira ticket linked to the "Blackbox Tests" epic

 **Save action**: Write each `[##]_[short_name].md` (temporary numeric name), create Jira ticket inline, then rename to `[JIRA-ID]_[short_name].md`.

 ---

-### Step 4: Cross-Task Verification (default mode only)
+### Step 4: Cross-Task Verification (default and tests-only modes)

 **Role**: Professional software architect and analyst
 **Goal**: Verify task consistency and produce `_dependencies_table.md`
 **Constraints**: Review step — fix gaps found, do not add new tasks

 1. Verify task dependencies across all tasks are consistent
-2. Check no gaps: every interface in architecture.md has tasks covering it
-3. Check no overlaps: tasks don't duplicate work across components
+2. Check no gaps:
+   - In default mode: every interface in architecture.md has tasks covering it
+   - In tests-only mode: every test scenario in `traceability-matrix.md` is covered by a task
+3. Check no overlaps: tasks don't duplicate work
 4. Check no circular dependencies in the task graph
 5. Produce `_dependencies_table.md` using `templates/dependencies-table.md`

 **Self-verification**:
+
+Default mode:
 - [ ] Every architecture interface is covered by at least one task
 - [ ] No circular dependencies in the task graph
 - [ ] Cross-component dependencies are explicitly noted in affected task specs
 - [ ] `_dependencies_table.md` contains every task with correct dependencies

+Tests-only mode:
+- [ ] Every test scenario from traceability-matrix.md "Covered" entries has a corresponding task
+- [ ] No circular dependencies in the task graph
+- [ ] Test task dependencies reference the test infrastructure bootstrap
+- [ ] `_dependencies_table.md` contains every task with correct dependencies
+
 **Save action**: Write `_dependencies_table.md`

 **BLOCKING**: Present dependency summary to user. Do NOT proceed until user confirms.
@@ -270,7 +352,7 @@ For each component (or the single provided component):
 |-----------|--------|
 | Ambiguous component boundaries | ASK user |
 | Task complexity exceeds 5 points after splitting | ASK user |
-| Missing component specs in PLANS_DIR | ASK user |
+| Missing component specs in DOCUMENT_DIR | ASK user |
 | Cross-component dependency conflict | ASK user |
 | Jira epic not found for a component | ASK user for Epic ID |
 | Task naming | PROCEED, confirm at next BLOCKING gate |
@@ -279,15 +361,27 @@ For each component (or the single provided component):

 ```
 ┌────────────────────────────────────────────────────────────────┐
-│          Task Decomposition (4-Step Method)                     │
+│          Task Decomposition (Multi-Mode)                        │
 ├────────────────────────────────────────────────────────────────┤
-│ CONTEXT: Resolve mode (default / single component)             │
-│ 1. Bootstrap Structure  → [JIRA-ID]_initial_structure.md       │
-│    [BLOCKING: user confirms structure]                         │
-│ 2. Component Tasks      → [JIRA-ID]_[short_name].md each      │
-│ 3. Integration Tests    → [JIRA-ID]_[short_name].md each      │
-│ 4. Cross-Verification   → _dependencies_table.md              │
-│    [BLOCKING: user confirms dependencies]                      │
+│ CONTEXT: Resolve mode (default / single component / tests-only)│
+│                                                                │
+│ DEFAULT MODE:                                                   │
+│  1.  Bootstrap Structure  → [JIRA-ID]_initial_structure.md     │
+│      [BLOCKING: user confirms structure]                       │
+│  2.  Component Tasks      → [JIRA-ID]_[short_name].md each    │
+│  3.  Blackbox Tests       → [JIRA-ID]_[short_name].md each    │
+│  4.  Cross-Verification   → _dependencies_table.md            │
+│      [BLOCKING: user confirms dependencies]                    │
+│                                                                │
+│ TESTS-ONLY MODE:                                                │
+│  1t. Test Infrastructure  → [JIRA-ID]_test_infrastructure.md   │
+│      [BLOCKING: user confirms test scaffold]                   │
+│  3.  Blackbox Tests       → [JIRA-ID]_[short_name].md each    │
+│  4.  Cross-Verification   → _dependencies_table.md            │
+│      [BLOCKING: user confirms dependencies]                    │
+│                                                                │
+│ SINGLE COMPONENT MODE:                                          │
+│  2.  Component Tasks      → [JIRA-ID]_[short_name].md each    │
 ├────────────────────────────────────────────────────────────────┤
 │ Principles: Atomic tasks · Behavioral specs · Flat structure   │
 │   Jira inline · Rename to Jira ID · Save now · Ask don't assume│
@@ -49,7 +49,7 @@ project-root/
 | Build | Compile/bundle the application | Every push |
 | Lint / Static Analysis | Code quality and style checks | Every push |
 | Unit Tests | Run unit test suite | Every push |
-| Integration Tests | Run integration test suite | Every push |
+| Blackbox Tests | Run blackbox test suite | Every push |
 | Security Scan | SAST / dependency check | Every push |
 | Deploy to Staging | Deploy to staging environment | Merge to staging branch |

@@ -64,7 +64,7 @@ Then [expected result]
 |--------|-------------|-----------------|
 | AC-1 | [test subject] | [expected result] |

-## Integration Tests
+## Blackbox Tests

 | AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
 |--------|------------------------|-------------|-------------------|----------------|
@@ -0,0 +1,129 @@
+# Test Infrastructure Task Template
+
+Use this template for the test infrastructure bootstrap (Step 1t in tests-only mode). Save as `TASKS_DIR/01_test_infrastructure.md` initially, then rename to `TASKS_DIR/[JIRA-ID]_test_infrastructure.md` after Jira ticket creation.
+
+---
+
+```markdown
+# Test Infrastructure
+
+**Task**: [JIRA-ID]_test_infrastructure
+**Name**: Test Infrastructure
+**Description**: Scaffold the Blackbox test project — test runner, mock services, Docker test environment, test data fixtures, reporting
+**Complexity**: [3|5] points
+**Dependencies**: None
+**Component**: Blackbox Tests
+**Jira**: [TASK-ID]
+**Epic**: [EPIC-ID]
+
+## Test Project Folder Layout
+
+```
+e2e/
+├── conftest.py
+├── requirements.txt
+├── Dockerfile
+├── mocks/
+│   ├── [mock_service_1]/
+│   │   ├── Dockerfile
+│   │   └── [entrypoint file]
+│   └── [mock_service_2]/
+│       ├── Dockerfile
+│       └── [entrypoint file]
+├── fixtures/
+│   └── [test data files]
+├── tests/
+│   ├── test_[category_1].py
+│   ├── test_[category_2].py
+│   └── ...
+└── docker-compose.test.yml
+```
+
+### Layout Rationale
+
+[Brief explanation of directory structure choices — framework conventions, separation of mocks from tests, fixture management]
+
+## Mock Services
+
+| Mock Service | Replaces | Endpoints | Behavior |
+|-------------|----------|-----------|----------|
+| [name] | [external service] | [endpoints it serves] | [response behavior, configurable via control API] |
+
+### Mock Control API
+
+Each mock service exposes a `POST /mock/config` endpoint for test-time behavior control (e.g., simulate downtime, inject errors). A `GET /mock/[resource]` endpoint returns recorded interactions for assertion.
+
+## Docker Test Environment
+
+### docker-compose.test.yml Structure
+
+| Service | Image / Build | Purpose | Depends On |
+|---------|--------------|---------|------------|
+| [system-under-test] | [build context] | Main system being tested | [mock services] |
+| [mock-1] | [build context] | Mock for [external service] | — |
+| [e2e-consumer] | [build from e2e/] | Test runner | [system-under-test] |
+
+### Networks and Volumes
+
+[Isolated test network, volume mounts for test data, model files, results output]
+
+## Test Runner Configuration
+
+**Framework**: [e.g., pytest]
+**Plugins**: [e.g., pytest-csv, sseclient-py, requests]
+**Entry point**: [e.g., pytest --csv=/results/report.csv]
+
+### Fixture Strategy
+
+| Fixture | Scope | Purpose |
+|---------|-------|---------|
+| [name] | [session/module/function] | [what it provides] |
+
+## Test Data Fixtures
+
+| Data Set | Source | Format | Used By |
+|----------|--------|--------|---------|
+| [name] | [volume mount / generated / API seed] | [format] | [test categories] |
+
+### Data Isolation
+
+[Strategy: fresh containers per run, volume cleanup, mock state reset]
+
+## Test Reporting
+
+**Format**: [e.g., CSV]
+**Columns**: [e.g., Test ID, Test Name, Execution Time (ms), Result, Error Message]
+**Output path**: [e.g., /results/report.csv → mounted to host]
+
+## Acceptance Criteria
+
+**AC-1: Test environment starts**
+Given the docker-compose.test.yml
+When `docker compose -f docker-compose.test.yml up` is executed
+Then all services start and the system-under-test is reachable
+
+**AC-2: Mock services respond**
+Given the test environment is running
+When the e2e-consumer sends requests to mock services
+Then mock services respond with configured behavior
+
+**AC-3: Test runner executes**
+Given the test environment is running
+When the e2e-consumer starts
+Then the test runner discovers and executes test files
+
+**AC-4: Test report generated**
+Given tests have been executed
+When the test run completes
+Then a report file exists at the configured output path with correct columns
+```
+
+---
+
+## Guidance Notes
+
+- This is a PLAN document, not code. The `/implement` skill executes it.
+- Focus on test infrastructure decisions, not individual test implementations.
+- Reference environment.md and test-data.md from the test specs — don't repeat everything.
+- Mock services must be deterministic: same input always produces same output.
+- The Docker environment must be self-contained: `docker compose up` sufficient.
@@ -20,7 +20,7 @@ Plan and document the full deployment lifecycle: check deployment status and env

 ## Core Principles

- **Docker-first**: every component runs in a container; local dev, integration tests, and production all use Docker
+- **Docker-first**: every component runs in a container; local dev, blackbox tests, and production all use Docker
 - **Infrastructure as code**: all deployment configuration is version-controlled
 - **Observability built-in**: logging, metrics, and tracing are part of the deployment plan, not afterthoughts
 - **Environment parity**: dev, staging, and production environments mirror each other as closely as possible
@@ -32,12 +32,12 @@ Plan and document the full deployment lifecycle: check deployment status and env

 Fixed paths:

- PLANS_DIR: `_docs/02_plans/`
+- DOCUMENT_DIR: `_docs/02_document/`
 - DEPLOY_DIR: `_docs/04_deploy/`
 - REPORTS_DIR: `_docs/04_deploy/reports/`
 - SCRIPTS_DIR: `scripts/`
- ARCHITECTURE: `_docs/02_plans/architecture.md`
- COMPONENTS_DIR: `_docs/02_plans/components/`
+- ARCHITECTURE: `_docs/02_document/architecture.md`
+- COMPONENTS_DIR: `_docs/02_document/components/`

 Announce the resolved paths to the user before proceeding.

@@ -45,18 +45,18 @@ Announce the resolved paths to the user before proceeding.

 ### Required Files

-| File | Purpose |
-|------|---------|
-| `_docs/00_problem/problem.md` | Problem description and context |
-| `_docs/00_problem/restrictions.md` | Constraints and limitations |
-| `_docs/01_solution/solution.md` | Finalized solution |
-| `PLANS_DIR/architecture.md` | Architecture from plan skill |
-| `PLANS_DIR/components/` | Component specs |
+| File | Purpose | Required |
+|------|---------|----------|
+| `_docs/00_problem/problem.md` | Problem description and context | Greenfield only |
+| `_docs/00_problem/restrictions.md` | Constraints and limitations | Greenfield only |
+| `_docs/01_solution/solution.md` | Finalized solution | Greenfield only |
+| `DOCUMENT_DIR/architecture.md` | Architecture (from plan or document skill) | Always |
+| `DOCUMENT_DIR/components/` | Component specs | Always |

 ### Prerequisite Checks (BLOCKING)

 1. `architecture.md` exists — **STOP if missing**, run `/plan` first
-2. At least one component spec exists in `PLANS_DIR/components/` — **STOP if missing**
+2. At least one component spec exists in `DOCUMENT_DIR/components/` — **STOP if missing**
 3. Create DEPLOY_DIR, REPORTS_DIR, and SCRIPTS_DIR if they do not exist
 4. If DEPLOY_DIR already contains artifacts, ask user: **resume from last checkpoint or start fresh?**

@@ -157,7 +157,7 @@ At the start of execution, create a TodoWrite with all steps (1 through 7). Upda
 ### Step 2: Containerization

 **Role**: DevOps / Platform engineer
-**Goal**: Define Docker configuration for every component, local development, and integration test environments
+**Goal**: Define Docker configuration for every component, local development, and blackbox test environments
 **Constraints**: Plan only — no Dockerfile creation. Describe what each Dockerfile should contain.

 1. Read architecture.md and all component specs
@@ -176,7 +176,7 @@ At the start of execution, create a TodoWrite with all steps (1 through 7). Upda
   - Any message queues, caches, or external service mocks
   - Shared network
   - Environment variable files (`.env`)
-6. Define `docker-compose.test.yml` for integration tests:
+6. Define `docker-compose.test.yml` for blackbox tests:
   - Application components under test
   - Test runner container (black-box, no internal imports)
   - Isolated database with seed data
@@ -189,7 +189,7 @@ At the start of execution, create a TodoWrite with all steps (1 through 7). Upda
 - [ ] Non-root user for all containers
 - [ ] Health checks defined for every service
 - [ ] docker-compose.yml covers all components + dependencies
- [ ] docker-compose.test.yml enables black-box integration testing
+- [ ] docker-compose.test.yml enables black-box testing
 - [ ] `.dockerignore` defined

 **Save action**: Write `containerization.md` using `templates/containerization.md`
@@ -212,7 +212,7 @@ At the start of execution, create a TodoWrite with all steps (1 through 7). Upda
 | Stage | Trigger | Steps | Quality Gate |
 |-------|---------|-------|-------------|
 | **Lint** | Every push | Run linters per language (black, rustfmt, prettier, dotnet format) | Zero errors |
-| **Test** | Every push | Unit tests, integration tests, coverage report | 75%+ coverage |
+| **Test** | Every push | Unit tests, blackbox tests, coverage report | 75%+ coverage (see `.cursor/rules/cursor-meta.mdc` Quality Thresholds) |
 | **Security** | Every push | Dependency audit, SAST scan (Semgrep/SonarQube), image scan (Trivy) | Zero critical/high CVEs |
 | **Build** | PR merge to dev | Build Docker images, tag with git SHA | Build succeeds |
 | **Push** | After build | Push to container registry | Push succeeds |
@@ -458,7 +458,7 @@ At the start of execution, create a TodoWrite with all steps (1 through 7). Upda

 - **Implementing during planning**: Steps 1–6 produce documents, not code (Step 7 is the exception — it creates scripts)
 - **Hardcoding secrets**: never include real credentials in deployment documents or scripts
- **Ignoring integration test containerization**: the test environment must be containerized alongside the app
+- **Ignoring blackbox test containerization**: the test environment must be containerized alongside the app
 - **Skipping BLOCKING gates**: never proceed past a BLOCKING marker without user confirmation
 - **Using `:latest` tags**: always pin base image versions
 - **Forgetting observability**: logging, metrics, and tracing are deployment concerns, not post-deployment additions
@@ -28,7 +28,7 @@ Save as `_docs/04_deploy/ci_cd_pipeline.md`.

 ### Test
 - Unit tests: [framework and command]
- Integration tests: [framework and command, uses docker-compose.test.yml]
+- Blackbox tests: [framework and command, uses docker-compose.test.yml]
 - Coverage threshold: 75% overall, 90% critical paths
 - Coverage report published as pipeline artifact

@@ -54,7 +54,7 @@ Save as `_docs/04_deploy/ci_cd_pipeline.md`.
 - Automated rollback on health check failure

 ### Smoke Tests
- Subset of integration tests targeting staging environment
+- Subset of blackbox tests targeting staging environment
 - Validates critical user flows
 - Timeout: [maximum duration]

@@ -48,7 +48,7 @@ networks:
  [shared network]
 ```

-## Docker Compose — Integration Tests
+## Docker Compose — Blackbox Tests

 ```yaml
 # docker-compose.test.yml structure
@@ -0,0 +1,114 @@
+# Deployment Scripts Documentation Template
+
+Save as `_docs/04_deploy/deploy_scripts.md`.
+
+---
+
+```markdown
+# [System Name] — Deployment Scripts
+
+## Overview
+
+| Script | Purpose | Location |
+|--------|---------|----------|
+| `deploy.sh` | Main deployment orchestrator | `scripts/deploy.sh` |
+| `pull-images.sh` | Pull Docker images from registry | `scripts/pull-images.sh` |
+| `start-services.sh` | Start all services | `scripts/start-services.sh` |
+| `stop-services.sh` | Graceful shutdown | `scripts/stop-services.sh` |
+| `health-check.sh` | Verify deployment health | `scripts/health-check.sh` |
+
+## Prerequisites
+
+- Docker and Docker Compose installed on target machine
+- SSH access to target machine (configured via `DEPLOY_HOST`)
+- Container registry credentials configured
+- `.env` file with required environment variables (see `.env.example`)
+
+## Environment Variables
+
+All scripts source `.env` from the project root or accept variables from the environment.
+
+| Variable | Required By | Purpose |
+|----------|------------|---------|
+| `DEPLOY_HOST` | All (remote mode) | SSH target for remote deployment |
+| `REGISTRY_URL` | `pull-images.sh` | Container registry URL |
+| `REGISTRY_USER` | `pull-images.sh` | Registry authentication |
+| `REGISTRY_PASS` | `pull-images.sh` | Registry authentication |
+| `IMAGE_TAG` | `pull-images.sh`, `start-services.sh` | Image version to deploy (default: latest git SHA) |
+| [add project-specific variables] | | |
+
+## Script Details
+
+### deploy.sh
+
+Main orchestrator that runs the full deployment flow.
+
+**Usage**:
+- `./scripts/deploy.sh` — Deploy latest version
+- `./scripts/deploy.sh --rollback` — Rollback to previous version
+- `./scripts/deploy.sh --help` — Show usage
+
+**Flow**:
+1. Validate required environment variables
+2. Call `pull-images.sh`
+3. Call `stop-services.sh`
+4. Call `start-services.sh`
+5. Call `health-check.sh`
+6. Report success or failure
+
+**Rollback**: When `--rollback` is passed, reads the previous image tags saved by `stop-services.sh` and redeploys those versions.
+
+### pull-images.sh
+
+**Usage**: `./scripts/pull-images.sh [--help]`
+
+**Steps**:
+1. Authenticate with container registry (`REGISTRY_URL`)
+2. Pull all required images with specified `IMAGE_TAG`
+3. Verify image integrity via digest check
+4. Report pull results per image
+
+### start-services.sh
+
+**Usage**: `./scripts/start-services.sh [--help]`
+
+**Steps**:
+1. Run `docker compose up -d` with the correct env file
+2. Configure networks and volumes
+3. Wait for all containers to report healthy state
+4. Report startup status per service
+
+### stop-services.sh
+
+**Usage**: `./scripts/stop-services.sh [--help]`
+
+**Steps**:
+1. Save current image tags to `previous_tags.env` (for rollback)
+2. Stop services with graceful shutdown period (30s)
+3. Clean up orphaned containers and networks
+
+### health-check.sh
+
+**Usage**: `./scripts/health-check.sh [--help]`
+
+**Checks**:
+
+| Service | Endpoint | Expected |
+|---------|----------|----------|
+| [Component 1] | `http://localhost:[port]/health/live` | HTTP 200 |
+| [Component 2] | `http://localhost:[port]/health/ready` | HTTP 200 |
+| [add all services] | | |
+
+**Exit codes**:
+- `0` — All services healthy
+- `1` — One or more services unhealthy
+
+## Common Script Properties
+
+All scripts:
+- Use `#!/bin/bash` with `set -euo pipefail`
+- Support `--help` flag for usage information
+- Source `.env` from project root if present
+- Are idempotent where possible
+- Support remote execution via SSH when `DEPLOY_HOST` is set
+```
@@ -0,0 +1,515 @@
+---
+name: document
+description: |
+  Bottom-up codebase documentation skill. Analyzes existing code from modules up through components
+  to architecture, then retrospectively derives problem/restrictions/acceptance criteria.
+  Produces the same _docs/ artifacts as the problem, research, and plan skills, but from code
+  analysis instead of user interview.
+  Trigger phrases:
+  - "document", "document codebase", "document this project"
+  - "documentation", "generate documentation", "create documentation"
+  - "reverse-engineer docs", "code to docs"
+  - "analyze and document"
+category: build
+tags: [documentation, code-analysis, reverse-engineering, architecture, bottom-up]
+disable-model-invocation: true
+---
+
+# Bottom-Up Codebase Documentation
+
+Analyze an existing codebase from the bottom up — individual modules first, then components, then system-level architecture — and produce the same `_docs/` artifacts that the `problem` and `plan` skills generate, without requiring user interview.
+
+## Core Principles
+
+- **Bottom-up always**: module docs -> component specs -> architecture/flows -> solution -> problem extraction. Every higher level is synthesized from the level below.
+- **Dependencies first**: process modules in topological order (leaves first). When documenting module X, all of X's dependencies already have docs.
+- **Incremental context**: each module's doc uses already-written dependency docs as context — no ever-growing chain.
+- **Verify against code**: cross-reference every entity in generated docs against actual codebase. Catch hallucinations.
+- **Save immediately**: write each artifact as soon as its step completes. Enable resume from any checkpoint.
+- **Ask, don't assume**: when code intent is ambiguous, ASK the user before proceeding.
+
+## Context Resolution
+
+Fixed paths:
+
+- DOCUMENT_DIR: `_docs/02_document/`
+- SOLUTION_DIR: `_docs/01_solution/`
+- PROBLEM_DIR: `_docs/00_problem/`
+
+Optional input:
+
+- FOCUS_DIR: a specific directory subtree provided by the user (e.g., `/document @src/api/`). When set, only this subtree and its transitive dependencies are analyzed.
+
+Announce resolved paths (and FOCUS_DIR if set) to user before proceeding.
+
+## Mode Detection
+
+Determine the execution mode before any other logic:
+
+| Mode | Trigger | Scope |
+|------|---------|-------|
+| **Full** | No input file, no existing state | Entire codebase |
+| **Focus Area** | User provides a directory path (e.g., `@src/api/`) | Only the specified subtree + transitive dependencies |
+| **Resume** | `state.json` exists in DOCUMENT_DIR | Continue from last checkpoint |
+
+Focus Area mode produces module + component docs for the targeted area only. It can be run repeatedly for different areas — each run appends to the existing module and component docs without overwriting other areas.
+
+## Prerequisite Checks
+
+1. If `_docs/` already exists and contains files AND mode is **Full**, ASK user: **overwrite, merge, or write to `_docs_generated/` instead?**
+2. Create DOCUMENT_DIR, SOLUTION_DIR, and PROBLEM_DIR if they don't exist
+3. If DOCUMENT_DIR contains a `state.json`, offer to **resume from last checkpoint or start fresh**
+4. If FOCUS_DIR is set, verify the directory exists and contains source files — **STOP if missing**
+
+## Progress Tracking
+
+Create a TodoWrite with all steps (0 through 7). Update status as each step completes.
+
+## Workflow
+
+### Step 0: Codebase Discovery
+
+**Role**: Code analyst
+**Goal**: Build a complete map of the codebase (or targeted subtree) before analyzing any code.
+
+**Focus Area scoping**: if FOCUS_DIR is set, limit the scan to that directory subtree. Still identify transitive dependencies outside FOCUS_DIR (modules that FOCUS_DIR imports) and include them in the processing order, but skip modules that are neither inside FOCUS_DIR nor dependencies of it.
+
+Scan and catalog:
+
+1. Directory tree (ignore `node_modules`, `.git`, `__pycache__`, `bin/`, `obj/`, build artifacts)
+2. Language detection from file extensions and config files
+3. Package manifests: `package.json`, `requirements.txt`, `pyproject.toml`, `*.csproj`, `Cargo.toml`, `go.mod`
+4. Config files: `Dockerfile`, `docker-compose.yml`, `.env.example`, CI/CD configs (`.github/workflows/`, `.gitlab-ci.yml`, `azure-pipelines.yml`)
+5. Entry points: `main.*`, `app.*`, `index.*`, `Program.*`, startup scripts
+6. Test structure: test directories, test frameworks, test runner configs
+7. Existing documentation: README, `docs/`, wiki references, inline doc coverage
+8. **Dependency graph**: build a module-level dependency graph by analyzing imports/references. Identify:
+   - Leaf modules (no internal dependencies)
+   - Entry points (no internal dependents)
+   - Cycles (mark for grouped analysis)
+   - Topological processing order
+   - If FOCUS_DIR: mark which modules are in-scope vs dependency-only
+
+**Save**: `DOCUMENT_DIR/00_discovery.md` containing:
+- Directory tree (concise, relevant directories only)
+- Tech stack summary table (language, framework, database, infra)
+- Dependency graph (textual list + Mermaid diagram)
+- Topological processing order
+- Entry points and leaf modules
+
+**Save**: `DOCUMENT_DIR/state.json` with initial state:
+```json
+{
+  "current_step": "module-analysis",
+  "completed_steps": ["discovery"],
+  "focus_dir": null,
+  "modules_total": 0,
+  "modules_documented": [],
+  "modules_remaining": [],
+  "module_batch": 0,
+  "components_written": [],
+  "last_updated": ""
+}
+```
+
+Set `focus_dir` to the FOCUS_DIR path if in Focus Area mode, or `null` for Full mode.
+
+---
+
+### Step 1: Module-Level Documentation
+
+**Role**: Code analyst
+**Goal**: Document every identified module individually, processing in topological order (leaves first).
+
+**Batched processing**: process modules in batches of ~5 (sorted by topological order). After each batch: save all module docs, update `state.json`, present a progress summary. Between batches, evaluate whether to suggest a session break.
+
+For each module in topological order:
+
+1. **Read**: read the module's source code. Assess complexity and what context is needed.
+2. **Gather context**: collect already-written docs of this module's dependencies (available because of bottom-up order). Note external library usage.
+3. **Write module doc** with these sections:
+   - **Purpose**: one-sentence responsibility
+   - **Public interface**: exported functions/classes/methods with signatures, input/output types
+   - **Internal logic**: key algorithms, patterns, non-obvious behavior
+   - **Dependencies**: what it imports internally and why
+   - **Consumers**: what uses this module (from the dependency graph)
+   - **Data models**: entities/types defined in this module
+   - **Configuration**: env vars, config keys consumed
+   - **External integrations**: HTTP calls, DB queries, queue operations, file I/O
+   - **Security**: auth checks, encryption, input validation, secrets access
+   - **Tests**: what tests exist for this module, what they cover
+4. **Verify**: cross-check that every entity referenced in the doc exists in the codebase. Flag uncertainties.
+
+**Cycle handling**: modules in a dependency cycle are analyzed together as a group, producing a single combined doc.
+
+**Large modules**: if a module exceeds comfortable analysis size, split into logical sub-sections and analyze each part, then combine.
+
+**Save**: `DOCUMENT_DIR/modules/[module_name].md` for each module.
+**State**: update `state.json` after each module completes (move from `modules_remaining` to `modules_documented`). Increment `module_batch` after each batch of ~5.
+
+**Session break heuristic**: after each batch, if more than 10 modules remain AND 2+ batches have already completed in this session, suggest a session break:
+
+```
+══════════════════════════════════════
+ SESSION BREAK SUGGESTED
+══════════════════════════════════════
+ Modules documented: [X] of [Y]
+ Batches completed this session: [N]
+══════════════════════════════════════
+ A) Continue in this conversation
+ B) Save and continue in a fresh conversation (recommended)
+══════════════════════════════════════
+ Recommendation: B — fresh context improves
+ analysis quality for remaining modules
+══════════════════════════════════════
+```
+
+Re-entry is seamless: `state.json` tracks exactly which modules are done.
+
+---
+
+### Step 2: Component Assembly
+
+**Role**: Software architect
+**Goal**: Group related modules into logical components and produce component specs.
+
+1. Analyze module docs from Step 1 to identify natural groupings:
+   - By directory structure (most common)
+   - By shared data models or common purpose
+   - By dependency clusters (tightly coupled modules)
+2. For each identified component, synthesize its module docs into a single component specification using `templates/component-spec.md` as structure:
+   - High-level overview: purpose, pattern, upstream/downstream
+   - Internal interfaces: method signatures, DTOs (from actual module code)
+   - External API specification (if the component exposes HTTP/gRPC endpoints)
+   - Data access patterns: queries, caching, storage estimates
+   - Implementation details: algorithmic complexity, state management, key libraries
+   - Extensions and helpers: shared utilities needed
+   - Caveats and edge cases: limitations, race conditions, bottlenecks
+   - Dependency graph: implementation order relative to other components
+   - Logging strategy
+3. Identify common helpers shared across multiple components -> document in `common-helpers/`
+4. Generate component relationship diagram (Mermaid)
+
+**Self-verification**:
+- [ ] Every module from Step 1 is covered by exactly one component
+- [ ] No component has overlapping responsibility with another
+- [ ] Inter-component interfaces are explicit (who calls whom, with what)
+- [ ] Component dependency graph has no circular dependencies
+
+**Save**:
+- `DOCUMENT_DIR/components/[##]_[name]/description.md` per component
+- `DOCUMENT_DIR/common-helpers/[##]_helper_[name].md` per shared helper
+- `DOCUMENT_DIR/diagrams/components.md` (Mermaid component diagram)
+
+**BLOCKING**: Present component list with one-line summaries to user. Do NOT proceed until user confirms the component breakdown is correct.
+
+---
+
+### Step 3: System-Level Synthesis
+
+**Role**: Software architect
+**Goal**: From component docs, synthesize system-level documents.
+
+All documents here are derived from component docs (Step 2) + module docs (Step 1). No new code reading should be needed. If it is, that indicates a gap in Steps 1-2 — go back and fill it.
+
+#### 3a. Architecture
+
+Using `templates/architecture.md` as structure:
+
+- System context and boundaries from entry points and external integrations
+- Tech stack table from discovery (Step 0) + component specs
+- Deployment model from Dockerfiles, CI configs, environment strategies
+- Data model overview from per-component data access sections
+- Integration points from inter-component interfaces
+- NFRs from test thresholds, config limits, health checks
+- Security architecture from per-module security observations
+- Key ADRs inferred from technology choices and patterns
+
+**Save**: `DOCUMENT_DIR/architecture.md`
+
+#### 3b. System Flows
+
+Using `templates/system-flows.md` as structure:
+
+- Trace main flows through the component interaction graph
+- Entry point -> component chain -> output for each major flow
+- Mermaid sequence diagrams and flowcharts
+- Error scenarios from exception handling patterns
+- Data flow tables per flow
+
+**Save**: `DOCUMENT_DIR/system-flows.md` and `DOCUMENT_DIR/diagrams/flows/flow_[name].md`
+
+#### 3c. Data Model
+
+- Consolidate all data models from module docs
+- Entity-relationship diagram (Mermaid ERD)
+- Migration strategy (if ORM/migration tooling detected)
+- Seed data observations
+- Backward compatibility approach (if versioning found)
+
+**Save**: `DOCUMENT_DIR/data_model.md`
+
+#### 3d. Deployment (if Dockerfile/CI configs exist)
+
+- Containerization summary
+- CI/CD pipeline structure
+- Environment strategy (dev, staging, production)
+- Observability (logging patterns, metrics, health checks found in code)
+
+**Save**: `DOCUMENT_DIR/deployment/` (containerization.md, ci_cd_pipeline.md, environment_strategy.md, observability.md — only files for which sufficient code evidence exists)
+
+---
+
+### Step 4: Verification Pass
+
+**Role**: Quality verifier
+**Goal**: Compare every generated document against actual code. Fix hallucinations, fill gaps, correct inaccuracies.
+
+For each document generated in Steps 1-3:
+
+1. **Entity verification**: extract all code entities (class names, function names, module names, endpoints) mentioned in the doc. Cross-reference each against the actual codebase. Flag any that don't exist.
+2. **Interface accuracy**: for every method signature, DTO, or API endpoint in component specs, verify it matches actual code.
+3. **Flow correctness**: for each system flow diagram, trace the actual code path and verify the sequence matches.
+4. **Completeness check**: are there modules or components discovered in Step 0 that aren't covered by any document? Flag gaps.
+5. **Consistency check**: do component docs agree with architecture doc? Do flow diagrams match component interfaces?
+
+Apply corrections inline to the documents that need them.
+
+**Save**: `DOCUMENT_DIR/04_verification_log.md` with:
+- Total entities verified vs flagged
+- Corrections applied (which document, what changed)
+- Remaining gaps or uncertainties
+- Completeness score (modules covered / total modules)
+
+**BLOCKING**: Present verification summary to user. Do NOT proceed until user confirms corrections are acceptable or requests additional fixes.
+
+**Session boundary**: After verification is confirmed, suggest a session break before proceeding to the synthesis steps (5–7). These steps produce different artifact types and benefit from fresh context:
+
+```
+══════════════════════════════════════
+ VERIFICATION COMPLETE — session break?
+══════════════════════════════════════
+ Steps 0–4 (analysis + verification) are done.
+ Steps 5–7 (solution + problem extraction + report)
+ can run in a fresh conversation.
+══════════════════════════════════════
+ A) Continue in this conversation
+ B) Save and continue in a new conversation (recommended)
+══════════════════════════════════════
+```
+
+If **Focus Area mode**: Steps 5–7 are skipped (they require full codebase coverage). Present a summary of modules and components documented for this area. The user can run `/document` again for another area, or run without FOCUS_DIR once all areas are covered to produce the full synthesis.
+
+---
+
+### Step 5: Solution Extraction (Retrospective)
+
+**Role**: Software architect
+**Goal**: From all verified technical documentation, retrospectively create `solution.md` — the same artifact the research skill produces. This makes downstream skills (`plan`, `deploy`, `decompose`) compatible with the documented codebase.
+
+Synthesize from architecture (Step 3) + component specs (Step 2) + system flows (Step 3) + verification findings (Step 4):
+
+1. **Product Solution Description**: what the system is, brief component interaction diagram (Mermaid)
+2. **Architecture**: the architecture that is implemented, with per-component solution tables:
+
+| Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit |
+|----------|-------|-----------|-------------|-------------|----------|------|-----|
+| [actual implementation] | [libs/platforms used] | [observed strengths] | [observed limitations] | [requirements met] | [security approach] | [cost indicators] | [fitness assessment] |
+
+3. **Testing Strategy**: summarize integration/functional tests and non-functional tests found in the codebase
+4. **References**: links to key config files, Dockerfiles, CI configs that evidence the solution choices
+
+**Save**: `SOLUTION_DIR/solution.md` (`_docs/01_solution/solution.md`)
+
+---
+
+### Step 6: Problem Extraction (Retrospective)
+
+**Role**: Business analyst
+**Goal**: From all verified technical docs, retrospectively derive the high-level problem definition — producing the same documents the `problem` skill creates through interview.
+
+This is the inverse of normal workflow: instead of problem -> solution -> code, we go code -> technical docs -> problem understanding.
+
+#### 6a. `problem.md`
+
+- Synthesize from architecture overview + component purposes + system flows
+- What is this system? What problem does it solve? Who are the users? How does it work at a high level?
+- Cross-reference with README if one exists
+- Free-form text, concise, readable by someone unfamiliar with the project
+
+#### 6b. `restrictions.md`
+
+- Extract from: tech stack choices, Dockerfile specs (OS, base images), CI configs (platform constraints), dependency versions, environment configs
+- Categorize with headers: Hardware, Software, Environment, Operational
+- Each restriction should be specific and testable
+
+#### 6c. `acceptance_criteria.md`
+
+- Derive from: test assertions (expected values, thresholds), performance configs (timeouts, rate limits, batch sizes), health check endpoints, validation rules in code
+- Categorize with headers by domain
+- Every criterion must have a measurable value — if only implied, note the source
+
+#### 6d. `input_data/`
+
+- Document data schemas found (DB schemas, API request/response types, config file formats)
+- Create `data_parameters.md` describing what data the system consumes, formats, volumes, update patterns
+
+#### 6e. `security_approach.md` (only if security code found)
+
+- Authentication mechanisms, authorization patterns, encryption, secrets handling, CORS, rate limiting, input sanitization — all from code observations
+- If no security-relevant code found, skip this file
+
+**Save**: all files to `PROBLEM_DIR/` (`_docs/00_problem/`)
+
+**BLOCKING**: Present all problem documents to user. These are the most abstracted and therefore most prone to interpretation error. Do NOT proceed until user confirms or requests corrections.
+
+---
+
+### Step 7: Final Report
+
+**Role**: Technical writer
+**Goal**: Produce `FINAL_report.md` integrating all generated documentation.
+
+Using `templates/final-report.md` as structure:
+
+- Executive summary from architecture + problem docs
+- Problem statement (transformed from problem.md, not copy-pasted)
+- Architecture overview with tech stack one-liner
+- Component summary table (number, name, purpose, dependencies)
+- System flows summary table
+- Risk observations from verification log (Step 4)
+- Open questions (uncertainties flagged during analysis)
+- Artifact index listing all generated documents with paths
+
+**Save**: `DOCUMENT_DIR/FINAL_report.md`
+
+**State**: update `state.json` with `current_step: "complete"`.
+
+---
+
+## Artifact Management
+
+### Directory Structure
+
+```
+_docs/
+├── 00_problem/                          # Step 6 (retrospective)
+│   ├── problem.md
+│   ├── restrictions.md
+│   ├── acceptance_criteria.md
+│   ├── input_data/
+│   │   └── data_parameters.md
+│   └── security_approach.md
+├── 01_solution/                         # Step 5 (retrospective)
+│   └── solution.md
+└── 02_document/                         # DOCUMENT_DIR
+    ├── 00_discovery.md                  # Step 0
+    ├── modules/                         # Step 1
+    │   ├── [module_name].md
+    │   └── ...
+    ├── components/                      # Step 2
+    │   ├── 01_[name]/description.md
+    │   ├── 02_[name]/description.md
+    │   └── ...
+    ├── common-helpers/                  # Step 2
+    ├── architecture.md                  # Step 3
+    ├── system-flows.md                  # Step 3
+    ├── data_model.md                    # Step 3
+    ├── deployment/                      # Step 3
+    ├── diagrams/                        # Steps 2-3
+    │   ├── components.md
+    │   └── flows/
+    ├── 04_verification_log.md           # Step 4
+    ├── FINAL_report.md                  # Step 7
+    └── state.json                       # Resumability
+```
+
+### Resumability
+
+Maintain `DOCUMENT_DIR/state.json`:
+
+```json
+{
+  "current_step": "module-analysis",
+  "completed_steps": ["discovery"],
+  "focus_dir": null,
+  "modules_total": 12,
+  "modules_documented": ["utils/helpers", "models/user"],
+  "modules_remaining": ["services/auth", "api/endpoints"],
+  "module_batch": 1,
+  "components_written": [],
+  "last_updated": "2026-03-21T14:00:00Z"
+}
+```
+
+Update after each module/component completes. If interrupted, resume from next undocumented module.
+
+When resuming:
+1. Read `state.json`
+2. Cross-check against actual files in DOCUMENT_DIR (trust files over state if they disagree)
+3. Continue from the next incomplete item
+4. Inform user which steps are being skipped
+
+### Save Principles
+
+1. **Save immediately**: write each module doc as soon as analysis completes
+2. **Incremental context**: each subsequent module uses already-written docs as context
+3. **Preserve intermediates**: keep all module docs even after synthesis into component docs
+4. **Enable recovery**: state file tracks exact progress for resume
+
+## Escalation Rules
+
+| Situation | Action |
+|-----------|--------|
+| Minified/obfuscated code detected | WARN user, skip module, note in verification log |
+| Module too large for context window | Split into sub-sections, analyze parts separately, combine |
+| Cycle in dependency graph | Group cycled modules, analyze together as one doc |
+| Generated code (protobuf, swagger-gen) | Note as generated, document the source spec instead |
+| No tests found in codebase | Note gap in acceptance_criteria.md, derive AC from validation rules and config limits only |
+| Contradictions between code and README | Flag in verification log, ASK user |
+| Binary files or non-code assets | Skip, note in discovery |
+| `_docs/` already exists | ASK user: overwrite, merge, or use `_docs_generated/` |
+| Code intent is ambiguous | ASK user, do not guess |
+
+## Common Mistakes
+
+- **Top-down guessing**: never infer architecture before documenting modules. Build up, don't assume down.
+- **Hallucinating entities**: always verify that referenced classes/functions/endpoints actually exist in code.
+- **Skipping modules**: every source module must appear in exactly one module doc and one component.
+- **Monolithic analysis**: don't try to analyze the entire codebase in one pass. Module by module, in order.
+- **Inventing restrictions**: only document constraints actually evidenced in code, configs, or Dockerfiles.
+- **Vague acceptance criteria**: "should be fast" is not a criterion. Extract actual numeric thresholds from code.
+- **Writing code**: this skill produces documents, never implementation code.
+
+## Methodology Quick Reference
+
+```
+┌──────────────────────────────────────────────────────────────────┐
+│          Bottom-Up Codebase Documentation (8-Step)               │
+├──────────────────────────────────────────────────────────────────┤
+│ MODE: Full / Focus Area (@dir) / Resume (state.json)             │
+│ PREREQ: Check _docs/ exists (overwrite/merge/new?)               │
+│ PREREQ: Check state.json for resume                              │
+│                                                                  │
+│ 0. Discovery          → dependency graph, tech stack, topo order │
+│    (Focus Area: scoped to FOCUS_DIR + transitive deps)           │
+│ 1. Module Docs        → per-module analysis (leaves first)       │
+│    (batched ~5 modules; session break between batches)           │
+│ 2. Component Assembly → group modules, write component specs     │
+│    [BLOCKING: user confirms components]                          │
+│ 3. System Synthesis   → architecture, flows, data model, deploy  │
+│ 4. Verification       → compare all docs vs code, fix errors     │
+│    [BLOCKING: user reviews corrections]                          │
+│    [SESSION BREAK suggested before Steps 5–7]                    │
+│    ── Focus Area mode stops here ──                              │
+│ 5. Solution Extraction → retrospective solution.md               │
+│ 6. Problem Extraction → retrospective problem, restrictions, AC  │
+│    [BLOCKING: user confirms problem docs]                        │
+│ 7. Final Report       → FINAL_report.md                          │
+├──────────────────────────────────────────────────────────────────┤
+│ Principles: Bottom-up always · Dependencies first                │
+│             Incremental context · Verify against code            │
+│             Save immediately · Resume from checkpoint            │
+│             Batch modules · Session breaks for large codebases   │
+└──────────────────────────────────────────────────────────────────┘
+```
@@ -73,9 +73,9 @@ For each task in the batch:
 - Determine: files OWNED (exclusive write), files READ-ONLY (shared interfaces, types), files FORBIDDEN (other agents' owned files)
 - If two tasks in the same batch would modify the same file, schedule them sequentially instead of in parallel

-### 5. Update Jira Status → In Progress
+### 5. Update Tracker Status → In Progress

-For each task in the batch, transition its Jira ticket status to **In Progress** via Jira MCP before launching the implementer.
+For each task in the batch, transition its ticket status to **In Progress** via the configured work item tracker (Jira MCP or Azure DevOps MCP — see `protocols.md` for detection) before launching the implementer. If `tracker: local`, skip this step.

 ### 6. Launch Implementer Subagents

@@ -93,15 +93,30 @@ Launch all subagents immediately — no user confirmation.
 - Collect structured status reports from each implementer
 - If any implementer reports "Blocked", log the blocker and continue with others

+**Stuck detection** — while monitoring, watch for these signals per subagent:
+- Same file modified 3+ times without test pass rate improving → flag as stuck, stop the subagent, report as Blocked
+- Subagent has not produced new output for an extended period → flag as potentially hung
+- If a subagent is flagged as stuck, do NOT let it continue looping — stop it and record the blocker in the batch report
+
 ### 8. Code Review

 - Run `/code-review` skill on the batch's changed files + corresponding task specs
 - The code-review skill produces a verdict: PASS, PASS_WITH_WARNINGS, or FAIL

-### 9. Gate
+### 9. Auto-Fix Gate

- If verdict is **FAIL**: present findings to user (**BLOCKING**). User must confirm fixes or accept before proceeding.
- If verdict is **PASS** or **PASS_WITH_WARNINGS**: show findings as info, continue automatically.
+Auto-fix loop with bounded retries (max 2 attempts) before escalating to user:
+
+1. If verdict is **PASS** or **PASS_WITH_WARNINGS**: show findings as info, continue automatically to step 10
+2. If verdict is **FAIL** (attempt 1 or 2):
+   - Parse the code review findings (Critical and High severity items)
+   - For each finding, attempt an automated fix using the finding's location, description, and suggestion
+   - Re-run `/code-review` on the modified files
+   - If now PASS or PASS_WITH_WARNINGS → continue to step 10
+   - If still FAIL → increment retry counter, repeat from (2) up to max 2 attempts
+3. If still **FAIL** after 2 auto-fix attempts: present all findings to user (**BLOCKING**). User must confirm fixes or accept before proceeding.
+
+Track `auto_fix_attempts` count in the batch report for retrospective analysis.

 ### 10. Test

@@ -112,12 +127,12 @@ Launch all subagents immediately — no user confirmation.

 - After user confirms the batch (explicitly for FAIL, implicitly for PASS/PASS_WITH_WARNINGS):
  - `git add` all changed files from the batch
-  - `git commit` with a message that includes ALL JIRA-IDs of tasks implemented in the batch, followed by a summary of what was implemented. Format: `[JIRA-ID-1] [JIRA-ID-2] ... Summary of changes`
+  - `git commit` with a message that includes ALL task IDs (Jira IDs, ADO IDs, or numeric prefixes) of tasks implemented in the batch, followed by a summary of what was implemented. Format: `[TASK-ID-1] [TASK-ID-2] ... Summary of changes`
  - `git push` to the remote branch

-### 12. Update Jira Status → In Testing
+### 12. Update Tracker Status → In Testing

-After the batch is committed and pushed, transition the Jira ticket status of each task in the batch to **In Testing** via Jira MCP.
+After the batch is committed and pushed, transition the ticket status of each task in the batch to **In Testing** via the configured work item tracker. If `tracker: local`, skip this step.

 ### 13. Loop

@@ -146,6 +161,8 @@ After each batch, produce a structured report:
 | [JIRA-ID]_[name] | Done | [count] files | [pass/fail] | [count or None] |

 ## Code Review Verdict: [PASS/FAIL/PASS_WITH_WARNINGS]
+## Auto-Fix Attempts: [0/1/2]
+## Stuck Agents: [count or None]

 ## Next Batch: [task list] or "All tasks complete"
 ```
@@ -173,5 +190,5 @@ Each batch commit serves as a rollback checkpoint. If recovery is needed:

 - Never launch tasks whose dependencies are not yet completed
 - Never allow two parallel agents to write to the same file
- If a subagent fails, do NOT retry automatically — report and let user decide
+- If a subagent fails or is flagged as stuck, stop it and report — do not let it loop indefinitely
 - Always run tests after each batch completes
@@ -0,0 +1,302 @@
+---
+name: new-task
+description: |
+  Interactive skill for adding new functionality to an existing codebase.
+  Guides the user through describing the feature, assessing complexity,
+  optionally running research, analyzing the codebase for insertion points,
+  validating assumptions with the user, and producing a task spec with Jira ticket.
+  Supports a loop — the user can add multiple tasks in one session.
+  Trigger phrases:
+  - "new task", "add feature", "new functionality"
+  - "I want to add", "new component", "extend"
+category: build
+tags: [task, feature, interactive, planning, jira]
+disable-model-invocation: true
+---
+
+# New Task (Interactive Feature Planning)
+
+Guide the user through defining new functionality for an existing codebase. Produces one or more task specifications with Jira tickets, optionally running deep research for complex features.
+
+## Core Principles
+
+- **User-driven**: every task starts with the user's description; never invent requirements
+- **Right-size research**: only invoke the research skill when the change is big enough to warrant it
+- **Validate before committing**: surface all assumptions and uncertainties to the user before writing the task file
+- **Save immediately**: write task files to disk as soon as they are ready; never accumulate unsaved work
+- **Ask, don't assume**: when scope, insertion point, or approach is unclear, STOP and ask the user
+
+## Context Resolution
+
+Fixed paths:
+
+- TASKS_DIR: `_docs/02_tasks/`
+- PLANS_DIR: `_docs/02_task_plans/`
+- DOCUMENT_DIR: `_docs/02_document/`
+- DEPENDENCIES_TABLE: `_docs/02_tasks/_dependencies_table.md`
+
+Create TASKS_DIR and PLANS_DIR if they don't exist.
+
+If TASKS_DIR already contains task files, scan them to determine the next numeric prefix for temporary file naming.
+
+## Workflow
+
+The skill runs as a loop. Each iteration produces one task. After each task the user chooses to add another or finish.
+
+---
+
+### Step 1: Gather Feature Description
+
+**Role**: Product analyst
+**Goal**: Get a clear, detailed description of the new functionality from the user.
+
+Ask the user:
+
+```
+══════════════════════════════════════
+ NEW TASK: Describe the functionality
+══════════════════════════════════════
+ Please describe in detail the new functionality you want to add:
+ - What should it do?
+ - Who is it for?
+ - Any specific requirements or constraints?
+══════════════════════════════════════
+```
+
+**BLOCKING**: Do NOT proceed until the user provides a description.
+
+Record the description verbatim for use in subsequent steps.
+
+---
+
+### Step 2: Analyze Complexity
+
+**Role**: Technical analyst
+**Goal**: Determine whether deep research is needed.
+
+Read the user's description and the existing codebase documentation from DOCUMENT_DIR (architecture.md, components/, system-flows.md).
+
+Assess the change along these dimensions:
+- **Scope**: how many components/files are affected?
+- **Novelty**: does it involve libraries, protocols, or patterns not already in the codebase?
+- **Risk**: could it break existing functionality or require architectural changes?
+
+Classification:
+
+| Category | Criteria | Action |
+|----------|----------|--------|
+| **Needs research** | New libraries/frameworks, unfamiliar protocols, significant architectural change, multiple unknowns | Proceed to Step 3 (Research) |
+| **Skip research** | Extends existing functionality, uses patterns already in codebase, straightforward new component with known tech | Skip to Step 4 (Codebase Analysis) |
+
+Present the assessment to the user:
+
+```
+══════════════════════════════════════
+ COMPLEXITY ASSESSMENT
+══════════════════════════════════════
+ Scope:   [low / medium / high]
+ Novelty: [low / medium / high]
+ Risk:    [low / medium / high]
+══════════════════════════════════════
+ Recommendation: [Research needed / Skip research]
+ Reason: [one-line justification]
+══════════════════════════════════════
+```
+
+**BLOCKING**: Ask the user to confirm or override the recommendation before proceeding.
+
+---
+
+### Step 3: Research (conditional)
+
+**Role**: Researcher
+**Goal**: Investigate unknowns before task specification.
+
+This step only runs if Step 2 determined research is needed.
+
+1. Create a problem description file at `PLANS_DIR/<task_slug>/problem.md` summarizing the feature request and the specific unknowns to investigate
+2. Invoke `.cursor/skills/research/SKILL.md` in standalone mode:
+   - INPUT_FILE: `PLANS_DIR/<task_slug>/problem.md`
+   - BASE_DIR: `PLANS_DIR/<task_slug>/`
+3. After research completes, read the solution draft from `PLANS_DIR/<task_slug>/01_solution/solution_draft01.md`
+4. Extract the key findings relevant to the task specification
+
+The `<task_slug>` is a short kebab-case name derived from the feature description (e.g., `auth-provider-integration`, `real-time-notifications`).
+
+---
+
+### Step 4: Codebase Analysis
+
+**Role**: Software architect
+**Goal**: Determine where and how to insert the new functionality.
+
+1. Read the codebase documentation from DOCUMENT_DIR:
+   - `architecture.md` — overall structure
+   - `components/` — component specs
+   - `system-flows.md` — data flows (if exists)
+   - `data_model.md` — data model (if exists)
+2. If research was performed (Step 3), incorporate findings
+3. Analyze and determine:
+   - Which existing components are affected
+   - Where new code should be inserted (which layers, modules, files)
+   - What interfaces need to change
+   - What new interfaces or models are needed
+   - How data flows through the change
+4. If the change is complex enough, read the actual source files (not just docs) to verify insertion points
+
+Present the analysis:
+
+```
+══════════════════════════════════════
+ CODEBASE ANALYSIS
+══════════════════════════════════════
+ Affected components: [list]
+ Insertion points:    [list of modules/layers]
+ Interface changes:   [list or "None"]
+ New interfaces:      [list or "None"]
+ Data flow impact:    [summary]
+══════════════════════════════════════
+```
+
+---
+
+### Step 5: Validate Assumptions
+
+**Role**: Quality gate
+**Goal**: Surface every uncertainty and get user confirmation.
+
+Review all decisions and assumptions made in Steps 2–4. For each uncertainty:
+1. State the assumption clearly
+2. Propose a solution or approach
+3. List alternatives if they exist
+
+Present using the Choose format for each decision that has meaningful alternatives:
+
+```
+══════════════════════════════════════
+ ASSUMPTION VALIDATION
+══════════════════════════════════════
+ 1. [Assumption]: [proposed approach]
+    Alternative: [other option, if any]
+ 2. [Assumption]: [proposed approach]
+    Alternative: [other option, if any]
+ ...
+══════════════════════════════════════
+ Please confirm or correct these assumptions.
+══════════════════════════════════════
+```
+
+**BLOCKING**: Do NOT proceed until the user confirms or corrects all assumptions.
+
+---
+
+### Step 6: Create Task
+
+**Role**: Technical writer
+**Goal**: Produce the task specification file.
+
+1. Determine the next numeric prefix by scanning TASKS_DIR for existing files
+2. Write the task file using `.cursor/skills/decompose/templates/task.md`:
+   - Fill all fields from the gathered information
+   - Set **Complexity** based on the assessment from Step 2
+   - Set **Dependencies** by cross-referencing existing tasks in TASKS_DIR
+   - Set **Jira** and **Epic** to `pending` (filled in Step 7)
+3. Save as `TASKS_DIR/[##]_[short_name].md`
+
+**Self-verification**:
+- [ ] Problem section clearly describes the user need
+- [ ] Acceptance criteria are testable (Gherkin format)
+- [ ] Scope boundaries are explicit
+- [ ] Complexity points match the assessment
+- [ ] Dependencies reference existing task Jira IDs where applicable
+- [ ] No implementation details leaked into the spec
+
+---
+
+### Step 7: Work Item Ticket
+
+**Role**: Project coordinator
+**Goal**: Create a work item ticket and link it to the task file.
+
+1. Create a ticket via the configured work item tracker (Jira MCP or Azure DevOps MCP — see `autopilot/protocols.md` for detection):
+   - Summary: the task's **Name** field
+   - Description: the task's **Problem** and **Acceptance Criteria** sections
+   - Story points: the task's **Complexity** value
+   - Link to the appropriate epic (ask user if unclear which epic)
+2. Write the ticket ID and Epic ID back into the task file header:
+   - Update **Task** field: `[TICKET-ID]_[short_name]`
+   - Update **Jira** field: `[TICKET-ID]`
+   - Update **Epic** field: `[EPIC-ID]`
+3. Rename the file from `[##]_[short_name].md` to `[TICKET-ID]_[short_name].md`
+
+If the work item tracker is not authenticated or unavailable (`tracker: local`):
+- Keep the numeric prefix
+- Set **Jira** to `pending`
+- Set **Epic** to `pending`
+- The task is still valid and can be implemented; tracker sync happens later
+
+---
+
+### Step 8: Loop Gate
+
+Ask the user:
+
+```
+══════════════════════════════════════
+ Task created: [JIRA-ID or ##] — [task name]
+══════════════════════════════════════
+ A) Add another task
+ B) Done — finish and update dependencies
+══════════════════════════════════════
+```
+
+- If **A** → loop back to Step 1
+- If **B** → proceed to Finalize
+
+---
+
+### Finalize
+
+After the user chooses **Done**:
+
+1. Update (or create) `TASKS_DIR/_dependencies_table.md` — add all newly created tasks to the dependencies table
+2. Present a summary of all tasks created in this session:
+
+```
+══════════════════════════════════════
+ NEW TASK SUMMARY
+══════════════════════════════════════
+ Tasks created: N
+ Total complexity: M points
+ ─────────────────────────────────────
+ [JIRA-ID] [name] ([complexity] pts)
+ [JIRA-ID] [name] ([complexity] pts)
+ ...
+══════════════════════════════════════
+```
+
+## Escalation Rules
+
+| Situation | Action |
+|-----------|--------|
+| User description is vague or incomplete | **ASK** for more detail — do not guess |
+| Unclear which epic to link to | **ASK** user for the epic |
+| Research skill hits a blocker | Follow research skill's own escalation rules |
+| Codebase analysis reveals conflicting architectures | **ASK** user which pattern to follow |
+| Complexity exceeds 5 points | **WARN** user and suggest splitting into multiple tasks |
+| Jira MCP unavailable | **WARN**, continue with local-only task files |
+
+## Trigger Conditions
+
+When the user wants to:
+- Add new functionality to an existing codebase
+- Plan a new feature or component
+- Create task specifications for upcoming work
+
+**Keywords**: "new task", "add feature", "new functionality", "extend", "I want to add"
+
+**Differentiation**:
+- User wants to decompose an existing plan into tasks → use `/decompose`
+- User wants to research a topic without creating tasks → use `/research`
+- User wants to refactor existing code → use `/refactor`
+- User wants to define and plan a new feature → use this skill
@@ -0,0 +1,2 @@
+<!-- This skill uses the shared task template at .cursor/skills/decompose/templates/task.md -->
+<!-- See that file for the full template structure. -->
@@ -3,7 +3,7 @@ name: plan
 description: |
  Decompose a solution into architecture, data model, deployment plan, system flows, components, tests, and Jira epics.
  Systematic 6-step planning workflow with BLOCKING gates, self-verification, and structured artifact management.
-  Uses _docs/ + _docs/02_plans/ structure.
+  Uses _docs/ + _docs/02_document/ structure.
  Trigger phrases:
  - "plan", "decompose solution", "architecture planning"
  - "break down the solution", "create planning documents"
@@ -31,13 +31,11 @@ Fixed paths — no mode detection needed:

 - PROBLEM_FILE: `_docs/00_problem/problem.md`
 - SOLUTION_FILE: `_docs/01_solution/solution.md`
- PLANS_DIR: `_docs/02_plans/`
+- DOCUMENT_DIR: `_docs/02_document/`

 Announce the resolved paths to the user before proceeding.

-## Input Specification
-
-### Required Files
+## Required Files

 | File | Purpose |
 |------|---------|
@@ -47,170 +45,23 @@ Announce the resolved paths to the user before proceeding.
 | `_docs/00_problem/input_data/` | Reference data examples |
 | `_docs/01_solution/solution.md` | Finalized solution to decompose |

-### Prerequisite Checks (BLOCKING)
+## Prerequisites

-Run sequentially before any planning step:
-
-**Prereq 1: Data Gate**
-
-1. `_docs/00_problem/acceptance_criteria.md` exists and is non-empty — **STOP if missing**
-2. `_docs/00_problem/restrictions.md` exists and is non-empty — **STOP if missing**
-3. `_docs/00_problem/input_data/` exists and contains at least one data file — **STOP if missing**
-4. `_docs/00_problem/problem.md` exists and is non-empty — **STOP if missing**
-
-All four are mandatory. If any is missing or empty, STOP and ask the user to provide them. If the user cannot provide the required data, planning cannot proceed — just stop.
-
-**Prereq 2: Finalize Solution Draft**
-
-Only runs after the Data Gate passes:
-
-1. Scan `_docs/01_solution/` for files matching `solution_draft*.md`
-2. Identify the highest-numbered draft (e.g. `solution_draft06.md`)
-3. **Rename** it to `_docs/01_solution/solution.md`
-4. If `solution.md` already exists, ask the user whether to overwrite or keep existing
-5. Verify `solution.md` is non-empty — **STOP if missing or empty**
-
-**Prereq 3: Workspace Setup**
-
-1. Create PLANS_DIR if it does not exist
-2. If PLANS_DIR already contains artifacts, ask user: **resume from last checkpoint or start fresh?**
+Read and follow `steps/00_prerequisites.md`. All three prerequisite checks are **BLOCKING** — do not start the workflow until they pass.

 ## Artifact Management

-### Directory Structure
-
-All artifacts are written directly under PLANS_DIR:
-
-```
-PLANS_DIR/
-├── integration_tests/
-│   ├── environment.md
-│   ├── test_data.md
-│   ├── functional_tests.md
-│   ├── non_functional_tests.md
-│   └── traceability_matrix.md
-├── architecture.md
-├── system-flows.md
-├── data_model.md
-├── deployment/
-│   ├── containerization.md
-│   ├── ci_cd_pipeline.md
-│   ├── environment_strategy.md
-│   ├── observability.md
-│   └── deployment_procedures.md
-├── risk_mitigations.md
-├── risk_mitigations_02.md          (iterative, ## as sequence)
-├── components/
-│   ├── 01_[name]/
-│   │   ├── description.md
-│   │   └── tests.md
-│   ├── 02_[name]/
-│   │   ├── description.md
-│   │   └── tests.md
-│   └── ...
-├── common-helpers/
-│   ├── 01_helper_[name]/
-│   ├── 02_helper_[name]/
-│   └── ...
-├── diagrams/
-│   ├── components.drawio
-│   └── flows/
-│       ├── flow_[name].md          (Mermaid)
-│       └── ...
-└── FINAL_report.md
-```
-
-### Save Timing
-
-| Step | Save immediately after | Filename |
-|------|------------------------|----------|
-| Step 1 | Integration test environment spec | `integration_tests/environment.md` |
-| Step 1 | Integration test data spec | `integration_tests/test_data.md` |
-| Step 1 | Integration functional tests | `integration_tests/functional_tests.md` |
-| Step 1 | Integration non-functional tests | `integration_tests/non_functional_tests.md` |
-| Step 1 | Integration traceability matrix | `integration_tests/traceability_matrix.md` |
-| Step 2 | Architecture analysis complete | `architecture.md` |
-| Step 2 | System flows documented | `system-flows.md` |
-| Step 2 | Data model documented | `data_model.md` |
-| Step 2 | Deployment plan complete | `deployment/` (5 files) |
-| Step 3 | Each component analyzed | `components/[##]_[name]/description.md` |
-| Step 3 | Common helpers generated | `common-helpers/[##]_helper_[name].md` |
-| Step 3 | Diagrams generated | `diagrams/` |
-| Step 4 | Risk assessment complete | `risk_mitigations.md` |
-| Step 5 | Tests written per component | `components/[##]_[name]/tests.md` |
-| Step 6 | Epics created in Jira | Jira via MCP |
-| Final | All steps complete | `FINAL_report.md` |
-
-### Save Principles
-
-1. **Save immediately**: write to disk as soon as a step completes; do not wait until the end
-2. **Incremental updates**: same file can be updated multiple times; append or replace
-3. **Preserve process**: keep all intermediate files even after integration into final report
-4. **Enable recovery**: if interrupted, resume from the last saved artifact (see Resumability)
-
-### Resumability
-
-If PLANS_DIR already contains artifacts:
-
-1. List existing files and match them to the save timing table above
-2. Identify the last completed step based on which artifacts exist
-3. Resume from the next incomplete step
-4. Inform the user which steps are being skipped
+Read `steps/01_artifact-management.md` for directory structure, save timing, save principles, and resumability rules. Refer to it throughout the workflow.

 ## Progress Tracking

-At the start of execution, create a TodoWrite with all steps (1 through 6). Update status as each step completes.
+At the start of execution, create a TodoWrite with all steps (1 through 6 plus Final). Update status as each step completes.

 ## Workflow

-### Step 1: Integration Tests
+### Step 1: Blackbox Tests

-**Role**: Professional Quality Assurance Engineer
-**Goal**: Analyze input data completeness and produce detailed black-box integration test specifications
-**Constraints**: Spec only — no test code. Tests describe what the system should do given specific inputs, not how the system is built.
-
-#### Phase 1a: Input Data Completeness Analysis
-
-1. Read `_docs/01_solution/solution.md` (finalized in Prereq 2)
-2. Read `acceptance_criteria.md`, `restrictions.md`
-3. Read testing strategy from solution.md
-4. Analyze `input_data/` contents against:
-   - Coverage of acceptance criteria scenarios
-   - Coverage of restriction edge cases
-   - Coverage of testing strategy requirements
-5. Threshold: at least 70% coverage of the scenarios
-6. If coverage is low, search the internet for supplementary data, assess quality with user, and if user agrees, add to `input_data/`
-7. Present coverage assessment to user
-
-**BLOCKING**: Do NOT proceed until user confirms the input data coverage is sufficient.
-
-#### Phase 1b: Black-Box Test Scenario Specification
-
-Based on all acquired data, acceptance_criteria, and restrictions, form detailed test scenarios:
-
-1. Define test environment using `templates/integration-environment.md` as structure
-2. Define test data management using `templates/integration-test-data.md` as structure
-3. Write functional test scenarios (positive + negative) using `templates/integration-functional-tests.md` as structure
-4. Write non-functional test scenarios (performance, resilience, security, edge cases) using `templates/integration-non-functional-tests.md` as structure
-5. Build traceability matrix using `templates/integration-traceability-matrix.md` as structure
-
-**Self-verification**:
- [ ] Every acceptance criterion is covered by at least one test scenario
- [ ] Every restriction is verified by at least one test scenario
- [ ] Positive and negative scenarios are balanced
- [ ] Consumer app has no direct access to system internals
- [ ] Docker environment is self-contained (`docker compose up` sufficient)
- [ ] External dependencies have mock/stub services defined
- [ ] Traceability matrix has no uncovered AC or restrictions
-
-**Save action**: Write all files under `integration_tests/`:
- `environment.md`
- `test_data.md`
- `functional_tests.md`
- `non_functional_tests.md`
- `traceability_matrix.md`
-
-**BLOCKING**: Present test coverage summary (from traceability_matrix.md) to user. Do NOT proceed until confirmed.
+Read and execute `.cursor/skills/test-spec/SKILL.md`.

 Capture any new questions, findings, or insights that arise during test specification — these feed forward into Steps 2 and 3.

@@ -218,263 +69,37 @@ Capture any new questions, findings, or insights that arise during test specific

 ### Step 2: Solution Analysis

-**Role**: Professional software architect
-**Goal**: Produce `architecture.md`, `system-flows.md`, `data_model.md`, and `deployment/` from the solution draft
-**Constraints**: No code, no component-level detail yet; focus on system-level view
-
-#### Phase 2a: Architecture & Flows
-
-1. Read all input files thoroughly
-2. Incorporate findings, questions, and insights discovered during Step 1 (integration tests)
-3. Research unknown or questionable topics via internet; ask user about ambiguities
-4. Document architecture using `templates/architecture.md` as structure
-5. Document system flows using `templates/system-flows.md` as structure
-
-**Self-verification**:
- [ ] Architecture covers all capabilities mentioned in solution.md
- [ ] System flows cover all main user/system interactions
- [ ] No contradictions with problem.md or restrictions.md
- [ ] Technology choices are justified
- [ ] Integration test findings are reflected in architecture decisions
-
-**Save action**: Write `architecture.md` and `system-flows.md`
-
-**BLOCKING**: Present architecture summary to user. Do NOT proceed until user confirms.
-
-#### Phase 2b: Data Model
-
-**Role**: Professional software architect
-**Goal**: Produce a detailed data model document covering entities, relationships, and migration strategy
-
-1. Extract core entities from architecture.md and solution.md
-2. Define entity attributes, types, and constraints
-3. Define relationships between entities (Mermaid ERD)
-4. Define migration strategy: versioning tool (EF Core migrations / Alembic / sql-migrate), reversibility requirement, naming convention
-5. Define seed data requirements per environment (dev, staging)
-6. Define backward compatibility approach for schema changes (additive-only by default)
-
-**Self-verification**:
- [ ] Every entity mentioned in architecture.md is defined
- [ ] Relationships are explicit with cardinality
- [ ] Migration strategy specifies reversibility requirement
- [ ] Seed data requirements defined
- [ ] Backward compatibility approach documented
-
-**Save action**: Write `data_model.md`
-
-#### Phase 2c: Deployment Planning
-
-**Role**: DevOps / Platform engineer
-**Goal**: Produce deployment plan covering containerization, CI/CD, environment strategy, observability, and deployment procedures
-
-Use the `/deploy` skill's templates as structure for each artifact:
-
-1. Read architecture.md and restrictions.md for infrastructure constraints
-2. Research Docker best practices for the project's tech stack
-3. Define containerization plan: Dockerfile per component, docker-compose for dev and tests
-4. Define CI/CD pipeline: stages, quality gates, caching, parallelization
-5. Define environment strategy: dev, staging, production with secrets management
-6. Define observability: structured logging, metrics, tracing, alerting
-7. Define deployment procedures: strategy, health checks, rollback, checklist
-
-**Self-verification**:
- [ ] Every component has a Docker specification
- [ ] CI/CD pipeline covers lint, test, security, build, deploy
- [ ] Environment strategy covers dev, staging, production
- [ ] Observability covers logging, metrics, tracing, alerting
- [ ] Deployment procedures include rollback and health checks
-
-**Save action**: Write all 5 files under `deployment/`:
- `containerization.md`
- `ci_cd_pipeline.md`
- `environment_strategy.md`
- `observability.md`
- `deployment_procedures.md`
+Read and follow `steps/02_solution-analysis.md`.

 ---

 ### Step 3: Component Decomposition

-**Role**: Professional software architect
-**Goal**: Decompose the architecture into components with detailed specs
-**Constraints**: No code; only names, interfaces, inputs/outputs. Follow SRP strictly.
-
-1. Identify components from the architecture; think about separation, reusability, and communication patterns
-2. Use integration test scenarios from Step 1 to validate component boundaries
-3. If additional components are needed (data preparation, shared helpers), create them
-4. For each component, write a spec using `templates/component-spec.md` as structure
-5. Generate diagrams:
-   - draw.io component diagram showing relations (minimize line intersections, group semantically coherent components, place external users near their components)
-   - Mermaid flowchart per main control flow
-6. Components can share and reuse common logic, same for multiple components. Hence for such occurences common-helpers folder is specified.
-
-**Self-verification**:
- [ ] Each component has a single, clear responsibility
- [ ] No functionality is spread across multiple components
- [ ] All inter-component interfaces are defined (who calls whom, with what)
- [ ] Component dependency graph has no circular dependencies
- [ ] All components from architecture.md are accounted for
- [ ] Every integration test scenario can be traced through component interactions
-
-**Save action**: Write:
- - each component `components/[##]_[name]/description.md`
- - common helper `common-helpers/[##]_helper_[name].md`
- - diagrams `diagrams/`
-
-**BLOCKING**: Present component list with one-line summaries to user. Do NOT proceed until user confirms.
+Read and follow `steps/03_component-decomposition.md`.

 ---

 ### Step 4: Architecture Review & Risk Assessment

-**Role**: Professional software architect and analyst
-**Goal**: Validate all artifacts for consistency, then identify and mitigate risks
-**Constraints**: This is a review step — fix problems found, do not add new features
-
-#### 4a. Evaluator Pass (re-read ALL artifacts)
-
-Review checklist:
- [ ] All components follow Single Responsibility Principle
- [ ] All components follow dumb code / smart data principle
- [ ] Inter-component interfaces are consistent (caller's output matches callee's input)
- [ ] No circular dependencies in the dependency graph
- [ ] No missing interactions between components
- [ ] No over-engineering — is there a simpler decomposition?
- [ ] Security considerations addressed in component design
- [ ] Performance bottlenecks identified
- [ ] API contracts are consistent across components
-
-Fix any issues found before proceeding to risk identification.
-
-#### 4b. Risk Identification
-
-1. Identify technical and project risks
-2. Assess probability and impact using `templates/risk-register.md`
-3. Define mitigation strategies
-4. Apply mitigations to architecture, flows, and component documents where applicable
-
-**Self-verification**:
- [ ] Every High/Critical risk has a concrete mitigation strategy
- [ ] Mitigations are reflected in the relevant component or architecture docs
- [ ] No new risks introduced by the mitigations themselves
-
-**Save action**: Write `risk_mitigations.md`
-
-**BLOCKING**: Present risk summary to user. Ask whether assessment is sufficient.
-
-**Iterative**: If user requests another round, repeat Step 4 and write `risk_mitigations_##.md` (## as sequence number). Continue until user confirms.
+Read and follow `steps/04_review-risk.md`.

 ---

 ### Step 5: Test Specifications

-**Role**: Professional Quality Assurance Engineer
-
-**Goal**: Write test specs for each component achieving minimum 75% acceptance criteria coverage
-
-**Constraints**: Test specs only — no test code. Each test must trace to an acceptance criterion.
-
-1. For each component, write tests using `templates/test-spec.md` as structure
-2. Cover all 4 types: integration, performance, security, acceptance
-3. Include test data management (setup, teardown, isolation)
-4. Verify traceability: every acceptance criterion from `acceptance_criteria.md` must be covered by at least one test
-
-**Self-verification**:
- [ ] Every acceptance criterion has at least one test covering it
- [ ] Test inputs are realistic and well-defined
- [ ] Expected results are specific and measurable
- [ ] No component is left without tests
-
-**Save action**: Write each `components/[##]_[name]/tests.md`
+Read and follow `steps/05_test-specifications.md`.

 ---

 ### Step 6: Jira Epics

-**Role**: Professional product manager
-
-**Goal**: Create Jira epics from components, ordered by dependency
-
-**Constraints**: Be concise — fewer words with the same meaning is better
-
-1. **Create "Bootstrap & Initial Structure" epic first** — this epic will parent the `01_initial_structure` task created by the decompose skill. It covers project scaffolding: folder structure, shared models, interfaces, stubs, CI/CD config, DB migrations setup, test structure.
-2. Generate Jira Epics for each component using Jira MCP, structured per `templates/epic-spec.md`
-3. Order epics by dependency (Bootstrap epic is always first, then components based on their dependency graph)
-4. Include effort estimation per epic (T-shirt size or story points range)
-5. Ensure each epic has clear acceptance criteria cross-referenced with component specs
-6. Generate updated draw.io diagram showing component-to-epic mapping
-
-**Self-verification**:
- [ ] "Bootstrap & Initial Structure" epic exists and is first in order
- [ ] "Integration Tests" epic exists
- [ ] Every component maps to exactly one epic
- [ ] Dependency order is respected (no epic depends on a later one)
- [ ] Acceptance criteria are measurable
- [ ] Effort estimates are realistic
-
-7. **Create "Integration Tests" epic** — this epic will parent the integration test tasks created by the `/decompose` skill. It covers implementing the test scenarios defined in `integration_tests/`.
-
-**Save action**: Epics created in Jira via MCP
+Read and follow `steps/06_jira-epics.md`.

 ---

-## Quality Checklist (before FINAL_report.md)
+### Final: Quality Checklist

-Before writing the final report, verify ALL of the following:
-
-### Integration Tests
- [ ] Every acceptance criterion is covered in traceability_matrix.md
- [ ] Every restriction is verified by at least one test
- [ ] Positive and negative scenarios are balanced
- [ ] Docker environment is self-contained
- [ ] Consumer app treats main system as black box
- [ ] CI/CD integration and reporting defined
-
-### Architecture
- [ ] Covers all capabilities from solution.md
- [ ] Technology choices are justified
- [ ] Deployment model is defined
- [ ] Integration test findings are reflected in architecture decisions
-
-### Data Model
- [ ] Every entity from architecture.md is defined
- [ ] Relationships have explicit cardinality
- [ ] Migration strategy with reversibility requirement
- [ ] Seed data requirements defined
- [ ] Backward compatibility approach documented
-
-### Deployment
- [ ] Containerization plan covers all components
- [ ] CI/CD pipeline includes lint, test, security, build, deploy stages
- [ ] Environment strategy covers dev, staging, production
- [ ] Observability covers logging, metrics, tracing, alerting
- [ ] Deployment procedures include rollback and health checks
-
-### Components
- [ ] Every component follows SRP
- [ ] No circular dependencies
- [ ] All inter-component interfaces are defined and consistent
- [ ] No orphan components (unused by any flow)
- [ ] Every integration test scenario can be traced through component interactions
-
-### Risks
- [ ] All High/Critical risks have mitigations
- [ ] Mitigations are reflected in component/architecture docs
- [ ] User has confirmed risk assessment is sufficient
-
-### Tests
- [ ] Every acceptance criterion is covered by at least one test
- [ ] All 4 test types are represented per component (where applicable)
- [ ] Test data management is defined
-
-### Epics
- [ ] "Bootstrap & Initial Structure" epic exists
- [ ] "Integration Tests" epic exists
- [ ] Every component maps to an epic
- [ ] Dependency order is correct
- [ ] Acceptance criteria are measurable
-
-**Save action**: Write `FINAL_report.md` using `templates/final-report.md` as structure
+Read and follow `steps/07_quality-checklist.md`.

 ## Common Mistakes

@@ -486,7 +111,7 @@ Before writing the final report, verify ALL of the following:
 - **Copy-pasting problem.md**: the architecture doc should analyze and transform, not repeat the input
 - **Vague interfaces**: "component A talks to component B" is not enough; define the method, input, output
 - **Ignoring restrictions.md**: every constraint must be traceable in the architecture or risk register
- **Ignoring integration test findings**: insights from Step 1 must feed into architecture (Step 2) and component decomposition (Step 3)
+- **Ignoring blackbox test findings**: insights from Step 1 must feed into architecture (Step 2) and component decomposition (Step 3)

 ## Escalation Rules

@@ -505,31 +130,26 @@ Before writing the final report, verify ALL of the following:

 ```
 ┌────────────────────────────────────────────────────────────────┐
-│               Solution Planning (6-Step Method)                │
+│              Solution Planning (6-Step + Final)                  │
 ├────────────────────────────────────────────────────────────────┤
-│ PREREQ 1: Data Gate (BLOCKING)                                 │
-│   → verify AC, restrictions, input_data exist — STOP if not    │
-│ PREREQ 2: Finalize solution draft                              │
-│   → rename highest solution_draft##.md to solution.md          │
-│ PREREQ 3: Workspace setup                                      │
-│   → create PLANS_DIR/ if needed                                │
+│ PREREQ: Data Gate (BLOCKING)                                    │
+│   → verify AC, restrictions, input_data, solution exist         │
 │                                                                │
-│ 1. Integration Tests  → integration_tests/ (5 files)           │
+│ 1. Blackbox Tests      → test-spec/SKILL.md                     │
 │    [BLOCKING: user confirms test coverage]                     │
-│ 2a. Architecture      → architecture.md, system-flows.md       │
+│ 2. Solution Analysis   → architecture, data model, deployment   │
 │    [BLOCKING: user confirms architecture]                      │
-│ 2b. Data Model        → data_model.md                          │
-│ 2c. Deployment        → deployment/ (5 files)                  │
-│ 3. Component Decompose → components/[##]_[name]/description    │
-│    [BLOCKING: user confirms decomposition]                     │
-│ 4. Review & Risk      → risk_mitigations.md                    │
-│    [BLOCKING: user confirms risks, iterative]                  │
-│ 5. Test Specifications → components/[##]_[name]/tests.md       │
-│ 6. Jira Epics         → Jira via MCP                           │
+│ 3. Component Decomp    → component specs + interfaces           │
+│    [BLOCKING: user confirms components]                        │
+│ 4. Review & Risk       → risk register, iterations              │
+│    [BLOCKING: user confirms mitigations]                       │
+│ 5. Test Specifications → per-component test specs               │
+│ 6. Jira Epics          → epic per component + bootstrap         │
 │    ─────────────────────────────────────────────────           │
-│    Quality Checklist → FINAL_report.md                         │
+│ Final: Quality Checklist → FINAL_report.md                      │
 ├────────────────────────────────────────────────────────────────┤
-│ Principles: SRP · Dumb code/smart data · Save immediately      │
-│             Ask don't assume · Plan don't code                 │
+│ Principles: Single Responsibility · Dumb code, smart data       │
+│             Save immediately · Ask don't assume                │
+│             Plan don't code                                    │
 └────────────────────────────────────────────────────────────────┘
 ```
@@ -0,0 +1,27 @@
+## Prerequisite Checks (BLOCKING)
+
+Run sequentially before any planning step:
+
+### Prereq 1: Data Gate
+
+1. `_docs/00_problem/acceptance_criteria.md` exists and is non-empty — **STOP if missing**
+2. `_docs/00_problem/restrictions.md` exists and is non-empty — **STOP if missing**
+3. `_docs/00_problem/input_data/` exists and contains at least one data file — **STOP if missing**
+4. `_docs/00_problem/problem.md` exists and is non-empty — **STOP if missing**
+
+All four are mandatory. If any is missing or empty, STOP and ask the user to provide them. If the user cannot provide the required data, planning cannot proceed — just stop.
+
+### Prereq 2: Finalize Solution Draft
+
+Only runs after the Data Gate passes:
+
+1. Scan `_docs/01_solution/` for files matching `solution_draft*.md`
+2. Identify the highest-numbered draft (e.g. `solution_draft06.md`)
+3. **Rename** it to `_docs/01_solution/solution.md`
+4. If `solution.md` already exists, ask the user whether to overwrite or keep existing
+5. Verify `solution.md` is non-empty — **STOP if missing or empty**
+
+### Prereq 3: Workspace Setup
+
+1. Create DOCUMENT_DIR if it does not exist
+2. If DOCUMENT_DIR already contains artifacts, ask user: **resume from last checkpoint or start fresh?**
@@ -0,0 +1,87 @@
+## Artifact Management
+
+### Directory Structure
+
+All artifacts are written directly under DOCUMENT_DIR:
+
+```
+DOCUMENT_DIR/
+├── tests/
+│   ├── environment.md
+│   ├── test-data.md
+│   ├── blackbox-tests.md
+│   ├── performance-tests.md
+│   ├── resilience-tests.md
+│   ├── security-tests.md
+│   ├── resource-limit-tests.md
+│   └── traceability-matrix.md
+├── architecture.md
+├── system-flows.md
+├── data_model.md
+├── deployment/
+│   ├── containerization.md
+│   ├── ci_cd_pipeline.md
+│   ├── environment_strategy.md
+│   ├── observability.md
+│   └── deployment_procedures.md
+├── risk_mitigations.md
+├── risk_mitigations_02.md          (iterative, ## as sequence)
+├── components/
+│   ├── 01_[name]/
+│   │   ├── description.md
+│   │   └── tests.md
+│   ├── 02_[name]/
+│   │   ├── description.md
+│   │   └── tests.md
+│   └── ...
+├── common-helpers/
+│   ├── 01_helper_[name]/
+│   ├── 02_helper_[name]/
+│   └── ...
+├── diagrams/
+│   ├── components.drawio
+│   └── flows/
+│       ├── flow_[name].md          (Mermaid)
+│       └── ...
+└── FINAL_report.md
+```
+
+### Save Timing
+
+| Step | Save immediately after | Filename |
+|------|------------------------|----------|
+| Step 1 | Blackbox test environment spec | `tests/environment.md` |
+| Step 1 | Blackbox test data spec | `tests/test-data.md` |
+| Step 1 | Blackbox tests | `tests/blackbox-tests.md` |
+| Step 1 | Blackbox performance tests | `tests/performance-tests.md` |
+| Step 1 | Blackbox resilience tests | `tests/resilience-tests.md` |
+| Step 1 | Blackbox security tests | `tests/security-tests.md` |
+| Step 1 | Blackbox resource limit tests | `tests/resource-limit-tests.md` |
+| Step 1 | Blackbox traceability matrix | `tests/traceability-matrix.md` |
+| Step 2 | Architecture analysis complete | `architecture.md` |
+| Step 2 | System flows documented | `system-flows.md` |
+| Step 2 | Data model documented | `data_model.md` |
+| Step 2 | Deployment plan complete | `deployment/` (5 files) |
+| Step 3 | Each component analyzed | `components/[##]_[name]/description.md` |
+| Step 3 | Common helpers generated | `common-helpers/[##]_helper_[name].md` |
+| Step 3 | Diagrams generated | `diagrams/` |
+| Step 4 | Risk assessment complete | `risk_mitigations.md` |
+| Step 5 | Tests written per component | `components/[##]_[name]/tests.md` |
+| Step 6 | Epics created in Jira | Jira via MCP |
+| Final | All steps complete | `FINAL_report.md` |
+
+### Save Principles
+
+1. **Save immediately**: write to disk as soon as a step completes; do not wait until the end
+2. **Incremental updates**: same file can be updated multiple times; append or replace
+3. **Preserve process**: keep all intermediate files even after integration into final report
+4. **Enable recovery**: if interrupted, resume from the last saved artifact (see Resumability)
+
+### Resumability
+
+If DOCUMENT_DIR already contains artifacts:
+
+1. List existing files and match them to the save timing table above
+2. Identify the last completed step based on which artifacts exist
+3. Resume from the next incomplete step
+4. Inform the user which steps are being skipped
@@ -0,0 +1,74 @@
+## Step 2: Solution Analysis
+
+**Role**: Professional software architect
+**Goal**: Produce `architecture.md`, `system-flows.md`, `data_model.md`, and `deployment/` from the solution draft
+**Constraints**: No code, no component-level detail yet; focus on system-level view
+
+### Phase 2a: Architecture & Flows
+
+1. Read all input files thoroughly
+2. Incorporate findings, questions, and insights discovered during Step 1 (blackbox tests)
+3. Research unknown or questionable topics via internet; ask user about ambiguities
+4. Document architecture using `templates/architecture.md` as structure
+5. Document system flows using `templates/system-flows.md` as structure
+
+**Self-verification**:
+- [ ] Architecture covers all capabilities mentioned in solution.md
+- [ ] System flows cover all main user/system interactions
+- [ ] No contradictions with problem.md or restrictions.md
+- [ ] Technology choices are justified
+- [ ] Blackbox test findings are reflected in architecture decisions
+
+**Save action**: Write `architecture.md` and `system-flows.md`
+
+**BLOCKING**: Present architecture summary to user. Do NOT proceed until user confirms.
+
+### Phase 2b: Data Model
+
+**Role**: Professional software architect
+**Goal**: Produce a detailed data model document covering entities, relationships, and migration strategy
+
+1. Extract core entities from architecture.md and solution.md
+2. Define entity attributes, types, and constraints
+3. Define relationships between entities (Mermaid ERD)
+4. Define migration strategy: versioning tool (EF Core migrations / Alembic / sql-migrate), reversibility requirement, naming convention
+5. Define seed data requirements per environment (dev, staging)
+6. Define backward compatibility approach for schema changes (additive-only by default)
+
+**Self-verification**:
+- [ ] Every entity mentioned in architecture.md is defined
+- [ ] Relationships are explicit with cardinality
+- [ ] Migration strategy specifies reversibility requirement
+- [ ] Seed data requirements defined
+- [ ] Backward compatibility approach documented
+
+**Save action**: Write `data_model.md`
+
+### Phase 2c: Deployment Planning
+
+**Role**: DevOps / Platform engineer
+**Goal**: Produce deployment plan covering containerization, CI/CD, environment strategy, observability, and deployment procedures
+
+Use the `/deploy` skill's templates as structure for each artifact:
+
+1. Read architecture.md and restrictions.md for infrastructure constraints
+2. Research Docker best practices for the project's tech stack
+3. Define containerization plan: Dockerfile per component, docker-compose for dev and tests
+4. Define CI/CD pipeline: stages, quality gates, caching, parallelization
+5. Define environment strategy: dev, staging, production with secrets management
+6. Define observability: structured logging, metrics, tracing, alerting
+7. Define deployment procedures: strategy, health checks, rollback, checklist
+
+**Self-verification**:
+- [ ] Every component has a Docker specification
+- [ ] CI/CD pipeline covers lint, test, security, build, deploy
+- [ ] Environment strategy covers dev, staging, production
+- [ ] Observability covers logging, metrics, tracing, alerting
+- [ ] Deployment procedures include rollback and health checks
+
+**Save action**: Write all 5 files under `deployment/`:
+- `containerization.md`
+- `ci_cd_pipeline.md`
+- `environment_strategy.md`
+- `observability.md`
+- `deployment_procedures.md`
@@ -0,0 +1,29 @@
+## Step 3: Component Decomposition
+
+**Role**: Professional software architect
+**Goal**: Decompose the architecture into components with detailed specs
+**Constraints**: No code; only names, interfaces, inputs/outputs. Follow SRP strictly.
+
+1. Identify components from the architecture; think about separation, reusability, and communication patterns
+2. Use blackbox test scenarios from Step 1 to validate component boundaries
+3. If additional components are needed (data preparation, shared helpers), create them
+4. For each component, write a spec using `templates/component-spec.md` as structure
+5. Generate diagrams:
+   - draw.io component diagram showing relations (minimize line intersections, group semantically coherent components, place external users near their components)
+   - Mermaid flowchart per main control flow
+6. Components can share and reuse common logic, same for multiple components. Hence for such occurences common-helpers folder is specified.
+
+**Self-verification**:
+- [ ] Each component has a single, clear responsibility
+- [ ] No functionality is spread across multiple components
+- [ ] All inter-component interfaces are defined (who calls whom, with what)
+- [ ] Component dependency graph has no circular dependencies
+- [ ] All components from architecture.md are accounted for
+- [ ] Every blackbox test scenario can be traced through component interactions
+
+**Save action**: Write:
+ - each component `components/[##]_[name]/description.md`
+ - common helper `common-helpers/[##]_helper_[name].md`
+ - diagrams `diagrams/`
+
+**BLOCKING**: Present component list with one-line summaries to user. Do NOT proceed until user confirms.
@@ -0,0 +1,38 @@
+## Step 4: Architecture Review & Risk Assessment
+
+**Role**: Professional software architect and analyst
+**Goal**: Validate all artifacts for consistency, then identify and mitigate risks
+**Constraints**: This is a review step — fix problems found, do not add new features
+
+### 4a. Evaluator Pass (re-read ALL artifacts)
+
+Review checklist:
+- [ ] All components follow Single Responsibility Principle
+- [ ] All components follow dumb code / smart data principle
+- [ ] Inter-component interfaces are consistent (caller's output matches callee's input)
+- [ ] No circular dependencies in the dependency graph
+- [ ] No missing interactions between components
+- [ ] No over-engineering — is there a simpler decomposition?
+- [ ] Security considerations addressed in component design
+- [ ] Performance bottlenecks identified
+- [ ] API contracts are consistent across components
+
+Fix any issues found before proceeding to risk identification.
+
+### 4b. Risk Identification
+
+1. Identify technical and project risks
+2. Assess probability and impact using `templates/risk-register.md`
+3. Define mitigation strategies
+4. Apply mitigations to architecture, flows, and component documents where applicable
+
+**Self-verification**:
+- [ ] Every High/Critical risk has a concrete mitigation strategy
+- [ ] Mitigations are reflected in the relevant component or architecture docs
+- [ ] No new risks introduced by the mitigations themselves
+
+**Save action**: Write `risk_mitigations.md`
+
+**BLOCKING**: Present risk summary to user. Ask whether assessment is sufficient.
+
+**Iterative**: If user requests another round, repeat Step 4 and write `risk_mitigations_##.md` (## as sequence number). Continue until user confirms.
@@ -0,0 +1,20 @@
+## Step 5: Test Specifications
+
+**Role**: Professional Quality Assurance Engineer
+
+**Goal**: Write test specs for each component achieving minimum 75% acceptance criteria coverage
+
+**Constraints**: Test specs only — no test code. Each test must trace to an acceptance criterion.
+
+1. For each component, write tests using `templates/test-spec.md` as structure
+2. Cover all 4 types: integration, performance, security, acceptance
+3. Include test data management (setup, teardown, isolation)
+4. Verify traceability: every acceptance criterion from `acceptance_criteria.md` must be covered by at least one test
+
+**Self-verification**:
+- [ ] Every acceptance criterion has at least one test covering it
+- [ ] Test inputs are realistic and well-defined
+- [ ] Expected results are specific and measurable
+- [ ] No component is left without tests
+
+**Save action**: Write each `components/[##]_[name]/tests.md`
@@ -0,0 +1,48 @@
+## Step 6: Work Item Epics
+
+**Role**: Professional product manager
+
+**Goal**: Create epics from components, ordered by dependency
+
+**Constraints**: Epic descriptions must be **comprehensive and self-contained** — a developer reading only the epic should understand the full context without needing to open separate files.
+
+1. **Create "Bootstrap & Initial Structure" epic first** — this epic will parent the `01_initial_structure` task created by the decompose skill. It covers project scaffolding: folder structure, shared models, interfaces, stubs, CI/CD config, DB migrations setup, test structure.
+2. Generate epics for each component using the configured work item tracker (Jira MCP or Azure DevOps MCP — see `autopilot/protocols.md`), structured per `templates/epic-spec.md`
+3. Order epics by dependency (Bootstrap epic is always first, then components based on their dependency graph)
+4. Include effort estimation per epic (T-shirt size or story points range)
+5. Ensure each epic has clear acceptance criteria cross-referenced with component specs
+6. Generate Mermaid diagrams showing component-to-epic mapping and component relationships
+
+**CRITICAL — Epic description richness requirements**:
+
+Each epic description MUST include ALL of the following sections with substantial content:
+- **System context**: where this component fits in the overall architecture (include Mermaid diagram showing this component's position and connections)
+- **Problem / Context**: what problem this component solves, why it exists, current pain points
+- **Scope**: detailed in-scope and out-of-scope lists
+- **Architecture notes**: relevant ADRs, technology choices, patterns used, key design decisions
+- **Interface specification**: full method signatures, input/output types, error types (from component description.md)
+- **Data flow**: how data enters and exits this component (include Mermaid sequence or flowchart diagram)
+- **Dependencies**: epic dependencies (with Jira IDs) and external dependencies (libraries, hardware, services)
+- **Acceptance criteria**: measurable criteria with specific thresholds (from component tests.md)
+- **Non-functional requirements**: latency, memory, throughput targets with failure thresholds
+- **Risks & mitigations**: relevant risks from risk_mitigations.md with concrete mitigation strategies
+- **Effort estimation**: T-shirt size and story points range
+- **Child issues**: planned task breakdown with complexity points
+- **Key constraints**: from restrictions.md that affect this component
+- **Testing strategy**: summary of test types and coverage from tests.md
+
+Do NOT create minimal epics with just a summary and short description. The epic is the primary reference document for the implementation team.
+
+**Self-verification**:
+- [ ] "Bootstrap & Initial Structure" epic exists and is first in order
+- [ ] "Blackbox Tests" epic exists
+- [ ] Every component maps to exactly one epic
+- [ ] Dependency order is respected (no epic depends on a later one)
+- [ ] Acceptance criteria are measurable
+- [ ] Effort estimates are realistic
+- [ ] Every epic description includes architecture diagram, interface spec, data flow, risks, and NFRs
+- [ ] Epic descriptions are self-contained — readable without opening other files
+
+7. **Create "Blackbox Tests" epic** — this epic will parent the blackbox test tasks created by the `/decompose` skill. It covers implementing the test scenarios defined in `tests/`.
+
+**Save action**: Epics created via the configured tracker MCP. Also saved locally in `epics.md` with ticket IDs. If `tracker: local`, save locally only.
@@ -0,0 +1,57 @@
+## Quality Checklist (before FINAL_report.md)
+
+Before writing the final report, verify ALL of the following:
+
+### Blackbox Tests
+- [ ] Every acceptance criterion is covered in traceability-matrix.md
+- [ ] Every restriction is verified by at least one test
+- [ ] Positive and negative scenarios are balanced
+- [ ] Docker environment is self-contained
+- [ ] Consumer app treats main system as black box
+- [ ] CI/CD integration and reporting defined
+
+### Architecture
+- [ ] Covers all capabilities from solution.md
+- [ ] Technology choices are justified
+- [ ] Deployment model is defined
+- [ ] Blackbox test findings are reflected in architecture decisions
+
+### Data Model
+- [ ] Every entity from architecture.md is defined
+- [ ] Relationships have explicit cardinality
+- [ ] Migration strategy with reversibility requirement
+- [ ] Seed data requirements defined
+- [ ] Backward compatibility approach documented
+
+### Deployment
+- [ ] Containerization plan covers all components
+- [ ] CI/CD pipeline includes lint, test, security, build, deploy stages
+- [ ] Environment strategy covers dev, staging, production
+- [ ] Observability covers logging, metrics, tracing, alerting
+- [ ] Deployment procedures include rollback and health checks
+
+### Components
+- [ ] Every component follows SRP
+- [ ] No circular dependencies
+- [ ] All inter-component interfaces are defined and consistent
+- [ ] No orphan components (unused by any flow)
+- [ ] Every blackbox test scenario can be traced through component interactions
+
+### Risks
+- [ ] All High/Critical risks have mitigations
+- [ ] Mitigations are reflected in component/architecture docs
+- [ ] User has confirmed risk assessment is sufficient
+
+### Tests
+- [ ] Every acceptance criterion is covered by at least one test
+- [ ] All 4 test types are represented per component (where applicable)
+- [ ] Test data management is defined
+
+### Epics
+- [ ] "Bootstrap & Initial Structure" epic exists
+- [ ] "Blackbox Tests" epic exists
+- [ ] Every component maps to an epic
+- [ ] Dependency order is correct
+- [ ] Acceptance criteria are measurable
+
+**Save action**: Write `FINAL_report.md` using `templates/final-report.md` as structure
@@ -1,6 +1,6 @@
 # Architecture Document Template

-Use this template for the architecture document. Save as `_docs/02_plans/architecture.md`.
+Use this template for the architecture document. Save as `_docs/02_document/architecture.md`.

 ---

@@ -1,24 +1,24 @@
-# E2E Functional Tests Template
+# Blackbox Tests Template

-Save as `PLANS_DIR/integration_tests/functional_tests.md`.
+Save as `DOCUMENT_DIR/tests/blackbox-tests.md`.

 ---

 ```markdown
-# E2E Functional Tests
+# Blackbox Tests

 ## Positive Scenarios

 ### FT-P-01: [Scenario Name]

-**Summary**: [One sentence: what end-to-end use case this validates]
+**Summary**: [One sentence: what black-box use case this validates]
 **Traces to**: AC-[ID], AC-[ID]
 **Category**: [which AC category — e.g., Position Accuracy, Image Processing, etc.]

 **Preconditions**:
 - [System state required before test]

-**Input data**: [reference to specific data set or file from test_data.md]
+**Input data**: [reference to specific data set or file from test-data.md]

 **Steps**:

@@ -71,8 +71,8 @@ Save as `PLANS_DIR/integration_tests/functional_tests.md`.

 ## Guidance Notes

- Functional tests should typically trace to at least one acceptance criterion or restriction. Tests without a trace are allowed but should have a clear justification.
+- Blackbox tests should typically trace to at least one acceptance criterion or restriction. Tests without a trace are allowed but should have a clear justification.
 - Positive scenarios validate the system does what it should.
 - Negative scenarios validate the system rejects or handles gracefully what it shouldn't accept.
 - Expected outcomes must be specific and measurable — not "works correctly" but "returns position within 50m of ground truth."
- Input data references should point to specific entries in test_data.md.
+- Input data references should point to specific entries in test-data.md.
@@ -1,6 +1,6 @@
-# Jira Epic Template
+# Epic Template

-Use this template for each Jira epic. Create epics via Jira MCP.
+Use this template for each epic. Create epics via the configured work item tracker (Jira MCP or Azure DevOps MCP).

 ---

@@ -73,14 +73,14 @@ Link to architecture.md and relevant component spec.]

 ### Design & Architecture

- Architecture doc: `_docs/02_plans/architecture.md`
- Component spec: `_docs/02_plans/components/[##]_[name]/description.md`
- System flows: `_docs/02_plans/system-flows.md`
+- Architecture doc: `_docs/02_document/architecture.md`
+- Component spec: `_docs/02_document/components/[##]_[name]/description.md`
+- System flows: `_docs/02_document/system-flows.md`

 ### Definition of Done

 - [ ] All in-scope capabilities implemented
- [ ] Automated tests pass (unit + integration + e2e)
+- [ ] Automated tests pass (unit + blackbox)
 - [ ] Minimum coverage threshold met (75%)
 - [ ] Runbooks written (if applicable)
 - [ ] Documentation updated
@@ -1,6 +1,6 @@
 # Final Planning Report Template

-Use this template after completing all 5 steps and the quality checklist. Save as `_docs/02_plans/FINAL_report.md`.
+Use this template after completing all 6 steps and the quality checklist. Save as `_docs/02_document/FINAL_report.md`.

 ---

@@ -1,97 +0,0 @@
-# E2E Non-Functional Tests Template
-
-Save as `PLANS_DIR/integration_tests/non_functional_tests.md`.
-
---
-
-```markdown
-# E2E Non-Functional Tests
-
-## Performance Tests
-
-### NFT-PERF-01: [Test Name]
-
-**Summary**: [What performance characteristic this validates]
-**Traces to**: AC-[ID]
-**Metric**: [what is measured — latency, throughput, frame rate, etc.]
-
-**Preconditions**:
- [System state, load profile, data volume]
-
-**Steps**:
-
-| Step | Consumer Action | Measurement |
-|------|----------------|-------------|
-| 1 | [action] | [what to measure and how] |
-
-**Pass criteria**: [specific threshold — e.g., p95 latency < 400ms]
-**Duration**: [how long the test runs]
-
---
-
-## Resilience Tests
-
-### NFT-RES-01: [Test Name]
-
-**Summary**: [What failure/recovery scenario this validates]
-**Traces to**: AC-[ID]
-
-**Preconditions**:
- [System state before fault injection]
-
-**Fault injection**:
- [What fault is introduced — process kill, network partition, invalid input sequence, etc.]
-
-**Steps**:
-
-| Step | Action | Expected Behavior |
-|------|--------|------------------|
-| 1 | [inject fault] | [system behavior during fault] |
-| 2 | [observe recovery] | [system behavior after recovery] |
-
-**Pass criteria**: [recovery time, data integrity, continued operation]
-
---
-
-## Security Tests
-
-### NFT-SEC-01: [Test Name]
-
-**Summary**: [What security property this validates]
-**Traces to**: AC-[ID], RESTRICT-[ID]
-
-**Steps**:
-
-| Step | Consumer Action | Expected Response |
-|------|----------------|------------------|
-| 1 | [attempt unauthorized access / injection / etc.] | [rejection / no data leak / etc.] |
-
-**Pass criteria**: [specific security outcome]
-
---
-
-## Resource Limit Tests
-
-### NFT-RES-LIM-01: [Test Name]
-
-**Summary**: [What resource constraint this validates]
-**Traces to**: AC-[ID], RESTRICT-[ID]
-
-**Preconditions**:
- [System running under specified constraints]
-
-**Monitoring**:
- [What resources to monitor — memory, CPU, GPU, disk, temperature]
-
-**Duration**: [how long to run]
-**Pass criteria**: [resource stays within limit — e.g., memory < 8GB throughout]
-```
-
---
-
-## Guidance Notes
-
- Performance tests should run long enough to capture steady-state behavior, not just cold-start.
- Resilience tests must define both the fault and the expected recovery — not just "system should recover."
- Security tests at E2E level focus on black-box attacks (unauthorized API calls, malformed input), not code-level vulnerabilities.
- Resource limit tests must specify monitoring duration — short bursts don't prove sustained compliance.
@@ -0,0 +1,35 @@
+# Performance Tests Template
+
+Save as `DOCUMENT_DIR/tests/performance-tests.md`.
+
+---
+
+```markdown
+# Performance Tests
+
+### NFT-PERF-01: [Test Name]
+
+**Summary**: [What performance characteristic this validates]
+**Traces to**: AC-[ID]
+**Metric**: [what is measured — latency, throughput, frame rate, etc.]
+
+**Preconditions**:
+- [System state, load profile, data volume]
+
+**Steps**:
+
+| Step | Consumer Action | Measurement |
+|------|----------------|-------------|
+| 1 | [action] | [what to measure and how] |
+
+**Pass criteria**: [specific threshold — e.g., p95 latency < 400ms]
+**Duration**: [how long the test runs]
+```
+
+---
+
+## Guidance Notes
+
+- Performance tests should run long enough to capture steady-state behavior, not just cold-start.
+- Define clear pass/fail thresholds with specific metrics (p50, p95, p99 latency, throughput, etc.).
+- Include warm-up preconditions to separate initialization cost from steady-state performance.
@@ -0,0 +1,37 @@
+# Resilience Tests Template
+
+Save as `DOCUMENT_DIR/tests/resilience-tests.md`.
+
+---
+
+```markdown
+# Resilience Tests
+
+### NFT-RES-01: [Test Name]
+
+**Summary**: [What failure/recovery scenario this validates]
+**Traces to**: AC-[ID]
+
+**Preconditions**:
+- [System state before fault injection]
+
+**Fault injection**:
+- [What fault is introduced — process kill, network partition, invalid input sequence, etc.]
+
+**Steps**:
+
+| Step | Action | Expected Behavior |
+|------|--------|------------------|
+| 1 | [inject fault] | [system behavior during fault] |
+| 2 | [observe recovery] | [system behavior after recovery] |
+
+**Pass criteria**: [recovery time, data integrity, continued operation]
+```
+
+---
+
+## Guidance Notes
+
+- Resilience tests must define both the fault and the expected recovery — not just "system should recover."
+- Include specific recovery time expectations and data integrity checks.
+- Test both graceful degradation (partial failure) and full recovery scenarios.
@@ -0,0 +1,31 @@
+# Resource Limit Tests Template
+
+Save as `DOCUMENT_DIR/tests/resource-limit-tests.md`.
+
+---
+
+```markdown
+# Resource Limit Tests
+
+### NFT-RES-LIM-01: [Test Name]
+
+**Summary**: [What resource constraint this validates]
+**Traces to**: AC-[ID], RESTRICT-[ID]
+
+**Preconditions**:
+- [System running under specified constraints]
+
+**Monitoring**:
+- [What resources to monitor — memory, CPU, GPU, disk, temperature]
+
+**Duration**: [how long to run]
+**Pass criteria**: [resource stays within limit — e.g., memory < 8GB throughout]
+```
+
+---
+
+## Guidance Notes
+
+- Resource limit tests must specify monitoring duration — short bursts don't prove sustained compliance.
+- Define specific numeric limits that can be programmatically checked.
+- Include both the monitoring method and the threshold in the pass criteria.
@@ -1,6 +1,6 @@
 # Risk Register Template

-Use this template for risk assessment. Save as `_docs/02_plans/risk_mitigations.md`.
+Use this template for risk assessment. Save as `_docs/02_document/risk_mitigations.md`.
 Subsequent iterations: `risk_mitigations_02.md`, `risk_mitigations_03.md`, etc.

 ---
@@ -0,0 +1,30 @@
+# Security Tests Template
+
+Save as `DOCUMENT_DIR/tests/security-tests.md`.
+
+---
+
+```markdown
+# Security Tests
+
+### NFT-SEC-01: [Test Name]
+
+**Summary**: [What security property this validates]
+**Traces to**: AC-[ID], RESTRICT-[ID]
+
+**Steps**:
+
+| Step | Consumer Action | Expected Response |
+|------|----------------|------------------|
+| 1 | [attempt unauthorized access / injection / etc.] | [rejection / no data leak / etc.] |
+
+**Pass criteria**: [specific security outcome]
+```
+
+---
+
+## Guidance Notes
+
+- Security tests at blackbox level focus on black-box attacks (unauthorized API calls, malformed input), not code-level vulnerabilities.
+- Verify the system remains operational after security-related edge cases (no crash, no hang).
+- Test authentication/authorization boundaries from the consumer's perspective.
@@ -1,7 +1,7 @@
 # System Flows Template

-Use this template for the system flows document. Save as `_docs/02_plans/system-flows.md`.
-Individual flow diagrams go in `_docs/02_plans/diagrams/flows/flow_[name].md`.
+Use this template for the system flows document. Save as `_docs/02_document/system-flows.md`.
+Individual flow diagrams go in `_docs/02_document/diagrams/flows/flow_[name].md`.

 ---

@@ -1,11 +1,11 @@
-# E2E Test Data Template
+# Test Data Template

-Save as `PLANS_DIR/integration_tests/test_data.md`.
+Save as `DOCUMENT_DIR/tests/test-data.md`.

 ---

 ```markdown
-# E2E Test Data Management
+# Test Data Management

 ## Seed Data Sets

@@ -23,6 +23,12 @@ Save as `PLANS_DIR/integration_tests/test_data.md`.
 |-----------------|----------------|-------------|-----------------|
 | [filename] | `_docs/00_problem/input_data/[filename]` | [what it contains] | [test IDs that use this data] |

+## Expected Results Mapping
+
+| Test Scenario ID | Input Data | Expected Result | Comparison Method | Tolerance | Expected Result Source |
+|-----------------|------------|-----------------|-------------------|-----------|----------------------|
+| [test ID] | `input_data/[filename]` | [quantifiable expected output] | [exact / tolerance / pattern / threshold / file-diff] | [± value or N/A] | `input_data/expected_results/[filename]` or inline |
+
 ## External Dependency Mocks

 | External Service | Mock/Stub | How Provided | Behavior |
@@ -42,5 +48,8 @@ Save as `PLANS_DIR/integration_tests/test_data.md`.

 - Every seed data set should be traceable to specific test scenarios.
 - Input data from `_docs/00_problem/input_data/` should be mapped to test scenarios that use it.
+- Every input data item MUST have a corresponding expected result in the Expected Results Mapping table.
+- Expected results MUST be quantifiable: exact values, numeric tolerances, pattern matches, thresholds, or reference files. "Works correctly" is never acceptable.
+- For complex expected outputs, provide machine-readable reference files (JSON, CSV) in `_docs/00_problem/input_data/expected_results/` and reference them in the mapping.
 - External mocks must be deterministic — same input always produces same output.
 - Data isolation must guarantee no test can affect another test's outcome.
@@ -1,16 +1,16 @@
-# E2E Test Environment Template
+# Test Environment Template

-Save as `PLANS_DIR/integration_tests/environment.md`.
+Save as `DOCUMENT_DIR/tests/environment.md`.

 ---

 ```markdown
-# E2E Test Environment
+# Test Environment

 ## Overview

 **System under test**: [main system name and entry points — API URLs, message queues, serial ports, etc.]
-**Consumer app purpose**: Standalone application that exercises the main system through its public interfaces, validating end-to-end use cases without access to internals.
+**Consumer app purpose**: Standalone application that exercises the main system through its public interfaces, validating black-box use cases without access to internals.

 ## Docker Environment

@@ -17,7 +17,7 @@ Use this template for each component's test spec. Save as `components/[##]_[name

 ---

-## Integration Tests
+## Blackbox Tests

 ### IT-01: [Test Name]

@@ -169,4 +169,4 @@ Use this template for each component's test spec. Save as `components/[##]_[name
 - If an acceptance criterion has no test covering it, mark it as NOT COVERED and explain why (e.g., "requires manual verification", "deferred to phase 2").
 - Performance test targets should come from the NFR section in `architecture.md`.
 - Security tests should cover at minimum: authentication bypass, authorization escalation, injection attacks relevant to this component.
- Not every component needs all 4 test types. A stateless utility component may only need integration tests.
+- Not every component needs all 4 test types. A stateless utility component may only need blackbox tests.
@@ -1,11 +1,11 @@
-# E2E Traceability Matrix Template
+# Traceability Matrix Template

-Save as `PLANS_DIR/integration_tests/traceability_matrix.md`.
+Save as `DOCUMENT_DIR/tests/traceability-matrix.md`.

 ---

 ```markdown
-# E2E Traceability Matrix
+# Traceability Matrix

 ## Acceptance Criteria Coverage

@@ -34,7 +34,7 @@ Save as `PLANS_DIR/integration_tests/traceability_matrix.md`.

 | Item | Reason Not Covered | Risk | Mitigation |
 |------|-------------------|------|-----------|
-| [AC/Restriction ID] | [why it cannot be tested at E2E level] | [what could go wrong] | [how risk is addressed — e.g., covered by component tests in Step 5] |
+| [AC/Restriction ID] | [why it cannot be tested at blackbox level] | [what could go wrong] | [how risk is addressed — e.g., covered by component tests in Step 5] |
 ```

 ---
@@ -44,4 +44,4 @@ Save as `PLANS_DIR/integration_tests/traceability_matrix.md`.
 - Every acceptance criterion must appear in the matrix — either covered or explicitly marked as not covered with a reason.
 - Every restriction must appear in the matrix.
 - NOT COVERED items must have a reason and a mitigation strategy (e.g., "covered at component test level" or "requires real hardware").
- Coverage percentage should be at least 75% for acceptance criteria at the E2E level.
+- Coverage percentage should be at least 75% for acceptance criteria at the blackbox test level.
@@ -46,7 +46,7 @@ The interview is complete when the AI can write ALL of these:
 | `problem.md` | Clear problem statement: what is being built, why, for whom, what it does |
 | `restrictions.md` | All constraints identified: hardware, software, environment, operational, regulatory, budget, timeline |
 | `acceptance_criteria.md` | Measurable success criteria with specific numeric targets grouped by category |
-| `input_data/` | At least one reference data file or detailed data description document |
+| `input_data/` | At least one reference data file or detailed data description document. Must include `expected_results.md` with input→output pairs for downstream test specification |
 | `security_approach.md` | (optional) Security requirements identified, or explicitly marked as not applicable |

 ## Interview Protocol
@@ -187,6 +187,7 @@ At least one file. Options:
 - User provides actual data files (CSV, JSON, images, etc.) — save as-is
 - User describes data parameters — save as `data_parameters.md`
 - User provides URLs to data — save as `data_sources.md` with links and descriptions
+- `expected_results.md` — expected outputs for given inputs (required by downstream test-spec skill). During the Acceptance Criteria dimension, probe for concrete input→output pairs and save them here. Format: use the template from `.cursor/skills/test-spec/templates/expected-results.md`.

 ### security_approach.md (optional)

@@ -34,8 +34,8 @@ Determine the operating mode based on invocation before any other logic runs.
 **Project mode** (no explicit input file provided):
 - PROBLEM_DIR: `_docs/00_problem/`
 - SOLUTION_DIR: `_docs/01_solution/`
- COMPONENTS_DIR: `_docs/02_components/`
- TESTS_DIR: `_docs/02_tests/`
+- COMPONENTS_DIR: `_docs/02_document/components/`
+- DOCUMENT_DIR: `_docs/02_document/`
 - REFACTOR_DIR: `_docs/04_refactoring/`
 - All existing guardrails apply.

@@ -155,7 +155,7 @@ Store in PROBLEM_DIR.

 | Metric Category | What to Capture |
 |----------------|-----------------|
-| **Coverage** | Overall, unit, integration, critical paths |
+| **Coverage** | Overall, unit, blackbox, critical paths |
 | **Complexity** | Cyclomatic complexity (avg + top 5 functions), LOC, tech debt ratio |
 | **Code Smells** | Total, critical, major |
 | **Performance** | Response times (P50/P95/P99), CPU/memory, throughput |
@@ -210,7 +210,7 @@ Write:

 Also copy to project standard locations if in project mode:
 - `SOLUTION_DIR/solution.md`
- `COMPONENTS_DIR/system_flows.md`
+- `DOCUMENT_DIR/system_flows.md`

 **Self-verification**:
 - [ ] Every component in the codebase is documented
@@ -276,14 +276,14 @@ Write `REFACTOR_DIR/analysis/refactoring_roadmap.md`:

 #### 3a. Design Test Specs

-Coverage requirements (must meet before refactoring):
+Coverage requirements (must meet before refactoring — see `.cursor/rules/cursor-meta.mdc` Quality Thresholds):
 - Minimum overall coverage: 75%
 - Critical path coverage: 90%
- All public APIs must have integration tests
+- All public APIs must have blackbox tests
 - All error handling paths must be tested

 For each critical area, write test specs to `REFACTOR_DIR/test_specs/[##]_[test_name].md`:
- Integration tests: summary, current behavior, input data, expected result, max expected time
+- Blackbox tests: summary, current behavior, input data, expected result, max expected time
 - Acceptance tests: summary, preconditions, steps with expected results
 - Coverage analysis: current %, target %, uncovered critical paths

@@ -297,7 +297,7 @@ For each critical area, write test specs to `REFACTOR_DIR/test_specs/[##]_[test_
 **Self-verification**:
 - [ ] Coverage requirements met (75% overall, 90% critical paths)
 - [ ] All tests pass on current codebase
- [ ] All public APIs have integration tests
+- [ ] All public APIs have blackbox tests
 - [ ] Test data fixtures are configured

 **Save action**: Write test specs; implemented tests go into the project's test folder
@@ -332,7 +332,7 @@ Write `REFACTOR_DIR/coupling_analysis.md`:
 For each change in the decoupling strategy:

 1. Implement the change
-2. Run integration tests
+2. Run blackbox tests
 3. Fix any failures
 4. Commit with descriptive message

@@ -1,5 +1,5 @@
 ---
-name: deep-research
+name: research
 description: |
  Deep Research Methodology (8-Step Method) with two execution modes:
  - Mode A (Initial Research): Assess acceptance criteria, then research problem and produce solution draft
@@ -13,6 +13,7 @@ description: |
  - "comparative analysis", "concept comparison", "technical comparison"
 category: build
 tags: [research, analysis, solution-design, comparison, decision-support]
+disable-model-invocation: true
 ---

 # Deep Research (8-Step Method)
@@ -42,257 +43,51 @@ Determine the operating mode based on invocation before any other logic runs.

 **Standalone mode** (explicit input file provided, e.g. `/research @some_doc.md`):
 - INPUT_FILE: the provided file (treated as problem description)
- OUTPUT_DIR: `_standalone/01_solution/`
- RESEARCH_DIR: `_standalone/00_research/`
+- BASE_DIR: if specified by the caller, use it; otherwise default to `_standalone/`
+- OUTPUT_DIR: `BASE_DIR/01_solution/`
+- RESEARCH_DIR: `BASE_DIR/00_research/`
 - Guardrails relaxed: only INPUT_FILE must exist and be non-empty
 - `restrictions.md` and `acceptance_criteria.md` are optional — warn if absent, proceed if user confirms
 - Mode detection uses OUTPUT_DIR for `solution_draft*.md` scanning
 - Draft numbering works the same, scoped to OUTPUT_DIR
- **Final step**: after all research is complete, move INPUT_FILE into `_standalone/`
+- **Final step**: after all research is complete, move INPUT_FILE into BASE_DIR

 Announce the detected mode and resolved paths to the user before proceeding.

 ## Project Integration

-### Prerequisite Guardrails (BLOCKING)
-
-Before any research begins, verify the input context exists. **Do not proceed if guardrails fail.**
-
-**Project mode:**
-1. Check INPUT_DIR exists — **STOP if missing**, ask user to create it and provide problem files
-2. Check `problem.md` in INPUT_DIR exists and is non-empty — **STOP if missing**
-3. Check `restrictions.md` in INPUT_DIR exists and is non-empty — **STOP if missing**
-4. Check `acceptance_criteria.md` in INPUT_DIR exists and is non-empty — **STOP if missing**
-5. Check `input_data/` in INPUT_DIR exists and contains at least one file — **STOP if missing**
-6. Read **all** files in INPUT_DIR to ground the investigation in the project context
-7. Create OUTPUT_DIR and RESEARCH_DIR if they don't exist
-
-**Standalone mode:**
-1. Check INPUT_FILE exists and is non-empty — **STOP if missing**
-2. Warn if no `restrictions.md` or `acceptance_criteria.md` were provided alongside INPUT_FILE — proceed if user confirms
-3. Create OUTPUT_DIR and RESEARCH_DIR if they don't exist
-
-### Mode Detection
-
-After guardrails pass, determine the execution mode:
-
-1. Scan OUTPUT_DIR for files matching `solution_draft*.md`
-2. **No matches found** → **Mode A: Initial Research**
-3. **Matches found** → **Mode B: Solution Assessment** (use the highest-numbered draft as input)
-4. **User override**: if the user explicitly says "research from scratch" or "initial research", force Mode A regardless of existing drafts
-
-Inform the user which mode was detected and confirm before proceeding.
-
-### Solution Draft Numbering
-
-All final output is saved as `OUTPUT_DIR/solution_draft##.md` with a 2-digit zero-padded number:
-
-1. Scan existing files in OUTPUT_DIR matching `solution_draft*.md`
-2. Extract the highest existing number
-3. Increment by 1
-4. Zero-pad to 2 digits (e.g., `01`, `02`, ..., `10`, `11`)
-
-Example: if `solution_draft01.md` through `solution_draft10.md` exist, the next output is `solution_draft11.md`.
-
-### Working Directory & Intermediate Artifact Management
-
-#### Directory Structure
-
-At the start of research, **must** create a working directory under RESEARCH_DIR:
-
-```
-RESEARCH_DIR/
-├── 00_ac_assessment.md            # Mode A Phase 1 output: AC & restrictions assessment
-├── 00_question_decomposition.md   # Step 0-1 output
-├── 01_source_registry.md          # Step 2 output: all consulted source links
-├── 02_fact_cards.md               # Step 3 output: extracted facts
-├── 03_comparison_framework.md     # Step 4 output: selected framework and populated data
-├── 04_reasoning_chain.md          # Step 6 output: fact → conclusion reasoning
-├── 05_validation_log.md           # Step 7 output: use-case validation results
-└── raw/                           # Raw source archive (optional)
-    ├── source_1.md
-    └── source_2.md
-```
-
-### Save Timing & Content
-
-| Step | Save immediately after completion | Filename |
-|------|-----------------------------------|----------|
-| Mode A Phase 1 | AC & restrictions assessment tables | `00_ac_assessment.md` |
-| Step 0-1 | Question type classification + sub-question list | `00_question_decomposition.md` |
-| Step 2 | Each consulted source link, tier, summary | `01_source_registry.md` |
-| Step 3 | Each fact card (statement + source + confidence) | `02_fact_cards.md` |
-| Step 4 | Selected comparison framework + initial population | `03_comparison_framework.md` |
-| Step 6 | Reasoning process for each dimension | `04_reasoning_chain.md` |
-| Step 7 | Validation scenarios + results + review checklist | `05_validation_log.md` |
-| Step 8 | Complete solution draft | `OUTPUT_DIR/solution_draft##.md` |
-
-### Save Principles
-
-1. **Save immediately**: Write to the corresponding file as soon as a step is completed; don't wait until the end
-2. **Incremental updates**: Same file can be updated multiple times; append or replace new content
-3. **Preserve process**: Keep intermediate files even after their content is integrated into the final report
-4. **Enable recovery**: If research is interrupted, progress can be recovered from intermediate files
+Read and follow `steps/00_project-integration.md` for prerequisite guardrails, mode detection, draft numbering, working directory setup, save timing, and output file inventory.

 ## Execution Flow

 ### Mode A: Initial Research

-Triggered when no `solution_draft*.md` files exist in OUTPUT_DIR, or when the user explicitly requests initial research.
+Read and follow `steps/01_mode-a-initial-research.md`.

-#### Phase 1: AC & Restrictions Assessment (BLOCKING)
-
-**Role**: Professional software architect
-
-A focused preliminary research pass **before** the main solution research. The goal is to validate that the acceptance criteria and restrictions are realistic before designing a solution around them.
-
-**Input**: All files from INPUT_DIR (or INPUT_FILE in standalone mode)
-
-**Task**:
-1. Read all problem context files thoroughly
-2. **ASK the user about every unclear aspect** — do not assume:
-   - Unclear problem boundaries → ask
-   - Ambiguous acceptance criteria values → ask
-   - Missing context (no `security_approach.md`, no `input_data/`) → ask what they have
-   - Conflicting restrictions → ask which takes priority
-3. Research in internet **extensively** — use multiple search queries per question, rephrase, and search from different angles:
-   - How realistic are the acceptance criteria for this specific domain? Search for industry benchmarks, standards, and typical values
-   - How critical is each criterion? Search for case studies where criteria were relaxed or tightened
-   - What domain-specific acceptance criteria are we missing? Search for industry standards, regulatory requirements, and best practices in the specific domain
-   - Impact of each criterion value on the whole system quality — search for research papers and engineering reports
-   - Cost/budget implications of each criterion — search for pricing, total cost of ownership analyses, and comparable project budgets
-   - Timeline implications — search for project timelines, development velocity reports, and comparable implementations
-   - What do practitioners in this domain consider the most important criteria? Search forums, conference talks, and experience reports
-4. Research restrictions from multiple perspectives:
-   - Are the restrictions realistic? Search for comparable projects that operated under similar constraints
-   - Should any be tightened or relaxed? Search for what constraints similar projects actually ended up with
-   - Are there additional restrictions we should add? Search for regulatory, compliance, and safety requirements in this domain
-   - What restrictions do practitioners wish they had defined earlier? Search for post-mortem reports and lessons learned
-5. Verify findings with authoritative sources (official docs, papers, benchmarks) — each key finding must have at least 2 independent sources
-
-**Uses Steps 0-3 of the 8-step engine** (question classification, decomposition, source tiering, fact extraction) scoped to AC and restrictions assessment.
-
-**📁 Save action**: Write `RESEARCH_DIR/00_ac_assessment.md` with format:
-
-```markdown
-# Acceptance Criteria Assessment
-
-## Acceptance Criteria
-
-| Criterion | Our Values | Researched Values | Cost/Timeline Impact | Status |
-|-----------|-----------|-------------------|---------------------|--------|
-| [name] | [current] | [researched range] | [impact] | Added / Modified / Removed |
-
-## Restrictions Assessment
-
-| Restriction | Our Values | Researched Values | Cost/Timeline Impact | Status |
-|-------------|-----------|-------------------|---------------------|--------|
-| [name] | [current] | [researched range] | [impact] | Added / Modified / Removed |
-
-## Key Findings
-[Summary of critical findings]
-
-## Sources
-[Key references used]
-```
-
-**BLOCKING**: Present the AC assessment tables to the user. Wait for confirmation or adjustments before proceeding to Phase 2. The user may update `acceptance_criteria.md` or `restrictions.md` based on findings.
-
---
-
-#### Phase 2: Problem Research & Solution Draft
-
-**Role**: Professional researcher and software architect
-
-Full 8-step research methodology. Produces the first solution draft.
-
-**Input**: All files from INPUT_DIR (possibly updated after Phase 1) + Phase 1 artifacts
-
-**Task** (drives the 8-step engine):
-1. Research existing/competitor solutions for similar problems — search broadly across industries and adjacent domains, not just the obvious competitors
-2. Research the problem thoroughly — all possible ways to solve it, split into components; search for how different fields approach analogous problems
-3. For each component, research all possible solutions and find the most efficient state-of-the-art approaches — use multiple query variants and perspectives from Step 1
-4. For each promising approach, search for real-world deployment experience: success stories, failure reports, lessons learned, and practitioner opinions
-5. Search for contrarian viewpoints — who argues against the common approaches and why? What failure modes exist?
-6. Verify that suggested tools/libraries actually exist and work as described — check official repos, latest releases, and community health (stars, recent commits, open issues)
-7. Include security considerations in each component analysis
-8. Provide rough cost estimates for proposed solutions
-
-Be concise in formulating. The fewer words, the better, but do not miss any important details.
-
-**📁 Save action**: Write `OUTPUT_DIR/solution_draft##.md` using template: `templates/solution_draft_mode_a.md`
-
---
-
-#### Phase 3: Tech Stack Consolidation (OPTIONAL)
-
-**Role**: Software architect evaluating technology choices
-
-Focused synthesis step — no new 8-step cycle. Uses research already gathered in Phase 2 to make concrete technology decisions.
-
-**Input**: Latest `solution_draft##.md` from OUTPUT_DIR + all files from INPUT_DIR
-
-**Task**:
-1. Extract technology options from the solution draft's component comparison tables
-2. Score each option against: fitness for purpose, maturity, security track record, team expertise, cost, scalability
-3. Produce a tech stack summary with selection rationale
-4. Assess risks and learning requirements per technology choice
-
-**📁 Save action**: Write `OUTPUT_DIR/tech_stack.md` with:
- Requirements analysis (functional, non-functional, constraints)
- Technology evaluation tables (language, framework, database, infrastructure, key libraries) with scores
- Tech stack summary block
- Risk assessment and learning requirements tables
-
---
-
-#### Phase 4: Security Deep Dive (OPTIONAL)
-
-**Role**: Security architect
-
-Focused analysis step — deepens the security column from the solution draft into a proper threat model and controls specification.
-
-**Input**: Latest `solution_draft##.md` from OUTPUT_DIR + `security_approach.md` from INPUT_DIR + problem context
-
-**Task**:
-1. Build threat model: asset inventory, threat actors, attack vectors
-2. Define security requirements and proposed controls per component (with risk level)
-3. Summarize authentication/authorization, data protection, secure communication, and logging/monitoring approach
-
-**📁 Save action**: Write `OUTPUT_DIR/security_analysis.md` with:
- Threat model (assets, actors, vectors)
- Per-component security requirements and controls table
- Security controls summary
+Phases: AC Assessment (BLOCKING) → Problem Research → Tech Stack (optional) → Security (optional).

 ---

 ### Mode B: Solution Assessment

-Triggered when `solution_draft*.md` files exist in OUTPUT_DIR.
+Read and follow `steps/02_mode-b-solution-assessment.md`.

-**Role**: Professional software architect
+---

-Full 8-step research methodology applied to assessing and improving an existing solution draft.
+## Research Engine (8-Step Method)

-**Input**: All files from INPUT_DIR + the latest (highest-numbered) `solution_draft##.md` from OUTPUT_DIR
+The 8-step method is the core research engine used by both modes. Steps 0-1 and Step 8 have mode-specific behavior; Steps 2-7 are identical regardless of mode.

-**Task** (drives the 8-step engine):
-1. Read the existing solution draft thoroughly
-2. Research in internet extensively — for each component/decision in the draft, search for:
-   - Known problems and limitations of the chosen approach
-   - What practitioners say about using it in production
-   - Better alternatives that may have emerged recently
-   - Common failure modes and edge cases
-   - How competitors/similar projects solve the same problem differently
-3. Search specifically for contrarian views: "why not [chosen approach]", "[chosen approach] criticism", "[chosen approach] failure"
-4. Identify security weak points and vulnerabilities — search for CVEs, security advisories, and known attack vectors for each technology in the draft
-5. Identify performance bottlenecks — search for benchmarks, load test results, and scalability reports
-6. For each identified weak point, search for multiple solution approaches and compare them
-7. Based on findings, form a new solution draft in the same format
+**Investigation phase** (Steps 0–3.5): Read and follow `steps/03_engine-investigation.md`.
+Covers: question classification, novelty sensitivity, question decomposition, perspective rotation, exhaustive web search, fact extraction, iterative deepening.

-**📁 Save action**: Write `OUTPUT_DIR/solution_draft##.md` (incremented) using template: `templates/solution_draft_mode_b.md`
+**Analysis phase** (Steps 4–8): Read and follow `steps/04_engine-analysis.md`.
+Covers: comparison framework, baseline alignment, reasoning chain, use-case validation, deliverable formatting.

-**Optional follow-up**: After Mode B completes, the user can request Phase 3 (Tech Stack Consolidation) or Phase 4 (Security Deep Dive) using the revised draft. These phases work identically to their Mode A descriptions above.
+## Solution Draft Output Templates
+
+- Mode A: `templates/solution_draft_mode_a.md`
+- Mode B: `templates/solution_draft_mode_b.md`

 ## Escalation Rules

@@ -316,389 +111,12 @@ When the user wants to:
 - Gather information and evidence for a decision
 - Assess or improve an existing solution draft

-**Keywords**:
- "deep research", "deep dive", "in-depth analysis"
- "research this", "investigate", "look into"
- "assess solution", "review draft", "improve solution"
- "comparative analysis", "concept comparison", "technical comparison"
-
 **Differentiation from other Skills**:
 - Needs a **visual knowledge graph** → use `research-to-diagram`
 - Needs **written output** (articles/tutorials) → use `wsy-writer`
 - Needs **material organization** → use `material-to-markdown`
 - Needs **research + solution draft** → use this Skill

-## Research Engine (8-Step Method)
-
-The 8-step method is the core research engine used by both modes. Steps 0-1 and Step 8 have mode-specific behavior; Steps 2-7 are identical regardless of mode.
-
-### Step 0: Question Type Classification
-
-First, classify the research question type and select the corresponding strategy:
-
-| Question Type | Core Task | Focus Dimensions |
-|---------------|-----------|------------------|
-| **Concept Comparison** | Build comparison framework | Mechanism differences, applicability boundaries |
-| **Decision Support** | Weigh trade-offs | Cost, risk, benefit |
-| **Trend Analysis** | Map evolution trajectory | History, driving factors, predictions |
-| **Problem Diagnosis** | Root cause analysis | Symptoms, causes, evidence chain |
-| **Knowledge Organization** | Systematic structuring | Definitions, classifications, relationships |
-
-**Mode-specific classification**:
-
-| Mode / Phase | Typical Question Type |
-|--------------|----------------------|
-| Mode A Phase 1 | Knowledge Organization + Decision Support |
-| Mode A Phase 2 | Decision Support |
-| Mode B | Problem Diagnosis + Decision Support |
-
-### Step 0.5: Novelty Sensitivity Assessment (BLOCKING)
-
-Before starting research, assess the novelty sensitivity of the question (Critical/High/Medium/Low). This determines source time windows and filtering strategy.
-
-**For full classification table, critical-domain rules, trigger words, and assessment template**: Read `references/novelty-sensitivity.md`
-
-Key principle: Critical-sensitivity topics (AI/LLMs, blockchain) require sources within 6 months, mandatory version annotations, cross-validation from 2+ sources, and direct verification of official download pages.
-
-**📁 Save action**: Append timeliness assessment to the end of `00_question_decomposition.md`
-
---
-
-### Step 1: Question Decomposition & Boundary Definition
-
-**Mode-specific sub-questions**:
-
-**Mode A Phase 2** (Initial Research — Problem & Solution):
- "What existing/competitor solutions address this problem?"
- "What are the component parts of this problem?"
- "For each component, what are the state-of-the-art solutions?"
- "What are the security considerations per component?"
- "What are the cost implications of each approach?"
-
-**Mode B** (Solution Assessment):
- "What are the weak points and potential problems in the existing draft?"
- "What are the security vulnerabilities in the proposed architecture?"
- "Where are the performance bottlenecks?"
- "What solutions exist for each identified issue?"
-
-**General sub-question patterns** (use when applicable):
- **Sub-question A**: "What is X and how does it work?" (Definition & mechanism)
- **Sub-question B**: "What are the dimensions of relationship/difference between X and Y?" (Comparative analysis)
- **Sub-question C**: "In what scenarios is X applicable/inapplicable?" (Boundary conditions)
- **Sub-question D**: "What are X's development trends/best practices?" (Extended analysis)
-
-#### Perspective Rotation (MANDATORY)
-
-For each research problem, examine it from **at least 3 different perspectives**. Each perspective generates its own sub-questions and search queries.
-
-| Perspective | What it asks | Example queries |
-|-------------|-------------|-----------------|
-| **End-user / Consumer** | What problems do real users encounter? What do they wish were different? | "X problems", "X frustrations reddit", "X user complaints" |
-| **Implementer / Engineer** | What are the technical challenges, gotchas, hidden complexities? | "X implementation challenges", "X pitfalls", "X lessons learned" |
-| **Business / Decision-maker** | What are the costs, ROI, strategic implications? | "X total cost of ownership", "X ROI case study", "X vs Y business comparison" |
-| **Contrarian / Devil's advocate** | What could go wrong? Why might this fail? What are critics saying? | "X criticism", "why not X", "X failures", "X disadvantages real world" |
-| **Domain expert / Academic** | What does peer-reviewed research say? What are theoretical limits? | "X research paper", "X systematic review", "X benchmarks academic" |
-| **Practitioner / Field** | What do people who actually use this daily say? What works in practice vs theory? | "X in production", "X experience report", "X after 1 year" |
-
-Select at least 3 perspectives relevant to the problem. Document the chosen perspectives in `00_question_decomposition.md`.
-
-#### Question Explosion (MANDATORY)
-
-For **each sub-question**, generate **at least 3-5 search query variants** before searching. This ensures broad coverage and avoids missing relevant information due to terminology differences.
-
-**Query variant strategies**:
- **Specificity ladder**: broad ("indoor navigation systems") → narrow ("UWB-based indoor drone navigation accuracy")
- **Negation/failure**: "X limitations", "X failure modes", "when X doesn't work"
- **Comparison framing**: "X vs Y for Z", "X alternative for Z", "X or Y which is better for Z"
- **Practitioner voice**: "X in production experience", "X real-world results", "X lessons learned"
- **Temporal**: "X 2025", "X latest developments", "X roadmap"
- **Geographic/domain**: "X in Europe", "X for defense applications", "X in agriculture"
-
-Record all planned queries in `00_question_decomposition.md` alongside each sub-question.
-
-**⚠️ Research Subject Boundary Definition (BLOCKING - must be explicit)**:
-
-When decomposing questions, you must explicitly define the **boundaries of the research subject**:
-
-| Dimension | Boundary to define | Example |
-|-----------|--------------------|---------|
-| **Population** | Which group is being studied? | University students vs K-12 vs vocational students vs all students |
-| **Geography** | Which region is being studied? | Chinese universities vs US universities vs global |
-| **Timeframe** | Which period is being studied? | Post-2020 vs full historical picture |
-| **Level** | Which level is being studied? | Undergraduate vs graduate vs vocational |
-
-**Common mistake**: User asks about "university classroom issues" but sources include policies targeting "K-12 students" — mismatched target populations will invalidate the entire research.
-
-**📁 Save action**:
-1. Read all files from INPUT_DIR to ground the research in the project context
-2. Create working directory `RESEARCH_DIR/`
-3. Write `00_question_decomposition.md`, including:
-   - Original question
-   - Active mode (A Phase 2 or B) and rationale
-   - Summary of relevant problem context from INPUT_DIR
-   - Classified question type and rationale
-   - **Research subject boundary definition** (population, geography, timeframe, level)
-   - List of decomposed sub-questions
-   - **Chosen perspectives** (at least 3 from the Perspective Rotation table) with rationale
-   - **Search query variants** for each sub-question (at least 3-5 per sub-question)
-4. Write TodoWrite to track progress
-
-### Step 2: Source Tiering & Exhaustive Web Investigation
-
-Tier sources by authority, **prioritize primary sources** (L1 > L2 > L3 > L4). Conclusions must be traceable to L1/L2; L3/L4 serve as supplementary and validation.
-
-**For full tier definitions, search strategies, community mining steps, and source registry templates**: Read `references/source-tiering.md`
-
-**Tool Usage**:
- Use `WebSearch` for broad searches; `WebFetch` to read specific pages
- Use the `context7` MCP server (`resolve-library-id` then `get-library-docs`) for up-to-date library/framework documentation
- Always cross-verify training data claims against live sources for facts that may have changed (versions, APIs, deprecations, security advisories)
- When citing web sources, include the URL and date accessed
-
-#### Exhaustive Search Requirements (MANDATORY)
-
-Do not stop at the first few results. The goal is to build a comprehensive evidence base.
-
-**Minimum search effort per sub-question**:
- Execute **all** query variants generated in Step 1's Question Explosion (at least 3-5 per sub-question)
- Consult at least **2 different source tiers** per sub-question (e.g., L1 official docs + L4 community discussion)
- If initial searches yield fewer than 3 relevant sources for a sub-question, **broaden the search** with alternative terms, related domains, or analogous problems
-
-**Search broadening strategies** (use when results are thin):
- Try adjacent fields: if researching "drone indoor navigation", also search "robot indoor navigation", "warehouse AGV navigation"
- Try different communities: academic papers, industry whitepapers, military/defense publications, hobbyist forums
- Try different geographies: search in English + search for European/Asian approaches if relevant
- Try historical evolution: "history of X", "evolution of X approaches", "X state of the art 2024 2025"
- Try failure analysis: "X project failure", "X post-mortem", "X recall", "X incident report"
-
-**Search saturation rule**: Continue searching until new queries stop producing substantially new information. If the last 3 searches only repeat previously found facts, the sub-question is saturated.
-
-**📁 Save action**:
-For each source consulted, **immediately** append to `01_source_registry.md` using the entry template from `references/source-tiering.md`.
-
-### Step 3: Fact Extraction & Evidence Cards
-
-Transform sources into **verifiable fact cards**:
-
-```markdown
-## Fact Cards
-
-### Fact 1
- **Statement**: [specific fact description]
- **Source**: [link/document section]
- **Confidence**: High/Medium/Low
-
-### Fact 2
-...
-```
-
-**Key discipline**:
- Pin down facts first, then reason
- Distinguish "what officials said" from "what I infer"
- When conflicting information is found, annotate and preserve both sides
- Annotate confidence level:
-  - ✅ High: Explicitly stated in official documentation
-  - ⚠️ Medium: Mentioned in official blog but not formally documented
-  - ❓ Low: Inference or from unofficial sources
-
-**📁 Save action**:
-For each extracted fact, **immediately** append to `02_fact_cards.md`:
-```markdown
-## Fact #[number]
- **Statement**: [specific fact description]
- **Source**: [Source #number] [link]
- **Phase**: [Phase 1 / Phase 2 / Assessment]
- **Target Audience**: [which group this fact applies to, inherited from source or further refined]
- **Confidence**: ✅/⚠️/❓
- **Related Dimension**: [corresponding comparison dimension]
-```
-
-**⚠️ Target audience in fact statements**:
- If a fact comes from a "partially overlapping" or "reference only" source, the statement **must explicitly annotate the applicable scope**
- Wrong: "The Ministry of Education banned phones in classrooms" (doesn't specify who)
- Correct: "The Ministry of Education banned K-12 students from bringing phones into classrooms (does not apply to university students)"
-
-### Step 3.5: Iterative Deepening — Follow-Up Investigation
-
-After initial fact extraction, review what you have found and identify **knowledge gaps and new questions** that emerged from the initial research. This step ensures the research doesn't stop at surface-level findings.
-
-**Process**:
-
-1. **Gap analysis**: Review fact cards and identify:
-   - Sub-questions with fewer than 3 high-confidence facts → need more searching
-   - Contradictions between sources → need tie-breaking evidence
-   - Perspectives (from Step 1) that have no or weak coverage → need targeted search
-   - Claims that rely only on L3/L4 sources → need L1/L2 verification
-
-2. **Follow-up question generation**: Based on initial findings, generate new questions:
-   - "Source X claims [fact] — is this consistent with other evidence?"
-   - "If [approach A] has [limitation], how do practitioners work around it?"
-   - "What are the second-order effects of [finding]?"
-   - "Who disagrees with [common finding] and why?"
-   - "What happened when [solution] was deployed at scale?"
-
-3. **Targeted deep-dive searches**: Execute follow-up searches focusing on:
-   - Specific claims that need verification
-   - Alternative viewpoints not yet represented
-   - Real-world case studies and experience reports
-   - Failure cases and edge conditions
-   - Recent developments that may change the picture
-
-4. **Update artifacts**: Append new sources to `01_source_registry.md`, new facts to `02_fact_cards.md`
-
-**Exit criteria**: Proceed to Step 4 when:
- Every sub-question has at least 3 facts with at least one from L1/L2
- At least 3 perspectives from Step 1 have supporting evidence
- No unresolved contradictions remain (or they are explicitly documented as open questions)
- Follow-up searches are no longer producing new substantive information
-
-### Step 4: Build Comparison/Analysis Framework
-
-Based on the question type, select fixed analysis dimensions. **For dimension lists** (General, Concept Comparison, Decision Support): Read `references/comparison-frameworks.md`
-
-**📁 Save action**:
-Write to `03_comparison_framework.md`:
-```markdown
-# Comparison Framework
-
-## Selected Framework Type
-[Concept Comparison / Decision Support / ...]
-
-## Selected Dimensions
-1. [Dimension 1]
-2. [Dimension 2]
-...
-
-## Initial Population
-| Dimension | X | Y | Factual Basis |
-|-----------|---|---|---------------|
-| [Dimension 1] | [description] | [description] | Fact #1, #3 |
-| ... | | | |
-```
-
-### Step 5: Reference Point Baseline Alignment
-
-Ensure all compared parties have clear, consistent definitions:
-
-**Checklist**:
- [ ] Is the reference point's definition stable/widely accepted?
- [ ] Does it need verification, or can domain common knowledge be used?
- [ ] Does the reader's understanding of the reference point match mine?
- [ ] Are there ambiguities that need to be clarified first?
-
-### Step 6: Fact-to-Conclusion Reasoning Chain
-
-Explicitly write out the "fact → comparison → conclusion" reasoning process:
-
-```markdown
-## Reasoning Process
-
-### Regarding [Dimension Name]
-
-1. **Fact confirmation**: According to [source], X's mechanism is...
-2. **Compare with reference**: While Y's mechanism is...
-3. **Conclusion**: Therefore, the difference between X and Y on this dimension is...
-```
-
-**Key discipline**:
- Conclusions come from mechanism comparison, not "gut feelings"
- Every conclusion must be traceable to specific facts
- Uncertain conclusions must be annotated
-
-**📁 Save action**:
-Write to `04_reasoning_chain.md`:
-```markdown
-# Reasoning Chain
-
-## Dimension 1: [Dimension Name]
-
-### Fact Confirmation
-According to [Fact #X], X's mechanism is...
-
-### Reference Comparison
-While Y's mechanism is... (Source: [Fact #Y])
-
-### Conclusion
-Therefore, the difference between X and Y on this dimension is...
-
-### Confidence
-✅/⚠️/❓ + rationale
-
---
-## Dimension 2: [Dimension Name]
-...
-```
-
-### Step 7: Use-Case Validation (Sanity Check)
-
-Validate conclusions against a typical scenario:
-
-**Validation questions**:
- Based on my conclusions, how should this scenario be handled?
- Is that actually the case?
- Are there counterexamples that need to be addressed?
-
-**Review checklist**:
- [ ] Are draft conclusions consistent with Step 3 fact cards?
- [ ] Are there any important dimensions missed?
- [ ] Is there any over-extrapolation?
- [ ] Are conclusions actionable/verifiable?
-
-**📁 Save action**:
-Write to `05_validation_log.md`:
-```markdown
-# Validation Log
-
-## Validation Scenario
-[Scenario description]
-
-## Expected Based on Conclusions
-If using X: [expected behavior]
-If using Y: [expected behavior]
-
-## Actual Validation Results
-[actual situation]
-
-## Counterexamples
-[yes/no, describe if yes]
-
-## Review Checklist
- [x] Draft conclusions consistent with fact cards
- [x] No important dimensions missed
- [x] No over-extrapolation
- [ ] Issue found: [if any]
-
-## Conclusions Requiring Revision
-[if any]
-```
-
-### Step 8: Deliverable Formatting
-
-Make the output **readable, traceable, and actionable**.
-
-**📁 Save action**:
-Integrate all intermediate artifacts. Write to `OUTPUT_DIR/solution_draft##.md` using the appropriate output template based on active mode:
- Mode A: `templates/solution_draft_mode_a.md`
- Mode B: `templates/solution_draft_mode_b.md`
-
-Sources to integrate:
- Extract background from `00_question_decomposition.md`
- Reference key facts from `02_fact_cards.md`
- Organize conclusions from `04_reasoning_chain.md`
- Generate references from `01_source_registry.md`
- Supplement with use cases from `05_validation_log.md`
- For Mode A: include AC assessment from `00_ac_assessment.md`
-
-## Solution Draft Output Templates
-
-### Mode A: Initial Research Output
-
-Use template: `templates/solution_draft_mode_a.md`
-
-### Mode B: Solution Assessment Output
-
-Use template: `templates/solution_draft_mode_b.md`
-
 ## Stakeholder Perspectives

 Adjust content depth based on audience:
@@ -709,75 +127,6 @@ Adjust content depth based on audience:
 | **Implementers** | Specific mechanisms, how-to | Detailed, emphasize how to do it |
 | **Technical experts** | Details, boundary conditions, limitations | In-depth, emphasize accuracy |

-## Output Files
-
-Default intermediate artifacts location: `RESEARCH_DIR/`
-
-**Required files** (automatically generated through the process):
-
-| File | Content | When Generated |
-|------|---------|----------------|
-| `00_ac_assessment.md` | AC & restrictions assessment (Mode A only) | After Phase 1 completion |
-| `00_question_decomposition.md` | Question type, sub-question list | After Step 0-1 completion |
-| `01_source_registry.md` | All source links and summaries | Continuously updated during Step 2 |
-| `02_fact_cards.md` | Extracted facts and sources | Continuously updated during Step 3 |
-| `03_comparison_framework.md` | Selected framework and populated data | After Step 4 completion |
-| `04_reasoning_chain.md` | Fact → conclusion reasoning | After Step 6 completion |
-| `05_validation_log.md` | Use-case validation and review | After Step 7 completion |
-| `OUTPUT_DIR/solution_draft##.md` | Complete solution draft | After Step 8 completion |
-| `OUTPUT_DIR/tech_stack.md` | Tech stack evaluation and decisions | After Phase 3 (optional) |
-| `OUTPUT_DIR/security_analysis.md` | Threat model and security controls | After Phase 4 (optional) |
-
-**Optional files**:
- `raw/*.md` - Raw source archives (saved when content is lengthy)
-
-## Methodology Quick Reference Card
-
-```
-┌──────────────────────────────────────────────────────────────────┐
-│              Deep Research — Mode-Aware 8-Step Method            │
-├──────────────────────────────────────────────────────────────────┤
-│ CONTEXT: Resolve mode (project vs standalone) + set paths        │
-│ GUARDRAILS: Check INPUT_DIR/INPUT_FILE exists + required files   │
-│ MODE DETECT: solution_draft*.md in 01_solution? → A or B         │
-│                                                                  │
-│ MODE A: Initial Research                                         │
-│   Phase 1: AC & Restrictions Assessment (BLOCKING)               │
-│   Phase 2: Full 8-step → solution_draft##.md                     │
-│   Phase 3: Tech Stack Consolidation (OPTIONAL) → tech_stack.md   │
-│   Phase 4: Security Deep Dive (OPTIONAL) → security_analysis.md  │
-│                                                                  │
-│ MODE B: Solution Assessment                                      │
-│   Read latest draft → Full 8-step → solution_draft##.md (N+1)    │
-│   Optional: Phase 3 / Phase 4 on revised draft                   │
-│                                                                  │
-│ 8-STEP ENGINE:                                                   │
-│  0. Classify question type → Select framework template           │
-│  0.5 Novelty sensitivity → Time windows for sources              │
-│  1. Decompose question → sub-questions + perspectives + queries  │
-│     → Perspective Rotation (3+ viewpoints, MANDATORY)            │
-│     → Question Explosion (3-5 query variants per sub-Q)          │
-│  2. Exhaustive web search → L1 > L2 > L3 > L4, broad coverage   │
-│     → Execute ALL query variants, search until saturation        │
-│  3. Extract facts → Each with source, confidence level           │
-│  3.5 Iterative deepening → gaps, contradictions, follow-ups     │
-│     → Keep searching until exit criteria met                     │
-│  4. Build framework → Fixed dimensions, structured compare       │
-│  5. Align references → Ensure unified definitions                │
-│  6. Reasoning chain → Fact→Compare→Conclude, explicit            │
-│  7. Use-case validation → Sanity check, prevent armchairing      │
-│  8. Deliverable → solution_draft##.md (mode-specific format)     │
-├──────────────────────────────────────────────────────────────────┤
-│ Key discipline: Ask don't assume · Facts before reasoning        │
-│   Conclusions from mechanism, not gut feelings                   │
-│   Search broadly, from multiple perspectives, until saturation   │
-└──────────────────────────────────────────────────────────────────┘
-```
-
-## Usage Examples
-
-For detailed execution flow examples (Mode A initial, Mode B assessment, standalone, force override): Read `references/usage-examples.md`
-
 ## Source Verifiability Requirements

 Every cited piece of external information must be directly verifiable by the user. All links must be publicly accessible (annotate `[login required]` if not), citations must include exact section/page/timestamp, and unverifiable information must be annotated `[limited source]`. Full checklist in `references/quality-checklists.md`.
@@ -795,7 +144,7 @@ Before completing the solution draft, run through the checklists in `references/

 When replying to the user after research is complete:

-**✅ Should include**:
+**Should include**:
 - Active mode used (A or B) and which optional phases were executed
 - One-sentence core conclusion
 - Key findings summary (3-5 points)
@@ -803,7 +152,7 @@ When replying to the user after research is complete:
 - Paths to optional artifacts if produced: `tech_stack.md`, `security_analysis.md`
 - If there are significant uncertainties, annotate points requiring further verification

-**❌ Must not include**:
+**Must not include**:
 - Process file listings (e.g., `00_question_decomposition.md`, `01_source_registry.md`, etc.)
 - Detailed research step descriptions
 - Working directory structure display
@@ -0,0 +1,103 @@
+## Project Integration
+
+### Prerequisite Guardrails (BLOCKING)
+
+Before any research begins, verify the input context exists. **Do not proceed if guardrails fail.**
+
+**Project mode:**
+1. Check INPUT_DIR exists — **STOP if missing**, ask user to create it and provide problem files
+2. Check `problem.md` in INPUT_DIR exists and is non-empty — **STOP if missing**
+3. Check `restrictions.md` in INPUT_DIR exists and is non-empty — **STOP if missing**
+4. Check `acceptance_criteria.md` in INPUT_DIR exists and is non-empty — **STOP if missing**
+5. Check `input_data/` in INPUT_DIR exists and contains at least one file — **STOP if missing**
+6. Read **all** files in INPUT_DIR to ground the investigation in the project context
+7. Create OUTPUT_DIR and RESEARCH_DIR if they don't exist
+
+**Standalone mode:**
+1. Check INPUT_FILE exists and is non-empty — **STOP if missing**
+2. Resolve BASE_DIR: use the caller-specified directory if provided; otherwise default to `_standalone/`
+3. Resolve OUTPUT_DIR (`BASE_DIR/01_solution/`) and RESEARCH_DIR (`BASE_DIR/00_research/`)
+4. Warn if no `restrictions.md` or `acceptance_criteria.md` were provided alongside INPUT_FILE — proceed if user confirms
+5. Create BASE_DIR, OUTPUT_DIR, and RESEARCH_DIR if they don't exist
+
+### Mode Detection
+
+After guardrails pass, determine the execution mode:
+
+1. Scan OUTPUT_DIR for files matching `solution_draft*.md`
+2. **No matches found** → **Mode A: Initial Research**
+3. **Matches found** → **Mode B: Solution Assessment** (use the highest-numbered draft as input)
+4. **User override**: if the user explicitly says "research from scratch" or "initial research", force Mode A regardless of existing drafts
+
+Inform the user which mode was detected and confirm before proceeding.
+
+### Solution Draft Numbering
+
+All final output is saved as `OUTPUT_DIR/solution_draft##.md` with a 2-digit zero-padded number:
+
+1. Scan existing files in OUTPUT_DIR matching `solution_draft*.md`
+2. Extract the highest existing number
+3. Increment by 1
+4. Zero-pad to 2 digits (e.g., `01`, `02`, ..., `10`, `11`)
+
+Example: if `solution_draft01.md` through `solution_draft10.md` exist, the next output is `solution_draft11.md`.
+
+### Working Directory & Intermediate Artifact Management
+
+#### Directory Structure
+
+At the start of research, **must** create a working directory under RESEARCH_DIR:
+
+```
+RESEARCH_DIR/
+├── 00_ac_assessment.md            # Mode A Phase 1 output: AC & restrictions assessment
+├── 00_question_decomposition.md   # Step 0-1 output
+├── 01_source_registry.md          # Step 2 output: all consulted source links
+├── 02_fact_cards.md               # Step 3 output: extracted facts
+├── 03_comparison_framework.md     # Step 4 output: selected framework and populated data
+├── 04_reasoning_chain.md          # Step 6 output: fact → conclusion reasoning
+├── 05_validation_log.md           # Step 7 output: use-case validation results
+└── raw/                           # Raw source archive (optional)
+    ├── source_1.md
+    └── source_2.md
+```
+
+### Save Timing & Content
+
+| Step | Save immediately after completion | Filename |
+|------|-----------------------------------|----------|
+| Mode A Phase 1 | AC & restrictions assessment tables | `00_ac_assessment.md` |
+| Step 0-1 | Question type classification + sub-question list | `00_question_decomposition.md` |
+| Step 2 | Each consulted source link, tier, summary | `01_source_registry.md` |
+| Step 3 | Each fact card (statement + source + confidence) | `02_fact_cards.md` |
+| Step 4 | Selected comparison framework + initial population | `03_comparison_framework.md` |
+| Step 6 | Reasoning process for each dimension | `04_reasoning_chain.md` |
+| Step 7 | Validation scenarios + results + review checklist | `05_validation_log.md` |
+| Step 8 | Complete solution draft | `OUTPUT_DIR/solution_draft##.md` |
+
+### Save Principles
+
+1. **Save immediately**: Write to the corresponding file as soon as a step is completed; don't wait until the end
+2. **Incremental updates**: Same file can be updated multiple times; append or replace new content
+3. **Preserve process**: Keep intermediate files even after their content is integrated into the final report
+4. **Enable recovery**: If research is interrupted, progress can be recovered from intermediate files
+
+### Output Files
+
+**Required files** (automatically generated through the process):
+
+| File | Content | When Generated |
+|------|---------|----------------|
+| `00_ac_assessment.md` | AC & restrictions assessment (Mode A only) | After Phase 1 completion |
+| `00_question_decomposition.md` | Question type, sub-question list | After Step 0-1 completion |
+| `01_source_registry.md` | All source links and summaries | Continuously updated during Step 2 |
+| `02_fact_cards.md` | Extracted facts and sources | Continuously updated during Step 3 |
+| `03_comparison_framework.md` | Selected framework and populated data | After Step 4 completion |
+| `04_reasoning_chain.md` | Fact → conclusion reasoning | After Step 6 completion |
+| `05_validation_log.md` | Use-case validation and review | After Step 7 completion |
+| `OUTPUT_DIR/solution_draft##.md` | Complete solution draft | After Step 8 completion |
+| `OUTPUT_DIR/tech_stack.md` | Tech stack evaluation and decisions | After Phase 3 (optional) |
+| `OUTPUT_DIR/security_analysis.md` | Threat model and security controls | After Phase 4 (optional) |
+
+**Optional files**:
+- `raw/*.md` - Raw source archives (saved when content is lengthy)
@@ -0,0 +1,127 @@
+## Mode A: Initial Research
+
+Triggered when no `solution_draft*.md` files exist in OUTPUT_DIR, or when the user explicitly requests initial research.
+
+### Phase 1: AC & Restrictions Assessment (BLOCKING)
+
+**Role**: Professional software architect
+
+A focused preliminary research pass **before** the main solution research. The goal is to validate that the acceptance criteria and restrictions are realistic before designing a solution around them.
+
+**Input**: All files from INPUT_DIR (or INPUT_FILE in standalone mode)
+
+**Task**:
+1. Read all problem context files thoroughly
+2. **ASK the user about every unclear aspect** — do not assume:
+   - Unclear problem boundaries → ask
+   - Ambiguous acceptance criteria values → ask
+   - Missing context (no `security_approach.md`, no `input_data/`) → ask what they have
+   - Conflicting restrictions → ask which takes priority
+3. Research in internet **extensively** — use multiple search queries per question, rephrase, and search from different angles:
+   - How realistic are the acceptance criteria for this specific domain? Search for industry benchmarks, standards, and typical values
+   - How critical is each criterion? Search for case studies where criteria were relaxed or tightened
+   - What domain-specific acceptance criteria are we missing? Search for industry standards, regulatory requirements, and best practices in the specific domain
+   - Impact of each criterion value on the whole system quality — search for research papers and engineering reports
+   - Cost/budget implications of each criterion — search for pricing, total cost of ownership analyses, and comparable project budgets
+   - Timeline implications — search for project timelines, development velocity reports, and comparable implementations
+   - What do practitioners in this domain consider the most important criteria? Search forums, conference talks, and experience reports
+4. Research restrictions from multiple perspectives:
+   - Are the restrictions realistic? Search for comparable projects that operated under similar constraints
+   - Should any be tightened or relaxed? Search for what constraints similar projects actually ended up with
+   - Are there additional restrictions we should add? Search for regulatory, compliance, and safety requirements in this domain
+   - What restrictions do practitioners wish they had defined earlier? Search for post-mortem reports and lessons learned
+5. Verify findings with authoritative sources (official docs, papers, benchmarks) — each key finding must have at least 2 independent sources
+
+**Uses Steps 0-3 of the 8-step engine** (question classification, decomposition, source tiering, fact extraction) scoped to AC and restrictions assessment.
+
+**Save action**: Write `RESEARCH_DIR/00_ac_assessment.md` with format:
+
+```markdown
+# Acceptance Criteria Assessment
+
+## Acceptance Criteria
+
+| Criterion | Our Values | Researched Values | Cost/Timeline Impact | Status |
+|-----------|-----------|-------------------|---------------------|--------|
+| [name] | [current] | [researched range] | [impact] | Added / Modified / Removed |
+
+## Restrictions Assessment
+
+| Restriction | Our Values | Researched Values | Cost/Timeline Impact | Status |
+|-------------|-----------|-------------------|---------------------|--------|
+| [name] | [current] | [researched range] | [impact] | Added / Modified / Removed |
+
+## Key Findings
+[Summary of critical findings]
+
+## Sources
+[Key references used]
+```
+
+**BLOCKING**: Present the AC assessment tables to the user. Wait for confirmation or adjustments before proceeding to Phase 2. The user may update `acceptance_criteria.md` or `restrictions.md` based on findings.
+
+---
+
+### Phase 2: Problem Research & Solution Draft
+
+**Role**: Professional researcher and software architect
+
+Full 8-step research methodology. Produces the first solution draft.
+
+**Input**: All files from INPUT_DIR (possibly updated after Phase 1) + Phase 1 artifacts
+
+**Task** (drives the 8-step engine):
+1. Research existing/competitor solutions for similar problems — search broadly across industries and adjacent domains, not just the obvious competitors
+2. Research the problem thoroughly — all possible ways to solve it, split into components; search for how different fields approach analogous problems
+3. For each component, research all possible solutions and find the most efficient state-of-the-art approaches — use multiple query variants and perspectives from Step 1
+4. For each promising approach, search for real-world deployment experience: success stories, failure reports, lessons learned, and practitioner opinions
+5. Search for contrarian viewpoints — who argues against the common approaches and why? What failure modes exist?
+6. Verify that suggested tools/libraries actually exist and work as described — check official repos, latest releases, and community health (stars, recent commits, open issues)
+7. Include security considerations in each component analysis
+8. Provide rough cost estimates for proposed solutions
+
+Be concise in formulating. The fewer words, the better, but do not miss any important details.
+
+**Save action**: Write `OUTPUT_DIR/solution_draft##.md` using template: `templates/solution_draft_mode_a.md`
+
+---
+
+### Phase 3: Tech Stack Consolidation (OPTIONAL)
+
+**Role**: Software architect evaluating technology choices
+
+Focused synthesis step — no new 8-step cycle. Uses research already gathered in Phase 2 to make concrete technology decisions.
+
+**Input**: Latest `solution_draft##.md` from OUTPUT_DIR + all files from INPUT_DIR
+
+**Task**:
+1. Extract technology options from the solution draft's component comparison tables
+2. Score each option against: fitness for purpose, maturity, security track record, team expertise, cost, scalability
+3. Produce a tech stack summary with selection rationale
+4. Assess risks and learning requirements per technology choice
+
+**Save action**: Write `OUTPUT_DIR/tech_stack.md` with:
+- Requirements analysis (functional, non-functional, constraints)
+- Technology evaluation tables (language, framework, database, infrastructure, key libraries) with scores
+- Tech stack summary block
+- Risk assessment and learning requirements tables
+
+---
+
+### Phase 4: Security Deep Dive (OPTIONAL)
+
+**Role**: Security architect
+
+Focused analysis step — deepens the security column from the solution draft into a proper threat model and controls specification.
+
+**Input**: Latest `solution_draft##.md` from OUTPUT_DIR + `security_approach.md` from INPUT_DIR + problem context
+
+**Task**:
+1. Build threat model: asset inventory, threat actors, attack vectors
+2. Define security requirements and proposed controls per component (with risk level)
+3. Summarize authentication/authorization, data protection, secure communication, and logging/monitoring approach
+
+**Save action**: Write `OUTPUT_DIR/security_analysis.md` with:
+- Threat model (assets, actors, vectors)
+- Per-component security requirements and controls table
+- Security controls summary
@@ -0,0 +1,27 @@
+## Mode B: Solution Assessment
+
+Triggered when `solution_draft*.md` files exist in OUTPUT_DIR.
+
+**Role**: Professional software architect
+
+Full 8-step research methodology applied to assessing and improving an existing solution draft.
+
+**Input**: All files from INPUT_DIR + the latest (highest-numbered) `solution_draft##.md` from OUTPUT_DIR
+
+**Task** (drives the 8-step engine):
+1. Read the existing solution draft thoroughly
+2. Research in internet extensively — for each component/decision in the draft, search for:
+   - Known problems and limitations of the chosen approach
+   - What practitioners say about using it in production
+   - Better alternatives that may have emerged recently
+   - Common failure modes and edge cases
+   - How competitors/similar projects solve the same problem differently
+3. Search specifically for contrarian views: "why not [chosen approach]", "[chosen approach] criticism", "[chosen approach] failure"
+4. Identify security weak points and vulnerabilities — search for CVEs, security advisories, and known attack vectors for each technology in the draft
+5. Identify performance bottlenecks — search for benchmarks, load test results, and scalability reports
+6. For each identified weak point, search for multiple solution approaches and compare them
+7. Based on findings, form a new solution draft in the same format
+
+**Save action**: Write `OUTPUT_DIR/solution_draft##.md` (incremented) using template: `templates/solution_draft_mode_b.md`
+
+**Optional follow-up**: After Mode B completes, the user can request Phase 3 (Tech Stack Consolidation) or Phase 4 (Security Deep Dive) using the revised draft. These phases work identically to their Mode A descriptions in `steps/01_mode-a-initial-research.md`.
@@ -0,0 +1,227 @@
+## Research Engine — Investigation Phase (Steps 0–3.5)
+
+### Step 0: Question Type Classification
+
+First, classify the research question type and select the corresponding strategy:
+
+| Question Type | Core Task | Focus Dimensions |
+|---------------|-----------|------------------|
+| **Concept Comparison** | Build comparison framework | Mechanism differences, applicability boundaries |
+| **Decision Support** | Weigh trade-offs | Cost, risk, benefit |
+| **Trend Analysis** | Map evolution trajectory | History, driving factors, predictions |
+| **Problem Diagnosis** | Root cause analysis | Symptoms, causes, evidence chain |
+| **Knowledge Organization** | Systematic structuring | Definitions, classifications, relationships |
+
+**Mode-specific classification**:
+
+| Mode / Phase | Typical Question Type |
+|--------------|----------------------|
+| Mode A Phase 1 | Knowledge Organization + Decision Support |
+| Mode A Phase 2 | Decision Support |
+| Mode B | Problem Diagnosis + Decision Support |
+
+### Step 0.5: Novelty Sensitivity Assessment (BLOCKING)
+
+Before starting research, assess the novelty sensitivity of the question (Critical/High/Medium/Low). This determines source time windows and filtering strategy.
+
+**For full classification table, critical-domain rules, trigger words, and assessment template**: Read `references/novelty-sensitivity.md`
+
+Key principle: Critical-sensitivity topics (AI/LLMs, blockchain) require sources within 6 months, mandatory version annotations, cross-validation from 2+ sources, and direct verification of official download pages.
+
+**Save action**: Append timeliness assessment to the end of `00_question_decomposition.md`
+
+---
+
+### Step 1: Question Decomposition & Boundary Definition
+
+**Mode-specific sub-questions**:
+
+**Mode A Phase 2** (Initial Research — Problem & Solution):
+- "What existing/competitor solutions address this problem?"
+- "What are the component parts of this problem?"
+- "For each component, what are the state-of-the-art solutions?"
+- "What are the security considerations per component?"
+- "What are the cost implications of each approach?"
+
+**Mode B** (Solution Assessment):
+- "What are the weak points and potential problems in the existing draft?"
+- "What are the security vulnerabilities in the proposed architecture?"
+- "Where are the performance bottlenecks?"
+- "What solutions exist for each identified issue?"
+
+**General sub-question patterns** (use when applicable):
+- **Sub-question A**: "What is X and how does it work?" (Definition & mechanism)
+- **Sub-question B**: "What are the dimensions of relationship/difference between X and Y?" (Comparative analysis)
+- **Sub-question C**: "In what scenarios is X applicable/inapplicable?" (Boundary conditions)
+- **Sub-question D**: "What are X's development trends/best practices?" (Extended analysis)
+
+#### Perspective Rotation (MANDATORY)
+
+For each research problem, examine it from **at least 3 different perspectives**. Each perspective generates its own sub-questions and search queries.
+
+| Perspective | What it asks | Example queries |
+|-------------|-------------|-----------------|
+| **End-user / Consumer** | What problems do real users encounter? What do they wish were different? | "X problems", "X frustrations reddit", "X user complaints" |
+| **Implementer / Engineer** | What are the technical challenges, gotchas, hidden complexities? | "X implementation challenges", "X pitfalls", "X lessons learned" |
+| **Business / Decision-maker** | What are the costs, ROI, strategic implications? | "X total cost of ownership", "X ROI case study", "X vs Y business comparison" |
+| **Contrarian / Devil's advocate** | What could go wrong? Why might this fail? What are critics saying? | "X criticism", "why not X", "X failures", "X disadvantages real world" |
+| **Domain expert / Academic** | What does peer-reviewed research say? What are theoretical limits? | "X research paper", "X systematic review", "X benchmarks academic" |
+| **Practitioner / Field** | What do people who actually use this daily say? What works in practice vs theory? | "X in production", "X experience report", "X after 1 year" |
+
+Select at least 3 perspectives relevant to the problem. Document the chosen perspectives in `00_question_decomposition.md`.
+
+#### Question Explosion (MANDATORY)
+
+For **each sub-question**, generate **at least 3-5 search query variants** before searching. This ensures broad coverage and avoids missing relevant information due to terminology differences.
+
+**Query variant strategies**:
+- **Specificity ladder**: broad ("indoor navigation systems") → narrow ("UWB-based indoor drone navigation accuracy")
+- **Negation/failure**: "X limitations", "X failure modes", "when X doesn't work"
+- **Comparison framing**: "X vs Y for Z", "X alternative for Z", "X or Y which is better for Z"
+- **Practitioner voice**: "X in production experience", "X real-world results", "X lessons learned"
+- **Temporal**: "X 2025", "X latest developments", "X roadmap"
+- **Geographic/domain**: "X in Europe", "X for defense applications", "X in agriculture"
+
+Record all planned queries in `00_question_decomposition.md` alongside each sub-question.
+
+**Research Subject Boundary Definition (BLOCKING - must be explicit)**:
+
+When decomposing questions, you must explicitly define the **boundaries of the research subject**:
+
+| Dimension | Boundary to define | Example |
+|-----------|--------------------|---------|
+| **Population** | Which group is being studied? | University students vs K-12 vs vocational students vs all students |
+| **Geography** | Which region is being studied? | Chinese universities vs US universities vs global |
+| **Timeframe** | Which period is being studied? | Post-2020 vs full historical picture |
+| **Level** | Which level is being studied? | Undergraduate vs graduate vs vocational |
+
+**Common mistake**: User asks about "university classroom issues" but sources include policies targeting "K-12 students" — mismatched target populations will invalidate the entire research.
+
+**Save action**:
+1. Read all files from INPUT_DIR to ground the research in the project context
+2. Create working directory `RESEARCH_DIR/`
+3. Write `00_question_decomposition.md`, including:
+   - Original question
+   - Active mode (A Phase 2 or B) and rationale
+   - Summary of relevant problem context from INPUT_DIR
+   - Classified question type and rationale
+   - **Research subject boundary definition** (population, geography, timeframe, level)
+   - List of decomposed sub-questions
+   - **Chosen perspectives** (at least 3 from the Perspective Rotation table) with rationale
+   - **Search query variants** for each sub-question (at least 3-5 per sub-question)
+4. Write TodoWrite to track progress
+
+---
+
+### Step 2: Source Tiering & Exhaustive Web Investigation
+
+Tier sources by authority, **prioritize primary sources** (L1 > L2 > L3 > L4). Conclusions must be traceable to L1/L2; L3/L4 serve as supplementary and validation.
+
+**For full tier definitions, search strategies, community mining steps, and source registry templates**: Read `references/source-tiering.md`
+
+**Tool Usage**:
+- Use `WebSearch` for broad searches; `WebFetch` to read specific pages
+- Use the `context7` MCP server (`resolve-library-id` then `get-library-docs`) for up-to-date library/framework documentation
+- Always cross-verify training data claims against live sources for facts that may have changed (versions, APIs, deprecations, security advisories)
+- When citing web sources, include the URL and date accessed
+
+#### Exhaustive Search Requirements (MANDATORY)
+
+Do not stop at the first few results. The goal is to build a comprehensive evidence base.
+
+**Minimum search effort per sub-question**:
+- Execute **all** query variants generated in Step 1's Question Explosion (at least 3-5 per sub-question)
+- Consult at least **2 different source tiers** per sub-question (e.g., L1 official docs + L4 community discussion)
+- If initial searches yield fewer than 3 relevant sources for a sub-question, **broaden the search** with alternative terms, related domains, or analogous problems
+
+**Search broadening strategies** (use when results are thin):
+- Try adjacent fields: if researching "drone indoor navigation", also search "robot indoor navigation", "warehouse AGV navigation"
+- Try different communities: academic papers, industry whitepapers, military/defense publications, hobbyist forums
+- Try different geographies: search in English + search for European/Asian approaches if relevant
+- Try historical evolution: "history of X", "evolution of X approaches", "X state of the art 2024 2025"
+- Try failure analysis: "X project failure", "X post-mortem", "X recall", "X incident report"
+
+**Search saturation rule**: Continue searching until new queries stop producing substantially new information. If the last 3 searches only repeat previously found facts, the sub-question is saturated.
+
+**Save action**:
+For each source consulted, **immediately** append to `01_source_registry.md` using the entry template from `references/source-tiering.md`.
+
+---
+
+### Step 3: Fact Extraction & Evidence Cards
+
+Transform sources into **verifiable fact cards**:
+
+```markdown
+## Fact Cards
+
+### Fact 1
+- **Statement**: [specific fact description]
+- **Source**: [link/document section]
+- **Confidence**: High/Medium/Low
+
+### Fact 2
+...
+```
+
+**Key discipline**:
+- Pin down facts first, then reason
+- Distinguish "what officials said" from "what I infer"
+- When conflicting information is found, annotate and preserve both sides
+- Annotate confidence level:
+  - ✅ High: Explicitly stated in official documentation
+  - ⚠️ Medium: Mentioned in official blog but not formally documented
+  - ❓ Low: Inference or from unofficial sources
+
+**Save action**:
+For each extracted fact, **immediately** append to `02_fact_cards.md`:
+```markdown
+## Fact #[number]
+- **Statement**: [specific fact description]
+- **Source**: [Source #number] [link]
+- **Phase**: [Phase 1 / Phase 2 / Assessment]
+- **Target Audience**: [which group this fact applies to, inherited from source or further refined]
+- **Confidence**: ✅/⚠️/❓
+- **Related Dimension**: [corresponding comparison dimension]
+```
+
+**Target audience in fact statements**:
+- If a fact comes from a "partially overlapping" or "reference only" source, the statement **must explicitly annotate the applicable scope**
+- Wrong: "The Ministry of Education banned phones in classrooms" (doesn't specify who)
+- Correct: "The Ministry of Education banned K-12 students from bringing phones into classrooms (does not apply to university students)"
+
+---
+
+### Step 3.5: Iterative Deepening — Follow-Up Investigation
+
+After initial fact extraction, review what you have found and identify **knowledge gaps and new questions** that emerged from the initial research. This step ensures the research doesn't stop at surface-level findings.
+
+**Process**:
+
+1. **Gap analysis**: Review fact cards and identify:
+   - Sub-questions with fewer than 3 high-confidence facts → need more searching
+   - Contradictions between sources → need tie-breaking evidence
+   - Perspectives (from Step 1) that have no or weak coverage → need targeted search
+   - Claims that rely only on L3/L4 sources → need L1/L2 verification
+
+2. **Follow-up question generation**: Based on initial findings, generate new questions:
+   - "Source X claims [fact] — is this consistent with other evidence?"
+   - "If [approach A] has [limitation], how do practitioners work around it?"
+   - "What are the second-order effects of [finding]?"
+   - "Who disagrees with [common finding] and why?"
+   - "What happened when [solution] was deployed at scale?"
+
+3. **Targeted deep-dive searches**: Execute follow-up searches focusing on:
+   - Specific claims that need verification
+   - Alternative viewpoints not yet represented
+   - Real-world case studies and experience reports
+   - Failure cases and edge conditions
+   - Recent developments that may change the picture
+
+4. **Update artifacts**: Append new sources to `01_source_registry.md`, new facts to `02_fact_cards.md`
+
+**Exit criteria**: Proceed to Step 4 when:
+- Every sub-question has at least 3 facts with at least one from L1/L2
+- At least 3 perspectives from Step 1 have supporting evidence
+- No unresolved contradictions remain (or they are explicitly documented as open questions)
+- Follow-up searches are no longer producing new substantive information
@@ -0,0 +1,146 @@
+## Research Engine — Analysis Phase (Steps 4–8)
+
+### Step 4: Build Comparison/Analysis Framework
+
+Based on the question type, select fixed analysis dimensions. **For dimension lists** (General, Concept Comparison, Decision Support): Read `references/comparison-frameworks.md`
+
+**Save action**:
+Write to `03_comparison_framework.md`:
+```markdown
+# Comparison Framework
+
+## Selected Framework Type
+[Concept Comparison / Decision Support / ...]
+
+## Selected Dimensions
+1. [Dimension 1]
+2. [Dimension 2]
+...
+
+## Initial Population
+| Dimension | X | Y | Factual Basis |
+|-----------|---|---|---------------|
+| [Dimension 1] | [description] | [description] | Fact #1, #3 |
+| ... | | | |
+```
+
+---
+
+### Step 5: Reference Point Baseline Alignment
+
+Ensure all compared parties have clear, consistent definitions:
+
+**Checklist**:
+- [ ] Is the reference point's definition stable/widely accepted?
+- [ ] Does it need verification, or can domain common knowledge be used?
+- [ ] Does the reader's understanding of the reference point match mine?
+- [ ] Are there ambiguities that need to be clarified first?
+
+---
+
+### Step 6: Fact-to-Conclusion Reasoning Chain
+
+Explicitly write out the "fact → comparison → conclusion" reasoning process:
+
+```markdown
+## Reasoning Process
+
+### Regarding [Dimension Name]
+
+1. **Fact confirmation**: According to [source], X's mechanism is...
+2. **Compare with reference**: While Y's mechanism is...
+3. **Conclusion**: Therefore, the difference between X and Y on this dimension is...
+```
+
+**Key discipline**:
+- Conclusions come from mechanism comparison, not "gut feelings"
+- Every conclusion must be traceable to specific facts
+- Uncertain conclusions must be annotated
+
+**Save action**:
+Write to `04_reasoning_chain.md`:
+```markdown
+# Reasoning Chain
+
+## Dimension 1: [Dimension Name]
+
+### Fact Confirmation
+According to [Fact #X], X's mechanism is...
+
+### Reference Comparison
+While Y's mechanism is... (Source: [Fact #Y])
+
+### Conclusion
+Therefore, the difference between X and Y on this dimension is...
+
+### Confidence
+✅/⚠️/❓ + rationale
+
+---
+## Dimension 2: [Dimension Name]
+...
+```
+
+---
+
+### Step 7: Use-Case Validation (Sanity Check)
+
+Validate conclusions against a typical scenario:
+
+**Validation questions**:
+- Based on my conclusions, how should this scenario be handled?
+- Is that actually the case?
+- Are there counterexamples that need to be addressed?
+
+**Review checklist**:
+- [ ] Are draft conclusions consistent with Step 3 fact cards?
+- [ ] Are there any important dimensions missed?
+- [ ] Is there any over-extrapolation?
+- [ ] Are conclusions actionable/verifiable?
+
+**Save action**:
+Write to `05_validation_log.md`:
+```markdown
+# Validation Log
+
+## Validation Scenario
+[Scenario description]
+
+## Expected Based on Conclusions
+If using X: [expected behavior]
+If using Y: [expected behavior]
+
+## Actual Validation Results
+[actual situation]
+
+## Counterexamples
+[yes/no, describe if yes]
+
+## Review Checklist
+- [x] Draft conclusions consistent with fact cards
+- [x] No important dimensions missed
+- [x] No over-extrapolation
+- [ ] Issue found: [if any]
+
+## Conclusions Requiring Revision
+[if any]
+```
+
+---
+
+### Step 8: Deliverable Formatting
+
+Make the output **readable, traceable, and actionable**.
+
+**Save action**:
+Integrate all intermediate artifacts. Write to `OUTPUT_DIR/solution_draft##.md` using the appropriate output template based on active mode:
+- Mode A: `templates/solution_draft_mode_a.md`
+- Mode B: `templates/solution_draft_mode_b.md`
+
+Sources to integrate:
+- Extract background from `00_question_decomposition.md`
+- Reference key facts from `02_fact_cards.md`
+- Organize conclusions from `04_reasoning_chain.md`
+- Generate references from `01_source_registry.md`
+- Supplement with use cases from `05_validation_log.md`
+- For Mode A: include AC assessment from `00_ac_assessment.md`
@@ -4,7 +4,7 @@ description: |
  Collect metrics from implementation batch reports and code review findings, analyze trends across cycles,
  and produce improvement reports with actionable recommendations.
  3-step workflow: collect metrics, analyze trends, produce report.
-  Outputs to _docs/05_metrics/.
+  Outputs to _docs/06_metrics/.
  Trigger phrases:
  - "retrospective", "retro", "run retro"
  - "metrics review", "feedback loop"
@@ -31,7 +31,7 @@ Collect metrics from implementation artifacts, analyze trends across development
 Fixed paths:

 - IMPL_DIR: `_docs/03_implementation/`
- METRICS_DIR: `_docs/05_metrics/`
+- METRICS_DIR: `_docs/06_metrics/`
 - TASKS_DIR: `_docs/02_tasks/`

 Announce the resolved paths to the user before proceeding.
@@ -166,7 +166,7 @@ Present the report summary to the user.
 │                                                                │
 │ 1. Collect Metrics  → parse batch reports, compute metrics     │
 │ 2. Analyze Trends   → patterns, comparison, improvement areas  │
-│ 3. Produce Report   → _docs/05_metrics/retro_[date].md         │
+│ 3. Produce Report   → _docs/06_metrics/retro_[date].md         │
 ├────────────────────────────────────────────────────────────────┤
 │ Principles: Data-driven · Actionable · Cumulative              │
 │             Non-judgmental · Save immediately                  │
@@ -1,130 +0,0 @@
---
-name: rollback
-description: |
-  Revert implementation to a specific batch checkpoint using git revert, reset Jira ticket statuses,
-  verify rollback integrity with tests, and produce a rollback report.
-  Trigger phrases:
-  - "rollback", "revert", "revert batch"
-  - "undo implementation", "roll back to batch"
-category: build
-tags: [rollback, revert, recovery, implementation]
-disable-model-invocation: true
---
-
-# Implementation Rollback
-
-Revert the codebase to a specific batch checkpoint, reset Jira statuses for reverted tasks, and verify integrity.
-
-## Core Principles
-
- **Preserve history**: always use `git revert`, never force-push
- **Verify after revert**: run the full test suite after every rollback
- **Update tracking**: reset Jira ticket statuses for all reverted tasks
- **Atomic rollback**: if rollback fails midway, stop and report — do not leave the codebase in a partial state
- **Ask, don't assume**: if the target batch is ambiguous, present options and ask
-
-## Context Resolution
-
- IMPL_DIR: `_docs/03_implementation/`
- Batch reports: `IMPL_DIR/batch_*_report.md`
-
-## Prerequisite Checks (BLOCKING)
-
-1. IMPL_DIR exists and contains at least one `batch_*_report.md` — **STOP if missing**
-2. Git working tree is clean (no uncommitted changes) — **STOP if dirty**, ask user to commit or stash
-
-## Input
-
- User specifies a target batch number or commit hash
- If not specified, present the list of available batch checkpoints and ask
-
-## Workflow
-
-### Step 1: Identify Checkpoints
-
-1. Read all `batch_*_report.md` files from IMPL_DIR
-2. Extract: batch number, date, tasks included, commit hash, code review verdict
-3. Present batch list to user
-
-**BLOCKING**: User must confirm which batch to roll back to.
-
-### Step 2: Revert Commits
-
-1. Determine which commits need to be reverted (all commits after the target batch)
-2. For each commit in reverse chronological order:
-   - Run `git revert <commit-hash> --no-edit`
-   - If merge conflicts occur: present conflicts and ask user for resolution
-3. If any revert fails and cannot be resolved, abort the rollback sequence with `git revert --abort` and report
-
-### Step 3: Verify Integrity
-
-1. Run the full test suite
-2. If tests fail: report failures to user, ask how to proceed (fix or abort)
-3. If tests pass: continue
-
-### Step 4: Update Jira
-
-1. Identify all tasks from reverted batches
-2. Reset each task's Jira ticket status to "To Do" via Jira MCP
-
-### Step 5: Finalize
-
-1. Commit with message: `[ROLLBACK] Reverted to batch [N]: [task list]`
-2. Write rollback report to `IMPL_DIR/rollback_report.md`
-
-## Output
-
-Write `_docs/03_implementation/rollback_report.md`:
-
-```markdown
-# Rollback Report
-
-**Date**: [YYYY-MM-DD]
-**Target**: Batch [N] (commit [hash])
-**Reverted Batches**: [list]
-
-## Reverted Tasks
-
-| Task | Batch | Status Before | Status After |
-|------|-------|--------------|-------------|
-| [JIRA-ID] | [batch #] | In Testing | To Do |
-
-## Test Results
- [pass/fail count]
-
-## Jira Updates
- [list of ticket transitions]
-
-## Notes
- [any conflicts, manual steps, or issues encountered]
-```
-
-## Escalation Rules
-
-| Situation | Action |
-|-----------|--------|
-| No batch reports exist | **STOP** — nothing to roll back |
-| Uncommitted changes in working tree | **STOP** — ask user to commit or stash |
-| Merge conflicts during revert | **ASK user** for resolution |
-| Tests fail after rollback | **ASK user** — fix or abort |
-| Rollback fails midway | Abort with `git revert --abort`, report to user |
-
-## Methodology Quick Reference
-
-```
-┌────────────────────────────────────────────────────────────────┐
-│              Rollback (5-Step Method)                            │
-├────────────────────────────────────────────────────────────────┤
-│ PREREQ: batch reports exist, clean working tree                 │
-│                                                                │
-│ 1. Identify Checkpoints → present batch list                    │
-│    [BLOCKING: user confirms target batch]                       │
-│ 2. Revert Commits       → git revert per commit                │
-│ 3. Verify Integrity     → run full test suite                   │
-│ 4. Update Jira          → reset statuses to "To Do"            │
-│ 5. Finalize             → commit + rollback_report.md           │
-├────────────────────────────────────────────────────────────────┤
-│ Principles: Preserve history · Verify after revert              │
-│             Atomic rollback · Ask don't assume                 │
-└────────────────────────────────────────────────────────────────┘
-```
@@ -1,300 +1,347 @@
 ---
-name: security-testing
-description: "Test for security vulnerabilities using OWASP principles. Use when conducting security audits, testing auth, or implementing security practices."
-category: specialized-testing
-priority: critical
-tokenEstimate: 1200
-agents: [qe-security-scanner, qe-api-contract-validator, qe-quality-analyzer]
-implementation_status: optimized
-optimization_version: 1.0
-last_optimized: 2025-12-02
-dependencies: []
-quick_reference_card: true
-tags: [security, owasp, sast, dast, vulnerabilities, auth, injection]
-trust_tier: 3
-validation:
-  schema_path: schemas/output.json
-  validator_path: scripts/validate-config.json
-  eval_path: evals/security-testing.yaml
+name: security
+description: |
+  OWASP-based security audit skill. Analyzes codebase for vulnerabilities across dependency scanning,
+  static analysis, OWASP Top 10 review, and secrets detection. Produces a structured security report
+  with severity-ranked findings and remediation guidance.
+  Can be invoked standalone or as part of the autopilot flow (optional step before deploy).
+  Trigger phrases:
+  - "security audit", "security scan", "OWASP review"
+  - "vulnerability scan", "security check"
+  - "check for vulnerabilities", "pentest"
+category: review
+tags: [security, owasp, sast, vulnerabilities, auth, injection, secrets]
+disable-model-invocation: true
 ---

-# Security Testing
+# Security Audit

-<default_to_action>
-When testing security or conducting audits:
-1. TEST OWASP Top 10 vulnerabilities systematically
-2. VALIDATE authentication and authorization on every endpoint
-3. SCAN dependencies for known vulnerabilities (npm audit)
-4. CHECK for injection attacks (SQL, XSS, command)
-5. VERIFY secrets aren't exposed in code/logs
+Analyze the codebase for security vulnerabilities using OWASP principles. Produces a structured report with severity-ranked findings, remediation suggestions, and a security checklist verdict.

-**Quick Security Checks:**
- Access control → Test horizontal/vertical privilege escalation
- Crypto → Verify password hashing, HTTPS, no sensitive data exposed
- Injection → Test SQL injection, XSS, command injection
- Auth → Test weak passwords, session fixation, MFA enforcement
- Config → Check error messages don't leak info
+## Core Principles

-**Critical Success Factors:**
- Think like an attacker, build like a defender
- Security is built in, not added at the end
- Test continuously in CI/CD, not just before release
-</default_to_action>
+- **OWASP-driven**: use the current OWASP Top 10 as the primary framework — verify the latest version at https://owasp.org/www-project-top-ten/ at audit start
+- **Evidence-based**: every finding must reference a specific file, line, or configuration
+- **Severity-ranked**: findings sorted Critical > High > Medium > Low
+- **Actionable**: every finding includes a concrete remediation suggestion
+- **Save immediately**: write artifacts to disk after each phase; never accumulate unsaved work
+- **Complement, don't duplicate**: the `/code-review` skill does a lightweight security quick-scan; this skill goes deeper

-## Quick Reference Card
+## Context Resolution

-### When to Use
- Security audits and penetration testing
- Testing authentication/authorization
- Validating input sanitization
- Reviewing security configuration
+**Project mode** (default):
+- PROBLEM_DIR: `_docs/00_problem/`
+- SOLUTION_DIR: `_docs/01_solution/`
+- DOCUMENT_DIR: `_docs/02_document/`
+- SECURITY_DIR: `_docs/05_security/`

-### OWASP Top 10
-Use the most recent **stable** version of the OWASP Top 10. At the start of each security audit, research the current version at https://owasp.org/www-project-top-ten/ and test against all listed categories. Do not rely on a hardcoded list — the OWASP Top 10 is updated periodically and the current version must be verified.
+**Standalone mode** (explicit target provided, e.g. `/security @src/api/`):
+- TARGET: the provided path
+- SECURITY_DIR: `_standalone/security/`

-### Tools
-| Type | Tool | Purpose |
-|------|------|---------|
-| SAST | SonarQube, Semgrep | Static code analysis |
-| DAST | OWASP ZAP, Burp | Dynamic scanning |
-| Deps | npm audit, Snyk | Dependency vulnerabilities |
-| Secrets | git-secrets, TruffleHog | Secret scanning |
+Announce the detected mode and resolved paths to the user before proceeding.

-### Agent Coordination
- `qe-security-scanner`: Multi-layer SAST/DAST scanning
- `qe-api-contract-validator`: API security testing
- `qe-quality-analyzer`: Security code review
+## Prerequisite Checks
+
+1. Codebase must contain source code files — **STOP if empty**
+2. Create SECURITY_DIR if it does not exist
+3. If SECURITY_DIR already contains artifacts, ask user: **resume, overwrite, or skip?**
+4. If `_docs/00_problem/security_approach.md` exists, read it for project-specific security requirements
+
+## Progress Tracking
+
+At the start of execution, create a TodoWrite with all phases (1 through 5). Update status as each phase completes.
+
+## Workflow
+
+### Phase 1: Dependency Scan
+
+**Role**: Security analyst
+**Goal**: Identify known vulnerabilities in project dependencies
+**Constraints**: Scan only — no code changes
+
+1. Detect the project's package manager(s): `requirements.txt`, `package.json`, `Cargo.toml`, `*.csproj`, `go.mod`
+2. Run the appropriate audit tool:
+   - Python: `pip audit` or `safety check`
+   - Node: `npm audit`
+   - Rust: `cargo audit`
+   - .NET: `dotnet list package --vulnerable`
+   - Go: `govulncheck`
+3. If no audit tool is available, manually inspect dependency files for known CVEs using WebSearch
+4. Record findings with CVE IDs, affected packages, severity, and recommended upgrade versions
+
+**Self-verification**:
+- [ ] All package manifests scanned
+- [ ] Each finding has a CVE ID or advisory reference
+- [ ] Upgrade paths identified for Critical/High findings
+
+**Save action**: Write `SECURITY_DIR/dependency_scan.md`

 ---

-## Key Vulnerability Tests
+### Phase 2: Static Analysis (SAST)

-### 1. Broken Access Control
-```javascript
-// Horizontal escalation - User A accessing User B's data
-test('user cannot access another user\'s order', async () => {
-  const userAToken = await login('userA');
-  const userBOrder = await createOrder('userB');
+**Role**: Security engineer
+**Goal**: Identify code-level vulnerabilities through static analysis
+**Constraints**: Analysis only — no code changes

-  const response = await api.get(`/orders/${userBOrder.id}`, {
-    headers: { Authorization: `Bearer ${userAToken}` }
-  });
-  expect(response.status).toBe(403);
-});
+Scan the codebase for these vulnerability patterns:

-// Vertical escalation - Regular user accessing admin
-test('regular user cannot access admin', async () => {
-  const userToken = await login('regularUser');
-  expect((await api.get('/admin/users', {
-    headers: { Authorization: `Bearer ${userToken}` }
-  })).status).toBe(403);
-});
-```
+**Injection**:
+- SQL injection via string interpolation or concatenation
+- Command injection (subprocess with shell=True, exec, eval, os.system)
+- XSS via unsanitized user input in HTML output
+- Template injection

-### 2. Injection Attacks
-```javascript
-// SQL Injection
-test('prevents SQL injection', async () => {
-  const malicious = "' OR '1'='1";
-  const response = await api.get(`/products?search=${malicious}`);
-  expect(response.body.length).toBeLessThan(100); // Not all products
-});
+**Authentication & Authorization**:
+- Hardcoded credentials, API keys, passwords, tokens
+- Missing authentication checks on endpoints
+- Missing authorization checks (horizontal/vertical escalation paths)
+- Weak password validation rules

-// XSS
-test('sanitizes HTML output', async () => {
-  const xss = '<script>alert("XSS")</script>';
-  await api.post('/comments', { text: xss });
+**Cryptographic Failures**:
+- Plaintext password storage (no hashing)
+- Weak hashing algorithms (MD5, SHA1 for passwords)
+- Hardcoded encryption keys or salts
+- Missing TLS/HTTPS enforcement

-  const html = (await api.get('/comments')).body;
-  expect(html).toContain('&lt;script&gt;');
-  expect(html).not.toContain('<script>');
-});
-```
+**Data Exposure**:
+- Sensitive data in logs or error messages (passwords, tokens, PII)
+- Sensitive fields in API responses (password hashes, SSNs)
+- Debug endpoints or verbose error messages in production configs
+- Secrets in version control (.env files, config with credentials)

-### 3. Cryptographic Failures
-```javascript
-test('passwords are hashed', async () => {
-  await db.users.create({ email: 'test@example.com', password: 'MyPassword123' });
-  const user = await db.users.findByEmail('test@example.com');
+**Insecure Deserialization**:
+- Pickle/marshal deserialization of untrusted data
+- JSON/XML parsing without size limits

-  expect(user.password).not.toBe('MyPassword123');
-  expect(user.password).toMatch(/^\$2[aby]\$\d{2}\$/); // bcrypt
-});
+**Self-verification**:
+- [ ] All source directories scanned
+- [ ] Each finding has file path and line number
+- [ ] No false positives from test files or comments

-test('no sensitive data in API response', async () => {
-  const response = await api.get('/users/me');
-  expect(response.body).not.toHaveProperty('password');
-  expect(response.body).not.toHaveProperty('ssn');
-});
-```
-
-### 4. Security Misconfiguration
-```javascript
-test('errors don\'t leak sensitive info', async () => {
-  const response = await api.post('/login', { email: 'nonexistent@test.com', password: 'wrong' });
-  expect(response.body.error).toBe('Invalid credentials'); // Generic message
-});
-
-test('sensitive endpoints not exposed', async () => {
-  const endpoints = ['/debug', '/.env', '/.git', '/admin'];
-  for (let ep of endpoints) {
-    expect((await fetch(`https://example.com${ep}`)).status).not.toBe(200);
-  }
-});
-```
-
-### 5. Rate Limiting
-```javascript
-test('rate limiting prevents brute force', async () => {
-  const responses = [];
-  for (let i = 0; i < 20; i++) {
-    responses.push(await api.post('/login', { email: 'test@example.com', password: 'wrong' }));
-  }
-  expect(responses.filter(r => r.status === 429).length).toBeGreaterThan(0);
-});
-```
+**Save action**: Write `SECURITY_DIR/static_analysis.md`

 ---

-## Security Checklist
+### Phase 3: OWASP Top 10 Review
+
+**Role**: Penetration tester
+**Goal**: Systematically review the codebase against current OWASP Top 10 categories
+**Constraints**: Review and document — no code changes
+
+1. Research the current OWASP Top 10 version at https://owasp.org/www-project-top-ten/
+2. For each OWASP category, assess the codebase:
+
+| Check | What to Look For |
+|-------|-----------------|
+| Broken Access Control | Missing auth middleware, IDOR vulnerabilities, CORS misconfiguration, directory traversal |
+| Cryptographic Failures | Weak algorithms, plaintext transmission, missing encryption at rest |
+| Injection | SQL, NoSQL, OS command, LDAP injection paths |
+| Insecure Design | Missing rate limiting, no input validation strategy, trust boundary violations |
+| Security Misconfiguration | Default credentials, unnecessary features enabled, missing security headers |
+| Vulnerable Components | Outdated dependencies (from Phase 1), unpatched frameworks |
+| Auth Failures | Brute force paths, weak session management, missing MFA |
+| Data Integrity Failures | Missing signature verification, insecure CI/CD, auto-update without verification |
+| Logging Failures | Missing audit logs, sensitive data in logs, no alerting for security events |
+| SSRF | Unvalidated URL inputs, internal network access from user-controlled URLs |
+
+3. Rate each category: PASS / FAIL / NOT_APPLICABLE
+4. If `security_approach.md` exists, cross-reference its requirements against findings
+
+**Self-verification**:
+- [ ] All current OWASP Top 10 categories assessed
+- [ ] Each FAIL has at least one specific finding with evidence
+- [ ] NOT_APPLICABLE categories have justification
+
+**Save action**: Write `SECURITY_DIR/owasp_review.md`
+
+---
+
+### Phase 4: Configuration & Infrastructure Review
+
+**Role**: DevSecOps engineer
+**Goal**: Review deployment configuration for security issues
+**Constraints**: Review only — no changes
+
+If Dockerfiles, CI/CD configs, or deployment configs exist:
+
+1. **Container security**: non-root user, minimal base images, no secrets in build args, health checks
+2. **CI/CD security**: secrets management, no credentials in pipeline files, artifact signing
+3. **Environment configuration**: .env handling, secrets injection method, environment separation
+4. **Network security**: exposed ports, TLS configuration, CORS settings, security headers
+
+If no deployment configs exist, skip this phase and note it in the report.
+
+**Self-verification**:
+- [ ] All Dockerfiles reviewed
+- [ ] All CI/CD configs reviewed
+- [ ] All environment/config files reviewed
+
+**Save action**: Write `SECURITY_DIR/infrastructure_review.md`
+
+---
+
+### Phase 5: Security Report
+
+**Role**: Security analyst
+**Goal**: Produce a consolidated security audit report
+**Constraints**: Concise, actionable, severity-ranked
+
+Consolidate findings from Phases 1-4 into a structured report:
+
+```markdown
+# Security Audit Report
+
+**Date**: [YYYY-MM-DD]
+**Scope**: [project name / target path]
+**Verdict**: PASS | PASS_WITH_WARNINGS | FAIL
+
+## Summary
+
+| Severity | Count |
+|----------|-------|
+| Critical | [N] |
+| High     | [N] |
+| Medium   | [N] |
+| Low      | [N] |
+
+## OWASP Top 10 Assessment
+
+| Category | Status | Findings |
+|----------|--------|----------|
+| [category] | PASS / FAIL / N/A | [count or —] |
+
+## Findings
+
+| # | Severity | Category | Location | Title |
+|---|----------|----------|----------|-------|
+| 1 | Critical | Injection | src/api.py:42 | SQL injection via f-string |
+
+### Finding Details
+
+**F1: [title]** (Severity / Category)
+- Location: `[file:line]`
+- Description: [what is vulnerable]
+- Impact: [what an attacker could do]
+- Remediation: [specific fix]
+
+## Dependency Vulnerabilities
+
+| Package | CVE | Severity | Fix Version |
+|---------|-----|----------|-------------|
+| [name] | [CVE-ID] | [sev] | [version] |
+
+## Recommendations
+
+### Immediate (Critical/High)
+- [action items]
+
+### Short-term (Medium)
+- [action items]
+
+### Long-term (Low / Hardening)
+- [action items]
+```
+
+**Self-verification**:
+- [ ] All findings from Phases 1-4 included
+- [ ] No duplicate findings
+- [ ] Every finding has remediation guidance
+- [ ] Verdict matches severity logic
+
+**Save action**: Write `SECURITY_DIR/security_report.md`
+
+**BLOCKING**: Present report summary to user.
+
+## Verdict Logic
+
+- **FAIL**: any Critical or High finding exists
+- **PASS_WITH_WARNINGS**: only Medium or Low findings
+- **PASS**: no findings
+
+## Security Checklist (Quick Reference)

 ### Authentication
 - [ ] Strong password requirements (12+ chars)
 - [ ] Password hashing (bcrypt, scrypt, Argon2)
 - [ ] MFA for sensitive operations
 - [ ] Account lockout after failed attempts
- [ ] Session ID changes after login
- [ ] Session timeout
+- [ ] Session timeout and rotation

 ### Authorization
 - [ ] Check authorization on every request
 - [ ] Least privilege principle
- [ ] No horizontal escalation
- [ ] No vertical escalation
+- [ ] No horizontal/vertical escalation paths

 ### Data Protection
 - [ ] HTTPS everywhere
 - [ ] Encrypted at rest
- [ ] Secrets not in code/logs
+- [ ] Secrets not in code/logs/version control
 - [ ] PII compliance (GDPR)

 ### Input Validation
- [ ] Server-side validation
+- [ ] Server-side validation on all inputs
 - [ ] Parameterized queries (no SQL injection)
 - [ ] Output encoding (no XSS)
- [ ] Rate limiting
+- [ ] Rate limiting on sensitive endpoints

---
+### CI/CD Security
+- [ ] Dependency audit in pipeline
+- [ ] Secret scanning (git-secrets, TruffleHog)
+- [ ] SAST in pipeline (Semgrep, SonarQube)
+- [ ] No secrets in pipeline config files

-## CI/CD Integration
+## Escalation Rules

-```yaml
-# GitHub Actions
-security-checks:
-  steps:
-    - name: Dependency audit
-      run: npm audit --audit-level=high
-
-    - name: SAST scan
-      run: npm run sast
-
-    - name: Secret scan
-      uses: trufflesecurity/trufflehog@main
-
-    - name: DAST scan
-      if: github.ref == 'refs/heads/main'
-      run: docker run owasp/zap2docker-stable zap-baseline.py -t https://staging.example.com
-```
-
-**Pre-commit hooks:**
-```bash
-#!/bin/sh
-git-secrets --scan
-npm run lint:security
-```
-
---
-
-## Agent-Assisted Security Testing
-
-```typescript
-// Comprehensive multi-layer scan
-await Task("Security Scan", {
-  target: 'src/',
-  layers: { sast: true, dast: true, dependencies: true, secrets: true },
-  severity: ['critical', 'high', 'medium']
-}, "qe-security-scanner");
-
-// OWASP Top 10 testing
-await Task("OWASP Scan", {
-  categories: ['broken-access-control', 'injection', 'cryptographic-failures'],
-  depth: 'comprehensive'
-}, "qe-security-scanner");
-
-// Validate fix
-await Task("Validate Fix", {
-  vulnerability: 'CVE-2024-12345',
-  expectedResolution: 'upgrade package to v2.0.0',
-  retestAfterFix: true
-}, "qe-security-scanner");
-```
-
---
-
-## Agent Coordination Hints
-
-### Memory Namespace
-```
-aqe/security/
-├── scans/*           - Scan results
-├── vulnerabilities/* - Found vulnerabilities
-├── fixes/*           - Remediation tracking
-└── compliance/*      - Compliance status
-```
-
-### Fleet Coordination
-```typescript
-const securityFleet = await FleetManager.coordinate({
-  strategy: 'security-testing',
-  agents: [
-    'qe-security-scanner',
-    'qe-api-contract-validator',
-    'qe-quality-analyzer',
-    'qe-deployment-readiness'
-  ],
-  topology: 'parallel'
-});
-```
-
---
+| Situation | Action |
+|-----------|--------|
+| Critical vulnerability found | **WARN user immediately** — do not defer to report |
+| No audit tools available | Use manual code review + WebSearch for CVEs |
+| Codebase too large for full scan | **ASK user** to prioritize areas (API endpoints, auth, data access) |
+| Finding requires runtime testing (DAST) | Note as "requires DAST verification" — this skill does static analysis only |
+| Conflicting security requirements | **ASK user** to prioritize |

 ## Common Mistakes

-### ❌ Security by Obscurity
-Hiding admin at `/super-secret-admin` → **Use proper auth**
+- **Security by obscurity**: hiding admin at secret URLs instead of proper auth
+- **Client-side validation only**: JavaScript validation can be bypassed; always validate server-side
+- **Trusting user input**: assume all input is malicious until proven otherwise
+- **Hardcoded secrets**: use environment variables and secret management, never code
+- **Skipping dependency scan**: known CVEs in dependencies are the lowest-hanging fruit for attackers

-### ❌ Client-Side Validation Only
-JavaScript validation can be bypassed → **Always validate server-side**
+## Trigger Conditions

-### ❌ Trusting User Input
-Assuming input is safe → **Sanitize, validate, escape all input**
+When the user wants to:
+- Conduct a security audit of the codebase
+- Check for vulnerabilities before deployment
+- Review security posture after implementation
+- Validate security requirements from `security_approach.md`

-### ❌ Hardcoded Secrets
-API keys in code → **Environment variables, secret management**
+**Keywords**: "security audit", "security scan", "OWASP", "vulnerability scan", "security check", "pentest"

---
+**Differentiation**:
+- Lightweight security checks during implementation → handled by `/code-review` Phase 4
+- Full security audit → use this skill
+- Security requirements gathering → handled by `/problem` (security dimension)

-## Related Skills
- [agentic-quality-engineering](../agentic-quality-engineering/) - Security with agents
- [api-testing-patterns](../api-testing-patterns/) - API security testing
- [compliance-testing](../compliance-testing/) - GDPR, HIPAA, SOC2
+## Methodology Quick Reference

---
-
-## Remember
-
-**Think like an attacker:** What would you try to break? Test that.
-**Build like a defender:** Assume input is malicious until proven otherwise.
-**Test continuously:** Security testing is ongoing, not one-time.
-
-**With Agents:** Agents automate vulnerability scanning, track remediation, and validate fixes. Use agents to maintain security posture at scale.
+```
+┌────────────────────────────────────────────────────────────────┐
+│              Security Audit (5-Phase Method)                    │
+├────────────────────────────────────────────────────────────────┤
+│ PREREQ: Source code exists, SECURITY_DIR created               │
+│                                                                │
+│ 1. Dependency Scan    → dependency_scan.md                     │
+│ 2. Static Analysis    → static_analysis.md                     │
+│ 3. OWASP Top 10      → owasp_review.md                        │
+│ 4. Infrastructure     → infrastructure_review.md               │
+│ 5. Security Report    → security_report.md                     │
+│    [BLOCKING: user reviews report]                             │
+├────────────────────────────────────────────────────────────────┤
+│ Verdict: PASS / PASS_WITH_WARNINGS / FAIL                      │
+│ Principles: OWASP-driven · Evidence-based · Severity-ranked    │
+│             Actionable · Save immediately                      │
+└────────────────────────────────────────────────────────────────┘
+```
@@ -1,789 +0,0 @@
-# =============================================================================
-# AQE Skill Evaluation Test Suite: Security Testing v1.0.0
-# =============================================================================
-#
-# Comprehensive evaluation suite for the security-testing skill per ADR-056.
-# Tests OWASP Top 10 2021 detection, severity classification, remediation
-# quality, and cross-model consistency.
-#
-# Schema: .claude/skills/.validation/schemas/skill-eval.schema.json
-# Validator: .claude/skills/security-testing/scripts/validate-config.json
-#
-# Coverage:
-# - OWASP A01:2021 - Broken Access Control
-# - OWASP A02:2021 - Cryptographic Failures
-# - OWASP A03:2021 - Injection (SQL, XSS, Command)
-# - OWASP A07:2021 - Identification and Authentication Failures
-# - Negative tests (no false positives on secure code)
-#
-# =============================================================================
-
-skill: security-testing
-version: 1.0.0
-description: >
-  Comprehensive evaluation suite for the security-testing skill.
-  Tests OWASP Top 10 2021 detection capabilities, CWE classification accuracy,
-  CVSS scoring, severity classification, and remediation quality.
-  Supports multi-model testing and integrates with ReasoningBank for
-  continuous improvement.
-
-# =============================================================================
-# Multi-Model Configuration
-# =============================================================================
-
-models_to_test:
-  - claude-3.5-sonnet    # Primary model (high accuracy expected)
-  - claude-3-haiku       # Fast model (minimum quality threshold)
-  - gpt-4o               # Cross-vendor validation
-
-# =============================================================================
-# MCP Integration Configuration
-# =============================================================================
-
-mcp_integration:
-  enabled: true
-  namespace: skill-validation
-
-  # Query existing security patterns before running evals
-  query_patterns: true
-
-  # Track each test outcome for learning feedback loop
-  track_outcomes: true
-
-  # Store successful patterns after evals complete
-  store_patterns: true
-
-  # Share learning with fleet coordinator agents
-  share_learning: true
-
-  # Update quality gate with validation metrics
-  update_quality_gate: true
-
-  # Target agents for learning distribution
-  target_agents:
-    - qe-learning-coordinator
-    - qe-queen-coordinator
-    - qe-security-scanner
-    - qe-security-auditor
-
-# =============================================================================
-# ReasoningBank Learning Configuration
-# =============================================================================
-
-learning:
-  store_success_patterns: true
-  store_failure_patterns: true
-  pattern_ttl_days: 90
-  min_confidence_to_store: 0.7
-  cross_model_comparison: true
-
-# =============================================================================
-# Result Format Configuration
-# =============================================================================
-
-result_format:
-  json_output: true
-  markdown_report: true
-  include_raw_output: false
-  include_timing: true
-  include_token_usage: true
-
-# =============================================================================
-# Environment Setup
-# =============================================================================
-
-setup:
-  required_tools:
-    - jq       # JSON parsing (required)
-    - npm      # Dependency audit (optional but recommended)
-
-  environment_variables:
-    SECURITY_SCAN_DEPTH: "deep"
-    OWASP_ENABLED: "true"
-    SEVERITY_THRESHOLD: "medium"
-
-  fixtures:
-    - name: vulnerable_express_app
-      path: fixtures/vulnerable-express-app.js
-      content: |
-        const express = require('express');
-        const app = express();
-
-        // SQL Injection vulnerability
-        app.get('/user', (req, res) => {
-          const query = `SELECT * FROM users WHERE id = ${req.params.id}`;
-          db.query(query);
-        });
-
-        // XSS vulnerability
-        app.get('/profile', (req, res) => {
-          res.send(`<h1>Hello ${req.query.name}</h1>`);
-        });
-
-        // Path Traversal vulnerability
-        app.get('/file', (req, res) => {
-          const path = './uploads/' + req.query.filename;
-          res.sendFile(path);
-        });
-
-# =============================================================================
-# TEST CASES
-# =============================================================================
-
-test_cases:
-  # ---------------------------------------------------------------------------
-  # CATEGORY: SQL Injection (OWASP A03:2021)
-  # ---------------------------------------------------------------------------
-
-  - id: tc001_sql_injection_string_concat
-    description: "Detect SQL injection via string concatenation in Node.js"
-    category: injection
-    priority: critical
-
-    input:
-      code: |
-        const express = require('express');
-        const mysql = require('mysql');
-        const app = express();
-
-        app.get('/api/users', (req, res) => {
-          const userId = req.params.id;
-          const query = `SELECT * FROM users WHERE id = ${userId}`;
-          db.query(query, (err, results) => {
-            res.json(results);
-          });
-        });
-      context:
-        language: javascript
-        framework: express
-        environment: production
-
-    expected_output:
-      must_contain:
-        - "SQL injection"
-        - "parameterized"
-      must_not_contain:
-        - "no vulnerabilities"
-        - "secure"
-      must_match_regex:
-        - "CWE-89|CWE-564"
-        - "A03:20[21][0-9]"
-      severity_classification: critical
-      finding_count:
-        min: 1
-        max: 3
-      recommendation_count:
-        min: 1
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.8
-      reasoning_quality_min: 0.7
-      grading_rubric:
-        completeness: 0.3
-        accuracy: 0.5
-        actionability: 0.2
-
-    timeout_ms: 30000
-
-  - id: tc002_sql_injection_parameterized_safe
-    description: "Verify parameterized queries are NOT flagged as vulnerable"
-    category: injection
-    priority: high
-
-    input:
-      code: |
-        app.get('/api/users', (req, res) => {
-          const userId = parseInt(req.params.id, 10);
-          db.query('SELECT * FROM users WHERE id = ?', [userId], (err, results) => {
-            res.json(results);
-          });
-        });
-      context:
-        language: javascript
-        framework: express
-
-    expected_output:
-      must_contain:
-        - "parameterized"
-        - "secure"
-      must_not_contain:
-        - "SQL injection"
-        - "critical"
-        - "vulnerable"
-      severity_classification: info
-      finding_count:
-        max: 1
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.7
-      allow_partial: true
-
-  # ---------------------------------------------------------------------------
-  # CATEGORY: Cross-Site Scripting (OWASP A03:2021)
-  # ---------------------------------------------------------------------------
-
-  - id: tc003_xss_reflected_html_output
-    description: "Detect reflected XSS in unescaped HTML output"
-    category: injection
-    priority: critical
-
-    input:
-      code: |
-        app.get('/profile', (req, res) => {
-          const name = req.query.name;
-          res.send(`
-            <html>
-              <body>
-                <h1>Welcome, ${name}!</h1>
-                <p>Your profile has been loaded.</p>
-              </body>
-            </html>
-          `);
-        });
-      context:
-        language: javascript
-        framework: express
-
-    expected_output:
-      must_contain:
-        - "XSS"
-        - "cross-site scripting"
-        - "sanitize"
-        - "escape"
-      must_match_regex:
-        - "CWE-79"
-      severity_classification: high
-      finding_count:
-        min: 1
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.8
-      reasoning_quality_min: 0.75
-
-  - id: tc004_xss_dom_based_innerhtml
-    description: "Detect DOM-based XSS via innerHTML assignment"
-    category: injection
-    priority: high
-
-    input:
-      code: |
-        // Client-side JavaScript
-        const params = new URLSearchParams(window.location.search);
-        const message = params.get('msg');
-        document.getElementById('output').innerHTML = message;
-      context:
-        language: javascript
-        framework: vanilla
-        environment: production
-
-    expected_output:
-      must_contain:
-        - "DOM"
-        - "XSS"
-        - "innerHTML"
-        - "textContent"
-      must_match_regex:
-        - "CWE-79"
-      severity_classification: high
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.7
-
-  # ---------------------------------------------------------------------------
-  # CATEGORY: Authentication Failures (OWASP A07:2021)
-  # ---------------------------------------------------------------------------
-
-  - id: tc005_hardcoded_credentials
-    description: "Detect hardcoded credentials and API keys"
-    category: authentication
-    priority: critical
-
-    input:
-      code: |
-        const ADMIN_PASSWORD = 'admin123';
-        const API_KEY = 'sk-1234567890abcdef';
-        const DATABASE_URL = 'postgres://admin:password123@localhost/db';
-
-        app.post('/login', (req, res) => {
-          if (req.body.password === ADMIN_PASSWORD) {
-            req.session.isAdmin = true;
-            res.send('Login successful');
-          }
-        });
-      context:
-        language: javascript
-        framework: express
-
-    expected_output:
-      must_contain:
-        - "hardcoded"
-        - "credentials"
-        - "secret"
-        - "environment variable"
-      must_match_regex:
-        - "CWE-798|CWE-259"
-      severity_classification: critical
-      finding_count:
-        min: 2
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.8
-      reasoning_quality_min: 0.8
-
-  - id: tc006_weak_password_hashing
-    description: "Detect weak password hashing algorithms (MD5, SHA1)"
-    category: authentication
-    priority: high
-
-    input:
-      code: |
-        const crypto = require('crypto');
-
-        function hashPassword(password) {
-          return crypto.createHash('md5').update(password).digest('hex');
-        }
-
-        function verifyPassword(password, hash) {
-          return hashPassword(password) === hash;
-        }
-      context:
-        language: javascript
-        framework: nodejs
-
-    expected_output:
-      must_contain:
-        - "MD5"
-        - "weak"
-        - "bcrypt"
-        - "argon2"
-      must_match_regex:
-        - "CWE-327|CWE-328|CWE-916"
-      severity_classification: high
-      finding_count:
-        min: 1
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.8
-
-  # ---------------------------------------------------------------------------
-  # CATEGORY: Broken Access Control (OWASP A01:2021)
-  # ---------------------------------------------------------------------------
-
-  - id: tc007_idor_missing_authorization
-    description: "Detect IDOR vulnerability with missing authorization check"
-    category: authorization
-    priority: critical
-
-    input:
-      code: |
-        app.get('/api/users/:id/profile', (req, res) => {
-          // No authorization check - any user can access any profile
-          const userId = req.params.id;
-          db.query('SELECT * FROM profiles WHERE user_id = ?', [userId])
-            .then(profile => res.json(profile));
-        });
-
-        app.delete('/api/users/:id', (req, res) => {
-          // No check if requesting user owns this account
-          db.query('DELETE FROM users WHERE id = ?', [req.params.id]);
-          res.send('User deleted');
-        });
-      context:
-        language: javascript
-        framework: express
-
-    expected_output:
-      must_contain:
-        - "authorization"
-        - "access control"
-        - "IDOR"
-        - "ownership"
-      must_match_regex:
-        - "CWE-639|CWE-284|CWE-862"
-        - "A01:2021"
-      severity_classification: critical
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.7
-
-  # ---------------------------------------------------------------------------
-  # CATEGORY: Cryptographic Failures (OWASP A02:2021)
-  # ---------------------------------------------------------------------------
-
-  - id: tc008_weak_encryption_des
-    description: "Detect use of weak encryption algorithms (DES, RC4)"
-    category: cryptography
-    priority: high
-
-    input:
-      code: |
-        const crypto = require('crypto');
-
-        function encryptData(data, key) {
-          const cipher = crypto.createCipher('des', key);
-          return cipher.update(data, 'utf8', 'hex') + cipher.final('hex');
-        }
-
-        function decryptData(data, key) {
-          const decipher = crypto.createDecipher('des', key);
-          return decipher.update(data, 'hex', 'utf8') + decipher.final('utf8');
-        }
-      context:
-        language: javascript
-        framework: nodejs
-
-    expected_output:
-      must_contain:
-        - "DES"
-        - "weak"
-        - "deprecated"
-        - "AES"
-      must_match_regex:
-        - "CWE-327|CWE-328"
-        - "A02:2021"
-      severity_classification: high
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.7
-
-  - id: tc009_plaintext_password_storage
-    description: "Detect plaintext password storage"
-    category: cryptography
-    priority: critical
-
-    input:
-      code: |
-        class User {
-          constructor(email, password) {
-            this.email = email;
-            this.password = password;  // Stored in plaintext!
-          }
-
-          save() {
-            db.query('INSERT INTO users (email, password) VALUES (?, ?)',
-                     [this.email, this.password]);
-          }
-        }
-      context:
-        language: javascript
-        framework: nodejs
-
-    expected_output:
-      must_contain:
-        - "plaintext"
-        - "password"
-        - "hash"
-        - "bcrypt"
-      must_match_regex:
-        - "CWE-256|CWE-312"
-        - "A02:2021"
-      severity_classification: critical
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.8
-
-  # ---------------------------------------------------------------------------
-  # CATEGORY: Path Traversal (Related to A01:2021)
-  # ---------------------------------------------------------------------------
-
-  - id: tc010_path_traversal_file_access
-    description: "Detect path traversal vulnerability in file access"
-    category: injection
-    priority: critical
-
-    input:
-      code: |
-        const fs = require('fs');
-
-        app.get('/download', (req, res) => {
-          const filename = req.query.file;
-          const filepath = './uploads/' + filename;
-          res.sendFile(filepath);
-        });
-
-        app.get('/read', (req, res) => {
-          const content = fs.readFileSync('./data/' + req.params.name);
-          res.send(content);
-        });
-      context:
-        language: javascript
-        framework: express
-
-    expected_output:
-      must_contain:
-        - "path traversal"
-        - "directory traversal"
-        - "../"
-        - "sanitize"
-      must_match_regex:
-        - "CWE-22|CWE-23"
-      severity_classification: critical
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.7
-
-  # ---------------------------------------------------------------------------
-  # CATEGORY: Negative Tests (No False Positives)
-  # ---------------------------------------------------------------------------
-
-  - id: tc011_secure_code_no_false_positives
-    description: "Verify secure code is NOT flagged as vulnerable"
-    category: negative
-    priority: critical
-
-    input:
-      code: |
-        const express = require('express');
-        const helmet = require('helmet');
-        const rateLimit = require('express-rate-limit');
-        const bcrypt = require('bcrypt');
-        const validator = require('validator');
-
-        const app = express();
-        app.use(helmet());
-        app.use(rateLimit({ windowMs: 15 * 60 * 1000, max: 100 }));
-
-        app.post('/api/users', async (req, res) => {
-          const { email, password } = req.body;
-
-          // Input validation
-          if (!validator.isEmail(email)) {
-            return res.status(400).json({ error: 'Invalid email' });
-          }
-
-          // Secure password hashing
-          const hashedPassword = await bcrypt.hash(password, 12);
-
-          // Parameterized query
-          await db.query(
-            'INSERT INTO users (email, password) VALUES ($1, $2)',
-            [email, hashedPassword]
-          );
-
-          res.status(201).json({ message: 'User created' });
-        });
-      context:
-        language: javascript
-        framework: express
-        environment: production
-
-    expected_output:
-      must_contain:
-        - "secure"
-        - "best practice"
-      must_not_contain:
-        - "SQL injection"
-        - "XSS"
-        - "critical vulnerability"
-        - "high severity"
-      finding_count:
-        max: 2  # Allow informational findings only
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.6
-      allow_partial: true
-
-  - id: tc012_secure_auth_implementation
-    description: "Verify secure authentication is recognized as safe"
-    category: negative
-    priority: high
-
-    input:
-      code: |
-        const bcrypt = require('bcrypt');
-        const jwt = require('jsonwebtoken');
-
-        async function login(email, password) {
-          const user = await User.findByEmail(email);
-          if (!user) {
-            return { error: 'Invalid credentials' };
-          }
-
-          const match = await bcrypt.compare(password, user.passwordHash);
-          if (!match) {
-            return { error: 'Invalid credentials' };
-          }
-
-          const token = jwt.sign(
-            { userId: user.id },
-            process.env.JWT_SECRET,
-            { expiresIn: '1h' }
-          );
-
-          return { token };
-        }
-      context:
-        language: javascript
-        framework: nodejs
-
-    expected_output:
-      must_contain:
-        - "bcrypt"
-        - "jwt"
-        - "secure"
-      must_not_contain:
-        - "vulnerable"
-        - "critical"
-        - "hardcoded"
-      severity_classification: info
-
-    validation:
-      schema_check: true
-      allow_partial: true
-
-  # ---------------------------------------------------------------------------
-  # CATEGORY: Python Security (Multi-language Support)
-  # ---------------------------------------------------------------------------
-
-  - id: tc013_python_sql_injection
-    description: "Detect SQL injection in Python Flask application"
-    category: injection
-    priority: critical
-
-    input:
-      code: |
-        from flask import Flask, request
-        import sqlite3
-
-        app = Flask(__name__)
-
-        @app.route('/user')
-        def get_user():
-            user_id = request.args.get('id')
-            conn = sqlite3.connect('users.db')
-            cursor = conn.cursor()
-            cursor.execute(f"SELECT * FROM users WHERE id = {user_id}")
-            return str(cursor.fetchone())
-      context:
-        language: python
-        framework: flask
-
-    expected_output:
-      must_contain:
-        - "SQL injection"
-        - "parameterized"
-        - "f-string"
-      must_match_regex:
-        - "CWE-89"
-      severity_classification: critical
-      finding_count:
-        min: 1
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.7
-
-  - id: tc014_python_ssti_jinja
-    description: "Detect Server-Side Template Injection in Jinja2"
-    category: injection
-    priority: critical
-
-    input:
-      code: |
-        from flask import Flask, request, render_template_string
-
-        app = Flask(__name__)
-
-        @app.route('/render')
-        def render():
-            template = request.args.get('template')
-            return render_template_string(template)
-      context:
-        language: python
-        framework: flask
-
-    expected_output:
-      must_contain:
-        - "SSTI"
-        - "template injection"
-        - "render_template_string"
-        - "Jinja2"
-      must_match_regex:
-        - "CWE-94|CWE-1336"
-      severity_classification: critical
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.7
-
-  - id: tc015_python_pickle_deserialization
-    description: "Detect insecure deserialization with pickle"
-    category: injection
-    priority: critical
-
-    input:
-      code: |
-        import pickle
-        from flask import Flask, request
-
-        app = Flask(__name__)
-
-        @app.route('/load')
-        def load_data():
-            data = request.get_data()
-            obj = pickle.loads(data)
-            return str(obj)
-      context:
-        language: python
-        framework: flask
-
-    expected_output:
-      must_contain:
-        - "pickle"
-        - "deserialization"
-        - "untrusted"
-        - "RCE"
-      must_match_regex:
-        - "CWE-502"
-        - "A08:2021"
-      severity_classification: critical
-
-    validation:
-      schema_check: true
-      keyword_match_threshold: 0.7
-
-# =============================================================================
-# SUCCESS CRITERIA
-# =============================================================================
-
-success_criteria:
-  # Overall pass rate (90% of tests must pass)
-  pass_rate: 0.9
-
-  # Critical tests must ALL pass (100%)
-  critical_pass_rate: 1.0
-
-  # Average reasoning quality score
-  avg_reasoning_quality: 0.75
-
-  # Maximum suite execution time (5 minutes)
-  max_execution_time_ms: 300000
-
-  # Maximum variance between model results (15%)
-  cross_model_variance: 0.15
-
-# =============================================================================
-# METADATA
-# =============================================================================
-
-metadata:
-  author: "qe-security-auditor"
-  created: "2026-02-02"
-  last_updated: "2026-02-02"
-  coverage_target: >
-    OWASP Top 10 2021: A01 (Broken Access Control), A02 (Cryptographic Failures),
-    A03 (Injection - SQL, XSS, SSTI, Command), A07 (Authentication Failures),
-    A08 (Software Integrity - Deserialization). Covers JavaScript/Node.js
-    Express apps and Python Flask apps. 15 test cases with 90% pass rate
-    requirement and 100% critical pass rate.
@@ -1,879 +0,0 @@
-{
-  "$schema": "https://json-schema.org/draft/2020-12/schema",
-  "$id": "https://agentic-qe.dev/schemas/security-testing-output.json",
-  "title": "AQE Security Testing Skill Output Schema",
-  "description": "Schema for security-testing skill output validation. Extends the base skill-output template with OWASP Top 10 categories, CWE identifiers, and CVSS scoring.",
-  "type": "object",
-  "required": ["skillName", "version", "timestamp", "status", "trustTier", "output"],
-  "properties": {
-    "skillName": {
-      "type": "string",
-      "const": "security-testing",
-      "description": "Must be 'security-testing'"
-    },
-    "version": {
-      "type": "string",
-      "pattern": "^\\d+\\.\\d+\\.\\d+(-[a-zA-Z0-9]+)?$",
-      "description": "Semantic version of the skill"
-    },
-    "timestamp": {
-      "type": "string",
-      "format": "date-time",
-      "description": "ISO 8601 timestamp of output generation"
-    },
-    "status": {
-      "type": "string",
-      "enum": ["success", "partial", "failed", "skipped"],
-      "description": "Overall execution status"
-    },
-    "trustTier": {
-      "type": "integer",
-      "const": 3,
-      "description": "Trust tier 3 indicates full validation with eval suite"
-    },
-    "output": {
-      "type": "object",
-      "required": ["summary", "findings", "owaspCategories"],
-      "properties": {
-        "summary": {
-          "type": "string",
-          "minLength": 50,
-          "maxLength": 2000,
-          "description": "Human-readable summary of security findings"
-        },
-        "score": {
-          "$ref": "#/$defs/securityScore",
-          "description": "Overall security score"
-        },
-        "findings": {
-          "type": "array",
-          "items": {
-            "$ref": "#/$defs/securityFinding"
-          },
-          "maxItems": 500,
-          "description": "List of security vulnerabilities discovered"
-        },
-        "recommendations": {
-          "type": "array",
-          "items": {
-            "$ref": "#/$defs/securityRecommendation"
-          },
-          "maxItems": 100,
-          "description": "Prioritized remediation recommendations with code examples"
-        },
-        "metrics": {
-          "$ref": "#/$defs/securityMetrics",
-          "description": "Security scan metrics and statistics"
-        },
-        "owaspCategories": {
-          "$ref": "#/$defs/owaspCategoryBreakdown",
-          "description": "OWASP Top 10 2021 category breakdown"
-        },
-        "artifacts": {
-          "type": "array",
-          "items": {
-            "$ref": "#/$defs/artifact"
-          },
-          "maxItems": 50,
-          "description": "Generated security reports and scan artifacts"
-        },
-        "timeline": {
-          "type": "array",
-          "items": {
-            "$ref": "#/$defs/timelineEvent"
-          },
-          "description": "Scan execution timeline"
-        },
-        "scanConfiguration": {
-          "$ref": "#/$defs/scanConfiguration",
-          "description": "Configuration used for the security scan"
-        }
-      }
-    },
-    "metadata": {
-      "$ref": "#/$defs/metadata"
-    },
-    "validation": {
-      "$ref": "#/$defs/validationResult"
-    },
-    "learning": {
-      "$ref": "#/$defs/learningData"
-    }
-  },
-  "$defs": {
-    "securityScore": {
-      "type": "object",
-      "required": ["value", "max"],
-      "properties": {
-        "value": {
-          "type": "number",
-          "minimum": 0,
-          "maximum": 100,
-          "description": "Security score (0=critical issues, 100=no issues)"
-        },
-        "max": {
-          "type": "number",
-          "const": 100,
-          "description": "Maximum score is always 100"
-        },
-        "grade": {
-          "type": "string",
-          "pattern": "^[A-F][+-]?$",
-          "description": "Letter grade: A (90-100), B (80-89), C (70-79), D (60-69), F (<60)"
-        },
-        "trend": {
-          "type": "string",
-          "enum": ["improving", "stable", "declining", "unknown"],
-          "description": "Trend compared to previous scans"
-        },
-        "riskLevel": {
-          "type": "string",
-          "enum": ["critical", "high", "medium", "low", "minimal"],
-          "description": "Overall risk level assessment"
-        }
-      }
-    },
-    "securityFinding": {
-      "type": "object",
-      "required": ["id", "title", "severity", "owasp"],
-      "properties": {
-        "id": {
-          "type": "string",
-          "pattern": "^SEC-\\d{3,6}$",
-          "description": "Unique finding identifier (e.g., SEC-001)"
-        },
-        "title": {
-          "type": "string",
-          "minLength": 10,
-          "maxLength": 200,
-          "description": "Finding title describing the vulnerability"
-        },
-        "description": {
-          "type": "string",
-          "maxLength": 2000,
-          "description": "Detailed description of the vulnerability"
-        },
-        "severity": {
-          "type": "string",
-          "enum": ["critical", "high", "medium", "low", "info"],
-          "description": "Severity: critical (CVSS 9.0-10.0), high (7.0-8.9), medium (4.0-6.9), low (0.1-3.9), info (0)"
-        },
-        "owasp": {
-          "type": "string",
-          "pattern": "^A(0[1-9]|10):20(21|25)$",
-          "description": "OWASP Top 10 category (e.g., A01:2021, A03:2025)"
-        },
-        "owaspCategory": {
-          "type": "string",
-          "enum": [
-            "A01:2021-Broken-Access-Control",
-            "A02:2021-Cryptographic-Failures",
-            "A03:2021-Injection",
-            "A04:2021-Insecure-Design",
-            "A05:2021-Security-Misconfiguration",
-            "A06:2021-Vulnerable-Components",
-            "A07:2021-Identification-Authentication-Failures",
-            "A08:2021-Software-Data-Integrity-Failures",
-            "A09:2021-Security-Logging-Monitoring-Failures",
-            "A10:2021-Server-Side-Request-Forgery"
-          ],
-          "description": "Full OWASP category name"
-        },
-        "cwe": {
-          "type": "string",
-          "pattern": "^CWE-\\d{1,4}$",
-          "description": "CWE identifier (e.g., CWE-79 for XSS, CWE-89 for SQLi)"
-        },
-        "cvss": {
-          "type": "object",
-          "properties": {
-            "score": {
-              "type": "number",
-              "minimum": 0,
-              "maximum": 10,
-              "description": "CVSS v3.1 base score"
-            },
-            "vector": {
-              "type": "string",
-              "pattern": "^CVSS:3\\.1/AV:[NALP]/AC:[LH]/PR:[NLH]/UI:[NR]/S:[UC]/C:[NLH]/I:[NLH]/A:[NLH]$",
-              "description": "CVSS v3.1 vector string"
-            },
-            "severity": {
-              "type": "string",
-              "enum": ["None", "Low", "Medium", "High", "Critical"],
-              "description": "CVSS severity rating"
-            }
-          }
-        },
-        "location": {
-          "$ref": "#/$defs/location",
-          "description": "Location of the vulnerability"
-        },
-        "evidence": {
-          "type": "string",
-          "maxLength": 5000,
-          "description": "Evidence: code snippet, request/response, or PoC"
-        },
-        "remediation": {
-          "type": "string",
-          "maxLength": 2000,
-          "description": "Specific fix instructions for this finding"
-        },
-        "references": {
-          "type": "array",
-          "items": {
-            "type": "object",
-            "required": ["title", "url"],
-            "properties": {
-              "title": { "type": "string" },
-              "url": { "type": "string", "format": "uri" }
-            }
-          },
-          "maxItems": 10,
-          "description": "External references (OWASP, CWE, CVE, etc.)"
-        },
-        "falsePositive": {
-          "type": "boolean",
-          "default": false,
-          "description": "Potential false positive flag"
-        },
-        "confidence": {
-          "type": "number",
-          "minimum": 0,
-          "maximum": 1,
-          "description": "Confidence in finding accuracy (0.0-1.0)"
-        },
-        "exploitability": {
-          "type": "string",
-          "enum": ["trivial", "easy", "moderate", "difficult", "theoretical"],
-          "description": "How easy is it to exploit this vulnerability"
-        },
-        "affectedVersions": {
-          "type": "array",
-          "items": { "type": "string" },
-          "description": "Affected package/library versions for dependency vulnerabilities"
-        },
-        "cve": {
-          "type": "string",
-          "pattern": "^CVE-\\d{4}-\\d{4,}$",
-          "description": "CVE identifier if applicable"
-        }
-      }
-    },
-    "securityRecommendation": {
-      "type": "object",
-      "required": ["id", "title", "priority", "owaspCategories"],
-      "properties": {
-        "id": {
-          "type": "string",
-          "pattern": "^REC-\\d{3,6}$",
-          "description": "Unique recommendation identifier"
-        },
-        "title": {
-          "type": "string",
-          "minLength": 10,
-          "maxLength": 200,
-          "description": "Recommendation title"
-        },
-        "description": {
-          "type": "string",
-          "maxLength": 2000,
-          "description": "Detailed recommendation description"
-        },
-        "priority": {
-          "type": "string",
-          "enum": ["critical", "high", "medium", "low"],
-          "description": "Remediation priority"
-        },
-        "effort": {
-          "type": "string",
-          "enum": ["trivial", "low", "medium", "high", "major"],
-          "description": "Estimated effort: trivial(<1hr), low(1-4hr), medium(1-3d), high(1-2wk), major(>2wk)"
-        },
-        "impact": {
-          "type": "integer",
-          "minimum": 1,
-          "maximum": 10,
-          "description": "Security impact if implemented (1-10)"
-        },
-        "relatedFindings": {
-          "type": "array",
-          "items": {
-            "type": "string",
-            "pattern": "^SEC-\\d{3,6}$"
-          },
-          "description": "IDs of findings this addresses"
-        },
-        "owaspCategories": {
-          "type": "array",
-          "items": {
-            "type": "string",
-            "pattern": "^A(0[1-9]|10):20(21|25)$"
-          },
-          "description": "OWASP categories this recommendation addresses"
-        },
-        "codeExample": {
-          "type": "object",
-          "properties": {
-            "before": {
-              "type": "string",
-              "maxLength": 2000,
-              "description": "Vulnerable code example"
-            },
-            "after": {
-              "type": "string",
-              "maxLength": 2000,
-              "description": "Secure code example"
-            },
-            "language": {
-              "type": "string",
-              "description": "Programming language"
-            }
-          },
-          "description": "Before/after code examples for remediation"
-        },
-        "resources": {
-          "type": "array",
-          "items": {
-            "type": "object",
-            "required": ["title", "url"],
-            "properties": {
-              "title": { "type": "string" },
-              "url": { "type": "string", "format": "uri" }
-            }
-          },
-          "maxItems": 10,
-          "description": "External resources and documentation"
-        },
-        "automatable": {
-          "type": "boolean",
-          "description": "Can this fix be automated?"
-        },
-        "fixCommand": {
-          "type": "string",
-          "description": "CLI command to apply fix if automatable"
-        }
-      }
-    },
-    "owaspCategoryBreakdown": {
-      "type": "object",
-      "description": "OWASP Top 10 2021 category scores and findings",
-      "properties": {
-        "A01:2021": {
-          "$ref": "#/$defs/owaspCategoryScore",
-          "description": "A01:2021 - Broken Access Control"
-        },
-        "A02:2021": {
-          "$ref": "#/$defs/owaspCategoryScore",
-          "description": "A02:2021 - Cryptographic Failures"
-        },
-        "A03:2021": {
-          "$ref": "#/$defs/owaspCategoryScore",
-          "description": "A03:2021 - Injection"
-        },
-        "A04:2021": {
-          "$ref": "#/$defs/owaspCategoryScore",
-          "description": "A04:2021 - Insecure Design"
-        },
-        "A05:2021": {
-          "$ref": "#/$defs/owaspCategoryScore",
-          "description": "A05:2021 - Security Misconfiguration"
-        },
-        "A06:2021": {
-          "$ref": "#/$defs/owaspCategoryScore",
-          "description": "A06:2021 - Vulnerable and Outdated Components"
-        },
-        "A07:2021": {
-          "$ref": "#/$defs/owaspCategoryScore",
-          "description": "A07:2021 - Identification and Authentication Failures"
-        },
-        "A08:2021": {
-          "$ref": "#/$defs/owaspCategoryScore",
-          "description": "A08:2021 - Software and Data Integrity Failures"
-        },
-        "A09:2021": {
-          "$ref": "#/$defs/owaspCategoryScore",
-          "description": "A09:2021 - Security Logging and Monitoring Failures"
-        },
-        "A10:2021": {
-          "$ref": "#/$defs/owaspCategoryScore",
-          "description": "A10:2021 - Server-Side Request Forgery (SSRF)"
-        }
-      },
-      "additionalProperties": false
-    },
-    "owaspCategoryScore": {
-      "type": "object",
-      "required": ["tested", "score"],
-      "properties": {
-        "tested": {
-          "type": "boolean",
-          "description": "Whether this category was tested"
-        },
-        "score": {
-          "type": "number",
-          "minimum": 0,
-          "maximum": 100,
-          "description": "Category score (100 = no issues, 0 = critical)"
-        },
-        "grade": {
-          "type": "string",
-          "pattern": "^[A-F][+-]?$",
-          "description": "Letter grade for this category"
-        },
-        "findingCount": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Number of findings in this category"
-        },
-        "criticalCount": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Number of critical findings"
-        },
-        "highCount": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Number of high severity findings"
-        },
-        "status": {
-          "type": "string",
-          "enum": ["pass", "fail", "warn", "skip"],
-          "description": "Category status"
-        },
-        "description": {
-          "type": "string",
-          "description": "Category description and context"
-        },
-        "cwes": {
-          "type": "array",
-          "items": {
-            "type": "string",
-            "pattern": "^CWE-\\d{1,4}$"
-          },
-          "description": "CWEs found in this category"
-        }
-      }
-    },
-    "securityMetrics": {
-      "type": "object",
-      "properties": {
-        "totalFindings": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Total vulnerabilities found"
-        },
-        "criticalCount": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Critical severity findings"
-        },
-        "highCount": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "High severity findings"
-        },
-        "mediumCount": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Medium severity findings"
-        },
-        "lowCount": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Low severity findings"
-        },
-        "infoCount": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Informational findings"
-        },
-        "filesScanned": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Number of files analyzed"
-        },
-        "linesOfCode": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Lines of code scanned"
-        },
-        "dependenciesChecked": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Number of dependencies checked"
-        },
-        "owaspCategoriesTested": {
-          "type": "integer",
-          "minimum": 0,
-          "maximum": 10,
-          "description": "OWASP Top 10 categories tested"
-        },
-        "owaspCategoriesPassed": {
-          "type": "integer",
-          "minimum": 0,
-          "maximum": 10,
-          "description": "OWASP Top 10 categories with no findings"
-        },
-        "uniqueCwes": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Unique CWE identifiers found"
-        },
-        "falsePositiveRate": {
-          "type": "number",
-          "minimum": 0,
-          "maximum": 1,
-          "description": "Estimated false positive rate"
-        },
-        "scanDurationMs": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Total scan duration in milliseconds"
-        },
-        "coverage": {
-          "type": "object",
-          "properties": {
-            "sast": {
-              "type": "boolean",
-              "description": "Static analysis performed"
-            },
-            "dast": {
-              "type": "boolean",
-              "description": "Dynamic analysis performed"
-            },
-            "dependencies": {
-              "type": "boolean",
-              "description": "Dependency scan performed"
-            },
-            "secrets": {
-              "type": "boolean",
-              "description": "Secret scanning performed"
-            },
-            "configuration": {
-              "type": "boolean",
-              "description": "Configuration review performed"
-            }
-          },
-          "description": "Scan coverage indicators"
-        }
-      }
-    },
-    "scanConfiguration": {
-      "type": "object",
-      "properties": {
-        "target": {
-          "type": "string",
-          "description": "Scan target (file path, URL, or package)"
-        },
-        "targetType": {
-          "type": "string",
-          "enum": ["source", "url", "package", "container", "infrastructure"],
-          "description": "Type of target being scanned"
-        },
-        "scanTypes": {
-          "type": "array",
-          "items": {
-            "type": "string",
-            "enum": ["sast", "dast", "dependency", "secret", "configuration", "container", "iac"]
-          },
-          "description": "Types of scans performed"
-        },
-        "severity": {
-          "type": "array",
-          "items": {
-            "type": "string",
-            "enum": ["critical", "high", "medium", "low", "info"]
-          },
-          "description": "Severity levels included in scan"
-        },
-        "owaspCategories": {
-          "type": "array",
-          "items": {
-            "type": "string",
-            "pattern": "^A(0[1-9]|10):20(21|25)$"
-          },
-          "description": "OWASP categories tested"
-        },
-        "tools": {
-          "type": "array",
-          "items": { "type": "string" },
-          "description": "Security tools used"
-        },
-        "excludePatterns": {
-          "type": "array",
-          "items": { "type": "string" },
-          "description": "File patterns excluded from scan"
-        },
-        "rulesets": {
-          "type": "array",
-          "items": { "type": "string" },
-          "description": "Security rulesets applied"
-        }
-      }
-    },
-    "location": {
-      "type": "object",
-      "properties": {
-        "file": {
-          "type": "string",
-          "maxLength": 500,
-          "description": "File path relative to project root"
-        },
-        "line": {
-          "type": "integer",
-          "minimum": 1,
-          "description": "Line number"
-        },
-        "column": {
-          "type": "integer",
-          "minimum": 1,
-          "description": "Column number"
-        },
-        "endLine": {
-          "type": "integer",
-          "minimum": 1,
-          "description": "End line for multi-line findings"
-        },
-        "endColumn": {
-          "type": "integer",
-          "minimum": 1,
-          "description": "End column"
-        },
-        "url": {
-          "type": "string",
-          "format": "uri",
-          "description": "URL for web-based findings"
-        },
-        "endpoint": {
-          "type": "string",
-          "description": "API endpoint path"
-        },
-        "method": {
-          "type": "string",
-          "enum": ["GET", "POST", "PUT", "DELETE", "PATCH", "HEAD", "OPTIONS"],
-          "description": "HTTP method for API findings"
-        },
-        "parameter": {
-          "type": "string",
-          "description": "Vulnerable parameter name"
-        },
-        "component": {
-          "type": "string",
-          "description": "Affected component or module"
-        }
-      }
-    },
-    "artifact": {
-      "type": "object",
-      "required": ["type", "path"],
-      "properties": {
-        "type": {
-          "type": "string",
-          "enum": ["report", "sarif", "data", "log", "evidence"],
-          "description": "Artifact type"
-        },
-        "path": {
-          "type": "string",
-          "maxLength": 500,
-          "description": "Path to artifact"
-        },
-        "format": {
-          "type": "string",
-          "enum": ["json", "sarif", "html", "md", "txt", "xml", "csv"],
-          "description": "Artifact format"
-        },
-        "description": {
-          "type": "string",
-          "maxLength": 500,
-          "description": "Artifact description"
-        },
-        "sizeBytes": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "File size in bytes"
-        },
-        "checksum": {
-          "type": "string",
-          "pattern": "^sha256:[a-f0-9]{64}$",
-          "description": "SHA-256 checksum"
-        }
-      }
-    },
-    "timelineEvent": {
-      "type": "object",
-      "required": ["timestamp", "event"],
-      "properties": {
-        "timestamp": {
-          "type": "string",
-          "format": "date-time",
-          "description": "Event timestamp"
-        },
-        "event": {
-          "type": "string",
-          "maxLength": 200,
-          "description": "Event description"
-        },
-        "type": {
-          "type": "string",
-          "enum": ["start", "checkpoint", "warning", "error", "complete"],
-          "description": "Event type"
-        },
-        "durationMs": {
-          "type": "integer",
-          "minimum": 0,
-          "description": "Duration since previous event"
-        },
-        "phase": {
-          "type": "string",
-          "enum": ["initialization", "sast", "dast", "dependency", "secret", "reporting"],
-          "description": "Scan phase"
-        }
-      }
-    },
-    "metadata": {
-      "type": "object",
-      "properties": {
-        "executionTimeMs": {
-          "type": "integer",
-          "minimum": 0,
-          "maximum": 3600000,
-          "description": "Execution time in milliseconds"
-        },
-        "toolsUsed": {
-          "type": "array",
-          "items": {
-            "type": "string",
-            "enum": ["semgrep", "npm-audit", "trivy", "owasp-zap", "bandit", "gosec", "eslint-security", "snyk", "gitleaks", "trufflehog", "bearer"]
-          },
-          "uniqueItems": true,
-          "description": "Security tools used"
-        },
-        "agentId": {
-          "type": "string",
-          "pattern": "^qe-[a-z][a-z0-9-]*$",
-          "description": "Agent ID (e.g., qe-security-scanner)"
-        },
-        "modelUsed": {
-          "type": "string",
-          "description": "LLM model used for analysis"
-        },
-        "inputHash": {
-          "type": "string",
-          "pattern": "^[a-f0-9]{64}$",
-          "description": "SHA-256 hash of input"
-        },
-        "targetUrl": {
-          "type": "string",
-          "format": "uri",
-          "description": "Target URL if applicable"
-        },
-        "targetPath": {
-          "type": "string",
-          "description": "Target path if applicable"
-        },
-        "environment": {
-          "type": "string",
-          "enum": ["development", "staging", "production", "ci"],
-          "description": "Execution environment"
-        },
-        "retryCount": {
-          "type": "integer",
-          "minimum": 0,
-          "maximum": 10,
-          "description": "Number of retries"
-        }
-      }
-    },
-    "validationResult": {
-      "type": "object",
-      "properties": {
-        "schemaValid": {
-          "type": "boolean",
-          "description": "Passes JSON schema validation"
-        },
-        "contentValid": {
-          "type": "boolean",
-          "description": "Passes content validation"
-        },
-        "confidence": {
-          "type": "number",
-          "minimum": 0,
-          "maximum": 1,
-          "description": "Confidence score"
-        },
-        "warnings": {
-          "type": "array",
-          "items": {
-            "type": "string",
-            "maxLength": 500
-          },
-          "maxItems": 20,
-          "description": "Validation warnings"
-        },
-        "errors": {
-          "type": "array",
-          "items": {
-            "type": "string",
-            "maxLength": 500
-          },
-          "maxItems": 20,
-          "description": "Validation errors"
-        },
-        "validatorVersion": {
-          "type": "string",
-          "pattern": "^\\d+\\.\\d+\\.\\d+$",
-          "description": "Validator version"
-        }
-      }
-    },
-    "learningData": {
-      "type": "object",
-      "properties": {
-        "patternsDetected": {
-          "type": "array",
-          "items": {
-            "type": "string",
-            "maxLength": 200
-          },
-          "maxItems": 20,
-          "description": "Security patterns detected (e.g., sql-injection-string-concat)"
-        },
-        "reward": {
-          "type": "number",
-          "minimum": 0,
-          "maximum": 1,
-          "description": "Reward signal for learning (0.0-1.0)"
-        },
-        "feedbackLoop": {
-          "type": "object",
-          "properties": {
-            "previousRunId": {
-              "type": "string",
-              "format": "uuid",
-              "description": "Previous run ID for comparison"
-            },
-            "improvement": {
-              "type": "number",
-              "minimum": -1,
-              "maximum": 1,
-              "description": "Improvement over previous run"
-            }
-          }
-        },
-        "newVulnerabilityPatterns": {
-          "type": "array",
-          "items": {
-            "type": "object",
-            "properties": {
-              "pattern": { "type": "string" },
-              "cwe": { "type": "string" },
-              "confidence": { "type": "number" }
-            }
-          },
-          "description": "New vulnerability patterns learned"
-        }
-      }
-    }
-  }
-}
@@ -1,45 +0,0 @@
-{
-  "skillName": "security-testing",
-  "skillVersion": "1.0.0",
-  "requiredTools": [
-    "jq"
-  ],
-  "optionalTools": [
-    "npm",
-    "semgrep",
-    "trivy",
-    "ajv",
-    "jsonschema",
-    "python3"
-  ],
-  "schemaPath": "schemas/output.json",
-  "requiredFields": [
-    "skillName",
-    "status",
-    "output",
-    "output.summary",
-    "output.findings",
-    "output.owaspCategories"
-  ],
-  "requiredNonEmptyFields": [
-    "output.summary"
-  ],
-  "mustContainTerms": [
-    "OWASP",
-    "security",
-    "vulnerability"
-  ],
-  "mustNotContainTerms": [
-    "TODO",
-    "placeholder",
-    "FIXME"
-  ],
-  "enumValidations": {
-    ".status": [
-      "success",
-      "partial",
-      "failed",
-      "skipped"
-    ]
-  }
-}
@@ -0,0 +1,75 @@
+---
+name: test-run
+description: |
+  Run the project's test suite, report results, and handle failures.
+  Detects test runners automatically (pytest, dotnet test, cargo test, npm test)
+  or uses scripts/run-tests.sh if available.
+  Trigger phrases:
+  - "run tests", "test suite", "verify tests"
+category: build
+tags: [testing, verification, test-suite]
+disable-model-invocation: true
+---
+
+# Test Run
+
+Run the project's test suite and report results. This skill is invoked by the autopilot at verification checkpoints — after implementing tests, after implementing features, or at any point where the test suite must pass before proceeding.
+
+## Workflow
+
+### 1. Detect Test Runner
+
+Check in order — first match wins:
+
+1. `scripts/run-tests.sh` exists → use it
+2. `docker-compose.test.yml` or equivalent test environment exists → spin it up first, then detect runner below
+3. Auto-detect from project files:
+   - `pytest.ini`, `pyproject.toml` with `[tool.pytest]`, or `conftest.py` → `pytest`
+   - `*.csproj` or `*.sln` → `dotnet test`
+   - `Cargo.toml` → `cargo test`
+   - `package.json` with test script → `npm test`
+   - `Makefile` with `test` target → `make test`
+
+If no runner detected → report failure and ask user to specify.
+
+### 2. Run Tests
+
+1. Execute the detected test runner
+2. Capture output: passed, failed, skipped, errors
+3. If a test environment was spun up, tear it down after tests complete
+
+### 3. Report Results
+
+Present a summary:
+
+```
+══════════════════════════════════════
+ TEST RESULTS: [N passed, M failed, K skipped]
+══════════════════════════════════════
+```
+
+### 4. Handle Outcome
+
+**All tests pass** → return success to the autopilot for auto-chain.
+
+**Tests fail** → present using Choose format:
+
+```
+══════════════════════════════════════
+ TEST RESULTS: [N passed, M failed, K skipped]
+══════════════════════════════════════
+ A) Fix failing tests and re-run
+ B) Proceed anyway (not recommended)
+ C) Abort — fix manually
+══════════════════════════════════════
+ Recommendation: A — fix failures before proceeding
+══════════════════════════════════════
+```
+
+- If user picks A → attempt to fix failures, then re-run (loop back to step 2)
+- If user picks B → return success with warning to the autopilot
+- If user picks C → return failure to the autopilot
+
+## Trigger Conditions
+
+This skill is invoked by the autopilot at test verification checkpoints. It is not typically invoked directly by the user.
@@ -0,0 +1,469 @@
+---
+name: test-spec
+description: |
+  Test specification skill. Analyzes input data and expected results completeness,
+  then produces detailed test scenarios (blackbox, performance, resilience, security, resource limits)
+  that treat the system as a black box. Every test pairs input data with quantifiable expected results
+  so tests can verify correctness, not just execution.
+  4-phase workflow: input data + expected results analysis, test scenario specification, data + results validation gate,
+  test runner script generation. Produces 8 artifacts under tests/ and 2 shell scripts under scripts/.
+  Trigger phrases:
+  - "test spec", "test specification", "test scenarios"
+  - "blackbox test spec", "black box tests", "blackbox tests"
+  - "performance tests", "resilience tests", "security tests"
+category: build
+tags: [testing, black-box, blackbox-tests, test-specification, qa]
+disable-model-invocation: true
+---
+
+# Test Scenario Specification
+
+Analyze input data completeness and produce detailed black-box test specifications. Tests describe what the system should do given specific inputs — they never reference internals.
+
+## Core Principles
+
+- **Black-box only**: tests describe observable behavior through public interfaces; no internal implementation details
+- **Traceability**: every test traces to at least one acceptance criterion or restriction
+- **Save immediately**: write artifacts to disk after each phase; never accumulate unsaved work
+- **Ask, don't assume**: when requirements are ambiguous, ask the user before proceeding
+- **Spec, don't code**: this workflow produces test specifications, never test implementation code
+- **No test without data**: every test scenario MUST have concrete test data; tests without data are removed
+- **No test without expected result**: every test scenario MUST pair input data with a quantifiable expected result; a test that cannot compare actual output against a known-correct answer is not verifiable and must be removed
+
+## Context Resolution
+
+Fixed paths — no mode detection needed:
+
+- PROBLEM_DIR: `_docs/00_problem/`
+- SOLUTION_DIR: `_docs/01_solution/`
+- DOCUMENT_DIR: `_docs/02_document/`
+- TESTS_OUTPUT_DIR: `_docs/02_document/tests/`
+
+Announce the resolved paths to the user before proceeding.
+
+## Input Specification
+
+### Required Files
+
+| File | Purpose |
+|------|---------|
+| `_docs/00_problem/problem.md` | Problem description and context |
+| `_docs/00_problem/acceptance_criteria.md` | Measurable acceptance criteria |
+| `_docs/00_problem/restrictions.md` | Constraints and limitations |
+| `_docs/00_problem/input_data/` | Reference data examples, expected results, and optional reference files |
+| `_docs/01_solution/solution.md` | Finalized solution |
+
+### Expected Results Specification
+
+Every input data item MUST have a corresponding expected result that defines what the system should produce. Expected results MUST be **quantifiable** — the test must be able to programmatically compare actual system output against the expected result and produce a pass/fail verdict.
+
+Expected results live inside `_docs/00_problem/input_data/` in one or both of:
+
+1. **Mapping file** (`input_data/expected_results/results_report.md`): a table pairing each input with its quantifiable expected output, using the format defined in `.cursor/skills/test-spec/templates/expected-results.md`
+
+2. **Reference files folder** (`input_data/expected_results/`): machine-readable files (JSON, CSV, etc.) containing full expected outputs for complex cases, referenced from the mapping file
+
+```
+input_data/
+├── expected_results/            ← required: expected results folder
+│   ├── results_report.md        ← required: input→expected result mapping
+│   ├── image_01_expected.csv    ← per-file expected detections
+│   └── video_01_expected.csv
+├── image_01.jpg
+├── empty_scene.jpg
+└── data_parameters.md
+```
+
+**Quantifiability requirements** (see template for full format and examples):
+- Numeric values: exact value or value ± tolerance (e.g., `confidence ≥ 0.85`, `position ± 10px`)
+- Structured data: exact JSON/CSV values, or a reference file in `expected_results/`
+- Counts: exact counts (e.g., "3 detections", "0 errors")
+- Text/patterns: exact string or regex pattern to match
+- Timing: threshold (e.g., "response ≤ 500ms")
+- Error cases: expected error code, message pattern, or HTTP status
+
+### Optional Files (used when available)
+
+| File | Purpose |
+|------|---------|
+| `DOCUMENT_DIR/architecture.md` | System architecture for environment design |
+| `DOCUMENT_DIR/system-flows.md` | System flows for test scenario coverage |
+| `DOCUMENT_DIR/components/` | Component specs for interface identification |
+
+### Prerequisite Checks (BLOCKING)
+
+1. `acceptance_criteria.md` exists and is non-empty — **STOP if missing**
+2. `restrictions.md` exists and is non-empty — **STOP if missing**
+3. `input_data/` exists and contains at least one file — **STOP if missing**
+4. `input_data/expected_results/results_report.md` exists and is non-empty — **STOP if missing**. Prompt the user: *"Expected results mapping is required. Please create `_docs/00_problem/input_data/expected_results/results_report.md` pairing each input with its quantifiable expected output. Use `.cursor/skills/test-spec/templates/expected-results.md` as the format reference."*
+5. `problem.md` exists and is non-empty — **STOP if missing**
+6. `solution.md` exists and is non-empty — **STOP if missing**
+7. Create TESTS_OUTPUT_DIR if it does not exist
+8. If TESTS_OUTPUT_DIR already contains files, ask user: **resume from last checkpoint or start fresh?**
+
+## Artifact Management
+
+### Directory Structure
+
+```
+TESTS_OUTPUT_DIR/
+├── environment.md
+├── test-data.md
+├── blackbox-tests.md
+├── performance-tests.md
+├── resilience-tests.md
+├── security-tests.md
+├── resource-limit-tests.md
+└── traceability-matrix.md
+```
+
+### Save Timing
+
+| Phase | Save immediately after | Filename |
+|-------|------------------------|----------|
+| Phase 1 | Input data analysis (no file — findings feed Phase 2) | — |
+| Phase 2 | Environment spec | `environment.md` |
+| Phase 2 | Test data spec | `test-data.md` |
+| Phase 2 | Blackbox tests | `blackbox-tests.md` |
+| Phase 2 | Performance tests | `performance-tests.md` |
+| Phase 2 | Resilience tests | `resilience-tests.md` |
+| Phase 2 | Security tests | `security-tests.md` |
+| Phase 2 | Resource limit tests | `resource-limit-tests.md` |
+| Phase 2 | Traceability matrix | `traceability-matrix.md` |
+| Phase 3 | Updated test data spec (if data added) | `test-data.md` |
+| Phase 3 | Updated test files (if tests removed) | respective test file |
+| Phase 3 | Updated traceability matrix (if tests removed) | `traceability-matrix.md` |
+| Phase 4 | Test runner script | `scripts/run-tests.sh` |
+| Phase 4 | Performance test runner script | `scripts/run-performance-tests.sh` |
+
+### Resumability
+
+If TESTS_OUTPUT_DIR already contains files:
+
+1. List existing files and match them to the save timing table above
+2. Identify which phase/artifacts are complete
+3. Resume from the next incomplete artifact
+4. Inform the user which artifacts are being skipped
+
+## Progress Tracking
+
+At the start of execution, create a TodoWrite with all three phases. Update status as each phase completes.
+
+## Workflow
+
+### Phase 1: Input Data Completeness Analysis
+
+**Role**: Professional Quality Assurance Engineer
+**Goal**: Assess whether the available input data is sufficient to build comprehensive test scenarios
+**Constraints**: Analysis only — no test specs yet
+
+1. Read `_docs/01_solution/solution.md`
+2. Read `acceptance_criteria.md`, `restrictions.md`
+3. Read testing strategy from solution.md (if present)
+4. If `DOCUMENT_DIR/architecture.md` and `DOCUMENT_DIR/system-flows.md` exist, read them for additional context on system interfaces and flows
+5. Read `input_data/expected_results/results_report.md` and any referenced files in `input_data/expected_results/`
+6. Analyze `input_data/` contents against:
+   - Coverage of acceptance criteria scenarios
+   - Coverage of restriction edge cases
+   - Coverage of testing strategy requirements
+7. Analyze `input_data/expected_results/results_report.md` completeness:
+   - Every input data item has a corresponding expected result row in the mapping
+   - Expected results are quantifiable (contain numeric thresholds, exact values, patterns, or file references — not vague descriptions like "works correctly" or "returns result")
+   - Expected results specify a comparison method (exact match, tolerance range, pattern match, threshold) per the template
+   - Reference files in `input_data/expected_results/` that are cited in the mapping actually exist and are valid
+8. Present input-to-expected-result pairing assessment:
+
+| Input Data | Expected Result Provided? | Quantifiable? | Issue (if any) |
+|------------|--------------------------|---------------|----------------|
+| [file/data] | Yes/No | Yes/No | [missing, vague, no tolerance, etc.] |
+
+9. Threshold: at least 70% coverage of scenarios AND every covered scenario has a quantifiable expected result (see `.cursor/rules/cursor-meta.mdc` Quality Thresholds table)
+10. If coverage is low, search the internet for supplementary data, assess quality with user, and if user agrees, add to `input_data/` and update `input_data/expected_results/results_report.md`
+11. If expected results are missing or not quantifiable, ask user to provide them before proceeding
+
+**BLOCKING**: Do NOT proceed until user confirms both input data coverage AND expected results completeness are sufficient.
+
+---
+
+### Phase 2: Test Scenario Specification
+
+**Role**: Professional Quality Assurance Engineer
+**Goal**: Produce detailed black-box test specifications covering blackbox, performance, resilience, security, and resource limit scenarios
+**Constraints**: Spec only — no test code. Tests describe what the system should do given specific inputs, not how the system is built.
+
+Based on all acquired data, acceptance_criteria, and restrictions, form detailed test scenarios:
+
+1. Define test environment using `.cursor/skills/plan/templates/test-environment.md` as structure
+2. Define test data management using `.cursor/skills/plan/templates/test-data.md` as structure
+3. Write blackbox test scenarios (positive + negative) using `.cursor/skills/plan/templates/blackbox-tests.md` as structure
+4. Write performance test scenarios using `.cursor/skills/plan/templates/performance-tests.md` as structure
+5. Write resilience test scenarios using `.cursor/skills/plan/templates/resilience-tests.md` as structure
+6. Write security test scenarios using `.cursor/skills/plan/templates/security-tests.md` as structure
+7. Write resource limit test scenarios using `.cursor/skills/plan/templates/resource-limit-tests.md` as structure
+8. Build traceability matrix using `.cursor/skills/plan/templates/traceability-matrix.md` as structure
+
+**Self-verification**:
+- [ ] Every acceptance criterion is covered by at least one test scenario
+- [ ] Every restriction is verified by at least one test scenario
+- [ ] Every test scenario has a quantifiable expected result from `input_data/expected_results/results_report.md`
+- [ ] Expected results use comparison methods from `.cursor/skills/test-spec/templates/expected-results.md`
+- [ ] Positive and negative scenarios are balanced
+- [ ] Consumer app has no direct access to system internals
+- [ ] Docker environment is self-contained (`docker compose up` sufficient)
+- [ ] External dependencies have mock/stub services defined
+- [ ] Traceability matrix has no uncovered AC or restrictions
+
+**Save action**: Write all files under TESTS_OUTPUT_DIR:
+- `environment.md`
+- `test-data.md`
+- `blackbox-tests.md`
+- `performance-tests.md`
+- `resilience-tests.md`
+- `security-tests.md`
+- `resource-limit-tests.md`
+- `traceability-matrix.md`
+
+**BLOCKING**: Present test coverage summary (from traceability-matrix.md) to user. Do NOT proceed until confirmed.
+
+Capture any new questions, findings, or insights that arise during test specification — these feed forward into downstream skills (plan, refactor, etc.).
+
+---
+
+### Phase 3: Test Data Validation Gate (HARD GATE)
+
+**Role**: Professional Quality Assurance Engineer
+**Goal**: Ensure every test scenario produced in Phase 2 has concrete, sufficient test data. Remove tests that lack data. Verify final coverage stays above 70%.
+**Constraints**: This phase is MANDATORY and cannot be skipped.
+
+#### Step 1 — Build the test-data and expected-result requirements checklist
+
+Scan `blackbox-tests.md`, `performance-tests.md`, `resilience-tests.md`, `security-tests.md`, and `resource-limit-tests.md`. For every test scenario, extract:
+
+| # | Test Scenario ID | Test Name | Required Input Data | Required Expected Result | Result Quantifiable? | Comparison Method | Input Provided? | Expected Result Provided? |
+|---|-----------------|-----------|---------------------|-------------------------|---------------------|-------------------|----------------|--------------------------|
+| 1 | [ID] | [name] | [data description] | [what system should output] | [Yes/No] | [exact/tolerance/pattern/threshold] | [Yes/No] | [Yes/No] |
+
+Present this table to the user.
+
+#### Step 2 — Ask user to provide missing test data AND expected results
+
+For each row where **Input Provided?** is **No** OR **Expected Result Provided?** is **No**, ask the user:
+
+> **Option A — Provide the missing items**: Supply what is missing:
+> - **Missing input data**: Place test data files in `_docs/00_problem/input_data/` or indicate the location.
+> - **Missing expected result**: Provide the quantifiable expected result for this input. Update `_docs/00_problem/input_data/expected_results/results_report.md` with a row mapping the input to its expected output. If the expected result is complex, provide a reference CSV file in `_docs/00_problem/input_data/expected_results/`. Use `.cursor/skills/test-spec/templates/expected-results.md` for format guidance.
+>
+> Expected results MUST be quantifiable — the test must be able to programmatically compare actual vs expected. Examples:
+> - "3 detections with bounding boxes [(x1,y1,x2,y2), ...] ± 10px"
+> - "HTTP 200 with JSON body matching `expected_response_01.json`"
+> - "Processing time < 500ms"
+> - "0 false positives in the output set"
+>
+> **Option B — Skip this test**: If you cannot provide the data or expected result, this test scenario will be **removed** from the specification.
+
+**BLOCKING**: Wait for the user's response for every missing item.
+
+#### Step 3 — Validate provided data and expected results
+
+For each item where the user chose **Option A**:
+
+**Input data validation**:
+1. Verify the data file(s) exist at the indicated location
+2. Verify **quality**: data matches the format, schema, and constraints described in the test scenario (e.g., correct image resolution, valid JSON structure, expected value ranges)
+3. Verify **quantity**: enough data samples to cover the scenario (e.g., at least N images for a batch test, multiple edge-case variants)
+
+**Expected result validation**:
+4. Verify the expected result exists in `input_data/expected_results/results_report.md` or as a referenced file in `input_data/expected_results/`
+5. Verify **quantifiability**: the expected result can be evaluated programmatically — it must contain at least one of:
+   - Exact values (counts, strings, status codes)
+   - Numeric values with tolerance (e.g., `± 10px`, `≥ 0.85`)
+   - Pattern matches (regex, substring, JSON schema)
+   - Thresholds (e.g., `< 500ms`, `≤ 5% error rate`)
+   - Reference file for structural comparison (JSON diff, CSV diff)
+6. Verify **completeness**: the expected result covers all outputs the test checks (not just one field when the test validates multiple)
+7. Verify **consistency**: the expected result is consistent with the acceptance criteria it traces to
+
+If any validation fails, report the specific issue and loop back to Step 2 for that item.
+
+#### Step 4 — Remove tests without data or expected results
+
+For each item where the user chose **Option B**:
+
+1. Warn the user: `⚠️ Test scenario [ID] "[Name]" will be REMOVED from the specification due to missing test data or expected result.`
+2. Remove the test scenario from the respective test file
+3. Remove corresponding rows from `traceability-matrix.md`
+4. Update `test-data.md` to reflect the removal
+
+**Save action**: Write updated files under TESTS_OUTPUT_DIR:
+- `test-data.md`
+- Affected test files (if tests removed)
+- `traceability-matrix.md` (if tests removed)
+
+#### Step 5 — Final coverage check
+
+After all removals, recalculate coverage:
+
+1. Count remaining test scenarios that trace to acceptance criteria
+2. Count total acceptance criteria + restrictions
+3. Calculate coverage percentage: `covered_items / total_items * 100`
+
+| Metric | Value |
+|--------|-------|
+| Total AC + Restrictions | ? |
+| Covered by remaining tests | ? |
+| **Coverage %** | **?%** |
+
+**Decision**:
+
+- **Coverage ≥ 70%** → Phase 3 **PASSED**. Present final summary to user.
+- **Coverage < 70%** → Phase 3 **FAILED**. Report:
+  > ❌ Test coverage dropped to **X%** (minimum 70% required). The removed test scenarios left gaps in the following acceptance criteria / restrictions:
+  >
+  > | Uncovered Item | Type (AC/Restriction) | Missing Test Data Needed |
+  > |---|---|---|
+  >
+  > **Action required**: Provide the missing test data for the items above, or add alternative test scenarios that cover these items with data you can supply.
+
+  **BLOCKING**: Loop back to Step 2 with the uncovered items. Do NOT finalize until coverage ≥ 70%.
+
+#### Phase 3 Completion
+
+When coverage ≥ 70% and all remaining tests have validated data AND quantifiable expected results:
+
+1. Present the final coverage report
+2. List all removed tests (if any) with reasons
+3. Confirm every remaining test has: input data + quantifiable expected result + comparison method
+4. Confirm all artifacts are saved and consistent
+
+---
+
+### Phase 4: Test Runner Script Generation
+
+**Role**: DevOps engineer
+**Goal**: Generate executable shell scripts that run the specified tests, so the autopilot and CI can invoke them consistently.
+**Constraints**: Scripts must be idempotent, portable across dev/CI, and exit with non-zero on failure.
+
+#### Step 1 — Detect test infrastructure
+
+1. Identify the project's test runner from manifests and config files:
+   - Python: `pytest` (pyproject.toml, setup.cfg, pytest.ini)
+   - .NET: `dotnet test` (*.csproj, *.sln)
+   - Rust: `cargo test` (Cargo.toml)
+   - Node: `npm test` or `vitest` / `jest` (package.json)
+2. Identify docker-compose files for integration/blackbox tests (`docker-compose.test.yml`, `e2e/docker-compose*.yml`)
+3. Identify performance/load testing tools from dependencies (k6, locust, artillery, wrk, or built-in benchmarks)
+4. Read `TESTS_OUTPUT_DIR/environment.md` for infrastructure requirements
+
+#### Step 2 — Generate `scripts/run-tests.sh`
+
+Create `scripts/run-tests.sh` at the project root using `.cursor/skills/test-spec/templates/run-tests-script.md` as structural guidance. The script must:
+
+1. Set `set -euo pipefail` and trap cleanup on EXIT
+2. Optionally accept a `--unit-only` flag to skip blackbox tests
+3. Run unit tests using the detected test runner
+4. If blackbox tests exist: spin up docker-compose environment, wait for health checks, run blackbox test suite, tear down
+5. Print a summary of passed/failed/skipped tests
+6. Exit 0 on all pass, exit 1 on any failure
+
+#### Step 3 — Generate `scripts/run-performance-tests.sh`
+
+Create `scripts/run-performance-tests.sh` at the project root. The script must:
+
+1. Set `set -euo pipefail` and trap cleanup on EXIT
+2. Read thresholds from `_docs/02_document/tests/performance-tests.md` (or accept as CLI args)
+3. Spin up the system under test (docker-compose or local)
+4. Run load/performance scenarios using the detected tool
+5. Compare results against threshold values from the test spec
+6. Print a pass/fail summary per scenario
+7. Exit 0 if all thresholds met, exit 1 otherwise
+
+#### Step 4 — Verify scripts
+
+1. Verify both scripts are syntactically valid (`bash -n scripts/run-tests.sh`)
+2. Mark both scripts as executable (`chmod +x`)
+3. Present a summary of what each script does to the user
+
+**Save action**: Write `scripts/run-tests.sh` and `scripts/run-performance-tests.sh` to the project root.
+
+---
+
+## Escalation Rules
+
+| Situation | Action |
+|-----------|--------|
+| Missing acceptance_criteria.md, restrictions.md, or input_data/ | **STOP** — specification cannot proceed |
+| Missing input_data/expected_results/results_report.md | **STOP** — ask user to provide expected results mapping using the template |
+| Ambiguous requirements | ASK user |
+| Input data coverage below 70% (Phase 1) | Search internet for supplementary data, ASK user to validate |
+| Expected results missing or not quantifiable (Phase 1) | ASK user to provide quantifiable expected results before proceeding |
+| Test scenario conflicts with restrictions | ASK user to clarify intent |
+| System interfaces unclear (no architecture.md) | ASK user or derive from solution.md |
+| Test data or expected result not provided for a test scenario (Phase 3) | WARN user and REMOVE the test |
+| Final coverage below 70% after removals (Phase 3) | BLOCK — require user to supply data or accept reduced spec |
+
+## Common Mistakes
+
+- **Referencing internals**: tests must be black-box — no internal module names, no direct DB queries against the system under test
+- **Vague expected outcomes**: "works correctly" is not a test outcome; use specific measurable values
+- **Missing expected results**: input data without a paired expected result is useless — the test cannot determine pass/fail without knowing what "correct" looks like
+- **Non-quantifiable expected results**: "should return good results" is not verifiable; expected results must have exact values, tolerances, thresholds, or pattern matches that code can evaluate
+- **Missing negative scenarios**: every positive scenario category should have corresponding negative/edge-case tests
+- **Untraceable tests**: every test should trace to at least one AC or restriction
+- **Writing test code**: this skill produces specifications, never implementation code
+- **Tests without data**: every test scenario MUST have concrete test data AND a quantifiable expected result; a test spec without either is not executable and must be removed
+
+## Trigger Conditions
+
+When the user wants to:
+- Specify blackbox tests before implementation or refactoring
+- Analyze input data completeness for test coverage
+- Produce test scenarios from acceptance criteria
+
+**Keywords**: "test spec", "test specification", "blackbox test spec", "black box tests", "blackbox tests", "test scenarios"
+
+## Methodology Quick Reference
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│              Test Scenario Specification (4-Phase)                   │
+├──────────────────────────────────────────────────────────────────────┤
+│ PREREQ: Data Gate (BLOCKING)                                         │
+│   → verify AC, restrictions, input_data (incl. expected_results.md)  │
+│                                                                      │
+│ Phase 1: Input Data & Expected Results Completeness Analysis         │
+│   → assess input_data/ coverage vs AC scenarios (≥70%)               │
+│   → verify every input has a quantifiable expected result            │
+│   → present input→expected-result pairing assessment                 │
+│   [BLOCKING: user confirms input data + expected results coverage]   │
+│                                                                      │
+│ Phase 2: Test Scenario Specification                                 │
+│   → environment.md                                                   │
+│   → test-data.md (with expected results mapping)                     │
+│   → blackbox-tests.md (positive + negative)                          │
+│   → performance-tests.md                                             │
+│   → resilience-tests.md                                              │
+│   → security-tests.md                                                │
+│   → resource-limit-tests.md                                          │
+│   → traceability-matrix.md                                           │
+│   [BLOCKING: user confirms test coverage]                            │
+│                                                                      │
+│ Phase 3: Test Data & Expected Results Validation Gate (HARD GATE)    │
+│   → build test-data + expected-result requirements checklist         │
+│   → ask user: provide data+result (A) or remove test (B)             │
+│   → validate input data (quality + quantity)                         │
+│   → validate expected results (quantifiable + comparison method)     │
+│   → remove tests without data or expected result, warn user          │
+│   → final coverage check (≥70% or FAIL + loop back)                  │
+│   [BLOCKING: coverage ≥ 70% required to pass]                        │
+│                                                                      │
+│ Phase 4: Test Runner Script Generation                               │
+│   → detect test runner + docker-compose + load tool                  │
+│   → scripts/run-tests.sh (unit + blackbox)                           │
+│   → scripts/run-performance-tests.sh (load/perf scenarios)           │
+│   → verify scripts are valid and executable                          │
+├──────────────────────────────────────────────────────────────────────┤
+│ Principles: Black-box only · Traceability · Save immediately         │
+│             Ask don't assume · Spec don't code                       │
+│             No test without data · No test without expected result   │
+└──────────────────────────────────────────────────────────────────────┘
+```
@@ -0,0 +1,135 @@
+# Expected Results Template
+
+Save as `_docs/00_problem/input_data/expected_results/results_report.md`.
+For complex expected outputs, place reference CSV files alongside it in `_docs/00_problem/input_data/expected_results/`.
+Referenced by the test-spec skill (`.cursor/skills/test-spec/SKILL.md`).
+
+---
+
+```markdown
+# Expected Results
+
+Maps every input data item to its quantifiable expected result.
+Tests use this mapping to compare actual system output against known-correct answers.
+
+## Result Format Legend
+
+| Result Type | When to Use | Example |
+|-------------|-------------|---------|
+| Exact value | Output must match precisely | `status_code: 200`, `detection_count: 3` |
+| Tolerance range | Numeric output with acceptable variance | `confidence: 0.92 ± 0.05`, `bbox_x: 120 ± 10px` |
+| Threshold | Output must exceed or stay below a limit | `latency < 500ms`, `confidence ≥ 0.85` |
+| Pattern match | Output must match a string/regex pattern | `error_message contains "invalid format"` |
+| File reference | Complex output compared against a reference file | `match expected_results/case_01.json` |
+| Schema match | Output structure must conform to a schema | `response matches DetectionResultSchema` |
+| Set/count | Output must contain specific items or counts | `classes ⊇ {"car", "person"}`, `detections.length == 5` |
+
+## Comparison Methods
+
+| Method | Description | Tolerance Syntax |
+|--------|-------------|-----------------|
+| `exact` | Actual == Expected | N/A |
+| `numeric_tolerance` | abs(actual - expected) ≤ tolerance | `± <value>` or `± <percent>%` |
+| `range` | min ≤ actual ≤ max | `[min, max]` |
+| `threshold_min` | actual ≥ threshold | `≥ <value>` |
+| `threshold_max` | actual ≤ threshold | `≤ <value>` |
+| `regex` | actual matches regex pattern | regex string |
+| `substring` | actual contains substring | substring |
+| `json_diff` | structural comparison against reference JSON | diff tolerance per field |
+| `set_contains` | actual output set contains expected items | subset notation |
+| `file_reference` | compare against reference file in expected_results/ | file path |
+
+## Input → Expected Result Mapping
+
+### [Scenario Group Name, e.g. "Single Image Detection"]
+
+| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
+|---|-------|-------------------|-----------------|------------|-----------|---------------|
+| 1 | `[file or parameters]` | [what this input represents] | [quantifiable expected output] | [method from table above] | [± value, range, or N/A] | [path in expected_results/ or N/A] |
+
+#### Example — Object Detection
+
+| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
+|---|-------|-------------------|-----------------|------------|-----------|---------------|
+| 1 | `image_01.jpg` | Aerial photo, 3 vehicles visible | `detection_count: 3`, classes: `["ArmorVehicle", "ArmorVehicle", "Truck"]` | exact (count), set_contains (classes) | N/A | N/A |
+| 2 | `image_01.jpg` | Same image, bbox positions | bboxes: `[(120,80,340,290), (400,150,580,310), (50,400,200,520)]` | numeric_tolerance | ± 15px per coordinate | `expected_results/image_01_detections.json` |
+| 3 | `image_01.jpg` | Same image, confidence scores | confidences: `[0.94, 0.88, 0.91]` | threshold_min | each ≥ 0.85 | N/A |
+| 4 | `empty_scene.jpg` | Aerial photo, no objects | `detection_count: 0`, empty detections array | exact | N/A | N/A |
+| 5 | `corrupted.dat` | Invalid file format | HTTP 400, body contains `"error"` key | exact (status), substring (body) | N/A | N/A |
+
+#### Example — Performance
+
+| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
+|---|-------|-------------------|-----------------|------------|-----------|---------------|
+| 1 | `standard_image.jpg` | 1920x1080 single image | Response time | threshold_max | ≤ 2000ms | N/A |
+| 2 | `large_image.jpg` | 8000x6000 tiled image | Response time | threshold_max | ≤ 10000ms | N/A |
+
+#### Example — Error Handling
+
+| # | Input | Input Description | Expected Result | Comparison | Tolerance | Reference File |
+|---|-------|-------------------|-----------------|------------|-----------|---------------|
+| 1 | `POST /detect` with no file | Missing required input | HTTP 422, message matches `"file.*required"` | exact (status), regex (message) | N/A | N/A |
+| 2 | `POST /detect` with `probability_threshold: 5.0` | Out-of-range config | HTTP 422 or clamped to valid range | exact (status) or range [0.0, 1.0] | N/A | N/A |
+
+## Expected Result Reference Files
+
+When the expected output is too complex for an inline table cell (e.g., full JSON response with nested objects), place a reference file in `_docs/00_problem/input_data/expected_results/`.
+
+### File Naming Convention
+
+`<input_name>_expected.<format>`
+
+Examples:
+- `image_01_detections.json`
+- `batch_A_results.csv`
+- `video_01_annotations.json`
+
+### Reference File Requirements
+
+- Must be machine-readable (JSON, CSV, YAML — not prose)
+- Must contain only the expected output structure and values
+- Must include tolerance annotations where applicable (as metadata fields or comments)
+- Must be valid and parseable by standard libraries
+
+### Reference File Example (JSON)
+
+File: `expected_results/image_01_detections.json`
+
+```json
+{
+  "input": "image_01.jpg",
+  "expected": {
+    "detection_count": 3,
+    "detections": [
+      {
+        "class": "ArmorVehicle",
+        "confidence": { "min": 0.85 },
+        "bbox": { "x1": 120, "y1": 80, "x2": 340, "y2": 290, "tolerance_px": 15 }
+      },
+      {
+        "class": "ArmorVehicle",
+        "confidence": { "min": 0.85 },
+        "bbox": { "x1": 400, "y1": 150, "x2": 580, "y2": 310, "tolerance_px": 15 }
+      },
+      {
+        "class": "Truck",
+        "confidence": { "min": 0.85 },
+        "bbox": { "x1": 50, "y1": 400, "x2": 200, "y2": 520, "tolerance_px": 15 }
+      }
+    ]
+  }
+}
+```
+```
+
+---
+
+## Guidance Notes
+
+- Every row in the mapping table must have at least one quantifiable comparison — no row should say only "should work" or "returns result".
+- Use `exact` comparison for counts, status codes, and discrete values.
+- Use `numeric_tolerance` for floating-point values and spatial coordinates where minor variance is expected.
+- Use `threshold_min`/`threshold_max` for performance metrics and confidence scores.
+- Use `file_reference` when the expected output has more than ~3 fields or nested structures.
+- Reference files must be committed alongside input data — they are part of the test specification.
+- When the system has non-deterministic behavior (e.g., model inference variance across hardware), document the expected tolerance explicitly and justify it.
@@ -0,0 +1,88 @@
+# Test Runner Script Structure
+
+Reference for generating `scripts/run-tests.sh` and `scripts/run-performance-tests.sh`.
+
+## `scripts/run-tests.sh`
+
+```bash
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+UNIT_ONLY=false
+RESULTS_DIR="$PROJECT_ROOT/test-results"
+
+for arg in "$@"; do
+  case $arg in
+    --unit-only) UNIT_ONLY=true ;;
+  esac
+done
+
+cleanup() {
+  # tear down docker-compose if it was started
+}
+trap cleanup EXIT
+
+mkdir -p "$RESULTS_DIR"
+
+# --- Unit Tests ---
+# [detect runner: pytest / dotnet test / cargo test / npm test]
+# [run and capture exit code]
+# [save results to $RESULTS_DIR/unit-results.*]
+
+# --- Blackbox Tests (skip if --unit-only) ---
+# if ! $UNIT_ONLY; then
+#   [docker compose -f <compose-file> up -d]
+#   [wait for health checks]
+#   [run blackbox test suite]
+#   [save results to $RESULTS_DIR/blackbox-results.*]
+# fi
+
+# --- Summary ---
+# [print passed / failed / skipped counts]
+# [exit 0 if all passed, exit 1 otherwise]
+```
+
+## `scripts/run-performance-tests.sh`
+
+```bash
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+RESULTS_DIR="$PROJECT_ROOT/test-results"
+
+cleanup() {
+  # tear down test environment if started
+}
+trap cleanup EXIT
+
+mkdir -p "$RESULTS_DIR"
+
+# --- Start System Under Test ---
+# [docker compose up -d or start local server]
+# [wait for health checks]
+
+# --- Run Performance Scenarios ---
+# [detect tool: k6 / locust / artillery / wrk / built-in]
+# [run each scenario from performance-tests.md]
+# [capture metrics: latency P50/P95/P99, throughput, error rate]
+
+# --- Compare Against Thresholds ---
+# [read thresholds from test spec or CLI args]
+# [print per-scenario pass/fail]
+
+# --- Summary ---
+# [exit 0 if all thresholds met, exit 1 otherwise]
+```
+
+## Key Requirements
+
+- Both scripts must be idempotent (safe to run multiple times)
+- Both scripts must work in CI (no interactive prompts, no GUI)
+- Use `trap cleanup EXIT` to ensure teardown even on failure
+- Exit codes: 0 = all pass, 1 = failures detected
+- Write results to `test-results/` directory (add to `.gitignore` if not already present)
+- The actual commands depend on the detected tech stack — fill them in during Phase 4 of the test-spec skill
@@ -0,0 +1,254 @@
+---
+name: ui-design
+description: |
+  End-to-end UI design workflow: requirements gathering → design system synthesis → HTML+CSS mockup generation → visual verification → iterative refinement.
+  Zero external dependencies. Optional MCP enhancements (RenderLens, AccessLint).
+  Two modes:
+  - Full workflow: phases 0-8 for complex design tasks
+  - Quick mode: skip to code generation for simple requests
+  Command entry points:
+  - /design-audit — quality checks on existing mockup
+  - /design-polish — final refinement pass
+  - /design-critique — UX review with feedback
+  - /design-regen — regenerate with different direction
+  Trigger phrases:
+  - "design a UI", "create a mockup", "build a page"
+  - "make a landing page", "design a dashboard"
+  - "mockup", "design system", "UI design"
+category: create
+tags: [ui-design, mockup, html, css, tailwind, design-system, accessibility]
+disable-model-invocation: true
+---
+
+# UI Design Skill
+
+End-to-end UI design workflow producing production-quality HTML+CSS mockups entirely within Cursor, with zero external tool dependencies.
+
+## Core Principles
+
+- **Design intent over defaults**: never settle for generic AI output; every visual choice must trace to user requirements
+- **Verify visually**: AI must see what it generates whenever possible (browser screenshots)
+- **Tokens over hardcoded values**: use CSS custom properties with semantic naming, not raw hex
+- **Restraint over decoration**: less is more; every visual element must earn its place
+- **Ask, don't assume**: when design direction is ambiguous, STOP and ask the user
+- **One screen at a time**: generate individual screens, not entire applications at once
+
+## Context Resolution
+
+Determine the operating mode based on invocation before any other logic runs.
+
+**Project mode** (default — `_docs/` structure exists):
+- MOCKUPS_DIR: `_docs/02_document/ui_mockups/`
+
+**Standalone mode** (explicit input file provided, e.g. `/ui-design @some_brief.md`):
+- INPUT_FILE: the provided file (treated as design brief)
+- MOCKUPS_DIR: `_standalone/ui_mockups/`
+
+Create MOCKUPS_DIR if it does not exist. Announce the detected mode and resolved path to the user.
+
+## Output Directory
+
+All generated artifacts go to `MOCKUPS_DIR`:
+
+```
+MOCKUPS_DIR/
+├── DESIGN.md              # Generated design system (three-layer tokens)
+├── index.html             # Main mockup (or named per page)
+└── [page-name].html       # Additional pages if multi-page
+```
+
+## Complexity Detection (Phase 0)
+
+Before starting the workflow, classify the request:
+
+**Quick mode** — skip to Phase 5 (Code Generation):
+- Request is a single component or screen
+- User provides enough style context in their message
+- `MOCKUPS_DIR/DESIGN.md` already exists
+- Signals: "just make a...", "quick mockup of...", single component name, less than 2 sentences
+
+**Full mode** — run phases 1-8:
+- Multi-page request
+- Brand-specific requirements
+- "design system for...", complex layouts, dashboard/admin panel
+- No existing DESIGN.md
+
+Announce the detected mode to the user.
+
+## Phase 1: Context Check
+
+1. Check for existing project documentation: PRD, design specs, README with design notes
+2. Check for existing `MOCKUPS_DIR/DESIGN.md`
+3. Check for existing mockups in `MOCKUPS_DIR/`
+4. If DESIGN.md exists → announce "Using existing design system" → skip to Phase 5
+5. If project docs with design info exist → extract requirements from them, skip to Phase 3
+
+## Phase 2: Requirements Gathering
+
+Use the AskQuestion tool for structured input. Adapt based on what Phase 1 found — only ask for what's missing.
+
+**Round 1 — Structural:**
+
+Ask using AskQuestion with these questions:
+- **Page type**: landing, dashboard, form, settings, profile, admin panel, e-commerce, blog, documentation, other
+- **Target audience**: developers, business users, consumers, internal team, general public
+- **Platform**: web desktop-first, web mobile-first
+- **Key sections**: header, hero, sidebar, main content, cards grid, data table, form, footer (allow multiple)
+
+**Round 2 — Design Intent:**
+
+Ask using AskQuestion with these questions:
+- **Visual atmosphere**: Airy & spacious / Dense & data-rich / Warm & approachable / Sharp & technical / Luxurious & premium
+- **Color mood**: Cool blues & grays / Warm earth tones / Bold & vibrant / Monochrome / Dark mode / Let AI choose based on atmosphere / Custom (specify brand colors)
+- **Typography mood**: Geometric (modern, clean) / Humanist (friendly, readable) / Monospace (technical, code-like) / Serif (editorial, premium)
+
+Then ask in free-form:
+- "Name an app or website whose look you admire" (optional, helps anchor style)
+- "Any specific content, copy, or data to include?"
+
+## Phase 3: Direction Exploration
+
+Generate 2-3 text-based direction summaries. Each direction is 3-5 sentences describing:
+- Visual approach and mood
+- Color palette direction (specific hues, not just "blue")
+- Layout strategy (grid type, density, whitespace approach)
+- Typography choice (specific font suggestions, not just "sans-serif")
+
+Present to user: "Here are 2-3 possible directions. Which resonates? Or describe a blend."
+
+Wait for user to pick before proceeding.
+
+## Phase 4: Design System Synthesis
+
+Generate `MOCKUPS_DIR/DESIGN.md` using the template from `templates/design-system.md`.
+
+The generated DESIGN.md must include all 6 sections:
+1. Visual Atmosphere — descriptive mood (never "clean and modern")
+2. Color System — three-layer CSS custom properties (primitives → semantic → component)
+3. Typography — specific font family, weight hierarchy, size scale with rem values
+4. Spacing & Layout — base unit, spacing scale, grid, breakpoints
+5. Component Styling Defaults — buttons, cards, inputs, navigation with all states
+6. Interaction States — loading, error, empty, hover, focus, disabled patterns
+
+Read `references/design-vocabulary.md` for atmosphere descriptors and style vocabulary to use when writing the DESIGN.md.
+
+## Phase 5: Code Generation
+
+Construct the generation by combining context from multiple sources:
+
+1. Read `MOCKUPS_DIR/DESIGN.md` for the design system
+2. Read `references/components.md` for component best practices relevant to the page type
+3. Read `references/anti-patterns.md` for explicit avoidance instructions
+
+Generate `MOCKUPS_DIR/[page-name].html` as a single file with:
+- `<script src="https://cdn.tailwindcss.com"></script>` for Tailwind
+- `<style>` block with all CSS custom properties from DESIGN.md
+- Tailwind config override in `<script>` to map tokens to Tailwind theme
+- Semantic HTML (nav, main, section, article, footer)
+- Mobile-first responsive design
+- All interactive elements with hover, focus, active states
+- At least one loading skeleton example
+- Proper heading hierarchy (single h1)
+
+**Anti-AI-Slop guard clauses** (MANDATORY — read `references/anti-patterns.md` for full list):
+- Do NOT use Inter or Roboto unless user explicitly requested them
+- Do NOT default to purple/indigo accent color
+- Do NOT create "card soup" — vary layout patterns
+- Do NOT make all buttons equal weight
+- Do NOT over-decorate
+- Use the actual tokens from DESIGN.md, not hardcoded values
+
+For quick mode without DESIGN.md: use a sensible default design system matching the request context. Still follow all anti-slop rules.
+
+## Phase 6: Visual Verification
+
+Tiered verification — use the best available tool:
+
+**Layer 1 — Structural Check** (always runs):
+Read `references/quality-checklist.md` and verify against the structural checklist.
+
+**Layer 2 — Visual Check** (when browser tool is available):
+1. Open the generated HTML file using the browser tool
+2. Take screenshots at desktop (1440px) width
+3. Examine the screenshot for: spacing consistency, alignment, color rendering, typography hierarchy, overall visual balance
+4. Compare against DESIGN.md's intended atmosphere
+5. Flag issues: cramped areas, orphan text, broken layouts, invisible elements
+
+**Layer 3 — Compliance Check** (when MCP tools are available):
+- If AccessLint MCP is configured: audit HTML for WCAG violations, auto-fix flagged issues
+- If RenderLens MCP is configured: render + audit (Lighthouse + WCAG scores) + diff
+
+Auto-fix any issues found. Re-verify after fixes.
+
+## Phase 7: User Review
+
+1. Open mockup in browser for the user:
+   - Primary: use Cursor browser tool (AI can see and discuss the same view)
+   - Fallback: use OS-appropriate command (`open` on macOS, `xdg-open` on Linux, `start` on Windows)
+2. Present assessment summary: structural check results, visual observations, compliance scores if available
+3. Ask: "How does this look? What would you like me to change?"
+
+## Phase 8: Iteration
+
+1. Parse user feedback into specific changes
+2. Apply targeted edits via StrReplace (not full regeneration unless user requests a fundamentally different direction)
+3. Re-run visual verification (Phase 6)
+4. Present changes to user
+5. Repeat until user approves
+
+## Command Entry Points
+
+These commands bypass the full workflow for targeted operations on existing mockups:
+
+### /design-audit
+Run quality checks on an existing mockup in `MOCKUPS_DIR/`.
+1. Read the HTML file
+2. Run structural checklist from `references/quality-checklist.md`
+3. If browser tool available: take screenshot and visual check
+4. If AccessLint MCP available: WCAG audit
+5. Report findings with severity levels
+
+### /design-polish
+Final refinement pass on an existing mockup.
+1. Read the HTML file and DESIGN.md
+2. Check token usage (no hardcoded values that should be tokens)
+3. Verify all interaction states are present
+4. Refine spacing consistency, typography hierarchy
+5. Apply micro-improvements (subtle shadows, transitions, hover states)
+
+### /design-critique
+UX review with specific feedback.
+1. Read the HTML file
+2. Evaluate: information hierarchy, call-to-action clarity, cognitive load, navigation flow
+3. Check against anti-patterns from `references/anti-patterns.md`
+4. Provide a structured critique with specific improvement suggestions
+
+### /design-regen
+Regenerate mockup with a different design direction.
+1. Keep the existing page structure and content
+2. Ask user what direction to change (atmosphere, colors, layout, typography)
+3. Update DESIGN.md tokens accordingly
+4. Regenerate the HTML with the new design system
+
+## Optional MCP Enhancements
+
+When configured, these MCP servers enhance the workflow:
+
+| MCP Server | Phase | What It Adds |
+|------------|-------|-------------|
+| RenderLens | 6 | HTML→screenshot, Lighthouse audit, pixel-level diff |
+| AccessLint | 6 | WCAG violation detection + auto-fix (99.5% fix rate) |
+| Playwright | 6 | Screenshot at multiple viewports, visual regression |
+
+The skill works fully without any MCP servers. MCPs are enhancements, not requirements.
+
+## Escalation Rules
+
+| Situation | Action |
+|-----------|--------|
+| Unclear design direction | **ASK user** — present direction options |
+| Conflicting requirements (e.g., "minimal but feature-rich") | **ASK user** which to prioritize |
+| User asks for a framework-specific output (React, Vue) | **WARN**: this skill generates HTML+CSS mockups; suggest adapting after approval |
+| Generated mockup looks wrong in visual verification | Auto-fix if possible; **ASK user** if the issue is subjective |
+| User requests multi-page site | Generate one page at a time; maintain DESIGN.md consistency across pages |
+| Accessibility audit fails | Auto-fix violations; **WARN user** about remaining manual-check items |
@@ -0,0 +1,69 @@
+# Anti-Patterns — AI Slop Prevention
+
+Read this file before generating any HTML/CSS. These are explicit instructions for what NOT to do.
+
+## Typography Anti-Patterns
+
+- **Do NOT default to Inter or Roboto.** These are the #1 signal of AI-generated UI. Choose a font that matches the atmosphere from `design-vocabulary.md`. Only use Inter/Roboto if the user explicitly requests them.
+- **Do NOT use the same font weight everywhere.** Establish a clear weight hierarchy: 600-700 for headings, 400 for body, 500 for UI elements.
+- **Do NOT set body text smaller than 14px (0.875rem).** Prefer 16px (1rem) for body.
+- **Do NOT skip heading levels.** Go h1 → h2 → h3, never h1 → h3.
+- **Do NOT use placeholder-only form fields.** Labels above inputs are mandatory; placeholders are hints only.
+
+## Color Anti-Patterns
+
+- **Do NOT default to purple or indigo accent colors.** Purple/indigo is the second-biggest AI-slop signal. Use the accent color from DESIGN.md tokens.
+- **Do NOT use more than 1 strong accent color** in the same view. Secondary accents should be muted or derived from the primary.
+- **Do NOT use gray text on colored backgrounds** without checking contrast. WCAG AA requires 4.5:1 for normal text, 3:1 for large text.
+- **Do NOT use rainbow color coding** for categories. Limit to 5-6 carefully chosen, distinguishable colors.
+- **Do NOT apply background gradients to text** (gradient text is fragile and often unreadable).
+
+## Layout Anti-Patterns
+
+- **Do NOT create "card soup"** — rows of identical cards with no visual break. Vary layout patterns: full-width sections, split layouts, featured items, asymmetric grids.
+- **Do NOT center everything.** Left-align body text. Center only headings, short captions, and CTAs.
+- **Do NOT use fixed pixel widths** for layout. Use relative units (%, fr, auto, minmax).
+- **Do NOT nest excessive containers.** Avoid "div soup" — use semantic elements (nav, main, section, article, aside, footer).
+- **Do NOT ignore mobile.** Design mobile-first; every component must work at 375px width.
+
+## Component Anti-Patterns
+
+- **Do NOT make all buttons equal weight.** Establish clear hierarchy: one primary (filled), secondary (outline), ghost (text-only) per visible area.
+- **Do NOT use spinners for content with known layout.** Use skeleton loaders that match the shape of the content.
+- **Do NOT put a modal inside a modal.** If you need nested interaction, use a slide-over or expand the current modal.
+- **Do NOT disable buttons without explanation.** Every disabled button needs a title attribute or adjacent text explaining why.
+- **Do NOT use "Click here" as link text.** Links should describe the destination: "View documentation", "Download report".
+- **Do NOT show hamburger menus on desktop.** Hamburgers are for mobile only; use full navigation on desktop.
+- **Do NOT use equal-weight buttons in a pair.** One must be visually primary, the other secondary.
+
+## Interaction Anti-Patterns
+
+- **Do NOT skip hover states on interactive elements.** Every clickable element needs a visible hover change.
+- **Do NOT skip focus states.** Keyboard users need visible focus indicators on every interactive element.
+- **Do NOT omit loading states.** If data loads asynchronously, show a skeleton or progress indicator.
+- **Do NOT omit empty states.** When a list or section has no data, show an illustration + explanation + action CTA.
+- **Do NOT omit error states.** Form validation errors need inline messages below the field with an icon.
+- **Do NOT use bare alert() for messages.** Use toast notifications or inline banners.
+
+## Decoration Anti-Patterns
+
+- **Do NOT over-decorate.** Restraint over decoration. Every visual element must earn its place.
+- **Do NOT apply shadows AND borders AND background fills simultaneously** on the same element. Pick one or two.
+- **Do NOT use generic stock-photo placeholder images.** Use SVG illustrations, solid color blocks with icons, or real content.
+- **Do NOT use decorative backgrounds** that reduce text readability.
+- **Do NOT animate everything.** Use motion sparingly and purposefully: transitions for state changes (200-300ms), not decorative animation.
+
+## Spacing Anti-Patterns
+
+- **Do NOT use inconsistent spacing.** Stick to the spacing scale from DESIGN.md (multiples of 4px or 8px base unit).
+- **Do NOT use zero padding inside containers.** Minimum 12-16px padding for any content container.
+- **Do NOT crowd elements.** When in doubt, add more whitespace, not less.
+- **Do NOT use different spacing systems** in different parts of the same page. One scale for the whole page.
+
+## Accessibility Anti-Patterns
+
+- **Do NOT rely on color alone** to convey information. Add icons, text, or patterns.
+- **Do NOT use thin font weights (100-300) for body text.** Minimum 400 for readability.
+- **Do NOT create custom controls** without proper ARIA attributes. Prefer native HTML elements.
+- **Do NOT trap keyboard focus** outside of modals. Only modals should have focus traps.
+- **Do NOT auto-play media** without user consent and a visible stop/mute control.
@@ -0,0 +1,307 @@
+# Component Reference
+
+Use this reference when generating UI mockups. Each component includes best practices, required states, and accessibility requirements.
+
+## Navigation
+
+### Top Navigation Bar
+- Fixed or sticky at top; z-index above content
+- Logo/brand left, primary nav center or right, actions (search, profile, CTA) far right
+- Active state: underline, background highlight, or bold — pick one, be consistent
+- Mobile: collapse to hamburger menu at `md` breakpoint; never show hamburger on desktop
+- Height: 56-72px; padding inline 16-24px
+- Aliases: navbar, header nav, app bar, top bar
+
+### Sidebar Navigation
+- Width: 240-280px expanded, 64-72px collapsed
+- Sections with labels; icons + text for each item
+- Active item: background fill + accent color text/icon
+- Collapse/expand toggle; responsive: overlay on mobile
+- Scroll independently from main content if taller than viewport
+- Aliases: side nav, drawer, rail
+
+### Breadcrumbs
+- Show hierarchy path; separator: `/` or `>`
+- Current page is plain text (not a link); parent pages are links
+- Truncate with ellipsis if more than 4-5 levels
+- Aliases: path indicator, navigation trail
+
+### Tabs
+- Use for switching between related content views within the same context
+- Active tab: border-bottom accent or filled background
+- Never nest tabs inside tabs
+- Scrollable when too many to fit; show scroll indicators
+- Aliases: tab bar, segmented control, view switcher
+
+### Pagination
+- Show current page, first, last, and 2-3 surrounding pages
+- Previous/Next buttons always visible; disabled at boundaries
+- Show total count when available: "Showing 1-20 of 342"
+- Aliases: pager, page navigation
+
+## Content Display
+
+### Card
+- Border-radius: 8-12px; subtle shadow or border (not both unless intentional)
+- Padding: 16-24px; consistent within the same card grid
+- Content order: image/visual → title → description → metadata → actions
+- Hover: subtle shadow lift or border-color change (not both)
+- Never stack more than 3 cards vertically without visual break
+- Aliases: tile, panel, content block
+
+### Data Table
+- Header row: sticky, slightly bolder background, sort indicators
+- Row hover: subtle background change
+- Striped rows optional; alternate between base and surface colors
+- Cell padding: 12-16px vertical, 16px horizontal
+- Truncate long text with ellipsis + tooltip on hover
+- Responsive: horizontal scroll with frozen first column, or stack to card layout on mobile
+- Include empty state when no data
+- Aliases: grid, spreadsheet, list view
+
+### List
+- Consistent item height or padding
+- Dividers between items: subtle border or spacing (not both)
+- Interactive lists: hover state on entire row
+- Leading element (icon/avatar) + content (title + subtitle) + trailing element (action/badge)
+- Aliases: item list, feed, timeline
+
+### Stat/Metric Card
+- Large number/value prominently displayed
+- Label above or below the value; comparison/trend indicator optional
+- Color-code trends: green up, red down, gray neutral
+- Aliases: KPI card, metric tile, stat block
+
+### Avatar
+- Circular; sizes: 24/32/40/48/64px
+- Fallback: initials on colored background when no image
+- Status indicator: small circle at bottom-right (green=online, gray=offline)
+- Group: overlap with z-index stacking; show "+N" for overflow
+- Aliases: profile picture, user icon
+
+### Badge/Tag
+- Small, pill-shaped or rounded-rectangle
+- Color indicates category or status; limit to 5-6 distinct colors
+- Text: short (1-3 words); truncate if longer
+- Removable variant: include x button
+- Aliases: chip, label, status indicator
+
+### Hero Section
+- Full-width; height 400-600px or viewport-relative
+- Strong headline (h1) + supporting text + primary CTA
+- Background: gradient, image with overlay, or solid color — not all three
+- Text must have sufficient contrast over any background
+- Aliases: banner, jumbotron, splash
+
+### Empty State
+- Illustration or icon (not a generic placeholder)
+- Explanatory text: what this area will contain
+- Primary action CTA: "Create your first...", "Add...", "Import..."
+- Never show just blank space
+- Aliases: zero state, no data, blank slate
+
+### Skeleton Loader
+- Match the shape and layout of the content being loaded
+- Animate with subtle pulse or shimmer (left-to-right gradient)
+- Show for predictable content; use progress bar for uploads/processes
+- Never use spinning loaders for content that has a known layout
+- Aliases: placeholder, loading state, content loader
+
+## Forms & Input
+
+### Text Input
+- Height: 40-48px; padding inline 12-16px
+- Label above the input (not placeholder-only); placeholder as hint only
+- States: default, hover, focus (accent ring), error (red border + message), disabled (reduced opacity)
+- Error message below the field with icon; don't use red placeholder
+- Aliases: text field, input box, form field
+
+### Textarea
+- Minimum height: 80-120px; resizable vertically
+- Character count when there's a limit
+- Same states as text input
+- Aliases: multiline input, text area, comment box
+
+### Select/Dropdown
+- Match text input height and styling
+- Chevron indicator on the right
+- Options list: max height with scroll; selected item checkmark
+- Search/filter for lists longer than 10 items
+- Aliases: combo box, picker, dropdown menu
+
+### Checkbox
+- Size: 16-20px; rounded corners (2-4px)
+- Label to the right; clickable area includes the label
+- States: unchecked, checked (accent fill + white check), indeterminate (dash), disabled
+- Group: vertical stack with 8-12px gap
+- Aliases: check box, toggle option, multi-select
+
+### Radio Button
+- Size: 16-20px; circular
+- Same interaction patterns as checkbox but single-select
+- Group: vertical stack; minimum 2 options
+- Aliases: radio, option button, single-select
+
+### Toggle/Switch
+- Width: 40-52px; height: 20-28px; thumb is circular
+- Off: gray track; On: accent color track
+- Label to the left or right; describe the "on" state
+- Never use for actions that require a submit; toggles are instant
+- Aliases: switch, on/off toggle
+
+### File Upload
+- Drop zone with dashed border; icon + "Drag & drop or click to upload"
+- Show file type restrictions and size limit
+- Progress indicator during upload
+- File list after upload: name, size, remove button
+- Aliases: file picker, upload area, attachment
+
+### Form Layout
+- Single column for most forms; two columns only for related short fields (first/last name, city/state)
+- Group related fields with section headings
+- Required field indicator: asterisk after label
+- Submit button: right-aligned or full-width; clearly primary
+- Inline validation: show errors on blur, not on every keystroke
+
+## Actions
+
+### Button
+- Primary: filled accent color, white text; one per visible area
+- Secondary: outline or subtle background; supports primary action
+- Ghost/tertiary: text-only with hover background
+- Sizes: sm (32px), md (40px), lg (48px); padding inline 16-24px
+- States: default, hover (darken/lighten 10%), active (darken 15%), focus (ring), disabled (opacity 0.5 + not-allowed cursor)
+- Disabled buttons must have a title attribute explaining why
+- Icon-only buttons: need aria-label; minimum 40px touch target
+- Aliases: action, CTA, submit
+
+### Icon Button
+- Circular or rounded-square; minimum 40px for touch targets
+- Tooltip on hover showing the action name
+- Visually lighter than text buttons
+- Aliases: toolbar button, action icon
+
+### Dropdown Menu
+- Trigger: button or icon button
+- Menu: elevated surface (shadow), rounded corners
+- Items: 36-44px height; icon + label; hover background
+- Dividers between groups; section labels for grouped items
+- Keyboard navigable: arrow keys, enter to select, escape to close
+- Aliases: context menu, action menu, overflow menu
+
+### Floating Action Button (FAB)
+- Circular, 56px; elevated with shadow
+- One per screen maximum; bottom-right placement
+- Primary creation action only
+- Extended variant: pill-shape with icon + label
+- Aliases: FAB, add button, create button
+
+## Feedback
+
+### Toast/Notification
+- Position: top-right or bottom-right; stack vertically
+- Auto-dismiss: 4-6 seconds for info; persist for errors until dismissed
+- Types: success (green), error (red), warning (amber), info (blue)
+- Content: icon + message + optional action link + close button
+- Maximum 3 visible at once; queue the rest
+- Aliases: snackbar, alert toast, flash message
+
+### Alert/Banner
+- Full-width within its container; not floating
+- Types: info, success, warning, error with corresponding colors
+- Icon left, message center, dismiss button right
+- Persistent until user dismisses or condition changes
+- Aliases: notice, inline alert, status banner
+
+### Modal/Dialog
+- Centered; overlay dims background (opacity 0.5 black)
+- Max width: 480-640px for standard, 800px for complex
+- Header (title + close button) + body + footer (actions)
+- Actions: right-aligned; primary right, secondary left
+- Close on overlay click and Escape key
+- Never put a modal inside a modal
+- Focus trap: tab cycles within modal while open
+- Aliases: popup, dialog box, lightbox
+
+### Tooltip
+- Appears on hover after 300-500ms delay; disappears on mouse leave
+- Position: above element by default; flip if near viewport edge
+- Max width: 200-280px; short text only
+- Arrow/caret pointing to trigger element
+- Aliases: hint, info popup, hover text
+
+### Progress Indicator
+- Linear bar: for known duration/percentage; show percentage text
+- Skeleton: for content loading with known layout
+- Spinner: only for indeterminate short waits (< 3 seconds) where layout is unknown
+- Step indicator: for multi-step flows; show completed/current/upcoming
+- Aliases: loading bar, progress bar, stepper
+
+## Layout
+
+### Page Shell
+- Max content width: 1200-1440px; centered with auto margins
+- Sidebar + main content pattern: sidebar fixed, main scrolls
+- Header/footer outside max-width for full-bleed effect
+- Consistent padding: 16px mobile, 24px tablet, 32px desktop
+
+### Grid
+- CSS Grid or Flexbox; 12-column system or auto-fit with minmax
+- Gap: 16-24px between items
+- Responsive: 1 column mobile, 2 columns tablet, 3-4 columns desktop
+- Never rely on fixed pixel widths; use fr units or percentages
+
+### Section Divider
+- Use spacing (48-96px margin) as primary divider; use lines sparingly
+- If using lines: subtle (1px, border color); full-width or indented
+- Alternate section backgrounds (base/surface) for clear separation without lines
+
+### Responsive Breakpoints
+- sm: 640px (large phone landscape)
+- md: 768px (tablet)
+- lg: 1024px (small laptop)
+- xl: 1280px (desktop)
+- Design mobile-first: base styles are mobile, layer up with breakpoints
+
+## Specialized
+
+### Pricing Table
+- 2-4 tiers side by side; highlight recommended tier
+- Feature comparison with checkmarks; group features by category
+- CTA button per tier; recommended tier has primary button, others secondary
+- Monthly/annual toggle if applicable
+- Aliases: pricing cards, plan comparison
+
+### Testimonial
+- Quote text (large, italic or with quotation marks)
+- Attribution: avatar + name + title/company
+- Layout: single featured or carousel/grid of multiple
+- Aliases: review, customer quote, social proof
+
+### Footer
+- Full-width; darker background than body
+- Column layout: links grouped by category; 3-5 columns
+- Bottom row: copyright, legal links, social icons
+- Responsive: columns stack on mobile
+- Aliases: site footer, bottom navigation
+
+### Search
+- Input with search icon; expand on focus or always visible
+- Results: dropdown with highlighted matching text
+- Recent searches and suggestions
+- Keyboard shortcut hint (Cmd+K / Ctrl+K)
+- Aliases: search bar, omnibar, search field
+
+### Date Picker
+- Input that opens a calendar dropdown
+- Navigate months with arrows; today highlighted
+- Range selection: two calendars side by side
+- Presets: "Today", "Last 7 days", "This month"
+- Aliases: calendar picker, date selector
+
+### Chart/Graph Placeholder
+- Container with appropriate aspect ratio (16:9 for line/bar, 1:1 for pie)
+- Include chart title, legend, and axis labels in the mockup
+- Use representative fake data; label as "Sample Data"
+- Tooltip placeholder on hover
+- Aliases: data visualization, graph, analytics chart
@@ -0,0 +1,139 @@
+# Design Vocabulary
+
+Use this reference when writing DESIGN.md files and constructing generation prompts. Replace vague descriptors with specific, actionable terms.
+
+## Atmosphere Descriptors
+
+Use these instead of "clean and modern":
+
+| Atmosphere | Characteristics | Font Direction | Color Direction | Spacing |
+|------------|----------------|---------------|-----------------|---------|
+| **Airy & Spacious** | Generous whitespace, light backgrounds, floating elements, subtle shadows | Thin/light weights, generous letter-spacing | Soft pastels, whites, muted accents | Large margins, open padding |
+| **Dense & Data-Rich** | Compact spacing, information-heavy, efficient use of space | Medium weights, tighter line-heights, smaller sizes | Neutral grays, high-contrast data colors | Tight but consistent padding |
+| **Warm & Approachable** | Rounded corners, friendly illustrations, organic shapes | Rounded/humanist typefaces, comfortable sizes | Earth tones, warm neutrals, amber/coral accents | Medium spacing, generous touch targets |
+| **Sharp & Technical** | Crisp edges, precise alignment, monospace elements, dark themes | Geometric or monospace, precise sizing | Cool grays, electric blues/greens, dark backgrounds | Grid-strict, mathematical spacing |
+| **Luxurious & Premium** | Generous space, refined details, serif accents, subtle animations | Serif or elegant sans-serif, generous sizing | Deep darks, gold/champagne accents, rich jewel tones | Expansive whitespace, dramatic padding |
+| **Playful & Creative** | Asymmetric layouts, bold colors, hand-drawn elements, motion | Display fonts, variable weights, expressive sizing | Bright saturated colors, unexpected combinations | Dynamic, deliberately uneven |
+| **Corporate & Enterprise** | Structured grids, predictable patterns, dense but organized | System fonts or conservative sans-serif | Brand blues/grays, accent for status indicators | Systematic, spec-driven |
+| **Editorial & Content** | Typography-forward, reading-focused, long-form layout | Serif for body text, sans for UI elements | Near-monochrome, sparse accent color | Generous line-height, wide columns |
+
+## Style-Specific Vocabulary
+
+### When user says... → Use these terms in DESIGN.md
+
+| Vague Input | Professional Translation |
+|-------------|------------------------|
+| "clean" | Restrained palette, generous whitespace, consistent alignment grid |
+| "modern" | Current design patterns (2024-2026), subtle depth, micro-interactions |
+| "minimal" | Single accent color, maximum negative space, typography-driven hierarchy |
+| "professional" | Structured grid, conservative palette, system fonts, clear navigation |
+| "fun" | Saturated palette, rounded elements, playful illustrations, motion |
+| "elegant" | Serif typography, muted palette, generous spacing, refined details |
+| "techy" | Dark theme, monospace accents, neon highlights, sharp corners |
+| "bold" | High contrast, large type, strong color blocks, dramatic layout |
+| "friendly" | Rounded corners (12-16px), humanist fonts, warm colors, illustrations |
+| "corporate" | Blue-gray palette, structured grid, conventional layout, data tables |
+
+## Color Mood Palettes
+
+### Cool Blues & Grays
+- Background: #f8fafc → #f1f5f9
+- Surface: #ffffff
+- Text: #0f172a → #475569
+- Accent: #2563eb (blue-600)
+- Pairs well with: Airy, Sharp, Corporate atmospheres
+
+### Warm Earth Tones
+- Background: #faf8f5 → #f5f0eb
+- Surface: #ffffff
+- Text: #292524 → #78716c
+- Accent: #c2410c (orange-700) or #b45309 (amber-700)
+- Pairs well with: Warm, Editorial atmospheres
+
+### Bold & Vibrant
+- Background: #fafafa → #f5f5f5
+- Surface: #ffffff
+- Text: #171717 → #525252
+- Accent: #dc2626 (red-600) or #7c3aed (violet-600) or #059669 (emerald-600)
+- Pairs well with: Playful, Creative atmospheres
+
+### Monochrome
+- Background: #fafafa → #f5f5f5
+- Surface: #ffffff
+- Text: #171717 → #737373
+- Accent: #171717 (black) with #e5e5e5 borders
+- Pairs well with: Minimal, Luxurious, Editorial atmospheres
+
+### Dark Mode
+- Background: #09090b → #18181b
+- Surface: #27272a → #3f3f46
+- Text: #fafafa → #a1a1aa
+- Accent: #3b82f6 (blue-500) or #22d3ee (cyan-400)
+- Pairs well with: Sharp, Technical, Dense atmospheres
+
+## Typography Mood Mapping
+
+### Geometric (Modern, Clean)
+Fonts: DM Sans, Plus Jakarta Sans, Outfit, General Sans, Satoshi
+- Characteristics: even stroke weight, circular letter forms, precise geometry
+- Best for: SaaS, tech products, dashboards, landing pages
+
+### Humanist (Friendly, Readable)
+Fonts: Source Sans 3, Nunito, Lato, Open Sans, Noto Sans
+- Characteristics: organic curves, varying stroke, warm feel
+- Best for: consumer apps, health/wellness, education, community platforms
+
+### Monospace (Technical, Code-Like)
+Fonts: JetBrains Mono, Fira Code, IBM Plex Mono, Space Mono
+- Characteristics: fixed-width, technical aesthetic, raw precision
+- Best for: developer tools, terminals, data displays, documentation
+
+### Serif (Editorial, Premium)
+Fonts: Playfair Display, Lora, Merriweather, Crimson Pro, Libre Baskerville
+- Characteristics: traditional elegance, reading comfort, authority
+- Best for: blogs, magazines, luxury brands, portfolio sites
+
+### Display (Expressive, Bold)
+Fonts: Cabinet Grotesk, Clash Display, Archivo Black, Space Grotesk
+- Characteristics: high impact, personality-driven, attention-grabbing
+- Best for: hero sections, headlines, creative portfolios, marketing pages
+- Use for headings only; pair with a readable body font
+
+## Shape & Depth Vocabulary
+
+### Border Radius Scale
+| Term | Value | Use for |
+|------|-------|---------|
+| Sharp | 0-2px | Technical, enterprise, data-heavy |
+| Subtle | 4-6px | Professional, balanced |
+| Rounded | 8-12px | Friendly, modern SaaS |
+| Pill | 16-24px or full | Playful, badges, tags |
+| Circle | 50% | Avatars, icon buttons |
+
+### Shadow Scale
+| Term | Value | Use for |
+|------|-------|---------|
+| None | none | Flat design, minimal |
+| Whisper | 0 1px 2px rgba(0,0,0,0.05) | Subtle elevation, cards |
+| Soft | 0 4px 6px rgba(0,0,0,0.07) | Standard cards, dropdowns |
+| Medium | 0 10px 15px rgba(0,0,0,0.1) | Elevated elements, modals |
+| Strong | 0 20px 25px rgba(0,0,0,0.15) | Floating elements, popovers |
+
+### Surface Hierarchy
+1. **Background** — deepest layer, covers viewport
+2. **Surface** — content containers (cards, panels) sitting on background
+3. **Elevated** — elements above surface (modals, dropdowns, tooltips)
+4. **Overlay** — dimming layer between surface and elevated elements
+
+## Layout Pattern Names
+
+| Pattern | Description | Best for |
+|---------|-------------|----------|
+| **Holy grail** | Header + sidebar + main + footer | Admin dashboards, apps |
+| **Magazine** | Multi-column with varied widths | Content sites, blogs |
+| **Single column** | Centered narrow content | Landing pages, articles, forms |
+| **Split screen** | Two equal or 60/40 halves | Comparison pages, sign-up flows |
+| **Card grid** | Uniform grid of cards | Product listings, portfolios |
+| **Asymmetric** | Deliberately unequal columns | Creative, editorial layouts |
+| **Full bleed** | Edge-to-edge sections, no max-width | Marketing pages, portfolios |
+| **Dashboard** | Stat cards + charts + tables in grid | Analytics, admin panels |
@@ -0,0 +1,109 @@
+# Quality Checklist
+
+Run through this checklist after generating or modifying a mockup. Three layers; run all that apply.
+
+## Layer 1: Structural Check (Always Run)
+
+### Semantic HTML
+- [ ] Uses `nav`, `main`, `section`, `article`, `aside`, `footer` — not just `div`
+- [ ] Single `h1` per page
+- [ ] Heading hierarchy follows h1 → h2 → h3 without skipping levels
+- [ ] Lists use `ul`/`ol`/`li`, not styled `div`s
+- [ ] Interactive elements are `button` or `a`, not clickable `div`s
+
+### Design Tokens
+- [ ] CSS custom properties defined in `<style>` block
+- [ ] Colors in HTML reference tokens (e.g., `var(--color-accent)`) not raw hex
+- [ ] Spacing follows the defined scale, not arbitrary pixel values
+- [ ] Font family matches DESIGN.md, not browser default or Inter/Roboto
+
+### Responsive Design
+- [ ] Mobile-first: base styles work at 375px
+- [ ] Content readable without horizontal scroll at all breakpoints
+- [ ] Navigation adapts: full nav on desktop, collapsed on mobile
+- [ ] Images/media have max-width: 100%
+- [ ] Touch targets minimum 44px on mobile
+
+### Interaction States
+- [ ] All buttons have hover, focus, active states
+- [ ] All links have hover and focus states
+- [ ] At least one loading state example (skeleton loader preferred)
+- [ ] At least one empty state with illustration + CTA
+- [ ] Disabled elements have visual indicator + explanation (title attribute)
+- [ ] Form inputs have focus ring using accent color
+
+### Component Quality
+- [ ] Button hierarchy: one primary per visible area, secondary and ghost variants present
+- [ ] Forms: labels above inputs, not placeholder-only
+- [ ] Error states: inline message below field with icon
+- [ ] No hamburger menu on desktop
+- [ ] No modal inside modal
+- [ ] No "Click here" links
+
+### Code Quality
+- [ ] Valid HTML (no unclosed tags, no duplicate IDs)
+- [ ] Tailwind classes are valid (no made-up utilities)
+- [ ] No inline styles that duplicate token values
+- [ ] File is self-contained (single HTML file, no external dependencies except Tailwind CDN)
+- [ ] Total file size under 50KB
+
+## Layer 2: Visual Check (When Browser Tool Available)
+
+Take a screenshot and examine:
+
+### Spacing & Alignment
+- [ ] Consistent margins between sections
+- [ ] Elements within the same row are vertically aligned
+- [ ] Padding within cards/containers is consistent
+- [ ] No orphan text (single word on its own line in headings)
+- [ ] Grid alignment: elements on the same row have matching heights or intentional variation
+
+### Typography
+- [ ] Heading sizes create clear hierarchy (visible difference between h1, h2, h3)
+- [ ] Body text is comfortable reading size (not tiny)
+- [ ] Font rendering looks correct (font loaded or appropriate fallback)
+- [ ] Line length: body text 50-75 characters per line
+
+### Color & Contrast
+- [ ] Primary accent is visible but not overwhelming
+- [ ] Text is readable over all backgrounds
+- [ ] No elements blend into their backgrounds
+- [ ] Status colors (success/error/warning) are distinguishable
+
+### Overall Composition
+- [ ] Visual weight is balanced (not all content on one side)
+- [ ] Clear focal point on the page (hero, headline, or primary CTA)
+- [ ] Appropriate whitespace: not cramped, not excessively empty
+- [ ] Consistent visual language throughout the page
+
+### Atmosphere Match
+- [ ] Overall feel matches the DESIGN.md atmosphere description
+- [ ] Not generic "AI generated" look
+- [ ] Color palette is cohesive (no unexpected color outliers)
+- [ ] Typography choice matches the intended mood
+
+## Layer 3: Compliance Check (When MCP Tools Available)
+
+### AccessLint MCP
+- [ ] Run `audit_html` on the generated file
+- [ ] Fix all violations with fixability "fixable" or "potentially_fixable"
+- [ ] Document any remaining violations that require manual judgment
+- [ ] Re-run `diff_html` to confirm fixes resolved violations
+
+### RenderLens MCP
+- [ ] Render at 1440px and 375px widths
+- [ ] Lighthouse accessibility score ≥ 80
+- [ ] Lighthouse performance score ≥ 70
+- [ ] Lighthouse best practices score ≥ 80
+- [ ] If iterating: run diff between previous and current version
+
+## Severity Classification
+
+When reporting issues found during the checklist:
+
+| Severity | Criteria | Action |
+|----------|----------|--------|
+| **Critical** | Broken layout, invisible content, no mobile support | Fix immediately before showing to user |
+| **High** | Missing interaction states, accessibility violations, token misuse | Fix before showing to user |
+| **Medium** | Minor spacing inconsistency, non-ideal font weight, slight alignment issue | Note in assessment, fix if easy |
+| **Low** | Style preference, minor polish opportunity | Note in assessment, fix during /design-polish |
@@ -0,0 +1,199 @@
+# Design System: [Project Name]
+
+## 1. Visual Atmosphere
+
+[Describe the mood, density, and aesthetic philosophy in 2-3 sentences. Be specific — never use "clean and modern". Reference the atmosphere type from design-vocabulary.md. Example: "A spacious, light-filled interface with generous whitespace that feels calm and unhurried. Elements float on a near-white canvas with subtle shadows providing depth. The overall impression is sophisticated simplicity — premium without being cold."]
+
+## 2. Color System
+
+### Primitives
+
+```css
+:root {
+  --white: #ffffff;
+  --black: #000000;
+
+  --gray-50: #______;
+  --gray-100: #______;
+  --gray-200: #______;
+  --gray-300: #______;
+  --gray-400: #______;
+  --gray-500: #______;
+  --gray-600: #______;
+  --gray-700: #______;
+  --gray-800: #______;
+  --gray-900: #______;
+  --gray-950: #______;
+
+  --accent-50: #______;
+  --accent-100: #______;
+  --accent-200: #______;
+  --accent-300: #______;
+  --accent-400: #______;
+  --accent-500: #______;
+  --accent-600: #______;
+  --accent-700: #______;
+  --accent-800: #______;
+  --accent-900: #______;
+
+  --red-500: #______;
+  --red-600: #______;
+  --green-500: #______;
+  --green-600: #______;
+  --amber-500: #______;
+  --amber-600: #______;
+}
+```
+
+### Semantic Tokens
+
+```css
+:root {
+  --color-bg-primary: var(--gray-50);
+  --color-bg-secondary: var(--gray-100);
+  --color-bg-surface: var(--white);
+  --color-bg-inverse: var(--gray-900);
+
+  --color-text-primary: var(--gray-900);
+  --color-text-secondary: var(--gray-500);
+  --color-text-tertiary: var(--gray-400);
+  --color-text-inverse: var(--white);
+  --color-text-link: var(--accent-600);
+
+  --color-accent: var(--accent-600);
+  --color-accent-hover: var(--accent-700);
+  --color-accent-light: var(--accent-50);
+
+  --color-border: var(--gray-200);
+  --color-border-strong: var(--gray-300);
+  --color-divider: var(--gray-100);
+
+  --color-error: var(--red-600);
+  --color-error-light: var(--red-500);
+  --color-success: var(--green-600);
+  --color-success-light: var(--green-500);
+  --color-warning: var(--amber-600);
+  --color-warning-light: var(--amber-500);
+}
+```
+
+### Component Tokens
+
+```css
+:root {
+  --button-primary-bg: var(--color-accent);
+  --button-primary-text: var(--color-text-inverse);
+  --button-primary-hover: var(--color-accent-hover);
+  --button-secondary-bg: transparent;
+  --button-secondary-border: var(--color-border-strong);
+  --button-secondary-text: var(--color-text-primary);
+
+  --card-bg: var(--color-bg-surface);
+  --card-border: var(--color-border);
+  --card-shadow: 0 1px 3px rgba(0, 0, 0, 0.08);
+
+  --input-bg: var(--color-bg-surface);
+  --input-border: var(--color-border);
+  --input-border-focus: var(--color-accent);
+  --input-text: var(--color-text-primary);
+  --input-placeholder: var(--color-text-tertiary);
+
+  --nav-bg: var(--color-bg-surface);
+  --nav-active-bg: var(--color-accent-light);
+  --nav-active-text: var(--color-accent);
+}
+```
+
+## 3. Typography
+
+- **Font family**: [Specific font name], [fallback], system-ui, sans-serif
+- **Font source**: Google Fonts link or system font
+
+| Level | Element | Size | Weight | Line Height | Letter Spacing |
+|-------|---------|------|--------|-------------|----------------|
+| Display | Hero headlines | 3rem (48px) | 700 | 1.1 | -0.02em |
+| H1 | Page title | 2.25rem (36px) | 700 | 1.2 | -0.01em |
+| H2 | Section title | 1.5rem (24px) | 600 | 1.3 | 0 |
+| H3 | Subsection | 1.25rem (20px) | 600 | 1.4 | 0 |
+| H4 | Card/group title | 1.125rem (18px) | 600 | 1.4 | 0 |
+| Body | Default text | 1rem (16px) | 400 | 1.5 | 0 |
+| Small | Captions, meta | 0.875rem (14px) | 400 | 1.5 | 0.01em |
+| XS | Labels, badges | 0.75rem (12px) | 500 | 1.4 | 0.02em |
+
+## 4. Spacing & Layout
+
+- **Base unit**: 4px (0.25rem)
+- **Spacing scale**: 1 (4px), 2 (8px), 3 (12px), 4 (16px), 5 (20px), 6 (24px), 8 (32px), 10 (40px), 12 (48px), 16 (64px), 20 (80px), 24 (96px)
+- **Content max-width**: [1200px / 1280px / 1440px]
+- **Grid**: [12-column / auto-fit] with [16px / 24px] gap
+
+| Breakpoint | Name | Min Width | Columns | Padding |
+|------------|------|-----------|---------|---------|
+| Mobile | sm | 0 | 1 | 16px |
+| Tablet | md | 768px | 2 | 24px |
+| Laptop | lg | 1024px | 3-4 | 32px |
+| Desktop | xl | 1280px | 4+ | 32px |
+
+## 5. Component Styling Defaults
+
+### Buttons
+- Border radius: [6px / 8px / full]
+- Padding: 10px 20px (md), 8px 16px (sm), 12px 24px (lg)
+- Font weight: 500
+- Transition: background-color 150ms ease, box-shadow 150ms ease
+- Focus: 2px ring with 2px offset using `--color-accent`
+- Disabled: opacity 0.5, cursor not-allowed
+
+### Cards
+- Border radius: [8px / 12px]
+- Border: 1px solid var(--card-border)
+- Shadow: var(--card-shadow)
+- Padding: 20-24px
+- Hover (if interactive): shadow increase or border-color change
+
+### Inputs
+- Height: 40px (md), 36px (sm), 48px (lg)
+- Border radius: 6px
+- Border: 1px solid var(--input-border)
+- Padding: 0 12px
+- Focus: border-color var(--input-border-focus) + 2px ring
+- Error: border-color var(--color-error) + error message below
+
+### Navigation
+- Item height: 40px
+- Active: background var(--nav-active-bg), text var(--nav-active-text)
+- Hover: background var(--color-bg-secondary)
+- Transition: background-color 150ms ease
+
+## 6. Interaction States (MANDATORY)
+
+### Loading
+- Use skeleton loaders matching content shape
+- Pulse animation: opacity 0.4 → 1.0, duration 1.5s, ease-in-out
+- Background: var(--color-bg-secondary)
+
+### Error
+- Inline message below the element
+- Icon (circle-exclamation) + red text using var(--color-error)
+- Border change on the input/container to var(--color-error)
+
+### Empty
+- Centered illustration or icon (64-96px)
+- Heading: "No [items] yet" or similar
+- Descriptive text: one sentence explaining what will appear
+- Primary CTA button: "Create first...", "Add...", "Import..."
+
+### Hover
+- Interactive elements: subtle background shift or underline
+- Cards: shadow increase or border-color change
+- Transition: 150ms ease
+
+### Focus
+- Visible ring: 2px solid var(--color-accent), 2px offset
+- Applied to all interactive elements (buttons, inputs, links, tabs)
+- Never remove outline without providing alternative focus indicator
+
+### Disabled
+- Opacity: 0.5
+- Cursor: not-allowed
+- Title attribute explaining why the element is disabled