Refine coding standards and testing guidelines. Updated coderule.mdc to emphasize readability, meaningful comments, and scope discipline. Adjusted testing.mdc to set a 75% coverage threshold for business logic and clarified test data requirements. Enhanced tracker.mdc with a mechanism for handling Jira connection issues and added completeness audit steps in research skills.

2026-06-21 07:21:13 +00:00 · 2026-04-17 20:28:48 +03:00
parent 57ff6dcd22
commit 0b3bb2fc55
17 changed files with 275 additions and 90 deletions
@@ -0,0 +1,10 @@
 ---
 description: Rules for installation and provisioning scripts
 globs: scripts/**/*.sh
 alwaysApply: false
 ---
 # Automation Scripts
 - Automate everything that can be automated. If a dependency can be downloaded and installed, do it automatically — never require the user to manually download and set up prerequisites.
 - Use sensible defaults for paths and configuration (e.g. `/opt/` for system-wide tools). Allow overrides via environment variables for users who need non-standard locations.
@@ -1,17 +1,17 @@
 ---
-description: "Enforces concise, comment-free, environment-aware coding standards with strict scope discipline and test verification"
+description: "Enforces readable, environment-aware coding standards with scope discipline, meaningful comments, and test verification"
 alwaysApply: true
 ---
 # Coding preferences
- Always prefer simple solution
+- Prefer the simplest solution that satisfies all requirements, including maintainability. When in doubt between two approaches, choose the one with fewer moving parts — but never sacrifice correctness, error handling, or readability for brevity.
 - Follow the Single Responsibility Principle — a class or method should have one reason to change:
  - If a method is hard to name precisely from the caller's perspective, its responsibility is misplaced. Vague names like "candidate", "data", or "item" are a signal — fix the design, not just the name.
  - Logic specific to a platform, variant, or environment belongs in the class that owns that variant, not in the general coordinator. Passing a dependency through is preferable to leaking variant-specific concepts into shared code.
  - Only use static methods for pure, self-contained computations (constants, simple math, stateless lookups). If a static method involves resource access, side effects, OS interaction, or logic that varies across subclasses or environments — use an instance method or factory class instead. Before implementing a non-trivial static method, ask the user.
- Generate concise code
+- Avoid boilerplate and unnecessary indirection, but never sacrifice readability for brevity.
 - Never suppress errors silently — no `2>/dev/null`, empty `catch` blocks, bare `except: pass`, or discarded error returns. These hide the information you need most when something breaks. If an error is truly safe to ignore, log it or comment why.
- Do not put comments in the code, except in tests: every test must use the Arrange / Act / Assert pattern with language-appropriate comment syntax (`# Arrange` for Python, `// Arrange` for C#/Rust/JS/TS). Omit any section that is not needed (e.g. if there is no setup, skip Arrange; if act and assert are the same line, keep only Assert)
+- Do not add comments that merely narrate what the code does. Comments are appropriate for: non-obvious business rules, workarounds with references to issues/bugs, safety invariants, and public API contracts. Make comments as short and concise as possible. Exception: every test must use the Arrange / Act / Assert pattern with language-appropriate comment syntax (`# Arrange` for Python, `// Arrange` for C#/Rust/JS/TS). Omit any section that is not needed (e.g. if there is no setup, skip Arrange; if act and assert are the same line, keep only Assert)
- Do not put logs unless it is an exception, or was asked specifically
+- Do not add verbose debug/trace logs by default. Log exceptions, security events (auth failures, permission denials), and business-critical state transitions. Add debug-level logging only when asked.
 - Do not put code annotations unless it was asked specifically 
 - Write code that takes into account the different environments: development, production
 - You are careful to make changes that are requested or you are confident the changes are well understood and related to the change being requested
@@ -22,16 +22,25 @@ alwaysApply: true
 - When a test fails due to a missing dependency, install it — do not fake or stub the module system. For normal packages, add them to the project's dependency file (requirements-test.txt, package.json devDependencies, test csproj, etc.) and install. Only consider stubbing if the dependency is heavy (e.g. hardware-specific SDK, large native toolchain) — and even then, ask the user first before choosing to stub.
 - Do not solve environment or infrastructure problems (dependency resolution, import paths, service discovery, connection config) by hardcoding workarounds in source code. Fix them at the environment/configuration level.
 - Before writing new infrastructure or workaround code, check how the existing codebase already handles the same concern. Follow established project patterns.
- If a file, class, or function has no remaining usages — delete it. Do not keep dead code "just in case"; git history preserves everything. Dead code rots: its dependencies drift, it misleads readers, and it breaks when the code it depends on evolves.
+- If a file, class, or function has no remaining usages — delete it. Dead code rots: its dependencies drift, it misleads readers, and it breaks when the code it depends on evolves. However, before deletion verify that the symbol is not used via any of the following. If any applies, do NOT delete — leave it or ASK the user:
  - Public API surface exported from the package and potentially consumed outside the workspace (see `workspace-boundary.mdc`)
  - Reflection, dependency injection, or service registration (scan DI container registrations, `appsettings.json` / equivalent config, attribute-based discovery, plugin manifests)
  - Dynamic dispatch from config/data (YAML/JSON references, string-based class lookups, route tables, command dispatchers)
  - Test fixtures used only by currently-skipped tests — temporary skips may become active again
  - Cross-repo references — if this workspace is part of a multi-repo system, grep sibling repos for shared contracts before deleting
- Focus on the areas of code relevant to the task
+- **Scope discipline**: focus edits on the task scope. The "scope" is:
- Do not touch code that is unrelated to the task
+  - Files the task explicitly names
- Always think about what other methods and areas of code might be affected by the code changes
+  - Files that define interfaces the task changes
- When you think you are done with changes, run the full test suite. Every failure — including pre-existing ones, collection errors, and import errors — is a **blocking gate**. Never silently ignore, skip, or proceed past a failing test. On any failure, stop and ask the user to choose one of:
+  - Files that directly call, implement, or test the changed code
 - **Adjacent hygiene is permitted** without asking: fixing imports you caused to break, updating obvious stale references within a file you already modify, deleting code that became dead because of your change.
 - **Unrelated issues elsewhere**: do not silently fix them as part of this task. Either note them to the user at end of turn and ASK before expanding scope, or record in `_docs/_process_leftovers/` for later handling.
 - Always think about what other methods and areas of code might be affected by the code changes, and surface the list to the user before modifying.
 - When you think you are done with changes, run the full test suite. Every failure in tests that cover code you modified or that depend on code you modified is a **blocking gate**. For pre-existing failures in unrelated areas, report them to the user but do not block on them. Never silently ignore or skip a failure without reporting it. On any blocking failure, stop and ask the user to choose one of:
  - **Investigate and fix** the failing test or source code
  - **Remove the test** if it is obsolete or no longer relevant
 - Do not rename any databases or tables or table columns without confirmation. Avoid such renaming if possible.
 - Make sure we don't commit binaries, create and keep .gitignore up to date and delete binaries after you are done with the task
 - Never force-push to main or dev branches
- Place all source code under the `src/` directory; keep project-level config, tests, and tooling at the repo root
+- For new projects, place source code under `src/` (this works for all stacks including .NET). For existing projects, follow the established directory structure. Keep project-level config, tests, and tooling at the repo root.
@@ -23,3 +23,17 @@ globs: [".cursor/**"]
 ## Security
 - All `.cursor/` files must be scanned for hidden Unicode before committing (see cursor-security.mdc)
 ## Quality Thresholds (canonical reference)
 All rules and skills must reference the single source of truth below. Do NOT restate different numeric thresholds in individual rule or skill files.
 | Concern | Threshold | Enforcement |
 |---------|-----------|-------------|
 | Test coverage on business logic | 75% | Aim (warn below); 100% on critical paths |
 | Test scenario coverage (vs AC + restrictions) | 75% | Blocking in test-spec Phase 1 and Phase 3 |
 | CI coverage gate | 75% | Fail build below |
 | Lint errors (Critical/High) | 0 | Blocking pre-commit |
 | Code-review auto-fix | Low + Medium (Style/Maint/Perf) + High (Style/Scope) | Critical and Security always escalate |
 When a skill or rule needs to cite a threshold, link to this table instead of hardcoding a different number.
@@ -5,6 +5,7 @@ alwaysApply: true
 # Git Workflow
 - Work on the `dev` branch
- Commit message format: `[TRACKER-ID-1] [TRACKER-ID-2] Summary of changes`
+- Commit message subject line format: `[TRACKER-ID-1] [TRACKER-ID-2] Summary of changes`
- Commit message total length must not exceed 30 characters
+- Subject line must not exceed 72 characters (standard Git convention for the first line). The 72-char limit applies to the subject ONLY, not the full commit message.
 - A commit message body is optional. Add one when the subject alone cannot convey the why of the change. Wrap the body at 72 chars per line.
 - Do NOT push or merge unless the user explicitly asks you to. Always ask first if there is a need.
@@ -4,21 +4,43 @@ alwaysApply: true
 ---
 # Sound Notification on Human Input
-Whenever you are about to ask the user a question, request confirmation, present options for a decision, or otherwise pause and wait for human input, you MUST first run the appropriate shell command for the current OS:
+## Sound commands per OS
 Detect the OS from user system info or `uname -s`:
 - **macOS**: `afplay /System/Library/Sounds/Glass.aiff &`
 - **Linux**: `paplay /usr/share/sounds/freedesktop/stereo/bell.oga 2>/dev/null || aplay /usr/share/sounds/freedesktop/stereo/bell.oga 2>/dev/null || echo -e '\a' &`
 - **Windows (PowerShell)**: `[System.Media.SystemSounds]::Exclamation.Play()`
-Detect the OS from the user's system info or by running `uname -s` if unknown.
+## When to play (play exactly once per trigger)
-This applies to:
+Play the sound when your turn will end in one of these states:
 - Asking clarifying questions
 - Presenting choices (e.g. via AskQuestion tool)
 - Requesting approval for destructive actions
 - Reporting that you are blocked and need guidance
 - Any situation where the conversation will stall without user response
 - Completing a task (final answer / deliverable ready for review)
-Do NOT play the sound when:
+1. You are about to call the AskQuestion tool — sound BEFORE the AskQuestion call
- You are in the middle of executing a multi-step task and just providing a status update
+2. Your text ends with a direct question to the user that cannot be answered without their input (e.g., "Which option do you prefer?", "What is the database name?", "Confirm before I push?")
 3. You are reporting that you are BLOCKED and cannot continue without user input (missing credentials, conflicting requirements, external approval required)
 4. You have just completed a destructive or irreversible action the user asked to review (commit, push, deploy, data migration, file deletion)
 ## When NOT to play
 - You are mid-execution and returning a progress update (the conversation is not stalling)
 - You are answering a purely informational or factual question and no follow-up is required
 - You have already played the sound once this turn for the same pause point
 - Your response only contains text describing what you did or found, with no question, no block, no irreversible action
 ## "Trivial" definition
 A response is trivial (no sound) when ALL of the following are true:
 - No explicit question to the user
 - No "I am blocked" report
 - No destructive/irreversible action that needs review
 If any one of those is present, the response is non-trivial — play the sound.
 ## Ordering
 The sound command is a normal Shell tool call. Place it:
 - **Immediately before an AskQuestion tool call** in the same message, or
 - **As the last Shell call of the turn** if ending with a text-based question, block report, or post-destructive-action review
 Do not play the sound as part of routine command execution — only at the pause points listed under "When to play".
@@ -5,7 +5,7 @@ alwaysApply: true
 # Agent Meta Rules
 ## Execution Safety
- Never run test suites, builds, Docker commands, or other long-running/resource-heavy/security-risky operations without asking the user first — unless it is explicitly stated in a skill or agent, or the user already asked to do so.
+- Run the full test suite automatically when you believe code changes are complete (as required by coderule.mdc). For other long-running/resource-heavy/security-risky operations (builds, Docker commands, deployments, performance tests), ask the user first — unless explicitly stated in a skill or the user already asked to do so.
 ## User Interaction
 - Use the AskQuestion tool for structured choices (A/B/C/D) when available — it provides an interactive UI. Fall back to plain-text questions if the tool is unavailable.
@@ -33,18 +33,30 @@ When the user reacts negatively to generated code ("WTF", "what the hell", "why
 - "Before writing new infrastructure or workaround code, check how the existing codebase already handles the same concern. Follow established project patterns."
 ## Debugging Over Contemplation
 When the root cause of a bug is not clear after ~5 minutes of reasoning, analysis, and assumption-making — **stop speculating and add debugging logs**. Observe actual runtime behavior before forming another theory. The pattern to follow:
 Agents cannot measure wall-clock time between turns. Use observable counts from your own transcript instead.
 **Trigger: stop speculating and instrument.** When you've formed **3 or more distinct hypotheses** about a bug without confirming any against runtime evidence (logs, stderr, debugger state, actual test failure messages) — stop and add debugging output. Re-reading the same code hoping to "spot it this time" counts as a new hypothesis that still has zero evidence.
 Steps:
 1. Identify the last known-good boundary (e.g., "request enters handler") and the known-bad result (e.g., "callback never fires").
-2. Add targeted `print(..., flush=True)` or log statements at each intermediate step to narrow the gap.
+2. Add targeted `print(..., flush=True)`, `console.error`, or logger statements at each intermediate step to narrow the gap.
-3. Read the output. Let evidence drive the next step — not inference chains built on unverified assumptions.
+3. Run the instrumented code. Read the output. Let evidence drive the next hypothesis — not inference chains.
-Prolonged mental contemplation without evidence is a time sink. A 15-minute instrumented run beats 45 minutes of "could it be X? but then Y... unless Z..." reasoning.
+An instrumented run producing real output beats any amount of "could it be X? but then Y..." reasoning.
 ## Long Investigation Retrospective
 When a problem takes significantly longer than expected (>30 minutes), perform a post-mortem before closing out:
-1. **Identify the bottleneck**: Was the delay caused by assumptions that turned out wrong? Missing visibility into runtime state? Incorrect mental model of a framework or language boundary?
+Trigger a post-mortem when ANY of the following is true (all are observable in your own transcript):
-2. **Extract the general lesson**: What category of mistake was this? (e.g., "Python cannot call Cython `cdef` methods", "engine errors silently swallowed", "wrong layer to fix the problem")
+
-3. **Propose a preventive rule**: Formulate it as a short, actionable statement. Present it to the user for approval.
+- **10+ tool calls** were used to diagnose a single issue
-4. **Write it down**: Add the approved rule to the appropriate `.mdc` file so it applies to all future sessions.
+- **Same file modified 3+ times** without tests going green
 - **3+ distinct approaches** attempted before arriving at the fix
 - Any phrase like "let me try X instead" appeared **more than twice**
 - A fix was eventually found by reading docs/source the agent had dismissed earlier
 Post-mortem steps:
 1. **Identify the bottleneck**: wrong assumption? missing runtime visibility? incorrect mental model of a framework/language boundary? ignored evidence?
 2. **Extract the general lesson**: what category of mistake was this? (e.g., "Python cannot call Cython `cdef` methods", "engine errors silently swallowed", "wrong layer to fix the problem")
 3. **Propose a preventive rule**: short, actionable. Present to user for approval.
 4. **Write it down**: add approved rule to the appropriate `.mdc` so it applies to future sessions.
@@ -8,7 +8,7 @@ globs: ["**/*test*", "**/*spec*", "**/*Test*", "**/tests/**", "**/test/**"]
 - One assertion per test when practical; name tests descriptively: `MethodName_Scenario_ExpectedResult`
 - Test boundary conditions, error paths, and happy paths
 - Use mocks only for external dependencies; prefer real implementations for internal code
- Aim for 80%+ coverage on business logic; 100% on critical paths
+- Aim for 75%+ coverage on business logic; 100% on critical paths (code paths where a bug would cause data loss, security breaches, financial errors, or system outages — identify from acceptance criteria marked as must-have or from security_approach.md). The 75% threshold is canonical — see `cursor-meta.mdc` Quality Thresholds.
 - Integration tests use real database (Postgres testcontainers or dedicated test DB)
 - Never use Thread Sleep or fixed delays in tests; use polling or async waits
 - Keep test data factories/builders for reusable test setup
@@ -12,3 +12,39 @@ alwaysApply: true
 - Project name: AZAION
 - All task IDs follow the format `AZ-<number>`
 - Issue types: Epic, Story, Task, Bug, Subtask
 ## Tracker Availability Gate
 - If Jira MCP returns **Unauthorized**, **errored**, **connection refused**, or any non-success response: **STOP** tracker operations and notify the user.
 - The user must fix the Jira MCP connection before any further ticket creation/transition/query is attempted.
 - Do NOT silently create local-only tasks, skip Jira steps, or pretend the write succeeded. The tracker is the source of truth — if a status transition is lost, the team loses visibility.
 ## Leftovers Mechanism (non-user-input blockers only)
 When a **non-user** blocker prevents a tracker write (MCP down, network error, transient failure, ticket linkage recoverable later), record the deferred write in `_docs/_process_leftovers/<YYYY-MM-DD>_<topic>.md` and continue non-tracker work. Each entry must include:
 - Timestamp (ISO 8601)
 - What was blocked (ticket creation, status transition, comment, link)
 - Full payload that would have been written (summary, description, story points, epic, target status) — so the write can be replayed later
 - Reason for the blockage (MCP unavailable, auth expired, unknown epic ID pending user clarification, etc.)
 ### Hard gates that CANNOT be deferred to leftovers
 Anything requiring user input MUST still block:
 - Clarifications about requirements, scope, or priority
 - Approval for destructive actions or irreversible changes
 - Choice between alternatives (A/B/C decisions)
 - Confirmation of assumptions that change task outcome
 If a blocker of this kind appears, STOP and ASK — do not write to leftovers.
 ### Replay obligation
 At the start of every `/autopilot` invocation, and before any new tracker write in any skill, check `_docs/_process_leftovers/` for pending entries. For each entry:
 1. Attempt to replay the deferred write against the tracker
 2. If replay succeeds → delete the leftover entry
 3. If replay still fails → update the entry's timestamp and reason, continue
 4. If the blocker now requires user input (e.g., MCP still down after N retries) → surface to the user
 Autopilot must not progress past its own step 0 until all leftovers that CAN be replayed have been replayed.
@@ -0,0 +1,7 @@
 # Workspace Boundary
 - Only modify files within the current repository (workspace root).
 - Never write, edit, or delete files in sibling repositories or parent directories outside the workspace.
 - When a task requires changes in another repository (e.g., admin API, flights, UI), **document** the required changes in the task's implementation notes or a dedicated cross-repo doc — do not implement them.
 - The mock API at `e2e/mocks/mock_api/` may be updated to reflect the expected contract of external services, but this is a test mock — not the real implementation.
 - If a task is entirely scoped to another repository, mark it as out-of-scope for this workspace and note the target repository.
@@ -55,6 +55,11 @@ After selecting the flow, apply its detection rules (first match wins) to determ
 Every invocation follows this sequence:
 ```
 0. Process leftovers (see `.cursor/rules/tracker.mdc` → Leftovers Mechanism):
   - Read _docs/_process_leftovers/ if it exists
   - For each entry, attempt replay against the tracker
   - Delete successful replays, update failed ones with new timestamp + reason
   - If any leftover still blocked AND requires user input → STOP and ASK
 1. Read _docs/_autopilot_state.md (if exists)
 2. Read all File Index files above
 3. Cross-check state file against _docs/ folder structure (rules in state.md)
@@ -28,7 +28,7 @@ The `implementer` agent is the specialist that writes all the code — it receiv
 - **Integrated review**: `/code-review` skill runs automatically after each batch
 - **Auto-start**: batches launch immediately — no user confirmation before a batch
 - **Gate on failure**: user confirmation is required only when code review returns FAIL
- **Commit and push per batch**: after each batch is confirmed, commit and push to remote
+- **Commit per batch**: after each batch is confirmed, commit. Ask the user whether to push to remote unless the user previously opted into auto-push for this session.
 ## Context Resolution
@@ -134,25 +134,38 @@ Only proceed to Step 9 when every AC has a corresponding test.
 ### 10. Auto-Fix Gate
-Auto-fix loop with bounded retries (max 2 attempts) before escalating to user:
+Bounded auto-fix loop — only applies to **mechanical** findings. Critical and Security findings are never auto-fixed.
-1. If verdict is **PASS** or **PASS_WITH_WARNINGS**: show findings as info, continue automatically to step 11
+**Auto-fix eligibility matrix:**
 2. If verdict is **FAIL** (attempt 1 or 2):
   - Parse the code review findings (Critical and High severity items)
   - For each finding, attempt an automated fix using the finding's location, description, and suggestion
   - Re-run `/code-review` on the modified files
   - If now PASS or PASS_WITH_WARNINGS → continue to step 11
   - If still FAIL → increment retry counter, repeat from (2) up to max 2 attempts
 3. If still **FAIL** after 2 auto-fix attempts: present all findings to user (**BLOCKING**). User must confirm fixes or accept before proceeding.
-Track `auto_fix_attempts` count in the batch report for retrospective analysis.
+| Severity | Category | Auto-fix? |
 |----------|----------|-----------|
 | Low | any | yes |
 | Medium | Style, Maintainability, Performance | yes |
 | Medium | Bug, Spec-Gap, Security | escalate |
 | High | Style, Scope | yes |
 | High | Bug, Spec-Gap, Performance, Maintainability | escalate |
 | Critical | any | escalate |
 | any | Security | escalate |
-### 11. Commit and Push
+Flow:
 1. If verdict is **PASS** or **PASS_WITH_WARNINGS**: show findings as info, continue to step 11
 2. If verdict is **FAIL**:
   - Partition findings into auto-fix-eligible and escalate (using the matrix above)
   - For eligible findings, attempt fixes using location/description/suggestion, then re-run `/code-review` on modified files (max 2 rounds)
   - If all remaining findings are auto-fix-eligible and re-review now passes → continue to step 11
   - If any non-eligible finding exists at any point → stop auto-fixing, present the full list to the user (**BLOCKING**)
 3. User must explicitly approve each non-auto-fix finding (accept, request manual fix, mark as out-of-scope) before proceeding.
 Track `auto_fix_attempts` and `escalated_findings` in the batch report for retrospective analysis.
 ### 11. Commit (and optionally Push)
 - After user confirms the batch (explicitly for FAIL, implicitly for PASS/PASS_WITH_WARNINGS):
  - `git add` all changed files from the batch
  - `git commit` with a message that includes ALL task IDs (tracker IDs or numeric prefixes) of tasks implemented in the batch, followed by a summary of what was implemented. Format: `[TASK-ID-1] [TASK-ID-2] ... Summary of changes`
-  - `git push` to the remote branch
+  - Ask the user whether to push to remote, unless the user previously opted into auto-push for this session
 ### 12. Update Tracker Status → In Testing
@@ -119,7 +119,7 @@ Read and follow `steps/07_quality-checklist.md`.
 |-----------|--------|
 | Missing acceptance_criteria.md, restrictions.md, or input_data/ | **STOP** — planning cannot proceed |
 | Ambiguous requirements | ASK user |
-| Input data coverage below 70% | Search internet for supplementary data, ASK user to validate |
+| Input data coverage below 75% | Search internet for supplementary data, ASK user to validate |
 | Technology choice with multiple valid options | ASK user |
 | Component naming | PROCEED, confirm at next BLOCKING gate |
 | File structure within templates | PROCEED |
@@ -32,3 +32,17 @@
 6. Applicable scenarios
 7. Team capability requirements
 8. Migration difficulty
 ## Decomposition Completeness Probes (Completeness Audit Reference)
 Used during Step 1's Decomposition Completeness Audit. After generating sub-questions, ask each probe against the current decomposition. If a probe reveals an uncovered area, add a sub-question for it.
 | Probe | What it catches |
 |-------|-----------------|
 | **What does this cost — in money, time, resources, or trade-offs?** | Budget, pricing, licensing, tax, opportunity cost, maintenance burden |
 | **What are the hard constraints — physical, legal, regulatory, environmental?** | Regulations, certifications, spectrum/frequency rules, export controls, physics limits, IP restrictions |
 | **What are the dependencies and assumptions that could break?** | Supply chain, vendor lock-in, API stability, single points of failure, standards evolution |
 | **What does the operating environment actually look like?** | Terrain, weather, connectivity, infrastructure, power, latency, user skill level |
 | **What failure modes exist and what happens when they trigger?** | Degraded operation, fallback, safety margins, blast radius, recovery time |
 | **What do practitioners who solved similar problems say matters most?** | Field-tested priorities that don't appear in specs or papers |
 | **What changes over time — and what looks stable now but isn't?** | Technology roadmaps, regulatory shifts, deprecation risk, scaling effects |
@@ -10,6 +10,12 @@
 - [ ] Every citation can be directly verified by the user (source verifiability)
 - [ ] Structure hierarchy is clear; executives can quickly locate information
 ## Decomposition Completeness
 - [ ] Domain discovery search executed: searched "key factors when [problem domain]" before starting research
 - [ ] Completeness probes applied: every probe from `references/comparison-frameworks.md` checked against sub-questions
 - [ ] No uncovered areas remain: all gaps filled with sub-questions or justified as not applicable
 ## Internet Search Depth
 - [ ] Every sub-question was searched with at least 3-5 different query variants
@@ -97,6 +97,16 @@ When decomposing questions, you must explicitly define the **boundaries of the r
 **Common mistake**: User asks about "university classroom issues" but sources include policies targeting "K-12 students" — mismatched target populations will invalidate the entire research.
 #### Decomposition Completeness Audit (MANDATORY)
 After generating sub-questions, verify the decomposition covers all major dimensions of the problem — not just the ones that came to mind first.
 1. **Domain discovery search**: Search the web for "key factors when [problem domain]" / "what to consider when [problem domain]" (e.g., "key factors GPS-denied navigation", "what to consider when choosing an edge deployment strategy"). Extract dimensions that practitioners and domain experts consider important but are absent from the current sub-questions.
 2. **Run completeness probes**: Walk through each probe in `references/comparison-frameworks.md` → "Decomposition Completeness Probes" against the current sub-question list. For each probe, note whether it is covered, not applicable (state why), or missing.
 3. **Fill gaps**: Add sub-questions (with search query variants) for any uncovered area. Do this before proceeding to Step 2.
 Record the audit result in `00_question_decomposition.md` as a "Completeness Audit" section.
 **Save action**:
 1. Read all files from INPUT_DIR to ground the research in the project context
 2. Create working directory `RESEARCH_DIR/`
@@ -109,6 +119,7 @@ When decomposing questions, you must explicitly define the **boundaries of the r
   - List of decomposed sub-questions
   - **Chosen perspectives** (at least 3 from the Perspective Rotation table) with rationale
   - **Search query variants** for each sub-question (at least 3-5 per sub-question)
   - **Completeness audit** (taxonomy cross-reference + domain discovery results)
 4. Write TodoWrite to track progress
 ---
@@ -102,32 +102,46 @@ After investigating, present:
 - If user picks A → apply fixes, then re-run (loop back to step 2)
 - If user picks B → return failure to the autopilot
-**Any test skipped** → this is also a **blocking gate**. Skipped tests mean something is wrong — either with the test, the environment, or the test design. **Never blindly remove a skipped test.** Always investigate the root cause first.
+**Any skipped test** → classify as legitimate or illegitimate before deciding whether to block.
-#### Investigation Protocol for Skipped Tests
+#### Legitimate skips (accept and proceed)
-For each skipped test:
+The code path genuinely cannot execute on this runner. Acceptable reasons:
-1. **Read the test code** — understand what the test is supposed to verify and why it skips.
+- Hardware not physically present (GPU, Apple Neural Engine, sensor, serial device)
-2. **Determine the root cause** — why did the skip condition fire?
+- Operating system mismatch (Darwin-only test on Linux CI, Windows-only test on macOS)
-   - Is the test environment misconfigured? (e.g., wrong ports, missing env vars, service not started correctly)
+- Feature-flag-gated test whose feature is intentionally disabled in this environment
-   - Is the test ordering wrong? (e.g., a fixture in an earlier test mutates shared state)
+- External service the project deliberately does not control (e.g., a third-party API with no sandbox, and the project has a documented contract test instead)
   - Is a dependency missing? (e.g., package not installed, fixture file absent)
   - Is the skip condition outdated? (e.g., code was refactored but the skip guard still checks the old behavior)
   - Is the test fundamentally untestable in the current setup? (e.g., requires Docker restart, different OS, special hardware)
 3. **Try to fix the root cause first** — the goal is to make the test run, not to delete it:
   - Fix the environment or configuration
   - Reorder tests or isolate shared state
   - Install the missing dependency
   - Update the skip condition to match current behavior
 4. **Only remove as last resort** — if the test truly cannot run in any realistic test environment (e.g., requires hardware not available, duplicates another test with identical assertions), then removal is justified. Document the reasoning.
-#### Categorization
+For legitimate skips: verify the skip condition is accurate (the test would run if the hardware/OS were present), verify it has a clear reason string, and proceed.
- **explicit skip (dead code)**: Has `@pytest.mark.skip` — investigate whether the reason in the decorator is still valid. Often these are temporary skips that became permanent by accident.
+#### Illegitimate skips (BLOCKING — must resolve)
- **runtime skip (unreachable)**: `pytest.skip()` fires inside the test body — investigate why the condition always triggers. Often fixable by adjusting test order, environment, or the condition itself.
+
- **environment mismatch**: Test assumes a different environment — investigate whether the test environment setup can be fixed.
+The skip is a workaround for something we can and should fix. NOT acceptable reasons:
- **missing fixture/data**: Data or service not available — investigate whether it can be provided.
+
 - Required service not running (database, message broker, downstream API we control) → fix: bring the service up, add a docker-compose dependency, or add a mock
 - Missing test fixture, seed data, or sample file → fix: provide the data, generate it, or ASK the user for it
 - Missing environment variable or credential → fix: add to `.env.example`, document, ASK user for the value
 - Flaky-test quarantine with no tracking ticket → fix: create the ticket (or replay via leftovers if tracker is down)
 - Inherited skip from a prior refactor that was never cleaned up → fix: clean it up now
 - Test ordering mutates shared state → fix: isolate the state
 **Rule of thumb**: if the reason for skipping is "we didn't set something up," that's not a valid skip — set it up. If the reason is "this hardware/OS isn't here," that's valid.
 #### Resolution steps for illegitimate skips
 1. Classify the skip (read the skip reason and test body)
 2. If the fix is **mechanical** — start a container, install a dep, add a mock, reorder fixtures — attempt it automatically and re-run
 3. If the fix requires **user input** — credentials, sample data, a business decision — BLOCK and ASK
 4. Never silently mark the skip as "accepted" — every illegitimate skip must either be fixed or escalated
 5. Removal is a last resort and requires explicit user approval with documented reasoning
 #### Categorization cheatsheet
 - **explicit skip (e.g. `@pytest.mark.skip`)**: check whether the reason in the decorator is still valid
 - **conditional skip (e.g. `@pytest.mark.skipif`)**: check whether the condition is accurate and whether we can change the environment to make it false
 - **runtime skip (e.g. `pytest.skip()` in body)**: check why the condition fires — often an ordering or environment bug
 - **missing fixture/data**: treated as illegitimate unless user confirms the data is unavailable
 After investigating, present findings:
@@ -27,8 +27,11 @@ Analyze input data completeness and produce detailed black-box test specificatio
 - **Save immediately**: write artifacts to disk after each phase; never accumulate unsaved work
 - **Ask, don't assume**: when requirements are ambiguous, ask the user before proceeding
 - **Spec, don't code**: this workflow produces test specifications, never test implementation code
- **No test without data**: every test scenario MUST have concrete test data; tests without data are removed
+- **Every test must have a pass/fail criterion**. Two acceptable shapes:
- **No test without expected result**: every test scenario MUST pair input data with a quantifiable expected result; a test that cannot compare actual output against a known-correct answer is not verifiable and must be removed
+  - **Input/output shape**: concrete input data paired with a quantifiable expected result (exact value, tolerance, threshold, pattern, reference file). Typical for functional blackbox tests, performance tests with load data, data-processing pipelines.
  - **Behavioral shape**: a trigger condition + observable system behavior + quantifiable pass/fail criterion, with no input data required. Typical for startup/shutdown tests, retry/backoff policies, state transitions, logging/metrics emission, resilience scenarios. Example criteria: "startup logs `service ready` within 5s", "retry emits 3 attempts with exponential backoff (base 100ms ± 20ms)", "on SIGTERM, service drains in-flight requests within 30s grace period", "health endpoint returns 503 while migrations run".
 - For behavioral tests the observable (log line, metric value, state transition, emitted event, elapsed time) must still be quantifiable — the test must programmatically decide pass/fail.
 - A test that cannot produce a pass/fail verdict through either shape is not verifiable and must be removed.
 ## Context Resolution
@@ -177,7 +180,7 @@ At the start of execution, create a TodoWrite with all four phases. Update statu
 |------------|--------------------------|---------------|----------------|
 | [file/data] | Yes/No | Yes/No | [missing, vague, no tolerance, etc.] |
-9. Threshold: at least 70% coverage of scenarios AND every covered scenario has a quantifiable expected result (see `.cursor/rules/cursor-meta.mdc` Quality Thresholds table)
+9. Threshold: at least 75% coverage of scenarios AND every covered scenario has a quantifiable expected result (see `.cursor/rules/cursor-meta.mdc` Quality Thresholds table)
 10. If coverage is low, search the internet for supplementary data, assess quality with user, and if user agrees, add to `input_data/` and update `input_data/expected_results/results_report.md`
 11. If expected results are missing or not quantifiable, ask user to provide them before proceeding
@@ -232,18 +235,26 @@ Capture any new questions, findings, or insights that arise during test specific
 ### Phase 3: Test Data Validation Gate (HARD GATE)
 **Role**: Professional Quality Assurance Engineer
-**Goal**: Ensure every test scenario produced in Phase 2 has concrete, sufficient test data. Remove tests that lack data. Verify final coverage stays above 70%.
+**Goal**: Ensure every test scenario produced in Phase 2 has concrete, sufficient test data. Remove tests that lack data. Verify final coverage stays above 75%.
 **Constraints**: This phase is MANDATORY and cannot be skipped.
-#### Step 1 — Build the test-data and expected-result requirements checklist
+#### Step 1 — Build the requirements checklist
-Scan `blackbox-tests.md`, `performance-tests.md`, `resilience-tests.md`, `security-tests.md`, and `resource-limit-tests.md`. For every test scenario, extract:
+Scan `blackbox-tests.md`, `performance-tests.md`, `resilience-tests.md`, `security-tests.md`, and `resource-limit-tests.md`. For every test scenario, classify its shape (input/output or behavioral) and extract:
 **Input/output tests:**
 | # | Test Scenario ID | Test Name | Required Input Data | Required Expected Result | Result Quantifiable? | Comparison Method | Input Provided? | Expected Result Provided? |
 |---|-----------------|-----------|---------------------|-------------------------|---------------------|-------------------|----------------|--------------------------|
 | 1 | [ID] | [name] | [data description] | [what system should output] | [Yes/No] | [exact/tolerance/pattern/threshold] | [Yes/No] | [Yes/No] |
-Present this table to the user.
+**Behavioral tests:**
 | # | Test Scenario ID | Test Name | Trigger Condition | Observable Behavior | Pass/Fail Criterion | Quantifiable? |
 |---|-----------------|-----------|-------------------|--------------------|--------------------|---------------|
 | 1 | [ID] | [name] | [e.g., service receives SIGTERM] | [e.g., drain logs emitted, port closed] | [e.g., drain completes ≤30s] | [Yes/No] |
 Present both tables to the user.
 #### Step 2 — Ask user to provide missing test data AND expected results
@@ -315,20 +326,20 @@ After all removals, recalculate coverage:
 **Decision**:
- **Coverage ≥ 70%** → Phase 3 **PASSED**. Present final summary to user.
+- **Coverage ≥ 75%** → Phase 3 **PASSED**. Present final summary to user.
- **Coverage < 70%** → Phase 3 **FAILED**. Report:
+- **Coverage < 75%** → Phase 3 **FAILED**. Report:
-  > ❌ Test coverage dropped to **X%** (minimum 70% required). The removed test scenarios left gaps in the following acceptance criteria / restrictions:
+  > ❌ Test coverage dropped to **X%** (minimum 75% required). The removed test scenarios left gaps in the following acceptance criteria / restrictions:
  >
  > | Uncovered Item | Type (AC/Restriction) | Missing Test Data Needed |
  > |---|---|---|
  >
  > **Action required**: Provide the missing test data for the items above, or add alternative test scenarios that cover these items with data you can supply.
-  **BLOCKING**: Loop back to Step 2 with the uncovered items. Do NOT finalize until coverage ≥ 70%.
+  **BLOCKING**: Loop back to Step 2 with the uncovered items. Do NOT finalize until coverage ≥ 75%.
 #### Phase 3 Completion
-When coverage ≥ 70% and all remaining tests have validated data AND quantifiable expected results:
+When coverage ≥ 75% and all remaining tests have validated data AND quantifiable expected results:
 1. Present the final coverage report
 2. List all removed tests (if any) with reasons
@@ -479,23 +490,23 @@ Create `scripts/run-performance-tests.sh` at the project root. The script must:
 | Missing acceptance_criteria.md, restrictions.md, or input_data/ | **STOP** — specification cannot proceed |
 | Missing input_data/expected_results/results_report.md | **STOP** — ask user to provide expected results mapping using the template |
 | Ambiguous requirements | ASK user |
-| Input data coverage below 70% (Phase 1) | Search internet for supplementary data, ASK user to validate |
+| Input data coverage below 75% (Phase 1) | Search internet for supplementary data, ASK user to validate |
 | Expected results missing or not quantifiable (Phase 1) | ASK user to provide quantifiable expected results before proceeding |
 | Test scenario conflicts with restrictions | ASK user to clarify intent |
 | System interfaces unclear (no architecture.md) | ASK user or derive from solution.md |
 | Test data or expected result not provided for a test scenario (Phase 3) | WARN user and REMOVE the test |
-| Final coverage below 70% after removals (Phase 3) | BLOCK — require user to supply data or accept reduced spec |
+| Final coverage below 75% after removals (Phase 3) | BLOCK — require user to supply data or accept reduced spec |
 ## Common Mistakes
 - **Referencing internals**: tests must be black-box — no internal module names, no direct DB queries against the system under test
 - **Vague expected outcomes**: "works correctly" is not a test outcome; use specific measurable values
- **Missing expected results**: input data without a paired expected result is useless — the test cannot determine pass/fail without knowing what "correct" looks like
+- **Missing pass/fail criterion**: input/output tests without an expected result, OR behavioral tests without a measurable observable — both are unverifiable and must be removed
- **Non-quantifiable expected results**: "should return good results" is not verifiable; expected results must have exact values, tolerances, thresholds, or pattern matches that code can evaluate
+- **Non-quantifiable criteria**: "should return good results", "works correctly", "behaves properly" — not verifiable. Use exact values, tolerances, thresholds, pattern matches, or timing bounds that code can evaluate.
 - **Forcing the wrong shape**: do not invent fake input data for a behavioral test (e.g., "input: SIGTERM signal") just to fit the input/output shape. Classify the test correctly and use the matching checklist.
 - **Missing negative scenarios**: every positive scenario category should have corresponding negative/edge-case tests
 - **Untraceable tests**: every test should trace to at least one AC or restriction
 - **Writing test code**: this skill produces specifications, never implementation code
 - **Tests without data**: every test scenario MUST have concrete test data AND a quantifiable expected result; a test spec without either is not executable and must be removed
 ## Trigger Conditions
@@ -516,7 +527,7 @@ When the user wants to:
 │   → verify AC, restrictions, input_data (incl. expected_results.md)  │
 │                                                                      │
 │ Phase 1: Input Data & Expected Results Completeness Analysis         │
-│   → assess input_data/ coverage vs AC scenarios (≥70%)               │
+│   → assess input_data/ coverage vs AC scenarios (≥75%)               │
 │   → verify every input has a quantifiable expected result            │
 │   → present input→expected-result pairing assessment                 │
 │   [BLOCKING: user confirms input data + expected results coverage]   │
@@ -538,8 +549,8 @@ When the user wants to:
 │   → validate input data (quality + quantity)                         │
 │   → validate expected results (quantifiable + comparison method)     │
 │   → remove tests without data or expected result, warn user          │
-│   → final coverage check (≥70% or FAIL + loop back)                  │
+│   → final coverage check (≥75% or FAIL + loop back)                  │
-│   [BLOCKING: coverage ≥ 70% required to pass]                        │
+│   [BLOCKING: coverage ≥ 75% required to pass]                        │
 │                                                                      │
 │ Phase 4: Test Runner Script Generation                               │
 │   → detect test runner + docker-compose + load tool                  │