Refine coding standards and testing guidelines

- Updated the coding rule descriptions to emphasize readability, meaningful comments, and test verification. - Revised guidelines to clarify the importance of avoiding boilerplate while maintaining readability. - Enhanced the testing rules to set a minimum coverage threshold of 75% for business logic and specified criteria for test scenarios. - Introduced a mechanism for handling skipped tests, categorizing them as legitimate or illegitimate, and outlined resolution steps. These changes aim to improve code quality, maintainability, and testing effectiveness.
2026-04-22 11:26:36 +00:00 · 2026-04-17 20:27:45 +03:00
parent 4b52c0be3b
commit 06b47c17c3
17 changed files with 275 additions and 90 deletions
@@ -55,6 +55,11 @@ After selecting the flow, apply its detection rules (first match wins) to determ
 Every invocation follows this sequence:

 ```
+0. Process leftovers (see `.cursor/rules/tracker.mdc` → Leftovers Mechanism):
+   - Read _docs/_process_leftovers/ if it exists
+   - For each entry, attempt replay against the tracker
+   - Delete successful replays, update failed ones with new timestamp + reason
+   - If any leftover still blocked AND requires user input → STOP and ASK
 1. Read _docs/_autopilot_state.md (if exists)
 2. Read all File Index files above
 3. Cross-check state file against _docs/ folder structure (rules in state.md)
@@ -28,7 +28,7 @@ The `implementer` agent is the specialist that writes all the code — it receiv
 - **Integrated review**: `/code-review` skill runs automatically after each batch
 - **Auto-start**: batches launch immediately — no user confirmation before a batch
 - **Gate on failure**: user confirmation is required only when code review returns FAIL
- **Commit and push per batch**: after each batch is confirmed, commit and push to remote
+- **Commit per batch**: after each batch is confirmed, commit. Ask the user whether to push to remote unless the user previously opted into auto-push for this session.

 ## Context Resolution

@@ -134,25 +134,38 @@ Only proceed to Step 9 when every AC has a corresponding test.

 ### 10. Auto-Fix Gate

-Auto-fix loop with bounded retries (max 2 attempts) before escalating to user:
+Bounded auto-fix loop — only applies to **mechanical** findings. Critical and Security findings are never auto-fixed.

-1. If verdict is **PASS** or **PASS_WITH_WARNINGS**: show findings as info, continue automatically to step 11
-2. If verdict is **FAIL** (attempt 1 or 2):
-   - Parse the code review findings (Critical and High severity items)
-   - For each finding, attempt an automated fix using the finding's location, description, and suggestion
-   - Re-run `/code-review` on the modified files
-   - If now PASS or PASS_WITH_WARNINGS → continue to step 11
-   - If still FAIL → increment retry counter, repeat from (2) up to max 2 attempts
-3. If still **FAIL** after 2 auto-fix attempts: present all findings to user (**BLOCKING**). User must confirm fixes or accept before proceeding.
+**Auto-fix eligibility matrix:**

-Track `auto_fix_attempts` count in the batch report for retrospective analysis.
+| Severity | Category | Auto-fix? |
+|----------|----------|-----------|
+| Low | any | yes |
+| Medium | Style, Maintainability, Performance | yes |
+| Medium | Bug, Spec-Gap, Security | escalate |
+| High | Style, Scope | yes |
+| High | Bug, Spec-Gap, Performance, Maintainability | escalate |
+| Critical | any | escalate |
+| any | Security | escalate |

-### 11. Commit and Push
+Flow:
+
+1. If verdict is **PASS** or **PASS_WITH_WARNINGS**: show findings as info, continue to step 11
+2. If verdict is **FAIL**:
+   - Partition findings into auto-fix-eligible and escalate (using the matrix above)
+   - For eligible findings, attempt fixes using location/description/suggestion, then re-run `/code-review` on modified files (max 2 rounds)
+   - If all remaining findings are auto-fix-eligible and re-review now passes → continue to step 11
+   - If any non-eligible finding exists at any point → stop auto-fixing, present the full list to the user (**BLOCKING**)
+3. User must explicitly approve each non-auto-fix finding (accept, request manual fix, mark as out-of-scope) before proceeding.
+
+Track `auto_fix_attempts` and `escalated_findings` in the batch report for retrospective analysis.
+
+### 11. Commit (and optionally Push)

 - After user confirms the batch (explicitly for FAIL, implicitly for PASS/PASS_WITH_WARNINGS):
  - `git add` all changed files from the batch
  - `git commit` with a message that includes ALL task IDs (tracker IDs or numeric prefixes) of tasks implemented in the batch, followed by a summary of what was implemented. Format: `[TASK-ID-1] [TASK-ID-2] ... Summary of changes`
-  - `git push` to the remote branch
+  - Ask the user whether to push to remote, unless the user previously opted into auto-push for this session

 ### 12. Update Tracker Status → In Testing

@@ -119,7 +119,7 @@ Read and follow `steps/07_quality-checklist.md`.
 |-----------|--------|
 | Missing acceptance_criteria.md, restrictions.md, or input_data/ | **STOP** — planning cannot proceed |
 | Ambiguous requirements | ASK user |
-| Input data coverage below 70% | Search internet for supplementary data, ASK user to validate |
+| Input data coverage below 75% | Search internet for supplementary data, ASK user to validate |
 | Technology choice with multiple valid options | ASK user |
 | Component naming | PROCEED, confirm at next BLOCKING gate |
 | File structure within templates | PROCEED |
@@ -32,3 +32,17 @@
 6. Applicable scenarios
 7. Team capability requirements
 8. Migration difficulty
+
+## Decomposition Completeness Probes (Completeness Audit Reference)
+
+Used during Step 1's Decomposition Completeness Audit. After generating sub-questions, ask each probe against the current decomposition. If a probe reveals an uncovered area, add a sub-question for it.
+
+| Probe | What it catches |
+|-------|-----------------|
+| **What does this cost — in money, time, resources, or trade-offs?** | Budget, pricing, licensing, tax, opportunity cost, maintenance burden |
+| **What are the hard constraints — physical, legal, regulatory, environmental?** | Regulations, certifications, spectrum/frequency rules, export controls, physics limits, IP restrictions |
+| **What are the dependencies and assumptions that could break?** | Supply chain, vendor lock-in, API stability, single points of failure, standards evolution |
+| **What does the operating environment actually look like?** | Terrain, weather, connectivity, infrastructure, power, latency, user skill level |
+| **What failure modes exist and what happens when they trigger?** | Degraded operation, fallback, safety margins, blast radius, recovery time |
+| **What do practitioners who solved similar problems say matters most?** | Field-tested priorities that don't appear in specs or papers |
+| **What changes over time — and what looks stable now but isn't?** | Technology roadmaps, regulatory shifts, deprecation risk, scaling effects |
@@ -10,6 +10,12 @@
 - [ ] Every citation can be directly verified by the user (source verifiability)
 - [ ] Structure hierarchy is clear; executives can quickly locate information

+## Decomposition Completeness
+
+- [ ] Domain discovery search executed: searched "key factors when [problem domain]" before starting research
+- [ ] Completeness probes applied: every probe from `references/comparison-frameworks.md` checked against sub-questions
+- [ ] No uncovered areas remain: all gaps filled with sub-questions or justified as not applicable
+
 ## Internet Search Depth

 - [ ] Every sub-question was searched with at least 3-5 different query variants
@@ -97,6 +97,16 @@ When decomposing questions, you must explicitly define the **boundaries of the r

 **Common mistake**: User asks about "university classroom issues" but sources include policies targeting "K-12 students" — mismatched target populations will invalidate the entire research.

+#### Decomposition Completeness Audit (MANDATORY)
+
+After generating sub-questions, verify the decomposition covers all major dimensions of the problem — not just the ones that came to mind first.
+
+1. **Domain discovery search**: Search the web for "key factors when [problem domain]" / "what to consider when [problem domain]" (e.g., "key factors GPS-denied navigation", "what to consider when choosing an edge deployment strategy"). Extract dimensions that practitioners and domain experts consider important but are absent from the current sub-questions.
+2. **Run completeness probes**: Walk through each probe in `references/comparison-frameworks.md` → "Decomposition Completeness Probes" against the current sub-question list. For each probe, note whether it is covered, not applicable (state why), or missing.
+3. **Fill gaps**: Add sub-questions (with search query variants) for any uncovered area. Do this before proceeding to Step 2.
+
+Record the audit result in `00_question_decomposition.md` as a "Completeness Audit" section.
+
 **Save action**:
 1. Read all files from INPUT_DIR to ground the research in the project context
 2. Create working directory `RESEARCH_DIR/`
@@ -109,6 +119,7 @@ When decomposing questions, you must explicitly define the **boundaries of the r
   - List of decomposed sub-questions
   - **Chosen perspectives** (at least 3 from the Perspective Rotation table) with rationale
   - **Search query variants** for each sub-question (at least 3-5 per sub-question)
+   - **Completeness audit** (taxonomy cross-reference + domain discovery results)
 4. Write TodoWrite to track progress

 ---
@@ -102,32 +102,46 @@ After investigating, present:
 - If user picks A → apply fixes, then re-run (loop back to step 2)
 - If user picks B → return failure to the autopilot

-**Any test skipped** → this is also a **blocking gate**. Skipped tests mean something is wrong — either with the test, the environment, or the test design. **Never blindly remove a skipped test.** Always investigate the root cause first.
+**Any skipped test** → classify as legitimate or illegitimate before deciding whether to block.

-#### Investigation Protocol for Skipped Tests
+#### Legitimate skips (accept and proceed)

-For each skipped test:
+The code path genuinely cannot execute on this runner. Acceptable reasons:

-1. **Read the test code** — understand what the test is supposed to verify and why it skips.
-2. **Determine the root cause** — why did the skip condition fire?
-   - Is the test environment misconfigured? (e.g., wrong ports, missing env vars, service not started correctly)
-   - Is the test ordering wrong? (e.g., a fixture in an earlier test mutates shared state)
-   - Is a dependency missing? (e.g., package not installed, fixture file absent)
-   - Is the skip condition outdated? (e.g., code was refactored but the skip guard still checks the old behavior)
-   - Is the test fundamentally untestable in the current setup? (e.g., requires Docker restart, different OS, special hardware)
-3. **Try to fix the root cause first** — the goal is to make the test run, not to delete it:
-   - Fix the environment or configuration
-   - Reorder tests or isolate shared state
-   - Install the missing dependency
-   - Update the skip condition to match current behavior
-4. **Only remove as last resort** — if the test truly cannot run in any realistic test environment (e.g., requires hardware not available, duplicates another test with identical assertions), then removal is justified. Document the reasoning.
+- Hardware not physically present (GPU, Apple Neural Engine, sensor, serial device)
+- Operating system mismatch (Darwin-only test on Linux CI, Windows-only test on macOS)
+- Feature-flag-gated test whose feature is intentionally disabled in this environment
+- External service the project deliberately does not control (e.g., a third-party API with no sandbox, and the project has a documented contract test instead)

-#### Categorization
+For legitimate skips: verify the skip condition is accurate (the test would run if the hardware/OS were present), verify it has a clear reason string, and proceed.

- **explicit skip (dead code)**: Has `@pytest.mark.skip` — investigate whether the reason in the decorator is still valid. Often these are temporary skips that became permanent by accident.
- **runtime skip (unreachable)**: `pytest.skip()` fires inside the test body — investigate why the condition always triggers. Often fixable by adjusting test order, environment, or the condition itself.
- **environment mismatch**: Test assumes a different environment — investigate whether the test environment setup can be fixed.
- **missing fixture/data**: Data or service not available — investigate whether it can be provided.
+#### Illegitimate skips (BLOCKING — must resolve)
+
+The skip is a workaround for something we can and should fix. NOT acceptable reasons:
+
+- Required service not running (database, message broker, downstream API we control) → fix: bring the service up, add a docker-compose dependency, or add a mock
+- Missing test fixture, seed data, or sample file → fix: provide the data, generate it, or ASK the user for it
+- Missing environment variable or credential → fix: add to `.env.example`, document, ASK user for the value
+- Flaky-test quarantine with no tracking ticket → fix: create the ticket (or replay via leftovers if tracker is down)
+- Inherited skip from a prior refactor that was never cleaned up → fix: clean it up now
+- Test ordering mutates shared state → fix: isolate the state
+
+**Rule of thumb**: if the reason for skipping is "we didn't set something up," that's not a valid skip — set it up. If the reason is "this hardware/OS isn't here," that's valid.
+
+#### Resolution steps for illegitimate skips
+
+1. Classify the skip (read the skip reason and test body)
+2. If the fix is **mechanical** — start a container, install a dep, add a mock, reorder fixtures — attempt it automatically and re-run
+3. If the fix requires **user input** — credentials, sample data, a business decision — BLOCK and ASK
+4. Never silently mark the skip as "accepted" — every illegitimate skip must either be fixed or escalated
+5. Removal is a last resort and requires explicit user approval with documented reasoning
+
+#### Categorization cheatsheet
+
+- **explicit skip (e.g. `@pytest.mark.skip`)**: check whether the reason in the decorator is still valid
+- **conditional skip (e.g. `@pytest.mark.skipif`)**: check whether the condition is accurate and whether we can change the environment to make it false
+- **runtime skip (e.g. `pytest.skip()` in body)**: check why the condition fires — often an ordering or environment bug
+- **missing fixture/data**: treated as illegitimate unless user confirms the data is unavailable

 After investigating, present findings:

@@ -27,8 +27,11 @@ Analyze input data completeness and produce detailed black-box test specificatio
 - **Save immediately**: write artifacts to disk after each phase; never accumulate unsaved work
 - **Ask, don't assume**: when requirements are ambiguous, ask the user before proceeding
 - **Spec, don't code**: this workflow produces test specifications, never test implementation code
- **No test without data**: every test scenario MUST have concrete test data; tests without data are removed
- **No test without expected result**: every test scenario MUST pair input data with a quantifiable expected result; a test that cannot compare actual output against a known-correct answer is not verifiable and must be removed
+- **Every test must have a pass/fail criterion**. Two acceptable shapes:
+  - **Input/output shape**: concrete input data paired with a quantifiable expected result (exact value, tolerance, threshold, pattern, reference file). Typical for functional blackbox tests, performance tests with load data, data-processing pipelines.
+  - **Behavioral shape**: a trigger condition + observable system behavior + quantifiable pass/fail criterion, with no input data required. Typical for startup/shutdown tests, retry/backoff policies, state transitions, logging/metrics emission, resilience scenarios. Example criteria: "startup logs `service ready` within 5s", "retry emits 3 attempts with exponential backoff (base 100ms ± 20ms)", "on SIGTERM, service drains in-flight requests within 30s grace period", "health endpoint returns 503 while migrations run".
+- For behavioral tests the observable (log line, metric value, state transition, emitted event, elapsed time) must still be quantifiable — the test must programmatically decide pass/fail.
+- A test that cannot produce a pass/fail verdict through either shape is not verifiable and must be removed.

 ## Context Resolution

@@ -177,7 +180,7 @@ At the start of execution, create a TodoWrite with all four phases. Update statu
 |------------|--------------------------|---------------|----------------|
 | [file/data] | Yes/No | Yes/No | [missing, vague, no tolerance, etc.] |

-9. Threshold: at least 70% coverage of scenarios AND every covered scenario has a quantifiable expected result (see `.cursor/rules/cursor-meta.mdc` Quality Thresholds table)
+9. Threshold: at least 75% coverage of scenarios AND every covered scenario has a quantifiable expected result (see `.cursor/rules/cursor-meta.mdc` Quality Thresholds table)
 10. If coverage is low, search the internet for supplementary data, assess quality with user, and if user agrees, add to `input_data/` and update `input_data/expected_results/results_report.md`
 11. If expected results are missing or not quantifiable, ask user to provide them before proceeding

@@ -232,18 +235,26 @@ Capture any new questions, findings, or insights that arise during test specific
 ### Phase 3: Test Data Validation Gate (HARD GATE)

 **Role**: Professional Quality Assurance Engineer
-**Goal**: Ensure every test scenario produced in Phase 2 has concrete, sufficient test data. Remove tests that lack data. Verify final coverage stays above 70%.
+**Goal**: Ensure every test scenario produced in Phase 2 has concrete, sufficient test data. Remove tests that lack data. Verify final coverage stays above 75%.
 **Constraints**: This phase is MANDATORY and cannot be skipped.

-#### Step 1 — Build the test-data and expected-result requirements checklist
+#### Step 1 — Build the requirements checklist

-Scan `blackbox-tests.md`, `performance-tests.md`, `resilience-tests.md`, `security-tests.md`, and `resource-limit-tests.md`. For every test scenario, extract:
+Scan `blackbox-tests.md`, `performance-tests.md`, `resilience-tests.md`, `security-tests.md`, and `resource-limit-tests.md`. For every test scenario, classify its shape (input/output or behavioral) and extract:
+
+**Input/output tests:**

 | # | Test Scenario ID | Test Name | Required Input Data | Required Expected Result | Result Quantifiable? | Comparison Method | Input Provided? | Expected Result Provided? |
 |---|-----------------|-----------|---------------------|-------------------------|---------------------|-------------------|----------------|--------------------------|
 | 1 | [ID] | [name] | [data description] | [what system should output] | [Yes/No] | [exact/tolerance/pattern/threshold] | [Yes/No] | [Yes/No] |

-Present this table to the user.
+**Behavioral tests:**
+
+| # | Test Scenario ID | Test Name | Trigger Condition | Observable Behavior | Pass/Fail Criterion | Quantifiable? |
+|---|-----------------|-----------|-------------------|--------------------|--------------------|---------------|
+| 1 | [ID] | [name] | [e.g., service receives SIGTERM] | [e.g., drain logs emitted, port closed] | [e.g., drain completes ≤30s] | [Yes/No] |
+
+Present both tables to the user.

 #### Step 2 — Ask user to provide missing test data AND expected results

@@ -315,20 +326,20 @@ After all removals, recalculate coverage:

 **Decision**:

- **Coverage ≥ 70%** → Phase 3 **PASSED**. Present final summary to user.
- **Coverage < 70%** → Phase 3 **FAILED**. Report:
-  > ❌ Test coverage dropped to **X%** (minimum 70% required). The removed test scenarios left gaps in the following acceptance criteria / restrictions:
+- **Coverage ≥ 75%** → Phase 3 **PASSED**. Present final summary to user.
+- **Coverage < 75%** → Phase 3 **FAILED**. Report:
+  > ❌ Test coverage dropped to **X%** (minimum 75% required). The removed test scenarios left gaps in the following acceptance criteria / restrictions:
  >
  > | Uncovered Item | Type (AC/Restriction) | Missing Test Data Needed |
  > |---|---|---|
  >
  > **Action required**: Provide the missing test data for the items above, or add alternative test scenarios that cover these items with data you can supply.

-  **BLOCKING**: Loop back to Step 2 with the uncovered items. Do NOT finalize until coverage ≥ 70%.
+  **BLOCKING**: Loop back to Step 2 with the uncovered items. Do NOT finalize until coverage ≥ 75%.

 #### Phase 3 Completion

-When coverage ≥ 70% and all remaining tests have validated data AND quantifiable expected results:
+When coverage ≥ 75% and all remaining tests have validated data AND quantifiable expected results:

 1. Present the final coverage report
 2. List all removed tests (if any) with reasons
@@ -479,23 +490,23 @@ Create `scripts/run-performance-tests.sh` at the project root. The script must:
 | Missing acceptance_criteria.md, restrictions.md, or input_data/ | **STOP** — specification cannot proceed |
 | Missing input_data/expected_results/results_report.md | **STOP** — ask user to provide expected results mapping using the template |
 | Ambiguous requirements | ASK user |
-| Input data coverage below 70% (Phase 1) | Search internet for supplementary data, ASK user to validate |
+| Input data coverage below 75% (Phase 1) | Search internet for supplementary data, ASK user to validate |
 | Expected results missing or not quantifiable (Phase 1) | ASK user to provide quantifiable expected results before proceeding |
 | Test scenario conflicts with restrictions | ASK user to clarify intent |
 | System interfaces unclear (no architecture.md) | ASK user or derive from solution.md |
 | Test data or expected result not provided for a test scenario (Phase 3) | WARN user and REMOVE the test |
-| Final coverage below 70% after removals (Phase 3) | BLOCK — require user to supply data or accept reduced spec |
+| Final coverage below 75% after removals (Phase 3) | BLOCK — require user to supply data or accept reduced spec |

 ## Common Mistakes

 - **Referencing internals**: tests must be black-box — no internal module names, no direct DB queries against the system under test
 - **Vague expected outcomes**: "works correctly" is not a test outcome; use specific measurable values
- **Missing expected results**: input data without a paired expected result is useless — the test cannot determine pass/fail without knowing what "correct" looks like
- **Non-quantifiable expected results**: "should return good results" is not verifiable; expected results must have exact values, tolerances, thresholds, or pattern matches that code can evaluate
+- **Missing pass/fail criterion**: input/output tests without an expected result, OR behavioral tests without a measurable observable — both are unverifiable and must be removed
+- **Non-quantifiable criteria**: "should return good results", "works correctly", "behaves properly" — not verifiable. Use exact values, tolerances, thresholds, pattern matches, or timing bounds that code can evaluate.
+- **Forcing the wrong shape**: do not invent fake input data for a behavioral test (e.g., "input: SIGTERM signal") just to fit the input/output shape. Classify the test correctly and use the matching checklist.
 - **Missing negative scenarios**: every positive scenario category should have corresponding negative/edge-case tests
 - **Untraceable tests**: every test should trace to at least one AC or restriction
 - **Writing test code**: this skill produces specifications, never implementation code
- **Tests without data**: every test scenario MUST have concrete test data AND a quantifiable expected result; a test spec without either is not executable and must be removed

 ## Trigger Conditions

@@ -516,7 +527,7 @@ When the user wants to:
 │   → verify AC, restrictions, input_data (incl. expected_results.md)  │
 │                                                                      │
 │ Phase 1: Input Data & Expected Results Completeness Analysis         │
-│   → assess input_data/ coverage vs AC scenarios (≥70%)               │
+│   → assess input_data/ coverage vs AC scenarios (≥75%)               │
 │   → verify every input has a quantifiable expected result            │
 │   → present input→expected-result pairing assessment                 │
 │   [BLOCKING: user confirms input data + expected results coverage]   │
@@ -538,8 +549,8 @@ When the user wants to:
 │   → validate input data (quality + quantity)                         │
 │   → validate expected results (quantifiable + comparison method)     │
 │   → remove tests without data or expected result, warn user          │
-│   → final coverage check (≥70% or FAIL + loop back)                  │
-│   [BLOCKING: coverage ≥ 70% required to pass]                        │
+│   → final coverage check (≥75% or FAIL + loop back)                  │
+│   [BLOCKING: coverage ≥ 75% required to pass]                        │
 │                                                                      │
 │ Phase 4: Test Runner Script Generation                               │
 │   → detect test runner + docker-compose + load tool                  │