Refine coding standards and testing guidelines

- Updated coding rules to emphasize readability, meaningful comments, and maintainability.
- Adjusted test coverage thresholds to 75% for business logic and clarified expectations for test scenarios.
- Enhanced guidelines for handling skipped tests, emphasizing the need for investigation and resolution.
- Introduced a completeness audit for research decomposition to ensure thoroughness in addressing problem dimensions.

Made-with: Cursor
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-04-17 20:29:15 +03:00
parent 567092188d
commit f46531fc6d
17 changed files with 275 additions and 90 deletions
+5
View File
@@ -55,6 +55,11 @@ After selecting the flow, apply its detection rules (first match wins) to determ
Every invocation follows this sequence:
```
0. Process leftovers (see `.cursor/rules/tracker.mdc` → Leftovers Mechanism):
- Read _docs/_process_leftovers/ if it exists
- For each entry, attempt replay against the tracker
- Delete successful replays, update failed ones with new timestamp + reason
- If any leftover still blocked AND requires user input → STOP and ASK
1. Read _docs/_autopilot_state.md (if exists)
2. Read all File Index files above
3. Cross-check state file against _docs/ folder structure (rules in state.md)
+26 -13
View File
@@ -28,7 +28,7 @@ The `implementer` agent is the specialist that writes all the code — it receiv
- **Integrated review**: `/code-review` skill runs automatically after each batch
- **Auto-start**: batches launch immediately — no user confirmation before a batch
- **Gate on failure**: user confirmation is required only when code review returns FAIL
- **Commit and push per batch**: after each batch is confirmed, commit and push to remote
- **Commit per batch**: after each batch is confirmed, commit. Ask the user whether to push to remote unless the user previously opted into auto-push for this session.
## Context Resolution
@@ -134,25 +134,38 @@ Only proceed to Step 9 when every AC has a corresponding test.
### 10. Auto-Fix Gate
Auto-fix loop with bounded retries (max 2 attempts) before escalating to user:
Bounded auto-fix loop — only applies to **mechanical** findings. Critical and Security findings are never auto-fixed.
1. If verdict is **PASS** or **PASS_WITH_WARNINGS**: show findings as info, continue automatically to step 11
2. If verdict is **FAIL** (attempt 1 or 2):
- Parse the code review findings (Critical and High severity items)
- For each finding, attempt an automated fix using the finding's location, description, and suggestion
- Re-run `/code-review` on the modified files
- If now PASS or PASS_WITH_WARNINGS → continue to step 11
- If still FAIL → increment retry counter, repeat from (2) up to max 2 attempts
3. If still **FAIL** after 2 auto-fix attempts: present all findings to user (**BLOCKING**). User must confirm fixes or accept before proceeding.
**Auto-fix eligibility matrix:**
Track `auto_fix_attempts` count in the batch report for retrospective analysis.
| Severity | Category | Auto-fix? |
|----------|----------|-----------|
| Low | any | yes |
| Medium | Style, Maintainability, Performance | yes |
| Medium | Bug, Spec-Gap, Security | escalate |
| High | Style, Scope | yes |
| High | Bug, Spec-Gap, Performance, Maintainability | escalate |
| Critical | any | escalate |
| any | Security | escalate |
### 11. Commit and Push
Flow:
1. If verdict is **PASS** or **PASS_WITH_WARNINGS**: show findings as info, continue to step 11
2. If verdict is **FAIL**:
- Partition findings into auto-fix-eligible and escalate (using the matrix above)
- For eligible findings, attempt fixes using location/description/suggestion, then re-run `/code-review` on modified files (max 2 rounds)
- If all remaining findings are auto-fix-eligible and re-review now passes → continue to step 11
- If any non-eligible finding exists at any point → stop auto-fixing, present the full list to the user (**BLOCKING**)
3. User must explicitly approve each non-auto-fix finding (accept, request manual fix, mark as out-of-scope) before proceeding.
Track `auto_fix_attempts` and `escalated_findings` in the batch report for retrospective analysis.
### 11. Commit (and optionally Push)
- After user confirms the batch (explicitly for FAIL, implicitly for PASS/PASS_WITH_WARNINGS):
- `git add` all changed files from the batch
- `git commit` with a message that includes ALL task IDs (tracker IDs or numeric prefixes) of tasks implemented in the batch, followed by a summary of what was implemented. Format: `[TASK-ID-1] [TASK-ID-2] ... Summary of changes`
- `git push` to the remote branch
- Ask the user whether to push to remote, unless the user previously opted into auto-push for this session
### 12. Update Tracker Status → In Testing
+1 -1
View File
@@ -119,7 +119,7 @@ Read and follow `steps/07_quality-checklist.md`.
|-----------|--------|
| Missing acceptance_criteria.md, restrictions.md, or input_data/ | **STOP** — planning cannot proceed |
| Ambiguous requirements | ASK user |
| Input data coverage below 70% | Search internet for supplementary data, ASK user to validate |
| Input data coverage below 75% | Search internet for supplementary data, ASK user to validate |
| Technology choice with multiple valid options | ASK user |
| Component naming | PROCEED, confirm at next BLOCKING gate |
| File structure within templates | PROCEED |
@@ -32,3 +32,17 @@
6. Applicable scenarios
7. Team capability requirements
8. Migration difficulty
## Decomposition Completeness Probes (Completeness Audit Reference)
Used during Step 1's Decomposition Completeness Audit. After generating sub-questions, ask each probe against the current decomposition. If a probe reveals an uncovered area, add a sub-question for it.
| Probe | What it catches |
|-------|-----------------|
| **What does this cost — in money, time, resources, or trade-offs?** | Budget, pricing, licensing, tax, opportunity cost, maintenance burden |
| **What are the hard constraints — physical, legal, regulatory, environmental?** | Regulations, certifications, spectrum/frequency rules, export controls, physics limits, IP restrictions |
| **What are the dependencies and assumptions that could break?** | Supply chain, vendor lock-in, API stability, single points of failure, standards evolution |
| **What does the operating environment actually look like?** | Terrain, weather, connectivity, infrastructure, power, latency, user skill level |
| **What failure modes exist and what happens when they trigger?** | Degraded operation, fallback, safety margins, blast radius, recovery time |
| **What do practitioners who solved similar problems say matters most?** | Field-tested priorities that don't appear in specs or papers |
| **What changes over time — and what looks stable now but isn't?** | Technology roadmaps, regulatory shifts, deprecation risk, scaling effects |
@@ -10,6 +10,12 @@
- [ ] Every citation can be directly verified by the user (source verifiability)
- [ ] Structure hierarchy is clear; executives can quickly locate information
## Decomposition Completeness
- [ ] Domain discovery search executed: searched "key factors when [problem domain]" before starting research
- [ ] Completeness probes applied: every probe from `references/comparison-frameworks.md` checked against sub-questions
- [ ] No uncovered areas remain: all gaps filled with sub-questions or justified as not applicable
## Internet Search Depth
- [ ] Every sub-question was searched with at least 3-5 different query variants
@@ -97,6 +97,16 @@ When decomposing questions, you must explicitly define the **boundaries of the r
**Common mistake**: User asks about "university classroom issues" but sources include policies targeting "K-12 students" — mismatched target populations will invalidate the entire research.
#### Decomposition Completeness Audit (MANDATORY)
After generating sub-questions, verify the decomposition covers all major dimensions of the problem — not just the ones that came to mind first.
1. **Domain discovery search**: Search the web for "key factors when [problem domain]" / "what to consider when [problem domain]" (e.g., "key factors GPS-denied navigation", "what to consider when choosing an edge deployment strategy"). Extract dimensions that practitioners and domain experts consider important but are absent from the current sub-questions.
2. **Run completeness probes**: Walk through each probe in `references/comparison-frameworks.md` → "Decomposition Completeness Probes" against the current sub-question list. For each probe, note whether it is covered, not applicable (state why), or missing.
3. **Fill gaps**: Add sub-questions (with search query variants) for any uncovered area. Do this before proceeding to Step 2.
Record the audit result in `00_question_decomposition.md` as a "Completeness Audit" section.
**Save action**:
1. Read all files from INPUT_DIR to ground the research in the project context
2. Create working directory `RESEARCH_DIR/`
@@ -109,6 +119,7 @@ When decomposing questions, you must explicitly define the **boundaries of the r
- List of decomposed sub-questions
- **Chosen perspectives** (at least 3 from the Perspective Rotation table) with rationale
- **Search query variants** for each sub-question (at least 3-5 per sub-question)
- **Completeness audit** (taxonomy cross-reference + domain discovery results)
4. Write TodoWrite to track progress
---
+35 -21
View File
@@ -102,32 +102,46 @@ After investigating, present:
- If user picks A → apply fixes, then re-run (loop back to step 2)
- If user picks B → return failure to the autopilot
**Any test skipped**this is also a **blocking gate**. Skipped tests mean something is wrong — either with the test, the environment, or the test design. **Never blindly remove a skipped test.** Always investigate the root cause first.
**Any skipped test**classify as legitimate or illegitimate before deciding whether to block.
#### Investigation Protocol for Skipped Tests
#### Legitimate skips (accept and proceed)
For each skipped test:
The code path genuinely cannot execute on this runner. Acceptable reasons:
1. **Read the test code** — understand what the test is supposed to verify and why it skips.
2. **Determine the root cause** — why did the skip condition fire?
- Is the test environment misconfigured? (e.g., wrong ports, missing env vars, service not started correctly)
- Is the test ordering wrong? (e.g., a fixture in an earlier test mutates shared state)
- Is a dependency missing? (e.g., package not installed, fixture file absent)
- Is the skip condition outdated? (e.g., code was refactored but the skip guard still checks the old behavior)
- Is the test fundamentally untestable in the current setup? (e.g., requires Docker restart, different OS, special hardware)
3. **Try to fix the root cause first** — the goal is to make the test run, not to delete it:
- Fix the environment or configuration
- Reorder tests or isolate shared state
- Install the missing dependency
- Update the skip condition to match current behavior
4. **Only remove as last resort** — if the test truly cannot run in any realistic test environment (e.g., requires hardware not available, duplicates another test with identical assertions), then removal is justified. Document the reasoning.
- Hardware not physically present (GPU, Apple Neural Engine, sensor, serial device)
- Operating system mismatch (Darwin-only test on Linux CI, Windows-only test on macOS)
- Feature-flag-gated test whose feature is intentionally disabled in this environment
- External service the project deliberately does not control (e.g., a third-party API with no sandbox, and the project has a documented contract test instead)
#### Categorization
For legitimate skips: verify the skip condition is accurate (the test would run if the hardware/OS were present), verify it has a clear reason string, and proceed.
- **explicit skip (dead code)**: Has `@pytest.mark.skip` — investigate whether the reason in the decorator is still valid. Often these are temporary skips that became permanent by accident.
- **runtime skip (unreachable)**: `pytest.skip()` fires inside the test body — investigate why the condition always triggers. Often fixable by adjusting test order, environment, or the condition itself.
- **environment mismatch**: Test assumes a different environment — investigate whether the test environment setup can be fixed.
- **missing fixture/data**: Data or service not available — investigate whether it can be provided.
#### Illegitimate skips (BLOCKING — must resolve)
The skip is a workaround for something we can and should fix. NOT acceptable reasons:
- Required service not running (database, message broker, downstream API we control) → fix: bring the service up, add a docker-compose dependency, or add a mock
- Missing test fixture, seed data, or sample file → fix: provide the data, generate it, or ASK the user for it
- Missing environment variable or credential → fix: add to `.env.example`, document, ASK user for the value
- Flaky-test quarantine with no tracking ticket → fix: create the ticket (or replay via leftovers if tracker is down)
- Inherited skip from a prior refactor that was never cleaned up → fix: clean it up now
- Test ordering mutates shared state → fix: isolate the state
**Rule of thumb**: if the reason for skipping is "we didn't set something up," that's not a valid skip — set it up. If the reason is "this hardware/OS isn't here," that's valid.
#### Resolution steps for illegitimate skips
1. Classify the skip (read the skip reason and test body)
2. If the fix is **mechanical** — start a container, install a dep, add a mock, reorder fixtures — attempt it automatically and re-run
3. If the fix requires **user input** — credentials, sample data, a business decision — BLOCK and ASK
4. Never silently mark the skip as "accepted" — every illegitimate skip must either be fixed or escalated
5. Removal is a last resort and requires explicit user approval with documented reasoning
#### Categorization cheatsheet
- **explicit skip (e.g. `@pytest.mark.skip`)**: check whether the reason in the decorator is still valid
- **conditional skip (e.g. `@pytest.mark.skipif`)**: check whether the condition is accurate and whether we can change the environment to make it false
- **runtime skip (e.g. `pytest.skip()` in body)**: check why the condition fires — often an ordering or environment bug
- **missing fixture/data**: treated as illegitimate unless user confirms the data is unavailable
After investigating, present findings:
+31 -20
View File
@@ -27,8 +27,11 @@ Analyze input data completeness and produce detailed black-box test specificatio
- **Save immediately**: write artifacts to disk after each phase; never accumulate unsaved work
- **Ask, don't assume**: when requirements are ambiguous, ask the user before proceeding
- **Spec, don't code**: this workflow produces test specifications, never test implementation code
- **No test without data**: every test scenario MUST have concrete test data; tests without data are removed
- **No test without expected result**: every test scenario MUST pair input data with a quantifiable expected result; a test that cannot compare actual output against a known-correct answer is not verifiable and must be removed
- **Every test must have a pass/fail criterion**. Two acceptable shapes:
- **Input/output shape**: concrete input data paired with a quantifiable expected result (exact value, tolerance, threshold, pattern, reference file). Typical for functional blackbox tests, performance tests with load data, data-processing pipelines.
- **Behavioral shape**: a trigger condition + observable system behavior + quantifiable pass/fail criterion, with no input data required. Typical for startup/shutdown tests, retry/backoff policies, state transitions, logging/metrics emission, resilience scenarios. Example criteria: "startup logs `service ready` within 5s", "retry emits 3 attempts with exponential backoff (base 100ms ± 20ms)", "on SIGTERM, service drains in-flight requests within 30s grace period", "health endpoint returns 503 while migrations run".
- For behavioral tests the observable (log line, metric value, state transition, emitted event, elapsed time) must still be quantifiable — the test must programmatically decide pass/fail.
- A test that cannot produce a pass/fail verdict through either shape is not verifiable and must be removed.
## Context Resolution
@@ -177,7 +180,7 @@ At the start of execution, create a TodoWrite with all four phases. Update statu
|------------|--------------------------|---------------|----------------|
| [file/data] | Yes/No | Yes/No | [missing, vague, no tolerance, etc.] |
9. Threshold: at least 70% coverage of scenarios AND every covered scenario has a quantifiable expected result (see `.cursor/rules/cursor-meta.mdc` Quality Thresholds table)
9. Threshold: at least 75% coverage of scenarios AND every covered scenario has a quantifiable expected result (see `.cursor/rules/cursor-meta.mdc` Quality Thresholds table)
10. If coverage is low, search the internet for supplementary data, assess quality with user, and if user agrees, add to `input_data/` and update `input_data/expected_results/results_report.md`
11. If expected results are missing or not quantifiable, ask user to provide them before proceeding
@@ -232,18 +235,26 @@ Capture any new questions, findings, or insights that arise during test specific
### Phase 3: Test Data Validation Gate (HARD GATE)
**Role**: Professional Quality Assurance Engineer
**Goal**: Ensure every test scenario produced in Phase 2 has concrete, sufficient test data. Remove tests that lack data. Verify final coverage stays above 70%.
**Goal**: Ensure every test scenario produced in Phase 2 has concrete, sufficient test data. Remove tests that lack data. Verify final coverage stays above 75%.
**Constraints**: This phase is MANDATORY and cannot be skipped.
#### Step 1 — Build the test-data and expected-result requirements checklist
#### Step 1 — Build the requirements checklist
Scan `blackbox-tests.md`, `performance-tests.md`, `resilience-tests.md`, `security-tests.md`, and `resource-limit-tests.md`. For every test scenario, extract:
Scan `blackbox-tests.md`, `performance-tests.md`, `resilience-tests.md`, `security-tests.md`, and `resource-limit-tests.md`. For every test scenario, classify its shape (input/output or behavioral) and extract:
**Input/output tests:**
| # | Test Scenario ID | Test Name | Required Input Data | Required Expected Result | Result Quantifiable? | Comparison Method | Input Provided? | Expected Result Provided? |
|---|-----------------|-----------|---------------------|-------------------------|---------------------|-------------------|----------------|--------------------------|
| 1 | [ID] | [name] | [data description] | [what system should output] | [Yes/No] | [exact/tolerance/pattern/threshold] | [Yes/No] | [Yes/No] |
Present this table to the user.
**Behavioral tests:**
| # | Test Scenario ID | Test Name | Trigger Condition | Observable Behavior | Pass/Fail Criterion | Quantifiable? |
|---|-----------------|-----------|-------------------|--------------------|--------------------|---------------|
| 1 | [ID] | [name] | [e.g., service receives SIGTERM] | [e.g., drain logs emitted, port closed] | [e.g., drain completes ≤30s] | [Yes/No] |
Present both tables to the user.
#### Step 2 — Ask user to provide missing test data AND expected results
@@ -315,20 +326,20 @@ After all removals, recalculate coverage:
**Decision**:
- **Coverage ≥ 70%** → Phase 3 **PASSED**. Present final summary to user.
- **Coverage < 70%** → Phase 3 **FAILED**. Report:
> ❌ Test coverage dropped to **X%** (minimum 70% required). The removed test scenarios left gaps in the following acceptance criteria / restrictions:
- **Coverage ≥ 75%** → Phase 3 **PASSED**. Present final summary to user.
- **Coverage < 75%** → Phase 3 **FAILED**. Report:
> ❌ Test coverage dropped to **X%** (minimum 75% required). The removed test scenarios left gaps in the following acceptance criteria / restrictions:
>
> | Uncovered Item | Type (AC/Restriction) | Missing Test Data Needed |
> |---|---|---|
>
> **Action required**: Provide the missing test data for the items above, or add alternative test scenarios that cover these items with data you can supply.
**BLOCKING**: Loop back to Step 2 with the uncovered items. Do NOT finalize until coverage ≥ 70%.
**BLOCKING**: Loop back to Step 2 with the uncovered items. Do NOT finalize until coverage ≥ 75%.
#### Phase 3 Completion
When coverage ≥ 70% and all remaining tests have validated data AND quantifiable expected results:
When coverage ≥ 75% and all remaining tests have validated data AND quantifiable expected results:
1. Present the final coverage report
2. List all removed tests (if any) with reasons
@@ -479,23 +490,23 @@ Create `scripts/run-performance-tests.sh` at the project root. The script must:
| Missing acceptance_criteria.md, restrictions.md, or input_data/ | **STOP** — specification cannot proceed |
| Missing input_data/expected_results/results_report.md | **STOP** — ask user to provide expected results mapping using the template |
| Ambiguous requirements | ASK user |
| Input data coverage below 70% (Phase 1) | Search internet for supplementary data, ASK user to validate |
| Input data coverage below 75% (Phase 1) | Search internet for supplementary data, ASK user to validate |
| Expected results missing or not quantifiable (Phase 1) | ASK user to provide quantifiable expected results before proceeding |
| Test scenario conflicts with restrictions | ASK user to clarify intent |
| System interfaces unclear (no architecture.md) | ASK user or derive from solution.md |
| Test data or expected result not provided for a test scenario (Phase 3) | WARN user and REMOVE the test |
| Final coverage below 70% after removals (Phase 3) | BLOCK — require user to supply data or accept reduced spec |
| Final coverage below 75% after removals (Phase 3) | BLOCK — require user to supply data or accept reduced spec |
## Common Mistakes
- **Referencing internals**: tests must be black-box — no internal module names, no direct DB queries against the system under test
- **Vague expected outcomes**: "works correctly" is not a test outcome; use specific measurable values
- **Missing expected results**: input data without a paired expected result is useless — the test cannot determine pass/fail without knowing what "correct" looks like
- **Non-quantifiable expected results**: "should return good results" is not verifiable; expected results must have exact values, tolerances, thresholds, or pattern matches that code can evaluate
- **Missing pass/fail criterion**: input/output tests without an expected result, OR behavioral tests without a measurable observable — both are unverifiable and must be removed
- **Non-quantifiable criteria**: "should return good results", "works correctly", "behaves properly" — not verifiable. Use exact values, tolerances, thresholds, pattern matches, or timing bounds that code can evaluate.
- **Forcing the wrong shape**: do not invent fake input data for a behavioral test (e.g., "input: SIGTERM signal") just to fit the input/output shape. Classify the test correctly and use the matching checklist.
- **Missing negative scenarios**: every positive scenario category should have corresponding negative/edge-case tests
- **Untraceable tests**: every test should trace to at least one AC or restriction
- **Writing test code**: this skill produces specifications, never implementation code
- **Tests without data**: every test scenario MUST have concrete test data AND a quantifiable expected result; a test spec without either is not executable and must be removed
## Trigger Conditions
@@ -516,7 +527,7 @@ When the user wants to:
│ → verify AC, restrictions, input_data (incl. expected_results.md) │
│ │
│ Phase 1: Input Data & Expected Results Completeness Analysis │
│ → assess input_data/ coverage vs AC scenarios (≥70%) │
│ → assess input_data/ coverage vs AC scenarios (≥75%) │
│ → verify every input has a quantifiable expected result │
│ → present input→expected-result pairing assessment │
│ [BLOCKING: user confirms input data + expected results coverage] │
@@ -538,8 +549,8 @@ When the user wants to:
│ → validate input data (quality + quantity) │
│ → validate expected results (quantifiable + comparison method) │
│ → remove tests without data or expected result, warn user │
│ → final coverage check (≥70% or FAIL + loop back) │
│ [BLOCKING: coverage ≥ 70% required to pass] │
│ → final coverage check (≥75% or FAIL + loop back) │
│ [BLOCKING: coverage ≥ 75% required to pass] │
│ │
│ Phase 4: Test Runner Script Generation │
│ → detect test runner + docker-compose + load tool │