Update test results directory structure and enhance Docker configurations

- Modified `.gitignore` to reflect the new path for test results. - Updated `docker-compose.test.yml` to mount the correct test results directory. - Adjusted `Dockerfile.test` to set the `PYTHONPATH` and ensure test results are saved in the updated location. - Added `boto3` and `netron` to `requirements-test.txt` to support new functionalities. - Updated `pytest.ini` to include the new `pythonpath` for test discovery. These changes streamline the testing process and ensure compatibility with the updated directory structure.
2026-06-21 05:51:12 +00:00 · 2026-03-28 00:13:08 +02:00
parent c20018745b
commit 243b69656b
48 changed files with 707 additions and 581 deletions
@@ -5,18 +5,22 @@ alwaysApply: true
 # Coding preferences
 - Always prefer simple solution
 - Generate concise code
- Do not put comments in the code
+- Do not put comments in the code, except in tests: every test must use the Arrange / Act / Assert pattern with `# Arrange`, `# Act`, `# Assert` section comments. Omit any section that is not needed (e.g. if there is no setup, skip `# Arrange`; if act and assert are the same line, keep only `# Assert`)
 - Do not put logs unless it is an exception, or was asked specifically
 - Do not put code annotations unless it was asked specifically 
 - Write code that takes into account the different environments: development, production
 - You are careful to make changes that are requested or you are confident the changes are well understood and related to the change being requested
 - Mocking data is needed only for tests, never mock data for dev or prod env
 - When you add new libraries or dependencies make sure you are using the same version of it as other parts of the code
 - When a test fails due to a missing dependency, install it — do not fake or stub the module system. For normal packages, add them to the project's dependency file (requirements-test.txt, package.json devDependencies, test csproj, etc.) and install. Only consider stubbing if the dependency is heavy (e.g. hardware-specific SDK, large native toolchain) — and even then, ask the user first before choosing to stub.
 - Focus on the areas of code relevant to the task
 - Do not touch code that is unrelated to the task
 - Always think about what other methods and areas of code might be affected by the code changes
- When you think you are done with changes, run tests and make sure they are not broken
+- When you think you are done with changes, run the full test suite. Every failure — including pre-existing ones, collection errors, and import errors — is a **blocking gate**. Never silently ignore, skip, or proceed past a failing test. On any failure, stop and ask the user to choose one of:
  - **Investigate and fix** the failing test or source code
  - **Remove the test** if it is obsolete or no longer relevant
  - **Leave as-is for now** (acknowledged tech debt — not recommended)
 - Do not rename any databases or tables or table columns without confirmation. Avoid such renaming if possible.
 - Make sure we don't commit binaries, create and keep .gitignore up to date and delete binaries after you are done with the task
@@ -17,5 +17,11 @@ globs: [".cursor/**"]
 ## Agent Files (.cursor/agents/)
 - Must have `name` and `description` in frontmatter
 ## User Interaction
 - Use the AskQuestion tool for structured choices (A/B/C/D) when available — it provides an interactive UI. Fall back to plain-text questions if the tool is unavailable.
 ## Execution Safety
 - Never run test suites, builds, Docker commands, or other long-running/resource-heavy/security-risky operations without asking the user first - unlsess it is explicilty stated in skill or agent, or user already asked to do so.
 ## Security
 - All `.cursor/` files must be scanned for hidden Unicode before committing (see cursor-security.mdc)
@@ -32,10 +32,10 @@ Auto-chaining execution engine that drives the full BUILD → SHIP workflow. Det
 - **Auto-chain**: when a skill completes, immediately start the next one — no pause between skills
 - **Only pause at decision points**: BLOCKING gates inside sub-skills are the natural pause points; do not add artificial stops between steps
- **State from disk**: all progress is persisted to `_docs/_autopilot_state.md` and cross-checked against `_docs/` folder structure
+- **State from disk**: current step is persisted to `_docs/_autopilot_state.md` and cross-checked against `_docs/` folder structure
- **Rich re-entry**: on every invocation, read the state file for full context before continuing
+- **Re-entry**: on every invocation, read the state file and cross-check against `_docs/` folders before continuing
 - **Delegate, don't duplicate**: read and execute each sub-skill's SKILL.md; never inline their logic here
- **Sound on pause**: follow `.cursor/rules/human-attention-sound.mdc` — play a notification sound before every pause that requires human input
+- **Sound on pause**: follow `.cursor/rules/human-attention-sound.mdc` — play a notification sound before every pause that requires human input (AskQuestion tool preferred for structured choices; fall back to plain text if unavailable)
 - **Minimize interruptions**: only ask the user when the decision genuinely cannot be resolved automatically
 - **Single project per workspace**: all `_docs/` paths are relative to workspace root; for monorepos, each service needs its own Cursor workspace
@@ -44,7 +44,7 @@ Auto-chaining execution engine that drives the full BUILD → SHIP workflow. Det
 Determine which flow to use:
 1. If workspace has source code files **and** `_docs/` does not exist → **existing-code flow** (Pre-Step detection)
-2. If `_docs/_autopilot_state.md` exists and records Document in `Completed Steps` → **existing-code flow**
+2. If `_docs/_autopilot_state.md` exists and `step >= 2` (i.e. Document already ran) → **existing-code flow**
 3. If `_docs/_autopilot_state.md` exists and `step: done` AND workspace contains source code → **existing-code flow** (completed project re-entry — loops to New Task)
 4. Otherwise → **greenfield flow**
@@ -65,7 +65,7 @@ Every invocation follows this sequence:
   a. Delegate to current skill (see Skill Delegation below)
   b. If skill returns FAILED → apply Skill Failure Retry Protocol (see protocols.md):
      - Auto-retry the same skill (failure may be caused by missing user input or environment issue)
-      - If 3 consecutive auto-retries fail → record in state file Blockers, warn user, stop auto-retry
+      - If 3 consecutive auto-retries fail → set status: failed, warn user, stop auto-retry
   c. When skill completes successfully → reset retry counter, update state file (rules in state.md)
   d. Re-detect next step from the active flow's detection rules
   e. If next skill is ready → auto-chain (go to 7a with next skill)
@@ -82,10 +82,26 @@ For each step, the delegation pattern is:
 3. Read the skill file: `.cursor/skills/[name]/SKILL.md`
 4. Execute the skill's workflow exactly as written, including all BLOCKING gates, self-verification checklists, save actions, and escalation rules. Update `sub_step` in state each time the sub-skill advances.
 5. If the skill **fails**: follow the Skill Failure Retry Protocol in `protocols.md` — increment `retry_count`, auto-retry up to 3 times, then escalate.
-6. When complete (success): reset `retry_count: 0`, mark step `completed`, record date + key outcome, add key decisions to state file, return to auto-chain rules (from active flow file)
+6. When complete (success): reset `retry_count: 0`, update state file to the next step with `status: not_started`, return to auto-chain rules (from active flow file)
 Do NOT modify, skip, or abbreviate any part of the sub-skill's workflow. The autopilot is a sequencer, not an optimizer.
 ## State File Template
 The state file (`_docs/_autopilot_state.md`) is a minimal pointer — only the current step. Full format rules are in `state.md`.
 ```markdown
 # Autopilot State
 ## Current Step
 flow: [greenfield | existing-code]
 step: [number or "done"]
 name: [step name]
 status: [not_started / in_progress / completed / skipped / failed]
 sub_step: [0 or N — sub-skill phase name]
 retry_count: [0-3]
 ```
 ## Trigger Conditions
 This skill activates when the user wants to:
@@ -1,6 +1,6 @@
 # Existing Code Workflow
-Workflow for projects with an existing codebase. Starts with documentation, produces test specs, decomposes and implements tests, verifies them, refactors with that safety net, then adds new functionality and deploys.
+Workflow for projects with an existing codebase. Starts with documentation, produces test specs, checks code testability (refactoring if needed), decomposes and implements tests, verifies them, refactors with that safety net, then adds new functionality and deploys.
 ## Step Reference Table
@@ -8,18 +8,19 @@ Workflow for projects with an existing codebase. Starts with documentation, prod
 |------|------|-----------|-------------------|
 | 1 | Document | document/SKILL.md | Steps 1–8 |
 | 2 | Test Spec | test-spec/SKILL.md | Phase 1a–1b |
-| 3 | Decompose Tests | decompose/SKILL.md (tests-only) | Step 1t + Step 3 + Step 4 |
+| 3 | Code Testability Revision | refactor/SKILL.md (guided mode) | Phases 0–7 (conditional) |
-| 4 | Implement Tests | implement/SKILL.md | (batch-driven, no fixed sub-steps) |
+| 4 | Decompose Tests | decompose/SKILL.md (tests-only) | Step 1t + Step 3 + Step 4 |
-| 5 | Run Tests | test-run/SKILL.md | Steps 1–4 |
+| 5 | Implement Tests | implement/SKILL.md | (batch-driven, no fixed sub-steps) |
-| 6 | Refactor | refactor/SKILL.md | Phases 0–6 (7-phase method) (optional) |
+| 6 | Run Tests | test-run/SKILL.md | Steps 1–4 |
-| 7 | New Task | new-task/SKILL.md | Steps 1–8 (loop) |
+| 7 | Refactor | refactor/SKILL.md | Phases 0–7 (optional) |
-| 8 | Implement | implement/SKILL.md | (batch-driven, no fixed sub-steps) |
+| 8 | New Task | new-task/SKILL.md | Steps 1–8 (loop) |
-| 9 | Run Tests | test-run/SKILL.md | Steps 1–4 |
+| 9 | Implement | implement/SKILL.md | (batch-driven, no fixed sub-steps) |
-| 10 | Security Audit | security/SKILL.md | Phase 1–5 (optional) |
+| 10 | Run Tests | test-run/SKILL.md | Steps 1–4 |
-| 11 | Performance Test | (autopilot-managed) | Load/stress tests (optional) |
+| 11 | Security Audit | security/SKILL.md | Phase 1–5 (optional) |
-| 12 | Deploy | deploy/SKILL.md | Step 1–7 |
+| 12 | Performance Test | (autopilot-managed) | Load/stress tests (optional) |
 | 13 | Deploy | deploy/SKILL.md | Step 1–7 |
-After Step 12, the existing-code workflow is complete.
+After Step 13, the existing-code workflow is complete.
 ## Detection Rules
@@ -35,7 +36,7 @@ Action: An existing codebase without documentation was detected. Read and execut
 ---
 **Step 2 — Test Spec**
-Condition: `_docs/02_document/FINAL_report.md` exists AND workspace contains source code files (e.g., `*.py`, `*.cs`, `*.rs`, `*.ts`) AND `_docs/02_document/tests/traceability-matrix.md` does not exist AND the autopilot state shows Document was run (check `Completed Steps` for "Document" entry)
+Condition: `_docs/02_document/FINAL_report.md` exists AND workspace contains source code files (e.g., `*.py`, `*.cs`, `*.rs`, `*.ts`) AND `_docs/02_document/tests/traceability-matrix.md` does not exist AND the autopilot state shows `step >= 2` (Document already ran)
 Action: Read and execute `.cursor/skills/test-spec/SKILL.md`
@@ -43,20 +44,51 @@ This step applies when the codebase was documented via the `/document` skill. Te
 ---
-**Step 3 — Decompose Tests**
+**Step 3 — Code Testability Revision**
-Condition: `_docs/02_document/tests/traceability-matrix.md` exists AND workspace contains source code files AND the autopilot state shows Document was run AND (`_docs/02_tasks/` does not exist or has no task files)
+Condition: `_docs/02_document/tests/traceability-matrix.md` exists AND the autopilot state shows Test Spec (Step 2) is completed AND the autopilot state does NOT show Code Testability Revision (Step 3) as completed or skipped
 Action: Analyze the codebase against the test specs to determine whether the code can be tested as-is.
 1. Read `_docs/02_document/tests/traceability-matrix.md` and all test scenario files in `_docs/02_document/tests/`
 2. For each test scenario, check whether the code under test can be exercised in isolation. Look for:
   - Hardcoded file paths or directory references
   - Hardcoded configuration values (URLs, credentials, magic numbers)
   - Global mutable state that cannot be overridden
   - Tight coupling to external services without abstraction
   - Missing dependency injection or non-configurable parameters
   - Direct file system operations without path configurability
   - Inline construction of heavy dependencies (models, clients)
 3. If ALL scenarios are testable as-is:
   - Mark Step 3 as `completed` with outcome "Code is testable — no changes needed"
   - Auto-chain to Step 4 (Decompose Tests)
 4. If testability issues are found:
   - Create `_docs/04_refactoring/01-testability-refactoring/`
   - Write `list-of-changes.md` in that directory using the refactor skill template (`.cursor/skills/refactor/templates/list-of-changes.md`), with:
     - **Mode**: `guided`
     - **Source**: `autopilot-testability-analysis`
     - One change entry per testability issue found (change ID, file paths, problem, proposed change, risk, dependencies)
   - Invoke the refactor skill in **guided mode**: read and execute `.cursor/skills/refactor/SKILL.md` with the `list-of-changes.md` as input
   - The refactor skill will create RUN_DIR (`01-testability-refactoring`), create tasks in `_docs/02_tasks/`, delegate to implement skill, and verify results
   - Phase 3 (Safety Net) is automatically skipped by the refactor skill for testability runs
   - After refactoring completes, mark Step 3 as `completed`
   - Auto-chain to Step 4 (Decompose Tests)
 ---
 **Step 4 — Decompose Tests**
 Condition: `_docs/02_document/tests/traceability-matrix.md` exists AND workspace contains source code files AND the autopilot state shows Step 3 (Code Testability Revision) is completed or skipped AND (`_docs/02_tasks/` does not exist or has no test task files)
 Action: Read and execute `.cursor/skills/decompose/SKILL.md` in **tests-only mode** (pass `_docs/02_document/tests/` as input). The decompose skill will:
 1. Run Step 1t (test infrastructure bootstrap)
 2. Run Step 3 (blackbox test task decomposition)
 3. Run Step 4 (cross-verification against test coverage)
-If `_docs/02_tasks/` has some task files already, the decompose skill's resumability handles it.
+If `_docs/02_tasks/` has some task files already (e.g., refactoring tasks from Step 3), the decompose skill's resumability handles it — it appends test tasks alongside existing refactoring tasks.
 ---
-**Step 4 — Implement Tests**
+**Step 5 — Implement Tests**
-Condition: `_docs/02_tasks/` contains task files AND `_dependencies_table.md` exists AND the autopilot state shows Step 3 (Decompose Tests) is completed AND `_docs/03_implementation/FINAL_implementation_report.md` does not exist
+Condition: `_docs/02_tasks/` contains task files AND `_dependencies_table.md` exists AND the autopilot state shows Step 4 (Decompose Tests) is completed AND `_docs/03_implementation/FINAL_implementation_report.md` does not exist
 Action: Read and execute `.cursor/skills/implement/SKILL.md`
@@ -66,8 +98,8 @@ If `_docs/03_implementation/` has batch reports, the implement skill detects com
 ---
-**Step 5 — Run Tests**
+**Step 6 — Run Tests**
-Condition: `_docs/03_implementation/FINAL_implementation_report.md` exists AND the autopilot state shows Step 4 (Implement Tests) is completed AND the autopilot state does NOT show Step 5 (Run Tests) as completed
+Condition: `_docs/03_implementation/FINAL_implementation_report.md` exists AND the autopilot state shows Step 5 (Implement Tests) is completed AND the autopilot state does NOT show Step 6 (Run Tests) as completed
 Action: Read and execute `.cursor/skills/test-run/SKILL.md`
@@ -75,8 +107,8 @@ Verifies the implemented test suite passes before proceeding to refactoring. The
 ---
-**Step 6 — Refactor (optional)**
+**Step 7 — Refactor (optional)**
-Condition: the autopilot state shows Step 5 (Run Tests) is completed AND the autopilot state does NOT show Step 6 (Refactor) as completed or skipped AND `_docs/04_refactoring/FINAL_report.md` does not exist
+Condition: the autopilot state shows Step 6 (Run Tests) is completed AND the autopilot state does NOT show Step 7 (Refactor) as completed or skipped AND no `_docs/04_refactoring/` run folder contains a `FINAL_report.md` for a non-testability run
 Action: Present using Choose format:
@@ -93,13 +125,13 @@ Action: Present using Choose format:
 ══════════════════════════════════════
 ```
- If user picks A → Read and execute `.cursor/skills/refactor/SKILL.md`. The refactor skill runs the full method using the implemented tests as a safety net. If `_docs/04_refactoring/` has phase reports, the refactor skill detects completed phases and continues. After completion, auto-chain to Step 7 (New Task).
+- If user picks A → Read and execute `.cursor/skills/refactor/SKILL.md` in automatic mode. The refactor skill creates a new run folder in `_docs/04_refactoring/` (e.g., `02-coupling-refactoring`), runs the full method using the implemented tests as a safety net. After completion, auto-chain to Step 8 (New Task).
- If user picks B → Mark Step 6 as `skipped` in the state file, auto-chain to Step 7 (New Task).
+- If user picks B → Mark Step 7 as `skipped` in the state file, auto-chain to Step 8 (New Task).
 ---
-**Step 7 — New Task**
+**Step 8 — New Task**
-Condition: the autopilot state shows Step 6 (Refactor) is completed or skipped AND the autopilot state does NOT show Step 7 (New Task) as completed
+Condition: the autopilot state shows Step 7 (Refactor) is completed or skipped AND the autopilot state does NOT show Step 8 (New Task) as completed
 Action: Read and execute `.cursor/skills/new-task/SKILL.md`
@@ -107,26 +139,26 @@ The new-task skill interactively guides the user through defining new functional
 ---
-**Step 8 — Implement**
+**Step 9 — Implement**
-Condition: the autopilot state shows Step 7 (New Task) is completed AND `_docs/03_implementation/` does not contain a FINAL report covering the new tasks (check state for distinction between test implementation and feature implementation)
+Condition: the autopilot state shows Step 8 (New Task) is completed AND `_docs/03_implementation/` does not contain a FINAL report covering the new tasks (check state for distinction between test implementation and feature implementation)
 Action: Read and execute `.cursor/skills/implement/SKILL.md`
-The implement skill reads the new tasks from `_docs/02_tasks/` and implements them. Tasks already implemented in Step 4 are skipped (the implement skill tracks completed tasks in batch reports).
+The implement skill reads the new tasks from `_docs/02_tasks/` and implements them. Tasks already implemented in Step 5 are skipped (the implement skill tracks completed tasks in batch reports).
 If `_docs/03_implementation/` has batch reports from this phase, the implement skill detects completed tasks and continues.
 ---
-**Step 9 — Run Tests**
+**Step 10 — Run Tests**
-Condition: the autopilot state shows Step 8 (Implement) is completed AND the autopilot state does NOT show Step 9 (Run Tests) as completed
+Condition: the autopilot state shows Step 9 (Implement) is completed AND the autopilot state does NOT show Step 10 (Run Tests) as completed
 Action: Read and execute `.cursor/skills/test-run/SKILL.md`
 ---
-**Step 10 — Security Audit (optional)**
+**Step 11 — Security Audit (optional)**
-Condition: the autopilot state shows Step 9 (Run Tests) is completed AND the autopilot state does NOT show Step 10 (Security Audit) as completed or skipped AND (`_docs/04_deploy/` does not exist or is incomplete)
+Condition: the autopilot state shows Step 10 (Run Tests) is completed AND the autopilot state does NOT show Step 11 (Security Audit) as completed or skipped AND (`_docs/04_deploy/` does not exist or is incomplete)
 Action: Present using Choose format:
@@ -141,13 +173,13 @@ Action: Present using Choose format:
 ══════════════════════════════════════
 ```
- If user picks A → Read and execute `.cursor/skills/security/SKILL.md`. After completion, auto-chain to Step 11 (Performance Test).
+- If user picks A → Read and execute `.cursor/skills/security/SKILL.md`. After completion, auto-chain to Step 12 (Performance Test).
- If user picks B → Mark Step 10 as `skipped` in the state file, auto-chain to Step 11 (Performance Test).
+- If user picks B → Mark Step 11 as `skipped` in the state file, auto-chain to Step 12 (Performance Test).
 ---
-**Step 11 — Performance Test (optional)**
+**Step 12 — Performance Test (optional)**
-Condition: the autopilot state shows Step 10 (Security Audit) is completed or skipped AND the autopilot state does NOT show Step 11 (Performance Test) as completed or skipped AND (`_docs/04_deploy/` does not exist or is incomplete)
+Condition: the autopilot state shows Step 11 (Security Audit) is completed or skipped AND the autopilot state does NOT show Step 12 (Performance Test) as completed or skipped AND (`_docs/04_deploy/` does not exist or is incomplete)
 Action: Present using Choose format:
@@ -168,13 +200,13 @@ Action: Present using Choose format:
  2. Otherwise, check if `_docs/02_document/tests/performance-tests.md` exists for test scenarios, detect appropriate load testing tool (k6, locust, artillery, wrk, or built-in benchmarks), and execute performance test scenarios against the running system
  3. Present results vs acceptance criteria thresholds
  4. If thresholds fail → present Choose format: A) Fix and re-run, B) Proceed anyway, C) Abort
-  5. After completion, auto-chain to Step 12 (Deploy)
+  5. After completion, auto-chain to Step 13 (Deploy)
- If user picks B → Mark Step 11 as `skipped` in the state file, auto-chain to Step 12 (Deploy).
+- If user picks B → Mark Step 12 as `skipped` in the state file, auto-chain to Step 13 (Deploy).
 ---
-**Step 12 — Deploy**
+**Step 13 — Deploy**
-Condition: the autopilot state shows Step 9 (Run Tests) is completed AND (Step 10 is completed or skipped) AND (Step 11 is completed or skipped) AND (`_docs/04_deploy/` does not exist or is incomplete)
+Condition: the autopilot state shows Step 10 (Run Tests) is completed AND (Step 11 is completed or skipped) AND (Step 12 is completed or skipped) AND (`_docs/04_deploy/` does not exist or is incomplete)
 Action: Read and execute `.cursor/skills/deploy/SKILL.md`
@@ -183,7 +215,7 @@ After deployment completes, the existing-code workflow is done.
 ---
 **Re-Entry After Completion**
-Condition: the autopilot state shows `step: done` OR all steps through 12 (Deploy) are completed
+Condition: the autopilot state shows `step: done` OR all steps through 13 (Deploy) are completed
 Action: The project completed a full cycle. Present status and loop back to New Task:
@@ -199,7 +231,7 @@ Action: The project completed a full cycle. Present status and loop back to New
 ══════════════════════════════════════
 ```
- If user picks A → set `step: 7`, `status: not_started` in the state file, then auto-chain to Step 7 (New Task). Previous cycle history stays in Completed Steps.
+- If user picks A → set `step: 8`, `status: not_started` in the state file, then auto-chain to Step 8 (New Task).
 - If user picks B → report final project status and exit.
 ## Auto-Chain Rules
@@ -207,17 +239,18 @@ Action: The project completed a full cycle. Present status and loop back to New
 | Completed Step | Next Action |
 |---------------|-------------|
 | Document (1) | Auto-chain → Test Spec (2) |
-| Test Spec (2) | Auto-chain → Decompose Tests (3) |
+| Test Spec (2) | Auto-chain → Code Testability Revision (3) |
-| Decompose Tests (3) | **Session boundary** — suggest new conversation before Implement Tests |
+| Code Testability Revision (3) | Auto-chain → Decompose Tests (4) |
-| Implement Tests (4) | Auto-chain → Run Tests (5) |
+| Decompose Tests (4) | **Session boundary** — suggest new conversation before Implement Tests |
-| Run Tests (5, all pass) | Auto-chain → Refactor choice (6) |
+| Implement Tests (5) | Auto-chain → Run Tests (6) |
-| Refactor (6, done or skipped) | Auto-chain → New Task (7) |
+| Run Tests (6, all pass) | Auto-chain → Refactor choice (7) |
-| New Task (7) | **Session boundary** — suggest new conversation before Implement |
+| Refactor (7, done or skipped) | Auto-chain → New Task (8) |
-| Implement (8) | Auto-chain → Run Tests (9) |
+| New Task (8) | **Session boundary** — suggest new conversation before Implement |
-| Run Tests (9, all pass) | Auto-chain → Security Audit choice (10) |
+| Implement (9) | Auto-chain → Run Tests (10) |
-| Security Audit (10, done or skipped) | Auto-chain → Performance Test choice (11) |
+| Run Tests (10, all pass) | Auto-chain → Security Audit choice (11) |
-| Performance Test (11, done or skipped) | Auto-chain → Deploy (12) |
+| Security Audit (11, done or skipped) | Auto-chain → Performance Test choice (12) |
-| Deploy (12) | **Workflow complete** — existing-code flow done |
+| Performance Test (12, done or skipped) | Auto-chain → Deploy (13) |
 | Deploy (13) | **Workflow complete** — existing-code flow done |
 ## Status Summary Template
@@ -225,18 +258,19 @@ Action: The project completed a full cycle. Present status and loop back to New
 ═══════════════════════════════════════════════════
 AUTOPILOT STATUS (existing-code)
 ═══════════════════════════════════════════════════
- Step 1   Document            [DONE / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 1   Document                 [DONE / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
- Step 2   Test Spec           [DONE / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 2   Test Spec                [DONE / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
- Step 3   Decompose Tests     [DONE (N tasks) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 3   Code Testability Rev.    [DONE / SKIPPED / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
- Step 4   Implement Tests     [DONE / IN PROGRESS (batch M) / NOT STARTED / FAILED (retry N/3)]
+ Step 4   Decompose Tests          [DONE (N tasks) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
- Step 5   Run Tests           [DONE (N passed, M failed) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 5   Implement Tests          [DONE / IN PROGRESS (batch M) / NOT STARTED / FAILED (retry N/3)]
- Step 6   Refactor            [DONE / SKIPPED / IN PROGRESS (phase N) / NOT STARTED / FAILED (retry N/3)]
+ Step 6   Run Tests                [DONE (N passed, M failed) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
- Step 7   New Task            [DONE (N tasks) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 7   Refactor                 [DONE / SKIPPED / IN PROGRESS (phase N) / NOT STARTED / FAILED (retry N/3)]
- Step 8   Implement           [DONE / IN PROGRESS (batch M of ~N) / NOT STARTED / FAILED (retry N/3)]
+ Step 8   New Task                 [DONE (N tasks) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
- Step 9   Run Tests           [DONE (N passed, M failed) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 9   Implement                [DONE / IN PROGRESS (batch M of ~N) / NOT STARTED / FAILED (retry N/3)]
- Step 10  Security Audit      [DONE / SKIPPED / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 10  Run Tests                [DONE (N passed, M failed) / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
- Step 11  Performance Test    [DONE / SKIPPED / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 11  Security Audit           [DONE / SKIPPED / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
- Step 12  Deploy              [DONE / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
+ Step 12  Performance Test         [DONE / SKIPPED / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
 Step 13  Deploy                   [DONE / IN PROGRESS / NOT STARTED / FAILED (retry N/3)]
 ═══════════════════════════════════════════════════
 Current: Step N — Name
 SubStep: M — [sub-skill internal step name]
@@ -190,7 +190,7 @@ Action: Read and execute `.cursor/skills/deploy/SKILL.md`
 ---
 **Done**
-Condition: `_docs/04_deploy/` contains all expected artifacts (containerization.md, ci_cd_pipeline.md, environment_strategy.md, observability.md, deployment_procedures.md)
+Condition: `_docs/04_deploy/` contains all expected artifacts (containerization.md, ci_cd_pipeline.md, environment_strategy.md, observability.md, deployment_procedures.md, deploy_scripts.md)
 Action: Report project completion with summary. If the user runs autopilot again after greenfield completion, Flow Resolution rule 3 routes to the existing-code flow (re-entry after completion) so they can add new features.
@@ -46,9 +46,8 @@ Rules:
 2. Always include a recommendation with a brief justification
 3. Keep option descriptions to one line each
 4. If only 2 options make sense, use A/B only — do not pad with filler options
-5. Play the notification sound (per `human-attention-sound.mdc`) before presenting the choice
+5. Play the notification sound (per `.cursor/rules/human-attention-sound.mdc`) before presenting the choice
-6. Record every user decision in the state file's `Key Decisions` section
+6. After the user picks, proceed immediately — no follow-up confirmation unless the choice was destructive
 7. After the user picks, proceed immediately — no follow-up confirmation unless the choice was destructive
 ## Work Item Tracker Authentication
@@ -124,16 +123,12 @@ Skill execution → FAILED
  │
  ├─ retry_count < 3 ?
  │    YES → increment retry_count in state file
  │         → log failure reason in state file (Retry Log section)
  │         → re-read the sub-skill's SKILL.md
  │         → re-execute from the current sub_step
  │         → (loop back to check result)
  │
  │    NO (retry_count = 3) →
  │         → set status: failed in Current Step
  │         → add entry to Blockers section:
  │             "[Skill Name] failed 3 consecutive times at sub_step [M].
  │              Last failure: [reason]. Auto-retry exhausted."
  │         → present warning to user (see Escalation below)
  │         → do NOT auto-retry again until user intervenes
 ```
@@ -143,18 +138,14 @@ Skill execution → FAILED
 1. **Auto-retry immediately**: when a skill fails, retry it without asking the user — the failure is often transient (missing user confirmation in a prior step, docker not running, file lock, etc.)
 2. **Preserve sub_step**: retry from the last recorded `sub_step`, not from the beginning of the skill — unless the failure indicates corruption, in which case restart from sub_step 1
 3. **Increment `retry_count`**: update `retry_count` in the state file's `Current Step` section on each retry attempt
-4. **Log each failure**: append the failure reason and timestamp to the state file's `Retry Log` section
+4. **Reset on success**: when the skill eventually succeeds, reset `retry_count: 0`
 5. **Reset on success**: when the skill eventually succeeds, reset `retry_count: 0` and clear the `Retry Log` for that step
 ### Escalation (after 3 consecutive failures)
 After 3 failed auto-retries of the same skill, the failure is likely not user-related. Stop retrying and escalate:
-1. Update the state file:
+1. Update the state file: set `status: failed` and `retry_count: 3` in `Current Step`
-   - Set `status: failed` in `Current Step`
+2. Play notification sound (per `.cursor/rules/human-attention-sound.mdc`)
   - Set `retry_count: 3`
   - Add a blocker entry describing the repeated failure
 2. Play notification sound (per `human-attention-sound.mdc`)
 3. Present using Choose format:
 ```
@@ -215,9 +206,8 @@ When executing a sub-skill, monitor for these signals:
 If the same autopilot step fails 3 consecutive times across conversations:
 - Record the failure pattern in the state file's `Blockers` section
 - Do NOT auto-retry on next invocation
- Present the blocker and ask user for guidance before attempting again
+- Present the failure pattern and ask user for guidance before attempting again
 ## Context Management Protocol
@@ -308,7 +298,4 @@ For steps that produce `_docs/` artifacts (problem, research, plan, decompose, d
 On every invocation, before executing any skill, present a status summary built from the state file (with folder scan fallback). Use the Status Summary Template from the active flow file (`flows/greenfield.md` or `flows/existing-code.md`).
-For re-entry (state file exists), also include:
+For re-entry (state file exists), cross-check the current step against `_docs/` folder structure and present any `status: failed` state to the user before continuing.
 - Key decisions from the state file's `Key Decisions` section
 - Last session context from the `Last Session` section
 - Any blockers from the `Blockers` section
@@ -2,81 +2,51 @@
 ## State File: `_docs/_autopilot_state.md`
-The autopilot persists its state to `_docs/_autopilot_state.md`. This file is the primary source of truth for re-entry. Folder scanning is the fallback when the state file doesn't exist.
+The autopilot persists its position to `_docs/_autopilot_state.md`. This is a lightweight pointer — only the current step. All history lives in `_docs/` artifacts and git log. Folder scanning is the fallback when the state file doesn't exist.
-### Format
+### Template
 ```markdown
 # Autopilot State
 ## Current Step
 flow: [greenfield | existing-code]
-step: [1-10 for greenfield, 1-12 for existing-code, or "done"]
+step: [1-10 for greenfield, 1-13 for existing-code, or "done"]
 name: [step name from the active flow's Step Reference Table]
 status: [not_started / in_progress / completed / skipped / failed]
-sub_step: [optional — sub-skill internal step number + name if interrupted mid-step]
+sub_step: [0, or sub-skill internal step number + name if interrupted mid-step]
-retry_count: [0-3 — number of consecutive auto-retry attempts for current step, reset to 0 on success]
+retry_count: [0-3 — consecutive auto-retry attempts, reset to 0 on success]
 ```
-When updating `Current Step`, always write it as:
+### Examples
  flow: existing-code   ← active flow
  step: N               ← autopilot step (sequential integer)
  sub_step: M           ← sub-skill's own internal step/phase number + name
  retry_count: 0        ← reset on new step or success; increment on each failed retry
 Example:
  flow: greenfield
  step: 3
  name: Plan
  status: in_progress
  sub_step: 4 — Architecture Review & Risk Assessment
  retry_count: 0
 Example (failed after 3 retries):
  flow: existing-code
  step: 2
  name: Test Spec
  status: failed
  sub_step: 1b — Test Case Generation
  retry_count: 3
-## Completed Steps
+```
 flow: greenfield
 step: 3
 name: Plan
 status: in_progress
 sub_step: 4 — Architecture Review & Risk Assessment
 retry_count: 0
 ```
-| Step | Name | Completed | Key Outcome |
+```
-|------|------|-----------|-------------|
+flow: existing-code
-| 1 | [name] | [date] | [one-line summary] |
+step: 2
-| 2 | [name] | [date] | [one-line summary] |
+name: Test Spec
-| ... | ... | ... | ... |
+status: failed
-
+sub_step: 1b — Test Case Generation
-## Key Decisions
+retry_count: 3
 - [decision 1: e.g. "Tech stack: Python + Rust for perf-critical, Postgres DB"]
 - [decision N]
 ## Last Session
 date: [date]
 ended_at: Step [N] [Name] — SubStep [M] [sub-step name]
 reason: [completed step / session boundary / user paused / context limit]
 notes: [any context for next session]
 ## Retry Log
 | Attempt | Step | Name | SubStep | Failure Reason | Timestamp |
 |---------|------|------|---------|----------------|-----------|
 | 1 | [step] | [name] | [sub_step] | [reason] | [date-time] |
 | ... | ... | ... | ... | ... | ... |
 (Clear this table when the step succeeds or user resets. Append a row on each failed auto-retry.)
 ## Blockers
 - [blocker 1, if any]
 - [none]
 ```
 ### State File Rules
-1. **Create** the state file on the very first autopilot invocation (after state detection determines Step 1)
+1. **Create** on the first autopilot invocation (after state detection determines Step 1)
-2. **Update** the state file after every step completion, every session boundary, every BLOCKING gate confirmation, and every failed retry attempt
+2. **Update** after every step completion, session boundary, or failed retry
-3. **Read** the state file as the first action on every invocation — before folder scanning
+3. **Read** as the first action on every invocation — before folder scanning
-4. **Cross-check**: after reading the state file, verify against actual `_docs/` folder contents. If they disagree (e.g., state file says Step 3 but `_docs/02_document/architecture.md` already exists), trust the folder structure and update the state file to match
+4. **Cross-check**: verify against actual `_docs/` folder contents. If they disagree, trust the folder structure and update the state file
-5. **Never delete** the state file. It accumulates history across the entire project lifecycle
+5. **Never delete** the state file
-6. **Retry tracking**: increment `retry_count` on each failed auto-retry; reset to `0` when the step succeeds or the user manually resets. If `retry_count` reaches 3, set `status: failed` and add an entry to `Blockers`
+6. **Retry tracking**: increment `retry_count` on each failed auto-retry; reset to `0` on success. If `retry_count` reaches 3, set `status: failed`
-7. **Failed state on re-entry**: if the state file shows `status: failed` with `retry_count: 3`, do NOT auto-retry — present the blocker to the user and wait for their decision before proceeding
+7. **Failed state on re-entry**: if `status: failed` with `retry_count: 3`, do NOT auto-retry — present the issue to the user first
 ## State Detection
@@ -92,8 +62,8 @@ When the user invokes `/autopilot` and work already exists:
 1. Read `_docs/_autopilot_state.md`
 2. Cross-check against `_docs/` folder structure
-3. Present Status Summary with context from state file (key decisions, last session, blockers)
+3. Present Status Summary (use the active flow's Status Summary Template)
-4. If the detected step has a sub-skill with built-in resumability (plan, decompose, implement, deploy all do), the sub-skill handles mid-step recovery
+4. If the detected step has a sub-skill with built-in resumability, the sub-skill handles mid-step recovery
 5. Continue execution from detected state
 ## Session Boundaries
@@ -101,12 +71,11 @@ When the user invokes `/autopilot` and work already exists:
 After any decompose/planning step completes, **do not auto-chain to implement**. Instead:
 1. Update state file: mark the step as completed, set current step to the next implement step with status `not_started`
-   - Existing-code flow: After Step 3 (Decompose Tests) → set current step to 4 (Implement Tests)
+   - Existing-code flow: After Step 4 (Decompose Tests) → set current step to 5 (Implement Tests)
-   - Existing-code flow: After Step 7 (New Task) → set current step to 8 (Implement)
+   - Existing-code flow: After Step 8 (New Task) → set current step to 9 (Implement)
   - Greenfield flow: After Step 5 (Decompose) → set current step to 6 (Implement)
-2. Write `Last Session` section: `reason: session boundary`, `notes: Decompose complete, implementation ready`
+2. Present a summary: number of tasks, estimated batches, total complexity points
-3. Present a summary: number of tasks, estimated batches, total complexity points
+3. Use Choose format:
 4. Use Choose format:
 ```
 ══════════════════════════════════════
@@ -177,7 +177,7 @@ Re-entry is seamless: `state.json` tracks exactly which modules are done.
   - By directory structure (most common)
   - By shared data models or common purpose
   - By dependency clusters (tightly coupled modules)
-2. For each identified component, synthesize its module docs into a single component specification using `templates/component-spec.md` as structure:
+2. For each identified component, synthesize its module docs into a single component specification using `.cursor/skills/plan/templates/component-spec.md` as structure:
   - High-level overview: purpose, pattern, upstream/downstream
   - Internal interfaces: method signatures, DTOs (from actual module code)
   - External API specification (if the component exposes HTTP/gRPC endpoints)
@@ -214,7 +214,7 @@ All documents here are derived from component docs (Step 2) + module docs (Step
 #### 3a. Architecture
-Using `templates/architecture.md` as structure:
+Using `.cursor/skills/plan/templates/architecture.md` as structure:
 - System context and boundaries from entry points and external integrations
 - Tech stack table from discovery (Step 0) + component specs
@@ -229,7 +229,7 @@ Using `templates/architecture.md` as structure:
 #### 3b. System Flows
-Using `templates/system-flows.md` as structure:
+Using `.cursor/skills/plan/templates/system-flows.md` as structure:
 - Trace main flows through the component interaction graph
 - Entry point -> component chain -> output for each major flow
@@ -370,7 +370,7 @@ This is the inverse of normal workflow: instead of problem -> solution -> code,
 **Role**: Technical writer
 **Goal**: Produce `FINAL_report.md` integrating all generated documentation.
-Using `templates/final-report.md` as structure:
+Using `.cursor/skills/plan/templates/final-report.md` as structure:
 - Executive summary from architecture + problem docs
 - Problem statement (transformed from problem.md, not copy-pasted)
@@ -120,8 +120,8 @@ Track `auto_fix_attempts` count in the batch report for retrospective analysis.
 ### 10. Test
- Run the full test suite
+- Read and execute `.cursor/skills/test-run/SKILL.md` (detect runner, run suite, diagnose failures, present blocking choices)
- If failures: report to user with details
+- Test failures are a **blocking gate** — do not proceed to commit until the test-run skill completes with a user decision
 ### 11. Commit and Push
@@ -174,7 +174,7 @@ After each batch, produce a structured report:
 | Implementer fails same approach 3+ times | Stop it, escalate to user |
 | Task blocked on external dependency (not in task list) | Report and skip |
 | File ownership conflict unresolvable | ASK user |
-| Test failures exceed 50% of suite after a batch | Stop and escalate |
+| Any test failure after a batch | Delegate to test-run skill — blocking gate |
 | All tasks complete | Report final summary, suggest final commit |
 | `_dependencies_table.md` missing | STOP — run `/decompose` first |
@@ -118,7 +118,7 @@ This step only runs if Step 2 determined research is needed.
 2. Invoke `.cursor/skills/research/SKILL.md` in standalone mode:
   - INPUT_FILE: `PLANS_DIR/<task_slug>/problem.md`
   - BASE_DIR: `PLANS_DIR/<task_slug>/`
-3. After research completes, read the solution draft from `PLANS_DIR/<task_slug>/01_solution/solution_draft01.md`
+3. After research completes, read the latest solution draft from `PLANS_DIR/<task_slug>/01_solution/` (highest-numbered `solution_draft*.md`)
 4. Extract the key findings relevant to the task specification
 The `<task_slug>` is a short kebab-case name derived from the feature description (e.g., `auth-provider-integration`, `real-time-notifications`).
@@ -2,7 +2,7 @@
 name: plan
 description: |
  Decompose a solution into architecture, data model, deployment plan, system flows, components, tests, and Jira epics.
-  Systematic 6-step planning workflow with BLOCKING gates, self-verification, and structured artifact management.
+  Systematic planning workflow with BLOCKING gates, self-verification, and structured artifact management.
  Uses _docs/ + _docs/02_document/ structure.
  Trigger phrases:
  - "plan", "decompose solution", "architecture planning"
@@ -1,11 +1,13 @@
 ---
 name: refactor
 description: |
-  Structured 9-phase refactoring workflow with three execution modes:
+  Structured 8-phase refactoring workflow with two input modes:
-  Full (all phases), Targeted (skip discovery), Quick Assessment (phases 0-2 only).
+  Automatic (skill discovers issues) and Guided (input file with change list).
-  Supports project mode (_docs/) and standalone mode (@file.md).
+  Each run gets its own subfolder in _docs/04_refactoring/.
  Delegates code execution to the implement skill via task files in _docs/02_tasks/.
  Additional workflow modes: Targeted (skip discovery), Quick Assessment (phases 0-2 only).
 category: evolve
-tags: [refactoring, coupling, technical-debt, performance, hardening]
+tags: [refactoring, coupling, technical-debt, performance, testability]
 trigger_phrases: ["refactor", "refactoring", "improve code", "analyze coupling", "decoupling", "technical debt", "code quality"]
 disable-model-invocation: true
 ---
@@ -16,43 +18,61 @@ Phase details live in `phases/` — read the relevant file before executing each
 ## Core Principles
- **Preserve behavior first**: never refactor without a passing test suite
+- **Preserve behavior first**: never refactor without a passing test suite (exception: testability runs, where the goal is making code testable)
 - **Measure before and after**: every change must be justified by metrics
 - **Small incremental changes**: commit frequently, never break tests
 - **Save immediately**: write artifacts to disk after each phase
 - **Delegate execution**: all code changes go through the implement skill via task files
 - **Ask, don't assume**: when scope or priorities are unclear, STOP and ask the user
 ## Context Resolution
-Determine operating mode before any other logic runs. Announce detected mode and paths to user.
+Announce detected paths and input mode to user before proceeding.
-| | Project mode (default) | Standalone mode (`/refactor @file.md`) |
+**Fixed paths:**
 |---|---|---|
 | PROBLEM_DIR | `_docs/00_problem/` | N/A |
 | SOLUTION_DIR | `_docs/01_solution/` | N/A |
 | COMPONENTS_DIR | `_docs/02_document/components/` | N/A |
 | DOCUMENT_DIR | `_docs/02_document/` | N/A |
 | REFACTOR_DIR | `_docs/04_refactoring/` | `_standalone/refactoring/` |
 | Prereqs | `problem.md` required, `acceptance_criteria.md` warn if absent | INPUT_FILE must exist and be non-empty |
-Create REFACTOR_DIR if missing. If it already has artifacts, ask user: **resume or start fresh?**
+| Path | Location |
 |------|----------|
 | PROBLEM_DIR | `_docs/00_problem/` |
 | SOLUTION_DIR | `_docs/01_solution/` |
 | COMPONENTS_DIR | `_docs/02_document/components/` |
 | DOCUMENT_DIR | `_docs/02_document/` |
 | TASKS_DIR | `_docs/02_tasks/` |
 | REFACTOR_DIR | `_docs/04_refactoring/` |
 | RUN_DIR | `REFACTOR_DIR/NN-[run-name]/` |
 **Prereqs**: `problem.md` required, `acceptance_criteria.md` warn if absent.
 **RUN_DIR resolution**: on start, scan REFACTOR_DIR for existing `NN-*` folders. Auto-increment the numeric prefix for the new run. The run name is derived from the invocation context (e.g., `01-testability-refactoring`, `02-coupling-refactoring`). If invoked with a guided input file, derive the name from the input file name or ask the user.
 Create REFACTOR_DIR and RUN_DIR if missing. If a RUN_DIR with the same name already exists, ask user: **resume or start fresh?**
 ## Input Modes
 | Mode | Trigger | Discovery source |
 |------|---------|-----------------|
 | Automatic | Default, no input file | Skill discovers issues from code analysis |
 | Guided | Input file provided (e.g., `/refactor @list-of-changes.md`) | Reads input file + scans code to form validated change list |
 Both modes produce `RUN_DIR/list-of-changes.md` (template: `templates/list-of-changes.md`). Both modes then convert that file into task files in TASKS_DIR during Phase 2.
 **Guided mode cleanup**: after `RUN_DIR/list-of-changes.md` is created from the input file, delete the original input file to avoid duplication.
 ## Workflow
 | Phase | File | Summary | Gate |
 |-------|------|---------|------|
-| 0 | `phases/00-baseline.md` | Collect goals, capture baseline metrics | BLOCKING: user confirms |
+| 0 | `phases/00-baseline.md` | Collect goals, create RUN_DIR, capture baseline metrics | BLOCKING: user confirms |
-| 1 | `phases/01-discovery.md` | Document components, synthesize solution | BLOCKING: user confirms |
+| 1 | `phases/01-discovery.md` | Document components (scoped for guided mode), produce list-of-changes.md | BLOCKING: user confirms |
-| 2 | `phases/02-analysis.md` | Research improvements, produce roadmap | BLOCKING: user confirms |
+| 2 | `phases/02-analysis.md` | Research improvements, produce roadmap, create epic, decompose into tasks in TASKS_DIR | BLOCKING: user confirms |
 | | | *Quick Assessment stops here* | |
-| 3 | `phases/03-safety-net.md` | Design and implement pre-refactoring tests | GATE: all tests pass |
+| 3 | `phases/03-safety-net.md` | Check existing tests or implement pre-refactoring tests (skip for testability runs) | GATE: all tests pass |
-| 4 | `phases/04-execution.md` | Analyze coupling, execute decoupling | BLOCKING: user confirms |
+| 4 | `phases/04-execution.md` | Delegate task execution to implement skill | GATE: implement completes |
-| 5 | `phases/05-hardening.md` | Technical debt, performance, security | Optional: user picks tracks |
+| 5 | `phases/05-test-sync.md` | Remove obsolete, update broken, add new tests | GATE: all tests pass |
-| 6 | `phases/06-test-sync.md` | Remove obsolete, update broken, add new tests | GATE: all tests pass |
+| 6 | `phases/06-verification.md` | Run full suite, compare metrics vs baseline | GATE: all pass, no regressions |
-| 7 | `phases/07-verification.md` | Run full suite, compare metrics vs baseline | GATE: all pass, no regressions |
+| 7 | `phases/07-documentation.md` | Update `_docs/` to reflect refactored state | Skip if `_docs/02_document/` absent |
 | 8 | `phases/08-documentation.md` | Update `_docs/` to reflect refactored state | Skip in standalone mode |
-**Mode detection:**
+**Workflow mode detection:**
 - "quick assessment" / "just assess" → phases 0–2
 - "refactor [specific target]" → skip phase 1 if docs exist
 - Default → all phases
@@ -61,31 +81,36 @@ At the start of execution, create a TodoWrite with all applicable phases.
 ## Artifact Structure
-All artifacts are written to REFACTOR_DIR:
+All artifacts are written to RUN_DIR:
 ```
 baseline_metrics.md                      Phase 0
 discovery/components/[##]_[name].md      Phase 1
 discovery/solution.md                    Phase 1
 discovery/system_flows.md                Phase 1
 list-of-changes.md                       Phase 1
 analysis/research_findings.md            Phase 2
 analysis/refactoring_roadmap.md          Phase 2
 test_specs/[##]_[test_name].md           Phase 3
 coupling_analysis.md                     Phase 4
 execution_log.md                         Phase 4
-hardening/{technical_debt,performance,security}.md   Phase 5
+test_sync/{obsolete_tests,updated_tests,new_tests}.md  Phase 5
-test_sync/{obsolete_tests,updated_tests,new_tests}.md  Phase 6
+verification_report.md                   Phase 6
-verification_report.md                   Phase 7
+doc_update_log.md                        Phase 7
 doc_update_log.md                        Phase 8
 FINAL_report.md                          after all phases
 ```
 Task files produced during Phase 2 go to TASKS_DIR (not RUN_DIR):
 ```
 TASKS_DIR/[JIRA-ID]_refactor_[short_name].md
 TASKS_DIR/_dependencies_table.md (appended)
 ```
 **Resumability**: match existing artifacts to phases above, resume from next incomplete phase.
 ## Final Report
-After all phases complete, write `REFACTOR_DIR/FINAL_report.md`:
+After all phases complete, write `RUN_DIR/FINAL_report.md`:
-mode used, phases executed, baseline vs final metrics, changes summary, remaining items, lessons learned.
+mode used (automatic/guided), input mode, phases executed, baseline vs final metrics, changes summary, remaining items, lessons learned.
 ## Escalation Rules
@@ -97,3 +122,4 @@ mode used, phases executed, baseline vs final metrics, changes summary, remainin
 | Performance vs readability trade-off | **ASK user** |
 | No test suite or CI exists | **WARN user**, suggest safety net first |
 | Security vulnerability found | **WARN user** immediately |
 | Implement skill reports failures | **ASK user** — review batch reports |
@@ -1,7 +1,7 @@
 # Phase 0: Context & Baseline
 **Role**: Software engineer preparing for refactoring
-**Goal**: Collect refactoring goals and capture baseline metrics
+**Goal**: Collect refactoring goals, create run directory, capture baseline metrics
 **Constraints**: Measurement only — no code changes
 ## 0a. Collect Goals
@@ -14,7 +14,18 @@ If PROBLEM_DIR files do not yet exist, help the user create them:
 Store in PROBLEM_DIR.
-## 0b. Capture Baseline
+## 0b. Create RUN_DIR
 1. Scan REFACTOR_DIR for existing `NN-*` folders
 2. Auto-increment the numeric prefix (e.g., if `01-testability-refactoring` exists, next is `02-...`)
 3. Determine the run name:
   - If guided mode with input file: derive from input file name or context (e.g., `01-testability-refactoring`)
   - If automatic mode: ask user for a short run name, or derive from goals (e.g., `01-coupling-refactoring`)
 4. Create `REFACTOR_DIR/NN-[run-name]/` — this is RUN_DIR for the rest of the workflow
 Announce RUN_DIR path to user.
 ## 0c. Capture Baseline
 1. Read problem description and acceptance criteria
 2. Measure current system metrics using project-appropriate tools:
@@ -31,10 +42,11 @@ Store in PROBLEM_DIR.
 3. Create functionality inventory: all features/endpoints with status and coverage
 **Self-verification**:
 - [ ] RUN_DIR created with correct auto-incremented prefix
 - [ ] All metric categories measured (or noted as N/A with reason)
 - [ ] Functionality inventory is complete
 - [ ] Measurements are reproducible
-**Save action**: Write `REFACTOR_DIR/baseline_metrics.md`
+**Save action**: Write `RUN_DIR/baseline_metrics.md`
 **BLOCKING**: Present baseline summary to user. Do NOT proceed until user confirms.
@@ -1,12 +1,61 @@
 # Phase 1: Discovery
 **Role**: Principal software architect
-**Goal**: Generate documentation from existing code and form solution description
+**Goal**: Analyze existing code and produce `RUN_DIR/list-of-changes.md`
-**Constraints**: Document what exists, not what should be. No code changes.
+**Constraints**: Document what exists, identify what needs to change. No code changes.
 **Skip condition** (Targeted mode): If `COMPONENTS_DIR` and `SOLUTION_DIR` already contain documentation for the target area, skip to Phase 2. Ask user to confirm skip.
-## 1a. Document Components
+## Mode Branch
 Determine the input mode set during Context Resolution (see SKILL.md):
 - **Guided mode**: input file provided → start with 1g below
 - **Automatic mode**: no input file → start with 1a below
 ---
 ## Guided Mode
 ### 1g. Read and Validate Input File
 1. Read the provided input file (e.g., `list-of-changes.md` from the autopilot testability revision step or user-provided file)
 2. Extract file paths, problem descriptions, and proposed changes from each entry
 3. For each entry, verify against actual codebase:
   - Referenced files exist
   - Described problems are accurate (read the code, confirm the issue)
   - Proposed changes are feasible
 4. Flag any entries that reference nonexistent files or describe inaccurate problems — ASK user
 ### 1h. Scoped Component Analysis
 For each file/area referenced in the input file:
 1. Analyze the specific modules and their immediate dependencies
 2. Document component structure, interfaces, and coupling points relevant to the proposed changes
 3. Identify additional issues not in the input file but discovered during analysis of the same areas
 Write per-component to `RUN_DIR/discovery/components/[##]_[name].md` (same format as automatic mode, but scoped to affected areas only).
 ### 1i. Produce List of Changes
 1. Start from the validated input file entries
 2. Enrich each entry with:
   - Exact file paths confirmed from code
   - Risk assessment (low/medium/high)
   - Dependencies between changes
 3. Add any additional issues discovered during scoped analysis (1h)
 4. Write `RUN_DIR/list-of-changes.md` using `templates/list-of-changes.md` format
   - Set **Mode**: `guided`
   - Set **Source**: path to the original input file
 Skip to **Save action** below.
 ---
 ## Automatic Mode
 ### 1a. Document Components
 For each component in the codebase:
@@ -14,33 +63,57 @@ For each component in the codebase:
 2. Go file by file, analyze each method
 3. Analyze connections between components
-Write per component to `REFACTOR_DIR/discovery/components/[##]_[name].md`:
+Write per component to `RUN_DIR/discovery/components/[##]_[name].md`:
 - Purpose and architectural patterns
 - Mermaid diagrams for logic flows
 - API reference table (name, description, input, output)
 - Implementation details: algorithmic complexity, state management, dependencies
 - Caveats, edge cases, known limitations
-## 1b. Synthesize Solution & Flows
+### 1b. Synthesize Solution & Flows
 1. Review all generated component documentation
 2. Synthesize into a cohesive solution description
 3. Create flow diagrams showing component interactions
 Write:
- `REFACTOR_DIR/discovery/solution.md` — product description, component overview, interaction diagram
+- `RUN_DIR/discovery/solution.md` — product description, component overview, interaction diagram
- `REFACTOR_DIR/discovery/system_flows.md` — Mermaid flowcharts per major use case
+- `RUN_DIR/discovery/system_flows.md` — Mermaid flowcharts per major use case
-Also copy to project standard locations if in project mode:
+Also copy to project standard locations:
 - `SOLUTION_DIR/solution.md`
 - `DOCUMENT_DIR/system_flows.md`
 ### 1c. Produce List of Changes
 From the component analysis and solution synthesis, identify all issues that need refactoring:
 1. Hardcoded values (paths, config, magic numbers)
 2. Tight coupling between components
 3. Missing dependency injection / non-configurable parameters
 4. Global mutable state
 5. Code duplication
 6. Missing error handling
 7. Testability blockers (code that cannot be exercised in isolation)
 8. Security concerns
 9. Performance bottlenecks
 Write `RUN_DIR/list-of-changes.md` using `templates/list-of-changes.md` format:
 - Set **Mode**: `automatic`
 - Set **Source**: `self-discovered`
 ---
 ## Save action (both modes)
 Write all discovery artifacts to RUN_DIR.
 **Self-verification**:
- [ ] Every component in the codebase is documented
+- [ ] Every referenced file in list-of-changes.md exists in the codebase
- [ ] Solution description covers all components
+- [ ] Each change entry has file paths, problem, change description, risk, and dependencies
- [ ] Flow diagrams cover all major use cases
+- [ ] Component documentation covers all areas affected by the changes
 - [ ] In guided mode: all input file entries are validated or flagged
 - [ ] In automatic mode: solution description covers all components
 - [ ] Mermaid diagrams are syntactically correct
-**Save action**: Write discovery artifacts
+**BLOCKING**: Present discovery summary and list-of-changes.md to user. Do NOT proceed until user confirms documentation accuracy and change list completeness.
 **BLOCKING**: Present discovery summary to user. Do NOT proceed until user confirms documentation accuracy.
@@ -1,8 +1,8 @@
-# Phase 2: Analysis
+# Phase 2: Analysis & Task Decomposition
-**Role**: Researcher and software architect
+**Role**: Researcher, software architect, and task planner
-**Goal**: Research improvements and produce a refactoring roadmap
+**Goal**: Research improvements, produce a refactoring roadmap, and decompose into implementable tasks
-**Constraints**: Analysis only — no code changes
+**Constraints**: Analysis and planning only — no code changes
 ## 2a. Deep Research
@@ -11,31 +11,84 @@
 3. Identify what could be done differently
 4. Suggest improvements based on state-of-the-art practices
-Write `REFACTOR_DIR/analysis/research_findings.md`:
+Write `RUN_DIR/analysis/research_findings.md`:
 - Current state analysis: patterns used, strengths, weaknesses
 - Alternative approaches per component: current vs alternative, pros/cons, migration effort
 - Prioritized recommendations: quick wins + strategic improvements
-## 2b. Solution Assessment
+## 2b. Solution Assessment & Hardening Tracks
 1. Assess current implementation against acceptance criteria
 2. Identify weak points in codebase, map to specific code areas
 3. Perform gap analysis: acceptance criteria vs current state
 4. Prioritize changes by impact and effort
-Write `REFACTOR_DIR/analysis/refactoring_roadmap.md`:
+Present optional hardening tracks for user to include in the roadmap:
 ```
 ══════════════════════════════════════
 DECISION REQUIRED: Include hardening tracks?
 ══════════════════════════════════════
 A) Technical Debt — identify and address design/code/test debt
 B) Performance Optimization — profile, identify bottlenecks, optimize
 C) Security Review — OWASP Top 10, auth, encryption, input validation
 D) All of the above
 E) None — proceed with structural refactoring only
 ══════════════════════════════════════
 ```
 For each selected track, add entries to `RUN_DIR/list-of-changes.md` (append to the file produced in Phase 1):
 - **Track A**: tech debt items with location, impact, effort
 - **Track B**: performance bottlenecks with profiling data
 - **Track C**: security findings with severity and fix description
 Write `RUN_DIR/analysis/refactoring_roadmap.md`:
 - Weak points assessment: location, description, impact, proposed solution
 - Gap analysis: what's missing, what needs improvement
 - Phased roadmap: Phase 1 (critical fixes), Phase 2 (major improvements), Phase 3 (enhancements)
 - Selected hardening tracks and their items
 ## 2c. Create Epic
 Create a Jira/ADO epic for this refactoring run:
 1. Epic name: the RUN_DIR name (e.g., `01-testability-refactoring`)
 2. Create the epic via configured tracker MCP
 3. Record the Epic ID — all tasks in 2d will be linked under this epic
 4. If tracker unavailable, use `PENDING` placeholder and note for later
 ## 2d. Task Decomposition
 Convert the finalized `RUN_DIR/list-of-changes.md` into implementable task files.
 1. Read `RUN_DIR/list-of-changes.md`
 2. For each change entry (or group of related entries), create an atomic task file in TASKS_DIR:
   - Use the standard task template format (`.cursor/skills/decompose/templates/task.md`)
   - File naming: `[##]_refactor_[short_name].md` (temporary numeric prefix)
   - **Task**: `PENDING_refactor_[short_name]`
   - **Description**: derived from the change entry's Problem + Change fields
   - **Complexity**: estimate 1-5 points; split into multiple tasks if >5
   - **Dependencies**: map change-level dependencies (C01, C02) to task-level Jira IDs
   - **Component**: from the change entry's File(s) field
   - **Epic**: the epic created in 2c
   - **Acceptance Criteria**: derived from the change entry — verify the problem is resolved
 3. Create Jira/ADO ticket for each task under the epic from 2c
 4. Rename each file to `[JIRA-ID]_refactor_[short_name].md` after ticket creation
 5. Update or append to `TASKS_DIR/_dependencies_table.md` with the refactoring tasks
 **Self-verification**:
 - [ ] All acceptance criteria are addressed in gap analysis
 - [ ] Recommendations are grounded in actual code, not abstract
 - [ ] Roadmap phases are prioritized by impact
- [ ] Quick wins are identified separately
+- [ ] Epic created and all tasks linked to it
 - [ ] Every entry in list-of-changes.md has a corresponding task file in TASKS_DIR
 - [ ] No task exceeds 5 complexity points
 - [ ] Task dependencies are consistent (no circular dependencies)
 - [ ] `_dependencies_table.md` includes all refactoring tasks
 - [ ] Every task has a Jira ticket (or PENDING placeholder)
-**Save action**: Write analysis artifacts
+**Save action**: Write analysis artifacts to RUN_DIR, task files to TASKS_DIR
-**BLOCKING**: Present refactoring roadmap to user. Do NOT proceed until user confirms.
+**BLOCKING**: Present refactoring roadmap and task list to user. Do NOT proceed until user confirms.
 **Quick Assessment mode stops here.** Present final summary and write `FINAL_report.md` with phases 0-2 content.
@@ -1,23 +1,45 @@
 # Phase 3: Safety Net
 **Role**: QA engineer and developer
-**Goal**: Design and implement tests that capture current behavior before refactoring
+**Goal**: Ensure tests exist that capture current behavior before refactoring
 **Constraints**: Tests must all pass on the current codebase before proceeding
-## 3a. Design Test Specs
+## Skip Condition: Testability Refactoring
-Coverage requirements (must meet before refactoring — see `.cursor/rules/cursor-meta.mdc` Quality Thresholds):
+If the current run name contains `testability` (e.g., `01-testability-refactoring`), **skip Phase 3 entirely**. The purpose of a testability run is to make the code testable so that tests can be written afterward. Announce the skip and proceed to Phase 4.
 - Minimum overall coverage: 75%
 - Critical path coverage: 90%
 - All public APIs must have blackbox tests
 - All error handling paths must be tested
-For each critical area, write test specs to `REFACTOR_DIR/test_specs/[##]_[test_name].md`:
+## 3a. Check Existing Tests
 Before designing or implementing any new tests, check what already exists:
 1. Scan the project for existing test files (unit tests, integration tests, blackbox tests)
 2. Run the existing test suite — record pass/fail counts
 3. Measure current coverage against the areas being refactored (from `RUN_DIR/list-of-changes.md` file paths)
 4. Assess coverage against thresholds:
   - Minimum overall coverage: 75%
   - Critical path coverage: 90%
   - All public APIs must have blackbox tests
   - All error handling paths must be tested
 If existing tests meet all thresholds for the refactoring areas:
 - Document the existing coverage in `RUN_DIR/test_specs/existing_coverage.md`
 - Skip to the GATE check below
 If existing tests partially cover the refactoring areas:
 - Document what is covered and what gaps remain
 - Proceed to 3b only for the uncovered areas
 If no relevant tests exist:
 - Proceed to 3b for full test design
 ## 3b. Design Test Specs (for uncovered areas only)
 For each uncovered critical area, write test specs to `RUN_DIR/test_specs/[##]_[test_name].md`:
 - Blackbox tests: summary, current behavior, input data, expected result, max expected time
 - Acceptance tests: summary, preconditions, steps with expected results
 - Coverage analysis: current %, target %, uncovered critical paths
-## 3b. Implement Tests
+## 3c. Implement Tests (for uncovered areas only)
 1. Set up test environment and infrastructure if not exists
 2. Implement each test from specs
@@ -25,11 +47,11 @@ For each critical area, write test specs to `REFACTOR_DIR/test_specs/[##]_[test_
 4. Document any discovered issues
 **Self-verification**:
- [ ] Coverage requirements met (75% overall, 90% critical paths)
+- [ ] Coverage requirements met (75% overall, 90% critical paths) across existing + new tests
 - [ ] All tests pass on current codebase
- [ ] All public APIs have blackbox tests
+- [ ] All public APIs in refactoring scope have blackbox tests
 - [ ] Test data fixtures are configured
-**Save action**: Write test specs; implemented tests go into the project's test folder
+**Save action**: Write test specs to RUN_DIR; implemented tests go into the project's test folder
 **GATE (BLOCKING)**: ALL tests must pass before proceeding to Phase 4. If tests fail, fix the tests (not the code) or ask user for guidance. Do NOT proceed to Phase 4 with failing tests.
@@ -1,45 +1,63 @@
 # Phase 4: Execution
-**Role**: Software architect and developer
+**Role**: Orchestrator
-**Goal**: Analyze coupling and execute decoupling changes
+**Goal**: Execute all refactoring tasks by delegating to the implement skill
-**Constraints**: Small incremental changes; tests must stay green after every change
+**Constraints**: No inline code changes — all implementation goes through the implement skill's batching and review pipeline
-## 4a. Analyze Coupling
+## 4a. Pre-Flight Checks
-1. Analyze coupling between components/modules
+1. Verify refactoring task files exist in TASKS_DIR (created during Phase 2d):
-2. Map dependencies (direct and transitive)
+   - All `[JIRA-ID]_refactor_*.md` files are present
-3. Identify circular dependencies
+   - Each task file has valid header fields (Task, Name, Description, Complexity, Dependencies)
-4. Form decoupling strategy
+2. Verify `TASKS_DIR/_dependencies_table.md` includes the refactoring tasks
 3. Verify all tests pass (safety net from Phase 3 is green)
 4. If any check fails, go back to the relevant phase to fix
-Write `REFACTOR_DIR/coupling_analysis.md`:
+## 4b. Delegate to Implement Skill
 - Dependency graph (Mermaid)
 - Coupling metrics per component
 - Problem areas: components involved, coupling type, severity, impact
 - Decoupling strategy: priority order, proposed interfaces/abstractions, effort estimates
-**BLOCKING**: Present coupling analysis to user. Do NOT proceed until user confirms strategy.
+Read and execute `.cursor/skills/implement/SKILL.md`.
-## 4b. Execute Decoupling
+The implement skill will:
 1. Parse task files and dependency graph from TASKS_DIR
 2. Detect already-completed tasks (skip non-refactoring tasks from prior workflow steps)
 3. Compute execution batches for the refactoring tasks
 4. Launch implementer subagents (up to 4 in parallel)
 5. Run code review after each batch
 6. Commit and push per batch
 7. Update Jira/ADO ticket status
-For each change in the decoupling strategy:
+Do NOT modify, skip, or abbreviate any part of the implement skill's workflow. The refactor skill is delegating execution, not optimizing it.
-1. Implement the change
+## 4c. Capture Results
 2. Run blackbox tests
 3. Fix any failures
 4. Commit with descriptive message
-Address code smells encountered: long methods, large classes, duplicate code, dead code, magic numbers.
+After the implement skill completes:
-Write `REFACTOR_DIR/execution_log.md`:
+1. Read batch reports from `_docs/03_implementation/batch_*_report.md`
- Change description, files affected, test status per change
+2. Read `_docs/03_implementation/FINAL_implementation_report.md`
- Before/after metrics comparison against baseline
+3. Write `RUN_DIR/execution_log.md` summarizing:
   - Total tasks executed
   - Batches completed
   - Code review verdicts per batch
   - Files modified (aggregate list)
   - Any blocked or failed tasks
   - Links to batch reports
 ## 4d. Update Task Statuses
 For each successfully completed refactoring task:
 1. Transition the Jira/ADO ticket status to **Done** via the configured tracker MCP
 2. If tracker unavailable, note the pending status transitions in `RUN_DIR/execution_log.md`
 For any failed or blocked tasks, leave their status as-is (the implement skill already set them to In Testing or blocked).
 **Self-verification**:
 - [ ] All refactoring tasks show as completed in batch reports
 - [ ] All completed tasks have Jira/ADO status set to Done
 - [ ] All tests still pass after execution
- [ ] No circular dependencies remain (or reduced per plan)
+- [ ] No tasks remain in blocked or failed state (or user has acknowledged them)
- [ ] Code smells addressed
+- [ ] `RUN_DIR/execution_log.md` written with links to batch reports
 - [ ] Metrics improved compared to baseline
-**Save action**: Write execution artifacts
+**Save action**: Write `RUN_DIR/execution_log.md`
-**BLOCKING**: Present execution summary to user. Do NOT proceed until user confirms.
+**GATE**: All refactoring tasks must be implemented. If any tasks failed, present the failures to the user and ask for guidance before proceeding to Phase 5.
@@ -1,51 +0,0 @@
 # Phase 5: Hardening (Optional, Parallel Tracks)
 **Role**: Varies per track
 **Goal**: Address technical debt, performance, and security
 **Constraints**: Each track is optional; user picks which to run
 Present the three tracks and let user choose which to execute:
 ## Track A: Technical Debt
 **Role**: Technical debt analyst
 1. Identify and categorize debt items: design, code, test, documentation
 2. Assess each: location, description, impact, effort, interest (cost of not fixing)
 3. Prioritize: quick wins → strategic debt → tolerable debt
 4. Create actionable plan with prevention measures
 Write `REFACTOR_DIR/hardening/technical_debt.md`
 ## Track B: Performance Optimization
 **Role**: Performance engineer
 1. Profile current performance, identify bottlenecks
 2. For each bottleneck: location, symptom, root cause, impact
 3. Propose optimizations with expected improvement and risk
 4. Implement one at a time, benchmark after each change
 5. Verify tests still pass
 Write `REFACTOR_DIR/hardening/performance.md` with before/after benchmarks
 ## Track C: Security Review
 **Role**: Security engineer
 1. Review code against OWASP Top 10
 2. Verify security requirements from `security_approach.md` are met
 3. Check: authentication, authorization, input validation, output encoding, encryption, logging
 Write `REFACTOR_DIR/hardening/security.md`:
 - Vulnerability assessment: location, type, severity, exploit scenario, fix
 - Security controls review
 - Compliance check against `security_approach.md`
 - Recommendations: critical fixes, improvements, hardening
 **Self-verification** (per track):
 - [ ] All findings are grounded in actual code
 - [ ] Recommendations are actionable with effort estimates
 - [ ] All tests still pass after any changes
 **Save action**: Write hardening artifacts
@@ -1,20 +1,22 @@
-# Phase 6: Test Synchronization
+# Phase 5: Test Synchronization
 **Role**: QA engineer and developer
 **Goal**: Reconcile the test suite with the refactored codebase — remove obsolete tests, update broken tests, add tests for new code
 **Constraints**: All tests must pass at the end of this phase. Do not change production code here — only tests.
-## 6a. Identify Obsolete Tests
+**Skip condition**: If the run name contains `testability`, skip Phase 5 entirely — no test suite exists yet to synchronize. Proceed directly to Phase 6.
 ## 5a. Identify Obsolete Tests
 1. Compare the pre-refactoring codebase structure (from Phase 0 inventory) with the current state
 2. Find tests that reference removed functions, classes, modules, or endpoints
 3. Find tests that duplicate coverage due to merged/consolidated code
 4. Decide per test: **delete** (functionality removed) or **merge** (duplicates)
-Write `REFACTOR_DIR/test_sync/obsolete_tests.md`:
+Write `RUN_DIR/test_sync/obsolete_tests.md`:
 - Test file, test name, reason (target removed / target merged / duplicate coverage), action taken (deleted / merged into)
-## 6b. Update Existing Tests
+## 5b. Update Existing Tests
 1. Run the full test suite — collect failures and errors
 2. For each failing test, determine the cause:
@@ -24,28 +26,28 @@ Write `REFACTOR_DIR/test_sync/obsolete_tests.md`:
   - Changed data structures → update fixtures and assertions
 3. Fix each test, re-run to confirm it passes
-Write `REFACTOR_DIR/test_sync/updated_tests.md`:
+Write `RUN_DIR/test_sync/updated_tests.md`:
 - Test file, test name, change type (import path / signature / assertion / fixture), description of update
-## 6c. Add New Tests
+## 5c. Add New Tests
-1. Identify new code introduced during Phases 4–5 that lacks test coverage:
+1. Identify new code introduced during Phase 4 that lacks test coverage:
   - New public functions, classes, or modules
   - New interfaces or abstractions introduced during decoupling
   - New error handling paths
 2. Write tests following the same patterns and conventions as the existing test suite
 3. Ensure coverage targets from Phase 3 are maintained or improved
-Write `REFACTOR_DIR/test_sync/new_tests.md`:
+Write `RUN_DIR/test_sync/new_tests.md`:
 - Test file, test name, target function/module, coverage type (unit / integration / blackbox)
 **Self-verification**:
 - [ ] All obsolete tests removed or merged
 - [ ] All pre-existing tests pass after updates
- [ ] New code from Phases 4–5 has test coverage
+- [ ] New code from Phase 4 has test coverage
 - [ ] Overall coverage meets or exceeds Phase 3 baseline (75% overall, 90% critical paths)
 - [ ] No tests reference removed or renamed code
 **Save action**: Write test_sync artifacts; implemented tests go into the project's test folder
-**GATE (BLOCKING)**: ALL tests must pass before proceeding to Phase 7. If tests fail, fix the tests or ask user for guidance.
+**GATE (BLOCKING)**: ALL tests must pass before proceeding to Phase 6. If tests fail, fix the tests or ask user for guidance.
@@ -1,20 +1,22 @@
-# Phase 7: Final Verification
+# Phase 6: Final Verification
 **Role**: QA engineer
 **Goal**: Run all tests end-to-end, compare final metrics against baseline, and confirm the refactoring succeeded
-**Constraints**: No code changes. If failures are found, go back to the appropriate phase (4/5/6) to fix before retrying.
+**Constraints**: No code changes. If failures are found, go back to the appropriate phase (4/5) to fix before retrying.
-## 7a. Run Full Test Suite
+**Skip condition**: If the run name contains `testability`, skip Phase 6 entirely — no test suite exists yet to verify against. Proceed directly to Phase 7.
 ## 6a. Run Full Test Suite
 1. Run unit tests, integration tests, and blackbox tests
 2. Run acceptance tests derived from `acceptance_criteria.md`
 3. Record pass/fail counts and any failures
 If any test fails:
- Determine whether the failure is a test issue (→ return to Phase 6) or a code issue (→ return to Phase 4/5)
+- Determine whether the failure is a test issue (→ return to Phase 5) or a code issue (→ return to Phase 4)
 - Do NOT proceed until all tests pass
-## 7b. Capture Final Metrics
+## 6b. Capture Final Metrics
 Re-measure all metrics from Phase 0 baseline using the same tools:
@@ -27,14 +29,14 @@ Re-measure all metrics from Phase 0 baseline using the same tools:
 | **Dependencies** | Total count, outdated, security vulnerabilities |
 | **Build** | Build time, test execution time, deployment time |
-## 7c. Compare Against Baseline
+## 6c. Compare Against Baseline
-1. Read `REFACTOR_DIR/baseline_metrics.md`
+1. Read `RUN_DIR/baseline_metrics.md`
 2. Produce a side-by-side comparison: baseline vs final for every metric
 3. Flag any regressions (metrics that got worse)
 4. Verify acceptance criteria are met
-Write `REFACTOR_DIR/verification_report.md`:
+Write `RUN_DIR/verification_report.md`:
 - Test results summary: total, passed, failed, skipped
 - Metric comparison table: metric, baseline value, final value, delta, status (improved / unchanged / regressed)
 - Acceptance criteria checklist: criterion, status (met / not met), evidence
@@ -46,6 +48,6 @@ Write `REFACTOR_DIR/verification_report.md`:
 - [ ] No critical metric regressions
 - [ ] Metrics are captured with the same tools/methodology as Phase 0
-**Save action**: Write `REFACTOR_DIR/verification_report.md`
+**Save action**: Write `RUN_DIR/verification_report.md`
-**GATE (BLOCKING)**: All tests must pass and no critical regressions. Present verification report to user. Do NOT proceed to Phase 8 until user confirms.
+**GATE (BLOCKING)**: All tests must pass and no critical regressions. Present verification report to user. Do NOT proceed to Phase 7 until user confirms.
@@ -1,36 +1,35 @@
-# Phase 8: Documentation Update
+# Phase 7: Documentation Update
 **Role**: Technical writer
 **Goal**: Update existing `_docs/` artifacts to reflect all changes made during refactoring
 **Constraints**: Documentation only — no code changes. Only update docs that are affected by refactoring changes.
-**Skip condition**: If no `_docs/02_document/` directory exists (standalone mode), skip this phase entirely.
+**Skip condition**: If no `_docs/02_document/` directory exists, skip this phase entirely.
-## 8a. Identify Affected Documentation
+## 7a. Identify Affected Documentation
-1. Review `REFACTOR_DIR/execution_log.md` to list all files changed during Phase 4
+1. Review `RUN_DIR/execution_log.md` to list all files changed during Phase 4
-2. Review any hardening changes from Phase 5
+2. Review test changes from Phase 5
-3. Review test changes from Phase 6
+3. Map changed files to their corresponding module docs in `_docs/02_document/modules/`
-4. Map changed files to their corresponding module docs in `_docs/02_document/modules/`
+4. Map changed modules to their parent component docs in `_docs/02_document/components/`
-5. Map changed modules to their parent component docs in `_docs/02_document/components/`
+5. Determine if system-level docs need updates (`architecture.md`, `system-flows.md`, `data_model.md`)
-6. Determine if system-level docs need updates (`architecture.md`, `system-flows.md`, `data_model.md`)
+6. Determine if test documentation needs updates (`_docs/02_document/tests/`)
 7. Determine if test documentation needs updates (`_docs/02_document/tests/`)
-## 8b. Update Module Documentation
+## 7b. Update Module Documentation
 For each module doc affected by refactoring changes:
 1. Re-read the current source file
 2. Update the module doc to reflect new/changed interfaces, dependencies, internal logic
 3. Remove documentation for deleted code; add documentation for new code
-## 8c. Update Component Documentation
+## 7c. Update Component Documentation
 For each component doc affected:
 1. Re-read the updated module docs within the component
 2. Update inter-module interfaces, dependency graphs, caveats
 3. Update the component relationship diagram if component boundaries changed
-## 8d. Update System-Level Documentation
+## 7d. Update System-Level Documentation
 If structural changes were made (new modules, removed modules, changed interfaces):
 1. Update `_docs/02_document/architecture.md` if architecture changed
@@ -0,0 +1,49 @@
 # List of Changes Template
 Save as `RUN_DIR/list-of-changes.md`. Produced during Phase 1 (Discovery).
 ---
 ```markdown
 # List of Changes
 **Run**: [NN-run-name]
 **Mode**: [automatic | guided]
 **Source**: [self-discovered | path/to/input-file.md]
 **Date**: [YYYY-MM-DD]
 ## Summary
 [1-2 sentence overview of what this refactoring run addresses]
 ## Changes
 ### C01: [Short Title]
 - **File(s)**: [file paths, comma-separated]
 - **Problem**: [what makes this problematic / untestable / coupled]
 - **Change**: [what to do — behavioral description, not implementation steps]
 - **Rationale**: [why this change is needed]
 - **Risk**: [low | medium | high]
 - **Dependencies**: [other change IDs this depends on, or "None"]
 ### C02: [Short Title]
 - **File(s)**: [file paths]
 - **Problem**: [description]
 - **Change**: [description]
 - **Rationale**: [description]
 - **Risk**: [low | medium | high]
 - **Dependencies**: [C01, or "None"]
 ```
 ---
 ## Guidelines
 - **Change IDs** use format `C##` (C01, C02, ...) — sequential within the run
 - Each change should map to one atomic task (1-5 complexity points); split if larger
 - **File(s)** must reference actual files verified to exist in the codebase
 - **Problem** describes the current state, not the desired state
 - **Change** describes what the system should do differently — behavioral, not prescriptive
 - **Dependencies** reference other change IDs within this list; cross-run dependencies use Jira IDs
 - In guided mode, the input file entries are validated against actual code and enriched with file paths, risk, and dependencies before writing
 - In automatic mode, entries are derived from Phase 1 component analysis and Phase 2 research findings
@@ -112,9 +112,6 @@ When the user wants to:
 - Assess or improve an existing solution draft
 **Differentiation from other Skills**:
 - Needs a **visual knowledge graph** → use `research-to-diagram`
 - Needs **written output** (articles/tutorials) → use `wsy-writer`
 - Needs **material organization** → use `material-to-markdown`
 - Needs **research + solution draft** → use this Skill
 ## Stakeholder Perspectives
@@ -44,31 +44,48 @@ Present a summary:
 ```
 ══════════════════════════════════════
- TEST RESULTS: [N passed, M failed, K skipped]
+ TEST RESULTS: [N passed, M failed, K skipped, E errors]
 ══════════════════════════════════════
 ```
-### 4. Handle Outcome
+**Important**: Collection errors (import failures, missing dependencies, syntax errors) count as failures — they are not "skipped" or ignorable.
 ### 4. Diagnose Failures
 Before presenting choices, list every failing/erroring test with a one-line root cause:
 ```
 Failures:
 1. test_foo.py::test_bar — missing dependency 'netron' (not installed)
 2. test_baz.py::test_qux — AssertionError: expected 5, got 3 (logic error)
 3. test_old.py::test_legacy — ImportError: no module 'removed_module' (possibly obsolete)
 ```
 Categorize each as: **missing dependency**, **broken import**, **logic/assertion error**, **possibly obsolete**, or **environment-specific**.
 ### 5. Handle Outcome
 **All tests pass** → return success to the autopilot for auto-chain.
-**Tests fail** → present using Choose format:
+**Any test fails or errors** → this is a **blocking gate**. Never silently ignore or skip failures. Present using Choose format:
 ```
 ══════════════════════════════════════
- TEST RESULTS: [N passed, M failed, K skipped]
+ TEST RESULTS: [N passed, M failed, K skipped, E errors]
 ══════════════════════════════════════
- A) Fix failing tests and re-run
+ A) Investigate and fix failing tests/code, then re-run
- B) Proceed anyway (not recommended)
+ B) Remove obsolete tests (if diagnosis shows they are no longer relevant)
- C) Abort — fix manually
+ C) Leave as-is — acknowledged tech debt (not recommended)
 D) Abort — fix manually
 ══════════════════════════════════════
 Recommendation: A — fix failures before proceeding
 ══════════════════════════════════════
 ```
- If user picks A → attempt to fix failures, then re-run (loop back to step 2)
+- If user picks A → investigate root causes, attempt fixes, then re-run (loop back to step 2)
- If user picks B → return success with warning to the autopilot
+- If user picks B → confirm which tests to remove, delete them, then re-run (loop back to step 2)
- If user picks C → return failure to the autopilot
+- If user picks C → require explicit user confirmation; log as acknowledged tech debt in the report, then return success with warning to the autopilot
 - If user picks D → return failure to the autopilot
 ## Trigger Conditions
@@ -147,7 +147,7 @@ If TESTS_OUTPUT_DIR already contains files:
 ## Progress Tracking
-At the start of execution, create a TodoWrite with all three phases. Update status as each phase completes.
+At the start of execution, create a TodoWrite with all four phases. Update status as each phase completes.
 ## Workflow
@@ -85,7 +85,7 @@ Announce the detected mode to the user.
 ## Phase 2: Requirements Gathering
-Use the AskQuestion tool for structured input. Adapt based on what Phase 1 found — only ask for what's missing.
+Use the AskQuestion tool for structured input (fall back to plain-text questions if the tool is unavailable). Adapt based on what Phase 1 found — only ask for what's missing.
 **Round 1 — Structural:**
@@ -24,4 +24,4 @@ venv
 *.png
 # Test results
-test-results/
+tests/test-results/
@@ -11,4 +11,6 @@ RUN pip install --no-cache-dir -r requirements-test.txt
 COPY . .
-CMD ["python", "-m", "pytest", "tests/", "--tb=short", "--junitxml=/app/test-results/test-results.xml", "-q"]
+ENV PYTHONPATH=/app/src
 CMD ["python", "-m", "pytest", "tests/", "--tb=short", "--junitxml=/app/tests/test-results/test-results.xml", "-q"]
@@ -3,69 +3,7 @@
 ## Current Step
 flow: existing-code
 step: 6
-name: Refactor
+name: Run Tests
-status: in_progress
+status: failed
 sub_step: 0
 retry_count: 0
 ## Completed Steps
 | Step | Name | Completed | Key Outcome |
 |------|------|-----------|-------------|
 | 1 (sub 0) | Document — Discovery | 2026-03-26 | 21 modules, 8 components identified, dependency graph built |
 | 1 (sub 1) | Document — Module Docs | 2026-03-26 | 21/21 module docs written in 7 batches |
 | 1 (sub 2) | Document — Component Assembly | 2026-03-26 | 8 components: Core, Security, API&CDN, Data Models, Data Pipeline, Training, Inference, Annotation Queue |
 | 1 (sub 3) | Document — System Synthesis | 2026-03-26 | architecture.md, system-flows.md (5 flows), data_model.md |
 | 1 (sub 4) | Document — Verification | 2026-03-26 | 87 entities verified, 0 hallucinations, 5 code bugs found, 3 security issues |
 | 1 (sub 5) | Document — Solution Extraction | 2026-03-26 | solution.md with component solution tables, testing strategy, deployment architecture |
 | 1 (sub 6) | Document — Problem Extraction | 2026-03-26 | problem.md, restrictions.md, acceptance_criteria.md, data_parameters.md, security_approach.md |
 | 1 (sub 7) | Document — Final Report | 2026-03-26 | FINAL_report.md with executive summary, risk observations, artifact index |
 | 1 | Document | 2026-03-26 | Full 8-step documentation complete: 21 modules, 8 components, 45+ artifacts |
 | 2 (sub 1) | Test Spec — Phase 1 | 2026-03-26 | Input data analysis: 100 images + ONNX model, 75% coverage (12/16 criteria), above 70% threshold |
 | 2 (sub 2) | Test Spec — Phase 2 | 2026-03-26 | 55 test scenarios across 5 categories: 32 blackbox, 5 performance, 6 resilience, 7 security, 5 resource limit. 80.6% AC coverage |
 | 2 (sub 3) | Test Spec — Phase 3 | 2026-03-26 | Test Data Validation Gate PASSED: all 55 tests have input data + quantifiable expected results. 0 removals. Coverage 80.6% |
 | 2 (sub 4) | Test Spec — Phase 4 | 2026-03-26 | Generated: run-tests-local.sh, run-performance-tests.sh, Dockerfile.test, docker-compose.test.yml, requirements-test.txt |
 | 2 | Test Spec | 2026-03-26 | Full 4-phase test spec complete: 55 scenarios, 37 expected result mappings, 80.6% coverage, runner scripts generated |
 | 3 (sub 1t) | Decompose Tests — Infrastructure | 2026-03-26 | Test infrastructure bootstrap task: pytest config, fixtures, conftest, Docker env, constants patching |
 | 3 (sub 3) | Decompose Tests — Test Tasks | 2026-03-26 | 11 test tasks decomposed from 55 scenarios, grouped by functional area |
 | 3 (sub 4) | Decompose Tests — Verification | 2026-03-26 | All 29 covered AC verified, no circular deps, no overlaps, dependencies table produced |
 | 3 | Decompose Tests | 2026-03-26 | 12 tasks total (1 infrastructure + 11 test tasks), 25 complexity points, 2 implementation batches |
 | 4 | Implement Tests | 2026-03-26 | 12/12 tasks implemented, 76 tests passing, 4 commits across 4 sub-batches |
 | 5 | Run Tests | 2026-03-26 | 76 passed, 0 failed, 0 skipped. JUnit XML in test-results/ |
 ## Key Decisions
 - Component breakdown: 8 components confirmed by user
 - Documentation structure: Keep both modules/ and components/ levels (user confirmed)
 - Skill modifications: Refactor step made optional in existing-code flow; doc update phase added to refactoring skill
 - Problem extraction documents approved by user without corrections
 - Test scope: Cover all components testable without external services (option B). Inference test is smoke-only (detects something, no precision). User will provide expected detection results later.
 - Fixture data: User provided 100 images + labels + ONNX model (81MB)
 - Test execution: Two modes required — local (no Docker, primary for macOS dev) + Docker (CI/portable). Both run the same pytest suite.
 - Tracker: jira (project AZ, cloud 1598226f-845f-4705-bcd1-5ed0c82d6119)
 - Epic: AZ-151 (Blackbox Tests), 12 tasks: AZ-152 to AZ-163
 - Task grouping: 55 test scenarios grouped into 11 atomic tasks by functional area, all ≤ 3 complexity points
 - Refactor approach: Pydantic BaseModel config chosen over env vars / dataclass / plain dict. pydantic 2.12.5 already installed via ultralytics.
 ## Refactor Progress (Step 6)
 Work done so far (across multiple sessions):
 - Replaced module-level path variables + get_paths/reload_config in constants.py with Pydantic Config(BaseModel) — paths defined once as @property
 - Migrated all 5 production callers (train.py, augmentation.py, exports.py, dataset-visualiser.py, manual_run.py) to constants.config.X
 - Fixed device=0 bug in exports.py, fixed total_to_process bug in augmentation.py
 - Simplified test infrastructure: conftest.py apply_constants_patch reduced to single config swap
 - Updated 7 test files to use constants.config.X
 - Rewrote E2E test to AAA pattern: Arrange (copy raw data), Act (production functions only: augment_annotations, train_dataset, export_onnx, export_coreml), Assert (7 test methods)
 - All 83 tests passing (76 non-E2E + 7 E2E)
 - Refactor test verification phase still pending
 ## Last Session
 date: 2026-03-27
 ended_at: Step 6 Refactor — implementation done, test verification pending
 reason: user indicated test phase not yet completed
 notes: Pydantic config refactor + E2E rewrite implemented. 83/83 tests pass. Formal test verification phase of refactoring still pending.
 ## Retry Log
 | Attempt | Step | Name | SubStep | Failure Reason | Timestamp |
 |---------|------|------|---------|----------------|-----------|
 ## Blockers
 - none
@@ -4,7 +4,7 @@ services:
      context: .
      dockerfile: Dockerfile.test
    volumes:
-      - ./test-results:/app/test-results
+      - ./tests/test-results:/app/tests/test-results
      - ./_docs/00_problem/input_data:/app/_docs/00_problem/input_data:ro
    environment:
      - PYTHONDONTWRITEBYTECODE=1
@@ -1,4 +1,5 @@
 [pytest]
 pythonpath = src
 markers =
    performance: Performance/throughput tests
    resilience: Resilience/error handling tests
@@ -8,3 +8,5 @@ msgpack
 PyYAML
 ultralytics
 coremltools
 boto3
 netron
@@ -3,7 +3,7 @@ set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
-RESULTS_DIR="$PROJECT_ROOT/test-results"
+RESULTS_DIR="$PROJECT_ROOT/tests/test-results"
 cleanup() {
  if [ -d "$RESULTS_DIR" ]; then
@@ -3,7 +3,7 @@ set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
-RESULTS_DIR="$PROJECT_ROOT/test-results"
+RESULTS_DIR="$PROJECT_ROOT/tests/test-results"
 PERF_ONLY=false
 UNIT_ONLY=false
@@ -1,6 +1,7 @@
 import json
 from datetime import datetime, timedelta
 from enum import Enum
 from os.path import dirname, join
 import msgpack
@@ -20,7 +21,8 @@ class AnnotationClass:
    @staticmethod
    def read_json():
-        with open('classes.json', 'r', encoding='utf-8') as f:
+        classes_path = join(dirname(dirname(__file__)), 'classes.json')
        with open(classes_path, 'r', encoding='utf-8') as f:
            j = json.loads(f.read())
            annotations_dict = {}
            for mode in WeatherMode:
@@ -7,7 +7,7 @@ _PROJECT_ROOT = Path(__file__).resolve().parent.parent
 _DATASET_IMAGES = _PROJECT_ROOT / "_docs/00_problem/input_data/dataset/images"
 _DATASET_LABELS = _PROJECT_ROOT / "_docs/00_problem/input_data/dataset/labels"
 _ONNX_MODEL = _PROJECT_ROOT / "_docs/00_problem/input_data/azaion.onnx"
-_CLASSES_JSON = _PROJECT_ROOT / "classes.json"
+_CLASSES_JSON = _PROJECT_ROOT / "src" / "classes.json"
 _CONFIG_TEST = _PROJECT_ROOT / "config.test.yaml"
 collect_ignore = ["security_test.py", "imagelabel_visualize_test.py"]
@@ -1,9 +1,7 @@
 import concurrent.futures
 import random
 import shutil
 import sys
 import time
 import types
 from pathlib import Path
 import numpy as np
@@ -11,13 +9,6 @@ import pytest
 from tests.conftest import apply_constants_patch
 if "matplotlib" not in sys.modules:
    _mpl = types.ModuleType("matplotlib")
    _plt = types.ModuleType("matplotlib.pyplot")
    _mpl.pyplot = _plt
    sys.modules["matplotlib"] = _mpl
    sys.modules["matplotlib.pyplot"] = _plt
 def _patch_augmentation_paths(monkeypatch, base: Path):
    apply_constants_patch(monkeypatch, base)
@@ -44,6 +35,7 @@ def _seed():
 def test_pt_aug_01_throughput_ten_images_sixty_seconds(
    tmp_path, monkeypatch, sample_images_labels
 ):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    _augment_annotation_with_total(monkeypatch)
    _seed()
@@ -59,9 +51,11 @@ def test_pt_aug_01_throughput_ten_images_sixty_seconds(
        shutil.copy2(p, img_dir / p.name)
    for p in src_lbl.glob("*.txt"):
        shutil.copy2(p, lbl_dir / p.name)
    # Act
    t0 = time.perf_counter()
    Augmentator().augment_annotations()
    elapsed = time.perf_counter() - t0
    # Assert
    assert elapsed <= 60.0
@@ -69,6 +63,7 @@ def test_pt_aug_01_throughput_ten_images_sixty_seconds(
 def test_pt_aug_02_parallel_at_least_one_point_five_x_faster(
    tmp_path, monkeypatch, sample_images_labels
 ):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    _augment_annotation_with_total(monkeypatch)
    _seed()
@@ -97,6 +92,7 @@ def test_pt_aug_02_parallel_at_least_one_point_five_x_faster(
    entries = [_E(n) for n in names]
    # Act
    aug_seq = Augmentator()
    aug_seq.total_images_to_process = len(entries)
    t0 = time.perf_counter()
@@ -115,4 +111,5 @@ def test_pt_aug_02_parallel_at_least_one_point_five_x_faster(
        list(ex.map(aug_par.augment_annotation, entries))
    par_elapsed = time.perf_counter() - t0
    # Assert
    assert seq_elapsed >= par_elapsed * 1.5
@@ -1,7 +1,5 @@
 import shutil
 import sys
 import time
 import types
 from os import path as osp
 from pathlib import Path
@@ -10,40 +8,6 @@ import pytest
 import constants as c_mod
 def _stub_train_dependencies():
    if getattr(_stub_train_dependencies, "_done", False):
        return
    def add_mod(name):
        if name in sys.modules:
            return sys.modules[name]
        m = types.ModuleType(name)
        sys.modules[name] = m
        return m
    ultra = add_mod("ultralytics")
    class YOLO:
        pass
    ultra.YOLO = YOLO
    def fake_client(*_a, **_k):
        return types.SimpleNamespace(
            upload_fileobj=lambda *_a, **_k: None,
            download_file=lambda *_a, **_k: None,
        )
    boto = add_mod("boto3")
    boto.client = fake_client
    add_mod("netron")
    add_mod("requests")
    _stub_train_dependencies._done = True
 _stub_train_dependencies()
 def _prepare_form_dataset(
    monkeypatch,
    tmp_path,
@@ -82,6 +46,7 @@ def test_pt_dsf_01_dataset_formation_under_thirty_seconds(
    fixture_images_dir,
    fixture_labels_dir,
 ):
    # Arrange
    train, today_ds = _prepare_form_dataset(
        monkeypatch,
        tmp_path,
@@ -91,7 +56,9 @@ def test_pt_dsf_01_dataset_formation_under_thirty_seconds(
        100,
        set(),
    )
    # Act
    t0 = time.perf_counter()
    train.form_dataset()
    elapsed = time.perf_counter() - t0
    # Assert
    assert elapsed <= 30.0
@@ -1,23 +1,10 @@
 import re
 import sys
 import types
 import pytest
 from dto.annotationClass import AnnotationClass
 def _stub_train_imports():
    if getattr(_stub_train_imports, "_done", False):
        return
    for _name in ("ultralytics", "boto3", "netron", "requests"):
        if _name not in sys.modules:
            sys.modules[_name] = types.ModuleType(_name)
    sys.modules["ultralytics"].YOLO = type("YOLO", (), {})
    sys.modules["boto3"].client = lambda *a, **k: None
    _stub_train_imports._done = True
 def _name_lines_under_names(text):
    lines = text.splitlines()
    out = []
@@ -39,7 +26,6 @@ _PLACEHOLDER_RE = re.compile(r"^-\s+Class-\d+\s*$")
@pytest.fixture
 def data_yaml_text(monkeypatch, tmp_path, fixture_classes_json):
    _stub_train_imports()
    import train
    import constants as c
@@ -52,14 +38,18 @@ def data_yaml_text(monkeypatch, tmp_path, fixture_classes_json):
 def test_bt_cls_01_base_classes(fixture_classes_json):
    # Act
    d = AnnotationClass.read_json()
    norm = {k: d[k] for k in range(17)}
    # Assert
    assert len(norm) == 17
    assert len({v.id for v in norm.values()}) == 17
 def test_bt_cls_02_weather_expansion(fixture_classes_json):
    # Act
    d = AnnotationClass.read_json()
    # Assert
    assert d[0].name == "ArmorVehicle"
    assert d[20].name == "ArmorVehicle(Wint)"
    assert d[40].name == "ArmorVehicle(Night)"
@@ -67,11 +57,13 @@ def test_bt_cls_02_weather_expansion(fixture_classes_json):
@pytest.mark.resource_limit
 def test_bt_cls_03_yaml_generation(data_yaml_text):
    # Arrange
    text = data_yaml_text
-    assert "nc: 80" in text
+    # Act
    names = _name_lines_under_names(text)
    placeholders = [ln for ln in names if _PLACEHOLDER_RE.match(ln)]
    named = [ln for ln in names if not _PLACEHOLDER_RE.match(ln)]
    # Assert
    assert len(names) == 80
    assert len(placeholders) == 29
    assert len(named) == 51
@@ -79,5 +71,7 @@ def test_bt_cls_03_yaml_generation(data_yaml_text):
@pytest.mark.resource_limit
 def test_rl_cls_01_total_class_count(data_yaml_text):
    # Act
    names = _name_lines_under_names(data_yaml_text)
    # Assert
    assert len(names) == 80
@@ -7,7 +7,7 @@ from pathlib import Path
 import msgpack
 import pytest
-sys.path.insert(0, str(Path(__file__).resolve().parent.parent / "annotation-queue"))
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent / "src" / "annotation-queue"))
 from annotation_queue_dto import AnnotationBulkMessage, AnnotationMessage, AnnotationStatus, RoleEnum
@@ -1,16 +1,7 @@
 import random
 import shutil
 import sys
 import types
 from pathlib import Path
 if "matplotlib" not in sys.modules:
    _mpl = types.ModuleType("matplotlib")
    _plt = types.ModuleType("matplotlib.pyplot")
    _mpl.pyplot = _plt
    sys.modules["matplotlib"] = _mpl
    sys.modules["matplotlib.pyplot"] = _plt
 import cv2
 import numpy as np
@@ -41,6 +32,7 @@ def _augment_annotation_with_total(monkeypatch):
 def test_bt_aug_01_augment_inner_returns_eight_image_labels(
    tmp_path, monkeypatch, fixture_images_dir, fixture_labels_dir
 ):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    _seed()
    from augmentation import Augmentator
@@ -63,11 +55,14 @@ def test_bt_aug_01_augment_inner_returns_eight_image_labels(
        labels_path=str(proc_lbl),
        labels=labels,
    )
    # Act
    out = aug.augment_inner(img_ann)
    # Assert
    assert len(out) == 8
 def test_bt_aug_02_naming_convention(tmp_path, monkeypatch, fixture_images_dir, fixture_labels_dir):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    _seed()
    from augmentation import Augmentator
@@ -89,7 +84,9 @@ def test_bt_aug_02_naming_convention(tmp_path, monkeypatch, fixture_images_dir,
        labels_path=str(proc_lbl),
        labels=labels,
    )
    # Act
    out = aug.augment_inner(img_ann)
    # Assert
    names = [Path(o.image_path).name for o in out]
    expected = [f"{stem}.jpg"] + [f"{stem}_{i}.jpg" for i in range(1, 8)]
    assert names == expected
@@ -110,6 +107,7 @@ def _all_coords_in_unit(labels_list):
 def test_bt_aug_03_all_bbox_coords_in_zero_one(
    tmp_path, monkeypatch, fixture_images_dir, fixture_labels_dir
 ):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    _seed()
    from augmentation import Augmentator
@@ -131,7 +129,9 @@ def test_bt_aug_03_all_bbox_coords_in_zero_one(
        labels_path=str(proc_lbl),
        labels=labels,
    )
    # Act
    out = aug.augment_inner(img_ann)
    # Assert
    for o in out:
        for row in o.labels:
            assert len(row) >= 5
@@ -139,13 +139,16 @@ def test_bt_aug_03_all_bbox_coords_in_zero_one(
 def test_bt_aug_04_correct_bboxes_clips_edge(tmp_path, monkeypatch):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    from augmentation import Augmentator
    aug = Augmentator()
    m = aug.correct_margin
    inp = [[0.99, 0.5, 0.2, 0.1, 0]]
    # Act
    res = aug.correct_bboxes(inp)
    # Assert
    assert len(res) == 1
    x, y, w, h, _ = res[0]
    hw, hh = 0.5 * w, 0.5 * h
@@ -156,18 +159,22 @@ def test_bt_aug_04_correct_bboxes_clips_edge(tmp_path, monkeypatch):
 def test_bt_aug_05_tiny_bbox_removed_after_clipping(tmp_path, monkeypatch):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    from augmentation import Augmentator
    aug = Augmentator()
    inp = [[0.995, 0.5, 0.01, 0.5, 0]]
    # Act
    res = aug.correct_bboxes(inp)
    # Assert
    assert res == []
 def test_bt_aug_06_empty_label_eight_outputs_empty_labels(
    tmp_path, monkeypatch, fixture_images_dir
 ):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    _seed()
    from augmentation import Augmentator
@@ -187,7 +194,9 @@ def test_bt_aug_06_empty_label_eight_outputs_empty_labels(
        labels_path=str(proc_lbl),
        labels=[],
    )
    # Act
    out = aug.augment_inner(img_ann)
    # Assert
    assert len(out) == 8
    for o in out:
        assert o.labels == []
@@ -196,6 +205,7 @@ def test_bt_aug_06_empty_label_eight_outputs_empty_labels(
 def test_bt_aug_07_full_pipeline_five_images_forty_outputs(
    tmp_path, monkeypatch, sample_images_labels
 ):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    _augment_annotation_with_total(monkeypatch)
    _seed()
@@ -211,7 +221,9 @@ def test_bt_aug_07_full_pipeline_five_images_forty_outputs(
        shutil.copy2(p, img_dir / p.name)
    for p in src_lbl.glob("*.txt"):
        shutil.copy2(p, lbl_dir / p.name)
    # Act
    Augmentator().augment_annotations()
    # Assert
    proc_img = Path(c.config.processed_images_dir)
    proc_lbl = Path(c.config.processed_labels_dir)
    assert len(list(proc_img.glob("*.jpg"))) == 40
@@ -219,6 +231,7 @@ def test_bt_aug_07_full_pipeline_five_images_forty_outputs(
 def test_bt_aug_08_skips_already_processed(tmp_path, monkeypatch, sample_images_labels):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    _augment_annotation_with_total(monkeypatch)
    _seed()
@@ -244,7 +257,9 @@ def test_bt_aug_08_skips_already_processed(tmp_path, monkeypatch, sample_images_
        dst = proc_img / p.name
        shutil.copy2(p, dst)
        markers.append(dst.read_bytes())
    # Act
    Augmentator().augment_annotations()
    # Assert
    after_jpgs = list(proc_img.glob("*.jpg"))
    assert len(after_jpgs) == 19
    assert len(list(proc_lbl.glob("*.txt"))) == 16
@@ -1,7 +1,5 @@
 import random
 import shutil
 import sys
 import types
 from pathlib import Path
 from types import SimpleNamespace
@@ -11,13 +9,6 @@ import pytest
 from tests.conftest import apply_constants_patch
 if "matplotlib" not in sys.modules:
    _mpl = types.ModuleType("matplotlib")
    _plt = types.ModuleType("matplotlib.pyplot")
    _mpl.pyplot = _plt
    sys.modules["matplotlib"] = _mpl
    sys.modules["matplotlib.pyplot"] = _plt
 def _patch_augmentation_paths(monkeypatch, base: Path):
    apply_constants_patch(monkeypatch, base)
@@ -44,6 +35,7 @@ def _seed():
 def test_rt_aug_01_corrupted_image_skipped(
    tmp_path, monkeypatch, fixture_images_dir, fixture_labels_dir
 ):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    _augment_annotation_with_total(monkeypatch)
    _seed()
@@ -59,13 +51,16 @@ def test_rt_aug_01_corrupted_image_skipped(
    shutil.copy2(fixture_labels_dir / f"{stem}.txt", lbl_dir / f"{stem}.txt")
    raw = (fixture_images_dir / f"{stem}.jpg").read_bytes()[:200]
    (img_dir / "corrupted_trunc.jpg").write_bytes(raw)
    # Act
    Augmentator().augment_annotations()
    # Assert
    proc_img = Path(c.config.processed_images_dir)
    assert len(list(proc_img.glob("*.jpg"))) == 8
@pytest.mark.resilience
 def test_rt_aug_02_missing_label_no_crash(tmp_path, monkeypatch, fixture_images_dir):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    _augment_annotation_with_total(monkeypatch)
    import constants as c
@@ -79,7 +74,9 @@ def test_rt_aug_02_missing_label_no_crash(tmp_path, monkeypatch, fixture_images_
    shutil.copy2(sorted(fixture_images_dir.glob("*.jpg"))[0], img_dir / f"{stem}.jpg")
    aug = Augmentator()
    aug.total_images_to_process = 1
    # Act
    aug.augment_annotation(SimpleNamespace(name=f"{stem}.jpg"))
    # Assert
    assert len(list(Path(c.config.processed_images_dir).glob("*.jpg"))) == 0
@@ -87,6 +84,7 @@ def test_rt_aug_02_missing_label_no_crash(tmp_path, monkeypatch, fixture_images_
 def test_rt_aug_03_narrow_bbox_fewer_or_eight_variants(
    tmp_path, monkeypatch, fixture_images_dir
 ):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    _seed()
    from augmentation import Augmentator
@@ -107,7 +105,9 @@ def test_rt_aug_03_narrow_bbox_fewer_or_eight_variants(
        labels_path=str(proc_lbl),
        labels=labels,
    )
    # Act
    out = aug.augment_inner(img_ann)
    # Assert
    assert 1 <= len(out) <= 8
@@ -115,6 +115,7 @@ def test_rt_aug_03_narrow_bbox_fewer_or_eight_variants(
 def test_rl_aug_01_augment_inner_exactly_eight_outputs(
    tmp_path, monkeypatch, fixture_images_dir, fixture_labels_dir
 ):
    # Arrange
    _patch_augmentation_paths(monkeypatch, tmp_path)
    _seed()
    from augmentation import Augmentator
@@ -136,5 +137,7 @@ def test_rl_aug_01_augment_inner_exactly_eight_outputs(
        labels_path=str(proc_lbl),
        labels=labels,
    )
    # Act
    out = aug.augment_inner(img_ann)
    # Assert
    assert len(out) == 8
@@ -1,6 +1,4 @@
 import shutil
 import sys
 import types
 from os import path as osp
 from pathlib import Path
@@ -9,40 +7,6 @@ import pytest
 import constants as c_mod
 def _stub_train_dependencies():
    if getattr(_stub_train_dependencies, "_done", False):
        return
    def add_mod(name):
        if name in sys.modules:
            return sys.modules[name]
        m = types.ModuleType(name)
        sys.modules[name] = m
        return m
    ultra = add_mod("ultralytics")
    class YOLO:
        pass
    ultra.YOLO = YOLO
    def fake_client(*_a, **_k):
        return types.SimpleNamespace(
            upload_fileobj=lambda *_a, **_k: None,
            download_file=lambda *_a, **_k: None,
        )
    boto = add_mod("boto3")
    boto.client = fake_client
    add_mod("netron")
    add_mod("requests")
    _stub_train_dependencies._done = True
 _stub_train_dependencies()
 def _prepare_form_dataset(
    monkeypatch,
    tmp_path,
@@ -84,6 +48,7 @@ def test_bt_dsf_01_split_ratio_70_20_10(
    fixture_images_dir,
    fixture_labels_dir,
 ):
    # Arrange
    train, today_ds = _prepare_form_dataset(
        monkeypatch,
        tmp_path,
@@ -93,7 +58,9 @@ def test_bt_dsf_01_split_ratio_70_20_10(
        100,
        set(),
    )
    # Act
    train.form_dataset()
    # Assert
    assert _count_jpg(Path(today_ds, "train", "images")) == 70
    assert _count_jpg(Path(today_ds, "valid", "images")) == 20
    assert _count_jpg(Path(today_ds, "test", "images")) == 10
@@ -106,6 +73,7 @@ def test_bt_dsf_02_six_subdirectories(
    fixture_images_dir,
    fixture_labels_dir,
 ):
    # Arrange
    train, today_ds = _prepare_form_dataset(
        monkeypatch,
        tmp_path,
@@ -115,7 +83,9 @@ def test_bt_dsf_02_six_subdirectories(
        100,
        set(),
    )
    # Act
    train.form_dataset()
    # Assert
    base = Path(today_ds)
    assert (base / "train" / "images").is_dir()
    assert (base / "train" / "labels").is_dir()
@@ -132,6 +102,7 @@ def test_bt_dsf_03_total_files_one_hundred(
    fixture_images_dir,
    fixture_labels_dir,
 ):
    # Arrange
    train, today_ds = _prepare_form_dataset(
        monkeypatch,
        tmp_path,
@@ -141,7 +112,9 @@ def test_bt_dsf_03_total_files_one_hundred(
        100,
        set(),
    )
    # Act
    train.form_dataset()
    # Assert
    n = (
        _count_jpg(Path(today_ds, "train", "images"))
        + _count_jpg(Path(today_ds, "valid", "images"))
@@ -157,6 +130,7 @@ def test_bt_dsf_04_corrupted_labels_quarantined(
    fixture_images_dir,
    fixture_labels_dir,
 ):
    # Arrange
    stems = [p.stem for p in sorted(fixture_images_dir.glob("*.jpg"))[:100]]
    corrupt = set(stems[:5])
    train, today_ds = _prepare_form_dataset(
@@ -168,7 +142,9 @@ def test_bt_dsf_04_corrupted_labels_quarantined(
        100,
        corrupt,
    )
    # Act
    train.form_dataset()
    # Assert
    split_total = (
        _count_jpg(Path(today_ds, "train", "images"))
        + _count_jpg(Path(today_ds, "valid", "images"))
@@ -187,6 +163,7 @@ def test_rt_dsf_01_empty_processed_no_crash(
    fixture_images_dir,
    fixture_labels_dir,
 ):
    # Arrange
    train, today_ds = _prepare_form_dataset(
        monkeypatch,
        tmp_path,
@@ -196,12 +173,15 @@ def test_rt_dsf_01_empty_processed_no_crash(
        0,
        set(),
    )
    # Act
    train.form_dataset()
    # Assert
    assert Path(today_ds).is_dir()
@pytest.mark.resource_limit
 def test_rl_dsf_01_split_ratios_sum_hundred():
    # Assert
    import train
    assert train.train_set + train.valid_set + train.test_set == 100
@@ -215,6 +195,7 @@ def test_rl_dsf_02_no_filename_duplication_across_splits(
    fixture_images_dir,
    fixture_labels_dir,
 ):
    # Arrange
    train, today_ds = _prepare_form_dataset(
        monkeypatch,
        tmp_path,
@@ -224,7 +205,9 @@ def test_rl_dsf_02_no_filename_duplication_across_splits(
        100,
        set(),
    )
    # Act
    train.form_dataset()
    # Assert
    base = Path(today_ds)
    names = []
    for split in ("train", "valid", "test"):
@@ -1,39 +1,40 @@
 import sys
 import types
 for _name in ("ultralytics", "boto3", "netron", "requests"):
    if _name not in sys.modules:
        sys.modules[_name] = types.ModuleType(_name)
 sys.modules["ultralytics"].YOLO = type("YOLO", (), {})
 sys.modules["boto3"].client = lambda *a, **k: None
 from train import check_label
 def test_bt_lbl_01_valid_label_returns_true(tmp_path):
    # Arrange
    p = tmp_path / "a.txt"
    p.write_text("0 0.5 0.5 0.1 0.1", encoding="utf-8")
    # Assert
    assert check_label(str(p)) is True
 def test_bt_lbl_02_x_gt_one_returns_false(tmp_path):
    # Arrange
    p = tmp_path / "a.txt"
    p.write_text("0 1.5 0.5 0.1 0.1", encoding="utf-8")
    # Assert
    assert check_label(str(p)) is False
 def test_bt_lbl_03_height_gt_one_returns_false(tmp_path):
    # Arrange
    p = tmp_path / "a.txt"
    p.write_text("0 0.5 0.5 0.1 1.2", encoding="utf-8")
    # Assert
    assert check_label(str(p)) is False
 def test_bt_lbl_04_missing_file_returns_false(tmp_path):
    # Arrange
    p = tmp_path / "missing.txt"
    # Assert
    assert check_label(str(p)) is False
 def test_bt_lbl_05_multiline_one_corrupted_returns_false(tmp_path):
    # Arrange
    p = tmp_path / "a.txt"
    p.write_text("0 0.5 0.5 0.1 0.1\n3 0.5 0.5 0.1 1.5", encoding="utf-8")
    # Assert
    assert check_label(str(p)) is False
@@ -1,24 +1,10 @@
 import sys
 import types
 import importlib
 import shutil
-from os import path as osp
+from os import path
 from pathlib import Path
 import pytest
 for _n in ("boto3", "netron", "requests"):
    if _n not in sys.modules:
        sys.modules[_n] = types.ModuleType(_n)
 for _k in [k for k in sys.modules if k == "ultralytics" or k.startswith("ultralytics.")]:
    del sys.modules[_k]
 from ultralytics import YOLO
 for _m in ("exports", "train"):
    if _m in sys.modules:
        importlib.reload(sys.modules[_m])
 import constants as c
 import train as train_mod
 import exports as exports_mod
@@ -56,7 +42,7 @@ def e2e_result(tmp_path_factory):
    exports_mod.export_onnx(c.config.current_pt_model)
    exports_mod.export_coreml(c.config.current_pt_model)
-    today_ds = osp.join(c.config.datasets_dir, train_mod.today_folder)
+    today_ds = path.join(c.config.datasets_dir, train_mod.today_folder)
    yield {
        "today_dataset": today_ds,