# Document Skill — Full / Focus Area / Resume Workflow Covers three related modes that share the same 8-step pipeline: - **Full**: entire codebase, no prior state - **Focus Area**: scoped to a directory subtree + transitive dependencies - **Resume**: continue from `state.json` checkpoint ## Prerequisite Checks 1. If `_docs/` already exists and contains files AND mode is **Full**, ASK user: **overwrite, merge, or write to `_docs_generated/` instead?** 2. Create DOCUMENT_DIR, SOLUTION_DIR, and PROBLEM_DIR if they don't exist 3. If DOCUMENT_DIR contains a `state.json`, offer to **resume from last checkpoint or start fresh** 4. If FOCUS_DIR is set, verify the directory exists and contains source files — **STOP if missing** ## Progress Tracking Create a TodoWrite with all steps (0 through 7). Update status as each step completes. ## Steps ### Step 0: Codebase Discovery **Role**: Code analyst **Goal**: Build a complete map of the codebase (or targeted subtree) before analyzing any code. **Focus Area scoping**: if FOCUS_DIR is set, limit the scan to that directory subtree. Still identify transitive dependencies outside FOCUS_DIR (modules that FOCUS_DIR imports) and include them in the processing order, but skip modules that are neither inside FOCUS_DIR nor dependencies of it. Scan and catalog: 1. Directory tree (ignore `node_modules`, `.git`, `__pycache__`, `bin/`, `obj/`, build artifacts) 2. Language detection from file extensions and config files 3. Package manifests: `package.json`, `requirements.txt`, `pyproject.toml`, `*.csproj`, `Cargo.toml`, `go.mod` 4. Config files: `Dockerfile`, `docker-compose.yml`, `.env.example`, CI/CD configs (`.github/workflows/`, `.gitlab-ci.yml`, `azure-pipelines.yml`) 5. Entry points: `main.*`, `app.*`, `index.*`, `Program.*`, startup scripts 6. Test structure: test directories, test frameworks, test runner configs 7. Existing documentation: README, `docs/`, wiki references, inline doc coverage 8. **Dependency graph**: build a module-level dependency graph by analyzing imports/references. Identify: - Leaf modules (no internal dependencies) - Entry points (no internal dependents) - Cycles (mark for grouped analysis) - Topological processing order - If FOCUS_DIR: mark which modules are in-scope vs dependency-only **Save**: `DOCUMENT_DIR/00_discovery.md` containing: - Directory tree (concise, relevant directories only) - Tech stack summary table (language, framework, database, infra) - Dependency graph (textual list + Mermaid diagram) - Topological processing order - Entry points and leaf modules **Save**: `DOCUMENT_DIR/state.json` with initial state (see `references/artifacts.md` for format). --- ### Step 1: Module-Level Documentation **Role**: Code analyst **Goal**: Document every identified module individually, processing in topological order (leaves first). **Batched processing**: process modules in batches of ~5 (sorted by topological order). After each batch: save all module docs, update `state.json`, present a progress summary. Between batches, evaluate whether to suggest a session break. For each module in topological order: 1. **Read**: read the module's source code. Assess complexity and what context is needed. 2. **Gather context**: collect already-written docs of this module's dependencies (available because of bottom-up order). Note external library usage. 3. **Write module doc** with these sections: - **Purpose**: one-sentence responsibility - **Public interface**: exported functions/classes/methods with signatures, input/output types - **Internal logic**: key algorithms, patterns, non-obvious behavior - **Dependencies**: what it imports internally and why - **Consumers**: what uses this module (from the dependency graph) - **Data models**: entities/types defined in this module - **Configuration**: env vars, config keys consumed - **External integrations**: HTTP calls, DB queries, queue operations, file I/O - **Security**: auth checks, encryption, input validation, secrets access - **Tests**: what tests exist for this module, what they cover 4. **Verify**: cross-check that every entity referenced in the doc exists in the codebase. Flag uncertainties. **Cycle handling**: modules in a dependency cycle are analyzed together as a group, producing a single combined doc. **Large modules**: if a module exceeds comfortable analysis size, split into logical sub-sections and analyze each part, then combine. **Save**: `DOCUMENT_DIR/modules/[module_name].md` for each module. **State**: update `state.json` after each module completes (move from `modules_remaining` to `modules_documented`). Increment `module_batch` after each batch of ~5. **Session break heuristic**: after each batch, if more than 10 modules remain AND 2+ batches have already completed in this session, suggest a session break: ``` ══════════════════════════════════════ SESSION BREAK SUGGESTED ══════════════════════════════════════ Modules documented: [X] of [Y] Batches completed this session: [N] ══════════════════════════════════════ A) Continue in this conversation B) Save and continue in a fresh conversation (recommended) ══════════════════════════════════════ Recommendation: B — fresh context improves analysis quality for remaining modules ══════════════════════════════════════ ``` Re-entry is seamless: `state.json` tracks exactly which modules are done. --- ### Step 2: Component Assembly **Role**: Software architect **Goal**: Group related modules into logical components and produce component specs. 1. Analyze module docs from Step 1 to identify natural groupings: - By directory structure (most common) - By shared data models or common purpose - By dependency clusters (tightly coupled modules) 2. For each identified component, synthesize its module docs into a single component specification using `.cursor/skills/plan/templates/component-spec.md` as structure: - High-level overview: purpose, pattern, upstream/downstream - Internal interfaces: method signatures, DTOs (from actual module code) - External API specification (if the component exposes HTTP/gRPC endpoints) - Data access patterns: queries, caching, storage estimates - Implementation details: algorithmic complexity, state management, key libraries - Extensions and helpers: shared utilities needed - Caveats and edge cases: limitations, race conditions, bottlenecks - Dependency graph: implementation order relative to other components - Logging strategy 3. Identify common helpers shared across multiple components → document in `common-helpers/` 4. Generate component relationship diagram (Mermaid) **Self-verification**: - [ ] Every module from Step 1 is covered by exactly one component - [ ] No component has overlapping responsibility with another - [ ] Inter-component interfaces are explicit (who calls whom, with what) - [ ] Component dependency graph has no circular dependencies **Save**: - `DOCUMENT_DIR/components/[##]_[name]/description.md` per component - `DOCUMENT_DIR/common-helpers/[##]_helper_[name].md` per shared helper - `DOCUMENT_DIR/diagrams/components.md` (Mermaid component diagram) **BLOCKING**: Present component list with one-line summaries to user. Do NOT proceed until user confirms the component breakdown is correct. --- ### Step 3: System-Level Synthesis **Role**: Software architect **Goal**: From component docs, synthesize system-level documents. All documents here are derived from component docs (Step 2) + module docs (Step 1). No new code reading should be needed. If it is, that indicates a gap in Steps 1-2 — go back and fill it. #### 3a. Architecture Using `.cursor/skills/plan/templates/architecture.md` as structure: - System context and boundaries from entry points and external integrations - Tech stack table from discovery (Step 0) + component specs - Deployment model from Dockerfiles, CI configs, environment strategies - Data model overview from per-component data access sections - Integration points from inter-component interfaces - NFRs from test thresholds, config limits, health checks - Security architecture from per-module security observations - Key ADRs inferred from technology choices and patterns **Save**: `DOCUMENT_DIR/architecture.md` #### 3b. System Flows Using `.cursor/skills/plan/templates/system-flows.md` as structure: - Trace main flows through the component interaction graph - Entry point → component chain → output for each major flow - Mermaid sequence diagrams and flowcharts - Error scenarios from exception handling patterns - Data flow tables per flow **Save**: `DOCUMENT_DIR/system-flows.md` and `DOCUMENT_DIR/diagrams/flows/flow_[name].md` #### 3c. Data Model - Consolidate all data models from module docs - Entity-relationship diagram (Mermaid ERD) - Migration strategy (if ORM/migration tooling detected) - Seed data observations - Backward compatibility approach (if versioning found) **Save**: `DOCUMENT_DIR/data_model.md` #### 3d. Deployment (if Dockerfile/CI configs exist) - Containerization summary - CI/CD pipeline structure - Environment strategy (dev, staging, production) - Observability (logging patterns, metrics, health checks found in code) **Save**: `DOCUMENT_DIR/deployment/` (containerization.md, ci_cd_pipeline.md, environment_strategy.md, observability.md — only files for which sufficient code evidence exists) --- ### Step 4: Verification Pass **Role**: Quality verifier **Goal**: Compare every generated document against actual code. Fix hallucinations, fill gaps, correct inaccuracies. For each document generated in Steps 1-3: 1. **Entity verification**: extract all code entities (class names, function names, module names, endpoints) mentioned in the doc. Cross-reference each against the actual codebase. Flag any that don't exist. 2. **Interface accuracy**: for every method signature, DTO, or API endpoint in component specs, verify it matches actual code. 3. **Flow correctness**: for each system flow diagram, trace the actual code path and verify the sequence matches. 4. **Completeness check**: are there modules or components discovered in Step 0 that aren't covered by any document? Flag gaps. 5. **Consistency check**: do component docs agree with architecture doc? Do flow diagrams match component interfaces? Apply corrections inline to the documents that need them. **Save**: `DOCUMENT_DIR/04_verification_log.md` with: - Total entities verified vs flagged - Corrections applied (which document, what changed) - Remaining gaps or uncertainties - Completeness score (modules covered / total modules) **BLOCKING**: Present verification summary to user. Do NOT proceed until user confirms corrections are acceptable or requests additional fixes. **Session boundary**: After verification is confirmed, suggest a session break before proceeding to the synthesis steps (5–7). These steps produce different artifact types and benefit from fresh context: ``` ══════════════════════════════════════ VERIFICATION COMPLETE — session break? ══════════════════════════════════════ Steps 0–4 (analysis + verification) are done. Steps 5–7 (solution + problem extraction + report) can run in a fresh conversation. ══════════════════════════════════════ A) Continue in this conversation B) Save and continue in a new conversation (recommended) ══════════════════════════════════════ ``` If **Focus Area mode**: Steps 5–7 are skipped (they require full codebase coverage). Present a summary of modules and components documented for this area. The user can run `/document` again for another area, or run without FOCUS_DIR once all areas are covered to produce the full synthesis. --- ### Step 5: Solution Extraction (Retrospective) **Role**: Software architect **Goal**: From all verified technical documentation, retrospectively create `solution.md` — the same artifact the research skill produces. Synthesize from architecture (Step 3) + component specs (Step 2) + system flows (Step 3) + verification findings (Step 4): 1. **Product Solution Description**: what the system is, brief component interaction diagram (Mermaid) 2. **Architecture**: the architecture that is implemented, with per-component solution tables: | Solution | Tools | Advantages | Limitations | Requirements | Security | Cost | Fit | |----------|-------|-----------|-------------|-------------|----------|------|-----| | [actual implementation] | [libs/platforms used] | [observed strengths] | [observed limitations] | [requirements met] | [security approach] | [cost indicators] | [fitness assessment] | 3. **Testing Strategy**: summarize integration/functional tests and non-functional tests found in the codebase 4. **References**: links to key config files, Dockerfiles, CI configs that evidence the solution choices **Save**: `SOLUTION_DIR/solution.md` (`_docs/01_solution/solution.md`) --- ### Step 6: Problem Extraction (Retrospective) **Role**: Business analyst **Goal**: From all verified technical docs, retrospectively derive the high-level problem definition. #### 6a. `problem.md` - Synthesize from architecture overview + component purposes + system flows - What is this system? What problem does it solve? Who are the users? How does it work at a high level? - Cross-reference with README if one exists #### 6b. `restrictions.md` - Extract from: tech stack choices, Dockerfile specs, CI configs, dependency versions, environment configs - Categorize: Hardware, Software, Environment, Operational #### 6c. `acceptance_criteria.md` - Derive from: test assertions, performance configs, health check endpoints, validation rules - Every criterion must have a measurable value #### 6d. `input_data/` - Document data schemas (DB schemas, API request/response types, config file formats) - Create `data_parameters.md` describing what data the system consumes #### 6e. `security_approach.md` (only if security code found) - Authentication, authorization, encryption, secrets handling, CORS, rate limiting, input sanitization **Save**: all files to `PROBLEM_DIR/` (`_docs/00_problem/`) **BLOCKING**: Present all problem documents to user. Do NOT proceed until user confirms or requests corrections. --- ### Step 7: Final Report **Role**: Technical writer **Goal**: Produce `FINAL_report.md` integrating all generated documentation. Using `.cursor/skills/plan/templates/final-report.md` as structure: - Executive summary from architecture + problem docs - Problem statement (transformed from problem.md, not copy-pasted) - Architecture overview with tech stack one-liner - Component summary table (number, name, purpose, dependencies) - System flows summary table - Risk observations from verification log (Step 4) - Open questions (uncertainties flagged during analysis) - Artifact index listing all generated documents with paths **Save**: `DOCUMENT_DIR/FINAL_report.md` **State**: update `state.json` with `current_step: "complete"`. --- ## Escalation Rules | Situation | Action | |-----------|--------| | Minified/obfuscated code detected | WARN user, skip module, note in verification log | | Module too large for context window | Split into sub-sections, analyze parts separately, combine | | Cycle in dependency graph | Group cycled modules, analyze together as one doc | | Generated code (protobuf, swagger-gen) | Note as generated, document the source spec instead | | No tests found in codebase | Note gap in acceptance_criteria.md, derive AC from validation rules and config limits only | | Contradictions between code and README | Flag in verification log, ASK user | | Binary files or non-code assets | Skip, note in discovery | | `_docs/` already exists | ASK user: overwrite, merge, or use `_docs_generated/` | | Code intent is ambiguous | ASK user, do not guess | ## Common Mistakes - **Top-down guessing**: never infer architecture before documenting modules. Build up, don't assume down. - **Hallucinating entities**: always verify that referenced classes/functions/endpoints actually exist in code. - **Skipping modules**: every source module must appear in exactly one module doc and one component. - **Monolithic analysis**: don't try to analyze the entire codebase in one pass. Module by module, in order. - **Inventing restrictions**: only document constraints actually evidenced in code, configs, or Dockerfiles. - **Vague acceptance criteria**: "should be fast" is not a criterion. Extract actual numeric thresholds from code. - **Writing code**: this skill produces documents, never implementation code. ## Quick Reference ``` ┌──────────────────────────────────────────────────────────────────┐ │ Bottom-Up Codebase Documentation (8-Step) │ ├──────────────────────────────────────────────────────────────────┤ │ MODE: Full / Focus Area (@dir) / Resume (state.json) │ │ PREREQ: Check _docs/ exists (overwrite/merge/new?) │ │ PREREQ: Check state.json for resume │ │ │ │ 0. Discovery → dependency graph, tech stack, topo order │ │ (Focus Area: scoped to FOCUS_DIR + transitive deps) │ │ 1. Module Docs → per-module analysis (leaves first) │ │ (batched ~5 modules; session break between batches) │ │ 2. Component Assembly → group modules, write component specs │ │ [BLOCKING: user confirms components] │ │ 3. System Synthesis → architecture, flows, data model, deploy │ │ 4. Verification → compare all docs vs code, fix errors │ │ [BLOCKING: user reviews corrections] │ │ [SESSION BREAK suggested before Steps 5–7] │ │ ── Focus Area mode stops here ── │ │ 5. Solution Extraction → retrospective solution.md │ │ 6. Problem Extraction → retrospective problem, restrictions, AC │ │ [BLOCKING: user confirms problem docs] │ │ 7. Final Report → FINAL_report.md │ ├──────────────────────────────────────────────────────────────────┤ │ Principles: Bottom-up always · Dependencies first │ │ Incremental context · Verify against code │ │ Save immediately · Resume from checkpoint │ │ Batch modules · Session breaks for large codebases │ └──────────────────────────────────────────────────────────────────┘ ```