Refactor README and command documentation to streamline deployment and CI/CD processes. Consolidate deployment strategies and remove obsolete commands related to CI/CD and observability. Enhance task decomposition workflow by adding data model and deployment planning sections, and update directory structures for improved clarity.

2026-06-22 14:11:13 +00:00 · 2026-03-19 12:10:11 +02:00
parent 5b1739186e
commit cfd09c79e1
17 changed files with 1314 additions and 313 deletions
@@ -0,0 +1,363 @@
+---
+name: deploy
+description: |
+  Comprehensive deployment skill covering containerization, CI/CD pipeline, environment strategy, observability, and deployment procedures.
+  5-step workflow: Docker containerization, CI/CD pipeline definition, environment strategy, observability planning, deployment procedures.
+  Uses _docs/02_plans/deployment/ structure.
+  Trigger phrases:
+  - "deploy", "deployment", "deployment strategy"
+  - "CI/CD", "pipeline", "containerize"
+  - "observability", "monitoring", "logging"
+  - "dockerize", "docker compose"
+category: ship
+tags: [deployment, docker, ci-cd, observability, monitoring, containerization]
+disable-model-invocation: true
+---
+
+# Deployment Planning
+
+Plan and document the full deployment lifecycle: containerize the application, define CI/CD pipelines, configure environments, set up observability, and document deployment procedures.
+
+## Core Principles
+
+- **Docker-first**: every component runs in a container; local dev, integration tests, and production all use Docker
+- **Infrastructure as code**: all deployment configuration is version-controlled
+- **Observability built-in**: logging, metrics, and tracing are part of the deployment plan, not afterthoughts
+- **Environment parity**: dev, staging, and production environments mirror each other as closely as possible
+- **Save immediately**: write artifacts to disk after each step; never accumulate unsaved work
+- **Ask, don't assume**: when infrastructure constraints or preferences are unclear, ask the user
+- **Plan, don't code**: this workflow produces deployment documents and specifications, not implementation code
+
+## Context Resolution
+
+Fixed paths:
+
+- PLANS_DIR: `_docs/02_plans/`
+- DEPLOY_DIR: `_docs/02_plans/deployment/`
+- ARCHITECTURE: `_docs/02_plans/architecture.md`
+- COMPONENTS_DIR: `_docs/02_plans/components/`
+
+Announce the resolved paths to the user before proceeding.
+
+## Input Specification
+
+### Required Files
+
+| File | Purpose |
+|------|---------|
+| `_docs/00_problem/problem.md` | Problem description and context |
+| `_docs/00_problem/restrictions.md` | Constraints and limitations |
+| `_docs/01_solution/solution.md` | Finalized solution |
+| `PLANS_DIR/architecture.md` | Architecture from plan skill |
+| `PLANS_DIR/components/` | Component specs |
+
+### Prerequisite Checks (BLOCKING)
+
+1. `architecture.md` exists — **STOP if missing**, run `/plan` first
+2. At least one component spec exists in `PLANS_DIR/components/` — **STOP if missing**
+3. Create DEPLOY_DIR if it does not exist
+4. If DEPLOY_DIR already contains artifacts, ask user: **resume from last checkpoint or start fresh?**
+
+## Artifact Management
+
+### Directory Structure
+
+```
+DEPLOY_DIR/
+├── containerization.md
+├── ci_cd_pipeline.md
+├── environment_strategy.md
+├── observability.md
+└── deployment_procedures.md
+```
+
+### Save Timing
+
+| Step | Save immediately after | Filename |
+|------|------------------------|----------|
+| Step 1 | Containerization plan complete | `containerization.md` |
+| Step 2 | CI/CD pipeline defined | `ci_cd_pipeline.md` |
+| Step 3 | Environment strategy documented | `environment_strategy.md` |
+| Step 4 | Observability plan complete | `observability.md` |
+| Step 5 | Deployment procedures documented | `deployment_procedures.md` |
+
+### Resumability
+
+If DEPLOY_DIR already contains artifacts:
+
+1. List existing files and match to the save timing table
+2. Identify the last completed step
+3. Resume from the next incomplete step
+4. Inform the user which steps are being skipped
+
+## Progress Tracking
+
+At the start of execution, create a TodoWrite with all steps (1 through 5). Update status as each step completes.
+
+## Workflow
+
+### Step 1: Containerization
+
+**Role**: DevOps / Platform engineer
+**Goal**: Define Docker configuration for every component, local development, and integration test environments
+**Constraints**: Plan only — no Dockerfile creation. Describe what each Dockerfile should contain.
+
+1. Read architecture.md and all component specs
+2. Read restrictions.md for infrastructure constraints
+3. Research best Docker practices for the project's tech stack (multi-stage builds, base image selection, layer optimization)
+4. For each component, define:
+   - Base image (pinned version, prefer alpine/distroless for production)
+   - Build stages (dependency install, build, production)
+   - Non-root user configuration
+   - Health check endpoint and command
+   - Exposed ports
+   - `.dockerignore` contents
+5. Define `docker-compose.yml` for local development:
+   - All application components
+   - Database (Postgres) with named volume
+   - Any message queues, caches, or external service mocks
+   - Shared network
+   - Environment variable files (`.env.dev`)
+6. Define `docker-compose.test.yml` for integration tests:
+   - Application components under test
+   - Test runner container (black-box, no internal imports)
+   - Isolated database with seed data
+   - All tests runnable via `docker compose -f docker-compose.test.yml up --abort-on-container-exit`
+7. Define image tagging strategy: `<registry>/<project>/<component>:<git-sha>` for CI, `latest` for local dev only
+
+**Self-verification**:
+- [ ] Every component has a Dockerfile specification
+- [ ] Multi-stage builds specified for all production images
+- [ ] Non-root user for all containers
+- [ ] Health checks defined for every service
+- [ ] docker-compose.yml covers all components + dependencies
+- [ ] docker-compose.test.yml enables black-box integration testing
+- [ ] `.dockerignore` defined
+
+**Save action**: Write `containerization.md` using `templates/containerization.md`
+
+**BLOCKING**: Present containerization plan to user. Do NOT proceed until confirmed.
+
+---
+
+### Step 2: CI/CD Pipeline
+
+**Role**: DevOps engineer
+**Goal**: Define the CI/CD pipeline with quality gates, security scanning, and multi-environment deployment
+**Constraints**: Pipeline definition only — produce YAML specification, not implementation
+
+1. Read architecture.md for tech stack and deployment targets
+2. Read restrictions.md for CI/CD constraints (cloud provider, registry, etc.)
+3. Research CI/CD best practices for the project's platform (GitHub Actions / Azure Pipelines)
+4. Define pipeline stages:
+
+| Stage | Trigger | Steps | Quality Gate |
+|-------|---------|-------|-------------|
+| **Lint** | Every push | Run linters per language (black, rustfmt, prettier, dotnet format) | Zero errors |
+| **Test** | Every push | Unit tests, integration tests, coverage report | 75%+ coverage |
+| **Security** | Every push | Dependency audit, SAST scan (Semgrep/SonarQube), image scan (Trivy) | Zero critical/high CVEs |
+| **Build** | PR merge to dev | Build Docker images, tag with git SHA | Build succeeds |
+| **Push** | After build | Push to container registry | Push succeeds |
+| **Deploy Staging** | After push | Deploy to staging environment | Health checks pass |
+| **Smoke Tests** | After staging deploy | Run critical path tests against staging | All pass |
+| **Deploy Production** | Manual approval | Deploy to production | Health checks pass |
+
+5. Define caching strategy: dependency caches, Docker layer caches, build artifact caches
+6. Define parallelization: which stages can run concurrently
+7. Define notifications: build failures, deployment status, security alerts
+
+**Self-verification**:
+- [ ] All pipeline stages defined with triggers and gates
+- [ ] Coverage threshold enforced (75%+)
+- [ ] Security scanning included (dependencies + images + SAST)
+- [ ] Caching configured for dependencies and Docker layers
+- [ ] Multi-environment deployment (staging → production)
+- [ ] Rollback procedure referenced
+- [ ] Notifications configured
+
+**Save action**: Write `ci_cd_pipeline.md` using `templates/ci_cd_pipeline.md`
+
+---
+
+### Step 3: Environment Strategy
+
+**Role**: Platform engineer
+**Goal**: Define environment configuration, secrets management, and environment parity
+**Constraints**: Strategy document — no secrets or credentials in output
+
+1. Define environments:
+
+| Environment | Purpose | Infrastructure | Data |
+|-------------|---------|---------------|------|
+| **Development** | Local developer workflow | docker-compose, local volumes | Seed data, mocks for external APIs |
+| **Staging** | Pre-production validation | Mirrors production topology | Anonymized production-like data |
+| **Production** | Live system | Full infrastructure | Real data |
+
+2. Define environment variable management:
+   - `.env.example` with all required variables (no real values)
+   - Per-environment variable sources (`.env` for dev, secret manager for staging/prod)
+   - Validation: fail fast on missing required variables at startup
+3. Define secrets management:
+   - Never commit secrets to version control
+   - Development: `.env` files (git-ignored)
+   - Staging/Production: secret manager (AWS Secrets Manager / Azure Key Vault / Vault)
+   - Rotation policy
+4. Define database management per environment:
+   - Development: Docker Postgres with named volume, seed data
+   - Staging: managed Postgres, migrations applied via CI/CD
+   - Production: managed Postgres, migrations require approval
+
+**Self-verification**:
+- [ ] All three environments defined with clear purpose
+- [ ] Environment variable documentation complete (`.env.example`)
+- [ ] No secrets in any output document
+- [ ] Secret manager specified for staging/production
+- [ ] Database strategy per environment
+
+**Save action**: Write `environment_strategy.md` using `templates/environment_strategy.md`
+
+---
+
+### Step 4: Observability
+
+**Role**: Site Reliability Engineer (SRE)
+**Goal**: Define logging, metrics, tracing, and alerting strategy
+**Constraints**: Strategy document — describe what to implement, not how to wire it
+
+1. Read architecture.md and component specs for service boundaries
+2. Research observability best practices for the tech stack
+
+**Logging**:
+- Structured JSON to stdout/stderr (no file logging in containers)
+- Fields: `timestamp` (ISO 8601), `level`, `service`, `correlation_id`, `message`, `context`
+- Levels: ERROR (exceptions), WARN (degraded), INFO (business events), DEBUG (diagnostics, dev only)
+- No PII in logs
+- Retention: dev = console, staging = 7 days, production = 30 days
+
+**Metrics**:
+- Expose Prometheus-compatible `/metrics` endpoint per service
+- System metrics: CPU, memory, disk, network
+- Application metrics: `request_count`, `request_duration` (histogram), `error_count`, `active_connections`
+- Business metrics: derived from acceptance criteria
+- Collection interval: 15s
+
+**Distributed Tracing**:
+- OpenTelemetry SDK integration
+- Trace context propagation via HTTP headers and message queue metadata
+- Span naming: `<service>.<operation>`
+- Sampling: 100% in dev/staging, 10% in production (adjust based on volume)
+
+**Alerting**:
+
+| Severity | Response Time | Condition Examples |
+|----------|---------------|-------------------|
+| Critical | 5 min | Service down, data loss, health check failed |
+| High | 30 min | Error rate > 5%, P95 latency > 2x baseline |
+| Medium | 4 hours | Disk > 80%, elevated latency |
+| Low | Next business day | Non-critical warnings |
+
+**Dashboards**:
+- Operations: service health, request rate, error rate, response time percentiles, resource utilization
+- Business: key business metrics from acceptance criteria
+
+**Self-verification**:
+- [ ] Structured logging format defined with required fields
+- [ ] Metrics endpoint specified per service
+- [ ] OpenTelemetry tracing configured
+- [ ] Alert severities with response times defined
+- [ ] Dashboards cover operations and business metrics
+- [ ] PII exclusion from logs addressed
+
+**Save action**: Write `observability.md` using `templates/observability.md`
+
+---
+
+### Step 5: Deployment Procedures
+
+**Role**: DevOps / Platform engineer
+**Goal**: Define deployment strategy, rollback procedures, health checks, and deployment checklist
+**Constraints**: Procedures document — no implementation
+
+1. Define deployment strategy:
+   - Preferred pattern: blue-green / rolling / canary (choose based on architecture)
+   - Zero-downtime requirement for production
+   - Graceful shutdown: 30-second grace period for in-flight requests
+   - Database migration ordering: migrate before deploy, backward-compatible only
+
+2. Define health checks:
+
+| Check | Type | Endpoint | Interval | Threshold |
+|-------|------|----------|----------|-----------|
+| Liveness | HTTP GET | `/health/live` | 10s | 3 failures → restart |
+| Readiness | HTTP GET | `/health/ready` | 5s | 3 failures → remove from LB |
+| Startup | HTTP GET | `/health/ready` | 5s | 30 attempts max |
+
+3. Define rollback procedures:
+   - Trigger criteria: health check failures, error rate spike, critical alert
+   - Rollback steps: redeploy previous image tag, verify health, rollback database if needed
+   - Communication: notify stakeholders during rollback
+   - Post-mortem: required after every production rollback
+
+4. Define deployment checklist:
+   - [ ] All tests pass in CI
+   - [ ] Security scan clean (zero critical/high CVEs)
+   - [ ] Database migrations reviewed and tested
+   - [ ] Environment variables configured
+   - [ ] Health check endpoints responding
+   - [ ] Monitoring alerts configured
+   - [ ] Rollback plan documented and tested
+   - [ ] Stakeholders notified
+
+**Self-verification**:
+- [ ] Deployment strategy chosen and justified
+- [ ] Zero-downtime approach specified
+- [ ] Health checks defined (liveness, readiness, startup)
+- [ ] Rollback trigger criteria and steps documented
+- [ ] Deployment checklist complete
+
+**Save action**: Write `deployment_procedures.md` using `templates/deployment_procedures.md`
+
+**BLOCKING**: Present deployment procedures to user. Do NOT proceed until confirmed.
+
+---
+
+## Escalation Rules
+
+| Situation | Action |
+|-----------|--------|
+| Unknown cloud provider or hosting | **ASK user** |
+| Container registry not specified | **ASK user** |
+| CI/CD platform preference unclear | **ASK user** — default to GitHub Actions |
+| Secret manager not chosen | **ASK user** |
+| Deployment pattern trade-offs | **ASK user** with recommendation |
+| Missing architecture.md | **STOP** — run `/plan` first |
+
+## Common Mistakes
+
+- **Implementing during planning**: this workflow produces documents, not Dockerfiles or pipeline YAML
+- **Hardcoding secrets**: never include real credentials in deployment documents
+- **Ignoring integration test containerization**: the test environment must be containerized alongside the app
+- **Skipping BLOCKING gates**: never proceed past a BLOCKING marker without user confirmation
+- **Using `:latest` tags**: always pin base image versions
+- **Forgetting observability**: logging, metrics, and tracing are deployment concerns, not post-deployment additions
+
+## Methodology Quick Reference
+
+```
+┌────────────────────────────────────────────────────────────────┐
+│            Deployment Planning (5-Step Method)                   │
+├────────────────────────────────────────────────────────────────┤
+│ PREREQ: architecture.md + component specs exist                 │
+│                                                                │
+│ 1. Containerization  → containerization.md                      │
+│    [BLOCKING: user confirms Docker plan]                        │
+│ 2. CI/CD Pipeline    → ci_cd_pipeline.md                        │
+│ 3. Environment       → environment_strategy.md                  │
+│ 4. Observability     → observability.md                         │
+│ 5. Procedures        → deployment_procedures.md                 │
+│    [BLOCKING: user confirms deployment plan]                    │
+├────────────────────────────────────────────────────────────────┤
+│ Principles: Docker-first · IaC · Observability built-in         │
+│             Environment parity · Save immediately               │
+└────────────────────────────────────────────────────────────────┘
+```
@@ -0,0 +1,87 @@
+# CI/CD Pipeline Template
+
+Save as `_docs/02_plans/deployment/ci_cd_pipeline.md`.
+
+---
+
+```markdown
+# [System Name] — CI/CD Pipeline
+
+## Pipeline Overview
+
+| Stage | Trigger | Quality Gate |
+|-------|---------|-------------|
+| Lint | Every push | Zero lint errors |
+| Test | Every push | 75%+ coverage, all tests pass |
+| Security | Every push | Zero critical/high CVEs |
+| Build | PR merge to dev | Docker build succeeds |
+| Push | After build | Images pushed to registry |
+| Deploy Staging | After push | Health checks pass |
+| Smoke Tests | After staging deploy | Critical paths pass |
+| Deploy Production | Manual approval | Health checks pass |
+
+## Stage Details
+
+### Lint
+- [Language-specific linters and formatters]
+- Runs in parallel per language
+
+### Test
+- Unit tests: [framework and command]
+- Integration tests: [framework and command, uses docker-compose.test.yml]
+- Coverage threshold: 75% overall, 90% critical paths
+- Coverage report published as pipeline artifact
+
+### Security
+- Dependency audit: [tool, e.g., npm audit / pip-audit / dotnet list package --vulnerable]
+- SAST scan: [tool, e.g., Semgrep / SonarQube]
+- Image scan: Trivy on built Docker images
+- Block on: critical or high severity findings
+
+### Build
+- Docker images built using multi-stage Dockerfiles
+- Tagged with git SHA: `<registry>/<component>:<sha>`
+- Build cache: Docker layer cache via CI cache action
+
+### Push
+- Registry: [container registry URL]
+- Authentication: [method]
+
+### Deploy Staging
+- Deployment method: [docker compose / Kubernetes / cloud service]
+- Pre-deploy: run database migrations
+- Post-deploy: verify health check endpoints
+- Automated rollback on health check failure
+
+### Smoke Tests
+- Subset of integration tests targeting staging environment
+- Validates critical user flows
+- Timeout: [maximum duration]
+
+### Deploy Production
+- Requires manual approval via [mechanism]
+- Deployment strategy: [blue-green / rolling / canary]
+- Pre-deploy: database migration review
+- Post-deploy: health checks + monitoring for 15 min
+
+## Caching Strategy
+
+| Cache | Key | Restore Keys |
+|-------|-----|-------------|
+| Dependencies | [lockfile hash] | [partial match] |
+| Docker layers | [Dockerfile hash] | [partial match] |
+| Build artifacts | [source hash] | [partial match] |
+
+## Parallelization
+
+[Diagram or description of which stages run concurrently]
+
+## Notifications
+
+| Event | Channel | Recipients |
+|-------|---------|-----------|
+| Build failure | [Slack/email] | [team] |
+| Security alert | [Slack/email] | [team + security] |
+| Deploy success | [Slack] | [team] |
+| Deploy failure | [Slack/email + PagerDuty] | [on-call] |
+```
@@ -0,0 +1,94 @@
+# Containerization Plan Template
+
+Save as `_docs/02_plans/deployment/containerization.md`.
+
+---
+
+```markdown
+# [System Name] — Containerization
+
+## Component Dockerfiles
+
+### [Component Name]
+
+| Property | Value |
+|----------|-------|
+| Base image | [e.g., mcr.microsoft.com/dotnet/aspnet:8.0-alpine] |
+| Build image | [e.g., mcr.microsoft.com/dotnet/sdk:8.0-alpine] |
+| Stages | [dependency install → build → production] |
+| User | [non-root user name] |
+| Health check | [endpoint and command] |
+| Exposed ports | [port list] |
+| Key build args | [if any] |
+
+### [Repeat for each component]
+
+## Docker Compose — Local Development
+
+```yaml
+# docker-compose.yml structure
+services:
+  [component]:
+    build: ./[path]
+    ports: ["host:container"]
+    environment: [reference .env.dev]
+    depends_on: [dependencies with health condition]
+    healthcheck: [command, interval, timeout, retries]
+
+  db:
+    image: [postgres:version-alpine]
+    volumes: [named volume]
+    environment: [credentials from .env.dev]
+    healthcheck: [pg_isready]
+
+volumes:
+  [named volumes]
+
+networks:
+  [shared network]
+```
+
+## Docker Compose — Integration Tests
+
+```yaml
+# docker-compose.test.yml structure
+services:
+  [app components under test]
+
+  test-runner:
+    build: ./tests/integration
+    depends_on: [app components with health condition]
+    environment: [test configuration]
+    # Exit code determines test pass/fail
+
+  db:
+    image: [postgres:version-alpine]
+    volumes: [seed data mount]
+```
+
+Run: `docker compose -f docker-compose.test.yml up --abort-on-container-exit`
+
+## Image Tagging Strategy
+
+| Context | Tag Format | Example |
+|---------|-----------|---------|
+| CI build | `<registry>/<project>/<component>:<git-sha>` | `ghcr.io/org/api:a1b2c3d` |
+| Release | `<registry>/<project>/<component>:<semver>` | `ghcr.io/org/api:1.2.0` |
+| Local dev | `<component>:latest` | `api:latest` |
+
+## .dockerignore
+
+```
+.git
+.cursor
+_docs
+_standalone
+node_modules
+**/bin
+**/obj
+**/__pycache__
+*.md
+.env*
+docker-compose*.yml
+```
+```
@@ -0,0 +1,103 @@
+# Deployment Procedures Template
+
+Save as `_docs/02_plans/deployment/deployment_procedures.md`.
+
+---
+
+```markdown
+# [System Name] — Deployment Procedures
+
+## Deployment Strategy
+
+**Pattern**: [blue-green / rolling / canary]
+**Rationale**: [why this pattern fits the architecture]
+**Zero-downtime**: required for production deployments
+
+### Graceful Shutdown
+
+- Grace period: 30 seconds for in-flight requests
+- Sequence: stop accepting new requests → drain connections → shutdown
+- Container orchestrator: `terminationGracePeriodSeconds: 40`
+
+### Database Migration Ordering
+
+- Migrations run **before** new code deploys
+- All migrations must be backward-compatible (old code works with new schema)
+- Irreversible migrations require explicit approval
+
+## Health Checks
+
+| Check | Type | Endpoint | Interval | Failure Threshold | Action |
+|-------|------|----------|----------|-------------------|--------|
+| Liveness | HTTP GET | `/health/live` | 10s | 3 failures | Restart container |
+| Readiness | HTTP GET | `/health/ready` | 5s | 3 failures | Remove from load balancer |
+| Startup | HTTP GET | `/health/ready` | 5s | 30 attempts | Kill and recreate |
+
+### Health Check Responses
+
+- `/health/live`: returns 200 if process is running (no dependency checks)
+- `/health/ready`: returns 200 if all dependencies (DB, cache, queues) are reachable
+
+## Staging Deployment
+
+1. CI/CD builds and pushes Docker images tagged with git SHA
+2. Run database migrations against staging
+3. Deploy new images to staging environment
+4. Wait for health checks to pass (readiness probe)
+5. Run smoke tests against staging
+6. If smoke tests fail: automatic rollback to previous image
+
+## Production Deployment
+
+1. **Approval**: manual approval required via [mechanism]
+2. **Pre-deploy checks**:
+   - [ ] Staging smoke tests passed
+   - [ ] Security scan clean
+   - [ ] Database migration reviewed
+   - [ ] Monitoring alerts configured
+   - [ ] Rollback plan confirmed
+3. **Deploy**: apply deployment strategy (blue-green / rolling / canary)
+4. **Verify**: health checks pass, error rate stable, latency within baseline
+5. **Monitor**: observe dashboards for 15 minutes post-deploy
+6. **Finalize**: mark deployment as successful or trigger rollback
+
+## Rollback Procedures
+
+### Trigger Criteria
+
+- Health check failures persist after deploy
+- Error rate exceeds 5% for more than 5 minutes
+- Critical alert fires within 15 minutes of deploy
+- Manual decision by on-call engineer
+
+### Rollback Steps
+
+1. Redeploy previous Docker image tag (from CI/CD artifact)
+2. Verify health checks pass
+3. If database migration was applied:
+   - Run DOWN migration if reversible
+   - If irreversible: assess data impact, escalate if needed
+4. Notify stakeholders
+5. Schedule post-mortem within 24 hours
+
+### Post-Mortem
+
+Required after every production rollback:
+- Timeline of events
+- Root cause
+- What went wrong
+- Prevention measures
+
+## Deployment Checklist
+
+- [ ] All tests pass in CI
+- [ ] Security scan clean (zero critical/high CVEs)
+- [ ] Docker images built and pushed
+- [ ] Database migrations reviewed and tested
+- [ ] Environment variables configured for target environment
+- [ ] Health check endpoints verified
+- [ ] Monitoring alerts configured
+- [ ] Rollback plan documented and tested
+- [ ] Stakeholders notified of deployment window
+- [ ] On-call engineer available during deployment
+```
@@ -0,0 +1,61 @@
+# Environment Strategy Template
+
+Save as `_docs/02_plans/deployment/environment_strategy.md`.
+
+---
+
+```markdown
+# [System Name] — Environment Strategy
+
+## Environments
+
+| Environment | Purpose | Infrastructure | Data Source |
+|-------------|---------|---------------|-------------|
+| Development | Local developer workflow | docker-compose | Seed data, mocked externals |
+| Staging | Pre-production validation | [mirrors production] | Anonymized production-like data |
+| Production | Live system | [full infrastructure] | Real data |
+
+## Environment Variables
+
+### Required Variables
+
+| Variable | Purpose | Dev Default | Staging/Prod Source |
+|----------|---------|-------------|-------------------|
+| `DATABASE_URL` | Postgres connection | `postgres://dev:dev@db:5432/app` | Secret manager |
+| [add all required variables] | | | |
+
+### `.env.example`
+
+```env
+# Copy to .env and fill in values
+DATABASE_URL=postgres://user:pass@host:5432/dbname
+# [all required variables with placeholder values]
+```
+
+### Variable Validation
+
+All services validate required environment variables at startup and fail fast with a clear error message if any are missing.
+
+## Secrets Management
+
+| Environment | Method | Tool |
+|-------------|--------|------|
+| Development | `.env` file (git-ignored) | dotenv |
+| Staging | Secret manager | [AWS Secrets Manager / Azure Key Vault / Vault] |
+| Production | Secret manager | [AWS Secrets Manager / Azure Key Vault / Vault] |
+
+Rotation policy: [frequency and procedure]
+
+## Database Management
+
+| Environment | Type | Migrations | Data |
+|-------------|------|-----------|------|
+| Development | Docker Postgres, named volume | Applied on container start | Seed data via init script |
+| Staging | Managed Postgres | Applied via CI/CD pipeline | Anonymized production snapshot |
+| Production | Managed Postgres | Applied via CI/CD with approval | Live data |
+
+Migration rules:
+- All migrations must be backward-compatible (support old and new code simultaneously)
+- Reversible migrations required (DOWN/rollback script)
+- Production migrations require review before apply
+```
@@ -0,0 +1,132 @@
+# Observability Template
+
+Save as `_docs/02_plans/deployment/observability.md`.
+
+---
+
+```markdown
+# [System Name] — Observability
+
+## Logging
+
+### Format
+
+Structured JSON to stdout/stderr. No file-based logging in containers.
+
+```json
+{
+  "timestamp": "ISO8601",
+  "level": "INFO",
+  "service": "service-name",
+  "correlation_id": "uuid",
+  "message": "Event description",
+  "context": {}
+}
+```
+
+### Log Levels
+
+| Level | Usage | Example |
+|-------|-------|---------|
+| ERROR | Exceptions, failures requiring attention | Database connection failed |
+| WARN | Potential issues, degraded performance | Retry attempt 2/3 |
+| INFO | Significant business events | User registered, Order placed |
+| DEBUG | Detailed diagnostics (dev/staging only) | Request payload, Query params |
+
+### Retention
+
+| Environment | Destination | Retention |
+|-------------|-------------|-----------|
+| Development | Console | Session |
+| Staging | [log aggregator] | 7 days |
+| Production | [log aggregator] | 30 days |
+
+### PII Rules
+
+- Never log passwords, tokens, or session IDs
+- Mask email addresses and personal identifiers
+- Log user IDs (opaque) instead of usernames
+
+## Metrics
+
+### Endpoints
+
+Every service exposes Prometheus-compatible metrics at `/metrics`.
+
+### Application Metrics
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `request_count` | Counter | Total HTTP requests by method, path, status |
+| `request_duration_seconds` | Histogram | Response time by method, path |
+| `error_count` | Counter | Failed requests by type |
+| `active_connections` | Gauge | Current open connections |
+
+### System Metrics
+
+- CPU usage, Memory usage, Disk I/O, Network I/O
+
+### Business Metrics
+
+| Metric | Type | Description | Source |
+|--------|------|-------------|--------|
+| [from acceptance criteria] | | | |
+
+Collection interval: 15 seconds
+
+## Distributed Tracing
+
+### Configuration
+
+- SDK: OpenTelemetry
+- Propagation: W3C Trace Context via HTTP headers
+- Span naming: `<service>.<operation>`
+
+### Sampling
+
+| Environment | Rate | Rationale |
+|-------------|------|-----------|
+| Development | 100% | Full visibility |
+| Staging | 100% | Full visibility |
+| Production | 10% | Balance cost vs observability |
+
+### Integration Points
+
+- HTTP requests: automatic instrumentation
+- Database queries: automatic instrumentation
+- Message queues: manual span creation on publish/consume
+
+## Alerting
+
+| Severity | Response Time | Conditions |
+|----------|---------------|-----------|
+| Critical | 5 min | Service unreachable, health check failed for 1 min, data loss detected |
+| High | 30 min | Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min |
+| Medium | 4 hours | Disk usage > 80%, elevated latency, connection pool exhaustion |
+| Low | Next business day | Non-critical warnings, deprecated API usage |
+
+### Notification Channels
+
+| Severity | Channel |
+|----------|---------|
+| Critical | [PagerDuty / phone] |
+| High | [Slack + email] |
+| Medium | [Slack] |
+| Low | [Dashboard only] |
+
+## Dashboards
+
+### Operations Dashboard
+
+- Service health status (up/down per component)
+- Request rate and error rate
+- Response time percentiles (P50, P95, P99)
+- Resource utilization (CPU, memory per container)
+- Active alerts
+
+### Business Dashboard
+
+- [Key business metrics from acceptance criteria]
+- [User activity indicators]
+- [Transaction volumes]
+```