mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-23 02:11:13 +00:00
[AZ-900] Remove local .cursor/ copy — skills now live at ~/.cline/
This commit is contained in:
@@ -1,224 +0,0 @@
|
||||
# CI/CD Pipeline Template
|
||||
|
||||
Save as `_docs/04_deploy/ci_cd_pipeline.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — CI/CD Pipeline
|
||||
|
||||
## Pipeline Overview
|
||||
|
||||
| Stage | Trigger | Quality Gate |
|
||||
|-------|---------|-------------|
|
||||
| Lint | Every push | Zero lint errors |
|
||||
| Test | Every push | 75%+ coverage, all tests pass |
|
||||
| Security | Every push | Zero critical/high CVEs |
|
||||
| Build | PR merge to dev | Docker build succeeds |
|
||||
| Push | After build | Images pushed to registry |
|
||||
| Deploy Staging | After push | Health checks pass |
|
||||
| Smoke Tests | After staging deploy | Critical paths pass |
|
||||
| Deploy Production | Manual approval | Health checks pass |
|
||||
|
||||
## Stage Details
|
||||
|
||||
### Lint
|
||||
- [Language-specific linters and formatters]
|
||||
- Runs in parallel per language
|
||||
|
||||
### Test
|
||||
- Unit tests: [framework and command]
|
||||
- Blackbox tests: [framework and command, uses docker-compose.test.yml]
|
||||
- Coverage threshold: 75% overall, 90% critical-path floor (100% aim) — per `.cursor/rules/cursor-meta.mdc` Quality Thresholds
|
||||
- Coverage report published as pipeline artifact
|
||||
|
||||
### Security
|
||||
- Dependency audit: [tool, e.g., npm audit / pip-audit / dotnet list package --vulnerable]
|
||||
- SAST scan: [tool, e.g., Semgrep / SonarQube]
|
||||
- Image scan: Trivy on built Docker images
|
||||
- Block on: critical or high severity findings
|
||||
|
||||
### Build
|
||||
- Docker images built using multi-stage Dockerfiles
|
||||
- Tagged with git SHA: `<registry>/<component>:<sha>`
|
||||
- Build cache: Docker layer cache via CI cache action
|
||||
|
||||
### Push
|
||||
- Registry: [container registry URL]
|
||||
- Authentication: [method]
|
||||
|
||||
### Deploy Staging
|
||||
- Deployment method: [docker compose / Kubernetes / cloud service]
|
||||
- Pre-deploy: run database migrations
|
||||
- Post-deploy: verify health check endpoints
|
||||
- Automated rollback on health check failure
|
||||
|
||||
### Smoke Tests
|
||||
- Subset of blackbox tests targeting staging environment
|
||||
- Validates critical user flows
|
||||
- Timeout: [maximum duration]
|
||||
|
||||
### Deploy Production
|
||||
- Requires manual approval via [mechanism]
|
||||
- Deployment strategy: [blue-green / rolling / canary]
|
||||
- Pre-deploy: database migration review
|
||||
- Post-deploy: health checks + monitoring for 15 min
|
||||
|
||||
## Caching Strategy
|
||||
|
||||
| Cache | Key | Restore Keys |
|
||||
|-------|-----|-------------|
|
||||
| Dependencies | [lockfile hash] | [partial match] |
|
||||
| Docker layers | [Dockerfile hash] | [partial match] |
|
||||
| Build artifacts | [source hash] | [partial match] |
|
||||
|
||||
## Parallelization
|
||||
|
||||
[Diagram or description of which stages run concurrently]
|
||||
|
||||
## Notifications
|
||||
|
||||
| Event | Channel | Recipients |
|
||||
|-------|---------|-----------|
|
||||
| Build failure | [Slack/email] | [team] |
|
||||
| Security alert | [Slack/email] | [team + security] |
|
||||
| Deploy success | [Slack] | [team] |
|
||||
| Deploy failure | [Slack/email + PagerDuty] | [on-call] |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Reference Implementation: Woodpecker CI two-workflow contract
|
||||
|
||||
Use this when the project's CI is **Woodpecker** and the test layout follows the autodev e2e contract from [`../../decompose/templates/test-infrastructure-task.md`](../../decompose/templates/test-infrastructure-task.md) (an `e2e/` folder containing `Dockerfile`, `docker-compose.test.yml`, `conftest.py`, `requirements.txt`, `mocks/`, `fixtures/`, `tests/`).
|
||||
|
||||
The contract is **two workflows in `.woodpecker/`**, scheduled on the same agent label, with the build workflow gated on a successful test run:
|
||||
|
||||
- `.woodpecker/01-test.yml` — runs the e2e contract, publishes `results/report.csv` as an artifact, fails the pipeline on any test failure.
|
||||
- `.woodpecker/02-build-push.yml` — `depends_on: [01-test]`. Builds the image, tags it `${CI_COMMIT_BRANCH}-${TAG_SUFFIX}`, pushes it to the registry. Skipped automatically if test failed.
|
||||
|
||||
The agent label is parameterized via `matrix:` so a single workflow file fans out across architectures: `labels: platform: ${PLATFORM}` routes each matrix entry to the matching agent. Both workflows for a repo must use the same matrix so test and build run on the same machine and share Docker layer cache. New architectures = new matrix entries; never new files.
|
||||
|
||||
### Multi-arch matrix conventions
|
||||
|
||||
| Variable | Meaning | Typical values |
|
||||
|----------|---------|----------------|
|
||||
| `PLATFORM` | Woodpecker agent label — selects which physical machine runs the entry. | `arm64`, `amd64` |
|
||||
| `TAG_SUFFIX` | Image tag suffix appended after the branch name. | `arm`, `amd` |
|
||||
| `DOCKERFILE` *(only when arches need different Dockerfiles)* | Path to the Dockerfile for this entry. | `Dockerfile`, `Dockerfile.jetson` |
|
||||
|
||||
Most repos use the same `Dockerfile` for both arches (multi-arch base images handle the rest), so `DOCKERFILE` can be omitted from the matrix and hardcoded in the build command. Repos with split per-arch Dockerfiles (e.g., `detections` uses `Dockerfile.jetson` on Jetson with TensorRT/CUDA-on-L4T) declare `DOCKERFILE` as a matrix var.
|
||||
|
||||
When only one architecture is currently in use, keep the matrix block with a single entry and the second entry commented out — adding a new arch is then a one-line uncomment, not a structural change.
|
||||
|
||||
### `.woodpecker/01-test.yml`
|
||||
|
||||
```yaml
|
||||
when:
|
||||
event: [push, pull_request, manual]
|
||||
branch: [dev, stage, main]
|
||||
|
||||
matrix:
|
||||
include:
|
||||
- PLATFORM: arm64
|
||||
TAG_SUFFIX: arm
|
||||
# - PLATFORM: amd64
|
||||
# TAG_SUFFIX: amd
|
||||
|
||||
labels:
|
||||
platform: ${PLATFORM}
|
||||
|
||||
steps:
|
||||
- name: e2e
|
||||
image: docker
|
||||
commands:
|
||||
- cd e2e
|
||||
- docker compose -f docker-compose.test.yml up --abort-on-container-exit --exit-code-from e2e-runner --build
|
||||
- docker compose -f docker-compose.test.yml down -v
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
|
||||
- name: report
|
||||
image: docker
|
||||
when:
|
||||
status: [success, failure]
|
||||
commands:
|
||||
- test -f e2e/results/report.csv && cat e2e/results/report.csv || echo "no report"
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `--abort-on-container-exit` shuts the whole compose down as soon as ANY service exits, so a crashed dependency surfaces immediately instead of hanging the runner.
|
||||
- `--exit-code-from e2e-runner` ensures the pipeline's exit code reflects the test runner's, not the SUT's.
|
||||
- The `report` step runs on `[success, failure]` so the report is always published; without this the CSV is lost on red builds.
|
||||
- `down -v` between runs drops mock state and DB volumes — every test run starts clean.
|
||||
|
||||
### `.woodpecker/02-build-push.yml`
|
||||
|
||||
```yaml
|
||||
when:
|
||||
event: [push, manual]
|
||||
branch: [dev, stage, main]
|
||||
|
||||
depends_on:
|
||||
- 01-test
|
||||
|
||||
matrix:
|
||||
include:
|
||||
- PLATFORM: arm64
|
||||
TAG_SUFFIX: arm
|
||||
# - PLATFORM: amd64
|
||||
# TAG_SUFFIX: amd
|
||||
|
||||
labels:
|
||||
platform: ${PLATFORM}
|
||||
|
||||
steps:
|
||||
- name: build-push
|
||||
image: docker
|
||||
environment:
|
||||
REGISTRY_HOST:
|
||||
from_secret: registry_host
|
||||
REGISTRY_USER:
|
||||
from_secret: registry_user
|
||||
REGISTRY_TOKEN:
|
||||
from_secret: registry_token
|
||||
commands:
|
||||
- echo "$REGISTRY_TOKEN" | docker login "$REGISTRY_HOST" -u "$REGISTRY_USER" --password-stdin
|
||||
- export TAG=${CI_COMMIT_BRANCH}-${TAG_SUFFIX}
|
||||
- export BUILD_DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
|
||||
- |
|
||||
docker build -f Dockerfile \
|
||||
--build-arg CI_COMMIT_SHA=$CI_COMMIT_SHA \
|
||||
--label org.opencontainers.image.revision=$CI_COMMIT_SHA \
|
||||
--label org.opencontainers.image.created=$BUILD_DATE \
|
||||
--label org.opencontainers.image.source=$CI_REPO_URL \
|
||||
-t $REGISTRY_HOST/azaion/<service>:$TAG .
|
||||
- docker push $REGISTRY_HOST/azaion/<service>:$TAG
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `depends_on: [01-test]` is enforced by Woodpecker — a failed `01-test` (any matrix entry) skips this workflow.
|
||||
- The build workflow does NOT trigger on `pull_request` events: PRs get test signal only; pushes to `dev`/`stage`/`main` produce images. Avoids polluting the registry with PR images.
|
||||
- Replace `<service>` with the actual service name (matches the registry namespace pattern `azaion/<service>`).
|
||||
- For repos with split per-arch Dockerfiles, add `DOCKERFILE: Dockerfile.jetson` (or similar) to the matrix entry and substitute `${DOCKERFILE}` for `Dockerfile` in the `docker build -f` line.
|
||||
|
||||
### Variations by stack
|
||||
|
||||
The contract is language-agnostic because the runner is `docker compose`. The Dockerfile inside `e2e/` selects the test framework:
|
||||
|
||||
| Stack | `e2e/Dockerfile` runs |
|
||||
|-------|----------------------|
|
||||
| Python | `pytest --csv=/results/report.csv -v` |
|
||||
| .NET | `dotnet test --logger:"trx;LogFileName=/results/report.trx"` (convert to CSV in a final step if needed) |
|
||||
| Node/UI | `npm test -- --reporters=default --reporters=jest-junit --outputDirectory=/results` |
|
||||
| Rust | `cargo test --no-fail-fast -- --format json > /results/report.json` |
|
||||
|
||||
When the repo has **only unit tests** (no `e2e/docker-compose.test.yml`), drop the compose orchestration and run the native test command directly inside a stack-appropriate image. Keep the same two-workflow split — `01-test.yml` runs unit tests, `02-build-push.yml` is unchanged.
|
||||
|
||||
### Manual-trigger override (test infrastructure not yet validated)
|
||||
|
||||
If a repo ships a complete `e2e/` layout but the test fixtures are not yet validated end-to-end (e.g., expected-results data is still being authored), gate `01-test.yml` on `event: [manual]` only and add a TODO comment pointing to the unblocking task. The `02-build-push.yml` workflow drops its `depends_on` clause for the manual-only window — an explicit and reversible exception, not a permanent split.
|
||||
@@ -1,94 +0,0 @@
|
||||
# Containerization Plan Template
|
||||
|
||||
Save as `_docs/04_deploy/containerization.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — Containerization
|
||||
|
||||
## Component Dockerfiles
|
||||
|
||||
### [Component Name]
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| Base image | [e.g., mcr.microsoft.com/dotnet/aspnet:8.0-alpine] |
|
||||
| Build image | [e.g., mcr.microsoft.com/dotnet/sdk:8.0-alpine] |
|
||||
| Stages | [dependency install → build → production] |
|
||||
| User | [non-root user name] |
|
||||
| Health check | [endpoint and command] |
|
||||
| Exposed ports | [port list] |
|
||||
| Key build args | [if any] |
|
||||
|
||||
### [Repeat for each component]
|
||||
|
||||
## Docker Compose — Local Development
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml structure
|
||||
services:
|
||||
[component]:
|
||||
build: ./[path]
|
||||
ports: ["host:container"]
|
||||
environment: [reference .env.dev]
|
||||
depends_on: [dependencies with health condition]
|
||||
healthcheck: [command, interval, timeout, retries]
|
||||
|
||||
db:
|
||||
image: [postgres:version-alpine]
|
||||
volumes: [named volume]
|
||||
environment: [credentials from .env.dev]
|
||||
healthcheck: [pg_isready]
|
||||
|
||||
volumes:
|
||||
[named volumes]
|
||||
|
||||
networks:
|
||||
[shared network]
|
||||
```
|
||||
|
||||
## Docker Compose — Blackbox Tests
|
||||
|
||||
```yaml
|
||||
# docker-compose.test.yml structure
|
||||
services:
|
||||
[app components under test]
|
||||
|
||||
test-runner:
|
||||
build: ./tests/integration
|
||||
depends_on: [app components with health condition]
|
||||
environment: [test configuration]
|
||||
# Exit code determines test pass/fail
|
||||
|
||||
db:
|
||||
image: [postgres:version-alpine]
|
||||
volumes: [seed data mount]
|
||||
```
|
||||
|
||||
Run: `docker compose -f docker-compose.test.yml up --abort-on-container-exit`
|
||||
|
||||
## Image Tagging Strategy
|
||||
|
||||
| Context | Tag Format | Example |
|
||||
|---------|-----------|---------|
|
||||
| CI build | `<registry>/<project>/<component>:<git-sha>` | `ghcr.io/org/api:a1b2c3d` |
|
||||
| Release | `<registry>/<project>/<component>:<semver>` | `ghcr.io/org/api:1.2.0` |
|
||||
| Local dev | `<component>:latest` | `api:latest` |
|
||||
|
||||
## .dockerignore
|
||||
|
||||
```
|
||||
.git
|
||||
.cursor
|
||||
_docs
|
||||
_standalone
|
||||
node_modules
|
||||
**/bin
|
||||
**/obj
|
||||
**/__pycache__
|
||||
*.md
|
||||
.env*
|
||||
docker-compose*.yml
|
||||
```
|
||||
```
|
||||
@@ -1,114 +0,0 @@
|
||||
# Deployment Scripts Documentation Template
|
||||
|
||||
Save as `_docs/04_deploy/deploy_scripts.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — Deployment Scripts
|
||||
|
||||
## Overview
|
||||
|
||||
| Script | Purpose | Location |
|
||||
|--------|---------|----------|
|
||||
| `deploy.sh` | Main deployment orchestrator | `scripts/deploy.sh` |
|
||||
| `pull-images.sh` | Pull Docker images from registry | `scripts/pull-images.sh` |
|
||||
| `start-services.sh` | Start all services | `scripts/start-services.sh` |
|
||||
| `stop-services.sh` | Graceful shutdown | `scripts/stop-services.sh` |
|
||||
| `health-check.sh` | Verify deployment health | `scripts/health-check.sh` |
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Docker and Docker Compose installed on target machine
|
||||
- SSH access to target machine (configured via `DEPLOY_HOST`)
|
||||
- Container registry credentials configured
|
||||
- `.env` file with required environment variables (see `.env.example`)
|
||||
|
||||
## Environment Variables
|
||||
|
||||
All scripts source `.env` from the project root or accept variables from the environment.
|
||||
|
||||
| Variable | Required By | Purpose |
|
||||
|----------|------------|---------|
|
||||
| `DEPLOY_HOST` | All (remote mode) | SSH target for remote deployment |
|
||||
| `REGISTRY_URL` | `pull-images.sh` | Container registry URL |
|
||||
| `REGISTRY_USER` | `pull-images.sh` | Registry authentication |
|
||||
| `REGISTRY_PASS` | `pull-images.sh` | Registry authentication |
|
||||
| `IMAGE_TAG` | `pull-images.sh`, `start-services.sh` | Image version to deploy (default: latest git SHA) |
|
||||
| [add project-specific variables] | | |
|
||||
|
||||
## Script Details
|
||||
|
||||
### deploy.sh
|
||||
|
||||
Main orchestrator that runs the full deployment flow.
|
||||
|
||||
**Usage**:
|
||||
- `./scripts/deploy.sh` — Deploy latest version
|
||||
- `./scripts/deploy.sh --rollback` — Rollback to previous version
|
||||
- `./scripts/deploy.sh --help` — Show usage
|
||||
|
||||
**Flow**:
|
||||
1. Validate required environment variables
|
||||
2. Call `pull-images.sh`
|
||||
3. Call `stop-services.sh`
|
||||
4. Call `start-services.sh`
|
||||
5. Call `health-check.sh`
|
||||
6. Report success or failure
|
||||
|
||||
**Rollback**: When `--rollback` is passed, reads the previous image tags saved by `stop-services.sh` and redeploys those versions.
|
||||
|
||||
### pull-images.sh
|
||||
|
||||
**Usage**: `./scripts/pull-images.sh [--help]`
|
||||
|
||||
**Steps**:
|
||||
1. Authenticate with container registry (`REGISTRY_URL`)
|
||||
2. Pull all required images with specified `IMAGE_TAG`
|
||||
3. Verify image integrity via digest check
|
||||
4. Report pull results per image
|
||||
|
||||
### start-services.sh
|
||||
|
||||
**Usage**: `./scripts/start-services.sh [--help]`
|
||||
|
||||
**Steps**:
|
||||
1. Run `docker compose up -d` with the correct env file
|
||||
2. Configure networks and volumes
|
||||
3. Wait for all containers to report healthy state
|
||||
4. Report startup status per service
|
||||
|
||||
### stop-services.sh
|
||||
|
||||
**Usage**: `./scripts/stop-services.sh [--help]`
|
||||
|
||||
**Steps**:
|
||||
1. Save current image tags to `previous_tags.env` (for rollback)
|
||||
2. Stop services with graceful shutdown period (30s)
|
||||
3. Clean up orphaned containers and networks
|
||||
|
||||
### health-check.sh
|
||||
|
||||
**Usage**: `./scripts/health-check.sh [--help]`
|
||||
|
||||
**Checks**:
|
||||
|
||||
| Service | Endpoint | Expected |
|
||||
|---------|----------|----------|
|
||||
| [Component 1] | `http://localhost:[port]/health/live` | HTTP 200 |
|
||||
| [Component 2] | `http://localhost:[port]/health/ready` | HTTP 200 |
|
||||
| [add all services] | | |
|
||||
|
||||
**Exit codes**:
|
||||
- `0` — All services healthy
|
||||
- `1` — One or more services unhealthy
|
||||
|
||||
## Common Script Properties
|
||||
|
||||
All scripts:
|
||||
- Use `#!/bin/bash` with `set -euo pipefail`
|
||||
- Support `--help` flag for usage information
|
||||
- Source `.env` from project root if present
|
||||
- Are idempotent where possible
|
||||
- Support remote execution via SSH when `DEPLOY_HOST` is set
|
||||
```
|
||||
@@ -1,73 +0,0 @@
|
||||
# Deployment Status Report Template
|
||||
|
||||
Save as `_docs/04_deploy/reports/deploy_status_report.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — Deployment Status Report
|
||||
|
||||
## Deployment Readiness Summary
|
||||
|
||||
| Aspect | Status | Notes |
|
||||
|--------|--------|-------|
|
||||
| Architecture defined | ✅ / ❌ | |
|
||||
| Component specs complete | ✅ / ❌ | |
|
||||
| Infrastructure prerequisites met | ✅ / ❌ | |
|
||||
| External dependencies identified | ✅ / ❌ | |
|
||||
| Blockers | [count] | [summary] |
|
||||
|
||||
## Component Status
|
||||
|
||||
| Component | State | Docker-ready | Notes |
|
||||
|-----------|-------|-------------|-------|
|
||||
| [Component 1] | planned / implemented / tested | yes / no | |
|
||||
| [Component 2] | planned / implemented / tested | yes / no | |
|
||||
|
||||
## External Dependencies
|
||||
|
||||
| Dependency | Type | Required For | Status |
|
||||
|------------|------|-------------|--------|
|
||||
| [e.g., PostgreSQL] | Database | Data persistence | [available / needs setup] |
|
||||
| [e.g., Redis] | Cache | Session management | [available / needs setup] |
|
||||
| [e.g., External API] | API | [purpose] | [available / needs setup] |
|
||||
|
||||
## Infrastructure Prerequisites
|
||||
|
||||
| Prerequisite | Status | Action Needed |
|
||||
|-------------|--------|--------------|
|
||||
| Container registry | [ready / not set up] | [action] |
|
||||
| Cloud account | [ready / not set up] | [action] |
|
||||
| DNS configuration | [ready / not set up] | [action] |
|
||||
| SSL certificates | [ready / not set up] | [action] |
|
||||
| CI/CD platform | [ready / not set up] | [action] |
|
||||
| Secret manager | [ready / not set up] | [action] |
|
||||
|
||||
## Deployment Blockers
|
||||
|
||||
| Blocker | Severity | Resolution |
|
||||
|---------|----------|-----------|
|
||||
| [blocker description] | critical / high / medium | [resolution steps] |
|
||||
|
||||
## Required Environment Variables
|
||||
|
||||
| Variable | Purpose | Required In | Default (Dev) | Source (Staging/Prod) |
|
||||
|----------|---------|------------|---------------|----------------------|
|
||||
| `DATABASE_URL` | Postgres connection string | All components | `postgres://dev:dev@db:5432/app` | Secret manager |
|
||||
| `DEPLOY_HOST` | Remote target machine | Deployment scripts | `localhost` | Environment |
|
||||
| `REGISTRY_URL` | Container registry URL | CI/CD, deploy scripts | `localhost:5000` | Environment |
|
||||
| `REGISTRY_USER` | Registry username | CI/CD, deploy scripts | — | Secret manager |
|
||||
| `REGISTRY_PASS` | Registry password | CI/CD, deploy scripts | — | Secret manager |
|
||||
| [add all required variables] | | | | |
|
||||
|
||||
## .env Files Created
|
||||
|
||||
- `.env.example` — committed to VCS, contains all variable names with placeholder values
|
||||
- `.env` — git-ignored, contains development defaults
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. [Resolve any blockers listed above]
|
||||
2. [Set up missing infrastructure prerequisites]
|
||||
3. [Proceed to containerization planning]
|
||||
```
|
||||
@@ -1,103 +0,0 @@
|
||||
# Deployment Procedures Template
|
||||
|
||||
Save as `_docs/04_deploy/deployment_procedures.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — Deployment Procedures
|
||||
|
||||
## Deployment Strategy
|
||||
|
||||
**Pattern**: [blue-green / rolling / canary]
|
||||
**Rationale**: [why this pattern fits the architecture]
|
||||
**Zero-downtime**: required for production deployments
|
||||
|
||||
### Graceful Shutdown
|
||||
|
||||
- Grace period: 30 seconds for in-flight requests
|
||||
- Sequence: stop accepting new requests → drain connections → shutdown
|
||||
- Container orchestrator: `terminationGracePeriodSeconds: 40`
|
||||
|
||||
### Database Migration Ordering
|
||||
|
||||
- Migrations run **before** new code deploys
|
||||
- All migrations must be backward-compatible (old code works with new schema)
|
||||
- Irreversible migrations require explicit approval
|
||||
|
||||
## Health Checks
|
||||
|
||||
| Check | Type | Endpoint | Interval | Failure Threshold | Action |
|
||||
|-------|------|----------|----------|-------------------|--------|
|
||||
| Liveness | HTTP GET | `/health/live` | 10s | 3 failures | Restart container |
|
||||
| Readiness | HTTP GET | `/health/ready` | 5s | 3 failures | Remove from load balancer |
|
||||
| Startup | HTTP GET | `/health/ready` | 5s | 30 attempts | Kill and recreate |
|
||||
|
||||
### Health Check Responses
|
||||
|
||||
- `/health/live`: returns 200 if process is running (no dependency checks)
|
||||
- `/health/ready`: returns 200 if all dependencies (DB, cache, queues) are reachable
|
||||
|
||||
## Staging Deployment
|
||||
|
||||
1. CI/CD builds and pushes Docker images tagged with git SHA
|
||||
2. Run database migrations against staging
|
||||
3. Deploy new images to staging environment
|
||||
4. Wait for health checks to pass (readiness probe)
|
||||
5. Run smoke tests against staging
|
||||
6. If smoke tests fail: automatic rollback to previous image
|
||||
|
||||
## Production Deployment
|
||||
|
||||
1. **Approval**: manual approval required via [mechanism]
|
||||
2. **Pre-deploy checks**:
|
||||
- [ ] Staging smoke tests passed
|
||||
- [ ] Security scan clean
|
||||
- [ ] Database migration reviewed
|
||||
- [ ] Monitoring alerts configured
|
||||
- [ ] Rollback plan confirmed
|
||||
3. **Deploy**: apply deployment strategy (blue-green / rolling / canary)
|
||||
4. **Verify**: health checks pass, error rate stable, latency within baseline
|
||||
5. **Monitor**: observe dashboards for 15 minutes post-deploy
|
||||
6. **Finalize**: mark deployment as successful or trigger rollback
|
||||
|
||||
## Rollback Procedures
|
||||
|
||||
### Trigger Criteria
|
||||
|
||||
- Health check failures persist after deploy
|
||||
- Error rate exceeds 5% for more than 5 minutes
|
||||
- Critical alert fires within 15 minutes of deploy
|
||||
- Manual decision by on-call engineer
|
||||
|
||||
### Rollback Steps
|
||||
|
||||
1. Redeploy previous Docker image tag (from CI/CD artifact)
|
||||
2. Verify health checks pass
|
||||
3. If database migration was applied:
|
||||
- Run DOWN migration if reversible
|
||||
- If irreversible: assess data impact, escalate if needed
|
||||
4. Notify stakeholders
|
||||
5. Schedule post-mortem within 24 hours
|
||||
|
||||
### Post-Mortem
|
||||
|
||||
Required after every production rollback:
|
||||
- Timeline of events
|
||||
- Root cause
|
||||
- What went wrong
|
||||
- Prevention measures
|
||||
|
||||
## Deployment Checklist
|
||||
|
||||
- [ ] All tests pass in CI
|
||||
- [ ] Security scan clean (zero critical/high CVEs)
|
||||
- [ ] Docker images built and pushed
|
||||
- [ ] Database migrations reviewed and tested
|
||||
- [ ] Environment variables configured for target environment
|
||||
- [ ] Health check endpoints verified
|
||||
- [ ] Monitoring alerts configured
|
||||
- [ ] Rollback plan documented and tested
|
||||
- [ ] Stakeholders notified of deployment window
|
||||
- [ ] On-call engineer available during deployment
|
||||
```
|
||||
@@ -1,61 +0,0 @@
|
||||
# Environment Strategy Template
|
||||
|
||||
Save as `_docs/04_deploy/environment_strategy.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — Environment Strategy
|
||||
|
||||
## Environments
|
||||
|
||||
| Environment | Purpose | Infrastructure | Data Source |
|
||||
|-------------|---------|---------------|-------------|
|
||||
| Development | Local developer workflow | docker-compose | Seed data, mocked externals |
|
||||
| Staging | Pre-production validation | [mirrors production] | Anonymized production-like data |
|
||||
| Production | Live system | [full infrastructure] | Real data |
|
||||
|
||||
## Environment Variables
|
||||
|
||||
### Required Variables
|
||||
|
||||
| Variable | Purpose | Dev Default | Staging/Prod Source |
|
||||
|----------|---------|-------------|-------------------|
|
||||
| `DATABASE_URL` | Postgres connection | `postgres://dev:dev@db:5432/app` | Secret manager |
|
||||
| [add all required variables] | | | |
|
||||
|
||||
### `.env.example`
|
||||
|
||||
```env
|
||||
# Copy to .env and fill in values
|
||||
DATABASE_URL=postgres://user:pass@host:5432/dbname
|
||||
# [all required variables with placeholder values]
|
||||
```
|
||||
|
||||
### Variable Validation
|
||||
|
||||
All services validate required environment variables at startup and fail fast with a clear error message if any are missing.
|
||||
|
||||
## Secrets Management
|
||||
|
||||
| Environment | Method | Tool |
|
||||
|-------------|--------|------|
|
||||
| Development | `.env` file (git-ignored) | dotenv |
|
||||
| Staging | Secret manager | [AWS Secrets Manager / Azure Key Vault / Vault] |
|
||||
| Production | Secret manager | [AWS Secrets Manager / Azure Key Vault / Vault] |
|
||||
|
||||
Rotation policy: [frequency and procedure]
|
||||
|
||||
## Database Management
|
||||
|
||||
| Environment | Type | Migrations | Data |
|
||||
|-------------|------|-----------|------|
|
||||
| Development | Docker Postgres, named volume | Applied on container start | Seed data via init script |
|
||||
| Staging | Managed Postgres | Applied via CI/CD pipeline | Anonymized production snapshot |
|
||||
| Production | Managed Postgres | Applied via CI/CD with approval | Live data |
|
||||
|
||||
Migration rules:
|
||||
- All migrations must be backward-compatible (support old and new code simultaneously)
|
||||
- Reversible migrations required (DOWN/rollback script)
|
||||
- Production migrations require review before apply
|
||||
```
|
||||
@@ -1,132 +0,0 @@
|
||||
# Observability Template
|
||||
|
||||
Save as `_docs/04_deploy/observability.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — Observability
|
||||
|
||||
## Logging
|
||||
|
||||
### Format
|
||||
|
||||
Structured JSON to stdout/stderr. No file-based logging in containers.
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "ISO8601",
|
||||
"level": "INFO",
|
||||
"service": "service-name",
|
||||
"correlation_id": "uuid",
|
||||
"message": "Event description",
|
||||
"context": {}
|
||||
}
|
||||
```
|
||||
|
||||
### Log Levels
|
||||
|
||||
| Level | Usage | Example |
|
||||
|-------|-------|---------|
|
||||
| ERROR | Exceptions, failures requiring attention | Database connection failed |
|
||||
| WARN | Potential issues, degraded performance | Retry attempt 2/3 |
|
||||
| INFO | Significant business events | User registered, Order placed |
|
||||
| DEBUG | Detailed diagnostics (dev/staging only) | Request payload, Query params |
|
||||
|
||||
### Retention
|
||||
|
||||
| Environment | Destination | Retention |
|
||||
|-------------|-------------|-----------|
|
||||
| Development | Console | Session |
|
||||
| Staging | [log aggregator] | 7 days |
|
||||
| Production | [log aggregator] | 30 days |
|
||||
|
||||
### PII Rules
|
||||
|
||||
- Never log passwords, tokens, or session IDs
|
||||
- Mask email addresses and personal identifiers
|
||||
- Log user IDs (opaque) instead of usernames
|
||||
|
||||
## Metrics
|
||||
|
||||
### Endpoints
|
||||
|
||||
Every service exposes Prometheus-compatible metrics at `/metrics`.
|
||||
|
||||
### Application Metrics
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `request_count` | Counter | Total HTTP requests by method, path, status |
|
||||
| `request_duration_seconds` | Histogram | Response time by method, path |
|
||||
| `error_count` | Counter | Failed requests by type |
|
||||
| `active_connections` | Gauge | Current open connections |
|
||||
|
||||
### System Metrics
|
||||
|
||||
- CPU usage, Memory usage, Disk I/O, Network I/O
|
||||
|
||||
### Business Metrics
|
||||
|
||||
| Metric | Type | Description | Source |
|
||||
|--------|------|-------------|--------|
|
||||
| [from acceptance criteria] | | | |
|
||||
|
||||
Collection interval: 15 seconds
|
||||
|
||||
## Distributed Tracing
|
||||
|
||||
### Configuration
|
||||
|
||||
- SDK: OpenTelemetry
|
||||
- Propagation: W3C Trace Context via HTTP headers
|
||||
- Span naming: `<service>.<operation>`
|
||||
|
||||
### Sampling
|
||||
|
||||
| Environment | Rate | Rationale |
|
||||
|-------------|------|-----------|
|
||||
| Development | 100% | Full visibility |
|
||||
| Staging | 100% | Full visibility |
|
||||
| Production | 10% | Balance cost vs observability |
|
||||
|
||||
### Integration Points
|
||||
|
||||
- HTTP requests: automatic instrumentation
|
||||
- Database queries: automatic instrumentation
|
||||
- Message queues: manual span creation on publish/consume
|
||||
|
||||
## Alerting
|
||||
|
||||
| Severity | Response Time | Conditions |
|
||||
|----------|---------------|-----------|
|
||||
| Critical | 5 min | Service unreachable, health check failed for 1 min, data loss detected |
|
||||
| High | 30 min | Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min |
|
||||
| Medium | 4 hours | Disk usage > 80%, elevated latency, connection pool exhaustion |
|
||||
| Low | Next business day | Non-critical warnings, deprecated API usage |
|
||||
|
||||
### Notification Channels
|
||||
|
||||
| Severity | Channel |
|
||||
|----------|---------|
|
||||
| Critical | [PagerDuty / phone] |
|
||||
| High | [Slack + email] |
|
||||
| Medium | [Slack] |
|
||||
| Low | [Dashboard only] |
|
||||
|
||||
## Dashboards
|
||||
|
||||
### Operations Dashboard
|
||||
|
||||
- Service health status (up/down per component)
|
||||
- Request rate and error rate
|
||||
- Response time percentiles (P50, P95, P99)
|
||||
- Resource utilization (CPU, memory per container)
|
||||
- Active alerts
|
||||
|
||||
### Business Dashboard
|
||||
|
||||
- [Key business metrics from acceptance criteria]
|
||||
- [User activity indicators]
|
||||
- [Transaction volumes]
|
||||
```
|
||||
Reference in New Issue
Block a user