mirror of
https://github.com/azaion/ui.git
synced 2026-04-22 22:46:34 +00:00
Update Dockerfile to use Bun for package management, remove package-lock.json, and adjust .gitignore to include it.
This commit is contained in:
@@ -0,0 +1,87 @@
|
||||
# CI/CD Pipeline Template
|
||||
|
||||
Save as `_docs/04_deploy/ci_cd_pipeline.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — CI/CD Pipeline
|
||||
|
||||
## Pipeline Overview
|
||||
|
||||
| Stage | Trigger | Quality Gate |
|
||||
|-------|---------|-------------|
|
||||
| Lint | Every push | Zero lint errors |
|
||||
| Test | Every push | 75%+ coverage, all tests pass |
|
||||
| Security | Every push | Zero critical/high CVEs |
|
||||
| Build | PR merge to dev | Docker build succeeds |
|
||||
| Push | After build | Images pushed to registry |
|
||||
| Deploy Staging | After push | Health checks pass |
|
||||
| Smoke Tests | After staging deploy | Critical paths pass |
|
||||
| Deploy Production | Manual approval | Health checks pass |
|
||||
|
||||
## Stage Details
|
||||
|
||||
### Lint
|
||||
- [Language-specific linters and formatters]
|
||||
- Runs in parallel per language
|
||||
|
||||
### Test
|
||||
- Unit tests: [framework and command]
|
||||
- Blackbox tests: [framework and command, uses docker-compose.test.yml]
|
||||
- Coverage threshold: 75% overall, 90% critical paths
|
||||
- Coverage report published as pipeline artifact
|
||||
|
||||
### Security
|
||||
- Dependency audit: [tool, e.g., npm audit / pip-audit / dotnet list package --vulnerable]
|
||||
- SAST scan: [tool, e.g., Semgrep / SonarQube]
|
||||
- Image scan: Trivy on built Docker images
|
||||
- Block on: critical or high severity findings
|
||||
|
||||
### Build
|
||||
- Docker images built using multi-stage Dockerfiles
|
||||
- Tagged with git SHA: `<registry>/<component>:<sha>`
|
||||
- Build cache: Docker layer cache via CI cache action
|
||||
|
||||
### Push
|
||||
- Registry: [container registry URL]
|
||||
- Authentication: [method]
|
||||
|
||||
### Deploy Staging
|
||||
- Deployment method: [docker compose / Kubernetes / cloud service]
|
||||
- Pre-deploy: run database migrations
|
||||
- Post-deploy: verify health check endpoints
|
||||
- Automated rollback on health check failure
|
||||
|
||||
### Smoke Tests
|
||||
- Subset of blackbox tests targeting staging environment
|
||||
- Validates critical user flows
|
||||
- Timeout: [maximum duration]
|
||||
|
||||
### Deploy Production
|
||||
- Requires manual approval via [mechanism]
|
||||
- Deployment strategy: [blue-green / rolling / canary]
|
||||
- Pre-deploy: database migration review
|
||||
- Post-deploy: health checks + monitoring for 15 min
|
||||
|
||||
## Caching Strategy
|
||||
|
||||
| Cache | Key | Restore Keys |
|
||||
|-------|-----|-------------|
|
||||
| Dependencies | [lockfile hash] | [partial match] |
|
||||
| Docker layers | [Dockerfile hash] | [partial match] |
|
||||
| Build artifacts | [source hash] | [partial match] |
|
||||
|
||||
## Parallelization
|
||||
|
||||
[Diagram or description of which stages run concurrently]
|
||||
|
||||
## Notifications
|
||||
|
||||
| Event | Channel | Recipients |
|
||||
|-------|---------|-----------|
|
||||
| Build failure | [Slack/email] | [team] |
|
||||
| Security alert | [Slack/email] | [team + security] |
|
||||
| Deploy success | [Slack] | [team] |
|
||||
| Deploy failure | [Slack/email + PagerDuty] | [on-call] |
|
||||
```
|
||||
@@ -0,0 +1,94 @@
|
||||
# Containerization Plan Template
|
||||
|
||||
Save as `_docs/04_deploy/containerization.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — Containerization
|
||||
|
||||
## Component Dockerfiles
|
||||
|
||||
### [Component Name]
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| Base image | [e.g., mcr.microsoft.com/dotnet/aspnet:8.0-alpine] |
|
||||
| Build image | [e.g., mcr.microsoft.com/dotnet/sdk:8.0-alpine] |
|
||||
| Stages | [dependency install → build → production] |
|
||||
| User | [non-root user name] |
|
||||
| Health check | [endpoint and command] |
|
||||
| Exposed ports | [port list] |
|
||||
| Key build args | [if any] |
|
||||
|
||||
### [Repeat for each component]
|
||||
|
||||
## Docker Compose — Local Development
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml structure
|
||||
services:
|
||||
[component]:
|
||||
build: ./[path]
|
||||
ports: ["host:container"]
|
||||
environment: [reference .env.dev]
|
||||
depends_on: [dependencies with health condition]
|
||||
healthcheck: [command, interval, timeout, retries]
|
||||
|
||||
db:
|
||||
image: [postgres:version-alpine]
|
||||
volumes: [named volume]
|
||||
environment: [credentials from .env.dev]
|
||||
healthcheck: [pg_isready]
|
||||
|
||||
volumes:
|
||||
[named volumes]
|
||||
|
||||
networks:
|
||||
[shared network]
|
||||
```
|
||||
|
||||
## Docker Compose — Blackbox Tests
|
||||
|
||||
```yaml
|
||||
# docker-compose.test.yml structure
|
||||
services:
|
||||
[app components under test]
|
||||
|
||||
test-runner:
|
||||
build: ./tests/integration
|
||||
depends_on: [app components with health condition]
|
||||
environment: [test configuration]
|
||||
# Exit code determines test pass/fail
|
||||
|
||||
db:
|
||||
image: [postgres:version-alpine]
|
||||
volumes: [seed data mount]
|
||||
```
|
||||
|
||||
Run: `docker compose -f docker-compose.test.yml up --abort-on-container-exit`
|
||||
|
||||
## Image Tagging Strategy
|
||||
|
||||
| Context | Tag Format | Example |
|
||||
|---------|-----------|---------|
|
||||
| CI build | `<registry>/<project>/<component>:<git-sha>` | `ghcr.io/org/api:a1b2c3d` |
|
||||
| Release | `<registry>/<project>/<component>:<semver>` | `ghcr.io/org/api:1.2.0` |
|
||||
| Local dev | `<component>:latest` | `api:latest` |
|
||||
|
||||
## .dockerignore
|
||||
|
||||
```
|
||||
.git
|
||||
.cursor
|
||||
_docs
|
||||
_standalone
|
||||
node_modules
|
||||
**/bin
|
||||
**/obj
|
||||
**/__pycache__
|
||||
*.md
|
||||
.env*
|
||||
docker-compose*.yml
|
||||
```
|
||||
```
|
||||
@@ -0,0 +1,114 @@
|
||||
# Deployment Scripts Documentation Template
|
||||
|
||||
Save as `_docs/04_deploy/deploy_scripts.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — Deployment Scripts
|
||||
|
||||
## Overview
|
||||
|
||||
| Script | Purpose | Location |
|
||||
|--------|---------|----------|
|
||||
| `deploy.sh` | Main deployment orchestrator | `scripts/deploy.sh` |
|
||||
| `pull-images.sh` | Pull Docker images from registry | `scripts/pull-images.sh` |
|
||||
| `start-services.sh` | Start all services | `scripts/start-services.sh` |
|
||||
| `stop-services.sh` | Graceful shutdown | `scripts/stop-services.sh` |
|
||||
| `health-check.sh` | Verify deployment health | `scripts/health-check.sh` |
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Docker and Docker Compose installed on target machine
|
||||
- SSH access to target machine (configured via `DEPLOY_HOST`)
|
||||
- Container registry credentials configured
|
||||
- `.env` file with required environment variables (see `.env.example`)
|
||||
|
||||
## Environment Variables
|
||||
|
||||
All scripts source `.env` from the project root or accept variables from the environment.
|
||||
|
||||
| Variable | Required By | Purpose |
|
||||
|----------|------------|---------|
|
||||
| `DEPLOY_HOST` | All (remote mode) | SSH target for remote deployment |
|
||||
| `REGISTRY_URL` | `pull-images.sh` | Container registry URL |
|
||||
| `REGISTRY_USER` | `pull-images.sh` | Registry authentication |
|
||||
| `REGISTRY_PASS` | `pull-images.sh` | Registry authentication |
|
||||
| `IMAGE_TAG` | `pull-images.sh`, `start-services.sh` | Image version to deploy (default: latest git SHA) |
|
||||
| [add project-specific variables] | | |
|
||||
|
||||
## Script Details
|
||||
|
||||
### deploy.sh
|
||||
|
||||
Main orchestrator that runs the full deployment flow.
|
||||
|
||||
**Usage**:
|
||||
- `./scripts/deploy.sh` — Deploy latest version
|
||||
- `./scripts/deploy.sh --rollback` — Rollback to previous version
|
||||
- `./scripts/deploy.sh --help` — Show usage
|
||||
|
||||
**Flow**:
|
||||
1. Validate required environment variables
|
||||
2. Call `pull-images.sh`
|
||||
3. Call `stop-services.sh`
|
||||
4. Call `start-services.sh`
|
||||
5. Call `health-check.sh`
|
||||
6. Report success or failure
|
||||
|
||||
**Rollback**: When `--rollback` is passed, reads the previous image tags saved by `stop-services.sh` and redeploys those versions.
|
||||
|
||||
### pull-images.sh
|
||||
|
||||
**Usage**: `./scripts/pull-images.sh [--help]`
|
||||
|
||||
**Steps**:
|
||||
1. Authenticate with container registry (`REGISTRY_URL`)
|
||||
2. Pull all required images with specified `IMAGE_TAG`
|
||||
3. Verify image integrity via digest check
|
||||
4. Report pull results per image
|
||||
|
||||
### start-services.sh
|
||||
|
||||
**Usage**: `./scripts/start-services.sh [--help]`
|
||||
|
||||
**Steps**:
|
||||
1. Run `docker compose up -d` with the correct env file
|
||||
2. Configure networks and volumes
|
||||
3. Wait for all containers to report healthy state
|
||||
4. Report startup status per service
|
||||
|
||||
### stop-services.sh
|
||||
|
||||
**Usage**: `./scripts/stop-services.sh [--help]`
|
||||
|
||||
**Steps**:
|
||||
1. Save current image tags to `previous_tags.env` (for rollback)
|
||||
2. Stop services with graceful shutdown period (30s)
|
||||
3. Clean up orphaned containers and networks
|
||||
|
||||
### health-check.sh
|
||||
|
||||
**Usage**: `./scripts/health-check.sh [--help]`
|
||||
|
||||
**Checks**:
|
||||
|
||||
| Service | Endpoint | Expected |
|
||||
|---------|----------|----------|
|
||||
| [Component 1] | `http://localhost:[port]/health/live` | HTTP 200 |
|
||||
| [Component 2] | `http://localhost:[port]/health/ready` | HTTP 200 |
|
||||
| [add all services] | | |
|
||||
|
||||
**Exit codes**:
|
||||
- `0` — All services healthy
|
||||
- `1` — One or more services unhealthy
|
||||
|
||||
## Common Script Properties
|
||||
|
||||
All scripts:
|
||||
- Use `#!/bin/bash` with `set -euo pipefail`
|
||||
- Support `--help` flag for usage information
|
||||
- Source `.env` from project root if present
|
||||
- Are idempotent where possible
|
||||
- Support remote execution via SSH when `DEPLOY_HOST` is set
|
||||
```
|
||||
@@ -0,0 +1,73 @@
|
||||
# Deployment Status Report Template
|
||||
|
||||
Save as `_docs/04_deploy/reports/deploy_status_report.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — Deployment Status Report
|
||||
|
||||
## Deployment Readiness Summary
|
||||
|
||||
| Aspect | Status | Notes |
|
||||
|--------|--------|-------|
|
||||
| Architecture defined | ✅ / ❌ | |
|
||||
| Component specs complete | ✅ / ❌ | |
|
||||
| Infrastructure prerequisites met | ✅ / ❌ | |
|
||||
| External dependencies identified | ✅ / ❌ | |
|
||||
| Blockers | [count] | [summary] |
|
||||
|
||||
## Component Status
|
||||
|
||||
| Component | State | Docker-ready | Notes |
|
||||
|-----------|-------|-------------|-------|
|
||||
| [Component 1] | planned / implemented / tested | yes / no | |
|
||||
| [Component 2] | planned / implemented / tested | yes / no | |
|
||||
|
||||
## External Dependencies
|
||||
|
||||
| Dependency | Type | Required For | Status |
|
||||
|------------|------|-------------|--------|
|
||||
| [e.g., PostgreSQL] | Database | Data persistence | [available / needs setup] |
|
||||
| [e.g., Redis] | Cache | Session management | [available / needs setup] |
|
||||
| [e.g., External API] | API | [purpose] | [available / needs setup] |
|
||||
|
||||
## Infrastructure Prerequisites
|
||||
|
||||
| Prerequisite | Status | Action Needed |
|
||||
|-------------|--------|--------------|
|
||||
| Container registry | [ready / not set up] | [action] |
|
||||
| Cloud account | [ready / not set up] | [action] |
|
||||
| DNS configuration | [ready / not set up] | [action] |
|
||||
| SSL certificates | [ready / not set up] | [action] |
|
||||
| CI/CD platform | [ready / not set up] | [action] |
|
||||
| Secret manager | [ready / not set up] | [action] |
|
||||
|
||||
## Deployment Blockers
|
||||
|
||||
| Blocker | Severity | Resolution |
|
||||
|---------|----------|-----------|
|
||||
| [blocker description] | critical / high / medium | [resolution steps] |
|
||||
|
||||
## Required Environment Variables
|
||||
|
||||
| Variable | Purpose | Required In | Default (Dev) | Source (Staging/Prod) |
|
||||
|----------|---------|------------|---------------|----------------------|
|
||||
| `DATABASE_URL` | Postgres connection string | All components | `postgres://dev:dev@db:5432/app` | Secret manager |
|
||||
| `DEPLOY_HOST` | Remote target machine | Deployment scripts | `localhost` | Environment |
|
||||
| `REGISTRY_URL` | Container registry URL | CI/CD, deploy scripts | `localhost:5000` | Environment |
|
||||
| `REGISTRY_USER` | Registry username | CI/CD, deploy scripts | — | Secret manager |
|
||||
| `REGISTRY_PASS` | Registry password | CI/CD, deploy scripts | — | Secret manager |
|
||||
| [add all required variables] | | | | |
|
||||
|
||||
## .env Files Created
|
||||
|
||||
- `.env.example` — committed to VCS, contains all variable names with placeholder values
|
||||
- `.env` — git-ignored, contains development defaults
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. [Resolve any blockers listed above]
|
||||
2. [Set up missing infrastructure prerequisites]
|
||||
3. [Proceed to containerization planning]
|
||||
```
|
||||
@@ -0,0 +1,103 @@
|
||||
# Deployment Procedures Template
|
||||
|
||||
Save as `_docs/04_deploy/deployment_procedures.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — Deployment Procedures
|
||||
|
||||
## Deployment Strategy
|
||||
|
||||
**Pattern**: [blue-green / rolling / canary]
|
||||
**Rationale**: [why this pattern fits the architecture]
|
||||
**Zero-downtime**: required for production deployments
|
||||
|
||||
### Graceful Shutdown
|
||||
|
||||
- Grace period: 30 seconds for in-flight requests
|
||||
- Sequence: stop accepting new requests → drain connections → shutdown
|
||||
- Container orchestrator: `terminationGracePeriodSeconds: 40`
|
||||
|
||||
### Database Migration Ordering
|
||||
|
||||
- Migrations run **before** new code deploys
|
||||
- All migrations must be backward-compatible (old code works with new schema)
|
||||
- Irreversible migrations require explicit approval
|
||||
|
||||
## Health Checks
|
||||
|
||||
| Check | Type | Endpoint | Interval | Failure Threshold | Action |
|
||||
|-------|------|----------|----------|-------------------|--------|
|
||||
| Liveness | HTTP GET | `/health/live` | 10s | 3 failures | Restart container |
|
||||
| Readiness | HTTP GET | `/health/ready` | 5s | 3 failures | Remove from load balancer |
|
||||
| Startup | HTTP GET | `/health/ready` | 5s | 30 attempts | Kill and recreate |
|
||||
|
||||
### Health Check Responses
|
||||
|
||||
- `/health/live`: returns 200 if process is running (no dependency checks)
|
||||
- `/health/ready`: returns 200 if all dependencies (DB, cache, queues) are reachable
|
||||
|
||||
## Staging Deployment
|
||||
|
||||
1. CI/CD builds and pushes Docker images tagged with git SHA
|
||||
2. Run database migrations against staging
|
||||
3. Deploy new images to staging environment
|
||||
4. Wait for health checks to pass (readiness probe)
|
||||
5. Run smoke tests against staging
|
||||
6. If smoke tests fail: automatic rollback to previous image
|
||||
|
||||
## Production Deployment
|
||||
|
||||
1. **Approval**: manual approval required via [mechanism]
|
||||
2. **Pre-deploy checks**:
|
||||
- [ ] Staging smoke tests passed
|
||||
- [ ] Security scan clean
|
||||
- [ ] Database migration reviewed
|
||||
- [ ] Monitoring alerts configured
|
||||
- [ ] Rollback plan confirmed
|
||||
3. **Deploy**: apply deployment strategy (blue-green / rolling / canary)
|
||||
4. **Verify**: health checks pass, error rate stable, latency within baseline
|
||||
5. **Monitor**: observe dashboards for 15 minutes post-deploy
|
||||
6. **Finalize**: mark deployment as successful or trigger rollback
|
||||
|
||||
## Rollback Procedures
|
||||
|
||||
### Trigger Criteria
|
||||
|
||||
- Health check failures persist after deploy
|
||||
- Error rate exceeds 5% for more than 5 minutes
|
||||
- Critical alert fires within 15 minutes of deploy
|
||||
- Manual decision by on-call engineer
|
||||
|
||||
### Rollback Steps
|
||||
|
||||
1. Redeploy previous Docker image tag (from CI/CD artifact)
|
||||
2. Verify health checks pass
|
||||
3. If database migration was applied:
|
||||
- Run DOWN migration if reversible
|
||||
- If irreversible: assess data impact, escalate if needed
|
||||
4. Notify stakeholders
|
||||
5. Schedule post-mortem within 24 hours
|
||||
|
||||
### Post-Mortem
|
||||
|
||||
Required after every production rollback:
|
||||
- Timeline of events
|
||||
- Root cause
|
||||
- What went wrong
|
||||
- Prevention measures
|
||||
|
||||
## Deployment Checklist
|
||||
|
||||
- [ ] All tests pass in CI
|
||||
- [ ] Security scan clean (zero critical/high CVEs)
|
||||
- [ ] Docker images built and pushed
|
||||
- [ ] Database migrations reviewed and tested
|
||||
- [ ] Environment variables configured for target environment
|
||||
- [ ] Health check endpoints verified
|
||||
- [ ] Monitoring alerts configured
|
||||
- [ ] Rollback plan documented and tested
|
||||
- [ ] Stakeholders notified of deployment window
|
||||
- [ ] On-call engineer available during deployment
|
||||
```
|
||||
@@ -0,0 +1,61 @@
|
||||
# Environment Strategy Template
|
||||
|
||||
Save as `_docs/04_deploy/environment_strategy.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — Environment Strategy
|
||||
|
||||
## Environments
|
||||
|
||||
| Environment | Purpose | Infrastructure | Data Source |
|
||||
|-------------|---------|---------------|-------------|
|
||||
| Development | Local developer workflow | docker-compose | Seed data, mocked externals |
|
||||
| Staging | Pre-production validation | [mirrors production] | Anonymized production-like data |
|
||||
| Production | Live system | [full infrastructure] | Real data |
|
||||
|
||||
## Environment Variables
|
||||
|
||||
### Required Variables
|
||||
|
||||
| Variable | Purpose | Dev Default | Staging/Prod Source |
|
||||
|----------|---------|-------------|-------------------|
|
||||
| `DATABASE_URL` | Postgres connection | `postgres://dev:dev@db:5432/app` | Secret manager |
|
||||
| [add all required variables] | | | |
|
||||
|
||||
### `.env.example`
|
||||
|
||||
```env
|
||||
# Copy to .env and fill in values
|
||||
DATABASE_URL=postgres://user:pass@host:5432/dbname
|
||||
# [all required variables with placeholder values]
|
||||
```
|
||||
|
||||
### Variable Validation
|
||||
|
||||
All services validate required environment variables at startup and fail fast with a clear error message if any are missing.
|
||||
|
||||
## Secrets Management
|
||||
|
||||
| Environment | Method | Tool |
|
||||
|-------------|--------|------|
|
||||
| Development | `.env` file (git-ignored) | dotenv |
|
||||
| Staging | Secret manager | [AWS Secrets Manager / Azure Key Vault / Vault] |
|
||||
| Production | Secret manager | [AWS Secrets Manager / Azure Key Vault / Vault] |
|
||||
|
||||
Rotation policy: [frequency and procedure]
|
||||
|
||||
## Database Management
|
||||
|
||||
| Environment | Type | Migrations | Data |
|
||||
|-------------|------|-----------|------|
|
||||
| Development | Docker Postgres, named volume | Applied on container start | Seed data via init script |
|
||||
| Staging | Managed Postgres | Applied via CI/CD pipeline | Anonymized production snapshot |
|
||||
| Production | Managed Postgres | Applied via CI/CD with approval | Live data |
|
||||
|
||||
Migration rules:
|
||||
- All migrations must be backward-compatible (support old and new code simultaneously)
|
||||
- Reversible migrations required (DOWN/rollback script)
|
||||
- Production migrations require review before apply
|
||||
```
|
||||
@@ -0,0 +1,132 @@
|
||||
# Observability Template
|
||||
|
||||
Save as `_docs/04_deploy/observability.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — Observability
|
||||
|
||||
## Logging
|
||||
|
||||
### Format
|
||||
|
||||
Structured JSON to stdout/stderr. No file-based logging in containers.
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "ISO8601",
|
||||
"level": "INFO",
|
||||
"service": "service-name",
|
||||
"correlation_id": "uuid",
|
||||
"message": "Event description",
|
||||
"context": {}
|
||||
}
|
||||
```
|
||||
|
||||
### Log Levels
|
||||
|
||||
| Level | Usage | Example |
|
||||
|-------|-------|---------|
|
||||
| ERROR | Exceptions, failures requiring attention | Database connection failed |
|
||||
| WARN | Potential issues, degraded performance | Retry attempt 2/3 |
|
||||
| INFO | Significant business events | User registered, Order placed |
|
||||
| DEBUG | Detailed diagnostics (dev/staging only) | Request payload, Query params |
|
||||
|
||||
### Retention
|
||||
|
||||
| Environment | Destination | Retention |
|
||||
|-------------|-------------|-----------|
|
||||
| Development | Console | Session |
|
||||
| Staging | [log aggregator] | 7 days |
|
||||
| Production | [log aggregator] | 30 days |
|
||||
|
||||
### PII Rules
|
||||
|
||||
- Never log passwords, tokens, or session IDs
|
||||
- Mask email addresses and personal identifiers
|
||||
- Log user IDs (opaque) instead of usernames
|
||||
|
||||
## Metrics
|
||||
|
||||
### Endpoints
|
||||
|
||||
Every service exposes Prometheus-compatible metrics at `/metrics`.
|
||||
|
||||
### Application Metrics
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `request_count` | Counter | Total HTTP requests by method, path, status |
|
||||
| `request_duration_seconds` | Histogram | Response time by method, path |
|
||||
| `error_count` | Counter | Failed requests by type |
|
||||
| `active_connections` | Gauge | Current open connections |
|
||||
|
||||
### System Metrics
|
||||
|
||||
- CPU usage, Memory usage, Disk I/O, Network I/O
|
||||
|
||||
### Business Metrics
|
||||
|
||||
| Metric | Type | Description | Source |
|
||||
|--------|------|-------------|--------|
|
||||
| [from acceptance criteria] | | | |
|
||||
|
||||
Collection interval: 15 seconds
|
||||
|
||||
## Distributed Tracing
|
||||
|
||||
### Configuration
|
||||
|
||||
- SDK: OpenTelemetry
|
||||
- Propagation: W3C Trace Context via HTTP headers
|
||||
- Span naming: `<service>.<operation>`
|
||||
|
||||
### Sampling
|
||||
|
||||
| Environment | Rate | Rationale |
|
||||
|-------------|------|-----------|
|
||||
| Development | 100% | Full visibility |
|
||||
| Staging | 100% | Full visibility |
|
||||
| Production | 10% | Balance cost vs observability |
|
||||
|
||||
### Integration Points
|
||||
|
||||
- HTTP requests: automatic instrumentation
|
||||
- Database queries: automatic instrumentation
|
||||
- Message queues: manual span creation on publish/consume
|
||||
|
||||
## Alerting
|
||||
|
||||
| Severity | Response Time | Conditions |
|
||||
|----------|---------------|-----------|
|
||||
| Critical | 5 min | Service unreachable, health check failed for 1 min, data loss detected |
|
||||
| High | 30 min | Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min |
|
||||
| Medium | 4 hours | Disk usage > 80%, elevated latency, connection pool exhaustion |
|
||||
| Low | Next business day | Non-critical warnings, deprecated API usage |
|
||||
|
||||
### Notification Channels
|
||||
|
||||
| Severity | Channel |
|
||||
|----------|---------|
|
||||
| Critical | [PagerDuty / phone] |
|
||||
| High | [Slack + email] |
|
||||
| Medium | [Slack] |
|
||||
| Low | [Dashboard only] |
|
||||
|
||||
## Dashboards
|
||||
|
||||
### Operations Dashboard
|
||||
|
||||
- Service health status (up/down per component)
|
||||
- Request rate and error rate
|
||||
- Response time percentiles (P50, P95, P99)
|
||||
- Resource utilization (CPU, memory per container)
|
||||
- Active alerts
|
||||
|
||||
### Business Dashboard
|
||||
|
||||
- [Key business metrics from acceptance criteria]
|
||||
- [User activity indicators]
|
||||
- [Transaction volumes]
|
||||
```
|
||||
Reference in New Issue
Block a user