loader/.cursor/skills/deploy/templates/observability.md

# Observability Template

Save as `_docs/04_deploy/observability.md`.

---

```markdown
# [System Name] — Observability

## Logging

### Format

Structured JSON to stdout/stderr. No file-based logging in containers.

```json
{
  "timestamp": "ISO8601",
  "level": "INFO",
  "service": "service-name",
  "correlation_id": "uuid",
  "message": "Event description",
  "context": {}
}
```

### Log Levels

| Level | Usage | Example |
|-------|-------|---------|
| ERROR | Exceptions, failures requiring attention | Database connection failed |
| WARN | Potential issues, degraded performance | Retry attempt 2/3 |
| INFO | Significant business events | User registered, Order placed |
| DEBUG | Detailed diagnostics (dev/staging only) | Request payload, Query params |

### Retention

| Environment | Destination | Retention |
|-------------|-------------|-----------|
| Development | Console | Session |
| Staging | [log aggregator] | 7 days |
| Production | [log aggregator] | 30 days |

### PII Rules

- Never log passwords, tokens, or session IDs
- Mask email addresses and personal identifiers
- Log user IDs (opaque) instead of usernames

## Metrics

### Endpoints

Every service exposes Prometheus-compatible metrics at `/metrics`.

### Application Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `request_count` | Counter | Total HTTP requests by method, path, status |
| `request_duration_seconds` | Histogram | Response time by method, path |
| `error_count` | Counter | Failed requests by type |
| `active_connections` | Gauge | Current open connections |

### System Metrics

- CPU usage, Memory usage, Disk I/O, Network I/O

### Business Metrics

| Metric | Type | Description | Source |
|--------|------|-------------|--------|
| [from acceptance criteria] | | | |

Collection interval: 15 seconds

## Distributed Tracing

### Configuration

- SDK: OpenTelemetry
- Propagation: W3C Trace Context via HTTP headers
- Span naming: `<service>.<operation>`

### Sampling

| Environment | Rate | Rationale |
|-------------|------|-----------|
| Development | 100% | Full visibility |
| Staging | 100% | Full visibility |
| Production | 10% | Balance cost vs observability |

### Integration Points

- HTTP requests: automatic instrumentation
- Database queries: automatic instrumentation
- Message queues: manual span creation on publish/consume

## Alerting

| Severity | Response Time | Conditions |
|----------|---------------|-----------|
| Critical | 5 min | Service unreachable, health check failed for 1 min, data loss detected |
| High | 30 min | Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min |
| Medium | 4 hours | Disk usage > 80%, elevated latency, connection pool exhaustion |
| Low | Next business day | Non-critical warnings, deprecated API usage |

### Notification Channels

| Severity | Channel |
|----------|---------|
| Critical | [PagerDuty / phone] |
| High | [Slack + email] |
| Medium | [Slack] |
| Low | [Dashboard only] |

## Dashboards

### Operations Dashboard

- Service health status (up/down per component)
- Request rate and error rate
- Response time percentiles (P50, P95, P99)
- Resource utilization (CPU, memory per container)
- Active alerts

### Business Dashboard

- [Key business metrics from acceptance criteria]
- [User activity indicators]
- [Transaction volumes]
```