loader/.cursor/skills/deploy/steps/05_observability.md

# Step 5: Observability

**Role**: Site Reliability Engineer (SRE)
**Goal**: Define logging, metrics, tracing, and alerting strategy.
**Constraints**: Strategy document — describe what to implement, not how to wire it.

## Steps

1. Read `architecture.md` and component specs for service boundaries
2. Research observability best practices for the tech stack

## Logging

- Structured JSON to stdout/stderr (no file logging in containers)
- Fields: `timestamp` (ISO 8601), `level`, `service`, `correlation_id`, `message`, `context`
- Levels: ERROR (exceptions), WARN (degraded), INFO (business events), DEBUG (diagnostics, dev only)
- No PII in logs
- Retention: dev = console, staging = 7 days, production = 30 days

## Metrics

- Expose Prometheus-compatible `/metrics` endpoint per service
- System metrics: CPU, memory, disk, network
- Application metrics: `request_count`, `request_duration` (histogram), `error_count`, `active_connections`
- Business metrics: derived from acceptance criteria
- Collection interval: 15s

## Distributed Tracing

- OpenTelemetry SDK integration
- Trace context propagation via HTTP headers and message queue metadata
- Span naming: `<service>.<operation>`
- Sampling: 100% in dev/staging, 10% in production (adjust based on volume)

## Alerting

| Severity | Response Time | Condition Examples |
|----------|---------------|-------------------|
| Critical | 5 min | Service down, data loss, health check failed |
| High | 30 min | Error rate > 5%, P95 latency > 2x baseline |
| Medium | 4 hours | Disk > 80%, elevated latency |
| Low | Next business day | Non-critical warnings |

## Dashboards

- Operations: service health, request rate, error rate, response time percentiles, resource utilization
- Business: key business metrics from acceptance criteria

## Self-verification

- [ ] Structured logging format defined with required fields
- [ ] Metrics endpoint specified per service
- [ ] OpenTelemetry tracing configured
- [ ] Alert severities with response times defined
- [ ] Dashboards cover operations and business metrics
- [ ] PII exclusion from logs addressed

## Save action

Write `observability.md` using `templates/observability.md`.