# Step 5: Observability **Role**: Site Reliability Engineer (SRE) **Goal**: Define logging, metrics, tracing, and alerting strategy. **Constraints**: Strategy document — describe what to implement, not how to wire it. ## Steps 1. Read `architecture.md` and component specs for service boundaries 2. Research observability best practices for the tech stack ## Logging - Structured JSON to stdout/stderr (no file logging in containers) - Fields: `timestamp` (ISO 8601), `level`, `service`, `correlation_id`, `message`, `context` - Levels: ERROR (exceptions), WARN (degraded), INFO (business events), DEBUG (diagnostics, dev only) - No PII in logs - Retention: dev = console, staging = 7 days, production = 30 days ## Metrics - Expose Prometheus-compatible `/metrics` endpoint per service - System metrics: CPU, memory, disk, network - Application metrics: `request_count`, `request_duration` (histogram), `error_count`, `active_connections` - Business metrics: derived from acceptance criteria - Collection interval: 15s ## Distributed Tracing - OpenTelemetry SDK integration - Trace context propagation via HTTP headers and message queue metadata - Span naming: `.` - Sampling: 100% in dev/staging, 10% in production (adjust based on volume) ## Alerting | Severity | Response Time | Condition Examples | |----------|---------------|-------------------| | Critical | 5 min | Service down, data loss, health check failed | | High | 30 min | Error rate > 5%, P95 latency > 2x baseline | | Medium | 4 hours | Disk > 80%, elevated latency | | Low | Next business day | Non-critical warnings | ## Dashboards - Operations: service health, request rate, error rate, response time percentiles, resource utilization - Business: key business metrics from acceptance criteria ## Self-verification - [ ] Structured logging format defined with required fields - [ ] Metrics endpoint specified per service - [ ] OpenTelemetry tracing configured - [ ] Alert severities with response times defined - [ ] Dashboards cover operations and business metrics - [ ] PII exclusion from logs addressed ## Save action Write `observability.md` using `templates/observability.md`.