Files
2026-04-18 22:04:23 +03:00

61 lines
2.1 KiB
Markdown

# Step 5: Observability
**Role**: Site Reliability Engineer (SRE)
**Goal**: Define logging, metrics, tracing, and alerting strategy.
**Constraints**: Strategy document — describe what to implement, not how to wire it.
## Steps
1. Read `architecture.md` and component specs for service boundaries
2. Research observability best practices for the tech stack
## Logging
- Structured JSON to stdout/stderr (no file logging in containers)
- Fields: `timestamp` (ISO 8601), `level`, `service`, `correlation_id`, `message`, `context`
- Levels: ERROR (exceptions), WARN (degraded), INFO (business events), DEBUG (diagnostics, dev only)
- No PII in logs
- Retention: dev = console, staging = 7 days, production = 30 days
## Metrics
- Expose Prometheus-compatible `/metrics` endpoint per service
- System metrics: CPU, memory, disk, network
- Application metrics: `request_count`, `request_duration` (histogram), `error_count`, `active_connections`
- Business metrics: derived from acceptance criteria
- Collection interval: 15s
## Distributed Tracing
- OpenTelemetry SDK integration
- Trace context propagation via HTTP headers and message queue metadata
- Span naming: `<service>.<operation>`
- Sampling: 100% in dev/staging, 10% in production (adjust based on volume)
## Alerting
| Severity | Response Time | Condition Examples |
|----------|---------------|-------------------|
| Critical | 5 min | Service down, data loss, health check failed |
| High | 30 min | Error rate > 5%, P95 latency > 2x baseline |
| Medium | 4 hours | Disk > 80%, elevated latency |
| Low | Next business day | Non-critical warnings |
## Dashboards
- Operations: service health, request rate, error rate, response time percentiles, resource utilization
- Business: key business metrics from acceptance criteria
## Self-verification
- [ ] Structured logging format defined with required fields
- [ ] Metrics endpoint specified per service
- [ ] OpenTelemetry tracing configured
- [ ] Alert severities with response times defined
- [ ] Dashboards cover operations and business metrics
- [ ] PII exclusion from logs addressed
## Save action
Write `observability.md` using `templates/observability.md`.