mirror of
https://github.com/azaion/flights.git
synced 2026-04-22 22:46:31 +00:00
61 lines
2.1 KiB
Markdown
61 lines
2.1 KiB
Markdown
# Step 5: Observability
|
|
|
|
**Role**: Site Reliability Engineer (SRE)
|
|
**Goal**: Define logging, metrics, tracing, and alerting strategy.
|
|
**Constraints**: Strategy document — describe what to implement, not how to wire it.
|
|
|
|
## Steps
|
|
|
|
1. Read `architecture.md` and component specs for service boundaries
|
|
2. Research observability best practices for the tech stack
|
|
|
|
## Logging
|
|
|
|
- Structured JSON to stdout/stderr (no file logging in containers)
|
|
- Fields: `timestamp` (ISO 8601), `level`, `service`, `correlation_id`, `message`, `context`
|
|
- Levels: ERROR (exceptions), WARN (degraded), INFO (business events), DEBUG (diagnostics, dev only)
|
|
- No PII in logs
|
|
- Retention: dev = console, staging = 7 days, production = 30 days
|
|
|
|
## Metrics
|
|
|
|
- Expose Prometheus-compatible `/metrics` endpoint per service
|
|
- System metrics: CPU, memory, disk, network
|
|
- Application metrics: `request_count`, `request_duration` (histogram), `error_count`, `active_connections`
|
|
- Business metrics: derived from acceptance criteria
|
|
- Collection interval: 15s
|
|
|
|
## Distributed Tracing
|
|
|
|
- OpenTelemetry SDK integration
|
|
- Trace context propagation via HTTP headers and message queue metadata
|
|
- Span naming: `<service>.<operation>`
|
|
- Sampling: 100% in dev/staging, 10% in production (adjust based on volume)
|
|
|
|
## Alerting
|
|
|
|
| Severity | Response Time | Condition Examples |
|
|
|----------|---------------|-------------------|
|
|
| Critical | 5 min | Service down, data loss, health check failed |
|
|
| High | 30 min | Error rate > 5%, P95 latency > 2x baseline |
|
|
| Medium | 4 hours | Disk > 80%, elevated latency |
|
|
| Low | Next business day | Non-critical warnings |
|
|
|
|
## Dashboards
|
|
|
|
- Operations: service health, request rate, error rate, response time percentiles, resource utilization
|
|
- Business: key business metrics from acceptance criteria
|
|
|
|
## Self-verification
|
|
|
|
- [ ] Structured logging format defined with required fields
|
|
- [ ] Metrics endpoint specified per service
|
|
- [ ] OpenTelemetry tracing configured
|
|
- [ ] Alert severities with response times defined
|
|
- [ ] Dashboards cover operations and business metrics
|
|
- [ ] PII exclusion from logs addressed
|
|
|
|
## Save action
|
|
|
|
Write `observability.md` using `templates/observability.md`.
|