Files
2026-04-18 22:03:43 +03:00

2.1 KiB

Step 5: Observability

Role: Site Reliability Engineer (SRE) Goal: Define logging, metrics, tracing, and alerting strategy. Constraints: Strategy document — describe what to implement, not how to wire it.

Steps

  1. Read architecture.md and component specs for service boundaries
  2. Research observability best practices for the tech stack

Logging

  • Structured JSON to stdout/stderr (no file logging in containers)
  • Fields: timestamp (ISO 8601), level, service, correlation_id, message, context
  • Levels: ERROR (exceptions), WARN (degraded), INFO (business events), DEBUG (diagnostics, dev only)
  • No PII in logs
  • Retention: dev = console, staging = 7 days, production = 30 days

Metrics

  • Expose Prometheus-compatible /metrics endpoint per service
  • System metrics: CPU, memory, disk, network
  • Application metrics: request_count, request_duration (histogram), error_count, active_connections
  • Business metrics: derived from acceptance criteria
  • Collection interval: 15s

Distributed Tracing

  • OpenTelemetry SDK integration
  • Trace context propagation via HTTP headers and message queue metadata
  • Span naming: <service>.<operation>
  • Sampling: 100% in dev/staging, 10% in production (adjust based on volume)

Alerting

Severity Response Time Condition Examples
Critical 5 min Service down, data loss, health check failed
High 30 min Error rate > 5%, P95 latency > 2x baseline
Medium 4 hours Disk > 80%, elevated latency
Low Next business day Non-critical warnings

Dashboards

  • Operations: service health, request rate, error rate, response time percentiles, resource utilization
  • Business: key business metrics from acceptance criteria

Self-verification

  • Structured logging format defined with required fields
  • Metrics endpoint specified per service
  • OpenTelemetry tracing configured
  • Alert severities with response times defined
  • Dashboards cover operations and business metrics
  • PII exclusion from logs addressed

Save action

Write observability.md using templates/observability.md.