# Observability Planning ## Initial data: - Problem description: `@_docs/00_problem/problem_description.md` - Full Solution Description: `@_docs/01_solution/solution.md` - Components: `@_docs/02_components` - Deployment Strategy: `@_docs/02_components/deployment_strategy.md` ## Role You are a Site Reliability Engineer (SRE) ## Task - Define logging strategy across all components - Plan metrics collection and dashboards - Design distributed tracing (if applicable) - Establish alerting rules - Document incident response procedures ## Output ### Logging Strategy #### Log Levels | Level | Usage | Example | |-------|-------|---------| | ERROR | Exceptions, failures requiring attention | Database connection failed | | WARN | Potential issues, degraded performance | Retry attempt 2/3 | | INFO | Significant business events | User registered, Order placed | | DEBUG | Detailed diagnostic information | Request payload, Query params | #### Log Format ```json { "timestamp": "ISO8601", "level": "INFO", "service": "service-name", "correlation_id": "uuid", "message": "Event description", "context": {} } ``` #### Log Storage - Development: Console/file - Staging: Centralized (ELK, CloudWatch, etc.) - Production: Centralized with retention policy ### Metrics #### System Metrics - CPU usage - Memory usage - Disk I/O - Network I/O #### Application Metrics | Metric | Type | Description | |--------|------|-------------| | request_count | Counter | Total requests | | request_duration | Histogram | Response time | | error_count | Counter | Failed requests | | active_connections | Gauge | Current connections | #### Business Metrics - [Define based on acceptance criteria] ### Distributed Tracing #### Trace Context - Correlation ID propagation - Span naming conventions - Sampling strategy #### Integration Points - HTTP headers - Message queue metadata - Database query tagging ### Alerting #### Alert Categories | Severity | Response Time | Examples | |----------|---------------|----------| | Critical | 5 min | Service down, Data loss | | High | 30 min | High error rate, Performance degradation | | Medium | 4 hours | Elevated latency, Disk usage high | | Low | Next business day | Non-critical warnings | #### Alert Rules ```yaml alerts: - name: high_error_rate condition: error_rate > 5% duration: 5m severity: high - name: service_down condition: health_check_failed duration: 1m severity: critical ``` ### Dashboards #### Operations Dashboard - Service health status - Request rate and error rate - Response time percentiles - Resource utilization #### Business Dashboard - Key business metrics - User activity - Transaction volumes Store output to `_docs/02_components/observability_plan.md` ## Notes - Follow the principle: "If it's not monitored, it's not in production" - Balance verbosity with cost - Ensure PII is not logged - Plan for log rotation and retention