# Observability Template Save as `_docs/04_deploy/observability.md`. --- ```markdown # [System Name] — Observability ## Logging ### Format Structured JSON to stdout/stderr. No file-based logging in containers. ```json { "timestamp": "ISO8601", "level": "INFO", "service": "service-name", "correlation_id": "uuid", "message": "Event description", "context": {} } ``` ### Log Levels | Level | Usage | Example | |-------|-------|---------| | ERROR | Exceptions, failures requiring attention | Database connection failed | | WARN | Potential issues, degraded performance | Retry attempt 2/3 | | INFO | Significant business events | User registered, Order placed | | DEBUG | Detailed diagnostics (dev/staging only) | Request payload, Query params | ### Retention | Environment | Destination | Retention | |-------------|-------------|-----------| | Development | Console | Session | | Staging | [log aggregator] | 7 days | | Production | [log aggregator] | 30 days | ### PII Rules - Never log passwords, tokens, or session IDs - Mask email addresses and personal identifiers - Log user IDs (opaque) instead of usernames ## Metrics ### Endpoints Every service exposes Prometheus-compatible metrics at `/metrics`. ### Application Metrics | Metric | Type | Description | |--------|------|-------------| | `request_count` | Counter | Total HTTP requests by method, path, status | | `request_duration_seconds` | Histogram | Response time by method, path | | `error_count` | Counter | Failed requests by type | | `active_connections` | Gauge | Current open connections | ### System Metrics - CPU usage, Memory usage, Disk I/O, Network I/O ### Business Metrics | Metric | Type | Description | Source | |--------|------|-------------|--------| | [from acceptance criteria] | | | | Collection interval: 15 seconds ## Distributed Tracing ### Configuration - SDK: OpenTelemetry - Propagation: W3C Trace Context via HTTP headers - Span naming: `.` ### Sampling | Environment | Rate | Rationale | |-------------|------|-----------| | Development | 100% | Full visibility | | Staging | 100% | Full visibility | | Production | 10% | Balance cost vs observability | ### Integration Points - HTTP requests: automatic instrumentation - Database queries: automatic instrumentation - Message queues: manual span creation on publish/consume ## Alerting | Severity | Response Time | Conditions | |----------|---------------|-----------| | Critical | 5 min | Service unreachable, health check failed for 1 min, data loss detected | | High | 30 min | Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min | | Medium | 4 hours | Disk usage > 80%, elevated latency, connection pool exhaustion | | Low | Next business day | Non-critical warnings, deprecated API usage | ### Notification Channels | Severity | Channel | |----------|---------| | Critical | [PagerDuty / phone] | | High | [Slack + email] | | Medium | [Slack] | | Low | [Dashboard only] | ## Dashboards ### Operations Dashboard - Service health status (up/down per component) - Request rate and error rate - Response time percentiles (P50, P95, P99) - Resource utilization (CPU, memory per container) - Active alerts ### Business Dashboard - [Key business metrics from acceptance criteria] - [User activity indicators] - [Transaction volumes] ```