Observability Template
Save as _docs/04_deploy/observability.md.
Log Levels
| Level |
Usage |
Example |
| ERROR |
Exceptions, failures requiring attention |
Database connection failed |
| WARN |
Potential issues, degraded performance |
Retry attempt 2/3 |
| INFO |
Significant business events |
User registered, Order placed |
| DEBUG |
Detailed diagnostics (dev/staging only) |
Request payload, Query params |
Retention
| Environment |
Destination |
Retention |
| Development |
Console |
Session |
| Staging |
[log aggregator] |
7 days |
| Production |
[log aggregator] |
30 days |
PII Rules
- Never log passwords, tokens, or session IDs
- Mask email addresses and personal identifiers
- Log user IDs (opaque) instead of usernames
Metrics
Endpoints
Every service exposes Prometheus-compatible metrics at /metrics.
Application Metrics
| Metric |
Type |
Description |
request_count |
Counter |
Total HTTP requests by method, path, status |
request_duration_seconds |
Histogram |
Response time by method, path |
error_count |
Counter |
Failed requests by type |
active_connections |
Gauge |
Current open connections |
System Metrics
- CPU usage, Memory usage, Disk I/O, Network I/O
Business Metrics
| Metric |
Type |
Description |
Source |
| [from acceptance criteria] |
|
|
|
Collection interval: 15 seconds
Distributed Tracing
Configuration
- SDK: OpenTelemetry
- Propagation: W3C Trace Context via HTTP headers
- Span naming:
<service>.<operation>
Sampling
| Environment |
Rate |
Rationale |
| Development |
100% |
Full visibility |
| Staging |
100% |
Full visibility |
| Production |
10% |
Balance cost vs observability |
Integration Points
- HTTP requests: automatic instrumentation
- Database queries: automatic instrumentation
- Message queues: manual span creation on publish/consume
Alerting
| Severity |
Response Time |
Conditions |
| Critical |
5 min |
Service unreachable, health check failed for 1 min, data loss detected |
| High |
30 min |
Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min |
| Medium |
4 hours |
Disk usage > 80%, elevated latency, connection pool exhaustion |
| Low |
Next business day |
Non-critical warnings, deprecated API usage |
Notification Channels
| Severity |
Channel |
| Critical |
[PagerDuty / phone] |
| High |
[Slack + email] |
| Medium |
[Slack] |
| Low |
[Dashboard only] |
Dashboards
Operations Dashboard
- Service health status (up/down per component)
- Request rate and error rate
- Response time percentiles (P50, P95, P99)
- Resource utilization (CPU, memory per container)
- Active alerts
Business Dashboard
- [Key business metrics from acceptance criteria]
- [User activity indicators]
- [Transaction volumes]