Files
ui/.cursor/skills/deploy/templates/observability.md
T

3.3 KiB

Observability Template

Save as _docs/04_deploy/observability.md.


# [System Name] — Observability

## Logging

### Format

Structured JSON to stdout/stderr. No file-based logging in containers.

```json
{
  "timestamp": "ISO8601",
  "level": "INFO",
  "service": "service-name",
  "correlation_id": "uuid",
  "message": "Event description",
  "context": {}
}

Log Levels

Level Usage Example
ERROR Exceptions, failures requiring attention Database connection failed
WARN Potential issues, degraded performance Retry attempt 2/3
INFO Significant business events User registered, Order placed
DEBUG Detailed diagnostics (dev/staging only) Request payload, Query params

Retention

Environment Destination Retention
Development Console Session
Staging [log aggregator] 7 days
Production [log aggregator] 30 days

PII Rules

  • Never log passwords, tokens, or session IDs
  • Mask email addresses and personal identifiers
  • Log user IDs (opaque) instead of usernames

Metrics

Endpoints

Every service exposes Prometheus-compatible metrics at /metrics.

Application Metrics

Metric Type Description
request_count Counter Total HTTP requests by method, path, status
request_duration_seconds Histogram Response time by method, path
error_count Counter Failed requests by type
active_connections Gauge Current open connections

System Metrics

  • CPU usage, Memory usage, Disk I/O, Network I/O

Business Metrics

Metric Type Description Source
[from acceptance criteria]

Collection interval: 15 seconds

Distributed Tracing

Configuration

  • SDK: OpenTelemetry
  • Propagation: W3C Trace Context via HTTP headers
  • Span naming: <service>.<operation>

Sampling

Environment Rate Rationale
Development 100% Full visibility
Staging 100% Full visibility
Production 10% Balance cost vs observability

Integration Points

  • HTTP requests: automatic instrumentation
  • Database queries: automatic instrumentation
  • Message queues: manual span creation on publish/consume

Alerting

Severity Response Time Conditions
Critical 5 min Service unreachable, health check failed for 1 min, data loss detected
High 30 min Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min
Medium 4 hours Disk usage > 80%, elevated latency, connection pool exhaustion
Low Next business day Non-critical warnings, deprecated API usage

Notification Channels

Severity Channel
Critical [PagerDuty / phone]
High [Slack + email]
Medium [Slack]
Low [Dashboard only]

Dashboards

Operations Dashboard

  • Service health status (up/down per component)
  • Request rate and error rate
  • Response time percentiles (P50, P95, P99)
  • Resource utilization (CPU, memory per container)
  • Active alerts

Business Dashboard

  • [Key business metrics from acceptance criteria]
  • [User activity indicators]
  • [Transaction volumes]