Files
detections-semantic/.cursor/skills/deploy/templates/observability.md
T
Oleksandr Bezdieniezhnykh 8e2ecf50fd Initial commit
Made-with: Cursor
2026-03-26 00:20:30 +02:00

133 lines
3.3 KiB
Markdown

# Observability Template
Save as `_docs/04_deploy/observability.md`.
---
```markdown
# [System Name] — Observability
## Logging
### Format
Structured JSON to stdout/stderr. No file-based logging in containers.
```json
{
"timestamp": "ISO8601",
"level": "INFO",
"service": "service-name",
"correlation_id": "uuid",
"message": "Event description",
"context": {}
}
```
### Log Levels
| Level | Usage | Example |
|-------|-------|---------|
| ERROR | Exceptions, failures requiring attention | Database connection failed |
| WARN | Potential issues, degraded performance | Retry attempt 2/3 |
| INFO | Significant business events | User registered, Order placed |
| DEBUG | Detailed diagnostics (dev/staging only) | Request payload, Query params |
### Retention
| Environment | Destination | Retention |
|-------------|-------------|-----------|
| Development | Console | Session |
| Staging | [log aggregator] | 7 days |
| Production | [log aggregator] | 30 days |
### PII Rules
- Never log passwords, tokens, or session IDs
- Mask email addresses and personal identifiers
- Log user IDs (opaque) instead of usernames
## Metrics
### Endpoints
Every service exposes Prometheus-compatible metrics at `/metrics`.
### Application Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `request_count` | Counter | Total HTTP requests by method, path, status |
| `request_duration_seconds` | Histogram | Response time by method, path |
| `error_count` | Counter | Failed requests by type |
| `active_connections` | Gauge | Current open connections |
### System Metrics
- CPU usage, Memory usage, Disk I/O, Network I/O
### Business Metrics
| Metric | Type | Description | Source |
|--------|------|-------------|--------|
| [from acceptance criteria] | | | |
Collection interval: 15 seconds
## Distributed Tracing
### Configuration
- SDK: OpenTelemetry
- Propagation: W3C Trace Context via HTTP headers
- Span naming: `<service>.<operation>`
### Sampling
| Environment | Rate | Rationale |
|-------------|------|-----------|
| Development | 100% | Full visibility |
| Staging | 100% | Full visibility |
| Production | 10% | Balance cost vs observability |
### Integration Points
- HTTP requests: automatic instrumentation
- Database queries: automatic instrumentation
- Message queues: manual span creation on publish/consume
## Alerting
| Severity | Response Time | Conditions |
|----------|---------------|-----------|
| Critical | 5 min | Service unreachable, health check failed for 1 min, data loss detected |
| High | 30 min | Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min |
| Medium | 4 hours | Disk usage > 80%, elevated latency, connection pool exhaustion |
| Low | Next business day | Non-critical warnings, deprecated API usage |
### Notification Channels
| Severity | Channel |
|----------|---------|
| Critical | [PagerDuty / phone] |
| High | [Slack + email] |
| Medium | [Slack] |
| Low | [Dashboard only] |
## Dashboards
### Operations Dashboard
- Service health status (up/down per component)
- Request rate and error rate
- Response time percentiles (P50, P95, P99)
- Resource utilization (CPU, memory per container)
- Active alerts
### Business Dashboard
- [Key business metrics from acceptance criteria]
- [User activity indicators]
- [Transaction volumes]
```