mirror of
https://github.com/azaion/loader.git
synced 2026-04-22 22:36:33 +00:00
b0a03d36d6
Made-with: Cursor
133 lines
3.3 KiB
Markdown
133 lines
3.3 KiB
Markdown
# Observability Template
|
|
|
|
Save as `_docs/04_deploy/observability.md`.
|
|
|
|
---
|
|
|
|
```markdown
|
|
# [System Name] — Observability
|
|
|
|
## Logging
|
|
|
|
### Format
|
|
|
|
Structured JSON to stdout/stderr. No file-based logging in containers.
|
|
|
|
```json
|
|
{
|
|
"timestamp": "ISO8601",
|
|
"level": "INFO",
|
|
"service": "service-name",
|
|
"correlation_id": "uuid",
|
|
"message": "Event description",
|
|
"context": {}
|
|
}
|
|
```
|
|
|
|
### Log Levels
|
|
|
|
| Level | Usage | Example |
|
|
|-------|-------|---------|
|
|
| ERROR | Exceptions, failures requiring attention | Database connection failed |
|
|
| WARN | Potential issues, degraded performance | Retry attempt 2/3 |
|
|
| INFO | Significant business events | User registered, Order placed |
|
|
| DEBUG | Detailed diagnostics (dev/staging only) | Request payload, Query params |
|
|
|
|
### Retention
|
|
|
|
| Environment | Destination | Retention |
|
|
|-------------|-------------|-----------|
|
|
| Development | Console | Session |
|
|
| Staging | [log aggregator] | 7 days |
|
|
| Production | [log aggregator] | 30 days |
|
|
|
|
### PII Rules
|
|
|
|
- Never log passwords, tokens, or session IDs
|
|
- Mask email addresses and personal identifiers
|
|
- Log user IDs (opaque) instead of usernames
|
|
|
|
## Metrics
|
|
|
|
### Endpoints
|
|
|
|
Every service exposes Prometheus-compatible metrics at `/metrics`.
|
|
|
|
### Application Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `request_count` | Counter | Total HTTP requests by method, path, status |
|
|
| `request_duration_seconds` | Histogram | Response time by method, path |
|
|
| `error_count` | Counter | Failed requests by type |
|
|
| `active_connections` | Gauge | Current open connections |
|
|
|
|
### System Metrics
|
|
|
|
- CPU usage, Memory usage, Disk I/O, Network I/O
|
|
|
|
### Business Metrics
|
|
|
|
| Metric | Type | Description | Source |
|
|
|--------|------|-------------|--------|
|
|
| [from acceptance criteria] | | | |
|
|
|
|
Collection interval: 15 seconds
|
|
|
|
## Distributed Tracing
|
|
|
|
### Configuration
|
|
|
|
- SDK: OpenTelemetry
|
|
- Propagation: W3C Trace Context via HTTP headers
|
|
- Span naming: `<service>.<operation>`
|
|
|
|
### Sampling
|
|
|
|
| Environment | Rate | Rationale |
|
|
|-------------|------|-----------|
|
|
| Development | 100% | Full visibility |
|
|
| Staging | 100% | Full visibility |
|
|
| Production | 10% | Balance cost vs observability |
|
|
|
|
### Integration Points
|
|
|
|
- HTTP requests: automatic instrumentation
|
|
- Database queries: automatic instrumentation
|
|
- Message queues: manual span creation on publish/consume
|
|
|
|
## Alerting
|
|
|
|
| Severity | Response Time | Conditions |
|
|
|----------|---------------|-----------|
|
|
| Critical | 5 min | Service unreachable, health check failed for 1 min, data loss detected |
|
|
| High | 30 min | Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min |
|
|
| Medium | 4 hours | Disk usage > 80%, elevated latency, connection pool exhaustion |
|
|
| Low | Next business day | Non-critical warnings, deprecated API usage |
|
|
|
|
### Notification Channels
|
|
|
|
| Severity | Channel |
|
|
|----------|---------|
|
|
| Critical | [PagerDuty / phone] |
|
|
| High | [Slack + email] |
|
|
| Medium | [Slack] |
|
|
| Low | [Dashboard only] |
|
|
|
|
## Dashboards
|
|
|
|
### Operations Dashboard
|
|
|
|
- Service health status (up/down per component)
|
|
- Request rate and error rate
|
|
- Response time percentiles (P50, P95, P99)
|
|
- Resource utilization (CPU, memory per container)
|
|
- Active alerts
|
|
|
|
### Business Dashboard
|
|
|
|
- [Key business metrics from acceptance criteria]
|
|
- [User activity indicators]
|
|
- [Transaction volumes]
|
|
```
|