mirror of
https://github.com/azaion/admin.git
synced 2026-04-22 22:36:33 +00:00
Update .gitignore to include .env and .DS_Store files
Add .cursor autodevelopment system
This commit is contained in:
@@ -0,0 +1,132 @@
|
||||
# Observability Template
|
||||
|
||||
Save as `_docs/04_deploy/observability.md`.
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# [System Name] — Observability
|
||||
|
||||
## Logging
|
||||
|
||||
### Format
|
||||
|
||||
Structured JSON to stdout/stderr. No file-based logging in containers.
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "ISO8601",
|
||||
"level": "INFO",
|
||||
"service": "service-name",
|
||||
"correlation_id": "uuid",
|
||||
"message": "Event description",
|
||||
"context": {}
|
||||
}
|
||||
```
|
||||
|
||||
### Log Levels
|
||||
|
||||
| Level | Usage | Example |
|
||||
|-------|-------|---------|
|
||||
| ERROR | Exceptions, failures requiring attention | Database connection failed |
|
||||
| WARN | Potential issues, degraded performance | Retry attempt 2/3 |
|
||||
| INFO | Significant business events | User registered, Order placed |
|
||||
| DEBUG | Detailed diagnostics (dev/staging only) | Request payload, Query params |
|
||||
|
||||
### Retention
|
||||
|
||||
| Environment | Destination | Retention |
|
||||
|-------------|-------------|-----------|
|
||||
| Development | Console | Session |
|
||||
| Staging | [log aggregator] | 7 days |
|
||||
| Production | [log aggregator] | 30 days |
|
||||
|
||||
### PII Rules
|
||||
|
||||
- Never log passwords, tokens, or session IDs
|
||||
- Mask email addresses and personal identifiers
|
||||
- Log user IDs (opaque) instead of usernames
|
||||
|
||||
## Metrics
|
||||
|
||||
### Endpoints
|
||||
|
||||
Every service exposes Prometheus-compatible metrics at `/metrics`.
|
||||
|
||||
### Application Metrics
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `request_count` | Counter | Total HTTP requests by method, path, status |
|
||||
| `request_duration_seconds` | Histogram | Response time by method, path |
|
||||
| `error_count` | Counter | Failed requests by type |
|
||||
| `active_connections` | Gauge | Current open connections |
|
||||
|
||||
### System Metrics
|
||||
|
||||
- CPU usage, Memory usage, Disk I/O, Network I/O
|
||||
|
||||
### Business Metrics
|
||||
|
||||
| Metric | Type | Description | Source |
|
||||
|--------|------|-------------|--------|
|
||||
| [from acceptance criteria] | | | |
|
||||
|
||||
Collection interval: 15 seconds
|
||||
|
||||
## Distributed Tracing
|
||||
|
||||
### Configuration
|
||||
|
||||
- SDK: OpenTelemetry
|
||||
- Propagation: W3C Trace Context via HTTP headers
|
||||
- Span naming: `<service>.<operation>`
|
||||
|
||||
### Sampling
|
||||
|
||||
| Environment | Rate | Rationale |
|
||||
|-------------|------|-----------|
|
||||
| Development | 100% | Full visibility |
|
||||
| Staging | 100% | Full visibility |
|
||||
| Production | 10% | Balance cost vs observability |
|
||||
|
||||
### Integration Points
|
||||
|
||||
- HTTP requests: automatic instrumentation
|
||||
- Database queries: automatic instrumentation
|
||||
- Message queues: manual span creation on publish/consume
|
||||
|
||||
## Alerting
|
||||
|
||||
| Severity | Response Time | Conditions |
|
||||
|----------|---------------|-----------|
|
||||
| Critical | 5 min | Service unreachable, health check failed for 1 min, data loss detected |
|
||||
| High | 30 min | Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min |
|
||||
| Medium | 4 hours | Disk usage > 80%, elevated latency, connection pool exhaustion |
|
||||
| Low | Next business day | Non-critical warnings, deprecated API usage |
|
||||
|
||||
### Notification Channels
|
||||
|
||||
| Severity | Channel |
|
||||
|----------|---------|
|
||||
| Critical | [PagerDuty / phone] |
|
||||
| High | [Slack + email] |
|
||||
| Medium | [Slack] |
|
||||
| Low | [Dashboard only] |
|
||||
|
||||
## Dashboards
|
||||
|
||||
### Operations Dashboard
|
||||
|
||||
- Service health status (up/down per component)
|
||||
- Request rate and error rate
|
||||
- Response time percentiles (P50, P95, P99)
|
||||
- Resource utilization (CPU, memory per container)
|
||||
- Active alerts
|
||||
|
||||
### Business Dashboard
|
||||
|
||||
- [Key business metrics from acceptance criteria]
|
||||
- [User activity indicators]
|
||||
- [Transaction volumes]
|
||||
```
|
||||
Reference in New Issue
Block a user