mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-04-22 09:16:38 +00:00
Sync .cursor from suite (autodev orchestrator + monorepo skills)
This commit is contained in:
@@ -0,0 +1,60 @@
|
||||
# Step 5: Observability
|
||||
|
||||
**Role**: Site Reliability Engineer (SRE)
|
||||
**Goal**: Define logging, metrics, tracing, and alerting strategy.
|
||||
**Constraints**: Strategy document — describe what to implement, not how to wire it.
|
||||
|
||||
## Steps
|
||||
|
||||
1. Read `architecture.md` and component specs for service boundaries
|
||||
2. Research observability best practices for the tech stack
|
||||
|
||||
## Logging
|
||||
|
||||
- Structured JSON to stdout/stderr (no file logging in containers)
|
||||
- Fields: `timestamp` (ISO 8601), `level`, `service`, `correlation_id`, `message`, `context`
|
||||
- Levels: ERROR (exceptions), WARN (degraded), INFO (business events), DEBUG (diagnostics, dev only)
|
||||
- No PII in logs
|
||||
- Retention: dev = console, staging = 7 days, production = 30 days
|
||||
|
||||
## Metrics
|
||||
|
||||
- Expose Prometheus-compatible `/metrics` endpoint per service
|
||||
- System metrics: CPU, memory, disk, network
|
||||
- Application metrics: `request_count`, `request_duration` (histogram), `error_count`, `active_connections`
|
||||
- Business metrics: derived from acceptance criteria
|
||||
- Collection interval: 15s
|
||||
|
||||
## Distributed Tracing
|
||||
|
||||
- OpenTelemetry SDK integration
|
||||
- Trace context propagation via HTTP headers and message queue metadata
|
||||
- Span naming: `<service>.<operation>`
|
||||
- Sampling: 100% in dev/staging, 10% in production (adjust based on volume)
|
||||
|
||||
## Alerting
|
||||
|
||||
| Severity | Response Time | Condition Examples |
|
||||
|----------|---------------|-------------------|
|
||||
| Critical | 5 min | Service down, data loss, health check failed |
|
||||
| High | 30 min | Error rate > 5%, P95 latency > 2x baseline |
|
||||
| Medium | 4 hours | Disk > 80%, elevated latency |
|
||||
| Low | Next business day | Non-critical warnings |
|
||||
|
||||
## Dashboards
|
||||
|
||||
- Operations: service health, request rate, error rate, response time percentiles, resource utilization
|
||||
- Business: key business metrics from acceptance criteria
|
||||
|
||||
## Self-verification
|
||||
|
||||
- [ ] Structured logging format defined with required fields
|
||||
- [ ] Metrics endpoint specified per service
|
||||
- [ ] OpenTelemetry tracing configured
|
||||
- [ ] Alert severities with response times defined
|
||||
- [ ] Dashboards cover operations and business metrics
|
||||
- [ ] PII exclusion from logs addressed
|
||||
|
||||
## Save action
|
||||
|
||||
Write `observability.md` using `templates/observability.md`.
|
||||
Reference in New Issue
Block a user