mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-04-23 03:06:37 +00:00
remove the current solution, add skills
This commit is contained in:
@@ -0,0 +1,122 @@
|
||||
# Observability Planning
|
||||
|
||||
## Initial data:
|
||||
- Problem description: `@_docs/00_problem/problem_description.md`
|
||||
- Full Solution Description: `@_docs/01_solution/solution.md`
|
||||
- Components: `@_docs/02_components`
|
||||
- Deployment Strategy: `@_docs/02_components/deployment_strategy.md`
|
||||
|
||||
## Role
|
||||
You are a Site Reliability Engineer (SRE)
|
||||
|
||||
## Task
|
||||
- Define logging strategy across all components
|
||||
- Plan metrics collection and dashboards
|
||||
- Design distributed tracing (if applicable)
|
||||
- Establish alerting rules
|
||||
- Document incident response procedures
|
||||
|
||||
## Output
|
||||
|
||||
### Logging Strategy
|
||||
|
||||
#### Log Levels
|
||||
| Level | Usage | Example |
|
||||
|-------|-------|---------|
|
||||
| ERROR | Exceptions, failures requiring attention | Database connection failed |
|
||||
| WARN | Potential issues, degraded performance | Retry attempt 2/3 |
|
||||
| INFO | Significant business events | User registered, Order placed |
|
||||
| DEBUG | Detailed diagnostic information | Request payload, Query params |
|
||||
|
||||
#### Log Format
|
||||
```json
|
||||
{
|
||||
"timestamp": "ISO8601",
|
||||
"level": "INFO",
|
||||
"service": "service-name",
|
||||
"correlation_id": "uuid",
|
||||
"message": "Event description",
|
||||
"context": {}
|
||||
}
|
||||
```
|
||||
|
||||
#### Log Storage
|
||||
- Development: Console/file
|
||||
- Staging: Centralized (ELK, CloudWatch, etc.)
|
||||
- Production: Centralized with retention policy
|
||||
|
||||
### Metrics
|
||||
|
||||
#### System Metrics
|
||||
- CPU usage
|
||||
- Memory usage
|
||||
- Disk I/O
|
||||
- Network I/O
|
||||
|
||||
#### Application Metrics
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| request_count | Counter | Total requests |
|
||||
| request_duration | Histogram | Response time |
|
||||
| error_count | Counter | Failed requests |
|
||||
| active_connections | Gauge | Current connections |
|
||||
|
||||
#### Business Metrics
|
||||
- [Define based on acceptance criteria]
|
||||
|
||||
### Distributed Tracing
|
||||
|
||||
#### Trace Context
|
||||
- Correlation ID propagation
|
||||
- Span naming conventions
|
||||
- Sampling strategy
|
||||
|
||||
#### Integration Points
|
||||
- HTTP headers
|
||||
- Message queue metadata
|
||||
- Database query tagging
|
||||
|
||||
### Alerting
|
||||
|
||||
#### Alert Categories
|
||||
| Severity | Response Time | Examples |
|
||||
|----------|---------------|----------|
|
||||
| Critical | 5 min | Service down, Data loss |
|
||||
| High | 30 min | High error rate, Performance degradation |
|
||||
| Medium | 4 hours | Elevated latency, Disk usage high |
|
||||
| Low | Next business day | Non-critical warnings |
|
||||
|
||||
#### Alert Rules
|
||||
```yaml
|
||||
alerts:
|
||||
- name: high_error_rate
|
||||
condition: error_rate > 5%
|
||||
duration: 5m
|
||||
severity: high
|
||||
|
||||
- name: service_down
|
||||
condition: health_check_failed
|
||||
duration: 1m
|
||||
severity: critical
|
||||
```
|
||||
|
||||
### Dashboards
|
||||
|
||||
#### Operations Dashboard
|
||||
- Service health status
|
||||
- Request rate and error rate
|
||||
- Response time percentiles
|
||||
- Resource utilization
|
||||
|
||||
#### Business Dashboard
|
||||
- Key business metrics
|
||||
- User activity
|
||||
- Transaction volumes
|
||||
|
||||
Store output to `_docs/02_components/observability_plan.md`
|
||||
|
||||
## Notes
|
||||
- Follow the principle: "If it's not monitored, it's not in production"
|
||||
- Balance verbosity with cost
|
||||
- Ensure PII is not logged
|
||||
- Plan for log rotation and retention
|
||||
Reference in New Issue
Block a user