remove the current solution, add skills

2026-04-23 03:06:37 +00:00 · 2026-03-14 18:37:48 +02:00
parent fd75243a84
commit 767874cb90
363 changed files with 6057 additions and 36380 deletions
@@ -0,0 +1,122 @@
+# Observability Planning
+
+## Initial data:
+ - Problem description: `@_docs/00_problem/problem_description.md`
+ - Full Solution Description: `@_docs/01_solution/solution.md`
+ - Components: `@_docs/02_components`
+ - Deployment Strategy: `@_docs/02_components/deployment_strategy.md`
+
+## Role
+  You are a Site Reliability Engineer (SRE)
+
+## Task
+ - Define logging strategy across all components
+ - Plan metrics collection and dashboards
+ - Design distributed tracing (if applicable)
+ - Establish alerting rules
+ - Document incident response procedures
+
+## Output
+
+### Logging Strategy
+
+#### Log Levels
+| Level | Usage | Example |
+|-------|-------|---------|
+| ERROR | Exceptions, failures requiring attention | Database connection failed |
+| WARN | Potential issues, degraded performance | Retry attempt 2/3 |
+| INFO | Significant business events | User registered, Order placed |
+| DEBUG | Detailed diagnostic information | Request payload, Query params |
+
+#### Log Format
+```json
+{
+  "timestamp": "ISO8601",
+  "level": "INFO",
+  "service": "service-name",
+  "correlation_id": "uuid",
+  "message": "Event description",
+  "context": {}
+}
+```
+
+#### Log Storage
+- Development: Console/file
+- Staging: Centralized (ELK, CloudWatch, etc.)
+- Production: Centralized with retention policy
+
+### Metrics
+
+#### System Metrics
+- CPU usage
+- Memory usage
+- Disk I/O
+- Network I/O
+
+#### Application Metrics
+| Metric | Type | Description |
+|--------|------|-------------|
+| request_count | Counter | Total requests |
+| request_duration | Histogram | Response time |
+| error_count | Counter | Failed requests |
+| active_connections | Gauge | Current connections |
+
+#### Business Metrics
+- [Define based on acceptance criteria]
+
+### Distributed Tracing
+
+#### Trace Context
+- Correlation ID propagation
+- Span naming conventions
+- Sampling strategy
+
+#### Integration Points
+- HTTP headers
+- Message queue metadata
+- Database query tagging
+
+### Alerting
+
+#### Alert Categories
+| Severity | Response Time | Examples |
+|----------|---------------|----------|
+| Critical | 5 min | Service down, Data loss |
+| High | 30 min | High error rate, Performance degradation |
+| Medium | 4 hours | Elevated latency, Disk usage high |
+| Low | Next business day | Non-critical warnings |
+
+#### Alert Rules
+```yaml
+alerts:
+  - name: high_error_rate
+    condition: error_rate > 5%
+    duration: 5m
+    severity: high
+    
+  - name: service_down
+    condition: health_check_failed
+    duration: 1m
+    severity: critical
+```
+
+### Dashboards
+
+#### Operations Dashboard
+- Service health status
+- Request rate and error rate
+- Response time percentiles
+- Resource utilization
+
+#### Business Dashboard
+- Key business metrics
+- User activity
+- Transaction volumes
+
+Store output to `_docs/02_components/observability_plan.md`
+
+## Notes
+ - Follow the principle: "If it's not monitored, it's not in production"
+ - Balance verbosity with cost
+ - Ensure PII is not logged
+ - Plan for log rotation and retention