mirror of
https://github.com/azaion/annotations.git
synced 2026-04-22 22:36:31 +00:00
9e7dc290db
Replace the WPF desktop application (Azaion.Suite, Azaion.Annotator, Azaion.Common, Azaion.Inference, Azaion.Loader, Azaion.LoaderUI, Azaion.Dataset, Azaion.Test) with a standalone .NET Web API in src/. Made-with: Cursor
123 lines
2.9 KiB
Markdown
123 lines
2.9 KiB
Markdown
# Observability Planning
|
|
|
|
## Initial data:
|
|
- Problem description: `@_docs/00_problem/problem_description.md`
|
|
- Full Solution Description: `@_docs/01_solution/solution.md`
|
|
- Components: `@_docs/02_components`
|
|
- Deployment Strategy: `@_docs/02_components/deployment_strategy.md`
|
|
|
|
## Role
|
|
You are a Site Reliability Engineer (SRE)
|
|
|
|
## Task
|
|
- Define logging strategy across all components
|
|
- Plan metrics collection and dashboards
|
|
- Design distributed tracing (if applicable)
|
|
- Establish alerting rules
|
|
- Document incident response procedures
|
|
|
|
## Output
|
|
|
|
### Logging Strategy
|
|
|
|
#### Log Levels
|
|
| Level | Usage | Example |
|
|
|-------|-------|---------|
|
|
| ERROR | Exceptions, failures requiring attention | Database connection failed |
|
|
| WARN | Potential issues, degraded performance | Retry attempt 2/3 |
|
|
| INFO | Significant business events | User registered, Order placed |
|
|
| DEBUG | Detailed diagnostic information | Request payload, Query params |
|
|
|
|
#### Log Format
|
|
```json
|
|
{
|
|
"timestamp": "ISO8601",
|
|
"level": "INFO",
|
|
"service": "service-name",
|
|
"correlation_id": "uuid",
|
|
"message": "Event description",
|
|
"context": {}
|
|
}
|
|
```
|
|
|
|
#### Log Storage
|
|
- Development: Console/file
|
|
- Staging: Centralized (ELK, CloudWatch, etc.)
|
|
- Production: Centralized with retention policy
|
|
|
|
### Metrics
|
|
|
|
#### System Metrics
|
|
- CPU usage
|
|
- Memory usage
|
|
- Disk I/O
|
|
- Network I/O
|
|
|
|
#### Application Metrics
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| request_count | Counter | Total requests |
|
|
| request_duration | Histogram | Response time |
|
|
| error_count | Counter | Failed requests |
|
|
| active_connections | Gauge | Current connections |
|
|
|
|
#### Business Metrics
|
|
- [Define based on acceptance criteria]
|
|
|
|
### Distributed Tracing
|
|
|
|
#### Trace Context
|
|
- Correlation ID propagation
|
|
- Span naming conventions
|
|
- Sampling strategy
|
|
|
|
#### Integration Points
|
|
- HTTP headers
|
|
- Message queue metadata
|
|
- Database query tagging
|
|
|
|
### Alerting
|
|
|
|
#### Alert Categories
|
|
| Severity | Response Time | Examples |
|
|
|----------|---------------|----------|
|
|
| Critical | 5 min | Service down, Data loss |
|
|
| High | 30 min | High error rate, Performance degradation |
|
|
| Medium | 4 hours | Elevated latency, Disk usage high |
|
|
| Low | Next business day | Non-critical warnings |
|
|
|
|
#### Alert Rules
|
|
```yaml
|
|
alerts:
|
|
- name: high_error_rate
|
|
condition: error_rate > 5%
|
|
duration: 5m
|
|
severity: high
|
|
|
|
- name: service_down
|
|
condition: health_check_failed
|
|
duration: 1m
|
|
severity: critical
|
|
```
|
|
|
|
### Dashboards
|
|
|
|
#### Operations Dashboard
|
|
- Service health status
|
|
- Request rate and error rate
|
|
- Response time percentiles
|
|
- Resource utilization
|
|
|
|
#### Business Dashboard
|
|
- Key business metrics
|
|
- User activity
|
|
- Transaction volumes
|
|
|
|
Store output to `_docs/02_components/observability_plan.md`
|
|
|
|
## Notes
|
|
- Follow the principle: "If it's not monitored, it's not in production"
|
|
- Balance verbosity with cost
|
|
- Ensure PII is not logged
|
|
- Plan for log rotation and retention
|