mirror of
https://github.com/azaion/annotations.git
synced 2026-04-22 22:06:30 +00:00
9e7dc290db
Replace the WPF desktop application (Azaion.Suite, Azaion.Annotator, Azaion.Common, Azaion.Inference, Azaion.Loader, Azaion.LoaderUI, Azaion.Dataset, Azaion.Test) with a standalone .NET Web API in src/. Made-with: Cursor
2.9 KiB
2.9 KiB
Observability Planning
Initial data:
- Problem description:
@_docs/00_problem/problem_description.md - Full Solution Description:
@_docs/01_solution/solution.md - Components:
@_docs/02_components - Deployment Strategy:
@_docs/02_components/deployment_strategy.md
Role
You are a Site Reliability Engineer (SRE)
Task
- Define logging strategy across all components
- Plan metrics collection and dashboards
- Design distributed tracing (if applicable)
- Establish alerting rules
- Document incident response procedures
Output
Logging Strategy
Log Levels
| Level | Usage | Example |
|---|---|---|
| ERROR | Exceptions, failures requiring attention | Database connection failed |
| WARN | Potential issues, degraded performance | Retry attempt 2/3 |
| INFO | Significant business events | User registered, Order placed |
| DEBUG | Detailed diagnostic information | Request payload, Query params |
Log Format
{
"timestamp": "ISO8601",
"level": "INFO",
"service": "service-name",
"correlation_id": "uuid",
"message": "Event description",
"context": {}
}
Log Storage
- Development: Console/file
- Staging: Centralized (ELK, CloudWatch, etc.)
- Production: Centralized with retention policy
Metrics
System Metrics
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
Application Metrics
| Metric | Type | Description |
|---|---|---|
| request_count | Counter | Total requests |
| request_duration | Histogram | Response time |
| error_count | Counter | Failed requests |
| active_connections | Gauge | Current connections |
Business Metrics
- [Define based on acceptance criteria]
Distributed Tracing
Trace Context
- Correlation ID propagation
- Span naming conventions
- Sampling strategy
Integration Points
- HTTP headers
- Message queue metadata
- Database query tagging
Alerting
Alert Categories
| Severity | Response Time | Examples |
|---|---|---|
| Critical | 5 min | Service down, Data loss |
| High | 30 min | High error rate, Performance degradation |
| Medium | 4 hours | Elevated latency, Disk usage high |
| Low | Next business day | Non-critical warnings |
Alert Rules
alerts:
- name: high_error_rate
condition: error_rate > 5%
duration: 5m
severity: high
- name: service_down
condition: health_check_failed
duration: 1m
severity: critical
Dashboards
Operations Dashboard
- Service health status
- Request rate and error rate
- Response time percentiles
- Resource utilization
Business Dashboard
- Key business metrics
- User activity
- Transaction volumes
Store output to _docs/02_components/observability_plan.md
Notes
- Follow the principle: "If it's not monitored, it's not in production"
- Balance verbosity with cost
- Ensure PII is not logged
- Plan for log rotation and retention