Files
annotations/.cursor/commands/observability.md
T
Oleksandr Bezdieniezhnykh 9e7dc290db Refactor annotation tool from WPF desktop app to .NET API
Replace the WPF desktop application (Azaion.Suite, Azaion.Annotator,
Azaion.Common, Azaion.Inference, Azaion.Loader, Azaion.LoaderUI,
Azaion.Dataset, Azaion.Test) with a standalone .NET Web API in src/.

Made-with: Cursor
2026-03-25 04:40:03 +02:00

2.9 KiB

Observability Planning

Initial data:

  • Problem description: @_docs/00_problem/problem_description.md
  • Full Solution Description: @_docs/01_solution/solution.md
  • Components: @_docs/02_components
  • Deployment Strategy: @_docs/02_components/deployment_strategy.md

Role

You are a Site Reliability Engineer (SRE)

Task

  • Define logging strategy across all components
  • Plan metrics collection and dashboards
  • Design distributed tracing (if applicable)
  • Establish alerting rules
  • Document incident response procedures

Output

Logging Strategy

Log Levels

Level Usage Example
ERROR Exceptions, failures requiring attention Database connection failed
WARN Potential issues, degraded performance Retry attempt 2/3
INFO Significant business events User registered, Order placed
DEBUG Detailed diagnostic information Request payload, Query params

Log Format

{
  "timestamp": "ISO8601",
  "level": "INFO",
  "service": "service-name",
  "correlation_id": "uuid",
  "message": "Event description",
  "context": {}
}

Log Storage

  • Development: Console/file
  • Staging: Centralized (ELK, CloudWatch, etc.)
  • Production: Centralized with retention policy

Metrics

System Metrics

  • CPU usage
  • Memory usage
  • Disk I/O
  • Network I/O

Application Metrics

Metric Type Description
request_count Counter Total requests
request_duration Histogram Response time
error_count Counter Failed requests
active_connections Gauge Current connections

Business Metrics

  • [Define based on acceptance criteria]

Distributed Tracing

Trace Context

  • Correlation ID propagation
  • Span naming conventions
  • Sampling strategy

Integration Points

  • HTTP headers
  • Message queue metadata
  • Database query tagging

Alerting

Alert Categories

Severity Response Time Examples
Critical 5 min Service down, Data loss
High 30 min High error rate, Performance degradation
Medium 4 hours Elevated latency, Disk usage high
Low Next business day Non-critical warnings

Alert Rules

alerts:
  - name: high_error_rate
    condition: error_rate > 5%
    duration: 5m
    severity: high
    
  - name: service_down
    condition: health_check_failed
    duration: 1m
    severity: critical

Dashboards

Operations Dashboard

  • Service health status
  • Request rate and error rate
  • Response time percentiles
  • Resource utilization

Business Dashboard

  • Key business metrics
  • User activity
  • Transaction volumes

Store output to _docs/02_components/observability_plan.md

Notes

  • Follow the principle: "If it's not monitored, it's not in production"
  • Balance verbosity with cost
  • Ensure PII is not logged
  • Plan for log rotation and retention