Files
gps-denied-onboard/.cursor/skills/deploy/templates/observability.md
T
Oleksandr Bezdieniezhnykh 1f634c2604
ci/woodpecker/push/02-build-push Pipeline failed
Update demo replay validation and testing documentation
- Modified the autodev state to reflect the current testing phase and details of the new `jetson-e2e` tests.
- Enhanced the "How to Test" documentation to provide clearer instructions on the demo replay validation process, including video and tlog alignment steps.
- Updated architectural documentation to include the new demo replay operator flow and its dependencies.
- Documented the removal of deprecated auto-sync features and clarified the operator-facing UI for replay validation.
- Added new entries in the dependencies table for upcoming tasks related to the demo replay flow.

These changes improve clarity and usability for operators and developers working with the demo replay system.
2026-06-20 11:24:43 +03:00

3.3 KiB

Observability Template

Save as _docs/04_deploy/observability.md.


# [System Name] — Observability

## Logging

### Format

Structured JSON to stdout/stderr. No file-based logging in containers.

```json
{
  "timestamp": "ISO8601",
  "level": "INFO",
  "service": "service-name",
  "correlation_id": "uuid",
  "message": "Event description",
  "context": {}
}

Log Levels

Level Usage Example
ERROR Exceptions, failures requiring attention Database connection failed
WARN Potential issues, degraded performance Retry attempt 2/3
INFO Significant business events User registered, Order placed
DEBUG Detailed diagnostics (dev/staging only) Request payload, Query params

Retention

Environment Destination Retention
Development Console Session
Staging [log aggregator] 7 days
Production [log aggregator] 30 days

PII Rules

  • Never log passwords, tokens, or session IDs
  • Mask email addresses and personal identifiers
  • Log user IDs (opaque) instead of usernames

Metrics

Endpoints

Every service exposes Prometheus-compatible metrics at /metrics.

Application Metrics

Metric Type Description
request_count Counter Total HTTP requests by method, path, status
request_duration_seconds Histogram Response time by method, path
error_count Counter Failed requests by type
active_connections Gauge Current open connections

System Metrics

  • CPU usage, Memory usage, Disk I/O, Network I/O

Business Metrics

Metric Type Description Source
[from acceptance criteria]

Collection interval: 15 seconds

Distributed Tracing

Configuration

  • SDK: OpenTelemetry
  • Propagation: W3C Trace Context via HTTP headers
  • Span naming: <service>.<operation>

Sampling

Environment Rate Rationale
Development 100% Full visibility
Staging 100% Full visibility
Production 10% Balance cost vs observability

Integration Points

  • HTTP requests: automatic instrumentation
  • Database queries: automatic instrumentation
  • Message queues: manual span creation on publish/consume

Alerting

Severity Response Time Conditions
Critical 5 min Service unreachable, health check failed for 1 min, data loss detected
High 30 min Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min
Medium 4 hours Disk usage > 80%, elevated latency, connection pool exhaustion
Low Next business day Non-critical warnings, deprecated API usage

Notification Channels

Severity Channel
Critical [PagerDuty / phone]
High [Slack + email]
Medium [Slack]
Low [Dashboard only]

Dashboards

Operations Dashboard

  • Service health status (up/down per component)
  • Request rate and error rate
  • Response time percentiles (P50, P95, P99)
  • Resource utilization (CPU, memory per container)
  • Active alerts

Business Dashboard

  • [Key business metrics from acceptance criteria]
  • [User activity indicators]
  • [Transaction volumes]