Files
gps-denied-onboard/.cursor/skills/deploy/steps/05_observability.md
T
Oleksandr Bezdieniezhnykh 1f634c2604
ci/woodpecker/push/02-build-push Pipeline failed
Update demo replay validation and testing documentation
- Modified the autodev state to reflect the current testing phase and details of the new `jetson-e2e` tests.
- Enhanced the "How to Test" documentation to provide clearer instructions on the demo replay validation process, including video and tlog alignment steps.
- Updated architectural documentation to include the new demo replay operator flow and its dependencies.
- Documented the removal of deprecated auto-sync features and clarified the operator-facing UI for replay validation.
- Added new entries in the dependencies table for upcoming tasks related to the demo replay flow.

These changes improve clarity and usability for operators and developers working with the demo replay system.
2026-06-20 11:24:43 +03:00

2.1 KiB

Step 5: Observability

Role: Site Reliability Engineer (SRE) Goal: Define logging, metrics, tracing, and alerting strategy. Constraints: Strategy document — describe what to implement, not how to wire it.

Steps

  1. Read architecture.md and component specs for service boundaries
  2. Research observability best practices for the tech stack

Logging

  • Structured JSON to stdout/stderr (no file logging in containers)
  • Fields: timestamp (ISO 8601), level, service, correlation_id, message, context
  • Levels: ERROR (exceptions), WARN (degraded), INFO (business events), DEBUG (diagnostics, dev only)
  • No PII in logs
  • Retention: dev = console, staging = 7 days, production = 30 days

Metrics

  • Expose Prometheus-compatible /metrics endpoint per service
  • System metrics: CPU, memory, disk, network
  • Application metrics: request_count, request_duration (histogram), error_count, active_connections
  • Business metrics: derived from acceptance criteria
  • Collection interval: 15s

Distributed Tracing

  • OpenTelemetry SDK integration
  • Trace context propagation via HTTP headers and message queue metadata
  • Span naming: <service>.<operation>
  • Sampling: 100% in dev/staging, 10% in production (adjust based on volume)

Alerting

Severity Response Time Condition Examples
Critical 5 min Service down, data loss, health check failed
High 30 min Error rate > 5%, P95 latency > 2x baseline
Medium 4 hours Disk > 80%, elevated latency
Low Next business day Non-critical warnings

Dashboards

  • Operations: service health, request rate, error rate, response time percentiles, resource utilization
  • Business: key business metrics from acceptance criteria

Self-verification

  • Structured logging format defined with required fields
  • Metrics endpoint specified per service
  • OpenTelemetry tracing configured
  • Alert severities with response times defined
  • Dashboards cover operations and business metrics
  • PII exclusion from logs addressed

Save action

Write observability.md using templates/observability.md.