mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-21 08:51:12 +00:00
1f634c2604
ci/woodpecker/push/02-build-push Pipeline failed
- Modified the autodev state to reflect the current testing phase and details of the new `jetson-e2e` tests. - Enhanced the "How to Test" documentation to provide clearer instructions on the demo replay validation process, including video and tlog alignment steps. - Updated architectural documentation to include the new demo replay operator flow and its dependencies. - Documented the removal of deprecated auto-sync features and clarified the operator-facing UI for replay validation. - Added new entries in the dependencies table for upcoming tasks related to the demo replay flow. These changes improve clarity and usability for operators and developers working with the demo replay system.
2.1 KiB
2.1 KiB
Step 5: Observability
Role: Site Reliability Engineer (SRE) Goal: Define logging, metrics, tracing, and alerting strategy. Constraints: Strategy document — describe what to implement, not how to wire it.
Steps
- Read
architecture.mdand component specs for service boundaries - Research observability best practices for the tech stack
Logging
- Structured JSON to stdout/stderr (no file logging in containers)
- Fields:
timestamp(ISO 8601),level,service,correlation_id,message,context - Levels: ERROR (exceptions), WARN (degraded), INFO (business events), DEBUG (diagnostics, dev only)
- No PII in logs
- Retention: dev = console, staging = 7 days, production = 30 days
Metrics
- Expose Prometheus-compatible
/metricsendpoint per service - System metrics: CPU, memory, disk, network
- Application metrics:
request_count,request_duration(histogram),error_count,active_connections - Business metrics: derived from acceptance criteria
- Collection interval: 15s
Distributed Tracing
- OpenTelemetry SDK integration
- Trace context propagation via HTTP headers and message queue metadata
- Span naming:
<service>.<operation> - Sampling: 100% in dev/staging, 10% in production (adjust based on volume)
Alerting
| Severity | Response Time | Condition Examples |
|---|---|---|
| Critical | 5 min | Service down, data loss, health check failed |
| High | 30 min | Error rate > 5%, P95 latency > 2x baseline |
| Medium | 4 hours | Disk > 80%, elevated latency |
| Low | Next business day | Non-critical warnings |
Dashboards
- Operations: service health, request rate, error rate, response time percentiles, resource utilization
- Business: key business metrics from acceptance criteria
Self-verification
- Structured logging format defined with required fields
- Metrics endpoint specified per service
- OpenTelemetry tracing configured
- Alert severities with response times defined
- Dashboards cover operations and business metrics
- PII exclusion from logs addressed
Save action
Write observability.md using templates/observability.md.