azaion/gps-denied-onboard

Fork 0

mirror of https://github.com/azaion/gps-denied-onboard.git synced 2026-06-21 08:51:12 +00:00

Files

T

Oleksandr Bezdieniezhnykh 1f634c2604

ci/woodpecker/push/02-build-push Pipeline failed

Details

Update demo replay validation and testing documentation

- Modified the autodev state to reflect the current testing phase and details of the new `jetson-e2e` tests.
- Enhanced the "How to Test" documentation to provide clearer instructions on the demo replay validation process, including video and tlog alignment steps.
- Updated architectural documentation to include the new demo replay operator flow and its dependencies.
- Documented the removal of deprecated auto-sync features and clarified the operator-facing UI for replay validation.
- Added new entries in the dependencies table for upcoming tasks related to the demo replay flow.

These changes improve clarity and usability for operators and developers working with the demo replay system.

2026-06-20 11:24:43 +03:00

2.1 KiB

Raw Blame History

Step 5: Observability

Role: Site Reliability Engineer (SRE) Goal: Define logging, metrics, tracing, and alerting strategy. Constraints: Strategy document — describe what to implement, not how to wire it.

Steps

Read architecture.md and component specs for service boundaries
Research observability best practices for the tech stack

Logging

Structured JSON to stdout/stderr (no file logging in containers)
Fields: timestamp (ISO 8601), level, service, correlation_id, message, context
Levels: ERROR (exceptions), WARN (degraded), INFO (business events), DEBUG (diagnostics, dev only)
No PII in logs
Retention: dev = console, staging = 7 days, production = 30 days

Metrics

Expose Prometheus-compatible /metrics endpoint per service
System metrics: CPU, memory, disk, network
Application metrics: request_count, request_duration (histogram), error_count, active_connections
Business metrics: derived from acceptance criteria
Collection interval: 15s

Distributed Tracing

OpenTelemetry SDK integration
Trace context propagation via HTTP headers and message queue metadata
Span naming: <service>.<operation>
Sampling: 100% in dev/staging, 10% in production (adjust based on volume)

Alerting

Severity	Response Time	Condition Examples
Critical	5 min	Service down, data loss, health check failed
High	30 min	Error rate > 5%, P95 latency > 2x baseline
Medium	4 hours	Disk > 80%, elevated latency
Low	Next business day	Non-critical warnings

Dashboards

Operations: service health, request rate, error rate, response time percentiles, resource utilization
Business: key business metrics from acceptance criteria

Self-verification

Structured logging format defined with required fields
Metrics endpoint specified per service
OpenTelemetry tracing configured
Alert severities with response times defined
Dashboards cover operations and business metrics
PII exclusion from logs addressed

Save action

Write observability.md using templates/observability.md.

2.1 KiB Raw Blame History