mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-22 20:21:13 +00:00
1f634c2604
ci/woodpecker/push/02-build-push Pipeline failed
- Modified the autodev state to reflect the current testing phase and details of the new `jetson-e2e` tests. - Enhanced the "How to Test" documentation to provide clearer instructions on the demo replay validation process, including video and tlog alignment steps. - Updated architectural documentation to include the new demo replay operator flow and its dependencies. - Documented the removal of deprecated auto-sync features and clarified the operator-facing UI for replay validation. - Added new entries in the dependencies table for upcoming tasks related to the demo replay flow. These changes improve clarity and usability for operators and developers working with the demo replay system.
3.3 KiB
3.3 KiB
Observability Template
Save as _docs/04_deploy/observability.md.
# [System Name] — Observability
## Logging
### Format
Structured JSON to stdout/stderr. No file-based logging in containers.
```json
{
"timestamp": "ISO8601",
"level": "INFO",
"service": "service-name",
"correlation_id": "uuid",
"message": "Event description",
"context": {}
}
Log Levels
| Level | Usage | Example |
|---|---|---|
| ERROR | Exceptions, failures requiring attention | Database connection failed |
| WARN | Potential issues, degraded performance | Retry attempt 2/3 |
| INFO | Significant business events | User registered, Order placed |
| DEBUG | Detailed diagnostics (dev/staging only) | Request payload, Query params |
Retention
| Environment | Destination | Retention |
|---|---|---|
| Development | Console | Session |
| Staging | [log aggregator] | 7 days |
| Production | [log aggregator] | 30 days |
PII Rules
- Never log passwords, tokens, or session IDs
- Mask email addresses and personal identifiers
- Log user IDs (opaque) instead of usernames
Metrics
Endpoints
Every service exposes Prometheus-compatible metrics at /metrics.
Application Metrics
| Metric | Type | Description |
|---|---|---|
request_count |
Counter | Total HTTP requests by method, path, status |
request_duration_seconds |
Histogram | Response time by method, path |
error_count |
Counter | Failed requests by type |
active_connections |
Gauge | Current open connections |
System Metrics
- CPU usage, Memory usage, Disk I/O, Network I/O
Business Metrics
| Metric | Type | Description | Source |
|---|---|---|---|
| [from acceptance criteria] |
Collection interval: 15 seconds
Distributed Tracing
Configuration
- SDK: OpenTelemetry
- Propagation: W3C Trace Context via HTTP headers
- Span naming:
<service>.<operation>
Sampling
| Environment | Rate | Rationale |
|---|---|---|
| Development | 100% | Full visibility |
| Staging | 100% | Full visibility |
| Production | 10% | Balance cost vs observability |
Integration Points
- HTTP requests: automatic instrumentation
- Database queries: automatic instrumentation
- Message queues: manual span creation on publish/consume
Alerting
| Severity | Response Time | Conditions |
|---|---|---|
| Critical | 5 min | Service unreachable, health check failed for 1 min, data loss detected |
| High | 30 min | Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min |
| Medium | 4 hours | Disk usage > 80%, elevated latency, connection pool exhaustion |
| Low | Next business day | Non-critical warnings, deprecated API usage |
Notification Channels
| Severity | Channel |
|---|---|
| Critical | [PagerDuty / phone] |
| High | [Slack + email] |
| Medium | [Slack] |
| Low | [Dashboard only] |
Dashboards
Operations Dashboard
- Service health status (up/down per component)
- Request rate and error rate
- Response time percentiles (P50, P95, P99)
- Resource utilization (CPU, memory per container)
- Active alerts
Business Dashboard
- [Key business metrics from acceptance criteria]
- [User activity indicators]
- [Transaction volumes]