mirror of
https://github.com/azaion/gps-denied-onboard.git
synced 2026-06-23 01:51:14 +00:00
1f634c2604
ci/woodpecker/push/02-build-push Pipeline failed
- Modified the autodev state to reflect the current testing phase and details of the new `jetson-e2e` tests. - Enhanced the "How to Test" documentation to provide clearer instructions on the demo replay validation process, including video and tlog alignment steps. - Updated architectural documentation to include the new demo replay operator flow and its dependencies. - Documented the removal of deprecated auto-sync features and clarified the operator-facing UI for replay validation. - Added new entries in the dependencies table for upcoming tasks related to the demo replay flow. These changes improve clarity and usability for operators and developers working with the demo replay system.
133 lines
3.3 KiB
Markdown
133 lines
3.3 KiB
Markdown
# Observability Template
|
|
|
|
Save as `_docs/04_deploy/observability.md`.
|
|
|
|
---
|
|
|
|
```markdown
|
|
# [System Name] — Observability
|
|
|
|
## Logging
|
|
|
|
### Format
|
|
|
|
Structured JSON to stdout/stderr. No file-based logging in containers.
|
|
|
|
```json
|
|
{
|
|
"timestamp": "ISO8601",
|
|
"level": "INFO",
|
|
"service": "service-name",
|
|
"correlation_id": "uuid",
|
|
"message": "Event description",
|
|
"context": {}
|
|
}
|
|
```
|
|
|
|
### Log Levels
|
|
|
|
| Level | Usage | Example |
|
|
|-------|-------|---------|
|
|
| ERROR | Exceptions, failures requiring attention | Database connection failed |
|
|
| WARN | Potential issues, degraded performance | Retry attempt 2/3 |
|
|
| INFO | Significant business events | User registered, Order placed |
|
|
| DEBUG | Detailed diagnostics (dev/staging only) | Request payload, Query params |
|
|
|
|
### Retention
|
|
|
|
| Environment | Destination | Retention |
|
|
|-------------|-------------|-----------|
|
|
| Development | Console | Session |
|
|
| Staging | [log aggregator] | 7 days |
|
|
| Production | [log aggregator] | 30 days |
|
|
|
|
### PII Rules
|
|
|
|
- Never log passwords, tokens, or session IDs
|
|
- Mask email addresses and personal identifiers
|
|
- Log user IDs (opaque) instead of usernames
|
|
|
|
## Metrics
|
|
|
|
### Endpoints
|
|
|
|
Every service exposes Prometheus-compatible metrics at `/metrics`.
|
|
|
|
### Application Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `request_count` | Counter | Total HTTP requests by method, path, status |
|
|
| `request_duration_seconds` | Histogram | Response time by method, path |
|
|
| `error_count` | Counter | Failed requests by type |
|
|
| `active_connections` | Gauge | Current open connections |
|
|
|
|
### System Metrics
|
|
|
|
- CPU usage, Memory usage, Disk I/O, Network I/O
|
|
|
|
### Business Metrics
|
|
|
|
| Metric | Type | Description | Source |
|
|
|--------|------|-------------|--------|
|
|
| [from acceptance criteria] | | | |
|
|
|
|
Collection interval: 15 seconds
|
|
|
|
## Distributed Tracing
|
|
|
|
### Configuration
|
|
|
|
- SDK: OpenTelemetry
|
|
- Propagation: W3C Trace Context via HTTP headers
|
|
- Span naming: `<service>.<operation>`
|
|
|
|
### Sampling
|
|
|
|
| Environment | Rate | Rationale |
|
|
|-------------|------|-----------|
|
|
| Development | 100% | Full visibility |
|
|
| Staging | 100% | Full visibility |
|
|
| Production | 10% | Balance cost vs observability |
|
|
|
|
### Integration Points
|
|
|
|
- HTTP requests: automatic instrumentation
|
|
- Database queries: automatic instrumentation
|
|
- Message queues: manual span creation on publish/consume
|
|
|
|
## Alerting
|
|
|
|
| Severity | Response Time | Conditions |
|
|
|----------|---------------|-----------|
|
|
| Critical | 5 min | Service unreachable, health check failed for 1 min, data loss detected |
|
|
| High | 30 min | Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min |
|
|
| Medium | 4 hours | Disk usage > 80%, elevated latency, connection pool exhaustion |
|
|
| Low | Next business day | Non-critical warnings, deprecated API usage |
|
|
|
|
### Notification Channels
|
|
|
|
| Severity | Channel |
|
|
|----------|---------|
|
|
| Critical | [PagerDuty / phone] |
|
|
| High | [Slack + email] |
|
|
| Medium | [Slack] |
|
|
| Low | [Dashboard only] |
|
|
|
|
## Dashboards
|
|
|
|
### Operations Dashboard
|
|
|
|
- Service health status (up/down per component)
|
|
- Request rate and error rate
|
|
- Response time percentiles (P50, P95, P99)
|
|
- Resource utilization (CPU, memory per container)
|
|
- Active alerts
|
|
|
|
### Business Dashboard
|
|
|
|
- [Key business metrics from acceptance criteria]
|
|
- [User activity indicators]
|
|
- [Transaction volumes]
|
|
```
|