azaion/gps-denied-onboard

Fork 0

mirror of https://github.com/azaion/gps-denied-onboard.git synced 2026-06-22 20:21:13 +00:00

Files

T

Oleksandr Bezdieniezhnykh 1f634c2604

ci/woodpecker/push/02-build-push Pipeline failed

Details

Update demo replay validation and testing documentation

- Modified the autodev state to reflect the current testing phase and details of the new `jetson-e2e` tests.
- Enhanced the "How to Test" documentation to provide clearer instructions on the demo replay validation process, including video and tlog alignment steps.
- Updated architectural documentation to include the new demo replay operator flow and its dependencies.
- Documented the removal of deprecated auto-sync features and clarified the operator-facing UI for replay validation.
- Added new entries in the dependencies table for upcoming tasks related to the demo replay flow.

These changes improve clarity and usability for operators and developers working with the demo replay system.

2026-06-20 11:24:43 +03:00

3.3 KiB

Raw Blame History

Observability Template

Save as _docs/04_deploy/observability.md.

# [System Name] — Observability

## Logging

### Format

Structured JSON to stdout/stderr. No file-based logging in containers.

```json
{
  "timestamp": "ISO8601",
  "level": "INFO",
  "service": "service-name",
  "correlation_id": "uuid",
  "message": "Event description",
  "context": {}
}

Log Levels

Level	Usage	Example
ERROR	Exceptions, failures requiring attention	Database connection failed
WARN	Potential issues, degraded performance	Retry attempt 2/3
INFO	Significant business events	User registered, Order placed
DEBUG	Detailed diagnostics (dev/staging only)	Request payload, Query params

Retention

Environment	Destination	Retention
Development	Console	Session
Staging	[log aggregator]	7 days
Production	[log aggregator]	30 days

PII Rules

Never log passwords, tokens, or session IDs
Mask email addresses and personal identifiers
Log user IDs (opaque) instead of usernames

Metrics

Endpoints

Every service exposes Prometheus-compatible metrics at /metrics.

Application Metrics

Metric	Type	Description
`request_count`	Counter	Total HTTP requests by method, path, status
`request_duration_seconds`	Histogram	Response time by method, path
`error_count`	Counter	Failed requests by type
`active_connections`	Gauge	Current open connections

System Metrics

CPU usage, Memory usage, Disk I/O, Network I/O

Business Metrics

Metric	Type	Description	Source
[from acceptance criteria]

Collection interval: 15 seconds

Distributed Tracing

Configuration

SDK: OpenTelemetry
Propagation: W3C Trace Context via HTTP headers
Span naming: <service>.<operation>

Sampling

Environment	Rate	Rationale
Development	100%	Full visibility
Staging	100%	Full visibility
Production	10%	Balance cost vs observability

Integration Points

HTTP requests: automatic instrumentation
Database queries: automatic instrumentation
Message queues: manual span creation on publish/consume

Alerting

Severity	Response Time	Conditions
Critical	5 min	Service unreachable, health check failed for 1 min, data loss detected
High	30 min	Error rate > 5% for 5 min, P95 latency > 2x baseline for 10 min
Medium	4 hours	Disk usage > 80%, elevated latency, connection pool exhaustion
Low	Next business day	Non-critical warnings, deprecated API usage

Notification Channels

Severity	Channel
Critical	[PagerDuty / phone]
High	[Slack + email]
Medium	[Slack]
Low	[Dashboard only]

Dashboards

Operations Dashboard

Service health status (up/down per component)
Request rate and error rate
Response time percentiles (P50, P95, P99)
Resource utilization (CPU, memory per container)
Active alerts

Business Dashboard

[Key business metrics from acceptance criteria]
[User activity indicators]
[Transaction volumes]

3.3 KiB Raw Blame History