Files
admin/_docs/04_deploy/observability.md
T
Oleksandr Bezdieniezhnykh 3a925b9b0f
ci/woodpecker/push/01-test Pipeline failed
ci/woodpecker/push/02-build-push unknown status
refactor: remove obsolete resource download and installer endpoints
- Deleted the `POST /resources/get/{dataFolder?}` and `GET /resources/get-installer` endpoints as part of the architectural shift towards simplified resource management.
- Removed associated methods and configurations, including `ResourcesService.GetEncryptedResource`, `ResourcesService.GetInstaller`, and related properties in `ResourcesConfig`.
- Cleaned up environment variables and configuration files to reflect the removal of installer-related settings.
- Eliminated the `GetResourceRequest` DTO and its validator, along with the `WrongResourceName` error code.
- Updated documentation to clarify the changes in resource handling and the retirement of per-user file encryption.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-14 04:17:55 +03:00

12 KiB
Raw Blame History

Azaion Admin API — Observability

Date: 2026-05-13 · Cycle: 1 · Status: planning artifact (no code changes; concrete wiring lands in Step 7).

1. Current State (audit)

Pillar Today Gap
Logging Serilog 4.1.0 → Console + rolling file logs/log.txt (daily); MinimumLevel Information; FromLogContext enrichment No structured fields beyond defaults; one unstructured LogInformation($"…") in ResourcesService.SaveResource (security audit F-12); SQL trace bypasses Serilog (Console.WriteLine); no correlation IDs
Metrics none No /metrics endpoint; no system, app, or business metrics
Tracing none No OpenTelemetry, no W3C trace context
Health checks none in code; docker-compose.test.yml uses raw TCP probe No /health endpoint (Drift H from Step 2 + skill self-verification)
Alerting none No alerts wired to any channel

This step closes the planning gap; implementation lands incrementally — /health and structured logging in cycle 1 (Step 7), metrics + tracing in a later cycle (Drift K).

2. Logging

2.1 Format

Structured JSON to stdout/stderr only in containers. The current rolling-file sink is dropped from the production runtime (and the /app/logs bind mount becomes optional) because:

  • Container logs should be collected by the platform, not the app.
  • A bind-mounted file silently fills the host disk when log rotation lags.
  • We currently have no log shipper, so logs already live only in docker logs for ops triage.

The existing console sink stays. The file sink is kept ONLY in Development (gated by ASPNETCORE_ENVIRONMENT).

{
  "timestamp": "2026-05-13T06:48:01.123Z",
  "level": "Information",
  "service": "azaion.admin-api",
  "revision": "a1b2c3d4e5f6",
  "correlation_id": "0HMU7…",
  "user_id": null,
  "message": "User registered",
  "context": {
    "endpoint": "POST /users",
    "duration_ms": 47
  }
}

Achieved by adding Serilog.Formatting.Compact.RenderedCompactJsonFormatter to the console sink and three enrichers:

Enricher Source Purpose
FromLogContext already present scoped properties
Serilog.Enrichers.Environment (new) ENV vars service, revision (AZAION_REVISION)
Serilog.AspNetCore.RequestLoggingOptions (new) ASP.NET pipeline request correlation_id from Activity.Current.TraceId (or generated UUID v7 if no Activity)

2.2 Log Levels

Level Usage Examples in this codebase
Error Unhandled exceptions, infra failures DB connection failure, sops decrypt failure on host
Warning Business exception caught Existing BusinessExceptionHandler already does this — keep as-is
Information Significant business events Login, RegisterUser, RegisterDevice, role change, resource upload, detection-class CRUD
Debug Diagnostic detail Request/response payloads (dev only — never in production); query parameters

2.3 Retention

Environment Destination Retention
Development console + logs/log.txt (rolling daily) 7 daily files (Serilog default)
Test (CI) console (captured by Woodpecker UI) 14 days (Woodpecker artifact retention)
Staging container stdout → journald on the host 7 days; journalctl --vacuum-time=7d cron
Production container stdout → journald on the host 30 days; journalctl --vacuum-time=30d cron

A central log aggregator (Loki / OpenSearch) is out of scope for cycle 1 — host journald is the entire pipeline. Recorded as Drift L.

2.4 PII Rules

Rule Implementation
Never log passwords LoginRequest.Password, RegisterUserRequest.Password, the response body of POST /devices (plaintext one-shot password). Add a [Serilog.Sensitive]-style helper or a Destructure.ByTransforming<T>(t => …) per DTO. (GetResourceRequest.Password was previously listed; the DTO was deleted in cycle 2 with the encrypted-download endpoint.)
Never log JWT tokens The /login response body is logged today only by BusinessExceptionHandler on failure, which doesn't include the body. Verify in Step 7 that no request-logger middleware logs response bodies.
Mask emails Use last-4 + @domain form for INFO-level logs (***123@example.com); full email allowed at DEBUG only. The BusinessExceptionHandler log line "Caught BusinessException: {Message}" may include emails embedded in messages — tightened in Step 7.
User IDs User.Id is an opaque GUID — safe to log; use it instead of email in correlation.

3. Metrics

3.1 Endpoint

GET /metrics exposing Prometheus exposition format. Add via prometheus-net.AspNetCore 8.x (latest stable for .NET 10 baseline; verify version against released wheel before wiring).

Exposure boundary: /metrics MUST NOT be reachable from the public CORS allow-list. The Nginx reverse proxy on admin.azaion.com will expose only /login, /users*, /devices, /resources*, /classes*, /health. /metrics and /swagger stay on the internal interface (separate Nginx server block bound to the management VLAN, OR localhost-only listener).

3.2 Metrics

Metric Type Source Labels
http_requests_total Counter ASP.NET request pipeline method, endpoint, status_code
http_request_duration_seconds Histogram ASP.NET request pipeline method, endpoint
http_requests_in_progress Gauge ASP.NET request pipeline method
db_command_duration_seconds Histogram linq2db trace hook operation (select/insert/update/delete)
db_command_failures_total Counter linq2db trace hook operation, sqlstate
auth_login_failures_total Counter AuthService.ValidateUser exception path reason (unknown_user, bad_password, disabled)
business_exceptions_total Counter BusinessExceptionHandler error_code (the existing ExceptionEnum)
resource_upload_bytes_total Counter ResourcesService.SaveResource data_folder
resource_upload_failures_total Counter same reason
detection_classes_total Gauge refresh on CRUD none
users_active_total Gauge refresh on CRUD + on a 5-min timer role
Process / runtime (auto) prometheus-net.DotNetRuntime gen0/1/2 GC, JIT, threadpool, etc.

3.3 System Metrics

CPU, RSS, file descriptors, network I/O — collected by node-exporter running on the host as a sibling container. The Admin API itself does NOT export host-level metrics.

3.4 Business Metrics

Mapped to the verified ACs in _docs/02_document/tests/blackbox-tests.md. Cycle-1 cut: users_active_total (AC-01..AC-12 user lifecycle) and detection_classes_total (AZ-513). The previously planned resource_download_bytes_total was dropped in cycle 2 along with ResourcesService.GetEncryptedResource itself; only the upload-side counters remain.

3.5 Collection

Setting Value
Scrape interval 15s (Prometheus default)
Scrape source node-exporter for host; the Admin API container for app metrics
Storage local Prometheus on the host, retention 14 days (cycle 1 budget)
Visualization local Grafana, single dashboard (§6)

4. Distributed Tracing

Cycle 1: scaffold only — produce a trace ID per request, propagate via traceparent (W3C), and emit it as the correlation_id field in JSON logs. Do NOT yet ship spans to a collector — there is no Jaeger / Tempo running, and the Admin API has no downstream services to trace into. Tracing pays back its cost when there's a chain to follow; cycle 1 has none.

Setting Value
SDK OpenTelemetry.Extensions.Hosting + OpenTelemetry.Instrumentation.AspNetCore
Propagation W3C Trace Context (traceparent) — auto when OpenTelemetry.Instrumentation.AspNetCore is registered
Sampling 100% in dev/staging, 10% in production (deferred — no exporter yet)
Span naming <service>.<operation> — service azaion.admin-api, operation <HTTP method> <route template>
Exporter none in cycle 1 (logs only)

Recorded as Drift M — wire a Tempo / Jaeger exporter once a downstream service exists.

5. Alerting

Severity Response time Conditions for this service Channel
Critical 5 min up{job="admin-api"} == 0 for 1 min · /health fails for 2 min · business_exceptions_total{error_code="DbFailure"} rate > 1/s for 1 min Slack #azaion-ops + on-call email (cycle 1 — PagerDuty deferred until on-call rotation exists)
High 30 min Error rate > 5% for 5 min (http_requests_total{status_code=~"5.."}/total) · P95 latency > 2× baseline for 10 min · auth_login_failures_total rate > 10/s for 1 min (possible brute force) Slack #azaion-ops + email
Medium 4 h Host disk > 80% · db_command_failures_total rate > 0.1/s for 10 min · process RSS > 80% of container limit Slack #azaion-ops
Low Next business day Deprecated package usage from dotnet list package --deprecated Slack #azaion-eng

Baseline values (P95) come from the cycle-1 perf report:

  • /login p95 ≈ 33 ms → high-latency alert at p95 > 66 ms for 10 min
  • /users (500 users) p95 ≈ 152 ms → high-latency alert at p95 > 305 ms for 10 min

Alert routing in cycle 1 is inform-only — no PagerDuty escalation, no auto-rollback. The deploy procedure (Step 6) documents the manual rollback path.

6. Dashboards

Operations dashboard (Grafana, single panel set; cycle 1):

  • Service up (admin-api, postgres, nginx) — stat panel
  • HTTP request rate (req/s) by endpoint — time series
  • HTTP error rate (% of 5xx) — time series with the High threshold band overlaid
  • Latency P50 / P95 / P99 by endpoint — time series, P95 baseline reference line
  • DB command rate + failure rate — time series
  • Container CPU / RSS / FDs — time series (from node-exporter)
  • Active alerts — table panel

Business dashboard (cycle 1):

  • users_active_total by role — stat panel + sparkline
  • detection_classes_total — stat panel
  • resource_upload_bytes_total rate (1h window) — time series
  • Login success/failure ratio (24h) — donut

Dashboards stored as code in monitoring/grafana/admin-api.json (introduced in Step 7).

7. Health Checks

Add a /health Minimal API endpoint:

Probe Endpoint What it checks Surface
Liveness GET /health/live Process is responsive (always 200 unless the process is wedged) Used by Docker HEALTHCHECK
Readiness GET /health/ready DB reader connection + DB admin connection (one-shot SELECT 1 each, 2s timeout) Used by Nginx upstream check + the deploy script (Step 6) post-deploy gate

Endpoints are anonymous (no JWT) but bound only to the management VLAN (or localhost listener) — same exposure rule as /metrics.

Failure mode: if the DB is unreachable for 30 s, /health/ready returns 503; Nginx pulls the upstream, returning 503 to clients (no silent traffic loss). The container itself stays running so a transient DB blip does not trigger Docker restart.

8. Drifts Logged Here

ID Severity Description Resolved In
H Low docker-compose.test.yml health check is TCP-only; upgrade to /health/live once available Step 7
K Medium (NEW) Metrics + tracing not implemented in cycle 1; only the plan + /health ship Future cycle
L Low (NEW) No central log aggregator; journald only Future cycle
M Low (NEW) Tracing has no exporter (cycle 1 = trace IDs in logs only) Future cycle when downstream services exist

9. Self-verification

  • Structured JSON logging format defined with timestamp, level, service, correlation_id, message, context.
  • Metrics endpoint specified (/metrics, internal-only) with full app/system/business metric inventory.
  • OpenTelemetry tracing configured at the SDK level (cycle 1) with future exporter wiring (Drift M).
  • Alert severities with response times and channels defined; baselines tied to perf report numbers.
  • Dashboards defined for operations and business metrics.
  • PII exclusion rules cover passwords, JWTs, and email masking; refers to specific DTO field names.