# Azaion Admin API — Observability **Date**: 2026-05-13 · **Cycle**: 1 · **Status**: planning artifact (no code changes; concrete wiring lands in Step 7). ## 1. Current State (audit) | Pillar | Today | Gap | |--------|-------|-----| | Logging | Serilog 4.1.0 → Console + rolling file `logs/log.txt` (daily); MinimumLevel `Information`; FromLogContext enrichment | No structured fields beyond defaults; one unstructured `LogInformation($"…")` in `ResourcesService.SaveResource` (security audit F-12); SQL trace bypasses Serilog (`Console.WriteLine`); no correlation IDs | | Metrics | none | No `/metrics` endpoint; no system, app, or business metrics | | Tracing | none | No OpenTelemetry, no W3C trace context | | Health checks | none in code; `docker-compose.test.yml` uses raw TCP probe | No `/health` endpoint (Drift H from Step 2 + skill self-verification) | | Alerting | none | No alerts wired to any channel | This step closes the planning gap; implementation lands incrementally — `/health` and structured logging in cycle 1 (Step 7), metrics + tracing in a later cycle (Drift K). ## 2. Logging ### 2.1 Format Structured JSON to **stdout/stderr only** in containers. The current rolling-file sink is **dropped from the production runtime** (and the `/app/logs` bind mount becomes optional) because: - Container logs should be collected by the platform, not the app. - A bind-mounted file silently fills the host disk when log rotation lags. - We currently have no log shipper, so logs already live only in `docker logs` for ops triage. The existing console sink stays. The file sink is kept ONLY in `Development` (gated by `ASPNETCORE_ENVIRONMENT`). ```json { "timestamp": "2026-05-13T06:48:01.123Z", "level": "Information", "service": "azaion.admin-api", "revision": "a1b2c3d4e5f6", "correlation_id": "0HMU7…", "user_id": null, "message": "User registered", "context": { "endpoint": "POST /users", "duration_ms": 47 } } ``` Achieved by adding `Serilog.Formatting.Compact.RenderedCompactJsonFormatter` to the console sink and three enrichers: | Enricher | Source | Purpose | |----------|--------|---------| | `FromLogContext` | already present | scoped properties | | `Serilog.Enrichers.Environment` (new) | `ENV` vars | `service`, `revision` (`AZAION_REVISION`) | | `Serilog.AspNetCore.RequestLoggingOptions` (new) | ASP.NET pipeline | request `correlation_id` from `Activity.Current.TraceId` (or generated UUID v7 if no Activity) | ### 2.2 Log Levels | Level | Usage | Examples in this codebase | |-------|-------|---------------------------| | `Error` | Unhandled exceptions, infra failures | DB connection failure, sops decrypt failure on host | | `Warning` | Business exception caught | Existing `BusinessExceptionHandler` already does this — keep as-is | | `Information` | Significant business events | Login, RegisterUser, RegisterDevice, role change, resource upload, detection-class CRUD | | `Debug` | Diagnostic detail | Request/response payloads (dev only — never in production); query parameters | ### 2.3 Retention | Environment | Destination | Retention | |-------------|-------------|-----------| | Development | console + `logs/log.txt` (rolling daily) | 7 daily files (Serilog default) | | Test (CI) | console (captured by Woodpecker UI) | 14 days (Woodpecker artifact retention) | | Staging | container stdout → `journald` on the host | 7 days; `journalctl --vacuum-time=7d` cron | | Production | container stdout → `journald` on the host | 30 days; `journalctl --vacuum-time=30d` cron | > A central log aggregator (Loki / OpenSearch) is **out of scope for cycle 1** — host `journald` is the entire pipeline. Recorded as **Drift L**. ### 2.4 PII Rules | Rule | Implementation | |------|----------------| | Never log passwords | `LoginRequest.Password`, `RegisterUserRequest.Password`, `GetResourceRequest.Password`, the response body of `POST /devices` (plaintext one-shot password). Add a `[Serilog.Sensitive]`-style helper or a `Destructure.ByTransforming(t => …)` per DTO. | | Never log JWT tokens | The `/login` response body is logged today only by `BusinessExceptionHandler` on failure, which doesn't include the body. Verify in Step 7 that no request-logger middleware logs response bodies. | | Mask emails | Use last-4 + `@domain` form for INFO-level logs (`***123@example.com`); full email allowed at DEBUG only. The `BusinessExceptionHandler` log line `"Caught BusinessException: {Message}"` may include emails embedded in messages — tightened in Step 7. | | User IDs | `User.Id` is an opaque GUID — safe to log; use it instead of email in correlation. | ## 3. Metrics ### 3.1 Endpoint `GET /metrics` exposing Prometheus exposition format. Add via `prometheus-net.AspNetCore` 8.x (latest stable for .NET 10 baseline; verify version against released wheel before wiring). > Exposure boundary: `/metrics` MUST NOT be reachable from the public CORS allow-list. The Nginx reverse proxy on `admin.azaion.com` will expose only `/login`, `/users*`, `/devices`, `/resources*`, `/classes*`, `/health`. `/metrics` and `/swagger` stay on the internal interface (separate Nginx server block bound to the management VLAN, OR `localhost`-only listener). ### 3.2 Metrics | Metric | Type | Source | Labels | |--------|------|--------|--------| | `http_requests_total` | Counter | ASP.NET request pipeline | `method`, `endpoint`, `status_code` | | `http_request_duration_seconds` | Histogram | ASP.NET request pipeline | `method`, `endpoint` | | `http_requests_in_progress` | Gauge | ASP.NET request pipeline | `method` | | `db_command_duration_seconds` | Histogram | linq2db trace hook | `operation` (`select`/`insert`/`update`/`delete`) | | `db_command_failures_total` | Counter | linq2db trace hook | `operation`, `sqlstate` | | `auth_login_failures_total` | Counter | `AuthService.ValidateUser` exception path | `reason` (`unknown_user`, `bad_password`, `disabled`) | | `business_exceptions_total` | Counter | `BusinessExceptionHandler` | `error_code` (the existing `ExceptionEnum`) | | `resource_upload_bytes_total` | Counter | `ResourcesService.SaveResource` | `data_folder` | | `resource_upload_failures_total` | Counter | same | `reason` | | `resource_download_bytes_total` | Counter | `ResourcesService.GetEncryptedResource` | `data_folder` | | `detection_classes_total` | Gauge | refresh on CRUD | none | | `users_active_total` | Gauge | refresh on CRUD + on a 5-min timer | `role` | | Process / runtime | (auto) | `prometheus-net.DotNetRuntime` | gen0/1/2 GC, JIT, threadpool, etc. | ### 3.3 System Metrics CPU, RSS, file descriptors, network I/O — collected by **node-exporter** running on the host as a sibling container. The Admin API itself does NOT export host-level metrics. ### 3.4 Business Metrics Mapped to the verified ACs in `_docs/02_document/tests/blackbox-tests.md`. Cycle-1 cut: `users_active_total` (AC-01..AC-12 user lifecycle) and `detection_classes_total` (AZ-513). Resource-related business metrics deferred until the resource flow is exercised by real users post-AZ-197. ### 3.5 Collection | Setting | Value | |---------|-------| | Scrape interval | 15s (Prometheus default) | | Scrape source | `node-exporter` for host; the Admin API container for app metrics | | Storage | local Prometheus on the host, retention 14 days (cycle 1 budget) | | Visualization | local Grafana, single dashboard (§6) | ## 4. Distributed Tracing **Cycle 1**: scaffold only — produce a trace ID per request, propagate via `traceparent` (W3C), and emit it as the `correlation_id` field in JSON logs. **Do NOT yet** ship spans to a collector — there is no Jaeger / Tempo running, and the Admin API has no downstream services to trace into. Tracing pays back its cost when there's a chain to follow; cycle 1 has none. | Setting | Value | |---------|-------| | SDK | `OpenTelemetry.Extensions.Hosting` + `OpenTelemetry.Instrumentation.AspNetCore` | | Propagation | W3C Trace Context (`traceparent`) — auto when `OpenTelemetry.Instrumentation.AspNetCore` is registered | | Sampling | 100% in dev/staging, 10% in production (deferred — no exporter yet) | | Span naming | `.` — service `azaion.admin-api`, operation ` ` | | Exporter | none in cycle 1 (logs only) | > Recorded as **Drift M** — wire a Tempo / Jaeger exporter once a downstream service exists. ## 5. Alerting | Severity | Response time | Conditions for this service | Channel | |----------|---------------|------------------------------|---------| | Critical | 5 min | `up{job="admin-api"} == 0` for 1 min · `/health` fails for 2 min · `business_exceptions_total{error_code="DbFailure"}` rate > 1/s for 1 min | Slack `#azaion-ops` + on-call email (cycle 1 — PagerDuty deferred until on-call rotation exists) | | High | 30 min | Error rate > 5% for 5 min (`http_requests_total{status_code=~"5.."}/total`) · P95 latency > 2× baseline for 10 min · `auth_login_failures_total` rate > 10/s for 1 min (possible brute force) | Slack `#azaion-ops` + email | | Medium | 4 h | Host disk > 80% · `db_command_failures_total` rate > 0.1/s for 10 min · process RSS > 80% of container limit | Slack `#azaion-ops` | | Low | Next business day | Deprecated package usage from `dotnet list package --deprecated` | Slack `#azaion-eng` | Baseline values (P95) come from the cycle-1 perf report: - `/login` p95 ≈ 33 ms → high-latency alert at p95 > 66 ms for 10 min - `/users` (500 users) p95 ≈ 152 ms → high-latency alert at p95 > 305 ms for 10 min Alert routing in cycle 1 is **inform-only** — no PagerDuty escalation, no auto-rollback. The deploy procedure (Step 6) documents the manual rollback path. ## 6. Dashboards **Operations dashboard** (Grafana, single panel set; cycle 1): - Service `up` (admin-api, postgres, nginx) — stat panel - HTTP request rate (req/s) by endpoint — time series - HTTP error rate (% of 5xx) — time series with the High threshold band overlaid - Latency P50 / P95 / P99 by endpoint — time series, P95 baseline reference line - DB command rate + failure rate — time series - Container CPU / RSS / FDs — time series (from node-exporter) - Active alerts — table panel **Business dashboard** (cycle 1): - `users_active_total` by role — stat panel + sparkline - `detection_classes_total` — stat panel - `resource_upload_bytes_total` rate (1h window) — time series - Login success/failure ratio (24h) — donut Dashboards stored as code in `monitoring/grafana/admin-api.json` (introduced in Step 7). ## 7. Health Checks Add a `/health` Minimal API endpoint: | Probe | Endpoint | What it checks | Surface | |-------|----------|----------------|---------| | Liveness | `GET /health/live` | Process is responsive (always 200 unless the process is wedged) | Used by Docker `HEALTHCHECK` | | Readiness | `GET /health/ready` | DB reader connection + DB admin connection (one-shot `SELECT 1` each, 2s timeout) | Used by Nginx upstream check + the deploy script (Step 6) post-deploy gate | Endpoints are anonymous (no JWT) but bound only to the management VLAN (or `localhost` listener) — same exposure rule as `/metrics`. > Failure mode: if the DB is unreachable for 30 s, `/health/ready` returns 503; Nginx pulls the upstream, returning 503 to clients (no silent traffic loss). The container itself stays running so a transient DB blip does not trigger Docker restart. ## 8. Drifts Logged Here | ID | Severity | Description | Resolved In | |----|----------|-------------|-------------| | H | Low | `docker-compose.test.yml` health check is TCP-only; upgrade to `/health/live` once available | Step 7 | | K | Medium (NEW) | Metrics + tracing not implemented in cycle 1; only the plan + `/health` ship | Future cycle | | L | Low (NEW) | No central log aggregator; `journald` only | Future cycle | | M | Low (NEW) | Tracing has no exporter (cycle 1 = trace IDs in logs only) | Future cycle when downstream services exist | ## 9. Self-verification - [x] Structured JSON logging format defined with `timestamp`, `level`, `service`, `correlation_id`, `message`, `context`. - [x] Metrics endpoint specified (`/metrics`, internal-only) with full app/system/business metric inventory. - [x] OpenTelemetry tracing configured at the SDK level (cycle 1) with future exporter wiring (Drift M). - [x] Alert severities with response times and channels defined; baselines tied to perf report numbers. - [x] Dashboards defined for operations and business metrics. - [x] PII exclusion rules cover passwords, JWTs, and email masking; refers to specific DTO field names.