# Azaion Admin API — Observability

**Date**: 2026-05-13 · **Cycle**: 1 · **Status**: planning artifact (no code changes; concrete wiring lands in Step 7).

## 1. Current State (audit)

| Pillar | Today | Gap |
|--------|-------|-----|
| Logging | Serilog 4.1.0 → Console + rolling file `logs/log.txt` (daily); MinimumLevel `Information`; FromLogContext enrichment | No structured fields beyond defaults; one unstructured `LogInformation($"…")` in `ResourcesService.SaveResource` (security audit F-12); SQL trace bypasses Serilog (`Console.WriteLine`); no correlation IDs |
| Metrics | none | No `/metrics` endpoint; no system, app, or business metrics |
| Tracing | none | No OpenTelemetry, no W3C trace context |
| Health checks | none in code; `docker-compose.test.yml` uses raw TCP probe | No `/health` endpoint (Drift H from Step 2 + skill self-verification) |
| Alerting | none | No alerts wired to any channel |

This step closes the planning gap; implementation lands incrementally — `/health` and structured logging in cycle 1 (Step 7), metrics + tracing in a later cycle (Drift K).

## 2. Logging

### 2.1 Format

Structured JSON to **stdout/stderr only** in containers. The current rolling-file sink is **dropped from the production runtime** (and the `/app/logs` bind mount becomes optional) because:

- Container logs should be collected by the platform, not the app.
- A bind-mounted file silently fills the host disk when log rotation lags.
- We currently have no log shipper, so logs already live only in `docker logs` for ops triage.

The existing console sink stays. The file sink is kept ONLY in `Development` (gated by `ASPNETCORE_ENVIRONMENT`).

```json
{
  "timestamp": "2026-05-13T06:48:01.123Z",
  "level": "Information",
  "service": "azaion.admin-api",
  "revision": "a1b2c3d4e5f6",
  "correlation_id": "0HMU7…",
  "user_id": null,
  "message": "User registered",
  "context": {
    "endpoint": "POST /users",
    "duration_ms": 47
  }
}
```

Achieved by adding `Serilog.Formatting.Compact.RenderedCompactJsonFormatter` to the console sink and three enrichers:

| Enricher | Source | Purpose |
|----------|--------|---------|
| `FromLogContext` | already present | scoped properties |
| `Serilog.Enrichers.Environment` (new) | `ENV` vars | `service`, `revision` (`AZAION_REVISION`) |
| `Serilog.AspNetCore.RequestLoggingOptions` (new) | ASP.NET pipeline | request `correlation_id` from `Activity.Current.TraceId` (or generated UUID v7 if no Activity) |

### 2.2 Log Levels

| Level | Usage | Examples in this codebase |
|-------|-------|---------------------------|
| `Error` | Unhandled exceptions, infra failures | DB connection failure, sops decrypt failure on host |
| `Warning` | Business exception caught | Existing `BusinessExceptionHandler` already does this — keep as-is |
| `Information` | Significant business events | Login, RegisterUser, RegisterDevice, role change, resource upload, detection-class CRUD |
| `Debug` | Diagnostic detail | Request/response payloads (dev only — never in production); query parameters |

### 2.3 Retention

| Environment | Destination | Retention |
|-------------|-------------|-----------|
| Development | console + `logs/log.txt` (rolling daily) | 7 daily files (Serilog default) |
| Test (CI) | console (captured by Woodpecker UI) | 14 days (Woodpecker artifact retention) |
| Staging | container stdout → `journald` on the host | 7 days; `journalctl --vacuum-time=7d` cron |
| Production | container stdout → `journald` on the host | 30 days; `journalctl --vacuum-time=30d` cron |

> A central log aggregator (Loki / OpenSearch) is **out of scope for cycle 1** — host `journald` is the entire pipeline. Recorded as **Drift L**.

### 2.4 PII Rules

| Rule | Implementation |
|------|----------------|
| Never log passwords | `LoginRequest.Password`, `RegisterUserRequest.Password`, `GetResourceRequest.Password`, the response body of `POST /devices` (plaintext one-shot password). Add a `[Serilog.Sensitive]`-style helper or a `Destructure.ByTransforming<T>(t => …)` per DTO. |
| Never log JWT tokens | The `/login` response body is logged today only by `BusinessExceptionHandler` on failure, which doesn't include the body. Verify in Step 7 that no request-logger middleware logs response bodies. |
| Mask emails | Use last-4 + `@domain` form for INFO-level logs (`***123@example.com`); full email allowed at DEBUG only. The `BusinessExceptionHandler` log line `"Caught BusinessException: {Message}"` may include emails embedded in messages — tightened in Step 7. |
| User IDs | `User.Id` is an opaque GUID — safe to log; use it instead of email in correlation. |

## 3. Metrics

### 3.1 Endpoint

`GET /metrics` exposing Prometheus exposition format. Add via `prometheus-net.AspNetCore` 8.x (latest stable for .NET 10 baseline; verify version against released wheel before wiring).

> Exposure boundary: `/metrics` MUST NOT be reachable from the public CORS allow-list. The Nginx reverse proxy on `admin.azaion.com` will expose only `/login`, `/users*`, `/devices`, `/resources*`, `/classes*`, `/health`. `/metrics` and `/swagger` stay on the internal interface (separate Nginx server block bound to the management VLAN, OR `localhost`-only listener).

### 3.2 Metrics

| Metric | Type | Source | Labels |
|--------|------|--------|--------|
| `http_requests_total` | Counter | ASP.NET request pipeline | `method`, `endpoint`, `status_code` |
| `http_request_duration_seconds` | Histogram | ASP.NET request pipeline | `method`, `endpoint` |
| `http_requests_in_progress` | Gauge | ASP.NET request pipeline | `method` |
| `db_command_duration_seconds` | Histogram | linq2db trace hook | `operation` (`select`/`insert`/`update`/`delete`) |
| `db_command_failures_total` | Counter | linq2db trace hook | `operation`, `sqlstate` |
| `auth_login_failures_total` | Counter | `AuthService.ValidateUser` exception path | `reason` (`unknown_user`, `bad_password`, `disabled`) |
| `business_exceptions_total` | Counter | `BusinessExceptionHandler` | `error_code` (the existing `ExceptionEnum`) |
| `resource_upload_bytes_total` | Counter | `ResourcesService.SaveResource` | `data_folder` |
| `resource_upload_failures_total` | Counter | same | `reason` |
| `resource_download_bytes_total` | Counter | `ResourcesService.GetEncryptedResource` | `data_folder` |
| `detection_classes_total` | Gauge | refresh on CRUD | none |
| `users_active_total` | Gauge | refresh on CRUD + on a 5-min timer | `role` |
| Process / runtime | (auto) | `prometheus-net.DotNetRuntime` | gen0/1/2 GC, JIT, threadpool, etc. |

### 3.3 System Metrics

CPU, RSS, file descriptors, network I/O — collected by **node-exporter** running on the host as a sibling container. The Admin API itself does NOT export host-level metrics.

### 3.4 Business Metrics

Mapped to the verified ACs in `_docs/02_document/tests/blackbox-tests.md`. Cycle-1 cut: `users_active_total` (AC-01..AC-12 user lifecycle) and `detection_classes_total` (AZ-513). Resource-related business metrics deferred until the resource flow is exercised by real users post-AZ-197.

### 3.5 Collection

| Setting | Value |
|---------|-------|
| Scrape interval | 15s (Prometheus default) |
| Scrape source | `node-exporter` for host; the Admin API container for app metrics |
| Storage | local Prometheus on the host, retention 14 days (cycle 1 budget) |
| Visualization | local Grafana, single dashboard (§6) |

## 4. Distributed Tracing

**Cycle 1**: scaffold only — produce a trace ID per request, propagate via `traceparent` (W3C), and emit it as the `correlation_id` field in JSON logs. **Do NOT yet** ship spans to a collector — there is no Jaeger / Tempo running, and the Admin API has no downstream services to trace into. Tracing pays back its cost when there's a chain to follow; cycle 1 has none.

| Setting | Value |
|---------|-------|
| SDK | `OpenTelemetry.Extensions.Hosting` + `OpenTelemetry.Instrumentation.AspNetCore` |
| Propagation | W3C Trace Context (`traceparent`) — auto when `OpenTelemetry.Instrumentation.AspNetCore` is registered |
| Sampling | 100% in dev/staging, 10% in production (deferred — no exporter yet) |
| Span naming | `<service>.<operation>` — service `azaion.admin-api`, operation `<HTTP method> <route template>` |
| Exporter | none in cycle 1 (logs only) |

> Recorded as **Drift M** — wire a Tempo / Jaeger exporter once a downstream service exists.

## 5. Alerting

| Severity | Response time | Conditions for this service | Channel |
|----------|---------------|------------------------------|---------|
| Critical | 5 min | `up{job="admin-api"} == 0` for 1 min · `/health` fails for 2 min · `business_exceptions_total{error_code="DbFailure"}` rate > 1/s for 1 min | Slack `#azaion-ops` + on-call email (cycle 1 — PagerDuty deferred until on-call rotation exists) |
| High | 30 min | Error rate > 5% for 5 min (`http_requests_total{status_code=~"5.."}/total`) · P95 latency > 2× baseline for 10 min · `auth_login_failures_total` rate > 10/s for 1 min (possible brute force) | Slack `#azaion-ops` + email |
| Medium | 4 h | Host disk > 80% · `db_command_failures_total` rate > 0.1/s for 10 min · process RSS > 80% of container limit | Slack `#azaion-ops` |
| Low | Next business day | Deprecated package usage from `dotnet list package --deprecated` | Slack `#azaion-eng` |

Baseline values (P95) come from the cycle-1 perf report:
- `/login` p95 ≈ 33 ms → high-latency alert at p95 > 66 ms for 10 min
- `/users` (500 users) p95 ≈ 152 ms → high-latency alert at p95 > 305 ms for 10 min

Alert routing in cycle 1 is **inform-only** — no PagerDuty escalation, no auto-rollback. The deploy procedure (Step 6) documents the manual rollback path.

## 6. Dashboards

**Operations dashboard** (Grafana, single panel set; cycle 1):

- Service `up` (admin-api, postgres, nginx) — stat panel
- HTTP request rate (req/s) by endpoint — time series
- HTTP error rate (% of 5xx) — time series with the High threshold band overlaid
- Latency P50 / P95 / P99 by endpoint — time series, P95 baseline reference line
- DB command rate + failure rate — time series
- Container CPU / RSS / FDs — time series (from node-exporter)
- Active alerts — table panel

**Business dashboard** (cycle 1):

- `users_active_total` by role — stat panel + sparkline
- `detection_classes_total` — stat panel
- `resource_upload_bytes_total` rate (1h window) — time series
- Login success/failure ratio (24h) — donut

Dashboards stored as code in `monitoring/grafana/admin-api.json` (introduced in Step 7).

## 7. Health Checks

Add a `/health` Minimal API endpoint:

| Probe | Endpoint | What it checks | Surface |
|-------|----------|----------------|---------|
| Liveness | `GET /health/live` | Process is responsive (always 200 unless the process is wedged) | Used by Docker `HEALTHCHECK` |
| Readiness | `GET /health/ready` | DB reader connection + DB admin connection (one-shot `SELECT 1` each, 2s timeout) | Used by Nginx upstream check + the deploy script (Step 6) post-deploy gate |

Endpoints are anonymous (no JWT) but bound only to the management VLAN (or `localhost` listener) — same exposure rule as `/metrics`.

> Failure mode: if the DB is unreachable for 30 s, `/health/ready` returns 503; Nginx pulls the upstream, returning 503 to clients (no silent traffic loss). The container itself stays running so a transient DB blip does not trigger Docker restart.

## 8. Drifts Logged Here

| ID | Severity | Description | Resolved In |
|----|----------|-------------|-------------|
| H  | Low | `docker-compose.test.yml` health check is TCP-only; upgrade to `/health/live` once available | Step 7 |
| K  | Medium (NEW) | Metrics + tracing not implemented in cycle 1; only the plan + `/health` ship | Future cycle |
| L  | Low (NEW) | No central log aggregator; `journald` only | Future cycle |
| M  | Low (NEW) | Tracing has no exporter (cycle 1 = trace IDs in logs only) | Future cycle when downstream services exist |

## 9. Self-verification

- [x] Structured JSON logging format defined with `timestamp`, `level`, `service`, `correlation_id`, `message`, `context`.
- [x] Metrics endpoint specified (`/metrics`, internal-only) with full app/system/business metric inventory.
- [x] OpenTelemetry tracing configured at the SDK level (cycle 1) with future exporter wiring (Drift M).
- [x] Alert severities with response times and channels defined; baselines tied to perf report numbers.
- [x] Dashboards defined for operations and business metrics.
- [x] PII exclusion rules cover passwords, JWTs, and email masking; refers to specific DTO field names.