Files
admin/_docs/04_deploy/observability.md
T
Oleksandr Bezdieniezhnykh 3a925b9b0f
ci/woodpecker/push/01-test Pipeline failed
ci/woodpecker/push/02-build-push unknown status
refactor: remove obsolete resource download and installer endpoints
- Deleted the `POST /resources/get/{dataFolder?}` and `GET /resources/get-installer` endpoints as part of the architectural shift towards simplified resource management.
- Removed associated methods and configurations, including `ResourcesService.GetEncryptedResource`, `ResourcesService.GetInstaller`, and related properties in `ResourcesConfig`.
- Cleaned up environment variables and configuration files to reflect the removal of installer-related settings.
- Eliminated the `GetResourceRequest` DTO and its validator, along with the `WrongResourceName` error code.
- Updated documentation to clarify the changes in resource handling and the retirement of per-user file encryption.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-14 04:17:55 +03:00

204 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Azaion Admin API — Observability
**Date**: 2026-05-13 · **Cycle**: 1 · **Status**: planning artifact (no code changes; concrete wiring lands in Step 7).
## 1. Current State (audit)
| Pillar | Today | Gap |
|--------|-------|-----|
| Logging | Serilog 4.1.0 → Console + rolling file `logs/log.txt` (daily); MinimumLevel `Information`; FromLogContext enrichment | No structured fields beyond defaults; one unstructured `LogInformation($"…")` in `ResourcesService.SaveResource` (security audit F-12); SQL trace bypasses Serilog (`Console.WriteLine`); no correlation IDs |
| Metrics | none | No `/metrics` endpoint; no system, app, or business metrics |
| Tracing | none | No OpenTelemetry, no W3C trace context |
| Health checks | none in code; `docker-compose.test.yml` uses raw TCP probe | No `/health` endpoint (Drift H from Step 2 + skill self-verification) |
| Alerting | none | No alerts wired to any channel |
This step closes the planning gap; implementation lands incrementally — `/health` and structured logging in cycle 1 (Step 7), metrics + tracing in a later cycle (Drift K).
## 2. Logging
### 2.1 Format
Structured JSON to **stdout/stderr only** in containers. The current rolling-file sink is **dropped from the production runtime** (and the `/app/logs` bind mount becomes optional) because:
- Container logs should be collected by the platform, not the app.
- A bind-mounted file silently fills the host disk when log rotation lags.
- We currently have no log shipper, so logs already live only in `docker logs` for ops triage.
The existing console sink stays. The file sink is kept ONLY in `Development` (gated by `ASPNETCORE_ENVIRONMENT`).
```json
{
"timestamp": "2026-05-13T06:48:01.123Z",
"level": "Information",
"service": "azaion.admin-api",
"revision": "a1b2c3d4e5f6",
"correlation_id": "0HMU7…",
"user_id": null,
"message": "User registered",
"context": {
"endpoint": "POST /users",
"duration_ms": 47
}
}
```
Achieved by adding `Serilog.Formatting.Compact.RenderedCompactJsonFormatter` to the console sink and three enrichers:
| Enricher | Source | Purpose |
|----------|--------|---------|
| `FromLogContext` | already present | scoped properties |
| `Serilog.Enrichers.Environment` (new) | `ENV` vars | `service`, `revision` (`AZAION_REVISION`) |
| `Serilog.AspNetCore.RequestLoggingOptions` (new) | ASP.NET pipeline | request `correlation_id` from `Activity.Current.TraceId` (or generated UUID v7 if no Activity) |
### 2.2 Log Levels
| Level | Usage | Examples in this codebase |
|-------|-------|---------------------------|
| `Error` | Unhandled exceptions, infra failures | DB connection failure, sops decrypt failure on host |
| `Warning` | Business exception caught | Existing `BusinessExceptionHandler` already does this — keep as-is |
| `Information` | Significant business events | Login, RegisterUser, RegisterDevice, role change, resource upload, detection-class CRUD |
| `Debug` | Diagnostic detail | Request/response payloads (dev only — never in production); query parameters |
### 2.3 Retention
| Environment | Destination | Retention |
|-------------|-------------|-----------|
| Development | console + `logs/log.txt` (rolling daily) | 7 daily files (Serilog default) |
| Test (CI) | console (captured by Woodpecker UI) | 14 days (Woodpecker artifact retention) |
| Staging | container stdout → `journald` on the host | 7 days; `journalctl --vacuum-time=7d` cron |
| Production | container stdout → `journald` on the host | 30 days; `journalctl --vacuum-time=30d` cron |
> A central log aggregator (Loki / OpenSearch) is **out of scope for cycle 1** — host `journald` is the entire pipeline. Recorded as **Drift L**.
### 2.4 PII Rules
| Rule | Implementation |
|------|----------------|
| Never log passwords | `LoginRequest.Password`, `RegisterUserRequest.Password`, the response body of `POST /devices` (plaintext one-shot password). Add a `[Serilog.Sensitive]`-style helper or a `Destructure.ByTransforming<T>(t => …)` per DTO. (`GetResourceRequest.Password` was previously listed; the DTO was deleted in cycle 2 with the encrypted-download endpoint.) |
| Never log JWT tokens | The `/login` response body is logged today only by `BusinessExceptionHandler` on failure, which doesn't include the body. Verify in Step 7 that no request-logger middleware logs response bodies. |
| Mask emails | Use last-4 + `@domain` form for INFO-level logs (`***123@example.com`); full email allowed at DEBUG only. The `BusinessExceptionHandler` log line `"Caught BusinessException: {Message}"` may include emails embedded in messages — tightened in Step 7. |
| User IDs | `User.Id` is an opaque GUID — safe to log; use it instead of email in correlation. |
## 3. Metrics
### 3.1 Endpoint
`GET /metrics` exposing Prometheus exposition format. Add via `prometheus-net.AspNetCore` 8.x (latest stable for .NET 10 baseline; verify version against released wheel before wiring).
> Exposure boundary: `/metrics` MUST NOT be reachable from the public CORS allow-list. The Nginx reverse proxy on `admin.azaion.com` will expose only `/login`, `/users*`, `/devices`, `/resources*`, `/classes*`, `/health`. `/metrics` and `/swagger` stay on the internal interface (separate Nginx server block bound to the management VLAN, OR `localhost`-only listener).
### 3.2 Metrics
| Metric | Type | Source | Labels |
|--------|------|--------|--------|
| `http_requests_total` | Counter | ASP.NET request pipeline | `method`, `endpoint`, `status_code` |
| `http_request_duration_seconds` | Histogram | ASP.NET request pipeline | `method`, `endpoint` |
| `http_requests_in_progress` | Gauge | ASP.NET request pipeline | `method` |
| `db_command_duration_seconds` | Histogram | linq2db trace hook | `operation` (`select`/`insert`/`update`/`delete`) |
| `db_command_failures_total` | Counter | linq2db trace hook | `operation`, `sqlstate` |
| `auth_login_failures_total` | Counter | `AuthService.ValidateUser` exception path | `reason` (`unknown_user`, `bad_password`, `disabled`) |
| `business_exceptions_total` | Counter | `BusinessExceptionHandler` | `error_code` (the existing `ExceptionEnum`) |
| `resource_upload_bytes_total` | Counter | `ResourcesService.SaveResource` | `data_folder` |
| `resource_upload_failures_total` | Counter | same | `reason` |
| `detection_classes_total` | Gauge | refresh on CRUD | none |
| `users_active_total` | Gauge | refresh on CRUD + on a 5-min timer | `role` |
| Process / runtime | (auto) | `prometheus-net.DotNetRuntime` | gen0/1/2 GC, JIT, threadpool, etc. |
### 3.3 System Metrics
CPU, RSS, file descriptors, network I/O — collected by **node-exporter** running on the host as a sibling container. The Admin API itself does NOT export host-level metrics.
### 3.4 Business Metrics
Mapped to the verified ACs in `_docs/02_document/tests/blackbox-tests.md`. Cycle-1 cut: `users_active_total` (AC-01..AC-12 user lifecycle) and `detection_classes_total` (AZ-513). The previously planned `resource_download_bytes_total` was dropped in cycle 2 along with `ResourcesService.GetEncryptedResource` itself; only the upload-side counters remain.
### 3.5 Collection
| Setting | Value |
|---------|-------|
| Scrape interval | 15s (Prometheus default) |
| Scrape source | `node-exporter` for host; the Admin API container for app metrics |
| Storage | local Prometheus on the host, retention 14 days (cycle 1 budget) |
| Visualization | local Grafana, single dashboard (§6) |
## 4. Distributed Tracing
**Cycle 1**: scaffold only — produce a trace ID per request, propagate via `traceparent` (W3C), and emit it as the `correlation_id` field in JSON logs. **Do NOT yet** ship spans to a collector — there is no Jaeger / Tempo running, and the Admin API has no downstream services to trace into. Tracing pays back its cost when there's a chain to follow; cycle 1 has none.
| Setting | Value |
|---------|-------|
| SDK | `OpenTelemetry.Extensions.Hosting` + `OpenTelemetry.Instrumentation.AspNetCore` |
| Propagation | W3C Trace Context (`traceparent`) — auto when `OpenTelemetry.Instrumentation.AspNetCore` is registered |
| Sampling | 100% in dev/staging, 10% in production (deferred — no exporter yet) |
| Span naming | `<service>.<operation>` — service `azaion.admin-api`, operation `<HTTP method> <route template>` |
| Exporter | none in cycle 1 (logs only) |
> Recorded as **Drift M** — wire a Tempo / Jaeger exporter once a downstream service exists.
## 5. Alerting
| Severity | Response time | Conditions for this service | Channel |
|----------|---------------|------------------------------|---------|
| Critical | 5 min | `up{job="admin-api"} == 0` for 1 min · `/health` fails for 2 min · `business_exceptions_total{error_code="DbFailure"}` rate > 1/s for 1 min | Slack `#azaion-ops` + on-call email (cycle 1 — PagerDuty deferred until on-call rotation exists) |
| High | 30 min | Error rate > 5% for 5 min (`http_requests_total{status_code=~"5.."}/total`) · P95 latency > 2× baseline for 10 min · `auth_login_failures_total` rate > 10/s for 1 min (possible brute force) | Slack `#azaion-ops` + email |
| Medium | 4 h | Host disk > 80% · `db_command_failures_total` rate > 0.1/s for 10 min · process RSS > 80% of container limit | Slack `#azaion-ops` |
| Low | Next business day | Deprecated package usage from `dotnet list package --deprecated` | Slack `#azaion-eng` |
Baseline values (P95) come from the cycle-1 perf report:
- `/login` p95 ≈ 33 ms → high-latency alert at p95 > 66 ms for 10 min
- `/users` (500 users) p95 ≈ 152 ms → high-latency alert at p95 > 305 ms for 10 min
Alert routing in cycle 1 is **inform-only** — no PagerDuty escalation, no auto-rollback. The deploy procedure (Step 6) documents the manual rollback path.
## 6. Dashboards
**Operations dashboard** (Grafana, single panel set; cycle 1):
- Service `up` (admin-api, postgres, nginx) — stat panel
- HTTP request rate (req/s) by endpoint — time series
- HTTP error rate (% of 5xx) — time series with the High threshold band overlaid
- Latency P50 / P95 / P99 by endpoint — time series, P95 baseline reference line
- DB command rate + failure rate — time series
- Container CPU / RSS / FDs — time series (from node-exporter)
- Active alerts — table panel
**Business dashboard** (cycle 1):
- `users_active_total` by role — stat panel + sparkline
- `detection_classes_total` — stat panel
- `resource_upload_bytes_total` rate (1h window) — time series
- Login success/failure ratio (24h) — donut
Dashboards stored as code in `monitoring/grafana/admin-api.json` (introduced in Step 7).
## 7. Health Checks
Add a `/health` Minimal API endpoint:
| Probe | Endpoint | What it checks | Surface |
|-------|----------|----------------|---------|
| Liveness | `GET /health/live` | Process is responsive (always 200 unless the process is wedged) | Used by Docker `HEALTHCHECK` |
| Readiness | `GET /health/ready` | DB reader connection + DB admin connection (one-shot `SELECT 1` each, 2s timeout) | Used by Nginx upstream check + the deploy script (Step 6) post-deploy gate |
Endpoints are anonymous (no JWT) but bound only to the management VLAN (or `localhost` listener) — same exposure rule as `/metrics`.
> Failure mode: if the DB is unreachable for 30 s, `/health/ready` returns 503; Nginx pulls the upstream, returning 503 to clients (no silent traffic loss). The container itself stays running so a transient DB blip does not trigger Docker restart.
## 8. Drifts Logged Here
| ID | Severity | Description | Resolved In |
|----|----------|-------------|-------------|
| H | Low | `docker-compose.test.yml` health check is TCP-only; upgrade to `/health/live` once available | Step 7 |
| K | Medium (NEW) | Metrics + tracing not implemented in cycle 1; only the plan + `/health` ship | Future cycle |
| L | Low (NEW) | No central log aggregator; `journald` only | Future cycle |
| M | Low (NEW) | Tracing has no exporter (cycle 1 = trace IDs in logs only) | Future cycle when downstream services exist |
## 9. Self-verification
- [x] Structured JSON logging format defined with `timestamp`, `level`, `service`, `correlation_id`, `message`, `context`.
- [x] Metrics endpoint specified (`/metrics`, internal-only) with full app/system/business metric inventory.
- [x] OpenTelemetry tracing configured at the SDK level (cycle 1) with future exporter wiring (Drift M).
- [x] Alert severities with response times and channels defined; baselines tied to perf report numbers.
- [x] Dashboards defined for operations and business metrics.
- [x] PII exclusion rules cover passwords, JWTs, and email masking; refers to specific DTO field names.