mirror of
https://github.com/azaion/admin.git
synced 2026-06-21 21:11:08 +00:00
3a925b9b0f
- Deleted the `POST /resources/get/{dataFolder?}` and `GET /resources/get-installer` endpoints as part of the architectural shift towards simplified resource management.
- Removed associated methods and configurations, including `ResourcesService.GetEncryptedResource`, `ResourcesService.GetInstaller`, and related properties in `ResourcesConfig`.
- Cleaned up environment variables and configuration files to reflect the removal of installer-related settings.
- Eliminated the `GetResourceRequest` DTO and its validator, along with the `WrongResourceName` error code.
- Updated documentation to clarify the changes in resource handling and the retirement of per-user file encryption.
Co-authored-by: Cursor <cursoragent@cursor.com>
204 lines
12 KiB
Markdown
204 lines
12 KiB
Markdown
# Azaion Admin API — Observability
|
||
|
||
**Date**: 2026-05-13 · **Cycle**: 1 · **Status**: planning artifact (no code changes; concrete wiring lands in Step 7).
|
||
|
||
## 1. Current State (audit)
|
||
|
||
| Pillar | Today | Gap |
|
||
|--------|-------|-----|
|
||
| Logging | Serilog 4.1.0 → Console + rolling file `logs/log.txt` (daily); MinimumLevel `Information`; FromLogContext enrichment | No structured fields beyond defaults; one unstructured `LogInformation($"…")` in `ResourcesService.SaveResource` (security audit F-12); SQL trace bypasses Serilog (`Console.WriteLine`); no correlation IDs |
|
||
| Metrics | none | No `/metrics` endpoint; no system, app, or business metrics |
|
||
| Tracing | none | No OpenTelemetry, no W3C trace context |
|
||
| Health checks | none in code; `docker-compose.test.yml` uses raw TCP probe | No `/health` endpoint (Drift H from Step 2 + skill self-verification) |
|
||
| Alerting | none | No alerts wired to any channel |
|
||
|
||
This step closes the planning gap; implementation lands incrementally — `/health` and structured logging in cycle 1 (Step 7), metrics + tracing in a later cycle (Drift K).
|
||
|
||
## 2. Logging
|
||
|
||
### 2.1 Format
|
||
|
||
Structured JSON to **stdout/stderr only** in containers. The current rolling-file sink is **dropped from the production runtime** (and the `/app/logs` bind mount becomes optional) because:
|
||
|
||
- Container logs should be collected by the platform, not the app.
|
||
- A bind-mounted file silently fills the host disk when log rotation lags.
|
||
- We currently have no log shipper, so logs already live only in `docker logs` for ops triage.
|
||
|
||
The existing console sink stays. The file sink is kept ONLY in `Development` (gated by `ASPNETCORE_ENVIRONMENT`).
|
||
|
||
```json
|
||
{
|
||
"timestamp": "2026-05-13T06:48:01.123Z",
|
||
"level": "Information",
|
||
"service": "azaion.admin-api",
|
||
"revision": "a1b2c3d4e5f6",
|
||
"correlation_id": "0HMU7…",
|
||
"user_id": null,
|
||
"message": "User registered",
|
||
"context": {
|
||
"endpoint": "POST /users",
|
||
"duration_ms": 47
|
||
}
|
||
}
|
||
```
|
||
|
||
Achieved by adding `Serilog.Formatting.Compact.RenderedCompactJsonFormatter` to the console sink and three enrichers:
|
||
|
||
| Enricher | Source | Purpose |
|
||
|----------|--------|---------|
|
||
| `FromLogContext` | already present | scoped properties |
|
||
| `Serilog.Enrichers.Environment` (new) | `ENV` vars | `service`, `revision` (`AZAION_REVISION`) |
|
||
| `Serilog.AspNetCore.RequestLoggingOptions` (new) | ASP.NET pipeline | request `correlation_id` from `Activity.Current.TraceId` (or generated UUID v7 if no Activity) |
|
||
|
||
### 2.2 Log Levels
|
||
|
||
| Level | Usage | Examples in this codebase |
|
||
|-------|-------|---------------------------|
|
||
| `Error` | Unhandled exceptions, infra failures | DB connection failure, sops decrypt failure on host |
|
||
| `Warning` | Business exception caught | Existing `BusinessExceptionHandler` already does this — keep as-is |
|
||
| `Information` | Significant business events | Login, RegisterUser, RegisterDevice, role change, resource upload, detection-class CRUD |
|
||
| `Debug` | Diagnostic detail | Request/response payloads (dev only — never in production); query parameters |
|
||
|
||
### 2.3 Retention
|
||
|
||
| Environment | Destination | Retention |
|
||
|-------------|-------------|-----------|
|
||
| Development | console + `logs/log.txt` (rolling daily) | 7 daily files (Serilog default) |
|
||
| Test (CI) | console (captured by Woodpecker UI) | 14 days (Woodpecker artifact retention) |
|
||
| Staging | container stdout → `journald` on the host | 7 days; `journalctl --vacuum-time=7d` cron |
|
||
| Production | container stdout → `journald` on the host | 30 days; `journalctl --vacuum-time=30d` cron |
|
||
|
||
> A central log aggregator (Loki / OpenSearch) is **out of scope for cycle 1** — host `journald` is the entire pipeline. Recorded as **Drift L**.
|
||
|
||
### 2.4 PII Rules
|
||
|
||
| Rule | Implementation |
|
||
|------|----------------|
|
||
| Never log passwords | `LoginRequest.Password`, `RegisterUserRequest.Password`, the response body of `POST /devices` (plaintext one-shot password). Add a `[Serilog.Sensitive]`-style helper or a `Destructure.ByTransforming<T>(t => …)` per DTO. (`GetResourceRequest.Password` was previously listed; the DTO was deleted in cycle 2 with the encrypted-download endpoint.) |
|
||
| Never log JWT tokens | The `/login` response body is logged today only by `BusinessExceptionHandler` on failure, which doesn't include the body. Verify in Step 7 that no request-logger middleware logs response bodies. |
|
||
| Mask emails | Use last-4 + `@domain` form for INFO-level logs (`***123@example.com`); full email allowed at DEBUG only. The `BusinessExceptionHandler` log line `"Caught BusinessException: {Message}"` may include emails embedded in messages — tightened in Step 7. |
|
||
| User IDs | `User.Id` is an opaque GUID — safe to log; use it instead of email in correlation. |
|
||
|
||
## 3. Metrics
|
||
|
||
### 3.1 Endpoint
|
||
|
||
`GET /metrics` exposing Prometheus exposition format. Add via `prometheus-net.AspNetCore` 8.x (latest stable for .NET 10 baseline; verify version against released wheel before wiring).
|
||
|
||
> Exposure boundary: `/metrics` MUST NOT be reachable from the public CORS allow-list. The Nginx reverse proxy on `admin.azaion.com` will expose only `/login`, `/users*`, `/devices`, `/resources*`, `/classes*`, `/health`. `/metrics` and `/swagger` stay on the internal interface (separate Nginx server block bound to the management VLAN, OR `localhost`-only listener).
|
||
|
||
### 3.2 Metrics
|
||
|
||
| Metric | Type | Source | Labels |
|
||
|--------|------|--------|--------|
|
||
| `http_requests_total` | Counter | ASP.NET request pipeline | `method`, `endpoint`, `status_code` |
|
||
| `http_request_duration_seconds` | Histogram | ASP.NET request pipeline | `method`, `endpoint` |
|
||
| `http_requests_in_progress` | Gauge | ASP.NET request pipeline | `method` |
|
||
| `db_command_duration_seconds` | Histogram | linq2db trace hook | `operation` (`select`/`insert`/`update`/`delete`) |
|
||
| `db_command_failures_total` | Counter | linq2db trace hook | `operation`, `sqlstate` |
|
||
| `auth_login_failures_total` | Counter | `AuthService.ValidateUser` exception path | `reason` (`unknown_user`, `bad_password`, `disabled`) |
|
||
| `business_exceptions_total` | Counter | `BusinessExceptionHandler` | `error_code` (the existing `ExceptionEnum`) |
|
||
| `resource_upload_bytes_total` | Counter | `ResourcesService.SaveResource` | `data_folder` |
|
||
| `resource_upload_failures_total` | Counter | same | `reason` |
|
||
| `detection_classes_total` | Gauge | refresh on CRUD | none |
|
||
| `users_active_total` | Gauge | refresh on CRUD + on a 5-min timer | `role` |
|
||
| Process / runtime | (auto) | `prometheus-net.DotNetRuntime` | gen0/1/2 GC, JIT, threadpool, etc. |
|
||
|
||
### 3.3 System Metrics
|
||
|
||
CPU, RSS, file descriptors, network I/O — collected by **node-exporter** running on the host as a sibling container. The Admin API itself does NOT export host-level metrics.
|
||
|
||
### 3.4 Business Metrics
|
||
|
||
Mapped to the verified ACs in `_docs/02_document/tests/blackbox-tests.md`. Cycle-1 cut: `users_active_total` (AC-01..AC-12 user lifecycle) and `detection_classes_total` (AZ-513). The previously planned `resource_download_bytes_total` was dropped in cycle 2 along with `ResourcesService.GetEncryptedResource` itself; only the upload-side counters remain.
|
||
|
||
### 3.5 Collection
|
||
|
||
| Setting | Value |
|
||
|---------|-------|
|
||
| Scrape interval | 15s (Prometheus default) |
|
||
| Scrape source | `node-exporter` for host; the Admin API container for app metrics |
|
||
| Storage | local Prometheus on the host, retention 14 days (cycle 1 budget) |
|
||
| Visualization | local Grafana, single dashboard (§6) |
|
||
|
||
## 4. Distributed Tracing
|
||
|
||
**Cycle 1**: scaffold only — produce a trace ID per request, propagate via `traceparent` (W3C), and emit it as the `correlation_id` field in JSON logs. **Do NOT yet** ship spans to a collector — there is no Jaeger / Tempo running, and the Admin API has no downstream services to trace into. Tracing pays back its cost when there's a chain to follow; cycle 1 has none.
|
||
|
||
| Setting | Value |
|
||
|---------|-------|
|
||
| SDK | `OpenTelemetry.Extensions.Hosting` + `OpenTelemetry.Instrumentation.AspNetCore` |
|
||
| Propagation | W3C Trace Context (`traceparent`) — auto when `OpenTelemetry.Instrumentation.AspNetCore` is registered |
|
||
| Sampling | 100% in dev/staging, 10% in production (deferred — no exporter yet) |
|
||
| Span naming | `<service>.<operation>` — service `azaion.admin-api`, operation `<HTTP method> <route template>` |
|
||
| Exporter | none in cycle 1 (logs only) |
|
||
|
||
> Recorded as **Drift M** — wire a Tempo / Jaeger exporter once a downstream service exists.
|
||
|
||
## 5. Alerting
|
||
|
||
| Severity | Response time | Conditions for this service | Channel |
|
||
|----------|---------------|------------------------------|---------|
|
||
| Critical | 5 min | `up{job="admin-api"} == 0` for 1 min · `/health` fails for 2 min · `business_exceptions_total{error_code="DbFailure"}` rate > 1/s for 1 min | Slack `#azaion-ops` + on-call email (cycle 1 — PagerDuty deferred until on-call rotation exists) |
|
||
| High | 30 min | Error rate > 5% for 5 min (`http_requests_total{status_code=~"5.."}/total`) · P95 latency > 2× baseline for 10 min · `auth_login_failures_total` rate > 10/s for 1 min (possible brute force) | Slack `#azaion-ops` + email |
|
||
| Medium | 4 h | Host disk > 80% · `db_command_failures_total` rate > 0.1/s for 10 min · process RSS > 80% of container limit | Slack `#azaion-ops` |
|
||
| Low | Next business day | Deprecated package usage from `dotnet list package --deprecated` | Slack `#azaion-eng` |
|
||
|
||
Baseline values (P95) come from the cycle-1 perf report:
|
||
- `/login` p95 ≈ 33 ms → high-latency alert at p95 > 66 ms for 10 min
|
||
- `/users` (500 users) p95 ≈ 152 ms → high-latency alert at p95 > 305 ms for 10 min
|
||
|
||
Alert routing in cycle 1 is **inform-only** — no PagerDuty escalation, no auto-rollback. The deploy procedure (Step 6) documents the manual rollback path.
|
||
|
||
## 6. Dashboards
|
||
|
||
**Operations dashboard** (Grafana, single panel set; cycle 1):
|
||
|
||
- Service `up` (admin-api, postgres, nginx) — stat panel
|
||
- HTTP request rate (req/s) by endpoint — time series
|
||
- HTTP error rate (% of 5xx) — time series with the High threshold band overlaid
|
||
- Latency P50 / P95 / P99 by endpoint — time series, P95 baseline reference line
|
||
- DB command rate + failure rate — time series
|
||
- Container CPU / RSS / FDs — time series (from node-exporter)
|
||
- Active alerts — table panel
|
||
|
||
**Business dashboard** (cycle 1):
|
||
|
||
- `users_active_total` by role — stat panel + sparkline
|
||
- `detection_classes_total` — stat panel
|
||
- `resource_upload_bytes_total` rate (1h window) — time series
|
||
- Login success/failure ratio (24h) — donut
|
||
|
||
Dashboards stored as code in `monitoring/grafana/admin-api.json` (introduced in Step 7).
|
||
|
||
## 7. Health Checks
|
||
|
||
Add a `/health` Minimal API endpoint:
|
||
|
||
| Probe | Endpoint | What it checks | Surface |
|
||
|-------|----------|----------------|---------|
|
||
| Liveness | `GET /health/live` | Process is responsive (always 200 unless the process is wedged) | Used by Docker `HEALTHCHECK` |
|
||
| Readiness | `GET /health/ready` | DB reader connection + DB admin connection (one-shot `SELECT 1` each, 2s timeout) | Used by Nginx upstream check + the deploy script (Step 6) post-deploy gate |
|
||
|
||
Endpoints are anonymous (no JWT) but bound only to the management VLAN (or `localhost` listener) — same exposure rule as `/metrics`.
|
||
|
||
> Failure mode: if the DB is unreachable for 30 s, `/health/ready` returns 503; Nginx pulls the upstream, returning 503 to clients (no silent traffic loss). The container itself stays running so a transient DB blip does not trigger Docker restart.
|
||
|
||
## 8. Drifts Logged Here
|
||
|
||
| ID | Severity | Description | Resolved In |
|
||
|----|----------|-------------|-------------|
|
||
| H | Low | `docker-compose.test.yml` health check is TCP-only; upgrade to `/health/live` once available | Step 7 |
|
||
| K | Medium (NEW) | Metrics + tracing not implemented in cycle 1; only the plan + `/health` ship | Future cycle |
|
||
| L | Low (NEW) | No central log aggregator; `journald` only | Future cycle |
|
||
| M | Low (NEW) | Tracing has no exporter (cycle 1 = trace IDs in logs only) | Future cycle when downstream services exist |
|
||
|
||
## 9. Self-verification
|
||
|
||
- [x] Structured JSON logging format defined with `timestamp`, `level`, `service`, `correlation_id`, `message`, `context`.
|
||
- [x] Metrics endpoint specified (`/metrics`, internal-only) with full app/system/business metric inventory.
|
||
- [x] OpenTelemetry tracing configured at the SDK level (cycle 1) with future exporter wiring (Drift M).
|
||
- [x] Alert severities with response times and channels defined; baselines tied to perf report numbers.
|
||
- [x] Dashboards defined for operations and business metrics.
|
||
- [x] PII exclusion rules cover passwords, JWTs, and email masking; refers to specific DTO field names.
|