refactor: remove deploy.cmd and update Dockerfile for health checks

- Deleted the deploy.cmd script as it was no longer needed. - Updated Dockerfile to include curl for health checks and added a non-root user for improved security. - Modified health check command to use curl for better reliability. - Adjusted docker-compose.test.yml to reflect changes in health check configuration. - Cleaned up appsettings.json and removed unused configuration properties. - Removed Resource entity and related requests from the codebase as part of the architectural shift. - Updated documentation to reflect the removal of hardware binding and related endpoints. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-21 10:51:10 +00:00 · 2026-05-13 08:47:21 +03:00
parent 43fe38e67d
commit c7b297de83
76 changed files with 4034 additions and 832 deletions
@@ -0,0 +1,204 @@
+# Azaion Admin API — Observability
+
+**Date**: 2026-05-13 · **Cycle**: 1 · **Status**: planning artifact (no code changes; concrete wiring lands in Step 7).
+
+## 1. Current State (audit)
+
+| Pillar | Today | Gap |
+|--------|-------|-----|
+| Logging | Serilog 4.1.0 → Console + rolling file `logs/log.txt` (daily); MinimumLevel `Information`; FromLogContext enrichment | No structured fields beyond defaults; one unstructured `LogInformation($"…")` in `ResourcesService.SaveResource` (security audit F-12); SQL trace bypasses Serilog (`Console.WriteLine`); no correlation IDs |
+| Metrics | none | No `/metrics` endpoint; no system, app, or business metrics |
+| Tracing | none | No OpenTelemetry, no W3C trace context |
+| Health checks | none in code; `docker-compose.test.yml` uses raw TCP probe | No `/health` endpoint (Drift H from Step 2 + skill self-verification) |
+| Alerting | none | No alerts wired to any channel |
+
+This step closes the planning gap; implementation lands incrementally — `/health` and structured logging in cycle 1 (Step 7), metrics + tracing in a later cycle (Drift K).
+
+## 2. Logging
+
+### 2.1 Format
+
+Structured JSON to **stdout/stderr only** in containers. The current rolling-file sink is **dropped from the production runtime** (and the `/app/logs` bind mount becomes optional) because:
+
+- Container logs should be collected by the platform, not the app.
+- A bind-mounted file silently fills the host disk when log rotation lags.
+- We currently have no log shipper, so logs already live only in `docker logs` for ops triage.
+
+The existing console sink stays. The file sink is kept ONLY in `Development` (gated by `ASPNETCORE_ENVIRONMENT`).
+
+```json
+{
+  "timestamp": "2026-05-13T06:48:01.123Z",
+  "level": "Information",
+  "service": "azaion.admin-api",
+  "revision": "a1b2c3d4e5f6",
+  "correlation_id": "0HMU7…",
+  "user_id": null,
+  "message": "User registered",
+  "context": {
+    "endpoint": "POST /users",
+    "duration_ms": 47
+  }
+}
+```
+
+Achieved by adding `Serilog.Formatting.Compact.RenderedCompactJsonFormatter` to the console sink and three enrichers:
+
+| Enricher | Source | Purpose |
+|----------|--------|---------|
+| `FromLogContext` | already present | scoped properties |
+| `Serilog.Enrichers.Environment` (new) | `ENV` vars | `service`, `revision` (`AZAION_REVISION`) |
+| `Serilog.AspNetCore.RequestLoggingOptions` (new) | ASP.NET pipeline | request `correlation_id` from `Activity.Current.TraceId` (or generated UUID v7 if no Activity) |
+
+### 2.2 Log Levels
+
+| Level | Usage | Examples in this codebase |
+|-------|-------|---------------------------|
+| `Error` | Unhandled exceptions, infra failures | DB connection failure, sops decrypt failure on host |
+| `Warning` | Business exception caught | Existing `BusinessExceptionHandler` already does this — keep as-is |
+| `Information` | Significant business events | Login, RegisterUser, RegisterDevice, role change, resource upload, detection-class CRUD |
+| `Debug` | Diagnostic detail | Request/response payloads (dev only — never in production); query parameters |
+
+### 2.3 Retention
+
+| Environment | Destination | Retention |
+|-------------|-------------|-----------|
+| Development | console + `logs/log.txt` (rolling daily) | 7 daily files (Serilog default) |
+| Test (CI) | console (captured by Woodpecker UI) | 14 days (Woodpecker artifact retention) |
+| Staging | container stdout → `journald` on the host | 7 days; `journalctl --vacuum-time=7d` cron |
+| Production | container stdout → `journald` on the host | 30 days; `journalctl --vacuum-time=30d` cron |
+
+> A central log aggregator (Loki / OpenSearch) is **out of scope for cycle 1** — host `journald` is the entire pipeline. Recorded as **Drift L**.
+
+### 2.4 PII Rules
+
+| Rule | Implementation |
+|------|----------------|
+| Never log passwords | `LoginRequest.Password`, `RegisterUserRequest.Password`, `GetResourceRequest.Password`, the response body of `POST /devices` (plaintext one-shot password). Add a `[Serilog.Sensitive]`-style helper or a `Destructure.ByTransforming<T>(t => …)` per DTO. |
+| Never log JWT tokens | The `/login` response body is logged today only by `BusinessExceptionHandler` on failure, which doesn't include the body. Verify in Step 7 that no request-logger middleware logs response bodies. |
+| Mask emails | Use last-4 + `@domain` form for INFO-level logs (`***123@example.com`); full email allowed at DEBUG only. The `BusinessExceptionHandler` log line `"Caught BusinessException: {Message}"` may include emails embedded in messages — tightened in Step 7. |
+| User IDs | `User.Id` is an opaque GUID — safe to log; use it instead of email in correlation. |
+
+## 3. Metrics
+
+### 3.1 Endpoint
+
+`GET /metrics` exposing Prometheus exposition format. Add via `prometheus-net.AspNetCore` 8.x (latest stable for .NET 10 baseline; verify version against released wheel before wiring).
+
+> Exposure boundary: `/metrics` MUST NOT be reachable from the public CORS allow-list. The Nginx reverse proxy on `admin.azaion.com` will expose only `/login`, `/users*`, `/devices`, `/resources*`, `/classes*`, `/health`. `/metrics` and `/swagger` stay on the internal interface (separate Nginx server block bound to the management VLAN, OR `localhost`-only listener).
+
+### 3.2 Metrics
+
+| Metric | Type | Source | Labels |
+|--------|------|--------|--------|
+| `http_requests_total` | Counter | ASP.NET request pipeline | `method`, `endpoint`, `status_code` |
+| `http_request_duration_seconds` | Histogram | ASP.NET request pipeline | `method`, `endpoint` |
+| `http_requests_in_progress` | Gauge | ASP.NET request pipeline | `method` |
+| `db_command_duration_seconds` | Histogram | linq2db trace hook | `operation` (`select`/`insert`/`update`/`delete`) |
+| `db_command_failures_total` | Counter | linq2db trace hook | `operation`, `sqlstate` |
+| `auth_login_failures_total` | Counter | `AuthService.ValidateUser` exception path | `reason` (`unknown_user`, `bad_password`, `disabled`) |
+| `business_exceptions_total` | Counter | `BusinessExceptionHandler` | `error_code` (the existing `ExceptionEnum`) |
+| `resource_upload_bytes_total` | Counter | `ResourcesService.SaveResource` | `data_folder` |
+| `resource_upload_failures_total` | Counter | same | `reason` |
+| `resource_download_bytes_total` | Counter | `ResourcesService.GetEncryptedResource` | `data_folder` |
+| `detection_classes_total` | Gauge | refresh on CRUD | none |
+| `users_active_total` | Gauge | refresh on CRUD + on a 5-min timer | `role` |
+| Process / runtime | (auto) | `prometheus-net.DotNetRuntime` | gen0/1/2 GC, JIT, threadpool, etc. |
+
+### 3.3 System Metrics
+
+CPU, RSS, file descriptors, network I/O — collected by **node-exporter** running on the host as a sibling container. The Admin API itself does NOT export host-level metrics.
+
+### 3.4 Business Metrics
+
+Mapped to the verified ACs in `_docs/02_document/tests/blackbox-tests.md`. Cycle-1 cut: `users_active_total` (AC-01..AC-12 user lifecycle) and `detection_classes_total` (AZ-513). Resource-related business metrics deferred until the resource flow is exercised by real users post-AZ-197.
+
+### 3.5 Collection
+
+| Setting | Value |
+|---------|-------|
+| Scrape interval | 15s (Prometheus default) |
+| Scrape source | `node-exporter` for host; the Admin API container for app metrics |
+| Storage | local Prometheus on the host, retention 14 days (cycle 1 budget) |
+| Visualization | local Grafana, single dashboard (§6) |
+
+## 4. Distributed Tracing
+
+**Cycle 1**: scaffold only — produce a trace ID per request, propagate via `traceparent` (W3C), and emit it as the `correlation_id` field in JSON logs. **Do NOT yet** ship spans to a collector — there is no Jaeger / Tempo running, and the Admin API has no downstream services to trace into. Tracing pays back its cost when there's a chain to follow; cycle 1 has none.
+
+| Setting | Value |
+|---------|-------|
+| SDK | `OpenTelemetry.Extensions.Hosting` + `OpenTelemetry.Instrumentation.AspNetCore` |
+| Propagation | W3C Trace Context (`traceparent`) — auto when `OpenTelemetry.Instrumentation.AspNetCore` is registered |
+| Sampling | 100% in dev/staging, 10% in production (deferred — no exporter yet) |
+| Span naming | `<service>.<operation>` — service `azaion.admin-api`, operation `<HTTP method> <route template>` |
+| Exporter | none in cycle 1 (logs only) |
+
+> Recorded as **Drift M** — wire a Tempo / Jaeger exporter once a downstream service exists.
+
+## 5. Alerting
+
+| Severity | Response time | Conditions for this service | Channel |
+|----------|---------------|------------------------------|---------|
+| Critical | 5 min | `up{job="admin-api"} == 0` for 1 min · `/health` fails for 2 min · `business_exceptions_total{error_code="DbFailure"}` rate > 1/s for 1 min | Slack `#azaion-ops` + on-call email (cycle 1 — PagerDuty deferred until on-call rotation exists) |
+| High | 30 min | Error rate > 5% for 5 min (`http_requests_total{status_code=~"5.."}/total`) · P95 latency > 2× baseline for 10 min · `auth_login_failures_total` rate > 10/s for 1 min (possible brute force) | Slack `#azaion-ops` + email |
+| Medium | 4 h | Host disk > 80% · `db_command_failures_total` rate > 0.1/s for 10 min · process RSS > 80% of container limit | Slack `#azaion-ops` |
+| Low | Next business day | Deprecated package usage from `dotnet list package --deprecated` | Slack `#azaion-eng` |
+
+Baseline values (P95) come from the cycle-1 perf report:
+- `/login` p95 ≈ 33 ms → high-latency alert at p95 > 66 ms for 10 min
+- `/users` (500 users) p95 ≈ 152 ms → high-latency alert at p95 > 305 ms for 10 min
+
+Alert routing in cycle 1 is **inform-only** — no PagerDuty escalation, no auto-rollback. The deploy procedure (Step 6) documents the manual rollback path.
+
+## 6. Dashboards
+
+**Operations dashboard** (Grafana, single panel set; cycle 1):
+
+- Service `up` (admin-api, postgres, nginx) — stat panel
+- HTTP request rate (req/s) by endpoint — time series
+- HTTP error rate (% of 5xx) — time series with the High threshold band overlaid
+- Latency P50 / P95 / P99 by endpoint — time series, P95 baseline reference line
+- DB command rate + failure rate — time series
+- Container CPU / RSS / FDs — time series (from node-exporter)
+- Active alerts — table panel
+
+**Business dashboard** (cycle 1):
+
+- `users_active_total` by role — stat panel + sparkline
+- `detection_classes_total` — stat panel
+- `resource_upload_bytes_total` rate (1h window) — time series
+- Login success/failure ratio (24h) — donut
+
+Dashboards stored as code in `monitoring/grafana/admin-api.json` (introduced in Step 7).
+
+## 7. Health Checks
+
+Add a `/health` Minimal API endpoint:
+
+| Probe | Endpoint | What it checks | Surface |
+|-------|----------|----------------|---------|
+| Liveness | `GET /health/live` | Process is responsive (always 200 unless the process is wedged) | Used by Docker `HEALTHCHECK` |
+| Readiness | `GET /health/ready` | DB reader connection + DB admin connection (one-shot `SELECT 1` each, 2s timeout) | Used by Nginx upstream check + the deploy script (Step 6) post-deploy gate |
+
+Endpoints are anonymous (no JWT) but bound only to the management VLAN (or `localhost` listener) — same exposure rule as `/metrics`.
+
+> Failure mode: if the DB is unreachable for 30 s, `/health/ready` returns 503; Nginx pulls the upstream, returning 503 to clients (no silent traffic loss). The container itself stays running so a transient DB blip does not trigger Docker restart.
+
+## 8. Drifts Logged Here
+
+| ID | Severity | Description | Resolved In |
+|----|----------|-------------|-------------|
+| H  | Low | `docker-compose.test.yml` health check is TCP-only; upgrade to `/health/live` once available | Step 7 |
+| K  | Medium (NEW) | Metrics + tracing not implemented in cycle 1; only the plan + `/health` ship | Future cycle |
+| L  | Low (NEW) | No central log aggregator; `journald` only | Future cycle |
+| M  | Low (NEW) | Tracing has no exporter (cycle 1 = trace IDs in logs only) | Future cycle when downstream services exist |
+
+## 9. Self-verification
+
+- [x] Structured JSON logging format defined with `timestamp`, `level`, `service`, `correlation_id`, `message`, `context`.
+- [x] Metrics endpoint specified (`/metrics`, internal-only) with full app/system/business metric inventory.
+- [x] OpenTelemetry tracing configured at the SDK level (cycle 1) with future exporter wiring (Drift M).
+- [x] Alert severities with response times and channels defined; baselines tied to perf report numbers.
+- [x] Dashboards defined for operations and business metrics.
+- [x] PII exclusion rules cover passwords, JWTs, and email masking; refers to specific DTO field names.