- Deleted the deploy.cmd script as it was no longer needed. - Updated Dockerfile to include curl for health checks and added a non-root user for improved security. - Modified health check command to use curl for better reliability. - Adjusted docker-compose.test.yml to reflect changes in health check configuration. - Cleaned up appsettings.json and removed unused configuration properties. - Removed Resource entity and related requests from the codebase as part of the architectural shift. - Updated documentation to reflect the removal of hardware binding and related endpoints. Co-authored-by: Cursor <cursoragent@cursor.com>
12 KiB
Azaion Admin API — Observability
Date: 2026-05-13 · Cycle: 1 · Status: planning artifact (no code changes; concrete wiring lands in Step 7).
1. Current State (audit)
| Pillar | Today | Gap |
|---|---|---|
| Logging | Serilog 4.1.0 → Console + rolling file logs/log.txt (daily); MinimumLevel Information; FromLogContext enrichment |
No structured fields beyond defaults; one unstructured LogInformation($"…") in ResourcesService.SaveResource (security audit F-12); SQL trace bypasses Serilog (Console.WriteLine); no correlation IDs |
| Metrics | none | No /metrics endpoint; no system, app, or business metrics |
| Tracing | none | No OpenTelemetry, no W3C trace context |
| Health checks | none in code; docker-compose.test.yml uses raw TCP probe |
No /health endpoint (Drift H from Step 2 + skill self-verification) |
| Alerting | none | No alerts wired to any channel |
This step closes the planning gap; implementation lands incrementally — /health and structured logging in cycle 1 (Step 7), metrics + tracing in a later cycle (Drift K).
2. Logging
2.1 Format
Structured JSON to stdout/stderr only in containers. The current rolling-file sink is dropped from the production runtime (and the /app/logs bind mount becomes optional) because:
- Container logs should be collected by the platform, not the app.
- A bind-mounted file silently fills the host disk when log rotation lags.
- We currently have no log shipper, so logs already live only in
docker logsfor ops triage.
The existing console sink stays. The file sink is kept ONLY in Development (gated by ASPNETCORE_ENVIRONMENT).
{
"timestamp": "2026-05-13T06:48:01.123Z",
"level": "Information",
"service": "azaion.admin-api",
"revision": "a1b2c3d4e5f6",
"correlation_id": "0HMU7…",
"user_id": null,
"message": "User registered",
"context": {
"endpoint": "POST /users",
"duration_ms": 47
}
}
Achieved by adding Serilog.Formatting.Compact.RenderedCompactJsonFormatter to the console sink and three enrichers:
| Enricher | Source | Purpose |
|---|---|---|
FromLogContext |
already present | scoped properties |
Serilog.Enrichers.Environment (new) |
ENV vars |
service, revision (AZAION_REVISION) |
Serilog.AspNetCore.RequestLoggingOptions (new) |
ASP.NET pipeline | request correlation_id from Activity.Current.TraceId (or generated UUID v7 if no Activity) |
2.2 Log Levels
| Level | Usage | Examples in this codebase |
|---|---|---|
Error |
Unhandled exceptions, infra failures | DB connection failure, sops decrypt failure on host |
Warning |
Business exception caught | Existing BusinessExceptionHandler already does this — keep as-is |
Information |
Significant business events | Login, RegisterUser, RegisterDevice, role change, resource upload, detection-class CRUD |
Debug |
Diagnostic detail | Request/response payloads (dev only — never in production); query parameters |
2.3 Retention
| Environment | Destination | Retention |
|---|---|---|
| Development | console + logs/log.txt (rolling daily) |
7 daily files (Serilog default) |
| Test (CI) | console (captured by Woodpecker UI) | 14 days (Woodpecker artifact retention) |
| Staging | container stdout → journald on the host |
7 days; journalctl --vacuum-time=7d cron |
| Production | container stdout → journald on the host |
30 days; journalctl --vacuum-time=30d cron |
A central log aggregator (Loki / OpenSearch) is out of scope for cycle 1 — host
journaldis the entire pipeline. Recorded as Drift L.
2.4 PII Rules
| Rule | Implementation |
|---|---|
| Never log passwords | LoginRequest.Password, RegisterUserRequest.Password, GetResourceRequest.Password, the response body of POST /devices (plaintext one-shot password). Add a [Serilog.Sensitive]-style helper or a Destructure.ByTransforming<T>(t => …) per DTO. |
| Never log JWT tokens | The /login response body is logged today only by BusinessExceptionHandler on failure, which doesn't include the body. Verify in Step 7 that no request-logger middleware logs response bodies. |
| Mask emails | Use last-4 + @domain form for INFO-level logs (***123@example.com); full email allowed at DEBUG only. The BusinessExceptionHandler log line "Caught BusinessException: {Message}" may include emails embedded in messages — tightened in Step 7. |
| User IDs | User.Id is an opaque GUID — safe to log; use it instead of email in correlation. |
3. Metrics
3.1 Endpoint
GET /metrics exposing Prometheus exposition format. Add via prometheus-net.AspNetCore 8.x (latest stable for .NET 10 baseline; verify version against released wheel before wiring).
Exposure boundary:
/metricsMUST NOT be reachable from the public CORS allow-list. The Nginx reverse proxy onadmin.azaion.comwill expose only/login,/users*,/devices,/resources*,/classes*,/health./metricsand/swaggerstay on the internal interface (separate Nginx server block bound to the management VLAN, ORlocalhost-only listener).
3.2 Metrics
| Metric | Type | Source | Labels |
|---|---|---|---|
http_requests_total |
Counter | ASP.NET request pipeline | method, endpoint, status_code |
http_request_duration_seconds |
Histogram | ASP.NET request pipeline | method, endpoint |
http_requests_in_progress |
Gauge | ASP.NET request pipeline | method |
db_command_duration_seconds |
Histogram | linq2db trace hook | operation (select/insert/update/delete) |
db_command_failures_total |
Counter | linq2db trace hook | operation, sqlstate |
auth_login_failures_total |
Counter | AuthService.ValidateUser exception path |
reason (unknown_user, bad_password, disabled) |
business_exceptions_total |
Counter | BusinessExceptionHandler |
error_code (the existing ExceptionEnum) |
resource_upload_bytes_total |
Counter | ResourcesService.SaveResource |
data_folder |
resource_upload_failures_total |
Counter | same | reason |
resource_download_bytes_total |
Counter | ResourcesService.GetEncryptedResource |
data_folder |
detection_classes_total |
Gauge | refresh on CRUD | none |
users_active_total |
Gauge | refresh on CRUD + on a 5-min timer | role |
| Process / runtime | (auto) | prometheus-net.DotNetRuntime |
gen0/1/2 GC, JIT, threadpool, etc. |
3.3 System Metrics
CPU, RSS, file descriptors, network I/O — collected by node-exporter running on the host as a sibling container. The Admin API itself does NOT export host-level metrics.
3.4 Business Metrics
Mapped to the verified ACs in _docs/02_document/tests/blackbox-tests.md. Cycle-1 cut: users_active_total (AC-01..AC-12 user lifecycle) and detection_classes_total (AZ-513). Resource-related business metrics deferred until the resource flow is exercised by real users post-AZ-197.
3.5 Collection
| Setting | Value |
|---|---|
| Scrape interval | 15s (Prometheus default) |
| Scrape source | node-exporter for host; the Admin API container for app metrics |
| Storage | local Prometheus on the host, retention 14 days (cycle 1 budget) |
| Visualization | local Grafana, single dashboard (§6) |
4. Distributed Tracing
Cycle 1: scaffold only — produce a trace ID per request, propagate via traceparent (W3C), and emit it as the correlation_id field in JSON logs. Do NOT yet ship spans to a collector — there is no Jaeger / Tempo running, and the Admin API has no downstream services to trace into. Tracing pays back its cost when there's a chain to follow; cycle 1 has none.
| Setting | Value |
|---|---|
| SDK | OpenTelemetry.Extensions.Hosting + OpenTelemetry.Instrumentation.AspNetCore |
| Propagation | W3C Trace Context (traceparent) — auto when OpenTelemetry.Instrumentation.AspNetCore is registered |
| Sampling | 100% in dev/staging, 10% in production (deferred — no exporter yet) |
| Span naming | <service>.<operation> — service azaion.admin-api, operation <HTTP method> <route template> |
| Exporter | none in cycle 1 (logs only) |
Recorded as Drift M — wire a Tempo / Jaeger exporter once a downstream service exists.
5. Alerting
| Severity | Response time | Conditions for this service | Channel |
|---|---|---|---|
| Critical | 5 min | up{job="admin-api"} == 0 for 1 min · /health fails for 2 min · business_exceptions_total{error_code="DbFailure"} rate > 1/s for 1 min |
Slack #azaion-ops + on-call email (cycle 1 — PagerDuty deferred until on-call rotation exists) |
| High | 30 min | Error rate > 5% for 5 min (http_requests_total{status_code=~"5.."}/total) · P95 latency > 2× baseline for 10 min · auth_login_failures_total rate > 10/s for 1 min (possible brute force) |
Slack #azaion-ops + email |
| Medium | 4 h | Host disk > 80% · db_command_failures_total rate > 0.1/s for 10 min · process RSS > 80% of container limit |
Slack #azaion-ops |
| Low | Next business day | Deprecated package usage from dotnet list package --deprecated |
Slack #azaion-eng |
Baseline values (P95) come from the cycle-1 perf report:
/loginp95 ≈ 33 ms → high-latency alert at p95 > 66 ms for 10 min/users(500 users) p95 ≈ 152 ms → high-latency alert at p95 > 305 ms for 10 min
Alert routing in cycle 1 is inform-only — no PagerDuty escalation, no auto-rollback. The deploy procedure (Step 6) documents the manual rollback path.
6. Dashboards
Operations dashboard (Grafana, single panel set; cycle 1):
- Service
up(admin-api, postgres, nginx) — stat panel - HTTP request rate (req/s) by endpoint — time series
- HTTP error rate (% of 5xx) — time series with the High threshold band overlaid
- Latency P50 / P95 / P99 by endpoint — time series, P95 baseline reference line
- DB command rate + failure rate — time series
- Container CPU / RSS / FDs — time series (from node-exporter)
- Active alerts — table panel
Business dashboard (cycle 1):
users_active_totalby role — stat panel + sparklinedetection_classes_total— stat panelresource_upload_bytes_totalrate (1h window) — time series- Login success/failure ratio (24h) — donut
Dashboards stored as code in monitoring/grafana/admin-api.json (introduced in Step 7).
7. Health Checks
Add a /health Minimal API endpoint:
| Probe | Endpoint | What it checks | Surface |
|---|---|---|---|
| Liveness | GET /health/live |
Process is responsive (always 200 unless the process is wedged) | Used by Docker HEALTHCHECK |
| Readiness | GET /health/ready |
DB reader connection + DB admin connection (one-shot SELECT 1 each, 2s timeout) |
Used by Nginx upstream check + the deploy script (Step 6) post-deploy gate |
Endpoints are anonymous (no JWT) but bound only to the management VLAN (or localhost listener) — same exposure rule as /metrics.
Failure mode: if the DB is unreachable for 30 s,
/health/readyreturns 503; Nginx pulls the upstream, returning 503 to clients (no silent traffic loss). The container itself stays running so a transient DB blip does not trigger Docker restart.
8. Drifts Logged Here
| ID | Severity | Description | Resolved In |
|---|---|---|---|
| H | Low | docker-compose.test.yml health check is TCP-only; upgrade to /health/live once available |
Step 7 |
| K | Medium (NEW) | Metrics + tracing not implemented in cycle 1; only the plan + /health ship |
Future cycle |
| L | Low (NEW) | No central log aggregator; journald only |
Future cycle |
| M | Low (NEW) | Tracing has no exporter (cycle 1 = trace IDs in logs only) | Future cycle when downstream services exist |
9. Self-verification
- Structured JSON logging format defined with
timestamp,level,service,correlation_id,message,context. - Metrics endpoint specified (
/metrics, internal-only) with full app/system/business metric inventory. - OpenTelemetry tracing configured at the SDK level (cycle 1) with future exporter wiring (Drift M).
- Alert severities with response times and channels defined; baselines tied to perf report numbers.
- Dashboards defined for operations and business metrics.
- PII exclusion rules cover passwords, JWTs, and email masking; refers to specific DTO field names.