azaion/admin

Fork 0

mirror of https://github.com/azaion/admin.git synced 2026-06-21 16:31:10 +00:00

Files

T

Oleksandr Bezdieniezhnykh 3a925b9b0f

ci/woodpecker/push/01-test Pipeline failed

Details

ci/woodpecker/push/02-build-push unknown status

Details

refactor: remove obsolete resource download and installer endpoints

- Deleted the `POST /resources/get/{dataFolder?}` and `GET /resources/get-installer` endpoints as part of the architectural shift towards simplified resource management.
- Removed associated methods and configurations, including `ResourcesService.GetEncryptedResource`, `ResourcesService.GetInstaller`, and related properties in `ResourcesConfig`.
- Cleaned up environment variables and configuration files to reflect the removal of installer-related settings.
- Eliminated the `GetResourceRequest` DTO and its validator, along with the `WrongResourceName` error code.
- Updated documentation to clarify the changes in resource handling and the retirement of per-user file encryption.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-14 04:17:55 +03:00

12 KiB

Raw Blame History

Azaion Admin API — Observability

Date: 2026-05-13 · Cycle: 1 · Status: planning artifact (no code changes; concrete wiring lands in Step 7).

1. Current State (audit)

Pillar	Today	Gap
Logging	Serilog 4.1.0 → Console + rolling file `logs/log.txt` (daily); MinimumLevel `Information`; FromLogContext enrichment	No structured fields beyond defaults; one unstructured `LogInformation($"…")` in `ResourcesService.SaveResource` (security audit F-12); SQL trace bypasses Serilog (`Console.WriteLine`); no correlation IDs
Metrics	none	No `/metrics` endpoint; no system, app, or business metrics
Tracing	none	No OpenTelemetry, no W3C trace context
Health checks	none in code; `docker-compose.test.yml` uses raw TCP probe	No `/health` endpoint (Drift H from Step 2 + skill self-verification)
Alerting	none	No alerts wired to any channel

This step closes the planning gap; implementation lands incrementally — /health and structured logging in cycle 1 (Step 7), metrics + tracing in a later cycle (Drift K).

2. Logging

2.1 Format

Structured JSON to stdout/stderr only in containers. The current rolling-file sink is dropped from the production runtime (and the /app/logs bind mount becomes optional) because:

Container logs should be collected by the platform, not the app.
A bind-mounted file silently fills the host disk when log rotation lags.
We currently have no log shipper, so logs already live only in docker logs for ops triage.

The existing console sink stays. The file sink is kept ONLY in Development (gated by ASPNETCORE_ENVIRONMENT).

{
  "timestamp": "2026-05-13T06:48:01.123Z",
  "level": "Information",
  "service": "azaion.admin-api",
  "revision": "a1b2c3d4e5f6",
  "correlation_id": "0HMU7…",
  "user_id": null,
  "message": "User registered",
  "context": {
    "endpoint": "POST /users",
    "duration_ms": 47
  }
}

Achieved by adding Serilog.Formatting.Compact.RenderedCompactJsonFormatter to the console sink and three enrichers:

Enricher	Source	Purpose
`FromLogContext`	already present	scoped properties
`Serilog.Enrichers.Environment` (new)	`ENV` vars	`service`, `revision` (`AZAION_REVISION`)
`Serilog.AspNetCore.RequestLoggingOptions` (new)	ASP.NET pipeline	request `correlation_id` from `Activity.Current.TraceId` (or generated UUID v7 if no Activity)

2.2 Log Levels

Level	Usage	Examples in this codebase
`Error`	Unhandled exceptions, infra failures	DB connection failure, sops decrypt failure on host
`Warning`	Business exception caught	Existing `BusinessExceptionHandler` already does this — keep as-is
`Information`	Significant business events	Login, RegisterUser, RegisterDevice, role change, resource upload, detection-class CRUD
`Debug`	Diagnostic detail	Request/response payloads (dev only — never in production); query parameters

2.3 Retention

Environment	Destination	Retention
Development	console + `logs/log.txt` (rolling daily)	7 daily files (Serilog default)
Test (CI)	console (captured by Woodpecker UI)	14 days (Woodpecker artifact retention)
Staging	container stdout → `journald` on the host	7 days; `journalctl --vacuum-time=7d` cron
Production	container stdout → `journald` on the host	30 days; `journalctl --vacuum-time=30d` cron

A central log aggregator (Loki / OpenSearch) is out of scope for cycle 1 — host journald is the entire pipeline. Recorded as Drift L.

2.4 PII Rules

Rule	Implementation
Never log passwords	`LoginRequest.Password`, `RegisterUserRequest.Password`, the response body of `POST /devices` (plaintext one-shot password). Add a `[Serilog.Sensitive]`-style helper or a `Destructure.ByTransforming<T>(t => …)` per DTO. (`GetResourceRequest.Password` was previously listed; the DTO was deleted in cycle 2 with the encrypted-download endpoint.)
Never log JWT tokens	The `/login` response body is logged today only by `BusinessExceptionHandler` on failure, which doesn't include the body. Verify in Step 7 that no request-logger middleware logs response bodies.
Mask emails	Use last-4 + `@domain` form for INFO-level logs (`***123@example.com`); full email allowed at DEBUG only. The `BusinessExceptionHandler` log line `"Caught BusinessException: {Message}"` may include emails embedded in messages — tightened in Step 7.
User IDs	`User.Id` is an opaque GUID — safe to log; use it instead of email in correlation.

3. Metrics

3.1 Endpoint

GET /metrics exposing Prometheus exposition format. Add via prometheus-net.AspNetCore 8.x (latest stable for .NET 10 baseline; verify version against released wheel before wiring).

Exposure boundary: /metrics MUST NOT be reachable from the public CORS allow-list. The Nginx reverse proxy on admin.azaion.com will expose only /login, /users*, /devices, /resources*, /classes*, /health. /metrics and /swagger stay on the internal interface (separate Nginx server block bound to the management VLAN, OR localhost-only listener).

3.2 Metrics

Metric	Type	Source	Labels
`http_requests_total`	Counter	ASP.NET request pipeline	`method`, `endpoint`, `status_code`
`http_request_duration_seconds`	Histogram	ASP.NET request pipeline	`method`, `endpoint`
`http_requests_in_progress`	Gauge	ASP.NET request pipeline	`method`
`db_command_duration_seconds`	Histogram	linq2db trace hook	`operation` (`select`/`insert`/`update`/`delete`)
`db_command_failures_total`	Counter	linq2db trace hook	`operation`, `sqlstate`
`auth_login_failures_total`	Counter	`AuthService.ValidateUser` exception path	`reason` (`unknown_user`, `bad_password`, `disabled`)
`business_exceptions_total`	Counter	`BusinessExceptionHandler`	`error_code` (the existing `ExceptionEnum`)
`resource_upload_bytes_total`	Counter	`ResourcesService.SaveResource`	`data_folder`
`resource_upload_failures_total`	Counter	same	`reason`
`detection_classes_total`	Gauge	refresh on CRUD	none
`users_active_total`	Gauge	refresh on CRUD + on a 5-min timer	`role`
Process / runtime	(auto)	`prometheus-net.DotNetRuntime`	gen0/1/2 GC, JIT, threadpool, etc.

3.3 System Metrics

CPU, RSS, file descriptors, network I/O — collected by node-exporter running on the host as a sibling container. The Admin API itself does NOT export host-level metrics.

3.4 Business Metrics

Mapped to the verified ACs in _docs/02_document/tests/blackbox-tests.md. Cycle-1 cut: users_active_total (AC-01..AC-12 user lifecycle) and detection_classes_total (AZ-513). The previously planned resource_download_bytes_total was dropped in cycle 2 along with ResourcesService.GetEncryptedResource itself; only the upload-side counters remain.

3.5 Collection

Setting	Value
Scrape interval	15s (Prometheus default)
Scrape source	`node-exporter` for host; the Admin API container for app metrics
Storage	local Prometheus on the host, retention 14 days (cycle 1 budget)
Visualization	local Grafana, single dashboard (§6)

4. Distributed Tracing

Cycle 1: scaffold only — produce a trace ID per request, propagate via traceparent (W3C), and emit it as the correlation_id field in JSON logs. Do NOT yet ship spans to a collector — there is no Jaeger / Tempo running, and the Admin API has no downstream services to trace into. Tracing pays back its cost when there's a chain to follow; cycle 1 has none.

Setting	Value
SDK	`OpenTelemetry.Extensions.Hosting` + `OpenTelemetry.Instrumentation.AspNetCore`
Propagation	W3C Trace Context (`traceparent`) — auto when `OpenTelemetry.Instrumentation.AspNetCore` is registered
Sampling	100% in dev/staging, 10% in production (deferred — no exporter yet)
Span naming	`<service>.<operation>` — service `azaion.admin-api`, operation `<HTTP method> <route template>`
Exporter	none in cycle 1 (logs only)

Recorded as Drift M — wire a Tempo / Jaeger exporter once a downstream service exists.

5. Alerting

Severity	Response time	Conditions for this service	Channel
Critical	5 min	`up{job="admin-api"} == 0` for 1 min · `/health` fails for 2 min · `business_exceptions_total{error_code="DbFailure"}` rate > 1/s for 1 min	Slack `#azaion-ops` + on-call email (cycle 1 — PagerDuty deferred until on-call rotation exists)
High	30 min	Error rate > 5% for 5 min (`http_requests_total{status_code=~"5.."}/total`) · P95 latency > 2× baseline for 10 min · `auth_login_failures_total` rate > 10/s for 1 min (possible brute force)	Slack `#azaion-ops` + email
Medium	4 h	Host disk > 80% · `db_command_failures_total` rate > 0.1/s for 10 min · process RSS > 80% of container limit	Slack `#azaion-ops`
Low	Next business day	Deprecated package usage from `dotnet list package --deprecated`	Slack `#azaion-eng`

Baseline values (P95) come from the cycle-1 perf report:

/login p95 ≈ 33 ms → high-latency alert at p95 > 66 ms for 10 min
/users (500 users) p95 ≈ 152 ms → high-latency alert at p95 > 305 ms for 10 min

Alert routing in cycle 1 is inform-only — no PagerDuty escalation, no auto-rollback. The deploy procedure (Step 6) documents the manual rollback path.

6. Dashboards

Operations dashboard (Grafana, single panel set; cycle 1):

Service up (admin-api, postgres, nginx) — stat panel
HTTP request rate (req/s) by endpoint — time series
HTTP error rate (% of 5xx) — time series with the High threshold band overlaid
Latency P50 / P95 / P99 by endpoint — time series, P95 baseline reference line
DB command rate + failure rate — time series
Container CPU / RSS / FDs — time series (from node-exporter)
Active alerts — table panel

Business dashboard (cycle 1):

users_active_total by role — stat panel + sparkline
detection_classes_total — stat panel
resource_upload_bytes_total rate (1h window) — time series
Login success/failure ratio (24h) — donut

Dashboards stored as code in monitoring/grafana/admin-api.json (introduced in Step 7).

7. Health Checks

Add a /health Minimal API endpoint:

Probe	Endpoint	What it checks	Surface
Liveness	`GET /health/live`	Process is responsive (always 200 unless the process is wedged)	Used by Docker `HEALTHCHECK`
Readiness	`GET /health/ready`	DB reader connection + DB admin connection (one-shot `SELECT 1` each, 2s timeout)	Used by Nginx upstream check + the deploy script (Step 6) post-deploy gate

Endpoints are anonymous (no JWT) but bound only to the management VLAN (or localhost listener) — same exposure rule as /metrics.

Failure mode: if the DB is unreachable for 30 s, /health/ready returns 503; Nginx pulls the upstream, returning 503 to clients (no silent traffic loss). The container itself stays running so a transient DB blip does not trigger Docker restart.

8. Drifts Logged Here

ID	Severity	Description	Resolved In
H	Low	`docker-compose.test.yml` health check is TCP-only; upgrade to `/health/live` once available	Step 7
K	Medium (NEW)	Metrics + tracing not implemented in cycle 1; only the plan + `/health` ship	Future cycle
L	Low (NEW)	No central log aggregator; `journald` only	Future cycle
M	Low (NEW)	Tracing has no exporter (cycle 1 = trace IDs in logs only)	Future cycle when downstream services exist

9. Self-verification

Structured JSON logging format defined with timestamp, level, service, correlation_id, message, context.
Metrics endpoint specified (/metrics, internal-only) with full app/system/business metric inventory.
OpenTelemetry tracing configured at the SDK level (cycle 1) with future exporter wiring (Drift M).
Alert severities with response times and channels defined; baselines tied to perf report numbers.
Dashboards defined for operations and business metrics.
PII exclusion rules cover passwords, JWTs, and email masking; refers to specific DTO field names.

12 KiB Raw Blame History Unescape Escape