Files
admin/_docs/02_document/architecture.md
T
Oleksandr Bezdieniezhnykh a77b3f8a59 [AZ-529] [AZ-530] Cycle-2 documentation refresh
Refreshes _docs/02_document/ to reflect the cycle-2 auth-modernization
+ CMMC hardening landings (AZ-531..AZ-538). Authoritative source for
the ripple set is ripple_log_cycle2.md.

Covered:
- architecture.md (section 1 rewritten, ADRs 6-9 added)
- data_model.md (sessions, audit_events, user columns, migrations)
- system-flows.md (F1 rewritten; F11-F17 added; F2/F7/F9 minor)
- module-layout.md (cycle-2 sub-component table)
- diagrams/flows/flow_login.md (dual-token + MFA)
- components/{01_data_layer,03_auth_and_security,05_admin_api}
- modules/ (12 new, 8 modified — full Argon2id/ES256/MFA/refresh
  /mission/session/audit/jwks rollup)
- tests/{blackbox,security,traceability-matrix}

Step 13 (Update Docs) output for cycle 2.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-14 09:22:53 +03:00

22 KiB
Raw Blame History

Azaion Admin API — Architecture

1. System Context

Problem being solved: Azaion Suite requires a centralized admin API to manage users + roles, authenticate humans (with optional second factor), authenticate UAVs for offline missions, and broker token revocation across a fleet of verifier services.

System boundaries:

  • Inside: user management, password hashing (Argon2id), authentication (ES256 JWT + opaque refresh tokens with rotation + reuse detection), TOTP MFA, mission-token issuance, session revocation + verifier-poll snapshot, account lockout + per-IP and per-account rate limiting, JWKS publication, role-based authorization, file-based resource storage (upload / list / clear), HSTS + HTTPS redirect.
  • Outside: admin web panel (admin.azaion.com), fTPM-secured Jetson edge devices (CompanionPC), verifier fleet (satellite-provider, gps-denied, ui — service-role identities), PostgreSQL, server filesystem.

Note (AZ-197, cycle 1): hardware-fingerprint binding removed.

Note (cycle 2 early): encrypted resource download + installer endpoints removed; ADR-003 retired.

Note (cycle 2 — Auth Modernization, 2026-05-14, AZ-531..AZ-538): the entire authentication layer was rebuilt:

  • AZ-536 — Argon2id password hashing replaced SHA-384; lazy migration on login.
  • AZ-531 — opaque refresh tokens with server-side rotation, family-based reuse detection, sliding + absolute lifetimes (SessionConfig).
  • AZ-532 — symmetric HS256 → asymmetric ES256 with file-system key store + JWKS endpoint.
  • AZ-534 — TOTP MFA (enroll/confirm/disable, recovery codes, two-step login, IDataProtector-encrypted secret, amr claim).
  • AZ-535 — logout (single + all) + admin revoke + verifier-poll snapshot of revoked sessions; new Service role for verifier identities.
  • AZ-533 — long-lived no-refresh mission tokens for UAV ops, with auto-revoke on aircraft reconnect.
  • AZ-537 — DB-backed account lockout + per-account sliding-window rate limit + per-IP token-bucket via ASP.NET RateLimiter; audit_events table.
  • AZ-538 — CORS narrowed to single HTTPS origin, HSTS enabled (non-Development), HTTPS redirection (non-Development).
  • New ADRs ADR-006 through ADR-009 below capture the per-decision context.

External systems:

System Integration Type Direction Purpose
PostgreSQL Database (linq2db) Both User + session + audit_events persistence
Server filesystem File I/O Both Resource files; ES256 PEM key store; DataProtection key store (when DataProtection:KeysFolder is set)
Admin web panel (admin.azaion.com) REST API Inbound User management, login, MFA, refresh, resource upload
Verifier fleet (Service role) REST API Inbound Polls /sessions/revoked, fetches /.well-known/jwks.json
CompanionPC (Jetson) edge devices REST API Inbound Login + refresh; mission-token consumer

2. Technology Stack

Layer Technology Version Rationale
Language C# .NET 10.0 Modern, cross-platform, strong typing
Framework ASP.NET Core Minimal API 10.0 Lightweight, minimal boilerplate
Database PostgreSQL (server-side) Open-source, robust relational DB
ORM linq2db 5.4.1 Lightweight, LINQ-native, no migrations overhead
Cache LazyCache (in-memory) 2.4.0 Simple async caching for user lookups
Auth JWT Bearer (ES256) 10.0.3 Stateless token auth; cycle 2 — switched from HS256 to ES256 with JWKS (AZ-532)
Password hashing Konscious.Security.Cryptography (Argon2id) (cycle 2 add) Replaces SHA-384 (AZ-536)
MFA OtpNet (TOTP) + QRCoder (PNG) (cycle 2 add) TOTP + recovery codes (AZ-534)
Rate limiting Microsoft.AspNetCore.RateLimiting 10.0 Per-IP sliding window (AZ-537)
Data protection Microsoft.AspNetCore.DataProtection 10.0 Encrypt MFA secret at rest (AZ-534)
Validation FluentValidation 11.3.0 / 11.10.0 Declarative request validation
Logging Serilog 4.1.0 Structured logging (console + file)
API Docs Swashbuckle (Swagger) 10.1.4 OpenAPI specification
Serialization Newtonsoft.Json 13.0.4 JSON for DB field mapping and responses (bumped from 13.0.1 by audit D-1)
Container Docker .NET 10.0 images Multi-stage build, ARM64 support
CI/CD Woodpecker CI Branch-based ARM64 builds
Registry docker.azaion.com Private container registry

3. Deployment Model

Environments: Development (local), Production (Linux server)

Infrastructure:

  • Self-hosted Linux server (evidenced by env/ provisioning scripts for Debian/Ubuntu)
  • Docker containerization with private registry (docker.azaion.com, localhost:5000)
  • No orchestration (single container deployment via deploy.cmd)

Environment-specific configuration:

Config Development Production
Database Local PostgreSQL (port 4312) Remote PostgreSQL (same custom port)
Secrets Environment variables (ASPNETCORE_*) Environment variables
Logging Console + file Console + rolling file (logs/log.txt)
Swagger Enabled Disabled
CORS (same policy registered, allows https://admin.azaion.com) https://admin.azaion.com only
HSTS Disabled (Development bypass) Enabled (1 y, includeSubDomains, preload)
HTTPS redirect Disabled (Development bypass) Enabled
ES256 keys JwtConfig.KeysFolder — at least one PEM, ActiveKid selects Same; persistent volume mandatory
DataProtection keys Ephemeral OK (single-instance dev) DataProtection:KeysFolder MUST be a persistent volume — otherwise MFA secrets are unrecoverable after restart

4. Data Model Overview

Core entities:

Entity Description Owned By Component
User System user. Cycle 2 added failed_login_count, lockout_until (AZ-537) and mfa_* columns (AZ-534). password_hash is now Argon2id PHC; legacy SHA-384 base64 lazily upgraded on next login (AZ-536). 01 Data Layer
Session (AZ-531+535+533+534) One row per refresh token (interactive) or per mission token. Carries family_id (rotation chain), revoked_at/revoked_reason/revoked_by_user_id, class ∈ {interactive, mission}, aircraft_id, mfa_authenticated. 01 Data Layer
AuditEvent (AZ-537+534) Append-only audit_events row: login_failed/success/lockout, mfa_enroll/confirm/disable/login_success/login_failed/recovery_used. 01 Data Layer
UserConfig JSON-serialized per-user configuration (queue offsets). 01 Data Layer
RoleEnum Authorization role hierarchy. Cycle 2 added Service = 60 for verifier identities (AZ-535). 01 Data Layer
DetectionClass Operator-managed catalogue. Unchanged in cycle 2. 01 Data Layer
ExceptionEnum Business error code catalog. Cycle 2 added codes 5061 for the auth/MFA/refresh/mission/lockout paths. Common Helpers

Key relationships (cycle 2 additions):

  • User 1 — N Session (sessions.user_id FK, ON DELETE CASCADE)
  • User 1 — N Session (sessions.aircraft_id FK for mission rows, ON DELETE SET NULL)
  • User 1 — N Session (sessions.revoked_by_user_id FK, ON DELETE SET NULL)
  • Session 1 — N Session (parent_session_id rotation chain)

Data flow summary:

  • Client → API → UserService → PostgreSQL: user CRUD + Argon2id verify/hash + lazy migration
  • Client → API → RefreshTokenService / SessionService / MfaService / MissionTokenService → PostgreSQL sessions + users + audit_events
  • Verifier → API → SessionService → PostgreSQL sessions (revoked-since snapshot) + JwtSigningKeyProvider (JWKS)
  • Client → API → ResourcesService → Filesystem: resource upload / list / clear

5. Integration Points

Internal Communication

From To Protocol Pattern Notes
Admin API User Management Direct DI call Request-Response Scoped
Admin API AuthService Direct DI call Request-Response Scoped — also reads IJwtSigningKeyProvider (singleton)
Admin API RefreshTokenService / SessionService / MfaService / MissionTokenService / AuditLog Direct DI call Request-Response Scoped
Admin API Resource Management Direct DI call Request-Response Scoped
User Management AuditLog Direct DI call Request-Response Failed/success/lockout audit + sliding-window count
MfaService IDataProtector Direct DI call Request-Response Encrypt/decrypt mfa_secret
All services Data Layer Direct DI call Request-Response Singleton DbFactory

External Integrations

External System Protocol Auth Rate Limits Failure Mode
PostgreSQL TCP (Npgsql) Username/password None configured Exception propagation
Filesystem OS I/O OS-level permissions None Exception propagation

6. Non-Functional Requirements

Requirement Target Measurement Priority
Max upload size 200 MB Kestrel MaxRequestBodySize High
Password hashing Argon2id (parameters from AuthConfig.PasswordHashing) Per-user, constant-time verify High
Access token lifetime JwtConfig.AccessTokenLifetimeMinutes (15 default) Per token High
Refresh token sliding lifetime SessionConfig.RefreshSlidingHours Per session row High
Refresh token absolute lifetime SessionConfig.RefreshAbsoluteHours Per family High
Mission token lifetime MissionSessionRequest.PlannedDurationH (validation-bounded) Per mission session High
Per-IP login rate AuthConfig.RateLimit.PerIpPermitLimit per PerIpWindowSeconds Sliding window High
Per-account login rate AuthConfig.RateLimit.PerAccountFailedThreshold per PerAccountWindowSeconds DB sliding window via audit_events High
Account lockout AuthConfig.Lockout.ConsecutiveFailureThreshold failures → LockoutSeconds lockout DB-backed High
HSTS 1 y, includeSubDomains, preload (non-Development) HTTP header High
HTTPS redirect Enabled (non-Development) Middleware High
Cache TTL 4 hours User entity cache Low

No explicit availability, latency, throughput, or recovery targets found in the codebase.

7. Security Architecture

Authentication:

  • ES256 (ECDSA P-256) JWT bearer tokens (AZ-532). ValidAlgorithms pinned to ES256 to prevent the HS256-with-public-key forgery class.
  • Opaque refresh tokens with server-side rotation + reuse detection (AZ-531). Stored as SHA-256 hashes; never re-presented.
  • TOTP MFA + recovery codes (AZ-534). Step-1 token is itself an ES256 JWT with a separate audience.
  • Mission tokens (AZ-533) — long-lived, no refresh, bound to aircraft_id, auto-revoked on aircraft reconnect.

Authorization: Role-based (RBAC) via ASP.NET Core authorization policies:

  • apiAdminPolicy — requires ApiAdmin
  • revocationReaderPolicy — requires Service OR ApiAdmin (verifier fleet)
  • General [Authorize] — any authenticated user

Data protection:

  • At rest: mfa_secret is encrypted via IDataProtector (purpose Azaion.Mfa.Secret). MFA recovery codes are individually Argon2id-hashed and single-use. Passwords are Argon2id PHC strings. ES256 PEM keys live in JwtConfig.KeysFolder — protect via filesystem permissions.
  • In transit: HSTS + HTTPS redirection in non-Development environments (AZ-538). CORS narrowed to https://admin.azaion.com only.
  • Token revocation propagation: GET /sessions/revoked provides a verifier-poll snapshot; verifiers are responsible for honoring it within their poll cadence (currently ~30s recommended).
  • Secrets management: Environment variables (ASPNETCORE_* prefix).

Audit logging: audit_events table records login_success/failed/lockout and mfa_enroll/confirm/disable/login_success/login_failed/recovery_used events with normalised email + caller IP. Drives the per-account rate limit and provides forensic evidence. Serilog continues to log business exceptions (WARN) and general events (INFO).

8. Key Architectural Decisions

ADR-001: Minimal API over Controllers

Context: API has ~17 endpoints with simple request/response patterns.

Decision: Use ASP.NET Core Minimal API with top-level statements instead of MVC controllers.

Consequences: All endpoints in a single Program.cs. Simple for small APIs but could become unwieldy as endpoints grow.

ADR-002: Read/Write Database Connection Separation

Context: Needed different privilege levels for read vs. write operations.

Decision: DbFactory maintains two connection strings — a read-only one (AzaionDb) and an admin one (AzaionDbAdmin) — with separate Run and RunAdmin methods.

Consequences: Write operations are explicitly gated through RunAdmin. Prevents accidental writes through the reader connection. Requires maintaining two DB users with different privileges.

ADR-003: Per-User Resource Encryption — RETIRED (cycle 2, 2026-05-14)

Original context: Resources (DLLs, AI models) had to be delivered only to authorized users via a per-download AES-256-CBC stream keyed off the user's email + password.

Retirement decision: With the OTA delivery flow (AZ-183) and the hardware-binding flow (AZ-197) both gone, the only remaining consumer of the encrypted-download path was a now-vestigial POST /resources/get/{dataFolder?} endpoint and the two installer endpoints. None of them are part of the target architecture (browser SaaS + fTPM Jetsons), so the entire encrypt-on-download stack — POST /resources/get, GET /resources/get-installer, GET /resources/get-installer/stage, ResourcesService.GetEncryptedResource, ResourcesService.GetInstaller, Security.GetApiEncryptionKey, Security.EncryptTo, Security.DecryptTo, GetResourceRequest, WrongResourceName (50), ResourcesConfig.SuiteInstallerFolder / SuiteStageInstallerFolder — was removed. Security.ToHash is retained because it still backs SHA-384 password hashing in UserService.

Consequences: resource files now live on disk as plain bytes; any future at-rest encryption must come from filesystem or storage-layer features (LUKS, object-store SSE), not from application code.

ADR-004: Hardware Fingerprint Binding — RETIRED (AZ-197)

Original context: Resources should only be usable on a specific physical machine.

Original decision: On first resource access, the user's hardware fingerprint string was stored. Subsequent accesses compared the hash of the provided hardware against the stored value.

Retirement decision (2026-05-13, AZ-197): The threat model that motivated this binding (credential reuse across machines via desktop installers) no longer applies:

  • Edge devices ship as fTPM-secured Jetsons (secure boot, fTPM-protected key storage, no user filesystem access, no installer redistribution). Hardware identity is anchored in the fTPM, not in a SHA-384 of CPU/GPU/Memory/DriveSerial strings.
  • Server / desktop access is SaaS-only (browser → admin API). There is no installer to copy and no hardware fingerprint to take.

The binding's only remaining effect was a real production failure mode (HardwareIdMismatch, error code 40) on legitimate hardware events. AZ-197 removed CheckHardwareHash, UpdateHardware, Security.GetHWHash, the PUT /users/hardware/set and POST /resources/check endpoints, and the Hardware field from GetResourceRequest. The User.Hardware DB column is a nullable tombstone (no migration in AZ-197; separate ticket if/when the column is dropped).

ADR-005: linq2db over Entity Framework

Context: Needed a lightweight ORM for PostgreSQL.

Decision: Use linq2db instead of Entity Framework Core.

Consequences: No migration framework — schema managed via SQL scripts (env/db/). Lighter runtime footprint. Manual mapping configuration in AzaionDbSchemaHolder.

ADR-006: Asymmetric ES256 JWT signing with file-system key store + JWKS (cycle 2 — AZ-532)

Context: Cycle-1 JWT signing was symmetric HS256 with the secret in environment configuration. The verifier fleet (satellite-provider, gps-denied, ui) needed to validate tokens without sharing the signing secret with every service. Sharing the HS256 secret would have made any verifier compromise also a token-forgery primitive.

Decision: Switch to ES256 (ECDSA P-256). The Admin API holds the private key; verifiers fetch the public key set from GET /.well-known/jwks.json. Keys live as one PEM per kid in JwtConfig.KeysFolder. JwtConfig.ActiveKid selects the signer; ALL discovered keys are exposed in JWKS so existing tokens stay verifiable across rotations.

Alternatives rejected:

  • Continue HS256 + share secret: rejected — secret-distribution + verifier-compromise blast radius.
  • RS256: equivalent security, larger keys, no operational benefit at our scale.
  • External KMS / HSM: deferred — adds operational complexity (KMS auth, latency on every signing op) without near-term benefit. The PEM-on-disk approach is reversible to KMS later.

Consequences:

  • JwtBearer ValidAlgorithms = [ES256] is mandatory — without it, a token forged with alg=HS256 using the public key as the HMAC secret would validate.
  • The PEM directory MUST be a persistent volume.
  • Key rotation is "drop a new PEM, set ActiveKid, restart" — the old kid keeps verifying tokens until physically removed.
  • Verifiers MUST cache the JWKS for at most 1 hour to pick up new kids quickly.

ADR-007: Refresh tokens as opaque rotating server-side rows (not JWT) (cycle 2 — AZ-531)

Context: The dual-token model needs a refresh token. The two viable shapes are (a) signed self-describing JWT or (b) opaque server-stored value. Refresh tokens are long-lived; their threat model centres on theft + replay.

Decision: Opaque random Base64Url(32 bytes) stored on the server as a SHA-256 hash. Each rotation marks the previous row as revoked_reason='rotated' and inserts a new row in the same family_id. Presenting an already-rotated token revokes the entire family with reason='reuse_detected'.

Alternatives rejected:

  • JWT refresh token: server cannot revoke without a denylist (which negates the "stateless" advantage). No reuse-detection without ALSO server state.
  • Sliding session ID alone (no rotation): theft is permanent until manual revocation.

Consequences:

  • Every refresh hits Postgres (one indexed lookup + one update + one insert in a transaction). Acceptable at current load; if it becomes a bottleneck, the sessions_refresh_hash_idx UNIQUE INDEX is the obvious caching boundary.
  • Refresh-token theft is detectable on the next legitimate refresh.
  • The session row is also the sid claim in the access token — the same row drives logout (F12), JWKS-independent revocation snapshots (F15), and AMR persistence across rotations (mfa_authenticated).

ADR-008: TOTP MFA secrets encrypted via IDataProtector (cycle 2 — AZ-534)

Context: MFA secrets are TOTP shared secrets — possession of the database alone (DBA access, backup leak) must NOT yield the ability to mint TOTP codes for users.

Decision: Encrypt mfa_secret with ASP.NET IDataProtector (purpose string Azaion.Mfa.Secret) before persisting. The DataProtection key store is configured via DataProtection:KeysFolder and MUST be a persistent volume in production. Recovery codes are individually Argon2id-hashed and stored as a jsonb array; single-use is enforced by setting used_at transactionally with the rest of the login.

Alternatives rejected:

  • Plaintext: explicit DB-leak escalation path.
  • Application-managed AES via env-var key: re-introduces the very key-distribution problem ADR-006 solved for JWT signing.
  • External KMS for MFA secrets: deferred for the same reason as ADR-006.

Consequences:

  • Loss of the DataProtection key folder = users must re-enroll MFA (no recovery path). This MUST be backed up alongside DB backups.
  • DBA-only access does not yield MFA bypass.

ADR-009: Per-account lockout + DB-backed sliding-window rate limit alongside per-IP token bucket (cycle 2 — AZ-537)

Context: ASP.NET RateLimiter is per-process and per-IP. CMMC AC.L2-3.1.8 requires per-account lockout that survives process restarts. Per-IP alone is insufficient (NAT'd attacker farm; bot rotates IPs). Per-account-only is insufficient (single IP can DoS many accounts at "just below threshold").

Decision: Both layers, both required to pass:

  1. Per-IP — ASP.NET RateLimiter middleware with SlidingWindowRateLimiter on /login and /login/mfa. In-memory; resets on restart but recovers within seconds.
  2. Per-account — DB-backed sliding window via audit_events (count login_failed rows for the email within PerAccountWindowSeconds).
  3. Lockout — users.failed_login_count + users.lockout_until. After ConsecutiveFailureThreshold failures, lockout_until = now + LockoutSeconds. Subsequent logins throw AccountLocked with RetryAfterSeconds until the window passes.

Alternatives rejected:

  • Redis token bucket per account: avoids DB load but adds a new infra dependency for a low-write workload. The DB sliding window has acceptable cost (audit_events_event_type_email_idx).
  • Single combined rule: harder to tune.

Consequences:

  • audit_events will grow large (~14 GB/yr at projected fleet scale); operational follow-up to time-partition.
  • The Retry-After header is set both by the per-IP middleware (lease metadata) and by the BusinessExceptionHandler (from BusinessException.RetryAfterSeconds), so clients see consistent backoff hints regardless of which layer rejected.
  • All gating events go through audit_events, providing a single auditable history.