# Azaion Admin API — Architecture ## 1. System Context **Problem being solved**: Azaion Suite requires a centralized admin API to manage users + roles, authenticate humans (with optional second factor), authenticate UAVs for offline missions, and broker token revocation across a fleet of verifier services. **System boundaries**: - **Inside**: user management, password hashing (Argon2id), authentication (ES256 JWT + opaque refresh tokens with rotation + reuse detection), TOTP MFA, mission-token issuance, session revocation + verifier-poll snapshot, account lockout + per-IP and per-account rate limiting, JWKS publication, role-based authorization, file-based resource storage (upload / list / clear), HSTS + HTTPS redirect. - **Outside**: admin web panel (`admin.azaion.com`), fTPM-secured Jetson edge devices (CompanionPC), verifier fleet (satellite-provider, gps-denied, ui — service-role identities), PostgreSQL, server filesystem. > **Note (AZ-197, cycle 1)**: hardware-fingerprint binding removed. > > **Note (cycle 2 early)**: encrypted resource download + installer endpoints removed; ADR-003 retired. > > **Note (cycle 2 — Auth Modernization, 2026-05-14, AZ-531..AZ-538)**: the entire authentication layer was rebuilt: > - **AZ-536** — Argon2id password hashing replaced SHA-384; lazy migration on login. > - **AZ-531** — opaque refresh tokens with server-side rotation, family-based reuse detection, sliding + absolute lifetimes (`SessionConfig`). > - **AZ-532** — symmetric HS256 → asymmetric ES256 with file-system key store + JWKS endpoint. > - **AZ-534** — TOTP MFA (enroll/confirm/disable, recovery codes, two-step login, `IDataProtector`-encrypted secret, `amr` claim). > - **AZ-535** — logout (single + all) + admin revoke + verifier-poll snapshot of revoked sessions; new `Service` role for verifier identities. > - **AZ-533** — long-lived no-refresh mission tokens for UAV ops, with auto-revoke on aircraft reconnect. > - **AZ-537** — DB-backed account lockout + per-account sliding-window rate limit + per-IP token-bucket via ASP.NET `RateLimiter`; `audit_events` table. > - **AZ-538** — CORS narrowed to single HTTPS origin, HSTS enabled (non-Development), HTTPS redirection (non-Development). > - New ADRs **ADR-006** through **ADR-009** below capture the per-decision context. **External systems**: | System | Integration Type | Direction | Purpose | |--------|-----------------|-----------|---------| | PostgreSQL | Database (linq2db) | Both | User + session + audit_events persistence | | Server filesystem | File I/O | Both | Resource files; ES256 PEM key store; DataProtection key store (when `DataProtection:KeysFolder` is set) | | Admin web panel (admin.azaion.com) | REST API | Inbound | User management, login, MFA, refresh, resource upload | | Verifier fleet (Service role) | REST API | Inbound | Polls `/sessions/revoked`, fetches `/.well-known/jwks.json` | | CompanionPC (Jetson) edge devices | REST API | Inbound | Login + refresh; mission-token consumer | ## 2. Technology Stack | Layer | Technology | Version | Rationale | |-------|-----------|---------|-----------| | Language | C# | .NET 10.0 | Modern, cross-platform, strong typing | | Framework | ASP.NET Core Minimal API | 10.0 | Lightweight, minimal boilerplate | | Database | PostgreSQL | (server-side) | Open-source, robust relational DB | | ORM | linq2db | 5.4.1 | Lightweight, LINQ-native, no migrations overhead | | Cache | LazyCache (in-memory) | 2.4.0 | Simple async caching for user lookups | | Auth | JWT Bearer (ES256) | 10.0.3 | Stateless token auth; cycle 2 — switched from HS256 to ES256 with JWKS (AZ-532) | | Password hashing | Konscious.Security.Cryptography (Argon2id) | (cycle 2 add) | Replaces SHA-384 (AZ-536) | | MFA | OtpNet (TOTP) + QRCoder (PNG) | (cycle 2 add) | TOTP + recovery codes (AZ-534) | | Rate limiting | Microsoft.AspNetCore.RateLimiting | 10.0 | Per-IP sliding window (AZ-537) | | Data protection | Microsoft.AspNetCore.DataProtection | 10.0 | Encrypt MFA secret at rest (AZ-534) | | Validation | FluentValidation | 11.3.0 / 11.10.0 | Declarative request validation | | Logging | Serilog | 4.1.0 | Structured logging (console + file) | | API Docs | Swashbuckle (Swagger) | 10.1.4 | OpenAPI specification | | Serialization | Newtonsoft.Json | 13.0.4 | JSON for DB field mapping and responses (bumped from 13.0.1 by audit D-1) | | Container | Docker | .NET 10.0 images | Multi-stage build, ARM64 support | | CI/CD | Woodpecker CI | — | Branch-based ARM64 builds | | Registry | docker.azaion.com | — | Private container registry | ## 3. Deployment Model **Environments**: Development (local), Production (Linux server) **Infrastructure**: - Self-hosted Linux server (evidenced by `env/` provisioning scripts for Debian/Ubuntu) - Docker containerization with private registry (`docker.azaion.com`, `localhost:5000`) - No orchestration (single container deployment via `deploy.cmd`) **Environment-specific configuration**: | Config | Development | Production | |--------|-------------|------------| | Database | Local PostgreSQL (port 4312) | Remote PostgreSQL (same custom port) | | Secrets | Environment variables (`ASPNETCORE_*`) | Environment variables | | Logging | Console + file | Console + rolling file (`logs/log.txt`) | | Swagger | Enabled | Disabled | | CORS | (same policy registered, allows `https://admin.azaion.com`) | `https://admin.azaion.com` only | | HSTS | **Disabled** (Development bypass) | **Enabled** (1 y, includeSubDomains, preload) | | HTTPS redirect | **Disabled** (Development bypass) | **Enabled** | | ES256 keys | `JwtConfig.KeysFolder` — at least one PEM, `ActiveKid` selects | Same; persistent volume mandatory | | DataProtection keys | Ephemeral OK (single-instance dev) | `DataProtection:KeysFolder` MUST be a persistent volume — otherwise MFA secrets are unrecoverable after restart | ## 4. Data Model Overview **Core entities**: | Entity | Description | Owned By Component | |--------|-------------|--------------------| | User | System user. Cycle 2 added `failed_login_count`, `lockout_until` (AZ-537) and `mfa_*` columns (AZ-534). `password_hash` is now Argon2id PHC; legacy SHA-384 base64 lazily upgraded on next login (AZ-536). | 01 Data Layer | | Session *(AZ-531+535+533+534)* | One row per refresh token (interactive) or per mission token. Carries `family_id` (rotation chain), `revoked_at`/`revoked_reason`/`revoked_by_user_id`, `class` ∈ {`interactive`, `mission`}, `aircraft_id`, `mfa_authenticated`. | 01 Data Layer | | AuditEvent *(AZ-537+534)* | Append-only `audit_events` row: login_failed/success/lockout, mfa_enroll/confirm/disable/login_success/login_failed/recovery_used. | 01 Data Layer | | UserConfig | JSON-serialized per-user configuration (queue offsets). | 01 Data Layer | | RoleEnum | Authorization role hierarchy. Cycle 2 added `Service = 60` for verifier identities (AZ-535). | 01 Data Layer | | DetectionClass | Operator-managed catalogue. Unchanged in cycle 2. | 01 Data Layer | | ExceptionEnum | Business error code catalog. Cycle 2 added codes 50–61 for the auth/MFA/refresh/mission/lockout paths. | Common Helpers | **Key relationships** (cycle 2 additions): - User 1 — N Session (`sessions.user_id` FK, ON DELETE CASCADE) - User 1 — N Session (`sessions.aircraft_id` FK for mission rows, ON DELETE SET NULL) - User 1 — N Session (`sessions.revoked_by_user_id` FK, ON DELETE SET NULL) - Session 1 — N Session (`parent_session_id` rotation chain) **Data flow summary**: - Client → API → UserService → PostgreSQL: user CRUD + Argon2id verify/hash + lazy migration - Client → API → RefreshTokenService / SessionService / MfaService / MissionTokenService → PostgreSQL `sessions` + `users` + `audit_events` - Verifier → API → SessionService → PostgreSQL `sessions` (revoked-since snapshot) + JwtSigningKeyProvider (JWKS) - Client → API → ResourcesService → Filesystem: resource upload / list / clear ## 5. Integration Points ### Internal Communication | From | To | Protocol | Pattern | Notes | |------|----|----------|---------|-------| | Admin API | User Management | Direct DI call | Request-Response | Scoped | | Admin API | AuthService | Direct DI call | Request-Response | Scoped — also reads `IJwtSigningKeyProvider` (singleton) | | Admin API | RefreshTokenService / SessionService / MfaService / MissionTokenService / AuditLog | Direct DI call | Request-Response | Scoped | | Admin API | Resource Management | Direct DI call | Request-Response | Scoped | | User Management | AuditLog | Direct DI call | Request-Response | Failed/success/lockout audit + sliding-window count | | MfaService | IDataProtector | Direct DI call | Request-Response | Encrypt/decrypt mfa_secret | | All services | Data Layer | Direct DI call | Request-Response | Singleton DbFactory | ### External Integrations | External System | Protocol | Auth | Rate Limits | Failure Mode | |----------------|----------|------|-------------|--------------| | PostgreSQL | TCP (Npgsql) | Username/password | None configured | Exception propagation | | Filesystem | OS I/O | OS-level permissions | None | Exception propagation | ## 6. Non-Functional Requirements | Requirement | Target | Measurement | Priority | |------------|--------|-------------|----------| | Max upload size | 200 MB | Kestrel MaxRequestBodySize | High | | Password hashing | Argon2id (parameters from `AuthConfig.PasswordHashing`) | Per-user, constant-time verify | High | | Access token lifetime | `JwtConfig.AccessTokenLifetimeMinutes` (15 default) | Per token | High | | Refresh token sliding lifetime | `SessionConfig.RefreshSlidingHours` | Per session row | High | | Refresh token absolute lifetime | `SessionConfig.RefreshAbsoluteHours` | Per family | High | | Mission token lifetime | `MissionSessionRequest.PlannedDurationH` (validation-bounded) | Per mission session | High | | Per-IP login rate | `AuthConfig.RateLimit.PerIpPermitLimit` per `PerIpWindowSeconds` | Sliding window | High | | Per-account login rate | `AuthConfig.RateLimit.PerAccountFailedThreshold` per `PerAccountWindowSeconds` | DB sliding window via `audit_events` | High | | Account lockout | `AuthConfig.Lockout.ConsecutiveFailureThreshold` failures → `LockoutSeconds` lockout | DB-backed | High | | HSTS | 1 y, includeSubDomains, preload (non-Development) | HTTP header | High | | HTTPS redirect | Enabled (non-Development) | Middleware | High | | Cache TTL | 4 hours | User entity cache | Low | No explicit availability, latency, throughput, or recovery targets found in the codebase. ## 7. Security Architecture **Authentication**: - ES256 (ECDSA P-256) JWT bearer tokens (AZ-532). `ValidAlgorithms` pinned to `ES256` to prevent the HS256-with-public-key forgery class. - Opaque refresh tokens with server-side rotation + reuse detection (AZ-531). Stored as SHA-256 hashes; never re-presented. - TOTP MFA + recovery codes (AZ-534). Step-1 token is itself an ES256 JWT with a separate audience. - Mission tokens (AZ-533) — long-lived, no refresh, bound to `aircraft_id`, auto-revoked on aircraft reconnect. **Authorization**: Role-based (RBAC) via ASP.NET Core authorization policies: - `apiAdminPolicy` — requires `ApiAdmin` - `revocationReaderPolicy` — requires `Service` OR `ApiAdmin` (verifier fleet) - General `[Authorize]` — any authenticated user **Data protection**: - **At rest**: `mfa_secret` is encrypted via `IDataProtector` (purpose `Azaion.Mfa.Secret`). MFA recovery codes are individually Argon2id-hashed and single-use. Passwords are Argon2id PHC strings. ES256 PEM keys live in `JwtConfig.KeysFolder` — protect via filesystem permissions. - **In transit**: HSTS + HTTPS redirection in non-Development environments (AZ-538). CORS narrowed to `https://admin.azaion.com` only. - **Token revocation propagation**: `GET /sessions/revoked` provides a verifier-poll snapshot; verifiers are responsible for honoring it within their poll cadence (currently ~30s recommended). - **Secrets management**: Environment variables (`ASPNETCORE_*` prefix). **Audit logging**: `audit_events` table records login_success/failed/lockout and mfa_enroll/confirm/disable/login_success/login_failed/recovery_used events with normalised email + caller IP. Drives the per-account rate limit and provides forensic evidence. Serilog continues to log business exceptions (WARN) and general events (INFO). ## 8. Key Architectural Decisions ### ADR-001: Minimal API over Controllers **Context**: API has ~17 endpoints with simple request/response patterns. **Decision**: Use ASP.NET Core Minimal API with top-level statements instead of MVC controllers. **Consequences**: All endpoints in a single `Program.cs`. Simple for small APIs but could become unwieldy as endpoints grow. ### ADR-002: Read/Write Database Connection Separation **Context**: Needed different privilege levels for read vs. write operations. **Decision**: `DbFactory` maintains two connection strings — a read-only one (`AzaionDb`) and an admin one (`AzaionDbAdmin`) — with separate `Run` and `RunAdmin` methods. **Consequences**: Write operations are explicitly gated through `RunAdmin`. Prevents accidental writes through the reader connection. Requires maintaining two DB users with different privileges. ### ADR-003: Per-User Resource Encryption — RETIRED (cycle 2, 2026-05-14) **Original context**: Resources (DLLs, AI models) had to be delivered only to authorized users via a per-download AES-256-CBC stream keyed off the user's email + password. **Retirement decision**: With the OTA delivery flow (AZ-183) and the hardware-binding flow (AZ-197) both gone, the only remaining consumer of the encrypted-download path was a now-vestigial `POST /resources/get/{dataFolder?}` endpoint and the two installer endpoints. None of them are part of the target architecture (browser SaaS + fTPM Jetsons), so the entire encrypt-on-download stack — `POST /resources/get`, `GET /resources/get-installer`, `GET /resources/get-installer/stage`, `ResourcesService.GetEncryptedResource`, `ResourcesService.GetInstaller`, `Security.GetApiEncryptionKey`, `Security.EncryptTo`, `Security.DecryptTo`, `GetResourceRequest`, `WrongResourceName` (50), `ResourcesConfig.SuiteInstallerFolder` / `SuiteStageInstallerFolder` — was removed. `Security.ToHash` is retained because it still backs SHA-384 password hashing in `UserService`. **Consequences**: resource files now live on disk as plain bytes; any future at-rest encryption must come from filesystem or storage-layer features (LUKS, object-store SSE), not from application code. ### ADR-004: Hardware Fingerprint Binding — RETIRED (AZ-197) **Original context**: Resources should only be usable on a specific physical machine. **Original decision**: On first resource access, the user's hardware fingerprint string was stored. Subsequent accesses compared the hash of the provided hardware against the stored value. **Retirement decision (2026-05-13, AZ-197)**: The threat model that motivated this binding (credential reuse across machines via desktop installers) no longer applies: - **Edge devices** ship as **fTPM-secured Jetsons** (secure boot, fTPM-protected key storage, no user filesystem access, no installer redistribution). Hardware identity is anchored in the fTPM, not in a SHA-384 of CPU/GPU/Memory/DriveSerial strings. - **Server / desktop access** is **SaaS-only** (browser → admin API). There is no installer to copy and no hardware fingerprint to take. The binding's only remaining effect was a real production failure mode (`HardwareIdMismatch`, error code 40) on legitimate hardware events. AZ-197 removed `CheckHardwareHash`, `UpdateHardware`, `Security.GetHWHash`, the `PUT /users/hardware/set` and `POST /resources/check` endpoints, and the `Hardware` field from `GetResourceRequest`. The `User.Hardware` DB column is a nullable tombstone (no migration in AZ-197; separate ticket if/when the column is dropped). ### ADR-005: linq2db over Entity Framework **Context**: Needed a lightweight ORM for PostgreSQL. **Decision**: Use linq2db instead of Entity Framework Core. **Consequences**: No migration framework — schema managed via SQL scripts (`env/db/`). Lighter runtime footprint. Manual mapping configuration in `AzaionDbSchemaHolder`. ### ADR-006: Asymmetric ES256 JWT signing with file-system key store + JWKS *(cycle 2 — AZ-532)* **Context**: Cycle-1 JWT signing was symmetric HS256 with the secret in environment configuration. The verifier fleet (satellite-provider, gps-denied, ui) needed to validate tokens without sharing the signing secret with every service. Sharing the HS256 secret would have made any verifier compromise also a token-forgery primitive. **Decision**: Switch to ES256 (ECDSA P-256). The Admin API holds the private key; verifiers fetch the public key set from `GET /.well-known/jwks.json`. Keys live as one PEM per kid in `JwtConfig.KeysFolder`. `JwtConfig.ActiveKid` selects the signer; ALL discovered keys are exposed in JWKS so existing tokens stay verifiable across rotations. **Alternatives rejected**: - **Continue HS256 + share secret**: rejected — secret-distribution + verifier-compromise blast radius. - **RS256**: equivalent security, larger keys, no operational benefit at our scale. - **External KMS / HSM**: deferred — adds operational complexity (KMS auth, latency on every signing op) without near-term benefit. The PEM-on-disk approach is reversible to KMS later. **Consequences**: - JwtBearer `ValidAlgorithms = [ES256]` is mandatory — without it, a token forged with `alg=HS256` using the public key as the HMAC secret would validate. - The PEM directory MUST be a persistent volume. - Key rotation is "drop a new PEM, set `ActiveKid`, restart" — the old kid keeps verifying tokens until physically removed. - Verifiers MUST cache the JWKS for at most 1 hour to pick up new kids quickly. ### ADR-007: Refresh tokens as opaque rotating server-side rows (not JWT) *(cycle 2 — AZ-531)* **Context**: The dual-token model needs a refresh token. The two viable shapes are (a) signed self-describing JWT or (b) opaque server-stored value. Refresh tokens are long-lived; their threat model centres on theft + replay. **Decision**: Opaque random `Base64Url(32 bytes)` stored on the server as a SHA-256 hash. Each rotation marks the previous row as `revoked_reason='rotated'` and inserts a new row in the same `family_id`. Presenting an already-rotated token revokes the entire family with `reason='reuse_detected'`. **Alternatives rejected**: - **JWT refresh token**: server cannot revoke without a denylist (which negates the "stateless" advantage). No reuse-detection without ALSO server state. - **Sliding session ID alone (no rotation)**: theft is permanent until manual revocation. **Consequences**: - Every refresh hits Postgres (one indexed lookup + one update + one insert in a transaction). Acceptable at current load; if it becomes a bottleneck, the `sessions_refresh_hash_idx` UNIQUE INDEX is the obvious caching boundary. - Refresh-token theft is detectable on the next legitimate refresh. - The session row is also the `sid` claim in the access token — the same row drives logout (F12), JWKS-independent revocation snapshots (F15), and AMR persistence across rotations (`mfa_authenticated`). ### ADR-008: TOTP MFA secrets encrypted via `IDataProtector` *(cycle 2 — AZ-534)* **Context**: MFA secrets are TOTP shared secrets — possession of the database alone (DBA access, backup leak) must NOT yield the ability to mint TOTP codes for users. **Decision**: Encrypt `mfa_secret` with ASP.NET `IDataProtector` (purpose string `Azaion.Mfa.Secret`) before persisting. The DataProtection key store is configured via `DataProtection:KeysFolder` and MUST be a persistent volume in production. Recovery codes are individually Argon2id-hashed and stored as a `jsonb` array; single-use is enforced by setting `used_at` transactionally with the rest of the login. **Alternatives rejected**: - **Plaintext**: explicit DB-leak escalation path. - **Application-managed AES via env-var key**: re-introduces the very key-distribution problem ADR-006 solved for JWT signing. - **External KMS for MFA secrets**: deferred for the same reason as ADR-006. **Consequences**: - Loss of the DataProtection key folder = users must re-enroll MFA (no recovery path). This MUST be backed up alongside DB backups. - DBA-only access does not yield MFA bypass. ### ADR-009: Per-account lockout + DB-backed sliding-window rate limit alongside per-IP token bucket *(cycle 2 — AZ-537)* **Context**: ASP.NET `RateLimiter` is per-process and per-IP. CMMC AC.L2-3.1.8 requires per-account lockout that survives process restarts. Per-IP alone is insufficient (NAT'd attacker farm; bot rotates IPs). Per-account-only is insufficient (single IP can DoS many accounts at "just below threshold"). **Decision**: Both layers, both required to pass: 1. Per-IP — ASP.NET `RateLimiter` middleware with `SlidingWindowRateLimiter` on `/login` and `/login/mfa`. In-memory; resets on restart but recovers within seconds. 2. Per-account — DB-backed sliding window via `audit_events` (count `login_failed` rows for the email within `PerAccountWindowSeconds`). 3. Lockout — `users.failed_login_count` + `users.lockout_until`. After `ConsecutiveFailureThreshold` failures, `lockout_until = now + LockoutSeconds`. Subsequent logins throw `AccountLocked` with `RetryAfterSeconds` until the window passes. **Alternatives rejected**: - **Redis token bucket per account**: avoids DB load but adds a new infra dependency for a low-write workload. The DB sliding window has acceptable cost (`audit_events_event_type_email_idx`). - **Single combined rule**: harder to tune. **Consequences**: - `audit_events` will grow large (~14 GB/yr at projected fleet scale); operational follow-up to time-partition. - The `Retry-After` header is set both by the per-IP middleware (lease metadata) and by the `BusinessExceptionHandler` (from `BusinessException.RetryAfterSeconds`), so clients see consistent backoff hints regardless of which layer rejected. - All gating events go through `audit_events`, providing a single auditable history.