Refreshes _docs/02_document/ to reflect the cycle-2 auth-modernization
+ CMMC hardening landings (AZ-531..AZ-538). Authoritative source for
the ripple set is ripple_log_cycle2.md.
Covered:
- architecture.md (section 1 rewritten, ADRs 6-9 added)
- data_model.md (sessions, audit_events, user columns, migrations)
- system-flows.md (F1 rewritten; F11-F17 added; F2/F7/F9 minor)
- module-layout.md (cycle-2 sub-component table)
- diagrams/flows/flow_login.md (dual-token + MFA)
- components/{01_data_layer,03_auth_and_security,05_admin_api}
- modules/ (12 new, 8 modified — full Argon2id/ES256/MFA/refresh
/mission/session/audit/jwks rollup)
- tests/{blackbox,security,traceability-matrix}
Step 13 (Update Docs) output for cycle 2.
Co-authored-by: Cursor <cursoragent@cursor.com>
22 KiB
Azaion Admin API — Architecture
1. System Context
Problem being solved: Azaion Suite requires a centralized admin API to manage users + roles, authenticate humans (with optional second factor), authenticate UAVs for offline missions, and broker token revocation across a fleet of verifier services.
System boundaries:
- Inside: user management, password hashing (Argon2id), authentication (ES256 JWT + opaque refresh tokens with rotation + reuse detection), TOTP MFA, mission-token issuance, session revocation + verifier-poll snapshot, account lockout + per-IP and per-account rate limiting, JWKS publication, role-based authorization, file-based resource storage (upload / list / clear), HSTS + HTTPS redirect.
- Outside: admin web panel (
admin.azaion.com), fTPM-secured Jetson edge devices (CompanionPC), verifier fleet (satellite-provider, gps-denied, ui — service-role identities), PostgreSQL, server filesystem.
Note (AZ-197, cycle 1): hardware-fingerprint binding removed.
Note (cycle 2 early): encrypted resource download + installer endpoints removed; ADR-003 retired.
Note (cycle 2 — Auth Modernization, 2026-05-14, AZ-531..AZ-538): the entire authentication layer was rebuilt:
- AZ-536 — Argon2id password hashing replaced SHA-384; lazy migration on login.
- AZ-531 — opaque refresh tokens with server-side rotation, family-based reuse detection, sliding + absolute lifetimes (
SessionConfig).- AZ-532 — symmetric HS256 → asymmetric ES256 with file-system key store + JWKS endpoint.
- AZ-534 — TOTP MFA (enroll/confirm/disable, recovery codes, two-step login,
IDataProtector-encrypted secret,amrclaim).- AZ-535 — logout (single + all) + admin revoke + verifier-poll snapshot of revoked sessions; new
Servicerole for verifier identities.- AZ-533 — long-lived no-refresh mission tokens for UAV ops, with auto-revoke on aircraft reconnect.
- AZ-537 — DB-backed account lockout + per-account sliding-window rate limit + per-IP token-bucket via ASP.NET
RateLimiter;audit_eventstable.- AZ-538 — CORS narrowed to single HTTPS origin, HSTS enabled (non-Development), HTTPS redirection (non-Development).
- New ADRs ADR-006 through ADR-009 below capture the per-decision context.
External systems:
| System | Integration Type | Direction | Purpose |
|---|---|---|---|
| PostgreSQL | Database (linq2db) | Both | User + session + audit_events persistence |
| Server filesystem | File I/O | Both | Resource files; ES256 PEM key store; DataProtection key store (when DataProtection:KeysFolder is set) |
| Admin web panel (admin.azaion.com) | REST API | Inbound | User management, login, MFA, refresh, resource upload |
| Verifier fleet (Service role) | REST API | Inbound | Polls /sessions/revoked, fetches /.well-known/jwks.json |
| CompanionPC (Jetson) edge devices | REST API | Inbound | Login + refresh; mission-token consumer |
2. Technology Stack
| Layer | Technology | Version | Rationale |
|---|---|---|---|
| Language | C# | .NET 10.0 | Modern, cross-platform, strong typing |
| Framework | ASP.NET Core Minimal API | 10.0 | Lightweight, minimal boilerplate |
| Database | PostgreSQL | (server-side) | Open-source, robust relational DB |
| ORM | linq2db | 5.4.1 | Lightweight, LINQ-native, no migrations overhead |
| Cache | LazyCache (in-memory) | 2.4.0 | Simple async caching for user lookups |
| Auth | JWT Bearer (ES256) | 10.0.3 | Stateless token auth; cycle 2 — switched from HS256 to ES256 with JWKS (AZ-532) |
| Password hashing | Konscious.Security.Cryptography (Argon2id) | (cycle 2 add) | Replaces SHA-384 (AZ-536) |
| MFA | OtpNet (TOTP) + QRCoder (PNG) | (cycle 2 add) | TOTP + recovery codes (AZ-534) |
| Rate limiting | Microsoft.AspNetCore.RateLimiting | 10.0 | Per-IP sliding window (AZ-537) |
| Data protection | Microsoft.AspNetCore.DataProtection | 10.0 | Encrypt MFA secret at rest (AZ-534) |
| Validation | FluentValidation | 11.3.0 / 11.10.0 | Declarative request validation |
| Logging | Serilog | 4.1.0 | Structured logging (console + file) |
| API Docs | Swashbuckle (Swagger) | 10.1.4 | OpenAPI specification |
| Serialization | Newtonsoft.Json | 13.0.4 | JSON for DB field mapping and responses (bumped from 13.0.1 by audit D-1) |
| Container | Docker | .NET 10.0 images | Multi-stage build, ARM64 support |
| CI/CD | Woodpecker CI | — | Branch-based ARM64 builds |
| Registry | docker.azaion.com | — | Private container registry |
3. Deployment Model
Environments: Development (local), Production (Linux server)
Infrastructure:
- Self-hosted Linux server (evidenced by
env/provisioning scripts for Debian/Ubuntu) - Docker containerization with private registry (
docker.azaion.com,localhost:5000) - No orchestration (single container deployment via
deploy.cmd)
Environment-specific configuration:
| Config | Development | Production |
|---|---|---|
| Database | Local PostgreSQL (port 4312) | Remote PostgreSQL (same custom port) |
| Secrets | Environment variables (ASPNETCORE_*) |
Environment variables |
| Logging | Console + file | Console + rolling file (logs/log.txt) |
| Swagger | Enabled | Disabled |
| CORS | (same policy registered, allows https://admin.azaion.com) |
https://admin.azaion.com only |
| HSTS | Disabled (Development bypass) | Enabled (1 y, includeSubDomains, preload) |
| HTTPS redirect | Disabled (Development bypass) | Enabled |
| ES256 keys | JwtConfig.KeysFolder — at least one PEM, ActiveKid selects |
Same; persistent volume mandatory |
| DataProtection keys | Ephemeral OK (single-instance dev) | DataProtection:KeysFolder MUST be a persistent volume — otherwise MFA secrets are unrecoverable after restart |
4. Data Model Overview
Core entities:
| Entity | Description | Owned By Component |
|---|---|---|
| User | System user. Cycle 2 added failed_login_count, lockout_until (AZ-537) and mfa_* columns (AZ-534). password_hash is now Argon2id PHC; legacy SHA-384 base64 lazily upgraded on next login (AZ-536). |
01 Data Layer |
| Session (AZ-531+535+533+534) | One row per refresh token (interactive) or per mission token. Carries family_id (rotation chain), revoked_at/revoked_reason/revoked_by_user_id, class ∈ {interactive, mission}, aircraft_id, mfa_authenticated. |
01 Data Layer |
| AuditEvent (AZ-537+534) | Append-only audit_events row: login_failed/success/lockout, mfa_enroll/confirm/disable/login_success/login_failed/recovery_used. |
01 Data Layer |
| UserConfig | JSON-serialized per-user configuration (queue offsets). | 01 Data Layer |
| RoleEnum | Authorization role hierarchy. Cycle 2 added Service = 60 for verifier identities (AZ-535). |
01 Data Layer |
| DetectionClass | Operator-managed catalogue. Unchanged in cycle 2. | 01 Data Layer |
| ExceptionEnum | Business error code catalog. Cycle 2 added codes 50–61 for the auth/MFA/refresh/mission/lockout paths. | Common Helpers |
Key relationships (cycle 2 additions):
- User 1 — N Session (
sessions.user_idFK, ON DELETE CASCADE) - User 1 — N Session (
sessions.aircraft_idFK for mission rows, ON DELETE SET NULL) - User 1 — N Session (
sessions.revoked_by_user_idFK, ON DELETE SET NULL) - Session 1 — N Session (
parent_session_idrotation chain)
Data flow summary:
- Client → API → UserService → PostgreSQL: user CRUD + Argon2id verify/hash + lazy migration
- Client → API → RefreshTokenService / SessionService / MfaService / MissionTokenService → PostgreSQL
sessions+users+audit_events - Verifier → API → SessionService → PostgreSQL
sessions(revoked-since snapshot) + JwtSigningKeyProvider (JWKS) - Client → API → ResourcesService → Filesystem: resource upload / list / clear
5. Integration Points
Internal Communication
| From | To | Protocol | Pattern | Notes |
|---|---|---|---|---|
| Admin API | User Management | Direct DI call | Request-Response | Scoped |
| Admin API | AuthService | Direct DI call | Request-Response | Scoped — also reads IJwtSigningKeyProvider (singleton) |
| Admin API | RefreshTokenService / SessionService / MfaService / MissionTokenService / AuditLog | Direct DI call | Request-Response | Scoped |
| Admin API | Resource Management | Direct DI call | Request-Response | Scoped |
| User Management | AuditLog | Direct DI call | Request-Response | Failed/success/lockout audit + sliding-window count |
| MfaService | IDataProtector | Direct DI call | Request-Response | Encrypt/decrypt mfa_secret |
| All services | Data Layer | Direct DI call | Request-Response | Singleton DbFactory |
External Integrations
| External System | Protocol | Auth | Rate Limits | Failure Mode |
|---|---|---|---|---|
| PostgreSQL | TCP (Npgsql) | Username/password | None configured | Exception propagation |
| Filesystem | OS I/O | OS-level permissions | None | Exception propagation |
6. Non-Functional Requirements
| Requirement | Target | Measurement | Priority |
|---|---|---|---|
| Max upload size | 200 MB | Kestrel MaxRequestBodySize | High |
| Password hashing | Argon2id (parameters from AuthConfig.PasswordHashing) |
Per-user, constant-time verify | High |
| Access token lifetime | JwtConfig.AccessTokenLifetimeMinutes (15 default) |
Per token | High |
| Refresh token sliding lifetime | SessionConfig.RefreshSlidingHours |
Per session row | High |
| Refresh token absolute lifetime | SessionConfig.RefreshAbsoluteHours |
Per family | High |
| Mission token lifetime | MissionSessionRequest.PlannedDurationH (validation-bounded) |
Per mission session | High |
| Per-IP login rate | AuthConfig.RateLimit.PerIpPermitLimit per PerIpWindowSeconds |
Sliding window | High |
| Per-account login rate | AuthConfig.RateLimit.PerAccountFailedThreshold per PerAccountWindowSeconds |
DB sliding window via audit_events |
High |
| Account lockout | AuthConfig.Lockout.ConsecutiveFailureThreshold failures → LockoutSeconds lockout |
DB-backed | High |
| HSTS | 1 y, includeSubDomains, preload (non-Development) | HTTP header | High |
| HTTPS redirect | Enabled (non-Development) | Middleware | High |
| Cache TTL | 4 hours | User entity cache | Low |
No explicit availability, latency, throughput, or recovery targets found in the codebase.
7. Security Architecture
Authentication:
- ES256 (ECDSA P-256) JWT bearer tokens (AZ-532).
ValidAlgorithmspinned toES256to prevent the HS256-with-public-key forgery class. - Opaque refresh tokens with server-side rotation + reuse detection (AZ-531). Stored as SHA-256 hashes; never re-presented.
- TOTP MFA + recovery codes (AZ-534). Step-1 token is itself an ES256 JWT with a separate audience.
- Mission tokens (AZ-533) — long-lived, no refresh, bound to
aircraft_id, auto-revoked on aircraft reconnect.
Authorization: Role-based (RBAC) via ASP.NET Core authorization policies:
apiAdminPolicy— requiresApiAdminrevocationReaderPolicy— requiresServiceORApiAdmin(verifier fleet)- General
[Authorize]— any authenticated user
Data protection:
- At rest:
mfa_secretis encrypted viaIDataProtector(purposeAzaion.Mfa.Secret). MFA recovery codes are individually Argon2id-hashed and single-use. Passwords are Argon2id PHC strings. ES256 PEM keys live inJwtConfig.KeysFolder— protect via filesystem permissions. - In transit: HSTS + HTTPS redirection in non-Development environments (AZ-538). CORS narrowed to
https://admin.azaion.comonly. - Token revocation propagation:
GET /sessions/revokedprovides a verifier-poll snapshot; verifiers are responsible for honoring it within their poll cadence (currently ~30s recommended). - Secrets management: Environment variables (
ASPNETCORE_*prefix).
Audit logging: audit_events table records login_success/failed/lockout and mfa_enroll/confirm/disable/login_success/login_failed/recovery_used events with normalised email + caller IP. Drives the per-account rate limit and provides forensic evidence. Serilog continues to log business exceptions (WARN) and general events (INFO).
8. Key Architectural Decisions
ADR-001: Minimal API over Controllers
Context: API has ~17 endpoints with simple request/response patterns.
Decision: Use ASP.NET Core Minimal API with top-level statements instead of MVC controllers.
Consequences: All endpoints in a single Program.cs. Simple for small APIs but could become unwieldy as endpoints grow.
ADR-002: Read/Write Database Connection Separation
Context: Needed different privilege levels for read vs. write operations.
Decision: DbFactory maintains two connection strings — a read-only one (AzaionDb) and an admin one (AzaionDbAdmin) — with separate Run and RunAdmin methods.
Consequences: Write operations are explicitly gated through RunAdmin. Prevents accidental writes through the reader connection. Requires maintaining two DB users with different privileges.
ADR-003: Per-User Resource Encryption — RETIRED (cycle 2, 2026-05-14)
Original context: Resources (DLLs, AI models) had to be delivered only to authorized users via a per-download AES-256-CBC stream keyed off the user's email + password.
Retirement decision: With the OTA delivery flow (AZ-183) and the hardware-binding flow (AZ-197) both gone, the only remaining consumer of the encrypted-download path was a now-vestigial POST /resources/get/{dataFolder?} endpoint and the two installer endpoints. None of them are part of the target architecture (browser SaaS + fTPM Jetsons), so the entire encrypt-on-download stack — POST /resources/get, GET /resources/get-installer, GET /resources/get-installer/stage, ResourcesService.GetEncryptedResource, ResourcesService.GetInstaller, Security.GetApiEncryptionKey, Security.EncryptTo, Security.DecryptTo, GetResourceRequest, WrongResourceName (50), ResourcesConfig.SuiteInstallerFolder / SuiteStageInstallerFolder — was removed. Security.ToHash is retained because it still backs SHA-384 password hashing in UserService.
Consequences: resource files now live on disk as plain bytes; any future at-rest encryption must come from filesystem or storage-layer features (LUKS, object-store SSE), not from application code.
ADR-004: Hardware Fingerprint Binding — RETIRED (AZ-197)
Original context: Resources should only be usable on a specific physical machine.
Original decision: On first resource access, the user's hardware fingerprint string was stored. Subsequent accesses compared the hash of the provided hardware against the stored value.
Retirement decision (2026-05-13, AZ-197): The threat model that motivated this binding (credential reuse across machines via desktop installers) no longer applies:
- Edge devices ship as fTPM-secured Jetsons (secure boot, fTPM-protected key storage, no user filesystem access, no installer redistribution). Hardware identity is anchored in the fTPM, not in a SHA-384 of CPU/GPU/Memory/DriveSerial strings.
- Server / desktop access is SaaS-only (browser → admin API). There is no installer to copy and no hardware fingerprint to take.
The binding's only remaining effect was a real production failure mode (HardwareIdMismatch, error code 40) on legitimate hardware events. AZ-197 removed CheckHardwareHash, UpdateHardware, Security.GetHWHash, the PUT /users/hardware/set and POST /resources/check endpoints, and the Hardware field from GetResourceRequest. The User.Hardware DB column is a nullable tombstone (no migration in AZ-197; separate ticket if/when the column is dropped).
ADR-005: linq2db over Entity Framework
Context: Needed a lightweight ORM for PostgreSQL.
Decision: Use linq2db instead of Entity Framework Core.
Consequences: No migration framework — schema managed via SQL scripts (env/db/). Lighter runtime footprint. Manual mapping configuration in AzaionDbSchemaHolder.
ADR-006: Asymmetric ES256 JWT signing with file-system key store + JWKS (cycle 2 — AZ-532)
Context: Cycle-1 JWT signing was symmetric HS256 with the secret in environment configuration. The verifier fleet (satellite-provider, gps-denied, ui) needed to validate tokens without sharing the signing secret with every service. Sharing the HS256 secret would have made any verifier compromise also a token-forgery primitive.
Decision: Switch to ES256 (ECDSA P-256). The Admin API holds the private key; verifiers fetch the public key set from GET /.well-known/jwks.json. Keys live as one PEM per kid in JwtConfig.KeysFolder. JwtConfig.ActiveKid selects the signer; ALL discovered keys are exposed in JWKS so existing tokens stay verifiable across rotations.
Alternatives rejected:
- Continue HS256 + share secret: rejected — secret-distribution + verifier-compromise blast radius.
- RS256: equivalent security, larger keys, no operational benefit at our scale.
- External KMS / HSM: deferred — adds operational complexity (KMS auth, latency on every signing op) without near-term benefit. The PEM-on-disk approach is reversible to KMS later.
Consequences:
- JwtBearer
ValidAlgorithms = [ES256]is mandatory — without it, a token forged withalg=HS256using the public key as the HMAC secret would validate. - The PEM directory MUST be a persistent volume.
- Key rotation is "drop a new PEM, set
ActiveKid, restart" — the old kid keeps verifying tokens until physically removed. - Verifiers MUST cache the JWKS for at most 1 hour to pick up new kids quickly.
ADR-007: Refresh tokens as opaque rotating server-side rows (not JWT) (cycle 2 — AZ-531)
Context: The dual-token model needs a refresh token. The two viable shapes are (a) signed self-describing JWT or (b) opaque server-stored value. Refresh tokens are long-lived; their threat model centres on theft + replay.
Decision: Opaque random Base64Url(32 bytes) stored on the server as a SHA-256 hash. Each rotation marks the previous row as revoked_reason='rotated' and inserts a new row in the same family_id. Presenting an already-rotated token revokes the entire family with reason='reuse_detected'.
Alternatives rejected:
- JWT refresh token: server cannot revoke without a denylist (which negates the "stateless" advantage). No reuse-detection without ALSO server state.
- Sliding session ID alone (no rotation): theft is permanent until manual revocation.
Consequences:
- Every refresh hits Postgres (one indexed lookup + one update + one insert in a transaction). Acceptable at current load; if it becomes a bottleneck, the
sessions_refresh_hash_idxUNIQUE INDEX is the obvious caching boundary. - Refresh-token theft is detectable on the next legitimate refresh.
- The session row is also the
sidclaim in the access token — the same row drives logout (F12), JWKS-independent revocation snapshots (F15), and AMR persistence across rotations (mfa_authenticated).
ADR-008: TOTP MFA secrets encrypted via IDataProtector (cycle 2 — AZ-534)
Context: MFA secrets are TOTP shared secrets — possession of the database alone (DBA access, backup leak) must NOT yield the ability to mint TOTP codes for users.
Decision: Encrypt mfa_secret with ASP.NET IDataProtector (purpose string Azaion.Mfa.Secret) before persisting. The DataProtection key store is configured via DataProtection:KeysFolder and MUST be a persistent volume in production. Recovery codes are individually Argon2id-hashed and stored as a jsonb array; single-use is enforced by setting used_at transactionally with the rest of the login.
Alternatives rejected:
- Plaintext: explicit DB-leak escalation path.
- Application-managed AES via env-var key: re-introduces the very key-distribution problem ADR-006 solved for JWT signing.
- External KMS for MFA secrets: deferred for the same reason as ADR-006.
Consequences:
- Loss of the DataProtection key folder = users must re-enroll MFA (no recovery path). This MUST be backed up alongside DB backups.
- DBA-only access does not yield MFA bypass.
ADR-009: Per-account lockout + DB-backed sliding-window rate limit alongside per-IP token bucket (cycle 2 — AZ-537)
Context: ASP.NET RateLimiter is per-process and per-IP. CMMC AC.L2-3.1.8 requires per-account lockout that survives process restarts. Per-IP alone is insufficient (NAT'd attacker farm; bot rotates IPs). Per-account-only is insufficient (single IP can DoS many accounts at "just below threshold").
Decision: Both layers, both required to pass:
- Per-IP — ASP.NET
RateLimitermiddleware withSlidingWindowRateLimiteron/loginand/login/mfa. In-memory; resets on restart but recovers within seconds. - Per-account — DB-backed sliding window via
audit_events(countlogin_failedrows for the email withinPerAccountWindowSeconds). - Lockout —
users.failed_login_count+users.lockout_until. AfterConsecutiveFailureThresholdfailures,lockout_until = now + LockoutSeconds. Subsequent logins throwAccountLockedwithRetryAfterSecondsuntil the window passes.
Alternatives rejected:
- Redis token bucket per account: avoids DB load but adds a new infra dependency for a low-write workload. The DB sliding window has acceptable cost (
audit_events_event_type_email_idx). - Single combined rule: harder to tune.
Consequences:
audit_eventswill grow large (~14 GB/yr at projected fleet scale); operational follow-up to time-partition.- The
Retry-Afterheader is set both by the per-IP middleware (lease metadata) and by theBusinessExceptionHandler(fromBusinessException.RetryAfterSeconds), so clients see consistent backoff hints regardless of which layer rejected. - All gating events go through
audit_events, providing a single auditable history.