Files
admin/_docs/02_document/architecture.md
T
Oleksandr Bezdieniezhnykh a77b3f8a59 [AZ-529] [AZ-530] Cycle-2 documentation refresh
Refreshes _docs/02_document/ to reflect the cycle-2 auth-modernization
+ CMMC hardening landings (AZ-531..AZ-538). Authoritative source for
the ripple set is ripple_log_cycle2.md.

Covered:
- architecture.md (section 1 rewritten, ADRs 6-9 added)
- data_model.md (sessions, audit_events, user columns, migrations)
- system-flows.md (F1 rewritten; F11-F17 added; F2/F7/F9 minor)
- module-layout.md (cycle-2 sub-component table)
- diagrams/flows/flow_login.md (dual-token + MFA)
- components/{01_data_layer,03_auth_and_security,05_admin_api}
- modules/ (12 new, 8 modified — full Argon2id/ES256/MFA/refresh
  /mission/session/audit/jwks rollup)
- tests/{blackbox,security,traceability-matrix}

Step 13 (Update Docs) output for cycle 2.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-14 09:22:53 +03:00

279 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Azaion Admin API — Architecture
## 1. System Context
**Problem being solved**: Azaion Suite requires a centralized admin API to manage users + roles, authenticate humans (with optional second factor), authenticate UAVs for offline missions, and broker token revocation across a fleet of verifier services.
**System boundaries**:
- **Inside**: user management, password hashing (Argon2id), authentication (ES256 JWT + opaque refresh tokens with rotation + reuse detection), TOTP MFA, mission-token issuance, session revocation + verifier-poll snapshot, account lockout + per-IP and per-account rate limiting, JWKS publication, role-based authorization, file-based resource storage (upload / list / clear), HSTS + HTTPS redirect.
- **Outside**: admin web panel (`admin.azaion.com`), fTPM-secured Jetson edge devices (CompanionPC), verifier fleet (satellite-provider, gps-denied, ui — service-role identities), PostgreSQL, server filesystem.
> **Note (AZ-197, cycle 1)**: hardware-fingerprint binding removed.
>
> **Note (cycle 2 early)**: encrypted resource download + installer endpoints removed; ADR-003 retired.
>
> **Note (cycle 2 — Auth Modernization, 2026-05-14, AZ-531..AZ-538)**: the entire authentication layer was rebuilt:
> - **AZ-536** — Argon2id password hashing replaced SHA-384; lazy migration on login.
> - **AZ-531** — opaque refresh tokens with server-side rotation, family-based reuse detection, sliding + absolute lifetimes (`SessionConfig`).
> - **AZ-532** — symmetric HS256 → asymmetric ES256 with file-system key store + JWKS endpoint.
> - **AZ-534** — TOTP MFA (enroll/confirm/disable, recovery codes, two-step login, `IDataProtector`-encrypted secret, `amr` claim).
> - **AZ-535** — logout (single + all) + admin revoke + verifier-poll snapshot of revoked sessions; new `Service` role for verifier identities.
> - **AZ-533** — long-lived no-refresh mission tokens for UAV ops, with auto-revoke on aircraft reconnect.
> - **AZ-537** — DB-backed account lockout + per-account sliding-window rate limit + per-IP token-bucket via ASP.NET `RateLimiter`; `audit_events` table.
> - **AZ-538** — CORS narrowed to single HTTPS origin, HSTS enabled (non-Development), HTTPS redirection (non-Development).
> - New ADRs **ADR-006** through **ADR-009** below capture the per-decision context.
**External systems**:
| System | Integration Type | Direction | Purpose |
|--------|-----------------|-----------|---------|
| PostgreSQL | Database (linq2db) | Both | User + session + audit_events persistence |
| Server filesystem | File I/O | Both | Resource files; ES256 PEM key store; DataProtection key store (when `DataProtection:KeysFolder` is set) |
| Admin web panel (admin.azaion.com) | REST API | Inbound | User management, login, MFA, refresh, resource upload |
| Verifier fleet (Service role) | REST API | Inbound | Polls `/sessions/revoked`, fetches `/.well-known/jwks.json` |
| CompanionPC (Jetson) edge devices | REST API | Inbound | Login + refresh; mission-token consumer |
## 2. Technology Stack
| Layer | Technology | Version | Rationale |
|-------|-----------|---------|-----------|
| Language | C# | .NET 10.0 | Modern, cross-platform, strong typing |
| Framework | ASP.NET Core Minimal API | 10.0 | Lightweight, minimal boilerplate |
| Database | PostgreSQL | (server-side) | Open-source, robust relational DB |
| ORM | linq2db | 5.4.1 | Lightweight, LINQ-native, no migrations overhead |
| Cache | LazyCache (in-memory) | 2.4.0 | Simple async caching for user lookups |
| Auth | JWT Bearer (ES256) | 10.0.3 | Stateless token auth; cycle 2 — switched from HS256 to ES256 with JWKS (AZ-532) |
| Password hashing | Konscious.Security.Cryptography (Argon2id) | (cycle 2 add) | Replaces SHA-384 (AZ-536) |
| MFA | OtpNet (TOTP) + QRCoder (PNG) | (cycle 2 add) | TOTP + recovery codes (AZ-534) |
| Rate limiting | Microsoft.AspNetCore.RateLimiting | 10.0 | Per-IP sliding window (AZ-537) |
| Data protection | Microsoft.AspNetCore.DataProtection | 10.0 | Encrypt MFA secret at rest (AZ-534) |
| Validation | FluentValidation | 11.3.0 / 11.10.0 | Declarative request validation |
| Logging | Serilog | 4.1.0 | Structured logging (console + file) |
| API Docs | Swashbuckle (Swagger) | 10.1.4 | OpenAPI specification |
| Serialization | Newtonsoft.Json | 13.0.4 | JSON for DB field mapping and responses (bumped from 13.0.1 by audit D-1) |
| Container | Docker | .NET 10.0 images | Multi-stage build, ARM64 support |
| CI/CD | Woodpecker CI | — | Branch-based ARM64 builds |
| Registry | docker.azaion.com | — | Private container registry |
## 3. Deployment Model
**Environments**: Development (local), Production (Linux server)
**Infrastructure**:
- Self-hosted Linux server (evidenced by `env/` provisioning scripts for Debian/Ubuntu)
- Docker containerization with private registry (`docker.azaion.com`, `localhost:5000`)
- No orchestration (single container deployment via `deploy.cmd`)
**Environment-specific configuration**:
| Config | Development | Production |
|--------|-------------|------------|
| Database | Local PostgreSQL (port 4312) | Remote PostgreSQL (same custom port) |
| Secrets | Environment variables (`ASPNETCORE_*`) | Environment variables |
| Logging | Console + file | Console + rolling file (`logs/log.txt`) |
| Swagger | Enabled | Disabled |
| CORS | (same policy registered, allows `https://admin.azaion.com`) | `https://admin.azaion.com` only |
| HSTS | **Disabled** (Development bypass) | **Enabled** (1 y, includeSubDomains, preload) |
| HTTPS redirect | **Disabled** (Development bypass) | **Enabled** |
| ES256 keys | `JwtConfig.KeysFolder` — at least one PEM, `ActiveKid` selects | Same; persistent volume mandatory |
| DataProtection keys | Ephemeral OK (single-instance dev) | `DataProtection:KeysFolder` MUST be a persistent volume — otherwise MFA secrets are unrecoverable after restart |
## 4. Data Model Overview
**Core entities**:
| Entity | Description | Owned By Component |
|--------|-------------|--------------------|
| User | System user. Cycle 2 added `failed_login_count`, `lockout_until` (AZ-537) and `mfa_*` columns (AZ-534). `password_hash` is now Argon2id PHC; legacy SHA-384 base64 lazily upgraded on next login (AZ-536). | 01 Data Layer |
| Session *(AZ-531+535+533+534)* | One row per refresh token (interactive) or per mission token. Carries `family_id` (rotation chain), `revoked_at`/`revoked_reason`/`revoked_by_user_id`, `class` ∈ {`interactive`, `mission`}, `aircraft_id`, `mfa_authenticated`. | 01 Data Layer |
| AuditEvent *(AZ-537+534)* | Append-only `audit_events` row: login_failed/success/lockout, mfa_enroll/confirm/disable/login_success/login_failed/recovery_used. | 01 Data Layer |
| UserConfig | JSON-serialized per-user configuration (queue offsets). | 01 Data Layer |
| RoleEnum | Authorization role hierarchy. Cycle 2 added `Service = 60` for verifier identities (AZ-535). | 01 Data Layer |
| DetectionClass | Operator-managed catalogue. Unchanged in cycle 2. | 01 Data Layer |
| ExceptionEnum | Business error code catalog. Cycle 2 added codes 5061 for the auth/MFA/refresh/mission/lockout paths. | Common Helpers |
**Key relationships** (cycle 2 additions):
- User 1 — N Session (`sessions.user_id` FK, ON DELETE CASCADE)
- User 1 — N Session (`sessions.aircraft_id` FK for mission rows, ON DELETE SET NULL)
- User 1 — N Session (`sessions.revoked_by_user_id` FK, ON DELETE SET NULL)
- Session 1 — N Session (`parent_session_id` rotation chain)
**Data flow summary**:
- Client → API → UserService → PostgreSQL: user CRUD + Argon2id verify/hash + lazy migration
- Client → API → RefreshTokenService / SessionService / MfaService / MissionTokenService → PostgreSQL `sessions` + `users` + `audit_events`
- Verifier → API → SessionService → PostgreSQL `sessions` (revoked-since snapshot) + JwtSigningKeyProvider (JWKS)
- Client → API → ResourcesService → Filesystem: resource upload / list / clear
## 5. Integration Points
### Internal Communication
| From | To | Protocol | Pattern | Notes |
|------|----|----------|---------|-------|
| Admin API | User Management | Direct DI call | Request-Response | Scoped |
| Admin API | AuthService | Direct DI call | Request-Response | Scoped — also reads `IJwtSigningKeyProvider` (singleton) |
| Admin API | RefreshTokenService / SessionService / MfaService / MissionTokenService / AuditLog | Direct DI call | Request-Response | Scoped |
| Admin API | Resource Management | Direct DI call | Request-Response | Scoped |
| User Management | AuditLog | Direct DI call | Request-Response | Failed/success/lockout audit + sliding-window count |
| MfaService | IDataProtector | Direct DI call | Request-Response | Encrypt/decrypt mfa_secret |
| All services | Data Layer | Direct DI call | Request-Response | Singleton DbFactory |
### External Integrations
| External System | Protocol | Auth | Rate Limits | Failure Mode |
|----------------|----------|------|-------------|--------------|
| PostgreSQL | TCP (Npgsql) | Username/password | None configured | Exception propagation |
| Filesystem | OS I/O | OS-level permissions | None | Exception propagation |
## 6. Non-Functional Requirements
| Requirement | Target | Measurement | Priority |
|------------|--------|-------------|----------|
| Max upload size | 200 MB | Kestrel MaxRequestBodySize | High |
| Password hashing | Argon2id (parameters from `AuthConfig.PasswordHashing`) | Per-user, constant-time verify | High |
| Access token lifetime | `JwtConfig.AccessTokenLifetimeMinutes` (15 default) | Per token | High |
| Refresh token sliding lifetime | `SessionConfig.RefreshSlidingHours` | Per session row | High |
| Refresh token absolute lifetime | `SessionConfig.RefreshAbsoluteHours` | Per family | High |
| Mission token lifetime | `MissionSessionRequest.PlannedDurationH` (validation-bounded) | Per mission session | High |
| Per-IP login rate | `AuthConfig.RateLimit.PerIpPermitLimit` per `PerIpWindowSeconds` | Sliding window | High |
| Per-account login rate | `AuthConfig.RateLimit.PerAccountFailedThreshold` per `PerAccountWindowSeconds` | DB sliding window via `audit_events` | High |
| Account lockout | `AuthConfig.Lockout.ConsecutiveFailureThreshold` failures → `LockoutSeconds` lockout | DB-backed | High |
| HSTS | 1 y, includeSubDomains, preload (non-Development) | HTTP header | High |
| HTTPS redirect | Enabled (non-Development) | Middleware | High |
| Cache TTL | 4 hours | User entity cache | Low |
No explicit availability, latency, throughput, or recovery targets found in the codebase.
## 7. Security Architecture
**Authentication**:
- ES256 (ECDSA P-256) JWT bearer tokens (AZ-532). `ValidAlgorithms` pinned to `ES256` to prevent the HS256-with-public-key forgery class.
- Opaque refresh tokens with server-side rotation + reuse detection (AZ-531). Stored as SHA-256 hashes; never re-presented.
- TOTP MFA + recovery codes (AZ-534). Step-1 token is itself an ES256 JWT with a separate audience.
- Mission tokens (AZ-533) — long-lived, no refresh, bound to `aircraft_id`, auto-revoked on aircraft reconnect.
**Authorization**: Role-based (RBAC) via ASP.NET Core authorization policies:
- `apiAdminPolicy` — requires `ApiAdmin`
- `revocationReaderPolicy` — requires `Service` OR `ApiAdmin` (verifier fleet)
- General `[Authorize]` — any authenticated user
**Data protection**:
- **At rest**: `mfa_secret` is encrypted via `IDataProtector` (purpose `Azaion.Mfa.Secret`). MFA recovery codes are individually Argon2id-hashed and single-use. Passwords are Argon2id PHC strings. ES256 PEM keys live in `JwtConfig.KeysFolder` — protect via filesystem permissions.
- **In transit**: HSTS + HTTPS redirection in non-Development environments (AZ-538). CORS narrowed to `https://admin.azaion.com` only.
- **Token revocation propagation**: `GET /sessions/revoked` provides a verifier-poll snapshot; verifiers are responsible for honoring it within their poll cadence (currently ~30s recommended).
- **Secrets management**: Environment variables (`ASPNETCORE_*` prefix).
**Audit logging**: `audit_events` table records login_success/failed/lockout and mfa_enroll/confirm/disable/login_success/login_failed/recovery_used events with normalised email + caller IP. Drives the per-account rate limit and provides forensic evidence. Serilog continues to log business exceptions (WARN) and general events (INFO).
## 8. Key Architectural Decisions
### ADR-001: Minimal API over Controllers
**Context**: API has ~17 endpoints with simple request/response patterns.
**Decision**: Use ASP.NET Core Minimal API with top-level statements instead of MVC controllers.
**Consequences**: All endpoints in a single `Program.cs`. Simple for small APIs but could become unwieldy as endpoints grow.
### ADR-002: Read/Write Database Connection Separation
**Context**: Needed different privilege levels for read vs. write operations.
**Decision**: `DbFactory` maintains two connection strings — a read-only one (`AzaionDb`) and an admin one (`AzaionDbAdmin`) — with separate `Run` and `RunAdmin` methods.
**Consequences**: Write operations are explicitly gated through `RunAdmin`. Prevents accidental writes through the reader connection. Requires maintaining two DB users with different privileges.
### ADR-003: Per-User Resource Encryption — RETIRED (cycle 2, 2026-05-14)
**Original context**: Resources (DLLs, AI models) had to be delivered only to authorized users via a per-download AES-256-CBC stream keyed off the user's email + password.
**Retirement decision**: With the OTA delivery flow (AZ-183) and the hardware-binding flow (AZ-197) both gone, the only remaining consumer of the encrypted-download path was a now-vestigial `POST /resources/get/{dataFolder?}` endpoint and the two installer endpoints. None of them are part of the target architecture (browser SaaS + fTPM Jetsons), so the entire encrypt-on-download stack — `POST /resources/get`, `GET /resources/get-installer`, `GET /resources/get-installer/stage`, `ResourcesService.GetEncryptedResource`, `ResourcesService.GetInstaller`, `Security.GetApiEncryptionKey`, `Security.EncryptTo`, `Security.DecryptTo`, `GetResourceRequest`, `WrongResourceName` (50), `ResourcesConfig.SuiteInstallerFolder` / `SuiteStageInstallerFolder` — was removed. `Security.ToHash` is retained because it still backs SHA-384 password hashing in `UserService`.
**Consequences**: resource files now live on disk as plain bytes; any future at-rest encryption must come from filesystem or storage-layer features (LUKS, object-store SSE), not from application code.
### ADR-004: Hardware Fingerprint Binding — RETIRED (AZ-197)
**Original context**: Resources should only be usable on a specific physical machine.
**Original decision**: On first resource access, the user's hardware fingerprint string was stored. Subsequent accesses compared the hash of the provided hardware against the stored value.
**Retirement decision (2026-05-13, AZ-197)**: The threat model that motivated this binding (credential reuse across machines via desktop installers) no longer applies:
- **Edge devices** ship as **fTPM-secured Jetsons** (secure boot, fTPM-protected key storage, no user filesystem access, no installer redistribution). Hardware identity is anchored in the fTPM, not in a SHA-384 of CPU/GPU/Memory/DriveSerial strings.
- **Server / desktop access** is **SaaS-only** (browser → admin API). There is no installer to copy and no hardware fingerprint to take.
The binding's only remaining effect was a real production failure mode (`HardwareIdMismatch`, error code 40) on legitimate hardware events. AZ-197 removed `CheckHardwareHash`, `UpdateHardware`, `Security.GetHWHash`, the `PUT /users/hardware/set` and `POST /resources/check` endpoints, and the `Hardware` field from `GetResourceRequest`. The `User.Hardware` DB column is a nullable tombstone (no migration in AZ-197; separate ticket if/when the column is dropped).
### ADR-005: linq2db over Entity Framework
**Context**: Needed a lightweight ORM for PostgreSQL.
**Decision**: Use linq2db instead of Entity Framework Core.
**Consequences**: No migration framework — schema managed via SQL scripts (`env/db/`). Lighter runtime footprint. Manual mapping configuration in `AzaionDbSchemaHolder`.
### ADR-006: Asymmetric ES256 JWT signing with file-system key store + JWKS *(cycle 2 — AZ-532)*
**Context**: Cycle-1 JWT signing was symmetric HS256 with the secret in environment configuration. The verifier fleet (satellite-provider, gps-denied, ui) needed to validate tokens without sharing the signing secret with every service. Sharing the HS256 secret would have made any verifier compromise also a token-forgery primitive.
**Decision**: Switch to ES256 (ECDSA P-256). The Admin API holds the private key; verifiers fetch the public key set from `GET /.well-known/jwks.json`. Keys live as one PEM per kid in `JwtConfig.KeysFolder`. `JwtConfig.ActiveKid` selects the signer; ALL discovered keys are exposed in JWKS so existing tokens stay verifiable across rotations.
**Alternatives rejected**:
- **Continue HS256 + share secret**: rejected — secret-distribution + verifier-compromise blast radius.
- **RS256**: equivalent security, larger keys, no operational benefit at our scale.
- **External KMS / HSM**: deferred — adds operational complexity (KMS auth, latency on every signing op) without near-term benefit. The PEM-on-disk approach is reversible to KMS later.
**Consequences**:
- JwtBearer `ValidAlgorithms = [ES256]` is mandatory — without it, a token forged with `alg=HS256` using the public key as the HMAC secret would validate.
- The PEM directory MUST be a persistent volume.
- Key rotation is "drop a new PEM, set `ActiveKid`, restart" — the old kid keeps verifying tokens until physically removed.
- Verifiers MUST cache the JWKS for at most 1 hour to pick up new kids quickly.
### ADR-007: Refresh tokens as opaque rotating server-side rows (not JWT) *(cycle 2 — AZ-531)*
**Context**: The dual-token model needs a refresh token. The two viable shapes are (a) signed self-describing JWT or (b) opaque server-stored value. Refresh tokens are long-lived; their threat model centres on theft + replay.
**Decision**: Opaque random `Base64Url(32 bytes)` stored on the server as a SHA-256 hash. Each rotation marks the previous row as `revoked_reason='rotated'` and inserts a new row in the same `family_id`. Presenting an already-rotated token revokes the entire family with `reason='reuse_detected'`.
**Alternatives rejected**:
- **JWT refresh token**: server cannot revoke without a denylist (which negates the "stateless" advantage). No reuse-detection without ALSO server state.
- **Sliding session ID alone (no rotation)**: theft is permanent until manual revocation.
**Consequences**:
- Every refresh hits Postgres (one indexed lookup + one update + one insert in a transaction). Acceptable at current load; if it becomes a bottleneck, the `sessions_refresh_hash_idx` UNIQUE INDEX is the obvious caching boundary.
- Refresh-token theft is detectable on the next legitimate refresh.
- The session row is also the `sid` claim in the access token — the same row drives logout (F12), JWKS-independent revocation snapshots (F15), and AMR persistence across rotations (`mfa_authenticated`).
### ADR-008: TOTP MFA secrets encrypted via `IDataProtector` *(cycle 2 — AZ-534)*
**Context**: MFA secrets are TOTP shared secrets — possession of the database alone (DBA access, backup leak) must NOT yield the ability to mint TOTP codes for users.
**Decision**: Encrypt `mfa_secret` with ASP.NET `IDataProtector` (purpose string `Azaion.Mfa.Secret`) before persisting. The DataProtection key store is configured via `DataProtection:KeysFolder` and MUST be a persistent volume in production. Recovery codes are individually Argon2id-hashed and stored as a `jsonb` array; single-use is enforced by setting `used_at` transactionally with the rest of the login.
**Alternatives rejected**:
- **Plaintext**: explicit DB-leak escalation path.
- **Application-managed AES via env-var key**: re-introduces the very key-distribution problem ADR-006 solved for JWT signing.
- **External KMS for MFA secrets**: deferred for the same reason as ADR-006.
**Consequences**:
- Loss of the DataProtection key folder = users must re-enroll MFA (no recovery path). This MUST be backed up alongside DB backups.
- DBA-only access does not yield MFA bypass.
### ADR-009: Per-account lockout + DB-backed sliding-window rate limit alongside per-IP token bucket *(cycle 2 — AZ-537)*
**Context**: ASP.NET `RateLimiter` is per-process and per-IP. CMMC AC.L2-3.1.8 requires per-account lockout that survives process restarts. Per-IP alone is insufficient (NAT'd attacker farm; bot rotates IPs). Per-account-only is insufficient (single IP can DoS many accounts at "just below threshold").
**Decision**: Both layers, both required to pass:
1. Per-IP — ASP.NET `RateLimiter` middleware with `SlidingWindowRateLimiter` on `/login` and `/login/mfa`. In-memory; resets on restart but recovers within seconds.
2. Per-account — DB-backed sliding window via `audit_events` (count `login_failed` rows for the email within `PerAccountWindowSeconds`).
3. Lockout — `users.failed_login_count` + `users.lockout_until`. After `ConsecutiveFailureThreshold` failures, `lockout_until = now + LockoutSeconds`. Subsequent logins throw `AccountLocked` with `RetryAfterSeconds` until the window passes.
**Alternatives rejected**:
- **Redis token bucket per account**: avoids DB load but adds a new infra dependency for a low-write workload. The DB sliding window has acceptable cost (`audit_events_event_type_email_idx`).
- **Single combined rule**: harder to tune.
**Consequences**:
- `audit_events` will grow large (~14 GB/yr at projected fleet scale); operational follow-up to time-partition.
- The `Retry-After` header is set both by the per-IP middleware (lease metadata) and by the `BusinessExceptionHandler` (from `BusinessException.RetryAfterSeconds`), so clients see consistent backoff hints regardless of which layer rejected.
- All gating events go through `audit_events`, providing a single auditable history.