# Persist DataProtection Keys Folder + Fail-Fast In Production **Task**: AZ-554_persist_dataprotection_keys **Name**: Persist DataProtection keys folder + fail-fast in Production **Description**: DataProtection (which encrypts MFA secrets, recovery codes, and any other protected payload) currently writes its master keys to an ephemeral container path. Every container restart rotates the master key, which permanently locks every MFA-enrolled user out of their account. Persist the key folder onto the host, document the env var, and fail-fast in Production if the folder is unconfigured. **Complexity**: 2 points **Dependencies**: AZ-553 (host-side volume pattern + runbook section established) **Component**: Admin API + Deploy / scripts **Tracker**: AZ-554 **Epic**: AZ-530 **CMMC ref**: SC.L2-3.13.10 (key management), IA.L2-3.5.7 (passwords, secrets storage) **Source**: `_docs/05_security/security_report_cycle2.md` F-INFRA-3 (High); `_docs/05_security/infrastructure_review_cycle2.md` §F-2026Q2-INFRA-3 ## Problem `Program.cs` configures `services.AddDataProtection()` without specifying a persistent key folder. ASP.NET Core defaults the key ring to an OS-specific path that, inside a container, lives on the writable layer and vanishes on every restart. AZ-534 uses DataProtection to encrypt the per-user TOTP `MfaSecret` at rest; AZ-534 also encrypts recovery codes. When the master key rotates on restart: - Existing `MfaSecret` ciphertexts can no longer be decrypted → no user can verify TOTP at login. - Existing recovery-code hashes (if also DataProtection-wrapped) become unusable. The net effect on the next `docker restart` is a hard lockout of every MFA-enrolled user. No data is corrupted on disk — but recovery requires either operator intervention or a re-enrolment campaign. ## Outcome - DataProtection master keys persist across container restarts in Production. - In Production, the app refuses to start if `DataProtection.KeysFolder` is unset (no silent fallback to the ephemeral path). - Development environment continues to work with the ephemeral default (no behavioural change for local devs). - `.env.example` and the deploy runbook document the new host-side env var. ## Scope ### Included - `Program.cs`: bind `DataProtection.KeysFolder` from configuration, call `PersistKeysToFileSystem(...)` when set, and add a Production-only fail-fast in the `AppEnv.IsProduction()` branch if the folder is unset, missing, or not writable. - `appsettings.json`: add a `DataProtection` section with documented keys (`KeysFolder`). - `scripts/start-services.sh`: bind-mount `$DEPLOY_HOST_DP_KEYS_DIR` onto the container at `/var/lib/azaion/dp-keys` (read-write — DataProtection must rotate keys on its own schedule). - `secrets/.public.env`: set `ASPNETCORE_DataProtection__KeysFolder=/var/lib/azaion/dp-keys` in production/staging templates. - `.env.example`: document `DEPLOY_HOST_DP_KEYS_DIR`. - Extend the deploy runbook section authored by AZ-553 to cover the DataProtection mount alongside the JWT mount (same host-side layout, same ownership/perms guidance). ### Excluded - Encrypting the DataProtection keys at rest with a hardware secret (HSM / KMS-wrapped). Larger scope; would belong to a separate hardening epic. - Cross-instance key sharing for a horizontally-scaled admin deployment. Currently single-instance per environment. - Reading the AZ-534 / AZ-NEW-12 user-cache invalidation concern — out of scope for this ticket. - `secrets/README.md` rewrite — AZ-555. ## Acceptance Criteria **AC-1: MFA survives container restart in Production** Given a Production deploy with `DEPLOY_HOST_DP_KEYS_DIR` mounted And a user has enrolled in TOTP MFA before the restart When the admin container is stopped and started again Then the user can complete a fresh `/login` + `/login/mfa` cycle using their existing TOTP authenticator (no recovery code, no re-enrolment). **AC-2: Production fails-fast when `KeysFolder` is unset** Given `ASPNETCORE_ENVIRONMENT=Production` and `ASPNETCORE_DataProtection__KeysFolder` is unset When the admin process starts Then the process exits non-zero with a startup-log entry that names `DataProtection.KeysFolder` as the missing/invalid configuration. **AC-3: Production fails-fast when `KeysFolder` is not writable** Given `ASPNETCORE_ENVIRONMENT=Production` and `KeysFolder` points at a path that is not writable by the container user When the admin process starts Then the process exits non-zero with a startup-log entry naming the path and the missing permission. **AC-4: Development unchanged** Given `ASPNETCORE_ENVIRONMENT=Development` and `KeysFolder` is unset When the admin process starts Then the process starts normally (uses the ephemeral default) and no fail-fast is triggered. **AC-5: Mount is read-write** Given the admin container is running with the new bind-mount When the DataProtection key ring rotates (test by writing a probe file `/var/lib/azaion/dp-keys/.probe`) Then the write succeeds. ## Non-Functional Requirements **Reliability** - Container restart MUST NOT invalidate already-issued MFA secrets or DataProtection-wrapped ciphertexts. **Security** - Mount must be writable by the container user but not world-readable on the host (`chmod 0700` host-side, container user owns). ## Blackbox Tests | AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References | |--------|------------------------|-------------|-------------------|----------------| | AC-1 | Prod env, mount configured, user MFA-enrolled, restart container | Login + MFA verify after restart | Same TOTP secret still works | Reliability | | AC-2 | Prod env, `KeysFolder` unset | Start admin process | Exit non-zero, log names `DataProtection.KeysFolder` | — | | AC-3 | Prod env, `KeysFolder` read-only path | Start admin process | Exit non-zero, log names path + permission | — | | AC-4 | Dev env, `KeysFolder` unset | Start admin process | Process starts, ephemeral default used | — | | AC-5 | Container running, mount RW | Probe write inside mount | Write succeeds | Security | ## Constraints - Persist via `PersistKeysToFileSystem` on the configured folder; do not introduce a database-backed or third-party key store in this ticket. - Fail-fast must be Production-only — Development workflows depend on the ephemeral default. ## Risks & Mitigation **Risk 1: Existing prod users locked out at first restart after deploy** - *Risk*: The first container restart AFTER this fix ships is fine going forward, but any MFA enrolments done on the cycle-2 build BEFORE this fix are encrypted with an already-lost master key. Those users are still locked out. - *Mitigation*: Cycle 2 has not been deployed to Production yet (the security audit FAILed before deploy). No real users are affected. Document this lifecycle clearly in the runbook so future hotfix sequencing avoids the same trap. **Risk 2: Host-side directory permissions wrong** - *Risk*: If the operator creates `$DEPLOY_HOST_DP_KEYS_DIR` as `root:root 700`, the container user cannot write. - *Mitigation*: AC-3 fail-fast catches this immediately on startup. Runbook includes the explicit ownership/perms command. **Risk 3: Drift between `appsettings.json` default and the runtime mount target** - *Risk*: Default in `appsettings.json` says one path; deploy script mounts another; container fails-fast. - *Mitigation*: AC-5 indirectly covers this via the probe-write step; runbook section explicitly states the mount target == config value.