Batch 5 (cycle 2 hotfix sprint, batch 1 of 2). 6 story points under epic AZ-530. Addresses 2 Critical + 2 High deploy-blocking findings from security_report_cycle2.md (F-INFRA-1..F-INFRA-4). AZ-552 — drop_jwt_secret_deploy_preflight (1 pt, F-INFRA-1 Critical) scripts/start-services.sh swaps obsolete JwtConfig__Secret preflight for the cycle-2 trio (KeysFolder + ActiveKid + DataProtection.KeysFolder). .env.example, env/api/env.ps1, _docs/04_deploy/* updated to match. Repo scan in scripts/ and .env.example returns 0 offenders. AZ-553 — bind_mount_es256_keys (2 pts, F-INFRA-2 Critical) start-services.sh bind-mounts DEPLOY_HOST_JWT_KEYS_DIR read-only at /etc/azaion/jwt-keys; preflight fails fast on a missing or empty host directory with operator-actionable error messages. AZ-554 — persist_dataprotection_keys (2 pts, F-INFRA-3 High) Program.cs DataProtection wiring now fails fast in Production when KeysFolder is unset OR not probe-writable. start-services.sh bind-mounts DEPLOY_HOST_DP_KEYS_DIR read-write at /var/lib/azaion/dp-keys. Development behaviour unchanged (ephemeral default). AZ-555 — secrets_readme_es256_rewrite (1 pt, F-INFRA-4 High) secrets/README.md schema fully rewritten; new "Host-side directories" subsection with bind-mount table + ownership/permission guidance. Cycle-1 JwtConfig__Secret removed from live schema (one prose deprecation paragraph retained). Adjacent hygiene module-layout.md "Owns" extended to include scripts/, secrets/, env/, .env.example (gap from Step 9 new-task layout-delta). Tests e2e/Azaion.E2E/Tests/Cycle2HotfixDeployTests.cs — 19 facts (8 exec, 11 Skip with rationale per AZ-537/AZ-538 precedent). Skipped tests cover preflight/restart/Production-only paths verified at deploy gate. Build: 0W 0E across Azaion.AdminApi + Azaion.E2E. Test run deferred to autodev Step 11 (Run Tests). Tracker transition deferred to next batch (MCP availability unverified in this session — Leftovers pattern). Co-authored-by: Cursor <cursoragent@cursor.com>
7.4 KiB
Persist DataProtection Keys Folder + Fail-Fast In Production
Task: AZ-554_persist_dataprotection_keys
Name: Persist DataProtection keys folder + fail-fast in Production
Description: DataProtection (which encrypts MFA secrets, recovery codes, and any other protected payload) currently writes its master keys to an ephemeral container path. Every container restart rotates the master key, which permanently locks every MFA-enrolled user out of their account. Persist the key folder onto the host, document the env var, and fail-fast in Production if the folder is unconfigured.
Complexity: 2 points
Dependencies: AZ-553 (host-side volume pattern + runbook section established)
Component: Admin API + Deploy / scripts
Tracker: AZ-554
Epic: AZ-530
CMMC ref: SC.L2-3.13.10 (key management), IA.L2-3.5.7 (passwords, secrets storage)
Source: _docs/05_security/security_report_cycle2.md F-INFRA-3 (High); _docs/05_security/infrastructure_review_cycle2.md §F-2026Q2-INFRA-3
Problem
Program.cs configures services.AddDataProtection() without specifying a persistent key folder. ASP.NET Core defaults the key ring to an OS-specific path that, inside a container, lives on the writable layer and vanishes on every restart. AZ-534 uses DataProtection to encrypt the per-user TOTP MfaSecret at rest; AZ-534 also encrypts recovery codes. When the master key rotates on restart:
- Existing
MfaSecretciphertexts can no longer be decrypted → no user can verify TOTP at login. - Existing recovery-code hashes (if also DataProtection-wrapped) become unusable.
The net effect on the next docker restart is a hard lockout of every MFA-enrolled user. No data is corrupted on disk — but recovery requires either operator intervention or a re-enrolment campaign.
Outcome
- DataProtection master keys persist across container restarts in Production.
- In Production, the app refuses to start if
DataProtection.KeysFolderis unset (no silent fallback to the ephemeral path). - Development environment continues to work with the ephemeral default (no behavioural change for local devs).
.env.exampleand the deploy runbook document the new host-side env var.
Scope
Included
Program.cs: bindDataProtection.KeysFolderfrom configuration, callPersistKeysToFileSystem(...)when set, and add a Production-only fail-fast in theAppEnv.IsProduction()branch if the folder is unset, missing, or not writable.appsettings.json: add aDataProtectionsection with documented keys (KeysFolder).scripts/start-services.sh: bind-mount$DEPLOY_HOST_DP_KEYS_DIRonto the container at/var/lib/azaion/dp-keys(read-write — DataProtection must rotate keys on its own schedule).secrets/<env>.public.env: setASPNETCORE_DataProtection__KeysFolder=/var/lib/azaion/dp-keysin production/staging templates..env.example: documentDEPLOY_HOST_DP_KEYS_DIR.- Extend the deploy runbook section authored by AZ-553 to cover the DataProtection mount alongside the JWT mount (same host-side layout, same ownership/perms guidance).
Excluded
- Encrypting the DataProtection keys at rest with a hardware secret (HSM / KMS-wrapped). Larger scope; would belong to a separate hardening epic.
- Cross-instance key sharing for a horizontally-scaled admin deployment. Currently single-instance per environment.
- Reading the AZ-534 / AZ-NEW-12 user-cache invalidation concern — out of scope for this ticket.
secrets/README.mdrewrite — AZ-555.
Acceptance Criteria
AC-1: MFA survives container restart in Production
Given a Production deploy with DEPLOY_HOST_DP_KEYS_DIR mounted
And a user has enrolled in TOTP MFA before the restart
When the admin container is stopped and started again
Then the user can complete a fresh /login + /login/mfa cycle using their existing TOTP authenticator (no recovery code, no re-enrolment).
AC-2: Production fails-fast when KeysFolder is unset
Given ASPNETCORE_ENVIRONMENT=Production and ASPNETCORE_DataProtection__KeysFolder is unset
When the admin process starts
Then the process exits non-zero with a startup-log entry that names DataProtection.KeysFolder as the missing/invalid configuration.
AC-3: Production fails-fast when KeysFolder is not writable
Given ASPNETCORE_ENVIRONMENT=Production and KeysFolder points at a path that is not writable by the container user
When the admin process starts
Then the process exits non-zero with a startup-log entry naming the path and the missing permission.
AC-4: Development unchanged
Given ASPNETCORE_ENVIRONMENT=Development and KeysFolder is unset
When the admin process starts
Then the process starts normally (uses the ephemeral default) and no fail-fast is triggered.
AC-5: Mount is read-write
Given the admin container is running with the new bind-mount
When the DataProtection key ring rotates (test by writing a probe file /var/lib/azaion/dp-keys/.probe)
Then the write succeeds.
Non-Functional Requirements
Reliability
- Container restart MUST NOT invalidate already-issued MFA secrets or DataProtection-wrapped ciphertexts.
Security
- Mount must be writable by the container user but not world-readable on the host (
chmod 0700host-side, container user owns).
Blackbox Tests
| AC Ref | Initial Data/Conditions | What to Test | Expected Behavior | NFR References |
|---|---|---|---|---|
| AC-1 | Prod env, mount configured, user MFA-enrolled, restart container | Login + MFA verify after restart | Same TOTP secret still works | Reliability |
| AC-2 | Prod env, KeysFolder unset |
Start admin process | Exit non-zero, log names DataProtection.KeysFolder |
— |
| AC-3 | Prod env, KeysFolder read-only path |
Start admin process | Exit non-zero, log names path + permission | — |
| AC-4 | Dev env, KeysFolder unset |
Start admin process | Process starts, ephemeral default used | — |
| AC-5 | Container running, mount RW | Probe write inside mount | Write succeeds | Security |
Constraints
- Persist via
PersistKeysToFileSystemon the configured folder; do not introduce a database-backed or third-party key store in this ticket. - Fail-fast must be Production-only — Development workflows depend on the ephemeral default.
Risks & Mitigation
Risk 1: Existing prod users locked out at first restart after deploy
- Risk: The first container restart AFTER this fix ships is fine going forward, but any MFA enrolments done on the cycle-2 build BEFORE this fix are encrypted with an already-lost master key. Those users are still locked out.
- Mitigation: Cycle 2 has not been deployed to Production yet (the security audit FAILed before deploy). No real users are affected. Document this lifecycle clearly in the runbook so future hotfix sequencing avoids the same trap.
Risk 2: Host-side directory permissions wrong
- Risk: If the operator creates
$DEPLOY_HOST_DP_KEYS_DIRasroot:root 700, the container user cannot write. - Mitigation: AC-3 fail-fast catches this immediately on startup. Runbook includes the explicit ownership/perms command.
Risk 3: Drift between appsettings.json default and the runtime mount target
- Risk: Default in
appsettings.jsonsays one path; deploy script mounts another; container fails-fast. - Mitigation: AC-5 indirectly covers this via the probe-write step; runbook section explicitly states the mount target == config value.