Files
admin/_docs/02_tasks/done/AZ-554_persist_dataprotection_keys.md
Oleksandr Bezdieniezhnykh f369153149 [AZ-552] [AZ-553] [AZ-554] [AZ-555] Cycle-2 hotfix: deploy/infra chain
Batch 5 (cycle 2 hotfix sprint, batch 1 of 2). 6 story points under epic
AZ-530. Addresses 2 Critical + 2 High deploy-blocking findings from
security_report_cycle2.md (F-INFRA-1..F-INFRA-4).

AZ-552 — drop_jwt_secret_deploy_preflight (1 pt, F-INFRA-1 Critical)
  scripts/start-services.sh swaps obsolete JwtConfig__Secret preflight
  for the cycle-2 trio (KeysFolder + ActiveKid + DataProtection.KeysFolder).
  .env.example, env/api/env.ps1, _docs/04_deploy/* updated to match. Repo
  scan in scripts/ and .env.example returns 0 offenders.

AZ-553 — bind_mount_es256_keys (2 pts, F-INFRA-2 Critical)
  start-services.sh bind-mounts DEPLOY_HOST_JWT_KEYS_DIR read-only at
  /etc/azaion/jwt-keys; preflight fails fast on a missing or empty host
  directory with operator-actionable error messages.

AZ-554 — persist_dataprotection_keys (2 pts, F-INFRA-3 High)
  Program.cs DataProtection wiring now fails fast in Production when
  KeysFolder is unset OR not probe-writable. start-services.sh bind-mounts
  DEPLOY_HOST_DP_KEYS_DIR read-write at /var/lib/azaion/dp-keys.
  Development behaviour unchanged (ephemeral default).

AZ-555 — secrets_readme_es256_rewrite (1 pt, F-INFRA-4 High)
  secrets/README.md schema fully rewritten; new "Host-side directories"
  subsection with bind-mount table + ownership/permission guidance.
  Cycle-1 JwtConfig__Secret removed from live schema (one prose
  deprecation paragraph retained).

Adjacent hygiene
  module-layout.md "Owns" extended to include scripts/, secrets/, env/,
  .env.example (gap from Step 9 new-task layout-delta).

Tests
  e2e/Azaion.E2E/Tests/Cycle2HotfixDeployTests.cs — 19 facts (8 exec,
  11 Skip with rationale per AZ-537/AZ-538 precedent). Skipped tests
  cover preflight/restart/Production-only paths verified at deploy gate.

Build: 0W 0E across Azaion.AdminApi + Azaion.E2E.
Test run deferred to autodev Step 11 (Run Tests).
Tracker transition deferred to next batch (MCP availability unverified
in this session — Leftovers pattern).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-14 09:35:57 +03:00

7.4 KiB

Persist DataProtection Keys Folder + Fail-Fast In Production

Task: AZ-554_persist_dataprotection_keys Name: Persist DataProtection keys folder + fail-fast in Production Description: DataProtection (which encrypts MFA secrets, recovery codes, and any other protected payload) currently writes its master keys to an ephemeral container path. Every container restart rotates the master key, which permanently locks every MFA-enrolled user out of their account. Persist the key folder onto the host, document the env var, and fail-fast in Production if the folder is unconfigured. Complexity: 2 points Dependencies: AZ-553 (host-side volume pattern + runbook section established) Component: Admin API + Deploy / scripts Tracker: AZ-554 Epic: AZ-530 CMMC ref: SC.L2-3.13.10 (key management), IA.L2-3.5.7 (passwords, secrets storage) Source: _docs/05_security/security_report_cycle2.md F-INFRA-3 (High); _docs/05_security/infrastructure_review_cycle2.md §F-2026Q2-INFRA-3

Problem

Program.cs configures services.AddDataProtection() without specifying a persistent key folder. ASP.NET Core defaults the key ring to an OS-specific path that, inside a container, lives on the writable layer and vanishes on every restart. AZ-534 uses DataProtection to encrypt the per-user TOTP MfaSecret at rest; AZ-534 also encrypts recovery codes. When the master key rotates on restart:

  • Existing MfaSecret ciphertexts can no longer be decrypted → no user can verify TOTP at login.
  • Existing recovery-code hashes (if also DataProtection-wrapped) become unusable.

The net effect on the next docker restart is a hard lockout of every MFA-enrolled user. No data is corrupted on disk — but recovery requires either operator intervention or a re-enrolment campaign.

Outcome

  • DataProtection master keys persist across container restarts in Production.
  • In Production, the app refuses to start if DataProtection.KeysFolder is unset (no silent fallback to the ephemeral path).
  • Development environment continues to work with the ephemeral default (no behavioural change for local devs).
  • .env.example and the deploy runbook document the new host-side env var.

Scope

Included

  • Program.cs: bind DataProtection.KeysFolder from configuration, call PersistKeysToFileSystem(...) when set, and add a Production-only fail-fast in the AppEnv.IsProduction() branch if the folder is unset, missing, or not writable.
  • appsettings.json: add a DataProtection section with documented keys (KeysFolder).
  • scripts/start-services.sh: bind-mount $DEPLOY_HOST_DP_KEYS_DIR onto the container at /var/lib/azaion/dp-keys (read-write — DataProtection must rotate keys on its own schedule).
  • secrets/<env>.public.env: set ASPNETCORE_DataProtection__KeysFolder=/var/lib/azaion/dp-keys in production/staging templates.
  • .env.example: document DEPLOY_HOST_DP_KEYS_DIR.
  • Extend the deploy runbook section authored by AZ-553 to cover the DataProtection mount alongside the JWT mount (same host-side layout, same ownership/perms guidance).

Excluded

  • Encrypting the DataProtection keys at rest with a hardware secret (HSM / KMS-wrapped). Larger scope; would belong to a separate hardening epic.
  • Cross-instance key sharing for a horizontally-scaled admin deployment. Currently single-instance per environment.
  • Reading the AZ-534 / AZ-NEW-12 user-cache invalidation concern — out of scope for this ticket.
  • secrets/README.md rewrite — AZ-555.

Acceptance Criteria

AC-1: MFA survives container restart in Production Given a Production deploy with DEPLOY_HOST_DP_KEYS_DIR mounted And a user has enrolled in TOTP MFA before the restart When the admin container is stopped and started again Then the user can complete a fresh /login + /login/mfa cycle using their existing TOTP authenticator (no recovery code, no re-enrolment).

AC-2: Production fails-fast when KeysFolder is unset Given ASPNETCORE_ENVIRONMENT=Production and ASPNETCORE_DataProtection__KeysFolder is unset When the admin process starts Then the process exits non-zero with a startup-log entry that names DataProtection.KeysFolder as the missing/invalid configuration.

AC-3: Production fails-fast when KeysFolder is not writable Given ASPNETCORE_ENVIRONMENT=Production and KeysFolder points at a path that is not writable by the container user When the admin process starts Then the process exits non-zero with a startup-log entry naming the path and the missing permission.

AC-4: Development unchanged Given ASPNETCORE_ENVIRONMENT=Development and KeysFolder is unset When the admin process starts Then the process starts normally (uses the ephemeral default) and no fail-fast is triggered.

AC-5: Mount is read-write Given the admin container is running with the new bind-mount When the DataProtection key ring rotates (test by writing a probe file /var/lib/azaion/dp-keys/.probe) Then the write succeeds.

Non-Functional Requirements

Reliability

  • Container restart MUST NOT invalidate already-issued MFA secrets or DataProtection-wrapped ciphertexts.

Security

  • Mount must be writable by the container user but not world-readable on the host (chmod 0700 host-side, container user owns).

Blackbox Tests

AC Ref Initial Data/Conditions What to Test Expected Behavior NFR References
AC-1 Prod env, mount configured, user MFA-enrolled, restart container Login + MFA verify after restart Same TOTP secret still works Reliability
AC-2 Prod env, KeysFolder unset Start admin process Exit non-zero, log names DataProtection.KeysFolder
AC-3 Prod env, KeysFolder read-only path Start admin process Exit non-zero, log names path + permission
AC-4 Dev env, KeysFolder unset Start admin process Process starts, ephemeral default used
AC-5 Container running, mount RW Probe write inside mount Write succeeds Security

Constraints

  • Persist via PersistKeysToFileSystem on the configured folder; do not introduce a database-backed or third-party key store in this ticket.
  • Fail-fast must be Production-only — Development workflows depend on the ephemeral default.

Risks & Mitigation

Risk 1: Existing prod users locked out at first restart after deploy

  • Risk: The first container restart AFTER this fix ships is fine going forward, but any MFA enrolments done on the cycle-2 build BEFORE this fix are encrypted with an already-lost master key. Those users are still locked out.
  • Mitigation: Cycle 2 has not been deployed to Production yet (the security audit FAILed before deploy). No real users are affected. Document this lifecycle clearly in the runbook so future hotfix sequencing avoids the same trap.

Risk 2: Host-side directory permissions wrong

  • Risk: If the operator creates $DEPLOY_HOST_DP_KEYS_DIR as root:root 700, the container user cannot write.
  • Mitigation: AC-3 fail-fast catches this immediately on startup. Runbook includes the explicit ownership/perms command.

Risk 3: Drift between appsettings.json default and the runtime mount target

  • Risk: Default in appsettings.json says one path; deploy script mounts another; container fails-fast.
  • Mitigation: AC-5 indirectly covers this via the probe-write step; runbook section explicitly states the mount target == config value.