# Infrastructure & Configuration Review — Cycle 2 (Delta) **Date**: 2026-05-14 **Scope**: cycle-2 surface only — `Program.cs` middleware/CORS/HSTS, DataProtection wiring, ES256 key store, `secrets/`, `scripts/deploy.sh` + `start-services.sh` + `_lib.sh` + `health-check.sh`, `docker-compose.test.yml`, `.env.example`, `.woodpecker/`. Cycle-1 categories that did not change since `infrastructure_review.md` are out of scope here. **Read order**: this file is a delta on `infrastructure_review.md`. Categories not listed here keep their cycle-1 status. ## Cycle-2 Findings ### F-2026Q2-INFRA-1: Deploy script hard-blocks on obsolete `JwtConfig__Secret` (CRITICAL — deploy-blocking) - **Location**: `scripts/start-services.sh:32`. - **Description**: `start-services.sh` does `require_env ... ASPNETCORE_JwtConfig__Secret`, which kills the deploy if the variable is not set. AZ-532 removed `JwtConfig.Secret` entirely — `Program.cs:60-83` configures `JwtBearer` against `IssuerSigningKeyResolver` backed by `JwtSigningKeyProvider` (ES256 PEMs). A correctly-configured cycle-2 deploy that follows `.env.example` (which does NOT include `JwtConfig__Secret`) will fail at the deploy-script preflight. - **Impact**: cycle-2 cannot deploy with the current scripts. Either the deploy fails on preflight, or the operator sets a dummy `JwtConfig__Secret=dummy` to get past the check — and then we hit F-INFRA-2 below. - **Remediation**: replace the line with: ``` require_env ... ASPNETCORE_JwtConfig__KeysFolder ASPNETCORE_JwtConfig__ActiveKid ``` Drop `ASPNETCORE_JwtConfig__Secret` from the schema in `secrets/README.md` and from any documented env templates. ### F-2026Q2-INFRA-2: ES256 keys folder not bind-mounted into container (CRITICAL — deploy-blocking) - **Location**: `scripts/start-services.sh:48-56` (the `docker run` line). - **Description**: `JwtConfig.KeysFolder` defaults to `secrets/jwt-keys` (relative path, per `appsettings.json:15`). Inside the container, this resolves under `/app/`. The Dockerfile **does not** COPY `secrets/`, and `start-services.sh` **does not** add a `--volume` mapping for the host `secrets/jwt-keys` directory. Result: `JwtSigningKeyProvider.Load` (cycle-2 ctor) fails-fast at startup with "no PEM files found". - **Impact**: container restart loop on every cycle-2 deploy. The only way to bring it up today is to manually `docker cp` the PEMs into the container — defeats reproducibility, no rotation story. - **Remediation**: add a host-side directory (e.g. `/etc/azaion/jwt-keys` owned by the runtime user, mode 0700, PEMs mode 0400) and a corresponding `--volume "$DEPLOY_HOST_JWT_KEYS_DIR:/etc/azaion/jwt-keys:ro"` line in `start-services.sh`. Set `ASPNETCORE_JwtConfig__KeysFolder=/etc/azaion/jwt-keys` in the public env overlay. Document the host-side procedure in `secrets/README.md` and `_docs/04_deploy/`. - **Cross-ref**: `e2e/test-keys` is correctly mounted in `docker-compose.test.yml:42` — the test stack works; only the prod deploy script is broken. ### F-2026Q2-INFRA-3: DataProtection key store ephemeral by default — MFA secrets unrecoverable across restarts (HIGH) - **Location**: `Program.cs:147-160`, `scripts/start-services.sh` (no DataProtection bind-mount), `secrets/README.md` (no entry). - **Description**: AZ-534 wraps `MfaSecret` ciphertext with `IDataProtector` (`Azaion.Mfa.Secret.v1` purpose). When `DataProtection:KeysFolder` is unset, ASP.NET Core writes its master keys to a per-machine path inside the container (`%LOCALAPPDATA%/ASP.NET/DataProtection-Keys` or `~/.aspnet/DataProtection-Keys` depending on platform), which is **lost on every container restart**. After the first restart, every existing `MfaSecret` becomes undecryptable; users with MFA enabled can no longer log in (their `/login/mfa` fails because the server can't unwrap the secret to verify TOTP), and they can't even self-disable MFA via `/users/me/mfa/disable` because that path also re-validates the existing TOTP. Recovery codes still work (SHA-256 hashed, no DataProtection involvement) — so the only escape is recovery-code-based login, then disable-and-re-enroll. - **Impact**: catastrophic data loss for the auth surface. Every `docker stop && docker run` cycle locks every MFA user out. - **Mitigating control (current)**: the cycle-2 test deploy is single-instance, so within one process lifetime the keys are stable. The risk crystallizes on first restart. - **Remediation**: 1. Add a host-side persistent directory (e.g. `/var/lib/azaion/dp-keys`) owned by the runtime user, mode 0700. 2. Add `--volume "$DEPLOY_HOST_DP_KEYS_DIR:/var/lib/azaion/dp-keys"` to `start-services.sh`. 3. Set `ASPNETCORE_DataProtection__KeysFolder=/var/lib/azaion/dp-keys` in `secrets/.public.env`. 4. Add a fail-fast in `Program.cs:151-160`: if `app.Environment.IsProduction()` and `DataProtection:KeysFolder` is unset, throw at startup. This makes the misconfiguration loud instead of silent. 5. Document key persistence and rotation in `secrets/README.md` and `_docs/04_deploy/`. ### F-2026Q2-INFRA-4: `secrets/README.md` schema still lists HS256-era `JwtConfig__Secret` (HIGH — doc drift, deploy-blocking by proxy) - **Location**: `secrets/README.md:50-55` ("Schema (variables that MUST be in the encrypted file)"). - **Description**: the documented schema still requires `ASPNETCORE_JwtConfig__Secret=<32 random bytes>`. This is the same root cause as F-INFRA-1 but on the documentation side — operators following the README will set a useless variable, miss `JwtConfig__KeysFolder` / `JwtConfig__ActiveKid`, and miss `DataProtection__KeysFolder`. - **Impact**: misleads any operator onboarding to the project; reinforces the broken deploy script. - **Remediation**: rewrite the "What goes where" + "Schema" sections to: - Drop `ASPNETCORE_JwtConfig__Secret`. - Add `ASPNETCORE_JwtConfig__KeysFolder=/etc/azaion/jwt-keys` (path; not a secret — belongs in `.public.env`). - Add `ASPNETCORE_JwtConfig__ActiveKid=` (path; not a secret — belongs in `.public.env`). - Add `ASPNETCORE_DataProtection__KeysFolder=/var/lib/azaion/dp-keys` (path; not a secret). - Note: the PEM private keys themselves are NOT in sops; they live on the host filesystem at `KeysFolder`, owned by the runtime user, mode 0400. Rotation procedure is in `scripts/generate-jwt-key.sh`. ### F-2026Q2-INFRA-5: HSTS preload+includeSubDomains may break legacy subdomains (MEDIUM) - **Location**: `Program.cs:217-225`. - **Description**: HSTS is configured with `MaxAge = 365 days`, `IncludeSubDomains = true`, `Preload = true` in non-Development environments. If any current or future `*.azaion.com` subdomain serves over plain HTTP (legacy admin tools, internal monitoring, dev/staging mirrors of unrelated systems), browsers that have seen the header will refuse to connect to those subdomains. Worse, **HSTS preload registration is essentially permanent** — the Chrome HSTS preload list takes weeks/months to be removed from once submitted, even after the header is disabled. - **Impact**: operational blast radius if a non-HTTPS subdomain exists or is added later. Preload makes the mistake hard to reverse. - **Remediation**: 1. Audit all `*.azaion.com` subdomains; confirm 100% HTTPS-only (including any internal-only ones — DNS hijacking can expose them to user browsers). 2. Document the subdomain inventory in `_docs/04_deploy/`. 3. Consider gating `Preload = true` behind an env var so staging and dev hosts don't trigger preload-list submission attempts. 4. Do NOT submit to the public preload list (https://hstspreload.org) until the audit is complete and signed off. ### F-2026Q2-INFRA-6: `audit_events` table has no retention policy (LOW — operational hygiene) - **Location**: `env/db/07_auth_lockout_and_audit.sql`. - **Description**: `audit_events` is append-only with no documented retention or partitioning. Every login attempt writes a row. At 10K users × 5 attempts/day = 50K rows/day = ~18M rows/year. Postgres handles this fine, and the composite index `(event_type, email, occurred_at DESC)` keeps `CountRecentFailedLogins` sub-millisecond, but unbounded growth has compliance implications (GDPR / data minimization), backup/restore time, and storage cost. - **Impact**: not a security risk per se — audit completeness is the goal — but the regulatory storage horizon needs a stated answer. - **Remediation**: agree retention (CMMC says ≥1 year for audit logs), add a nightly `DELETE FROM audit_events WHERE occurred_at < now() - interval '13 months'` job (cron + small script), document in `_docs/04_deploy/`. Optional: switch to monthly partitions so the cleanup is `DROP PARTITION` instead of a row-by-row delete. ### F-2026Q2-INFRA-7: `JwtSigningKeyProvider` silent fallback to first PEM (MEDIUM) - **Location**: `Azaion.Services/JwtSigningKeyProvider.cs:73-86`. - **Description**: when `JwtConfig.ActiveKid` is unset, the provider picks the alphabetically-first PEM and only logs at `LogInformation`. Adding a new PEM with a name that sorts earlier silently changes the signing key on next restart. The deploy script in `scripts/start-services.sh` does not require `ActiveKid`, and `.env.example:25-26` calls it "optional". - **Impact**: operator drops `kid-2026-04-aaa.pem` thinking it's a side-by-side rotation key, restart, and now all newly-minted tokens are signed under a kid the verifiers may not yet have in their JWKS cache (1-h max-age — fleet sees signature failures for up to 1 h). - **Remediation**: 1. Make `ActiveKid` required when more than one PEM is present (fail-fast at startup if ambiguous). 2. If exactly one PEM exists, accept it but log at `Warning` (not `Information`). 3. Update `.env.example` to mark `ActiveKid` as **required for prod** rather than "optional". - **Cross-ref**: same finding documented in `static_analysis_cycle2.md` as F-2026Q2-AUTH-7. Listed here too because the operational-hardening fix (require it in `start-services.sh`) is in this scope. ## Re-verified categories (no cycle-2 regression) | Area | Cycle-1 status | Cycle-2 verdict | |------|----------------|-----------------| | Container non-root user (F-6) | FAIL | **PASS** — Dockerfile now sets `USER app` (line 40) and chowns `/app/Content` + `/app/logs`. Closes F-6. | | Production HTTPS enforcement (F-13) | FAIL | **PASS** — `app.UseHttpsRedirection()` + `app.UseHsts()` enabled in non-Development. Closes F-13 in code (still reverse-proxy fronting in deploy). | | CORS | tight | **TIGHTER** — cycle-2 dropped the `http://admin.azaion.com` origin. Only `https://admin.azaion.com` remains. | | Image pinned by digest | WARN | unchanged. Deferred. | | Secrets via env vars | PASS | unchanged. | | Test sidecar / E2E images | acceptable | unchanged. | | Test compose `ASPNETCORE_ENVIRONMENT=Development` | acceptable for tests | unchanged. Flag operator risk: a misconfigured prod that inherits this value silently loses HTTPS enforcement and HSTS. | | `.gitignore` excludes secrets | PASS | **PASS** — `secrets/jwt-keys/*` is gitignored; only `.gitkeep` tracked. Verified no PEMs in repo. | ## Self-verification - [x] All cycle-2-touched infra files reviewed (Dockerfile, docker-compose.test.yml, scripts/*, secrets/*, .env.example, appsettings*.json, .woodpecker/*) - [x] Each finding has file path + line number + remediation - [x] Cycle-1 closures verified by re-reading the code (F-6 USER directive, F-13 HSTS+HttpsRedirection) - [x] No false positives from test-only files (test fixtures are flagged as acceptable, not as findings)