Batch 5 (cycle 2 hotfix sprint, batch 1 of 2). 6 story points under epic AZ-530. Addresses 2 Critical + 2 High deploy-blocking findings from security_report_cycle2.md (F-INFRA-1..F-INFRA-4). AZ-552 — drop_jwt_secret_deploy_preflight (1 pt, F-INFRA-1 Critical) scripts/start-services.sh swaps obsolete JwtConfig__Secret preflight for the cycle-2 trio (KeysFolder + ActiveKid + DataProtection.KeysFolder). .env.example, env/api/env.ps1, _docs/04_deploy/* updated to match. Repo scan in scripts/ and .env.example returns 0 offenders. AZ-553 — bind_mount_es256_keys (2 pts, F-INFRA-2 Critical) start-services.sh bind-mounts DEPLOY_HOST_JWT_KEYS_DIR read-only at /etc/azaion/jwt-keys; preflight fails fast on a missing or empty host directory with operator-actionable error messages. AZ-554 — persist_dataprotection_keys (2 pts, F-INFRA-3 High) Program.cs DataProtection wiring now fails fast in Production when KeysFolder is unset OR not probe-writable. start-services.sh bind-mounts DEPLOY_HOST_DP_KEYS_DIR read-write at /var/lib/azaion/dp-keys. Development behaviour unchanged (ephemeral default). AZ-555 — secrets_readme_es256_rewrite (1 pt, F-INFRA-4 High) secrets/README.md schema fully rewritten; new "Host-side directories" subsection with bind-mount table + ownership/permission guidance. Cycle-1 JwtConfig__Secret removed from live schema (one prose deprecation paragraph retained). Adjacent hygiene module-layout.md "Owns" extended to include scripts/, secrets/, env/, .env.example (gap from Step 9 new-task layout-delta). Tests e2e/Azaion.E2E/Tests/Cycle2HotfixDeployTests.cs — 19 facts (8 exec, 11 Skip with rationale per AZ-537/AZ-538 precedent). Skipped tests cover preflight/restart/Production-only paths verified at deploy gate. Build: 0W 0E across Azaion.AdminApi + Azaion.E2E. Test run deferred to autodev Step 11 (Run Tests). Tracker transition deferred to next batch (MCP availability unverified in this session — Leftovers pattern). Co-authored-by: Cursor <cursoragent@cursor.com>
12 KiB
Azaion Admin API — Environment Strategy
Date: 2026-05-13 · Cycle: 1 · Status: planning artifact (no scripts; concrete wiring lands in Step 7).
1. Environments
| Environment | Purpose | Infrastructure | Data Source |
|---|---|---|---|
| Development | Local developer workflow on macOS / Linux. | Either bare dotnet run against host Postgres (port 4312) or the new docker-compose.yml planned in Step 2 (API + Postgres on a private Docker network). |
Empty database; SQL files under env/db/ create roles + schema; no fixtures. |
| Test (CI) | Black-box tests in CI and locally via scripts/run-tests.sh. |
docker-compose.test.yml — API + Postgres + e2e-runner on a Docker network. |
Functional fixtures from e2e/db-init/00_run_all.sh + 99_test_seed.sql. |
| Staging | Pre-production validation. | Self-hosted Linux server, single Docker host, behind Nginx reverse proxy on stage.admin.azaion.com. Mirrors prod topology and Postgres major version. |
Anonymized snapshot of production (PII scrubbed by an offline script before import). |
| Production | Live system. | Self-hosted Linux server, single Docker host, behind Nginx reverse proxy on admin.azaion.com. |
Live data; daily off-host backups. |
Test is added as a first-class environment because cycle 1 already exercises it (
docker-compose.test.yml). The deploy template lists three; we list four to match reality.
2. Environment Variables
Source of Truth
The complete variable inventory lives in .env.example at the repo root (Step 1, 24 entries). This document does NOT duplicate that table — it only specifies, per environment, where each variable is sourced.
Per-environment sourcing
| Variable group | Development | Test (CI) | Staging | Production |
|---|---|---|---|---|
ASPNETCORE_ENVIRONMENT |
.env (Development) |
docker-compose environment: (Development) |
docker-compose / --env-file (Staging) |
docker-compose / --env-file (Production) |
ASPNETCORE_URLS |
.env |
compose | host .env (rendered from sops) |
host .env (rendered from sops) |
ConnectionStrings__* |
.env (real local creds) |
compose (literal — accepted F-10) | sops-encrypted file in git → decrypted on host at deploy time | same as staging |
JwtConfig__KeysFolder, __ActiveKid (AZ-552/AZ-553) |
.env (dev-only path) |
compose (volume mount) | public env + bind-mount via DEPLOY_HOST_JWT_KEYS_DIR |
same |
DataProtection__KeysFolder (AZ-554) |
unset (ephemeral dev default) | unset | public env + bind-mount via DEPLOY_HOST_DP_KEYS_DIR |
same; Production fail-fast if unset |
JwtConfig__{Issuer,Audience,AccessTokenLifetimeMinutes} |
appsettings defaults | appsettings defaults | host .env if non-default |
host .env if non-default |
ResourcesConfig__* |
appsettings defaults | compose | host .env if non-default |
host .env if non-default |
DEPLOY_*, REGISTRY_TAG |
.env (developer machine) |
n/a | passed to scripts/deploy.sh from operator's shell or CI manual trigger |
same |
REGISTRY_USER, REGISTRY_TOKEN |
empty in dev .env |
Woodpecker secrets registry_user / registry_token |
Woodpecker secrets (CI deploy) or operator's shell (manual deploy) | same |
CI_COMMIT_SHA |
unset → image label unknown |
Woodpecker built-in | Woodpecker built-in | Woodpecker built-in |
Variable Validation (fail-fast)
The Admin API already does this for the most security-critical variable:
var jwtConfig = builder.Configuration.GetSection(nameof(JwtConfig)).Get<JwtConfig>();
if (jwtConfig == null || string.IsNullOrEmpty(jwtConfig.Secret))
throw new Exception("Missing configuration section: JwtConfig");
The deploy plan adds the same fail-fast check for connection strings during Step 7 wiring (a one-time _ = configuration.GetConnectionString("AzaionDb") ?? throw … plus the same for AzaionDbAdmin, executed during WebApplication build). Without the check, a missing variable currently surfaces only on the first DB call, which is too late.
Static / lookup-style variables (
ResourcesConfig__*,JwtConfig__{Issuer,Audience,Lifetime}) keep theirappsettings.jsondefaults in every environment unless an override is required. We do NOT add fail-fast checks for them.
3. Secrets Management
Decision
| Environment | Method | Tool |
|---|---|---|
| Development | .env file |
committed .env.example + per-developer .env (git-ignored) |
| Test (CI) | docker-compose environment: literals |
accepted as test-only (security audit F-10) |
| Staging | git-tracked encrypted file | sops + age |
| Production | git-tracked encrypted file | sops + age |
Why sops + age (not Vault, not Woodpecker secrets, not hand-edited .env)
Constraints: self-hosted, no cloud account, single ops engineer, currently hand-editing .env on the host.
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| sops + age (chosen) | Secrets versioned in git, encrypted at rest, decrypted on the host with a single age key. No new infra. Works offline. | Requires per-environment age keypair stored on the host outside git. Manual key rotation. | ✅ pragmatic for this team size and topology |
| HashiCorp Vault (self-hosted) | Dynamic DB creds, audit log, fine-grained ACL, KV v2. | Adds a service to operate, monitor, back up. Single-engineer ops budget cannot absorb it now. | ⏳ revisit in a future cycle when ops capacity grows |
| Woodpecker secrets exported into runtime container | Reuses existing secret store. | Couples runtime config to CI; secrets are not visible/auditable outside Woodpecker UI; cannot run the container outside CI without manually exporting them. | ❌ leaks the CI/runtime boundary |
Hand-edited host .env (status quo) |
Zero new tooling. | No version history, no encryption, no review trail. Single point of failure if the file is lost; security audit can't track changes. | ❌ status quo we are leaving behind (Drift B) |
sops + age conventions for this repo
secrets/
├── .sops.yaml # routes secrets/staging.env / production.env to the right age recipients
├── staging.env # SOPS-encrypted; safe to commit
└── production.env # SOPS-encrypted; safe to commit
.sops.yamldeclares two age recipients:recipient_stagingandrecipient_production(public keys).- The matching age private keys live on each host at
/etc/azaion/age.key, mode0400, owned by root. They are NEVER committed. scripts/deploy.sh(Step 7) runsSOPS_AGE_KEY_FILE=/etc/azaion/age.key sops -d secrets/${env}.env > /tmp/azaion.envand feeds it todocker run --env-file.- All staging/production env values that are NOT secret (e.g.
DEPLOY_HOST_PORT,REGISTRY_TAG) live in plain-textsecrets/staging.public.env/secrets/production.public.envnext to the encrypted file, also git-tracked. Loaded before the decrypted overlay.
Rotation policy
| Secret | Rotation cadence | Procedure |
|---|---|---|
Postgres azaion_admin / azaion_reader passwords |
every 90 days, on operator schedule | ALTER ROLE … WITH PASSWORD … → re-encrypt production.env → scripts/deploy.sh |
JWT signing PEMs in DEPLOY_HOST_JWT_KEYS_DIR (AZ-532/AZ-552/AZ-553) |
every 180 days, AND on any suspected leak | follow scripts/generate-jwt-key.sh header (steps 1-6: drop a new PEM next to the active one → restart → wait verifier-cache TTL → switch ActiveKid → wait access-token TTL → delete old PEM). Rotation is non-breaking because both kids are exposed via /.well-known/jwks.json during the overlap window. |
azaion_superadmin password |
every 365 days, AND on owner change | manual; not used by the running app, only by DB migrations |
Registry REGISTRY_TOKEN |
every 90 days OR on CI compromise | rotate registry credential → update Woodpecker secret registry_token → re-encrypt production.env if also referenced there |
age private key (/etc/azaion/age.key) |
every 365 days OR on host compromise | generate new key → add public recipient to .sops.yaml → sops updatekeys secrets/*.env → distribute new private key out-of-band → remove old recipient |
4. Database Management
| Environment | Type | Migrations | Data | Backup |
|---|---|---|---|---|
| Development | Local Postgres on host (port 4312) or dockerized Postgres from docker-compose.yml |
env/db/*.sql applied manually by developer the first time, then *_users_email_unique.sql-style additive scripts run with psql on demand |
empty | none |
| Test (CI) | Postgres 16-alpine from docker-compose.test.yml |
env/db/*.sql mounted into /docker-entrypoint-initdb.d/sql/, ordered by 00_run_all.sh |
99_test_seed.sql (functional) + 500 perf users injected by scripts/run-performance-tests.sh when needed |
none — down -v between runs |
| Staging | Same Postgres major (16) on the staging server, port 4312, azaion database |
env/db/*.sql applied manually under change control via psql -U azaion_superadmin. New migrations land in the same numeric-prefix sequence (07_*.sql, 08_*.sql, …) |
anonymized prod snapshot, refreshed on demand | nightly pg_dump snapshot retained 14 days |
| Production | Same Postgres 16 on prod server | Same as staging; migration must be applied to staging first, observed for ≥ 24 h, then promoted to prod with operator approval | live | nightly pg_dump retained 30 days; weekly snapshot retained 12 weeks; off-host copy via rsync |
Migration rules (cycle 1)
The project does NOT use an ORM migration framework (linq2db; restrictions.md). The conventions below replace it:
- Numeric-prefix ordering — every new migration is added as
env/db/NN_<description>.sqlwhereNNcontinues the existing sequence. The current sequence is01..06; the next is07_*.sql. - Forward-only by default. Reversibility is provided by the off-host backup, NOT by hand-written DOWN scripts. The existing files (
02_structure.sql,03_add_timestamp_columns.sql,04_detection_classes.sql,06_users_email_unique.sql) follow this pattern; we keep it. - Backward-compatible deploys — every schema change must be safe to apply BEFORE the matching code is deployed (additive change → deploy code → cleanup change in a later release). The cycle 1 example:
06_users_email_unique.sqlwas applied first; theRegisterUserchange to translate23505came after. AZ-197'sUser.Hardwarecolumn was kept as a tombstone instead of dropped, for the same reason. - Production migrations need approval — operator manually runs the SQL on prod after staging soak. No automatic CI execution against prod in cycle 1 (Drift J — automation is a future cycle's work).
Drifts logged here
| ID | Severity | Description | Resolved In |
|---|---|---|---|
| B | Medium | No secret manager (status quo: hand-edited host .env) |
Resolved in spec — sops + age (§3); concrete files + script in Step 7 |
| J | Low (NEW) | DB migrations applied manually on staging/prod; no automation | Carried forward to a future cycle |
5. Self-verification
- Four environments (Dev, Test/CI, Staging, Production) defined with purpose, infrastructure, and data source.
- Environment variable sourcing matrix references
.env.example(Step 1) without duplicating it. - No literal secrets in this document — only variable names and tool names.
- Secret manager chosen for staging/production (sops + age) with rotation policy.
- Database strategy per environment, including the explicit no-ORM-migrations convention.