refactor: remove deploy.cmd and update Dockerfile for health checks
ci/woodpecker/push/01-test Pipeline failed
ci/woodpecker/push/02-build-push unknown status

- Deleted the deploy.cmd script as it was no longer needed.
- Updated Dockerfile to include curl for health checks and added a non-root user for improved security.
- Modified health check command to use curl for better reliability.
- Adjusted docker-compose.test.yml to reflect changes in health check configuration.
- Cleaned up appsettings.json and removed unused configuration properties.
- Removed Resource entity and related requests from the codebase as part of the architectural shift.
- Updated documentation to reflect the removal of hardware binding and related endpoints.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-05-13 08:47:21 +03:00
parent 43fe38e67d
commit c7b297de83
76 changed files with 4034 additions and 832 deletions
+158
View File
@@ -0,0 +1,158 @@
# Azaion Admin API — CI/CD Pipeline
**Date**: 2026-05-13 · **Cycle**: 1 · **Status**: planning artifact (current Woodpecker files audited; proposed changes land as concrete YAML in Step 7).
## 1. Platform & Constraints
| Constraint | Value | Source |
|------------|-------|--------|
| CI platform | **Woodpecker CI** | restrictions.md §Operational |
| Default agent label | `arm64` | `.woodpecker/01-test.yml`, `.woodpecker/02-build-push.yml` |
| Future agent label | `amd64` (matrix entry, currently commented out) | `.woodpecker/02-build-push.yml` |
| Two-workflow contract | `01-test.yml` → tests; `02-build-push.yml` (`depends_on: 01-test`) → image | Already in repo |
| Registry | `$REGISTRY_HOST/azaion/admin` | Woodpecker secret `registry_host` |
| Branches with full pipeline | `dev`, `stage`, `main` | both files' `when.branch` |
The reference contract from `.cursor/skills/deploy/templates/ci_cd_pipeline.md` is already partially adopted. This step closes the remaining gaps.
## 2. Current Pipeline (audited)
### `.woodpecker/01-test.yml` — what it does today
| Step | Image | Action | Quality gate |
|------|-------|--------|--------------|
| `unit-tests` | `mcr.microsoft.com/dotnet/sdk:10.0` | `dotnet restore` + `dotnet test Azaion.AdminApi.sln` (release, TRX logger) | All unit tests pass |
| `e2e-tests` | `mcr.microsoft.com/dotnet/sdk:10.0` | `dotnet restore` + `dotnet test e2e/Azaion.E2E/Azaion.E2E.csproj` | All E2E tests pass |
**Audit findings**:
1. ✅ Tests are gated before build (matches contract).
2. ❌ E2E test step runs `dotnet test` directly — but the project uses **Docker-orchestrated black-box tests** via `docker-compose.test.yml`. The pure `dotnet test` invocation cannot start `system-under-test` + `test-db` containers, so `e2e-tests` as written either skips integration scenarios or relies on undocumented agent state. The reference contract uses `docker compose … --abort-on-container-exit --exit-code-from e2e-runner` instead.
3. ❌ No coverage report.
4. ❌ No SAST / dependency scan / image scan stage. Security audit recommendation 13 explicitly asked for `dotnet list package --vulnerable` in CI (Drift F).
5. ❌ No artifact upload of TRX results — failures are visible only in console logs.
### `.woodpecker/02-build-push.yml` — what it does today
| Step | Image | Action | Quality gate |
|------|-------|--------|--------------|
| `build-push` | `docker` | `docker login``docker build` (with three OCI labels + `CI_COMMIT_SHA` build-arg) → `docker push $REGISTRY_HOST/azaion/admin:${CI_COMMIT_BRANCH}-${TAG_SUFFIX}` | Push succeeds |
**Audit findings**:
1. ✅ Multi-arch matrix scaffolding present (`PLATFORM` / `TAG_SUFFIX`) with amd64 commented for future use.
2.`depends_on: [01-test]` — gating is correct.
3. ✅ OCI labels (`revision`, `created`, `source`) injected as build-time labels.
4. ❌ Only branch-based mutable tag pushed. No immutable `<sha12>-<arch>` tag → host scripts cannot pin (Drift A).
5. ❌ No image scan (Trivy) before push.
6. ❌ Old documentation referenced `.woodpecker/build-arm.yml` which no longer exists (Drift D — fix in this doc, see §10).
## 3. Proposed Stage Map (target state for cycle 1)
| Stage | Trigger | Workflow file | Quality gate |
|-------|---------|---------------|--------------|
| Lint / format | every push & PR | `01-test.yml` (new step) | `dotnet format --verify-no-changes` returns 0 |
| Unit tests | every push & PR | `01-test.yml` | All `Azaion.*Tests` pass; TRX uploaded |
| Black-box E2E (Docker compose) | every push & PR | `01-test.yml` | `docker compose -f docker-compose.test.yml up --abort-on-container-exit --exit-code-from e2e-consumer` returns 0; results uploaded |
| Security: dependency audit | every push & PR | `01-test.yml` (new step) | `dotnet list package --vulnerable --include-transitive` reports zero High/Critical CVEs |
| Security: image scan | post-build, pre-push | `02-build-push.yml` (new step) | `trivy image --severity HIGH,CRITICAL --exit-code 1` returns 0 |
| Build | push to `dev` / `stage` / `main` | `02-build-push.yml` | `docker build` succeeds |
| Push (branch tag + SHA tag) | push to `dev` / `stage` / `main` | `02-build-push.yml` | both `docker push` calls succeed |
| Performance smoke (optional) | manual on `stage` / `main` | `03-perf.yml` (new) | k6 thresholds in `scripts/perf-scenarios.js` all `ok: true` |
| Deploy staging | tag push or `stage` branch | `04-deploy.yml` (new) | health check returns 200 within timeout |
| Deploy production | manual approval | `04-deploy.yml` (new) | health check returns 200 within timeout |
> Note on coverage: the test infrastructure (cycle 1) does not yet collect or report coverage. The skill's 75% gate cannot be enforced this cycle. Recorded as **Drift I** (carried forward to a future cycle); does NOT block this deploy.
## 4. Caching Strategy
| Cache | Key | Notes |
|-------|-----|-------|
| `nuget` packages | hash of `**/*.csproj` | Mounted on `/root/.nuget/packages`; restored before `dotnet restore`. Cache invalidates on any csproj change. |
| Docker layer cache | hash of `Dockerfile` + `**/*.csproj` | Use Woodpecker `--cache-from` against the previous push of the same branch (e.g. `--cache-from $REGISTRY_HOST/azaion/admin:dev-arm`). Cheapest cache available without buildx. |
| E2E DB init scripts | none — re-init each run | Schema differences would mask test failures. `down -v` between runs is intentional (mirrors `scripts/run-tests.sh`). |
## 5. Parallelization
```
01-test.yml (matrix: arm64 [+ amd64 future])
├── lint-format ─┐
├── unit-tests ─┼── all run in parallel on the same agent;
├── e2e-tests ─┤ the slowest (e2e) gates the workflow
└── deps-audit ─┘
02-build-push.yml (matrix: arm64 [+ amd64 future])
├── build ─→ image-scan ─→ push (branch tag) ─→ push (sha tag)
└─→ artifact: image digest stored as Woodpecker artifact
03-perf.yml (manual; arm64 only)
└── k6-perf (uses the docker-compose.test.yml SUT)
04-deploy.yml (manual; per-environment)
└── pull → stop → start → health-check → smoke
```
Cross-workflow gates: `02 depends_on 01`; `04 depends_on 02` for the same SHA.
## 6. Quality Gates (summary)
| Gate | Threshold | Action on breach |
|------|-----------|------------------|
| Lint | 0 violations | fail workflow |
| Unit tests | 100% pass | fail workflow |
| E2E tests | 100% pass | fail workflow |
| Dependency audit (High / Critical) | 0 CVEs | fail workflow (Drift F) |
| Image scan (High / Critical) | 0 CVEs | fail workflow |
| Coverage | not enforced this cycle (Drift I) | inform-only |
| Performance (k6) | thresholds in `perf-scenarios.js` | fail workflow when run |
## 7. Notifications
| Event | Channel | Recipients |
|-------|---------|------------|
| `01-test` failure | Woodpecker UI + Slack `#azaion-ci` | Backend team |
| `02-build-push` failure | Woodpecker UI + Slack `#azaion-ci` | Backend team |
| Image-scan High/Critical finding | Slack `#azaion-security` | Security + on-call |
| `04-deploy` failure | Slack `#azaion-ops` + email on-call | Ops on-call |
| Manual production deploy approval requested | Slack `#azaion-ops` | Approvers |
> Slack channel names are placeholders — swap to actual channel IDs in Step 7 when wiring `from_secret: slack_webhook_*`. Email/Pager wiring is deferred until those secrets exist.
## 8. Image Tags
Resolves Drift A:
| Push order | Tag | Stability | Used by |
|-----------|-----|-----------|---------|
| 1 | `${CI_COMMIT_BRANCH}-${TAG_SUFFIX}` | mutable (overwritten each push to the branch) | quick dev pulls (`docker pull …:dev-arm`) |
| 2 | `${CI_COMMIT_SHA:0:12}-${TAG_SUFFIX}` | immutable | host deploy scripts; rollback target |
Production deploys MUST reference the SHA tag, never the branch tag (Step 6 procedures will enforce this).
## 9. Reproducibility & Audit
- Every pushed image carries `org.opencontainers.image.revision` = full `CI_COMMIT_SHA`. The 12-char prefix in the tag is for human reading; the label is the source of truth.
- `org.opencontainers.image.created` = ISO-8601 build start time (UTC).
- `org.opencontainers.image.source` = `$CI_REPO_URL`.
- Both image scan and dependency audit reports are uploaded as Woodpecker artifacts on every run (success and failure).
## 10. Drifts Resolved Here / Carried Forward
| ID | Severity | Description | Status |
|----|----------|-------------|--------|
| A | Medium | Branch-tag-only push; host pulls `:latest` that CI never produces | **Resolved in spec** — add SHA-tag push (§8); script change in Step 7 |
| D | Low | Old docs referenced `.woodpecker/build-arm.yml` | **Resolved here** — corrected to `01-test.yml` + `02-build-push.yml` everywhere |
| E | Low | `scripts/run-performance-tests.sh` is run-on-demand only | **Spec**`03-perf.yml` planned; manual trigger in cycle 1, automatic gate in a future cycle when threshold fluctuation is understood |
| F | Low | No vulnerable-dep gate in CI | **Resolved in spec**`deps-audit` step in `01-test.yml`; concrete YAML in Step 7 |
| I | Low (NEW) | No coverage threshold enforced (no coverage collection wired) | **Carried forward** to a future cycle; recorded in the deploy plan, not blocking |
## 11. Self-verification
- [x] All pipeline stages defined with triggers and gates.
- [ ] Coverage threshold enforced — **deferred (Drift I)** with explicit justification.
- [x] Security scanning included (deps + image; SAST deferred to a future cycle when a SAST tool is selected).
- [x] Caching configured (NuGet + Docker layer).
- [x] Multi-environment deployment scaffold (staging → production manual).
- [x] Rollback referenced (SHA-tagged images make `docker run …:<previous-sha>-arm` a one-line rollback; details in Step 6).
- [x] Notification matrix defined.
+228
View File
@@ -0,0 +1,228 @@
# Azaion Admin API — Containerization
**Date**: 2026-05-13 · **Cycle**: 1 · **Status**: planning artifact (no code changes; Dockerfile updates land in Step 7).
## 1. Container Inventory
The system has only **one runtime container**. The four library components are linked into the API at build time, not shipped separately.
| # | Container | Built from | Purpose | Lifetime |
|---|-----------|------------|---------|----------|
| 1 | `admin-api` | `Dockerfile` (root) | Single ASP.NET Core 10 service exposing all 17 endpoints | Long-running |
| 2 | `e2e-runner` | `e2e/Dockerfile` | Black-box test consumer used by CI and local `docker-compose.test.yml` | One-shot (run-and-exit) |
| 3 | `test-db` | `postgres:16-alpine` (no custom Dockerfile) | Isolated Postgres for tests | One-shot (per CI run) |
> `docker.test/Dockerfile` is a leftover placeholder (`FROM alpine:latest; CMD echo hello`) and is unused. **Drift G** — recommend deletion in Step 7 (scripts) cleanup.
## 2. Component → Container Mapping
| Component | Ships in container? | Notes |
|-----------|--------------------:|-------|
| 01 Data Layer | no | Class library `Azaion.Common`, linked into `admin-api` |
| 02 User Management | no | Class library `Azaion.Services` |
| 03 Auth & Security | no | Class library `Azaion.Services` |
| 04 Resource Management | no | Class library `Azaion.Services` |
| 05 Admin API | **yes** | Hosts the Minimal API process (`Azaion.AdminApi`) |
## 3. `admin-api` — Dockerfile Specification
| Property | Current value | Planned value (Step 7) | Rationale |
|----------|---------------|------------------------|-----------|
| Build base image | `mcr.microsoft.com/dotnet/sdk:10.0` (`--platform=$BUILDPLATFORM`) | unchanged | Matches restriction (.NET 10.0); cross-platform build supported |
| Runtime base image | `mcr.microsoft.com/dotnet/aspnet:10.0` | **pin by digest** in production (`@sha256:…`) | Restrictions forbid moving off `aspnet:10.0`; digest pin protects against silent base-image churn |
| Stages | `base``build``publish``final` | unchanged structure; non-root user added in `final` | Existing layout already follows multi-stage best practice |
| Working dir | `/app` | unchanged | Matches `start-container.sh` mounts |
| Exposed port | `8080` | unchanged | Bound by Kestrel via `ASPNETCORE_URLS=http://+:8080` |
| Container user | **root** (current) | `USER app` (UID 1654, GID 1654) | Closes security audit F-6 / AZ-518 (Drift C). Non-existing UID; matches the convention in `mcr.microsoft.com/dotnet/aspnet:8.0+` images |
| Mount points needing write | `/app/Content`, `/app/logs` | `chown app:app` both directories in the `final` stage | The new non-root user must own the dirs that are bind-mounted from the host |
| Build arg | `CI_COMMIT_SHA=unknown` | unchanged; populated by Woodpecker | Already wired; surfaces as `AZAION_REVISION` env var inside the container |
| OCI labels | none on the Dockerfile (CI adds three: `revision`, `created`, `source`) | move the three labels into the Dockerfile so local builds also carry them | Single source of truth; consistent labeling regardless of build origin |
| Health check | none | `HEALTHCHECK CMD curl -fsS http://localhost:8080/health \|\| exit 1` | Wires into the `/health` endpoint planned in Step 5 (Observability). Until that endpoint exists, fall back to the TCP probe already used in `docker-compose.test.yml`. |
| Entrypoint | `["dotnet", "Azaion.AdminApi.dll"]` | unchanged | Smallest-possible entrypoint; PID 1 is the .NET process |
### Sketch (planning artifact — actual edits land in Step 7)
```
FROM mcr.microsoft.com/dotnet/aspnet:10.0@sha256:<pinned> AS base
WORKDIR /app
EXPOSE 8080
RUN groupadd -g 1654 app && useradd -u 1654 -g 1654 -m -d /home/app -s /sbin/nologin app \
&& mkdir -p /app/Content /app/logs && chown -R app:app /app
FROM --platform=$BUILDPLATFORM mcr.microsoft.com/dotnet/sdk:10.0 AS build
ARG TARGETARCH
WORKDIR /app
COPY . .
RUN dotnet restore
WORKDIR /app/Azaion.AdminApi
RUN dotnet build "Azaion.AdminApi.csproj" -c Release -o /app/build
FROM build AS publish
RUN arch=$([ "$TARGETARCH" = "amd64" ] && echo "x64" || echo "$TARGETARCH") && \
dotnet publish "Azaion.AdminApi.csproj" -c Release -o /app/publish /p:UseAppHost=false --os linux --arch $arch
FROM base AS final
ARG CI_COMMIT_SHA=unknown
ARG BUILD_DATE=unknown
ENV AZAION_REVISION=$CI_COMMIT_SHA
LABEL org.opencontainers.image.revision="$CI_COMMIT_SHA"
LABEL org.opencontainers.image.created="$BUILD_DATE"
LABEL org.opencontainers.image.source="https://git.azaion.com/azaion/admin"
COPY --from=publish --chown=app:app /app/publish /app/
USER app
HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
CMD curl -fsS http://localhost:8080/health || exit 1
ENTRYPOINT ["dotnet", "Azaion.AdminApi.dll"]
```
## 4. `e2e-runner` — Dockerfile Specification
Existing `e2e/Dockerfile` is sufficient for cycle 1; no changes proposed.
| Property | Value | Notes |
|----------|-------|-------|
| Base image | `mcr.microsoft.com/dotnet/sdk:10.0` (build + run) | SDK is required because the runner invokes `dotnet test` |
| Stages | `build` → run | Multi-stage to discard sources from the final image |
| Working dir | `/test` | Matches `docker-compose.test.yml` |
| Output dir | `/test-results` | Bind-mounted to `./e2e/test-results` on the host |
| User | root (acceptable — short-lived, no network exposure, no persistence beyond `/test-results`) | Non-root not required for one-shot CI containers |
| Loggers | `console`, `trx`, `xunit` | Last one feeds Woodpecker's parser |
| Entrypoint | `dotnet test Azaion.E2E.dll …` | Already present |
## 5. Local Development — `docker-compose.yml`
> Currently the project does **not** ship a local-dev compose file. Local devs run the API via `dotnet run` against a host Postgres on port 4312. We add `docker-compose.yml` in Step 7 (scripts) so newcomers get a one-command bring-up.
```yaml
# docker-compose.yml — planning artifact for Step 7
services:
api:
build:
context: .
dockerfile: Dockerfile
args:
CI_COMMIT_SHA: dev
image: azaion/admin:dev-local
env_file: .env
depends_on:
db:
condition: service_healthy
ports:
- "8080:8080"
volumes:
- ./.dev/content:/app/Content
- ./.dev/logs:/app/logs
healthcheck:
test: ["CMD", "curl", "-fsS", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 5
start_period: 30s
networks: [azaion-net]
db:
image: postgres:16-alpine
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: postgres
volumes:
- ./e2e/db-init/00_run_all.sh:/docker-entrypoint-initdb.d/00_run_all.sh:ro
- ./env/db:/docker-entrypoint-initdb.d/sql:ro
- dev-db:/var/lib/postgresql/data
ports:
- "4312:5432" # match local-dev convention; non-standard host port
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres -d postgres"]
interval: 5s
timeout: 5s
retries: 10
start_period: 10s
networks: [azaion-net]
volumes:
dev-db:
networks:
azaion-net:
driver: bridge
```
Notes:
- The DB schema and roles are bootstrapped from the same SQL files that the test-compose uses (`env/db/*.sql`), so `docker-compose.yml` and `docker-compose.test.yml` produce DB images with identical structure.
- `.dev/` is added to `.gitignore` and `.dockerignore` in Step 7.
- `db.ports` exposes `4312:5432` so a developer running the API outside Docker can still hit the same connection string defined in `.env`.
## 6. Blackbox Test — `docker-compose.test.yml` (existing)
The current file is already aligned with the Step 2 contract (`docker compose -f docker-compose.test.yml up --abort-on-container-exit --exit-code-from e2e-consumer`). Only one drift to log:
| Drift | Description | Resolved In |
|-------|-------------|-------------|
| Drift H | `system-under-test.healthcheck` uses a raw bash TCP probe (`exec 3<>/dev/tcp/127.0.0.1/8080`). Once `/health` exists (Step 5), switch to the curl-based probe to actually test the application layer. | Step 5 + Step 7 |
No structural change in cycle 1 — the file already brings up Postgres + SUT + e2e-runner on a private network and tears down on test exit.
## 7. Image Tagging Strategy
| Context | Tag format | Example | Notes |
|---------|------------|---------|-------|
| CI build (per push) | `$REGISTRY_HOST/$REGISTRY_IMAGE:${CI_COMMIT_BRANCH}-${TAG_SUFFIX}` | `docker.azaion.com/azaion/admin:dev-arm` | Existing convention from `.woodpecker/02-build-push.yml` |
| CI build (per push) — additional immutable tag | `$REGISTRY_HOST/$REGISTRY_IMAGE:${CI_COMMIT_SHA:0:12}-${TAG_SUFFIX}` | `docker.azaion.com/azaion/admin:a1b2c3d4e5f6-arm` | **NEW (Drift A resolution)** — gives every CI build an immutable tag the host scripts can pin |
| Production deploy | the SHA tag from above; never `latest` | `docker.azaion.com/azaion/admin:a1b2c3d4e5f6-arm` | Eliminates the host-pulls-`:latest` / CI-never-pushes-`:latest` mismatch |
| Local dev | `azaion/admin:dev-local` | — | Built by `docker-compose.yml`; never pushed |
| Multi-arch (future) | `<image>:<branch>-amd` and `<image>:<branch>-arm` (already matrix-prepared) | — | The Woodpecker matrix is wired; uncomment the `amd64` row when an amd agent is online |
> Drift A resolution depends on a CI change (Step 3) and a script change (Step 7). The tag format itself is decided here.
## 8. `.dockerignore`
Existing `.dockerignore` is sufficient; no changes proposed in cycle 1. It already excludes `bin/`, `obj/`, `.env`, `.git`, IDE folders, `Dockerfile*`, and compose files. The only addition required by the new local-dev compose is `.dev/` — added in Step 7.
```
.dev
**/.dockerignore
**/.env
**/.git
**/.gitignore
**/.project
**/.settings
**/.toolstarget
**/.vs
**/.vscode
**/.idea
**/*.*proj.user
**/*.dbmdl
**/*.jfm
**/azds.yaml
**/bin
**/charts
**/docker-compose*
**/Dockerfile*
**/node_modules
**/npm-debug.log
**/obj
**/secrets.dev.yaml
**/values.dev.yaml
LICENSE
README.md
```
## 9. Self-verification
- [x] Every component has a Dockerfile specification (only Admin API ships; libraries explicitly excluded with rationale).
- [x] Multi-stage builds specified for every production image.
- [x] Non-root user planned for `admin-api` (Drift C closed in spec; code change in Step 7).
- [x] Health check defined for every long-running service (real `/health` planned in Step 5; TCP fallback documented for the interim).
- [x] `docker-compose.yml` covers all components + Postgres dependency.
- [x] `docker-compose.test.yml` already enables black-box testing; one observation logged (Drift H).
- [x] `.dockerignore` defined and reviewed (one addition planned: `.dev/`).
## 10. Drifts Logged Here (carried forward)
| ID | Severity | Description | Resolved In |
|----|----------|-------------|-------------|
| C | Medium | `Dockerfile` final stage runs as root → add `USER app` (UID 1654) | Step 7 |
| G | Low | Unused `docker.test/Dockerfile` placeholder | Step 7 (delete) |
| H | Low | `docker-compose.test.yml` health check is TCP-only; upgrade to `/health` once available | Step 5 + Step 7 |
+196
View File
@@ -0,0 +1,196 @@
# Azaion Admin API — Deployment Scripts
**Date**: 2026-05-13 · **Cycle**: 1 · **Status**: shipped (this is the only doc that matches concrete files in `scripts/` and `secrets/`).
## 1. Overview
| Script | Purpose | Location |
|--------|---------|----------|
| `deploy.sh` | Main orchestrator (pull → stop → start → health) | `scripts/deploy.sh` |
| `pull-images.sh` | `docker login` + `docker pull` the target image | `scripts/pull-images.sh` |
| `stop-services.sh` | Graceful stop + record rollback target | `scripts/stop-services.sh` |
| `start-services.sh` | `docker run` with the materialized env file and bind mounts | `scripts/start-services.sh` |
| `health-check.sh` | Poll `/health/ready` until 200 or timeout | `scripts/health-check.sh` |
| `smoke.sh` | 6 critical-path checks against the **public** URL | `scripts/smoke.sh` |
| `_lib.sh` | Shared logging + env-overlay helpers | `scripts/_lib.sh` (sourced, not executed) |
| `run-tests.sh` | Existing — runs the docker-compose test suite locally | `scripts/run-tests.sh` |
| `run-performance-tests.sh` | Existing — runs k6 against the test compose stack | `scripts/run-performance-tests.sh` |
## 2. Prerequisites
On the **deploy host**:
| Requirement | Why |
|-------------|-----|
| Docker 24+ | `docker pull`, `docker run`, `--restart unless-stopped` |
| `sops` (≥ 3.8) | Decrypt `secrets/<env>.env` |
| `age` (≥ 1.1) | Backing crypto for sops |
| `curl` | Used by `health-check.sh` and `smoke.sh` |
| `jq` | Used by `smoke.sh` for JSON parsing |
| `/etc/azaion/age.key` (mode 0400) | Per-host age private key (see `secrets/README.md`) |
On the **operator's machine** (only for `smoke.sh`):
| Requirement | Why |
|-------------|-----|
| `curl`, `jq` | Same as host |
| Network access to the public URL | `BASE_URL` is the production / staging hostname |
## 3. Environment Variables
`scripts/_lib.sh` `load_env_overlay <env>` resolves variables in this order (later sources override earlier):
1. `<repo>/.env` (if present — local-dev convenience; harmless on a prod host that has no `.env`)
2. `secrets/<env>.public.env` (committed plain text; loaded with `set -a`)
3. `secrets/<env>.env` (sops-decrypted to a tempfile, sourced, tempfile deleted on exit)
4. The shell environment that invoked `deploy.sh` (operator overrides)
The complete variable inventory is `.env.example` at the repo root. Variables specifically consumed by these scripts:
| Variable | Required by | Source | Notes |
|----------|-------------|--------|-------|
| `ENV` | `deploy.sh` | operator shell | `staging` or `production` |
| `REGISTRY_HOST`, `REGISTRY_IMAGE`, `REGISTRY_TAG` | pull / start | public env / operator | tag is the `<sha12>-<arch>` immutable tag from `.woodpecker/02-build-push.yml` |
| `REGISTRY_USER`, `REGISTRY_TOKEN` | pull | encrypted env | optional; if both missing, assumes `docker login` was done out-of-band |
| `DEPLOY_CONTAINER_NAME`, `DEPLOY_HOST_PORT`, `DEPLOY_HOST_CONTENT_DIR`, `DEPLOY_HOST_LOGS_DIR` | stop / start | public env | identical for staging and prod by default |
| `ASPNETCORE_ConnectionStrings__AzaionDb`, `__AzaionDbAdmin`, `JwtConfig__Secret` | start | encrypted env | the API fail-fast checks these on boot |
| `ASPNETCORE_ResourcesConfig__*`, `JwtConfig__{Issuer,Audience,Lifetime}` | start | public env (defaults from `appsettings.json`) | only override if the env value differs from the appsettings default |
| `SOPS_AGE_KEY_FILE` | `_lib.sh` | host | defaults to `/etc/azaion/age.key` if unset |
| `SMOKE_ADMIN_EMAIL`, `SMOKE_ADMIN_PASSWORD` | `smoke.sh` | operator shell | dedicated smoke-test admin user; rotate as a regular admin password |
## 4. Script details
### `deploy.sh`
**Usage**:
```bash
ENV=staging ./scripts/deploy.sh <sha-tag>
ENV=production ./scripts/deploy.sh <sha-tag>
ENV=staging ./scripts/deploy.sh --rollback # uses scripts/.previous_tags.env
./scripts/deploy.sh --help
```
**Flow** (matches `_docs/04_deploy/deployment_procedures.md` §3 / §4):
1. Validate `ENV` and required commands.
2. Load env overlay (public + sops-decrypted).
3. If `--rollback`: read `scripts/.previous_tags.env` → set `SHA_TAG` to `PREVIOUS_SHA_TAG`.
4. `pull-images.sh` (login + pull).
5. `stop-services.sh` (records the SHA of whatever was running; graceful stop with `docker stop -t 40`; remove).
6. `start-services.sh` (`docker run --restart unless-stopped --env-file <materialized> --publish $DEPLOY_HOST_PORT:8080`).
7. `health-check.sh` (poll `/health/ready` with timeout).
8. Print success line with the running revision.
**Failure handling**: any non-zero exit from a sub-script aborts `deploy.sh` (because `set -euo pipefail` propagates). The previously-recorded SHA in `.previous_tags.env` is unchanged, so `--rollback` after a failed deploy targets the version that was running BEFORE the failed attempt.
### `pull-images.sh`
- `docker login` only when both `REGISTRY_USER` and `REGISTRY_TOKEN` are set; otherwise warns and continues (assumes pre-auth).
- `docker pull $REGISTRY_HOST/$REGISTRY_IMAGE:$REGISTRY_TAG`.
- Logs the resolved `RepoDigests[0]` to give the operator an immutable identifier in the deploy log.
### `stop-services.sh`
- Reads `org.opencontainers.image.revision` from the running container (label set by the Dockerfile).
- Writes `scripts/.previous_tags.env`:
```
PREVIOUS_SHA_TAG=<sha12>-<arch>
PREVIOUS_REVISION=<full sha>
RECORDED_AT=<ISO 8601>
```
- `docker stop -t 40` then `docker rm -f`.
- If the container does not exist, logs and exits 0 (idempotent — first deploy on a new host should succeed).
### `start-services.sh`
- Materializes a runtime env file by filtering the current shell environment with `grep '^(ASPNETCORE_|AZAION_)'`. Registry credentials and deploy-host plumbing variables stay on the host and never enter the container.
- `mkdir -p` for the bind-mounted `Content/` and `logs/` dirs (idempotent).
- `docker run --detach --name --restart unless-stopped --env-file --publish --volume`.
- Logs the container ID and the running revision.
### `health-check.sh`
- One-shot check on `/health/live` first (3 s timeout). If this fails the container is wedged — fail fast.
- Polls `/health/ready` every `HEALTH_INTERVAL` (default 2 s) until 200 or `HEALTH_TIMEOUT` (default 60 s).
- Returns 0 on first 200; non-zero on timeout.
### `smoke.sh`
Six checks, each ≤ 10 s, against the public `BASE_URL`:
1. `GET /health/live` (200)
2. `GET /health/ready` (200, best-effort — public URL may legitimately not expose this)
3. `POST /login` — extract JWT
4. `GET /users/current` (Bearer auth)
5. `GET /users` — count rows
6. `GET /resources/list` — sanity that filesystem-backed paths are reachable
Smoke is intentionally lightweight; it does NOT exercise CRUD or detection-class endpoints (those are covered by E2E in CI).
### `_lib.sh`
Shared sourced library. Sourced via `. "$SCRIPT_DIR/_lib.sh"` from every script. NOT executable (lives at `scripts/_lib.sh` mode 0644). Contains:
- `log_info` / `log_warn` / `log_error` / `die`
- `require_env <var…>` / `require_cmd <cmd…>`
- `load_env_overlay <env>` (the sops + age decryption pipeline)
- `container_exists`, `container_running`, `current_image_revision`
## 5. Examples
### First-ever staging deploy
```bash
# On the staging host, as deploy operator:
cd /opt/azaion/admin # or wherever the repo is checked out
ENV=staging ./scripts/deploy.sh a1b2c3d4e5f6-arm
```
### Rolling back production after a bad deploy
```bash
# Same host, immediately after the failed deploy:
ENV=production ./scripts/deploy.sh --rollback
```
### Running smoke from the operator workstation
```bash
export BASE_URL=https://stage.admin.azaion.com
export SMOKE_ADMIN_EMAIL=ops-smoke@azaion.com
export SMOKE_ADMIN_PASSWORD=... # from the operator's password manager
./scripts/smoke.sh
```
### Local development against the dockerized stack
The dev-time compose was deferred (Drift K-adjacent). Until it lands, run the API directly:
```bash
# Postgres on host port 4312 (per env/db/00_install.sh)
dotnet run --project Azaion.AdminApi
```
## 6. Common script properties
All scripts:
- Use `#!/usr/bin/env bash` with `set -euo pipefail`.
- Support `--help` / `-h` for usage.
- Source `_lib.sh` for logging and env-overlay helpers.
- Are idempotent where possible (running `deploy.sh` twice with the same SHA tag is a no-op for `pull-images.sh`, recreates the container in `stop`/`start`, and re-checks health).
- Echo to stderr for log lines (so stdout from a sub-process can still be piped).
## 7. What is NOT shipped in cycle 1
- Remote SSH wrapper. The deploy procedure assumes the operator runs the script on the target host. A `--remote $DEPLOY_HOST` mode is recorded as **Drift O** (carried forward).
- Slack notifications from inside the scripts. Notifications happen out-of-band per `_docs/04_deploy/observability.md` §5.
- Database migration step. Migrations are applied manually with `psql` per `_docs/04_deploy/environment_strategy.md` §4 (Drift J).
## 8. Related artifacts
- Postmortem template: `_docs/06_metrics/postmortem_template.md`
- Procedures: `_docs/04_deploy/deployment_procedures.md`
- Environment strategy: `_docs/04_deploy/environment_strategy.md`
- secrets/ folder onboarding: `secrets/README.md`
+195
View File
@@ -0,0 +1,195 @@
# Azaion Admin API — Deployment Procedures
**Date**: 2026-05-13 · **Cycle**: 1 · **Status**: planning artifact (the executable scripts referenced here land in Step 7).
## 1. Deployment Strategy
**Pattern**: **stop-and-start with pre-pulled image** (single-container, single-host).
**Rationale**:
- Topology is one Docker host per environment running one `azaion.api` container behind Nginx. There is no orchestrator, no replica set, no load balancer beyond Nginx itself.
- Blue-green requires either two listening ports + Nginx switch, or two hosts. Cycle-1 budget does not include either. Recorded as **Drift N** for a future cycle.
- Rolling/canary is meaningless with one replica.
- The realistic SLO for cycle 1 is **brief (< 30 s) downtime per deploy**, mitigated by deploying in low-traffic windows. The procedure pre-pulls the image so the actual stop-start gap is the time it takes for the new container to clear `/health/ready`, not image-download time.
**Zero-downtime in production**: not achieved in cycle 1. Documented and acknowledged.
### Graceful Shutdown
| Signal | Behavior |
|--------|----------|
| `SIGTERM` (`docker stop`) | ASP.NET Core stops accepting new requests, waits up to `HostOptions.ShutdownTimeout` for in-flight requests, then exits. |
| `ShutdownTimeout` | Set to **30 seconds** in `Program.cs` (`services.Configure<HostOptions>(o => o.ShutdownTimeout = TimeSpan.FromSeconds(30))`). |
| `docker stop` grace | Use `docker stop -t 40` so Docker waits 40 s before sending `SIGKILL`, leaving 10 s of headroom over the app's 30 s. |
This wiring lands in Step 7 (Dockerfile + small `Program.cs` change).
### Database Migration Ordering
Conventions inherited from the Environment Strategy (§4 of `environment_strategy.md`):
1. Apply the new `env/db/NN_*.sql` file **before** deploying the matching code. Because every migration is backward-compatible, the old container keeps working against the new schema.
2. After the deploy is healthy, optionally apply a follow-up `NN+1_*.sql` for cleanup (e.g., dropping a tombstone column once no code reads it).
3. Production migrations run on staging first and soak ≥ 24 h before promotion.
4. Migration is performed by the operator with `psql -h <host> -p 4312 -U azaion_superadmin -d azaion -f env/db/NN_xxx.sql`. Logged in the deploy ticket.
## 2. Health Checks
These endpoints are introduced in Step 7 (anonymous, internal-only — see Observability §3.1 / §7).
| Check | Type | Endpoint | Interval | Failure threshold | Action |
|-------|------|----------|----------|-------------------|--------|
| Docker liveness | HTTP GET (in-container, via `Dockerfile` `HEALTHCHECK`) | `/health/live` | 30 s | 3 consecutive | Docker marks container `unhealthy`; **does NOT auto-restart** in cycle 1 (no `--restart=on-failure` policy in `start-container.sh`) |
| Nginx readiness | HTTP GET (upstream `health_check`) | `/health/ready` | 5 s | 3 consecutive | Nginx pulls upstream → 503 to clients (no silent traffic loss) |
| Deploy-script startup | HTTP GET (polling) | `/health/ready` | 2 s | up to 30 attempts (~60 s) | `scripts/deploy.sh` aborts and triggers rollback |
### Health Check Response Contract
| Endpoint | 200 condition | 5xx condition | Headers |
|----------|---------------|---------------|---------|
| `/health/live` | Process is responsive (always — short-circuits before any dependency call) | Never returns 5xx unless the process is wedged | `Cache-Control: no-store` |
| `/health/ready` | `SELECT 1` succeeds against both `AzaionDb` (reader) and `AzaionDbAdmin` (writer) within a 2 s timeout | Either DB query fails or times out → 503 | `Cache-Control: no-store` |
`/health/ready` does NOT exercise the filesystem (`Content/`, `logs/`) — a transient `EACCES` there should not yank the upstream. It surfaces in metrics (`resource_upload_failures_total`) and alerts (Observability §5) instead.
## 3. Staging Deployment
Triggered manually by the operator from the staging host or from a Woodpecker manual workflow.
```
1. Pre-flight — operator on local machine
a. Confirm CI green for the target SHA on the `stage` branch.
b. Run `dotnet list package --vulnerable` against the target commit (CI does this too — local is a sanity check).
c. Confirm any DB migration in env/db/ for this SHA has been reviewed.
2. DB migration (if any) — operator SSH to staging host
psql -h localhost -p 4312 -U azaion_superadmin -d azaion -f env/db/NN_<desc>.sql
3. Deploy — operator runs scripts/deploy.sh on staging host
ENV=staging ./scripts/deploy.sh <sha-tag>
# script: docker pull → stop -t 40 → rm → run --env-file .env → poll /health/ready
4. Verify — automatic in scripts/deploy.sh
- /health/ready returns 200 within 60 s
- Container `docker inspect` healthcheck status is `healthy`
- `docker logs --tail=80` contains no `Error` lines from the last 60 s
5. Smoke tests — operator runs from local machine
BASE_URL=https://stage.admin.azaion.com ./scripts/smoke.sh
# 6 critical-path checks: /login (admin), GET /users (paginates), GET /classes,
# GET /resources/list, /health/ready, JWT lifecycle.
6. Soak — observe dashboard for ≥ 24 h before promoting
```
If any step fails → §5 Rollback.
## 4. Production Deployment
```
1. Approval — required: ops lead OR backend lead
- Reference the staging soak completion timestamp.
- Reference the cycle's deploy ticket (AZ-NNN) and CI run URL.
2. Pre-deploy checks (operator on local machine)
[ ] Staging smoke tests passed (§3 step 5).
[ ] Staging soaked ≥ 24 h with no Critical/High alerts.
[ ] CI green for the same SHA on the `main` branch.
[ ] Image-scan report for the SHA shows zero High/Critical (Woodpecker artifact).
[ ] DB migration plan recorded in the deploy ticket.
[ ] Rollback target SHA is recorded (the SHA currently running in prod — `docker inspect azaion.api | jq -r '.[0].Config.Labels."org.opencontainers.image.revision"'`).
[ ] On-call engineer is reachable for the next 30 min.
3. DB migration (if any) — operator SSH to prod host
psql -h localhost -p 4312 -U azaion_superadmin -d azaion -f env/db/NN_<desc>.sql
4. Deploy — operator runs scripts/deploy.sh on prod host
ENV=production ./scripts/deploy.sh <sha-tag>
5. Verify — automatic + operator
- /health/ready returns 200 within 60 s.
- Container `docker inspect` healthcheck status `healthy`.
- Operator hits `/login` with admin creds and a known user list query.
6. Monitor — operator observes dashboards for ≥ 15 minutes
- Error rate (5xx) stays < 1%.
- P95 latency stays within 2× cycle-1 baseline (66 ms /login, 305 ms /users).
- No Critical or High alerts fire.
7. Finalize
- Update deploy ticket with start/stop timestamps and image SHA.
- Post `:white_check_mark: prod deploy: <sha-tag>` to Slack #azaion-ops.
```
## 5. Rollback Procedures
### Trigger Criteria (any one)
- `/health/ready` fails for ≥ 60 s after deploy.
- Error rate (5xx) > 5 % for 5 minutes within the 15-minute observation window.
- Any Critical alert fires within 15 minutes of deploy.
- Operator's manual call (e.g. business-impacting bug surfaced by smoke tests).
### Rollback Steps (≤ 5 minutes)
```
1. Capture state — operator on the affected host
docker logs azaion.api --tail=500 > /var/log/azaion/rollback-$(date -u +%Y%m%dT%H%M%SZ).log
docker inspect azaion.api > /var/log/azaion/rollback-$(date -u +%Y%m%dT%H%M%SZ).inspect.json
2. Re-deploy previous SHA — operator
ENV=production ./scripts/deploy.sh <previous-sha-tag>
# The SHA tag was recorded in step 2 of the deploy procedure.
3. DB rollback (if a migration was applied this deploy)
- If reversible (drop column, drop index): run the agreed reverse SQL recorded in the deploy ticket.
- If irreversible (added column, table): leave the schema as-is — the previous code is backward-compatible (rule §1.3) so the extra schema is inert.
- If data was migrated destructively: STOP, escalate to backend lead. Restore from backup if necessary.
4. Verify — same checks as deploy §5
5. Notify — operator posts ":rotating_light: prod rollback: <previous-sha>" to Slack #azaion-ops with the deploy ticket link.
6. Post-mortem — schedule within 24 hours; required artifact: timeline + root cause + prevention.
```
### Post-Mortem (required)
Template lives in `_docs/06_metrics/postmortem_template.md` (added in Step 7). Required sections:
- Timeline (UTC), with deploy SHA and rollback SHA.
- Root cause (one sentence + evidence link).
- Detection — how was it caught? Which alert? Which probe? Which user report?
- Repair — what fixed it?
- Prevention — concrete change (test, alert, procedure step) with an owner and a target date.
## 6. Deployment Checklist (per release)
Copy this into the deploy ticket; tick before flipping `prod`:
```
[ ] CI green on target SHA (01-test + 02-build-push, all matrix entries)
[ ] Image scan report: zero High/Critical CVEs (Woodpecker artifact)
[ ] Dependency audit (`dotnet list package --vulnerable`): zero High/Critical
[ ] Image SHA tag exists in registry: docker manifest inspect $REGISTRY_HOST/azaion/admin:<sha-tag>-arm
[ ] DB migration (if any) reviewed by backend lead; rollback SQL recorded if reversible
[ ] secrets/staging.env / secrets/production.env decrypts cleanly on the target host
[ ] Health endpoints respond 200 in current production (sanity baseline)
[ ] Monitoring alerts armed (no silenced alerts that would mask the deploy)
[ ] Rollback target SHA recorded
[ ] Stakeholders notified (Slack #azaion-ops, expected window)
[ ] On-call engineer reachable for the next 30 min
```
## 7. Drifts Logged Here
| ID | Severity | Description | Carried Forward |
|----|----------|-------------|-----------------|
| N (NEW) | Medium | No zero-downtime deploy strategy — single-container topology produces ~30 s gap per deploy | Future cycle: blue-green via dual ports + Nginx upstream switch |
## 8. Self-verification
- [x] Deployment strategy chosen (stop-and-start) with explicit rationale and acknowledgement that zero-downtime is deferred (Drift N).
- [x] Graceful-shutdown contract specified (`HostOptions.ShutdownTimeout` 30 s, `docker stop -t 40`).
- [x] Health checks defined (liveness, readiness, startup) with exact response contract and Cache-Control header.
- [x] Rollback trigger criteria + 6-step rollback procedure + post-mortem template requirement.
- [x] Deployment checklist complete (10 items) and explicitly references the SHA tag (Drift A resolution from Step 3).
+127
View File
@@ -0,0 +1,127 @@
# Azaion Admin API — Environment Strategy
**Date**: 2026-05-13 · **Cycle**: 1 · **Status**: planning artifact (no scripts; concrete wiring lands in Step 7).
## 1. Environments
| Environment | Purpose | Infrastructure | Data Source |
|-------------|---------|----------------|-------------|
| **Development** | Local developer workflow on macOS / Linux. | Either bare `dotnet run` against host Postgres (port 4312) **or** the new `docker-compose.yml` planned in Step 2 (API + Postgres on a private Docker network). | Empty database; SQL files under `env/db/` create roles + schema; no fixtures. |
| **Test (CI)** | Black-box tests in CI and locally via `scripts/run-tests.sh`. | `docker-compose.test.yml` — API + Postgres + e2e-runner on a Docker network. | Functional fixtures from `e2e/db-init/00_run_all.sh` + `99_test_seed.sql`. |
| **Staging** | Pre-production validation. | Self-hosted Linux server, single Docker host, behind Nginx reverse proxy on `stage.admin.azaion.com`. Mirrors prod topology and Postgres major version. | Anonymized snapshot of production (PII scrubbed by an offline script before import). |
| **Production** | Live system. | Self-hosted Linux server, single Docker host, behind Nginx reverse proxy on `admin.azaion.com`. | Live data; daily off-host backups. |
> Test is added as a first-class environment because cycle 1 already exercises it (`docker-compose.test.yml`). The deploy template lists three; we list four to match reality.
## 2. Environment Variables
### Source of Truth
The complete variable inventory lives in `.env.example` at the repo root (Step 1, 24 entries). This document does NOT duplicate that table — it only specifies, per environment, **where each variable is sourced**.
### Per-environment sourcing
| Variable group | Development | Test (CI) | Staging | Production |
|----------------|-------------|-----------|---------|------------|
| `ASPNETCORE_ENVIRONMENT` | `.env` (`Development`) | docker-compose `environment:` (`Development`) | docker-compose / `--env-file` (`Staging`) | docker-compose / `--env-file` (`Production`) |
| `ASPNETCORE_URLS` | `.env` | compose | host `.env` (rendered from sops) | host `.env` (rendered from sops) |
| `ConnectionStrings__*` | `.env` (real local creds) | compose (literal — accepted F-10) | **sops-encrypted file in git** → decrypted on host at deploy time | same as staging |
| `JwtConfig__Secret` | `.env` (dev-only literal) | compose (literal — accepted F-10) | **sops-encrypted** | **sops-encrypted** |
| `JwtConfig__{Issuer,Audience,Lifetime}` | appsettings defaults | appsettings defaults | host `.env` if non-default | host `.env` if non-default |
| `ResourcesConfig__*` | appsettings defaults | compose | host `.env` if non-default | host `.env` if non-default |
| `DEPLOY_*`, `REGISTRY_TAG` | `.env` (developer machine) | n/a | passed to `scripts/deploy.sh` from operator's shell or CI manual trigger | same |
| `REGISTRY_USER`, `REGISTRY_TOKEN` | empty in dev `.env` | Woodpecker secrets `registry_user` / `registry_token` | Woodpecker secrets (CI deploy) or operator's shell (manual deploy) | same |
| `CI_COMMIT_SHA` | unset → image label `unknown` | Woodpecker built-in | Woodpecker built-in | Woodpecker built-in |
### Variable Validation (fail-fast)
The Admin API already does this for the most security-critical variable:
```csharp
var jwtConfig = builder.Configuration.GetSection(nameof(JwtConfig)).Get<JwtConfig>();
if (jwtConfig == null || string.IsNullOrEmpty(jwtConfig.Secret))
throw new Exception("Missing configuration section: JwtConfig");
```
The deploy plan **adds** the same fail-fast check for connection strings during Step 7 wiring (a one-time `_ = configuration.GetConnectionString("AzaionDb") ?? throw …` plus the same for `AzaionDbAdmin`, executed during `WebApplication` build). Without the check, a missing variable currently surfaces only on the first DB call, which is too late.
> Static / lookup-style variables (`ResourcesConfig__*`, `JwtConfig__{Issuer,Audience,Lifetime}`) keep their `appsettings.json` defaults in every environment unless an override is required. We do NOT add fail-fast checks for them.
## 3. Secrets Management
### Decision
| Environment | Method | Tool |
|-------------|--------|------|
| Development | `.env` file | committed `.env.example` + per-developer `.env` (git-ignored) |
| Test (CI) | docker-compose `environment:` literals | accepted as test-only (security audit F-10) |
| Staging | git-tracked encrypted file | **sops + age** |
| Production | git-tracked encrypted file | **sops + age** |
### Why sops + age (not Vault, not Woodpecker secrets, not hand-edited `.env`)
Constraints: self-hosted, no cloud account, single ops engineer, currently hand-editing `.env` on the host.
| Option | Pros | Cons | Verdict |
|--------|------|------|---------|
| sops + age (chosen) | Secrets versioned in git, encrypted at rest, decrypted on the host with a single age key. No new infra. Works offline. | Requires per-environment age keypair stored on the host outside git. Manual key rotation. | ✅ pragmatic for this team size and topology |
| HashiCorp Vault (self-hosted) | Dynamic DB creds, audit log, fine-grained ACL, KV v2. | Adds a service to operate, monitor, back up. Single-engineer ops budget cannot absorb it now. | ⏳ revisit in a future cycle when ops capacity grows |
| Woodpecker secrets exported into runtime container | Reuses existing secret store. | Couples runtime config to CI; secrets are not visible/auditable outside Woodpecker UI; cannot run the container outside CI without manually exporting them. | ❌ leaks the CI/runtime boundary |
| Hand-edited host `.env` (status quo) | Zero new tooling. | No version history, no encryption, no review trail. Single point of failure if the file is lost; security audit can't track changes. | ❌ status quo we are leaving behind (Drift B) |
### sops + age conventions for this repo
```
secrets/
├── .sops.yaml # routes secrets/staging.env / production.env to the right age recipients
├── staging.env # SOPS-encrypted; safe to commit
└── production.env # SOPS-encrypted; safe to commit
```
- `.sops.yaml` declares two age recipients: `recipient_staging` and `recipient_production` (public keys).
- The matching age **private** keys live on each host at `/etc/azaion/age.key`, mode `0400`, owned by root. They are NEVER committed.
- `scripts/deploy.sh` (Step 7) runs `SOPS_AGE_KEY_FILE=/etc/azaion/age.key sops -d secrets/${env}.env > /tmp/azaion.env` and feeds it to `docker run --env-file`.
- All staging/production env values that are NOT secret (e.g. `DEPLOY_HOST_PORT`, `REGISTRY_TAG`) live in plain-text `secrets/staging.public.env` / `secrets/production.public.env` next to the encrypted file, also git-tracked. Loaded before the decrypted overlay.
### Rotation policy
| Secret | Rotation cadence | Procedure |
|--------|------------------|-----------|
| Postgres `azaion_admin` / `azaion_reader` passwords | every 90 days, on operator schedule | `ALTER ROLE … WITH PASSWORD …` → re-encrypt `production.env``scripts/deploy.sh` |
| JWT `JwtConfig__Secret` | every 180 days, AND on any suspected leak | re-encrypt → deploy. **All issued tokens become invalid** — communicate maintenance window. |
| `azaion_superadmin` password | every 365 days, AND on owner change | manual; not used by the running app, only by DB migrations |
| Registry `REGISTRY_TOKEN` | every 90 days OR on CI compromise | rotate registry credential → update Woodpecker secret `registry_token` → re-encrypt `production.env` if also referenced there |
| age private key (`/etc/azaion/age.key`) | every 365 days OR on host compromise | generate new key → add public recipient to `.sops.yaml``sops updatekeys secrets/*.env` → distribute new private key out-of-band → remove old recipient |
## 4. Database Management
| Environment | Type | Migrations | Data | Backup |
|-------------|------|------------|------|--------|
| Development | Local Postgres on host (port 4312) **or** dockerized Postgres from `docker-compose.yml` | `env/db/*.sql` applied manually by developer the first time, then `*_users_email_unique.sql`-style additive scripts run with `psql` on demand | empty | none |
| Test (CI) | Postgres 16-alpine from `docker-compose.test.yml` | `env/db/*.sql` mounted into `/docker-entrypoint-initdb.d/sql/`, ordered by `00_run_all.sh` | `99_test_seed.sql` (functional) + 500 perf users injected by `scripts/run-performance-tests.sh` when needed | none — `down -v` between runs |
| Staging | Same Postgres major (16) on the staging server, port 4312, `azaion` database | `env/db/*.sql` applied **manually under change control** via `psql -U azaion_superadmin`. New migrations land in the same numeric-prefix sequence (`07_*.sql`, `08_*.sql`, …) | anonymized prod snapshot, refreshed on demand | nightly `pg_dump` snapshot retained 14 days |
| Production | Same Postgres 16 on prod server | Same as staging; **migration must be applied to staging first**, observed for ≥ 24 h, then promoted to prod with operator approval | live | nightly `pg_dump` retained 30 days; weekly snapshot retained 12 weeks; off-host copy via `rsync` |
### Migration rules (cycle 1)
The project does NOT use an ORM migration framework (linq2db; restrictions.md). The conventions below replace it:
1. **Numeric-prefix ordering** — every new migration is added as `env/db/NN_<description>.sql` where `NN` continues the existing sequence. The current sequence is `01..06`; the next is `07_*.sql`.
2. **Forward-only by default**. Reversibility is provided by the off-host backup, NOT by hand-written DOWN scripts. The existing files (`02_structure.sql`, `03_add_timestamp_columns.sql`, `04_detection_classes.sql`, `06_users_email_unique.sql`) follow this pattern; we keep it.
3. **Backward-compatible deploys** — every schema change must be safe to apply BEFORE the matching code is deployed (additive change → deploy code → cleanup change in a later release). The cycle 1 example: `06_users_email_unique.sql` was applied first; the `RegisterUser` change to translate `23505` came after. AZ-197's `User.Hardware` column was kept as a tombstone instead of dropped, for the same reason.
4. **Production migrations need approval** — operator manually runs the SQL on prod after staging soak. No automatic CI execution against prod in cycle 1 (Drift J — automation is a future cycle's work).
### Drifts logged here
| ID | Severity | Description | Resolved In |
|----|----------|-------------|-------------|
| B | Medium | No secret manager (status quo: hand-edited host `.env`) | **Resolved in spec** — sops + age (§3); concrete files + script in Step 7 |
| J | Low (NEW) | DB migrations applied manually on staging/prod; no automation | **Carried forward** to a future cycle |
## 5. Self-verification
- [x] Four environments (Dev, Test/CI, Staging, Production) defined with purpose, infrastructure, and data source.
- [x] Environment variable sourcing matrix references `.env.example` (Step 1) without duplicating it.
- [x] No literal secrets in this document — only variable names and tool names.
- [x] Secret manager chosen for staging/production (sops + age) with rotation policy.
- [x] Database strategy per environment, including the explicit no-ORM-migrations convention.
+204
View File
@@ -0,0 +1,204 @@
# Azaion Admin API — Observability
**Date**: 2026-05-13 · **Cycle**: 1 · **Status**: planning artifact (no code changes; concrete wiring lands in Step 7).
## 1. Current State (audit)
| Pillar | Today | Gap |
|--------|-------|-----|
| Logging | Serilog 4.1.0 → Console + rolling file `logs/log.txt` (daily); MinimumLevel `Information`; FromLogContext enrichment | No structured fields beyond defaults; one unstructured `LogInformation($"…")` in `ResourcesService.SaveResource` (security audit F-12); SQL trace bypasses Serilog (`Console.WriteLine`); no correlation IDs |
| Metrics | none | No `/metrics` endpoint; no system, app, or business metrics |
| Tracing | none | No OpenTelemetry, no W3C trace context |
| Health checks | none in code; `docker-compose.test.yml` uses raw TCP probe | No `/health` endpoint (Drift H from Step 2 + skill self-verification) |
| Alerting | none | No alerts wired to any channel |
This step closes the planning gap; implementation lands incrementally — `/health` and structured logging in cycle 1 (Step 7), metrics + tracing in a later cycle (Drift K).
## 2. Logging
### 2.1 Format
Structured JSON to **stdout/stderr only** in containers. The current rolling-file sink is **dropped from the production runtime** (and the `/app/logs` bind mount becomes optional) because:
- Container logs should be collected by the platform, not the app.
- A bind-mounted file silently fills the host disk when log rotation lags.
- We currently have no log shipper, so logs already live only in `docker logs` for ops triage.
The existing console sink stays. The file sink is kept ONLY in `Development` (gated by `ASPNETCORE_ENVIRONMENT`).
```json
{
"timestamp": "2026-05-13T06:48:01.123Z",
"level": "Information",
"service": "azaion.admin-api",
"revision": "a1b2c3d4e5f6",
"correlation_id": "0HMU7…",
"user_id": null,
"message": "User registered",
"context": {
"endpoint": "POST /users",
"duration_ms": 47
}
}
```
Achieved by adding `Serilog.Formatting.Compact.RenderedCompactJsonFormatter` to the console sink and three enrichers:
| Enricher | Source | Purpose |
|----------|--------|---------|
| `FromLogContext` | already present | scoped properties |
| `Serilog.Enrichers.Environment` (new) | `ENV` vars | `service`, `revision` (`AZAION_REVISION`) |
| `Serilog.AspNetCore.RequestLoggingOptions` (new) | ASP.NET pipeline | request `correlation_id` from `Activity.Current.TraceId` (or generated UUID v7 if no Activity) |
### 2.2 Log Levels
| Level | Usage | Examples in this codebase |
|-------|-------|---------------------------|
| `Error` | Unhandled exceptions, infra failures | DB connection failure, sops decrypt failure on host |
| `Warning` | Business exception caught | Existing `BusinessExceptionHandler` already does this — keep as-is |
| `Information` | Significant business events | Login, RegisterUser, RegisterDevice, role change, resource upload, detection-class CRUD |
| `Debug` | Diagnostic detail | Request/response payloads (dev only — never in production); query parameters |
### 2.3 Retention
| Environment | Destination | Retention |
|-------------|-------------|-----------|
| Development | console + `logs/log.txt` (rolling daily) | 7 daily files (Serilog default) |
| Test (CI) | console (captured by Woodpecker UI) | 14 days (Woodpecker artifact retention) |
| Staging | container stdout → `journald` on the host | 7 days; `journalctl --vacuum-time=7d` cron |
| Production | container stdout → `journald` on the host | 30 days; `journalctl --vacuum-time=30d` cron |
> A central log aggregator (Loki / OpenSearch) is **out of scope for cycle 1** — host `journald` is the entire pipeline. Recorded as **Drift L**.
### 2.4 PII Rules
| Rule | Implementation |
|------|----------------|
| Never log passwords | `LoginRequest.Password`, `RegisterUserRequest.Password`, `GetResourceRequest.Password`, the response body of `POST /devices` (plaintext one-shot password). Add a `[Serilog.Sensitive]`-style helper or a `Destructure.ByTransforming<T>(t => …)` per DTO. |
| Never log JWT tokens | The `/login` response body is logged today only by `BusinessExceptionHandler` on failure, which doesn't include the body. Verify in Step 7 that no request-logger middleware logs response bodies. |
| Mask emails | Use last-4 + `@domain` form for INFO-level logs (`***123@example.com`); full email allowed at DEBUG only. The `BusinessExceptionHandler` log line `"Caught BusinessException: {Message}"` may include emails embedded in messages — tightened in Step 7. |
| User IDs | `User.Id` is an opaque GUID — safe to log; use it instead of email in correlation. |
## 3. Metrics
### 3.1 Endpoint
`GET /metrics` exposing Prometheus exposition format. Add via `prometheus-net.AspNetCore` 8.x (latest stable for .NET 10 baseline; verify version against released wheel before wiring).
> Exposure boundary: `/metrics` MUST NOT be reachable from the public CORS allow-list. The Nginx reverse proxy on `admin.azaion.com` will expose only `/login`, `/users*`, `/devices`, `/resources*`, `/classes*`, `/health`. `/metrics` and `/swagger` stay on the internal interface (separate Nginx server block bound to the management VLAN, OR `localhost`-only listener).
### 3.2 Metrics
| Metric | Type | Source | Labels |
|--------|------|--------|--------|
| `http_requests_total` | Counter | ASP.NET request pipeline | `method`, `endpoint`, `status_code` |
| `http_request_duration_seconds` | Histogram | ASP.NET request pipeline | `method`, `endpoint` |
| `http_requests_in_progress` | Gauge | ASP.NET request pipeline | `method` |
| `db_command_duration_seconds` | Histogram | linq2db trace hook | `operation` (`select`/`insert`/`update`/`delete`) |
| `db_command_failures_total` | Counter | linq2db trace hook | `operation`, `sqlstate` |
| `auth_login_failures_total` | Counter | `AuthService.ValidateUser` exception path | `reason` (`unknown_user`, `bad_password`, `disabled`) |
| `business_exceptions_total` | Counter | `BusinessExceptionHandler` | `error_code` (the existing `ExceptionEnum`) |
| `resource_upload_bytes_total` | Counter | `ResourcesService.SaveResource` | `data_folder` |
| `resource_upload_failures_total` | Counter | same | `reason` |
| `resource_download_bytes_total` | Counter | `ResourcesService.GetEncryptedResource` | `data_folder` |
| `detection_classes_total` | Gauge | refresh on CRUD | none |
| `users_active_total` | Gauge | refresh on CRUD + on a 5-min timer | `role` |
| Process / runtime | (auto) | `prometheus-net.DotNetRuntime` | gen0/1/2 GC, JIT, threadpool, etc. |
### 3.3 System Metrics
CPU, RSS, file descriptors, network I/O — collected by **node-exporter** running on the host as a sibling container. The Admin API itself does NOT export host-level metrics.
### 3.4 Business Metrics
Mapped to the verified ACs in `_docs/02_document/tests/blackbox-tests.md`. Cycle-1 cut: `users_active_total` (AC-01..AC-12 user lifecycle) and `detection_classes_total` (AZ-513). Resource-related business metrics deferred until the resource flow is exercised by real users post-AZ-197.
### 3.5 Collection
| Setting | Value |
|---------|-------|
| Scrape interval | 15s (Prometheus default) |
| Scrape source | `node-exporter` for host; the Admin API container for app metrics |
| Storage | local Prometheus on the host, retention 14 days (cycle 1 budget) |
| Visualization | local Grafana, single dashboard (§6) |
## 4. Distributed Tracing
**Cycle 1**: scaffold only — produce a trace ID per request, propagate via `traceparent` (W3C), and emit it as the `correlation_id` field in JSON logs. **Do NOT yet** ship spans to a collector — there is no Jaeger / Tempo running, and the Admin API has no downstream services to trace into. Tracing pays back its cost when there's a chain to follow; cycle 1 has none.
| Setting | Value |
|---------|-------|
| SDK | `OpenTelemetry.Extensions.Hosting` + `OpenTelemetry.Instrumentation.AspNetCore` |
| Propagation | W3C Trace Context (`traceparent`) — auto when `OpenTelemetry.Instrumentation.AspNetCore` is registered |
| Sampling | 100% in dev/staging, 10% in production (deferred — no exporter yet) |
| Span naming | `<service>.<operation>` — service `azaion.admin-api`, operation `<HTTP method> <route template>` |
| Exporter | none in cycle 1 (logs only) |
> Recorded as **Drift M** — wire a Tempo / Jaeger exporter once a downstream service exists.
## 5. Alerting
| Severity | Response time | Conditions for this service | Channel |
|----------|---------------|------------------------------|---------|
| Critical | 5 min | `up{job="admin-api"} == 0` for 1 min · `/health` fails for 2 min · `business_exceptions_total{error_code="DbFailure"}` rate > 1/s for 1 min | Slack `#azaion-ops` + on-call email (cycle 1 — PagerDuty deferred until on-call rotation exists) |
| High | 30 min | Error rate > 5% for 5 min (`http_requests_total{status_code=~"5.."}/total`) · P95 latency > 2× baseline for 10 min · `auth_login_failures_total` rate > 10/s for 1 min (possible brute force) | Slack `#azaion-ops` + email |
| Medium | 4 h | Host disk > 80% · `db_command_failures_total` rate > 0.1/s for 10 min · process RSS > 80% of container limit | Slack `#azaion-ops` |
| Low | Next business day | Deprecated package usage from `dotnet list package --deprecated` | Slack `#azaion-eng` |
Baseline values (P95) come from the cycle-1 perf report:
- `/login` p95 ≈ 33 ms → high-latency alert at p95 > 66 ms for 10 min
- `/users` (500 users) p95 ≈ 152 ms → high-latency alert at p95 > 305 ms for 10 min
Alert routing in cycle 1 is **inform-only** — no PagerDuty escalation, no auto-rollback. The deploy procedure (Step 6) documents the manual rollback path.
## 6. Dashboards
**Operations dashboard** (Grafana, single panel set; cycle 1):
- Service `up` (admin-api, postgres, nginx) — stat panel
- HTTP request rate (req/s) by endpoint — time series
- HTTP error rate (% of 5xx) — time series with the High threshold band overlaid
- Latency P50 / P95 / P99 by endpoint — time series, P95 baseline reference line
- DB command rate + failure rate — time series
- Container CPU / RSS / FDs — time series (from node-exporter)
- Active alerts — table panel
**Business dashboard** (cycle 1):
- `users_active_total` by role — stat panel + sparkline
- `detection_classes_total` — stat panel
- `resource_upload_bytes_total` rate (1h window) — time series
- Login success/failure ratio (24h) — donut
Dashboards stored as code in `monitoring/grafana/admin-api.json` (introduced in Step 7).
## 7. Health Checks
Add a `/health` Minimal API endpoint:
| Probe | Endpoint | What it checks | Surface |
|-------|----------|----------------|---------|
| Liveness | `GET /health/live` | Process is responsive (always 200 unless the process is wedged) | Used by Docker `HEALTHCHECK` |
| Readiness | `GET /health/ready` | DB reader connection + DB admin connection (one-shot `SELECT 1` each, 2s timeout) | Used by Nginx upstream check + the deploy script (Step 6) post-deploy gate |
Endpoints are anonymous (no JWT) but bound only to the management VLAN (or `localhost` listener) — same exposure rule as `/metrics`.
> Failure mode: if the DB is unreachable for 30 s, `/health/ready` returns 503; Nginx pulls the upstream, returning 503 to clients (no silent traffic loss). The container itself stays running so a transient DB blip does not trigger Docker restart.
## 8. Drifts Logged Here
| ID | Severity | Description | Resolved In |
|----|----------|-------------|-------------|
| H | Low | `docker-compose.test.yml` health check is TCP-only; upgrade to `/health/live` once available | Step 7 |
| K | Medium (NEW) | Metrics + tracing not implemented in cycle 1; only the plan + `/health` ship | Future cycle |
| L | Low (NEW) | No central log aggregator; `journald` only | Future cycle |
| M | Low (NEW) | Tracing has no exporter (cycle 1 = trace IDs in logs only) | Future cycle when downstream services exist |
## 9. Self-verification
- [x] Structured JSON logging format defined with `timestamp`, `level`, `service`, `correlation_id`, `message`, `context`.
- [x] Metrics endpoint specified (`/metrics`, internal-only) with full app/system/business metric inventory.
- [x] OpenTelemetry tracing configured at the SDK level (cycle 1) with future exporter wiring (Drift M).
- [x] Alert severities with response times and channels defined; baselines tied to perf report numbers.
- [x] Dashboards defined for operations and business metrics.
- [x] PII exclusion rules cover passwords, JWTs, and email masking; refers to specific DTO field names.
@@ -0,0 +1,122 @@
# Azaion Admin API — Deployment Status Report
**Date**: 2026-05-13
**Cycle**: 1
**Step**: Deploy / 1 — Status & Environment Setup
**Verdict**: **READY for planning** — no critical blockers; three medium drift items must be resolved before Steps 27 produce final artifacts.
## Deployment Readiness Summary
| Aspect | Status | Notes |
|--------|--------|-------|
| Architecture defined | ✅ | `_docs/02_document/architecture.md` (§3 Deployment Model) |
| Component specs complete | ✅ | 5 components in `_docs/02_document/components/` |
| Infrastructure prerequisites met | ⚠️ Partial | Self-hosted Linux + private registry assumed; SSL/DNS not codified |
| External dependencies identified | ✅ | PostgreSQL (4312) + filesystem; no message bus, no CDN consumed by API |
| Cycle-1 changes integrated | ✅ | AZ-513 (`/classes`), AZ-196 (`/devices`), AZ-197 (HW removed); AZ-183 (OTA) reverted |
| Security audit signed off | ⚠️ PASS_WITH_WARNINGS | F-2 deferred to AZ-516; F-6 (root container) carried into Step 2 (containerization) |
| Performance test signed off | ✅ | NFT-PERF-01/04 PASS; NFT-PERF-02/03 obsolete (OTA reverted) |
| Blockers | 0 | 3 medium-priority drift items, listed below |
## Component Status
| Component | State | Docker-ready | Notes |
|-----------|-------|--------------|-------|
| 01 Data Layer | implemented + tested | n/a (library) | linq2db 5.4.1; entities `User`, `UserConfig`, `RoleEnum`, `DetectionClass`, `ExceptionEnum` |
| 02 User Management | implemented + tested | n/a (library) | `UserService`, `RegisterUser`/`RegisterDevice` consolidated post-F-3 |
| 03 Auth & Security | implemented + tested | n/a (library) | JWT bearer + per-user resource encryption (HW component removed AZ-197) |
| 04 Resource Management | implemented + tested | n/a (library) | Filesystem-backed; OTA paths deleted post-revert |
| 05 Admin API | implemented + tested | yes | Single deployable container (`Azaion.AdminApi`); composes the four libraries |
> Only the Admin API ships as a runtime container. Libraries are linked into the API at build time.
## External Dependencies
| Dependency | Type | Required For | Status |
|------------|------|--------------|--------|
| PostgreSQL 14+ (custom port 4312) | Database | All persistence | needs setup per env (`env/db/`) |
| Server filesystem (`Content/`, `logs/`) | Local I/O | Resource storage + Serilog rolling files | provisioned by host (bind mounts) |
| Docker Engine | Runtime | Container execution | required on `DEPLOY_HOST` |
| Nginx (reverse proxy) | TLS / routing | HTTPS termination, Host header | provisioned by `env/api/02-nginx-docker-registry.sh` |
API has no outbound calls to external SaaS APIs (no SSRF surface).
## Infrastructure Prerequisites
| Prerequisite | Status | Action Needed |
|--------------|--------|---------------|
| Container registry | ⚠️ Two registries in flight | Drift A (below): consolidate `docker.azaion.com/api``$REGISTRY_HOST/azaion/admin:branch-arm` |
| Cloud account | n/a | Self-hosted Linux server; no cloud account required |
| DNS configuration | ✅ | `admin.azaion.com` already in CORS allow-list |
| SSL certificates | ⚠️ Assumed at proxy | HTTPS not enforced in code (security audit F-13); document upstream chain in Step 6 |
| CI/CD platform | ✅ | Woodpecker CI on ARM64 (`.woodpecker/01-test.yml`, `02-build-push.yml`) |
| Secret manager | ❌ Not chosen | Drift B (below): Woodpecker secrets are used in CI; no manager for runtime container `.env` |
| Container user | ⚠️ root | Drift C: security audit F-6 — add `USER app` in Step 2 (containerization) |
| Health check endpoint | ❌ Missing | Step 5 (observability) — required for orchestration / load-balancer probes |
## Deployment Drift / Blockers (planning inputs)
| ID | Severity | Description | Resolved In |
|----|----------|-------------|-------------|
| Drift A | Medium | Image path in `deploy.cmd` / `env/api/start-container.sh` (`docker.azaion.com/api`) ≠ image path in `.woodpecker/02-build-push.yml` (`$REGISTRY_HOST/azaion/admin:branch-arm`). The host pulls `:latest`, the CI never pushes `:latest`. | Step 3 (CI/CD) + Step 7 (scripts) |
| Drift B | Medium | Production `.env` is hand-edited on the server. No secret manager, no rotation policy. | Step 4 (env strategy) — propose Vault / sops / SSM |
| Drift C | Medium | `Dockerfile` final stage runs as root (security audit F-6, AZ-518). | Step 2 (containerization) — add non-root `USER app` (UID 1654) |
| Drift D | Low | `.woodpecker/build-arm.yml` referenced in old docs but the actual files are `01-test.yml` + `02-build-push.yml`. | Step 3 (CI/CD) — refresh the doc |
| Drift E | Low | Performance script is run-on-demand (`scripts/run-performance-tests.sh`), not gated in CI. | Step 3 (CI/CD) — optional perf gate |
| Drift F | Low | No vulnerable-dep gate in CI (security audit recommendation 13). | Step 3 (CI/CD) — `dotnet list package --vulnerable` |
> No **critical** blockers. The drifts are planning inputs for Steps 27; they do NOT block Step 1 from completing.
## Required Environment Variables
| Variable | Purpose | Required In | Default (Dev) | Source (Staging/Prod) |
|----------|---------|-------------|---------------|----------------------|
| `ASPNETCORE_ENVIRONMENT` | Selects appsettings overlay + Swagger gate | All | `Development` | Environment (`Production`) |
| `ASPNETCORE_URLS` | Kestrel bind address | Container | `http://+:8080` | Environment |
| `ASPNETCORE_ConnectionStrings__AzaionDb` | Reader DB connection (read-only role) | All | `Host=localhost;Port=4312;…;Username=azaion_reader` | Secret manager |
| `ASPNETCORE_ConnectionStrings__AzaionDbAdmin` | Admin DB connection (read/write role) | All | `Host=localhost;Port=4312;…;Username=azaion_admin` | Secret manager |
| `ASPNETCORE_JwtConfig__Secret` | HMAC-SHA256 signing key (≥ 32 bytes) | All | dev-only literal in `.env` | Secret manager |
| `ASPNETCORE_JwtConfig__Issuer` | JWT `iss` claim | All | `AzaionApi` (appsettings) | appsettings or env override |
| `ASPNETCORE_JwtConfig__Audience` | JWT `aud` claim | All | `Annotators/OrangePi/Admins` (appsettings) | appsettings or env override |
| `ASPNETCORE_JwtConfig__TokenLifetimeHours` | Token TTL | All | `4` (appsettings) | Environment |
| `ASPNETCORE_ResourcesConfig__ResourcesFolder` | File storage root | All | `Content` | Environment |
| `ASPNETCORE_ResourcesConfig__SuiteInstallerFolder` | Prod installer dir | All | `suite` | Environment |
| `ASPNETCORE_ResourcesConfig__SuiteStageInstallerFolder` | Stage installer dir | All | `suite-stage` | Environment |
| `CI_COMMIT_SHA` | Build-time label → `AZAION_REVISION` env in container | Build only | (unset → `unknown`) | Woodpecker `$CI_COMMIT_SHA` |
| `DEPLOY_HOST` | Remote target machine for `scripts/deploy.sh` | Deploy scripts | `admin.azaion.com` | Environment |
| `DEPLOY_SSH_USER` | SSH user on `DEPLOY_HOST` | Deploy scripts | `root` | Environment |
| `DEPLOY_CONTAINER_NAME` | Docker container name on host | Deploy scripts | `azaion.api` | Environment |
| `DEPLOY_HOST_PORT` | Published host port (mapped to container 8080) | Deploy scripts | `4000` | Environment |
| `DEPLOY_HOST_CONTENT_DIR` | Host bind mount for `Content/` | Deploy scripts | `/root/api/content` | Environment |
| `DEPLOY_HOST_LOGS_DIR` | Host bind mount for `logs/` | Deploy scripts | `/root/api/logs` | Environment |
| `REGISTRY_HOST` | Container registry hostname | CI + deploy scripts | `docker.azaion.com` | Environment / Woodpecker secret |
| `REGISTRY_IMAGE` | Image path inside registry | CI + deploy scripts | `azaion/admin` | Environment |
| `REGISTRY_TAG` | Image tag | Deploy scripts | `dev-arm` | Environment |
| `REGISTRY_USER` | Registry login user | CI + deploy scripts | (empty) | Woodpecker secret `registry_user` / Secret manager |
| `REGISTRY_TOKEN` | Registry login token/password | CI + deploy scripts | (empty) | Woodpecker secret `registry_token` / Secret manager |
> All `ASPNETCORE_…` variables map to ASP.NET Core's `IConfiguration` via the standard `__` separator (e.g., `JwtConfig:Secret` ← `ASPNETCORE_JwtConfig__Secret`). The `ASPNETCORE_` prefix is *required* — `ConfigurationBuilder` only picks up env vars under that prefix unless additional prefixes are wired explicitly (which this app does not do).
## .env Files Created
- `.env.example` — committed to VCS, contains all variable names with placeholder values and inline comments.
- `.env` — git-ignored (via existing `.gitignore` line `.env`), contains development defaults pointing at the local Postgres on port 4312 and a clearly-marked dev-only JWT secret.
## Acceptance Checklist (Step 1 self-verification)
- [x] All five components assessed for deployment readiness.
- [x] External dependencies catalogued (Postgres + filesystem only).
- [x] Infrastructure prerequisites identified, including 6 named drifts (AF).
- [x] All required environment variables discovered (24 entries).
- [x] `.env.example` created with placeholders + comments.
- [x] `.env` created with safe local defaults (no real secrets).
- [x] `.gitignore` already excludes `.env` (line 10).
- [x] Status report written to `_docs/04_deploy/reports/deploy_status_report.md`.
## Next Steps
1. **User confirms** this report (BLOCKING gate at end of Step 1).
2. Step 2 (Containerization): consume Drift C (non-root `USER`) and the existing multi-stage Dockerfile as the baseline.
3. Step 3 (CI/CD): consume Drifts A, D, E, F and refresh the documented pipeline against the actual `01-test.yml` / `02-build-push.yml` files.
4. Step 4 (Environment Strategy): consume Drift B by proposing a secret manager option (e.g., HashiCorp Vault, sops-encrypted files in git, or Woodpecker secrets exported into the runtime container).
5. Steps 57 then layer observability, procedures, and scripts on top.