Files
admin/_docs/04_deploy/deployment_procedures.md
T
Oleksandr Bezdieniezhnykh c7b297de83
ci/woodpecker/push/01-test Pipeline failed
ci/woodpecker/push/02-build-push unknown status
refactor: remove deploy.cmd and update Dockerfile for health checks
- Deleted the deploy.cmd script as it was no longer needed.
- Updated Dockerfile to include curl for health checks and added a non-root user for improved security.
- Modified health check command to use curl for better reliability.
- Adjusted docker-compose.test.yml to reflect changes in health check configuration.
- Cleaned up appsettings.json and removed unused configuration properties.
- Removed Resource entity and related requests from the codebase as part of the architectural shift.
- Updated documentation to reflect the removal of hardware binding and related endpoints.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-13 08:47:21 +03:00

11 KiB
Raw Blame History

Azaion Admin API — Deployment Procedures

Date: 2026-05-13 · Cycle: 1 · Status: planning artifact (the executable scripts referenced here land in Step 7).

1. Deployment Strategy

Pattern: stop-and-start with pre-pulled image (single-container, single-host).

Rationale:

  • Topology is one Docker host per environment running one azaion.api container behind Nginx. There is no orchestrator, no replica set, no load balancer beyond Nginx itself.
  • Blue-green requires either two listening ports + Nginx switch, or two hosts. Cycle-1 budget does not include either. Recorded as Drift N for a future cycle.
  • Rolling/canary is meaningless with one replica.
  • The realistic SLO for cycle 1 is brief (< 30 s) downtime per deploy, mitigated by deploying in low-traffic windows. The procedure pre-pulls the image so the actual stop-start gap is the time it takes for the new container to clear /health/ready, not image-download time.

Zero-downtime in production: not achieved in cycle 1. Documented and acknowledged.

Graceful Shutdown

Signal Behavior
SIGTERM (docker stop) ASP.NET Core stops accepting new requests, waits up to HostOptions.ShutdownTimeout for in-flight requests, then exits.
ShutdownTimeout Set to 30 seconds in Program.cs (services.Configure<HostOptions>(o => o.ShutdownTimeout = TimeSpan.FromSeconds(30))).
docker stop grace Use docker stop -t 40 so Docker waits 40 s before sending SIGKILL, leaving 10 s of headroom over the app's 30 s.

This wiring lands in Step 7 (Dockerfile + small Program.cs change).

Database Migration Ordering

Conventions inherited from the Environment Strategy (§4 of environment_strategy.md):

  1. Apply the new env/db/NN_*.sql file before deploying the matching code. Because every migration is backward-compatible, the old container keeps working against the new schema.
  2. After the deploy is healthy, optionally apply a follow-up NN+1_*.sql for cleanup (e.g., dropping a tombstone column once no code reads it).
  3. Production migrations run on staging first and soak ≥ 24 h before promotion.
  4. Migration is performed by the operator with psql -h <host> -p 4312 -U azaion_superadmin -d azaion -f env/db/NN_xxx.sql. Logged in the deploy ticket.

2. Health Checks

These endpoints are introduced in Step 7 (anonymous, internal-only — see Observability §3.1 / §7).

Check Type Endpoint Interval Failure threshold Action
Docker liveness HTTP GET (in-container, via Dockerfile HEALTHCHECK) /health/live 30 s 3 consecutive Docker marks container unhealthy; does NOT auto-restart in cycle 1 (no --restart=on-failure policy in start-container.sh)
Nginx readiness HTTP GET (upstream health_check) /health/ready 5 s 3 consecutive Nginx pulls upstream → 503 to clients (no silent traffic loss)
Deploy-script startup HTTP GET (polling) /health/ready 2 s up to 30 attempts (~60 s) scripts/deploy.sh aborts and triggers rollback

Health Check Response Contract

Endpoint 200 condition 5xx condition Headers
/health/live Process is responsive (always — short-circuits before any dependency call) Never returns 5xx unless the process is wedged Cache-Control: no-store
/health/ready SELECT 1 succeeds against both AzaionDb (reader) and AzaionDbAdmin (writer) within a 2 s timeout Either DB query fails or times out → 503 Cache-Control: no-store

/health/ready does NOT exercise the filesystem (Content/, logs/) — a transient EACCES there should not yank the upstream. It surfaces in metrics (resource_upload_failures_total) and alerts (Observability §5) instead.

3. Staging Deployment

Triggered manually by the operator from the staging host or from a Woodpecker manual workflow.

1. Pre-flight                        — operator on local machine
   a. Confirm CI green for the target SHA on the `stage` branch.
   b. Run `dotnet list package --vulnerable` against the target commit (CI does this too — local is a sanity check).
   c. Confirm any DB migration in env/db/ for this SHA has been reviewed.

2. DB migration (if any)            — operator SSH to staging host
   psql -h localhost -p 4312 -U azaion_superadmin -d azaion -f env/db/NN_<desc>.sql

3. Deploy                           — operator runs scripts/deploy.sh on staging host
   ENV=staging ./scripts/deploy.sh <sha-tag>
   # script: docker pull → stop -t 40 → rm → run --env-file .env → poll /health/ready

4. Verify                           — automatic in scripts/deploy.sh
   - /health/ready returns 200 within 60 s
   - Container `docker inspect` healthcheck status is `healthy`
   - `docker logs --tail=80` contains no `Error` lines from the last 60 s

5. Smoke tests                      — operator runs from local machine
   BASE_URL=https://stage.admin.azaion.com ./scripts/smoke.sh
   # 6 critical-path checks: /login (admin), GET /users (paginates), GET /classes,
   # GET /resources/list, /health/ready, JWT lifecycle.

6. Soak                             — observe dashboard for ≥ 24 h before promoting

If any step fails → §5 Rollback.

4. Production Deployment

1. Approval                         — required: ops lead OR backend lead
   - Reference the staging soak completion timestamp.
   - Reference the cycle's deploy ticket (AZ-NNN) and CI run URL.

2. Pre-deploy checks (operator on local machine)
   [ ] Staging smoke tests passed (§3 step 5).
   [ ] Staging soaked ≥ 24 h with no Critical/High alerts.
   [ ] CI green for the same SHA on the `main` branch.
   [ ] Image-scan report for the SHA shows zero High/Critical (Woodpecker artifact).
   [ ] DB migration plan recorded in the deploy ticket.
   [ ] Rollback target SHA is recorded (the SHA currently running in prod — `docker inspect azaion.api | jq -r '.[0].Config.Labels."org.opencontainers.image.revision"'`).
   [ ] On-call engineer is reachable for the next 30 min.

3. DB migration (if any)            — operator SSH to prod host
   psql -h localhost -p 4312 -U azaion_superadmin -d azaion -f env/db/NN_<desc>.sql

4. Deploy                           — operator runs scripts/deploy.sh on prod host
   ENV=production ./scripts/deploy.sh <sha-tag>

5. Verify                           — automatic + operator
   - /health/ready returns 200 within 60 s.
   - Container `docker inspect` healthcheck status `healthy`.
   - Operator hits `/login` with admin creds and a known user list query.

6. Monitor                          — operator observes dashboards for ≥ 15 minutes
   - Error rate (5xx) stays < 1%.
   - P95 latency stays within 2× cycle-1 baseline (66 ms /login, 305 ms /users).
   - No Critical or High alerts fire.

7. Finalize
   - Update deploy ticket with start/stop timestamps and image SHA.
   - Post `:white_check_mark: prod deploy: <sha-tag>` to Slack #azaion-ops.

5. Rollback Procedures

Trigger Criteria (any one)

  • /health/ready fails for ≥ 60 s after deploy.
  • Error rate (5xx) > 5 % for 5 minutes within the 15-minute observation window.
  • Any Critical alert fires within 15 minutes of deploy.
  • Operator's manual call (e.g. business-impacting bug surfaced by smoke tests).

Rollback Steps (≤ 5 minutes)

1. Capture state                    — operator on the affected host
   docker logs azaion.api --tail=500 > /var/log/azaion/rollback-$(date -u +%Y%m%dT%H%M%SZ).log
   docker inspect azaion.api > /var/log/azaion/rollback-$(date -u +%Y%m%dT%H%M%SZ).inspect.json

2. Re-deploy previous SHA           — operator
   ENV=production ./scripts/deploy.sh <previous-sha-tag>
   # The SHA tag was recorded in step 2 of the deploy procedure.

3. DB rollback (if a migration was applied this deploy)
   - If reversible (drop column, drop index): run the agreed reverse SQL recorded in the deploy ticket.
   - If irreversible (added column, table): leave the schema as-is — the previous code is backward-compatible (rule §1.3) so the extra schema is inert.
   - If data was migrated destructively: STOP, escalate to backend lead. Restore from backup if necessary.

4. Verify                           — same checks as deploy §5
5. Notify                           — operator posts ":rotating_light: prod rollback: <previous-sha>" to Slack #azaion-ops with the deploy ticket link.
6. Post-mortem                      — schedule within 24 hours; required artifact: timeline + root cause + prevention.

Post-Mortem (required)

Template lives in _docs/06_metrics/postmortem_template.md (added in Step 7). Required sections:

  • Timeline (UTC), with deploy SHA and rollback SHA.
  • Root cause (one sentence + evidence link).
  • Detection — how was it caught? Which alert? Which probe? Which user report?
  • Repair — what fixed it?
  • Prevention — concrete change (test, alert, procedure step) with an owner and a target date.

6. Deployment Checklist (per release)

Copy this into the deploy ticket; tick before flipping prod:

[ ] CI green on target SHA (01-test + 02-build-push, all matrix entries)
[ ] Image scan report: zero High/Critical CVEs (Woodpecker artifact)
[ ] Dependency audit (`dotnet list package --vulnerable`): zero High/Critical
[ ] Image SHA tag exists in registry: docker manifest inspect $REGISTRY_HOST/azaion/admin:<sha-tag>-arm
[ ] DB migration (if any) reviewed by backend lead; rollback SQL recorded if reversible
[ ] secrets/staging.env / secrets/production.env decrypts cleanly on the target host
[ ] Health endpoints respond 200 in current production (sanity baseline)
[ ] Monitoring alerts armed (no silenced alerts that would mask the deploy)
[ ] Rollback target SHA recorded
[ ] Stakeholders notified (Slack #azaion-ops, expected window)
[ ] On-call engineer reachable for the next 30 min

7. Drifts Logged Here

ID Severity Description Carried Forward
N (NEW) Medium No zero-downtime deploy strategy — single-container topology produces ~30 s gap per deploy Future cycle: blue-green via dual ports + Nginx upstream switch

8. Self-verification

  • Deployment strategy chosen (stop-and-start) with explicit rationale and acknowledgement that zero-downtime is deferred (Drift N).
  • Graceful-shutdown contract specified (HostOptions.ShutdownTimeout 30 s, docker stop -t 40).
  • Health checks defined (liveness, readiness, startup) with exact response contract and Cache-Control header.
  • Rollback trigger criteria + 6-step rollback procedure + post-mortem template requirement.
  • Deployment checklist complete (10 items) and explicitly references the SHA tag (Drift A resolution from Step 3).