Files
admin/_docs/04_deploy/deployment_procedures.md
T
Oleksandr Bezdieniezhnykh c7b297de83
ci/woodpecker/push/01-test Pipeline failed
ci/woodpecker/push/02-build-push unknown status
refactor: remove deploy.cmd and update Dockerfile for health checks
- Deleted the deploy.cmd script as it was no longer needed.
- Updated Dockerfile to include curl for health checks and added a non-root user for improved security.
- Modified health check command to use curl for better reliability.
- Adjusted docker-compose.test.yml to reflect changes in health check configuration.
- Cleaned up appsettings.json and removed unused configuration properties.
- Removed Resource entity and related requests from the codebase as part of the architectural shift.
- Updated documentation to reflect the removal of hardware binding and related endpoints.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-13 08:47:21 +03:00

196 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Azaion Admin API — Deployment Procedures
**Date**: 2026-05-13 · **Cycle**: 1 · **Status**: planning artifact (the executable scripts referenced here land in Step 7).
## 1. Deployment Strategy
**Pattern**: **stop-and-start with pre-pulled image** (single-container, single-host).
**Rationale**:
- Topology is one Docker host per environment running one `azaion.api` container behind Nginx. There is no orchestrator, no replica set, no load balancer beyond Nginx itself.
- Blue-green requires either two listening ports + Nginx switch, or two hosts. Cycle-1 budget does not include either. Recorded as **Drift N** for a future cycle.
- Rolling/canary is meaningless with one replica.
- The realistic SLO for cycle 1 is **brief (< 30 s) downtime per deploy**, mitigated by deploying in low-traffic windows. The procedure pre-pulls the image so the actual stop-start gap is the time it takes for the new container to clear `/health/ready`, not image-download time.
**Zero-downtime in production**: not achieved in cycle 1. Documented and acknowledged.
### Graceful Shutdown
| Signal | Behavior |
|--------|----------|
| `SIGTERM` (`docker stop`) | ASP.NET Core stops accepting new requests, waits up to `HostOptions.ShutdownTimeout` for in-flight requests, then exits. |
| `ShutdownTimeout` | Set to **30 seconds** in `Program.cs` (`services.Configure<HostOptions>(o => o.ShutdownTimeout = TimeSpan.FromSeconds(30))`). |
| `docker stop` grace | Use `docker stop -t 40` so Docker waits 40 s before sending `SIGKILL`, leaving 10 s of headroom over the app's 30 s. |
This wiring lands in Step 7 (Dockerfile + small `Program.cs` change).
### Database Migration Ordering
Conventions inherited from the Environment Strategy (§4 of `environment_strategy.md`):
1. Apply the new `env/db/NN_*.sql` file **before** deploying the matching code. Because every migration is backward-compatible, the old container keeps working against the new schema.
2. After the deploy is healthy, optionally apply a follow-up `NN+1_*.sql` for cleanup (e.g., dropping a tombstone column once no code reads it).
3. Production migrations run on staging first and soak ≥ 24 h before promotion.
4. Migration is performed by the operator with `psql -h <host> -p 4312 -U azaion_superadmin -d azaion -f env/db/NN_xxx.sql`. Logged in the deploy ticket.
## 2. Health Checks
These endpoints are introduced in Step 7 (anonymous, internal-only — see Observability §3.1 / §7).
| Check | Type | Endpoint | Interval | Failure threshold | Action |
|-------|------|----------|----------|-------------------|--------|
| Docker liveness | HTTP GET (in-container, via `Dockerfile` `HEALTHCHECK`) | `/health/live` | 30 s | 3 consecutive | Docker marks container `unhealthy`; **does NOT auto-restart** in cycle 1 (no `--restart=on-failure` policy in `start-container.sh`) |
| Nginx readiness | HTTP GET (upstream `health_check`) | `/health/ready` | 5 s | 3 consecutive | Nginx pulls upstream → 503 to clients (no silent traffic loss) |
| Deploy-script startup | HTTP GET (polling) | `/health/ready` | 2 s | up to 30 attempts (~60 s) | `scripts/deploy.sh` aborts and triggers rollback |
### Health Check Response Contract
| Endpoint | 200 condition | 5xx condition | Headers |
|----------|---------------|---------------|---------|
| `/health/live` | Process is responsive (always — short-circuits before any dependency call) | Never returns 5xx unless the process is wedged | `Cache-Control: no-store` |
| `/health/ready` | `SELECT 1` succeeds against both `AzaionDb` (reader) and `AzaionDbAdmin` (writer) within a 2 s timeout | Either DB query fails or times out → 503 | `Cache-Control: no-store` |
`/health/ready` does NOT exercise the filesystem (`Content/`, `logs/`) — a transient `EACCES` there should not yank the upstream. It surfaces in metrics (`resource_upload_failures_total`) and alerts (Observability §5) instead.
## 3. Staging Deployment
Triggered manually by the operator from the staging host or from a Woodpecker manual workflow.
```
1. Pre-flight — operator on local machine
a. Confirm CI green for the target SHA on the `stage` branch.
b. Run `dotnet list package --vulnerable` against the target commit (CI does this too — local is a sanity check).
c. Confirm any DB migration in env/db/ for this SHA has been reviewed.
2. DB migration (if any) — operator SSH to staging host
psql -h localhost -p 4312 -U azaion_superadmin -d azaion -f env/db/NN_<desc>.sql
3. Deploy — operator runs scripts/deploy.sh on staging host
ENV=staging ./scripts/deploy.sh <sha-tag>
# script: docker pull → stop -t 40 → rm → run --env-file .env → poll /health/ready
4. Verify — automatic in scripts/deploy.sh
- /health/ready returns 200 within 60 s
- Container `docker inspect` healthcheck status is `healthy`
- `docker logs --tail=80` contains no `Error` lines from the last 60 s
5. Smoke tests — operator runs from local machine
BASE_URL=https://stage.admin.azaion.com ./scripts/smoke.sh
# 6 critical-path checks: /login (admin), GET /users (paginates), GET /classes,
# GET /resources/list, /health/ready, JWT lifecycle.
6. Soak — observe dashboard for ≥ 24 h before promoting
```
If any step fails → §5 Rollback.
## 4. Production Deployment
```
1. Approval — required: ops lead OR backend lead
- Reference the staging soak completion timestamp.
- Reference the cycle's deploy ticket (AZ-NNN) and CI run URL.
2. Pre-deploy checks (operator on local machine)
[ ] Staging smoke tests passed (§3 step 5).
[ ] Staging soaked ≥ 24 h with no Critical/High alerts.
[ ] CI green for the same SHA on the `main` branch.
[ ] Image-scan report for the SHA shows zero High/Critical (Woodpecker artifact).
[ ] DB migration plan recorded in the deploy ticket.
[ ] Rollback target SHA is recorded (the SHA currently running in prod — `docker inspect azaion.api | jq -r '.[0].Config.Labels."org.opencontainers.image.revision"'`).
[ ] On-call engineer is reachable for the next 30 min.
3. DB migration (if any) — operator SSH to prod host
psql -h localhost -p 4312 -U azaion_superadmin -d azaion -f env/db/NN_<desc>.sql
4. Deploy — operator runs scripts/deploy.sh on prod host
ENV=production ./scripts/deploy.sh <sha-tag>
5. Verify — automatic + operator
- /health/ready returns 200 within 60 s.
- Container `docker inspect` healthcheck status `healthy`.
- Operator hits `/login` with admin creds and a known user list query.
6. Monitor — operator observes dashboards for ≥ 15 minutes
- Error rate (5xx) stays < 1%.
- P95 latency stays within 2× cycle-1 baseline (66 ms /login, 305 ms /users).
- No Critical or High alerts fire.
7. Finalize
- Update deploy ticket with start/stop timestamps and image SHA.
- Post `:white_check_mark: prod deploy: <sha-tag>` to Slack #azaion-ops.
```
## 5. Rollback Procedures
### Trigger Criteria (any one)
- `/health/ready` fails for ≥ 60 s after deploy.
- Error rate (5xx) > 5 % for 5 minutes within the 15-minute observation window.
- Any Critical alert fires within 15 minutes of deploy.
- Operator's manual call (e.g. business-impacting bug surfaced by smoke tests).
### Rollback Steps (≤ 5 minutes)
```
1. Capture state — operator on the affected host
docker logs azaion.api --tail=500 > /var/log/azaion/rollback-$(date -u +%Y%m%dT%H%M%SZ).log
docker inspect azaion.api > /var/log/azaion/rollback-$(date -u +%Y%m%dT%H%M%SZ).inspect.json
2. Re-deploy previous SHA — operator
ENV=production ./scripts/deploy.sh <previous-sha-tag>
# The SHA tag was recorded in step 2 of the deploy procedure.
3. DB rollback (if a migration was applied this deploy)
- If reversible (drop column, drop index): run the agreed reverse SQL recorded in the deploy ticket.
- If irreversible (added column, table): leave the schema as-is — the previous code is backward-compatible (rule §1.3) so the extra schema is inert.
- If data was migrated destructively: STOP, escalate to backend lead. Restore from backup if necessary.
4. Verify — same checks as deploy §5
5. Notify — operator posts ":rotating_light: prod rollback: <previous-sha>" to Slack #azaion-ops with the deploy ticket link.
6. Post-mortem — schedule within 24 hours; required artifact: timeline + root cause + prevention.
```
### Post-Mortem (required)
Template lives in `_docs/06_metrics/postmortem_template.md` (added in Step 7). Required sections:
- Timeline (UTC), with deploy SHA and rollback SHA.
- Root cause (one sentence + evidence link).
- Detection — how was it caught? Which alert? Which probe? Which user report?
- Repair — what fixed it?
- Prevention — concrete change (test, alert, procedure step) with an owner and a target date.
## 6. Deployment Checklist (per release)
Copy this into the deploy ticket; tick before flipping `prod`:
```
[ ] CI green on target SHA (01-test + 02-build-push, all matrix entries)
[ ] Image scan report: zero High/Critical CVEs (Woodpecker artifact)
[ ] Dependency audit (`dotnet list package --vulnerable`): zero High/Critical
[ ] Image SHA tag exists in registry: docker manifest inspect $REGISTRY_HOST/azaion/admin:<sha-tag>-arm
[ ] DB migration (if any) reviewed by backend lead; rollback SQL recorded if reversible
[ ] secrets/staging.env / secrets/production.env decrypts cleanly on the target host
[ ] Health endpoints respond 200 in current production (sanity baseline)
[ ] Monitoring alerts armed (no silenced alerts that would mask the deploy)
[ ] Rollback target SHA recorded
[ ] Stakeholders notified (Slack #azaion-ops, expected window)
[ ] On-call engineer reachable for the next 30 min
```
## 7. Drifts Logged Here
| ID | Severity | Description | Carried Forward |
|----|----------|-------------|-----------------|
| N (NEW) | Medium | No zero-downtime deploy strategy — single-container topology produces ~30 s gap per deploy | Future cycle: blue-green via dual ports + Nginx upstream switch |
## 8. Self-verification
- [x] Deployment strategy chosen (stop-and-start) with explicit rationale and acknowledgement that zero-downtime is deferred (Drift N).
- [x] Graceful-shutdown contract specified (`HostOptions.ShutdownTimeout` 30 s, `docker stop -t 40`).
- [x] Health checks defined (liveness, readiness, startup) with exact response contract and Cache-Control header.
- [x] Rollback trigger criteria + 6-step rollback procedure + post-mortem template requirement.
- [x] Deployment checklist complete (10 items) and explicitly references the SHA tag (Drift A resolution from Step 3).