mirror of
https://github.com/azaion/admin.git
synced 2026-06-21 14:41:08 +00:00
c7b297de83
- Deleted the deploy.cmd script as it was no longer needed. - Updated Dockerfile to include curl for health checks and added a non-root user for improved security. - Modified health check command to use curl for better reliability. - Adjusted docker-compose.test.yml to reflect changes in health check configuration. - Cleaned up appsettings.json and removed unused configuration properties. - Removed Resource entity and related requests from the codebase as part of the architectural shift. - Updated documentation to reflect the removal of hardware binding and related endpoints. Co-authored-by: Cursor <cursoragent@cursor.com>
196 lines
11 KiB
Markdown
196 lines
11 KiB
Markdown
# Azaion Admin API — Deployment Procedures
|
||
|
||
**Date**: 2026-05-13 · **Cycle**: 1 · **Status**: planning artifact (the executable scripts referenced here land in Step 7).
|
||
|
||
## 1. Deployment Strategy
|
||
|
||
**Pattern**: **stop-and-start with pre-pulled image** (single-container, single-host).
|
||
|
||
**Rationale**:
|
||
|
||
- Topology is one Docker host per environment running one `azaion.api` container behind Nginx. There is no orchestrator, no replica set, no load balancer beyond Nginx itself.
|
||
- Blue-green requires either two listening ports + Nginx switch, or two hosts. Cycle-1 budget does not include either. Recorded as **Drift N** for a future cycle.
|
||
- Rolling/canary is meaningless with one replica.
|
||
- The realistic SLO for cycle 1 is **brief (< 30 s) downtime per deploy**, mitigated by deploying in low-traffic windows. The procedure pre-pulls the image so the actual stop-start gap is the time it takes for the new container to clear `/health/ready`, not image-download time.
|
||
|
||
**Zero-downtime in production**: not achieved in cycle 1. Documented and acknowledged.
|
||
|
||
### Graceful Shutdown
|
||
|
||
| Signal | Behavior |
|
||
|--------|----------|
|
||
| `SIGTERM` (`docker stop`) | ASP.NET Core stops accepting new requests, waits up to `HostOptions.ShutdownTimeout` for in-flight requests, then exits. |
|
||
| `ShutdownTimeout` | Set to **30 seconds** in `Program.cs` (`services.Configure<HostOptions>(o => o.ShutdownTimeout = TimeSpan.FromSeconds(30))`). |
|
||
| `docker stop` grace | Use `docker stop -t 40` so Docker waits 40 s before sending `SIGKILL`, leaving 10 s of headroom over the app's 30 s. |
|
||
|
||
This wiring lands in Step 7 (Dockerfile + small `Program.cs` change).
|
||
|
||
### Database Migration Ordering
|
||
|
||
Conventions inherited from the Environment Strategy (§4 of `environment_strategy.md`):
|
||
|
||
1. Apply the new `env/db/NN_*.sql` file **before** deploying the matching code. Because every migration is backward-compatible, the old container keeps working against the new schema.
|
||
2. After the deploy is healthy, optionally apply a follow-up `NN+1_*.sql` for cleanup (e.g., dropping a tombstone column once no code reads it).
|
||
3. Production migrations run on staging first and soak ≥ 24 h before promotion.
|
||
4. Migration is performed by the operator with `psql -h <host> -p 4312 -U azaion_superadmin -d azaion -f env/db/NN_xxx.sql`. Logged in the deploy ticket.
|
||
|
||
## 2. Health Checks
|
||
|
||
These endpoints are introduced in Step 7 (anonymous, internal-only — see Observability §3.1 / §7).
|
||
|
||
| Check | Type | Endpoint | Interval | Failure threshold | Action |
|
||
|-------|------|----------|----------|-------------------|--------|
|
||
| Docker liveness | HTTP GET (in-container, via `Dockerfile` `HEALTHCHECK`) | `/health/live` | 30 s | 3 consecutive | Docker marks container `unhealthy`; **does NOT auto-restart** in cycle 1 (no `--restart=on-failure` policy in `start-container.sh`) |
|
||
| Nginx readiness | HTTP GET (upstream `health_check`) | `/health/ready` | 5 s | 3 consecutive | Nginx pulls upstream → 503 to clients (no silent traffic loss) |
|
||
| Deploy-script startup | HTTP GET (polling) | `/health/ready` | 2 s | up to 30 attempts (~60 s) | `scripts/deploy.sh` aborts and triggers rollback |
|
||
|
||
### Health Check Response Contract
|
||
|
||
| Endpoint | 200 condition | 5xx condition | Headers |
|
||
|----------|---------------|---------------|---------|
|
||
| `/health/live` | Process is responsive (always — short-circuits before any dependency call) | Never returns 5xx unless the process is wedged | `Cache-Control: no-store` |
|
||
| `/health/ready` | `SELECT 1` succeeds against both `AzaionDb` (reader) and `AzaionDbAdmin` (writer) within a 2 s timeout | Either DB query fails or times out → 503 | `Cache-Control: no-store` |
|
||
|
||
`/health/ready` does NOT exercise the filesystem (`Content/`, `logs/`) — a transient `EACCES` there should not yank the upstream. It surfaces in metrics (`resource_upload_failures_total`) and alerts (Observability §5) instead.
|
||
|
||
## 3. Staging Deployment
|
||
|
||
Triggered manually by the operator from the staging host or from a Woodpecker manual workflow.
|
||
|
||
```
|
||
1. Pre-flight — operator on local machine
|
||
a. Confirm CI green for the target SHA on the `stage` branch.
|
||
b. Run `dotnet list package --vulnerable` against the target commit (CI does this too — local is a sanity check).
|
||
c. Confirm any DB migration in env/db/ for this SHA has been reviewed.
|
||
|
||
2. DB migration (if any) — operator SSH to staging host
|
||
psql -h localhost -p 4312 -U azaion_superadmin -d azaion -f env/db/NN_<desc>.sql
|
||
|
||
3. Deploy — operator runs scripts/deploy.sh on staging host
|
||
ENV=staging ./scripts/deploy.sh <sha-tag>
|
||
# script: docker pull → stop -t 40 → rm → run --env-file .env → poll /health/ready
|
||
|
||
4. Verify — automatic in scripts/deploy.sh
|
||
- /health/ready returns 200 within 60 s
|
||
- Container `docker inspect` healthcheck status is `healthy`
|
||
- `docker logs --tail=80` contains no `Error` lines from the last 60 s
|
||
|
||
5. Smoke tests — operator runs from local machine
|
||
BASE_URL=https://stage.admin.azaion.com ./scripts/smoke.sh
|
||
# 6 critical-path checks: /login (admin), GET /users (paginates), GET /classes,
|
||
# GET /resources/list, /health/ready, JWT lifecycle.
|
||
|
||
6. Soak — observe dashboard for ≥ 24 h before promoting
|
||
```
|
||
|
||
If any step fails → §5 Rollback.
|
||
|
||
## 4. Production Deployment
|
||
|
||
```
|
||
1. Approval — required: ops lead OR backend lead
|
||
- Reference the staging soak completion timestamp.
|
||
- Reference the cycle's deploy ticket (AZ-NNN) and CI run URL.
|
||
|
||
2. Pre-deploy checks (operator on local machine)
|
||
[ ] Staging smoke tests passed (§3 step 5).
|
||
[ ] Staging soaked ≥ 24 h with no Critical/High alerts.
|
||
[ ] CI green for the same SHA on the `main` branch.
|
||
[ ] Image-scan report for the SHA shows zero High/Critical (Woodpecker artifact).
|
||
[ ] DB migration plan recorded in the deploy ticket.
|
||
[ ] Rollback target SHA is recorded (the SHA currently running in prod — `docker inspect azaion.api | jq -r '.[0].Config.Labels."org.opencontainers.image.revision"'`).
|
||
[ ] On-call engineer is reachable for the next 30 min.
|
||
|
||
3. DB migration (if any) — operator SSH to prod host
|
||
psql -h localhost -p 4312 -U azaion_superadmin -d azaion -f env/db/NN_<desc>.sql
|
||
|
||
4. Deploy — operator runs scripts/deploy.sh on prod host
|
||
ENV=production ./scripts/deploy.sh <sha-tag>
|
||
|
||
5. Verify — automatic + operator
|
||
- /health/ready returns 200 within 60 s.
|
||
- Container `docker inspect` healthcheck status `healthy`.
|
||
- Operator hits `/login` with admin creds and a known user list query.
|
||
|
||
6. Monitor — operator observes dashboards for ≥ 15 minutes
|
||
- Error rate (5xx) stays < 1%.
|
||
- P95 latency stays within 2× cycle-1 baseline (66 ms /login, 305 ms /users).
|
||
- No Critical or High alerts fire.
|
||
|
||
7. Finalize
|
||
- Update deploy ticket with start/stop timestamps and image SHA.
|
||
- Post `:white_check_mark: prod deploy: <sha-tag>` to Slack #azaion-ops.
|
||
```
|
||
|
||
## 5. Rollback Procedures
|
||
|
||
### Trigger Criteria (any one)
|
||
|
||
- `/health/ready` fails for ≥ 60 s after deploy.
|
||
- Error rate (5xx) > 5 % for 5 minutes within the 15-minute observation window.
|
||
- Any Critical alert fires within 15 minutes of deploy.
|
||
- Operator's manual call (e.g. business-impacting bug surfaced by smoke tests).
|
||
|
||
### Rollback Steps (≤ 5 minutes)
|
||
|
||
```
|
||
1. Capture state — operator on the affected host
|
||
docker logs azaion.api --tail=500 > /var/log/azaion/rollback-$(date -u +%Y%m%dT%H%M%SZ).log
|
||
docker inspect azaion.api > /var/log/azaion/rollback-$(date -u +%Y%m%dT%H%M%SZ).inspect.json
|
||
|
||
2. Re-deploy previous SHA — operator
|
||
ENV=production ./scripts/deploy.sh <previous-sha-tag>
|
||
# The SHA tag was recorded in step 2 of the deploy procedure.
|
||
|
||
3. DB rollback (if a migration was applied this deploy)
|
||
- If reversible (drop column, drop index): run the agreed reverse SQL recorded in the deploy ticket.
|
||
- If irreversible (added column, table): leave the schema as-is — the previous code is backward-compatible (rule §1.3) so the extra schema is inert.
|
||
- If data was migrated destructively: STOP, escalate to backend lead. Restore from backup if necessary.
|
||
|
||
4. Verify — same checks as deploy §5
|
||
5. Notify — operator posts ":rotating_light: prod rollback: <previous-sha>" to Slack #azaion-ops with the deploy ticket link.
|
||
6. Post-mortem — schedule within 24 hours; required artifact: timeline + root cause + prevention.
|
||
```
|
||
|
||
### Post-Mortem (required)
|
||
|
||
Template lives in `_docs/06_metrics/postmortem_template.md` (added in Step 7). Required sections:
|
||
|
||
- Timeline (UTC), with deploy SHA and rollback SHA.
|
||
- Root cause (one sentence + evidence link).
|
||
- Detection — how was it caught? Which alert? Which probe? Which user report?
|
||
- Repair — what fixed it?
|
||
- Prevention — concrete change (test, alert, procedure step) with an owner and a target date.
|
||
|
||
## 6. Deployment Checklist (per release)
|
||
|
||
Copy this into the deploy ticket; tick before flipping `prod`:
|
||
|
||
```
|
||
[ ] CI green on target SHA (01-test + 02-build-push, all matrix entries)
|
||
[ ] Image scan report: zero High/Critical CVEs (Woodpecker artifact)
|
||
[ ] Dependency audit (`dotnet list package --vulnerable`): zero High/Critical
|
||
[ ] Image SHA tag exists in registry: docker manifest inspect $REGISTRY_HOST/azaion/admin:<sha-tag>-arm
|
||
[ ] DB migration (if any) reviewed by backend lead; rollback SQL recorded if reversible
|
||
[ ] secrets/staging.env / secrets/production.env decrypts cleanly on the target host
|
||
[ ] Health endpoints respond 200 in current production (sanity baseline)
|
||
[ ] Monitoring alerts armed (no silenced alerts that would mask the deploy)
|
||
[ ] Rollback target SHA recorded
|
||
[ ] Stakeholders notified (Slack #azaion-ops, expected window)
|
||
[ ] On-call engineer reachable for the next 30 min
|
||
```
|
||
|
||
## 7. Drifts Logged Here
|
||
|
||
| ID | Severity | Description | Carried Forward |
|
||
|----|----------|-------------|-----------------|
|
||
| N (NEW) | Medium | No zero-downtime deploy strategy — single-container topology produces ~30 s gap per deploy | Future cycle: blue-green via dual ports + Nginx upstream switch |
|
||
|
||
## 8. Self-verification
|
||
|
||
- [x] Deployment strategy chosen (stop-and-start) with explicit rationale and acknowledgement that zero-downtime is deferred (Drift N).
|
||
- [x] Graceful-shutdown contract specified (`HostOptions.ShutdownTimeout` 30 s, `docker stop -t 40`).
|
||
- [x] Health checks defined (liveness, readiness, startup) with exact response contract and Cache-Control header.
|
||
- [x] Rollback trigger criteria + 6-step rollback procedure + post-mortem template requirement.
|
||
- [x] Deployment checklist complete (10 items) and explicitly references the SHA tag (Drift A resolution from Step 3).
|