# Azaion AI Training — Deployment Procedures ## Deployment Strategy **Pattern**: Stop-and-replace on a single GPU server **Rationale**: The system runs on one dedicated GPU server. Training takes days — there is no "zero-downtime" concern for the training process. The annotation queue can tolerate brief restarts (queue offset is persisted, messages are replayed from last offset). ### Component Behavior During Deploy | Component | Deploy Impact | Recovery | |-----------|--------------|----------| | Training Pipeline | Must finish current run or be stopped manually. Never interrupted mid-training — checkpoints save every epoch. | Resume from last checkpoint (`resume_training`) | | Annotation Queue | Brief restart (< 30 seconds). Messages accumulate in RabbitMQ during downtime. | Resumes from persisted offset in `offset.yaml` | ### Graceful Shutdown - **Training**: not stopped by deploy scripts — training runs for days and is managed independently. Deploy only updates images/code for the *next* training run. - **Annotation Queue**: `docker stop` with 30-second grace period → SIGTERM → process exits → container replaced with new image. ## Health Checks No HTTP endpoints — these are batch processes and queue consumers. Health is verified by: | Check | Method | Target | Interval | Failure Action | |-------|--------|--------|----------|----------------| | Annotation Queue alive | `docker inspect --format='{{.State.Running}}'` | annotation-queue container | 5 min (cron) | Restart container | | RabbitMQ reachable | TCP connect to `$RABBITMQ_HOST:$RABBITMQ_PORT` | RabbitMQ server | 5 min (cron) | Alert, check network | | GPU available | `nvidia-smi` exit code | NVIDIA driver | 5 min (cron) | Alert, check driver | | Disk space | `df /azaion/ --output=pcent` | Filesystem | 5 min (cron) | Alert if > 80%, critical if > 95% | | Queue offset advancing | Compare `offset.yaml` value to previous check | Annotation queue progress | 30 min | Alert if stale and queue has messages | All checks are performed by `scripts/health-check.sh`. ## Production Deployment ### Pre-Deploy Checklist - [ ] All CI tests pass on `dev` branch - [ ] Security scan clean (zero critical/high CVEs) - [ ] Docker images built and pushed to registry - [ ] `.env` on target server is up to date with any new variables - [ ] `/azaion/` directory tree exists with correct permissions - [ ] No training run is currently active (or training will not be restarted this deploy) - [ ] NVIDIA driver and Docker with GPU support are installed on target ### Deploy Steps 1. SSH to GPU server 2. Pull new Docker images: `scripts/pull-images.sh` 3. Stop annotation queue: `scripts/stop-services.sh` 4. Generate `config.yaml` from `.env` template 5. Start services: `scripts/start-services.sh` 6. Verify health: `scripts/health-check.sh` 7. Confirm annotation queue is consuming messages (check offset advancing) All steps are orchestrated by `scripts/deploy.sh`. ### Post-Deploy Verification - Check `docker ps` — annotation-queue container is running - Check `docker logs annotation-queue --tail 20` — no errors - Check `offset.yaml` — offset is advancing (queue is consuming) - Check disk space — adequate for continued operation ## Rollback Procedures ### Trigger Criteria - Annotation queue crashes repeatedly after deploy (restart loop) - Queue messages are being dropped or corrupted - `config.yaml` generation failed (missing env vars) - New code has a bug affecting annotation processing ### Rollback Steps 1. Run `scripts/deploy.sh --rollback` - This reads the previous image tags from `scripts/.previous-tags` (saved during deploy) - Stops current containers - Starts containers with previous image tags 2. Verify health: `scripts/health-check.sh` 3. Check annotation queue is consuming correctly 4. Investigate root cause of the failed deploy ### Training Rollback Training is not managed by deploy scripts. If a new training run produces bad results: 1. The previous `best.pt` model is still available in `/azaion/models/` (dated directories) 2. Roll back by pointing `config.yaml` to the previous model 3. No container restart needed — training is a batch job started manually ## Deployment Checklist (Quick Reference) ``` Pre-deploy: □ CI green on dev branch □ Images built and pushed □ .env updated on server (if new vars added) □ No active training run (if training container is being updated) Deploy: □ SSH to server □ Run scripts/deploy.sh □ Verify health-check.sh passes Post-deploy: □ docker ps shows containers running □ docker logs show no errors □ Queue offset advancing □ Disk space adequate If problems: □ Run scripts/deploy.sh --rollback □ Verify health □ Investigate logs ```