ai-training/_docs/04_deploy/deployment_procedures.md

# Azaion AI Training — Deployment Procedures

## Deployment Strategy

**Pattern**: Stop-and-replace on a single GPU server
**Rationale**: The system runs on one dedicated GPU server. Training takes days — there is no "zero-downtime" concern for the training process. The annotation queue can tolerate brief restarts (queue offset is persisted, messages are replayed from last offset).

### Component Behavior During Deploy

| Component | Deploy Impact | Recovery |
|-----------|--------------|----------|
| Training Pipeline | Must finish current run or be stopped manually. Never interrupted mid-training — checkpoints save every epoch. | Resume from last checkpoint (`resume_training`) |
| Annotation Queue | Brief restart (< 30 seconds). Messages accumulate in RabbitMQ during downtime. | Resumes from persisted offset in `offset.yaml` |

### Graceful Shutdown

- **Training**: not stopped by deploy scripts — training runs for days and is managed independently. Deploy only updates images/code for the *next* training run.
- **Annotation Queue**: `docker stop` with 30-second grace period → SIGTERM → process exits → container replaced with new image.

## Health Checks

No HTTP endpoints — these are batch processes and queue consumers. Health is verified by:

| Check | Method | Target | Interval | Failure Action |
|-------|--------|--------|----------|----------------|
| Annotation Queue alive | `docker inspect --format='{{.State.Running}}'` | annotation-queue container | 5 min (cron) | Restart container |
| RabbitMQ reachable | TCP connect to `$RABBITMQ_HOST:$RABBITMQ_PORT` | RabbitMQ server | 5 min (cron) | Alert, check network |
| GPU available | `nvidia-smi` exit code | NVIDIA driver | 5 min (cron) | Alert, check driver |
| Disk space | `df /azaion/ --output=pcent` | Filesystem | 5 min (cron) | Alert if > 80%, critical if > 95% |
| Queue offset advancing | Compare `offset.yaml` value to previous check | Annotation queue progress | 30 min | Alert if stale and queue has messages |

All checks are performed by `scripts/health-check.sh`.

## Production Deployment

### Pre-Deploy Checklist

- [ ] All CI tests pass on `dev` branch
- [ ] Security scan clean (zero critical/high CVEs)
- [ ] Docker images built and pushed to registry
- [ ] `.env` on target server is up to date with any new variables
- [ ] `/azaion/` directory tree exists with correct permissions
- [ ] No training run is currently active (or training will not be restarted this deploy)
- [ ] NVIDIA driver and Docker with GPU support are installed on target

### Deploy Steps

1. SSH to GPU server
2. Pull new Docker images: `scripts/pull-images.sh`
3. Stop annotation queue: `scripts/stop-services.sh`
4. Generate `config.yaml` from `.env` template
5. Start services: `scripts/start-services.sh`
6. Verify health: `scripts/health-check.sh`
7. Confirm annotation queue is consuming messages (check offset advancing)

All steps are orchestrated by `scripts/deploy.sh`.

### Post-Deploy Verification

- Check `docker ps` — annotation-queue container is running
- Check `docker logs annotation-queue --tail 20` — no errors
- Check `offset.yaml` — offset is advancing (queue is consuming)
- Check disk space — adequate for continued operation

## Rollback Procedures

### Trigger Criteria

- Annotation queue crashes repeatedly after deploy (restart loop)
- Queue messages are being dropped or corrupted
- `config.yaml` generation failed (missing env vars)
- New code has a bug affecting annotation processing

### Rollback Steps

1. Run `scripts/deploy.sh --rollback`
   - This reads the previous image tags from `scripts/.previous-tags` (saved during deploy)
   - Stops current containers
   - Starts containers with previous image tags
2. Verify health: `scripts/health-check.sh`
3. Check annotation queue is consuming correctly
4. Investigate root cause of the failed deploy

### Training Rollback

Training is not managed by deploy scripts. If a new training run produces bad results:
1. The previous `best.pt` model is still available in `/azaion/models/` (dated directories)
2. Roll back by pointing `config.yaml` to the previous model
3. No container restart needed — training is a batch job started manually

## Deployment Checklist (Quick Reference)

```
Pre-deploy:
  □ CI green on dev branch
  □ Images built and pushed
  □ .env updated on server (if new vars added)
  □ No active training run (if training container is being updated)

Deploy:
  □ SSH to server
  □ Run scripts/deploy.sh
  □ Verify health-check.sh passes

Post-deploy:
  □ docker ps shows containers running
  □ docker logs show no errors
  □ Queue offset advancing
  □ Disk space adequate

If problems:
  □ Run scripts/deploy.sh --rollback
  □ Verify health
  □ Investigate logs
```