Files
ai-training/_docs/04_deploy/deployment_procedures.md
T
Oleksandr Bezdieniezhnykh aeb7f8ca8c Update autopilot workflow and documentation for project cycle completion
- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation.
- Updated the autopilot state to reflect the current step as `done` and status as `completed`.
- Clarified the deployment status report by specifying non-deployed services and their purposes.

These changes enhance the automation of task management and improve documentation clarity.
2026-03-29 05:02:22 +03:00

116 lines
4.7 KiB
Markdown

# Azaion AI Training — Deployment Procedures
## Deployment Strategy
**Pattern**: Stop-and-replace on a single GPU server
**Rationale**: The system runs on one dedicated GPU server. Training takes days — there is no "zero-downtime" concern for the training process. The annotation queue can tolerate brief restarts (queue offset is persisted, messages are replayed from last offset).
### Component Behavior During Deploy
| Component | Deploy Impact | Recovery |
|-----------|--------------|----------|
| Training Pipeline | Must finish current run or be stopped manually. Never interrupted mid-training — checkpoints save every epoch. | Resume from last checkpoint (`resume_training`) |
| Annotation Queue | Brief restart (< 30 seconds). Messages accumulate in RabbitMQ during downtime. | Resumes from persisted offset in `offset.yaml` |
### Graceful Shutdown
- **Training**: not stopped by deploy scripts — training runs for days and is managed independently. Deploy only updates images/code for the *next* training run.
- **Annotation Queue**: `docker stop` with 30-second grace period → SIGTERM → process exits → container replaced with new image.
## Health Checks
No HTTP endpoints — these are batch processes and queue consumers. Health is verified by:
| Check | Method | Target | Interval | Failure Action |
|-------|--------|--------|----------|----------------|
| Annotation Queue alive | `docker inspect --format='{{.State.Running}}'` | annotation-queue container | 5 min (cron) | Restart container |
| RabbitMQ reachable | TCP connect to `$RABBITMQ_HOST:$RABBITMQ_PORT` | RabbitMQ server | 5 min (cron) | Alert, check network |
| GPU available | `nvidia-smi` exit code | NVIDIA driver | 5 min (cron) | Alert, check driver |
| Disk space | `df /azaion/ --output=pcent` | Filesystem | 5 min (cron) | Alert if > 80%, critical if > 95% |
| Queue offset advancing | Compare `offset.yaml` value to previous check | Annotation queue progress | 30 min | Alert if stale and queue has messages |
All checks are performed by `scripts/health-check.sh`.
## Production Deployment
### Pre-Deploy Checklist
- [ ] All CI tests pass on `dev` branch
- [ ] Security scan clean (zero critical/high CVEs)
- [ ] Docker images built and pushed to registry
- [ ] `.env` on target server is up to date with any new variables
- [ ] `/azaion/` directory tree exists with correct permissions
- [ ] No training run is currently active (or training will not be restarted this deploy)
- [ ] NVIDIA driver and Docker with GPU support are installed on target
### Deploy Steps
1. SSH to GPU server
2. Pull new Docker images: `scripts/pull-images.sh`
3. Stop annotation queue: `scripts/stop-services.sh`
4. Generate `config.yaml` from `.env` template
5. Start services: `scripts/start-services.sh`
6. Verify health: `scripts/health-check.sh`
7. Confirm annotation queue is consuming messages (check offset advancing)
All steps are orchestrated by `scripts/deploy.sh`.
### Post-Deploy Verification
- Check `docker ps` — annotation-queue container is running
- Check `docker logs annotation-queue --tail 20` — no errors
- Check `offset.yaml` — offset is advancing (queue is consuming)
- Check disk space — adequate for continued operation
## Rollback Procedures
### Trigger Criteria
- Annotation queue crashes repeatedly after deploy (restart loop)
- Queue messages are being dropped or corrupted
- `config.yaml` generation failed (missing env vars)
- New code has a bug affecting annotation processing
### Rollback Steps
1. Run `scripts/deploy.sh --rollback`
- This reads the previous image tags from `scripts/.previous-tags` (saved during deploy)
- Stops current containers
- Starts containers with previous image tags
2. Verify health: `scripts/health-check.sh`
3. Check annotation queue is consuming correctly
4. Investigate root cause of the failed deploy
### Training Rollback
Training is not managed by deploy scripts. If a new training run produces bad results:
1. The previous `best.pt` model is still available in `/azaion/models/` (dated directories)
2. Roll back by pointing `config.yaml` to the previous model
3. No container restart needed — training is a batch job started manually
## Deployment Checklist (Quick Reference)
```
Pre-deploy:
□ CI green on dev branch
□ Images built and pushed
□ .env updated on server (if new vars added)
□ No active training run (if training container is being updated)
Deploy:
□ SSH to server
□ Run scripts/deploy.sh
□ Verify health-check.sh passes
Post-deploy:
□ docker ps shows containers running
□ docker logs show no errors
□ Queue offset advancing
□ Disk space adequate
If problems:
□ Run scripts/deploy.sh --rollback
□ Verify health
□ Investigate logs
```