mirror of
https://github.com/azaion/ai-training.git
synced 2026-04-22 23:36:36 +00:00
aeb7f8ca8c
- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation. - Updated the autopilot state to reflect the current step as `done` and status as `completed`. - Clarified the deployment status report by specifying non-deployed services and their purposes. These changes enhance the automation of task management and improve documentation clarity.
4.7 KiB
4.7 KiB
Azaion AI Training — Deployment Procedures
Deployment Strategy
Pattern: Stop-and-replace on a single GPU server Rationale: The system runs on one dedicated GPU server. Training takes days — there is no "zero-downtime" concern for the training process. The annotation queue can tolerate brief restarts (queue offset is persisted, messages are replayed from last offset).
Component Behavior During Deploy
| Component | Deploy Impact | Recovery |
|---|---|---|
| Training Pipeline | Must finish current run or be stopped manually. Never interrupted mid-training — checkpoints save every epoch. | Resume from last checkpoint (resume_training) |
| Annotation Queue | Brief restart (< 30 seconds). Messages accumulate in RabbitMQ during downtime. | Resumes from persisted offset in offset.yaml |
Graceful Shutdown
- Training: not stopped by deploy scripts — training runs for days and is managed independently. Deploy only updates images/code for the next training run.
- Annotation Queue:
docker stopwith 30-second grace period → SIGTERM → process exits → container replaced with new image.
Health Checks
No HTTP endpoints — these are batch processes and queue consumers. Health is verified by:
| Check | Method | Target | Interval | Failure Action |
|---|---|---|---|---|
| Annotation Queue alive | docker inspect --format='{{.State.Running}}' |
annotation-queue container | 5 min (cron) | Restart container |
| RabbitMQ reachable | TCP connect to $RABBITMQ_HOST:$RABBITMQ_PORT |
RabbitMQ server | 5 min (cron) | Alert, check network |
| GPU available | nvidia-smi exit code |
NVIDIA driver | 5 min (cron) | Alert, check driver |
| Disk space | df /azaion/ --output=pcent |
Filesystem | 5 min (cron) | Alert if > 80%, critical if > 95% |
| Queue offset advancing | Compare offset.yaml value to previous check |
Annotation queue progress | 30 min | Alert if stale and queue has messages |
All checks are performed by scripts/health-check.sh.
Production Deployment
Pre-Deploy Checklist
- All CI tests pass on
devbranch - Security scan clean (zero critical/high CVEs)
- Docker images built and pushed to registry
.envon target server is up to date with any new variables/azaion/directory tree exists with correct permissions- No training run is currently active (or training will not be restarted this deploy)
- NVIDIA driver and Docker with GPU support are installed on target
Deploy Steps
- SSH to GPU server
- Pull new Docker images:
scripts/pull-images.sh - Stop annotation queue:
scripts/stop-services.sh - Generate
config.yamlfrom.envtemplate - Start services:
scripts/start-services.sh - Verify health:
scripts/health-check.sh - Confirm annotation queue is consuming messages (check offset advancing)
All steps are orchestrated by scripts/deploy.sh.
Post-Deploy Verification
- Check
docker ps— annotation-queue container is running - Check
docker logs annotation-queue --tail 20— no errors - Check
offset.yaml— offset is advancing (queue is consuming) - Check disk space — adequate for continued operation
Rollback Procedures
Trigger Criteria
- Annotation queue crashes repeatedly after deploy (restart loop)
- Queue messages are being dropped or corrupted
config.yamlgeneration failed (missing env vars)- New code has a bug affecting annotation processing
Rollback Steps
- Run
scripts/deploy.sh --rollback- This reads the previous image tags from
scripts/.previous-tags(saved during deploy) - Stops current containers
- Starts containers with previous image tags
- This reads the previous image tags from
- Verify health:
scripts/health-check.sh - Check annotation queue is consuming correctly
- Investigate root cause of the failed deploy
Training Rollback
Training is not managed by deploy scripts. If a new training run produces bad results:
- The previous
best.ptmodel is still available in/azaion/models/(dated directories) - Roll back by pointing
config.yamlto the previous model - No container restart needed — training is a batch job started manually
Deployment Checklist (Quick Reference)
Pre-deploy:
□ CI green on dev branch
□ Images built and pushed
□ .env updated on server (if new vars added)
□ No active training run (if training container is being updated)
Deploy:
□ SSH to server
□ Run scripts/deploy.sh
□ Verify health-check.sh passes
Post-deploy:
□ docker ps shows containers running
□ docker logs show no errors
□ Queue offset advancing
□ Disk space adequate
If problems:
□ Run scripts/deploy.sh --rollback
□ Verify health
□ Investigate logs