Files
ai-training/_docs/04_deploy/deployment_procedures.md
T
Oleksandr Bezdieniezhnykh aeb7f8ca8c Update autopilot workflow and documentation for project cycle completion
- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation.
- Updated the autopilot state to reflect the current step as `done` and status as `completed`.
- Clarified the deployment status report by specifying non-deployed services and their purposes.

These changes enhance the automation of task management and improve documentation clarity.
2026-03-29 05:02:22 +03:00

4.7 KiB

Azaion AI Training — Deployment Procedures

Deployment Strategy

Pattern: Stop-and-replace on a single GPU server Rationale: The system runs on one dedicated GPU server. Training takes days — there is no "zero-downtime" concern for the training process. The annotation queue can tolerate brief restarts (queue offset is persisted, messages are replayed from last offset).

Component Behavior During Deploy

Component Deploy Impact Recovery
Training Pipeline Must finish current run or be stopped manually. Never interrupted mid-training — checkpoints save every epoch. Resume from last checkpoint (resume_training)
Annotation Queue Brief restart (< 30 seconds). Messages accumulate in RabbitMQ during downtime. Resumes from persisted offset in offset.yaml

Graceful Shutdown

  • Training: not stopped by deploy scripts — training runs for days and is managed independently. Deploy only updates images/code for the next training run.
  • Annotation Queue: docker stop with 30-second grace period → SIGTERM → process exits → container replaced with new image.

Health Checks

No HTTP endpoints — these are batch processes and queue consumers. Health is verified by:

Check Method Target Interval Failure Action
Annotation Queue alive docker inspect --format='{{.State.Running}}' annotation-queue container 5 min (cron) Restart container
RabbitMQ reachable TCP connect to $RABBITMQ_HOST:$RABBITMQ_PORT RabbitMQ server 5 min (cron) Alert, check network
GPU available nvidia-smi exit code NVIDIA driver 5 min (cron) Alert, check driver
Disk space df /azaion/ --output=pcent Filesystem 5 min (cron) Alert if > 80%, critical if > 95%
Queue offset advancing Compare offset.yaml value to previous check Annotation queue progress 30 min Alert if stale and queue has messages

All checks are performed by scripts/health-check.sh.

Production Deployment

Pre-Deploy Checklist

  • All CI tests pass on dev branch
  • Security scan clean (zero critical/high CVEs)
  • Docker images built and pushed to registry
  • .env on target server is up to date with any new variables
  • /azaion/ directory tree exists with correct permissions
  • No training run is currently active (or training will not be restarted this deploy)
  • NVIDIA driver and Docker with GPU support are installed on target

Deploy Steps

  1. SSH to GPU server
  2. Pull new Docker images: scripts/pull-images.sh
  3. Stop annotation queue: scripts/stop-services.sh
  4. Generate config.yaml from .env template
  5. Start services: scripts/start-services.sh
  6. Verify health: scripts/health-check.sh
  7. Confirm annotation queue is consuming messages (check offset advancing)

All steps are orchestrated by scripts/deploy.sh.

Post-Deploy Verification

  • Check docker ps — annotation-queue container is running
  • Check docker logs annotation-queue --tail 20 — no errors
  • Check offset.yaml — offset is advancing (queue is consuming)
  • Check disk space — adequate for continued operation

Rollback Procedures

Trigger Criteria

  • Annotation queue crashes repeatedly after deploy (restart loop)
  • Queue messages are being dropped or corrupted
  • config.yaml generation failed (missing env vars)
  • New code has a bug affecting annotation processing

Rollback Steps

  1. Run scripts/deploy.sh --rollback
    • This reads the previous image tags from scripts/.previous-tags (saved during deploy)
    • Stops current containers
    • Starts containers with previous image tags
  2. Verify health: scripts/health-check.sh
  3. Check annotation queue is consuming correctly
  4. Investigate root cause of the failed deploy

Training Rollback

Training is not managed by deploy scripts. If a new training run produces bad results:

  1. The previous best.pt model is still available in /azaion/models/ (dated directories)
  2. Roll back by pointing config.yaml to the previous model
  3. No container restart needed — training is a batch job started manually

Deployment Checklist (Quick Reference)

Pre-deploy:
  □ CI green on dev branch
  □ Images built and pushed
  □ .env updated on server (if new vars added)
  □ No active training run (if training container is being updated)

Deploy:
  □ SSH to server
  □ Run scripts/deploy.sh
  □ Verify health-check.sh passes

Post-deploy:
  □ docker ps shows containers running
  □ docker logs show no errors
  □ Queue offset advancing
  □ Disk space adequate

If problems:
  □ Run scripts/deploy.sh --rollback
  □ Verify health
  □ Investigate logs