mirror of https://github.com/azaion/ai-training.git synced 2026-04-22 23:36:36 +00:00

Files

T

Oleksandr Bezdieniezhnykh aeb7f8ca8c Update autopilot workflow and documentation for project cycle completion

- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation.
- Updated the autopilot state to reflect the current step as `done` and status as `completed`.
- Clarified the deployment status report by specifying non-deployed services and their purposes.

These changes enhance the automation of task management and improve documentation clarity.

2026-03-29 05:02:22 +03:00

4.7 KiB

Raw Blame History

Azaion AI Training — Deployment Procedures

Deployment Strategy

Pattern: Stop-and-replace on a single GPU server Rationale: The system runs on one dedicated GPU server. Training takes days — there is no "zero-downtime" concern for the training process. The annotation queue can tolerate brief restarts (queue offset is persisted, messages are replayed from last offset).

Component Behavior During Deploy

Component	Deploy Impact	Recovery
Training Pipeline	Must finish current run or be stopped manually. Never interrupted mid-training — checkpoints save every epoch.	Resume from last checkpoint (`resume_training`)
Annotation Queue	Brief restart (< 30 seconds). Messages accumulate in RabbitMQ during downtime.	Resumes from persisted offset in `offset.yaml`

Graceful Shutdown

Training: not stopped by deploy scripts — training runs for days and is managed independently. Deploy only updates images/code for the next training run.
Annotation Queue: docker stop with 30-second grace period → SIGTERM → process exits → container replaced with new image.

Health Checks

No HTTP endpoints — these are batch processes and queue consumers. Health is verified by:

Check	Method	Target	Interval	Failure Action
Annotation Queue alive	`docker inspect --format='{{.State.Running}}'`	annotation-queue container	5 min (cron)	Restart container
RabbitMQ reachable	TCP connect to `$RABBITMQ_HOST:$RABBITMQ_PORT`	RabbitMQ server	5 min (cron)	Alert, check network
GPU available	`nvidia-smi` exit code	NVIDIA driver	5 min (cron)	Alert, check driver
Disk space	`df /azaion/ --output=pcent`	Filesystem	5 min (cron)	Alert if > 80%, critical if > 95%
Queue offset advancing	Compare `offset.yaml` value to previous check	Annotation queue progress	30 min	Alert if stale and queue has messages

All checks are performed by scripts/health-check.sh.

Production Deployment

Pre-Deploy Checklist

All CI tests pass on dev branch
Security scan clean (zero critical/high CVEs)
Docker images built and pushed to registry
.env on target server is up to date with any new variables
/azaion/ directory tree exists with correct permissions
No training run is currently active (or training will not be restarted this deploy)
NVIDIA driver and Docker with GPU support are installed on target

Deploy Steps

SSH to GPU server
Pull new Docker images: scripts/pull-images.sh
Stop annotation queue: scripts/stop-services.sh
Generate config.yaml from .env template
Start services: scripts/start-services.sh
Verify health: scripts/health-check.sh
Confirm annotation queue is consuming messages (check offset advancing)

All steps are orchestrated by scripts/deploy.sh.

Post-Deploy Verification

Check docker ps — annotation-queue container is running
Check docker logs annotation-queue --tail 20 — no errors
Check offset.yaml — offset is advancing (queue is consuming)
Check disk space — adequate for continued operation

Rollback Procedures

Trigger Criteria

Annotation queue crashes repeatedly after deploy (restart loop)
Queue messages are being dropped or corrupted
config.yaml generation failed (missing env vars)
New code has a bug affecting annotation processing

Rollback Steps

Run scripts/deploy.sh --rollback
- This reads the previous image tags from scripts/.previous-tags (saved during deploy)
- Stops current containers
- Starts containers with previous image tags
Verify health: scripts/health-check.sh
Check annotation queue is consuming correctly
Investigate root cause of the failed deploy

Training Rollback

Training is not managed by deploy scripts. If a new training run produces bad results:

The previous best.pt model is still available in /azaion/models/ (dated directories)
Roll back by pointing config.yaml to the previous model
No container restart needed — training is a batch job started manually

Deployment Checklist (Quick Reference)

Pre-deploy:
  □ CI green on dev branch
  □ Images built and pushed
  □ .env updated on server (if new vars added)
  □ No active training run (if training container is being updated)

Deploy:
  □ SSH to server
  □ Run scripts/deploy.sh
  □ Verify health-check.sh passes

Post-deploy:
  □ docker ps shows containers running
  □ docker logs show no errors
  □ Queue offset advancing
  □ Disk space adequate

If problems:
  □ Run scripts/deploy.sh --rollback
  □ Verify health
  □ Investigate logs

4.7 KiB Raw Blame History