mirror of
https://github.com/azaion/ai-training.git
synced 2026-04-22 21:46:35 +00:00
aeb7f8ca8c
- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation. - Updated the autopilot state to reflect the current step as `done` and status as `completed`. - Clarified the deployment status report by specifying non-deployed services and their purposes. These changes enhance the automation of task management and improve documentation clarity.
116 lines
4.7 KiB
Markdown
116 lines
4.7 KiB
Markdown
# Azaion AI Training — Deployment Procedures
|
|
|
|
## Deployment Strategy
|
|
|
|
**Pattern**: Stop-and-replace on a single GPU server
|
|
**Rationale**: The system runs on one dedicated GPU server. Training takes days — there is no "zero-downtime" concern for the training process. The annotation queue can tolerate brief restarts (queue offset is persisted, messages are replayed from last offset).
|
|
|
|
### Component Behavior During Deploy
|
|
|
|
| Component | Deploy Impact | Recovery |
|
|
|-----------|--------------|----------|
|
|
| Training Pipeline | Must finish current run or be stopped manually. Never interrupted mid-training — checkpoints save every epoch. | Resume from last checkpoint (`resume_training`) |
|
|
| Annotation Queue | Brief restart (< 30 seconds). Messages accumulate in RabbitMQ during downtime. | Resumes from persisted offset in `offset.yaml` |
|
|
|
|
### Graceful Shutdown
|
|
|
|
- **Training**: not stopped by deploy scripts — training runs for days and is managed independently. Deploy only updates images/code for the *next* training run.
|
|
- **Annotation Queue**: `docker stop` with 30-second grace period → SIGTERM → process exits → container replaced with new image.
|
|
|
|
## Health Checks
|
|
|
|
No HTTP endpoints — these are batch processes and queue consumers. Health is verified by:
|
|
|
|
| Check | Method | Target | Interval | Failure Action |
|
|
|-------|--------|--------|----------|----------------|
|
|
| Annotation Queue alive | `docker inspect --format='{{.State.Running}}'` | annotation-queue container | 5 min (cron) | Restart container |
|
|
| RabbitMQ reachable | TCP connect to `$RABBITMQ_HOST:$RABBITMQ_PORT` | RabbitMQ server | 5 min (cron) | Alert, check network |
|
|
| GPU available | `nvidia-smi` exit code | NVIDIA driver | 5 min (cron) | Alert, check driver |
|
|
| Disk space | `df /azaion/ --output=pcent` | Filesystem | 5 min (cron) | Alert if > 80%, critical if > 95% |
|
|
| Queue offset advancing | Compare `offset.yaml` value to previous check | Annotation queue progress | 30 min | Alert if stale and queue has messages |
|
|
|
|
All checks are performed by `scripts/health-check.sh`.
|
|
|
|
## Production Deployment
|
|
|
|
### Pre-Deploy Checklist
|
|
|
|
- [ ] All CI tests pass on `dev` branch
|
|
- [ ] Security scan clean (zero critical/high CVEs)
|
|
- [ ] Docker images built and pushed to registry
|
|
- [ ] `.env` on target server is up to date with any new variables
|
|
- [ ] `/azaion/` directory tree exists with correct permissions
|
|
- [ ] No training run is currently active (or training will not be restarted this deploy)
|
|
- [ ] NVIDIA driver and Docker with GPU support are installed on target
|
|
|
|
### Deploy Steps
|
|
|
|
1. SSH to GPU server
|
|
2. Pull new Docker images: `scripts/pull-images.sh`
|
|
3. Stop annotation queue: `scripts/stop-services.sh`
|
|
4. Generate `config.yaml` from `.env` template
|
|
5. Start services: `scripts/start-services.sh`
|
|
6. Verify health: `scripts/health-check.sh`
|
|
7. Confirm annotation queue is consuming messages (check offset advancing)
|
|
|
|
All steps are orchestrated by `scripts/deploy.sh`.
|
|
|
|
### Post-Deploy Verification
|
|
|
|
- Check `docker ps` — annotation-queue container is running
|
|
- Check `docker logs annotation-queue --tail 20` — no errors
|
|
- Check `offset.yaml` — offset is advancing (queue is consuming)
|
|
- Check disk space — adequate for continued operation
|
|
|
|
## Rollback Procedures
|
|
|
|
### Trigger Criteria
|
|
|
|
- Annotation queue crashes repeatedly after deploy (restart loop)
|
|
- Queue messages are being dropped or corrupted
|
|
- `config.yaml` generation failed (missing env vars)
|
|
- New code has a bug affecting annotation processing
|
|
|
|
### Rollback Steps
|
|
|
|
1. Run `scripts/deploy.sh --rollback`
|
|
- This reads the previous image tags from `scripts/.previous-tags` (saved during deploy)
|
|
- Stops current containers
|
|
- Starts containers with previous image tags
|
|
2. Verify health: `scripts/health-check.sh`
|
|
3. Check annotation queue is consuming correctly
|
|
4. Investigate root cause of the failed deploy
|
|
|
|
### Training Rollback
|
|
|
|
Training is not managed by deploy scripts. If a new training run produces bad results:
|
|
1. The previous `best.pt` model is still available in `/azaion/models/` (dated directories)
|
|
2. Roll back by pointing `config.yaml` to the previous model
|
|
3. No container restart needed — training is a batch job started manually
|
|
|
|
## Deployment Checklist (Quick Reference)
|
|
|
|
```
|
|
Pre-deploy:
|
|
□ CI green on dev branch
|
|
□ Images built and pushed
|
|
□ .env updated on server (if new vars added)
|
|
□ No active training run (if training container is being updated)
|
|
|
|
Deploy:
|
|
□ SSH to server
|
|
□ Run scripts/deploy.sh
|
|
□ Verify health-check.sh passes
|
|
|
|
Post-deploy:
|
|
□ docker ps shows containers running
|
|
□ docker logs show no errors
|
|
□ Queue offset advancing
|
|
□ Disk space adequate
|
|
|
|
If problems:
|
|
□ Run scripts/deploy.sh --rollback
|
|
□ Verify health
|
|
□ Investigate logs
|
|
```
|