# Azaion AI Training — CI/CD Pipeline ## Pipeline Overview | Stage | Trigger | Quality Gate | |-------|---------|-------------| | Lint | Every push | Zero lint errors | | Test | Every push | All tests pass | | Security | Every push | Zero critical/high CVEs | | Build | PR merge to dev | Docker build succeeds | | Push | After build | Images pushed to registry | | Deploy | Manual trigger | Health checks pass on target server | No staging environment — the system runs on a dedicated GPU server. "Staging" is replaced by the test suite running in CI on CPU-only runners (annotation queue tests, unit tests) and manual GPU verification on the target machine. ## Stage Details ### Lint - `black --check src/` — Python formatting - `ruff check src/` — Python linting - Runs on standard CI runner (no GPU) ### Test - Framework: `pytest` - Command: `pytest tests/ -v --tb=short` - Test compose for annotation queue integration tests: `docker compose -f docker-compose.test.yml up --abort-on-container-exit` - GPU-dependent tests (training, export) are excluded from CI — they require a physical GPU and run during manual verification on the target server - Coverage report published as pipeline artifact ### Security - Dependency audit: `pip-audit -r requirements.txt` - Dependency audit: `pip-audit -r src/annotation-queue/requirements.txt` - SAST scan: Semgrep with `p/python` ruleset - Image scan: Trivy on built Docker images - Block on: critical or high severity findings ### Build - Docker images built for both components: - `docker/training.Dockerfile` → `azaion/training:` - `docker/annotation-queue.Dockerfile` → `azaion/annotation-queue:` - Build cache: Docker layer cache via GitHub Actions cache - Build runs on standard runner — no GPU needed for `docker build` ### Push - Registry: configurable via `DOCKER_REGISTRY` secret (e.g., GitHub Container Registry `ghcr.io`, or private registry) - Authentication: registry login via CI secrets (`DOCKER_REGISTRY_USER`, `DOCKER_REGISTRY_TOKEN`) ### Deploy - **Manual trigger only** (workflow_dispatch) — training runs for days, unattended deploys are risky - Deployment method: SSH to target GPU server, run deploy scripts (`scripts/deploy.sh`) - Pre-deploy: pull new images, stop services gracefully - Post-deploy: start services, run health check script - Rollback: `scripts/deploy.sh --rollback` redeploys previous image tags ## Pipeline Configuration (GitHub Actions) ```yaml name: CI/CD on: push: branches: [dev, main] pull_request: branches: [dev] workflow_dispatch: inputs: deploy_target: description: "Deploy to target server" required: true type: boolean default: false jobs: lint: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: "3.10" - run: pip install black ruff - run: black --check src/ - run: ruff check src/ test: runs-on: ubuntu-latest needs: lint steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: "3.10" - run: pip install -r requirements-test.txt - run: pytest tests/ -v --tb=short --ignore=tests/gpu security: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: "3.10" - run: pip install pip-audit - run: pip-audit -r requirements.txt || true - run: pip-audit -r src/annotation-queue/requirements.txt - uses: returntocorp/semgrep-action@v1 with: config: p/python build-and-push: runs-on: ubuntu-latest needs: [test, security] if: github.ref == 'refs/heads/dev' && github.event_name == 'push' steps: - uses: actions/checkout@v4 - uses: docker/setup-buildx-action@v3 - uses: docker/login-action@v3 with: registry: ${{ secrets.DOCKER_REGISTRY }} username: ${{ secrets.DOCKER_REGISTRY_USER }} password: ${{ secrets.DOCKER_REGISTRY_TOKEN }} - uses: docker/build-push-action@v6 with: context: . file: docker/training.Dockerfile push: true tags: ${{ secrets.DOCKER_REGISTRY }}/azaion/training:${{ github.sha }} cache-from: type=gha cache-to: type=gha,mode=max - uses: docker/build-push-action@v6 with: context: . file: docker/annotation-queue.Dockerfile push: true tags: ${{ secrets.DOCKER_REGISTRY }}/azaion/annotation-queue:${{ github.sha }} cache-from: type=gha cache-to: type=gha,mode=max deploy: runs-on: ubuntu-latest needs: build-and-push if: github.event.inputs.deploy_target == 'true' environment: production steps: - uses: actions/checkout@v4 - run: | ssh ${{ secrets.DEPLOY_USER }}@${{ secrets.DEPLOY_HOST }} \ "cd /opt/azaion-training && \ DOCKER_IMAGE_TAG=${{ github.sha }} \ bash scripts/deploy.sh" ``` ## Caching Strategy | Cache | Key | Restore Keys | |-------|-----|-------------| | pip dependencies | `requirements.txt` hash | `pip-` prefix | | Docker layers | GitHub Actions cache (BuildKit) | `gha-` prefix | ## Parallelization ``` push event ├── lint ──► test ──┐ │ ├──► build-and-push ──► deploy (manual) └── security ───────┘ ``` Lint and security run in parallel. Test depends on lint. Build depends on both test and security passing. ## Notifications | Event | Channel | Recipients | |-------|---------|-----------| | Build failure | GitHub PR check | PR author | | Security alert | GitHub security tab | Repository maintainers | | Deploy success | GitHub Actions log | Deployment team | | Deploy failure | GitHub Actions log + email | Deployment team |