# Azaion AI Training — CI/CD Pipeline

## Pipeline Overview

| Stage | Trigger | Quality Gate |
|-------|---------|-------------|
| Lint | Every push | Zero lint errors |
| Test | Every push | All tests pass |
| Security | Every push | Zero critical/high CVEs |
| Build | PR merge to dev | Docker build succeeds |
| Push | After build | Images pushed to registry |
| Deploy | Manual trigger | Health checks pass on target server |

No staging environment — the system runs on a dedicated GPU server. "Staging" is replaced by the test suite running in CI on CPU-only runners (annotation queue tests, unit tests) and manual GPU verification on the target machine.

## Stage Details

### Lint

- `black --check src/` — Python formatting
- `ruff check src/` — Python linting
- Runs on standard CI runner (no GPU)

### Test

- Framework: `pytest`
- Command: `pytest tests/ -v --tb=short`
- Test compose for annotation queue integration tests: `docker compose -f docker-compose.test.yml up --abort-on-container-exit`
- GPU-dependent tests (training, export) are excluded from CI — they require a physical GPU and run during manual verification on the target server
- Coverage report published as pipeline artifact

### Security

- Dependency audit: `pip-audit -r requirements.txt`
- Dependency audit: `pip-audit -r src/annotation-queue/requirements.txt`
- SAST scan: Semgrep with `p/python` ruleset
- Image scan: Trivy on built Docker images
- Block on: critical or high severity findings

### Build

- Docker images built for both components:
  - `docker/training.Dockerfile` → `azaion/training:<git-sha>`
  - `docker/annotation-queue.Dockerfile` → `azaion/annotation-queue:<git-sha>`
- Build cache: Docker layer cache via GitHub Actions cache
- Build runs on standard runner — no GPU needed for `docker build`

### Push

- Registry: configurable via `DOCKER_REGISTRY` secret (e.g., GitHub Container Registry `ghcr.io`, or private registry)
- Authentication: registry login via CI secrets (`DOCKER_REGISTRY_USER`, `DOCKER_REGISTRY_TOKEN`)

### Deploy

- **Manual trigger only** (workflow_dispatch) — training runs for days, unattended deploys are risky
- Deployment method: SSH to target GPU server, run deploy scripts (`scripts/deploy.sh`)
- Pre-deploy: pull new images, stop services gracefully
- Post-deploy: start services, run health check script
- Rollback: `scripts/deploy.sh --rollback` redeploys previous image tags

## Pipeline Configuration (GitHub Actions)

```yaml
name: CI/CD

on:
  push:
    branches: [dev, main]
  pull_request:
    branches: [dev]
  workflow_dispatch:
    inputs:
      deploy_target:
        description: "Deploy to target server"
        required: true
        type: boolean
        default: false

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.10"
      - run: pip install black ruff
      - run: black --check src/
      - run: ruff check src/

  test:
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.10"
      - run: pip install -r requirements-test.txt
      - run: pytest tests/ -v --tb=short --ignore=tests/gpu

  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.10"
      - run: pip install pip-audit
      - run: pip-audit -r requirements.txt || true
      - run: pip-audit -r src/annotation-queue/requirements.txt
      - uses: returntocorp/semgrep-action@v1
        with:
          config: p/python

  build-and-push:
    runs-on: ubuntu-latest
    needs: [test, security]
    if: github.ref == 'refs/heads/dev' && github.event_name == 'push'
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ${{ secrets.DOCKER_REGISTRY }}
          username: ${{ secrets.DOCKER_REGISTRY_USER }}
          password: ${{ secrets.DOCKER_REGISTRY_TOKEN }}
      - uses: docker/build-push-action@v6
        with:
          context: .
          file: docker/training.Dockerfile
          push: true
          tags: ${{ secrets.DOCKER_REGISTRY }}/azaion/training:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
      - uses: docker/build-push-action@v6
        with:
          context: .
          file: docker/annotation-queue.Dockerfile
          push: true
          tags: ${{ secrets.DOCKER_REGISTRY }}/azaion/annotation-queue:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    runs-on: ubuntu-latest
    needs: build-and-push
    if: github.event.inputs.deploy_target == 'true'
    environment: production
    steps:
      - uses: actions/checkout@v4
      - run: |
          ssh ${{ secrets.DEPLOY_USER }}@${{ secrets.DEPLOY_HOST }} \
            "cd /opt/azaion-training && \
             DOCKER_IMAGE_TAG=${{ github.sha }} \
             bash scripts/deploy.sh"
```

## Caching Strategy

| Cache | Key | Restore Keys |
|-------|-----|-------------|
| pip dependencies | `requirements.txt` hash | `pip-` prefix |
| Docker layers | GitHub Actions cache (BuildKit) | `gha-` prefix |

## Parallelization

```
push event
  ├── lint ──► test ──┐
  │                   ├──► build-and-push ──► deploy (manual)
  └── security ───────┘
```

Lint and security run in parallel. Test depends on lint. Build depends on both test and security passing.

## Notifications

| Event | Channel | Recipients |
|-------|---------|-----------|
| Build failure | GitHub PR check | PR author |
| Security alert | GitHub security tab | Repository maintainers |
| Deploy success | GitHub Actions log | Deployment team |
| Deploy failure | GitHub Actions log + email | Deployment team |