ai-training/_docs/04_deploy/containerization.md

# Azaion AI Training — Containerization

## Component Dockerfiles

### Training Pipeline

| Property | Value |
|----------|-------|
| Base image | `nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04` |
| Build image | Same (devel required for TensorRT engine build + pycuda) |
| Stages | 1) system deps + Python 3.10 → 2) pip install requirements → 3) copy source |
| User | `azaion` (non-root, UID 1000) |
| Health check | Not applicable — batch job, exits on completion |
| Exposed ports | None |
| Key build args | `CUDA_VERSION=12.1.1` |

Single-stage build (devel image required at runtime for TensorRT engine compilation and pycuda). The image is large but training runs for days on a dedicated GPU server — image size is not a deployment bottleneck.

Installs from `requirements.txt` with `--extra-index-url https://download.pytorch.org/whl/cu121` for PyTorch CUDA 12.1 wheels.

Volume mount: `/azaion/` host directory for datasets, models, and annotation data.

### Annotation Queue

| Property | Value |
|----------|-------|
| Base image | `python:3.10-slim` |
| Build image | `python:3.10-slim` (no compilation needed) |
| Stages | 1) pip install from `src/annotation-queue/requirements.txt` → 2) copy source |
| User | `azaion` (non-root, UID 1000) |
| Health check | `CMD python -c "import rstream" \|\| exit 1` (process liveness; no HTTP endpoint) |
| Exposed ports | None |
| Key build args | None |

Lightweight container — only needs `pyyaml`, `msgpack`, `rstream`. No GPU, no heavy ML libraries. Runs as a persistent async process consuming from RabbitMQ Streams.

Volume mount: `/azaion/` host directory for writing annotation images and labels.

### Not Containerized

The following are developer/verification tools, not production services:

- **Inference Engine** (`start_inference.py`) — used for testing and model verification, runs ad-hoc on a GPU machine
- **Data Tools** (`convert-annotations.py`, `dataset-visualiser.py`) — interactive developer utilities requiring GUI environment

## Docker Compose — Local Development

```yaml
services:
  rabbitmq:
    image: rabbitmq:3.13-management-alpine
    ports:
      - "5552:5552"
      - "5672:5672"
      - "15672:15672"
    environment:
      RABBITMQ_DEFAULT_USER: ${RABBITMQ_USER}
      RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD}
    volumes:
      - rabbitmq_data:/var/lib/rabbitmq
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "check_running"]
      interval: 10s
      timeout: 5s
      retries: 5

  annotation-queue:
    build:
      context: .
      dockerfile: docker/annotation-queue.Dockerfile
    env_file: .env
    depends_on:
      rabbitmq:
        condition: service_healthy
    volumes:
      - ${AZAION_ROOT_DIR:-/azaion}:/azaion
    restart: unless-stopped

  training:
    build:
      context: .
      dockerfile: docker/training.Dockerfile
    env_file: .env
    volumes:
      - ${AZAION_ROOT_DIR:-/azaion}:/azaion
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ipc: host
    shm_size: "16g"

volumes:
  rabbitmq_data:

networks:
  default:
    name: azaion-training
```

Notes:
- `ipc: host` and `shm_size: "16g"` for PyTorch multi-worker data loading
- `annotation-queue` runs continuously, restarts on failure
- RabbitMQ Streams plugin must be enabled (port 5552); the management UI is on port 15672

## Docker Compose — Blackbox Tests

```yaml
services:
  rabbitmq:
    image: rabbitmq:3.13-management-alpine
    ports:
      - "5552:5552"
      - "5672:5672"
    environment:
      RABBITMQ_DEFAULT_USER: test_user
      RABBITMQ_DEFAULT_PASS: test_pass
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "check_running"]
      interval: 5s
      timeout: 3s
      retries: 10

  annotation-queue:
    build:
      context: .
      dockerfile: docker/annotation-queue.Dockerfile
    environment:
      RABBITMQ_HOST: rabbitmq
      RABBITMQ_PORT: "5552"
      RABBITMQ_USER: test_user
      RABBITMQ_PASSWORD: test_pass
      RABBITMQ_QUEUE_NAME: azaion-annotations
      AZAION_ROOT_DIR: /azaion
    depends_on:
      rabbitmq:
        condition: service_healthy
    volumes:
      - test_data:/azaion

  test-runner:
    build:
      context: .
      dockerfile: docker/test-runner.Dockerfile
    environment:
      RABBITMQ_HOST: rabbitmq
      RABBITMQ_PORT: "5552"
      RABBITMQ_USER: test_user
      RABBITMQ_PASSWORD: test_pass
      AZAION_ROOT_DIR: /azaion
      TEST_SCOPE: blackbox
    depends_on:
      rabbitmq:
        condition: service_healthy
      annotation-queue:
        condition: service_started
    volumes:
      - test_data:/azaion

volumes:
  test_data:
```

Run: `docker compose -f docker-compose.test.yml up --abort-on-container-exit`

Note: GPU-dependent tests (training) require `--gpus all` and are excluded from the default blackbox test suite. They run separately via `docker compose -f docker-compose.test.yml --profile gpu-tests up --abort-on-container-exit`.

## Image Tagging Strategy

| Context | Tag Format | Example |
|---------|-----------|---------|
| CI build | `<registry>/azaion/<component>:<git-sha>` | `registry.example.com/azaion/training:a1b2c3d` |
| Release | `<registry>/azaion/<component>:<semver>` | `registry.example.com/azaion/training:1.0.0` |
| Local dev | `azaion-<component>:latest` | `azaion-training:latest` |

## .dockerignore

```
.git
.cursor
_docs
_standalone
tests
**/__pycache__
**/*.pyc
*.md
.env
.env.example
docker-compose*.yml
.gitignore
.editorconfig
requirements-test.txt
```