# Azaion AI Training — Containerization ## Component Dockerfiles ### Training Pipeline | Property | Value | |----------|-------| | Base image | `nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04` | | Build image | Same (devel required for TensorRT engine build + pycuda) | | Stages | 1) system deps + Python 3.10 → 2) pip install requirements → 3) copy source | | User | `azaion` (non-root, UID 1000) | | Health check | Not applicable — batch job, exits on completion | | Exposed ports | None | | Key build args | `CUDA_VERSION=12.1.1` | Single-stage build (devel image required at runtime for TensorRT engine compilation and pycuda). The image is large but training runs for days on a dedicated GPU server — image size is not a deployment bottleneck. Installs from `requirements.txt` with `--extra-index-url https://download.pytorch.org/whl/cu121` for PyTorch CUDA 12.1 wheels. Volume mount: `/azaion/` host directory for datasets, models, and annotation data. ### Annotation Queue | Property | Value | |----------|-------| | Base image | `python:3.10-slim` | | Build image | `python:3.10-slim` (no compilation needed) | | Stages | 1) pip install from `src/annotation-queue/requirements.txt` → 2) copy source | | User | `azaion` (non-root, UID 1000) | | Health check | `CMD python -c "import rstream" \|\| exit 1` (process liveness; no HTTP endpoint) | | Exposed ports | None | | Key build args | None | Lightweight container — only needs `pyyaml`, `msgpack`, `rstream`. No GPU, no heavy ML libraries. Runs as a persistent async process consuming from RabbitMQ Streams. Volume mount: `/azaion/` host directory for writing annotation images and labels. ### Not Containerized The following are developer/verification tools, not production services: - **Inference Engine** (`start_inference.py`) — used for testing and model verification, runs ad-hoc on a GPU machine - **Data Tools** (`convert-annotations.py`, `dataset-visualiser.py`) — interactive developer utilities requiring GUI environment ## Docker Compose — Local Development ```yaml services: rabbitmq: image: rabbitmq:3.13-management-alpine ports: - "5552:5552" - "5672:5672" - "15672:15672" environment: RABBITMQ_DEFAULT_USER: ${RABBITMQ_USER} RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD} volumes: - rabbitmq_data:/var/lib/rabbitmq healthcheck: test: ["CMD", "rabbitmq-diagnostics", "check_running"] interval: 10s timeout: 5s retries: 5 annotation-queue: build: context: . dockerfile: docker/annotation-queue.Dockerfile env_file: .env depends_on: rabbitmq: condition: service_healthy volumes: - ${AZAION_ROOT_DIR:-/azaion}:/azaion restart: unless-stopped training: build: context: . dockerfile: docker/training.Dockerfile env_file: .env volumes: - ${AZAION_ROOT_DIR:-/azaion}:/azaion deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ipc: host shm_size: "16g" volumes: rabbitmq_data: networks: default: name: azaion-training ``` Notes: - `ipc: host` and `shm_size: "16g"` for PyTorch multi-worker data loading - `annotation-queue` runs continuously, restarts on failure - RabbitMQ Streams plugin must be enabled (port 5552); the management UI is on port 15672 ## Docker Compose — Blackbox Tests ```yaml services: rabbitmq: image: rabbitmq:3.13-management-alpine ports: - "5552:5552" - "5672:5672" environment: RABBITMQ_DEFAULT_USER: test_user RABBITMQ_DEFAULT_PASS: test_pass healthcheck: test: ["CMD", "rabbitmq-diagnostics", "check_running"] interval: 5s timeout: 3s retries: 10 annotation-queue: build: context: . dockerfile: docker/annotation-queue.Dockerfile environment: RABBITMQ_HOST: rabbitmq RABBITMQ_PORT: "5552" RABBITMQ_USER: test_user RABBITMQ_PASSWORD: test_pass RABBITMQ_QUEUE_NAME: azaion-annotations AZAION_ROOT_DIR: /azaion depends_on: rabbitmq: condition: service_healthy volumes: - test_data:/azaion test-runner: build: context: . dockerfile: docker/test-runner.Dockerfile environment: RABBITMQ_HOST: rabbitmq RABBITMQ_PORT: "5552" RABBITMQ_USER: test_user RABBITMQ_PASSWORD: test_pass AZAION_ROOT_DIR: /azaion TEST_SCOPE: blackbox depends_on: rabbitmq: condition: service_healthy annotation-queue: condition: service_started volumes: - test_data:/azaion volumes: test_data: ``` Run: `docker compose -f docker-compose.test.yml up --abort-on-container-exit` Note: GPU-dependent tests (training) require `--gpus all` and are excluded from the default blackbox test suite. They run separately via `docker compose -f docker-compose.test.yml --profile gpu-tests up --abort-on-container-exit`. ## Image Tagging Strategy | Context | Tag Format | Example | |---------|-----------|---------| | CI build | `/azaion/:` | `registry.example.com/azaion/training:a1b2c3d` | | Release | `/azaion/:` | `registry.example.com/azaion/training:1.0.0` | | Local dev | `azaion-:latest` | `azaion-training:latest` | ## .dockerignore ``` .git .cursor _docs _standalone tests **/__pycache__ **/*.pyc *.md .env .env.example docker-compose*.yml .gitignore .editorconfig requirements-test.txt ```