- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation. - Updated the autopilot state to reflect the current step as `done` and status as `completed`. - Clarified the deployment status report by specifying non-deployed services and their purposes. These changes enhance the automation of task management and improve documentation clarity.
5.5 KiB
Azaion AI Training — Containerization
Component Dockerfiles
Training Pipeline
| Property | Value |
|---|---|
| Base image | nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04 |
| Build image | Same (devel required for TensorRT engine build + pycuda) |
| Stages | 1) system deps + Python 3.10 → 2) pip install requirements → 3) copy source |
| User | azaion (non-root, UID 1000) |
| Health check | Not applicable — batch job, exits on completion |
| Exposed ports | None |
| Key build args | CUDA_VERSION=12.1.1 |
Single-stage build (devel image required at runtime for TensorRT engine compilation and pycuda). The image is large but training runs for days on a dedicated GPU server — image size is not a deployment bottleneck.
Installs from requirements.txt with --extra-index-url https://download.pytorch.org/whl/cu121 for PyTorch CUDA 12.1 wheels.
Volume mount: /azaion/ host directory for datasets, models, and annotation data.
Annotation Queue
| Property | Value |
|---|---|
| Base image | python:3.10-slim |
| Build image | python:3.10-slim (no compilation needed) |
| Stages | 1) pip install from src/annotation-queue/requirements.txt → 2) copy source |
| User | azaion (non-root, UID 1000) |
| Health check | CMD python -c "import rstream" || exit 1 (process liveness; no HTTP endpoint) |
| Exposed ports | None |
| Key build args | None |
Lightweight container — only needs pyyaml, msgpack, rstream. No GPU, no heavy ML libraries. Runs as a persistent async process consuming from RabbitMQ Streams.
Volume mount: /azaion/ host directory for writing annotation images and labels.
Not Containerized
The following are developer/verification tools, not production services:
- Inference Engine (
start_inference.py) — used for testing and model verification, runs ad-hoc on a GPU machine - Data Tools (
convert-annotations.py,dataset-visualiser.py) — interactive developer utilities requiring GUI environment
Docker Compose — Local Development
services:
rabbitmq:
image: rabbitmq:3.13-management-alpine
ports:
- "5552:5552"
- "5672:5672"
- "15672:15672"
environment:
RABBITMQ_DEFAULT_USER: ${RABBITMQ_USER}
RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD}
volumes:
- rabbitmq_data:/var/lib/rabbitmq
healthcheck:
test: ["CMD", "rabbitmq-diagnostics", "check_running"]
interval: 10s
timeout: 5s
retries: 5
annotation-queue:
build:
context: .
dockerfile: docker/annotation-queue.Dockerfile
env_file: .env
depends_on:
rabbitmq:
condition: service_healthy
volumes:
- ${AZAION_ROOT_DIR:-/azaion}:/azaion
restart: unless-stopped
training:
build:
context: .
dockerfile: docker/training.Dockerfile
env_file: .env
volumes:
- ${AZAION_ROOT_DIR:-/azaion}:/azaion
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ipc: host
shm_size: "16g"
volumes:
rabbitmq_data:
networks:
default:
name: azaion-training
Notes:
ipc: hostandshm_size: "16g"for PyTorch multi-worker data loadingannotation-queueruns continuously, restarts on failure- RabbitMQ Streams plugin must be enabled (port 5552); the management UI is on port 15672
Docker Compose — Blackbox Tests
services:
rabbitmq:
image: rabbitmq:3.13-management-alpine
ports:
- "5552:5552"
- "5672:5672"
environment:
RABBITMQ_DEFAULT_USER: test_user
RABBITMQ_DEFAULT_PASS: test_pass
healthcheck:
test: ["CMD", "rabbitmq-diagnostics", "check_running"]
interval: 5s
timeout: 3s
retries: 10
annotation-queue:
build:
context: .
dockerfile: docker/annotation-queue.Dockerfile
environment:
RABBITMQ_HOST: rabbitmq
RABBITMQ_PORT: "5552"
RABBITMQ_USER: test_user
RABBITMQ_PASSWORD: test_pass
RABBITMQ_QUEUE_NAME: azaion-annotations
AZAION_ROOT_DIR: /azaion
depends_on:
rabbitmq:
condition: service_healthy
volumes:
- test_data:/azaion
test-runner:
build:
context: .
dockerfile: docker/test-runner.Dockerfile
environment:
RABBITMQ_HOST: rabbitmq
RABBITMQ_PORT: "5552"
RABBITMQ_USER: test_user
RABBITMQ_PASSWORD: test_pass
AZAION_ROOT_DIR: /azaion
TEST_SCOPE: blackbox
depends_on:
rabbitmq:
condition: service_healthy
annotation-queue:
condition: service_started
volumes:
- test_data:/azaion
volumes:
test_data:
Run: docker compose -f docker-compose.test.yml up --abort-on-container-exit
Note: GPU-dependent tests (training) require --gpus all and are excluded from the default blackbox test suite. They run separately via docker compose -f docker-compose.test.yml --profile gpu-tests up --abort-on-container-exit.
Image Tagging Strategy
| Context | Tag Format | Example |
|---|---|---|
| CI build | <registry>/azaion/<component>:<git-sha> |
registry.example.com/azaion/training:a1b2c3d |
| Release | <registry>/azaion/<component>:<semver> |
registry.example.com/azaion/training:1.0.0 |
| Local dev | azaion-<component>:latest |
azaion-training:latest |
.dockerignore
.git
.cursor
_docs
_standalone
tests
**/__pycache__
**/*.pyc
*.md
.env
.env.example
docker-compose*.yml
.gitignore
.editorconfig
requirements-test.txt