mirror of https://github.com/azaion/ai-training.git synced 2026-04-22 21:46:35 +00:00

Files

T

Oleksandr Bezdieniezhnykh aeb7f8ca8c Update autopilot workflow and documentation for project cycle completion

- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation.
- Updated the autopilot state to reflect the current step as `done` and status as `completed`.
- Clarified the deployment status report by specifying non-deployed services and their purposes.

These changes enhance the automation of task management and improve documentation clarity.

2026-03-29 05:02:22 +03:00

5.5 KiB

Raw Blame History

Azaion AI Training — Containerization

Component Dockerfiles

Training Pipeline

Property	Value
Base image	`nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04`
Build image	Same (devel required for TensorRT engine build + pycuda)
Stages	1) system deps + Python 3.10 → 2) pip install requirements → 3) copy source
User	`azaion` (non-root, UID 1000)
Health check	Not applicable — batch job, exits on completion
Exposed ports	None
Key build args	`CUDA_VERSION=12.1.1`

Single-stage build (devel image required at runtime for TensorRT engine compilation and pycuda). The image is large but training runs for days on a dedicated GPU server — image size is not a deployment bottleneck.

Installs from requirements.txt with --extra-index-url https://download.pytorch.org/whl/cu121 for PyTorch CUDA 12.1 wheels.

Volume mount: /azaion/ host directory for datasets, models, and annotation data.

Annotation Queue

Property	Value
Base image	`python:3.10-slim`
Build image	`python:3.10-slim` (no compilation needed)
Stages	1) pip install from `src/annotation-queue/requirements.txt` → 2) copy source
User	`azaion` (non-root, UID 1000)
Health check	`CMD python -c "import rstream" \|\| exit 1` (process liveness; no HTTP endpoint)
Exposed ports	None
Key build args	None

Lightweight container — only needs pyyaml, msgpack, rstream. No GPU, no heavy ML libraries. Runs as a persistent async process consuming from RabbitMQ Streams.

Volume mount: /azaion/ host directory for writing annotation images and labels.

Not Containerized

The following are developer/verification tools, not production services:

Inference Engine (start_inference.py) — used for testing and model verification, runs ad-hoc on a GPU machine
Data Tools (convert-annotations.py, dataset-visualiser.py) — interactive developer utilities requiring GUI environment

Docker Compose — Local Development

services:
  rabbitmq:
    image: rabbitmq:3.13-management-alpine
    ports:
      - "5552:5552"
      - "5672:5672"
      - "15672:15672"
    environment:
      RABBITMQ_DEFAULT_USER: ${RABBITMQ_USER}
      RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD}
    volumes:
      - rabbitmq_data:/var/lib/rabbitmq
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "check_running"]
      interval: 10s
      timeout: 5s
      retries: 5

  annotation-queue:
    build:
      context: .
      dockerfile: docker/annotation-queue.Dockerfile
    env_file: .env
    depends_on:
      rabbitmq:
        condition: service_healthy
    volumes:
      - ${AZAION_ROOT_DIR:-/azaion}:/azaion
    restart: unless-stopped

  training:
    build:
      context: .
      dockerfile: docker/training.Dockerfile
    env_file: .env
    volumes:
      - ${AZAION_ROOT_DIR:-/azaion}:/azaion
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ipc: host
    shm_size: "16g"

volumes:
  rabbitmq_data:

networks:
  default:
    name: azaion-training

Notes:

ipc: host and shm_size: "16g" for PyTorch multi-worker data loading
annotation-queue runs continuously, restarts on failure
RabbitMQ Streams plugin must be enabled (port 5552); the management UI is on port 15672

Docker Compose — Blackbox Tests

services:
  rabbitmq:
    image: rabbitmq:3.13-management-alpine
    ports:
      - "5552:5552"
      - "5672:5672"
    environment:
      RABBITMQ_DEFAULT_USER: test_user
      RABBITMQ_DEFAULT_PASS: test_pass
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "check_running"]
      interval: 5s
      timeout: 3s
      retries: 10

  annotation-queue:
    build:
      context: .
      dockerfile: docker/annotation-queue.Dockerfile
    environment:
      RABBITMQ_HOST: rabbitmq
      RABBITMQ_PORT: "5552"
      RABBITMQ_USER: test_user
      RABBITMQ_PASSWORD: test_pass
      RABBITMQ_QUEUE_NAME: azaion-annotations
      AZAION_ROOT_DIR: /azaion
    depends_on:
      rabbitmq:
        condition: service_healthy
    volumes:
      - test_data:/azaion

  test-runner:
    build:
      context: .
      dockerfile: docker/test-runner.Dockerfile
    environment:
      RABBITMQ_HOST: rabbitmq
      RABBITMQ_PORT: "5552"
      RABBITMQ_USER: test_user
      RABBITMQ_PASSWORD: test_pass
      AZAION_ROOT_DIR: /azaion
      TEST_SCOPE: blackbox
    depends_on:
      rabbitmq:
        condition: service_healthy
      annotation-queue:
        condition: service_started
    volumes:
      - test_data:/azaion

volumes:
  test_data:

Run: docker compose -f docker-compose.test.yml up --abort-on-container-exit

Note: GPU-dependent tests (training) require --gpus all and are excluded from the default blackbox test suite. They run separately via docker compose -f docker-compose.test.yml --profile gpu-tests up --abort-on-container-exit.

Image Tagging Strategy

Context	Tag Format	Example
CI build	`<registry>/azaion/<component>:<git-sha>`	`registry.example.com/azaion/training:a1b2c3d`
Release	`<registry>/azaion/<component>:<semver>`	`registry.example.com/azaion/training:1.0.0`
Local dev	`azaion-<component>:latest`	`azaion-training:latest`

.dockerignore

.git
.cursor
_docs
_standalone
tests
**/__pycache__
**/*.pyc
*.md
.env
.env.example
docker-compose*.yml
.gitignore
.editorconfig
requirements-test.txt

5.5 KiB Raw Blame History