mirror of
https://github.com/azaion/ai-training.git
synced 2026-04-22 09:56:36 +00:00
aeb7f8ca8c
- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation. - Updated the autopilot state to reflect the current step as `done` and status as `completed`. - Clarified the deployment status report by specifying non-deployed services and their purposes. These changes enhance the automation of task management and improve documentation clarity.
197 lines
5.5 KiB
Markdown
197 lines
5.5 KiB
Markdown
# Azaion AI Training — Containerization
|
|
|
|
## Component Dockerfiles
|
|
|
|
### Training Pipeline
|
|
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| Base image | `nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04` |
|
|
| Build image | Same (devel required for TensorRT engine build + pycuda) |
|
|
| Stages | 1) system deps + Python 3.10 → 2) pip install requirements → 3) copy source |
|
|
| User | `azaion` (non-root, UID 1000) |
|
|
| Health check | Not applicable — batch job, exits on completion |
|
|
| Exposed ports | None |
|
|
| Key build args | `CUDA_VERSION=12.1.1` |
|
|
|
|
Single-stage build (devel image required at runtime for TensorRT engine compilation and pycuda). The image is large but training runs for days on a dedicated GPU server — image size is not a deployment bottleneck.
|
|
|
|
Installs from `requirements.txt` with `--extra-index-url https://download.pytorch.org/whl/cu121` for PyTorch CUDA 12.1 wheels.
|
|
|
|
Volume mount: `/azaion/` host directory for datasets, models, and annotation data.
|
|
|
|
### Annotation Queue
|
|
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| Base image | `python:3.10-slim` |
|
|
| Build image | `python:3.10-slim` (no compilation needed) |
|
|
| Stages | 1) pip install from `src/annotation-queue/requirements.txt` → 2) copy source |
|
|
| User | `azaion` (non-root, UID 1000) |
|
|
| Health check | `CMD python -c "import rstream" \|\| exit 1` (process liveness; no HTTP endpoint) |
|
|
| Exposed ports | None |
|
|
| Key build args | None |
|
|
|
|
Lightweight container — only needs `pyyaml`, `msgpack`, `rstream`. No GPU, no heavy ML libraries. Runs as a persistent async process consuming from RabbitMQ Streams.
|
|
|
|
Volume mount: `/azaion/` host directory for writing annotation images and labels.
|
|
|
|
### Not Containerized
|
|
|
|
The following are developer/verification tools, not production services:
|
|
|
|
- **Inference Engine** (`start_inference.py`) — used for testing and model verification, runs ad-hoc on a GPU machine
|
|
- **Data Tools** (`convert-annotations.py`, `dataset-visualiser.py`) — interactive developer utilities requiring GUI environment
|
|
|
|
## Docker Compose — Local Development
|
|
|
|
```yaml
|
|
services:
|
|
rabbitmq:
|
|
image: rabbitmq:3.13-management-alpine
|
|
ports:
|
|
- "5552:5552"
|
|
- "5672:5672"
|
|
- "15672:15672"
|
|
environment:
|
|
RABBITMQ_DEFAULT_USER: ${RABBITMQ_USER}
|
|
RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD}
|
|
volumes:
|
|
- rabbitmq_data:/var/lib/rabbitmq
|
|
healthcheck:
|
|
test: ["CMD", "rabbitmq-diagnostics", "check_running"]
|
|
interval: 10s
|
|
timeout: 5s
|
|
retries: 5
|
|
|
|
annotation-queue:
|
|
build:
|
|
context: .
|
|
dockerfile: docker/annotation-queue.Dockerfile
|
|
env_file: .env
|
|
depends_on:
|
|
rabbitmq:
|
|
condition: service_healthy
|
|
volumes:
|
|
- ${AZAION_ROOT_DIR:-/azaion}:/azaion
|
|
restart: unless-stopped
|
|
|
|
training:
|
|
build:
|
|
context: .
|
|
dockerfile: docker/training.Dockerfile
|
|
env_file: .env
|
|
volumes:
|
|
- ${AZAION_ROOT_DIR:-/azaion}:/azaion
|
|
deploy:
|
|
resources:
|
|
reservations:
|
|
devices:
|
|
- driver: nvidia
|
|
count: 1
|
|
capabilities: [gpu]
|
|
ipc: host
|
|
shm_size: "16g"
|
|
|
|
volumes:
|
|
rabbitmq_data:
|
|
|
|
networks:
|
|
default:
|
|
name: azaion-training
|
|
```
|
|
|
|
Notes:
|
|
- `ipc: host` and `shm_size: "16g"` for PyTorch multi-worker data loading
|
|
- `annotation-queue` runs continuously, restarts on failure
|
|
- RabbitMQ Streams plugin must be enabled (port 5552); the management UI is on port 15672
|
|
|
|
## Docker Compose — Blackbox Tests
|
|
|
|
```yaml
|
|
services:
|
|
rabbitmq:
|
|
image: rabbitmq:3.13-management-alpine
|
|
ports:
|
|
- "5552:5552"
|
|
- "5672:5672"
|
|
environment:
|
|
RABBITMQ_DEFAULT_USER: test_user
|
|
RABBITMQ_DEFAULT_PASS: test_pass
|
|
healthcheck:
|
|
test: ["CMD", "rabbitmq-diagnostics", "check_running"]
|
|
interval: 5s
|
|
timeout: 3s
|
|
retries: 10
|
|
|
|
annotation-queue:
|
|
build:
|
|
context: .
|
|
dockerfile: docker/annotation-queue.Dockerfile
|
|
environment:
|
|
RABBITMQ_HOST: rabbitmq
|
|
RABBITMQ_PORT: "5552"
|
|
RABBITMQ_USER: test_user
|
|
RABBITMQ_PASSWORD: test_pass
|
|
RABBITMQ_QUEUE_NAME: azaion-annotations
|
|
AZAION_ROOT_DIR: /azaion
|
|
depends_on:
|
|
rabbitmq:
|
|
condition: service_healthy
|
|
volumes:
|
|
- test_data:/azaion
|
|
|
|
test-runner:
|
|
build:
|
|
context: .
|
|
dockerfile: docker/test-runner.Dockerfile
|
|
environment:
|
|
RABBITMQ_HOST: rabbitmq
|
|
RABBITMQ_PORT: "5552"
|
|
RABBITMQ_USER: test_user
|
|
RABBITMQ_PASSWORD: test_pass
|
|
AZAION_ROOT_DIR: /azaion
|
|
TEST_SCOPE: blackbox
|
|
depends_on:
|
|
rabbitmq:
|
|
condition: service_healthy
|
|
annotation-queue:
|
|
condition: service_started
|
|
volumes:
|
|
- test_data:/azaion
|
|
|
|
volumes:
|
|
test_data:
|
|
```
|
|
|
|
Run: `docker compose -f docker-compose.test.yml up --abort-on-container-exit`
|
|
|
|
Note: GPU-dependent tests (training) require `--gpus all` and are excluded from the default blackbox test suite. They run separately via `docker compose -f docker-compose.test.yml --profile gpu-tests up --abort-on-container-exit`.
|
|
|
|
## Image Tagging Strategy
|
|
|
|
| Context | Tag Format | Example |
|
|
|---------|-----------|---------|
|
|
| CI build | `<registry>/azaion/<component>:<git-sha>` | `registry.example.com/azaion/training:a1b2c3d` |
|
|
| Release | `<registry>/azaion/<component>:<semver>` | `registry.example.com/azaion/training:1.0.0` |
|
|
| Local dev | `azaion-<component>:latest` | `azaion-training:latest` |
|
|
|
|
## .dockerignore
|
|
|
|
```
|
|
.git
|
|
.cursor
|
|
_docs
|
|
_standalone
|
|
tests
|
|
**/__pycache__
|
|
**/*.pyc
|
|
*.md
|
|
.env
|
|
.env.example
|
|
docker-compose*.yml
|
|
.gitignore
|
|
.editorconfig
|
|
requirements-test.txt
|
|
```
|