Files
Oleksandr Bezdieniezhnykh aeb7f8ca8c Update autopilot workflow and documentation for project cycle completion
- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation.
- Updated the autopilot state to reflect the current step as `done` and status as `completed`.
- Clarified the deployment status report by specifying non-deployed services and their purposes.

These changes enhance the automation of task management and improve documentation clarity.
2026-03-29 05:02:22 +03:00

197 lines
5.5 KiB
Markdown

# Azaion AI Training — Containerization
## Component Dockerfiles
### Training Pipeline
| Property | Value |
|----------|-------|
| Base image | `nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04` |
| Build image | Same (devel required for TensorRT engine build + pycuda) |
| Stages | 1) system deps + Python 3.10 → 2) pip install requirements → 3) copy source |
| User | `azaion` (non-root, UID 1000) |
| Health check | Not applicable — batch job, exits on completion |
| Exposed ports | None |
| Key build args | `CUDA_VERSION=12.1.1` |
Single-stage build (devel image required at runtime for TensorRT engine compilation and pycuda). The image is large but training runs for days on a dedicated GPU server — image size is not a deployment bottleneck.
Installs from `requirements.txt` with `--extra-index-url https://download.pytorch.org/whl/cu121` for PyTorch CUDA 12.1 wheels.
Volume mount: `/azaion/` host directory for datasets, models, and annotation data.
### Annotation Queue
| Property | Value |
|----------|-------|
| Base image | `python:3.10-slim` |
| Build image | `python:3.10-slim` (no compilation needed) |
| Stages | 1) pip install from `src/annotation-queue/requirements.txt` → 2) copy source |
| User | `azaion` (non-root, UID 1000) |
| Health check | `CMD python -c "import rstream" \|\| exit 1` (process liveness; no HTTP endpoint) |
| Exposed ports | None |
| Key build args | None |
Lightweight container — only needs `pyyaml`, `msgpack`, `rstream`. No GPU, no heavy ML libraries. Runs as a persistent async process consuming from RabbitMQ Streams.
Volume mount: `/azaion/` host directory for writing annotation images and labels.
### Not Containerized
The following are developer/verification tools, not production services:
- **Inference Engine** (`start_inference.py`) — used for testing and model verification, runs ad-hoc on a GPU machine
- **Data Tools** (`convert-annotations.py`, `dataset-visualiser.py`) — interactive developer utilities requiring GUI environment
## Docker Compose — Local Development
```yaml
services:
rabbitmq:
image: rabbitmq:3.13-management-alpine
ports:
- "5552:5552"
- "5672:5672"
- "15672:15672"
environment:
RABBITMQ_DEFAULT_USER: ${RABBITMQ_USER}
RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD}
volumes:
- rabbitmq_data:/var/lib/rabbitmq
healthcheck:
test: ["CMD", "rabbitmq-diagnostics", "check_running"]
interval: 10s
timeout: 5s
retries: 5
annotation-queue:
build:
context: .
dockerfile: docker/annotation-queue.Dockerfile
env_file: .env
depends_on:
rabbitmq:
condition: service_healthy
volumes:
- ${AZAION_ROOT_DIR:-/azaion}:/azaion
restart: unless-stopped
training:
build:
context: .
dockerfile: docker/training.Dockerfile
env_file: .env
volumes:
- ${AZAION_ROOT_DIR:-/azaion}:/azaion
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ipc: host
shm_size: "16g"
volumes:
rabbitmq_data:
networks:
default:
name: azaion-training
```
Notes:
- `ipc: host` and `shm_size: "16g"` for PyTorch multi-worker data loading
- `annotation-queue` runs continuously, restarts on failure
- RabbitMQ Streams plugin must be enabled (port 5552); the management UI is on port 15672
## Docker Compose — Blackbox Tests
```yaml
services:
rabbitmq:
image: rabbitmq:3.13-management-alpine
ports:
- "5552:5552"
- "5672:5672"
environment:
RABBITMQ_DEFAULT_USER: test_user
RABBITMQ_DEFAULT_PASS: test_pass
healthcheck:
test: ["CMD", "rabbitmq-diagnostics", "check_running"]
interval: 5s
timeout: 3s
retries: 10
annotation-queue:
build:
context: .
dockerfile: docker/annotation-queue.Dockerfile
environment:
RABBITMQ_HOST: rabbitmq
RABBITMQ_PORT: "5552"
RABBITMQ_USER: test_user
RABBITMQ_PASSWORD: test_pass
RABBITMQ_QUEUE_NAME: azaion-annotations
AZAION_ROOT_DIR: /azaion
depends_on:
rabbitmq:
condition: service_healthy
volumes:
- test_data:/azaion
test-runner:
build:
context: .
dockerfile: docker/test-runner.Dockerfile
environment:
RABBITMQ_HOST: rabbitmq
RABBITMQ_PORT: "5552"
RABBITMQ_USER: test_user
RABBITMQ_PASSWORD: test_pass
AZAION_ROOT_DIR: /azaion
TEST_SCOPE: blackbox
depends_on:
rabbitmq:
condition: service_healthy
annotation-queue:
condition: service_started
volumes:
- test_data:/azaion
volumes:
test_data:
```
Run: `docker compose -f docker-compose.test.yml up --abort-on-container-exit`
Note: GPU-dependent tests (training) require `--gpus all` and are excluded from the default blackbox test suite. They run separately via `docker compose -f docker-compose.test.yml --profile gpu-tests up --abort-on-container-exit`.
## Image Tagging Strategy
| Context | Tag Format | Example |
|---------|-----------|---------|
| CI build | `<registry>/azaion/<component>:<git-sha>` | `registry.example.com/azaion/training:a1b2c3d` |
| Release | `<registry>/azaion/<component>:<semver>` | `registry.example.com/azaion/training:1.0.0` |
| Local dev | `azaion-<component>:latest` | `azaion-training:latest` |
## .dockerignore
```
.git
.cursor
_docs
_standalone
tests
**/__pycache__
**/*.pyc
*.md
.env
.env.example
docker-compose*.yml
.gitignore
.editorconfig
requirements-test.txt
```