mirror of
https://github.com/azaion/ai-training.git
synced 2026-04-22 10:46:35 +00:00
Update autopilot workflow and documentation for project cycle completion
- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation. - Updated the autopilot state to reflect the current step as `done` and status as `completed`. - Clarified the deployment status report by specifying non-deployed services and their purposes. These changes enhance the automation of task management and improve documentation clarity.
This commit is contained in:
@@ -0,0 +1,196 @@
|
||||
# Azaion AI Training — Containerization
|
||||
|
||||
## Component Dockerfiles
|
||||
|
||||
### Training Pipeline
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| Base image | `nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04` |
|
||||
| Build image | Same (devel required for TensorRT engine build + pycuda) |
|
||||
| Stages | 1) system deps + Python 3.10 → 2) pip install requirements → 3) copy source |
|
||||
| User | `azaion` (non-root, UID 1000) |
|
||||
| Health check | Not applicable — batch job, exits on completion |
|
||||
| Exposed ports | None |
|
||||
| Key build args | `CUDA_VERSION=12.1.1` |
|
||||
|
||||
Single-stage build (devel image required at runtime for TensorRT engine compilation and pycuda). The image is large but training runs for days on a dedicated GPU server — image size is not a deployment bottleneck.
|
||||
|
||||
Installs from `requirements.txt` with `--extra-index-url https://download.pytorch.org/whl/cu121` for PyTorch CUDA 12.1 wheels.
|
||||
|
||||
Volume mount: `/azaion/` host directory for datasets, models, and annotation data.
|
||||
|
||||
### Annotation Queue
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| Base image | `python:3.10-slim` |
|
||||
| Build image | `python:3.10-slim` (no compilation needed) |
|
||||
| Stages | 1) pip install from `src/annotation-queue/requirements.txt` → 2) copy source |
|
||||
| User | `azaion` (non-root, UID 1000) |
|
||||
| Health check | `CMD python -c "import rstream" \|\| exit 1` (process liveness; no HTTP endpoint) |
|
||||
| Exposed ports | None |
|
||||
| Key build args | None |
|
||||
|
||||
Lightweight container — only needs `pyyaml`, `msgpack`, `rstream`. No GPU, no heavy ML libraries. Runs as a persistent async process consuming from RabbitMQ Streams.
|
||||
|
||||
Volume mount: `/azaion/` host directory for writing annotation images and labels.
|
||||
|
||||
### Not Containerized
|
||||
|
||||
The following are developer/verification tools, not production services:
|
||||
|
||||
- **Inference Engine** (`start_inference.py`) — used for testing and model verification, runs ad-hoc on a GPU machine
|
||||
- **Data Tools** (`convert-annotations.py`, `dataset-visualiser.py`) — interactive developer utilities requiring GUI environment
|
||||
|
||||
## Docker Compose — Local Development
|
||||
|
||||
```yaml
|
||||
services:
|
||||
rabbitmq:
|
||||
image: rabbitmq:3.13-management-alpine
|
||||
ports:
|
||||
- "5552:5552"
|
||||
- "5672:5672"
|
||||
- "15672:15672"
|
||||
environment:
|
||||
RABBITMQ_DEFAULT_USER: ${RABBITMQ_USER}
|
||||
RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD}
|
||||
volumes:
|
||||
- rabbitmq_data:/var/lib/rabbitmq
|
||||
healthcheck:
|
||||
test: ["CMD", "rabbitmq-diagnostics", "check_running"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
|
||||
annotation-queue:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: docker/annotation-queue.Dockerfile
|
||||
env_file: .env
|
||||
depends_on:
|
||||
rabbitmq:
|
||||
condition: service_healthy
|
||||
volumes:
|
||||
- ${AZAION_ROOT_DIR:-/azaion}:/azaion
|
||||
restart: unless-stopped
|
||||
|
||||
training:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: docker/training.Dockerfile
|
||||
env_file: .env
|
||||
volumes:
|
||||
- ${AZAION_ROOT_DIR:-/azaion}:/azaion
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
count: 1
|
||||
capabilities: [gpu]
|
||||
ipc: host
|
||||
shm_size: "16g"
|
||||
|
||||
volumes:
|
||||
rabbitmq_data:
|
||||
|
||||
networks:
|
||||
default:
|
||||
name: azaion-training
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `ipc: host` and `shm_size: "16g"` for PyTorch multi-worker data loading
|
||||
- `annotation-queue` runs continuously, restarts on failure
|
||||
- RabbitMQ Streams plugin must be enabled (port 5552); the management UI is on port 15672
|
||||
|
||||
## Docker Compose — Blackbox Tests
|
||||
|
||||
```yaml
|
||||
services:
|
||||
rabbitmq:
|
||||
image: rabbitmq:3.13-management-alpine
|
||||
ports:
|
||||
- "5552:5552"
|
||||
- "5672:5672"
|
||||
environment:
|
||||
RABBITMQ_DEFAULT_USER: test_user
|
||||
RABBITMQ_DEFAULT_PASS: test_pass
|
||||
healthcheck:
|
||||
test: ["CMD", "rabbitmq-diagnostics", "check_running"]
|
||||
interval: 5s
|
||||
timeout: 3s
|
||||
retries: 10
|
||||
|
||||
annotation-queue:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: docker/annotation-queue.Dockerfile
|
||||
environment:
|
||||
RABBITMQ_HOST: rabbitmq
|
||||
RABBITMQ_PORT: "5552"
|
||||
RABBITMQ_USER: test_user
|
||||
RABBITMQ_PASSWORD: test_pass
|
||||
RABBITMQ_QUEUE_NAME: azaion-annotations
|
||||
AZAION_ROOT_DIR: /azaion
|
||||
depends_on:
|
||||
rabbitmq:
|
||||
condition: service_healthy
|
||||
volumes:
|
||||
- test_data:/azaion
|
||||
|
||||
test-runner:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: docker/test-runner.Dockerfile
|
||||
environment:
|
||||
RABBITMQ_HOST: rabbitmq
|
||||
RABBITMQ_PORT: "5552"
|
||||
RABBITMQ_USER: test_user
|
||||
RABBITMQ_PASSWORD: test_pass
|
||||
AZAION_ROOT_DIR: /azaion
|
||||
TEST_SCOPE: blackbox
|
||||
depends_on:
|
||||
rabbitmq:
|
||||
condition: service_healthy
|
||||
annotation-queue:
|
||||
condition: service_started
|
||||
volumes:
|
||||
- test_data:/azaion
|
||||
|
||||
volumes:
|
||||
test_data:
|
||||
```
|
||||
|
||||
Run: `docker compose -f docker-compose.test.yml up --abort-on-container-exit`
|
||||
|
||||
Note: GPU-dependent tests (training) require `--gpus all` and are excluded from the default blackbox test suite. They run separately via `docker compose -f docker-compose.test.yml --profile gpu-tests up --abort-on-container-exit`.
|
||||
|
||||
## Image Tagging Strategy
|
||||
|
||||
| Context | Tag Format | Example |
|
||||
|---------|-----------|---------|
|
||||
| CI build | `<registry>/azaion/<component>:<git-sha>` | `registry.example.com/azaion/training:a1b2c3d` |
|
||||
| Release | `<registry>/azaion/<component>:<semver>` | `registry.example.com/azaion/training:1.0.0` |
|
||||
| Local dev | `azaion-<component>:latest` | `azaion-training:latest` |
|
||||
|
||||
## .dockerignore
|
||||
|
||||
```
|
||||
.git
|
||||
.cursor
|
||||
_docs
|
||||
_standalone
|
||||
tests
|
||||
**/__pycache__
|
||||
**/*.pyc
|
||||
*.md
|
||||
.env
|
||||
.env.example
|
||||
docker-compose*.yml
|
||||
.gitignore
|
||||
.editorconfig
|
||||
requirements-test.txt
|
||||
```
|
||||
Reference in New Issue
Block a user