Update autopilot workflow and documentation for project cycle completion

- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation. - Updated the autopilot state to reflect the current step as `done` and status as `completed`. - Clarified the deployment status report by specifying non-deployed services and their purposes. These changes enhance the automation of task management and improve documentation clarity.
2026-06-21 16:01:11 +00:00 · 2026-03-29 05:02:22 +03:00
parent 0bf3894e03
commit aeb7f8ca8c
20 changed files with 1360 additions and 12 deletions
@@ -0,0 +1,196 @@
+# Azaion AI Training — Containerization
+
+## Component Dockerfiles
+
+### Training Pipeline
+
+| Property | Value |
+|----------|-------|
+| Base image | `nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04` |
+| Build image | Same (devel required for TensorRT engine build + pycuda) |
+| Stages | 1) system deps + Python 3.10 → 2) pip install requirements → 3) copy source |
+| User | `azaion` (non-root, UID 1000) |
+| Health check | Not applicable — batch job, exits on completion |
+| Exposed ports | None |
+| Key build args | `CUDA_VERSION=12.1.1` |
+
+Single-stage build (devel image required at runtime for TensorRT engine compilation and pycuda). The image is large but training runs for days on a dedicated GPU server — image size is not a deployment bottleneck.
+
+Installs from `requirements.txt` with `--extra-index-url https://download.pytorch.org/whl/cu121` for PyTorch CUDA 12.1 wheels.
+
+Volume mount: `/azaion/` host directory for datasets, models, and annotation data.
+
+### Annotation Queue
+
+| Property | Value |
+|----------|-------|
+| Base image | `python:3.10-slim` |
+| Build image | `python:3.10-slim` (no compilation needed) |
+| Stages | 1) pip install from `src/annotation-queue/requirements.txt` → 2) copy source |
+| User | `azaion` (non-root, UID 1000) |
+| Health check | `CMD python -c "import rstream" \|\| exit 1` (process liveness; no HTTP endpoint) |
+| Exposed ports | None |
+| Key build args | None |
+
+Lightweight container — only needs `pyyaml`, `msgpack`, `rstream`. No GPU, no heavy ML libraries. Runs as a persistent async process consuming from RabbitMQ Streams.
+
+Volume mount: `/azaion/` host directory for writing annotation images and labels.
+
+### Not Containerized
+
+The following are developer/verification tools, not production services:
+
+- **Inference Engine** (`start_inference.py`) — used for testing and model verification, runs ad-hoc on a GPU machine
+- **Data Tools** (`convert-annotations.py`, `dataset-visualiser.py`) — interactive developer utilities requiring GUI environment
+
+## Docker Compose — Local Development
+
+```yaml
+services:
+  rabbitmq:
+    image: rabbitmq:3.13-management-alpine
+    ports:
+      - "5552:5552"
+      - "5672:5672"
+      - "15672:15672"
+    environment:
+      RABBITMQ_DEFAULT_USER: ${RABBITMQ_USER}
+      RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD}
+    volumes:
+      - rabbitmq_data:/var/lib/rabbitmq
+    healthcheck:
+      test: ["CMD", "rabbitmq-diagnostics", "check_running"]
+      interval: 10s
+      timeout: 5s
+      retries: 5
+
+  annotation-queue:
+    build:
+      context: .
+      dockerfile: docker/annotation-queue.Dockerfile
+    env_file: .env
+    depends_on:
+      rabbitmq:
+        condition: service_healthy
+    volumes:
+      - ${AZAION_ROOT_DIR:-/azaion}:/azaion
+    restart: unless-stopped
+
+  training:
+    build:
+      context: .
+      dockerfile: docker/training.Dockerfile
+    env_file: .env
+    volumes:
+      - ${AZAION_ROOT_DIR:-/azaion}:/azaion
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    ipc: host
+    shm_size: "16g"
+
+volumes:
+  rabbitmq_data:
+
+networks:
+  default:
+    name: azaion-training
+```
+
+Notes:
+- `ipc: host` and `shm_size: "16g"` for PyTorch multi-worker data loading
+- `annotation-queue` runs continuously, restarts on failure
+- RabbitMQ Streams plugin must be enabled (port 5552); the management UI is on port 15672
+
+## Docker Compose — Blackbox Tests
+
+```yaml
+services:
+  rabbitmq:
+    image: rabbitmq:3.13-management-alpine
+    ports:
+      - "5552:5552"
+      - "5672:5672"
+    environment:
+      RABBITMQ_DEFAULT_USER: test_user
+      RABBITMQ_DEFAULT_PASS: test_pass
+    healthcheck:
+      test: ["CMD", "rabbitmq-diagnostics", "check_running"]
+      interval: 5s
+      timeout: 3s
+      retries: 10
+
+  annotation-queue:
+    build:
+      context: .
+      dockerfile: docker/annotation-queue.Dockerfile
+    environment:
+      RABBITMQ_HOST: rabbitmq
+      RABBITMQ_PORT: "5552"
+      RABBITMQ_USER: test_user
+      RABBITMQ_PASSWORD: test_pass
+      RABBITMQ_QUEUE_NAME: azaion-annotations
+      AZAION_ROOT_DIR: /azaion
+    depends_on:
+      rabbitmq:
+        condition: service_healthy
+    volumes:
+      - test_data:/azaion
+
+  test-runner:
+    build:
+      context: .
+      dockerfile: docker/test-runner.Dockerfile
+    environment:
+      RABBITMQ_HOST: rabbitmq
+      RABBITMQ_PORT: "5552"
+      RABBITMQ_USER: test_user
+      RABBITMQ_PASSWORD: test_pass
+      AZAION_ROOT_DIR: /azaion
+      TEST_SCOPE: blackbox
+    depends_on:
+      rabbitmq:
+        condition: service_healthy
+      annotation-queue:
+        condition: service_started
+    volumes:
+      - test_data:/azaion
+
+volumes:
+  test_data:
+```
+
+Run: `docker compose -f docker-compose.test.yml up --abort-on-container-exit`
+
+Note: GPU-dependent tests (training) require `--gpus all` and are excluded from the default blackbox test suite. They run separately via `docker compose -f docker-compose.test.yml --profile gpu-tests up --abort-on-container-exit`.
+
+## Image Tagging Strategy
+
+| Context | Tag Format | Example |
+|---------|-----------|---------|
+| CI build | `<registry>/azaion/<component>:<git-sha>` | `registry.example.com/azaion/training:a1b2c3d` |
+| Release | `<registry>/azaion/<component>:<semver>` | `registry.example.com/azaion/training:1.0.0` |
+| Local dev | `azaion-<component>:latest` | `azaion-training:latest` |
+
+## .dockerignore
+
+```
+.git
+.cursor
+_docs
+_standalone
+tests
+**/__pycache__
+**/*.pyc
+*.md
+.env
+.env.example
+docker-compose*.yml
+.gitignore
+.editorconfig
+requirements-test.txt
+```