Files
ai-training/_docs/04_deploy/environment_strategy.md
T
Oleksandr Bezdieniezhnykh aeb7f8ca8c Update autopilot workflow and documentation for project cycle completion
- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation.
- Updated the autopilot state to reflect the current step as `done` and status as `completed`.
- Clarified the deployment status report by specifying non-deployed services and their purposes.

These changes enhance the automation of task management and improve documentation clarity.
2026-03-29 05:02:22 +03:00

5.6 KiB

Azaion AI Training — Environment Strategy

Environments

Environment Purpose Infrastructure Data Source
Development Local developer workflow docker-compose, local RabbitMQ Test annotations, small sample dataset
Production Live training on GPU server Direct host processes or Docker, real RabbitMQ Real annotations from Azaion platform

No staging environment — the system is an ML training pipeline on a dedicated GPU server, not a multi-tier web service. Validation happens through the CI test suite (CPU tests) and manual verification on the GPU server before committing to a long training run.

Environment Variables

Required Variables

Variable Purpose Dev Default Prod Source
AZAION_API_URL Azaion REST API base URL https://api.azaion.com .env on server
AZAION_API_EMAIL API login email dev account .env on server
AZAION_API_PASSWORD API login password dev password .env on server
RABBITMQ_HOST RabbitMQ host 127.0.0.1 (local container) .env on server
RABBITMQ_PORT RabbitMQ Streams port 5552 .env on server
RABBITMQ_USER RabbitMQ username azaion_receiver .env on server
RABBITMQ_PASSWORD RabbitMQ password changeme .env on server
RABBITMQ_QUEUE_NAME Queue name azaion-annotations .env on server
AZAION_ROOT_DIR Root data directory /azaion .env on server
AZAION_DATA_DIR Validated annotations dir name data .env on server
AZAION_DATA_SEED_DIR Unvalidated annotations dir name data-seed .env on server
AZAION_DATA_DELETED_DIR Deleted annotations dir name data_deleted .env on server
TRAINING_MODEL Base model filename yolo26m.pt .env on server
TRAINING_EPOCHS Training epochs 120 .env on server
TRAINING_BATCH_SIZE Training batch size 11 .env on server
TRAINING_IMGSZ Training image size 1280 .env on server
TRAINING_SAVE_PERIOD Checkpoint save interval 1 .env on server
TRAINING_WORKERS Dataloader workers 24 .env on server
EXPORT_ONNX_IMGSZ ONNX export image size 1280 .env on server

.env.example

Committed to version control with placeholder values. See .env.example in project root (created in Step 1).

Variable Validation

The config.yaml generation script (part of deploy scripts) validates that all required environment variables are set before writing the config file. Missing variables cause an immediate failure with a clear error listing which variables are absent.

Config Generation

The codebase reads configuration from config.yaml, not directly from environment variables. The deployment flow generates config.yaml from environment variables at deploy time:

  1. .env contains all variable values (never committed)
  2. Deploy script sources .env and renders config.yaml from a template
  3. config.yaml is placed at the expected location for the application

This preserves the existing code's config reading pattern while externalizing secrets to environment variables.

Secrets Management

Environment Method Location
Development .env file (git-ignored) Project root
Production .env file (restricted permissions) GPU server /opt/azaion-training/.env

Production .env file:

  • Ownership: root:deploy (deploy user's group)
  • Permissions: 640 (owner read/write, group read, others none)
  • Located outside the Docker build context

Secrets in this project:

  • AZAION_API_PASSWORD — API authentication
  • RABBITMQ_PASSWORD — message queue access
  • CDN credentials — auto-provisioned via API at runtime (encrypted cdn.yaml), not in .env
  • Model encryption key — hardcoded in security.py (existing pattern, flagged as security concern)

Rotation policy: rotate API and RabbitMQ passwords quarterly. Update .env on the server, restart affected services.

Filesystem Management

Environment /azaion/ Location Contents
Development Docker volume or local dir Test images, small sample labels
Production Host directory /azaion/ Full annotation dataset, trained models, export artifacts

The /azaion/ directory tree must exist before services start:

/azaion/
├── data/           (validated annotations: images/ + labels/)
├── data-seed/      (unvalidated annotations: images/ + labels/)
├── data_deleted/   (soft-deleted annotations: images/ + labels/)
├── datasets/       (formed training datasets: azaion-YYYY-MM-DD/)
├── models/         (trained models: azaion-YYYY-MM-DD/, azaion.pt)
└── classes.json    (annotation class definitions)

Production data is persistent and never deleted by deployment. Docker containers mount this directory as a bind mount.

External Service Configuration

Service Dev Prod
Azaion REST API Real API (dev credentials) Real API (prod credentials)
S3-compatible CDN Auto-provisioned via API Auto-provisioned via API
RabbitMQ Local container (docker-compose) Managed instance on network
NVIDIA GPU Host GPU via --gpus all Host GPU via --gpus all

CDN credentials are not in .env — they are fetched from the API at runtime as an encrypted cdn.yaml file, decrypted using the hardware-bound key. This is the existing pattern and does not need environment variable configuration.