mirror of https://github.com/azaion/ai-training.git synced 2026-04-22 21:46:35 +00:00

Files

T

Oleksandr Bezdieniezhnykh aeb7f8ca8c Update autopilot workflow and documentation for project cycle completion

- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation.
- Updated the autopilot state to reflect the current step as `done` and status as `completed`.
- Clarified the deployment status report by specifying non-deployed services and their purposes.

These changes enhance the automation of task management and improve documentation clarity.

2026-03-29 05:02:22 +03:00

5.6 KiB

Raw Blame History

Azaion AI Training — Environment Strategy

Environments

Environment	Purpose	Infrastructure	Data Source
Development	Local developer workflow	docker-compose, local RabbitMQ	Test annotations, small sample dataset
Production	Live training on GPU server	Direct host processes or Docker, real RabbitMQ	Real annotations from Azaion platform

No staging environment — the system is an ML training pipeline on a dedicated GPU server, not a multi-tier web service. Validation happens through the CI test suite (CPU tests) and manual verification on the GPU server before committing to a long training run.

Environment Variables

Required Variables

Variable	Purpose	Dev Default	Prod Source
`AZAION_API_URL`	Azaion REST API base URL	`https://api.azaion.com`	`.env` on server
`AZAION_API_EMAIL`	API login email	dev account	`.env` on server
`AZAION_API_PASSWORD`	API login password	dev password	`.env` on server
`RABBITMQ_HOST`	RabbitMQ host	`127.0.0.1` (local container)	`.env` on server
`RABBITMQ_PORT`	RabbitMQ Streams port	`5552`	`.env` on server
`RABBITMQ_USER`	RabbitMQ username	`azaion_receiver`	`.env` on server
`RABBITMQ_PASSWORD`	RabbitMQ password	`changeme`	`.env` on server
`RABBITMQ_QUEUE_NAME`	Queue name	`azaion-annotations`	`.env` on server
`AZAION_ROOT_DIR`	Root data directory	`/azaion`	`.env` on server
`AZAION_DATA_DIR`	Validated annotations dir name	`data`	`.env` on server
`AZAION_DATA_SEED_DIR`	Unvalidated annotations dir name	`data-seed`	`.env` on server
`AZAION_DATA_DELETED_DIR`	Deleted annotations dir name	`data_deleted`	`.env` on server
`TRAINING_MODEL`	Base model filename	`yolo26m.pt`	`.env` on server
`TRAINING_EPOCHS`	Training epochs	`120`	`.env` on server
`TRAINING_BATCH_SIZE`	Training batch size	`11`	`.env` on server
`TRAINING_IMGSZ`	Training image size	`1280`	`.env` on server
`TRAINING_SAVE_PERIOD`	Checkpoint save interval	`1`	`.env` on server
`TRAINING_WORKERS`	Dataloader workers	`24`	`.env` on server
`EXPORT_ONNX_IMGSZ`	ONNX export image size	`1280`	`.env` on server

`.env.example`

Committed to version control with placeholder values. See .env.example in project root (created in Step 1).

Variable Validation

The config.yaml generation script (part of deploy scripts) validates that all required environment variables are set before writing the config file. Missing variables cause an immediate failure with a clear error listing which variables are absent.

Config Generation

The codebase reads configuration from config.yaml, not directly from environment variables. The deployment flow generates config.yaml from environment variables at deploy time:

.env contains all variable values (never committed)
Deploy script sources .env and renders config.yaml from a template
config.yaml is placed at the expected location for the application

This preserves the existing code's config reading pattern while externalizing secrets to environment variables.

Secrets Management

Environment	Method	Location
Development	`.env` file (git-ignored)	Project root
Production	`.env` file (restricted permissions)	GPU server `/opt/azaion-training/.env`

Production .env file:

Ownership: root:deploy (deploy user's group)
Permissions: 640 (owner read/write, group read, others none)
Located outside the Docker build context

Secrets in this project:

AZAION_API_PASSWORD — API authentication
RABBITMQ_PASSWORD — message queue access
CDN credentials — auto-provisioned via API at runtime (encrypted cdn.yaml), not in .env
Model encryption key — hardcoded in security.py (existing pattern, flagged as security concern)

Rotation policy: rotate API and RabbitMQ passwords quarterly. Update .env on the server, restart affected services.

Filesystem Management

Environment	`/azaion/` Location	Contents
Development	Docker volume or local dir	Test images, small sample labels
Production	Host directory `/azaion/`	Full annotation dataset, trained models, export artifacts

The /azaion/ directory tree must exist before services start:

/azaion/
├── data/           (validated annotations: images/ + labels/)
├── data-seed/      (unvalidated annotations: images/ + labels/)
├── data_deleted/   (soft-deleted annotations: images/ + labels/)
├── datasets/       (formed training datasets: azaion-YYYY-MM-DD/)
├── models/         (trained models: azaion-YYYY-MM-DD/, azaion.pt)
└── classes.json    (annotation class definitions)

Production data is persistent and never deleted by deployment. Docker containers mount this directory as a bind mount.

External Service Configuration

Service	Dev	Prod
Azaion REST API	Real API (dev credentials)	Real API (prod credentials)
S3-compatible CDN	Auto-provisioned via API	Auto-provisioned via API
RabbitMQ	Local container (docker-compose)	Managed instance on network
NVIDIA GPU	Host GPU via `--gpus all`	Host GPU via `--gpus all`

CDN credentials are not in .env — they are fetched from the API at runtime as an encrypted cdn.yaml file, decrypted using the hardware-bound key. This is the existing pattern and does not need environment variable configuration.

5.6 KiB Raw Blame History