mirror of
https://github.com/azaion/ai-training.git
synced 2026-04-22 07:06:36 +00:00
aeb7f8ca8c
- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation. - Updated the autopilot state to reflect the current step as `done` and status as `completed`. - Clarified the deployment status report by specifying non-deployed services and their purposes. These changes enhance the automation of task management and improve documentation clarity.
107 lines
5.6 KiB
Markdown
107 lines
5.6 KiB
Markdown
# Azaion AI Training — Environment Strategy
|
|
|
|
## Environments
|
|
|
|
| Environment | Purpose | Infrastructure | Data Source |
|
|
|-------------|---------|---------------|-------------|
|
|
| Development | Local developer workflow | docker-compose, local RabbitMQ | Test annotations, small sample dataset |
|
|
| Production | Live training on GPU server | Direct host processes or Docker, real RabbitMQ | Real annotations from Azaion platform |
|
|
|
|
No staging environment — the system is an ML training pipeline on a dedicated GPU server, not a multi-tier web service. Validation happens through the CI test suite (CPU tests) and manual verification on the GPU server before committing to a long training run.
|
|
|
|
## Environment Variables
|
|
|
|
### Required Variables
|
|
|
|
| Variable | Purpose | Dev Default | Prod Source |
|
|
|----------|---------|-------------|-------------|
|
|
| `AZAION_API_URL` | Azaion REST API base URL | `https://api.azaion.com` | `.env` on server |
|
|
| `AZAION_API_EMAIL` | API login email | dev account | `.env` on server |
|
|
| `AZAION_API_PASSWORD` | API login password | dev password | `.env` on server |
|
|
| `RABBITMQ_HOST` | RabbitMQ host | `127.0.0.1` (local container) | `.env` on server |
|
|
| `RABBITMQ_PORT` | RabbitMQ Streams port | `5552` | `.env` on server |
|
|
| `RABBITMQ_USER` | RabbitMQ username | `azaion_receiver` | `.env` on server |
|
|
| `RABBITMQ_PASSWORD` | RabbitMQ password | `changeme` | `.env` on server |
|
|
| `RABBITMQ_QUEUE_NAME` | Queue name | `azaion-annotations` | `.env` on server |
|
|
| `AZAION_ROOT_DIR` | Root data directory | `/azaion` | `.env` on server |
|
|
| `AZAION_DATA_DIR` | Validated annotations dir name | `data` | `.env` on server |
|
|
| `AZAION_DATA_SEED_DIR` | Unvalidated annotations dir name | `data-seed` | `.env` on server |
|
|
| `AZAION_DATA_DELETED_DIR` | Deleted annotations dir name | `data_deleted` | `.env` on server |
|
|
| `TRAINING_MODEL` | Base model filename | `yolo26m.pt` | `.env` on server |
|
|
| `TRAINING_EPOCHS` | Training epochs | `120` | `.env` on server |
|
|
| `TRAINING_BATCH_SIZE` | Training batch size | `11` | `.env` on server |
|
|
| `TRAINING_IMGSZ` | Training image size | `1280` | `.env` on server |
|
|
| `TRAINING_SAVE_PERIOD` | Checkpoint save interval | `1` | `.env` on server |
|
|
| `TRAINING_WORKERS` | Dataloader workers | `24` | `.env` on server |
|
|
| `EXPORT_ONNX_IMGSZ` | ONNX export image size | `1280` | `.env` on server |
|
|
|
|
### `.env.example`
|
|
|
|
Committed to version control with placeholder values. See `.env.example` in project root (created in Step 1).
|
|
|
|
### Variable Validation
|
|
|
|
The `config.yaml` generation script (part of deploy scripts) validates that all required environment variables are set before writing the config file. Missing variables cause an immediate failure with a clear error listing which variables are absent.
|
|
|
|
## Config Generation
|
|
|
|
The codebase reads configuration from `config.yaml`, not directly from environment variables. The deployment flow generates `config.yaml` from environment variables at deploy time:
|
|
|
|
1. `.env` contains all variable values (never committed)
|
|
2. Deploy script sources `.env` and renders `config.yaml` from a template
|
|
3. `config.yaml` is placed at the expected location for the application
|
|
|
|
This preserves the existing code's config reading pattern while externalizing secrets to environment variables.
|
|
|
|
## Secrets Management
|
|
|
|
| Environment | Method | Location |
|
|
|-------------|--------|----------|
|
|
| Development | `.env` file (git-ignored) | Project root |
|
|
| Production | `.env` file (restricted permissions) | GPU server `/opt/azaion-training/.env` |
|
|
|
|
Production `.env` file:
|
|
- Ownership: `root:deploy` (deploy user's group)
|
|
- Permissions: `640` (owner read/write, group read, others none)
|
|
- Located outside the Docker build context
|
|
|
|
Secrets in this project:
|
|
- `AZAION_API_PASSWORD` — API authentication
|
|
- `RABBITMQ_PASSWORD` — message queue access
|
|
- CDN credentials — auto-provisioned via API at runtime (encrypted `cdn.yaml`), not in `.env`
|
|
- Model encryption key — hardcoded in `security.py` (existing pattern, flagged as security concern)
|
|
|
|
Rotation policy: rotate API and RabbitMQ passwords quarterly. Update `.env` on the server, restart affected services.
|
|
|
|
## Filesystem Management
|
|
|
|
| Environment | `/azaion/` Location | Contents |
|
|
|-------------|-------------------|----------|
|
|
| Development | Docker volume or local dir | Test images, small sample labels |
|
|
| Production | Host directory `/azaion/` | Full annotation dataset, trained models, export artifacts |
|
|
|
|
The `/azaion/` directory tree must exist before services start:
|
|
|
|
```
|
|
/azaion/
|
|
├── data/ (validated annotations: images/ + labels/)
|
|
├── data-seed/ (unvalidated annotations: images/ + labels/)
|
|
├── data_deleted/ (soft-deleted annotations: images/ + labels/)
|
|
├── datasets/ (formed training datasets: azaion-YYYY-MM-DD/)
|
|
├── models/ (trained models: azaion-YYYY-MM-DD/, azaion.pt)
|
|
└── classes.json (annotation class definitions)
|
|
```
|
|
|
|
Production data is persistent and never deleted by deployment. Docker containers mount this directory as a bind mount.
|
|
|
|
## External Service Configuration
|
|
|
|
| Service | Dev | Prod |
|
|
|---------|-----|------|
|
|
| Azaion REST API | Real API (dev credentials) | Real API (prod credentials) |
|
|
| S3-compatible CDN | Auto-provisioned via API | Auto-provisioned via API |
|
|
| RabbitMQ | Local container (docker-compose) | Managed instance on network |
|
|
| NVIDIA GPU | Host GPU via `--gpus all` | Host GPU via `--gpus all` |
|
|
|
|
CDN credentials are not in `.env` — they are fetched from the API at runtime as an encrypted `cdn.yaml` file, decrypted using the hardware-bound key. This is the existing pattern and does not need environment variable configuration.
|