# Azaion AI Training — Environment Strategy

## Environments

| Environment | Purpose | Infrastructure | Data Source |
|-------------|---------|---------------|-------------|
| Development | Local developer workflow | docker-compose, local RabbitMQ | Test annotations, small sample dataset |
| Production | Live training on GPU server | Direct host processes or Docker, real RabbitMQ | Real annotations from Azaion platform |

No staging environment — the system is an ML training pipeline on a dedicated GPU server, not a multi-tier web service. Validation happens through the CI test suite (CPU tests) and manual verification on the GPU server before committing to a long training run.

## Environment Variables

### Required Variables

| Variable | Purpose | Dev Default | Prod Source |
|----------|---------|-------------|-------------|
| `AZAION_API_URL` | Azaion REST API base URL | `https://api.azaion.com` | `.env` on server |
| `AZAION_API_EMAIL` | API login email | dev account | `.env` on server |
| `AZAION_API_PASSWORD` | API login password | dev password | `.env` on server |
| `RABBITMQ_HOST` | RabbitMQ host | `127.0.0.1` (local container) | `.env` on server |
| `RABBITMQ_PORT` | RabbitMQ Streams port | `5552` | `.env` on server |
| `RABBITMQ_USER` | RabbitMQ username | `azaion_receiver` | `.env` on server |
| `RABBITMQ_PASSWORD` | RabbitMQ password | `changeme` | `.env` on server |
| `RABBITMQ_QUEUE_NAME` | Queue name | `azaion-annotations` | `.env` on server |
| `AZAION_ROOT_DIR` | Root data directory | `/azaion` | `.env` on server |
| `AZAION_DATA_DIR` | Validated annotations dir name | `data` | `.env` on server |
| `AZAION_DATA_SEED_DIR` | Unvalidated annotations dir name | `data-seed` | `.env` on server |
| `AZAION_DATA_DELETED_DIR` | Deleted annotations dir name | `data_deleted` | `.env` on server |
| `TRAINING_MODEL` | Base model filename | `yolo26m.pt` | `.env` on server |
| `TRAINING_EPOCHS` | Training epochs | `120` | `.env` on server |
| `TRAINING_BATCH_SIZE` | Training batch size | `11` | `.env` on server |
| `TRAINING_IMGSZ` | Training image size | `1280` | `.env` on server |
| `TRAINING_SAVE_PERIOD` | Checkpoint save interval | `1` | `.env` on server |
| `TRAINING_WORKERS` | Dataloader workers | `24` | `.env` on server |
| `EXPORT_ONNX_IMGSZ` | ONNX export image size | `1280` | `.env` on server |

### `.env.example`

Committed to version control with placeholder values. See `.env.example` in project root (created in Step 1).

### Variable Validation

The `config.yaml` generation script (part of deploy scripts) validates that all required environment variables are set before writing the config file. Missing variables cause an immediate failure with a clear error listing which variables are absent.

## Config Generation

The codebase reads configuration from `config.yaml`, not directly from environment variables. The deployment flow generates `config.yaml` from environment variables at deploy time:

1. `.env` contains all variable values (never committed)
2. Deploy script sources `.env` and renders `config.yaml` from a template
3. `config.yaml` is placed at the expected location for the application

This preserves the existing code's config reading pattern while externalizing secrets to environment variables.

## Secrets Management

| Environment | Method | Location |
|-------------|--------|----------|
| Development | `.env` file (git-ignored) | Project root |
| Production | `.env` file (restricted permissions) | GPU server `/opt/azaion-training/.env` |

Production `.env` file:
- Ownership: `root:deploy` (deploy user's group)
- Permissions: `640` (owner read/write, group read, others none)
- Located outside the Docker build context

Secrets in this project:
- `AZAION_API_PASSWORD` — API authentication
- `RABBITMQ_PASSWORD` — message queue access
- CDN credentials — auto-provisioned via API at runtime (encrypted `cdn.yaml`), not in `.env`
- Model encryption key — hardcoded in `security.py` (existing pattern, flagged as security concern)

Rotation policy: rotate API and RabbitMQ passwords quarterly. Update `.env` on the server, restart affected services.

## Filesystem Management

| Environment | `/azaion/` Location | Contents |
|-------------|-------------------|----------|
| Development | Docker volume or local dir | Test images, small sample labels |
| Production | Host directory `/azaion/` | Full annotation dataset, trained models, export artifacts |

The `/azaion/` directory tree must exist before services start:

```
/azaion/
├── data/           (validated annotations: images/ + labels/)
├── data-seed/      (unvalidated annotations: images/ + labels/)
├── data_deleted/   (soft-deleted annotations: images/ + labels/)
├── datasets/       (formed training datasets: azaion-YYYY-MM-DD/)
├── models/         (trained models: azaion-YYYY-MM-DD/, azaion.pt)
└── classes.json    (annotation class definitions)
```

Production data is persistent and never deleted by deployment. Docker containers mount this directory as a bind mount.

## External Service Configuration

| Service | Dev | Prod |
|---------|-----|------|
| Azaion REST API | Real API (dev credentials) | Real API (prod credentials) |
| S3-compatible CDN | Auto-provisioned via API | Auto-provisioned via API |
| RabbitMQ | Local container (docker-compose) | Managed instance on network |
| NVIDIA GPU | Host GPU via `--gpus all` | Host GPU via `--gpus all` |

CDN credentials are not in `.env` — they are fetched from the API at runtime as an encrypted `cdn.yaml` file, decrypted using the hardware-bound key. This is the existing pattern and does not need environment variable configuration.