Update autopilot workflow and documentation for project cycle completion

- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation.
- Updated the autopilot state to reflect the current step as `done` and status as `completed`.
- Clarified the deployment status report by specifying non-deployed services and their purposes.

These changes enhance the automation of task management and improve documentation clarity.
This commit is contained in:
Oleksandr Bezdieniezhnykh
2026-03-29 05:02:22 +03:00
parent 0bf3894e03
commit aeb7f8ca8c
20 changed files with 1360 additions and 12 deletions
+106
View File
@@ -0,0 +1,106 @@
# Azaion AI Training — Environment Strategy
## Environments
| Environment | Purpose | Infrastructure | Data Source |
|-------------|---------|---------------|-------------|
| Development | Local developer workflow | docker-compose, local RabbitMQ | Test annotations, small sample dataset |
| Production | Live training on GPU server | Direct host processes or Docker, real RabbitMQ | Real annotations from Azaion platform |
No staging environment — the system is an ML training pipeline on a dedicated GPU server, not a multi-tier web service. Validation happens through the CI test suite (CPU tests) and manual verification on the GPU server before committing to a long training run.
## Environment Variables
### Required Variables
| Variable | Purpose | Dev Default | Prod Source |
|----------|---------|-------------|-------------|
| `AZAION_API_URL` | Azaion REST API base URL | `https://api.azaion.com` | `.env` on server |
| `AZAION_API_EMAIL` | API login email | dev account | `.env` on server |
| `AZAION_API_PASSWORD` | API login password | dev password | `.env` on server |
| `RABBITMQ_HOST` | RabbitMQ host | `127.0.0.1` (local container) | `.env` on server |
| `RABBITMQ_PORT` | RabbitMQ Streams port | `5552` | `.env` on server |
| `RABBITMQ_USER` | RabbitMQ username | `azaion_receiver` | `.env` on server |
| `RABBITMQ_PASSWORD` | RabbitMQ password | `changeme` | `.env` on server |
| `RABBITMQ_QUEUE_NAME` | Queue name | `azaion-annotations` | `.env` on server |
| `AZAION_ROOT_DIR` | Root data directory | `/azaion` | `.env` on server |
| `AZAION_DATA_DIR` | Validated annotations dir name | `data` | `.env` on server |
| `AZAION_DATA_SEED_DIR` | Unvalidated annotations dir name | `data-seed` | `.env` on server |
| `AZAION_DATA_DELETED_DIR` | Deleted annotations dir name | `data_deleted` | `.env` on server |
| `TRAINING_MODEL` | Base model filename | `yolo26m.pt` | `.env` on server |
| `TRAINING_EPOCHS` | Training epochs | `120` | `.env` on server |
| `TRAINING_BATCH_SIZE` | Training batch size | `11` | `.env` on server |
| `TRAINING_IMGSZ` | Training image size | `1280` | `.env` on server |
| `TRAINING_SAVE_PERIOD` | Checkpoint save interval | `1` | `.env` on server |
| `TRAINING_WORKERS` | Dataloader workers | `24` | `.env` on server |
| `EXPORT_ONNX_IMGSZ` | ONNX export image size | `1280` | `.env` on server |
### `.env.example`
Committed to version control with placeholder values. See `.env.example` in project root (created in Step 1).
### Variable Validation
The `config.yaml` generation script (part of deploy scripts) validates that all required environment variables are set before writing the config file. Missing variables cause an immediate failure with a clear error listing which variables are absent.
## Config Generation
The codebase reads configuration from `config.yaml`, not directly from environment variables. The deployment flow generates `config.yaml` from environment variables at deploy time:
1. `.env` contains all variable values (never committed)
2. Deploy script sources `.env` and renders `config.yaml` from a template
3. `config.yaml` is placed at the expected location for the application
This preserves the existing code's config reading pattern while externalizing secrets to environment variables.
## Secrets Management
| Environment | Method | Location |
|-------------|--------|----------|
| Development | `.env` file (git-ignored) | Project root |
| Production | `.env` file (restricted permissions) | GPU server `/opt/azaion-training/.env` |
Production `.env` file:
- Ownership: `root:deploy` (deploy user's group)
- Permissions: `640` (owner read/write, group read, others none)
- Located outside the Docker build context
Secrets in this project:
- `AZAION_API_PASSWORD` — API authentication
- `RABBITMQ_PASSWORD` — message queue access
- CDN credentials — auto-provisioned via API at runtime (encrypted `cdn.yaml`), not in `.env`
- Model encryption key — hardcoded in `security.py` (existing pattern, flagged as security concern)
Rotation policy: rotate API and RabbitMQ passwords quarterly. Update `.env` on the server, restart affected services.
## Filesystem Management
| Environment | `/azaion/` Location | Contents |
|-------------|-------------------|----------|
| Development | Docker volume or local dir | Test images, small sample labels |
| Production | Host directory `/azaion/` | Full annotation dataset, trained models, export artifacts |
The `/azaion/` directory tree must exist before services start:
```
/azaion/
├── data/ (validated annotations: images/ + labels/)
├── data-seed/ (unvalidated annotations: images/ + labels/)
├── data_deleted/ (soft-deleted annotations: images/ + labels/)
├── datasets/ (formed training datasets: azaion-YYYY-MM-DD/)
├── models/ (trained models: azaion-YYYY-MM-DD/, azaion.pt)
└── classes.json (annotation class definitions)
```
Production data is persistent and never deleted by deployment. Docker containers mount this directory as a bind mount.
## External Service Configuration
| Service | Dev | Prod |
|---------|-----|------|
| Azaion REST API | Real API (dev credentials) | Real API (prod credentials) |
| S3-compatible CDN | Auto-provisioned via API | Auto-provisioned via API |
| RabbitMQ | Local container (docker-compose) | Managed instance on network |
| NVIDIA GPU | Host GPU via `--gpus all` | Host GPU via `--gpus all` |
CDN credentials are not in `.env` — they are fetched from the API at runtime as an encrypted `cdn.yaml` file, decrypted using the hardware-bound key. This is the existing pattern and does not need environment variable configuration.