Update autopilot workflow and documentation for project cycle completion

- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation. - Updated the autopilot state to reflect the current step as `done` and status as `completed`. - Clarified the deployment status report by specifying non-deployed services and their purposes. These changes enhance the automation of task management and improve documentation clarity.
2026-04-22 09:06:35 +00:00 · 2026-03-29 05:02:22 +03:00
parent 0bf3894e03
commit aeb7f8ca8c
20 changed files with 1360 additions and 12 deletions
@@ -0,0 +1,106 @@
+# Azaion AI Training — Environment Strategy
+
+## Environments
+
+| Environment | Purpose | Infrastructure | Data Source |
+|-------------|---------|---------------|-------------|
+| Development | Local developer workflow | docker-compose, local RabbitMQ | Test annotations, small sample dataset |
+| Production | Live training on GPU server | Direct host processes or Docker, real RabbitMQ | Real annotations from Azaion platform |
+
+No staging environment — the system is an ML training pipeline on a dedicated GPU server, not a multi-tier web service. Validation happens through the CI test suite (CPU tests) and manual verification on the GPU server before committing to a long training run.
+
+## Environment Variables
+
+### Required Variables
+
+| Variable | Purpose | Dev Default | Prod Source |
+|----------|---------|-------------|-------------|
+| `AZAION_API_URL` | Azaion REST API base URL | `https://api.azaion.com` | `.env` on server |
+| `AZAION_API_EMAIL` | API login email | dev account | `.env` on server |
+| `AZAION_API_PASSWORD` | API login password | dev password | `.env` on server |
+| `RABBITMQ_HOST` | RabbitMQ host | `127.0.0.1` (local container) | `.env` on server |
+| `RABBITMQ_PORT` | RabbitMQ Streams port | `5552` | `.env` on server |
+| `RABBITMQ_USER` | RabbitMQ username | `azaion_receiver` | `.env` on server |
+| `RABBITMQ_PASSWORD` | RabbitMQ password | `changeme` | `.env` on server |
+| `RABBITMQ_QUEUE_NAME` | Queue name | `azaion-annotations` | `.env` on server |
+| `AZAION_ROOT_DIR` | Root data directory | `/azaion` | `.env` on server |
+| `AZAION_DATA_DIR` | Validated annotations dir name | `data` | `.env` on server |
+| `AZAION_DATA_SEED_DIR` | Unvalidated annotations dir name | `data-seed` | `.env` on server |
+| `AZAION_DATA_DELETED_DIR` | Deleted annotations dir name | `data_deleted` | `.env` on server |
+| `TRAINING_MODEL` | Base model filename | `yolo26m.pt` | `.env` on server |
+| `TRAINING_EPOCHS` | Training epochs | `120` | `.env` on server |
+| `TRAINING_BATCH_SIZE` | Training batch size | `11` | `.env` on server |
+| `TRAINING_IMGSZ` | Training image size | `1280` | `.env` on server |
+| `TRAINING_SAVE_PERIOD` | Checkpoint save interval | `1` | `.env` on server |
+| `TRAINING_WORKERS` | Dataloader workers | `24` | `.env` on server |
+| `EXPORT_ONNX_IMGSZ` | ONNX export image size | `1280` | `.env` on server |
+
+### `.env.example`
+
+Committed to version control with placeholder values. See `.env.example` in project root (created in Step 1).
+
+### Variable Validation
+
+The `config.yaml` generation script (part of deploy scripts) validates that all required environment variables are set before writing the config file. Missing variables cause an immediate failure with a clear error listing which variables are absent.
+
+## Config Generation
+
+The codebase reads configuration from `config.yaml`, not directly from environment variables. The deployment flow generates `config.yaml` from environment variables at deploy time:
+
+1. `.env` contains all variable values (never committed)
+2. Deploy script sources `.env` and renders `config.yaml` from a template
+3. `config.yaml` is placed at the expected location for the application
+
+This preserves the existing code's config reading pattern while externalizing secrets to environment variables.
+
+## Secrets Management
+
+| Environment | Method | Location |
+|-------------|--------|----------|
+| Development | `.env` file (git-ignored) | Project root |
+| Production | `.env` file (restricted permissions) | GPU server `/opt/azaion-training/.env` |
+
+Production `.env` file:
+- Ownership: `root:deploy` (deploy user's group)
+- Permissions: `640` (owner read/write, group read, others none)
+- Located outside the Docker build context
+
+Secrets in this project:
+- `AZAION_API_PASSWORD` — API authentication
+- `RABBITMQ_PASSWORD` — message queue access
+- CDN credentials — auto-provisioned via API at runtime (encrypted `cdn.yaml`), not in `.env`
+- Model encryption key — hardcoded in `security.py` (existing pattern, flagged as security concern)
+
+Rotation policy: rotate API and RabbitMQ passwords quarterly. Update `.env` on the server, restart affected services.
+
+## Filesystem Management
+
+| Environment | `/azaion/` Location | Contents |
+|-------------|-------------------|----------|
+| Development | Docker volume or local dir | Test images, small sample labels |
+| Production | Host directory `/azaion/` | Full annotation dataset, trained models, export artifacts |
+
+The `/azaion/` directory tree must exist before services start:
+
+```
+/azaion/
+├── data/           (validated annotations: images/ + labels/)
+├── data-seed/      (unvalidated annotations: images/ + labels/)
+├── data_deleted/   (soft-deleted annotations: images/ + labels/)
+├── datasets/       (formed training datasets: azaion-YYYY-MM-DD/)
+├── models/         (trained models: azaion-YYYY-MM-DD/, azaion.pt)
+└── classes.json    (annotation class definitions)
+```
+
+Production data is persistent and never deleted by deployment. Docker containers mount this directory as a bind mount.
+
+## External Service Configuration
+
+| Service | Dev | Prod |
+|---------|-----|------|
+| Azaion REST API | Real API (dev credentials) | Real API (prod credentials) |
+| S3-compatible CDN | Auto-provisioned via API | Auto-provisioned via API |
+| RabbitMQ | Local container (docker-compose) | Managed instance on network |
+| NVIDIA GPU | Host GPU via `--gpus all` | Host GPU via `--gpus all` |
+
+CDN credentials are not in `.env` — they are fetched from the API at runtime as an encrypted `cdn.yaml` file, decrypted using the hardware-bound key. This is the existing pattern and does not need environment variable configuration.