mirror of
https://github.com/azaion/ai-training.git
synced 2026-04-22 10:56:36 +00:00
Update autopilot workflow and documentation for project cycle completion
- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation. - Updated the autopilot state to reflect the current step as `done` and status as `completed`. - Clarified the deployment status report by specifying non-deployed services and their purposes. These changes enhance the automation of task management and improve documentation clarity.
This commit is contained in:
@@ -0,0 +1,135 @@
|
||||
# Azaion AI Training — Observability
|
||||
|
||||
This system is an ML training pipeline, not a web service. Observability focuses on training progress, GPU health, queue throughput, and disk usage rather than HTTP request metrics.
|
||||
|
||||
## Logging
|
||||
|
||||
### Format
|
||||
|
||||
Structured JSON to stdout/stderr. Containers should not write log files — use Docker's log driver for collection.
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2026-03-28T14:30:00Z",
|
||||
"level": "INFO",
|
||||
"service": "training",
|
||||
"message": "Epoch 45/120 completed",
|
||||
"context": {"epoch": 45, "loss": 0.0234, "mAP50": 0.891}
|
||||
}
|
||||
```
|
||||
|
||||
### Log Levels
|
||||
|
||||
| Level | Usage | Example |
|
||||
|-------|-------|---------|
|
||||
| ERROR | Exceptions, unrecoverable failures | GPU out of memory, API auth failed, corrupt label file |
|
||||
| WARN | Recoverable issues | Queue reconnection attempt, skipped corrupt image |
|
||||
| INFO | Progress and business events | Epoch completed, dataset formed, model exported, annotation saved |
|
||||
| DEBUG | Diagnostics (dev only) | Individual file processing, queue message contents |
|
||||
|
||||
### Current State
|
||||
|
||||
| Component | Current Logging | Target |
|
||||
|-----------|----------------|--------|
|
||||
| Training Pipeline | `print()` statements | Python `logging` with JSON formatter to stdout |
|
||||
| Annotation Queue | `logging` with TimedRotatingFileHandler | Keep existing + add JSON stdout for Docker |
|
||||
| Inference Engine | `print()` statements | Not in deployment scope |
|
||||
|
||||
### Retention
|
||||
|
||||
| Environment | Destination | Retention |
|
||||
|-------------|-------------|-----------|
|
||||
| Development | Console (docker logs) | Session |
|
||||
| Production | Docker JSON log driver → host filesystem | 30 days (log rotation via Docker daemon config) |
|
||||
|
||||
### PII Rules
|
||||
|
||||
- Never log API passwords or tokens
|
||||
- Never log CDN credentials
|
||||
- Never log model encryption keys
|
||||
- Queue message image data (base64 bytes) must not be logged at INFO level
|
||||
|
||||
## Metrics
|
||||
|
||||
### Collection Method
|
||||
|
||||
No HTTP `/metrics` endpoint — these are batch processes, not services. Metrics are collected via:
|
||||
1. **Docker stats** — CPU, memory, GPU via `nvidia-smi`
|
||||
2. **Training logs** — parsed from structured log output (epoch, loss, mAP)
|
||||
3. **Filesystem monitoring** — disk usage of `/azaion/` directory tree
|
||||
|
||||
### Key Metrics
|
||||
|
||||
| Metric | Type | Source | Description |
|
||||
|--------|------|--------|-------------|
|
||||
| `training_epoch` | Gauge | Training logs | Current epoch number |
|
||||
| `training_loss` | Gauge | Training logs | Current training loss |
|
||||
| `training_mAP50` | Gauge | Training logs | Mean average precision at IoU 0.50 |
|
||||
| `training_mAP50_95` | Gauge | Training logs | mAP at IoU 0.50:0.95 |
|
||||
| `gpu_utilization_pct` | Gauge | `nvidia-smi` | GPU compute utilization |
|
||||
| `gpu_memory_used_mb` | Gauge | `nvidia-smi` | GPU memory usage |
|
||||
| `gpu_temperature_c` | Gauge | `nvidia-smi` | GPU temperature |
|
||||
| `disk_usage_azaion_gb` | Gauge | `df` / `du` | Total disk usage of `/azaion/` |
|
||||
| `disk_usage_datasets_gb` | Gauge | `du` | Disk usage of `/azaion/datasets/` |
|
||||
| `disk_usage_models_gb` | Gauge | `du` | Disk usage of `/azaion/models/` |
|
||||
| `queue_messages_processed` | Counter | Queue logs | Total annotations processed |
|
||||
| `queue_messages_failed` | Counter | Queue logs | Failed message processing |
|
||||
| `queue_offset` | Gauge | `offset.yaml` | Last processed queue offset |
|
||||
|
||||
### Monitoring Script
|
||||
|
||||
A `scripts/health-check.sh` script (created in Step 7) collects these metrics on demand:
|
||||
- Checks Docker container status
|
||||
- Reads `nvidia-smi` for GPU metrics
|
||||
- Checks disk usage
|
||||
- Reads annotation queue offset
|
||||
- Reports overall system health
|
||||
|
||||
Collection interval: on-demand via health check script, or via cron job (every 5 minutes) for continuous monitoring.
|
||||
|
||||
## Distributed Tracing
|
||||
|
||||
Not applicable. The system consists of independent batch processes (training, annotation queue) that do not form request chains. No distributed tracing is needed.
|
||||
|
||||
## Alerting
|
||||
|
||||
| Severity | Condition | Response Time | Action |
|
||||
|----------|-----------|---------------|--------|
|
||||
| Critical | GPU temperature > 90°C | Immediate | Pause training, investigate cooling |
|
||||
| Critical | Annotation queue process crashed | 5 min | Restart container, check logs |
|
||||
| Critical | Disk usage > 95% | 5 min | Free space (old datasets/models), expand storage |
|
||||
| High | Training loss NaN or diverging | 30 min | Check dataset, review hyperparameters |
|
||||
| High | GPU memory OOM | 30 min | Reduce batch size, restart training |
|
||||
| Medium | Disk usage > 80% | 4 hours | Plan cleanup of old datasets |
|
||||
| Medium | Queue offset stale for > 1 hour | 4 hours | Check RabbitMQ connectivity |
|
||||
| Low | Training checkpoint save failed | Next business day | Check disk space, retry |
|
||||
|
||||
### Notification Method
|
||||
|
||||
For a single GPU server deployment, alerts are practical via:
|
||||
- **Cron-based health check** running `scripts/health-check.sh` every 5 minutes
|
||||
- Critical/High alerts: write to a status file, optionally send email or webhook notification
|
||||
- Dashboard: a simple status page generated from the last health check output
|
||||
|
||||
## Dashboards
|
||||
|
||||
### Operations View
|
||||
|
||||
For a single-server deployment, a lightweight monitoring approach:
|
||||
|
||||
1. **GPU dashboard**: `nvidia-smi dmon` or `nvitop` running in a tmux session
|
||||
2. **Training progress**: tail structured logs for epoch/loss/mAP progression
|
||||
3. **Disk usage**: periodic `du -sh /azaion/*/` output
|
||||
4. **Container status**: `docker ps` + `docker stats`
|
||||
|
||||
### Training Progress View
|
||||
|
||||
Key information to track during a training run:
|
||||
- Current epoch / total epochs
|
||||
- Training loss trend (decreasing = good)
|
||||
- Validation mAP50 and mAP50-95 (increasing = good)
|
||||
- GPU utilization and temperature
|
||||
- Estimated time remaining
|
||||
- Last checkpoint saved
|
||||
|
||||
YOLO's built-in TensorBoard integration provides this out of the box. Access via `tensorboard --logdir /azaion/models/azaion-YYYY-MM-DD/` on the training server.
|
||||
Reference in New Issue
Block a user