- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation. - Updated the autopilot state to reflect the current step as `done` and status as `completed`. - Clarified the deployment status report by specifying non-deployed services and their purposes. These changes enhance the automation of task management and improve documentation clarity.
5.6 KiB
Azaion AI Training — Observability
This system is an ML training pipeline, not a web service. Observability focuses on training progress, GPU health, queue throughput, and disk usage rather than HTTP request metrics.
Logging
Format
Structured JSON to stdout/stderr. Containers should not write log files — use Docker's log driver for collection.
{
"timestamp": "2026-03-28T14:30:00Z",
"level": "INFO",
"service": "training",
"message": "Epoch 45/120 completed",
"context": {"epoch": 45, "loss": 0.0234, "mAP50": 0.891}
}
Log Levels
| Level | Usage | Example |
|---|---|---|
| ERROR | Exceptions, unrecoverable failures | GPU out of memory, API auth failed, corrupt label file |
| WARN | Recoverable issues | Queue reconnection attempt, skipped corrupt image |
| INFO | Progress and business events | Epoch completed, dataset formed, model exported, annotation saved |
| DEBUG | Diagnostics (dev only) | Individual file processing, queue message contents |
Current State
| Component | Current Logging | Target |
|---|---|---|
| Training Pipeline | print() statements |
Python logging with JSON formatter to stdout |
| Annotation Queue | logging with TimedRotatingFileHandler |
Keep existing + add JSON stdout for Docker |
| Inference Engine | print() statements |
Not in deployment scope |
Retention
| Environment | Destination | Retention |
|---|---|---|
| Development | Console (docker logs) | Session |
| Production | Docker JSON log driver → host filesystem | 30 days (log rotation via Docker daemon config) |
PII Rules
- Never log API passwords or tokens
- Never log CDN credentials
- Never log model encryption keys
- Queue message image data (base64 bytes) must not be logged at INFO level
Metrics
Collection Method
No HTTP /metrics endpoint — these are batch processes, not services. Metrics are collected via:
- Docker stats — CPU, memory, GPU via
nvidia-smi - Training logs — parsed from structured log output (epoch, loss, mAP)
- Filesystem monitoring — disk usage of
/azaion/directory tree
Key Metrics
| Metric | Type | Source | Description |
|---|---|---|---|
training_epoch |
Gauge | Training logs | Current epoch number |
training_loss |
Gauge | Training logs | Current training loss |
training_mAP50 |
Gauge | Training logs | Mean average precision at IoU 0.50 |
training_mAP50_95 |
Gauge | Training logs | mAP at IoU 0.50:0.95 |
gpu_utilization_pct |
Gauge | nvidia-smi |
GPU compute utilization |
gpu_memory_used_mb |
Gauge | nvidia-smi |
GPU memory usage |
gpu_temperature_c |
Gauge | nvidia-smi |
GPU temperature |
disk_usage_azaion_gb |
Gauge | df / du |
Total disk usage of /azaion/ |
disk_usage_datasets_gb |
Gauge | du |
Disk usage of /azaion/datasets/ |
disk_usage_models_gb |
Gauge | du |
Disk usage of /azaion/models/ |
queue_messages_processed |
Counter | Queue logs | Total annotations processed |
queue_messages_failed |
Counter | Queue logs | Failed message processing |
queue_offset |
Gauge | offset.yaml |
Last processed queue offset |
Monitoring Script
A scripts/health-check.sh script (created in Step 7) collects these metrics on demand:
- Checks Docker container status
- Reads
nvidia-smifor GPU metrics - Checks disk usage
- Reads annotation queue offset
- Reports overall system health
Collection interval: on-demand via health check script, or via cron job (every 5 minutes) for continuous monitoring.
Distributed Tracing
Not applicable. The system consists of independent batch processes (training, annotation queue) that do not form request chains. No distributed tracing is needed.
Alerting
| Severity | Condition | Response Time | Action |
|---|---|---|---|
| Critical | GPU temperature > 90°C | Immediate | Pause training, investigate cooling |
| Critical | Annotation queue process crashed | 5 min | Restart container, check logs |
| Critical | Disk usage > 95% | 5 min | Free space (old datasets/models), expand storage |
| High | Training loss NaN or diverging | 30 min | Check dataset, review hyperparameters |
| High | GPU memory OOM | 30 min | Reduce batch size, restart training |
| Medium | Disk usage > 80% | 4 hours | Plan cleanup of old datasets |
| Medium | Queue offset stale for > 1 hour | 4 hours | Check RabbitMQ connectivity |
| Low | Training checkpoint save failed | Next business day | Check disk space, retry |
Notification Method
For a single GPU server deployment, alerts are practical via:
- Cron-based health check running
scripts/health-check.shevery 5 minutes - Critical/High alerts: write to a status file, optionally send email or webhook notification
- Dashboard: a simple status page generated from the last health check output
Dashboards
Operations View
For a single-server deployment, a lightweight monitoring approach:
- GPU dashboard:
nvidia-smi dmonornvitoprunning in a tmux session - Training progress: tail structured logs for epoch/loss/mAP progression
- Disk usage: periodic
du -sh /azaion/*/output - Container status:
docker ps+docker stats
Training Progress View
Key information to track during a training run:
- Current epoch / total epochs
- Training loss trend (decreasing = good)
- Validation mAP50 and mAP50-95 (increasing = good)
- GPU utilization and temperature
- Estimated time remaining
- Last checkpoint saved
YOLO's built-in TensorBoard integration provides this out of the box. Access via tensorboard --logdir /azaion/models/azaion-YYYY-MM-DD/ on the training server.