Files
Oleksandr Bezdieniezhnykh aeb7f8ca8c Update autopilot workflow and documentation for project cycle completion
- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation.
- Updated the autopilot state to reflect the current step as `done` and status as `completed`.
- Clarified the deployment status report by specifying non-deployed services and their purposes.

These changes enhance the automation of task management and improve documentation clarity.
2026-03-29 05:02:22 +03:00

5.6 KiB

Azaion AI Training — Observability

This system is an ML training pipeline, not a web service. Observability focuses on training progress, GPU health, queue throughput, and disk usage rather than HTTP request metrics.

Logging

Format

Structured JSON to stdout/stderr. Containers should not write log files — use Docker's log driver for collection.

{
  "timestamp": "2026-03-28T14:30:00Z",
  "level": "INFO",
  "service": "training",
  "message": "Epoch 45/120 completed",
  "context": {"epoch": 45, "loss": 0.0234, "mAP50": 0.891}
}

Log Levels

Level Usage Example
ERROR Exceptions, unrecoverable failures GPU out of memory, API auth failed, corrupt label file
WARN Recoverable issues Queue reconnection attempt, skipped corrupt image
INFO Progress and business events Epoch completed, dataset formed, model exported, annotation saved
DEBUG Diagnostics (dev only) Individual file processing, queue message contents

Current State

Component Current Logging Target
Training Pipeline print() statements Python logging with JSON formatter to stdout
Annotation Queue logging with TimedRotatingFileHandler Keep existing + add JSON stdout for Docker
Inference Engine print() statements Not in deployment scope

Retention

Environment Destination Retention
Development Console (docker logs) Session
Production Docker JSON log driver → host filesystem 30 days (log rotation via Docker daemon config)

PII Rules

  • Never log API passwords or tokens
  • Never log CDN credentials
  • Never log model encryption keys
  • Queue message image data (base64 bytes) must not be logged at INFO level

Metrics

Collection Method

No HTTP /metrics endpoint — these are batch processes, not services. Metrics are collected via:

  1. Docker stats — CPU, memory, GPU via nvidia-smi
  2. Training logs — parsed from structured log output (epoch, loss, mAP)
  3. Filesystem monitoring — disk usage of /azaion/ directory tree

Key Metrics

Metric Type Source Description
training_epoch Gauge Training logs Current epoch number
training_loss Gauge Training logs Current training loss
training_mAP50 Gauge Training logs Mean average precision at IoU 0.50
training_mAP50_95 Gauge Training logs mAP at IoU 0.50:0.95
gpu_utilization_pct Gauge nvidia-smi GPU compute utilization
gpu_memory_used_mb Gauge nvidia-smi GPU memory usage
gpu_temperature_c Gauge nvidia-smi GPU temperature
disk_usage_azaion_gb Gauge df / du Total disk usage of /azaion/
disk_usage_datasets_gb Gauge du Disk usage of /azaion/datasets/
disk_usage_models_gb Gauge du Disk usage of /azaion/models/
queue_messages_processed Counter Queue logs Total annotations processed
queue_messages_failed Counter Queue logs Failed message processing
queue_offset Gauge offset.yaml Last processed queue offset

Monitoring Script

A scripts/health-check.sh script (created in Step 7) collects these metrics on demand:

  • Checks Docker container status
  • Reads nvidia-smi for GPU metrics
  • Checks disk usage
  • Reads annotation queue offset
  • Reports overall system health

Collection interval: on-demand via health check script, or via cron job (every 5 minutes) for continuous monitoring.

Distributed Tracing

Not applicable. The system consists of independent batch processes (training, annotation queue) that do not form request chains. No distributed tracing is needed.

Alerting

Severity Condition Response Time Action
Critical GPU temperature > 90°C Immediate Pause training, investigate cooling
Critical Annotation queue process crashed 5 min Restart container, check logs
Critical Disk usage > 95% 5 min Free space (old datasets/models), expand storage
High Training loss NaN or diverging 30 min Check dataset, review hyperparameters
High GPU memory OOM 30 min Reduce batch size, restart training
Medium Disk usage > 80% 4 hours Plan cleanup of old datasets
Medium Queue offset stale for > 1 hour 4 hours Check RabbitMQ connectivity
Low Training checkpoint save failed Next business day Check disk space, retry

Notification Method

For a single GPU server deployment, alerts are practical via:

  • Cron-based health check running scripts/health-check.sh every 5 minutes
  • Critical/High alerts: write to a status file, optionally send email or webhook notification
  • Dashboard: a simple status page generated from the last health check output

Dashboards

Operations View

For a single-server deployment, a lightweight monitoring approach:

  1. GPU dashboard: nvidia-smi dmon or nvitop running in a tmux session
  2. Training progress: tail structured logs for epoch/loss/mAP progression
  3. Disk usage: periodic du -sh /azaion/*/ output
  4. Container status: docker ps + docker stats

Training Progress View

Key information to track during a training run:

  • Current epoch / total epochs
  • Training loss trend (decreasing = good)
  • Validation mAP50 and mAP50-95 (increasing = good)
  • GPU utilization and temperature
  • Estimated time remaining
  • Last checkpoint saved

YOLO's built-in TensorBoard integration provides this out of the box. Access via tensorboard --logdir /azaion/models/azaion-YYYY-MM-DD/ on the training server.