# Azaion AI Training — Observability This system is an ML training pipeline, not a web service. Observability focuses on training progress, GPU health, queue throughput, and disk usage rather than HTTP request metrics. ## Logging ### Format Structured JSON to stdout/stderr. Containers should not write log files — use Docker's log driver for collection. ```json { "timestamp": "2026-03-28T14:30:00Z", "level": "INFO", "service": "training", "message": "Epoch 45/120 completed", "context": {"epoch": 45, "loss": 0.0234, "mAP50": 0.891} } ``` ### Log Levels | Level | Usage | Example | |-------|-------|---------| | ERROR | Exceptions, unrecoverable failures | GPU out of memory, API auth failed, corrupt label file | | WARN | Recoverable issues | Queue reconnection attempt, skipped corrupt image | | INFO | Progress and business events | Epoch completed, dataset formed, model exported, annotation saved | | DEBUG | Diagnostics (dev only) | Individual file processing, queue message contents | ### Current State | Component | Current Logging | Target | |-----------|----------------|--------| | Training Pipeline | `print()` statements | Python `logging` with JSON formatter to stdout | | Annotation Queue | `logging` with TimedRotatingFileHandler | Keep existing + add JSON stdout for Docker | | Inference Engine | `print()` statements | Not in deployment scope | ### Retention | Environment | Destination | Retention | |-------------|-------------|-----------| | Development | Console (docker logs) | Session | | Production | Docker JSON log driver → host filesystem | 30 days (log rotation via Docker daemon config) | ### PII Rules - Never log API passwords or tokens - Never log CDN credentials - Never log model encryption keys - Queue message image data (base64 bytes) must not be logged at INFO level ## Metrics ### Collection Method No HTTP `/metrics` endpoint — these are batch processes, not services. Metrics are collected via: 1. **Docker stats** — CPU, memory, GPU via `nvidia-smi` 2. **Training logs** — parsed from structured log output (epoch, loss, mAP) 3. **Filesystem monitoring** — disk usage of `/azaion/` directory tree ### Key Metrics | Metric | Type | Source | Description | |--------|------|--------|-------------| | `training_epoch` | Gauge | Training logs | Current epoch number | | `training_loss` | Gauge | Training logs | Current training loss | | `training_mAP50` | Gauge | Training logs | Mean average precision at IoU 0.50 | | `training_mAP50_95` | Gauge | Training logs | mAP at IoU 0.50:0.95 | | `gpu_utilization_pct` | Gauge | `nvidia-smi` | GPU compute utilization | | `gpu_memory_used_mb` | Gauge | `nvidia-smi` | GPU memory usage | | `gpu_temperature_c` | Gauge | `nvidia-smi` | GPU temperature | | `disk_usage_azaion_gb` | Gauge | `df` / `du` | Total disk usage of `/azaion/` | | `disk_usage_datasets_gb` | Gauge | `du` | Disk usage of `/azaion/datasets/` | | `disk_usage_models_gb` | Gauge | `du` | Disk usage of `/azaion/models/` | | `queue_messages_processed` | Counter | Queue logs | Total annotations processed | | `queue_messages_failed` | Counter | Queue logs | Failed message processing | | `queue_offset` | Gauge | `offset.yaml` | Last processed queue offset | ### Monitoring Script A `scripts/health-check.sh` script (created in Step 7) collects these metrics on demand: - Checks Docker container status - Reads `nvidia-smi` for GPU metrics - Checks disk usage - Reads annotation queue offset - Reports overall system health Collection interval: on-demand via health check script, or via cron job (every 5 minutes) for continuous monitoring. ## Distributed Tracing Not applicable. The system consists of independent batch processes (training, annotation queue) that do not form request chains. No distributed tracing is needed. ## Alerting | Severity | Condition | Response Time | Action | |----------|-----------|---------------|--------| | Critical | GPU temperature > 90°C | Immediate | Pause training, investigate cooling | | Critical | Annotation queue process crashed | 5 min | Restart container, check logs | | Critical | Disk usage > 95% | 5 min | Free space (old datasets/models), expand storage | | High | Training loss NaN or diverging | 30 min | Check dataset, review hyperparameters | | High | GPU memory OOM | 30 min | Reduce batch size, restart training | | Medium | Disk usage > 80% | 4 hours | Plan cleanup of old datasets | | Medium | Queue offset stale for > 1 hour | 4 hours | Check RabbitMQ connectivity | | Low | Training checkpoint save failed | Next business day | Check disk space, retry | ### Notification Method For a single GPU server deployment, alerts are practical via: - **Cron-based health check** running `scripts/health-check.sh` every 5 minutes - Critical/High alerts: write to a status file, optionally send email or webhook notification - Dashboard: a simple status page generated from the last health check output ## Dashboards ### Operations View For a single-server deployment, a lightweight monitoring approach: 1. **GPU dashboard**: `nvidia-smi dmon` or `nvitop` running in a tmux session 2. **Training progress**: tail structured logs for epoch/loss/mAP progression 3. **Disk usage**: periodic `du -sh /azaion/*/` output 4. **Container status**: `docker ps` + `docker stats` ### Training Progress View Key information to track during a training run: - Current epoch / total epochs - Training loss trend (decreasing = good) - Validation mAP50 and mAP50-95 (increasing = good) - GPU utilization and temperature - Estimated time remaining - Last checkpoint saved YOLO's built-in TensorBoard integration provides this out of the box. Access via `tensorboard --logdir /azaion/models/azaion-YYYY-MM-DD/` on the training server.