mirror of https://github.com/azaion/ai-training.git synced 2026-04-22 09:56:36 +00:00

Files

T

Oleksandr Bezdieniezhnykh aeb7f8ca8c Update autopilot workflow and documentation for project cycle completion

- Modified the existing-code workflow to automatically loop back to New Task after project completion without user confirmation.
- Updated the autopilot state to reflect the current step as `done` and status as `completed`.
- Clarified the deployment status report by specifying non-deployed services and their purposes.

These changes enhance the automation of task management and improve documentation clarity.

2026-03-29 05:02:22 +03:00

5.6 KiB

Raw Permalink Blame History

Azaion AI Training — Observability

This system is an ML training pipeline, not a web service. Observability focuses on training progress, GPU health, queue throughput, and disk usage rather than HTTP request metrics.

Logging

Format

Structured JSON to stdout/stderr. Containers should not write log files — use Docker's log driver for collection.

{
  "timestamp": "2026-03-28T14:30:00Z",
  "level": "INFO",
  "service": "training",
  "message": "Epoch 45/120 completed",
  "context": {"epoch": 45, "loss": 0.0234, "mAP50": 0.891}
}

Log Levels

Level	Usage	Example
ERROR	Exceptions, unrecoverable failures	GPU out of memory, API auth failed, corrupt label file
WARN	Recoverable issues	Queue reconnection attempt, skipped corrupt image
INFO	Progress and business events	Epoch completed, dataset formed, model exported, annotation saved
DEBUG	Diagnostics (dev only)	Individual file processing, queue message contents

Current State

Component	Current Logging	Target
Training Pipeline	`print()` statements	Python `logging` with JSON formatter to stdout
Annotation Queue	`logging` with TimedRotatingFileHandler	Keep existing + add JSON stdout for Docker
Inference Engine	`print()` statements	Not in deployment scope

Retention

Environment	Destination	Retention
Development	Console (docker logs)	Session
Production	Docker JSON log driver → host filesystem	30 days (log rotation via Docker daemon config)

PII Rules

Never log API passwords or tokens
Never log CDN credentials
Never log model encryption keys
Queue message image data (base64 bytes) must not be logged at INFO level

Metrics

Collection Method

No HTTP /metrics endpoint — these are batch processes, not services. Metrics are collected via:

Docker stats — CPU, memory, GPU via nvidia-smi
Training logs — parsed from structured log output (epoch, loss, mAP)
Filesystem monitoring — disk usage of /azaion/ directory tree

Key Metrics

Metric	Type	Source	Description
`training_epoch`	Gauge	Training logs	Current epoch number
`training_loss`	Gauge	Training logs	Current training loss
`training_mAP50`	Gauge	Training logs	Mean average precision at IoU 0.50
`training_mAP50_95`	Gauge	Training logs	mAP at IoU 0.50:0.95
`gpu_utilization_pct`	Gauge	`nvidia-smi`	GPU compute utilization
`gpu_memory_used_mb`	Gauge	`nvidia-smi`	GPU memory usage
`gpu_temperature_c`	Gauge	`nvidia-smi`	GPU temperature
`disk_usage_azaion_gb`	Gauge	`df` / `du`	Total disk usage of `/azaion/`
`disk_usage_datasets_gb`	Gauge	`du`	Disk usage of `/azaion/datasets/`
`disk_usage_models_gb`	Gauge	`du`	Disk usage of `/azaion/models/`
`queue_messages_processed`	Counter	Queue logs	Total annotations processed
`queue_messages_failed`	Counter	Queue logs	Failed message processing
`queue_offset`	Gauge	`offset.yaml`	Last processed queue offset

Monitoring Script

A scripts/health-check.sh script (created in Step 7) collects these metrics on demand:

Checks Docker container status
Reads nvidia-smi for GPU metrics
Checks disk usage
Reads annotation queue offset
Reports overall system health

Collection interval: on-demand via health check script, or via cron job (every 5 minutes) for continuous monitoring.

Distributed Tracing

Not applicable. The system consists of independent batch processes (training, annotation queue) that do not form request chains. No distributed tracing is needed.

Alerting

Severity	Condition	Response Time	Action
Critical	GPU temperature > 90°C	Immediate	Pause training, investigate cooling
Critical	Annotation queue process crashed	5 min	Restart container, check logs
Critical	Disk usage > 95%	5 min	Free space (old datasets/models), expand storage
High	Training loss NaN or diverging	30 min	Check dataset, review hyperparameters
High	GPU memory OOM	30 min	Reduce batch size, restart training
Medium	Disk usage > 80%	4 hours	Plan cleanup of old datasets
Medium	Queue offset stale for > 1 hour	4 hours	Check RabbitMQ connectivity
Low	Training checkpoint save failed	Next business day	Check disk space, retry

Notification Method

For a single GPU server deployment, alerts are practical via:

Cron-based health check running scripts/health-check.sh every 5 minutes
Critical/High alerts: write to a status file, optionally send email or webhook notification
Dashboard: a simple status page generated from the last health check output

Dashboards

Operations View

For a single-server deployment, a lightweight monitoring approach:

GPU dashboard: nvidia-smi dmon or nvitop running in a tmux session
Training progress: tail structured logs for epoch/loss/mAP progression
Disk usage: periodic du -sh /azaion/*/ output
Container status: docker ps + docker stats

Training Progress View

Key information to track during a training run:

Current epoch / total epochs
Training loss trend (decreasing = good)
Validation mAP50 and mAP50-95 (increasing = good)
GPU utilization and temperature
Estimated time remaining
Last checkpoint saved

YOLO's built-in TensorBoard integration provides this out of the box. Access via tensorboard --logdir /azaion/models/azaion-YYYY-MM-DD/ on the training server.

5.6 KiB Raw Permalink Blame History