An admin interface to list current unfinished jobs and finished jobs of a certain age.
- You can stop unfinished jobs
- You can show details of old jobs
- You can lock the server from spawning new jobs. (e.g. for maintenance.)
- Galaxy logs (
journalctl -f -u galaxy)
- Web (uWSGI)
- nginx logs (
Can we make better walltime decisions?
scripts/runtime_stats.py: Database-driven job runtime statistics
Galaxy ships with its own app that reports usage (user, job, data, etc numbers)
Nagios is a general-purpose tool for monitoring systems and services.
Galaxy-specific check in
contrib/nagios/: Runs Galaxy jobs
- Motto: “Stop hoping your users will report errors”
- Error tracking and analysing tool.
- Galaxy has Sentry middleware that you can enable in configuration.
Galaxy can collect metrics on each job through configurable plugins in
core: Captures Galaxy slots, start and end of job, runtime
cpuinfo: processor count for each job
env: dump environment for each job
collectl: monitor a wide array of system performance data
Telegraf, InfluxDB, and Grafana
General purpose tools for monitoring systems and services.
|Telegraf||plugin-driven server agent for collecting & reporting metrics|
|Influxdb||purpose built time series database|
|Grafana||dashboard for beautiful analytics and monitoring|
- Galaxy produces data
- Telegraf consumes and buffers it, before sending it to
- InfluxDB which stores the data
- And Grafana is used to visualise it
Infrastructure for Grafana
- Everything captured in Galaxy Ansible infrastructure-playbook repository.
- Ansible playbook to install Telegraf.
- Ansible tasks for installing InfluxDB and Grafana.
If you see a dashboard you can export its configuration and put it on your Grafana with your data. Copy away!