name: inverse layout: true class: center, middle, inverse
# Galaxy Monitoring with Telegraf and Grafana
.footnote[Tip: press `P` to view the presenter notes] ??? Presenter notes contain extra information which might be useful if you intend to use these slides for teaching. Press `P` again to switch presenter notes off Press `C` to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other. Useful when presenting. --- ### <i class="far fa-question-circle" aria-hidden="true"></i><span class="visually-hidden">question</span> Questions - How to monitor Galaxy with Telegraf - How do I set up InfluxDB - How can I make graphs in Grafana? - How can I best alert on important metrics? --- ### <i class="fas fa-bullseye" aria-hidden="true"></i><span class="visually-hidden">objectives</span> Objectives - Setup InfluxDB - Setup Telegraf - Setup Grafana - Create several charts --- # Telegraf, InfluxDB, and Grafana General purpose tools for monitoring systems and services. Tool | Use --- | --- [Telegraf](https://github.com/influxdata/telegraf) | plugin-driven server agent for collecting & reporting metrics [Influxdb](https://github.com/influxdata/influxdb/) | purpose built time series database [Grafana](https://grafana.com/) | dashboard for beautiful analytics and monitoring Dataflow: - Galaxy produces data - Telegraf consumes and buffers it, before sending it to - InfluxDB which stores the data - And Grafana is used to visualise it ??? - Monitoring in Galaxy is easy to setup. - Galaxy produces data, which is consumed by telegraf. - telegrafends data to Influx DB. - This data is visualized in Grafana. --- # Grafana showcase * usegalaxy.eu [public server](https://stats.galaxyproject.eu) * usegalaxy.org.au [public server](https://stats.genome.edu.au) * usegalaxy.org private server If you see a dashboard you can export its configuration and put it on your Grafana with your data. Copy away! ??? - We have several public Grafana servers. - If you like any of our graphs, you can copy them. --- ![galaxy dashboard showing route timings, user counts, job counts, etc.](../../images/grafana/galaxy.png) ??? - We have built numerous dashboards for monitoring Galaxy. - These include scripts and playbooks and configuration for everything. - Here is EU's galaxy dashboard showing active users, running and unscheduled jobs, etc. --- ![node detail dashboard with filesystem usage, process states, cpu, memory, load, network, etc.](../../images/grafana/node.png) ??? - However sometimes we notice something going wrong with our infrastructure. - We use the node detail dashboard to begin our investigation. - It gives us a very fast overview of the server. - This can help efficiently pinpoint isuses. --- ![DB dashboard showing transactions, tuples fetched/modified, and index sizes for each database](../../images/grafana/db.png) ??? - We also monitor the database heavily. - All of this monitoring is built into telegraf. - We need to be able to correlate latency with autovacuums or contention. - We monitor table size changes to check for anomalies. --- ![user statistics page for Eu with 23k users, 30k workflows, 400k histories, 13M jobs, and 30M datasets. Additional breakdowns provided for years of compute time on various clusters included 1k years on de.NBI cloud.](../../images/grafana/stats.png) ??? - Our staff often needs to report numbers for their grants. - We produced this user statistics dashboard to help them. - Now they can answer their own questions, and make their own graphs, without admin help. --- ![cvmfs dashboard showing which repos each server supports in green, and missing ones in white. ~90% of repos are supported](../../images/grafana/cvmfs.png) ??? - We don't just monitor Galaxy though. - We also monitor CVMFS, and the availability of repositories in each server. - This can give a good view of which repositories are replicated. --- ### <i class="fas fa-key" aria-hidden="true"></i><span class="visually-hidden">keypoints</span> Key points - Telegraf provides an easy solution to monitor servers - Galaxy can send metrics to Telegraf - Telegraf can run arbitrary commands like `gxadmin`, which provides influx formatted output - InfluxDB can collect metrics from Telegraf - Use Grafana to visualise these metrics, and monitor their values --- ## Thank you! This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://wiki.galaxyproject.org/Teach/GTN) and all the contributors!
This material is licensed under the
Creative Commons Attribution 4.0 International License