Galaxy Database schema

Author(s)	Hans-Rudolf Hotz Björn Grüning
Reviewers

Overview
Questions:

Running a production Galaxy server, you some times end up in with a situation, where you manually need to interact with the Galaxy database: how do you do that

How to extract usage information, which can not be gathered using the given report tools

How to move from MySQL to PostgreSQL

Is there ever a need to manually change the contents of a table

Objectives:

Learn some of the design concepts of the Galaxy database

Extract information from the Galaxy database

Get to know SchemaSpy

Time estimation: 2 hours

Supporting Materials:

FAQs

Published: Jul 20, 2016

Last modification: Apr 8, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00006

version Revision: 23

For the hands-on examples you need access to a Galaxy server and access to its PostgreSQL database. You can set-up this yourself, or use the Galaxy Docker Image provided by Björn Grüning (https://github.com/bgruening/docker-galaxy-stable). During this tutorial, we will work with the Galaxy Docker Image.

Setting up Docker and using the Galaxy Docker Image: (please do this before the tutorial, preferably when you are still at home using a fast internet connection)

Follow the instruction to install the Docker engine: https://docs.docker.com/engine/installation/
Execute: docker run -d -p 8080:80 bgruening/galaxy-stable (this will download the Galaxy Docker Image, when executed the first time, and start it) Test the Docker image in your web browser: http://localhost:8080 (see first paragraph of Björn’s introduction, for special cases when using the Docker Toolbox on Mac and Windows). Quit by: docker kill NAME (you get the name with: docker ps )

Introduction

Database versus Object Model

The session description is database centric and we’ll be focusing on the relational database that backs Galaxy servers. But that’s only half the picture of the this data. The other is the object model which is the object-oriented view of this same data. The object model is used by the code to manipulate and access the database. The translation between the two worlds is handled by an object-relational mapping implemented with SQLAlchemy (https://www.sqlalchemy.org).

Today we are covering the database and how to access it with SQL. We aren’t going to cover the corresponding object model or object relational mapping.

Database Platform

The default out-of-the-box Galaxy installation uses SQLite (https://www.sqlite.org). SQLite is a lightweight database management system (DBMS) that can be packaged inside Galaxy and does not require any additional steps at initial setup time.

However, SQLite is not the recommended DBMS for running a Galaxy server. The recommended production DMBS for Galaxy is PostgreSQL (https://www.postgresql.org). PostgreSQL offers a full set of DBMS features and robust support for multiple simultaneous users.

This workshop will be entirely based in PostgreSQL (also referred to as Postgres).

What is in (and not in) the Galaxy database?

The Galaxy database contains management information about your server. The database tracks users, groups, jobs, histories, datasets, workflows and so on.

What’s not in the database is the data. Datasets are stored outside the database. The database does keep metadata – information about the datasets such as data type. The tools themselves are not stored in the database either.

Understanding the Database Schema

ER diagrams and SchemaSpy

Entity-relationship diagrams are a way to understand tables and the relationships between them inside a relational database. SchemaSpy (http://schemaspy.org) is a free (and remarkable tool) for generating ER diagrams. Here’s a diagram of the Galaxy database (generated by SchemaSpy on June 6 2020; 6Mb).

The “Tables” tab is a good place to start learning the structure of the database. Each table represents a different type of thing, and often that thing is itself a relationship. For example, each record in the dataset table has information about a specific dataset, while records in the history_dataset_association table have information about what histories that dataset is in.

Each SchemaSpy table’s page shows the attributes in that table, as well as any constraints on those attributes, and the relationships between that table and other tables.

Also see the “Run SchemaSpy in this container” section below for how to install and then run SchemaSpy yourself.

Database conventions

The Galaxy database uses a number of naming and design conventions. Understanding these can make navigating the database much easier.

id attributes

Every table has an id column that uniquely identifies each row. (The id column is the primary key in database terminology.) Beyond uniquely identifying a row in the table, ID values have no meaning. ID values are unique within a table, but not across the database.

Relationships between tables, and `_id` columns

Relationships between tables are implemented by exporting id columns from one table into another. Imported ids are called foreign keys in database nomenclature, and are uniformly named table_the_id_came_from_id

There are a few notable exceptions to this rule. If the ID is from a table that is prefixed with galaxy_, for example, galaxy_user or galaxy_session, the galaxy_ will be dropped from the column name. For example, galaxy_user.id becomes user_id in the over 50 tables it is imported into

Relationship tables

As mentioned previously, some tables, such as history_dataset_association represent relationships between things, rather than things themselves. In this case history_dataset_association describes relationships between datasets and histories.

Relationship table names typically contain the names of tables they are relating, suffixed with _association.

Why are nulls allowed in almost every column? We have no idea. In practice, they aren’t nulls in most of those columns.

Why aren’t there comments, on anything? PostgreSQL supports comments to table definitions, but there are none shown in the SchemaSpy report. Why? The table definitions are actually generated by SQLAlchemy, the object-relational mapping software used by Galaxy, and SQLAlchemy does not support it.

There is nothing in the database that results from direct manipulation of the table definitions through DDL. Everything comes in through SQLAlchemy.

Start Docker and Galaxy

Hands On

Start the Galaxy Docker Image - this time as an interactive session
      docker run -i -t -p 8080:80 bgruening/galaxy-stable /bin/bash
Start Galaxy and its PostgreSQL server
      startup > log 2>&1 &
Follow the startup process
      tail -f log

Important tables

Hands On

Connect to the PostgreSQL database (change to user galaxy first)
       su galaxy
       psql -d galaxy -U galaxy
List all tables
      \dt

Enter q to exit the view results page, and space to see the next results page.

Table “galaxy_user”

Hands On

       select * from galaxy_user;

As described in Björn’s introduction, an Admin user is already pre-set (email: ‘admin@galaxy.org’, password: ‘admin’). Now let’s add (i.e. register) a new user via the Galaxy website. And check the database:

Hands On

       select * from galaxy_user;

Table “job”

Hands On

       select * from job;

Run a few jobs on the galaxy website (e.g upload file a simple table and add column with ‘Iterate’ no and yes) and check the database again:

Hands On

       select * from job \x\g\x

Table “job_parameter”

Hands On

      select * from job_parameter;

Table “history”

Hands On

      select * from history;

Give your current history a name and check the database again.

Table “dataset”

Hands On

      select * from dataset;

Table “history_dataset_association”

Hands On

      select * from history_dataset_association;

More (hands-on) Examples, not covered by the reports app

Have a look at the reports up (which is also provided in the Docker Image):

http://admin:admin@localhost:8080/reports/

Depending on your local needs, some queries are missing, like:

Jobs per tool per year / jobs per tool since 2015

You can add the numbers per month from the reports, or:

Hands On

      select j.id, j.create_time from job j limit 5;

       select j.id, j.create_time from job j
           where j.create_time >= '2015-12-31'
           and j.create_time < '2016-12-31';

       select j.id,j.create_time from job j
          where EXTRACT(year FROM j.create_time) = 2016
          and j.tool_id='upload1';`

…and now include the user

       select count(j.id) from job j, galaxy_user u
          where j.user_id = u.id
          and u.email = 'hansrudolf.hotz@fmi.ch'
          and EXTRACT(year FROM j.create_time) = 2016
          and j.tool_id='upload1';

       select u.email, count(*) from job j, galaxy_user u
          where j.user_id = u.id
          and EXTRACT(year FROM j.create_time) = 2016
          and j.tool_id='upload1'
          GROUP BY u.email;

Jobs per tool of a certain version

Imagine the current version of a tool is working fine, however a previous version had a bug: now you wanna warn all the users who have used the broken version, without alerting users who never used the broken one.

The following example is from the development server at the FMI

Hands On

      select distinct(j.tool_version) from job j
          where j.tool_id = 'qAlign';

       select j.user_id from job j
          where j.tool_id = 'qAlign'
          and j.tool_version = '1.0.4quasr';

      select u.email, j.create_time from job j, galaxy_user u
          where j.user_id = u.id
          and j.tool_id = 'qAlign'
          and j.tool_version = '1.0.4quasr';

All users running a job using a certain parameter

Hands On

      select jp.name, jp.value  from job_parameter jp
          where name = 'iterate'`

      select u.email, jp.name, jp.value
          from job_parameter jp, job j, galaxy_user u
          where jp.name = 'iterate'
          and j.tool_id = 'addValue'
          and jp.job_id = j.id
          and j.user_id = u.id;

Close PostgreSQL client and quit docker

Hands On

Close the PostgreSQL client
      \q
Quit the interactive docker (change back to root first)
      exit
      exit

Conclusion

There is a lot of information stored in the Galaxy database. Use this information for trouble shooting when necessary and use it as a source for extended user statistics.

You've Finished the Tutorial

Key points

Be careful, when you interact with the Galaxy database. And make sure you always have a backup!

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Hans-Rudolf Hotz, Björn Grüning, Galaxy Database schema (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/admin/tutorials/database-schema/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{admin-database-schema,
author = "Hans-Rudolf Hotz and Björn Grüning",
	title = "Galaxy Database schema (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/admin/tutorials/database-schema/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Funding

These individuals or organisations provided funding support for the development of this resource

EuroScienceGateway was funded by the European Union programme Horizon Europe (HORIZON-INFRA-2021-EOSC-01-04) under grant agreement number 101057388 and by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee grant number 10038963.

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.
shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/admin/tutorials/database-schema/tutorial.json | jq .admin_install_yaml -r)
Alternatively you can copy and paste the following YAML
---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools: []

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.

Galaxy Database schema

Introduction

Database versus Object Model

Database Platform

What is in (and not in) the Galaxy database?

Understanding the Database Schema

ER diagrams and SchemaSpy

Database conventions

id attributes

Relationships between tables, and _id columns

Relationship tables

Start Docker and Galaxy

Important tables

Table “galaxy_user”

Table “job”

Table “job_parameter”

Table “history”

Table “dataset”

Table “history_dataset_association”

More (hands-on) Examples, not covered by the reports app

Jobs per tool per year / jobs per tool since 2015

Jobs per tool of a certain version

All users running a job using a certain parameter

Close PostgreSQL client and quit docker

Other Topics

How to move from MySQL to PostgreSQL

Is there ever a need to manually change the contents of a table

Run SchemaSpy in this container

Conclusion

You've Finished the Tutorial

Frequently Asked Questions

Feedback

Citing this Tutorial

Funding

Relationships between tables, and `_id` columns