# Reference Data with CVMFS

### Overview

question Questions
objectives Objectives
• Have an understanding of what CVMFS is and how it works

• Install and configure the CVMFS client on a linux machine and mount the Galaxy reference data repository

• Configure your Galaxy to use these reference genomes and indices

• Use an ansible playbook for all of the above.

requirements Requirements

time Time estimation: 1 hour

Supporting Materials

last_modification Last modification: Jan 20, 2020

# Overview

The CernVM-FS is a distributed filesystem perfectly designed for sharing readonly data across the globe. We use it in the Galaxy Project for sharing things that a lot of Galaxy servers need. Namely:

• Reference Data
• Genome sequences for hundreds of useful species.
• Indices for the genome sequences
• Various bioinformatic tool indices for the available genomes
• Tool containers
• Singularity containers of everything stored in Biocontainers (A bioinformatic tool container repository.) You get these for free every time you build a Bioconda recipe/package for a tool.
• Others too..

From the Cern website:

The CernVM File System provides a scalable, reliable and low-maintenance software distribution service. It was developed to assist High Energy Physics (HEP) collaborations to deploy software on the worldwide-distributed computing infrastructure used to run data processing applications. CernVM-FS is implemented as a POSIX read-only file system in user space (a FUSE module). Files and directories are hosted on standard web servers and mounted in the universal namespace /cvmfs.”

A slideshow presentation on this subject can be found here. More details on the usegalaxy.org (Galaxy Main’s) reference data setup and CVMFS system can be found here

There are two sections to this exercise. The first shows you how to use ansible to setup and configure CVMFS for Galaxy. The second shows you how to do everything manually. It is recommended that you use the ansible method. The manual method is included here mainly for a more in depth understanding of what is happening.

If you really want to perform all these tasks manually, go here, otherwise just follow along.

# Ansible-CVMFS and Galaxy

The Galaxy project supports a few CVMFS repositories.

Repository Repository Address Contents
Reference Data and Indices data.galaxyproject.org Genome sequences and their tool indices, Galaxy .loc files for them as well
Singularity Containers singularity.galaxyproject.org Singularity containers for everything in Biocontainers for use in Galaxy systems
Galaxy Main Configuration main.galaxyproject.org The configuration files etc for Galaxy Main (usegalaxy.org)

You can browse the contents of data.galaxyproject.org at the datacache.

## Installing and configuring Galaxy’s CVMFS reference data with ansible

Luckily for us, the Galaxy Project has a lot of experience with using and configuring CVMFS and we are going to leverage off that. To get CVMFS working on our Galaxy server, we will use the ansible role for CVMFS written by the Galaxy Project. Firstly, we need to install the role and then write a playbook for using it.

If the terms “ansible”, “role” and “playbook” mean nothing to you, please checkout the ansible introduction slides and the ansible introduction tutorial

### tip Tip: Running Ansible on your remote machine

It is possible to have ansible installed on the remote machine and run it there, not just from your local machine connecting to the remote machine.

Your hosts file will need to use localhost, and whenever you run playbooks with ansible-playbook -i hosts playbook.yml, you will need to add -c local to your command.

Be certain that the playbook that you’re writing on the remote machine is stored somewhere safe, like your user home directory, or backed up on your local machine. The cloud can be unreliable and things can disappear at any time.

### hands_on Hands-on: Installing CVMFS with Ansible

1. In your working directory, add the CVMFS role to your requirements.yml

---
- galaxyproject.cvmfs

2. Install the requirements with ansible-galaxy:

ansible-galaxy install -p roles -r requirements.yml

3. Create and edit the group variables file, group_vars/galaxyservers.yml file.

The variables available in this role are:

Variable Type Description
cvmfs_role string Type of CVMFS host: client, stratum0, stratum1, or localproxy. Controls what packages are installed and what configuration is performed.
cvmfs_keys list of dicts Keys to install on hosts of all types.
cvmfs_server_urls list of dicts CVMFS server URLs, the value of CVMFS_SERVER_URL in /etc/cvmfs/domain.d/<domain>.conf.
cvmfs_repositories list of dicts CVMFS repository configurations, CVMFS_REPOSITORIES in /etc/cvmfs/default.local plus additional settings in /etc/cvmfs/repositories.d/<repository>/{client,server}.conf.
cvmfs_quota_limit integer in MB Size of CVMFS client cache. Default is 4000.

But, luckily for us, the Galaxy Project CVMFS role has a lot of defaults for these variables which we can use by just setting galaxy_cvmfs_repos_enabled to true. We will also set the cvmfs_quota_limit to something sensible as we have relatively small disks on our instances (2000MB).

Add the following lines to your group_vars/galaxyservers.yml file:

---
# CVMFS vars
cvmfs_role: client
galaxy_cvmfs_repos_enabled: true
cvmfs_quota_limit: 2000

4. Create and edit a new playbook which uses the CVMFS role, name it cvmfs_playbook.yml

- hosts: galaxyservers
become: true
roles:
- galaxyproject.cvmfs

5. Run the playbook

ansible-playbook -i hosts cvmfs_playbook.yml


Congratulations, you’ve set up CVMFS.

## Exploring the CVMFS Installation

### hands_on Hands-on: Exploring CVMFS

1. SSH into your machine

2. Change directory into /cvmfs/ and list the files in that folder

### question Question

What do you see?

### solution Solution

You should see nothing, as CVMFS uses autofs in order to mount paths only upon request. Once you cd into the directory, autofs will automatically mount the repository and files will be listed.

3. Change directory into /cvmfs/data.galaxyproject.org/. Have a browse through the contents. You’ll see .loc files, genomes and indices.

And just like that we all have access to all the reference genomes and associated tool indices thanks to the Galaxy Project’s and mostly Nate’s hard work!

## Configuring Galaxy to use the CVMFS references.

Now that we have mounted the cvmfs repository we need to tell Galaxy how to find it and use it.

There are two primary directories in the reference data repository:

Directory Contents
/managed Data generated with Galaxy Data Managers, organized by data table (index format), then by genome build.
/byhand Data generated prior to the existence/use of Data Managers, manually curated. (For legacy reasons, this directory is shared as /indexes on the HTTP and rsync servers.)

These directories have somewhat different structures:

• /managed is organized by index type, then by genome build (Galaxy dbkey)
• /byhand is organzied by genome build, then by index type

Both directories contain a location subdirectory, and each of these contain a tool_data_table_conf.xml file:

• /managed/location/tool_data_table_conf.xml
• /byhand/location/tool_data_table_conf.xml

Galaxy consumes these tool_data_table_conf.xml files and the .loc “location” files they reference. The paths contained in these files are valid if the data is mounted via CVMFS.

Examples of data include:

• twoBit (.2bit) and FASTA (.fa) sequence files
• Bowtie 2 and BWA indexes
• Mutation Annotation Format (.maf) files
• SAMTools FASTA indexes (.fai)

Now all we need to do is tell Galaxy how to find it! This tutorial assumes that you have run the tutorial in the requirements, “Galaxy Installation with Ansible”. The hands-on below will use the Galaxy Project Ansible role to configure everything.

### hands_on Hands-on: Configuring Galaxy to use CVMFS

1. Edit the group_vars/galaxyservers.yml file and add a tool_data_table_config_path entry under the galaxy_config: section in the groupvars/galaxyservers.yml file.

galaxy_config:
galaxy:
...
tool_data_table_config_path: /cvmfs/data.galaxyproject.org/byhand/location/tool_data_table_conf.xml,/cvmfs/data.galaxyproject.org/managed/location/tool_data_table_conf.xml


### question Question

How does your final configuration look?

### solution Solution

2. Because we want to alter some Galaxy settings now, we need to add the Galaxy role to our playbook.

Add galaxyproject.galaxy to the roles in our cvmfs_playbook.yml file.

- hosts: galaxyservers
become: true
roles:
- galaxyproject.cvmfs
- galaxyproject.galaxy


### comment Ansible Heads-up

If you’ve set up your Galaxy server using the Galaxy Installation with Ansible tutorial, you will have created a handler for restarting Galaxy (its name is set in the galaxy_restart_handler_name option in your group vars). You will need to define that handler in the CVMFS playbook the same way as you defined it in your original playbook. This also means that Ansible will perform the restart step below for you!

3. Re-run the CVMFS playbook (ansible-playbook -i hosts cvmfs_playbook.yml)

4. Restart Galaxy

5. In your Galaxy interface, open the bwa, bwa-mem or Bowtie2 tool interface (whichever you may have installed). Now check that there are a lot more Genomes available for use!

6. Login to Galaxy as the admin user, and go to Admin → Local Data → View Tool Data Table Entries → bwa_mem indexes

You’ve now finished the tutorial, and you can jump to the end or read on to learn about configuring CVMFS without Ansible.

# CVMFS and Galaxy without Ansible

### comment Manual version of Ansible Commands

If you wish to perform the same thing that we’ve just done, but by building the ansible script manually, follow these instructions. Otherwise, you have already done everything below and do not need to re-do it.

We are going to setup a CVMFS mount to the Galaxy reference data repository on our machines. To do this we have to install and configure the CVMFS client and then mount the appropriate CVMFS repository using the publicly available keys.

### hands_on Hands-on: Installing the CVMFS Client

1. On your remote machine, we need to first install the Cern software apt repo and then the cvmfs client and config utility:

sudo apt install lsb-release
wget https://ecsft.cern.ch/dist/cvmfs/cvmfs-release/cvmfs-release-latest_all.deb
sudo dpkg -i cvmfs-release-latest_all.deb
rm -f cvmfs-release-latest_all.deb
sudo apt-get update

sudo apt install cvmfs cvmfs-config

2. Now we need to run the cvmfs setup script.

sudo cvmfs_config setup


## Configuring CVMFS

The configuration is not complex for CVMFS:

### hands_on Hands-on: Configuring CVMFS

1. Create a /etc/cvmfs/default.local file with the following contents:

CVMFS_REPOSITORIES="data.galaxyproject.org"
CVMFS_HTTP_PROXY="DIRECT"
CVMFS_QUOTA_LIMIT="4000"
CVMFS_CACHE_BASE="/srv/cvmfs/cache"
CVMFS_USE_GEOAPI=yes


This tells cvmfs to mount the Galaxy reference data repository and use a specific location for the cache which is limited to 4GB in size and to use the instance’s geo-location to choose the best CVMFS repo server to connect to.

2. Create a /etc/cvmfs/domain.d/galaxyproject.org.conf file with the following contents:

CVMFS_SERVER_URL="http://cvmfs1-tacc0.galaxyproject.org/cvmfs/@fqrn@;http://cvmfs1-iu0.galaxyproject.org/cvmfs/@fqrn@;http://cvmfs1-psu0.galaxyproject.org/cvmfs/@fqrn@;http://galaxy.jrc.ec.europa.eu:8008/cvmfs/@fqrn@;http://cvmfs1-mel0.gvl.org.au/cvmfs/@fqrn@;http://cvmfs1-ufr0.galaxyproject.eu/cvmfs/@fqrn@"


This is a list of the available tier1 servers that have this repo. Note there is one in Penn State. We will most likely be connecting to this one.

3. Create a /etc/cvmfs/keys/data.galaxyproject.org.pub file with the following contents:

-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA5LHQuKWzcX5iBbCGsXGt
6CRi9+a9cKZG4UlX/lJukEJ+3dSxVDWJs88PSdLk+E25494oU56hB8YeVq+W8AQE
3LWx2K2ruRjEAI2o8sRgs/IbafjZ7cBuERzqj3Tn5qUIBFoKUMWMSIiWTQe2Sfnj
GzfDoswr5TTk7aH/FIXUjLnLGGCOzPtUC244IhHARzu86bWYxQJUw0/kZl5wVGcH
maSgr39h1xPst0Vx1keJ95AH0wqxPbCcyBGtF1L6HQlLidmoIDqcCQpLsGJJEoOs
NVNhhcb66OJHah5ppI1N3cZehdaKyr1XcF9eedwLFTvuiwTn6qMmttT/tHX7rcxT
owIDAQAB
-----END PUBLIC KEY-----

4. Make a directory for the cache files

sudo mkdir /srv/cvmfs


## Testing it out

Probe the connection.

### hands_on Hands-on: Testing it out

1. Run sudo cvmfs_config probe data.galaxyproject.org

### question Question

What does it output?

### solution Solution

OK


If this doesn’t return OK then you may need to restart autofs: sudo systemctl restart autofs

2. Change directory into /cvmfs/ and list the files in that folder

### question Question

What do you see?

### solution Solution

You should see nothing, as CVMFS uses autofs in order to mount paths only upon request. Once you cd into the directory, autofs will automatically mount the repository and files will be listed.

3. Change directory into /cvmfs/data.galaxyproject.org/. Have a browse through the contents. You’ll see .loc files, genomes and indices.

And just like that we all have access to all the reference genomes and associated tool indices thanks to the Galaxy Project’s and mostly Nate’s hard work!

#### Step 4: Look at the repository

Now to configure Galaxy to use the CVMFS references we have just installed, see here

# Citing this Tutorial

1. Simon Gladman, Helena Rasche, 2020 Reference Data with CVMFS (Galaxy Training Materials). /archive/2020-02-01/topics/admin/tutorials/cvmfs/tutorial.html Online; accessed TODAY
2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

### details BibTeX

@misc{admin-cvmfs,
author = "Simon Gladman and Helena Rasche",
title = "Reference Data with CVMFS (Galaxy Training Materials)",
year = "2020",
month = "01",
day = "20"
url = "\url{/archive/2020-02-01/topics/admin/tutorials/cvmfs/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
doi = {10.1016/j.cels.2018.05.012},
url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
year = 2018,
month = {jun},
publisher = {Elsevier {BV}},
volume = {6},
number = {6},
pages = {752--758.e1},
author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
title = {Community-Driven Data Analysis Training for Biology},
journal = {Cell Systems}
}
`

### congratulations Congratulations on successfully completing this tutorial!

Did you use this material as an instructor? Feel free to give us feedback on how it went.