Reference Data with CVMFS

Overview

question Questions
objectives Objectives
  • Have an understanding of what CVMFS is and how it works

  • Install and configure the CVMFS client on a linux machine and mount the Galaxy reference data repository

  • Configure your Galaxy to use these reference genomes and indices

  • Use an Ansible playbook for all of the above.

requirements Requirements

time Time estimation: 1 hour

Supporting Materials

last_modification Last modification: Mar 11, 2021

Overview

The CernVM-FS is a distributed filesystem perfectly designed for sharing readonly data across the globe. We use it in the Galaxy Project for sharing things that a lot of Galaxy servers need. Namely:

  • Reference Data
    • Genome sequences for hundreds of useful species.
    • Indices for the genome sequences
    • Various bioinformatic tool indices for the available genomes
  • Tool containers
    • Singularity containers of everything stored in Biocontainers (A bioinformatic tool container repository.) You get these for free every time you build a Bioconda recipe/package for a tool.
  • Others too..

From the Cern website:

The CernVM File System provides a scalable, reliable and low-maintenance software distribution service. It was developed to assist High Energy Physics (HEP) collaborations to deploy software on the worldwide-distributed computing infrastructure used to run data processing applications. CernVM-FS is implemented as a POSIX read-only file system in user space (a FUSE module). Files and directories are hosted on standard web servers and mounted in the universal namespace /cvmfs.”

https://cernvm.cern.ch/portal/filesystem

A slideshow presentation on this subject can be found here. More details on the usegalaxy.org (Galaxy Main’s) reference data setup and CVMFS system can be found here

Agenda

  1. Ansible-CVMFS and Galaxy
    1. Installing and configuring Galaxy’s CVMFS reference data with Ansible
    2. Exploring the CVMFS Installation
    3. Configuring Galaxy to use the CVMFS references.
  2. Other Aspects
    1. Development
    2. Automation
    3. Access Control
    4. Proxying Recap
    5. Plant Data

Ansible-CVMFS and Galaxy

The Galaxy project supports a few CVMFS repositories.

Repository Repository Address Contents
Reference Data and Indices data.galaxyproject.org Genome sequences and their tool indices, Galaxy .loc files for them as well
Singularity Containers singularity.galaxyproject.org Singularity containers for everything in Biocontainers for use in Galaxy systems
Galaxy Main Configuration main.galaxyproject.org The configuration files etc for Galaxy Main (usegalaxy.org)

You can browse the contents of data.galaxyproject.org at the datacache.

Installing and configuring Galaxy’s CVMFS reference data with Ansible

Luckily for us, the Galaxy Project has a lot of experience with using and configuring CVMFS and we are going to leverage off that. To get CVMFS working on our Galaxy server, we will use the Ansible role for CVMFS written by the Galaxy Project. Firstly, we need to install the role and then write a playbook for using it.

If the terms “Ansible”, “role” and “playbook” mean nothing to you, please checkout the Ansible introduction slides and the Ansible introduction tutorial

Tip: Running Ansible on your remote machine

It is possible to have ansible installed on the remote machine and run it there, not just from your local machine connecting to the remote machine.Your hosts file will need to use localhost, and whenever you run playbooks with ansible-playbook -i hosts playbook.yml, you will need to add -c local to your command.Be certain that the playbook that you’re writing on the remote machine is stored somewhere safe, like your user home directory, or backed up on your local machine. The cloud can be unreliable and things can disappear at any time.

hands_on Hands-on: Installing CVMFS with Ansible

  1. In your working directory, add the CVMFS role to your requirements.yml

    - src: galaxyproject.cvmfs
      version: 0.2.13
    
  2. Install the role with:

    ansible-galaxy role install -p roles -r requirements.yml
    
  3. The variables available in this role are:

    Variable Type Description
    cvmfs_role string Type of CVMFS host: client, stratum0, stratum1, or localproxy. Controls what packages are installed and what configuration is performed.
    cvmfs_keys list of dicts Keys to install on hosts of all types.
    cvmfs_server_urls list of dicts CVMFS server URLs, the value of CVMFS_SERVER_URL in /etc/cvmfs/domain.d/<domain>.conf.
    cvmfs_repositories list of dicts CVMFS repository configurations, CVMFS_REPOSITORIES in /etc/cvmfs/default.local plus additional settings in /etc/cvmfs/repositories.d/<repository>/{client,server}.conf.
    cvmfs_quota_limit integer in MB Size of CVMFS client cache. Default is 4000.

    But, luckily for us, the Galaxy Project CVMFS role has a lot of defaults for these variables which we can use by just setting galaxy_cvmfs_repos_enabled to config-repo. We will also set the cvmfs_quota_limit to something sensible (500MB) as we have relatively small disks on our instances. In a production setup, you should size this appropriately for the client.

    Add the following lines to your group_vars/all.yml file, creating it if it doesn’t exist:

    # CVMFS vars
    cvmfs_role: client
    galaxy_cvmfs_repos_enabled: config-repo
    cvmfs_quota_limit: 500
    

    tip Why all.yml?

    We’ve integrated the cvmfs and pulsar tutorials better, such that CVMFS will be used for Pulsar as well, this configuration will be needed on all of our machines. This mirrors real life where you want CVMFS on every node that does computation.

  4. Add the new role to the list of roles under the roles key in your playbook, galaxy.yml:

    - hosts: galaxyservers
      become: true
      roles:
        # ... existing roles ...
        - galaxyproject.cvmfs
    
  5. Run the playbook

    ansible-playbook galaxy.yml
    

Congratulations, you’ve set up CVMFS.

Exploring the CVMFS Installation

hands_on Hands-on: Exploring CVMFS

  1. SSH into your machine

  2. Change directory into /cvmfs/ and list the files in that folder

    question Question

    What do you see?

    solution Solution

    You should see nothing, as CVMFS uses autofs in order to mount paths only upon request. Once you cd into the directory, autofs will automatically mount the repository and files will be listed.

  3. Change directory into /cvmfs/data.galaxyproject.org/. Have a browse through the contents. You’ll see .loc files, genomes and indices.

    And just like that we all have access to all the reference genomes and associated tool indices thanks to the Galaxy Project’s and mostly Nate’s hard work!

Configuring Galaxy to use the CVMFS references.

Now that we have mounted the CVMFS repository we need to tell Galaxy how to find it and use it.

There are two primary directories in the reference data repository:

Directory Contents
/managed Data generated with Galaxy Data Managers, organized by data table (index format), then by genome build.
/byhand Data generated prior to the existence/use of Data Managers, manually curated. (For legacy reasons, this directory is shared as /indexes on the HTTP and rsync servers.)

These directories have somewhat different structures:

  • /managed is organized by index type, then by genome build (Galaxy dbkey)
  • /byhand is organzied by genome build, then by index type

Both directories contain a location subdirectory, and each of these contain a tool_data_table_conf.xml file:

  • /managed/location/tool_data_table_conf.xml
  • /byhand/location/tool_data_table_conf.xml

Galaxy consumes these tool_data_table_conf.xml files and the .loc “location” files they reference. The paths contained in these files are valid if the data is mounted via CVMFS.

Examples of data include:

  • twoBit (.2bit) and FASTA (.fa) sequence files
  • Bowtie 2 and BWA indexes
  • Mutation Annotation Format (.maf) files
  • SAMTools FASTA indexes (.fai)

Now all we need to do is tell Galaxy how to find it! This tutorial assumes that you have run the tutorial in the requirements, Galaxy Installation with Ansible. The hands-on below will use the Galaxy Project Ansible role to configure everything.

hands_on Hands-on: Configuring Galaxy to use CVMFS

  1. Edit the group_vars/galaxyservers.yml file and add a tool_data_table_config_path entry under the galaxy key of the galaxy_config section in the group_vars/galaxyservers.yml file. This new entry should be a list containing the paths to both tool_data_table_conf.xml files referenced above.

    question Question

    How does your final configuration look?

    solution Solution

    galaxy_config:
      galaxy:
        # ... existing configuration options in the `galaxy` section ...
        tool_data_table_config_path: /cvmfs/data.galaxyproject.org/byhand/location/tool_data_table_conf.xml,/cvmfs/data.galaxyproject.org/managed/location/tool_data_table_conf.xml
    
  2. Re-run the playbook (ansible-playbook galaxy.yml)

  3. Install the BWA-MEM tool, if needed.

    Tip: Install tools via the Admin UI

    1. Open Galaxy in your browser and type bwa in the tool search box on the left. If “Map with BWA-MEM” is among the search results, you can skip the following steps.
    2. Access the Admin menu from the top bar (you need to be logged-in with an email specified in the admin_users setting)
    3. Click “Install and Uninstall”, which can be found on the left, under “Tool Management”
    4. Enter bwa in the search interface
    5. Click on the first hit, having devteam as owner
    6. Click the “Install” button for the latest revision
    7. Enter “Mapping” as the target section and click “OK”.
  4. In your Galaxy server, open the Map with BWA-MEM tool tool. Now check that there are a lot more reference genomes available for use!

    available_genomes.png

  5. Login to Galaxy as the admin user, and go to Admin → Data Tables → bwa_mem indexes

    bwa_mem_indices.png

Other Aspects

Development

If you are developing a new tool, and want to add a reference genome, we recommend you talk to us on Gitter. You can also look at one of the tools that uses reference data, and try and copy from that. If you’re developing the location files completely new, you need to write the data manager.

Automation

You can automate the process of installing and setting up data managers and data with ephemeris. We’re working in the IDC to democratise this CVMFS repository, and make this a community-controlled resource. You can also look here for ideas on automating your data management.

Access Control

It is not easily possible to filter access to reference data depending on the user’s role or group.

You could set up a tool per user/group, secure access to running this tool, and then allow this private tool to access a private tool data table. But you will not get tool updates, you will have to copy and edit this tool every time it gets updated. Or write more advanced job control rules to reject specific jobs which use specific datasets.

Proxying Recap

The client talks directly to the stratum 1 (or to a proxy), and manages the data, and exposes it to the user. The proxy stores an opaque cache, that can’t really be used, except as a proxy to a stratum 1.

Plant Data

If you are working with plants, you can find separate reference data here: frederikcoppens/galaxy_data_management

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Galaxy Server administration topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.

Click here to load Google feedback frame

Citing this Tutorial

  1. Simon Gladman, Helena Rasche, 2021 Reference Data with CVMFS (Galaxy Training Materials). /archive/2021-05-01/topics/admin/tutorials/cvmfs/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{admin-cvmfs,
    author = "Simon Gladman and Helena Rasche",
    title = "Reference Data with CVMFS (Galaxy Training Materials)",
    year = "2021",
    month = "03",
    day = "11"
    url = "\url{/archive/2021-05-01/topics/admin/tutorials/cvmfs/tutorial.html}",
    note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
        doi = {10.1016/j.cels.2018.05.012},
        url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
        year = 2018,
        month = {jun},
        publisher = {Elsevier {BV}},
        volume = {6},
        number = {6},
        pages = {752--758.e1},
        author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
        title = {Community-Driven Data Analysis Training for Biology},
        journal = {Cell Systems}
}
                    

congratulations Congratulations on successfully completing this tutorial!