Distributed Object Storage

Overview

question Questions
  • How does Galaxy locate data?

  • How can I have Galaxy use multiple storage locations?

objectives Objectives
  • Setup Galaxy with both the Hierarachical and Distributed Object Storages

requirements Requirements
time Time estimation: 30 minutes
Supporting Materials
last_modification Last modification: May 23, 2021

Expanding Storage

You may find that your Galaxy files directory has run out of space, but you don’t want to move all of the files from one filesystem to another. One solution to this problem is to use Galaxy’s hierarchical object store to add an additional file space for Galaxy.

Alternatively, you may wish to write new datasets to more than one filesystem. For this, you can use Galaxy’s distributed object store.

This tutorial assumes you have done the “Ansible for installing Galaxy” tutorial, it references the base configuration set up in that tutorial in numerous places.

Agenda

  1. Expanding Storage
  2. Hierarchical Object Store
  3. Distributed Object Store
  4. S3 Object Store
  5. Dropbox

Hierarchical Object Store

First, note that your Galaxy datasets have been created thus far in the directory /data, due to galaxy_config: galaxy: file_path. In some cases, we run out of storage in a particular location. Galaxy allows us to add additional storage locations where it will create new datasets, while still looking in the old locations for old datasets. You will not have to migrate any of your datasets, and can just “plug and play” with new storage pools.

hands_on Hands-on: Adding Hierarchical Storage

  1. Open your group variables file and set the object_store_config_file variable:

    galaxy_config:
      galaxy:
        object_store_config_file: "{{ galaxy_config_dir }}/object_store_conf.xml"
    
  2. In your group variables file, add it to the galaxy_config_templates section:

    galaxy_config_templates:
      - src: templates/galaxy/config/object_store_conf.xml
        dest: "{{ galaxy_config.galaxy.object_store_config_file }}"
    
  3. Create and edit templates/galaxy/config/object_store_conf.xml with the following contents:

    <?xml version="1.0"?>
    <object_store type="hierarchical">
        <backends>
            <backend id="newdata" type="disk" order="0">
                <files_dir path="/data2" />
                <extra_dir type="job_work" path="/data2/job_work_dir" />
            </backend>
            <backend id="olddata" type="disk" order="1">
                <files_dir path="/data" />
                <extra_dir type="job_work" path="/data/job_work_dir" />
            </backend>
        </backends>
    </object_store>
    
  4. Add a pre_task in your playbook galaxy.yml file to create the /data2 folder using the file module.

        - name: Create the second storage directory
          file:
            owner: galaxy
            group: galaxy
            path: /data2
            state: directory
            mode: '0755'
    

    We’ve hardcoded the user/group because creating a storage directory is unusual. In normal practice someone provides you with an NFS mount and you will simply point your Galaxy there.

  5. Run the playbook and restart Galaxy

  6. Run a couple of jobs after Galaxy has restarted.

    question Question

    Where is the data now stored?

    solution Solution

    You should see /data2 in the Full Path, if not, something went wrong. Check that your “order” is correct

Distributed Object Store

Rather than searching a hierarchy of object stores until the dataset is found, Galaxy can store the ID (in the database) of the object store in which a dataset is located when the dataset is created. This allows Galaxy to write to more than one object store for new datasets.

hands_on Hands-on: Distributed Object Store

  1. Edit your templates/galaxy/config/object_store_conf.xml file and replace the contents with:

    <?xml version="1.0"?>
    <object_store type="distributed">
        <backends>
            <backend id="newdata" type="disk" weight="1">
                <files_dir path="/data2"/>
                <extra_dir type="job_work" path="/data2/job_work_dir"/>
            </backend>
            <backend id="olddata" type="disk" weight="1">
                <files_dir path="/data"/>
                <extra_dir type="job_work" path="/data/job_work_dir"/>
            </backend>
        </backends>
    </object_store>
    
  2. Run the playbook, restart Galaxy

  3. Run 4 or so jobs, and check where the output appear. You should see that they are split relatively evenly between the two data directories.

Sites like UseGalaxy.eu use the distributed object store in order to balance dataset storage across 10 different storage pools.

details More documentation

More information can be found in the sample file.

tip Can I distribute objects based on the user?

Yes! You must write your own dynamic job handler code to handle this. See PR#6552 and PR#10233

If you implement something like this, please let the GTN know with some example code, and we can include this as a training module for everyone.

warning Warning: switching object store types will cause issues

We have switched between two different object stores here, but this is not supported. If you need to do this, you will need to update datasets in Galaxy’s database. Any datasets that were created as hierarchical will lack the object_store_id, and you will need to supply the correct one. Do not just blindly copy these instructions, please understand what they do before running them and talk to us on Gitter for more help

  1. Move the datasets to their new location: sudo -u galaxy rsync -avr /hierarchical/000/ /distributed/000/

  2. Update the database: sudo -Hu galaxy psql galaxy -c "UPDATE dataset SET object_store_id='data';"

  3. Restart your Galaxy

S3 Object Store

Many sites have access to an S3 service (either public AWS, or something private like Swift or Ceph), and you can take advantage of this for data storage.

we will set up a local S3-compatible object store, and then talk to the API of this service.

hands_on Hands-on: Setting up an S3-compatible Object Store

  1. Edit your requirements.yml file and add:

    - src: atosatto.minio
      version: v1.1.0
    
  2. ansible-galaxy install -p roles -r requirements.yml

  3. Edit your group variables to configure the object store:

    minio_server_datadirs: ["/minio-test"]
    minio_access_key: "my-access-key"
    minio_secret_key: "my-super-extra-top-secret-key"
    
  4. Edit your playbook and add the minio role before galaxyproject.galaxy:

        - atosatto.minio
    

    Galaxy will need to use the bucket, and will want it to be there when it boots, so we need to setup the object store first.

  5. Edit the templates/galaxy/config/object_store_conf.xml, and configure the object store as one of the hierarchical backends. The object store does not play nicely with the distributed backend during training preparation. Additionally, reset the orders of the disk backends to be higher than the order of the swift backend.

    @@ -1,13 +1,21 @@
     <?xml version="1.0"?>
    - <object_store type="distributed">
    + <object_store type="hierarchical">
         <backends>
    -        <backend id="newdata" type="disk" weight="1">
    +        <backend id="newdata" type="disk" order="1">
                 <files_dir path="/data2"/>
                 <extra_dir type="job_work" path="/data2/job_work_dir"/>
             </backend>
    -        <backend id="olddata" type="disk" weight="1">
    +        <backend id="olddata" type="disk" order="2">
                 <files_dir path="/data"/>
                 <extra_dir type="job_work" path="/data/job_work_dir"/>
             </backend>
    +        <object_store id="swifty" type="swift" order="0">
    +            <auth access_key="{{ minio_access_key }}" secret_key="{{ minio_secret_key }}" />
    +            <bucket name="galaxy" use_reduced_redundancy="False" max_chunk_size="250"/>
    +            <connection host="127.0.0.1" port="9091" is_secure="False" conn_path="" multipart="True"/>
    +            <cache path="{{ galaxy_mutable_data_dir }}/database/object_store_cache" size="1000" />
    +            <extra_dir type="job_work" path="{{ galaxy_mutable_data_dir }}/database/job_working_directory_swift"/>
    +            <extra_dir type="temp" path="{{ galaxy_mutable_data_dir }}/database/tmp_swift"/>
    +        </object_store>
         </backends>
     </object_store>
    
  6. Run the playbook.

  7. Galaxy should now be configure to use the object store!

  8. When the playbook is done, upload a dataset to Galaxy, and check if it shows up in the bucket:

    $ sudo ls /minio-test/galaxy/000/
    dataset_24.dat
    

Dropbox

Dropbox is a well-known cloud storage service where you can store and share files with anyone. As of 20.09, Galaxy has support for a couple of different file storage backends, including NextCloud (via webdavfs) and Dropbox.

This tutorial will help you setup the connection between Galaxy and Dropbox, allowing your users to add their account details and then access their Dropbox data within Galaxy

hands_on Hands-on: Configure Galaxy to access the Dropbox service

  1. If the folder does not exist, create files/galaxy/config next to your galaxy.yml playbook.

    code-in Input: Bash

    mkdir -p files/galaxy/config
    
  2. Create files/galaxy/config/file_sources_conf.yml with the following contents:

    - type: dropbox
      id: dropbox
      label: Your Dropbox Files
      doc: Your Dropbox files - configure an access token via the user preferences
      accessToken: ${user.preferences.get('dropbox|access_token', '') if $user.preferences else ''}
    
  3. Create files/galaxy/config/user_preferences_extra_conf.yml with the following contents:

    preferences:
        dropbox:
            description: Your Dropbox account
            inputs:
                - name: access_token
                  label: Dropbox access token
                  type: password
                  required: False
    
  4. Inform the galaxyproject.galaxy role of where you would like the file_sources_conf.yml and user_preferences_extra_conf.yml to reside, by setting it in your group_vars/galaxyservers.yml:

    --- a/group_vars/galaxyservers.yml
    +++ b/group_vars/galaxyservers.yml
    @@ -35,6 +35,8 @@ galaxy_config:
         check_migrate_tools: false
         tool_data_path: "{{ galaxy_mutable_data_dir }}/tool-data"
         job_config_file: "{{ galaxy_config_dir }}/job_conf.xml"
    +    file_sources_config_file: "{{ galaxy_config_dir }}/file_sources_conf.yml"
    +    user_preferences_extra_conf_path: "{{ galaxy_config_dir }}/user_preferences_extra_conf.yml"
       uwsgi:
         socket: 127.0.0.1:8080
         buffer-size: 16384
    
  5. Deploy the new config files using the galaxy_config_files var (also from the galaxyproject.galaxy role) in your group vars:

    --- a/group_vars/galaxyservers.yml
    +++ b/group_vars/galaxyservers.yml
    @@ -65,6 +67,12 @@ galaxy_config_templates:
       - src: templates/galaxy/config/job_conf.xml.j2
         dest: "{{ galaxy_config.galaxy.job_config_file }}"
    
    +galaxy_config_files:
    +  - src: files/galaxy/config/user_preferences_extra_conf.yml
    +    dest: "{{ galaxy_config.galaxy.user_preferences_extra_conf_path }}"
    +  - src: files/galaxy/config/file_sources_conf.yml
    +    dest: "{{ galaxy_config.galaxy.file_sources_config_file }}"
    +
     # systemd
     galaxy_systemd_mode: mule
     galaxy_zergpool_listen_addr: 127.0.0.1:8080
    
  6. Run the playbook. At the very end, you should see output like the following indicating that Galaxy has been restarted:

    code-in Input: Bash

    ansible-playbook galaxy.yml
    

    code-in Output

    ...
    RUNNING HANDLER [restart galaxy] ****************************************
    changed: [gat-88.training.galaxyproject.eu]
    

Now we are ready to configure a Galaxy’s user account to upload dataset from Dropbox to the Galaxy server.

hands_on Hands-on: Configure Galaxy to access the Dropbox service

  1. Generate a Dropbox access token following the Dropbox Oauth guide

  2. Add the Dropbox access token in the Galaxy’s user preferences
    • Go to https://<server>/user/information
    • Here you will find the form user will fill in with his own Dropbox access token: user preferences form
  3. Click the upload icon toward the top left corner. You will have a new “Choose remote files” button that will open the remote files windows with the link to reach your Dropbox files: remote files window

keypoints Key points

  • The distributed object store configuration allows you to easily expand that storage that is attached to your Galaxy.

  • You can move data around without affecting users.

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Galaxy Server administration topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.

Click here to load Google feedback frame

Citing this Tutorial

  1. Nate Coraor, Helena Rasche, Gianmauro Cuccuru, 2021 Distributed Object Storage (Galaxy Training Materials). /archive/2021-06-01/topics/admin/tutorials/object-store/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{admin-object-store,
    author = "Nate Coraor and Helena Rasche and Gianmauro Cuccuru",
    title = "Distributed Object Storage (Galaxy Training Materials)",
    year = "2021",
    month = "05",
    day = "23"
    url = "\url{/archive/2021-06-01/topics/admin/tutorials/object-store/tutorial.html}",
    note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
        doi = {10.1016/j.cels.2018.05.012},
        url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
        year = 2018,
        month = {jun},
        publisher = {Elsevier {BV}},
        volume = {6},
        number = {6},
        pages = {752--758.e1},
        author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
        title = {Community-Driven Data Analysis Training for Biology},
        journal = {Cell Systems}
}
                    

congratulations Congratulations on successfully completing this tutorial!