Rule Based Uploader: Advanced

Overview

question Questions
  • How to use the rule based uploader to create complex collections

objectives Objectives
  • Learn even more about the Rule Based Uploader

requirements Requirements

time Time estimation: 20 minutes

level Level: Advanced level level level

last_modification Last modification: Jun 15, 2020

Introduction

This builds on the previous Rule Based Uploader tutorial to cover even more advanced topics.

comment Audience

This tutorial assumes a basic knowledge of using dataset collections in Galaxy but doesn’t assume any particular knowledge of biology or bioinformatics. If you have not used collections with Galaxy previously, please check out the using dataset collections tutorial.

Agenda

In this tutorial, we will:

  1. Building URLs from Accession Information
  2. Building Matched Collections
  3. Building Nested Lists
  4. Apply Rules to Existing Collections

Building URLs from Accession Information

In the previous examples the metadata we started with already contained URLs. In some cases such URLs will not be present in the available metadata and may need to be constructed dynamically from identifiers.

For this multiomics example, we will start with a uniprot query and build URLs from accession numbers contained within the supplied tabular data. Consider the uniprot query https://uniprot.org/uniprot/?query=proteome:UP000052092+AND+proteomecomponent:”Genome”, pictured below.

UniProt table screenshot

In order to make this data useful for the Rule Builder, we need to turn these accessoin IDs into URLs.

hands_on Hands-on: Turn Uniprot Accession IDs into URLs

  1. Open Uniprot
  2. Click on one of the entries in the table (e.g. E7C0H6)
  3. Select Format from the top menu
  4. Click FASTA (canonical)

    Link to access FASTA file

  5. Your browser will redirect to the fasta file

    Viewing the FASTA file

  6. We can deduce that the FASTA files for the other accession IDs will be available from URLs of the form https://www.uniprot.org/uniprot/{identifier}.fasta

We will use this information to build a collection of FASTA files from our query.

hands_on Hands-on: Build Collection from Uniprot Accession IDs

  1. Open the Rule Builder
    • “Upload data as”: Collection(s)
    • “Load tabular data from”: Pasted Table
    • Paste the following table:

      Entry	Entry name	Status	Protein names	Gene names	Organism	Length
      E7C0H6	E7C0H6_9PAPI	unreviewed	Major capsid protein L1	L1	Equus caballus papillomavirus 3	498
      E7C0H0	E7C0H0_9PAPI	unreviewed	Protein E6	E6	Equus caballus papillomavirus 3	150
      E7C0H5	E7C0H5_9PAPI	unreviewed	Minor capsid protein L2	L2	Equus caballus papillomavirus 3	498
      E7C0H1	E7C0H1_9PAPI	unreviewed	Protein E7		Equus caballus papillomavirus 3	93
      E7C0H3	E7C0H3_9PAPI	unreviewed	Regulatory protein E2	E2	Equus caballus papillomavirus 3	421
      E7C0H4	E7C0H4_9PAPI	unreviewed	Putative E4 early protein (Fragment)		Equus caballus papillomavirus 3	175
      E7C0H2	E7C0H2_9PAPI	unreviewed	Replication protein E1 (EC 3.6.4.12) (ATP-dependent helicase E1)	E1	Equus caballus papillomavirus 3	621
      
    • Click Build screenshot
  2. Let’s apply some initial filtering to our data:
    • From Filter menu, select First or Last N Rows
      • Filter the first row
    • From Rules menu, select Remove Columns
      • Remove columns B, C, E, F, and G.
    • From Rules menu, select Add / Modify Column Definitions
      • Click Add Definition, Info, Column B

        This is a block of text that will appear in the history panel when the dataset is expanded. screenshot

    These datasets appear in a seemingly random order, it will be easier to manage things in the history panel if we sort this data first. screenshot

  3. From Rules menu, select Sort
    • “From Column”: A

    Next is the key step, we will build a URL from the pattern we described above using the accession ID in column A.

  4. From Column menu, select Using a Regular Expression
    • “From Column”: A
    • Select Create column from expression replacement
    • “Regular Expression: .*
    • “Replacement Expression: https://www.uniprot.org/uniprot/\0.fasta
    • Click Apply

      screenshot

    comment Regular expression explained

    In this regular expression, .* will capture the entire accession ID.

    In the replacement expression, https://www.uniprot.org/uniprot/\0.fasta, the \0 will be replaced by the captured regular expression (the accession ID) in the resulting column values.

    The new column C with the URL we built should appear as shown below. screenshot

    Finally, let us set our column definitions:

  5. From Rules menu, select Add / Modify Column Definitions
    • Add Definition, List Identifier(s), Select Column A
    • Add Definition, URL, Column C
  6. Finalize the collection:
    • “Name”: for example UP000052092
    • “Type”: fasta
    • Click Upload screenshot

    comment JSON Rule Definitions

    This example is ready to go, but before clicking “Build” it may be interesting to check out the rules Galaxy is following to clean and import your data. Click the little Wrench icon at the top of the list of rules. The result is a bunch of JavaScript object notation (JSON) text that you should never need to worry about but that you can build or modify by hand if you find it useful. We will use it the next example to quickly restore the list builder back to this state.

    screenshot

    This could additionally be copied and pasted if you need to do the same set of operations on multiple metadata inputs that are similarly formatted.

  7. Click Build and wait for your list of FASTA files to appear.

Building Matched Collections

This example will demonstrate creating multiple collections at the same time. We will use the same metadata generated from UniProt as the last example but we will build two collections with matching list identifiers - one collection of FASTA files and one collection of GFF files. This will also demonstrate reading the collection name and the target datatype from the metadata itself - important techniques if generating multiple collections with different names and datatypes.

For this example we will re-use the metadata from the previous example.

hands_on Hands-on: Build a matched collection

  1. Open the Rule Builder
    • “Upload data as”: Collection(s)
    • “Load tabular data from”: Pasted Table
    • Paste the table from the previous exercise:
      Entry	Entry name	Status	Protein names	Gene names	Organism	Length
      E7C0H6	E7C0H6_9PAPI	unreviewed	Major capsid protein L1	L1	Equus caballus papillomavirus 3	498
      E7C0H0	E7C0H0_9PAPI	unreviewed	Protein E6	E6	Equus caballus papillomavirus 3	150
      E7C0H5	E7C0H5_9PAPI	unreviewed	Minor capsid protein L2	L2	Equus caballus papillomavirus 3	498
      E7C0H1	E7C0H1_9PAPI	unreviewed	Protein E7		Equus caballus papillomavirus 3	93
      E7C0H3	E7C0H3_9PAPI	unreviewed	Regulatory protein E2	E2	Equus caballus papillomavirus 3	421
      E7C0H4	E7C0H4_9PAPI	unreviewed	Putative E4 early protein (Fragment)		Equus caballus papillomavirus 3	175
      E7C0H2	E7C0H2_9PAPI	unreviewed	Replication protein E1 (EC 3.6.4.12) (ATP-dependent helicase E1)	E1	Equus caballus papillomavirus 3	621
      
    • Click Build and proceed to the rule builder. screenshot
  2. Instead of manually creating the rules this time, we will import an existing set of rules from some JSON. Copy the following text into your clipboard. These are the same rules as used in the Building URLs from Accession Information tutorial.

    {"rules":[{"type":"add_filter_count","count":"1","which":"first","invert":false},{"type":"remove_columns","target_columns":[1,2,4,5,6]},{"type":"sort","target_column":0,"numeric":false},{"type":"add_column_regex","target_column":0,"expression":".*","replacement":"http://www.uniprot.org/uniprot/\\0.fasta"}],"mapping":[{"type":"info","columns":[1]},{"type":"list_identifiers","columns":[0],"editing":false},{"type":"url","columns":[2]}],"extension":"csfasta"}
    
  3. In the ruler builder interface, select the wrench icon next to the word Rules
    • Paste the rules into the textbox
    • Click Apply

    screenshot

    You should now see the rules you created in the last example. screenshot

    This next part may seem a bit silly at first but we are going to add some columns with fixed values into the builder. When we split up the columns at a later step this will make sense.

  4. From Column menu, select Fixed Value
    • “Value”: fasta
    • Click Apply This value will eventually be used for the datatype of the file.
  5. Repeat this process with:
    • “Value”:UP000052092 FASTA

    Next we will repeat the process of adding URL, name, and datatype columns but for GFF files.

  6. From Column, select Using a Regular Expression
    • “From Column”: A
    • Select Create column from expression replacement
    • “Regular Expression: .*
    • “Replacement Expression: https://www.uniprot.org/uniprot/\0.gff. screenshot

    Next we will add two more columns

  7. From Column menu, select Fixed Value
    • “Value”: gff3
    • Click Apply

    This value will eventually be used for the datatype of the file.

  8. Repeat this process with:
    • “Value”: UP000052092 GFF3

    Your very large list of rules should now look like the following screenshot. screenshot

    Notice we have two URLs, two collection names, and two datatype extensions for each accession ID we started with. Like in the example where we split the columns, here we will split these up to describe multiple collections.

  9. From Rules menu, select Split Columns
    • “Odd Number Column(s)”: C, D, and E (the fasta columns)
    • “Even Number Column(s)”: F, G, and H (the gff3 columns)

    This will take the row consisting of the columns ABCDEFGH and build two rows, one with ABCDEF and the other with ABFGH screenshot

  10. Click Apply and you should be returned the list of rules. screenshot

    Finally, we need to add some more column definitions for these new columns we just created:

  11. From Rules menu, select Add / Modify Column Definitions
    • Add Definition, List Identifier, Column A
    • Add Definition, Info, Column B
    • Add Definition, URL, Column C
    • Add Definition, Type, Column D
    • Add Definition, Collection Name, Column E

    Notice when these values are being generated from the metadata the option to specify them manually from the type and collection name boxes from the bottom of the form disappear. screenshot

  12. Click Upload

    Galaxy should make two collections - one containing FASTA files and one containing GFF3 files.

Building Nested Lists

In this example, we will be building a nested list using data from SRA. This is a more sophisticated structure for organizing datasets in Galaxy. In the above examples we organized datasets into simple lists with a single “list identifier” describing the files in the collection. Galaxy allows lists to be organized into nested lists - where each level of the list has an identifier.

If two such identifiers are present, this is a list of lists (called list:list in the workflow editor). In such a structure the outer identifiers (or first level of identifiers) may describe sample names and the inner identifiers (or second level) may describe replicates. Alternative the outer identifiers may describe conditions and the inner identifiers may describe individual samples. The structure of such collections should ideally be dictated by the study design.

If certain parts of your analysis require benefit from datasets being nested this way while other parts require feeding you data to a Galaxy tool all together without such a structure, then it is probably best to go ahead and build nested lists at the start of your analysis and then use the “Flatten Collection” tool on the resulting collection or a derivative collection to get the flat structure needed by certain tools in certain parts of your analysis.

For this example, we will describe analyzing the metadata of the SRA project PRJNA355367. Unlike the other examples, these SRA files are relatively large and not ideal for training purposes. So we’ve pre-downloaded the project metadata and replaced all the links with simple text files that should download really quickly - the result is here.

So use either the SRA exporter tool or download the CSV file with fake URLs. If you download the data from the SRA exporter tool, select only the first 12 columns from the data (up to the column labeled “Library Name”) and copy the resulting data to your clipboard.

hands_on Hands-on: Building Nested Lists

  1. Open the Rule Builder
    • “Upload data as”: Collection(s)
    • “Load tabular data from”: Pasted Table
    • Paste the table

    solution Show Table

    Below is the table with pre-downloaded example data mentioned above:

    screenshot

    The resulting table should look something like the following: screenshot

    Navigate to the end of the table and notice that column J is the URL target we are hoping to download for each file.

  2. As in the previous exercises, please:
    • Set this Column definition for the URL target
    • Strip the header row

      solution Solution

      From Filter menu, select First or Last N Rows

      • Filter first Row

      From Rules menu, select Add / Modify Column Definitions

      • Add Definition, URL, Column J. screenshot

    For the analysis we wish to do, we want to group these files based on the type indicated in column L (LibraryName) shown below. The source data though adds numbers to the library type to to generate the LibraryName, we need to strip those out to use the type as an identifier for grouping the datasets. To do this, use the regex column adder rule again.

  3. From Column menu, select Using a Regular Expression
    • “From Column”: L
    • Select Create column from regular expression groups
    • “Regular Expression: ([^\d]+)\d+.
    • “Number of Groups”: 1

    The result looks like: screenshot

    comment Regular expression explained

    In this regular expression, \d means any digit, so [^...] means match anything that is not inside the brackets. So together [^\d]+ means match one or more, non digits at the start of the column and the () around that means capture them into a group. We’ve add \d+ at the end of the expression but it isn’t grouped so we are effectively ignoring the digits at the end as we had hoped.

    Now we have two columns we need to assign list identifiers for, the new column M for the first, outer identifier and the first column A for the inner, second identifier.

  4. From Rules menu, select Add / Modify Column Definitions
    • Add Definition, List Identifier(s), Column M
    • Click on ... Assign Another Column
    • Select column A

      comment Re-ordering columns

      If you make a mistake in the order you select columns in you can simple use the up and down arrows to re-arrange the list

    The result should look something like this: screenshot

  5. Click Apply to return to the rule preview screen and notice there are two column listed for the list identifier definition screenshot

  6. Finalize your collection:
    • “Type”: txt
    • Name your Collection
    • Click Upload

    Note: set “Type” to sra if you used real data from SRA instead of the table from exercise. screenshot

Apply Rules to Existing Collections

To start this example, we will first upload a simple, flat collection of data. The data files we will use will be the same as those used by the DESeq2 Vignette from the Pasilla Bioconductor Package.

hands_on Hands-on: Applying Rules to Existing Collections

  1. Open the Rule Builder
    • “Upload data as”: Collection(s)
    • “Load tabular data from”: Pasted Table
    • Paste the following table:
    https://raw.githubusercontent.com/jmchilton/galaxy/apply_rules_tutorials/test-data/rules/treated1fb.txt treated_single_1
    https://raw.githubusercontent.com/jmchilton/galaxy/apply_rules_tutorials/test-data/rules/treated2fb.txt treated_paired_2
    https://raw.githubusercontent.com/jmchilton/galaxy/apply_rules_tutorials/test-data/rules/treated3fb.txt treated_paired_3
    https://raw.githubusercontent.com/jmchilton/galaxy/apply_rules_tutorials/test-data/rules/untreated1fb.txt untreated_single_4
    https://raw.githubusercontent.com/jmchilton/galaxy/apply_rules_tutorials/test-data/rules/untreated2fb.txt untreated_single_5
    https://raw.githubusercontent.com/jmchilton/galaxy/apply_rules_tutorials/test-data/rules/untreated3fb.txt untreated_paired_6
    https://raw.githubusercontent.com/jmchilton/galaxy/apply_rules_tutorials/test-data/rules/untreated4fb.txt untreated_paired_7
    

    Next, we set up the column types:

  2. From Rules menu, select Add / Modify Column Definitions
    • Add Definition, URL, column A
    • Add Definition, List Identifier(s), column B screenshot
  3. Finalize the collection:
    • “Type”: txt
    • Give the collection a name.
    • Click Build screenshot

    The first thing we will do to this new collection is add some levels or depth to its structure. Lets assume we want to group it into “treated” and “untreated” lists and “paired” and “single” sublists below that. We can do this with the Apply Rules collection operation tool, which will likely be under the Collection Operations tool menu in your Galaxy interface.

  4. Open the Apply Rule to Collection tool

    The very simple interface should look something like this: screenshot

    This interface simply lets one pick a collection to operate on and then launch the rule builder window to work to describe and preview manipulating the metadata of that collection.

  5. Apply Rule to Collection tool with the following parameters:
    • “Input Collection”: the collection we just uploaded
    • Click the Edit button screenshot

    When a flat collection is used with this tool, the rule builder will initialize a default rule to pull the list identifier out for each item of the collection as shown above. Next we will use regular expressions to build two new columns, these columns will group the datasets into “treated” and “untreated” sublists and then “single” and “paired” sublists of that. This rule is found under the Column menu.

    We’ll transform column A into some more useful columns

  6. From Column, select Using a Regular Expression
    • “Column”: A
    • Select Create columns matching expression groups
    • “Regular Expression: (.*)_(.*)_.*
    • “Number of Groups”: 2

    comment Regular expression explained

    Here .* means match any number of any character - so basically match anything. The parentheses around .* means form a “group” from whatever is matched. The _ describes the literal _ values in the identifier we are matching. The result is that everything before the first _ will be matched as the first group and everything between the _ characters will be matched as the second group. Click to apply this rule and two new columns should be created.

    We have defined two matching groups (the (.*) in the regex), so the number of groups is set to 2

  7. From Rules, select Add / Modify Column Definitions
    • Add Definition, List Identifier(s), Column B
    • Assign another column, column C
    • Assign another column, column A
    • Click Save

    screenshot

    The resulting collection should have two new levels of depth for “untreated” vs “treated” and “paired” vs “single” as shown below: screenshot

    Placing the “single” and “paired” sublists inside the “treated” and “untreated” lists was a bit arbitrary and depending on the workflow these may need to be inverted at different parts of an analysis. The Apply Rules tool can be used to invert these levels if that is needed.

  8. Apply Rule to Collection tool with the following parameters:
    • “Input Collection”: the new nested list we created in the previous step
    • Click “Edit”.

    screenshot

    Notice when loading a nested collection into the rule builder, there is a column for each layer of the list for each element. One can pretty easily invert the outer two layers of the list by simply assigning the list identifiers in a new order. So select to assign “List Identifiers” and this time assign them to columns “B”, “A”, and “C” - in that order.

  9. From Rules, select Add / Modify Column Definitions
    • Add Definition, List Identifier(s), Column B
    • Assign another column, Column A
    • Assign another column, Column C
    • Click Save

    screenshot

    The resulting collection should be inverted.

    screenshot

    In addition to structural re-organizations of a collection, the Apply Rules tool can be used to filter elements out of the collection.

  10. Apply Rule to Collection tool with the following parameters:
    • “Input Collection”: the original flat list created
    • Click Edit
    • From Filter menu, select From a regular expression
      • “From Column”: A
      • “Regular Expression: .*_single_.*
    • From Rules menu, select Add / Modify Column Definitions
      • Add Definition, List Identifier(s), Column A

    screenshot

    The resulting collection should be a filtered version of the original list.

    screenshot

    Structural re-organizations of collections can also be combined with filtering. To demonstrate this, reopen the original flat list created for this example again in the rule builder of the Apply Rules tool. Use the same regular expression as last time to filter the result but also add a column for “treated” and “untreated” list identifiers.

  11. Apply Rule to Collection tool with the following parameters:
    • From Filter menu, select Using a Regular Expression
      • “From Column”: A
      • “Regular Expression: .*_single_.*
    • From Column menu, select Using a Regular Expression
      • “From Column”: A
      • Select Create column from regular expression groups
      • “Regular Expression: (.*)_single_.*
      • “Number of Groups”: 1
    • From Rules menu, select Add / Modify Column Definitions
      • Add Definition, List Identifier(s). Column B
      • Assign another column, Column A

    screenshot

    The resulting collection should be a filtered to only include the “single” data and broken into “treated” and “untreated” sublists.

    screenshot

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.

Click here to load Google feedback frame

Citing this Tutorial

  1. John Chilton, Helena Rasche, 2020 Rule Based Uploader: Advanced (Galaxy Training Materials). /archive/2021-02-01/topics/galaxy-interface/tutorials/upload-rules-advanced/tutorial.html Online; accessed Wed Mar 26 2025
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

congratulations Congratulations on successfully completing this tutorial!