Generating a single cell matrix using Alevin

Author(s) orcid logoWendi Bacon avatar Wendi BaconJonathan Manning avatar Jonathan Manning
Editor(s) orcid logoHelena Rasche avatar Helena Rasche
Tester(s) orcid logoJulia Jakiela avatar Julia Jakiela
Reviewers Pavankumar Videm avatarBjörn Grüning avatarMarisa Loach avatarWendi Bacon avatarTeresa Müller avatarHelena Rasche avatarSaskia Hiltemann avatarMehmet Tekman avatarJonathan Manning avatarBeatriz Serrano-Solano avatarDavid López avatar
Overview
Creative Commons License: CC-BY Questions:
  • I have some single cell FASTQ files I want to analyse. Where do I start?

Objectives:
  • Generate a cellxgene matrix for droplet-based single cell sequencing data

  • Interpret quality control (QC) plots to make informed decisions on cell thresholds

  • Find relevant information in GTF files for the particulars of their study, and include this in data matrix metadata

Requirements:
Time estimation: 2 hours
Supporting Materials:
Published: Mar 3, 2021
Last modification: Oct 28, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00245
rating Rating: 4.0 (1 recent ratings, 5 all time)
version Revision: 19

This tutorial will take you from raw FASTQ files to a cell x gene data matrix in AnnData format. What’s a data matrix, and what’s AnnData format? Well you’ll find out! Importantly, this is the first step in processing single cell data in order to start analysing it. Currently you have a bunch of strings of ATGGGCTT etc. in your sequencing files, and what you need to know is how many cells you have and what genes appear in those cells. These steps are the most computationally heavy in the single cell world, as you’re starting with 100s of millions of reads, each with 4 lines of text. Later on in analysis, this data becomes simple gene counts such as ‘Cell A has 4 GAPDHs’, which is a lot easier to store! Because of this data overload, we have downsampled the FASTQ files to speed up the analysis a bit. Saying that, you’re still having to map loads of reads to the massive murine genome, so get yourself a cup of coffee and prepare to analyse!

Agenda

In this tutorial, we will cover:

  1. Generating a matrix
    1. Get Data
  2. Important tips for easier analysis
    1. Generate a transcript to gene map
    2. Generate a transcriptome index & quantify!
  3. Basic QC
    1. Generate an unprocessed matrix in a usable format
    2. Adding in Gene metadata
  4. emptyDrops
  5. Analysing multiple FASTQ files
  6. Mitochondrial flagging
  7. Conclusion

Generating a matrix

In this section, we will show you the principles of the initial phase of single-cell RNA-seq analysis: generating expression measures in a matrix. We’ll concentrate on droplet-based (rather than plate-based) methodology, since this is the process with most differences with respect to conventional approaches developed for bulk RNA-seq.

Droplet-based data consists of three components: cell barcodes, unique molecular identifiers (UMIs) and cDNA reads. To generate cell-wise quantifications we need to:

  • Process cell barcodes, working out which ones correspond to ‘real’ cells, which to sequencing artefacts, and possibly correct any barcodes likely to be the product of sequencing errors by comparison to more frequent sequences.
  • Map biological sequences to the reference genome or transcriptome.
  • ‘De-duplicate’ using the UMIs.

This used to be a complex process involving multiple algorithms, or was performed with technology-specific methods (such as 10X’s ‘Cellranger’ tool) but is now much simpler thanks to the advent of a few new methods. When selecting methodology for your own work you should consider:

  • STARsolo - a droplet-based scRNA-seq-specific variant of the popular genome alignment method STAR. Produces results very close to those of Cellranger (which itself uses STAR under the hood).
  • Kallisto/ bustools - developed by the originators of the transcriptome quantification method, Kallisto.
  • Alevin - another transcriptome analysis method developed by the authors of the Salmon tool.

We’re going to use Alevin Srivastava et al. 2019 for demonstration purposes, but we do not endorse one method over another.

Get Data

We’ve provided you with some example data to play with, a small subset of the reads in a mouse dataset of fetal growth restriction Bacon et al. 2018 (see the study in Single Cell Expression Atlas and the project submission). This is a study using the Drop-seq chemistry, however this tutorial is almost identical to a 10x chemistry. We will point out the one tool parameter change you will need to run 10x samples. This data is not carefully curated, standard tutorial data - it’s real, it’s messy, it desperately needs filtering, it has background RNA running around, and most of all it will give you a chance to practice your analysis as if this data were yours.

Down-sampled reads and some associated annotation can be imported below. How did I downsample these FASTQ files? Check out this history to find out!

Additionally, to map your reads, you will need a transcriptome to align against (a FASTA) as well as the gene information for each transcript (a gtf) file. You can download these for your species of interest from Ensembl. These files are included in the data import step below. Keep in mind, these are big files, so the fastest way to get these into your Galaxy account is through importing them by history.

Hands-on: Option 1: Data upload - Import history
  1. Import history from: example input history

    1. Open the link to the shared history
    2. Click on the Import this history button on the top left
    3. Enter a title for the new history
    4. Click on Copy History

  2. Rename galaxy-pencil the the history to your name of choice.

Hands-on: Option 2: Data upload - Add to history
  1. Create a new history for this tutorial
  2. Import the Experimental Design table, sequencing reads 1 & 2, the GTF and fasta files from Zenodo

    https://zenodo.org/record/4574153/files/Experimental_Design.tabular
    https://zenodo.org/record/4574153/files/Mus_musculus.GRCm38.100.gtf.gff
    https://zenodo.org/record/4574153/files/Mus_musculus.GRCm38.cdna.all.fa.fasta
    https://zenodo.org/record/4574153/files/SLX-7632.TAAGGCGA.N701.s_1.r_1.fq-400k.fastq
    https://zenodo.org/record/4574153/files/SLX-7632.TAAGGCGA.N701.s_1.r_2.fq-400k.fastq
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Rename galaxy-pencil the datasets: Change SLX-7632.TAAGGCGA.N701.s_1.r_1.fq-400k to N701-Read1 and SLX-7632.TAAGGCGA.N701.s_1.r_2.fq-400k.fastq to N701-Read2.
Question

Have a look at the files you now have in your history.

  1. Which of the FASTQ files do you think contains the barcode sequences?
  2. Given the chemistry this study should have, are the barcode/UMI reads the correct length?
  3. What is the ‘N701’ referring to?
  1. Read 1 (SLX-7632.TAAGGCGA.N701.s_1.r_1.fq-400k) contains the cell barcode and UMI because it is significantly shorter (indeed, 20 bp!) compared to the longer, r_2 transcript read. For ease, it’s better to rename these files N701-Read1 and N701-Read2.
  2. You can see Read 1 is only 20 bp long, which for original Drop-Seq is 12 bp for cell barcode and 8 bp for UMI. This is correct! Be warned - 10x Chromium (and many technologies) change their chemistry over time, so particularly when you are accessing public data, you want to check and make sure you have your numbers correct!
  3. ‘N701’ is referring to an index read. This sample was run alongside 6 other samples, each denoted by an Illumina Nextera Index (N70X). Later, this will tell you batch information. If you look at the ‘Experimental Design’ file, you’ll see that the N701 sample was from a male wildtype neonatal thymus.

Important tips for easier analysis

Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.

  • Open your Galaxy server
  • Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
  • Navigate to your tutorial
  • Tool names in tutorials will be blue buttons that open the correct tool for you
  • Note: this does not work for all tutorials (yet) gif showing how GTN-in-Galaxy works
  • You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
  • We’ve had some issues with Tutorial mode on Safari for Mac users.
  • Try a different browser if you aren’t seeing the button.

Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.

The Single Cell Omics Lab currently uses the main European Galaxy infrastructure and power, it’s just organised better for users of particular analyses…like single cell!

Try it out!

When something goes wrong in Galaxy, there are a number of things you can do to find out what it was. Error messages can help you figure out whether it was a problem with one of the settings of the tool, or with the input data, or maybe there is a bug in the tool itself and the problem should be reported. Below are the steps you can follow to troubleshoot your Galaxy errors.

  1. Expand the red history dataset by clicking on it.
    • Sometimes you can already see an error message here
  2. View the error message by clicking on the bug icon galaxy-bug

  3. Check the logs. Output (stdout) and error logs (stderr) of the tool are available:
    • Expand the history item
    • Click on the details icon
    • Scroll down to the Job Information section to view the 2 logs:
      • Tool Standard Output
      • Tool Standard Error
    • For more information about specific tool errors, please see the Troubleshooting section
  4. Submit a bug report! If you are still unsure what the problem is.
    • Click on the bug icon galaxy-bug
    • Write down any information you think might help solve the problem
      • See this FAQ on how to write good bug reports
    • Click galaxy-bug Report button
  5. Ask for help!

Generate a transcript to gene map

Gene-level, rather than transcript-level, quantification is standard in scRNA-seq, which means that the expression level of alternatively spliced RNA molecules are combined to create gene-level values. Droplet-based scRNA-seq techniques only sample one end each transcript, so lack the full-molecule coverage that would be required to accurately quantify different transcript isoforms.

To generate gene-level quantifications based on transcriptome quantification, Alevin and similar tools require a conversion between transcript and gene identifiers. We can derive a transcript-gene conversion from the gene annotations available in genome resources such as Ensembl. The transcripts in such a list need to match the ones we will use later to build a binary transcriptome index. If you were using spike-ins, you’d need to add these to the transcriptome and the transcript-gene mapping.

In your example data you will see the murine reference annotation as retrieved from Ensembl in GTF format. This annotation contains gene, exon, transcript and all sorts of other information on the sequences. We will use these to generate the transcript/ gene mapping by passing that information to a tool that extracts just the transcript identifiers we need.

Question

Which of the ‘attributes’ in the last column of the GTF files contains the transcript and gene identifiers?

The file is organised such that the last column (headed ‘Group’) contains a wealth of information in the format: attribute1 “information associated with attribute 1”;attribute2 “information associated with attribute 2” etc.

gene_id and transcript_id are each followed by “ensembl gene_id” and “ensembl transcript_id”

It’s now time to parse the GTF file using the rtracklayer package in R. This parsing will give us a conversion table with a list of transcript identifiers and their corresponding gene identifiers for counting. Additionally, because we will be generating our own binary index (more later!), we also need to input our FASTA so that it can be filtered to only contain transcriptome information found in the GTF.

Hands-on: Generate a filtered FASTA and transcript-gene map
  1. GTF2GeneList ( Galaxy version 1.52.0+galaxy0) with the following parameters:
    • param-file “Ensembl GTF file”: GTF file in the history galaxy-history
    • “Feature type for which to derive annotation”: transcript (Your sequences are transcript sequencing, so this is your starting point)
    • “Field to place first in output table”: transcript_id (This is accessing the column you identified above!)
    • “Suppress header line in output?”: Yes (The next tool (Alevin) does not expect a header)
    • “Comma-separated list of field names to extract from the GTF (default: use all fields)”: transcript_id,gene_id (This calls the first column to be the transcript_id, and the second the gene_id. Thus, your key can turn transcripts into genes)
    • “Append version to transcript identifiers?”: Yes (The Ensembl FASTA files usually have these, and since we need the FASTA transcriptome and the GTF gene information to work together, we need to append these!)
    • “Flag mitochondrial features?”: No
    • “Provide a cDNA file for extracting annotations and/ or possible filtering?”: Yes
    • param-file “FASTA-format cDNA/transcript file”: FASTA file in your history galaxy-history
    • “Annotation field to match with sequences”: transcript_id
    • “Filter the cDNA file to match the annotations?”: Yes
  2. Rename galaxy-pencil the annotation table to Map

  3. Rename galaxy-pencil the uncompressed filtered FASTA file to Filtered FASTA

Generate a transcriptome index & quantify!

Alevin collapses the steps involved in dealing with dscRNA-seq into a single process. Such tools need to compare the sequences in your sample to a reference containing all the likely transcript sequences (a ‘transcriptome’). This will contain the biological transcript sequences known for a given species, and perhaps also technical sequences such as ‘spike ins’ if you have those.

To be able to search a transcriptome quickly, Alevin needs to convert the text (FASTA) format sequences into something it can search quickly, called an ‘index’. The index is in a binary rather than human-readable format, but allows fast lookup by Alevin. Because the types of biological and technical sequences we need to include in the index can vary between experiments, and because we often want to use the most up-to-date reference sequences from Ensembl or NCBI, we can end up re-making the indices quite often. Making these indices is time-consuming! Have a look at the uncompressed FASTA to see what it starts with.

We now have:

  • Barcode/ UMI reads
  • cDNA reads
  • transcript/ gene mapping
  • filtered FASTA

We can now run Alevin. In some public instances, Alevin won’t show up if you search for it. Instead, you may have to click the Single Cell tab at the left and scroll down to the Alevin tool. Alternatively, use Tutorial Mode as described above and you’ll easily navigate to all the tools, and their versions will all be the tried and tested ones of this tutorial. It’s often a good idea to check your tool versions. To identify which version of a tool you are using, select tool-versions ‘Versions’ and choose the appropriate version. In this case the tutorial was built with Alevin Galaxy Version 1.9.0+galaxy2, but will also work with the versions named in the Hands-on sections.

Hands-on: Running Alevin
  1. Alevin ( Galaxy version 1.10.1+galaxy0)

    Question

    Try to fill in the parameters of Alevin using what you know!

    The Salmon documentation on ‘Fragment Library Types’ and running the Alevin command is here: salmon.readthedocs.io/en/latest/library_type.html and salmon.readthedocs.io/en/latest/alevin.html. These links will help here, although keep in mind the image there is drawn with the RNA 5’ on top, whereas in this scRNA-seq protocol, the polyA is captured by its 3’ tail and thus effectively the bottom or reverse strand…)

    • “Select a reference transcriptome from your history or use a built-in index?”: Use one from the history
      • You are going to generate the binary index using your filtered FASTA!
    • param-file “Transcripts FASTA file”: Filtered FASTA
    • “Single or paired-end reads?”: Paired-end
    • param-file “Mate pair 1”: N701-Read1
    • param-file “Mate pair 2”: N701-Read2
    • “Specify the strandedness of the reads”: Infer automatically (A)
    • “Type of single-cell protocol”: DropSeq Single Cell protocol
    • param-file “Transcript to gene map file”: Map
    • In “Extra output files”:
      • param-check Salmon Quant log file
      • param-check Features used by the CB classification and their counts at each cell level (--dumpFeatures)

      • Of course you are welcome to select more options and explore the output files (warning warning: “Per cell level parsimonious Umi graph (–dumpUmiGraph)” will generate over 2 thousand single files), but for this tutorial you will only need to select those specified.
    • In “Advanced options”:
      • “Dump cell v transcripts count matrix in MTX format”: galaxy-toggle Yes
Comment: What if I'm running a 10x sample?

The main parameter that needs changing for a 10X Chromium sample is the ‘Protocol’ parameter of Alevin. Just select the correct 10x Chemistry there instead.

Comment: Alevin file names

You will notice that the names of the output files of Alevin are written in a certain convention, mentioning which tool was used and on which files, for example: “Alevin on data X, data Y, and others: whitelist”. Remember that you can always rename the files if you wish! For simplicity, when we refer to those files in the tutorial, we skip the information about tool and only use the second part of the name - in this case it would be simply “whitelist”.

This tool will take a while to run. Alevin produces many file outputs, not all of which we’ll use. You can refer to the Alevin documentation if you’re curious what they all are, but we’re most interested in is:

  • the matrix itself (per-cell gene-count matrix (MTX) - the count by gene and cell)
  • the row (cell/ barcode) identifiers (row index (CB-ids)) and
  • the column (gene) labels (column headers (gene-ids)).
Question

After you’ve run Alevin, galaxy-eye look through all the different files. Can you find:

  1. The Mapping Rate?
  2. How many cells are present in the matrix output?
  1. Inspect galaxy-eye the file param-file Salmon log file. You can see the mapping rate is a paltry 25.45%. This is a terrible mapping rate. Why might this be? Remember this was downsampled, and specifically by taking only the last 400,000 reads of the FASTQ file. The overall mapping rate of the file is more like 50%, which is still quite poor, but for early Drop-Seq samples and single-cell data in general, you might expect a slightly poorer mapping rate. 10x samples are much better these days! This is real data, not test data, after all!
  2. Inspect galaxy-eye the file param-file row index (CB-ids), and you can see it has 2163 lines. The rows refer to the cells in the cell x gene matrix. According to this (rough) estimate, your sample has 2163 cells in it!
Warning: Choose the appropriate input going forward!

Make certain to use per-cell gene-count matrix (MTX) file going forward.

congratulations Congratulations - you’ve made an expression matrix! We could almost stop here. But it’s sensible to do some basic QC, and one of the things we can do is look at a barcode rank plot.

Basic QC

The question we’re looking to answer here, is: “do we mostly have a single cell per droplet”? That’s what experimenters are normally aiming for, but it’s not entirely straightforward to get exactly one cell per droplet. Sometimes almost no cells make it into droplets, other times we have too many cells in each droplet. At a minimum, we should easily be able to distinguish droplets with cells from those without.

Hands-on: Generate a raw barcode QC plot
  1. Droplet barcode rank plot ( Galaxy version 1.6.1+galaxy2) with the following parameters:
    • “Input MTX-format matrix?”: No
    • param-file “A two-column tab-delimited file, with barcodes in the first column and frequencies in the second”: Output of Alevin raw CB classification frequencies
    • “Label to place in plot title”: Barcode rank plot (raw barcode frequencies)
  2. Rename galaxy-pencil the image output Barcode Plot - raw barcode frequencies
raw droplet barcode plots-400k. Open image in new tab

Figure 1: 400k subsample raw

Now, the image generated here (400k) isn’t the most informative - but you are dealing with a fraction of the reads! If you run the total sample (so identical steps above, but with significantly more time!) you’d get the image below.

raw droplet barcode plots-total. Open image in new tab

Figure 2: Total sample - 32,579,453 reads - raw

This is our own formulation of the barcode plot based on a discussion we had with community members. The left hand plots with the smooth lines are the main plots, showing the UMI counts for individual cell barcodes ranked from high to low. We expect a sharp drop-off between cell-containing droplets and ones that are empty or contain only cell debris. Now, this data is not an ideal dataset, so for perspective, in an ideal world with a very clean 10x run, data will look a bit more like the following taken from the lung atlas (see the study in Single Cell Expression Atlas and the project submission).

raw droplet barcode plots - lung atlas. Open image in new tab

Figure 3: Pretty data - raw

In that plot, you can see the clearer ‘knee’ bend, showing the cut-off between empty droplets and cell-containing droplets.

The right hand plots are density plots from the first one, and the thresholds are generated either using dropletUtils or by the method described in the discussion mentioned above. We could use any of these thresholds to select cells, assuming that anything with fewer counts is not a valid cell. By default, Alevin does something similar, and we can learn something about that by plotting just the barcodes Alevin retains.

Hands-on: Generate Alevin's barcode QC plot
  1. Droplet barcode rank plot ( Galaxy version 1.6.1+galaxy2) with the following parameters:
    • “Input MTX-format matrix?”: Yes
    • “Matrix-market format matrix file, with cells by column (overrides –barcode-frequencies if supplied)”: per-cell gene-count matrix (MTX)
    • “For use with –mtx-matrix: force interpretation of matrix to assume cells are by row, rather than by column (default)”: Yes
    • “Label to place in plot title”: Barcode rank plot (Alevin-processed)
  2. Rename galaxy-pencil the image output Barcode Plot - Alevin processed barcodes
raw droplet barcode plots - 400k. Open image in new tab

Figure 4: 400k subsample - Alevin processed

And the full sample looks like:

raw droplet barcode plots - total. Open image in new tab

Figure 5: Total sample - 32,579,453 reads - Alevin processed

And to round this off, here’s the lung atlas plot.

raw droplet barcode plots - total. Open image in new tab

Figure 6: Pretty data - Alevin processed

You should see a completely vertical drop-off where Alevin has trunctated the distribution (after excluding any cell barcode that had <10 UMI, Alevin then chose a threshold based off the curve and removed all barcodes with fewer UMIs).

In experiments with relatively simple characteristics, this ‘knee detection’ method works relatively well. But some populations (such as our sample!) present difficulties. For instance, sub-populations of small cells may not be distinguished from empty droplets based purely on counts by barcode. Some libraries produce multiple ‘knees’ for multiple sub-populations. The emptyDrops method has become a popular way of dealing with this. emptyDrops still retains barcodes with very high counts, but also adds in barcodes that can be statistically distinguished from the ambient profiles, even if total counts are similar. In order to ultimately run emptyDrops (or indeed, whatever tool you like that accomplishes biologically relevant thresholding), we first need to re-run Alevin, but prevent it from applying its own less than ideal thresholds.

To use emptyDrops effectively, we need to go back and re-run Alevin, stopping it from applying it’s own thresholds. Click the re-run icon galaxy-refresh on any Alevin output in your history, because almost every parameter is the same as before, except you need to change the following:

Generate an unprocessed matrix in a usable format

Hands-on: Stopping Alevin from thresholding
  1. Alevin ( Galaxy version 1.10.1+galaxy0) (Click re-run on the last Alevin output)
    • “Advanced options”
    • “Fraction of cellular barcodes to keep”: ‘1’ - i.e. keep them all!
    • “Minimum frequency for a barcode to be considered”: ‘3’ - This will only remove cell barcodes with a frequency of less than 3, a low bar to pass but useful way of avoiding processing a bunch of almost certainly empty barcodes
  1. Expand one of the output datasets of the tool (by clicking on it)
  2. Click re-run galaxy-refresh the tool

This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.

Question

How many cells are in the output now?

  1. 22952 cells are in the quants_mat_rows now! Far more than the Alevin-filtered 2163. This needs some serious filtering with EmptyDrops!

Alevin outputs MTX format, which we can pass to the dropletUtils package and run emptyDrops. Unfortunately the matrix is in the wrong orientation for tools expecting files like those produced by 10X software (which dropletUtils does). We need to ‘transform’ the matrix such that cells are in columns and genes are in rows.

Warning: Be careful!

Don’t mix up files from the different Alevin runs! Use the later run, which has higher numbers in the history!

Hands-on: Transform matrix
  1. salmonKallistoMtxTo10x ( Galaxy version 0.0.1+galaxy6) with the following parameters:
    • param-file “.mtx-format matrix”: per-cell gene-count matrix (MTX) (output of Alevin tool)
    • param-file “Tab-delimited genes file”: column headers (gene-ids) (output of Alevin tool)
    • param-file “Tab-delimited barcodes file”: row index (CB-ids) (output of Alevin tool)
  2. Rename galaxy-pencil ‘salmonKallistoMtxTo10x….:genes’ to Gene table
  3. Rename galaxy-pencil ‘salmonKallistoMtxTo10x….:barcodes’ to Barcode table
  4. Rename galaxy-pencil ‘salmonKallistoMtxTo10x….:matrix’ to Matrix table

The output is a matrix in the correct orientation for the rest of our tools. However, our matrix is looking a bit sparse - for instance, click on Gene table. I don’t know about you, but I’d struggle to have a good biological discussion using only Ensembl gene_ids! What I’d really like is the more understandable ‘GAPDH’ or other gene acronym, as well as information on mitochondrial genes so that I can assess if my cells were stressed out or not. In order to prepare our data for emptyDrops, we’re going to combine this information into an object, and it’s easiest to add in that information now.

Adding in Gene metadata

Question

Where can we find this gene information?

Our old friend the GTF file!

Question

Which of the ‘attributes’ in the last column of that file contains the gene acronym?

gene_name

We’re now going to re-run galaxy-refresh the tool that extracts information from our GTF file.

Hands-on: Generate gene information
  1. GTF2GeneList ( Galaxy version 1.52.0+galaxy0) with the following parameters:
    • “Feature type for which to derive annotation”: gene
    • “Field to place first in output table”: gene_id
    • “Suppress header line in output?”: Yes
    • “Comma-separated list of field names to extract from the GTF (default: use all fields)”: gene_id,gene_name,mito
    • “Append version to transcript identifiers?”: Yes
    • “Flag mitochondrial features?”: Yes - note, this will auto-fill a bunch of acronyms for searching in the GTF for mitochondrial associated genes. This is good!
    • “Filter the cDNA file to match the annotations?”: No - we don’t need to, we’re done with the FASTA!
  2. Check that the output file type is tabular. If not, change the file type by clicking the ‘Edit attributes’galaxy-pencil on the dataset in the history (as if you were renaming the file.) Then click Datatypes and type in tabular. Click Change datatype.)
  3. Rename galaxy-pencil the annotation table to Gene Information

Inspect galaxy-eye the Gene Information object in the history. Now you have made a new key for gene_id, with gene name and a column of mitochondrial information (false = not mitochondrial, true = mitochondrial). We need to add this information into the salmonKallistoMtxTo10x output ‘Gene table’. But we need to keep ‘Gene table’ in the same order, since it is referenced in the ‘Matrix table’ by row.

Hands-on: Combine MTX Gene Table with Gene Information
  1. Join two Datasets with the following parameters:
    • “Join”: Gene Table
    • “Using column”: Column: 1
    • “with”: Gene Information
    • “and column”: Column: 1
    • “Keep lines of first input that do not join with second input”: Yes
    • “Keep lines of first input that are incomplete”: Yes
    • “Fill empty columns”: No
    • “Keep the header lines”: No

    If you inspect galaxy-eye the object, you’ll see we have joined these tables and now have quite a few gene_id repeats. Let’s take those out, while keeping the order of the original ‘Gene Table’.

  2. Cut columns from a table with the following parameters:
    • “Cut columns”: c1,c4,c5
    • “Delimited by”: Tab
    • “From”: output of Join two Datasets tool
  3. Rename output Annotated Gene Table

Inspect galaxy-eye your Annotated Gene Table. That’s more like it! You now have gene_id, gene_name, and mito. Now let’s get back to your journey to emptyDrops and sophisticated thresholding of empty droplets!

emptyDrops

emptyDrops Lun et al. 2019 works with a specific form of R object called a SingleCellExperiment. We need to convert our transformed MTX files into that form, using the DropletUtils Read10x tool:

Hands-on: Converting to SingleCellExperiment format
  1. DropletUtils Read10x ( Galaxy version 1.0.4+galaxy0) with the following parameters:
    • param-file “Expression matrix in sparse matrix format (.mtx)”: Matrix table
    • param-file “Gene Table”: Annotated Gene Table
    • param-file “Barcode/cell table”: Barcode table
    • “Should metadata file be added?”: No
  2. Rename galaxy-pencil output: SCE Object

Fantastic! Now that our matrix is combined into an object, specifically the SingleCellExperiment format, we can now run emptyDrops! Let’s get rid of those background droplets containing no cells!

Hands-on: Emptydrops
  1. DropletUtils emptyDrops ( Galaxy version 1.0.4+galaxy0) with the following parameters:
    • param-file “SingleCellExperiment rdata object”: SCE Object
    • “Should barcodes estimated to have no cells be removed from the output object?”: Yes
  2. Rename galaxy-pencil serialised SingleCellExperiment output as Emptied-Object

  3. Rename galaxy-pencil tabular output as Emptied-Tabular Output
Question

How many cell barcodes remain after the emptyDrops treatment? Why might that be?

If you click on the Emptied-Object in the galaxy-history history, the text in that window says 38 barcodes or something similar to that - there is an element of random in the algorithm, so yours might differ slightly. Why is this so low?? And why might the number be different? Consider…is this a complete set of data?

Remember this is a subsampled dataset. If you look carefully at the parameters of emptyDrops, you’ll see it set a minimum threshold at 100 UMI. If you look at the barcode plots above for the 400k read sample, you’ll see this is far too stringent for this subsampled data! To satisfy your curiosity, this minimum threshold would yield 4332 barcodes for the total sample. Also, the number may vary slightly as the output depends on a large number of random iterations.

We will nevertheless proceed with your majestic annotated expression matrix of 38 cells, ready to go for further processing and analysis! However, the next tutorials we will link to use a tool suite called Scanpy Wolf et al. 2018. You need to convert this SingleCellExperiment object into a format called annData, which is a variant of a file format called hdf5.

Hands-on: Converting to AnnData format
  1. SCEasy Converter ( Galaxy version 0.0.7+galaxy2) with the following parameters:
    • “Convert From / To”: SingleCellExperiment to AnnData
    • param-file “Input object in sce,rds,rdata.sce format”: Emptied-Object

If the dataset does not show up in the corresponding input field or displays as ‘unavailable’, don’t worry - try dragging the dataset from the history panel and dropping it into the input field. If this still doesn’t work, then you can change the datatype to rdata.sce.

  • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
  • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
  • In the galaxy-chart-select-data Assign Datatype, select your desired datatype from “New type” dropdown
    • Tip: you can start typing the datatype into the field to filter the dropdown menu
  • Click the Save button

  1. Rename galaxy-pencil output N701-400k-AnnData

congratulations Congrats! Your object is ready to for the scanpy pipeline! You can can check your work against the example history. You can also compare how the subsampled datasets you’ve generated compare with the total sample.

However, it may be that you want to combine this object with others like it, for instance, maybe you ran 5 samples, and you are starting with 10 FASTQ files…

Analysing multiple FASTQ files

This sample was originally one of seven. So to run the other 12 downsampled FASTQ files, you can use a workflow and run them all at once! All these samples are going to take a while, so go and have several cups of tea… Or, better yet, I have run them myself. To combine the resultant files into a single matrix, you can look at the next tutorial in this case study: Combining datasets after pre-processing

Mitochondrial flagging

We have assumed you will be combining multiple files - but if that’s not the case, you’ll need to perform this step to turn your column of true and false labelling the mitochondrial genes into some metrics telling you the % of mitochondrial genes in each cell. You can follow that step here: Mitochondrial calculations.

Conclusion

Workflow Part 1. Open image in new tab

Figure 7: Workflow - Steps 1-3

We have:

  • Examined raw read data, annotations and necessary input files for quantification.
  • Run Alevin in two different parameterisations, both allowing Alevin to make its own calls on what constitutes empty droplets, and applying emptyDrops instead.
  • Deployed barcode rank plots as a way of quickly assessing the signal present in droplet datasets.
  • Applied the necessary conversion to pass these data to downstream processes.

feedback To discuss with like-minded scientists, join our Galaxy Training Network chatspace in Slack and discuss with fellow users of Galaxy single cell analysis tools on #single-cell-users

We also post new tutorials / workflows there from time to time, as well as any other news.

point-right If you’d like to contribute ideas, requests or feedback as part of the wider community building single-cell and spatial resources within Galaxy, you can also join our Single cell & sPatial Omics Community of Practice.

tool You can request tools here on our Single Cell and Spatial Omics Community Tool Request Spreadsheet