Generating a single cell matrix using Alevin

Author(s)	Wendi Bacon Jonathan Manning
Editor(s)	Helena Rasche
Tester(s)	Julia Jakiela

Overview
Questions:

I have some single cell FASTQ files I want to analyse. Where do I start?

Objectives:

Generate a cellxgene matrix for droplet-based single cell sequencing data

Interpret quality control (QC) plots to make informed decisions on cell thresholds

Find relevant information in GTF files for the particulars of their study, and include this in data matrix metadata

Requirements:

Introduction to Galaxy Analyses

Single Cell

An introduction to scRNA-seq data analysis: slides slides

Hands-on: Hands-on: Understanding Barcodes: tutorial hands-on

Time estimation: 2 hours

Supporting Materials:

Datasets

Workflows

FAQs

video Recordings

video Tutorial (February 2021)

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.org (Main) ✅ ⭐️

Published: Mar 3, 2021

Last modification: Mar 8, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00245

rating Rating: 3.8 (0 recent ratings, 4 all time)

version Revision: 23

This tutorial will take you from raw FASTQ files to a cell x gene data matrix in AnnData format. What’s a data matrix, and what’s AnnData format? Well you’ll find out! Importantly, this is the first step in processing single cell data in order to start analysing it. Currently you have a bunch of strings of ATGGGCTT etc. in your sequencing files, and what you need to know is how many cells you have and what genes appear in those cells. These steps are the most computationally heavy in the single cell world, as you’re starting with 100s of millions of reads, each with 4 lines of text. Later on in analysis, this data becomes simple gene counts such as ‘Cell A has 4 GAPDHs’, which is a lot easier to store! Because of this data overload, we have downsampled the FASTQ files to speed up the analysis a bit. Saying that, you’re still having to map loads of reads to the massive murine genome, so get yourself a cup of coffee and prepare to analyse!

Agenda

In this tutorial, we will cover:

Generating a matrix

Get Data

Important tips for easier analysis

Generate a transcript to gene map

Generate a transcriptome index & quantify!

Basic QC

Generate an unprocessed matrix in a usable format

Adding in Gene metadata

emptyDrops

Analysing multiple FASTQ files

Mitochondrial flagging

Conclusion

Generating a matrix

In this section, we will show you the principles of the initial phase of single-cell RNA-seq analysis: generating expression measures in a matrix. We’ll concentrate on droplet-based (rather than plate-based) methodology, since this is the process with most differences with respect to conventional approaches developed for bulk RNA-seq.

Droplet-based data consists of three components: cell barcodes, unique molecular identifiers (UMIs) and cDNA reads. To generate cell-wise quantifications we need to:

Process cell barcodes, working out which ones correspond to ‘real’ cells, which to sequencing artefacts, and possibly correct any barcodes likely to be the product of sequencing errors by comparison to more frequent sequences.
Map biological sequences to the reference genome or transcriptome.
‘De-duplicate’ using the UMIs.

This used to be a complex process involving multiple algorithms, or was performed with technology-specific methods (such as 10X’s ‘Cellranger’ tool) but is now much simpler thanks to the advent of a few new methods. When selecting methodology for your own work you should consider:

STARsolo - a droplet-based scRNA-seq-specific variant of the popular genome alignment method STAR. Produces results very close to those of Cellranger (which itself uses STAR under the hood).
Kallisto/ bustools - developed by the originators of the transcriptome quantification method, Kallisto.
Alevin - another transcriptome analysis method developed by the authors of the Salmon tool.

We’re going to use Alevin Srivastava et al. 2019 for demonstration purposes, but we do not endorse one method over another.

Get Data

We’ve provided you with some example data to play with, a small subset of the reads in a mouse dataset of fetal growth restriction Bacon et al. 2018 (see the study in Single Cell Expression Atlas and the project submission). This is a study using the Drop-seq chemistry, however this tutorial is almost identical to a 10x chemistry. We will point out the one tool parameter change you will need to run 10x samples. This data is not carefully curated, standard tutorial data - it’s real, it’s messy, it desperately needs filtering, it has background RNA running around, and most of all it will give you a chance to practice your analysis as if this data were yours.

Down-sampled reads and some associated annotation can be imported below. How did I downsample these FASTQ files? Check out this history to find out!

Additionally, to map your reads, you will need a transcriptome to align against (a FASTA) as well as the gene information for each transcript (a gtf) file. You can download these for your species of interest from Ensembl. These files are included in the data import step below. Keep in mind, these are big files, so the fastest way to get these into your Galaxy account is through importing them by history.

Hands-on: Option 1: Data upload - Import history

Import history from: example input history

Open the link to the shared history

Click on the new-history Import history button on the top right

Enter a title for the new history

Click on Import

Rename galaxy-pencil the the history to your name of choice.

Hands-on: Option 2: Data upload - Add to history
Create a new history for this tutorial
Import the Experimental Design table, sequencing reads 1 & 2, the GTF and fasta files from Zenodo
https://zenodo.org/record/4574153/files/Experimental_Design.tabular
https://zenodo.org/record/4574153/files/Mus_musculus.GRCm38.100.gtf.gff
https://zenodo.org/record/4574153/files/Mus_musculus.GRCm38.cdna.all.fa.fasta
https://zenodo.org/record/4574153/files/SLX-7632.TAAGGCGA.N701.s_1.r_1.fq-400k.fastq
https://zenodo.org/record/4574153/files/SLX-7632.TAAGGCGA.N701.s_1.r_2.fq-400k.fastq
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window
Rename galaxy-pencil the datasets

Question

Have a look at the files you now have in your history.

Which of the FASTQ files do you think contains the barcode sequences?

Given the chemistry this study should have, are the barcode/UMI reads the correct length?

What is the ‘N701’ referring to?

Read 1 (SLX-7632.TAAGGCGA.N701.s_1.r_1.fq-400k) contains the cell barcode and UMI because it is significantly shorter (indeed, 20 bp!) compared to the longer, r_2 transcript read. For ease, rename these files N701-Read1 and N701-Read2.

You can see Read 1 is only 20 bp long, which for original Drop-Seq is 12 bp for cell barcode and 8 bp for UMI. This is correct! Be warned - 10x Chromium (and many technologies) change their chemistry over time, so particularly when you are accessing public data, you want to check and make sure you have your numbers correct!

‘N701’ is referring to an index read. This sample was run alongside 6 other samples, each denoted by an Illumina Nextera Index (N70X). Later, this will tell you batch information. If you look at the ‘Experimental Design’ file, you’ll see that the N701 sample was from a male wildtype neonatal thymus.

Important tips for easier analysis

Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.

Open your Galaxy server

Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.

Navigate to your tutorial

Tool names in tutorials will be blue buttons that open the correct tool for you

Note: this does not work for all tutorials (yet)

You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface

Warning: Not all browsers work!

We’ve had some issues with Tutorial mode on Safari for Mac users.

Try a different browser if you aren’t seeing the button.

Did you know we have a unique Single Cell Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.

The Single Cell Lab currently uses the main European Galaxy infrastructure and power, it’s just organised better for users of particular analyses…like single cell!

Try it out! All your histories/workflows/logins from the general European Galaxy server will be there!

subdomain Single Cell Lab

When something goes wrong in Galaxy, there are a number of things you can do to find out what it was. Error messages can help you figure out whether it was a problem with one of the settings of the tool, or with the input data, or maybe there is a bug in the tool itself and the problem should be reported. Below are the steps you can follow to troubleshoot your Galaxy errors.

Expand the red history dataset by clicking on it.

Sometimes you can already see an error message here

View the error message by clicking on the bug icon galaxy-bug

Check the logs. Output (stdout) and error logs (stderr) of the tool are available:

Expand the history item

Click on the details icon

Scroll down to the Job Information section to view the 2 logs:

Tool Standard Output

Tool Standard Error

For more information about specific tool errors, please see the Troubleshooting section

Submit a bug report! If you are still unsure what the problem is.

Click on the bug icon galaxy-bug

Write down any information you think might help solve the problem

See this FAQ on how to write good bug reports

Click galaxy-bug Report button

Ask for help!

Where?

In the Single cell user community

In the GTN Matrix Channel

In the Galaxy Matrix Channel

Browse the Galaxy Help Forum to see if others have encountered the same problem before (or post your question).

When asking for help, it is useful to share a link to your history

Generate a transcript to gene map

Gene-level, rather than transcript-level, quantification is standard in scRNA-seq, which means that the expression level of alternatively spliced RNA molecules are combined to create gene-level values. Droplet-based scRNA-seq techniques only sample one end each transcript, so lack the full-molecule coverage that would be required to accurately quantify different transcript isoforms.

To generate gene-level quantifications based on transcriptome quantification, Alevin and similar tools require a conversion between transcript and gene identifiers. We can derive a transcript-gene conversion from the gene annotations available in genome resources such as Ensembl. The transcripts in such a list need to match the ones we will use later to build a binary transcriptome index. If you were using spike-ins, you’d need to add these to the transcriptome and the transcript-gene mapping.

In your example data you will see the murine reference annotation as retrieved from Ensembl in GTF format. This annotation contains gene, exon, transcript and all sorts of other information on the sequences. We will use these to generate the transcript/ gene mapping by passing that information to a tool that extracts just the transcript identifiers we need.

Question

Which of the ‘attributes’ in the last column of the GTF files contains the transcript and gene identifiers?

The file is organised such that the last column (headed ‘Group’) contains a wealth of information in the format: attribute1 “information associated with attribute 1”;attribute2 “information associated with attribute 2” etc.

gene_id and transcript_id are each followed by “ensembl gene_id” and “ensembl transcript_id”

It’s now time to parse the GTF file using the rtracklayer package in R. This parsing will give us a conversion table with a list of transcript identifiers and their corresponding gene identifiers for counting. Additionally, because we will be generating our own binary index (more later!), we also need to input our FASTA so that it can be filtered to only contain transcriptome information found in the GTF.

Hands-on: Generate a filtered FASTA and transcript-gene map

GTF2GeneList ( Galaxy version 1.52.0+galaxy0) with the following parameters:

param-file “Ensembl GTF file”: GTF file in the history galaxy-history

“Feature type for which to derive annotation”: transcript (Your sequences are transcript sequencing, so this is your starting point)

“Field to place first in output table”: transcript_id (This is accessing the column you identified above!)

“Suppress header line in output?”: Yes (The next tool (Alevin) does not expect a header)

“Comma-separated list of field names to extract from the GTF (default: use all fields)”: transcript_id,gene_id (This calls the first column to be the transcript_id, and the second the gene_id. Thus, your key can turn transcripts into genes)

“Append version to transcript identifiers?”: Yes (The Ensembl FASTA files usually have these, and since we need the FASTA transcriptome and the GTF gene information to work together, we need to append these!)

“Flag mitochondrial features?”: No

“Provide a cDNA file for extracting annotations and/ or possible filtering?”: Yes

param-file “FASTA-format cDNA/transcript file”: FASTA file in your history galaxy-history

“Annotation field to match with sequences”: transcript_id

“Filter the cDNA file to match the annotations?”: Yes

Rename galaxy-pencil the annotation table to Map

Rename galaxy-pencil the uncompressed filtered FASTA file to Filtered FASTA

Generate a transcriptome index & quantify!

Alevin collapses the steps involved in dealing with dscRNA-seq into a single process. Such tools need to compare the sequences in your sample to a reference containing all the likely transcript sequences (a ‘transcriptome’). This will contain the biological transcript sequences known for a given species, and perhaps also technical sequences such as ‘spike ins’ if you have those.

To be able to search a transcriptome quickly, Alevin needs to convert the text (FASTA) format sequences into something it can search quickly, called an ‘index’. The index is in a binary rather than human-readable format, but allows fast lookup by Alevin. Because the types of biological and technical sequences we need to include in the index can vary between experiments, and because we often want to use the most up-to-date reference sequences from Ensembl or NCBI, we can end up re-making the indices quite often. Making these indices is time-consuming! Have a look at the uncompressed FASTA to see what it starts with.

We now have:

Barcode/ UMI reads
cDNA reads
transcript/ gene mapping
filtered FASTA

We can now run Alevin. In some public instances, Alevin won’t show up if you search for it. Instead, you may have to click the Single Cell tab at the left and scroll down to the Alevin tool. Alternatively, use Tutorial Mode as described above and you’ll easily navigate to all the tools, and their versions will all be the tried and tested ones of this tutorial. It’s often a good idea to check your tool versions. To identify which version of a tool you are using, select tool-versions ‘Versions’ and choose the appropriate version. In this case the tutorial was built with Alevin Galaxy Version 1.9.0+galaxy2.

Hands-on: Running Alevin

Alevin ( Galaxy version 1.9.0+galaxy2)

Question

Try to fill in the parameters of Alevin using what you know!

The Salmon documentation on ‘Fragment Library Types’ and running the Alevin command is here: salmon.readthedocs.io/en/latest/library_type.html and salmon.readthedocs.io/en/latest/alevin.html. These links will help here, although keep in mind the image there is drawn with the RNA 5’ on top, whereas in this scRNA-seq protocol, the polyA is captured by its 3’ tail and thus effectively the bottom or reverse strand…)

“Select a reference transcriptome from your history or use a built-in index?”: Use one from the history

You are going to generate the binary index using your filtered FASTA!

param-file “Transcripts FASTA file”: Filtered FASTA

“Single or paired-end reads?”: Paired-end

param-file “Mate pair 1”: N701-Read1

param-file “Mate pair 2”: N701-Read2

“Specify the strandedness of the reads”: Infer automatically (A)

“Type of single-cell protocol”: DropSeq Single Cell protocol

param-file “Transcript to gene map file”: Map

In “Extra output files”:

param-check Salmon Quant log file

param-check Features used by the CB classification and their counts at each cell level (--dumpFeatures)

Of course you are welcome to select more options and explore the output files (warning warning: “Per cell level parsimonious Umi graph (–dumpUmiGraph)” will generate over 2 thousand single files), but for this tutorial you will only need to select those specified.

In “Advanced options”:

“Dump cell v transcripts count matrix in MTX format”: galaxy-toggle Yes

Comment: What if I'm running a 10x sample?

The main parameter that needs changing for a 10X Chromium sample is the ‘Protocol’ parameter of Alevin. Just select the correct 10x Chemistry there instead.

Comment: Alevin file names

You will notice that the names of the output files of Alevin are written in a certain convention, mentioning which tool was used and on which files, for example: “Alevin on data X, data Y, and others: whitelist”. Remember that you can always rename the files if you wish! For simplicity, when we refer to those files in the tutorial, we skip the information about tool and only use the second part of the name - in this case it would be simply “whitelist”.

This tool will take a while to run. Alevin produces many file outputs, not all of which we’ll use. You can refer to the Alevin documentation if you’re curious what they all are, but we’re most interested in is:

the matrix itself (per-cell gene-count matrix (MTX) - the count by gene and cell)
the row (cell/ barcode) identifiers (row index (CB-ids)) and
the column (gene) labels (column headers (gene-ids)).

Question

After you’ve run Alevin, galaxy-eye look through all the different files. Can you find:

The Mapping Rate?

How many cells are present in the matrix output?

Inspect galaxy-eye the file param-file Salmon log file. You can see the mapping rate is a paltry 25.45%. This is a terrible mapping rate. Why might this be? Remember this was downsampled, and specifically by taking only the last 400,000 reads of the FASTQ file. The overall mapping rate of the file is more like 50%, which is still quite poor, but for early Drop-Seq samples and single-cell data in general, you might expect a slightly poorer mapping rate. 10x samples are much better these days! This is real data, not test data, after all!

Inspect galaxy-eye the file param-file row index (CB-ids), and you can see it has 2163 lines. The rows refer to the cells in the cell x gene matrix. According to this (rough) estimate, your sample has 2163 cells in it!

Warning: Choose the appropriate input going forward!

Make certain to use per-cell gene-count matrix (MTX) file going forward.

congratulations Congratulations - you’ve made an expression matrix! We could almost stop here. But it’s sensible to do some basic QC, and one of the things we can do is look at a barcode rank plot.

Basic QC

The question we’re looking to answer here, is: “do we mostly have a single cell per droplet”? That’s what experimenters are normally aiming for, but it’s not entirely straightforward to get exactly one cell per droplet. Sometimes almost no cells make it into droplets, other times we have too many cells in each droplet. At a minimum, we should easily be able to distinguish droplets with cells from those without.

Hands-on: Generate a raw barcode QC plot

Droplet barcode rank plot ( Galaxy version 1.6.1+galaxy2) with the following parameters:

“Input MTX-format matrix?”: No

param-file “A two-column tab-delimited file, with barcodes in the first column and frequencies in the second”: Output of Alevin raw CB classification frequencies

“Label to place in plot title”: Barcode rank plot (raw barcode frequencies)

Rename galaxy-pencil the image output Barcode Plot - raw barcode frequencies

raw droplet barcode plots-400k. — **Figure 1**: 400k subsample raw

Now, the image generated here (400k) isn’t the most informative - but you are dealing with a fraction of the reads! If you run the total sample (so identical steps above, but with significantly more time!) you’d get the image below.

raw droplet barcode plots-total. — **Figure 2**: Total sample - 32,579,453 reads - raw

This is our own formulation of the barcode plot based on a discussion we had with community members. The left hand plots with the smooth lines are the main plots, showing the UMI counts for individual cell barcodes ranked from high to low. We expect a sharp drop-off between cell-containing droplets and ones that are empty or contain only cell debris. Now, this data is not an ideal dataset, so for perspective, in an ideal world with a very clean 10x run, data will look a bit more like the following taken from the lung atlas (see the study in Single Cell Expression Atlas and the project submission).

raw droplet barcode plots - lung atlas. — **Figure 3**: Pretty data - raw

In that plot, you can see the clearer ‘knee’ bend, showing the cut-off between empty droplets and cell-containing droplets.

The right hand plots are density plots from the first one, and the thresholds are generated either using dropletUtils or by the method described in the discussion mentioned above. We could use any of these thresholds to select cells, assuming that anything with fewer counts is not a valid cell. By default, Alevin does something similar, and we can learn something about that by plotting just the barcodes Alevin retains.

Hands-on: Generate Alevin's barcode QC plot

Droplet barcode rank plot ( Galaxy version 1.6.1+galaxy2) with the following parameters:

“Input MTX-format matrix?”: Yes

“Matrix-market format matrix file, with cells by column (overrides –barcode-frequencies if supplied)”: per-cell gene-count matrix (MTX)

“For use with –mtx-matrix: force interpretation of matrix to assume cells are by row, rather than by column (default)”: Yes

“Label to place in plot title”: Barcode rank plot (Alevin-processed)

Rename galaxy-pencil the image output Barcode Plot - Alevin processed barcodes

And the full sample looks like:

And to round this off, here’s the lung atlas plot.

You should see a completely vertical drop-off where Alevin has trunctated the distribution (after excluding any cell barcode that had <10 UMI, Alevin then chose a threshold based off the curve and removed all barcodes with fewer UMIs).

In experiments with relatively simple characteristics, this ‘knee detection’ method works relatively well. But some populations (such as our sample!) present difficulties. For instance, sub-populations of small cells may not be distinguished from empty droplets based purely on counts by barcode. Some libraries produce multiple ‘knees’ for multiple sub-populations. The emptyDrops method has become a popular way of dealing with this. emptyDrops still retains barcodes with very high counts, but also adds in barcodes that can be statistically distinguished from the ambient profiles, even if total counts are similar. In order to ultimately run emptyDrops (or indeed, whatever tool you like that accomplishes biologically relevant thresholding), we first need to re-run Alevin, but prevent it from applying its own less than ideal thresholds.

To use emptyDrops effectively, we need to go back and re-run Alevin, stopping it from applying it’s own thresholds. Click the re-run icon galaxy-refresh on any Alevin output in your history, because almost every parameter is the same as before, except you need to change the following:

Generate an unprocessed matrix in a usable format

Hands-on: Stopping Alevin from thresholding

Alevin ( Galaxy version 1.9.0+galaxy2) (Click re-run on the last Alevin output)

“Advanced options”

“Fraction of cellular barcodes to keep”: ‘1’ - i.e. keep them all!

“Minimum frequency for a barcode to be considered”: ‘3’ - This will only remove cell barcodes with a frequency of less than 3, a low bar to pass but useful way of avoiding processing a bunch of almost certainly empty barcodes

Expand one of the output datasets of the tool (by clicking on it)

Click re-run galaxy-refresh the tool

This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.

Question

How many cells are in the output now?

22952 cells are in the quants_mat_rows now! Far more than the Alevin-filtered 2163. This needs some serious filtering with EmptyDrops!

Alevin outputs MTX format, which we can pass to the dropletUtils package and run emptyDrops. Unfortunately the matrix is in the wrong orientation for tools expecting files like those produced by 10X software (which dropletUtils does). We need to ‘transform’ the matrix such that cells are in columns and genes are in rows.

Warning: Be careful!

Don’t mix up files from the different Alevin runs! Use the later run, which has higher numbers in the history!

Hands-on: Transform matrix

salmonKallistoMtxTo10x ( Galaxy version 0.0.1+galaxy6) with the following parameters:

param-file “.mtx-format matrix”: per-cell gene-count matrix (MTX) (output of Alevin tool)

param-file “Tab-delimited genes file”: column headers (gene-ids) (output of Alevin tool)

param-file “Tab-delimited barcodes file”: row index (CB-ids) (output of Alevin tool)

Rename galaxy-pencil ‘salmonKallistoMtxTo10x….:genes’ to Gene table

Rename galaxy-pencil ‘salmonKallistoMtxTo10x….:barcodes’ to Barcode table

Rename galaxy-pencil ‘salmonKallistoMtxTo10x….:matrix’ to Matrix table

The output is a matrix in the correct orientation for the rest of our tools. However, our matrix is looking a bit sparse - for instance, click on Gene table. I don’t know about you, but I’d struggle to have a good biological discussion using only Ensembl gene_ids! What I’d really like is the more understandable ‘GAPDH’ or other gene acronym, as well as information on mitochondrial genes so that I can assess if my cells were stressed out or not. In order to prepare our data for emptyDrops, we’re going to combine this information into an object, and it’s easiest to add in that information now.

Adding in Gene metadata

Question

Where can we find this gene information?

Our old friend the GTF file!

Question

Which of the ‘attributes’ in the last column of that file contains the gene acronym?

gene_name

We’re now going to re-run galaxy-refresh the tool that extracts information from our GTF file.

Hands-on: Generate gene information

GTF2GeneList ( Galaxy version 1.52.0+galaxy0) with the following parameters:

“Feature type for which to derive annotation”: gene

“Field to place first in output table”: gene_id

“Suppress header line in output?”: Yes

“Comma-separated list of field names to extract from the GTF (default: use all fields)”: gene_id,gene_name,mito

“Append version to transcript identifiers?”: Yes

“Flag mitochondrial features?”: Yes - note, this will auto-fill a bunch of acronyms for searching in the GTF for mitochondrial associated genes. This is good!

“Filter the cDNA file to match the annotations?”: No - we don’t need to, we’re done with the FASTA!

Check that the output file type is tabular. If not, change the file type by clicking the ‘Edit attributes’galaxy-pencil on the dataset in the history (as if you were renaming the file.) Then click Datatypes and type in tabular. Click Change datatype.)

Rename galaxy-pencil the annotation table to Gene Information

Inspect galaxy-eye the Gene Information object in the history. Now you have made a new key for gene_id, with gene name and a column of mitochondrial information (false = not mitochondrial, true = mitochondrial). We need to add this information into the salmonKallistoMtxTo10x output ‘Gene table’. But we need to keep ‘Gene table’ in the same order, since it is referenced in the ‘Matrix table’ by row.

Hands-on: Combine MTX Gene Table with Gene Information

Join two Datasets with the following parameters:

“Join”: Gene Table

“Using column”: Column: 1

“with”: Gene Information

“and column”: Column: 1

“Keep lines of first input that do not join with second input”: Yes

“Keep lines of first input that are incomplete”: Yes

“Fill empty columns”: No

“Keep the header lines”: No

If you inspect galaxy-eye the object, you’ll see we have joined these tables and now have quite a few gene_id repeats. Let’s take those out, while keeping the order of the original ‘Gene Table’.

Cut columns from a table with the following parameters:

“Cut columns”: c1,c4,c5

“Delimited by”: Tab

“From”: output of Join two Datasets tool

Rename output Annotated Gene Table

Inspect galaxy-eye your Annotated Gene Table. That’s more like it! You now have gene_id, gene_name, and mito. Now let’s get back to your journey to emptyDrops and sophisticated thresholding of empty droplets!

emptyDrops

emptyDrops Lun et al. 2019 works with a specific form of R object called a SingleCellExperiment. We need to convert our transformed MTX files into that form, using the DropletUtils Read10x tool:

Hands-on: Converting to SingleCellExperiment format

DropletUtils Read10x ( Galaxy version 1.0.4+galaxy0) with the following parameters:

param-file “Expression matrix in sparse matrix format (.mtx)”: Matrix table

param-file “Gene Table”: Annotated Gene Table

param-file “Barcode/cell table”: Barcode table

“Should metadata file be added?”: No

Rename galaxy-pencil output: SCE Object

Fantastic! Now that our matrix is combined into an object, specifically the SingleCellExperiment format, we can now run emptyDrops! Let’s get rid of those background droplets containing no cells!

Hands-on: Emptydrops

DropletUtils emptyDrops ( Galaxy version 1.0.4+galaxy0) with the following parameters:

param-file “SingleCellExperiment rdata object”: SCE Object

“Should barcodes estimated to have no cells be removed from the output object?”: Yes

Rename galaxy-pencil serialised SingleCellExperiment output as Emptied-Object

Rename galaxy-pencil tabular output as Emptied-Tabular Output

Question

How many cell barcodes remain after the emptyDrops treatment? Why might that be?

If you click on the Emptied-Object in the galaxy-history history, the text in that window says 37 barcodes or something similar to that - there is an element of random in the algorithm, so yours might differ slightly. Why is this so low?? And why might the number be different? Consider…is this a complete set of data?

Remember this is a subsampled dataset. If you look carefully at the parameters of emptyDrops, you’ll see it set a minimum threshold at 100 UMI. If you look at the barcode plots above for the 400k read sample, you’ll see this is far too stringent for this subsampled data! To satisfy your curiosity, this minimum threshold would yield 4332 barcodes for the total sample. Also, the number may vary slightly as the output depends on a large number of random iterations.

We will nevertheless proceed with your majestic annotated expression matrix of 37 cells, ready to go for further processing and analysis! However, the next tutorials we will link to use a tool suite called Scanpy Wolf et al. 2018. You need to convert this SingleCellExperiment object into a format called annData, which is a variant of a file format called hdf5.

Hands-on: Converting to AnnData format

SCEasy convert ( Galaxy version 0.0.5+galaxy1) with the following parameters:

“Direction of conversion”: SingleCellExperiment to AnnData

param-file “Input object in SingleCellExperiment RDS format”: Emptied-Object

“Name of the assay to be transferred as main layer”: counts

Rename galaxy-pencil output N701-400k-AnnData

congratulations Congrats! Your object is ready to for the scanpy pipeline! You can can check your work against the example history (if you used Alevin Galaxy Version 1.5.1+galaxy0, you can check the example history run with that version of the tool). You can also compare how the subsampled datasets you’ve generated compare with the total sample (using Alevin Galaxy Version 1.5.1+galaxy0).

However, it may be that you want to combine this object with others like it, for instance, maybe you ran 5 samples, and you are starting with 10 FASTQ files…

Analysing multiple FASTQ files

This sample was originally one of seven. So to run the other 12 downsampled FASTQ files, you can use a workflow and run them all at once! All these samples are going to take a while, so go and have several cups of tea… Or, better yet, I have run them myself. To combine the resultant files into a single matrix, you can look at the next tutorial in this case study: Combining datasets after pre-processing

Mitochondrial flagging

We have assumed you will be combining multiple files - but if that’s not the case, you’ll need to perform this step to turn your column of true and false labelling the mitochondrial genes into some metrics telling you the % of mitochondrial genes in each cell. You can follow that step here: Mitochondrial calculations.

Conclusion

Workflow Part 1. — **Figure 7**: Workflow - Steps 1-3

We have:

Examined raw read data, annotations and necessary input files for quantification.
Run Alevin in two different parameterisations, both allowing Alevin to make its own calls on what constitutes empty droplets, and applying emptyDrops instead.
Deployed barcode rank plots as a way of quickly assessing the signal present in droplet datasets.
Applied the necessary conversion to pass these data to downstream processes.

To discuss with like-minded scientists, join our Matrix/Element chatroom to discuss with fellow users of Galaxy single cell analysis tools!

We also post new tutorials / workflows there from time to time, as well as any other news.

Key points

Create a scanpy-accessible AnnData object from FASTQ files, including relevant gene metadata

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Single Cell topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Bacon, W. A., R. S. Hamilton, Z. Yu, J. Kieckbusch, D. Hawkes et al., 2018 Single-Cell Analysis Identifies Thymic Maturation Delay in Growth-Restricted Neonatal Mice. Frontiers in Immunology 9: 10.3389/fimmu.2018.02523
Wolf, F. A., P. Angerer, and F. J. Theis, 2018 SCANPY: large-scale single-cell gene expression data analysis. Genome Biology 19: 10.1186/s13059-017-1382-0
Lun, A., S. Riesenfeld, T. Andrews, T. Dao, T. Gomes et al., 2019 EmptyDrops: Distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biology 20: 63. 10.1186/s13059-019-1662-y
Srivastava, A., L. Malik, T. Smith, I. Sudbery, and R. Patro, 2019 Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biology 20: 10.1186/s13059-019-1670-y

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Wendi Bacon, Jonathan Manning, Generating a single cell matrix using Alevin (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case_alevin/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{single-cell-scrna-case_alevin,
author = "Wendi Bacon and Jonathan Manning",
	title = "Generating a single cell matrix using Alevin (Galaxy Training Materials)",
	year = "",
	month = "",
	day = ""
	url = "\url{https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case_alevin/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol} Computational Biology}
}

                   

Congratulations on successfully completing this tutorial!

Go Further
Do you want to extend your knowledge? Follow one of our recommended follow-up trainings:

Single Cell

Hands-on: Hands-on: Combining single cell datasets after pre-processing: tutorial hands-on

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/single-cell/tutorials/scrna-case_alevin/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: alevin
  owner: bgruening
  revisions: e4c01dcece8b
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: alevin
  owner: bgruening
  revisions: f2000bc73102
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: droplet_barcode_plot
  owner: ebi-gxa
  revisions: 8f605643ccfd
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: dropletutils_empty_drops
  owner: ebi-gxa
  revisions: fa662ca7ae7f
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: dropletutils_read_10x
  owner: ebi-gxa
  revisions: ca62f52e0d7f
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: gtf2gene_list
  owner: ebi-gxa
  revisions: 14b3f2a4523b
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: salmon_kallisto_mtx_to_10x
  owner: ebi-gxa
  revisions: 88f5a0a6a818
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: sceasy_convert
  owner: ebi-gxa
  revisions: f62bc418173f
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

5 stars 2

3 stars 1

2 stars 1

March 2022

5 stars: Liked: Everything because all were new for me Disliked: To be given the ability to conduct all parts of the tutorial, because some tools (in my condition) were not worked and I had to create a new account on another server in order to fulfill the tutorial.

January 2022

2 stars: Disliked: The examples are still very buggy. Many of them do not work with the instructions as supplied. For example, the SalmonKallistoMtxTo10x instructions fails, most likely because the program is looking for a gz file when the tutorial example makes an unzipped file. The Droplet barcode rank plot exercise also crashes.

November 2021

5 stars: Liked: The instructions are clear, I really enjoyed it thank you! Disliked: Notes for the versions of the tools so you can select them and go a little bit faster with the course.

July 2021

3 stars: Disliked: Speak a bit slower and explain each file before moving on. Relate the theoretical apresentation with this part.