{ "metadata": { "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" } }, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
\n\n# Generating a single cell matrix using Alevin and combining datasets (bash + R)\n\nby [Julia Jakiela](https://training.galaxyproject.org/hall-of-fame/wee-snufkin/), [Wendi Bacon](https://training.galaxyproject.org/hall-of-fame/nomadscientist/)\n\nCC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)\n\n**Objectives**\n\n- I have some single cell FASTQ files I want to analyse. Where do I start?\n- How to generate a single cell matrix using command line?\n\n**Objectives**\n\n- Generate a cellxgene matrix for droplet-based single cell sequencing data\n- Interpret quality control (QC) plots to make informed decisions on cell thresholds\n- Find relevant information in GTF files for the particulars of their study, and include this in data matrix metadata\n\n**Time Estimation: 2H**\n
\n", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-0", "source": "

Setting up the environment

\n

Alevin is a tool integrated with the Salmon software, so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the releases page. Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below .

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "wget -nv https://github.com/COMBINE-lab/salmon/releases/download/v1.10.0/salmon-1.10.0_linux_x86_64.tar.gz" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-2", "source": "

Once you’ve downloaded a specific binary (here we’re using version 1.10.0), just extract it like so:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-3", "source": [ "tar -xvzf salmon-1.10.0_linux_x86_64.tar.gz" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-4", "source": "
\n
\n

As mentioned, installing salmon using conda is also an option, and you can do it using the following command in the terminal:

\n
conda install -c bioconda salmon\n
\n

However, for this tutorial, it would be easier and quicker to use the downloaded pre-compiled binaries, as shown above.

\n
\n

We’re going to use Alevin for demonstration purposes, but we do not endorse one method over another.

\n

Get Data

\n

We continue working on the same example data - a very small subset of the reads in a mouse dataset of fetal growth restriction {% cite Bacon2018 %} (see the study in Single Cell Expression Atlas and the project submission). For the purposes of this tutorial, the datasets have been subsampled to only 50k reads (around 1% of the original files). Those are two fastq files - one with transcripts and the another one with cell barcodes. You can download the files by running the code below:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "wget -nv https://zenodo.org/records/10116786/files/transcript_701.fastq\n", "wget -nv https://zenodo.org/records/10116786/files/barcodes_701.fastq" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-6", "source": "
\n
Question
\n

How to differentiate between the two files if they are just called ‘Read 1’ and ‘Read 2’?

\n
👁 View solution\n
\n

The file which contains the cell barcodes and UMI is significantly shorter (indeed, 20 bp!) compared to the other file containing longer, transcript read. For ease, we will use explicit file names.

\n
\n
\n

Additionally, to map your reads, you will need a transcriptome to align against (a FASTA) as well as the gene information for each transcript (a gtf) file. These files are included in the data import step below. You can also download these for your species of interest from Ensembl.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-7", "source": [ "wget -c https://zenodo.org/record/4574153/files/Mus_musculus.GRCm38.100.gtf.gff -O GRCm38_gtf.gff\n", "wget -c https://zenodo.org/record/4574153/files/Mus_musculus.GRCm38.cdna.all.fa.fasta -O GRCm38_cdna.fasta" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-8", "source": "

Why do we need FASTA and GTF files?\nTo generate gene-level quantifications based on transcriptome quantification, Alevin and similar tools require a conversion between transcript and gene identifiers. We can derive a transcript-gene conversion from the gene annotations available in genome resources such as Ensembl. The transcripts in such a list need to match the ones we will use later to build a binary transcriptome index. If you were using spike-ins, you’d need to add these to the transcriptome and the transcript-gene mapping.

\n

We will use the murine reference annotation as retrieved from Ensembl (GRCm38 or mm10) in GTF format. This annotation contains gene, exon, transcript and all sorts of other information on the sequences. We will use these to generate the transcript-gene mapping by passing that information to a tool that extracts just the transcript identifiers we need.

\n

Generate a transcript to gene map and filtered FASTA

\n

You can have a look at the Terminal tab again. Has the package atlas-gene-annotation-manipulation been installed yet? If yes, you can execute the code cell below and while it’s running, I’ll explain all the parameters we set here.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-9", "source": [ "gtf2featureAnnotation.R -g GRCm38_gtf.gff -c GRCm38_cdna.fasta -d \"transcript_id\" -t \"transcript\" -f \"transcript_id\" -o map -l \"transcript_id,gene_id\" -r -e filtered_fasta" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-10", "source": "

In essence, gtf2featureAnnotation.R script takes a GTF annotation file and creates a table of annotation by feature, optionally filtering a cDNA file supplied at the same time. Therefore the first parameter -g stands for “gtf-file” and requires a path to a valid GTF file. Then -c takes a cDNA file for extracting meta info and/or filtering - that’s our FASTA! Where –parse-cdnas (that’s our -c) is specified, we need to specify, using -d, which field should be used to compare to identfiers from the FASTA. We set that to “transcript_id” - feel free to inspect the GTF file to explore other attributes. We pass the same value in -f, meaning first-field, ie. the name of the field to place first in output table. To specify which other fields to retain in the output table, we provide comma-separated list of those fields, and since we’re only interested in transcript to gene map, we put those two names (“transcript_id,gene_id”) into -l. -t stands for the feature type to use, and in our case we’re using “transcript”. Guess what -o is! Indeed, that’s the output annotation table - here we specify the file path of our transcript to gene map. We will also have another output denoted by -e and that’s the path to a filtered FASTA. Finally, we also put -r which is there only to suppress header on output. Summarising, output will be a an annotation table, and a FASTA-format cDNAs file with unannotated transcripts removed.

\n

Why filtered FASTA?\nSometimes it’s important that there are no transcripts in a FASTA-format transcriptome that cannot be matched to a transcript/gene mapping. Salmon, for example, used to produce errors when this mismatch was present. We can synchronise the cDNA file by removing mismatches as we have done above.

\n

Generate a transcriptome index

\n

We will use Salmon in mapping-based mode, so first we have to build a Salmon index for our transcriptome. We will run the Salmon indexer as so:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-11", "source": [ "salmon-latest_linux_x86_64/bin/salmon index -t filtered_fasta -i salmon_index -k 31" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-12", "source": "

Where -t stands for our filtered FASTA file, and -i is the output of the mapping-based index. To build it, the function is using an auxiliary k-mer hash over k-mers of length 31. While the mapping algorithms will make use of arbitrarily long matches between the query and reference, the k size selected here will act as the minimum acceptable length for a valid match. Thus, a smaller value of k may slightly improve sensitivity. We find that a k of 31 seems to work well for reads of 75bp or longer, but you might consider a smaller k if you plan to deal with shorter reads. Also, a shorter value of k may improve sensitivity even more when using selective alignment (enabled via the –validateMappings flag). So, if you are seeing a smaller mapping rate than you might expect, consider building the index with a slightly smaller k.

\n
\n
\n

To be able to search a transcriptome quickly, Salmon needs to convert the text (FASTA) format sequences into something it can search quickly, called an ‘index’. The index is in a binary rather than human-readable format, but allows fast lookup by Alevin. Because the types of biological and technical sequences we need to include in the index can vary between experiments, and because we often want to use the most up-to-date reference sequences from Ensembl or NCBI, we can end up re-making the indices quite often.

\n
\n

Use Alevin

\n

Time to use Alevin now! Alevin works under the same indexing scheme (as Salmon) for the reference, and consumes the set of FASTA/Q files(s) containing the Cellular Barcode(CB) + Unique Molecule identifier (UMI) in one read file and the read sequence in the other. Given just the transcriptome and the raw read files, Alevin generates a cell-by-gene count matrix (in a fraction of the time compared to other tools).

\n
\n
\n

Alevin works in two phases. In the first phase it quickly parses the read file containing the CB and UMI information to generate the frequency distribution of all the observed CBs, and creates a lightweight data-structure for fast-look up and correction of the CB. In the second round, Alevin utilizes the read-sequences contained in the files to map the reads to the transcriptome, identify potential PCR/sequencing errors in the UMIs, and performs hybrid de-duplication while accounting for UMI collisions. Finally, a post-abundance estimation CB whitelisting procedure is done and a cell-by-gene count matrix is generated.

\n
\n

Alevin can be run using the following command:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-13", "source": [ "salmon-latest_linux_x86_64/bin/salmon alevin -l ISR -1 barcodes_701.fastq -2 transcript_701.fastq --dropseq -i salmon_index -p 10 -o alevin_output --tgMap map --freqThreshold 3 --keepCBFraction 1 --dumpFeatures" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-14", "source": "

All the required input parameters are described in the documentation, but for the ease of use, they are presented below as well:

\n
\n
\n\n
\n

We have also added some additional parameters (--freqThreshold, --keepCBFraction) and their values are derived from the Alevin Galaxy tutorial after QC to stop Alevin from applying its own thresholds. However, if you’re not sure what value to pick, you can simply allow Alevin to make its own calls on what constitutes empty droplets.

\n

This tool will take a while to run. Alevin produces many file outputs, not all of which we’ll use. You can refer to the Alevin documentation if you’re curious what they all are, you can look through all the different files to find information such as the mapping rate, but we’ll just pass the whole output folder directory for downstream analysis.

\n
\n
Question
\n
    \n
  1. Can you find the information what was the mapping rate?
  2. \n
  3. How many transcripts did Alevin find?
  4. \n
\n
👁 View solution\n
\n
    \n
  1. As mentioned above, in alevin_output folder there will be many different files, including the log files. To check the mapping rate, go to alevin_output -> logs and open salmon_quant file. There you will find not only information about mapping rate, but also many more, calculated at salmon indexing step.
  2. \n
  3. Alevin log can be found in alevin_output -> alevin and the file name is also alevin. You can find many details about the alevin process there, including the number of transcripts found.
  4. \n
\n
\n
\n
\n
Warning: Process stopping
\n

The command above will display the log of the process and will say “Analyzed X cells (Y% of all)”. For some reason, running Alevin may sometimes cause problems in Jupyter Notebook and this process will stop and not go to completion. This is the reason why we use hugely subsampled dataset here - bigger ones couldn’t be fully analysed (they worked fine locally though). The dataset used in this tutorial shouldn’t make any issues when you’re using Jupyter notebook through galaxy.eu, however might not work properly on galaxy.org. If you’re accessing Jupyter notebook via galaxy.eu and alevin process stopped, just restart the kernel and that should help.

\n
\n\n

Alevin output to SummarizedExperiment

\n

Let’s change gear a little bit. We’ve done the work in bash, and now we’re switching to R to complete the processing. To do so, you have to change Kernel to R (either click on Kernel -> Change Kernel... in the upper left corner of your JupyterLab or click on the displayed current kernel in the upper right corner and change it).

\n
\"FigureOpen image in new tab

Figure 1: Two ways of switching kernel.
\n

Now load the library that we have previously installed in terminal:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "library(tximeta)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-16", "source": "

The tximeta package created by {% cite Love2020 %} is used for import of transcript-level quantification data into R/Bioconductor and requires that the entire output of alevin is present and unmodified.

\n

In the alevin_output -> alevin folder you can find the following files:

\n\n

We will only focus on quants_mat.gz though. First, let’s specify the path to that file:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-17", "source": [ "path <- 'alevin_output/alevin/quants_mat.gz'" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-18", "source": "

We will specify the following arguments when running tximeta:

\n\n

With that we can create a dataframe and pass it to tximeta to create SummarizedExperiment object.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-19", "source": [ "coldata <- data.frame(files = path, names=\"sample701\")\n", "alevin_se <- tximeta(coldata, type = \"alevin\")" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-20", "source": "

Inspect the created object:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-21", "source": [ "alevin_se" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-22", "source": "

As you can see, rowData names and colData names are still empty. Before we add some metadata, we will first identify barcodes that correspond to non-empty droplets.

\n

Identify barcodes that correspond to non-empty droplets

\n

Some sub-populations of small cells may not be distinguished from empty droplets based purely on counts by barcode. Some libraries produce multiple ‘knees’ (see the Alevin Galaxy tutorial for multiple sub-populations. The emptyDrops method ({% cite Lun2019 %}) has become a popular way of dealing with this. emptyDrops still retains barcodes with very high counts, but also adds in barcodes that can be statistically distinguished from the ambient profiles, even if total counts are similar.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-23", "source": [ "library(DropletUtils) # load the library and required packages" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-24", "source": "

emptyDrops takes multiple arguments that you can read about in the documentation. However, in this case, we will only specify the following arguments:

\n
\n
\n\n
\n

Let’s then extract the matrix from our alevin_se object. It’s stored in assays -> counts.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-25", "source": [ "matrix_alevin <- assays(alevin_se)$counts" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-26", "source": "

And now run emptyDrops:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-27", "source": [ "# Identify likely cell-containing droplets\n", "out <- emptyDrops(matrix_alevin, lower = 100, niters = 1000, retain = 20)\n", "out" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-28", "source": "

We also correct for multiple testing by controlling the false discovery rate (FDR) using the Benjamini-Hochberg (BH) method ({% cite Benjamini1995 %}). Putative cells are defined as those barcodes that have significantly poor fits to the ambient model at a specified FDR threshold. Here, we will use an FDR threshold of 0.01. This means that the expected proportion of empty droplets in the set of retained barcodes is no greater than 1%, which we consider to be acceptably low for downstream analyses.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-29", "source": [ "is.cell <- out$FDR <= 0.01\n", "sum(is.cell, na.rm=TRUE) # check how many cells left after filtering" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-30", "source": "

We got rid of the background droplets containing no cells, so now we will filter the matrix that we passed on to emptyDrops, so that it corresponds to the remaining cells.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-31", "source": [ "emptied_matrix <- matrix_alevin[,which(is.cell),drop=FALSE] # filter the matrix" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-32", "source": "

From here, we can move on to adding metadata and we’ll return to emptied_matrix soon.

\n

Adding cell metadata

\n

The cells barcodes are stored in colnames. Let’s exctract them into a separate object:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-33", "source": [ "barcode <- colnames(alevin_se)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-34", "source": "

Now, we can simply add those barcodes into colData names where we will keep the cell metadata. To do this, we will create a column called barcode in colData and pass the stored values into there.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-35", "source": [ "colData(alevin_se)$barcode <- barcode\n", "colData(alevin_se)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-36", "source": "

That’s only cell barcodes for now! However, after running emptyDrops, we generated lots of cell information that is currently stored in out object (Total, LogProb, PValue, Limited, FDR). Let’s add those values to cell metadata! Since we already have barcodes in there, we will simply bind the emptyDrops output to the existing dataframe:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-37", "source": [ "colData(alevin_se) <- cbind(colData(alevin_se),out)\n", "colData(alevin_se)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-38", "source": "

As you can see, the new columns were appended successfully and now the dataframe has 6 columns.

\n

If you have a look at the Experimental Design from that study, you might notice that there is actually more information about the cells. The most important for us would be batch, genotype and sex, summarised in the small table below.

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
IndexBatchGenotypeSex
N7010wildtypemale
N7021knockoutmale
N7032knockoutfemale
N7043wildtypemale
N7054wildtypemale
N7065wildtypemale
N7076knockoutmale
\n

We are currently analysing sample N701, so let’s add its information from the table.

\n

Batch

\n

We will label batch as an integer - “0” for sample N701, “1” for N702 etc. The way to do it is creating a list with zeros of the length corresponding to the number of cells that we have in our SummarizedExperiment object, like so:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-39", "source": [ "batch <- rep(\"0\", length(colnames(alevin_se)))" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-40", "source": "

And now create a batch slot in the colData names and append the batch list in the same way as we did with barcodes:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-41", "source": [ "colData(alevin_se)$batch <- batch\n", "colData(alevin_se)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-42", "source": "

A new column appeared, full of zeros - as expected!

\n

Genotype

\n

It’s all the same for genotype, but instead creating a list with zeros, we’ll create a list with string “wildtype” and append it into genotype slot:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-43", "source": [ "genotype <- rep(\"wildtype\", length(colnames(alevin_se)))\n", "colData(alevin_se)$genotype <- genotype" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-44", "source": "

Sex

\n

You already know what to do, right? A list with string “male” and then adding it into a new slot!

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-45", "source": [ "sex <- rep(\"male\", length(colnames(alevin_se)))\n", "colData(alevin_se)$sex <- sex" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-46", "source": "

Check if all looks fine:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-47", "source": [ "colData(alevin_se)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-48", "source": "

3 new columns appeared with the information that we’ve just added - perfect! You can add any information you need in this way, as long as it’s the same for all the cells from one sample.

\n

Adding gene metadata

\n

The genes IDs are stored in rownames. Let’s exctract them into a separate object:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-49", "source": [ "gene_ID <- rownames(alevin_se)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-50", "source": "

Analogically, we will add those genes IDs into rowData names which stores gene metadata. To do this, we will create a column called gene_ID in rowData and pass the stored values into there.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-51", "source": [ "rowData(alevin_se)$gene_ID <- gene_ID" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-52", "source": "

Adding genes symbols based on their IDs

\n

Since gene symbols are much more informative than only gene IDs, we will add them to our metadata. We will base this annotation on Ensembl - the genome database – with the use of the library BioMart. We will use the archive Genome assembly GRCm38 to get the gene names. Please note that the updated version (GRCm39) is available, but some of the gene IDs are not in that EnsEMBL database. The code below is written in a way that it will work for the updated dataset too, but will produce ‘NA’ where the corresponding gene name couldn’t be found.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-53", "source": [ "# get relevant gene names\n", "library(\"biomaRt\") # load the BioMart library\n", "ensembl.ids <- gene_ID\n", "mart <- useEnsembl(biomart = \"ENSEMBL_MART_ENSEMBL\") # connect to a specified BioMart database and dataset hosted by Ensembl\n", "ensembl_m = useMart(\"ensembl\", dataset=\"mmusculus_gene_ensembl\", version=100)\n", "\n", "# The line above connects to a specified BioMart database and dataset within this database.\n", "# In our case we choose the mus musculus database and to get the desired Genome assembly GRCm38,\n", "# we specify the host with this archive. If you want to use the most recent version of the dataset, just run:\n", "# ensembl_m = useMart(\"ensembl\", dataset=\"mmusculus_gene_ensembl\")" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-54", "source": "
\n
Warning: Ensembl connection problems
\n

Sometimes you may encounter some connection issues with Ensembl. To improve performance Ensembl provides several mirrors of their site distributed around the globe. When you use the default settings for useEnsembl() your queries will be directed to your closest mirror geographically. In theory this should give you the best performance, however this is not always the case in practice. For example, if the nearest mirror is experiencing many queries from other users it may perform poorly for you. In such cases, the other mirrors should be chosen automatically.

\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-55", "source": [ "genes <- getBM(attributes=c('ensembl_gene_id','external_gene_name'),\n", " filters = 'ensembl_gene_id',\n", " values = ensembl.ids,\n", " mart = ensembl_m)\n", "\n", "# The line above retrieves the specified attributes from the connected BioMart database;\n", "# 'ensembl_gene_id' are genes IDs,\n", "# 'external_gene_name' are the genes symbols that we want to get for our values stored in ‘ensembl.ids’." ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-56", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-57", "source": [ "# see the resulting data\n", "head(genes)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-58", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-59", "source": [ "# replace IDs for gene names\n", "gene_names <- ensembl.ids\n", "count = 1\n", "for (geneID in gene_names)\n", "{\n", " index <- which(genes==geneID) # finds an index of geneID in the genes object created by getBM()\n", " if (length(index)==0) # condition in case if there is no corresponding gene name in the chosen dataset\n", " {\n", " gene_names[count] <- 'NA'\n", " }\n", " else\n", " {\n", " gene_names[count] <- genes$external_gene_name[index] \t# replaces gene ID by the corresponding gene name based on the found geneID’s index\n", " }\n", " count = count + 1 # increased count so that every element in gene_names is replaced\n", "}" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-60", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-61", "source": [ "# add the gene names into rowData in a new column gene_name\n", "rowData(alevin_se)$gene_name <- gene_names" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-62", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-63", "source": [ "# see the changes\n", "rowData(alevin_se)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-64", "source": "

If you are working on your own data and it’s not mouse data, you can check available datasets for other species and just use relevant dataset in useMart() function.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-65", "source": [ "listDatasets(mart) # available datasets" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-66", "source": "

Flag mitochondrial genes

\n

We can also flag mitochondrial genes. Usually those are the genes whose name starts with ‘mt-‘ or ‘MT-‘. Therefore, we will store those characters in mito_genes_names and then use grepl() to find those genes stored in gene_name slot.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-67", "source": [ "mito_genes_names <- '^mt-|^MT-' # how mitochondrial gene names can start\n", "mito <- grepl(mito_genes_names, rowData(alevin_se)$gene_name) # find mito genes\n", "mito # see the resulting boolean list" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-68", "source": "

Now we can add another slot in rowData and call it mito that will keep boolean values (true/false) to indicate which genes are mitochondrial.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-69", "source": [ "rowData(alevin_se)$mito <- mito\n", "rowData(alevin_se)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-70", "source": "

Subsetting the object

\n

Let’s go back to the emptied_matrix object. Do you remember how many cells were left after filtering? We can check that by looking at the matrix’ dimensions:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-71", "source": [ "dim(matrix_alevin) # check the dimension of the unfiltered matrix\n", "dim(emptied_matrix) # check the dimension of the filtered matrix" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-72", "source": "

We’ve gone from 3608 to 35 cells. We’ve filtered the matrix, but not our SummarizedExperiment. We can subset alevin_se based on the cells that were left after filtering. We will store them in a separate list, as we did with the barcodes:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-73", "source": [ "retained_cells <- colnames(emptied_matrix)\n", "retained_cells" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-74", "source": "

And now we can subset our SummarizedExperiment based on the barcodes that are in the retained_cells list:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-75", "source": [ "alevin_subset <- alevin_se[, colData(alevin_se)$barcode %in% retained_cells]\n", "alevin_701 <- alevin_subset\n", "alevin_701" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-76", "source": "

And that’s our subset, ready for downstream analysis!

\n

More datasets

\n

We’ve done the analysis for one sample. But there are 7 samples in this experiment and it would be very handy to have all the information in one place. Therefore, you would need to repeat all the steps for the subsequent samples (that’s when you’ll appreciate wrapped tools and automation in Galaxy workflows!). To make your life easier, we will show you how to combine the datasets on smaller scale. Also, to save you some time, we’ve already run Alevin on sample 702 (also subsampled to 50k reads). Let’s quickly repeat the steps we performed in R to complete the analysis of sample 702 in the same way as we did with 701.

\n

But first, we have to save the results of our hard work on sample 701!

\n

Saving sample 701 data

\n

Saving files is quite straightforward. Just specify which object you want to save and how you want the file to be named. Don’t forget the extension!

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-77", "source": [ "save(alevin_701, file = \"alevin_701.rdata\")" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-78", "source": "

You will see the new file in the panel on the left.

\n

Analysis of sample 702

\n

Normally, at this point you would switch kernel to bash to run alevin, and then back to R to complete the analysis of another sample. Here, we are providing you with the alevin output for the next sample, but to give you some practise in switching kernels and saving data, we will use bash to unzip the folder with that output data.

\n
\n
Warning: Switching kernels & losing variables
\n

Be aware that every time when you switch kernel, you will lose variables you store in the objects that you’ve created, unless you save them. Therefore, if you want to switch from R to bash, make sure you save your R objects! You can then load them anytime.

\n
\n

Let’s switch the kernel back to bash and run the following code to unzip the alevin output for sample 702:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-79", "source": [ "# we're in bash again!\n", "wget https://zenodo.org/records/10116786/files/alevin_output_702.zip" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-80", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-81", "source": [ "unzip alevin_output_702.zip" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-82", "source": "

The files are there! Now back to R - switch kernel again.

\n

Above we described all the steps done in R and explained what each bit of code does. Below all those steps are in one block of code, so read carefully and make sure you understand everything!

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-83", "source": [ "# we're in R now!\n", "\n", "## load libraries again ##\n", "library(tximeta)\n", "library(DropletUtils)\n", "library(biomaRt)\n", "\n", "## running tximeta ##\n", "path2 <- 'alevin_output_702/alevin/quants_mat.gz' # path to alevin output for N702\n", "alevin2 <- tximeta(coldata = data.frame(files = path2, names = \"sample702\"), type = \"alevin\") # create SummarizedExperiment from Alevin output\n", "\n", "## running emptyDrops ##\n", "matrix_alevin2 <- assays(alevin2)$counts # extract matrix from SummarizedExperiment\n", "out2 <- emptyDrops(matrix_alevin2, lower = 100, niters = 1000, retain = 20) # apply emptyDrops\n", "is.cell2 <- out2$FDR <= 0.01 # apply FDR threshold\n", "emptied_matrix2 <- matrix_alevin2[,which(is.cell2),drop=FALSE] # subset the matrix to the cell-containing droplets\n", "\n", "## adding cell metadata ##\n", "barcode2 <- colnames(alevin2)\n", "colData(alevin2)$barcode <- barcode2 # add barcodes\n", "\n", "colData(alevin2) <- cbind(colData(alevin2),out2) # add emptyDrops info\n", "\n", "batch2 <- rep(\"1\", length(colnames(alevin2)))\n", "colData(alevin2)$batch <- batch2 # add batch info\n", "\n", "genotype2 <- rep(\"wildtype\", length(colnames(alevin2)))\n", "colData(alevin2)$genotype <- genotype2 # add genotype info\n", "\n", "sex2 <- rep(\"male\", length(colnames(alevin2)))\n", "colData(alevin2)$sex <- sex2 # add sex info\n", "\n", "## adding gene metadata ##\n", "gene_ID2 <- rownames(alevin2)\n", "rowData(alevin2)$gene_ID <- gene_ID2\n", "\n", "# get relevant gene names\n", "ensembl.ids2 <- gene_ID2\n", "mart <- useEnsembl(biomart = \"ENSEMBL_MART_ENSEMBL\") # connect to a specified BioMart database and dataset hosted by Ensembl\n", "ensembl_m2 = useMart(\"ensembl\", dataset=\"mmusculus_gene_ensembl\", version=100)\n", "\n", "genes2 <- getBM(attributes=c('ensembl_gene_id','external_gene_name'),\n", " filters = 'ensembl_gene_id',\n", " values = ensembl.ids2,\n", " mart = ensembl_m2)\n", "\n", "# replace IDs for gene names\n", "gene_names2 <- ensembl.ids2\n", "count = 1\n", "for (geneID in gene_names2)\n", "{\n", " index <- which(genes2==geneID) # finds an index of geneID in the genes object created by getBM()\n", " if (length(index)==0) # condition in case if there is no corresponding gene name in the chosen dataset\n", " {\n", " gene_names2[count] <- 'NA'\n", " }\n", " else\n", " {\n", " gene_names2[count] <- genes2$external_gene_name[index] \t# replaces gene ID by the corresponding gene name based on the found geneID’s index\n", " }\n", " count = count + 1 # increased count so that every element in gene_names is replaced\n", "}\n", "\n", "rowData(alevin2)$gene_name <- gene_names2 # add gene names to gene metadata\n", "\n", "mito_genes_names <- '^mt-|^MT-' # how mitochondrial gene names can start\n", "mito2 <- grepl(mito_genes_names, rowData(alevin2)$gene_name) # find mito genes\n", "rowData(alevin2)$mito <- mito2 # add mitochondrial information to gene metadata\n", "\n", "## create a subset of filtered object ##\n", "retained_cells2 <- colnames(emptied_matrix2)\n", "alevin_subset2 <- alevin2[, colData(alevin2)$barcode %in% retained_cells2]\n", "\n", "alevin_702 <- alevin_subset2\n", "alevin_702" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-84", "source": "

Alright, another sample pre-processed!

\n

Combining datasets

\n

Pre-processed sample 702 is there, but we still need to load sample 701 that we saved before switching kernels. It’s equally easy as saving the object:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-85", "source": [ "load(\"alevin_701.rdata\")" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-86", "source": "

Check if it was loaded ok:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-87", "source": [ "alevin_701" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-88", "source": "

Now we can combine those two objects into one using one simple command:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-89", "source": [ "alevin_combined <- cbind(alevin_701, alevin_702)\n", "alevin_combined" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-90", "source": "

If you have more samples, just append them in the same way. We won’t process another sample here, but pretending that we have third sample, we would combine it like this:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-91", "source": [ "alevin_subset3 <- alevin_702 # copy dataset for demonstration purposes\n", "alevin_combined_demo <- cbind(alevin_combined, alevin_subset3)\n", "alevin_combined_demo" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-92", "source": "

You get the point, right? It’s important though that the rowData names and colData names are the same in each sample.

\n

Saving and exporting the files

\n

It is generally more common to use SingleCellExperiment format rather than SummarizedExperiment. The conversion is quick and easy, and goes like this:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-93", "source": [ "library(SingleCellExperiment) # might need to load this library\n", "alevin_sce <- as(alevin_combined, \"SingleCellExperiment\")\n", "alevin_sce" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-94", "source": "

As you can see, all the embeddings have been successfully transfered during this conversion and believe me, sce object will be more useful for you!

\n

You’ve already learned how to save and load objects in Jupyter notebook, let’s then save the SCE file:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-95", "source": [ "save(alevin_sce, file = \"alevin_sce.rdata\")" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-96", "source": "

The last thing that might be useful is exporting the files into your Galaxy history. To do it… guess what! Yes - switching kernels again! But this time we choose Python 3 kernel and run the following command:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-97", "source": [ "# that's Python now!\n", "put(\"alevin_sce.rdata\")" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "Alevin is a tool integrated with the [Salmon software](https://salmon.readthedocs.io/en/latest/salmon.html), so first we need to get Salmon. You can install Salmon using conda, but in this tutorial we will show an alternative method - downloading the pre-compiled binaries from the [releases page](https://github.com/COMBINE-lab/salmon/releases). Note that binaries are usually compiled for specific CPU architectures, such as the 64-bit (x86_64) machine release referenced below ." ], "id": "" } } }, { "id": "cell-98", "source": "

Conclusion

\n

Well done! In this tutorial we have:

\n\n

As you might now appreciate, some tasks are much quicker and easier when run in the code, but the reproducibility and automation of Galaxy workflows is a huge advantage that helps in dealing with many samples more quickly and efficiently.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- Create a SCE object from FASTQ files, including relevant gene and cell metadata, and do it all in Jupyter Notebook!\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/alevin-commandline/tutorial.html#feedback) and check there for further resources!\n" ] } ] }