Measuring gene expression on a genome-wide scale has become common practice over the last two decades or so, with microarrays predominantly used pre-2008. With the advent of next generation sequencing technology in 2008, an increasing number of scientists use this technology to measure and understand changes in gene expression in often complex systems. As sequencing costs have decreased, using RNA-Seq to simultaneously measure the expression of tens of thousands of genes for multiple samples has never been easier. The cost of these experiments has now moved from generating the data to storing and analysing it.
There are many steps involved in analysing an RNA-Seq experiment. The analysis begins with sequencing reads (FASTQ files). These are usually aligned to a reference genome, if available. Then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analyses on to determine differentially expressed genes and pathways. The purpose of this tutorial is to demonstrate how to do read alignment and counting, prior to performing differential expression. Differential expression analysis with limma-voom is covered in an accompanying tutorial RNA-seq counts to genes. The tutorial here shows how to start from FASTQ data and perform the mapping and counting steps, along with associated Quality Control.
This study examined the expression profiles of basal and luminal cells in the mammary gland of virgin, pregnant and lactating mice. Six groups are present, with one for each combination of cell type and mouse status. Note that two biological replicates are used here, two independent sorts of cells from the mammary glands of virgin, pregnant or lactating mice, however three replicates is usually recommended as a minimum requirement for RNA-seq.
Your results may be slightly different from the ones presented in this tutorial
due to differing versions of tools, reference data, external databases, or
because of stochastic processes in the algorithms.
Preparing the reads
Import data from URLs
Read sequences are usually stored in compressed (gzipped) FASTQ files. Before the differential expression analysis can proceed, these reads must be aligned to the reference genome and counted into annotated genes. Mappingreads to the genome is a very important task, and many different aligners are available, such as HISAT2 (Kim, Langmead, and Salzberg, 2015), STAR (Dobin et al. 2013) and Subread (Liao, Smyth, and Shi 2013]). Most mapping tasks require larger computers than an average laptop, so usually read mapping is done on a server in a linux-like environment, requiring some programming knowledge. However, Galaxy enables you to do this mapping without needing to know programming and if you don’t have access to a server you can try to use one of the publically available Galaxies e.g. usegalaxy.org, usegalaxy.eu, usegalaxy.org.au.
If you are sequencing your own data, the sequencing facility will almost always provide compressed FASTQ files which you can upload into Galaxy. If your FASTQs are provided through Galaxy’s Shared Data, you can easily import them into a history. For publicly available sequence data, such as from GEO/SRA/ENA, Galaxy’s Rule-based Uploader can be used to import the files from URLs, saving on the need to download to your computer and upload into Galaxy.
The raw reads used in this tutorial were obtained from SRA from the link given in GEO for the the mouse mammary gland dataset (Fu et al. 2015) (e.g ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP045%2FSRP045534). For the purpose of this tutorial we are going to be working with a small part of the FASTQ files. We are only going to be mapping 1000 reads from each sample to enable running through all the steps quickly. If working with your own data you would use the full data and some results for the full mouse dataset will be shown for comparison. The small FASTQ files are available in Figshare and the links to the FASTQ files are provided below. We are going to import the files into a Collection. Using Galaxy Collections helps keep the datasets organised and tidy in the history. Collections also make it easier to maintain the sample names through tools and workflows. If you are not familiar with collections, see the Galaxy Collections tutorial.
The sample information (sample ID, Group) and link to the FASTQ file (URL) are in the grey box below. To generate a file like this to import data from SRA/ENA see the Galaxy Rule-based Uploader tutorial.
Click on the name of one of the collection elements (e.g MCL1-DL). Check that the data type is fastqsanger.gz and the database mm10.
Click on the galaxy-eye (eye) icon of one of the FASTQ files to have a look at what it contains
QC: raw reads
During sequencing, errors are introduced, such as incorrect nucleotides being called. These are due to the technical limitations of each sequencing platform. Sequencing errors might bias the analysis and can lead to a misinterpretation of the data. Every base sequence gets a quality score from the sequencer and this information is present in the FASTQ file. A quality score of 30 corresponds to a 1 in 1000 chance of an incorrect base call (a quality score of 10 is a 1 in 10 chance of an incorrect base call). To look at the overall distribution of quality scores across the reads, we can use FastQC.
Sequence quality control is therefore an essential first step in your analysis. We will use similar tools as described in the “Quality control” tutorial: FastQC and Cutadapt.
hands_on Hands-on: Check raw reads with FastQC
FastQCtool with the following parameters:
param-collection“Short read data from your current history”: fastqs (Input dataset collection)
Inspect the Webpage output of FastQCtool for the MCL1-DL sample by clicking on the galaxy-eye (eye) icon
question Questions
What is the read length?
What base quality score encoding is used?
solution Solution
The read length is 100 bp.
Sanger quality score encoding is used.
This information can be seen at the top of the FastQC Webpage as below.
Figure 4: FastQC Webpage
The FastQC report contains a lot of information and we can look at the report for each sample. However, that is quite a few reports, 12 for this dataset. If you had more samples it could be a lot more. Luckily, there is a very useful tool called MultiQC that can summarise QC information for multiple samples into a single report.
hands_on Hands-on: Aggregate FastQC reports with MultiQC
MultiQCtool with the following parameters to aggregate the FastQC reports
In “Results”
param-select“Which tool was used generate logs?”: FastQC
In “FastQC output”
param-select“Type of FastQC output?”: Raw data
param-collection“FastQC output”: RawData files (output of FastQCtool on trimmed reads)
Inspect the Webpage output from MultiQC
Note that these are the results for just 1000 reads. The FastQC results for the full dataset are shown below. The 1000 reads are the first reads from the FASTQ files, and the first reads usually originate from the flowcell edges, so we can expect that they may have lower quality and the patterns may be a bit different from the distribution in the full dataset.
You should see that most of the plots in the small FASTQs look similar to the full dataset. However, in the small FASTQs, there is less duplication, some Ns in the reads and some overrepresented sequences.
Figure 5: General StatisticsFigure 6: Sequence CountsFigure 7: Sequence QualityFigure 8: Per Sequence Quality ScoresFigure 9: Per Sequence GC ContentFigure 10: Per base N contentFigure 11: Sequence Duplication LevelsFigure 12: Adapter Content
What do you think of the overall quality of the sequences?
solution Solution
Overall, the samples look pretty good. The main things to note here are:
The base quality is high in all samples.
Some Illumina adapter has been detected.
Some duplication in RNA-seq can be normal due to the presence of highly expressed genes. However, for some reason MCL1-LE and MCL1-LF have higher numbers of duplicates detected than the other samples.
We will use Cutadapt to trim the reads to remove the Illumina adapter and any low quality bases at the ends (quality score < 20). We will discard any sequences that are too short (< 20bp) after trimming. We will also output the Cutadapt report for summarising with MultiQC.
The Cutadapt tool Help section provides the sequence we can use to trim this standard Illumina adapter AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC, as given on the Cutadapt website. For trimming paired-end data see the Cutadapt Help section. Other Illumina adapter sequences (e.g. Nextera) can be found at the Illumina website. Note that Cutadapt requires at least three bases to match between adapter and read to reduce the number of falsely trimmed bases, which can be changed in the Cutadapt options if desired.
Trim reads
hands_on Hands-on: Trim reads with Cutadapt
Cutadapttool with the following parameters:
param-select“Single-end or Paired-end reads?”: Single-end
The reads are no longer all the same length, we now have sequences of different lengths detected.
Mapping
Now that we have prepared our reads, we can align the reads for our 12 samples. There is an existing reference genome for mouse and we will map the reads to that. The current version of the mouse reference genome is mm10/GRCm38. Here we will use HISAT2 to align the reads. HISAT2 is the descendent of TopHat, one of the first widely-used aligners, but alternative mappers could be used, such as STAR. See the RNA-seq ref-based tutorial for more information on RNA-seq mappers. There are often numerous mapping parameters that we can specify, but usually the default mapping parameters are fine. However, library type (paired-end vs single-end) and library strandness (stranded vs unstranded) require some different settings when mapping and counting, so they are two important pieces of information to know about samples. The mouse data comprises unstranded, single-end reads so we will specify that where necessary. HISAT2 can output a mapping summary file that tells what proportion of reads mapped to the reference genome. Summary files for multiple samples can be summarised with MultiQC. As we’re only using a subset of 1000 reads per sample, aligning should just take a minute or so. To run the full samples from this dataset would take longer.
Map reads to reference genome
hands_on Hands-on: Map reads to reference with HISAT2
HISAT2tool with the following parameters:
param-select“Source for the reference genome”: Use a built-in genome
param-select“Select a reference genome”: mm10
param-select“Is this a single or paired library?”: Single-end
param-collection“FASTA/Q file”: Read 1 Output (output of Cutadapttool)
In “Summary Options”:
param-check“Output alignment summary in a more machine-friendly style.”: Yes
param-check“Print alignment summary to a file.”: Yes
comment Settings for Paired-end or Stranded reads
If you have paired-endreads
Select “Is this a single or paired library”Paired-end or Paired-end Dataset Collection or Paired-end data from single interleaved dataset
If you have strandedreads
Select “Specify strand information”: Forward (FR) or Reverse (RF)
HISAT2 generates a BAM file with mapped reads.
hands_on Hands-on: Take a look at BAM format
Click on the collection name (HISAT2 on collection N: aligned reads (BAM))
Click on the galaxy-eye (eye) icon of one of the BAM files to have a look at what it contains.
An important metric to check is the percentage of reads mapped to the reference genome. A low percentage can indicate issues with the data or analysis. We can use MultiQC again to summarise the QC information.
hands_on Hands-on: Aggregate the HISAT2 summary files with MultiQC
MultiQCtool with the following parameters to aggregate the HISAT2 summary files
In “Results”
param-select“Which tool was used generate logs?”: HISAT2
param-collection“Output of HISAT2”: Mapping summary (output of HISAT2tool)
Inspect the Webpage output from MultiQC
The MultiQC plot below shows the result from the full dataset for comparison.
Figure 15: HISAT2 mapping
Over 90% of reads have mapped in all samples, which is a good mapping rate, and the vast majority of reads have mapped uniquely, they haven’t mapped to multiple locations in the reference genome.
It is also good practice to visualise the read alignments in the BAM file, for example using IGV, see the RNA-seq ref-based tutorial.
tip Tip: Downloading a collection
To download a collection of datasets (e.g. the collection of BAM files) click on the floppy disk icon within the collection. This will download a tar file containing all the datasets in the collection. Note that for BAM files the .bai indexes (required for IGV) will be included automatically in the download.
QC: strandness
As far as we know this data is unstranded, but as a sanity check you can check the strandness. You can use RSeQC Infer Experiment tool to “guess” the strandness, as explained in the RNA-seq ref-based tutorial. This is done through comparing the “strandness of reads” with the “strandness of transcripts”. For this tool, and many of the other RSeQC tools, a reference bed file of genes (reference genes) is required. RSeQC provides some reference BED files for model organisms. You can import the RSeQC mm10 RefSeq BED file from the link https://sourceforge.net/projects/rseqc/files/BED/Mouse_Mus_musculus/mm10_RefSeq.bed.gz/download (and rename to reference genes) or import a file from Shared data if provided. Alternatively, you can provide your own BED file of reference genes, for example from UCSC (see the Peaks to Genes tutorial. Or the Convert GTF to BED12 tool can be used to convert a GTF into a BED file.
tip Tip: Importing data via links
Copy the link location
Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)
Select Paste/Fetch Data
Paste the link into the text field
Press Start
Close the window
By default, Galaxy uses the URL as the name, so rename the files with a more useful name.
hands_on Hands-on: Check strandness with Infer Experiment
Infer Experimenttool with the following parameters:
param-collection“Input .bam file”: aligned reads (BAM) (output of HISAT2tool)
param-file“Reference gene model”: reference genes (Reference BED file)
MultiQCtool with the following parameters:
In “1: Results”:
param-select“Which tool was used generate logs?”: RSeQC
param-select“Type of RSeQC output?”: infer_experiment
param-collection“RSeQC infer_experiment output”: Infer Experiment output (output of Infer Experimenttool)
Inspect the Webpage output from MultiQC
The MultiQC plot below shows the result from the full dataset for comparison.
Figure 16: Infer Experiment
question Questions
Do you think the data is stranded or unstranded?
solution Solution
It is unstranded as approximately equal numbers of reads have aligned to the sense and antisense strands.
QC: duplicate reads
Duplicate reads are usually kept in RNA-seq differential expression analysis as they can come from highly-expressed genes but it is still a good metric to check. A high percentage of duplicates can indicate a problem with the sample, for example, PCR amplification of a low complexity library (not many transcripts) due to not enough RNA used as input. FastQC gives us an idea of duplicates in the reads before mapping (note that it just takes a sample of the data). We can assess the numbers of duplicates in all mapped reads using the Picard MarkDuplicates tool. Picard considers duplicates to be reads that map to the same location, based on the start position of where the read maps. In general, we consider normal to obtain up to 50% of duplication.
hands_on Hands-on: Check duplicate reads with MarkDuplicates
MarkDuplicatestool with the following parameters:
param-collection“Select SAM/BAM dataset or dataset collection”: aligned reads (BAM) (output of HISAT2tool)
MultiQCtool with the following parameters:
In “1: Results”:
param-select“Which tool was used generate logs?”: Picard
param-select“Type of Picard output?”: Markdups
param-collection“Picard output”: MarkDuplicate metrics (output of MarkDuplicatestool)
Inspect the Webpage output from MultiQC
The MultiQC plot below shows the result from the full dataset for comparison.
Figure 17: MarkDups metrics
question Questions
Which two samples have the most duplicates detected?
solution Solution
MCL1-LE and MCL1-LF have the highest number of duplicates in mapped reads compared to the other samples, similar to what we saw in the raw reads with FastQC.
QC: number of reads mapped to each chromosome
You can check the numbers of reads mapped to each chromosome with the Samtools IdxStats tool. This can help assess the sample quality, for example, if there is an excess of mitochondrial contamination. It could also help to check the sex of the sample through the numbers of readsmapping to X/Y or to see if any chromosomes have highly expressed genes.
hands_on Hands-on: Count readsmapping to each chromosome with IdxStats
IdxStatstool with the following parameters:
param-collection“BAM file”: aligned reads (BAM) (output of HISAT2tool)
MultiQCtool with the following parameters:
In “1: Results”:
param-select“Which tool was used generate logs?”: Samtools
param-select“Type of Samtools output?”: idxstats
param-collection“Samtools idxstats output”: IdxStats output (output of IdxStatstool)
Inspect the Webpage output from MultiQC
The MultiQC plot below shows the result from the full dataset for comparison.
Are the samples male or female? (If a sample is not in the XY plot it means no reads mapped to Y)
solution Solution
Some of the samples have very high mapping on chromosome 5. What is going on there?
The samples appear to be all female as there are few readsmapping to the Y chromosome. As this is a experiment studying virgin, pregnant and lactating mice if we saw large numbers of readsmapping to the Y chromosome in a sample it would be unexpected and a probable cause for concern.
QC: gene body coverage (5’-3’)
The coverage of reads along gene bodies can be assessed to check if there is any bias in coverage. For example, a bias towards the 3’ end of genes could indicate degradation of the RNA. Alternatively, a 3’ bias could indicate that the data is from a 3’ assay (e.g. oligodT-primed, 3’RNA-seq). You can use the RSeQC Gene Body Coverage (BAM) tool to assess gene body coverage in the BAM files.
hands_on Hands-on: Check coverage of genes with Gene Body Coverage (BAM)
Gene Body Coverage (BAM)tool with the following parameters:
“Run each sample separately, or combine mutiple samples into one plot”: Run each sample separately
param-collection“Input .bam file”: aligned reads (BAM) (output of HISAT2tool)
param-select“Which tool was used generate logs?”: RSeQC
param-select“Type of RSeQC output?”: gene_body_coverage
param-collection“RSeQC gene_body_coverage output”: Gene Body Coverage (BAM) (text) (output of Gene Body Coverage (
BAM)tool)
Inspect the Webpage output from MultiQC
The MultiQC plot below shows the result from the full dataset for comparison.
Figure 20: Gene Body Coverage
The plot below from the RSeQC website shows what samples with 3’biased coverage would look like.
Figure 21: Gene Body Coverage comparison
question Questions
What do you think of the coverage across gene bodies in these samples?
solution Solution
It looks good. This plot looks a bit noisy in the small FASTQs but it still shows there’s pretty even coverage from 5’ to 3’ ends with no obvious bias in all the samples.
QC: read distribution across features (exons, introns, intergenic..)
We can also check the distribution of reads across known gene features, such as exons (CDS, 5’UTR, 3’UTR), introns and intergenic regions. In RNA-seq we expect most reads to map to exons rather than introns or intergenic regions. It is also the reads mapped to exons that will be counted so it is good to check what proportions of reads have mapped to those. High numbers of readsmapping to intergenic regions could indicate the presence of DNA contamination.
hands_on Hands-on: Check distribution of reads with Read Distribution
Read Distributiontool with the following parameters:
param-collection“Input .bam/.sam file”: aligned reads (BAM) (output of HISAT2tool)
param-select“Which tool was used generate logs?”: RSeQC
param-select“Type of RSeQC output?”: read_distribution
param-collection“RSeQC read_distribution output”: Read Distribution output (output of Read Distributiontool)
Inspect the Webpage output from MultiQC
The MultiQC plot below shows the result from the full dataset for comparison.
Figure 22: Read Distribution
question Questions
What do you think of the read distribution?
solution Solution
It looks good, most of the reads have mapped to exons and not many to introns or intergenic regions. The samples have pretty consistent read distribution, albeit with slightly higher numbers of readsmapping to CDS exons for MCL1-LC and MCL1-LD, and MCL1-LE and MCL1-LF have more readsmapping to CDS exons than the other samples.
Now that we have checked the data and are happy with how it looks, we can proceed to counting.
Counting
The alignment produces a set of BAM files, where each file contains the read alignments for each sample. In the BAM file, there is a chromosomal location for every read that mapped. Now that we have figured out where each read comes from in the genome, we need to summarise the information across genes or exons. The mapped reads can be counted across mouse genes by using a tool called featureCounts. featureCounts requires gene annotation specifying the genomic start and end position of each exon of each gene. For convenience, featureCounts contains built-in annotation for mouse (mm10, mm9) and human (hg38, hg19) genome assemblies, where exon intervals are defined from the NCBI RefSeq annotation of the reference genome. Reads that map to exons of genes are added together to obtain the count for each gene, with some care taken with reads that span exon-exon boundaries. The output is a count for each Entrez Gene ID, which are numbers such as 100008567. For other species, users will need to read in a data frame in GTF format to define the genes and exons. Users can also specify a custom annotation file in SAF format. See the tool help in Galaxy, which has an example of what an SAF file should like like, or the Rsubread users guide for more information.
comment Comment
In this example we have kept many of the default settings, which are typically optimised to work well under a variety of situations. For example, the default setting for featureCounts is that it only keeps reads that uniquely map to the reference genome. For testing differential expression of genes, this is preferred, as the reads are unambigously assigned to one place in the genome, allowing for easier interpretation of the results. Understanding all the different parameters you can change involves doing a lot of reading about the tool that you are using, and can take a lot of time to understand! We won’t be going into the details of the parameters you can change here, but you can get more information from looking at the tool help.
Count reads mapped to genes
hands_on Hands-on: Count reads mapped to genes with featureCounts
featureCountstool with the following parameters:
param-collection“Alignment file”: aligned reads (BAM) (output of HISAT2tool)
param-select“Count fragments instead of reads”: Enabled; fragments (or templates) will be counted instead of reads
If you have strandedreads
param-select Select “Specify strand information”: Stranded (Forward) or Stranded (Reverse)
QC: assignment of reads to genes
featureCounts reports the numbers of unassigned reads and the reasons why they are not assigned (eg. ambiguity, multi-mapping, secondary alignment, mapping quality, fragment length, chimera, read duplicate, non-junction and so on), in addition to the number of successfully assigned reads for each library. The statistics of the read mapping are output as a file from featureCounts. MultiQC can be used to assess the numbers of reads assigned to features, genes in this case.
hands_on Hands-on: Assess reads mapped to genes
MultiQCtool with the following parameters:
param-select“Which tool was used generate logs?”: featureCounts
param-collection“Output of FeatureCounts”: featureCounts summary (output of featureCountstool)
Inspect the Webpage output from MultiQC
The MultiQC plot below shows the result from the full dataset for comparison.
Figure 23: featureCounts assignments
question Questions
What % reads are assigned to exons?
solution Solution
~60-70% of reads are assigned to exons. This is a fairly typical number for RNA-seq.
The counts for the samples are output as tabular files. Take a look at one. The numbers in the first column of the counts file represent the Entrez gene identifiers for each gene, while the second column contains the counts for each gene for the sample.
Create count matrix
The counts files are currently in the format of one file per sample. However, it is often convenient to have a count matrix. A count matrix is a single table containing the counts for all samples, with the genes in rows and the samples in columns. The counts files are all within a collection so we can use the GalaxyColumn Join on Collection tool to easily create a count matrix from the single counts files.
hands_on Hands-on: Create count matrix with Column Join on Collection
Column Join on Collectiontool with the following parameters:
param-collection“Tabular files”: featureCounts output (output of featureCountstool)
param-text“Identifier column”: 1
param-text“Number of header lines in each input file”: 1
param-check“Add column name to header”: No
Take a look at the output (the output for the full dataset is shown below).
Figure 24: Count matrix
Now it is easier to see the counts for a gene across all samples. The accompanying tutorial, RNA-seq counts to genes, shows how gene information (symbols etc) can be added to a count matrix.
Generating an interactive QC summary report
So far we have run MultiQC on one step at a time which generates multiple reports for the analysis. However, it would be more convenient to have a single report containing all the QC information for the analysis. This can be easily achieved in Galaxy with the MultiQC tool.
question Questions
Can you create a MultiQC report that summarises all the QC information into a single report?
Summarise the QC info for Cutadapt, FastQC, HISAT2, Infer Experiment, MarkDups, IdxStats, Gene Body Coverage, Read Distribution, featureCounts.
Use just one of the two FastQC outputs e.g. the trimmed FastQC RawData and not the untrimmed aswell, as you currently can’t have more than one run of the same tool in the same MultiQC report.
solution Solution
MultiQC summary of all programs used (screenshot from the full dataset).
Figure 25: MultiQC Summary
MultiQCtool with the following parameters:
In “Results”:
In “1: Results”:
param-select“Which tool was used generate logs?”: Cutadapt/Trim Galore!
param-collection“Output of Cutadapt”: Cutadapt Report (output of Cutadapttool)
param-repeat Click on “Insert Results”:
In “2: Results”:
param-select“Which tool was used generate logs?”: FastQC
param-collection“FastQC output”: FastQC RawData (output of FastQCtool)
param-repeat Click on “Insert Results”:
In “3: Results”:
param-select“Which tool was used generate logs?”: HISAT2
param-collection“Output of HISAT2”: Mapping summary (output of HISAT2tool)
param-repeat Click on “Insert Results”:
In “4: Results”:
param-select“Which tool was used generate logs?”: RSeQC
param-select“Type of RSeQC output?”: infer_experiment
param-collection“RSeQC infer_experiment output”: Infer Experiment output (output of Infer Experimenttool)
param-repeat Click on “Insert Results”:
In “5: Results”:
param-select“Which tool was used generate logs?”: Picard
param-select“Type of Picard output?”: Markdups
param-collection“Picard output”: MarkDuplicate metrics (output of MarkDuplicatestool)
param-repeat Click on “Insert Results”:
In “6: Results”:
param-select“Which tool was used generate logs?”: Samtools
param-select“Type of Samtools output?”: idxstats
param-collection“Samtools idxstats output”: IdxStats output (output of IdxStatstool)
param-repeat Click on “Insert Results”:
In “7: Results”:
param-select“Which tool was used generate logs?”: RSeQC
param-select“Type of RSeQC output?”: gene_body_coverage
param-collection“RSeQC gene_body_coverage output”: Gene Body Coverage (BAM) (text) (output of Gene Body Coverage (
BAM)tool)
param-repeat Click on “Insert Results”:
In “8: Results”:
param-select“Which tool was used generate logs?”: RSeQC
param-select“Type of RSeQC output?”: read_distribution
param-collection“RSeQC read_distribution output”: Read Distribution output (output of Read Distributiontool)
param-repeat Click on “Insert Results”:
In “9: Results”:
param-select“Which tool was used generate logs?”: featureCounts
param-collection“Output of FeatureCounts”: featureCounts summary (output of featureCountstool)
The MultiQC report can be downloaded by clicking on the floppy disk icon on the dataset in the history.
question Questions
Can you think of any other QCs that could be performed on RNA-seq reads?
solution Solution
The reads could be checked for:
Ribosomal contamination
Contamination with other species e.g. bacteria
GC bias of the mapped reads
This is single-end data but paired-end mapped reads could be checked for fragment size (distance between the read pairs).
Creating a workflow of the analysis
You can now extract a workflow from the steps that have been performed, as shown in the Extracting Workflows from Histories tutorial. This is a way to help keep a record of the steps performed. It also enables efficient rerunning of a multi-step analysis, such as RNA-seq.
Conclusion
In this tutorial we have seen how reads (FASTQ files) can be converted into counts. We have also seen QC steps that can be performed to help assess the quality of the data. A follow-on tutorial, RNA-seq counts to genes, shows how to perform differential expression and QC on the counts for this dataset.
keypoints Key points
In RNA-seq, reads (FASTQs) are mapped to a reference genome with a spliced aligner (e.g HISAT2, STAR)
The aligned reads (BAMs) can then be converted to counts (e.g featureCounts, HTSeq)
Many QC steps can be performed to help check the quality of the data
MultiQC can be used to create a nice summary report of QC information
The Galaxy Rule-based Uploader, Collections and Workflows can help make analysis more efficient and easier
Useful literature
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
congratulations Congratulations on successfully completing this tutorial!
Did you use this material as an instructor? Feel free to give us feedback on how it went.
feedback Give us even more feedback on this content!
To give us more detailed feedback about these materials, please take a moment to fill in the extended Feedback Form.