# Quality Control

### Overview

Questions:
• How to perform quality control of NGS raw data (FASTQ)?

• What are the quality parameters to check for a dataset?

• How to improve the quality of a dataset?

Objectives:
• Assess FASTQ quality using FASTQE 🧬😎 and FastQC

• Perform quality correction with Cutadapt

• Summarise quality metrics MultiQC

• Process single-end and paired-end data

Requirements:
Time estimation: 1 hour 30 minutes
Level: Introductory Introductory
Supporting Materials:
Last modification: Sep 1, 2021

# Introduction

During sequencing, the nucleotide bases in a DNA or RNA sample (library) are determined by the sequencer. For each fragment in the library, a short sequence is generated, also called a read, which is simply a succession of nucleotides.

Modern sequencing technologies can generate a massive number of sequence reads in a single experiment. However, no sequencing technology is perfect, and each instrument will generate different types and amount of errors, such as incorrect nucleotides being called. These wrongly called bases are due to the technical limitations of each sequencing platform.

Therefore, it is necessary to understand, identify and exclude error-types that may impact the interpretation of downstream analysis. Sequence quality control is therefore an essential first step in your analysis. Catching errors early saves time later on.

### Agenda

In this tutorial, we will deal with:

# Inspect a raw sequence file

1. Create a new history for this tutorial and give it a proper name

### Tip: Creating a new history

Click the new-history icon at the top of the history panel.

If the new-history is missing:

1. Click on the galaxy-gear icon (History options) on the top of the history panel
2. Select the option Create New from the menu

### Tip: Renaming a history

1. Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
2. Type the new name
3. Press Enter
2. Import the file female_oral2.fastq-4143.gz from Zenodo or from the data library (ask your instructor) This is a microbiome sample from a snake Jacques et al. 2021.

https://zenodo.org/record/3977236/files/female_oral2.fastq-4143.gz

• Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

• Select Paste/Fetch Data
• Paste the link into the text field

• Press Start

• Close the window

• By default, Galaxy uses the URL as the name, so rename the files with a more useful name.

### Tip: Importing data from a data library

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

• Go into Shared data (top panel) then Data libraries
• Navigate to the correct folder as indicated by your instructor
• Select the desired files
• Click on the To History button near the top and select as Datasets from the dropdown menu
• In the pop-up window, select the history you want to import the files to (or create a new one)
• Click on Import

We just imported a file into Galaxy. This file is similar to the data we could get directly from a sequencing facility: a FASTQ file.

### hands_on Hands-on: Inspect the FASTQ file

1. Inspect the file by clicking on the galaxy-eye (eye) icon

Although it looks complicated (and maybe it is), the FASTQ format is easy to understand with a little decoding.

Each read, representing a fragment of the library, is encoded by 4 lines:

Line Description
1 Always begins with @ followed by the information about the read
2 The actual nucleic sequence
3 Always begins with a + and contains sometimes the same info in line 1
4 Has a string of characters which represent the quality scores associated with each base of the nucleic sequence; must have the same number of characters as line 2

So for example, the first sequence in our file is:

@M00970:337:000000000-BR5KF:1:1102:17745:1557 1:N:0:CGCAGAAC+ACAGAGTT
GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(


It means that the fragment named @M00970 corresponds to the DNA sequence GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA and this sequence has been sequenced with a quality GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(.

But what does this quality score mean?

The quality score for each sequence is a string of characters, one for each base of the nucleic sequence, used to characterize the probability of mis-identification of each base. The score is encoded using the ASCII character table (with some historical differences):

So there is an ASCII character associated with each nucleotide, representing its Phred quality score, the probability of an incorrect base call:

Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%

### question Questions

1. Which ASCII character corresponds to the worst Phred score for Illumina 1.8+?
2. What is the Phred quality score of the 3rd nucleotide of the 1st sequence?
3. What is the accuracy of this 3rd nucleotide?

### solution Solution

1. The worst Phred score is the smallest one, so 0. For Illumina 1.8+, it corresponds to the ! character.
2. The 3rd nucleotide of the 1st sequence has a ASCII character G, which correspond to a score of 38.
3. The corresponding nucleotide G has an accuracy of almost 99.99%

### comment Comment

The current lllumina (1.8+) uses Sanger format (Phred+33). If you are working with older datasets you may encounter the older scoring schemes. FastQC tool, a tool we will use later in this tutorial, can be used to try to determine what type of quality encoding is used (through assessing the range of Phred values seen in the FASTQ).

When looking at the file in Galaxy, it looks like most the nucleotides have a high score (G corresponding to a score 38). Is it true for all sequences? And along the full sequence length?

# Assess quality with FASTQE 🧬😎

To take a look at sequence quality along all sequences, we can use FASTQE. It is an open-source tool that provides a simple and fun way to quality control raw sequence data and print them as emoji. You can use it to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

### hands_on Hands-on: Quality check

1. FASTQE Tool: toolshed.g2.bx.psu.edu/repos/iuc/fastqe/fastqe/0.2.6+galaxy2 with the following parameters
• param-files “FastQ data”: female_oral2.fastq-4143.gz
• param-select “Score types to show”: Mean
2. Inspect the generated HTML file

Rather than looking at quality scores for each individual read, FASTQE looks at quality collectively across all reads within a sample and can calculate the mean for each nucleotide position along the length of the reads. Below shows the mean values for this dataset.

You can see the score for each emoji here. The emojis below, with Phred scores less than 20, are the ones we hope we don’t see much.

Phred Quality Score ASCII code Emoji
0 ! 🚫
1
2 # 👺

## Sequence Duplication Levels

The graph shows in blue the percentage of reads of a given sequence in the file which are present a given number of times in the file:

In a diverse library most sequences will occur only once in the final set. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias.

Two sources of duplicate reads can be found:

• PCR duplication in which library fragments have been over-represented due to biased PCR enrichment

It is a concern because PCR duplicates misrepresent the true proportion of sequences in the input.

• Truly over-represented sequences such as very abundant transcripts in an RNA-Seq library or in amplicon data (like this sample)

It is an expected case and not of concern because it does faithfully represent the input.

### details More details about duplication

FastQC counts the degree of duplication for every sequence in a library and creates a plot showing the relative number of sequences with different degrees of duplication. There are two lines on the plot:

• Blue line: distribution of the duplication levels for the full sequence set
• Red line: distribution for the de-duplicated sequences with the proportions of the deduplicated set which come from different duplication levels in the original data.

For whole genome shotgun data it is expected that nearly 100% of your reads will be unique (appearing only 1 time in the sequence data). Most sequences should fall into the far left of the plot in both the red and blue lines. This indicates a highly diverse library that was not over sequenced. If the sequencing depth is extremely high (e.g. > 100x the size of the genome) some inevitable sequence duplication can appear: there are in theory only a finite number of completely unique sequence reads which can be obtained from any given input DNA sample.

More specific enrichments of subsets, or the presence of low complexity contaminants will tend to produce spikes towards the right of the plot. These high duplication peaks will most often appear in the blue trace as they make up a high proportion of the original library, but usually disappear in the red trace as they make up an insignificant proportion of the deduplicated set. If peaks persist in the red trace then this suggests that there are a large number of different highly duplicated sequences which might indicate either a contaminant set or a very severe technical duplication.

It is usually the case for RNA sequencing where there is some very highly abundant transcripts and some lowly abundant. It is expected that duplicate reads will be observed for high abundance transcripts:

## Over-represented sequences

A normal high-throughput library will contain a diverse set of sequences, with no individual sequence making up a tiny fraction of the whole. Finding that a single sequence is very over-represented in the set either means that it is highly biologically significant, or indicates that the library is contaminated, or not as diverse as expected.

FastQC lists all of the sequence which make up more than 0.1% of the total. For each over-represented sequence FastQC will look for matches in a database of common contaminants and will report the best hit it finds. Hits must be at least 20bp in length and have no more than 1 mismatch. Finding a hit doesn’t necessarily mean that this is the source of the contamination, but may point you in the right direction. It’s also worth pointing out that many adapter sequences are very similar to each other so you may get a hit reported which isn’t technically correct, but which has a very similar sequence to the actual match.

RNA sequencing data may have some transcripts that are so abundant that they register as over-represented sequence. With DNA sequencing data no single sequence should be present at a high enough frequency to be listed, but we can sometimes see a small percentage of adapter reads.

### question Questions

How could we find out what the overrepreseented sequences are?

### solution Solution

We can BLAST overrepresented sequences to see what they are. In this case, if we take the top overrepresented sequence

>overrep_seq1
GTGTCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCC


and use blastn against the default Nucleotide (nr/nt) database we don’t get any hits. But if we use VecScreen we see it is the Nextera adapter.

### details More details about other FastQC plots

#### Per base N content

If a sequencer is unable to make a base call with sufficient confidence, it will write an “N” instead of a conventional base call. This plot displays the percentage of base calls at each position or bin for which an N was called.

It’s not unusual to see a very high proportion of Ns appearing in a sequence, especially near the end of a sequence. But this curve should never rises noticeably above zero. If it does this indicates a problem occurred during the sequencing run. In the example below, an error caused the instrument to be unable to call a base for approximately 20% of the reads at position 29:

#### Kmer Content

This plot not output by default. As stated in the tool form, if you want this module it needs to be enabled using a custom Submodule and limits file. With this module, FastQC does a generic analysis of all of the short nucleotide sequences of length k (kmer, with k = 7 by default) starting at each position along the read in the library to find those which do not have an even coverage through the length of your reads. Any given kmer should be evenly represented across the length of the read.

FastQC will report the list of kmers which appear at specific positions with a greater frequency than expected. This can be due to different sources of bias in the library, including the presence of read-through adapter sequences building up on the end of the sequences. The presence of any overrepresented sequences in the library (such as adapter dimers) causes the kmer plot to be dominated by the kmer from these sequences. Any biased kmer due to other interesting biases may be then diluted and not easy to see.

The following example is from a high-quality DNA-Seq library. The biased kmers nearby the start of the read likely are due to slight sequence dependent efficiency of DNA shearing or a result of random priming:

This module can be very difficult to interpret. The adapter content plot and overrepesented sequences table are easier to interpret and may give you enough information without needing this plot. RNA-seq libraries may have highly represented kmers that are derived from highly expressed sequences. To learn more about this plot, please check the FastQC Kmer Content documentation.

We tried to explain here there different FastQC reports and some use cases. More about this and also some common next-generation sequencing problems can be found on QCFAIL.com

### details Specific problem for alternate library types

#### Small/micro RNA

In small RNA libraries, we typically have a relatively small set of unique, short sequences. Small RNA libraries are not randomly sheared before adding sequencing adapters to their ends: all the reads for specific classes of microRNAs will be identical. It will result in:

• Extremely biased per base sequence content
• Extremely narrow distribution of GC content
• Very high sequence duplication levels
• Abundance of overrepresented sequences

#### Amplicon

Amplicon libraries are prepared by PCR amplification of a specific target. For example, the V4 hypervariable region of the bacterial 16S rRNA gene. All reads from this type of library are expected to be nearly identical. It will result in:

• Extremely biased per base sequence content
• Extremely narrow distribution of GC content
• Very high sequence duplication levels
• Abundance of overrepresented sequences

#### Bisulfite or Methylation sequencing

With Bisulfite or methylation sequencing, the majority of the cytosine (C) bases are converted to thymine (T). It will result in:

• Biased per base sequence content
• Biased per sequence GC content

Any library type may contain a very small percentage of adapter dimer (i.e. no insert) fragments. They are more likely to be found in amplicon libraries constructed entirely by PCR (by formation of PCR primer-dimers) than in DNA-Seq or RNA-Seq libraries constructed by adapter ligation. If a sufficient fraction of the library is adapter dimer it will become noticeable in the FastQC report:

• Drop in per base sequence quality after base 60
• Possible bi-modal distribution of per sequence quality scores
• Distinct pattern observed in per bases sequence content up to base 60
• Spike in per sequence GC content
• Adapter content > 0% starting at base 1

If the quality of the reads is not good, we should always first check what is wrong and think about it: it may come from the type of sequencing or what we sequenced (high quantity of overrepresented sequences in transcriptomics data, biased percentage of bases in HiC data).

You can also ask the sequencing facility about it, especially if the quality is really bad: the quality treatments can not solve everything. If too many bad quality bases are cut away, the corresponding reads then will be filtered out and you lose them.

# Trim and filter

The quality drops in the middle of these sequences. This could cause bias in downstream analyses with these potentially incorrectly called nucleotides. Sequences must be treated to reduce bias in downstream analysis. Trimming can help to increase the number of reads the aligner or assembler are able to succesfully use, reducing the number of reads that are unmapped or unassembled. In general, quality treatments include:

• from low quality score regions
• beginning/end of sequence
2. Filtering of sequences
• with low mean quality score
• too short
• with too many ambiguous (N) bases

To accomplish this task we will use Cutadapt Marcel 2011, a tool that enhances sequence quality by automating adapter trimming as well as quality control. We will:

• Trim low-quality bases from the ends. Quality trimming is done before any adapter trimming. We will set the quality threshold as 20, a commonly used threshold, see more here.
• Trim adapter with Cutadapt. For that we need to supply the sequence of the adapter. In this sample, Nextera is the adapter that was detected. We can find the sequence of the Nextera adapter on the Illumina website here CTGTCTCTTATACACATCT. We will trim that sequence from the 3’ end of the reads.
• Filter out sequences with length < 20 after trimming

### hands_on Hands-on: Improvement of sequence quality

• “Single-end or Paired-end reads?”: Single-end
• param-file Reads in FASTQ format”: female_oral2.fastq-4143.gz (Input dataset)

### tip Tip: Files not selectable?

If your FASTQ file cannot be selected, you might check whether the format is FASTQ with Sanger-scaled quality values (fastqsanger.gz). You can edit the data type by clicking on the pencil symbol.

• “Source”: Enter custom sequence
• “Enter custom 3’ adapter sequence”: CTGTCTCTTATACACATCT
• In “Filter Options”
• “Minimum length”: 20
• “Quality cutoff”: 20
• param-select “Outputs selector”: Report
2. Inspect the generated txt file (Report)

### question Questions

3. What % reads have been removed because they were too short?

### solution Solution

1. 58.6% reads contain adapter (Reads with adapters:)
2. 35.1% reads have been trimmed because of bad quality (Quality-trimmed:)
3. 0 % reads were removed because they were too short

One of the biggest advantage of Cutadapt compared to other trimming tools (e.g. TrimGalore!) is that it has a good documentation explaining how the tool works in detail.

Cutadapt quality trimming algorithm consists of three simple steps:

1. Subtract the chosen threshold value from the quality value of each position
2. Compute a partial sum of these differences from the end of the sequence to each position (as long as the partial sum is negative)
3. Cut at the minimum value of the partial sum

In the following example, we assume that the 3’ end is to be quality-trimmed with a threshold of 10 and we have the following quality values

42 40 26 27 8 7 11 4 2 3

1. Subtract the threshold

 32 30 16 17 -2 -3 1 -6 -8 -7

2. Add up the numbers, starting from the 3’ end (partial sums) and stop early if the sum is greater than zero

 (70) (38) 8 -8 -25 -23 -20, -21 -15 -7


The numbers in parentheses are not computed (because 8 is greater than zero), but shown here for completeness.

3. Choose the position of the minimum (-25) as the trimming position

Therefore, the read is trimmed to the first four bases, which have quality values

42 40 26 27


Note that therefore, positions with a quality value larger than the chosen threshold are also removed if they are embedded in regions with lower quality (the partial sum is decreasing if the quality values are smaller than the threshold). The advantage of this procedure is that it is robust against a small number of positions with a quality higher than the threshold.

Alternatives to this procedure would be:

• Cut after the first position with a quality smaller than the threshold
• Sliding window approach

The sliding window approach checks that the average quality of each sequence window of specified length is larger than the threshold. Note that in contrast to cutadapt’s approach, this approach has one more parameter and the robustness depends of the length of the window (in combination with the quality threshold). Both approaches are implemented in Trimmomatic.

We can examine our trimmed data with FASTQE and/or FastQC.

### hands_on Hands-on: Checking quality after trimming

1. FASTQE Tool: toolshed.g2.bx.psu.edu/repos/iuc/fastqe/fastqe/0.2.6+galaxy2 : Re-run FASTQE with the following parameters
• param-files “FastQ data”: Cutadapt Read 1 Output
• param-select “Score types to show”: Mean
2. Inspect the new FASTQE report

### question Questions

Compare the FASTQE output to the previous one before trimming above. Has sequence quality been improved?

### Tip: Using the Scratchbook to view multiple datasets

If you would like to view two or more datasets at once, you can use the Scratchbook feature in Galaxy:

1. Click on the Scratchbook icon galaxy-scratchbook on the top menu bar.
• You should see a little checkmark on the icon now
2. View galaxy-eye a dataset by clicking on the eye icon galaxy-eye to view the output
• You should see the output in a window overlayed over Galaxy
• You can resize this window by dragging the bottom-right corner
3. Click outside the file to exit the Scratchbook
4. View galaxy-eye a second dataset from your history
• You should now see a second window with the new dataset
• This makes it easier to compare the two outputs
5. Repeat this for as many files as you would like to compare
6. You can turn off the Scratchbook galaxy-scratchbook by clicking on the icon again

### solution Solution

Yes, the quality score emojis look better (happier) now.

With FASTQE we can see we improved the quality of the bases in the dataset.

We can also, or instead, check the quality-controlled data with FastQC.

### hands_on Hands-on: Checking quality after trimming

1. FASTQC Tool: toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.72+galaxy1 with the following parameters
• param-files “Short read data from your current history”: Cutadapt Read 1 Output
2. Inspect the generated HTML file

### question Questions

1. Does the per base sequence quality look better?

### solution Solution

1. Yes. The vast majority of the bases have a quality score above 20 now.
1. Yes. No adapter is detected now.

With FastQC we can see we improved the quality of the bases in the dataset and removed the adapter.

### details Other FastQC plots after trimming

We have some red stripes as we’ve trimmed those regions from the reads.

We now have one peak of high quality instead of one high and one lower quality that we had previously.

We don’t have equal representation of the bases as before as this is amplicon data.

We now have a single main GC peak due to removing the adapter.

This is the same as before as we don’t have any Ns in these reads.

We now have multiple peaks and a range of lengths, instead of the single peak with had before trimming when all sequences were the same length.

### question Questions

What does the top overrepresented sequence GTGTCAGCCGCCGCGGTAGTCCGACGTGG correspond to?

### solution Solution

If we take the top overrepresented sequence

>overrep_seq1_after
GTGTCAGCCGCCGCGGTAGTCCGACGTGG


and use blastn against the default Nucleotide (nr/nt) database we see the top hits are to 16S rRNA genes. This makes sense as this is 16S amplicon data, where the 16S gene is PCR amplified.

# Processing multiple datasets

## Process paired-end data

With paired-end sequencing, the fragments are sequenced from both sides. This approach results in two reads per fragment, with the first read in forward orientation and the second read in reverse-complement orientation. With this technique, we have the advantage to get more information about each DNA fragment compared to reads sequenced by only single-end sequencing:

    ------>                       [single-end]

----------------------------- [fragment]

------>               <------ [paired-end]


The distance between both reads is known and therefore is additional information that can improve read mapping.

Paired-end sequencing generates 2 FASTQ files:

• One file with the sequences corresponding to forward orientation of all the fragments
• One file with the sequences corresponding to reverse orientation of all the fragments

Usually we recognize these two files which belong to one sample by the name which has the same identifier for the reads but a different extension, e.g. sampleA_R1.fastq for the forward reads and sampleA_R2.fastq for the reverse reads. It can also be _f or _1 for the forward reads and _r or _2 for the reverse reads.

The data we analyzed in the previous step was single-end data so we will import a paired-end RNA-seq dataset to use. We will run FastQC and aggregate the two reports with MultiQC Ewels et al. 2016.

### hands_on Hands-on: Assessing the quality of paired-end reads

1. Import the paired-end reads GSM461178_untreat_paired_subset_1.fastq and GSM461178_untreat_paired_subset_2.fastq from Zenodo or from the data library (ask your instructor)

https://zenodo.org/record/61771/files/GSM461178_untreat_paired_subset_1.fastq
https://zenodo.org/record/61771/files/GSM461178_untreat_paired_subset_2.fastq

2. FASTQC Tool: toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.72+galaxy1 with both datasets

### Tip: Select multiple datasets

1. Click on param-files Multiple datasets
2. Select several files by keeping the Ctrl (orCOMMAND) key pressed and clicking on the files of interest
3. MultiQC Tool: toolshed.g2.bx.psu.edu/repos/iuc/multiqc/multiqc/1.9+galaxy1 with the following parameters to aggregate the FastQC reports of both forward and reverse reads
• In “Results”
• “Which tool was used generate logs?”: FastQC
• In “FastQC output”
• “Type of FastQC output?”: Raw data
• param-files “FastQC output”: Raw data files (output of both FastQC tool)
4. Inspect the webpage output from MultiQC

### question Questions

1. What do you think about the quality of the sequences?
2. What should we do?

### solution Solution

1. The quality of the sequences seems worse for the reverse reads than for the forward reads:
• Per Sequence Quality Scores: distribution more on the left, i.e. a lower mean quality of the sequences
• Per base sequence quality: less smooth curve and stronger decrease at the end with a mean value below 28
• Per Base Sequence Content: stronger bias at the beginning and no clear distinction between C-G and A-T groups

The other indicators (adapters, duplication levels, etc) are similar.

2. We should trim the end of the sequences and filter them with Cutadapt tool

With paired-end reads the average quality scores for forward reads will almost always be higher than for reverse reads.

After trimming, reverse reads will be shorter because of their quality and then will be eliminated during the filtering step. If one of the reverse reads is removed, its corresponding forward read should be removed too. Otherwise we will get different number of reads in both files and in different order, and order is important for the next steps. Therefore it is important to treat the forward and reverse reads together for trimming and filtering.

### hands_on Hands-on: Improving the quality of paired-end data

• “Single-end or Paired-end reads?”: Paired-end
• param-file “FASTQ/A file #1”: GSM461178_untreat_paired_subset_1.fastq (Input dataset)
• param-file “FASTQ/A file #2”: GSM461178_untreat_paired_subset_2.fastq (Input dataset)

The order is important here!

No adapters were found in these datasets. When you process your own data and you know which adapter sequences were used during library preparation, you should provide their sequences here.

• In “Filter Options”
• “Minimum length”: 20
• “Quality cutoff”: 20
• In “Output Options”
• “Report”: Yes
2. Inspect the generated txt file (Report)

### question Questions

1. How many basepairs has been removed from the reads because of bad quality?
2. How many sequence pairs have been removed because they were too short?

### solution Solution

1. 44,164 bp (Quality-trimmed:) for the forward reads and 138,638 bp for the reverse reads.
2. 1,376 sequences have been removed because at least one read was shorter than the length cutoff (322 when only the forward reads were analyzed).

These datasets can be used for the downstream analysis, e.g. mapping.

### question Questions

1. What kind of alignment is used for finding adapters in reads?
2. What is the criterion to choose the best adapter alignment?

### solution Solution

1. Semi-global alignment, i.e., only the overlapping part of the read and the adapter sequence is used for scoring.
2. An alignment with maximum overlap is computed that has the smallest number of mismatches and indels.

# Conclusion

In this tutorial we checked the quality of FASTQ files to ensure that their data looks good before inferring any further information. This step is the usual first step for analyses such as RNA-Seq, ChIP-Seq, or any other OMIC analysis relying on NGS data. Quality control steps are similar for any type of sequencing data:

• Quality assessment with tools like FASTQE tool and FastQC tool
• Trimming and filtering with a tool like Cutadapt tool

### Key points

• Perform quality control on every dataset before running any other bioinformatics analysis

• Assess the quality metrics and improve quality if necessary

• Check the impact of the quality control

• Different tools are available to provide additional quality metrics

Have questions about this tutorial? Check out the FAQ page for the Sequence analysis topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

# Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

# References

2. Ewels, P., M. Magnusson, S. Lundin, and M. Käller, 2016 MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32: 3047–3048. https://academic.oup.com/bioinformatics/article/32/19/3047/2196507
3. Jacques, R. M. S., W. M. Maza, S. D. Robertson, A. Lonsdale, C. S. Murray et al., 2021 A Fun Introductory Command Line Lesson: Next Generation Sequencing Quality Analysis with Emoji! CourseSource 8: 10.24918/cs.2021.17

# Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.

# Citing this Tutorial

1. Bérénice Batut, Maria Doyle, 2021 Quality Control (Galaxy Training Materials). https://training.galaxyproject.org/archive/2021-10-01/topics/sequence-analysis/tutorials/quality-control/tutorial.html Online; accessed TODAY
2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

### details BibTeX

@misc{sequence-analysis-quality-control,
author = "Bérénice Batut and Maria Doyle",
title = "Quality Control (Galaxy Training Materials)",
year = "2021",
month = "09",
day = "01"
url = "\url{https://training.galaxyproject.org/archive/2021-10-01/topics/sequence-analysis/tutorials/quality-control/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
doi = {10.1016/j.cels.2018.05.012},
url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
year = 2018,
month = {jun},
publisher = {Elsevier {BV}},
volume = {6},
number = {6},
pages = {752--758.e1},
author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
title = {Community-Driven Data Analysis Training for Biology},
journal = {Cell Systems}
}
`