### Overview

Questions:
• How to mask repeats in a genome?

• What is the difference between hard and soft masking?

Objectives:

Requirements:
Time estimation: 1 hour
Level: Introductory Introductory
Supporting Materials:
Last modification: Feb 25, 2022

# Introduction

When you assemble a new genome, you get its full sequence in FASTA format, in the form of contigs, scaffolds, or even whole chromosomes if you are lucky. However genomes, in particular for eukaryote organisms, contain a varying but significant proportion of repeated elements all along the sequence. These elements belong to different classes, including:

• Tandem repeats: small sequences (<60 base pairs) repeated next to each other, found in many places in the genome, in particular centromeres and telomeres
• Interspersed repeats: sequences repeated in distant positions, including transposons, Short Interspersed Nuclear Elements (SINEs) or Long Interspersed Nuclear Elements (LINEs)

These repeats are interesting on their own: they can originate from transposons or viral insertions, and they can have direct effects on the expression of genes. But they are also the source of a lot of trouble when you work on genomics data. First when sequencing a genome, assembly tools often have problems reconstructing the genome sequence in regions containing repeats (in particular when repeats are longer than the read size). Then, when you have a good assembly, you want to annotate it to find the location of genes. Unfortunately annotation tools have trouble identifying gene locations in regions rich in repeats.

The aim of repeat masking is to identify the location of all repeated elements along a genome sequence. Other tools (like annotation pipelines) can then take this information into account when producing their results.

The output of repeat masking tools is most often composed of a fasta file (with sometimes a GFF file containing the position of each repeat). There is two types of masking, producing slightly different fasta output:

• Soft masking: repeat elements are written in lower case
• Hard masking: repeat elements are replaced by stretches of the letter N

Normal (non-repeated) sequences are always kept in uppercase. Doing hard masking is destructive because you lose large parts of the sequence which are replaced by stretches of N. If you want to perform an annotation, it is best to choose soft masking.

We call this operation “masking” because, by making repeats lowercase, or replacing them with Ns, you make them “invisible” by annotation tools (they are written to mostly ignore the regions marked like this).

Multiple tools exist to perform the masking: RepeatMasker, RepeatModeler, REPET, … Each one have specificities: some can be trained on specific genomes, some rely on existing databases of repeated elements signatures (Dfam, RepBase).

In this tutorial you will learn how to soft mask the genome sequence of a small eukaryote: Mucor mucedo (a fungal plant pathogen). You can learn how this genome sequence was assembled by following the Flye assembly tutorial. We will use RepeatMasker, which is probably the simplest solution giving an acceptable result before annotating the genome in the Funannotate annotation tutorial.

### Agenda

In this tutorial, we will cover:

# Get data

1. Create a new history for this tutorial
2. Import the files from Zenodo or from the shared data library (GTN - Material -> genome-annotation -> Masking repeats with RepeatMasker):

https://zenodo.org/api/files/71333591-99bd-4d99-bbdf-664cc18fd422/genome_raw.fasta

• Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

• Select Paste/Fetch Data
• Paste the link into the text field

• Press Start

• Close the window

### Tip: Importing data from a data library

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

• Go into Shared data (top panel) then Data libraries
• Navigate to the correct folder as indicated by your instructor
• Select the desired files
• Click on the To History button near the top and select as Datasets from the dropdown menu
• In the pop-up window, select the history you want to import the files to (or create a new one)
• Click on Import

Let’s run RepeatMasker, by selected the input assembly in fasta format. We select the soft masking option, and we choose to use the Dfam database.

### comment Choosing the right species

We select the Human (Homo sapiens) species here, even though we are masking a fungi genome. It means RepeatMasker will identify very common repeats found in many organisms. For more precise results, you can consider selecting a species closer to the one you analyse in the drop down list, or using other more advanced tools like RepeatModeler.

### hands_on Hands-on

• param-file “Genomic DNA: genome_raw.fasta (Input dataset)
• “Repeat library source”: DFam (curated only, bundled with RepeatMasker)
• “Select species name from a list?”: Yes
• “Species”: Human (Homo sapiens)
• “Perform softmasking instead of hardmasking”: Yes

• masked sequence: this is the fasta file that you will use for future analysis. If you display it, you will notice that some portions of the sequence are in lowercase: these are the regions that were identified as repeats.
• repeat statistics: this one contains some statistics on the number of repeats found in each category, and the total number of base pairs masked.
• output log: this is a tabular file listing all repeats.
• repeat catalogue: this one contains the list of all repeat sequences that were identified, with their position, and their similarity with known repeats from the Dfam database.

### question Question

What proportion of the whole genome sequence is masked?

### solution Solution

You should find it in the repeat statistics output. It should be ~2.41%.

As we have used a generic species (Human), we only identified the most common repeats, not very specific to this species. Other tools might mask a greater proportion of the genome, at the cost of a more complex workflow with training steps. But this result is sufficient to perform an annotation by following the Funannotate annotation tutorial.

# Conclusion

By following this tutorial you have learn how to mask an eukaryotic genome using RepeatMasker, after assembling (Flye assembly tutorial) and before annotating it (Funannotate annotation tutorial).

Often times, annotation tools prefer to use soft masked genomes, as they primarily search for genes in non repeated regions, but tolerate that some genes overlap partially with these regions.

### Key points

• It is an essential first step before running structural annotation pipelines

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Genome Annotation topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

# Glossary

LINEs
Long Interspersed Nuclear Elements
SINEs
Short Interspersed Nuclear Elements

# Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

# Citing this Tutorial

1. Anthony Bretaudeau, Alexandre Cormier, Laura Leroi, Erwan Corre, Stéphanie Robin, Erasmus+ Programme, 2022 Masking repeats with RepeatMasker (Galaxy Training Materials). https://training.galaxyproject.org/archive/2022-04-01/topics/genome-annotation/tutorials/repeatmasker/tutorial.html Online; accessed TODAY
2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

### details BibTeX

@misc{genome-annotation-repeatmasker,
author = "Anthony Bretaudeau and Alexandre Cormier and Laura Leroi and Erwan Corre and Stéphanie Robin and Erasmus+ Programme",
year = "2022",
month = "02",
day = "25"
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
doi = {10.1016/j.cels.2018.05.012},
url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
year = 2018,
month = {jun},
publisher = {Elsevier {BV}},
volume = {6},
number = {6},
pages = {752--758.e1},
author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
title = {Community-Driven Data Analysis Training for Biology},
journal = {Cell Systems}
}
`