Masking repeats with RepeatMasker

Author(s)

Overview
Questions:

How to mask repeats in a genome?

What is the difference between hard and soft-masking?

Objectives:

Use Red and RepeatMasker to soft-mask a newly assembled genome

Requirements:

Introduction to Galaxy Analyses

Time estimation: 1 hour

Level: Introductory Introductory

Supporting Materials:

Datasets

Workflows

FAQs

video Recordings

video Tutorial (May 2023)

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.org (Main) ✅ ⭐️

UseGalaxy.org.au ✅ ⭐️

UseGalaxy.cz ✅

UseGalaxy.fr ✅

Containers

docker_image Docker image

Published: Nov 29, 2021

Last modification: Jan 8, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00178

rating Rating: 4.3 (0 recent ratings, 3 all time)

version Revision: 39

When you assemble a new genome, you get its full sequence in FASTA format, in the form of contigs, scaffolds, or even whole chromosomes if you are lucky. However genomes, in particular for eukaryote organisms, contain a varying but significant proportion of repeated elements all along the sequence. These elements belong to different classes, including:

Tandem repeats: small sequences (<60 base pairs) repeated next to each other, found in many places in the genome, in particular centromeres and telomeres
Interspersed repeats: sequences repeated in distant positions, including transposons, Short Interspersed Nuclear Elements (SINEs) or Long Interspersed Nuclear Elements (LINEs)

These repeats are interesting on their own: they can originate from transposons or viral insertions, and they can have direct effects on the expression of genes. But they are also the source of a lot of trouble when you work on genomics data. First when sequencing a genome, assembly tools often have problems reconstructing the genome sequence in regions containing repeats (in particular when repeats are longer than the read size). Then, when you have a good assembly, you want to annotate it to find the location of genes. Unfortunately many annotation tools have trouble identifying gene locations in regions rich in repeats.

The aim of repeat masking is to identify the location of all repeated elements along a genome sequence. Other tools (like annotation pipelines) can then take this information into account when producing their results.

The output of repeat masking tools is most often composed of a fasta file (with sometimes a BED or GFF file containing the position of each repeat). There is two types of masking, producing slightly different fasta output:

Soft-masking: repeat elements are written in lower case
Hard-masking: repeat elements are replaced by stretches of the letter N

Normal (non-repeated) sequences are always kept in uppercase. Doing hard-masking is destructive because you lose large parts of the sequence which are replaced by stretches of N. If you want to perform an annotation, it is best to choose soft-masking.

We call this operation “masking” because, by making repeats lowercase, or replacing them with Ns, you make them “invisible” by annotation tools (they are written to mostly ignore the regions marked like this).

Multiple tools exist to perform the masking: RepeatMasker, RepeatModeler, REPET, … Each one have specificities: some can be trained on specific genomes, some rely on existing databases of repeated elements signatures (Dfam, RepBase).

In this tutorial you will learn how to soft-mask the genome sequence of a small eukaryote: Mucor mucedo (a fungal plant pathogen). You can learn how this genome sequence was assembled by following the Flye assembly tutorial. We will use two different tools, Red and RepeatMasker, which are probably two of the simplest solutions giving an acceptable result before annotating the genome in the Funannotate annotation tutorial.

Agenda

In this tutorial, we will cover:

Get data

Soft-masking using Red

Soft-masking using RepeatMasker

Conclusion

Get data

Hands-on: Data upload
Create a new history for this tutorial
Import the files from Zenodo or from the shared data library (GTN - Material -> genome-annotation -> Masking repeats with RepeatMasker):
https://zenodo.org/record/7085837/files/genome_raw.fasta
https://zenodo.org/record/7085837/files/Muco_library_RM2.fasta
https://zenodo.org/record/7085837/files/Muco_library_EDTA.fasta
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Shared data (top panel) then Data libraries

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import

Soft-masking using Red

First let’s try Red, a tool than can mask repeats de novo. For that, select the input assembly in fasta format.

Comment: *Ab initio* tool

Red is an ab initio tool, it means that it will try to predict repeat elements using only the genomic sequence. It’s perfect when you know nothing about the organism that you are working on.

Hands-on

Red ( Galaxy version 2018.09.10+galaxy1) with the following parameters:

param-file “Genome sequence to mask”: genome_raw.fasta (Input dataset)

Red produces 2 output files :

A fasta file: this is the soft-masked genome that you can use for future analysis. If you display it, you will notice that some portions of the sequence are in lowercase: these are the regions that were identified as repeats
A bed file: this one contains the coordinates on the genome of each repeated loci

Question

What proportion of the whole genome sequence is masked?

You need to click on the galaxy-info on one of the output. You should find it at the end of the extended Tool Standard Output in Job Information. It should be ~ 30.62%

Question

How to hard-mask a genome with Red ?

As you can see Red has no option to hard-mask your genome. However, one of the output is a bed file, so you can use bedtools MaskFastaBed to replace repeated regions with stretches of N:

Hands-on

bedtools MaskFastaBed ( Galaxy version 2.30.0) with the following parameters:

param-file “BED/bedGraph/GFF/VCF/EncodePeak file”: Red on data (bed file produced by red)

param-file “FASTA file”: genome_raw.fasta

Red uses only the sequence of the genome to detect repeated regions, and does not provide a detailed classification of the detected repeats. Let’s use another tool that works differently: RepeatMasker.

Soft-masking using RepeatMasker

Let’s run RepeatMasker, by selected the input assembly in fasta format. We select the soft-masking option, and we choose to use the Dfam database.

Comment: Choosing the right species

We select the Human (Homo sapiens) species here, even though we are masking a fungi genome. It means RepeatMasker will identify very common repeats found in many organisms. For more precise results, you can consider selecting a species closer to the one you analyse in the drop down list, or using other more advanced tools like RepeatModeler.

Hands-on

RepeatMasker ( Galaxy version 4.1.5+galaxy0) with the following parameters:

param-file “Genomic DNA”: genome_raw.fasta (Input dataset)

“Repeat library source”: DFam (curated only, bundled with RepeatMasker)

“Select species name from a list?”: Yes

“Species”: Human (Homo sapiens)

“Output annotation of repeats in GFF format”: Yes

“Perform soft-masking instead of hard-masking”: Yes

RepeatMasker produces 5 output files:

masked sequence: this is the fasta file that you will use for future analysis. If you display it, you will notice that some portions of the sequence are in lowercase: these are the regions that were identified as repeats.
repeat statistics: this one contains some statistics on the number of repeats found in each category, and the total number of base pairs masked.
output log: this is a tabular file listing all repeats.
repeat catalogue: this one contains the list of all repeat sequences that were identified, with their position, and their similarity with known repeats from the Dfam database.
repeat annotation : this one contains the coordinate of each repeat element in GFF2 format.

Question

What proportion of the whole genome sequence is masked?

You should find it in the repeat statistics output. It should be ~2.41%.

As we have used a generic species (Human), we only identified the most common and simple repeats, not very specific to this species. If you compare with Red results, your are missing at least ~28% of repeated content in the genome. However, RepeatMasker gives interesting information about repeat classification which could be interesting for future analysis.

To boost RepeatMasker performance, we need a tailored repeat library for Mucor mucedo. This step can take from a few hours to a few days and a large number of tools could be used. We pre-computed two librairies:

Muco_library_RM2.fasta using RepeatModeler (Flynn et al. 2020)
Muco_library_EDTA.fasta using EDTA (Ou et al. 2019)

You can prepare your own repeat library, specific to the genome you’re using, by running RepeatModeler in Galaxy (it will take several hours to run):

Hands-on: Building a repeat library with RepeatModeler

RepeatModeler ( Galaxy version 2.0.4+galaxy1) with the following parameters:

param-file “Input genome fasta”: genome_raw.fasta (Input dataset)

The library is in the “consensus sequences” output dataset.

Hands-on

RepeatMasker ( Galaxy version 4.1.5+galaxy0) with the following parameters:

param-file “Genomic DNA”: genome_raw.fasta (Input dataset)

“Repeat library source”: Custom library of repeats

“Custom library of repeats”

“One of the two pre-computed libraires”: Muco_library_RM2.fasta or Muco_library_EDTA.fasta

“Output annotation of repeats in GFF format”: Yes

“Perform soft-masking instead of hard-masking”: Yes

Question

Compare the different repeat statistics files produced, what is the highest library for RepeatMasker?

The RepeatModeler library seems to have the highest percentage of repeats found with ~ 34.89%. It could be explained as RepeatModeler is specifically made to work with RepeatMasker.

Other tools might mask a greater proportion of the genome, at the cost of a more complex workflow with training steps. But masking more isn’t always a positive results! In fact, large family of genes could be considered as a repeat by some tools or certain library. Only a manual curation can correct those mistakes, but this result is sufficient to perform an annotation by following the Funannotate annotation tutorial.

Conclusion

By following this tutorial you have learn how to mask an eukaryotic genome using Red and RepeatMasker, after assembling (Flye assembly tutorial) and before annotating it (Funannotate annotation tutorial).

Often times, annotation tools prefer to use soft-masked genomes, as they primarily search for genes in non repeated regions, but tolerate that some genes overlap partially with these regions.

Key points

RepeatMasker can be used to soft-mask a genome

It is an essential first step before running structural annotation pipelines

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Genome Annotation topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

References

Ou, S., W. Su, Y. Liao, K. Chougule, J. R. A. Agda et al., 2019 Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol 20: 10.1186/s13059-019-1905-y
Flynn, J. M., R. Hubley, C. Goubert, J. Rosen, A. G. Clark et al., 2020 RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. U.S.A. 117: 9451–9457. 10.1073/pnas.1921046117

Glossary

LINEs: Long Interspersed Nuclear Elements
SINEs: Short Interspersed Nuclear Elements

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Anthony Bretaudeau, Alexandre Cormier, Laura Leroi, Erwan Corre, Stéphanie Robin, Jonathan Kreplak, Masking repeats with RepeatMasker (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/repeatmasker/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{genome-annotation-repeatmasker,
author = "Anthony Bretaudeau and Alexandre Cormier and Laura Leroi and Erwan Corre and Stéphanie Robin and Jonathan Kreplak",
	title = "Masking repeats with RepeatMasker (Galaxy Training Materials)",
	year = "",
	month = "",
	day = ""
	url = "\url{https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/repeatmasker/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol} Computational Biology}
}

                   

Funding

These individuals or organisations provided funding support for the development of this resource

See Funder Profile

Congratulations on successfully completing this tutorial!

Go Further
Do you want to extend your knowledge? Follow one of our recommended follow-up trainings:

Genome Annotation

Hands-on: Hands-on: Genome annotation with Funannotate: tutorial hands-on

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/genome-annotation/tutorials/repeatmasker/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: repeat_masker
  owner: bgruening
  revisions: ba6d2c32f797
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: repeatmodeler
  owner: csbl
  revisions: 8661b2607b7e
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: bedtools
  owner: iuc
  revisions: a1a923cd89e8
  tool_panel_section_label: BED
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: red
  owner: iuc
  revisions: db57bc3b57af
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

5 stars 1

4 stars 2

March 2022

4 stars: Liked: good explanation why masking is done