Long non-coding RNAs (lncRNAs) annotation with FEELnc

Author(s) orcid logoAvatarStéphanie Robin
Editor(s) orcid logoAvatarAnthony Bretaudeau
Overview
Questions:
  • How to annotate lncRNAs with FEELnc?

  • How to classify lncRNAs according to their localisation and direction of transcription of proximal RNA transcripts?

  • How to update genome annotation with these annotated lncRNAs?

Objectives:
  • Load data (genome assembly, annotation and mapped RNASeq) into Galaxy

  • Perform a transcriptome assembly with StringTie

  • Annotate lncRNAs with FEELnc

  • Classify lncRNAs according to their location

  • Update genome annotation with lncRNAs

Requirements:
Time estimation: 2 hours
Level: Intermediate Intermediate
Supporting Materials:
Last modification: Oct 18, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

Messenger RNAs (mRNAs) are not the only type of RNAs present in organisms (like mammals, insects or plants) and represent only a small fraction of the transcripts. A vast repertoire of small (miRNAs, snRNAs) and long non-coding RNAs (lncRNAs) are also present. Long non-coding RNAs (LncRNAs) are generally defined as transcripts longer than 200 nucleotides that are not translated into functional proteins. They are important because of their major roles in cellular machinery and their presence in large number. Indeed, they are notably involved in gene expression regulation, control of translation or imprinting. Statistics from the GENCODE project reveals that the human genome contains more than 19,095 lncRNA genes, almost as much as the 19,370 protein-coding genes.

Using RNASeq data, we can reconstruct assembled transcripts (with ou without any reference genome) which can then be annotated and identified individually as mRNAs or lncRNAs.

In this tutorial, we will use a software tool called StringTie (StringTie enables improved reconstruction of a transcriptome from RNA-seq reads” 2015) to assemble the transcripts and then FEELnc (FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome” 2017) to annotate the assembled transcripts of a small eukaryote: Mucor mucedo (a fungal plant pathogen).

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts.

FEELnc (FlExible Extraction of Long non-coding RNA) is a pipeline to annotate lncRNAs from RNASeq assembled transcripts. It is composed of 3 modules:

  • FEELnc_filter: Extract, filter candidate transcripts.
  • FEELnc_codpot: Compute the coding potential of candidate transcripts.
  • FEELnc_classifier: Classify lncRNAs based on their genomic localization wrt others transcripts.
Agenda

In this tutorial, we will cover:

  1. Introduction
  2. Data upload
  3. Transcripts assembly with StringTie
  4. lncRNAs annotation with FEELnc
  5. Conclusion

Data upload

To assemble transcriptome with StringTie and annotate lncRNAs with FEELnc, we will use the following files :

  • The genome sequence in fasta format. For this tutorial, we will use the genome assembled in the Flye assembly tutorial.
  • The genome annotation in GFF3 format. We will use the genome annotation obtained in the Funannotate tutorial.
  • Some aligned RNASeq data in bam format. Here, we will use some mapped RNASeq data where mapping was done using STAR.
Hands-on: Data upload
  1. Create a new history for this tutorial

    Click the new-history icon at the top of the history panel.

    If the new-history is missing:

    1. Click on the galaxy-gear icon (History options) on the top of the history panel
    2. Select the option Create New from the menu
  2. Import the files from Zenodo or from the shared data library (GTN - Material -> genome-annotation -> Long non-coding RNAs (lncRNAs) annotation with FEELnc):

    https://zenodo.org/api/files/0f8d27c5-8c8d-4379-90c4-c3cd950de391/genome_assembly.fasta
    https://zenodo.org/api/files/0f8d27c5-8c8d-4379-90c4-c3cd950de391/genome_annotation.gff3
    https://zenodo.org/api/files/0f8d27c5-8c8d-4379-90c4-c3cd950de391/all_RNA_mapped.bam
    
    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    • Go into Shared data (top panel) then Data libraries
    • Navigate to the correct folder as indicated by your instructor
    • Select the desired files
    • Click on the To History button near the top and select as Datasets from the dropdown menu
    • In the pop-up window, select the history you want to import the files to (or create a new one)
    • Click on Import

Transcripts assembly with StringTie

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. StringTie takes as input a SAM, BAM or CRAM file sorted by coordinate (genomic location). This file should contain spliced RNA-seq read alignments such as the ones produced by TopHat, HISAT2 or STAR. The TopHat output is already sorted, but the SAM ouput from other aligners should be sorted using the samtools program.

A reference annotation file in GTF or GFF3 format can be provided to StringTie which can be used as ‘guides’ for the assembly process and help improve the transcript structure recovery for those transcripts.

Hands-on: Transcripts assembly

StringTie Tool: toolshed.g2.bx.psu.edu/repos/iuc/stringtie/stringtie/2.1.7+galaxy1 with the following parameters:

  • “Input options”: Short reads
  • param-file “Input short mapped reads: all_RNA_mapped.bam
  • “Specify strand information”: Unstranded
  • “Use a reference file to guide assembly?”: Use reference GTF/GFF3
  • “Reference file”: Use a file from history
    • param-file “GTF/GFF3 dataset to guide assembly”: genome_annotation.gff3
  • “Use Reference transcripts only?”: No
  • “Output files for differential expression?”: No additional output
  • “Output coverage file”: No

We obtain an annotation file (GTF format) which contained all assembled transcripts present in the RNASeq data.

After this step, the transcriptome is assembled and ready for lncRNAs annotation.

Question

How many transcripts are assembled ?

Specific features can be extracted from the GTF file using for example Extract features from GFF data Tool: Extract_features1 . By selecting transcript From column 3 / Feature, we can select only the transcript elements present in this annotation file. Assembly contains 14,877 transcripts (corresponding to the number of lines in the filtered GTF file).

lncRNAs annotation with FEELnc

FEELnc is a pipeline which is composed of 3 steps. These 3 steps are run automatically when running FEELnc within Galaxy. The first step (FEELnc_filter) consists in filtering out unwanted/spurious transcripts and/or transcripts overlapping (in sense) exons of the reference annotation, and especially protein coding exons as they more probably correspond to new mRNA isoforms.

To use FEELnc, we need to have a reference annotation file in GTF format, which contains protein-coding genes annotation. Presently, we downloaded only the reference annotation file in GFF3 format (annotation.gff3). To convert from GFF3 to GTF format, we will use gffread.

Hands-on: FEELnc
  1. gffread Tool: toolshed.g2.bx.psu.edu/repos/devteam/gffread/gffread/2.2.1.3+galaxy0 with the following parameters:
    • param-file “Input BED, GFF3 or GTF feature file”: genome_annotation.gff3
    • “Feature File Output”: GTF
  2. FEELnc Tool: toolshed.g2.bx.psu.edu/repos/iuc/feelnc/feelnc/0.2 with the following parameters:
    • param-file “Transcripts assembly”: Assembled transcript (output of StringTie tool)
    • param-file “Reference annotation”: genome_annotation.gtf (Output of gffread tool)
    • param-file “Genome sequence”: genome_assembly.fasta

FEELnc provides 3 output files

  • lncRNA annotation file: annotation file in GTF format which contains the final set of lncRNAs
  • mRNA annotation file: annotation file in GTF format which contains the final set of mRNAs
  • Classifier output file: table containing classification of lncRNAs based on their genomic localisation w.r.t other transcripts (direction: sense or antisense, type: genic, if the lncRNA gene overlaps an RNA gene from the reference annotation file or intergenic (lincRNA) if not).

FEELnc provides also summary file in stdout.

Question

How many RNAs does this annotation contain ? How many interactions between lncRNAs and mRNAs have been identified ? Can you describe the different types of lncRNAs ?

The summary file indicates 104 lncRNAs and 0 new mRNAs were annotated by FEELnc. The initial annotation contains 13,795 mRNAs annotated. Therefore, a total of 13,898 RNAs are currently annotated.

The summary file indicates 652 interactions between lncRNAs and mRNAs. These interactions are described in the Classifier output file.

The different types of lncRNAs (intergenic (sense and antisense), intragenic (sense)) are described in the Classifier output file. We observe that the majority of the lncRNAs are intergenic. These lncRNAs can each have interactions with several mRNAs. Only 7 lncRNAs are genic. These lncRNAs have only one interaction with the mRNA that contains it.

For future analyses, it would be interesting to use an updated annotation containing mRNAs and lncRNAs annotations. Thus, we will merge the reference annotation with those obtained with FEELnc.

Hands-on: Merge the annotations

concatenate Tool: https://toolshed.g2.bx.psu.edu/view/bgruening/text_processing/f46f0e4f75c4 with the following parameters:

  • param-file “Datasets to concatenate”: genome_annotation.gtf
  • Insert Dataset
  • param-file “Dataset”: lncRNA annotation with FEELnc

Conclusion

Congratulations for reaching the end of this tutorial! Now you know how to perform an annotation of lncRNAs by using RNASeq data.

Key points
  • StringTie allows to perform a transcriptome assembly using mapped RNASeq data and provides an annotation file containing trancripts description.

  • FEELnc pipeline allows to perform annotation of long non-coding RNAs (lncRNAs).

  • Annotation is based on reconstructed transcripts from RNA-seq data (either with or without a reference genome)

  • Annotation can be performed without any training set of non-coding RNAs.

  • FEELnc provides the localisation and the direction of transcription of proximal RNA transcripts of lncRNAs.

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Genome Annotation topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

References

  1. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads, 2015 Nature Biotechnology 33: 290–295. 10.1038/nbt.3122
  2. FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome, 2017 Nucleic Acids Research 45: e57. 10.1093/nar/gkw1306

Glossary

LncRNAs
Long non-coding RNAs
lncRNAs
long non-coding RNAs
mRNAs
Messenger RNAs

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Stéphanie Robin, Long non-coding RNAs (lncRNAs) annotation with FEELnc (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/lncrna/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012



@misc{genome-annotation-lncrna,
author = "Stéphanie Robin",
title = "Long non-coding RNAs (lncRNAs) annotation with FEELnc (Galaxy Training Materials)",
year = "",
month = "",
day = ""
url = "\url{https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/lncrna/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol} Computational Biology}
}

                   

Congratulations on successfully completing this tutorial!