Assembly of the mitochondrial genome from PacBio HiFi reads
Author(s) | Delphine Lariviere |
Reviewers |
OverviewQuestions:Objectives:
How to assemble the mitochondrial genome from PacBio Hifi Reads
Requirements:
Generate Mitochondrial assembly
Understand the outputs of MitoHifi
Time estimation: 1 hourSupporting Materials:Published: Sep 3, 2024Last modification: Sep 26, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00453rating Rating: 3.0 (1 recent ratings, 3 all time)version Revision: 2
Introduction
This tutorial will show you how to assemble a mitochondrial genome from PacBio HiFi data using MitoHiFi Uliano-Silva et al. 2023. Combined with the tutorials “Using the VGP workflows to assemble a vertebrate genome with HiFi and Hi-C data” and “Decontamination of a genome assembly”, this allows you to produce a reference assembly for both the nuclear and the mitochondrial DNA of a vertebrate species.
This tutorial uses data from the Zebra Finch (Taeniopygia guttata) generated by the Vertebrate Genome Project. We downsampled the reads that didn’t align with the mitochondrial genome so that the tutorial can run faster.
Comment: Run this analysis on "real" dataIf you want to run this analysis on a real sequencing library generated by the Vertebrate Genome Project you can find the PacBio HiFi data on Genome Ark as a remote repository and upload it to Galaxy (available on the three main Public Galaxy instances: .org, .eu, .org.au).
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a Choose remote files:
- Click on Upload Data on the top of the left panel
Click on Choose remote files and scroll down to find your data folder or type the folder name in the
search box
on the top.
- Look for your data under:
Genome Ark -> species -> Taeniopygia_guttata -> bTaeGut2 -> genomic_data -> pacbio_hifi -> *_reads.fasta.gz
- click on OK
- Click on Start
- Click on Close
- You can find the dataset has begun loading in you history.
The assembly is using the wrapped workflow MitoHiFi. MitoHiFi:
- Extracts mitochondrial reads (based on a BLAST against an existing reference mitogenome) and uses Hifiasm Cheng et al. 2021 to assemble them.
- Removes nuclear mitochondrial DNA sequences (NUMTs) from the potential mitogenome contigs
- Generates a circularized and annotated genome for all potential mitogenome contigs
- Selects a representative for the final mitochondrial assembly
AgendaIn this tutorial, we will cover:
Get data
Hands-on: Data Upload
- Create a new history for this tutorial
Import the files from Zenodo:
https://zenodo.org/records/13345315/files/PacBio_reads.fastqsanger.gz
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
Check that the datatype is
fastqsanger.gz
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
fastqsanger.gz
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Download the mitogenome for a related species
To assemble the mitogenome from our PacBio Data, MitoHiFi needs a reference mitogenome to fish out mitochondrial reads (the reads are blasted against the related reference). To download this reference, we use the tool MitoHiFi with the operation Find a closely related species
.
- MitoHiFi ( Galaxy version 3+galaxy0) with the following parameters:
- “Operation type selector”:
Find a close-related mitochondrial reference genome
- “Species name”:
Taeniopygia guttata
Enter the latin name of the species you are assembling- “Email”:
your.email@service.com
Enter your email- “Minimal appropriate length”:
15000
As vertebrate mitochondrial genomes are typically at least 14kbp long, we are using a value in this range so that we get complete mitogenome results as our reference.
Assemble the mitochondrial genome with MitoHiFi
Hands-on: Assemble the mitochondrial genome
Create a collection with your PacBio HiFi Reads
- Click on galaxy-selector Select Items at the top of the history panel
- Check the fastq.gz containing the HiFi reads
Click 1 of N selected and choose Build Dataset List
- Enter a name for your collection to PacBio Reads
- Click Create collection to build your collection
- Click on the checkmark icon at the top of your history again
MitoHiFi ( Galaxy version 3+galaxy0) with the following parameters:
- “Operation type selector”:
Run MitoHiFi
- “Input mode”:
Pacbio Hifi Reads
- param-collection “Pacbio Hifi reads”:
PacBio Reads
(Input dataset collection)- param-file “Close-related mitogenome in fasta format”:
MitoHiFi on : reference genome (FASTA)
(output of MitoHiFi tool)- param-file “Close-related mitogenome in genbank format”:
MitoHiFi on : reference genome (genbank)
(output of MitoHiFi tool)- “Genetic code”:
Vertebrate mitochondrial code
- In “Advanced options”:
- “Blast percentage identity”:
70
This setting filters the potential mito contigs – setting it to 70 means that we are retaining contigs with at least 70% of its length in the BLAST match. This parameter can be lowered if you are expecting more sequence divergence among mitogenomes of your taxa, or vice versa.
Outputs of MitoHiFi:
- Final mitogenome (FASTA). The mitochondrial genome circularized and rotated to start at tRNA-Phe.
- Final mitogenome (genbank). The final mitogenome annotated in GenBank format.
- Final mitogenome annotation (png). The predicted genes in the final mitogenome.
- Final mitogenome coverage (png). The sequencing coverage along the final mitogenome.
- Contigs stats (TSV). Contains the statistics of your assembled mitogenomes such as the number of genes, size, whether it was circularized or not, if the sequence has frameshifts, and other metrics.
- Reads mapped to close-related mtDNA (FASTA). All and filtered by size.
- Hifiasm contigs (fasta). The results of running Hifiasm on the mitochondrial reads.
Conclusion
In this tutorial, we learned how to assemble the mitochondrial genome using PacBio HiFi reads and MitoHiFi. You can try this tutorial on your own data using the full HiFi read set you’d use for the nuclear genome assembly, since the filtering for mitochondrial reads happens within MitoHiFi.