Indexing and profiling microbes with MetaSBT

Author(s) orcid logoFabio Cumbo avatar Fabio Cumbo
Overview
Creative Commons License: CC-BY Questions:
  • What are Sequence Bloom Trees and how does MetaSBT use them for genomics?

  • How do you build a custom viral database from a set of reference genomes?

  • How can an existing MetaSBT database be updated with new viruses?

  • How do you profile a query viral genome against a database to determine its identity and novelty?

  • How can we interpret the viral genome profiling results?

Objectives:
  • Understand the fundamental concepts of MetaSBT for efficient genomic indexing.

  • Learn to use the metasbt_index tool to create a custom viral database.

  • Learn to use the metasbt_index tool to update an existing database.

  • Learn to use the metasbt_profile tool to identify and characterize a query viral genome.

  • Be able to interpret the similarity reports generated by the profiling tool.

Requirements:
Time estimation: 30 minutes
Supporting Materials:
Published: Aug 25, 2025
Last modification: Aug 25, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
version Revision: 7

Introduction

The ever-increasing volume of sequenced viral genomes, from global surveillance efforts and metagenomic studies, presents a significant computational challenge. How can we efficiently catalog all known viral genomes and then quickly determine the identity of a newly sequenced one?

MetaSBT is a powerful computational method designed to address this challenge. It uses a data structure called a Sequence Bloom Tree (SBT), which is a tree-like index where each leaf contains a Bloom filter. A Bloom filter is a space-efficient probabilistic data structure that allows for rapid checking of whether an element (in this case, a k-mer from a genome) is part of a set. By organizing these filters into a tree, MetaSBT can quickly search across thousands of genomes at once.

This approach gives MetaSBT several key advantages:

  • Speed: It can query massive databases much faster than traditional alignment-based methods.
  • Low Memory Footprint: It uses significantly less RAM than other methods, making it accessible on standard hardware.
  • Novelty Detection: It can distinguish k-mers from a query genome that are present in the database from those that are new or unknown.

In this tutorial, we will learn how to use the MetaSBT suite in Galaxy to perform three key tasks with viral genomes:

  1. Create a new MetaSBT database from a small set of reference viruses.
  2. Update this database with new viruses.
  3. Profile a query virus against our database to determine its identity and assess its novelty.

Let’s get started!

Hands On: Prepare the History
  1. Create a new history for this tutorial and give it a name like “MetaSBT Viral Profiling”.
  2. Import the 29 viral reference genomes for this tutorial belonging to 5 different viral species. Open the Galaxy galaxy-upload Upload Manager and choose Paste/Fetch data. Paste the following URLs into the text box:

    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/882/815/GCA_000882815.1_ViralProj36615/GCA_000882815.1_ViralProj36615_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/366/285/GCA_002366285.1_ViralProj411812/GCA_002366285.1_ViralProj411812_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/787/195/GCA_004787195.1_ASM478719v1/GCA_004787195.1_ASM478719v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/787/735/GCA_004787735.1_ASM478773v1/GCA_004787735.1_ASM478773v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/788/155/GCA_004788155.1_ASM478815v1/GCA_004788155.1_ASM478815v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/864/885/GCA_000864885.1_ViralProj15500/GCA_000864885.1_ViralProj15500_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/937/885/GCA_009937885.1_ASM993788v1/GCA_009937885.1_ASM993788v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/937/895/GCA_009937895.1_ASM993789v1/GCA_009937895.1_ASM993789v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/937/905/GCA_009937905.1_ASM993790v1/GCA_009937905.1_ASM993790v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/857/045/GCA_000857045.1_ViralProj15142/GCA_000857045.1_ViralProj15142_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/458/325/GCA_006458325.1_ASM645832v1/GCA_006458325.1_ASM645832v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/458/385/GCA_006458385.1_ASM645838v1/GCA_006458385.1_ASM645838v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/458/425/GCA_006458425.1_ASM645842v1/GCA_006458425.1_ASM645842v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/458/665/GCA_006458665.1_ASM645866v1/GCA_006458665.1_ASM645866v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/851/145/GCA_000851145.1_ViralMultiSegProj14892/GCA_000851145.1_ViralMultiSegProj14892_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/864/105/GCA_000864105.1_ViralMultiSegProj15617/GCA_000864105.1_ViralMultiSegProj15617_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/865/085/GCA_000865085.1_ViralMultiSegProj15622/GCA_000865085.1_ViralMultiSegProj15622_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/865/725/GCA_000865725.1_ViralMultiSegProj15521/GCA_000865725.1_ViralMultiSegProj15521_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/866/645/GCA_000866645.1_ViralMultiSegProj15620/GCA_000866645.1_ViralMultiSegProj15620_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/862/125/GCA_000862125.1_ViralProj15306/GCA_000862125.1_ViralProj15306_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/865/065/GCA_000865065.1_ViralProj15599/GCA_000865065.1_ViralProj15599_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/866/625/GCA_000866625.1_ViralProj15598/GCA_000866625.1_ViralProj15598_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/871/845/GCA_000871845.1_ViralProj20183/GCA_000871845.1_ViralProj20183_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/786/575/GCA_004786575.1_ASM478657v1/GCA_004786575.1_ASM478657v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/745/045/GCA_029745045.1_ASM2974504v1/GCA_029745045.1_ASM2974504v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/745/055/GCA_029745055.1_ASM2974505v1/GCA_029745055.1_ASM2974505v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/745/065/GCA_029745065.1_ASM2974506v1/GCA_029745065.1_ASM2974506v1_genomic.fna.gz
    https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/744/035/GCA_029744035.1_ASM2974403v1/GCA_029744035.1_ASM2974403v1_genomic.fna.gz
    
  3. Also import the table from the Zenodo record:

    https://zenodo.org/record/15882806/files/taxonomies.tsv
    
  4. Click Start and then Close the upload manager.
  5. Once uploaded, you will have 30 items in your history. The table retrieved at the previous step contains a mapping between the name of the genomes and their complete taxonomic labels. Note that four genomes are missing from this table. Don’t worry, we are going to use them later for updating our database and demonstrate how to profile new genomes:
    • GCA_000882815.1_ViralProj36615_genomic -> k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Zika_virus
    • GCA_002366285.1_ViralProj411812_genomic -> k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Zika_virus
    • GCA_004787195.1_ASM478719v1_genomic -> k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Zika_virus
    • GCA_004787735.1_ASM478773v1_genomic -> k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Zika_virus
    • GCA_004788155.1_ASM478815v1_genomic -> k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Zika_virus
    • GCA_000864885.1_ViralProj15500_genomic -> k__Viruses|p__Pisuviricota|c__Pisoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_related_coronavirus
    • GCA_009858895.3_ASM985889v3_genomic -> k__Viruses|p__Pisuviricota|c__Pisoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_related_coronavirus
    • GCA_009937885.1_ASM993788v1_genomic -> k__Viruses|p__Pisuviricota|c__Pisoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_related_coronavirus
    • GCA_009937895.1_ASM993789v1_genomic -> k__Viruses|p__Pisuviricota|c__Pisoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_related_coronavirus
    • GCA_009937905.1_ASM993790v1_genomic -> k__Viruses|p__Pisuviricota|c__Pisoniviricetes|o__Nidovirales|f__Coronaviridae|g__Betacoronavirus|s__Severe_acute_respiratory_syndrome_related_coronavirus
    • GCA_000857045.1_ViralProj15142_genomic -> k__Viruses|p__Nucleocytoviricota|c__Pokkesviricetes|o__Chitovirales|f__Poxviridae|g__Orthopoxvirus|s__Monkeypox_virus
    • GCA_006458325.1_ASM645832v1_genomic -> k__Viruses|p__Nucleocytoviricota|c__Pokkesviricetes|o__Chitovirales|f__Poxviridae|g__Orthopoxvirus|s__Monkeypox_virus
    • GCA_006458385.1_ASM645838v1_genomic -> k__Viruses|p__Nucleocytoviricota|c__Pokkesviricetes|o__Chitovirales|f__Poxviridae|g__Orthopoxvirus|s__Monkeypox_virus
    • GCA_006458425.1_ASM645842v1_genomic -> k__Viruses|p__Nucleocytoviricota|c__Pokkesviricetes|o__Chitovirales|f__Poxviridae|g__Orthopoxvirus|s__Monkeypox_virus
    • GCA_006458665.1_ASM645866v1_genomic -> k__Viruses|p__Nucleocytoviricota|c__Pokkesviricetes|o__Chitovirales|f__Poxviridae|g__Orthopoxvirus|s__Monkeypox_virus
    • GCA_000851145.1_ViralMultiSegProj14892_genomic -> k__Viruses|p__Negarnaviricota|c__Insthoviricetes|o__Articulavirales|f__Orthomyxoviridae|g__Alphainfluenzavirus|s__Influenza_A_virus
    • GCA_000864105.1_ViralMultiSegProj15617_genomic -> k__Viruses|p__Negarnaviricota|c__Insthoviricetes|o__Articulavirales|f__Orthomyxoviridae|g__Alphainfluenzavirus|s__Influenza_A_virus
    • GCA_000865085.1_ViralMultiSegProj15622_genomic -> k__Viruses|p__Negarnaviricota|c__Insthoviricetes|o__Articulavirales|f__Orthomyxoviridae|g__Alphainfluenzavirus|s__Influenza_A_virus
    • GCA_000865725.1_ViralMultiSegProj15521_genomic -> k__Viruses|p__Negarnaviricota|c__Insthoviricetes|o__Articulavirales|f__Orthomyxoviridae|g__Alphainfluenzavirus|s__Influenza_A_virus
    • GCA_000866645.1_ViralMultiSegProj15620_genomic -> k__Viruses|p__Negarnaviricota|c__Insthoviricetes|o__Articulavirales|f__Orthomyxoviridae|g__Alphainfluenzavirus|s__Influenza_A_virus
    • GCA_000862125.1_ViralProj15306_genomic -> k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Dengue_virus
    • GCA_000865065.1_ViralProj15599_genomic -> k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Dengue_virus
    • GCA_000866625.1_ViralProj15598_genomic -> k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Dengue_virus
    • GCA_000871845.1_ViralProj20183_genomic -> k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Dengue_virus
    • GCA_004786575.1_ASM478657v1_genomic -> k__Viruses|p__Kitrinoviricota|c__Flasuviricetes|o__Amarillovirales|f__Flaviviridae|g__Flavivirus|s__Dengue_virus

Building a comprehensive database from all NCBI viral reference genomes can be time-consuming. For many applications, you can directly use a pre-built database. Many Galaxy servers provide access to these via a CVMFS (CernVM File System) data directory.

For example, you might find pre-built MetaSBT databases at a path like: /cvmfs/data.galaxyproject.org/byhand/MetaSBT/

You can find more information about how Galaxy uses CVMFS for reference data.


Part 1: Creating a MetaSBT Database from Scratch

Our first step is to build a small, custom database using our reference viruses.

Hands On: Build the Initial Index
  1. Run the tool metasbt_index tool with the following parameters:
    • param-collection “Input genomes”: Select all the reference genomes previously imported into the history *.fna.gz.
    • param-file “Input table with taxonomic labels”: Select the table with the mapping between the name of the genomes and their complete taxonomic labels.
    • Under the “Advanced options” section, param-file “MetaSBT database” is automatically set to “Build your own MetaSBT database from scratch”. This gives us access to a set of advanced options for the estimation of the best k-mer length and the estimation of a proper Bloom filter size according to our specific set of input genomes, and a quality control filter to get rid of genomes that do not satisfy certain quality criteria based on a completeness and contamination threshold. For the specific purpose of this tutorial, we should set the following options:
      • “K-mer length” > “Set a k-mer length”: 9
      • “Bloom filter size” > “Set a bloom filter size”: 10000
  2. Click Execute. The tool will produce three output files:
    • A clusters table with the list of clusters defined by MetaSBT according to the taxonomic organization of the input genomes, together with the number of genomes in each cluster and the density of the bloom filters.
    • A genomes table with the list of genomes in the MetaSBT clusters and their assigned taxonomic labels.
    • A database compressed tarball representing the actual MetaSBT database.
Question
  1. What does the k-mer size parameter represent?
  2. Why did we compress the output database tarball?
  3. There are three genomes missing in the database. Why?
  1. The k-mer size is the length of the short DNA/RNA sequences (k-mers) that are extracted from the genomes and stored in the Bloom filters.
  2. The MetaSBT database is actually a directory containing multiple files. Compressing it into a single .tar.gz file makes it much easier to manage and pass to other tools or share it with other users within Galaxy.
  3. Although we selected the whole set of genomes in our history *.fna.gz, the input table with the mapping between the name of the genomes and their complete taxonomic labels does not report any information about four genomes. Thus, they are automatically excluded. This has been done on purpose specifically for this tutorial. We are now going to use these three of these missing genomes for updating the database!

We now have a database containing the genomic information for a specific selection of viral species.


Part 2: Updating an Existing MetaSBT Database

Viral databases are rarely static. New strains and species are constantly being discovered. Instead of rebuilding our index from scratch every time, we can simply update it.

We will now add three more Monkeypox virus genomes to the database we just created.

Hands On: Update the Database
  1. Run the tool metasbt_index tool again with these parameters:
    • param-collection “Input genomes”: Select GCA_029745045.1_ASM2974504v1_genomic.fna.gz, GCA_029745055.1_ASM2974505v1_genomic.fna.gz, and GCA_029745065.1_ASM2974506v1_genomic.fna.gz.
    • Under the “Advanced options” section, “MetaSBT database”: “Update a MetaSBT database”.
    • “Will you use a MetaSBT database from your history or a public database?”: “Use a database from the history”.
    • param-file “Select a MetaSBT database”: database.
  2. Click Execute. This will produce a new MetaSBT database, thus, three new clusters, genomes, and database files will appear in your history. These three new genomes have been profiled against our database and assigned to the closest species cluster, i.e., Monkeypox virus in this specific case.
Question
  1. Why is updating faster?
  2. Why didn’t we specify the k-mer length nor the Bloom filter size?
  1. When updating, MetaSBT only needs to process the new genomes and insert them into the existing tree structure. It doesn’t need to re-process the genomes that are already in the index. For very large databases, this can save hours or even days of computation time.
  2. In case of an update, the way we build the Bloom filter representation of the new genomes must be consistent with how we previously built our database. Thus, this information is implicitly inherited from the selected MetaSBT database.

Part 3: Profiling a Query Virus Against the Database

Now we have a database containing five known viral species, and we have a new “query” genome whose identity we want to determine. Let’s use MetaSBT to find out what our query virus is and how it relates to the viruses in our database.

Our query genome is GCA_029744035.1_ASM2974403v1_genomic.fna.gz. We will profile it against our database.

Hands On: Profile the Query Virus
  1. Run the tool metasbt_profile tool with the following parameters:
    • param-collection “Input genomes”: Select GCA_029744035.1_ASM2974403v1_genomic.fna.gz.
    • “Database source”: “Use a database from the history”.
    • param-file “Select a MetaSBT database tarball”: Select the second database file (the one from the update step).
  2. Click Execute. The tool will generate a collection containing a profiling report for each of the input query genomes in tabular format.

  3. Click the galaxy-eye (eye) icon on the generated report to view its contents.
Question

The output report has different rows. Why there are multiple matches under the same taxonomic level?

MetaSBT may report multiple taxonomic units under the same taxonomic level if their distance from the input query genome is below a specific threshold which is established considering the distance to the closest taxonomic unit minus its 20% (by default). This last percentage is called uncertainty percentage and can be changed under the “Advanced options” section.

Analyzing the Results

When you view one of the profiles in the output collection, you should see a report where each row represents a taxonomic level and its corresponding Average Nucleotide Identity (ANI) value. ANI is a measure of genomic similarity between two genomes. Here, the ANI is expressed as a distance measure, so the lower it is, the closer a specific taxonomic unit is to the input genome.

The report is structured into three columns:

  • level: This column indicates the taxonomic rank (i.e., kingdom, phylum, class, order, family, genus, and species).
  • closest: This column displays the lineage of the closest match found in the database at that specific taxonomic level.
  • ani: This column shows the ANI distance for the best match found. A lower ANI distance suggests a closer relationship with the input query genome.

This output helps you understand the genomic relatedness of your query to the genomes present in the database at various taxonomic resolutions. You can observe how the ANI changes as you move down the taxonomic hierarchy, providing insights into the closest classification of your genome.


Conclusion

In this tutorial, you have learned the complete workflow for using MetaSBT in Galaxy. You have successfully:

  1. Built a custom database from a selection of viral reference genomes;
  2. Efficiently updated that database with new viruses;
  3. Profiled an unknown virus against the database to determine its identity (or lack thereof);
  4. Interpreted the similarity reports to identify a query virus.

You are now equipped to use MetaSBT for your own research. You can create custom databases for curated sets of genomes, or leverage large, pre-built databases to rapidly identify and characterize newly sequenced genomes from clinical or environmental samples.