Identification of AMR genes in an assembled bacterial genome

Overview
Creative Commons License: CC-BY Questions:
  • Which resistance genes are on a bacterial genome?

  • Where are the genes located on the genome?

Objectives:
  • Run a series of tool to assess the presence of antimicrobial resistance genes (ARG)

  • Get information about ARGs

  • Visualize the ARGs and plasmid genes in their genomic context

Requirements:
Time estimation: 2 hours
Level: Introductory Introductory
Supporting Materials:
Published: Jan 23, 2024
Last modification: Mar 13, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00401
version Revision: 9

Antimicrobial resistance (AMR) is a global phenomenon with no geographical or species boundaries, which poses an important threat to human, animal and environmental health. It is a complex and growing problem that compromises our ability to treat bacterial infections.

AMR gene content can be assessed from whole genome sequencing to detect known resistance mechanisms and potentially identify novel mechanisms.

To illustrate the process to identify AMR gene in a bacterial genome, we take an assembly of a bacterial genome (KUN1163 sample) generated by following a bacterial genome assembly tutorial from data produced in “Complete Genome Sequences of Eight Methicillin-Resistant Staphylococcus aureus Strains Isolated from Patients in Japan” (Hikichi et al. 2019).

Methicillin-resistant Staphylococcus aureus (MRSA) is a major pathogen causing nosocomial infections, and the clinical manifestations of MRSA range from asymptomatic colonization of the nasal mucosa to soft tissue infection to fulminant invasive disease. Here, we report the complete genome sequences of eight MRSA strains isolated from patients in Japan.

Agenda

In this tutorial, we will cover:

  1. Galaxy and data preparation
  2. Identification of AMR genes
  3. Getting more information about ARG via the CARD database
  4. Visualization of the ARGs and plasmid genes in their genomic context
    1. Extraction of the ARG and plasmid genes location
    2. Annotation of the contigs
    3. Mapping the raw reads on the contigs
    4. Visualisation of the ARGs
  5. Conclusion

Galaxy and data preparation

Any analysis should get its own Galaxy history. So let’s start by creating a new one and get the data (contig file from the assembly) into it.

Hands-on: Prepare Galaxy and data
  1. Create a new history for this analysis

    Click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Rename the history

    1. Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)
    2. Type the new name
    3. Click on Save

    If you do not have the galaxy-pencil (Edit) next to the history name:

    1. Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
    2. Type the new name
    3. Press Enter

  3. Import the contig file from Zenodo or from Galaxy shared data libraries:

    https://zenodo.org/record/10572227/files/DRR187559_contigs.fasta
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Shared data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

Identification of AMR genes

To identify AMR genes in contigs, tools like ABRicate or staramr (Bharat et al. 2022) can be used. In this tutorial, we use staramr. Staramr scans bacterial genome contigs against the ResFinder (Zankari et al. 2012), PointFinder (Zankari et al. 2017), and PlasmidFinder (Carattoli et al. 2014) databases (used by the ResFinder webservice) and compiles a summary report of detected antimicrobial resistance genes.

Comment

If you are interested in learning more about using ABRicate for AMR gene detection, you can check the sub-section on AMR in “Gene based pathogenic identification” of “Pathogen detection from (direct Nanopore) sequencing data using Galaxy - Foodborne Edition” tutorial

Hands-on: Run staramr
  1. staramr ( Galaxy version 0.10.0+galaxy1) with the following parameters:
    • param-file “genomes”: Contig file

staramr generates 8 outputs:

  1. summary.tsv: A summary of all detected AMR genes/mutations/plasmids/sequence type in each genome, one genome per line and the following columns:

    • Isolate ID: The id of the isolate/genome file(s) passed to staramr.
    • Quality Module: The isolate/genome file(s) pass/fail result(s) for the quality metrics
    • Genotype: The AMR genotype of the isolate.
    • Predicted Phenotype: The predicted AMR phenotype (drug resistances) for the isolate.
    • CGE Predicted Phenotype: The CGE-predicted AMR phenotype (drug resistances) for the isolate (CGE = Center for Genomic Epidemiology)

    • Plasmid: Plasmid types that were found for the isolate.
    • Scheme: The MLST scheme used

      MLST stands for MultiLocus Sequence Typing. It is a technique for the typing of multiple loci, using DNA sequences of internal fragments of multiple housekeeping genes to characterize isolates of microbial species.

      Here, starmr uses mlst to scan the contig files against traditional PubMLST typing schemes. The correspondance between the scheme and the bacteria genus and species is accessible in the map

    • Sequence Type: The sequence type that’s assigned when combining all allele types
    • Genome Length: The isolate/genome file(s) genome length(s)
    • N50 value: The isolate/genome file(s) N50 value(s)
    • Number of Contigs Greater Than Or Equal To 300 bp: The number of contigs greater or equal to 300 base pair in the isolate/genome file(s)
    • Quality Module Feedback: The isolate/genome file(s) detailed feedback for the quality metrics
    Hands-on: Inspect the staramr output
    1. View galaxy-eye the summary.tsv file
    Question: Inspect the staramr output
    1. How many genomes are there?
    2. Has the genome failed or passed the quality? Why?
    3. What are the predicted AMR drug resistances? Are they similar to the one found for KUN1163 in Table 1 in Hikichi et al. 2019?
    4. What is the MLST scheme? Does that correspond to the expected species and genus?
    1. There is one genome (1 line)
    2. The genome has failed the quality (column 2), because the genome length is not within the acceptable length range (last column).
    3. We can summarize starmr output and Table 1 in Hikichi et al. 2019:

      Antibiotic name Abbreviation staramr Hikichi et al. 2019
      Amikacin   Yes  
      Azithromycin   Yes  
      Cefazolin CEZ   Yes
      clindamycin CLDM   Yes
      Erythromycin EM Yes Yes
      Gentamicin GM Yes Yes
      Imipenem IPM   Yes
      Levofloxacin LVFX   Yes
      Oxacillin MPIPC   Yes
      Penicillin   Yes  
      Spectinomycin   Yes  
      Tetracycline   Yes  
      Tobramycin   Yes  
    4. The scheme is saureus, so Staphylococcus aureus (given the scheme genus map), which is coherent with MRSA
  2. detailed_summary.tsv: A detailed summary of all detected AMR genes/mutations/plasmids/sequence type in each genome, one gene per line and the following columns:

    • Isolate ID: The id of the isolate/genome file(s) passed to staramr.
    • Data: The particular gene detected from ResFinder, PlasmidFinder, PointFinder, or the sequence type.
    • Data Type: The type of gene (Resistance or Plasmid), or MLST.
    • Predicted Phenotype: The predicted AMR phenotype (drug resistances) found in ResFinder/PointFinder. Plasmids will be left blank by default.
    • CGE Predicted Phenotype: The CGE-predicted AMR phenotype (drug resistances) found in ResFinder/PointFinder. Plasmids will be left blank by default.
    • %Identity: The % identity of the top BLAST HSP to the gene.
    • %Overlap: THe % overlap of the top BLAST HSP to the gene (calculated as hsp length/total length * 100).
    • HSP Length/Total Length The top BLAST HSP length over the gene total length (nucleotides).
    • Contig: The contig id containing this gene.
    • Start: The start of the gene (will be greater than End if on minus strand).
    • End: The end of the gene.
    • Accession: The accession of the gene from either ResFinder or PlasmidFinder database.
    Hands-on: Inspect the staramr output
    1. View galaxy-eye the detailed_summary.tsv file
    Question: Inspect the staramr output
    1. What are the different sequence types that have been identified?
    2. Where are located the plasmid genes?
    3. Where are located the resistances genes?
    4. By looking for the accession number (M13771) on NCBI (https://www.ncbi.nlm.nih.gov/nuccore/), where does the first resistance come from?
    1. There are:
      • 1 MSLT
      • 5 plasmid rep genes
      • 7 resistance genes
    2. The plasmid genes are on contig00019 (3 genes - coherent with Lozano et al. 2012), contig00024, and contig00002.
    3. The resistance genes are:
      • 4/7 on contigs with plasmid genes (contig00002, contig00019, contig00024)
      • 3/7 on contigs without plasmid genes (contig00022, contig00021)
    4. M13771 is a 6’-aminoglycoside acetyltransferase phosphotransferase (AAC(6’)-APH(2’)) bifunctional resistance protein from Streptococcus faecalis. It might come from horizontal transfer from Streptococcus faecalis on a plasmid of Staphylococcus aureus.
  3. resfinder.tsv: A tabular file of each AMR gene and additional BLAST information from the ResFinder database, one gene per line and the following columns:

    • Isolate ID: The id of the isolate/genome file(s) passed to staramr.
    • Gene: The particular AMR gene detected.
    • Predicted Phenotype: The predicted AMR phenotype (drug resistances) for this gene.
    • CGE Predicted Phenotype: The CGE-predicted AMR phenotype (drug resistances) for this gene.
    • %Identity: The % identity of the top BLAST HSP to the AMR gene.
    • %Overlap: THe % overlap of the top BLAST HSP to the AMR gene (calculated as hsp length/total length * 100).
    • HSP Length/Total Length The top BLAST HSP length over the AMR gene total length (nucleotides).
    • Contig: The contig id containing this AMR gene.
    • Start: The start of the AMR gene (will be greater than End if on minus strand).
    • End: The end of the AMR gene.
    • Accession: The accession of the AMR gene in the ResFinder database.
    • Sequence: The AMR Gene sequence
    • CGE Notes: Any CGE notes associated with the prediction.
  4. plasmidfinder.tsv: A tabular file of each AMR plasmid type and additional BLAST information from the PlasmidFinder database, one plasmid type per line with the following columns:

    • Isolate ID: The id of the isolate/genome file(s) passed to staramr.
    • Plasmid: The particular plasmid type detected.
    • %Identity: The % identity of the top BLAST HSP to the plasmid type.
    • %Overlap: The % overlap of the top BLAST HSP to the plasmid type (calculated as hsp length/total length * 100).
    • HSP Length/Total Length The top BLAST HSP length over the plasmid type total length (nucleotides).
    • Contig: The contig id containing this plasmid type.
    • Start: The start of the plasmid type (will be greater than End if on minus strand).
    • End: The end of the plasmid type.
    • Accession: The accession of the plasmid type in the PlasmidFinder database.
  5. mlst.tsv: A tabular file of each multi-locus sequence type (MLST) and it’s corresponding locus/alleles, one genome per line with the following columns:

    • Isolate ID: The id of the isolate/genome file(s) passed to staramr.
    • Scheme: The scheme that MLST has identified.
    • Sequence Type: The sequence type that’s assigned when combining all allele types
    • Locus #: A particular locus in the specified MLST scheme.
  6. settings.txt: The command-line, database versions, and other settings used to run staramr.
  7. results.xlsx: An Excel spreadsheet containing the previous files as separate worksheets.
  8. hits: A collection with fasta files of the BLAST HSP nucleotides for the entries listed in the resfinder.tsv and pointfinder.tsv files. There are up to two files per input genome, one for ResFinder and one for PointFinder.

Getting more information about ARG via the CARD database

To get more information about the antibiotic resistant genes (ARG), we can check the CARD database (Comprehensive Antibiotic Resistance Database) (Jia et al. 2016)

CARD can be very helpful to check all the resistance genes and check if it is logical to find the resistance gene in a specific bacteria.

Question
  1. To what family does mecA belong?
  2. Do you expect to find this gene in this MRSA strain and why?
  3. Is the accession number of the entry related to the accession reported by staramr?
  1. Methicillin resistant PBP2
  2. The strain we use is a Methicillin(multi) resistant Staphylococcus aureus. As mecA has a perfect resistome mach with S. aureus, and the AMR Gene Family is methicillin resistant PBP2, we expect to see mecA in MRSA.
  3. No, these are completely unrelated. Unfortunately this is a very common issue in bioinformatics. Everyone builds their own numbering system for entries in their database (usually calling them ‘accessions’), and then someone else needs to build a service to link these databases.

Visualization of the ARGs and plasmid genes in their genomic context

We would like to look at the ARGs and plasmid genes in their genomic context. To do that, we will usie JBrowse (Diesh et al. 2023) with several information:

  1. Assembly as the reference
  2. ARGs location
  3. Contigs annotation (genes, etc)
  4. Coverage of the contigs from the raw reads

Extraction of the ARG and plasmid genes location

The first step is to extract the location of the ARGs and plasmid genes on the contigs. This information is available on the detailed_summary.tsv output of starmar. The genes and their location are on the lines with a decimal value on column 6 or 7. So to only get this information, we need to select lines with a decimal value (###.##) followed by a tab character, the column separator in Galaxy. As a result, any lines without an identity or overlap value will be filtered out.

Hands-on: Extract ARGs and plasmid genes
  1. Select lines that match an expression with the following parameters:
    • param-file “Select lines from”: detailed_summary.tsv staramr output
    • “that”: Matching
    • “the pattern”: [0-9]+\.[0-9]+\t
Question

How many genes have been kept?

12

This table can not be used directly in JBrowse. It first needs to be transformed in a standard format: GFF3, a file format used for describing genes and other features of DNA, RNA and protein sequences.

Comment: GFF3 file format

A GFF is a tab delimited file with 9 fields per line:

  1. seqid: The name of the sequence where the feature is located.
  2. source: The algorithm or procedure that generated the feature. This is typically the name of a software or database.
  3. type: The feature type name, like “gene” or “exon”. In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent “transcript” feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project.
  4. start: Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED.
  5. end: Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED.
  6. score: Numeric value that generally indicates the confidence of the source in the annotated feature. A value of “.” (a dot) is used to define a null value.
  7. strand: Single character that indicates the strand of the feature. This can be “+” (positive, or 5’->3’), “-“, (negative, or 3’->5’), “.” (undetermined), or “?” for features with relevant but unknown strands.
  8. phase: phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or “.” (for everything else). See the section below for a detailed explanation.
  9. attributes: A list of tag-value pairs separated by a semicolon with additional information about the feature.
Hands-on: Create a GFF file
  1. Table to GFF3 ( Galaxy version 1.2)
    • param-file “Table”: output of the above Select lines tool step
    • “Record ID column or value”: 9
    • “Start column or value”: 10
    • “End column or value”: 11
    • “Type column or value”: 3
    • “Score column or value”: 6
    • “Source column or value”: 3
    • param-repeat “Insert Qualifiers”
      • “Name”: name
      • “Qualifier value column or raw text”: 2
    • param-repeat “Insert Qualifiers”
      • “Name”: phenotype
      • “Qualifier value column or raw text”: 4
    • param-repeat “Insert Qualifiers”
      • “Name”: accession
      • “Qualifier value column or raw text”: 12
Question

How many genes have been found per contigs?

  • Contig 19: 3 plasmid genes, 1 ARG
  • Contig 24: 1 plasmid gene, 2 ARGs
  • Contig 2: 1 plasmid gene, 1 ARG
  • Contig 22: 2 ARGs
  • Contig 21: 1 ARG

Annotation of the contigs

In addition to the ARGs and plasmid genes, it would be good to have extra information about other genes on the contigs. Several tools exists to do that: Prokka (Seemann 2014), Bakta (Schwengers et al. 2021), etc. Here, we use Bakta as recommended by Torsten Seeman as the successor of Prokka.

Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" section, where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

Have you already run Bakta on this dataset before?

Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs.

Hands-on: Annotate the contigs
  1. Bakta ( Galaxy version 1.8.2+galaxy0) with the following parameters:
    • In “Input/Output options”:
      • “The bakta database”: latest one
      • “The amrfinderplus database”: latest one
      • param-file “Select genome in fasta format”: Contig file
    • In “Optional annotation”:
      • “Keep original contig header (–keep-contig-headers)”: Yes
Question

How many features have been identified?

51k+ (number of lines in the GFF)

If you already ran Bakta on your contigs, you can just upload here the GFF generated by Bakta.

Hands-on: Import annotation GFF
  1. Import the GFF file generated by Bakta

Bakta - circular plot of the annotation, with the contigs on the outside ordered clockwise by length size.

Mapping the raw reads on the contigs

To get information about the coverage of the contigs and genes, we map the reads used to build the contigs to the contigs using Bowtie2 (Langmead et al. 2009).

Hands-on: Annotating the Genome
  1. Import the reads (after quality control) from another history, from Zenodo or from Galaxy shared data libraries:

    https://zenodo.org/record/10572227/files/DRR187559_after_fastp_1.fastq.gz
    https://zenodo.org/record/10572227/files/DRR187559_after_fastp_2.fastq.gz
    
  2. Bowtie2 ( Galaxy version 2.5.0+galaxy0) with the following parameters:

    • “Is this single or paired library”: Paired-end
      • param-file “FASTA/Q file #1”: DRR187559_after_fastp_1.fastq.gz
      • param-file “FASTA/Q file #2”: DRR187559_after_fastp_2.fastq.gz
    • “Will you select a reference genome from your history or use a built-in index?”: Use a genome from the history and build index
      • param-file “Select reference genome”: Contig file
    • “Save the bowtie2 mapping statistics to the history”: Yes
Question

What is the alignment rate? What does that mean for the assembly?

The alignment rate is 99.77%. Most of the reads have been mapped on the contigs. They have been then used to build the contigs.

Visualisation of the ARGs

We can now visualize the contigs, the mapping coverage, and the genes, using JBrowse and different information track.

Hands-on: Visualize the Genome
  1. JBrowse ( Galaxy version 1.16.11+galaxy1) with the following parameters:
    • “Reference genome to display”: Use a genome from history
      • param-file “Select the reference genome”: Contig file
    • “Genetic Code”: 11. The Bacterial, Archaeal and Plant Plastid Code
    • In “Track Group”:
      • param-repeat “Insert Track Group”
        • “Track Category”: Bakta
        • In “Annotation Track”:
          • param-repeat “Insert Annotation Track”
            • “Track Type”: GFF/GFF3/BED Features
              • param-file “GFF/GFF3/BED Track Data”: Annotation_and_sequences (output of Bakta tool)
              • “JBrowse Track Type [Advanced]”: Neat Canvas Features
              • “Track Visibility”: On for new users
      • param-repeat “Insert Track Group”
        • “Track Category”: ARGs and plasmid genes
        • In “Annotation Track”:
          • param-repeat “Insert Annotation Track”
            • “Track Type”: GFF/GFF3/BED Features
              • param-file “GFF/GFF3/BED Track Data”: Output of Table to GFF3
              • “JBrowse Track Type [Advanced]”: Neat Canvas Features
              • “Track Visibility”: On for new users
      • param-repeat “Insert Track Group”
        • “Track Category”: Coverage
        • In “Annotation Track”:
          • param-repeat “Insert Annotation Track”
            • “Track Type”: BAM Pileups
              • param-file “BAM Track Data”: Bowtie2’s output
              • “Autogenerate SNP Track”: Yes
  2. View the output of JBrowse

In the output of the JBrowse you can view the mapped reads and the found genes against the reference genome. With the search tools you can easily find genes of interest. JBrowse can handle many inputs and can be very useful. Using the Bowtie2 mapping output, low coverage regions can be detected. This SNP detection can also give a clear view of where the data was less reliable or where variations were located.

If it takes too long to build the JBrowse instance, you can view an embedded one here. (Warning: feature name search will not work.)

Open JBrowse in a new tab

Question
  1. What is the name for the gene found by Bakta corresponding to the rep16 gene found by staramr?
  2. What are the NCBI Protein id and UniRef id for aac(6’)-aph(2’’)?
  1. By zooming at the beginning of the contig, the name for rep16 is DUF536 domain-containing protein
  2. aac(6’)-aph(2’’) is on contig00019:20311..21749. The corresponding gene found by Bakta is bifunctional aminoglycoside N-acetyltransferase AAC(6’)-Ie/aminoglycoside O-phosphotransferase APH(2’’)-Ia. When clicking on it, we can find the dbxref attributes:
    • NCBI Protein: WP_001028144.1
    • UniRef: UniRef50_P0A0C2

Conclusion

In this tutorial, contigs were scaned for AMR and plasmid genes. The genes were then visualized in their genomic context after contig annotation.