Deciphering Virus Populations - Single Nucleotide Variants (SNVs) and Specificities in Baculovirus Isolates

Author(s)	Jörg Wennmann
Reviewers

Overview
Questions:

How can the genetic variability of a baculovirus isolate be analysed?

How can single nucleotide variants (SNVs) be identified across multiple isolates?

How can SNVs be used as markers to analyse the composition of baculovirus isolates?

Objectives:

Understand the significance of SNVs.

Determine variable SNV positions for multiple isolates using a common reference genome.

Transform the output (VCF file) to a readable table format.

Interpret the SNV data to analyse the intra-isolate variability of baculoviruses.

Requirements:

Introduction to Galaxy Analyses

slides Slides: Quality Control

tutorial Hands-on: Quality Control

slides Slides: Mapping

tutorial Hands-on: Mapping

Time estimation: 3 hours

Level: Intermediate Intermediate

Supporting Materials:

Workflows

FAQs

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.fr ✅ ⭐️

UseGalaxy.org (Main) ✅ ⭐️

UseGalaxy.org.au ✅ ⭐️

Possibly Working

UseGalaxy.cz

Published: Feb 24, 2025

Last modification: Feb 24, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00528

rating Rating: 5.0 (0 recent ratings, 2 all time)

version Revision: 1

Baculoviruses of the family Baculoviridae (Harrison et al. 2018) are among the most intensively studied viruses, not only because of their widespread application in biotechnology as protein expression systems, medicine, and as biological insecticides (Rodríguez-Hernández et al. 2023, Cox et al. 2021, Grzywacz and Moore 2017, Eberle et al. 2012, van Oers 2011). Like all members of the class Naldaviricetes, baculoviruses have large circular dsDNA genomes that, in the family Baculoviridae, range from 90 to 180 kbp and can encode up to 180 open reading frames (ORFs). A detailed description of the biology of the Baculoviridae family can be found in the official report of the International Committee on Taxonomy of Viruses (ICTV) (Harrison et al. 2018). A detailed summary can also be found on the website of the ICTV.

Comment: A Baculovirus Genome

The genome of the isolate Autographa californica multiple nucleopolyhedrovirus isolate C6 (AcMNPV-C6) (family Baculoviridae, genus Alphabaculovirus, species Alphabaculovirus aucalifornicae) is one of the best-studied baculovirus genomes and is 133,894 bp long (Ayres et al. 1994). It is the first fully sequenced genome of baculoviruses and today hundreds of genomes are fully sequenced and publically available at NCBI Genbank. The genome of AcMNPV-C6 encodes for 154 open reading frames (ORF) in both forward and reverse orientation (see figure). So far, 38 ORFs were identified, which are encoded by all baculoviruses sequenced to date and are called baculovirus core genes. In addition to the core genes, genes were found that are encoded by baculoviruses from the genera Alphabaculovirus and Betabaculovirus.

Open image in new tab

Figure 1: Figure: Schematic representation of the AcMNPV-C6 genome. ORFs and their orientation are indicated by arrows. (Figure from on Taxonomy of Viruses (ICTV))

The genome size makes it difficult to study genetic variation within a baculovirus isolate, especially since the most commonly used sequencing technique (Illumina sequencing) generates only short reads and requires genome assembly. Since an isolate represents a population of viruses and their genomes that differ genetically from one another, the complexity of the analysis is further increased. Typically, genome assemblies are performed that lead to the generation of a consensus sequence, which usually reflects the majority of the sequenced virus population and intends to represent the entire isolate. The problem is that the consensus sequence can mask genetic variability. Tools for haplotype-sensitive assembly are available but so far have been established for viruses with a relatively short genome but not for baculoviruses or other large dsDNA viruses. Using Nanopore sequencing, it is possible to sequence significiant fragments of baculovirus genomes to determine major haplotypes (Wennmann et al. 2024).

SNVs have been proven as a powerful tool for analyzing the genetic variability of sequenced baculovirus isolates (Fan et al. 2020). SNVs are particularly useful for the analysis of intra-isolate specific variation, as many bioinformatic workflows and tools are established for SNV determination and processing. For the identification of SNVs, Illumina sequencing data is mostly used, as this sequencing technique provides highly accurate reads with a very low probability of sequencing error. The data is usually provided and stored in FASTQ format.

When a baculovirus isolate is sequenced, a dataset of sequence reads is obtained that consists of a multitude of fragmented genomes and the exact reconstruction of individual whole genomes is no longer possible. However, variable SNV positions can be used as markers to count the frequency of the nucleotides that occur in the SNV positions. The determination of the nucleotides is only possible because not one but many genomes were sequenced. Therefore, the SNVs reflect the genetic variability of the isolate itself. Although deletions and insertions can also occur, these are not covered in this tutorial.

When multiple baculovirus isolates have been sequenced, SNV positions can be determined that are specific to only certain isolates. These specific SNV markers can then be used to identify an isolate based on sequencing data only, even when the isolate is present in a mixture with other isolates. To enable this identification of baculovirus isolates, SNV positions must be determined across multiple isolates using a common reference sequence. This reference serves as an anchor for the detected SNV positions and must be chosen with care. Based on the reference and the SNVs, SNV specificities can be determined for individual isolates.

The tutorial presented here aims to explain…

the criteria for a suitable reference genome,
a workflow that can be used to call SNVs across multiple sequenced baculovirus isolates,
what is meant by SNV specificities and
how to determine the composition of an isolate based on other previously sequenced baculovirus isolates.

The workflow is based on sequencing data from the Cydia pomonella granulovirus (CpGV) (family Baculoviridae, genus Betabaculovirus). CpGV is used commercially in agriculture for the biological control of larvae of the codling moth (Cydia pomonella), which cause significant losses to fruit crops, such as apples and pears.

Agenda

In this tutorial, we will cover:

Dataset Preparation

Reference Genome from NCBI Genbank

Paired-end Sequencing Data from NCBI SRA

Read Preprocessing and Quality Control

Mapping of Reads to the Reference Genome

SNP Calling and Variant Detection

VCF to Table Transformation

Replace SRA Names with Virus Abbreviations

Reduce Complexity of SNV Table to First Alternative

Visualizing SNV Variability Across Isolates

SNV Specificity Determination

Visualization of Variable SNV Positions by Specificity

Determining the Potential Components of a Mixed Isolate

Conclusion

Dataset Preparation

Let’s start at the very beginning with a new history and a detailed description of the sequencing datasets as well as the reference sequence. We will also have a closer look at the Illumina sequencing datasets, what we know so far about them and what we can expect from the analysis. By doing this, we will work our way towards our goal step by step.

Hands On: Prepare the Galaxy history

Create a new history for this analysis

To create a new history simply click the new-history icon at the top of the history panel:

Rename the history

Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)

Type the new name

Click on Save

To cancel renaming, click the galaxy-undo “Cancel” button

If you do not have the galaxy-pencil (Edit) next to the history name (which can be the case if you are using an older version of Galaxy) do the following:

Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel

Type the new name

Press Enter

Reference Genome from NCBI Genbank

Selecting the right or most suitable reference sequence is important if not critical for the entire analysis. The reference genome used in this tutorial is the Cydia pomonella granulovirus (Mexican isolate, CpGV-M). The function of the reference is to act as an anchor point to link detected SNV positions in all sequenced and analyzed isolates. Selecting a correct reference genome is critical because it must be as closely related as possible to the sequenced isolates. Therefore, it is best to use a reference from the same species that has been well analyzed by previous studies or the scientific community. The following step will download the NCBI Accession Number KM217575, which corresponds to the Cydia pomonella granulovirus isolate CpGV-M, complete genome. It will later serve as the reference genome for the detection of SNV positions.

Hands On: Reference genome download

NCBI Accession Download ( Galaxy version 0.2.8+galaxy0) with the following parameters:

“Select source for IDs”: Direct Entry

“ID List”: KM217575

“Molecule Type”: Nucleotide

“File Format”: FASTA

Click Run Tool

Warning: Use a single reference genome only!

The NCBI Accession Download tool accepts multiple NCBI accession numbers as input, which should be avoided, otherwise the workflow will fail. If you still want to use multiple reference genomes, then I recommend to run the entire workflow for each reference genome separately.

Collapse Collection ( Galaxy version 5.1.0) with the following parameters:

param-file “Collection of files to collapse into single dataset”: NCBI Accession Download on: Downloaded files (output of NCBI Accession Download tool)

Click Run Tool

Comment: Collapse Genbank files

NCBI Accession Download accepts multiple accession numbers and creates a list as output. Therefore, its output can contain multiple records. For this analysis just one reference genome is required and the list needs to be converted into a single file.

Replace Text in entire line ( Galaxy version 9.3+galaxy1) with the following parameters:

param-file “File to process”: the collapsed collection containing the single reference genome; output of Collapse Collection tool

In “Replacement”:

“Find pattern”: ^>.*$

“Replace with:”: >CpGV-M

Click Run Tool

Comment: What the tool does...

The find pattern ^>.*$ searches for the > and all characters that follow. Then everything (the entire line) is replaced by >CpGV-M.

Comment: Why we change the name...

The reference genome is stored in a single file in FASTA format. After downloading from NCBI, the name of the reference is >KM217575.1 Cydia pomonella granulovirus isolate CpGV-M, complete genome. The name is very long and can cause issues during later analysis (in R or Python). Therefore, I recommend to replace the name with the Replace text in entire line tool.

Paired-end Sequencing Data from NCBI SRA

The tutorial is based on Illumina data sets from several isolates of the Cydia pomonella granulovirus (CpGV). The CpGV is one of the most well-studied baculoviruses because many isolates have been sequenced and sequence data sets are available at NCBI Genbank and NCBI SRA. We will use four CpGV isolates to decipher their intra-isolate specific variation and try to draw conclusions about the isolate’s homogeneity or heterogeneity. We will also encounter a mixed isolate (a more or less clean mixture of other previously sequenced CpGV isolates) and learn how to recognize it. In addition, we will learn how to determine the composition of this mixed isolate based on other sequenced CpGV isolates.

Comment: Detailed CpGV Isolate Information

For those who want to delve deeper into the data sets, I have provided a table with links to NCBI SRA and related publications.

Isolate NCBI Genbank NCBI SRA Reference

CpGV-M KM217575 SRR31589148 Wennmann et al. 2020

CpGV-S KM217573 SRR31589147 Wennmann et al. 2020

CpGV-E2 KM217577 SRR31589146 Gueli Alletti et al. 2017

CpGV-V15 No assembly available SRR31679023 Fan et al. 2020

Isolate	NCBI Genbank	NCBI SRA	Reference
CpGV-M	KM217575	SRR31589148	Wennmann et al. 2020
CpGV-S	KM217573	SRR31589147	Wennmann et al. 2020
CpGV-E2	KM217577	SRR31589146	Gueli Alletti et al. 2017
CpGV-V15	No assembly available	SRR31679023	Fan et al. 2020

Follow the steps below to download the four Illumina datasets published at NCBI SRA.

Hands On: NCBI SRA Data Download

Run Download and Extract Reads in FASTQ format from NCBI SRA ( Galaxy version 3.1.1+galaxy1) with the following parameters:

“Select input type”: SRR accession

Accession: SRR31589146, SRR31589147, SRR31589148, SRR31679023

Select output format: gzip compressed fastq

Click Run Tool

Comment: Take a cup of coffee

Downloading the data from NCBI SRA may take a while. It is a good opportunity to have a cup of coffee or tea.

Read Preprocessing and Quality Control

After the reference genome and the paired-end Illumina reads are loaded in the history, it is time to start the analysis. First, the reads need to be adapter trimmed and quality filtered.

Before and after trimming, it is usually a good idea to check the quality of the reads using FASTQC. However, we will omit this step here. If you are interested in learning more about read quality, I recommend the check the Galaxy Training Quality Control tutorial.

Hands On: Task description

Trim Galore! ( Galaxy version 0.6.7+galaxy0) with the following parameters:

“Is this library paired- or single-end?”: Paired Collection

param-collection “Select a paired collection”: Paired-end data (fastq-dump) (output of Download and Extract Reads in FASTQ format from NCBI SRA tool)

“Advanced settings”: Full parameter list

“Trim low-quality ends from reads in addition to adapter removal (Enter phred quality score threshold)”: 30

“Discard reads that became shorter than length N”: 50

“specify if you would like to retain unpaired reads”: Do not output unpaired reads

“RRBS specific settings”: Use defaults (no RRBS)

“Trimming settings”: Use defaults

Click Run Tool

Mapping of Reads to the Reference Genome

Now the filtered reads of the individual isolates are mapped independently against the CpGV-M reference. The key is that each isolate is mapped separately against the identical reference genome, so that in the end we get a Binary Alignment Mapping (BAM) file for each isolate. After that, everything we do with the sequence data is linked to the common reference.

Hands On: Read mapping using bcftools

Map with BWA-MEM ( Galaxy version 0.7.18) with the following parameters:

“Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index

param-file “Use the following dataset as the reference sequence”: the renamed reference genome in FASTA format, output of Replace Text in entire line tool

“Single or Paired-end reads”: Paired Collection

param-file “Select a paired collection”: the trimmed paired read collection, generated as output of Trim Galore! tool

“Set read groups information?”: Set read groups (Picard style)

“Auto-assign”: Yes

“Auto-assign”: Yes

“Auto-assign”: Yes

“Select analysis mode”: 1.Simple Illumina mode

Click Run Tool

SNP Calling and Variant Detection

The key part of the analysis is the determination of variable SNV positions, which is carried out in the following step. Insertions/Deletions (indels) are deliberately omitted because they are not relevant for this analysis.

Hands On: Variable SNV positions

bcftools mpileup ( Galaxy version 1.15.1+galaxy4) with the following parameters:

“Alignment Inputs”: Multiple BAM/CRAMs

param-file “Input BAM/CRAMs”: the BAM file collection; output of Map with BWA-MEM tool

“Choose the source for the reference genome”: History

param-file “Genome Reference”: the renamed reference genome in FASTA format, output of Replace Text in entire line tool

In “Indel Calling”:

“Perform INDEL calling”: Do not perform INDEL calling

In “Input Filtering Options”:

“Max reads per BAM”: 1024

“Set filter by flags”: Do not filter

“Quality Options”: defaults

“Select read groups to include or exclude”: use defaults

In “Restrict to”:

“Regions”: Do not restrict to Regions

“Targets”: Do not restrict to Targets

In “Output options”:

“Optional tags to output”: DP (Number of high-quality bases)

“Optional tags to output”: DPR (Number of high-quality bases for each observed allele)

“Output type”: uncompressed VCF

Click Run Tool

Comment: What bcftools mpileup does...

bcftools mpileup calculates the nucleotide coverage for each position in the reference genome based on the input BAM files. It enables preparation for variant analysis.

bcftools call ( Galaxy version 1.15.1+galaxy5) with the following parameters:

param-file “VCF/BCF Data”: the uncompressed VCF file; output of bcftools mpileup tool

In “Restrict to”:

“Regions”: Do not restrict to Regions

In “Consensus/variant calling Options”:

“Calling method”: Multiallelic and rare-variant caller

“Constrain”: Do not constrain

“Targets”: Do not restrict to Targets

In “File format Options”:

“Regions”: Do not restrict to Regions

In “Input/output Options”:

“Keep alts”: Yes

“Output variant sites only”: Yes

“Output type”: uncompressed VCF

Click Run Tool

Comment: What bcftools call does...

bcftools call performs the actual variant detection. It removes non variant and keeps variant sites only.

What you get is a file in Variant Call Format (VCF), which can be difficult to understand at first glance. It is important to note that the file begins with a large header, lines that begin with two hashtags (##), which contain details of the analysis and the data set. After the header, there is a table with a line that contains the column names. This line begins with just one hashtag (#). After that, the results are displayed in a tab-separated format. In our VCF example, some columns have the following abbreviations and meanings (I just selected a few that are important to us in this tutorial):

Fixed fields	Description
CHROM	Chromosome (name of the reference, in our case CpGV-M)
POS	Position (position in the reference genome)
REF	Nucleotide of the Referance at the corresponding position (POS)
ALT	Alternative nucleotides detected at this position in the sequencing data.
FORMAT	Describes the content of the sample column reported in the VCF file. In our case GT:PL:DP:DPR.
SRR31589146	1st isolate: The results for the sequence data SRR31589146 in the format specified in FORMAT.
SRR31589147	2nd isolate: The results for the sequence data SRR31589147 in the format specified in FORMAT.
SRR31589148	3rd isolate: The results for the sequence data SRR31589148 in the format specified in FORMAT.
SRR31679023	4th isolate: The results for the sequence data SRR31679023 in the format specified in FORMAT.

To understand how the data is stored, we have to look at FORMAT in detail. This is where two values are of great importance: DP and DPR.

Genotype field	Description
DP	Read depth (number of nucleotides) at this position for this isolate.
DPR	The read depth for each allele. Here, the first value corresponds to the reference nucleotide (REF), the second to the first possible allele (ALT1), the second to the second possible allele (ALT2) and the third value to the last possible allele (ALT3).

Comment: DPR functional, but deprecated

If you start bcftools mpileup with the Optional tags to output DPR (Number of high-quality bases for each observed allele) option, you will see a warning in the information panel of the VCF file: [warning] tag DPR functional but deprecated. Please switch to AD in future. At the time of writing this tutorial, I was using DPR. It works the same way with Optional tags to output option AD (instead of DPR), but you will need to adjust something later in the workflow. The tutorial will be switched to AD in the near future. For now, we stick with DPR.

Also take a look at other tutorials that deal with non-diploid (but also diploid) data sets to perform a variant analysis. Broadening your horizons is always important. I recommend the check the tutorials on …

Calling variants in diploid systems and.

Calling variants in non-diploid systems.

Maybe later? Then let’s continue.

VCF to Table Transformation

Now we come to an exciting part, because we have all the information we need to analyse the genetic variation within the sequenced virus isolates. The data is only hidden in the VCF file and is difficult for the beginner in bioinformatics to see. We have the positions (POS), which were detected as variable in the virus isolates. In addition, we know the number of all reads (and thus also nucleotides) in these positions (represented by DP). By using DPR, we obtain information on how often the alleles (the four possible nucleotides) occur at a particular position. To analyse DP and DPR, we first have to access it because the information is hidden in each sample column in the FORMAT genotype data. If we look at the first position POS = 246 in sample column SRR31589146, the following data is visible:
1/1:255,134,0,255,255,255,255,255,255,255:886:107,773,6,0.
The information provided by the FORMAT field explains the division of the data by colons: GT:PL:DP:DPR.

We can break it down like this…

FORMAT genotype field	Value
GT	1/1
PL	255,134,0,255,255,255,255,255,255,255
DP	886
DPR	107,773,6,0

GT = Genotype information, which cannot be used with virus isolates!
PL = Phred-scaled genotype likelihood (also not useable for us, because we do not have a diploid organism!)

DPR can be broken down even further, since the individual values, which are separated by commas this time, can be assigned to the reference or the alternative nucleotides. In a virus isolate, four nucleotides (A, T, G and C) can theoretically occur at each position. One nucleotide defines the reference, leaving three alternatives: first (ALT1), second (ALT2) and third (ALT3) alternative.

CHROM	POS	DP	DPR	ALT1	ALT2	ALT3
CpGV-M	246	886	107,773,6,0	773	6	0

If we now divide the absolute frequencies of ALT1 = 773, ALT2 = 6 and ALT3 = 0 by DP = 886, we get the relative frequencies (REL.ALT):

ALLELE	REL.ALT
ALT1	0.87246
ALT2	0.00677201
ALT3	0

Hands On: Transfrom VCF to tab-delimited table
VCFtoTab-delimited: ( Galaxy version 1.0.0_rc1+galaxy0) with the following parameters:

param-file “Select VCF dataset to convert”: the filtered VCF file; output of bcftools call tool

“Fill empty fields with”: NULL

Click Run Tool

Comment: An easier-to-read table

This tool creates a table from the VCF file. The columns are tab-delimited. Take a look at the table and see if it is now easier to read. It can also be imported to R/RStudio/Excel more easily.
Text reformatting with awk ( Galaxy version 9.3+galaxy1) with the following parameters:
param-file “File to process”: the tab-delimited VCF table; output of VCFtoTab-delimited tool
“AWK Program”: Paste the code from the code box below.
Code In: awk
BEGIN { FS="\t"; OFS="\t" }
NR == 1 {

    # Rename "SAMPLE" to "ISOLATE" in the header
    for (i = 1; i <= NF; i++) {
        if ($i == "SAMPLE") $i = "ISOLATE";
    }
    # Identify column indices based on header names
    for (i = 1; i <= NF; i++) {
        if ($i == "POS") pos_col = i;
        if ($i == "ISOLATE") isolate_col = i;
        if ($i == "DP") dp_col = i;
        if ($i == "DPR") dpr_col = i;
    }
    # Check if all required columns are found
    if (!(pos_col && isolate_col && dp_col && dpr_col)) {
        print "Error: One or more required columns (POS, ISOLATE, DP, DPR) are missing." > "/dev/stderr";
        exit 1;
    }
    # Print the new header
    print $0, "ALLELE", "DPR.ALLELE", "REL.ALT", "REL.ALT.0.05";
    next;
}
{
    # Create a unique key for position and isolate
    pos_isolate_key = $(pos_col) "_" $(isolate_col);

    if (last_pos != $(pos_col)) {  # When a new position starts
        delete isolate_count;      # Reset the isolate counter
        last_pos = $(pos_col);     # Update the current position
    }

    # Split the DPR column into values
    split($(dpr_col), dpr_values, ",");

    # Count occurrences of each isolate per position
    isolate_count[$(isolate_col)]++;
    dpr_index = isolate_count[$(isolate_col)] + 1;  # Index for the alternate allele (starting at 2)

    # Check if the index is valid
    if (dpr_index <= length(dpr_values)) {
        allele = "ALT" (dpr_index - 1);            # Determine the allele
        dpr_value = dpr_values[dpr_index];         # Get the corresponding DPR value
        rel_alt = (dpr_value / $(dp_col));         # Calculate REL.ALT (DPR.ALLELE / DP)
        rel_alt_filtered = (rel_alt >= 0.05) ? rel_alt : 0;  # Filter REL.ALT values < 0.05
        print $0, allele, dpr_value, rel_alt, rel_alt_filtered;  # Output with new columns
    }
}
Click Run Tool

Comment: Create ALLELE and REL.ALT columns

The Text reformatting with awk tool allows the use of AWK code. AWK is for text processing and to performing complex tasks on data sets (such as tab-delimited files). We will come across AWK scripts again later. Here is a brief summary of what it does:

Splitting DPR data into separate allele counts.

Calculating the relative allele frequencies (REL.ALT) for each allele.

Filtering out relative frequencies below a threshold of 0.05 and storing the result in REL.ALT.0.05.

The output table is complex and shows the relative frequency (REL.ALT) for each of the three alternative alleles/nucleotides (ALT1, ALT2 and ALT3) in each position (POS) and sequenced CpGV isolate (ISOLATE). In column REL.ALT.0.05, values of the REL.ALT < 0.05 were set to 0 to set a threshold. We will see later why this is sometimes important.

Below is the table with selected relevant columns only. REL and ALT show the reference and alternative nucleotide, respectively.

CHROM	POS	REF	ALT	ISOLATE	DP	DPR	ALLELE	DPR.ALLELE	REL.ALT	REL.ALT.0.05
CpGV-M	246	C	T	SRR31589146	886	107,773,6,0	ALT1	773	0.87246	0.873446
CpGV-M	246	C	G	SRR31589146	886	107,773,6,0	ALT2	6	0.00677201	0
CpGV-M	246	C	A	SRR31589146	886	107,773,6,0	ALT3	0	0	0
CpGV-M	246	C	T	SRR31589147	881	5,875,1,0	ALT1	875	0.99319	0.994305
CpGV-M	246	C	G	SRR31589147	881	5,875,1,0	ALT2	1	0.00113507	0
CpGV-M	246	C	A	SRR31589147	881	5,875,1,0	ALT3	0	0	0
CpGV-M	246	C	T	SRR31589148	935	800,133,1,1	ALT1	133	0.142246	0.142398
CpGV-M	246	C	G	SRR31589148	935	800,133,1,1	ALT2	1	0.00106952	0
CpGV-M	246	C	A	SRR31589148	935	800,133,1,1	ALT3	1	0.00106952	0
CpGV-M	246	C	T	SRR31679023	839	42,797,0,0	ALT1	797	0.94994	0.950296
CpGV-M	246	C	G	SRR31679023	839	42,797,0,0	ALT2	0	0	0
CpGV-M	246	C	A	SRR31679023	839	42,797,0,0	ALT3	0	0	0

Replace SRA Names with Virus Abbreviations

One thing that stands out are the NCBI SRA numbers in the ISOLATE column, which have been automatically extracted from the NCBI SRA datasets. Since it is difficult to remember which virus isolate is behind which number, we can replace these accession numbers with proper names, e.g. virus isolate abbreviations. This makes the table even easier to read and later we can use the information directly to display the SNV positions.

Hands On: Replace SRA accession numbers by virus isolate abbreviations

Replace Text in a specific column ( Galaxy version 9.3+galaxy1) with the following parameters:

param-file “File to process”: the reformatted VCF table; output of Text reformatting with awk: tool

In “Replacement”:

param-repeat “Insert Replacement”

“in column”: c25

“Find pattern”: SRR31589148

“Replace with”: CpGV-M

param-repeat “Insert Replacement”

“in column”: c25

“Find pattern”: SRR31589147

“Replace with”: CpGV-S

param-repeat “Insert Replacement”

“in column”: c25

“Find pattern”: SRR31589146

“Replace with”: CpGV-E2

param-repeat “Insert Replacement”

“in column”: c25

“Find pattern”: SRR31679023

“Replace with”: CpGV-V15

Click Run Tool

Question: What is replaced by what?

Can you say which SRA number was replaced by which isolate abbreviation?

SRR31589148 was replaced by CpGV-M

SRR31589147 was replaced by CpGV-S

SRR31589146 was replaced by CpGV-E2

SRR31679023 was replaced by CpGV-V15

Reduce Complexity of SNV Table to First Alternative

Based on the SNV table, we can see that three possible nucleotides (alleles) occur in most positions. After filtering with a minimum threshold of REL.ALT > 0.05, many relative frequencies of ALT2 and ALT3 are set to zero. At least it is visible that in the vast majority of positions only the first alternative has a value of > 0.05. Thus, two nucleotides usually occur in the detected SNV positions: the reference nucleotide (REF) and the nucleotide of the first alternative (ALT1). We want to reduce the complexity of the data set and only consider ALT1 in the further course of this tutorial.

Hands On: Keep only ALT1 and remove ALT2 and ALT3

Filter data on any column using simple expressions with the following parameters:

param-file “Filter”: the VCF table with the replaced isolate abbreviations; output of Replace Text in a specific column tool

“With following condition”: c30=='ALT1'

“Numbers of header ines to skip”: 1

Click Run Tool

Question: How can you check whether only two nucleotides occur (mostly) per SNV position?

Think about an option how the occurance of one, two, three or four nucleotides per SNV position can be analyzed.

When looking only at a table with several columns and numbers, it is difficult to understand that mainly a reference and one alternative nucleotide were detected in the SNV positions. One way to examine the frequency of a second or third alternative nucleotide is to filter the table with c30=='ALT2' or c30=='ALT3' instead of c30=='ALT1'. The resulting table can be used to create a distribution over the REL.ALT values in column 32 (c32).

You can perform this analysis by using the following tools: Filter data on any column using simple expressions tool, Cut columns from a table tool to extract column 32 (c32), which corresponds to REL.ALT and Histogram with ggplot2 tool.

This task is left to you as a small challenge, if you like. It is up to you to familiarise yourself with the individual tools, but it could be worth it, as it could help you understand why we do not consider ALT2 and ALT3 further in the next steps.

Warning: Be careful when removing data!

The data sets used here have already been analysed extensively and for this tutorial only the first alternative nucleotide (allele) is sufficient to explain the further workflow. Always be sure what you are doing when you remove data! Always ask yourself if you are allowed to do it and what the consequences might be.

Visualizing SNV Variability Across Isolates

The final table should be much easier to read and contain all the information we need to perform an analysis of the intra-isolate specific variability. Here is a short overview of the final table (selected columns) and its first three SNV positions:

CHROM	POS	REF	ALT	ISOLATE	DP	DPR	ALLELE	DPR.ALLELE	REL.ALT	REL.ALT.0.05
CpGV-M	246	C	T	CpGV-E2	886	107,773,6,0	ALT1	773	0.87246	0.87246
CpGV-M	246	C	T	CpGV-S	881	5,875,1,0	ALT1	875	0.99319	0.99319
CpGV-M	246	C	T	CpGV-M	935	800,133,1,1	ALT1	133	0.142246	0.142246
CpGV-M	246	C	T	CpGV-V15	839	42,797,0,0	ALT1	797	0.94994	0.94994
CpGV-M	249	A	C	CpGV-E2	888	106,780,1	ALT1	780	0.878378	0.878378
CpGV-M	249	A	C	CpGV-S	881	2,879,0	ALT1	879	0.99773	0.99773
CpGV-M	249	A	C	CpGV-M	938	799,139,0	ALT1	139	0.148188	0.148188
CpGV-M	249	A	C	CpGV-V15	843	48,793,2	ALT1	793	0.940688	0.940688
CpGV-M	564	T	C	CpGV-E2	964	98,856,9,1	ALT1	856	0.887967	0.887967
CpGV-M	564	T	C	CpGV-S	947	1,940,6,0	ALT1	940	0.992608	0.992608
CpGV-M	564	T	C	CpGV-M	976	821,154,1,0	ALT1	154	0.157787	0.157787
CpGV-M	564	T	C	CpGV-V15	880	33,846,1,0	ALT1	846	0.961364	0.961364

We can now start the first visualisation and create a plot for each CpGV isolate (ISOLATE), plotting the position of the SNV (POS) against the relative frequency of the alternative nucleotide (REL.ALT). Note that we are not using the REL.ALT threshold values because we want to look at the unfiltered data.

Hands On: SNV plot

Scatterplot with ggplot2 ( Galaxy version 3.4.0+galaxy1) with the following parameters:

param-file “Input in tabular format”: the ALT1-filtered VCF table; output of Filter data on any column using simple expressions tool

“Column to plot on x-axis”: 2

“Column to plot on y-axis”: 32

“Plot title”: SNV plot

“Label for x axis”: Reference Genome Position of CpGV-M

“Label for y axis”: Relative Nucleotide Frequency

In “Advanced options”:

“Type of plot”: Points only (default)

“Data point options”: Default

“Plotting multiple groups”: Plot multiple groups of data on individual plots

“column differentiating the different groups”: 25

“Axis title options”: Default

“Axis text options”: Default

“Plot title options”: Default

“Grid lines”: Hide major and minor grid lines

“Axis scaling”: Automatic axis scaling

In “Output Options”:

“Unit of output dimensions”: Centimeters (cm)

“width of output”: 20.0

“height of output”: 12.0

“dpi of output”: 200.0

Click Run Tool

Comment: Column names and numbers

To provide ggplot2 with the data for the X- and Y-axis, column numbers must be passed to the function. You have to check which column numbers the columns POS and REL.ALT have. In addition, ggplot2 offers the option of splitting the data according to ISOLATE, to obtain a separate plot for each value in ISOLATE.

POS = column 2

REL.ALT = column 32

ISOLATE = column 25

Open image in new tab

Figure 2: Section of the VCF table with column numbers and names.

As a result, we get a SNV plot that shows the relative frequency of the first alternative nucleotide in all variable SNV positions. Before we have a closer look at the data, we have to remember that all the sequencing data (= the reads) were mapped during this workflow against the reference genome of CpGV-M. For that reason, all isolates are compared to the genome of CpGV-M. We see that CpGV-E2 has a heterogeneous SNV pattern. CpGV-M appears homogeneous but with relatively few genetic variation – but keep in mind that the sequence data of CpGV-M was mapped against its own reference, which is why one might assume less variability. For CpGV-S, the variability is also homogeneous, but many SNV positions have a nucleotide frequency close to one. This could indicate that the isolate is homogeneous but different from the reference CpGV-M. For isolate CpGV-V15, a SNV cloud occurs at approximately 0.5 (50%), which could indicate a mixed isolate. By “mixed”, it is meant that two or more (homogenous) isolates occur in a certain ratio in this isolate.

SNV Specificity Determination

Now we come to the last but most complex section of this tutorial - determining SNV specificities. We have seen that CpGV-V15 is a mixed isolate. Now we want to find out which isolates were mixed and how this SNV pattern can be explained. In the next step, we will consider SNV positions as markers and look for SNV positions that are only variable for one or more sequenced isolates of CpGV. If a SNV position is only variable for CpGV-S (i.e. the relative frequency is greater than 0; REL.ALT > 0) but is equal to zero for the other isolates, then this position is a marker for CpGV-S, or specific for CpGV-S. Since we cannot perform this determination by hand for all positions, I have written a small program implemented in the Text reformatting with awk tool that does the work for us.

Before we get started, I would like to explain specificity in more detail using an example. First, we decide that we want to determine the specificities for the SNV positions for the following isolates:

CpGV-M
CpGV-S
CpGV-E2.

Isolate CpGV-V15 is not included in the analysis of SNV specificity determination because we want to explain its composition later with the other data sets.

The principle behind SNV specificities and how they are determined can be explained by the table below. In the first position 249, all isolates (CpGV-E2, CpGV-S and CpGV-M) have a relative nucleotide frequency > 0. This means that all three sequenced isolates have an alternative nucleotide at this position, with frequencies of 14.7%, 88% and >99% for CpGV-M, CpGV-E2 and CpGV-S, respectively. The remaining percentages are the reference nucleotide. Note that I am using the REL.ALT.0.05 values, which are based on the threshold of 0.05 (values < 0.05 were set to zero). By this, the tutorial is simplified. Position 249 is thus variable for all three sequenced isolates and therefore specific for CpGV-E2 + CpGV-S + CpGV-M. In the second position 603, only CpGV-E2 has an alternative nucleotide, its REL.ALT.0.05 > 0. The isolates of CpGV-S and CpGV-M show no variability in position 603 and are therefore identical to the reference. Thus, position 603 is specific for CpGV-E2. If we apply the same logic to positions 1278 and 6393, we can see that position 1278 is specific for CpGV-S and position 6393 is specific for CpGV-E2 and CpGV-S.
If we now wanted to detect specifically isolate CpGV-S and CpGV-E2 in a mixture, positions 603 and 1278 could serve as markers.

POS	REL.ALT0.05 (CpGV-E2)	REL.ALT0.05 (CpGV-S)	REL.ALT0.05 (CpGV-M)	SNV Specificity
249	0.878378	0.99773	0.148188	CpGV-E2 + CpGV-S + CpGV-M
603	1	0	0	CpGV-E2
1278	0	0.998963	0	CpGV-S
6393	0.98847	0.997783	0	CpGV-E2 + CpGV-S

Let us run the tool below to determine the SNV specificities of our dataset.

Hands On: SNV specificity determination

Text reformatting with awk ( Galaxy version 9.3+galaxy1) with the following parameters:

param-file “File to process”: the ALT1-filtered VCF table; output of Filter data on any column using simple expressions tool

“AWK Program”: Paste the code from the code box below.

Code In: awk

BEGIN { FS="\t"; OFS="\t" }
NR == 1 {
    # Dynamically find column indices based on header names
    for (i = 1; i <= NF; i++) {
        if ($i == "POS") pos_col = i;
        if ($i == "ALLELE") allele_col = i;
        if ($i == "ISOLATE") isolate_col = i;
        if ($i == "REL.ALT.0.05") rel_alt_col = i;
    }

    # Check if all required columns were found
    if (!(pos_col && allele_col && isolate_col && rel_alt_col)) {
        print "Error: Required columns (POS, ALLELE, ISOLATE, REL.ALT.0.05) are missing." > "/dev/stderr";
        exit 1;
    }

    # Print the header and add the new SPEC column
    print $0, "SPEC";
    next;
}

{
    # When reaching a new position, prepare the specificity
    if ($(pos_col) != current_pos) {
        # Assign the calculated specificity to all rows of the current position
        for (i in pos_lines) {
            # Add "SNV specificity: " before the specificity value
            final_spec = (specificity == "" ? "0" : "SNV specificity: " specificity);
            print pos_lines[i], final_spec;
        }
        # Reset variables for the new position
        delete pos_lines;
        specificity = "";
        current_pos = $(pos_col);
    }

    # Save the current row for later
    pos_lines[NR] = $0;

    # Conditions for calculating specificity
    if ($(allele_col) == "ALT1" && 
        ($(isolate_col) == "CpGV-E2" || $(isolate_col) == "CpGV-S" || $(isolate_col) == "CpGV-M") && 
        $(rel_alt_col) > 0) {
        # Concatenate isolate names with " + " if REL.ALT.0.05 > 0
        specificity = (specificity == "" ? $(isolate_col) : specificity " + " $(isolate_col));
    }
}

END {
    # Assign the specificity to the last position
    for (i in pos_lines) {
        # Add "SNV specificity: " before the specificity value
        final_spec = (specificity == "" ? "0" : "SNV specificity: " specificity);
        print pos_lines[i], final_spec;
    }
}

Click Run Tool

Visualization of Variable SNV Positions by Specificity

Now, the VCF table has an additional column called SPEC, which indicates the SNV specificity for each position. This allows us to create a special SNV plot for each isolate. The special thing about this is that a separate plot is created for each isolate and each specificity. Since we can only look at each isolate individually, we first have to filter out the data for one isolate, we start with CpGV-S. After that, a plot is created for each SNV specificity for the CpGV-S isolate.

Hands On: Extract data for one isolate only

Filter data on any column using simple expressions with the following parameters:

param-file “Filter”: the VCF table with determined SNV specificities; output of Text reformatting with awk tool

“With following condition”: c25=='CpGV-S'

“Number of header lines to skip”: 1

Click Run Tool

Comment: Filtering the CpGV-S data

The VCF SNV table is reduced to the data of CpGV-S. This is done by filtering on column 25 (ISOLATE).

Scatterplot with ggplot2 ( Galaxy version 3.4.0+galaxy1) with the following parameters:

param-file “Input in tabular format”: the VCF table for CpGV-S; output of Filter data on any column using simple expressions tool

“Column to plot on x-axis”: 2

“Column to plot on y-axis”: 32

“Plot title”: Isolate CpGV-S

“Label for x axis”: Reference Genome Position of CpGV-M

“Label for y axis”: Relative Nucleotide Frequency

In “Advanced options”:

“Type of plot”: Points only (default)

“Data point options”: Default

“Plotting multiple groups”: Plot multiple groups of data on individual plots

“column differentiating the different groups”: 34

“Axis title options”: Default

“Axis text options”: Default

“Plot title options”: Default

“Grid lines”: Hide major and minor grid lines

“Axis scaling”: Automatic axis scaling

In “Output Options”:

“Unit of output dimensions”: Centimeters (cm)

“width of output”: 30.0

“height of output”: 12.0

“dpi of output”: 200.0

Click Run Tool

Let us now look at the result for isolate CpGV-S. We can see that all CpGV-E2 specific SNV positions have a frequency equal or close to zero. This makes sense, since we are looking at the isolate CpGV-S and the CpGV-E2 specific SNV markers should not indicate the presence of CpGV-E2. The situation is different for the SNV positions that are solely or in combination with another isolate specific for CpGV-S: here the frequency is close to 1. This also makes sense, since these markers indicate the presence of CpGV-S, which is the case. The few SNV positions that are specific for CpGV-M also do not indicate the presence of CpGV-M.

SNV specificity plot for CpGV-S. — **Figure 4**: SNV specificity plot for the sequenced CpGV-S isolate.

Question: Can you run the same analysis for CpGV-E2?

Extract the data subset for CpGv-E2 and create the SNV specificity plot. Can you interpret and understand the result?

Hands On: Extract data for CpGV-E2

Filter data on any column using simple expressions with the following parameters:

param-file “Filter”: the VCF table with determined SNV specificities; output of Text reformatting with awk tool

“With following condition”: c25=='CpGV-E2'

“Number of header lines to skip”: 1

Click Run Tool

Scatterplot with ggplot2 ( Galaxy version 3.4.0+galaxy1) with the following parameters:

param-file “Input in tabular format”: the VCF table for CpGV-E2; output of Filter data on any column using simple expressions tool

“Column to plot on x-axis”: 2

“Column to plot on y-axis”: 32

“Plot title”: Isolate CpGV-E2

“Label for x axis”: Reference Genome Position of CpGV-M

“Label for y axis”: Relative Nucleotide Frequency

In “Advanced options”:

“Type of plot”: Points only (default)

“Data point options”: Default

“Plotting multiple groups”: Plot multiple groups of data on individual plots

“column differentiating the different groups”: 34

“Axis title options”: Default

“Axis text options”: Default

“Plot title options”: Default

“Grid lines”: Hide major and minor grid lines

“Axis scaling”: Automatic axis scaling

In “Output Options”:

“Unit of output dimensions”: Centimeters (cm)

“width of output”: 30.0

“height of output”: 12.0

“dpi of output”: 200.0

Click Run Tool

Determining the Potential Components of a Mixed Isolate

To create an SNV specificity plot for the isolate CpGV-V15, we proceed in exactly the same way as described above. Unfold the tip box if you are unsure how to create the SNV plot for CpGV-V15.

Filter data on any column using simple expressions with the following parameters:

param-file “Filter”: the VCF table with determined SNV specificities; output of Text reformatting with awk tool

“With following condition”: c25=='CpGV-V15'

“Number of header lines to skip”: 1

Click Run Tool

Scatterplot with ggplot2 ( Galaxy version 3.4.0+galaxy1) with the following parameters:

param-file “Input in tabular format”: the VCF table for CpGV-V15; output of Filter data on any column using simple expressions tool

“Column to plot on x-axis”: 2

“Column to plot on y-axis”: 32

“Plot title”: Isolate CpGV-V15

“Label for x axis”: Reference Genome Position of CpGV-M

“Label for y axis”: Relative Nucleotide Frequency

In “Advanced options”:

“Type of plot”: Points only (default)

“Data point options”: Default

“Plotting multiple groups”: Plot multiple groups of data on individual plots

“column differentiating the different groups”: 34

“Axis title options”: Default

“Axis text options”: Default

“Plot title options”: Default

“Grid lines”: Hide major and minor grid lines

“Axis scaling”: Automatic axis scaling

In “Output Options”:

“Unit of output dimensions”: Centimeters (cm)

“width of output”: 30.0

“height of output”: 12.0

“dpi of output”: 200.0

Click Run Tool

Although we have determined the SNV specificities using CpGV-M, CpGV-S and CpGV-E2, we can transfer this information to CpGV-V15 because all isolates were mapped against the common reference of CpGV-M. Because of this, we can look at the relative abundance of the CpGV-E2 and CpGV-S markers (= SNV positions) in the CpGV-V15 isolate. Let’s look at the result for CpGV-V15 (Fig. 5).

SNV specificity plot for CpGV-V15. — **Figure 5**: Deciphering the composition of the isolate CpGV-V15 by using SNV specificities.

Most of the SNV positions specific for CpGV-E2 have a frequency of around 0.5 (50%) (Fig. 5, SNV specificity: CpGV-E2). Looking at the positions specific to CpGV-S, we also see a SNV cloud at around 0.5 (50 %) (Fig. 5, SNV specificity: CpGV-S). From these two observations, we can conclude that CpGV-E2 and CpGV-S occur at a ratio of 50% each. As a control, we look at the markers (SNV positions) that are specific for both CpGV-E2 and CpGV-S (Fig. 5, SNV specificity: CpGV-E2 + CpGV-S). Here we see that most SNV positions have a relative frequency of ~1 (100%). This also makes sense, because if CpGV-V15 is a mixture of CpGV-E2 and CpGV-S, then the markers that indicate both isolates should be at 100%, which is the case. Based on the SNV analysis, it can be concluded that CpGV-V15 is mainly a mixture of CpGV-S and CpGV-E2.

The same analysis was carried out in more detail with significantly more isolates and it was concluded that CpGV-V15 is a mixture of CpGV-S (49%), CpGV-E2 (42%) and CpGV-M (7%) (Fan et al. 2020).

The analysis presented in this tutorial is a simplification of the experiment from the publication Fan et al. 2020 and Wennmann et al. 2020 and is intended to help explain the procedure.

Conclusion

Four Illumina sequencing data sets from isolates of a large dsDNA virus (Cydia pomonella granulovirus, family Baculoviridae) were used to demonstrate how insight can be gained into intra-isolate specific genetic variability. SNVs were identified as markers to assess the homogeneity and heterogeneity of these sequenced CpGV isolates. The relative frequency with which alternative nucleotides occur at each variable SNV position was also used to assign SNV specificities, i.e. to identify them as markers for one or a combination of isolates. The markers were used to decipher the composition of a mixed isolate and to estimate its composition.
This tutorial provides a basis for understanding the complexity behind isolates, which represent virus populations, and how consensus sequences, often generated by sequencing and genome assembly, can mask genetic variability. The tutorial also includes a complete Galaxy workflow that includes all the steps described above. It is intended to help you create your own workflow for your own data.

Please leave me a comment if you liked the workflow or have any questions. I would love to hear your feedback.

Have fun analysing your own data and good luck with your research.

You've Finished the Tutorial

Key points

Baculovirus isolates are heterogeneous and show genetic variability.

Some SNV positions occur only in certain isolates and are therefore specific to that isolate.

SNV specificities can be used as markers to identify isolates and determine mixtures.

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Ayres, M. D., S. C. Howard, J. Kuzio, M. Lopez-Ferber, and R. D. Possee, 1994 The Complete DNA Sequence of Autographa californica Nuclear Polyhedrosis Virus. Virology 202: 586–605. 10.1006/viro.1994.1380
Oers, M. M. van, 2011 Opportunities and challenges for the baculovirus expression system. Journal of Invertebrate Pathology 107: S3–S15. 10.1016/j.jip.2011.05.001
Eberle, K. E., J. A. Jehle, and J. Huber, 2012 Microbial control of crop pests using insect viruses., pp. 281–298 in Integrated pest management: principles and practice, CABI. 10.1079/9781845938086.0281 ISBN: 9781780640020
Grzywacz, D., and S. Moore, 2017 Production, Formulation, and Bioassay of Baculoviruses for Pest Control, pp. 109–124 in Microbial Control of Insect and Mite Pests, Elsevier. 10.1016/b978-0-12-803527-6.00007-x ISBN: 9780128035276
Gueli Alletti, G., A. Sauer, B. Weihrauch, E. Fritsch, K. Undorf-Spahn et al., 2017 Using Next Generation Sequencing to Identify and Quantify the Genetic Composition of Resistance-Breaking Commercial Isolates of Cydia pomonella Granulovirus. Viruses 9: 250. 10.3390/v9090250
Harrison, R. L., E. A. Herniou, J. A. Jehle, D. A. Theilmann, J. P. Burand et al., 2018 ICTV Virus Taxonomy Profile: Baculoviridae. Journal of General Virology 99: 1185–1186. 10.1099/jgv.0.001107
Fan, J., J. A. Jehle, and J. T. Wennmann, 2020 Population structure of Cydia pomonella granulovirus isolates revealed by quantitative analysis of genetic variation. Virus Evolution 7: 10.1093/ve/veaa073
Wennmann, J. T., J. Fan, and J. A. Jehle, 2020 Bacsnp: Using Single Nucleotide Polymorphism (SNP) Specificities and Frequencies to Identify Genotype Composition in Baculoviruses. Viruses 12: 625. 10.3390/v12060625
Cox, S. J. D., A. N. Gonzalez-Beltran, B. Magagna, and M.-C. Marinescu, 2021 Ten simple rules for making a vocabulary FAIR (S. Markel, Ed.). PLOS Computational Biology 17: e1009041. 10.1371/journal.pcbi.1009041
Rodríguez-Hernández, A. P., D. Martínez-Flores, A. Cruz-Reséndiz, T. Padilla-Flores, R. González-Flores et al., 2023 Baculovirus Display of Peptides and Proteins for Medical Applications. Viruses 15: 411. 10.3390/v15020411
Wennmann, J. T., F.-S. Lim, S. Senger, M. Gani, J. A. Jehle et al., 2024 Haplotype determination of the Bombyx mori nucleopolyhedrovirus by Nanopore sequencing and linkage of single nucleotide variants. Journal of General Virology 105: 10.1099/jgv.0.001983
Viruses (ICTV), I. C. on Taxonomy of Genome mag of Autographa californica multiple nucleopolyhedrovirus isolate C6 (Figure 2). https://ictv.global/sites/default/files/inline-images/7651.ODD.Baculo.Fig2.v2.png

Glossary

CpGV: Cydia pomonella granulovirus
NCBI: National Center for Biotechnology Information
ORF: open reading frame
SNV: Single nucleotide variant
SRA: Sequence Read Archive

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Jörg Wennmann, Deciphering Virus Populations - Single Nucleotide Variants (SNVs) and Specificities in Baculovirus Isolates (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/variant-analysis/tutorials/baculovirus-isolate-variation/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{variant-analysis-baculovirus-isolate-variation,
author = "Jörg Wennmann",
	title = "Deciphering Virus Populations - Single Nucleotide Variants (SNVs) and Specificities in Baculovirus Isolates (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/variant-analysis/tutorials/baculovirus-isolate-variation/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/variant-analysis/tutorials/baculovirus-isolate-variation/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: text_processing
  owner: bgruening
  revisions: 86755160afbf
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: text_processing
  owner: bgruening
  revisions: 86755160afbf
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: text_processing
  owner: bgruening
  revisions: 86755160afbf
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: trim_galore
  owner: bgruening
  revisions: cd7e644cae1d
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: bwa
  owner: devteam
  revisions: 4a196b9c72c2
  tool_panel_section_label: Mapping
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: vcf2tsv
  owner: devteam
  revisions: e92b3c0f9224
  tool_panel_section_label: Variant Calling
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: bcftools_call
  owner: iuc
  revisions: '07396b423218'
  tool_panel_section_label: Variant Calling
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: bcftools_mpileup
  owner: iuc
  revisions: 0a5dcc9a8ec2
  tool_panel_section_label: Variant Calling
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: ggplot2_point
  owner: iuc
  revisions: 5298e06f14f9
  tool_panel_section_label: Graph/Display Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: ncbi_acc_download
  owner: iuc
  revisions: e063168e0a81
  tool_panel_section_label: Get Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: sra_tools
  owner: iuc
  revisions: 8848455c0270
  tool_panel_section_label: Get Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: collapse_collections
  owner: nml
  revisions: 90981f86000f
  tool_panel_section_label: Collection Operations
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

5 stars 2

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.