Avian influenza viral strain analysis from gene segment sequencing data

Author(s)	Wolfgang Maier
Reviewers

Overview
Questions:

With reassortment of gene segments being a common event in avian influenza virus (AIV) evolution, does it make sense to use a reference-based mapping approach for constructing consensus genome sequences for AIV samples?

Is it possible to reuse existing tools and workflows developed for the analysis of sequencing data from other viruses?

How can we obtain meaningful phylogenetic insight from AIV consensus sequences?

Objectives:

Determine how reassortment impacts reference-based mapping approaches

Use a collection of per-segment reference sequences to construct a hybrid reference genome that is sufficiently close to a sequenced sample to be useful as a reference for mapping

Construct a sample consensus genome from mapped reads

Generate per-segment phylogenetic trees of AIV consensus sequences

Requirements:

Introduction to Galaxy Analyses

slides Slides: Quality Control

tutorial Hands-on: Quality Control

slides Slides: Mapping

tutorial Hands-on: Mapping

Time estimation: 4 hours

Level: Intermediate Intermediate

Supporting Materials:

galaxy-history-input Input Histories

UseGalaxy.eu

UseGalaxy.org

help How to Use This

FAQs

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.org (Main) ✅ ⭐️

Possibly Working

UseGalaxy.cz

Published: Nov 21, 2022

Last modification: Oct 4, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00308

rating Rating: 4.9 (0 recent ratings, 8 all time)

version Revision: 10

Of the four species of influenza viruses (Influenza A-D), Influenza A is the most virulent in human hosts and subtypes of it have been responsible for all historic flu pandemics.

The different influenza species infect distinct animal hosts (though all of them can infect humans and pigs) and the most important natural reservoir for Influenza A are birds (in particular, wild aquatic birds), in which it causes Avian influenza.

Importantly, all flu pandemics of the last century have been triggered by reassortment events between avian and human Influenza A strains (although at least the 2009 swine flu pandemic involved an additional mixing event with an Influenza A strain from pigs).

Reassortment events are a key trait of Influenza A made possible by its wide range of natural hosts combined with two molecular characteristics of the species:

the linear negative-sense single-stranded RNA genomes of influenza viruses (and of all viruses of the Orthomyxoviridae family) are segmented, i.e. consist of several (eight in the case of Influenza A and B) distinct pieces of RNA, which typically encode just one, sometimes two viral proteins.
the mutation rate in the genome is high in influenza viruses, and particularly high in Influenza A, since their RNA polymerase lacks exonuclease activity necessary for proof-reading.

Together these characteristics enable Influenza A to evolve relatively rapidly in non-human hosts, then re-adapt to humans through exchange of segments (reassortment) in a host co-infected with a human and, e.g., an avian strain. If such a reassortment introduces new variants of the two most antigenic proteins of influenza, HA (hemagglutinin) and NA (neuraminidase), this antigenic shift provides the reassorted strain with immune escape potential, which, if it happens in a genetic background compatible with efficient transmission in humans, can trigger an unusually strong wave of influenza or even a new pandemic.

In order to estimate the likelihood of an epidemic or pandemic influenza event in the near future it is, therefor, important to monitor closely the genome composition of circulating Avian Influenza strains in wild and domestic birds and the huge technological advances over the last decade make it possible to use next-generation sequencing for this purpose.

At the same time, the segmented nature of their genomes combined with high genetic variability, especially in the HA and NA genes, requires rather specialized bioinformatics workflows for the analysis of data from influenza viruses compared to other viruses with similar genome size (like, e.g. SARS-CoV-2):

The viral surface proteins HA and (to a lesser extent) NA are the main targets of the host antibody response and are, thus, under constant selection pressure to mutate into forms capable of evading an existing host immune response. As a consequence, these segments have evolved into a much richer panel of sequence variants than the other segments, to the point that sequences of the HA segment can, at present, be classified into 18 distinct subfamilies, H1-H18, while there are 11 recognized subfamilies of NA, and Influenza A strains are subtyped (as, for example, H5N1, H3N2, etc.) according to the combination of HA- and NA-encoding segments they are carrying.

Importantly, the sequence diversity of HA and (again to a lesser extent) of NA segments is big enough to prevent a naive approach of mapping sequenced reads to one specific agreed-upon Influenza A reference sequence. While this would work for the other six segments, mapping software would regularly fail to find enough plausible mappings for sequenced reads of HA and NA origin to continue analysis with. This is why, in this tutorial we are going to explore an alternative approach, which is also mapping-based but chooses a suitable reference for each segment dynamically based on the input sequencing data.

Agenda

In this tutorial, we will cover:

Prepare analysis history and data

Get a history with reference data

Get the sequencing data

Inspect the reference data

Quality control

Per-segment subtyping and hybrid reference construction

Mapping to a hybrid reference

Consensus sequence construction

Placing segments on a phylogenetic tree

Conclusion

Prepare analysis history and data

In this tutorial you are going to work on a single avian influenza sample sequenced in paired-end mode on the Illumina platform, i.e. we are going to download two datasets of sequenced reads for that sample.

In addition, we are going to base the analysis on a small collection of multiple reference sequences for each influenza gene segment. We prepared this collection for you from public INSAFlu data.

If you have your own curated collection of reference sequences, you should be able to use it to follow this tutorial without any problem. Note, however, that the analysis results referred to in many of the questions will be different if you are exchanging the reference collection. If you want to learn how you can create a Galaxy collection from your own references that is structured like the default we are suggesting here, you may want to follow the dedicated tutorial on Using dataset collections.

Get a history with reference data

Any analysis should get its own Galaxy history, but in this tutorial we won’t start with creating one.

Instead, to save some time, you can copy an existing shared history on Galaxy Europe.

Hands On: Prepare the Galaxy history

On Galaxy Europe, open the shared history with pre-prepared INSAFlu reference data.

On the Galaxy US server, use this shared history instead.

Rename the history

Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)

Type the new name: Avian influenza tutorial

Click on Save

To cancel renaming, click the galaxy-undo “Cancel” button

If you do not have the galaxy-pencil (Edit) next to the history name (which can be the case if you are using an older version of Galaxy) do the following:

Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel

Type the new name: Avian influenza tutorial

Press Enter

Get the sequencing data

Hands On: Get the data
Import the forward and reverse reads of the sequenced sample into your history and organize them into a Paired Collection.
https://usegalaxy.eu/api/datasets/4838ba20a6d86765171e4f201205515c/display?to_ext=fastqsanger.gz
https://usegalaxy.eu/api/datasets/4838ba20a6d867657d09280baae4e1ec/display?to_ext=fastqsanger.gz
Copy the links above

Open the Upload Manager

In the top row of tabs select Collection

Configure the drop-down select boxes on that tab like this:

“Collection Type”: Pair

“File Type”: fastqsanger.gz

Click on Paste/Fetch data and paste the links you copied into the empty text box

Press Start

Wait for the Build button to become enabled, click it, and, in the next dialogue, give a suitable Name, like Sequencing data, to the new collection.

Click on Create collection

Inspect the reference data

Hands On: What is in the reference collection?

Step inside the imported reference collection by clicking on it.

Expand individual elements of the collection by clicking on them to reveal more details about them.

View the content of any element by clicking its galaxy-eye.

You can then scroll through the data and use your browser’s search function to look for particular sequences.

Question: Questions

How many elements are in the collection? What do they represent?

How many sequences are stored in each element?

Are all subtypes of HA and NA represented in the sequences?

Are there sequences of non-A Influenza species?

The collection consists of eight elements and each element provides reference sequences for one genome segment.

Each element has data for 56 sequences (Galaxy knows how to interpret the fasta format of the sequencing data and counts sequences for you; you can see the count for any of the elements by expanding it).

The collection has reference sequence data for H1-H18 (with the exception of H11) and for N1-N11. For many of those subtypes, however, there is only one reference, i.e., within-subtype variation is not captured well by the collection.

There are 6 Influenza B references in the collection. Since Influenza B is not normally seen in birds and we are going to analyze an avian influenza sample here, these can be considered controls: if we get an Influenza B assignment for the sample at any step in our analysis, we know something is very suspicious. This also means that we only have 50 informative references in the collection.

galaxy-info There are lots of text processing and filtering tools for Galaxy, of which this tutorial is going to introduce just a small subset. You could use them to extract statistics about which subtypes are found how many times in the collection of references instead of relying on scrolling and searching. If you are interested and have the time, you may want to try the Data Manipulation Olympics to learn more about available such tools and how to combine them.

Quality control

As a very first step, we would like to make sure that we base our analysis only on the high-quality parts of the sequenced reads.

NGS reads often have lower base calling quality near their ends, so here we are going to trim low-quality stretches of bases from both ends. In addition, we are going to discard reads shorter than 30 bases after trimming. This is reasonable since our next step will involve extracting all possible 21-mers from the reads and there are very few possibilities for a read shorter than 30 bases.

The tool fastp lets us perform these tasks and obtain a nice quality report for our reads before and after processing in one go, but many other options exist to perform sequenced reads quality control and trimming/filtering in Galaxy and the dedicated tutorial on quality control introduces more of them.

Hands On: QC and read trimming/filtering with fastp

fastp ( Galaxy version 0.23.4+galaxy1)

“Single-end or paired reads”: Paired Collection

“Select paired collection(s)”: the uploaded paired collection of sequenced reads

In “Filter Options” under “Length filtering options”:

“Length required”: 30

In “Read Modification Options” under “Per read cutting by quality options”:

“Cut by quality in front (5’)”: Yes

“Cut by quality in tail (3’)”: Yes

“Cutting mean quality”: 30

fastp comes with basic default settings for some of its many parameters. In particular, with no explicit parameter specifications, the tool will still remove reads that

have base qualities < 15 for more than 40% of their bases

have more than 5 Ns in their sequence

have a length < 15 bases (after removing poly-G tails from Illumina reads)

Because our input data has far longer MiSeq-generated reads, we can increase this length threshold to 30 bases as done above.

These are very relaxed criteria that will only remove the least usable reads, but still assure minimal quality standards that are good enough for many (including training) purposes.

Running the tool will result in two outputs. One is a new paired collection with the processed reads, the other one is a report of initial quality, the processing actions performed and their effect on key quality metrics.

galaxy-eye Inspect the report and try to answer the following questions:

Question: Questions

Which set of reads (forward or reverse reads) did profit most quality-wise from our low-quality base trimming?

What percentage of reads got discarded completely with our settings, and what percentage of bases?

Why, according to the Filtering result, have some reads been discarded because of too many Ns?

The forward reads were of worse quality in particular near their 3’-ends than the reverse reads and, consequently, were affected more strongly by trimming (see the different “quality” plots in the report).

Less than 2% of reads got discarded completely (the “Filtering result” section of the report says 98.03% of all reads were retained). In contrast, around 10% of all bases got discarded (compare total bases before and after filtering) so most reads got trimmed but not discarded.

As explained in the “fastp default settings” box above, fastp is a tool with many hidden default actions in its various sections. Inside the Filter Options inspect Quality filtering options and the various defaults mentioned in the parameters help therein. One of these defaults is to discard reads with more than 5 Ns. These defaults are often reasonable, but it’s good to be aware of them.

Per-segment subtyping and hybrid reference construction

Our goal at this step is to find best matches between our sequencing data and the reference sequences of each genome segment. This will:

give us preliminary subtyping results with regard to the HA and NA segments for the sequenced sample
suggest the best reference to map the sequencing data to for each segment.

We are then going to combine the best reference of each segment into a hybrid reference genome to use for mapping our sequenced reads against.

To identify the best matching reference segments, we are going to run the tool VAPOR asking it to report the stats for any hits it identifies.

Hands On: Exploring best-matching reference segment scores

VAPOR ( Galaxy version 1.0.2+galaxy3)

param-collection “Reference sequences”: References per segment (INSAFlu)

“Type of sequencing data”: Paired-end as collection

“Paired collection of sequenced reads”: quality-trimmed reads; output of fastp

“Desired output”: Return scores of best matches

“Limit number of reported matches to”: 0

In “Optional arguments”:

“Read kmer filtering threshold”: 0.1

“Minimum k-mer proportion”: 0.0

galaxy-eye Explore the output collection produced by the tool

Question: Questions

Is there a difference between the results for segment 4 (encoding HA), segment 6 (encoding NA) and those for the remaining segments?

According to the tool, what is the likely subtype with regard to HA and NA of the sample?

The values for % of query bases in reads and for Total score are drastically lower for the HA and somewhat lower for the NA segment than for any of the other segments. This could be due to the structure of the collection of references we have used, but also fits very well with the higher selection pressure on NA and, in particular on HA, as the main antigenic proteins of the virus, which leads to higher variability in these segments than in the others.

The best match (assigned an almost 2x higher score than the second-best match) found for the HA segment is from an H4N6 strain. The top two matches with regard to the NA segment (again assigned ~ 2x higher scores than the next best matches) are from strains with the N6 subtype of NA.

The most likely subtype of the sample, thus, appears to be H4N6.

Now that we have established that things may make sense, we can use the output of VAPOR to extract the actual sequence of the top hit for each reference segment. We then concatenate these best matches into a hybrid reference genome for mapping.

Hands On: Obtaining sequences of top hits identified by VAPOR

Replace parts of text ( Galaxy version 9.3+galaxy1) to extract just the name of the sequence from each line of VAPOR’s output

param-collection “File to process”: collection of score outputs of VAPOR

In param-repeat “1. Find and Replace”:

“Find pattern”: ^.+\t>(.+)$

“Replace with”: $1

“Find-Pattern is a regular expression”: Yes

Select first lines from a dataset to get the first line, i.e. the best match from VAPOR’s output

“Select first”: 1 lines

param-collection “from”: output of Replace

seqtk_subseq ( Galaxy version 1.4+galaxy0) to extract the reference sequences based on their names reported by VAPOR

param-collection “Input FASTA/Q file”: References per segment (INSAFlu)

“Select source of sequence choices”: FASTA/Q ID list

param-collection “Input ID list”: the collection output of Select first

Collapse Collection ( Galaxy version 5.1.0) to combine the best-matching sequence of each segment into one dataset

param-collection “Collection of files to collapse into single dataset”: the selected sequences collection produced by seqtk_subseq

At this point, you may want to galaxy-eye inspect the output of the last step to see if it is an eight-segments reference genome as expected, and if the segments correspond to the top hits found by VAPOR.

Mapping to a hybrid reference

If things went well, the hybrid reference we just obtained should be close enough across all segments to our sample to allow successful mapping of reads. Before we start the mapping we may want to truncate the segment names in our hybrid reference genome though because currently these names still reflect the full origin of the segment sequences, but from now on we are fine with just the segment ID. We can use the Replace tool from before again to truncate the names.

Hands On: Shortening sequence titles

Replace parts of text ( Galaxy version 9.3+galaxy1)

param-file “File to process”: the hybrid reference genome; output of Collapse Collection

In param-repeat “1. Find and Replace”:

“Find pattern”: ^>([^|]+).+$

“Replace with”: >$1

“Find-Pattern is a regular expression”: Yes

Add a tag to the output dataset

This is the hybrid reference genome that we are going to use for mapping now. It is an important intermediate result, to which we will need to return later, and to make that easier we are going to add a tag to it now for extra visibility in the history.

Click on the dataset to expand it

Click on galaxy-tags Add Tags

Enter a tag like HybridRef (tags have to consist of a single word)

Confirm your choice by pressing Enter on the keyboard or clicking on the new name.

Having polished the titles of the segments in our hybrid reference genome we are finally ready for mapping, which we will carry out with BWA-MEM, clean up a bit with Samtools view and produce a quality report for with QualiMap BamQC.

Hands On: Read mapping and quality control

Map with BWA-MEM ( Galaxy version 0.7.18)

“Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index

param-file “Use the following dataset as the reference sequence”: hybrid reference genome with shortened names; output of Replace

“Single or Paired-end reads”: Paired Collection

“Select a paired collection”: quality-trimmed reads; output of fastp in QC and Trimming

Samtools view ( Galaxy version 1.20+galaxy3)

param-file “SAM/BAM/CRAM data set”: mapped reads BAM dataset; output of BWA-MEM

“What would you like to look at?”: A filtered/subsampled selection of reads

In “Configure filters”:

“Filter by quality”: 20

“Require that these flags are set”: Read is paired and Read is mapped in a proper pair

QualiMap BamQC ( Galaxy version 2.2.2d+galaxy3)

param-file “Mapped reads input dataset”: filtered mapped reads BAM dataset; output of Samtools view

“Skip duplicate reads”: Unselect all

In “Settings affecting specific plots”:

“Number of bins to use in across-reference plots”: 40

galaxy-eye Study the report generated with QualiMap

Question: Questions

What is the coverage of each segment by the sequenced reads, and is it uniform?

Look for a plot showing read mapping quality across the reference. What can you conclude?

Coverage is rather different for the different segments. Looking at “Mean coverage” and its “Standard deviation” in the “Chromosome stats” table and at the “Coverage across reference” plot, coverage of the HA segment seems to be most critical since it approaches zero in some regions.

Mapping quality is almost constant at or very near 60 (which happens to be the maximum mapping quality value emitted by BWA-MEM) across all segments with the exception of HA.

These two observations combined with the VAPOR statistics for HA show again that even the best matching reference for this segment is not exactly close to the sequence of our sample. The read mapper had difficulties placing the sequenced reads on that sub-optimal reference and it looks as if we might have very little sequence information for some HA regions.

Consensus sequence construction

From the polished mapping of reads to our custom reference we can now construct the consensus sequence of our sample.

Unfortunately, the tool we are going to use for this, ivar consensus, is not capable of working with more than one reference name at a time, but because this is influenza data we have mappings to 8 different segments described in our data. So we need to take a little detour and split the mapped reads data into a collection of datasets each containing the mappings for just one segment first again, then perform the consensus construction for all of them in parallel.

Hands On: Splitting mapped reads by genome segment

Split BAM by Reference ( Galaxy version 2.5.2+galaxy2)

param-file “BAM dataset to split by reference”: filtered mapped reads; output of Samtools view

The output from this step has the desired collection structure, but the names of the collection elements are not the nicest. Ideally, we would just reuse the segment names, which are already provided in our mapping reference genome. So lets extract these names again and use them as new element labels

Hands On: Relabeling collection elements

Select lines that match an expression

param-file “Select lines from”: the mapping reference; output of Replace in Mapping to a hybrid reference

tip This is the dataset you should have galaxy-tags tagged before. Lets hope you did!

“that”: Matching

“the pattern”: ^>.+

Replace parts of text ( Galaxy version 9.3+galaxy1)

param-file “File to process”: output of Select

In param-repeat “1. Find and Replace”:

“Find pattern”: ^>(.+)$

“Replace with”: $1

“Find-Pattern is a regular expression”: Yes

Relabel identifiers

param-file “Input collection”: the split mappings; output of Split BAM by Reference

“How should the new labels be specified?”: Using lines in a simple text file.

param-file “New Identifiers”: output of Replace

“Ensure strict mapping”: Yes

And with that we are ready for consensus sequence generation!

To accept any base suggested by the mapped sequenced reads as the consensus base for the corresponding genome position, we ask for the following requirements to be fulfilled:

at least ten sequenced reads have to provide information about the base in question
at a minimum, 70% of these reads have to agree on the base at this position.

To avoid getting misled too much by sequencing errors, we are also going to ignore bases with a base calling quality less than 20 in the above counts (i.e., we are going to base our decisions only on bases in sequenced reads that the basecaller of the sequencer was reasonably sure about.

Now what if we cannot obtain a consensus base for a position with the above criteria? In such cases of uncertainty we want to insert an N (i.e. an unknown base) to express that we either did not have enough information about the position or that this information was ambiguous.

galaxy-info All of the above limits for consensus base calling are arbitrary to some degree, and depend somewhat on the quality of the sequencing data. With very high overall coverage, for example, it is possible to increase the coverage threshold, but if you increase that threshold too much, you may end up with a consensus sequence consisting mostly of Ns.

Hands On: Per-segment consensus construction

ivar consensus ( Galaxy version 1.4.3+galaxy0)

param-collection “Bam file”: the relabeled collection of mapped reads; output of Relabel identifiers

“Minimum quality score threshold to count base”: 20

“Minimum frequency threshold”: 0.7

“Minimum indel frequency threshold”: 0.8

“Minimum depth to call consensus”: 10

“How to represent positions with coverage less than the minimum depth threshold”: Represent as N (-n N)

The output is a consensus sequence in FASTA format, one per segment, with the names just providing a bit too much detail for our purpose.

Replace parts of text ( Galaxy version 9.3+galaxy1)

param-collection “File to process”: consensus sequences; output of ivar consensus

In param-repeat “1: Find and Replace”:

“Find pattern”: ^>Consensus_(.*)_threshold_.*

“Replace with”: >$1

“Find-Pattern is a regular expression”: Yes

galaxy-eye Inspect each consensus sequence generated for the different segments

Question

Does everything look ok?

As expected from the findings so far, the consensus sequence for the HA segment has stretches of Ns in it, which likely reflect the mapping issues and associated loss of coverage caused by our insufficiently sized collection of references.

Placing segments on a phylogenetic tree

The next logical step after obtaining the consensus sequences of segments of our sample is to explore how those sequences are related to the sequences in our reference collection. To do so, we are going to combine the reference sequences of all segments with their corresponding consensus sequence into one multiple sequence alignment (MSA) per segment, and use these to generate phylogenetic trees, again one per segment. We are going to use two rather standard tools, MAFFT and IQTree, for generating MSAs and trees, respectively.

Hands On: Exploring phylogeny

MAFFT ( Galaxy version 7.526+galaxy0)

“For multiple inputs generate”: one or several MSAs depending on input structure

In “Input batch”:

param-repeat “1: Input batch”

param-collection “Sequences to align”: collection of References per segment (INSAFlu)

param-repeat “2: Input batch”

param-collection “Sequences to align”: collection of renamed consensus sequences; output of Replace on consensus sequences

“Type of sequences”: Nucleic acids

Because both input batches are collections of eight elements each, the result is also a collection of eight MSAs, each aligning all reference sequences of one genome segment plus the consensus sequence we have obtained for that segment against each other.

IQ-TREE ( Galaxy version 2.3.6+galaxy0)

param-collection “Specify input alignment file in PHYLIP, FASTA, NEXUS, CLUSTAL or MSF format.”: output of MAFFT

“Specify sequence type …“: DNA

galaxy-eye Explore each of the final trees produced by IQTree for the different segments

There are two ways to explore the trees in Galaxy:

One option is to open the Report and Final Tree datasets produced by IQ-Tree. When you scroll down through their content, you will find that these feature simple text representations of the trees.

Another option is to expand the MaxLikelihood Tree datasets generated by IQ-Tree, then click on galaxy-barchart Visualize. This will let you launch the Phylogenetic Tree Visualization in the middle panel, which lets you explore the tree interactively and in different representations.

Question

What are your conclusions about the sample in general and its HA and NA segments in particular?

For most of its segments the sample resembles relatively recent (from the last decade) Eurasian reference sequences.

For HA and NA the sample clusters with the few available samples of the corresponding subtype.

None of the references closest to the sample with respect to HA and NA are close to the recent Eurasian reference cluster for their remaining segments.

A plausible explanation is that the H4 and N6 segments of the sample have been brought into the recent Eurasian background through a reassortment event. Caveat: interpretations like this can be heavily influenced by the size of the reference collection!

Conclusion

Analysis workflows for influenza whole-genome sequencing data need to take into account the specific characteristics of the viral genome. Due to their higher natural variability this is especially true for avian influenza samples and for the HA- and NA-encoding segments of the genome.

Nevertheless, it looks possible, with carefully chosen reference segment sequences and bioinformatic tools, to avoid a computationally expensive de-novo assembly approach and to use mapping against a dynamically compiled reference genome instead.

The rather small reference segment collection suggested for this tutorial consists of 56 different samples, of which only a single one has the H4N6 subtype of the sample analyzed here. Still it allowed us to perform subtyping of the sample, to construct complete consensus sequences for 7 segments including NA, and to draw valuable conclusions about the origin of the sample. It is conceivable that a larger collection of references chosen to capture several strains from each HA subtype could solve the remaining issue of the incomplete HA consensus sequence.

You've Finished the Tutorial

Key points

Reassortment of gene segments makes reference-based mapping of influenza sequencing data challenging

An alternative to de-novo assembly can be mapping to a dynamically chosen reference genome

Variant calling and consensus genome construction can follow workflows used also for other viral sequence data

Standard phylogenetic tools can be used to find relationships between influenza samples but should be used on a per-segment basis

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Wolfgang Maier, Avian influenza viral strain analysis from gene segment sequencing data (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/variant-analysis/tutorials/aiv-analysis/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{variant-analysis-aiv-analysis,
author = "Wolfgang Maier",
	title = "Avian influenza viral strain analysis from gene segment sequencing data (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/variant-analysis/tutorials/aiv-analysis/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/variant-analysis/tutorials/aiv-analysis/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: text_processing
  owner: bgruening
  revisions: 86755160afbf
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: bwa
  owner: devteam
  revisions: 4a196b9c72c2
  tool_panel_section_label: Mapping
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: bamtools_split_ref
  owner: iuc
  revisions: 9b8feb118f9e
  tool_panel_section_label: SAM/BAM
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: fastp
  owner: iuc
  revisions: d60c3f704da0
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: iqtree
  owner: iuc
  revisions: f87ac61981f0
  tool_panel_section_label: Evolution
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: ivar_consensus
  owner: iuc
  revisions: 196d126cd016
  tool_panel_section_label: Virology
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: qualimap_bamqc
  owner: iuc
  revisions: 19ece8afbaab
  tool_panel_section_label: SAM/BAM
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: samtools_view
  owner: iuc
  revisions: 32dc5f781059
  tool_panel_section_label: SAM/BAM
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: seqtk
  owner: iuc
  revisions: a019807f4e67
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: vapor
  owner: iuc
  revisions: 244812f5bd1f
  tool_panel_section_label: Metagenomic Analysis
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: collapse_collections
  owner: nml
  revisions: 90981f86000f
  tool_panel_section_label: Collection Operations
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mafft
  owner: rnateam
  revisions: 8e649f27aa0d
  tool_panel_section_label: Multiple Alignments
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

t{ hist[0] | to_stars }} 7

t{ hist[0] | to_stars }} 1

August 2023

5 stars: Liked: Such an elegant way for quick sub-typing. Thank you very much!

June 2023

5 stars: Liked: it is very exact (therefore it requires an exact working style)

May 2023

5 stars: Liked: it was easy to follow and understand each step

5 stars: Liked: Yes, is incredible. Disliked: Add the expected output graphs to compare the results