View markdown source on GitHub

# High Performance Computing for Pairwise Genome Comparison

## Questions

• What is sequence comparison and why does it matter?

• What are the challenges and how can we use High Performance Computing to overcome them?

• How can we use Galaxy for large scale sequence comparison?

## Objectives

• Learn the basics of genome comparison

• Run sequence comparisons in Galaxy

• Using GECKO for sequence comparison

• Using CHROMEISTER for comparing large sequences such as plants and chromosomes

last_modification Last modification: Jan 28, 2023

### What is sequence comparison?

• Finding string matches between text paragraphs
• Comparing biological sequences can be seen as an instance of common string matching problem but with some particularities

• Restricted alphabets: e.g. DNA = {A, C, G, T}
• instead of looking for only one pattern, we have to look for all possible patterns!

### Why is it importantg?

• Sequence comparison is a core problem of bioinformatics that allows us to study:
• Evolution (Where do we come from? And other species?)
• Structure (What are we made of? How does our body work?)
• Functionality (How do changes in our DNA affect us?)

• Widely used in several fields!
• Whole genome alignment, gene prediction, phylogenomics, pharmacogenetics, metagenomics, etc.

### What are the challenges?

• A lot of progress has taken place since the beginning (around 1960s)
• So what is the problem?

• New technology produces huge amounts of data
• Optimal sequence comparison has complexity O(n2), which is a lot when sequences are long
• Several ways to compare sequences, each having its advantages: DP-based, gapped and ungapped, alignment-free
• DNA contains lots of repeats and low complexity regions, which make it difficult to find the conserved signal
• And also makes it very hard for computational algorithms

### Why is sequence comparison hard?

• The complexity of the problem grows as the length of the sequences becomes larger
• That means that for small sequences it is OK; but for large sequences the problem can take a lot of time
• Computing power doubles every two years according to Moore’s law but sequence databases grow purely exponential every year
Sequence type Approx. length
Gene ~1,000 to 10,000 bp
Bacteria genome ~106 bp
Mammalian chr. ~108 bp
Mammalian genome ~2 * 109 bp
Plant genome ~3 * 109 bp
Large plant genomes ~2 * 1010 bp
Amoeba genome ~6 * 1011 bp

### Exponential growth of the Sequence Read Archive

• Other databases have experienced similar growth too

### Methods for sequence comparison

• Dot-matrix comparison (no alignments)
• Comparing all letters with each other (extremely slow/noisy)

• Alignment-free methods
• Similar to the previous but typically based on k-mers (words)

• Aligning methods
• Exhaustive dynamic programming algorithms (Slow but optimal)
• Local alignment
• Global alignment
• Seed-based heuristics
• Words of fixed size are used to start alignments (Fast but less sensitivity). Can also include gaps.

### Dot-matrix methods

• Takes each letter in both sequences and compare it to all the rest
• Takes quadratic time and space!
• Does not produce alignments, but rather just represents a comparison
• The x and y axis represent the sequences being compared

### How useful are raw dotplots?

.pull-left[

• Dot matrix methods do not take into account the global information or context

• Noise

• Where is the signal? ]

.pull-right[.image-70[]]

### How do we include alignments?

.pull-left[

• Alignments have to take into account their surroundings!

• Use dynamic programming to compute an accumulated score

• Match = 4
• Miss = -4

]

.pull-right[ .image-70[ ] ]

### How do we include alignments? Cont. I

.pull-left[

• Alignments have to take into account their surroundings!

• Use dynamic programming to compute an accumulated score

• The score of each cell comes from either the diagonal (no penalty but match/mismatch score) or from a row or column, which is a gap in either the query or reference sequence (“jumping” gap penalty is included then)

• Match = 4
• Miss = -4
• Gap = -iG - eG*L

]

.pull-right[ ]

### How do we include alignments? Cont. III

• But what if the two sequences are not toy examples?

• Typically, a mammalian chromosome is around the 100 Mbp mark

• Worst case scenario this algorithm takes quadratic time and space

• Imagine comparing two amoeba genomes (Not your usual case, though)
• Total number of cells: 1022 ~the number of sand grains in the earth

### How to speed up finding alignments?

• DP algorithms are not feasible for long sequences

• Use seeds of fixed size (k-mers) to find anchor points and then perform the alignment

• Can be done in linear time

• But careful, the number of common seeds can be huge and optimization is still needed

### How to speed up finding alignments? Cont. II

• Extend the common seeds identified previously to form larger fragments

• These are called High-scoring Segment Pairs and do not typically include gaps

• Some programs (such as BLAST) include small gaps

• This approach also enables to compare distant sequences with many evolutionary rearrangments since fragments do not require to be close to each other
• Similar result to the local dynamic programming algorithm (as opposed to the global approach where everything is connected)

### Recap on Sequence comparison

• Sequence comparison is inherently hard because it requires quadratic time and space to produce optimal alignments

• Especially when there are more (and longer) genome sequences available everyday (Ensembl blog)

• Heuristic approaches based on seeds are required to cope with the complexity of dynamic programming methods

• Tools are needed to quickly compare sequences while providing insightful information

### Methods

• Now that we have a general background on sequence alignment, lets jump to the software that we will use in this tutorial

• GECKO: fine-grained sequence comparison that uses secondary memory to enable virtually comparisons of any size
• GECKO Multi-Genome-Viewer: A tool for the interactive inspection of sequence comparisons
• CHROMEISTER: Ultra-fast approach for the previsualization of sequence comparisons while dealing with noise and repeats, including full mammalian genomes and plants

### GECKO

• Pairwise genome comparison software which reduces execution time in pairwise and multiple genome comparison studies

• It uses an ‘out of core’ strategy, i.e. it runs on secondary memory (disk) as opposed to RAM
• Can run anywhere, about ~4 GB of RAM and ~1 TB of disk required for the longest sequences

• Faster than state-of-the-art software especially for longer sequences such as chromosomes

### The GECKO algorithm (Cont. I)

• A dictionary is computed for each sequence containing the positional information of each word (possible seed)

• Once each dictionary is sorted, perfect matches between words produce a set of seeds (alignment candidates)

• Seeds are then sorted (by their diagonal position xstart - ystart) and filtered

• Finally, the seeds are extended (up and downstream) to generate a set of High-scoring Segment Pairs (HSPs)

### Parallelising GECKO

.pull-left[

• In a static distribution cores can become idle if they finish before other cores → resources are wasted

• GECKO uses MPI to distribute the workload dynamically based on the length

• When a core (worker) has finished executing a task (comparison) the master node broadcasts more work

• This reduces overall makespan time because of workload balance

]

.pull-right[ ] —

### Parallelising GECKO (Cont. I)

.pull-left[

• GECKO also employs a second level of parallelism besides using MPI to distribute data dynamically

• It consists on sorting the initial seeds found (which is the most time-consuming step) using shared memory threads (pthreads and openMP)

• Each core uses up to t threads to sort the dictionaries containing the seeds ]

.pull-right[ ] —

### GECKO-MGV: interactive sequence comparison

• GeckoMGV is a Web-based application aimed to visualize results from multiple genome comparison allowing deep data analysis. It features:

• User friendly interface

• Multilayer for displaying and overlaying comparisons

• Interactive zooming and filtering

• External and proprietary post-processing services

• Adaptable from equivalent software (e.g. MAUVE)

• Dendrograms and Multiple Sequence Alignment visualization

### GECKO-MGV: interactive sequence comparison (Cont. II)

• A typical exercise would include running a sequence comparison with GECKO

• Visualizing the alignments with GECKO-MGV

• Interactively selecting regions of interest and performing post-processing

• Such as selecting repeats, extracting them and performing multiple sequence alignment to find out single point mutations

### CHROMEISTER: Motivation

.pull-left[

• GECKO is suitable for chromosome comparisons

• But all vs all chromosomes between any two species generate usually over 400 comparisons

• This results in extremely large computation times
• Do all of these even have any similarity signals?
• Instead of runnning everything from scratch with GECKO, first use CHROMEISTER to detect significant comparisons ]

.pull-right[ ]

### CHROMEISTER: Motivation (Cont. I)

.pull-left[

• Comparisons tend to be full of repetitions in large mammalians, let alone plants!

• State-of-the-art methods get stuck handling large number of repetitions or resulting dotplots are too noisy

• Can we extract any information from raw seeds (hits) without having to generate fragments (slower)? ]

.pull-right[ ]

### CHROMEISTER: Methods

• CHROMEISTER (CHROmosome MEISTER) is an ultra fast heuristic approach to computing extremely large pairwise genome comparison in desktop PCs

• Hybrid indexing approach

• Heuristically select long, unique and inexact hits!

• Probabilistic filtering

• Higher concentration of unique hits within HSPs

• Apply a kernel to improve visualization

### CHROMEISTER: Methods (Cont. I)

• Heuristically select unique and inexact seeds:

• Hits with frequency above one are repetitions by definition (in a computational sense) so we remove them
• Short k-mers produce repetitions and noise, so we have to use long k-mers
• But long k-mers produce less seeds due to biological evolution, therefore a certain inexactitude is allowed

• Are these two words equal? Under what considerations?

### CHROMEISTER: Methods (Cont. II)

• Probabilistic filtering
• The probability of a sequence region containing more hits will be higher if there was an existing HSP!
• The discrete dotplot is downsampled to a smaller representation to group hits into blocks (similar to HSPs)

• Kernel and score calculation
• A kernel is convoluted with the dotplot to improve visualization and remove single repetitions
• The scoring distance is calculated between the peak hit points using taxicab distance:
$d_{raw} = \sum_{i}^{l-1}taxicab(max(H_m(i), H_m(i+1)))$ $d = \frac{d_{raw}}{l^2}$

### CHROMEISTER: Conclusions

.pull-left[

• Produces a dotplot visualization in minutes using 1 core and 1 GB of RAM
• It is used in the plants community
• This is due mostly to its ability to filter repeats and to handle large sizes
• Also used for analysis of contig/scaffold reordering

• (Right) Pairwise genome comparison between Aegilops tauschii (3.2 Gbps) and Triticum aestivum (4.2 Gbps) in under 16 minutes using 1 core and 1 GB of RAM ]

.pull-right[ ]

## Key Points

• Sequence comparison is difficult due to its computational complexity and the growth of databases
• There are several approaches to sequencen comparison, and we need to know which one to use (DP-based, alignment-free, seeds, etc.)
• We have learnt the internals regarding GECKO, GECKO-MGV and CHROMEISTER to work with single chromosome comparisons and/or exhaustive searchs, the post-processing and for larger/noisy experiments, respectively