Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.
Press P
again to switch presenter notes off
Press C
to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.
Useful when presenting.
Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.
Press P
again to switch presenter notes off
Press C
to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.
Useful when presenting.
Before diving into this slide deck, we recommend you to have a look at:
The likelihood of a model is a quantity that is proportional to the probability that the model gave you the data.
It is easy(ish) to calculate the probability that a given model yields a particular outcome -- the data set.
Maximum Likelihood (ML) seeks to find the model for which this probability is the highest.
Imagine I have a set of very strange 10-sided dice, where the sides of my blue die are labelled (1,1,2,3,4,5,5,6,6,6), and the sides of the red die are labelled (1,1,2,2,3,3,4,4,5,6).
Both dice have the same range of possible outcomes, 1 through 6, but the relative probabilities are different.
score | Pred | Pblue |
---|---|---|
1 | 0.2 | 0.2 |
2 | 0.1 | 0.2 |
3 | 0.1 | 0.2 |
4 | 0.1 | 0.2 |
5 | 0.2 | 0.1 |
6 | 0.3 | 0.1 |
Total | 1.0 | 1.0 |
Suppose I choose a die and roll it three times, getting values 3, 6, and 2.
Which die is most likely to have been used?
If it were the blue die then the probability of this outcome is 0.1×0.3×0.1=0.003.
If it were the red die then this probability is 0.2×0.1×0.2=0.004.
Neither of these probabilities is high! But the larger of the two is for the red die: thus, given the data, we would say the maximum likelihood estimate of the "model" (here, which die was used) is the red die.
We could write this example more formally as
Data D is the outcome (3,6,2);
P(B) and P(R) are the probabilities of choosing the blue and red dice (say 0.5 each);
P(D|B) and P(D|R) are the probabilities of the data given the model "blue die" and "red die";
P(B|D) is the probability of the model given the data, and it can be written
P(B|D)=P(D|B)P(B)P(D)=P(D|B)P(B)P(D|B)P(B)+P(D|R)P(R)
-- this is Bayes' formula.
P(R|D)=P(D|R)P(R)P(D)=P(D|R)P(R)P(D|B)P(B)+P(D|R)P(R)
The dice example was very simple but it showed us important concepts:
Now suppose we have a probability matrix whose entries are the probabilities of nucleotide substitution in some fixed time, say 1 million years:
We could then calculate the probability of going from one nucleotide to another, say A to G, in 1M years by reading it from the graph:
Pr(A→G)=PA,G=0.01.
To calculate the probability of multiple nucleotides changing we multiply the individual probabilities together -- which is making direct use of the assumption that each nucleotide site evolves independently of the others.
We would then calculate the probability of going from sequence AAGT to AAGA as
prob=P(A→A)×P(A→G)×P(G→G)×P(T→A)=0.97×0.01×0.97×0.01≈9.4×10−6
To calculate the probability of sequences changing in our toy example for say two million years, we would square the probability matrix P:
This doesn't generalise to arbitrary time periods, so instead we use a rate matrix, like this:
Converting a rate matrix to a probability requires a matrix exponentiation, giving formulae like this:
P(t)=eQt.
We won't worry here about technical details, but note that some classes of rate matrix are easier to work with than others, the easiest being the JC69 (one parameter) matrix.
Many models of sequence evolution available, including different base (nucleotide) frequencies, and different sites being allowed different rates.
Likelihoods will be expressed as a (negative) log-likelihood, and these are used to compare models.
Typically these will be quite large (e.g., -37000, -80000), but it is the relative likelihoods that matter, so the difference between log likelihoods.
A model is selected to give the best chance of getting the phylogeny right.
More parameter-rich models fit better; we need to avoid over-fitting.
The (log)-likelihood value can be adjusted to penalise the model, based on the number of parameters.
Common penalties are the Akaike Information Criterion (AIC).
There is also a "corrected" AIC (AICc), and a "Bayesian Information Criterion" penalty (BIC).
For a model with k free parameters and a log likelihood L, here are the penalty functions:
AIC is the original information-theoretic correction:
AIC=2k−2ln(L)
AICc is a correction for small sample sizes and is equal to
AICc=AIC+2k2+2kn−k−1
BIC is defined by
BIC=kln(n)−2ln(L)
(where n is a measure of sample size).
All of these are in common use but the commonest is AIC, and that is the method we use in the tutorial.
The likelihood calculations in this tutorial are based on reversible models: that means that there is no difference between the likelihood of going from sequence A to sequence B of from B to A.
This means that the output trees are unrooted.
(There are some asymmetric models included in IQTree but they are well beyond the scope of this tutorial and are not in common use.)
Now we have the huge task:
ML Search: for every tree $T$ in tree space { for each parameter in the substitution rate matrix { for each branch length { Calculate the (log-)likelihood of the alignment } } }
... and find the best combination!
All this means that Maximum Likelihood is slow.
where the asterisk is a short-hand to make the row-sums equal 0.
Some sites are non-coding: they evolve faster than others.
Third codon position sites are largely redundant: for most amino acids, the third nucleotide is free to vary.
Some are under strong selection and evolve slowly.
Some are so important that they must stay in the same state else a protein will not fold properly: these are fixed.
This means that even if the same rate matrix might be applicable at different sites, the overall substitution rate might be hundreds of times faster at one site than another.
Models that accommodate this are called "Rates Across Sites" or RAS models.
By default, IQTree includes RAS models.
IQTree is at the forefront of maximum likelihood phylogenetic tools. It is by far the most advanced such tool and is under active development, adding new models and features.
It is available from IQtree.org.
To cite IQTree2, please use
B.Q. Minh, H.A. Schmidt, O. Chernomor, D. Schrempf, M.D. Woodhams, A. von Haeseler, R. Lanfear (2020) IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol., 37:1530-1534. https://doi.org/10.1093/molbev/msaa015
A network:
Nearest Neighbour Interchange is a way to transform one tree into another -- so we can search for the best tree.
An NNI move consists of effectively collapsing an internal branch of the tree and re-expanding it out in either of two possible ways.
For example, given the original tree
there are two internal edges: x−y and y−z. We can do an NNI move on each of them.
Tree Search algorithms work essentially by
Starting with a reasonably good estimate, e.g., using Neighbour-Joining;
Checking all the neighbouring trees under whichever perturbation is being used for an acceptable new tree;
If an acceptable new tree is available, move to it! If not, then stop.
This figure from Cao et al. (2019) illustrates moving around in tree space.
Source: https://doi.org/10.1101/746362
Anolis is a highly speciose genus of lizards -- more than 425 species (unless as some would have it, many should be moved to another genus) -- native to the Americas.
Graphic: "Picture of a lizard (Anolis carolinensis) I (DanielCD) took in Atascocita, Texas (USA) on 4/22/05. Atascocita is in southeast Texas just north of Houston."
This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
iqtree
outputmac@mac-pc-443 ~/N/p/t/2/AusBioCommons-Phylogenetics> iqtree2 -s anolis-aligned.fastaIQ-TREE multicore version 2.1.3 COVID-edition for Linux 64-bit built Apr 21 2021Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.Host: mac-pc-443 (AVX2, FMA3, 31 GB RAM)Command: iqtree2 -s anolis-trimmed-aligned-clustal.fasta -redoSeed: 225340 (Using SPRNG - Scalable Parallel Random Number Generator)Time: Thu Dec 21 13:26:51 2023Kernel: AVX+FMA - 1 threads (8 CPU cores detected)HINT: Use -nt option to specify number of threads because your CPU has 8 cores!HINT: -nt AUTO will automatically determine the best number of threads to use.Reading alignment file anolis-trimmed-aligned-clustal.fasta ... Fasta format detectedAlignment most likely contains DNA/RNA sequencesAlignment has 55 sequences with 1462 columns, 1138 distinct patterns913 parsimony-informative, 172 singleton sites, 377 constant sites Gap/Ambiguity Composition p-value 1 Anolis.acutus 3.56% passed 50.37% 2 A.aeneus 5.54% failed 2.25% 3 A.agassizi 3.01% failed 1.82% 4 A.ahli 3.15% passed 81.11% 5 A.aliniger 2.74% passed 43.47% 6 A.alutaceous 2.94% passed 9.34% 7 A.angusticeps 2.80% passed 57.88% 8 A.bahorucoensis 2.53% passed 21.76% 9 A.barahonae 3.90% failed 1.72% 10 A.bartschi 3.76% passed 30.22% 11 A.bimaculatus 4.65% passed 32.77%
iqtree
outputCreate initial parsimony tree by phylogenetic likelihood library (PLL)... 0.010 secondsPerform fast likelihood tree search using GTR+I+G model...Estimate model parameters (epsilon = 5.000)Perform nearest neighbor interchange...Estimate model parameters (epsilon = 1.000)1. Initial log-likelihood: -37100.3822. Current log-likelihood: -37098.639Optimal log-likelihood: -37097.515Rate parameters: A-C: 1.41395 A-G: 4.11730 A-T: 1.55773 C-G: 0.69506 C-T: 7.28507 G-T: 1.00000Base frequencies: A: 0.339 C: 0.261 G: 0.118 T: 0.282Proportion of invariable sites: 0.186Gamma shape alpha: 0.832Parameters optimization took 2 rounds (0.144 sec)Time for fast ML tree search: 0.993 seconds
iqtree
outputModelFinder will test up to 286 DNA models (sample size: 1462) ... No. Model -LnL df AIC AICc BIC 1 GTR+F 41407.688 115 83045.375 83065.197 83653.445 2 GTR+F+I 39252.694 116 78737.387 78757.569 79350.744 3 GTR+F+G4 37164.452 116 74560.904 74581.085 75174.261 4 GTR+F+I+G4 37096.959 117 74427.917 74448.462 75046.562 5 GTR+F+R2 37597.901 117 75429.802 75450.347 76048.447 6 GTR+F+R3 37179.249 119 74596.499 74617.780 75225.718 7 GTR+F+R4 37074.898 121 74391.795 74413.828 75031.590 8 GTR+F+R5 37037.210 123 74320.421 74343.219 74970.791 9 GTR+F+R6 37031.233 125 74312.465 74336.043 74973.410 21 SYM+R5 37262.481 120 74764.963 74786.618 75399.470 22 SYM+R6 37261.627 122 74767.254 74789.668 75412.336 34 TVM+F+R5 37097.645 122 74439.290 74461.704 75084.373 35 TVM+F+R6 37095.642 124 74439.285 74462.471 75094.942 47 TVMe+R5 37745.606 119 75729.213 75750.494 76358.432 48 TVMe+R6 37742.081 121 75726.162 75748.195 76365.957 60 TIM3+F+R5 37076.842 121 74395.685 74417.718 75035.480 61 TIM3+F+R6 37070.459 123 74386.918 74409.717 75037.288 73 TIM3e+R5 37747.554 118 75731.107 75752.019 76355.039 74 TIM3e+R6 37740.393 120 75720.786 75742.442 76355.293 86 TIM2+F+R5 37042.461 121 74326.921 74348.954 74966.716 87 TIM2+F+R6 37036.463 123 74318.927 74341.725 74969.297 99 TIM2e+R5 37274.306 118 74784.612 74805.524 75408.544100 TIM2e+R6 37273.302 120 74786.604 74808.259 75421.111112 TIM+F+R5 37081.068 121 74404.136 74426.169 75043.931113 TIM+F+R6 37074.732 123 74395.464 74418.262 75045.834125 TIMe+R5 37757.456 118 75750.913 75771.824 76374.845126 TIMe+R6 37750.820 120 75741.641 75763.296 76376.148138 TPM3u+F+R5 37129.051 120 74498.103 74519.758 75132.610
iqtree
outputEstimate model parameters (epsilon = 0.100)1. Initial log-likelihood: -37042.461Optimal log-likelihood: -37042.422Rate parameters: A-C: 1.77793 A-G: 5.08796 A-T: 1.77793 C-G: 1.00000 C-T: 8.61510 G-T: 1.00000Base frequencies: A: 0.339 C: 0.261 G: 0.118 T: 0.282Site proportion and rates: (0.321,0.030) (0.225,0.316) (0.232,1.114) (0.196,2.482) (0.026,6.784)Parameters optimization took 1 rounds (0.163 sec)Computing ML distances based on estimated model parameters...Computing ML distances took 0.054833 sec (of wall-clock time) 0.054830 sec(of CPU time)Computing RapidNJ tree took 0.000283 sec (of wall-clock time) 0.000282 sec (of CPU time)Log-likelihood of RapidNJ tree: -37150.315
iqtree
output-------------------------------------------------------------------| INITIALIZING CANDIDATE TREE SET |--------------------------------------------------------------------Generating 98 parsimony trees... 0.879 secondComputing log-likelihood of 98 initial trees ... 1.283 secondsCurrent best score: -37042.422Do NNI search on 20 best initial treesEstimate model parameters (epsilon = 0.100)BETTER TREE FOUND at iteration 1: -37042.003Estimate model parameters (epsilon = 0.100)BETTER TREE FOUND at iteration 2: -37036.294Iteration 10 / LogL: -37042.212 / Time: 0h:0m:5sIteration 20 / LogL: -37058.987 / Time: 0h:0m:6sFinish initializing candidate tree set (10)Current best tree score: -37036.294 / CPU time: 6.458Number of iterations: 20
iqtree
output--------------------------------------------------------------------| OPTIMIZING CANDIDATE TREE SET |--------------------------------------------------------------------Estimate model parameters (epsilon = 0.100)BETTER TREE FOUND at iteration 27: -37036.066Iteration 30 / LogL: -37044.404 / Time: 0h:0m:9s (0h:0m:31s left)Iteration 40 / LogL: -37036.365 / Time: 0h:0m:11s (0h:0m:26s left)Iteration 50 / LogL: -37036.142 / Time: 0h:0m:14s (0h:0m:22s left)Iteration 60 / LogL: -37038.691 / Time: 0h:0m:16s (0h:0m:19s left)Iteration 70 / LogL: -37039.589 / Time: 0h:0m:19s (0h:0m:15s left)Iteration 80 / LogL: -37049.077 / Time: 0h:0m:21s (0h:0m:12s left)Iteration 90 / LogL: -37036.399 / Time: 0h:0m:23s (0h:0m:9s left)Iteration 100 / LogL: -37036.108 / Time: 0h:0m:26s (0h:0m:7s left)Iteration 110 / LogL: -37036.508 / Time: 0h:0m:28s (0h:0m:4s left)Iteration 120 / LogL: -37036.242 / Time: 0h:0m:31s (0h:0m:1s left)TREE SEARCH COMPLETED AFTER 128 ITERATIONS / Time: 0h:0m:33s
iqtree
output--------------------------------------------------------------------| FINALIZING TREE SEARCH |--------------------------------------------------------------------Performs final model parameters optimizationEstimate model parameters (epsilon = 0.010)1. Initial log-likelihood: -37036.066Optimal log-likelihood: -37036.065Rate parameters: A-C: 1.78140 A-G: 5.12638 A-T: 1.78140 C-G: 1.00000 C-T: 8.65177 G-T: 1.00000Base frequencies: A: 0.339 C: 0.261 G: 0.118 T: 0.282Site proportion and rates: (0.320,0.029) (0.222,0.311) (0.230,1.085) (0.200,2.454) (0.027,6.674)Parameters optimization took 1 rounds (0.074 sec)BEST SCORE FOUND : -37036.065Total tree length: 10.598Total number of iterations: 128CPU time used for tree search: 32.804 sec (0h:0m:32s)Wall-clock time used for tree search: 33.018 sec (0h:0m:33s)Total CPU time used: 33.400 sec (0h:0m:33s)Total wall-clock time used: 33.500 sec (0h:0m:33s)Analysis results written to: IQ-TREE report: anolis-aligned.fasta.iqtree Maximum-likelihood tree: anolis-aligned.fasta.treefile Likelihood distances: anolis-aligned.fasta.mldist Screen log file: anolis-aligned.fasta.logDate and Time: Thu Dec 21 13:28:52 2023
iqtree
output
Next - phylogenetic networks!
This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors!
Author(s) |
|
Reviewers |
|
Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.
Before diving into this slide deck, we recommend you to have a look at:
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |