Dr Laura Emery LauraEmeryebiacuk wwwebiacuk Objectives After this tutorial you should be able to Discuss a range of methods for phylogenetic inference their advantages ID: 830212
Download The PPT/PDF document "Part two An Introduction to Phylogenetic..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Part two
An Introduction to Phylogenetic Methods
Dr Laura Emery
Laura.Emery@ebi.ac.uk
www.ebi.ac.uk
Slide2Objectives
After this tutorial you should be able to…
Discuss
a range of
methods for
phylogenetic
inference, their advantages
, assumptions
and limitations
Implement some phylogenetic methods using publicly available
software
Appreciate some approaches for assessing branch support and selecting an appropriate substitution
model
Know where to look for further information
Slide3Outline
Alignment for
phylogenetics
Phylogenetics
: The general approach
Phylogenetic Methods (1 – simple methods)
Assessing Branch Support
BREAK
Substitution Models
Phylogenetic Methods
(2 - statistical inference)
Deciding which model to use (hypothesis testing)
Software
Slide4The problem of multiple substitutions
More likely to have occurred between distantly related species
>
We need an explicit model of evolution to account for
these
A
A
A
T
G
*
*
*
*
hidden
mutations
Slide5Methodological approaches
Distance
matrix methods (pre-computed distances)
UPGMA assumes perfect molecular
clock
Sokal
& Michener (1958)
Minimum evolution (e.g.
Neighbor
-joining, NJ
)
Saitou & Nei (1987)Maximum parsimony
Fitch (1971)
Minimises number of mutational steps Maximum likelihood, ML
Evaluates statistical likelihood of alternative trees, based on an explicit model of substitutionBayesian methodsLike ML but can
incorporate prior knowledge
What is a substitution model?
Slide6Statistical phylogenetic
inference
Figure Brian Moore
Slide7Models of sequence evolution
We use models of substitution to ‘roughly’ describe the way that we believe the sequences have evolvedThey are necessarily highly simplified descriptions of more complex biological processes
Parameters can be added to build more sophisticated models if we believe this is relevant for our data
Slide8Substitution Models
Common nested modelsJukes and Cantor (JC) 1969
Kimura 2 Parameter (K2P) 1980
Felsenstein
1981(F81)
Hasegawa,
Kishino
and Yano 1985 (HKY85)
Generalised time-reversible (GTR or REV)
Tavaré
1986Accounting for rate heterogeneityOther substitution models
Slide9The Jukes and Cantor (JC) 1969 model
1 parameter
μ
= mutation rate
Assumptions
Equal
base frequencies
All mutations equally
likely
All sites evolve at the same rate All sites evolve independently
Time reversibilityA
CG T
d
=
estimated nucleotide
distance
p
= observed distance in sequence
data
μ
μ
μ
μ
μ
μ
μ
μ
μ
μ
μ
μ
Slide10But not all substitutions are equally
likely…Transitions are more likely to occur than transversions
Figures Andrew
Rambaut
Slide11Kimura
2 Parameter (K2P) 19802 parameters
μ
=
mutation rate
κ
=
transition/
transversion
ratioAssumptions Equal base frequencies All mutations equally likely All sites evolve at the same rate All sites evolve independently
Time reversibility
A C
G T
κ
κ
μ
μ
κ
κ
μ
μ
μ
μ
μ
μ
d
=
estimated nucleotide
distance
p
= observed distance in sequence data
q
= proportion
of sites
with
transversional
differences
Slide12But base frequencies are often not equal...
Base frequencies vary among and within genomes
A
C
T
G
Slide13Felsenstein
1981 (F81) 4 symbols (3 parameters)
π
A
,
π
C
,
π
G,πT = base frequenciesπ
A + π
C + π
G + πT
= 1(so 3 parameters)
Assumptions Equal base frequencies
All mutations equally likely All sites evolve at the same rate All sites evolve
independently Time reversibility
A
C
G T
π
C
π
T
π
A
π
G
Slide14Hasegawa,
Kishino and Yano 1985 (HKY85)6 symbols (5 parameters)
μ
=
mutation rate
κ
=
transition/
transversion
ratioπA ,πC
,πG,
πT = base frequencies
Assumptions Equal base frequencies All mutations equally likely
All sites evolve at the same rate All sites evolve independently
Time reversibility
A C
G T
π
C
π
T
π
A
π
G
κ
κ
μ
μ
κ
κ
μ
μ
μ
μ
μ
μ
Slide15But there are also differences among the other nucleotide transition rates...
A C
A
G
A
T
C
G
C
T
G T
Figures Andrew
Rambaut
Slide16Generalised time-reversible (
GTR) Tavaré 1986
10
symbols (9
parameters)
r
AC
,
r
AG
,
rAT,
rCG,
rCT
, rGT
= mutation rates
πA
,πC
,πG,
πT =
base frequenciesAssumptions Equal base frequencies
All mutations equally likely All sites evolve at the same rate
All sites evolve independently Time reversibility
A
C
G T
π
C
π
T
π
A
π
G
r
AG
r
CT
r
GT
r
AC
r
CT
r
AG
r
AC
r
GT
r
CG
r
CG
r
AT
r
AT
Widely used
Slide17But some sites overall faster than others...
973
mtDNA
CR; parsimony analysis (with known pedigree)
Heyer
et al
. (2001)
Slide18Gamma distributed rates
Rate variation among sites is often shown to
be well-approximated
by
a gamma distribution
To use: add
alpha (
α) parameter to existing model e.g.
GTR+G
Assumptions Equal base frequencies All mutations equally
likely All sites evolve at the same rate All sites evolve independently Time reversibility
Substitution
rate
Frequency
α =
200α = 50
α = 2
α = 0.5
0
1
2
0.02
0.04
0.06
Slide19Other substitution
modelsAmino acid substitution models
Dayhoff
1972
Whelan and Goldman
2001 (WAG
)
Lee &
Gascuel
2008 (LG)
Codon models e.g. Yang 2000Relaxed molecular clock e.g. Drummond et al. 2006Mixture modelsAnd many more!
Slide20Methodological approaches
Distance
matrix methods (pre-computed distances)
UPGMA assumes perfect molecular
clock
Sokal
& Michener (1958)
Minimum evolution (e.g.
Neighbor
-joining, NJ
)
Saitou & Nei (1987)Maximum parsimony
Fitch (1971)
Minimises number of mutational steps Maximum likelihood, ML
Evaluates statistical likelihood of alternative trees, based on an explicit model of substitutionBayesian methodsLike ML but can
incorporate prior knowledge
What is a substitution model?
Slide21Statistical phylogenetic
inference
recommended
methods
Figure Brian Moore
Slide22Maximum Likelihood
Calculate the probability of the observed
sequence
data
under
a given
model
(including
tree structure
, branch lengths, and transition parameters). [The likelihood is proportional to this probability.]Search for the tree(s) which maximize(s) the likelihood.
Likelihood
topology
branch lengths
model parameters
constant
probability
data
(alignment)
Slide23Maximum Likelihood
Advantages:statistically consistentrequires the use of an explicit model of evolution
Disadvantages:
slow
(especially if all possible trees are evaluated)
produces a single ML tree
Usage:
Widely-used
and
recommended method
recommended
Slide24Bayesian Inference
Calculate the probability of the
model
specified given
the
sequence data
observed
(using equation derived
from Bayes
Theorem)
Search the tree-space using MCMC (or equivalent) to approximate the joint-posterior probability density
posterior probability
likelihood
function
prior probability
marginal likelihood
Slide25Bayesian Inference
Advantages:the option to incorporate prior knowledgeproduces probability distribution of possible
trees
unlike ML,
treats model parameters as random variables
Disadvantages:
v
ery slow
heuristic methods of tree searching do not guarantee you find the best
tree
Usage: Widely-used and recommended method
recommended
Slide26Heuristic searches do not guarantee you find the best tree
Figure Andrew
Rambaut
Slide27Methodological approaches
Distance
matrix methods (pre-computed distances)
UPGMA assumes perfect molecular
clock
Sokal
& Michener (1958)
Minimum evolution (e.g.
Neighbor
-joining, NJ
)
Saitou & Nei (1987)Maximum parsimony
Fitch (1971)
Minimises number of mutational steps
Maximum likelihood, MLEvaluates statistical likelihood of alternative trees, based on an explicit model of substitution
Bayesian methodsLike ML but can
incorporate prior knowledge
What is a substitution model?
How do I choose a substitution model?
Slide28How do I choose a substitution model?
biological
intuition
develop hypothesis
test
hypothesis
Identify most appropriate
assumptions
and thus model for your data
Will a
complex
model with fewer assumptions better explain your data than a simple model?
Likelihood ratio test
Bayes factor test
Not sure where to start
? Empirical
data shows GTR+G (nucleotide) or LG (protein) to be a good
bet for
standard datasets large in size
Slide29Choosing a more complex model with more parameters will always fit the data better
> We want to know if the fit is significantly better
R
2
= 0.86
R
2
=
0.78
R
2
= 1
Slide30Likelihood ratio
test Requires models to be nested
Uses likelihood
ratio
to evaluate if our
hypothesis (
H
1
) is significantly better than our null hypothesis (
H
0):Likelihood ratio = L(H1
)/L(H0)Twice the logarithm of this ratio (2Δ) approximates a chi-squared distribution
under the null hypothesis H0:2
Δ = 2[ln(L(H1)) – ln(L(H
0))]with d degrees of freedom corresponding to the difference in the number of
free parameters between models
Likelihood of hypothesis
Likelihood of null
hypothesis
Log likelihood of hypothesis
Log likelihood of
null
hypothesis
Twice the difference in log likelihood
Slide31Likelihood ratio test example
Question: Do the rates of transitions and transversions in my sequence
data significantly vary?
H
1
:
K2P
better explains my
data (2 rate
parameters,
transitions different to transversions)H0:
JC is adequate (1 rate parameter for all substitutions)Draw trees, find out ln(L(K2P)) = -23345; ln(L(JC)) = -23368Calculate: 2
Δ = 2[ln(L(H1)) – ln(L(H0
))] 2Δ = 2[ -23345 -
-23368] = 2x23 = 46d = difference in number of free parameters = 2 - 1 = 1Next we look this up on a Χ2
distribution…
Slide32Likelihood ratio test example
Is our 2
Δ
(twice
log of the likelihood ratio) greater than we would expect by
chance (
p
= 0.05)?
2
Δ = 46 (d = 1)
YES – 46 is much larger than 0.004
> We can reject
H
0
(JC)
and accept
H1 (K2P)
Slide33Software
And lots lots more see:
http://
evolution.genetics.washington.edu/phylip/software.html
Sequence searching
BLAST, FASTA, PSI-Search
http://www.ebi.ac.uk/services
Multiple sequence alignment
Clustal
O
m
ega, MUSCLE, Prank (
phylogenetically
aware)
http://www.ebi.ac.uk/services
Distance-based phylogenetic methods
ClustalW2, PAUP
http://www.ebi.ac.uk/Tools/phylogeny/
Maximum likelihood
phylogenetics
RAxML
(coming soon to EBI tools),
PhyML
,
SeaView
, PAUP, PAML
Bayesian Phylogenetics
MrBayes
, BEASTModel Testing
ModelTest
, PAML
Slide34Outline
Alignment for
phylogenetics
Phylogenetics
: The general approach
Phylogenetic Methods (1 – simple methods)
Assessing Branch Support
BREAK
Substitution Models
Phylogenetic Methods
(2 - statistical inference)
Deciding which model to use (hypothesis testing)Software
Slide35Now it is your turn…
Open your course manuals and begin Tutorial 2 (page 13)Also available to download from:
http
://
www.ebi.ac.uk/training/course/scuola-di-bioinformatica-2013
You
will require the alignment file
Rodents.txt
You will require the software
SeaView
4.4.2 http://pbil.univ-lyon1.fr/software/seaview.html There are answers available online but it is much better to ask for help!
Slide36Thank you!
www.ebi.ac.ukTwitter: @emblebi
Facebook:
EMBLEBI