/
Part two An Introduction to Phylogenetic Methods Part two An Introduction to Phylogenetic Methods

Part two An Introduction to Phylogenetic Methods - PowerPoint Presentation

erica
erica . @erica
Follow
342 views
Uploaded On 2021-01-27

Part two An Introduction to Phylogenetic Methods - PPT Presentation

Dr Laura Emery LauraEmeryebiacuk wwwebiacuk Objectives After this tutorial you should be able to Discuss a range of methods for phylogenetic inference their advantages ID: 830212

model likelihood substitution methods likelihood model methods substitution sites rate parameters data evolve base models hypothesis phylogenetic frequencies maximum

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Part two An Introduction to Phylogenetic..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Part two

An Introduction to Phylogenetic Methods

Dr Laura Emery

Laura.Emery@ebi.ac.uk

www.ebi.ac.uk

Slide2

Objectives

After this tutorial you should be able to…

Discuss

a range of

methods for

phylogenetic

inference, their advantages

, assumptions

and limitations

Implement some phylogenetic methods using publicly available

software

Appreciate some approaches for assessing branch support and selecting an appropriate substitution

model

Know where to look for further information

Slide3

Outline

Alignment for

phylogenetics

Phylogenetics

: The general approach

Phylogenetic Methods (1 – simple methods)

Assessing Branch Support

BREAK

Substitution Models

Phylogenetic Methods

(2 - statistical inference)

Deciding which model to use (hypothesis testing)

Software

Slide4

The problem of multiple substitutions

More likely to have occurred between distantly related species

>

We need an explicit model of evolution to account for

these

A

A

A

T

G

*

*

*

*

hidden

mutations

Slide5

Methodological approaches

Distance

matrix methods (pre-computed distances)

UPGMA assumes perfect molecular

clock

Sokal

& Michener (1958)

Minimum evolution (e.g.

Neighbor

-joining, NJ

)

Saitou & Nei (1987)Maximum parsimony

Fitch (1971)

Minimises number of mutational steps Maximum likelihood, ML

Evaluates statistical likelihood of alternative trees, based on an explicit model of substitutionBayesian methodsLike ML but can

incorporate prior knowledge

What is a substitution model?

Slide6

Statistical phylogenetic

inference

Figure Brian Moore

Slide7

Models of sequence evolution

We use models of substitution to ‘roughly’ describe the way that we believe the sequences have evolvedThey are necessarily highly simplified descriptions of more complex biological processes

Parameters can be added to build more sophisticated models if we believe this is relevant for our data

Slide8

Substitution Models

Common nested modelsJukes and Cantor (JC) 1969

Kimura 2 Parameter (K2P) 1980

Felsenstein

1981(F81)

Hasegawa,

Kishino

and Yano 1985 (HKY85)

Generalised time-reversible (GTR or REV)

Tavaré

1986Accounting for rate heterogeneityOther substitution models

Slide9

The Jukes and Cantor (JC) 1969 model

1 parameter

μ

= mutation rate

Assumptions

Equal

base frequencies

All mutations equally

likely

All sites evolve at the same rate All sites evolve independently

Time reversibilityA

CG T

d

=

estimated nucleotide

distance

p

= observed distance in sequence

data

μ

μ

μ

μ

μ

μ

μ

μ

μ

μ

μ

μ

Slide10

But not all substitutions are equally

likely…Transitions are more likely to occur than transversions

Figures Andrew

Rambaut

Slide11

Kimura

2 Parameter (K2P) 19802 parameters

μ

=

mutation rate

κ

=

transition/

transversion

ratioAssumptions Equal base frequencies All mutations equally likely All sites evolve at the same rate All sites evolve independently

Time reversibility

A C

G T

κ

κ

μ

μ

κ

κ

μ

μ

μ

μ

μ

μ

d

=

estimated nucleotide

distance

p

= observed distance in sequence data

q

= proportion

of sites

with

transversional

differences

Slide12

But base frequencies are often not equal...

Base frequencies vary among and within genomes

A

C

T

G

Slide13

Felsenstein

1981 (F81) 4 symbols (3 parameters)

π

A

,

π

C

,

π

G,πT = base frequenciesπ

A + π

C + π

G + πT

= 1(so 3 parameters)

Assumptions Equal base frequencies

All mutations equally likely All sites evolve at the same rate All sites evolve

independently Time reversibility

A

C

G T

π

C

π

T

π

A

π

G

Slide14

Hasegawa,

Kishino and Yano 1985 (HKY85)6 symbols (5 parameters)

μ

=

mutation rate

κ

=

transition/

transversion

ratioπA ,πC

,πG,

πT = base frequencies

Assumptions Equal base frequencies All mutations equally likely

All sites evolve at the same rate All sites evolve independently

Time reversibility

A C

G T

π

C

π

T

π

A

π

G

κ

κ

μ

μ

κ

κ

μ

μ

μ

μ

μ

μ

Slide15

But there are also differences among the other nucleotide transition rates...

A C

A

G

A

T

C

G

C

T

G T

Figures Andrew

Rambaut

Slide16

Generalised time-reversible (

GTR) Tavaré 1986

10

symbols (9

parameters)

r

AC

,

r

AG

,

rAT,

rCG,

rCT

, rGT

= mutation rates

πA

,πC

,πG,

πT =

base frequenciesAssumptions Equal base frequencies

All mutations equally likely All sites evolve at the same rate

All sites evolve independently Time reversibility

A

C

G T

π

C

π

T

π

A

π

G

r

AG

r

CT

r

GT

r

AC

r

CT

r

AG

r

AC

r

GT

r

CG

r

CG

r

AT

r

AT

Widely used

Slide17

But some sites overall faster than others...

973

mtDNA

CR; parsimony analysis (with known pedigree)

Heyer

et al

. (2001)

Slide18

Gamma distributed rates

Rate variation among sites is often shown to

be well-approximated

by

a gamma distribution

To use: add

alpha (

α) parameter to existing model e.g.

GTR+G

Assumptions Equal base frequencies All mutations equally

likely All sites evolve at the same rate All sites evolve independently Time reversibility

Substitution

rate

Frequency

α =

200α = 50

α = 2

α = 0.5

0

1

2

0.02

0.04

0.06

Slide19

Other substitution

modelsAmino acid substitution models

Dayhoff

1972

Whelan and Goldman

2001 (WAG

)

Lee &

Gascuel

2008 (LG)

Codon models e.g. Yang 2000Relaxed molecular clock e.g. Drummond et al. 2006Mixture modelsAnd many more!

Slide20

Methodological approaches

Distance

matrix methods (pre-computed distances)

UPGMA assumes perfect molecular

clock

Sokal

& Michener (1958)

Minimum evolution (e.g.

Neighbor

-joining, NJ

)

Saitou & Nei (1987)Maximum parsimony

Fitch (1971)

Minimises number of mutational steps Maximum likelihood, ML

Evaluates statistical likelihood of alternative trees, based on an explicit model of substitutionBayesian methodsLike ML but can

incorporate prior knowledge

What is a substitution model?

Slide21

Statistical phylogenetic

inference

recommended

methods

Figure Brian Moore

Slide22

Maximum Likelihood

Calculate the probability of the observed

sequence

data

under

a given

model

(including

tree structure

, branch lengths, and transition parameters). [The likelihood is proportional to this probability.]Search for the tree(s) which maximize(s) the likelihood.

Likelihood

topology

branch lengths

model parameters

constant

probability

data

(alignment)

Slide23

Maximum Likelihood

Advantages:statistically consistentrequires the use of an explicit model of evolution

Disadvantages:

slow

(especially if all possible trees are evaluated)

produces a single ML tree

Usage:

Widely-used

and

recommended method

recommended

Slide24

Bayesian Inference

Calculate the probability of the

model

specified given

the

sequence data

observed

(using equation derived

from Bayes

Theorem)

Search the tree-space using MCMC (or equivalent) to approximate the joint-posterior probability density

 

posterior probability

likelihood

function

prior probability

marginal likelihood

Slide25

Bayesian Inference

Advantages:the option to incorporate prior knowledgeproduces probability distribution of possible

trees

unlike ML,

treats model parameters as random variables

Disadvantages:

v

ery slow

heuristic methods of tree searching do not guarantee you find the best

tree

Usage: Widely-used and recommended method

recommended

Slide26

Heuristic searches do not guarantee you find the best tree

Figure Andrew

Rambaut

Slide27

Methodological approaches

Distance

matrix methods (pre-computed distances)

UPGMA assumes perfect molecular

clock

Sokal

& Michener (1958)

Minimum evolution (e.g.

Neighbor

-joining, NJ

)

Saitou & Nei (1987)Maximum parsimony

Fitch (1971)

Minimises number of mutational steps

Maximum likelihood, MLEvaluates statistical likelihood of alternative trees, based on an explicit model of substitution

Bayesian methodsLike ML but can

incorporate prior knowledge

What is a substitution model?

How do I choose a substitution model?

Slide28

How do I choose a substitution model?

biological

intuition

develop hypothesis

test

hypothesis

Identify most appropriate

assumptions

and thus model for your data

Will a

complex

model with fewer assumptions better explain your data than a simple model?

Likelihood ratio test

Bayes factor test

Not sure where to start

? Empirical

data shows GTR+G (nucleotide) or LG (protein) to be a good

bet for

standard datasets large in size

Slide29

Choosing a more complex model with more parameters will always fit the data better

> We want to know if the fit is significantly better

R

2

= 0.86

R

2

=

0.78

R

2

= 1

Slide30

Likelihood ratio

test Requires models to be nested

Uses likelihood

ratio

to evaluate if our

hypothesis (

H

1

) is significantly better than our null hypothesis (

H

0):Likelihood ratio = L(H1

)/L(H0)Twice the logarithm of this ratio (2Δ) approximates a chi-squared distribution

under the null hypothesis H0:2

Δ = 2[ln(L(H1)) – ln(L(H

0))]with d degrees of freedom corresponding to the difference in the number of

free parameters between models

Likelihood of hypothesis

Likelihood of null

hypothesis

Log likelihood of hypothesis

Log likelihood of

null

hypothesis

Twice the difference in log likelihood

Slide31

Likelihood ratio test example

Question: Do the rates of transitions and transversions in my sequence

data significantly vary?

H

1

:

K2P

better explains my

data (2 rate

parameters,

transitions different to transversions)H0:

JC is adequate (1 rate parameter for all substitutions)Draw trees, find out ln(L(K2P)) = -23345; ln(L(JC)) = -23368Calculate: 2

Δ = 2[ln(L(H1)) – ln(L(H0

))] 2Δ = 2[ -23345 -

-23368] = 2x23 = 46d = difference in number of free parameters = 2 - 1 = 1Next we look this up on a Χ2

distribution…

Slide32

Likelihood ratio test example

Is our 2

Δ

(twice

log of the likelihood ratio) greater than we would expect by

chance (

p

= 0.05)?

2

Δ = 46 (d = 1)

YES – 46 is much larger than 0.004

> We can reject

H

0

(JC)

and accept

H1 (K2P)

Slide33

Software

And lots lots more see:

http://

evolution.genetics.washington.edu/phylip/software.html

Sequence searching

BLAST, FASTA, PSI-Search

http://www.ebi.ac.uk/services

Multiple sequence alignment

Clustal

O

m

ega, MUSCLE, Prank (

phylogenetically

aware)

http://www.ebi.ac.uk/services

Distance-based phylogenetic methods

ClustalW2, PAUP

http://www.ebi.ac.uk/Tools/phylogeny/

Maximum likelihood

phylogenetics

RAxML

(coming soon to EBI tools),

PhyML

,

SeaView

, PAUP, PAML

Bayesian Phylogenetics

MrBayes

, BEASTModel Testing

ModelTest

, PAML

Slide34

Outline

Alignment for

phylogenetics

Phylogenetics

: The general approach

Phylogenetic Methods (1 – simple methods)

Assessing Branch Support

BREAK

Substitution Models

Phylogenetic Methods

(2 - statistical inference)

Deciding which model to use (hypothesis testing)Software

Slide35

Now it is your turn…

Open your course manuals and begin Tutorial 2 (page 13)Also available to download from:

http

://

www.ebi.ac.uk/training/course/scuola-di-bioinformatica-2013

You

will require the alignment file

Rodents.txt

You will require the software

SeaView

4.4.2 http://pbil.univ-lyon1.fr/software/seaview.html There are answers available online but it is much better to ask for help!

Slide36

Thank you!

www.ebi.ac.ukTwitter: @emblebi

Facebook:

EMBLEBI