Ph ylogenetic analysis
28K - views

Ph ylogenetic analysis

Phylogenetics. Phylogenetics. is the study of the evolutionary history of living organisms using . treelike diagrams . to represent pedigrees of these organisms. .. The tree branching . patterns. representing .

Download Presentation

Ph ylogenetic analysis




Download Presentation - The PPT/PDF document "Ph ylogenetic analysis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Ph ylogenetic analysis"— Presentation transcript:

Slide1

Ph

ylogenetic analysis

Slide2

Phylogenetics

Phylogenetics is the study of the evolutionary history of living organisms using treelike diagrams to represent pedigrees of these organisms.The tree branching patterns representing the evolutionary divergence are referred to as phylogeny.

Slide3

Studying phylogenetics

Fossil records – morphological information, available only for certain species, data can be fragmentary, morphological traits are ambiguous, fossil record nonexistent for microorganismsMolecular data (molecular fossils) – more numerous than fossils, easier to obtain, favorite for reconstruction of the evolutionary history

http://

www.agiweb.org/news/evolution/fossilrecord.html

Slide4

Tree of life

http://tikalon.com/blog/blog.php?article=2011/domains

Slide5

DNA sequence evolution

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AAGACTT

TGGACTT

AAGGCCT

AGGGCAT

TAGCCCT

AGCACTT

AAGGCCT

TGGACTT

AGCGCTT

AGCACAA

TAGACTT

TAGCCCA

AGGGCAT

www.cs.utexas.edu/users/tandy/CSBtutorial.ppt

Slide6

Tree terminology

Ancestral node or root of the tree

Internal nodes orDivergence points (represent hypothetical ancestors of the taxa)

Branches

Terminal nodes – taxa (taxon)

A

B

C

D

E

Based on lectures by

Tal

Pupko

Slide7

Monophyletic

(clade) – a taxon that is derived from a single ancestral species.Polyphyletic – a taxon whose members were derived from two or more ancestors not common to all members.Paraphyletic – a taxon that excludes some members that share a common ancestor with members included in the taxon.

Based on lectures by Tal Pupko

Slide8

dichotomy – all branches bifurcate, vs. polytomy – result of a taxon giving rise to more than two descendants or unresolved phylogeny (the exact order of bifurcations can not be determined exactly)

Slide9

unrooted – no knowledge of a common ancestor, shows relative relationship of taxa, no direction of an evolutionary pathrooted – obviously, more informative

Slide10

Finding a true tree is difficult

Correct reconstruction of the evolutionary history = find a correct tree topology with correct branch lengths.Number of potential tree topologies can be enormously large even with a moderate number of taxa.6 taxa … NR=945, NU=10510 taxa … NR=34 459 425, NU = 2 027 025

 

Slide11

Slide12

Rooting the tree

A

B

C

Root

D

A

B

C

D

Root

Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.

Rooted tree

Unrooted tree

To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root:

Based on lectures by

Tal

Pupko

Slide13

Now, try it again with the root at another position:

A

B

C

Root

D

Unrooted tree

Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D.

C

D

Root

Rooted tree

A

B

Based on lectures by

Tal

Pupko

Slide14

The unrooted tree 1:

A

C

B

D

Rooted tree 1d

C

D

A

B

4

Rooted tree 1c

A

B

C

D

3

Rooted tree 1e

D

C

A

B

5

Rooted tree 1b

A

B

C

D

2

Rooted tree 1a

B

A

C

D

1

These trees show five different evolutionary relationships among the taxa!

An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees

Based on lectures by

Tal

Pupko

Slide15

Rooting the tree

outgroup – taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., a-globins to root b-globins).

outgroup

Based on lectures by

Tal

Pupko

Slide16

Rooting the tree

midpoint rooting approach - roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner.

A

B

C

D

10

2

3

5

2

d

(A,D) = 10 + 3 + 5 = 18

Midpoint = 18 / 2 = 9

Based on lectures by

Tal

Pupko

Slide17

Molecular clock

This concept was

proposed

by Emil Zuckerkandl and Linus Pauling (1962) as well as Emanuel

Margoliash (1963).

This hypothesis states that

for every

given gene (or protein), the rate of molecular evolution is approximately constant

.

Pioneering study by Zuckerkandl and Pauling

They observed

the number

of amino

acid differences between human

globins –

β

and

δ

(~ 6 differences

),

β

and

γ

(~ 36

differences),

α

and

β

(~ 78

differences

), and

α

and

γ

(

~

83

differences

).

They could also compare human to

gorilla (both

β

and

α

globins

), observing either 2 or 1 differences

respectively.

They knew

from fossil evidence that humans and gorillas diverged from a common

ancestor about

11

MYA.

Using this divergence time as a calibration point, they

estimated that

gene duplications of the common ancestor to

β

and

δ

occurred 44 MYA;

β

and

derived

from a common

ancestor

260MYA

;

α

and

β

565 MYA; and

α

and

γ

600MYA

.

Slide18

Gene phylogeny vs. species phylogeny

Main objective of building phylogenetic trees based on molecular sequences:

reconstruct the evolutionary history of the species

involved.

A

gene

phylogeny only

describes the evolution of that particular gene or encoded

protein.

This sequence

may

evolve more or less rapidly than other genes in the

genome.

The

evolution of a particular sequence does not

necessarily correlate

with the evolutionary path of the

species.

Branching point i

n

a species

tree

– the speciation event

Branching point in a gene tree – which event?

The

two events

may or

may not coincide.

To

obtain a species phylogeny, phylogenetic trees from

a variety

of gene families need to be constructed to give an overall assessment of

the

species

evolution.

Slide19

Closest living relatives of humans?

Based on lectures by

Tal

Pupko

Slide20

Closest living relatives of humans?

Humans

Bonobos

Gorillas

Orangutans

Chimpanzees

MYA

0

15-30

MYA

Chimpanzees

Orangutans

Humans

Bonobos

Gorillas

0

14

Mitochondrial DNA, most nuclear DNA-encoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than either are to gorillas.

The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least 15-30 MYA.

Slide21

Orangutan

Gorilla

Chimpanzee

Human

From the Tree of the Life

Website, University

of Arizona

Slide22

Forms of tree representation

phylogram – branch lengths represent the amount of evolutionary divergencecladogram – external taxa line up neatly, only the topology matters

Slide23

Based on lectures by Tal Pupko

((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses

Taxon A

Taxon B

Taxon C

Taxon E

Taxon D

No meaning to the

spacing between the

taxa, or to the order in

which they appear from

top to bottom.

This dimension either can have no scale (for ‘

cladograms

’),

can be proportional to genetic distance or amount of change

(for ‘

phylograms

’),

or can be

proportional to

time (for ‘ultrametric trees’

or

true evolutionary trees).

These say that

B

and

C

are more closely related to each other than either is to

A

,

and that

A

,

B

, and

C

form a clade that is a sister group to the clade composed of

D

and

E

. If the tree has a time scale, then

D

and

E

are the most closely related.

Slide24

Newick format

Slide25

A consensus tree

combining the nodes:strict consensus - all conflicting nodes are collapsed into polytomiesmajority rule – among the conflicting nodes, those that agree by more than 50% of the nodes are retained whereas the remaining nodes are collapsed into polytomies

Slide26

Procedure

Choice of molecular markers

Multiple sequence alignment

Choice of a model of evolution

Determine a tree building method

Assess tree reliability

Slide27

Choice of molecular markers

Nucleotide or protein sequence data?

NA sequences evolve more rapidly.

They can be used for studying very closely related organisms.

E. g.,

for

evolutionary

analysis of different individuals within a population, noncoding regions of mtDNA are often used.

Evolution of more divergent organisms – either slowly evolving NA (e.g., rRNA) or protein sequences.

Deepest level (e.g., relatioships between bacteria and eukaryotes) – conserved protein sequences

NA sequences: good if sequences are closely related, reveal synonymous/nonsynonymous substitutions

Slide28

Positive and negative selection

synonymous substitution – nucleotide changes in a sequence not resulting in amino acid sequence changes (genetic code degeneracy, 3rd codon position)nonsynonymous changesnonsynonsymous substitution rate synonymous – positive selectioncertain parts of the protein are undergoing active mutations that may contribute to the evolution of new functionnegative selection – synonymous > nonsynonymousneutral changes at the AA level, the protein sequence is critical enough that its changes are not tolerated

 

Slide29

MSA

Critical step

Multiple state-of-the-art alignment programs (e.g., T-Coffee, Praline, Poa, …) should be used.

The

alignment

results from multiple sources should be inspected and compared

carefully

to identify the most reasonable one.

Automatic

sequence alignments almost

always contain

errors and should be further edited or refined if

necessary –

manual editing

!

Rascal and

NorMD

can

help to improve alignment by correcting

alignment errors

and removing potentially unrelated or highly divergent sequences.

Slide30

Model of evolution

A simple measure of the divergence of two sequences – number of substitutions in the alignment, a distance between two sequences – a proportion of substitutions

If A was replaced by C: A → C or A

T

G

C?

Back mutation: G

C → G.

Parallel mutations – both sequences mutate into e.g., T at the same time.

All of this obscures the

estimation of the true evolutionary distances between sequences.

This

effect is

known

as

homoplasy

and must be corrected.

Statistical models

infer the true

evolutionary

distances

between sequences.

Slide31

Model of evolution

Homplasy is corrected by substitution (evolutionary) models.There exists a lot of such models.Jukes-Cantor modeldAB … distance, pAB … proportion of substitutionsexample: alignment of A and B is 20 nucleotides long, 6 pairs are different, pAB = 0.3, dAB = 0.38Kimura modelpti … frequency of transition, ptv … frequency of transversion

 

Slide32

Models of amino acids substitutions

use the amino acid substitution matrixPAMJTT – 90s, the same methodology as PAM, but with larger protein databaseprotein equivalents of of Jukes–Cantor and Kimura models, e.g.,

 

Slide33

Among site variations

Up to now we have assumed that different

positions in a sequence are assumed to be

evolving at

the same rate

.

However, in reality is may not be true.

In DNA,

the rates of substitution differ for different codon

positions. 3

rd

codon mutates much faster.

In proteins, some AA

change much more rarely than others owing to functional constraints

.

It has been shown that there are always a proportion of positions in a

sequence dataset

that have invariant rates and a proportion that have more variable rates

.

The distribution of variant sites follows a

Gamma

distribution pattern.

Slide34

Gamma distribution

Slide35

To account for site-dependent rate variation, a Gamma correction factor can be used. For the Jukes–Cantor model, the evolution distance can be adjusted with the following formula: For the Kimura model, the evolutionary distance becomes

 

Slide36

Tree building methods

Two major categories.

Distance based methods.

Based on the amount of dissimilarity between

pairs of sequences, computed on the basis of sequence alignment

.

Characters based methods.

Based

on discrete characters,

which are

molecular sequences from individual taxa.

Slide37

Tree building methods

COMPUTATIONAL METHOD

Clustering algorithm

Optimality criterion

DATA TYPE

Characters

Distances

Maximum parsimony (MP)

Maximum

likelihood

(

ML)

UPGMANeighbor-joining (NJ)

Fitch-Margoliash (FM)

Slide38

Distance based methods

Calculate evolutionary distances

d

AB

between sequences using some of the evolutionary model.

Construct a distance matrix – distances between all pairs of taxa.

Based on the distance scores, construct a phylogenetic tree.

clustering algorithms – UPGMA, neighbor joining (NJ)

optimality based – Fitch-Margoliash (FM), minimum evolution (ME)

Slide39

Clustering methods

UPGMA

 (

U

nweighted 

P

air 

G

roup 

M

ethod with 

A

rithmetic

Mean)

Hierachical clustering, agglomerative, you know it as an average linkage

Produces rooted tree (most phylogenetic methods produce unrooted tree).

Basic

assumption of the UPGMA

method: all

taxa evolve at a

constant rate,

they are equally distant from the root, implying that a molecular

clock is

in effect.

However

, real data rarely meet this assumption.

Thus, UPGMA

often produces erroneous tree topologies.

Slide40

Neighbor joining

B

D

A

C

E

A

D

C

E

B

A,B

B

D

A

C

E

A

B

C

D

E

A

0

2

3

4

4

B

0

3

4

5

C

0

3

4

D

0

5

E0

A,BCDEA,B02.54.53.5C034D05E0

The Minimum Evolution (ME) criterion:

in each iteration we separate the two sequences which result with the

minimal sum of branch lengths

Slide41

Optimality based methods

Clustering methods produce a single tree.

There is no

criterion in judging how this tree is compared to other alternative

trees

.

O

ptimality

based

methods have

a

well-defined

algorithm to compare all possible

tree

topologies

and select a tree that best fits the actual evolutionary distance matrix.

Slide42

Distance based – pros and cons

clustering

F

ast, can handle large datasets

Not guaranteed to find the best tree

The actual

sequence information is lost when all the sequence variation is reduced to

a single

value. Hence, ancestral sequences at internal nodes cannot be inferred.

UPGMA – assumes

a constant rate of evolution of the sequences in all branches of the

tree

(molecular

clock assumption)

NJ – does

not assume that the rate of evolution is the same in all branches of the tree

NJ is slower but better than UPGMA

exhaustive tree searching (FM)

better accuracy, prohibitive for more than 12 taxa

Slide43

Character based methods

Also called

discreet methods

Based

directly on

the sequence characters

They count mutational

events

accumulated

on the sequences and may therefore avoid the loss of

information when

characters are converted to distances

.

Evolutionary

dynamics of each character can be

studied

Ancestral sequences

can also be inferred.

The

two most popular character-based

approaches: maximum

parsimony (MP) and maximum likelihood (ML) methods.

Slide44

Maximum parsimony

Based on

Occam’s razor

.

William of Occam, 13

th

century

.

The

simplest explanation is probably the

correct one.

This is because the simplest explanation requires the fewest

assumptions and

the fewest leaps of logic

.

A

tree with the least number of substitutions is probably the best to explain the

differences among

the taxa under

study.

Slide45

1234567891AAGAGTGCA2AGCCGTGCG3AGATATCCA4AGAGATCCG

To save computing time, only a small number of sites that have the richest phylogenetic information are used in tree determination.informative site – sites that have at least two different kinds of characters, each occurring at least twice

A worked example

Slide46

To save computing time, only a small number of sites that have the richest phylogenetic information are used in tree determination.informative site – sites that have at least two different kinds of characters, each occurring at least twice

1234567891AAGAGTGCA2AGCCGTGCG3AGATATCCA4AGAGATCCG

A worked example

Slide47

How many possible unrooted trees?

1231GGA2GGG3ACA4ACG

 

1

2

3

4

1

3

2

4

1

4

3

2

Tree I

Tree II

Tree III

Slide48

GGAA

G

G

A

A

G

A

G

A

G

A

A

G

A

G

G

G

G

G

Tree I

Tree II

Tree III

1

2

3

4

1

3

2

4

1

4

3

2

Slide49

GGCC

G

G

C

C

G

C

G

C

G

C

C

G

C

G

G

G

G

G

Tree I

Tree II

Tree III

Slide50

AGAG

A

G

A

G

A

A

G

G

A

G

A

G

A

A

G

A

G

G

Tree I

Tree II

Tree III

Slide51

IIIIIIGGAA122GGCC122AGAG212Tree length456

GGA

GGG

ACA

ACG

ACA

GGA

Tree I

2

1

1

Slide52

Weighted parsimony

The parsimony method discussed

so far is

unweighted because it treats all mutations

as equivalent.

This may be an oversimplification; mutations of some sites are

known to

occur less frequently than others, for example, transversions versus

transitions, functionally

important sites versus neutral sites

.

A

weighting scheme

takes

into account the different kinds of

mutations.

Slide53

Slide54

Branch-and-bound

The

parsimony method examines all possible tree topologies to

find the

maximally parsimonious tree

.

This is an exhaustive search method, expensive.

N = 10

2 027 025

N = 20 …

2.22 × 10

20

Branch-and-bound

Rationale:

a maximally parsimonious tree must be equal

to or

shorter than the distance-based tree

.

First build a distance tree using NJ or UPGMA.

Compute the minimum number of substitutions for this tree.

The resulting number defines the

upper bound

to which any other

trees are

compared

.

I.e., when you build a parsimonous tree, you stop growing it when its length exceeds the upper bound.

Slide55

Heuristic methods

When a number of taxa exceeds 20, even branch-and-bound becomes computationally unfeasible.

Then,

heuristic

search can be applied.

Both exhaustive search and branch-and-bound methods lead to the optimum tree.

Heuristic search leads to the suboptimum tree (compare to BLAST which is also heuristic).

Slide56

MP – pros and cons

Intuitive -

its assumptions are easily

understood

The

character-based method is able to provide evolutionary

information about

the sequence characters, such as information regarding

homoplasy and

ancestral states

.

It tends to produce more accurate trees than the

distance-based methods

when sequence divergence is

low

because

this is the circumstance when

the parsimony

assumption of rarity in evolutionary changes holds true

.

When sequence

divergence is

high,

tree

estimation

by MP can

be less effective, because the original parsimony assumption no longer holds

.

Estimation of branch lengths may also be erroneous because MP does not

employ substitution

models to correct for multiple substitutions.

Slide57

Maximum likelihood – ML

Uses

probabilistic models to

choose a

best tree that has the highest probability

(likelihood)

of reproducing the

observed data.

ML is

an

exhaustive

method that searches every possible tree topology and considers

every position

in an alignment, not just informative

sites.

By employing a particular

substitution model

that has probability values of residue substitutions, ML calculates

the total

likelihood of ancestral sequences evolving to internal nodes and eventually

to existing

sequences.

It

sometimes also incorporates parameters that account for

rate variations

across sites.

Slide58

ML – pros and cons

Based

on well-founded statistics instead of a medieval philosophy

.

More robust, uses the

full sequence

information, not just informative sites.

Employs substitution model – strength, but also weakness (choosing wrong model leads to incorrect tree).

Accurately reconstructs

the relationships between sequences that have been separated for a long

time.

Very

time consuming, considerably more than MP which is itself more time consuming

than clustering methods.

Slide59

Phylogeny packages

PHYLIP, Phylogenetic inference package

evolution.genetics.washington.edu/phylip.html

Felsenstein

Free

!

PAUP

, phylogenetic analysis

using

parsimony

paup.csit.fsu.edu

Swofford