/
Constructing the Tree of Life: Divide-and-Conquer! Constructing the Tree of Life: Divide-and-Conquer!

Constructing the Tree of Life: Divide-and-Conquer! - PowerPoint Presentation

calandra-battersby
calandra-battersby . @calandra-battersby
Follow
386 views
Uploaded On 2017-05-07

Constructing the Tree of Life: Divide-and-Conquer! - PPT Presentation

Tandy Warnow University of Illinois at UrbanaChampaign Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website University of Arizona Phylogeny evolutionary tree Phylogenies and Applications ID: 545529

alignment tree 000 estimation tree alignment estimation 000 sequences sequence multiple 2015 2009 mirarab sat

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Constructing the Tree of Life: Divide-an..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Constructing the Tree of Life: Divide-and-Conquer!

Tandy Warnow

University of Illinois at Urbana-ChampaignSlide2

Orangutan

Gorilla

Chimpanzee

Human

From the Tree of the Life Website,

University of Arizona

Phylogeny (evolutionary tree)Slide3

Phylogenies and Applications

Basic Biology:

How did life evolve?

Applications of phylogenies to:

protein structure and function

population genetics human migrations metagenomics

Figure from https://en.wikipedia.org/wiki/Common_descent Slide4
Slide5

DNA Sequence Evolution

AAGACTT

TG

GACTT

AAG

G

C

C

T

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

A

G

GGC

A

T

T

AG

C

CCT

A

G

C

ACTT

AAGGCCT

TGGACTT

TAGCCC

A

TAG

A

C

T

T

AGC

G

CTT

AGCAC

AA

AGGGCAT

AGGGCAT

TAGCCCT

AGCACTT

AAGACTT

TGGACTT

AAGGCCT

AGGGCAT

TAGCCCT

AGCACTT

AAGGCCT

TGGACTT

AGCGCTT

AGCACAA

TAGACTT

TAGCCCA

AGGGCATSlide6

Phylogenetic Tree Estimation

TAGCCCA

TAGACTT

TGCACAA

TGCGCTT

AGGGCAT

U

V

W

X

Y

U

V

W

X

YSlide7

AGAT

TAGACTT

TGCACAA

TGCGCTT

AGGGCATGA

U

V

W

X

Y

U

V

W

X

Y

However…Slide8

…AC

GGTG

CAGT

T

ACCA…

Mutation

Deletion

…ACCAGT

C

ACCA…

Indels (insertions and deletions)Slide9

…AC

GGTG

CAGT

T

ACC

-

A…

…AC

----

CAGT

C

ACC

T

A…

The true multiple alignment

Reflects historical substitution, insertion, and deletion events

Defined using transitive closure of pairwise alignments computed on edges of the true tree

AC

GGTG

CAGT

T

ACCA

Substitution

Deletion

ACCAGT

C

ACC

T

A

InsertionSlide10

Phylogenetic Tree Estimation

S1 = AGGCTATCACCTGACCTCCA

S2 = TAGCTATCACGACCGC

S3 = TAGCTGACCGC

S4 = TCACGACCGACASlide11

Input: unaligned sequences

S1 = AGGCTATCACCTGACCTCCA

S2 = TAGCTATCACGACCGC

S3 = TAGCTGACCGC

S4 = TCACGACCGACASlide12

Phase 1: Alignment

S1 = -AGGCTATCACCTGACCTCCA

S2 = TAG-CTATCAC--GACCGC--

S3 = TAG-CT-------GACCGC--

S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA

S2 = TAGCTATCACGACCGC

S3 = TAGCTGACCGCS4 = TCACGACCGACASlide13

Phase 2: Construct tree

S1 = -AGGCTATCACCTGACCTCCA

S2 = TAG-CTATCAC--GACCGC--

S3 = TAG-CT-------GACCGC--

S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA

S2 = TAGCTATCACGACCGC

S3 = TAGCTGACCGCS4 = TCACGACCGACA

S1

S4

S2

S3Slide14

Two-phase estimation

Alignment methods

Clustal

POY (and POY*)

Probcons (and Probtree)

Probalign

MAFFT

Muscle

Di-align

T-Coffee

Prank (PNAS 2005, Science 2008)

Opal (ISMB and Bioinf. 2007)

FSA (PLoS Comp. Bio. 2009)

Infernal (Bioinf. 2009)

Etc.

Phylogeny methods

Bayesian MCMC

Maximum parsimony

Maximum likelihood

Neighbor

joining

FastME

UPGMA

Quartet puzzling

Etc.Slide15

Two-phase estimation

Alignment methods

Clustal

POY (and POY*)

Probcons (and Probtree)

Probalign

MAFFT

Muscle

Di-align

T-Coffee

Prank (PNAS 2005, Science 2008)

Opal (ISMB and Bioinf. 2007)

FSA (PLoS Comp. Bio. 2009)

Infernal (Bioinf. 2009)

Etc.

Phylogeny methods

Bayesian MCMC

Maximum parsimony

Maximum likelihood

Neighbor

joining

FastME

UPGMA

Quartet puzzling

Etc.

RAxML

: heuristic for large-scale ML optimizationSlide16

Quantifying Error

FN: false negative

(missing edge)

FP: false positive

(incorrect edge)

FN

FP

50% error rateSlide17

1000-taxon

models, ordered by difficulty (Liu et al.,

Science 19 June 2009

)Slide18

Multiple Sequence Alignment (MSA):

a scientific

grand challenge

1

S1 = -AGGCTATCACCTGACCTCCA

S2 = TAG-CTATCAC--GACCGC--

S3 = TAG-CT-------GACCGC--…Sn = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA

S2 = TAGCTATCACGACCGCS3 = TAGCTGACCGC …

Sn = TCACGACCGACA

Novel techniques needed

for scalability and accuracy

NP-hard problems and large datasets Current methods do not provide good accuracy

Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation

1

Frontiers in Massive Data Analysis, National Academies Press, 2013Slide19

1KP: Thousand

Transcriptome

Project

First

publication:

Wickett, Mirarab, et al.,

PNAS, 2014Used SATé (Liu et al., Science 2009 and

Syst Biol 2012) to compute multiple sequence alignments and treesUsed ASTRAL (Mirarab et al.,

Bioinf 2014 and 2015) to compute the species tree

G.

Ka-Shu

Wong

U Alberta

N.

Wickett

Northwestern

J.

Leebens

-Mack

U Georgia

N.

Matasci

iPlant

T. Warnow, S.

Mirarab

, N. Nguyen

UT

-

Austin UT-Austin UT-Austin

Upcoming Challenge:

Multiple sequence alignment and gene tree estimation on 100,000 sequencesSlide20

Computational

Phylogenetics (2005)

Courtesy of the Tree of Life

web project,

tolweb.org

Current methods can use months to

estimate trees on 1000 DNA sequences

Our objective:

More accurate trees and alignments

on 500,000 sequences in under a weekSlide21

Computational

Phylogenetics (2015)

Courtesy of the Tree of Life

web project,

tolweb.org

1997-2001: Distance-based phylogenetic tree estimation from polynomial length sequences

2012

: Computing accurate trees (almost)

without

multiple sequence

alignments

2009-2015: Co-estimation of multiple sequence alignments and gene

trees, now on 1,000,000 sequences in under two weeks2014-2015: Species tree estimation from whole genomes in the presence of massive gene tree

heterogeneitySlide22

Computational

Phylogenetics (2015)

Courtesy of the Tree of Life

web project,

tolweb.org

1997-2001: Distance-based phylogenetic tree estimation from polynomial length sequences

2012

: Computing accurate trees (almost)

without

multiple sequence

alignments

2009-2015: Co-estimation of multiple sequence

alignments and gene trees, now on 1,000,000 sequences in under two weeks2014-2015: Species tree estimation from whole genomes in the presence of

massive gene tree heterogeneitySlide23

Key technique: Divide-and-conquer!

In general, small datasets with not too much “heterogeneity” are easy to analyze with good accuracy.Slide24

Divide-and-Conquer

Divide-and-conquer is a basic algorithmic trick for solving problems!

Three steps:

divide a dataset into two or more sets, solve the problem on each set, andcombine solutions.Slide25

Sorting

10

3

54

23

75

5

1

25

Objective: sort this list of integers from

s

mallest to largest.

10, 3, 54, 23, 75, 5, 1, 25 should become

1, 3, 5, 10, 23, 25, 54, 75Slide26

MergeSort

10

3

54

23

75

5

1

25

Step 1: Divide into two

sublists

Step 2: Recursively sort each

sublist

Step 3: Merge the two sorted

sublistsSlide27

Step 1: break into two lists

10

3

54

23

75

5

1

25

X:

Y:Slide28

Step 2: sort the two lists

3

10

23

54

1

5

25

75

X:

Y:Slide29

Step 3: merge the sorted lists

3

10

23

54

1

5

25

75

X:

Y:

Result:Slide30

Merging (cont.)

3

10

23

54

5

25

75

1

X:

Y:

Result:Slide31

Merging (cont.)

10

23

54

5

25

75

1

3

X:

Y:

Result:Slide32

Merging (cont.)

10

23

54

25

75

1

3

5

X:

Y:

Result:Slide33

Merging (cont.)

23

54

25

75

1

3

5

10

X:

Y:

Result:Slide34

Merging (cont.)

54

25

75

1

3

5

10

23

X:

Y:

Result:Slide35

Merging (cont.)

54

75

1

3

5

10

23

25

X:

Y:

Result:Slide36

Merging (cont.)

75

1

3

5

10

23

25

54

X:

Y:

Result:Slide37

Merging (cont.)

1

3

5

10

23

25

54

75

X:

Y:

Result:Slide38

Multiple Sequence Alignment (MSA):

a scientific

grand challenge

1

S1 = -AGGCTATCACCTGACCTCCA

S2 = TAG-CTATCAC--GACCGC--

S3 = TAG-CT-------GACCGC--…Sn = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA

S2 = TAGCTATCACGACCGCS3 = TAGCTGACCGC …

Sn = TCACGACCGACA

Novel techniques needed

for scalability and accuracy

NP-hard problems and large datasets Current methods do not provide good accuracy

Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation

1

Frontiers in Massive Data Analysis, National Academies Press, 2013Slide39

SATé and PASTA

Input: set of unaligned sequences

Output: multiple sequence alignment and phylogenetic tree

SATé: Liu et al., Science 2009 (up to 10,000 sequences) and Systematic Biology 2012 (up to 50,000 sequences)

PASTA: Mirarab et al., J. Comp Biol 2015 (up to 1,000,000 sequences)Slide40

1000-taxon

models, ordered by difficulty (Liu et al.,

Science 19 June 2009

)Slide41

Re-aligning on a tree

A

B

D

C

Merge sub-alignments

Estimate ML tree on merged alignment

Decompose dataset

A

B

C

D

Align subproblems

A

B

C

D

ABCDSlide42

SATé

and PASTA Algorithms

Tree

Obtain initial alignment and estimated ML tree

Use tree to compute new alignmentSlide43

SATé

and PASTA Algorithms

Tree

Obtain initial alignment and estimated ML tree

Use tree to compute new alignment

AlignmentSlide44

SATé

and PASTA Algorithms

Estimate ML tree on new alignment

Tree

Obtain initial alignment and estimated ML tree

Use tree to compute new alignment

AlignmentSlide45

SATé

and PASTA Algorithms

Estimate ML tree on new alignment

Tree

Obtain initial alignment and estimated ML tree

Use tree to compute new alignment

Alignment

Repeat

until termination

condition, and

return the alignment/tree pair with the best ML scoreSlide46

1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009)

24-hour

SATé

analysis, on desktop

machines

(

Similar improvements for biological datasets)

SATé

: 24-hour co-estimation of highly accurate alignments and trees on 1000 sequencesSlide47

(Liu et al.,

Syst

Biol

61(1):90-106, 2012)

SATé-2: even more accurate! Slide48

Simulated

RNASim

datasets from 10K to 200K taxa

Limited to 24 hours using 12 CPUsNot all methods could run (missing bars could not finish)

PASTA, Mirarab et al., J Comp Biol 22(5): 377-386 (2015)PASTA: even more accurate, and can scale to 1,000,000 sequencesSlide49

Avian Phylogenomics Project

E

Jarvis,

HHMI

G

Zhang,

BGI

First analysis (Jarvis, Mirarab, et al., Science 2014):Approx. 50 species, 14,000 loci

Used SATé for gene sequence alignment and tree estimation Next analysis will have more species, and will use PASTA

MTP Gilbert,

Copenhagen

S.

Mirarab Md. S. Bayzid, UT-Austin UT-Austin

T. WarnowUT-AustinPlus many many other people…Slide50

1KP: Thousand

Transcriptome

Project

First

analysis (

Wickett, Mirarab, et al.,

PNAS, 2014)About 100 species and 800 loci

Used SATé

G.

Ka-Shu

Wong

U Alberta

N.

Wickett

Northwestern

J.

Leebens

-Mack

U Georgia

N.

Matasci

iPlant

T. Warnow, S.

Mirarab

, N. Nguyen

UT

-

Austin UT-Austin UT-Austin

Next analysis will be much larger and more difficult:

Multiple sequence alignment and gene tree estimation on 100,000 sequences, many datasets highly

fragmentary

Will

use

PASTA and UPP (Nguyen et al., Genome Biology 2015)Slide51

Computational

Phylogenetics (2015)

Courtesy of the Tree of Life

web project,

tolweb.org

1997-2001: Distance-based phylogenetic tree estimation from polynomial length sequences

2012

: Computing accurate trees (almost)

without

multiple sequence

alignments

2009-2015: Co-estimation of multiple sequence alignments and gene

trees, now on 1,000,000 sequences in under two weeks2014-2015: Species tree estimation from whole genomes in the presence of massive gene tree

heterogeneitySlide52

“Boosters”, or “Meta

-

Methods”

Meta-methods

use divide-and-conquer and iteration (or other techniques) to “boost

” the performance of base methods (phylogeny reconstruction, alignment estimation, etc)

Meta-method

Base method M

M*Slide53

Main Points

Innovative algorithm design can

improve accuracy

as well as reduce running time.Divide-and-conquer is a key algorithmic technique that has dramatically changed the toolkit for biologists! Slide54

Acknowledgments

Funding

:

Guggenheim Foundation, Packard Foundation, NSF, Microsoft Research New England, David

Bruton

Jr. Centennial Professorship, Grainger Foundation, and TACC (Texas Advanced Computing Center)Slide55

Avian Phylogenomics Project

E

Jarvis,

HHMI

G

Zhang,

BGI

Jarvis, Mirarab, et al., Science 2014

MTP Gilbert,

Copenhagen

S.

Mirarab Md. S. Bayzid, UT-Austin UT-Austin

T. WarnowUT-AustinPlus many many other people…Major challenge:

Massive gene tree heterogeneity consistent with incomplete lineage sortingVery poor resolution in the 14,000 gene treesStandard coalescent-based species tree estimation methods had poor accuracy

Solution: New technique to improve coalescent-based species tree (statistical binning, Mirarab et al., Science 2014)