Tandy Warnow University of Illinois at UrbanaChampaign Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website University of Arizona Phylogeny evolutionary tree Phylogenies and Applications ID: 545529
Download Presentation The PPT/PDF document "Constructing the Tree of Life: Divide-an..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Constructing the Tree of Life: Divide-and-Conquer!
Tandy Warnow
University of Illinois at Urbana-ChampaignSlide2
Orangutan
Gorilla
Chimpanzee
Human
From the Tree of the Life Website,
University of Arizona
Phylogeny (evolutionary tree)Slide3
Phylogenies and Applications
Basic Biology:
How did life evolve?
Applications of phylogenies to:
protein structure and function
population genetics human migrations metagenomics
Figure from https://en.wikipedia.org/wiki/Common_descent Slide4Slide5
DNA Sequence Evolution
AAGACTT
TG
GACTT
AAG
G
C
C
T
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
A
G
GGC
A
T
T
AG
C
CCT
A
G
C
ACTT
AAGGCCT
TGGACTT
TAGCCC
A
TAG
A
C
T
T
AGC
G
CTT
AGCAC
AA
AGGGCAT
AGGGCAT
TAGCCCT
AGCACTT
AAGACTT
TGGACTT
AAGGCCT
AGGGCAT
TAGCCCT
AGCACTT
AAGGCCT
TGGACTT
AGCGCTT
AGCACAA
TAGACTT
TAGCCCA
AGGGCATSlide6
Phylogenetic Tree Estimation
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
U
V
W
X
Y
U
V
W
X
YSlide7
AGAT
TAGACTT
TGCACAA
TGCGCTT
AGGGCATGA
U
V
W
X
Y
U
V
W
X
Y
However…Slide8
…AC
GGTG
CAGT
T
ACCA…
Mutation
Deletion
…ACCAGT
C
ACCA…
Indels (insertions and deletions)Slide9
…AC
GGTG
CAGT
T
ACC
-
A…
…AC
----
CAGT
C
ACC
T
A…
The true multiple alignment
Reflects historical substitution, insertion, and deletion events
Defined using transitive closure of pairwise alignments computed on edges of the true tree
…
AC
GGTG
CAGT
T
ACCA
…
Substitution
Deletion
…
ACCAGT
C
ACC
T
A
…
InsertionSlide10
Phylogenetic Tree Estimation
S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGC
S3 = TAGCTGACCGC
S4 = TCACGACCGACASlide11
Input: unaligned sequences
S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGC
S3 = TAGCTGACCGC
S4 = TCACGACCGACASlide12
Phase 1: Alignment
S1 = -AGGCTATCACCTGACCTCCA
S2 = TAG-CTATCAC--GACCGC--
S3 = TAG-CT-------GACCGC--
S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGC
S3 = TAGCTGACCGCS4 = TCACGACCGACASlide13
Phase 2: Construct tree
S1 = -AGGCTATCACCTGACCTCCA
S2 = TAG-CTATCAC--GACCGC--
S3 = TAG-CT-------GACCGC--
S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGC
S3 = TAGCTGACCGCS4 = TCACGACCGACA
S1
S4
S2
S3Slide14
Two-phase estimation
Alignment methods
Clustal
POY (and POY*)
Probcons (and Probtree)
Probalign
MAFFT
Muscle
Di-align
T-Coffee
Prank (PNAS 2005, Science 2008)
Opal (ISMB and Bioinf. 2007)
FSA (PLoS Comp. Bio. 2009)
Infernal (Bioinf. 2009)
Etc.
Phylogeny methods
Bayesian MCMC
Maximum parsimony
Maximum likelihood
Neighbor
joining
FastME
UPGMA
Quartet puzzling
Etc.Slide15
Two-phase estimation
Alignment methods
Clustal
POY (and POY*)
Probcons (and Probtree)
Probalign
MAFFT
Muscle
Di-align
T-Coffee
Prank (PNAS 2005, Science 2008)
Opal (ISMB and Bioinf. 2007)
FSA (PLoS Comp. Bio. 2009)
Infernal (Bioinf. 2009)
Etc.
Phylogeny methods
Bayesian MCMC
Maximum parsimony
Maximum likelihood
Neighbor
joining
FastME
UPGMA
Quartet puzzling
Etc.
RAxML
: heuristic for large-scale ML optimizationSlide16
Quantifying Error
FN: false negative
(missing edge)
FP: false positive
(incorrect edge)
FN
FP
50% error rateSlide17
1000-taxon
models, ordered by difficulty (Liu et al.,
Science 19 June 2009
)Slide18
Multiple Sequence Alignment (MSA):
a scientific
grand challenge
1
S1 = -AGGCTATCACCTGACCTCCA
S2 = TAG-CTATCAC--GACCGC--
S3 = TAG-CT-------GACCGC--…Sn = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGCS3 = TAGCTGACCGC …
Sn = TCACGACCGACA
Novel techniques needed
for scalability and accuracy
NP-hard problems and large datasets Current methods do not provide good accuracy
Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation
1
Frontiers in Massive Data Analysis, National Academies Press, 2013Slide19
1KP: Thousand
Transcriptome
Project
First
publication:
Wickett, Mirarab, et al.,
PNAS, 2014Used SATé (Liu et al., Science 2009 and
Syst Biol 2012) to compute multiple sequence alignments and treesUsed ASTRAL (Mirarab et al.,
Bioinf 2014 and 2015) to compute the species tree
G.
Ka-Shu
Wong
U Alberta
N.
Wickett
Northwestern
J.
Leebens
-Mack
U Georgia
N.
Matasci
iPlant
T. Warnow, S.
Mirarab
, N. Nguyen
UT
-
Austin UT-Austin UT-Austin
Upcoming Challenge:
Multiple sequence alignment and gene tree estimation on 100,000 sequencesSlide20
Computational
Phylogenetics (2005)
Courtesy of the Tree of Life
web project,
tolweb.org
Current methods can use months to
estimate trees on 1000 DNA sequences
Our objective:
More accurate trees and alignments
on 500,000 sequences in under a weekSlide21
Computational
Phylogenetics (2015)
Courtesy of the Tree of Life
web project,
tolweb.org
1997-2001: Distance-based phylogenetic tree estimation from polynomial length sequences
2012
: Computing accurate trees (almost)
without
multiple sequence
alignments
2009-2015: Co-estimation of multiple sequence alignments and gene
trees, now on 1,000,000 sequences in under two weeks2014-2015: Species tree estimation from whole genomes in the presence of massive gene tree
heterogeneitySlide22
Computational
Phylogenetics (2015)
Courtesy of the Tree of Life
web project,
tolweb.org
1997-2001: Distance-based phylogenetic tree estimation from polynomial length sequences
2012
: Computing accurate trees (almost)
without
multiple sequence
alignments
2009-2015: Co-estimation of multiple sequence
alignments and gene trees, now on 1,000,000 sequences in under two weeks2014-2015: Species tree estimation from whole genomes in the presence of
massive gene tree heterogeneitySlide23
Key technique: Divide-and-conquer!
In general, small datasets with not too much “heterogeneity” are easy to analyze with good accuracy.Slide24
Divide-and-Conquer
Divide-and-conquer is a basic algorithmic trick for solving problems!
Three steps:
divide a dataset into two or more sets, solve the problem on each set, andcombine solutions.Slide25
Sorting
10
3
54
23
75
5
1
25
Objective: sort this list of integers from
s
mallest to largest.
10, 3, 54, 23, 75, 5, 1, 25 should become
1, 3, 5, 10, 23, 25, 54, 75Slide26
MergeSort
10
3
54
23
75
5
1
25
Step 1: Divide into two
sublists
Step 2: Recursively sort each
sublist
Step 3: Merge the two sorted
sublistsSlide27
Step 1: break into two lists
10
3
54
23
75
5
1
25
X:
Y:Slide28
Step 2: sort the two lists
3
10
23
54
1
5
25
75
X:
Y:Slide29
Step 3: merge the sorted lists
3
10
23
54
1
5
25
75
X:
Y:
Result:Slide30
Merging (cont.)
3
10
23
54
5
25
75
1
X:
Y:
Result:Slide31
Merging (cont.)
10
23
54
5
25
75
1
3
X:
Y:
Result:Slide32
Merging (cont.)
10
23
54
25
75
1
3
5
X:
Y:
Result:Slide33
Merging (cont.)
23
54
25
75
1
3
5
10
X:
Y:
Result:Slide34
Merging (cont.)
54
25
75
1
3
5
10
23
X:
Y:
Result:Slide35
Merging (cont.)
54
75
1
3
5
10
23
25
X:
Y:
Result:Slide36
Merging (cont.)
75
1
3
5
10
23
25
54
X:
Y:
Result:Slide37
Merging (cont.)
1
3
5
10
23
25
54
75
X:
Y:
Result:Slide38
Multiple Sequence Alignment (MSA):
a scientific
grand challenge
1
S1 = -AGGCTATCACCTGACCTCCA
S2 = TAG-CTATCAC--GACCGC--
S3 = TAG-CT-------GACCGC--…Sn = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGCS3 = TAGCTGACCGC …
Sn = TCACGACCGACA
Novel techniques needed
for scalability and accuracy
NP-hard problems and large datasets Current methods do not provide good accuracy
Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation
1
Frontiers in Massive Data Analysis, National Academies Press, 2013Slide39
SATé and PASTA
Input: set of unaligned sequences
Output: multiple sequence alignment and phylogenetic tree
SATé: Liu et al., Science 2009 (up to 10,000 sequences) and Systematic Biology 2012 (up to 50,000 sequences)
PASTA: Mirarab et al., J. Comp Biol 2015 (up to 1,000,000 sequences)Slide40
1000-taxon
models, ordered by difficulty (Liu et al.,
Science 19 June 2009
)Slide41
Re-aligning on a tree
A
B
D
C
Merge sub-alignments
Estimate ML tree on merged alignment
Decompose dataset
A
B
C
D
Align subproblems
A
B
C
D
ABCDSlide42
SATé
and PASTA Algorithms
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignmentSlide43
SATé
and PASTA Algorithms
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignment
AlignmentSlide44
SATé
and PASTA Algorithms
Estimate ML tree on new alignment
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignment
AlignmentSlide45
SATé
and PASTA Algorithms
Estimate ML tree on new alignment
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignment
Alignment
Repeat
until termination
condition, and
return the alignment/tree pair with the best ML scoreSlide46
1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009)
24-hour
SATé
analysis, on desktop
machines
(
Similar improvements for biological datasets)
SATé
: 24-hour co-estimation of highly accurate alignments and trees on 1000 sequencesSlide47
(Liu et al.,
Syst
Biol
61(1):90-106, 2012)
SATé-2: even more accurate! Slide48
Simulated
RNASim
datasets from 10K to 200K taxa
Limited to 24 hours using 12 CPUsNot all methods could run (missing bars could not finish)
PASTA, Mirarab et al., J Comp Biol 22(5): 377-386 (2015)PASTA: even more accurate, and can scale to 1,000,000 sequencesSlide49
Avian Phylogenomics Project
E
Jarvis,
HHMI
G
Zhang,
BGI
First analysis (Jarvis, Mirarab, et al., Science 2014):Approx. 50 species, 14,000 loci
Used SATé for gene sequence alignment and tree estimation Next analysis will have more species, and will use PASTA
MTP Gilbert,
Copenhagen
S.
Mirarab Md. S. Bayzid, UT-Austin UT-Austin
T. WarnowUT-AustinPlus many many other people…Slide50
1KP: Thousand
Transcriptome
Project
First
analysis (
Wickett, Mirarab, et al.,
PNAS, 2014)About 100 species and 800 loci
Used SATé
G.
Ka-Shu
Wong
U Alberta
N.
Wickett
Northwestern
J.
Leebens
-Mack
U Georgia
N.
Matasci
iPlant
T. Warnow, S.
Mirarab
, N. Nguyen
UT
-
Austin UT-Austin UT-Austin
Next analysis will be much larger and more difficult:
Multiple sequence alignment and gene tree estimation on 100,000 sequences, many datasets highly
fragmentary
Will
use
PASTA and UPP (Nguyen et al., Genome Biology 2015)Slide51
Computational
Phylogenetics (2015)
Courtesy of the Tree of Life
web project,
tolweb.org
1997-2001: Distance-based phylogenetic tree estimation from polynomial length sequences
2012
: Computing accurate trees (almost)
without
multiple sequence
alignments
2009-2015: Co-estimation of multiple sequence alignments and gene
trees, now on 1,000,000 sequences in under two weeks2014-2015: Species tree estimation from whole genomes in the presence of massive gene tree
heterogeneitySlide52
“Boosters”, or “Meta
-
Methods”
Meta-methods
use divide-and-conquer and iteration (or other techniques) to “boost
” the performance of base methods (phylogeny reconstruction, alignment estimation, etc)
Meta-method
Base method M
M*Slide53
Main Points
Innovative algorithm design can
improve accuracy
as well as reduce running time.Divide-and-conquer is a key algorithmic technique that has dramatically changed the toolkit for biologists! Slide54
Acknowledgments
Funding
:
Guggenheim Foundation, Packard Foundation, NSF, Microsoft Research New England, David
Bruton
Jr. Centennial Professorship, Grainger Foundation, and TACC (Texas Advanced Computing Center)Slide55
Avian Phylogenomics Project
E
Jarvis,
HHMI
G
Zhang,
BGI
Jarvis, Mirarab, et al., Science 2014
MTP Gilbert,
Copenhagen
S.
Mirarab Md. S. Bayzid, UT-Austin UT-Austin
T. WarnowUT-AustinPlus many many other people…Major challenge:
Massive gene tree heterogeneity consistent with incomplete lineage sortingVery poor resolution in the 14,000 gene treesStandard coalescent-based species tree estimation methods had poor accuracy
Solution: New technique to improve coalescent-based species tree (statistical binning, Mirarab et al., Science 2014)