Tandy Warnow Departments of Bioengineering and Computer Science http tandycsillinoisedu Course Details Office hours To be determined Course webpage httptandycsillinoisedu598 ID: 791809
Download The PPT/PDF document "CS/BIOE 598: Algorithmic Computational ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CS/BIOE 598: Algorithmic Computational Genomics
Tandy Warnow
Departments of Bioengineering and Computer Science
http://
tandy.cs.illinois.edu
Slide2Course Details
Office hours:
To be determined
Course webpage:
http://tandy.cs.illinois.edu/598-
2016.
html
Slide3Today
•
D
e
s
c
r
ib
e
s
o
me
i
m
p
o
r
ta
n
t
p
r
o
bl
em
s in
c
o
m
pu
ta
ti
o
nal
bi
o
l
o
gy
and
c
o
m
pu
ta
ti
o
nal
hist
o
r
i
c
al
lin
g
uis
ti
c
s
,
f
o
r
w
hi
c
h stud
e
nts in this
c
o
u
r
s
e
c
o
uld d
eve
l
o
p i
m
p
r
o
ve
d
me
th
o
ds
.
•
Explain h
o
w
th
e
c
o
u
r
s
e
w
ill b
e
r
un
.
•
Ans
we
r
qu
es
ti
ons
.
Slide4Basic
s
•
Prere
quisi
te
s
:
C
o
m
pu
ter
S
c
i
e
n
ce
(al
g
o
r
ith
m
d
e
si
g
n
, analysis,
and p
r
o
gramm
in
g
)
,
m
ath
em
a
ti
c
al
m
atu
r
i
ty
(abili
ty
to und
er
sta
n
d p
r
oo
fs).
N
o b
ackgr
o
und in bi
o
l
o
gy
o
r
lin
g
uis
ti
c
s
is n
ee
d
e
d
!
•
N
o
te
:
if
y
o
u
are
n
o
t
a CS
,
ECE
,
o
r
Math
m
aj
o
r,
you may still be able to do the course; but please
s
ee
me
.
Slide5O
r
a
ngut
a
n
G
orilla
Chimpanze
e
H
uma
n
F
r
om
the
T
ree of the Life Website,University of Arizona
Species Tree
Slide6E
v
o
lu
ti
o
n
inf
o
rms ab
out everything in biolog
y
• Bi
g gen
ome sequencing p
rojects just p
roduce data
-‐-‐ so what?•
Evo
lutionary history relates all organisms and genes, and helps us understand and predict– in
teractions between ge
n
e
s (
ge
n
e
ti
c
n
et
w
o
rk
s)
–
d
r
ug
d
e
si
g
n
–
p
r
e
di
c
ti
ng
fun
c
ti
ons
o
f
ge
n
e
s
–
influ
e
nza
vacc
in
e
d
eve
l
o
p
me
n
t
–
o
r
i
g
ins and sp
re
ad
o
f dis
e
as
e
–
o
r
i
g
ins and
m
i
gr
a
ti
o
ns
o
f hu
m
ans
Slide7Phylogenomic pipeline
Select taxon set and markers
Gather and screen sequence data, possibly identify
orthologs
Compute multiple sequence alignments for each locus
Compute species tree or network:
Compute gene trees on the alignments and combine the estimated gene trees, OR
Estimate a tree from a concatenation of the multiple sequence alignments
Get statistical support on each branch (e.g., bootstrapping)
Estimate dates on the nodes of the phylogeny
Use species tree with branch support and dates
to understand biology
Slide8Phylogenomic pipeline
Select taxon set and markers
Gather and screen sequence data, possibly identify
orthologs
Compute multiple sequence alignments for each locus
Compute species tree or network:
Compute gene trees on the alignments and combine the estimated gene trees, OR
Estimate a tree from a concatenation of the multiple sequence alignments
Get statistical support on each branch (e.g., bootstrapping)
Estimate dates on the nodes of the phylogeny
Use species tree with branch support and dates
to understand biology
Slide9Constructing the Tree of Life:
Hard Computational Problems
NP-hard problems
Large datasets
100,000+ sequences
thousands of genes
“Big data” complexity:
model misspecification
fragmentary sequences
errors in input data
streaming data
Slide10Avian Phylogenomics Project
G Zhang,
BGI
Approx. 50 species, whole genomes
14,000 loci
MTP Gilbert,
Copenhagen
S.
Mirarab
Md. S.
Bayzid
, UT-Austin UT-Austin
T. Warnow
UT-Austin
Plus many many other people…
Erich Jarvis,
HHMI
Science, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)
Slide111kp: Thousand
Transcriptome
Project
Plant Tree of Life based on
transcriptomes
of ~1200 species
More than 13,000 gene families (most not single copy)
First paper: PNAS 2014 (~100 species and ~800 loci)
Gene Tree Incongruence
G.
Ka-Shu
Wong
U Alberta
N.
Wickett
Northwestern
J.
Leebens
-Mack
U Georgia
N.
Matasci
iPlant
T. Warnow, S. Mirarab, N. Nguyen,
UIUC UT-Austin UT-Austin
Plus many many other people…
Upcoming Challenges (~1200 species, ~400 loci
)
Slide12DNA Sequence Evolution
AAGACTT
TG
GACTT
AAG
G
C
C
T
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
A
G
GGC
A
T
T
AG
C
CCT
A
G
C
ACTT
AAGGCCT
TGGACTT
TAGCCC
A
TAG
A
C
T
T
AGC
G
CTT
AGCAC
AA
AGGGCAT
AGGGCAT
TAGCCCT
AGCACTT
AAGACTT
TGGACTT
AAGGCCT
AGGGCAT
TAGCCCT
AGCACTT
AAGGCCT
TGGACTT
AGCGCTT
AGCACAA
TAGACTT
TAGCCCA
AGGGCAT
Slide13Phylogeny Problem
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
U
V
W
X
Y
U
V
W
X
Y
Slide14Performance criteria
Running time
Space
Statistical performance issues (e.g., statistical consistency) with respect to a Markov model of evolution
“
Topological accuracy
”
with respect to the underlying
true tree or true alignment,
typically studied in simulation
Accuracy with respect to a particular criterion (e.g. maximum likelihood score), on real data
Slide15Quantifying Error
FN: false negative
(missing edge)
FP: false positive
(incorrect edge)
50% error rate
FN
FP
Slide16Statistical consistency, exponential convergence, and absolute fast convergence (afc)
Slide17H
ill
-
c
limbi
ng
h
e
uri
s
ti
cs for
hard optimi
zation
criteria (Maximum Pars
imony and Maximum
Likelihood)
Local optimumCos
t
Global optimumPhylogenetic treesPolynomial time distance-based met
hods: Neighbor Joining, FastM
E
,
e
t
c
.
3.
Ba
y
e
s
i
a
n
me
t
hods
P
hylog
e
netic
r
ec
ons
tructi
on
met
hods
Slide18Solving
maximum
likelihood
(and
o
t
her
hard optimization
problems) is…
unlikely
# of Taxa
# of Unrooted
Trees
4
3
51561057945
8
10395
9
135135
10
2027025
20
2.2
x
10
20
100
4.5
x
10
190
1000
2.7
x
10
2900
Slide19Q
uan
t
i
f
ying
Error
F
N
:
fa
l
se neg
ative (
missing edg
e)FP: fal
se p
ositive (incorrect edge)FP50% error rateFN
Slide20Neighbor
joining
has
poor
pe
rf
ormance
on large diame
ter t
rees [Nakhleh e
t al. ISMB 2001]
Theorem (Atteson): Exponent
ial sequence length requiremen
t for Neighbor Joining!
NJ
0
400
800
No.
T
axa
1600
1200
0.4
0.2
0
0.6
0.8
Error Rate
Slide21Major
Challenges
•
Phylogene
tic
a
n
a
l
yses:
st
andard methods have poor
accuracy on even moderately
large datasets,
and the most
accurate me
thods are enormously computationally intensive (weeks or months, high memory requirements)
Slide22Phylogeny Problem
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
U
V
W
X
Y
U
V
W
X
Y
Slide23AG
A
T
T
A
G
A
C
T
T
T
GC
A
C
AA
T
GCGC
T
T
AGGGCATGA
U
V
W
X
Y
X
U
Y
V
W
The Real Problem!
Slide24…AC
GGTG
CAGT
T
ACC
-
A…
…AC
----
CAGT
C
ACC
TA…
The
true multiple
alignmen
t– Re
flects historical
substitution,
insertion, and deletion events– Defined using transitive clos
ure of pairwi
se
a
lign
me
n
ts
c
o
m
pu
te
d
on
e
dg
es
o
f
th
e
tr
u
e
tree…ACGGTGCAGTTACCA…InsertionSubstitution
Deletion…ACCAGTCACCTA…
Slide25Input:
unaligned
sequences
S1
=
AGGCTATCACCTGACCTCCA
S2
=
TAGCTATCACGACCGC
S3
=
TAGCTGACCGC
S4
=
TCACGACCGACA
Slide26Phase
1
:
Alignmen
t
S
1 =
-AGGCTATCACCTGACCTCCA S
2 =
TAG-CTATCAC--GACCGC-- S
3 =
TAG-CT-------GACCGC--
S4 = -------TCAC--GACCGACAS1 =
AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGCS
3 = TAGCTGACCGC S4 =
TCACGACCGACA
Slide27Phase
2
:
Con
st
ru
ct
tree
S
1 = -AGGCTATCACCTGACCTCCA S
2 = TAG-CTATCAC--GACCGC-- S
3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA S2 =
TAGCTATCACGACCGCS3 =
TAGCTGACCGC S4 =
TCACGACCGACA
S1
S4
S2
S3
Slide28T
wo-phase
e
s
t
ima
t
ion
Alignmen
t
met
hods•
Clustal
• POY (and POY*)
• Probcons (and Prob
tree)• Probalign
• MAFFT
• Muscle
• Di-align• T-Coffee• Prank (PNAS 2005, Science 2008)• Opal (ISMB and Bioin
f. 2007)• FSA (PLoS
Comp
.
Bio
.
2009)
•
Infernal
(Bioin
f.
2009)
•
Etc.
Phylogeny
me
t
hods
•
Bayesian
MCMC
•
Maximum parsimony
•
Maximum likelihood• Neighbor joining• Fa
stME• UPGMA• Quartet puzzling• Etc.
Slide29S
i
m
ul
a
t
io
n
St
udies
S
1 =
-AGGCTATCACCTGACCTCCA S
2 =
TAG-CTATCAC--GACCGC-- S
3 =
TAG-CT-------GACCGC--S
4 = -------TCAC--GACCGACA
S1 S2
S
4
S3
T
rue
t
r
ee
a
n
d
al
i
g
n
m
e
n
t
S
1 =
AGGCTATCACCTGACCTCCA S
2 =
TAGCTATCACGACCGC
S
3 =
TAGCTGACCGC S
4 =
TCACGACCGACACompare
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC--S4 = T---C-A-CGACCGA----CA S1 S4
S2 S3Estimated
tree and alignment Unaligned Sequences
Slide301000 taxon
mod
els
, o
rdered by di
fficul
ty (Liu et al.
,
2009)
Slide31Multiple Sequence Alignment (MSA):
another grand challenge
1
S1 = -AGGCTATCACCTGACCTCCA
S2 = TAG-CTATCAC--GACCGC--
S3 = TAG-CT-------GACCGC--
…
Sn
= -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGC
S3 = TAGCTGACCGC
…
Sn
= TCACGACCGACA
Novel techniques needed
for scalability and accuracy
NP-hard problems and large datasets
Current methods do not provide good accuracy
Few methods can analyze even moderately large datasets
Many important applications besides phylogenetic estimation
1
Frontiers in Massive Data Analysis, National Academies Press, 2013
Slide32Major
Challenges
•
Phylogene
tic
a
n
a
l
yses:
st
andard methods have poor
accuracy on even moderately
large datasets,
and the most
accurate me
thods are enormously computationally intensive (weeks or months, high memory requirements)•
Multiple sequence
a
lign
me
n
t
:
key
st
ep
f
or
many
biological que
st
ions
(pro
t
ein
st
ru
ct
ure
and function, phylogenetic estimation), but few methods can run on large
datasets. Alignment accuracy is generally poor for large datasets with high rates of evolution.
Slide33Phylogenomics
(Phylogenetic estimation from whole genomes)
Slide34O
r
a
ngut
a
n
G
orilla
Chimpanze
e
H
uma
n
F
r
om
the
T
ree of the Life Website,University of Arizona
Species Tree Estimat
ion
requires
mul
t
iple
gene
s!
Slide35T
w
o
bas
i
c
a
pp
r
oaches for s
pecies
tree
estimation
• Concatena
te (“combi
ne”) sequence ali
gnments for
different genes, and run phylogeny estimation methods
• Compute tr
ees
o
n
i
nd
i
v
i
dua
l
genes
and
co
mb
i
ne
g
ene
t
r
ees
Using multiple genes
gene 1
S
1
S
2
S
3
S
4
S
7
S
8
TCTAATGGAA
GCTAAGGGAA
TCTAAGGGAA
TCTAACGGAA
TCTAATGGAC
TATAACGGAA
gene 3
TATTGATACA
TCTTGATACC
TAGTGATGCA
CATTCATACC
TAGTGATGCA
S
1
S
3
S
4
S
7
S
8
gene 2
GGTAACCCTC
GCTAAACCTC
GGTGACCATC
GCTAAACCTC
S
4
S
5
S
6
S
7
Slide37Concatenation
gene 1
S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
gene 2
gene 3
TCTAATGGAA
GCTAAGGGAA
TCTAAGGGAA
TCTAACGGAA
TCTAATGGAC
TATAACGGAA
GGTAACCCTC
GCTAAACCTC
GGTGACCATC
GCTAAACCTC
TATTGATACA
TCTTGATACC
TAGTGATGCA
CATTCATACC
TAGTGATGCA
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
Slide38Red gene tree ≠ species tree(green gene tree okay)
Slide39Gene Tree Incongruence
Gene trees can differ from the species tree due to:
Duplication and loss
Horizontal gene transfer
Incomplete lineage sorting (ILS)
Slide40Incomplete Lineage Sorting (ILS)
Confounds
phylogenetic analysis for many groups:
Hominids
Birds
Yeast
Animals
Toads
Fish
Fungi
There is substantial debate about how to analyze phylogenomic datasets in the presence of ILS.
Slide41Lineage Sorting
Population-level process, also called the “Multi-species coalescent” (Kingman, 1982)
Gene trees can differ from species trees due to short times between speciation events or large population size; this is called “Incomplete Lineage Sorting” or “Deep Coalescence”.
Slide42The Coalescent
Present
Past
Courtesy James Degnan
Slide43Gene tree in a species tree
Courtesy James Degnan
Slide44Key observation:
Under the multi-species coalescent model, the species tree
defines a
probability distribution on the gene trees, and is identifiable from the distribution on gene trees
Courtesy James
Degnan
Slide45. . .
Analyze
separately
Summary Method
Two competing approaches
gene 1
gene 2
. . .
gene
k
. . .
Concatenation
Species
Slide46Orangutan
Gorilla
Chimpanzee
Human
From the Tree of the Life Website,
University of Arizona
Species tree estimation: difficult, even for small datasets!
Slide47Ma
jo
r
C
h
a
ll
e
ng
es:
large datasets
, fra
gmentary
sequences
• Multiple se
quence
alignment:
Few
methods can run on large datasets, and alignment accuracy is generally poor for large datasets wit
h high rates of evolu
t
ion
.
•
Ge
ne
T
ree
Es
ti
ma
tion
:
st
andard
me
t
hods
have
poor
accuracy
on
even moderately large datasets, and the most accurate me
thods are enormously computationally intensive (weeks or months, high memory requirements).• Species Tree Es
timation: gene tree incongruence
makes accurate estimation of species tree challenging.Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolutionBoth phylogenetic estimation and
multiple sequence alignment
are also impacted by fra
gmentary data.
Slide48Ma
jo
r
C
h
a
ll
e
ng
es:
large datasets
, fra
gmentary
sequences
• Multiple se
quence
alignment:
Few
methods can run on large datasets, and alignment accuracy is generally poor for large datasets wit
h high rates of evolu
t
ion
.
•
Ge
ne
T
ree
Es
ti
ma
tion
:
st
andard
me
t
hods
have
poor
accuracy
on
even moderately large datasets, and the most accurate me
thods are enormously computationally intensive (weeks or months, high memory requirements).• Species Tree Es
timation: gene tree incongruence
makes accurate estimation of species tree challenging.Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolutionBoth phylogenetic estimation and
multiple sequence alignment are
also impacted by fra
gmentary data.
Slide49Avian Phylogenomics Project
G Zhang,
BGI
Approx. 50 species, whole genomes
14,000 loci
MTP Gilbert,
Copenhagen
S.
Mirarab
Md. S.
Bayzid
, UT-Austin UT-Austin
T. Warnow
UT-Austin
Plus many many other people…
Erich Jarvis,
HHMI
Challenges:
Species tree estimation under the multi-species coalescent
model, from 14,000 poor estimated gene trees, all with different topologies (we used “statistical binning”)Maximum likelihood estimation on a million-site genome-scale alignment – 250 CPU yearsScience, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)
Slide501kp: Thousand
Transcriptome
Project
Plant Tree of Life based on
transcriptomes
of ~1200 species
More than 13,000 gene families (most not single copy)
First paper: PNAS 2014 (~100 species and ~800 loci)
Gene Tree Incongruence
G.
Ka-Shu
Wong
U Alberta
N.
Wickett
Northwestern
J.
Leebens
-Mack
U Georgia
N.
Matasci
iPlant
T. Warnow, S. Mirarab, N. Nguyen,
UIUC UT-Austin UT-Austin
Plus many many other people…
Upcoming Challenges (~1200 species, ~400 loci):
Species tree estimation under the multi-species coalescent
from hundreds of conflicting gene trees on >1000 species
(we will use ASTRAL – Mirarab et al.
2014, Mirarab & Warnow 2015)
Multiple sequence alignment of >100,000 sequences (with lots
of fragments!) – we will use UPP (Nguyen et al.,
2015)
Slide51Meta
genom
ics:
V
e
n
ter
et
a
l.,
Explo
ring th
e Sargasso Sea:Sc
ientists Disc
over One
Million New
Ge
nes in Ocean Microbes
Slide52Me
t
agenomic
da
t
a
analysis
N
GS
data
produce f
ragmentary
sequence data Metagenomic analyses include
unknownspecies
Taxon ident
ificatio
n: given
short sequences, identify the species for each fragmentApplications: Human Microbiome Issues:
accuracy and speed
Slide53Metagenomic taxon identification
Obj
ec
ti
ve
: classify sh
o
rt
re
ads in a
meta
gen
omic
sample
Slide54P
o
ssibl
e
Ind
o
-European
t
ree
(Ringe, Warnow and Ta
ylor 2000)
Anatolian
Albanian
Tocharian
Greek
Germanic Armenian
Italic
Celtic
Baltic
Slavic
Vedic
Iranian
Slide55“
Per
f
ect
P
h
y
l
o
genetic Netwo
rk” for IE
Nakhleh et al.,
Language 2005
Anatolian
Albanian
Tocharian
Greek
Germanic Armenian
Baltic
Slavic
Vedic
Iranian
Italic
Celtic
Slide56Course Material
Phylogenetics: 5 weeks
Multiple Sequence Alignment: 2 weeks
Genome-scale phylogeny: 1 week
Advanced topics: 2.5 weeks
Historical linguistics
Metagenomics
Genome Assembly
Protein structure and function prediction
Scientific literature (student presentations):2.5 weeks
Slide57Grading
Homework: 25% (one
hw
dropped)
Midterm: 40% (March 29)
Final Project: 25% (due May 6)
Class Presentation: 5%
Course Participation: 5%
No final exam.
Slide58Final Project and Class Presentation
Either research project (can be with another student) or survey paper (done by yourself).
Many interesting and publishable problems
to address: see
http://tandy.cs.illinois.edu/
topics.html
Your class presentation should be related to your final project.
Slide59Examples of published course projects
Md S. Bayzid, T. Hunt, and T. Warnow. "Disk Covering Methods Improve Phylogenomic Analyses". Proceedings of RECOMB-CG (Comparative Genomics), 2014, and BMC Genomics 2014, 15(
Suppl
6): S7.
(PDF
)
T. Zimmermann, S. Mirarab and T. Warnow. "BBCA: Improving the scalability of *BEAST using random binning". Proceedings of RECOMB-CG (Comparative Genomics), 2014, and BMC Genomics 2014, 15(
Suppl
6): S11.
(PDF
)
J. Chou, A. Gupta, S. Yaduvanshi, R. Davidson, M. Nute, S. Mirarab and T. Warnow. “A comparative study of SVDquartets and other coalescent-based species tree estimation
methods”. RECOMB-Comparative Genomics and BMC Genomics, 2015., 2015, 16 (Suppl 10): S2
.
Slide60Research opportunities
Evaluating methods on simulated and real (biological or linguistic) datasets
Designing a new method, and establishing its performance (using theory and data)
Analyzing a biological dataset using several different methods, to address biology
Slide61Algori
t
hmic
S
t
ra
t
egies
•
Divide
-and-conquer•
“Bin-and-conquer”
• Iteration•
Bayesian statisticsHidden Markov Models
• G
raph theory