Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science http tandycsillinoisedu Course Details Office hours To be determined Course webpage ID: 479728
Download Presentation The PPT/PDF document "CS/BIOE 598:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CS/BIOE 598: Algorithmic Computational Genomics
Tandy Warnow
Departments of Bioengineering and Computer Science
http://
tandy.cs.illinois.eduSlide2
Course Details
Office hours:
To be determined
Course webpage:
http://tandy.cs.illinois.edu/598-
2016.
htmlSlide3
Tod
ay
•
D
e
s
c
r
ib
e
s
o
me
i
m
p
o
r
ta
n
t
p
r
o
bl
em
s in
c
o
m
pu
ta
ti
o
nal
bi
o
l
o
gy
and
c
o
m
pu
ta
ti
o
nal
hist
o
r
i
c
al
lin
g
uis
ti
c
s
,
f
o
r
w
hi
c
h stud
e
nts in this
c
o
u
r
s
e
c
o
uld d
eve
l
o
p i
m
p
r
o
ve
d
me
th
o
ds
.
•
Explain h
o
w
th
e
c
o
u
r
s
e
w
ill b
e
r
un
.
•
Ans
we
r
qu
es
ti
ons
.Slide4
Basic
s
•
Prere
quisi
te
s
:
C
o
m
pu
ter
S
c
i
e
n
ce
(al
g
o
r
ith
m
d
e
si
g
n
, analysis,
and p
r
o
gramm
in
g
)
,
m
ath
em
a
ti
c
al
m
atu
r
i
ty
(abili
ty
to und
er
sta
n
d p
r
oo
fs).
N
o b
ackgr
o
und in bi
o
l
o
gy
o
r
lin
g
uis
ti
c
s
is n
ee
d
e
d
!
•
N
o
te
:
if
y
o
u
are
n
o
t
a CS
,
ECE
,
o
r
Math
m
aj
o
r,
you may still be able to do the course; but please
s
ee
me
.Slide5
O
r
a
ngut
a
n
G
orilla
Chimpanze
e
H
uma
n
F
r
om
the
T
ree of the Life Website,University of Arizona
Species TreeSlide6
E
v
o
lu
ti
o
n
inf
o
rms ab
out everything in biolog
y
• B
ig ge
nome sequencing p
rojects just p
roduce data
-‐-‐ so what?•
Ev
olutionary history relates all organisms and genes, and helps us understand and predict–
interactions between
ge
n
e
s (
ge
n
e
ti
c
n
et
w
o
rk
s)
–
d
r
ug
d
e
si
g
n
–
p
r
e
di
c
ti
ng
fun
c
ti
ons
o
f
ge
n
e
s
–
influ
e
nza
vacc
in
e
d
eve
l
o
p
me
n
t
–
o
r
i
g
ins and sp
re
ad
o
f dis
e
as
e
–
o
r
i
g
ins and
m
i
gr
a
ti
o
ns
o
f hu
m
ansSlide7
Phylogenomic pipeline
Select taxon set and markers
Gather and screen sequence data, possibly identify
orthologs
Compute multiple sequence alignments for each locus
Compute species tree or network:
Compute gene trees on the alignments and combine the estimated gene trees, OR
Estimate a tree from a concatenation of the multiple sequence alignments
Get statistical support on each branch (e.g., bootstrapping)
Estimate dates on the nodes of the phylogeny
Use species tree with branch support and dates
to understand biologySlide8
Phylogenomic pipeline
Select taxon set and markers
Gather and screen sequence data, possibly identify
orthologs
Compute multiple sequence alignments for each locus
Compute species tree or network:
Compute gene trees on the alignments and combine the estimated gene trees, OR
Estimate a tree from a concatenation of the multiple sequence alignments
Get statistical support on each branch (e.g., bootstrapping)
Estimate dates on the nodes of the phylogeny
Use species tree with branch support and dates
to understand biologySlide9
Constructing the Tree of Life:
Hard Computational Problems
NP-hard problems
Large datasets
100,000+ sequences
thousands of genes
“Big data” complexity:
model misspecification
fragmentary sequences
errors in input data
streaming dataSlide10
Avian Phylogenomics Project
G Zhang,
BGI
Approx. 50 species, whole genomes
14,000 loci
MTP Gilbert,
Copenhagen
S.
Mirarab
Md. S.
Bayzid
, UT-Austin UT-Austin
T. Warnow
UT-Austin
Plus many many other people…
Erich Jarvis,
HHMI
Science, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)Slide11
1kp: Thousand
Transcriptome
Project
Plant Tree of Life based on
transcriptomes
of ~1200 species
More than 13,000 gene families (most not single copy)
First paper: PNAS 2014 (~100 species and ~800 loci)
Gene Tree Incongruence
G.
Ka-Shu
Wong
U Alberta
N.
Wickett
Northwestern
J.
Leebens
-Mack
U Georgia
N.
Matasci
iPlant
T. Warnow, S. Mirarab, N. Nguyen,
UIUC UT-Austin UT-Austin
Plus many many other people…
Upcoming Challenges (~1200 species, ~400 loci
) Slide12
DNA Sequence Evolution
AAGACTT
TG
GACTT
AAG
G
C
C
T
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
A
G
GGC
A
T
T
AG
C
CCT
A
G
C
ACTT
AAGGCCT
TGGACTT
TAGCCC
A
TAG
A
C
T
T
AGC
G
CTT
AGCAC
AA
AGGGCAT
AGGGCAT
TAGCCCT
AGCACTT
AAGACTT
TGGACTT
AAGGCCT
AGGGCAT
TAGCCCT
AGCACTT
AAGGCCT
TGGACTT
AGCGCTT
AGCACAA
TAGACTT
TAGCCCA
AGGGCATSlide13
Phylogeny Problem
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
U
V
W
X
Y
U
V
W
X
YSlide14
Performance criteria
Running time
Space
Statistical performance issues (e.g., statistical consistency) with respect to a Markov model of evolution
“
Topological accuracy
”
with respect to the underlying
true tree or true alignment,
typically studied in simulation
Accuracy with respect to a particular criterion (e.g. maximum likelihood score), on real dataSlide15
Quantifying Error
FN: false negative
(missing edge)
FP: false positive
(incorrect edge)
50% error rate
FN
FPSlide16
Statistical consistency, exponential convergence, and absolute fast convergence (afc)Slide17
H
ill
-
c
limbi
ng
h
e
uri
s
ti
cs for
hard optimi
zation
criteria (Maximum Pars
imony and Maximum
Likelihood)
Local optimum
Cost
Global optimumPhylogenetic treesPolynomial time distance-based me
thods: Neighbor Joining, FastM
E
,
e
t
c
.
3.
Ba
y
e
s
i
a
n
me
t
hods
P
hylog
e
netic
r
ec
ons
tructi
on
met
hodsSlide18
Solving
maximum
likelihood
(and
o
t
her
hard optimization
problems) is…
unlikely
# of Taxa
# of
Unrooted Trees
4
3
51561057945
8
10395
9
135135
10
2027025
20
2.2
x
10
20
100
4.5
x
10
190
1000
2.7
x
10
2900Slide19
Q
uan
t
i
f
ying
Error
F
N
:
fa
l
se ne
gative
(missing e
dge)FP: fa
lse p
ositive (incorrect edge)FP50% error rateFNSlide20
Neighbor
joining
has
poor
pe
rf
ormance
on large diame
ter t
rees [Nakhleh e
t al. ISMB 2001]
Theorem (Atteson): Exponen
tial sequence length requiremen
t for Neighbor Joining!
NJ
0
400
800
No.
T
axa
1600
1200
0.4
0.2
0
0.6
0.8
Error RateSlide21
Major
Challenges
•
Phylogene
tic
a
n
a
l
yses:
st
andard methods have poor
accuracy on even moderat
ely large datasets,
and the most
accurate me
thods are enormously computationally intensive (weeks or months, high memory requirements) Slide22
Phylogeny Problem
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
U
V
W
X
Y
U
V
W
X
YSlide23
AG
A
T
T
A
G
A
C
T
T
T
GC
A
C
AA
T
GCGC
T
T
AGGGCATGA
U
V
W
X
Y
X
U
Y
V
W
The Real Problem!Slide24
…AC
GGTG
CAGT
T
ACC
-
A…
…AC
----
CAGT
C
ACC
TA…
The
true multiple
alignmen
t– Re
flects historical
substitution,
insertion, and deletion events– Defined using transitive clos
ure of pairwi
se
a
lign
me
n
ts
c
o
m
pu
te
d
on
e
dg
es
o
f
th
e
tr
u
e
tree…ACGGTGCAGTTACCA…InsertionSub
stitutionDeletion…ACCAGTCACCTA…Slide25
Input:
unaligned
sequences
S1
=
AGGCTATCACCTGACCTCCA
S2
=
TAGCTATCACGACCGC
S3
=
TAGCTGACCGC
S4
=
TCACGACCGACASlide26
Phase
1
:
Alignmen
t
S
1 =
-AGGCTATCACCTGACCTCCA S
2 =
TAG-CTATCAC--GACCGC-- S
3 =
TAG-CT-------GACCGC--
S4 = -------TCAC--GACCGACAS1 =
AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGCS
3 = TAGCTGACCGC S4 =
TCACGACCGACASlide27
Phase
2
:
Con
st
ru
ct
tree
S
1 = -AGGCTATCACCTGACCTCCA S
2 = TAG-CTATCAC--GACCGC-- S
3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA S2 =
TAGCTATCACGACCGCS3 =
TAGCTGACCGC S4 =
TCACGACCGACA
S1
S4
S2
S3 Slide28
T
wo-phase
e
s
t
ima
t
ion
Alignmen
t
met
hods
• Clustal
• POY (and POY*)
• Probcons (and
Probtree)•
Probalign• MAFFT
• Muscle
• Di-align• T-Coffee• Prank (PNAS 2005, Science 2008)• Opal (ISMB and
Bioinf. 2007)• FSA (PLoS
Comp
.
Bio
.
2009)
•
Infernal
(Bioin
f.
2009)
•
Etc.
Phylogeny
me
t
hods
•
Bayesian
MCMC
•
Maximum parsimony
•
Maximum likelihood• Neighbor joining•
FastME• UPGMA• Quartet puzzling• Etc.Slide29
S
i
m
ul
a
t
io
n
St
udies
S
1 =
-AGGCTATCACCTGACCTCCA S
2 =
TAG-CTATCAC--GACCGC-- S
3 =
TAG-CT-------GACCGC--S
4 = -------TCAC--GACCGACA
S1 S2
S
4
S3
T
rue
t
r
ee
a
n
d
al
i
g
n
m
e
n
t
S
1 =
AGGCTATCACCTGACCTCCA S
2 =
TAGCTATCACGACCGC
S
3 =
TAGCTGACCGC S
4 =
TCACGACCGACACompare
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC--S4 = T---C-A-CGACCGA----CA S1 S4
S2 S3Estima
ted tree and alignment Unaligned Sequences Slide30
1000 taxon
mod
els
, o
rdered by di
fficul
ty (Liu et al.
,
2009)Slide31
Multiple Sequence Alignment (MSA):
another grand challenge
1
S1 = -AGGCTATCACCTGACCTCCA
S2 = TAG-CTATCAC--GACCGC--
S3 = TAG-CT-------GACCGC--
…
Sn
= -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGC
S3 = TAGCTGACCGC
…
Sn
= TCACGACCGACA
Novel techniques needed
for scalability and accuracy
NP-hard problems and large datasets
Current methods do not provide good accuracy
Few methods can analyze even moderately large datasets
Many important applications besides phylogenetic estimation
1
Frontiers in Massive Data Analysis, National Academies Press, 2013Slide32
Major
Challenges
•
Phylogene
tic
a
n
a
l
yses:
st
andard methods have poor
accuracy on even moderat
ely large datasets,
and the most
accurate me
thods are enormously computationally intensive (weeks or months, high memory requirements)•
Multiple sequence
a
lign
me
n
t
:
key
st
ep
f
or
many
biological que
st
ions
(pro
t
ein
st
ru
ct
ure
and function, phylogenetic estimation), but few methods can run on
large datasets. Alignment accuracy is generally poor for large datasets with high rates of evolution.Slide33
Phylogenomics
(Phylogenetic estimation from whole genomes)Slide34
O
r
a
ngut
a
n
G
orilla
Chimpanze
e
H
uma
n
F
r
om
the
T
ree of the Life Website,University of Arizona
Species Tree Estimat
ion
requires
mul
t
iple
gene
s!Slide35
T
w
o
bas
i
c
a
pp
r
oaches for s
pecies
tree
estimation
•
Concatenate (“combi
ne”) sequence a
lignments f
or different genes, and run phylogeny estimation methods
• Compute t
r
ees
o
n
i
nd
i
v
i
dua
l
genes
and
co
mb
i
ne
g
ene
t
r
ees
Slide36
Using multiple genes
gene 1
S
1
S
2
S
3
S
4
S
7
S
8
TCTAATGGAA
GCTAAGGGAA
TCTAAGGGAA
TCTAACGGAA
TCTAATGGAC
TATAACGGAA
gene 3
TATTGATACA
TCTTGATACC
TAGTGATGCA
CATTCATACC
TAGTGATGCA
S
1
S
3
S
4
S
7
S
8
gene 2
GGTAACCCTC
GCTAAACCTC
GGTGACCATC
GCTAAACCTC
S
4
S
5
S
6
S
7Slide37
Concatenation
gene 1
S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
gene 2
gene 3
TCTAATGGAA
GCTAAGGGAA
TCTAAGGGAA
TCTAACGGAA
TCTAATGGAC
TATAACGGAA
GGTAACCCTC
GCTAAACCTC
GGTGACCATC
GCTAAACCTC
TATTGATACA
TCTTGATACC
TAGTGATGCA
CATTCATACC
TAGTGATGCA
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?Slide38
Red gene tree ≠ species tree(green gene tree okay)Slide39
Gene Tree Incongruence
Gene trees can differ from the species tree due to:
Duplication and loss
Horizontal gene transfer
Incomplete lineage sorting (ILS)Slide40
Incomplete Lineage Sorting (ILS)
Confounds
phylogenetic analysis for many groups:
Hominids
Birds
Yeast
Animals
Toads
Fish
Fungi
There is substantial debate about how to analyze phylogenomic datasets in the presence of ILS.Slide41
Lineage Sorting
Population-level process, also called the “Multi-species coalescent” (Kingman, 1982)
Gene trees can differ from species trees due to short times between speciation events or large population size; this is called “Incomplete Lineage Sorting” or “Deep Coalescence”.Slide42
The Coalescent
Present
Past
Courtesy James DegnanSlide43
Gene tree in a species tree
Courtesy James DegnanSlide44
Key observation:
Under the multi-species coalescent model, the species tree
defines a
probability distribution on the gene trees, and is identifiable from the distribution on gene trees
Courtesy James
DegnanSlide45
. . .
Analyze
separately
Summary Method
Two competing approaches
gene 1
gene 2
. . .
gene
k
. . .
Concatenation
SpeciesSlide46
Orangutan
Gorilla
Chimpanzee
Human
From the Tree of the Life Website,
University of Arizona
Species tree estimation: difficult, even for small datasets!Slide47
Ma
jo
r
C
h
a
ll
e
ng
es:
large d
atasets, fra
gmentary
sequences
• Multiple se
quence
alignment
: Few
methods can run on large datasets, and alignment accuracy is generally poor for large datasets wit
h high rates of
evolu
t
ion
.
•
Ge
ne
T
ree
Es
ti
ma
tion
:
st
andard
me
t
hods
have
poor
accuracy
on even moderately large datasets, and the most accurat
e methods are enormously computationally intensive (weeks or months, high memory requirements).• Species Tree
Estimation: gene t
ree incongruence makes accurate estimation of species tree challenging.Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolutionBoth phylogenetic estima
tion and multiple sequence
alignment are also impacted
by fragmentary data.Slide48
Ma
jo
r
C
h
a
ll
e
ng
es:
large d
atasets, fra
gmentary
sequences
• Multiple se
quence
alignment
: Few
methods can run on large datasets, and alignment accuracy is generally poor for large datasets wit
h high rates of
evolu
t
ion
.
•
Ge
ne
T
ree
Es
ti
ma
tion
:
st
andard
me
t
hods
have
poor
accuracy
on even moderately large datasets, and the most accurat
e methods are enormously computationally intensive (weeks or months, high memory requirements).• Species Tree
Estimation: gene t
ree incongruence makes accurate estimation of species tree challenging.Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolutionBoth phylogenetic estimat
ion and multiple sequence
alignment are also impacted
by fragmentary data.Slide49
Avian Phylogenomics Project
G Zhang,
BGI
Approx. 50 species, whole genomes
14,000 loci
MTP Gilbert,
Copenhagen
S.
Mirarab
Md. S.
Bayzid
, UT-Austin UT-Austin
T. Warnow
UT-Austin
Plus many many other people…
Erich Jarvis,
HHMI
Challenges:
Species tree estimation under the multi-species coalescent
model, from 14,000 poor estimated gene trees, all with different topologies (we used “statistical binning”)Maximum likelihood estimation on a million-site genome-scale alignment – 250 CPU yearsScience, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)Slide50
1kp: Thousand
Transcriptome
Project
Plant Tree of Life based on
transcriptomes
of ~1200 species
More than 13,000 gene families (most not single copy)
First paper: PNAS 2014 (~100 species and ~800 loci)
Gene Tree Incongruence
G.
Ka-Shu
Wong
U Alberta
N.
Wickett
Northwestern
J.
Leebens
-Mack
U Georgia
N.
Matasci
iPlant
T. Warnow, S. Mirarab, N. Nguyen,
UIUC UT-Austin UT-Austin
Plus many many other people…
Upcoming Challenges (~1200 species, ~400 loci):
Species tree estimation under the multi-species coalescent
from hundreds of conflicting gene trees on >1000 species
(we will use ASTRAL – Mirarab et al.
2014, Mirarab & Warnow 2015)
Multiple sequence alignment of >100,000 sequences (with lots
of fragments!) – we will use UPP (Nguyen et al.,
2015)Slide51
Meta
genom
ics:
V
e
n
ter
et
a
l.,
Explo
ring
the Sargasso Sea:Sc
ientists Di
scover On
e Million New
G
enes in Ocean MicrobesSlide52
Me
t
agenomic
da
t
a
analysis
N
GS
dat
a produce
fragmentary
sequence data Metagenomic analyses include
unknownspecies
Taxon iden
tificatio
n: given
short sequences, identify the species for each fragmentApplications: Human Microbiome Issue
s: accuracy and speedSlide53
Metagenomic taxon identification
Obj
ec
ti
ve
: classify sh
o
rt
re
ads in a
meta
gen
omi
c sampleSlide54
P
o
ssibl
e
Ind
o
-European
t
ree
(Ringe, Warnow and Ta
ylor 2000)
Anatolian
Albanian
Tocharian
Greek
Germanic Armenian
Italic
Celtic
Baltic
Slavic
Vedic
IranianSlide55
“
Per
f
ect
P
h
y
l
o
genetic Netw
ork” for IE
Nakhleh et al.
, Language 2005
Anatolian
Albanian
Tocharian
Greek
Germanic Armenian
Baltic
Slavic
Vedic
Iranian
Italic
CelticSlide56
Course Material
Phylogenetics: 5 weeks
Multiple Sequence Alignment: 2 weeks
Genome-scale phylogeny: 1 week
Advanced topics: 2.5 weeks
Historical linguistics
Metagenomics
Genome Assembly
Protein structure and function prediction
Scientific literature (student presentations):2.5 weeksSlide57
Grading
Homework: 25% (one
hw
dropped)
Midterm: 40% (March 29)
Final Project: 25% (due May 6)
Class Presentation: 5%
Course Participation: 5%
No final exam.Slide58
Final Project and Class Presentation
Either research project (can be with another student) or survey paper (done by yourself).
Many interesting and publishable problems
to address: see
http://tandy.cs.illinois.edu/
topics.html
Your class presentation should be related to your final project. Slide59
Examples of published course projects
Md S. Bayzid, T. Hunt, and T. Warnow. "Disk Covering Methods Improve Phylogenomic Analyses". Proceedings of RECOMB-CG (Comparative Genomics), 2014, and BMC Genomics 2014, 15(
Suppl
6): S7.
(PDF
)
T. Zimmermann, S. Mirarab and T. Warnow. "BBCA: Improving the scalability of *BEAST using random binning". Proceedings of RECOMB-CG (Comparative Genomics), 2014, and BMC Genomics 2014, 15(
Suppl
6): S11.
(PDF
)
J. Chou, A. Gupta, S. Yaduvanshi, R. Davidson, M. Nute, S. Mirarab and T. Warnow. “A
comparative study of SVDquartets and other coalescent-based species tree estimation methods”. RECOMB-Comparative Genomics and BMC Genomics, 2015., 2015, 16 (Suppl 10): S2
.Slide60
Research opportunities
Evaluating methods on simulated and real (biological or linguistic) datasets
Designing a new method, and establishing its performance (using theory and data)
Analyzing a biological dataset using several different methods, to address biologySlide61
Algori
t
hmic
S
t
ra
t
egies
•
Divide
-and-conquer•
“Bin-and-conquer”
• Iteration
• Bayesian statisticsHidden Markov Models
• G
raph theory