/
CS/BIOE 598:  Algorithmic Computational Genomics CS/BIOE 598:  Algorithmic Computational Genomics

CS/BIOE 598: Algorithmic Computational Genomics - PowerPoint Presentation

provingintel
provingintel . @provingintel
Follow
344 views
Uploaded On 2020-07-01

CS/BIOE 598: Algorithmic Computational Genomics - PPT Presentation

Tandy Warnow Departments of Bioengineering and Computer Science http tandycsillinoisedu Course Details Office hours To be determined Course webpage httptandycsillinoisedu598 ID: 791809

species tree datasets gene tree species gene datasets large estimation sequence multiple alignment mirarab trees accuracy data methods tag

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "CS/BIOE 598: Algorithmic Computational ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CS/BIOE 598: Algorithmic Computational Genomics

Tandy Warnow

Departments of Bioengineering and Computer Science

http://

tandy.cs.illinois.edu

Slide2

Course Details

Office hours:

To be determined

Course webpage:

http://tandy.cs.illinois.edu/598-

2016.

html

Slide3

Today

D

e

s

c

r

ib

e

s

o

me

i

m

p

o

r

ta

n

t

p

r

o

bl

em

s in

c

o

m

pu

ta

ti

o

nal

bi

o

l

o

gy

and

c

o

m

pu

ta

ti

o

nal

hist

o

r

i

c

al

lin

g

uis

ti

c

s

,

f

o

r

w

hi

c

h stud

e

nts in this

c

o

u

r

s

e

c

o

uld d

eve

l

o

p i

m

p

r

o

ve

d

me

th

o

ds

.

Explain h

o

w

th

e

c

o

u

r

s

e

w

ill b

e

r

un

.

Ans

we

r

qu

es

ti

ons

.

Slide4

Basic

s

Prere

quisi

te

s

:

C

o

m

pu

ter

S

c

i

e

n

ce

(al

g

o

r

ith

m

d

e

si

g

n

, analysis,

and p

r

o

gramm

in

g

)

,

m

ath

em

a

ti

c

al

m

atu

r

i

ty

(abili

ty

to und

er

sta

n

d p

r

oo

fs).

N

o b

ackgr

o

und in bi

o

l

o

gy

o

r

lin

g

uis

ti

c

s

is n

ee

d

e

d

!

N

o

te

:

if

y

o

u

are

n

o

t

a CS

,

ECE

,

o

r

Math

m

aj

o

r,

you may still be able to do the course; but please

s

ee

me

.

Slide5

O

r

a

ngut

a

n

G

orilla

Chimpanze

e

H

uma

n

F

r

om

the

T

ree of the Life Website,University of Arizona

Species Tree

Slide6

E

v

o

lu

ti

o

n

inf

o

rms ab

out everything in biolog

y

•  Bi

g gen

ome sequencing p

rojects just p

roduce data

-­‐-­‐ so what?• 

Evo

lutionary history relates all organisms and genes, and helps us understand and predict–  in

teractions between ge

n

e

s (

ge

n

e

ti

c

n

et

w

o

rk

s)

d

r

ug

d

e

si

g

n

p

r

e

di

c

ti

ng

fun

c

ti

ons

o

f

ge

n

e

s

influ

e

nza

vacc

in

e

d

eve

l

o

p

me

n

t

o

r

i

g

ins and sp

re

ad

o

f dis

e

as

e

o

r

i

g

ins and

m

i

gr

a

ti

o

ns

o

f hu

m

ans

Slide7

Phylogenomic pipeline

Select taxon set and markers

Gather and screen sequence data, possibly identify

orthologs

Compute multiple sequence alignments for each locus

Compute species tree or network:

Compute gene trees on the alignments and combine the estimated gene trees, OR

Estimate a tree from a concatenation of the multiple sequence alignments

Get statistical support on each branch (e.g., bootstrapping)

Estimate dates on the nodes of the phylogeny

Use species tree with branch support and dates

to understand biology

Slide8

Phylogenomic pipeline

Select taxon set and markers

Gather and screen sequence data, possibly identify

orthologs

Compute multiple sequence alignments for each locus

Compute species tree or network:

Compute gene trees on the alignments and combine the estimated gene trees, OR

Estimate a tree from a concatenation of the multiple sequence alignments

Get statistical support on each branch (e.g., bootstrapping)

Estimate dates on the nodes of the phylogeny

Use species tree with branch support and dates

to understand biology

Slide9

Constructing the Tree of Life:

Hard Computational Problems

NP-hard problems

Large datasets

100,000+ sequences

thousands of genes

“Big data” complexity:

model misspecification

fragmentary sequences

errors in input data

streaming data

Slide10

Avian Phylogenomics Project

G Zhang,

BGI

Approx. 50 species, whole genomes

14,000 loci

MTP Gilbert,

Copenhagen

S.

Mirarab

Md. S.

Bayzid

, UT-Austin UT-Austin

T. Warnow

UT-Austin

Plus many many other people…

Erich Jarvis,

HHMI

Science, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)

Slide11

1kp: Thousand

Transcriptome

Project

Plant Tree of Life based on

transcriptomes

of ~1200 species

More than 13,000 gene families (most not single copy)

First paper: PNAS 2014 (~100 species and ~800 loci)

Gene Tree Incongruence

G.

Ka-Shu

Wong

U Alberta

N.

Wickett

Northwestern

J.

Leebens

-Mack

U Georgia

N.

Matasci

iPlant

T. Warnow, S. Mirarab, N. Nguyen,

UIUC UT-Austin UT-Austin

Plus many many other people…

Upcoming Challenges (~1200 species, ~400 loci

)

Slide12

DNA Sequence Evolution

AAGACTT

TG

GACTT

AAG

G

C

C

T

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

A

G

GGC

A

T

T

AG

C

CCT

A

G

C

ACTT

AAGGCCT

TGGACTT

TAGCCC

A

TAG

A

C

T

T

AGC

G

CTT

AGCAC

AA

AGGGCAT

AGGGCAT

TAGCCCT

AGCACTT

AAGACTT

TGGACTT

AAGGCCT

AGGGCAT

TAGCCCT

AGCACTT

AAGGCCT

TGGACTT

AGCGCTT

AGCACAA

TAGACTT

TAGCCCA

AGGGCAT

Slide13

Phylogeny Problem

TAGCCCA

TAGACTT

TGCACAA

TGCGCTT

AGGGCAT

U

V

W

X

Y

U

V

W

X

Y

Slide14

Performance criteria

Running time

Space

Statistical performance issues (e.g., statistical consistency) with respect to a Markov model of evolution

Topological accuracy

with respect to the underlying

true tree or true alignment,

typically studied in simulation

Accuracy with respect to a particular criterion (e.g. maximum likelihood score), on real data

Slide15

Quantifying Error

FN: false negative

(missing edge)

FP: false positive

(incorrect edge)

50% error rate

FN

FP

Slide16

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

Slide17

H

ill

-

c

limbi

ng

h

e

uri

s

ti

cs for

hard optimi

zation

criteria (Maximum Pars

imony and Maximum

Likelihood)

Local optimumCos

t

Global optimumPhylogenetic treesPolynomial time distance-based met

hods: Neighbor Joining, FastM

E

,

e

t

c

.

3.

Ba

y

e

s

i

a

n

me

t

hods

P

hylog

e

netic

r

ec

ons

tructi

on

met

hods

Slide18

Solving

maximum

likelihood

(and

o

t

her

hard optimization

problems) is…

unlikely

# of Taxa

# of Unrooted

Trees

4

3

51561057945

8

10395

9

135135

10

2027025

20

2.2

x

10

20

100

4.5

x

10

190

1000

2.7

x

10

2900

Slide19

Q

uan

t

i

f

ying

Error

F

N

:

fa

l

se neg

ative (

missing edg

e)FP: fal

se p

ositive (incorrect edge)FP50% error rateFN

Slide20

Neighbor

joining

has

poor

pe

rf

ormance

on large diame

ter t

rees [Nakhleh e

t al. ISMB 2001]

Theorem (Atteson): Exponent

ial sequence length requiremen

t for Neighbor Joining!

NJ

0

400

800

No.

T

axa

1600

1200

0.4

0.2

0

0.6

0.8

Error Rate

Slide21

Major

Challenges

Phylogene

tic

a

n

a

l

yses:

st

andard methods have poor

accuracy on even moderately

large datasets,

and the most

accurate me

thods are enormously computationally intensive (weeks or months, high memory requirements)

Slide22

Phylogeny Problem

TAGCCCA

TAGACTT

TGCACAA

TGCGCTT

AGGGCAT

U

V

W

X

Y

U

V

W

X

Y

Slide23

AG

A

T

T

A

G

A

C

T

T

T

GC

A

C

AA

T

GCGC

T

T

AGGGCATGA

U

V

W

X

Y

X

U

Y

V

W

The Real Problem!

Slide24

…AC

GGTG

CAGT

T

ACC

-

A…

…AC

----

CAGT

C

ACC

TA…

The

true multiple

alignmen

t–  Re

flects historical

substitution,

insertion, and deletion events–  Defined using transitive clos

ure of pairwi

se

a

lign

me

n

ts

c

o

m

pu

te

d

on

e

dg

es

o

f

th

e

tr

u

e

tree…ACGGTGCAGTTACCA…InsertionSubstitution

Deletion…ACCAGTCACCTA…

Slide25

Input:

unaligned

sequences

S1

=

AGGCTATCACCTGACCTCCA

S2

=

TAGCTATCACGACCGC

S3

=

TAGCTGACCGC

S4

=

TCACGACCGACA

Slide26

Phase

1

:

Alignmen

t

S

1 =

-AGGCTATCACCTGACCTCCA S

2 =

TAG-CTATCAC--GACCGC-- S

3 =

TAG-CT-------GACCGC--

S4 = -------TCAC--GACCGACAS1 =

AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGCS

3 = TAGCTGACCGC S4 =

TCACGACCGACA

Slide27

Phase

2

:

Con

st

ru

ct

tree

S

1 = -AGGCTATCACCTGACCTCCA S

2 = TAG-CTATCAC--GACCGC-- S

3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA S2 =

TAGCTATCACGACCGCS3 =

TAGCTGACCGC S4 =

TCACGACCGACA

S1

S4

S2

S3

Slide28

T

wo-phase

e

s

t

ima

t

ion

Alignmen

t

met

hods•

  Clustal

•  POY (and POY*)

•  Probcons (and Prob

tree)•  Probalign

•  MAFFT

•  Muscle

•  Di-align•  T-Coffee•  Prank (PNAS 2005, Science 2008)•  Opal (ISMB and Bioin

f. 2007)•  FSA (PLoS

Comp

.

Bio

.

2009)

Infernal

(Bioin

f.

2009)

Etc.

Phylogeny

me

t

hods

Bayesian

MCMC

Maximum parsimony

Maximum likelihood•  Neighbor joining•  Fa

stME•  UPGMA•  Quartet puzzling•  Etc.

Slide29

S

i

m

ul

a

t

io

n

St

udies

S

1 =

-AGGCTATCACCTGACCTCCA S

2 =

TAG-CTATCAC--GACCGC-- S

3 =

TAG-CT-------GACCGC--S

4 = -------TCAC--GACCGACA

S1 S2

S

4

S3

T

rue

t

r

ee

a

n

d

al

i

g

n

m

e

n

t

S

1 =

AGGCTATCACCTGACCTCCA S

2 =

TAGCTATCACGACCGC

S

3 =

TAGCTGACCGC S

4 =

TCACGACCGACACompare

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC--S4 = T---C-A-CGACCGA----CA S1 S4

S2 S3Estimated

tree and alignment Unaligned Sequences

Slide30

1000 taxon

mod

els

, o

rdered by di

fficul

ty (Liu et al.

,

2009)

Slide31

Multiple Sequence Alignment (MSA):

another grand challenge

1

S1 = -AGGCTATCACCTGACCTCCA

S2 = TAG-CTATCAC--GACCGC--

S3 = TAG-CT-------GACCGC--

Sn

= -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA

S2 = TAGCTATCACGACCGC

S3 = TAGCTGACCGC

Sn

= TCACGACCGACA

Novel techniques needed

for scalability and accuracy

NP-hard problems and large datasets

Current methods do not provide good accuracy

Few methods can analyze even moderately large datasets

Many important applications besides phylogenetic estimation

1

Frontiers in Massive Data Analysis, National Academies Press, 2013

Slide32

Major

Challenges

Phylogene

tic

a

n

a

l

yses:

st

andard methods have poor

accuracy on even moderately

large datasets,

and the most

accurate me

thods are enormously computationally intensive (weeks or months, high memory requirements)•

  Multiple sequence

a

lign

me

n

t

:

key

st

ep

f

or

many

biological que

st

ions

(pro

t

ein

st

ru

ct

ure

and function, phylogenetic estimation), but few methods can run on large

datasets. Alignment accuracy is generally poor for large datasets with high rates of evolution.

Slide33

Phylogenomics

(Phylogenetic estimation from whole genomes)

Slide34

O

r

a

ngut

a

n

G

orilla

Chimpanze

e

H

uma

n

F

r

om

the

T

ree of the Life Website,University of Arizona

Species Tree Estimat

ion

requires

mul

t

iple

gene

s!

Slide35

T

w

o

bas

i

c

a

pp

r

oaches for s

pecies

tree

estimation

•  Concatena

te (“combi

ne”) sequence ali

gnments for

different genes, and run phylogeny estimation methods

•  Compute tr

ees

o

n

i

nd

i

v

i

dua

l

genes

and

co

mb

i

ne

g

ene

t

r

ees

Slide36

Using multiple genes

gene 1

S

1

S

2

S

3

S

4

S

7

S

8

TCTAATGGAA

GCTAAGGGAA

TCTAAGGGAA

TCTAACGGAA

TCTAATGGAC

TATAACGGAA

gene 3

TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA

S

1

S

3

S

4

S

7

S

8

gene 2

GGTAACCCTC

GCTAAACCTC

GGTGACCATC

GCTAAACCTC

S

4

S

5

S

6

S

7

Slide37

Concatenation

gene 1

S

1

S

2

S

3

S

4

S

5

S

6

S

7

S

8

gene 2

gene 3

TCTAATGGAA

GCTAAGGGAA

TCTAAGGGAA

TCTAACGGAA

TCTAATGGAC

TATAACGGAA

GGTAACCCTC

GCTAAACCTC

GGTGACCATC

GCTAAACCTC

TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

Slide38

Red gene tree ≠ species tree(green gene tree okay)

Slide39

Gene Tree Incongruence

Gene trees can differ from the species tree due to:

Duplication and loss

Horizontal gene transfer

Incomplete lineage sorting (ILS)

Slide40

Incomplete Lineage Sorting (ILS)

Confounds

phylogenetic analysis for many groups:

Hominids

Birds

Yeast

Animals

Toads

Fish

Fungi

There is substantial debate about how to analyze phylogenomic datasets in the presence of ILS.

Slide41

Lineage Sorting

Population-level process, also called the “Multi-species coalescent” (Kingman, 1982)

Gene trees can differ from species trees due to short times between speciation events or large population size; this is called “Incomplete Lineage Sorting” or “Deep Coalescence”.

Slide42

The Coalescent

Present

Past

Courtesy James Degnan

Slide43

Gene tree in a species tree

Courtesy James Degnan

Slide44

Key observation:

Under the multi-species coalescent model, the species tree

defines a

probability distribution on the gene trees, and is identifiable from the distribution on gene trees

Courtesy James

Degnan

Slide45

. . .

Analyze

separately

Summary Method

Two competing approaches

gene 1

gene 2

. . .

gene

k

. . .

Concatenation

Species

Slide46

Orangutan

Gorilla

Chimpanzee

Human

From the Tree of the Life Website,

University of Arizona

Species tree estimation: difficult, even for small datasets!

Slide47

Ma

jo

r

C

h

a

ll

e

ng

es:

large datasets

, fra

gmentary

sequences

•  Multiple se

quence

alignment:

Few

methods can run on large datasets, and alignment accuracy is generally poor for large datasets wit

h high rates of evolu

t

ion

.

Ge

ne

T

ree

Es

ti

ma

tion

:

st

andard

me

t

hods

have

poor

accuracy

on

even moderately large datasets, and the most accurate me

thods are enormously computationally intensive (weeks or months, high memory requirements).•  Species Tree Es

timation: gene tree incongruence

makes accurate estimation of species tree challenging.Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolutionBoth phylogenetic estimation and

multiple sequence alignment

are also impacted by fra

gmentary data.

Slide48

Ma

jo

r

C

h

a

ll

e

ng

es:

large datasets

, fra

gmentary

sequences

•  Multiple se

quence

alignment:

Few

methods can run on large datasets, and alignment accuracy is generally poor for large datasets wit

h high rates of evolu

t

ion

.

Ge

ne

T

ree

Es

ti

ma

tion

:

st

andard

me

t

hods

have

poor

accuracy

on

even moderately large datasets, and the most accurate me

thods are enormously computationally intensive (weeks or months, high memory requirements).•  Species Tree Es

timation: gene tree incongruence

makes accurate estimation of species tree challenging.Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolutionBoth phylogenetic estimation and

multiple sequence alignment are

also impacted by fra

gmentary data.

Slide49

Avian Phylogenomics Project

G Zhang,

BGI

Approx. 50 species, whole genomes

14,000 loci

MTP Gilbert,

Copenhagen

S.

Mirarab

Md. S.

Bayzid

, UT-Austin UT-Austin

T. Warnow

UT-Austin

Plus many many other people…

Erich Jarvis,

HHMI

Challenges:

Species tree estimation under the multi-species coalescent

model, from 14,000 poor estimated gene trees, all with different topologies (we used “statistical binning”)Maximum likelihood estimation on a million-site genome-scale alignment – 250 CPU yearsScience, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)

Slide50

1kp: Thousand

Transcriptome

Project

Plant Tree of Life based on

transcriptomes

of ~1200 species

More than 13,000 gene families (most not single copy)

First paper: PNAS 2014 (~100 species and ~800 loci)

Gene Tree Incongruence

G.

Ka-Shu

Wong

U Alberta

N.

Wickett

Northwestern

J.

Leebens

-Mack

U Georgia

N.

Matasci

iPlant

T. Warnow, S. Mirarab, N. Nguyen,

UIUC UT-Austin UT-Austin

Plus many many other people…

Upcoming Challenges (~1200 species, ~400 loci):

Species tree estimation under the multi-species coalescent

from hundreds of conflicting gene trees on >1000 species

(we will use ASTRAL – Mirarab et al.

2014, Mirarab & Warnow 2015)

Multiple sequence alignment of >100,000 sequences (with lots

of fragments!) – we will use UPP (Nguyen et al.,

2015)

Slide51

Meta

genom

ics:

V

e

n

ter

et

a

l.,

Explo

ring th

e Sargasso Sea:Sc

ientists Disc

over One

Million New

Ge

nes in Ocean Microbes

Slide52

Me

t

agenomic

da

t

a

analysis

N

GS

data

produce f

ragmentary

sequence data Metagenomic analyses include

unknownspecies

Taxon ident

ificatio

n: given

short sequences, identify the species for each fragmentApplications: Human Microbiome Issues:

accuracy and speed

Slide53

Metagenomic taxon identification

Obj

ec

ti

ve

: classify sh

o

rt

re

ads in a

meta

gen

omic

sample

Slide54

P

o

ssibl

e

Ind

o

-European

t

ree

(Ringe, Warnow and Ta

ylor 2000)

Anatolian

Albanian

Tocharian

Greek

Germanic Armenian

Italic

Celtic

Baltic

Slavic

Vedic

Iranian

Slide55

Per

f

ect

P

h

y

l

o

genetic Netwo

rk” for IE

Nakhleh et al.,

Language 2005

Anatolian

Albanian

Tocharian

Greek

Germanic Armenian

Baltic

Slavic

Vedic

Iranian

Italic

Celtic

Slide56

Course Material

Phylogenetics: 5 weeks

Multiple Sequence Alignment: 2 weeks

Genome-scale phylogeny: 1 week

Advanced topics: 2.5 weeks

Historical linguistics

Metagenomics

Genome Assembly

Protein structure and function prediction

Scientific literature (student presentations):2.5 weeks

Slide57

Grading

Homework: 25% (one

hw

dropped)

Midterm: 40% (March 29)

Final Project: 25% (due May 6)

Class Presentation: 5%

Course Participation: 5%

No final exam.

Slide58

Final Project and Class Presentation

Either research project (can be with another student) or survey paper (done by yourself).

Many interesting and publishable problems

to address: see

http://tandy.cs.illinois.edu/

topics.html

Your class presentation should be related to your final project.

Slide59

Examples of published course projects

Md S. Bayzid, T. Hunt, and T. Warnow. "Disk Covering Methods Improve Phylogenomic Analyses". Proceedings of RECOMB-CG (Comparative Genomics), 2014, and BMC Genomics 2014, 15(

Suppl

6): S7.

(PDF

)

T. Zimmermann, S. Mirarab and T. Warnow. "BBCA: Improving the scalability of *BEAST using random binning". Proceedings of RECOMB-CG (Comparative Genomics), 2014, and BMC Genomics 2014, 15(

Suppl

6): S11.

(PDF

)

J. Chou, A. Gupta, S. Yaduvanshi, R. Davidson, M. Nute, S. Mirarab and T. Warnow. “A comparative study of SVDquartets and other coalescent-based species tree estimation

methods”. RECOMB-Comparative Genomics and BMC Genomics, 2015., 2015, 16 (Suppl 10): S2

.

Slide60

Research opportunities

Evaluating methods on simulated and real (biological or linguistic) datasets

Designing a new method, and establishing its performance (using theory and data)

Analyzing a biological dataset using several different methods, to address biology

Slide61

Algori

t

hmic

S

t

ra

t

egies

Divide

-and-conquer• 

“Bin-and-conquer”

•  Iteration•

  Bayesian statisticsHidden Markov Models

•  G

raph theory