/
Profile HMMs Profile HMMs

Profile HMMs - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
396 views
Uploaded On 2017-06-07

Profile HMMs - PPT Presentation

Tandy Warnow BioE CS 598AGB Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used to model a family of sequences Can be built from a multiple sequence alignment ID: 556970

tree profile supertree sequences profile tree sequences supertree trees mrp gene sequence hmms alignment source scm superfine multiple hmm

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Profile HMMs" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Profile HMMs

Tandy Warnow

BioE

/CS 598AGBSlide2

Profile Hidden Markov Models

Basic tool in sequence analysis

Look more complicated than they really are

Used to model a family of sequences

Can be built from a multiple sequence alignment

Algorithms using profile HMMs are based on dynamic programming (much like Needleman-

Wunsch

)Slide3

Profile

Given a gap-less multiple sequence alignment, we can build a profile describing what we see

S1 = A C T A G

S2 = A C A A G

S3 = A T T T G

S4 = G T T C G

1 2 3 4 5

A 0.75 0.0 0.25 0.50 0.0

C 0.00 0.5 0.00 0.25 0.0

T 0.00 0.5 0.75 0.25 0.0

G 0.25 0.0 0.00 0.00 1.0Slide4

Using a profile

1 2 3 4 5

A 0.75 0.0 0.25 0.50 0.0

C 0.00 0.5 0.00 0.25 0.0

T 0.00 0.5 0.75 0.25 0.0

G 0.25 0.0 0.00 0.00 1.0

End

End

0.75

0.0

0.0

0.25

0.0

0.5

0.5

0.0

0.250.00.750.0

0.500.250.250.0

0.00.00.01.0

Start

1.0

1.0

1.0

1.0

1.0

1.0

The profile yields a probability distribution of sequences – here, all of the

s

ame length.Slide5

Adding in insertions

The profile shown in the previous slide only had *match* states (indicated by rectangles). It doesn’t allow any insertions or deletions.

To model

indels

, we just have to add additional states to the graphical model.

Insertion states: Diamonds (have non-zero emission probabilities)

Deletion states: Circles (nothing emitted)Slide6

From http

://

www.cbs.dtu.dk

/~

kj

/bioinfo_assign2.htmlSlide7

From http

://

codecereal.blogspot.com

/2011/07/protein-profile-with-

hmm.htmlSlide8

Building Profile HMMs

Profile HMMs can be built from a given multiple sequence alignment – this is not too difficult.

Profile HMMs can also be built from unaligned sequences. This is a bit complicated.Slide9

Using Profile HMMs

Given a Profile HMM computed for a multiple sequence alignment, you can use it to

Recognize related sequences

Add related sequences into the multiple sequence alignment

See PFAM,

http://pfam.xfam.org

, for how profile HMMs are used to represent groups of functionally and structurally related proteins. Slide10

Using profile HMMs to align sequences

Given an MSA for a set

S

of representative sequences for a gene and set

X

of additional sequences (query sequences)

You build a profile HMM for the MSA on S.

For each

s in X you find for the gene, you find the

path through the profile HMM that is most likely to generate s.The

path specifies how to add the sequence into the MSA (only the match states count).Transitivity gives you the final MSA after you add in all the other sequences.Slide11

From http

://

codecereal.blogspot.com

/2011/07/protein-profile-with-

hmm.htmlSlide12

Other uses of profile HMMs

Given two profile HMMs (H1 and H2), and a sequence s, you can determine which one is more likely to generate s (again, using dynamic programming).

Note that computing the probability that a profile HMM generates a sequence requires calculating the probability for *every* path and adding up the probabilities. This can be calculated in polynomial time, using dynamic programming.Slide13

Applications of profile HMMs

Recognizing membership in

protein families

(see PFAM http

://

pfam.xfam.org)Progressive multiple sequence alignment methods, such as SATCHMO (Edgar and

Sjolander

, Bioinformatics 2003)

Phylogenetic placement (e.g., SEPP, Mirarab et al., PSB 2012)Taxonomic identification of metagenomic data (e.g., TIPP, Nguyen et al., Bioinformatics 2014)Slide14

Another application: UPP

UPP (Nguyen et al., RECOMB 2015) is a method for ultra-large multiple sequence alignment:

Up to 1,000,000 sequences

Robust to fragmentary sequences

Nam-

phuong Nguyen (IGB postdoctoral fellow) will present this method on Tuesday.Slide15

HMMER

http://

hmmer.janelia.org

One of the most popular collection of tools to perform analyses based on profile HMMs.

HMMER web server: interactive sequence similarity

searching, NAR 2011, http

://

nar.oxfordjournals.org

/content/39/suppl_2/W29Slide16

Supertree Estimation

Purposes:

Divide-and-conquer tree estimation

Combining analyses performed by other research groups

Tree of LifeSlide17

Supertree Estimation

Challenges:

Tree compatibility is NP-complete (therefore, even if subtrees are correct, supertree estimation is hard)

Estimated subtrees have error

Advantages:

Estimating individual gene trees can be computationally feasible (compared to the combined analysis of many genes)

Can use different types of data for each geneSlide18

Many Supertree Methods

MRP

weighted MRP

MRF

MRD

Robinson-Foulds SupertreesMin-CutModified Min-CutSemi-strict Supertree

QMC

Q-imputation

SDM

PhySICMajority-Rule SupertreesMaximum Likelihood Supertreesand many more ...

Matrix Representation with Parsimony

(Most commonly used and most accurate)Slide19

MRP and MRL

MRP (Matrix Representation with Parsimony)

MRL (Matrix Representation with Likelihood)

The input set of source trees is represented by a binary matrix with ?s for missing data

Each edge in each source tree gives you one column in the matrix (based on the bipartition it defines on its

leafset

)

Then you analyze the matrix using parsimony or CFN maximum likelihoodSlide20

Single gene vs. multi-gene analyses

Most methods analyze

single

genes (or other genomic region). These produce estimated

gene trees”.But species trees are estimated using multiple genes.Slide21

Not all genes present in all species

gene 1

S

1

S

2

S

3

S

4

S

7

S

8

TCTAATGGAA

GCTAAGGGAA

TCTAAGGGAA

TCTAACGGAA

TCTAATGGAC

TATAACGGAA

gene 3

TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA

S

1

S

3

S

4

S

7

S

8

gene 2

GGTAACCCTC

GCTAAACCTC

GGTGACCATC

GCTAAACCTC

S

4

S

5

S

6

S

7Slide22

. . .

Analyze

separately

Supertree

Method

Two competing approaches

gene 1

gene 2

. . .

gene

k

. . .

Combined

Analysis

SpeciesSlide23

FN rate of MRP vs.

combined analysis

Scaffold Density (%)Slide24

SuperFine-boosting:

improves accuracy of MRP

Scaffold Density (%)

(Swenson et al., Syst. Biol. 2012)Slide25

SuperFine

First, construct a supertree with low false positives

The Strict Consensus

Then, refine the tree to reduce false negatives by resolving each polytomy using a

base”

supertree method (e.g., MRP)

Quartet Max CutSlide26

Obtaining a supertree with low FP

The Strict Consensus Merger (SCM)

SCM of two trees

Computes the strict consensus on the common leaf set

Then superimposes the two trees, contracting more edges in the presence of

collisions

Slide27

Strict Consensus Merger (SCM)

a

b

c

d

e

f

g

a

b

c

d

h

i

j

e

f

g

h

i

j

a

b

c

d

a

b

c

d

e

f

g

a

b

c

d

h

i

jSlide28

Theoretical results for SCM

SCM can be computed in polynomial time

For certain types of inputs, the SCM method solves the NP-hard

Tree Compatibility

” problemAll splits in the SCM

appear

” in at least one source tree (and are not contradicted by any source tree)Slide29

Performance of SCM

Low false positive (FP) rate

(Estimated supertree has few false edges)

High false negative (FN) rate

(Estimated supertree is missing many true edges)Slide30

Part II of SuperFine

Refine the tree to reduce false negatives by resolving each polytomy using a base supertree method (e.g., MRP) Slide31

Part 1 of SuperFine

a

b

c

d

e

f

g

a

b

c

d

h

i

j

e

f

g

h

i

j

a

b

c

d

a

b

c

d

e

f

g

a

b

c

d

h

i

jSlide32

Part 2 of SuperFine

e

f

g

a

b

c

d

h

i

j

a

b

c

e

h

i

j

d

f

g

1

2

3

4

5

6

a

b

c

d

e

f

g

a

b

c

d

h

i

j

1

1

1

4

1

6

5

1

1

1

4

2

3

3

4

1

6

5

1

4

2

3Slide33

Theorem

Given

a set of source trees,

SCM tree T,

and a polytomy in T,

after relabelling and reducing, each source tree has at most one leaf with each label.Slide34

Step 2: Apply MRP to the collection of reduced source trees

1

2

3

4

1

4

5

6

MRP

1

2

3

4

6

5Slide35

Replace polytomy using tree from MRP

1

2

3

4

6

5

a

b

c

e

h

i

j

d

f

g

e

f

g

a

b

c

d

h

i

j

h

d

g

f

i

j

a

b

c

eSlide36

Resolving a single polytomy,

v, using MRP

Step 1: Reduce each source tree to a tree on leafset, {1,2,...,

d

} where

d=degree(v) Step 2: Apply MRP to the collection of reduced source trees, to produce a tree t on {1,2,...,

d

}

Step 3: Replace the star tree at v by tree

tSlide37

SuperFine-boosting:

improves accuracy of MRP

Scaffold Density (%)

(Swenson et al., Syst. Biol. 2012)Slide38

SuperFine is also much faster

MRP

8-12 sec.

SuperFine

2-3 sec.

Scaffold Density (%)

Scaffold Density (%)

Scaffold Density (%)Slide39

Uses of Supertree M

ethods

DACTAL: divide-and-conquer trees almost without alignments (

Nelesen

et al, 2012)

DCM1-boosting

In these methods, the dataset is divided into subsets (using

chordal

graph theory), and then trees on the subsets are combined (using supertree methods).

Proofs about the algorithm guarantees are established using chordal graph theory. Slide40

Chordal graph algorithms yield phylogeny estimation from

polynomial length

sequences

Theorem (Warnow et al., SODA 2001):

DCM1-NJ correct with high probability given sequences of length

O(ln n e

O(ln n)

)

Simulation study from Nakhleh et al. ISMB 2001

NJ

DCM1-NJ

0

400

800

1600

1200

No. Taxa

0

0.2

0.4

0.6

0.8

Error RateSlide41

DACTAL: divide-and-conquer trees without alignment (submitted)

New supertree method:

SuperFine

Existing Method:

RAxML(MAFFT)

pRecDCM3

BLAST-based

Overlapping

subsets

A tree for each subset

Unaligned Sequences

A tree for the entire datasetSlide42

DACTAL

more accurate than all standard methods, and much faster than

SATé

Average results on 3 large RNA datasets (6K to 28K)

CRW: Comparative RNA database, structural alignments

3 datasets with 6,323 to 27,643 sequences

Reference trees: 75% RAxML bootstrap trees

DACTAL (shown in red) run for 5 iterations starting from FT(Part)

SATé-1 fails on the largest dataset

SATé-2 runs but is not more accurate than DACTAL, and takes longer Slide43

More divide-and-conquer

Recall that

SATe

and PASTA use divide-and-conquer (and also iteration) to improve alignment estimation.

Alignments on different subsets are merged together using techniques like OPAL and Muscle.

However, alignments can also be merged together using HMM-HMM alignment.

You should think of your own algorithmic designs for improving scalability and accuracy, whether for MSA or for tree estimation!