/
Spectrum-based Spectrum-based

Spectrum-based - PowerPoint Presentation

calandra-battersby
calandra-battersby . @calandra-battersby
Follow
387 views
Uploaded On 2015-09-27

Spectrum-based - PPT Presentation

de novo repeat detection in genomic sequences Do Huy Hoang Outline Introduction What is a repeat Why studying repeats Related work SAGRI Algorithm Analysis Evaluation Introduction What is a repeat Definition ID: 142116

cga repeats mer repeat repeats cga repeat mer length spectrum detectrepseg aacaagaccacgagtattccccctcgactcgaagatgtgtaatcgtgatta cgaagtgattaaccctcgacgcgatcc genome path position false sequence ref curr set analysis

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Spectrum-based" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Spectrum-based de novo repeat detection in genomic sequences

Do Huy HoangSlide2

OutlineIntroduction

What is a repeat?

Why studying repeats?

Related

work

SAGRI

Algorithm

Analysis

EvaluationSlide3

IntroductionSlide4

What is a repeat? (Definition)

[General]: Nucleotide sequences occurring multiply within a genome

[

CompBio

]: Given a genome sequence S, find a string P which occurs at least twice in S (allowing some errors).Slide5

What is a repeat? (Function)

Motifs

Very short repeats (10-20bp)

Transcription

factor binding

sites

Long and Short interspersed elements (SINE, LINE)

Jumping genes

Genes and

Pseudogenes

Tandem repeats

Simple short sequence repeats A

n

,

CGG

nSlide6

Why studying repeats? (1)

Eukaryotic genomes contain a lot of repeats

E.g. Human genome contains 50% repeats.

Repeats are believed to play an important role in evolution and disease.

E.g.

Alu

elements are particularly prone to recombination. Insertion of

Alu

repeats inactivate genes in patient with hemophilia and neurofibromatosis (

Kazazian

, 1998;

Deininger

and

Batzer

, 1999)

Repeats are important to chromatin structure.

Most TEs in mammals seem to be silenced by

methylation

.

Alu

sequences are major target for

histone

H3-Lys9

methylation

in humans (Kondo and

Issa

, 2003

).

It is known that heterochromatin have a lot of SINE and LINE repeats.Slide7

Why studying repeats? (2)

Repeats complicated sequence assembly and genome comparison

Many people remove repeats before they analyze the genome.

Repeats set hurdles on microarray probe signal analysis

The probe signal may be inaccurate if the probe sequence overlap with repeat regions.

Repeats may contribute to human diversity more than genes.

Repeats can be used as DNA fingerprintSlide8

Steps in Repeat finding

Repeat library (

RepeatMasker

)

De-novo

repeat

discovery

(two steps):

Identification of repeats

Classification of repeatsSlide9

SAGRI algorithmSlide10

Algorithm outlineInput: a text G

FindHit

phase

: finds all candidate of second occurrence of repeat regions

A

CGACGCGAT

TAACCCT

CGACGTGAT

CCTC

Validation phase

: uses hits from phase 1 to find all pairs of repeats

A

CGACGCGAT

TAACCCT

CGACGTGAT

CCTCSlide11

Spectrum-based repeat finder

What is a spectrum?

Given a string G, its spectrum is the set of all k-

mers

.

E.g. k=3, G= ACGACGCTCACCCT

The spectrum is

ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA

CTC is a k-

mer

occurring at position 7.

ACG is a k-

mer

occurring at positions 1, 4.Slide12

Observation 1: How to find candidate regions containing repeats?

Two regions of repeats should share some k-mers.

E.g. the following repeats share CGA.

A

CGA

CGCGAT

TAACCCT

CGA

CGTGAT

CCTCSlide13

Feasible extension (bud)

i

S = AC

GAC

GT

GAT

TAACCCT

CGA

CGTGATCCTC

Given the spectrum S for G[1..i-1]:

A

X

C

G

X

T

CGA

Feasible extensions!

i

Note: T is called a fooling probe!Slide14

Observation 2

A path of feasible extensions may be a repeat.

Example:

S = ACGACGCTATCGATGCCCTC

Spectrum S for G[1..10] is

ACG

,

CGA

,

CGC

, CTA,

GAC

, GCT, TAT

Starting from position 11, there exists a path of feasible extensions:

CGA-C-G-C

This path corresponds to a length-6 substring in position 2.

Also, this path has one mismatch compare with the length-6 substring for position 11 (CGATGC).

11Slide15

Phase 1: FindHit()

Algorithm:

Input: a text G

Initialize the empty spectrum S

For

i

= 1 to n

/* we maintain the variant that S is a spectrum for G[1..i-1] */

Let x be the k-

mer

at position

i

If x exists in S, run

DetectRepSeq

(

S,i

);

Insert x into S

Note:

DetectRepSeq

(

S,i

) looks for repeat occurring at position

i

.Slide16

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

1 2 …

Ref

CurrSlide17

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

1 2 …

Ref

CurrSlide18

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

-T

2

-A

3

*

A

1

-

G

1

-T

2

-

G

2

-

A

2

-

T

2

-T

3*

C2-C2-C3* G3*

1 2 …

Ref

CurrSlide19

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

-T

2

-A

3

*

A

1

-

G

1

-T

2

-

G

2

-

A

2

-

T

2

-T

3*

C2-C2-C3* G3*

1 2 …

Ref

CurrSlide20

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

-T

2

-A

3

*

A

1

-

G

1

-T

2

-

G

2

-

A

2

-

T

2

-T

3*

C2-C2-C3* G3*

1 2 …

Ref

CurrSlide21

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

-T

2

-A

3

*

A

1

-

G

1

-T

2

-

G

2

-

A

2

-

T

2

-T

3*

C2-C2-C3* G3*

1 2 …

Ref

CurrSlide22

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

-T

2

-A

3

*

A

1

-

G

1

-T

2

-

G

2

-

A

2

-

T

2

-T

3

*

C2-C2-C3* G3*1 2 …

RefCurrSlide23

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

-T

2

-A

3

*

A

1

-

G

1

-T

2

-

G

2

-

A

2

-

T

2

-T

3

*

C2-C2-C3* G3*1 2 …

Ref

CurrSlide24

Other detailsExtend backward

Stop backtracking after

h

stepsSlide25

Validation phaseDecompose hits into set of k-mer and index all the locations of these k-mers.

Scan for each pair of locations of a k-mer w in the hits, do BLAST extension

Use some auxiliary data structure to avoid double checking

Report the pairs whose length exceed our thresholdSlide26

AnalysisSlide27

AnalysisHow to find most repeats?

Avoid false negative

How to get better speed?

Avoid false positiveSlide28

How do we choose k? (1)If k is too big,

k-mer is too specific and we may miss some repeat

If k is too small,

k-mer cannot help us to differentiate repeat from non-repeat

For repeat of length 50 and similarity>0.9,

we found that k

log

4

n+2 is good enough.Slide29

How do we choose k? (2)

A random k-mer match with one of n chosen k-mer

Pr(a k-mer re-occurs by random in a sequence of length n)

(analog to throwing n balls into 4

k

bins)

 1-(1 – 4

-k

)

m 

1 – exp(-m/4

k

).

We requires 1-exp(-n/4

k

)

1

,

hence, k  log

4

n + log

4

1

.

If we set 

1

=1/16, k  log

4

n + 2

0

mSlide30

The occurrence of false negative (missed repeat) (1)

A pair of repeats of length L, with m mismatches

Probability of a preserved k-

mer

in repeat is

M is the number of nonnegative integer solutions

to Subject to

L

X

x

1

x

2

X

m+1

XSlide31

The occurrence of false negative (missed repeat) (2)

It is easy to see that M is the coefficient of x

L−m

in

HenceSlide32

Criterion for path termination (1)Instead of fixing the number of mismatches, we may want to fixed the percentage of mismatches, says, 10%.

Then, the pruning strategy is length dependent.

If the length of strings in

 is r, we allow (r) mismatches.Slide33

Criterion for path termination (2)

Let q be the mismatch probability and r be the length of the string.

Prob that a string has s mismatches =

For a threshold

 (says, 0.01)

, we set

(r) = max {2  s  r-2 | P

q

(s) > } + 2Slide34

Control of false positives (1)

Two typical cases

The

probability of (case 1)/ (case 2)

is

 2*4

-

P(case1 or case2) is small

For example: 4 errors, q=0.1, k = 12,

P(case 1) = 1.77 * 10

-8Slide35

EvaluationCompare with other programsSlide36

ProgramsEulerAlign

by Zhang and Waterman

PALS by Edgar and Myers

REPuter

by Kurtz et al.

SARGRISlide37

MeasurementCount Ratio (CR): the ratio of number of pairs of repeat share more than 50% with a reference pair to the number of reference pairs.

Shared Repeat Region (SRR): the ratio of the found region to the reference region. Slide38

Simulated data

Conclusion from simulated data

The result is consistent with the analysisSlide39

Genome dataM.gen (0.6 Mbp)

Organism with the smallest genome

Lives in the primate genital and respiratory tracts

C.tra (1 Mbp)

Live inside the cells of humans

A.ful (2.1 Mbp)

Found in high-temperature oil fields

E.coli (4 Mbp)

An import bacteria live inside lower intestines of mammals

Human chr22 p20M to p21M (1Mbp)Slide40

Use CR and SRR ratio to measure

Cross validation

G/H=1, H/G<1

 G “outperforms” H

G/H<1, H/G=1  H “outperforms” G

G/H<1, H/G<1  G, H are complementary

G/H=1, H/G=1  G, H are similarSlide41
Slide42

= 

Slide43

Questions and AnswersSlide44
Slide45

H. H. Do, K. P. Choi

, F. P.

Preparata

, W. K. Sung, L. Zhang. Spectrum-based de novo repeat detection in genomic sequences. Journal of Computational Biology, 15(5):469-487, June 2008