/
Spectrum-based  de novo  repeat detection in genomic sequences Spectrum-based  de novo  repeat detection in genomic sequences

Spectrum-based de novo repeat detection in genomic sequences - PowerPoint Presentation

shangmaxi
shangmaxi . @shangmaxi
Follow
343 views
Uploaded On 2020-06-16

Spectrum-based de novo repeat detection in genomic sequences - PPT Presentation

Do Huy Hoang Outline Introduction What is a repeat Why studying repeats Related work SAGRI Algorithm Analysis Evaluation Introduction What is a repeat Definition General Nucleotide sequences occurring multiply within a genome ID: 779025

repeats cga gat repeat cga repeats repeat gat taaccct acg mer spectrum ctc cgacgcgat cct length ccc acc taa

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Spectrum-based de novo repeat detectio..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Spectrum-based de novo repeat detection in genomic sequences

Do Huy Hoang

Slide2

OutlineIntroduction

What is a repeat?

Why studying repeats?

Related

work

SAGRI

Algorithm

Analysis

Evaluation

Slide3

Introduction

Slide4

What is a repeat? (Definition)

[General]: Nucleotide sequences occurring multiply within a genome

[

CompBio

]: Given a genome sequence S, find a string P which occurs at least twice in S (allowing some errors).

Slide5

What is a repeat? (Function)

Motifs

Very short repeats (10-20bp)

Transcription

factor binding

sites

Long and Short interspersed elements (SINE, LINE)

Jumping genes

Genes and

Pseudogenes

Tandem repeats

Simple short sequence repeats A

n

,

CGG

n

Slide6

Why studying repeats? (1)

Eukaryotic genomes contain a lot of repeats

E.g. Human genome contains 50% repeats.

Repeats are believed to play an important role in evolution and disease.

E.g.

Alu

elements are particularly prone to recombination. Insertion of

Alu

repeats inactivate genes in patient with hemophilia and neurofibromatosis (

Kazazian

, 1998;

Deininger

and

Batzer

, 1999)

Repeats are important to chromatin structure.

Most TEs in mammals seem to be silenced by

methylation

.

Alu

sequences are major target for

histone

H3-Lys9

methylation

in humans (Kondo and

Issa

, 2003

).

It is known that heterochromatin have a lot of SINE and LINE repeats.

Slide7

Why studying repeats? (2)

Repeats complicated sequence assembly and genome comparison

Many people remove repeats before they analyze the genome.

Repeats set hurdles on microarray probe signal analysis

The probe signal may be inaccurate if the probe sequence overlap with repeat regions.

Repeats may contribute to human diversity more than genes.

Repeats can be used as DNA fingerprint

Slide8

Steps in Repeat finding

Repeat library (

RepeatMasker

)

De-novo

repeat

discovery

(two steps):

Identification of repeats

Classification of repeats

Slide9

SAGRI algorithm

Slide10

Algorithm outlineInput: a text G

FindHit

phase

: finds all candidate of second occurrence of repeat regions

A

CGACGCGAT

TAACCCT

CGACGTGAT

CCTC

Validation phase

: uses hits from phase 1 to find all pairs of repeats

A

CGACGCGAT

TAACCCT

CGACGTGAT

CCTC

Slide11

Spectrum-based repeat finder

What is a spectrum?

Given a string G, its spectrum is the set of all k-

mers

.

E.g. k=3, G= ACGACGCTCACCCT

The spectrum is

ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA

CTC is a k-

mer

occurring at position 7.

ACG is a k-

mer

occurring at positions 1, 4.

Slide12

Observation 1: How to find candidate regions containing repeats?

Two regions of repeats should share some k-mers.

E.g. the following repeats share CGA.

A

CGA

CGCGAT

TAACCCT

CGA

CGTGAT

CCTC

Slide13

Feasible extension (bud)

i

S = AC

GAC

GT

GAT

TAACCCT

CGA

CGTGATCCTC

Given the spectrum S for G[1..i-1]:

A

X

C

G

X

T

CGA

Feasible extensions!

i

Note: T is called a fooling probe!

Slide14

Observation 2

A path of feasible extensions may be a repeat.

Example:

S = ACGACGCTATCGATGCCCTC

Spectrum S for G[1..10] is

ACG

,

CGA

,

CGC

, CTA,

GAC

, GCT, TAT

Starting from position 11, there exists a path of feasible extensions:

CGA-C-G-C

This path corresponds to a length-6 substring in position 2.

Also, this path has one mismatch compare with the length-6 substring for position 11 (CGATGC).

11

Slide15

Phase 1: FindHit()

Algorithm:

Input: a text G

Initialize the empty spectrum S

For

i

= 1 to n

/* we maintain the variant that S is a spectrum for G[1..i-1] */

Let x be the k-

mer

at position

i

If x exists in S, run

DetectRepSeq

(

S,i

);

Insert x into S

Note:

DetectRepSeq

(

S,i

) looks for repeat occurring at position

i

.

Slide16

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

1 2 …

Ref

Curr

Slide17

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

1 2 …

Ref

Curr

Slide18

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

-T

2

-A

3

*

A

1

-

G

1

-T

2

-

G

2

-

A

2

-

T

2

-T

3*

C2-C2-C3* G3*

1 2 …

Ref

Curr

Slide19

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

-T

2

-A

3

*

A

1

-

G

1

-T

2

-

G

2

-

A

2

-

T

2

-T

3*

C2-C2-C3* G3*

1 2 …

Ref

Curr

Slide20

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

-T

2

-A

3

*

A

1

-

G

1

-T

2

-

G

2

-

A

2

-

T

2

-T

3*

C2-C2-C3* G3*

1 2 …

Ref

Curr

Slide21

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

-T

2

-A

3

*

A

1

-

G

1

-T

2

-

G

2

-

A

2

-

T

2

-T

3*

C2-C2-C3* G3*

1 2 …

Ref

Curr

Slide22

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

-T

2

-A

3

*

A

1

-

G

1

-T

2

-

G

2

-

A

2

-

T

2

-T

3

*

C2-C2-C3* G3*1 2 …

RefCurr

Slide23

A

CGA

A

G

T

GAT

TAACCCT

CGACGCGAT

CC

18 19 20

21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C

G

C

G

A

T

C

T

DetectRepSeg(S

(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

CGA

-T

1

-T

2

-A

3

*

A

1

-

G

1

-T

2

-

G

2

-

A

2

-

T

2

-T

3

*

C2-C2-C3* G3*1 2 …

Ref

Curr

Slide24

Other detailsExtend backward

Stop backtracking after

h

steps

Slide25

Validation phaseDecompose hits into set of k-mer and index all the locations of these k-mers.

Scan for each pair of locations of a k-mer w in the hits, do BLAST extension

Use some auxiliary data structure to avoid double checking

Report the pairs whose length exceed our threshold

Slide26

Analysis

Slide27

AnalysisHow to find most repeats?

Avoid false negative

How to get better speed?

Avoid false positive

Slide28

How do we choose k? (1)If k is too big,

k-mer is too specific and we may miss some repeat

If k is too small,

k-mer cannot help us to differentiate repeat from non-repeat

For repeat of length 50 and similarity>0.9,

we found that k

log

4

n+2 is good enough.

Slide29

How do we choose k? (2)

A random k-mer match with one of n chosen k-mer

Pr(a k-mer re-occurs by random in a sequence of length n)

(analog to throwing n balls into 4

k

bins)

 1-(1 – 4

-k

)

m 

1 – exp(-m/4

k

).

We requires 1-exp(-n/4

k

)

1

,

hence, k  log

4

n + log

4

1

.

If we set 

1

=1/16, k  log

4

n + 2

0

m

Slide30

The occurrence of false negative (missed repeat) (1)

A pair of repeats of length L, with m mismatches

Probability of a preserved k-

mer

in repeat is

M is the number of nonnegative integer solutions

to Subject to

L

X

x

1

x

2

X

m+1

X

Slide31

The occurrence of false negative (missed repeat) (2)

It is easy to see that M is the coefficient of x

L−m

in

Hence

Slide32

Criterion for path termination (1)Instead of fixing the number of mismatches, we may want to fixed the percentage of mismatches, says, 10%.

Then, the pruning strategy is length dependent.

If the length of strings in

 is r, we allow (r) mismatches.

Slide33

Criterion for path termination (2)

Let q be the mismatch probability and r be the length of the string.

Prob that a string has s mismatches =

For a threshold

 (says, 0.01)

, we set

(r) = max {2  s  r-2 | P

q

(s) > } + 2

Slide34

Control of false positives (1)

Two typical cases

The

probability of (case 1)/ (case 2)

is

 2*4

-

P(case1 or case2) is small

For example: 4 errors, q=0.1, k = 12,

P(case 1) = 1.77 * 10

-8

Slide35

EvaluationCompare with other programs

Slide36

ProgramsEulerAlign

by Zhang and Waterman

PALS by Edgar and Myers

REPuter

by Kurtz et al.

SARGRI

Slide37

MeasurementCount Ratio (CR): the ratio of number of pairs of repeat share more than 50% with a reference pair to the number of reference pairs.

Shared Repeat Region (SRR): the ratio of the found region to the reference region.

Slide38

Simulated data

Conclusion from simulated data

The result is consistent with the analysis

Slide39

Genome dataM.gen (0.6 Mbp)

Organism with the smallest genome

Lives in the primate genital and respiratory tracts

C.tra (1 Mbp)

Live inside the cells of humans

A.ful (2.1 Mbp)

Found in high-temperature oil fields

E.coli (4 Mbp)

An import bacteria live inside lower intestines of mammals

Human chr22 p20M to p21M (1Mbp)

Slide40

Use CR and SRR ratio to measure

Cross validation

G/H=1, H/G<1

 G “outperforms” H

G/H<1, H/G=1  H “outperforms” G

G/H<1, H/G<1  G, H are complementary

G/H=1, H/G=1  G, H are similar

Slide41

Slide42

= 

Slide43

Questions and Answers

Slide44

Slide45

H. H. Do, K. P. Choi

, F. P.

Preparata

, W. K. Sung, L. Zhang. Spectrum-based de novo repeat detection in genomic sequences. Journal of Computational Biology, 15(5):469-487, June 2008