de novo repeat detection in genomic sequences Do Huy Hoang Outline Introduction What is a repeat Why studying repeats Related work SAGRI Algorithm Analysis Evaluation Introduction What is a repeat Definition ID: 142116
Download Presentation The PPT/PDF document "Spectrum-based" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Spectrum-based de novo repeat detection in genomic sequences
Do Huy HoangSlide2
OutlineIntroduction
What is a repeat?
Why studying repeats?
Related
work
SAGRI
Algorithm
Analysis
EvaluationSlide3
IntroductionSlide4
What is a repeat? (Definition)
[General]: Nucleotide sequences occurring multiply within a genome
[
CompBio
]: Given a genome sequence S, find a string P which occurs at least twice in S (allowing some errors).Slide5
What is a repeat? (Function)
Motifs
Very short repeats (10-20bp)
Transcription
factor binding
sites
Long and Short interspersed elements (SINE, LINE)
Jumping genes
Genes and
Pseudogenes
Tandem repeats
Simple short sequence repeats A
n
,
CGG
nSlide6
Why studying repeats? (1)
Eukaryotic genomes contain a lot of repeats
E.g. Human genome contains 50% repeats.
Repeats are believed to play an important role in evolution and disease.
E.g.
Alu
elements are particularly prone to recombination. Insertion of
Alu
repeats inactivate genes in patient with hemophilia and neurofibromatosis (
Kazazian
, 1998;
Deininger
and
Batzer
, 1999)
Repeats are important to chromatin structure.
Most TEs in mammals seem to be silenced by
methylation
.
Alu
sequences are major target for
histone
H3-Lys9
methylation
in humans (Kondo and
Issa
, 2003
).
It is known that heterochromatin have a lot of SINE and LINE repeats.Slide7
Why studying repeats? (2)
Repeats complicated sequence assembly and genome comparison
Many people remove repeats before they analyze the genome.
Repeats set hurdles on microarray probe signal analysis
The probe signal may be inaccurate if the probe sequence overlap with repeat regions.
Repeats may contribute to human diversity more than genes.
Repeats can be used as DNA fingerprintSlide8
Steps in Repeat finding
Repeat library (
RepeatMasker
)
De-novo
repeat
discovery
(two steps):
Identification of repeats
Classification of repeatsSlide9
SAGRI algorithmSlide10
Algorithm outlineInput: a text G
FindHit
phase
: finds all candidate of second occurrence of repeat regions
A
CGACGCGAT
TAACCCT
CGACGTGAT
CCTC
Validation phase
: uses hits from phase 1 to find all pairs of repeats
A
CGACGCGAT
TAACCCT
CGACGTGAT
CCTCSlide11
Spectrum-based repeat finder
What is a spectrum?
Given a string G, its spectrum is the set of all k-
mers
.
E.g. k=3, G= ACGACGCTCACCCT
The spectrum is
ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA
CTC is a k-
mer
occurring at position 7.
ACG is a k-
mer
occurring at positions 1, 4.Slide12
Observation 1: How to find candidate regions containing repeats?
Two regions of repeats should share some k-mers.
E.g. the following repeats share CGA.
A
CGA
CGCGAT
TAACCCT
CGA
CGTGAT
CCTCSlide13
Feasible extension (bud)
i
S = AC
GAC
GT
GAT
TAACCCT
CGA
CGTGATCCTC
Given the spectrum S for G[1..i-1]:
A
X
C
G
X
T
CGA
Feasible extensions!
i
Note: T is called a fooling probe!Slide14
Observation 2
A path of feasible extensions may be a repeat.
Example:
S = ACGACGCTATCGATGCCCTC
Spectrum S for G[1..10] is
ACG
,
CGA
,
CGC
, CTA,
GAC
, GCT, TAT
Starting from position 11, there exists a path of feasible extensions:
CGA-C-G-C
This path corresponds to a length-6 substring in position 2.
Also, this path has one mismatch compare with the length-6 substring for position 11 (CGATGC).
11Slide15
Phase 1: FindHit()
Algorithm:
Input: a text G
Initialize the empty spectrum S
For
i
= 1 to n
/* we maintain the variant that S is a spectrum for G[1..i-1] */
Let x be the k-
mer
at position
i
If x exists in S, run
DetectRepSeq
(
S,i
);
Insert x into S
Note:
DetectRepSeq
(
S,i
) looks for repeat occurring at position
i
.Slide16
A
CGA
A
G
T
GAT
TAACCCT
CGACGCGAT
CC
18 19 20
21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C
G
C
G
A
T
C
T
DetectRepSeg(S
(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
CGA
1 2 …
Ref
CurrSlide17
A
CGA
A
G
T
GAT
TAACCCT
CGACGCGAT
CC
18 19 20
21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C
G
C
G
A
T
C
T
DetectRepSeg(S
(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
CGA
-T
1
1 2 …
Ref
CurrSlide18
A
CGA
A
G
T
GAT
TAACCCT
CGACGCGAT
CC
18 19 20
21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C
G
C
G
A
T
C
T
DetectRepSeg(S
(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
CGA
-T
1
-T
2
-A
3
*
A
1
-
G
1
-T
2
-
G
2
-
A
2
-
T
2
-T
3*
C2-C2-C3* G3*
1 2 …
Ref
CurrSlide19
A
CGA
A
G
T
GAT
TAACCCT
CGACGCGAT
CC
18 19 20
21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C
G
C
G
A
T
C
T
DetectRepSeg(S
(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
CGA
-T
1
-T
2
-A
3
*
A
1
-
G
1
-T
2
-
G
2
-
A
2
-
T
2
-T
3*
C2-C2-C3* G3*
1 2 …
Ref
CurrSlide20
A
CGA
A
G
T
GAT
TAACCCT
CGACGCGAT
CC
18 19 20
21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C
G
C
G
A
T
C
T
DetectRepSeg(S
(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
CGA
-T
1
-T
2
-A
3
*
A
1
-
G
1
-T
2
-
G
2
-
A
2
-
T
2
-T
3*
C2-C2-C3* G3*
1 2 …
Ref
CurrSlide21
A
CGA
A
G
T
GAT
TAACCCT
CGACGCGAT
CC
18 19 20
21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C
G
C
G
A
T
C
T
DetectRepSeg(S
(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
CGA
-T
1
-T
2
-A
3
*
A
1
-
G
1
-T
2
-
G
2
-
A
2
-
T
2
-T
3*
C2-C2-C3* G3*
1 2 …
Ref
CurrSlide22
A
CGA
A
G
T
GAT
TAACCCT
CGACGCGAT
CC
18 19 20
21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C
G
C
G
A
T
C
T
DetectRepSeg(S
(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
CGA
-T
1
-T
2
-A
3
*
A
1
-
G
1
-T
2
-
G
2
-
A
2
-
T
2
-T
3
*
C2-C2-C3* G3*1 2 …
RefCurrSlide23
A
CGA
A
G
T
GAT
TAACCCT
CGACGCGAT
CC
18 19 20
21 22 23 24 25 26 27 28
… 18 19 20 21 22 23 24 25 26 27 28
CGA C
G
C
G
A
T
C
T
DetectRepSeg(S
(18), 18)
AAC
AAG
ACC
ACG
AGT
ATT
CCC
CCT
CGA
CTC
GAA
GAT
GTG
TAA
TCG
TGA
TTA
CGA
-T
1
-T
2
-A
3
*
A
1
-
G
1
-T
2
-
G
2
-
A
2
-
T
2
-T
3
*
C2-C2-C3* G3*1 2 …
Ref
CurrSlide24
Other detailsExtend backward
Stop backtracking after
h
stepsSlide25
Validation phaseDecompose hits into set of k-mer and index all the locations of these k-mers.
Scan for each pair of locations of a k-mer w in the hits, do BLAST extension
Use some auxiliary data structure to avoid double checking
Report the pairs whose length exceed our thresholdSlide26
AnalysisSlide27
AnalysisHow to find most repeats?
Avoid false negative
How to get better speed?
Avoid false positiveSlide28
How do we choose k? (1)If k is too big,
k-mer is too specific and we may miss some repeat
If k is too small,
k-mer cannot help us to differentiate repeat from non-repeat
For repeat of length 50 and similarity>0.9,
we found that k
log
4
n+2 is good enough.Slide29
How do we choose k? (2)
A random k-mer match with one of n chosen k-mer
Pr(a k-mer re-occurs by random in a sequence of length n)
(analog to throwing n balls into 4
k
bins)
1-(1 – 4
-k
)
m
1 – exp(-m/4
k
).
We requires 1-exp(-n/4
k
)
1
,
hence, k log
4
n + log
4
1
.
If we set
1
=1/16, k log
4
n + 2
0
mSlide30
The occurrence of false negative (missed repeat) (1)
A pair of repeats of length L, with m mismatches
Probability of a preserved k-
mer
in repeat is
M is the number of nonnegative integer solutions
to Subject to
L
X
x
1
x
2
X
m+1
XSlide31
The occurrence of false negative (missed repeat) (2)
It is easy to see that M is the coefficient of x
L−m
in
HenceSlide32
Criterion for path termination (1)Instead of fixing the number of mismatches, we may want to fixed the percentage of mismatches, says, 10%.
Then, the pruning strategy is length dependent.
If the length of strings in
is r, we allow (r) mismatches.Slide33
Criterion for path termination (2)
Let q be the mismatch probability and r be the length of the string.
Prob that a string has s mismatches =
For a threshold
(says, 0.01)
, we set
(r) = max {2 s r-2 | P
q
(s) > } + 2Slide34
Control of false positives (1)
Two typical cases
The
probability of (case 1)/ (case 2)
is
2*4
-
P(case1 or case2) is small
For example: 4 errors, q=0.1, k = 12,
P(case 1) = 1.77 * 10
-8Slide35
EvaluationCompare with other programsSlide36
ProgramsEulerAlign
by Zhang and Waterman
PALS by Edgar and Myers
REPuter
by Kurtz et al.
SARGRISlide37
MeasurementCount Ratio (CR): the ratio of number of pairs of repeat share more than 50% with a reference pair to the number of reference pairs.
Shared Repeat Region (SRR): the ratio of the found region to the reference region. Slide38
Simulated data
Conclusion from simulated data
The result is consistent with the analysisSlide39
Genome dataM.gen (0.6 Mbp)
Organism with the smallest genome
Lives in the primate genital and respiratory tracts
C.tra (1 Mbp)
Live inside the cells of humans
A.ful (2.1 Mbp)
Found in high-temperature oil fields
E.coli (4 Mbp)
An import bacteria live inside lower intestines of mammals
Human chr22 p20M to p21M (1Mbp)Slide40
Use CR and SRR ratio to measure
Cross validation
G/H=1, H/G<1
G “outperforms” H
G/H<1, H/G=1 H “outperforms” G
G/H<1, H/G<1 G, H are complementary
G/H=1, H/G=1 G, H are similarSlide41Slide42
=
Slide43
Questions and AnswersSlide44Slide45
H. H. Do, K. P. Choi
, F. P.
Preparata
, W. K. Sung, L. Zhang. Spectrum-based de novo repeat detection in genomic sequences. Journal of Computational Biology, 15(5):469-487, June 2008