Tel Aviv University March 2016 Last updated March 28 2017 Algorithms in Action Fast Fourier Transform 2 String Matching abracadabra abraabracadabracadabraabara abracadabra Given a text ID: 741457
Download Presentation The PPT/PDF document "1 Haim Kaplan, Uri Zwick" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
1
Haim Kaplan, Uri
Zwick
Tel Aviv University
March 2016Last updated: March 28, 2017
Algorithms in Action
Fast Fourier TransformSlide2
2String Matchingabracadabraabraabracadabracadabraabara
abracadabra
Given a
text
of length and a pattern of length ,find all occurrences of the pattern in the text.
The naïve algorithm runs in
time.
Several classical algorithms run in
time.
[
Knuth-Morris-Pratt (1977)] [Boyer-Moore (1977)
]
Slide3
3More String Matching Problemsabra
cadabra
abraabracadabracadabraabara
abracadabra
Count the number of matches/mismatches in each alignment of the pattern with the text.
(Find all aligments
with at most
mismatches.)
Allow a
wildcard
(“don’t care”)
(
)
that match
any (single) symbol in the pattern and/or text.
“Traditional” string matching techniques
are not so efficient for these extensions.Slide4
4
(Cross-)
Correlation
Slide5
(Cross-)CorrelationA convolution without the initial reversal, with a shift
of indices.
.
The correlation of two vectors of length
can be computed in
time.
Slide6
6
(Cross-)
Correlation (unequal lengths)
Slide7
7
(Cross-)
Correlation
Slide8
8
(Cross-)
Correlation
Slide9
9
(Cross-)
Correlation
Slide10
10
(Cross-)
Correlation
Slide11
11
(Cross-)
Correlation
Slide12
12
(Cross-)
Correlation
Slide13
13
(Cross-)
Correlation
Slide14
(Cross-)Correlation
If
is of length
and
of length
,
where
, then
.
Exercise:
The correlation of two vectors of length
and
,
where
, can be computed in
time.
Sometimes, only the values
, corresponding
to a full overlap of
with a shift of
, are of interest.
Slide15
Counting mismatches[Fischer-Paterson (1974)]Let
be the alphabet of the pattern and text.
We may assume that
. (Why?)
For every
create two Boolean strings:
iff
iff
Correlation of
and
counts
mismatches
involving
.
Slide16
16Counting mismatchesabracadabraabraabracadabracad
abraabara
10010101001
01
1001101010110101011001010Slide17
17Counting mismatchesabracadabraabraabracadabracadabraabara
10010101001
011001101010110101011001010
abrac
adabraabraabracad
abracadabraabara
10010
1
0100
1
01100
1
1010
1
0110101011001010Slide18
Counting mismatchesLet be the alphabet of the pattern and text.
We may assume that
. (Why?)
For every
create two Boolean strings:
iff
iff
Correlation of
and
counts
mismatches
involving
.
Summing over all
we get the total no. of
mismatches
.
Complexity:
word operations.
(Each word assumed to hold
bits.)
Fast only if
is small.
Slide19
19Counting mismatches with wildcards[Fischer-Paterson (1974)]
For every
create two Boolean strings:
iff
iff
and
Complexity:
word operations.
Slide20
20Counting mismatches with wildcardsabracada
*
ra
abraabra*
adabracadabraabara10010101001011001100010110101011001010abraca
da*r
a
abraa
b
raca
*
abracadabraabara
10010
1
0100
1
011001101000110101011001010Slide21
21Counting mismatches with wildcardsIf we only want to find exact matches, replace each
character
by a specific
bit string
Slide22
22
Counting
mismatches
with
wildcards
Complexity drops to
.
Can we get rid of the dependence on
?
Count mismatches of the binary strings as
before
(2 convolutions)
A result of 0 corresponds to a matchSlide23
23-matching[Lipsky-Porat
(2011)]
Suppose that each “character” is a real number.
We want to find approximate matches.For each
we want to compute
Standard string matching uses the
Hamming
distance.
Two characters either match or they do not.
is not closer to
than to
.
-distance:
Slide24
24
Constant.
time.
Correlation.
time.
Easy in
time.
-matching can be computed in
time.
-matching
[
Lipsky-Porat
(2011)]
Slide25
25Replace each character by a positive integer.Replace the wildcard by 0.
For each
compute
There is an exact match at position
iff
Exact
matches
with
wildcards
[Clifford-Clifford (2007
)]Slide26
Exact matches with wildcards[Clifford-Clifford (2007)]
Compute three correlations of
appropriate sequences in
time.
Running time is independent of
!
Assuming that each character fits in an
-bit word
and that operations on such words takes constant time.