Sublinear Algorithms Euclidean space dimension reduction and NNS Alex Andoni MSR SVC A Sketching Problem 2 Sketching objects short bitstrings given and should be able to deduce if ID: 403315
Download Presentation The PPT/PDF document "Sketching, Sampling and other" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Sketching, Sampling and other Sublinear Algorithms:Euclidean space: dimension reduction and NNS
Alex Andoni
(MSR SVC)Slide2
A Sketching Problem2
Sketching:
:objects
short bit-strings
given and should be able to deduce if and are “similar”Why?reduce space and time to compute similarity
010110
010101
similar?
To be or not to be
To sketch or not to sketch
be
to
similar?Slide3
Sketch from LSH3LSH often has property:
Sketching from LSH:
Estimate
by the fraction of collisions between
controls the variance of the estimate
1
[Broder’97]:
for
Jaccard
coefficientSlide4
General Theory: embeddingsThe above map
is
an
embedding
General motivation: given distance (metric) , solve a computational problem under
Euclidean distance (ℓ2
) Hamming distance
Edit distance between two strings
Earth-Mover (transportation) DistanceCompute distance between two pointsDiameter/Close-pair of set S
Clustering, MST, etcNearest Neighbor Search
f
Reduce problem <P under hard metric> to <P under simpler metric>Slide5
Embeddings: landscapeDefinition: an embedding
is a map
of a metric
into a host metric
such that for any
:
w
here
is the distortion (approximation) of the embedding
.Embeddings come in all shapes and colors:Source/host spaces
Distortion Can be randomized:
with
probabilityTime to compute Types of
embeddings:From norm to the same norm but of lower dimension (dimension reduction)From non-norms (edit distance, Earth-Mover Distance) into a norm (ℓ1)From given finite metric (shortest path on a planar graph) into a norm (ℓ
1)
not a metric but a computational procedure sketches Slide6
Dimension Reduction
Johnson
Lindenstrauss
Lemma
: there is a linear map ,
, that preserves distance between two vectors
up to distortion
with
probability ( some constant)Preserves distances among points for
Motivation:E.g.: diameter of a
pointset in -dimensional Euclidean spaceTrivially:
timeUsing lemma:
time for
approximation
MANY applications: nearest neighbor search, streaming, pattern matching, approximation algorithms (clustering)…
Slide7
Main intuitionThe map can be simply a projection onto a random subspace of dimension
Slide8
1D embeddingHow about one dimension (
) ?
Map
,
where
are
iid
normal (Gaussian) random variable
Why Gaussian?Stability property:
is distributed as , where
is also GaussianEquivalently:
is centrally distributed, i.e., has random direction, and projection on random direction depends only on length of
pdf =
E[g]=0E[g2]=1
Slide9
1D embeddingMap
,
f
or any
,
Linear:
Want:
Claim
: for any
, we have
Expectation:
Standard deviation:
Proof
:
Prove for
since
linear
Expectation
p
df
=
E[g]=0
E[g
2
]=1
2
2Slide10
Full Dimension Reduction Just repeat the 1D embedding for
times!
w
here
is
matrix of Gaussian random variables
Want
to prove
:
with probability
OK to prove
for fixed
Slide11
Concentration
is distributed as
w
here each
is distributed as Gaussian
Norm
is called chi-squared distribution with
degrees
Fact:
chi-squared very well concentrated:
Equal to
with probability
Akin to central limit theorem
Slide12
Dimension Reduction: wrap-up
with high probability
Extra:
Linear: can update
as
changes
Can use
instead of Gaussians
[AMS’96, Ach’01, TZ’04…]
Fast JL: can compute faster than time
[AC’06, AL’07’09, DKS’10, KN’10’12…] Slide13
NNS for Euclidean space13
Can use dimensionality reduction to get LSH for
LSH function
:
pick a random line , and quantizeproject point into
is a random Gaussian vector
random in
is a parameter (e.g., 4)
[Datar-Immorlica-Indyk-Mirrokni
’
04
]
Slide14
Regular grid → grid of ballsp can hit empty space, so take more such grids until
p
is in a ball
Need (too) many grids of balls
Start by projecting in dimension tAnalysis givesChoice of reduced dimension t?Tradeoff between # hash tables, n, andTime to hash, tO(t)
Total query time: dn
1/c2+o(1)
Near-Optimal LSH
2D
p
p
R
t
[A-Indyk
’
06
]Slide15
Open question:More practical variant of above hashing?Design space partitioning of
that is
efficient: point location in
poly(t)
timequalitative: regions are “sphere-like”
[Prob. needle of length
1
is
not cut
][
Prob needle of length c is not cut]
≥
c
2
Slide16
Time-Space Trade-offs
[AI’06]
[AI’06]
[KOR’98, IM’98, Pan’06]
[KOR’98, IM’98, Pan’06]
[Ind’01, Pan’06]
[Ind’01, Pan’06]
Space
Time
Comment
Reference
[DIIM’04, AI’06]
[DIIM’04, AI’06]
[IM’98]
[IM’98]
query
t
ime
space
medium
medium
low
high
high
low
one hash
table lookup!
n
o(1
/
ε
2
)
ω
(1) memory lookups
[AIP’06]
n
1+o(1
/
c
2
)
ω
(1) memory lookups
[PTW’08, PTW’10]Slide17
NNS beyond LSH17
Data-dependent
partitions…
Practice:
Trees: kd-trees, quad-trees, ball-trees, rp-trees, PCA-trees, sp-trees…often no guaranteesTheory: can improve standard LSH by random data-dependent space partitions [A-Indyk-Nguyen-Razenshteyn’??]tree-based approach to max-norm ()
Slide18
FinaleDimension Reduction in Euclidean space
, random projection
preserves distances
o
nly
dimensions for distance among points!
NNS for Euclidean spaceRandom projections gives LSHEven better with ball partitioningOr better with cool lattices?