L ocality S ensitive H ashing Alexandr Andoni Columbia Ilya Razenshteyn MIT CSAIL Near Neighbor Search Dataset points in Goal a data point within from a query Space query time ID: 713557
Download Presentation The PPT/PDF document "Tight Lower Bounds for Data-Dependent" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing
Alexandr
Andoni
(Columbia)
Ilya Razenshteyn
(MIT CSAIL)Slide2
Near Neighbor SearchDataset: points in
,
Goal:
a data point within from a querySpace, query timeExample: , Euclidean distance space timeInfeasible for large (“curse of dimensionality”):Space exponential in the dimensionLots of high-dimensional applications
Slide3
Given: points in
distance threshold
approximation
Query: a point within from a data pointWant: a data point within from the query Approximate Near
Neighbor Search (ANN)
Slide4
Locality-Sensitive Hashing (LSH)Until recently: the only approach to ANN with
memory and
dependence (Indyk-Motwani 1998)Main idea: random partitions of s.t. closer pairs of points collide more oftenA distribution over partitions is
-sensitive
if for every
:
If
, then
If
, then
From the definition of ANN
Slide5
From LSH to ANNEfficient
-sensitive partitions imply
-ANN with
space
and query time
, where
S
ample
partitions; given a query, retrieve data points that collide w.r.t.
all
partitions
Repeat
times to boost success probability
Slide6
Bounds on LSH
Distance metric
Reference
Euclidean
(
)
1/4
(Andoni-Indyk 2006)
(O’Donnell-Wu-Zhou 2011)
Manhattan, Hamming
(
)
1/2
(Indyk-
Motwani
1998)
(O’Donnell-Wu-Zhou 2011)
Distance metric
Reference
1/4
(Andoni-Indyk 2006)
(O’Donnell-Wu-Zhou 2011)
1/2
(Indyk-
Motwani
1998)
(O’Donnell-Wu-Zhou 2011)
Can one improve upon LSH?
Space
,
query time
Yes!
(
Andoni
-
Indyk
-Nguyen-
R
2014
) (
Andoni
-
R
2015)Slide7
How to do better than LSH?Main idea: data-dependent space partitionsA distribution over partitions is
-sensitive
if
for every :If
, then
If
, then
Too strong!
Enough to satisfy for
and
E
xploit the geometry of
to design better partitions
Able to obtain
improvement
for every
(Andoni-Indyk-Nguyen-
R
2014
): slightly bypass the LSH lower bound for large (
Andoni-R 2015)
: significant improvement upon LSH for
all
Slide8
Bounds on data-dependent LSH
Distance metric
Reference
Euclidean
(
)
1/4
(Andoni-Indyk 2006)
(O’Donnell-Wu-Zhou 2011)
1/7
(
Andoni
-
R
2015)
Manhattan, Hamming
(
)
1/2
(Indyk-
Motwani
1998)
(O’Donnell-Wu-Zhou 2011)
1/3
(
Andoni
-
R
2015)
Distance metric
Reference
1/4
(Andoni-Indyk 2006)
(O’Donnell-Wu-Zhou 2011)
1/7
(
Andoni
-
R
2015)
1/2
(Indyk-
Motwani
1998)
(O’Donnell-Wu-Zhou 2011)
1/3
(
Andoni
-
R
2015)
Space
,
query time
Optimal!
Optimal!Slide9
The main resultThe data-dependent space partitions for the Euclidean and Manhattan/Hamming distances from (
Andoni
-
R 2015) are optimal* * After proper formalizationSlide10
Hard instanceHamming distanceDataset: random vectors from
, think
Queries:
flip each bit of some data point with probability Distances:
to the original data point
to other data points
Any
-approximate data structure must recover the original data point
The goal: even if see the dataset in advance,
(implies the lower bound for the Euclidean distance as well)
Slide11
Fine printVoronoi diagram: a perfect partitionUseless: hard to locate pointsNeed to define what is allowed properly to rule it outSlide12
Formalizing the modelRestricted computational complexity: data structure lower boundsBounded number of parts: can tweak the Voronoi diagram exampleSlide13
The main resultAny data structure for the hard instance with
based on data-dependent LSH:
Either has the exponent
Or the space partitions occupy space
(captures
Voronoi
diagrams)
T
he point location time is proportional to the space a partition occupies (for
all known
constructions)
For
better exponent is possible
(Becker-
Ducas
-Gama-
Laarhoven
2016)
Slide14
Proof outlineImprove (Motwani-Naor-Panigrahy 2007) to get a tight exponent
for
data-independent
partitions of the
hard instanceHypercontractivity estimatesData-dependent
lower boundConcentration and union bound (not too many distinct partitions)
Slide15
ConclusionsTight data-dependent LSH lower bound(Andoni-Laarhoven-
R
-
Waingarten 2016): for Hamming SpaceQuery time
Space
Query time
Can prove matching
data-independent
lower bounds for the hard instance (in an appropriate model)
What about
data-dependent?
Questions?