Alex Andoni Columbia University MADALGO Summer School on Streaming Algorithms 2015 TimeSpace Tradeoffs Euclidean AI06 AI06 KOR98 IM98 Pan06 KOR98 IM98 Pan06 ID: 582696
Download Presentation The PPT/PDF document "Nearest Neighbor Search (3)" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Nearest Neighbor Search (3)
Alex Andoni(Columbia University)
MADALGO Summer School on Streaming Algorithms 2015Slide2
Time-Space Trade-offs (Euclidean)
[AI’06]
[AI’06]
[KOR’98, IM’98, Pan’06]
[KOR’98, IM’98, Pan’06]
[Ind’01, Pan’06]
[Ind’01, Pan’06]
Space
Time
Comment
Reference
[AI’06]
[AI’06]
[IM’98, DIIM’04]
[IM’98, DIIM’04]
query
time
space
medium
medium
low
high
high
low
ω(1) memory lookups[AIP’06]
ω(1) memory lookups[AIP’06]
ω(1) memory lookups[PTW’08, PTW’10]
ω(1) memory lookups[PTW’08, PTW’10]
[MNP’06, OWZ’11]
[MNP’06, OWZ’11]
1 mem lookupSlide3
Beyond Locality Sensitive Hashing
Space
Time
Exponent
Reference
SpaceTimeExponent
Reference
[IM’98]
[IM’98]
[AI’06]
[AI’06]
HammingspaceEuclideanspace
[MNP’06, OWZ’11]
[MNP’06, OWZ’11]
[MNP’06, OWZ’11]
[MNP’06, OWZ’11]
[AINR’14, AR’15]
[AINR’14, AR’15
]
[AINR’14, AR’15]
[
AINR’14, AR’15
]
LSH
LSH
3Slide4
New approach?4
A
random hash function, chosen after seeing the given dataset
Efficiently computable
Data-dependent hashingSlide5
A look at LSH lower bounds
LSH lower bounds in Hamming spaceFourier analytic[Motwani-Naor-Panigrahy’06]
distribution over hash functions
Far pair:
random, distance Close pair: random at distance = Get 5[O’Donnell-Wu-Zhou’11]
Slide6
Why not NNS lower bound?
Suppose we try to apply [OWZ’11] to NNSPick random
All the “far point” are
concentrated in
a small ball of radius Easy to see at preprocessing: actual near neighbor close to the center of the minimum enclosing ball 6Slide7
Construction of hash function7
Two components:Nice geometric structure
Reduction
to such structure
Like a “Regularity Lemma” for a set of pointshas better LSHdata-dependent[A.-Indyk-Nguyen-Razenshteyn’14, A.-Razenshteyn’15]Slide8
Nice geometric structure8
Like a random dataset on a spheres
.t.
random points at distance
Lemma: via Cap Carvingcurvature
Slide9
Reduction to nice structure9
Idea: iteratively decrease the radius of minimum enclosing ball
Algorithm:
f
ind dense clusterswith smaller radiuslarge fraction of pointsrecurse on dense clustersapply cap carving on the restrecurse on each “cap”eg, dense clusters might reappear
radius =
*picture not to scale & dimensionradius = Why ok?Why ok?no dense clusterslike “random dataset” with radius=even better! Slide10
Hash function
10Described by a tree (like a hash table)
radius =
*picture not to
scale&dimensionSlide11
Dense clusters
Current dataset: radius A dense cluster:Contains
points
S
maller radius: After we remove all clusters:For any point on the surface, there are at most points within distance The other points are essentially orthogonal !When applying Cap Carving with parameters :Empirical number of far pts colliding with query:
As long as , the “impurity” doesn’t matter!
trade-off trade-off ?Slide12
Tree recap12
During query:Recurse
in all clusters
Just in one bucket in
CapCarvingWill look in >1 leaf!How much branching?Claim: at most Each time we branchat most clusters ()a cluster reduces radius by cluster-depth at most Progress in 2 ways:Clusters reduce radiusCapCarving nodes reduce the # of far points (empirical )A tree succeeds with probability
trade-off Slide13
Fast preprocessing13
How to find the dense clusters fast?Step 1: reduce to
time.
Enough to consider centers that
are data pointsStep 2: reduce to near-linear time.Down-sample!Ok because we want clusters of size After downsampling by a factor of , a cluster is still somewhat heavy. Slide14
Other details14
In the analysis,Instead of working with “probability of collision with far point”
, work with “empirical estimate” (the actual number)
A little delicate: interplay with “probability of collision with close point”,
The empirical important only for the bucket where the query falls intoNeed to condition on collision with close point in the above empirical estimateIn dense clusters, points may appear inside the ballswhereas CapCarving works for points on the sphereneed to partition balls into thin shells (introduces more branching) Slide15
Data-dependent hashing wrap-up15
Dynamicity?Dynamization
techniques
[
Overmars-van Leeuwen’81]Better bounds?[AR]: optimal even for data-dependent hashing!In the right formalization (to rule out Voronoi diagram)Description complexity of the hash function is High dimensionNNS for [Indyk’98] gets approximation (poly space, sublinear qt)Cf., has no non-trivial sketch!Some matching lower bounds in the relevant model [ACP’08, KP’12] Can be thought of as data-dependent hashingNNS for any norm? Slide16
16
Beyond
’s ?
Slide17
Sketching/NNS for other distances?Earth Mover Distance:
Given two sets A, B of points in a metric spaceEMD(
A,
B
) = min cost bipartite matching between A and BApplications in image vision
010110010101 small or large? Slide18
Algorithms via embedding
Embedding:Map sets into vectors in
preserving the distance (approximately)
Then use algorithms for
!Say, EMD over Theorem [Cha02, IT03]: Exists a map mapping all into with distortion : i.e., for any we have:
Sketch: -approximation in
space Slide19
Metric
Upper
bound
Earth-mover distance
(-sized sets in 2D plane)[Cha02, IT03]Earth-mover distance(-sized sets in
)[AIK08]Edit distance over (#indels
to transform x->y)[OR05]Ulam (edit distance between permutations)[CK06]Block edit distance[MS00, CM07]MetricUpper boundUlam (edit distance between permutations)Block edit distance
edit(1234567, 7123456) = 2edit( banana ,
ananas ) = 2Embeddability into Slide20
Metric
Upper
bound
Earth-mover distance
(-sized sets in 2D plane)[Cha02, IT03]Earth-mover distance(-sized sets in
)[AIK08]Edit distance over
(#indels to transform x->y)[OR05]Ulam (edit distance between permutations)[CK06]Block edit distance[MS00, CM07]MetricUpper boundUlam (edit distance between permutations)
Block edit distanceLower bounds[NS07][KN05]
[KN05,KR06][AK07]4/3[Cor03]Lower bounds4/3[Cor03]Non-embeddability into Other hosts possible, with worse sketching complexityEMD: -approximation in
space [ADIW’09]embed into a more complex space, and use Precision Sampling Slide21
Theory of sketching
When is sketching possible?[BO’10]: characterize when can sketch for “generalized frequency moments”:
for increasing functions
[LNW’14]
: in general streams (insertions and deletions), for estimating any , might as well have which is linear[AKR’15]: in the case of a norm has very efficient sketch: size and approximation (like for for ) if and only if embeds into some for !Eg, using [NS07], EMD does not admit very efficient sketchesCharacterization in other cases?