/
Nearest Neighbor Search (3) Nearest Neighbor Search (3)

Nearest Neighbor Search (3) - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
388 views
Uploaded On 2017-08-27

Nearest Neighbor Search (3) - PPT Presentation

Alex Andoni Columbia University MADALGO Summer School on Streaming Algorithms 2015 TimeSpace Tradeoffs Euclidean AI06 AI06 KOR98 IM98 Pan06 KOR98 IM98 Pan06 ID: 582696

edit distance radius clusters distance edit clusters radius dense points owz

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Nearest Neighbor Search (3)" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Nearest Neighbor Search (3)

Alex Andoni(Columbia University)

MADALGO Summer School on Streaming Algorithms 2015Slide2

Time-Space Trade-offs (Euclidean)

[AI’06]

[AI’06]

[KOR’98, IM’98, Pan’06]

[KOR’98, IM’98, Pan’06]

[Ind’01, Pan’06]

[Ind’01, Pan’06]

Space

Time

Comment

Reference

[AI’06]

[AI’06]

[IM’98, DIIM’04]

[IM’98, DIIM’04]

query

time

space

medium

medium

low

high

high

low

ω(1) memory lookups[AIP’06]

ω(1) memory lookups[AIP’06]

ω(1) memory lookups[PTW’08, PTW’10]

ω(1) memory lookups[PTW’08, PTW’10]

[MNP’06, OWZ’11]

[MNP’06, OWZ’11]

1 mem lookupSlide3

Beyond Locality Sensitive Hashing

Space

Time

Exponent

Reference

SpaceTimeExponent

Reference

[IM’98]

[IM’98]

[AI’06]

[AI’06]

HammingspaceEuclideanspace

[MNP’06, OWZ’11]

[MNP’06, OWZ’11]

[MNP’06, OWZ’11]

[MNP’06, OWZ’11]

[AINR’14, AR’15]

[AINR’14, AR’15

]

[AINR’14, AR’15]

[

AINR’14, AR’15

]

LSH

LSH

3Slide4

New approach?4

A

random hash function, chosen after seeing the given dataset

Efficiently computable

Data-dependent hashingSlide5

A look at LSH lower bounds

LSH lower bounds in Hamming spaceFourier analytic[Motwani-Naor-Panigrahy’06]

distribution over hash functions

Far pair:

random, distance Close pair: random at distance = Get  5[O’Donnell-Wu-Zhou’11]

  

 Slide6

Why not NNS lower bound?

Suppose we try to apply [OWZ’11] to NNSPick random

All the “far point” are

concentrated in

a small ball of radius Easy to see at preprocessing: actual near neighbor close to the center of the minimum enclosing ball 6Slide7

Construction of hash function7

Two components:Nice geometric structure

Reduction

to such structure

Like a “Regularity Lemma” for a set of pointshas better LSHdata-dependent[A.-Indyk-Nguyen-Razenshteyn’14, A.-Razenshteyn’15]Slide8

Nice geometric structure8

Like a random dataset on a spheres

.t.

random points at distance

Lemma: via Cap Carvingcurvature  

 Slide9

Reduction to nice structure9

Idea: iteratively decrease the radius of minimum enclosing ball

Algorithm:

f

ind dense clusterswith smaller radiuslarge fraction of pointsrecurse on dense clustersapply cap carving on the restrecurse on each “cap”eg, dense clusters might reappear

radius =

 *picture not to scale & dimensionradius =  Why ok?Why ok?no dense clusterslike “random dataset” with radius=even better! Slide10

Hash function

10Described by a tree (like a hash table)

radius =

 

*picture not to

scale&dimensionSlide11

Dense clusters

Current dataset: radius A dense cluster:Contains

points

S

maller radius: After we remove all clusters:For any point on the surface, there are at most points within distance The other points are essentially orthogonal !When applying Cap Carving with parameters :Empirical number of far pts colliding with query:

As long as , the “impurity” doesn’t matter!  

trade-off  trade-off ?Slide12

Tree recap12

During query:Recurse

in all clusters

Just in one bucket in

CapCarvingWill look in >1 leaf!How much branching?Claim: at most Each time we branchat most clusters ()a cluster reduces radius by cluster-depth at most Progress in 2 ways:Clusters reduce radiusCapCarving nodes reduce the # of far points (empirical )A tree succeeds with probability

 

trade-off Slide13

Fast preprocessing13

How to find the dense clusters fast?Step 1: reduce to

time.

Enough to consider centers that

are data pointsStep 2: reduce to near-linear time.Down-sample!Ok because we want clusters of size After downsampling by a factor of , a cluster is still somewhat heavy.  Slide14

Other details14

In the analysis,Instead of working with “probability of collision with far point”

, work with “empirical estimate” (the actual number)

A little delicate: interplay with “probability of collision with close point”,

The empirical important only for the bucket where the query falls intoNeed to condition on collision with close point in the above empirical estimateIn dense clusters, points may appear inside the ballswhereas CapCarving works for points on the sphereneed to partition balls into thin shells (introduces more branching) Slide15

Data-dependent hashing wrap-up15

Dynamicity?Dynamization

techniques

[

Overmars-van Leeuwen’81]Better bounds?[AR]: optimal even for data-dependent hashing!In the right formalization (to rule out Voronoi diagram)Description complexity of the hash function is High dimensionNNS for [Indyk’98] gets approximation (poly space, sublinear qt)Cf., has no non-trivial sketch!Some matching lower bounds in the relevant model [ACP’08, KP’12] Can be thought of as data-dependent hashingNNS for any norm? Slide16

16

Beyond

’s ?

 Slide17

Sketching/NNS for other distances?Earth Mover Distance:

Given two sets A, B of points in a metric spaceEMD(

A,

B

) = min cost bipartite matching between A and BApplications in image vision

 

 010110010101 small or large? Slide18

Algorithms via embedding

Embedding:Map sets into vectors in

preserving the distance (approximately)

Then use algorithms for

!Say, EMD over Theorem [Cha02, IT03]: Exists a map mapping all into with distortion : i.e., for any we have:

Sketch: -approximation in

space Slide19

Metric

Upper

bound

Earth-mover distance

(-sized sets in 2D plane)[Cha02, IT03]Earth-mover distance(-sized sets in

)[AIK08]Edit distance over (#indels

to transform x->y)[OR05]Ulam (edit distance between permutations)[CK06]Block edit distance[MS00, CM07]MetricUpper boundUlam (edit distance between permutations)Block edit distance

edit(1234567, 7123456) = 2edit( banana ,

ananas ) = 2Embeddability into  Slide20

Metric

Upper

bound

Earth-mover distance

(-sized sets in 2D plane)[Cha02, IT03]Earth-mover distance(-sized sets in

)[AIK08]Edit distance over

(#indels to transform x->y)[OR05]Ulam (edit distance between permutations)[CK06]Block edit distance[MS00, CM07]MetricUpper boundUlam (edit distance between permutations)

Block edit distanceLower bounds[NS07][KN05]

[KN05,KR06][AK07]4/3[Cor03]Lower bounds4/3[Cor03]Non-embeddability into  Other hosts possible, with worse sketching complexityEMD: -approximation in

space [ADIW’09]embed into a more complex space, and use Precision Sampling Slide21

Theory of sketching

When is sketching possible?[BO’10]: characterize when can sketch for “generalized frequency moments”:

for increasing functions

[LNW’14]

: in general streams (insertions and deletions), for estimating any , might as well have which is linear[AKR’15]: in the case of a norm has very efficient sketch: size and approximation (like for for ) if and only if embeds into some for !Eg, using [NS07], EMD does not admit very efficient sketchesCharacterization in other cases?