/
Data-dependent Hashing for Data-dependent Hashing for

Data-dependent Hashing for - PowerPoint Presentation

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
384 views
Uploaded On 2017-08-09

Data-dependent Hashing for - PPT Presentation

Similarity Search Alex Andoni Columbia University Nearest Neighbor Search NNS Preprocess a set of points Query given a query point report a point with the smallest distance to ID: 577228

structure data dependent random data structure random dependent hashing lsh space pca nns indyk dimensional distance tree mnp

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Data-dependent Hashing for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Data-dependent Hashing forSimilarity Search

Alex

Andoni

(Columbia University)Slide2

Nearest Neighbor Search (NNS)

Preprocess:

a set

of pointsQuery: given a query point , report a point with the smallest distance to

 

 

 

2Slide3

Motivation

Generic setup:

Points model

objects (e.g. images)Distance models (dis)similarity measureApplication areas: machine learning: k-NN rulespeech/image/video/music recognition, vector quantization, bioinformatics, etc…Distances: Hamming, Euclidean, edit distance, earthmover distance, etc…Core primitive: closest pair, clustering, etc…

000000

011100

010100

000100

010100

011111

000000

001100

000100

000100

110100

111111

 

 

3Slide4

Curse of Dimensionality

Hard when

 

4

Algorithm

Query time

Space

Full indexing

(

Voronoi

diagram size)

No indexing

Algorithm

Query time

Space

Full indexing

No indexing

Refutes Strong ETH

[Wil’04]

 Slide5

Approximate NNS

-near neighbor

: given a query

point , report a point

s.t.

as

long

as there is some point within distance

Practice: use for exact NNSFiltering: gives a set of candidates (hopefully small)

 

 

 

 

 

 

-approximate

 

 

5Slide6

Plan6

Space decompositions

Structure in data

PCA-treeNo structure in dataData-dependent hashingTowards practiceSlide7

Plan7

Space decompositions

Structure in data

PCA-treeNo structure in dataData-dependent hashingTowards practiceSlide8

It’s all about space decompositions

8

Approximate

space decompositions[Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-Silverman-We’98], [Kleinberg’97], [Har-Peled’02],[Arya-Fonseca-Mount’11],…Random decompositions == hashing[Kushilevitz-Ostrovsky-Rabani’98], [Indyk-Motwani’98], [Indyk’01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A.-Indyk’06], [Terasawa-Tanaka’07], [Kapralov’15], [Pagh’16], [Becker-Ducas-Gama-Laarhoven’16], [Christiani’17], …

f

or low-dimensional

for high-dimensionalSlide9

Low-dimensional

9

k

d-trees, approximate Voronoi,…Usual factor:

 Slide10

High-dimensional

Locality-Sensitive Hashing (LSH)

Space partitions:

fully randomJohnson-Lindenstrauss Lemma: project to random subspace LSH runtime: for approximation 10Slide11

Locality-Sensitive Hashing

Random hash

function

on

satisfying:

f

or

close

pair: when

is “high” for far pair: when

is “small”

Use several hash tables:

 

 

 

 

 

 

 

1

 

 

,

where

 

[Indyk-Motwani

98

]

 

not-so-small

 

 

 

11

 Slide12

LSH Algorithms

12

Space

Time

Exponent

Reference

Space

Time

Exponent

Reference

[IM’98]

[IM’98]

[IM’98, DIIM’04]

[

IM’98, DIIM’04]

Hamming

space

Euclidean

space

[MNP’06, OWZ’11]

[MNP’06, OWZ’11]

[MNP’06, OWZ’11]

[MNP’06, OWZ’11]

[AI’06]

[AI’06

]Slide13

vs PracticeData-dependent partitions

o

ptimize the partition to

your datasetPCA-tree [S’91, M’01, VKD’09]randomized kd-trees [SAH08, ML09]spectral/PCA/semantic/WTA hashing [WTF08, CKW09, SH09, YSRL11]…13

My

dataset is not worse-caseSlide14

Practice vs Theory

14

Data-dependent partitions often outperform (vanilla) random partitions

But no guarantees (correctness or performance)Hard to measure progress, understand trade-offsJL generally optimal [Alo’03, JW’13]Even for some NNS regimes! [AIP’06]

Why do

data-dependent partitions outperform random ones ?Slide15

Plan15

Space decompositions

Structure in data

PCA-treeNo structure in dataData-dependent hashingTowards practiceSlide16

How to study phenomenon:

16

Because of

further special structure in the dataset“Intrinsic dimension is small”[Clarkson’99,Karger-Ruhl’02, Krauthgamer-Lee’04, Beygelzimer-Kakade-Langford’06, Indyk-Naor’07, Dasgupta-Sinha’13,…]But, dataset is really high-dimensional…

Why do data-dependent partitions outperform random ones ?Slide17

Towards higher-dimensional datasets

17

low-dimensional signal + high-dimensional noise”Signal: where of dimension

Data: each point is

perturbed by a full-dimensional Gaussian noise

 

 

[Abdullah-A.-Krauthgamer-Kannan’14]Slide18

Model properties

18

Data

Query

s.t.:

for “nearest neighbor”

for everybody else

Noise

up to factors poly in

roughly

when

nearest neighbor is still the sameNoise is large:Displacement: ~ top dimensions of capture sub-constant massJL would not work: after noise, gap very close to 1

 Slide19

NNS in this model

19

Best we can hope:

dataset contains a “worst-case” -dimensional instanceReduction from dimension to Spoiler: Yes, we can do it! 

Goal:

NNS performance as if we are in dimensions ?

 Slide20

Tool: PCA20

Principal Component Analysis

l

ike Singular Value DecompositionTop PCA direction: direction maximizing variance of the dataSlide21

Naïve: just use PCA?

21

Use PCA to find the “

signal subspace” ?find top-k PCA spaceproject everything and solve NNS thereDoes NOT workSparse signal directions overpowered by noise directionsPCA is good only on “average”, not “worst-case” 

 Slide22

Our algorithms: intuition

22

Extract “well-captured points”

points with signal mostly inside top PCA spaceshould work for large fraction of pointsIterate on the restAlgorithm1: iterative PCAAlgorithm 2: (modified) PCA-treeAnalysis: via [Davis-Kahan’70, Wedin’72]

 Slide23

Plan23

Space decompositions

Structure in data

PCA-treeNo structure in dataData-dependent hashingTowards practiceWhat if data has no structure?(worst-case)Slide24

Back to worst-case NNS

Space

Time

Exponent

Reference

Space

Time

Exponent

Reference

[IM’98]

[IM’98]

[AI’06]

[AI’06

]

Hamming

space

Euclidean

space

[MNP’06, OWZ’11]

[MNP’06, OWZ’11]

[MNP’06, OWZ’11]

[MNP’06, OWZ’11]

[AR’15]

[AR’15]

[AR’15]

[AR’15]

LSH

LSH

24

complicated

[AINR’14]

complicated

[AINR’14]

complicated

[AINR’14]

complicated

[AINR’14]

Random hashing is not optimal!Slide25

25

A

random hash function, chosen after seeing the given dataset

Efficiently computableData-dependent hashing(even when no structure)

Beyond Locality Sensitive HashingSlide26

Construction of hash function

26

Two components:

Nice geometric structureReduction to such structureLike a (weak) “regularity lemma” for a set of pointshas better LSH

data-dependent

[A.-Indyk-Nguyen-Razenshteyn’14]Slide27

Nice geometric structure

27

Pseudo-random

configuration:Far pair is (near) orthogonal (@distance ) as when the dataset is random on the sphereclose pair (distance )Lemma:

v

ia Cap Carving

 

 

 Slide28

Reduction to nice configuration

28

Idea:

iteratively make the pointset more isotopic by narrowing a ball around itAlgorithm:find dense clusterswith smaller radiuslarge fraction of pointsrecurse on dense clustersapply Cap Hashing on the restrecurse on each “cap”eg, dense clusters might reappear

radius =

 

*picture not to scale & dimension

radius =

 

Why

ok?

no

dense clusters

r

emainder near- orthogonal (

radius=

)

even better!

 

[A.-Razenshteyn’15]Slide29

Hash function

29

Described by a tree (like a hash table)

radius =

 

*picture not to

scale&dimensionSlide30

Dense clusters

Current dataset: radius

A dense cluster:

Contains pointsSmaller radius:

After we remove all clusters:

For any point on the surface, there are at most

points within distance

The other points are essentially orthogonal !When applying Cap Carving with parameters

:

Empirical number of far pts colliding with query:

As long as

, the “impurity” doesn’t matter! 

 

trade-off

 

trade-off

 

?Slide31

Tree recap31

During query:

Recurse

in all clustersJust in one bucket in CapCarvingWill look in >1 leaf!How much branching?Claim: at most

Each time we branchat most

clusters ()a cluster reduces radius by

cluster-depth at most

Progress in 2 ways:

Clusters reduce radiusCapCarving nodes reduce the # of far points (empirical

)A tree succeeds with probability

 

trade-off

 Slide32

Plan32

Space decompositions

Structure in data

PCA-treeNo structure in dataData-dependent hashingTowards practiceSlide33

Towards (provable) practicality 1

33

Two components:

Nice geometric structure: spherical caseReduction to such structurepractical variant

[

A-Indyk-Laarhoven-Razenshteyn-Schmidt’15

]Slide34

Classic LSH: Hyperplane [Charikar’02]

34

Charikar’s construction:

Sample uniformly

Performance:

Query time:

 Slide35

Our algorithm

35

Based on cross-polytope

introduced in [Terasawa-Tanaka’07]Construction of :Pick a random rotation : index of maximal coordinate of

and its signTheorem: nearly the same exponent as (optimal) CapCarving LSH for

caps!

 

[

A-Indyk-Laarhoven-Razenshteyn-Schmidt’15

]Slide36

Comparison: Hyperplane vs CrossPoly

Fix probability of collision for far points

Equivalently: partition

space into piecesHyperplane LSH: with hyperplanesimplies

partsExponent

CrossPolytope LSH:with parts

Exponent

 Slide37

Complexity of our scheme

37

Time:

Rotations are expensive! timeIdea: pseudo-random rotationsCompute where is “fast rotation”based on Hadamard

transform

time!

Used in [Ailon-Chazelle’06, Dasgupta

, Kumar, Sarlos’11, Ve-Sarlos-Smola’13,…]Less space:Can adapt Multi-probe LSH [Lv-Josephson-Wang-Charikar-Li’07]

 Slide38

Code & Experiments

38

FALCONN package:

http://falcon-lib.orgANN_SIFT1M data (, )Linear scan: 38msVanilla:Hyperplane: 3.7msCross-polytope: 3.1msAfter some clustering and recentering (data-dependent): Hyperplane: 2.75msCross-polytope:

1.7ms (1.6x speedup)Pubmed data (

, )Use feature hashingLinear scan:

3.6sHyperplane: 857msCross-polytope: 213ms (4x speedup)

 Slide39

Towards (provable) practicality 2

39

Two components:

Nice geometric structure: spherical caseReduction to such structurea partial step

[

A –

Razenshteyn – Shekel-Nosatzki’17]Slide40

Hamming space

random trees, of height

i

n each node sample a random coordinateLSH Forest++:LSH Forest [Bawa-Condie-Ganesan’05]In a tree, split until bucket size++: store points closest to the mean of Pquery compared against the pointsTheorem:

obtains

for

large enough=performance of IM98 on random data

 LSH Forest++40

[A.-Razenshteyn-Shekel-Nosatzki’17]

[Indyk-Motwani’98]

[

Indyk-Motwani’98

]

 

c

oor

3

c

oor

2

c

oor

5

c

oor

7Slide41

Follow-ups41

Dynamic data structure?

Dynamization

techniques [Overmars-van Leeuwen’81]Better bounds?For dimension , can get better

! [Becker-Ducas-Gama-Laarhoven’16]For dimension

:

our is optimal even for data-dependent hashing!

[A-Razenshteyn’16]: in the right formalization (to rule out Voronoi diagram)Time-space trade-offs:

time and

space

[A-Laarhoven-Razenshteyn-Waingarten’17]Simpler algorithm/analysisOptimal* bound! [ALRW’17, Christiani’17]

 

 Slide42

Finale: Data-dependent hashing

42

A reduction

from “worst-case” to “random case”Instance-optimality (beyond-worst case) ?Other applications? Eg, closest pair can be solved much faster for random-case [Valiant’12, Karppa-Kaski-Kohonen’16]Tool to understand NNS for harder metrics?Classic hard distance:

No non-trivial sketch/random hash [BJKS04, AN’12]But:

[Indyk’98] gets efficient NNS with approximation

Tight

in the relevant models [ACP’08, PTW’10, KP’12] Can be seen as data-dependent hashing!

 

Algorithms with worst-case guarantees, but great performance if dataset is “nice” ?

NNS for

any norm (eg, matrix norms, EMD) or metric (edit d)?