Similarity Search Alex Andoni Columbia University Nearest Neighbor Search NNS Preprocess a set of points Query given a query point report a point with the smallest distance to ID: 577228
Download Presentation The PPT/PDF document "Data-dependent Hashing for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Data-dependent Hashing forSimilarity Search
Alex
Andoni
(Columbia University)Slide2
Nearest Neighbor Search (NNS)
Preprocess:
a set
of pointsQuery: given a query point , report a point with the smallest distance to
2Slide3
Motivation
Generic setup:
Points model
objects (e.g. images)Distance models (dis)similarity measureApplication areas: machine learning: k-NN rulespeech/image/video/music recognition, vector quantization, bioinformatics, etc…Distances: Hamming, Euclidean, edit distance, earthmover distance, etc…Core primitive: closest pair, clustering, etc…
000000
011100
010100
000100
010100
011111
000000
001100
000100
000100
110100
111111
3Slide4
Curse of Dimensionality
Hard when
4
Algorithm
Query time
Space
Full indexing
(
Voronoi
diagram size)
No indexing
Algorithm
Query time
Space
Full indexing
No indexing
Refutes Strong ETH
[Wil’04]
Slide5
Approximate NNS
-near neighbor
: given a query
point , report a point
s.t.
as
long
as there is some point within distance
Practice: use for exact NNSFiltering: gives a set of candidates (hopefully small)
-approximate
5Slide6
Plan6
Space decompositions
Structure in data
PCA-treeNo structure in dataData-dependent hashingTowards practiceSlide7
Plan7
Space decompositions
Structure in data
PCA-treeNo structure in dataData-dependent hashingTowards practiceSlide8
It’s all about space decompositions
8
Approximate
space decompositions[Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-Silverman-We’98], [Kleinberg’97], [Har-Peled’02],[Arya-Fonseca-Mount’11],…Random decompositions == hashing[Kushilevitz-Ostrovsky-Rabani’98], [Indyk-Motwani’98], [Indyk’01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A.-Indyk’06], [Terasawa-Tanaka’07], [Kapralov’15], [Pagh’16], [Becker-Ducas-Gama-Laarhoven’16], [Christiani’17], …
f
or low-dimensional
for high-dimensionalSlide9
Low-dimensional
9
k
d-trees, approximate Voronoi,…Usual factor:
Slide10
High-dimensional
Locality-Sensitive Hashing (LSH)
Space partitions:
fully randomJohnson-Lindenstrauss Lemma: project to random subspace LSH runtime: for approximation 10Slide11
Locality-Sensitive Hashing
Random hash
function
on
satisfying:
f
or
close
pair: when
is “high” for far pair: when
is “small”
Use several hash tables:
1
,
where
[Indyk-Motwani
’
98
]
“
not-so-small
”
11
Slide12
LSH Algorithms
12
Space
Time
Exponent
Reference
Space
Time
Exponent
Reference
[IM’98]
[IM’98]
[IM’98, DIIM’04]
[
IM’98, DIIM’04]
Hamming
space
Euclidean
space
[MNP’06, OWZ’11]
[MNP’06, OWZ’11]
[MNP’06, OWZ’11]
[MNP’06, OWZ’11]
[AI’06]
[AI’06
]Slide13
vs PracticeData-dependent partitions
o
ptimize the partition to
your datasetPCA-tree [S’91, M’01, VKD’09]randomized kd-trees [SAH08, ML09]spectral/PCA/semantic/WTA hashing [WTF08, CKW09, SH09, YSRL11]…13
My
dataset is not worse-caseSlide14
Practice vs Theory
14
Data-dependent partitions often outperform (vanilla) random partitions
But no guarantees (correctness or performance)Hard to measure progress, understand trade-offsJL generally optimal [Alo’03, JW’13]Even for some NNS regimes! [AIP’06]
Why do
data-dependent partitions outperform random ones ?Slide15
Plan15
Space decompositions
Structure in data
PCA-treeNo structure in dataData-dependent hashingTowards practiceSlide16
How to study phenomenon:
16
Because of
further special structure in the dataset“Intrinsic dimension is small”[Clarkson’99,Karger-Ruhl’02, Krauthgamer-Lee’04, Beygelzimer-Kakade-Langford’06, Indyk-Naor’07, Dasgupta-Sinha’13,…]But, dataset is really high-dimensional…
Why do data-dependent partitions outperform random ones ?Slide17
Towards higher-dimensional datasets
17
“
low-dimensional signal + high-dimensional noise”Signal: where of dimension
Data: each point is
perturbed by a full-dimensional Gaussian noise
[Abdullah-A.-Krauthgamer-Kannan’14]Slide18
Model properties
18
Data
Query
s.t.:
for “nearest neighbor”
for everybody else
Noise
up to factors poly in
roughly
when
nearest neighbor is still the sameNoise is large:Displacement: ~ top dimensions of capture sub-constant massJL would not work: after noise, gap very close to 1
Slide19
NNS in this model
19
Best we can hope:
dataset contains a “worst-case” -dimensional instanceReduction from dimension to Spoiler: Yes, we can do it!
Goal:
NNS performance as if we are in dimensions ?
Slide20
Tool: PCA20
Principal Component Analysis
l
ike Singular Value DecompositionTop PCA direction: direction maximizing variance of the dataSlide21
Naïve: just use PCA?
21
Use PCA to find the “
signal subspace” ?find top-k PCA spaceproject everything and solve NNS thereDoes NOT workSparse signal directions overpowered by noise directionsPCA is good only on “average”, not “worst-case”
Slide22
Our algorithms: intuition
22
Extract “well-captured points”
points with signal mostly inside top PCA spaceshould work for large fraction of pointsIterate on the restAlgorithm1: iterative PCAAlgorithm 2: (modified) PCA-treeAnalysis: via [Davis-Kahan’70, Wedin’72]
Slide23
Plan23
Space decompositions
Structure in data
PCA-treeNo structure in dataData-dependent hashingTowards practiceWhat if data has no structure?(worst-case)Slide24
Back to worst-case NNS
Space
Time
Exponent
Reference
Space
Time
Exponent
Reference
[IM’98]
[IM’98]
[AI’06]
[AI’06
]
Hamming
space
Euclidean
space
[MNP’06, OWZ’11]
[MNP’06, OWZ’11]
[MNP’06, OWZ’11]
[MNP’06, OWZ’11]
[AR’15]
[AR’15]
[AR’15]
[AR’15]
LSH
LSH
24
complicated
[AINR’14]
complicated
[AINR’14]
complicated
[AINR’14]
complicated
[AINR’14]
Random hashing is not optimal!Slide25
25
A
random hash function, chosen after seeing the given dataset
Efficiently computableData-dependent hashing(even when no structure)
Beyond Locality Sensitive HashingSlide26
Construction of hash function
26
Two components:
Nice geometric structureReduction to such structureLike a (weak) “regularity lemma” for a set of pointshas better LSH
data-dependent
[A.-Indyk-Nguyen-Razenshteyn’14]Slide27
Nice geometric structure
27
Pseudo-random
configuration:Far pair is (near) orthogonal (@distance ) as when the dataset is random on the sphereclose pair (distance )Lemma:
v
ia Cap Carving
Slide28
Reduction to nice configuration
28
Idea:
iteratively make the pointset more isotopic by narrowing a ball around itAlgorithm:find dense clusterswith smaller radiuslarge fraction of pointsrecurse on dense clustersapply Cap Hashing on the restrecurse on each “cap”eg, dense clusters might reappear
radius =
*picture not to scale & dimension
radius =
Why
ok?
no
dense clusters
r
emainder near- orthogonal (
radius=
)
even better!
[A.-Razenshteyn’15]Slide29
Hash function
29
Described by a tree (like a hash table)
radius =
*picture not to
scale&dimensionSlide30
Dense clusters
Current dataset: radius
A dense cluster:
Contains pointsSmaller radius:
After we remove all clusters:
For any point on the surface, there are at most
points within distance
The other points are essentially orthogonal !When applying Cap Carving with parameters
:
Empirical number of far pts colliding with query:
As long as
, the “impurity” doesn’t matter!
trade-off
trade-off
?Slide31
Tree recap31
During query:
Recurse
in all clustersJust in one bucket in CapCarvingWill look in >1 leaf!How much branching?Claim: at most
Each time we branchat most
clusters ()a cluster reduces radius by
cluster-depth at most
Progress in 2 ways:
Clusters reduce radiusCapCarving nodes reduce the # of far points (empirical
)A tree succeeds with probability
trade-off
Slide32
Plan32
Space decompositions
Structure in data
PCA-treeNo structure in dataData-dependent hashingTowards practiceSlide33
Towards (provable) practicality 1
33
Two components:
Nice geometric structure: spherical caseReduction to such structurepractical variant
[
A-Indyk-Laarhoven-Razenshteyn-Schmidt’15
]Slide34
Classic LSH: Hyperplane [Charikar’02]
34
Charikar’s construction:
Sample uniformly
Performance:
Query time:
Slide35
Our algorithm
35
Based on cross-polytope
introduced in [Terasawa-Tanaka’07]Construction of :Pick a random rotation : index of maximal coordinate of
and its signTheorem: nearly the same exponent as (optimal) CapCarving LSH for
caps!
[
A-Indyk-Laarhoven-Razenshteyn-Schmidt’15
]Slide36
Comparison: Hyperplane vs CrossPoly
Fix probability of collision for far points
Equivalently: partition
space into piecesHyperplane LSH: with hyperplanesimplies
partsExponent
CrossPolytope LSH:with parts
Exponent
Slide37
Complexity of our scheme
37
Time:
Rotations are expensive! timeIdea: pseudo-random rotationsCompute where is “fast rotation”based on Hadamard
transform
time!
Used in [Ailon-Chazelle’06, Dasgupta
, Kumar, Sarlos’11, Ve-Sarlos-Smola’13,…]Less space:Can adapt Multi-probe LSH [Lv-Josephson-Wang-Charikar-Li’07]
Slide38
Code & Experiments
38
FALCONN package:
http://falcon-lib.orgANN_SIFT1M data (, )Linear scan: 38msVanilla:Hyperplane: 3.7msCross-polytope: 3.1msAfter some clustering and recentering (data-dependent): Hyperplane: 2.75msCross-polytope:
1.7ms (1.6x speedup)Pubmed data (
, )Use feature hashingLinear scan:
3.6sHyperplane: 857msCross-polytope: 213ms (4x speedup)
Slide39
Towards (provable) practicality 2
39
Two components:
Nice geometric structure: spherical caseReduction to such structurea partial step
[
A –
Razenshteyn – Shekel-Nosatzki’17]Slide40
Hamming space
random trees, of height
i
n each node sample a random coordinateLSH Forest++:LSH Forest [Bawa-Condie-Ganesan’05]In a tree, split until bucket size++: store points closest to the mean of Pquery compared against the pointsTheorem:
obtains
for
large enough=performance of IM98 on random data
LSH Forest++40
[A.-Razenshteyn-Shekel-Nosatzki’17]
[Indyk-Motwani’98]
[
Indyk-Motwani’98
]
c
oor
3
c
oor
2
c
oor
5
c
oor
7Slide41
Follow-ups41
Dynamic data structure?
Dynamization
techniques [Overmars-van Leeuwen’81]Better bounds?For dimension , can get better
! [Becker-Ducas-Gama-Laarhoven’16]For dimension
:
our is optimal even for data-dependent hashing!
[A-Razenshteyn’16]: in the right formalization (to rule out Voronoi diagram)Time-space trade-offs:
time and
space
[A-Laarhoven-Razenshteyn-Waingarten’17]Simpler algorithm/analysisOptimal* bound! [ALRW’17, Christiani’17]
Slide42
Finale: Data-dependent hashing
42
A reduction
from “worst-case” to “random case”Instance-optimality (beyond-worst case) ?Other applications? Eg, closest pair can be solved much faster for random-case [Valiant’12, Karppa-Kaski-Kohonen’16]Tool to understand NNS for harder metrics?Classic hard distance:
No non-trivial sketch/random hash [BJKS04, AN’12]But:
[Indyk’98] gets efficient NNS with approximation
Tight
in the relevant models [ACP’08, PTW’10, KP’12] Can be seen as data-dependent hashing!
Algorithms with worst-case guarantees, but great performance if dataset is “nice” ?
NNS for
any norm (eg, matrix norms, EMD) or metric (edit d)?