/
Sketching, Sampling and other Sketching, Sampling and other

Sketching, Sampling and other - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
408 views
Uploaded On 2016-07-13

Sketching, Sampling and other - PPT Presentation

Sublinear Algorithms Euclidean space dimension reduction and NNS Alex Andoni MSR SVC A Sketching Problem 2 Sketching objects short bitstrings given and should be able to deduce if ID: 403315

distance dimension space random dimension distance random space lsh reduction metric map time trees euclidean embedding norm

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Sketching, Sampling and other" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Sketching, Sampling and other Sublinear Algorithms:Euclidean space: dimension reduction and NNS

Alex Andoni

(MSR SVC)Slide2

A Sketching Problem2

Sketching:

:objects

short bit-strings

given and should be able to deduce if and are “similar”Why?reduce space and time to compute similarity

 

 

 

010110

010101

similar?

To be or not to be

To sketch or not to sketch

 

 

be

to

similar?Slide3

Sketch from LSH3LSH often has property:

Sketching from LSH:

Estimate

by the fraction of collisions between

controls the variance of the estimate

 

 

 

1

[Broder’97]:

for

Jaccard

coefficientSlide4

General Theory: embeddingsThe above map

is

an

embedding

General motivation: given distance (metric) , solve a computational problem under  

Euclidean distance (ℓ2

) Hamming distance

Edit distance between two strings

Earth-Mover (transportation) DistanceCompute distance between two pointsDiameter/Close-pair of set S

Clustering, MST, etcNearest Neighbor Search

f

Reduce problem <P under hard metric> to <P under simpler metric>Slide5

Embeddings: landscapeDefinition: an embedding

is a map

of a metric

into a host metric

such that for any

:

w

here

is the distortion (approximation) of the embedding

.Embeddings come in all shapes and colors:Source/host spaces

Distortion Can be randomized:

with

probabilityTime to compute Types of

embeddings:From norm to the same norm but of lower dimension (dimension reduction)From non-norms (edit distance, Earth-Mover Distance) into a norm (ℓ1)From given finite metric (shortest path on a planar graph) into a norm (ℓ

1)

not a metric but a computational procedure sketches Slide6

Dimension Reduction

Johnson

Lindenstrauss

Lemma

: there is a linear map ,

, that preserves distance between two vectors

up to distortion

with

probability ( some constant)Preserves distances among points for

Motivation:E.g.: diameter of a

pointset in -dimensional Euclidean spaceTrivially:

timeUsing lemma:

time for

approximation

MANY applications: nearest neighbor search, streaming, pattern matching, approximation algorithms (clustering)…

 Slide7

Main intuitionThe map can be simply a projection onto a random subspace of dimension

 Slide8

1D embeddingHow about one dimension (

) ?

Map

,

where

are

iid

normal (Gaussian) random variable

Why Gaussian?Stability property:

is distributed as , where

is also GaussianEquivalently:

is centrally distributed, i.e., has random direction, and projection on random direction depends only on length of

 pdf =

E[g]=0E[g2]=1

 

 Slide9

1D embeddingMap

,

f

or any

,

Linear:

Want:

Claim

: for any

, we have

Expectation:

Standard deviation:

Proof

:

Prove for

since

linear

Expectation

 

p

df

=

E[g]=0

E[g

2

]=1

 2

2Slide10

Full Dimension Reduction Just repeat the 1D embedding for

times!

w

here

is

matrix of Gaussian random variables

Want

to prove

:

with probability

OK to prove

for fixed

 Slide11

Concentration

is distributed as

w

here each

is distributed as Gaussian

Norm

is called chi-squared distribution with

degrees

Fact:

chi-squared very well concentrated:

Equal to

with probability

Akin to central limit theorem

 Slide12

Dimension Reduction: wrap-up

with high probability

Extra:

Linear: can update

as

changes

Can use

instead of Gaussians

[AMS’96, Ach’01, TZ’04…]

Fast JL: can compute faster than time

[AC’06, AL’07’09, DKS’10, KN’10’12…] Slide13

NNS for Euclidean space13

Can use dimensionality reduction to get LSH for

LSH function

:

pick a random line , and quantizeproject point into

is a random Gaussian vector

random in

is a parameter (e.g., 4)

 

[Datar-Immorlica-Indyk-Mirrokni

04

]

 

 Slide14

Regular grid → grid of ballsp can hit empty space, so take more such grids until

p

is in a ball

Need (too) many grids of balls

Start by projecting in dimension tAnalysis givesChoice of reduced dimension t?Tradeoff between # hash tables, n, andTime to hash, tO(t)

Total query time: dn

1/c2+o(1)

Near-Optimal LSH

2D

p

p

R

t

[A-Indyk

06

]Slide15

Open question:More practical variant of above hashing?Design space partitioning of

that is

efficient: point location in

poly(t)

timequalitative: regions are “sphere-like” 

[Prob. needle of length

1

is

not cut

][

Prob needle of length c is not cut]

c

2

 Slide16

Time-Space Trade-offs

[AI’06]

[AI’06]

[KOR’98, IM’98, Pan’06]

[KOR’98, IM’98, Pan’06]

[Ind’01, Pan’06]

[Ind’01, Pan’06]

Space

Time

Comment

Reference

[DIIM’04, AI’06]

[DIIM’04, AI’06]

[IM’98]

[IM’98]

query

t

ime

space

medium

medium

low

high

high

low

one hash

table lookup!

n

o(1

/

ε

2

)

ω

(1) memory lookups

[AIP’06]

n

1+o(1

/

c

2

)

ω

(1) memory lookups

[PTW’08, PTW’10]Slide17

NNS beyond LSH17

Data-dependent

partitions…

Practice:

Trees: kd-trees, quad-trees, ball-trees, rp-trees, PCA-trees, sp-trees…often no guaranteesTheory: can improve standard LSH by random data-dependent space partitions [A-Indyk-Nguyen-Razenshteyn’??]tree-based approach to max-norm () 

 Slide18

FinaleDimension Reduction in Euclidean space

, random projection

preserves distances

o

nly

dimensions for distance among points!

NNS for Euclidean spaceRandom projections gives LSHEven better with ball partitioningOr better with cool lattices?