/
Random Sampling on Big Data: Techniques and Applications Random Sampling on Big Data: Techniques and Applications

Random Sampling on Big Data: Techniques and Applications - PowerPoint Presentation

easyho
easyho . @easyho
Follow
342 views
Uploaded On 2020-08-28

Random Sampling on Big Data: Techniques and Applications - PPT Presentation

Ke Yi Hong Kong University of Science and Technology yikeusthk Random Sampling on Big Data 2 Big Data in one slide The 3 Vs Volume External memory algorithms Distributed data ID: 808476

random sampling big data sampling random data big sample elements china 201 size probability 100 error algorithm memory samples

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Random Sampling on Big Data: Techniques ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Random Sampling on Big Data: Techniques and Applications

Ke

Yi

Hong Kong University of Science and Technology

yike@ust.hk

Slide2

Random Sampling on Big Data

2

Slide3

“Big Data” in one slide

The 3 V’s

:

Volume

External memory algorithms

Distributed dataVelocity

Streaming data

Variety

Integers, real numbersPoints in a multi-dimensional spaceRecords in relational databaseGraph-structured data

Random Sampling on Big Data

3

Slide4

Dealing with Big DataThe first approach: scale up / out the computation

Many great technical innovations:

Distributed/parallel systems

MapReduce

,

Pregel, Dremel, Spark

New computational models

BSP, MPC, … Dan Suciu’s tutorial tomorrow!My BeyondMR

talk of FridayThis talk is not about this approach!

Random Sampling on Big Data

4

Slide5

Downsizing data

A second approach to computational scalability:

scale down the data!

Too much redundancy in big data anyway

100% accuracy is often not needed

What we finally want is small: human readable analysis / decisions

Examples:

samples

, sketches, histograms, various transforms

See tutorial by Graham Cormode for other data summaries

Complementary to the first approachCan scale out computation and scale down data at the same timeAlgorithms need to work under new system architectures

Good old RAM model no longer applies

Random Sampling on Big Data

5

Slide6

Outline of the talkStream sampling

Importance sampling

Merge-reduce sampling

Sampling for Approximate Query Processing

Sampling from one table

Sampling from multiple tables (joins)

Random Sampling on Big Data

6

Slide7

Simple Random SamplingSampling without replacement

Randomly draw an element

Don’t put it back

Repeat s times

Sampling with replacement

Randomly draw an elementPut it back

Repeat s times

Trivial in the RAM model

Random Sampling on Big Data

7

The statistical difference is very small, for

 

Slide8

Stream Sampling

P

Memory

Slide9

Random Sampling on Big Data

9

Slide10

Reservoir Sampling

Maintain a sample of size

drawn

(without replacement) from

all elements in the stream so far

Keep the first

elements in the stream, set

Algorithm for a new element

With probability

, use it to replace an item in the current sample chosen uniformly at random

With probability

, throw it away

Space:

, time:

Perhaps the first “streaming algorithm”

 

Random Sampling on Big Data

10

[Waterman ??; Knuth’s book]

Slide11

Correctness Proof

By

induction on

: trivially correct

Assume

each element

so far is

sampled with probability

Consider

:

The new element is sampled with probability

Any of the first

element is sampled with probability

.

This is a wrong (incomplete) proof

Each element being sampled with probability

is

not

a sufficient condition of random sampling

Counter

e

xample: Divide elements into groups of

and pick one group randomly

 

Random Sampling on Big Data

11

Slide12

Random Sampling on Big Data

12

Slide13

Reservoir Sampling Correctness Proof

Correct proof relates with the

Fisher-Yates

shuffle

This algorithm returns a random permutation of

elements

The

r

eservoir sampling maintains the top

elements during the Fisher-Yates shuffle.

 

Random Sampling on Big Data

13

a

b

c

d

b

a

c

d

a

b

c

d

b

c

a

d

b

d

a

c

s = 2

Slide14

External Memory Stream Sampling

Sample size

(

: main memory size)

External memory size: unlimited

Stream goes to internal memory without cost

Reading/writing data on external memory costs I/O

Issue with the reservoir sampling algorithm: Deleting an element in the existing sample costs 1 I/O

 

Random Sampling on Big Data

14

P

Internal Memory

External memory

Block

size:

 

Slide15

External Memory Stream Sampling

Idea: Lazy deletion

Store the first

elements on disk

Algorithm

for a new element

With probability

:

Add new element to buffer

If buffer is full, write to disk

With probability

, throw it away

When # elements stored in external memory

, perform a clean-up step

 

Random Sampling on Big Data

15

Slide16

Clean-up Step

Idea:

Consider the elements in reserve order

The principle of deferred decisions

must stay

Which element it kicks out will be decided later

stays if it is not kicked out by

, which happens with prob.

At this point, just decide if

kicks out

or not.

 

Random Sampling on Big Data

16

 

 

Slide17

Clean-up Step

Consider

. Suppose

elements after

have stayed.

elements have been kicked out

arrows after

haven’t been decided, only knowing that they point to some elements before

(including)

These arrows cannot point to the same elements

These arrows can only point to alive elements

There are

alive elements

So

is pointed by one of them with probability

.

Just need to remember

 

Random Sampling on Big Data

17

 

 

Slide18

External Memory Stream Sampling

Each clean-up step can be done by one scan:

I/Os

Each clean-up step removes

elements

Number of elements ever kept (in expectation)

Total I/O

cost:

A matching lower bound

Sliding windows

 

Random Sampling on Big Data

18

[

Gemulla

and

Lehner

06] [Hu,

Qiao

, Tao 15]

Slide19

Sampling from Distributed Streams

One coordinator and

sites

Each site can communicate with the coordinator

Goal: Maintain a random sample of size

over the union of all streams with minimum

communication

Difficulty: Don’t know

, so can’t run reservoir sampling algorithm

Key observation: Don’t have to know

in order to sample!

Sampling is easier than counting

 

Random Sampling on Big Data

19

[

Cormode

,

Muthukrishnan

, Yi,

Zhang 09] [

Woodruff,

Tirthapura

11

]

Slide20

Reduction from Coin Flip Sampling

Flip a fair coin for each element until we get “1”

An element is active on a level if it is “0”

If a level has

active elements, we can draw a sample from those active elements

Key: The coordinator does not want all the active elements, which are too many!

Choose a level appropriately

 

Random Sampling on Big Data

20

Slide21

The Algorithm

Initialize

In round

:

Sites send in every item

w.p

.

(

This is a

coin-flip sample

with prob.

)

Coordinator maintains a lower sample

and a

higher sample: each received item

goes to

either with equal prob

.

(

The lower sample is a sample with prob.

)

When the lower sample reaches size

,

the coordinator

broadcasts to advance

to round

Discard the upper sample

Split the lower sample into a new

lower sample

and a higher sample

 

Random Sampling on Big Data

21

Slide22

Communication Cost of Algorithm

Communication cost of each round:

Expect to receive

sampled items before round

ends

Broadcast

to end round:

Number of rounds:

In

each round,

need

items being sampled to end round

Each

item has prob.

to contribute: need

items

Total communication

:

Can be improved to

A matching lower

bound

Sliding windows

 

Random Sampling on Big Data

22

Slide23

Importance Sampling

Sampling probability depends on how important data is

Slide24

Frequency Estimation on Distributed Data

Given: A multiset

of

items drawn from the universe

For example: IP addresses of network packets

is partitioned arbitrarily and stored

o

n

nodes

Local

count

: frequency of item

on node

Global

count

Goal:

Estimate

with

absolute error

for all

Can’t hope for small relative error for all

Heavy hitters are estimated well

 

Random Sampling on Big Data

24

[Zhao

,

Ogihara

,

Wang

,

Xu 06] [Huang

, Yi, Liu,

Chen 11

]

Slide25

Frequency Estimation: Standard Solutions

Local heavy hitters

Let

be the data size at node

Node

sends in all items with frequency

Total error is at most

Communication cost:

Simple random sampling

A simple random sample of size

can be used to estimate the frequency of any item with error

Algorithm

Coordinator first gets

for all

Decides how many samples to get from each

Get the samples from the nodes

Communication cost:

 

Random Sampling on Big Data

25

Slide26

Importance SamplingRandom Sampling on Big Data

26

Horvitz–Thompson estimator:

Estimator for global count

:

 

Slide27

Importance Sampling: What is a Good

?

 

Natural choice:

M

ore precisely:

Can show:

for any

Communication cost:

This is (worst-case) optimal

Interesting discovery:

Also has

for any

Also has communication cost

in the worst case

when

local counts are

, the rest are zero.

But can be much lower than

on some inputs

 

Random Sampling on Big Data

27

Slide28

is Instance-Optimal

 

Random Sampling on Big Data

28

All possible inputs

Communication cost

 

 

 

Slide29

What Happened?

 

Making

removes all effects of the input

 

Random Sampling on Big Data

29

Slide30

Variance-Communication DualityRandom Sampling on Big Data

30

All possible inputs

Variance

 

 

 

Slide31

Merge-Reduce Sampling

Better than simple random sampling

Slide32

-Approximation:

A

U

niform” Sample

 

A

uniform

sample needs

sample

points

A random sample needs

sample

points

(w/ constant prob.)

 

32

Random sample:

 

Random Sampling on Big Data

Slide33

Median and Quantiles (order statistics)

Exact quantiles:

for

,

CDF

Approximate version: tolerate answer between

An

-approximation produces

-approximate quantiles

 

Random Sampling on Big Data

33

Slide34

Merge-Reduce Sampling

Divide data into chunks of size

Sort each chunk

Do binary merges into one chunk

Each merge takes odd-positioned

or even-positioned elements with

equal probability

This needs

time, how is it useful?

 

34

1 5 6 7 8

2 3 4 9

10

1 3 5 7 9

+

Random Sampling on Big Data

Slide35

Application 1: Streaming Computation

Can merge chunks up as items arrive

in the stream

At any time, keep at most

chunks

Space:

Can be improved to

by combining with random sampling

[Agarwal,

Cormode

, Huang, Phillips, Wei, Yi 12]

Improved to

[

Felber

,

Ostrovsky

15]

Improved to

[

Karnin

, Lang, Liberty 16]

Reservoir sampling needs

space

Best deterministic algorithm needs

space

[Greenwald,

Khana

01]

 

35

Random Sampling on Big Data

Slide36

Error Analysis: Base case

36

Consider any rang

e

on sample

from data set

of size

Approximation guarantee

Estimator

is unbiased and has error at most 1

is even:

has no error

is odd:

has error

1

with equal prob.

 

Random Sampling on Big Data

Slide37

Error Analysis: General Case

Level 1

Level 2

Level 3

Level 4

Consider

’th

merge at level

of

,

to

Estimate is

Error introduced is

(new estimate) (old estimate)

Absolute error

by previous argument

Total error over all

levels:

 

37

Random Sampling on Big Data

Slide38

Error Analysis: Azuma-Hoeffding

The errors

are not independent

Let

be the prefix sums of

’s

The

’s form a martingale

Azuma-

Hoeffding

: If

, then

 failure probability

Set

to get failure probability

Enough to consider

ranges

Set

and apply union bound

 

38

Random Sampling on Big Data

Slide39

Application 2: Distributed Data

Data partitioned on

nodes

Each node reduces its data to

using paired sampling, and send to coordinator

Each node can have variance

Communication cost:

Best possible (even under the blackboard model)

Deterministic lower bound

 

39

[Huang, Yi 14]

Random Sampling on Big Data

Slide40

Generalization to Multi-dimensions

For any range

in a

range space

(e.g., circles or rectangles),

Applications in data mining, machine learning, numerical integration, Monte Carlo simulations, …

 

40

Random Sampling on Big Data

Slide41

How to Reduce: Low-Discrepancy Coloring

: a set of

in

C

oloring

, define

Find

such that

is minimized

Example: in 1D, just do odd-even

coloring

.

Reduce: Pick one

color

randomly

Discrepancy

Sample size and communication cost depend on discrepancy

 

41

Random Sampling on Big Data

Slide42

Known Discrepancy Results

1D:

For an arbitrary range space

by random coloring

For 2D axis-parallel rectangles

,

For

2D circles,

halfplanes

For range space with VC-dimension

 

42

Random Sampling on Big Data

Slide43

Sampling for Approximate Query Processing

Slide44

Complex Analytical Queries (TPC-H)

SELECT

SUM(

l_price

)

FROM

customer,

lineitem

, orders, nation,

region

WHERE

c_custkey

=

o_custkey

AND

l_orderkey

=

o_orderkey

AND

l_shipdate

>= 2017-03-01

AND

l_shipdate

<= 2017-03-22

AND

c_nationkey

=

n_nationkey

AND

n_regionkey

=

r_regionkey

AND

r_name

=

'ASIA‘

Things to consider:

What to return?

Simple aggregation (COUNT,

SUM

)

A sample (UDFs)

Pre-computation allowed?

Pre-computed samples, indexes

Random Sampling on Big Data

44

Slide45

Sampling from One Table

SELECT UDF(R.A)

FROM

R

WHERE

x < R.B < y

Have to allow pre-computation

Otherwise, can only scan whole table or sample & check

Simple

aggregation

is easily done in

time

Associate partial aggregates in a binary tree (B-tree)

Goal is to return a random sample of size

Sample size

maybe unknown

 

Random Sampling on Big Data

45

Slide46

Binary Tree with Pre-computed Samples

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

1

4

5

7

9

12

14

16

3

8

12

14

7

14

5

Report:

Active nodes

5

Random Sampling on Big Data

46

Slide47

Binary Tree with Pre-computed Samples

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

1

4

5

7

9

12

14

16

3

8

12

14

7

14

5

Report:

5

Active nodes

Random Sampling on Big Data

47

Slide48

Binary Tree with Pre-computed Samples

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

1

4

5

7

9

12

14

16

3

8

12

14

7

14

5

Report: 5

Active nodes

7

Pick 7 or 14 with equal prob.

Random Sampling on Big Data

48

Slide49

Binary Tree with Pre-computed Samples

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

1

4

5

7

9

12

14

16

3

8

12

14

7

14

5

Report: 5 7

Active nodes

Pick 3, 8, or 14

with prob. 1:1:2

Random Sampling on Big Data

49

Slide50

Binary Tree with Pre-computed Samples

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

1

4

5

7

9

12

14

16

3

8

12

14

7

14

5

Report: 5 7

Active nodes

Random Sampling on Big Data

50

Slide51

Binary Tree with Pre-computed Samples

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

1

4

5

7

9

12

14

16

3

8

12

14

7

14

5

Report: 5 7

Active nodes

12

Pick 3, 8, or 12

with equal

prob

Random Sampling on Big Data

51

[

Wang, Christensen, Li, Yi 16]

Slide52

Binary Tree with Pre-computed Samples

Query time:

,

: full query

size

Extends

to higher dimensions

Issue: The same query

always returns

the same sample

Independent range sampling [Hu,

Qiao

, Tao 14]

Idea: Replenish new samples after used

 

Random Sampling on Big Data

52

Slide53

Two Tables:

 

Return a sample, without

p

re-computation

Hopeless, since

Return a sample, with re-computation

Build an index on

Obtain value frequencies in

Sample a tuple in

with probability proportional to

’s frequency in

Sample a joining tuple in

Example:

join size = 8

 

Random Sampling on Big Data

53

[

Chaudhuri, Motwani, Narasayya 99

]

Slide54

Sampling Joins: Open Problems

How to deal with selection predicates?

Can’t afford to re-compute value frequencies at query time

How to handle multi-way joins?

Prob

[

is sampled]

: residual query when

must appear in the join result

For acyclic queries, all the residual query sizes can be computed in

time in preprocessing

For arbitraries queries, the problem is open

 

Random Sampling on Big Data

54

Slide55

Two Tables: COUNT, No Pre-computation

Ripple join

[

Haas,

Hellerstein

99]Sample a tuple from

each table

Join with

previously sampled tuples from other tablesThe joined sampled tuples are not independent, but unbiasedWorks well for full Cartesian productBut most joins are sparse …

Can be extended to multiple tables but efficiency is even lowerWhat can be done with pre-computation (indexes)?

Random Sampling on Big Data

55

Slide56

A Running Example

Nation

CID

US

1

US

2

China

3

UK

4

China

5

US

6

China

7

UK

8

Japan

9

UK

10

Random Sampling on Big Data

56

OrderID

ItemID

Price

4

301

$2100

2

304

$100

3

201

$300

4

306

$500

3

401

$230

1

101

$800

2

201

$300

5

101

$200

4

301

$100

2

201

$600

BuyerID

OrderID

4

1

3

2

1

3

5

4

5

5

5

6

3

7

5

8

3

9

7

10

What’s the total revenue of all orders from customers in China?

Slide57

Join as a Graph

Nation

CID

US

1

US

2

China

3

UK

4

China

5

US

6

China

7

UK

8

Japan

9

UK

10

Random Sampling on Big Data

57

OrderID

ItemID

Price

4

301

$2100

2

304

$100

3

201

$300

4

306

$500

3

401

$230

1

101

$800

2

201

$300

5

101

$200

4

301

$100

2

201

$600

BuyerID

OrderID

4

1

3

2

1

3

5

4

5

5

5

6

3

7

5

8

3

9

7

10

Slide58

Sampling by Random Walks

Nation

CID

US

1

US

2

China

3

UK

4

China

5

US

6

China

7

UK

8

Japan

9

UK

10

Random Sampling on Big Data

58

OrderID

ItemID

Price

4

301

$2100

2

304

$100

3

201

$300

4

306

$500

3

401

$230

1

101

$800

2

201

$300

5

101

$200

4

301

$100

2

201

$600

BuyerID

OrderID

4

1

3

2

1

3

5

4

5

5

5

6

3

7

5

8

3

9

7

10

Slide59

Sampling by Random Walks

Nation

CID

US

1

US

2

China

3

UK

4

China

5

US

6

China

7

UK

8

Japan

9

UK

10

Random Sampling on Big Data

59

OrderID

ItemID

Price

4

301

$2100

2

304

$100

3

201

$300

4

306

$500

3

401

$230

1

101

$800

2

201

$300

5

101

$200

4

301

$100

2

201

$600

BuyerID

OrderID

4

1

3

2

1

3

5

4

5

5

5

6

3

7

5

8

3

9

7

10

Slide60

Sampling by Random Walks

Nation

CID

US

1

US

2

China

3

UK

4

China

5

US

6

China

7

UK

8

Japan

9

UK

10

60

OrderID

ItemID

Price

4

301

$2100

2

304

$100

3

201

$300

4

306

$500

3

401

$230

1

101

$800

2

201

$300

5

101

$200

4

301

$100

2

201

$600

BuyerID

OrderID

4

1

3

2

1

3

5

4

5

5

5

6

3

7

5

8

3

9

7

10

Random Sampling on Big Data

Slide61

Sampling by Random Walks

Nation

CID

US

1

US

2

China

3

UK

4

China

5

US

6

China

7

UK

8

Japan

9

UK

10

61

OrderID

ItemID

Price

4

301

$2100

2

304

$100

3

201

$300

4

306

$500

3

401

$230

1

101

$800

2

201

$300

5

101

$200

4

301

$100

2

201

$600

BuyerID

OrderID

4

1

3

2

1

3

5

4

5

5

5

6

3

7

5

8

3

9

7

10

Unbiased estimator:

 

[

Li, Wu, Yi, Zhao 16

]

Random Sampling on Big Data

Can also deal with selection predicates

Slide62

Open Problem

Theoretical analysis of this random walk algorithm?

Focus on COUNT

Connection to approximate triangle counting

An

-time algorithm for obtaining an constant-factor approximation of the number of triangles (

)

[

Eden,

Levi

,

Ron, Seshadhri 15

]

The algorithm is essentially the same as the random walk algorithm

(with the right parameters)

applied to the triangle join

Conjecture:

time to estimate join size (

)

Currently, computing COUNT is no easier than full join

 

Random Sampling on Big Data

62

Slide63

Thank you!