Ke Yi Hong Kong University of Science and Technology yikeusthk Random Sampling on Big Data 2 Big Data in one slide The 3 Vs Volume External memory algorithms Distributed data ID: 808476
Download The PPT/PDF document "Random Sampling on Big Data: Techniques ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Random Sampling on Big Data: Techniques and Applications
Ke
Yi
Hong Kong University of Science and Technology
yike@ust.hk
Slide2Random Sampling on Big Data
2
Slide3“Big Data” in one slide
The 3 V’s
:
Volume
External memory algorithms
Distributed dataVelocity
Streaming data
Variety
Integers, real numbersPoints in a multi-dimensional spaceRecords in relational databaseGraph-structured data
Random Sampling on Big Data
3
Slide4Dealing with Big DataThe first approach: scale up / out the computation
Many great technical innovations:
Distributed/parallel systems
MapReduce
,
Pregel, Dremel, Spark
…
New computational models
BSP, MPC, … Dan Suciu’s tutorial tomorrow!My BeyondMR
talk of FridayThis talk is not about this approach!
Random Sampling on Big Data
4
Slide5Downsizing data
A second approach to computational scalability:
scale down the data!
Too much redundancy in big data anyway
100% accuracy is often not needed
What we finally want is small: human readable analysis / decisions
Examples:
samples
, sketches, histograms, various transforms
See tutorial by Graham Cormode for other data summaries
Complementary to the first approachCan scale out computation and scale down data at the same timeAlgorithms need to work under new system architectures
Good old RAM model no longer applies
Random Sampling on Big Data
5
Slide6Outline of the talkStream sampling
Importance sampling
Merge-reduce sampling
Sampling for Approximate Query Processing
Sampling from one table
Sampling from multiple tables (joins)
Random Sampling on Big Data
6
Slide7Simple Random SamplingSampling without replacement
Randomly draw an element
Don’t put it back
Repeat s times
Sampling with replacement
Randomly draw an elementPut it back
Repeat s times
Trivial in the RAM model
Random Sampling on Big Data
7
The statistical difference is very small, for
Stream Sampling
P
Memory
Slide9Random Sampling on Big Data
9
Slide10Reservoir Sampling
Maintain a sample of size
drawn
(without replacement) from
all elements in the stream so far
Keep the first
elements in the stream, set
Algorithm for a new element
With probability
, use it to replace an item in the current sample chosen uniformly at random
With probability
, throw it away
Space:
, time:
Perhaps the first “streaming algorithm”
Random Sampling on Big Data
10
[Waterman ??; Knuth’s book]
Slide11Correctness Proof
By
induction on
: trivially correct
Assume
each element
so far is
sampled with probability
Consider
:
The new element is sampled with probability
Any of the first
element is sampled with probability
.
□
This is a wrong (incomplete) proof
Each element being sampled with probability
is
not
a sufficient condition of random sampling
Counter
e
xample: Divide elements into groups of
and pick one group randomly
Random Sampling on Big Data
11
Slide12Random Sampling on Big Data
12
Slide13Reservoir Sampling Correctness Proof
Correct proof relates with the
Fisher-Yates
shuffle
This algorithm returns a random permutation of
elements
The
r
eservoir sampling maintains the top
elements during the Fisher-Yates shuffle.
Random Sampling on Big Data
13
a
b
c
d
b
a
c
d
a
b
c
d
b
c
a
d
b
d
a
c
s = 2
Slide14External Memory Stream Sampling
Sample size
(
: main memory size)
External memory size: unlimited
Stream goes to internal memory without cost
Reading/writing data on external memory costs I/O
Issue with the reservoir sampling algorithm: Deleting an element in the existing sample costs 1 I/O
Random Sampling on Big Data
14
P
Internal Memory
External memory
Block
size:
External Memory Stream Sampling
Idea: Lazy deletion
Store the first
elements on disk
Algorithm
for a new element
With probability
:
Add new element to buffer
If buffer is full, write to disk
With probability
, throw it away
When # elements stored in external memory
, perform a clean-up step
Random Sampling on Big Data
15
Slide16Clean-up Step
Idea:
Consider the elements in reserve order
The principle of deferred decisions
must stay
Which element it kicks out will be decided later
stays if it is not kicked out by
, which happens with prob.
At this point, just decide if
kicks out
or not.
Random Sampling on Big Data
16
Clean-up Step
Consider
. Suppose
elements after
have stayed.
elements have been kicked out
arrows after
haven’t been decided, only knowing that they point to some elements before
(including)
These arrows cannot point to the same elements
These arrows can only point to alive elements
There are
alive elements
So
is pointed by one of them with probability
.
Just need to remember
Random Sampling on Big Data
17
External Memory Stream Sampling
Each clean-up step can be done by one scan:
I/Os
Each clean-up step removes
elements
Number of elements ever kept (in expectation)
Total I/O
cost:
A matching lower bound
Sliding windows
Random Sampling on Big Data
18
[
Gemulla
and
Lehner
06] [Hu,
Qiao
, Tao 15]
Slide19Sampling from Distributed Streams
One coordinator and
sites
Each site can communicate with the coordinator
Goal: Maintain a random sample of size
over the union of all streams with minimum
communication
Difficulty: Don’t know
, so can’t run reservoir sampling algorithm
Key observation: Don’t have to know
in order to sample!
Sampling is easier than counting
Random Sampling on Big Data
19
[
Cormode
,
Muthukrishnan
, Yi,
Zhang 09] [
Woodruff,
Tirthapura
11
]
Slide20Reduction from Coin Flip Sampling
Flip a fair coin for each element until we get “1”
An element is active on a level if it is “0”
If a level has
active elements, we can draw a sample from those active elements
Key: The coordinator does not want all the active elements, which are too many!
Choose a level appropriately
Random Sampling on Big Data
20
Slide21The Algorithm
Initialize
In round
:
Sites send in every item
w.p
.
(
This is a
coin-flip sample
with prob.
)
Coordinator maintains a lower sample
and a
higher sample: each received item
goes to
either with equal prob
.
(
The lower sample is a sample with prob.
)
When the lower sample reaches size
,
the coordinator
broadcasts to advance
to round
Discard the upper sample
Split the lower sample into a new
lower sample
and a higher sample
Random Sampling on Big Data
21
Slide22Communication Cost of Algorithm
Communication cost of each round:
Expect to receive
sampled items before round
ends
Broadcast
to end round:
Number of rounds:
In
each round,
need
items being sampled to end round
Each
item has prob.
to contribute: need
items
Total communication
:
Can be improved to
A matching lower
bound
Sliding windows
Random Sampling on Big Data
22
Slide23Importance Sampling
Sampling probability depends on how important data is
Slide24Frequency Estimation on Distributed Data
Given: A multiset
of
items drawn from the universe
For example: IP addresses of network packets
is partitioned arbitrarily and stored
o
n
nodes
Local
count
: frequency of item
on node
Global
count
Goal:
Estimate
with
absolute error
for all
Can’t hope for small relative error for all
Heavy hitters are estimated well
Random Sampling on Big Data
24
[Zhao
,
Ogihara
,
Wang
,
Xu 06] [Huang
, Yi, Liu,
Chen 11
]
Slide25Frequency Estimation: Standard Solutions
Local heavy hitters
Let
be the data size at node
Node
sends in all items with frequency
Total error is at most
Communication cost:
Simple random sampling
A simple random sample of size
can be used to estimate the frequency of any item with error
Algorithm
Coordinator first gets
for all
Decides how many samples to get from each
Get the samples from the nodes
Communication cost:
Random Sampling on Big Data
25
Slide26Importance SamplingRandom Sampling on Big Data
26
Horvitz–Thompson estimator:
Estimator for global count
:
Importance Sampling: What is a Good
?
Natural choice:
M
ore precisely:
Can show:
for any
Communication cost:
This is (worst-case) optimal
Interesting discovery:
Also has
for any
Also has communication cost
in the worst case
when
local counts are
, the rest are zero.
But can be much lower than
on some inputs
Random Sampling on Big Data
27
Slide28is Instance-Optimal
Random Sampling on Big Data
28
All possible inputs
Communication cost
What Happened?
Making
removes all effects of the input
Random Sampling on Big Data
29
Slide30Variance-Communication DualityRandom Sampling on Big Data
30
All possible inputs
Variance
Merge-Reduce Sampling
Better than simple random sampling
Slide32-Approximation:
A
“
U
niform” Sample
A
uniform
sample needs
sample
points
A random sample needs
sample
points
(w/ constant prob.)
32
Random sample:
Random Sampling on Big Data
Slide33Median and Quantiles (order statistics)
Exact quantiles:
for
,
CDF
Approximate version: tolerate answer between
An
-approximation produces
-approximate quantiles
Random Sampling on Big Data
33
Slide34Merge-Reduce Sampling
Divide data into chunks of size
Sort each chunk
Do binary merges into one chunk
Each merge takes odd-positioned
or even-positioned elements with
equal probability
This needs
time, how is it useful?
34
1 5 6 7 8
2 3 4 9
10
1 3 5 7 9
+
Random Sampling on Big Data
Slide35Application 1: Streaming Computation
Can merge chunks up as items arrive
in the stream
At any time, keep at most
chunks
Space:
Can be improved to
by combining with random sampling
[Agarwal,
Cormode
, Huang, Phillips, Wei, Yi 12]
Improved to
[
Felber
,
Ostrovsky
15]
Improved to
[
Karnin
, Lang, Liberty 16]
Reservoir sampling needs
space
Best deterministic algorithm needs
space
[Greenwald,
Khana
01]
35
Random Sampling on Big Data
Slide36Error Analysis: Base case
36
Consider any rang
e
on sample
from data set
of size
Approximation guarantee
Estimator
is unbiased and has error at most 1
is even:
has no error
is odd:
has error
1
with equal prob.
Random Sampling on Big Data
Slide37Error Analysis: General Case
Level 1
Level 2
Level 3
Level 4
Consider
’th
merge at level
of
,
to
Estimate is
Error introduced is
(new estimate) (old estimate)
Absolute error
by previous argument
Total error over all
levels:
37
Random Sampling on Big Data
Slide38Error Analysis: Azuma-Hoeffding
The errors
are not independent
Let
be the prefix sums of
’s
The
’s form a martingale
Azuma-
Hoeffding
: If
, then
failure probability
Set
to get failure probability
Enough to consider
ranges
Set
and apply union bound
38
Random Sampling on Big Data
Slide39Application 2: Distributed Data
Data partitioned on
nodes
Each node reduces its data to
using paired sampling, and send to coordinator
Each node can have variance
Communication cost:
Best possible (even under the blackboard model)
Deterministic lower bound
39
[Huang, Yi 14]
Random Sampling on Big Data
Slide40Generalization to Multi-dimensions
For any range
in a
range space
(e.g., circles or rectangles),
Applications in data mining, machine learning, numerical integration, Monte Carlo simulations, …
40
Random Sampling on Big Data
Slide41How to Reduce: Low-Discrepancy Coloring
: a set of
in
C
oloring
, define
Find
such that
is minimized
Example: in 1D, just do odd-even
coloring
.
Reduce: Pick one
color
randomly
Discrepancy
Sample size and communication cost depend on discrepancy
41
Random Sampling on Big Data
Slide42Known Discrepancy Results
1D:
For an arbitrary range space
by random coloring
For 2D axis-parallel rectangles
,
For
2D circles,
halfplanes
For range space with VC-dimension
42
Random Sampling on Big Data
Slide43Sampling for Approximate Query Processing
Slide44Complex Analytical Queries (TPC-H)
SELECT
SUM(
l_price
)
FROM
customer,
lineitem
, orders, nation,
region
WHERE
c_custkey
=
o_custkey
AND
l_orderkey
=
o_orderkey
AND
l_shipdate
>= 2017-03-01
AND
l_shipdate
<= 2017-03-22
AND
c_nationkey
=
n_nationkey
AND
n_regionkey
=
r_regionkey
AND
r_name
=
'ASIA‘
Things to consider:
What to return?
Simple aggregation (COUNT,
SUM
)
A sample (UDFs)
Pre-computation allowed?
Pre-computed samples, indexes
Random Sampling on Big Data
44
Slide45Sampling from One Table
SELECT UDF(R.A)
FROM
R
WHERE
x < R.B < y
Have to allow pre-computation
Otherwise, can only scan whole table or sample & check
Simple
aggregation
is easily done in
time
Associate partial aggregates in a binary tree (B-tree)
Goal is to return a random sample of size
Sample size
maybe unknown
Random Sampling on Big Data
45
Slide46Binary Tree with Pre-computed Samples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
4
5
7
9
12
14
16
3
8
12
14
7
14
5
Report:
Active nodes
5
Random Sampling on Big Data
46
Slide47Binary Tree with Pre-computed Samples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
4
5
7
9
12
14
16
3
8
12
14
7
14
5
Report:
5
Active nodes
Random Sampling on Big Data
47
Slide48Binary Tree with Pre-computed Samples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
4
5
7
9
12
14
16
3
8
12
14
7
14
5
Report: 5
Active nodes
7
Pick 7 or 14 with equal prob.
Random Sampling on Big Data
48
Slide49Binary Tree with Pre-computed Samples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
4
5
7
9
12
14
16
3
8
12
14
7
14
5
Report: 5 7
Active nodes
Pick 3, 8, or 14
with prob. 1:1:2
Random Sampling on Big Data
49
Slide50Binary Tree with Pre-computed Samples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
4
5
7
9
12
14
16
3
8
12
14
7
14
5
Report: 5 7
Active nodes
Random Sampling on Big Data
50
Slide51Binary Tree with Pre-computed Samples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
4
5
7
9
12
14
16
3
8
12
14
7
14
5
Report: 5 7
Active nodes
12
Pick 3, 8, or 12
with equal
prob
Random Sampling on Big Data
51
[
Wang, Christensen, Li, Yi 16]
Slide52Binary Tree with Pre-computed Samples
Query time:
,
: full query
size
Extends
to higher dimensions
Issue: The same query
always returns
the same sample
Independent range sampling [Hu,
Qiao
, Tao 14]
Idea: Replenish new samples after used
Random Sampling on Big Data
52
Slide53Two Tables:
Return a sample, without
p
re-computation
Hopeless, since
Return a sample, with re-computation
Build an index on
Obtain value frequencies in
Sample a tuple in
with probability proportional to
’s frequency in
Sample a joining tuple in
Example:
join size = 8
Random Sampling on Big Data
53
[
Chaudhuri, Motwani, Narasayya 99
]
Slide54Sampling Joins: Open Problems
How to deal with selection predicates?
Can’t afford to re-compute value frequencies at query time
How to handle multi-way joins?
Prob
[
is sampled]
: residual query when
must appear in the join result
For acyclic queries, all the residual query sizes can be computed in
time in preprocessing
For arbitraries queries, the problem is open
Random Sampling on Big Data
54
Slide55Two Tables: COUNT, No Pre-computation
Ripple join
[
Haas,
Hellerstein
99]Sample a tuple from
each table
Join with
previously sampled tuples from other tablesThe joined sampled tuples are not independent, but unbiasedWorks well for full Cartesian productBut most joins are sparse …
Can be extended to multiple tables but efficiency is even lowerWhat can be done with pre-computation (indexes)?
Random Sampling on Big Data
55
Slide56A Running Example
Nation
CID
US
1
US
2
China
3
UK
4
China
5
US
6
China
7
UK
8
Japan
9
UK
10
Random Sampling on Big Data
56
OrderID
ItemID
Price
4
301
$2100
2
304
$100
3
201
$300
4
306
$500
3
401
$230
1
101
$800
2
201
$300
5
101
$200
4
301
$100
2
201
$600
BuyerID
OrderID
4
1
3
2
1
3
5
4
5
5
5
6
3
7
5
8
3
9
7
10
What’s the total revenue of all orders from customers in China?
Slide57Join as a Graph
Nation
CID
US
1
US
2
China
3
UK
4
China
5
US
6
China
7
UK
8
Japan
9
UK
10
Random Sampling on Big Data
57
OrderID
ItemID
Price
4
301
$2100
2
304
$100
3
201
$300
4
306
$500
3
401
$230
1
101
$800
2
201
$300
5
101
$200
4
301
$100
2
201
$600
BuyerID
OrderID
4
1
3
2
1
3
5
4
5
5
5
6
3
7
5
8
3
9
7
10
Slide58Sampling by Random Walks
Nation
CID
US
1
US
2
China
3
UK
4
China
5
US
6
China
7
UK
8
Japan
9
UK
10
Random Sampling on Big Data
58
OrderID
ItemID
Price
4
301
$2100
2
304
$100
3
201
$300
4
306
$500
3
401
$230
1
101
$800
2
201
$300
5
101
$200
4
301
$100
2
201
$600
BuyerID
OrderID
4
1
3
2
1
3
5
4
5
5
5
6
3
7
5
8
3
9
7
10
Slide59Sampling by Random Walks
Nation
CID
US
1
US
2
China
3
UK
4
China
5
US
6
China
7
UK
8
Japan
9
UK
10
Random Sampling on Big Data
59
OrderID
ItemID
Price
4
301
$2100
2
304
$100
3
201
$300
4
306
$500
3
401
$230
1
101
$800
2
201
$300
5
101
$200
4
301
$100
2
201
$600
BuyerID
OrderID
4
1
3
2
1
3
5
4
5
5
5
6
3
7
5
8
3
9
7
10
Slide60Sampling by Random Walks
Nation
CID
US
1
US
2
China
3
UK
4
China
5
US
6
China
7
UK
8
Japan
9
UK
10
60
OrderID
ItemID
Price
4
301
$2100
2
304
$100
3
201
$300
4
306
$500
3
401
$230
1
101
$800
2
201
$300
5
101
$200
4
301
$100
2
201
$600
BuyerID
OrderID
4
1
3
2
1
3
5
4
5
5
5
6
3
7
5
8
3
9
7
10
Random Sampling on Big Data
Slide61Sampling by Random Walks
Nation
CID
US
1
US
2
China
3
UK
4
China
5
US
6
China
7
UK
8
Japan
9
UK
10
61
OrderID
ItemID
Price
4
301
$2100
2
304
$100
3
201
$300
4
306
$500
3
401
$230
1
101
$800
2
201
$300
5
101
$200
4
301
$100
2
201
$600
BuyerID
OrderID
4
1
3
2
1
3
5
4
5
5
5
6
3
7
5
8
3
9
7
10
Unbiased estimator:
[
Li, Wu, Yi, Zhao 16
]
Random Sampling on Big Data
Can also deal with selection predicates
Slide62Open Problem
Theoretical analysis of this random walk algorithm?
Focus on COUNT
Connection to approximate triangle counting
An
-time algorithm for obtaining an constant-factor approximation of the number of triangles (
)
[
Eden,
Levi
,
Ron, Seshadhri 15
]
The algorithm is essentially the same as the random walk algorithm
(with the right parameters)
applied to the triangle join
Conjecture:
time to estimate join size (
)
Currently, computing COUNT is no easier than full join
Random Sampling on Big Data
62
Slide63Thank you!