David Woodruff IBM Almaden J oint works with Arnab Bhattacharyya Vladimir Braverman Stephen R Chestnut Palash Dey Nikita Ivkin Jelani Nelson and Zhengyu Wang Streaming Model ID: 678658
Download Presentation The PPT/PDF document "New Algorithms for Heavy Hitters in Data..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
New Algorithms for Heavy Hitters in Data Streams
David Woodruff IBM Almaden
J
oint works with Arnab Bhattacharyya, Vladimir
Braverman
, Stephen R. Chestnut,
Palash
Dey
Nikita
Ivkin
, Jelani Nelson, and
Zhengyu
WangSlide2
Streaming Model
Stream of elements a1, …, am in [n] = {1, …, n}. Assume m = poly(n) Arbitrary orderOne pass over the data Minimize space complexity (in bits) for solving a taskLet fj be the number of occurrences of item jHeavy Hitters Problem: find those j for which fj is large
…
2
1
1
3
7
3
4Slide3
Guarantees
l1 – guaranteeoutput a set containing all items j for which fj φ mthe set should not contain any j with fj (φ-ε) ml2 – guarantee
output a set containing all items j for which f
j
2
the set should not contain any j with fj 2
(φ-ε)l2 – guarantee can be much stronger than the l1 – guaranteeSuppose frequency vector is (, 1, 1, 1, …, 1)Item 1 is an l2-heavy hitter for constant φ, ε, but not an l1
-heavy hitter f1,
f2 f3 f4 f5 f6 Slide4
Outline
Optimal algorithm in all parameters φ, ε for l1-guaranteeOptimal algorithm for l2-guarantee for constant φ, εSlide5
Misra-Gries
Maintain a list L of c = O(1/ε) pairs of the form (key, value)Given an update to item iIf i is in L, increment value by 1If i is not in L, and there are fewer than c pairs in L, put (i,1) in LOtherwise, subtract 1 from all values in L. Remove pairs with value 0If an item i is not a key, charge its updates to c-1 distinct updates of other items: fi , so
Charge each update not included in the value f’
i
of a key
i
to c-1 updates of other items:
keyvalueSlide6
Space Complexity of Misra-Gries
log n) bits, assuming stream length Optimal if
since output size is
bits
But what if say,
= ½ and
= 1/log n?Misra-Gries uses
bits but lower bound only bits Slide7
Our Results
Obtain an optimal algorithm using) bitsIf = ½ and
= 1/log
n we obtain the optimal O(log n) bits!
For general stream lengths m, there is an additive O(log
log
m) in upper and lower bounds, so also optimalO(1) update and reporting times provided Slide8
A Simple Initial Improvement
First show ) bit algorithm, then improve it to optimal
) bits
I
dea:
use same number c of (key, value) pairs, but compress each pair
Compress values by sampling random stream positions.
If sample items with probability p = then for all i in [n], new frequency satisfies
distinct keys after sampling, so hash identities to universe of size
Slide9
Why Sampling Works?
Compress the values by sampling random stream positions. If sample items with probability p = then for all i in [n], new frequency
satisfies
for which
]
2
1
1
3
7
3
1Slide10
Misra-Gries after Hashing
Stream length is after sampling distinct keys after sampling, so hash identities pairwise-independently to universe of size
Misra-Gries
on (key, value) pairs takes
) bits of space
Heavy hitters in sampled stream correspond to heavy hitters in original stream, and frequencies are preserved up to additive Problem: want original (non-hashed) identities of heavy hitters! Slide11
Maintaining Identities
For the items with largest counts, as reported by our data structure, maintain actual log n bit identitiesAlways possible to maintain since if we sample an insertion of an item i, we have its actual identity in hand 314659
1000
20
33
5000
11hashed keyvalue458938\\30903\\10020335000
11actual keyvalueSlide12
Summary of Initial Improvement
) bit algorithmUpdate and reporting time can be made O(1) provided
For most stream updates, they’re not sampled so do nothing!
S
pread out computation of expensive operations over future updates for which you do nothing
Slide13
An Optimal Algorithm
) space, but want
)
Too much space for (key, value) pairs in
Misra-Gries
!
Instead, run Misra-Gries to find items with frequency > then use a separate data structure to estimate their frequencies up to additive Misra-Gries data structure takes O(
bits of spaceSeparate data structure will be O() independent repetitions of a data structure using O() bits. What can you do with O(
) bits? Slide14
An Optimal Algorithm
Want to use O() bits so that for any given item i, can report an additive approximation to with probability > 2/3Median of estimates across O() repetitions is an additive approximation with probability 1 – φ
/100. Union bound over 1/
φ
items
Keep O(
) counters as in Misra-Gries, but each on average uses O(1) bits!Can’t afford to keep item identifiers, even hashed ones..Can’t afford to keep exact counts, even on the sampled stream.. Slide15
Dealing with Item Identifiers
Choose a pairwise-independent hash function h:[n] -> {1, 2, …, 1/ε}Don’t keep item identifiers, just treat all items that go to the same hash bucket as one itemExpected “noise” in a bucket is ε (sampled stream length) = 1/εSolves the problem with item identifiers, but what about counts? Slide16
Dealing with Item Counts
We have r = O(1/ε) counters , with
, and want to store each
up to additive error 1/
R
ound each
to its nearest integer multiple of 1/Gives O(1/ε) bits of space But how to maintain this as the stream progresses?classic “probabilistic counters” do not workdesign “accelerated counters” which are more accurate as count increasesFor more details, please see the paper!
Slide17
Conclusions on l1-guarantee
) bits of spaceIf
, then update and reporting times are O(1)
Show a matching lower bound
I
s this also a significant practical improvement over
Misra-Gries? Slide18
Outline
Optimal algorithm in all parameters φ, ε for l1-guaranteeOptimal algorithm for l2-guarantee for constant φ, εSlide19
CountSketch achieves the l
2–guarantee [CCFC]Assign each coordinate i a random sign ¾(i) 2 {-1,1}Randomly partition coordinates into B buckets, maintain cj = Σi: h(i) = j ¾(i)¢fi in j-th bucket
.
Σ
i: h(
i
) = 2
¾(i
)¢fi
.
.
f
1
f
2
f
3
f
4
f
5
f
6
f
7
f
8
f
9
f
10
Estimate
f
i
as
¾
(
i
)
¢
c
h
(
i
)
E[
¾
(
i
)
¢
c
h
(
i
)
] =
¾
(
i
)
Σ
i
’: h(
i
’)
=
h(
i
)
¾
(
i
’)
¢
f
i’
=
f
i
Repeat
this hashing scheme O(log n) times
Output median of estimates
Ensures every
f
j
is approximated up to an additive
/B)
1/2
Gives O(log
2
n) bits of space
Slide20
Known Space Bounds for l
2– heavy hittersCountSketch achieves O(log2 n) bits of spaceIf the stream is allowed to have deletions, this is optimal [DPIW]What about insertion-only streams? This is the model originally introduced by Alon, Matias, and SzegedyModels internet search logs, network traffic, databases, scientific data, etc.The only known lower bound is Ω(log n) bits, just to report the identity of the heavy hitterSlide21
Our Results [BCIW]
We give an algorithm using O(log n log log n) bits of space!Same techniques give a number of other results:( at all times) Estimate at all times in a stream with O(log n log log n) bits of spaceImproves the union bound which would take O(log2 n) bits of spaceImproves an algorithm of [HTY] which requires m >> poly(n) to achieve savings(-Estimation)
Compute max
i
f
i up to additive (ε
)1/2 using O(log n log log n) bits of space (Resolves IITK Open Question 3) Slide22
Simplifications
Output a set containing all items i for which fi 2 for constant φThere are at most O(1/φ) = O(1) such items iHash items into O(1) bucketsAll items i for which fi 2
will go to different buckets with good probability
Problem reduces to having a single
i
* in {1, 2, …, n} with fi* ()1/2 Slide23
Intuition
Suppose first that log n and fi in {0,1} for all i in {1, 2, …, n} \ {i*}For the moment, also assume that we have an infinitely long random tapeAssign each coordinate i a random sign ¾(i)
2
{-
1,1}
R
andomly partition items into 2 bucketsMaintain c1 = Σi: h(i) = 1 ¾(i)¢fi and c2 = Σi: h(i) = 2 ¾(i)¢fi Suppose
h(i*) = 1. What do the values c1 and c2 look like? Slide24
c
1 = ¾(i*)¢fi* +
and c
2
=
c
1 - ¾(i*)¢
fi* and c2 evolve as random walks as the stream progresses(Random Walks) There is a constant C > 0 so that with probability 9/10, at all times, |c1 - ¾(i*)¢fi*| < Cn1/2 and |c2| < Cn1/2
Eventually, fi* >
Only gives 1 bit of information. Can’t repeat log n times in parallel, but can repeat log n times sequentially!Slide25
Repeating Sequentially
Wait until either |c1| or |c2| exceeds Cn1/2If |c1| > Cn1/2 then h(i*) = 1, otherwise h(i*) = 2This gives 1 bit of information about i*(Repeat) initialize 2 new counters to 0 and perform the procedure again!Assuming
log
n), we will have at least 10 log n repetitions, and we will be correct in a 2/3 fraction of them
(
Chernoff
) only a single value of i* whose hash values match a 2/3 fraction of repetitions Slide26
Gaussian Processes
We don’t actually have log n and fi in {0,1} for all i in {1, 2, …, n} \ {i*}Fix both problems using Gaussian processes(Gaussian Process) Collection {Xt}t in T of random variables, for an index set T, for which every finite linear combination of random variables is GaussianA
ssume E[X
t
] = 0 for all t
Process entirely determined by covariances E[X
sXt]Distance function d(s,t) = (E[|Xs-Xt|2])1/2 is a pseudo-metric on T(Connection to Data Streams) Suppose we replace the signs ¾(i) with normal random variables g(i), and consider a counter c at time t: c(t) = Σi g(i)¢fi(t) fi(t) is frequency of item i after processing t stream insertions
c(t) is a Gaussian process! Slide27
Chaining Inequality [Fernique,
Talagrand]Let {Xt}t in T be a Gaussian process and let be such that
and
for
. Then,
How can we apply this to
c(t) =
Σ
i g(i)
¢fi(t)?Let be the value of after t stream insertionsLet the
be a recursive partitioning of the stream where (t) changes by a factor of 2 Slide28
…
a
t
a
5
a
4
a
3
a
2
a1
a
m
…
a
t
is the first point in the stream for which
Let
be the set of
times
in the stream such that t
j
is the first point in the stream with
Then
and
for
A
pply the chaining inequality!Slide29
Applying the Chaining Inequality
Let {Xt}t in T be a Gaussian process and let be such that and
for
. Then,
= (min
E|c(t) – c(t
j
)|
2])1/2
)1/2Hence,
)
1/2 = O(F21/2) Same behavior as for random walks!Slide30
Removing Frequency Assumptions
We don’t actually have log n and fj in {0,1} for all j in {1, 2, …, n} \ {t}Gaussian process removes the restriction that fj in {0,1} for all j in {1, 2, …, n} \ {t}The random walk bound of Cn1/2 we needed on counters holds without this restriction
But we still need
(-
i
*) log
n to learn log n bits about the heavy hitterHow to replace this restriction with
(φ F2(-i*)) 1/2?Assume φ > log log n by hashing into log log n buckets and incurring a log log n factor in space Slide31
Amplification
Create O(log log n) pairs of streams from the input stream(streamL1 , streamR1), (streamL2 , streamR2), …, (streamLlog log n , streamRlog log n)For each j in O(log log n), choose a hash function hj :{1, …, n} -> {0,1}
stream
L
j
is the original stream restricted to items i with h
j(i) = 0streamRj is the remaining part of the input streammaintain counters cL = Σi: hj(i) = 0 g(i)¢fi and cR = Σi: hj(i) = 1
g(i)¢fi (Chaining Inequality + Chernoff) the larger counter is usually the substream with i* The larger counter stays larger forever if the Chaining Inequality holdsRun algorithm on items corresponding to the larger countsExpected F2 value of items, excluding i*, is F2/poly(log n), so i* is heavierSlide32
Derandomization
We don’t have an infinitely long random tapeWe need to derandomize a Gaussian processderandomize the hash functions used to sequentially learn bits of i*We achieve (1) by(Derandomized Johnson Lindenstrauss) defining our counters by first applying a Johnson-Lindenstrauss (JL) transform [KMN] to the frequency vector, reducing n dimensions to log n, then taking the inner product with fully independent Gaussians(Slepian’s Lemma) counters don’t change much because a Gaussian process is determined by its covariances and all covariances are roughly preserved by JLFor (2), derandomize an auxiliary algorithm via Nisan’s PRG [I]Slide33
An Optimal Algorithm [BCINWW]
Want O(log n) bits instead of O(log n log log n) bitsSources where the O(log log n) factor is coming fromAmplificationUse a tree-based scheme and that the heavy hitter becomes heavier!DerandomizationShow 6-wise independence suffices for derandomizing a Gaussian process!Slide34
Conclusions on l
2-guaranteeBeat CountSketch for finding -heavy hitters in a data streamAchieve O(log n) bits of space instead of O(log2 n) bitsNew results for estimating F2 at all points and L - estimationQuestions:Is this a significant practical improvement over CountSketch as well?Can we use Gaussian processes for other insertion-only stream problems? Slide35
Accelerated Counters
What if we update a counter c for with probability p = ε? E[c/p] = Sum of the counts is expected to be O(1/ ε)We have counters with sum O(1/ ε)c
bits
Problem:
very inaccurate if
=
Slide36
Accelerated Counters
Instead, suppose you knew a value r with Update a counter c with probability . Output Var[
] =
Problem:
don’t know
in advance!
Slide37
Accelerated Counters
Solution: increase sampling probability as counter increases!Opposite of standard probabilistic countersA frequency will in expectation have count value about )With counters subject to
space is maximized at
bits