Author Jaewoo Kang Jeffrey F Naughton Stratis D Viglas University of WisconsinMadison CS Dept Presenter Yang YingChia 楊 應 甲 R01922018 CSIE National Taiwan University ID: 429576
Download Presentation The PPT/PDF document "Evaluating Window Joins over Unbounded S..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Evaluating Window Joins over Unbounded Streams
Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas
University of Wisconsin-Madison CS Dept.
Presenter
:
Yang Ying-Chia
楊
應
甲
(
R01922018)
CSIE,
National
Taiwan
UniversitySlide2
Outline
Abstract
Background
IntroductionRelated WorkEstimating the Cost of Moving Window JoinsOn Maximizing the Efficiency of Processing JoinsConclusion
2Slide3
Abstract – Problem and Solution
Problem: Process joins over unbounded streams.
Solution: Moving Window Join
Queries have “window predicates”
3Slide4
Abstract – Central Point of the Thesis
The paper proposes a
unit-time-basis cost
model for evaluating moving window joins.Using this cost model, it proposes strategies for maximizing the efficiency of processing joins in different scenarios.
4Slide5
AbstractBackground
Introduction
Related Work
Estimating the Cost of Moving Window JoinsOn Maximizing the Efficiency of Processing JoinsConclusion5Slide6
Background
Join
Nested Loops Join (NLJ)
Hash Join (HJ)Moving Window Join6Slide7
Background – Join
7Slide8
Background – Nested Loops Join (NLJ)
8Slide9
Background – Hash Join (HJ)
9Slide10
Background – Moving Window Join
10Slide11
Background – Moving Window Join
Instead of saying we want to join all tuples of A and B, we say we want to join all tuples that have arrived on A in the last t1 seconds with all the tuples that have arrived on S in the last t2 seconds.
11Slide12
AbstractBackground
Introduction
Related Work
Estimating the Cost of Moving Window JoinsOn Maximizing the Efficiency of Processing JoinsConclusion12Slide13
Introduction – Questions
How can we measure the efficiency of a moving
window join evaluation strategy
, since the traditional metric of execution time to completion does not apply?
Can an algorithm for a moving window join take advantage of asymmetries in the rates of the input streams?
How can we
deal with cases in which an input stream is so fast that the system cannot keep up?
If memory is
the
bottleneck, how should we allocate memory between the
two
windows
for
the two inputs
?
13Slide14
Introduction – The Three Scenarios
One stream is much faster than the other.
System resources are insufficient to keep up with the input
streams.Memory is limited.
14Slide15
AbstractBackground
Introduction
Related Work
Estimating the Cost of Moving Window JoinsOn Maximizing the Efficiency of Processing JoinsConclusion15Slide16
Related Work
Predicate grouping and group optimization techniques
Adaptive query processing and query scrambling
Symmetric Hash Join and symmetric nested loops joinDiag-Join for data warehouse environment
Rate based streaming query optimization framework
16Slide17
AbstractBackground
Introduction
Related Work
Estimating the Cost of Moving Window JoinsOn Maximizing the Efficiency of Processing JoinsConclusion17Slide18
Estimating the Cost of Moving Window Joins
Cost model
Cost of a single join operation
18Slide19
Cost of Nested Loop Join A to B
19
Number of tuples accessed in a time unit
Cost of accessing a single tuple
Number of tuples accessed to
search for matched in window B
Number of tuples
insert and invalidationSlide20
Cost of Hash Join A to B
20
Cost of probe(b) and invalidate(b) is a function of the hash bucket size in window B
Cost of accessing a single tuple in
a specific hash table implementationSlide21
Cost of Full Join
Symmetric Join
HHJ, NNJ
21Slide22
Cost of Full Join
Asymmetric Join
HNJ
22Slide23
Cost Curves for Full Joins
23
σ
a
= 1
/|A|
=
1/
Nkey
(A)
σ
b
= 1/|B| = 1/
Nkey
(B)Slide24
Observation from the Previous Graphs
When input streams’ speed difference is minimal, HJ outperforms every other join combinations.
As the speed gap
increases, the cost of HJ increases considerably and exceeds that of HNJ at around 70 tuples/sec and 140 tuples/sec.Here we have a performance crossover point.
24Slide25
Estimating the Weight Factors
The crossover points can be calculated by equating the two cost formulas
For
two given streams,
we can determine when NLJ will outperform
HJ, depending
on the ratio of the arrival of the input streams.
25
…Slide26
AbstractBackground
Introduction
Related Work
Estimating the Cost of Moving Window JoinsOn Maximizing the Efficiency of Processing JoinsConclusion26Slide27
Recall the three scenarios
One stream is much faster than the other.
System resources are insufficient to keep up with the input streams.
Memory is limited.27Slide28
Exploiting Asymmetry in Input Streams Speed
Assumptions:
The two time windows are fixed.
The aggregate speed of two streams is less than the system’s service rate μ (i.e., λa
+
λ
b
< μ ).
The following inequality determines the likely winner between NLJ and HJ:
If inequality holds,
NLJ
will outperform
HJ;
otherwise, HJ
outperforms
NLJ.
28Slide29
Graphs to Prove the Previous Hypothesis
29Slide30
Observation from the Previous Graphs
HHJ costs the least until the input rate reaches about 70 tuples/sec; then HNJ takes over. Hence, either HHJ or HNJ is the winner.
Both hash join output rates decrease drastically after passing their thrashing point.
30Slide31
Maximizing the Number of Result Tuples with Limited Computing Resources
This scenario
arises
under the following conditions:System evaluates very expensive predicatesThe input stream’s speed is faster than the join operator’s service rate, i.e., λa
+
λ
b
> μ.
Hence, not all answer tuples can be generated and input streams need to be “regulated”.
But, what policy?
31Slide32
Performance Comparison between Policies
32
The winner is the equal distribution strategy!
Regardless of time window sizes and window selectivity factors.Slide33
Maximizing the Number of Result Tuples with Limited Memory
Assumption:
The two time window sizes can be adjusted to fully utilize available memory.
The two arrival rates are constant.Hence, memory allocation strategies are necessary. But, what policy? Will equal distribution win
again?
33Slide34
Performance Comparison between Policies
34
The winner is the Max
A strategy, which allocates all memory to the slower
stream.
Keep the
slower stream in memory and let the faster one probe against it and pass
by.Slide35
Maximizing the Number of Result Tuples with Limited Memory
Another assumption:
Variable time windows
Variable arrival rates35Slide36
Performance Comparison between Policies
36
The best policy is either maximizing stream A’s time window in conjunction with maximizing B’s arrival rate, or we can maximize B’s time window and A’s arrival rate alternatively.Slide37
AbstractBackground
Introduction
Related Work
Estimating the Cost of Moving Window JoinsOn Maximizing the Efficiency of Processing JoinsConclusion37Slide38
Conclusion
A unit-time basis model to analyze expected performance of moving window joins is introduced.
The proposed cost-model divides the join cost into two independent terms, each corresponding to one of the two join directions.
This work can be extended to have a cost model beyond single joins and for full query plans.Other algorithms apart from NLJ and HJ can be modeled and evaluated.
38Slide39
The End
Thanks for your attention