Sobhan Badiozamany Kjell Orsborn Tore Risch Uppsala University Sweden Emails firstnamelastnameituuse Outline Why clustering over sliding window is interesting State of the art solutions ID: 612898
Download Presentation The PPT/PDF document "Framework for real-time clustering over ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Framework for real-time clustering over sliding windows
Sobhan Badiozamany
Kjell Orsborn
Tore Risch
Uppsala University, Sweden
Emails: firstname.lastname@it.uu.seSlide2
Outline
Why clustering over sliding window is interesting
State of the art solutions
Our contributions:SBMGeneric state maintenance using contextsContextualized indexingResultsRelated work
2Slide3
Clustering over data streams
Online data stream analysis
Monitoring the distribution of moving objects, e.g. urban traffic monitoring
Spatio-temporal event monitoring, e.g. detecting major events using social media3Slide4
Sliding window characteristics
Sliding windows capture the evolving behavior of data streams.
W
2,12 highly overlaps with W0,10 (
gray portions
)
Building W
2,12
from W
0,10
W
W10,12 is new dataW0,2 is expired data.
4Slide5
GROUPBY queries over sliding windows
First phase
summarize the small blocks
Produces partial summaries
Road
#cars
E4
10
E20
20
E18
30
E10
15
Road
#cars
E4
2E1810
Road#carsE103E185
Road#carsE412E2020E1835E1012
W0,10
W2,12
5
Second phase: Reuses the summary in W0,10 to produce W2,12Green is merged (incremental)Red is excluded (decremental)
Group memberships are identified using distinct values of the Road attributeDeterministic group membershipOnly aggregates are updated
Add
#cars
Subtract
#carsSlide6
Clustering queries over sliding windows
Clustering is dynamic, because grouping is based on similarity
Groups
might mergeGroups might split
6
For many clustering algorithms the exclude function:
Does not exist, e.g. BIRCH
exists e.g. [Ester et.al. 1998] but is shown to be very expensive [Yang et.al. 2009]
Have to
Only rely on the
merge
function
to maintain clusters over sliding windowsSlide7
Window Partition Ratio (PR)
Partition Ratio PR = the number of partial summaries that comprise a window
Here PR=5
Higher PR ->finer grain slides ->real time change trackingScaling PR is desirable for many queries
7Slide8
Repetitive Merge [Guha
et.al. 2000] [Babcock et.al. 2003]
W
0,10
= merge(W
0,2
,
W
2,4
,
W
4,6
,
W
6,8
,
W8,10)W2,12 = merge(W2,4, W4,6, W6,8,
W8,10, W10,12)
W4,14 = merge(W4,6,
W6,8, W8,10, W10,12, W12,14)W6,16 = merge(W6,8, W8,10, W10,12, W12,14, W14,16)
W
8,18
= merge(
W8,10, W10,12, W12,14, W14,16, W16,18)W0,2
W2,4W4,6
W
6,8
W
8,10
W
10
,12
W
12
,14
W
14
,16
W
16
,18
t
0
1
2
3
4
5
6
7
8
9
10
11
12
13
15
16
14
17
18
Only uses merging, no exclude needed.
Maintains PR windows in parallel.
Each arriving partial summary is merged into PR windows, e.g. W
8,10
The number of merges per slide: PR
8Slide9
Sliding Binary merge (SBM)
The number of merges per slide: log
2
PR
The old nodes should be removed
Gray nodes already removed
Red nodes being removed right now (t=52)
W
0
,4
W
4,8
W
8,12
W
12,16
W
16,20
W
20,24
W
24,28
W
28,32
W
12,28
W
0
,8
W
4,12
W
8
,
16
W
12,20
W
16,24
W
20,28
W
24,32
W
0
,16
W
4
,20
W
8,24
W
16,32
W
0
,32
W
32,36
W
28,36
W
20,36
W
4,3
6
W
8,40
W
24,40
W
32,40
W
36,40
W
40,44
W
44,48
W
48,52
W
36,44
W
40,48
W
44,52
W
28,44
W
32,48
W
36,52
W
12,44
W
16,48
W
20,52
4
8
12
16
20
24
28
32
36
40
44
48
52
Uses a lattice to
represent
temporal relationships between window instances in terms of their time
intervals
Here window range: 32, window stride:4
->
PR=8
9Slide10
Properties of SBM
Reduces the number of merges per slide from PR to log
2
PROnly slightly higher memory footprint compared to repetitive mergeSupports arbitrary window sizesProof:
10
…Or maybe read the paper :-)Slide11
How to use SBM in a framework?
Generic 2-phase Continuous Summarization(G2CS) framework generalizes the GROUPBY frameworks to support clustering.
Each node in the lattice represents a window instance having a number of clusters.
In G2CS a window instance is represented by its time interval, also called its context.Contexts are objects that are managed by G2CS.
11Slide12
Contextualizing the window state
Each context contains a number of clusters, each having an arbitrarily complex structure.
G2CS uses a uniform schema that represent all clusters in all window instances.
Contextualized Clustering Table (CCT)CCT(cid
,
cxtid
, a
1
,….a
n
)
cid is cluster identifiercxtid is context identifiera1, …an are algorithm specific
cid
cxtid
LS
SS
NCM1{0,8}…………2{0,8}…………3{0,8}…………4{8,12}…………5{8,12}…
……………………
…BIRCH clustering algo.LS: linear sumSS: Squared sumN: number of pointsCM: center of massA context identifies a partition in the CCT that contain its window instance data.
A node in SBM-lattice corresponds to a partition in CCT.12Slide13
Generic 2-phase Continuous Summarization(G2CS) framework
Partial Summarizer
Final Summarizer
adder
merger
excluder
reporter
Main Memory Data Manager
Context Manager
Contextualized index manager
Continuous Summarization Queries
Stream
Continuous Summary
copier
G2CS
Clustering algorithm (
red
) operate on system managed contexts,
merger is the most expensive.Provides transparent indexing per context, i.e. per partition in CCT
SBM is implemented in the final summarizer.G2CS Modularizes the solution13Slide14
Why indexing is needed?
The expensive
merger
plug-in receives two sets of clusters to merge, here black and green.
Often performs nearest neighbor search to form links between micro-clusters
for each green micro-cluster, we need to find the closest black micro-cluster.
Multi-dimensional indexing on the set of black micro-clusters helps.
14Slide15
Contextualized indexing
The nearest neighbor search in
merger
always have a bound context, e.g. for each green micro-cluster a search in the black context is done.
cid
cxtid
…
a
i
…
1
1
…
a
i1
…
2
1
…
ai2…31…ai3…42…ai4…52…ai5…62
…
a
i6
………………cxtid
21index
…
The CCT
2
1
3
X-tree containing the black
5
6
4
X-tree containing a
i4
, a
i5
, and a
i6
Two layered index:
Global hash index on context id,
cxtid
,
Local spatial index on each context data
Many contexts
->many X-trees
->hard to find “the one”
15Slide16
Experimental results, GROUPBY
No contextualized indexing
Conventional GROUPBY, very efficient exclude method.
Synthetic dataDifferential Maintenance DM takes constant time, SBM scales logarithmically and RM scales linear
16Slide17
Experimental results, Indexing
AC
:
the average number of clusters per window instance
SBM with contextualized indexing scales logarithmic to the AC, no index scales
quadratically.
17Slide18
Experimental results, Real data
BIRCH Clustering on real data from a soccer game.
As PR is scaled AC is also scaled
SBM is significantly better than RM.
The gain by indexing is limited in RM (15%) due to intensive copying, compared to 60% gain for indexing SBM.
18Slide19
Experimental results, Memory Utilization
19Slide20
Experimental results, Work breakdown
Copier plug-in dominates in RM
In RM all window instances have the full extent of the window
-> more data to copy -> indexing does not help
Merger plug-in dominates the SBM
Copier is relatively cheap because most nodes in the lattice cover a short extent
-> less data to copy -> Indexing helps
Low G2CS overhead
20Slide21
References
Repetitive Merge is used in the following papers:
[1] S.
Guha, N. Mishra, R. Motwani, and L. O'Callaghan, "Clustering data streams," in Proceedings of Foundations of Computer Science conference, Redondo Beach, CA, 2000, pp. 359-366.
[2] B. Babcock, D.
Mayur
, M. Rajeev, and L. O'Callaghan, "Maintaining variance and k-medians over data stream windows," in
SIGMOD conf.
, San Diego, 2003, pp. 234-243.
Decremental DBSCAN:
[3] M. Ester, H-P.
Kriegel, J. Sander, M. Wimmer, and X. Xu, "Incremental clustering for mining in a data warehousing environment," in VLDB conf., New York, 1998, pp. 323-333.Why decremental clustering algorithms are not suitable for streaming:
[4] Di Yang, E. A. Rundensteiner, and M. O. Ward, "Neighbor-based pattern detection for windows over streaming data.," in EDBT conf., Saint Petersburg, 2009, pp. 229-540.
BIRCH
[5] T. Zhang, R.
Ramakrishnan
, and M. Livny, "BIRCH: an efficient data clustering method for very large databases," in SIGMOD conf., Montreal, 1996., pp. 103-114.21Slide22
Framework for real-time clustering over sliding windows
Sobhan Badiozamany
Kjell Orsborn
Tore RischUppsala University, SwedenEmails: firstname.lastname@it.uu.se