/
Framework for real-time clustering over sliding windows Framework for real-time clustering over sliding windows

Framework for real-time clustering over sliding windows - PowerPoint Presentation

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
390 views
Uploaded On 2017-12-06

Framework for real-time clustering over sliding windows - PPT Presentation

Sobhan Badiozamany Kjell Orsborn Tore Risch Uppsala University Sweden Emails firstnamelastnameituuse Outline Why clustering over sliding window is interesting State of the art solutions ID: 612898

clustering data merge window data clustering window merge context windows indexing sliding clusters number g2cs sbm time index micro

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Framework for real-time clustering over ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Framework for real-time clustering over sliding windows

Sobhan Badiozamany

Kjell Orsborn

Tore Risch

Uppsala University, Sweden

Emails: firstname.lastname@it.uu.seSlide2

Outline

Why clustering over sliding window is interesting

State of the art solutions

Our contributions:SBMGeneric state maintenance using contextsContextualized indexingResultsRelated work

2Slide3

Clustering over data streams

Online data stream analysis

Monitoring the distribution of moving objects, e.g. urban traffic monitoring

Spatio-temporal event monitoring, e.g. detecting major events using social media3Slide4

Sliding window characteristics

Sliding windows capture the evolving behavior of data streams.

W

2,12 highly overlaps with W0,10 (

gray portions

)

Building W

2,12

from W

0,10

W

W10,12 is new dataW0,2 is expired data.

4Slide5

GROUPBY queries over sliding windows

First phase

summarize the small blocks

Produces partial summaries

Road

#cars

E4

10

E20

20

E18

30

E10

15

Road

#cars

E4

2E1810

Road#carsE103E185

Road#carsE412E2020E1835E1012

W0,10

W2,12

5

Second phase: Reuses the summary in W0,10 to produce W2,12Green is merged (incremental)Red is excluded (decremental)

Group memberships are identified using distinct values of the Road attributeDeterministic group membershipOnly aggregates are updated

Add

#cars

Subtract

#carsSlide6

Clustering queries over sliding windows

Clustering is dynamic, because grouping is based on similarity

Groups

might mergeGroups might split

6

For many clustering algorithms the exclude function:

Does not exist, e.g. BIRCH

exists e.g. [Ester et.al. 1998] but is shown to be very expensive [Yang et.al. 2009]

Have to

Only rely on the

merge

function

to maintain clusters over sliding windowsSlide7

Window Partition Ratio (PR)

Partition Ratio PR = the number of partial summaries that comprise a window

Here PR=5

Higher PR ->finer grain slides ->real time change trackingScaling PR is desirable for many queries

7Slide8

Repetitive Merge [Guha

et.al. 2000] [Babcock et.al. 2003]

W

0,10

= merge(W

0,2

,

W

2,4

,

W

4,6

,

W

6,8

,

W8,10)W2,12 = merge(W2,4, W4,6, W6,8,

W8,10, W10,12)

W4,14 = merge(W4,6,

W6,8, W8,10, W10,12, W12,14)W6,16 = merge(W6,8, W8,10, W10,12, W12,14, W14,16)

W

8,18

= merge(

W8,10, W10,12, W12,14, W14,16, W16,18)W0,2

W2,4W4,6

W

6,8

W

8,10

W

10

,12

W

12

,14

W

14

,16

W

16

,18

t

0

1

2

3

4

5

6

7

8

9

10

11

12

13

15

16

14

17

18

Only uses merging, no exclude needed.

Maintains PR windows in parallel.

Each arriving partial summary is merged into PR windows, e.g. W

8,10

The number of merges per slide: PR

8Slide9

Sliding Binary merge (SBM)

The number of merges per slide: log

2

PR

The old nodes should be removed

Gray nodes already removed

Red nodes being removed right now (t=52)

W

0

,4

W

4,8

W

8,12

W

12,16

W

16,20

W

20,24

W

24,28

W

28,32

W

12,28

W

0

,8

W

4,12

W

8

,

16

W

12,20

W

16,24

W

20,28

W

24,32

W

0

,16

W

4

,20

W

8,24

W

16,32

W

0

,32

W

32,36

W

28,36

W

20,36

W

4,3

6

W

8,40

W

24,40

W

32,40

W

36,40

W

40,44

W

44,48

W

48,52

W

36,44

W

40,48

W

44,52

W

28,44

W

32,48

W

36,52

W

12,44

W

16,48

W

20,52

4

8

12

16

20

24

28

32

36

40

44

48

52

Uses a lattice to

represent

temporal relationships between window instances in terms of their time

intervals

Here window range: 32, window stride:4

->

PR=8

9Slide10

Properties of SBM

Reduces the number of merges per slide from PR to log

2

PROnly slightly higher memory footprint compared to repetitive mergeSupports arbitrary window sizesProof:

10

…Or maybe read the paper :-)Slide11

How to use SBM in a framework?

Generic 2-phase Continuous Summarization(G2CS) framework generalizes the GROUPBY frameworks to support clustering.

Each node in the lattice represents a window instance having a number of clusters.

In G2CS a window instance is represented by its time interval, also called its context.Contexts are objects that are managed by G2CS.

11Slide12

Contextualizing the window state

Each context contains a number of clusters, each having an arbitrarily complex structure.

G2CS uses a uniform schema that represent all clusters in all window instances.

Contextualized Clustering Table (CCT)CCT(cid

,

cxtid

, a

1

,….a

n

)

cid is cluster identifiercxtid is context identifiera1, …an are algorithm specific

cid

cxtid

LS

SS

NCM1{0,8}…………2{0,8}…………3{0,8}…………4{8,12}…………5{8,12}…

……………………

…BIRCH clustering algo.LS: linear sumSS: Squared sumN: number of pointsCM: center of massA context identifies a partition in the CCT that contain its window instance data.

A node in SBM-lattice corresponds to a partition in CCT.12Slide13

Generic 2-phase Continuous Summarization(G2CS) framework

Partial Summarizer

Final Summarizer

adder

merger

excluder

reporter

Main Memory Data Manager

Context Manager

Contextualized index manager

Continuous Summarization Queries

Stream

Continuous Summary

copier

G2CS

Clustering algorithm (

red

) operate on system managed contexts,

merger is the most expensive.Provides transparent indexing per context, i.e. per partition in CCT

SBM is implemented in the final summarizer.G2CS Modularizes the solution13Slide14

Why indexing is needed?

The expensive

merger

plug-in receives two sets of clusters to merge, here black and green.

Often performs nearest neighbor search to form links between micro-clusters

for each green micro-cluster, we need to find the closest black micro-cluster.

Multi-dimensional indexing on the set of black micro-clusters helps.

14Slide15

Contextualized indexing

The nearest neighbor search in

merger

always have a bound context, e.g. for each green micro-cluster a search in the black context is done.

cid

cxtid

a

i

1

1

a

i1

2

1

ai2…31…ai3…42…ai4…52…ai5…62

a

i6

………………cxtid

21index

The CCT

2

1

3

X-tree containing the black

5

6

4

X-tree containing a

i4

, a

i5

, and a

i6

Two layered index:

Global hash index on context id,

cxtid

,

Local spatial index on each context data

Many contexts

->many X-trees

->hard to find “the one”

15Slide16

Experimental results, GROUPBY

No contextualized indexing

Conventional GROUPBY, very efficient exclude method.

Synthetic dataDifferential Maintenance DM takes constant time, SBM scales logarithmically and RM scales linear

16Slide17

Experimental results, Indexing

AC

:

the average number of clusters per window instance

SBM with contextualized indexing scales logarithmic to the AC, no index scales

quadratically.

17Slide18

Experimental results, Real data

BIRCH Clustering on real data from a soccer game.

As PR is scaled AC is also scaled

SBM is significantly better than RM.

The gain by indexing is limited in RM (15%) due to intensive copying, compared to 60% gain for indexing SBM.

18Slide19

Experimental results, Memory Utilization

19Slide20

Experimental results, Work breakdown

Copier plug-in dominates in RM

In RM all window instances have the full extent of the window

-> more data to copy -> indexing does not help

Merger plug-in dominates the SBM

Copier is relatively cheap because most nodes in the lattice cover a short extent

-> less data to copy -> Indexing helps

Low G2CS overhead

20Slide21

References

Repetitive Merge is used in the following papers:

[1] S.

Guha, N. Mishra, R. Motwani, and L. O'Callaghan, "Clustering data streams," in Proceedings of Foundations of Computer Science conference, Redondo Beach, CA, 2000, pp. 359-366.

[2] B. Babcock, D.

Mayur

, M. Rajeev, and L. O'Callaghan, "Maintaining variance and k-medians over data stream windows," in

SIGMOD conf.

, San Diego, 2003, pp. 234-243.

Decremental DBSCAN:

[3] M. Ester, H-P.

Kriegel, J. Sander, M. Wimmer, and X. Xu, "Incremental clustering for mining in a data warehousing environment," in VLDB conf., New York, 1998, pp. 323-333.Why decremental clustering algorithms are not suitable for streaming:

[4] Di Yang, E. A. Rundensteiner, and M. O. Ward, "Neighbor-based pattern detection for windows over streaming data.," in EDBT conf., Saint Petersburg, 2009, pp. 229-540.

BIRCH

[5] T. Zhang, R.

Ramakrishnan

, and M. Livny, "BIRCH: an efficient data clustering method for very large databases," in SIGMOD conf., Montreal, 1996., pp. 103-114.21Slide22

Framework for real-time clustering over sliding windows

Sobhan Badiozamany

Kjell Orsborn

Tore RischUppsala University, SwedenEmails: firstname.lastname@it.uu.se