/
Chung Sheng CHEN, Chung Sheng CHEN,

Chung Sheng CHEN, - PowerPoint Presentation

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
444 views
Uploaded On 2015-09-22

Chung Sheng CHEN, - PPT Presentation

Nauful SHAIKH Panitee CHAROENRATTANARUK Christoph F EICK Nouhad RIZK and Edgar GABRIEL Department of Computer Science University of Houston Talk Organization Randomized Hill Climbing ID: 137394

parco11 eick solutions ghent eick parco11 ghent solutions fitness solution function cuda clustering current size clever hours location clusters

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Chung Sheng CHEN," is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Chung Sheng CHEN, Nauful SHAIKH, Panitee CHAROENRATTANARUK, Christoph F. EICK, Nouhad RIZK and Edgar GABRIELDepartment of Computer Science, University of Houston Talk OrganizationRandomized Hill ClimbingCLEVER—A Prototype-based Clustering Algorithm which Supports Fitness FunctionsOpenMP and CUDA Versions of CleverExperimental ResultsSummary

1

Design and Evaluation of a Parallel

Execution Framework for the CLEVER

Clustering AlgorithmSlide2

1. Randomized Hill Climbing

Neighborhood

Randomized Hill

Climbing

: Sample p points randomly in the neighborhood of the currently

best solution

; determine the best solution of the n sampled points. If it is better than the

current solution, make it the new current solution and continue the search; otherwise,

terminate returning the current solution.

Advantages

: easy to apply, does not need many resources, usually fast.

Problems

: How do I define my

neighborhood

; what parameter

p

should I choose?

Eick

et al., ParCo11, GhentSlide3

Maximize f(x,y,z)=|x-y-0.2|*|x*z-0.8|*|0.3-z*z*y| with x,y,z in [0,1]Neighborhood Design: Create solutions 50 solutions s, such that:s= (min(1, max(0,x+r1)), min(1, max(0,y+r2)), min(1, max(0, z+r3)) with r1, r2, r3 being random numbers in [-0.05,+0.05].Example Randomized Hill Climbing

Eick et al., ParCo11, GhentSlide4

2. CLEVER: Clustering with Plug-in Fitness FunctionsIn the last 5 years, the UH-DMML Research Group at the University of Houston developed families of clustering algorithms that find contiguous spatial clusters by maximizing a plug-in fitness function.This work is motivated by a mismatch between evaluation measures of traditional clustering algorithms (such as cluster compactness) and what domain experts are actually looking for.Plug-in Fitness Functions allow domain experts to instruct clustering algorithms with respect to desirable properties of “good” clusters the clustering algorithm should seek for.

4

Eick

et al., ParCo11, GhentSlide5

Region Discovery Framework

8Eick et al., ParCo11, GhentSlide6

Region Discovery Framework3The algorithms we currently investigate solve the following problem:Given:A dataset O with a schema RA distance function d defined on instances of RA fitness function q(X) that evaluates clusterings X={c1,…,ck} as follows:q(X)= cX reward(c)=cX

i(c) *size(c)

with b1Objective:

Find c1,…,ck  O such that:cic

j= if ijX={c1,…,ck} maximizes q(X)All cluster ciX are contiguous (each pair of objects belonging to ci has to be delaunay-connected with respect to ci and to d)c1

c

k

 O

c

1

,…,c

k

are usually ranked based on the reward each cluster receives, and low reward clusters are frequently not reported

10

Eick

et al., ParCo11, GhentSlide7

Example1: Finding Regional Co-location Patterns in Spatial DataObjective: Find co-location regions using various clustering algorithms and novel fitness functions. Applications: 1. Finding regions on planet Mars where shallow and deep ice are co-located, using point and raster datasets. In figure 1, regions in red have very high co-location and regions in blue have anti co-location. 2. Finding co-location patterns involving chemical concentrations with values on the wings of their statistical distribution in Texas

’ ground water supply. Figure 2 indicates discovered regions and their associated chemical patterns.

Figure 1: Co-location regions involving deep and

shallow ice on Mars

Figure 2: Chemical co-location

patterns in Texas Water Supply

12Slide8

Example 2: Regional RegressionGeo-regression approaches: Multiple regression functions are used that vary depending on location.Regional Regression:

To discover regions with strong relationships between dependent & independent variables Construct regional regression functions for each region When predicting the dependent variable of an object, use the regression function associated with the location of the object

13

Eick

et al., ParCo11, GhentSlide9

Representative-based Clustering

Attribute2

Attribute1

1

2

3

4

Objective

: Find a set of objects O

R

such that the clustering X

obtained by using the objects in O

R

as representatives minimizes q(X).

Characteristic

: cluster are formed by assigning objects to the closest

representative

Popular Algorithms

: K-means,

K-

medoids

/PAM, CLEVER, CLEVER

,

9

Eick

et al., ParCo11, GhentSlide10

The CLEVER Algorithm10A prototype-based clustering algorithm which supports plug-in fitness functionUses a randomized hill climbing procedure to find a “good” set of prototype data objects that represent clusters“good”  maximize the plug-in fitness functionSearch for the “correct number of cluster”CLEVER is powerful but usually slow;Hill Climbing Procedure

CLEVER

Plug-in fitness function

Neighboring solutions generator

Assign cluster membersEick et al., ParCo11, GhentSlide11

Inputs: Dataset O, k’, neighborhood-size, p, q,  , object-distance-function d or distance matrix D, i-maxOutputs: Clustering X, fitness q(X), rewards for clusters in X Algorithm: 1. Create a current solution by randomly selecting k’ representatives from O. 2. If i-max iterations have been done terminate with the current solution3. Create p neighbors of the current solution randomly using the given neighborhood definition. 4. If the best neighbor improves the fitness q, it becomes the current solution. Go back to step 2.

5. If the fitness does not improve, the solution neighborhood is re-sampled by generating p’ (more precisely, first 2*p solutions and then (q-2)*p solutions are re-sampled) more neighbors. If re-sampling does not lead to a better solution, terminate returning the current solution (however, clusters that

receive a reward of 0 will be considered outliers and non-reward clusters are therefore not returned); otherwise, go back to step 2 replacing the

current solution by the best solution found by re-sampling. Pseudo Code of CLEVER

s)

11Slide12

3. PAR-CLEVER : A Faster Clustering AlgorithmOpenMPCUDA (GPU computing)MPIMap/Reduce12Eick et al., ParCo11, GhentSlide13

Benchmarks Data Sets Used 1310OvalsSize:3,359 Fitness function: purityEarthquakeSize: 330,561Fitness function: find clusters with high variance with respect to earthquake depthYahoo Ads Clicks full size: 3,009,071,396; subset:2,910,613Fitness function: minimum intra-cluster distanceEick et al., ParCo11, GhentSlide14

Parallelization targets14Assign cluster members: O(n*k)Data parallelizationHighly independentThe first priority for parallelizationFitness value calculation : ~ O(n)Neighboring solutions generation: ~ O(p)n:= number of object in the datasetk:= number of clusters in the current solutionp:= sampling rate (how many neighbors of the current solution are sampled)Eick et al., ParCo11, GhentSlide15

Hardware Specification15crill-001 to crill-016 (OpenMP)Processor : 4 x AMD Opteron(tm) Processor 6174CPU cores : 48Core speed : 2200 MHzMemory : 64 GBcrill-101 and crill-102 (GPU Computing—NVIDIA CUDA)Processor : 2 x AMD Opteron(tm) Processor 6174CPU cores : 24Core speed : 2200 MHzMemory : 32 GB

GPU Device : 4 x Tesla M2050,Memory : 3 Gb CUDA cores : 448Eick

et al., ParCo11, GhentSlide16

4. Experimental Results10Ovals(measured in seconds)16100val Dataset ( size = 3359 )

p=100, q=27, k’=10, η

= 1.1, th=0.6, β = 1.6, Interestingness Function=Purity

Threads

1

6

12

24

48

Loop-level

Time(sec)

248.49

50.52

30.09

20.58

16.39

Speedup

1.00

4.92

8.26

12.07

15.16

Efficiency

1.00

0.82

0.69

0.50

0.32

Loop-level + Incremental Updating

Time(sec)

229.88

49.43

29.99

20.28

15.61

Speedup

1.00

4.65

7.67

11.34

14.73

Efficiency

1.00

0.78

0.64

0.47

0.31

Task-level

Time(sec)

248.49

41.83

21.67

11.44

6.40

Speedup

1.00

5.94

11.47

21.72

38.84

Efficiency

1.00

0.99

0.96

0.90

0.81

Iterations = 14, Evaluated neighbor solutions = 15200, k = 5, Fitness = 77187.7

Eick

et al., ParCo11, GhentSlide17

Experimental Results continued 10Ovals17Eick et al., ParCo11, GhentSlide18

Experimental ResultsEarthquake (measured in hours)18Earthquake Dataset ( size = 330,561 )

p=50, q=12, k’=100,

η =2, th

=1.2, β = 1.4, Interestingness Function=Variance High

Threads

1

6

12

24

48

Loop-level

Time(hours)

185.39

35.27

23.17

12.38

10.20

Speedup

1

5.26

8.00

14.97

18.18

Efficiency

1

0.88

0.67

0.62

0.38

Loop-level + Incremental Updating

Time(hours)

30.24

9.18

6.89

6.06

6.84

Speedup

1

3.29

4.39

4.99

4.42

Efficiency

1

0.55

0.37

0.21

0.09

Task-level

Time(hours)

185.39

31.95

17.19

9.76

6.14

Speedup

1

5.80

10.79

19.00

30.18

Efficiency

1

0.97

0.90

0.79

0.63

Iterations = 216, Evaluated neighbor solutions =

21,950

, k =

115

Eick

et al., ParCo11, GhentSlide19

Experimental Results continuedEarthquake 19Eick et al., ParCo11, GhentSlide20

Experimental ResultsYahoo (measured in hours)20Yahoo Reduced Dataset ( size = 2910613 )

p=48, q=7, k’=80, η

=1.2, th=0, β = 1.000001, Interestingness Function=Average Distance to Medoid

Threads

1

6

12

24

48

Loop-level

Time(hours)

154.62

29.25

16.74

12.12

9.94

Speedup

1

5.29

9.24

12.75

15.55

Efficiency

1

0.88

0.77

0.53

0.32

Loop-level + Incremental Updating

Time(hours)

28.30

8.15

6.71

5.55

5.68

Speedup

1

3.47

4.22

5.10

4.98

Efficiency

1

0.58

0.35

0.21

0.10

Task-level

Time(hours)

154.62

25.78

12.97

6.63

3.42

Speedup

1

6.00

11.92

23.33

45.21

Efficiency

1

1.00

0.99

0.97

0.94

Iterations = 10, Evaluated neighbor solutions = 480, k =

94

Eick

et al., ParCo11, GhentSlide21

Experimental Results continuedYahoo21Eick et al., ParCo11, GhentSlide22

CUDA Results10Ovals22100val Dataset ( size = 3359 )

p=100, q=27, k’=10, η = 1.1, th=0.6, β = 1.6,

Interestingness Function=Purity

Run Time (seconds)

1.331.321.341.321.331.32Avg:1.327Iterations = 12, Evaluated neighbor solutions = 5100, k = 5

CUDA version evaluate 5100 solutions in 1.327

seconds 15200 solutions in 3.95 seconds

Speed up = Time(CPU) / Time(GPU)

63x speed up compares to sequential version

1.62x speed up compares to 48 threads

OpenMP

OpenMP

#threads

Sequential

6

12

24

48

Task-level

Time(sec)

248.49

41.83

21.67

11.44

6.40

Iterations = 14, Evaluated neighbor solutions = 15200, k = 5, Fitness = 77187.7

vs.Slide23

CUDA ResultsEarthquake (preliminary!)23Earthquake Dataset ( size = 330561 )

p=50, q=12, k’=100, η =2,

th=1.2, β = 1.4, Interestingness Function=Variance High

Run Time (seconds)

138.95146.56143.82139.10146.19147.03Avg:143.61Iterations = 158, Evaluated neighbor solutions = 28,900, k = 92

OpenMP

#threads

Sequential

6

12

24

48

Task-level

Time(hours)

185.39

31.95

17.19

9.76

6.14

Iterations = 216, Evaluated neighbor solutions = 21950, k =

115

CUDA version evaluate 28000 solutions in 143.61

seconds 21950 solutions in 109.07 seconds

Speed up = Time(CPU) / Time(GPU)

6119x speed up compares to sequential version

202x speed up compares to 48 threads

OpenMP

vs.

Eick

et al., ParCo11, GhentSlide24

CUDA implementation Cache representatives in shared memoryThe representatives are read frequently in the computation that assigns objects to clusters. The results presented earlier cached the representatives into the shared memory for a faster access.The following table compares the performances between CLEVER with and without caching the representatives on the earthquake data set. The data size of the representatives being cached is 2MBThe result shows that caching the representatives has very little improvement on the runtime (0.09%) based on the Earthquake Dataset ( size = 330561 )

p=50, q=12, k’=100,

η =2, th

=1.2, β = 1.4, Interestingness Function=Variance High

Run Time (seconds)Cache138.95146.56143.82139.10146.19147.03Avg:143.61No-cache

144.63

139.9

144.27

144.5

144.71

144.44

Avg:143.74

Iterations = 158, Evaluated neighbor solutions = 28,900, k = 92

24

Eick

et al., ParCo11, GhentSlide25

The difference between the OpenMP and CUDA implementations—why?The OpenMP version uses a object oriented programming (OOP) design inherited from its original implementation but the redesigned CUDA version is more a procedural programming implementation.CUDA hardware has higher bandwidth which contributed to the speedup a littleCaching contributes little of the speedup (we already analyzed that)25Eick et al., ParCo11, GhentSlide26

5. Summary26CUDA and OpenMP results indicate good scalability parallel algorithm using multi-core processors—computations which take days can now be performed in minutes/hours.OpenMPEasy to implementGood Speed upLimited by the number of cores and the amount of RAMCUDA GPUExtra attentions needed for CUDA programmingLower level of programming: registers, cache memory…GPU memory hierarchy is different from CPUOnly support for some data structures;Synchronization between threads in blocks is not possibleSuper speed up, some of which are still subject of investigation

Eick et al., ParCo11, GhentSlide27

Future Work More work on the CUDA versionConduct more experiments which explain what works well and which doesn’t and why it does/does not work wellAnalyze impact of the capability to search many more solutions on solution quality in more depth. Implement a version of CLEVER which conducts multiple randomized hill climbing searches in parallel and which employs dynamic load balancingmore resources are allocated to the “more promising” searchesReuse code for speeding up other data mining algorithms which uses randomized hill climbing.27Eick et al., ParCo11, Ghent