/
Questions and Topics Review Dec. 1, 2011 Questions and Topics Review Dec. 1, 2011

Questions and Topics Review Dec. 1, 2011 - PowerPoint Presentation

stefany-barnette
stefany-barnette . @stefany-barnette
Follow
375 views
Uploaded On 2015-12-02

Questions and Topics Review Dec. 1, 2011 - PPT Presentation

Give an example of a problem that might benefit from feature creation Compute the Silhouette of the following clustering that consists of 2 clusters 00 01 22 32 33 ID: 212594

sequences algorithm sequence mining algorithm sequences mining sequence means apriori local point pruning survived candidate frequent problem generated points

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Questions and Topics Review Dec. 1, 2011" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Questions and Topics Review Dec. 1, 2011

Give an example of a problem that might benefit from feature creation

Compute the Silhouette of the following clustering that consists of 2 clusters: {(0,0), (0,1), (2,2)}

{(3,2), (3,3)}.

Assume Manhattan Distance is used.

Silhouette: For an individual point,

i

Calculate

a

= average distance of

i

to the points in its cluster

Calculate

b

= min (average distance of

i

to points in another cluster)

The silhouette coefficient for a point is then given by:

s = (b-a)/max(

a,b

)

APRIORI has been generalized for mining sequential patterns. How is the APRIORI property defined and used in the context of sequence mining?

Assume

the

Apriori

-style sequence mining algorithm described at pages 429-435 is used and the algorithm generated 3-sequences listed

below (see

2007 Final Exam!):

 

Frequent 3-sequences Candidate Generation Candidates that survived pruning

<(1) (2) (3)>

<(1 2 3)>

<(1) (2) (4)>

<(1) (3) (4)>

<(1 2) (3)>

<(2 3) (4)>

<(2) (3) (4)>

<(3) (4 5)>Slide2

Questions and Topics Review Dec. 1, 2011

Give an example of a problem that might benefit from feature creation

Compute the Silhouette of the following clustering that consists of 2 clusters: {(0,0), (0,1), (2,2)}

{(3,2), (3,3)}.

Silhouette: For an individual point,

i

Calculate

a

= average distance of

i

to the points in its cluster

Calculate

b

= min (average distance of

i

to points in another cluster)

The silhouette coefficient for a point is then given by:

s = (b-a)/max(

a,b

)

APRIORI has been generalized for mining sequential patterns. How is the APRIORI property defined and used in the context of sequence mining?

Property:

see text book [2]

Use: Combine sequences that a frequent and which agree in all elements except the first element of the first sequence, and the last element of the second sequence.

Prune sequences if not all subsequences that can be obtained by removing a single element are frequent. [3]

Assume

the

Apriori

-style sequence mining algorithm described at pages 429-435 is used and the algorithm generated 3-sequences listed below:

 

Frequent 3-sequences Candidate Generation Candidates that survived pruning

<(1) (2) (3)>

<(1 2 3)>

<(1) (2) (4)>

<(1) (3) (4)>

<(1 2) (3)>

<(2 3) (4)>

<(2) (3) (4)>

<(3) (4 5)>Slide3

Questions and Topics Review Dec. 1, 2011

Assume the

Apriori

-style sequence mining algorithm described at pages 429-435 is used and the algorithm generated 3-sequences listed below:

 Frequent 3-sequences Candidate Generation Candidates that survived pruning

3) Association Rule and Sequence Mining [15]

a) Assume the

Apriori

-style sequence mining algorithm described at pages 429-435 is used and the algorithm generated 3-sequences listed below:

Candidates that survived pruning:

<(1) (2) (3) (4)>

 

Candidate Generation:

<(1) (2) (3) (4)>

survived

<(1 2 3) (4)>

pruned, (1 3) (4) is infrequent

<(1) (3) (4 5)>

pruned (1) (4 5) is infrequent

<(1 2) (3) (4)>

pruned, (1 2) (4) is infrequent

<(2 3) (4 5)>

pruned, (2) (4 5) is infrequent

<(2) (3) (4 5)>

pruned, (2) (4 5) is infrequent

 

 

 

What if the

ans

are correct, but this part of description isn’t giving?? Do I need to take any points off??

Give an extra point if explanation is correct and present; otherwise subtract a point; more than 2 errors: 2 points or less!

Frequent 3-sequences Candidate Generation Candidates that survived pruning

<(1) (2) (3)>

<(1 2 3)>

<(1) (2) (4)>

<(1) (3) (4)>

<(1 2) (3)>

<(2 3) (4)>

<(2) (3) (4)>

<(3) (4 5)>

 

What candidate 4-sequences are generated from this 3-sequence set? Which of the generated 4-sequences survive the pruning step? Use format of Figure 7.6 in the textbook on page 435 to describe your answer! [7]

 Slide4

Questions and Topics Review Dec. 1, 2011

5. The

Top 10 Data Mining Algorithms article says about k-means “

The greedy-descent nature of k-means on a non-convex cost also implies that the convergence is only to a local optimum, and indeed the algorithm is typically quite sensitive to the initial centroid locations…

The local minima problem can be countered

to some extent by running the algorithm multiple times with different initial centroids.

” Explain why the suggestion in boldface is a potential solution to the local maximum problem. Propose a modification of the k-means algorithm that uses the suggestion! Slide5

5. The

Top 10 Data Mining Algorithms article says about k-means “

The greedy-descent nature of k-means on a non-convex cost also implies that the convergence is only to a local optimum, and indeed the algorithm is typically quite sensitive to the initial centroid locations…

The local minima problem can be countered

to some extent by running the algorithm multiple times with different initial centroids.

” Explain why the suggestion in boldface is a potential solution to the local maximum problem. Propose a modification of the k-means algorithm that uses the suggestion!

Using k-means with different seeds will find different local maxima of K-mean’s objective function; therefore, running k-means with different initial seeds that are in proximity of different local maxima will produce alternative results.[2]

Run k-means with different seeds multiple times (e.g. 20 times), then compute the SSE of each clustering, return the clustering with the lowest SSE value as the result. [3]