# 1 CS 490 Sample Project Mining the Mushroom Data Set

### Presentations text content in 1 CS 490 Sample Project Mining the Mushroom Data Set

1

CS 490 Sample Project Mining the Mushroom Data Set

Kirk

Scott

Slide22

Slide3Yellow Morels

3

Slide4Black Morels

4

Slide5This set of overheads begins with the contents of the project check-off sheet

After that an example project is given

5

Slide6CS 490 Data Mining Project Check-Off Sheet

Student's name: _______ 1. Meets requirements for formatting. (No pts.) [ ]2. Oral presentation given. (No pts.) [ ]

3. Attendance at Other Students' Presentations. Partial points for partial attendance. 20 pts.____

6

Slide7I. Background Information on the Problem Domain and the Data Set

7

Slide8Name of Data Set: _______

I.A. Random Information Drawn from the Online Data Files Posted with the Data Set. 3 pts.___I.B. Contents of the Data File. 3 pts.___I.C. Summary of Background Information. 3 pts.___

I.D. Screen Shot of Open File. 3 pts.___

8

Slide9II. Applications of Data Mining Algorithms to the Data Set

9

Slide10II. Case 1. This Needs to Be a Classification Algorithm

Name of Algorithm: _______i. Output Results. 3 pts.___ii. Explanation of Item. 2 pts.___

iii. Graphical or Other Special Purpose Additional Output. 2 pts.___

10

Slide11II. Case 2. This Needs to Be a Clustering Algorithm

Name of Algorithm: _______i. Output Results. 3 pts.___ii. Explanation of Item. 2 pts.___

iii. Graphical or Other Special Purpose Additional Output. 2 pts.___

11

Slide12II. Case 3. This Needs to Be an Association Mining Algorithm

Name of Algorithm: _______i. Output Results. 3 pts.___ii. Explanation of Item. 2 pts.___

iii. Graphical or Other Special Purpose Additional Output. 2 pts.___

12

Slide13II. Case 4. Any Kind of Algorithm

Name of Algorithm: _______i. Output Results. 3 pts.___ii. Explanation of Item. 2 pts.___

iii. Graphical or Other Special Purpose Additional Output. 2 pts.___

13

Slide14II. Case 5. Any Kind of Algorithm

Name of Algorithm: _______i. Output Results. 3 pts.___ii. Explanation of Item. 2 pts.___

iii. Graphical or Other Special Purpose Additional Output. 2 pts.___

14

Slide15II. Case 6. Any Kind of Algorithm

Name of Algorithm: _______i. Output Results. 3 pts.___ii. Explanation of Item. 2 pts.___

iii. Graphical or Other Special Purpose Additional Output. 2 pts.___

15

Slide16II. Case 7. Any Kind of Algorithm

Name of Algorithm: _______i. Output Results. 3 pts.___ii. Explanation of Item. 2 pts.___

iii. Graphical or Other Special Purpose Additional Output. 2 pts.___

16

Slide17II. Case 8. Any Kind of Algorithm

Name of Algorithm: _______i. Output Results. 3 pts.___ii. Explanation of Item. 2 pts.___

iii. Graphical or Other Special Purpose Additional Output. 2 pts.___

17

Slide18III. Choosing the Best Algorithm Among the Results

18

Slide19III.A. Random Babbling. 6 pts.___

III.B. An Application of the Paired t-test. 6 pts.___

Total out of 100 points possible: _____

19

Slide20Example Project

The point of this sample project is to illustrate what you should produce for your project.In addition to the content of the project, information given in italics provides instructions or commentary or background information.

20

Slide21Needless to say, your project should simply contain all of the necessary content.

You don't have to provide italicized commentary.

21

Slide22I. Background Information on the Problem Domain and the Data Set

If you are working with your own data set you will have to produce this documentation entirely yourself. If you are working with a downloaded data set, you can use whatever information comes with the data set.

You may paraphrase that information, rearrange it, do anything to it to help make your presentation clear.

22

Slide23You don't have to follow academic practice and try to document or footnote what you did when presenting the information.

The goal is simply adaptation for clear and complete presentation. What I'm trying to say is this: There will be no penalty for "plagiarism".

23

Slide24What I would like you to avoid is simply copying and pasting, leading to a mass of information that is not relevant or helpful to the reader (the teacher—who will be making the grades) in understanding what you were doing.

Reorganize and edit as necessary in order to make it clear.

24

Slide25Finally, include a screen shot of the explorer view of the data set after you've opened the file containing it.

Already here you have a choice of what exactly to show and you need to write some text explaining what the screen shot displays.

25

Slide26I.A. Random Information Drawn from the Online Data Files Posted with the Data Set

This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus

and

Lepiota

Family (pp. 500-525).

Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended.

This latter class was combined with the poisonous one.

26

Slide27The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ''leaflets three, let it be'' for Poisonous Oak and Ivy.

27

Slide28Number of Instances: 8124

Number of Attributes: 22 (all nominally valued)Attribute Information: (classes: edible=e, poisonous=p)

28

Slide291. cap-shape: bell=

b,conical=c,convex=x,flat

=

f,knobbed

=

k,sunken

=s2. cap-surface: fibrous=f,grooves

=g,scaly=y,smooth=s3. cap-color: brown=

n,buff

=

b,cinnamon

=

c,gray

=

g,green

=

r,pink

=

p,purple

=

u,red

=

e,white

=

w,yellow

=y

29

Slide304. bruises?: bruises=

t,no=f5. odor: almond=a,anise=

l,creosote

=

c,fishy

=

y,foul=f,musty

=m,none=n,pungent=p,spicy=s

6. gill-attachment: attached=

a,descending

=

d,free

=

f,notched

=n

7. gill-spacing: close=

c,crowded

=

w,distant

=d

30

Slide318. gill-size: broad=

b,narrow=n9. gill-color: black=k,brown=

n,buff

=

b,chocolate

=

h,gray=g,green

=r,orange=o,pink=p,purple=

u,red

=

e,white

=

w,yellow

=y

10. stalk-shape: enlarging=

e,tapering

=t

11. stalk-root: bulbous=

b,club

=

c,cup

=

u,equal

=e,

rhizomorphs

=

z,rooted

=

r,missing

=?

31

Slide3212. stalk-surface-above-ring: fibrous=

f,scaly=y,silky=k,smooth

=s

13. stalk-surface-below-ring: fibrous=

f,scaly

=

y,silky=

k,smooth=s14. stalk-color-above-ring: brown=n,buff=b,cinnamon=

c,gray

=

g,orange

=

o,pink

=

p,red

=

e,white

=

w,yellow

=y

32

Slide3315. stalk-color-below-ring: brown=

n,buff=b,cinnamon=c,gray

=

g,orange

=

o,pink

=p,red=

e,white=w,yellow=y16. veil-type: partial=p,universal=u

17. veil-color: brown=

n,orange

=

o,white

=

w,yellow

=y

18. ring-number: none=

n,one

=

o,two

=t

33

Slide3419. ring-type: cobwebby=

c,evanescent=e,flaring=f,large

=

l,none

=

n,pendant

=p,sheathing=

s,zone=z20. spore-print-color: black=k,brown=n,buff=

b,chocolate

=

h,green

=

r,orange

=

o,purple

=

u,white

=

w,yellow

=y

34

Slide3521. population: abundant=

a,clustered=c,numerous=n, scattered=s,several

=

v,solitary

=y

22. habitat: grasses=

g,leaves=

l,meadows=m,paths=p,urban=u,waste

=

w,woods

=d

35

Slide36Missing Attribute Values: 2480 of them (denoted by "?"), all for attribute #11.

Class Distribution: -- edible: 4208 (51.8%)-- poisonous: 3916 (48.2%)

-- total: 8124 instances

36

Slide37Logical rules for the mushroom data sets.

This is information derived by researchers who have already worked with the data set. Logical rules given below seem to be the simplest possible for the mushroom dataset and therefore should be treated as benchmark results.

37

Slide38Disjunctive rules for poisonous mushrooms, from most general to most specific:

P_1) odor=NOT(almond.OR.anise.OR.none)120 poisonous cases missed, 98.52% accuracy

P_2) spore-print-color=green

48 cases missed, 99.41% accuracy

38

Slide39P_3) odor=

none.AND.stalk-surface-below-ring=scaly.AND.(stalk-color-above-ring=NOT.brown

)

8 cases missed, 99.90% accuracy

P_4) habitat=

leaves.AND.cap

-color=white100% accuracy

Rule P_4) may also beP_4') population=clustered.AND.cap_color=white

39

Slide40These rules involve 6 attributes (out of 22). Rules for edible mushrooms are obtained as negation of the rules given above, for example the rule:

odor=(almond.OR.anise.OR.none).

AND.spore

-print-color=

NOT.green

gives 48 errors, or 99.41% accuracy on the whole dataset.

40

Slide41Several slightly more complex variations on these rules exist, involving other attributes, such as

gill_size, gill_spacing,

stalk_surface_above_ring

, but the rules given above are the simplest we have found.

41

Slide42I.B. Contents of the Data File

Here is a snippet of five records from the data file:p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u

e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g

e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m

p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u

e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g

42

Slide43Incidentally, the data file contents also exist in expanded form.

Here is a record from that file:EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,WHITE,TAPERING,BULBOUS,SMOOTH,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS

43

Slide44Section I.C should be written by you. You should summarize the information given above, which is largely copy and paste, in a brief, well-organized paragraph that you write yourself and which conveys the basics in a concise way.

44

Slide45The idea is that a reader who really doesn't want or need to know the details could go to this paragraph and find out everything they needed to know in order to keep reading the rest of your write-up and have some idea of what is going on.

45

Slide46I.C. Summary of Background Information

The problem domain is the classification of mushrooms as either poisonous/inedible or non-poisonous/edible. There are 8124 instances in the data set consisting of 22 nominal attributes apiece.

Roughly half of the instances are poisonous and half are non-poisonous.

46

Slide47There are 2480 cases of missing attribute values, all on the same attribute.

As is to be expected with non-original data sets, this set has already been extensively studied. Other researchers have provided sets of rules they have derived which would serve as benchmarks when considering the results of the application of further data mining algorithms to the data set.

47

Slide48I.D. Screen Shot of Open File

***What this shows: The cap-shape attribute is chosen out of the list on the left. Its different values are given in the table in the upper right.

In the lower right, the Edible attribute is selected from a (hidden) drop down list.

48

Slide49The graph shows the proportion of edible and inedible mushrooms among the instances containing different values of cap-shape.

49

Slide5050

Slide51II. Applications of Data Mining Algorithms to the Data Set

The overall requirement is that you use the Weka explorer and run up to 8 different data mining algorithms on your data set.

Here is a preview of what is involved:

51

Slide52i

. You will get full credit for all 8 cases if among the 8 there is at least one each of classification, clustering, and association rule mining. In order to make it clear that this has been done, the first case should be a classification, the second case should be a clustering, and the third case should be an application of association rule mining.

52

Slide53The grading check-off sheet will reflect this requirement.

All remaining cases can be of your choice, given in any order you want to.

53

Slide54ii. You will have to either copy a screen shot or copy certain information out of the

Weka explorer interface and paste it into your report. The stuff you need to do this for in the different kinds of cases is simply illustrated.

I won't try and list it all out here.

54

Slide55At every point, ask yourself this question:

"Was it immediately apparent to me what I was looking at and what it meant?“If the answer to that question was no, you should include explanatory remarks explanatory remarks with whatever you chose to show from

Weka

.

55

Slide56For consistency's sake in these cases you can label your remarks "***What this shows:".

56

Slide57iii. The most obvious kind of results that you would reproduce would be the percent correct and percent incorrect classification for a classification scheme, for example.

In addition to this, the output would include things like a confusion matrix, the Kappa statistic, and so on.

57

Slide58For each case that you examine, you will be expected to highlight one aspect of the output and to provide your own brief, written explanation of it.

Note that this is an "educational" aspect of this project.

58

Slide59On the job, the expectation would be that you as a user knew what it all meant.

Here, as a student, the goal is to show that you know what it all meant.

59

Slide60iv. Finally, there is an additional aspect of

Weka that you should use and illustrate. I will not try to describe it in detail here. You will see examples in the content below.

60

Slide61In short, for the different algorithms, if you right click on the results, you will be given options to create graphs, charts, and various other kinds of output.

For each case that you cover you should take one of these options.

Again, there is an educational, as opposed to practical aspect to this.

61

Slide62For the purposes of this project, just cycle through the different options that are available to show that you are familiar with them.

For each one, provide a sentence or two making it clear that you know what this additional output means.

62

Slide63II. Case 1. This Needs to Be a Classification Algorithm

Name of Algorithm: J48

63

Slide64i. Output Results

***What this shows: This shows the classifier tree generated by the J48 algorithm.

64

Slide6565

Slide66***What this shows:

This gives the analysis of the output of the algorithm. The most notable thing that should jump out at you is that this is a "perfect" tree. The output shows 100% correct classification and no misclassification.

66

Slide6767

Slide68ii. Explanation of Item

There is no need to repeat the screen shot. For this item I have chosen the confusion matrix. It is very easy to understand.

It shows 0 false positives and 0 false negatives.

68

Slide69It is interesting to note that you need to know the values for the attributes in the data file to be sure which number represents TP and which represents TN.

Referring back to the earlier the screen shot, the same is true for the bars. What do the blue and red parts of the bars represent, edible or inedible?

69

Slide70iii. Graphical or Other Special Purpose Additional Output

***What this shows: Going back to the previous screen shot, if you right click on the item highlighted in blue—the results of running J48 on the data set, you get several options.

One of them is "Visualize tree".

This screen shot shows the result of taking that option.

70

Slide7171

Slide72II. Case 2. This Needs to Be a Clustering Algorithm

Name of Algorithm: SimpleKMeans

72

Slide73i. Output Results

***What this shows: This shows the results of the SimpleKMeans

clustering algorithm with the edible/inedible attribute ignored.

The results compare the clusters/classifications with the ignored attribute.

The algorithm finds 2 clusters based on the remaining attributes.

73

Slide7474

Slide75ii. Explanation of Item

At the bottom of the screen shot there is an item, "Incorrectly clustered instances". 37.6% of the clustered instances don't fall into the desired edible/inedible category. The algorithm finds 2 clusters, but these 2 clusters don't agree with the 2 classifications of the attributed that was ignored.

75

Slide76iii. Graphical or Other Special Purpose Additional Output

***What this shows: Going back to the previous screen shot, if you right click on the item highlighted in blue—the results of running SimpleKMeans

on the data set, you get several options.

One of them is "Visualize cluster assignments".

76

Slide77This screen shot shows the result of taking that option.

Since it isn't possible to visualize the clusters in n-dimensional space, the screen provides the option of picking which individual attribute to visualize.

77

Slide78This screen shows the instances in order by number along the x-axis.

The y-axis shows the cluster placements for the different values for the cap-shape attribute. The drop down box allows you to change what the axes represent.

78

Slide7979

Slide80II. Case 3. This Needs to Be an Association Mining Algorithm

Name of Algorithm: Apriori

80

Slide81i. Output Results

***What this shows: This shows the results of the Apriori

association rule mining algorithm.

81

Slide8282

Slide83ii. Explanation of Item

Various relevant parameters are shown on the screen shot. The system defaults to a minimum support level of .95 and a minimum confidence level of .9. The system lists the 10 best rules found.

The first 9 have confidence levels of 1.

On the one hand, this is good.

83

Slide84From a practical point of view, what this tends to suggest is that the data are effectively redundant.

Just to take the first rule for example, if you know the color of the veil, you know the type of the veil. The 10

th

rule provides an interesting reverse insight into this.

It tells you that if you know the type, you only know the color with .98 confidence.

84

Slide85iii. Graphical or Other Special Purpose Additional Output

There don't appear to be any other output options for association rules. There is no standard visualization for them so nothing is included for this point.

85

Slide86II. Case 4. Any Kind of Algorithm

Name of Algorithm: ADTree

86

Slide87i. Output Results

***What this shows. These are the results of running the ADTree

classification algorithm.

I haven't bothered to scroll up and show the ASCII representation of the tree.

Instead, I've just shown the critical output at the bottom.

87

Slide8888

Slide89ii. Explanation of Item

There are two items I'd like to highlight:a. Notice that this tree generation algorithm didn't get 100% classified correctly. If I'm reading the data correctly, there were 8 false positives on the attribute of interest, which is named Edible.

This is not good.

89

Slide90False negatives deprive you of a tasty gustatory and culinary experience.

False positives deprive you of your health or your life. I point this out in contrast to the J48 results given above.

90

Slide91b. Notice that the time taken to build the model was .73 seconds.

This is about 10 times slower than J48, but I'm mainly interested in comparing with the following algorithm.

91

Slide92iii. Graphical or Other Special Purpose Additional Output

***What this shows: This is the visualization of the tree. There are other graphical options, but they are difficult to interpret for the mushroom data set, so this is given for comparison with the J48 tree.

92

Slide9393

Slide94II. Case 5. Any Kind of Algorithm

Name of Algorithm: BFTree

94

Slide95i. Output Results

***What this shows: This shows the results of using the BFTree

classification algorithm.

95

Slide9696

Slide97ii. Explanation of Item

This algorithm also doesn't give a tree that classifies with 100% accuracy. It gives the same kind of error as the ADTree, although there are 3 fewer.

97

Slide98The additional item I'd like to highlight is that the time taken to build the model was 12.42 seconds.

As a matter of fact, that information came out first and then additional, significant amounts of time were taken to run through each fold of the data. This was quite time consuming compared to the other trees produced so far.

98

Slide99iii. Graphical or Other Special Purpose Additional Output

***What this shows: This screen shot is the result of taking the "Visualize classifier errors" option on the results of the algorithm. I believe what this screen illustrates is a decision point in the tree on the cap-surface attribute.

99

Slide100In one of the cases, symbolized by the blue rectangle, an incorrect classification is made on this basis while 7 other instances classify correctly based on this attribute.

100

Slide101101

Slide102II. Case 6. Any Kind of Algorithm

Name of Algorithm: Naïve Bayes

102

Slide103i. Output Results

***What this shows: This screen shot shows the bottom of the output for the Naïve Bayes

classification algorithm.

The upper part of the output shows conditional probability counts for all of the attributes in the data.

103

Slide104If the cost of an error wasn't so high, this algorithm by itself does OK.

It's time cost is only .03 seconds and it achieves 95.8% correct classification.

104

Slide105105

Slide106ii. Explanation of Item

I'm running out of items to highlight which are particularly meaningful for the example in question. Notice that the output includes the Mean absolute error, the Root mean squared error, the Relative absolute error, and the Root relative squared error.

106

Slide107These differ in magnitude because of the way they're calculated, but they are all indicators of the same general thing.

As pointed out in the book, when comparing two different data mining approaches, if you compared the same measure for both, you will tend to have a valid comparison regardless of which of the measures you used.

107

Slide108iii. Graphical or Other Special Purpose Additional Output

***What this shows: Two graphical output screen shots are given below. They show a cost-benefit analysis.

Such an analysis is more appropriate to something like direct mailing, but it is possible to illustrate something by changing one of the parameters in the display.

108

Slide109Both screen shots show a threshold curve and a cost-benefit curve where the button to minimize cost/benefit has been clicked.

In the first screen shot the costs of FP and FN are equal, at 1. In the second, the cost of a false positive has been raised to 1,000.

Notice how the shape of the curve changes.

109

Slide110Roughly speaking, I would interpret the second screenshot to mean that you have effectively no costs as long as you are correctly predicting TP, but your cost rises linearly with the increasing probability of FP predictions later in the data set.

110

Slide111111

Slide112112

Slide113II. Case 7. Any Kind of Algorithm

Name of Algorithm: BayesNet

113

Slide114i. Output Results

***What this shows. This screen shot shows the results of the BayesNet

classification algorithm.

114

Slide115115

Slide116ii. Explanation of Item

This is not a new item to explain, but it is an observation related to the values and results previously obtained. The association rule mining algorithm seemed to suggest that there were heavy dependencies among some of the attributes in the data set.

BayesNet

is supposed to take these into account, while Naïve

Bayes

does not.

116

Slide117However, when you compare the rate of correctly classified instances, here you get 96.2% vs. 95.8% for Naïve

Bayes. It seems fair to ask what difference it really made to include the dependencies in the analysis.

117

Slide118iii. Graphical or Other Special Purpose Additional Output

What this shows: This shows the result of taking the "Visualize cost curve" option on the results of the data mining. Honestly, I've about reached the limit of what I understand without further research.

I present this here without further explanation.

118

Slide119This is one of the reasons I advertise this sample project write-up as an example of a B, rather than an A effort.

Everything that has been asked for is included, but in this point, for example, the explanation isn't complete. It sure is pretty though…

119

Slide120120

Slide121II. Case 8. Any Kind of Algorithm

Name of Algorithm: RIDOR

121

Slide122i. Output Results

***What this shows: This screen shows the results of applying the RIDOR algorithm to the data set. RIDOR was the technique based on rules and exceptions.

Look at the top of the output.

Here you see clearly that the default classification is edible, with exceptions listed underneath.

122

Slide123Philosophically, this goes against my point of view on mushrooms.

The logical default should be inedible, but there are more edible mushrooms in the data set than inedible. So it goes.

123

Slide124124

Slide125ii. Explanation of Item

The last set of items that appears in these output screens are the Precision, Recall, F-Measure, and ROC values. This is probably not the best example for illustrating what they mean. It's apparent that things like recall would be better suited to document retrieval for example.

125

Slide126Maybe the best illustration that they don't really apply is that they are all 1 or .999.

On the other hand, maybe that's realistic for a classification scheme that gives 99.95% correct results.

126

Slide127iii. Graphical or Other Special Purpose Additional Output

Once again, the fact that this is a "B" example rather than an "A" example comes into play. I'm not showing a new bit of graphical output. I'm showing the cost curve, like for the previous data mining algorithm.

127

Slide128The main reason for choosing to show it again is that this picture looks so much like the simple picture in the text that they used to illustrate some of the cost concepts graphically.

128

Slide129129

Slide130III. Choosing the Best Algorithm Among the Results

Depending on the problem domain and your level of ambition, you might compare algorithms on the basis of lift charts, cost curves, and so on. For simple classification, the tools will give results showing the percent classified correctly and the percent classified incorrectly.

130

Slide131It would be natural to simply choose the one with the highest percent classified incorrectly.

However, this is not good enough for credit on this item. I have chosen to illustrate what you need to do with a simple basic example.

131

Slide132I consider the two classification algorithms that gave the highest percent classified correctly.

I then apply the paired t-test to see whether or not there is actually a statistically significant different between.

132

Slide133If there is, that's the correct basis for preferring one over the other.

For the purposes of illustration, I do this by hand and explain what I'm doing.You may find tools that allow you to make a valid comparison of results.

That's OK, as long as you explain.

133

Slide134The point simply is that it's not sufficient to just list a bunch of percents and pick the highest one.

Illustrate the use of some advanced technique, whether involving concepts like lift charts or cost curves or statistics.

You may also have noticed that

Weka

tells you the run time for doing an analysis.

134

Slide135When making a decision about which algorithm is the best, at a minimum take into account an advanced comparison of the two apparent best, and you may want to make an observation about the apparent complexity or time cost of the algorithms.

135

Slide136III.A. Random Babbling

The concept of "Cost of classification" seems relevant to this example. It takes a human expert to tell if a mushroom is poisonous. If you're not an expert, you can tell by eating a mushroom and seeing what happens.

136

Slide137The cost of finding out that the mushroom is poisonous is about as high as it gets.

I guess if you're truly dedicated, you'd be willing to die for science. Directly related to this is the cost of a misclassification.

It seems to be on the infinite side…

137

Slide138The J48 tree approach, given first, even though it's apparently been pruned, still classifies 100% correctly.

This seems to be at odds with claims made at various point that you don't want a perfect classifier because it will tend to be overtrained

.

138

Slide139On the other hand, since the cost of a misclassification is so high, maybe it would be best to bias the training.

Lots of false "It's poisonous" results would be desirable. I remember learning this rule from my parents:

Don't eat any wild mushrooms.

139

Slide140It's also interesting to compare with the commentary provided at the beginning.

"Experts" who have examined the data wanted to get a minimal rule set. They apparently considered that a success. But they were willing to live with errors.

I'm not sure living with errors is consistent with this data set.

140

Slide141III.B. An Application of the Paired t-test

Pick any two of your results above, identify them and the success rate values they gave, and compare them using the paired t-test. Give a statistically valid statement that tells whether or not the two cases you're comparing are significantly different.

141

Slide142What is shown is my attempt to interpret and apply what the book says about the paired t-test.

I do not claim that I have necessarily done this correctly. Students who have recently taken statistics may reach different conclusions about how this is done.

142

Slide143However, I have gone through the motions.

To get credit for this section, you should do the same, whether following my example or following your own understanding.

143

Slide144I have chosen to compare the percent of correct classifications by Naïve

Bayes (NB) and BayesNet (BN) given above.

144

Slide145Taken from

Weka results:NB sample mean = 95.8272%NB root mean squared error = .1757

Squaring the value above:

NB mean squared error = .03087049

145

Slide146Taken from

Weka results:BN sample mean = 96.2211%BN root mean squared error = .1639

Squaring the value above:

BN mean squared error = .02686321

146

Slide147This is my estimate of the standard deviation of the t statistic where the divisor is 10 because I opted for the default 10-fold cross-validation in

Weka:Estimate of paired root mean squared error (EPRMSE)

= square root(( NB mean squared error / 10) + (BN mean squared error / 10))

= .075982695

147

Slide148t statistic

= (NB sample mean – BN sample mean) / EPRMSE= 5.184

148

Slide149The book says this is a two-tailed test.

For a 99% confidence interval I want to use a threshold of .5%. The book's table gives a value of 3.25.

149

Slide150The computed value, 5.184 is greater than the table value of 3.25.

This means you reject the null hypothesis that the means of the two distributions are the same.

150

Slide151In other words, you conclude that there is a statistically significant difference between the percent of correct classifications resulting from the Naïve

Bayes and the Bayesian Network algorithms on the mushroom data.

151

Slide152The End

152

Slide153
## 1 CS 490 Sample Project Mining the Mushroom Data Set

Download Presentation - The PPT/PDF document "1 CS 490 Sample Project Mining the Mus..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.