/
SEEM4630  2015-2016  Tutorial SEEM4630  2015-2016  Tutorial

SEEM4630 2015-2016 Tutorial - PowerPoint Presentation

sequest
sequest . @sequest
Follow
343 views
Uploaded On 2020-07-03

SEEM4630 2015-2016 Tutorial - PPT Presentation

1 Classification Decision tree Siyuan Zhang syzhangsecuhkeduhk Classification Definition Given a collection of records training set each record contains a set of attributes ID: 794761

gain log rain log2 log gain log2 rain sunny weak attribute high normal wind strong humidity temperature overcast gini

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "SEEM4630 2015-2016 Tutorial" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

SEEM4630 2015-2016 Tutorial 1 Classification:Decision tree

Siyuan

Zhang, syzhang@se.cuhk.edu.hk

Slide2

Classification: DefinitionGiven a collection of records (training set ), each record contains a set of attributes, one of the attributes is the class.

Find a

model

for class attribute as a function of the values of other attributes.Decision treeNaïve bayesk-NNGoal: previously unseen records should be assigned a class as accurately as possible.

2

Slide3

Decision TreeGoal Construct a tree so that instances belonging to different classes should be separatedBasic algorithm (a greedy algorithm)Tree is constructed in a top-down recursive mannerAt start, all the training examples are at the rootTest attributes are selected on the basis of a heuristics or statistical measure (e.g., information gain

)

Examples are partitioned recursively based on selected attributes

3

Slide4

Let pi be the probability that a tuple belongs to class Ci, estimated by |Ci,D|/|D|

Expected information

(entropy) needed to classify a tuple in D:

Information needed (after using A to split D into v partitions) to classify D:Information gained by branching on attribute A:

Attribute Selection Measure 1: Information Gain

4

Slide5

Information gain measure is biased towards attributes with a large number of valuesC4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain):GainRatio(A) = Gain(A)/SplitInfo

(A)

Attribute Selection Measure 2: Gain Ratio

5

Slide6

If a data set

D

contains examples from

n classes, gini index, gini(D)

is defined as:

where

p

j

is the relative frequency of class

j

in D

If a data set D is split on A into two subsets

D1 and D2

, the gini index gini(

D

) is defined as

Reduction in

Impurity

:

If a data set

D

contains examples from n classes, gini index, gini(D) is defined as: where pj is the relative frequency of class j in DIf a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined asReduction in Impurity:

Attribute Selection Measure 3: Gini index

6

Slide7

ExampleOutlook

Temperature

Humidity

Wind

Play Tennis

Sunny

>25

High

Weak

No

Sunny

>25

High

Strong

No

Overcast

>25

High

Weak

Yes

Rain

15-25

High

Weak

Yes

Rain

<15

Normal

Weak

Yes

Rain

<15

Normal

Strong

No

Overcast<15NormalStrongYesSunny15-25HighWeakNoSunny<15NormalWeakYesRain15-25NormalWeakYesSunny15-25NormalStrongYesOvercast15-25HighStrongYesOvercast>25NormalWeakYesRain15-25HighStrongNo

7

Slide8

Entropy of data SSplit data by attribute Outlook

Tree induction example

8

S[9+, 5-]

Outlook

Sunny [2+,3-]

Overcast

[4+,0-]

Rain

[3+,2-]

Gain(Outlook) = 0.94 – 5/14[-2/5(log

2

(2/5))-3/5(log

2

(3/5))]

– 4/14[-4/4(log

2

(4/4))-0/4(log

2

(0/4))]

– 5/14[-3/5(log

2

(3/5))-2/5(log

2

(2/5))]

= 0.94 – 0.69 = 0.25

Info(S) = -9/14(log

2

(9/14

))-5/14(log

2

(5/14)) =

0.94

Slide9

Tree induction example9

S[9+, 5-]

Temperature

<15 [3+,1-]

15-25

[4+,2-

]

>25 [2+,2-]

Gain(Temperature) = 0.94 – 4/14[-3/4(log

2

(3/4))-1/4(log

2

(1/4))]

– 6/14

[-4/6(log

2

(4/6))-2/6(log

2

(2/6

))]

– 4/14[-2/4(log

2

(2/4))-2/4(log

2(2/4))] = 0.94 –

0.91 = 0.03

Split data by attribute

Temperature

Slide10

Tree induction example10

S[9+, 5-] Wind

Weak [6+, 2-]

Strong [3+, 3-]

Gain(Humidity) = 0.94 – 7/14[-3/7(log

2

(3/7))-4/7(log

2

(4/7))]

– 7/14[-6/7(log

2

(6/7))-1/7(log

2

(1/7))]

= 0.94 – 0.79 = 0.15

Gain(Wind) = 0.94 – 8/14[-6/8(log

2

(6/8))-2/8(log

2

(2/8))]

– 6/14[-3/6(log

2

(3/6))-3/6(log

2

(3/6))] = 0.94 – 0.89 = 0.05

Split

data by attribute Humidity

Split data by attribute

Wind

S[9+, 5-] Humidity

High [3+,4-]

Normal [6+, 1-]

Slide11

11

Outlook

Yes

??

??

Overcast

Sunny

Rain

Gain(Outlook) = 0.25

Gain(Temperature)=

0.03

Gain(Humidity) = 0.15

Gain(Wind) = 0.05

No

Weak

High

>25

Sunny

No

Strong

High

>25

Sunny

Yes

Weak

High

>25

Overcast

Yes

Weak

High

15-25

Rain

Yes

Weak

Normal<15RainNoStrongNormal<15

Rain

Yes

Strong

Normal

<15

Overcast

No

Weak

High

15-25

Sunny

Yes

Weak

Normal

<15

Sunny

Yes

Weak

Normal

15-25

Rain

Yes

Strong

Normal

15-25

Sunny

Yes

Strong

High

15-25

Overcast

Yes

Weak

Normal

>25

Overcast

No

Strong

High

15-25

Rain

Play Tennis

Wind

Humidity

Temperature

Outlook

Tree induction example

Slide12

Entropy of branch SunnySplit Sunny branch by attribute Temperature

Split

Sunny

branch by attribute Humidity

Split

Sunny

branch by attribute

Wind

12

Sunny[2+, 3-]

Wind

Weak [1+, 2-]

Strong [1+, 1-]

Gain(Humidity)

= 0.97

– 3/5[-0/3(log

2

(0/3))-3/3(log

2

(3/3))]

– 2/5[-2/2(log

2

(2/2))-0/2(log

2

(0/2))]

= 0.97 – 0 =

0.97

Gain(Wind)

= 0.97

– 3/5[-1/3(log

2

(1/3))-2/3(log

2

(2/3))]

2/5[-1/2(log2(1/2))-1/2(log2(1/2))]= 0.97 – 0.95= 0.02Info(Sunny) = -2/5(log2(2/5))-3/5(log2(3/5)) = 0.97Sunny[2+,3-] Temperature<15 [1+,0-]15-25 [1+,1-]>25 [0+,2-]Gain(Temperature) = 0.97 – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]= 0.97 – 0.4 = 0.57Sunny[2+,3-] HumidityHigh [0+,3-]Normal [2+, 0-]

Slide13

13

Outlook

Yes

Humidity

??

Yes

No

High

Sunny

Rain

Normal

Overcast

Tree induction example

Slide14

Gain(Humidity)

= 0.97

– 2/5[-1/2(log

2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]= 0.97 – 0.95 =

0.02

Gain(Wind)

= 0.97

– 3/5[-3/3(log

2

(3/3))-0/3(log

2

(0/3))]

– 2/5[-0/2(log

2

(0/2))-2/2(log

2

(2/2))]

= 0.97 – 0 =

0.97

Entropy of branch

Rain

Split

Rain

branch by attribute TemperatureSplit Rain branch by attribute HumiditySplit Rain branch by attribute Wind14

Info(Rain) = -

3/5(log

2

(3/5))-

2/5(log

2

(2/5))

= 0.97

Rain[3+,2-]

Temperature

<15 [1+,1-]

15-25 [2+,1-]>25 [0+,0-]Gain(Outlook) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] – 0/5[-0/0(log2(0/0))-0/0(log2(0/0))]= 0.97 – 0.95 = 0.02Rain[3+,2-] WindWeak [3+, 0-]Strong [0+, 2-]Rain[3+,2-] HumidityHigh [1+,1-]Normal [2+, 1-]

Slide15

15

Outlook

Yes

Humidity

Wind

Yes

No

Normal

High

No

Yes

Strong

Weak

Overcast

Sunny

Rain

Tree induction example

Slide16

Thank you & Questions?16