1 Classification Decision tree Siyuan Zhang syzhangsecuhkeduhk Classification Definition Given a collection of records training set each record contains a set of attributes ID: 794761
Download The PPT/PDF document "SEEM4630 2015-2016 Tutorial" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
SEEM4630 2015-2016 Tutorial 1 Classification:Decision tree
Siyuan
Zhang, syzhang@se.cuhk.edu.hk
Slide2Classification: DefinitionGiven a collection of records (training set ), each record contains a set of attributes, one of the attributes is the class.
Find a
model
for class attribute as a function of the values of other attributes.Decision treeNaïve bayesk-NNGoal: previously unseen records should be assigned a class as accurately as possible.
2
Slide3Decision TreeGoal Construct a tree so that instances belonging to different classes should be separatedBasic algorithm (a greedy algorithm)Tree is constructed in a top-down recursive mannerAt start, all the training examples are at the rootTest attributes are selected on the basis of a heuristics or statistical measure (e.g., information gain
)
Examples are partitioned recursively based on selected attributes
3
Slide4Let pi be the probability that a tuple belongs to class Ci, estimated by |Ci,D|/|D|
Expected information
(entropy) needed to classify a tuple in D:
Information needed (after using A to split D into v partitions) to classify D:Information gained by branching on attribute A:
Attribute Selection Measure 1: Information Gain
4
Slide5Information gain measure is biased towards attributes with a large number of valuesC4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain):GainRatio(A) = Gain(A)/SplitInfo
(A)
Attribute Selection Measure 2: Gain Ratio
5
Slide6If a data set
D
contains examples from
n classes, gini index, gini(D)
is defined as:
where
p
j
is the relative frequency of class
j
in D
If a data set D is split on A into two subsets
D1 and D2
, the gini index gini(
D
) is defined as
Reduction in
Impurity
:
If a data set
D
contains examples from n classes, gini index, gini(D) is defined as: where pj is the relative frequency of class j in DIf a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined asReduction in Impurity:
Attribute Selection Measure 3: Gini index
6
Slide7ExampleOutlook
Temperature
Humidity
Wind
Play Tennis
Sunny
>25
High
Weak
No
Sunny
>25
High
Strong
No
Overcast
>25
High
Weak
Yes
Rain
15-25
High
Weak
Yes
Rain
<15
Normal
Weak
Yes
Rain
<15
Normal
Strong
No
Overcast<15NormalStrongYesSunny15-25HighWeakNoSunny<15NormalWeakYesRain15-25NormalWeakYesSunny15-25NormalStrongYesOvercast15-25HighStrongYesOvercast>25NormalWeakYesRain15-25HighStrongNo
7
Slide8Entropy of data SSplit data by attribute Outlook
Tree induction example
8
S[9+, 5-]
Outlook
Sunny [2+,3-]
Overcast
[4+,0-]
Rain
[3+,2-]
Gain(Outlook) = 0.94 – 5/14[-2/5(log
2
(2/5))-3/5(log
2
(3/5))]
– 4/14[-4/4(log
2
(4/4))-0/4(log
2
(0/4))]
– 5/14[-3/5(log
2
(3/5))-2/5(log
2
(2/5))]
= 0.94 – 0.69 = 0.25
Info(S) = -9/14(log
2
(9/14
))-5/14(log
2
(5/14)) =
0.94
Slide9Tree induction example9
S[9+, 5-]
Temperature
<15 [3+,1-]
15-25
[4+,2-
]
>25 [2+,2-]
Gain(Temperature) = 0.94 – 4/14[-3/4(log
2
(3/4))-1/4(log
2
(1/4))]
– 6/14
[-4/6(log
2
(4/6))-2/6(log
2
(2/6
))]
– 4/14[-2/4(log
2
(2/4))-2/4(log
2(2/4))] = 0.94 –
0.91 = 0.03
Split data by attribute
Temperature
Slide10Tree induction example10
S[9+, 5-] Wind
Weak [6+, 2-]
Strong [3+, 3-]
Gain(Humidity) = 0.94 – 7/14[-3/7(log
2
(3/7))-4/7(log
2
(4/7))]
– 7/14[-6/7(log
2
(6/7))-1/7(log
2
(1/7))]
= 0.94 – 0.79 = 0.15
Gain(Wind) = 0.94 – 8/14[-6/8(log
2
(6/8))-2/8(log
2
(2/8))]
– 6/14[-3/6(log
2
(3/6))-3/6(log
2
(3/6))] = 0.94 – 0.89 = 0.05
Split
data by attribute Humidity
Split data by attribute
Wind
S[9+, 5-] Humidity
High [3+,4-]
Normal [6+, 1-]
Slide1111
Outlook
Yes
??
??
Overcast
Sunny
Rain
Gain(Outlook) = 0.25
Gain(Temperature)=
0.03
Gain(Humidity) = 0.15
Gain(Wind) = 0.05
No
Weak
High
>25
Sunny
No
Strong
High
>25
Sunny
Yes
Weak
High
>25
Overcast
Yes
Weak
High
15-25
Rain
Yes
Weak
Normal<15RainNoStrongNormal<15
Rain
Yes
Strong
Normal
<15
Overcast
No
Weak
High
15-25
Sunny
Yes
Weak
Normal
<15
Sunny
Yes
Weak
Normal
15-25
Rain
Yes
Strong
Normal
15-25
Sunny
Yes
Strong
High
15-25
Overcast
Yes
Weak
Normal
>25
Overcast
No
Strong
High
15-25
Rain
Play Tennis
Wind
Humidity
Temperature
Outlook
Tree induction example
Slide12Entropy of branch SunnySplit Sunny branch by attribute Temperature
Split
Sunny
branch by attribute Humidity
Split
Sunny
branch by attribute
Wind
12
Sunny[2+, 3-]
Wind
Weak [1+, 2-]
Strong [1+, 1-]
Gain(Humidity)
= 0.97
– 3/5[-0/3(log
2
(0/3))-3/3(log
2
(3/3))]
– 2/5[-2/2(log
2
(2/2))-0/2(log
2
(0/2))]
= 0.97 – 0 =
0.97
Gain(Wind)
= 0.97
– 3/5[-1/3(log
2
(1/3))-2/3(log
2
(2/3))]
–
2/5[-1/2(log2(1/2))-1/2(log2(1/2))]= 0.97 – 0.95= 0.02Info(Sunny) = -2/5(log2(2/5))-3/5(log2(3/5)) = 0.97Sunny[2+,3-] Temperature<15 [1+,0-]15-25 [1+,1-]>25 [0+,2-]Gain(Temperature) = 0.97 – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]= 0.97 – 0.4 = 0.57Sunny[2+,3-] HumidityHigh [0+,3-]Normal [2+, 0-]
Slide1313
Outlook
Yes
Humidity
??
Yes
No
High
Sunny
Rain
Normal
Overcast
Tree induction example
Slide14Gain(Humidity)
= 0.97
– 2/5[-1/2(log
2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]= 0.97 – 0.95 =
0.02
Gain(Wind)
= 0.97
– 3/5[-3/3(log
2
(3/3))-0/3(log
2
(0/3))]
– 2/5[-0/2(log
2
(0/2))-2/2(log
2
(2/2))]
= 0.97 – 0 =
0.97
Entropy of branch
Rain
Split
Rain
branch by attribute TemperatureSplit Rain branch by attribute HumiditySplit Rain branch by attribute Wind14
Info(Rain) = -
3/5(log
2
(3/5))-
2/5(log
2
(2/5))
= 0.97
Rain[3+,2-]
Temperature
<15 [1+,1-]
15-25 [2+,1-]>25 [0+,0-]Gain(Outlook) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] – 0/5[-0/0(log2(0/0))-0/0(log2(0/0))]= 0.97 – 0.95 = 0.02Rain[3+,2-] WindWeak [3+, 0-]Strong [0+, 2-]Rain[3+,2-] HumidityHigh [1+,1-]Normal [2+, 1-]
Slide1515
Outlook
Yes
Humidity
Wind
Yes
No
Normal
High
No
Yes
Strong
Weak
Overcast
Sunny
Rain
Tree induction example
Slide16Thank you & Questions?16