Scores will be included in midsemester grades Assignments HW6 Out late tonight Due date Tue 324 1159 pm Plan Last time Nearest Neighbor Classification kNN Nonparametric vs parametric ID: 776513
Download Presentation The PPT/PDF document " Announcements Midterm Grading over the ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Announcements
Midterm
Grading over the next few days
Scores will be included in mid-semester grades
Assignments:
HW6
Out late tonight
Due date Tue, 3/24, 11:59 pm
Slide2Plan
Last time
Nearest Neighbor Classification
kNN
Non-parametric vs parametric
Today
Decision Trees!
Slide3Introduction to Machine Learning
Decision Trees
Instructor: Pat Virtue
Slide4k-NN classifier (k=5)
4
Test document
Whales
Seals
Sharks
Slide5k-Nearest Neighbor Classification
Given a training dataset and a test input , predict the class label, :Find the closest points in the training data to Return the class label of that closest point , where is the number of the -neighbors with class label c
k-NN on Fisher Iris Data
6
Special Case: Nearest
Neighbor
Slide7k-NN on Fisher Iris Data
7
Slide8k-NN on Fisher Iris Data
8
Special Case: Majority Vote
Slide9Decision Trees
First a few toolsMajority vote: Classification error rate: What fraction did we predict incorrectly
Decision trees
Popular representation for classifiersEven among humans!I’ve just arrived at a restaurant: should I stay (and wait for a table) or go elsewhere?
Slide11Decision trees
It’s Friday night and you’re hungryYou arrive at your favorite cheap but really cool happening burger placeIt’s full up and you have no reservation but there is a barThe host estimates a 45 minute waitThere are alternatives nearby but it’s raining outside
Decision tree
partitions
the input space, assigns a label to each partition
Slide12Expressiveness
Discrete decision trees can express any function of the inputE.g., for Boolean functions, build a path from root to leaf for each row of the truth table:True/false: there is a consistent decision tree that fits any training set exactly But a tree that simply records the examples is essentially a lookup tableTo get generalization to new examples, need a compact tree
Slide13Tree to Predict C-Section Risk
Figure from Tom Mitchell
Slide14Decision Stumps
Split data based on a single attribute
Dataset: Output Y, Attributes A, B, C
Y
A
B
C
-
1
0
0
-
1
0
1
-
1
0
o
+
0
0
1
+
1
1
0
+
1
1
1
+
1
1
0
+
1
1
1
Slide15Building a decision tree
Function
BuildTree
(
n,A
) // n: samples, A: set of attributes
If empty(A) or all n(L) are the same
status = leaf
class = most common class in n(L)
else
status = internal
a
bestAttribute
(
n,A
)
LeftNode
=
BuildTree
(n(a=1), A \ {a})
Right
Node
=
BuildTree
(n(a=0), A \ {a})
end
end
Slide16Building a decision tree
Function BuildTree(n,A) // n: samples, A: set of attributes If empty(A) or all n(L) are the same status = leaf class = most common class in n(L) else status = internal a bestAttribute(n,A) LeftNode = BuildTree(n(a=1), A \ {a}) RightNode = BuildTree(n(a=0), A \ {a}) endend
n(L): Labels for samples in this set
Decision: Which attribute?
Recursive calls to create left and right subtrees, n(a=1) is the set of samples in n for which the attribute a is 1
Slide17Decision Trees as a Search Problem
Slide18Background: Greedy Search
18
Start
State
EndStates
Goal:Search space consists of nodes and weighted edgesGoal is to find the lowest (total) weight path from root to a leafGreedy Search:At each node, selects the edge with lowest (immediate) weightHeuristic method of search (i.e. does not necessarily find the best path)
2
4
3
1
7
3
3
5
4
1
2
2
3
5
6
4
7
8
9
8
Slide19Background: Greedy Search
19
Start
State
EndStates
Goal:Search space consists of nodes and weighted edgesGoal is to find the lowest (total) weight path from root to a leafGreedy Search:At each node, selects the edge with lowest (immediate) weightHeuristic method of search (i.e. does not necessarily find the best path)
2
4
3
1
7
3
3
5
4
1
2
2
3
5
6
4
7
8
9
8
9
9
1
9
Slide20Background: Greedy Search
20
Start
State
EndStates
Goal:Search space consists of nodes and weighted edgesGoal is to find the lowest (total) weight path from root to a leafGreedy Search:At each node, selects the edge with lowest (immediate) weightHeuristic method of search (i.e. does not necessarily find the best path)
2
4
3
1
7
3
3
5
4
1
2
2
3
5
6
4
7
8
9
8
9
9
1
9
7
1
3
5
2
1
2
2
5
3
1
5
Slide21Building a decision tree
Function BuildTree(n,A) // n: samples, A: set of attributes If empty(A) or all n(L) are the same status = leaf class = most common class in n(L) else status = internal a bestAttribute(n,A) LeftNode = BuildTree(n(a=1), A \ {a}) RightNode = BuildTree(n(a=0), A \ {a}) endend
n(L): Labels for samples in this set
Decision: Which attribute?
Recursive calls to create left and right subtrees, n(a=1) is the set of samples in n for which the attribute a is 1
Slide22Identifying ‘bestAttribute’
There are many possible ways to select the best attribute for a given set.
We will discuss one possible way which is based on information theory.
Slide23Entropy
Quantifies the amount of uncertainty associated with a specific probability distributionThe higher the entropy, the less confident we are in the outcomeDefinition
Claude Shannon (1916 – 2001), most of the work was done in Bell labs
Slide24Entropy
DefinitionSo, if P(X=1) = 1 thenIf P(X=1) = .5 then
H(X)
Slide25Mutual Information
25
For a decision tree, we can use
mutual information
of the output class
Y
and some attribute
X on which to split as a splitting criterionGiven a dataset D of training examples, we can estimate the required probabilities as…
Slide26Mutual Information
26
For a decision tree, we can use
mutual information
of the output class
Y
and some attribute
X on which to split as a splitting criterionGiven a dataset D of training examples, we can estimate the required probabilities as…
Informally
, we say that mutual information is a measure of the following: If we know X, how much does this reduce our uncertainty about Y?
Entropy
measures the expected # of bits to code one random draw from X. For a decision tree, we want to reduce the entropy of the random variable we are trying to predict!
Conditional entropy
is the expected value of specific conditional entropy EP(X=x)[H(Y | X = x)]
Slide27Decision Tree Learning Example
Which attribute would mutual information select for the next split?ABA or B (tie)Neither
27
Dataset: Output Y, Attributes A and B
Y
A
B
-
1
0
-
1
0
+
1
0
+
1
0
+
1
1
+
1
1
+
1
1
+
1
1
Slide28Decision Tree Learning Example
28
YAB-10-10+10+10+11+11+11+11