Data Mining for Business Analytics Shmueli Patel amp Bruce Why Evaluate Multiple methods are available to classify or predict For each method multiple choices are available for settings ID: 534886
Download Presentation The PPT/PDF document "Chapter 5 – Evaluating Predictive Perf..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Chapter 5 – Evaluating Predictive Performance
Data Mining for Business
Analytics
Shmueli
, Patel & BruceSlide2
Why Evaluate?
Multiple methods are available to classify or predict
For each method, multiple choices are available for settings
To choose best model, need to assess each model’s performanceSlide3
Types of outcome
Predicted numerical value
Outcome variable is numerical and continuous
Predicted class membership
Outcome variable is categorical
Propensity
Probability of class membership when the outcome
variable is categoricalSlide4
Accuracy Measures (Continuous outcome)Slide5
Evaluating Predictive Performance
Not the same as goodness-of-fit (R
2
, S
Y/X
)
Predicted performance is measured using the validation dataset
Benchmark is average used as predictionSlide6
Prediction accuracy measures
e
i
=
y
i
–
ŷ
i
,
where yi = Actual y value and ŷi = predicted y valueMAE = Mean Absolute Error = MAE is also known as MAD = Mean Absolute DeviationAverage Error = MAPE = Mean Absolute Percent Error = 100% x RSME = Root-Mean-Squared Error = Total SSE = Total Sum of Squared Error =
Slide7
Excel Miner
Example: Multiple Regression using Boston Housing datasetSlide8
Excel Miner
Residuals for Training dataset
Residuals for Validation datasetSlide9
Excel Miner
Residuals for Training dataset
Residuals for Validation datasetSlide10
SAS Enterprise Miner
Example: Multiple Regression using Boston Housing datasetSlide11
SAS Enterprise Miner
Residuals for Validation datasetSlide12
SAS Enterprise Miner
Boxplot of residualsSlide13
Lift Chart
Order predicted y-value from highest to lowest
X-axis = Cases from 1 to n
Y-axis = Cumulative predicted value of Y
Two lines are plotted …
one with
as the predicted value
one with
ŷ
found using a prediction modelSlide14
Excel Miner: Lift ChartSlide15
Decile-wise lift Chart
Order predicted y-value from highest to lowest
X-axis = % of cases from 10% to 100%
i.e. 1
st
decile to 10
th
decile
Y-axis = Cumulative predicted value of Y
For each decile the ratio of sum of
ŷ and sum of is pottedSlide16
XLMINER: Decile-wise lift ChartSlide17
Accuracy Measures (Classification)Slide18
Misclassification error
Error = classifying a record as belonging to one class when it belongs to another class.
Error rate = percent of misclassified records out of the total records in the validation dataSlide19
Benchmark: Naïve Rule
Naïve
rule:
classify
all records as belonging to the most prevalent
class
Used only as benchmark to evaluate more complex classifiersSlide20
Separation of Records
“High separation of records” means that using predictor variables attains low error
“Low separation of records” means that using predictor variables does not improve much on naïve ruleSlide21
High Separation of RecordsSlide22
Low Separation of RecordsSlide23
Classification Confusion Matrix
201
1’s correctly classified as “1” (True positives)
85
1’s incorrectly classified as “0” (False negatives)
25
0’s incorrectly classified as “1” (False positives)
2689
0’s correctly classified as “0” (True negatives)Slide24
Error Rate
Overall error rate
= (25+85)/3000 = 3.67%
Accuracy
= 1 – err = (201+2689) = 96.33%
If multiple classes, error rate is:
(sum of misclassified records)/(total records)Slide25
Classification Matrix:Meaning of each cell
Actual Class
Predicted Class
C
1
C
2
C
1
n
1,1= No. of C1 cases classified correctly as C1n1,2= No. of C1 cases classified incorrectly as C2C2n2,1= No. of C2 cases classified incorrectly as C1n2,2= No. of C2 cases classified correctly as C2Misclassification rate = err = Accuracy = 1 - err = Slide26
Propensity
Propensities are estimated probabilities that
a case belongs to
each of the classes
It is used in two ways …
To generate predicted class membership
Rank ordering cases by probability of belonging to a particular class of interestSlide27
Most Data Mining algorithms classify via a 2-step process:
For each case,
Compute
probability of belonging to the class of interest
Compare to cutoff value, and classify accordingly
Default cutoff value is 0.50
If >= 0.50, classify as “1”
If < 0.50, classify as “0”
Can use different cutoff values
Typically, error rate is lowest for cutoff = 0.50
Propensity and Cutoff for ClassificationSlide28
Cutoff Table
If cutoff is 0.50: thirteen records are classified as “1”
If cutoff is 0.80: seven records are classified as “1”Slide29
Confusion Matrix for Different CutoffsSlide30
Confusion Matrix for Different CutoffsSlide31
When One Class is More Important
In many cases it is more important to identify members of one class
Tax fraud
Credit default
Response to promotional offer
Predicting delayed flights
In
such cases, overall accuracy is not a good measure of evaluating the classifier.
Sensitivity
SpecificitySlide32
Sensitivity
Suppose that “C
1
” is the important class.
Sensitivity is the ability (probability) to detect membership in “C
1
” class correctly and is given by
=
Hit rate = True Positive rate
Slide33
Specificity
Suppose that “C
1
” is the important class.
Specificity is the ability (probability) to rule out members who belong to “C
2
” class correctly and is given by
=
1 – Specificity = False Positive rate
Slide34
ROC CurveReceiver Operating Characteristics Curve
Slide35
Asymmetric Misclassification CostsSlide36
Misclassification Costs May Differ
The cost of making a misclassification error may be higher for one class than the other(s)
Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s)Slide37
Example – Response to Promotional Offer
“Naïve rule” (classify everyone as “0”) has error rate of 1% (seems good)
Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse)
Predict Class 0
Predict Class 1
Actual 0
990
0
Actual 1
100Slide38
The Confusion Matrix
Error rate = (2+20) = 2.2% (higher than naïve rate)
Predict Class 0
Predict Class 1
Actual 0
970
20
Actual 1
2
8Suppose that using DM we can correctly classify eight 1’s as 1’sIt comes at the cost of misclassifying twenty 0’s as 1’s and two 1’s as 0’s.Slide39
Introducing Costs & Benefits
Suppose:
Profit from a “1” is $10
Cost of sending offer is $1
Then:
Under naïve rule, all are classified as “0”, so no offers are sent: no cost, no profit
Under DM predictions, 28 offers are sent.
8 respond with profit of $10 each
20 fail to respond, cost $1 each
972 receive nothing (no cost, no profit)
Net profit = $80 - $20 = $60Slide40
Profit Matrix
Predict Class 0
Predict Class 1
Actual 0
0
-$20
Actual 1
0
$80Slide41
Minimize Opportunity costs
As we see, best to convert everything to costs, as opposed to a mix of costs and benefits
E.g., instead of “benefit from sale” refer to “opportunity cost of lost sale”
Leads to same decisions, but referring only to costs allows greater applicabilitySlide42
Cost Matrix with opportunity costs
Recall original confusion matrix (profit from a “1” = $10, cost of sending offer = $1):
Costs
Predict Class 0
Predict Class 1
Actual 0
970 x $0 = $0
20 x $1 = $20
Actual 12 x $10 = $208 x $1 = $8 Predict Class 0Predict Class 1Actual 097020Actual 128
Total opportunity cost = 0 + 20 + 20 + 8 = 48Slide43
Average misclassification cost
q
1
= cost of misclassifying an actual C
1
as belonging to C
2
q
2
= cost of misclassifying an actual C2 as belonging to C1 Average misclassification cost = Look for a classifier that minimizes this average cost. Slide44
Generalize to Cost Ratio
Sometimes actual costs and benefits are hard to estimate
Need to express everything in terms of costs (i.e., cost of misclassification per record)
A
good practical substitute for individual costs is the
ratio
of misclassification costs (e.g., “misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms”)Slide45
Multiple Classes
Theoretically, there are
m
(
m
-1) misclassification costs, since any case from one of the m classes could be misclassified into any one of the
m
-1 other classes
Practically too many to work with
In decision-making context, though, such complexity rarely arises – one class is usually of primary interest
For m classes, confusion matrix has m rows and m columnsSlide46
Judging Ranking PerformanceSlide47
Lift Chart for Binary Data
Input: Scored validation dataset
Actual class and propensity (probability) to belong to the class of interest C
1
)
Sort records in descending order of propensity to belong to the class of interest
Compute cumulative number of C
1
members for each row
Lift chart is the plot with row number (no. of records) as the x-axis and cumulative number of C
1 members as the y-axisSlide48
Lift Chart for Binary Data - Example
Case No.
Propensity
(Predicted probability of belonging to class "1")
Actual class
Cumulative actual classes
1
0.9959767
1
1
20.98753311230.98445641340.98043961450.94811641560.88929721670.847631917
8
0.7628063
0
7
9
0.7069919
1
8
10
0.6807541
1
9
11
0.6563437
1
10
12
0.6224195
0
10
13
0.5055069
1
11
14
0.4713405
0
11
15
0.3371174
0
11
16
0.2179678
1
12
17
0.1992404
0
12
18
0.1494827
0
12
19
0.0479626
0
12
20
0.0383414
0
12
21
0.0248510
0
12
22
0.0218060
0
12
23
0.0161299
0
12
24
0.0035600
0
12Slide49
Lift Chart with Costs and Benefits
Sort records in descending order of probability of success (success = belonging to class of interest)
For each record compute cost/benefit with actual outcome
Compute a column of cumulative cost/benefit
Plot cumulative cost/benefit on y-axis and row number (no. of records) as the x-axisSlide50
Lift Chart with cost/benefit - Example
Case No.
Propensity
(Predicted probability of belonging to class "1")
Actual class
Cost
/
Benefit
Cumulative Cost/benefit
1
0.99597671101020.98753311102030.98445641103040.98043961104050.9481164110506
0.8892972
1
10
60
7
0.8476319
1
10
70
8
0.7628063
0
-1
69
9
0.7069919
1
10
79
10
0.6807541
1
10
89
11
0.6563437
1
10
99
12
0.6224195
0
-1
98
13
0.5055069
1
10
108
14
0.4713405
0
-1
107
15
0.3371174
0
-1
106
16
0.2179678
1
10
116
17
0.1992404
0
-1
115
18
0.1494827
0
-1
114
19
0.0479626
0
-1
113
20
0.0383414
0
-1
112
21
0.0248510
0
-1
111
22
0.0218060
0
-1
110
23
0.0161299
0
-1
109
24
0.0035600
0
-1
108Slide51
Lift Curve May Go Negative
If total net benefit from all cases is negative, reference line will have
negative slope
Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximumSlide52
Negative slope to reference curveSlide53
Oversampling and Asymmetric CostsSlide54
Rare Cases
Responder to mailing
Someone who commits fraud
Debt defaulter
Often we oversample rare cases to give model more information to work with
Typically use 50% “1” and 50% “0” for training
Asymmetric costs/benefits typically go hand in hand with presence of rare but important classSlide55
Example
Following graphs show optimal classification under three scenarios:
assuming equal costs of misclassification
assuming that misclassifying “o” is five times the cost of misclassifying “x”
Oversampling scheme allowing DM methods to incorporate asymmetric costsSlide56
Classification: equal costsSlide57
Classification: Unequal costs
Suppose that failing to catch “o” is 5 times as costly as failing to catch “x”.Slide58
Oversampling for asymmetric costs
Oversample “o” to appropriately weight misclassification costs – without or with replacementSlide59
Equal number of responders
Sample equal number of responders and non-respondersSlide60
An Oversampling Procedure
Separate the responders (rare) from non-responders
Randomly assign half the responders to the training sample, plus equal number of non-responders
Remaining responders go to validation sample
Add non-responders to validation data, to maintain original ratio of responders to non-responders
Randomly take test set (if needed) from validationSlide61
Assessing model performance
Score the model with a validation dataset selected without oversampling
Score the model with a
oversampled validation
dataset
and reweight the results to remove the effects of oversampling
Method 1 is straightforward and easier to implement.Slide62
Adjusting confusion matrix for oversampling
Whole data
Sample
Responders
2%
50%
Nonresponders
98%
50%
Example:
One responder in whole data = 50/2 = 25 in the sampleOne nonresponder in whole data = 50/98 = 0.5102 in the sampleSlide63
Adjusting confusion matrix for oversampling
Suppose that the confusion matrix with oversampled validation
dataset is as follows:
Classification matrix, Oversampled data (Validation)
Predicted
0
Predcited 1
Total
Actual 0
390
110500Actual 180420500Total4705301000Misclassification rate = (80 + 110)/1000 = 0.19 or 19%Percentage of records predicted as “1” = 530/1000 = 0.53 or 53%Slide64
Adjusting confusion matrix for oversampling
Classification matrix,
Reweighted
Predicted
0
Predcited 1
Total
Actual 0
390/0.5102 = 764.4
110
/0.5102 = 215.6980Actual 180/25 = 3.2420/25 = 16.8500Total767.6232.41000Weight for responder (Actual 1) = 25Weight for nonresponder (Actual 0) = 0.5102Misclassification rate = (3.2 + 215.6)/1000 = 0.219 or 21.9%Percentage of records predicted as “1” = 232.4/1000 = 0.2324 or 23.24%Slide65
Adjusting Lift curve for Oversampling
Sort records in descending order of probability of success (success = belonging to class of interest)
For each record compute cost/benefit with actual outcome
Divide the value by the oversampling rate of the actual class
Compute a cumulative column of weighted cost/benefit
Plot the cumulative
weighted
cost/benefit on y-axis and row number (no. of records) as the x-axisSlide66
Adjusting Lift curve for Oversampling -- Example
Cost/Benefit
Oversampling weight
Actual 0
-3
0.6
Actual 1
50
20
Suppose that cost/benefit and oversampling weights are as follows:
Case No.Propensity(Predicted probability of belonging to class "1")Actual classweighted Cost/BenefitCumulative weighted Cost/benefit10.995976712.502.5020.987533112.505.0030.984456412.50
7.50
4
0.9804396
1
2.50
10.00
5
0.9481164
1
2.50
12.50
6
0.8892972
1
2.50
15.00
7
0.8476319
1
2.50
17.50
8
0.7628063
0
-5.00
12.50
9
0.7069919
1
2.50
15.00
10
0.6807541
1
2.50
17.50
11
0.6563437
1
2.50
20.00
12
0.6224195
0
-5.00
15.00
13
0.5055069
1
2.50
17.50
14
0.4713405
0
-5.00
12.50
15
0.3371174
0
-5.00
7.50
16
0.2179678
1
2.50
10.00
17
0.1992404
0
-5.00
5.00
18
0.1494827
0
-5.00
0.00
19
0.0479626
0
-5.00
-5.00
20
0.0383414
0
-5.00
-10.00
21
0.0248510
0
-5.00
-15.00
22
0.0218060
0
-5.00
-20.00
23
0.0161299
0
-5.00
-25.00
24
0.0035600
0
-5.00
-30.00Slide67
Adjusting Lift curve for Oversampling -- Example