/
Chapter 5 – Evaluating Predictive Performance Chapter 5 – Evaluating Predictive Performance

Chapter 5 – Evaluating Predictive Performance - PowerPoint Presentation

mitsue-stanley
mitsue-stanley . @mitsue-stanley
Follow
438 views
Uploaded On 2017-04-07

Chapter 5 – Evaluating Predictive Performance - PPT Presentation

Data Mining for Business Analytics Shmueli Patel amp Bruce Why Evaluate Multiple methods are available to classify or predict For each method multiple choices are available for settings ID: 534886

cost class actual costs class cost costs actual records error predicted rate matrix misclassification oversampling benefit lift validation cases

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Chapter 5 – Evaluating Predictive Perf..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Chapter 5 – Evaluating Predictive Performance

Data Mining for Business

Analytics

Shmueli

, Patel & BruceSlide2

Why Evaluate?

Multiple methods are available to classify or predict

For each method, multiple choices are available for settings

To choose best model, need to assess each model’s performanceSlide3

Types of outcome

Predicted numerical value

Outcome variable is numerical and continuous

Predicted class membership

Outcome variable is categorical

Propensity

Probability of class membership when the outcome

variable is categoricalSlide4

Accuracy Measures (Continuous outcome)Slide5

Evaluating Predictive Performance

Not the same as goodness-of-fit (R

2

, S

Y/X

)

Predicted performance is measured using the validation dataset

Benchmark is average used as predictionSlide6

Prediction accuracy measures

e

i

=

y

i

ŷ

i

,

where yi = Actual y value and ŷi = predicted y valueMAE = Mean Absolute Error = MAE is also known as MAD = Mean Absolute DeviationAverage Error = MAPE = Mean Absolute Percent Error = 100% x RSME = Root-Mean-Squared Error = Total SSE = Total Sum of Squared Error =

 Slide7

Excel Miner

Example: Multiple Regression using Boston Housing datasetSlide8

Excel Miner

Residuals for Training dataset

Residuals for Validation datasetSlide9

Excel Miner

Residuals for Training dataset

Residuals for Validation datasetSlide10

SAS Enterprise Miner

Example: Multiple Regression using Boston Housing datasetSlide11

SAS Enterprise Miner

Residuals for Validation datasetSlide12

SAS Enterprise Miner

Boxplot of residualsSlide13

Lift Chart

Order predicted y-value from highest to lowest

X-axis = Cases from 1 to n

Y-axis = Cumulative predicted value of Y

Two lines are plotted …

one with

as the predicted value

one with

ŷ

found using a prediction modelSlide14

Excel Miner: Lift ChartSlide15

Decile-wise lift Chart

Order predicted y-value from highest to lowest

X-axis = % of cases from 10% to 100%

i.e. 1

st

decile to 10

th

decile

Y-axis = Cumulative predicted value of Y

For each decile the ratio of sum of

ŷ and sum of  is pottedSlide16

XLMINER: Decile-wise lift ChartSlide17

Accuracy Measures (Classification)Slide18

Misclassification error

Error = classifying a record as belonging to one class when it belongs to another class.

Error rate = percent of misclassified records out of the total records in the validation dataSlide19

Benchmark: Naïve Rule

Naïve

rule:

classify

all records as belonging to the most prevalent

class

Used only as benchmark to evaluate more complex classifiersSlide20

Separation of Records

“High separation of records” means that using predictor variables attains low error

“Low separation of records” means that using predictor variables does not improve much on naïve ruleSlide21

High Separation of RecordsSlide22

Low Separation of RecordsSlide23

Classification Confusion Matrix

201

1’s correctly classified as “1” (True positives)

85

1’s incorrectly classified as “0” (False negatives)

25

0’s incorrectly classified as “1” (False positives)

2689

0’s correctly classified as “0” (True negatives)Slide24

Error Rate

Overall error rate

= (25+85)/3000 = 3.67%

Accuracy

= 1 – err = (201+2689) = 96.33%

If multiple classes, error rate is:

(sum of misclassified records)/(total records)Slide25

Classification Matrix:Meaning of each cell

Actual Class

Predicted Class

C

1

C

2

C

1

n

1,1= No. of C1 cases classified correctly as C1n1,2= No. of C1 cases classified incorrectly as C2C2n2,1= No. of C2 cases classified incorrectly as C1n2,2= No. of C2 cases classified correctly as C2Misclassification rate = err = Accuracy = 1 - err =  Slide26

Propensity

Propensities are estimated probabilities that

a case belongs to

each of the classes

It is used in two ways …

To generate predicted class membership

Rank ordering cases by probability of belonging to a particular class of interestSlide27

Most Data Mining algorithms classify via a 2-step process:

For each case,

Compute

probability of belonging to the class of interest

Compare to cutoff value, and classify accordingly

Default cutoff value is 0.50

If >= 0.50, classify as “1”

If < 0.50, classify as “0”

Can use different cutoff values

Typically, error rate is lowest for cutoff = 0.50

Propensity and Cutoff for ClassificationSlide28

Cutoff Table

If cutoff is 0.50: thirteen records are classified as “1”

If cutoff is 0.80: seven records are classified as “1”Slide29

Confusion Matrix for Different CutoffsSlide30

Confusion Matrix for Different CutoffsSlide31

When One Class is More Important

In many cases it is more important to identify members of one class

Tax fraud

Credit default

Response to promotional offer

Predicting delayed flights

In

such cases, overall accuracy is not a good measure of evaluating the classifier.

Sensitivity

SpecificitySlide32

Sensitivity

Suppose that “C

1

” is the important class.

Sensitivity is the ability (probability) to detect membership in “C

1

” class correctly and is given by

=

Hit rate = True Positive rate

 Slide33

Specificity

Suppose that “C

1

” is the important class.

Specificity is the ability (probability) to rule out members who belong to “C

2

” class correctly and is given by

=

1 – Specificity = False Positive rate

 Slide34

ROC CurveReceiver Operating Characteristics Curve

Slide35

Asymmetric Misclassification CostsSlide36

Misclassification Costs May Differ

The cost of making a misclassification error may be higher for one class than the other(s)

Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s)Slide37

Example – Response to Promotional Offer

“Naïve rule” (classify everyone as “0”) has error rate of 1% (seems good)

Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse)

 

Predict Class 0

Predict Class 1

Actual 0

990

0

Actual 1

100Slide38

The Confusion Matrix

Error rate = (2+20) = 2.2% (higher than naïve rate)

 

Predict Class 0

Predict Class 1

Actual 0

970

20

Actual 1

2

8Suppose that using DM we can correctly classify eight 1’s as 1’sIt comes at the cost of misclassifying twenty 0’s as 1’s and two 1’s as 0’s.Slide39

Introducing Costs & Benefits

Suppose:

Profit from a “1” is $10

Cost of sending offer is $1

Then:

Under naïve rule, all are classified as “0”, so no offers are sent: no cost, no profit

Under DM predictions, 28 offers are sent.

8 respond with profit of $10 each

20 fail to respond, cost $1 each

972 receive nothing (no cost, no profit)

Net profit = $80 - $20 = $60Slide40

Profit Matrix

 

Predict Class 0

Predict Class 1

Actual 0

0

-$20

Actual 1

0

$80Slide41

Minimize Opportunity costs

As we see, best to convert everything to costs, as opposed to a mix of costs and benefits

E.g., instead of “benefit from sale” refer to “opportunity cost of lost sale”

Leads to same decisions, but referring only to costs allows greater applicabilitySlide42

Cost Matrix with opportunity costs

Recall original confusion matrix (profit from a “1” = $10, cost of sending offer = $1):

 

Costs

Predict Class 0

Predict Class 1

Actual 0

970 x $0 = $0

20 x $1 = $20

Actual 12 x $10 = $208 x $1 = $8 Predict Class 0Predict Class 1Actual 097020Actual 128

Total opportunity cost = 0 + 20 + 20 + 8 = 48Slide43

Average misclassification cost

q

1

= cost of misclassifying an actual C

1

as belonging to C

2

q

2

= cost of misclassifying an actual C2 as belonging to C1 Average misclassification cost = Look for a classifier that minimizes this average cost. Slide44

Generalize to Cost Ratio

Sometimes actual costs and benefits are hard to estimate

Need to express everything in terms of costs (i.e., cost of misclassification per record)

A

good practical substitute for individual costs is the

ratio

of misclassification costs (e.g., “misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms”)Slide45

Multiple Classes

Theoretically, there are

m

(

m

-1) misclassification costs, since any case from one of the m classes could be misclassified into any one of the

m

-1 other classes

Practically too many to work with

In decision-making context, though, such complexity rarely arises – one class is usually of primary interest

For m classes, confusion matrix has m rows and m columnsSlide46

Judging Ranking PerformanceSlide47

Lift Chart for Binary Data

Input: Scored validation dataset

Actual class and propensity (probability) to belong to the class of interest C

1

)

Sort records in descending order of propensity to belong to the class of interest

Compute cumulative number of C

1

members for each row

Lift chart is the plot with row number (no. of records) as the x-axis and cumulative number of C

1 members as the y-axisSlide48

Lift Chart for Binary Data - Example

Case No.

Propensity

(Predicted probability of belonging to class "1")

Actual class

Cumulative actual classes

1

0.9959767

1

1

20.98753311230.98445641340.98043961450.94811641560.88929721670.847631917

8

0.7628063

0

7

9

0.7069919

1

8

10

0.6807541

1

9

11

0.6563437

1

10

12

0.6224195

0

10

13

0.5055069

1

11

14

0.4713405

0

11

15

0.3371174

0

11

16

0.2179678

1

12

17

0.1992404

0

12

18

0.1494827

0

12

19

0.0479626

0

12

20

0.0383414

0

12

21

0.0248510

0

12

22

0.0218060

0

12

23

0.0161299

0

12

24

0.0035600

0

12Slide49

Lift Chart with Costs and Benefits

Sort records in descending order of probability of success (success = belonging to class of interest)

For each record compute cost/benefit with actual outcome

Compute a column of cumulative cost/benefit

Plot cumulative cost/benefit on y-axis and row number (no. of records) as the x-axisSlide50

Lift Chart with cost/benefit - Example

Case No.

Propensity

(Predicted probability of belonging to class "1")

Actual class

Cost

/

Benefit

Cumulative Cost/benefit

1

0.99597671101020.98753311102030.98445641103040.98043961104050.9481164110506

0.8892972

1

10

60

7

0.8476319

1

10

70

8

0.7628063

0

-1

69

9

0.7069919

1

10

79

10

0.6807541

1

10

89

11

0.6563437

1

10

99

12

0.6224195

0

-1

98

13

0.5055069

1

10

108

14

0.4713405

0

-1

107

15

0.3371174

0

-1

106

16

0.2179678

1

10

116

17

0.1992404

0

-1

115

18

0.1494827

0

-1

114

19

0.0479626

0

-1

113

20

0.0383414

0

-1

112

21

0.0248510

0

-1

111

22

0.0218060

0

-1

110

23

0.0161299

0

-1

109

24

0.0035600

0

-1

108Slide51

Lift Curve May Go Negative

If total net benefit from all cases is negative, reference line will have

negative slope

Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximumSlide52

Negative slope to reference curveSlide53

Oversampling and Asymmetric CostsSlide54

Rare Cases

Responder to mailing

Someone who commits fraud

Debt defaulter

Often we oversample rare cases to give model more information to work with

Typically use 50% “1” and 50% “0” for training

Asymmetric costs/benefits typically go hand in hand with presence of rare but important classSlide55

Example

Following graphs show optimal classification under three scenarios:

assuming equal costs of misclassification

assuming that misclassifying “o” is five times the cost of misclassifying “x”

Oversampling scheme allowing DM methods to incorporate asymmetric costsSlide56

Classification: equal costsSlide57

Classification: Unequal costs

Suppose that failing to catch “o” is 5 times as costly as failing to catch “x”.Slide58

Oversampling for asymmetric costs

Oversample “o” to appropriately weight misclassification costs – without or with replacementSlide59

Equal number of responders

Sample equal number of responders and non-respondersSlide60

An Oversampling Procedure

Separate the responders (rare) from non-responders

Randomly assign half the responders to the training sample, plus equal number of non-responders

Remaining responders go to validation sample

Add non-responders to validation data, to maintain original ratio of responders to non-responders

Randomly take test set (if needed) from validationSlide61

Assessing model performance

Score the model with a validation dataset selected without oversampling

Score the model with a

oversampled validation

dataset

and reweight the results to remove the effects of oversampling

Method 1 is straightforward and easier to implement.Slide62

Adjusting confusion matrix for oversampling

Whole data

Sample

Responders

2%

50%

Nonresponders

98%

50%

Example:

One responder in whole data = 50/2 = 25 in the sampleOne nonresponder in whole data = 50/98 = 0.5102 in the sampleSlide63

Adjusting confusion matrix for oversampling

Suppose that the confusion matrix with oversampled validation

dataset is as follows:

Classification matrix, Oversampled data (Validation)

Predicted

0

Predcited 1

Total

Actual 0

390

110500Actual 180420500Total4705301000Misclassification rate = (80 + 110)/1000 = 0.19 or 19%Percentage of records predicted as “1” = 530/1000 = 0.53 or 53%Slide64

Adjusting confusion matrix for oversampling

Classification matrix,

Reweighted

Predicted

0

Predcited 1

Total

Actual 0

390/0.5102 = 764.4

110

/0.5102 = 215.6980Actual 180/25 = 3.2420/25 = 16.8500Total767.6232.41000Weight for responder (Actual 1) = 25Weight for nonresponder (Actual 0) = 0.5102Misclassification rate = (3.2 + 215.6)/1000 = 0.219 or 21.9%Percentage of records predicted as “1” = 232.4/1000 = 0.2324 or 23.24%Slide65

Adjusting Lift curve for Oversampling

Sort records in descending order of probability of success (success = belonging to class of interest)

For each record compute cost/benefit with actual outcome

Divide the value by the oversampling rate of the actual class

Compute a cumulative column of weighted cost/benefit

Plot the cumulative

weighted

cost/benefit on y-axis and row number (no. of records) as the x-axisSlide66

Adjusting Lift curve for Oversampling -- Example

Cost/Benefit

Oversampling weight

Actual 0

-3

0.6

Actual 1

50

20

Suppose that cost/benefit and oversampling weights are as follows:

Case No.Propensity(Predicted probability of belonging to class "1")Actual classweighted Cost/BenefitCumulative weighted Cost/benefit10.995976712.502.5020.987533112.505.0030.984456412.50

7.50

4

0.9804396

1

2.50

10.00

5

0.9481164

1

2.50

12.50

6

0.8892972

1

2.50

15.00

7

0.8476319

1

2.50

17.50

8

0.7628063

0

-5.00

12.50

9

0.7069919

1

2.50

15.00

10

0.6807541

1

2.50

17.50

11

0.6563437

1

2.50

20.00

12

0.6224195

0

-5.00

15.00

13

0.5055069

1

2.50

17.50

14

0.4713405

0

-5.00

12.50

15

0.3371174

0

-5.00

7.50

16

0.2179678

1

2.50

10.00

17

0.1992404

0

-5.00

5.00

18

0.1494827

0

-5.00

0.00

19

0.0479626

0

-5.00

-5.00

20

0.0383414

0

-5.00

-10.00

21

0.0248510

0

-5.00

-15.00

22

0.0218060

0

-5.00

-20.00

23

0.0161299

0

-5.00

-25.00

24

0.0035600

0

-5.00

-30.00Slide67

Adjusting Lift curve for Oversampling -- Example