/
Model Evaluation Model Evaluation

Model Evaluation - PowerPoint Presentation

stefany-barnette
stefany-barnette . @stefany-barnette
Follow
399 views
Uploaded On 2016-03-03

Model Evaluation - PPT Presentation

CRISPDM CRISPDM Phases Business Understanding Initial phase Focuses on Understanding the project objectives and requirements from a business perspective Converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives ID: 239720

model data positive roc data model roc positive test rate validation disease cross true set 100 false classifier curve

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Model Evaluation" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Model EvaluationSlide2

CRISP-DMSlide3

CRISP-DM Phases

Business Understanding

Initial phase

Focuses on:

Understanding the project objectives and requirements from a business perspective

Converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectivesData UnderstandingStarts with an initial data collectionProceeds with activities aimed at:Getting familiar with the dataIdentifying data quality problemsDiscovering first insights into the dataDetecting interesting subsets to form hypotheses for hidden informationSlide4

CRISP-DM Phases

Data Preparation

Covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data

Data preparation tasks are likely to be performed multiple times, and not in any prescribed order

Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modeling tools

ModelingVarious modeling techniques are selected and applied, and their parameters are calibrated to optimal valuesTypically, there are several techniques for the same data mining problem typeSome techniques have specific requirements on the form of data, therefore, stepping back to the data preparation phase is often neededSlide5

CRISP-DM Phases

Evaluation

At this stage, a model (or models) that appears to have high quality, from a data analysis perspective, has been built

Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives

A key objective is to determine if there is some important business issue that has not been sufficiently considered

At the end of this phase, a decision on the use of the data mining results should be reachedSlide6

CRISP-DM Phases

Deployment

Creation of the model is generally not the end of the project

Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it

Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process

In many cases it will be the customer, not the data analyst, who will carry out the deployment stepsHowever, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions will need to be carried out in order to actually make use of the created modelsSlide7

Evaluating Classification Systems

Two issues

What evaluation measure should we use?How do we ensure reliability of our model?Slide8

EvaluAtion

How do we ensure reliability of our model?Slide9

How do we ensure reliability?

Heavily dependent on

trainingSlide10

Data Partitioning

Randomly

partition data into

training

and

test setTraining set – data used to train/build the model.Estimate parameters (e.g., for a linear regression), build decision tree, build artificial network, etc.Test set – a set of examples not used for model induction. The model’s performance is evaluated on unseen data. Aka out-of-sample

data

.

Generalization Error: Model error on the test data.

Set of training examples

Set of test

examplesSlide11

Complexity and Generalization

S

train

(

q

)

S

test

(

q

)

Complexity = degrees

of freedom in the model

(e.g., number of variables)

Score Function

e.g., squared

error

Optimal model

complexitySlide12

Holding out data

The

holdout

method reserves a certain amount for testing and uses the remainder for training

Usually: one third for testing, the rest for training

For “unbalanced” datasets, random samples might not be representativeFew or none instances of some classesStratified sample: Make sure that each class is represented with approximately equal proportions in both subsets

12Slide13

Repeated holdout method

Holdout estimate can be made more reliable by repeating the process with different subsamples

In each iteration, a certain proportion is randomly selected for training (possibly with stratification)

The error rates on the different iterations are averaged to yield an overall error rate

This is called the

repeated holdout method

13Slide14

Cross-validation

Most popular and effective type of repeated holdout is cross-validation

Cross-validation

avoids overlapping test sets

First step: data is split into

k subsets of equal sizeSecond step: each subset in turn is used for testing and the remainder for trainingThis is called k-fold cross-validationOften the subsets are stratified before the cross-validation is performed

14Slide15

Cross-validation example:

15

15

Slide16

More on cross-validation

Standard data-mining method for evaluation: stratified ten-fold cross-validation

Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate

Stratification reduces the estimate

s varianceEven better: repeated stratified cross-validationE.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the sampling variance)Error estimate is the mean across all repetitions

16Slide17

Leave-One-Out cross-validation

Leave-One-Out:

a particular form of cross-validation:

Set number of folds to number of training instances

I.e., for

n training instances, build classifier n timesMakes best use of the dataInvolves no random subsampling Computationally expensive, but good performance

17Slide18

Leave-One-Out-CV and stratification

Disadvantage of Leave-One-Out-CV: stratification is not possible

It

guarantees

a non-stratified sample because there is only one instance in the test set!

Extreme example: random dataset split equally into two classesBest model predicts majority class50% accuracy on fresh data Leave-One-Out-CV estimate is 100% error!

18Slide19

Three way data splits

One problem with CV is since data is being used jointly to fit model and estimate error, the error could be biased downward.

If the goal is a real estimate of error (as opposed to which model is best), you may want a three way split:

Training set: examples used for learning

Validation set: used to tune parameters

Test set: never used in the model fitting process, used at the end for unbiased estimate of hold out errorSlide20

The Bootstrap

The Statistician Brad Efron proposed a very simple and clever idea for mechanically estimating confidence intervals:

The Bootstrap

The idea is to take multiple

resamples of your original dataset.Compute the statistic of interest on each resampleyou thereby estimate the distribution of this statistic!Slide21

Sampling with Replacement

Draw a data point at random from the data set.

Then throw it back in

Draw a second data point.

Then throw

it back in…Keep going until we’ve got 1000 data points.You might call this a “pseudo” data set.This is not merely re-sorting the data.Some of the original data points will appear more than once; others won’t appear at all. Slide22

Sampling with Replacement

In fact, there is a chance of

(1-1/1000)1000

≈ 1/

e ≈ .368 that any one of the original data points won’t appear at all if we sample with replacement 1000 times. any data point is included with Prob ≈ .632Intuitively, we treat the original sample as the “true population in the sky”.Each resample simulates the process of taking a sample from the “true” distribution. Slide23

Bootstrapping & Validation

This is interesting in its own right.

But bootstrapping also

relates back to model validation.

Along the lines of cross-validation.

You can fit models on bootstrap resamples of your data.For each resample, test the model on the ≈ .368 of the data not in your resample.Will be biased, but corrections are available.Get a spectrum of ROC curves.Slide24

Closing Thoughts

The

“cross-validation”

approach has several nice features:

Relies on the data, not likelihood theory, etc.

Comports nicely with the lift curve concept.Allows model validation that has both business & statistical meaning.Is generic  can be used to compare models generated from competing techniques…… or even pre-existing modelsCan be performed on different sub-segments of the dataIs very intuitive, easily grasped.Slide25

Closing Thoughts

B

ootstrapping has a family resemblance to cross-validation:Use the data to estimate features of a statistic or a model that we previously relied on statistical theory to give us.

Classic examples of the

data mining” (in the non-pejorative sense of the term!) mindset: Leverage modern computers to “do it yourself” rather than look up a formula in a book!Generic tools that can be used creatively.Can be used to estimate model bias & variance.Can be used to estimate (simulate) distributional characteristics of very difficult statistics.Ideal for many actuarial applications.Slide26

Metrics

What evaluation measure should we use?Slide27

Evaluation of Classification

Accuracy = (a+d) / (a+b+c+d)

Not always the best choice

Assume 1% fraud,

model predicts no fraud

What is the accuracy?

1

0

1

a

b

0

c

d

predicted

outcome

actual

outcome

Fraud

No Fraud

Fraud

0

0

No Fraud

10

990

Predicted Class

Actual ClassSlide28

Evaluation of Classification

Other options:

recall

or sensitivity (how many of those that are really positive did you predict?):

a/(

a+c)precision (how many of those predicted positive really are?)a/(a+b)Precision and recall are always in tensionIncreasing one tends to decrease another

1

0

1

a

b

0

c

d

predicted

outcome

actual

outcomeSlide29

Evaluation of Classification

Yet another option:

recall or

sensitivity

(how many of the positives did you get right?):

a/(a+c)Specificity (how many of the negatives did you get right?)d/(b+d)Sensitivity and specificity have the same tensionDifferent fields use different metrics

1

0

1

a

b

0

c

d

predicted

outcome

actual

outcomeSlide30

Evaluation for a Thresholded Response

Many classification models output probabilities

These probabilities get thresholded to make a prediction.

Classification accuracy depends on the threshold – good models give low probabilities to Y=0 and high probabilities to Y=1.Slide31

Test Data

predicted probabilities

8

3

0

9

1

0

0

1

actual outcome

predicted

outcome

Suppose we use a cutoff of 0.5…Slide32

8

3

0

9

1

0

0

1

actual outcome

predicted

outcome

Suppose we use a cutoff of 0.5…

sensitivity: = 100%

8

8+0

specificity: = 75%

9

9+3

we want both of these to be highSlide33

6

2

2

10

1

0

0

1

actual outcome

predicted

outcome

Suppose we use a cutoff of 0.8…

sensitivity: = 75%

6

6+2

specificity: = 83%

10

10+2Slide34

Note there are 20 possible thresholds

Plotting all values of sensitivity vs. specificity gives a sense of model performance by seeing the tradeoff with different thresholds

Note if threshold = minimum

c

=

d

=0 so

sens

=1; spec=0

If threshold = maximum

a

=

b

=0 so

sens

=0; spec=1

If model is perfect

sens

=1; spec=1

a

b

c

d

1

0

0

1

actual outcomeSlide35

ROC curve plots sensitivity vs. (1-specificity) – also known as false positive rate

Always goes from (0,0) to (1,1)

The more area in the upper left, the better

Random model is on the diagonal

Area under the curve

(AUC) is a common measure of predictive performanceSlide36

ROC CurvesSlide37

Receiver

Operating

Characteristic curve

ROC curves were developed in the 1950's as a by-product of research into making sense of radio signals contaminated by noise. More recently it's become clear that they are remarkably useful in decision-making.

They are a

performance graphing method.True positive and False positive fractions are plotted as we move the dividing threshold. They look like:Slide38

ROC Space

ROC graphs are two-dimensional graphs in which TP rate is plotted on the Y axis and FP rate is plotted on the X axis.

An ROC graph depicts relative trade-offs between benefits (true positives) and costs (false positives).

Figure shows an ROC graph with five classifiers labeled A through E.

A discrete classier is one that outputs only a class label.

Each discrete classier produces an (fp rate, tp rate) pair corresponding to a single point in ROC space. Classifiers in figure are all discrete classifiers.Slide39

Several Points in ROC Space

Lower left point (0, 0)

represents the strategy of never issuing a positive classification;

such a classier commits no false positive errors but also gains no true positives.

Upper right corner (1, 1) represents the opposite strategy, of unconditionally issuing positive classifications. Point (0, 1) represents perfect classification. D's performance is perfect as shown.Informally, one point in ROC space is better than another if it is to the northwest of the firsttp rate is higher, fp rate is lower, or both.Slide40

Specific Example

Test Result

Pts with disease

Pts without the diseaseSlide41

Test Result

Call these patients

negative

Call these patients

positive

ThresholdSlide42

Test Result

Call these patients

negative

Call these patients

positive

without the disease

with the disease

True Positives

Some definitions ...Slide43

Test Result

Call these patients

negative

Call these patients

positive

without the disease

with the disease

False PositivesSlide44

Test Result

Call these patients

negative

Call these patients

positive

without the disease

with the disease

True negativesSlide45

Test Result

Call these patients

negative

Call these patients

positive

without the disease

with the disease

False negativesSlide46

Test Result

without the disease

with the disease

‘‘

-

’’

‘‘

+

’’

Moving the Threshold: rightSlide47

Test Result

without the disease

with the disease

‘‘

-

’’

‘‘

+

’’

Moving the Threshold: leftSlide48

True Positive Rate (sensitivity)

0%

100%

False Positive Rate (1-specificity)

0%

100%

ROC curveSlide49

True Positive Rate

0%

100%

False Positive Rate

0%

100%

True Positive Rate

0%

100%

False Positive Rate

0%

100%

A good test:

A poor test:

ROC curve comparisonSlide50

Best Test:

Worst test:

True Positive Rate

0%

100%

False Positive Rate

0%

100%

True Positive Rate

0%

100%

False Positive Rate

0%

100%

The distributions don

t overlap at all

The distributions overlap completely

ROC curve extremesSlide51

How to Construct ROC Curve for one Classifier

Sort the instances according to their

Ppos

.

Move a threshold on the sorted instances.

For each threshold define a classifier with confusion matrix.Plot the TPr and FPr rates of the classfiers.Ppos True Class0.99 pos0.98 pos0.7 neg0.6 pos0.43 neg

True

Predicted

pos

neg

pos

2

1

neg

1

1Slide52

Creating an ROC Curve

A classifier produces a single ROC point.

If the classifier has a “sensitivity”

parameter, varying it produces a series of ROC points (confusion matrices).

Alternatively, if the classifier is produced by a learning algorithm, a series of ROC points can be generated by varying the class ratio in the training set.Slide53

ROC for one Classifier

Good separation between the classes, convex curve.Slide54

ROC for one Classifier

Reasonable separation between the classes, mostly convex.Slide55

ROC for one Classifier

Fairly poor separation between the classes, mostly convex.Slide56

ROC for one Classifier

Poor separation between the classes, large and small concavities.Slide57

ROC for one Classifier

Random performance.Slide58

The AUC Metric

The area under ROC curve (AUC) assesses the ranking in terms of separation of the classes.

AUC estimates that randomly chosen positive instance will be ranked before randomly chosen negative instances.Slide59

Comparing Models

Highest AUC wins

But pay attention to

Occam

’s Razor’‘the best theory is the smallest one that describes all the facts’Also known as the ‘parsimony principle’If two models are similar, pick the simpler one