CRISPDM CRISPDM Phases Business Understanding Initial phase Focuses on Understanding the project objectives and requirements from a business perspective Converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives ID: 239720
Download Presentation The PPT/PDF document "Model Evaluation" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Model EvaluationSlide2
CRISP-DMSlide3
CRISP-DM Phases
Business Understanding
Initial phase
Focuses on:
Understanding the project objectives and requirements from a business perspective
Converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectivesData UnderstandingStarts with an initial data collectionProceeds with activities aimed at:Getting familiar with the dataIdentifying data quality problemsDiscovering first insights into the dataDetecting interesting subsets to form hypotheses for hidden informationSlide4
CRISP-DM Phases
Data Preparation
Covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data
Data preparation tasks are likely to be performed multiple times, and not in any prescribed order
Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modeling tools
ModelingVarious modeling techniques are selected and applied, and their parameters are calibrated to optimal valuesTypically, there are several techniques for the same data mining problem typeSome techniques have specific requirements on the form of data, therefore, stepping back to the data preparation phase is often neededSlide5
CRISP-DM Phases
Evaluation
At this stage, a model (or models) that appears to have high quality, from a data analysis perspective, has been built
Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives
A key objective is to determine if there is some important business issue that has not been sufficiently considered
At the end of this phase, a decision on the use of the data mining results should be reachedSlide6
CRISP-DM Phases
Deployment
Creation of the model is generally not the end of the project
Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it
Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process
In many cases it will be the customer, not the data analyst, who will carry out the deployment stepsHowever, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions will need to be carried out in order to actually make use of the created modelsSlide7
Evaluating Classification Systems
Two issues
What evaluation measure should we use?How do we ensure reliability of our model?Slide8
EvaluAtion
How do we ensure reliability of our model?Slide9
How do we ensure reliability?
Heavily dependent on
trainingSlide10
Data Partitioning
Randomly
partition data into
training
and
test setTraining set – data used to train/build the model.Estimate parameters (e.g., for a linear regression), build decision tree, build artificial network, etc.Test set – a set of examples not used for model induction. The model’s performance is evaluated on unseen data. Aka out-of-sample
data
.
Generalization Error: Model error on the test data.
Set of training examples
Set of test
examplesSlide11
Complexity and Generalization
S
train
(
q
)
S
test
(
q
)
Complexity = degrees
of freedom in the model
(e.g., number of variables)
Score Function
e.g., squared
error
Optimal model
complexitySlide12
Holding out data
The
holdout
method reserves a certain amount for testing and uses the remainder for training
Usually: one third for testing, the rest for training
For “unbalanced” datasets, random samples might not be representativeFew or none instances of some classesStratified sample: Make sure that each class is represented with approximately equal proportions in both subsets
12Slide13
Repeated holdout method
Holdout estimate can be made more reliable by repeating the process with different subsamples
In each iteration, a certain proportion is randomly selected for training (possibly with stratification)
The error rates on the different iterations are averaged to yield an overall error rate
This is called the
repeated holdout method
13Slide14
Cross-validation
Most popular and effective type of repeated holdout is cross-validation
Cross-validation
avoids overlapping test sets
First step: data is split into
k subsets of equal sizeSecond step: each subset in turn is used for testing and the remainder for trainingThis is called k-fold cross-validationOften the subsets are stratified before the cross-validation is performed
14Slide15
Cross-validation example:
15
15
Slide16
More on cross-validation
Standard data-mining method for evaluation: stratified ten-fold cross-validation
Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate
Stratification reduces the estimate
’
s varianceEven better: repeated stratified cross-validationE.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the sampling variance)Error estimate is the mean across all repetitions
16Slide17
Leave-One-Out cross-validation
Leave-One-Out:
a particular form of cross-validation:
Set number of folds to number of training instances
I.e., for
n training instances, build classifier n timesMakes best use of the dataInvolves no random subsampling Computationally expensive, but good performance
17Slide18
Leave-One-Out-CV and stratification
Disadvantage of Leave-One-Out-CV: stratification is not possible
It
guarantees
a non-stratified sample because there is only one instance in the test set!
Extreme example: random dataset split equally into two classesBest model predicts majority class50% accuracy on fresh data Leave-One-Out-CV estimate is 100% error!
18Slide19
Three way data splits
One problem with CV is since data is being used jointly to fit model and estimate error, the error could be biased downward.
If the goal is a real estimate of error (as opposed to which model is best), you may want a three way split:
Training set: examples used for learning
Validation set: used to tune parameters
Test set: never used in the model fitting process, used at the end for unbiased estimate of hold out errorSlide20
The Bootstrap
The Statistician Brad Efron proposed a very simple and clever idea for mechanically estimating confidence intervals:
The Bootstrap
The idea is to take multiple
resamples of your original dataset.Compute the statistic of interest on each resampleyou thereby estimate the distribution of this statistic!Slide21
Sampling with Replacement
Draw a data point at random from the data set.
Then throw it back in
Draw a second data point.
Then throw
it back in…Keep going until we’ve got 1000 data points.You might call this a “pseudo” data set.This is not merely re-sorting the data.Some of the original data points will appear more than once; others won’t appear at all. Slide22
Sampling with Replacement
In fact, there is a chance of
(1-1/1000)1000
≈ 1/
e ≈ .368 that any one of the original data points won’t appear at all if we sample with replacement 1000 times. any data point is included with Prob ≈ .632Intuitively, we treat the original sample as the “true population in the sky”.Each resample simulates the process of taking a sample from the “true” distribution. Slide23
Bootstrapping & Validation
This is interesting in its own right.
But bootstrapping also
relates back to model validation.
Along the lines of cross-validation.
You can fit models on bootstrap resamples of your data.For each resample, test the model on the ≈ .368 of the data not in your resample.Will be biased, but corrections are available.Get a spectrum of ROC curves.Slide24
Closing Thoughts
The
“cross-validation”
approach has several nice features:
Relies on the data, not likelihood theory, etc.
Comports nicely with the lift curve concept.Allows model validation that has both business & statistical meaning.Is generic can be used to compare models generated from competing techniques…… or even pre-existing modelsCan be performed on different sub-segments of the dataIs very intuitive, easily grasped.Slide25
Closing Thoughts
B
ootstrapping has a family resemblance to cross-validation:Use the data to estimate features of a statistic or a model that we previously relied on statistical theory to give us.
Classic examples of the
“
data mining” (in the non-pejorative sense of the term!) mindset: Leverage modern computers to “do it yourself” rather than look up a formula in a book!Generic tools that can be used creatively.Can be used to estimate model bias & variance.Can be used to estimate (simulate) distributional characteristics of very difficult statistics.Ideal for many actuarial applications.Slide26
Metrics
What evaluation measure should we use?Slide27
Evaluation of Classification
Accuracy = (a+d) / (a+b+c+d)
Not always the best choice
Assume 1% fraud,
model predicts no fraud
What is the accuracy?
1
0
1
a
b
0
c
d
predicted
outcome
actual
outcome
Fraud
No Fraud
Fraud
0
0
No Fraud
10
990
Predicted Class
Actual ClassSlide28
Evaluation of Classification
Other options:
recall
or sensitivity (how many of those that are really positive did you predict?):
a/(
a+c)precision (how many of those predicted positive really are?)a/(a+b)Precision and recall are always in tensionIncreasing one tends to decrease another
1
0
1
a
b
0
c
d
predicted
outcome
actual
outcomeSlide29
Evaluation of Classification
Yet another option:
recall or
sensitivity
(how many of the positives did you get right?):
a/(a+c)Specificity (how many of the negatives did you get right?)d/(b+d)Sensitivity and specificity have the same tensionDifferent fields use different metrics
1
0
1
a
b
0
c
d
predicted
outcome
actual
outcomeSlide30
Evaluation for a Thresholded Response
Many classification models output probabilities
These probabilities get thresholded to make a prediction.
Classification accuracy depends on the threshold – good models give low probabilities to Y=0 and high probabilities to Y=1.Slide31
Test Data
predicted probabilities
8
3
0
9
1
0
0
1
actual outcome
predicted
outcome
Suppose we use a cutoff of 0.5…Slide32
8
3
0
9
1
0
0
1
actual outcome
predicted
outcome
Suppose we use a cutoff of 0.5…
sensitivity: = 100%
8
8+0
specificity: = 75%
9
9+3
we want both of these to be highSlide33
6
2
2
10
1
0
0
1
actual outcome
predicted
outcome
Suppose we use a cutoff of 0.8…
sensitivity: = 75%
6
6+2
specificity: = 83%
10
10+2Slide34
Note there are 20 possible thresholds
Plotting all values of sensitivity vs. specificity gives a sense of model performance by seeing the tradeoff with different thresholds
Note if threshold = minimum
c
=
d
=0 so
sens
=1; spec=0
If threshold = maximum
a
=
b
=0 so
sens
=0; spec=1
If model is perfect
sens
=1; spec=1
a
b
c
d
1
0
0
1
actual outcomeSlide35
ROC curve plots sensitivity vs. (1-specificity) – also known as false positive rate
Always goes from (0,0) to (1,1)
The more area in the upper left, the better
Random model is on the diagonal
“
Area under the curve
”
(AUC) is a common measure of predictive performanceSlide36
ROC CurvesSlide37
Receiver
Operating
Characteristic curve
ROC curves were developed in the 1950's as a by-product of research into making sense of radio signals contaminated by noise. More recently it's become clear that they are remarkably useful in decision-making.
They are a
performance graphing method.True positive and False positive fractions are plotted as we move the dividing threshold. They look like:Slide38
ROC Space
ROC graphs are two-dimensional graphs in which TP rate is plotted on the Y axis and FP rate is plotted on the X axis.
An ROC graph depicts relative trade-offs between benefits (true positives) and costs (false positives).
Figure shows an ROC graph with five classifiers labeled A through E.
A discrete classier is one that outputs only a class label.
Each discrete classier produces an (fp rate, tp rate) pair corresponding to a single point in ROC space. Classifiers in figure are all discrete classifiers.Slide39
Several Points in ROC Space
Lower left point (0, 0)
represents the strategy of never issuing a positive classification;
such a classier commits no false positive errors but also gains no true positives.
Upper right corner (1, 1) represents the opposite strategy, of unconditionally issuing positive classifications. Point (0, 1) represents perfect classification. D's performance is perfect as shown.Informally, one point in ROC space is better than another if it is to the northwest of the firsttp rate is higher, fp rate is lower, or both.Slide40
Specific Example
Test Result
Pts with disease
Pts without the diseaseSlide41
Test Result
Call these patients
“
negative
”
Call these patients
“
positive
”
ThresholdSlide42
Test Result
Call these patients
“
negative
”
Call these patients
“
positive
”
without the disease
with the disease
True Positives
Some definitions ...Slide43
Test Result
Call these patients
“
negative
”
Call these patients
“
positive
”
without the disease
with the disease
False PositivesSlide44
Test Result
Call these patients
“
negative
”
Call these patients
“
positive
”
without the disease
with the disease
True negativesSlide45
Test Result
Call these patients
“
negative
”
Call these patients
“
positive
”
without the disease
with the disease
False negativesSlide46
Test Result
without the disease
with the disease
‘‘
-
’’
‘‘
+
’’
Moving the Threshold: rightSlide47
Test Result
without the disease
with the disease
‘‘
-
’’
‘‘
+
’’
Moving the Threshold: leftSlide48
True Positive Rate (sensitivity)
0%
100%
False Positive Rate (1-specificity)
0%
100%
ROC curveSlide49
True Positive Rate
0%
100%
False Positive Rate
0%
100%
True Positive Rate
0%
100%
False Positive Rate
0%
100%
A good test:
A poor test:
ROC curve comparisonSlide50
Best Test:
Worst test:
True Positive Rate
0%
100%
False Positive Rate
0%
100%
True Positive Rate
0%
100%
False Positive Rate
0%
100%
The distributions don
’
t overlap at all
The distributions overlap completely
ROC curve extremesSlide51
How to Construct ROC Curve for one Classifier
Sort the instances according to their
Ppos
.
Move a threshold on the sorted instances.
For each threshold define a classifier with confusion matrix.Plot the TPr and FPr rates of the classfiers.Ppos True Class0.99 pos0.98 pos0.7 neg0.6 pos0.43 neg
True
Predicted
pos
neg
pos
2
1
neg
1
1Slide52
Creating an ROC Curve
A classifier produces a single ROC point.
If the classifier has a “sensitivity”
parameter, varying it produces a series of ROC points (confusion matrices).
Alternatively, if the classifier is produced by a learning algorithm, a series of ROC points can be generated by varying the class ratio in the training set.Slide53
ROC for one Classifier
Good separation between the classes, convex curve.Slide54
ROC for one Classifier
Reasonable separation between the classes, mostly convex.Slide55
ROC for one Classifier
Fairly poor separation between the classes, mostly convex.Slide56
ROC for one Classifier
Poor separation between the classes, large and small concavities.Slide57
ROC for one Classifier
Random performance.Slide58
The AUC Metric
The area under ROC curve (AUC) assesses the ranking in terms of separation of the classes.
AUC estimates that randomly chosen positive instance will be ranked before randomly chosen negative instances.Slide59
Comparing Models
Highest AUC wins
But pay attention to
‘
Occam
’s Razor’‘the best theory is the smallest one that describes all the facts’Also known as the ‘parsimony principle’If two models are similar, pick the simpler one