F Michael Speed PhD Analytical Consultant SAS Global Academic Program Objectives State one of the major principles underlying data mining Give a high level overview of three classification procedures ID: 718047
Download Presentation The PPT/PDF document "Introduction to Data Mining and Classifi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Introduction to Data Mining and Classification
F. Michael Speed, Ph.D.
Analytical Consultant
SAS
Global
Academic ProgramSlide2
Objectives
State one of the major principles underlying data mining
Give a high level overview of three classification proceduresSlide3
A Basic principle of Data Mining
Splitting the data:
Training Data Set – this is a must do
Validation
Data Set – this is a must do
Testing Data Set – This is optionalSlide4
Training Set
For a given procedure (logistic or neural net or decision tree)
w
e use the training set to generate a sequence of models.
For example:
If we use logistic regression, we get:
Training Data
Logistic
Reg
Model 1
Model 2
Model qSlide5
How Do We decide Which of the q Models is Best?
We want the model with the fewest terms (most parsimonious).
We want the model with largest (smallest) value of our criteria index (adjusted r-square, misclassification rate, AIC, BIC, SBC etc.)
We use the validation set to compute the criteria (Fit Index) for each model and then choose the “best.”Slide6
Compute the Fit Index for Each Model
Then find the “best” using a fixed Fit Index
Model 1
Model 2
Model q
Validation Set
Validation Set
Validation Set
Fit Index 1
Fit Index 2
Fit Index qSlide7
Fit Indices (Statistics)
Default
— The default selection uses different statistics based on the type of target variable and whether a profit/loss matrix has been defined.
If a profit/loss matrix is defined for a categorical target, the average profit or average loss is used.
If no profit/loss matrix is defined for a categorical target, the misclassification rate is used.
If the target variable is interval, the average squared error is used.
Akaike's Information Criterion
— chooses the model with the smallest Akaike's Information Criterion value.
Average Squared Error — chooses the model with the smallest average squared error value. Mean Squared Error — chooses the model with the smallest mean squared error value. ROC — chooses the model with the greatest area under the ROC curve. Captured Response — chooses the model with the greatest captured response values using the
decile range that is specified in the Selection Depth property. Slide8
Continued
Gain
— chooses the model with the greatest gain using the
decile
range that is specified in the Selection Depth property.
Gini
Coefficient
— chooses the model with the highest Gini coefficient value. Kolmogorov-Smirnov Statistic — chooses the model with the highest Kolmogorov - Smirnov statistic value.
Lift — chooses the model with the greatest lift using the
decile range that is specified in the Selection Depth property. Misclassification Rate — chooses the model with the lowest misclassification rate. Average Profit/Loss — chooses the model with the greatest average profit/loss. Percent Response — chooses the model with the greatest % response.
Cumulative Captured Response — chooses the model with the greatest cumulative % captured response. Cumulative Lift — chooses the model with the greatest cumulative lift. Cumulative Percent Response
— chooses the model with the greatest cumulative % response. Slide9
Misclassification Rate(MR)
Prediction = 0
Prediction =1
Actual = 0
True Negative
False Positive
Actual = 1
False Negative
True Positive
MR = (FN +FP)/(TN+FP+FN+TP)Slide10
Equity Data Set.
The variable BAD = 1 if the borrower is a bad credit risk and = 0 if not.
We want to build a model to predict if a person is a bad credit risk
Other Variables: Job, YOJ, Loan,
DebtInc
Mortdue
- How much they need to pay on their
mortgage
Value - Assessed
valuationDerog - Number of Derogatory ReportsDeliniq - Number of
Delinquent Trade LinesClage - Age of Oldest Trade Line
Ninq - Number of recent credit inquiries.Clno - Number of trade linesSlide11
Three Procedures
Decision Tree
Regression (Logistic)
Neural NetworkSlide12
Decision Tree
Very Simple to Understand
Easy to use
Can
e
xplain to the boss/supervisorSlide13
ExampleSlide14
Maximal Tree – Ignoring Validation DataSlide15
Optimal TreeSlide16
ContinuedSlide17
Fit Statistics
Prediction = 0
Prediction =1
Actual = 0
2266
146
Actual = 1
225
370
MC=(225+146)/2981= .124455Slide18
Logistic Regression
Since we observe a 0 or a 1, ordinary least squares is not an option.
We need a different approach
The probability of getting a 1 depends upon X.
We write that as p(X).
Log odds = log(p(X)/(1-p(X))= a +
bXSlide19
Logistic Graph –Solve for p(X)
X
P(X)Slide20
Fit StatisticsSlide21
MCR
Prediction = 0
Prediction =1
Actual = 0
2306
80
Actual = 1
332
263
MC=(332+80)/2981=.138209Slide22
Neural Net
Very Complex Mathematical Equations
Interpretations of the meaning of the input variables are not possible with final model
Often a good prediction of the response.Slide23
Neural Net DiagramSlide24
Fit StatisticsSlide25
MCR
Prediction = 0
Prediction =1
Actual = 0
2291
95
Actual = 1
288
307Slide26
ComparisonSlide27
Enterprise Miner InterfaceSlide28
Enterprise Guide InterfaceSlide29
RPMSlide30
ContinuedSlide31
ContinuedSlide32
Fit StatisticsSlide33
Summary
Divide your data into
training and
validation
We looked at trees, logistic regression and neural nets
We also looked at RPM Slide34
Q&A