/
Introduction to Data Mining and Classification Introduction to Data Mining and Classification

Introduction to Data Mining and Classification - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
365 views
Uploaded On 2018-11-06

Introduction to Data Mining and Classification - PPT Presentation

F Michael Speed PhD Analytical Consultant SAS Global Academic Program Objectives State one of the major principles underlying data mining Give a high level overview of three classification procedures ID: 718047

chooses model fit data model chooses data fit set prediction response greatest actual average validation index logistic statistics loss profit cumulative training

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Introduction to Data Mining and Classifi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Introduction to Data Mining and Classification

F. Michael Speed, Ph.D.

Analytical Consultant

SAS

Global

Academic ProgramSlide2

Objectives

State one of the major principles underlying data mining

Give a high level overview of three classification proceduresSlide3

A Basic principle of Data Mining

Splitting the data:

Training Data Set – this is a must do

Validation

Data Set – this is a must do

Testing Data Set – This is optionalSlide4

Training Set

For a given procedure (logistic or neural net or decision tree)

w

e use the training set to generate a sequence of models.

For example:

If we use logistic regression, we get:

Training Data

Logistic

Reg

Model 1

Model 2

Model qSlide5

How Do We decide Which of the q Models is Best?

We want the model with the fewest terms (most parsimonious).

We want the model with largest (smallest) value of our criteria index (adjusted r-square, misclassification rate, AIC, BIC, SBC etc.)

We use the validation set to compute the criteria (Fit Index) for each model and then choose the “best.”Slide6

Compute the Fit Index for Each Model

Then find the “best” using a fixed Fit Index

Model 1

Model 2

Model q

Validation Set

Validation Set

Validation Set

Fit Index 1

Fit Index 2

Fit Index qSlide7

Fit Indices (Statistics)

Default

— The default selection uses different statistics based on the type of target variable and whether a profit/loss matrix has been defined.

If a profit/loss matrix is defined for a categorical target, the average profit or average loss is used.

If no profit/loss matrix is defined for a categorical target, the misclassification rate is used.

If the target variable is interval, the average squared error is used.

Akaike's Information Criterion

— chooses the model with the smallest Akaike's Information Criterion value.

Average Squared Error — chooses the model with the smallest average squared error value. Mean Squared Error — chooses the model with the smallest mean squared error value. ROC — chooses the model with the greatest area under the ROC curve. Captured Response — chooses the model with the greatest captured response values using the

decile range that is specified in the Selection Depth property. Slide8

Continued

Gain

— chooses the model with the greatest gain using the

decile

range that is specified in the Selection Depth property.

Gini

Coefficient

— chooses the model with the highest Gini coefficient value. Kolmogorov-Smirnov Statistic —  chooses the model with the highest Kolmogorov - Smirnov statistic value.

Lift — chooses the model with the greatest lift using the

decile range that is specified in the Selection Depth property. Misclassification Rate — chooses the model with the lowest misclassification rate. Average Profit/Loss — chooses the model with the greatest average profit/loss. Percent Response — chooses the model with the greatest % response.

Cumulative Captured Response — chooses the model with the greatest cumulative % captured response. Cumulative Lift — chooses the model with the greatest cumulative lift. Cumulative Percent Response

— chooses the model with the greatest cumulative % response. Slide9

Misclassification Rate(MR)

Prediction = 0

Prediction =1

Actual = 0

True Negative

False Positive

Actual = 1

False Negative

True Positive

MR = (FN +FP)/(TN+FP+FN+TP)Slide10

Equity Data Set.

The variable BAD = 1 if the borrower is a bad credit risk and = 0 if not.

We want to build a model to predict if a person is a bad credit risk

Other Variables: Job, YOJ, Loan,

DebtInc

Mortdue

- How much they need to pay on their

mortgage

Value - Assessed

valuationDerog - Number of Derogatory ReportsDeliniq - Number of

Delinquent Trade LinesClage - Age of Oldest Trade Line

Ninq - Number of recent credit inquiries.Clno - Number of trade linesSlide11

Three Procedures

Decision Tree

Regression (Logistic)

Neural NetworkSlide12

Decision Tree

Very Simple to Understand

Easy to use

Can

e

xplain to the boss/supervisorSlide13

ExampleSlide14

Maximal Tree – Ignoring Validation DataSlide15

Optimal TreeSlide16

ContinuedSlide17

Fit Statistics

Prediction = 0

Prediction =1

Actual = 0

2266

146

Actual = 1

225

370

MC=(225+146)/2981= .124455Slide18

Logistic Regression

Since we observe a 0 or a 1, ordinary least squares is not an option.

We need a different approach

The probability of getting a 1 depends upon X.

We write that as p(X).

Log odds = log(p(X)/(1-p(X))= a +

bXSlide19

Logistic Graph –Solve for p(X)

X

P(X)Slide20

Fit StatisticsSlide21

MCR

Prediction = 0

Prediction =1

Actual = 0

2306

80

Actual = 1

332

263

MC=(332+80)/2981=.138209Slide22

Neural Net

Very Complex Mathematical Equations

Interpretations of the meaning of the input variables are not possible with final model

Often a good prediction of the response.Slide23

Neural Net DiagramSlide24

Fit StatisticsSlide25

MCR

Prediction = 0

Prediction =1

Actual = 0

2291

95

Actual = 1

288

307Slide26

ComparisonSlide27

Enterprise Miner InterfaceSlide28

Enterprise Guide InterfaceSlide29

RPMSlide30

ContinuedSlide31

ContinuedSlide32

Fit StatisticsSlide33

Summary

Divide your data into

training and

validation

We looked at trees, logistic regression and neural nets

We also looked at RPM Slide34

Q&A