Data Analytics CMIS Short Course part II - PowerPoint Presentation

test . @test

363 views
Uploaded On 2018-03-16

Data Analytics CMIS Short Course part II - PPT Presentation

Day 1 Part 4 ROC Curves Sam Buttrey December 2015 A ssessing a Classifier In data sets with very few bads the naïve model that says everyone is good is highly accurate It never pays to predict bad ID: 653379

curve rate positive roc rate curve roc positive specificity sensitivity threshold pos auc area predicted true false probability test

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/653379" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Data Analytics CMIS Short Course part II" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Data AnalyticsCMIS Short Course part II

Day 1 Part 4: ROC Curves

Sam Buttrey

December 2015Slide2

Assessing a ClassifierIn data sets with very few “bads,” the “naïve” model that says “everyone is good” is highly accurate

It never pays to predict “bad”

How can we decide that a model is getting

probabilities

correct, or compare two models? How can we put our model to use?

What if we don’t like the 0.5 threshold to decide which are

predicted good

or not

One answer:

ROCSlide3

ROC Plot“Receiver Operating Characteristic” CurveIntended for binary

response variables (“hit” and “miss” or “positive” and “negative”)

We have the estimated probability

and a “cutoff” or threshold

If predict “hit” else predict “miss”How well are we doing? What should t be?

Slide4

2x2 Confusion Matrix Pos a b

Neg

dSensitivity: true positive rate, a/(a+b) false negative rate, b/(a+b)Specificity: true negative rate, d/(

d) false positive rate, c/(c+d)If t is large, few are predicted positive, sensitivity small, specificity high.

Predicted

ObservedSlide5

Plotting the ROCThe ROC curve plots Sensitivity (true pos rate) against 1

– Specificity

(or the false pos rate) for different

’s

In a good test, there is a k for which both Sensitivity and Specificity are near 1That curve would pass near (1, 0)Top left corner

In a bad test, the proportion classified as positive would be the same regardless of the truth. We would have

Sensitivity = (1

– Specificity) for all

k45o

lineSlide6

The ROC

Sensitivity

(true pos. rate)

30%

– Specificity

(false pos. rate)

Low Threshold (t)

73%

High

Threshold (t)

We give every observation a

score

. For this value of

73% of positive observations have score



, but only 30% of negatives have score 

tSlide7

On the 45 LineIf your classifier’s ROC curve follows the 45 line, then for any t

the

probability of classifying a

positive

as positive

(sensitivity) is the same as the probability of classifying a negative as positive (1 – specificity)The area between your ROC curve and the 45

 line is a measure of quality. So is the

total area under the curve

or the AUC (area under the curve). Slide8

The ROC

Sensitivity

(true pos. rate)

Specificity

(false pos. rate)

Low Threshold

Area Under Curve (AUC)

High

ThresholdSlide9

Area Under The CurveThe Area under the Curve (AUC) is often measured or estimatedA “random” classifier has AUC = 0.5Rule of thumb: AUC > .8 good -- but often only a few thresholds make senseR draws ROC curves via

pROC

, ROCR…Slide10

Other interpretations of AUCSelect two observations, one with a “hit” and one without, randomly For a particular model, what is the probability that the predicted probability for the “hit” is the greater?Answer: it’s exactly the AUC

This number also relates to the

Wilcoxon

two-sample non-parametric test applied to the predicted classes. Slide11

Using AUCROC is for binary classifiers that produce numeric scoresProduce a set of predicted scores on

test set

, plot ROCs

If one classifier’s curve is always above another’s, it dominates

Otherwise, compare by AUC – or by using some “real” threshold

Examples!Yay?