/
Performance Metrics for Performance Metrics for

Performance Metrics for - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
489 views
Uploaded On 2015-11-12

Performance Metrics for - PPT Presentation

Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical Significance Techniques ID: 191407

class performance data model performance class model data cluster learning metrics points confusion matrix significance cost comparison predicted positive

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Performance Metrics for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Performance Metrics for Graph Mining Tasks

1Slide2

OutlineIntroduction to Performance Metrics

Supervised Learning Performance MetricsUnsupervised

Learning Performance

Metrics

Optimizing MetricsStatistical Significance TechniquesModel Comparison

2Slide3

OutlineIntroduction to Performance Metrics

Supervised Learning Performance

Metrics

Unsupervised

Learning Performance

Metrics

Optimizing Metrics

Statistical Significance TechniquesModel Comparison

3Slide4

Introduction to Performance MetricsPerformance metric measures how well your data mining algorithm is performing on a given

dataset.

For example, if we apply a classification algorithm on a dataset, we first check to see how many of the data points were classified correctly. This is a performance metric and the formal name for it is “accuracy.”

Performance metrics

also help us decide is one algorithm is better or worse than

another.

For example, one classification algorithm A classifies 80% of data points correctly and another

classification algorithm

B classifies 90

% of data points

correctly. We immediately realize that algorithm B is doing better. There are some intricacies that we will discuss in this chapter.

4Slide5

OutlineIntroduction to Performance Metrics

Supervised Learning Performance Metrics

Unsupervised

Learning Performance

Metrics

Optimizing Metrics

Statistical Significance Techniques

Model Comparison

5Slide6

Supervised Learning Performance Metrics

Metrics that are applied when the ground truth is known (E.g., Classification tasks)

Outline:

2 X

2 Confusion MatrixMulti-level Confusion

Matrix

Visual Metrics

Cross-validation

6Slide7

2X2 Confusion Matrix

7

Predicted

Class

+

-

Actual

Class

+

f

++

f

+-

C = f

++

+

f

+-

-

f

-+f--D = f-+ + f--A = f++ + f-+B = f+- + f--T = f+++ f-++ f+-+ f--

An 2X2 matrix, is used to tabulate the results of 2-class supervised learning problem and entry (i,j) represents the number of elements with class label i, but predicted to have class label j.

True Positive

False Positive

False Negative

True Negative

+ and – are two class labelsSlide8

2X2 Confusion MetricsExample

8

Vertex

ID

Actual

Class

Predicted

Class

1

+

+

2

+

+

3

+

+

4

+

+

5+-6-+7-+8--PredictedClass+-ActualClass+41

C = 5-

2

1

D = 3

A = 6

B = 2

T = 8

Corresponding

2x2 matrix for the given table

Results from a Classification Algorithms

True positive = 4

False positive = 1

True Negative = 1

False Negative =2Slide9

2X2 Confusion MetricsPerformance Metrics

Walk-through

different metrics using

the following

example

9

Accuracy

is proportion

of correct

predictions

Error rate

is proportion of incorrect predictions

3. Recall

is the proportion of

“+” data points

predicted as

“+”

4. Precision

is the proportion of data points predicted as “+” that are truly “+”Slide10

Multi-level Confusion Matrix

An nXn

matrix, where

n is the number of classes and entry (i,j

)

represents

the number of elements with class label i, but predicted to have class label

j

10Slide11

Multi-level Confusion MatrixExample

11

Predicted Class

Marginal

Sum of

Actual Values

Class 1

Class 2

Class

3

Actual

Class

Class 1

2

1

1

4

Class 2

1

2

14Class 31236

Marginal Sum of Predictions

4

5

5

T = 14Slide12

Multi-level Confusion MatrixConversion to 2X2

Predicted Class

Class 1

Class 2

Class

3

Actual

Class

Class 1

2

1

1

Class 2

1

2

1

Class 3

1

2

3

2X2 Matrix Specific to Class 1

f

++

f-+

f

+-

f--

Predicted Class

Class 1 (+)

Not Class 1 (-)

Actual

Class

Class 1 (+)

2

2

C = 4

Not Class 1 (-)

2

8

D

= 10

A = 4

B = 10

T = 14

We can now apply all the 2X2 metrics

Accuracy = 2/14

Error = 8/14

Recall = 2/4

Precision = 2/4Slide13

Multi-level Confusion MatrixPerformance Metrics

13

Predicted Class

Class 1

Class 2

Class

3

Actual

Class

Class 1

2

1

1

Class 2

1

2

1

Class 3

1

2

3Critical Success Index or Threat Score is the ratio of correct predictions for class L to the sum of vertices that belong to L and those predicted as L2. Bias - For each class L, it is the ratio of the total points with class label L to the number of points predicted as L. Bias helps understand if a model is over or under-predicting a class Slide14

Confusion MetricsR-code

14

library(

PerformanceMetrics

)

data(M)

M

[,1] [,2][1,] 4 1[2,] 2 1twoCrossConfusionMatrixMetrics(M)

data(

MultiLevelM

)MultiLevelM[,1] [,2] [,3][1,] 2 1 1

[2,] 1 2 1

[3,] 1 2 3

multilevelConfusionMatrixMetrics

(

MultiLevelM

)Slide15

Visual Metrics

Metrics that are plotted on a graph to obtain the visual picture of the performance of two class classifiers

15

0

0

1

1

False Positive Rate

True Positive Rate

(0,1) - Ideal

(0,0)

Predicts the –

ve

class all the time

(1,1)

Predicts the +

ve

class all the time

Random Guessing ModelsAUC = 0.5Plot the performance of multiple models to decide which one performs bestROC plotSlide16

Understanding Model Performance based on ROC Plot

16

0

0

1

1

False Positive Rate

True Positive Rate

Random Guessing Models

AUC = 0.5

Models that lie in this area perform worse than random

Note: Models here can be negated to move them to the upper right corner

Models that lie in this

upper left have good performance

Note: This is where you aim to get the model

Models that lie in

lower left are conservative.

Will not predict “+” unless strong evidence

Low False positives but high False Negatives

Models that lie in

upper right are liberal.Will predict “+” with little evidenceHigh False positivesSlide17

ROC Plot Example

17

0

0

1

1

False Positive Rate

True Positive Rate

M

1

(0.1,0.8

)

M

2

(0.5,0.5)

M

3

(0.3,0.5)

M

1’s performance occurs furthestin the upper-right direction and hence is considered the best model.Slide18

Cross-validationCross-validation also

called rotation estimation,

is a way to analyze how a predictive data mining model will perform

on an

unknown dataset, i.e., how well the model generalizes

Strategy:

Divide up the dataset into two non-overlapping subsets

One subset is called the “test” and the other the “training”

Build the model using the “training” dataset

Obtain predictions of the “test” set

Utilize the “test” set predictions

to

calculate

all the performance metrics

18

Typically

cross-validation

is performed for multiple

iterations,

s

electing a different non-overlapping test and training set each timeSlide19

Types of Cross-validation

hold-out: Random 1/3rd

of the data is used as test and remaining 2/3

rd

as trainingk-fold: Divide the data into k partitions, use one

partition as test and remaining k-1 partitions for training

Leave-one-out:

Special case of k-fold, where k=1

19

Note:

Selection of data points is typically done in stratified manner, i.e., the class distribution in the test set is similar to the training setSlide20

OutlineIntroduction to Performance Metrics

Supervised Learning Performance

Metrics

Unsupervised

Learning Performance Metrics

Optimizing Metrics

Statistical Significance Techniques

Model Comparison

20Slide21

Unsupervised Learning Performance Metrics

Metrics that are applied when the ground truth is not always available (E.g., Clustering tasks)

Outline:

Evaluation Using Prior

KnowledgeEvaluation Using

Cluster Properties

21Slide22

Evaluation Using Prior Knowledge

To test the effectiveness of

unsupervised learning

methods is by considering a dataset

D with known class labels, stripping the labels and providing the set as input to

any

unsupervised

leaning algorithm, U. The resulting clusters are then compared

with the

knowledge priors

to judge the performance of U

To evaluate performance

Contingency Table

Ideal and Observed Matrices

22Slide23

Contingency Table

23

Cluster

Same Cluster

Different

Cluster

Class

Same Class

u

11

u

10

Different Class

u

01

u

00

(A) To fill the

table,

i

nitialize u11, u01, u10, u00 to 0(B) Then, for each pair of points of form (v,w): if v and w belong to the same class and cluster then increment u11 if v and w belong to the same class but different cluster then increment u10 if v and w belong to the different class but same cluster then increment u01 if v and w belong to the different class and cluster then increment u00Slide24

Contingency TablePerformance Metrics

Rand

Statistic also called simple

matching

coefficient is a measure where both placing a pair of points with the same class label in the same cluster

and placing

a pair of points with different class labels in different clusters

are given equal importance, i.e., it accounts for both specificity and sensitivity of the clustering

Jaccard Coefficient

can

be utilized when placing a pair of points with the same class label in the same cluster is primarily important

24

Example

Matrix

Cluster

Same Cluster

Different

Cluster

Class

Same Class

94Different Class312Slide25

Ideal and Observed Matrices

Given that the number of points is T, the ideal-matrix is a

TxT

matrix, where each cell (

i,j) has a 1 if the points i

and

j

belong to the same class and a 0 if they belong to different clusters. The observed-matrix is a

TxT

matrix, where a cell (i,j) has a 1 if the

points

i

and

j

belong to the same cluster and a

0

if they belong

to different cluster

Mantel Test is a statistical test of the correlation between two matrices of the same rank. The two matrices, in this case, are symmetric and, hence, it is sufficient to analyze lower or upper diagonals of each matrix25Slide26

Evaluation Using Prior KnowledgeR-code

library(PerformanceMetrics

)

data(

ContingencyTable)ContingencyTable[,1] [,2][1,] 9 4

[2,] 3 12

contingencyTableMetrics

(ContingencyTable)

26Slide27

Evaluation Using Cluster Properties

In the absence of prior knowledge we have to rely on the information from the clusters themselves to evaluate performance.

Cohesion

measures how closely objects in the same cluster are related

Separation

measures how distinct or separated a cluster is from

all the other clusters

Here,

g

i

refers to cluster

i

, W is total number of clusters, x and y are data points, proximity can be any similarity measure (e.g., cosine similarity)

We

want the cohesion to be close to 1 and separation to be close to 0

27Slide28

OutlineIntroduction to Performance Metrics

Supervised Learning Performance

Metrics

Unsupervised

Learning Performance

Metrics

Optimizing Metrics

Statistical Significance TechniquesModel Comparison

28Slide29

Optimizing MetricsPerformance metrics that

act as optimization functions for a data mining algorithm

Outline:

Sum of Squared

ErrorsPreserved Variability

29Slide30

Sum of Squared Errors

Squared sum error (SSE) is

typically used in

clustering algorithms

to measure the quality of the clusters obtained. This parameter takes into consideration the distance between each point in a cluster to its cluster

center (centroid or some other chosen representative).

For

dj, a point

in cluster

g

i, where mi

is the cluster center of

g

i

,

and

W, the total

number of clusters,

SSE

is defined as follows:This value is small when points are close to their cluster center, indicating a good clustering. Similarly, a large SSE indicates a poor clustering. Thus, clustering algorithms aim to minimize SSE.30Slide31

Preserved Variability

Preserved variability is typically used in eigenvector-based

dimension reduction techniques to quantify the

variance preserved

by the chosen dimensions. The objective of the dimension reduction technique is to maximize this

parameter.

Given

that the point is represented in r dimensions (k << r),

the eigenvalues

are

λ1>=λ

2

>=…..

λ

r-1

>=

λ

r

.

The preserved variability (PV) is calculated as follows:The value of this parameter depends on the number of dimensions chosen: the more included, the higher the value. Choosing all the dimensions will result in the perfect score of 1. 31Slide32

OutlineIntroduction to Performance Metrics

Supervised Learning Performance

Metrics

Unsupervised

Learning Performance

Metrics

Optimizing Metrics

Statistical Significance TechniquesModel Comparison

32Slide33

Statistical Significance TechniquesMethods used to asses a p-value

for the different performance metrics

Scenario:

We obtain say cohesion =0.99 for clustering algorithm A. From the first look it feels like 0.99 is a very good score.

However, it is possible that the underlying data is structured in such a way that you would get 0.99 no matter how you cluster the data.

Thus, 0.99 is not very significant. One way to decide that is by using statistical significance estimation.

We will discuss the Monte

Carlo Procedure in next slide!

33Slide34

Monte Carlo Procedure Empirical p-value Estimation

Monte Carlo procedure

uses random sampling to assess

the significance of a

particular performance metric we obtain could have been attained at random.

For example, if we obtain a cohesion score of a cluster of size 5 is 0.99, we would be inclined to think that it is a very cohesive score. However, this value could have

resulted

due to the nature of the data and not due to the algorithm. To test the significance of this 0.99 value weSample N (usually 1000) random sets of size 5 from the dataset

Recalculate the cohesion for each of the 1000 sets

Count R: number of random sets with value >= 0.99 (original score of cluster)

Empirical p-value for the cluster of size 5 with 0.99 score is given by R/N

We apply a cutoff say 0.05 to decide if 0.99 is significant

Steps 1-4 is the Monte Carlo method for p-value estimation

.

34Slide35

OutlineIntroduction to Performance Metrics

Supervised Learning Performance

Metrics

Unsupervised

Learning Performance

Metrics

Optimizing Metrics

Statistical Significance TechniquesModel Comparison

35Slide36

Model ComparisonMetrics that compare the performance of different algorithms

Scenario:

Model 1 provides an accuracy of 70% and Model 2 provides an accuracy of 75%

From the first look, Model 2 seems better, however

it

could be that Model 1 is predicting Class1 better than Class2

However, Class1 is indeed more important than Class2 for our problem

We can use model comparison methods to take this notion of “importance” into consideration when we pick one model over another

Cost-based Analysis is an important model comparison method discussed in the next few slides.

36Slide37

Cost-based AnalysisIn

real-world applications, certain aspects of model performance are considered more important than others.

For

example: if a person with cancer was diagnosed as cancer-free

or vice-versa then the prediction model should be especially penalized. This penalty

can be introduced in the form of a

cost-matrix.

37

Cost

Matrix

Predicted

Class

+

-

Actual

Class

+

c

11

c

10-c01c00Associated with f11 or u11Associated with f01 or u01Associated with f10 or u10Associated with f00 or u00Slide38

Cost-based AnalysisCost of a Model

The cost and confusion matrices for Model M are given below

Cost of Model M is given as:

38

Cost Matrix

Predicted

Class

+

-

Actual

Class

+

c

11

c

10

-

c

01

c

00Confusion MatrixPredictedClass+-ActualClass+f

11f10-

f

01

f

00Slide39

Cost-based AnalysisComparing Two Models

This analysis is typically used to select one model when we have more than one choice through using different algorithms or different parameters to the learning algorithms

.

39

Cost

Matrix

Predicted

Class

+

-

Actual

Class

+

-20

100

-

45

-10

Confusion

Matrix of

MxPredictedClass+-ActualClass+4

1-

2

1

Confusion

Matrix of

My

Predicted

Class

+

-

Actual

Class

+

3

2

-

2

1

Cost of

M

y

: 200

Cost of

M

x

:

100

C

M

x

<

C

M

y

Purely, based on

cost

model,

M

x

is a better model

Slide40

Cost-based AnalysisR-code

library(PerformanceMetrics

)

data(

Mx)data(My)data(CostMatrix)

Mx

[,1] [,2]

[1,] 4 1[2,] 2 1My[,1] [,2][1,] 3 2

[2,] 2 1

costAnalysis

(Mx,CostMatrix)costAnalysis

(

My,CostMatrix

)

40