Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical Significance Techniques ID: 191407
Download Presentation The PPT/PDF document "Performance Metrics for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Performance Metrics for Graph Mining Tasks
1Slide2
OutlineIntroduction to Performance Metrics
Supervised Learning Performance MetricsUnsupervised
Learning Performance
Metrics
Optimizing MetricsStatistical Significance TechniquesModel Comparison
2Slide3
OutlineIntroduction to Performance Metrics
Supervised Learning Performance
Metrics
Unsupervised
Learning Performance
Metrics
Optimizing Metrics
Statistical Significance TechniquesModel Comparison
3Slide4
Introduction to Performance MetricsPerformance metric measures how well your data mining algorithm is performing on a given
dataset.
For example, if we apply a classification algorithm on a dataset, we first check to see how many of the data points were classified correctly. This is a performance metric and the formal name for it is “accuracy.”
Performance metrics
also help us decide is one algorithm is better or worse than
another.
For example, one classification algorithm A classifies 80% of data points correctly and another
classification algorithm
B classifies 90
% of data points
correctly. We immediately realize that algorithm B is doing better. There are some intricacies that we will discuss in this chapter.
4Slide5
OutlineIntroduction to Performance Metrics
Supervised Learning Performance Metrics
Unsupervised
Learning Performance
Metrics
Optimizing Metrics
Statistical Significance Techniques
Model Comparison
5Slide6
Supervised Learning Performance Metrics
Metrics that are applied when the ground truth is known (E.g., Classification tasks)
Outline:
2 X
2 Confusion MatrixMulti-level Confusion
Matrix
Visual Metrics
Cross-validation
6Slide7
2X2 Confusion Matrix
7
Predicted
Class
+
-
Actual
Class
+
f
++
f
+-
C = f
++
+
f
+-
-
f
-+f--D = f-+ + f--A = f++ + f-+B = f+- + f--T = f+++ f-++ f+-+ f--
An 2X2 matrix, is used to tabulate the results of 2-class supervised learning problem and entry (i,j) represents the number of elements with class label i, but predicted to have class label j.
True Positive
False Positive
False Negative
True Negative
+ and – are two class labelsSlide8
2X2 Confusion MetricsExample
8
Vertex
ID
Actual
Class
Predicted
Class
1
+
+
2
+
+
3
+
+
4
+
+
5+-6-+7-+8--PredictedClass+-ActualClass+41
C = 5-
2
1
D = 3
A = 6
B = 2
T = 8
Corresponding
2x2 matrix for the given table
Results from a Classification Algorithms
True positive = 4
False positive = 1
True Negative = 1
False Negative =2Slide9
2X2 Confusion MetricsPerformance Metrics
Walk-through
different metrics using
the following
example
9
Accuracy
is proportion
of correct
predictions
Error rate
is proportion of incorrect predictions
3. Recall
is the proportion of
“+” data points
predicted as
“+”
4. Precision
is the proportion of data points predicted as “+” that are truly “+”Slide10
Multi-level Confusion Matrix
An nXn
matrix, where
n is the number of classes and entry (i,j
)
represents
the number of elements with class label i, but predicted to have class label
j
10Slide11
Multi-level Confusion MatrixExample
11
Predicted Class
Marginal
Sum of
Actual Values
Class 1
Class 2
Class
3
Actual
Class
Class 1
2
1
1
4
Class 2
1
2
14Class 31236
Marginal Sum of Predictions
4
5
5
T = 14Slide12
Multi-level Confusion MatrixConversion to 2X2
Predicted Class
Class 1
Class 2
Class
3
Actual
Class
Class 1
2
1
1
Class 2
1
2
1
Class 3
1
2
3
2X2 Matrix Specific to Class 1
f
++
f-+
f
+-
f--
Predicted Class
Class 1 (+)
Not Class 1 (-)
Actual
Class
Class 1 (+)
2
2
C = 4
Not Class 1 (-)
2
8
D
= 10
A = 4
B = 10
T = 14
We can now apply all the 2X2 metrics
Accuracy = 2/14
Error = 8/14
Recall = 2/4
Precision = 2/4Slide13
Multi-level Confusion MatrixPerformance Metrics
13
Predicted Class
Class 1
Class 2
Class
3
Actual
Class
Class 1
2
1
1
Class 2
1
2
1
Class 3
1
2
3Critical Success Index or Threat Score is the ratio of correct predictions for class L to the sum of vertices that belong to L and those predicted as L2. Bias - For each class L, it is the ratio of the total points with class label L to the number of points predicted as L. Bias helps understand if a model is over or under-predicting a class Slide14
Confusion MetricsR-code
14
library(
PerformanceMetrics
)
data(M)
M
[,1] [,2][1,] 4 1[2,] 2 1twoCrossConfusionMatrixMetrics(M)
data(
MultiLevelM
)MultiLevelM[,1] [,2] [,3][1,] 2 1 1
[2,] 1 2 1
[3,] 1 2 3
multilevelConfusionMatrixMetrics
(
MultiLevelM
)Slide15
Visual Metrics
Metrics that are plotted on a graph to obtain the visual picture of the performance of two class classifiers
15
0
0
1
1
False Positive Rate
True Positive Rate
(0,1) - Ideal
(0,0)
Predicts the –
ve
class all the time
(1,1)
Predicts the +
ve
class all the time
Random Guessing ModelsAUC = 0.5Plot the performance of multiple models to decide which one performs bestROC plotSlide16
Understanding Model Performance based on ROC Plot
16
0
0
1
1
False Positive Rate
True Positive Rate
Random Guessing Models
AUC = 0.5
Models that lie in this area perform worse than random
Note: Models here can be negated to move them to the upper right corner
Models that lie in this
upper left have good performance
Note: This is where you aim to get the model
Models that lie in
lower left are conservative.
Will not predict “+” unless strong evidence
Low False positives but high False Negatives
Models that lie in
upper right are liberal.Will predict “+” with little evidenceHigh False positivesSlide17
ROC Plot Example
17
0
0
1
1
False Positive Rate
True Positive Rate
M
1
(0.1,0.8
)
M
2
(0.5,0.5)
M
3
(0.3,0.5)
M
1’s performance occurs furthestin the upper-right direction and hence is considered the best model.Slide18
Cross-validationCross-validation also
called rotation estimation,
is a way to analyze how a predictive data mining model will perform
on an
unknown dataset, i.e., how well the model generalizes
Strategy:
Divide up the dataset into two non-overlapping subsets
One subset is called the “test” and the other the “training”
Build the model using the “training” dataset
Obtain predictions of the “test” set
Utilize the “test” set predictions
to
calculate
all the performance metrics
18
Typically
cross-validation
is performed for multiple
iterations,
s
electing a different non-overlapping test and training set each timeSlide19
Types of Cross-validation
hold-out: Random 1/3rd
of the data is used as test and remaining 2/3
rd
as trainingk-fold: Divide the data into k partitions, use one
partition as test and remaining k-1 partitions for training
Leave-one-out:
Special case of k-fold, where k=1
19
Note:
Selection of data points is typically done in stratified manner, i.e., the class distribution in the test set is similar to the training setSlide20
OutlineIntroduction to Performance Metrics
Supervised Learning Performance
Metrics
Unsupervised
Learning Performance Metrics
Optimizing Metrics
Statistical Significance Techniques
Model Comparison
20Slide21
Unsupervised Learning Performance Metrics
Metrics that are applied when the ground truth is not always available (E.g., Clustering tasks)
Outline:
Evaluation Using Prior
KnowledgeEvaluation Using
Cluster Properties
21Slide22
Evaluation Using Prior Knowledge
To test the effectiveness of
unsupervised learning
methods is by considering a dataset
D with known class labels, stripping the labels and providing the set as input to
any
unsupervised
leaning algorithm, U. The resulting clusters are then compared
with the
knowledge priors
to judge the performance of U
To evaluate performance
Contingency Table
Ideal and Observed Matrices
22Slide23
Contingency Table
23
Cluster
Same Cluster
Different
Cluster
Class
Same Class
u
11
u
10
Different Class
u
01
u
00
(A) To fill the
table,
i
nitialize u11, u01, u10, u00 to 0(B) Then, for each pair of points of form (v,w): if v and w belong to the same class and cluster then increment u11 if v and w belong to the same class but different cluster then increment u10 if v and w belong to the different class but same cluster then increment u01 if v and w belong to the different class and cluster then increment u00Slide24
Contingency TablePerformance Metrics
Rand
Statistic also called simple
matching
coefficient is a measure where both placing a pair of points with the same class label in the same cluster
and placing
a pair of points with different class labels in different clusters
are given equal importance, i.e., it accounts for both specificity and sensitivity of the clustering
Jaccard Coefficient
can
be utilized when placing a pair of points with the same class label in the same cluster is primarily important
24
Example
Matrix
Cluster
Same Cluster
Different
Cluster
Class
Same Class
94Different Class312Slide25
Ideal and Observed Matrices
Given that the number of points is T, the ideal-matrix is a
TxT
matrix, where each cell (
i,j) has a 1 if the points i
and
j
belong to the same class and a 0 if they belong to different clusters. The observed-matrix is a
TxT
matrix, where a cell (i,j) has a 1 if the
points
i
and
j
belong to the same cluster and a
0
if they belong
to different cluster
Mantel Test is a statistical test of the correlation between two matrices of the same rank. The two matrices, in this case, are symmetric and, hence, it is sufficient to analyze lower or upper diagonals of each matrix25Slide26
Evaluation Using Prior KnowledgeR-code
library(PerformanceMetrics
)
data(
ContingencyTable)ContingencyTable[,1] [,2][1,] 9 4
[2,] 3 12
contingencyTableMetrics
(ContingencyTable)
26Slide27
Evaluation Using Cluster Properties
In the absence of prior knowledge we have to rely on the information from the clusters themselves to evaluate performance.
Cohesion
measures how closely objects in the same cluster are related
Separation
measures how distinct or separated a cluster is from
all the other clusters
Here,
g
i
refers to cluster
i
, W is total number of clusters, x and y are data points, proximity can be any similarity measure (e.g., cosine similarity)
We
want the cohesion to be close to 1 and separation to be close to 0
27Slide28
OutlineIntroduction to Performance Metrics
Supervised Learning Performance
Metrics
Unsupervised
Learning Performance
Metrics
Optimizing Metrics
Statistical Significance TechniquesModel Comparison
28Slide29
Optimizing MetricsPerformance metrics that
act as optimization functions for a data mining algorithm
Outline:
Sum of Squared
ErrorsPreserved Variability
29Slide30
Sum of Squared Errors
Squared sum error (SSE) is
typically used in
clustering algorithms
to measure the quality of the clusters obtained. This parameter takes into consideration the distance between each point in a cluster to its cluster
center (centroid or some other chosen representative).
For
dj, a point
in cluster
g
i, where mi
is the cluster center of
g
i
,
and
W, the total
number of clusters,
SSE
is defined as follows:This value is small when points are close to their cluster center, indicating a good clustering. Similarly, a large SSE indicates a poor clustering. Thus, clustering algorithms aim to minimize SSE.30Slide31
Preserved Variability
Preserved variability is typically used in eigenvector-based
dimension reduction techniques to quantify the
variance preserved
by the chosen dimensions. The objective of the dimension reduction technique is to maximize this
parameter.
Given
that the point is represented in r dimensions (k << r),
the eigenvalues
are
λ1>=λ
2
>=…..
λ
r-1
>=
λ
r
.
The preserved variability (PV) is calculated as follows:The value of this parameter depends on the number of dimensions chosen: the more included, the higher the value. Choosing all the dimensions will result in the perfect score of 1. 31Slide32
OutlineIntroduction to Performance Metrics
Supervised Learning Performance
Metrics
Unsupervised
Learning Performance
Metrics
Optimizing Metrics
Statistical Significance TechniquesModel Comparison
32Slide33
Statistical Significance TechniquesMethods used to asses a p-value
for the different performance metrics
Scenario:
We obtain say cohesion =0.99 for clustering algorithm A. From the first look it feels like 0.99 is a very good score.
However, it is possible that the underlying data is structured in such a way that you would get 0.99 no matter how you cluster the data.
Thus, 0.99 is not very significant. One way to decide that is by using statistical significance estimation.
We will discuss the Monte
Carlo Procedure in next slide!
33Slide34
Monte Carlo Procedure Empirical p-value Estimation
Monte Carlo procedure
uses random sampling to assess
the significance of a
particular performance metric we obtain could have been attained at random.
For example, if we obtain a cohesion score of a cluster of size 5 is 0.99, we would be inclined to think that it is a very cohesive score. However, this value could have
resulted
due to the nature of the data and not due to the algorithm. To test the significance of this 0.99 value weSample N (usually 1000) random sets of size 5 from the dataset
Recalculate the cohesion for each of the 1000 sets
Count R: number of random sets with value >= 0.99 (original score of cluster)
Empirical p-value for the cluster of size 5 with 0.99 score is given by R/N
We apply a cutoff say 0.05 to decide if 0.99 is significant
Steps 1-4 is the Monte Carlo method for p-value estimation
.
34Slide35
OutlineIntroduction to Performance Metrics
Supervised Learning Performance
Metrics
Unsupervised
Learning Performance
Metrics
Optimizing Metrics
Statistical Significance TechniquesModel Comparison
35Slide36
Model ComparisonMetrics that compare the performance of different algorithms
Scenario:
Model 1 provides an accuracy of 70% and Model 2 provides an accuracy of 75%
From the first look, Model 2 seems better, however
it
could be that Model 1 is predicting Class1 better than Class2
However, Class1 is indeed more important than Class2 for our problem
We can use model comparison methods to take this notion of “importance” into consideration when we pick one model over another
Cost-based Analysis is an important model comparison method discussed in the next few slides.
36Slide37
Cost-based AnalysisIn
real-world applications, certain aspects of model performance are considered more important than others.
For
example: if a person with cancer was diagnosed as cancer-free
or vice-versa then the prediction model should be especially penalized. This penalty
can be introduced in the form of a
cost-matrix.
37
Cost
Matrix
Predicted
Class
+
-
Actual
Class
+
c
11
c
10-c01c00Associated with f11 or u11Associated with f01 or u01Associated with f10 or u10Associated with f00 or u00Slide38
Cost-based AnalysisCost of a Model
The cost and confusion matrices for Model M are given below
Cost of Model M is given as:
38
Cost Matrix
Predicted
Class
+
-
Actual
Class
+
c
11
c
10
-
c
01
c
00Confusion MatrixPredictedClass+-ActualClass+f
11f10-
f
01
f
00Slide39
Cost-based AnalysisComparing Two Models
This analysis is typically used to select one model when we have more than one choice through using different algorithms or different parameters to the learning algorithms
.
39
Cost
Matrix
Predicted
Class
+
-
Actual
Class
+
-20
100
-
45
-10
Confusion
Matrix of
MxPredictedClass+-ActualClass+4
1-
2
1
Confusion
Matrix of
My
Predicted
Class
+
-
Actual
Class
+
3
2
-
2
1
Cost of
M
y
: 200
Cost of
M
x
:
100
C
M
x
<
C
M
y
Purely, based on
cost
model,
M
x
is a better model
Slide40
Cost-based AnalysisR-code
library(PerformanceMetrics
)
data(
Mx)data(My)data(CostMatrix)
Mx
[,1] [,2]
[1,] 4 1[2,] 2 1My[,1] [,2][1,] 3 2
[2,] 2 1
costAnalysis
(Mx,CostMatrix)costAnalysis
(
My,CostMatrix
)
40