/
Chapter  9 Classification and Chapter  9 Classification and

Chapter 9 Classification and - PowerPoint Presentation

test
test . @test
Follow
351 views
Uploaded On 2018-10-13

Chapter 9 Classification and - PPT Presentation

Clustering Classification and Clustering Classification and clustering are classical pattern recognition and machine learning problems Classification also referred to as categorization ID: 688894

cluster clustering clusters number clustering cluster number clusters class documents information amp means data set space training items evaluation

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Chapter 9 Classification and" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Chapter 9

Classification and

ClusteringSlide2

Classification and Clustering

Classification and clustering are classical

pattern recognition

and machine learning problemsClassification, also referred to as categorizationAsks “what class does this item belong to?”Supervised learning task (automatically applies labels to data)ClusteringAsks “how can I group this set of items?”Unsupervised learning task (grouping related items together)Items can be documents, emails, queries, entities & imagesUseful for a wide variety of search engine tasks

2Slide3

Classification

Classification

is the task of automatically applying

labels to itemsUseful for many search-related tasksSpam detectionSentiment classificationOnline advertisingTwo common approachesProbabilisticGeometric3Slide4

How to Classify?

How do humans classify items?

For example, suppose you had to classify the

healthiness of a foodIdentify set of features indicative of health: fat, cholesterol, sugar, sodium, etc.Extract features from foodsRead nutritional facts, chemical analysis, etc.Combine evidence from the features into a hypothesisAdd health features together to get “healthiness factor”Finally, classify the item based on the evidenceIf “healthiness factor” is above a certain value, then deem it healthy4Slide5

Ontologies

Ontology is a labeling or

categorization scheme

ExamplesBinary (spam, not spam)Multi-valued (red, green, blue)Hierarchical (news/local/sports)Different classification tasks require different ontologies5Slide6

Naïve Bayes Classifier

Probabilistic classifier based on

Bayes’ rule

:C (D) is a random variable corresponding to the class (input)Based on the term independence assumption, the Naïve Bayes’ rule yields: 6P(c | d) =P(d | c)

P

(

c

)

P

(

d

|

c

)

P

(

c

)

cC

=

n

i

=1

P

(

w

i

| c) P(c)

P(wi | c) P(c)

n

i=1

cC

(Chain rule)Slide7

Naïve Bayes Classifier

Documents are

classified

according toMust estimate P(d | c) and P(c)P(c) is the probability of observing class cP(d | c) is the probability

that

document

d

is observed given the class is known to be

c

7Slide8

Estimating P(c)

P

(

c) is the probability of observing class cEstimated as the proportion of training documents in class c:Nc is the number of training documents in class cN is the total number of training documents8Slide9

Estimating P(d | c)

P(

d

| c) is the probability that document d is observed given the class is known to be cEstimate depends on the event space used to represent the documentsWhat is an event space?The set of all possible outcomes for a given random variablee.g., for a coin toss random variable the event space is S = { heads, tails }The probability of an event space SA probability is assigned to each event/outcome in SThe sum of the probabilities over all the events in S must equal to one

9Slide10

Multiple Bernoulli Event Space

Documents are represented as

binary vectors

One entry for every word in the vocabularyEntry i = 1, if word i occurs in the document; 0, otherwiseMultiple Bernoulli distribution is a natural way to model distributions over binary vectorsSame event space as used in the classical probabilistic retrieval model10Slide11

Multiple Bernoulli Document Representation

11

Example

.Slide12

Multiple-Bernoulli: Estimating P(d | c)

P

(

d | c) is computed (in the Multiple-Bernoulli model) aswhere (w, d) = 1 iff term w occurs in d; P(d | c) = 0 if w  d never occurred in c in the training set, the “data sparseness” problem, which can be solved by the “smoothing” methods. Laplacian smoothed estimate: where dfw,c denotes the number of documents in c including term w Nc is the number of documents belonged to class cCollection smoothed estimate: where  is a tunable parameter

and

N

w

is the no. of doc. including

w

12Slide13

Multinomial Event Space

Documents are represented as vectors of

term frequencies

One entry for every word in the vocabularyEntry i = number of times that term i occurs in the documentMultinomial distribution is a natural way to model distributions over frequency vectorsSame event space as used in the language modeling retrieval model13Slide14

Multinomial Document Representation

14

Example

.Slide15

Multinomial: Estimating P(d | c)

P(

d

| c) is computed as:Laplacian smoothed estimate:Collection smoothed estimate:15

Multinomial Coefficient

=

tf

! / (

tf

w

1

,

d

! ×

×

tf

w

v

,

d

!)

Probability of generating

a document of length |

d

|

(= number of terms in

d

)

Document

Dependentwhere |

c| is the number of terms in the training documents of class

c |V

| is the number of distinct terms in the training documents

Number of term w

in a training set

C

Number of terms in all

training documents

Tunable

parameter

Number of Terms

w

in Class

cSlide16

Multinomial Versus Multiple-Bernoulli Model16

The

Multinomial model

is consistently outperform the Multiple-Bernoulli modelImplementing both models is relatively straightforwardBoth classifiers areefficient, since their statistical data can be stored in memoryaccurate in document classificationpopular and attractive choice as a general-purpose classifier Slide17

Support Vector Machines (SVM)

A vector-space-based machine-learning method

Goal: find a decision boundary between two classes that is

maximally far from any point in the training data Two-class data sets are separable by a linear classifier17(right up against the margin of the classifier)An infinite number of hyperplanes thatseparate two linearly separable classes Slide18

Support Vector Machines (SVM)

Based on geometric principles

Documents are represented as

N-dimensional vectors with (non-)binary feature weightsGiven a set of inputs labeled ‘+’ & ‘-’, find the best hyper- plane in an N-dimensional space that separates the ‘+’s and ‘-’s, i.e., a binary-categorization methodQuestionsHow is “best” defined?What if no hyperplane exists such that the ‘+’s and ‘-’s can be perfectly separated?18Slide19

“Best” Hyperplane?

First, what is a

hyperplane

?A generalization of a line to higher dimensionsDefined by vector w that is learned from the training dataAvoiding the overfitting problem, i.e., working well with the training data, but fails at classifying the test dataTo avoid overfitting, SVM chooses a hyperplane with the maximum margin that separates ‘+’s & ‘-’sNecessarily since points near the decision surface represent very uncertain classification decisions with almost a 50% chance deciding either way Correctly generalize to test data is increased19Slide20

Support Vector Machines20

+

+

+++++––––––––

Margin

Hyperplane

+

w

x

> 0

w

x

= 1

w

x

= -1

w

x

= 0

w

x

<

0

+

+

Maximizes the separation of

the ‘+’ and ‘-’ data points

Vector

w

defines

t

he

hyperplane

(

H

)

Distance from

x

-

to

H

plus

the distance from

x

+

to

H

Slide21

“Best” Hyperplane?

w

 x is a scalar, i.e., a single number, the projection of w onto xw  x = 0 specifies each point x that lies on the line perpendicular to ww  x = 1 is the line parallel to w  x = 0, shifted by 1 / ||w||2 / ||w|| is thus the maximal margin, which is the objective boundary function21

1/ ||

w

||

w

x

= 0

w

x

= -1

w

x

= 1

wSlide22

“Best” Hyperplane?

If

x

+ & x- are the closest ‘+’ & ‘-’ inputs to the hyperplane, called support vectors, then the margin is:It is typically assumed that |w  x-| = |w  x+| = 1, which does not change the solution to the problemThus, to find the hyperplane with the largest (maximal) margin, it requires Margin(w) = 2 / (|| w || = (w  w)1/2)22Sum of the distances of w & x- and

w

&

x

+

Magnitude of

w

w

eight vectorSlide23

+

+

+

++++++–––––––

+

+

+

+

+

+

+

+

+

+

+

Separable vs. Non-Separable Data

Separable

Non-Separable

23

Linearly separable

data

s

ets are well-handed

Must map the original feature space

to some

higher dimensional

feature

space where the dataset is separable Slide24

Linear Separable Case

In math:

In English:

Find the largest margin hyperplane that separates the ‘+’s and ‘-’sCan be solved using quadratic programmingAn unseen document d can be classified using 24Class(d) = + if w  xd > 0- otherwiseSlide25

Feature Selection for Text Classification

Document classifiers can have a very large number of

features

, such as indexed termsNot all features are usefulExcessive features can increase computational cost of training and testingFeature selection methods reduce the number of features by choosing the most useful featureswhich can significantly improve efficiency (in terms of storage and processing time) while not hurting the effectiveness much (in addition to eliminating noisy)25Slide26

Information Gain

Information gain (IG)

is a commonly used feature selection measure based on

information theoryIt tells how much “information” is gained (about the class labels) if we observe some featureEntropy characterizes the (im)purity of a collection of examplesThe information gain is the expected reduction in entropy caused by partitioning the examples according to an attribute (word)Rank features by information gain and then train model using the top K (K is typically small) attributes (words)The information gain for a MNB classifier is computed as

26

Entropy of

P

(

c

)

Conditional EntropySlide27

27Feature Selection for Text Classification

Feature selection is based on

entropy/information gain

The law of large numbers indicates that the symbol aj will, on the average, be selected n × p(aj) times in a total of n selections The average amount of information obtained from n different source outputs for each aj (j  1) with - log2 p(aj) bits is n ×

p

(

a

1

) log

2

p

(

a

1

)

-1

+

+

n

× p

(aj) log

2

p

(aj)

-1 bits. Divided by n

obtain the average amount of information per source output symbol, which is known as uncertainty, or the entropy,

E, where  is the number of symbolsSlide28

Information Gain

Example

. The information gain for the term “cheap”, using

IG(w) = -cC P(c) log P(c) + w{0,1}P(w) cC P(c|w

)

log

P

(

c|w

)

where

P

(

cheap

) denotes

P

(

cheap

= 0),

P

(

spam

) denotes

P

(

not spam

), 0 log 0 = 0, and

IG(buy) = 0.0008, IG

(banking) = 0.04, IG(dinner) = 0.36, IG(the

) = 0 28Slide29

Clustering

A set of

unsupervised

algorithms that attempt to find latent structure in a set of itemsGoal is to identify groups (clusters) of similar items, given a set of unlabeled instancesSuppose I gave you the shape, color, vitamin C content and price of various fruits and asked you to cluster themWhat criteria would you use?How would you define similarity?Clustering is very sensitive to (i) how items are represented and (ii) how similarity is defined

29Slide30

Clustering

General outline of

clustering algorithms

Decide how items will be represented (e.g., feature vectors)Define similarity measure between pairs or groups of items (e.g., cosine similarity)Determine what makes a “good” clustering (e.g., using intra- & inter-cluster similarity measuresIteratively construct clusters that are increasingly “good”Stop after a local/global optimum clustering is foundSteps 3 and 4 differ the most across algorithms30Slide31

Hierarchical Clustering

Constructs a

hierarchy

of clustersStarting with some initial clustering of data & iteratively trying to improve the “quality” of clustersThe top level of the hierarchy consists of a single cluster with all items in itThe bottom level of the hierarchy consists of N (number of items) singleton clustersDifferent objectives lead to different types of clustersTwo types of hierarchical clusteringDivisive (“top down”)Agglomerative (“

bottom up

”)

Hierarchy can be visualized as a

dendogram

31Slide32

A

D

E

BCFGHI

J

K

L

M

Example Dendrogram

32

Height

indicates

the

similarity

of

the clusters involvedSlide33

Divisive & Agglomerative Hierarchical Clustering

Divisive

Start with a

single cluster consisting of all of the itemsUntil only singleton clusters exist …Divide an existing cluster into two (or more) new clustersAgglomerativeStart with N (number of items) singleton clustersUntil a single cluster exists …Combine two (or more) existing cluster into a new clusterHow do we know how to divide or combine clusters?Define a division or combination costPerform the division or combination with the lowest cost33Slide34

F

A

C

EBGDFA

C

E

B

G

D

F

A

C

E

B

G

D

F

A

C

E

B

G

D

Divisive Hierarchical Clustering

34Slide35

F

A

C

EBGDFA

C

E

B

G

D

F

A

C

E

B

G

D

F

A

C

E

B

G

D

Agglomerative Hierarchical Clustering

35Slide36

Clustering Costs

Cost

: a measure of how

expensive to merge 2 clusters Single linkageComplete linkageAverage linkageAverage group linkage where C = (X

C

X

)

/ |

C

|

is the

centroid

of

cluster

C

36

(Euclidean)Slide37

F

A

C

EBGFAC

E

B

G

D

F

A

C

E

B

G

D

D

Single Linkage

Complete Linkage

Average Linkage

Average Group Linkage

μ

μ

μ

μ

Clustering Strategies

37

*

*

Generally,

Average-Link

Clustering yields the best effectivenessSlide38

Clustering Costs

The choice of the

best

clustering technique/strategy requires experiments & evaluation Single linkageCould result in “ very long” or “spread-out” clustersComplete linkageClusters are more compact than Single LinkageAverage linkageA compromise between Single & Complete LinkageAverage group linkageClosely related to the Average Linkage approach38Slide39

K-Means Clustering

Hierarchical clustering

constructs a hierarchy of clusters

K-means always maintains exactly K clustersClusters are represented by their centroids (“centers of mass”)Basic algorithm:Step 0: Choose K cluster centroidsStep 1: Assign points to closet centroidStep 2: Re-compute cluster centroidsStep 3: Goto Step 1Tends to converge quicklyCan be sensitive to choice of initial centroidsMust choose K to begin with!

39Slide40

K-Means Clustering

Goal: find the cluster assignments (for the assignment vectors

A

[1], …, A[N]) that minimize the cost function: COST(A[1], …, A[N]) = where dist(Xi, Ck) = || Xi - Ck ||2, where Ck is the centroid of Ck = (Xi -

C

k

) . (

X

i

-

C

k

), the

Euclidean Distance

Strategy:

Randomly select

K

initial cluster centers (instances) as

seeds

Move the cluster centers around to

minimize

the cost function

Re-assign instances to the cluster with the closest

centroid

Re-compute the cost value of each centroid based on the current members of its cluster

40

  dist(Xi, Ck

)k=1 i:A[i

]=kKSlide41

K-Means Clustering

Example

.

41(a)(b)(c)(d)(e)Slide42

K-Means Clustering

The

K

-means optimization problem:A naïve approach is to try every possible combination of cluster assignments, which is infeasible for large data setsThe K-means algorithm should find an approximate, heuristic solution that iteratively tries to minimize the costAnticipated results:the solution is not guaranteed to be globally (nor locally) optimaldespite the heuristic nature, the K-means algorithm tends to work very well in practice In practice, K-means clustering tends to converge quicklyCompared to hierarchical clustering H, K-means is more efficient and produces clusters of similar quality to HImplementing K

-means requires O(

KN

), rather than O(

N

2

) for

H

42Slide43

K-Means Clustering Algorithm

43

(* Either randomly or using some knowledge of the data *) (* Each instance is assigned to the closest cluster *) (* The cluster of an instance changes; proceeds *)Slide44

K Nearest Neighbor Clustering

Hierarchical

and

K-Means clustering partition items into clustersEvery item is in exactly one clusterK Nearest neighbor clustering forms one cluster per itemThe cluster for item j consists of j and the K nearest neighbors of j Clusters now overlap44Slide45

B

D

D

BB

A

D

A

C

A

B

D

B

D

C

C

C

C

A

A

A

D

B

C

5 Nearest Neighbor Clustering

45Slide46

K Nearest Neighbor Clustering

Drawbacks

of the

K Nearest Neighbor Clustering method:Often fails to find meaningful clustersIn spare areas of the input space, the instances assigned to a cluster are father far away (e.g., D in the 5-NN example)In dense areas, some related instances may be missed if K is not large enough (e.g., B in the 5-NN example)Computational expensive (compared with K-means), since it computes distances between each pair of instancesApplications of the K Nearest Neighbor Clustering methodEmphasize finding a small number (rather than all) of closely related instances, i.e., precision over recall

46Slide47

How to Choose K?

K

-means

and K nearest neighbor clustering require us to choose KNo theoretically appealing way of choosing KDepends on the application and data; often chosen experimentally to evaluate the quality of the resulting clusters for various values of KCan use hierarchical clustering and choose the best levelCan use adaptive K for K-nearest neighbor clusteringLarger (Smaller) K for dense (spare) areasChallenge: choosing the boundary sizeDifficult problem with no clear solution

47Slide48

B

B

B

C

B

B

C

C

C

A

A

D

B

C

Adaptive Nearest Neighbor Clustering

48Slide49

Evaluation of Clustering

Typical objective functions/

internal criterion

in clusteringAttaining high intra-cluster similarity (documents within a cluster are similar)Achieving low inter-cluster similarity (documents from different clusters are dissimilar)External criterion evaluates how well the clustering matches the gold standard classesThe gold standard is ideally produced by human judges with a good level of inter-judge agreementA set of classes are used in the evaluation benchmarkFour external criteria of clustering quality: Purity, Normalized mutual information,

Rand index

, &

F-measure

49Slide50

Evaluation of Clustering

Purity

, a simple & transparent evaluation measure

Each cluster is assigned to the class which is most frequent in the clusterThe accuracy of the assignment is measured by counting the number of correctly assigned documents divided by N, the total number of documents to be clustered Purity(, C) =  max | wi  Cj |

where

 = {

w

1

,

w

2

, …

w

K

} is the set of

clusters

C

= {

C

1

,

C

2

, …

CJ } is the set of classes

w

n, 1  n  K

(Cm , 1  m  J

) is a set of documentsBad clusterings have purity values close to 0 & a perfect clustering has a purity of 1

50

1N

Ki=1

jSlide51

Evaluation of Clustering

Example

. An external evaluation criterion for cluster quality where majority class & no. of member of the majority class for the 3 classes are: x (5), o (4), and

 (3)Purity is (1 / 17)  (5 + 4 + 3)  0.7151

Cluster 1

Cluster 2

Cluster 3

x

x

x

x

x

o

x

o

o

o

o

x

 x 

PurityNMI

RIF5Minimum

0.00.00.00.0

Maximum1.0

1.01.01.0

Value for

Example

0.71

0.36

0.68

0.46Slide52

Evaluation of Clustering

Normalized Mutual Information

High purity is easy to achieve when the number of clusters is large,

in particular, purity is 1 if each doc is assigned its own clusterA tradeoff is the normalized mutual information (NMI) NMI(, C) = where I is mutual information, the knowledge about the classes I(; C) =   P(wn

C

m

) log

=

H

(

) = -

P

(

w

n

) log

P

(

w

n

) = - log , the entropy

52

I

(

;

C

)[H

() + H(C)] / 2

kn=1

Jm=1P(w

n  Cm

)P(wn

) P(C

m)

k

log

n

=1

J

m

=1

N

|

w

n

C

m

|

|

w

n

| |

C

m

|

|

w

n

C

m

|

N

k

n

=1

k

n

=1

|

w

n

|

N

|

w

n

|

N

Joint Probability Distribution

KL-divergence

Maximum

Likelihood

EstimatesSlide53

Evaluation of Clustering

Normalized Mutual

Information

I(; C) is 0 if the clustering is random w.r.t. class membershipMaximum mutual information (MI) is reached for a clustering excat that perfectly recreates the classesA clustering with K = N, i.e., generate one document clusters, has the maximum MI, the same problem as purity, which is fixed by [H() + H(C)] / 2 Entropy tends to increase with the number of clusters, i.e., H() reaches its maximum log N for K = N

[

H

() +

H

(

C

)] /

2 is a tight upper bound

on

I

(; C

)

53Slide54

Evaluation of Clustering

Rand Index

(RI), which measures the decisions that are

correct on assigning two documents to the same cluster, assuming that they are similar RI = where TP TN FP FN54 TP + TNTP + FP + FN + TN

occurs if two

similar

docs are assigned to the

same

cluster

occurs when two

dissimilar

docs are assigned to

different

clusters

occurs if two

dissimilar

docs are assigned to the

same

cluster

occurs when two

similar

docs are assigned to

different

clustersSlide55

Evaluation of Clustering

Example

(

RI). Given the following clustersTP + FP = + + = 40TP = + + + = 20FN = + + + + = 5 + 10 + 4 + 3 + 2 = 24TN = 25+2 (C1-C2) + 15+5 (C1-C3) + 20+3+2 (C2-C3) = 72RI = (20 + 72) / (20 + 20 + 24 + 72)

 0.68

55

Cluster 1

Cluster 2

Cluster 3

x

x

x

x

x

o

x

o

o

o

o

x

x

6 6 5

2 2 2

5

4 3 2

2 2 2 2

Same

Cluster

Different

Clusters

Same Class

TP = 20

FN = 24

Different

Classes

FP = 20

TN = 72

5

5 4 3 2

1

2

1

1 1Slide56

Evaluation of Clustering

F-Measure

RI gives

equal weight to FPs and FNsSeparating similar docs is sometimes worse than putting pairs of dissimilar docs in the same clusterF-measure penalizes FNs more strongly than FPs by selecting a  > 1, thus giving more weight to recall P = , R = , F = Example. Based on TP = 20, FP = 20, FN = 24, and TN = 72

P

= 20/40 = 0.5,

R

= 20/44

 0.455,

F

1

 0.48,

F

5

0.456

56

TP

TP + FP

TP

TP + FN

(

2

+ 1)

PR

2 P + RSlide57

Evaluating Clustering

Evaluating clustering is challenging, since it is an

unsupervised

learning taskIf labels exist, can use standard IR metrics, e.g., precision/recallIf not, then can use measures such as “cluster precision”, which is defined as:where K (= |C|) is the total number of resultant clusters |MaxClass(Ci)| is the number of instances in cluster Ci with the most instances of (human-assigned) class label Ci

N

is the total number of instances

Another option is to evaluate clustering as part of an end-to-end system, e.g.,

clusters

to improve web

ranking

57