/
1 Text Categorization Assigning documents to a fixed set of categories 1 Text Categorization Assigning documents to a fixed set of categories

1 Text Categorization Assigning documents to a fixed set of categories - PowerPoint Presentation

debby-jeon
debby-jeon . @debby-jeon
Follow
355 views
Uploaded On 2019-03-19

1 Text Categorization Assigning documents to a fixed set of categories - PPT Presentation

Applications Web pages Recommending pages Yahoolike classification hierarchies Categorizing bookmarks Newsgroup Messages News Feeds Microblog Posts Recommending messages posts tweets etc ID: 757848

000 idf text vector idf 000 vector text doc document similarity documents examples prototype categorization words training compute vectors

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 Text Categorization Assigning document..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Text CategorizationSlide2

2

Text Categorization

Assigning documents to a fixed set of categories

with text features as attributes representing documents

Applications:

Web pages

Recommending pages

Yahoo-like classification hierarchies

Categorizing bookmarks

Newsgroup Messages /News Feeds / Micro-blog Posts

Recommending messages, posts, tweets, etc.

Message filtering

News articles

Personalized news

Email messages

Routing

Spam filteringSlide3

3

Learning for Text Categorization

Text Categorization is an application of classification learning

Typical Learning Algorithms:

Bayesian (naïve)

Neural networks

Relevance Feedback (

Rocchio

)

Nearest Neighbor

Support Vector Machines (SVM)Slide4

4

Similarity/Distance Metrics

Nearest neighbor method depends on a similarity (or distance) metric

Simplest for continuous

m

-dimensional instance space is

Euclidean distance

Simplest for

m

-dimensional binary instance space is

Hamming distance

(number of feature values that differ)

For text, cosine similarity of TF-IDF weighted vectors is typically most effectiveSlide5

Basic Automatic Text Processing

Parse documents to recognize structure and meta-data

e.g. title, date, other fields, html tags, etc. Scan for word tokens lexical analysis to recognize keywords, numbers, special characters, etc.Stopword removal common words such as “the”, “and”, “or” which are not semantically meaningful in a documentStem words morphological processing to group word variants (e.g., “compute”, “computer”, “computing”, “computes”, … can be represented by a single stem “comput” in the index)Assign weight to words using frequency in documents and across documentsStore Index Stored in a Term-Document Matrix (“inverted index”) which stores each document as a vector of keyword weights

5Slide6

The dictionary

6

Document Vectors and Indexes

Conceptually, the index can be viewed as a

document-term matrix

Each document is represented as an

n

-dimensional vector (

n

= no. of terms in the dictionary)

Term weights represent the scalar value of each dimension in a document

The

inverted index structure

is an “

implementation model

” used in practice to store the information captured in this conceptual representation

nova galaxy heat

hollywood

film role diet fur

A 1.0 0.5 0.3

B 0.5 1.0C 1.0 0.8 0.7 D 0.9 1.0 0.5E 1.0 1.0F 0.9 1.0G 0.5 0.7 0.9H 0.6 1.0 0.3 0.2 0.8I 0.7 0.5 0.1 0.3

Document Ids

a document

vector

Term Weights

(in this case

normalized)Slide7

7

Example: Documents and Query in 3D Space

Documents in term spaceDocuments (and the query) are represented as vectors of terms/tokensQuery and Document weightsbased on length and direction of their vector

Why use this representation?

A vector distance measure between the query and documents can be used to rank retrieved documentsSlide8

8

Inverted Indexes

Then the data is split into a Dictionary and a Postings file

Dictionary

PostingsSlide9

9

tf

x

idf

Weights

tf

x

idf

measure:

term frequency (

tf

)inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution

Recall the Zipf distributionWant to weight terms highly if they arefrequent in relevant documents … BUTinfrequent in the collection as a whole

Goal: assign a tf x idf weight to each term in each documentSlide10

10

tf

x

idfSlide11

11

Inverse Document Frequency

IDF provides high values for rare words and low values for common words

Note:

typically, we’ll use Log base 2 to compute IDF valuesSlide12

tf x idf Example

12

 

Doc 1

Doc 2

Doc 3

Doc 4

Doc 5

Doc 6

 

df

idf = log2(N/df)

T1

0

2

4

0

1

0

 

3

1.00

T2

1

3

0

0

0

2

 

3

1.00

T3

0

1

0

2

0

0

 

2

1.58

T4

3

0

1

5

4

0

 

4

0.58

T5

0

4

0

0

0

1

 

2

1.58

T6

2

7

2

1

3

0

 

5

0.26

T7

1

0

0

5

5

1

 

4

0.58

T8

0

1

1

0

0

3

 

3

1.00

 

Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6T10.002.004.000.001.000.00T21.003.000.000.000.002.00T30.001.580.003.170.000.00T41.740.000.582.902.320.00T50.006.340.000.000.001.58T60.531.840.530.260.790.00T70.580.000.002.922.920.58T80.001.001.000.000.003.00

The initial Term x Doc matrix(Inverted Index)

tf x idfTerm x Doc matrix

Documents represented as vectors of wordsSlide13

13

Nearest-Neighbor Learning Algorithm

Learning is just storing the representations of the training examples in data set

D

Testing instance

x

:

Compute similarity between

x

and all examples in

D

Assign

x

the category of the most similar examples in

D

Does not explicitly compute a generalization or category prototypes (i.e., no “modeling”)Also called:Case-basedMemory-basedLazy learningSlide14

14

K Nearest Neighbor for Text Categorization

Training:

For each

each

training example <

x

,

c

(

x

)>

D

Compute the corresponding TF-IDF vector,

d

x

, for document

xTest instance y:Compute TF-IDF vector d for document yFor each <x, c

(x)>  D Let sx = cosSim(d, dx)Sort examples, x, in D by decreasing value of sxLet N be the first k examples in D. (get most similar neighbors)Return the majority class of examples in N Slide15

Recall: KNN for Document Categorization Example

15

 

T1

T2

T3

T4

T5

T6

T7

T8

Cat

DOC1

2

0

4

3

0

1

0

2

Cat1DOC202402300Cat1DOC34

013010

1Cat2

DOC4

0

1020010

Cat1

DOC50020040

0Cat1DOC611

0

2

0

1

1

3

Cat2

DOC7

2

1

3

4

0

2

0

2

Cat2

DOC8

3

1

0

4

1

0

2

1

?Slide16

KNN for Document Categorization

16

 

T1

T2

T3

T4

T5

T6

T7

T8

Norm

Sim(D8,Di)

DOC1

2

0

4

3

0

1

02

5.830.61DOC2024023005.740.12

DOC340130

1

01

5.29

0.84DOC401020

0

102.450.79DOC500

200400

4.47

0.00

DOC6

1

1

0

2

0

1

1

3

4.12

0.73

DOC7

2

1

3

4

0

2

0

2

6.16

0.72

DOC8

3

1

0

4

1

0

2

1

5.66

Using Cosine Similarity to find K=3 neighbors:

E.g.:

Sim(D8,D7) = (D8

D7) / (Norm(D8).Norm(D7)

= (3x2+1x1+0x3+4x4+1x0+0x2+2x0+1x2) /

(5.66 x 6.16)

= 25 / 34.87 = 0.72Slide17

KNN for Document Categorization

Simple voting:

Cat for DOC 8 = Cat2 with confidence 2/3 = 0.67Weighted voting:Cat for DOC 8 = Cat2Confidence: (0.84 + 0.73) / (0.84 + 0.79 + 0.73) = 0.6617

 

T1

T2

T3

T4

T5

T6

T7

T8

Cat

Sim(D8,Di)

DOC1

2

0

4

3

0

1

02Cat10.61DOC202402300

Cat10.12DOC3401

30

1

0

1Cat20.84DOC4010

2

0010Cat10.79DOC5

002004

0

0

Cat1

0.00

DOC6

1

1

0

2

0

1

1

3

Cat2

0.73

DOC7

2

1

3

4

0

2

0

2

Cat2

0.72

DOC8

3

1

0

4

1

0

2

1

5.66Slide18

18

Using

Rocchio MethodRocchio

method is typically used for relevance feedback in information retrieval

It can be adapted for text categorization.

Use standard TF/IDF weighted vectors to represent text documents

For each category, compute a

prototype

vector by summing the vectors of the training documents in the category.

Assign test documents to the category with the closest prototype vector based on cosine similarity.Slide19

19

Rocchio Text Categorization Algorithm

(Training)

Assume the set of categories is

{

c

1

,

c

2

,…

c

n

}

For

i

from 1 to

n

let

pi = <0, 0,…,0> (init. prototype vectors)

For each training example <x, c(x)>  D Let d be the TF/IDF term vector for doc x Let i = j where cj = c(x)

(sum all the document vectors in ci to get pi) Let pi = pi + d Slide20

20

Rocchio Text Categorization Algorithm

(Test)

Given test document

x

Let

d

be the TF/IDF term vector for

x

Let

m

=

–2

(

init.

maximum

cosSim

)

For

i

from 1 to n: (compute similarity to prototype vector) Let s = cosSim(d, pi) if s > m let m = s let r = ci (update most similar class prototype)Return class rSlide21

Intelligent Information Retrieval

21

Rocchio

-Based Categorization - Example

So, the new document/email should it be classified as spam = “yes” because it is more similar to the prototype for the “yes” category.

 

t1

t2

t3

t4

t5

Spam

 

 

D1

2

1

0

1

0

no

 

 

D2

0

3

1

0

0

no

 

 

D3

1

0

2

0

2

yes

 

 

D4

1

0

1

2

0

yes

 

 

D5

0

1

0

1

0

yes

 

 

D6

1

2

0

0

2

no

 

 

D7

0

1

0

2

0

yes

 

 

D8

1

1

0

1

0

yes

 

 

D9

3

0

1

1

1

no

  D1010101yes         NormCos Sim with New DocPrototype: "no"66223 9.4340.673Prototype: "yes"43463 9.2740.809   New Doc10011 1.732 For simplicity, in this example we will use raw term frequencies (normally full TFxIDF weights should be used).Slide22

22

Rocchio Properties

Does not guarantee a consistent hypothesis.Forms a simple generalization of the examples in each class (a

prototype).Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.

Classification is based on similarity to class prototypes.