Applications Web pages Recommending pages Yahoolike classification hierarchies Categorizing bookmarks Newsgroup Messages News Feeds Microblog Posts Recommending messages posts tweets etc ID: 757848
Download Presentation The PPT/PDF document "1 Text Categorization Assigning document..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Text CategorizationSlide2
2
Text Categorization
Assigning documents to a fixed set of categories
with text features as attributes representing documents
Applications:
Web pages
Recommending pages
Yahoo-like classification hierarchies
Categorizing bookmarks
Newsgroup Messages /News Feeds / Micro-blog Posts
Recommending messages, posts, tweets, etc.
Message filtering
News articles
Personalized news
Email messages
Routing
Spam filteringSlide3
3
Learning for Text Categorization
Text Categorization is an application of classification learning
Typical Learning Algorithms:
Bayesian (naïve)
Neural networks
Relevance Feedback (
Rocchio
)
Nearest Neighbor
Support Vector Machines (SVM)Slide4
4
Similarity/Distance Metrics
Nearest neighbor method depends on a similarity (or distance) metric
Simplest for continuous
m
-dimensional instance space is
Euclidean distance
Simplest for
m
-dimensional binary instance space is
Hamming distance
(number of feature values that differ)
For text, cosine similarity of TF-IDF weighted vectors is typically most effectiveSlide5
Basic Automatic Text Processing
Parse documents to recognize structure and meta-data
e.g. title, date, other fields, html tags, etc. Scan for word tokens lexical analysis to recognize keywords, numbers, special characters, etc.Stopword removal common words such as “the”, “and”, “or” which are not semantically meaningful in a documentStem words morphological processing to group word variants (e.g., “compute”, “computer”, “computing”, “computes”, … can be represented by a single stem “comput” in the index)Assign weight to words using frequency in documents and across documentsStore Index Stored in a Term-Document Matrix (“inverted index”) which stores each document as a vector of keyword weights
5Slide6
The dictionary
6
Document Vectors and Indexes
Conceptually, the index can be viewed as a
document-term matrix
Each document is represented as an
n
-dimensional vector (
n
= no. of terms in the dictionary)
Term weights represent the scalar value of each dimension in a document
The
inverted index structure
is an “
implementation model
” used in practice to store the information captured in this conceptual representation
nova galaxy heat
hollywood
film role diet fur
A 1.0 0.5 0.3
B 0.5 1.0C 1.0 0.8 0.7 D 0.9 1.0 0.5E 1.0 1.0F 0.9 1.0G 0.5 0.7 0.9H 0.6 1.0 0.3 0.2 0.8I 0.7 0.5 0.1 0.3
Document Ids
a document
vector
Term Weights
(in this case
normalized)Slide7
7
Example: Documents and Query in 3D Space
Documents in term spaceDocuments (and the query) are represented as vectors of terms/tokensQuery and Document weightsbased on length and direction of their vector
Why use this representation?
A vector distance measure between the query and documents can be used to rank retrieved documentsSlide8
8
Inverted Indexes
Then the data is split into a Dictionary and a Postings file
Dictionary
PostingsSlide9
9
tf
x
idf
Weights
tf
x
idf
measure:
term frequency (
tf
)inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution
Recall the Zipf distributionWant to weight terms highly if they arefrequent in relevant documents … BUTinfrequent in the collection as a whole
Goal: assign a tf x idf weight to each term in each documentSlide10
10
tf
x
idfSlide11
11
Inverse Document Frequency
IDF provides high values for rare words and low values for common words
Note:
typically, we’ll use Log base 2 to compute IDF valuesSlide12
tf x idf Example
12
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
df
idf = log2(N/df)
T1
0
2
4
0
1
0
3
1.00
T2
1
3
0
0
0
2
3
1.00
T3
0
1
0
2
0
0
2
1.58
T4
3
0
1
5
4
0
4
0.58
T5
0
4
0
0
0
1
2
1.58
T6
2
7
2
1
3
0
5
0.26
T7
1
0
0
5
5
1
4
0.58
T8
0
1
1
0
0
3
3
1.00
Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6T10.002.004.000.001.000.00T21.003.000.000.000.002.00T30.001.580.003.170.000.00T41.740.000.582.902.320.00T50.006.340.000.000.001.58T60.531.840.530.260.790.00T70.580.000.002.922.920.58T80.001.001.000.000.003.00
The initial Term x Doc matrix(Inverted Index)
tf x idfTerm x Doc matrix
Documents represented as vectors of wordsSlide13
13
Nearest-Neighbor Learning Algorithm
Learning is just storing the representations of the training examples in data set
D
Testing instance
x
:
Compute similarity between
x
and all examples in
D
Assign
x
the category of the most similar examples in
D
Does not explicitly compute a generalization or category prototypes (i.e., no “modeling”)Also called:Case-basedMemory-basedLazy learningSlide14
14
K Nearest Neighbor for Text Categorization
Training:
For each
each
training example <
x
,
c
(
x
)>
D
Compute the corresponding TF-IDF vector,
d
x
, for document
xTest instance y:Compute TF-IDF vector d for document yFor each <x, c
(x)> D Let sx = cosSim(d, dx)Sort examples, x, in D by decreasing value of sxLet N be the first k examples in D. (get most similar neighbors)Return the majority class of examples in N Slide15
Recall: KNN for Document Categorization Example
15
T1
T2
T3
T4
T5
T6
T7
T8
Cat
DOC1
2
0
4
3
0
1
0
2
Cat1DOC202402300Cat1DOC34
013010
1Cat2
DOC4
0
1020010
Cat1
DOC50020040
0Cat1DOC611
0
2
0
1
1
3
Cat2
DOC7
2
1
3
4
0
2
0
2
Cat2
DOC8
3
1
0
4
1
0
2
1
?Slide16
KNN for Document Categorization
16
T1
T2
T3
T4
T5
T6
T7
T8
Norm
Sim(D8,Di)
DOC1
2
0
4
3
0
1
02
5.830.61DOC2024023005.740.12
DOC340130
1
01
5.29
0.84DOC401020
0
102.450.79DOC500
200400
4.47
0.00
DOC6
1
1
0
2
0
1
1
3
4.12
0.73
DOC7
2
1
3
4
0
2
0
2
6.16
0.72
DOC8
3
1
0
4
1
0
2
1
5.66
Using Cosine Similarity to find K=3 neighbors:
E.g.:
Sim(D8,D7) = (D8
D7) / (Norm(D8).Norm(D7)
= (3x2+1x1+0x3+4x4+1x0+0x2+2x0+1x2) /
(5.66 x 6.16)
= 25 / 34.87 = 0.72Slide17
KNN for Document Categorization
Simple voting:
Cat for DOC 8 = Cat2 with confidence 2/3 = 0.67Weighted voting:Cat for DOC 8 = Cat2Confidence: (0.84 + 0.73) / (0.84 + 0.79 + 0.73) = 0.6617
T1
T2
T3
T4
T5
T6
T7
T8
Cat
Sim(D8,Di)
DOC1
2
0
4
3
0
1
02Cat10.61DOC202402300
Cat10.12DOC3401
30
1
0
1Cat20.84DOC4010
2
0010Cat10.79DOC5
002004
0
0
Cat1
0.00
DOC6
1
1
0
2
0
1
1
3
Cat2
0.73
DOC7
2
1
3
4
0
2
0
2
Cat2
0.72
DOC8
3
1
0
4
1
0
2
1
5.66Slide18
18
Using
Rocchio MethodRocchio
method is typically used for relevance feedback in information retrieval
It can be adapted for text categorization.
Use standard TF/IDF weighted vectors to represent text documents
For each category, compute a
prototype
vector by summing the vectors of the training documents in the category.
Assign test documents to the category with the closest prototype vector based on cosine similarity.Slide19
19
Rocchio Text Categorization Algorithm
(Training)
Assume the set of categories is
{
c
1
,
c
2
,…
c
n
}
For
i
from 1 to
n
let
pi = <0, 0,…,0> (init. prototype vectors)
For each training example <x, c(x)> D Let d be the TF/IDF term vector for doc x Let i = j where cj = c(x)
(sum all the document vectors in ci to get pi) Let pi = pi + d Slide20
20
Rocchio Text Categorization Algorithm
(Test)
Given test document
x
Let
d
be the TF/IDF term vector for
x
Let
m
=
–2
(
init.
maximum
cosSim
)
For
i
from 1 to n: (compute similarity to prototype vector) Let s = cosSim(d, pi) if s > m let m = s let r = ci (update most similar class prototype)Return class rSlide21
Intelligent Information Retrieval
21
Rocchio
-Based Categorization - Example
So, the new document/email should it be classified as spam = “yes” because it is more similar to the prototype for the “yes” category.
t1
t2
t3
t4
t5
Spam
D1
2
1
0
1
0
no
D2
0
3
1
0
0
no
D3
1
0
2
0
2
yes
D4
1
0
1
2
0
yes
D5
0
1
0
1
0
yes
D6
1
2
0
0
2
no
D7
0
1
0
2
0
yes
D8
1
1
0
1
0
yes
D9
3
0
1
1
1
no
D1010101yes NormCos Sim with New DocPrototype: "no"66223 9.4340.673Prototype: "yes"43463 9.2740.809 New Doc10011 1.732 For simplicity, in this example we will use raw term frequencies (normally full TFxIDF weights should be used).Slide22
22
Rocchio Properties
Does not guarantee a consistent hypothesis.Forms a simple generalization of the examples in each class (a
prototype).Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.
Classification is based on similarity to class prototypes.