VSM doc1 Documents as Vectors Terms are axes of the space Documents are points or vectors in this space So we have a Vdimensional vector space The Matrix Doc 1 makan makan ID: 588277
Download Presentation The PPT/PDF document "The Vector Space Models" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
The Vector Space Models(VSM)
doc1Slide2
| Documents as Vectors
Terms are axes of the spaceDocuments are points or vectors
in this space
So we have a |V|-dimensional vector spaceSlide3
| The Matrix
Doc 1 : makan makanDoc 2 : makan nasiSlide4
| The Matrix
Doc 1 : makan makanDoc 2 : makan nasi
Incidence Matrix
TF Biner
TermDoc 1
Doc 2Makan
NasiSlide5
| The Matrix : Binary
Doc 1 : makan makanDoc 2 : makan nasi
Incidence Matrix
TF Biner
Term
Doc 1Doc 2
Makan
1
1
Nasi
0
1Slide6
| Documents as Vectors
Terms are axes of the space
Documents are points or vectors in this space
So we have a |V|-dimensional vector spaceSlide7
| The Matrix : Binary
Doc 1 : makan makanDoc 2 : makan nasi
Incidence Matrix
TF Biner
Term
Doc 1Doc 2
Makan
1
1
Nasi
0
1
Makan
Nasi
1
0
1
Doc 1
Doc 2Slide8
| The Matrix : Binary -> Count
Doc 1 : makan makanDoc 2 : makan nasi
Incidence Matrix
TF Biner
Inverted IndexRaw TF
Term
Doc 1
Doc 2
Doc 1
Doc 2
Makan
1
1
Nasi
0
1Slide9
| The Matrix : Binary -> Count
Doc 1 : makan makanDoc 2 : makan nasi
Incidence Matrix
TF Biner
Inverted IndexRaw TF
Term
Doc 1
Doc 2
Doc 1
Doc 2
Makan
1
1
2
1
Nasi
0
1
0
1Slide10
| The Matrix : Binary -> Count
Doc 1 : makan makanDoc 2 : makan nasi
Inverted Index
Raw TF
Term
Doc 1Doc 2
Makan
2
1
Nasi
0
1
Makan
Nasi
1
0
1
Doc 1
Doc 2Slide11
| The Matrix : Binary -> Count -> Weight
Doc 1 : makan makanDoc 2 : makan nasi
Incidence Matrix
TF Biner
Inverted Index
Raw TF
Inverted Index
Logaritmic TF
Term
Doc 1
Doc 2
Doc 1
Doc 2
Doc 1
Doc 2
Makan
1
1
2
1
Nasi
0
1
0
1Slide12
| The Matrix : Binary -> Count -> Weight
Doc 1 : makan makanDoc 2 : makan nasi
Incidence Matrix
TF Biner
Inverted Index
Raw TF
Inverted Index
Logaritmic TF
Term
Doc 1
Doc 2
Doc 1
Doc 2
Doc 1
Doc 2
Makan
1
1
2
1
1.3
1
Nasi
0
1
0
1
0
1Slide13
| The Matrix : Binary -> Count -> Weight
Doc 1 : makan makanDoc 2 : makan nasi
Inverted Index
Logaritmic TF
Term
Doc 1Doc 2
Makan
1.3
1
Nasi
0
1
Makan
Nasi
1
0
1
Doc 1
Doc 2Slide14
| The Matrix : Binary -> Count -> Weight
Doc 1 : makan makanDoc 2 : makan nasi
Incidence Matrix
TF Biner
Inverted Index
Raw TF
Inverted Index
Logaritmic TF
Inverted Index
TF-IDF
Term
Doc 1
Doc 2
Doc 1
Doc 2
Doc 1
Doc 2
Doc 1
Doc 2
Makan
1
1
2
1
1.3
1
Nasi
0
1
0
1
0
1Slide15
| The Matrix : Binary -> Count -> Weight
Doc 1 : makan makanDoc 2 : makan nasi
Inverted Index
Raw TF
Inverted Index
Logaritmic TF
Inverted Index
TF-IDF
Term
Doc 1
Doc 2
Doc 1
Doc 2
IDF
Doc 1
Doc 2
Makan
2
1
1.3
1
0
Nasi
0
1
0
1
0.3Slide16
| The Matrix : Binary -> Count -> Weight
Doc 1 : makan makanDoc 2 : makan nasi
Inverted Index
Raw TF
Inverted Index
Logaritmic TF
Inverted Index
TF-IDF
Term
Doc 1
Doc 2
Doc 1
Doc 2
IDF
Doc 1
Doc 2
Makan
2
1
1.3
1
0
0
0
Nasi
0
1
0
1
0.3
0
0.3Slide17
| The Matrix : Binary -> Count -> Weight
Doc 1 : makan makanDoc 2 : makan nasi
Inverted Index
TF-IDF
Term
Doc 1Doc 2
Makan
0
0
Nasi
0
0.3
Makan
Nasi
1
0
1
Doc 1
Doc 2Slide18
| The Matrix : Binary -> Count -> Weight
Doc 1 : makan makan jagungDoc 2 : makan nasi
Inverted Index
Raw TF
Inverted Index
Logaritmic TF
Inverted Index
TF-IDF
Term
Doc 1
Doc 2
Doc 1
Doc 2
IDF
Doc 1
Doc 2
Makan
2
1
1.3
1
Nasi
0
1
0
1
Jagung
1
0
1
0Slide19
| The Matrix : Binary -> Count -> Weight
Doc 1 : makan makan jagungDoc 2 : makan nasi
Inverted Index
Raw TF
Inverted Index
Logaritmic TF
Inverted Index
TF-IDF
Term
Doc 1
Doc 2
Doc 1
Doc 2
IDF
Doc 1
Doc 2
Makan
2
1
1.3
1
0
0
0
Nasi
0
1
0
1
0.3
0
0.3Jagung
1010
0.30.30Slide20
| Documents as Vectors
Terms are axes of the space
Documents are points or vectors
in this spaceSo we have a |V|-dimensional vector space
The weight can be anything : Binary, TF, TF-IDF and so on.Slide21
|Documents as Vectors
Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
These are very sparse vectors - most entries are zero.Slide22
How About The Query?Slide23
Query as Vector
too...Slide24
| VECTOR SPACE MODEL
Key idea 1:
Represent
Document as vectors in the space
Key idea 2
: Do the same for queries: represent them as vectors in the space
Key idea
3
: Rank documents according to their
proximity
to the query in this spaceSlide25
PROXIMITY?Slide26
| Proximity
Proximity = KemiripanP
roximity = similarity of vectors
Proximity
≈ inverse of distanceDokumen yang memiliki
proximity dengan query yang terbesar akan memiliki score yang tinggi sehingga rankingnya lebih tinggiSlide27
How to Measure Vector Space Proximity?Slide28Slide29
| Proximity
First cut: distance between two points( = distance between the end points of the two vectors)
Euclidean distance?Euclidean distance is a bad idea . . .
. . . because Euclidean distance is large
for vectors of different lengths
.Slide30
| Distance Example
Doc 1 : gossipDoc 2 : jealousDoc 3 : gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip (gossip : 90x jealous: 70x)Slide31
| Distance Example
Query : gossip Jealous
Inverted Index
Logaritmic TF
Inverted Index
TF-IDF
Term
Doc 1
Doc 2
Doc 3
Query
IDF
Doc 1
Doc 2
Doc 3
Query
Gossip
1
0
2.95
1
0.17
0.17
0
0.50
0.17
Jealous
0
1
2.84
1
0.17
00.17
0.480.17Slide32
| Distance Example
Query : gossip Jealous
Inverted Index
TF-IDF
Term
Doc 1Doc 2
Doc 3
Query
Gossip
0.17
0
0.50
0.17
Jealous
0
0.17
0.48
0.17
Gossip
Jealous
0.4
0
0.4
Doc 1
Doc 2
Doc 3
QuerySlide33
| Why Distance is a Bad Idea?
The Euclidean distance between q
uery and Doc3
is large
even though the distribution of terms in the
query q and the distribution of
terms in the document
Doc3
are
very
similar
.Slide34
| So, instead of Distance?
Thought experiment: take a document d and append it to itself. Call this document d′.
“Semantically” d and d′ have the same contentThe Euclidean distance between the two documents can be quite largeSlide35
| So, instead of Distance?
The angle between
the two documents is 0,
corresponding to
maximal similarity.
Gossip
Jealous
0.4
0
0.4
d
’
dSlide36
| Use angle instead of distance
Key idea: Rank documents according to angle with query.Slide37
| From angles to cosines
The following two notions are equivalent.Rank documents in decreasing order of the angle between query and document
Rank documents in increasing order of cosine(query,document)
Cosine is a monotonically decreasing function for the interval [0
o, 180o]Slide38
| From angles to cosinesSlide39
But how – and why – should we be computing cosines?Slide40
a · b = |
a| × |b| × cos(θ)
Where:
|a| is the magnitude (length) of
vector a|
b| is the magnitude (length) of vector bθ is the angle between
a
and
b
cos(
θ)
=
(
a · b ) / (
|
a
| × |
b
|)Slide41
q
i
is the
tf-idf weight
(or whatever) of term i in the query
di is the tf-idf
weight
(or whatever)
of term
i
in the document
cos
(
q,d
) is the cosine similarity of
q
and
d
… or,
equivalently, the cosine of the angle between
q and
d.Slide42
| Length normalization
A vector can be (length-) normalized by dividing each of its components by its lengthDividing a vector by its
length makes it a unit (length) vector (on surface of unit hypersphere
)
Unit Vector = A vector whose length is exactly
1 (the unit length)Slide43
| Remember this Case
Gossip
Jealous
0.4
0
0.4
d
’
dSlide44
| Length normalizationSlide45
| Remember this Case
Gossip
Jealous
1
0
1
d
d
’Slide46
| Length normalization
Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.Long and short documents now have comparable weightsSlide47
After Normalization :Slide48
After Normalization :
for q, d length-normalized.Slide49
| Cosine similarity illustratedSlide50
| Cosine similarity illustrated
Value of the Cosine Similarity is [0,1]Slide51
Example?Slide52
| TF-IDF Example
Term
Query
tf-raw
tf-wt
df
idf
tf.idf
n’lize
auto
0
0
5000
2.3
0
best
1
1
50000
1.3
1.3
car
1
1
10000
2.0
2.0
insurance
1
1
1000
3.0
3.0
Document:
car insurance auto insurance
Query:
best car insurance
Query
length =
N=1000000Slide53
| TF-IDF Example
Term
Query
tf-raw
tf-wt
df
idf
tf.idf
n’lize
auto
0
0
5000
2.3
0
0
best
1
1
50000
1.3
1.3
0.34
car
1
1
10000
2.0
2.0
0.52
insurance
1
1
1000
3.0
3.0
0.78
Document:
car insurance auto insurance
Query:
best car insurance
Query
length =Slide54
| TF-IDF Example
Term
Document
tf-raw
tf
-wt
idf
tf.idf
n’lize
auto
1
1
2.3
2.3
best
0
0
1.3
0
car
1
1
2.0
2.0
insurance
2
1.3
3.0
3.9
Document:
car insurance auto insurance
Query:
best car insurance
Doc length =Slide55
| TF-IDF Example
Term
Document
tf-raw
tf
-wt
idf
tf.idf
n’lize
auto
1
1
2.3
2.3
0.3
best
0
0
1.3
0
0
car
1
1
2.0
2.0
0.5
insurance
2
1.3
3.0
3.9
0.79
Document:
car insurance auto insurance
Query:
best car insurance
Doc length =Slide56
After Normalization :
for q, d length-normalized.Slide57
| TF-IDF Example
Term
Query
Document
Dot
Prod
uct
tf.idf
n’lize
tf.idf
n’lize
auto
0
0
2.3
0.3
0
best
1.3
0.34
0
0
0
car
2.0
0.52
2.0
0.5
0.26
insurance
3.0
0.78
3.9
0.79
0.62
Document:
car insurance auto insurance
Query:
best car insurance
Score =
0+0+0.2
6
+0.
62
=
0.
88
Doc length =
Sec. 6.4Slide58
| Summary – vector space ranking
Represent the query as a weighted tf-idf vector
Represent each document as a weighted tf-idf vector
Compute the cosine similarity score for the query vector and each document vectorRank documents with respect to the query by score
Return the top K (e.g., K = 10) to the userSlide59
Cosine similarity amongst 3 documents
term
SaS
PaP
WH
affection
115
58
20
jealous
10
7
11
gossip
2
0
6
wuthering
0
0
38
How similar are
the novels
SaS
:
Sense and
Sensibility
PaP
:
Pride and
Prejudice
, and
WH
:
Wuthering
Heights
?
Term frequencies (counts)
Sec. 6.3
Note: To simplify this example, we don’t do idf weighting.Slide60
3 documents example contd.Log frequency weighting
term
SaS
PaP
WH
affection
3.06
2.76
2.30
jealous
2.00
1.85
2.04
gossip
1.30
0
1.78
wuthering
0
0
2.58
After length normalization
term
SaS
PaP
WH
affection
0.789
0.832
0.524
jealous
0.515
0.555
0.465
gossip
0.335
0
0.405
wuthering
0
0
0.588
cos(SaS,PaP)
≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈
0.94
cos(SaS,WH)
≈
0.79
cos(PaP,WH)
≈
0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
Sec. 6.3