/
The Vector Space Models The Vector Space Models

The Vector Space Models - PowerPoint Presentation

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
411 views
Uploaded On 2017-09-16

The Vector Space Models - PPT Presentation

VSM doc1 Documents as Vectors Terms are axes of the space Documents are points or vectors in this space So we have a Vdimensional vector space The Matrix Doc 1 makan makan ID: 588277

makan doc inverted nasi doc makan nasi inverted index matrix idf gossip jealous length query documents vector space distance

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "The Vector Space Models" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

The Vector Space Models(VSM)

doc1Slide2

| Documents as Vectors

Terms are axes of the spaceDocuments are points or vectors

in this space

So we have a |V|-dimensional vector spaceSlide3

| The Matrix

Doc 1 : makan makanDoc 2 : makan nasiSlide4

| The Matrix

Doc 1 : makan makanDoc 2 : makan nasi

Incidence Matrix

TF Biner

TermDoc 1

Doc 2Makan

NasiSlide5

| The Matrix : Binary

Doc 1 : makan makanDoc 2 : makan nasi

Incidence Matrix

TF Biner

Term

Doc 1Doc 2

Makan

1

1

Nasi

0

1Slide6

| Documents as Vectors

Terms are axes of the space

Documents are points or vectors in this space

So we have a |V|-dimensional vector spaceSlide7

| The Matrix : Binary

Doc 1 : makan makanDoc 2 : makan nasi

Incidence Matrix

TF Biner

Term

Doc 1Doc 2

Makan

1

1

Nasi

0

1

Makan

Nasi

1

0

1

Doc 1

Doc 2Slide8

| The Matrix : Binary -> Count

Doc 1 : makan makanDoc 2 : makan nasi

Incidence Matrix

TF Biner

Inverted IndexRaw TF

Term

Doc 1

Doc 2

Doc 1

Doc 2

Makan

1

1

Nasi

0

1Slide9

| The Matrix : Binary -> Count

Doc 1 : makan makanDoc 2 : makan nasi

Incidence Matrix

TF Biner

Inverted IndexRaw TF

Term

Doc 1

Doc 2

Doc 1

Doc 2

Makan

1

1

2

1

Nasi

0

1

0

1Slide10

| The Matrix : Binary -> Count

Doc 1 : makan makanDoc 2 : makan nasi

Inverted Index

Raw TF

Term

Doc 1Doc 2

Makan

2

1

Nasi

0

1

Makan

Nasi

1

0

1

Doc 1

Doc 2Slide11

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makanDoc 2 : makan nasi

Incidence Matrix

TF Biner

Inverted Index

Raw TF

Inverted Index

Logaritmic TF

Term

Doc 1

Doc 2

Doc 1

Doc 2

Doc 1

Doc 2

Makan

1

1

2

1

Nasi

0

1

0

1Slide12

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makanDoc 2 : makan nasi

Incidence Matrix

TF Biner

Inverted Index

Raw TF

Inverted Index

Logaritmic TF

Term

Doc 1

Doc 2

Doc 1

Doc 2

Doc 1

Doc 2

Makan

1

1

2

1

1.3

1

Nasi

0

1

0

1

0

1Slide13

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makanDoc 2 : makan nasi

Inverted Index

Logaritmic TF

Term

Doc 1Doc 2

Makan

1.3

1

Nasi

0

1

Makan

Nasi

1

0

1

Doc 1

Doc 2Slide14

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makanDoc 2 : makan nasi

Incidence Matrix

TF Biner

Inverted Index

Raw TF

Inverted Index

Logaritmic TF

Inverted Index

TF-IDF

Term

Doc 1

Doc 2

Doc 1

Doc 2

Doc 1

Doc 2

Doc 1

Doc 2

Makan

1

1

2

1

1.3

1

Nasi

0

1

0

1

0

1Slide15

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makanDoc 2 : makan nasi

Inverted Index

Raw TF

Inverted Index

Logaritmic TF

Inverted Index

TF-IDF

Term

Doc 1

Doc 2

Doc 1

Doc 2

IDF

Doc 1

Doc 2

Makan

2

1

1.3

1

0

Nasi

0

1

0

1

0.3Slide16

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makanDoc 2 : makan nasi

Inverted Index

Raw TF

Inverted Index

Logaritmic TF

Inverted Index

TF-IDF

Term

Doc 1

Doc 2

Doc 1

Doc 2

IDF

Doc 1

Doc 2

Makan

2

1

1.3

1

0

0

0

Nasi

0

1

0

1

0.3

0

0.3Slide17

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makanDoc 2 : makan nasi

Inverted Index

TF-IDF

Term

Doc 1Doc 2

Makan

0

0

Nasi

0

0.3

Makan

Nasi

1

0

1

Doc 1

Doc 2Slide18

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan jagungDoc 2 : makan nasi

Inverted Index

Raw TF

Inverted Index

Logaritmic TF

Inverted Index

TF-IDF

Term

Doc 1

Doc 2

Doc 1

Doc 2

IDF

Doc 1

Doc 2

Makan

2

1

1.3

1

Nasi

0

1

0

1

Jagung

1

0

1

0Slide19

| The Matrix : Binary -> Count -> Weight

Doc 1 : makan makan jagungDoc 2 : makan nasi

Inverted Index

Raw TF

Inverted Index

Logaritmic TF

Inverted Index

TF-IDF

Term

Doc 1

Doc 2

Doc 1

Doc 2

IDF

Doc 1

Doc 2

Makan

2

1

1.3

1

0

0

0

Nasi

0

1

0

1

0.3

0

0.3Jagung

1010

0.30.30Slide20

| Documents as Vectors

Terms are axes of the space

Documents are points or vectors

in this spaceSo we have a |V|-dimensional vector space

The weight can be anything : Binary, TF, TF-IDF and so on.Slide21

|Documents as Vectors

Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine

These are very sparse vectors - most entries are zero.Slide22

How About The Query?Slide23

Query as Vector

too...Slide24

| VECTOR SPACE MODEL

Key idea 1:

Represent

Document as vectors in the space

Key idea 2

: Do the same for queries: represent them as vectors in the space

Key idea

3

: Rank documents according to their

proximity

to the query in this spaceSlide25

PROXIMITY?Slide26

| Proximity

Proximity = KemiripanP

roximity = similarity of vectors

Proximity

≈ inverse of distanceDokumen yang memiliki

proximity dengan query yang terbesar akan memiliki score yang tinggi sehingga rankingnya lebih tinggiSlide27

How to Measure Vector Space Proximity?Slide28
Slide29

| Proximity

First cut: distance between two points( = distance between the end points of the two vectors)

Euclidean distance?Euclidean distance is a bad idea . . .

. . . because Euclidean distance is large

for vectors of different lengths

.Slide30

| Distance Example

Doc 1 : gossipDoc 2 : jealousDoc 3 : gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip (gossip : 90x jealous: 70x)Slide31

| Distance Example

Query : gossip Jealous

Inverted Index

Logaritmic TF

Inverted Index

TF-IDF

Term

Doc 1

Doc 2

Doc 3

Query

IDF

Doc 1

Doc 2

Doc 3

Query

Gossip

1

0

2.95

1

0.17

0.17

0

0.50

0.17

Jealous

0

1

2.84

1

0.17

00.17

0.480.17Slide32

| Distance Example

Query : gossip Jealous

Inverted Index

TF-IDF

Term

Doc 1Doc 2

Doc 3

Query

Gossip

0.17

0

0.50

0.17

Jealous

0

0.17

0.48

0.17

Gossip

Jealous

0.4

0

0.4

Doc 1

Doc 2

Doc 3

QuerySlide33

| Why Distance is a Bad Idea?

The Euclidean distance between q

uery and Doc3

is large

even though the distribution of terms in the

query q and the distribution of

terms in the document

Doc3

are

very

similar

.Slide34

| So, instead of Distance?

Thought experiment: take a document d and append it to itself. Call this document d′.

“Semantically” d and d′ have the same contentThe Euclidean distance between the two documents can be quite largeSlide35

| So, instead of Distance?

The angle between

the two documents is 0,

corresponding to

maximal similarity.

Gossip

Jealous

0.4

0

0.4

d

dSlide36

| Use angle instead of distance

Key idea: Rank documents according to angle with query.Slide37

| From angles to cosines

The following two notions are equivalent.Rank documents in decreasing order of the angle between query and document

Rank documents in increasing order of cosine(query,document)

Cosine is a monotonically decreasing function for the interval [0

o, 180o]Slide38

| From angles to cosinesSlide39

But how – and why – should we be computing cosines?Slide40

a · b = |

a| × |b| × cos(θ)

Where:

|a| is the magnitude (length) of

vector a|

b| is the magnitude (length) of vector bθ is the angle between 

a

 and 

b

cos(

θ)

 =

(

a · b ) / (

|

a

| × |

b

|)Slide41

q

i

is the

tf-idf weight

(or whatever) of term i in the query

di is the tf-idf

weight

(or whatever)

of term

i

in the document

cos

(

q,d

) is the cosine similarity of

q

and

d

… or,

equivalently, the cosine of the angle between

q and

d.Slide42

| Length normalization

A vector can be (length-) normalized by dividing each of its components by its lengthDividing a vector by its

length makes it a unit (length) vector (on surface of unit hypersphere

)

Unit Vector = A vector whose length is exactly

1 (the unit length)Slide43

| Remember this Case

Gossip

Jealous

0.4

0

0.4

d

dSlide44

| Length normalizationSlide45

| Remember this Case

Gossip

Jealous

1

0

1

d

d

’Slide46

| Length normalization

Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.Long and short documents now have comparable weightsSlide47

After Normalization :Slide48

After Normalization :

for q, d length-normalized.Slide49

| Cosine similarity illustratedSlide50

| Cosine similarity illustrated

Value of the Cosine Similarity is [0,1]Slide51

Example?Slide52

| TF-IDF Example

Term

Query

tf-raw

tf-wt

df

idf

tf.idf

n’lize

auto

0

0

5000

2.3

0

best

1

1

50000

1.3

1.3

car

1

1

10000

2.0

2.0

insurance

1

1

1000

3.0

3.0

Document:

car insurance auto insurance

Query:

best car insurance

Query

length =

N=1000000Slide53

| TF-IDF Example

Term

Query

tf-raw

tf-wt

df

idf

tf.idf

n’lize

auto

0

0

5000

2.3

0

0

best

1

1

50000

1.3

1.3

0.34

car

1

1

10000

2.0

2.0

0.52

insurance

1

1

1000

3.0

3.0

0.78

Document:

car insurance auto insurance

Query:

best car insurance

Query

length =Slide54

| TF-IDF Example

Term

Document

tf-raw

tf

-wt

idf

tf.idf

n’lize

auto

1

1

2.3

2.3

best

0

0

1.3

0

car

1

1

2.0

2.0

insurance

2

1.3

3.0

3.9

Document:

car insurance auto insurance

Query:

best car insurance

Doc length =Slide55

| TF-IDF Example

Term

Document

tf-raw

tf

-wt

idf

tf.idf

n’lize

auto

1

1

2.3

2.3

0.3

best

0

0

1.3

0

0

car

1

1

2.0

2.0

0.5

insurance

2

1.3

3.0

3.9

0.79

Document:

car insurance auto insurance

Query:

best car insurance

Doc length =Slide56

After Normalization :

for q, d length-normalized.Slide57

| TF-IDF Example

Term

Query

Document

Dot

Prod

uct

tf.idf

n’lize

tf.idf

n’lize

auto

0

0

2.3

0.3

0

best

1.3

0.34

0

0

0

car

2.0

0.52

2.0

0.5

0.26

insurance

3.0

0.78

3.9

0.79

0.62

Document:

car insurance auto insurance

Query:

best car insurance

Score =

0+0+0.2

6

+0.

62

=

0.

88

Doc length =

Sec. 6.4Slide58

| Summary – vector space ranking

Represent the query as a weighted tf-idf vector

Represent each document as a weighted tf-idf vector

Compute the cosine similarity score for the query vector and each document vectorRank documents with respect to the query by score

Return the top K (e.g., K = 10) to the userSlide59

Cosine similarity amongst 3 documents

term

SaS

PaP

WH

affection

115

58

20

jealous

10

7

11

gossip

2

0

6

wuthering

0

0

38

How similar are

the novels

SaS

:

Sense and

Sensibility

PaP

:

Pride and

Prejudice

, and

WH

:

Wuthering

Heights

?

Term frequencies (counts)

Sec. 6.3

Note: To simplify this example, we don’t do idf weighting.Slide60

3 documents example contd.Log frequency weighting

term

SaS

PaP

WH

affection

3.06

2.76

2.30

jealous

2.00

1.85

2.04

gossip

1.30

0

1.78

wuthering

0

0

2.58

After length normalization

term

SaS

PaP

WH

affection

0.789

0.832

0.524

jealous

0.515

0.555

0.465

gossip

0.335

0

0.405

wuthering

0

0

0.588

cos(SaS,PaP)

0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0

0.94

cos(SaS,WH)

0.79

cos(PaP,WH)

0.69

Why do we have cos(SaS,PaP) > cos(SaS,WH)?

Sec. 6.3