/
Clustering More than Two Million Clustering More than Two Million

Clustering More than Two Million - PowerPoint Presentation

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
344 views
Uploaded On 2019-06-20

Clustering More than Two Million - PPT Presentation

Biomedical Publications Comparing the Accuracies of Nine TextBased Similarity Approaches Boyack et al 2011 PLoS ONE 63 e18029 Motivation Compare different similarity measurements ID: 759232

document similarity documents corpus similarity document corpus documents word input map number data node vector topic matrix bm25 weight som based idf

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Clustering More than Two Million" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Clustering More than Two Million Biomedical Publications

Comparing the Accuracies of Nine Text-Based Similarity Approaches

Boyack et al. (2011).

PLoS

ONE 6(3): e18029

Slide2

Motivation

Compare different similarity measurements

Make use of biomedical data set

Process large corpus

Slide3

Procedures

define

a corpus of

documents

extract

and pre-process the relevant textual

information from

the

corpus

calculate

pairwise

document-document similarities

using nine

different similarity

approaches

create

similarity matrices keeping only the top-n

similarities per document

cluster

the documents based on this similarity

matrix

assess

each cluster solution using coherence and

concentration metrics

Slide4

Data

To build

a corpus with titles, abstracts,

MeSH

terms, and

reference lists

Matched

and combined data from the

MEDLINE and

Scopus (Elsevier)

databases

The

resulting set was then

limited to

those documents published from 2004-2008 that

contained abstracts

, at least five

MeSH

terms, and at least five references

in their bibliographies

resulting in a corpus comprised of

2,153,769 unique

scientific

documents

Base matrix: word-document co-occurrence matrix

Slide5

Methods

Slide6

tf-idf

The 

tf–idf

 weight (term frequency–inverse document frequency)

A

statistical measure used to evaluate how important a word is to a document in a collection or 

corpus

T

he

importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Slide7

tf-idf

Slide8

LSA

Latent semantic analysis

Slide9

LSA

Slide10

BM25

Okapi

BM25

A ranking function that

is widely used by search engines to rank matching

documents according

to their relevance to a query

Slide11

BM25

Slide12

SOM

Self-organizing

map

A form of artificial

neural network that generates a

low-dimensional geometric

model from

high-dimensional data

SOM may be considered a nonlinear generalization of Principal components analysis (PCA).

Slide13

SOM

Randomize the map's nodes' weight vectors

Grab an input vector

Traverse each node in the map

Use 

Euclidean distance formula

to find similarity between the input vector and the map's node's weight vector

Track the node that produces the smallest distance (this node is the best matching unit, BMU)

Update the nodes in the

neighbourhood

of BMU by pulling them closer to the input vector

Wv

(t + 1) = 

Wv

(t) + Θ(t)α(t)(

D

(t) - 

Wv

(t))

Increase t and repeat from 2 while 

t

 < λ

Slide14

Topic modeling

Three separate Gibbs-sampled topic models were learned at

the following

topic resolutions: T= 500, T= 1000 and

T=2000 topics.

Dirichlet prior hyperparameter settings of b=

0.01

and

a = 0.05N/(D.T) were used, where N is the total number

of word

tokens, D is the number of documents and T is the

number of

topics.

Slide15

Topic modeling

Slide16

PMRA

The PMRA ranking measure is used to calculate ‘Related Articles’ in the PubMed interfaceThe de facto standardProxy

Slide17

Similarity filtering

Reduce matrix size

Generate

a top-n similarity file from each of

the larger

similarity

matrices

n=15, each

document thus contributes

between 5

and 15 edges to the similarity file

Slide18

Clustering

DrL

(now called

OpenOrd

)

A graph layout

algorithm that calculates an (

x,y

) position for

each document

in a collection using an input set of weighted

edges

http://gephi.org/

Slide19

Evaluation

Textual coherence (Jensen-Shannon divergence)

Slide20

Evaluation

Concentration: a metric based on grant acknowledgements from MEDLINE, using a grant-to-article linkage dataset from a previous study

Slide21

Results

Slide22

Slide23

Slide24

Slide25

Results (cont.)

Slide26