Biomedical Publications Comparing the Accuracies of Nine TextBased Similarity Approaches Boyack et al 2011 PLoS ONE 63 e18029 Motivation Compare different similarity measurements ID: 759232
Download Presentation The PPT/PDF document "Clustering More than Two Million" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Clustering More than Two Million Biomedical Publications
Comparing the Accuracies of Nine Text-Based Similarity Approaches
Boyack et al. (2011).
PLoS
ONE 6(3): e18029
Slide2Motivation
Compare different similarity measurements
Make use of biomedical data set
Process large corpus
Slide3Procedures
define
a corpus of
documents
extract
and pre-process the relevant textual
information from
the
corpus
calculate
pairwise
document-document similarities
using nine
different similarity
approaches
create
similarity matrices keeping only the top-n
similarities per document
cluster
the documents based on this similarity
matrix
assess
each cluster solution using coherence and
concentration metrics
Slide4Data
To build
a corpus with titles, abstracts,
MeSH
terms, and
reference lists
Matched
and combined data from the
MEDLINE and
Scopus (Elsevier)
databases
The
resulting set was then
limited to
those documents published from 2004-2008 that
contained abstracts
, at least five
MeSH
terms, and at least five references
in their bibliographies
resulting in a corpus comprised of
2,153,769 unique
scientific
documents
Base matrix: word-document co-occurrence matrix
Slide5Methods
Slide6tf-idf
The
tf–idf
weight (term frequency–inverse document frequency)
A
statistical measure used to evaluate how important a word is to a document in a collection or
corpus
T
he
importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
Slide7tf-idf
LSA
Latent semantic analysis
Slide9LSA
BM25
Okapi
BM25
A ranking function that
is widely used by search engines to rank matching
documents according
to their relevance to a query
Slide11BM25
SOM
Self-organizing
map
A form of artificial
neural network that generates a
low-dimensional geometric
model from
high-dimensional data
SOM may be considered a nonlinear generalization of Principal components analysis (PCA).
Slide13SOM
Randomize the map's nodes' weight vectors
Grab an input vector
Traverse each node in the map
Use
Euclidean distance formula
to find similarity between the input vector and the map's node's weight vector
Track the node that produces the smallest distance (this node is the best matching unit, BMU)
Update the nodes in the
neighbourhood
of BMU by pulling them closer to the input vector
Wv
(t + 1) =
Wv
(t) + Θ(t)α(t)(
D
(t) -
Wv
(t))
Increase t and repeat from 2 while
t
< λ
Slide14Topic modeling
Three separate Gibbs-sampled topic models were learned at
the following
topic resolutions: T= 500, T= 1000 and
T=2000 topics.
Dirichlet prior hyperparameter settings of b=
0.01
and
a = 0.05N/(D.T) were used, where N is the total number
of word
tokens, D is the number of documents and T is the
number of
topics.
Slide15Topic modeling
PMRA
The PMRA ranking measure is used to calculate ‘Related Articles’ in the PubMed interfaceThe de facto standardProxy
Slide17Similarity filtering
Reduce matrix size
Generate
a top-n similarity file from each of
the larger
similarity
matrices
n=15, each
document thus contributes
between 5
and 15 edges to the similarity file
Slide18Clustering
DrL
(now called
OpenOrd
)
A graph layout
algorithm that calculates an (
x,y
) position for
each document
in a collection using an input set of weighted
edges
http://gephi.org/
Slide19Evaluation
Textual coherence (Jensen-Shannon divergence)
Slide20Evaluation
Concentration: a metric based on grant acknowledgements from MEDLINE, using a grant-to-article linkage dataset from a previous study
Slide21Results
Slide22Slide23Slide24Slide25Results (cont.)
Slide26