Fanjin Zhang Xiao Liu Jie Tang Yuxiao Dong Peiran Yao Jie Zhang Xiaotao Gu Yan Wang Bin Shao Rui Li and Kuansan Wang Tsinghua University Microsoft Research ID: 781899
Download The PPT/PDF document "OAG: Toward Linking Large-scale Heteroge..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
OAG: Toward Linking Large-scale Heterogeneous Entity Graphs
Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li and Kuansan Wang.Tsinghua University Microsoft Research
Slide2OAG overview
Linking large-scale heterogeneous academic graphs
Open Academic Graph (OAG) is a large knowledge graph unifying two web-scale academic graphs: Microsoft Academic Graph (MAG) and AMiner.
Slide3OAG: Open Academic Graph
https://www.openacademic.ai/oag/
Slide4Problem & Challenges
Input: two heterogeneous entity graphs
and
.
Output
:
entity
linkings
such that
and
represent exactly the same entity.
Challenges
Entity heterogeneity Different types of entitiesHeterogeneous attributesEntity ambiguityLong-standing name ambiguity problemLarge-scale entity linkingHundreds of millions of publications in each source.
Slide6Related work
Rule-based method: DiscR [TKDE’15]Traditional ML method: RiMOM [JWS’06], Rong et al. [ISWC’12], Wang et al. [WWW’12], COSNET [KDD’15].Embedding-based method: IONE [IJCAI’16], REGAL [CIKM’18], MEgo2Vec [CIKM’18].
Slide7Framework: LinKG
Venue linking module
Author linking module
Paper
linking
module
Slide8Framework: LinKG
Venue linking — Sequence-based EntitiesAn LSTM-based method to capture the dependenciesPaper linkinglocality-sensitive hashing and convolutional neural networks for scalable and precise linking.Author linkingheterogeneous graph attention networks to model different types of entities.
Slide9Linking venues — sequence-based
entitiesInput: venue names in each graphOutput: linked venue pairsIdea:
Direct name matching
Easy cases
LSTM-based method
Fuzzy-sequence linking
Slide10Venue linking characteristics
Word order mattersE.g. ‘Diagnostic and interventional imaging’ and ‘Journal of Diagnostic Imaging and Interventional Radiology’Fuzzy matching for varied-length
venue names.Extra or missing prefix or suffixE.g. Proceedings of
the Second
international conference on Advances in social network mining
and analysis
.
Slide11Venue linking model
Similarity
score
Two-layer LSTM layers
Keywords extracted from integral sequences
Raw word sequence
Input
Slide12Framework: LinKG
Venue linking — Sequence-based EntitiesAn LSTM-based method to capture the dependenciesPaper linkinglocality-sensitive hashing and convolutional neural networks for scalable and precise linking.Author linkingheterogeneous graph attention networks to model different types of entities.
Slide13Linking papers — large-scale
entitiesProblem setting: To link paper entities, we fully leverage the heterogeneous information, including a paper’s title and authors.Leverage the hashing technique (LSH)
for fast processingAdopt Doc2Vec
to
transform
titles
to
real-valued
vectors
Use LSH to map real-valued paper features to binary codes.And the convolutional neural network for effective linking.
Slide14Paper linking characteristics
Large-scale entitiesHundreds of millions of academic publications for each graph.Local and
hierarchical matching patternsPaper titles are often truncated if they contain punctuation marks, such
as ‘:’ and ‘?’
Different
author
name
formats:
Jing
Zhang, J., Zhang & Zhang, J.
Slide15Paper linking model
— CNN modelword-level
similarity matrix
Convolution
on
input
similarity
matrix
MLP layers
Slide16Framework: LinKG
Venue linking — Sequence-based EntitiesAn LSTM-based method to capture the dependenciesPaper linkinglocality-sensitive hashing and convolutional neural networks for scalable and precise linking.Author linkingheterogeneous graph attention networks to model different types of entities.
Slide17Linking authors — ambiguous
entitiesProblem setting: To link author entities, we generate a heterogeneous subgraph for each author. One author’s subgraph is composed of his or her coauthors, papers, and publication venues.Also incorporate
the venue and paper linking
results
.
Present
a
heterogeneous graph attention network
based
technique for author linking.
Slide18Author linking characteristics
Name ambiguity16,392 Jing Zhang in AMiner and 7,170 Jing Zhang in MAGAttribute
sparsityMissing affiliations, homepages…Already
linked
papers
and
venues!
View
author
linking as a subgraph matching problemAggregate needed information from neighbors
Slide19Graph neural networks
a
e
v
b
d
c
Neighborhood Aggregation:
Aggregate neighbor information and pass into a neural network
It can be viewed as a center-surround filter in CNN---graph convolutions!
Slide20GCN: graph convolutional networks
GCN is one way of neighbor aggregations
GraphSage
Graph Attention
… …
Slide21LinKG step
1: paired subgraph constructionSubgraph nodes direct (heterogeneous) neighbors, including coauthors, papers, and venuescoauthors’ papers and venues (2-hop ego networks)Merge pre-linked entities (papers or venues)Construct fixed-size graph
Slide22Step 2:
linking based on Heterogeneous Graph Attention Networks (HGAT)Input node features (in subgraphs)Semantic embedding: average word
embedding of author attributesStructure
embedding:
trained
network
embedding
on
a
large heterogeneous graph (e.g. LINE)
Slide23Step 2:
linking based on Heterogeneous Graph Attention Networks (HGAT)Encoder layersattention coefficient attn
learnt
by
self-attention
mechanis
m
Normalized
attention
coefficient: differentiate different types of entities aggregation weight of source entity ’s embedding on target entity
Step 2:
linking based on Heterogeneous Graph Attention Networks (HGAT)Encoder layers (cont.)Multi-head attentionTwo graph attention
layers in the encoderDecoder layers
Fuse
embeddings
of
candidate
pairs,
and
use fully-connected layers to produce the final matching score.concatenationElement-wise multiplication
Slide25Author linking model
— heterogenous graph attentionHeterogeneous subgraph for a candidate
author pair
Attention
coefficient
Different
attention
parameters
for
different entity types
Slide26Experiment Setup
DatasetsBaselinesRule-based method: KeywordTraditional ML method: SVM & DedupeSOTA author linking modelCOSNET: based on factor graph model MEgo2Vec: based on graph neural networks
Venue
Paper
Author
Train
841
26,936
15,000
Test
361
9,234
5,000
Slide27Experimental results
CNN-based method
LSTM-based method
Slide28Model variants
of paper linkingTable
2: Paper linking performance
Table
3:
Running time of different methods for paper linking (in second).
100x
prediction
speed-up
Slide29OAG: Open Academic Graph
https://www.openacademic.ai/oag/
Slide30Applications
Data integrationGraph mining collaboration and citationText mining title and abstractScience of science …
Citation Network Dataset
https://www.aminer.cn/citation
Slide31Thank You
Code: https://github.com/zfjsail/OAGData: https://www.openacademic.ai/oag/