Designing GNN for Textrich Graphs Yanbang Wang Jul 27 2020 at UIUC DMG Collaborated work with Carl Yang Pan Li and Prof Jiawei Han Textrich Graphs Usually come with two things Node attributes ID: 935875
Download Presentation The PPT/PDF document "Beyond Non-textual Linkage:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Beyond Non-textual Linkage:
Designing GNN for Text-rich Graphs
Yanbang Wang, Jul 27, 2020 at UIUC DMG
Collaborated work with Carl Yang, Pan Li and Prof. Jiawei Han
Slide2Text-rich GraphsUsually come with two things:
Node attributes:Raw text (or Bag-of-Word / TF-IDF features)May or may not have extra numerical featuresPregiven linkage information among entities: Citation…Examples: Paper-paper:
Cora
, Citeseer, Pubmed,
arXiv, DBLP Webpage-webpage: WebKBPerson-person: Facebook
paper
paper
Cite
Slide3TaskPrediction over text-rich graphsNode classification (theme categorization)
Link prediction (citation prediction)Prototype to many important applications
Slide4Opportunities & challenges from Text:Structural information beyond citation network
The main advantage of GNN (collective classification): utilization of the linkage between target entitiesAlmost all previous work simply follow the pregiven linkage informationOr at most make some modification primarily based on the pregiven linkageHowever, in text-rich networks, the linkage information should not be confined to those explicitly givenFor example, the relationship between two papers is not only just “citation”Text attributes endorse much richer interrelationship beyond the pregiven linkage
Slide5Two types of linkage
Non-textual linkage usually pregiven and cleancitationco-authorship same publication venue
Textual linkage
latent, complex, but very rich
topic clusters (e.g. different research areas)(dis)similarities among the topic clusters (e.g., “Network analysis”, “Graph mining”, “Security”)subtle semantic relationship (e.g., “machine learning”, “deep learning”, “CNN”)
paper
paper
???
Slide6Previous WorkHow to
1) model 2) utilize these (latent) textual linkages is seriously underexplored:Previous work on text-rich graphs focus on using the text independently:Use deep models like Bi-LSTM to vectorize node’s textual tags independently;Treat text attribute as a general feature vector;Previous work on document classification:
Can not cooperatively use textual linkage and non-textual linkage;
Fail to capture the complexity of the textual linkages;
(e.g. simple keyword matching like Text-GNN [AAAI’19], or k-nearest neighbor like paper2vec)
Slide7Proposed method involving two phasesBasic idea: make the best use of the latent relationship of topics and word semantics
To model and utilize these (latent) textual linkage:Phase 1: graph construction, textual + non-textual -> heterogeneous
Phase 2:
Specialized GNN to model the interaction
Whether or not to combine the two phases results in two variants
Slide8Phased Version
Slide9Phase 1: Heterogeneous graph construction
Doc
Topic
Term
<has mixture of>
<distributes over>
<cite>
Non-textual linkage (citation)
Node Attributes
Node Text
Topic Model
Emb
.
Lookup
node/edge label
Input:
Supervision:
Loss:
Graph:
Phase 2: Neural Propagation on Heterogeneous Graph
Step 1: Encode the different edge types and edge weights, Step 2: project all edges and node attributs to a unified feature spaceStep 3: Propagate the features:Where, σ: sigmoid, g:
softplus
,
encodes edge types and weights
Joint Version
Slide12What’s captured and what’s missed
Pros:Much richer latent textual relationships.Interaction between textual and non-textual linkage Phased framework: clean and transparentProblem roots in the two phases being completely independent:The topic clustering information are hard-coded by the pretrained topic models, and remains
frozen
throughout the GNN training later
Graph construction receive no benefit from the supervision signals in the 2nd phase
GNN training process has to accept whatever the topic model yields in the 1st phase
Doc
Topic
Term
<has mixture of>
<distributes over>
<cite>
Slide13Making graph construction trainable
Doc
Topic
Term
<has mixture of>
<distributes over>
<cite>
Non-textual linkage
Topic Model
Input:
Node text
Topic Model
=
=
Loss:
node/edge label
Supervision:
Node text
(doc-term matrix)
Node Attributes
Slide14Integrating supervision from the text
Input:Doc. node features:
Topic node feature:
Term node feature:
Supervision:
Doc-term matrix:
, extracted from raw text
,
Where,
is a learnable diagonal matrix parametrizing topic weights,
,
and
are scalar loss coefficients
1) Techniques to enforce sparsity,
2) Gumbel
softmax
to learn
distrb
.
Slide15Experiments
Cora
Citeseer
Pubmed
GCN
87.63
77.28
87.17
GAT
87.71
76.21
86.92
GraphSAGE
86.82
75.19
84.74
Text-rich GNN (phased)
88.92
78.56
88.08
Text-rich GNN (joint)
On-going
Slide16ConclusionWe propose leveraging the latent textual relationshipModel and use the latent textual relationship
Slide17Project UpdateWorked on several technical details of the GNN architectureExperiment with 20NewsGroup Dataset
Systematic experiment setup
Slide18Review
Doc
Topic
Term
<has mixture of>
<distributes over>
<cite>
Pregiven
Link
Topic Model
Input:
Node text
Topic Model
=
=
Loss:
node/edge label)
Supervision:
Node text
(doc-term matrix)
Node Attributes
Slide1920NewsGroup DatasetDocument type: news report 20 news categoriesNo pregiven link information
Documents: 18,846, vocab size: 42,757, average length: 221.3
Method
Accuracy
LSTM
65.71
Bi-LSTM
73.18
PTE
76.74
CNN
82.15
Text-GCN (doc-word
, word-word)
86.34
Our Method
83~84
Slide20Experiment Setup
Prediction TasksAblation & Comparison StudyEffect of topic model’s parameters (#Topics, #Terms)Analysis of the learned attention & topic models
Slide21Prediction TaskText-rich graph with pregiven but
weak link dataDocument classification dataset without any links
Name
Node Meaning
Pregiven Link
Classification Target
CORA_ML
ML. Papers
Citation
7 ML. Areas
Hep-
th
High-energy Physics Papers
Citation
4 High-energy Physics Areas
WebKB
Webpages of Top Univs
Hyperlink
5 types of target readers
(previous SOTA 0.6)
20
NewsGroup
News reports
-
20 News Categories
MoviewReviews
Movie Reviews
-
2 (Positive/Negative)
Reuters
News reports
-
8 News Categories
Slide22BaselinesGNN-based methods: GCN, GAT.Random-walk based methods: paper2vec, TADW
Text-network based methods: Text-GCN, PTE (Predictive Text Embed.)
Slide23Ablation & Comparison StudyRemove different types of links in our network
PregivenDoc-TopicTopic-TermInitialize doc. node attributes with different feature extractors for textTF-IDF (default)Glove Vector (Mean pooled)Bi-LSTMText CNN(BERT)
Slide24Further AnalysisRobustness to topic model setting
What happen if we use different number of topic node and term nodes?Analysis of the learned attention & topic modelsCan we visualize the learned attention and check how our GNN model finally learn about the text relationships?
Slide25Experiment Update
Slide26Dataset Overview
Name
Node Meaning
Pregiven Link?
#Target
Raw Text?
#docs, #edges, #vocab
CORA_ML
ML. Papers
Citation
7
✓
2708, 5278, N/A
Citeseer
ML. Papers
Citation
6
x
3327, 4552, N/A
Pubmed
Biomed Papers
Citation
3
x
19717, 44324, N/A
Hep-
th
High-energy Physics Papers
Citation
4
✓
11752, 134956, 21614
Wikipedia
Wikipedia webpages
Hyperlink
19
x
2405, 17981, N/A
20
NewsGroup
News reports
None
20
✓
18846, N/A, 42757
Reuters
News reports
None
8
✓
7647, N/A, 7688
Notes:
When we remove pregiven links from text-rich graphs we get document collections (the last 2 datasets)
Not all datasets we use come with raw text, some only have TF-IDF / Word Freq. features
Our method can work well
without raw text
and
/or
without pregiven links
, while almost all baselines require at least one of them available
We claim our major contribution with text-rich graphs (the first 5 datasets)
Slide27Main Performance Table – Text-rich Graph Datasets
Row-wise comparison:Our method uniformly and significantly outperforms the state-of-art baselines on all these popular text-rich graph datasetsLDA does a better job than MLP GNN based methods also generally shows very competitive results, but there is limited difference within this line of works
Random-walk based methods relies on matrix factorization, which is generally a linear process to deal with relational data and lacks expressive power (even CANE is limited in this way )
Slide28Main Performance Table – Text-rich Graph Datasets
Row-wise comparison:Text-GCN is the strongest baseline, the idea of introducing additional text relations is rather game-changing!Our most significant gain is achieved with the most difficult Hep-th dataset
Slide29Ablation studyMethod: remove different components of our framework and to check how the performance is affected
Goal: validate the usefulness of each building component
Slide30Analysis:
Ab. 1 vs. the rest: the importance of using various types of text relationshipAb. 0 vs. Ab. 2: pregiven links is usually helpful to some extent, though relatively limited with Hep-thAb. 0 vs. Ab. 3: usefulness of doc_word links (direct channels between document and words)
Ab. 0 vs. Ab. 4:
our model trains word embeddings that better suits the downstream classification task
Ab. 0 vs. Ab. 5
and Ab. 6 vs. Ab. 7: the existence of topic node is highly important in most cases, no matter how many word nodes are used
Becomes Text-
gcn
Note:
Our method include {
Doc_doc
,
doc_topic
,
topic_word
,
doc_word
} links
Slide31Analysis:Ab. 0 vs. Ab. 6: when topic node is NOT present, using all vocabulary without pmi link is a very bad choice
Ab. 5 vs. Ab. 7: when topic node is present, using all vocabulary without pmi link is does not have a consistent effectAb. 7 vs. Ab. 8: the usage of word_word pmi is highly crucial to the success of text-gcn. However, it also requires all vocabulary to be used as word nodes, which leads to an implicit tradeoff!
Slide32Other interesting findingsThe optimal number of topic nodes is usually 1 to 1.5 times the number of classification categoriesThe optimal #words/topic usually ranges from 20 to 100, usually accounting for only less than 5% of the entire vocabulary
Using all vocabulary leads to significant overfit of the model
Slide33On-going experimentsRobustness to hyperparametersInitialize the with different feature extractors
End2end training frameworkCase study of learned topic and word embeddings