/
Predictively Modeling Social Text Predictively Modeling Social Text

Predictively Modeling Social Text - PowerPoint Presentation

stefany-barnette
stefany-barnette . @stefany-barnette
Follow
343 views
Uploaded On 2019-11-20

Predictively Modeling Social Text - PPT Presentation

Predictively Modeling Social Text William W Cohen Machine Learning Dept and Language Technologies Institute School of Computer Science Carnegie Mellon University Joint work with Amr Ahmed Andrew Arnold Ramnath Balasubramanyan Frank Lin Matt Hurst MSFT Ramesh Nallapati Noah Smith Eric X ID: 766118

protein model models text model protein text models lda topic words modeling proteins generate link political entity documents comments

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Predictively Modeling Social Text" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Predictively Modeling Social Text William W. CohenMachine Learning Dept. and Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon UniversityJoint work with: Amr Ahmed, Andrew Arnold, Ramnath Balasubramanyan, Frank Lin, Matt Hurst (MSFT), Ramesh Nallapati, Noah Smith, Eric Xing, Tae Yano

Document modeling with Latent Dirichlet Allocation (LDA)z w  M  N a For each document d = 1,  ,M Generate  d ~ Dir( ¢ |  ) For each position n = 1,  , N d generate z n ~ Mult( ¢ |  d ) generate w n ~ Mult( ¢ |  z n )

Hyperlink modeling using LinkLDA [Erosheva, Fienberg, Lafferty, PNAS, 2004]z w  M  N a For each document d = 1,  ,M Generate  d ~ Dir( ¢ |  ) For each position n = 1,  , N d generate z n ~ Mult( ¢ |  d ) generate w n ~ Mult( ¢ | zn)For each citation j = 1,, Ld generate zj ~ Mult( . | d) generate cj ~ Mult( . | zj) z c g L Learning using variational EM

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]z w f M  N b For each author a = 1,  ,A Generate  a ~ Dir( ¢ |  ) For each topic k = 1,  ,K Generate f k ~ Dir( ¢ |  ) For each document d = 1,  ,M For each position n = 1, , NdGenerate author x ~ Unif(¢ | ad) generate zn ~ Mult( ¢ | a) generate wn ~ Mult( ¢ | fzn )x a A P a K

Labeled LDA: [Ramage , Hall, Nallapati, Manning, EMNLP 2009]

Labeled LDA Del.icio.us tags as labels for documents

Labeled LDA

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] “SNA” = Jensen-Shannon divergence for recipients of messages

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007] Copycat model of citation influence c is a cited document s is a coin toss to mix γ and  plaigarism innovation

s is a coin toss to mix γ and 

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007] Citation influence graph for LDA paper

Modeling Citation Influences

Modeling Citation Influences User study: self-reported citation influence on Likert scaleLDA-post is Prob(cited doc|paper) LDA-js is Jensen-Shannon dist in topic space

Models of hypertext for blogs, scientific literature [ICWSM 2008, KDD 2008]Ramesh Nallapati me Amr Ahmed Eric Xing

LinkLDA model for citing documentsVariant of PLSA model for cited documentsTopics are shared between citing, citedLinks depend on topics in two documents Link-PLSA-LDA

Stochastic Block models: assume 1) nodes w/in a block z and 2) edges between blocks zp,zq are exchangeable zp zq a pq N 2 z p N a p b Gibbs sampling: Randomly initialize z p for each node p . For t = 1… For each node p Compute z p given other z ’s Sample z p See: Snijders & Nowicki , 1997, Estimation and Prediction for Stochastic Blockmodels for Groups with Latent Graph Structure

Mixed Membership Stochastic Block models p  q zp  . z .  q a pq N 2  p N a p b Airoldi et al, JMLR 2008

Pairwise Link-LDA zw   N a z w  N z z c 

Pairwise Link-LDA supports new inferences… …but doesn’t perform better on link prediction

Want to predict linkage based on similarity of topic distributions. Using Z’s rather than θ ’s: In Gibbs sampling the z’s are more accessible than the θ’s. Only observed links are modeled but higher link probabilities are penalized Component-wise product of expectation over topics is used as feature for a logistic regression function

Experiments Three hypertext corpora: WebKB, PNAS, CoraEach about 50-100k words, 1-3k documents, 1.5-5k links

Experiments Three hypertext corpora: WebKB, PNAS, CoraEach about 50-100k words, 1-3k documents, 1.5-5k linksMeasure perplexity in predicting links from words, words from links

Link prediction

Word prediction

Link prediction Word prediction

Predicting Response to Political Blog Posts with Topic Models [NAACL ’09] Tae Yano Noah Smith

33 Political blogs and and comments Comment style is casual, creative, less carefully edited Posts are often coupled with comment se ctio ns

Political blogs and comments Most of the text associated with large “A-list” community blogs is comments5-20x as many words in comments as in text for the 5 sites considered in Yano et al.A large part of socially-created commentary in the blogosphere is comments.Not blog  blog hyperlinksComments do not just echo the post

Modeling political blogs Our political blog model: CommentLDA D = # of documents; N = # of words in post; M = # of words in comments z, z` = topic w = word (in post) w`= word (in comments) u = user

Modeling political blogs Our proposed political blog model: CommentLDA LHS is vanilla LDA D = # of documents; N = # of words in post; M = # of words in comments

Modeling political blogs Our proposed political blog model: CommentLDA RHS to capture the generation of reaction separately from the post body Two separate sets of word distributions D = # of documents; N = # of words in post; M = # of words in comments Two chambers share the same topic-mixture

Modeling political blogs Our proposed political blog model: CommentLDA User IDs of the commenters as a part of comment text generate the words in the comment section D = # of documents; N = # of words in post; M = # of words in comments

Modeling political blogs Another model we tried: CommentLDA This is a model agnostic to the words in the comment section! D = # of documents; N = # of words in post; M = # of words in comments Took out the words from the comment section! The model is structurally equivalent to the LinkLDA from (Erosheva et al., 2004)

40 Topic discovery - Matthew Yglesias (MY) site

41 Topic discovery - Matthew Yglesias (MY) site

42 Topic discovery - Matthew Yglesias (MY) site

Ramnath Balasubramanyan, William W. CohenICML WS 2010, SDM 2011 Language Technologies Institute and Machine Learning Department,School of Computer Science,Carnegie Mellon University,Joint Modeling of Entity-Entity Linksand Entity-Annotated Text

Motivation: Toward Re-usable “Topic Models” LDA inspired many similar “topic models”“Topic models” = generative models of selected properties of data (e.g., LDA: word co-occurance in a corpus, sLDA: word co-occurance and document labels, ..., RelLDA, Pairwise LinkLDA: words and links in hypertext, …)LDA-like models are surprisingly hard to buildConceptually modular, but nontrivial to implementHigh-level toolkits like HBC, BLOG, … have had limited successAn alternative: general-purpose families of models than can be reconfigured and re-tasked for different purposes Somewhere between a modeling language (like HBC) and a task-specific LDA-like topic model

Motivation: Toward Re-usable “Topic” Models Examples of re-use of LDA-like topic models:LinkLDA model Proposed to model text and citations in publications (Eroshova et al, 2004) z word  M  N a z cite g L

Motivation: Toward Re-usable “Topic” Models Examples of re-use of LDA-like topic models:LinkLDA model Proposed to model text and citations in publicationsRe-used to model commenting behavior on blogs (Yano et al, NAACL 2009) z word  M  N a z userId g L

Motivation: Toward Re-usable “Topic” Models Examples of re-use of LDA-like topic models:LinkLDA model Proposed to model text and citations in publicationsRe-used to model commenting behavior on blogsRe-used to model selectional restrictions for information extraction (Ritter et al, ACL 2010) z subj  M  N a z obj g L

Motivation: Toward Re-usable “Topic” Models Examples of re-use of LDA-like topic models:LinkLDA model Proposed to model text and citations in publicationsRe-used to model commenting behavior on blogsRe-used to model selectional restrictions for IEExtended and re-used to model multiple types of annotations (e.g., authors, algorithms) and numeric annotations (e.g., timestamps, as in TOT) z subj  M  N a z obj g L [Our current work]

Motivation: Toward Re-usable “Topic” Models Examples of re-use of LDA-like topic models:LinkLDA model Proposed to model text and citations in publicationsRe-used to model commenting behavior on blogsRe-used to model selectional restrictions for information extractionWhat kinds of models are easy to re-use?

Motivation: Toward Re-usable “Topic” Models What kinds of models are easy to reuse? What makes re-use possible?What syntactic shape does information often take?(Annotated) text: i.e., collections of documents, each containing a bag of words, and (one or more) bags of typed entitiesSimplest case: one type  entity-annotated textComplex case: many entity types, time-stamps, …Relations: i.e., k-tuples of typed entitiesSimplest case: k=2  entity-entity linksComplex case: relational DBCombinations of relations and annotated text are also commonResearch goal: jointly model information in annotated text + set of relations This talk: one binary relation and one corpus of text annotated with one entity typejoint model of both

Test problem: Protein-protein interactions in yeast Using known interactions between 844 proteins, curated by Munich Info Center for Protein Sequences (MIPS). Studied by Airoldi et al in 2008 JMLR paper (on mixed membership stochastic block models) Index of protein 1 Index of protein 2 p1, p2 do interact (sorted after clustering)

Test problem: Protein-protein interactions in yeast Using known interactions between 844 proteins from MIPS. … and 16k paper abstracts from SGD, annotated with the proteins that the papers refer to (all papers about these 844 proteins). Vac1p coordinates Rab and phosphatidylinositol 3-kinase signaling in Vps45p-dependent vesicle docking/fusion at the endosome. The vacuolar protein sorting (VPS) pathway of Saccharomyces cerevisiae mediates transport of vacuolar protein precursors from the late Golgi to the lysosome-like vacuole. Sorting of some vacuolar proteins occurs via a prevacuolar endosomal compartment and mutations in a subset of VPS genes (the class D VPS genes) interfere with the Golgi-to-endosome transport step. Several of the encoded proteins, including Pep12p/Vps6p (an endosomal target (t) SNARE) and Vps45p (a Sec1p homologue), bind each other directly [1]. Another of these proteins, Vac1p/Pep7p/Vps19p, associates with Pep12p and binds phosphatidylinositol 3-phosphate (PI(3)P), the product of the Vps34 phosphatidylinositol 3-kinase (PI 3-kinase) ...... EP7, VPS45, VPS34, PEP12, VPS21,… Protein annotations English text

Aside: Is there information about protein interactions in the text? MIPS interactions Thresholded text co-occurrence counts

Question: How to model this? Vac1p coordinates Rab and phosphatidylinositol 3-kinase signaling in Vps45p-dependent vesicle docking/fusion at the endosome. The vacuolar protein sorting (VPS) pathway of Saccharomyces cerevisiae mediates transport of vacuolar protein precursors from the late Golgi to the lysosome-like vacuole. Sorting of some vacuolar proteins occurs via a prevacuolar endosomal compartment and mutations in a subset of VPS genes (the class D VPS genes) interfere with the Golgi-to-endosome transport step. Several of the encoded proteins, including Pep12p/Vps6p (an endosomal target (t) SNARE) and Vps45p (a Sec1p homologue), bind each other directly [1]. Another of these proteins, Vac1p/Pep7p/Vps19p, associates with Pep12p and binds phosphatidylinositol 3-phosphate (PI(3)P), the product of the Vps34 phosphatidylinositol 3-kinase (PI 3-kinase) ...... EP7, VPS45, VPS34, PEP12, VPS21 Protein annotations English text Generic, configurable version of LinkLDA

Question: How to model this? Vac1p coordinates Rab and phosphatidylinositol 3-kinase signaling in Vps45p-dependent vesicle docking/fusion at the endosome. The vacuolar protein sorting (VPS) pathway of Saccharomyces cerevisiae mediates transport of vacuolar protein precursors from the late Golgi to the lysosome-like vacuole. Sorting of some vacuolar proteins occurs via a prevacuolar endosomal compartment and mutations in a subset of VPS genes (the class D VPS genes) interfere with the Golgi-to-endosome transport step. Several of the encoded proteins, including Pep12p/Vps6p (an endosomal target (t) SNARE) and Vps45p (a Sec1p homologue), bind each other directly [1]. Another of these proteins, Vac1p/Pep7p/Vps19p, associates with Pep12p and binds phosphatidylinositol 3-phosphate (PI(3)P), the product of the Vps34 phosphatidylinositol 3-kinase (PI 3-kinase) ...... EP7, VPS45, VPS34, PEP12, VPS21 Protein annotations English text Instantiation z word  M  N a z prot g L

Question: How to model this? Index of protein 1Index of protein 2 p1, p2 do interact MMSBM of Airoldi et al Draw K 2 Bernoulli distributions Draw a θ i for each protein For each entry i,j, in matrix Draw zi* from θi Draw z*j from θ j Draw mij from a Bernoulli associated with the pair of z’s.

Question: How to model this? Index of protein 1Index of protein 2 p1, p2 do interact Sparse block model of Parkinnen et al, 2007 These define the “blocks” we prefer… Draw K 2 multinomial distributions β For each row in the link relation: Draw ( z L* , z *R ) from  Draw a protein i from left multinomial associated with pair Draw a protein j from right multinomial associated with pair Add i,j to the link relation

Gibbs sampler for sparse block model Sampling the class pair for a link probability of class pair in the link corpus probability of the two entities in their respective classes

BlockLDA: jointly modeling blocks and text Entity distributions shared between “blocks” and “topics”

Recovering the interaction matrix MIPS interactions Sparse Block model Block-LDA

Varying The Amount of Training Data

1/3 of links + all text for training; 2/3 of links for testing 1/3 of text + all links for training; 2/3 of docs for testing

Another Performance Test Goal: predict “functional categories” of proteins15 categories at top-level (e.g., metabolism, cellular communication, cell fate, …)Proteins have 2.1 categories on averageMethod for predicting categories:Run with 15 topicsUsing held-out labeled data, associate topics with closest categoryIf category has n true members, pick top n proteins by probability of membership in associated topic.Metric: F1, Precision, Recall

Performance

Enron Email Corpus 96,103 emails in “sent folders”Entities in header are “annotations”200,404 links (sender-recipient)

Other Related Work Link PLSA LDA: Nallapati et al., 2008 - Models linked documents Nubbi: Chang et al., 2009, - Discovers relations between entities in textTopic Link LDA: Liu et al, 2009 - Discovers communities of authors from text corpora

Other related work

Conclusions Hypothesis: relations + annotated text are a common syntactic representation of data, so joint models for this data should be usefulBlockLDA is an effective model for this sort of dataResult: for yeast protein-protein interaction dataimprovements in block modeling when entity-annotated text about the entities involved is addedimprovements in entity perplexity given text when relational data about the entities involved is added