/
1.IntroductionThevastuseofsocialmediatosharefactsandopinionsaboutentit 1.IntroductionThevastuseofsocialmediatosharefactsandopinionsaboutentit

1.IntroductionThevastuseofsocialmediatosharefactsandopinionsaboutentit - PDF document

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
374 views
Uploaded On 2016-08-27

1.IntroductionThevastuseofsocialmediatosharefactsandopinionsaboutentit - PPT Presentation

1AtthetimeofwritingthispapersomepopularreputationmanagementtoolsaretrackurhttpwwwtrackurcomBrandsEyehttpwwwbrandseyecomAlterianSM2httpsocialmediaalteriancomorSocialMentionh ID: 457086

1Atthetimeofwritingthispaper somepopularreputationmanagementtoolsaretrackur(http://www.trackur.com/) BrandsEye(http://www.brandseye.com/) AlterianSM2(http://socialmedia.alterian.com/)orSocialMention(h

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "1.IntroductionThevastuseofsocialmediatos..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1.IntroductionThevastuseofsocialmediatosharefactsandopinionsaboutentities,suchascompanies,brandsandpublic gureshasgeneratedtheopportunity-andthenecessity{ofmanagingtheonlinereputationofthoseentities.OnlineReputationManagementconsistsofmonitoring{andhandling{theopinionofInternetusers(alsoreferredtoaselectronicwordofmouth,eWOM)onpeople,companiesandproducts,andisalreadyafundamentaltoolincorporatecommunication[76,20,64,32].Overthelastyears,awidevarietyoftoolshavebeendevelopedthatfacilitatethemonitoringtaskofonlinereputationmanagers1.Theprocessfollowedbythesetoolstypicallyconsistsofthreetasks:1.Retrievalofpotentialmentions.Theusergivesasinputasetofkeywords(e.g.thecompanyname,products,CEO'sname...),andtheserviceusesthesekeywordstoretrievedocumentsandcontentgeneratedbyusersfromdi erentsources:broad-castnewssources,socialmedia,blogs,sitereviews,etc.2.Analysisofresults.Retrieveddocumentsareautomaticallyprocessedinordertogetrelevantinformationtotheuser:sentiment,authorityandin uence,back-groundtopics,etc.3.Resultsvisualization.Analyzeddataispresentedtotheuserindi erentways:rankingdocuments,drawinggraphics,generatingtagclouds,etc.Amajorproblemconcerningthe rsttask(retrievalofpotentialmentions)isthatbrandnamesareoftenambiguous.Forinstance,thequery\Ford"retrievesinforma-tionaboutthemotorcompany,butalsomightretrieveresultsaboutFordModels(themodelingagency),TomFord(the lmdirector),etc.Onemightthinkthatthequeryistoogeneral,andtheusershouldprovideamorespeci cquery,suchas\fordmotor"or\fordcars".Infact,sometoolsexplicitlysuggesttheusertore nepossiblyambiguousqueries2.Thisapproachhastwomaindisadvantages:(i)usershavetomakeanadditionale ortwhende ningunambiguousqueriesand(ii)queryre nementharmsrecalloverthementionsofthebrandintheWeb,whichcanbeparticularlymisleadinginanonlinereputationmanagementscenario.Notethat lteringoutthementionsthatdonotrefertothemonitoredentityisalsocrucialwhenestimatingitsvisibility.QuantifyingthenumberofmentionsontheWebaboutanentity,andhowthisnumberchangesovertime,isessentialtotrackmarketingorPublicRelationshipcampaigns.Whentheentitynameisambiguous,indicatorsgivenbytoolssuchasGoogleTrends3orTopsy4canbemisleading.Wethinkthatacomponentcapableof lteringoutmentionsthatdonotrefertotheentitybeingmonitored(speci edbytheuserasakeywordplusarepresentativeURL) 1Atthetimeofwritingthispaper,somepopularreputationmanagementtoolsaretrackur(http://www.trackur.com/),BrandsEye(http://www.brandseye.com/),AlterianSM2(http://socialmedia.alterian.com/)orSocialMention(http://socialmention.com),amongothers.2AlterianSM2service:http://www.sdlsm2.com/social-listening-for-business/industry/.3http://www.google.com/trends4http://www.topsy.com2 polarizedwordsandthenpickstheclasswiththegreatestprobabilityinawinner-takes-allscenario.Inthisworkweuseasimilarapproach,inthemannerthatwetest,amongothers,thewinner-takes-allstrategyinordertoclassifyallthetweetsforacompanynameeitherasrelatedorunrelated.Zhaoetal.[90]proposeanautomatickeyphraseextractionmodeltosummarizetop-icsinTwitter.Firstly,theyidentifytopicsusingTwitter-LDA[89],whichassumesasingletopicassignmentforeachtweet.Secondly,theyuseavariationofTopicalPageR-ank[48]whereedgesbetweentwowordsareweightedtakingintoaccountco-occurrencesintweetsassignedtothesametopic.Then,theygeneratephrasecandidatesbylookingforcombinationsofthetoptopickeywordsthatco-occurasfrequentphrasesinthetextcollection.Finally,topicalkeyphrasesarerankedbyusingaprobabilisticmodelthattakesintoaccount(i)thespeci cityofaphrasegivenatopicand(ii)theretweetratiooftweetscontainingakeyphrase.Tothebestofourknowledge,theOnlineReputationManagementTaskonWePS-3heldatCLEF2010wasthe rstcampaignofNLP-centeredtasksoverTwitter.TREC2011andTREC2012heldatrackaboutTwitter:TRECMicroblogTrack8.Here,theproblemaddressedisarealtimesearchtask(RealtimeAdhocTask):givenaqueryataspeci ctime,systemsshouldreturnalistbothrecentandrelevanttweets[67].2.1.2.TwitterDatasetsThereareseveralTwitterdatasetsthataresuitableforavarietyofresearchpurposes.Acorpusof900.000tweetshasbeenprovidedbytheContentAnalysisinWeb2.0Workshop(CAW2.0)9,totackletheproblemoftextnormalizationonusergeneratedcontents.Yang&Leskovec[80]useadatasetof580millionTweetstoidentifytem-poralpatternsoverthecontentpublishedonthetweets,andKwaketal.[46]usearepresentativesampleof467milliontweetsfrom20millionuserscoveringa7monthperiodfromJune12009toDecember312009tostudytheinformationdi usionandthetopologicalcharacteristicsofTwitter.10.Chaetal.[14]builtadatasetcomprising54,981,152users,connectedtoeachotherby1,963,263,821sociallinksandincludingatotalof1,755,925,520tweetstoanalyzeusers'in uenceandhowtomeasureitontheTwittersphere.Finally,theTweets2011corpusisthedatasetthatwillbeusedontheTRECMi-croblogTrack.Thecorpusisarepresentativesampleofthetwittersphere,thatincludesalsospamtweets.Itconsistsofapproximately16milliontweetsoveraperiodof2weeks(24thJanuary2011until8thFebruary,inclusive),whichcoversboththetimeperiodoftheEgyptianrevolutionandtheUSSuperbowl,amongothers.Tothebestofourknowledge,WePS-3Task2testcollection(describedinSection2.4),isthe rstdatasetspeci callybuilttoaddresstheproblemofdisambiguationoforga-nizationnamesontweets.Recently,theRepLabevaluationcampaign[4]heldinCLEF2012,hasaddressedthesameproblembutinamultilingualscenario:theRepLabdatasetcontainstweetswritteninEnglishandSpanish. 8https://sites.google.com/site/microblogtrack9http://caw2.barcelonamedia.org/node/4110ThesedatasetsarenolongeravailableasperrequestfromTwitter5 entity).ThisresultshowsthatWikipediahaveahighcoverageasacatalogofsensesfortweetdisambiguation.Di erentfrom[24],Meijetal.[54]usessupervisedmachinelearningtechniquestore nealistofcandidateWikipediaconceptsthatarepotentiallyrelevanttoagiventweet.Thecandidaterankinglistisgeneratedbymatchingn-gramsinthetweetwithanchortextsinWikipediaarticles,takingintoaccounttheinter-WikipedialinkstructuretocomputethemostprobableWikipediaconceptforeachn-gram.Michelson&Macskassy[55]focusondiscoveringthetopicsofinterestforaparticularTwitteruser.Giventhestreamoftweetscorrespondingtotheuser,they rstly ndtheWikipediapageoftheentitiesmentionedontweetsandsecondlytheybuildatopicpro lefromthehigh-levelcategoriesthatcovertheseentities.Entitydisambiguationisperformedbycalculatingtheoverlapbetweenthetermsonthetweetandthetermsonthepageofeachcandidateentity.Inthisscenario,theaccuracyofthedisambiguationprocessisnotcritical,sincethesystemtakesintoaccountthecategoriesthatoccurfrequentlyacrossalltheentitiesfoundinordertoproduceatopicpro leofastream.2.3.AutomaticKeyphraseExtractionWeincludehereanoverviewofAutomaticKeyphraseExtractiontechniques,asthisisatechniquethatplaysacrucialroleinourapproachtocompanynamedisambiguationonTwitter.AutomaticKeyphraseExtractionisthetaskofidentifyingasetofrelevanttermsorphrasesthatsummarizeandcharacterizeoneormoregivendocuments[77].Mostoftheliteratureaboutautomatickeyphraseextractionisfocusedon(well-written)technicaldocuments,suchasscienti candmedicalarticles,sincethekeywordsgivenbytheauthorscanbeusedasgoldstandard[74,77,25,59,58,43].Someauthorsaddressautomatickeywordextractionasawayofautomaticallysummarizatingwebsites[86,87,88,85].In[85]di erentkeywordextractionmethodsarecompared,includingtf*idf,supervisedmethods,andheuristicsbasedonbothstatisticalandlinguisticfeaturesofcandidateterms.WhileZhangetal.[85]studytheautomatickeywordextractionfromwebsitedescriptions,inourworkweexploreasemi-supervisedkeywordextractionapproachthatextract lterkeywordsoverastreamoftweets.Automatickeywordextractionistypicallyusedtocharacterizethecontentofoneormoredocuments,usingfeaturesintrinsicallyassociatedwiththatdocuments.Inordertodetectbothpositiveandnegative lterkeywords,however,weneedtolookintoexternalresourcesinordertodiscriminatebetweenrelatedandunrelatedkeywords.Thus,automatickeywordextractionmethodsarenotdirectlyapplicabletoour lterkeywordapproach.2.4.TheWePS-3OnlineReputationManagementTaskDisambiguationofcompanynamesintextstreams(andinparticularinmicroblogposts)isanecessarystepinthemonitoringofopinionsaboutacompany.However,itisnottackledexplicitlyinmostresearchonthesubject.Ratherthanthis,mostpreviousworkassumethatquerytermsarenotambiguousintheretrievalprocess.ThedisambiguationtaskhasbeenexplicitlyaddressedintheWePS-3evaluationcampaign[3].Inthissectionwesummarizetheoutcomeofthatcampaign,analyzingthetestcollectionandcomparingsystemresults.8 andthecompanyhomepage.TheUvAsystem[72]doesnotemployanyresourcerelatedtothecompany,butusesfeaturesthatinvolvetheuseofthelanguageinthecollectionoftweets(URLs,hashtags,capitalcharacters,punctuation,etc.).Finally,theKALMARsystem[41]buildsaninitialmodelbasedonthetermsextractedfromthehomepagetolabelaseedoftweetsandthenusestheminabootstrappingprocess,computingthepoint-wisemutualinformationbetweenthewordandthetarget'slabel.IntheWePSexercise,accuracy(ratioofcorrectlyclassi edtweets)wasusedtoranksystems.Thebestoverallsystem(LSIR)obtained0.83,butincludingmanuallyproduced lterkeywords.Thebestautomaticsystem(ITC-UT)reachesanaccuracyof0.75(being0.5theaccuracyofarandomclassi cation),andincludesaqueryclassi cationstepinordertopredicttheratioofpositive/negativetweets.2.4.3.OtherworkusingWePS-3datasetsRecently,Yervaetal.[82]haveexploredtheimpactofextendingthecompanypro lespresentedin[81]byusingtherelatedratioandconsideringnewtweetsretrievedfromtheTwitterstream.Byestimatingthedegreeofambiguityperentityfromasubsetof50tweetsperentity,theyreach0.73accuracy14.Usingthisrelatedratioandconsideringco-occurrenceswithagivencompanypro le,theoriginalpro leisextendedwithnewtermsextractedfromtweetsretrievedbyqueryingthecompanynameinTwitter.Theexpandedpro leoutperformstheoriginal,achieving0.81accuracy.Zhangetal.[84]presentsatwostagesbaseddisambiguationsystem.Theycombinesupervisedandsemi-supervisedmethods.Firstly,theytrainagenericclassi erusingthetrainingset,similarlyto[81].Then,theyusethisclassi ertoannotateaseedoftweetsinthetestset,usingtheLabelPropagationalgorithmtoannotatetheremaindertweetsinthetestset.UsingNaveBayesandLabelPropagationtheyachievea0.75accuracy,thatmatchestheperformanceofthebestautomaticsysteminWePS-3.TheRepLabevaluationcampaignheldatCLEF2012addressed,asasubtask,thesame lteringproblemintroducedinWePS-3.Themaindi erencewasthatwhileWePS-3datasetcontainsonlytweetswritteninEnglish,theRepLabcollection[4]containstweetsbothinEnglishandSpanish.Finally,theWePS-3ORMtaskdatasethasbeenextendedwithmanualannotationsforthetaskofentitypro linginmicroblogposts,thatincludesidentifyingentityaspectsandopiniontargets[69].2.5.WrapUpTherearetwomain ndingsonentitynamedisambiguationthatmotivateourre-search:1.Useof lterkeywords.Artilesetal.[8]studiedtheimpactofqueryre nementintheWebPeopleSearchclusteringtaskandconcludedthat\althoughinmostoccasionsthereisanexpressionthatcanbeusedasanear-perfectre nementtoretrieveallandonlythosedocumentsreferringtoanindividual,thenatureoftheseidealre nementsisunpredictableandveryunlikelytobehypothesizedbytheuser"[8] 14Notethat,unliketheoriginalformulationoftheWePS-3task,thisisasupervisedsystem,asitusespartofthetestsetfortraining.Hence,theirresultscannotbedirectlycomparedwiththeresultsinourwork.10 Thisisbothapositiveindication{therearekeywordsabletoisolaterelevantinformationwell{andasuggestionthat ndingoptimalkeywordsautomaticallymightbeachallengingtask.Anotherevidenceinfavorofusing lterkeywordsisthatthebestresultsontheWePS-3ORMTaskwereachievedbyasystemthatusedasetofmanuallyproduced{bothpositiveandnegative{ lterkeywords[81].Oneofthegoalsofourworkistoanalyzequeryre nementinthescenariooftheCompanyNameDisambiguationonTwitter.Inotherwords,weexploretheimpactofde ningasetofkeywordsto lterbothrelatedandunrelatedtweetstoagivencompany.2.Useofknowledgebasestorepresentcandidateentities.Bothentitylinkinganddocumentenrichmentsystemsuseknowledgebasesinordertocharacterizethepossibleentitiesthatamentionmayreferto.MostsystemsuseWikipediaasknowledgebase[57,13,19,27,60,45,24,39,21].Inthisdirection,webelievethatlookingatWikipediapagesrelatedtothecompanytodisambiguatecouldgiveusefulinformationtocharacterizepositive lterkeywords.WealsoexploretheuseofotherresourcessuchasOpenDirectoryProject(ODP),thecompanywebsiteandtheWebingeneral.Note,however,thatnoteverycompanyislistedinODPorhasanentryinWikipedia.11 amanualselectionofpositiveandnegativekeywordsforallthecompaniesintheWePS-3corpus.Notethattheannotatorinspectedpagesinthewebsearchresults,butdidnothaveaccesstothetweetsinthecorpus.Tables1and2showsomeexamplesforpositiveandnegativekeywords,respectively.NotethatinthesetofOraclekeywordsthereareexpressionsthatahumanwouldhardlychoosetodescribeacompany(atleast,withoutpreviouslyanalyzingthetweetstream).Forinstance,thebestpositiveoraclekeywordsfortheFoxEntertainmentGroupdonotincludeintuitivekeywordssuchastvorbroadcast;instead,theyincludekeywordsclosertobreakingnews(leader,denouncing,etc.). Companyname OraclePositiveKeywords ManualPositiveKeywords amazon sale,books,deal,deals,gift electronics,apparel,books,computers,buy apple gizmodo,ipod,microsoft,itunes,auction store,ipad,mac,iphone,com-puter fox money,weather,leader,de-nouncing,viewers tv,broadcast,shows,episodes,fringe,bones kiss fair,rock,concert,allesegretti,stanley tour,discography,lyrics,band,rockers,makeupdesign Table1:Di erencesbetweenoracleandmanualpositivekeywordsforsomeofthecom-panynamesonthetestcollection.Lookingatnegativekeywords(Table2),wecan ndoccasionaloraclekeywordsthatarecloselyrelatedwiththevocabularyusedinmicrobloggingservices,suchasfollowdaibosyu,nowplayingorwanna,whileintuitivemanualkeywordslikewildlifeforjaguarareunlikelytooccurintheTwittercollection. Companyname OracleNegativeKeywords ManualNegativeKeywords amazon followdaibosyu,pest,plug,brothers,pirotta river,rainforest,deforestation,bolivian,brazilian apple juice,pie,fruit,tea, ona fruit,diet,crunch,pie,recipe fox megan,matthew,lazy,valley,michael animal,terrier,hunting,Volk-swagen,racing kiss hug,nowplaying,lips,day,wanna french,MegKevin,bangbang,RyanKline Table2:Di erencesbetweenoracleandmanualnegativekeywordsforsomeofthecom-panynamesonthetestcollection.Remarkably,manualkeywordsextractedfromtheWeb(around10percompany)onlyreach15%coverageofthetweets(comparewith40%coverageusing10oraclekeywordsextractedfromthetweetstream),withanaccuracyof0.86(whichislowerthanexpectedformanuallyselected lterkeywords).Thisseemsanindicationthatthevocabularyandtopicsofmicrobloggingaredi erentfromthosefoundintheWeb.OurexperimentsinSection4corroboratethis nding.13 (a)manualkeywords (b)20oraclekeywordsFigure2:Fingerprintsforthebootstrappingclassi cationstrategywhenapplyingmanualkeywords(left)or20oraclekeywords(right).Inordertobetterunderstandtheresults,Figure2showsthe ngerprintrepresen-tation[68]formanualkeywordsand20oraclekeywords.Thisvisualizationtechniqueconsistsofdisplayingtheaccuracyofthesystem(verticalaxis)foreachcompany/testcase(dots)vs.theratioofrelated(positive)tweetsforthecompany(horizontalaxis).Thethreebasicbaselines(alltrue,allnegativeandrandom)arerepresentedasthree xedlines:y=x,y=1�xandy=0:5,respectively.The ngerprintvisualizationmethodhelpsinunderstandingandcomparingsystems'behavior,speciallywhenclassskewsarevariableoverdi erenttestcase.Using20oraclekeywords(seeFigure2b),theobtainedaverageaccuracyis0.87.The ngerprintshowsthattheimprovementresidesincaseswitharelatedratioaround0.5,i.e.thecaseswhereitismorelikelytohaveenoughtrainingsamplesforbothclasses.Manualkeywords,ontheotherhand,leadtoannotationsthattendtosticktothe"allrelated"or"allunrelated"baselines,whichindicatesthattheytendtodescribeonlyoneclass,andthenthelearningprocessisbiased.Insummary,ourresultsvalidatetheideathat lterkeywordscanbeapowerfultoolinour lteringtask,butalsosuggestthattheywillnotbeeasyto nd:descriptivewebsourcesthatcanbeattributedtothecompanydonotleadtothekeywordsthataremostusefuloraccurateintheTwitterdomain.15 inthewholecorpus,documentfrequencyinthesetoftweetsforthecompany,howmanytimesthetermoccursasatwitterhashtag,andtheaveragedistancebetweenthetermandthecompanynameinthetweets.(a)col c df:NormalizeddocumentfrequencyinthecollectionofthetweetsforthecompanyTc:dft(Tc) jTcj(1)(b)col c speci city:RatioofdocumentfrequencyinthetweetsforthecompanyTcoverthedocumentfrequencyinthewholecorpusT:dft(Tc) dft(T)(2)(c)col hashtag:Numberofoccurrencesofthetermasahashtag(e.g.#jobs,#football)inTc.(d)col c prox avg,col c prox sd,col c prox median:Mean,standardde-viationandmedianofthedistance(numberofterms)betweenthetermandthecompanynameinthetweets.2.Web-basedfeatures(web *):ThesefeaturesarecomputedfrominformationabouttheterminthealltheWeb(approximatedbysearchcounts),thewebsiteofthecompany,WikipediaandtheOpenDirectoryProject(ODP).16(a)web c assoc:Intuitively,atermwhichisclosetothecompanynamehasmorechancestobeakeyword(eitherpositiveornegative)thanmoregenericterms.Thisfeaturerepresentstheassociation,accordingtothesearchcounts,betweenthetermtandacompanynamec.dfweb(tORc)=dfweb(c) dfweb(t)=M(3)(b)web c ngd:TheNormalizedGoogleDistance[16](appliedtotheYahoo!searchengine),whichisameasureofsemanticdistancebetweentwotermsfromthesearchcounts.Then,foratermtandacompanynamec,theGoogleNormalizedDistanceisgivenby(4):max(log(f(c));log(f(t)))�log(f(tANDc)) M�min(log(f(t));log(f(c)))(4)wheref(x)=dfweb(x)(c)web dom df:Frequenttermsinthecompanywebsiteshouldbemeaningfultocharacterizepositivekeywords.web dom dfisthenormalizeddocumentfrequencyoftheterminthewebsiteofthecompany.dfweb(tANDsite:domainc) dfweb(site:domainc)(5) 16SomecompaniesintheWeps-3collectionhaveaWikipediapageasreferencepageinsteadofthecompanywebsite.Inthesecases,thefeatureweb dom df(thatisalsothenumeratorinthefeatureweb dom assoc)iscomputedasthepresenceofthetermtintheWikipediapage.Also,thequeryusedtogetthevaluesofthefeaturesweb odp occandweb wiki occisthetitleoftheWikipediapage.17 (d)web dom assoc:degreeofassociationofthetermwiththewebsiteincom-parisonwiththeuseofthetermintheweb.Thisfeatureisanalogoustoweb c assoc,usingthewebsitedomaininsteadofthecompanynamec.dfweb(tANDsite:domainc)=dfweb(site:domainc) dfweb(t)=M(6)(e)web odp occ:Numberofoccurrencesoftheterminalltheitemsindmoz(domainc).EachitemiscomposedbyanURL,atitle,adescriptionandtheODPcategorytowhichitbelongs.(f)web wiki occ:Numberofoccurrencesoftheterminthe rst100resultsinwikipedia(domainc).Inorderto lterpagesreturnedbytheAPIthatcouldbeunrelatedtothecompany,onlypagesthatcontainthestringdomaincareconsidered.3.Featuresexpandedwithco-occurrence:Inordertoavoidfalsezerosinweb-basedfeatures,weexpandsomeoftheprevioustermfeatureswiththevalueob-tainedbythe vemostco-occurrentterms.Givenafeaturef,anewfeatureiscomputedastheEuclideannorm(7)ofthevectorwithcomponentsftiw(t;ti)forthe vemostco-occurrenttermswithtinthesetoftweetsTc(8),whereftiistheweb-basedfeaturevaluefforthetermtiandw(t;ti)isthethegradeofco-occurrenceofeachterm(9):cooc agg(t;f)=s Xi2cooct(f(ti)w(ti))2(7)cooct=setofthe vetermswhichmostco-occurwitht(8)w(t;ti)=jco-occurrencesTc(t;ti)j jTcj(9)f(ti)=valueofthefeaturefforthetermtiThisformulaisappliedtoweb c assoc,web c ngd,web dom df,web dom assoc,web odp occandweb wiki occ,resultinginthefeaturesenumeratedbelow:(a)cooc c assoc=cooc agg(t;web c assoc)(b)cooc c ngd=cooc agg(t;web c ngd)(c)cooc dom df=cooc agg(t;web dom df)(d)cooc dom assoc=cooc agg(t;web dom assoc)(e)cooc odp occ=cooc agg(t;web odp occ)(f)cooc wiki occ=cooc agg(t;web wiki occ)4.2.FeatureAnalysisThe rststepforthefeatureanalysisistodevelopagoldstandardsetofpositiveandnegativekeywords.Inordertogetsucienttrainingdataandtodealwithpossiblemiss-annotationsinthecorpus,wesetaprecisionof0.85ofaterminarelated/unrelatedsetoftweetsasafeasiblethresholdtoannotateatermasakeyword.Thosetermswithprecisionlowerthan0.85inbothclassesarelabeledasskipterms(10):18 (a)col c df (b)col c speci city (c)col hashtag (d)col c prox median (e)col c prox avg (f)col c prox sd (g)web odp occ (h)web wiki occ (i)cooc odp occ (j)cooc wiki occ (k)web c assoc (l)web c ngd (m)web dom df (n)web dom assoc (o)cooc c assoc (p)cooc c ngd (q)cooc dom df (r)cooc dom assocFigure3:Box-plotsrepresentingthedistributionofeachofthefeaturesinthepositive,negativeandskipclasses.ThebottomandtopoftheboxaretheQ1andQ3quartiles,respectively,andthebandnearthemiddleistheQ2quartile{i.e.,themedian.Thewhiskersextendtothemostextremedatapoint(1.5timesthelengthoftheboxawayfromthebox:�1:5IQRand1:5IQR,whereIQR=jQ3�Q1j).20 Theseplotshelpvisualizingtherangeofvaluesforeachfeature,aswellaswheremostofthevalueslie,allowingforaqualitativeanalysisofthefeatures.Wecanseethatfeaturescol hashtag(Fig.3c),web odp occ(Fig.3g)andcooc odp occ(Fig.3i)arenotinformative,becausealmostalloftheirvaluesarezero.Therearelessthan1%ofthetermsinthetestsetthatoccuratleastonetimeashashtag.Also,lessthan1%aretermsthatappearindescriptionsandtitlesofODPsearchresults.Featuresdescribingterm-companydistanceseemtocapturedi erencesbetweenkeywordandskipterms:bothnegativeandpositivekeywords,generallyoccurclosertothecompanynamethanskipterms.Whilepositiveandnegativekeywordssharesimilarmedianandstandarddeviation(Figs.3dand3f)ofproximitytothecompanyname,averagedistanceforpositivekeywordsisslightlysmallerthanfornegativekeywords(Fig.3e).Featurescol c df,col c speci city,web c assoc,web c ngdandtheirexpanded(byco-ocurrence)versionscooc c assocandcooc c ngdwerede nedtodiscriminate lterkeywordsfromskipterms.Themostdiscriminativefeatureseemstobecol c speci city(Fig.3b).Ontheotherhand,featuresweb wiki occ,web dom df,web dom assoc,cooc dom df,cooc dom assocandcooc wiki occweredesignedtodistinguishbetweenpositiveandnegative lterkey-words.Ata rstglance,positiveandnegativekeywordshavedi erentdistributionsinallthefeatures.Skipterms,ontheotherhand,tendtohavedistributionssimilartothoseofpositivekeywords.Thefeaturescooc dom assoc(Fig.3r)andcooc wiki occ(Fig.3j)seemtobethebesttodiscriminatepositivekeywordsfromnegativeandskipterms.Remarkably,featuresexpandedbyco-occurrenceseemtobemoreinformativethantheoriginalfeatures,whichtendtoconcentrateonlowvalues(themedianisnearzero).Whenexpandingtheoriginalvaluesbyco-occurrence,positivetermsreceivehighervaluesmoreconsistently.Inordertoquantitativelyevaluatethequalityoffeatures,wecomputetheMann-WhitneyUtest[51],whichisanon-parametrictestusedinstatisticalfeatureselectionwhenanormaldistributionofthefeaturescannotbeassumed.Thep-valuecouldbeusedtorankthefeatures,sincethesmallerthevalueofthep-value,themoreinformativethefeatureis[31].21 atonlythebesttwofeaturesaccordingtotheMann-WhitneyUtest: rst,wede neathresholdtoremoveskiptermsaccordingtothespeci cityw.r.t.thecollectionofthetweetsforthecompany(col c speci cityfeature).Thenwestate,forthefeaturethatmeasuresassociationwiththewebsite(cooc dom assocfeature)alowerboundtocapturepositive lterkeywordsandanupperboundtocapturenegative lterkeywords.Thesethreethresholdshavebeenmanuallyoptimizedusingthetrainingdataset.Finally,wealsoexploredathirdoption:weapplymachinelearningusingonlythebesttwofeaturesinsteadofthewholefeatureset.Wewillreferhereaftertothismethodasmachinelearning-2features.WehaveexperimentedwithseveralmachinelearningmethodsusingRapidminer[56]:NeuralNets,C4.5andCARTDecisionTrees,LineaSupportVectorMAchines(SVM)andNaiveBayes.Allmethodshavebeenusedwith\out-of-the-box"parameters.AllthetermslabeledovertheWePS-3trainingdatasetwereusedtotrainthemodels.Inthesameway,termsextractedfromthetestdatasetwereusedastestset.Table6showsthevaluesoftheAreaUndertheROCCurve(AUC)ofeachofthebinaryclassi ersevaluated.AUCisanappropriatemetrictomeasurethequalityofbinaryclassi cationmodelsindependentlyofthecon dencethreshold[23].Weanalyzedthreedi erentsubsetsoffeaturestorepresenttheterms:(i)usingallbutthesixfeaturesexpandedbyco-occurrence,(ii)usingonlythebesttwofeatures(thoseusedbytheheuristicandmachinelearning-2featuresclassi ers),and(iii)usingallthefeatures. machinelearningalgorithm notexpandedbyco-occurrencefeatures 2bestfeatures allfeatures posneg posneg posneg NeuralNet 0.680.67 0.730.72 0.750.73CARTDec.Trees 0.580.61 0.720.71 0.630.64LinearSVM 0.500.50 0.730.71 0.500.50NaveBayes 0.640.64 0.710.71 0.720.72C4.5Dec.Trees 0.500.61 0.500.50 0.590.66 Table6:AreaUndertheROCCurve(AUC)valuesofthe veclassi cationmodelsandthethreefeaturesetsusedtoclassifypositivesandnegativeskeywords.Theresultsobtainedaresimilarforallmodels,exceptforC4.5andSVMthatinsev-eralcasesdonotprovideanyusefulinformationforclassi cation(AUC=0:5).Keepingoutthe\expandedbyco-occurrence\features,theperformanceis,ingeneral,lowerforallthealgorithms.Thiscorroboratestheresultsofourpreviousfeatureanalysis.Inthefollowingexperiments,wefocusontheNeuralNetalgorithmtotrainboth(positiveversusothersandnegativeversusother)classi ers,becauseitisconsistentlythebestperformingalgorithmaccordingtotheAUCmeasure.Foreachofthefeaturecombinationsdescribedatthebeginningofthissection(ma-chinelearning-allfeatures,heuristicandmachinelearning-best2features),belowweanalyzetheobtainedresults.ThemethodsweretrainedusingtermsfromtheWePS-3trainingdatasetandevaluatedwiththeWePS-3testset.Table7showstheconfusionmatrixobtainedforthemachinelearning-allfeaturesmethod.Theprecisionforthepositiveandnegativeclassesis62%and56%,respectively,23 strategydescribedinSection3.2isusedtocompletethetask.Iiisalsointerestingtocomparethebootstrappingstrategywiththenavewinner-takes-allbaseline{thatdirectlyclassi esallthetweetsasrelatedorunrelateddependingonwhichisthedominantclassintheseedoftweets{andthewinner-takes-remainderstrategy,whichconsistsofapplyingthewinner-takes-allstrategyonlytothosetweetsthatwerenotcoveredbysomeofthe lterkeywords. Figure4:Fingerprintsforeachofthekeywordselectionstrategiescombinedwitheachofthedi erenttweetclassi cationstrategies.Figure4showsthe ngerprintofeachofthecombinationstestedandTable10showstheresults.Thebestautomaticmethod,whichcombines(machinelearning-allfeaturestodiscoverkeywordsandbootstrappingwiththetweetsannotatedusingthatkeywords)givesanaccuracyof0.73,whichishigherthanusingmanualkeywordsfrom25 6.Discussion6.1.HowMuchofTheProblemisSolved?Inordertoshedlightonthetrade-o betweenqualityandquantityof lterkeywords,hereweanalyzetherelationbetweenaccuracyandcoverageofthetweetsclassi edbyconsideringdi erentsetsof lterkeywords.Figure6showsthecoverage/accuracycurvesfororacle,manualandautomatic lterkeywords. Figure6:Coverage/accuracycurvesfororacle,manualandautomatic lterkeywords.Curvesweregeneratedasfollows:Oraclekeywords.Atstepn,weconsiderthenthpositive/negativeoraclekeywordsthatmaximizesaccuracyand-incaseofties-coverageoftweets(i.e.,inthecaseoftwokeywordshavingsameaccuracy,theonethatcoversmoretweetsisconsidered rst).Manualkeywords.Ateachstepnweconsiderthenthpositive/negativemanualkey-wordsthatmaximizecoverageoftweets.Machinelearning-allfeatures.Thesetoftermsconsideredinthisanalysisarethosethatwereclassi edaspositiveornegativebyusingthecon dencethresholdslearnedbythetwoclassi ersinthemethod\machinelearning-allfeatures".Skiptermsare(i)thoseclassi edasskipbybothbinaryclassi ers;(ii)thoseclassi edsi-multaneouslyaspositiveandnegativekeywords.Then,weusethemaximumofthe28 con dencescoresreturnedbythetwoclassi ers(i.e.,max(conf(positive);conf(negative)))tosortthekeywords.Thekeywordwithhighestcon dencescoreisaddedateachstep.Thepointinthecurvewithhighestcoveragecorrespondstotheclassi erusedintheexperimentsexplainedinSection4.Machinelearning-2features.Thiscurveisgeneratedbyusingthetwoclassi erslearnedin\machinelearning-2features",similarlytothecurvegeneratedfor\machinelearning-allfeatures".Heuristic.Sincethisclassi erconsistsofmanuallyde ningthresholdsusingthetrainingset,itdoesn'tprovideanycon dencescoreforthetestcases.Hence,inthegraphicitisrepresentedasasinglepoint().ThecurveforOraclekeywordsprovidesastatisticalupperboundofhowmanytweetscanbedirectlycoveredusing lterkeywords.Consideringthebest100oraclekeywordsforeachtestcase/companyname,itispossibletodirectlytag85%ofthetweetswith0.99accuracy.Ontheotherhand,amorerealisticupperboundisgivenbymanualkeywords.Here,wecanobservehowtheaccuracyremainsstablearound0.85,whilethecoveragegrowsfrom10%to15%approx.Inthebestpossiblecase,withmorekeywordsthecurvewouldcontinueastheliney=0:85.NotethatmanualkeywordshavebeenannotatedbyinspectingrepresentativeWebpages(fromGooglesearchresults)ratherthaninspectingtweets.Therefore,anautomatickeywordclassi ercannotachieveanaccuracyabove0.85.Consideringthis,ourautomaticapproachesestablishastronglowerboundof0.7accu-racy.Inconclusion,itseemsthata lterkeywordclassi ershouldhavereachanaccuracybetween0.7and0.85tobecompetitive.6.2.ComparingSystemswithDi erentMetricsWehaveseenthatrelated/unrelatedtweetsarenotbalancedinmostofthetestcasesinWePS-3,andtheproportiondoesnotfollowanormaldistribution(extremevaluesseemtobeasplausibleasvaluesaroundthemean).Becauseofthis,accuracymaybenotsucienttounderstandthequalityofsystems,andthat'swhywehavecomplementeditwiththe ngerprintrepresentation[68].Inthissection,weevaluate(andcompare)resultswiththemostpopularalternativeevaluationmetricsfoundintheliterature.Consideringtheconfusionmatrixgivenbyeachsystem,whereTP=truerelatedtweets,FP=falserelatedtweets,TN=trueunrelatedtweets,andFN=falseunrelatedtweets,wecomputethefollowingmetrics,inadditiontoaccuracy:NormalizedUtility.Utilityhasbeenusedtoevaluatedocument lteringtasksinTREC[34,33]andiscommonlyusedassigningarelative weightbetweentruepositivesandfalsepositives:u(S;T)= TP�FPAsintheTREC-8 lteringtask[33],hereUtilityisnormalizedbymeansofthefollowingscalingfunction:us(S;T)=max(u(S;T);U(s))�U(s) MaxU(T)�U(s)29 andF1=2PrecisionRecall Precision+RecallTable11reportstheresultsforthebaselines,theWePS-3systemsandourproposedsystemsforthemetricsdescribedabove.Allmetricsweremacro-averagedbytopics,andunde nedscoreswereconsideredaszerovalues. System accuracy utility lam% F1(R;S) F1 Goldstandard 1:00N 1:00N 0:00N 1:00N 1:00Nsupervisedbootstr. 0:85N 0:69N 0:28N 0:30N 0:62NWePS-3:LSIR(manual) 0:83N 0:66N 0:28N 0:27 0:62N ml-allfeat.+bootstr. 0:73 0:47 0:37 0:27 0:49ml-2feat.+wtr 0:72 0:49 0:43 0:16O 0:49ml-2feat.+bootstr. 0:72 0:47 0:43 0:17O 0:50heuristic+bootstr. 0:71 0:45 0:42 0:11H 0:46ml-allfeat.+wtr 0:71 0:44 0:39H 0:21H 0:43Hml-2feat.+wta 0:70 0:48 0:50H 0:00H 0:39Oml-allfeat.+wta 0:69O 0:40O 0:50H 0:00H 0:27Hheuristic+wtr 0:65H 0:46 0:42 0:10H 0:46heuristic+wta 0:64H 0:44 0:50H 0:00H 0:39O WePS-3:ITC-UT 0:75 0:52 0:37 0:20 0:49WePS-3:SINAI 0:64O 0:38O 0:35 0:12H 0:30HWePS-3:UvA 0:58H 0:22H 0:46H 0:17H 0:36HWePS-3:KALMAR 0:47H 0:35H 0:43 0:16O 0:48 baseline:allunrelated 0:57H 0:20H 0:50H 0:00H 0:00Hbaseline:random 0:49H 0:20H 0:49H 0:16H 0:37Hbaseline:allrelated 0:43H 0:40 0:50H 0:00H 0:52 Table11:Resultsforproposedsystems,WePS-3systemsandbaselinescomparedwithdi erentevaluationmetrics.Bestautomaticrunsareinboldface.(ml=machinelearning,wta=winner-takes-all,wtr=winner-takes-remainder).Statisticalsigni cancew.r.t.theml-allfeat.+bootstrappingrunwascomputedusingtwo-tailedStudent'st-test.Signi cantdi erencesareindicatedusingN(orH)for =0:01andM(orO)for =0:05.Resultsshowthat,accordingtoReliability&Sensitivity,ourbestautomaticsystemml-allfeatures+bootstrappingachievesthesamescoreastheWePS-3LSIRsemi-automaticsystem(0:27){whichisthebestresultatthecompetitionandinvolvesmanualprocessing{andoutperformsthebestautomaticsysteminWePS-3(ITC-UT=0.20),witha35%ofrelativeimprovement.Intermsoflam%,theSINAIsystemachievesthebestautomaticscoreof0:35,followedbyITC-UT&ml-allfeatures+bootstrappingthatreaches0:37lam%.Notethatlam%andR&Spenalizenon-informative/baseline-likebehaviors.Becauseofthis,thewinner-takes-allsystemsandthe\all(un)related"baselinesgettheworstscoresinthesemetrics.Accordingtoutility,ITC-UTisstillthebestautomaticsystem(0.52).Ourbestrunsarebetween0.47and0.49,beingml-2features+bootstrappingthebestofthem.31 Filterkeywords Homepage Wikipedia Both relatedoraclekeywords 36% 68% 33%unrelatedoraclekeywords 9% 19% 6% Table12:Percentageofthe10bestoraclekeywordsextractedfromthetweetstreamcoveredbythecompany'shomepage,itsWikipediaarticleandboth.Overall,theonlysubstantialoverlapisforpositivekeywordsinWikipedia,indicatingthatrepresentativeWebpagesarenottheidealplacetolookfore ective lterkeywordsinTwitter.Notethattheoverlapofrelatedoraclekeywordswiththecompany'sWikipediapageroughlydoublestheoverlapwithitshomepage.Thesamethinghappenswithunrelatedkeywords:almost20%onWikipediaandalmost10%onthehomepage.ThepercentageoforaclekeywordsthatoccurbothinthehomepageandintheWikipediaarticleissimilartothehomepagealone,indicatingthatWikipediabasicallyextendsthekeywordsalreadypresentinthehomepage.Insummary,exploringthenatureof lterkeywordsleadsustotheconclusionthatthevocabularycharacterizingacompanyinTwitterissubstantiallydi erentfromthevocabularyassociatedtothecompanyinitshomepage,inWikipedia,andapparentlyintheWebatlarge.These ndingsareinlinewiththe\vocabularygap\thathasbeenshownbetweenTwitterandotherWebsourcessuchasWikipediaornewscomments[73].Onewayofalleviatingthisproblemisusingco-occurrenceexpansionofweb-basedfea-tures,whichallowstobetterrecognizeautomatically lterkeywords.Whilethecom-pany'sWikipediaarticleseemstohavemorecoverageof(perfect) lterkeywordsthanthecompany'shomepage,furtherinvestigationisneededonhowtoautomaticallyinferthecompany'sWikipediapagefromitshomepageURLinordertoextractadditionalkeywordfeaturesfromit.33 References[1]Agirre,E.,Edmonds,P.,2006.Wordsensedisambiguation:Algorithmsandapplications.Springer.[2]Al-Kamha,R.,Embley,D.,2004.Groupingsearch-enginereturnedcitationsforperson-namequeries.In:Proceedingsofthe6thannualACMinternationalworkshoponWebinformationanddatamanagement.pp.96{103.[3]Amigo,E.,Artiles,J.,Gonzalo,J.,Spina,D.,Liu,B.,Corujo,A.,2010.WePS-3EvaluationCam-paign:OverviewoftheOnlineReputationManagementTask.In:CLEF2010LabsandWorkshopsNotebookPapers.[4]Amigo,E.,Corujo,A.,Gonzalo,J.,Meij,E.,deRijke,M.,2012.OverviewofRepLab2012:EvaluatingOnlineReputationManagementSystems.In:CLEF2012LabsandWorkshopNotebookPapers.[5]Amigo,E.,Gonzalo,J.,Verdejo,F.,2012.ReliabilityandSensitivity:GenericEvaluationMeasuresforDocumentOrganizationTasks.Tech.rep.,UNED.[6]Artiles,J.,October2009.Webpeoplesearch.Ph.D.thesis,UNEDUniversity.[7]Artiles,J.,Borthwick,A.,Gonzalo,J.,Sekine,S.,Amigo,E.,2010.Weps-3evaluationcampaign:Overviewofthewebpeoplesearchclusteringandattributeextractiontasks.[8]Artiles,J.,Gonzalo,J.,Amigo,E.,2009.Theimpactofqueryre nementinthewebpeoplesearchtask.In:ProceedingsoftheACL-IJCNLP2009ConferenceShortPapers.AssociationforCompu-tationalLinguistics,pp.361{364.[9]Artiles,J.,Gonzalo,J.,Sekine,S.,2007.Thesemeval-2007wepsevaluation:Establishingabench-markforthewebpeoplesearchtask.ProceedingsofSemeval.[10]Artiles,J.,Gonzalo,J.,Sekine,S.,2009.Weps2evaluationcampaign:overviewofthewebpeoplesearchclusteringtask.In:2ndWebPeopleSearchEvaluationWorkshop(WePS2009),18thWWWConference.[11]Bagga,A.,Baldwin,B.,1998.Entity-basedcross-documentcoreferencingusingthevectorspacemodel.In:Proceedingsofthe36thAnnualMeetingoftheAssociationforComputationalLinguisticsand17thInternationalConferenceonComputationalLinguistics-Volume1.pp.79{85.[12]Boyd,D.,Golder,S.,Lotan,G.,2010.Tweet,tweet,retweet:Conversationalaspectsofretweetingontwitter.In:Proceedingsofthe201043rdHawaiiInternationalConferenceonSystemSciences.HICSS'10.IEEEComputerSociety,Washington,DC,USA,pp.1{10.[13]Bunescu,R.,Pasca,M.,2006.Usingencyclopedicknowledgefornamedentitydisambiguation.In:ProceedingsofEACL.Vol.6.pp.9{16.[14]Cha,M.,Haddadi,H.,Benevenuto,F.,Gummadi,K.P.,2010.MeasuringUserIn uenceinTwitter:TheMillionFollowerFallacy.In:InProceedingsofthe4thInternationalAAAIConferenceonWeblogsandSocialMedia(ICWSM).WashingtonDC,USA.[15]Cheng,Z.,Caverlee,J.,Lee,K.,2010.Youarewhereyoutweet:acontent-basedapproachtogeo-locatingtwitterusers.In:Proceedingsofthe19thACMinternationalconferenceonInformationandknowledgemanagement.pp.759{768.[16]Cilibrasi,R.L.,Vitanyi,P.M.,2007.Thegooglesimilaritydistance.IEEETransactionsonKnowl-edgeandDataEngineering.[17]Comm,J.,Robbins,A.,2009.Twitterpower:Howtodominateyourmarketonetweetatatime.JohnWiley&SonsInc.[18]Cormack,G.,Lynam,T.,2005.Trec2005spamtrackoverview.In:TheFourteenthTextREtrievalConference(TREC2005)Proceedings.[19]Cucerzan,S.,2007.Large-scalenamedentitydisambiguationbasedonwikipediadata.In:Proceed-ingsofEMNLP-CoNLL.Vol.2007.pp.708{716.[20]Dellarocas,C.,Awad,N.,Zhang,X.,2004.Exploringthevalueofonlinereviewstoorganizations:Implicationsforrevenueforecastingandplanning.In:ProceedingsoftheInternationalConferenceonInformationSystems.[21]Dredze,M.,McNamee,P.,Rao,D.,Gerber,A.,Finin,T.,2010.Entitydisambiguationforknowl-edgebasepopulation.In:Proceedingsofthe23rdInternationalConferenceonComputationalLinguistics.pp.277{285.[22]Edmonds,P.,Cotton,S.,2001.Senseval-2:Overview.In:ProceedingsofTheSecondInternationalWorkshoponEvaluatingWordSenseDisambiguationSystems(SENSEVAL-2.pp.1{6.[23]Fawcett,T.,2006.Anintroductiontorocanalysis.Patternrecognitionletters27(8),861{874.[24]Ferragina,P.,Scaiella,U.,2010.Tagme:on-the- yannotationofshorttextfragments(bywikipediaentities).In:Proceedingsofthe19thACMinternationalconferenceonInformationandknowledgemanagement.pp.1625{1628.35 HOPKINSUNIVERSITY.[50]Mann,G.,Yarowsky,D.,2003.Unsupervisedpersonalnamedisambiguation.In:ProceedingsoftheHLT-NAACL'03.pp.33{40.[51]Mann,H.,Whitney,D.,1947.Onatestofwhetheroneoftworandomvariablesisstochasticallylargerthantheother.TheAnnalsofMathematicalStatistics18(1),50{60.[52]Martinez-Romo,J.,Araujo,L.,2009.Webpeoplesearchdisambiguationusinglanguagemodeltechniques.[53]McNamee,P.,Dang,H.,2009.OverviewoftheTAC2009knowledgebasepopulationtrack.In:TextAnalysisConference(TAC).[54]Meij,E.,Weerkamp,W.,deRijke,M.,2012.Addingsemanticstomicroblogposts.In:Proceedingsofthe fthACMinternationalconferenceonWebsearchanddatamining.[55]Michelson,M.,Macskassy,S.,2010.Discoveringusers'topicsofinterestontwitter:a rstlook.In:ProceedingsofthefourthworkshoponAnalyticsfornoisyunstructuredtextdata.pp.73{80.[56]Mierswa,I.,Wurst,M.,Klinkenberg,R.,Scholz,M.,Euler,T.,2006.YALE:Rapidprototypingforcomplexdataminingtasks.In:Proceedingsofthe12thACMSIGKDD.[57]Mihalcea,R.,Csomai,A.,2007.Wikify!:linkingdocumentstoencyclopedicknowledge.In:CIKM.Vol.7.pp.233{242.[58]Mihalcea,R.,Tarau,P.,2004.Textrank:Bringingorderintotexts.In:ProceedingsofEMNLP.Vol.4.pp.404{411.[59]Milios,E.,Zhang,Y.,He,B.,Dong,L.,2003.Automatictermextractionanddocumentsimilarityinspecialtextcorpora.In:ProceedingsoftheSixthConferenceofthePaci cAssociationforComputationalLinguistics.pp.275{284.[60]Milne,D.,Witten,I.,2008.Learningtolinkwithwikipedia.In:Proceedingofthe17thACMconferenceonInformationandknowledgemanagement.pp.509{518.[61]Milstein,S.,Chowdhury,A.,Hochmuth,G.,Lorica,B.,Magoulas,R.,2008.Twitterandthemicro-messagingrevolution:Communication,connections,andimmediacy{140charactersatatime.O'ReillyMedia,Inc.[62]Nadeau,D.,Sekine,S.,2007.Asurveyofnamedentityrecognitionandclassi cation.LingvisticaeInvestigationes30(1),3{26.[63]Pak,A.,Paroubek,P.,2010.Twitterasacorpusforsentimentanalysisandopinionmining.Pro-ceedingsofLREC2010.[64]Pollach,I.,2006.Electronicwordofmouth:Agenreanalysisofproductreviewsonconsumeropinionwebsites.In:Proceedingsofthe39thAnnualHawaiiInternationalConferenceonSystemSciences.Vol.3.[65]Quinlan,J.,1993.C4.5:programsformachinelearning.MorganKaufmann.[66]Sakaki,T.,Okazaki,M.,Matsuo,Y.,2010.Earthquakeshakestwitterusers:real-timeeventde-tectionbysocialsensors.In:Proceedingsofthe19thinternationalconferenceonWorldwideweb.WWW'10.pp.851{860.[67]Soboro ,I.,McCullough,D.,Lin,J.,Macdonald,C.,Ounis,I.,McCreadie,R.,2012.Evaluatingreal-timesearchovertweets.[68]Spina,D.,Amigo,E.,Gonzalo,J.,2011.Filterkeywordsandmajorityclassstrategiesforcom-panynamedisambiguationinTwitter.In:CLEF2011ConferenceonMultilingualandMultimodalInformationAccessEvaluation.SpringerBerlin/Heidelberg,pp.50{61.[69]Spina,D.,Meij,E.,Oghina,A.,Bui,M.T.,Breuss,M.,deRijke,M.,2012.ACorpusforEntityPro linginMicroblogPosts.In:LRECWorkshoponLanguageEngineeringforOnlineReputationManagement.[70]Sriram,B.,Fuhry,D.,Demir,E.,Ferhatosmanoglu,H.,Demirbas,M.,2010.Shorttextclassi cationintwittertoimproveinformation ltering.In:Proceedingofthe33rdInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval.SIGIR'10.ACM,NewYork,NY,USA,pp.841{842.[71]Strube,M.,Ponzetto,S.,2006.Wikirelate!computingsemanticrelatednessusingwikipedia.In:ProceedingsoftheNationalConferenceonArti cialIntelligence.Vol.21.p.1419.[72]Tsagkias,M.,Balog,K.,2010.TheUniversityofAmsterdamatWePS3.In:CLEF2010LabsandWorkshopsNotebookPapers.[73]Tsagkias,M.,deRijke,M.,Weerkamp,W.,2011.Linkingonlinenewsandsocialmedia.In:Pro-ceedingsofthefourthACMInternationalConferenceonWebSearchandDataMining.[74]Turney,P.,1999.Learningtoextractkeyphrasesfromtext,nationalresearchcouncil.InstituteforInformationTechnology,TechnicalReportERB-1057.[75]Wan,X.,Gao,J.,Li,M.,Ding,B.,2005.Personresolutioninpersonsearchresults:Webhawk.In:37 Proceedingsofthe14thACMinternationalconferenceonInformationandknowledgemanagement.pp.163{170.[76]Wilson,R.,2003.Keepingawatchoncorporatereputation.StrategicCommunicationsManagement7(2).[77]Witten,I.H.,Paynter,G.W.,Frank,E.,Gutwin,C.,Nevill-Manning,C.G.,1999.Kea:practicalautomatickeyphraseextraction.In:ProceedingsofthefourthACMconferenceonDigitallibraries.DL'99.pp.254{255.[78]Wu,S.,Hofman,J.M.,Mason,W.A.,Watts,D.J.,2011.Whosayswhattowhomontwitter.In:Proceedingsofthe20thInternationalConferenceonWorldWideWeb.WWW'11.ACM,NewYork,NY,USA,pp.705{714.[79]Yang,J.,Leskovec,J.,2010.Modelinginformationdi usioninimplicitnetworks.DataMining,IEEEInternationalConferenceon0,599{608.[80]Yang,J.,Leskovec,J.,2011.Patternsoftemporalvariationinonlinemedia.In:ProceedingsofthefourthACMinternationalconferenceonWebsearchanddatamining.WSDM'11.pp.177{186.[81]Yerva,S.R.,Miklos,Z.,Aberer,K.,2010.Itwaseasywhenapplesandblackberrieswereonlyfruits.In:CLEF2010LabsandWorkshopsNotebookPapers.[82]Yerva,S.R.,Mikls,Z.,Aberer,K.,2012.Entity-basedclassi cationoftwittermessages.IJCSA9(1),88{115.[83]Yoshida,M.,Matsushima,S.,Ono,S.,Sato,I.,Nakagawa,H.,2010.ITC-UT:TweetCategorizationbyQueryCategorizationforOn-lineReputationManagement.In:CLEF2010LabsandWorkshopsNotebookPapers.[84]Zhang,S.,Wu,J.,Zheng,D.,Meng,Y.,Xia,Y.,Yu,H.,2012.Twostagesbasedorganizationnamedisambiguity.ComputationalLinguisticsandIntelligentTextProcessing,249{257.[85]Zhang,Y.,Milios,E.,Zincir-Heywood,N.,2007.Acomparativestudyonkeyphraseextractionmethodsinautomaticwebsitesummarization.JournalofDigitalInformationManagement5(5),323.[86]Zhang,Y.,Zincir-Heywood,N.,Milios,E.,2004.Term-basedclusteringandsummarizationofwebpagecollections.AdvancesinArti cialIntelligence,60{74.[87]Zhang,Y.,Zincir-Heywood,N.,Milios,E.,2004.Worldwidewebsitesummarization.WEBIN-TELLIGENCEANDAGENTSYSTEMS.2,39{54.[88]Zhang,Y.,Zincir-Heywood,N.,Milios,E.,2005.Narrativetextclassi cationforautomatickeyphraseextractioninwebdocumentcorpora.In:Proceedingsofthe7thannualACMinternationalworkshoponWebinformationanddatamanagement.pp.51{58.[89]Zhao,W.X.,Jiang,J.,Weng,J.,He,J.,Lim,E.-P.,Yan,H.,Li,X.,2011.Comparingtwitterandtraditionalmediausingtopicmodels.In:Proceedingsofthe33rdEuropeanconferenceonAdvancesinInformationRetrieval.[90]Zhao,X.,Jiang,J.,He,J.,Song,Y.,Achananuparp,P.,LIM,E.,Li,X.,2011.Topicalkeyphraseextractionfromtwitter.38

Related Contents


Next Show more