1AtthetimeofwritingthispapersomepopularreputationmanagementtoolsaretrackurhttpwwwtrackurcomBrandsEyehttpwwwbrandseyecomAlterianSM2httpsocialmediaalteriancomorSocialMentionh ID: 457086
Download Pdf The PPT/PDF document "1.IntroductionThevastuseofsocialmediatos..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1.IntroductionThevastuseofsocialmediatosharefactsandopinionsaboutentities,suchascompanies,brandsandpublicgureshasgeneratedtheopportunity-andthenecessity{ofmanagingtheonlinereputationofthoseentities.OnlineReputationManagementconsistsofmonitoring{andhandling{theopinionofInternetusers(alsoreferredtoaselectronicwordofmouth,eWOM)onpeople,companiesandproducts,andisalreadyafundamentaltoolincorporatecommunication[76,20,64,32].Overthelastyears,awidevarietyoftoolshavebeendevelopedthatfacilitatethemonitoringtaskofonlinereputationmanagers1.Theprocessfollowedbythesetoolstypicallyconsistsofthreetasks:1.Retrievalofpotentialmentions.Theusergivesasinputasetofkeywords(e.g.thecompanyname,products,CEO'sname...),andtheserviceusesthesekeywordstoretrievedocumentsandcontentgeneratedbyusersfromdierentsources:broad-castnewssources,socialmedia,blogs,sitereviews,etc.2.Analysisofresults.Retrieveddocumentsareautomaticallyprocessedinordertogetrelevantinformationtotheuser:sentiment,authorityandin uence,back-groundtopics,etc.3.Resultsvisualization.Analyzeddataispresentedtotheuserindierentways:rankingdocuments,drawinggraphics,generatingtagclouds,etc.Amajorproblemconcerningthersttask(retrievalofpotentialmentions)isthatbrandnamesareoftenambiguous.Forinstance,thequery\Ford"retrievesinforma-tionaboutthemotorcompany,butalsomightretrieveresultsaboutFordModels(themodelingagency),TomFord(thelmdirector),etc.Onemightthinkthatthequeryistoogeneral,andtheusershouldprovideamorespecicquery,suchas\fordmotor"or\fordcars".Infact,sometoolsexplicitlysuggesttheusertorenepossiblyambiguousqueries2.Thisapproachhastwomaindisadvantages:(i)usershavetomakeanadditionaleortwhendeningunambiguousqueriesand(ii)queryrenementharmsrecalloverthementionsofthebrandintheWeb,whichcanbeparticularlymisleadinginanonlinereputationmanagementscenario.Notethatlteringoutthementionsthatdonotrefertothemonitoredentityisalsocrucialwhenestimatingitsvisibility.QuantifyingthenumberofmentionsontheWebaboutanentity,andhowthisnumberchangesovertime,isessentialtotrackmarketingorPublicRelationshipcampaigns.Whentheentitynameisambiguous,indicatorsgivenbytoolssuchasGoogleTrends3orTopsy4canbemisleading.Wethinkthatacomponentcapableoflteringoutmentionsthatdonotrefertotheentitybeingmonitored(speciedbytheuserasakeywordplusarepresentativeURL) 1Atthetimeofwritingthispaper,somepopularreputationmanagementtoolsaretrackur(http://www.trackur.com/),BrandsEye(http://www.brandseye.com/),AlterianSM2(http://socialmedia.alterian.com/)orSocialMention(http://socialmention.com),amongothers.2AlterianSM2service:http://www.sdlsm2.com/social-listening-for-business/industry/.3http://www.google.com/trends4http://www.topsy.com2 polarizedwordsandthenpickstheclasswiththegreatestprobabilityinawinner-takes-allscenario.Inthisworkweuseasimilarapproach,inthemannerthatwetest,amongothers,thewinner-takes-allstrategyinordertoclassifyallthetweetsforacompanynameeitherasrelatedorunrelated.Zhaoetal.[90]proposeanautomatickeyphraseextractionmodeltosummarizetop-icsinTwitter.Firstly,theyidentifytopicsusingTwitter-LDA[89],whichassumesasingletopicassignmentforeachtweet.Secondly,theyuseavariationofTopicalPageR-ank[48]whereedgesbetweentwowordsareweightedtakingintoaccountco-occurrencesintweetsassignedtothesametopic.Then,theygeneratephrasecandidatesbylookingforcombinationsofthetoptopickeywordsthatco-occurasfrequentphrasesinthetextcollection.Finally,topicalkeyphrasesarerankedbyusingaprobabilisticmodelthattakesintoaccount(i)thespecicityofaphrasegivenatopicand(ii)theretweetratiooftweetscontainingakeyphrase.Tothebestofourknowledge,theOnlineReputationManagementTaskonWePS-3heldatCLEF2010wastherstcampaignofNLP-centeredtasksoverTwitter.TREC2011andTREC2012heldatrackaboutTwitter:TRECMicroblogTrack8.Here,theproblemaddressedisarealtimesearchtask(RealtimeAdhocTask):givenaqueryataspecictime,systemsshouldreturnalistbothrecentandrelevanttweets[67].2.1.2.TwitterDatasetsThereareseveralTwitterdatasetsthataresuitableforavarietyofresearchpurposes.Acorpusof900.000tweetshasbeenprovidedbytheContentAnalysisinWeb2.0Workshop(CAW2.0)9,totackletheproblemoftextnormalizationonusergeneratedcontents.Yang&Leskovec[80]useadatasetof580millionTweetstoidentifytem-poralpatternsoverthecontentpublishedonthetweets,andKwaketal.[46]usearepresentativesampleof467milliontweetsfrom20millionuserscoveringa7monthperiodfromJune12009toDecember312009tostudytheinformationdiusionandthetopologicalcharacteristicsofTwitter.10.Chaetal.[14]builtadatasetcomprising54,981,152users,connectedtoeachotherby1,963,263,821sociallinksandincludingatotalof1,755,925,520tweetstoanalyzeusers'in uenceandhowtomeasureitontheTwittersphere.Finally,theTweets2011corpusisthedatasetthatwillbeusedontheTRECMi-croblogTrack.Thecorpusisarepresentativesampleofthetwittersphere,thatincludesalsospamtweets.Itconsistsofapproximately16milliontweetsoveraperiodof2weeks(24thJanuary2011until8thFebruary,inclusive),whichcoversboththetimeperiodoftheEgyptianrevolutionandtheUSSuperbowl,amongothers.Tothebestofourknowledge,WePS-3Task2testcollection(describedinSection2.4),istherstdatasetspecicallybuilttoaddresstheproblemofdisambiguationoforga-nizationnamesontweets.Recently,theRepLabevaluationcampaign[4]heldinCLEF2012,hasaddressedthesameproblembutinamultilingualscenario:theRepLabdatasetcontainstweetswritteninEnglishandSpanish. 8https://sites.google.com/site/microblogtrack9http://caw2.barcelonamedia.org/node/4110ThesedatasetsarenolongeravailableasperrequestfromTwitter5 entity).ThisresultshowsthatWikipediahaveahighcoverageasacatalogofsensesfortweetdisambiguation.Dierentfrom[24],Meijetal.[54]usessupervisedmachinelearningtechniquestorenealistofcandidateWikipediaconceptsthatarepotentiallyrelevanttoagiventweet.Thecandidaterankinglistisgeneratedbymatchingn-gramsinthetweetwithanchortextsinWikipediaarticles,takingintoaccounttheinter-WikipedialinkstructuretocomputethemostprobableWikipediaconceptforeachn-gram.Michelson&Macskassy[55]focusondiscoveringthetopicsofinterestforaparticularTwitteruser.Giventhestreamoftweetscorrespondingtotheuser,theyrstlyndtheWikipediapageoftheentitiesmentionedontweetsandsecondlytheybuildatopicprolefromthehigh-levelcategoriesthatcovertheseentities.Entitydisambiguationisperformedbycalculatingtheoverlapbetweenthetermsonthetweetandthetermsonthepageofeachcandidateentity.Inthisscenario,theaccuracyofthedisambiguationprocessisnotcritical,sincethesystemtakesintoaccountthecategoriesthatoccurfrequentlyacrossalltheentitiesfoundinordertoproduceatopicproleofastream.2.3.AutomaticKeyphraseExtractionWeincludehereanoverviewofAutomaticKeyphraseExtractiontechniques,asthisisatechniquethatplaysacrucialroleinourapproachtocompanynamedisambiguationonTwitter.AutomaticKeyphraseExtractionisthetaskofidentifyingasetofrelevanttermsorphrasesthatsummarizeandcharacterizeoneormoregivendocuments[77].Mostoftheliteratureaboutautomatickeyphraseextractionisfocusedon(well-written)technicaldocuments,suchasscienticandmedicalarticles,sincethekeywordsgivenbytheauthorscanbeusedasgoldstandard[74,77,25,59,58,43].Someauthorsaddressautomatickeywordextractionasawayofautomaticallysummarizatingwebsites[86,87,88,85].In[85]dierentkeywordextractionmethodsarecompared,includingtf*idf,supervisedmethods,andheuristicsbasedonbothstatisticalandlinguisticfeaturesofcandidateterms.WhileZhangetal.[85]studytheautomatickeywordextractionfromwebsitedescriptions,inourworkweexploreasemi-supervisedkeywordextractionapproachthatextractlterkeywordsoverastreamoftweets.Automatickeywordextractionistypicallyusedtocharacterizethecontentofoneormoredocuments,usingfeaturesintrinsicallyassociatedwiththatdocuments.Inordertodetectbothpositiveandnegativelterkeywords,however,weneedtolookintoexternalresourcesinordertodiscriminatebetweenrelatedandunrelatedkeywords.Thus,automatickeywordextractionmethodsarenotdirectlyapplicabletoourlterkeywordapproach.2.4.TheWePS-3OnlineReputationManagementTaskDisambiguationofcompanynamesintextstreams(andinparticularinmicroblogposts)isanecessarystepinthemonitoringofopinionsaboutacompany.However,itisnottackledexplicitlyinmostresearchonthesubject.Ratherthanthis,mostpreviousworkassumethatquerytermsarenotambiguousintheretrievalprocess.ThedisambiguationtaskhasbeenexplicitlyaddressedintheWePS-3evaluationcampaign[3].Inthissectionwesummarizetheoutcomeofthatcampaign,analyzingthetestcollectionandcomparingsystemresults.8 andthecompanyhomepage.TheUvAsystem[72]doesnotemployanyresourcerelatedtothecompany,butusesfeaturesthatinvolvetheuseofthelanguageinthecollectionoftweets(URLs,hashtags,capitalcharacters,punctuation,etc.).Finally,theKALMARsystem[41]buildsaninitialmodelbasedonthetermsextractedfromthehomepagetolabelaseedoftweetsandthenusestheminabootstrappingprocess,computingthepoint-wisemutualinformationbetweenthewordandthetarget'slabel.IntheWePSexercise,accuracy(ratioofcorrectlyclassiedtweets)wasusedtoranksystems.Thebestoverallsystem(LSIR)obtained0.83,butincludingmanuallyproducedlterkeywords.Thebestautomaticsystem(ITC-UT)reachesanaccuracyof0.75(being0.5theaccuracyofarandomclassication),andincludesaqueryclassicationstepinordertopredicttheratioofpositive/negativetweets.2.4.3.OtherworkusingWePS-3datasetsRecently,Yervaetal.[82]haveexploredtheimpactofextendingthecompanyprolespresentedin[81]byusingtherelatedratioandconsideringnewtweetsretrievedfromtheTwitterstream.Byestimatingthedegreeofambiguityperentityfromasubsetof50tweetsperentity,theyreach0.73accuracy14.Usingthisrelatedratioandconsideringco-occurrenceswithagivencompanyprole,theoriginalproleisextendedwithnewtermsextractedfromtweetsretrievedbyqueryingthecompanynameinTwitter.Theexpandedproleoutperformstheoriginal,achieving0.81accuracy.Zhangetal.[84]presentsatwostagesbaseddisambiguationsystem.Theycombinesupervisedandsemi-supervisedmethods.Firstly,theytrainagenericclassierusingthetrainingset,similarlyto[81].Then,theyusethisclassiertoannotateaseedoftweetsinthetestset,usingtheLabelPropagationalgorithmtoannotatetheremaindertweetsinthetestset.UsingNaveBayesandLabelPropagationtheyachievea0.75accuracy,thatmatchestheperformanceofthebestautomaticsysteminWePS-3.TheRepLabevaluationcampaignheldatCLEF2012addressed,asasubtask,thesamelteringproblemintroducedinWePS-3.ThemaindierencewasthatwhileWePS-3datasetcontainsonlytweetswritteninEnglish,theRepLabcollection[4]containstweetsbothinEnglishandSpanish.Finally,theWePS-3ORMtaskdatasethasbeenextendedwithmanualannotationsforthetaskofentityprolinginmicroblogposts,thatincludesidentifyingentityaspectsandopiniontargets[69].2.5.WrapUpTherearetwomainndingsonentitynamedisambiguationthatmotivateourre-search:1.Useoflterkeywords.Artilesetal.[8]studiedtheimpactofqueryrenementintheWebPeopleSearchclusteringtaskandconcludedthat\althoughinmostoccasionsthereisanexpressionthatcanbeusedasanear-perfectrenementtoretrieveallandonlythosedocumentsreferringtoanindividual,thenatureoftheseidealrenementsisunpredictableandveryunlikelytobehypothesizedbytheuser"[8] 14Notethat,unliketheoriginalformulationoftheWePS-3task,thisisasupervisedsystem,asitusespartofthetestsetfortraining.Hence,theirresultscannotbedirectlycomparedwiththeresultsinourwork.10 Thisisbothapositiveindication{therearekeywordsabletoisolaterelevantinformationwell{andasuggestionthatndingoptimalkeywordsautomaticallymightbeachallengingtask.AnotherevidenceinfavorofusinglterkeywordsisthatthebestresultsontheWePS-3ORMTaskwereachievedbyasystemthatusedasetofmanuallyproduced{bothpositiveandnegative{lterkeywords[81].OneofthegoalsofourworkistoanalyzequeryrenementinthescenariooftheCompanyNameDisambiguationonTwitter.Inotherwords,weexploretheimpactofdeningasetofkeywordstolterbothrelatedandunrelatedtweetstoagivencompany.2.Useofknowledgebasestorepresentcandidateentities.Bothentitylinkinganddocumentenrichmentsystemsuseknowledgebasesinordertocharacterizethepossibleentitiesthatamentionmayreferto.MostsystemsuseWikipediaasknowledgebase[57,13,19,27,60,45,24,39,21].Inthisdirection,webelievethatlookingatWikipediapagesrelatedtothecompanytodisambiguatecouldgiveusefulinformationtocharacterizepositivelterkeywords.WealsoexploretheuseofotherresourcessuchasOpenDirectoryProject(ODP),thecompanywebsiteandtheWebingeneral.Note,however,thatnoteverycompanyislistedinODPorhasanentryinWikipedia.11 amanualselectionofpositiveandnegativekeywordsforallthecompaniesintheWePS-3corpus.Notethattheannotatorinspectedpagesinthewebsearchresults,butdidnothaveaccesstothetweetsinthecorpus.Tables1and2showsomeexamplesforpositiveandnegativekeywords,respectively.NotethatinthesetofOraclekeywordsthereareexpressionsthatahumanwouldhardlychoosetodescribeacompany(atleast,withoutpreviouslyanalyzingthetweetstream).Forinstance,thebestpositiveoraclekeywordsfortheFoxEntertainmentGroupdonotincludeintuitivekeywordssuchastvorbroadcast;instead,theyincludekeywordsclosertobreakingnews(leader,denouncing,etc.). Companyname OraclePositiveKeywords ManualPositiveKeywords amazon sale,books,deal,deals,gift electronics,apparel,books,computers,buy apple gizmodo,ipod,microsoft,itunes,auction store,ipad,mac,iphone,com-puter fox money,weather,leader,de-nouncing,viewers tv,broadcast,shows,episodes,fringe,bones kiss fair,rock,concert,allesegretti,stanley tour,discography,lyrics,band,rockers,makeupdesign Table1:Dierencesbetweenoracleandmanualpositivekeywordsforsomeofthecom-panynamesonthetestcollection.Lookingatnegativekeywords(Table2),wecanndoccasionaloraclekeywordsthatarecloselyrelatedwiththevocabularyusedinmicrobloggingservices,suchasfollowdaibosyu,nowplayingorwanna,whileintuitivemanualkeywordslikewildlifeforjaguarareunlikelytooccurintheTwittercollection. Companyname OracleNegativeKeywords ManualNegativeKeywords amazon followdaibosyu,pest,plug,brothers,pirotta river,rainforest,deforestation,bolivian,brazilian apple juice,pie,fruit,tea,ona fruit,diet,crunch,pie,recipe fox megan,matthew,lazy,valley,michael animal,terrier,hunting,Volk-swagen,racing kiss hug,nowplaying,lips,day,wanna french,MegKevin,bangbang,RyanKline Table2:Dierencesbetweenoracleandmanualnegativekeywordsforsomeofthecom-panynamesonthetestcollection.Remarkably,manualkeywordsextractedfromtheWeb(around10percompany)onlyreach15%coverageofthetweets(comparewith40%coverageusing10oraclekeywordsextractedfromthetweetstream),withanaccuracyof0.86(whichislowerthanexpectedformanuallyselectedlterkeywords).ThisseemsanindicationthatthevocabularyandtopicsofmicrobloggingaredierentfromthosefoundintheWeb.OurexperimentsinSection4corroboratethisnding.13 (a)manualkeywords (b)20oraclekeywordsFigure2:Fingerprintsforthebootstrappingclassicationstrategywhenapplyingmanualkeywords(left)or20oraclekeywords(right).Inordertobetterunderstandtheresults,Figure2showsthengerprintrepresen-tation[68]formanualkeywordsand20oraclekeywords.Thisvisualizationtechniqueconsistsofdisplayingtheaccuracyofthesystem(verticalaxis)foreachcompany/testcase(dots)vs.theratioofrelated(positive)tweetsforthecompany(horizontalaxis).Thethreebasicbaselines(alltrue,allnegativeandrandom)arerepresentedasthreexedlines:y=x,y=1xandy=0:5,respectively.Thengerprintvisualizationmethodhelpsinunderstandingandcomparingsystems'behavior,speciallywhenclassskewsarevariableoverdierenttestcase.Using20oraclekeywords(seeFigure2b),theobtainedaverageaccuracyis0.87.Thengerprintshowsthattheimprovementresidesincaseswitharelatedratioaround0.5,i.e.thecaseswhereitismorelikelytohaveenoughtrainingsamplesforbothclasses.Manualkeywords,ontheotherhand,leadtoannotationsthattendtosticktothe"allrelated"or"allunrelated"baselines,whichindicatesthattheytendtodescribeonlyoneclass,andthenthelearningprocessisbiased.Insummary,ourresultsvalidatetheideathatlterkeywordscanbeapowerfultoolinourlteringtask,butalsosuggestthattheywillnotbeeasytond:descriptivewebsourcesthatcanbeattributedtothecompanydonotleadtothekeywordsthataremostusefuloraccurateintheTwitterdomain.15 inthewholecorpus,documentfrequencyinthesetoftweetsforthecompany,howmanytimesthetermoccursasatwitterhashtag,andtheaveragedistancebetweenthetermandthecompanynameinthetweets.(a)col c df:NormalizeddocumentfrequencyinthecollectionofthetweetsforthecompanyTc:dft(Tc) jTcj(1)(b)col c specicity:RatioofdocumentfrequencyinthetweetsforthecompanyTcoverthedocumentfrequencyinthewholecorpusT:dft(Tc) dft(T)(2)(c)col hashtag:Numberofoccurrencesofthetermasahashtag(e.g.#jobs,#football)inTc.(d)col c prox avg,col c prox sd,col c prox median:Mean,standardde-viationandmedianofthedistance(numberofterms)betweenthetermandthecompanynameinthetweets.2.Web-basedfeatures(web *):ThesefeaturesarecomputedfrominformationabouttheterminthealltheWeb(approximatedbysearchcounts),thewebsiteofthecompany,WikipediaandtheOpenDirectoryProject(ODP).16(a)web c assoc:Intuitively,atermwhichisclosetothecompanynamehasmorechancestobeakeyword(eitherpositiveornegative)thanmoregenericterms.Thisfeaturerepresentstheassociation,accordingtothesearchcounts,betweenthetermtandacompanynamec.dfweb(tORc)=dfweb(c) dfweb(t)=M(3)(b)web c ngd:TheNormalizedGoogleDistance[16](appliedtotheYahoo!searchengine),whichisameasureofsemanticdistancebetweentwotermsfromthesearchcounts.Then,foratermtandacompanynamec,theGoogleNormalizedDistanceisgivenby(4):max(log(f(c));log(f(t)))log(f(tANDc)) Mmin(log(f(t));log(f(c)))(4)wheref(x)=dfweb(x)(c)web dom df:Frequenttermsinthecompanywebsiteshouldbemeaningfultocharacterizepositivekeywords.web dom dfisthenormalizeddocumentfrequencyoftheterminthewebsiteofthecompany.dfweb(tANDsite:domainc) dfweb(site:domainc)(5) 16SomecompaniesintheWeps-3collectionhaveaWikipediapageasreferencepageinsteadofthecompanywebsite.Inthesecases,thefeatureweb dom df(thatisalsothenumeratorinthefeatureweb dom assoc)iscomputedasthepresenceofthetermtintheWikipediapage.Also,thequeryusedtogetthevaluesofthefeaturesweb odp occandweb wiki occisthetitleoftheWikipediapage.17 (d)web dom assoc:degreeofassociationofthetermwiththewebsiteincom-parisonwiththeuseofthetermintheweb.Thisfeatureisanalogoustoweb c assoc,usingthewebsitedomaininsteadofthecompanynamec.dfweb(tANDsite:domainc)=dfweb(site:domainc) dfweb(t)=M(6)(e)web odp occ:Numberofoccurrencesoftheterminalltheitemsindmoz(domainc).EachitemiscomposedbyanURL,atitle,adescriptionandtheODPcategorytowhichitbelongs.(f)web wiki occ:Numberofoccurrencesofthetermintherst100resultsinwikipedia(domainc).InordertolterpagesreturnedbytheAPIthatcouldbeunrelatedtothecompany,onlypagesthatcontainthestringdomaincareconsidered.3.Featuresexpandedwithco-occurrence:Inordertoavoidfalsezerosinweb-basedfeatures,weexpandsomeoftheprevioustermfeatureswiththevalueob-tainedbythevemostco-occurrentterms.Givenafeaturef,anewfeatureiscomputedastheEuclideannorm(7)ofthevectorwithcomponentsftiw(t;ti)forthevemostco-occurrenttermswithtinthesetoftweetsTc(8),whereftiistheweb-basedfeaturevaluefforthetermtiandw(t;ti)isthethegradeofco-occurrenceofeachterm(9):cooc agg(t;f)=s Xi2cooct(f(ti)w(ti))2(7)cooct=setofthevetermswhichmostco-occurwitht(8)w(t;ti)=jco-occurrencesTc(t;ti)j jTcj(9)f(ti)=valueofthefeaturefforthetermtiThisformulaisappliedtoweb c assoc,web c ngd,web dom df,web dom assoc,web odp occandweb wiki occ,resultinginthefeaturesenumeratedbelow:(a)cooc c assoc=cooc agg(t;web c assoc)(b)cooc c ngd=cooc agg(t;web c ngd)(c)cooc dom df=cooc agg(t;web dom df)(d)cooc dom assoc=cooc agg(t;web dom assoc)(e)cooc odp occ=cooc agg(t;web odp occ)(f)cooc wiki occ=cooc agg(t;web wiki occ)4.2.FeatureAnalysisTherststepforthefeatureanalysisistodevelopagoldstandardsetofpositiveandnegativekeywords.Inordertogetsucienttrainingdataandtodealwithpossiblemiss-annotationsinthecorpus,wesetaprecisionof0.85ofaterminarelated/unrelatedsetoftweetsasafeasiblethresholdtoannotateatermasakeyword.Thosetermswithprecisionlowerthan0.85inbothclassesarelabeledasskipterms(10):18 (a)col c df (b)col c specicity (c)col hashtag (d)col c prox median (e)col c prox avg (f)col c prox sd (g)web odp occ (h)web wiki occ (i)cooc odp occ (j)cooc wiki occ (k)web c assoc (l)web c ngd (m)web dom df (n)web dom assoc (o)cooc c assoc (p)cooc c ngd (q)cooc dom df (r)cooc dom assocFigure3:Box-plotsrepresentingthedistributionofeachofthefeaturesinthepositive,negativeandskipclasses.ThebottomandtopoftheboxaretheQ1andQ3quartiles,respectively,andthebandnearthemiddleistheQ2quartile{i.e.,themedian.Thewhiskersextendtothemostextremedatapoint(1.5timesthelengthoftheboxawayfromthebox:1:5IQRand1:5IQR,whereIQR=jQ3Q1j).20 Theseplotshelpvisualizingtherangeofvaluesforeachfeature,aswellaswheremostofthevalueslie,allowingforaqualitativeanalysisofthefeatures.Wecanseethatfeaturescol hashtag(Fig.3c),web odp occ(Fig.3g)andcooc odp occ(Fig.3i)arenotinformative,becausealmostalloftheirvaluesarezero.Therearelessthan1%ofthetermsinthetestsetthatoccuratleastonetimeashashtag.Also,lessthan1%aretermsthatappearindescriptionsandtitlesofODPsearchresults.Featuresdescribingterm-companydistanceseemtocapturedierencesbetweenkeywordandskipterms:bothnegativeandpositivekeywords,generallyoccurclosertothecompanynamethanskipterms.Whilepositiveandnegativekeywordssharesimilarmedianandstandarddeviation(Figs.3dand3f)ofproximitytothecompanyname,averagedistanceforpositivekeywordsisslightlysmallerthanfornegativekeywords(Fig.3e).Featurescol c df,col c specicity,web c assoc,web c ngdandtheirexpanded(byco-ocurrence)versionscooc c assocandcooc c ngdweredenedtodiscriminatelterkeywordsfromskipterms.Themostdiscriminativefeatureseemstobecol c specicity(Fig.3b).Ontheotherhand,featuresweb wiki occ,web dom df,web dom assoc,cooc dom df,cooc dom assocandcooc wiki occweredesignedtodistinguishbetweenpositiveandnegativelterkey-words.Atarstglance,positiveandnegativekeywordshavedierentdistributionsinallthefeatures.Skipterms,ontheotherhand,tendtohavedistributionssimilartothoseofpositivekeywords.Thefeaturescooc dom assoc(Fig.3r)andcooc wiki occ(Fig.3j)seemtobethebesttodiscriminatepositivekeywordsfromnegativeandskipterms.Remarkably,featuresexpandedbyco-occurrenceseemtobemoreinformativethantheoriginalfeatures,whichtendtoconcentrateonlowvalues(themedianisnearzero).Whenexpandingtheoriginalvaluesbyco-occurrence,positivetermsreceivehighervaluesmoreconsistently.Inordertoquantitativelyevaluatethequalityoffeatures,wecomputetheMann-WhitneyUtest[51],whichisanon-parametrictestusedinstatisticalfeatureselectionwhenanormaldistributionofthefeaturescannotbeassumed.Thep-valuecouldbeusedtorankthefeatures,sincethesmallerthevalueofthep-value,themoreinformativethefeatureis[31].21 atonlythebesttwofeaturesaccordingtotheMann-WhitneyUtest:rst,wedeneathresholdtoremoveskiptermsaccordingtothespecicityw.r.t.thecollectionofthetweetsforthecompany(col c specicityfeature).Thenwestate,forthefeaturethatmeasuresassociationwiththewebsite(cooc dom assocfeature)alowerboundtocapturepositivelterkeywordsandanupperboundtocapturenegativelterkeywords.Thesethreethresholdshavebeenmanuallyoptimizedusingthetrainingdataset.Finally,wealsoexploredathirdoption:weapplymachinelearningusingonlythebesttwofeaturesinsteadofthewholefeatureset.Wewillreferhereaftertothismethodasmachinelearning-2features.WehaveexperimentedwithseveralmachinelearningmethodsusingRapidminer[56]:NeuralNets,C4.5andCARTDecisionTrees,LineaSupportVectorMAchines(SVM)andNaiveBayes.Allmethodshavebeenusedwith\out-of-the-box"parameters.AllthetermslabeledovertheWePS-3trainingdatasetwereusedtotrainthemodels.Inthesameway,termsextractedfromthetestdatasetwereusedastestset.Table6showsthevaluesoftheAreaUndertheROCCurve(AUC)ofeachofthebinaryclassiersevaluated.AUCisanappropriatemetrictomeasurethequalityofbinaryclassicationmodelsindependentlyofthecondencethreshold[23].Weanalyzedthreedierentsubsetsoffeaturestorepresenttheterms:(i)usingallbutthesixfeaturesexpandedbyco-occurrence,(ii)usingonlythebesttwofeatures(thoseusedbytheheuristicandmachinelearning-2featuresclassiers),and(iii)usingallthefeatures. machinelearningalgorithm notexpandedbyco-occurrencefeatures 2bestfeatures allfeatures posneg posneg posneg NeuralNet 0.680.67 0.730.72 0.750.73CARTDec.Trees 0.580.61 0.720.71 0.630.64LinearSVM 0.500.50 0.730.71 0.500.50NaveBayes 0.640.64 0.710.71 0.720.72C4.5Dec.Trees 0.500.61 0.500.50 0.590.66 Table6:AreaUndertheROCCurve(AUC)valuesoftheveclassicationmodelsandthethreefeaturesetsusedtoclassifypositivesandnegativeskeywords.Theresultsobtainedaresimilarforallmodels,exceptforC4.5andSVMthatinsev-eralcasesdonotprovideanyusefulinformationforclassication(AUC=0:5).Keepingoutthe\expandedbyco-occurrence\features,theperformanceis,ingeneral,lowerforallthealgorithms.Thiscorroboratestheresultsofourpreviousfeatureanalysis.Inthefollowingexperiments,wefocusontheNeuralNetalgorithmtotrainboth(positiveversusothersandnegativeversusother)classiers,becauseitisconsistentlythebestperformingalgorithmaccordingtotheAUCmeasure.Foreachofthefeaturecombinationsdescribedatthebeginningofthissection(ma-chinelearning-allfeatures,heuristicandmachinelearning-best2features),belowweanalyzetheobtainedresults.ThemethodsweretrainedusingtermsfromtheWePS-3trainingdatasetandevaluatedwiththeWePS-3testset.Table7showstheconfusionmatrixobtainedforthemachinelearning-allfeaturesmethod.Theprecisionforthepositiveandnegativeclassesis62%and56%,respectively,23 strategydescribedinSection3.2isusedtocompletethetask.Iiisalsointerestingtocomparethebootstrappingstrategywiththenavewinner-takes-allbaseline{thatdirectlyclassiesallthetweetsasrelatedorunrelateddependingonwhichisthedominantclassintheseedoftweets{andthewinner-takes-remainderstrategy,whichconsistsofapplyingthewinner-takes-allstrategyonlytothosetweetsthatwerenotcoveredbysomeofthelterkeywords. Figure4:Fingerprintsforeachofthekeywordselectionstrategiescombinedwitheachofthedierenttweetclassicationstrategies.Figure4showsthengerprintofeachofthecombinationstestedandTable10showstheresults.Thebestautomaticmethod,whichcombines(machinelearning-allfeaturestodiscoverkeywordsandbootstrappingwiththetweetsannotatedusingthatkeywords)givesanaccuracyof0.73,whichishigherthanusingmanualkeywordsfrom25 6.Discussion6.1.HowMuchofTheProblemisSolved?Inordertoshedlightonthetrade-obetweenqualityandquantityoflterkeywords,hereweanalyzetherelationbetweenaccuracyandcoverageofthetweetsclassiedbyconsideringdierentsetsoflterkeywords.Figure6showsthecoverage/accuracycurvesfororacle,manualandautomaticlterkeywords. Figure6:Coverage/accuracycurvesfororacle,manualandautomaticlterkeywords.Curvesweregeneratedasfollows:Oraclekeywords.Atstepn,weconsiderthenthpositive/negativeoraclekeywordsthatmaximizesaccuracyand-incaseofties-coverageoftweets(i.e.,inthecaseoftwokeywordshavingsameaccuracy,theonethatcoversmoretweetsisconsideredrst).Manualkeywords.Ateachstepnweconsiderthenthpositive/negativemanualkey-wordsthatmaximizecoverageoftweets.Machinelearning-allfeatures.Thesetoftermsconsideredinthisanalysisarethosethatwereclassiedaspositiveornegativebyusingthecondencethresholdslearnedbythetwoclassiersinthemethod\machinelearning-allfeatures".Skiptermsare(i)thoseclassiedasskipbybothbinaryclassiers;(ii)thoseclassiedsi-multaneouslyaspositiveandnegativekeywords.Then,weusethemaximumofthe28 condencescoresreturnedbythetwoclassiers(i.e.,max(conf(positive);conf(negative)))tosortthekeywords.Thekeywordwithhighestcondencescoreisaddedateachstep.ThepointinthecurvewithhighestcoveragecorrespondstotheclassierusedintheexperimentsexplainedinSection4.Machinelearning-2features.Thiscurveisgeneratedbyusingthetwoclassierslearnedin\machinelearning-2features",similarlytothecurvegeneratedfor\machinelearning-allfeatures".Heuristic.Sincethisclassierconsistsofmanuallydeningthresholdsusingthetrainingset,itdoesn'tprovideanycondencescoreforthetestcases.Hence,inthegraphicitisrepresentedasasinglepoint().ThecurveforOraclekeywordsprovidesastatisticalupperboundofhowmanytweetscanbedirectlycoveredusinglterkeywords.Consideringthebest100oraclekeywordsforeachtestcase/companyname,itispossibletodirectlytag85%ofthetweetswith0.99accuracy.Ontheotherhand,amorerealisticupperboundisgivenbymanualkeywords.Here,wecanobservehowtheaccuracyremainsstablearound0.85,whilethecoveragegrowsfrom10%to15%approx.Inthebestpossiblecase,withmorekeywordsthecurvewouldcontinueastheliney=0:85.NotethatmanualkeywordshavebeenannotatedbyinspectingrepresentativeWebpages(fromGooglesearchresults)ratherthaninspectingtweets.Therefore,anautomatickeywordclassiercannotachieveanaccuracyabove0.85.Consideringthis,ourautomaticapproachesestablishastronglowerboundof0.7accu-racy.Inconclusion,itseemsthatalterkeywordclassiershouldhavereachanaccuracybetween0.7and0.85tobecompetitive.6.2.ComparingSystemswithDierentMetricsWehaveseenthatrelated/unrelatedtweetsarenotbalancedinmostofthetestcasesinWePS-3,andtheproportiondoesnotfollowanormaldistribution(extremevaluesseemtobeasplausibleasvaluesaroundthemean).Becauseofthis,accuracymaybenotsucienttounderstandthequalityofsystems,andthat'swhywehavecomplementeditwiththengerprintrepresentation[68].Inthissection,weevaluate(andcompare)resultswiththemostpopularalternativeevaluationmetricsfoundintheliterature.Consideringtheconfusionmatrixgivenbyeachsystem,whereTP=truerelatedtweets,FP=falserelatedtweets,TN=trueunrelatedtweets,andFN=falseunrelatedtweets,wecomputethefollowingmetrics,inadditiontoaccuracy:NormalizedUtility.UtilityhasbeenusedtoevaluatedocumentlteringtasksinTREC[34,33]andiscommonlyusedassigningarelativeweightbetweentruepositivesandfalsepositives:u(S;T)=TPFPAsintheTREC-8lteringtask[33],hereUtilityisnormalizedbymeansofthefollowingscalingfunction:us(S;T)=max(u(S;T);U(s))U(s) MaxU(T)U(s)29 andF1=2PrecisionRecall Precision+RecallTable11reportstheresultsforthebaselines,theWePS-3systemsandourproposedsystemsforthemetricsdescribedabove.Allmetricsweremacro-averagedbytopics,andundenedscoreswereconsideredaszerovalues. System accuracy utility lam% F1(R;S) F1 Goldstandard 1:00N 1:00N 0:00N 1:00N 1:00Nsupervisedbootstr. 0:85N 0:69N 0:28N 0:30N 0:62NWePS-3:LSIR(manual) 0:83N 0:66N 0:28N 0:27 0:62N ml-allfeat.+bootstr. 0:73 0:47 0:37 0:27 0:49ml-2feat.+wtr 0:72 0:49 0:43 0:16O 0:49ml-2feat.+bootstr. 0:72 0:47 0:43 0:17O 0:50heuristic+bootstr. 0:71 0:45 0:42 0:11H 0:46ml-allfeat.+wtr 0:71 0:44 0:39H 0:21H 0:43Hml-2feat.+wta 0:70 0:48 0:50H 0:00H 0:39Oml-allfeat.+wta 0:69O 0:40O 0:50H 0:00H 0:27Hheuristic+wtr 0:65H 0:46 0:42 0:10H 0:46heuristic+wta 0:64H 0:44 0:50H 0:00H 0:39O WePS-3:ITC-UT 0:75 0:52 0:37 0:20 0:49WePS-3:SINAI 0:64O 0:38O 0:35 0:12H 0:30HWePS-3:UvA 0:58H 0:22H 0:46H 0:17H 0:36HWePS-3:KALMAR 0:47H 0:35H 0:43 0:16O 0:48 baseline:allunrelated 0:57H 0:20H 0:50H 0:00H 0:00Hbaseline:random 0:49H 0:20H 0:49H 0:16H 0:37Hbaseline:allrelated 0:43H 0:40 0:50H 0:00H 0:52 Table11:Resultsforproposedsystems,WePS-3systemsandbaselinescomparedwithdierentevaluationmetrics.Bestautomaticrunsareinboldface.(ml=machinelearning,wta=winner-takes-all,wtr=winner-takes-remainder).Statisticalsignicancew.r.t.theml-allfeat.+bootstrappingrunwascomputedusingtwo-tailedStudent'st-test.SignicantdierencesareindicatedusingN(orH)for=0:01andM(orO)for=0:05.Resultsshowthat,accordingtoReliability&Sensitivity,ourbestautomaticsystemml-allfeatures+bootstrappingachievesthesamescoreastheWePS-3LSIRsemi-automaticsystem(0:27){whichisthebestresultatthecompetitionandinvolvesmanualprocessing{andoutperformsthebestautomaticsysteminWePS-3(ITC-UT=0.20),witha35%ofrelativeimprovement.Intermsoflam%,theSINAIsystemachievesthebestautomaticscoreof0:35,followedbyITC-UT&ml-allfeatures+bootstrappingthatreaches0:37lam%.Notethatlam%andR&Spenalizenon-informative/baseline-likebehaviors.Becauseofthis,thewinner-takes-allsystemsandthe\all(un)related"baselinesgettheworstscoresinthesemetrics.Accordingtoutility,ITC-UTisstillthebestautomaticsystem(0.52).Ourbestrunsarebetween0.47and0.49,beingml-2features+bootstrappingthebestofthem.31 Filterkeywords Homepage Wikipedia Both relatedoraclekeywords 36% 68% 33%unrelatedoraclekeywords 9% 19% 6% Table12:Percentageofthe10bestoraclekeywordsextractedfromthetweetstreamcoveredbythecompany'shomepage,itsWikipediaarticleandboth.Overall,theonlysubstantialoverlapisforpositivekeywordsinWikipedia,indicatingthatrepresentativeWebpagesarenottheidealplacetolookforeectivelterkeywordsinTwitter.Notethattheoverlapofrelatedoraclekeywordswiththecompany'sWikipediapageroughlydoublestheoverlapwithitshomepage.Thesamethinghappenswithunrelatedkeywords:almost20%onWikipediaandalmost10%onthehomepage.ThepercentageoforaclekeywordsthatoccurbothinthehomepageandintheWikipediaarticleissimilartothehomepagealone,indicatingthatWikipediabasicallyextendsthekeywordsalreadypresentinthehomepage.Insummary,exploringthenatureoflterkeywordsleadsustotheconclusionthatthevocabularycharacterizingacompanyinTwitterissubstantiallydierentfromthevocabularyassociatedtothecompanyinitshomepage,inWikipedia,andapparentlyintheWebatlarge.Thesendingsareinlinewiththe\vocabularygap\thathasbeenshownbetweenTwitterandotherWebsourcessuchasWikipediaornewscomments[73].Onewayofalleviatingthisproblemisusingco-occurrenceexpansionofweb-basedfea-tures,whichallowstobetterrecognizeautomaticallylterkeywords.Whilethecom-pany'sWikipediaarticleseemstohavemorecoverageof(perfect)lterkeywordsthanthecompany'shomepage,furtherinvestigationisneededonhowtoautomaticallyinferthecompany'sWikipediapagefromitshomepageURLinordertoextractadditionalkeywordfeaturesfromit.33 References[1]Agirre,E.,Edmonds,P.,2006.Wordsensedisambiguation:Algorithmsandapplications.Springer.[2]Al-Kamha,R.,Embley,D.,2004.Groupingsearch-enginereturnedcitationsforperson-namequeries.In:Proceedingsofthe6thannualACMinternationalworkshoponWebinformationanddatamanagement.pp.96{103.[3]Amigo,E.,Artiles,J.,Gonzalo,J.,Spina,D.,Liu,B.,Corujo,A.,2010.WePS-3EvaluationCam-paign:OverviewoftheOnlineReputationManagementTask.In:CLEF2010LabsandWorkshopsNotebookPapers.[4]Amigo,E.,Corujo,A.,Gonzalo,J.,Meij,E.,deRijke,M.,2012.OverviewofRepLab2012:EvaluatingOnlineReputationManagementSystems.In:CLEF2012LabsandWorkshopNotebookPapers.[5]Amigo,E.,Gonzalo,J.,Verdejo,F.,2012.ReliabilityandSensitivity:GenericEvaluationMeasuresforDocumentOrganizationTasks.Tech.rep.,UNED.[6]Artiles,J.,October2009.Webpeoplesearch.Ph.D.thesis,UNEDUniversity.[7]Artiles,J.,Borthwick,A.,Gonzalo,J.,Sekine,S.,Amigo,E.,2010.Weps-3evaluationcampaign:Overviewofthewebpeoplesearchclusteringandattributeextractiontasks.[8]Artiles,J.,Gonzalo,J.,Amigo,E.,2009.Theimpactofqueryrenementinthewebpeoplesearchtask.In:ProceedingsoftheACL-IJCNLP2009ConferenceShortPapers.AssociationforCompu-tationalLinguistics,pp.361{364.[9]Artiles,J.,Gonzalo,J.,Sekine,S.,2007.Thesemeval-2007wepsevaluation:Establishingabench-markforthewebpeoplesearchtask.ProceedingsofSemeval.[10]Artiles,J.,Gonzalo,J.,Sekine,S.,2009.Weps2evaluationcampaign:overviewofthewebpeoplesearchclusteringtask.In:2ndWebPeopleSearchEvaluationWorkshop(WePS2009),18thWWWConference.[11]Bagga,A.,Baldwin,B.,1998.Entity-basedcross-documentcoreferencingusingthevectorspacemodel.In:Proceedingsofthe36thAnnualMeetingoftheAssociationforComputationalLinguisticsand17thInternationalConferenceonComputationalLinguistics-Volume1.pp.79{85.[12]Boyd,D.,Golder,S.,Lotan,G.,2010.Tweet,tweet,retweet:Conversationalaspectsofretweetingontwitter.In:Proceedingsofthe201043rdHawaiiInternationalConferenceonSystemSciences.HICSS'10.IEEEComputerSociety,Washington,DC,USA,pp.1{10.[13]Bunescu,R.,Pasca,M.,2006.Usingencyclopedicknowledgefornamedentitydisambiguation.In:ProceedingsofEACL.Vol.6.pp.9{16.[14]Cha,M.,Haddadi,H.,Benevenuto,F.,Gummadi,K.P.,2010.MeasuringUserIn uenceinTwitter:TheMillionFollowerFallacy.In:InProceedingsofthe4thInternationalAAAIConferenceonWeblogsandSocialMedia(ICWSM).WashingtonDC,USA.[15]Cheng,Z.,Caverlee,J.,Lee,K.,2010.Youarewhereyoutweet:acontent-basedapproachtogeo-locatingtwitterusers.In:Proceedingsofthe19thACMinternationalconferenceonInformationandknowledgemanagement.pp.759{768.[16]Cilibrasi,R.L.,Vitanyi,P.M.,2007.Thegooglesimilaritydistance.IEEETransactionsonKnowl-edgeandDataEngineering.[17]Comm,J.,Robbins,A.,2009.Twitterpower:Howtodominateyourmarketonetweetatatime.JohnWiley&SonsInc.[18]Cormack,G.,Lynam,T.,2005.Trec2005spamtrackoverview.In:TheFourteenthTextREtrievalConference(TREC2005)Proceedings.[19]Cucerzan,S.,2007.Large-scalenamedentitydisambiguationbasedonwikipediadata.In:Proceed-ingsofEMNLP-CoNLL.Vol.2007.pp.708{716.[20]Dellarocas,C.,Awad,N.,Zhang,X.,2004.Exploringthevalueofonlinereviewstoorganizations:Implicationsforrevenueforecastingandplanning.In:ProceedingsoftheInternationalConferenceonInformationSystems.[21]Dredze,M.,McNamee,P.,Rao,D.,Gerber,A.,Finin,T.,2010.Entitydisambiguationforknowl-edgebasepopulation.In:Proceedingsofthe23rdInternationalConferenceonComputationalLinguistics.pp.277{285.[22]Edmonds,P.,Cotton,S.,2001.Senseval-2:Overview.In:ProceedingsofTheSecondInternationalWorkshoponEvaluatingWordSenseDisambiguationSystems(SENSEVAL-2.pp.1{6.[23]Fawcett,T.,2006.Anintroductiontorocanalysis.Patternrecognitionletters27(8),861{874.[24]Ferragina,P.,Scaiella,U.,2010.Tagme:on-the- yannotationofshorttextfragments(bywikipediaentities).In:Proceedingsofthe19thACMinternationalconferenceonInformationandknowledgemanagement.pp.1625{1628.35 HOPKINSUNIVERSITY.[50]Mann,G.,Yarowsky,D.,2003.Unsupervisedpersonalnamedisambiguation.In:ProceedingsoftheHLT-NAACL'03.pp.33{40.[51]Mann,H.,Whitney,D.,1947.Onatestofwhetheroneoftworandomvariablesisstochasticallylargerthantheother.TheAnnalsofMathematicalStatistics18(1),50{60.[52]Martinez-Romo,J.,Araujo,L.,2009.Webpeoplesearchdisambiguationusinglanguagemodeltechniques.[53]McNamee,P.,Dang,H.,2009.OverviewoftheTAC2009knowledgebasepopulationtrack.In:TextAnalysisConference(TAC).[54]Meij,E.,Weerkamp,W.,deRijke,M.,2012.Addingsemanticstomicroblogposts.In:ProceedingsofthefthACMinternationalconferenceonWebsearchanddatamining.[55]Michelson,M.,Macskassy,S.,2010.Discoveringusers'topicsofinterestontwitter:arstlook.In:ProceedingsofthefourthworkshoponAnalyticsfornoisyunstructuredtextdata.pp.73{80.[56]Mierswa,I.,Wurst,M.,Klinkenberg,R.,Scholz,M.,Euler,T.,2006.YALE:Rapidprototypingforcomplexdataminingtasks.In:Proceedingsofthe12thACMSIGKDD.[57]Mihalcea,R.,Csomai,A.,2007.Wikify!:linkingdocumentstoencyclopedicknowledge.In:CIKM.Vol.7.pp.233{242.[58]Mihalcea,R.,Tarau,P.,2004.Textrank:Bringingorderintotexts.In:ProceedingsofEMNLP.Vol.4.pp.404{411.[59]Milios,E.,Zhang,Y.,He,B.,Dong,L.,2003.Automatictermextractionanddocumentsimilarityinspecialtextcorpora.In:ProceedingsoftheSixthConferenceofthePacicAssociationforComputationalLinguistics.pp.275{284.[60]Milne,D.,Witten,I.,2008.Learningtolinkwithwikipedia.In:Proceedingofthe17thACMconferenceonInformationandknowledgemanagement.pp.509{518.[61]Milstein,S.,Chowdhury,A.,Hochmuth,G.,Lorica,B.,Magoulas,R.,2008.Twitterandthemicro-messagingrevolution:Communication,connections,andimmediacy{140charactersatatime.O'ReillyMedia,Inc.[62]Nadeau,D.,Sekine,S.,2007.Asurveyofnamedentityrecognitionandclassication.LingvisticaeInvestigationes30(1),3{26.[63]Pak,A.,Paroubek,P.,2010.Twitterasacorpusforsentimentanalysisandopinionmining.Pro-ceedingsofLREC2010.[64]Pollach,I.,2006.Electronicwordofmouth:Agenreanalysisofproductreviewsonconsumeropinionwebsites.In:Proceedingsofthe39thAnnualHawaiiInternationalConferenceonSystemSciences.Vol.3.[65]Quinlan,J.,1993.C4.5:programsformachinelearning.MorganKaufmann.[66]Sakaki,T.,Okazaki,M.,Matsuo,Y.,2010.Earthquakeshakestwitterusers:real-timeeventde-tectionbysocialsensors.In:Proceedingsofthe19thinternationalconferenceonWorldwideweb.WWW'10.pp.851{860.[67]Soboro,I.,McCullough,D.,Lin,J.,Macdonald,C.,Ounis,I.,McCreadie,R.,2012.Evaluatingreal-timesearchovertweets.[68]Spina,D.,Amigo,E.,Gonzalo,J.,2011.Filterkeywordsandmajorityclassstrategiesforcom-panynamedisambiguationinTwitter.In:CLEF2011ConferenceonMultilingualandMultimodalInformationAccessEvaluation.SpringerBerlin/Heidelberg,pp.50{61.[69]Spina,D.,Meij,E.,Oghina,A.,Bui,M.T.,Breuss,M.,deRijke,M.,2012.ACorpusforEntityProlinginMicroblogPosts.In:LRECWorkshoponLanguageEngineeringforOnlineReputationManagement.[70]Sriram,B.,Fuhry,D.,Demir,E.,Ferhatosmanoglu,H.,Demirbas,M.,2010.Shorttextclassicationintwittertoimproveinformationltering.In:Proceedingofthe33rdInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval.SIGIR'10.ACM,NewYork,NY,USA,pp.841{842.[71]Strube,M.,Ponzetto,S.,2006.Wikirelate!computingsemanticrelatednessusingwikipedia.In:ProceedingsoftheNationalConferenceonArticialIntelligence.Vol.21.p.1419.[72]Tsagkias,M.,Balog,K.,2010.TheUniversityofAmsterdamatWePS3.In:CLEF2010LabsandWorkshopsNotebookPapers.[73]Tsagkias,M.,deRijke,M.,Weerkamp,W.,2011.Linkingonlinenewsandsocialmedia.In:Pro-ceedingsofthefourthACMInternationalConferenceonWebSearchandDataMining.[74]Turney,P.,1999.Learningtoextractkeyphrasesfromtext,nationalresearchcouncil.InstituteforInformationTechnology,TechnicalReportERB-1057.[75]Wan,X.,Gao,J.,Li,M.,Ding,B.,2005.Personresolutioninpersonsearchresults:Webhawk.In:37 Proceedingsofthe14thACMinternationalconferenceonInformationandknowledgemanagement.pp.163{170.[76]Wilson,R.,2003.Keepingawatchoncorporatereputation.StrategicCommunicationsManagement7(2).[77]Witten,I.H.,Paynter,G.W.,Frank,E.,Gutwin,C.,Nevill-Manning,C.G.,1999.Kea:practicalautomatickeyphraseextraction.In:ProceedingsofthefourthACMconferenceonDigitallibraries.DL'99.pp.254{255.[78]Wu,S.,Hofman,J.M.,Mason,W.A.,Watts,D.J.,2011.Whosayswhattowhomontwitter.In:Proceedingsofthe20thInternationalConferenceonWorldWideWeb.WWW'11.ACM,NewYork,NY,USA,pp.705{714.[79]Yang,J.,Leskovec,J.,2010.Modelinginformationdiusioninimplicitnetworks.DataMining,IEEEInternationalConferenceon0,599{608.[80]Yang,J.,Leskovec,J.,2011.Patternsoftemporalvariationinonlinemedia.In:ProceedingsofthefourthACMinternationalconferenceonWebsearchanddatamining.WSDM'11.pp.177{186.[81]Yerva,S.R.,Miklos,Z.,Aberer,K.,2010.Itwaseasywhenapplesandblackberrieswereonlyfruits.In:CLEF2010LabsandWorkshopsNotebookPapers.[82]Yerva,S.R.,Mikls,Z.,Aberer,K.,2012.Entity-basedclassicationoftwittermessages.IJCSA9(1),88{115.[83]Yoshida,M.,Matsushima,S.,Ono,S.,Sato,I.,Nakagawa,H.,2010.ITC-UT:TweetCategorizationbyQueryCategorizationforOn-lineReputationManagement.In:CLEF2010LabsandWorkshopsNotebookPapers.[84]Zhang,S.,Wu,J.,Zheng,D.,Meng,Y.,Xia,Y.,Yu,H.,2012.Twostagesbasedorganizationnamedisambiguity.ComputationalLinguisticsandIntelligentTextProcessing,249{257.[85]Zhang,Y.,Milios,E.,Zincir-Heywood,N.,2007.Acomparativestudyonkeyphraseextractionmethodsinautomaticwebsitesummarization.JournalofDigitalInformationManagement5(5),323.[86]Zhang,Y.,Zincir-Heywood,N.,Milios,E.,2004.Term-basedclusteringandsummarizationofwebpagecollections.AdvancesinArticialIntelligence,60{74.[87]Zhang,Y.,Zincir-Heywood,N.,Milios,E.,2004.Worldwidewebsitesummarization.WEBIN-TELLIGENCEANDAGENTSYSTEMS.2,39{54.[88]Zhang,Y.,Zincir-Heywood,N.,Milios,E.,2005.Narrativetextclassicationforautomatickeyphraseextractioninwebdocumentcorpora.In:Proceedingsofthe7thannualACMinternationalworkshoponWebinformationanddatamanagement.pp.51{58.[89]Zhao,W.X.,Jiang,J.,Weng,J.,He,J.,Lim,E.-P.,Yan,H.,Li,X.,2011.Comparingtwitterandtraditionalmediausingtopicmodels.In:Proceedingsofthe33rdEuropeanconferenceonAdvancesinInformationRetrieval.[90]Zhao,X.,Jiang,J.,He,J.,Song,Y.,Achananuparp,P.,LIM,E.,Li,X.,2011.Topicalkeyphraseextractionfromtwitter.38