Department University at Buffalo SUNY 14260 Buffalo NY USA cgakcora mbayir demirbascsebuffaloedu Hakan Ferhatosmanoglu Computer Science Eng Department The Ohio State University Columbus OH 43210 USA hakancseohiostateedu ABSTRACT While polls are tra ID: 28613
Download Pdf The PPT/PDF document "Identifying Breakpoints in Public Opinio..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
IdentifyingBreakpointsinPublicOpinionCuneytGurcanAkcora,MuratAliBayir,MuratDemirbasComputerScience&Eng.DepartmentUniversityatBuffalo,SUNY14260,Buffalo,NY,USA{cgakcora,mbayir,demirbas}@cse.buffalo.eduHakanFerhatosmanogluComputerScience&Eng.DepartmentTheOhioStateUniversityColumbus,OH43210,USAhakan@cse.ohio-state.eduABSTRACTWhilepollsaretraditionallyusedforobservingpublicopin-ion,theyprovideapointsnapshot,notacontinuum.Weconsidertheproblemofidentifyingbreakpointsinpublicopinion,andproposeusingmicro-bloggingsitestocapturetrendsinpublicopinion.Wedevelopmethodstodetectchangesinpublicopinion,andndeventsthatcausethesechanges.Ourexperimentsshowthattheproposedmethodsareabletodeterminechangesinpublicopinionandextractthema-jornewsabouttheeventseectively.Wealsodeployanapplicationwhereuserscanviewtheimportantnewsstoriesforacontinuingeventandndtherelatedarticlesonweb.CategoriesandSubjectDescriptorsH.2.8[InformationSystems]:DatabaseManagement|DatabaseApplications,DataMining;H.3[InformationSys-tems]:InformationStorageandRetrievalGeneralTermsOpinionMining,EmotionCorpus,Microblogging,SentimentAnalysis.1.INTRODUCTIONSince18241,pollshavebeenusedtotakeasnapshotofpublicopinion,buttheycannotreachmanypeoplenorcap-tureopinionsaboutthetopicsthatarenotaskedintheques-tionnaire.Moreover,whileeventsunfoldrapidlyandpublicopinionchangeswiththoseevents,pollscannotaccountforthetemporalchangesinpublicopinion.Withtheadvanceofmicro-bloggingsiteslikeTwitter[7,10],wearenowabletoobserveindividualopinionsandkeepupwiththechangesinthepublicopinion.Whencarefullyaggregatedandclassi-ed,individualopinionscangiveusabetterunderstandingofhowsomeeventsarereceivedbythepublic. 1ConductedinthecontestfortheUnitedStatespresidency.Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.1stWorkshoponSocialMediaAnalytics(SOMA'10),July25,2010,Washington,DC,USA.Copyright2010ACM978-1-4503-0217-3...$10.00.Inthispaper,weproposeecientmethodstoidentifyandclassifyopinionsinalargestreamofinformation,andpinpointrelatedeventsthatstimulateuserstoexpresstheiropinions.Inparticular,thecontributionsofthispaperareasfollows:Wedevelopandutilizeanemotioncorpustodetectemotionsintweets.Thismethodenablesexpandingopinionrepresentationfrombinaryoptions(\positiveornegative")tomultipledimensionsbyprovidingmoregranularityinclassication.Weproposecombiningsetandvectorspacemodelstoobservethepublicopinionanddetectchangesovertime.Fromtheexperimentalresults,wefoundthatusingthesetwomethodstogethereliminatesfalsepos-itivesandimprovestheaccuracyofourndings.Wedevelopadynamicscoringfunctiontogiveasyn-opsisofnews(intermsofprominentwords)thatledtobreakpointsinpublicopinion.Wecreateacustomizedeventtrackingapplicationthatcannotifyuserswithout\roodingthemwitheverynewentryabouttheevent.WeshowthatourapplicationismoreuserfriendlythantheGoogleAlert2service.2.RELATEDWORKOpinionMininghasreceivedgreatattentionrecentlyandresearchersstartedtoinvestigatepeople'sopinionaboutcer-taintopicsornews[6].Existingopinionminingmethodsareusuallygroupedun-dertwocategories[8,11]calleddocumentbasedandat-tributebasedapproaches.Theseapproachesarefocusedoncharacterizinguseropinionsaspositiveornegativeoverdo-mainspecicwebsites[4,13]fordierentapplications.Asadocumentlevelapproach,Turneyetal.[14]proposeddeterminingpolarityofdocumentsbyusingsemanticori-entationofextractedphrases.Asanexampleofattributebasedapproaches,Zhuangetal.[15]proposedamethodforgroupingmoviereviewsbasedonfrequentopinionterms.Dieringfromthesesupervisedapproaches,weproposeus-inganergranularityclassication(8emotionclasses)foropinions.Toaccountforthetemporalchangesinpublicopinion,arelatedworktoourapproachisproposedbyKuetal.[9] 2http://www.google.com/alerts wheretheauthorsusedthelanguagecharacteristicsofChi-nese.Intemporaldimension,theirmethodcapturesopin-ionsandshowschangesinoverallsentimentaboutcandi-datesinapresidentialelection.3.METHODOLOGYWebeginourdiscussionformethodologybyrstexplain-ingwhatindicatesachangeinpublicopinioninstreamingtweets.Forthispurpose,wenotetwoobservationsonTwit-terdata.Observation1:Ifaneventresultsinachangeofpublicopinion,moretweetscontainemotionwords.Furthermore,emotionpatternoftweetsinthattimeperiodisdierentfromtheemotionpatternoftheprecedingperiod,butmoresimilartotheemotionpatternoftweetsinthefollowingperiod,i.e,thenewshasanenduringimpressiononpublic.ExampleTweet:(TransgressionclaimsadmittedbyWoods.)TigerWoods-Whatadisappointment.Observation2:Ifanimportantstoryabouttheeventappears,thewordpatternoftweetsisdierentfromlastperiod.Ontheotherhand,thesamewordpatternrepeatsinthenextperiod,i.e,tweetscontainsimilarwordsinthenextperiodasstillthesametopicisdiscussed.ExampleTweet:(Companiesstartendingsponsor-shipagreements.)AccentureDumpsTigerWoodsFromCorporateHomepage.Followingtheseobservations,weconcludethat,toclaimachangeinthepublicopinion,theemotionpatternandthewordpatternmustchangeaccordingtotheseobservations.Wearelookingfornewsthatarebothmajoreventsandopinionchangers.InSection3.1wediscusshowwendemotionandwordpatternsandusementionedobservationstodetectopinionchanges.Wecontinuewithndingtopicsrelatedtotheeventsinsection3.23.1OpinionDetectionFortheemotionpattern,weuseanemotioncorpusbasedmethod,whileusingsetspacemodelforthewordpattern.EmotionCorpusBasedMethodisbasedonvectorspacemodelforcalculatingdocumentsimilarity.Fortheemotiondetectionintweets,weuseanemotioncorpusthatisbasedon8basicclasses,E=fAnger,Sadness,Love,Fear,Disgust,Shame,Joy,Surpriseg,from[12].Webuilta309wordemotioncorpustopopulatethose8classes.EachclassrepresentsadimensionintheBooleanemotionvectorofatweet.Welookforemotionwordsinatweet,andiffound,setthecorrespondingclassdimensionintheemotionvectorto1,otherwiseitremains0.Tweet:IwasonmainstreetinNorfolkwhenIheardabouttigerwoodsupdatesanditmademefeelangry,on2009-12-11.Emotionvector:(1;0;0;0;0;0;0;0).Forallthetweetsinachosentimeinterval,acentroidofallcorrespondingemotionvectordimensionsiscalculated,andthiscentroidisconsideredadocumentforeachinterval.ForagiventimeintervalTthatcontainsNtweets,letVfv1;v2,:::,vNgbeasetofvectors(withl=8dimen-sionseach)generatedfromthesetweets.WedenecentroidvforperiodTas:v=(kNPk=1v1k N;kNPk=1v2k N:::;kNPk=1vlk N)(1)Afterndingcentroidvectorforeachinterval,wedenetheopinionsimilaritybetweentwointervalsT1andT2bycalculatingcosinesimilaritybetweentheircentroidvectors:Sim(T1;T2)=v1:v2 jv1jjv2j(2)SetSpaceModelprescribesrepresentingeachintervalbyasingledocumentwhichistheunionofthetweetspostedinthatparticulartimeinterval.AfterremovingthestopwordsandstemmingthetermsusingPorterstemmer3,wecollectalltermsinahashsetforeachinterval.WedenethesimilaritybetweentwointervalsT1andT2bycalculatingJaccardSimilarity[2]:Sim(T1;T2)=j(Set)T1\(Set)T2j j(Set)T1[(Set)T2j(3)Tondthechanges,neithercorpusbasedmethodnorthesetspacemodelaloneissuitable.Forthecorpusbasedmethod,achangeinthecentroidcanbemisleadingwhentheintervalhasveryfewemotionwordscomparedtoitsneighbors.Forthesetspacemodel,achangeinsimilaritydoesnotbyitselfimplyanopinionchange,becausenotallofthewordsareemotionwords.Inourmethod,werstan-alyzevectorspacesimilarity.Ifwedetectapossiblechange,wevalidateitbyanalyzingtheJaccardSimilarity.Followingtheobservations1and2,ifbothmethodsdetectthechange,wereportthatpointasabreakpoint.Tnisatimebreak,ifthefollowingsaresatisedinbothcorpusbasedmethodandsetspacemodel:Sim(Tn 1;Tn)Sim(Tn 2;Tn 1)(4)Sim(Tn 1;Tn)Sim(Tn;Tn+1)(5)3.2BreakpointRepresentationAfterdetectingthechanges,wesetouttoidentifytheeventsthatcausedthesechanges.Tothisend,welookfortheprominentwordsofanintervaltorepresentthebreak-point.Fortheprominentwordselection,weproposeaTfIdfbaseddynamicscoringfunction.Thealgorithmshouldef-fectivelyndrecentlyemergingkeywordstoguideusersintocatchingbreakingnewsandpayspecialattentiontothewordswhichemergeinaperiodandstartappearinginmoreperiodsastimeprogresses.TheStreamingTfIdfAlgorithm.Toidentifytheeventsthatcausedbreakpoints,weneedtondkeywordsthatrepresentthetopicsoftheseevents.WeproposetheStreamingTfIdfalgorithmforextractingeventrelatedkey-wordsfromaninformationstreamoftweets.DocumentPhase.Forbreakpointrepresentation,thesametimeintervallengthintheopiniondetectionisused,andforeverytimeintervalTn,adocumentDncontainstheunionofstemmedwordsfromalltweetsinthatperiod.ForwordxindocumentDn,TermFrequencyTfx;Dnis 3http://tartarus.org/martin/PorterStemmer/ calculatedas:Tfx;DnCountx;Dn nPk=1Countk;Dn(6)ForthetotalcountofdocumentsuptodocumentDn,InverseDocumentFrequencyofawordxindocumentDn,Idfx;Dniscalculatedas:Idfx;Dn=log(n jf8k;kn:x2Dkgj)(7)Notethat,nisnotaxedvalue.Aswemovefromtheoldestdocumenttothenewestdocument,thetotalnumberofdocuments,n,increases.Bythisparameter,therstappearanceofakeywordwillalwayshaveabiggerIdfvalue,andthefollowingappearancesofthewordwillhavesmallervalues.BasedonthecalculatedIdfx;DnandTfx;Dn,wecalculatetheTfIdfvalueas:TfIdfx;DnTfx;DnIdfx;Dn(8)ProminenceUpdatePhase.ForakeywordxthatrecentlyappearedinDn,wedenetheTfx;DoforthewordxindocumentDowhereonas:tfx;Dotfx;DoF(To;Tn)tfx;Dn(9)Here,weapplyadecayfunctionF(o;n)topreventthewordxinthedocumentDntoincreasetheTfvalueofxinatooolddocumentDo.Thisfollowsfromthefactthat,tweetsarehighlytemporal,i.e,neweventstendtoaectusertweetsonlyforashortperiodoftime.Aswemoveforwardinthetimedomain,akeywordinanewperiodshouldnotincreasetheprominenceofakeywordinawaybackperiod,becauseitishighlyunlikelythatappearanceofakeywordisbecauseofaveryoldevent.Fortheperiodnumbersoandn,wedenethedecayfunc-tionfortwoperiodsToandTnas:F(To;Tn)=1=(no)(10)FortheupdatedTfvaluesofthekeywordxindocumentDo,were-calculatetheTfIdfx;Doas:TfIdfx;DoTfx;DoIdfx;Do(11)WechoosepwordswithhighestTfIdfvaluesfromeachdocument,andcallthemprominentwordsofthatdocument.4.EXPERIMENTALRESULTSInthissection,wepresentexperimentalresultsofourmethodsonTwitter.Weanalyzeddataabouttwotopics,(1)FortHoodshootingsinTexas,USA,November05,2009and(2)TigerWoods,November27,2009caraccident.DuetospacelimitationshereweonlypresenttheTigerWoodsnewsstory.WeusedaTwittersearchengine,Twopular4tocollectdata.Weprocessed258548tweets,andfound23280emotionwordsinthosetweets.Figure1showsthetweetcountofeachday. 4www.twopular.com Figure1:TweetCountofDays4.1OpinionDetectionThelengthoftimeintervalsisanimportantfactorinouranalysis.Weevaluatedunitlengthsvaryingfrom2hoursto24hours.Intervalsshorterthan12hoursleadtobiasedresults,becausetheycontaintoofewtweetstoformamean-ingfulsample.Ontheotherhand,intervalslongerthan24hoursarenotsuitablefortheproblemdomain(medianewscycle).Wechose12hours,becauseitistheshortestintervaltoprovidemeaningfuldatabesidesenablingustocaptureeventsinnegranularity.Inourdatafor20days,wefound8possiblebreaksbyEmotionCorpusMethod(Figure2)f5,10,17,23,25,27,32,36g,and5ofthemf5,10,23,25,27gwerealsocapturedbyJaccardsimilarity(Figure3).Figure2containsblackbarsthatrepresentoutlierintervalswithveryfewtweets.WetestedourndingswithatimelineofTigerWoodsrelatedeventsfromCNN,ABCNewsandESPN5.Our3validatedbreakingpointsarerelatedtothefollowingeventsinsuccessiveorder:(5)TransgressionclaimsacceptedbyTigerWoods,(10)morewomenallegedtohaveaairswithWoods,(23)GatoradeendsasponsorshipagreementwithWoods,andTwitterusersstartwritingthousandsofjokesaboutWoodswithSantaClaus#hashtagsnearingChrist-mas.Amongthevalidatedbreakpoints,25and27arefalsepositives.4.2BreakpointRepresentationUpondetectingopinionchangesintheTigerWoodscase,wefoundfrequentkeywordsofallperiods,andbyusingtheStreamingTfIdfalgorithm,weextractedtheprominentwordsfromthesekeywords.Whilecreatingdocumentsforeach12hourperiod,weputtopFmostfrequentwordsintotheirrespectivedocu-ments.Duringthisprocess,weusedthePorterStemmertoremovethecommonermorphologicalandin\rexionalendingsofwordsandanalyzedthefrequencydistributiongraphofthewords.Wefound50tobethebestchoicebecauseforvalueslargerthan50,bigclustersofwordswithlowfrequen-ciesappear.Forthenumberofprominentwordsp,weusedp=5.Therstdocumenthastheprominentwords:crash,re-port,\rorida,injur,golfer.Theprominentwordscanmanytimesbeselfexplanatory:accenture,drop,stop, 5http://sports.espn.go.com/golf/news/story?id=4922436 \n \n Figure2:EmotionVectorSimilarityoftwosuccessiveintervals \r \r\r\r\r Figure3:JaccardSimilarityoftwosuccessiveintervalsgolfer,sponsorship.ThisreferstotheAccenture'sdeci-siontodropasponsorshipwithTigerWoods.Thealgorithmcansuccessfullydetectappearancedatesofemergingtopics.Whileprominentwordsofthe11thdocumentwiththetra-ditionalTfIdfdoesnotincludetheword\voicemail",theStreamingTfIdfalgorithmcorrectlyidentiesitasbreakingnewsandaddsittotheprominentwords.Apartfromidentifyingtheprominentwords,thealgo-rithmcorrectlydiscriminatesagainstwordsthatarenotre-latedtotheevents.Inthe11thinterval,theword"Afghanistan"isinthesetofprominentwords.ItisbecauseofthetweetsthatprotestTigerWoodheadlineswhile"Afghanistanwar"getsmoreviolent.Inthefollowingdays,theprominentwordsetofthedocumentisupdatedand"Afghanistan"disappearsfromtheprominentwordset,asitisnotactuallyrelatedtotheevent.Thebreakpointrepresentationmethodidentiesthesig-nicantperiodsas6,11and24.Notethat,abreakonthe(n)thbarinthesimilaritygraphs(Figures2-3)indicatesanopinionchangebetween(n)thand(n+1)thtimeperiods.Forthesebreakpoints,Table1givesustheprominentwordsfor(n+1)thintervals.RunTimeAnalysisofourmethodsshowalinearchar-acteristicasthetweetcountincreases.Inordertotestscal-ability,weexperimentedwith5000;10000and20000tweetsandfoundtheruntimeofourmethodstobe24224;45985and92867milisecondsonAMDTurionDual-Core2.00GHzprocessor. Period ProminentWords 1 crash,\rorida,injur,golf,accident 6 crash,wife,accident,mistress,golf 11 voicemail,wife,f***,golf,cheat 24 drop,stop,santa,claus,gatorade Table1:ProminentValuesforSignicantPeriods5.CUSTOMIZEDNEWSTRACKINGWedevelopedanewstrackingapplicationonTwitter.Theresultingapplicationcanbeseenattheprojectwebsite6,anditsscreenshotisgiveninFigure4.Theappli-cationusesaninteractiveJavascriptinterfacethatliststhetweetcountsofeachperiod.Theusercanclickonthepe-riodcolumnstoseetheeventsofatimeperioddependingontheprominentwords.Foreachperiod,wesearchforthearticlesthatarepublishedinthedaterangeoftheperiod.Wearenotstoringthoseweblinksinadatabase,becausethelinkscanberemovedorre-locatedovertime.GoogleAlertoerssuchacustomizedwebservice,anditprovidesasystemwhichnotiesusersbyemailwhenachosenkeywordhasanewentryonweb.WhereasGooglesendsupdatesabouteveryentryonatrackedkeyword,ourapplicationobservesthepublicopiniontoidentifybreakingpointsandndskeywordsofimportanteventstonotifyusersaboutthem.6.CONCLUSIONSInthispaperwepresentedanecientwaytoobservepub-licopinionontemporaldimension.Ourmethodscaniden- 6http://ubicomp.cse.bualo.edu/upinion/ Figure4:UpinionApplicationtifybreakpoints,andndrelatedeventsthatcausedtheseopinionchanges.WetestedourresultswiththetimelineofTigerWoodscaseandshowedtheaccuracyofourresults.Wedevelopedanapplicationthatcanserveuserswithnewspagesdependingonthetimeperiod.Wearecurrentlywork-ingonexpandingtheemotioncorpusforeliminatingoutlierintervalsinouranalysis.Asafuturework,weareplanningtodevelopcustomizedversionofourwebservicethatenableswebuserstotracktheirselectedtopicsonTwitter.WearealsoworkingondistributedimplementationofoursystemoverHadoop7Map/Reduceframework.Map/Reduce[5]allowslargesoft-wareframeworks[1,3]toprocessunlimitedamountofdatainadistributedmanner.ByusingpowerofMap/Reduceparadigm,weareplanningtohandlemillionsoftweetatthesametimebelongingtomultipletopics.7.REFERENCES[1]M.A.Bayir,I.H.Toroslu,A.Cosar,andG.Fidan.Smartminer:anewframeworkformininglargescalewebusagedata.InWWW,pages161{170,2009.[2]M.W.Berry,editor.Surveyoftextmining:clustering,classication,andretrieval.Springer,2004.[3]H.Cao,D.Jiang,J.Pei,Q.He,Z.Liao,E.Chen,andH.Li.Context-awarequerysuggestionbyminingclick-throughandsessiondata.InKDD,pages875{883,2008.[4]K.Dave,S.Lawrence,andD.M.Pennock.Miningthepeanutgallery:opinionextractionandsemanticclassicationofproductreviews.InWWW,pages519{528,2003.[5]J.DeanandS.Ghemawat.Mapreduce:Simplieddataprocessingonlargeclusters.InOSDI,pages137{150,2004. 7http://hadoop.apache.org/[6]N.DiakopoulosandD.A.Shamma.Characterizingdebateperformanceviaaggregatedtwittersentiment.InConferenceonHumanFactorsinComputingSystems(CHI),April2010.[7]A.Java,X.Song,T.Finin,andB.Tseng.Whywetwitter:understandingmicrobloggingusageandcommunities.InProceedingsofthe9thWebKDDand1stSNA-KDD2007workshoponWebminingandsocialnetworkanalysis,pages56{65.ACM,2007.[8]W.Jin,H.H.Ho,andR.K.Srihari.Opinionminer:anovelmachinelearningsystemforwebopinionminingandextraction.InKDD,pages1195{1204,2009.[9]L.Ku,Y.Liang,andH.Chen.Opinionextraction,summarizationandtrackinginnewsandblogcorpora.InProceedingsofAAAI-2006SpringSymposiumonComputationalApproachestoAnalyzingWeblogs,pages100{107,2006.[10]H.Kwak,C.Lee,H.Park,andS.B.Moon.Whatistwitter,asocialnetworkoranewsmedia?InWWW,pages591{600,2010.[11]B.PangandL.Lee.Opinionminingandsentimentanalysis.FoundationsandTrendsinInformationRetrieval,2(1-2):1{135,2007.[12]W.G.Parrott,editor.Emotionsinsocialpsychology:essentialreadings.PsychologyPress,2001.[13]A.PopescuandO.Etzioni.Extractingproductfeaturesandopinionsfromreviews.InEMNLP-05,2005.[14]P.D.Turney.Thumbsuporthumbsdown?semanticorientationappliedtounsupervisedclassicationofreviews.InACL,pages417{424,2002.[15]L.Zhuang,F.Jing,X.-Y.Zhu,andL.Zhang.Moviereviewminingandsummarization.InCIKM-06,2006.