149K - views

Identifying Breakpoints in Public Opinion Cuneyt Gurcan Akcora Murat Ali Bayir Murat Demirbas Computer Science Eng

Department University at Buffalo SUNY 14260 Buffalo NY USA cgakcora mbayir demirbascsebuffaloedu Hakan Ferhatosmanoglu Computer Science Eng Department The Ohio State University Columbus OH 43210 USA hakancseohiostateedu ABSTRACT While polls are tra

Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "Identifying Breakpoints in Public Opinio..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Identifying Breakpoints in Public Opinion Cuneyt Gurcan Akcora Murat Ali Bayir Murat Demirbas Computer Science Eng






Presentation on theme: "Identifying Breakpoints in Public Opinion Cuneyt Gurcan Akcora Murat Ali Bayir Murat Demirbas Computer Science Eng"— Presentation transcript:

IdentifyingBreakpointsinPublicOpinionCuneytGurcanAkcora,MuratAliBayir,MuratDemirbasComputerScience&Eng.DepartmentUniversityatBuffalo,SUNY14260,Buffalo,NY,USA{cgakcora,mbayir,demirbas}@cse.buffalo.eduHakanFerhatosmanogluComputerScience&Eng.DepartmentTheOhioStateUniversityColumbus,OH43210,USAhakan@cse.ohio-state.eduABSTRACTWhilepollsaretraditionallyusedforobservingpublicopin-ion,theyprovideapointsnapshot,notacontinuum.Weconsidertheproblemofidentifyingbreakpointsinpublicopinion,andproposeusingmicro-bloggingsitestocapturetrendsinpublicopinion.Wedevelopmethodstodetectchangesinpublicopinion,and ndeventsthatcausethesechanges.Ourexperimentsshowthattheproposedmethodsareabletodeterminechangesinpublicopinionandextractthema-jornewsabouttheeventse ectively.Wealsodeployanapplicationwhereuserscanviewtheimportantnewsstoriesforacontinuingeventand ndtherelatedarticlesonweb.CategoriesandSubjectDescriptorsH.2.8[InformationSystems]:DatabaseManagement|DatabaseApplications,DataMining;H.3[InformationSys-tems]:InformationStorageandRetrievalGeneralTermsOpinionMining,EmotionCorpus,Microblogging,SentimentAnalysis.1.INTRODUCTIONSince18241,pollshavebeenusedtotakeasnapshotofpublicopinion,buttheycannotreachmanypeoplenorcap-tureopinionsaboutthetopicsthatarenotaskedintheques-tionnaire.Moreover,whileeventsunfoldrapidlyandpublicopinionchangeswiththoseevents,pollscannotaccountforthetemporalchangesinpublicopinion.Withtheadvanceofmicro-bloggingsiteslikeTwitter[7,10],wearenowabletoobserveindividualopinionsandkeepupwiththechangesinthepublicopinion.Whencarefullyaggregatedandclassi- ed,individualopinionscangiveusabetterunderstandingofhowsomeeventsarereceivedbythepublic. 1ConductedinthecontestfortheUnitedStatespresidency.Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.1stWorkshoponSocialMediaAnalytics(SOMA'10),July25,2010,Washington,DC,USA.Copyright2010ACM978-1-4503-0217-3...$10.00.Inthispaper,weproposeecientmethodstoidentifyandclassifyopinionsinalargestreamofinformation,andpinpointrelatedeventsthatstimulateuserstoexpresstheiropinions.Inparticular,thecontributionsofthispaperareasfollows:Wedevelopandutilizeanemotioncorpustodetectemotionsintweets.Thismethodenablesexpandingopinionrepresentationfrombinaryoptions(\positiveornegative")tomultipledimensionsbyprovidingmoregranularityinclassi cation.Weproposecombiningsetandvectorspacemodelstoobservethepublicopinionanddetectchangesovertime.Fromtheexperimentalresults,wefoundthatusingthesetwomethodstogethereliminatesfalsepos-itivesandimprovestheaccuracyofour ndings.Wedevelopadynamicscoringfunctiontogiveasyn-opsisofnews(intermsofprominentwords)thatledtobreakpointsinpublicopinion.Wecreateacustomizedeventtrackingapplicationthatcannotifyuserswithout\roodingthemwitheverynewentryabouttheevent.WeshowthatourapplicationismoreuserfriendlythantheGoogleAlert2service.2.RELATEDWORKOpinionMininghasreceivedgreatattentionrecentlyandresearchersstartedtoinvestigatepeople'sopinionaboutcer-taintopicsornews[6].Existingopinionminingmethodsareusuallygroupedun-dertwocategories[8,11]calleddocumentbasedandat-tributebasedapproaches.Theseapproachesarefocusedoncharacterizinguseropinionsaspositiveornegativeoverdo-mainspeci cwebsites[4,13]fordi erentapplications.Asadocumentlevelapproach,Turneyetal.[14]proposeddeterminingpolarityofdocumentsbyusingsemanticori-entationofextractedphrases.Asanexampleofattributebasedapproaches,Zhuangetal.[15]proposedamethodforgroupingmoviereviewsbasedonfrequentopinionterms.Di eringfromthesesupervisedapproaches,weproposeus-inga nergranularityclassi cation(8emotionclasses)foropinions.Toaccountforthetemporalchangesinpublicopinion,arelatedworktoourapproachisproposedbyKuetal.[9] 2http://www.google.com/alerts wheretheauthorsusedthelanguagecharacteristicsofChi-nese.Intemporaldimension,theirmethodcapturesopin-ionsandshowschangesinoverallsentimentaboutcandi-datesinapresidentialelection.3.METHODOLOGYWebeginourdiscussionformethodologyby rstexplain-ingwhatindicatesachangeinpublicopinioninstreamingtweets.Forthispurpose,wenotetwoobservationsonTwit-terdata.Observation1:Ifaneventresultsinachangeofpublicopinion,moretweetscontainemotionwords.Furthermore,emotionpatternoftweetsinthattimeperiodisdi erentfromtheemotionpatternoftheprecedingperiod,butmoresimilartotheemotionpatternoftweetsinthefollowingperiod,i.e,thenewshasanenduringimpressiononpublic.ExampleTweet:(TransgressionclaimsadmittedbyWoods.)TigerWoods-Whatadisappointment.Observation2:Ifanimportantstoryabouttheeventappears,thewordpatternoftweetsisdi erentfromlastperiod.Ontheotherhand,thesamewordpatternrepeatsinthenextperiod,i.e,tweetscontainsimilarwordsinthenextperiodasstillthesametopicisdiscussed.ExampleTweet:(Companiesstartendingsponsor-shipagreements.)AccentureDumpsTigerWoodsFromCorporateHomepage.Followingtheseobservations,weconcludethat,toclaimachangeinthepublicopinion,theemotionpatternandthewordpatternmustchangeaccordingtotheseobservations.Wearelookingfornewsthatarebothmajoreventsandopinionchangers.InSection3.1wediscusshowwe ndemotionandwordpatternsandusementionedobservationstodetectopinionchanges.Wecontinuewith ndingtopicsrelatedtotheeventsinsection3.23.1OpinionDetectionFortheemotionpattern,weuseanemotioncorpusbasedmethod,whileusingsetspacemodelforthewordpattern.EmotionCorpusBasedMethodisbasedonvectorspacemodelforcalculatingdocumentsimilarity.Fortheemotiondetectionintweets,weuseanemotioncorpusthatisbasedon8basicclasses,E=fAnger,Sadness,Love,Fear,Disgust,Shame,Joy,Surpriseg,from[12].Webuilta309wordemotioncorpustopopulatethose8classes.EachclassrepresentsadimensionintheBooleanemotionvectorofatweet.Welookforemotionwordsinatweet,andiffound,setthecorrespondingclassdimensionintheemotionvectorto1,otherwiseitremains0.Tweet:IwasonmainstreetinNorfolkwhenIheardabouttigerwoodsupdatesanditmademefeelangry,on2009-12-11.Emotionvector:(1;0;0;0;0;0;0;0).Forallthetweetsinachosentimeinterval,acentroidofallcorrespondingemotionvectordimensionsiscalculated,andthiscentroidisconsideredadocumentforeachinterval.ForagiventimeintervalTthatcontainsNtweets,letVfv1;v2,:::,vNgbeasetofvectors(withl=8dimen-sionseach)generatedfromthesetweets.Wede necentroidvforperiodTas:v=(kNPk=1v1k N;kNPk=1v2k N:::;kNPk=1vlk N)(1)After ndingcentroidvectorforeachinterval,wede netheopinionsimilaritybetweentwointervalsT1andT2bycalculatingcosinesimilaritybetweentheircentroidvectors:Sim(T1;T2)=v1:v2 jv1jjv2j(2)SetSpaceModelprescribesrepresentingeachintervalbyasingledocumentwhichistheunionofthetweetspostedinthatparticulartimeinterval.AfterremovingthestopwordsandstemmingthetermsusingPorterstemmer3,wecollectalltermsinahashsetforeachinterval.Wede nethesimilaritybetweentwointervalsT1andT2bycalculatingJaccardSimilarity[2]:Sim(T1;T2)=j(Set)T1\(Set)T2j j(Set)T1[(Set)T2j(3)To ndthechanges,neithercorpusbasedmethodnorthesetspacemodelaloneissuitable.Forthecorpusbasedmethod,achangeinthecentroidcanbemisleadingwhentheintervalhasveryfewemotionwordscomparedtoitsneighbors.Forthesetspacemodel,achangeinsimilaritydoesnotbyitselfimplyanopinionchange,becausenotallofthewordsareemotionwords.Inourmethod,we rstan-alyzevectorspacesimilarity.Ifwedetectapossiblechange,wevalidateitbyanalyzingtheJaccardSimilarity.Followingtheobservations1and2,ifbothmethodsdetectthechange,wereportthatpointasabreakpoint.Tnisatimebreak,ifthefollowingsaresatis edinbothcorpusbasedmethodandsetspacemodel:Sim(Tn1;Tn)Sim(Tn2;Tn1)(4)Sim(Tn1;Tn)Sim(Tn;Tn+1)(5)3.2BreakpointRepresentationAfterdetectingthechanges,wesetouttoidentifytheeventsthatcausedthesechanges.Tothisend,welookfortheprominentwordsofanintervaltorepresentthebreak-point.Fortheprominentwordselection,weproposeaTfIdfbaseddynamicscoringfunction.Thealgorithmshouldef-fectively ndrecentlyemergingkeywordstoguideusersintocatchingbreakingnewsandpayspecialattentiontothewordswhichemergeinaperiodandstartappearinginmoreperiodsastimeprogresses.TheStreamingTfIdfAlgorithm.Toidentifytheeventsthatcausedbreakpoints,weneedto ndkeywordsthatrepresentthetopicsoftheseevents.WeproposetheStreamingTfIdfalgorithmforextractingeventrelatedkey-wordsfromaninformationstreamoftweets.DocumentPhase.Forbreakpointrepresentation,thesametimeintervallengthintheopiniondetectionisused,andforeverytimeintervalTn,adocumentDncontainstheunionofstemmedwordsfromalltweetsinthatperiod.ForwordxindocumentDn,TermFrequencyTfx;Dnis 3http://tartarus.org/martin/PorterStemmer/ calculatedas:Tfx;DnCountx;Dn nPk=1Countk;Dn(6)ForthetotalcountofdocumentsuptodocumentDn,InverseDocumentFrequencyofawordxindocumentDn,Idfx;Dniscalculatedas:Idfx;Dn=log(n jf8k;kn:x2Dkgj)(7)Notethat,nisnota xedvalue.Aswemovefromtheoldestdocumenttothenewestdocument,thetotalnumberofdocuments,n,increases.Bythisparameter,the rstappearanceofakeywordwillalwayshaveabiggerIdfvalue,andthefollowingappearancesofthewordwillhavesmallervalues.BasedonthecalculatedIdfx;DnandTfx;Dn,wecalculatetheTfIdfvalueas:TfIdfx;DnTfx;DnIdfx;Dn(8)ProminenceUpdatePhase.ForakeywordxthatrecentlyappearedinDn,wede netheTfx;DoforthewordxindocumentDowhereonas:tfx;Dotfx;DoF(To;Tn)tfx;Dn(9)Here,weapplyadecayfunctionF(o;n)topreventthewordxinthedocumentDntoincreasetheTfvalueofxinatooolddocumentDo.Thisfollowsfromthefactthat,tweetsarehighlytemporal,i.e,neweventstendtoa ectusertweetsonlyforashortperiodoftime.Aswemoveforwardinthetimedomain,akeywordinanewperiodshouldnotincreasetheprominenceofakeywordinawaybackperiod,becauseitishighlyunlikelythatappearanceofakeywordisbecauseofaveryoldevent.Fortheperiodnumbersoandn,wede nethedecayfunc-tionfortwoperiodsToandTnas:F(To;Tn)=1=(no)(10)FortheupdatedTfvaluesofthekeywordxindocumentDo,were-calculatetheTfIdfx;Doas:TfIdfx;DoTfx;DoIdfx;Do(11)WechoosepwordswithhighestTfIdfvaluesfromeachdocument,andcallthemprominentwordsofthatdocument.4.EXPERIMENTALRESULTSInthissection,wepresentexperimentalresultsofourmethodsonTwitter.Weanalyzeddataabouttwotopics,(1)FortHoodshootingsinTexas,USA,November05,2009and(2)TigerWoods,November27,2009caraccident.DuetospacelimitationshereweonlypresenttheTigerWoodsnewsstory.WeusedaTwittersearchengine,Twopular4tocollectdata.Weprocessed258548tweets,andfound23280emotionwordsinthosetweets.Figure1showsthetweetcountofeachday. 4www.twopular.com     Figure1:TweetCountofDays4.1OpinionDetectionThelengthoftimeintervalsisanimportantfactorinouranalysis.Weevaluatedunitlengthsvaryingfrom2hoursto24hours.Intervalsshorterthan12hoursleadtobiasedresults,becausetheycontaintoofewtweetstoformamean-ingfulsample.Ontheotherhand,intervalslongerthan24hoursarenotsuitablefortheproblemdomain(medianewscycle).Wechose12hours,becauseitistheshortestintervaltoprovidemeaningfuldatabesidesenablingustocaptureeventsin negranularity.Inourdatafor20days,wefound8possiblebreaksbyEmotionCorpusMethod(Figure2)f5,10,17,23,25,27,32,36g,and5ofthemf5,10,23,25,27gwerealsocapturedbyJaccardsimilarity(Figure3).Figure2containsblackbarsthatrepresentoutlierintervalswithveryfewtweets.Wetestedour ndingswithatimelineofTigerWoodsrelatedeventsfromCNN,ABCNewsandESPN5.Our3validatedbreakingpointsarerelatedtothefollowingeventsinsuccessiveorder:(5)TransgressionclaimsacceptedbyTigerWoods,(10)morewomenallegedtohavea airswithWoods,(23)GatoradeendsasponsorshipagreementwithWoods,andTwitterusersstartwritingthousandsofjokesaboutWoodswithSantaClaus#hashtagsnearingChrist-mas.Amongthevalidatedbreakpoints,25and27arefalsepositives.4.2BreakpointRepresentationUpondetectingopinionchangesintheTigerWoodscase,wefoundfrequentkeywordsofallperiods,andbyusingtheStreamingTfIdfalgorithm,weextractedtheprominentwordsfromthesekeywords.Whilecreatingdocumentsforeach12hourperiod,weputtopFmostfrequentwordsintotheirrespectivedocu-ments.Duringthisprocess,weusedthePorterStemmertoremovethecommonermorphologicalandin\rexionalendingsofwordsandanalyzedthefrequencydistributiongraphofthewords.Wefound50tobethebestchoicebecauseforvalueslargerthan50,bigclustersofwordswithlowfrequen-ciesappear.Forthenumberofprominentwordsp,weusedp=5.The rstdocumenthastheprominentwords:crash,re-port,\rorida,injur,golfer.Theprominentwordscanmanytimesbeselfexplanatory:accenture,drop,stop, 5http://sports.espn.go.com/golf/news/story?id=4922436    \n  \n  Figure2:EmotionVectorSimilarityoftwosuccessiveintervals \r  \r\r\r\r Figure3:JaccardSimilarityoftwosuccessiveintervalsgolfer,sponsorship.ThisreferstotheAccenture'sdeci-siontodropasponsorshipwithTigerWoods.Thealgorithmcansuccessfullydetectappearancedatesofemergingtopics.Whileprominentwordsofthe11thdocumentwiththetra-ditionalTfIdfdoesnotincludetheword\voicemail",theStreamingTfIdfalgorithmcorrectlyidenti esitasbreakingnewsandaddsittotheprominentwords.Apartfromidentifyingtheprominentwords,thealgo-rithmcorrectlydiscriminatesagainstwordsthatarenotre-latedtotheevents.Inthe11thinterval,theword"Afghanistan"isinthesetofprominentwords.ItisbecauseofthetweetsthatprotestTigerWoodheadlineswhile"Afghanistanwar"getsmoreviolent.Inthefollowingdays,theprominentwordsetofthedocumentisupdatedand"Afghanistan"disappearsfromtheprominentwordset,asitisnotactuallyrelatedtotheevent.Thebreakpointrepresentationmethodidenti esthesig-ni cantperiodsas6,11and24.Notethat,abreakonthe(n)thbarinthesimilaritygraphs(Figures2-3)indicatesanopinionchangebetween(n)thand(n+1)thtimeperiods.Forthesebreakpoints,Table1givesustheprominentwordsfor(n+1)thintervals.RunTimeAnalysisofourmethodsshowalinearchar-acteristicasthetweetcountincreases.Inordertotestscal-ability,weexperimentedwith5000;10000and20000tweetsandfoundtheruntimeofourmethodstobe24224;45985and92867milisecondsonAMDTurionDual-Core2.00GHzprocessor. Period ProminentWords 1 crash,\rorida,injur,golf,accident 6 crash,wife,accident,mistress,golf 11 voicemail,wife,f***,golf,cheat 24 drop,stop,santa,claus,gatorade Table1:ProminentValuesforSigni cantPeriods5.CUSTOMIZEDNEWSTRACKINGWedevelopedanewstrackingapplicationonTwitter.Theresultingapplicationcanbeseenattheprojectwebsite6,anditsscreenshotisgiveninFigure4.Theappli-cationusesaninteractiveJavascriptinterfacethatliststhetweetcountsofeachperiod.Theusercanclickonthepe-riodcolumnstoseetheeventsofatimeperioddependingontheprominentwords.Foreachperiod,wesearchforthearticlesthatarepublishedinthedaterangeoftheperiod.Wearenotstoringthoseweblinksinadatabase,becausethelinkscanberemovedorre-locatedovertime.GoogleAlerto erssuchacustomizedwebservice,anditprovidesasystemwhichnoti esusersbyemailwhenachosenkeywordhasanewentryonweb.WhereasGooglesendsupdatesabouteveryentryonatrackedkeyword,ourapplicationobservesthepublicopiniontoidentifybreakingpointsand ndskeywordsofimportanteventstonotifyusersaboutthem.6.CONCLUSIONSInthispaperwepresentedanecientwaytoobservepub-licopinionontemporaldimension.Ourmethodscaniden- 6http://ubicomp.cse.bu alo.edu/upinion/  Figure4:UpinionApplicationtifybreakpoints,and ndrelatedeventsthatcausedtheseopinionchanges.WetestedourresultswiththetimelineofTigerWoodscaseandshowedtheaccuracyofourresults.Wedevelopedanapplicationthatcanserveuserswithnewspagesdependingonthetimeperiod.Wearecurrentlywork-ingonexpandingtheemotioncorpusforeliminatingoutlierintervalsinouranalysis.Asafuturework,weareplanningtodevelopcustomizedversionofourwebservicethatenableswebuserstotracktheirselectedtopicsonTwitter.WearealsoworkingondistributedimplementationofoursystemoverHadoop7Map/Reduceframework.Map/Reduce[5]allowslargesoft-wareframeworks[1,3]toprocessunlimitedamountofdatainadistributedmanner.ByusingpowerofMap/Reduceparadigm,weareplanningtohandlemillionsoftweetatthesametimebelongingtomultipletopics.7.REFERENCES[1]M.A.Bayir,I.H.Toroslu,A.Cosar,andG.Fidan.Smartminer:anewframeworkformininglargescalewebusagedata.InWWW,pages161{170,2009.[2]M.W.Berry,editor.Surveyoftextmining:clustering,classi cation,andretrieval.Springer,2004.[3]H.Cao,D.Jiang,J.Pei,Q.He,Z.Liao,E.Chen,andH.Li.Context-awarequerysuggestionbyminingclick-throughandsessiondata.InKDD,pages875{883,2008.[4]K.Dave,S.Lawrence,andD.M.Pennock.Miningthepeanutgallery:opinionextractionandsemanticclassi cationofproductreviews.InWWW,pages519{528,2003.[5]J.DeanandS.Ghemawat.Mapreduce:Simpli eddataprocessingonlargeclusters.InOSDI,pages137{150,2004. 7http://hadoop.apache.org/[6]N.DiakopoulosandD.A.Shamma.Characterizingdebateperformanceviaaggregatedtwittersentiment.InConferenceonHumanFactorsinComputingSystems(CHI),April2010.[7]A.Java,X.Song,T.Finin,andB.Tseng.Whywetwitter:understandingmicrobloggingusageandcommunities.InProceedingsofthe9thWebKDDand1stSNA-KDD2007workshoponWebminingandsocialnetworkanalysis,pages56{65.ACM,2007.[8]W.Jin,H.H.Ho,andR.K.Srihari.Opinionminer:anovelmachinelearningsystemforwebopinionminingandextraction.InKDD,pages1195{1204,2009.[9]L.Ku,Y.Liang,andH.Chen.Opinionextraction,summarizationandtrackinginnewsandblogcorpora.InProceedingsofAAAI-2006SpringSymposiumonComputationalApproachestoAnalyzingWeblogs,pages100{107,2006.[10]H.Kwak,C.Lee,H.Park,andS.B.Moon.Whatistwitter,asocialnetworkoranewsmedia?InWWW,pages591{600,2010.[11]B.PangandL.Lee.Opinionminingandsentimentanalysis.FoundationsandTrendsinInformationRetrieval,2(1-2):1{135,2007.[12]W.G.Parrott,editor.Emotionsinsocialpsychology:essentialreadings.PsychologyPress,2001.[13]A.PopescuandO.Etzioni.Extractingproductfeaturesandopinionsfromreviews.InEMNLP-05,2005.[14]P.D.Turney.Thumbsuporthumbsdown?semanticorientationappliedtounsupervisedclassi cationofreviews.InACL,pages417{424,2002.[15]L.Zhuang,F.Jing,X.-Y.Zhu,andL.Zhang.Moviereviewminingandsummarization.InCIKM-06,2006.