/
Memetracking and the Dynamics of the News Cycle Jure Leskovec Lars Backstrom Jon Kleinberg Memetracking and the Dynamics of the News Cycle Jure Leskovec Lars Backstrom Jon Kleinberg

Memetracking and the Dynamics of the News Cycle Jure Leskovec Lars Backstrom Jon Kleinberg - PDF document

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
507 views
Uploaded On 2014-12-01

Memetracking and the Dynamics of the News Cycle Jure Leskovec Lars Backstrom Jon Kleinberg - PPT Presentation

cornelledu larscscornelledu kleinbercscornelledu ABSTRACT Tracking new topics ideas and memes across the Web has been an issue of considerable interest Recent work has developed meth ods for tracking topic shifts over long time scales as well as abru ID: 19558

cornelledu larscscornelledu kleinbercscornelledu ABSTRACT Tracking

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Memetracking and the Dynamics of the New..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

memes.Andlikegeneticsignatures,wendthatwhiletheyremainrecognizableastheyappearintextovertime,theyalsoundergosig-nicantmutation.Asaresult,acentralcomputationalchallengeinthisapproachistondrobustwaysofextractingandidentifyingallthemutationalvariantsofeachofthesedistinctivephrases,andtogroupthemtogether.Wedevelopscalablealgorithmsforthisprob-lem,sothatmemesendupcorrespondingtoclusterscontainingallthemutationalvariantsofasinglephrase.Asanapplicationofourtechnique,weuseittoproducesomeoftherstquantitativeanalysisoftheglobalnewscycle.Todothis,weworkwithamassivesetof90millionnewsandblogarti-clesthatwecollectedoverthenalthreemonthsofthe2008U.S.PresidentialElection(startingAugust1).1Inthiscontext,thecol-lectionofdistinctivephrasesthatwillactastracersformemesarethesetofquotedphrasesandsentencesthatwendinarticles—thatis,quotationsattributedtoindividuals.Thisisnaturalforthedomainofnews:quotesareanintegralpartofjournalisticprac-tice,andevenifanewsstoryisnotspecicallyaboutaparticularquote,quotesaredeployedinessentiallyallarticles,andtheytendtofollowiterationsofastoryasitevolves[28].However,eachin-dividualquotetendstoexhibitveryhighlevelsofvariationacrossitsoccurrenceinmanyarticles,andsotheaspectsofourapproachbasedonclusteringmutationalvariantswillbecrucial.Thus,ouranalysisofthenewscyclewillconsistofstudyingthemostsignicantgroupsofmutationalvariantsastheyevolveovertime.Weperformthisanalysisbothatagloballevel—under-standingthetemporalvariationasawhole—andatalocallevel—identifyingrecurringpatternsinthegrowthanddecayofamemearounditsperiodofpeakintensity.Atagloballevel,wendastructureinwhichindividualmemescompetewithanotherovershorttimeperiods,producingdailyandweeklypatternsofvaria-tion.Wealsoshowhowthetemporalpatternsweobservearisenat-urallyfromasimplemathematicalmodelinwhichnewssourcesimitateeachother'sdecisionsaboutwhattocover,butsubjecttorecencyeffectspenalizingoldercontent.Thiscombinationofimi-tationandrecencycanproducesynthetictemporalpatternsresem-blingtherealdata;neitheringredientaloneisabletodothis.Atalocallevel,weidentifysomeofthene-graineddynamicsgoverninghowtheintensityofamemebehaves.Wendacharac-teristicspikearoundthepointofpeakintensity;inbothdirectionsawayfromthepeakthevolumedecreasesexponentiallywithtime,butinan8-hourwindowoftimearoundthemedian,wendthatvolumeyasafunctionoftimetbehaveslikey(t)alog(t).Thisfunctiondivergesatt=0—indicatinganexplosiveamountofac-tivityrightatthepeakperiod.Furtherinterestingdynamicsemergewhenoneseparatesthewebsitesunderconsiderationintotwodis-tinctcategories—newsmediaandblogs.Wendthatthepeakofnews-mediaattentionofaphrasetypicallycomes2.5hoursearlierthanthepeakattentionoftheblogosphere.Moreover,ifwelookattheproportionofphrasementionsinblogsinafew-hourwin-dowaroundthepeak,itdisplaysacharacteristic“heartbeat”-typeshapeasthememebouncesbetweenmainstreammediaandblogs.Wefurtherbreakdowntheanalysistothelevelofindividualblogsandnewssources,characterizingthetypicalamountbywhicheachsourceleadsorlagstheoverallpeak.Amongthe“fastest”sourceswendanumberofpopularpoliticalblogs;thismeasurethussug- 1Thisisofcourseaperiodwhennewscoveragewasparticularlyhigh-intensity,butitgivesusachancetostudythenewscycleoverpreciselythekindofintervalinwhichpeople'sgeneralintuitionsaboutitareformed—andinwhichitisenormouslyconsequential.Inthelatterregard,studyingtheeffectofcommunicationtechnol-ogyonelectionsisaresearchtraditionthatgoesbackatleasttotheworkofLazarsfeld,Berelson,andGaudetinthe1940s[22].gestsawayofidentifyingsitesthatareregularlyfaraheadofthebulkofmediaattentiontoatopic.Furtherrelatedwork.Inadditiontotherangeofdifferentap-proachesfortrackingtopics,ideas,andmemesdiscussedabove,therehasbeenconsiderableworkincomputersciencefocusedonnewsdatainparticular.Twodominantthemesinthisworktodatehavebeentheuseofalgorithmictoolsfororganizingandlteringnews;andtheroleofbloggingandtheproductionofnewsbyindi-vidualsratherthanprofessionalmediaorganizations.Someofthekeyresearchissuesherehavebeentheidenticationoftopicsovertime[5,11,16],theevolvingpracticesofbloggers[25,26],thecascadingadoptionofstories[3,14,20,23],andtheideologicaldi-visionsintheblogosphere[2,12,13].Thishasledtodevelopmentofanumberofinterestingtoolstohelppeoplebetterunderstandthenews(e.g.[5,11,12,13,16]).Outsideofcomputerscience,theinterplaybetweentechnology,thenewsmedia,andthepoliticalprocesshasbeenafocusofcon-siderableresearchinterestformuchofthepastcentury[6,22].Thisresearchtraditionhasincludedworkbysociologists,communica-tionscholars,andmediatheorists,usuallyatqualitativelevelex-ploringthepoliticalandeconomiccontextsinwhichnewsispro-duced[19],itseffectonpublicopinion,anditsabilitytofacilitateeitherpolarizationorconsensus[15].Animportantrecentthemewithinthisliteraturehasbeentheincreasingintensityofthenewscycle,andtheincreasingroleitplaysinthepoliticalprocess.IntheirinuentialbookWarpSpeed:AmericaintheAgeoftheMixedMedia,KovachandRosenstieldis-cusshowtheexcessesofthenewscyclehavebecomeintertwinedwiththefragmentationofthenewsaudience,writing,“Theclassicfunctionofjournalismtosortoutatrueandreliableaccountoftheday'seventsisbeingundermined.Itisbeingdisplacedbythecon-tinuousnewscycle,thegrowingpowerofsourcesoverreporters,varyingstandardsofjournalism,andafascinationwithinexpen-sive,polarizingargument.Thepressisalsoincreasinglyxatedonndingthe'bigstory'thatwilltemporarilyreassemblethenow-fragmentedmassaudience”[19].Inadditiontoilluminatingtheireffectontheproducersandconsumersofnews,researchershavealsoinvestigatedtheroletheseissuesplayinpolicy-makingbygov-ernment.AsJaysonHarsinobserves,overtimethenewscyclehasgrownfrombeingadominantaspectofelectioncampaignseasontoaconstantfeatureofthepoliticallandscapemoregenerally;“notonly,”hewrites,“arecampaigntacticsnormalizedforgoverningbutthecommunicationtacticsarethemselvesinstitutionallyinu-encedbythetwenty-fourhourcableandinternetnewscycle”[15].Movingbeyondqualitativeanalysishasprovendifculthere,andtheintriguingassertionsinthesocial-scienceworkonthistopicformasignicantpartofthemotivationforourcurrentapproach.Specically,thediscussionsinthisareahadlargelyleftopenthequestionofwhetherthe“newscycle”isprimarilyametaphoricalconstructthatdescribesourperceptionsofthenews,orwhetheritissomethingthatonecouldactuallyobserveandmeasure.Weshowthatbytrackingessentiallyallnewsstoriesattherightlevelofgranularity,itisindeedpossibletobuildstructuresthatcloselymatchourintuitivepictureofthenewscycle,makingitpossibletobeginamoreformalandquantitativestudyofitsbasicproperties.2.ALGORITHMSFORCLUSTERINGMUTATIONALVARIANTSOFPHRASESWenowdiscussouralgorithmsforidentifyingandclusteringtextualvariantsofquotes,capableofscalingtoourcorpusofroughlyahundredmillionarticlesoverathree-monthperiod.Theseclus-terswillthenformthebasicobjectsinoursubsequentanalysis. (e.g.,nodes13,14,15inFig.2).So,thephraseclustershouldbeasubgraphforwhichallpathsterminateinasinglerootnode.Thus,informally,toidentifyphraseclusters,wewouldlikedeleteedgesofsmalltotalweightfromthephrasegraphsoitfallsapartintodisjointpieces,withthepropertythateachpiece“feedsinto”asinglerootphrasethatcanserveastheexemplarforthephrasecluster.Moreprecisely,wedeneadirectedacyclicgraphtobesingle-rootedifitcontainsexactlyonerootnode.(Notethatev-eryDAGhasatleastoneroot.)WenowdenethefollowingDAGpartitioningproblem:DAGPartitioning:Givenadirectedacyclicgraphwithedgeweights,deleteasetofedgesofminimumto-talweightsothateachoftheresultingcomponentsissingle-rooted.Forexample,Figure2showsaDAGwithalledgeweightsequalto1;deletingindicatededgesformstheuniqueoptimalsolution.WenowshowthatDAGPartitioningiscomputationallyintractabletosolveoptimally.Wethendiscusstheheuristicweusefortheproblemonourdata,whichwendtoworkwellinpractice.PROPOSITION1.DAGPartitioningisNP-hard.ProofSketch.WeshowthatdecidingwhetheraninstanceofDAGPartitioninghasasolutionoftotaledgeweightatmostWisNP-complete,usingareductionfromanNP-completeproblemindis-creteoptimizationknownastheMultiwayCutproblem[9,10].InaninstanceofMultiwayCut,wearegivenaweightedundirectedgraphHinwhichasubsetTofthenodeshasbeendesignatedasthesetofterminals.ThegoalistodecidewhetherwecandeleteasetofedgesoftotalweightatmostWsothateachterminalTbelongstoadistinctcomponent.Duetospaceconstraintswegivethedetailsofthisconstructionatthesupportingwebsite[1]. Analternateheuristic.GiventheintractabilityofDAGPartition-ing,wedevelopaclassofheuristicsforitthatwendtoscalewellandtoproducegoodphraseclustersinpractice.Tomotivatetheheuristics,notethatinanyoptimalsolutiontoDAGPartitioning,thereisatleastoneoutgoingedgefromeachnon-rootnodethathasnotbeendeleted.(Forifanon-rootnodehadallitsoutgoingedgesdeleted,thenwecouldputonebackinandstillpreservethevalidityofthesolution.)Second,asubgraphoftheDAGwhereeachnon-rootnodehasonlyasingleout-edgemustnecessarilyhavesingle-rootedcomponents,sincetheedgesetsofthecomponentswillallbein-branchingtrees.Finally,if—asathoughtexperiment—foreachnodevwehappenedtoknowjustasingleedgeethatwasnotdeletedintheoptimalsolution,thenthesubgraphconsistingofalltheseedgesewouldhavethesamecomponents(whenviewedasnodesets)asthecomponentsintheoptimalsolutionofDAGPartitioning.Inotherwords,itisenoughtondasingleedgeoutofeachnodethatisincludedintheoptimalsolutiontoidentifytheoptimalcomponents.Withthisinmind,ourheuristicsproceedbychoosingforeachnon-rootnodeasingleoutgoingedge.Thuseachofthecompo-nentswillbesingle-rooted,asnotedabove,andwetaketheseasthecomponentsofoursolution.Weevaluatetheheuristicswithrespecttothetotalamountofedgeweightkeptintheclustersifarandomedgeoutofeachphraseiskept.Wefoundthatkeepinganedgetotheshortestphrasegives9%improvementoverthebase-line,whilekeepinganedgetothemostfrequentphrasegives12%improvement.ProceedingfromtherootsdowntheDAGandgreed-ilyassigningeachnodetotheclustertowhichithasthemostedgesgives13%improvementoverthebaseline.Wealsoexperimentedwithsimulatedannealingbutthatdidnotimprovethesolution,sug-gestingfurtherevidencefortheeffectivenessofourheuristics. Figure3:Phrasevolumedistribution.Weconsiderthevolumeofindividualphrases,phrase-clusters,andthephrasesthatcomposethe“Lipstickonapig”cluster.Noticephrasesandphrase-clustershavesimilarpower-lawdistributionwhilethe“Lipstickonapig”clusterhasmuchfattertail,whichmeansthatpopularphrasesarehaveunexpectedlyhighpopularity.Datasetdescription.OurdatasetcoversthreemonthsofonlinemainstreamandsocialmediaactivityfromAugust1toOctober312008withabout1milliondocumentsperday.Intotalitconsistof90milliondocuments(blogpostsandnewsarticles)from1.65mil-liondifferentsitesthatweobtainedthroughtheSpinn3rAPI[27].Thetotaldatasetsizeis390GBandessentiallyincludescompleteonlinemediacoverage:wehaveallmainstreammediasitesthatarepartofGoogleNews(20,000differentsites)plus1.6millionblogs,forumsandothermediasites.Fromthedatasetweextractedthetotal112millionquotesanddiscardedthosewithL4,M10,andthosethatfailoursingle-domaintestwith"=:25.Thisleftuswith47millionphrasesoutofwhich22millionweredistinct.Clus-teringthephrasestook9hoursandproducedaDAGwith35,800non-trivialcomponents(clusterswithatleasttwophrases)thatto-getherincluded94,700nodes(phrases).Figure3showsthecomplementarycumulativedistributionfunc-tion(CCDF)ofthephrasevolume.Foreachvolumex,weplotthenumberofphraseswithvolumex.Ifthequantityofinterestispower-lawdistributedwithexponent ,p(x)/x� ,thenwhenplottedonlog-logaxestheCCDFwillbeastraightlinewithslope�( +1).InFigure3wesuperimposethreequantitiesofinterest:thevolumeofindividualphrases,phraseclusters(volumeofallphrasesinthecluster),andtheindividualphrasesfromthelargestphrase-clusterinourdataset(the“lipstickonapig”cluster).No-ticeallquantitiesarepower-lawdistributed.Moreover,thevolumeofindividualphrasesdecaysasx�2:8,andofphrase-clustersasx�3:1,whichmeansthatthetailsarenotveryheavyasfor &#x]TJ/;༱ ;.96;d T; 22;&#x.158;&#x 0 T; [0;3power-lawdistributionsstarttohavenitevariances.However,no-ticethatvolumeofthe“lipstickonapig”clusterdecaysasx�1:85inwhichcasethetailismuchheavier.Infact,for 2power-lawshaveinniteexpectations.Thismeansthatvariantsofpopularphrases,like“lipstickonapig,”aremuchmore“stickier”thanwhatwouldbeexpectedfromoverallphrasevolumedistribution.Pop-ularphraseshavemanyvariantsandeachofthemappearsmorefrequentlythanan“average”phrase.3.GLOBALANALYSIS:TEMPORALVARI-ATIONANDAPROBABILISTICMODELHavingproducedphraseclusters,wenowconstructtheindivid-ualelementsofthenewscycle.Wedeneathreadassociatedwithagivenphraseclustertobethesetofallitems(newsarticlesorblogposts)containingsomephrasefromthecluster,andwethen Figure4:Top50threadsinthenewscyclewithhighestvolumefortheperiodAug.1–Oct.31,2008.Eachthreadconsistsofallnewsarticlesandblogpostscontainingatextualvariantofaparticularquotedphrases.(Phrasevariantsforthetwolargestthreadsineachweekareshownaslabelspointingtothecorrespondingthread.)Thedataisdrawnasastackedplotinwhichthethicknessofthestrandcorrespondingtoeachthreadindicatesitsvolumeovertime.Interactivevisualizationisavailableathttp://memetracker.org. Figure5:Temporaldynamicsoftopthreadsasgeneratedbyourmodel.Onlytwoingredients,namelyimitationandapreferencetorecentthreads,areenoughtoqualitativelyreproducetheobserveddynamicsofthenewscycle.trackallthreadsovertime,consideringboththeirindividualtem-poraldynamicsaswellastheirinteractionswithoneanother.UsingourapproachwecompletelyautomaticallycreatedandalsoautomaticallylabeledtheplotinFigure4,whichdepictsthe50largestthreadsforthethree-monthperiodAug.1–Oct.31.Itisdrawnasastackedplot,astyleofvisualization(seee.g.[16])inwhichthethicknessofeachstrandcorrespondstothevolumeofthecorrespondingthreadovertime,withthetotalareaequaltothetotalvolume.Weseethattherisingandfallingpatterndoesinfacttellusaboutthepatternsbywhichblogsandthemediasuccessivelyfocusanddefocusoncommonstorylines.Animportantpointtonoteattheoutsetisthatthetotalnumberofarticlesandposts,aswellasthetotalnumberofquotes,isap-proximatelyconstantoverallweekdaysinourdataset.(Referto[1]fortheplots.)Asaresult,thetemporalvariationexhibitedinFig-ure4isnottheresultofvariationsintheoverallamountofglobalnewsandbloggingactivityfromonedaytothenext.Rather,theperiodswhentheupperenvelopeofthecurvearehighcorrespondtotimeswhenthereisagreaterdegreeofconvergenceonkeysto-ries,whilethelowperiodsindicatethatattentionismorediffuse,spreadoutovermanystories.Thereisaclearweeklypatterninthis(again,despitetherelativelyconstantoverallvolume),withthevelargepeaksbetweenlateAugustandlateSeptembercorre-sponding,respectively,totheDemocraticandRepublicanNationalConventions,theoverwhelmingvolumeofthe“lipstickonapig”thread,thebeginningofpeakpublicattentiontothenancialcrisis,andthenegotiationsoverthenancialbailoutplan.Noticehowtheplotcapturesthedynamicsofthepresidentialcampaigncoverageataveryneresolution.Spikesandthephrasespinpointtheexacteventsandmomentsthattriggeredlargeamountsofattention.Moreover,wehaveevaluatedcompetingbaselinesinwhichweproducetopicclustersusingstandardmethodsbasedonprobabilis-tictermmixtures(e.g.[7,8]).2Theclustersproducedforthistimeperiodcorrespondtomuchcoarserdivisionsofthecontent(poli-tics,technology,movies,andanumberofessentiallyunrecogniz-ableclusters).ThisisconsistentwithourinitialobservationinSec-tion1thattopicalclustersareworkingatalevelofgranularitydif-ferentfromwhatisneededtotalkaboutthenewscycle.Similarly,producingclustersfromthemostlinked-todocuments[23]inthedatasetproducesamuchnergranularity,atthelevelofindividualarticles.Forreasonsofspace,wereferthereadertothesupportingwebsite[1]forthefullresultsofthesebaselineapproaches.Globalmodelsfortemporalvariation.Fromamodelingper- 2Asthesedonotscaletothesizeofthedatawehavehere,wecouldonlyuseasubsetof10,000mosthighlylinked-toarticles. Figure6:Onlyasingleaspectofthemodeldoesnotreproducedynamicbehavior.Withonlypreferencetorecency(left)nothreadprevailsasateverytimestepthelatestthreadgetsattention.Withonlyimitation(right)asinglethreadgainsmostoftheattention. Figure7:Threadvolumeincreaseanddecayovertime.Noticethesymmetry,quickerdecaythanbuildup,andlowerbaselinepopularityafterthepeak.threadatatimetissimplythenumberofitemsitcontainswithtimestampt.Firstweexaminehowthevolumeofathreadchangesovertime.Anaturalconjectureherewouldbetoassumeanexpo-nentialformforthechangeinthepopularityofaphraseovertime.However,somewhatsurprisinglyweshownextthattheexponentialfunctiondoesnotincreasefastenoughtomodelthebehavior.Givenathreadp,wedeneitspeaktimetpasthemedianofthetimesatwhichitoccurredinthedataset.Wendthatthreadstendtohaveparticularlyhighvolumerightaroundthismediantime,andhencethevalueoftpisquitestableundertheadditionordeletionofmoderatenumbersofitemstop.Wefocusonthe1,000threadswiththelargesttotalvolumes(i.e.thelargestnumbersofitems).Foreachthread,wedetermineitsvolumeasafunctionoftime;wethennormalizeandalignethesecurvessothattp=0foreach,andsothatthevolumeofeachattime0wasequalto1.Finally,foreachtimetweplotthemedianvolumeattoverall1,000phrase-clusters.ThisisdepictedinFigure7.Ingeneral,onewouldexpecttheoverallvolumeofathreadtobeverylowinitially;thenasthemassmediabeginsjoininginthevol-umewouldrise;andthenasitpercolatestoblogsandothermediaitwouldslowlydecay.However,itseemsthatthebehaviortendstobequitedifferentfromthis.First,noticethatinFigure7theriseanddropinvolumeissurprisinglysymmetricaroundthepeak,whichsuggestslittleornoevidenceforaquickbuild-upfollowedbyaslowdecay.Wendthatnoonesimplejustiablefunctiontsthedatawell.Rather,itappearsthattherearetwodistincttypesofbe-havior:thevolumeoutsidean8-hourwindowcenteredatthepeakcanbewellmodeledbyanexponentialfunction,e�bx,whilethe8-hourtimewindowaroundthepeakisbestmodeledbyalogarith-micfunction,ajlog(jxj)j.Theexponentialfunctionisincreasingtooslowlytobeabletotthepeak,whilethelogarithmhasapoleatx=0(jlog(jxj)j!1asx!0).Thisissurprisingasitsuggeststhatthepeakisapointof“singularity”wherethenumberofmentionseffectivelydiverges.AnotherwaytoviewthisisasaformofZeno'sparadox:asweapproachtime0fromeitherside,thevolumeincreasesbyaxedincrementeachtimeweshrinkourdistancetotime0byaconstantfactor.Fittingthefunctionalog(t)tothespikewendthatfromthe Figure8:Timelagforblogsandnewsmedia.Threadvolumeinblogsreachesitspeaktypically2.5hoursafterthepeakthreadvolumeinthenewssources.Threadvolumeinnewssourcesin-creasesslowlybutdecreasequickly,whileinblogstheincreaseisrapidanddecreasemuchslower.leftt!0�wehavea=0:076,whileast!0+wehavea=0:092.Thissuggeststhatthepeakbuildsupmoreslowlyanddeclinesfaster.Asimilarcontrastholdsfortheexponentialde-cayparameterb.Wetebtandnoticethatfromtheleftb=1:77,whileafterthepeakb=2:15,whichsimilarlysuggeststhatthepopularityslowlybuildsup,peaksandthendecayssomewhatmorequickly.Finally,wealsonotethatthebackgroundfrequencybe-forethepeakisaround0.12,whileafterthepeakitdropstoaround0.09,whichfurthersuggeststhatthreadsaremorepopularbeforetheyreachtheirpeak,andafterwardstheydecayveryquickly.Timelagbetweennewsmediaandblogs.Acommonassertionaboutthenewscycleisthatquotedphrasesrstappearinthenewsmedia,andthendiffusetotheblogosphere,wheretheydwellforsometime.However,therealquestionis,howoftendoesthishap-pen?Whataboutthepropagationintheoppositedirection?Whatisthetimelag?Asweshownext,usingourapproachwecande-terminethelagwithintemporalresolutionoflessthananhour.Welabeledeachofour1.6millionsitesasnewsmediaorblogs.Toassignthelabelweusedthefollowingrule:ifasiteappearsonGoogleNewsthenwelabelitasnewsmedia,andotherwisewelabelitasablog.Althoughthisruleisnotperfectwefoundittoworkquitewellinpractice.Thereare20,000differentnewssitesinGoogleNews,whichatinynumberwhencomparedto1.65millionsitesthatwetrack.However,thesenewsmediasitesgenerateabout30%ofthetotalnumberofdocumentsinourdataset.Moreover,ifweonlyconsiderdocumentsthatcontainfrequentphrasesthentheshareofnewsmediadocumentsfurtherrisesto44%.Byanalogywiththepreviousexperimentwetakethetop1000highestvolumethreads,alignthemsothateachhasanoverallpeakattp=0,butnowcreatetwoseparatevolumecurvesforeachthread:oneconsistingofthesubsequenceofitsblogitems,andtheotherconsistingofthesubsequenceofitsnewsmediaitems.Wewillrefertothesizesoftheseastheblogvolumeandnewsvolumeofthethread.Figure8plotsthemediannewsandblogvolumesandrevealsthatthistimeourintuitionwasright.First,noticethat Figure9:Phrasehandofffromnewstoblogs.Noticeaheart-beatlikepulsewhennewsmediaquicklytakesoveraphrase.themedianforathreadinthenewsmediatypicallyoccursrst,andthenamedianof2.5hourslaterthemedianforthethreadamongblogsoccurs.Moreover,newsvolumebothincreasesfaster,andhigher,butalsodecreasesquickerthanblogvolume.Fornewsvol-umewemakeanobservationsimilartowhatwesawinFigure7:thevolumeincreasesslowerthanitdecays.However,inblogsthisisexactlytheopposite.Herethenumberofmentionsrstquicklyincreases,reachesitspeak2.5hoursafterthenewsmediapeak,butthendecaysmoreslowly.Oneinterpretationisthataquotedphraserstbecomeshigh-volumeamongnewssources,andisthen“handedoff”toblogs.Thenewsmediaareslowertoheavilyadoptaquotedphraseandsubsequentlyquickindroppingit,astheymoveontonewcontent.Ontheotherhand,bloggersratherquicklyadoptphrasesfromthenewsmedia,witha2.5-hourlag,andthendiscussthemformuchlonger.Thusweseeapatterninwhichaspikeandthenrapiddropinnewsvolumefeedsalaterandmorepersistentincreaseinblogvolumeforthesamethread.Handoffofphrasesfromnewsmediatoblogs.Tofurtherinves-tigatethedynamicsandtransitionsofphrasesfromthenewsmediatotheblogosphereweperformthefollowingexperiment:wetakethetop1000threads,alignthemsothattheyallpeakattimetp=0,butnowcalculatetheratioofblogvolumetototalvolumeforeachthreadasafunctionoftime.Figure9showsa“heartbeat”-likelikedynamicswherethephrase“oscillates”betweenblogsandmainstreammedia.Thefractionofblogvolumeisinitiallyconstant,butitturnsupwardaboutthreehoursbeforethepeakasearlybloggersmentionthephrase.Oncethenewsmediajoinsin,aroundt=�1,thefractionofblogvol-umedropssharply;butitthenjumpsupaftert=0oncethenewsmediabeginsdroppingthethreadandblogscontinueadoptingit.Thefractionofblogmentionpeaksaroundt=2:5,andafter6-9hoursthehand-offisoverandthefractionsstabilize.Itisinter-estingthattheconstantfractionbeforethepeak(t�6)is56%,whileafterthepeak(t9)isactuallyhigher,whichsuggestsapersistenteffectintheblogosphereafterthenewsmediahasmovedon.Thisprovidesapictureoftheveryne-scaletemporaldynam-icsofthehandoffofnewsfrommainstreammediatoblogs,aggre-gatedattheverylargescaleof90millionnewsarticles.Lagofindividualsitesonmentioningaphrase.Wealsoinves-tigatehowquicklydifferentmediasitesmentionaphrase.Thus,wedenethelagofasitewithrespecttoagiventhreadtobethetimeatwhichthesiterstmentionstheassociatedquotedphrase,minusthephrasepeaktime.(Negativelagsindicatethatthesitementionedthequotedphrasebeforepeakattention.)ThismeasureRank Lag[h] Reported Site 1 -26.5 42 hotair.com2 -23 33 talkingpointsmemo.com4 -19.5 56 politicalticker.blogs.cnn.com5 -18 73 hufngtonpost.com6 -17 49 digg.com7 -16 89 breitbart.com8 -15 31 thepoliticalcarnival.blogspot.com9 -15 32 talkleft.com10 -14.5 34 dailykos.com 16 -14 54 blogs.abcnews.com30 -11 32 uk.reuters.com34 -11 72 cnn.com40 -10.5 78 washingtonpost.com48 -10 53 online.wsj.com49 -10 54 ap.orgTable1:Howquicklydifferentmediasitesreportaphrase.Lag:mediantimebetweentherstmentionofaphraseonasiteandthetimewhenitsmentionspeaked.Reported:per-centageoftop100phrasesthatthesitementioned.givesusasenseforhowearlyorlatethesitetakespartinthethread,relativetothebulkofthecoverage.Table1givesalistofsiteswiththeminimum(i.e.mostnegative)lags.Noticethatearlymention-ersareblogsandindependentmediasites;behindthem,butstillwellaheadofthecrowd,arelargemediaorganizations.Quotesmigratingfromblogstonewsmedia.Themajorityofphrasesrstappearinnewsmediaandthendiffusestoblogswhereitisthendiscussedforlongertime.However,therearealsophrasesthatpropagateintheoppositeway,percolatingintheblogosphereuntiltheyarepickedupthenewsmedia.Suchcasesareveryim-portantastheyshowtheimportanceofindependentmedia.Whiletherehasbeenanecdotalevidenceofthisphenomenon,ourap-proachandthecomprehensivenessandthescaleofourdatasetmakesitpossibletoautomaticallyndinstancesofit.Toextractphrasesthatacquirednon-trivialvolumeearlierintheblogosphere,weusethefollowingsimpleheuristic.Lettmde-notethemediantimeofnewsvolumeforathread.Thenletfbbethefractionofthetotalthreadvolumeconsistingofblogitemsdatedatleastaweekbeforetm.Welookforthreadsforwhich0:15fb0:5.Herethethresholdof0:15ensuresthatthephrasewassufcientlymentionedontheblogospherewellbeforethenewsmediapeak,and0:5selectsonlyphrasesthatalsohadasignicantpresenceinthenewsmedia.Table2liststhehighest-volumethreadasautomaticallyreturnedbyourrule.Manualinspectionindicatesthatalmostallcorrespondtointuitivelynaturalcasesofstoriesthatwererst“discovered”bybloggers.Moreover,outof16,000frequentphrasesweconsideredinthisexperiment760passedtheabovelter.Interpretingthisratioinlightofourheuristic,itsuggeststhatabout3.5%ofquotedphrasestendtopercolatefromblogstonewsmedia,whilediffusionintheotherdirectionismuchmorecommon.5.CONCLUSIONWehavedevelopedaframeworkfortrackingshort,distinctivephrasesthattravelrelativelyintactthroughon-linetextandpre-sentedscalablealgorithmsforidentifyingandclusteringtextualvariantsofsuchphrasesthatscaletoacollectionof90millionar-ticles,whichmakesthepresentstudyoneofthelargestanalysesofon-linenewsintermsofdatascale.Ourworkofferssomeoftherstquantitativeanalysesoftheglobalnewscycleandthedy-namicsofinformationpropagationbetweenmainstreamandsocialmedia.Inparticular,weobservedatypicallagof2.5hoursbetweenthepeaksofattentiontoaphraseinthenewsmediaandinblogs, M fb Phrase 2,141 .30 WelluhyouknowIthinkthatwhetheryou'relookingatitfromatheologicalperspectiveoruhascienticperspectiveuhansweringthatquestionwithspecicityuhyouknowisuhabovemypaygrade. 826 .18 AchangingenvironmentwillaffectAlaskamorethananyotherstatebecauseofourlocationI'mnotonethoughwhowouldattributeittobeingman-made. 763 .18 ItwasRonaldReaganwhosaidthatfreedomisalwaysjustonegenerationawayfromextinctionwedon'tpassittoourchildreninthebloodstreamwehavetoghtforitandpro-tectitandthenhandittothemsothattheyshalldothesameorwe'regoingtondourselvesspendingoursunsetyearstellingourchildrenandourchildren'schildrenaboutatimeinAmericabackinthedaywhenmenandwomenwerefree. 745 .18 AftertryingtomakeexperiencetheissueofthiscampaignJohnMcCaincelebratedhis72ndbirthdaybyappointingaformersmalltownmayorandbrandnewgovernorashisvicepresidentialnomineeisthisreallywhotherepublicanpartywantstobeoneheartbeatawayfromthepresidencygivenSarahPalin'slackofexperienceoneveryfrontandonnearlyeveryissuethisvicepresidentialpickdoesn'tshowjudgementitshowspoliticalpanic. 670 .38 Clarionfundrecentlynancedthedistributionofsome28millionDVDscontainingthelmobsessionradicalislam'swaragainstthewestinwhatmanypoliticalanalystsde-scribeasswingstatesintheupcomingpresidentialelections.Table2:Phrasesrstdiscoveredbyblogsandonlylateradoptedbythenewsmedia.M:totalphrasevolume,fb:frac-tionofblogmentionsbefore1weekofthenewsmediapeak.witha“heartbeat”-likeshapeofthehandoffbetweennewsandblogs.Wealsodevelopedamathematicalmodelforthekindsoftemporalvariationthatthesystemexhibits.Asinformationmostlypropagatesfromnewstoblogs,wealsofoundthatinonly3.5%ofthecasesstoriesrstappeardominantlyintheblogosphereandsubsequentlypercolateintothemainstreammedia.Ourapproachtomeme-trackingopensanopportunitytopursuelong-standingquestionsthatbeforewereeffectivelyimpossibletotackle.Forexample,howcanwecharacterizethedynamicsofmu-tationwithinphrases?Howdoesinformationchangeasitpropa-gates?Overlongenoughtimeperiods,itmaybepossibletomodelthewayinwhichtheessential“core”ofawidespreadquotedphraseemergesandenterspopulardiscoursemoregenerally.Onecouldcombinetheapproachesherewithinformationaboutthepoliticalorientationsofthedifferentnewsmediaandblogsources[2,12,13],toseehowparticularthreadsmovewithinandbetweenop-posedgroups.Introducingsuchtypesoforientationischallenging,however,sinceitrequiresreliablemethodsoflabelingsignicantfractionsofsourcesatthisscaleofdata.Finally,adeeperunder-standingofsimplemathematicalmodelsforthedynamicsofthenewscyclewouldbeusefulformediaanalysts;temporalrelation-shipssuchaswendinFigure8suggestthepossibilityofemploy-ingatypeoftwo-speciespredator-preymodel[18]withblogsandthenewsmediaasthetwointeractingparticipants.Moregenerally,itwillbeusefultofurtherunderstandtherolesdifferentparticipantsplayintheprocess,astheircollectivebehaviorleadsdirectlytothewaysinwhichallofusexperiencenewsanditsconsequences.Acknowledgements.WethankDavidStrangandSteveStrogatzforvaluableconversationsandthecreatorsofFlareandSpinn3rforresourcesthatfacilitatedtheresearch.6.REFERENCES[1]Supportingwebsite:http://memetracker.org/supp[2]L.AdamicandN.Glance.Thepoliticalblogosphereandthe2004U.S.election.WorkshoponLinkDiscovery,2005.[3]E.Adar,L.Zhang,L.Adamic,R.Lukose.Implicitstructureanddynamicsofblogspace.Wks.WebloggingEcosystem'04.[4]R.AlbertandA.-L.Barabási.Statisticalmechanicsofcomplexnetworks.Rev.ofModernPhys.,74:47–97,2002.[5]J.Allan(ed).TopicDetectionandTracking.Kluwer,2002.[6]L.Bennett.News:ThePoliticsofIllusion.A.B.Longman(ClassicsinPoliticalScience),seventhedition,2006.[7]D.Blei,J.Lafferty.Dynamictopicmodels.ICML,2006.[8]D.M.Blei,A.Y.Ng,andM.I.Jordan.Latentdirichletallocation.JMLR,pages3:993–1022,2003.[9]G.Calinescu,H.Karloff,Y.Rabani.Animprovedapproximationalgorithmformultiwaycut.JCSS60(2000).[10]E.Dahlhaus,D.S.Johnson,C.H.Papadimitriou,P.D.Seymour,andM.Yannakakis.Thecomplexityofmultiterminalcuts.SIAMJ.Comput.,23(4):864–894,1994.[11]E.Gabrilovich,S.Dumais,andE.Horvitz.Newsjunkie:Providingpersonalizednewsfeedsviaanalysisofinformationnovelty.InWWW'04,2004.[12]M.Gamon,S.Basu,D.Belenko,D.Fisher,M.Hurst,andA.C.Kanig.Blews:Usingblogstoprovidecontextfornewsarticles.InICWSM'08,2008.[13]N.Godbole,M.Srinivasaiah,andS.Skiena.Large-scalesentimentanalysisfornewsandblogs.InICWSM'07,2007.[14]D.Gruhl,D.Liben-Nowell,R.V.Guha,andA.Tomkins.Informationdiffusionthroughblogspace.InWWW'04,2004.[15]J.Harsin.Therumourbomb:TheorisingtheconvergenceofnewandoldtrendsinmediatedU.S.politics.SouthernReview:Communication,PoliticsandCulture,39(2006).[16]S.Havre,B.Hetzler,L.Nowell.ThemeRiver:Visualizingthemechangesovertime.IEEESymp.Info.Vis.2000.[17]J.Kleinberg.Burstyandhierarchicalstructureinstreams.InKDD'02,pages91–101,2002.[18]M.Kot.ElementsofMathematicalEcology.CambridgeUniversityPress,2001.[19]B.KovachandT.Rosenstiel.WarpSpeed:AmericaintheAgeofMixedMedia.CenturyFoundationPress,1999.[20]R.Kumar,J.Novak,P.Raghavan,andA.Tomkins.Structureandevolutionofblogspace.CACM,47(12):35–39,2004.[21]M.LackerandC.Peskin.Controlofovulationnumberinamodelofovarianfollicularmaturation.InAMSSymposiumonMathematicalBiology,pages21–32,1981.[22]P.F.Lazarsfeld,B.Berelson,andH.Gaudet.ThePeople'sChoice.Duell,Sloan,andPearce,1944.[23]J.Leskovec,M.McGlohon,C.Faloutsos,N.Glance,M.Hurst.Cascadingbehaviorinlargebloggraphs.SDM'07.[24]R.D.Malmgren,D.B.Stouffer,A.Motter,andL.A.N.Amaral.Apoissonianexplanationforheavytailsine-mailcommunication.PNAS,toappear,2008.[25]J.Schmidt.Bloggingpractices:Ananalyticalframework.JournalofComputer-MediatedCommunication,12(4),2007.[26]J.Singer.Thepoliticalj-blogger.Journalism,6(2005).[27]Spinn3rAPI.http://www.spinn3r.com.2008.[28]M.L.Stein,S.Paterno,andR.C.Burnett.Newswriter'sHandbook:AnIntroductiontoJournalism.Blackwell,2006.[29]A.Vazquez,J.G.Oliveira,Z.Deszo,K.-I.Goh,I.Kondor,andA.-L.Barabasi.Modelingburstsandheavytailsinhumandynamics.PhysicalReviewE,73(036127),2006.[30]X.WangandA.McCallum.Topicsovertime:anon-markovcontinuous-timemodeloftopicaltrends.Proc.KDD,2006.[31]X.Wang,C.Zhai,X.Hu,R.Sproat.Miningcorrelatedburstytopicpatternsfromcoordinatedtextstreams.KDD,2007.[32]F.WuandB.Huberman.Noveltyandcollectiveattention.Proc.Natl.Acad.Sci.USA,104,2007.