174K - views

Patterns of Temporal Variation in Online Media Jaewon Yang Stanford University crucisstanford

edu Jure Leskovec Stanford University jurecsstanfordedu ABSTRACT Online content exhibits rich temporal dynamics and divers e real time user generated content further intensi64257es this proces s How ever temporal patterns by which online content grow

Download Pdf

Patterns of Temporal Variation in Online Media Jaewon Yang Stanford University crucisstanford




Download Pdf - The PPT/PDF document "Patterns of Temporal Variation in Online..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Patterns of Temporal Variation in Online Media Jaewon Yang Stanford University crucisstanford"— Presentation transcript:

PatternsofTemporalVariationinOnlineMediaJaewonYangStanfordUniversitycrucis@stanford.eduJureLeskovecStanfordUniversityjure@cs.stanford.eduABSTRACTOnlinecontentexhibitsrichtemporaldynamics,anddiversereal-timeusergeneratedcontentfurtherintensiesthisprocess.How-ever,temporalpatternsbywhichonlinecontentgrowsandfadesovertime,andbywhichdifferentpiecesofcontentcompeteforattentionremainlargelyunexplored.Westudytemporalpatternsassociatedwithonlinecontentandhowthecontent'spopularitygrowsandfadesovertime.Theat-tentionthatcontentreceivesontheWebvariesdependingonmanyfactorsandoccursonverydifferenttimescalesandatdifferentresolutions.Inordertouncoverthetemporaldynamicsofonlinecontentweformulateatimeseriesclusteringproblemusingasimi-laritymetricthatisinvarianttoscalingandshifting.WedeveloptheK-SpectralCentroid(K-SC)clusteringalgorithmthateffectivelyndsclustercentroidswithoursimilaritymeasure.Byapplyinganadaptivewavelet-basedincrementalapproachtoclustering,wescaleK-SCtolargedatasets.Wedemonstrateourapproachontwomassivedatasets:asetof580millionTweets,andasetof170millionblogpostsandnewsmediaarticles.WendthatK-SCoutperformstheK-meansclus-teringalgorithminndingdistinctshapesoftimeseries.Ouranal-ysisshowsthattherearesixmaintemporalshapesofattentionofonlinecontent.Wealsopresentasimplemodelthatreliablypre-dictstheshapeofattentionbyusinginformationaboutonlyasmallnumberofparticipants.OuranalysesofferinsightintocommontemporalpatternsofthecontentontheWebandbroadentheunder-standingofthedynamicsofhumanattention.CategoriesandSubjectDescriptorsH.3.3[InformationSearchandRetrieval]:[Clustering]GeneralTermsAlgorithm,MeasurementKeywordsSocialMedia,TimeSeriesClusteringPermissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.WSDM'11,February9–12,2011,HongKong,China.Copyright2011ACM978-1-4503-0493-1/11/02...$10.00. 0 20 40 60 80 100 120 -40 -20 0 20 40 60 80 Nnumber of mentions/hTime (hours)"I will sign this legislation..." "Lipstick on a pig" Average phrase #iranElection Figure1:ShorttextualphrasesandTwitterhashtagsexhibitlargevariabilityinnumberofmentionsovertime.1.INTRODUCTIONOnlineinformationisbecomingincreasinglydynamicandtheemergenceofonlinesocialmediaandrichuser-generatedcontentfurtherintensiesthisphenomena.PopularityofvariouspiecesofcontentontheWeb,likenewsarticles[30],blogposts[21,27],Videos[10],postsinonlinediscussionforums[4]andproductre-views[13],varyonverydifferenttemporalscales.Forexample,contentonmicro-bloggingplatforms,likeTwitter[15,34],isveryvolatile,andpiecesofcontentbecomepopularandfadeawayinamatterofhours.Shortquotedtextualphrases(“memes”)riseanddecayonatemporalscaleofdays,andrepresenttheintegralpartofthe“newscycle.”[22]Temporalvariationofnamedentitiesandgeneralthemes(like,“economy”or“Obama”)exhibitsvariationsatevenlargertemporalscale[3,14,31].However,uncoveringpatternsoftemporalvariationontheWebisdifcultbecausehumanbehaviorbehindthetemporalvariationishighlyunpredictable.Previousresearchonthetimingofanin-dividual'sactivityhasreportedthathumanactionsrangefromran-dom[26]tohighlycorrelated[6].Althoughtheaggregatedynam-icsofindividualactivitiestendstocreateseasonaltrendsorsimplepatterns,sometimescollectiveactionsofpeopleandtheeffectsofpersonalnetworksresultinadeviationfromtrends.Moreover,allindividualsarenotthesame.Forexample,someactas“inuen-tials”[33].TheoverallpictureoftemporalactivityontheWebisevenmorecomplexduetotheinteractionsbetweenindividuals,smallgroups,andcorporations.Bloggersandmainstreammediaarebothproducingandpushingnewcontentintothesystem[16].ThecontentthengetsadoptedthroughpersonalsocialnetworksanddiscussedasitdiffusesthroughtheWeb.Despiteextensivequali-tativeresearch,therehasbeenlittleworkabouttemporalpatternsbywhichcontentgrowsandfadesovertimeandbywhichdifferentpiecesofcontentcompeteforattentionduringthisprocess. Temporalpatternsofonlinecontent.Herewestudywhattem-poralpatternsexistinthepopularityofcontentinsocialmedia.Thepopularityofonlinecontentvariesrapidlyandexhibitsmanydifferenttemporalpatterns.Weaimtouncoveranddetectsuchtemporalpatternsofonlinetextualcontent.Morespecically,wefocusonthepropagationofthehashtagsonTwitter,andthequo-tationofshorttextualphrasesinthenewsarticlesandblog-postsontheWeb.Suchcontentexhibitsrichtemporaldynamics[21,22,26]andisadirectreectionoftheattentionthatpeoplepaytovarioustopics.Moreover,theonlinemediaspaceisoccupiedbyawidespectrumofverydistinctparticipants.Firstofall,therearemanypersonalblogsandTwitteraccounts,witharelativelysmallreadership.Secondly,thereareprofessionalbloggersandsmallcommunity-drivenorprofessionalonlinemediasites(like,TheHufngtonPost)thathavespecializedinterestsandrespondquicklytoevents.Finally,mainstreammassmedia,likeTVsta-tions(e.g.,CNN),largenewspapers(e.g.,TheWashingtonPost)andnewsagencies(e.g.,Reuters)allproducecontentandpushittotheothercontributorsmentionedabove.Weaimtounderstandwhatkindsoftemporalvariationsareexhibitedbyonlinecontent,howdifferentmediasitesshapethetemporaldynamics,andwhatkindsoftemporalpatternstheyproduceandinuence.Theapproach.Weanalyzeasetofmorethan170millionnewsarticlesandblogpostsoveraperiodofoneyear.Inaddition,weexaminetheadoptionofTwitterhashtagsinamassivesetof580millionTwitterpostscollectedovera8monthperiod.Wemeasuretheattentiongiventovariouspiecesofcontentbytracingthenum-berofmentions(i.e.,volume)overtime.Weformulateatimese-riesclusteringproblemanduseatimeseriesshapesimilaritymet-ricthatisinvarianttothetotalvolume(popularity)andthetimeofpeakactivity.Tondthecommontemporalpatterns,wedevelopaK-SpectralCentroid(K-SC)clusteringalgorithmthatallowstheefcientcomputationofclustercentroidsunderourdistancemet-ric.WendthatK-SCismoreusefulinndingdiversetemporalpatternsthantheK-meansclusteringalgorithm[17].WedevelopanincrementalapproachbasedonHaarWaveletstoimprovethescalabilityofK-SCforhigh-dimensionaltimeseries.Findings.Wendthattemporalvariationofpopularityofcontentinonlinesocialmediacanbeaccuratelydescribedbyasmallsetoftimeseriesshapes.Surprisingly,wendthatbothoftheadoptionofhashtagsinTwitterandthepropagationofquotedphrasesontheWebexhibitnearlyidenticaltemporalpatterns.Wendthatsuchpatternsaregovernedbyaparticulartypeofonlinemedia.Mostpressagencynewsexhibitsaveryrapidrisefollowedbyarelativelyslowdecay.Whereas,bloggersplayaveryimportantroleindeterminingthelongevityofnewsontheWeb.Dependingonwhenbloggersstartparticipatingintheonlinediscoursethenewsstorymayexperienceoneormorereboundsinitspopularity.Moreover,wepresentasimplepredictivemodelwhich,basedontimingsofonlyfewsitesorTwitterusers,predictswith75%accu-racywhichofthetemporalpatternsthepopularitytimeserieswillfollow.Wealsoobservecomplexinteractionsbetweendifferenttypesofparticipantsintheonlinediscourse.Consequencesandapplications.Moregenerally,ourworkdevel-opsscalablecomputationaltoolstofurtherextendunderstandingoftherolesofdifferentparticipantsplayintheonlinemediaspace.Wendthatthecollectivebehaviorofvariousparticipantsgovernshowweexperiencenewcontentandreacttoit.Ourresultshavedirectapplicationsforpredictingtheoverallpopularityandtempo-raltrendsexhibitedbytheonlinecontent.Moreover,ourresultscanbeusedforbetterplacingofcontenttomaximizeclickthroughrates[5]andforndinginuentialblogsandTwitters[23].2.FINDINGTEMPORALPATTERNSInthissection,weformallydenetheproblemandthenproposeK-SpectralCentroid(K-SC)clusteringalgorithm.Westartbyassumingthatwearegivenatimeseriesofmentionsorinteractionswithaparticularpieceofcontents.ThiscouldbeatimeseriesofclicksorplaysofapopularvideoonYouTube,thenumberoftimesanarticleonapopularnewspaperwebsitewasread,orthenumberoftimesthatapopularhashtaginTwitterwasused.Nowwewanttondpatternsinthetemporalvariationoftimeseriesthataresharedbymanypiecesofcontent.Weformallydenethisasaproblemofclusteringtimeseriesbasedontheirshape.Giventhatonlinecontenthaslargevariationintotalpopularityandoccursatverydifferenttimes,wewillrstadoptatimeseriessimilaritymetricthatisinvarianttoscalingandshifting.Basedonthismetric,wedevelopanovelalgorithmforclusteringtimeseries.Finally,wepresentaspeed-uptechniquethatgreatlyreducestheruntimeandallowsforscalingtolargedatasets.2.1ProblemdenitionWearegivenNitemsofcontentsandforeachitemiwehaveasetoftracesoftheform(sj;tj)i,whichmeansthatsitesjmen-tioneditemiattimetj.FromtheseNtraces,wethenconstructadiscretetimeseriesxi(t)bycountingthenumberofmentionsofitemiattimeintervalt.Simply,wecreateatimeseriesofthenum-berofmentionsofitemiattimetwheret'smeasuredinsometimeunit,e.g.,hours.Intuitively,ximeasuresthepopularityorattentiongiventoitemiovertime.Forconvenienceletusalsoassumethatalltimeseriesxihavethesamelength,L.Theshapeofthetimeseriesxisimplyrepresentshowthepopularityorattentiontoitemichangedovertime.Wethenaimtogrouptogetheritemssothatitemi'sinthesamegrouphaveasimilarshapeofthetimeseriesxi.Thiswaywecaninferwhatitemshaveasimilartemporalpatternofpopularity,andwecanthenconsiderthecenterofeachclusterastherepresentativecommonpatternofthegroup.2.2MeasureoftimeseriesshapesimilarityInordertoperformtheclusteringbasedontheshapeoftheitempopularitycurve,werstdiscusshowwecanmeasuretheshapesimilarityoftwotimeseries.Figure1showsanexampleoftemporalvariabilityinthenum-berofmentionsofdifferenttextualphrasesandTwitterhashtags.Weplottheaveragepopularitycurveof1,000phraseswithlargestoverallvolume(afteraligningthemsothattheyallpeakatthesametime).Thegureshowstwoindividualphrases.FirstisthequotefromU.S.presidentBarackObamaaboutthestimulusbill:“Iwillsignthislegislationintolawshortlyandwe'llbeginmakingtheimmediateinvestmentsnecessarytoputpeoplebacktoworkdoingtheworkAmericaneedsdone.”andthesecondisthe“Lipstickonapig”phrasefromthe2008U.S.presidentialelectioncampaign.Noticethelargedifferenceamongthepatterns.Whereasaveragephrasesalmostsymmetricallyriseandfade,the“Lipstickonapig”hastwospikeswiththesecondbeinghigherthantherst,whilethestimulusbillphraseshowsalongstreakofmoderateactivity.Awiderangeofmeasuresoftimeseriessimilarityandapproachestotimeseriesclusteringhavebeenproposedandinvestigated.How-ever,theproblemweareaddressingherehasseveralcharacteristicsthatmakeoursettingsomewhatdifferentandthuscommonmet-ricssuchasEuclideanorDynamicTimeWarpingareinappropriateinourcaseforatleasttworeasons.First,iftwotimeserieshaveverysimilarshapebutdifferentoverallvolume,theyshouldstillbeconsideredsimilar.Thus,scalingthetimeseriesonthey-axisshouldnotchangethesimilarity.Second,differentitemsappearandspikeatdifferenttimes.Again,eventhoughtwotimeseries maybeshifted,theyshouldbeconsideredsimilarprovidedthattheyhavesimilarshape.Thus,translatingtimeseriesonthetimeaxisshouldnotchangethesimilaritybetweenthetwotimeseries.Timeseriessimilaritymeasure.Asdescribedabovewerequireatimeseriessimilaritymeasurethatisinvarianttoscalingandtrans-lationandallowsforefcientcomputation.SincethetimeseriesofpopularityofitemsontheWebtypicallyexhibitburstyandspikybehavior[10],onecouldaddressthein-variancetotranslationbyaligningthetimeseriestopeakatthesametime.Evenso,manychallengesremain.Forexample,whatexactlydowemeanby“peak”?Isitthetimeofthepeakpopular-ity?Howdowemeasurepopularity?Shouldwealignasmoothedversionofthetimeseries?Howmuchshouldwesmooth?Evenifweassumethatsomehowpeakalignmentworks,theoverallvolumeofdistincttimeseriesistoodiversetobedirectlycompared.Onemightnormalizeeachtimeseriesbysome(neces-sarilyarbitrary)criteriaandthenapplyasimpledistancemeasuresuchasEuclidiannorm.However,therearenumerouswaystonor-malizeandscalethetimeseries.Wecouldnormalizesothattotaltimeseriesvolumeis1,thatthepeakvolumeis1,etc.Forexample,Figure2illustratestheambiguityofchoosingatimeseriesnormalizationmethod.HereweaimtogrouptimeseriesS1,:::,S4intwoclusters,whereS1andS2havetwopeaksandS3andS4haveonlyonesharppeak.FirstwealignandscaletimeseriesbytheirpeakvolumeandruntheK-MeansalgorithmusingEuclideandistance(bottomguresin(B)and(C)).(WechoosethistimeseriesnormalizationmethodbecausewefoundittoperformbestinourexperimentsinSection3.)However,theK-Meansal-gorithmidentieswrongclusters{S2,S3,S4}and{S4}.Thisisbecausethepeaknormalizationtendstofocusontheglobalpeakandignoresothersmallerpeaks(gure(B)).Totacklethisprob-lemweadoptadifferenttimeseriesdistancemeasureanddevelopanewclusteringalgorithm,whichdoesnotsufferfromsuchbehav-ior,i.e.,itgroupstogetherthetwopeakedtimeseriesS1andS2,andputssinglepeakedtimeseriesS3andS4intheothercluster.First,weadoptadistancemeasurethatisinvarianttoscalingandtranslationofthetimeseries[9].Giventwotimeseriesxandy,thedistance^d(x;y)betweenthemisdenedasfollows:^d(x;y)=min ;qjjx y(q)jj jjxjj(1)wherey(q)istheresultofshiftingtimeseriesybyqtimeunits,andjjjjisthel2norm.Thismeasurendstheoptimalalignment(translationq)andthescalingcoefcient formatchingtheshapesofthetwotimeseries.Thecomputationalcomplexityofthisoper-ationisreasonablesincewecanndaclosed-formexpressiontocomputetheoptimal forxedq.Withqxed,jjx y(q)jj jjxjjisaconvexfunctionof ,andthereforewecanndtheoptimal bysettingthegradienttozero: =xTy(q) jjy(q)jj2.Also,notethat^d(x;y)issymmetricinxandy(refertoextendedversionfordetails[1]).Whereasonecanquicklyndtheoptimalvalueof ,thereisnosimplewaytondtheoptimalq.Inpracticewerstndalignmentq0thatmakesthetimeseriestopeakatthesametimeandthensearchforoptimalqaroundq0.Inourexperiments,thestartingpointq0isveryclosetotheoptimalsincemostofourtimeserieshaveaverysharppeakvolume,asshowninSection3.Therefore,thisheuristicndsqthatisclosetotheoptimalveryquickly.2.3K-SpectralCentroidClusteringNext,wepresenttheK-SpectralCentroid(K-SC)clusteringal-gorithmthatndsclustersoftimeseriesthatshareadistincttem-poralpattern.K-SCisaniterativealgorithmsimilartotheclassical (a)TimeSeries (b)ClustercenterFigure3:(a)Aclusterof7single-peakedtimeseries:6havetheshapeM1,andonehastheshapeM2.(b)TheclustercentersfoundbyK-Means(KM)andKSC.TheKSCclustercenterislessaffectedbytheoutlierandbetterrepresentsthecommonshapeoftimeseriesinthecluster.K-meansclusteringalgorithm[17]butenablesefcientcentroidcomputationunderthescaleandshiftinvariantdistancemetricthatweuse.K-meansiteratesatwostepprocedure,theassignmentstepandtherenementstep.Intheassignmentstep,K-meansassignseachitemtotheclusterclosesttoit.Intherenementsteptheclustercentroidsarethenupdated.Byrepeatingthesetwosteps,K-meansminimizesthesumofthesquaredEuclideandistancesbetweenthemembersofthesamecluster.Similarly,K-SCal-ternatesthetwostepstominimizethesumofsquareddistances,butthedistancemetricisnotEuclideanbutourdistancemetric^d(x;y).AsK-meanssimplytakestheaverageoverallitemsintheclusterastheclustercentroid,thisisinappropriatewhenweuseourmetric^d(x;y).Therefore,wedevelopaK-SpectralCentroid(K-SC)clusteringalgorithmwhichappropriatelycomputesclustercentroidsundertimeseriesdistancemetric^d(x;y).Forexample,inFigure2,K-SCdiscoversthecorrectclusters(bluegroupinpanel(A)).When^d(x;y)isusedtocomputethedistancebetweenS2andtheothertimeseries,^d(x;y)ndstheoptimalscalingofothertimeserieswithrespecttoS2.Then,S1ismuchclosertoS2thanS3andS4,asitcanmatchthevariationinthesecondpeakofS2withtheproperscaling(panelB).Becauseofaccurateclustering,K-SCcomputesthecommonshapesharedbythetimeseriesinthecluster(panelC).Moreover,evenifK-meansandK-SCndasameclustering,theclustercenterfoundbyK-SCismoreinformative.InFigure3,weshowaclusterofsingle-peakedtimeseries,andtrytoobservethecommonshapeoftimeseriesintheclusterbycomputingaclustercenterbyK-meansandK-SC.Sincetheclusterhas6timeseriesofthesameshape(M1)andoneoutlier(M2),wewanttheclustercentertobesimilartoM1.ObservethatK-SCndsabettercenterthanK-means.AsK-meanscomputestheaverageshapeoftimeseriesforaclustercenter,theresultingcenterissensitivetooutliers.Whereas,K-SCscaleseachtimeseriesdifferentlytondaclustercenter,andthisscalingdecreasestheinuenceofoutliers.Moreformally,wearegivenasetoftimeseriesxi,andthenum-berofclustersK.ThegoalthenistondforeachclusterkanassignmentCkoftimeseriestothecluster,andthecentroidkoftheclusterthatminimizeafunctionFdenedasfollows:F=KXk=1Xxi2Ck^d(xi;k)2:(2)WestarttheK-SCalgorithmwitharandominitializationoftheclustercenters.Intheassignmentstep,weassigneachxitotheclosestcluster,basedon^d(x;y).Thisisidenticaltotheassign- Figure2:(A)Fourtimeseries,S1,:::,S4.(B)Timeseriesafterscalingandalignment.(C)Clustercetroids.K-Meanswronglyputs{S1}initsownclusterand{S2,S3,S4}inthesecondcluster,whileK-SCnicelyidentiesclustersoftwovs.singlepeakedtimeseries.mentstepofK-meansexceptthatitusesadifferentdistancemet-ric.Afterndingtheclustermembershipofeverytimeseries,weupdatetheclustercentroid.Simplyupdatingthenewcenterastheaverageofallmembersoftheclusterisinappropriate,asthisisnottheminimizerofthesumofsquareddistancestomembersofthecluster(under^d(x;y)).Thenewclustercenterkshouldbetheminimizerofthesumof^d(xi;k)2overallxi2Ck:k=argminXxi2Ck^d(xi;)2:(3)SinceK-SCisaniterativealgorithmitneedstoupdatetheclustercentroidsmanytimesbeforeitconverges.Thusitiscrucialtondanefcientwaytosolvetheaboveminimizationproblem.Next,weshowthatEq.3hasauniqueminimizerthatcanbeexpressedinaclosedform.WerstcombineEqs.1and3.k=argminXxi2Ckmin i;qijj ixi(qi)jj2 jjjj2SincewendtheoptimaltranslationqiintheassignmentstepofK-SC,consider(withoutthelossofgenerality)thatxiisalreadyshiftedbyqi.Wethenreplace iwithitsoptimalvalue(Sec.2.2):k=argmin1 jjjj2Xxi2CkjjxTi jjxijj2xijj2WeiptheorderofxTixiandsimplifytheexpression:k=argmin1 jjjj2Xxi2CkjjxixTi jjxijj2jj2=argmin1 jjjj2Xxi2Ckjj(xixTi jjxijj2I)jj2=argmin1 jjjj2TXxi2Ck(IxixTi jjxijj2)Finally,substitutingxi2Ck(IxixTi jjxijj2)byMleadstothefol-lowingminimizationproblem:k=argminTM jjjj2:(4) Algorithm1K-SCclusteringalgorithm:K-SC(x,C;K) Require:Timeseriesxi,i=1;2;:::;N,ThenumberofclustersK,InitialclusterassignmentsC=fC1;::;CKgrepeat^C Cforj=1toKdo{Renementstep}M i2Cj(IxixTi jjxijj2)j ThesmallesteigenvectorofMCj ;endforfori=1toNdo{Assignmentstep}j argminj=1;::;K^d(xi;j)Cj Cj[figendforuntil^C=CreturnC;1;:::;K ThesolutionofthisproblemistheeigenvectorumcorrespondingtothesmallesteigenvaluemofmatrixM[12].IfwetransformbymultiplyingtheeigenvectorsofM,thenTMisequivalenttotheweightedsumoftheeigenvaluesofM,whosesmallestelementismjjjj2.Therefore,theminimumofEq.4ismandletting=umachievestheminimum.AsMisgivenbyxi's,wesimplyndthesmallesteigenvectorofMforthenewclustercenterk.SincekminimizesthespectralnormofM,wecallktheSpec-tralCentroid,andcallthewholealgorithmtheK-SpectralCentroid(K-SC)clustering(Algorithm1).2.4IncrementalK-SCalgorithmSinceourtimeseriesareusuallyquitelongandgointohundredsandsometimethousandsofelements,scalabilityofK-SCisimpor-tant.LetusdenotethenumberoftimeseriesbyN,thenumberofclustersbyK,andthelengthofthetimeseriesbyL.There-nementstepofK-SCcomputesMrst,andthenndsitseigen-vectors.ComputingMtakesO(L2)foreachxj,andndingtheeigenvectorsofMtakesO(L3).Thus,theruntimeoftherene-mentstepisdominatedbyO(max(NL2;KL3)).However,theassignmentsteptakesonlyO(KNL),andthereforethecomplex-ityofoneiterationofK-SCisO(max(NL2;KL3)).AcubiccomplexityinLisclearlyanobstacleforK-SCtobeusedonlargedatasets.Moreover,thereisanotherreasonwhyap- Algorithm2IncrementalK-SC Require:Timeseriesxi,i=1;2;:::;N,ThenumberofclustersK,InitialassignmentsC=fC1;::;CKg,StartlevelS,Thelengthofxi'sLfori=1toNdozi DiscreteHaarWaveletTransform(xi)endforforj=Stolog2(L)dofori=1toNdoyi InverseDiscreteHaarWaveletTransform(zi(1:2j)){zi(1:n)meanstherstnelementsofzi}endfor(C;1;:::;K) K-SC(y,C,K)endforreturnC;1;:::;K plyingK-SCdirectlytohighdimensionaldataisnotdesirable.LikeK-means,K-SCisagreedyhill-climbingalgorithmforop-timizinganon-convexobjectivefunction.SinceK-SCstartsatsomeinitialpointandthengreedilyoptimizetheobjectivefunc-tion,therateofconvergenceisverysensitivetotheinitializationoftheclustercenters[28].Iftheinitialcentersarepoorlychosen,thealgorithmmaybeveryslow,especiallyifNorLarelarge.Weaddressthesetwoproblemsbyadoptinganapproachsimi-lartoIncrementalK-means[28]whichutilizesthemulti-resolutionpropertyoftheDiscreteHaarWaveletTransform(DHWT)[7].Itoperatesasfollows:therstfewcoefcientsofDHWTdecompo-sitioncontainanapproximationoftheoriginaltimeseriesatverycoarseresolution,whileadditionalcoefcientsshowinformationinhigherresolution.Givenasetoftimeseriesx,wecomputetheHaarWaveletdecompositionforeverytimeseriesxi.TheDHWTcomputationisfast,takingO(L)foreachtimeseries.BytakingtherstfewcoefcientsoftheHaarWaveletdecom-positionofthetimeseries,weapproximatethetimeseriesatverycoarsegranularity.Thus,werstclusterthecoarse-grainedrepre-sentationsofthetimeseriesusingtheK-SCalgorithm.InthiscaseK-SCwillberunveryquicklyandwillalsoberobustwithrespecttorandominitializationoftheclustercenters.Then,wemovetothenextlevelofresolutionofthetimeseriesandusetheassignmentsfromthepreviousiterationofK-SCastheinitialassignmentsatthecurrentlevel.Werepeatthisprocedureuntilwereachthefullresolutionofthetimeseries,i.e.,allwaveletcoefcientsareused.Evenwhenweareworkingwithfullresolutiontimeseries,K-SCconvergesmuchfasterthanifwestartedK-SCfromarandomini-tialization,sincewestartverycloselyfromtheoptimalpoint.Alg.2givesthepseudo-codeoftheIncrementalK-SCalgorithm.3.EXPERIMENTALRESULTSNextwedescribethedata,experimentalsetup,andevaluationoftheclusterswend.WedescribeourndingsinSection4.3.1ExperimentalsetupFirstweapplyouralgorithmtoadatasetofmorethan172mil-lionnewsarticlesandblogpostscollectedfrom1milliononlinesourcesduringaone-yearperiodfromSeptember12008toAu-gust312009.WeusetheMemeTracker[22]methodologytoiden-tifyshortquotedtextualphrasesandextractmorethan343millionshortphrases.Toobservethecompletelifetimeofaphrase,weonlykeepphrasesthatrstappearedafterSeptember5.Thisstepremovesthephrasesquotedrepeatedlywithoutareferencetoacer-tainevent,suchas”Iloveyou.” 0 5 10 15 20 25 30 35 40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Peak Width (hours)Ratio of the threshold to the peak valueTp-T1(median) T2-Tp(median) T2-T1(median) Figure4:Widthofthepeakofthetimeseriesversusthefrac-tionofthethresholdweset.Afterthesepreprocessingsteps,wechoosethe1,000mostfre-quentphrasesandforeachphrasecreateatimeseriesofthenumberofmentions(i.e.,volume)perunittimeinterval.Toreducerapiductuationinthetimeseries,weapplyGaussiankernelsmoothing.Choosingthetimeserieslength.InprinciplethetimeseriesofeachphrasecontainsL=8,760elements(i.e.,thenumberofhoursin1year).However,thevolumeofphrasestendstobeconcen-tratedaroundapeak[22],andthustakingsuchalongtimeserieswouldnotbeagoodidea.Forexample,wemeasurethesimilar-itybetweentwophrasesthatareactivelyquotedforoneweekandabandonedfortherestofthetime.Wewouldbeinterestedmainlyinthedifferencesofthemduringtheiractiveoneweek.However,thedifferencesininactiveperiodsmaynotbezeroduetonoise,andthesesmalldifferencescandominatetheoverallsimilaritysincetheyareaccumulatedoveralongperiod.Therefore,wetruncatethetimeseriestofocusonthe”interesting”partofthetimeseries.Tosetthelengthoftruncation,Wemeasurehowlongthepeakpopularityspreadsout:letTpbethetimewhenthephrasereachedpeakvolume,andletvpbethephrasevolumeatthattime(i.e.,numberofmentionsathourTp).Forathreshold,xvp(for0x1),wegobackintimefromTpofagivenphrasesandrecordasT1(x)thelasttimeindexwhenthephrase'svolumegetsbelowxvp.Next,wegoforwardintimefromTpandmarkthersttimeindexwhenitsvolumegetsbelowthresholdasT2(x).Thus,T1(x)measuresthewidthofthepeakfromtheleft,andT2(x)measuresthewidthofthepeakfromtheright.Figure4plotsthemedianvalueofTpT1(x),T2(x)TpandT2(x)T1(x)asafunctionofx.Wenotethatmostphrasesmaintainnontrivialvolumeforaveryshorttime.Forexample,ittakesonly40hoursforthephrasevolumetorisefrom10%ofthepeakvolume,reachthepeak,andfallagainbelow10%ofthepeakvolume.Ingeneral,thevolumecurvetendstobeskewedtotheright(i.e.,TpT1(x)andT2(x)Tparefarforsmallx).Thismeansthatingeneralthevolumeofphrasesratherquicklyreachesitspeakandthenslowlyfallsoff.Giventheaboveresults,wetruncatethelengthofthetimeseriesto128hours,andshiftitsuchthatitpeaksatthe1/3oftheentirelengthofthetimeseries(i.e.,the43thindex).Choosingthenumberofclusters.TheK-SCalgorithm,likealltheothervariantsofK-means,requiresthenumberofclusterstobespeciedinadvance.Althoughitisanopenquestionhowtochoosethemostappropriatenumberofclusters,wemeasurehowthequalityofclusteringvarieswiththenumberofclusters.WeranK-SCwithadifferentnumberofclusters,andmeasuredHartigan'sIndexandtheAverageSilhouette[17].Figure3.1showsthevaluesofthetwomeasuresasafunctionofthenumberofclusters.Thehigherthevaluethebettertheclustering.Thetwometricsdonot 20 30 40 50 60 70 80 90 100 110 4 5 6 7 8 9 10 Hartigan's IndexThe number of clusters (a)Hartigan'sIndex 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 4 5 6 7 8 9 10 11 Average SilhouetteThe number of clusters (b)AverageSilhouetteFigure5:Clusteringqualityversusthenumberofclusters.Method F ^d(i;j)2 (lowerisbetter) (higherisbetter) KM-NS 122.12 2.12 KM-P 76.25 3.94 K-SC 64.75 4.53Table1:Clusterquality.KM-NS:K-meanswithpeakalign-mentbutnoscaling;KM-P:K-meanswithalignmentandscal-ing.NoticeK-SCwellimprovesoverK-meansinbothcriteria.necessarilyagreewitheachother,butFigure3.1suggeststhatalowervalueofKgivesbetterresults.WechoseK=6asthenumberofclusters.WealsoexperimentedwithK2f3;:::;12gandfoundthatclusteringsarequitestable.EvenwhenK=12,allof12clustersareessentiallythevariantsofclustersthatwendusingK=6(refertoextendedversionofthepaper[1]fordetails).3.2PerformanceofK-SCAlgorithmHavingdescribedthedatapreprocessingandK-SCparametersettingsweevaluatetheperformanceofouralgorithmintermsofqualityandspeed.WecomparetheresultofK-SCtothatofK-means,whichusestheEuclideantimeseriesdistancemetric.Inparticular,weevaluatetwovariantsofK-meansthatdifferinthewaywescaleandalignthetimeseries.First,wealignthetimeseriestopeakatthesametimebutdonotscaletheminthey-axis.Inthesecondvariantwenotonlyalignthetimeseriesbutalsoscaletheminthey-axissothattheyallhavethepeakvolumeof100.Foreachalgorithmwecomputetwoperformancemetrics:(a)thevalueoftheobjectivefunctionFasdenedinEquation2,and(b)thesumofthesquareddistancesbetweentheclustercenters,^d(i;j)2.FunctionFmeasuresthecompactnessoftheclus-ter,whilethedistancesbetweentheclustercentersmeasurethedi-versityoftheclusters.Thus,agoodclusteringhasalowvalueofFandlargedistancesbetweentheclustercenters.Table1presentstheresultsandshowsthatK-SCachievesthesmallestvalueofF,andthebiggestdistancebetweentheclusters.NotethatitisnottrivialthatK-SCachievesabiggervalueof^d(i;j)2thanK-means,becauseK-SCdoesnotoptimize^d(i;j)2.ManualinspectionoftheclustersfromeachalgorithmalsosuggeststhatK-meansclustersarehardertointerpretthanK-SCclusters,andtheirshapesarelessdiversethanthoseofK-SCclusters.WealsoexperimentedwithnormalizingthesumofeachtimeseriestothesamevalueandstandardizationofthetimeseriesbutwereunabletotomakeK-meansworkwell(referto[1]fordetails).WealsonotethatK-SCdoesnotrequireanykindofnormalization,andperformsbetterthanK-meanswiththebestnormalizationmethod.ScalabilityofK-SC.Last,weanalyzetheeffectoftheWavelet-basedincrementalclusteringprocedureintheruntimeofK-SC.Inourdataset,weuserelativelyshorttimeseries(L=128)andK-SCcaneasilyhandlethem.InthefollowingexperimentweshowthatK-SCisgenerallyapplicableeveniftimeserieswouldspanformorethanhundredsoftimeindexes.Weassumethatwearedealingwithmuchlongertimeseries.In 0 200 400 600 800 1000 1200 1400 0 0.5 1 1.5 2 2.5 3 Runtime (sec.)The number of time series (x1000)Incremental Naive Figure6:RuntimeofNaiveK-SCandIncrementalK-SC.particular,wetakeL=512insteadofL=128fortruncatingthetimeseries,andrunIncrementalK-SCvetimesincreasingthenumberoftimeseriesfrom100to3,000.Asabaseline,weper-formK-SCwithouttheincrementalwavelet-basedapproach.Fig-ure6showstheaveragevaluesandthevariancesoftheruntimewithrespecttothesizeofthedataset.Theincrementalapproachreducestheruntimesignicantly.WhiletheruntimeofnaiveK-SCgrowsveryquicklywiththesizeofthedata,incrementalK-SCgrowsmuchslower.Furthermore,noticealsothattheerrorbarsontheruntimeofincrementalK-SCareverysmall.ThismeansthatincrementalK-SCisalsomuchmorerobusttotheinitializationoftheclustercentersthannaiveK-SCinthatittakesalmostthesametimetoperformtheclusteringregardlessoftheinitialconditions.4.EXPERIMENTSONMEMETRACKERCluster C1 C2 C3 C4 C5 C6 fc 28.7% 23.2% 18.1% 13.3% 10.3% 6,4% V 681 704 613 677 719 800 V128 463 246 528 502 466 295 VP 54 74 99 51 41 32 VP=V 7.9% 10.6% 16.2% 7.5% 5.7% 4.0% Lb 1.48 0.21 1.30 1.97 1.59 -0.34 FB 33.3% 42.9% 29.1% 36.2% 45.0% 53.1% FB128: 27.4% 35.6% 28.5% 32.6% 36.5% 53.4%Table2:StatisticsoftheclustersfromFigure7.fc:Fractionofphrasesinthecluster,V:Totalvolume(over1year)V128:Volumearoundthepeak(128hours),VP:Volumeatthepeak(1hour),VP=V:Peaktototalvolume,Lb:BlogLag(hours),FB:Fractionofblogvolumeover1year,FB128:Fractionofblogvolumearoundthepeak.NowwedescribethetemporalpatternsoftheMemetrackerphrasesasidentiedbyourK-SCalgorithm.Figure7showstheclustercentersforK=6clusters,andTable2givesfurtherdescriptivestatisticsforeachofthesixclusters.WeordertheclusterssothatC1isthelargestandC6isthesmallest.Noticethehighvariabil-ityintheclustershapes.Thelargestthreeclustersinthetoprowexhibitsomewhatdifferentbutstillveryspikytemporalbehavior,wherethepeaklastsforlessthan1day.Ontheotherhand,inthelatterthreeclustersthepeaklastslongerthanoneday.Althoughwepresenttheclusteringofthetop1,000mostfrequentphrases,morethanhalfofthephraseslosetheirattentionafterasingleday.Thebiggestcluster,C1,isthemostspreadoutofallthe“singlepeak”clustersthatallsharethecommonquickrisefollowedbyamonotonedecay.NoticethatC1looksverymuchliketheaverageofallthephrasesinFigure1.Thisisnaturalbecausetheaver-agepatternwouldbelikelytooccurinalargenumberofphrases. 0 10 20 30 40 50 60 -40 -20 0 20 40 60 80 Time (hours)NPTAB (a)ClusterC1 0 10 20 30 40 50 60 70 80 -40 -20 0 20 40 60 80 Time (hours)NPTAB (b)ClusterC2 0 20 40 60 80 100 -40 -20 0 20 40 60 80 Time (hours)NPTAB (c)ClusterC3 0 10 20 30 40 50 60 -40 -20 0 20 40 60 80 Time (hours)NPTAB (d)ClusterC4 0 5 10 15 20 25 30 35 40 45 -40 -20 0 20 40 60 80 Time (hours)NPTAB (e)ClusterC5 0 5 10 15 20 25 30 35 -40 -20 0 20 40 60 80 Time (hours)NPTAB (f)ClusterC6Figure7:ClustersidentiedbyK-SC.Wealsoplottheaverageofthetimewhenaparticulartypeofwebsiterstmentionsthephrasesineachcluster.Thehorizontalpositioncorrespondstotheaveragetimeoftherstmention.P:professionalblog,N:newspaper,A:newsagency,T:TVstation,B:blogaggregator.ClusterC2isnarrowerandhasaquickerriseanddecaythanC1.WhereasC1isnotentirelysymmetric,theriseanddecayofC2occurataroundthesamerate.C3ischaracterizebyasuperquickrisejust1hourbeforethepeakandaslowerdecaythanC1andC2.Thenexttwoclusters,C4andC5,experienceareboundintheirpopularityandhavetwopeaksabout24hoursapart.WhileC4experiencesabigpeakontherstdayandasmallerpeakonthesecondday,C5doesexactlytheopposite.Ithasasmallpeakontherstdayandalargeroneonthesecondday.Finally,phrasesinClusterC6staypopularformorethanthreedaysafterthepeak,withtheheightofthelocalpeaksslowlydeclining.Clusterstatistics.Wealsocollectedstatisticsaboutthephrasesineachoftheclusters(Table2).Foreachclusterwecomputethefollowingmedianstatisticsoverallphrasesinthecluster:thetotalphrasevolumeovertheentire1yearperiod,volumeinthe128hourperiodaroundthepeak,thevolumeduringthehouraroundthepeak,andtheratiobetweenthetwo.WealsoquantifytheBlogLagasfollows:weusetheclassicationofGoogleNewsandlabelallthesitesindexedbyGoogleNewsasmainstreammediaandalltheothersitesasblogs[22].ThenforeachphrasewedeneBlogLagasthedifferencebetweenthemedianofthetimewhennewsmediaquotethephraseandthemedianofthetimewhenblogsquotethephrase.NotethatpositiveBlogLagmeansthatblogstrailmainstreammedia.Atlast,wecomputetheratioofvolumecomingfromtheblogstothetotalphrasevolumeforthetwotimehorizons,aoneyearand128hoursaroundthepeak.WendthatclusterC1showsmoderatevaluesinmostcate-gories,conrmingthatthisclusterisclosesttoabehaviorofatyp-icalphrase.ClusterC2andC3havesharppeaks,buttheirtotalvolumearoundthepeakissignicantlydifferent.Thisdifferencecomesfromthereactionofmainstreammedia.Althoughbothclus-tershavehigherpeakthanotherclusters,74and99respectively,theNumberofwebsites 50 100 200 300 Temporalfeatures 76.62% 81.23% 88.73% 95.75% Volumefeatures 70.71% 77.05% 86.62% 95.59% TF-IDFfeatures 70.12% 77.05% 87.04% 94.74%Table3:Classicationaccuracyoftheclusterswithadifferentsetoffeatures.Seethemaintextfordescription.volumeofC2comingfrommainstreammediaisonly30%ofthatofC3.Interestinglyenough,thephrasesinC3havethelargestvolumearoundthepeakandalsofarthehighestpeakvolume.Thedominantforcehereistheattentionfromnewsmedia,becauseC3showsthesmallestfractionoftheblogvolume.Thenexttwoclus-ters,C4andC5,havetwopeaksandarethemirrorversionsofeachother.Theyalsoshowsimilarvaluesformostcategories.TheonlydifferenceisthatC4hasbiggervolumefrommainstreamme-diaandgetsmentionsfromblogsforalongertime,whichresultsinthelargervalueofthetotalvolumearoundthepeak.Thelastcluster,C6,isthemostinterestingone.ThephrasesinC6havethehighestoverallvolume,butthesmallestvolumearoundthepeak.Itseemsthatmanyphrasesinthisclustercorrespondtohottopicsonwhichtheblogospherediscussesforseveraldays.Anotherin-terestingaspectofC6isthattheroleofblogsinthecluster.Ithasdistinctivelyhighfractionoftheblogvolume,andtheonlyclusterwherebloggersactuallyleadmainstreammedia.Modelingthetimeseriesshape.Ouranalysissofarshowsthattheclustershaveverydifferentcharacteristicsaswellasdiverseshapes.Motivatedbythisresult,weconductatemporalanalysisforanindividualwebsitewithrespecttoeachcluster.Wehypoth-esizethatifacertainwebsitementionsthephrasethiswillcreatedistinctivetemporalsignatureofthephrases.Forexample,fromTable2weseethatblogstendtomentionthephrasesinC6ear-lier.Ifthehypothesisistrue,therefore,thenweshouldbeable topredicttowhichclusteraphrasebelongstosolelybasedontheinformationaboutwhichwebsitesmentionedthephrase.Thatis,basedonwhichsitesmentionedthephrasewewouldliketopredictthetemporalpatternthephrasewillexhibit.Foreachphraseweconstructafeaturevectorbyrecordingforeachwebsitethetimewhenitrstmentionedthephrase.Ifaweb-sitedoesnotmentionaphrase,weconsideritasamissingdata.Weimputethemissingtimeastheaverageofthetimeswhentheweb-siterstmentionedphrases.Forcomparisonwealsoconstructtwootherfeaturevectors.Foreachwebsite,werstrecordthefractionofthephrasevolumecreatedbythatwebsite.Inaddition,wetreateveryphraseasa“document”andeverysiteasa“word”,andthencomputetheTF-IDFscore[29]ofeachphrase.Givenfeaturevectors,welearnsixseparatelogisticregressionclassierssothatthei-thclassierpredictswhetherthephrasebe-longstothei-thclusterornot.Moreover,wevarythelengthoffea-turevectors(i.e.,thenumberofthesitesusedbytheclassier),bychoosingthelargestwebsitesintermsofphrasevolume.WereporttheaverageclassicationaccuracyinTable3.Byusingtheinfor-mationfromonly100largestwebsites,wecanpredicttheshapeofthephrasevolumeovertimewiththeaccuracyof81%.Amongthethreetypesoffeatures,weobservethatthefeaturesbasedonthetemporalinformationgivebestperformance.Timeseriesshapeandthetypesofwebsites.Encouragedbytheaboveresults,wefurtherinvestigatehowwebsitescontributetotheshapeofthephrasevolumeandinteracteachotherineachcluster.Fortheanalysiswemanuallychoseasetof12representativeweb-sites.Wemanuallyclassiedthemintovecategoriesbasedontheorganizationortheroletheyplayinthemediaspace:Newspapers,Professionalblogs,TV,NewsagenciesandBlogs(refertothefullversionfortheusedlistofwebsites[1]).First,werepeattheclassicationtaskfromprevioussectionbutnowwithonlythe12websites.Surprisingly,weobtainanaverageclassicationaccuracyof75.2%.Moreover,ifwechoose12web-siteslargestbytotalvolumeweobtainaccuracyof73.7%.Byusingthel1-regularizedlogisticregressiontoselecttheoptimal(i.e.,mostpredictive)setof12websitesweobtaintheaccuracyof76.0%.Second,usingtheclassicationofwebsitesinto5groupswecomputethetimewhenwebsitesofthattypetendtomentionthephrasesinparticularcluster.Figure7showsthemeasuredaveragetimeforeachtypeofwebsite.Letterscorrespondtothetypesofwebsitesandthehorizontalpositionofletterscorrespondstotheaveragetimeoftherstmention.Forexample,itistheprofes-sionalbloggers(P)thatrstmentionthephrasesinClusterC1andC2.ForphrasesinC1,thisisfollowedbynewspapers(N),newsagencies(A),thentelevision(T)andnallybybloggers(B).InC2theorderisabitdifferentbutthepointisthatalltypesmentionthephraseveryclosetogether.Interestingly,forthephrasesinC3newsagencies(A)mentionthephraserst.NoticethatC3hastheheaviesttailamongallthesingle-peakclusters.Itisprobablyduetothefactthatmanydifferentorganizationssubscribeandpublishthearticlesfromnewsagencies,andthusthephrasesinC3slowlyper-colatesintoonlinemedia.WeobservetheprocessofpercolationbylookingatthetimevaluesinFigure7:startingfromnewsagenciestonewspapersandprofessionalbloggers,andnallytoTVstationsandsmallbloggers.InC4andC5,wenotethatitistheblog-gersthatmakethedifference.InC4bloggerscomelateandcreatethesecondlowerspike,whileinC5bloggers(bothsmallonesandprofessionalones)aretheearliesttypes.Finally,thephrasesinC6gaintheattentionmainlyontheblogosphere.Wealreadysawthatthisclusterhasthehighestproportionoftheblogvolume.Again,wenotethatbloggersmentionthephrasesinthisclusterrightatthepeakpopularityandlatertherestofthemediafollows.5.EXPERIMENTSONTWITTERWealsoanalyzethetemporalpatternsofattentionofcontentpublishedonTwitter.InordertoidentifyandtracecontentthatappearsonTwitterwefocusonappearanceofURLsand“hash-tags”.UsersonTwitteroftenmakereferencestointerestingcontentbyincludingtheURLinpost.Similarly,manytweetsareaccom-paniedbyhashtags(e.g.,]ilovelifebecause),shorttextualtagsthatgetwidelyadoptedbytheTwittercommunity.Linksandhashtags,adoptedbytheTwitterusers,representspecicpiecesofinforma-tionthatwecantrackastheygetadoptedacrossthenetwork.Sim-ilarlyaswiththequotedphrases,ourgoalhereisapplyingK-SCinordertoidentifypatternsinthetemporalvariationofthepopu-larityofahashtagsandURLsmentionedintweetsandtoexplainthepatternsbasedonindividualusers'participation.DataPreparation.Wecollectednearly580millionTwitterpostsfrom20millionuserscoveringa8monthperiodfromJune2009toFebruary2010.Weestimatethisisabout20-30%ofallpostspub-lishedonTwitterduringthattimeframe.Weidentied6milliondifferenthashtagsand144millionURLsmentionedintheseposts.Foreachkindofitemsofcontent(i.e.,separatelyforURLsandthehashtags)wediscarditemswhichexhibitnearlyuniformvolumeovertime.Thenweordertheitemsbytheirtotalvolumeandfocuson1,000mostfrequentlymentionedhashtags(URLs)and100,000usersthatmentionedtheseitemsmostfrequently.Analysisoftheresults.WepresenttheresultsofidentifyingthetemporalpatternsofTwitterhashtags.WenotethatweobtainverysimilarresultsifusingURLs.Foreachhashtag,webuildatimeseriesdescribingitsvolumefollowingexactlythesameprotocolaswithquotedphrases.Weuse1hourtimeunitandtruncatetheseriesto128hoursaroundthepeakvolumewiththepeakatoccurringat1/3of128hours.WerunK-SConthesetimeseriesandpresenttheshapesofidentiedclustercentroidsintheFigure8.Whereasmassmediaandblogsmentionphrasesthatarerelatedtocertainpiecesofnewsorevents,mostTwitterusersadopthash-tagsentirelybypersonalmotivationtodescribetheirmoodorcur-rentactivity.ThisdifferenceappearsinFigure8inthatmosthash-tagsmaintainnonzerovolumeoverthewholetimeperiod.Thismeansthattherealwaysexistacertainnumberofuserswhomen-tionahashtagevenifitisoutdatedorold.Nevertheless,thepat-ternsoftemporalvariationinthehashtagpopularityareverycon-sistentwiththeclustersoftemporalvariationofquotedphrasesidentiedinFigure7.Wecanestablishaperfectcorrespondencebetweentheclassesoftemporalvariationofthesetwoverydif-ferenttypesofonlinecontent,namelyquotedphrasesandTwitterhashtags(andURLs).WearrangetheclustersinFigure8inthesameorderasinFigure7sothatT1correspondstoC1,T2toC2,andsoon.Theseresultsareveryinterestingespeciallyconsider-ingthatthemotivationforpeopleinTwittertomentionhashtagsappearstobedifferentfrommechanismsthatdrivetheadoptionofquotedphrases.AlthoughweomitthediscussionofthetemporalvariationofURLmentionsduetobrevity,wenotethattheobtainedclustersarenearlyidenticaltothehashtagclusters(see[1]).Table4givesfurtherstatisticsofTwitterhashtagclusters.Com-paringthesestatisticstocharacteristicsofphraseclusters(Table2)weobserveseveralinterestingdifferences.ThelargestTwitterclus-ter(T2)hasmorephrasesthanthelargestphrasecluster(C1),whilethesmallestTwitterclusterhasmoremembersthansmallestphrasecluster.ThisshowsthatsizesofTwitterclustersaresome-whatmoreskewed.Moreover,wealsonotethatTwitterclustersarelessconcentratedaroundthepeakvolume,withthepeakvol-umeaccountingforonlyaround2-5%ofthetotalvolume(inphraseclusterspeakaccountsfor4-16%ofthetotalvolume). Cluster T1 T2 T3 T4 T5 T6 fc 16.1% 35.1% 15.9% 10.9% 13.7% 8.3% V 4083 3321 3151 3253 3972 3177 V128 760 604 481 718 738 520 VP 86 169 67 60 67 53 VP=V 2.1% 5.1% 2.1% 1.8% 1.7% 1.7%Table4:Twitterhashtagclusterstatistics.Table2givesthedescriptionofthesymbols.Numberoffeatures 50 100 200 300 Temporalfeatures 69.53% 78.30% 88.23% 95.35% Volumefeatures 66.31% 71.84% 81.39% 92.36% TF-IDFfeatures 64.17% 70.12% 79.54% 89.93%Table5:ClassicationaccuracyoftheclustersinTwitterwithadifferentsetoffeatures.Refertothemaintextfordescription.NextwealsoperformthepredictivetaskofpredictingtheshapeofvolumeovertimecurveforTwitterhashtags.Twitterdataisverysparseaseventhemostactivemostusersmentiononlyabout10to50differenthashtags.Thusweorderusersbythetotalnumberofhashtagstheymention,collectthemintogroupsof100users,andmeasurethecollectivebehaviorofeachgroupofusers.Foreachhashtag,webuildafeaturevectorwherei-thcompo-nentstoresthetimeoftheearliestmentionofthetagbyanyuserinthegroupi.Similarlyaswithquotedphrasesweconstructafea-turevectorbasedonthefractionofthementionsfromeachgroup,andanotherfeaturevectorbasedontheTF-IDFscoretreatinghash-tagsas“documents”andusergroupsas“words”.Foreachcluster,weperformabinaryclassicationforaclusteragainsttherestus-ingthelogisticregression,andreporttheaverageaccuracyoverthesixclassicationtasksintheTable5.Again,thetemporalfea-turesachievebestaccuracy,suggestingthatthetimewhenausergroupadoptsahashtagisanimportantfactorindetermininghowthepopularityofthehashtagwillvaryovertime.Wealsonotethattheaccuraciesarelowerthanforquotedphrases(Table3)andthegapgetslargeraswechooseasmallernumberoffeatures.ThisgapsuggeststhatasmallnumberoflargefamousmediasitesandblogshasamuchgreaterinuenceontheadoptionofnewsmediacontentthanthemostactivegroupsofusershaveontheadoptionofTwitterhashtags.EventhoughthelargescaletemporaldynamicsofattentionofTwitterandnewsmediacontentseemssimilar.Theseresultshintthattheadoptionofquotedphrasestendstobemuchquickeranddrivenbyasmallnumberoflargeinuentialsites.Ontheotherhand,inTwitteritappearsasiftheinuentialsaremuchlessinuentialandhavesmallercumulativeimpactonthecontentpopularity.6.RELATEDWORKTherearetwodistinctlinesofworkrelatedtothetopicspre-sentedhere:workontemporaldynamicsofhumanactivity,andresearchonthegeneraltimeseriesclustering.Temporaldynamicsofhumanactivity.Patternsofhumanatten-tion[34,35],popularity[24,30]andresponsedynamics[6,10]havebeenextensivelystudied.Researchinvestigatedtemporalpat-ternsofactivityofnewsarticles[5,30],blogposts[3,14,21,27],Videos[10]andonlinediscussionforums[4].OurworkhereisdifferentaswearenottryingtondaunifyingglobalmodeloftemporalvariationbutratherexploretechniquesthatallowustoquantifywhatkindsoftemporalvariationsexistontheWeb.Inthislight,ourworkalignswiththeresearchesonWebsearchqueries 0 10 20 30 40 50 60 70 80 90 -40 -20 0 20 40 60 80 Time(1 hours) (a)ClusterT1 0 20 40 60 80 100 120 140 160 180 -40 -20 0 20 40 60 80 Time(1 hours) (b)ClusterT2 0 10 20 30 40 50 60 70 -40 -20 0 20 40 60 80 Time(1 hours) (c)ClusterT3 0 10 20 30 40 50 60 -40 -20 0 20 40 60 80 Time(1 hours) (d)ClusterT4 0 10 20 30 40 50 60 70 -40 -20 0 20 40 60 80 Time(1 hours) (e)ClusterT5 0 5 10 15 20 25 30 35 40 45 50 55 -40 -20 0 20 40 60 80 Time(1 hours) (f)ClusterT6Figure8:ShapesofattentionofTwitterhashtags.thatndtemporalcorrelationbetweensocialmedia[2]orquerieswhosetemporalvariationsaresimilareachother[8].Aftertempo-ralpatternsareidentied,onecanthenfocusonoptimizingmediacontentplacementtomaximizeclickthroughrates[5],predictingthepopularityofnews[30]orndingtopicintensitiesstreams[19].Timeseriesclustering.Twokeycomponentsoftimeseriesclus-teringareadistancemeasure[11],andaclusteringalgorithm[32].WhiletheEuclideandistanceisaclassicaltimeseriesdistancemet-ric,moresophisticatedmeasuressuchastheDynamicTimeWarp-ingandtheLongestCommonSubsequence[18]havealsobeenproposed.Amongclusteringalgorithms,theagglomerativehier-archical[20]andtheK-meansclustering[28]arefrequentlyused.Duetoitssimplicityandscalability,K-meansinspiredmanyvari-antssuchask-medoids[17],fuzzyK-means[17],andtheExpecta-tionMaximizationbasedvariant[28].Toaddresstheissuescausedbythehighdimensionalityoftimeseriesdata,transformssuchasDiscreteFourierTransform,DiscreteHaarWaveletTransform[7],PrincipalComponentAnalysisandSymbolicAggregateApproxi-mation[25]havealsobeenapplied.7.CONCLUSIONWeexploredtemporalpatternsarisinginthepopularityofonlinecontent.Firstweformulatedatimeseriesclusteringproblemandmotivatedameasureoftimeseriessimilarity.WethendevelopedK-SC,anovelalgorithmfortimeseriesclusteringthatefcientlycomputestheclustercentroidsunderourdistancemetric.Finally,weimprovedthescalabilityofK-SCbyusingawavelet-basedin-crementalapproach.Weinvestigatedthedynamicsofattentionintwodomains.A massivedatasetof170millionnewsdocumentsandasetof580millionTwitterposts.TheproposedK-SCachievesbettercluster-ingthanK-meansintermsofintra-clusterhomogeneityandinter-clusterdiversity.Wealsofoundthattherearesixdifferentshapesthatpopularityofonlinecontentexhibits.Interestingly,theshapesareconsistentacrossthetwoverydifferentdomainsofstudy,namely,theshorttextualphrasesarisinginnewsmediaandthehashtagsonTwitter.Weshowedhowdifferentparticipantsinonlineme-diaspaceshapethedynamicsofattentionthecontentreceives.Andperhapssurprisinglybasedonobservingasmallnumberofadoptersofonlinecontentwecanreliablypredicttheoveralldy-namicsofcontentpopularityovertime.Allinall,ourworkprovidesmeanstostudycommontemporalpatternsinpopularityandtheattentionofonlinecontent,byiden-tifyingthepatternsfrommassiveamountsofrealworlddata.Ourresultshavedirectapplicationtotheoptimalplacementofonlinecontent[5].AnotherapplicationofourworkisthediscoveryoftherolesofwebsiteswhichcanimprovetheidenticationofinuentialwebsitesorTwitterusers[23].Webelievethatourapproachoffersausefulstartingpointforunderstandingthedynamicsintheonlinemediaandhowthedynamicsofattentionevolvesovertime.AcknowledgmentWethankSpinn3rforresourcesthatfacilitatedtheresearch,andreviewersforhelpfulsuggestions.JaewonYangissupportedbySamsungScholarship.TheresearchwassupportedinpartbyNSFgrantsCNS-1010921,IIS-1016909,AlbertYu&MaryBechmannFoundation,IBM,Lightspeed,MicrosoftandYahoo.8.REFERENCES[1]Extendedversionofthepaper.Patternsoftemporalvariationinonlinemedia.TechnicalReport,StanfordInfolab,2010.[2]E.Adar,D.Weld,B.Bershad,andS.GribbleWhyWeSearch:VisualizingandPredictingUserBehavior.InWWW'07,2007.[3]E.Adar,L.Zhang,L.A.Adamic,andR.M.Lukose.Implicitstructureandthedynamicsofblogspace.InWorkshopontheWebloggingEcosystem,2004.[4]C.Aperjis,B.A.Huberman,andF.Wu.Harvestingcollectiveintelligence:Temporalbehaviorinyahooanswers.ArXive-prints,Jan2010.[5]L.Backstrom,J.Kleinberg,andR.Kumar.Optimizingwebtrafcviathemediaschedulingproblem.InKDD'09,2009.[6]A.-L.Barabási.Theoriginofburstsandheavytailsinhumandynamics.Nature,435:207,2005.[7]F.K.-P.Chan,A.W.cheeFu,andC.Yu.Haarwaveletsforefcientsimilaritysearchoftime-series:Withandwithouttimewarping.IEEETKDE,15(3):686–705,2003.[8]S.ChienandN.Immorlica.SemanticSimilaritybetweenSearchEngineQueriesUsingTemporalCorrelation.InWWW'05,2005.[9]K.K.W.ChuandM.H.Wong.Fasttime-seriessearchingwithscalingandshifting.InPODS'99,237–248,1999.[10]R.CraneandD.Sornette.Robustdynamicclassesrevealedbymeasuringtheresponsefunctionofasocialsystem.PNAS,105(41):15649–15653,October2008.[11]H.Ding,G.Trajcevski,P.Scheuermann,X.Wang,andE.Keogh.Queryingandminingoftimeseriesdata:experimentalcomparisonofrepresentationsanddistancemeasures.VLDB.,1(2):1542–1552,2008.[12]G.H.GolubandC.F.VanLoan.Matrixcomputations(3rded.).JohnsHopkinsUniversityPress,1996.[13]D.Gruhl,R.Guha,R.Kumar,J.Novak,andA.Tomkins.Thepredictivepowerofonlinechatter.InKDD'05,2005.[14]D.Gruhl,D.Liben-Nowell,R.V.Guha,andA.Tomkins.Informationdiffusionthroughblogspace.InWWW,2004.[15]A.Java,X.Song,T.Finin,andB.Tseng.Whywetwitter:understandingmicrobloggingusageandcommunities.InWebKDDworkshop,pages56–65.2007.[16]E.KatzandP.Lazarsfeld.Personalinuence:Thepartplayedbypeopleintheowofmasscommunications.FreePress,1955.[17]L.KaufmanandP.J.Rousseeuw.FindingGroupsinData:AnIntroductiontoClusterAnalysis(WileySeriesinProbabilityandStatistics).Wiley-Interscience,March2005.[18]E.KeoghandC.Ratanamahatana.Exactindexingofdynamictimewarping.KnowledgeandInformationSystems,7(3):358–386,2005.[19]A.Krause,J.Leskovec,andC.Guestrin.Dataassociationfortopicintensitytracking.InICML'06,2006.[20]M.Kumar,N.R.Patel,andJ.Woo.Clusteringseasonalitypatternsinthepresenceoferrors.InKDD'02,2002.[21]R.Kumar,J.Novak,P.Raghavan,andA.Tomkins.Ontheburstyevolutionofblogspace.InWWW'02,2003.[22]J.Leskovec,L.Backstrom,andJ.Kleinberg.Meme-trackingandthedynamicsofthenewscycle.InKDD'09,2009.[23]J.Leskovec,A.Krause,C.Guestrin,C.Faloutsos,J.VanBriesen,andN.Glance.Cost-effectiveoutbreakdetectioninnetworks.InKDD'07,2007.[24]J.Leskovec,M.McGlohon,C.Faloutsos,N.Glance,andM.Hurst.Cascadingbehaviorinlargebloggraphs.InSDM'07,2007.[25]J.Lin,E.Keogh,S.Lonardi,andB.Chiu.Asymbolicrepresentationoftimeseries,withimplicationsforstreamingalgorithms.InSIGMOD'03,2003.[26]R.D.Malmgren,D.B.Stouffer,A.E.Motter,andL.A.A.N.Amaral.Apoissonianexplanationforheavytailsine-mailcommunication.PNAS,105(47):18153–18158,2008.[27]Q.Mei,C.Liu,H.Su,andC.Zhai.Aprobabilisticapproachtospatiotemporalthemepatternminingonweblogs.InWWW'06,2006.[28]J.L.Michail,J.Lin,M.Vlachos,E.Keogh,andD.Gunopulos.Iterativeincrementalclusteringoftimeseries.InEDBT,2004.[29]G.SaltonandM.J.McGill.IntroductiontoModernInformationRetrieval.McGraw-Hill,1986.[30]G.SzaboandB.A.Huberman.Predictingthepopularityofonlinecontent.ArXive-prints,Nov2008.[31]X.Wang,C.Zhai,X.Hu,andR.Sproat.Miningcorrelatedburstytopicpatternsfromcoordinatedtextstreams.InKDD'07,page793,2007.[32]T.WarrenLiao.Clusteringoftimeseriesdata-asurvey.PatternRecognition,38(11):1857–1874,2005.[33]D.J.WattsandP.S.Dodds.Inuentials,networks,andpublicopinionformation.JournalofConsumerResearch,34(4):441–458,December2007.[34]F.WuandB.A.Huberman.Noveltyandcollectiveattention.PNAS,104(45):17599–17601,2007.[35]S.Yardi,S.A.Golder,andM.J.Brzozowski.Bloggingatworkandthecorporateattentioneconomy.InCHI'09,2009.