/
Densest Subgraph in Streaming and MapReduce Bahman Bah Densest Subgraph in Streaming and MapReduce Bahman Bah

Densest Subgraph in Streaming and MapReduce Bahman Bah - PDF document

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
429 views
Uploaded On 2015-05-27

Densest Subgraph in Streaming and MapReduce Bahman Bah - PPT Presentation

edu Ravi Kumar Yahoo Research Sunnyvale CA ravikumaryahooinccom Sergei Vassilvitskii Yahoo Research New York NY sergeiyahooinccom ABSTRACT The problem of nding locally dense components of a graph is an important primitive in data analysis with widera ID: 75723

edu Ravi Kumar Yahoo Research

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Densest Subgraph in Streaming and MapRed..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

comesNP-hard[26].AndersenandChellapilla[3]aswellasKhullerandSaha[26]showhowtoobtain2-approximationsforthisversionoftheproblem.Whilethealgorithmsproposedintheabovelineofworkguaranteegoodapproximationfactors,theyarenotecientwhenrunonverylargedatasets.Inthisworkweshowhowtousetheprinciplesunderlyingexistingalgorithms,especially[10],todevelopnewalgorithmsthatcanberuninthedatastreamanddistributedcomputingmodels,forexample,MapReduce;thisalsoresolvestheopenproblemposedin[1,13].1.1StreamingandMapReduceAsthedatasetshavegrowntotera-andpetabyteinputsizes,twoparadigmshaveemergedfordevelopingalgorithmsthatscaletosuchlargeinputs:streamingandMapReduce.Inthestreamingmodel[34],oneassumesthattheinputcanbereadsequentiallyinanumberofpassesoverthedata,whilethetotalamountofrandomaccessmemory(RAM)availabletothecomputationissublinearinthesizeoftheinput.Thegoalistoreducethenumberofpassesneeded,allthewhileminimizingtheamountofRAMnecessarytostoreintermediateresults.Inthecasetheinputisagraph,thenodesVareknowninadvance,andtheedgesarestreamed(itisknownthatmostnon-trivialgraphproblemsrequire (jVj)RAM,evenifmultiplepassescanbeused[18]).Thechallengeinstreamingalgorithmsliesinwiselyusingthelimitedamountofinformationthatcanbestoredbetweenpasses.Complementingstreamingalgorithms,MapReduce,anditsopensourceimplementation,Hadoop,hasbecomethede-factomodelfordistributedcomputationonamassivescale.Unlikestreaming,whereasinglemachineeventuallyseesthewholedataset,inMapReduce,theinputispartitionedacrossasetofmachines,eachofwhichcanperformaseriesofcomputationsonitslocalsliceofthedata.Theprocesscanthenberepeated,yieldingamulti-passalgorithm(See[16]forexactframework,and[19,25]fortheoreticalmodels).Itiswellknownthatsimpleoperationslikesumandotherholisticmeasures[35]aswellassomegraphprimitives,like ndingconnectedcomponents[25],canbeimplementedinMapReduceinawork-ecientmanner.Thechallengeliesinreducingthetotalnumberofpasseswithnomachineeverseeingtheentiredataset.1.2OurcontributionsInthisworkwefocusonobtainingecientalgorithmsforthedensestsubgraphproblemthatcanworkonmas-sivegraphs,wherethegraphcannotbestoredinthemainmemory.Speci cally,weshowhowtomodifytheapproachof[10]sothattheresultingalgorithmmakesonlyO(1 logn)passesoverthedataandguaranteestoreturnananswerwithina(2+)factorofoptimum.Weshowthatouralgorithmonlyrequiresthecomputationofbasicgraphparameters(e.g.,thedegreeofeachnodeandtheoveralldensity)andthuscanbeeasilyparallelized|weusetheMapReducemodeltodemonstrateonesuchparallelimplementation.Finally,weshowthatdespitethe(2+)worst-caseapproximationguarantee,thealgorithm'soutputisoftennearlyoptimalonreal-worldgraphs;moreoveritcaneasilyscaletographswithbillionsofedges.2.RELATEDWORKThedensestsubgraphproblemliesatthecoreoflargescaledataminingandassuchitanditsvariantshavebeenintensivelystudied.Goldberg[22]wasoneofthe rsttoformallyintroducetheproblemof ndingthedensestsub-graphinanundirectedgraphandgaveanalgorithmthatrequiredO(logn) owcomputationsto ndtheoptimalso-lution;seealso[29].Charikar[10]describedasimplegreedyalgorithmandshowedthatitleadstoa2-approximationtotheoptimum.Whenaugmentedwithaconstraintre-quiringthesolutionbeofsizeatleastk,theproblembe-comesNP-hard[26].Onthepositiveside,AndersenandChellapilla[3]gavea2-approximationtothisversionoftheproblem,and[26]gaveafasteralgorithmthatachievesthesamesolutionquality.Inthecasetheunderlyinggraphisdirected,KannanandVinay[24]werethe rsttode nethenotionofdensityandgaveanO(logn)approximationalgorithm.Thiswasfur-therimprovedbyCharikar[10]whoshowedthatitcanbesolvedexactlyinpolynomialtimebysolvingO(n2)linearprograms,andobtainedacombinatorial2-approximational-gorithm.Thelatteralgorithmwassimpli edintheworkofKhullerandSaha[26].Inadditiontothesteadytheoreticalprogress,thereisarichlineofworkthattailoredtheproblemtothespeci ctaskathand.Variantsofdensestsubgraphproblemhavebeenusedincomputationalbiology(see,forexample[1,Chapter14]),communitymining[12,17,32],andeventodecidewhatsubsetofpeoplewouldformthemoste ectiveworkinggroup[20].Thespeci cproblemof ndingdensesubgraphsonverylargedatasetswasaddressedinGibsonetal.[21]whoeschewedapproximationguaranteesandusedshinglingapproachesto ndsetsofnodeswithhighneighborhoodoverlap.StreamingandMapReduce.DatastreamingandMapRe-ducehaveemergedastwoleadingparadigmsforhandlingcomputationonverylargedatasets.Inthedatastreammodel,theinputisassumedtoolargeto tintomainmem-ory,andisinsteadstreamedpastoneobjectatatime.Foranintroductiontostreaming,seetheexcellentsurveybyMuthukrishnan[34].Whenstreaminggraphs,thetypicalassumptionisthatthesetofnodesisknownaheadoftimeandcan tintomainmemory,andtheedgesarriveonebyone;thisisthesemi-streamingmodelofcomputation[18].Algorithmsforavarietyofgraphprimitivesfrommatch-ings[31],tocountingtriangles[5,6]havebeenproposedandanalyzedinthissetting.Whiledatastreamsareanecientmodelofcomputa-tionforasinglemachine,MapReducehasbecomeapop-ularmethodforlarge-scaleparallelprocessing.BeginningwiththeoriginalworkofDeanandGhemawat[16],severalalgorithmshavebeenproposedfordistributeddataanaly-sis,fromclustering[15]tosolvingsetcover[13].Forgraphproblems,Karlo etal.[25]givealgorithmsfor ndingcon-nectedcomponentsandspanningtrees;SuriandVassilvit-skiishowhowtocounttrianglese ectively[37],whileLat-tanzietal.[28]andMoralesetal.[33]describealgorithmsfor ndingmatchingsonmassivegraphs.3.PRELIMINARIESLetG=(V;E)beanundirectedgraph.ForasubsetSV,lettheinducededgesetbede nedasE(S)=E\S2 exist,sinceSeventuallybecomesempty.Clearly,SS.Leti2A(S)\S.Wehave(S)degS(i)*(4:1)degS(i)*SS(2+2)(S):*i2A(S)Thisimplies(S)(S)=(2+2)andhencethealgorithmoutputsa(2+2)-approximation. Next,weshowthatthealgorithmremovesaconstantfrac-tionofallofthenodesineverypass,andthusisguaranteedtoterminateafterO(logn)passesofthewhileloop.Lemma4.Algorithm1terminatesinO(log1+n)passes.Proof.Ateachstepofthepass,wehave2jE(S)j=Xi2A(S)degS(i)+Xi2SnA(S)degS(i)�2(1+)(jSj�jA(S)j)(S)=2(1+)(jSj�jA(S)j)jE(S)j jSj;wherethesecondinequalityfollowsbyconsideringonlythosenodesinSnA(S).Thus,jA(S)j� 1+jSj:(4.2)Equivalently,jSnA(S)j1 1+jSj:Therefore,thecardinalityoftheremainingsetSdecreasesbyafactoratleast1=(1+)duringeachpass.Hence,thealgorithmterminatesinO(log1+n)passes. Noticethatforsmall,log(1+)andhencethenumberofpassesisO(1 logn).4.1.1LowerboundsInthissectionweshowthatouranalysisistight.Inpar-ticular,weshowthattherearegraphsonwhichAlgorithm1makes (logn)passes.Furthermore,wealsoshowthatanyalgorithmthatachievesa2-approximationinO(logn)passesmustuse (n=logn)space.NotethatAlgorithm1comesclosetothislowerboundsinceitmakesO(logn)passesandusesO(n)memory.Passlowerbound.Weshowthattheanalysisofthenum-berofpassesistightuptoconstantfactors.Webeginwithaslightlyweakerresult.Lemma5.ThereexistsanunweightedgraphonwhichAl-gorithm1requires (logn loglogn)passes.Proof.ThegraphconsistsofkdisjointsubsetsG1;:::;Gk,whereGiisa2i�1regulargraphonjVij=22k+1�inodes,henceeveryGihasexactly22k�1edgesandhasdensityof2i�2.Forany`1,letG`=Si`Gi.ThedensityG`is:(G`)=(k�`+1)22k�1 2k+1(2k�`+1�1)(k�`+1)2`�3:WeclaimthatineverypassthealgorithmremovesO(logk)ofthesesubgraphs.SupposethatwestartwiththesubgraphG`atthebeginningofthepass.ThenthenodesinA(S)areexactlythosethathavetheirdegreelessthan(G`)(2+)(k�`+1)2`�2.SinceanodeinGihasdegree2i�1,thisisequivalenttonodesinGifori(`�1)+log(k�`),andhencethesubgraphinthenextpassisG`+log(k�`)�1.Therefore,thealgorithmwilltakeatleast (k=logk)passestocomplete.Sincek=(logn),theprooffollows. ToshowanexampleonwhichAlgorithm1requires (logn)passes,weappealtoweightedgraphs.NotethatAlgorithm1andtheanalysiseasilygeneralizeto ndingthemaximumdensitysubgraphinanundirectedweightedgraph.Lemma6.ThereexistsaweightedgraphonwhichAlgo-rithm1requires (logn)passes.ProofSketch.Consideragraphwhosedegreesequencefollowsapowerlawwithexponent0 1,i.e.,ifdiistheithlargestdegree,thendi/i� .WehavePni=1i� 'Rn0x� dx=n1� 1� ,soifthegraphhasmedges,we(approx-imately)havedi=(1� )i� m n1� .Hence,inthe rstpassofthealgorithm,weremoveallthenodeswithdi=(1� )i� m n1� (2+)m n:Hence,thenodessuchthati1� 2+1= n;gotothenextpass;notethatthisisaconstantfractionofthenodes.Aslongasthepowerlawpropertyofthedegreesequenceispreservedafterremovingthelowdegreenodesineachpass,weobtainthedesired (logn)lowerbound.Considerthegraphsgeneratedbythepreferentialattach-mentprocess[2].Toavoidthestochasticityinthemodel,whichonlymakestheanalysismorecomplicated,onecanconsiderthefollowingdeterministicvariantofthisprocess:wheneveranewnodeuarrives,itaddsanedgetoalloftheexistingnodesvandassignsaweightwu;vtotheedge(u;v)whichisproportionaltothecurrentdegreeofv.Thendegreeoftheithnodeafteratotalofnnodeshavearrivedfollowsapowerlawdistributionwhichisexactlywhatweneededtoachieve. Spacelowerbound.Weshowthatthetradeo betweenmemoryandnumberofpassesisalmostthebestpossible.Namely,anyconstant-passstreamingalgorithmforapprox-imatingthedensestsubgraphtowithinaconstantfactorof2mustusealinearamountofmemory,andanalgorithmmakingO(logn)passesmustuse (n=logn)memory.Lemma7.Anyp-passstreaming -approximationalgo-rithmforthedensestsubgraphproblem,where 2,needs (n=(p 2))space.Proof.Considerthestandarddisjointnessproblemintheq-partyarbitraryroundcommunicationmodel.Thereareq2players,andthejthplayerhasthen-bitvectorxj1;:::;xjn.Theirgoalistodecideifthereisanindexisuchthat^qj=1xji=1.Itisknownthatthisproblemneeds (n=q)communication[4,9]andthelowerboundholdsevenunderthepromisethateitherthebitvectorsarepairwisedisjoint(NOinstance)ortheyhaveauniquecommonele-mentbutareotherwisedisjoint(YESinstance). necessarilydisjointsubsetsS;TV.Weassumethattheratioc=jSj=jTjfortheoptimalsetsS;Tisknowntothealgorithm.Inpractice,onecandoasearchforthisvalue,bytryingthealgorithmfordi erentvaluesofcandretainingthebestresult.Thealgorithmthenproceedsinasimilarspiritasintheundirectedcase.WebeginwithS=T=VandremoveeitherthosenodesA(S)whoseoutdegreetoTisbelowav-erage,orthosenodesB(T)whoseindegreetoSisbelowav-erage.(Formallyweneedthedegreestobebelowathresholdslightlyabovetheaverageforthealgorithmtoconverge.)AnaivewaytodecidewhetherthesetA(S)orB(T)shouldberemovedinthecurrentpassistolookatthemaximumoutdegree,E(i;T),ofnodesinA(S)andthemaximumin-degree,E(S;j),ofnodesinB(T).IfE(S;j)=E(i;T)cthenA(S)canberemovedandotherwiseB(T)canbere-moved.However,abetterwayistomakethischoicedi-rectlybasedonthecurrentsizesofSandT.Intuitively,ifjSj=jTj�c,thenweshouldberemovingthenodesfromStogettheratioclosertoc,otherwiseweshouldremovethosefromT.Inadditiontobeingsimpler,thiswayisalsofastermainlyduetothefactthatitneedstocomputeeitherA(S)orB(T)ineverypass,leadingtoasigni cantspeedupinpractice.Algorithm3containstheformaldescription. Algorithm3Densestsubgraphfordirectedgraphs. Require:G=(V;E),c�0,and�01:~S;~T;S;T V2:whileS6=;andT6=;do3:ifjSj=jTjcthen4:A(S) ni2SjjE(i;T)j(1+)jE(S;T)j jSjo5:S SnA(S)6:else7:B(T) nj2TjjE(S;j)j(1+)jE(S;T)j jTjo8:T TnB(T)9:endif10:if(S;T)�(~S;~T)then11:~S S,~T T12:endif13:endwhile14:return~S;~T First,weanalyzetheapproximationfactorofthealgo-rithm.Lemma12.Algorithm3leadstoa(2+2)-approximationtothedensestsubgraphproblemondirectedgraphs.Proof.Asin[10],wegenerateanassignmentoftheedgestotheendpointscorrespondingtothealgorithm.When-ever(i;j)2E,andi2A(S)isremovedfromS,weassign(i;j)toi;asimilarassignmentismadeforthenodesinB(T).Let~=(~S;~T).Letdegoutbethemaximumoutde-greeanddeginbethemaximumindegreeinG.WeneedtoshowthatifA(S)isremoved,then8i2A(S);p cjE(i;T)j(1+)(S;T):SupposethatjSj=jTjc,andsothenodesinA(S)willberemoved.Foralli2A(S),wehavep cjE(i;T)jp c(1+)jE(S;T)j jSjp c(1+)jE(S;T)js 1 cjSjjTj=(1+)jE(S;T)j p jSjjTj=(1+)(S;T):ThesecondlinefollowsbecausejSjcjTj)jSjp cjSjjTj.Similarly,onecanshowthatifB(T)getsremoved,then8j2B(T);1 p cE(S;j)(1+)(S;T):Thisprovesthatinthegivenassignment(ofedgestoend-points),p cdegout(1+)~and1 p cdegin(1+)~.Oncewehavesuchanassignment,wecanusethesamelogicasinLemmas7and8in[10]toconcludethatthealgorithmgivesa(2+2)-approximation:maxS;TV;jSj=jTj=cf(S;T)g(2+2)(~S;~T): Next,weanalyzethenumberofpassesofthealgorithm.TheproofissimilartothatofLemma4.Lemma13.Algorithm3terminateinO(log1+n)passes.Proof.WehavejE(S;T)j=Xi2A(S)jE(i;T)j+Xi2SnA(S)jE(i;T)j�(1+)(jSj�jA(S)j)jE(S;T)j jSj;whichyieldsjSnA(S)j1 1+jSj:Similarly,wecanprovejTnB(T)j1 1+jTj:Therefore,duringeachpassofthealgorithm,eitherthesizeoftheremainingsetSorthesizeoftheremainingsetTgoesdownbyafactorofatleast1=(1+).Hence,inO(log1+n)passes,oneofthesesetsbecomesemptyandthealgorithmterminates. 5.PRACTICALCONSIDERATIONSInthissectionwedescribetwopracticalconsiderationsinimplementingthealgorithms.The rst(Section5.1)isaheuristicmethodbasedonCount-Sketchtocutthememoryrequirementsofthealgorithm.Thesecond(Section5.2)isadiscussiononhowtorealizethealgorithmsintheMapRe-ducecomputingmodel.5.1HeuristicimprovementsWeshowedinLemma7thatanyp-passalgorithmachiev-inga2-approximationtothedensestsubgraphproblemmustuseatleast (n p)space.However,eventhisamountofspacecanbeprohibitivelylargeforverylargedatasets.Tofur-therreducethespacerequiredbythealgorithmsweturnto G typejVjjEj flickr undirected976K7.6Mim undirected645M6.1Blivejournal directed4.84M68.9Mtwitter directed50.7M2.7B Table1:Parametersofthegraphsusedintheex-periments.quadratictime(lineartimeforeachpassandalinearnum-berofpasses),whichisstillinfeasibleforthesegraphs.Inor-dertocircumventthis,weworkwithslightlysmallergraphsjusttocomparethequalityofthesolutiontothatoftheoptimum(Section6.2).6.2QualityofapproximationWestudyhowgoodofanapproximationisobtainedbyouralgorithmfortheundirectedcase.Toenablethis,weneedtocomputethevalueoftheoptimum.Recallthat,asmentionedinsection1,boththedirectedandundirecteddensestsubgraphproblemscanbesolvedexactlyusingpara-metric ow.Inthissectionwewanttoobtain,i.e.,thevalueoftheoptimalsolution,toarguethattheapproxima-tionfactorinpracticeismuchbetterthan2(1+),guaran-teedbyLemma3.(Todosuchatestfordirectedgraphsisveryexpensivebecauseonehastotryalln2valuesofc.)Inordertosolvethedensestsubgraphproblemexactly,weusethefollowinglinearprogramming(LP)formulation.maxXijxij8(i;j)2E;xijyi8(i;j)2E;xijyjXiyi1xij;yi0Charikar[10]showedthatthevalueofthisLPispreciselyequalto(G).Weusethisobservationtomeasurethequal-ityofapproximationobtainedbyAlgorithm1.TosolvetheLP,weusetheCOIN-ORCLPsolver(projects.coin-or.org/Clp).Weusesevenmoderately-sizedundirectedgraphspubliclyavailableatSNAP(snap.stanford.edu).Table2showstheparametersofthesegraphsandtheapproxima-tionfactorofouralgorithmsfordi erentsettingsof.Itisclearthattheapproximationfactorsobtainedbyouralgo-rithmaremuchbetterthanwhatLemma3promises.Fur-thermore,evenhighvaluesofseemtohardlyhurttheapproximationguarantees.6.3UndirectedgraphsInthissectionwestudytheperformanceofouralgorithmsontwoundirectedgraphs,namely,flickrandim.First,westudythee ectofontheapproximationfactorandthenumberofpasses.Figure6.1showstheresults.Foreaseofcomparison,weshowthevaluesrelativetothedensityobtainedbyouralgorithmfor=0.(Notethatthesetting=0issimilartoCharikar'salgorithm[10]intermsoftheapproximationfactorbutcanruninmuchfewernumberofpasses;however,terminationisnotguaranteedfor=0.)AswesawinTable2,theapproximationdoesnotdeterio-rateforhighervaluesof(notethattheperformanceisnot Figure6.1:E ectofontheapproximationandthenumberofpasses.monotonein).Choosingavalueof2[0:5;1]seemstocutdownthenumberofpassesbyhalfwhilelosingonly10%oftheoptimum.Wethenmoveontoanalyzethegraphstructureasthepassesprogress.Figure6.2showstherelativedensityasafunctionofthenumberofpasses.(Curiously,weobserveaunimodalbehaviorforflickr,butthisdoesnotseemtoholdingeneral.)Figure6.3showsthenumberofnodesandedgesinthegraphaftereachpass.Theshapeoftheplotssuggeststhatthegraphgetsdramaticallysmallerevenintheearlypasses.Thisisaveryusefulfeatureinpractice,sinceifthegraphgetsverysmallearlyon,thentherestofthecomputationcanbedoneinthemainmemory.Thiswillavoidtheover-headofadditionalpasses.Notealsothattheworst-caseboundofO(log1+n)forthenumberofpassesasgivenbyLemma4isneverachievedbythesegraphs.Thisispossiblybecauseoftheheavy-tailnatureofthedegreedistributionofgraphsderivedfromsocialnetworksandtheircoreconnectivityproperties;see[27,30].Thesepropertiesmayalsocontributetoachievingthegoodapproximationratio,i.e.,theworst-caseboundofLemma3isnotmetbythesegraphs.Exploringtheseinfurtherdetailisoutsidethescopeofthisworkandisaninterestingareaoffutureresearch.6.4DirectedgraphsInthissectionwestudytheperformanceofthedirectedgraphversionofouralgorithm.Weusethelivejournalandtwittergraphsforthispurpose.Recallthatfordi-rectedgraphs,wehavetotryforvariousvaluesofc(Section4.3).Ofcourse,tryingalln2possiblevaluesofcispro-hibitive.Asimplealternativeistochoosearesolution(� G=(V;E) jVjjEj (G) (G)=~(G) =0:001=0:1=1 as20000102 6,47413,233 9.29 1.2291.2681.194ca-AstroPh 18,772396,160 32.12 1.1471.1561.273ca-CondMat 23,133186,936 13.47 1.0721.0721.429ca-GrQc 5,24228,980 22.39 1.0001.0001.395ca-HepPh 12,008237,010 119.00 1.0001.0171.151ca-HepTh 9,87751,971 15.50 1.0001.0001.356email-Enron 36,692367,662 37.34 1.0581.0721.063 Table2:Empiricalapproximationboundsforvariousvaluesof. Figure6.3:Numberofnodesandedgesinthegraphaftereachstepofthepassforflickrandim.1)andtrycatdi erentpowersof(Onecanprovethatthisworsenstheapproximationguaranteebyatmostafactor[10]).Clearly,therunningtimeisgivenby2logn=log.First,westudythee ectofthechoiceofcomparedtothechoiceof.Table3showstheresults.Fromthevalues,itiseasytoseethataslongasremainsreasonable,thee ectofisasintheundirectedcase.Tomaketherestofthestudylesscumbersome,we x=2fortheremainderofthissection.First,wepresenttheresultsforlivejournal.   210100 0 325.27312.13307.961 334.38308.70306.912 294.50284.47179.59 Table3:livejournal:fordi erentand.Westudytheperformanceofthealgorithmforvariouschoicesofc,given=2.Inparticular,wemeasurethedensityandthenumberofpasses.Figure6.4showsthevalues.Thebehaviorofdensityisquitecomplex,andforlivejournal,theoptimumoccurswhentherelativesizesofSandTarenotskewed.Finally,Figure6.5showsthebehavioroflivejournalforthebestsettingofc(whichis0.436)for=2;=1.Itclearlyshowsthe\alternate"natureofthesimpli edalgorithm(Algorithm3)thatwedevelopedinSection4.3.Asalways,thenumberofnodesandedgesfalldramaticallyasthepassesprogress.Fortwitter,weused=1;=2andstudiedtheper-formanceofthealgorithmforvariousvaluesofc.Figure6.6showsthedensityandthenumberofpassesforvari-ousvaluesofc.Unlikelivejournal,thebestvalueofcisnotconcentratedaround1.Thismaybeduetothehighlyskewednatureofthetwittergraph:forexample,thereareabout600popularuserswhoarefollowedbymorethan30millionotherusers.Theresultsfromlivejournalandtwittersuggestthat,inpractice,onecansafelyskipmanyvaluesofc.6.5PerformanceofsketchingInthissectionwediscusstheperformanceofthesketchingheuristicpresentedinSection5.1.Wetestedthealgorithmonflickr,whichhas976Knodes.RecallthatthenumberofwordsinaCount-Sketchschemeusingbbucketsandt Figure6.2:Densityasafunctionofthenumberofpassesforvariousvaluesof,forflickrandim.independenthashtablesistb.InTable4,weshowtheratioofthedensestsubgraphwithandwithoutsketching,forvariousvaluesofband.Thebottomrowshowsthemainmemoryusedbythealgorithmwithsketchingcomparedtothealgorithmwithoutsketching.Clearly,forsmallvaluesof,theperformancedi erenceisnotverysigni cantevenforb=30000,whichmeansonly530000=976K=16%ofmainmemoryisused.Thissuggeststhat,despitethespacelowerbounds(Lemma7),inpractice,asketchingschemecanobtainsigni cantsavingsinmainmemory.  b=30000b=40000b=50000 0 1.0471.0271.0140.5 0.9600.8960.9211 0.9580.9360.9181.5 0.8900.9110.9292 0.7600.8450.8692.5 0.7870.7080.740 Memory 0.160.200.25 Table4:Ratioofwithandwithoutsketchingforflickr(t=5).6.6MapReduceimplementationInthissectionwestudytheperformanceoftheMapRe-duceimplementationofouralgorithmsforbothdirectedandundirectedgraphs.Forthispurpose,weusetheimandtwittergraphssincetheyaretoobigtobestudiedunderthesemi-streamingmodel.WeimplementedouralgorithmsinHadoop(hadoop.apache.org)andranitwith2000map-persand2000reducers.Figure6.7showsthewall-clockrunningtimesforeachpassforim,whichisanundirected Figure6.4:Densityandthenumberofpassesat=2forlivejournal.graph.Itonlytakesunder260minutesforouralgorithmtorunonim(amassivegraphwithmorethanhalf-billionnodes).Fortwitter,whichisadirectedgraph,ouralgo-rithmtakesaround35minutesforagivenvalueofcandforeachiteration;Figure6.6showsthatthenumberofitera-tionsisbetweenfourandseven,andthenumberofvaluesofctobetriedisverysmall.Theseclearlyshowthescalabilityofouralgorithms.7.CONCLUSIONSInthispaperwestudiedtheproblemof ndingdensesub-graphs,afundamentalprimitiveinseveraldatamanagementapplications,instreamingandMapReduce,twocomputa-tionalmodelsthatareincreasinglybeingadoptedbylarge-scaledataprocessingapplications.Weshowedasimpleal-gorithmthatmakeasmallnumberofpassesoverthegraphandobtainsa(2+)-approximationtothedensestsubgraph.Wethenobtainedseveralextensionsofthisalgorithm:forthecasewhenthethesubgraphisprescribedtobemorethanacertainsizeandwhenthegraphisdirected.Tothebestofourknowledge,thesearethe rstalgorithmsforthedensestsubgraphproblemthattrulyscaleyeto erprovableguarantees.Ourexperimentsshowedthatthealgorithmsareindeedscalableandachievequalityandperformancethatisoftenmuchbetterthanthetheoreticalguarantees.Oural-gorithm'sscalabilityisthemainreasonitwaspossibletorunitonagraphwithmorethanahalfabillionnodesandsixbillionedges.AcknowledgmentsWethanktheanonymousreviewersfortheirmanyusefulcomments. Figure6.7:TimetakenonimgraphinMapReduce. Figure6.5:BehaviorofjSj;jTj;jE(S;T)jforthebestparametersofc;forlivejournal. Figure6.6:Densityandthenumberofpassesat=1;=2fortwitter.8.REFERENCES[1]C.C.AggarwalandH.Wang.ManagingandMiningGraphData.SpringerPublishingCompany,Incorporated,1stedition,2010.[2]R.AlbertandA.-L.Barabasi.Statisticalmechanicsofcomplexnetworks.Rev.Mod.Phys.,74(1):47{97,Jan2002.[3]R.AndersenandK.Chellapilla.Findingdensesubgraphswithsizebounds.InWAW,pages25{37,2009.[4]Z.Bar-Yossef,T.S.Jayram,R.Kumar,andD.Sivakumar.Aninformationstatisticsapproachtodatastreamandcommunicationcomplexity.JCSS,68(4):702{732,2004.[5]Z.Bar-Yossef,R.Kumar,andD.Sivakumar.Reductionsinstreamingalgorithms,withanapplicationtocountingtrianglesingraphs.InSODA,pages623{632,2002.[6]L.Becchetti,P.Boldi,C.Castillo,andA.Gionis.Ecientsemi-streamingalgorithmsforlocaltrianglecountinginmassivegraphs.InKDD,pages16{24,2008.[7]B.Berger,J.Rompel,andP.W.Shor.EcientNCalgorithmsforsetcoverwithapplicationstolearningandgeometry.JCSS,49(3):454{477,1994.[8]G.BuehrerandK.Chellapilla.Ascalablepatternminingapproachtowebgraphcompressionwithcommunities.InWSDM,pages95{106,2008.[9]A.Chakrabarti,S.Khot,andX.Sun.Near-optimallowerboundsonthemultipartycommunicationcomplexityofset-disjointness.InCCC,pages107{117,2003.[10]M.Charikar.Greedyapproximationalgorithmsfor ndingdensecomponentsinagraph.InAPPROX,pages84{95,2000.[11]M.Charikar,K.Chen,andM.Farach-Colton.Findingfrequentitemsindatastreams.TCS,312:3{15,2004.[12]J.ChenandY.Saad.Densesubgraphextractionwithapplicationtocommunitydetection.TKDE,PP(99),2011.[13]F.Chierichetti,R.Kumar,andA.Tomkins.Max-CoverinMap-Reduce.InWWW,pages231{240,2010.[14]E.Cohen,E.Halperin,H.Kaplan,andU.Zwick.Reachabilityanddistancequeriesvia2-hoplabels.InSODA,pages937{946,2002.[15]A.S.Das,M.Datar,A.Garg,andS.Rajaram.Googlenewspersonalization:scalableonlinecollaborative ltering.InWWW,pages271{280,2007.[16]J.DeanandS.Ghemawat.Mapreduce:simpli eddataprocessingonlargeclusters.InOSDI,pages137{150,2004.[17]Y.Dourisboure,F.Geraci,andM.Pellegrini.Extractionandclassi cationofdensecommunitiesintheweb.InWWW,pages461{470,2007.[18]J.Feigenbaum,S.Kannan,A.McGregor,S.Suri,andJ.Zhang.Ongraphproblemsinasemi-streamingmodel.TCS,348(2-3):207{216,2005.[19]J.Feldman,S.Muthukrishnan,A.Sidiropoulos,C.Stein,andZ.Svitkina.Ondistributingsymmetricstreamingcomputations.TALG,6:66:1{66:19,September2010.[20]A.GajewarandA.D.Sarma.Multi-skillcollaborativeteamsbasedondensestsubgraphs.CoRR,abs/1102.3340,2011.[21]D.Gibson,R.Kumar,andA.Tomkins.Discoveringlargedensesubgraphsinmassivegraphs.InVLDB,pages721{732,2005.[22]A.V.Goldberg.Findingamaximumdensitysubgraph.TechnicalReportUCB/CSD-84-171,EECSDepartment,UniversityofCalifornia,Berkeley,1984. [23]R.Jin,Y.Xiang,N.Ruan,andD.Fuhry.3-HOP:Ahigh-compressionindexingschemeforreachabilityquery.InSIGMOD,pages813{826,2009.[24]R.KannanandV.Vinay.Analyzingthestructureoflargegraphs,1999.Manuscript.[25]H.Karlo ,S.Suri,andS.Vassilvitskii.Amodelofcomputationformapreduce.InSODA,pages938{948,2010.[26]S.KhullerandB.Saha.On ndingdensesubgraphs.InICALP,pages597{608,2009.[27]R.Kumar,J.Novak,andA.Tomkins.Structureandevolutionofonlinesocialnetworks.InKDD,pages611{617,2006.[28]S.Lattanzi,B.Moseley,S.Suri,andS.Vassilvitskii.Filtering:amethodforsolvinggraphproblemsinmapreduce.InSPAA,pages85{94,2011.[29]E.Lawler.CombinatorialOptimization:NetworksandMatroids.Holt,Rinehart,andWinston,1976.[30]J.Leskovec,K.J.Lang,A.Dasgupta,andM.W.Mahoney.Communitystructureinlargenetworks:Naturalclustersizesandtheabsenceoflargewell-de nedclusters.InternetMathematics,6(1):29{123,2009.[31]A.McGregor.Findinggraphmatchingsindatastreams.InAPPROX,pages170{181,2005.[32]M.Newman.Modularityandcommunitystructureinnetworks.PNAS,103(23):8577{8582,2006.[33]G.D.F.Morales,A.Gionis,andM.Sozio.Socialcontentmatchinginmapreduce.PVLDB,pages460{469,2011.[34]S.Muthukrishnan.Datastreams:Algorithmsandapplications.FoundationsandTrendsinTheoreticalComputerScience,1(2),2005.[35]A.Nandi,C.Yu,P.Bohannon,andR.Ramakrishnan.Distributedcubematerializationonholisticmeasures.InICDE,pages183{194,2011.[36]B.Saha,A.Hoch,S.Khuller,L.Raschid,andX.-N.Zhang.Densesubgraphswithrestrictionsandapplicationstogeneannotationgraphs.InRECOMB,pages456{472,2010.[37]S.SuriandS.Vassilvitskii.Countingtrianglesandthecurseofthelastreducer.InWWW,pages607{614,2011.