edu Ravi Kumar Yahoo Research Sunnyvale CA ravikumaryahooinccom Sergei Vassilvitskii Yahoo Research New York NY sergeiyahooinccom ABSTRACT The problem of nding locally dense components of a graph is an important primitive in data analysis with widera ID: 75723
Download Pdf The PPT/PDF document "Densest Subgraph in Streaming and MapRed..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
comesNP-hard[26].AndersenandChellapilla[3]aswellasKhullerandSaha[26]showhowtoobtain2-approximationsforthisversionoftheproblem.Whilethealgorithmsproposedintheabovelineofworkguaranteegoodapproximationfactors,theyarenotecientwhenrunonverylargedatasets.Inthisworkweshowhowtousetheprinciplesunderlyingexistingalgorithms,especially[10],todevelopnewalgorithmsthatcanberuninthedatastreamanddistributedcomputingmodels,forexample,MapReduce;thisalsoresolvestheopenproblemposedin[1,13].1.1StreamingandMapReduceAsthedatasetshavegrowntotera-andpetabyteinputsizes,twoparadigmshaveemergedfordevelopingalgorithmsthatscaletosuchlargeinputs:streamingandMapReduce.Inthestreamingmodel[34],oneassumesthattheinputcanbereadsequentiallyinanumberofpassesoverthedata,whilethetotalamountofrandomaccessmemory(RAM)availabletothecomputationissublinearinthesizeoftheinput.Thegoalistoreducethenumberofpassesneeded,allthewhileminimizingtheamountofRAMnecessarytostoreintermediateresults.Inthecasetheinputisagraph,thenodesVareknowninadvance,andtheedgesarestreamed(itisknownthatmostnon-trivialgraphproblemsrequire (jVj)RAM,evenifmultiplepassescanbeused[18]).Thechallengeinstreamingalgorithmsliesinwiselyusingthelimitedamountofinformationthatcanbestoredbetweenpasses.Complementingstreamingalgorithms,MapReduce,anditsopensourceimplementation,Hadoop,hasbecomethede-factomodelfordistributedcomputationonamassivescale.Unlikestreaming,whereasinglemachineeventuallyseesthewholedataset,inMapReduce,theinputispartitionedacrossasetofmachines,eachofwhichcanperformaseriesofcomputationsonitslocalsliceofthedata.Theprocesscanthenberepeated,yieldingamulti-passalgorithm(See[16]forexactframework,and[19,25]fortheoreticalmodels).Itiswellknownthatsimpleoperationslikesumandotherholisticmeasures[35]aswellassomegraphprimitives,likendingconnectedcomponents[25],canbeimplementedinMapReduceinawork-ecientmanner.Thechallengeliesinreducingthetotalnumberofpasseswithnomachineeverseeingtheentiredataset.1.2OurcontributionsInthisworkwefocusonobtainingecientalgorithmsforthedensestsubgraphproblemthatcanworkonmas-sivegraphs,wherethegraphcannotbestoredinthemainmemory.Specically,weshowhowtomodifytheapproachof[10]sothattheresultingalgorithmmakesonlyO(1 logn)passesoverthedataandguaranteestoreturnananswerwithina(2+)factorofoptimum.Weshowthatouralgorithmonlyrequiresthecomputationofbasicgraphparameters(e.g.,thedegreeofeachnodeandtheoveralldensity)andthuscanbeeasilyparallelized|weusetheMapReducemodeltodemonstrateonesuchparallelimplementation.Finally,weshowthatdespitethe(2+)worst-caseapproximationguarantee,thealgorithm'soutputisoftennearlyoptimalonreal-worldgraphs;moreoveritcaneasilyscaletographswithbillionsofedges.2.RELATEDWORKThedensestsubgraphproblemliesatthecoreoflargescaledataminingandassuchitanditsvariantshavebeenintensivelystudied.Goldberg[22]wasoneofthersttoformallyintroducetheproblemofndingthedensestsub-graphinanundirectedgraphandgaveanalgorithmthatrequiredO(logn) owcomputationstondtheoptimalso-lution;seealso[29].Charikar[10]describedasimplegreedyalgorithmandshowedthatitleadstoa2-approximationtotheoptimum.Whenaugmentedwithaconstraintre-quiringthesolutionbeofsizeatleastk,theproblembe-comesNP-hard[26].Onthepositiveside,AndersenandChellapilla[3]gavea2-approximationtothisversionoftheproblem,and[26]gaveafasteralgorithmthatachievesthesamesolutionquality.Inthecasetheunderlyinggraphisdirected,KannanandVinay[24]werethersttodenethenotionofdensityandgaveanO(logn)approximationalgorithm.Thiswasfur-therimprovedbyCharikar[10]whoshowedthatitcanbesolvedexactlyinpolynomialtimebysolvingO(n2)linearprograms,andobtainedacombinatorial2-approximational-gorithm.ThelatteralgorithmwassimpliedintheworkofKhullerandSaha[26].Inadditiontothesteadytheoreticalprogress,thereisarichlineofworkthattailoredtheproblemtothespecictaskathand.Variantsofdensestsubgraphproblemhavebeenusedincomputationalbiology(see,forexample[1,Chapter14]),communitymining[12,17,32],andeventodecidewhatsubsetofpeoplewouldformthemosteectiveworkinggroup[20].ThespecicproblemofndingdensesubgraphsonverylargedatasetswasaddressedinGibsonetal.[21]whoeschewedapproximationguaranteesandusedshinglingapproachestondsetsofnodeswithhighneighborhoodoverlap.StreamingandMapReduce.DatastreamingandMapRe-ducehaveemergedastwoleadingparadigmsforhandlingcomputationonverylargedatasets.Inthedatastreammodel,theinputisassumedtoolargetotintomainmem-ory,andisinsteadstreamedpastoneobjectatatime.Foranintroductiontostreaming,seetheexcellentsurveybyMuthukrishnan[34].Whenstreaminggraphs,thetypicalassumptionisthatthesetofnodesisknownaheadoftimeandcantintomainmemory,andtheedgesarriveonebyone;thisisthesemi-streamingmodelofcomputation[18].Algorithmsforavarietyofgraphprimitivesfrommatch-ings[31],tocountingtriangles[5,6]havebeenproposedandanalyzedinthissetting.Whiledatastreamsareanecientmodelofcomputa-tionforasinglemachine,MapReducehasbecomeapop-ularmethodforlarge-scaleparallelprocessing.BeginningwiththeoriginalworkofDeanandGhemawat[16],severalalgorithmshavebeenproposedfordistributeddataanaly-sis,fromclustering[15]tosolvingsetcover[13].Forgraphproblems,Karloetal.[25]givealgorithmsforndingcon-nectedcomponentsandspanningtrees;SuriandVassilvit-skiishowhowtocounttriangleseectively[37],whileLat-tanzietal.[28]andMoralesetal.[33]describealgorithmsforndingmatchingsonmassivegraphs.3.PRELIMINARIESLetG=(V;E)beanundirectedgraph.ForasubsetSV,lettheinducededgesetbedenedasE(S)=E\S2 exist,sinceSeventuallybecomesempty.Clearly,SS.Leti2A(S)\S.Wehave(S)degS(i)*(4:1)degS(i)*SS(2+2)(S):*i2A(S)Thisimplies(S)(S)=(2+2)andhencethealgorithmoutputsa(2+2)-approximation. Next,weshowthatthealgorithmremovesaconstantfrac-tionofallofthenodesineverypass,andthusisguaranteedtoterminateafterO(logn)passesofthewhileloop.Lemma4.Algorithm1terminatesinO(log1+n)passes.Proof.Ateachstepofthepass,wehave2jE(S)j=Xi2A(S)degS(i)+Xi2SnA(S)degS(i)2(1+)(jSjjA(S)j)(S)=2(1+)(jSjjA(S)j)jE(S)j jSj;wherethesecondinequalityfollowsbyconsideringonlythosenodesinSnA(S).Thus,jA(S)j 1+jSj:(4.2)Equivalently,jSnA(S)j1 1+jSj:Therefore,thecardinalityoftheremainingsetSdecreasesbyafactoratleast1=(1+)duringeachpass.Hence,thealgorithmterminatesinO(log1+n)passes. Noticethatforsmall,log(1+)andhencethenumberofpassesisO(1 logn).4.1.1LowerboundsInthissectionweshowthatouranalysisistight.Inpar-ticular,weshowthattherearegraphsonwhichAlgorithm1makes (logn)passes.Furthermore,wealsoshowthatanyalgorithmthatachievesa2-approximationinO(logn)passesmustuse (n=logn)space.NotethatAlgorithm1comesclosetothislowerboundsinceitmakesO(logn)passesandusesO(n)memory.Passlowerbound.Weshowthattheanalysisofthenum-berofpassesistightuptoconstantfactors.Webeginwithaslightlyweakerresult.Lemma5.ThereexistsanunweightedgraphonwhichAl-gorithm1requires (logn loglogn)passes.Proof.ThegraphconsistsofkdisjointsubsetsG1;:::;Gk,whereGiisa2i1regulargraphonjVij=22k+1inodes,henceeveryGihasexactly22k1edgesandhasdensityof2i2.Forany`1,letG`=Si`Gi.ThedensityG`is:(G`)=(k`+1)22k1 2k+1(2k`+11)(k`+1)2`3:WeclaimthatineverypassthealgorithmremovesO(logk)ofthesesubgraphs.SupposethatwestartwiththesubgraphG`atthebeginningofthepass.ThenthenodesinA(S)areexactlythosethathavetheirdegreelessthan(G`)(2+)(k`+1)2`2.SinceanodeinGihasdegree2i1,thisisequivalenttonodesinGifori(`1)+log(k`),andhencethesubgraphinthenextpassisG`+log(k`)1.Therefore,thealgorithmwilltakeatleast (k=logk)passestocomplete.Sincek=(logn),theprooffollows. ToshowanexampleonwhichAlgorithm1requires (logn)passes,weappealtoweightedgraphs.NotethatAlgorithm1andtheanalysiseasilygeneralizetondingthemaximumdensitysubgraphinanundirectedweightedgraph.Lemma6.ThereexistsaweightedgraphonwhichAlgo-rithm1requires (logn)passes.ProofSketch.Consideragraphwhosedegreesequencefollowsapowerlawwithexponent01,i.e.,ifdiistheithlargestdegree,thendi/i.WehavePni=1i'Rn0xdx=n1 1,soifthegraphhasmedges,we(approx-imately)havedi=(1)im n1.Hence,intherstpassofthealgorithm,weremoveallthenodeswithdi=(1)im n1(2+)m n:Hence,thenodessuchthati1 2+1=n;gotothenextpass;notethatthisisaconstantfractionofthenodes.Aslongasthepowerlawpropertyofthedegreesequenceispreservedafterremovingthelowdegreenodesineachpass,weobtainthedesired (logn)lowerbound.Considerthegraphsgeneratedbythepreferentialattach-mentprocess[2].Toavoidthestochasticityinthemodel,whichonlymakestheanalysismorecomplicated,onecanconsiderthefollowingdeterministicvariantofthisprocess:wheneveranewnodeuarrives,itaddsanedgetoalloftheexistingnodesvandassignsaweightwu;vtotheedge(u;v)whichisproportionaltothecurrentdegreeofv.Thendegreeoftheithnodeafteratotalofnnodeshavearrivedfollowsapowerlawdistributionwhichisexactlywhatweneededtoachieve. Spacelowerbound.Weshowthatthetradeobetweenmemoryandnumberofpassesisalmostthebestpossible.Namely,anyconstant-passstreamingalgorithmforapprox-imatingthedensestsubgraphtowithinaconstantfactorof2mustusealinearamountofmemory,andanalgorithmmakingO(logn)passesmustuse (n=logn)memory.Lemma7.Anyp-passstreaming-approximationalgo-rithmforthedensestsubgraphproblem,where2,needs (n=(p2))space.Proof.Considerthestandarddisjointnessproblemintheq-partyarbitraryroundcommunicationmodel.Thereareq2players,andthejthplayerhasthen-bitvectorxj1;:::;xjn.Theirgoalistodecideifthereisanindexisuchthat^qj=1xji=1.Itisknownthatthisproblemneeds (n=q)communication[4,9]andthelowerboundholdsevenunderthepromisethateitherthebitvectorsarepairwisedisjoint(NOinstance)ortheyhaveauniquecommonele-mentbutareotherwisedisjoint(YESinstance). necessarilydisjointsubsetsS;TV.Weassumethattheratioc=jSj=jTjfortheoptimalsetsS;Tisknowntothealgorithm.Inpractice,onecandoasearchforthisvalue,bytryingthealgorithmfordierentvaluesofcandretainingthebestresult.Thealgorithmthenproceedsinasimilarspiritasintheundirectedcase.WebeginwithS=T=VandremoveeitherthosenodesA(S)whoseoutdegreetoTisbelowav-erage,orthosenodesB(T)whoseindegreetoSisbelowav-erage.(Formallyweneedthedegreestobebelowathresholdslightlyabovetheaverageforthealgorithmtoconverge.)AnaivewaytodecidewhetherthesetA(S)orB(T)shouldberemovedinthecurrentpassistolookatthemaximumoutdegree,E(i;T),ofnodesinA(S)andthemaximumin-degree,E(S;j),ofnodesinB(T).IfE(S;j)=E(i;T)cthenA(S)canberemovedandotherwiseB(T)canbere-moved.However,abetterwayistomakethischoicedi-rectlybasedonthecurrentsizesofSandT.Intuitively,ifjSj=jTjc,thenweshouldberemovingthenodesfromStogettheratioclosertoc,otherwiseweshouldremovethosefromT.Inadditiontobeingsimpler,thiswayisalsofastermainlyduetothefactthatitneedstocomputeeitherA(S)orB(T)ineverypass,leadingtoasignicantspeedupinpractice.Algorithm3containstheformaldescription. Algorithm3Densestsubgraphfordirectedgraphs. Require:G=(V;E),c0,and01:~S;~T;S;T V2:whileS6=;andT6=;do3:ifjSj=jTjcthen4:A(S) ni2SjjE(i;T)j(1+)jE(S;T)j jSjo5:S SnA(S)6:else7:B(T) nj2TjjE(S;j)j(1+)jE(S;T)j jTjo8:T TnB(T)9:endif10:if(S;T)(~S;~T)then11:~S S,~T T12:endif13:endwhile14:return~S;~T First,weanalyzetheapproximationfactorofthealgo-rithm.Lemma12.Algorithm3leadstoa(2+2)-approximationtothedensestsubgraphproblemondirectedgraphs.Proof.Asin[10],wegenerateanassignmentoftheedgestotheendpointscorrespondingtothealgorithm.When-ever(i;j)2E,andi2A(S)isremovedfromS,weassign(i;j)toi;asimilarassignmentismadeforthenodesinB(T).Let~=(~S;~T).Letdegoutbethemaximumoutde-greeanddeginbethemaximumindegreeinG.WeneedtoshowthatifA(S)isremoved,then8i2A(S);p cjE(i;T)j(1+)(S;T):SupposethatjSj=jTjc,andsothenodesinA(S)willberemoved.Foralli2A(S),wehavep cjE(i;T)jp c(1+)jE(S;T)j jSjp c(1+)jE(S;T)js 1 cjSjjTj=(1+)jE(S;T)j p jSjjTj=(1+)(S;T):ThesecondlinefollowsbecausejSjcjTj)jSjp cjSjjTj.Similarly,onecanshowthatifB(T)getsremoved,then8j2B(T);1 p cE(S;j)(1+)(S;T):Thisprovesthatinthegivenassignment(ofedgestoend-points),p cdegout(1+)~and1 p cdegin(1+)~.Oncewehavesuchanassignment,wecanusethesamelogicasinLemmas7and8in[10]toconcludethatthealgorithmgivesa(2+2)-approximation:maxS;TV;jSj=jTj=cf(S;T)g(2+2)(~S;~T): Next,weanalyzethenumberofpassesofthealgorithm.TheproofissimilartothatofLemma4.Lemma13.Algorithm3terminateinO(log1+n)passes.Proof.WehavejE(S;T)j=Xi2A(S)jE(i;T)j+Xi2SnA(S)jE(i;T)j(1+)(jSjjA(S)j)jE(S;T)j jSj;whichyieldsjSnA(S)j1 1+jSj:Similarly,wecanprovejTnB(T)j1 1+jTj:Therefore,duringeachpassofthealgorithm,eitherthesizeoftheremainingsetSorthesizeoftheremainingsetTgoesdownbyafactorofatleast1=(1+).Hence,inO(log1+n)passes,oneofthesesetsbecomesemptyandthealgorithmterminates. 5.PRACTICALCONSIDERATIONSInthissectionwedescribetwopracticalconsiderationsinimplementingthealgorithms.Therst(Section5.1)isaheuristicmethodbasedonCount-Sketchtocutthememoryrequirementsofthealgorithm.Thesecond(Section5.2)isadiscussiononhowtorealizethealgorithmsintheMapRe-ducecomputingmodel.5.1HeuristicimprovementsWeshowedinLemma7thatanyp-passalgorithmachiev-inga2-approximationtothedensestsubgraphproblemmustuseatleast (n p)space.However,eventhisamountofspacecanbeprohibitivelylargeforverylargedatasets.Tofur-therreducethespacerequiredbythealgorithmsweturnto G typejVjjEj flickr undirected976K7.6Mim undirected645M6.1Blivejournal directed4.84M68.9Mtwitter directed50.7M2.7B Table1:Parametersofthegraphsusedintheex-periments.quadratictime(lineartimeforeachpassandalinearnum-berofpasses),whichisstillinfeasibleforthesegraphs.Inor-dertocircumventthis,weworkwithslightlysmallergraphsjusttocomparethequalityofthesolutiontothatoftheoptimum(Section6.2).6.2QualityofapproximationWestudyhowgoodofanapproximationisobtainedbyouralgorithmfortheundirectedcase.Toenablethis,weneedtocomputethevalueoftheoptimum.Recallthat,asmentionedinsection1,boththedirectedandundirecteddensestsubgraphproblemscanbesolvedexactlyusingpara-metric ow.Inthissectionwewanttoobtain,i.e.,thevalueoftheoptimalsolution,toarguethattheapproxima-tionfactorinpracticeismuchbetterthan2(1+),guaran-teedbyLemma3.(Todosuchatestfordirectedgraphsisveryexpensivebecauseonehastotryalln2valuesofc.)Inordertosolvethedensestsubgraphproblemexactly,weusethefollowinglinearprogramming(LP)formulation.maxXijxij8(i;j)2E;xijyi8(i;j)2E;xijyjXiyi1xij;yi0Charikar[10]showedthatthevalueofthisLPispreciselyequalto(G).Weusethisobservationtomeasurethequal-ityofapproximationobtainedbyAlgorithm1.TosolvetheLP,weusetheCOIN-ORCLPsolver(projects.coin-or.org/Clp).Weusesevenmoderately-sizedundirectedgraphspubliclyavailableatSNAP(snap.stanford.edu).Table2showstheparametersofthesegraphsandtheapproxima-tionfactorofouralgorithmsfordierentsettingsof.Itisclearthattheapproximationfactorsobtainedbyouralgo-rithmaremuchbetterthanwhatLemma3promises.Fur-thermore,evenhighvaluesofseemtohardlyhurttheapproximationguarantees.6.3UndirectedgraphsInthissectionwestudytheperformanceofouralgorithmsontwoundirectedgraphs,namely,flickrandim.First,westudytheeectofontheapproximationfactorandthenumberofpasses.Figure6.1showstheresults.Foreaseofcomparison,weshowthevaluesrelativetothedensityobtainedbyouralgorithmfor=0.(Notethatthesetting=0issimilartoCharikar'salgorithm[10]intermsoftheapproximationfactorbutcanruninmuchfewernumberofpasses;however,terminationisnotguaranteedfor=0.)AswesawinTable2,theapproximationdoesnotdeterio-rateforhighervaluesof(notethattheperformanceisnot Figure6.1:Eectofontheapproximationandthenumberofpasses.monotonein).Choosingavalueof2[0:5;1]seemstocutdownthenumberofpassesbyhalfwhilelosingonly10%oftheoptimum.Wethenmoveontoanalyzethegraphstructureasthepassesprogress.Figure6.2showstherelativedensityasafunctionofthenumberofpasses.(Curiously,weobserveaunimodalbehaviorforflickr,butthisdoesnotseemtoholdingeneral.)Figure6.3showsthenumberofnodesandedgesinthegraphaftereachpass.Theshapeoftheplotssuggeststhatthegraphgetsdramaticallysmallerevenintheearlypasses.Thisisaveryusefulfeatureinpractice,sinceifthegraphgetsverysmallearlyon,thentherestofthecomputationcanbedoneinthemainmemory.Thiswillavoidtheover-headofadditionalpasses.Notealsothattheworst-caseboundofO(log1+n)forthenumberofpassesasgivenbyLemma4isneverachievedbythesegraphs.Thisispossiblybecauseoftheheavy-tailnatureofthedegreedistributionofgraphsderivedfromsocialnetworksandtheircoreconnectivityproperties;see[27,30].Thesepropertiesmayalsocontributetoachievingthegoodapproximationratio,i.e.,theworst-caseboundofLemma3isnotmetbythesegraphs.Exploringtheseinfurtherdetailisoutsidethescopeofthisworkandisaninterestingareaoffutureresearch.6.4DirectedgraphsInthissectionwestudytheperformanceofthedirectedgraphversionofouralgorithm.Weusethelivejournalandtwittergraphsforthispurpose.Recallthatfordi-rectedgraphs,wehavetotryforvariousvaluesofc(Section4.3).Ofcourse,tryingalln2possiblevaluesofcispro-hibitive.Asimplealternativeistochoosearesolution( G=(V;E) jVjjEj (G) (G)=~(G) =0:001=0:1=1 as20000102 6,47413,233 9.29 1.2291.2681.194ca-AstroPh 18,772396,160 32.12 1.1471.1561.273ca-CondMat 23,133186,936 13.47 1.0721.0721.429ca-GrQc 5,24228,980 22.39 1.0001.0001.395ca-HepPh 12,008237,010 119.00 1.0001.0171.151ca-HepTh 9,87751,971 15.50 1.0001.0001.356email-Enron 36,692367,662 37.34 1.0581.0721.063 Table2:Empiricalapproximationboundsforvariousvaluesof. Figure6.3:Numberofnodesandedgesinthegraphaftereachstepofthepassforflickrandim.1)andtrycatdierentpowersof(Onecanprovethatthisworsenstheapproximationguaranteebyatmostafactor[10]).Clearly,therunningtimeisgivenby2logn=log.First,westudytheeectofthechoiceofcomparedtothechoiceof.Table3showstheresults.Fromthevalues,itiseasytoseethataslongasremainsreasonable,theeectofisasintheundirectedcase.Tomaketherestofthestudylesscumbersome,wex=2fortheremainderofthissection.First,wepresenttheresultsforlivejournal. 210100 0 325.27312.13307.961 334.38308.70306.912 294.50284.47179.59 Table3:livejournal:fordierentand.Westudytheperformanceofthealgorithmforvariouschoicesofc,given=2.Inparticular,wemeasurethedensityandthenumberofpasses.Figure6.4showsthevalues.Thebehaviorofdensityisquitecomplex,andforlivejournal,theoptimumoccurswhentherelativesizesofSandTarenotskewed.Finally,Figure6.5showsthebehavioroflivejournalforthebestsettingofc(whichis0.436)for=2;=1.Itclearlyshowsthe\alternate"natureofthesimpliedalgorithm(Algorithm3)thatwedevelopedinSection4.3.Asalways,thenumberofnodesandedgesfalldramaticallyasthepassesprogress.Fortwitter,weused=1;=2andstudiedtheper-formanceofthealgorithmforvariousvaluesofc.Figure6.6showsthedensityandthenumberofpassesforvari-ousvaluesofc.Unlikelivejournal,thebestvalueofcisnotconcentratedaround1.Thismaybeduetothehighlyskewednatureofthetwittergraph:forexample,thereareabout600popularuserswhoarefollowedbymorethan30millionotherusers.Theresultsfromlivejournalandtwittersuggestthat,inpractice,onecansafelyskipmanyvaluesofc.6.5PerformanceofsketchingInthissectionwediscusstheperformanceofthesketchingheuristicpresentedinSection5.1.Wetestedthealgorithmonflickr,whichhas976Knodes.RecallthatthenumberofwordsinaCount-Sketchschemeusingbbucketsandt Figure6.2:Densityasafunctionofthenumberofpassesforvariousvaluesof,forflickrandim.independenthashtablesistb.InTable4,weshowtheratioofthedensestsubgraphwithandwithoutsketching,forvariousvaluesofband.Thebottomrowshowsthemainmemoryusedbythealgorithmwithsketchingcomparedtothealgorithmwithoutsketching.Clearly,forsmallvaluesof,theperformancedierenceisnotverysignicantevenforb=30000,whichmeansonly530000=976K=16%ofmainmemoryisused.Thissuggeststhat,despitethespacelowerbounds(Lemma7),inpractice,asketchingschemecanobtainsignicantsavingsinmainmemory. b=30000b=40000b=50000 0 1.0471.0271.0140.5 0.9600.8960.9211 0.9580.9360.9181.5 0.8900.9110.9292 0.7600.8450.8692.5 0.7870.7080.740 Memory 0.160.200.25 Table4:Ratioofwithandwithoutsketchingforflickr(t=5).6.6MapReduceimplementationInthissectionwestudytheperformanceoftheMapRe-duceimplementationofouralgorithmsforbothdirectedandundirectedgraphs.Forthispurpose,weusetheimandtwittergraphssincetheyaretoobigtobestudiedunderthesemi-streamingmodel.WeimplementedouralgorithmsinHadoop(hadoop.apache.org)andranitwith2000map-persand2000reducers.Figure6.7showsthewall-clockrunningtimesforeachpassforim,whichisanundirected Figure6.4:Densityandthenumberofpassesat=2forlivejournal.graph.Itonlytakesunder260minutesforouralgorithmtorunonim(amassivegraphwithmorethanhalf-billionnodes).Fortwitter,whichisadirectedgraph,ouralgo-rithmtakesaround35minutesforagivenvalueofcandforeachiteration;Figure6.6showsthatthenumberofitera-tionsisbetweenfourandseven,andthenumberofvaluesofctobetriedisverysmall.Theseclearlyshowthescalabilityofouralgorithms.7.CONCLUSIONSInthispaperwestudiedtheproblemofndingdensesub-graphs,afundamentalprimitiveinseveraldatamanagementapplications,instreamingandMapReduce,twocomputa-tionalmodelsthatareincreasinglybeingadoptedbylarge-scaledataprocessingapplications.Weshowedasimpleal-gorithmthatmakeasmallnumberofpassesoverthegraphandobtainsa(2+)-approximationtothedensestsubgraph.Wethenobtainedseveralextensionsofthisalgorithm:forthecasewhenthethesubgraphisprescribedtobemorethanacertainsizeandwhenthegraphisdirected.Tothebestofourknowledge,thesearetherstalgorithmsforthedensestsubgraphproblemthattrulyscaleyetoerprovableguarantees.Ourexperimentsshowedthatthealgorithmsareindeedscalableandachievequalityandperformancethatisoftenmuchbetterthanthetheoreticalguarantees.Oural-gorithm'sscalabilityisthemainreasonitwaspossibletorunitonagraphwithmorethanahalfabillionnodesandsixbillionedges.AcknowledgmentsWethanktheanonymousreviewersfortheirmanyusefulcomments. Figure6.7:TimetakenonimgraphinMapReduce. Figure6.5:BehaviorofjSj;jTj;jE(S;T)jforthebestparametersofc;forlivejournal. Figure6.6:Densityandthenumberofpassesat=1;=2fortwitter.8.REFERENCES[1]C.C.AggarwalandH.Wang.ManagingandMiningGraphData.SpringerPublishingCompany,Incorporated,1stedition,2010.[2]R.AlbertandA.-L.Barabasi.Statisticalmechanicsofcomplexnetworks.Rev.Mod.Phys.,74(1):47{97,Jan2002.[3]R.AndersenandK.Chellapilla.Findingdensesubgraphswithsizebounds.InWAW,pages25{37,2009.[4]Z.Bar-Yossef,T.S.Jayram,R.Kumar,andD.Sivakumar.Aninformationstatisticsapproachtodatastreamandcommunicationcomplexity.JCSS,68(4):702{732,2004.[5]Z.Bar-Yossef,R.Kumar,andD.Sivakumar.Reductionsinstreamingalgorithms,withanapplicationtocountingtrianglesingraphs.InSODA,pages623{632,2002.[6]L.Becchetti,P.Boldi,C.Castillo,andA.Gionis.Ecientsemi-streamingalgorithmsforlocaltrianglecountinginmassivegraphs.InKDD,pages16{24,2008.[7]B.Berger,J.Rompel,andP.W.Shor.EcientNCalgorithmsforsetcoverwithapplicationstolearningandgeometry.JCSS,49(3):454{477,1994.[8]G.BuehrerandK.Chellapilla.Ascalablepatternminingapproachtowebgraphcompressionwithcommunities.InWSDM,pages95{106,2008.[9]A.Chakrabarti,S.Khot,andX.Sun.Near-optimallowerboundsonthemultipartycommunicationcomplexityofset-disjointness.InCCC,pages107{117,2003.[10]M.Charikar.Greedyapproximationalgorithmsforndingdensecomponentsinagraph.InAPPROX,pages84{95,2000.[11]M.Charikar,K.Chen,andM.Farach-Colton.Findingfrequentitemsindatastreams.TCS,312:3{15,2004.[12]J.ChenandY.Saad.Densesubgraphextractionwithapplicationtocommunitydetection.TKDE,PP(99),2011.[13]F.Chierichetti,R.Kumar,andA.Tomkins.Max-CoverinMap-Reduce.InWWW,pages231{240,2010.[14]E.Cohen,E.Halperin,H.Kaplan,andU.Zwick.Reachabilityanddistancequeriesvia2-hoplabels.InSODA,pages937{946,2002.[15]A.S.Das,M.Datar,A.Garg,andS.Rajaram.Googlenewspersonalization:scalableonlinecollaborativeltering.InWWW,pages271{280,2007.[16]J.DeanandS.Ghemawat.Mapreduce:simplieddataprocessingonlargeclusters.InOSDI,pages137{150,2004.[17]Y.Dourisboure,F.Geraci,andM.Pellegrini.Extractionandclassicationofdensecommunitiesintheweb.InWWW,pages461{470,2007.[18]J.Feigenbaum,S.Kannan,A.McGregor,S.Suri,andJ.Zhang.Ongraphproblemsinasemi-streamingmodel.TCS,348(2-3):207{216,2005.[19]J.Feldman,S.Muthukrishnan,A.Sidiropoulos,C.Stein,andZ.Svitkina.Ondistributingsymmetricstreamingcomputations.TALG,6:66:1{66:19,September2010.[20]A.GajewarandA.D.Sarma.Multi-skillcollaborativeteamsbasedondensestsubgraphs.CoRR,abs/1102.3340,2011.[21]D.Gibson,R.Kumar,andA.Tomkins.Discoveringlargedensesubgraphsinmassivegraphs.InVLDB,pages721{732,2005.[22]A.V.Goldberg.Findingamaximumdensitysubgraph.TechnicalReportUCB/CSD-84-171,EECSDepartment,UniversityofCalifornia,Berkeley,1984. [23]R.Jin,Y.Xiang,N.Ruan,andD.Fuhry.3-HOP:Ahigh-compressionindexingschemeforreachabilityquery.InSIGMOD,pages813{826,2009.[24]R.KannanandV.Vinay.Analyzingthestructureoflargegraphs,1999.Manuscript.[25]H.Karlo,S.Suri,andS.Vassilvitskii.Amodelofcomputationformapreduce.InSODA,pages938{948,2010.[26]S.KhullerandB.Saha.Onndingdensesubgraphs.InICALP,pages597{608,2009.[27]R.Kumar,J.Novak,andA.Tomkins.Structureandevolutionofonlinesocialnetworks.InKDD,pages611{617,2006.[28]S.Lattanzi,B.Moseley,S.Suri,andS.Vassilvitskii.Filtering:amethodforsolvinggraphproblemsinmapreduce.InSPAA,pages85{94,2011.[29]E.Lawler.CombinatorialOptimization:NetworksandMatroids.Holt,Rinehart,andWinston,1976.[30]J.Leskovec,K.J.Lang,A.Dasgupta,andM.W.Mahoney.Communitystructureinlargenetworks:Naturalclustersizesandtheabsenceoflargewell-denedclusters.InternetMathematics,6(1):29{123,2009.[31]A.McGregor.Findinggraphmatchingsindatastreams.InAPPROX,pages170{181,2005.[32]M.Newman.Modularityandcommunitystructureinnetworks.PNAS,103(23):8577{8582,2006.[33]G.D.F.Morales,A.Gionis,andM.Sozio.Socialcontentmatchinginmapreduce.PVLDB,pages460{469,2011.[34]S.Muthukrishnan.Datastreams:Algorithmsandapplications.FoundationsandTrendsinTheoreticalComputerScience,1(2),2005.[35]A.Nandi,C.Yu,P.Bohannon,andR.Ramakrishnan.Distributedcubematerializationonholisticmeasures.InICDE,pages183{194,2011.[36]B.Saha,A.Hoch,S.Khuller,L.Raschid,andX.-N.Zhang.Densesubgraphswithrestrictionsandapplicationstogeneannotationgraphs.InRECOMB,pages456{472,2010.[37]S.SuriandS.Vassilvitskii.Countingtrianglesandthecurseofthelastreducer.InWWW,pages607{614,2011.