/
OneTrillionEdges:GraphProcessingatFacebook OneTrillionEdges:GraphProcessingatFacebook

OneTrillionEdges:GraphProcessingatFacebook - PDF document

briana-ranney
briana-ranney . @briana-ranney
Follow
379 views
Uploaded On 2016-07-13

OneTrillionEdges:GraphProcessingatFacebook - PPT Presentation

1805 1807 1811 1814 1808 1813 1804 1810 1812 1809 1815 1806 Graph Vertices Edges LiveJournal9 48M 69M Twitter201031 42M 15B UKwebgraph200710 109M 37B Yahooweb8 14B 66B Manyspecializedgrap ID: 403255

1805 1807 1811 1814 1808 1813 1804 1810 1812 1809 1815 1806 Graph Vertices Edges LiveJournal[9] 4.8M 69M Twitter2010[31] 42M 1.5B UKwebgraph2007[10] 109M 3.7B Yahooweb[8] 1.4B 6.6B Manyspecializedgrap

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "OneTrillionEdges:GraphProcessingatFacebo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1805 1807 1811 1814 1808 1813 1804 1810 1812 1809 1815 1806 OneTrillionEdges:GraphProcessingatFacebook­ScaleAveryChingFacebook1HackerLaneMenloPark,Californiaaching@fb.comSergeyEdunovFacebook1HackerLaneMenloPark,Californiaedunov@fb.comMajaKabiljoFacebook1HackerLaneMenloPark,Californiamajakabiljo@fb.comDionysiosLogothetisFacebook1HackerLaneMenloPark,Californiadionysios@fb.comSambaviMuthukrishnanFacebook1HackerLaneMenloPark,Californiasambavim@fb.comABSTRACTAnalyzinglargegraphsprovidesvaluableinsightsforsocialnetworkingandwebcompaniesincontentrankingandrec-ommendations.Whilenumerousgraphprocessingsystemshavebeendevelopedandevaluatedonavailablebenchmarkgraphsofupto6.6Bedges,theyoftenfacesigni cantdif- cultiesinscalingtomuchlargergraphs.Industrygraphscanbetwoordersofmagnitudelarger-hundredsofbil-lionsoruptoonetrillionedges.Inadditiontoscalabilitychallenges,realworldapplicationsoftenrequiremuchmorecomplexgraphprocessingwork owsthanpreviouslyevalu-ated.Inthispaper,wedescribetheusability,performance,andscalabilityimprovementswemadetoApacheGiraph,anopen-sourcegraphprocessingsystem,inordertouseitonFacebook-scalegraphsofuptoonetrillionedges.WealsodescribeseveralkeyextensionstotheoriginalPregelmodelthatmakeitpossibletodevelopabroaderrangeofproductiongraphapplicationsandwork owsaswellasim-provecodereuse.Finally,wereportonreal-worldoperationsaswellasperformancecharacteristicsofseverallarge-scaleproductionapplications.1.INTRODUCTIONGraphstructuresareubiquitous:theyprovideabasicmodelofentitieswithconnectionsbetweenthemthatcanrepresentalmostanything.Facebookmanagesasocialgraph[41]thatiscomposedofpeople,theirfriendships,subscrip-tions,likes,posts,andmanyotherconnections.Opengraph[7]allowsapplicationdeveloperstoconnectobjectsintheirapplicationswithreal-worldactions(suchasuserXislis-teningtosongY).Analyzingtheserealworldgraphsatthescaleofhundredsofbillionsorevenatrillion(1012)edgeswithavailablesoftwarewasverydicultwhenwebeganThisworkislicensedundertheCreativeCommonsAttribution­NonCommercial­NoDerivs3.0UnportedLicense.Toviewacopyofthisli­cense,visithttp://creativecommons.org/licenses/by­nc­nd/3.0/.Obtainper­missionpriortoanyusebeyondthosecoveredbythelicense.Contactcopyrightholderbyemailinginfo@vldb.org.Articlesfromthisvolumewereinvitedtopresenttheirresultsatthe41stInternationalConferenceonVeryLargeDataBases,August31st­September4th2015,KohalaCoast,Hawaii.ProceedingsoftheVLDBEndowment,Vol.8,No.12Copyright2015VLDBEndowment2150­8097/15/08.aprojecttorunFacebook-scalegraphapplicationsinthesummerof2012andisstillthecasetoday.Table1:Popularbenchmarkgraphs. Graph Vertices Edges LiveJournal[9] 4.8M 69M Twitter2010[31] 42M 1.5B UKwebgraph2007[10] 109M 3.7B Yahooweb[8] 1.4B 6.6B Manyspecializedgraphprocessingframeworks(e.g.[20,21,32,44])havebeendevelopedtorunonwebandsocialgraphssuchasthoseshowninTable1.Unfortunately,realworldsocialnetworkareordersofmagnitudelarger.Twit-terhas288Mmonthlyactiveusersasof3/2015andanesti-matedaverageof208followersperuser[4]foranestimatedtotalof60Bfollowers(edges).Facebookhas1.39Bactiveusersasof12/2014withmorethan400Bedges.Manyoftheperformanceandscalabilitybottlenecksareverydi er-entwhenconsideringreal-worldindustryworkloads.Severalstudies[14,24]havedocumentedthatmanygraphframe-worksfailatmuchsmallerscalemostlyduetoinecientmemoryusage.Asynchronousgraphprocessingenginestendtohaveadditionalchallengesinscalingtolargergraphsduetounboundedmessagequeuescausingmemoryover-load,vertex-centriclockingcomplexityandoverhead,anddicultyinleveraginghighnetworkbandwidthdueto ne-grainedcomputation.Correspondingly,thereislackofinformationonhowap-plicationsperformandscaletopracticalproblemsontrillion-edgegraphs.Inpractice,manywebcompanieshavechosentobuildgraphapplicationsontheirexistingMapReducein-frastructureratherthanagraphprocessingframeworkforavarietyofreasons.First,manysuchcompaniesalreadyrunMapReduceapplications[17]onexistingHadoop[2]infras-tructureanddonotwanttomaintainadi erentservicethatcanonlyprocessgraphs.Second,manyoftheseframeworksareeitherclosedsource(i.e.Pregel)orwritteninalan-guageotherthanJava(e.g.GraphLabiswritteninC++).Sincemuchoftoday'sdatasetsarestoredinHadoop,havingeasyaccesstoHDFSand/orhigher-levelabstractionssuchasHivetablesisessentialtointeroperatingwithexistingHadoopinfrastructure.Withsomanyvariantsandversions ofHive/HDFSinuse,providingnativeC++orotherlan-guagesupportisbothunappealingandtimeconsuming.ApacheGiraph[1] llsthisgapasitiswritteninJavaandhasvertexandedgeinputformatsthatcanaccessMapRe-duceinputformatsaswellasHivetables[40].Userscanin-sertGiraphapplicationsintoexistingHadooppipelinesandleverageoperationalexpertisefromHadoop.WhileGiraphinitiallydidnotscaletoourneedsatFacebookwithover1.39Busersandhundredsofbillionsofsocialconnections,weimprovedtheplatforminavarietyofwaystosupportourworkloadsandimplementourproductionapplications.Wedescribeourexperiencesscalingandextendingexistinggraphprocessingmodelstoenableabroadsetofbothgraphmininganditerativeapplications.Ourcontributionsarethefollowing:Presentusability,performance,andscalabilityimprove-mentsforApacheGiraphthatenabletrillionedgegraphcomputations.DescribeextensionstothePregelmodelandwhywefoundthemusefulforourgraphapplications.RealworldapplicationsandtheirperformanceontheFacebookgraph.Shareoperationalexperiencesrunninglarge-scalepro-ductionapplicationsonexistingMapReduceinfras-tructure.Contributionofproductioncode(includingallexten-sionsdescribedinthispaper)intotheopen-sourceApacheGiraphproject.Therestofthepaperisorganizedasfollows.Section2describesrelatedwork.Section3providesasummaryofGi-raph,detailsourreasonsforselectingitasourinitialgraphprocessingplatform,andexplainsourusabilityandscalabil-ityimprovements.Section4describesourgeneralizationtotheoriginalPregelgraphprocessingmodelforcreatingmorepowerfulapplicationbuildingblocksandreusablecode.Sec-tion5detailsGiraphapplicationsandtheirperformanceforavarietyofworkloads.InSection6,weshareourgraphpro-cessingoperationalexperiences.InSection7,weconcludeourworkanddescribepotentialfuturework.2.RELATEDWORKLarge-scalegraphcomputingbasedontheBulkSynchronousProcessing(BSP)model[42]was rstintroducedbyMalewiczetal.inthePregelsystem[33].Unfortunately,thePregelsourcecodewasnotmadepublic.ApacheGiraphwasde-signedtobringlarge-scalegraphprocessingtotheopensourcecommunity,basedlooselyonthePregelmodel,whileprovidingtheabilitytorunonexistingHadoopinfrastruc-ture.Manyothergraphprocessingframeworks(e.g.[37,15])arealsobasedontheBSPmodel.MapReducehasbeenusedtoexecutelarge-scalegraphparallelalgorithmsandisalsobasedontheBSPmodel.Un-fortunately,graphalgorithmstendtobeiterativeinnatureandtypicallydonotperformwellintheMapReducecom-putemodel.Evenwiththeselimitations,severalgraphanditerativecomputinglibrarieshavebeenbuiltonMapReduceduetoitsabilitytorunreliablyinproductionenvironments[3,30].IterativeframeworksonMapReducestylecomput-ingmodelsisanareathathasbeenexploredinTwister[18]andHaloop[13].Asynchronousmodelsofgraphcomputinghavebeenpro-posedinsuchsystemsasSignal-Collect[39],GraphLab[20]andGRACE[44].Whileasynchronousgraphcomputinghasbeendemonstratedtoconvergefasterforsomeappli-cations,itaddsconsiderablecomplexitytothesystemandthedeveloper.Mostnotably,withoutprogramrepeatabil-ityitisdiculttoascertainwhetherbugslieinthesysteminfrastructureortheapplicationcode.Furthermore,asyn-chronousmessagingqueuesforcertainverticesmayunpre-dictablycausemachinestorunoutofmemory.DAG-basedexecutionsystemsthatgeneralizetheMapReducemodeltobroadercomputationmodels,suchasHyracks[11],Spark[45],andDryad[28],canalsodographanditerativecom-putation.Sparkadditionallyhasahigher-levelgraphcom-putinglibrarybuiltontopofit,calledGraphX[21],thatal-lowstheusertoprocessgraphsinaninteractive,distributedmanner.SinglemachinegraphcomputingimplementationssuchasCassovary[23]areusedatTwitter.Anothersinglemachineimplementation,GraphChi[32],canecientlyprocesslargegraphsout-of-core.Latency-tolerantgraphprocessingtech-niquesforcommodityprocessors,asopposedtohardwaremultithreadingsystems(e.g.CrayXMT),wereexploredin[34].InTrinity[38],graphprocessinganddatabasesarecombinedintoasinglesystem.ParallelBGL[22]paral-lelizesgraphcomputationsontopofMPI.Piccolo[36]exe-cutesdistributedgraphcomputationsontopofpartitionedtables.Presto[43],adistributedRframework,implementsmatrixoperationsecientlyandcanbeusedforgraphanal-ysis.SeveralDSLgraphlanguageshavebeenbuilt,includ-ingGreen-Marl[25],whichcancompiletoGiraphcode,andSPARQL[26],agraphtraversallanguage.3.APACHEGIRAPHApacheGiraphisaniterativegraphprocessingsystemdesignedtoscaletohundredsorthousandsofmachinesandprocesstrillionsofedges.Forexample,itiscurrentlyusedatFacebooktoanalyzethesocialgraphformedbyusersandtheirconnections.GiraphwasinspiredbyPregel,thegraphprocessingarchitecturedevelopedatGoogle.PregelprovidesasimplegraphprocessingAPIforscalablebatchexecutionofiterativegraphalgorithms.GiraphhasgreatlyextendedthebasicPregelmodelwithnewfunctionalitysuchasmastercomputation,shardedaggregators,edge-orientedinput,out-of-corecomputation,composablecomputation,andmore.Giraphhasasteadydevelopmentcycleandagrowingcommunityofusersworldwide.3.1EarlyFacebookexperimentationInthesummerof2012,webeganexploringadiversesetofgraphalgorithmsacrossmanydi erentFacebookproductsaswellasrelatedacademicliterature.Weselectedafewrepresentativeusecasesthatcutacrosstheproblemspacewithdi erentsystembottlenecksandprogrammingcom-plexity.Ourdiverseusecasesandthedesiredfeaturesoftheprogrammingframeworkdrovetherequirementsforoursysteminfrastructure.Werequiredaniterativecomputingmodel,graph-basedAPI,andeasyaccesstoFacebookdata.Weknewthatthisinfrastructurewouldneedtoworkatthescaleofhundredsofbillionsofedges.Finally,aswewould Figure1:Anexampleoflabelpropagation:In-ferringunknownwebsiteclassi cationsfromknownwebsiteclassi cationsinagraphwherelinksaregeneratedfromoverlappingwebsitekeywords.berapidlybedevelopingthissoftware,wehadtobeabletoeasilyidentifyapplicationandinfrastructurebugsandhaverepeatable,reliableperformance.Basedontheserequire-ments,weselectedafewpromisinggraph-processingplat-formsincludingHive,GraphLab,andGiraphforevaluation.Weusedlabelpropagation,amongothergraphalgorithms,tocomparetheselectedplatforms.Labelpropagationisaniterativegraphalgorithmthatinfersunlabeleddatafromlabeleddata.Thebasicideaisthatduringeachiterationofthealgorithm,everyvertexpropagatesitsprobabilisticlabels(forexample,websiteclassi cations-seeFigure1)toitsneighboringvertices,collectsthelabelsfromitsneigh-bors,andcalculatesnewprobabilitiesforitslabels.WeimplementedthisalgorithmonallplatformsandcomparedtheperformanceofHive0.9,GraphLab2,andGiraphonasmallscaleof25millionedges.WeendedupchoosingGiraphforseveralcompellingrea-sons.GiraphdirectlyinterfaceswithourinternalversionofHDFS(sinceGiraphiswritteninJava)aswellasHive.SinceGiraphisscheduledasaMapReducejob,wecanlever-ageourexistingMapReduce(Corona)infrastructurestackwithlittleoperationaloverhead.Withrespecttoperfor-mance,atthetimeoftestinginlatesummer2012,Giraphwasfasterthantheotherframeworks.Perhapsmostimpor-tantly,theBSPmodelofGiraphwaseasytodebugasitprovidesrepeatableresultsandisthemoststraightforwardtoscalesincewedidnothavetohandlealltheproblemsofasynchronousgraphprocessingframeworks.BSPalsomadeiteasytoimplementcomposablecomputation(seeSection4.3)andsimpletodocheckpointing.3.2PlatformimprovementsEventhoughwehadchosenaplatform,therewasalotofworkaheadofus.Herearesomeofthelimitationsthatweneededtoaddress:Giraph'sgraphinputmodelwasonlyvertexcentric,requiringustoeitherextendthemodelordovertexcentricgraphpreparationexternaltoGiraph.ParallelizingGiraphinfrastructurereliedcompletelyonMapReduce'stasklevelparallelismanddidnothavemultithreadingsupportfor negrainparallelism.Giraph's exibletypesandcomputingmodelwereini-tiallyimplementedusingnativeJavaobjectsandcon-sumedexcessivememoryandgarbagecollectiontime. Figure2:GiraphleveragesHiveIOtodirectlyaccessHivetablesandcanrunonMapReduceandYARN.TheaggregatorframeworkwasinecientlyimplementedinZooKeeperandweneededtosupportverylargeag-gregators(e.g.gigabytes).Weselectedthreeproductionapplications,labelpropaga-tion,variantsofPageRank[12],andk-meansclustering,todrivethedirectionofourdevelopment.Runningtheseap-plicationsongraphsaslargeasthefullFacebookfriendshipgraph,withover1billionusersandhundredsofbillionsoffriendships,requiredustoaddresstheseshortcomingsinGi-raph.Inthefollowingsections, rst,wedescribeoure ortswith exiblevertexandedgebasedinputtoloadandstoregraphsintoFacebook'sdatawarehouse.Second,wedetailourperformanceandscalabilityenhancementswithparal-lelizationapproaches,memoryoptimization,andshardedaggregators.Inaddition,wheneverrelevant,weaddaci-tationintheformofGIRAPH-XXXastheGiraphJIRAnumberassociatedwiththeworkdescribedthatcanberef-erencedat[5].3.2.1Flexiblevertex/edgebasedinputSinceGiraphisacomputingplatform,itneedstointerfacewithexternalstoragetoreadtheinputandwritebacktheoutputofabatchcomputation.SimilarlytoMapReduce,wecande necustominputandoutputformatsforvariousdatasources(e.g.,HDFS les,HBasetables,Hivetables).TheFacebooksoftwarestackisshowninFigure2.DatasetsfedtoaGiraphjobconsistofverticesandedges,typicallywithsomeattachedmetadata.Forinstance,alabelpropagationalgorithmmightreadverticeswithinitiallabelsandedgeswithattachedweights.Theoutputwillconsistofverticeswiththeir nallabels,possiblywithcon dencevalues.TheoriginalinputmodelinGiraphrequiredaratherrigidlayout:alldatarelativetoavertex,includingoutgoingedges,hadtobereadfromthesamerecordandwereas-sumedtoexistinthesamedatasource,forinstance,thesameHivetable.Wefoundtheserestrictionssuboptimalformostofourusecases.First,graphsinourdataware-houseareusuallystoredinarelationalformat(oneedgeperrow),andgroupingthembysourcevertexrequiresanextraMapReducejobforpre-processing.Second,vertexdata,suchasinitiallabelsintheexampleabove,maybestoredseparatelyfromedgedata.Thisrequiredanextrapre-processingsteptojoinverticeswithitsedges,addingunnecessaryoverhead.Finally,insomecases,weneededtocombinevertexdatafromdi erenttablesintoasingleonebeforerunningaGiraphjob. Toaddresstheseshortcomings,wemodi edGiraphtoallowloadingvertexdataandedgesfromseparatesources(GIRAPH-155).Eachworkercanreadanarbitrarysubsetoftheedges,whicharethenappropriatelydistributedsothateachvertexhasallitsoutgoingedges.Thisnewmodelalsoencouragesreusingdatasets:di erentalgorithmsmayrequireloadingdi erentvertexmetadatawhileoperatingonthesamegraph(e.g.,thegraphofFacebookfriendships).Onefurthergeneralizationweimplementedistheabil-itytoaddanarbitrarynumberofdatasources(GIRAPH-639),sothat,forexample,multipleHivetableswithdif-ferentschemascanbecombinedintotheinputforasingleGiraphjob.Onecouldalsoconsiderloadingfromdi er-entsourcesfromthesameapplication(e.g.loadingvertexdatafromMySQLmachinesandedgedatafromHadoopsequence les).Allofthesemodi cationsgiveusthe ex-ibilitytorungraphalgorithmsonourexistinganddiversedatasetsaswellasreducetherequiredpre-processingtoaminimumoreliminateitinmanycases.Thesesametech-niquescanbeusedinothergraphprocessingframeworksaswell.3.2.2ParallelizationsupportSinceaGiraphapplicationisscheduledasasingleMapRe-ducejob,itinitiallyinheritedtheMapReducewayofparal-lelizingajob,thatis,byincreasingthenumberofworkers(mappers)forthejob.Unfortunately,itishardtoshareresourceswithotherHadooptasksrunningonthesamemachineduetodi eringrequirementsandresourceexpec-tations.WhenGiraphrunsonemonopolizingworkerpermachineinahomogenouscluster,itcanmitigateissuesofdi erentresourceavailabilitiesfordi erentworkers(i.e.theslowestworkerproblem).Toaddresstheseissues,weextendedGiraphtoprovidetwomethodsofparallelizingcomputations:Addingmoreworkerspermachine.UseworkerlocalmultithreadingtotakeadvantageofadditionalCPUcores.Speci cally,weaddedmultithreadingtoloadingthegraph,computation(GIRAPH-374),andstoringthecomputedre-sults(GIRAPH-615).InCPUboundapplications,suchask-meansclustering,wehaveseenanearlinearspeedupduetomultithreadingtheapplicationcode.Inproduction,weparallelizeourapplicationsbytakingoverasetofentirema-chineswithoneworkerpermachineandusemultithreadingtomaximizeresourceutilization.Multithreadingintroducessomeadditionalcomplexity.Tak-ingadvantageofmultithreadinginGiraphrequirestheusertopartitionthegraphintonpartitionsacrossmmachineswherethemaximumcomputeparallelismisn=m.Addition-ally,wehadtoaddanewconceptcalledWorkerContextthatallowsdeveloperstoaccesssharedmembervariables.Manyothergraphprocessingframeworksdonotallowmulti-threadingsupportwithintheinfrastructureasGiraphdoes.Wefoundthattechniquethishassigni cantadvantagesinreducingoverhead(e.g.TCPconnections,largermessagebatching)asopposedtomorecoarsegrainparallelismbyaddingworkers.3.2.3MemoryoptimizationInscalingtobillionsofedgespermachine,memoryopti-mizationisakeyconcept.Fewothergraphprocessingsys-temssupportGiraph's exiblemodelofallowingarbitraryvertexid,vertexvalue,vertexedge,andmessageclassesaswellasgraphmutationcapabilities.Unfortunately,this exibilitycanhavealargememoryoverheadwithoutcare-fulimplementation.Inthe0.1incubatorrelease,GiraphwasmemoryinecientduetoalldatatypesbeingstoredasseparateJavaobjects.TheJVMworkedtoohard,outofmemoryerrorswereaseriousissue,garbagecollectiontookalargeportionofourcomputetime,andwecouldnotloadorprocesslargegraphs.Weaddressedthisissueviatwoimprovements.First,bydefaultweserializetheedgesofeveryvertexintoabytear-rayratherthaninstantiatingthemasnativeJavaobjects(GIRAPH-417)usingnativedirect(andnon-portable)seri-alizationmethods.Messagesontheserverareserializedaswell(GIRAPH-435).Second,wecreatedanOutEdgesinter-facethatwouldallowdeveloperstoleverageJavaprimitivesbasedonFastUtilforspecializededgestores(GIRAPH-528).Giventheseoptimizationsandknowingtherearetypi-callymanymoreedgesthanverticesinourgraphs(2ordersofmagnitudeormoreinmostcases),wecannowroughlyestimatetherequiredmemoryusageforloadingthegraphbasedentirelyontheedges.Wesimplycountthenumberofbytesperedge,multiplybythetotalnumberofedgesinthegraph,andthenmultiplyby1.5xtotakeintoaccountmemoryfragmentationandinexactbytearraysizes.Priortothesechanges,theobjectmemoryoverheadcouldhavebeenashighas10x.Reducingmemoryusewasabigfac-torinenablingthesystemtoloadandsendmessagesto1trillionedges.Finally,wealsoimprovedthemessagecom-biner(GIRAPH-414)tofurtherreducememoryusageandimproveperformancebyaround30%inPageRanktesting.Ourimprovementsinmemoryallocationshowthatwiththecorrectinterfaces,wecansupportthefull exibilityofJavaclassesforvertexids,values,edges,andmessageswithoutinecientperobjectallocations.3.2.4ShardedaggregatorsAggregators,asdescribedin[33],provideecientsharedstateacrossworkers.Whilecomputing,verticescanaggre-gate(theoperationmustbecommutativeandassociative)valuesintonamedaggregatorstodoglobalcomputation(i.e.min/maxvertexvalue,errorrate,etc.).TheGiraphinfras-tructureaggregatesthesevaluesacrossworkersandmakestheaggregatedvaluesavailabletoverticesinthenextsu-perstep.Onewaytoimplementk-meansclusteringistouseag-gregatorstocalculateanddistributethecoordinatesofcen-troids.Someofourcustomerswantedhundredsofthou-sandsormillionsofcentroids(andcorrespondinglyaggre-gators),whichwouldfailinearlyversionsofGiraph.Orig-inally,aggregatorswereimplementedusingZookeeper[27].Workerswouldwritepartialaggregatedvaluestoznodes(Zookeeperdatastorage).Themasterwouldaggregateallofthem,andwritethe nalresultbacktoitsznodeforworkerstoaccessit.Thistechniquewassucientforapplicationsthatonlyusedafewsimpleaggregators,butwasn'tscal-ableduetoznodesizeconstraints(maximum1megabyte)andZookeeperwritelimitations.Weneededasolutionthatcouldecientlyhandletensofgigabytesofaggregatordatacomingfromeveryworker. Figure3:Aftershardingaggregators,aggregatedcommunicationisdistributedacrossworkers.void&#xM000;Iterablemessages){if(phase==K_MEANS)//Dok-meanselseif(phase==START_EDGE_CUT)//Dofirstphaseofedgecutelseif(phase==END_EDGE_CUT)//Dosecondphaseofedgecut}Figure5:Examplevertexcomputationcodetochangephases.Tosolvethisissue, rst,webypassedZookeeperandusedNetty[6]todirectlycommunicateaggregatorvaluesbetweenthemasteranditsworkers.Whilethischangeallowedmuchlargeraggregatordatatransfers,westillhadabottleneck.Theamountofdatathemasterwasreceiving,processing,andsendingwasgrowinglinearlywiththenumberofwork-ers.Inordertoremovethisbottleneck,weimplementedshardedaggregators.Intheshardedaggregatorarchitecture(Figure3),eachaggregatorisnowrandomlyassignedtooneoftheworkers.Theassignedworkerisinchargeofgatheringthevaluesofitsaggregatorsfromallworkers,performingtheaggre-gation,anddistributingthe nalvaluestothemasterandotherworkers.Now,aggregationresponsibilitiesarebal-ancedacrossallworkersratherthanbottleneckedbythemasterandaggregatorsarelimitedonlybythetotalmem-oryavailableoneachworker.4.COMPUTEMODELEXTENSIONSUsabilityandscalabilityimprovementshaveallowedustoexecutesimplegraphanditerativealgorithmsatFacebook-scale,butwesoonrealizedthatthePregelmodelneededtobegeneralizedtosupportmorecomplexapplicationsandmaketheframeworkmorereusable.Aneasywaytodepicttheneedforthisgeneralizationisthroughaverysimpleex-ample:k-meansclustering.K-meansclustering,asshowninFigure4,isasimpleclusteringheuristicforassigningin-putvectorstooneofkcentroidsinann-dimensionalspace.Whilek-meansisnotagraphapplication,itisiterativeinnatureandeasilymapsintotheGiraphmodelwherein-putvectorsareverticesandeverycentroidisanaggregator.Thevertex(inputvector)computemethodcalculatesthedistancetoallthecentroidsandaddsitselftothenearestone.Asthecentroidsgaininputvectors,theyincrementallydeterminetheirnewlocation.Atthenextsuperstep,thenewlocationofeverycentroidisavailabletoeveryvertex.4.1WorkerphasesThemethodspreSuperstep(),postSuperstep(),preApplica-tion(),andpostApplication()wereaddedtotheComputa-tionclassandhaveaccesstotheworkerstate.Oneusecaseforthepre-superstepcomputationistocalculatethenewpo-sitionforeachofthecentroids.Withoutpre-superstepcom-putation,adeveloperhastoeitherincrementallycalculatethepositionwitheveryaddedinputvectororcalculateitforeverydistancecalculation.InthepreSuperstep()methodthatisexecutedoneveryworkerpriortoeverysuperstep,everyworkercancomputethe nalcentroidlocationsjustbeforetheinputvectorsareprocessed.DeterminingtheinitialpositionsofthecentroidscanbedoneinthepreApplication()methodsinceitisexecutedoneveryworkerpriortoanycomputationbeingexecuted.Whilethesesimplemethodsaddalotoffunctionality,theybypassthePregelmodelandrequirespecialconsiderationforapplicationspeci ctechniquessuchsuperstepcheckpoint-ing.4.2MastercomputationFundamentally,whilethePregelmodelde nesafunc-tionalcomputationby\thinkinglikeavertex",somecom-putationsneedtobeexecutedinacentralizedfashionformanyofthereasonsabove.Whileexecutingthesamecodeoneachworkerprovidesalotofthesamefunctionality,itisnotwellunderstoodbydevelopersandiserrorprone.GIRAPH-127addedmastercomputationtodocentralizedcomputationpriortoeverysuperstepthatcancommunicatewiththeworkersviaaggregators.Wedescribetwoexampleusecasesbelow.Whenusingk-meanstoclustertheFacebookdatasetofover1billionusers,itisusefultoaggregatetheerrortoseewhethertheapplicationisconverging.Itisstraightforwardtoaggregatethedistanceofeveryinputvectortoitscho-sencentroidasanotheraggregator,butatFacebookwealsohavethesocialgraphinformation.Periodically,wecanalsocomputeanothermetricofdistance:theedgecut.Wecanusethefriendships,subscriptions,andothersocialconnec-tionsofouruserstomeasuretheedgecut,ortheweighted Figure4:Ink-meansclustering,kcentroidshavesomeinitiallocation(oftenrandom).Theninputvectorsareassignedtotheirnearestcentroid.Centroidlocationsareupdatedandtheprocessrepeats. Figure6:Mastercomputationallowsk-meansclusteringtoperiodicallycalculateedgecutcomputations.edgecutiftheedgeshaveweights.Decidingwhentocalcu-latetheedgecutcanbethejobofthemastercomputation,forinstance,whenatleast15minutesofk-meanscompu-tationhavebeenexecuted.Withmastercomputation,thisfunctionalitycanbeimplementedbycheckingtoseehowlongithasbeensincethelastedgecutcalculation.Ifthetimelimithasbeenexceeded,setaspecialaggregator(i.e.executeedgecut)totrue.Theexecutionwork owisshowninFigure6.Thevertexcomputecodeonlyneedstochecktheaggregatorvaluetodecidewhethertobegincalculatinganedgecutorcontinueiteratingonk-means.Executingacoordinatedoperationsuchasthisoneontheworkerswithoutmastercomputationismorecomplicatedduetoclockskew,althoughpossible,forinstancewithmul-tiplesuperstepstodecideonwhethertodotheedgecutcalculation.AnotherexampleoftheusefulnessofmastercomputationcanbefoundintheexamplePageRankcodeinthePregelpaper[33].Intheexample,everyvertexmustcheckwhetherthedesirednumberofiterationshasbeencompletedtode-cidetovotetohalt.Thisisasimplecomputationthatneedstobeexecutedexactlyonceratherthanoneveryvertex.InthemastercomputationthereisahaltComputation()method,whereitissimpletocheckoncepriortostartingasuperstepwhethertheapplicationshouldcontinueratherthanexecutingthecheckonapervertexbasis.4.3ComposablecomputationInourproductionenvironment,weobservedthatgraphprocessingapplicationscanbecomplex,oftenconsistingof\stages",eachimplementingdistinctlogic.Mastercompu-tationallowsdeveloperstocomposetheirapplicationswithdi erentstages,butisstilllimitingsincetheoriginalPregelmodelonlyallowsonemessagetypeandonemessagecom-biner.Also,thevertexcomputecodegetsmessyasshowninFigure5.Inordertosupportapplicationsthatdodistinc-tivecomputations(suchask-means)inacleanerandmorereusableway,weaddedcomposablecomputation.Compos-ablecomputingsimplydecouplesthevertexfromthecom-putationasshowninFigure7.TheComputationclassabstractsthecomputationfromthevertexsothatdi er-enttypesofcomputationscanbeexecuted.Additionally,therearenowtwomessagetypesspeci ed.M1isthein-comingmessagetypeandM2istheoutgoingmessagetype.ThemastercomputemethodcanchoosethecomputationclasstoexecuteforthecurrentsuperstepwiththemethodsetComputation(Class?extendsComputation&#x]TJ/;༱ ;.96;d T; 7.;Ũ ;� Td;&#x [00;computa-tionClass).Themasteralsohasacorrespondingmethodtochangethemessagecombineraswell.Theinfrastructurecheckstoinsurethatalltypesmatchforcomputationsthatarechainedtogether.Nowthatvertexcomputationsaredecoupledtodi erentComputationimplementations,theycanbeusedasbuildingblocksformultipleapplications.Forexample,theedgecutcomputationscanbeusedinavarietyofclusteringalgorithms,notonlyk-means.Inourexamplek-meansapplicationwithaperiodicedgecut,wemightwanttouseamoresophisticatedcentroidini-tializationmethodsuchasinitializingthecentroidwitharandomuserandthenarandomsetoftheirfriends.Thecomputationstoinitializethecentroidswillhavedi erentmessagetypesthantheedgecutcomputations.InTable2, Table2:Exampleusageofcomposablecomputationsinak-meansapplicationwithdi erentcomputations(initialization,edgecut,andk-means) Computation Addrandomcentroid/randomfriends Addtocentroids K-means Startedgecut Endedgecut Inmessage Null Centroidmessage Null Null Cluster Outmessage Centroidmessage Null Null Cluster Null Combiner N/A N/A N/A Clustercombiner N/A publicabstractclassVertexI,V,E,&#x]TJ ;.8;) -;.4;a T; [0;M{publicabstractvoidcompute(&#xM000;Iterablemessages);}publicinterfaceComputationI,V,E,M1,&#x]TJ ;.8;) -;.4;a T; [0;M2{voidcompute(VertexV,&#xI,-5;─Evertex,&#xM100;Iterablemessages);}publicabstractclassMasterCompute{publicabstractvoidcompute();publicvoidsetComputation(Classextends&#x?-52;倀Computationcomputation);publicvoidsetMessageCombiner(Classextends&#x?-52;倀MessageCombinercombiner);publicvoidsetIncomingMessage(Classextends&#x?-52;倀WritableincomingMessage);publicvoidsetOutgoingMessage(Classextends&#x?-52;倀WritableoutgoingMessage);}Figure7:VertexwasgeneralizedintoComputationtosupportdi erentcomputations,messages,andmessagecombinersassetbyMasterCompute.weshowhowcomposablecomputationallowsustousedif-ferentmessagetypes,combiners,andcomputationstobuildapowerfulk-meansapplication.Weusethe rstcomputa-tionforaddingrandominputvectorstocentroidsandno-tifyingrandomfriendstoaddthemselvestothecentroids.Thesecondcomputationaddstherandomfriendstocen-troidsby guringoutitsdesiredcentroidfromtheorigi-nallyaddedinputvector.Itdoesn'tsendanymessagestothenextcomputation(k-means),sotheoutmessagetypeisnull.Notethatstartingtheedgecutistheonlycomputa-tionthatactuallyusesamessagecombiner,butonecoulduseanymessagecombinerindi erentcomputations.Composablecomputationmakesthemastercomputationlogicsimple.Otherexampleapplicationsthatcanbene tfromcomposablecomputationbesidesk-meansincludebal-ancedlabelpropagation[41]andanitypropagation[19].Balancedlabelpropagationusestwocomputations:com-putecandidatepartitionsforeachvertexandmovingver-ticestopartitions.Anitypropagationhasthreecompu-tations:calculateresponsibility,calculateavailability,andupdateexemplars.4.4SuperstepsplittingSomeapplicationshavemessagingpatternsthatcanex-ceedtheavailablememoryonthedestinationvertexowner.Formessagesthatareaggregatable,thatis,commutativeandassociative,messagecombiningsolvesthisproblem.How-ever,manyapplicationssendmessagesthatcannotbeag-gregated.Calculatingmutualfriends,forinstance,requireseachvertextosendallitsneighborsthevertexidsofitsneighborhood.Thismessagecannotbeaggregatedbythereceiveracrossallitsmessages.Anotherexampleisthemul-tiplephasesofanitypropagation-eachmessagemustberespondedtoindividuallyandisunabletobeaggregated.Graphprocessingframeworksthatareasynchronousarees-peciallypronetosuchissuessincetheymayreceivemessagesatafasterratethansynchronousframeworks.Insocialnetworks,oneexampleofthisissuecanoccurwhensendingmessagestoconnectionsofconnections(i.e.friendsoffriendsintheFacebooknetwork).WhileFacebookusersarelimitedto5000friends,theoreticallyoneuser'sver-texcouldreceiveupto25millionmessages.Oneofourpro-ductionapplications,friends-of-friendsscore,calculatesthestrengthofarelationshipbetweentwousersbasedontheirmutualfriendswithsomeweights.Themessagessentfromausertohis/herfriendscontainasetoftheirfriendsandsomeweights.Ourproductionapplicationactuallysends850GBfromeachworkerduringthecalculationwhenweuse200workers.Wedonothavemachineswith850GBandwhileGiraphsupportsout-of-corecomputationbyspillingthegraphand/ormessagestodisk,itismuchslower.Therefore,wehavecreatedatechniquefordoingthesamecomputationallin-memoryforsuchapplications:superstepsplitting.Thegeneralideaisthatinsuchamessageheavysuperstep,adevelopercansendafragmentofthemessagestotheirdestinationsanddoapartialcomputationthatup-datesthestateofthevertexvalue.Thelimitationsforthesuperstepsplittingtechniqueareasfollows:Themessage-basedupdatemustbecommutativeandassociative.Nosinglemessagecanover owthememorybu erofasinglevertex.Themastercomputationwillrunthesamesuperstepfora xednumberofiterations.Duringeachiteration,everyvertexusesahashfunctionwiththedestinationvertexidofeachofitspotentialsmessagestodeterminewhethertosendamessageornot.Avertexonlydoescomputationifitsvertexidpassesthehashfunctionforthecurrentsuperstep.Asanexamplefor50iterations(splitsofoursuperstep)ourfriends-of-friendsapplicationonlyuses17GBofmemoryperiteration.Thistechniquecaneliminatetheneedtogoout-of-coreandismemoryscalable(simplyaddmoreiterationstoproportionallyreducethememoryusage). Figure8:ScalingPageRankwithrespecttothenumberofworkers(a)andthenumberofedges(b)5.EXPERIMENTALRESULTSInthissection,wedescribeourresultsofrunningseveralexampleapplicationsinGiraphwhencomparedtotheirre-spectiveHiveimplementations.OurcomparisonwillincludeiterativegraphapplicationsaswellassimpleHivequeriesconvertedtoGiraphjobs.Weusetwometricsfortheperfor-mancecomparison:CPUtimeandelapsedtime.CPUtimeisthetimemeasuredbytheCPUtocompleteanapplicationwhileelapsedtimeisthetimeobservedbytheuserwhentheirapplicationcompletes.Allofthedescribedapplica-tionsareinproduction.Experimentsweregatheredonpro-ductionmachineswith16physicalcoresand10GbpsEth-ernet.TheGiraphexperimentswereallconductedinphysi-calmemorywithoutcheckpointing,whileHive/Hadoophasmanyphasesofdiskaccess(mapsidespilling,reducerfetch-ing,HDFSaccessperiterations,etc.)causingittoexhibitpoorerperformance.Weusedthedefaulthash-basedgraphpartitioningforallexperiments.5.1GiraphscalabilityWithasimpleunweightedPageRankbenchmark,weeval-uatedthescalabilityofGiraphonourproductionclusters.UnweightedPageRankisalightweightcomputationalappli-cationthatistypicallynetworkbound.InFigure8a,we rst xedtheedgecountto200Bandscalethenumberofworkerfrom50to300.Thereisaslightbitofvariance,butoverallGiraphperformancecloselytrackstheidealscalingcurvebasedon50workersasthestartingdatapoint.InFigure8b,we xthenumberofworkersat50andscaletheproblemsizefrom1Bto200Bedges.Asexpected,GiraphperformancescalesuplinearlywiththenumberofedgessincePageRankisnetworkbound.5.2ApplicationsWedescribethreeproductioniterativegraphapplicationsaswellasthealgorithmicpseudocodethathavebeenruninbothHiveandGiraph.TheGiraphimplementationused200machinesinourtestsandtheHiveimplementationsusedatleast200machines(albeitpossiblywithfailuresasMapReduceisbuilttohandlethem).5.2.1LabelpropagationComputation1:Sendneighborsedgeweightsandlabeldata1.Sendmyneighborsmyedgeweightsandlabels.Table3:LabelpropagationcomparisonbetweenHiveandGiraphimplementationswith2iterations. Graphsize Hive Giraph Speedup 701M+vertices48B+edges TotalCPU9.631Msecs TotalCPU1.014Msecs 9x ElapsedTime1,666mins ElapsedTime19mins 87x Computation2:Updatelabels1.Ifthenormalizationbaseisnotset,createitfromtheedgeweights.2.Updatemylabelsbasedonthemessages3.Trimlabelstotopn.4.Sendmyneighborsmyedgeweightsandlabels.Labelpropagationwasdescribedbrie yinSection3.1.Thisalgorithm(similarto[46])waspreviouslyimplementedinHiveasaseriesofHivequeriesthathadabadimpedancemismatchsincethealgorithmismuchsimplertoexpressasagraphalgorithm.Therearetwocomputationphasesasshownabove.WeneedtonormalizeedgesandGiraphkeepsonlytheoutedges(nottheinedges)tosavememory.Socomputation1willstartthe rstphaseof guringouthowtonormalizethein-comingedgeweightsbythedestinationvertex.Themastercomputationwillruncomputation2untilthedesireditera-tionshavebeenmet.Notethattrimmingtothetopnlabelsisacommontechniquetoavoidunlikelylabelsbeingprop-agatedtoallvertices.Trimmingbothsavesalargeamountofmemoryandreducestheamountofnetworkdatasent.InTable3,wedescribethecomparisonbetweenHiveandGiraphimplementationsofthesamelabelpropagationalgo-rithm.FromaCPUstandpoint,wesave9xCPUseconds.Fromanelapsedtimestandpoint,theGiraphimplementa-tionis87xfasterwith200machinesversusroughlythesamenumberofmachinesfortheHiveimplementation.Ourcom-parisonwasexecutedwithonly2iterationsprimarilybe-causeittooksolonginHive/Hadoop.Inproduction,werunminorvariantsoflabelpropagationwithmanydi erenttypesoflabelsthatcanbeusedasrankingfeaturesinavarietyofapplications.5.2.2PageRankComputation:PageRank1.Normalizetheoutgoingedgeweightsifthisisthefirstsuperstep.Otherwise,updatethepagerankvaluebysumminguptheincomingmessagesandcalculatinganewPageRankvalue.2.SendeveryedgemyPageRankvalueweightedbyitsnormalizededgeweight.PageRank[12]isausefulapplicationfor ndingin uentialentitiesinthegraph.Whileoriginallydevelopedforrankingwebsites,wecanalsoapplyPageRankinthesocialcontext.ThisalgorithmabovedescribesourimplementationofweightedPageRank,wheretheweightsrepresentthestrengthoftheconnectionbetweenusers(i.e.basedonthefriendsoffriendsscorecalculation).WeightedPageRankisslightlymoreex-pensivethanunweightedPageRankduetosendingeachout Table4:WeightedPageRankcomparisonbetweenHiveandGiraphimplementationsforasingleiter-ation.) Graphsize Hive Giraph Speedup 2B+vertices400B+edges TotalCPU16.5Msecs TotalCPU0.6Msecs 26x ElapsedTime600mins ElapsedTime19mins 120x Table5:PerformancecomparisonforfriendsoffriendsscorebetweenHiveandGiraphimplemen-tations.) Graphsize Hive Giraph Speedup 1B+vertices76B+edges TotalCPU255Msecs TotalCPU18Msecs 14x ElapsedTime7200mins ElapsedTime110mins 65x edgeaspecializedmessage,butalsousesacombinerinpro-duction.ThecombinersimplyaddsthePageRankmessagestogetherandsumsthemintoasinglemessage.WerananiterationofPageRankagainstasnapshotofsomeportionoftheusergraphatFacebookwithover2Bver-ticesandmorethan400Bsocialconnections(includesmoreconnectionsthanjustfriendships).InTable4,wecanseetheresults.At600minutesperiterationforHive/Hadoop,computing10iterationswouldtakeover4dayswithmorethan200machines.InGiraph,10iterationscouldbedoneinonly50minuteson200machines.5.2.3FriendsoffriendsscoreComputation:Friendsoffriendsscore1.Sendfriendlistwithweightstoahashedportionofmyfriendlist.2.Ifmyvertexidmatchesthehashfunctiononthesuperstepnumber,aggregatemessagestogeneratemyfriendsoffriendsscorewiththesourceverticesoftheincomingmessages.AsmentionedinSection4.4,thefriendsoffriendsscoreisavaluablefeatureforvariousFacebookproducts.Itisoneofthefewfeatureswehavethatcancalculateascorebetweenusersthatarenotdirectlyconnected.Itusestheaforemen-tionedsuperstepsplittingtechniquetoavoidgoingoutofcorewhenmessagingsizesfarexceedavailablememory.FromTable5wecanseethattheHiveimplementationismuchlessecientthantheGiraphimplementation.Thesuperstepsplittingtechniqueallowsustomaintainasignif-icant65xelapsedtimeimprovementon200machinessincethecomputationandmessagepassingarehappeninginmem-ory.Whilenotshown,out-of-coremessaginghada2-3xoverheadinperformanceinourexperimentswhiletheaddi-tionalsynchronizationsuperstepoverheadofthesuperstepsplittingtechniquewassmall.5.3HivequeriesonGiraphWhileGiraphhasenabledustoruniterativegraphalgo-rithmsmuchmoreecientlythanHive,wehavealsofounditavaluabletoolforrunningcertainexpensiveHivequeriesinafasterway.Hiveismuchsimplertowrite,ofcourse,butTable6:DoublejoinperformanceinHiveandGiraphimplementationsforoneexpensiveHivequery.) Graphsize Hive Giraph Speedup 450Bconnections2.5B+uniqueids TotalCPU211days TotalCPU43days 5x ElapsedTime425mins ElapsedTime50mins 8.5x Table7:CustomHivequerycomparisonwithGi-raphforavarietyofdatasizesforcalculatingthenumberofusersinteractingwithaspeci caction.) Datasize Hive Giraph Speedup 360Bactions40Mobjects TotalCPU22days TotalCPU4days 5.5x ElapsedTime64mins ElapsedTime7mins 9.1x 162Bactions74Mobjects TotalCPU92days TotalCPU18days 5.1x ElapsedTime129mins ElapsedTime19mins 6.8x 620Bactions110Mobjects TotalCPU485days TotalCPU78days 6.2x ElapsedTime510mins ElapsedTime45mins 11.3x wehavefoundmanyqueriesfallinthesamepatternofjoinandalsodoublejoin.Theseapplicationsinvolveeitheronlyinputandoutput,orjusttwosupersteps(oneiterationofmessagesendingandprocessing).PartitioningofthegraphduringinputloadingcansimulateajoinqueryandwedoanyvertexprocessingpriortodumpingthedatatoHive.Inoneexample,considerajoinoftwobigtables,whereoneofthemcanbetreatedasedgeinput(representingcon-nectionsbetweenusers,pages,etc.)andtheotheroneasvertexinput(somedataabouttheseusersorpages).Im-plementingthesekindofqueriesinGiraphhasproventobe3-4xmoreCPUecientthanperformingthesamequeryinHive.Doingajoinonbothsidesoftheedgetable(doublejoin)asshownintheaboveexamplequeryisupto5xbetterasshowninTable6.Asanotherexample,wehavealargetablewithsomeac-tionlogs,withtheidoftheuserperformingtheaction,theidoftheobjectitinteractedwith,andpotentiallysomeadditionaldataabouttheactionitself.Oneofouruserswasinterestedincalculatingforeachobjectandeachactiondescriptionhowmanydi erentusershaveinteractedwithit.InHivethiswouldbeexpressedintheform:"selectcount(distinctuserId)...groupbyobjectId,actionDescrip-tion".Thistypeofqueryisveryexpensive,andexpressingitinGiraphachievinguptoa6xCPUtimeimprovementand11.3xelapsedtimeimprovement,botharebigwinsfortheuserasshowninTable7.Customersarehappytohaveawaytogetsigni cantlybetterperformancefortheirexpensiveHivequeries.Al-thoughGiraphisnotbuiltforgeneralizedrelationalqueries,itoutperformsHiveduetoavoidingdiskandusingcus-tomizedprimitivedatastructuresthatareparticulartotheunderlyingdata.Inthefuture,wemayconsiderbuildinga moregeneralizedqueryframeworktomakethistransitioneasierforacertainclassesofqueries.5.4TrillionedgePageRankOursocialgraphofover1.39Busersiscontinuingtogrowrapidly.Thenumberofconnectionsbetweenusers(friend-ships,subscriptions,publicinteraction,etc.)isalsoseeingexplosivegrowth.ToensurethatGiraphcouldcontinuetooperateatalargerscale,werananiterationofunweightedPageRankonour1.39Buserdatasetwithover1trillionsocialconnections.Withsomerecentimprovementsinmes-saging(GIRAPH-701)andrequestprocessingimprovements(GIRAPH-704),wewereabletoexecutePageRankonoveratrillionsocialconnectionsinlessthan3minutesperiter-ationwithonly200machines.6.OPERATIONALEXPERIENCEInthissection,wedescribetheoperationalexperiencewehavegainedafterrunningGiraphinproductionatFacebookforovertwoyears.Giraphexecutesapplicationsforads,search,entities,usermodeling,insights,andmanyotherteamsatFacebook.6.1SchedulingGiraphleveragesourexistingMapReduceschedulingframe-work(Corona)butalsoworksontopofHadoop.Hadoopschedulingassumesanincrementalschedulingmodel,wheremap/reduceslotscanbeincrementallyassignedtojobswhileGiraphrequiredthatallslotsareavailablepriortotheap-plicationrunning.Preemption,availableinsomeversionsofHadoop,causesfailureofGiraphjobs.Whilewecouldturnoncheckpointingtohandlesomeofthesefailures,inpracticewechoosetodisablecheckpointingforthreereasons:1.OurHDFSimplementationwilloccasionallyfailonwriteoperations(e.g.temporarynamenodeunavail-ability,writenodesunavailable,etc.),causingourcheck-pointingcodetofailandironicallyincreasingthechanceoffailure.2.Mostofourproductionapplicationsruninlessthananhouranduselessthan200machines.Thechanceoffailureisrelativelylowandhandledwellbyrestarts.3.Checkpointinghasanoverhead.Giraphjobsareruninnon-preemptibleFIFOpoolsinHadoopwherethenumberofmapslotsofajobneverex-ceedsthemaximumnumberofmapslotsinthepool.ThispolicyallowsGiraphjobstoqueueupinapool,waituntiltheygetalltheirresourcesandthenexecute.Inordertomakethisprocesslesserrorprone,weaddedafewchangestoCoronatopreventusererror.First,weaddedanoptionalfeatureforapoolinCoronatoautomaticallyfailjobsthataskformoremapslotsthanthemaximummapslotsofthepool.Second,wecon gureGiraphclustersdi erentlythantypicalMapReduceclusters.Giraphclustersarehomoge-nousandonlyhaveonemapslot.Thiscon gurationallowsGiraphjobstotakeoveranentiremachineandallofitsresources(CPU,memory,network,etc.).SinceGiraphstillrunsonthesameinfrastructureasHadoop,productionen-gineeringcanusethesamecon gurationmanagementtoolsandpriorMapReduceexperiencetomaintaintheGiraphinfrastructure.JobsthatfailarerestartedautomaticallybythesameschedulingsystemthatalsorestartsfailedHivequeries.Inproduction,bothHiveandGiraphretriesaresetto4andonceanGiraphapplicationisdeployedtoproductionwerarelyseeitfailconsistently.Theonecasewesawduringthepastyearoccurredwhenauserchangedproductionpa-rameterswithout rsttestingtheirjobsatscale.6.2GraphpreparationProductiongradegraphpreparationisasubjectthatisnotwelladdressedinresearchliterature.ProjectssuchasGraphBuilder[29]havebuiltframeworksthathelpwithar-eassuchasgraphformation,compression,transformation,partitioning,outputformatting,andserialization.AtFace-book,wetakeasimplerapproach.Graphpreparationcanbehandledintwodi erentwaysdependingontheneedsoftheuser.AllFacebookwarehousedataisstoredinHivetablesbutcanbeeasilyinterpretedasverticesand/oredgeswithHiveIO.InanyoftheGiraphinputformats,ausercanaddcustom lteringortransformationoftheHivetabletocreateverticesand/oredges.AsmentionedinSection3.2.1,userscanessentiallyscantablesandpickandchoosethegraphdatatheyareinterestedinfortheapplication.UserscanalsoturntoHivequeriesand/orMapReduceapplicationstoprepareanyinputdataforgraphcomputation.OthercompaniesthatuseApachePig[35],ahigh-levellanguageforexpressingdataanalysisprograms,canexecutesimilargraphpreparationstepsasaseriesofsimplequeries.6.3ProductionapplicationworkowCustomersoftenaskusthework owfordeployingaGi-raphapplicationintoproduction.Wetypicallygothroughthefollowingcycle:1.Writeyourapplicationandunittestit.Giraphcanruninalocalmodewiththetoolstocreatesimplegraphsfortesting.2.Runyourapplicationonatestdataset.WehavesmalldatasetsthatmimicthefullFacebookgraphasasinglecountry.Typicallythesetestsonlyneedtorunonafewmachines.3.Runyourapplicationatscale(typicallyamaximumof200workers).Wehavelimitedresourcesfornon-productionjobstorunatscale,soweaskuserstotunetheircon gurationonstep2.4.Deploytoproduction.Theuserapplicationanditscon gurationareenteredintoourhigher-levelsched-ulertoensurethatjobsarescheduledperiodicallyandretrieshappenonfailure.TheGiraphoncallisrespon-sibleforensuringthattheproductionjobscomplete.Overall,customersaresatis edwiththiswork ow.ItisverysimilartotheHive/Hadoopproductionwork- owandisaneasytransitionforthem.7.CONCLUSIONS&FUTUREWORKInthispaper,wehavedetailedhowaBSP-based,compos-ablegraphprocessingframeworksupportsFacebook-scaleproductionworkloads.Inparticular,wehavedescribedtheimprovementstotheApacheGiraphprojectthatenabledustoscaletotrillionedgegraphs(muchlargerthanthoserefer-encedinpreviouswork).Wedescribednewgraphprocessing techniquessuchascomposablecomputationandsuperstepsplittingthathaveallowedustobroadenthepoolofpo-tentialapplications.Wehavesharedourexperienceswithseveralproductionapplicationsandtheirperformanceim-provementsoverourexistingHive/Hadoopinfrastructure.WehavecontributedallsystemscodebackintotheApacheGiraphprojectsoanyonecantryoutproductionqualitygraphprocessingcodethatcansupporttrillionedgegraphs.Finally,wehavesharedouroperationalexperienceswithGi-raphjobsandhowwescheduleandprepareourgraphdataforcomputationpipelines.WhileGiraphsuitsourcurrentneedsandprovidesmuchneededeciencywinsoverourexistingHive/Hadoopinfras-tructure,wehaveidenti edseveralareasoffutureworkthatwehavestartedtoinvestigate.First,ourinternalexperi-mentsshowthatgraphpartitioningcanhaveasigni cante ectonnetworkboundapplicationssuchasPageRank.Forlongrunningapplications,determiningagoodqualitygraphpartitioningpriortoourcomputationwilllikelybenetwininperformance.Second,wehavestartedtolookatmakingourcomputationsmoreasynchronousasapossiblewaytoimproveconvergencespeed.Whileourusersenjoyapre-dictableapplicationandthesimplicityoftheBSPmodel,theymayconsiderasynchronouscomputingifthegainissigni cant.Finally,weareleveragingGiraphasaparallelmachine-learningplatform.WealreadyhaveseveralMLalgorithmsimplementedinternally.Ourmatrixfactorizationbasedcol-laborative lteringimplementationscalestooverahundredbillionexamples.TheBSPcomputingmodelalsoappearstobeagood tforAdaBoostlogisticregressionfromCollinsetal.[16].Weareabletotrainlogisticregressionmodelson1.1billionsampleswith800millionsparsefeaturesandanaverageof1000activefeaturespersampleinminutesperiteration.8.REFERENCES[1]Apachegiraph-http://giraph.apache.org.[2]Apachehadoop.http://hadoop.apache.org/.[3]Apachemahout-http://mahout.apache.org.[4]Beevolvetwitterstudy.http://www.beevolve.com/twitter-statistics.[5]Giraphjira.https://issues.apache.org/jira/browse/GIRAPH.[6]Netty-http://netty.io.[7]Opengraph.https://developers.facebook.com/docs/opengraph.[8]Yahoo!altavistawebpagehyperlinkconnectivitygraph,circa2002,2012.http://webscope.sandbox.yahoo.com/.[9]L.Backstrom,D.Huttenlocher,J.Kleinberg,andX.Lan.Groupformationinlargesocialnetworks:Membership,growth,andevolution.InProceedingsofthe12thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,KDD'06,pages44{54,NewYork,NY,USA,2006.ACM.[10]P.Boldi,M.Santini,andS.Vigna.Alargetime-awaregraph.SIGIRForum,42(2):33{38,2008.[11]V.Borkar,M.Carey,R.Grover,N.Onose,andR.Vernica.Hyracks:A exibleandextensiblefoundationfordata-intensivecomputing.InProceedingsofthe2011IEEE27thInternationalConferenceonDataEngineering,ICDE'11,pages1151{1162,Washington,DC,USA,2011.IEEEComputerSociety.[12]S.BrinandL.Page.Theanatomyofalarge-scalehypertextualwebsearchengine.InProceedingsoftheseventhinternationalconferenceonWorldWideWeb7,WWW7,pages107{117,Amsterdam,TheNetherlands,TheNetherlands,1998.ElsevierSciencePublishersB.V.[13]Y.Bu,B.Howe,M.Balazinska,andM.D.Ernst.Haloop:ecientiterativedataprocessingonlargeclusters.Proc.VLDBEndow.,3(1-2):285{296,Sept.2010.[14]Z.Cai,Z.J.Gao,S.Luo,L.L.Perez,Z.Vagena,andC.Jermaine.Acomparisonofplatformsforimplementingandrunningverylargescalemachinelearningalgorithms.InProceedingsofthe2014ACMSIGMODInternationalConferenceonManagementofData,SIGMOD'14,pages1371{1382,NewYork,NY,USA,2014.ACM.[15]R.Chen,M.Yang,X.Weng,B.Choi,B.He,andX.Li.Improvinglargegraphprocessingonpartitionedgraphsinthecloud.InProceedingsoftheThirdACMSymposiumonCloudComputing,SoCC'12,pages3:1{3:13,NewYork,NY,USA,2012.ACM.[16]M.Collins,R.E.Schapire,andY.Singer.Logisticregression,adaboostandbregmandistances.MachineLearning,48(1-3):253{285,2002.[17]J.DeanandS.Ghemawat.Mapreduce:simpli eddataprocessingonlargeclusters.Commun.ACM,51(1):107{113,Jan.2008.[18]J.Ekanayake,H.Li,B.Zhang,T.Gunarathne,S.heeBae,J.Qiu,andG.Fox.Twister:Aruntimeforiterativemapreduce.InInTheFirstInternationalWorkshoponMapReduceanditsApplications,2010.[19]B.J.FreyandD.Dueck.Clusteringbypassingmessagesbetweendatapoints.Science,315:972{976,2007.[20]J.E.Gonzalez,Y.Low,H.Gu,D.Bickson,andC.Guestrin.Powergraph:distributedgraph-parallelcomputationonnaturalgraphs.InProceedingsofthe10thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI'12,pages17{30,Berkeley,CA,USA,2012.USENIXAssociation.[21]J.E.Gonzalez,R.S.Xin,A.Dave,D.Crankshaw,M.J.Franklin,andI.Stoica.Graphx:Graphprocessinginadistributeddata owframework.In11thUSENIXSymposiumonOperatingSystemsDesignandImplementation(OSDI14),pages599{613,Broom eld,CO,Oct.2014.USENIXAssociation.[22]D.GregorandA.Lumsdaine.TheParallelBGL:Agenericlibraryfordistributedgraphcomputations.InParallelObject-OrientedScienti cComputing(POOSC),07/20052005.Accepted.[23]P.Gupta,A.Goel,J.Lin,A.Sharma,D.Wang,andR.Zadeh.Wtf:thewhotofollowserviceattwitter.InProceedingsofthe22ndinternationalconferenceonWorldWideWeb,WWW'13,pages505{514,RepublicandCantonofGeneva,Switzerland,2013.InternationalWorldWideWebConferencesSteeringCommittee. [24]M.Han,K.Daudjee,K.Ammar,M.T.Ozsu,X.Wang,andT.Jin.Anexperimentalcomparisonofpregel-likegraphprocessingsystems.PVLDB,7(12):1047{1058,2014.[25]S.Hong,H.Cha ,E.Sedlar,andK.Olukotun.Green-marl:adslforeasyandecientgraphanalysis.InProceedingsoftheseventeenthinternationalconferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASPLOSXVII,pages349{362,NewYork,NY,USA,2012.ACM.[26]J.Huang,D.J.Abadi,andK.Ren.Scalablesparqlqueryingoflargerdfgraphs.PVLDB,4(11):1123{1134,2011.[27]P.Hunt,M.Konar,F.P.Junqueira,andB.Reed.Zookeeper:wait-freecoordinationforinternet-scalesystems.InProceedingsofthe2010USENIXconferenceonUSENIXannualtechnicalconference,USENIXATC'10,pages11{11,Berkeley,CA,USA,2010.USENIXAssociation.[28]M.Isard,M.Budiu,Y.Yu,A.Birrell,andD.Fetterly.Dryad:distributeddata-parallelprogramsfromsequentialbuildingblocks.InProceedingsofthe2ndACMSIGOPS/EuroSysEuropeanConferenceonComputerSystems2007,EuroSys'07,pages59{72,NewYork,NY,USA,2007.ACM.[29]N.Jain,G.Liao,andT.L.Willke.Graphbuilder:scalablegraphetlframework.InFirstInternationalWorkshoponGraphDataManagementExperiencesandSystems,GRADES'13,pages4:1{4:6,NewYork,NY,USA,2013.ACM.[30]U.Kang,C.E.Tsourakakis,andC.Faloutsos.Pegasus:Apeta-scalegraphminingsystemimplementationandobservations.InProceedingsofthe2009NinthIEEEInternationalConferenceonDataMining,ICDM'09,pages229{238,Washington,DC,USA,2009.IEEEComputerSociety.[31]H.Kwak,C.Lee,H.Park,andS.Moon.Whatistwitter,asocialnetworkoranewsmedia?InProceedingsofthe19thInternationalConferenceonWorldWideWeb,WWW'10,pages591{600,NewYork,NY,USA,2010.ACM.[32]A.Kyrola,G.Blelloch,andC.Guestrin.Graphchi:large-scalegraphcomputationonjustapc.InProceedingsofthe10thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI'12,pages31{46,Berkeley,CA,USA,2012.USENIXAssociation.[33]G.Malewicz,M.H.Austern,A.J.Bik,J.C.Dehnert,I.Horn,N.Leiser,andG.Czajkowski.Pregel:asystemforlarge-scalegraphprocessing.InProceedingsofthe2010ACMSIGMODInternationalConferenceonManagementofdata,SIGMOD'10,pages135{146,NewYork,NY,USA,2010.ACM.[34]J.Nelson,B.Myers,A.H.Hunter,P.Briggs,L.Ceze,C.Ebeling,D.Grossman,S.Kahan,andM.Oskin.Crunchinglargegraphswithcommodityprocessors.InProceedingsofthe3rdUSENIXconferenceonHottopicinparallelism,pages10{10.USENIXAssociation,2011.[35]C.Olston,B.Reed,U.Srivastava,R.Kumar,andA.Tomkins.Piglatin:anot-so-foreignlanguagefordataprocessing.InProceedingsofthe2008ACMSIGMODinternationalconferenceonManagementofdata,SIGMOD'08,pages1099{1110,NewYork,NY,USA,2008.ACM.[36]R.PowerandJ.Li.Piccolo:buildingfast,distributedprogramswithpartitionedtables.InProceedingsofthe9thUSENIXconferenceonOperatingsystemsdesignandimplementation,OSDI'10,pages1{14,Berkeley,CA,USA,2010.USENIXAssociation.[37]S.SalihogluandJ.Widom.Gps:Agraphprocessingsystem.InScienti candStatisticalDatabaseManagement.StanfordInfoLab,July2013.[38]B.Shao,H.Wang,andY.Li.Trinity:adistributedgraphengineonamemorycloud.InProceedingsofthe2013ACMSIGMODInternationalConferenceonManagementofData,SIGMOD'13,pages505{516,NewYork,NY,USA,2013.ACM.[39]P.Stutz,A.Bernstein,andW.Cohen.Signal/collect:graphalgorithmsforthe(semantic)web.InProceedingsofthe9thinternationalsemanticwebconferenceonThesemanticweb-VolumePartI,ISWC'10,pages764{780,Berlin,Heidelberg,2010.Springer-Verlag.[40]A.Thusoo,J.S.Sarma,N.Jain,Z.Shao,P.Chakka,S.Anthony,H.Liu,P.Wycko ,andR.Murthy.Hive:awarehousingsolutionoveramap-reduceframework.Proc.VLDBEndow.,2(2):1626{1629,Aug.2009.[41]J.UganderandL.Backstrom.Balancedlabelpropagationforpartitioningmassivegraphs.InProceedingsofthesixthACMinternationalconferenceonWebsearchanddatamining,WSDM'13,pages507{516,NewYork,NY,USA,2013.ACM.[42]L.G.Valiant.Abridgingmodelforparallelcomputation.Commun.ACM,33(8):103{111,Aug.1990.[43]S.Venkataraman,E.Bodzsar,I.Roy,A.AuYoung,andR.S.Schreiber.Presto:distributedmachinelearningandgraphprocessingwithsparsematrices.InProceedingsofthe8thACMEuropeanConferenceonComputerSystems,EuroSys'13,pages197{210,NewYork,NY,USA,2013.ACM.[44]G.Wang,W.Xie,A.J.Demers,andJ.Gehrke.Asynchronouslarge-scalegraphprocessingmadeeasy.InCIDR,2013.[45]M.Zaharia,M.Chowdhury,M.J.Franklin,S.Shenker,andI.Stoica.Spark:clustercomputingwithworkingsets.InProceedingsofthe2ndUSENIXconferenceonHottopicsincloudcomputing,HotCloud'10,pages10{10,Berkeley,CA,USA,2010.USENIXAssociation.[46]X.ZhuandZ.Ghahramani.Learningfromlabeledandunlabeleddatawithlabelpropagation.Technicalreport,TechnicalReportCMU-CALD-02-107,2002.

Related Contents


Next Show more