1805 1807 1811 1814 1808 1813 1804 1810 1812 1809 1815 1806 Graph Vertices Edges LiveJournal9 48M 69M Twitter201031 42M 15B UKwebgraph200710 109M 37B Yahooweb8 14B 66B Manyspecializedgrap ID: 403255
Download Pdf The PPT/PDF document "OneTrillionEdges:GraphProcessingatFacebo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1805 1807 1811 1814 1808 1813 1804 1810 1812 1809 1815 1806 OneTrillionEdges:GraphProcessingatFacebookScaleAveryChingFacebook1HackerLaneMenloPark,Californiaaching@fb.comSergeyEdunovFacebook1HackerLaneMenloPark,Californiaedunov@fb.comMajaKabiljoFacebook1HackerLaneMenloPark,Californiamajakabiljo@fb.comDionysiosLogothetisFacebook1HackerLaneMenloPark,Californiadionysios@fb.comSambaviMuthukrishnanFacebook1HackerLaneMenloPark,Californiasambavim@fb.comABSTRACTAnalyzinglargegraphsprovidesvaluableinsightsforsocialnetworkingandwebcompaniesincontentrankingandrec-ommendations.Whilenumerousgraphprocessingsystemshavebeendevelopedandevaluatedonavailablebenchmarkgraphsofupto6.6Bedges,theyoftenfacesignicantdif-cultiesinscalingtomuchlargergraphs.Industrygraphscanbetwoordersofmagnitudelarger-hundredsofbil-lionsoruptoonetrillionedges.Inadditiontoscalabilitychallenges,realworldapplicationsoftenrequiremuchmorecomplexgraphprocessingwork owsthanpreviouslyevalu-ated.Inthispaper,wedescribetheusability,performance,andscalabilityimprovementswemadetoApacheGiraph,anopen-sourcegraphprocessingsystem,inordertouseitonFacebook-scalegraphsofuptoonetrillionedges.WealsodescribeseveralkeyextensionstotheoriginalPregelmodelthatmakeitpossibletodevelopabroaderrangeofproductiongraphapplicationsandwork owsaswellasim-provecodereuse.Finally,wereportonreal-worldoperationsaswellasperformancecharacteristicsofseverallarge-scaleproductionapplications.1.INTRODUCTIONGraphstructuresareubiquitous:theyprovideabasicmodelofentitieswithconnectionsbetweenthemthatcanrepresentalmostanything.Facebookmanagesasocialgraph[41]thatiscomposedofpeople,theirfriendships,subscrip-tions,likes,posts,andmanyotherconnections.Opengraph[7]allowsapplicationdeveloperstoconnectobjectsintheirapplicationswithreal-worldactions(suchasuserXislis-teningtosongY).Analyzingtheserealworldgraphsatthescaleofhundredsofbillionsorevenatrillion(1012)edgeswithavailablesoftwarewasverydicultwhenwebeganThisworkislicensedundertheCreativeCommonsAttributionNonCommercialNoDerivs3.0UnportedLicense.Toviewacopyofthislicense,visithttp://creativecommons.org/licenses/byncnd/3.0/.Obtainpermissionpriortoanyusebeyondthosecoveredbythelicense.Contactcopyrightholderbyemailinginfo@vldb.org.Articlesfromthisvolumewereinvitedtopresenttheirresultsatthe41stInternationalConferenceonVeryLargeDataBases,August31stSeptember4th2015,KohalaCoast,Hawaii.ProceedingsoftheVLDBEndowment,Vol.8,No.12Copyright2015VLDBEndowment21508097/15/08.aprojecttorunFacebook-scalegraphapplicationsinthesummerof2012andisstillthecasetoday.Table1:Popularbenchmarkgraphs. Graph Vertices Edges LiveJournal[9] 4.8M 69M Twitter2010[31] 42M 1.5B UKwebgraph2007[10] 109M 3.7B Yahooweb[8] 1.4B 6.6B Manyspecializedgraphprocessingframeworks(e.g.[20,21,32,44])havebeendevelopedtorunonwebandsocialgraphssuchasthoseshowninTable1.Unfortunately,realworldsocialnetworkareordersofmagnitudelarger.Twit-terhas288Mmonthlyactiveusersasof3/2015andanesti-matedaverageof208followersperuser[4]foranestimatedtotalof60Bfollowers(edges).Facebookhas1.39Bactiveusersasof12/2014withmorethan400Bedges.Manyoftheperformanceandscalabilitybottlenecksareverydier-entwhenconsideringreal-worldindustryworkloads.Severalstudies[14,24]havedocumentedthatmanygraphframe-worksfailatmuchsmallerscalemostlyduetoinecientmemoryusage.Asynchronousgraphprocessingenginestendtohaveadditionalchallengesinscalingtolargergraphsduetounboundedmessagequeuescausingmemoryover-load,vertex-centriclockingcomplexityandoverhead,anddicultyinleveraginghighnetworkbandwidthduetone-grainedcomputation.Correspondingly,thereislackofinformationonhowap-plicationsperformandscaletopracticalproblemsontrillion-edgegraphs.Inpractice,manywebcompanieshavechosentobuildgraphapplicationsontheirexistingMapReducein-frastructureratherthanagraphprocessingframeworkforavarietyofreasons.First,manysuchcompaniesalreadyrunMapReduceapplications[17]onexistingHadoop[2]infras-tructureanddonotwanttomaintainadierentservicethatcanonlyprocessgraphs.Second,manyoftheseframeworksareeitherclosedsource(i.e.Pregel)orwritteninalan-guageotherthanJava(e.g.GraphLabiswritteninC++).Sincemuchoftoday'sdatasetsarestoredinHadoop,havingeasyaccesstoHDFSand/orhigher-levelabstractionssuchasHivetablesisessentialtointeroperatingwithexistingHadoopinfrastructure.Withsomanyvariantsandversions ofHive/HDFSinuse,providingnativeC++orotherlan-guagesupportisbothunappealingandtimeconsuming.ApacheGiraph[1]llsthisgapasitiswritteninJavaandhasvertexandedgeinputformatsthatcanaccessMapRe-duceinputformatsaswellasHivetables[40].Userscanin-sertGiraphapplicationsintoexistingHadooppipelinesandleverageoperationalexpertisefromHadoop.WhileGiraphinitiallydidnotscaletoourneedsatFacebookwithover1.39Busersandhundredsofbillionsofsocialconnections,weimprovedtheplatforminavarietyofwaystosupportourworkloadsandimplementourproductionapplications.Wedescribeourexperiencesscalingandextendingexistinggraphprocessingmodelstoenableabroadsetofbothgraphmininganditerativeapplications.Ourcontributionsarethefollowing:Presentusability,performance,andscalabilityimprove-mentsforApacheGiraphthatenabletrillionedgegraphcomputations.DescribeextensionstothePregelmodelandwhywefoundthemusefulforourgraphapplications.RealworldapplicationsandtheirperformanceontheFacebookgraph.Shareoperationalexperiencesrunninglarge-scalepro-ductionapplicationsonexistingMapReduceinfras-tructure.Contributionofproductioncode(includingallexten-sionsdescribedinthispaper)intotheopen-sourceApacheGiraphproject.Therestofthepaperisorganizedasfollows.Section2describesrelatedwork.Section3providesasummaryofGi-raph,detailsourreasonsforselectingitasourinitialgraphprocessingplatform,andexplainsourusabilityandscalabil-ityimprovements.Section4describesourgeneralizationtotheoriginalPregelgraphprocessingmodelforcreatingmorepowerfulapplicationbuildingblocksandreusablecode.Sec-tion5detailsGiraphapplicationsandtheirperformanceforavarietyofworkloads.InSection6,weshareourgraphpro-cessingoperationalexperiences.InSection7,weconcludeourworkanddescribepotentialfuturework.2.RELATEDWORKLarge-scalegraphcomputingbasedontheBulkSynchronousProcessing(BSP)model[42]wasrstintroducedbyMalewiczetal.inthePregelsystem[33].Unfortunately,thePregelsourcecodewasnotmadepublic.ApacheGiraphwasde-signedtobringlarge-scalegraphprocessingtotheopensourcecommunity,basedlooselyonthePregelmodel,whileprovidingtheabilitytorunonexistingHadoopinfrastruc-ture.Manyothergraphprocessingframeworks(e.g.[37,15])arealsobasedontheBSPmodel.MapReducehasbeenusedtoexecutelarge-scalegraphparallelalgorithmsandisalsobasedontheBSPmodel.Un-fortunately,graphalgorithmstendtobeiterativeinnatureandtypicallydonotperformwellintheMapReducecom-putemodel.Evenwiththeselimitations,severalgraphanditerativecomputinglibrarieshavebeenbuiltonMapReduceduetoitsabilitytorunreliablyinproductionenvironments[3,30].IterativeframeworksonMapReducestylecomput-ingmodelsisanareathathasbeenexploredinTwister[18]andHaloop[13].Asynchronousmodelsofgraphcomputinghavebeenpro-posedinsuchsystemsasSignal-Collect[39],GraphLab[20]andGRACE[44].Whileasynchronousgraphcomputinghasbeendemonstratedtoconvergefasterforsomeappli-cations,itaddsconsiderablecomplexitytothesystemandthedeveloper.Mostnotably,withoutprogramrepeatabil-ityitisdiculttoascertainwhetherbugslieinthesysteminfrastructureortheapplicationcode.Furthermore,asyn-chronousmessagingqueuesforcertainverticesmayunpre-dictablycausemachinestorunoutofmemory.DAG-basedexecutionsystemsthatgeneralizetheMapReducemodeltobroadercomputationmodels,suchasHyracks[11],Spark[45],andDryad[28],canalsodographanditerativecom-putation.Sparkadditionallyhasahigher-levelgraphcom-putinglibrarybuiltontopofit,calledGraphX[21],thatal-lowstheusertoprocessgraphsinaninteractive,distributedmanner.SinglemachinegraphcomputingimplementationssuchasCassovary[23]areusedatTwitter.Anothersinglemachineimplementation,GraphChi[32],canecientlyprocesslargegraphsout-of-core.Latency-tolerantgraphprocessingtech-niquesforcommodityprocessors,asopposedtohardwaremultithreadingsystems(e.g.CrayXMT),wereexploredin[34].InTrinity[38],graphprocessinganddatabasesarecombinedintoasinglesystem.ParallelBGL[22]paral-lelizesgraphcomputationsontopofMPI.Piccolo[36]exe-cutesdistributedgraphcomputationsontopofpartitionedtables.Presto[43],adistributedRframework,implementsmatrixoperationsecientlyandcanbeusedforgraphanal-ysis.SeveralDSLgraphlanguageshavebeenbuilt,includ-ingGreen-Marl[25],whichcancompiletoGiraphcode,andSPARQL[26],agraphtraversallanguage.3.APACHEGIRAPHApacheGiraphisaniterativegraphprocessingsystemdesignedtoscaletohundredsorthousandsofmachinesandprocesstrillionsofedges.Forexample,itiscurrentlyusedatFacebooktoanalyzethesocialgraphformedbyusersandtheirconnections.GiraphwasinspiredbyPregel,thegraphprocessingarchitecturedevelopedatGoogle.PregelprovidesasimplegraphprocessingAPIforscalablebatchexecutionofiterativegraphalgorithms.GiraphhasgreatlyextendedthebasicPregelmodelwithnewfunctionalitysuchasmastercomputation,shardedaggregators,edge-orientedinput,out-of-corecomputation,composablecomputation,andmore.Giraphhasasteadydevelopmentcycleandagrowingcommunityofusersworldwide.3.1EarlyFacebookexperimentationInthesummerof2012,webeganexploringadiversesetofgraphalgorithmsacrossmanydierentFacebookproductsaswellasrelatedacademicliterature.Weselectedafewrepresentativeusecasesthatcutacrosstheproblemspacewithdierentsystembottlenecksandprogrammingcom-plexity.Ourdiverseusecasesandthedesiredfeaturesoftheprogrammingframeworkdrovetherequirementsforoursysteminfrastructure.Werequiredaniterativecomputingmodel,graph-basedAPI,andeasyaccesstoFacebookdata.Weknewthatthisinfrastructurewouldneedtoworkatthescaleofhundredsofbillionsofedges.Finally,aswewould Figure1:Anexampleoflabelpropagation:In-ferringunknownwebsiteclassicationsfromknownwebsiteclassicationsinagraphwherelinksaregeneratedfromoverlappingwebsitekeywords.berapidlybedevelopingthissoftware,wehadtobeabletoeasilyidentifyapplicationandinfrastructurebugsandhaverepeatable,reliableperformance.Basedontheserequire-ments,weselectedafewpromisinggraph-processingplat-formsincludingHive,GraphLab,andGiraphforevaluation.Weusedlabelpropagation,amongothergraphalgorithms,tocomparetheselectedplatforms.Labelpropagationisaniterativegraphalgorithmthatinfersunlabeleddatafromlabeleddata.Thebasicideaisthatduringeachiterationofthealgorithm,everyvertexpropagatesitsprobabilisticlabels(forexample,websiteclassications-seeFigure1)toitsneighboringvertices,collectsthelabelsfromitsneigh-bors,andcalculatesnewprobabilitiesforitslabels.WeimplementedthisalgorithmonallplatformsandcomparedtheperformanceofHive0.9,GraphLab2,andGiraphonasmallscaleof25millionedges.WeendedupchoosingGiraphforseveralcompellingrea-sons.GiraphdirectlyinterfaceswithourinternalversionofHDFS(sinceGiraphiswritteninJava)aswellasHive.SinceGiraphisscheduledasaMapReducejob,wecanlever-ageourexistingMapReduce(Corona)infrastructurestackwithlittleoperationaloverhead.Withrespecttoperfor-mance,atthetimeoftestinginlatesummer2012,Giraphwasfasterthantheotherframeworks.Perhapsmostimpor-tantly,theBSPmodelofGiraphwaseasytodebugasitprovidesrepeatableresultsandisthemoststraightforwardtoscalesincewedidnothavetohandlealltheproblemsofasynchronousgraphprocessingframeworks.BSPalsomadeiteasytoimplementcomposablecomputation(seeSection4.3)andsimpletodocheckpointing.3.2PlatformimprovementsEventhoughwehadchosenaplatform,therewasalotofworkaheadofus.Herearesomeofthelimitationsthatweneededtoaddress:Giraph'sgraphinputmodelwasonlyvertexcentric,requiringustoeitherextendthemodelordovertexcentricgraphpreparationexternaltoGiraph.ParallelizingGiraphinfrastructurereliedcompletelyonMapReduce'stasklevelparallelismanddidnothavemultithreadingsupportfornegrainparallelism.Giraph's exibletypesandcomputingmodelwereini-tiallyimplementedusingnativeJavaobjectsandcon-sumedexcessivememoryandgarbagecollectiontime. Figure2:GiraphleveragesHiveIOtodirectlyaccessHivetablesandcanrunonMapReduceandYARN.TheaggregatorframeworkwasinecientlyimplementedinZooKeeperandweneededtosupportverylargeag-gregators(e.g.gigabytes).Weselectedthreeproductionapplications,labelpropaga-tion,variantsofPageRank[12],andk-meansclustering,todrivethedirectionofourdevelopment.Runningtheseap-plicationsongraphsaslargeasthefullFacebookfriendshipgraph,withover1billionusersandhundredsofbillionsoffriendships,requiredustoaddresstheseshortcomingsinGi-raph.Inthefollowingsections,rst,wedescribeoureortswith exiblevertexandedgebasedinputtoloadandstoregraphsintoFacebook'sdatawarehouse.Second,wedetailourperformanceandscalabilityenhancementswithparal-lelizationapproaches,memoryoptimization,andshardedaggregators.Inaddition,wheneverrelevant,weaddaci-tationintheformofGIRAPH-XXXastheGiraphJIRAnumberassociatedwiththeworkdescribedthatcanberef-erencedat[5].3.2.1Flexiblevertex/edgebasedinputSinceGiraphisacomputingplatform,itneedstointerfacewithexternalstoragetoreadtheinputandwritebacktheoutputofabatchcomputation.SimilarlytoMapReduce,wecandenecustominputandoutputformatsforvariousdatasources(e.g.,HDFSles,HBasetables,Hivetables).TheFacebooksoftwarestackisshowninFigure2.DatasetsfedtoaGiraphjobconsistofverticesandedges,typicallywithsomeattachedmetadata.Forinstance,alabelpropagationalgorithmmightreadverticeswithinitiallabelsandedgeswithattachedweights.Theoutputwillconsistofverticeswiththeirnallabels,possiblywithcondencevalues.TheoriginalinputmodelinGiraphrequiredaratherrigidlayout:alldatarelativetoavertex,includingoutgoingedges,hadtobereadfromthesamerecordandwereas-sumedtoexistinthesamedatasource,forinstance,thesameHivetable.Wefoundtheserestrictionssuboptimalformostofourusecases.First,graphsinourdataware-houseareusuallystoredinarelationalformat(oneedgeperrow),andgroupingthembysourcevertexrequiresanextraMapReducejobforpre-processing.Second,vertexdata,suchasinitiallabelsintheexampleabove,maybestoredseparatelyfromedgedata.Thisrequiredanextrapre-processingsteptojoinverticeswithitsedges,addingunnecessaryoverhead.Finally,insomecases,weneededtocombinevertexdatafromdierenttablesintoasingleonebeforerunningaGiraphjob. Toaddresstheseshortcomings,wemodiedGiraphtoallowloadingvertexdataandedgesfromseparatesources(GIRAPH-155).Eachworkercanreadanarbitrarysubsetoftheedges,whicharethenappropriatelydistributedsothateachvertexhasallitsoutgoingedges.Thisnewmodelalsoencouragesreusingdatasets:dierentalgorithmsmayrequireloadingdierentvertexmetadatawhileoperatingonthesamegraph(e.g.,thegraphofFacebookfriendships).Onefurthergeneralizationweimplementedistheabil-itytoaddanarbitrarynumberofdatasources(GIRAPH-639),sothat,forexample,multipleHivetableswithdif-ferentschemascanbecombinedintotheinputforasingleGiraphjob.Onecouldalsoconsiderloadingfromdier-entsourcesfromthesameapplication(e.g.loadingvertexdatafromMySQLmachinesandedgedatafromHadoopsequenceles).Allofthesemodicationsgiveusthe ex-ibilitytorungraphalgorithmsonourexistinganddiversedatasetsaswellasreducetherequiredpre-processingtoaminimumoreliminateitinmanycases.Thesesametech-niquescanbeusedinothergraphprocessingframeworksaswell.3.2.2ParallelizationsupportSinceaGiraphapplicationisscheduledasasingleMapRe-ducejob,itinitiallyinheritedtheMapReducewayofparal-lelizingajob,thatis,byincreasingthenumberofworkers(mappers)forthejob.Unfortunately,itishardtoshareresourceswithotherHadooptasksrunningonthesamemachineduetodieringrequirementsandresourceexpec-tations.WhenGiraphrunsonemonopolizingworkerpermachineinahomogenouscluster,itcanmitigateissuesofdierentresourceavailabilitiesfordierentworkers(i.e.theslowestworkerproblem).Toaddresstheseissues,weextendedGiraphtoprovidetwomethodsofparallelizingcomputations:Addingmoreworkerspermachine.UseworkerlocalmultithreadingtotakeadvantageofadditionalCPUcores.Specically,weaddedmultithreadingtoloadingthegraph,computation(GIRAPH-374),andstoringthecomputedre-sults(GIRAPH-615).InCPUboundapplications,suchask-meansclustering,wehaveseenanearlinearspeedupduetomultithreadingtheapplicationcode.Inproduction,weparallelizeourapplicationsbytakingoverasetofentirema-chineswithoneworkerpermachineandusemultithreadingtomaximizeresourceutilization.Multithreadingintroducessomeadditionalcomplexity.Tak-ingadvantageofmultithreadinginGiraphrequirestheusertopartitionthegraphintonpartitionsacrossmmachineswherethemaximumcomputeparallelismisn=m.Addition-ally,wehadtoaddanewconceptcalledWorkerContextthatallowsdeveloperstoaccesssharedmembervariables.Manyothergraphprocessingframeworksdonotallowmulti-threadingsupportwithintheinfrastructureasGiraphdoes.Wefoundthattechniquethishassignicantadvantagesinreducingoverhead(e.g.TCPconnections,largermessagebatching)asopposedtomorecoarsegrainparallelismbyaddingworkers.3.2.3MemoryoptimizationInscalingtobillionsofedgespermachine,memoryopti-mizationisakeyconcept.Fewothergraphprocessingsys-temssupportGiraph's exiblemodelofallowingarbitraryvertexid,vertexvalue,vertexedge,andmessageclassesaswellasgraphmutationcapabilities.Unfortunately,this exibilitycanhavealargememoryoverheadwithoutcare-fulimplementation.Inthe0.1incubatorrelease,GiraphwasmemoryinecientduetoalldatatypesbeingstoredasseparateJavaobjects.TheJVMworkedtoohard,outofmemoryerrorswereaseriousissue,garbagecollectiontookalargeportionofourcomputetime,andwecouldnotloadorprocesslargegraphs.Weaddressedthisissueviatwoimprovements.First,bydefaultweserializetheedgesofeveryvertexintoabytear-rayratherthaninstantiatingthemasnativeJavaobjects(GIRAPH-417)usingnativedirect(andnon-portable)seri-alizationmethods.Messagesontheserverareserializedaswell(GIRAPH-435).Second,wecreatedanOutEdgesinter-facethatwouldallowdeveloperstoleverageJavaprimitivesbasedonFastUtilforspecializededgestores(GIRAPH-528).Giventheseoptimizationsandknowingtherearetypi-callymanymoreedgesthanverticesinourgraphs(2ordersofmagnitudeormoreinmostcases),wecannowroughlyestimatetherequiredmemoryusageforloadingthegraphbasedentirelyontheedges.Wesimplycountthenumberofbytesperedge,multiplybythetotalnumberofedgesinthegraph,andthenmultiplyby1.5xtotakeintoaccountmemoryfragmentationandinexactbytearraysizes.Priortothesechanges,theobjectmemoryoverheadcouldhavebeenashighas10x.Reducingmemoryusewasabigfac-torinenablingthesystemtoloadandsendmessagesto1trillionedges.Finally,wealsoimprovedthemessagecom-biner(GIRAPH-414)tofurtherreducememoryusageandimproveperformancebyaround30%inPageRanktesting.Ourimprovementsinmemoryallocationshowthatwiththecorrectinterfaces,wecansupportthefull exibilityofJavaclassesforvertexids,values,edges,andmessageswithoutinecientperobjectallocations.3.2.4ShardedaggregatorsAggregators,asdescribedin[33],provideecientsharedstateacrossworkers.Whilecomputing,verticescanaggre-gate(theoperationmustbecommutativeandassociative)valuesintonamedaggregatorstodoglobalcomputation(i.e.min/maxvertexvalue,errorrate,etc.).TheGiraphinfras-tructureaggregatesthesevaluesacrossworkersandmakestheaggregatedvaluesavailabletoverticesinthenextsu-perstep.Onewaytoimplementk-meansclusteringistouseag-gregatorstocalculateanddistributethecoordinatesofcen-troids.Someofourcustomerswantedhundredsofthou-sandsormillionsofcentroids(andcorrespondinglyaggre-gators),whichwouldfailinearlyversionsofGiraph.Orig-inally,aggregatorswereimplementedusingZookeeper[27].Workerswouldwritepartialaggregatedvaluestoznodes(Zookeeperdatastorage).Themasterwouldaggregateallofthem,andwritethenalresultbacktoitsznodeforworkerstoaccessit.Thistechniquewassucientforapplicationsthatonlyusedafewsimpleaggregators,butwasn'tscal-ableduetoznodesizeconstraints(maximum1megabyte)andZookeeperwritelimitations.Weneededasolutionthatcouldecientlyhandletensofgigabytesofaggregatordatacomingfromeveryworker. Figure3:Aftershardingaggregators,aggregatedcommunicationisdistributedacrossworkers.voidM000;Iterablemessages){if(phase==K_MEANS)//Dok-meanselseif(phase==START_EDGE_CUT)//Dofirstphaseofedgecutelseif(phase==END_EDGE_CUT)//Dosecondphaseofedgecut}Figure5:Examplevertexcomputationcodetochangephases.Tosolvethisissue,rst,webypassedZookeeperandusedNetty[6]todirectlycommunicateaggregatorvaluesbetweenthemasteranditsworkers.Whilethischangeallowedmuchlargeraggregatordatatransfers,westillhadabottleneck.Theamountofdatathemasterwasreceiving,processing,andsendingwasgrowinglinearlywiththenumberofwork-ers.Inordertoremovethisbottleneck,weimplementedshardedaggregators.Intheshardedaggregatorarchitecture(Figure3),eachaggregatorisnowrandomlyassignedtooneoftheworkers.Theassignedworkerisinchargeofgatheringthevaluesofitsaggregatorsfromallworkers,performingtheaggre-gation,anddistributingthenalvaluestothemasterandotherworkers.Now,aggregationresponsibilitiesarebal-ancedacrossallworkersratherthanbottleneckedbythemasterandaggregatorsarelimitedonlybythetotalmem-oryavailableoneachworker.4.COMPUTEMODELEXTENSIONSUsabilityandscalabilityimprovementshaveallowedustoexecutesimplegraphanditerativealgorithmsatFacebook-scale,butwesoonrealizedthatthePregelmodelneededtobegeneralizedtosupportmorecomplexapplicationsandmaketheframeworkmorereusable.Aneasywaytodepicttheneedforthisgeneralizationisthroughaverysimpleex-ample:k-meansclustering.K-meansclustering,asshowninFigure4,isasimpleclusteringheuristicforassigningin-putvectorstooneofkcentroidsinann-dimensionalspace.Whilek-meansisnotagraphapplication,itisiterativeinnatureandeasilymapsintotheGiraphmodelwherein-putvectorsareverticesandeverycentroidisanaggregator.Thevertex(inputvector)computemethodcalculatesthedistancetoallthecentroidsandaddsitselftothenearestone.Asthecentroidsgaininputvectors,theyincrementallydeterminetheirnewlocation.Atthenextsuperstep,thenewlocationofeverycentroidisavailabletoeveryvertex.4.1WorkerphasesThemethodspreSuperstep(),postSuperstep(),preApplica-tion(),andpostApplication()wereaddedtotheComputa-tionclassandhaveaccesstotheworkerstate.Oneusecaseforthepre-superstepcomputationistocalculatethenewpo-sitionforeachofthecentroids.Withoutpre-superstepcom-putation,adeveloperhastoeitherincrementallycalculatethepositionwitheveryaddedinputvectororcalculateitforeverydistancecalculation.InthepreSuperstep()methodthatisexecutedoneveryworkerpriortoeverysuperstep,everyworkercancomputethenalcentroidlocationsjustbeforetheinputvectorsareprocessed.DeterminingtheinitialpositionsofthecentroidscanbedoneinthepreApplication()methodsinceitisexecutedoneveryworkerpriortoanycomputationbeingexecuted.Whilethesesimplemethodsaddalotoffunctionality,theybypassthePregelmodelandrequirespecialconsiderationforapplicationspecictechniquessuchsuperstepcheckpoint-ing.4.2MastercomputationFundamentally,whilethePregelmodeldenesafunc-tionalcomputationby\thinkinglikeavertex",somecom-putationsneedtobeexecutedinacentralizedfashionformanyofthereasonsabove.Whileexecutingthesamecodeoneachworkerprovidesalotofthesamefunctionality,itisnotwellunderstoodbydevelopersandiserrorprone.GIRAPH-127addedmastercomputationtodocentralizedcomputationpriortoeverysuperstepthatcancommunicatewiththeworkersviaaggregators.Wedescribetwoexampleusecasesbelow.Whenusingk-meanstoclustertheFacebookdatasetofover1billionusers,itisusefultoaggregatetheerrortoseewhethertheapplicationisconverging.Itisstraightforwardtoaggregatethedistanceofeveryinputvectortoitscho-sencentroidasanotheraggregator,butatFacebookwealsohavethesocialgraphinformation.Periodically,wecanalsocomputeanothermetricofdistance:theedgecut.Wecanusethefriendships,subscriptions,andothersocialconnec-tionsofouruserstomeasuretheedgecut,ortheweighted Figure4:Ink-meansclustering,kcentroidshavesomeinitiallocation(oftenrandom).Theninputvectorsareassignedtotheirnearestcentroid.Centroidlocationsareupdatedandtheprocessrepeats. Figure6:Mastercomputationallowsk-meansclusteringtoperiodicallycalculateedgecutcomputations.edgecutiftheedgeshaveweights.Decidingwhentocalcu-latetheedgecutcanbethejobofthemastercomputation,forinstance,whenatleast15minutesofk-meanscompu-tationhavebeenexecuted.Withmastercomputation,thisfunctionalitycanbeimplementedbycheckingtoseehowlongithasbeensincethelastedgecutcalculation.Ifthetimelimithasbeenexceeded,setaspecialaggregator(i.e.executeedgecut)totrue.Theexecutionwork owisshowninFigure6.Thevertexcomputecodeonlyneedstochecktheaggregatorvaluetodecidewhethertobegincalculatinganedgecutorcontinueiteratingonk-means.Executingacoordinatedoperationsuchasthisoneontheworkerswithoutmastercomputationismorecomplicatedduetoclockskew,althoughpossible,forinstancewithmul-tiplesuperstepstodecideonwhethertodotheedgecutcalculation.AnotherexampleoftheusefulnessofmastercomputationcanbefoundintheexamplePageRankcodeinthePregelpaper[33].Intheexample,everyvertexmustcheckwhetherthedesirednumberofiterationshasbeencompletedtode-cidetovotetohalt.Thisisasimplecomputationthatneedstobeexecutedexactlyonceratherthanoneveryvertex.InthemastercomputationthereisahaltComputation()method,whereitissimpletocheckoncepriortostartingasuperstepwhethertheapplicationshouldcontinueratherthanexecutingthecheckonapervertexbasis.4.3ComposablecomputationInourproductionenvironment,weobservedthatgraphprocessingapplicationscanbecomplex,oftenconsistingof\stages",eachimplementingdistinctlogic.Mastercompu-tationallowsdeveloperstocomposetheirapplicationswithdierentstages,butisstilllimitingsincetheoriginalPregelmodelonlyallowsonemessagetypeandonemessagecom-biner.Also,thevertexcomputecodegetsmessyasshowninFigure5.Inordertosupportapplicationsthatdodistinc-tivecomputations(suchask-means)inacleanerandmorereusableway,weaddedcomposablecomputation.Compos-ablecomputingsimplydecouplesthevertexfromthecom-putationasshowninFigure7.TheComputationclassabstractsthecomputationfromthevertexsothatdier-enttypesofcomputationscanbeexecuted.Additionally,therearenowtwomessagetypesspecied.M1isthein-comingmessagetypeandM2istheoutgoingmessagetype.ThemastercomputemethodcanchoosethecomputationclasstoexecuteforthecurrentsuperstepwiththemethodsetComputation(Class?extendsComputation]TJ/;༱ ;.96;d T; 7.;Ũ ; Td; [00;computa-tionClass).Themasteralsohasacorrespondingmethodtochangethemessagecombineraswell.Theinfrastructurecheckstoinsurethatalltypesmatchforcomputationsthatarechainedtogether.NowthatvertexcomputationsaredecoupledtodierentComputationimplementations,theycanbeusedasbuildingblocksformultipleapplications.Forexample,theedgecutcomputationscanbeusedinavarietyofclusteringalgorithms,notonlyk-means.Inourexamplek-meansapplicationwithaperiodicedgecut,wemightwanttouseamoresophisticatedcentroidini-tializationmethodsuchasinitializingthecentroidwitharandomuserandthenarandomsetoftheirfriends.Thecomputationstoinitializethecentroidswillhavedierentmessagetypesthantheedgecutcomputations.InTable2, Table2:Exampleusageofcomposablecomputationsinak-meansapplicationwithdierentcomputations(initialization,edgecut,andk-means) Computation Addrandomcentroid/randomfriends Addtocentroids K-means Startedgecut Endedgecut Inmessage Null Centroidmessage Null Null Cluster Outmessage Centroidmessage Null Null Cluster Null Combiner N/A N/A N/A Clustercombiner N/A publicabstractclassVertexI,V,E,]TJ ;.8;) -;.4;a T; [0;M{publicabstractvoidcompute(M000;Iterablemessages);}publicinterfaceComputationI,V,E,M1,]TJ ;.8;) -;.4;a T; [0;M2{voidcompute(VertexV,I,-5;─Evertex,M100;Iterablemessages);}publicabstractclassMasterCompute{publicabstractvoidcompute();publicvoidsetComputation(Classextends?-52;倀Computationcomputation);publicvoidsetMessageCombiner(Classextends?-52;倀MessageCombinercombiner);publicvoidsetIncomingMessage(Classextends?-52;倀WritableincomingMessage);publicvoidsetOutgoingMessage(Classextends?-52;倀WritableoutgoingMessage);}Figure7:VertexwasgeneralizedintoComputationtosupportdierentcomputations,messages,andmessagecombinersassetbyMasterCompute.weshowhowcomposablecomputationallowsustousedif-ferentmessagetypes,combiners,andcomputationstobuildapowerfulk-meansapplication.Weusetherstcomputa-tionforaddingrandominputvectorstocentroidsandno-tifyingrandomfriendstoaddthemselvestothecentroids.Thesecondcomputationaddstherandomfriendstocen-troidsbyguringoutitsdesiredcentroidfromtheorigi-nallyaddedinputvector.Itdoesn'tsendanymessagestothenextcomputation(k-means),sotheoutmessagetypeisnull.Notethatstartingtheedgecutistheonlycomputa-tionthatactuallyusesamessagecombiner,butonecoulduseanymessagecombinerindierentcomputations.Composablecomputationmakesthemastercomputationlogicsimple.Otherexampleapplicationsthatcanbenetfromcomposablecomputationbesidesk-meansincludebal-ancedlabelpropagation[41]andanitypropagation[19].Balancedlabelpropagationusestwocomputations:com-putecandidatepartitionsforeachvertexandmovingver-ticestopartitions.Anitypropagationhasthreecompu-tations:calculateresponsibility,calculateavailability,andupdateexemplars.4.4SuperstepsplittingSomeapplicationshavemessagingpatternsthatcanex-ceedtheavailablememoryonthedestinationvertexowner.Formessagesthatareaggregatable,thatis,commutativeandassociative,messagecombiningsolvesthisproblem.How-ever,manyapplicationssendmessagesthatcannotbeag-gregated.Calculatingmutualfriends,forinstance,requireseachvertextosendallitsneighborsthevertexidsofitsneighborhood.Thismessagecannotbeaggregatedbythereceiveracrossallitsmessages.Anotherexampleisthemul-tiplephasesofanitypropagation-eachmessagemustberespondedtoindividuallyandisunabletobeaggregated.Graphprocessingframeworksthatareasynchronousarees-peciallypronetosuchissuessincetheymayreceivemessagesatafasterratethansynchronousframeworks.Insocialnetworks,oneexampleofthisissuecanoccurwhensendingmessagestoconnectionsofconnections(i.e.friendsoffriendsintheFacebooknetwork).WhileFacebookusersarelimitedto5000friends,theoreticallyoneuser'sver-texcouldreceiveupto25millionmessages.Oneofourpro-ductionapplications,friends-of-friendsscore,calculatesthestrengthofarelationshipbetweentwousersbasedontheirmutualfriendswithsomeweights.Themessagessentfromausertohis/herfriendscontainasetoftheirfriendsandsomeweights.Ourproductionapplicationactuallysends850GBfromeachworkerduringthecalculationwhenweuse200workers.Wedonothavemachineswith850GBandwhileGiraphsupportsout-of-corecomputationbyspillingthegraphand/ormessagestodisk,itismuchslower.Therefore,wehavecreatedatechniquefordoingthesamecomputationallin-memoryforsuchapplications:superstepsplitting.Thegeneralideaisthatinsuchamessageheavysuperstep,adevelopercansendafragmentofthemessagestotheirdestinationsanddoapartialcomputationthatup-datesthestateofthevertexvalue.Thelimitationsforthesuperstepsplittingtechniqueareasfollows:Themessage-basedupdatemustbecommutativeandassociative.Nosinglemessagecanover owthememorybuerofasinglevertex.Themastercomputationwillrunthesamesuperstepforaxednumberofiterations.Duringeachiteration,everyvertexusesahashfunctionwiththedestinationvertexidofeachofitspotentialsmessagestodeterminewhethertosendamessageornot.Avertexonlydoescomputationifitsvertexidpassesthehashfunctionforthecurrentsuperstep.Asanexamplefor50iterations(splitsofoursuperstep)ourfriends-of-friendsapplicationonlyuses17GBofmemoryperiteration.Thistechniquecaneliminatetheneedtogoout-of-coreandismemoryscalable(simplyaddmoreiterationstoproportionallyreducethememoryusage). Figure8:ScalingPageRankwithrespecttothenumberofworkers(a)andthenumberofedges(b)5.EXPERIMENTALRESULTSInthissection,wedescribeourresultsofrunningseveralexampleapplicationsinGiraphwhencomparedtotheirre-spectiveHiveimplementations.OurcomparisonwillincludeiterativegraphapplicationsaswellassimpleHivequeriesconvertedtoGiraphjobs.Weusetwometricsfortheperfor-mancecomparison:CPUtimeandelapsedtime.CPUtimeisthetimemeasuredbytheCPUtocompleteanapplicationwhileelapsedtimeisthetimeobservedbytheuserwhentheirapplicationcompletes.Allofthedescribedapplica-tionsareinproduction.Experimentsweregatheredonpro-ductionmachineswith16physicalcoresand10GbpsEth-ernet.TheGiraphexperimentswereallconductedinphysi-calmemorywithoutcheckpointing,whileHive/Hadoophasmanyphasesofdiskaccess(mapsidespilling,reducerfetch-ing,HDFSaccessperiterations,etc.)causingittoexhibitpoorerperformance.Weusedthedefaulthash-basedgraphpartitioningforallexperiments.5.1GiraphscalabilityWithasimpleunweightedPageRankbenchmark,weeval-uatedthescalabilityofGiraphonourproductionclusters.UnweightedPageRankisalightweightcomputationalappli-cationthatistypicallynetworkbound.InFigure8a,werstxedtheedgecountto200Bandscalethenumberofworkerfrom50to300.Thereisaslightbitofvariance,butoverallGiraphperformancecloselytrackstheidealscalingcurvebasedon50workersasthestartingdatapoint.InFigure8b,wexthenumberofworkersat50andscaletheproblemsizefrom1Bto200Bedges.Asexpected,GiraphperformancescalesuplinearlywiththenumberofedgessincePageRankisnetworkbound.5.2ApplicationsWedescribethreeproductioniterativegraphapplicationsaswellasthealgorithmicpseudocodethathavebeenruninbothHiveandGiraph.TheGiraphimplementationused200machinesinourtestsandtheHiveimplementationsusedatleast200machines(albeitpossiblywithfailuresasMapReduceisbuilttohandlethem).5.2.1LabelpropagationComputation1:Sendneighborsedgeweightsandlabeldata1.Sendmyneighborsmyedgeweightsandlabels.Table3:LabelpropagationcomparisonbetweenHiveandGiraphimplementationswith2iterations. Graphsize Hive Giraph Speedup 701M+vertices48B+edges TotalCPU9.631Msecs TotalCPU1.014Msecs 9x ElapsedTime1,666mins ElapsedTime19mins 87x Computation2:Updatelabels1.Ifthenormalizationbaseisnotset,createitfromtheedgeweights.2.Updatemylabelsbasedonthemessages3.Trimlabelstotopn.4.Sendmyneighborsmyedgeweightsandlabels.Labelpropagationwasdescribedbrie yinSection3.1.Thisalgorithm(similarto[46])waspreviouslyimplementedinHiveasaseriesofHivequeriesthathadabadimpedancemismatchsincethealgorithmismuchsimplertoexpressasagraphalgorithm.Therearetwocomputationphasesasshownabove.WeneedtonormalizeedgesandGiraphkeepsonlytheoutedges(nottheinedges)tosavememory.Socomputation1willstarttherstphaseofguringouthowtonormalizethein-comingedgeweightsbythedestinationvertex.Themastercomputationwillruncomputation2untilthedesireditera-tionshavebeenmet.Notethattrimmingtothetopnlabelsisacommontechniquetoavoidunlikelylabelsbeingprop-agatedtoallvertices.Trimmingbothsavesalargeamountofmemoryandreducestheamountofnetworkdatasent.InTable3,wedescribethecomparisonbetweenHiveandGiraphimplementationsofthesamelabelpropagationalgo-rithm.FromaCPUstandpoint,wesave9xCPUseconds.Fromanelapsedtimestandpoint,theGiraphimplementa-tionis87xfasterwith200machinesversusroughlythesamenumberofmachinesfortheHiveimplementation.Ourcom-parisonwasexecutedwithonly2iterationsprimarilybe-causeittooksolonginHive/Hadoop.Inproduction,werunminorvariantsoflabelpropagationwithmanydierenttypesoflabelsthatcanbeusedasrankingfeaturesinavarietyofapplications.5.2.2PageRankComputation:PageRank1.Normalizetheoutgoingedgeweightsifthisisthefirstsuperstep.Otherwise,updatethepagerankvaluebysumminguptheincomingmessagesandcalculatinganewPageRankvalue.2.SendeveryedgemyPageRankvalueweightedbyitsnormalizededgeweight.PageRank[12]isausefulapplicationforndingin uentialentitiesinthegraph.Whileoriginallydevelopedforrankingwebsites,wecanalsoapplyPageRankinthesocialcontext.ThisalgorithmabovedescribesourimplementationofweightedPageRank,wheretheweightsrepresentthestrengthoftheconnectionbetweenusers(i.e.basedonthefriendsoffriendsscorecalculation).WeightedPageRankisslightlymoreex-pensivethanunweightedPageRankduetosendingeachout Table4:WeightedPageRankcomparisonbetweenHiveandGiraphimplementationsforasingleiter-ation.) Graphsize Hive Giraph Speedup 2B+vertices400B+edges TotalCPU16.5Msecs TotalCPU0.6Msecs 26x ElapsedTime600mins ElapsedTime19mins 120x Table5:PerformancecomparisonforfriendsoffriendsscorebetweenHiveandGiraphimplemen-tations.) Graphsize Hive Giraph Speedup 1B+vertices76B+edges TotalCPU255Msecs TotalCPU18Msecs 14x ElapsedTime7200mins ElapsedTime110mins 65x edgeaspecializedmessage,butalsousesacombinerinpro-duction.ThecombinersimplyaddsthePageRankmessagestogetherandsumsthemintoasinglemessage.WerananiterationofPageRankagainstasnapshotofsomeportionoftheusergraphatFacebookwithover2Bver-ticesandmorethan400Bsocialconnections(includesmoreconnectionsthanjustfriendships).InTable4,wecanseetheresults.At600minutesperiterationforHive/Hadoop,computing10iterationswouldtakeover4dayswithmorethan200machines.InGiraph,10iterationscouldbedoneinonly50minuteson200machines.5.2.3FriendsoffriendsscoreComputation:Friendsoffriendsscore1.Sendfriendlistwithweightstoahashedportionofmyfriendlist.2.Ifmyvertexidmatchesthehashfunctiononthesuperstepnumber,aggregatemessagestogeneratemyfriendsoffriendsscorewiththesourceverticesoftheincomingmessages.AsmentionedinSection4.4,thefriendsoffriendsscoreisavaluablefeatureforvariousFacebookproducts.Itisoneofthefewfeatureswehavethatcancalculateascorebetweenusersthatarenotdirectlyconnected.Itusestheaforemen-tionedsuperstepsplittingtechniquetoavoidgoingoutofcorewhenmessagingsizesfarexceedavailablememory.FromTable5wecanseethattheHiveimplementationismuchlessecientthantheGiraphimplementation.Thesuperstepsplittingtechniqueallowsustomaintainasignif-icant65xelapsedtimeimprovementon200machinessincethecomputationandmessagepassingarehappeninginmem-ory.Whilenotshown,out-of-coremessaginghada2-3xoverheadinperformanceinourexperimentswhiletheaddi-tionalsynchronizationsuperstepoverheadofthesuperstepsplittingtechniquewassmall.5.3HivequeriesonGiraphWhileGiraphhasenabledustoruniterativegraphalgo-rithmsmuchmoreecientlythanHive,wehavealsofounditavaluabletoolforrunningcertainexpensiveHivequeriesinafasterway.Hiveismuchsimplertowrite,ofcourse,butTable6:DoublejoinperformanceinHiveandGiraphimplementationsforoneexpensiveHivequery.) Graphsize Hive Giraph Speedup 450Bconnections2.5B+uniqueids TotalCPU211days TotalCPU43days 5x ElapsedTime425mins ElapsedTime50mins 8.5x Table7:CustomHivequerycomparisonwithGi-raphforavarietyofdatasizesforcalculatingthenumberofusersinteractingwithaspecicaction.) Datasize Hive Giraph Speedup 360Bactions40Mobjects TotalCPU22days TotalCPU4days 5.5x ElapsedTime64mins ElapsedTime7mins 9.1x 162Bactions74Mobjects TotalCPU92days TotalCPU18days 5.1x ElapsedTime129mins ElapsedTime19mins 6.8x 620Bactions110Mobjects TotalCPU485days TotalCPU78days 6.2x ElapsedTime510mins ElapsedTime45mins 11.3x wehavefoundmanyqueriesfallinthesamepatternofjoinandalsodoublejoin.Theseapplicationsinvolveeitheronlyinputandoutput,orjusttwosupersteps(oneiterationofmessagesendingandprocessing).PartitioningofthegraphduringinputloadingcansimulateajoinqueryandwedoanyvertexprocessingpriortodumpingthedatatoHive.Inoneexample,considerajoinoftwobigtables,whereoneofthemcanbetreatedasedgeinput(representingcon-nectionsbetweenusers,pages,etc.)andtheotheroneasvertexinput(somedataabouttheseusersorpages).Im-plementingthesekindofqueriesinGiraphhasproventobe3-4xmoreCPUecientthanperformingthesamequeryinHive.Doingajoinonbothsidesoftheedgetable(doublejoin)asshownintheaboveexamplequeryisupto5xbetterasshowninTable6.Asanotherexample,wehavealargetablewithsomeac-tionlogs,withtheidoftheuserperformingtheaction,theidoftheobjectitinteractedwith,andpotentiallysomeadditionaldataabouttheactionitself.Oneofouruserswasinterestedincalculatingforeachobjectandeachactiondescriptionhowmanydierentusershaveinteractedwithit.InHivethiswouldbeexpressedintheform:"selectcount(distinctuserId)...groupbyobjectId,actionDescrip-tion".Thistypeofqueryisveryexpensive,andexpressingitinGiraphachievinguptoa6xCPUtimeimprovementand11.3xelapsedtimeimprovement,botharebigwinsfortheuserasshowninTable7.CustomersarehappytohaveawaytogetsignicantlybetterperformancefortheirexpensiveHivequeries.Al-thoughGiraphisnotbuiltforgeneralizedrelationalqueries,itoutperformsHiveduetoavoidingdiskandusingcus-tomizedprimitivedatastructuresthatareparticulartotheunderlyingdata.Inthefuture,wemayconsiderbuildinga moregeneralizedqueryframeworktomakethistransitioneasierforacertainclassesofqueries.5.4TrillionedgePageRankOursocialgraphofover1.39Busersiscontinuingtogrowrapidly.Thenumberofconnectionsbetweenusers(friend-ships,subscriptions,publicinteraction,etc.)isalsoseeingexplosivegrowth.ToensurethatGiraphcouldcontinuetooperateatalargerscale,werananiterationofunweightedPageRankonour1.39Buserdatasetwithover1trillionsocialconnections.Withsomerecentimprovementsinmes-saging(GIRAPH-701)andrequestprocessingimprovements(GIRAPH-704),wewereabletoexecutePageRankonoveratrillionsocialconnectionsinlessthan3minutesperiter-ationwithonly200machines.6.OPERATIONALEXPERIENCEInthissection,wedescribetheoperationalexperiencewehavegainedafterrunningGiraphinproductionatFacebookforovertwoyears.Giraphexecutesapplicationsforads,search,entities,usermodeling,insights,andmanyotherteamsatFacebook.6.1SchedulingGiraphleveragesourexistingMapReduceschedulingframe-work(Corona)butalsoworksontopofHadoop.Hadoopschedulingassumesanincrementalschedulingmodel,wheremap/reduceslotscanbeincrementallyassignedtojobswhileGiraphrequiredthatallslotsareavailablepriortotheap-plicationrunning.Preemption,availableinsomeversionsofHadoop,causesfailureofGiraphjobs.Whilewecouldturnoncheckpointingtohandlesomeofthesefailures,inpracticewechoosetodisablecheckpointingforthreereasons:1.OurHDFSimplementationwilloccasionallyfailonwriteoperations(e.g.temporarynamenodeunavail-ability,writenodesunavailable,etc.),causingourcheck-pointingcodetofailandironicallyincreasingthechanceoffailure.2.Mostofourproductionapplicationsruninlessthananhouranduselessthan200machines.Thechanceoffailureisrelativelylowandhandledwellbyrestarts.3.Checkpointinghasanoverhead.Giraphjobsareruninnon-preemptibleFIFOpoolsinHadoopwherethenumberofmapslotsofajobneverex-ceedsthemaximumnumberofmapslotsinthepool.ThispolicyallowsGiraphjobstoqueueupinapool,waituntiltheygetalltheirresourcesandthenexecute.Inordertomakethisprocesslesserrorprone,weaddedafewchangestoCoronatopreventusererror.First,weaddedanoptionalfeatureforapoolinCoronatoautomaticallyfailjobsthataskformoremapslotsthanthemaximummapslotsofthepool.Second,wecongureGiraphclustersdierentlythantypicalMapReduceclusters.Giraphclustersarehomoge-nousandonlyhaveonemapslot.ThiscongurationallowsGiraphjobstotakeoveranentiremachineandallofitsresources(CPU,memory,network,etc.).SinceGiraphstillrunsonthesameinfrastructureasHadoop,productionen-gineeringcanusethesamecongurationmanagementtoolsandpriorMapReduceexperiencetomaintaintheGiraphinfrastructure.JobsthatfailarerestartedautomaticallybythesameschedulingsystemthatalsorestartsfailedHivequeries.Inproduction,bothHiveandGiraphretriesaresetto4andonceanGiraphapplicationisdeployedtoproductionwerarelyseeitfailconsistently.Theonecasewesawduringthepastyearoccurredwhenauserchangedproductionpa-rameterswithoutrsttestingtheirjobsatscale.6.2GraphpreparationProductiongradegraphpreparationisasubjectthatisnotwelladdressedinresearchliterature.ProjectssuchasGraphBuilder[29]havebuiltframeworksthathelpwithar-eassuchasgraphformation,compression,transformation,partitioning,outputformatting,andserialization.AtFace-book,wetakeasimplerapproach.Graphpreparationcanbehandledintwodierentwaysdependingontheneedsoftheuser.AllFacebookwarehousedataisstoredinHivetablesbutcanbeeasilyinterpretedasverticesand/oredgeswithHiveIO.InanyoftheGiraphinputformats,ausercanaddcustomlteringortransformationoftheHivetabletocreateverticesand/oredges.AsmentionedinSection3.2.1,userscanessentiallyscantablesandpickandchoosethegraphdatatheyareinterestedinfortheapplication.UserscanalsoturntoHivequeriesand/orMapReduceapplicationstoprepareanyinputdataforgraphcomputation.OthercompaniesthatuseApachePig[35],ahigh-levellanguageforexpressingdataanalysisprograms,canexecutesimilargraphpreparationstepsasaseriesofsimplequeries.6.3ProductionapplicationworkowCustomersoftenaskusthework owfordeployingaGi-raphapplicationintoproduction.Wetypicallygothroughthefollowingcycle:1.Writeyourapplicationandunittestit.Giraphcanruninalocalmodewiththetoolstocreatesimplegraphsfortesting.2.Runyourapplicationonatestdataset.WehavesmalldatasetsthatmimicthefullFacebookgraphasasinglecountry.Typicallythesetestsonlyneedtorunonafewmachines.3.Runyourapplicationatscale(typicallyamaximumof200workers).Wehavelimitedresourcesfornon-productionjobstorunatscale,soweaskuserstotunetheircongurationonstep2.4.Deploytoproduction.Theuserapplicationanditscongurationareenteredintoourhigher-levelsched-ulertoensurethatjobsarescheduledperiodicallyandretrieshappenonfailure.TheGiraphoncallisrespon-sibleforensuringthattheproductionjobscomplete.Overall,customersaresatisedwiththiswork ow.ItisverysimilartotheHive/Hadoopproductionwork- owandisaneasytransitionforthem.7.CONCLUSIONS&FUTUREWORKInthispaper,wehavedetailedhowaBSP-based,compos-ablegraphprocessingframeworksupportsFacebook-scaleproductionworkloads.Inparticular,wehavedescribedtheimprovementstotheApacheGiraphprojectthatenabledustoscaletotrillionedgegraphs(muchlargerthanthoserefer-encedinpreviouswork).Wedescribednewgraphprocessing techniquessuchascomposablecomputationandsuperstepsplittingthathaveallowedustobroadenthepoolofpo-tentialapplications.Wehavesharedourexperienceswithseveralproductionapplicationsandtheirperformanceim-provementsoverourexistingHive/Hadoopinfrastructure.WehavecontributedallsystemscodebackintotheApacheGiraphprojectsoanyonecantryoutproductionqualitygraphprocessingcodethatcansupporttrillionedgegraphs.Finally,wehavesharedouroperationalexperienceswithGi-raphjobsandhowwescheduleandprepareourgraphdataforcomputationpipelines.WhileGiraphsuitsourcurrentneedsandprovidesmuchneededeciencywinsoverourexistingHive/Hadoopinfras-tructure,wehaveidentiedseveralareasoffutureworkthatwehavestartedtoinvestigate.First,ourinternalexperi-mentsshowthatgraphpartitioningcanhaveasignicanteectonnetworkboundapplicationssuchasPageRank.Forlongrunningapplications,determiningagoodqualitygraphpartitioningpriortoourcomputationwilllikelybenetwininperformance.Second,wehavestartedtolookatmakingourcomputationsmoreasynchronousasapossiblewaytoimproveconvergencespeed.Whileourusersenjoyapre-dictableapplicationandthesimplicityoftheBSPmodel,theymayconsiderasynchronouscomputingifthegainissignicant.Finally,weareleveragingGiraphasaparallelmachine-learningplatform.WealreadyhaveseveralMLalgorithmsimplementedinternally.Ourmatrixfactorizationbasedcol-laborativelteringimplementationscalestooverahundredbillionexamples.TheBSPcomputingmodelalsoappearstobeagoodtforAdaBoostlogisticregressionfromCollinsetal.[16].Weareabletotrainlogisticregressionmodelson1.1billionsampleswith800millionsparsefeaturesandanaverageof1000activefeaturespersampleinminutesperiteration.8.REFERENCES[1]Apachegiraph-http://giraph.apache.org.[2]Apachehadoop.http://hadoop.apache.org/.[3]Apachemahout-http://mahout.apache.org.[4]Beevolvetwitterstudy.http://www.beevolve.com/twitter-statistics.[5]Giraphjira.https://issues.apache.org/jira/browse/GIRAPH.[6]Netty-http://netty.io.[7]Opengraph.https://developers.facebook.com/docs/opengraph.[8]Yahoo!altavistawebpagehyperlinkconnectivitygraph,circa2002,2012.http://webscope.sandbox.yahoo.com/.[9]L.Backstrom,D.Huttenlocher,J.Kleinberg,andX.Lan.Groupformationinlargesocialnetworks:Membership,growth,andevolution.InProceedingsofthe12thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,KDD'06,pages44{54,NewYork,NY,USA,2006.ACM.[10]P.Boldi,M.Santini,andS.Vigna.Alargetime-awaregraph.SIGIRForum,42(2):33{38,2008.[11]V.Borkar,M.Carey,R.Grover,N.Onose,andR.Vernica.Hyracks:A exibleandextensiblefoundationfordata-intensivecomputing.InProceedingsofthe2011IEEE27thInternationalConferenceonDataEngineering,ICDE'11,pages1151{1162,Washington,DC,USA,2011.IEEEComputerSociety.[12]S.BrinandL.Page.Theanatomyofalarge-scalehypertextualwebsearchengine.InProceedingsoftheseventhinternationalconferenceonWorldWideWeb7,WWW7,pages107{117,Amsterdam,TheNetherlands,TheNetherlands,1998.ElsevierSciencePublishersB.V.[13]Y.Bu,B.Howe,M.Balazinska,andM.D.Ernst.Haloop:ecientiterativedataprocessingonlargeclusters.Proc.VLDBEndow.,3(1-2):285{296,Sept.2010.[14]Z.Cai,Z.J.Gao,S.Luo,L.L.Perez,Z.Vagena,andC.Jermaine.Acomparisonofplatformsforimplementingandrunningverylargescalemachinelearningalgorithms.InProceedingsofthe2014ACMSIGMODInternationalConferenceonManagementofData,SIGMOD'14,pages1371{1382,NewYork,NY,USA,2014.ACM.[15]R.Chen,M.Yang,X.Weng,B.Choi,B.He,andX.Li.Improvinglargegraphprocessingonpartitionedgraphsinthecloud.InProceedingsoftheThirdACMSymposiumonCloudComputing,SoCC'12,pages3:1{3:13,NewYork,NY,USA,2012.ACM.[16]M.Collins,R.E.Schapire,andY.Singer.Logisticregression,adaboostandbregmandistances.MachineLearning,48(1-3):253{285,2002.[17]J.DeanandS.Ghemawat.Mapreduce:simplieddataprocessingonlargeclusters.Commun.ACM,51(1):107{113,Jan.2008.[18]J.Ekanayake,H.Li,B.Zhang,T.Gunarathne,S.heeBae,J.Qiu,andG.Fox.Twister:Aruntimeforiterativemapreduce.InInTheFirstInternationalWorkshoponMapReduceanditsApplications,2010.[19]B.J.FreyandD.Dueck.Clusteringbypassingmessagesbetweendatapoints.Science,315:972{976,2007.[20]J.E.Gonzalez,Y.Low,H.Gu,D.Bickson,andC.Guestrin.Powergraph:distributedgraph-parallelcomputationonnaturalgraphs.InProceedingsofthe10thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI'12,pages17{30,Berkeley,CA,USA,2012.USENIXAssociation.[21]J.E.Gonzalez,R.S.Xin,A.Dave,D.Crankshaw,M.J.Franklin,andI.Stoica.Graphx:Graphprocessinginadistributeddata owframework.In11thUSENIXSymposiumonOperatingSystemsDesignandImplementation(OSDI14),pages599{613,Broomeld,CO,Oct.2014.USENIXAssociation.[22]D.GregorandA.Lumsdaine.TheParallelBGL:Agenericlibraryfordistributedgraphcomputations.InParallelObject-OrientedScienticComputing(POOSC),07/20052005.Accepted.[23]P.Gupta,A.Goel,J.Lin,A.Sharma,D.Wang,andR.Zadeh.Wtf:thewhotofollowserviceattwitter.InProceedingsofthe22ndinternationalconferenceonWorldWideWeb,WWW'13,pages505{514,RepublicandCantonofGeneva,Switzerland,2013.InternationalWorldWideWebConferencesSteeringCommittee. [24]M.Han,K.Daudjee,K.Ammar,M.T.Ozsu,X.Wang,andT.Jin.Anexperimentalcomparisonofpregel-likegraphprocessingsystems.PVLDB,7(12):1047{1058,2014.[25]S.Hong,H.Cha,E.Sedlar,andK.Olukotun.Green-marl:adslforeasyandecientgraphanalysis.InProceedingsoftheseventeenthinternationalconferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASPLOSXVII,pages349{362,NewYork,NY,USA,2012.ACM.[26]J.Huang,D.J.Abadi,andK.Ren.Scalablesparqlqueryingoflargerdfgraphs.PVLDB,4(11):1123{1134,2011.[27]P.Hunt,M.Konar,F.P.Junqueira,andB.Reed.Zookeeper:wait-freecoordinationforinternet-scalesystems.InProceedingsofthe2010USENIXconferenceonUSENIXannualtechnicalconference,USENIXATC'10,pages11{11,Berkeley,CA,USA,2010.USENIXAssociation.[28]M.Isard,M.Budiu,Y.Yu,A.Birrell,andD.Fetterly.Dryad:distributeddata-parallelprogramsfromsequentialbuildingblocks.InProceedingsofthe2ndACMSIGOPS/EuroSysEuropeanConferenceonComputerSystems2007,EuroSys'07,pages59{72,NewYork,NY,USA,2007.ACM.[29]N.Jain,G.Liao,andT.L.Willke.Graphbuilder:scalablegraphetlframework.InFirstInternationalWorkshoponGraphDataManagementExperiencesandSystems,GRADES'13,pages4:1{4:6,NewYork,NY,USA,2013.ACM.[30]U.Kang,C.E.Tsourakakis,andC.Faloutsos.Pegasus:Apeta-scalegraphminingsystemimplementationandobservations.InProceedingsofthe2009NinthIEEEInternationalConferenceonDataMining,ICDM'09,pages229{238,Washington,DC,USA,2009.IEEEComputerSociety.[31]H.Kwak,C.Lee,H.Park,andS.Moon.Whatistwitter,asocialnetworkoranewsmedia?InProceedingsofthe19thInternationalConferenceonWorldWideWeb,WWW'10,pages591{600,NewYork,NY,USA,2010.ACM.[32]A.Kyrola,G.Blelloch,andC.Guestrin.Graphchi:large-scalegraphcomputationonjustapc.InProceedingsofthe10thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI'12,pages31{46,Berkeley,CA,USA,2012.USENIXAssociation.[33]G.Malewicz,M.H.Austern,A.J.Bik,J.C.Dehnert,I.Horn,N.Leiser,andG.Czajkowski.Pregel:asystemforlarge-scalegraphprocessing.InProceedingsofthe2010ACMSIGMODInternationalConferenceonManagementofdata,SIGMOD'10,pages135{146,NewYork,NY,USA,2010.ACM.[34]J.Nelson,B.Myers,A.H.Hunter,P.Briggs,L.Ceze,C.Ebeling,D.Grossman,S.Kahan,andM.Oskin.Crunchinglargegraphswithcommodityprocessors.InProceedingsofthe3rdUSENIXconferenceonHottopicinparallelism,pages10{10.USENIXAssociation,2011.[35]C.Olston,B.Reed,U.Srivastava,R.Kumar,andA.Tomkins.Piglatin:anot-so-foreignlanguagefordataprocessing.InProceedingsofthe2008ACMSIGMODinternationalconferenceonManagementofdata,SIGMOD'08,pages1099{1110,NewYork,NY,USA,2008.ACM.[36]R.PowerandJ.Li.Piccolo:buildingfast,distributedprogramswithpartitionedtables.InProceedingsofthe9thUSENIXconferenceonOperatingsystemsdesignandimplementation,OSDI'10,pages1{14,Berkeley,CA,USA,2010.USENIXAssociation.[37]S.SalihogluandJ.Widom.Gps:Agraphprocessingsystem.InScienticandStatisticalDatabaseManagement.StanfordInfoLab,July2013.[38]B.Shao,H.Wang,andY.Li.Trinity:adistributedgraphengineonamemorycloud.InProceedingsofthe2013ACMSIGMODInternationalConferenceonManagementofData,SIGMOD'13,pages505{516,NewYork,NY,USA,2013.ACM.[39]P.Stutz,A.Bernstein,andW.Cohen.Signal/collect:graphalgorithmsforthe(semantic)web.InProceedingsofthe9thinternationalsemanticwebconferenceonThesemanticweb-VolumePartI,ISWC'10,pages764{780,Berlin,Heidelberg,2010.Springer-Verlag.[40]A.Thusoo,J.S.Sarma,N.Jain,Z.Shao,P.Chakka,S.Anthony,H.Liu,P.Wycko,andR.Murthy.Hive:awarehousingsolutionoveramap-reduceframework.Proc.VLDBEndow.,2(2):1626{1629,Aug.2009.[41]J.UganderandL.Backstrom.Balancedlabelpropagationforpartitioningmassivegraphs.InProceedingsofthesixthACMinternationalconferenceonWebsearchanddatamining,WSDM'13,pages507{516,NewYork,NY,USA,2013.ACM.[42]L.G.Valiant.Abridgingmodelforparallelcomputation.Commun.ACM,33(8):103{111,Aug.1990.[43]S.Venkataraman,E.Bodzsar,I.Roy,A.AuYoung,andR.S.Schreiber.Presto:distributedmachinelearningandgraphprocessingwithsparsematrices.InProceedingsofthe8thACMEuropeanConferenceonComputerSystems,EuroSys'13,pages197{210,NewYork,NY,USA,2013.ACM.[44]G.Wang,W.Xie,A.J.Demers,andJ.Gehrke.Asynchronouslarge-scalegraphprocessingmadeeasy.InCIDR,2013.[45]M.Zaharia,M.Chowdhury,M.J.Franklin,S.Shenker,andI.Stoica.Spark:clustercomputingwithworkingsets.InProceedingsofthe2ndUSENIXconferenceonHottopicsincloudcomputing,HotCloud'10,pages10{10,Berkeley,CA,USA,2010.USENIXAssociation.[46]X.ZhuandZ.Ghahramani.Learningfromlabeledandunlabeleddatawithlabelpropagation.Technicalreport,TechnicalReportCMU-CALD-02-107,2002.