/
Shark: SQL and Rich Analytics at Scale Shark: SQL and Rich Analytics at Scale

Shark: SQL and Rich Analytics at Scale - PDF document

conchita-marotz
conchita-marotz . @conchita-marotz
Follow
390 views
Uploaded On 2015-10-20

Shark: SQL and Rich Analytics at Scale - PPT Presentation

Reynold Shi XinJoshua RosenMatei ZahariaMichael FranklinScott ShenkerIon Stoica Electrical Engineering and Computer SciencesUniversity of California at BerkeleyTechnical Report No UCBEECS2012214ht ID: 166805

Reynold Shi XinJoshua RosenMatei ZahariaMichael

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Shark: SQL and Rich Analytics at Scale" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Shark: SQL and Rich Analytics at Scale Reynold Shi XinJoshua RosenMatei ZahariaMichael FranklinScott ShenkerIon Stoica Electrical Engineering and Computer SciencesUniversity of California at BerkeleyTechnical Report No. UCB/EECS-2012-214http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.htmlNovember 26, 2012 Copyright © 2012, by the author(s).All rights reserved. Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission. Shark:SQLandRichAnalyticsatScaleReynoldXin,JoshRosen,MateiZaharia,MichaelJ.Franklin,ScottShenker,IonStoicaAMPLab,EECS,UCBerkeley{rxin,joshrosen,matei,franklin,shenker,istoica}@cs.berkeley.eduABSTRACTSharkisanewdataanalysissystemthatmarriesqueryprocess-ingwithcomplexanalyticsonlargeclusters.ItleveragesanoveldistributedmemoryabstractiontoprovideauniedenginethatcanrunSQLqueriesandsophisticatedanalyticsfunctions(e.g.,it-erativemachinelearning)atscale,andefcientlyrecoversfromfailuresmid-query.ThisallowsSharktorunSQLqueriesupto100fasterthanApacheHive,andmachinelearningprogramsupto100fasterthanHadoop.Unlikeprevioussystems,Sharkshowsthatitispossibletoachievethesespeedupswhileretain-ingaMapReduce-likeexecutionengine,andthene-grainedfaulttolerancepropertiesthatsuchenginesprovide.Itextendssuchanengineinseveralways,includingcolumn-orientedin-memorystor-ageanddynamicmid-queryreplanning,toeffectivelyexecuteSQL.TheresultisasystemthatmatchesthespeedupsreportedforMPPanalyticdatabasesoverMapReduce,whileofferingfaulttolerancepropertiesandcomplexanalyticscapabilitiesthattheylack.1IntroductionModerndataanalysisfacesaconuenceofgrowingchallenges.First,datavolumesareexpandingdramatically,creatingtheneedtoscaleoutacrossclustersofhundredsofcommoditymachines.Second,thisnewscaleincreasestheincidenceoffaultsandstrag-glers(slowtasks),complicatingparalleldatabasedesign.Third,thecomplexityofdataanalysishasalsogrown:moderndataanalysisemployssophisticatedstatisticalmethods,suchasmachinelearn-ingalgorithms,thatgowellbeyondtheroll-upanddrill-downca-pabilitiesoftraditionalenterprisedatawarehousesystems.Finally,despitetheseincreasesinscaleandcomplexity,usersstillexpecttobeabletoquerydataatinteractivespeeds.Totacklethe“bigdata”problem,twomajorlinesofsystemshaverecentlybeenexplored.Therst,composedofMapReduce[13]andvariousgeneralizations[17,9],offersane-grainedfaulttoler-ancemodelsuitableforlargeclusters,wheretasksonfailedorslownodescanbedeterministicallyre-executedonothernodes.MapRe-duceisalsofairlygeneral:ithasbeenshowntobeabletoexpressmanystatisticalandlearningalgorithms[11].Italsoeasilysupportsunstructureddataand“schema-on-read.”However,MapReduceengineslackmanyofthefeaturesthatmakedatabasesefcient,andhavehighlatenciesoftensofsecondstohours.EvensystemsthathavesignicantlyoptimizedMapReduceforSQLqueries,suchasGoogle'sTenzing[9],orthatcombineitwithatraditionaldatabaseoneachnode,suchasHadoopDB[3],reportaminimumlatencyof10seconds.Assuch,MapReduceapproacheshavelargelybeendismissedforinteractive-speedqueries[25],andevenGoogleisdevelopingnewenginesforsuchworkloads[24].Instead,mostMPPanalyticdatabases(e.g.,Vertica,Greenplum,Teradata)andseveralofthenewlow-latencyenginesproposedforMapReduceenvironments(e.g.,GoogleDremel[24],ClouderaIm-pala[1])employacoarser-grainedrecoverymodel,whereanentirequeryhastoberesubmittedifamachinefails.1Thisworkswellforshortquerieswherearetryisinexpensive,butfacessignicantchallengesinlongqueriesasclustersscaleup[3].Inaddition,thesesystemsoftenlacktherichanalyticsfunctionsthatareeasytoimplementinMapReduce,suchasmachinelearningandgraphalgorithms.Furthermore,whileitmaybepossibletoimplementsomeofthesefunctionsusingUDFs,thesealgorithmsareoftenexpensive,furtheringtheneedforfaultandstragglerrecoveryforlongqueries.Thus,mostorganizationstendtouseothersystemsalongsideMPPdatabasestoperformcomplexanalytics.Toprovideaneffectiveenvironmentforbigdataanalysis,webelievethatprocessingsystemswillneedtosupportbothSQLandcomplexanalyticsefciently,andtoprovidene-grainedfaultre-coveryacrossbothtypesofoperations.Thispaperdescribesanewsystemthatmeetsthesegoals,calledShark.SharkisopensourceandcompatiblewithApacheHive,andhasalreadybeenusedatwebcompaniestospeedupqueriesby40–100.Sharkbuildsonarecently-proposeddistributedsharedmemoryabstractioncalledResilientDistributedDatasets(RDDs)[33]toperformmostcomputationsinmemorywhileofferingne-grainedfaulttolerance.In-memorycomputingisincreasinglyimportantinlarge-scaleanalyticsfortworeasons.First,manycomplexanalyt-icsfunctions,suchasmachinelearningandgraphalgorithms,areiterative,goingoverthedatamultipletimes;thus,thefastestsys-temsdeployedfortheseapplicationsarein-memory[23,22,33].Second,eventraditionalSQLwarehouseworkloadsexhibitstrongtemporalandspatiallocality,becausemore-recentfacttabledataandsmalldimensiontablesarereaddisproportionatelyoften.AstudyofFacebook'sHivewarehouseandMicrosoft'sBinganalyt-icsclustershowedthatover95%ofqueriesinbothsystemscouldbeservedoutofmemoryusingjust64GB/nodeasacache,eventhougheachsystemmanagesmorethan100PBoftotaldata[5].ThemainbenetofRDDsisanefcientmechanismforfaultrecovery.Traditionalmain-memorydatabasessupportne-grainedupdatestotablesandreplicatewritesacrossthenetworkforfaulttolerance,whichisexpensiveonlargecommodityclusters.Incon-trast,RDDsrestricttheprogramminginterfacetocoarse-graineddeterministicoperatorsthataffectmultipledataitemsatonce,suchasmap,group-byandjoin,andrecoverfromfailuresbytrackingthelineageofeachdatasetandrecomputinglostdata.Thisapproachworkswellfordata-parallelrelationalqueries,andhasalsobeenshowntosupportmachinelearningandgraphcomputation[33].Thus,whenanodefails,Sharkcanrecovermid-querybyrerun- 1Dremelprovidesfaulttolerancewithinaquery,butDremelislim-itedtoaggregationtreesinsteadofthemorecomplexcommunica-tionpatternsinjoins. Figure1:PerformanceofSharkvs.Hive/HadoopontwoSQLqueriesfromanearlyuserandoneiterationoflogisticregres-sion(aclassicationalgorithmthatruns10suchsteps).Re-sultsmeasuretheruntime(seconds)ona100-nodecluster.ningthedeterministicoperationsusedtobuildlostdatapartitionsonothernodes,similartoMapReduce.Indeed,ittypicallyrecoverswithinseconds,byparallelizingthisworkacrossthecluster.TorunSQLefciently,however,wealsohadtoextendtheRDDexecutionmodel,bringinginseveralconceptsfromtraditionalan-alyticaldatabasesandsomenewones.Westartedwithanexist-ingimplementationofRDDscalledSpark[33],andaddedseveralfeatures.First,tostoreandprocessrelationaldataefciently,weimplementedin-memorycolumnarstorageandcolumnarcompres-sion.Thisreducedboththedatasizeandtheprocessingtimebyasmuchas5overnaïvelystoringthedatainaSparkprograminitsoriginalformat.Second,tooptimizeSQLqueriesbasedonthedatacharacteristicseveninthepresenceofanalyticsfunctionsandUDFs,weextendedSparkwithPartialDAGExecution(PDE):SharkcanreoptimizearunningqueryafterrunningtherstfewstagesofitstaskDAG,choosingbetterjoinstrategiesortherightdegreeofparallelismbasedonobservedstatistics.Third,welever-ageotherpropertiesoftheSparkenginenotpresentintraditionalMapReducesystems,suchascontroloverdatapartitioning.OurimplementationofSharkiscompatiblewithApacheHive[28],supportingallofHive'sSQLdialectandUDFsandallowingexecutionoverunmodiedHivedatawarehouses.ItaugmentsSQLwithcomplexanalyticsfunctionswritteninSpark,usingSpark'sJava,ScalaorPythonAPIs.ThesefunctionscanbecombinedwithSQLinasingleexecutionplan,providingin-memorydatasharingandfastrecoveryacrossbothtypesofprocessing.ExperimentsshowthatusingRDDsandtheoptimizationsabove,SharkcananswerSQLqueriesupto100fasterthanHive,runsit-erativemachinelearningalgorithmsupto100fasterthanHadoop,andcanrecoverfromfailuresmid-querywithinseconds.Figure1showsthreesampleresults.Shark'sspeediscomparabletothatofMPPdatabasesinbenchmarkslikePavloetal.'scomparisonwithMapReduce[25],butitoffersne-grainedrecoveryandcomplexanalyticsfeaturesthatthesesystemslack.Morefundamentally,ourworkshowsthatMapReduce-likeexe-cutionmodelscanbeappliedeffectivelytoSQL,andofferapromis-ingwaytocombinerelationalandcomplexanalytics.InthecourseofpresentingofShark,wealsoexplorewhySQLenginesoverpre-viousMapReduceruntimes,suchasHive,areslow,andshowhowacombinationofenhancementsinShark(e.g.,PDE),andenginepropertiesthathavenotbeenoptimizedinMapReduce,suchastheoverheadoflaunchingtasks,eliminatemanyofthebottlenecksintraditionalMapReducesystems.2SystemOverviewSharkisadataanalysissystemthatsupportsbothSQLquerypro-cessingandmachinelearningfunctions.Wehavechosentoimple- Figure2:SharkArchitecturementSharktobecompatiblewithApacheHive.ItcanbeusedtoqueryanexistingHivewarehouseandreturnresultsmuchfaster,withoutmodicationtoeitherthedataorthequeries.ThankstoitsHivecompatibility,SharkcanquerydatainanysystemthatsupportstheHadoopstorageAPI,includingHDFSandAmazonS3.Italsosupportsawiderangeofdataformatssuchastext,binarysequenceles,JSON,andXML.ItinheritsHive'sschema-on-readcapabilityandnesteddatatypes[28].Inaddition,userscanchoosetoloadhigh-valuedataintoShark'smemorystoreforfastanalytics,asshownbelow:CREATETABLElatest_logsTBLPROPERTIES("shark.cache"=true)ASSELECT*FROMlogsWHEREdate�now()-3600;Figure2showsthearchitectureofaSharkcluster,consistingofasinglemasternodeandanumberofslavenodes,withtheware-housemetadatastoredinanexternaltransactionaldatabase.ItisbuiltontopofSpark,amodernMapReduce-likeclustercomputingengine.Whenaqueryissubmittedtothemaster,SharkcompilesthequeryintooperatortreerepresentedasRDDs,asweshalldis-cussinSection2.4.TheseRDDsarethentranslatedbySparkintoagraphoftaskstoexecuteontheslavenodes.Clusterresourcescanoptionallybeallocatedbyaclusterre-sourcemanager(e.g.,HadoopYARNorApacheMesos)thatpro-videsresourcesharingandisolationbetweendifferentcomputingframeworks,allowingSharktocoexistwithengineslikeHadoop.Intheremainderofthissection,wecoverthebasicsofSparkandtheRDDprogrammingmodel,followedbyanexplanationofhowSharkqueryplansaregeneratedandrun.2.1SparkSparkistheMapReduce-likeclustercomputingengineusedbyShark.Sparkhasseveralfeaturesthatdifferentiateitfromtradi-tionalMapReduceengines[33]:1.LikeDryadandTenzing[17,9],itsupportsgeneralcompu-tationDAGs,notjustthetwo-stageMapReducetopology.2.Itprovidesanin-memorystorageabstractioncalledResilientDistributedDatasets(RDDs)thatletsapplicationskeepdatainmemoryacrossqueries,andautomaticallyreconstructsitafterfailures[33].3.Theengineisoptimizedforlowlatency.Itcanefcientlymanagetasksasshortas100millisecondsonclustersofthousandsofcores,whileengineslikeHadoopincurala-tencyof5–10secondstolauncheachtask. Figure3:LineagegraphfortheRDDsinourSparkexample.OblongsrepresentRDDs,whilecirclesshowpartitionswithinadataset.Lineageistrackedatthegranularityofpartitions.RDDsareuniquetoSpark,andwereessentialtoenablingmid-queryfaulttolerance.However,theotherdifferencesareimportantengineeringelementsthatcontributetoShark'sperformance.Ontopofthesefeatures,wehavealsomodiedtheSparkengineforSharktosupportpartialDAGexecution,thatis,modicationofthequeryplanDAGafteronlysomeofthestageshavenished,basedonstatisticscollectedfromthesestages.Similarto[20],weusethistechniquetooptimizejoinalgorithmsandotheraspectsoftheexecutionmid-query,asweshalldiscussinSection3.1.2.2ResilientDistributedDatasets(RDDs)Spark'smainabstractionisresilientdistributeddatasets(RDDs),whichareimmutable,partitionedcollectionsthatcanbecreatedthroughvariousdata-paralleloperators(e.g.,map,group-by,hash-join).EachRDDiseitheracollectionstoredinanexternalstoragesystem,suchasaleinHDFS,oraderiveddatasetcreatedbyapplyingoperatorstootherRDDs.Forexample,givenanRDDof(visitID,URL)pairsforvisitstoawebsite,wemightcomputeanRDDof(URL,count)pairsbyapplyingamapoperatortoturneacheventintoan(URL,1)pair,andthenareducetoaddthecountsbyURL.InSpark'snativeAPI,RDDoperationsareinvokedthroughafunctionalinterfacesimilartoDryadLINQ[19]inScala,JavaorPython.Forexample,theScalacodeforthequeryaboveis:valvisits=spark.hadoopFile("hdfs://...")valcounts=visits.map(v�=(v.url,1)).reduceByKey((a,b)�=a+b)RDDscancontainarbitrarydatatypesaselements(sinceSparkrunsontheJVM,theseelementsareJavaobjects),andareau-tomaticallypartitionedacrossthecluster,buttheyareimmutableoncecreated,andtheycanonlybecreatedthroughSpark'sdeter-ministicparalleloperators.Thesetworestrictions,however,enablehighlyefcientfaultrecovery.Inparticular,insteadofreplicatingeachRDDacrossnodesforfault-tolerance,SparkremembersthelineageoftheRDD(thegraphofoperatorsusedtobuildit),andrecoverslostpartitionsbyrecomputingthemfrombasedata[33].2Forexample,Figure3showsthelineagegraphfortheRDDscom-putedabove.IfSparklosesoneofthepartitionsinthe(URL,1)RDD,forexample,itcanrecomputeitbyrerunningthemaponjustthecorrespondingpartitionoftheinputle.TheRDDmodeloffersseveralkeybenetsourlarge-scalein-memorycomputingsetting.First,RDDscanbewrittenatthespeedofDRAMinsteadofthespeedofthenetwork,becausethereisno 2WeassumethatexternallesforRDDsrepresentingexternaldatadonotchange,orthatwecantakeasnapshotofalewhenwecreateanRDDfromit.needtoreplicateeachbytewrittentoanothermachineforfault-tolerance.DRAMinamodernserverisover10fasterthanevena10-Gigabitnetwork.Second,SparkcankeepjustonecopyofeachRDDpartitioninmemory,savingpreciousmemoryoverarepli-catedsystem,sinceitcanalwaysrecoverlostdatausinglineage.Third,whenanodefails,itslostRDDpartitionscanberebuiltinparallelacrosstheothernodes,allowingspeedyrecovery.3Fourth,evenifanodeisjustslow(a“straggler”),wecanrecomputenec-essarypartitionsonothernodesbecauseRDDsareimmutablesotherearenoconsistencyconcernswithhavingtwocopiesofapar-tition.ThesebenetsmakeRDDsattractiveasthefoundationforourrelationalprocessinginShark.2.3FaultToleranceGuaranteesTosummarizethebenetsofRDDsexplainedabove,Sharkpro-videsthefollowingfaulttoleranceproperties,whichhavebeendif-culttosupportintraditionalMPPdatabasedesigns:1.Sharkcantoleratethelossofanysetofworkernodes.Theexecutionenginewillre-executeanylosttasksandrecom-puteanylostRDDpartitionsusinglineage.4Thisistrueevenwithinaquery:Sparkwillrerunanyfailedtasks,orlostdependenciesofnewtasks,withoutabortingthequery.2.Recoveryisparallelizedacrossthecluster.Ifafailednodecontained100RDDpartitions,thesecanberebuiltinparallelon100differentnodes,quicklyrecoveringthelostdata.3.ThedeterministicnatureofRDDsalsoenablesstragglermit-igation:ifataskisslow,thesystemcanlaunchaspeculative“backupcopy”ofitonanothernode,asinMapReduce[13].4.RecoveryworkseveninqueriesthatcombineSQLandma-chinelearningUDFs(Section4),astheseoperationsallcom-pileintoasingleRDDlineagegraph.2.4ExecutingSQLoverRDDsSharkrunsSQLqueriesoverSparkusingathree-stepprocesssim-ilartotraditionalRDBMSs:queryparsing,logicalplangeneration,andphysicalplangeneration.Givenaquery,SharkusestheHivequerycompilertoparsethequeryandgenerateanabstractsyntaxtree.Thetreeisthenturnedintoalogicalplanandbasiclogicaloptimization,suchaspredi-catepushdown,isapplied.Uptothispoint,SharkandHiveshareanidenticalapproach.HivewouldthenconverttheoperatorintoaphysicalplanconsistingofmultipleMapReducestages.InthecaseofShark,itsoptimizerappliesadditionalrule-basedoptimizations,suchaspushingLIMITdowntoindividualpartitions,andcreatesaphysicalplanconsistingoftransformationsonRDDsratherthanMapReducejobs.WeuseavarietyofoperatorsalreadypresentinSpark,suchasmapandreduce,aswellasnewoperatorsweimple-mentedforShark,suchasbroadcastjoins.Spark'smasterthenexe-cutesthisgraphusingstandardMapReduceschedulingtechniques,suchplacingtasksclosetotheirinputdata,rerunninglosttasks,andperformingstragglermitigation[33].WhilethisbasicapproachmakesitpossibletorunSQLoverSpark,doingsoefcientlyischallenging.TheprevalenceofUDFsandcomplexanalyticfunctionsinShark'sworkloadmakesitdif-culttodetermineanoptimalqueryplanatcompiletime,especiallyfornewdatathathasnotundergoneETL.Inaddition,evenwith 3Toprovidefaulttoleranceacross“shufe”operationslikeapar-allelreduce,theexecutionenginealsosavesthe“map”sideoftheshufeinmemoryonthesourcenodes,spillingtodiskifnecessary.4Supportformasterrecoverycouldalsobeaddedbyreliabliylog-gingtheRDDlineagegraphandthesubmittedjobs,becausethisstateissmall,butwehavenotyetimplementedthis. suchaplan,naïvelyexecutingitoverSpark(orotherMapReduceruntimes)canbeinefcient.Inthenextsection,wediscusssev-eralextensionswemadetoSparktoefcientlystorerelationaldataandrunSQL,startingwithamechanismthatallowsfordynamic,statistics-drivenre-optimizationatrun-time.3EngineExtensionsInthissection,wedescribeourmodicationstotheSparkenginetoenableefcientexecutionofSQLqueries.3.1PartialDAGExecution(PDE)SystemslikeSharkandHivearefrequentlyusedtoqueryfreshdatathathasnotundergoneadataloadingprocess.Thisprecludestheuseofstaticqueryoptimizationtechniquesthatrelyonaccurateaprioridatastatistics,suchasstatisticsmaintainedbyindices.Thelackofstatisticsforfreshdata,combinedwiththeprevalentuseofUDFs,necessitatesdynamicapproachestoqueryoptimization.Tosupportdynamicqueryoptimizationinadistributedsetting,weextendedSparktosupportpartialDAGexecution(PDE),atechniquethatallowsdynamicalterationofqueryplansbasedondatastatisticscollectedatrun-time.WecurrentlyapplypartialDAGexecutionatblocking“shuf-e"operatorboundarieswheredataisexchangedandrepartitioned,sincethesearetypicallythemostexpensiveoperationsinShark.Bydefault,Sparkmaterializestheoutputofeachmaptaskinmemorybeforeashufe,spillingittodiskasnecessary.Later,reducetasksfetchthisoutput.PDEmodiesthismechanismintwoways.First,itgatherscus-tomizablestatisticsatglobalandper-partitiongranularitieswhilematerializingmapoutput.Second,itallowstheDAGtobealteredbasedonthesestatistics,eitherbychoosingdifferentoperatorsoralteringtheirparameters(suchastheirdegreesofparallelism).Thesestatisticsarecustomizableusingasimple,pluggableac-cumulatorAPI.Someexamplestatisticsinclude:1.Partitionsizesandrecordcounts,whichcanbeusedtodetectskew.2.Listsof“heavyhitters,”i.e.,itemsthatoccurfrequentlyinthedataset.3.Approximatehistograms,whichcanbeusedtoestimatepar-titions'data'sdistributions.Thesestatisticsaresentbyeachworkertothemaster,wheretheyareaggregatedandpresentedtotheoptimizer.Forefciency,weuselossycompressiontorecordthestatistics,limitingtheirsizeto1–2KBpertask.Forinstance,weencodepartitionsizes(inbytes)withlogarithmicencoding,whichcanrepresentsizesofupto32GBusingonlyonebytewithatmost10%error.Themastercanthenusethesestatisticstoperformvariousrun-timeoptimizations,asweshalldiscussnext.PartialDAGexecutioncomplementsexistingadaptivequeryop-timizationtechniquesthattypicallyruninasingle-nodesystem[6,20,30],aswecanuseexistingtechniquestodynamicallyoptimizethelocalplanwithineachnode,andusePDEtooptimizetheglobalstructureoftheplanatstageboundaries.Thisne-grainedstatis-ticscollection,andtheoptimizationsthatitenables,differentiatesPDEfromgraphrewritingfeaturesinprevioussystems,suchasDryadLINQ[19].3.1.1JoinOptimizationPartialDAGexecutioncanbeusedtoperformseveralrun-timeop-timizationsforjoinqueries.Figure4illustratestwocommunicationpatternsforMapReduce-stylejoins.Inshufejoin,bothjointablesarehash-partitionedby Figure4:Dataowsformapjoinandshufejoin.Mapjoinbroadcaststhesmalltabletoalllargetablepartitions,whileshufejoinrepartitionsandshufesbothtables.thejoinkey.Eachreducerjoinscorrespondingpartitionsusingalocaljoinalgorithm,whichischosenbyeachreducerbasedonrun-timestatistics.Ifoneofareducer'sinputpartitionsissmall,thenitconstructsahashtableoverthesmallpartitionandprobesitusingthelargepartition.Ifbothpartitionsarelarge,thenasymmetrichashjoinisperformedbyconstructinghashtablesoverbothinputs.Inmapjoin,alsoknownasbroadcastjoin,asmallinputtableisbroadcasttoallnodes,whereitisjoinedwitheachpartitionofalargetable.Thisapproachcanresultinsignicantcostsavingsbyavoidinganexpensiverepartitioningandshufingphase.Mapjoinisonlyworthwhileifsomejoininputsaresmall,soSharkusespartialDAGexecutiontoselectthejoinstrategyatrun-timebasedonitsinputs'exactsizes.Byusingsizesofthejoininputsgatheredatrun-time,thisapproachworkswellevenwithin-puttablesthathavenopriorstatistics,suchasintermediateresults.Run-timestatisticsalsoinformthejointasks'schedulingpoli-cies.Iftheoptimizerhasapriorbeliefthataparticularjoininputwillbesmall,itwillschedulethattaskbeforeotherjoininputsanddecidetoperformamap-joinifitobservesthatthetask'soutputissmall.Thisallowsthequeryenginetoavoidperformingthepre-shufepartitioningofalargetableoncetheoptimizerhasdecidedtoperformamap-join.3.1.2Skew-handlingandDegreeofParallelismPartialDAGexecutioncanalsobeusedtodetermineoperators'degreesofparallelismandtomitigateskew.Thedegreeofparallelismforreducetaskscanhavealargeper-formanceimpact:launchingtoofewreducersmayoverloadre-ducers'networkconnectionsandexhausttheirmemories,whilelaunchingtoomanymayprolongthejobduetotaskschedulingoverhead.Hive'sperformanceisespeciallysensitivetothenumberofreducetasks,duetoHadoop'slargeschedulingoverhead.UsingpartialDAGexecution,Sharkcanuseindividualparti-tions'sizestodeterminethenumberofreducersatrun-timebyco-alescingmanysmall,ne-grainedpartitionsintofewercoarsepar-titionsthatareusedbyreducetasks.Tomitigateskew,ne-grainedpartitionsareassignedtocoalescedpartitionsusingagreedybin-packingheuristicthatattemptstoequalizecoalescedpartitions'sizes[15].Thisoffersperformancebenets,especiallywhengoodbin-packingsexist.Somewhatsurprisingly,wediscoveredthatSharkcanobtainsim-ilarperformanceimprovementbyrunningalargernumberofre-ducetasks.WeattributethistoSpark'slowschedulingoverhead. 3.2ColumnarMemoryStoreIn-memorycomputationisessentialtolow-latencyqueryanswer-ing,giventhatmemory'sthroughputisordersofmagnitudehigherthanthatofdisks.NaïvelyusingSpark'smemorystore,however,canleadtoundesirableperformance.SharkimplementsacolumnarmemorystoreontopofSpark'smemorystore.In-memorydatarepresentationaffectsbothspacefootprintandreadthroughput.Anaïveapproachistosimplycachetheon-diskdatainitsnativeformat,performingon-demanddeserializationinthequeryprocessor.Thisdeserializationbecomesamajorbottle-neck:inourstudies,wesawthatmoderncommodityCPUscandeserializeatarateofonly200MBpersecondpercore.TheapproachtakenbySpark'sdefaultmemorystoreistostoredatapartitionsascollectionsofJVMobjects.Thisavoidsdeserial-ization,sincethequeryprocessorcandirectlyusetheseobjects,butleadstosignicantstoragespaceoverheads.CommonJVMimple-mentationsadd12to16bytesofoverheadperobject.Forexample,storing270MBofTPC-HlineitemtableasJVMobjectsusesap-proximately971MBofmemory,whileaserializedrepresentationrequiresonly289MB,nearlythreetimeslessspace.Amoreseri-ousimplication,however,istheeffectongarbagecollection(GC).Witha200Brecordsize,a32GBheapcancontain160millionob-jects.TheJVMgarbagecollectiontimecorrelateslinearlywiththenumberofobjectsintheheap,soitcouldtakeminutestoperformafullGConalargeheap.Theseunpredictable,expensivegarbagecollectionscauselargevariabilityinworkers'responsetimes.SharkstoresallcolumnsofprimitivetypesasJVMprimitivearrays.ComplexdatatypessupportedbyHive,suchasmapandarray,areserializedandconcatenatedintoasinglebytearray.EachcolumncreatesonlyoneJVMobject,leadingtofastGCsandacompactdatarepresentation.ThespacefootprintofcolumnardatacanbefurtherreducedbycheapcompressiontechniquesatvirtuallynoCPUcost.Similartomoretraditionaldatabasesystems[27],SharkimplementsCPU-efcientcompressionschemessuchasdictionaryencoding,run-lengthencoding,andbitpacking.Columnardatarepresentationalsoleadstobettercachebehavior,especiallyforforanalyticalqueriesthatfrequentlycomputeaggre-gationsoncertaincolumns.3.3DistributedDataLoadingInadditiontoqueryexecution,SharkalsousesSpark'sexecutionenginefordistributeddataloading.Duringloading,atableissplitintosmallpartitions,eachofwhichisloadedbyaSparktask.Theloadingtasksusethedataschematoextractindividualeldsfromrows,marshalsapartitionofdataintoitscolumnarrepresentation,andstoresthosecolumnsinmemory.Eachdataloadingtasktracksmetadatatodecidewhethereachcolumninapartitionshouldbecompressed.Forexample,theloadingtaskwillcompressacolumnusingdictionaryencodingifitsnumberofdistinctvaluesisbelowathreshold.Thisallowseachtasktochoosethebestcompressionschemeforeachpartition,ratherthanconformingtoaglobalcompressionschemethatmightnotbeoptimalforlocalpartitions.Theselocaldecisionsdonotrequirecoordinationamongdataloadingtasks,allowingtheloadphasetoachieveamaximumdegreeofparallelism,atthesmallcostofrequiringeachpartitiontomaintainitsowncompressionmeta-data.ItisimportanttoclarifythatanRDD'slineagedoesnotneedtocontainthecompressionschemeandmetadataforeachparti-tion.ThecompressionschemeandmetadataaresimplybyproductsoftheRDDcomputation,andcanbedeterministicallyrecomputedalongwiththein-memorydatainthecaseoffailures.Asaresult,SharkcanloaddataintomemoryattheaggregatedthroughputoftheCPUsprocessingincomingdata.Pavloetal.[25]showedthatHadoopwasabletoperformdataloadingat5to10timesthethroughputofMPPdatabases.Testedusingthesamedatasetusedin[25],Sharkprovidesthesamethrough-putasHadoopinloadingdataintoHDFS.Sharkis5timesfasterthanHadoopwhenloadingdataintoitsmemorystore.3.4DataCo-partitioningInsomewarehouseworkloads,twotablesarefrequentlyjoinedto-gether.Forexample,theTPC-Hbenchmarkfrequentlyjoinsthelineitemandordertables.AtechniquecommonlyusedbyMPPdatabasesistoco-partitionthetwotablesbasedontheirjoinkeyinthedataloadingprocess.IndistributedlesystemslikeHDFS,thestoragesystemisschema-agnostic,whichpreventsdataco-partitioning.Sharkallowsco-partitioningtwotablesonacom-monkeyforfasterjoinsinsubsequentqueries.Thiscanbeac-complishedwiththeDISTRIBUTEBYclause:CREATETABLEl_memTBLPROPERTIES("shark.cache"=true)ASSELECT*FROMlineitemDISTRIBUTEBYL_ORDERKEY;CREATETABLEo_memTBLPROPERTIES("shark.cache"=true,"copartition"="l_mem")ASSELECT*FROMorderDISTRIBUTEBYO_ORDERKEY;Whenjoiningtwoco-partitionedtables,Shark'soptimizercon-structsaDAGthatavoidstheexpensiveshufeandinsteadusesmaptaskstoperformthejoin.3.5PartitionStatisticsandMapPruningDatatendtobestoredinsomelogicalclusteringononeormorecolumns.Forexample,entriesinawebsite'strafclogdatamightbegroupedbyusers'physicallocations,becauselogsarerststoredindatacentersthathavethebestgeographicalproximitytousers.Withineachdatacenter,logsareappend-onlyandarestoredinroughlychronologicalorder.Asalessobviouscase,anewssite'slogsmightcontainnews_idandtimestampcolumnsthathavestronglycorrelatedvalues.Foranalyticalqueries,itistypicaltoapplylterpredicatesoraggregationsoversuchcolumns.Forex-ample,adailywarehousereportmightdescribehowdifferentvisi-torsegmentsinteractwiththewebsite;thistypeofquerynaturallyappliesapredicateontimestampsandperformsaggregationsthataregroupedbygeographicallocation.Thispatternisevenmorefrequentforinteractivedataanalysis,duringwhichdrill-downop-erationsarefrequentlyperformed.Mappruningistheprocessofpruningdatapartitionsbasedontheirnaturalclusteringcolumns.SinceShark'smemorystoresplitsdataintosmallpartitions,eachblockcontainsonlyoneorfewlog-icalgroupsonsuchcolumns,andSharkcanavoidscanningcertainblocksofdataiftheirvaluesfalloutofthequery'slterrange.Totakeadvantageofthesenaturalclusteringsofcolumns,Shark'smemorystoreoneachworkerpiggybacksthedataloadingprocesstocollectstatistics.Theinformationcollectedforeachpartitionin-cludetherangeofeachcolumnandthedistinctvaluesifthenum-berofdistinctvaluesissmall(i.e.,enumcolumns).Thecollectedstatisticsaresentbacktothemasterprogramandkeptinmemoryforpruningpartitionsduringqueryexecution.Whenaqueryisissued,Sharkevaluatesthequery'spredicatesagainstallpartitionstatistics;partitionsthatdonotsatisfythepred-icateareprunedandSharkdoesnotlaunchtaskstoscanthem.WecollectedasampleofqueriesfromtheHivewarehouseofavideoanalyticscompany,andoutofthe3833queriesweobtained,atleast3277ofthemcontainpredicatesthatSharkcanuseformappruning.Section6providesmoredetailsonthisworkload. deflogRegress(points:RDD[Point]):Vector{varw=Vector(D,_�=2*rand.nextDouble-1)for(i1toITERATIONS){valgradient=points.map{p&#x--60;�=valdenom=1+exp(-p.y*(wdotp.x))(1/denom-1)*p.y*p.x}.reduce(_+_)w-=gradient}w}valusers=sql2rdd("SELECT*FROMuseruJOINcommentcONc.uid=u.uid")valfeatures=users.mapRows{row&#x--60;�=newVector(extractFeature1(row.getInt("age")),extractFeature2(row.getStr("country")),...)}valtrainedVector=logRegress(features.cache()) Listing1:LogisticRegressionExample4MachineLearningSupportAkeydesigngoalofSharkistoprovideasinglesystemcapableofefcientSQLqueryprocessingandsophisticatedmachinelearn-ing.Followingtheprincipleofpushingcomputationtodata,Sharksupportsmachinelearningasarst-classcitizen.ThisisenabledbythedesigndecisiontochooseSparkastheexecutionengineandRDDasthemaindatastructureforoperators.Inthissection,weexplainShark'slanguageandexecutionengineintegrationforSQLandmachinelearning.Otherresearchprojects[12,14]havedemonstratedthatitispos-sibletoexpresscertainmachinelearningalgorithmsinSQLandavoidmovingdataoutofthedatabase.Theimplementationofthoseprojects,however,involvesacombinationofSQL,UDFs,anddriverprogramswritteninotherlanguages.Thesystemsbe-comeobscureanddifculttomaintain;inaddition,theymaysacri-ceperformancebyperformingexpensiveparallelnumericalcom-putationsontraditionaldatabaseenginesthatwerenotdesignedforsuchworkloads.ContrastthiswiththeapproachtakenbyShark,whichoffersin-databaseanalyticsthatpushcomputationtodata,butdoessousingaruntimethatisoptimizedforsuchworkloadsandaprogrammingmodelthatisdesignedtoexpressmachinelearn-ingalgorithms.4.1LanguageIntegrationInadditiontoexecutingaSQLqueryandreturningitsresults,SharkalsoallowsqueriestoreturntheRDDrepresentingthequeryplan.CallerstoSharkcantheninvokedistributedcomputationoverthequeryresultusingthereturnedRDD.Asanexampleofthisintegration,Listing1illustratesadataanalysispipelinethatperformslogisticregressionoverauserdatabase.Logisticregression,acommonclassicationalgorithm,searchesforahyperplanewthatbestseparatestwosetsofpoints(e.g.spam-mersandnon-spammers).Thealgorithmappliesgradientdescentoptimizationbystartingwitharandomizedwvectoranditerativelyupdatingitbymovingalonggradientstowardsanoptimumvalue.Theprogrambeginsbyusingsql2rddtoissueaSQLquerytoretreiveuserinformationasaTableRDD.Itthenperformsfeatureextractiononthequeryrowsandrunslogisticregressionovertheextractedfeaturematrix.EachiterationoflogRegressappliesafunctionofwtoalldatapointstoproduceasetofgradients,whicharesummedtoproduceanetgradientthatisusedtoupdatew.Thehighlightedmap,mapRows,andreducefunctionsareau-tomaticallyparallelizedbySharktoexecuteacrossacluster,andthemasterprogramsimplycollectstheoutputofthereducefunc-tiontoupdatew.NotethatthisdistributedlogisticregressionimplementationinSharklooksremarkablysimilartoaprogramimplementedforasinglenodeintheScalalanguage.TheusercanconvenientlymixthebestpartsofbothSQLandMapReduce-styleprogramming.Currently,SharkprovidesnativesupportforScalaandJava,withsupportforPythonindevelopment.WehavemodiedtheScalashelltoenableinteractiveexecutionofbothSQLanddistributedmachinelearningalgorithms.BecauseSharkisbuiltontopoftheJVM,itistrivialtosupportotherJVMlanguages,suchasClojureorJRuby.Wehaveimplementedanumberofbasicmachinelearningal-gorithms,includinglinearregression,logisticregression,andk-meansclustering.Inmostcases,theuseronlyneedstosupplyamapRowsfunctiontoperformfeatureextractionandcaninvoketheprovidedalgorithms.Theaboveexampledemonstrateshowmachinelearningcompu-tationscanbeperformedonqueryresults.UsingRDDasthemaindatastructureforqueryoperatorsalsoenablesthepossibilityofus-ingSQLtoquerytheresultsofmachinelearningcomputationsinasingleexecutionplan.4.2ExecutionEngineIntegrationInadditiontolanguageintegration,anotherkeybenetofusingRDDsasthedatastructureforoperatorsistheexecutionenginein-tegration.Thiscommonabstractionallowsmachinelearningcom-putationsandSQLqueriestoshareworkersandcacheddatawith-outtheoverheadofdatamovement.BecauseSQLqueryprocessingisimplementedusingRDDs,lin-eageiskeptforthewholepipeline,whichenablesend-to-endfaulttolerancefortheentireworkow.Iffailuresoccurduringmachinelearningstage,partitionsonfaultynodeswillautomaticallybere-computedbasedontheirlineage.5ImplementationWhileimplementingShark,wediscoveredthatanumberofengi-neeringdetailshadsignicantperformanceimpacts.Overall,toimprovethequeryprocessingspeed,oneshouldminimizethetaillatencyoftasksandtheCPUcostofprocessingeachrow.Memory-basedShufe:BothSparkandHadoopwritemapout-putlestodisk,hopingthattheywillremainintheOSbuffercachewhenreducetasksfetchthem.Inpractice,wehavefoundthattheextrasystemcallsandlesystemjournalingaddssignicantover-head.Inaddition,theinabilitytocontrolwhenbuffercachesareushedleadstovariabilityintheexecutiontimeofshufetasks.Aquery'sresponsetimeisdeterminedbythelasttasktonish,andthustheincreasingvariabilityleadstolong-taillatency,whichsig-nicantlyhurtsshufeperformance.Wemodiedtheshufephasetomaterializemapoutputsinmemory,withtheoptiontospillthemtodisk.TemporaryObjectCreation:Itiseasytowriteaprogramthatcreatesmanytemporaryobjects,whichcanburdentheJVM'sgarbagecollector.Foraparalleljob,aslowGCatonetaskmayslowtheentirejob.SharkoperatorsandRDDtransformationsarewritteninawaythatminimizestemporaryobjectcreations.BytecodeCompilationofExpressionEvaluators:Initscurrentimplementation,SharksendstheexpressionevaluatorsgeneratedbytheHiveparseraspartofthetaskstobeexecutedoneachrow.ByprolingShark,wediscoveredthatforcertainqueries,when dataisservedoutofthememorystorethemajorityoftheCPUcy-clesarewastedininterpretingtheseevaluators.WeareworkingonacompilertotransformtheseexpressionevaluatorsintoJVMbyte-code,whichcanfurtherincreasetheexecutionengine'sthroughput.SpecializedDataStructures:Usingspecializeddatastructuresisanotherlow-hangingoptimizationthatwehaveyettoexploit.Forexample,Java'shashtableisbuiltforgenericobjects.Whenthehashkeyisaprimitivetype,theuseofspecializeddatastruc-turescanleadtomorecompactdatarepresentations,andthusbettercachebehavior.6ExperimentsWeevaluatedSharkusingfourdatasets:1.Pavloetal.Benchmark:2.1TBofdatareproducingPavloetal.'scomparisonofMapReducevs.analyticalDBMSs[25].2.TPC-HDataset:100GBand1TBdatasetsgeneratedbytheDBGENprogram[29].3.RealHiveWarehouse:1.7TBofsampledHivewarehousedatafromanearlyindustrialuserofShark.4.MachineLearningDataset:100GBsyntheticdatasettomea-suretheperformanceofmachinelearningalgorithms.Overall,ourresultsshowthatSharkcanperformupto100fasterthanHive,eventhoughwehaveyettoimplementsomeoftheperformanceoptimizationsmentionedintheprevioussection.Inparticular,SharkprovidescomparableperformancegainstothosereportedforMPPdatabasesinPavloetal.'scomparison[25].Insomecaseswheredatatsinmemory,Sharkexceedstheperfor-mancereportedforMPPdatabases.WeemphasizethatwearenotclaimingthatSharkisfunda-mentallyfasterthanMPPdatabases;thereisnoreasonwhyMPPenginescouldnotimplementthesameprocessingoptimizationsasShark.Indeed,ourimplementationhasseveraldisadvantagesrelativetocommercialengines,suchasrunningontheJVM.In-stead,weaimtoshowthatitispossibletoachievecomparableper-formancewhileretainingaMapReduce-likeengine,andthene-grainedfaultrecoveryfeaturesthatsuchenginesprovide.Inaddi-tion,Sharkcanleveragethisenginetoperformhigh-speedmachinelearningfunctionsonthesamedata,whichwebelievewillbees-sentialinfutureanalyticsworkloads.6.1MethodologyandClusterSetupUnlessotherwisespecied,experimentswereconductedonAma-zonEC2using100m2.4xlargenodes.Eachnodehad8virtualcores,68GBofmemory,and1.6TBoflocalstorage.Theclusterwasrunning64-bitLinux3.2.28,ApacheHadoop0.20.205,andApacheHive0.9.ForHadoopMapReduce,thenum-berofmaptasksandthenumberofreducetaskspernodeweresetto8,matchingthenumberofcores.ForHive,weenabledJVMreusebetweentasksandavoidedmergingsmalloutputles,whichwouldtakeanextrastepaftereachquerytoperformthemerge.Weexecutedeachquerysixtimes,discardedtherstrun,andreporttheaverageoftheremainingveruns.WediscardtherstruninordertoallowtheJVM'sjust-in-timecompilertooptimizecommoncodepaths.Webelievethatthismorecloselymirrorsreal-worlddeploymentswheretheJVMwillbereusedbymanyqueries.6.2Pavloetal.BenchmarksPavloetal.comparedHadoopversusMPPdatabasesandshowedthatHadoopexcelledatdataingress,butperformedunfavorablyinqueryexecution[25].WereusedthedatasetandqueriesfromtheirbenchmarkstocompareSharkagainstHive. Figure5:Selectionandaggregationqueryruntimes(seconds)fromPavloetal.benchmarkThebenchmarkusedtwotables:a1GB/noderankingstable,anda20GB/nodeuservisitstable.Forour100-nodecluster,werecreateda100GBrankingstablecontaining1.8billionrowsanda2TBuservisitstablecontaining15.5billionrows.WeranthefourqueriesintheirexperimentscomparingSharkwithHiveandreporttheresultsinFigures5and6.Inthissubsection,wehand-tunedHive'snumberofreducetaskstoproduceoptimalresultsforHive.Despitethistuning,SharkoutperformedHiveinallcasesbyawidemargin.6.2.1SelectionQueryTherstquerywasasimpleselectionontherankingstable:SELECTpageURL,pageRankFROMrankingsWHEREpageRank�X;In[25],VerticaoutperformedHadoopbyafactorof10becauseaclusteredindexwascreatedforVertica.Evenwithoutaclusteredindex,Sharkwasabletoexecutethisquery80fasterthanHiveforin-memorydata,and5ondatareadfromHDFS.6.2.2AggregationQueriesThePavloetal.benchmarkrantwoaggregationqueries:SELECTsourceIP,SUM(adRevenue)FROMuservisitsGROUPBYsourceIP;SELECTSUBSTR(sourceIP,1,7),SUM(adRevenue)FROMuservisitsGROUPBYSUBSTR(sourceIP,1,7);Inourdataset,therstqueryhadtwomilliongroupsandthesec-ondhadapproximatelyonethousandgroups.SharkandHivebothappliedtask-localaggregationsandshufedthedatatoparallelizethenalmergeaggregation.Again,SharkoutperformedHivebyawidemargin.ThebenchmarkedMPPdatabasesperformlocalag-gregationsoneachnode,andthensendallaggregatestoasinglequerycoordinatorforthenalmerging;thisperformedverywellwhenthenumberofgroupswassmall,butperformedworsewithlargenumberofgroups.TheMPPdatabases'chosenplanissimilartochoosingasinglereducetaskforSharkandHive.6.2.3JoinQueryThenalqueryfromPavloetal.involvedjoiningthe2TBuservis-itstablewiththe100GBrankingstable.SELECTINTOTempsourceIP,AVG(pageRank),SUM(adRevenue)astotalRevenueFROMrankingsASR,uservisitsASUVWHERER.pageURL=UV.destURL Figure6:Joinqueryruntime(seconds)fromPavlobenchmarkANDUV.visitDateBETWEENDate('2000-01-15')ANDDate('2000-01-22')GROUPBYUV.sourceIP;Again,SharkoutperformedHiveinallcases.Figure6showsthatforthisquery,servingdataoutofmemorydidnotprovidemuchbenetoverdisk.Thisisbecausethecostofthejoinstepdominatedthequeryprocessing.Co-partitioningthetwotables,however,providedsignicantbenetsasitavoidedshufingdata2.1TBofdataduringthejoinstep.6.2.4DataLoadingHadoopwasshownby[25]toexcelatdataloading,asitsdataloadingthroughputwasvetotentimeshigherthanthatofMPPdatabases.AsexplainedinSection2,SharkcanbeusedtoquerydatainHDFSdirectly,whichmeansitsdataingressrateisatleastasfastasHadoop's.Aftergeneratingthe2TBuservisitstable,wemeasuredthetimetoloaditintoHDFSandcomparedthatwiththetimetoloaditintoShark'smemorystore.Wefoundtherateofdataingresswas5higherinShark'smemorystorethanthatofHDFS.6.3Micro-BenchmarksTounderstandthefactorsaffectingShark'sperformance,wecon-ductedasequenceofmicro-benchmarks.Wegenerated100GBand1TBofdatausingtheDBGENprogramprovidedbyTPC-H[29].Wechosethisdatasetbecauseitcontainstablesandcolumnsofvaryingcardinalityandcanbeusedtocreateamyriadofmicro-benchmarksfortestingindividualoperators.Whileperformingexperiments,wefoundthatHiveandHadoopMapReducewereverysensitivetothenumberofreducerssetforajob.Hive'soptimizerautomaticallysetsthenumberofreducersbasedontheestimateddatasize.However,wefoundthatHive'soptimizerfrequentlymadethewrongdecision,leadingtoincredi-blylongqueryexecutiontimes.Wehand-tunedthenumberofre-ducersforHivebasedoncharacteristicsofthequeriesandthroughtrialanderror.WereportHiveperformancenumbersforbothoptimizer-determinedandhand-tunednumbersofreducers.Shark,ontheotherhand,wasmuchlesssensitivetothenumberofreducersandrequiredminimaltuning.6.3.1AggregationPerformanceWetestedtheperformanceofaggregationsbyrunninggroup-byqueriesontheTPH-Hlineitemtable.Forthe100GBdataset,lineitemtablecontained600millionrows.Forthe1TBdataset,itcontained6billionrows.Thequerieswereoftheform:SELECT[GROUP_BY_COLUMN],COUNT(*)FROMlineitemGROUPBY[GROUP_BY_COLUMN]Wechosetorunonequerywithnogroup-bycolumn(i.e.,asim-plecount),andthreequerieswithgroup-byaggregations:SHIP-MODE(7groups),RECEIPTDATE(2500groups),andSHIPMODE(150milliongroupsin100GB,and537milliongroupsin1TB). Figure8:Joinstrategieschosenbyoptimizers(seconds)ForbothSharkandHive,aggregationswererstperformedoneachpartition,andthentheintermediateaggregatedresultswerepartitionedandsenttoreducetaskstoproducethenalaggrega-tion.Asthenumberofgroupsbecomeslarger,moredataneedstobeshufedacrossthenetwork.Figure7comparestheperformanceofSharkandHive,measur-ingShark'sperformanceonbothin-memorydataanddataloadedfromHDFS.Ascanbeseeninthegure,Sharkwas80fasterthanhand-tunedHiveforquerieswithsmallnumbersofgroups,and20fasterforquerieswithlargenumbersofgroups,wheretheshufephasedomniatedthetotalexecutioncost.Weweresomewhatsurprisedbytheperformancegainobservedforon-diskdatainShark.Afterall,bothSharkandHivehadtoreaddatafromHDFSanddeserializeitforqueryprocessing.Thisdifference,however,canbeexplainedbyShark'sverylowtasklaunchingoverhead,optimizedshufeoperator,andotherfactors;seeSection7formoredetails.6.3.2JoinSelectionatRun-timeInthisexperiment,wetestedhowpartialDAGexecutioncanim-provequeryperformancethroughrun-timere-optimizationofqueryplans.Thequeryjoinedthelineitemandsuppliertablesfromthe1TBTPC-Hdataset,usingaUDFtoselectsuppliersofinterestbasedontheiraddresses.Inthisspecicinstance,theUDFselected1000outof10millionsuppliers.Figure8summarizestheseresults.SELECT*fromlineitemljoinsuppliersONl.L_SUPPKEY=s.S_SUPPKEYWHERESOME_UDF(s.S_ADDRESS)LackinggoodselectivityestimationontheUDF,astaticopti-mizerwouldchoosetoperformashufejoinonthesetwotablesbecausetheinitialsizesofbothtablesarelarge.LeveragingpartialDAGexecution,afterrunningthepre-shufemapstagesforbothtables,Shark'sdynamicoptimizerrealizedthatthelteredsuppliertablewassmall.Itdecidedtoperformamap-join,replicatingthelteredsuppliertabletoallnodesandperformingthejoinusingonlymaptasksonlineitem.Tofurtherimprovetheexecution,theoptimizercananalyzethelogicalplanandinferthattheprobabilityofsuppliertablebeingsmallismuchhigherthanthatoflineitem(sincesupplierissmallerinitially,andthereisalterpredicateonsupplier).Theoptimizerchosetopre-shufeonlythesuppliertable,andavoidedlaunchingtwowavesoftasksonlineitem.ThiscombinationofstaticqueryanalysisandpartialDAGexecutionledtoa3performanceim-provementoveranaïve,staticallychosenplan.6.3.3FaultToleranceTomeasureShark'sperformanceinthepresenceofnodefailures,wesimulatedfailuresandmeasuredqueryperformancebefore,dur-ing,andafterfailurerecovery.Figure9summarizesvesrunsofourfailurerecoveryexperiment,whichwasperformedona50-nodem2.4xlargeEC2cluster.Weusedagroup-byqueryonthe100GBlineitemtabletomea-surequeryperformanceinthepresenceoffaults.Afterloadingthe Figure7:Aggregationqueriesonlineitemtable.X-axisindicatesthenumberofgroupsforeachaggregationquery. Figure9:Querytimewithfailures(seconds)lineitemdataintoShark'smemorystore,wekilledaworkerma-chineandre-ranthequery.Sharkgracefullyrecoveredfromthisfailureandparallelizedthereconstructionoflostpartitionsontheother49nodes.Thisrecoveryhadasmallperformanceimpact(3seconds),butitwassignicantlycheaperthanthecostofre-loadingtheentiredatasetandre-executingthequery.Afterthisrecovery,subsequentqueriesoperatedagainstthere-covereddataset,albeitwithfewermachines.InFigure9,thepost-recoveryperformancewasmarginallybetterthanthepre-failureperformance;webelievethatthiswasaside-effectoftheJVM'sJITcompiler,asmoreofthescheduler'scodemighthavebecomecompiledbythetimethepost-recoveryquerieswererun.6.4RealHiveWarehouseQueriesAnearlyindustrialuserprovideduswithasampleoftheirHivewarehousedataandtwoyearsofquerytracesfromtheirHivesys-tem.Aleadingvideoanalyticscompanyforcontentprovidersandpublishers,theuserbuiltmostoftheiranalyticsstackbasedonHadoop.Thesampleweobtainedcontained30daysofvideoses-siondata,occupying1.7TBofdiskspacewhendecompressed.Itconsistsofasinglefacttablecontaining103columns,withheavyuseofcomplexdatatypessuchasarrayandstruct.Thesampledquerylogcontains3833analyticalqueries,sortedinor-deroffrequency.WelteredoutqueriesthatinvokedproprietaryUDFsandpickedfourfrequentqueriesthatareprototypicalofotherqueriesinthecompletetrace.Thesequeriescomputeag-gregatevideoqualitymetricsoverdifferentaudiencesegments:1.Query1computessummarystatisticsin12dimensionsforusersofaspeciccustomeronaspecicday.2.Query2countsthenumberofsessionsanddistinctcustomer/-clientcombinationgroupedbycountrieswithlterpredi- Figure10:RealHivewarehouseworkloadscatesoneightcolumns.3.Query3countsthenumberofsessionsanddistinctusersforallbut2countries.4.Query4computessummarystatisticsin7dimensionsgroup-ingbyacolumn,andshowingthetopgroupssortedinde-scendingorder.Figure10comparestheperformanceofSharkandHiveonthesequeries.TheresultisverypromisingasSharkwasabletoprocessthesereallifequeriesinsub-secondlatencyinallbutonecases,whereasittookHive50to100timeslongertoexecutethem.AcloserlookintothesequeriessuggeststhatthisdataexhibitsthenaturalclusteringpropertiesmentionedinSection3.5.Themappruningtechnique,onaverage,reducedtheamountofdatascannedbyafactorof30.6.5MachineLearningAkeymotivatorofusingSQLinaMapReduceenvironmentistheabilitytoperformsophisticatedmachinelearningonbigdata.Weimplementedtwoiterativemachinelearningalgorithms,logisticre-gressionandk-means,tocomparetheperformanceofSharkversus Figure11:Logisticregression,per-iterationruntime(seconds) Figure12:K-meansclustering,per-iterationruntime(seconds)runningthesameworkowinHiveandHadoop.Thedatasetwassyntheticallygeneratedandcontained1billionrowsand10columns,occupying100GBofspace.Thus,thefea-turematrixcontained1billionpoints,eachwith10dimensions.Thesemachinelearningexperimentswereperformedona100-nodem1.xlargeEC2cluster.DatawasinitiallystoredinrelationalforminShark'smemorystoreandHDFS.Theworkowconsistedofthreesteps:(1)select-ingthedataofinterestfromthewarehouseusingSQL,(2)extract-ingfeatures,and(3)applyingiterativealgorithms.Instep3,bothalgorithmswererunfor10iterations.Figures11and12showthetimetoexecuteasingleiterationoflogisticregressionandk-means,respectively.WeimplementedtwoversionsofthealgorithmsforHadoop,onestoringinputdataastextinHDFSandtheotherusingaserializedbinaryformat.ThebinaryrepresentationwasmorecompactandhadlowerCPUcostinrecorddeserialization,leadingtoimprovedperformance.Ourre-sultsshowthatSharkis100fasterthanHiveandHadoopforlo-gisticregressionand30fasterfork-means.K-meansexperiencedlessspeedupbecauseitwascomputationallymoreexpensivethanlogisticregression,thusmakingtheworkowmoreCPU-bound.InthecaseofShark,ifdatainitiallyresidedinitsmemorystore,step1and2wereexecutedinroughlythesametimeittooktorunoneiterationofthemachinelearningalgorithm.Ifdatawasnotloadedintothememorystore,therstiterationtook40secondsforbothalgorithms.Subsequentiterations,however,reportednumbersconsistentwithFigures11and12.InthecaseofHiveandHadoop,everyiterationtookthereportedtimebecausedatawasloadedfromHDFSforeveryiteration.7DiscussionSharkshowsthatitispossibletorunfastrelationalqueriesinafault-tolerantmannerusingthene-graineddeterministictaskmodelintroducedbyMapReduce.Thisdesignoffersaneffectivewaytoscalequeryprocessingtoever-largerworkloads,andtocombineitwithrichanalytics.Inthissection,weconsidertwoquestions:rst,whywerepreviousMapReduce-basedsystems,suchasHive,slow,andwhatgaveSharkitsadvantages?Second,arethereotherbenetstothene-grainedtaskmodel?Wearguethatne-grainedtasksalsohelpwithmultitenancyandelasticity,ashasbeendemon-stratedinMapReducesystems.7.1WhyarePreviousMapReduce-BasedSystemsSlow?ConventionalwisdomisthatMapReduceisslowerthanMPPdatabasesforseveralreasons:expensivedatamaterializationforfaulttoler-ance,inferiordatalayout(e.g.,lackofindices),andcostlierexe-cutionstrategies[25,26].OurexplorationofHiveconrmsthesereasons,butalsoshowsthatacombinationofconceptuallysimple“engineering”changestotheengine(e.g.,in-memorystorage)andmoreinvolvedarchitecturalchanges(e.g.,partialDAGexecution)canalleviatethem.WealsondthatasomewhatsurprisingvariablenotconsideredindetailinMapReducesystems,thetaskschedul-ingoverhead,actuallyhasadramaticeffectonperformance,andgreatlyimprovesloadbalancingifminimized.IntermediateOutputs:MapReduce-basedqueryengines,suchasHive,materializeintermediatedatatodiskintwosituations.First,withinaMapReducejob,themaptaskssavetheiroutputincaseareducetaskfails[13].Second,manyqueriesneedtobecompiledintomultipleMapReducesteps,andenginesrelyonreplicatedlesystems,suchasHDFS,tostoretheoutputofeachstep.Fortherstcase,wenotethatmapoutputswerestoredondiskprimarilyasaconveniencetoensurethereissufcientspacetoholdtheminlargebatchjobs.Mapoutputsarenotreplicatedacrossnodes,sotheywillstillbelostifthemappernodefails[13].Thus,iftheoutputstinmemory,itmakessensetostoretheminmem-oryinitially,andonlyspillthemtodiskiftheyarelarge.Shark'sshufeimplementationdoesthisbydefault,andseesfarfastershuf-eperformance(andnoseeks)whentheoutputstinRAM.Thisisoftenthecaseinaggregationsandlteringqueriesthatreturnamuchsmalleroutputthantheirinput.5Anotherhardwaretrendthatmayimproveperformance,evenforlargeshufes,isSSDs,whichwouldallowfastrandomaccesstoalargerspacethanmemory.Forthesecondcase,enginesthatextendtheMapReduceexecu-tionmodeltogeneraltaskDAGscanrunmulti-stagejobswithoutmaterializinganyoutputstoHDFS.Manysuchengineshavebeenproposed,includingDryad,TenzingandSpark[17,9,33].DataFormatandLayout:Whilethenaïvepureschema-on-readapproachtoMapReduceincursconsiderableprocessingcosts,manysystemsusemoreefcientstorageformatswithintheMapReducemodeltospeedupqueries.Hiveitselfsupports“tablepartitions”(abasicindex-likesystemwhereitknowsthatcertainkeyrangesarecontainedincertainles,soitcanavoidscanningawholetable),aswellascolumn-orientedrepresentationofon-diskdata[28].WegofurtherinSharkbyusingfastin-memorycolumnarrepresentationswithinSpark.SharkdoesthiswithoutmodifyingtheSparkruntimebysimplyrepresentingablockoftuplesasasingleSparkrecord(oneJavaobjectfromSpark'sperspective),andchoosingitsownrepresentationforthetupleswithinthisobject.AnotherfeatureofSparkthathelpsShark,butwasnotpresentinpreviousMapReduceruntimes,iscontroloverthedatapartitioningacrossnodes(Section3.4).Thisletsusco-partitiontables.Finally,onecapabilityofRDDsthatwedonotyetexploitisran-domreads.WhileRDDsonlysupportcoarse-grainedoperationsfortheirwrites,readoperationsonthemcanbene-grained,ac-cessingjustonerecord[33].ThiswouldallowRDDstobeusedasindices.Tenzingcanusesuchremote-lookupreadsforjoins[9].ExecutionStrategies:HivespendsconsiderabletimeonsortingthedatabeforeeachshufeandwritingtheoutputsofeachMapRe-ducestagetoHDFS,bothlimitationsoftherigid,one-passMapRe-ducemodelinHadoop.Moregeneralruntimeengines,suchasSpark,alleviatesomeoftheseproblems.Forinstance,Sparksup-portshash-baseddistributedaggregationandgeneraltaskDAGs. 5SystemslikeHadoopalsobenetfromtheOSbuffercacheinservingmapoutputs,butwefoundthattheextrasystemcallsandlesystemjournallingfromwritingmapoutputstolesstilladdsoverhead(Section5). Totrulyoptimizetheexecutionofrelationalqueries,however,wefounditnecessarytoselectexecutionplansbasedondatastatis-tics.ThisbecomesdifcultinthepresenceofUDFsandcomplexanalyticsfunctions,whichweseektosupportasrst-classcitizensinShark.Toaddressthisproblem,weproposedpartialDAGexecu-tion(PDE),whichallowsourmodiedversionofSparktochangethedownstreamportionofanexecutiongraphonceeachstagecom-pletesbasedondatastatistics.PDEgoesbeyondtheruntimegraphrewritingfeaturesinprevioussystems,suchasDryadLINQ[19],bycollectingne-grainedstatisticsaboutrangesofkeysandbyallowingswitchestoacompletelydifferentjoinstrategy,suchasbroadcastjoin,insteadofjustselectingthenumberofreducetasks.TaskSchedulingCost:Perhapsthemostsurprisingengineprop-ertythataffectedShark,however,wasapurely“engineering”con-cern:theoverheadoflaunchingtasks.TraditionalMapReducesys-tems,suchasHadoop,weredesignedformulti-hourbatchjobsconsistingoftasksthatwereseveralminuteslong.TheylaunchedeachtaskinaseparateOSprocess,andinsomecaseshadahighlatencytoevensubmitatask.Forinstance,Hadoopusesperiodic“heartbeats”fromeachworkerevery3secondstoassigntasks,andseesoveralltaskstartupdelaysof5–10seconds.Thiswassufcientforbatchworkloads,butclearlyfallsshortforad-hocqueries.Sparkavoidsthisproblembyusingafastevent-drivenRPCli-brarytolaunchtasksandbyreusingitsworkerprocesses.Itcanlaunchthousandsoftaskspersecondwithonlyabout5msofover-headpertask,makingtasklengthsof50-100msandMapReducejobsof500msviable.Whatsurprisedusishowmuchthisaffectedqueryperformance,eveninlarge(multi-minute)queries.Sub-secondtasksallowtheenginetobalanceworkacrossnodesextremelywell,evenwhensomenodesincurunpredictabledelays(e.g.,networkdelaysorJVMgarbagecollection).Theyalsohelpdramaticallywithskew.Consider,forexample,asystemthatneedstorunahashaggregationon100cores.Ifthesystemlaunches100reducetasks,thekeyrangeforeachtaskneedstobecarefullycho-sen,asanyimbalancewillslowdowntheentirejob.Ifitcouldsplittheworkamong1000tasks,thentheslowesttaskcanbeasmuchas10slowerthantheaveragewithoutaffectingthejobresponsetimemuch!Afterimplementingskew-awarepartitionselectioninPDE,weweresomewhatdisappointedthatitdidnothelpcomparedtojusthavingahighernumberofreducetasksinmostworkloads,becauseSparkcouldcomfortablysupportthousandsofsuchtasks.However,thispropertymakestheenginehighlyrobusttounex-pectedskew.Inthisway,SparkstandsincontrasttoHadoop/Hive,whereus-ingthewrongnumberoftaskswassometimes10slowerthananoptimalplan,andtherehasbeenconsiderableworktoauto-maticallychoosethenumberofreducetasks[21,15].Figure13showshowjobexecutiontimesvariesasthenumberofreducetaskslaunchedbyHadoopandSpark.SinceaSparkjobcanlaunchthou-sandsofreducetaskswithoutincurringmuchoverhead,partitiondataskewcanbemitigatedbyalwayslaunchingmanytasks. Figure13:TasklaunchingoverheadMorefundamentally,therearefewreasonswhysub-secondtasksshouldnotbefeasibleevenathigherscalesthanwehaveexplored,suchastensofthousandsofnodes.SystemslikeDremel[24]rou-tinelyrunsub-second,multi-thousand-nodejobs.Indeed,evenifasinglemastercannotkeepupwiththeschedulingdecisions,theschedulingcouldbedelegatedacross“lieutenant”mastersforsub-setsofthecluster.Fine-grainedtasksalsooffermanyadvantagesovercoarser-grainedexecutiongraphsbeyondloadbalancing,suchasfasterrecovery(byspreadingoutlosttasksacrossmorenodes)andqueryelasticity;wediscusssomeofthesenext.7.2OtherBenetsoftheFine-GrainedTaskModelWhilethispaperhasfocusedprimarilyonthefaulttoleranceben-etsofne-graineddeterministictasks,themodelalsoprovidesotherattractiveproperties.WewishtopointouttwobenetsthathavebeenexploredinMapReduce-basedsystems.Elasticity:IntraditionalMPPdatabases,adistributedqueryplanisselectedonce,andthesystemneedstorunatthatlevelofpar-allelismforthewholedurationofthequery.Inane-grainedtasksystem,however,nodescanappearorgoawayduringaquery,andpendingworkwillautomaticallybespreadontothem.Thisen-ablesthedatabaseenginetonaturallybeelastic.Ifanadministratorwishestoremovenodesfromtheengine(e.g.,inavirtualizedcor-poratedatacenter),theenginecansimplytreatthoseasfailed,or(betteryet)proactivelyreplicatetheirdatatoothernodesifgivenafewminutes'warning.Similarly,adatabaseenginerunningonacloudcouldscaleupbyrequestingnewVMsifaqueryisexpen-sive.Amazon'sElasticMapReduce[2]alreadysupportsresizingclustersatruntime.Multitenancy:Thesameelasticity,mentionedabove,enablesdy-namicresourcesharingbetweenusers.InatraditionalMPPdatabase,ifanimportantqueryarriveswhileanotherlargequeryusingmostofthecluster,therearefewoptionsbeyondcancelingtheearlierquery.Insystemsbasedonne-grainedtasks,onecansimplywaitafewsecondsforthecurrenttasksfromtherstquerytonish,andstartgivingthenodestasksfromthesecondquery.Forin-stance,FacebookandMicrosofthavedevelopedfairschedulersforHadoopandDryadthatallowlargehistoricalqueries,compute-intensivemachinelearningjobs,andshortad-hocqueriestosafelycoexist[32,18].8RelatedWorkTothebestofourknowledge,Sharkistheonlylow-latencysystemthatcanefcientlycombineSQLandmachinelearningworkloads,whilesupportingne-grainedfaultrecovery.Wecategorizelarge-scaledataanalyticssystemsintothreeclasses.First,systemslikeHive[28],Tenzing[9],SCOPE[8],andChee-tah[10]compiledeclarativequeriesintoMapReduce-stylejobs.Eventhoughsomeofthemintroducemodicationstotheexecu-tionenginetheyarebuilton,itishardforthesesystemstoachieveinteractivequeryresponsetimesforreasonsdiscussedinSection7.Second,severalprojectsaimtoprovidelow-latencyenginesus-ingarchitecturesresemblingshared-nothingparalleldatabases.SuchprojectsincludePowerDrill[16]andImpala[1].Thesesystemsdonotsupportne-grainedfaulttolerance.Incaseofmid-queryfaults,theentirequeryneedstobere-executed.Google'sDremel[24]doesrerunlosttasks,butitonlysupportsanaggregationtreetopol-ogyforqueryexecution,andnotthemorecomplexshufeDAGsrequiredforlargejoinsordistributedmachinelearning.AthirdclassofsystemstakeahybridapproachbycombiningaMapReduce-likeenginewithrelationaldatabases.HadoopDB[3]connectsmultiplesingle-nodedatabasesystemsusingHadoopas thecommunicationlayer.QueriescanbeparallelizedusingHadoopMapReduce,butwithineachMapReducetask,dataprocessingispushedintotherelationaldatabasesystem.Osprey[31]isamiddle-warelayerthataddsfault-tolerancepropertiestoparalleldatabases.ItdoessobybreakingaSQLqueryintomultiplesmallqueriesandsendingthemtoparalleldatabasesforexecution.Sharkpresentsamuchsimplersingle-systemarchitecturethatsupportsallofthepropertiesofthisthirdclassofsystems,aswellasstatisticallearn-ingcapabilitiesthatHadoopDBandOspreylack.ThepartialDAGexecution(PDE)techniqueintroducedbySharkresemblesadaptivequeryoptimizationtechniquesproposedin[6,30,20].Itis,however,unclearhowthesesingle-nodetechniqueswouldworkinadistributedsettingandscaleouttohundredsofnodes.Infact,PDEactuallycomplementssomeofthesetech-niques,asSharkcanusePDEtooptimizehowdatagetsshuf-edacrossnodes,andusethetraditionalsingle-nodetechniqueswithinalocaltask.DryadLINQ[19]optimizesitsnumberofre-ducetasksatrun-timebasedonmapoutputsizes,butdoesnotcollectricherstatistics,suchashistograms,ormakebroaderex-ecutionplanchanges,suchaschangingjoinalgorithms,likePDEcan.RoPE[4]proposesusinghistoricalqueryinformationtoopti-mizequeryplans,butreliesonrepeatedlyexecutedqueries.PDEworksonqueriesthatareexecutingforthersttime.Finally,SharkbuildsonthedistributedapproachesformachinelearningdevelopedinsystemslikeGraphlab[22],Haloop[7],andSpark[33].However,Sharkisuniqueinofferingthesecapabili-tiesinaSQLengine,allowinguserstoselectdataofinterestusingSQLandimmediatelyrunlearningalgorithmsonitwithouttime-consumingexporttoanothersystem.ComparedtoSpark,Sharkalsoprovidesfarmoreefcientin-memoryrepresentationofrela-tionaldata,andmid-queryoptimizationusingPDE.9ConclusionWehavepresentedShark,anewdatawarehousesystemthatcom-binesfastrelationalqueriesandcomplexanalyticsinasingle,fault-tolerantruntime.SharkgeneralizesaMapReduce-likeruntimetorunSQLeffectively,usingbothtraditionaldatabasetechniques,suchascolumn-orientedstorage,andanovelpartialDAGexe-cution(PDE)techniquethatletsitreoptimizequeriesatrun-timebasedonne-graineddatastatistics.ThisdesignsenablesSharktogenerallymatchthespeedupsreportedforMPPdatabasesoverMapReduce,whilesimultaneouslyprovidingmachinelearningfunc-tionsinthesameengineandne-grained,mid-queryfaulttoleranceacrossbothSQLandmachinelearning.Overall,thesystemisupto100fasterthanHiveforSQL,and100fasterthanHadoopformachinelearning.WehaveopensourcedSharkatshark.cs.berkeley.edu,andhavealsoworkedwithtwoInternetcompaniesasearlyusers.Theyreportspeedupsof40–100onrealqueries,consistentwithourresults.10AcknowledgmentsWethankCliffEngle,HarveyFeng,ShivaramVenkataraman,RamSriharsha,DennyBritz,AntonioLupher,PatrickWendell,andPaulRuanfortheirworkonShark.ThisresearchissupportedinpartbyNSFCISEExpeditionsawardCCF-1139158,giftsfromAmazonWebServices,Google,SAP,BlueGoji,Cisco,Cloudera,Ericsson,GeneralElectric,HewlettPackard,Huawei,Intel,Microsoft,Ne-tApp,Oracle,Quanta,Splunk,VMwareandbyDARPA(contract#FA8650-11-C-7136).11References[1]https://github.com/cloudera/impala.[2]http://aws.amazon.com/about-aws/whats-new/2010/10/20/amazon-elastic-mapreduce-introduces-resizing-running-job-ows/.[3]A.Abouzeidetal.Hadoopdb:anarchitecturalhybridofmapreduceanddbmstechnologiesforanalyticalworkloads.VLDB,2009.[4]S.Agarwaletal.Re-optimizingdata-parallelcomputing.InNSDI'12.[5]G.Ananthanarayananetal.Pacman:Coordinatedmemorycachingforparalleljobs.InNSDI,2012.[6]R.AvnurandJ.M.Hellerstein.Eddies:continuouslyadaptivequeryprocessing.InSIGMOD,2000.[7]Y.Buetal.HaLoop:efcientiterativedataprocessingonlargeclusters.Proc.VLDBEndow.,2010.[8]R.Chaikenetal.Scope:easyandefcientparallelprocessingofmassivedatasets.VLDB,2008.[9]B.Chattopadhyay,,etal.Tenzingasqlimplementationonthemapreduceframework.PVLDB,4(12):1318–1327,2011.[10]S.Chen.Cheetah:ahighperformance,customdatawarehouseontopofmapreduce.VLDB,2010.[11]C.Chuetal.Map-reduceformachinelearningonmulticore.Advancesinneuralinformationprocessingsystems,19:281,2007.[12]J.Cohen,B.Dolan,M.Dunlap,J.Hellerstein,andC.Welton.Madskills:newanalysispracticesforbigdata.VLDB,2009.[13]J.DeanandS.Ghemawat.MapReduce:Simplieddataprocessingonlargeclusters.InOSDI,2004.[14]X.Fengetal.Towardsauniedarchitectureforin-rdbmsanalytics.InSIGMOD,2012.[15]B.Guferetal.Handlingdataskewinmapreduce.InCLOSER,2011.[16]A.Halletal.Processingatrillioncellspermouseclick.VLDB.[17]M.Isardetal.Dryad:distributeddata-parallelprogramsfromsequentialbuildingblocks.SIGOPS,2007.[18]M.Isardetal.Quincy:Fairschedulingfordistributedcomputingclusters.InSOSP'09,2009.[19]M.IsardandY.Yu.Distributeddata-parallelcomputingusingahigh-levelprogramminglanguage.InSIGMOD,2009.[20]N.KabraandD.J.DeWitt.Efcientmid-queryre-optimizationofsub-optimalqueryexecutionplans.InSIGMOD,1998.[21]Y.Kwonetal.Skewtune:mitigatingskewinmapreduceapplications.InSIGMOD'12,2012.[22]Y.Lowetal.Distributedgraphlab:aframeworkformachinelearninganddatamininginthecloud.VLDB,2012.[23]G.Malewiczetal.Pregel:asystemforlarge-scalegraphprocessing.InSIGMOD,2010.[24]S.Melniketal.Dremel:interactiveanalysisofweb-scaledatasets.Proc.VLDBEndow.,3:330–339,Sept2010.[25]A.Pavloetal.Acomparisonofapproachestolarge-scaledataanalysis.InSIGMOD,2009.[26]M.Stonebrakeretal.Mapreduceandparalleldbmss:friendsorfoes?Commun.ACM.[27]M.Stonebrakeretal.C-store:acolumn-orienteddbms.InVLDB,2005.[28]A.Thusooetal.Hive-apetabytescaledatawarehouseusinghadoop.InICDE,2010.[29]TransactionProcessingPerformanceCouncil.TPCBENCHMARKH.[30]T.Urhan,M.J.Franklin,andL.Amsaleg.Cost-basedqueryscramblingforinitialdelays.InSIGMOD,1998.[31]C.Yangetal.Osprey:Implementingmapreduce-stylefaulttoleranceinashared-nothingdistributeddatabase.InICDE,2010.[32]M.Zahariaetal.Delayscheduling:Asimpletechniqueforachievinglocalityandfairnessinclusterscheduling.InEuroSys10,2010.[33]M.Zahariaetal.Resilientdistributeddatasets:afault-tolerantabstractionforin-memoryclustercomputing.NSDI,2012.