/
Llama Leveraging Columnar Storage for Scalable Join Pr Llama Leveraging Columnar Storage for Scalable Join Pr

Llama Leveraging Columnar Storage for Scalable Join Pr - PDF document

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
429 views
Uploaded On 2015-05-17

Llama Leveraging Columnar Storage for Scalable Join Pr - PPT Presentation

nuseducom Divyakant Agrawal University of California Santa Barbara agrawalcsucsbedu Chun Chen Zhejiang University China chenccszjueducn Beng Chin Ooi National University of Singapore ooibccompnuseducom Sai Wu National University of Singapore wusaicom ID: 68954

nuseducom Divyakant Agrawal University

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Llama Leveraging Columnar Storage for Sc..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Llama:LeveragingColumnarStorageforScalableJoinProcessingintheMapReduceFrameworkYutingLinNationalUniversityofSingaporelin36@comp.nus.edu.comDivyakantAgrawalUniversityofCalifornia,SantaBarbaraagrawal@cs.ucsb.eduChunChenZhejiangUniversity,Chinachenc@cs.zju.edu.cnBengChinOoiNationalUniversityofSingaporeooibc@comp.nus.edu.comSaiWuNationalUniversityofSingaporewusai@comp.nus.edu.comABSTRACT erstoproducetheÞnalresults.Figure5(B)showstheexamplesofthree-wayjoinandfour-wayjoinqueriesinthechainpattern.Thedashedverticallineindicateshowwesplitthechain.Toprocessathree-wayjoinatleasttwotypesofmappersareneeded:onetoretrieveÕsat-tributesfromÕsbasicgroup,andtheothertoperformamap-sidejoinfor.AlltheintermediateresultsfromthemappersarepartitionedbyToperformafour-waychainquery,twotypesofmappersarecreatedfor.TheyjoingroupwithÕsbasicgroup,andgroupwithÕsbasicgroup,respectively.Allthein-termediateresultsareshufßedviaandareÞnallyjoinedinthereducephase.Inthisapproach,theremaybemiss-ingattributesofbecauseweprocessthemviatheirgroupsforthemap-mergejoin.Togetthemissingattributesofwecoulduseonemorekindofmappertoretrievethemfrombasicgroup,orlatematerializetheminthereducephase.Forthemissingattributesof,onlyrandomaccessispossible.Therea-sonisthat,intermediateresultsfromthemapperstothereducersarepartitionedby.Sincedoesnotcontaintheseattributes,partitionerisunabletopartitionÕstuplesindividuallytotheproperreducer.Theprimarydifferenceofjoinstrategiesbetweenthesetwopat-ternsisthat:Inthestarpattern,alljoinsbetweenthefacttableanddimensiontableareperformedinthemapphase.Theirinterme-diateresultsareshufßedviatheprimarykeyofthefacttableandreducersessentiallyperformthemerge-likeoperation.Inthechainpattern,intermediateresultsfromthemappersareshufßedviathejoinkeyinsteadoftheprimarykeyofthefacttable.5.3HybridPatternGivenacomplexquery,Llamatranslatesitintoasetofsub-queries.Eachsub-queryiscomposedofasetofjoinsthatcanbeprocessedbyasingleMapReducejob.Basically,sub-queriesofstarpatternandchainpatternareconsidered.Algorithm1illus-tratestheplangenerationforcomplexqueriesinLlama. Algorithm1PlangeneratePlan(QueryGraph =getAllNode(cost=MaximumCost;Planplan.size()==1returnMaterialization(fortmpPlan=null;canbebufferedinmemorytmpPlan=FragmentReplicationJoin(,generatePlan(elseifisafacttablein//processinstarpatternlist=newList();for(connectedsubGraphlisttmpPlan=Join(list//processinchainpatterntmpPlan=Join(generatePlan(),generatePlan(costtmpPlanplantmpPlancost=estimateCost(plantmpPlan=ßattenPlan(tmpPlancosttmpPlanplantmpPlancost=estimateCost(planplan AspresentedinAlgorithm1,ifthereisonlyonetableinthegraph,itreturnsamaterializationoperation(Line4).Otherwise,ititeratesallthenodesandgeneratesdifferentexecutionplansac-cordingtothefollowingcases:IfTableissmallenoughtobebufferedinthememory,weemploythefragment-replicationalgorithm(Line9)tojoinwiththeremaininggraph.Thatis,eachmapperreadsareplicaofandstoresitinthelocalmemory.Thejoinisthuscapabletobeperformedinthemapphase.Theplanofisgeneratedbythesamealgorithmaswell.IfTableonlyactsasafacttableinthegraph,thatis,allitsedgesarepointingfromitsconnectedcomponents,Llamaappliesthestarpatternstrategyontabletogeneratetheplan.Foreachofitsconnectedcomponents,Llamacallsthisalgorithmtorespectivelygeneratetheirsub-plans(Line14).AllofthemareÞnallyjoinedtogetherbasedontheprimarykeyofthefacttable(Line15).IfTableisadimensiontablejoinedwithtable,thatis,anedgeispointingfrom,Llamasplitsthegraphintotwocomponents,eachofwhichcontainsrespectively(Line18).Theycallthisalgorithmtogeneratetheirsub-plans(Line19)andÞnallyjointhembythejoinedkey(Õsprimarykey).ThecostestimationduringplangenerationisfocusedontheI/Ocostincurredininitialization,shufßeandoutput.ThecalculationissimilartotheoverheadanalysisinSection4.Inaddition,thecostestimationalsoneedstocheckwhethertherequiredexistsintheplan.Ifnot,theoverheadtogeneratethepropergroupshouldbetakenintoconsideration.NotethattheJointhealgorithmwhichjoinsdecomposedcomponentsincreasesthedepthoftheplantreeby1,implyingthatonemoreMapReducejobisneededduringthequeryprocessing.Tomaketheplantreecompact,wecalltheflattenPlanfunctionaftertheoriginalplanisgenerated.ThisfunctioncombinestwoMapReducejobswhenthepartitionkeyofonejobisasubsetofitssubsequentjob.Afterßattened,theirjoinoperationsareperformedinthesamephasetoreducethenumberofMapReducejobsandfurtherreducethein-termediateI/Ocost.However,itisworthnotingthat,theßattenedplanmaynotbebetterthantheoriginalone,becausegeneratingtheproperPFgroupisalsoaMapReducejobanditsoverheadneedsthoroughconsideration.5.4ARunningExampleTakingthequeryQ9fromtheTPC-Hbenchmarkasanexample,weillustratehowthequeryplanisgeneratedandexecutedinLlamausingtheproperverticalgroups.Q9determineshowmuchproÞtismadeonagivenlineofparts,whichisexpressedasLineitemOrdersPartSuppPartSupplierNation.Forsimplicity,weusetorepresentthecorrespondingtables.ThequerygraphofQ9isshowninFigure6.Thesizeoftableissmallenoughtobefullybufferedinmem-oryandhencecanbejoinedwithothertablesbyusingthefragment-replicationmethod.Theoriginalqueryisthusreducedto.Next,byanalyzingthegraph,theedgesfrompointtoisthustreatedasafacttable.SplittingthegraphinLbythestarpatternstrategy,weobtaintwocomponentsconnectedtoL:.ThesimilarapproachcouldbeadoptedintheÞrstcomponentbyusingasthefacttable.Beforeßatteningtheplan,therearetwoMapReducejobstocompletethequery.TheÞrstjobperformsandthesecondjobperforms.SincethepartitioningkeyintheÞrstjobisÕsprimarykey,whichisthesubsetofthepar-titioningkey(Õsprimarykey)inthesecondjob,itispossibleto itscorrespondingpartitionofthefacttable,itquicklylocatesthestartpositionofthecorrespondingvaluesinthedimensiontableviathejoiningkey.Then,itcanperformamergejoinbysequentiallyscanningboththefactanddimensiontables.Theresulttuplesarepipelinedtothemappers.Ifthejoinselectivityishigh,weexploittheindexofthedimensiontabletoseektothecorrespondingblock,whichfurtherreducesthescancost.TofacilitatedifferenttypesofprocessingprocedureinoneMapRe-ducejob,weimplementwhichspeciÞesparticularin-putformat,mappers,combinerandpartitionerintermsofdifferentdatasets.SinceitprovidesmoreßexibleuserdeÞnedconÞgurationstohandleheterogeneousdatainput,itissuperiortoinHadoop[2]andTableInputsin[24].Tosimplifythedeploymentofcustomizedinterfaces,itisdeÞnedasfollows:intLlamaInputs.addInput(configuration,dataset,inputformat,mapper,combiner,partitioner);Inthisinterface,configurationdescribesthejobconÞgurationintheMapReduceenvironments;datasetindicatesthetable(s)tobeprocessedinthisspeciÞcmapper;inputformatistheparticularin-putformattohandledifferentdatasources;mapperpartitionerarerespectivelyusedtoÞlterparticulartuples,combinetheintermediateresults,anddeÞnethepartitionstrategytothere-ducers.Thereturnvalueofthisinterfaceisaninteger,whichbindstheinputdatasetwiththecustomizedprocessing.Accordingtothisinteger,LlamacanusetheproperinputformattomaterializeorjointhetablesandinstantiatethespeciÞcmapper,combinerandpartitionertoprocesstheintermediatetuples.Tojoinheterogeneousdatasourcesinthereducephase,wein-troduceajoinerbeforethereducerastheMap-Join-Reducepro-posal[24].ThejoinprocessinglogicisspeciÞedbycreatingajoinerclassandimplementingspeciÞcjoin()functions.Joinersareregisteredtothesystembythefollowingfunction:LlamaInputs.addJoiner(int[]tableID,Joinerjoiner);Inthisfunction,tableIDindicatestwoormoretablestobejoinedbythesamejoiningkey;joinerprovidesthejoin()functiontojointheindicatedtablesinthereducephase.Beforethejoinprocess-ing,datafromthesetablesissortedbythejoiningkey.AsLlamainheritsandextendsthejoinerfromthepreviousproposal[24],itisabletocompletemultiplejoinsofdifferentjoinkeysinthereduce7.EXPERIMENTSInthissection,westudytheperformanceofLlama.WeÞrstcomparethedatastoragewiththeexistingÞleformatsinHadoop.ThenwestudythecolumnmaterializationintheMapReduceframe-work.FinallywebenchmarkoursystemagainstHivewithTPC-Hqueries.WechooseHiveforcomparisonforthreereasons:First,HiveisbuiltontopofHadoopandprovidesefÞcientperformanceonlargescaledataanalysis.Second,Hivehasbenchmarkedit-selfwithHadoopandPigusingtheTPC-Hbenchmark.Itpro-videsscriptswithreasonablequeryplansandexecutionparame-ters.Third,Hiveprovidesthecolumn-wisestorageformatRCFilewhichcouldbeadoptedintheHiveQLandcomparedwithourap-proachtothecolumn-wisestorage.7.1ExperimentalEnvironmentWeconductourexperimentsonAmazonEC2cloudwithlargeinstance.Eachinstancehas7.5GBmemory,2virtualcoreswith2EC2computeunitseach.Thereare850GBlocalstorageintwolocaldisksforeachinstance.Theoperatingsystemis64-bitLinux.InourconÞguration,oneinstanceisusedastheNameNodeandJobTracker,tomanagetheHDFSclusterandscheduletheMapRe-ducejobs.TheothersareusedastheDataNodesandTaskTrackerstostoretheÞlesandexecutethetasksassignedbytheJobTracker.WeimplementLlamaontopofHadoopv0.20.2,anduseHivev0.5.0forcomparisonpurposes.Basedonthehardwarecapacity,weadjustthemaximumnumberofmappersandreducersto2.Weassign1024MBmemoryforeachtask.ChunksizeintheHDFSissetto512MBandthereplicationfactoris3.Followingthesugges-tionoftheHiveTPC-Hexperiments,wesetthereplicationfactoroftheintermediateresultsto1;thatis,theresultsofintermediatejobsarestoredin1nodeinsteadof3nodes,tosavethereplicationcost.ThenumberofreducersisautomaticallyconÞguredbyHiveÕsdefaultsetting.BothHadoopandLlamastoretheÞlesinHDFS.WeusedbgeninTPC-Htogeneratethesyntheticdataset.Webenchmarktheper-formancewiththeclustersizeof4,8,16,32and64nodes.Eachnodestores10GBTPC-Hdata.ThesizeofTPC-Hdataarethus40GB,80GB,160GB,320GBand640GBrespectively.7.2ComparisonsBetweenFilesInthistest,wecomparetheperformanceofdifferentÞleformatsimplementedontheHadoopsystem.TheyareTFileofZebra[9],RCFileofHive[4],HFileofHBase[3]andourCFile.Firstwecomparetheircompressionperformanceintermsoftheircompres-sionspeedandcompressionratiobasedonthethreegeneralcom-pressionalgorithms,namelyBZip2,GzipandLzo.Then,wecom-paretheirperformanceonwrite,sequentialscan,andrandomreads.AlltheseoperationsarerunontheHDFS.TheoriginaldatasetisOrderstableinTPC-Hschema(15Mtupleswith9columns,1.75GBuncompressedtextdatawhenTPC-Hscaleis10).OurobjectivehereistodemonstratethattheCFileformathasdistinctadvantagesinlargescaledataprocessing.Intheexperiment,weÞrstparsethedataintocolumnsandwritethembyspeciÞcwriteroftheseÞles.HFileandTFileuseonecolumnfamilytostoreallthecolumns.InHFile,onetupleisparsedintomultiplekey-valuepairs,andeachpairrepresentavalueinonecolumn.ItskeyisacompositeoforderIDandthecolumnqualiÞer,itsvalueisthedatainthecorrespondingcolumn.Notimestampisincludedinourexperiments.InTFile,onetupleisparsedintoonekey-valuepair.ItskeyisorderID,anditsvalueisabytearrayoftheothercolumns,whichisinrow-wiseformat.Forcomparisonpurposes,weuseTFiletosimulatethecolumn-wisestorage.Thatis,eachÞlerepresentsonecolumnfamily.ThekeyistheorderIDandthevalueisthespeciÞcvalueofthatcolumn.Thiskeyisnecessarytocombinedifferentcolumns,whenrandomaccessingseveralcolumnsfromdifferentÞles.Inourexperiment,weuseTFile-oneTFile-multitorepresentthesetwostorageapproaches.ThedataisblockcompressedandstoredintheHDFS.TFileandHFileonlysupportLzoandGZip.Therefore,theresultsofTFileandHFilepresentedbelowdonotcontainBZip2.StorageEfciency.Figure7showsthesizeofÞlescompressedbasedondifferentcompressionalgorithms.TheresultsshowthatBZip2compressedthedatawithahigherratiothanGzipandLzo.CFileandRCFilearesmallerthanotherÞleformats,becausetheybothcompressthedataincolumns.Ontheotherhand,sinceHFileandTFile-multicontaintherow-idforeachrecord,theirsizesarerelativelylarger.Moreover,HFilehastostorethecolumnqualiÞerandisthuslarger.CreationOverhead.Figure8presentsthecreationoverheadofdifferentÞlesfortable HFileTFileŠoneTFileŠmultiRCFileCFileFigure7:StorageEfÞciency HFileTFileŠoneTFileŠmultiRCFileCFile676.6564.94Figure8:FileCreation HFileTFileŠoneTFileŠmultiRCFileCFileFigure9:All-ColumnScan HFileTFileŠoneTFileŠmultiRCFileCFileFigure10:Two-ColumnScan HFileTFileŠoneTFileŠmultiRCFileCFileFigure11:Two-ColumnRandomAccess 0 20 40 60 80 100 120 140 0 2.5*10-5 5*10-5 7.5*10-5 10*10-5 selectivityLate Materialization Early Materialization Figure12:ColumnMaterializationOrders.AlthoughthecompressionratioofLzoisnotasgoodasGzipandBZip2,itscompressionspeedismuchfaster.TheirwriteoverheadisprimarilyproportionaltothesizeoftheÞle.Assuch,thecreationofCFileisanefÞcientprocess.AccessPerformance.Figure9showstheperformanceofscanningallthecolumns.Similartocompression,Lzodecompressesfasterthantheotheral-gorithms.CFileisnotasgoodasRCFileandTFile-one,becauseCFilehastoreadallthecolumnsfromdifferentÞlesanddecom-pressthemindividually.Figure10showsthespeedofscanningthe2ndandthe7thcolumnsinOrders.TFile-oneandHFileareessentiallytherow-wisestorageinonecolumnfamily.Thereforetheyneedtoreadallthedataanddecompressthemasscanningallthecolumns.RCFile,ontheotherhand,groupseachcolumnonaseparateminiblock.Thisapproachavoidsdecompressingtheunnecessarycolumns,butitstillhastoreadthewholeblock.Differentwiththeaboveformats,TFile-multiandCFileonlyreadtherequiredcolumnsandthusobtainabetterperformance.Moreover,CFileoutperformsTFile-multibecauseitisdesignedforcolumnaccesswithmorecompactstorage.Performanceofrandomlyaccessing10000recordswiththetwosamecolumnsisreportedinFigure11.RCFileisdesignedforse-quentialscanwithoutprovidingtheinterfaceforrandomaccesses,thusitscorrespondingresultsarenotcapturedintheÞgure.Al-thoughCFilehastoperformtwoseekstorandomaccesstwocolumns,itsperformanceisasgoodasHFileandTFile-onewhichperformsonlyoneseek.CFileshowsitscompetitiveperformancethantheexistingÞleformatsforlargescaledataprocessing.IntermsofstorageefÞ-ciency,itismorecompactwithlessstorageoverhead.Inexecu-tion,eventhoughitisnotasgoodasRCFileintermsofscanningallthecolumns,itsspeedisfastestwhenonlyafewcolumnshavetoscanned.Therefore,itisparticularlysuitablefortheanalysisthatonlyrequiresafewcolumns,butnotthewholetable.Inaddi-tion,CFileprovidestherandomaccessefÞcientlysimilartothatofHFileandTFile.7.3ColumnMaterializationInthisexperiment,weÞrstestimatethecostratiodiscussedinSection4.ThenwerunasimpleaggregationquerytostudythecolumnmaterializationwithintheMapReduceframework.Basedonourexperiment,weÞndthatLMispreferredwhentheselectiv-ityisveryhigh.Toestimatethecostratioofrandominthecostmodel,wecomparetheaveragerunningtimeofscanningversusrandomaccessing.ByexaminingattheexecutiontimepresentedinFigure10andFigure11,isabout1.Toestimatethecostra-tioofshuffle,weruntwoMapReducejobsthatwritenoresultsbacktotheHDFS.TheÞrstjobisamaponlyjobthatscansalargedatasetwithoutanyprocessing.Itsrunningtimeisthusproportionaltothescanningoverhead.ThesecondjobisaMapReducejobthatshufßesalltheinputdatatothereducerswith-outanyÞltering.Itsrunningtimeisthusproportionaltobothscanningandshufßingoverhead.Therefore,isapproximately.BasedontherunningtimeofthesetwojobsonEC2,weestimatethatisabout3.Obviously,theoverheadofrandomaccessismuchlargerthanscanningandshufßing.Ifthecolumnsizeissmall,thepredicatediscussedinEquation6canbefurtherreducedto1.Thatis,LMispickedonlywhentheselectivityissmallerthan1.Toverifythisestimation,werunasimpleaggre-gationquerytostudywhenLMoutperformsEMbyadjustingthebelow:selectsum(l_price),sum(l_discount),sum(l_quality),sum(l_tax)fromlineitemwherel_orderkeyxgroupbyl_orderkey;TheperformanceofthesetwostrategiesaresummarizedinFig-ure12.LMisbetterthanEMonlywhentheselectivityislessthan.Moreover,astheselectivitygrows,therunningtimeofLMincreasesrapidly.Whentheselectivityislow,EMisthere-forepreferred.SincetheselectivityinTPC-Hislow,LlamamainlyemploysEMforthefollowingTPC-Hqueries.7.4DataLoadingWereporttheloadingtimeofallthetablesintheTPC-Hbench-markinFigure13,andbrießydescribeourloadingprocedures 4 nodes8 nodes16 nodes32 nodes64 nodesFigure13:Loadtime 4 nodes8 nodes16 nodes32 nodes64 nodesFigure14:AggregationTask:TPC-HQ1 4 nodes8 nodes16 nodes32 nodes64 nodes Figure15:JoinTask:TPC-HQ4here.FirstweusedbgentogeneratetheTPC-Hdataforthedif-ferentlocaldisksofthecluster.ThenweuseHadoopÕsÞleutilityÒcopyFromLocalÓtouploadunalteredTPC-Htextdatafromthelo-caldisktotheHDFSinparallel.Itdirectlycopiesthetextwithoutparsingthedata.TheHDFSautomaticallypartitionseachÞleintoblocksandreplicatestothreedatanodes.InHive,wetransformtheraw-texttoRCFilebyHiveQL.Forexample,ifthereisrawtextdatalocatedindirectoryÕÕandtheschemais(int,int),wecouldusethefollowingHiveQLtocompletetheformattransformation.CREATEEXTERNALtableA(a1int,a2int)STOREDASTEXTFILELOCATION’/A’;CREATEtableB(b1int,b2int)STOREDasRCFILE;INSERTOVERWRITEtableBSELECTa1,a2FROMA;TheÞrsttwocommandsaretodeclarethemetadatainformationsuchastheschemaanddatalocation.ThethirdcommandistoexecutethespeciÞctransformation.Inourexperiment,RCFileisblockcompressedbyLzo.InLlama,wetransformtheraw-texttoCFile.Tobuildthebasicgroup,map-onlyjobislaunchedtoreadthedatafromtheHDFS.Itparsesthetextrecordsintocolumnsguidedbythedelimiters,andwriteseachcolumntothecorrespondingCFile.TheblackpartofthegraphinFigure13indicatestheprocessingtimerequiredtobuildthebasicgroupofTPC-Hdataset.Thewhitepartofthegraphindicatestheadditionalcostforbuildingtwogroups:PFgroupPartsuppsortedbysupplierID,andthegroupofOrderssortedbycustomerID.TheÞrstgroupistofacilitatethemap-mergejoinofTPC-HQ3whilethesecondoneistofacilitatethemap-mergejoinofTPC-HQ9.AsshowninFigure13,LlamaperformsslightlybetterthanHiveifwedonottakethesortingtimeintoaccount.Ontheotherhand,eventhoughthetransformationcostofLlamaishigherthanthepureHDFScopybecauseoftheadditionaloverheadofparsingandsorting,thistransformationisworthwhilebecauseitsigniÞcantlyreducestheprocessingtimeoftheanalyticaltask.Aswillbeseeninthefollowingexperiments,theaccumulatedsavingsaresigniÞ-cantlymorethantheloadingoverhead.7.5AggregationTaskWeuseTPC-HQ1asouraggregationtasktomeasuretheperfor-manceimprovementgainedbyadoptingthecolumn-wisestorage.Q1istoprovidethepricingreportforallthelineitemsshippedonagivendate.Intermediateresultshavetobeexchangedbetweennodesinthecluster.TheresultsareshowninFigure14.representtheperformanceofHivewiththesameexecutionplanbutontherow-wiseandcolumn-wisestoragerespectively.There-sultsconÞrmthebeneÞtofexploitingcolumn-wisestorageincom-pression.Underthecompressedcolumn-wisestorage,bothHiveandLlamasavetheI/Ocost.IncontrasttotheRCFilethatstorescolumnsinoneÞleinarecordcolumnarmanner,CFilestoreseachcolumninanindividualÞle.Thisguaranteesthatonlynecessarydataisread,andhencesavesmoreI/Oduringprocessing.7.6JoinTaskWechooseTPC-HQ4,Q3andQ9asourjointasks.Thereareone,twoandÞvejoinoperationsinthesequeriesrespectively.Theyarechosentostudytheperformanceandscalabilityofthesystemwithrespecttothedataandclustersize.InHive,theexecutionplansforaspeciÞcqueryarethesameregardlessoftheunderlyingÞleformats.TPC-HQ4.Q4istodeterminehowwelltheorderprioritysystemisworkingandgivesanassessmentofcustomersatisfaction.ItcontainsonejoinoperationandiscompiledintothreemainMapReducejobsbyHive.Job1createsatemporarytableTmpthatcontainsonlydistinctqualiÞedkeyfromLineitem.Job2joinsTmpOrdersandJob3aggregatestheresults.Llamaprocessesthequeryusingthesimilarapproach.IntheÞrstjob,LlamamaterializesLineitemandcreatesatemporaryta-TmpwithdistinctorderID.Itperformsamap-mergejoinforOrdersTmpinthesecondjobandaggregatestheÞnalresultsinthelastjob.AsshowninFigure15,Llamarunsabout2timesfasterthanHive.TheperformancebeneÞtisderivedfromitscolumn-wisestorage.Applyingthemap-mergejoin,itfurtherreducestheshuf-ßingcostoftheintermediateresults.Asthequerybecomescom-plexinHive,thebeneÞtoftheI/Osavingfromcolumn-wisestor-agebecomeslessobviousbecausethestoragelayeronlyaffectstheperformanceintheinitialphase.TPC-HQ3.Q3istoretrieve10unshippedorderswiththehighestrevenueoperatingontableLineitem,Orders,andCustomer.HivecompilesQ3intoÞveMapReducejobs.TheÞrsttwojobsjointhethreetables,andotherthreejobsaggregatethetuples,sortthemandgetthetop10results.Llamaprocessesthisjobwithtwojobs.IntheÞrstjob,therearetwotypesofmaptasks,joiningOrdersCustomerandma-Lineitem.Theintermediateresultsareshufßedtothereducersbytheorderkey.Reducersperformthereduce-joinfol-lowedbyalocalaggregation.ThesecondjobcombinesthepartialaggregationsfortheÞnalanswer.Toenabletheconcurrentjoin,PFgroupoftableOrdersisbuiltintheloadingphase.Figure16summarizestheresultsfordifferentclustersizes.Con-currentjoinfacilitatesamoreßexiblequeryplanwithfewerjobs,whichsigniÞcantlyreducesthejoblaunchingtimeandtheinterme-diateI/Otransfercost,andthusmakesitabout2timesfasterthanHive.Whenthedatasizescalesto640GBforaclustersizeof64 4 nodes8 nodes16 nodes32 nodes64 nodesFigure16:JoinTask:TPC-HQ3 4 nodes8 nodes16 nodes32 nodes64 nodes Figure17:JoinTask:TPC-HQ9nodes,onereducetaskatthesecondstageinHiveÕsjobisshuf-ßedmorethan30GBdatatoprocess.ItranforalongtimeandwasÞnallykilledafterfailingtoreportthestatusin600seconds.Therefore,theexecutiontimeofHiveisnotreportedinthegraphfortheclustersizeof64.TPC-HQ9.Q9istodeterminehowmuchproÞtismadeonagivenlineofparts,brokendownbysupplierÕsnationandyear.HivecompilesQ9intosevenMapReducejobs.TheÞrstÞvejobsarerespectivelytojointhesixtablesofthequery.AfterthesejoinsareÞnished,twoadditionaljobsarelaunchedforaggregationandordering.TheexecutionofQ9inLlamahasearlierbeenpresentedinSec-tion5.4.Tofacilitatetheconcurrentjoin,PFgroupoftablePartsuppisbuiltduringtheloadingphase.Figure17showstheperformanceofHiveandLlamaforQ9withrespecttodifferentclustersizes.Basedontheresults,theperfor-mancedifferencebetweenHiveofdifferentstorageformatsisnotveryobvious.Themainreasonisthat,HiveconÞguresitsnum-berofmappersandreducersbythesizeoftheinputdataset.WithdifferentÞleformats,thesizeoftheinputvaries,whichfurtheraf-fectsthemappersandreducersinthesubsequentjobs.Inthiscase,theoverallperformanceofthesubsequentMapReducejobsslightlydeteriorates.Therefore,thebeneÞtoftheI/Osavingisnotappar-entintheoverallperformance.AsinTPC-HQ3,Q9couldnotbecompletedinHivewithinaspeciÞctimeframefortheclustersizeof64nodes.Ontheotherhand,theconcurrentjoinmethodiscapableofcom-pletingallthejoininoneMapReducejob,whichsigniÞcantlyre-ducethematerializationofintermediateresultsbytheexecutionplan.AscanbeobservedfromFigure17,concurrentjoinrunsnearly5timesfasterthanthetraditionalexecutionplan.Insummary,Llamaachievesaverygoodscalabilitywhentheclustersizeincreasesfrom4nodes(with40GBoftotaldatasize)to64nodes(with640GBoftotaldatasize).Itsscalabilityisalmostlinearforlargescaledataprocessing.8.RELATEDWORK8.1JoinProcessinginMapReduceTheMapReduceparadigm[21]hasbeenintroducedasadis-tributedprogrammingframeworkforlarge-scaledata-analytics.Duetoitseaseofprogramming,scalability,andfaulttolerance,theMapReduceparadigmhasbecomepopularforlarge-scaledataanal-ysis.AnopensourceimplementationofMapReduce,Hadoop[2]iswidelyavailabletobothcommercialandacademicusers.Build-ingontopofHadoop,Pig[7]andHive[4]providethedeclara-tivequerylanguageinterfaceandfacilitatejoinoperationtohandlecomplexdataanalysis.Zebra[9]isastorageabstractionofPigtoprovidecolumnwisestorageformatforfastdataprojection.Toexecuteequi-joinsintheMapReduceframework,Pig[7]andHive[4]provideseveraljoinstrategiesintermsofthefeatureofthejoiningdatasets[6,5].Forexample,[29]proposesasetofopti-mizationstrategiesforautomaticoptimizationofparalleldataßowprogramssuchasPig.Ontheotherhand,HadoopDB[12]providesahybridsolutionwhichusesHadoopasthetaskcoordinatorandcommunicationlayertoconnectmultiplesinglenodedatabases.Thejoinoperationcanbepushedintothedatabaseiftheinvolvedtablesarepartitionedonthesameattribute.Hadoop++[22]pro-videsanon-invasiveindexandjointechniquesforco-partitioningthetables.Thecostofdataloadingofthesetwosystemsisquitehigh.Acomprehensivedescriptionandcomparisonofseveralequi-joinimplementationsfortheMapReduceframeworkappearsin[16,23].However,inalloftheaboveimplementations,oneMapRe-ducejobcanonlyprocessonejoinoperationwithanon-trivialstartupandcheckpointingcost.Toaddressthislimitation,[13,24]proposeaone-to-manyshufßingstrategytoprocessmulti-wayjoininasingleMapReducejob.However,asthenumberofjoin-ingtablesincreases,thetuplereplicationduringtheshufßephaseincreasessigniÞcantly.Inanotherrecentwork[25],aninterme-diatestoragesystemofMapReduceisproposedtoaugmentthefault-tolerancewhilekeepingthereplicationoverheadslow.[19]presentsamodiÞedversionoftheHadoopMapReduceframeworkthatsupportsonlineaggregationbypipelining.However,theydonotessentiallyimprovetheperformanceofMapReducebasedmulti-wayjoinprocessing.8.2Column-wiseStorageinMapReduceThefundamentalideaofthecolumn-wisestorageistoimproveI/Operformanceintwoways:(i)Reducingdatatransmissionbyavoidingtofetchunnecessarycolumns;and(ii)Improvingthecom-pressionratiobycompressingthedatablocksofindividualcolum-nardata.Althoughverticallypartitioningthetablehasbeenaroundforalongtime[20,15],ithasonlyrecentlygainedwide-spreadattentiontobuildcolumnaranalyticdatabases[27,30,8,26]pri-marilyfordatawarehousingandonlineanalyticalprocessing.Column-wisedatamodelisalsopreferredinMapReduceanddistributeddatastoragesystems.HadoopDB[12]canusecolumnardatabaselikeC-store[30]asitsunderlyingstorage.Dremel[28]proposedaspeciÞcstorageformatfornesteddataalongwiththeexecutionmodelforinteractivequeries.Bigtable[17]proposedcolumnfamilytogrouponeormorecolumnsasabasicunitofac-cesscontrol.HBase[3],anopensourceimplementationofBigTable,hasbeendevelopedastheHadoop[2]database.HFileisitsun-derlyingcolumn-wisestorage.Besides,TFileandRCFilearetheothertwopopularÞlestructuresthathavebeenusedinZebra[9]andHive[4]projectsforlargescaledataanalysisontopofHadoop.EachoftheseÞlesrepresentsonecolumnfamilyandcontainsoneormorecolumns.Theirrecordsarepresentedasakey-valuepair.InHFile,eachrecordcontainsdetailedinformationtoindicate thekeyby(row:string,column-qualier:string,timestamp:longbecauseitisspeciÞcallydesignedforstoringsparseandreal-timedata.ThismakesHFileincompactandthusineffectiveinlargescaledataprocessing.TFile,ontheotherhand,doesnotstoresuchmetadataineachrecord.Eachrecordisstoredinthefollowingfor-keyLength,key,valLength,value).Thelengthinformationisnecessarytostatetheboundaryofthekeyandvalueineachrecord.SimilartoTFile,RCFilestoresthesamedataoneachblockforgivencolumns.However,withineachblock,itgroupsalltheval-uesofaparticularcolumntogetheronaseparateminiblock,whichissimilartoPAX[14].RCFilealsousesthekey-valuespairtorep-resentthedata,whereasthekeycontainsthelengthinformationforeachcolumnintheblock,andthevaluecontainsallthecolumns.TheaboveÞleformatsstorethecolumnsinacolumnfamilyonthesameblockwithinaÞle.ThisstrategyprovidesagooddatalocalitywhileaccessingseveralcolumnsinthesameÞle.However,itrequiresreadingtheentireblockevenifsomecolumnsarenotneededinthequery,resultinginwastedI/Os.EventhougheachÞlestoresonlyonecolumn,theÞleformatisincompact,becausethelengthinformationofbothkeyandvalueforeachrecordincurnon-trivialoverhead,especiallywhenthatcolumnissmallsuchasbeinganintegertype.Furthermore,theseÞlesareonlydesignedtoprovidetheI/OefÞciency.ThereisnoefforttoleveragetheÞleformatstoexpeditequeryprocessinginMapReduce.Inthisaspect,Zebra[9]andHive[4]couldnotbetreatedasatrulycolumn-wisedatawarehouse.9.CONCLUSIONInthispaper,wepresentLlama,acolumn-wisedatamanage-mentsystemonMapReduce.Llamaappliesacolumn-wiseparti-tioningschemetotransformtheimporteddataintoCFiles,aspe-cialÞleformatdesignedforLlama.ToefÞcientlysupportdatawarehousequeries,Llamaadoptsapartition-awarequeryprocess-ingstrategy.Itexploitsthemap-sidejointomaximizethepar-allelismandreducetheshufßingcost.Westudytheproblemofdatamaterializationanddevelopacostmodeltoanalyzethecostofdataaccesses.WeevaluateLlamaÕsperformancebycomparingitagainstHive.OurexperimentsconductedonAmazonEC2us-ingTPC-Hdatasetsshowthat,Llamaprovidesthespeedupof5timescomparedtoHive.TheperformanceevaluationconÞrmstherobustness,efÞciencyandscalabilityofLlama.10.ACKNOWLEDGEMENTSTheworkofYutingLin,BengChinOoi,SaiWu,andChunChenarerespectivelysupportedinpartbytheMinistryofEducationofSingapore(GrantNo.R-252-000-394-112)andtheNationalNatu-ralScienceFoundationofChina(GrantNo.61070155).WethankAmazonfortheresearchgrantofthefreeusageofAWS.Wealsothanktheanonymousreviewersfortheirinsightfulcomments.11.REFERENCES[1]Epic.[2]Hadoop.[3]Hbase.[4]Hive.[5]Hive/tutorial.[6]Joinframework.[7]Pig.[8]Vertica.[9]Zebra.[10]D.J.Abadi,S.Madden,andM.Ferreira.Integratingcompressionandexecutionincolumn-orienteddatabasesystems.In,2006.[11]D.J.Abadi,D.S.Myers,D.J.Dewitt,andS.R.Madden.Materializationstrategiesinacolumn-orienteddbms.In,2007.[12]A.Abouzeid,K.Bajda-Pawlikowski,D.J.Abadi,A.Rasin,andA.Silberschatz.Hadoopdb:Anarchitecturalhybridofmapreduceanddbmstechnologiesforanalyticalworkloads.,volume2,pages922Ð933,2009.[13]F.N.AfratiandJ.D.Ullman.Optimizingjoinsinamap-reduceenvironment.In,2010.[14]A.Ailamaki,D.J.DeWitt,M.D.Hill,andM.Skounakis.Weavingrelationsforcacheperformance.In,2001.[15]D.S.Batory.OnsearchingtransposedÞles.ACMTrans.DatabaseSyst.,4(4):531Ð544,1979.[16]S.Blanas,J.M.Patel,V.Ercegovac,J.Rao,E.J.Shekita,andY.Tian.Acomparisonofjoinalgorithmsforlogprocessinginmapreduce.In,2010.[17]F.Chang,J.Dean,S.Ghemawat,W.C.Hsieh,D.A.Wallach,M.Burrows,T.Chandra,A.Fikes,andR.E.Gruber.Bigtable:adistributedstoragesystemforstructureddata.In,2006.[18]C.Chen,G.Chen,D.Jiang,B.C.Ooi,H.T.Vo,S.Wu,andQ.Xu.Providingscalabledatabaseservicesonthecloud.In,2010.[19]T.Condie,N.Conway,P.Alvaro,J.M.Hellerstein,K.Elmeleegy,andR.Sears.Mapreduceonline.In[20]G.P.CopelandandS.N.KhoshaÞan.Adecompositionstoragemodel.In,1985.[21]J.DeanandS.Ghemawat.Mapreduce:simpliÞeddataprocessingonlargeclusters.In,2004.[22]J.Dittrich,J.-A.QuianŽ-Ruiz,A.Jindal,Y.Kargin,V.Setty,andJ.Schad.Hadoop++:Makingayellowelephantrunlikeacheetah(withoutitevennoticing).,3(1):518Ð529,[23]D.Jiang,B.C.Ooi,L.Shi,andS.Wu.Theperformanceofmapreduce:Anin-depthstudy.,3(1):472Ð483,2010.[24]D.Jiang,A.K.H.Tung,andG.Chen.Map-join-reduce:TowardsscalableandefÞcientdataanalysisonlargeclusters.IEEETransactionsonKnowledgeandDataEngineering[25]S.Y.Ko,I.Hoque,B.Cho,andI.Gupta.Makingcloudintermediatedatafault-tolerant.In,2010.[26]R.MacNicolandB.French.Sybaseiqmultiplex-designedforanalytics.In,2004.[27]S.Manegold,P.A.Boncz,andM.L.Kersten.Optimizingdatabasearchitectureforthenewbottleneck:memoryVLDBJournal,9(3):231Ð246,2000.[28]S.Melnik,A.Gubarev,J.J.Long,G.Romer,S.Shivakumar,M.Tolton,andT.Vassilakis.Dremel:Interactiveanalysisofweb-scaledatasets.,3(1):330Ð339,2010.[29]C.Olston,B.Reed,A.Silberstein,andU.Srivastava.Automaticoptimizationofparalleldataßowprograms.InUSENIXAnnualTechnicalConference,2008.[30]M.Stonebraker,D.J.Abadi,A.Batkin,X.Chen,M.Cherniack,M.Ferreira,E.Lau,A.Lin,S.Madden,E.OÕNeil,P.OÕNeil,A.Rasin,N.Tran,andS.Zdonik.C-store:acolumn-orienteddbms.In,2005.