1713 1708 1712 1715 1707 1711 1716 1714 1710 1717 1709 1718 Figure1Performanceofbranchfreeselectionsbasedoncursorarithmetics28akapredicationoverabranchingimplementationusingifstatementsPeR ID: 817081
Download Pdf The PPT/PDF document "VoodooAVectorAlgebraforPortableDatabas..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
17131708171217151707171117161714
171317081712171517071711171617141710171717091718VoodooAVectorAlgebraforPortableDatabasePerformanceonModernHardwareHolgerPirkMITCSAILholger@csail.mit.eduOscarMollMITCSAILorm@csail.mit.eduMateiZahariaMITCSAILmatei@csail.mit.eduSamMaddenMITCSAILmadden@csail.mit.eduABSTRACTIn-memorydatabasesrequirecarefultuningandmanyengi-neeringtrickstoachievegoodperformance.Suchdatabaseperformanceengineeringishard:aplethoraofdataandhardware-dependentoptimizationtechniquesformadesignspacethatisdiculttonavigateforaskilledengineer{evenmoresoforaquerycompiler.Tofacilitateperformance-orienteddesignexplorationandqueryplancompilation,wepresentVoodoo,adeclarativeintermediatealgebrathatab-stractsthedetailedarchitecturalpropertiesofthehard-ware,suchasmulti-ormany-corearchitectures,cachesandSIMDregisters,withoutlosingtheabilitytogeneratehighlytunedcode.Becauseitconsistsofacollectionofdeclarative,vector-orientedoperations,Voodooiseasiertoreasonaboutandtunethanlow-levelCandrelatedhardware-focusedex-tensions(Intrinsics,OpenCL,CUDA,etc.).ThisenablesourVoodoocompilertoproduce(OpenCL)codethatrivalsandevenoutperformsthefasteststate-of-the-artinmemorydatabasesforbothGPUsandCPUs.Inaddition,Voodoomakesitpossibletoexpresstechniquesasdiverseascache-consciousprocessing,predicationandvectorization(againonbothGPUsandCPUs)withjustafewlinesofcode.Centraltoourapproachisanovelideawetermedcontrolvectors,whichallowsacodegeneratingfrontendtoexposeparallelismtotheVoodoocompilerinaabstractmanner,enablingportableperformanceacrosshardwareplatforms.WeusedVoodootobuildanalternativebackendforMon-etDB,apopularopen-sourcein-memorydatabase.Ourback-endallowsMonetDBtoperformatthesamelevelashighlytunedin-memorydatabases,includingHyPeRandOcelot.WealsodemonstrateVoodoo'susefulnesswheninvestigat-inghardwareconscioustuningtechniques,assessingtheirperformanceondierentqueries,devicesanddata.1.INTRODUCTIONIncreasingRAMcapacitiesonmodernhardwaremeanthatmanyOLAPanddatabaseanalyticsapplicationscanstoretheirdataentirelyinmemory.Asaresult,anewgen-erationofmain-memory-optimizeddatabases,suchasHy-ThisworkislicensedundertheCreativeCommonsAttributionNonCommercialNoDerivatives4.0InternationalLicense.Toviewacopyofthislicense,visithttp://creativecommons.org/licenses/byncnd/4.0/.Foranyusebeyondthosecoveredbythislicense,obtainpermissionbyemailinginfo@vldb.org.ProceedingsoftheVLDBEndowment,Vol.9,No.14Copyright2016VLDBEndowment21508097/16/10.Figure1:Performanceofbranch-freeselectionsbasedoncursorarithmetics[28](a.k.a.predication)overabranchingimplementation(usingifstatements)PeR[18],Legobase[14]andTupleWare[9],havearisen.Thesesystemsaredesignedtooperateclosetomemorybandwidthspeedbyad-hocgeneratingCPU-executablecode.However,codegenerationiscomplex,andasaresultmostsystemsweredesignedtogeneratecodeforaspecichard-wareplatform(andsometimesaspecicdatasetorwork-load),andrequiresubstantialchangestotargetanaddi-tionalplatform.Thisisbecausedierentarchitecturesem-ployverydierenttechniquestoachieveperformance,rang-ingfromSIMDinstructionstomassivelyparallelco-process-orssuchasGPUsorIntel'sXeonPhitoasymmetricchipdesignssuchasARM'sbig.LITTLE.Exploitingsuchhard-wareproperlyistrickybecausethebenetofmostmachinecodeoptimizationsisdataaswellashardwaredependent.Toillustratethecomplexityofthesetradeos,Figure1showstheimpactofpredicateselectivityandarchitectureonin-memoryselections(overonebillionsingle-precision oats).Here,welteralistofvalueswithapredicateofvariableselectivity,usingoneoftwomethods:conventionalif-statementsthatevaluatethepredicateoneveryitem,andabranch-freeapproachwhereeveryinputiscopiedtotheoutput,buttheaddressforthenextoutputiscom-putedbyaddingtheoutcomeofthepredicate(1or0)forthecurrentvalue(sovaluesthatdon'tsatisfythepredicateareoverwrittenbythenextvalue).Thebranch-freeimple-mentationexecutesmoreinstructionsbutavoidspotentiallyexpensivebranchmispredictions.OnGPUs,thebranch-ingimplementationisoftenbetterandneversignicantlyFigure2:TheVoodooQueryProcessingSystemworse;onCPUsthebranch-freeimplementationcansome-timesbeupto4xbetterinthesingle-threadedcase(wherecostsarestronglybranch-dominated)and2.5xbetterinthemulti-threadedcase(whichislessbranch-dominated).Thisexampleshowswhygeneratinghigh-performancecodeformodernhardwareishard:evenstraight-forwardoptimiza-tionssuchastheeliminationofbranchmispredictionsmustbeappliedwithknowledgeofbothhardwareanddata!Unfortunately,implementingtransformationslikethisinexistingcodegenerationenginesishardbecauseitrequiresencodingknowledgeaboutthehardware(GPU/CPU)anddata(selectivities)intothecodegenerator.Furthermore,manysuchoptimizationsrequirecross-cuttingchangesacrossoperatorsandcodecomponents.Asaresultofthis
,noneoftheaforementionedenginesimplement
,noneoftheaforementionedenginesimplementhardware-specicordata-drivenoptimizationssuchastheoneshowninFigure1.Toaddressthiscomplexity,wedevelopedanewinterme-diatealgebra,calledVoodooasthecompilationtargetforqueryplans.Voodooisbothportabletoarangeofmodernhardwarearchitectures,includingGPUsandCPUsandex-pressiveenoughtoeasilycapturemostoftheoptimizationsproposedformain-memoryqueryprocessorsinthelitera-ture{apropertywewillcalltunabilityintherestofthispaper.Forexample,itcanexpressdierentlayouts(col-umnvsrow)[26,34],materializationstrategies[1],sharingofcommonsubexpressions[6,14],vectorization[25],par-allelization,predication(theaboveexample)[28]aswellasloopfusionandssion[32].Whilesupportingalloftheseop-timizations,Voodoomaintainstheeciencyofhand-writtenC-codebyjust-in-timegeneratingexecutablecode.Ofcourse,Voodooisnottherstsystemtogeneratee-cientexecutablecodeformodernhardware.However,exist-ingsystemsoccupyparticulardesignpointsinthespace,im-plementingarchitecture-specictechniquestoachieveband-widthandCPUeciency.Table1displaysthedesignchoicesofseveralwell-knownsystems(webenchmarkagainstsomeinourexperiments).Portinganyofthesetoanewhardwarearchitecture(withanewsetofdata-specicbottlenecks)wouldinvolveanearlycompleterewriteofthesystem.Weaimtodevelopanabstractionlayerthatmakesiteasytoobtainperformancefromnewhardwarearchitectures.Wetermedthekeyinnovationthatenablessuch\tun-ability"declarativepartitioning,whichallowsacodegen-eratingfrontendtoprovideinformationaboutthedesiredparallelismofoperationsinahardware-independentfash-SystemBandwidthEciencyTechniqueCPUEciencyTechniqueHardwareTargetHyPeR[18]PipeliningCompilationCPU-onlyMonetDB[7]{Bulk-ProcessingCPU-onlyVectorwise[33]Cache-friendlyPartitioningCPU-onlyTupleware[9]PipeliningCompilationCPU-onlyOcelot[13]{Bulk-ProcessingGPU-OptimizedVoodooTunableTable1:TechniquesUsedinExistingIn-MemoryDBMSsion.ThisisspecicallyembodiedinVoodoobyanoveltechnique,calledcontrolledfolding,wherevirtualattributes(whichwetermcontrolvectors)areattachedtodatavectors.Bytuningthevalueofthesevirtualattributes,frontendscancreatemoreorfewerpartitions(yieldingmoreorlesspar-allelism)toadapttohardwarewithdierentlane-widths,cachesizes,andnumbersofexecutionunits.Specically,wemakethefollowingcontributions:WepresenttheVoodooalgebraanddescribeourim-plementationofit.OurimplementationcompilesintoOpenCL,andservesasanalternativephysicalplanal-gebraandbackendforMonetDB[7],anexistinghigh-performancequeryprocessingenginethatdidnotpre-viouslygeneratemachinecode.WepresentasetofprinciplesthatguidethedesignofVoodoothatallowittoecientlyadapttoarangeofhardware.Theseincludetheuseofaminimalcol-lectionofdeclarativeoperatorsthatcanbeecientlyexecutedondierenthardwareplatforms.Wedescribethedesignandimplementationofcon-trolledfoldingaswellasacompilerthatgeneratesef-cientOpenCLcodefromVoodooprograms.Weshowthatourimplementationisperformance-com-petitivewithexistingdatabaseenginesdesignedforspecichardwareplatformsandhand-optimizedcode,usingsimplerandmoreportablecode,andthatitcancaptureseveralexistingandnewarchitecture-specicoptimizations.Notethatwedonotaddresstheproblemofprogramat-icallygeneratingoptimalVoodoocode.Instead,weshowthatVoodoocanbeusedtoexpressavarietyofsophisticatedhardware-consciousoptimizations,somenovelandsomepre-viouslyproposed,andarguethatthesecouldeventuallybechosenviaanoptimizerthatgeneratesVoodoocode.Westructuretherestofthispaperaroundthearchitec-tureofVoodoo(showninFigure2):afterabriefdiscussionofourdesigngoalsintherestofthissectionwepresenttheVoodooalgebrathatisusedtoencodequeryplansinSection2.Section3providesanin-depthdiscussionoftheVoodooOpenCLbackend.InSection4wedescribehowweintegratedtheVoodookernelwithMonetDBtoacceleratethequeryprocessingengine.Weevaluatethecompletesys-teminSection5,discussrelevantrelated(Section6)andfuture(Section6)workandconcludeinSection8.2.THEVOODOOALGEBRATointroducetheVoodooalgebra,westartwithanexam-plethatillustratesitskeyfeaturesandourdesignprinciples.Afterthat,wegoontopresentVoodoo'sdatamodelandoperatorsinmoredetail.1input:=Load("input")//SingleColumn:val2ids:=Range(input)3partitionSize:=Constant(1024)4partitionIDs:=Divide(ids,partitionSize)5positions:=Partition(partitionIDs)6inputWPart:=Zip(input,partitionIDs)7partInput:=Scatter(inputWPart,positions)8pSum:=FoldSum(partInput.val,partInput.partition)9totalSum:=FoldSum(pSum)Figure3:MultithreadedHierarchicalAggregationinVoodoo13,4c3,42partitionSize:=Constant(1024)3partitionIDs:=Divide(ids,partitionSize)4---5laneCount:=Constant(2)6partitionIDs:=Modulo(ids,lan
eCount)Figure4:MultithreadingtoSIMDinVo
eCount)Figure4:MultithreadingtoSIMDinVoodoo(textualdi)Figure3showsaVoodooprogram(inStaticsingleas-signmentform)toperformahierarchicalsummation:dataisrstpartiallysummedonNprocessors,andthentheNpartialaggregatesarethemselvessummed.Line1loadsaninputvectorwithasinglecolumn\val".Line2createsavectorofids,rangingfrom1:::jinputj.Line3and4cre-ateavectorthatmapseachtupletoapartitionbyinteger-dividingitbythepartitionSize(1024tuplesintheexample).Line5computestheoutputpositionforeachtuple,basedonitspartition.Line6attachesthegeneratedpartitionidstotheinputtuples.Line7partitionstheinputaccordingtothepositionscomputedinline5(notethatthisparti-tioningispurelylogical{meaningitjustcausesthegener-atedcodetoloopoverthespeciednumberofpartitions{unlessexplicitlymaterialized).Finally,line8performstheper-partitionaggregation,andline9performstheglobalag-gregation.ThisexampleillustratesseveralkeypropertiesofVoodooandhowtheyenableportabilityandperformance.VectorOriented:ThealgebraconsistsofasmallsetofvectoroperationslikeScatterandFoldSum,whichcanbeparallelizedonmodernarchitectures.Vectorinstructionsenablebothportabilityandperformance,astheycanbepar-allelizedonmanyhardwareplatformswhilealsoyieldingtostraightforwardimplementationsonanyhardwareplatform.Thespecicoperatorsinthealgebrawerechosentobefa-miliartocompiler-designersre ectingthedesignofvectormachines,SIMDinstructionsets,andfunctionallanguages,andexpressiveenoughtocaptureawidevarietyofnewandpreviouslyproposedtechniquesforoptimizingmain-memoryanalytics.Forexample,inadditiontotheexampleinFig-ure4,VoodooisexpressiveenoughtocapturethedierentimplementationsshowninFigure1,andcompactenoughthateachimplementationisjustafewlinesofcode.Declarative:Voodoodescribesdata owratherthanex-plicitbehavior.Inparticular,theoperatorsonlydenehowtheoutputdependsontheinputs{nothowtheoutputsareproduced.ThisallowsVoodooto,e.g.,implementlane-wiseparallelism(asinFoldSum)usingSIMD-instructionsonCPUsandwork-groups/warpsonGPUs(whicharecon-ceptuallyverysimilar).ThisdeclarativepropertyisalsoimportantforportabilityinVoodoo,asoperatorsdon'tde-scribespecicpropertiesofhardwarethattheyrelyon.Inaddition,itmakestheprogramshorterandsimplerthantheequivalentC++programusing,e.g.,Intel'sThread-1autoinput=load("input");2autototalssum=3parallel_deterministic_reduce(4size;_t00;blocked_range(input.size,5input.size/1024),60,[&input](auto&range,autopartsum){7for(size_ti=range.begin();8irange.end();i++){9partsum+=input.elements[i].constant;10}11returnpartsum;12},13[](autos1,autos2){returns1+s2;});Figure5:MultithreadedHierarchicalAggregationinTBB1autoinput=load("input");2typedefintv4i__attribute__((vector_size(16)));3autovSize=(sizeof(v4i)/sizeof(int));4v4isums={};5for(size_ti=0;iinput.size/vSize;i++)6sums+=((v4i*)input.elements)[i];7int*scalarSums=(int*)&sums;8autototalsum=0l;9for(size_ti=0;i4;i++)10totalsum+=scalarSums[i];Figure6:HierarchicalAggregationusingSIMDIntrinsicsingBuildingBlocks(seeFigure5)whileprovidingequivalentexpressivepower.Simplicityarisesbecauseitemploysasin-gleconcept:vectoroperations.Incontrast,TBBinvolvesblockedranges(line4),functionallambdas(lines6and13)usinglexicalscopingandareducer(line13).Finally,declarativeoperatorsallowustoavoidmaterializ-ingmanyoftheintermediatevectorsinaVoodooprogram.Forexample,intheprograminFigure3,mostofthevectors(exceptinput,parts,andtotal)areneverstored.Thesevectorsaresimplyusedtocontrolthedegreeofparallelisminthegeneratedcode,asdescribedinSection3.1below.Minimal:Voodooconsistsofnon-redundant,statelessoperators.(anexampleof-hidden-statewouldbeaninter-nalhashtablewhencomputinganaggregate).BykeepingtheAPIsimple,frontendsareabletoeectivelyin uencethegenerationandusageofintermediatedatastructureswhichmayormaynotbebenecialforperformance.Hid-dendatastructuresarealsoproblematicwithrespecttoportabilitybecausetheycanbeunboundedinsize(again,hash-tablescometomind).Sincemostco-processorsdonotecientlysupportthe(re-)allocationofmemoryatruntime,unboundeddatastructuresimpedeportability.Simple,ne-grainedoperatorsalsolendthemselvestone-grainedcostmodelssuchastheonewedenedinearlierwork[21].Ef-fectivereasoningaboutcostisanintegralpartoftunability.Non-redundancyintheoperatorsethasthreedistinctad-vantages:a)improvementsinoneoperatorcanimprovemanyqueriesb)itincreasesthenumberopportunitiesforcommonsubexpressioneliminationc)itsimpliesbackendimplementationandmaintenance.Deterministic:Voodooprogramsdonotcontainrun-timecontrolstatementssuchasiforforthatdecideifanoperatorisexecuted.Suchdeterminismenablesecientexecutiononarchitectureswithno,oronlyexpensiveex-ecutioncontrol,suchasGPUsandSIMDunits,andalsoimprovesCPUperfo
rmancebyallowingtheCPUtoeec-tivelys
rmancebyallowingtheCPUtoeec-tivelyspeculateduringprogramexecution.Italsosimpliescost-modeling.Determinismdoescomeataprice:thelackofdynamicdecisionspreventthefrontendfromin uencingdynamicex-ecutionstrategiessuchasload-balancing,garbagecollection,memoryre-allocationorrunningloopsofwhichthenumberofiterationsisunknownatcompiletime.However,thisdoesnotimplythatgeneratedcodecannotmakedecisionsaboutwhatdatatoload(e.g.,whichisthenextnodeinatreeindex),aslongastheoperationsonthedataareknownatcompiletime1.Operationsontrees,e.g.,areexpressibleaslongasthedepthofthetreeis(reasonably)bounded/bal-anced.Toremovetheneedforsuchabound,weplanto(re-)integratedynamicdecisionsintoVoodooandwillex-ploretheimpactofcontrol-statementsinfuturework.Explicit:Asmuchaspossible,everyVoodooprogramhasexactlyoneimplementationoneachunderlyinghard-wareplatform.Explicitnessisimportantfortunability,be-causeitmeansacodegeneratingfrontend(orthedeveloperthereof)canreasonclearlyaboutwhataparticularprogramwilldoonaparticularhardwareplatform.Tunable/Transformable:ThenalkeypropertyofVoo-doo,asnotedintheintroduction,isthatitiseasytotunetovarioushardwareplatformsusingasingleabstraction.SinceVoodooalreadyfollowsadeclarativeoperatormodel,itisnaturaltoextendthismodeltocreateadeclarativeapproachtotuning.Inadditiontocompatiblitywiththealgebra,thisapproachhasanappealingproperty:itencodesconceptuallysimilartechniquesintostructurallysimilarprograms.Toil-lustratethis,considertheexampleofparallelizationusingeithermultiplecoresormultipleSIMD-lanesofamodernmulticoreCPU.Thisisanon-trivialdierenceinC:Fig-ure6showscodethatisequivalenttoFigure5butusesSIMDintrinsicsinsteadofTBBmultithreading.Thesearealmostentirelydierentprograms:theonlylinesthataresharedaretheloadingoftheinput(line1)andmostoftheoutputdeclaration(lines2and9,respectively).Incontrast,thechangestotheVoodooprogramareminimal(seeFig-ure4foratextualdi):theconstantinline3nowencodesthenumberoflanesratherthanthesizeofapartitionandthegenerationofsuccessiverunshasbeenreplacedwiththegenerationofcircularlane-idsinline4.Thischangecausestherecordstobescatteredinaround-robinpatterninlines5{7,whichnaturallymapstoSIMDinstructions.Thisex-ampleshowshowanlaboriouscodechangeismadesimplebyVoodoo.Thissimplicityiswhatwemeanbytunable.Thepreviousexamplealsointroducestheconceptofcon-trolledfolding,whichwedescribeinmoredetaillateron.Insummary,thedesignofVoodooisdrivenbytwopri-marygoals:portabilityandtunability.Thesegoalsareoftencontradictoryasexempliedbythecaseofplatform-specicextensionsofCsuchasSIMDintrinsics:theycanimproveperformanceifthehardwareecientlysupportsthembuthurtportabilityandsometimesevenperformanceiftheyarenotoronlybadlyimplemented[24].Voodooalleviatestheseproblemsbyprovidingalayerofabstractionthata)canbetranslatedintoecientcodeforavarietyofhardwareplatforms,b)allowseasyandne-grainedcontroloverthegeneratedcodewhilekeepingtheabstractionsimpleinor-dertoc)allowreasoningaboutaprogram,bothintermsofsemanticsaswellascost(givenahardwareplatform).IntherestofthissectionwedescribethedatamodelandoperatorsthatallowVoodootoachievetheseproperties.1Notethat,sincewegeneratecode,wehaveinformationaboutfactorssuchasdatasizesatcompiletime2.1Data:StructuredVectorsTheVoodoodatamodelisbasedoninteger-addressablevectors.Wechosethisbecausevirtuallyallhardwareplat-formsimplementsomekindofinteger-addressablememory.Consequently,Voodoostoresdatausingathinlayerofab-stractionoversuchinteger-addressablememory:amodelwetermStructuredVectors.AStructuredVectorisanorderedcollectionofxedsizedataitems,allofwhichconformtothesameschema.Forconvenience,weallowdataitemstocontain(nest)otherstructureddataitems.StructuredVec-torsareequivalenttoone-dimensionalarraysofstructsinANSICandcan,thus,bemappednaturallytonativecodeinaC-derivedlanguagesuchasOpenCLC.Wecurrentlyonlyallowscalartypesandnestedstructsaselds(butmay,inthefuture,addxedsizearraysasaconveniencefeature).Toillustratethis,Figure7showstwovectors:aninput(ontop)andanoutputvector(bottom).Theinputvectorhastwoattributes(.foldand.value)and8elements.ToaddressanattributeofavectorinVoodoocode,weuseKeypathsto\navigate"thenestedstructures.Innotation,keypathsaremarkedwithaprecedingdot:forexample,thepath.valuedesignatesthevaluecomponentofeverytupleinFigure7.Sincestructurescanbenested,keypathscanhavemorethanonecomponent(e.g.,.input.value).PointersandNULLvalues.Voodoohasnonotionofpointers.Referencestotuplesofvectorscanberepresentedsimilartoforeign-keys:integervaluesencodingapositioninanothervector.Weprovideprimitivestoresolvethesereferences(thegatheroperation).RegardingNULLvalues,wedecidedtonotimposeaspecicwaybutenablecommondesignpatternstodealwithNULLssuchasbitmaps
(im-plementedin,e.g.,HyPeR),reservedvalu
(im-plementedin,e.g.,HyPeR),reservedvalues(MonetDB)orsparseattributes(Postgres)inVoodoo.Intherestofthispaper,weimplementNULLvaluesusingMonetDB'sscheme.Voodoodoeshavethenotionofan\empty"eldvaluewhichwedenoteas(e.g.,inFigure7).Emptyslotsoccurif,e.g.,valuesarenotsetinascatterornotselectedinafoldSelect.WeillustratetheusefulnessofthisconceptforecientcodegenerationinSection3.2.2ControlledFoldingAkeychallengeinVoodooisprovidingdeclarativeoper-atorsthatstillgivecontrolovertuningparameters(suchasparallelism).Toaddressthischallenge,wedevelopedacon-ceptwetermControlledFolding,whichisusedtoexpressaggregation(producingasinglevalue)aswellaspartition-wiseselections(producingasequenceoftupleids).ThebasicideaofControlledFolding(illustratedforthecaseofaggregationinFigure7,aswellasintheuseofDivideandModulo,respectivelyinFigures3and4)issimilartofoldoperationsinfunctionallanguages(e.g.,Haskell):reduceasequenceofvaluesintoasinglevalueusingabinaryfunction.ControlledfoldoperatorsinVoodooareageneralizationoffunctionalfolds:InadditiontothevectorofvaluestofoldFigure7:VoodooFoldoperationsarecontrolledOperatorExplanationMaint.Load(.keypath)LoadavectoridentiedbykeypathfrompersistentstoragePersist(.keypath,V)PersistvectorV,makingitavailablefrompersistentstorageunder.keypathBitShift(.out,V1,.kp1,V2,.kp2)ShiftthevalueofeachiteminV1.kp1byV2.kp2LogicalAnd/Or(.o,V1,.p1,V2,.p2)LogicalOperationsAdd,Subtract,Multiply,Divide,ModuloArithmeticOperationsGreater,EqualsComparisonOperationsDataParallelZip(.out1,V1,.kp1,.out2,V2,.kp2)CreatenewvectorwithsubstructureV1.kp1as.out1andV2.kp2as.out2Project(.out,V,.kp)CreatenewvectorwithsubstructureV.kpas.outUpsert(V1,.out,V2,.kp)CopyV1andreplaceorinsertattribute.outwithvalueV2.kpScatter(V1,V2,.kp2,V3,.pos)CreateanewvectorofsizeV2.FilltheslotsofthenewvectorbyplacingeachvalueofV1intopositionV3.pos.Valuesareoverwrittenoncon ict.Scattersareperformedinorderwithinavalue-runinV2.kp2|runshavenoorderguaranteeswithrespecttoeachother.Gather(V1,V2,.pos)CreateavectorofsizeV2llingitbyresolvingpositionV2.posinV1.Outofboundspositionsresultinemptyslots.Materialize(V1,V2,.kp2)MaterializevectorV1inmemory.MaterializechunksofsizesaccordingtorunsinV2.kp2(X100-style[33]processing)Break(V1,V2,.kp)BreakupV1intosegmentsaccordingtotherunsinV2.kp(puretuninghint)Partition(.out,V1,.v,V2,.pv)GenerateascatterpositionvectortopartitionV1.vaccordingtothelistofpivotsV2.pvFoldFoldSelect(.out,V1,.fold,.s)GenerateavectorofpositionsofslotsinV1thathave.ssettonon-zero.Aligntheoutputtovaluerunsin.fold(seeFigure7)FoldMax/Min/Sum(.out,V1,.fold,.agg)CalculatedMax/Min/Sumforeveryrunin.fold.Alignoutputvaluewithstartofrun.FoldScan(.out,V1,.fold,.s)Prex-SumthevaluesofV1.s(startofnewrunin.foldstartsnewsum)ShapeRange(.kp,fromI,[vInt|v],stepI)GenerateavectorwiththesamesizeasvwithvaluesstartingfromfromincreasingbystepCross(.kp1,v1,.kp2,v2)Generatethecrossproductofthepositionsofv1andv2Table2:VoodooOperatorstheyacceptasecondvectorthatwetermthecontrolvectoroftheoperation.Thecontrolvectoreectivelyprovidesthepartitionidsforvaluesbeingfolded.Theeectofthecontrolvectorontheoutputofanop-eratoristhat,whensequentiallytraversingthevalues,theoperatoralsotraversesthealignedcontrol-sequence.Asitdoesthis,itfoldsalladjacenttuplesthathavethesamepar-titionidintoasingleoutputvalue{thebeginningofanewrunofpartitionidscausesthefoldtostartanewresult.Theresultofeachsub-foldiswrittentotheoutputcellatthestartofthepartition.Theslotsbetweenonefoldresultandthenextarepaddedwithemptyvalues.Thepaddingavoidstheneedforsynchronizationwhenprocessingrunsinpar-allel(wedescribehowtoeliminatethestorageoverheadofthisinSection3.1.)ControlledfoldingisakeyabstractionsinVoodoothatallowsusobtainparallelperformancefrommultiplehardwareplatforms.Asweshowinthefollowing,itisaverypowerfulabstractionbecauseitallowsdeclarativelyspecicationoftheoutputofpartition-wiseoperations.2.3OperatorsInthissection,webrie ydescribethecoreoperatorsofVoodoo.Thegoalistoprovideanintuitionforthekindsofoperatorswesupport.AdetaileddescriptionisprovidedinTable2.Voodoo'soperatorsfallintofourcategories:1.MaintenanceOperationsmanipulatethepersistentstateofthedatabase.Theyimportdata,persistdatatothedatabaseandloaditback.2.Data-parallelOperationsoperateonalignedtuplesintwoinputvectors.Theyincludestandardoper-ationssuchasarithmeticorlogicalexpressionsandZips.GatherandMaterializealsofallintothiscategorybecausetheyaretrivialtoparallelize.Thesizeoftheoutputoftheseoperatorsisthesizeofthesmallerinput.Scattertakesathirdparameterthatspeciesthesizeoftheoutputandtechnicallyinvolvesaconsistentwritetomemory.However,asvirtuallyallhardwareplatformsimplementsuchaprimitive,w
econsiderScatterdataparallel.3.FoldOpera
econsiderScatterdataparallel.3.FoldOperationsareoperationsthatrequiresomelevelofsynchronization.Ingeneral,thisincludesalloperationswherethevalueatapositionintheoutputdependsonmorethanonevalueoftheinputvector.Naturally,thisholdsforaggregations.However,italsoholdsforfoldSelectbecauseitisorderpreserving:consequently,thepositionofatupleintheoutputde-pendsonthenumberofqualifyingtuplesthatprecedeit.Partitionalsotakesavectorofpivotsasasec-ondinput.Thesizeoftheoutputisthesizeoftheinputvector(notthepivotvectorforpartitioning).4.ShapeOperationsareoperationsthatcreatevectorswithvaluesthatarenotbasedonthedatavaluesofothervectorsbutonlyontheirsize.Shapedoperationsdonottakeanyinputkeypathsbecausetheydonotprocessanyattributevalues.Whileweusetheseoper-atorstogenerateconstants,theirmostimportantuseisinsteadtocreaterun-controlvectorsforfoldopera-tions.Bygeneratingappropriatefoldattributevaluesandmaintainingmetadatathatdescribestheruns,wecandeclarativelycontrolthedegreeofparallelismintheVoodooprogram.Consequently,wecallsuchgen-eratedattributesControlAttributes.Wedescribethefunctioningoftheseattributesinmoredetailinthenextsection.3.VOODOOBACKENDSThedeclarativenatureoftheVoodoooperatorsmakesiteasytoprovidedierentbackendimplementations.WhileFigure8:Select&HierarchicallyAggregateinVoodooweforeseeimplementationsindierentcontextssuchasdistributedprocessing,wefocusoureortsonsingle-node(multi-core)backendsinthispaper.Wehaveimplementedtwobackends:aninterpreterontopofC++standardli-brarycontainersclassesaswellasabackendthatcompilesVoodoocodetohighlyecientOpenCLkernels.Wewillstartthissectionwithanin-depthdiscussionofthedesignoftheOpenCLbackendandnishwithabriefdiscussionoftheinterpreter.3.1TheOpenCLCompilerThepurposeoftheOpenCLcompileristogeneratehighlyecient,parallelcodethatavoidsunnecessarydatamate-rialization.Itdoessobygeneratingfullyinlined,function-call-freeOpenCLkernelsfromsequencesofmultipleVoodoooperators.ThegenerationofthekernelsisstronglyinspiredbythecodegenerationprocessinHyPeR[18],and,followingtheirnomenclature,wewillrefertoageneratedpieceofcodeasafragment.Resultmaterializationtomemoryonlyoccursattheseamsbetweenfragments.AsinHyPeR'squerycom-piler,theVoodootoOpenCLcompilertraversestheplaninadependencyorderandappendsstatementstoafragment.However,thecodegenerationprocessinVoodooismorein-volvedthaninHyPeRfortworeasons:First,wegeneratedataparallelcodewheneverpossibleandonlygeneratecodewithreducedparallelismwhennecessaryandsecondVoodooqueryplansareDAGsratherthantrees(seeFigure8)toen-ablesharingofintermediateresults.Inthefollowing,weillustratethecodegenerationprocesstakingthesecomplicatingfactorsintoaccount.AsarunningFigure9:Select&HierarchicallyAggregateinOpenCLexample,weuseasimpliedversionofTPC-HQuery1(SeeFigure8).Thequeryis:SELECTSUM(l_quantity)FROMlineitemGROUPBYl_returnflag3.1.1CodeGenerationAsinearlierin-memorydatamanagementsystemssuchasVectorwise/MonetDBX100[33]andHyPeR[18],weaimtoavoidthematerializationofintermediateresults.SimilartoHyPeR,wegeneratecodebytraversingtheexecution-DAGinalinearizedorder(toptobottominFigure8).ControllingParallelism.MostVoodooprogramscon-tainfullyparallelaswellascontrolledfoldoperations.TheOpenCLbackendneedstoecientlymapbothtokernelswiththeappropriatedegreeofparallelism.Tothisend,VoodooassignsanExtentandanIntenttoeachgeneratedcodefragment.TheExtentisthedegreeof(data)paral-lelism(roughlyequivalenttotheglobalworksizeinOpenCLorthenumberofparallelthreadsworkingonavector)whiletheIntentisthenumberofsequentialiterationsperparallelwork-unit.AfragmentofExtent1isfullysequentialwhileafragmentofIntent1isfullyparallel.BeforedescribinghowwederivetheseparametersfromtheControlVectors,werstdiscusshowtheyaectageneratedprogram.TheDAGpropertyofourqueryplansforcesustomain-tainmultipleactivefragmentsatthesametime.Toillus-tratethis,considerthecaseofcompilingaprograminwhichanumberoffullyparallelstatementsareinterruptedbyarun-controlledsequentialone.Inthiscase,Voodoocreatesasequentialfragmentinadditiontothealreadyactiveparallelfragment|neitherisexecuteduntilneeded.Whenprocessinganinstruction,thecompilerchoosesacodefragmentthathasthesameextentasthecurrentstate-ment.Thisprocessisappliedinastraightforwardmannerfordata-parallel,maintenance,andshapeoperatorsinTa-ble2.Forfoldoperators,wedistinguishthreecases:a)iftherunsareoflength1(theextentisnandtheintent1),thefold-operationisfullydata-parallelandcanbeappendedtoacodefragmentoftherightextentb)ifthereisonlyasinglerunoflengthn(theintentisnandtheextent1),thefold-operationisfullysequentialandcanonlybeappendedtoasequentialfragmentoftherightintentc)ifthelengthoftherunsislessthanorequaltothesupportedpartitionsize,valuesarewrittentothebeginningoft
hepartition.Inthatcase,wedonotneedtointr
hepartition.Inthatcase,wedonotneedtointroduceaglobalsynchronizationpoint,i.e.,createanewfragment.ThismeansthefoldcanFigure10:Group&Aggregateapplytoanyfragmentthathasthesameextent.Speci-callythiscanbeachievedbysynchronizingandsequentiallyprocessingtheinputusingasubsetoftheactivedataitemsbydisablingsomeofthecoresinOpenCL.Ifnofragmentcanbefoundthathasthesameextentasthecurrentoperator,anewfragmentiscreated.Example.Toillustratethisprocess,considertheFrag-ment1inFigure8.Itiseasiesttounderstandthisdia-gramintermsoftheredoperators,whicheitherpartiallyorcompletelyinterruptthepipeliningoftheplan.Start-ingfromthetop,thebreakoperatorcausesthegreaterthanexpressiontobecomputedinafull-dataparallelman-ner,andtheresultofthisexpressiontobematerializedinmemory(orcache).Then,therstfold(.position=foldSelect)computesthepositionsinthelineitemta-blewherethepredicateonlshipdateissatised.Theparallelismofthisoperatoriscontrolledbythe.foldcon-trolvector.Incomputing.fold,the$intentvariableen-codesthelengthoftherunsallowingtotransitionfromfullyparallel($intent=1)tofullysequential($intent=n).ThisfoldSelectoperatorresultsinoneloopperfold(theex-tentiskinputk=$intent);eachloopinthiscasejustloopsoverthebooleanvaluesmaterializedatthebreakoperatorandemitsthenon-zeropositions.Finally,thethirdredop-erator(foldSum)computesthepartialsummationofeachfold.ItusesthesameparallelismasthefoldSelect,sonoadditionalmaterializationisnecessary,andtheycanbepipelinedintothesameloopinstance.Thereareafewadditionalthingstonoteaboutthisdi-agram.First,theparallelismofeachoperatorisdeter-minedbytheparallelismofitsrstredancestor,astheVoodoocompileraggressivelyinlinesoperatorsbetweentheredpipeline-breakingoperations.Second,thepurpleopera-tors(andassociatedvectors)are\virtual"inthesensethattheysimplycontroltheparallelismofthegeneratedprogrambutarenotcomputed(orgenerated)atruntime.Figure9illustratestheresultingOpenCLprogramassum-ing$grainsizeissetto4.Thegureillustratesthatthepredicateisevaluatedfullydata-parallel.Theselectaswellastherstsumareevaluatedusinglocallyreducedpar-allelismbutwithouttheneedforaglobalbarrier.ThenalFigure11:VirtualScattersum(Fragment2inFigure8)isentirelysequentialandre-quiresaglobalbarrier(intheformofanewOpenCLkernel).Notethat,tokeeptheexampleconcise,thefoldSelectandtherstfoldSumsharethefoldattribute(stemmingfromthesameControlVector).Thisisnotnecessaryinpractice,astheyareindependentlytunable.Naturally,se-lectingtheoptimalparallelizationstrategyishardandnotthefocusofthispaper.Theexampledoes,however,showhowacomplexoptimizationdecisioncanbeencodedintoa(setof)integerconstant(s).However,ithingesonaneec-tivewaytokeepcontrolvectormetadata.MaintainingRunMetadata.Whilecontrolvectorsallowustospecifythedegreeofparallelisminanabstractmanner,westillhavetogenerateappropriatecodefromthatspecicationinthebackend.Forthatpurpose,theVoodoocompilermaintainsdescriptivemetadataabouteachgeneratedvectorattribute:thestart,astepfactorandamodulocap.Attributesvaluesgeneratedbyrangeoperatorscan,thus,becalculatedusingthefollowingequation:v[i]=from+bistepcmodcapDividingavectorbyaconstantxisequivalenttodividingstepbyx.Amodulobyxissettingthecaptox.Whencombining(e.g.,throughaddition)acontrolvectorwithadatavector(toencodeparallelgroupedaggregation),wekeepthevaluesofthecontrolvectorinadditiontothedatavalues.Weexpectthefrontendtoensurethatthebitsofthecontrolvectordonotcon ictwiththedatabitsandthrowanexceptionwhenthisassumptionisviolated.3.1.2EmptySlotSuppressionAsdescribedinSection2.2,theoutputsofalloperationsareofstaticallyknownsizeandarepaddedwithemptyslots.Whilethisseemswastefulatrst,itmakesitpossibletoecientlyexecuteVoodooonmassivelyparallelhardwarewithouttheneedforexpensivewritecon icthandling.How-ever,byapplyingcontrolvectormetadataknowledge,wecandrasticallyreducethememoryconsumption.Toillustratethis,considerthefoldSumstepsinFigure9.Thefoldingcreatesapredictablenumberofemptycellsinthevector.Toreducethememoryfootprint,slotsthatcanbeguaran-teedtoneverbelledwithvalues(e.g.,whenfoldingallvaluesofaworkgroupintoalocalsum)cansimplynotbeallocated.Wesuppressemptyslotsbyallocatingasmallerbuer,appropriatelymodifyingthegeneratedwritecursorandannotatingthevectorwithametadataeldtokeeptrackoftheemptyslots.3.1.3VirtualScatterAnothercaseinwhichweexploitcompile-timeknowledgetoavoidruntimematerializationisthecaseofscatter.Anavelyimplementedscatterwouldwriteallinputval-uestoaslotintheoutputcausingabreakinthefragmentandsubstantialrandommemorytrac.Thismay,however,beunnecessaryifthescatteredvectorisonlycreatedtobeused,e.g.,asaninputtoanaggregation.Figure10depictsthisverycommoncase.Toavoidtheintermediatemateri-alization,w
ecandroptheassumptionthatOpenCLworkitems
ecandroptheassumptionthatOpenCLworkitems(i.e.,positionsintheinputvector)anddataitems(i.e.,positionsintheoutputvector)arealigned.ThisisillustratedinFigure11:wetagavectorwithascatterposition(denotedwithan@inthethirdvector).Thatscatterpositionisaperwork-itemlocalvariable(splitintopartitionstartandtupleoset)thatwillbeusedtodeterminethepositionofeachoutputtupleifthevectorisevermaterialized.Sincemostvectorsarenevermaterial-ized,scattercanbecomeaverycheapoperation,thatispaidforonlywhenandifitisfullymaterialized.Figure11illus-trateshowweexploitthisinthecaseofa(single-partition)groupedaggregation:thepartitiongeneratesapartitionidwhichiswork-itemlocal.Thescattercreatesa(virtual)vec-torthatcontainstheinputvaluesannotatedwiththescatterpath.ThevectorisreadbythefoldCount(amacroontopoffoldSum)whichgeneratesthe(partitionaligned)counts.Finally,thesecountsarecompactedintoacontiguousmem-oryregionfortheresult.3.2TheInterpreterTheinterpretermainlyservesasareferenceimplementa-tion;itusesvectorsofmapstorepresentdataaswellascontrolvectors.Theinterpretermaterializesallintermedi-atevectorsandis,inthatrespect,aclassicbulk-processor.However,sinceitstoresalldatainvectorsofmaps,itusesvirtualfunctioncallstoretrievevaluesfromtuples.Thecombinationoffullmaterializationandvirtualfunc-tioncallsmeansthatthisbackendisnotdesignedforhighperformance.Itis,rather,asmallreferenceimplementa-tionthatisusefulfordebuggingandvericationbecauseallintermediatesarematerializedand,thus,inspectable.4.ARELATIONALFRONTENDTodemonstrateVoodoo'seectivenessasadatabaseker-nelaswellastosimplifyexperimentation,weimplementedaprototypeofarelationalqueryprocessingengineontopoftheVoodooalgebra.Duetoitseasyextensibilityandopensource,wechosetointegrateVoodooasanalterna-tiveexecutionengineintoMonetDB.However,byreplacingMonetDB'sexecutionengineandphysicaloptimizer,weef-fectivelyreduceditsroletodataloadingandqueryparsing.Toillustratetheeectivenessoftheresultingsystem,letusbrie ywalkthroughtheaspectsofstorage,queryprocessingandoptimization.Loading.MonetDBexposesitsinternalcataloginforma-tion,includingpointerstowardsthebackingles,throughaqueryableSQLinterface.Weexploitthistodirectlyloaddatafromthelesystembypassingthequeryprocessor.Uponstartup,VoodooloadsdatafromtheinternalcatalogofMonetDB.Wedirectlycopydatafromdiskintotheprocess-ingdevice,usingthesamestorageformatMonetDBuses:binarycolumn-wiseusingdictionaryencodingforstrings.Queries.WeuseMonetDB'sSQLtorelationalalgebracompilertoparseSQLqueriesandremovelogicalconceptssuchasnestedsubqueries.Fromtherelationalalgebrarep-resentation,wegenerateVoodooplans,thusbypassingMon-etDB'sphysicaloptimizer.Voodooplansaresimilarincom-plexitytothosethatMonetDBusesforphysicaloptimiza-tionandqueryevaluation.Optimization.SinceVoodoocompilesMonetDB'slogicalplans,itinheritsthelogicaloptimizationsthatMonetDBap-plied(join-order,query-unnesting,etc.).Beyondthat,thephysicaloptimizerhasanumberofoptimization agsthatenablehardware-specicoptimizations:cache-consciouspar-titioning,predication,parallelizationstrategy(forselectionsandaggregations)aswellasthetargeteddevice.Whileweusethese agsforthemicrobenchmarksinSec-tion5.3,wedisablethemwhencomparingthemacrobench-markresults(Section5.2).ThisallowsacomparisonoftheeciencyofthegeneratedcodetothatofHyPeR(whichalsodoesnotapplysuchoptimizations).However,weenablethegenerationofparallelplansbyspecifyingappropriatecon-trolvectors.Beyondthat,weuseidentityhashingonopenhashtablesandderivetheirsizefromtheinputdomain(us-ingonlyminandmax).Adetailedper-querydiscussionofthequeryplanscanbefoundinourupcomingtechnicalreport[23].5.EVALUATIONToevaluateourapproach,westudythesystemwithre-specttoourtwodesigngoals:portabilityandtunability.TheportabilityexperimentsinSection5.2showthatVoodoonotonlyrunsondierenthardwareplatformsbutmatches(andevenoutperforms)existingsystemstailoredtospecichard-warearchitectures.Specically,wecompareagainstHy-PeR[18]andMonetDB/Ocelot[13],whichtargetCPUandGPUarchitectures,respectively.WeconducttheevaluationusingasubsetoftheTPC-Hbenchmarkthatwasusedtoevaluatethesesystemswhentheywerepresented.ThetunabilityexperimentsinSection5.3demonstratehowVoodoofacilitatestheimplementationandexamina-tionofvarioushardware-consciousoptimizationsondier-entplatforms.Wehighlightanumberofhardwareanddatadependenttrade-osthatVoodooallowsustoexplore.5.1SetupWeranourCPU,aswellasallofourGPU-experiments,onaDellServerwithsingleIntelSkylakeXeonE3-1270v5runningUbuntuLinux15.10(Kernel4.2.0-27)at3.60GHzwith64GBofRAM.TheGPUwasaGeForceGTXTITANXwith12GBofglobalmemoryusingCUDAversion7.5.Allmicro-experimentswerecompiledusingIntelICC16.ForbothCPUsandGPUs,weonlycountedtheexecutiontimeoncethedatawasl
oadedintotheirrespectivememoriesandignor
oadedintotheirrespectivememoriesandignorecostsforresultoutput.WedonotaddressthePCIbottleneck.5.2BaselinePortability:TPCHWeranasignicantsubsetoftheTPC-Hqueriesonascalefactor10datasetusingVoodoo,HyPeRand,whereapplicable,Ocelot(OcelotdoesnotactuallysupportallofFigure12:TPC-HPerformanceonGPUthequeriesweevaluated).Thesequeriescovermostoftherelationaloperators(selections,aggregates,group-byandjoins)exposingstandardproblemsinrelationalquerypro-cessing[6].Notethatourgoalistodemonstratethecapa-bilityoftheVoodoosystemtogenerateecientcode|notthequalityoftherelationalqueryfrontendandoptimizer,whicharestillquitebasic.Intermsoftheoptimizationsofthegeneratedcode,ourimplementationisroughlyequiva-lenttothecodegenerationthatisimplementedinHyPeR(novectorization,nomanualSIMDinstructions).However,weaggressivelyexploitavailablemetadata(min,max,FK-constraints)which,inmanycases,allowsustobypassopera-tionssuchashashingorcollisionmanagementwhichHyPeRhastoperform.TheresultsontheCPU(Figure13)andGPUshowthat,asanabstraction,Voodooimposeslittleoverhead.Gener-ally,performanceiscomparabletoHyPeR's.Voodooper-formsbetterforcomputeandlookup-intensivequeries(suchas5,6,9and19)becauseoftheaggressiveexploitationofmetadataandtheuseofSIMDinstructionsbytheOpenCLcompiler.HyPeRperformsbetterfororder-by/limitqueriessinceitevaluatestheseusingpriorityqueueswhichavoidexpensivematerializationofthefullresult(inVoodoo,theorder-by/limitclauseswereommitted).WhencomparingtoOcelot,thebenetofexecutablecodegenerationbecomesclear:whereHyPeRandVoodooavoidexpensivematerializationofintermediateresults,Ocelotpaysahighpricefordoingso.Inparticularthelow-selectivity(highoutputcardinality)queriessuchasquery1exposethisproblem.ThisisanindicationthatOcelotwasdevelopedforanarchitecturewithahighmemory-bandwidthsuchasGPUs.WhenevaluatingontheGPU,withits300GB/smemorybandwidth,weseethatOcelotsuerssignicantlylessfromthatdesigndecision(seeFigure12).5.3TunabilityWenowturnourattentiontotheabilityofVoodootoexperimentwithdierenthardware-awareoptimizations,fo-cusingoneaseofimplementationandtheperformancetrade-osofthesedierenttechniques.Just-in-timelayouttransformations.Mostin-memorydatabasesstoredatausingaxedphysicalschema(usuallyrow-wiseorcolumn-wise).However,ithasbeenshownthatperformancecanbeincreasedbytransformingthelayoutbeforeorevenduringqueryexecution[34].Asanoperationthatcanbenetfromsuchanoptimization,weconsidertheevaluationofanindexedforeign-keyjoin(essentiallyaposi-tionallookup)onmultiplecolumnsofthesametable.Weconsiderthreepossibleimplementationsofthisopera-tioninCandVoodoo:asingletraversaloftheindex-columnwithlookupsintobothcolumns(termedSingleLoop),twoconsecutivetraversalsofthepositions,eachresolvingthekeysintooneofthetargetcolumns(termedSeparateLoops)andthetransformationofthetargettablefromcolumn-torow-wisestorage,followedbyasingleloopoverthepositionsandtheirresolution(termedLayoutTransform).AsshownFigure14a,thebestimplementationisdepen-dentonthelookuppatternintothetargettable:ifthelookupsaresequential,localityisalwaysgoodandtheSin-gleLoopimplementationperformsbest.Ifthelookupsarerandomandthetargettablesmall(4MB)thebestimple-mentationistoresolvethekeysintwoSeparateLoops.Ifthelookupsarerandomandthetargettablelarge(128MB),aLayoutTransformpayso:becausethevaluesofbothpro-jectedcolumnsareco-locatedthisoptimizationreducesthenumberofrandomcachemissesbytwo.ExpressingtheseoptimizationsinVoodooinvolvesabreakoperatorbetweenthetwogatherstoswitchfromSingleLooptoSeparateLoopsandazipandmaterializetoswitchtotheLayoutTransformimplementation.AsdisplayedinFigure14b,VoodooaccuratelymatchedtheperformanceoftheCimplementationontheCPU.Figure14cshowsthattheperformanceontheGPUissimilarbuttheSeparateLoopsversionisoutperformedbytheLayoutTransformimplementationinallcases.Thisexperimentillustrateshowthelackoflargeper-corecachesontheGPUpenalizerandomaccessesearlierthanonaCPU.Still,theoptimizationportsreasonablywell.SelectiveAggregation.Sinceselectionsareoneofthemostfrequentlyusedoperators,anecientimplementationisoneofthecornerstonesofgoodin-memoryperformance.Themainproblemwiththedefault(branching)implemen-tationisthepenaltyforbranchmispredictions.AsshowninFigure1,avoidingthesebranchmispredictionsviaabranch-freepredicationtechniquecanbebenecialasittradesmem-orytracforfewermispredictions,althoughthebenetde-pendsonselectivity.Toavoidtheadditionalmemorytracofpredication,theprocessingcanbevectorized:dividedintocache-sizedchunks,whereforeachchunk,apositionlistisgeneratedusingabranch-freeimplementation.Thispositionlististhentraversedandprocessedinasecond(cache-sized)loop.WeimplementedthesethreedesignalternativesinCandVoodooandshowtheselectivi
ty-dependentresultsinFigure15a:weseethec
ty-dependentresultsinFigure15a:weseethecharacteristicbehaviorofspeculativeexecutionwithworst-caseperformanceat50%selectivity.Incontrastthebranch-freeimplementationshows atperfor-mancethatoutperformsthebranchingimplementationformid-rangeselectivities.Thevectorizedversionperformssig-nicantlybetterthanthebranch-freeimplementationand,forselectivitiesabove1%,outperformsthebranchingver-sion.NotethattheC-codeoftheseversionslooksverydif-ferent:twoloopsandanadditionalbuervs.asingleloopwithnobuering.InVoodoo,weachievevirtuallyidenticalperformance(Figure15b)butonlyneedtoinsertasingleadditionaloperator:amaterializewithacontrol-vectortoencodetheintermediatebuersize.RunningtheVoodoocodeontheGPUcreatesadierentpicture(seeFigure15c):sincetheGPUdoesnotspecu-lativelyexecutecode,thepredicatedversiononlyaddsad-ditionalmemorytracwithoutanybenet.Surprisingly,thevectorizedimplementationhurtsperformance:theaddi-tionalpositionbuercausesadditionalmemorytracand,evenworse,islledsequentially,whichlimitsthedegreeofparallelismthatcanbeusedtohidelatencies.WeconcludethatthisoptimizationdoesnotportwelltoGPUs.Figure13:TPC-HPerformanceonCPU(a)ImplementedinC(b)VoodooonCPU(c)VoodooonGPUFigure14:Just-in-timelayoutchanges.Branch-FreeForeign-KeyJoins.Inthelastmicrobench-mark,wepresentanoveltuningtechniquethatillustratestheinterplaybetweendataaccessandprocessing.Wecon-sideratable-scan,theapplicationofaselectionandanin-dexedforeign-keyjoinintoasingle,largetargettablewithsubsequentaggregation.TheSQL-queryis:SELECTsum(target.v)FROMfact,targetWHEREfacts.target_fk=target.pkANDfacts.v$1ThestraightforwardapproachisBranching:asequentialscanoftheselectioncolumn(facts.v),theevaluationofthepredicateandalookupandaggregationofqualifyingtuples.Abranch-freealternativeistounconditionallyper-formthelookupsandmultiplytheresultingvalueswiththeoutcomeofthepredicate(0or1)beforeaggregation(la-beledPredicatedAggregationinFigure16).Theselectivity-dependentresultsaredisplayedinFigure16a:thebranchingversionexhibitsthetypicalbell-shapedcurvethatindicatesbadspeculation.Thebranch-freevariantissignicantlymoreexpensive,duetothehighnumberofrandomcachemissesthatresultfromtheunconditionallookups.Toad-dressthebadcachebehavior,wedevisedanoptimizationwetermPredicatedLookups:beforeperformingthelookup,wemultiplythepositionwiththeoutcomeoftheselectionpredicate.Thisway,allnon-qualifyinglookupshitthesameaddress(positionzero)whichwillbeheldinone\veryhot"cacheline.Thisaddressesthebadcachebehaviorbutcausesanextraarithmeticoperation(notethatthelooked-upval-uesstillneedtobepredicatedtoensurecorrectness).Theresult(PredicatedLookupsinFigure16a)isanimplementa-tionthatperformssignicantlybetterthanthebranch-freeversionandoutperformsthebranchingversionformuchoftheparameterspace.Voodoomatchesthisresultveryac-curatelyontheCPU(seeFigure16b).OntheGPU,Voodooexposesdierentperformancetrade-os:theBranchingimplementationshowsthebestperfor-manceovermostoftheparameterspaceandisonlyoutper-formedbythePredicatedLookupsversionforselectivitiesabove80%.ThisresultexhibitsanotherGPUdesignde-cision:thesacriceofintegerarithmeticfor oatingpointperformance.SincethePredicatedLookupsperformstwointegerarithmeticoperations,performanceisdominatedbythat{thisoptimizationalsodoesnotportwell.6.RELATEDWORKTheVoodooprojectwasinspiredbythedisparityofthelargenumberofoptimizationtechniquesintheliteratureandthesmallnumberofsuchtechniquesthatareactuallyusedsystems.Beforeconcluding,letusprovideaoverviewoversometechniquesandsystemsthattrytousethem.Lowleveloptimizationsformodernhardware.Mostoftheworkintheliteratureaddressesspecichardwarecomponentsinisolation.Amongthesearetechniquesad-dressinghierarchicalcaches[3],branchpredictors[28]andSIMD[25]registers.Muchofthispriorworkshowsthateachofthesetechniquescanleadtoordersofmagnitudeperformanceimprovements.However,thesetechniqueswerenotstudiedinthecontextofafulldatamanagementsys-temstackorgeneralizedintoafulloperatormodel.redThemostrecentofthese,Polychroniouetal.[25]developasetofalgorithmsthatemployadvancedtechniquesforSIMD-enabledprocessing.MostofthesecanbetranslateddirectlyintoequivalentVoodoocode(seeourupcomingtechnicalreport[23]fordetails).Theexceptiontothisarethecasesinwhichdatastructuresarenotwrite-once:whenllinghash-tableslotswithuniquemarkervalues(torecognizecon icts)andsubsequentlyoverwritingthemwithdataval-uesandwhenswappingvaluesthroughacuckoo-tableuntilanemptyslotisfound.Theformercanbeimplementedbyusingasecond(logical)buertoholdthemarkerval-ues.ThelatercanonlybeapproximatedinVoodoobecauseeachcuckooiterationneedsto(logically)createanewdatastructure.Whilethememoryoverheadcanberemovedatcompile-time,theprogra
mgrowslinearlywiththenumberofcuckoo-iter
mgrowslinearlywiththenumberofcuckoo-iterations.Thisboundsthenumberofpossibleiterationstoa(reasonablysmall)constant.(a)ImplementedinC(b)VoodooonCPU(c)VoodooonGPUFigure15:selectsum(v2)fromfactswherev1between$1and$2(a)ImplementedinC(b)Voodoo(RunonCPU)(c)Voodoo(RunonGPU)Figure16:SelectiveForeign-KeyJoinPerformanceAnotherlineofresearchtargetstheuseofprogrammableGPUswithitsdiversetuningtechniques[11,12,22].GPUprogrammingtextbooks(e.g.,[20])containanumberofplat-formspecicheuristicstochoosetherightsetoftechniquesgivenaproblem.Unfortunately,thelow-levelprogrammingparadigmofframeworkssuchasOpenCLorCUDAmakesthiskindofoptimizationhard,oftenrequiringasubstantialrewriteoftheprogramforeacharchitectureoroptimization.HighPerformanceComputing(HPC).TheHPCcom-munityhasaimedtocreateeasytouseabstractionsoverhighlyparallelhardwareformorethanthreedecades.ThemostprominentartifactofthisistheBLASstandardwithseveralimplementations.Aimedatlinearalgebra,however,BLASimplementationssuchasIntel'sMKL[31],cuBLAS[4],MAGMA[2]orOpenBLASsolvearestrictedand,mostim-portantly,data-independenttuningproblem.Hence,tuningisusuallydonebythedevelopersofthelibrary,ratherthangeneratingdata-dependentcodewhentheapplicationruns.CompilerframeworkssuchasDelite[30]orDandelion[29]aredesignedtofacilitatethisprocessbutitstillremainswiththelibrarydeveloperand,thus,static.Generalpurposecompile-timeabstractionssuchasIntel'sArrayBuildingBlocks(ABB)[19]inheritedthisproblem.ABBspecicallywasabandonedinfavorof\tunable"ap-proachessuchasCilk[5],ThreadingBuildingBlocks[27]andOpenMP[10].Whiletheseoerhightunability,theyareill-suitedtoautomaticcodegenerationatruntime(seeSection2).TheapproachclosesttooursisArrayFire[17]whichprovidesabstractvectoroperationsbackedbymulti-plehardwarespecicbackends(CUDA,OpenCLandC++).ArrayFireevengeneratescodeatruntimebutonlyforarith-meticexpressionsappliedusingamapoperator.Stateoftheartsystemsforin-memoryanalytics.Someofthetechniquesusedbythisworkinvolvebulk-processing[7]vectorprocessingandjust-in-timecompila-tion.MonetDB/x100[33](a.k.a.JIT-compiling)[15].Hy-PeR[18],employsasimpledirecttranslationofrelationalalgebratoLLVMassemblerwhichisthenexecuted.HyPeRaimstorunbothanalyticandtransactionalworkloadsinthesamesystem.Legobase[14],inadditiontogeneratingLLVMorCcodefromSQL,letsdatabasedevelopersexpressinter-naldatabasedatastructuresandalgorithmsusingahighlevellanguage(Scala)andthenhavethemcompileddowntolowlevelwiththerestofthequery.TupleWare[9]aimstohandlealargerclassofcomputationsincludingiterativecomputationsandUDFs,andemploysthecodegenerationtechniquetoecientlyintegrateframeworkcodewiththeUDFs.OcelotaimstoportMonetDBtoexploitGPUs[13].VoodooiscomplimentarytoHyPeRandLegobaseinthatVoodoocanbeusedasalowerlayerforsuchsystems.7.FUTUREWORKWebelieveVoodootobeusefulasafoundationformuchfutureworkinhigh-aswellaslow-leveloptimizationofdatabasequeries.Themachine-friendlydesignofVoodoolendsitselftoautomaticexplorationofthedatabasedesignspace.Specicallyanautomatic,incremental,runtimere-optimizationsystemisenabledbythedesignofVoodoo.Suchasystemcouldemploycurrentandfuturelow-leveloptimizations.However,itwillstillhavetohandlethelargedesignspaceandmayrequirenewabstractions.Webelievethatdeclarativeoptimizers[8,16]canbeeectivelycom-binedwithVoodootohandlethiscomplexity.ThecurrentVoodoodesigndeliberatelyomitscontrol-statements(for,if,while,...).Whilethesearenotnec-essarytoimplementrelationalalgebra,theyenableruntimeoptimizationssuchasloadbalancing,dynamicresizingofdatastructures(e.g.,hashtables)ordatastructuresthathavecomplexbehavior(suchaspriorityqueues).However,theseareexactlythekindsofoptimizationsthatarediculttoporttomassivelyparallelarchitectures.Asolutiontothisproblemislikelytobebasedontheco-operationofmultipledevices(CPUsforcontrol,GPUsfordataprocessing).Weplantodevelopsuchasolutioninthenearfuture.Wealsoplantoexpandthecurrentalgorithmstoaddnon-relationaloperations.Anexampleofthisissupportingregularexpressionmatching(whichhasecientmassivelyparallelsolutions)butalsosupportforgraphsorarrays.8.CONCLUSIONImplementingecientin-memorydatabasesischalleng-ing,andoftenrequiresbeingawareofboththecharacteris-ticsoftheworkload,dataandthespecicarchitecturalprop-ertiesofthehardware.Inthiswork,weproposedVoodoo,anovelunifyingframeworktoexplore,implementandeval-uatearangeofhardware-specictuningtechniquesthatcancapturealloftheseaspectsofecientdatabasedesign.Voodooconsistsofanintermediatevectoralgebrathatab-stractsawayhardwarespecicssuchashierarchicalcaches,many-corearchitecturesorSIMDinstructionsetswhilestillallowingfrontenddeveloperstooptimizeforthem.Centraltoourapproachisanoveltechniquecalledcontrolvectorsthatexposepar
allelismintheprogramanddatatothecom-pile
allelismintheprogramanddatatothecom-piler,withouttheuseofhardware-specicabstractions.WeshowedthatVoodoocanbeusedasanalternativebackendforanexistingsystem(MonetDB),andthatitcanmatchandevenoutperformpreviouslyproposedhighlyop-timizedin-memorydatabases.WealsodemonstratedthatVoodoomakesiteasytotuneprogramsandexploredesignalternativesfordierenthardwarearchitectures.9.REFERENCES[1]D.Abadi,D.Myers,D.DeWitt,andS.Madden.Materializationstrategiesinacolumn-orienteddbms.InICDE2007.IEEE,2007.[2]E.Agullo,J.Demmel,J.Dongarra,B.Hadri,J.Kurzak,J.Langou,H.Ltaief,P.Luszczek,andS.Tomov.Numericallinearalgebraonemergingarchitectures:Theplasmaandmagmaprojects.InJournalofPhysics:ConferenceSeries,volume180,2009.[3]C.Balkesen,J.Teubner,G.Alonso,andM.T.Ozsu.Main-memoryhashjoinsonmulti-corecpus:Tuningtotheunderlyinghardware.ETHZurich,Tech.Rep,2012.[4]S.Barrachina,M.Castillo,F.D.Igual,R.Mayo,andE.S.Quintana-Orti.Evaluationandtuningofthelevel3cublasforgraphicsprocessors.InParallelandDistributedProcessing,2008.IPDPS2008.IEEEInternationalSymposiumon.IEEE,2008.[5]R.D.Blumofe,C.F.Joerg,B.C.Kuszmaul,C.E.Leiserson,K.H.Randall,andY.Zhou.Cilk:Anecientmultithreadedruntimesystem.Journalofparallelanddistributedcomputing,37(1),1996.[6]P.Boncz,T.Neumann,andO.Erling.Tpc-hanalyzed:Hiddenmessagesandlessonslearnedfromanin uentialbenchmark.InTPC-TC.Springer,2013.[7]P.A.Boncz,M.L.Kersten,andS.Manegold.Breakingthememorywallinmonetdb.CACM,122008.[8]T.Condie,D.Chu,J.M.Hellerstein,andP.Maniatis.Evitaraced:metacompilationfordeclarativenetworks.PVLDB,1(1),2008.[9]A.Crotty,A.Galakatos,K.Dursun,T.Kraska,U.Cetintemel,andS.Zdoni.Tupleware:"big"data,biganalytics,smallclusters.InCIDR,2015.[10]L.DagumandR.Menon.Openmp:anindustrystandardapiforshared-memoryprogramming.IEEEcomputationalscienceandengineering,5(1),1998.[11]C.GreggandK.Hazelwood.Whereisthedata?whyyoucannotdebatecpuvs.gpuperformancewithouttheanswer.InISPASS'11.IEEE,2011.[12]B.He,M.Lu,K.Yang,R.Fang,N.Govindaraju,Q.Luo,andP.Sander.Relationalquerycoprocessingongraphicsprocessors.TODS,34(4):21,2009.[13]M.Heimel,M.Saecker,H.Pirk,S.Manegold,andV.Markl.Hardware-obliviousparallelismforin-memorycolumn-stores.VLDB,2013.[14]Y.Klonatos,C.Koch,T.Rompf,andH.Cha.Buildingecientqueryenginesinahigh-levellanguage.PVLDB,7(10),2014.[15]K.Krikellas,S.Viglas,andM.Cintra.Generatingcodeforholisticqueryevaluation.InICDE,2010.[16]M.Liu,Z.G.Ives,andB.T.Loo.Enablingincrementalqueryre-optimization.InSIGMOD,2016.[17]J.Malcolm,P.Yalamanchili,C.McClanahan,V.Venugopalakrishnan,K.Patel,andJ.Melonakos.Arrayre:agpuaccelerationplatform.InSPIEDefense,Security,andSensing,2012.[18]T.Neumann.Ecientlycompilingecientqueryplansformodernhardware.PVLDB,4(9),2011.[19]C.J.Newburn,B.So,Z.Liu,M.McCool,A.Ghuloum,S.D.Toit,Z.G.Wang,Z.H.Du,Y.Chen,G.Wu,etal.Intel'sarraybuildingblocks:Aretargetable,dynamiccompilerandembeddedlanguage.InCGO,2011.[20]H.Nguyen.Gpugems3.Addison-WesleyProfessional,2007.[21]H.Pirketal.Cpuandcacheecientmanagementofmemory-residentdatabases.InICDE,2013.[22]H.Pirk,S.Manegold,andM.L.Kersten.Wastenot...ecientco-processingofrelationaldata.InICDE2014,pages{.IEEE,April2014.[23]H.Pirk,O.Moll,M.Zaharia,andS.Madden.Voodoo-portabledatabaseperformanceonmodernhardware.Technicalreport,MITCSAIL,2016.[24]H.Pirk,E.Petraki,S.Idreos,S.Manegold,andM.Kersten.Databasecracking:fancyscan,notpoorman'ssort!InDaMoN.ACM,2014.[25]O.Polychroniou,A.Raghavan,andK.A.Ross.Rethinkingsimdvectorizationforin-memorydatabases.InSIGMOD2015.ACM,2015.[26]V.Raman,G.Attaluri,R.Barber,N.Chainani,D.Kalmuk,V.KulandaiSamy,J.Leenstra,S.Lightstone,S.Liu,G.M.Lohman,etal.Db2withbluacceleration:Somuchmorethanjustacolumnstore.PVLDB,6(11),2013.[27]J.Reinders.Intelthreadingbuildingblocks:outttingC++formulti-coreprocessorparallelism."O'ReillyMedia,Inc.",2007.[28]K.A.Ross.Selectionconditionsinmainmemory.ACMTrans.DatabaseSyst.,29(1),Mar.2004.[29]C.J.Rossbach,Y.Yu,J.Currey,J.-P.Martin,andD.Fetterly.Dandelion:acompilerandruntimeforheterogeneoussystems.InSOSP.ACM,2013.[30]A.K.Sujeeth,K.J.Brown,H.Lee,T.Rompf,H.Cha,M.Odersky,andK.Olukotun.Delite:Acompilerarchitectureforperformance-orientedembeddeddomain-speciclanguages.ACMTECS,13:134,2014.[31]E.Wang,Q.Zhang,B.Shen,G.Zhang,X.Lu,Q.Wu,andY.Wang.Intelmathkernellibrary.InHigh-PerformanceComputingontheIntelR XeonPhi.Springer,2014.[32]H.Wu,G.Diamos,J.Wang,S.Cadambi,S.Yalamanchili,andS.Chakradhar.Optimizingdatawarehousingapplicationsforgpususingkernelfusion/ssion.InIPDPSW.IEEE,2012.[33]M.Zukowski,P.Boncz,N.Nes,andS.Heman.Monetdb/x100-adbmsinthecpucache.IEEEDataEngineeringBulletin,1001:17,2005.[34]M.Zukowski,N.Nes,andP.Boncz.DSMvs.NSM:CPUPerformanceTradeosinBlock-orientedQueryProcessing.InDaMoN