/
Voodoo­AVectorAlgebraforPortableDatabasePerformanceonModernHardwareHo Voodoo­AVectorAlgebraforPortableDatabasePerformanceonModernHardwareHo

Voodoo­AVectorAlgebraforPortableDatabasePerformanceonModernHardwareHo - PDF document

festivehippo
festivehippo . @festivehippo
Follow
342 views
Uploaded On 2020-11-18

Voodoo­AVectorAlgebraforPortableDatabasePerformanceonModernHardwareHo - PPT Presentation

1713 1708 1712 1715 1707 1711 1716 1714 1710 1717 1709 1718 Figure1Performanceofbranchfreeselectionsbasedoncursorarithmetics28akapredicationoverabranchingimplementationusingifstatementsPeR ID: 817081

input speci fold kp2 speci input kp2 fold size ands voodoo constant cpu range load ieee 2013 monetdb processing

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Voodoo­AVectorAlgebraforPortableDatabas..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

17131708171217151707171117161714
171317081712171517071711171617141710171717091718Voodoo­AVectorAlgebraforPortableDatabasePerformanceonModernHardwareHolgerPirkMITCSAILholger@csail.mit.eduOscarMollMITCSAILorm@csail.mit.eduMateiZahariaMITCSAILmatei@csail.mit.eduSamMaddenMITCSAILmadden@csail.mit.eduABSTRACTIn-memorydatabasesrequirecarefultuningandmanyengi-neeringtrickstoachievegoodperformance.Suchdatabaseperformanceengineeringishard:aplethoraofdataandhardware-dependentoptimizationtechniquesformadesignspacethatisdiculttonavigateforaskilledengineer{evenmoresoforaquerycompiler.Tofacilitateperformance-orienteddesignexplorationandqueryplancompilation,wepresentVoodoo,adeclarativeintermediatealgebrathatab-stractsthedetailedarchitecturalpropertiesofthehard-ware,suchasmulti-ormany-corearchitectures,cachesandSIMDregisters,withoutlosingtheabilitytogeneratehighlytunedcode.Becauseitconsistsofacollectionofdeclarative,vector-orientedoperations,Voodooiseasiertoreasonaboutandtunethanlow-levelCandrelatedhardware-focusedex-tensions(Intrinsics,OpenCL,CUDA,etc.).ThisenablesourVoodoocompilertoproduce(OpenCL)codethatrivalsandevenoutperformsthefasteststate-of-the-artinmemorydatabasesforbothGPUsandCPUs.Inaddition,Voodoomakesitpossibletoexpresstechniquesasdiverseascache-consciousprocessing,predicationandvectorization(againonbothGPUsandCPUs)withjustafewlinesofcode.Centraltoourapproachisanovelideawetermedcontrolvectors,whichallowsacodegeneratingfrontendtoexposeparallelismtotheVoodoocompilerinaabstractmanner,enablingportableperformanceacrosshardwareplatforms.WeusedVoodootobuildanalternativebackendforMon-etDB,apopularopen-sourcein-memorydatabase.Ourback-endallowsMonetDBtoperformatthesamelevelashighlytunedin-memorydatabases,includingHyPeRandOcelot.WealsodemonstrateVoodoo'susefulnesswheninvestigat-inghardwareconscioustuningtechniques,assessingtheirperformanceondi erentqueries,devicesanddata.1.INTRODUCTIONIncreasingRAMcapacitiesonmodernhardwaremeanthatmanyOLAPanddatabaseanalyticsapplicationscanstoretheirdataentirelyinmemory.Asaresult,anewgen-erationofmain-memory-optimizeddatabases,suchasHy-ThisworkislicensedundertheCreativeCommonsAttribution­NonCommercial­NoDerivatives4.0InternationalLicense.Toviewacopyofthislicense,visithttp://creativecommons.org/licenses/by­nc­nd/4.0/.Foranyusebeyondthosecoveredbythislicense,obtainpermissionbyemailinginfo@vldb.org.ProceedingsoftheVLDBEndowment,Vol.9,No.14Copyright2016VLDBEndowment2150­8097/16/10.Figure1:Performanceofbranch-freeselectionsbasedoncursorarithmetics[28](a.k.a.predication)overabranchingimplementation(usingifstatements)PeR[18],Legobase[14]andTupleWare[9],havearisen.Thesesystemsaredesignedtooperateclosetomemorybandwidthspeedbyad-hocgeneratingCPU-executablecode.However,codegenerationiscomplex,andasaresultmostsystemsweredesignedtogeneratecodeforaspeci chard-wareplatform(andsometimesaspeci cdatasetorwork-load),andrequiresubstantialchangestotargetanaddi-tionalplatform.Thisisbecausedi erentarchitecturesem-ployverydi erenttechniquestoachieveperformance,rang-ingfromSIMDinstructionstomassivelyparallelco-process-orssuchasGPUsorIntel'sXeonPhitoasymmetricchipdesignssuchasARM'sbig.LITTLE.Exploitingsuchhard-wareproperlyistrickybecausethebene tofmostmachinecodeoptimizationsisdataaswellashardwaredependent.Toillustratethecomplexityofthesetradeo s,Figure1showstheimpactofpredicateselectivityandarchitectureonin-memoryselections(overonebillionsingle-precision oats).Here,we lteralistofvalueswithapredicateofvariableselectivity,usingoneoftwomethods:conventionalif-statementsthatevaluatethepredicateoneveryitem,andabranch-freeapproachwhereeveryinputiscopiedtotheoutput,buttheaddressforthenextoutputiscom-putedbyaddingtheoutcomeofthepredicate(1or0)forthecurrentvalue(sovaluesthatdon'tsatisfythepredicateareoverwrittenbythenextvalue).Thebranch-freeimple-mentationexecutesmoreinstructionsbutavoidspotentiallyexpensivebranchmispredictions.OnGPUs,thebranch-ingimplementationisoftenbetterandneversigni cantlyFigure2:TheVoodooQueryProcessingSystemworse;onCPUsthebranch-freeimplementationcansome-timesbeupto4xbetterinthesingle-threadedcase(wherecostsarestronglybranch-dominated)and2.5xbetterinthemulti-threadedcase(whichislessbranch-dominated).Thisexampleshowswhygeneratinghigh-performancecodeformodernhardwareishard:evenstraight-forwardoptimiza-tionssuchastheeliminationofbranchmispredictionsmustbeappliedwithknowledgeofbothhardwareanddata!Unfortunately,implementingtransformationslikethisinexistingcodegenerationenginesishardbecauseitrequiresencodingknowledgeaboutthehardware(GPU/CPU)anddata(selectivities)intothecodegenerator.Furthermore,manysuchoptimizationsrequirecross-cuttingchangesacrossoperatorsandcodecomponents.Asaresultofthis

,noneoftheaforementionedenginesimplement
,noneoftheaforementionedenginesimplementhardware-speci cordata-drivenoptimizationssuchastheoneshowninFigure1.Toaddressthiscomplexity,wedevelopedanewinterme-diatealgebra,calledVoodooasthecompilationtargetforqueryplans.Voodooisbothportabletoarangeofmodernhardwarearchitectures,includingGPUsandCPUsandex-pressiveenoughtoeasilycapturemostoftheoptimizationsproposedformain-memoryqueryprocessorsinthelitera-ture{apropertywewillcalltunabilityintherestofthispaper.Forexample,itcanexpressdi erentlayouts(col-umnvsrow)[26,34],materializationstrategies[1],sharingofcommonsubexpressions[6,14],vectorization[25],par-allelization,predication(theaboveexample)[28]aswellasloopfusionand ssion[32].Whilesupportingalloftheseop-timizations,Voodoomaintainstheeciencyofhand-writtenC-codebyjust-in-timegeneratingexecutablecode.Ofcourse,Voodooisnotthe rstsystemtogeneratee-cientexecutablecodeformodernhardware.However,exist-ingsystemsoccupyparticulardesignpointsinthespace,im-plementingarchitecture-speci ctechniquestoachieveband-widthandCPUeciency.Table1displaysthedesignchoicesofseveralwell-knownsystems(webenchmarkagainstsomeinourexperiments).Portinganyofthesetoanewhardwarearchitecture(withanewsetofdata-speci cbottlenecks)wouldinvolveanearlycompleterewriteofthesystem.Weaimtodevelopanabstractionlayerthatmakesiteasytoobtainperformancefromnewhardwarearchitectures.Wetermedthekeyinnovationthatenablessuch\tun-ability"declarativepartitioning,whichallowsacodegen-eratingfrontendtoprovideinformationaboutthedesiredparallelismofoperationsinahardware-independentfash-SystemBandwidthEciencyTechniqueCPUEciencyTechniqueHardwareTargetHyPeR[18]PipeliningCompilationCPU-onlyMonetDB[7]{Bulk-ProcessingCPU-onlyVectorwise[33]Cache-friendlyPartitioningCPU-onlyTupleware[9]PipeliningCompilationCPU-onlyOcelot[13]{Bulk-ProcessingGPU-OptimizedVoodooTunableTable1:TechniquesUsedinExistingIn-MemoryDBMSsion.Thisisspeci callyembodiedinVoodoobyanoveltechnique,calledcontrolledfolding,wherevirtualattributes(whichwetermcontrolvectors)areattachedtodatavectors.Bytuningthevalueofthesevirtualattributes,frontendscancreatemoreorfewerpartitions(yieldingmoreorlesspar-allelism)toadapttohardwarewithdi erentlane-widths,cachesizes,andnumbersofexecutionunits.Speci cally,wemakethefollowingcontributions:WepresenttheVoodooalgebraanddescribeourim-plementationofit.OurimplementationcompilesintoOpenCL,andservesasanalternativephysicalplanal-gebraandbackendforMonetDB[7],anexistinghigh-performancequeryprocessingenginethatdidnotpre-viouslygeneratemachinecode.WepresentasetofprinciplesthatguidethedesignofVoodoothatallowittoecientlyadapttoarangeofhardware.Theseincludetheuseofaminimalcol-lectionofdeclarativeoperatorsthatcanbeecientlyexecutedondi erenthardwareplatforms.Wedescribethedesignandimplementationofcon-trolledfoldingaswellasacompilerthatgeneratesef- cientOpenCLcodefromVoodooprograms.Weshowthatourimplementationisperformance-com-petitivewithexistingdatabaseenginesdesignedforspeci chardwareplatformsandhand-optimizedcode,usingsimplerandmoreportablecode,andthatitcancaptureseveralexistingandnewarchitecture-speci coptimizations.Notethatwedonotaddresstheproblemofprogramat-icallygeneratingoptimalVoodoocode.Instead,weshowthatVoodoocanbeusedtoexpressavarietyofsophisticatedhardware-consciousoptimizations,somenovelandsomepre-viouslyproposed,andarguethatthesecouldeventuallybechosenviaanoptimizerthatgeneratesVoodoocode.Westructuretherestofthispaperaroundthearchitec-tureofVoodoo(showninFigure2):afterabriefdiscussionofourdesigngoalsintherestofthissectionwepresenttheVoodooalgebrathatisusedtoencodequeryplansinSection2.Section3providesanin-depthdiscussionoftheVoodooOpenCLbackend.InSection4wedescribehowweintegratedtheVoodookernelwithMonetDBtoacceleratethequeryprocessingengine.Weevaluatethecompletesys-teminSection5,discussrelevantrelated(Section6)andfuture(Section6)workandconcludeinSection8.2.THEVOODOOALGEBRATointroducetheVoodooalgebra,westartwithanexam-plethatillustratesitskeyfeaturesandourdesignprinciples.Afterthat,wegoontopresentVoodoo'sdatamodelandoperatorsinmoredetail.1input:=Load("input")//SingleColumn:val2ids:=Range(input)3partitionSize:=Constant(1024)4partitionIDs:=Divide(ids,partitionSize)5positions:=Partition(partitionIDs)6inputWPart:=Zip(input,partitionIDs)7partInput:=Scatter(inputWPart,positions)8pSum:=FoldSum(partInput.val,partInput.partition)9totalSum:=FoldSum(pSum)Figure3:MultithreadedHierarchicalAggregationinVoodoo13,4c3,42partitionSize:=Constant(1024)3partitionIDs:=Divide(ids,partitionSize)4---5�laneCount:=Constant(2)6�partitionIDs:=Modulo(ids,lan

eCount)Figure4:MultithreadingtoSIMDinVo
eCount)Figure4:MultithreadingtoSIMDinVoodoo(textualdi )Figure3showsaVoodooprogram(inStaticsingleas-signmentform)toperformahierarchicalsummation:datais rstpartiallysummedonNprocessors,andthentheNpartialaggregatesarethemselvessummed.Line1loadsaninputvectorwithasinglecolumn\val".Line2createsavectorofids,rangingfrom1:::jinputj.Line3and4cre-ateavectorthatmapseachtupletoapartitionbyinteger-dividingitbythepartitionSize(1024tuplesintheexample).Line5computestheoutputpositionforeachtuple,basedonitspartition.Line6attachesthegeneratedpartitionidstotheinputtuples.Line7partitionstheinputaccordingtothepositionscomputedinline5(notethatthisparti-tioningispurelylogical{meaningitjustcausesthegener-atedcodetoloopoverthespeci ednumberofpartitions{unlessexplicitlymaterialized).Finally,line8performstheper-partitionaggregation,andline9performstheglobalag-gregation.ThisexampleillustratesseveralkeypropertiesofVoodooandhowtheyenableportabilityandperformance.VectorOriented:ThealgebraconsistsofasmallsetofvectoroperationslikeScatterandFoldSum,whichcanbeparallelizedonmodernarchitectures.Vectorinstructionsenablebothportabilityandperformance,astheycanbepar-allelizedonmanyhardwareplatformswhilealsoyieldingtostraightforwardimplementationsonanyhardwareplatform.Thespeci coperatorsinthealgebrawerechosentobefa-miliartocompiler-designersre ectingthedesignofvectormachines,SIMDinstructionsets,andfunctionallanguages,andexpressiveenoughtocaptureawidevarietyofnewandpreviouslyproposedtechniquesforoptimizingmain-memoryanalytics.Forexample,inadditiontotheexampleinFig-ure4,Voodooisexpressiveenoughtocapturethedi erentimplementationsshowninFigure1,andcompactenoughthateachimplementationisjustafewlinesofcode.Declarative:Voodoodescribesdata owratherthanex-plicitbehavior.Inparticular,theoperatorsonlyde nehowtheoutputdependsontheinputs{nothowtheoutputsareproduced.ThisallowsVoodooto,e.g.,implementlane-wiseparallelism(asinFoldSum)usingSIMD-instructionsonCPUsandwork-groups/warpsonGPUs(whicharecon-ceptuallyverysimilar).ThisdeclarativepropertyisalsoimportantforportabilityinVoodoo,asoperatorsdon'tde-scribespeci cpropertiesofhardwarethattheyrelyon.Inaddition,itmakestheprogramshorterandsimplerthantheequivalentC++programusing,e.g.,Intel'sThread-1autoinput=load("input");2autototalssum=3parallel_deterministic_reduce(4&#xsize;&#x_t00;blocked_range(input.size,5input.size/1024),60,[&input](auto&range,autopartsum){7for(size_ti=range.begin();8irange.end();i++){9partsum+=input.elements[i].constant;10}11returnpartsum;12},13[](autos1,autos2){returns1+s2;});Figure5:MultithreadedHierarchicalAggregationinTBB1autoinput=load("input");2typedefintv4i__attribute__((vector_size(16)));3autovSize=(sizeof(v4i)/sizeof(int));4v4isums={};5for(size_ti=0;iinput.size/vSize;i++)6sums+=((v4i*)input.elements)[i];7int*scalarSums=(int*)&sums;8autototalsum=0l;9for(size_ti=0;i4;i++)10totalsum+=scalarSums[i];Figure6:HierarchicalAggregationusingSIMDIntrinsicsingBuildingBlocks(seeFigure5)whileprovidingequivalentexpressivepower.Simplicityarisesbecauseitemploysasin-gleconcept:vectoroperations.Incontrast,TBBinvolvesblockedranges(line4),functionallambdas(lines6and13)usinglexicalscopingandareducer(line13).Finally,declarativeoperatorsallowustoavoidmaterializ-ingmanyoftheintermediatevectorsinaVoodooprogram.Forexample,intheprograminFigure3,mostofthevectors(exceptinput,parts,andtotal)areneverstored.Thesevectorsaresimplyusedtocontrolthedegreeofparallelisminthegeneratedcode,asdescribedinSection3.1below.Minimal:Voodooconsistsofnon-redundant,statelessoperators.(anexampleof-hidden-statewouldbeaninter-nalhashtablewhencomputinganaggregate).BykeepingtheAPIsimple,frontendsareabletoe ectivelyin uencethegenerationandusageofintermediatedatastructureswhichmayormaynotbebene cialforperformance.Hid-dendatastructuresarealsoproblematicwithrespecttoportabilitybecausetheycanbeunboundedinsize(again,hash-tablescometomind).Sincemostco-processorsdonotecientlysupportthe(re-)allocationofmemoryatruntime,unboundeddatastructuresimpedeportability.Simple, ne-grainedoperatorsalsolendthemselvesto ne-grainedcostmodelssuchastheonewede nedinearlierwork[21].Ef-fectivereasoningaboutcostisanintegralpartoftunability.Non-redundancyintheoperatorsethasthreedistinctad-vantages:a)improvementsinoneoperatorcanimprovemanyqueriesb)itincreasesthenumberopportunitiesforcommonsubexpressioneliminationc)itsimpli esbackendimplementationandmaintenance.Deterministic:Voodooprogramsdonotcontainrun-timecontrolstatementssuchasiforforthatdecideifanoperatorisexecuted.Suchdeterminismenablesecientexecutiononarchitectureswithno,oronlyexpensiveex-ecutioncontrol,suchasGPUsandSIMDunits,andalsoimprovesCPUperfo

rmancebyallowingtheCPUtoe ec-tivelys
rmancebyallowingtheCPUtoe ec-tivelyspeculateduringprogramexecution.Italsosimpli escost-modeling.Determinismdoescomeataprice:thelackofdynamicdecisionspreventthefrontendfromin uencingdynamicex-ecutionstrategiessuchasload-balancing,garbagecollection,memoryre-allocationorrunningloopsofwhichthenumberofiterationsisunknownatcompiletime.However,thisdoesnotimplythatgeneratedcodecannotmakedecisionsaboutwhatdatatoload(e.g.,whichisthenextnodeinatreeindex),aslongastheoperationsonthedataareknownatcompiletime1.Operationsontrees,e.g.,areexpressibleaslongasthedepthofthetreeis(reasonably)bounded/bal-anced.Toremovetheneedforsuchabound,weplanto(re-)integratedynamicdecisionsintoVoodooandwillex-ploretheimpactofcontrol-statementsinfuturework.Explicit:Asmuchaspossible,everyVoodooprogramhasexactlyoneimplementationoneachunderlyinghard-wareplatform.Explicitnessisimportantfortunability,be-causeitmeansacodegeneratingfrontend(orthedeveloperthereof)canreasonclearlyaboutwhataparticularprogramwilldoonaparticularhardwareplatform.Tunable/Transformable:The nalkeypropertyofVoo-doo,asnotedintheintroduction,isthatitiseasytotunetovarioushardwareplatformsusingasingleabstraction.SinceVoodooalreadyfollowsadeclarativeoperatormodel,itisnaturaltoextendthismodeltocreateadeclarativeapproachtotuning.Inadditiontocompatiblitywiththealgebra,thisapproachhasanappealingproperty:itencodesconceptuallysimilartechniquesintostructurallysimilarprograms.Toil-lustratethis,considertheexampleofparallelizationusingeithermultiplecoresormultipleSIMD-lanesofamodernmulticoreCPU.Thisisanon-trivialdi erenceinC:Fig-ure6showscodethatisequivalenttoFigure5butusesSIMDintrinsicsinsteadofTBBmultithreading.Thesearealmostentirelydi erentprograms:theonlylinesthataresharedaretheloadingoftheinput(line1)andmostoftheoutputdeclaration(lines2and9,respectively).Incontrast,thechangestotheVoodooprogramareminimal(seeFig-ure4foratextualdi ):theconstantinline3nowencodesthenumberoflanesratherthanthesizeofapartitionandthegenerationofsuccessiverunshasbeenreplacedwiththegenerationofcircularlane-idsinline4.Thischangecausestherecordstobescatteredinaround-robinpatterninlines5{7,whichnaturallymapstoSIMDinstructions.Thisex-ampleshowshowanlaboriouscodechangeismadesimplebyVoodoo.Thissimplicityiswhatwemeanbytunable.Thepreviousexamplealsointroducestheconceptofcon-trolledfolding,whichwedescribeinmoredetaillateron.Insummary,thedesignofVoodooisdrivenbytwopri-marygoals:portabilityandtunability.Thesegoalsareoftencontradictoryasexempli edbythecaseofplatform-speci cextensionsofCsuchasSIMDintrinsics:theycanimproveperformanceifthehardwareecientlysupportsthembuthurtportabilityandsometimesevenperformanceiftheyarenotoronlybadlyimplemented[24].Voodooalleviatestheseproblemsbyprovidingalayerofabstractionthata)canbetranslatedintoecientcodeforavarietyofhardwareplatforms,b)allowseasyand ne-grainedcontroloverthegeneratedcodewhilekeepingtheabstractionsimpleinor-dertoc)allowreasoningaboutaprogram,bothintermsofsemanticsaswellascost(givenahardwareplatform).IntherestofthissectionwedescribethedatamodelandoperatorsthatallowVoodootoachievetheseproperties.1Notethat,sincewegeneratecode,wehaveinformationaboutfactorssuchasdatasizesatcompiletime2.1Data:StructuredVectorsTheVoodoodatamodelisbasedoninteger-addressablevectors.Wechosethisbecausevirtuallyallhardwareplat-formsimplementsomekindofinteger-addressablememory.Consequently,Voodoostoresdatausingathinlayerofab-stractionoversuchinteger-addressablememory:amodelwetermStructuredVectors.AStructuredVectorisanorderedcollectionof xedsizedataitems,allofwhichconformtothesameschema.Forconvenience,weallowdataitemstocontain(nest)otherstructureddataitems.StructuredVec-torsareequivalenttoone-dimensionalarraysofstructsinANSICandcan,thus,bemappednaturallytonativecodeinaC-derivedlanguagesuchasOpenCLC.Wecurrentlyonlyallowscalartypesandnestedstructsas elds(butmay,inthefuture,add xedsizearraysasaconveniencefeature).Toillustratethis,Figure7showstwovectors:aninput(ontop)andanoutputvector(bottom).Theinputvectorhastwoattributes(.foldand.value)and8elements.ToaddressanattributeofavectorinVoodoocode,weuseKeypathsto\navigate"thenestedstructures.Innotation,keypathsaremarkedwithaprecedingdot:forexample,thepath.valuedesignatesthevaluecomponentofeverytupleinFigure7.Sincestructurescanbenested,keypathscanhavemorethanonecomponent(e.g.,.input.value).PointersandNULLvalues.Voodoohasnonotionofpointers.Referencestotuplesofvectorscanberepresentedsimilartoforeign-keys:integervaluesencodingapositioninanothervector.Weprovideprimitivestoresolvethesereferences(thegatheroperation).RegardingNULLvalues,wedecidedtonotimposeaspeci cwaybutenablecommondesignpatternstodealwithNULLssuchasbitmaps

(im-plementedin,e.g.,HyPeR),reservedvalu
(im-plementedin,e.g.,HyPeR),reservedvalues(MonetDB)orsparseattributes(Postgres)inVoodoo.Intherestofthispaper,weimplementNULLvaluesusingMonetDB'sscheme.Voodoodoeshavethenotionofan\empty" eldvaluewhichwedenoteas(e.g.,inFigure7).Emptyslotsoccurif,e.g.,valuesarenotsetinascatterornotselectedinafoldSelect.WeillustratetheusefulnessofthisconceptforecientcodegenerationinSection3.2.2ControlledFoldingAkeychallengeinVoodooisprovidingdeclarativeoper-atorsthatstillgivecontrolovertuningparameters(suchasparallelism).Toaddressthischallenge,wedevelopedacon-ceptwetermControlledFolding,whichisusedtoexpressaggregation(producingasinglevalue)aswellaspartition-wiseselections(producingasequenceoftupleids).ThebasicideaofControlledFolding(illustratedforthecaseofaggregationinFigure7,aswellasintheuseofDivideandModulo,respectivelyinFigures3and4)issimilartofoldoperationsinfunctionallanguages(e.g.,Haskell):reduceasequenceofvaluesintoasinglevalueusingabinaryfunction.ControlledfoldoperatorsinVoodooareageneralizationoffunctionalfolds:InadditiontothevectorofvaluestofoldFigure7:VoodooFoldoperationsarecontrolledOperatorExplanationMaint.Load(.keypath)Loadavectoridenti edbykeypathfrompersistentstoragePersist(.keypath,V)PersistvectorV,makingitavailablefrompersistentstorageunder.keypathBitShift(.out,V1,.kp1,V2,.kp2)ShiftthevalueofeachiteminV1.kp1byV2.kp2LogicalAnd/Or(.o,V1,.p1,V2,.p2)LogicalOperationsAdd,Subtract,Multiply,Divide,ModuloArithmeticOperationsGreater,EqualsComparisonOperationsDataParallelZip(.out1,V1,.kp1,.out2,V2,.kp2)CreatenewvectorwithsubstructureV1.kp1as.out1andV2.kp2as.out2Project(.out,V,.kp)CreatenewvectorwithsubstructureV.kpas.outUpsert(V1,.out,V2,.kp)CopyV1andreplaceorinsertattribute.outwithvalueV2.kpScatter(V1,V2,.kp2,V3,.pos)CreateanewvectorofsizeV2.FilltheslotsofthenewvectorbyplacingeachvalueofV1intopositionV3.pos.Valuesareoverwrittenoncon ict.Scattersareperformedinorderwithinavalue-runinV2.kp2|runshavenoorderguaranteeswithrespecttoeachother.Gather(V1,V2,.pos)CreateavectorofsizeV2 llingitbyresolvingpositionV2.posinV1.Outofboundspositionsresultinemptyslots.Materialize(V1,V2,.kp2)MaterializevectorV1inmemory.MaterializechunksofsizesaccordingtorunsinV2.kp2(X100-style[33]processing)Break(V1,V2,.kp)BreakupV1intosegmentsaccordingtotherunsinV2.kp(puretuninghint)Partition(.out,V1,.v,V2,.pv)GenerateascatterpositionvectortopartitionV1.vaccordingtothelistofpivotsV2.pvFoldFoldSelect(.out,V1,.fold,.s)GenerateavectorofpositionsofslotsinV1thathave.ssettonon-zero.Aligntheoutputtovaluerunsin.fold(seeFigure7)FoldMax/Min/Sum(.out,V1,.fold,.agg)CalculatedMax/Min/Sumforeveryrunin.fold.Alignoutputvaluewithstartofrun.FoldScan(.out,V1,.fold,.s)Pre x-SumthevaluesofV1.s(startofnewrunin.foldstartsnewsum)ShapeRange(.kp,fromI,[vInt|v],stepI)GenerateavectorwiththesamesizeasvwithvaluesstartingfromfromincreasingbystepCross(.kp1,v1,.kp2,v2)Generatethecrossproductofthepositionsofv1andv2Table2:VoodooOperatorstheyacceptasecondvectorthatwetermthecontrolvectoroftheoperation.Thecontrolvectore ectivelyprovidesthepartitionidsforvaluesbeingfolded.Thee ectofthecontrolvectorontheoutputofanop-eratoristhat,whensequentiallytraversingthevalues,theoperatoralsotraversesthealignedcontrol-sequence.Asitdoesthis,itfoldsalladjacenttuplesthathavethesamepar-titionidintoasingleoutputvalue{thebeginningofanewrunofpartitionidscausesthefoldtostartanewresult.Theresultofeachsub-foldiswrittentotheoutputcellatthestartofthepartition.Theslotsbetweenonefoldresultandthenextarepaddedwithemptyvalues.Thepaddingavoidstheneedforsynchronizationwhenprocessingrunsinpar-allel(wedescribehowtoeliminatethestorageoverheadofthisinSection3.1.)ControlledfoldingisakeyabstractionsinVoodoothatallowsusobtainparallelperformancefrommultiplehardwareplatforms.Asweshowinthefollowing,itisaverypowerfulabstractionbecauseitallowsdeclarativelyspeci cationoftheoutputofpartition-wiseoperations.2.3OperatorsInthissection,webrie ydescribethecoreoperatorsofVoodoo.Thegoalistoprovideanintuitionforthekindsofoperatorswesupport.AdetaileddescriptionisprovidedinTable2.Voodoo'soperatorsfallintofourcategories:1.MaintenanceOperationsmanipulatethepersistentstateofthedatabase.Theyimportdata,persistdatatothedatabaseandloaditback.2.Data-parallelOperationsoperateonalignedtuplesintwoinputvectors.Theyincludestandardoper-ationssuchasarithmeticorlogicalexpressionsandZips.GatherandMaterializealsofallintothiscategorybecausetheyaretrivialtoparallelize.Thesizeoftheoutputoftheseoperatorsisthesizeofthesmallerinput.Scattertakesathirdparameterthatspeci esthesizeoftheoutputandtechnicallyinvolvesaconsistentwritetomemory.However,asvirtuallyallhardwareplatformsimplementsuchaprimitive,w

econsiderScatterdataparallel.3.FoldOpera
econsiderScatterdataparallel.3.FoldOperationsareoperationsthatrequiresomelevelofsynchronization.Ingeneral,thisincludesalloperationswherethevalueatapositionintheoutputdependsonmorethanonevalueoftheinputvector.Naturally,thisholdsforaggregations.However,italsoholdsforfoldSelectbecauseitisorderpreserving:consequently,thepositionofatupleintheoutputde-pendsonthenumberofqualifyingtuplesthatprecedeit.Partitionalsotakesavectorofpivotsasasec-ondinput.Thesizeoftheoutputisthesizeoftheinputvector(notthepivotvectorforpartitioning).4.ShapeOperationsareoperationsthatcreatevectorswithvaluesthatarenotbasedonthedatavaluesofothervectorsbutonlyontheirsize.Shapedoperationsdonottakeanyinputkeypathsbecausetheydonotprocessanyattributevalues.Whileweusetheseoper-atorstogenerateconstants,theirmostimportantuseisinsteadtocreaterun-controlvectorsforfoldopera-tions.Bygeneratingappropriatefoldattributevaluesandmaintainingmetadatathatdescribestheruns,wecandeclarativelycontrolthedegreeofparallelismintheVoodooprogram.Consequently,wecallsuchgen-eratedattributesControlAttributes.Wedescribethefunctioningoftheseattributesinmoredetailinthenextsection.3.VOODOOBACKENDSThedeclarativenatureoftheVoodoooperatorsmakesiteasytoprovidedi erentbackendimplementations.WhileFigure8:Select&HierarchicallyAggregateinVoodooweforeseeimplementationsindi erentcontextssuchasdistributedprocessing,wefocusoure ortsonsingle-node(multi-core)backendsinthispaper.Wehaveimplementedtwobackends:aninterpreterontopofC++standardli-brarycontainersclassesaswellasabackendthatcompilesVoodoocodetohighlyecientOpenCLkernels.Wewillstartthissectionwithanin-depthdiscussionofthedesignoftheOpenCLbackendand nishwithabriefdiscussionoftheinterpreter.3.1TheOpenCLCompilerThepurposeoftheOpenCLcompileristogeneratehighlyecient,parallelcodethatavoidsunnecessarydatamate-rialization.Itdoessobygeneratingfullyinlined,function-call-freeOpenCLkernelsfromsequencesofmultipleVoodoooperators.ThegenerationofthekernelsisstronglyinspiredbythecodegenerationprocessinHyPeR[18],and,followingtheirnomenclature,wewillrefertoageneratedpieceofcodeasafragment.Resultmaterializationtomemoryonlyoccursattheseamsbetweenfragments.AsinHyPeR'squerycom-piler,theVoodootoOpenCLcompilertraversestheplaninadependencyorderandappendsstatementstoafragment.However,thecodegenerationprocessinVoodooismorein-volvedthaninHyPeRfortworeasons:First,wegeneratedataparallelcodewheneverpossibleandonlygeneratecodewithreducedparallelismwhennecessaryandsecondVoodooqueryplansareDAGsratherthantrees(seeFigure8)toen-ablesharingofintermediateresults.Inthefollowing,weillustratethecodegenerationprocesstakingthesecomplicatingfactorsintoaccount.AsarunningFigure9:Select&HierarchicallyAggregateinOpenCLexample,weuseasimpli edversionofTPC-HQuery1(SeeFigure8).Thequeryis:SELECTSUM(l_quantity)FROMlineitemGROUPBYl_returnflag3.1.1CodeGenerationAsinearlierin-memorydatamanagementsystemssuchasVectorwise/MonetDBX100[33]andHyPeR[18],weaimtoavoidthematerializationofintermediateresults.SimilartoHyPeR,wegeneratecodebytraversingtheexecution-DAGinalinearizedorder(toptobottominFigure8).ControllingParallelism.MostVoodooprogramscon-tainfullyparallelaswellascontrolledfoldoperations.TheOpenCLbackendneedstoecientlymapbothtokernelswiththeappropriatedegreeofparallelism.Tothisend,VoodooassignsanExtentandanIntenttoeachgeneratedcodefragment.TheExtentisthedegreeof(data)paral-lelism(roughlyequivalenttotheglobalworksizeinOpenCLorthenumberofparallelthreadsworkingonavector)whiletheIntentisthenumberofsequentialiterationsperparallelwork-unit.AfragmentofExtent1isfullysequentialwhileafragmentofIntent1isfullyparallel.BeforedescribinghowwederivetheseparametersfromtheControlVectors,we rstdiscusshowtheya ectageneratedprogram.TheDAGpropertyofourqueryplansforcesustomain-tainmultipleactivefragmentsatthesametime.Toillus-tratethis,considerthecaseofcompilingaprograminwhichanumberoffullyparallelstatementsareinterruptedbyarun-controlledsequentialone.Inthiscase,Voodoocreatesasequentialfragmentinadditiontothealreadyactiveparallelfragment|neitherisexecuteduntilneeded.Whenprocessinganinstruction,thecompilerchoosesacodefragmentthathasthesameextentasthecurrentstate-ment.Thisprocessisappliedinastraightforwardmannerfordata-parallel,maintenance,andshapeoperatorsinTa-ble2.Forfoldoperators,wedistinguishthreecases:a)iftherunsareoflength1(theextentisnandtheintent1),thefold-operationisfullydata-parallelandcanbeappendedtoacodefragmentoftherightextentb)ifthereisonlyasinglerunoflengthn(theintentisnandtheextent1),thefold-operationisfullysequentialandcanonlybeappendedtoasequentialfragmentoftherightintentc)ifthelengthoftherunsislessthanorequaltothesupportedpartitionsize,valuesarewrittentothebeginningoft

hepartition.Inthatcase,wedonotneedtointr
hepartition.Inthatcase,wedonotneedtointroduceaglobalsynchronizationpoint,i.e.,createanewfragment.ThismeansthefoldcanFigure10:Group&Aggregateapplytoanyfragmentthathasthesameextent.Speci -callythiscanbeachievedbysynchronizingandsequentiallyprocessingtheinputusingasubsetoftheactivedataitemsbydisablingsomeofthecoresinOpenCL.Ifnofragmentcanbefoundthathasthesameextentasthecurrentoperator,anewfragmentiscreated.Example.Toillustratethisprocess,considertheFrag-ment1inFigure8.Itiseasiesttounderstandthisdia-gramintermsoftheredoperators,whicheitherpartiallyorcompletelyinterruptthepipeliningoftheplan.Start-ingfromthetop,thebreakoperatorcausesthegreaterthanexpressiontobecomputedinafull-dataparallelman-ner,andtheresultofthisexpressiontobematerializedinmemory(orcache).Then,the rstfold(.position=foldSelect)computesthepositionsinthelineitemta-blewherethepredicateonlshipdateissatis ed.Theparallelismofthisoperatoriscontrolledbythe.foldcon-trolvector.Incomputing.fold,the$intentvariableen-codesthelengthoftherunsallowingtotransitionfromfullyparallel($intent=1)tofullysequential($intent=n).ThisfoldSelectoperatorresultsinoneloopperfold(theex-tentiskinputk=$intent);eachloopinthiscasejustloopsoverthebooleanvaluesmaterializedatthebreakoperatorandemitsthenon-zeropositions.Finally,thethirdredop-erator(foldSum)computesthepartialsummationofeachfold.ItusesthesameparallelismasthefoldSelect,sonoadditionalmaterializationisnecessary,andtheycanbepipelinedintothesameloopinstance.Thereareafewadditionalthingstonoteaboutthisdi-agram.First,theparallelismofeachoperatorisdeter-minedbytheparallelismofits rstredancestor,astheVoodoocompileraggressivelyinlinesoperatorsbetweentheredpipeline-breakingoperations.Second,thepurpleopera-tors(andassociatedvectors)are\virtual"inthesensethattheysimplycontroltheparallelismofthegeneratedprogrambutarenotcomputed(orgenerated)atruntime.Figure9illustratestheresultingOpenCLprogramassum-ing$grainsizeissetto4.The gureillustratesthatthepredicateisevaluatedfullydata-parallel.Theselectaswellasthe rstsumareevaluatedusinglocallyreducedpar-allelismbutwithouttheneedforaglobalbarrier.The nalFigure11:VirtualScattersum(Fragment2inFigure8)isentirelysequentialandre-quiresaglobalbarrier(intheformofanewOpenCLkernel).Notethat,tokeeptheexampleconcise,thefoldSelectandthe rstfoldSumsharethefoldattribute(stemmingfromthesameControlVector).Thisisnotnecessaryinpractice,astheyareindependentlytunable.Naturally,se-lectingtheoptimalparallelizationstrategyishardandnotthefocusofthispaper.Theexampledoes,however,showhowacomplexoptimizationdecisioncanbeencodedintoa(setof)integerconstant(s).However,ithingesonane ec-tivewaytokeepcontrolvectormetadata.MaintainingRunMetadata.Whilecontrolvectorsallowustospecifythedegreeofparallelisminanabstractmanner,westillhavetogenerateappropriatecodefromthatspeci cationinthebackend.Forthatpurpose,theVoodoocompilermaintainsdescriptivemetadataabouteachgeneratedvectorattribute:thestart,astepfactorandamodulocap.Attributesvaluesgeneratedbyrangeoperatorscan,thus,becalculatedusingthefollowingequation:v[i]=from+bistepcmodcapDividingavectorbyaconstantxisequivalenttodividingstepbyx.Amodulobyxissettingthecaptox.Whencombining(e.g.,throughaddition)acontrolvectorwithadatavector(toencodeparallelgroupedaggregation),wekeepthevaluesofthecontrolvectorinadditiontothedatavalues.Weexpectthefrontendtoensurethatthebitsofthecontrolvectordonotcon ictwiththedatabitsandthrowanexceptionwhenthisassumptionisviolated.3.1.2EmptySlotSuppressionAsdescribedinSection2.2,theoutputsofalloperationsareofstaticallyknownsizeandarepaddedwithemptyslots.Whilethisseemswastefulat rst,itmakesitpossibletoecientlyexecuteVoodooonmassivelyparallelhardwarewithouttheneedforexpensivewritecon icthandling.How-ever,byapplyingcontrolvectormetadataknowledge,wecandrasticallyreducethememoryconsumption.Toillustratethis,considerthefoldSumstepsinFigure9.Thefoldingcreatesapredictablenumberofemptycellsinthevector.Toreducethememoryfootprint,slotsthatcanbeguaran-teedtoneverbe lledwithvalues(e.g.,whenfoldingallvaluesofaworkgroupintoalocalsum)cansimplynotbeallocated.Wesuppressemptyslotsbyallocatingasmallerbu er,appropriatelymodifyingthegeneratedwritecursorandannotatingthevectorwithametadata eldtokeeptrackoftheemptyslots.3.1.3VirtualScatterAnothercaseinwhichweexploitcompile-timeknowledgetoavoidruntimematerializationisthecaseofscatter.Anavelyimplementedscatterwouldwriteallinputval-uestoaslotintheoutputcausingabreakinthefragmentandsubstantialrandommemorytrac.Thismay,however,beunnecessaryifthescatteredvectorisonlycreatedtobeused,e.g.,asaninputtoanaggregation.Figure10depictsthisverycommoncase.Toavoidtheintermediatemateri-alization,w

ecandroptheassumptionthatOpenCLworkitems
ecandroptheassumptionthatOpenCLworkitems(i.e.,positionsintheinputvector)anddataitems(i.e.,positionsintheoutputvector)arealigned.ThisisillustratedinFigure11:wetagavectorwithascatterposition(denotedwithan@inthethirdvector).Thatscatterpositionisaperwork-itemlocalvariable(splitintopartitionstartandtupleo set)thatwillbeusedtodeterminethepositionofeachoutputtupleifthevectorisevermaterialized.Sincemostvectorsarenevermaterial-ized,scattercanbecomeaverycheapoperation,thatispaidforonlywhenandifitisfullymaterialized.Figure11illus-trateshowweexploitthisinthecaseofa(single-partition)groupedaggregation:thepartitiongeneratesapartitionidwhichiswork-itemlocal.Thescattercreatesa(virtual)vec-torthatcontainstheinputvaluesannotatedwiththescatterpath.ThevectorisreadbythefoldCount(amacroontopoffoldSum)whichgeneratesthe(partitionaligned)counts.Finally,thesecountsarecompactedintoacontiguousmem-oryregionfortheresult.3.2TheInterpreterTheinterpretermainlyservesasareferenceimplementa-tion;itusesvectorsofmapstorepresentdataaswellascontrolvectors.Theinterpretermaterializesallintermedi-atevectorsandis,inthatrespect,aclassicbulk-processor.However,sinceitstoresalldatainvectorsofmaps,itusesvirtualfunctioncallstoretrievevaluesfromtuples.Thecombinationoffullmaterializationandvirtualfunc-tioncallsmeansthatthisbackendisnotdesignedforhighperformance.Itis,rather,asmallreferenceimplementa-tionthatisusefulfordebuggingandveri cationbecauseallintermediatesarematerializedand,thus,inspectable.4.ARELATIONALFRONTENDTodemonstrateVoodoo'se ectivenessasadatabaseker-nelaswellastosimplifyexperimentation,weimplementedaprototypeofarelationalqueryprocessingengineontopoftheVoodooalgebra.Duetoitseasyextensibilityandopensource,wechosetointegrateVoodooasanalterna-tiveexecutionengineintoMonetDB.However,byreplacingMonetDB'sexecutionengineandphysicaloptimizer,weef-fectivelyreduceditsroletodataloadingandqueryparsing.Toillustratethee ectivenessoftheresultingsystem,letusbrie ywalkthroughtheaspectsofstorage,queryprocessingandoptimization.Loading.MonetDBexposesitsinternalcataloginforma-tion,includingpointerstowardsthebacking les,throughaqueryableSQLinterface.Weexploitthistodirectlyloaddatafromthe lesystembypassingthequeryprocessor.Uponstartup,VoodooloadsdatafromtheinternalcatalogofMonetDB.Wedirectlycopydatafromdiskintotheprocess-ingdevice,usingthesamestorageformatMonetDBuses:binarycolumn-wiseusingdictionaryencodingforstrings.Queries.WeuseMonetDB'sSQLtorelationalalgebracompilertoparseSQLqueriesandremovelogicalconceptssuchasnestedsubqueries.Fromtherelationalalgebrarep-resentation,wegenerateVoodooplans,thusbypassingMon-etDB'sphysicaloptimizer.Voodooplansaresimilarincom-plexitytothosethatMonetDBusesforphysicaloptimiza-tionandqueryevaluation.Optimization.SinceVoodoocompilesMonetDB'slogicalplans,itinheritsthelogicaloptimizationsthatMonetDBap-plied(join-order,query-unnesting,etc.).Beyondthat,thephysicaloptimizerhasanumberofoptimization agsthatenablehardware-speci coptimizations:cache-consciouspar-titioning,predication,parallelizationstrategy(forselectionsandaggregations)aswellasthetargeteddevice.Whileweusethese agsforthemicrobenchmarksinSec-tion5.3,wedisablethemwhencomparingthemacrobench-markresults(Section5.2).ThisallowsacomparisonoftheeciencyofthegeneratedcodetothatofHyPeR(whichalsodoesnotapplysuchoptimizations).However,weenablethegenerationofparallelplansbyspecifyingappropriatecon-trolvectors.Beyondthat,weuseidentityhashingonopenhashtablesandderivetheirsizefromtheinputdomain(us-ingonlyminandmax).Adetailedper-querydiscussionofthequeryplanscanbefoundinourupcomingtechnicalreport[23].5.EVALUATIONToevaluateourapproach,westudythesystemwithre-specttoourtwodesigngoals:portabilityandtunability.TheportabilityexperimentsinSection5.2showthatVoodoonotonlyrunsondi erenthardwareplatformsbutmatches(andevenoutperforms)existingsystemstailoredtospeci chard-warearchitectures.Speci cally,wecompareagainstHy-PeR[18]andMonetDB/Ocelot[13],whichtargetCPUandGPUarchitectures,respectively.WeconducttheevaluationusingasubsetoftheTPC-Hbenchmarkthatwasusedtoevaluatethesesystemswhentheywerepresented.ThetunabilityexperimentsinSection5.3demonstratehowVoodoofacilitatestheimplementationandexamina-tionofvarioushardware-consciousoptimizationsondi er-entplatforms.Wehighlightanumberofhardwareanddatadependenttrade-o sthatVoodooallowsustoexplore.5.1SetupWeranourCPU,aswellasallofourGPU-experiments,onaDellServerwithsingleIntelSkylakeXeonE3-1270v5runningUbuntuLinux15.10(Kernel4.2.0-27)at3.60GHzwith64GBofRAM.TheGPUwasaGeForceGTXTITANXwith12GBofglobalmemoryusingCUDAversion7.5.Allmicro-experimentswerecompiledusingIntelICC16.ForbothCPUsandGPUs,weonlycountedtheexecutiontimeoncethedatawasl

oadedintotheirrespectivememoriesandignor
oadedintotheirrespectivememoriesandignorecostsforresultoutput.WedonotaddressthePCIbottleneck.5.2BaselinePortability:TPC­HWeranasigni cantsubsetoftheTPC-Hqueriesonascalefactor10datasetusingVoodoo,HyPeRand,whereapplicable,Ocelot(OcelotdoesnotactuallysupportallofFigure12:TPC-HPerformanceonGPUthequeriesweevaluated).Thesequeriescovermostoftherelationaloperators(selections,aggregates,group-byandjoins)exposingstandardproblemsinrelationalquerypro-cessing[6].Notethatourgoalistodemonstratethecapa-bilityoftheVoodoosystemtogenerateecientcode|notthequalityoftherelationalqueryfrontendandoptimizer,whicharestillquitebasic.Intermsoftheoptimizationsofthegeneratedcode,ourimplementationisroughlyequiva-lenttothecodegenerationthatisimplementedinHyPeR(novectorization,nomanualSIMDinstructions).However,weaggressivelyexploitavailablemetadata(min,max,FK-constraints)which,inmanycases,allowsustobypassopera-tionssuchashashingorcollisionmanagementwhichHyPeRhastoperform.TheresultsontheCPU(Figure13)andGPUshowthat,asanabstraction,Voodooimposeslittleoverhead.Gener-ally,performanceiscomparabletoHyPeR's.Voodooper-formsbetterforcomputeandlookup-intensivequeries(suchas5,6,9and19)becauseoftheaggressiveexploitationofmetadataandtheuseofSIMDinstructionsbytheOpenCLcompiler.HyPeRperformsbetterfororder-by/limitqueriessinceitevaluatestheseusingpriorityqueueswhichavoidexpensivematerializationofthefullresult(inVoodoo,theorder-by/limitclauseswereommitted).WhencomparingtoOcelot,thebene tofexecutablecodegenerationbecomesclear:whereHyPeRandVoodooavoidexpensivematerializationofintermediateresults,Ocelotpaysahighpricefordoingso.Inparticularthelow-selectivity(highoutputcardinality)queriessuchasquery1exposethisproblem.ThisisanindicationthatOcelotwasdevelopedforanarchitecturewithahighmemory-bandwidthsuchasGPUs.WhenevaluatingontheGPU,withits300GB/smemorybandwidth,weseethatOcelotsu erssigni cantlylessfromthatdesigndecision(seeFigure12).5.3TunabilityWenowturnourattentiontotheabilityofVoodootoexperimentwithdi erenthardware-awareoptimizations,fo-cusingoneaseofimplementationandtheperformancetrade-o softhesedi erenttechniques.Just-in-timelayouttransformations.Mostin-memorydatabasesstoredatausinga xedphysicalschema(usuallyrow-wiseorcolumn-wise).However,ithasbeenshownthatperformancecanbeincreasedbytransformingthelayoutbeforeorevenduringqueryexecution[34].Asanoperationthatcanbene tfromsuchanoptimization,weconsidertheevaluationofanindexedforeign-keyjoin(essentiallyaposi-tionallookup)onmultiplecolumnsofthesametable.Weconsiderthreepossibleimplementationsofthisopera-tioninCandVoodoo:asingletraversaloftheindex-columnwithlookupsintobothcolumns(termedSingleLoop),twoconsecutivetraversalsofthepositions,eachresolvingthekeysintooneofthetargetcolumns(termedSeparateLoops)andthetransformationofthetargettablefromcolumn-torow-wisestorage,followedbyasingleloopoverthepositionsandtheirresolution(termedLayoutTransform).AsshownFigure14a,thebestimplementationisdepen-dentonthelookuppatternintothetargettable:ifthelookupsaresequential,localityisalwaysgoodandtheSin-gleLoopimplementationperformsbest.Ifthelookupsarerandomandthetargettablesmall(4MB)thebestimple-mentationistoresolvethekeysintwoSeparateLoops.Ifthelookupsarerandomandthetargettablelarge(128MB),aLayoutTransformpayso :becausethevaluesofbothpro-jectedcolumnsareco-locatedthisoptimizationreducesthenumberofrandomcachemissesbytwo.ExpressingtheseoptimizationsinVoodooinvolvesabreakoperatorbetweenthetwogatherstoswitchfromSingleLooptoSeparateLoopsandazipandmaterializetoswitchtotheLayoutTransformimplementation.AsdisplayedinFigure14b,VoodooaccuratelymatchedtheperformanceoftheCimplementationontheCPU.Figure14cshowsthattheperformanceontheGPUissimilarbuttheSeparateLoopsversionisoutperformedbytheLayoutTransformimplementationinallcases.Thisexperimentillustrateshowthelackoflargeper-corecachesontheGPUpenalizerandomaccessesearlierthanonaCPU.Still,theoptimizationportsreasonablywell.SelectiveAggregation.Sinceselectionsareoneofthemostfrequentlyusedoperators,anecientimplementationisoneofthecornerstonesofgoodin-memoryperformance.Themainproblemwiththedefault(branching)implemen-tationisthepenaltyforbranchmispredictions.AsshowninFigure1,avoidingthesebranchmispredictionsviaabranch-freepredicationtechniquecanbebene cialasittradesmem-orytracforfewermispredictions,althoughthebene tde-pendsonselectivity.Toavoidtheadditionalmemorytracofpredication,theprocessingcanbevectorized:dividedintocache-sizedchunks,whereforeachchunk,apositionlistisgeneratedusingabranch-freeimplementation.Thispositionlististhentraversedandprocessedinasecond(cache-sized)loop.WeimplementedthesethreedesignalternativesinCandVoodooandshowtheselectivi

ty-dependentresultsinFigure15a:weseethec
ty-dependentresultsinFigure15a:weseethecharacteristicbehaviorofspeculativeexecutionwithworst-caseperformanceat50%selectivity.Incontrastthebranch-freeimplementationshows atperfor-mancethatoutperformsthebranchingimplementationformid-rangeselectivities.Thevectorizedversionperformssig-ni cantlybetterthanthebranch-freeimplementationand,forselectivitiesabove1%,outperformsthebranchingver-sion.NotethattheC-codeoftheseversionslooksverydif-ferent:twoloopsandanadditionalbu ervs.asingleloopwithnobu ering.InVoodoo,weachievevirtuallyidenticalperformance(Figure15b)butonlyneedtoinsertasingleadditionaloperator:amaterializewithacontrol-vectortoencodetheintermediatebu ersize.RunningtheVoodoocodeontheGPUcreatesadi erentpicture(seeFigure15c):sincetheGPUdoesnotspecu-lativelyexecutecode,thepredicatedversiononlyaddsad-ditionalmemorytracwithoutanybene t.Surprisingly,thevectorizedimplementationhurtsperformance:theaddi-tionalpositionbu ercausesadditionalmemorytracand,evenworse,is lledsequentially,whichlimitsthedegreeofparallelismthatcanbeusedtohidelatencies.WeconcludethatthisoptimizationdoesnotportwelltoGPUs.Figure13:TPC-HPerformanceonCPU(a)ImplementedinC(b)VoodooonCPU(c)VoodooonGPUFigure14:Just-in-timelayoutchanges.Branch-FreeForeign-KeyJoins.Inthelastmicrobench-mark,wepresentanoveltuningtechniquethatillustratestheinterplaybetweendataaccessandprocessing.Wecon-sideratable-scan,theapplicationofaselectionandanin-dexedforeign-keyjoinintoasingle,largetargettablewithsubsequentaggregation.TheSQL-queryis:SELECTsum(target.v)FROMfact,targetWHEREfacts.target_fk=target.pkANDfacts.v$1ThestraightforwardapproachisBranching:asequentialscanoftheselectioncolumn(facts.v),theevaluationofthepredicateandalookupandaggregationofqualifyingtuples.Abranch-freealternativeistounconditionallyper-formthelookupsandmultiplytheresultingvalueswiththeoutcomeofthepredicate(0or1)beforeaggregation(la-beledPredicatedAggregationinFigure16).Theselectivity-dependentresultsaredisplayedinFigure16a:thebranchingversionexhibitsthetypicalbell-shapedcurvethatindicatesbadspeculation.Thebranch-freevariantissigni cantlymoreexpensive,duetothehighnumberofrandomcachemissesthatresultfromtheunconditionallookups.Toad-dressthebadcachebehavior,wedevisedanoptimizationwetermPredicatedLookups:beforeperformingthelookup,wemultiplythepositionwiththeoutcomeoftheselectionpredicate.Thisway,allnon-qualifyinglookupshitthesameaddress(positionzero)whichwillbeheldinone\veryhot"cacheline.Thisaddressesthebadcachebehaviorbutcausesanextraarithmeticoperation(notethatthelooked-upval-uesstillneedtobepredicatedtoensurecorrectness).Theresult(PredicatedLookupsinFigure16a)isanimplementa-tionthatperformssigni cantlybetterthanthebranch-freeversionandoutperformsthebranchingversionformuchoftheparameterspace.Voodoomatchesthisresultveryac-curatelyontheCPU(seeFigure16b).OntheGPU,Voodooexposesdi erentperformancetrade-o s:theBranchingimplementationshowsthebestperfor-manceovermostoftheparameterspaceandisonlyoutper-formedbythePredicatedLookupsversionforselectivitiesabove80%.ThisresultexhibitsanotherGPUdesignde-cision:thesacri ceofintegerarithmeticfor oatingpointperformance.SincethePredicatedLookupsperformstwointegerarithmeticoperations,performanceisdominatedbythat{thisoptimizationalsodoesnotportwell.6.RELATEDWORKTheVoodooprojectwasinspiredbythedisparityofthelargenumberofoptimizationtechniquesintheliteratureandthesmallnumberofsuchtechniquesthatareactuallyusedsystems.Beforeconcluding,letusprovideaoverviewoversometechniquesandsystemsthattrytousethem.Lowleveloptimizationsformodernhardware.Mostoftheworkintheliteratureaddressesspeci chardwarecomponentsinisolation.Amongthesearetechniquesad-dressinghierarchicalcaches[3],branchpredictors[28]andSIMD[25]registers.Muchofthispriorworkshowsthateachofthesetechniquescanleadtoordersofmagnitudeperformanceimprovements.However,thesetechniqueswerenotstudiedinthecontextofafulldatamanagementsys-temstackorgeneralizedintoafulloperatormodel.redThemostrecentofthese,Polychroniouetal.[25]developasetofalgorithmsthatemployadvancedtechniquesforSIMD-enabledprocessing.MostofthesecanbetranslateddirectlyintoequivalentVoodoocode(seeourupcomingtechnicalreport[23]fordetails).Theexceptiontothisarethecasesinwhichdatastructuresarenotwrite-once:when llinghash-tableslotswithuniquemarkervalues(torecognizecon icts)andsubsequentlyoverwritingthemwithdataval-uesandwhenswappingvaluesthroughacuckoo-tableuntilanemptyslotisfound.Theformercanbeimplementedbyusingasecond(logical)bu ertoholdthemarkerval-ues.ThelatercanonlybeapproximatedinVoodoobecauseeachcuckooiterationneedsto(logically)createanewdatastructure.Whilethememoryoverheadcanberemovedatcompile-time,theprogra

mgrowslinearlywiththenumberofcuckoo-iter
mgrowslinearlywiththenumberofcuckoo-iterations.Thisboundsthenumberofpossibleiterationstoa(reasonablysmall)constant.(a)ImplementedinC(b)VoodooonCPU(c)VoodooonGPUFigure15:selectsum(v2)fromfactswherev1between$1and$2(a)ImplementedinC(b)Voodoo(RunonCPU)(c)Voodoo(RunonGPU)Figure16:SelectiveForeign-KeyJoinPerformanceAnotherlineofresearchtargetstheuseofprogrammableGPUswithitsdiversetuningtechniques[11,12,22].GPUprogrammingtextbooks(e.g.,[20])containanumberofplat-formspeci cheuristicstochoosetherightsetoftechniquesgivenaproblem.Unfortunately,thelow-levelprogrammingparadigmofframeworkssuchasOpenCLorCUDAmakesthiskindofoptimizationhard,oftenrequiringasubstantialrewriteoftheprogramforeacharchitectureoroptimization.HighPerformanceComputing(HPC).TheHPCcom-munityhasaimedtocreateeasytouseabstractionsoverhighlyparallelhardwareformorethanthreedecades.ThemostprominentartifactofthisistheBLASstandardwithseveralimplementations.Aimedatlinearalgebra,however,BLASimplementationssuchasIntel'sMKL[31],cuBLAS[4],MAGMA[2]orOpenBLASsolvearestrictedand,mostim-portantly,data-independenttuningproblem.Hence,tuningisusuallydonebythedevelopersofthelibrary,ratherthangeneratingdata-dependentcodewhentheapplicationruns.CompilerframeworkssuchasDelite[30]orDandelion[29]aredesignedtofacilitatethisprocessbutitstillremainswiththelibrarydeveloperand,thus,static.Generalpurposecompile-timeabstractionssuchasIntel'sArrayBuildingBlocks(ABB)[19]inheritedthisproblem.ABBspeci callywasabandonedinfavorof\tunable"ap-proachessuchasCilk[5],ThreadingBuildingBlocks[27]andOpenMP[10].Whiletheseo erhightunability,theyareill-suitedtoautomaticcodegenerationatruntime(seeSection2).TheapproachclosesttooursisArrayFire[17]whichprovidesabstractvectoroperationsbackedbymulti-plehardwarespeci cbackends(CUDA,OpenCLandC++).ArrayFireevengeneratescodeatruntimebutonlyforarith-meticexpressionsappliedusingamapoperator.Stateoftheartsystemsforin-memoryanalytics.Someofthetechniquesusedbythisworkinvolvebulk-processing[7]vectorprocessingandjust-in-timecompila-tion.MonetDB/x100[33](a.k.a.JIT-compiling)[15].Hy-PeR[18],employsasimpledirecttranslationofrelationalalgebratoLLVMassemblerwhichisthenexecuted.HyPeRaimstorunbothanalyticandtransactionalworkloadsinthesamesystem.Legobase[14],inadditiontogeneratingLLVMorCcodefromSQL,letsdatabasedevelopersexpressinter-naldatabasedatastructuresandalgorithmsusingahighlevellanguage(Scala)andthenhavethemcompileddowntolowlevelwiththerestofthequery.TupleWare[9]aimstohandlealargerclassofcomputationsincludingiterativecomputationsandUDFs,andemploysthecodegenerationtechniquetoecientlyintegrateframeworkcodewiththeUDFs.OcelotaimstoportMonetDBtoexploitGPUs[13].VoodooiscomplimentarytoHyPeRandLegobaseinthatVoodoocanbeusedasalowerlayerforsuchsystems.7.FUTUREWORKWebelieveVoodootobeusefulasafoundationformuchfutureworkinhigh-aswellaslow-leveloptimizationofdatabasequeries.Themachine-friendlydesignofVoodoolendsitselftoautomaticexplorationofthedatabasedesignspace.Speci callyanautomatic,incremental,runtimere-optimizationsystemisenabledbythedesignofVoodoo.Suchasystemcouldemploycurrentandfuturelow-leveloptimizations.However,itwillstillhavetohandlethelargedesignspaceandmayrequirenewabstractions.Webelievethatdeclarativeoptimizers[8,16]canbee ectivelycom-binedwithVoodootohandlethiscomplexity.ThecurrentVoodoodesigndeliberatelyomitscontrol-statements(for,if,while,...).Whilethesearenotnec-essarytoimplementrelationalalgebra,theyenableruntimeoptimizationssuchasloadbalancing,dynamicresizingofdatastructures(e.g.,hashtables)ordatastructuresthathavecomplexbehavior(suchaspriorityqueues).However,theseareexactlythekindsofoptimizationsthatarediculttoporttomassivelyparallelarchitectures.Asolutiontothisproblemislikelytobebasedontheco-operationofmultipledevices(CPUsforcontrol,GPUsfordataprocessing).Weplantodevelopsuchasolutioninthenearfuture.Wealsoplantoexpandthecurrentalgorithmstoaddnon-relationaloperations.Anexampleofthisissupportingregularexpressionmatching(whichhasecientmassivelyparallelsolutions)butalsosupportforgraphsorarrays.8.CONCLUSIONImplementingecientin-memorydatabasesischalleng-ing,andoftenrequiresbeingawareofboththecharacteris-ticsoftheworkload,dataandthespeci carchitecturalprop-ertiesofthehardware.Inthiswork,weproposedVoodoo,anovelunifyingframeworktoexplore,implementandeval-uatearangeofhardware-speci ctuningtechniquesthatcancapturealloftheseaspectsofecientdatabasedesign.Voodooconsistsofanintermediatevectoralgebrathatab-stractsawayhardwarespeci cssuchashierarchicalcaches,many-corearchitecturesorSIMDinstructionsetswhilestillallowingfrontenddeveloperstooptimizeforthem.Centraltoourapproachisanoveltechniquecalledcontrolvectorsthatexposepar

allelismintheprogramanddatatothecom-pile
allelismintheprogramanddatatothecom-piler,withouttheuseofhardware-speci cabstractions.WeshowedthatVoodoocanbeusedasanalternativebackendforanexistingsystem(MonetDB),andthatitcanmatchandevenoutperformpreviouslyproposedhighlyop-timizedin-memorydatabases.WealsodemonstratedthatVoodoomakesiteasytotuneprogramsandexploredesignalternativesfordi erenthardwarearchitectures.9.REFERENCES[1]D.Abadi,D.Myers,D.DeWitt,andS.Madden.Materializationstrategiesinacolumn-orienteddbms.InICDE2007.IEEE,2007.[2]E.Agullo,J.Demmel,J.Dongarra,B.Hadri,J.Kurzak,J.Langou,H.Ltaief,P.Luszczek,andS.Tomov.Numericallinearalgebraonemergingarchitectures:Theplasmaandmagmaprojects.InJournalofPhysics:ConferenceSeries,volume180,2009.[3]C.Balkesen,J.Teubner,G.Alonso,andM.T.Ozsu.Main-memoryhashjoinsonmulti-corecpus:Tuningtotheunderlyinghardware.ETHZurich,Tech.Rep,2012.[4]S.Barrachina,M.Castillo,F.D.Igual,R.Mayo,andE.S.Quintana-Orti.Evaluationandtuningofthelevel3cublasforgraphicsprocessors.InParallelandDistributedProcessing,2008.IPDPS2008.IEEEInternationalSymposiumon.IEEE,2008.[5]R.D.Blumofe,C.F.Joerg,B.C.Kuszmaul,C.E.Leiserson,K.H.Randall,andY.Zhou.Cilk:Anecientmultithreadedruntimesystem.Journalofparallelanddistributedcomputing,37(1),1996.[6]P.Boncz,T.Neumann,andO.Erling.Tpc-hanalyzed:Hiddenmessagesandlessonslearnedfromanin uentialbenchmark.InTPC-TC.Springer,2013.[7]P.A.Boncz,M.L.Kersten,andS.Manegold.Breakingthememorywallinmonetdb.CACM,122008.[8]T.Condie,D.Chu,J.M.Hellerstein,andP.Maniatis.Evitaraced:metacompilationfordeclarativenetworks.PVLDB,1(1),2008.[9]A.Crotty,A.Galakatos,K.Dursun,T.Kraska,U.Cetintemel,andS.Zdoni.Tupleware:"big"data,biganalytics,smallclusters.InCIDR,2015.[10]L.DagumandR.Menon.Openmp:anindustrystandardapiforshared-memoryprogramming.IEEEcomputationalscienceandengineering,5(1),1998.[11]C.GreggandK.Hazelwood.Whereisthedata?whyyoucannotdebatecpuvs.gpuperformancewithouttheanswer.InISPASS'11.IEEE,2011.[12]B.He,M.Lu,K.Yang,R.Fang,N.Govindaraju,Q.Luo,andP.Sander.Relationalquerycoprocessingongraphicsprocessors.TODS,34(4):21,2009.[13]M.Heimel,M.Saecker,H.Pirk,S.Manegold,andV.Markl.Hardware-obliviousparallelismforin-memorycolumn-stores.VLDB,2013.[14]Y.Klonatos,C.Koch,T.Rompf,andH.Cha .Buildingecientqueryenginesinahigh-levellanguage.PVLDB,7(10),2014.[15]K.Krikellas,S.Viglas,andM.Cintra.Generatingcodeforholisticqueryevaluation.InICDE,2010.[16]M.Liu,Z.G.Ives,andB.T.Loo.Enablingincrementalqueryre-optimization.InSIGMOD,2016.[17]J.Malcolm,P.Yalamanchili,C.McClanahan,V.Venugopalakrishnan,K.Patel,andJ.Melonakos.Array re:agpuaccelerationplatform.InSPIEDefense,Security,andSensing,2012.[18]T.Neumann.Ecientlycompilingecientqueryplansformodernhardware.PVLDB,4(9),2011.[19]C.J.Newburn,B.So,Z.Liu,M.McCool,A.Ghuloum,S.D.Toit,Z.G.Wang,Z.H.Du,Y.Chen,G.Wu,etal.Intel'sarraybuildingblocks:Aretargetable,dynamiccompilerandembeddedlanguage.InCGO,2011.[20]H.Nguyen.Gpugems3.Addison-WesleyProfessional,2007.[21]H.Pirketal.Cpuandcacheecientmanagementofmemory-residentdatabases.InICDE,2013.[22]H.Pirk,S.Manegold,andM.L.Kersten.Wastenot...ecientco-processingofrelationaldata.InICDE2014,pages{.IEEE,April2014.[23]H.Pirk,O.Moll,M.Zaharia,andS.Madden.Voodoo-portabledatabaseperformanceonmodernhardware.Technicalreport,MITCSAIL,2016.[24]H.Pirk,E.Petraki,S.Idreos,S.Manegold,andM.Kersten.Databasecracking:fancyscan,notpoorman'ssort!InDaMoN.ACM,2014.[25]O.Polychroniou,A.Raghavan,andK.A.Ross.Rethinkingsimdvectorizationforin-memorydatabases.InSIGMOD2015.ACM,2015.[26]V.Raman,G.Attaluri,R.Barber,N.Chainani,D.Kalmuk,V.KulandaiSamy,J.Leenstra,S.Lightstone,S.Liu,G.M.Lohman,etal.Db2withbluacceleration:Somuchmorethanjustacolumnstore.PVLDB,6(11),2013.[27]J.Reinders.Intelthreadingbuildingblocks:out ttingC++formulti-coreprocessorparallelism."O'ReillyMedia,Inc.",2007.[28]K.A.Ross.Selectionconditionsinmainmemory.ACMTrans.DatabaseSyst.,29(1),Mar.2004.[29]C.J.Rossbach,Y.Yu,J.Currey,J.-P.Martin,andD.Fetterly.Dandelion:acompilerandruntimeforheterogeneoussystems.InSOSP.ACM,2013.[30]A.K.Sujeeth,K.J.Brown,H.Lee,T.Rompf,H.Cha ,M.Odersky,andK.Olukotun.Delite:Acompilerarchitectureforperformance-orientedembeddeddomain-speci clanguages.ACMTECS,13:134,2014.[31]E.Wang,Q.Zhang,B.Shen,G.Zhang,X.Lu,Q.Wu,andY.Wang.Intelmathkernellibrary.InHigh-PerformanceComputingontheIntelR XeonPhi.Springer,2014.[32]H.Wu,G.Diamos,J.Wang,S.Cadambi,S.Yalamanchili,andS.Chakradhar.Optimizingdatawarehousingapplicationsforgpususingkernelfusion/ ssion.InIPDPSW.IEEE,2012.[33]M.Zukowski,P.Boncz,N.Nes,andS.Heman.Monetdb/x100-adbmsinthecpucache.IEEEDataEngineeringBulletin,1001:17,2005.[34]M.Zukowski,N.Nes,andP.Boncz.DSMvs.NSM:CPUPerformanceTradeo sinBlock-orientedQueryProcessing.InDaMoN