/
HYRISEA Main Memory Hybrid Storage Engine Martin Grund HassoPlattnerInstitute Jens Kr HYRISEA Main Memory Hybrid Storage Engine Martin Grund HassoPlattnerInstitute Jens Kr

HYRISEA Main Memory Hybrid Storage Engine Martin Grund HassoPlattnerInstitute Jens Kr - PDF document

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
445 views
Uploaded On 2015-01-15

HYRISEA Main Memory Hybrid Storage Engine Martin Grund HassoPlattnerInstitute Jens Kr - PPT Presentation

For columns accessed as a part of analytical queries eg via sequential scans narrow partitions perform better because when scanning a single column cache locality is improved if the values of that column are stored contiguously In contrast for colum ID: 31421

For columns accessed

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "HYRISEA Main Memory Hybrid Storage Engin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

HYRISE—AMainMemoryHybridStorageEngineMartinGrundHasso­Plattner­InstituteJensKr¨ugerHasso­Plattner­InstituteHassoPlattnerHasso­Plattner­InstituteAlexanderZeierHasso­Plattner­InstitutePhilippeCudre­MaurouxMITCSAILSamuelMaddenMITCSAILABSTRACTInthispaper,wedescribeamainmemoryhybriddatabasesystemcalledHYRISE,whichautomaticallypartitionstablesintoverticalpartitionsofvaryingwidthsdependingonhowthecolumnsofthetableareaccessed.Forcolumnsaccessedasapartofanalyticalqueries(e.g.,viasequentialscans),narrowpartitionsperformbetter,because,whenscanningasinglecolumn,cachelocalityisimprovedifthevaluesofthatcolumnarestoredcontiguously.Incontrast,forcolumnsaccessedasapartofOLTP-stylequeries,widerpartitionsperformbetter,becausesuchtransactionsfrequentlyinsert,delete,update,oraccessmanyoftheeldsofarow,andco-locatingthoseeldsleadstobettercachelocality.Usingahighlyaccuratemodelofcachemisses,HYRISEisabletopredicttheperformanceofdifferentpartitionings,andtoautomaticallyselectthebestpartitioningusinganautomateddatabasedesignalgorithm.Weshowthat,onarealisticworkloadderivedfromcustomerapplications,HYRISEcanachievea20%to400%performanceimprovementoverpureall-columnorall-rowdesigns,andthatitisbothmorescalableandproducesbet-terdesignsthanpreviousverticalpartitioningapproachesformainmemorysystems.1.INTRODUCTIONTraditionally,thedatabasemarketdividesintotransactionpro-cessing(OLTP)andanalyticalprocessing(OLAP)workloads.OLTPworkloadsarecharacterizedbyamixofreadsandwritestoafewrowsatatime,typicallythroughaB+Treeorotherindexstructures.Conversely,OLAPapplicationsarecharacterizedbybulkupdatesandlargesequentialscansspanningfewcolumnsbutmanyrowsofthedatabase,forexampletocomputeaggregatevalues.Typically,thosetwoworkloadsaresupportedbytwodifferenttypesofdatabasesystems–transactionprocessingsystemsandwarehousingsystems.Thissimplecategorizationofworkloads,however,doesnoten-tirelyreectmodernenterprisecomputing.First,thereisanin-creasingneedfor“real-timeanalytics”–thatis,up-to-the-minutereportingonbusinessprocessesthathavetraditionallybeenhandledbywarehousingsystems.Althoughwarehousevendorsaredoingasmuchaspossibletoimproveresponsetimes(e.g.,byreducingloadPermissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.ArticlesfromthisvolumewereinvitedtopresenttheirresultsatThe37thInternationalConferenceonVeryLargeDataBases,August29th­September3rd2011,Seattle,Washington.ProceedingsoftheVLDBEndowment,Vol.4,No.2Copyright2010VLDBEndowment2150­8097/10/11...$10.00.times),theexplicitseparationbetweentransactionprocessingandanalyticssystemsintroducesafundamentalbottleneckinanalyticsresponsetimes.Forsomeapplications,directlyansweringanalyt-icsqueriesfromthetransactionalsystemispreferable.Forexam-ple“available-to-promise”(ATP)applicationsprocessOLTP-stylequerieswhileaggregatingstocklevelsinreal-timeusingOLAP-stylequeriestodetermineifanordercanbefullled.Unfortunately,existingdatabasesarenotoptimizedforsuchmixedqueryworkloadsbecausetheirstoragestructuresareusu-allyoptimizedforoneworkloadortheother.Toaddresssuchworkloads,wehavebuiltamainmemoryhybriddatabasesystem,calledHYRISE,whichpartitionstablesintoverticalpartitionsofvaryingwidthsdependingonhowthecolumnsofthetablesareac-cessed(e.g.,transactionallyoranalytically).Wefocusonmainmemorysystemsbecause,likeotherre-searchers[22,7],webelievethatmanyfuturedatabases–partic-ularlythosethatinvolveenterpriseentitieslikecustomers,outstand-ingorders,products,stocklevels,andemployees–willtintothememoryofasmallnumberofmachines.Commerciallyavailablesystemsalreadyofferupto1TBofmainmemory(e.g.,theFujitsuRX600S5).Mainmemorysystemspresentauniquesetofchallengesandop-portunities.DuetothearchitectureofmodernCPUsandtheircom-plexcachehierarchy,comparingtheperformanceofdifferentmainmemorylayoutscanbechallenging.Inthispaper,wecarefullypro-lethecacheperformanceofamodernmulti-coremachineandde-velopacostmodelthatallowsustopredictthelayout-dependentper-formanceofamixedOLTP/OLAPqueryworkloadonane-grainedhybridrow/columndatabase.Ourmodelcapturestheideathatitispreferabletousenarrowpar-titionsforcolumnsthatareaccessedasapartofanalyticalqueries,asisdoneinpurecolumnarsystems[5,6].Inaddition,HYRISEstorescolumnsthatareaccessedinOLTP-stylequeriesinwiderpartitions,toreducecachemisseswhenperformingsinglerowre-trievals.Thoughothershavenotedtheimportanceofcachelocalityinmainmemorysystems[6,3,12,25],webelievewearethersttobuildadedicatedhybriddatabasesystembasedonadetailedmodelofcacheperformanceinmixedOLAP/OLTPsettings.OurworkisclosestinspirittoDataMorphing[12],whichalsoproposesahybridstoragemodel,butwehaveextendedtheirapproachwithamoreaccuratemodelofcacheandprefetchingperformanceformodernprocessorsthatyieldsupto60%fewercachemissescomparedtothelayoutsproposedbyDataMorphing.Furthermore,thelayoutalgorithmsdescribedintheDataMorphingpaperareexponential(2n)inthenumberofattributesintheinputrelations,andassuchdonotscaletolargerelations.Ouralgorithmsscaletorelationswithhundredsofcolumns,whichoccurfrequentlyinrealworkloads.Wenotethatseveralanalyticsdatabasevendorshaveannouncedsupportforhybridstoragelayoutstooptimizeperformanceofpar- Figure3:ModeledvsMeasuredL2misses(a);CPUcycles(prefetcheroff)(b);L2MissesRow/ColumnContainersandVaryingSelectivity(c)3.3CombiningPartialProjectionsTheprevioussectiondiscussedtheprojectionofcontiguousat-tributesfromacontainer.However,queryplansoftenneedtoprojectnon-contiguoussetsofattributes.Non-contiguousprojectionscanberewrittenasasetofpartialprojectionsf1;:::;kg,eachofwhichretrievesalistofcontiguousattributesinC.Suchprojec-tionsdeneasetofgapsf 1;:::; lg,i.e.,contiguousattributesgroupsthatarenotprojected.Forinstance,aprojectionontherstandthirdattributeofave-attributecontainerisequivalenttotwopartialprojections—oneontherstandoneonthethirdattribute.Theprojectiondenestwogaps,oneonthesecondattribute,andasecondoneonthefourthandfthattributes.Twocasescanoccurdependingonthewidth :wofthegaps:Full-scan:if8 i2f 1;:::; lg; i:wLi:w,allgapsarestrictlysmallerthanacachelineandcannotbeskipped.Thus,theprojectionresultsinafullscanofthecontainer.Independentprojections:if9 i2f 1;:::; lgj i:wLi:w,thereexistportionsofthecontainerthatmightpotentiallybeskippedwhenexecutingtheprojection.Theprojectionisthenequivalenttoasetof1+Pli=11 i:wLi:wpartialprojectionsdenedbythegapboundaries(where1isanindicativefunctionusedtoexpressthegapsthataregreaterthanthecacheline).Writing�itoexpressthei-thlargestgapwhosewidth�i:wLi:w,theequivalentpartialprojectionscanbedenedaseqi:o=�i:o:(9)andeqi:w=�i+1:o�(�i:o+�i:w)(10)Similarly,wecanmergetherstandlastprojectionsbytakingintoaccountthefactthatthelastbytesofarowarestoredcontiguouslywiththerstbytesofthefollowingrow.Hence,wemergetherstandlastprojectionswhenthegap�rowbetweenthemissmallerthanacacheline,i.e.,when�row=(eqfirst:o+C:w)�(eqlast:o+eqlast:w)Li:w:(11)Thenalprojectioneqisinthatcasedenedasfollows:eq:o=eqlast:oandeq:w=eqfirst:w+eqlast:w+�row.Usingthisframework,wecanmodeltheimpactofcomplexqueriesinvolvingtheprojectionofanarbitrarysetofattributesfromacontainerwithouterroneouslycountingmissestwice.3.4SelectionsInthissubsection,weconsiderprojectionsthatonlyretrieveaspe-cicsubsetSoftherowsinacontainer.Weassumethatweknowthelistofrowsri2Sthatshouldbeconsideredfortheprojection(e.g.,fromapositionlookupoperator).Theselectivityoftheprojection:srepresentsthefractionofrowsreturnedbytheprojection.Ourequationsinthissectionmustcapturethefactthathighlyselectiveprojectionstouchingafewisolatedrowscangeneratemorecachemissesperresultthanwhatfull-scanswoulddo.Twocasescanalsooccurhere,dependingontherelativesizesofthecontainer,theprojection,andthecacheline:Independentselections:wheneverC:w�:w�1Li:w,thegapsbetweentheprojectedsegmentscauseeachrowtoberetrievedindependentlyoftheothers.Inthatcase,thecachemissesincurredbyeachrowretrievalareindependentoftheotherrowretrievals.Thetotalnumberofcachemissesincurredbytheselectionisthesumofallthemissesincurredwhenindependentlyprojectingeachrowr2Sfromthesetofselectedrows:Missi(C;)sel=C:n�1Xr=0Missi(r;):s:(12)Thisnumberofmissescanbeefcientlycomputedusingtheadditiveorderapproachdescribedabove.Overlappingselections:whenC:w�:w�1Li:w,retriev-ingtheprojectionforagivenrowmayretrievepartsoftheprojec-tionforadifferentrow.Thiseffectisparticularlyapparentforlow-selectivityprojectionsonnarrowcontainers,forwhichseveralrowscantonasinglecacheline.TheaveragenumberofrowsthatcantinonecachelineisequaltoLi:w=C:w.Foreachcachelinefetchedtoretrieveaselectedrow,thereareonaveragetotalCachedRows=1+:sLi:w C:w�1(13)selectedrowscached,assumingthattherowsareselectedindepen-dentlyofeachother.TheaveragenumberormissesisthusMissi(C;)sel=:s totalCachedRowsMissi(C;):(14)Figure3(c)comparesthemeasuredandmodelednumberofcachemissesforselectionsontwolayouts:oneconsistingof16one-attributecontainersandasecondoneconsistingofone16-attributewidecontainer.Bothlayoutshavethesametotalwidthandbothhave2Mtuples.Forlowselectivities,usingwidercontainersresultsinfewercachemisses,sinceeach(whethernarroworwide)con-tainergeneratesatleastonecachemisspertuplefragmentretrievedforveryselectiveprojections.3.5JoinsandAggregatesOursystemcurrentlyuseslate-materializationbasedhashjoinsandhash-basedGROUPBYs.Theseoperationscanbothbemod-eledaspartialprojectionsandselections.ForajoinoftablesRandS,whereRisthetablethatistobehashed,Rislteredviaapartialprojectionandapositionlookup(usingpositionsfromearlyoperatorsintheplan).TheresultinghashtableisthenprobedwithasimilarlylteredversionofS.ForGROUPBYs,thegroupingcolumnisalsolteredviaapartialprojectionandpositionlookup.3.6PaddedContainersandReconstructionPaddingandAlignment:Forpartialprojectionsitispossibletoreducethetotalnumberofcachemissesbyperformingnarrowrow-paddingsuchthatthebeginningofeachrowcoincideswiththebe-ginningofacacheline.ForapaddedcontainerC,thepadding(emptyspace):wtoinsertattheendofeachrowtoachievesuchaneffectisC:wmodLi:wbyteswideforagivencachelevel.Theexpressionsgivenabovetocomputethenumberofcachemissescanbeusedonpaddedcontainersbyreplacingthewidthofthecontainer HYRISEistheDataMorphingapproachofHankinsandPatel[12].DataMorphingpartitionsrelationsintobothrowandcolumn-orientedstorage.ThemaindifferencesbetweenourapproachesandDataMorphingareinthedelityofourcache-missmodel(wemodelmanycasesthatDataMorphingisunabletocapture),andinourphysicaldatabasedesignalgorithm.Takentogether,thesemakeHYRISEsignicantlyfasterthanDataMorphing,andalsoallowittoscaletotableswithtensorhundredsofattributes,whereasDataMorphingcannotscaletotableswithlargenumbersofattributes.Verticalpartitioningisawidelyusedtechniquethathasbeenex-ploredsincetheearlydaysofthedatabasecommunity[24,13,11,17,2].Someofthiswork[9,10,17,2]attemptstoautomaticallyderivegoodpartitions,butdoessowithaneyetowardsminimizingdiskseeksandI/OperformanceratherthanmainmemorycostsaswedoinHYRISE.Assuch,thesesystemsdonotincludecarefulmodelsofcachemisses.TheworkofAgrawaletal[2]ismostsimilartoourapproachinthatitusesascost-basedmechanismtoidentifypartitionsthatarelikelytoworkwellforagivenworkload.Recentlytherehasbeenarenewedinterestinpureverticalpar-titioningintoa“column-store”,e.g.,DSM[8],MonetandMon-etDB/X100[5,6],C-Store[21].As“pure”columnsystems,theseapproachesarequitedifferentthanHYRISE.TheMonetsystemisperhapsmostrelatedbecauseitsauthorsdevelopcompletemodelsofcacheperformanceincolumnstores.TherehavebeenseveralattemptstobuildsystemsinthespiritofHYRISEthatarerow/columnhybrids.PAX[3]isanearlyexam-ple;itstoresdatafrommultiplecolumnsinadiskblock,butusesacolumn-wisedatarepresentationforthosecolumns.IncomparisontothecachemissperformanceofHYRISEwhenscanninganar-rowprojection,PAXwillincursomewhatmorecachemisseswhenscanningjustafewcolumnsfromatable(sinceitwillhavetojumpfromonepagetothenextinmemory).Similarly,incomparisontoHYRISEscanningawideprojection,PAXwillincurmorecachemisseswhenscanningmanycolumnsfromatable(sinceitwillhavetojumpfromonecolumntothenextinmemory.)WechosenottocompareourworkagainstPAXdirectlybecausetheDataMorphingpaper[12]showedthatahybridsystemlikeHYRISEcanbeuptoafactorof2fasterforworkloadsthatreadjustafewcolumns,andasweshowinSection5.3,HYRISEgen-erallyperformsbetterthanDataMorphing.FracturedMirrors[19]andSchaffneretal.[20]arehybridapproachesthatconsiderbothrowandcolumnrepresentationsandanswersqueriesfromtherep-resentationthatisbestforagivenquery;thisleadstogoodqueryperformancebuthassubstantialsynchronizationoverhead.UnlikeHYRISE,neitherofthesesystemsnorPAXvarytheirphysicalde-signbasedontheworkload,andsodonotfocusontheautomateddesignproblemweaddress.7.CONCLUSIONSInthispaper,wepresentedHYRISE,amainmemoryhybriddatabasesystemdesignedtomaximizethecacheperformanceofqueries.HYRISEcreatesverticalpartitionsoftablesofdifferentwidths,dependingontheaccesspatternsofqueriesovertables.Todeterminethebestpartitioning,wedevelopedanaccuratemodelofcachemissesthatisabletoestimatethenumberofmissesthataparticularpartitioningandaccesspatternwillincur.Wepresentedadatabasedesignalgorithmbasedonthismodelthatndspartition-ingsthatminimizethenumberofcachemissesforagivenworkload,andthatisabletoscaletotableswithalargenumberofcolumns.OurresultsshowthatHYRISEisabletoproducedesignsthatare20%to400%fasterthaneitherapure-columnorpure-rowapproachonarealisticbenchmarkderivedfromawidelyusedenterpriseap-plication.WealsoshowthatourapproachleadstobetterphysicaldesignsandcanscaletolargertablesthanDataMorphing[12],thepreviousstateoftheartworkload-awareapproachforpartitioningmainmemorydatabases.Asfuturework,weplantoexaminehor-izontalpartitioningaswellasadditionalhybrid-basedqueryopti-mizations,andtooptimizeHYRISEforfuturemany-coreprocessorswithmultiplememorychannelsandanincreasingparallelism.8.REFERENCES[1]D.J.Abadi,D.S.Myers,D.J.DeWitt,andS.Madden.MaterializationStrategiesinaColumn-OrientedDBMS.InICDE,pages466–475,2007.[2]S.Agrawal,V.R.Narasayya,andB.Yang.IntegratingVerticalandHorizontalPartitioningIntoAutomatedPhysicalDatabaseDesign.InSIGMODConference,pages359–370,2004.[3]A.Ailamaki,D.J.DeWitt,M.D.Hill,andM.Skounakis.WeavingRelationsforCachePerformance.InVLDB,pages169–180,2001.[4]A.Ailamaki,D.J.DeWitt,M.D.Hill,andD.A.Wood.DBMSsonaModernProcessor:WhereDoesTimeGo?InVLDB,pages266–277,1999.[5]P.A.Boncz,S.Manegold,andM.L.Kersten.DatabaseArchitectureOptimizedfortheNewBottleneck:MemoryAccess.InVLDB,pages54–65,1999.[6]P.A.Boncz,M.Zukowski,andN.Nes.MonetDB/X100:Hyper-PipeliningQueryExecution.InCIDR,pages225–237,2005.[7]S.K.ChaandC.Song.P*TIME:HighlyScalableOLTPDBMSforManagingUpdate-IntensiveStreamWorkload.InVLDB,pages1033–1044,2004.[8]G.P.CopelandandS.Khoshaan.ADecompositionStorageModel.InSIGMODConference,pages268–279,1985.[9]D.W.CornellandP.S.Yu.AnEffectiveApproachtoVerticalPartitioningforPhysicalDesignofRelationalDatabases.IEEETransactionsonSoftwareEngineering,16(2):248–258,1990.[10]P.De,J.S.Park,andH.Pirkul.Anintegratedmodelofrecordsegmentationandaccesspathselectionfordatabases.InformationSystems,13(1):13–30,1988.[11]M.HammerandB.Niamir.AHeuristicApproachtoAttributePartitioning.InSIGMODConference,pages93–101,1979.[12]R.A.HankinsandJ.M.Patel.DataMorphing:AnAdaptive,Cache-ConsciousStorageTechnique.InVLDB,pages417–428,2003.[13]J.A.HofferandD.G.Severance.TheUseofClusterAnalysisinPhysicalDataBaseDesign.InVLDB,pages69–86,1975.[14]B.L.JohnstonandF.Richman.NumbersandSymmetry:AnIntroductiontoAlgebra.CRC-Press,1997.[15]G.KarypisandV.Kumar.Multielvelk-waypartitioningschemeforirregulargraphs.JournalofParallelandDistributedComputing,48(1):96–129,1998.[16]S.Manegold,P.A.Boncz,andM.L.Kersten.GenericDatabaseCostModelsforHierarchicalMemorySystems.InVLDB,pages191–202,2002.[17]S.B.Navathe,S.Ceri,G.Wiederhold,andJ.Dou.VerticalPartitioningAlgorithmsforDatabaseDesign.ACMTransactionsonDatabaseSystems,9(4):680–710,1984.[18]H.Plattner.AcommondatabaseapproachforOLTPandOLAPusinganin-memorycolumndatabase.InSIGMODConf.,pages1–2,2009.[19]R.Ramamurthy,D.J.DeWitt,andQ.Su.ACaseforFracturedMirrors.InVLDB,pages430–441,2002.[20]J.Schaffner,A.Bog,J.Krueger,andA.Zeier.AHybridRow-ColumnOLTPDatabaseArchitectureforOperationalReporting.InBIRTE,2008.[21]M.Stonebraker,D.J.Abadi,andA.B.etal.C-Store:AColumn-orientedDBMS.InVLDB,pages553–564,2005.[22]M.Stonebraker,S.Madden,D.J.Abadi,S.Harizopoulos,N.Hachem,andP.Helland.TheEndofanArchitecturalEra(It'sTimeforaCompleteRewrite).InVLDB,pages1150–1160,2007.[23]M.Stonebraker,L.A.Rowe,andM.Hirohama.TheImplementationofPostgres.IEEETransactionsonKnowledgeandDataEngineering,2(1):125–142,1990.[24]P.J.Titman.AnExperimentalDataBaseSystemUsingBinaryRelations.InIFIPWorkingConferenceDataBaseManagement,1974.[25]M.Zukowski,N.Nes,andP.Boncz.DSMvs.NSM:CPUperformancetradeoffsinblock-orientedqueryprocessing.InDaMoN,2008. APPENDIXIntheseappendices,weprovideseveralexamplesofthebehaviorofHYRISE'sphysicalcontainerdesignanddescribeseveralextensionsthatfurtherimproveHYRISE'sperformanceonmodernmachines(AppendixA.)InAppendixBwegiveanexampleofHYRISE'slayoutselection.Wealsodescribethedetailsofthenewbench-markwehavedevelopedforthispaper(AppendixC.)Inaddition,wegiveacompactcomparisontotheDataMorphingcostmodel(Ap-pendixD),detailedinformationonWriteOperations(AppendixE)andLayoutGeneration(AppendixF.)A.PHYSICALDESIGNANDEXECUTIONContainerAlignmentExample:AsdescribedinSection3,con-taineralignmentoncacheboundariescanhaveadramaticeffectonthenumberofcachemisses.Forexample,Figure6givesthenum-berofcachemissesfortwocontainersandforpartialprojectionsretrieving0to80attributesfromthecontainers.Therstcontainerisa80-attributewidecontainerwhilethesecondcontainerisa86-attributewidecontainer(allattributesare4byteswide).Therstcontainerhasawidththatisamultipleofthecachelinesize.The86-attributecontainerisnotalignedtothecachelinesandsuffersmorecachemissesforthepartialprojections,althoughtheamountofdataretrievedisthesameinbothcases.Ifthiscontainerweretobepaddedto384bytes(insteadofusing344bytescorrespondingtothewidthofits86attributes)thenboththe80andthe86widecontainerswouldbehavesimilarlyintermsofcachemisses.Forthisreason,properlyaligningcontainersasdoneinHYRISEisessential.CacheSetCollision:Cachecollisionsduetoassociativitycon-ictscanbeproblematicincache-awaresystems.Forthisreason,HYRISEautomaticallyadjustsitscontaineralignmentpolicyinor-dertominimizethesecachesetcollisions.WhentheOSallocatesalargememoryregion(forexamplewhencreatingacontainer),itusuallyautomaticallyalignsthebeginningoftheregionwiththebeginningofavirtualmemorypage.Vir-tualmemorypageshaveaxedsize—theaddressoftheirrstbytealwaysisamultipleofthesystem-levelmemorypagesizePAGESIZE(whichisasystemconstantthatcanbedeterminedbycallinggetconfPAGESIZE).Thetotalnumberofcachesets#setsisequaltoLi:n=assoc,whereassocistheassociativityofthecache.Eachmemoryaddressaddressismappedtoauniquecachesetsetasfollows:set(address)=address Li:wmod#sets:(21)Thismappingiscyclicandstartsoverevery#setsLi:wbytes.Whenthememorypagesizeisamultipleofthiscyclelength,i.e.,whenPAGESIZEmod(#setsLi:w)=0,theaddressescor-respondingtothebeginningofthecontainersareallsystematicallymappedtothesamecacheset,thusseverelylimitingtheamountofcacheavailablewhenprocessingseveralcontainersinparallel.This Figure6:L2MissesforContainerswithDifferentAlignmentsproblemoftenoccursinpractice(itoccursforourtestsystemde-scribedinSection3forinstance).Toalleviatethisproblem,wemaintainavariable#containerscountingthenumberofcontainers.Whenanewcontaineriscre-ated,thesystemshiftsthebeginningofthecontainerbyLi:w(#containersmod#sets)bytes,tomaximizetheutilizationofthecachesets.Figure7illustratesthispointforourtestsystemand100one-attributewidecontainers.Eachcontaineris4byteswideandtheassociativityoftheL1cacheis8inthiscase.Withoutcachesetcolli-sionoptimization,thetotalnumberofcachablecachelinesavailablewhenreadingseveralcontainersinparallelis8,sincethecontain-ersareallalignedtothesamecachesetandsharethesamewidth.Cacheevictionsthusoccurassoonasmorethan8attributesarereadinparallel,signicantlyincreasingthenumberofcachemisses(seeFigure7).Byoffsettingthecontainersusingthemethoddescribedabove,HYRISEisabletoreadallthecontainersinparallelwithoutanyearlycacheeviction(thesystemcanreadupto512containersinparallelinthatcase).CachesetcollisionsoftenoccurfortheL1cache.TheyoccurlessfrequentlyfortheL2cache,whichtypicallycontainsamuchlargernumberofsetsandhasahigherassociativitythantheL1cache.PrefetcherSelection:Inadditiontoallocatingandaligningcon-tainerstominimizecachemisses,HYRISEsupportsseveralcacheprefetchingpoliciesthatcanbeswitchedonaper-operatorbasis.ModernCPUsprefetchcachelinesthattheprocessordeterminesarelikelytobeaccessedinthefuture.Theadvantageofthisapproachisthatthedataforaprefetchedcachelinestartstobeloadedwhilethepreviouscachelineisstillbeingprocessed.Mostprocessorsprovideseveralprefetchersandallowapplica-tionstoselectwhichprefetchertheywishtouse.Forexample,IntelprocessorsbasedontheIntelCorearchitectureprovidetwodifferentL2hardwareprefetchers.TherstprefetcheriscalledStreamerandloadsdataorinstructionsfrommemorytothesecond-levelcacheinblocksof128bytes.Therstaccesstooneofthetwocachelinesinblocksof128bytestriggersthestreamertoprefetchthepairoflines.ThesecondistheDataPrefetchLogic(DPL)hardwareprefetcherthatprefetchesdatatothesecondlevelcachebasedonrequestpatternsobservedinL1.DPLisabletodetectmorecomplicatedaccesspatterns,evenwhentheprogramskipsaccesstoacertainnumberofcachelines;itisalsoabletodisableprefetchinginthepresenceofrandomaccesseswhereprefetchingmayhurtperformance.Toevaluatetheperformanceimpactofthedifferenthardwareprefetcherswecreatedtwolayouts,1,consistingofasinglewidecontainerofwidthw,and2,consistingofasetofcontainerswhoseaggregatewidthwasw.Weaccessedalistofrandompositionsineachcontainer,varyingtheselectivityfrom0:0to1:0.Foraccessesto1therewasnovisibledifferencebetweenthetwoprefetchingimplementations(Figure8)butforaccessesto2,DPLused24%fewerCPUcyclesasitwasabletopredictskipsbetweencontainers Figure7:ExperimentfromFigure3(a)withCacheCollision Description CPUCycles L2CacheMisses NoWrites 1;105;317 5 Non-temporalWrites 29;902;648 11;683 NormalWrites 40;289;157 557;346 Table2:Comparingtemporalandnon-temporalwritesversionconcurrencycontrol)thatarenotre-accessedfrequentlyandareprivatetoasingletransaction.Thesewritescanresultincachepollutionwithdatathatisnotlikelytoberead.Toavoidthispollution,weusenon-temporalwritesprovidedbytheIntelSIMDStreamingExtensionsinHYRISE.Whenusingnon-temporalwritesthewrite-combiningbufferoftheCPUwillcaptureallwritestoasinglecachelineandthenwritebackthecachelinedirectlytomemorywithoutstoringthemodieddatainthecache.Tomeasurethebenetofnon-temporalwrites,weallocatedasingle-columncontainerwiththesizeoftheL2cache(6MB).Af-terscanningthecontainerandreadingallvalueswethenmeasuredthenumberofcachemissesperformedbyanadditionalscanoverthedata(whichbyitselfshouldnotproduceanyL2cachemissessincethedataisalreadycachedandthecontainertsintocache).Concurrently,wewritedatatoasecondcontainer;wealternatebe-tweenusingtheSSEextensionfornon-temporalwritesandusingplainmemcpy().Weusethe mm stream si128()functiontogeneratetherequiredMOVNTDQassemblerinstruction,whichcausesthewritecombinedbufferoftheCPUtocollectallmodicationsforacache-linesworthofdataandwriteitbacktomainmemorywithoutstoringthedatainanycache.Thisoperationmodelstheprocessofaccumulatingareadsetduringtheexecutionofatransaction.Table2showstheresultsoftheexperiment.Theresultsshowthatnon-temporalwritesuseonly75%oftheCPUcyclesand0.02%ofthecachemisseswhencomparedtomemcpy(),suggestingthatthisoptimizationsignicantlyimprovestheperformanceofwritesthatarenotfrequentlyread.F.HYRISELAYOUTSELECTIONInthissectionweillustratetheinuenceofworkloadvariationsonthelayoutsproducedbyHYRISE.Experiment1:Inthisexperiment,insteadofsimplyrunningourlayoutertondalayoutforagivenworkloadweuseaworkloadforwhichoneoftwolayoutsisoptimal,givenamixofqueriesrunwithdifferentfrequencies(orweights).Weuseatablewith10attributes(a1:::a10)andatotalwidthof56bytes.Theworkloadconsistsoftwoqueries:oneOLTP-stylequerythatscansattributea1and,fortherowsthatmatchahighlyse-lectivepredicate,returnsallattributes;andtwo,anOLAPquerythatscansattributea1,applyinganon-selectivepredicate,andthendoesaGROUPBYonthematchingrowsofattributea1andaggregatesoverthevaluesofattributea4.Fromthosetwoqueriesoneoftwolayoutsisoptimal.Therst,1,separatesattribute(a1)andgroup(a2:::a10)together;thesec-ondlayout,2,alsoseparates(a1),butinadditionsplitstheremain-inggroupinto(a4)and(a2:::a3;a5:::a10).Givenbothlayoutswewanttoobservewhenthelayoutalgorithmchoosestoswitchfromonedesigntotheother.AssumingthatOLTPqueriesoccurmorefrequentlythanOLAPqueries,wewishtovisualizewheneachlayoutismoreappropriate.WevaryOLTPselectivitiesfrom0OLTP0:5andOLAPselectivitiesfrom0:5OLAP1.Wethencomputethecostofeachqueryanddeterminethepointatwhichthelayoutshaveequalcost.xCostOLTP(1;1)+CostOLAP(1;2)=xCostOLTP(2;1)+CostOLAP(2;2)(25)Equation25showstheformulausedtocalculatethecostforlayout1and2basedonthedistinctselectivities1and2.Figure12 Figure12:ContourplotshowingthenumberofOLTPqueriesperOLAPqueryrequiredbeforeworkload1'scostexceedsthatof2fordifferentvaluesof1and2.showstheresultofthisexperiment.Thecontourlinesdenetheregionwhere1'scostexceedsthatof2fordifferentvaluesof1and2.Forexample,when1is.005and2is.55,iftherearemorethan100OLTPqueriesperOLAPquery,1ispreferable(infact,forany1�:01,1ispreferablewhenthereareatleast100OTLPqueriesperOLAPquery).Experiment2:Arelatedquestionishowmanypartitionsaretyp-icallygeneratedbyouralgorithm.Inthesimplestcase,foratablewithnattributesandiindependentqueriesthataccessthetableondisjointattributeswiththesameweightandselectivity,thetablewillbesplitintoipartitions.Formorecomplexcases,ouralgorithmwillonlygenerateapar-titioningthatoptimizesthetotalworkloadcost,andwon'tcreateaseparatestoragecontainerforinfrequentlyrunqueries.Inmosten-terpriseapplications,thereareasmallnumberofqueryclasses(andhenceattributegroups)thatareaccessedfrequently;ad-hocqueriesmayberunbuttheywilllikelyrepresentaverysmallfractionoftotalaccess.Hence,thesemorefrequent(highlyweighted)querieswilldominatethecostandbemoreheavilyrepresentedinthestoragelayout,resultinginamodestnumberofstoragecontainersformostworkloads(i.e.,itisunlikelythatafullcolumn-orientedlayoutwillbeoptimalfornon-OLAPworkloads.)Figure13showstheresultofanexperimentwhereweusedourap-proximategraph-partitioningalgorithmtodeterminethebestlayoutforawidetableof5004-byteattributesandanincreasingnumberoffrequently-posedqueries.TheexperimentstartswitharelativelyexpensiveOLTPquery(selectingallattributeswithaselectivityof40%).WetheniterativelyaddthreerandomOLAPqueriesthateachprojectarandompairofattributes,untilwehaveatotalof1000randomqueries.Asexpected,thenumberofpartitionsinslowlyconvergestoan“all-column”layout.ThetimetakentocomputethelayoutusingourapproximatepartitioningalgorithmandforK=10variesfromafewhundredmillisecondsinitiallytoafewminutesforthecasewith1000OLAPqueries. Figure13:IncreasingnumberofOLAPqueries