ViktorLeis122 IntroductionINumberofCPUcoreskeepsgrowing4socketIvyBridgeEXwith60cores120threads1TBRAM50000IThesesystemssupportterabytesofNUMARAMdiskisnotabottleneckIForanalyticworkloadsintra ID: 125142
Download Pdf The PPT/PDF document "Morsel-DrivenParallelism:ANUMA-AwareQuer..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Morsel-DrivenParallelism:ANUMA-AwareQueryEvaluationFrameworkfortheMany-CoreAgeViktorLeis,PeterBoncz*,AlfonsKemper,ThomasNeumannTechnischeUniversitätMünchen*CWIwithsomemodicationsby:S.Sudarshan ViktorLeis1/22 IntroductionINumberofCPUcoreskeepsgrowing:4-socketIvyBridgeEXwith60cores,120threads,1TBRAM(50,000$)IThesesystemssupportterabytesofNUMARAM:diskisnotabottleneckIForanalyticworkloadsintra-queryparallelizationisnecessarytoutilizesuchsystems INumberofCPUcoreskeepsgrowing:4-socketIvyBridgeEXwith60cores,120threads,1TBRAM(50,000$)ViktorLeis2/22 ContributionsIWepresentanarchitecturalblueprintforaqueryengineincorporatingthefollowingIMorsel-drivenqueryexecution(workisdistributedbetweenthreadsdynamicallyusingworkstealing)ISetoffastparallelalgorithmsforthemostimportantrelationaloperatorsISystematicapproachtointegratingNUMA-awarenessintodatabasesystemsILotsofpriorworkonalgorithmsformain-memorydatabasesIFocusonstorage,andonindividualoperations(hashjoin,mergejoin,aggregation,...)INUMAhasbeenaddressedbyquiteafewpapersIFocusofthispaperisonecientlyevaluatingafullquery,andonalgorithmsthatsupportpipelinedevaluationViktorLeis3/22 RelatedWork:Volcano-StyleParallelism(1)IEncapsulationofParallelismintheVolcanoQueryProcessingSystem,GoetzGraefe,SIGMOD1990SIGMODTestofTimeAward2000IPlan-drivenapproach:IoptimizerstaticallydeterminesatquerycompiletimehowmanythreadsshouldrunIinstantiatesonequeryoperatorplanforeachthreadIconnectsthesewithexchangeoperators,whichencapsulateparallelismandmanagethreadsIElegantmodelwhichisusedbymanysystems ViktorLeis4/22 Volcano-StyleParallelism(2)+Operatorsarelargelyoblivioustoparallelism+Greatforshared-nothingparallelsystemsButcandobetterforsharedmemoryparallelsystemswithalldatain-memoryStaticworkpartitioningcancauseloadimbalancesDegreeofparallelismcannoteasilybechangedmid-queryNotNUMAawareOverhead:IThreadoversubscriptioncausescontextswitchingIHashre-partitioningoftendoesnotpayoIExchangeoperatorscreateadditionalcopiesofthetuplesViktorLeis5/22 Morsel-DrivenQueryExecution(1)IBreakinputintoconstant-sizedworkunits(morsels)IDispatcherassignsmorselstoworkerthreadsI#workerthreads=#hardwarethreadsIOperatorsaredesignedforparallelexecution ViktorLeis6/22 Morsel-DrivenQueryExecution(2)IEachpipelineisparallelizedindividuallyusingallthreads ViktorLeis7/22 Morsel-DrivenQueryExecution(2)IEachpipelineisparallelizedindividuallyusingallthreads ViktorLeis7/22 Morsel-DrivenQueryExecution(2)IEachpipelineisparallelizedindividuallyusingallthreads ViktorLeis7/22 Morsel-DrivenQueryExecution(2)IEachpipelineisparallelizedindividuallyusingallthreads ViktorLeis7/22 ParallelIn-MemoryHashJoin1.Severalalgorithmsproposedearlierforparallelin-memoryhashjoin2.Option1:partitionrelationandprocesseachpartioninginparallel3.Option2:buildaglobalhashtableonbuildrelation,butparallellizebothbuildingandprobing4.EarlierworkshowsOption2isbetter5.Keyissues:maximizelocality,minimizesynchronizationViktorLeis8/22 NUMA-awareProcessingofBuildPhase ViktorLeis9/22 Morsel-WiseProcessingofProbePhase ViktorLeis10/22 Dispatcher ViktorLeis11/22 HashTable IUnusedbitsinpointersactasacheapbloomlterViktorLeis12/22 Lock-FreeInsertionintoHashTable1.insert(entry){2.//determineslotinhashtable3.slot=entry-hashhashTableShift4.do{5.old=hashTable[slot]6.//setnexttooldentrywithouttag7.entry-next=removeTag(old)8.//addoldandnewtag9.new=entry|(old&tagMask)jtag(entryhash)10.//trytosetnewvalue,repeatonfailure11.}while(!CAS(hashTable[slot],old,new))12.}13.}ViktorLeis13/22 StorageImplementation1.Uselargevirtualmemorypages(2MB)bothforthehashtableandthetuplestorageareas.1.1ThenumberofTLBmissesisreduced,thepagetableisguaranteedtotintoL1cache,andscalabilityproblemsfromtoomanykernelpagefaultsduringthebuildphaseareavoided.2.AllocatethehashtableusingtheUnixmmapsystemcall,ifavailable.2.1Pagegetsallocatedonrstwrite,initializedto0's2.2PageslocatedonsameNUMAnodeasthreadthatrstwritesthepage,ensuringlocalityifonlysingleNUMAnodeisused.3.Maybeagoodideatopartitiontableusingprimary/foreignkey3.1e.g.orderandlineitemonorderkeyViktorLeis14/22 MorselsINoloadimbalances:allworkersnishverycloseintimeIMorselsallowtoreacttoworkloadchanges:priority-basedschedulingofdynamicworkloadspossible ViktorLeis15/22 NUMAAwarenessINUMAawarenessatthemorsellevelIE.g.,Tablescan:IRelationsarepartitionedoverNUMAnodesIWorkerthreadsaskforNUMA-localmorselsIMaystealmorselsfromothersocketstoavoididleworkers ViktorLeis16/22 ParallelAggregationIAggregation:partitioning-basedwithcheappre-aggregationIStage1:Fixedsizehashtableperthread,overowtopartitionsIStage2:Finalaggregation:threadperpartition ViktorLeis17/22 ParallelMergeSortISortingfororderbyandtop-Konly,sortingformergejoinnotecientILocalsortinparallel,followedbyparallelmergeIKeyissue:ndingexactseparators.Median-of-mediansalgo. ViktorLeis18/22 Evaluation:TPC-H(SF100),NehalemEX(32cores) TPC-H# time[s]speedup 1 0.2832.42 0.0822.33 0.6624.74 0.3821.65 0.9721.36 0.1727.57 0.5332.48 0.3531.29 2.1432.010 0.6020.011 0.0937.1 TPC-H# time[s]speedup 12 0.2242.013 1.9540.014 0.1924.815 0.4419.816 0.7817.317 0.4430.518 2.7824.019 0.8829.520 0.1833.421 0.9128.022 0.3025.7 Isinglethreaded:30xfasterthanPostgreSQL,10xfasterthancommercialcolumnstore,similarspeedasVectorwiseImultithreaded:5xfasterthanVectorwise,50xfasterthanClouderaImpalaon20-nodeclusterViktorLeis19/22 Scalability ViktorLeis20/22 ConclusionsIGettinggoodscalabilityandperformanceonmany-coresystemsischallengingbutpossibleIHowever,itnotpossibletoboltonparallelismtoanexistingqueryengine,onemustredesignitwithmodernhardwareinmindIWithmorsel-drivenparallelismHyPercannishadhocqueriesonhundredsofGBsinseconds www.hyper-db.comViktorLeis21/22