/
Morsel-DrivenParallelism:ANUMA-AwareQueryEvaluationFrameworkfortheMany Morsel-DrivenParallelism:ANUMA-AwareQueryEvaluationFrameworkfortheMany

Morsel-DrivenParallelism:ANUMA-AwareQueryEvaluationFrameworkfortheMany - PDF document

pasty-toler
pasty-toler . @pasty-toler
Follow
371 views
Uploaded On 2015-09-09

Morsel-DrivenParallelism:ANUMA-AwareQueryEvaluationFrameworkfortheMany - PPT Presentation

ViktorLeis122 IntroductionINumberofCPUcoreskeepsgrowing4socketIvyBridgeEXwith60cores120threads1TBRAM50000IThesesystemssupportterabytesofNUMARAMdiskisnotabottleneckIForanalyticworkloadsintra ID: 125142

ViktorLeis1/22 IntroductionINumberofCPUcoreskeepsgrowing:4-socketIvyBridgeEXwith60cores 120threads 1TBRAM(50 000$)IThesesystemssupportterabytesofNUMARAM:diskisnotabottleneckIForanalyticworkloadsintra-

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Morsel-DrivenParallelism:ANUMA-AwareQuer..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Morsel-DrivenParallelism:ANUMA-AwareQueryEvaluationFrameworkfortheMany-CoreAgeViktorLeis,PeterBoncz*,AlfonsKemper,ThomasNeumannTechnischeUniversitätMünchen*CWIwithsomemodicationsby:S.Sudarshan ViktorLeis1/22 IntroductionINumberofCPUcoreskeepsgrowing:4-socketIvyBridgeEXwith60cores,120threads,1TBRAM(50,000$)IThesesystemssupportterabytesofNUMARAM:diskisnotabottleneckIForanalyticworkloadsintra-queryparallelizationisnecessarytoutilizesuchsystems INumberofCPUcoreskeepsgrowing:4-socketIvyBridgeEXwith60cores,120threads,1TBRAM(50,000$)ViktorLeis2/22 ContributionsIWepresentanarchitecturalblueprintforaqueryengineincorporatingthefollowingIMorsel-drivenqueryexecution(workisdistributedbetweenthreadsdynamicallyusingworkstealing)ISetoffastparallelalgorithmsforthemostimportantrelationaloperatorsISystematicapproachtointegratingNUMA-awarenessintodatabasesystemsILotsofpriorworkonalgorithmsformain-memorydatabasesIFocusonstorage,andonindividualoperations(hashjoin,mergejoin,aggregation,...)INUMAhasbeenaddressedbyquiteafewpapersIFocusofthispaperisonecientlyevaluatingafullquery,andonalgorithmsthatsupportpipelinedevaluationViktorLeis3/22 RelatedWork:Volcano-StyleParallelism(1)IEncapsulationofParallelismintheVolcanoQueryProcessingSystem,GoetzGraefe,SIGMOD1990SIGMODTestofTimeAward2000IPlan-drivenapproach:IoptimizerstaticallydeterminesatquerycompiletimehowmanythreadsshouldrunIinstantiatesonequeryoperatorplanforeachthreadIconnectsthesewithexchangeoperators,whichencapsulateparallelismandmanagethreadsIElegantmodelwhichisusedbymanysystems ViktorLeis4/22 Volcano-StyleParallelism(2)+Operatorsarelargelyoblivioustoparallelism+Greatforshared-nothingparallelsystems�Butcandobetterforsharedmemoryparallelsystemswithalldatain-memory�Staticworkpartitioningcancauseloadimbalances�Degreeofparallelismcannoteasilybechangedmid-query�NotNUMAaware�Overhead:IThreadoversubscriptioncausescontextswitchingIHashre-partitioningoftendoesnotpayoIExchangeoperatorscreateadditionalcopiesofthetuplesViktorLeis5/22 Morsel-DrivenQueryExecution(1)IBreakinputintoconstant-sizedworkunits(morsels)IDispatcherassignsmorselstoworkerthreadsI#workerthreads=#hardwarethreadsIOperatorsaredesignedforparallelexecution ViktorLeis6/22 Morsel-DrivenQueryExecution(2)IEachpipelineisparallelizedindividuallyusingallthreads ViktorLeis7/22 Morsel-DrivenQueryExecution(2)IEachpipelineisparallelizedindividuallyusingallthreads ViktorLeis7/22 Morsel-DrivenQueryExecution(2)IEachpipelineisparallelizedindividuallyusingallthreads ViktorLeis7/22 Morsel-DrivenQueryExecution(2)IEachpipelineisparallelizedindividuallyusingallthreads ViktorLeis7/22 ParallelIn-MemoryHashJoin1.Severalalgorithmsproposedearlierforparallelin-memoryhashjoin2.Option1:partitionrelationandprocesseachpartioninginparallel3.Option2:buildaglobalhashtableonbuildrelation,butparallellizebothbuildingandprobing4.EarlierworkshowsOption2isbetter5.Keyissues:maximizelocality,minimizesynchronizationViktorLeis8/22 NUMA-awareProcessingofBuildPhase ViktorLeis9/22 Morsel-WiseProcessingofProbePhase ViktorLeis10/22 Dispatcher ViktorLeis11/22 HashTable IUnusedbitsinpointersactasacheapbloomlterViktorLeis12/22 Lock-FreeInsertionintoHashTable1.insert(entry){2.//determineslotinhashtable3.slot=�entry-hash��hashTableShift4.do{5.old=hashTable[slot]6.//setnexttooldentrywithouttag7.en�try-next=removeTag(old)8.//addoldandnewtag9.new=entry|(old&tagMask)jtag(entry��hash)10.//trytosetnewvalue,repeatonfailure11.}while(!CAS(hashTable[slot],old,new))12.}13.}ViktorLeis13/22 StorageImplementation1.Uselargevirtualmemorypages(2MB)bothforthehashtableandthetuplestorageareas.1.1ThenumberofTLBmissesisreduced,thepagetableisguaranteedtotintoL1cache,andscalabilityproblemsfromtoomanykernelpagefaultsduringthebuildphaseareavoided.2.AllocatethehashtableusingtheUnixmmapsystemcall,ifavailable.2.1Pagegetsallocatedonrstwrite,initializedto0's2.2PageslocatedonsameNUMAnodeasthreadthatrstwritesthepage,ensuringlocalityifonlysingleNUMAnodeisused.3.Maybeagoodideatopartitiontableusingprimary/foreignkey3.1e.g.orderandlineitemonorderkeyViktorLeis14/22 MorselsINoloadimbalances:allworkersnishverycloseintimeIMorselsallowtoreacttoworkloadchanges:priority-basedschedulingofdynamicworkloadspossible ViktorLeis15/22 NUMAAwarenessINUMAawarenessatthemorsellevelIE.g.,Tablescan:IRelationsarepartitionedoverNUMAnodesIWorkerthreadsaskforNUMA-localmorselsIMaystealmorselsfromothersocketstoavoididleworkers ViktorLeis16/22 ParallelAggregationIAggregation:partitioning-basedwithcheappre-aggregationIStage1:Fixedsizehashtableperthread,overowtopartitionsIStage2:Finalaggregation:threadperpartition ViktorLeis17/22 ParallelMergeSortISortingfororderbyandtop-Konly,sortingformergejoinnotecientILocalsortinparallel,followedbyparallelmergeIKeyissue:ndingexactseparators.Median-of-mediansalgo. ViktorLeis18/22 Evaluation:TPC-H(SF100),NehalemEX(32cores) TPC-H# time[s]speedup 1 0.2832.42 0.0822.33 0.6624.74 0.3821.65 0.9721.36 0.1727.57 0.5332.48 0.3531.29 2.1432.010 0.6020.011 0.0937.1 TPC-H# time[s]speedup 12 0.2242.013 1.9540.014 0.1924.815 0.4419.816 0.7817.317 0.4430.518 2.7824.019 0.8829.520 0.1833.421 0.9128.022 0.3025.7 Isinglethreaded:30xfasterthanPostgreSQL,10xfasterthancommercialcolumnstore,similarspeedasVectorwiseImultithreaded:5xfasterthanVectorwise,50xfasterthanClouderaImpalaon20-nodeclusterViktorLeis19/22 Scalability ViktorLeis20/22 ConclusionsIGettinggoodscalabilityandperformanceonmany-coresystemsischallengingbutpossibleIHowever,itnotpossibletoboltonparallelismtoanexistingqueryengine,onemustredesignitwithmodernhardwareinmindIWithmorsel-drivenparallelismHyPercannishadhocqueriesonhundredsofGBsinseconds www.hyper-db.comViktorLeis21/22