/
Efcient Parallel CKY Parsing on GPUs Youngmin Yi ChaoYue Lai Slav Petrov Kurt Keutzer Efcient Parallel CKY Parsing on GPUs Youngmin Yi ChaoYue Lai Slav Petrov Kurt Keutzer

Efcient Parallel CKY Parsing on GPUs Youngmin Yi ChaoYue Lai Slav Petrov Kurt Keutzer - PDF document

liane-varnes
liane-varnes . @liane-varnes
Follow
457 views
Uploaded On 2014-11-22

Efcient Parallel CKY Parsing on GPUs Youngmin Yi ChaoYue Lai Slav Petrov Kurt Keutzer - PPT Presentation

ackr University of California Berkeley Berkeley CA USA colbylaikeutzer eecsberkeleyedu Google Research New York NY USA slavgooglecom Abstract Lowlatency solutions for syntactic parsing are needed if parsing is to become an inte gral part of userfacin ID: 14982

ackr University California Berkeley

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Efcient Parallel CKY Parsing on GPUs You..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Figure1:Anexampleofaparsetreeforthesentence“Iloveyou.”ne-grainedparallelismandsynchronizationop-tionsthatmakethispossible.Weempiricallyevaluatethevariousparallelim-plementationsontwoNVIDIAGPUs(GTX480andGTX285)inSection5.Weobservethatsomeparallelizationoptionsarearchitecturedependent,emphasizingthatathoroughunderstandingoftheprogrammingmodelandtheunderlyinghardwareisneededtoachievegoodresults.Ourimplemen-tationonNVIDIA'sGTX480usingCUDAresultsina26-foldspeedupcomparedtotheoriginalse-quentialCimplementation.OntheGTX285GPUweobtaina14-foldspeedup.Parallelizingnaturallanguageparsershasbeenstudiedpreviously(seeSection6),however,pre-viousworkhasfocusedonscenarioswhereonlyalimitedlevelofcoarse-grainedparallelismcouldbeutilized,ortheunderlyinghardwarerequiredunrealisticrestrictionsonthesizeofthecontext-freegrammar.Tothebestofourknowledge,thisistherstGPU-basedparallelsyntacticparserus-ingastate-of-the-artgrammar.2NaturalLanguageParsingWhileweassumeabasicfamiliaritywithprob-abilisticCKYparsing,inthissectionwebrieyreviewtheCKYdynamicprogrammingalgorithmandtheViterbialgorithmforextractingthehighestscoringpaththroughthedynamicprogram.2.1Context-FreeGrammarsInthisworkwefocusourattentiononcon-stituencyparsingandassumethataweightedCFGisavailabletous.InourexperimentswewilluseaprobabilisticlatentvariableCFG(Petrovetal.,2006).However,ouralgorithmscanbeusedwithanyweightedCFG,includingdiscriminativeones,suchastheonesinPetrovandKlein(2007a)and Figure2:Thechartthatvisualizesthebottom-uppro-cessofCKYparsingforthesentence“Iloveyou.”Finkeletal.(2008).1Thegrammarsinourexperi-mentshaveontheorderofthousandsofnontermi-nalsandmillionsofproductions.Figure1(a)showsaconstituencyparsetree.Leafnodesintheparsetree,alsocalledtermi-nalnodes,correspondtowordsinthelanguage.Preterminalscorrespondtopart-of-speechtags,whiletheothernonterminalscorrespondtophrasalcategories.Foreaseofexposition,wewillsaythatterminalproductionsarepartofalexicon.Forex-ample,(L1)inFigure1(b)isalexicalruleprovid-ingascore(of�0:23)formappingtheword”I”tothesymbol“PRP.”Weassumethatthegrammarhasbeenbinarizedandcontainsonlyunaryandbinaryproductions.Werefertotheapplicationofgrammarrulesasunary/binaryrelaxations.2.2SequentialCKYParsingTheCKYalgorithmisanexhaustivebottom-upalgorithmthatusesdynamicprogrammingtoincrementallybuilduplargertreestructures.Tokeeptrackofthescoresofthesestructures,achartindexedbythestartandendpositionsandthesymbolunderconsiderationisused:scores[start][end][symbol](seealsoFigure2).Afterinitializingthepreterminallevelofthechartwiththepart-of-speechscoresfromthelexicon,thealgorithmcontinuesbyrepeatedlyapplyingallbinaryandunaryrulesinordertobuilduplargerspans(pseudocodeisgiveninFigure3).Torecon-structthehighestscoringparsetreeweperformatop-downsearch.Wefoundthistobemoreef-cientthankeepingbackpointers.2Oneshouldalsonotethatmanyreal-worldap-plicationsbenetfrom,orevenexpectn-bestlistsofpossibleparsetrees.UsingthelazyevaluationalgorithmofHuangandChiang(2005)theextrac- 1Forfeature-richdiscriminativemodelsatriviallyparal-lelizablepasscanbeusedtopre-computetherule-potentials.2ThisobservationisduetoDanKlein,p.c. Figure5:ThehierarchicalorganizationofGPUs.branchandshouldbeavoidedasmuchaspossi-blewhendesigningparallelalgorithmsandmap-pingapplicationsontoaGPU.IntheprogrammingmodelofCUDA,oneormorewarpsareimplic-itlygroupedintothreadblocks.DifferentthreadblocksaremappedtoSMsandcanbeexecutedindependentlyofoneanother.3.3SharedMemoryGenerallyspeaking,manycorearchitectures(likeGPUs)havemoreALUsinplaceofon-chipcaches,makingarithmeticoperationsrelativelycheaperandglobalmemoryaccessesrelativelymoreexpensive.Thus,toachievegoodperfor-mance,itisimportanttoincreasetheratioofCom-putetoGlobalMemoryAccess(CGMA)(KirkandHwu,2010),whichcanbedoneinpartbycleverlyutilizingthedifferenttypesofsharedon-chipmemoryineachSM.ThreadsinathreadblockaremappedontothesameSMandcancooperatewithoneanotherbysharingdatathroughtheon-chipsharedmem-oryoftheSM(showninFigure5).Thissharedmemoryhastwoordersofmagnitudelesslatencythantheoff-chipglobalmemory,butisverysmall(16KBto64KB,dependingonthearchitecture).CUDAthereforeprovidestheprogrammerwiththeexibility(andburden)toexplicitlymanagesharedmemory(i.e.,loadingavaluefromglobalmemoryandstoringit).Additionally,GPUsalsohavesocalledtexturememoryandconstantmemory.TexturememorycanbewrittenonlyfromthehostCPU,butpro-videscachingandissharedacrossdifferentSMs.Henceitisoftenusedforstoringfrequentlyac-cessedread-onlydata.Constantmemoryisverysmallandasitsnamesuggestsisonlyappropriateforstoringconstantsusedacrossthreadblocks.3.4SynchronizationCUDAprovidesasetofAPIsforthreadsynchro-nization.Whenthreadsperformareduction,orneedtoaccessasinglevariableinamutuallyex-clusiveway,atomicoperationsareused.AtomicoperationAPIstakeasargumentsthememorylo-cation(i.e.,pointerofthevariabletobereduced)andthevalue.However,atomicoperationsonglobalmemorycanbeverycostly,astheyneedtoserializeapotentiallylargenumberofthreadsinthekernel.Toreducethisoverhead,oneusu-allyappliesatomicoperationsrsttovariablesde-claredinthesharedmemoryofeachthreadblock.Afterthesereductionshavecompletedanothersetofatomicoperationsisdone.Inaddition,CUDAprovidesanAPI( syncthreads())torealizeabarriersynchro-nizationbetweenthreadsinthesamethreadblock.ThisAPIforceseachthreadtowaituntilallthreadsintheblockhavereachedthecallingline.NotethatthereisnoAPIforthebarriersynchro-nizationbetweenallthreadsinakernel.Sinceareturnfromakernelaccomplishesaglobalbarriersynchronization,onecanuseseparatekernelswhenaglobalbarriersynchronizationisneeded.4ParallelCKYParsingonGPUsThedynamicprogrammingloopsoftheCKYalgorithmprovidevarioustypesofparallelism.WhiletheloopinFigure3cannotbeparallelizedduetodependenciesbetweeniterations,allfourloopsinFigure4couldinprinciplebeparallelized.Inthissection,wediscussthedifferentdesignchoicesandstrategiesforparallelizingthebinaryrelaxationstepthataccountsforthebulkoftheoverallexecutiontimeoftheCKYalgorithm.4.1ThreadMappingTheessentialstepindesigningapplicationsonaparallelplatformistodeterminewhichexecutionentityintheparallelalgorithmshouldbemappedtotheunderlyingparallelhardwarethreadintheplatform.ForaCKYparserwithmillionsofgram-marrulesandthousandsofsymbols,onecanmapeitherrulesorsymbolstothreads.Atrstsightitmightappearthatmappingchartcellsorsym-bolstothreadsisanaturalchoice,asitisequiva-lenttoexecutingtherstloopinFigure4inpar-allel.However,ifwemapasymboltoathread,thenitnotonlyfailstoprovideenoughparallelismtofullyutilizethemassivenumberofthreadsin Figure8:Creatingvirtualsymbolswheneverasymbolhastoomanyrules.threadblockneedstoperformthecheck.Sincethischeckinvolvesaglobalmemoryaccess,itiscostly.Minimizingthenumberofglobalmemoryaccessesiskeytotheperformanceofparallelal-gorithmsonGPUs.Achallengingaspectoftheblock-basedmap-pingcomesfromthefactthatthenumberofrulespersymbolcanexceedthemaximumnumberofthreadsperthreadblock(1,024or512dependingontheGPUarchitecture).Tocircumventthislim-itation,weintroducevirtualsymbols,whichhostdifferentpartitionsoftheoriginalrules,asshowninFigure8.Introducingvirtualsymbolsdoesnotincreasethecomplexityofthealgorithm,becausevirtualsymbolsonlyexistuntilweperformthemaximumreductions,atwhichpointtheyarecon-vertedtotheoriginalsymbols.4.2Span-LevelParallelismAnotherlevelofparallelism,whichisorthogonaltothepreviouslydiscussedmappings,ispresentintherstloopinFigure4.Spansinthesamelevelinthechart(seeFigure2),areindependentofeachotherandcanhencebeexecutedinpar-allelbymappingthemtothreadblocks(line1ofFigure6andFigure7).SinceCUDAprovidesuptothree-dimensional(x;y;z)indexingofthreadblocks,thiscanbeeasilyaccomplished:wecreatetwo-dimensionalgridswhoseXaxiscorrespondstosymbolsinblock-basedmappingorsimplyagroupofrulesinthread-basedmapping,andtheYaxiscorrespondstothespans.4.3ThreadSynchronizationThreadsynchronizationisneededtocorrectlycomputethemaximumscoresinparallel.Syn-chronizationcanbeachievedbyatomicoperationsorbyparallelreductionsusing syncthreads()asexplainedinSection3.Themostviablesynchro- Figure9:Parallelreductiononsharedmemorybetweenthreadsinthesamethreadblockwith8threads.nizationmethodwillofcoursevarydependingonthemappingwechoose.Inpractice,onlyatomicoperationsareanoptioninthread-basedmapping,sincewewouldotherwiseneedasmanyexecu-tionsofparallelreductions,asthenumberofdif-ferentparentsymbolsineachthreadblock.Inblock-basedmapping,ontheotherhand,bothpar-allelreductionsandatomicoperationscanbeap-plied.4.3.1AtomicOperationsInthread-basedmapping,tocorrectlyupdatethescore,eachthreadneedstocalltheatomicAPImaxoperationwithapointertothedesiredwritelocation.However,thisoperationcanbeveryslow(aswewillconrminSection5),sothatwein-steadperformarstreductionbycallingtheAPIwithapointertothesharedvariable(asshowninline14ofFigure6),andthenperformasec-ondreductionwithapointertothescoresarray(asshowninline17ofFigure6).Whenwecallatomicoperationsonsharedmemory,sharedvariablesneedtobedeclaredforallsymbols.Thisisnecessarybecauseinthread-basedmap-pingthreadsinthesamethreadblockcanhavedif-ferentparentsymbols.Inblock-basedmappingwecanalsouseatomicoperationsonsharedmemory.However,inthismapping,allthreadsinathreadblockhavethesameparentsymbol,andthereforeonlyonesharedvariableperthreadblockisneededforthepar-entsymbol(asshowninline15ofFigure7).Allthereductionsareperformedonthissinglesharedvariable.Comparedtothread-basedmap-ping,block-basedmappingrequiresafractionofthesharedmemory,andcostlyatomicoperationsonglobalmemoryareperformedonlyonce(asshowninline18ofFigure7). Figure11:SpeedupsofvariousversionsofparallelCKYparserthathavedifferentmappings,synchronizationmethods,anddifferentmemoryaccessoptimizations.OntheGTX285,thetransformedaccesspat-ternforthescoresarrayalongwithaccessestothesharedmemory(Block+PR+SH)im-provestheperformancebyabout10%,show-inga11.1speedup.Placingthescoresar-rayintexturememoryimprovesallimplemen-tations.Thereducedbindingcostduetothearrayreorganizationresultsinadditionalgainsofabout25%forBlock+PR+SS+tex:scoresandBlock+PR+tex:scoresagainstBlock+PR+SSandBlock+PR(foratotalspeedupof17.4and13.0,respectively).However,placingtheruleinformationintexturememoryimprovestheper-formancelittleastherearemanymoreaccessestothescoresarraythantotheruleinformation.TheGTX480istheFermiarchitecture(NVIDIA,2009),withmanyfeaturesaddedtotheGTX285.Thenumberofcoresdoubledfrom240to480,butthenumberofSMswashalvedfrom30to15.ThebiggestdifferenceistheintroductionofL1cacheaswellasthesharedmemoryperSM.Forthesereasons,allparallelimplementationsarefasterontheGTX480thanontheGTX285.OntheGTX480,executingspanssequentially(SS)doesnotimprovetheperformance,butactuallydegradesit.ThisisbecausethisGPUhasL1cacheandahigherglobalmemorybandwidth,sothatreducingtheparallelismactuallylimitstheperformance.Utilizingtexturememoryorsharedmemoryforthescoresarraydoesnothelpeither.ThisisbecausetheGTX480hardwarealreadycachesthescoresarrayintotheL1cache.Interestingly,therankingofthevariousparal-lelizationcongurationsintermsofspeedupisar-chitecturedependent:ontheGTX285,theblock-basedmappingandsequentialspanprocessingarepreferred,andtheparallelreductionispreferredovershared-memoryatomicoperations.UsingtexturememoryisalsohelpfulontheGTX285.OntheGTX480,block-basedmappingisalsopre-ferredbutsequentialspansmappingisnot.Theparallelreductionisclearlybetterthanshared-memoryatomicoperations,andthereisnoneedforutilizingtexturememoryontheGTX480.Itisimportanttounderstandhowthedifferentdesignchoicesaffecttheperformance,sinceonediffer-entchoicesmightbenecessaryforgrammarswithdifferentnumbersofsymbolsandrules.6RelatedWorkAsubstantialbodyofrelatedworkonparalleliz-ingnaturallanguageparsershasaccumulatedoverthelasttwodecades(vanLohuizen,1999;Gi-achinandRullent,1989;Pontellietal.,1998;Manousopoulouetal.,1997).However,noneofthisworkisdirectlycomparabletoours,asGPUsprovidemuchmorene-grainedpossibil-itiesforparallelization.Theparallelparsersinpastworkareimplementedonmulticoresystems,wherethelimitedparallelizationpossibilitiespro-videdbythesystemsrestrictthespeedupsthatcanbeachieved.Forexample,vanLohuizen(1999)reportsa1.8speedup,whileManousopoulouetal.(1997)claimsa7-8speedup.Incontrast,ourparallelparserisimplementedonamanycoresys-temwithanabundantnumberofthreadsandpro- ReferencesK.Asanovic,R.Bodik,B.Catanzaro,J.Gebis,P.Husbands,K.Keutzer,D.Patterson,W.Plishker,J.Shalf,S.Williams,andK.Yelick.2006.Thelandscapeofparallelcomputingresearch:Aviewfromberkeley.Technicalreport,ElectricalEngi-neeringandComputingSciences,UniversityofCal-iforniaatBerkeley.J.Bordim,Y.Ito,andK.Nakano.2002.AcceleratingtheCKYparsingusingFPGAs.InICHPC'02.E.CharniakandM.Johnson.2005.Coarse-to-FineN-BestParsingandMaxEntDiscriminativeReranking.InACL'05.E.Charniak.2000.Amaximum–entropy–inspiredparser.InNAACL'00.J.CockeandJ.T.Schwartz.1970.Programminglanguagesandtheircompilers:Preliminarynotes.Technicalreport,CourantInstituteofMathematicalSciences,NewYorkUniversity.M.Collins.1999.Head-DrivenStatisticalModelsforNaturalLanguageParsing.Ph.D.thesis,UniversityofPennsylvania,Philadelphia,PA,USA.A.Dunlop,N.Bodenstab,andB.Roark.2011.Ef-cientmatrix-encodedgrammarsandlowlatencypar-allelizationstrategiesforCYK.InIWPT'11.J.Finkel,A.Kleeman,andC.D.Manning.2008.Ef-cient,feature-based,conditionalrandomeldpars-ing.InACL/HLT'08.E.P.GiachinandC.Rullent.1989.Aparallelparserforspokennaturallanguage.InIJCAL'89.J.Goodman.1997.Globalthresholdingandmultiple-passparsing.InEMNLP'97.L.HuangandD.Chiang.2005.Betterk-bestparsing.InIWPT'05.T.Kasami.1965.Anefcientrecognitionandsyntax-analysisalgorithmforcontext-freelanguages.Sci-enticReportAFCRL-65-758,AirForceCam-bridgeResearchLab.DavidB.KirkandWen-meiW.Hwu.2010.Program-mingMassivelyParallelProcessors:AHands-onApproach.MorganKaufmannPublishersInc.,SanFrancisco,CA,USA,1stedition.D.KleinandC.Manning.2003.A*parsing:Fastexactviterbiparseselection.InNAACL'03.E.Lindholm,J.Nickolls,S.Oberman,andJ.Montrym.2008.Nvidiatesla:Auniedgraphicsandcomput-ingarchitecture.IEEEMicro,28(2):39–55.A.G.Manousopoulou,G.Manis,P.Tsanakas,andG.Papakonstantinou.1997.Automaticgenerationofportableparallelnaturallanguageparsers.InPro-ceedingsofthe9thconferenceonToolswithArti-cialIntelligence.M.Marcus,B.Santorini,andM.Marcinkiewicz.1993.BuildingalargeannotatedcorpusofEnglish:ThePennTreebank.InComputationalLinguistics.J.Nickolls,I.Buck,M.Garland,andK.Skadron.2008.ScalableparallelprogrammingwithCUDA.ACMQueue,6(2):40–53.T.Ninomiya,K.Torisawa,K.Taura,andJ.Tsujii.1997.Aparallelckyparsingalgorithmonlarge–scaledistributed–memoryparallelmachines.InPA-CLING'97.NVIDIA.2009.Fermi:NVIDIA'snextgener-ationCUDAcomputearchitecture.http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.A.PaulsandD.Klein.2009.Hierarchicalsearchforparsing.InNAACL-HLT'09.S.PetrovandD.Klein.2007a.Discriminativelog-lineargrammarswithlatentvariables.InNIPS'07.S.PetrovandD.Klein.2007b.Improvedinferenceforunlexicalizedparsing.InNAACL'07.S.Petrov,L.Barrett,R.Thibaux,andD.Klein.2006.Learningaccurate,compact,andinterpretabletreeannotation.InACL'06.E.Pontelli,G.Gupta,J.Wiebe,andD.Farwell.1998.Naturallanguageprocessing:Acasestudy.InAAAI'98.M.P.vanLohuizen.1999.Parallelprocessingofnatu-rallanguageparsers.InParCo'99,pages17–20.D.H.Younger.1967.Recognitionandparsingofcontext-freelanguagesintimen3.InformationandControl,10.