/
Parsing with Compositional Vector Grammars Richard Soc Parsing with Compositional Vector Grammars Richard Soc

Parsing with Compositional Vector Grammars Richard Soc - PDF document

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
409 views
Uploaded On 2015-05-18

Parsing with Compositional Vector Grammars Richard Soc - PPT Presentation

Manning Andrew Y Ng Computer Science Department Stanford University Stanford CA 94305 USA richardsocherorg horatiogmailcom manningstanfordedu angcsstanfordedu Abstract Natural language parsing has typically been done with small sets of discrete cat ID: 69587

Manning Andrew

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Parsing with Compositional Vector Gramma..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

n-gramswillneverbeseenduringtraining,espe-ciallywhennislarge.Inthesecases,onecannotsimplyusedistributionalsimilaritiestorepresentunseenphrases.Inthiswork,wepresentanewso-lutiontolearnfeaturesandphraserepresentationsevenforverylong,unseenn-grams.WeintroduceaCompositionalVectorGrammarParser(CVG)forstructureprediction.Liketheaboveworkonparsing,themodeladdressestheproblemofrepresentingphrasesandcategories.Unlikethem,itjointlylearnshowtoparseandhowtorepresentphrasesasbothdiscretecategoriesandcontinuousvectorsasillustratedinFig.1.CVGscombinetheadvantagesofstandardprobabilisticcontextfreegrammars(PCFG)withthoseofre-cursiveneuralnetworks(RNNs).TheformercancapturethediscretecategorizationofphrasesintoNPorPPwhilethelattercancapturene-grainedsyntacticandcompositional-semanticinformationonphrasesandwords.Thisinformationcanhelpincaseswheresyntacticambiguitycanonlybere-solvedwithsemanticinformation,suchasinthePPattachmentofthetwosentences:Theyateudonwithforks.vs.Theyateudonwithchicken.PreviousRNN-basedparsersusedthesame(tied)weightsatallnodestocomputethevectorrepresentingaconstituent(Socheretal.,2011b).Thisrequiresthecompositionfunctiontobeex-tremelypowerful,sinceithastocombinephraseswithdifferentsyntacticheadwords,anditishardtooptimizesincetheparametersformaverydeepneuralnetwork.WegeneralizethefullytiedRNNtoonewithsyntacticallyuntiedweights.Theweightsateachnodeareconditionallydependentonthecategoriesofthechildconstituents.Thisallowsdifferentcompositionfunctionswhencom-biningdifferenttypesofphrasesandisshowntoresultinalargeimprovementinparsingaccuracy.Ourcompositionaldistributedrepresentational-lowsaCVGparsertomakeaccurateparsingde-cisionsandcapturesimilaritiesbetweenphrasesandsentences.AnyPCFG-basedparsercanbeim-provedwithanRNN.WeuseasimpliedversionoftheStanfordParser(KleinandManning,2003a)asthebasePCFGandimproveitsaccuracyfrom86.56to90.44%labeledF1onallsentencesoftheWSJsection23.Thecodeofourparserisavail-ableatnlp.stanford.edu.2RelatedWorkTheCVGisinspiredbytwolinesofresearch:EnrichingPCFGparsersthroughmorediversesetsofdiscretestatesandrecursivedeeplearningmodelsthatjointlylearnclassiersandcontinuousfeaturerepresentationsforvariable-sizedinputs.ImprovingDiscreteSyntacticRepresentationsAsmentionedintheintroduction,thereareseveralapproachestoimprovingdiscreterepresentationsforparsing.KleinandManning(2003a)usemanualfeatureengineering,whilePetrovetal.(2006)usealearningalgorithmthatsplitsandmergesthesyntacticcategoriesinordertomaximizelikelihoodonthetreebank.Theirapproachsplitscategoriesintoseveraldozensubcategories.Anotherapproachislexicalizedparsers(Collins,2003;Charniak,2000)thatdescribeeachcategorywithalexicalitem,usuallytheheadword.Morerecently,HallandKlein(2012)combineseveralsuchannotationschemesinafactoredparser.Weextendtheaboveideasfromdiscreterepresentationstorichercontinuousones.TheCVGcanbeseenasfactoringdiscreteandcontinuousparsinginonemodel.Anotherdifferentapproachtotheabovegenerativemodelsistolearndiscriminativeparsersusingmanywelldesignedfeatures(Taskaretal.,2004;Finkeletal.,2008).WealsoborrowideasfromthislineofresearchinthatourparsercombinesthegenerativePCFGmodelwithdiscriminativelylearnedRNNs.DeepLearningandRecursiveDeepLearningEarlyattemptsatusingneuralnetworkstode-scribephrasesincludeElman(1991),whousedre-currentneuralnetworkstocreaterepresentationsofsentencesfromasimpletoygrammarandtoanalyzethelinguisticexpressivenessofthere-sultingrepresentations.Wordswererepresentedasone-onvectors,whichwasfeasiblesincethegrammaronlyincludedahandfulofwords.Col-lobertandWeston(2008)showedthatneuralnet-workscanperformwellonsequencelabelinglan-guageprocessingtaskswhilealsolearningappro-priatefeatures.However,theirmodelislackinginthatitcannotrepresenttherecursivestructureinherentinnaturallanguage.Theypartiallycir-cumventthisproblembyusingeitherindependentwindow-basedclassiersoraconvolutionallayer.RNN-specictrainingwasintroducedbyGollerandK¨uchler(1996)tolearndistributedrepresen-tationsofgiven,structuredobjectssuchaslogi-calterms.Incontrast,ourmodelbothpredictsthestructureanditsrepresentation. attheithindex.Soaw=Lei2Rn:Henceforth,aftermappingeachwordtoitsvector,werepresentasentenceSasanorderedlistof(word,vector)pairs:x=((w1;aw1);:::;(wm;awm)).Nowthatwehavediscreteandcontinuousrep-resentationsforallwords,wecancontinuewiththeapproachforcomputingtreestructuresandvectorsfornonterminalnodes.3.2Max-MarginTrainingObjectiveforCVGsThegoalofsupervisedparsingistolearnafunc-tiong:X!Y,whereXisthesetofsentencesandYisthesetofallpossiblelabeledbinaryparsetrees.Thesetofallpossibletreesforagivensen-tencexiisdenedasY(xi)andthecorrecttreeforasentenceisyi.Werstdeneastructuredmarginloss(yi;^y)forpredictingatree^yforagivencorrecttree.Thelossincreasesthemoreincorrecttheproposedparsetreeis(Goodman,1998).ThediscrepancybetweentreesismeasuredbycountingthenumberofnodesN(y)withanincorrectspan(orlabel)intheproposedtree:(yi;^y)=Xd2N(^y)1fd=2N(yi)g:(1)Weset=0:1inallexperiments.Foragivensetoftraininginstances(xi;yi),wesearchforthefunctiong,parameterizedby,withthesmallestexpectedlossonanewsentence.Ithasthefollow-ingform:g(x)=argmax^y2Y(x)s(CVG(;x;^y));(2)wherethetreeisfoundbytheCompositionalVec-torGrammar(CVG)introducedbelowandthenscoredviathefunctions.Thehigherthescoreofatreethemorecondentthealgorithmisthatitsstructureiscorrect.Thismax-margin,structure-predictionobjective(Taskaretal.,2004;Ratliffetal.,2007;Socheretal.,2011b)trainstheCVGsothatthehighestscoringtreewillbethecorrecttree:g(xi)=yianditsscorewillbelargeruptoamargintootherpossibletrees^y2Y(xi):s(CVG(;xi;yi))s(CVG(;xi;^y))+(yi;^y):Thisleadstotheregularizedriskfunctionformtrainingexamples:J()=1 mmXi=1ri()+ 2kk22;whereri()=max^y2Y(xi)�s(CVG(xi;^y))+(yi;^y)�s(CVG(xi;yi))(3)Intuitively,tominimizethisobjective,thescoreofthecorrecttreeyiisincreasedandthescoreofthehighestscoringincorrecttree^yisdecreased.3.3ScoringTreeswithCVGsForeaseofexposition,werstdescribehowtoscoreanexistingfullylabeledtreewithastandardRNNandthenwithaCVG.Thesubsequentsec-tionwillthendescribeabottom-upbeamsearchanditsapproximationforndingtheoptimaltree.Assume,fornow,wearegivenalabeledparsetreeasshowninFig.2.Wedenethewordrepresentationsas(vector,POS)pairs:((a;A);(b;B);(c;C)),wherethevectorsarede-nedasinSec.3.1andthePOStagscomefromaPCFG.ThestandardRNNessentiallyignoresallPOStagsandsyntacticcategoriesandeachnon-terminalnodeisassociatedwiththesameneuralnetwork(i.e.,theweightsacrossnodesarefullytied).WecanrepresentthebinarytreeinFig.2intheformofbranchingtriplets(p!c1c2).Eachsuchtripletdenotesthataparentnodephastwochildrenandeachckcanbeeitherawordvectororanon-terminalnodeinthetree.FortheexampleinFig.2,wewouldgetthetriples((p1!bc);(p2!ap1)).Notethatinordertoreplicatetheneuralnetworkandcomputenoderepresentationsinabottomupfashion,theparentmusthavethesamedimensionalityasthechildren:p2Rn.Giventhistreestructure,wecannowcomputeactivationsforeachnodefromthebottomup.Webeginbycomputingtheactivationforp1usingthechildren'swordvectors.Werstconcatenatethechildren'srepresentationsb;c2Rn1intoavectorbc2R2n1.ThenthecompositionfunctionmultipliesthisvectorbytheparameterweightsoftheRNNW2Rn2nandappliesanelement-wisenonlinearityfunctionf=tanhtotheoutputvector.Theresultingoutputp(1)isthengivenasinputtocomputep(2).p(1)=fWbc;p(2)=fWap1 whereP(P1!BC)comesfromthePCFG.Thiscanbeinterpretedasthelogprobabilityofadiscrete-continuousruleapplicationwiththefol-lowingfactorization:P((P1;p1)!(B;b)(C;c))(5)=P(p1!bcjP1!BC)P(P1!BC);Note,however,thatduetothecontinuousnatureofthewordvectors,theprobabilityofsuchaCVGruleapplicationisnotcomparabletoprobabilitiesprovidedbyaPCFGsincethelattersumto1forallchildren.Assumingthatnodep1hassyntacticcategoryP1,wecomputethesecondparentvectorvia:p(2)=fW(A;P1)ap(1):Thescoreofthelastparentinthistrigramiscom-putedvia:sp(2)=�v(A;P1)Tp(2)+logP(P2!AP1):3.4ParsingwithCVGsTheabovescores(Eq.4)areusedinthesearchforthecorrecttreeforasentence.ThegoodnessofatreeismeasuredintermsofitsscoreandtheCVGscoreofacompletetreeisthesumofthescoresateachnode:s(CVG(;x;^y))=Xd2N(^y)spd:(6)ThemainobjectivefunctioninEq.3includesamaximizationoverallpossibletreesmax^y2Y(x).Findingtheglobalmaximum,however,cannotbedoneefcientlyforlongersentencesnorcanweusedynamicprogramming.Thisisduetothefactthatthevectorsbreaktheindependenceassump-tionsofthebasePCFG.A(category,vector)noderepresentationisdependentonallthewordsinitsspanandhencetondthetrueglobaloptimum,wewouldhavetocomputethescoresforallbi-narytrees.Forasentenceoflengthn,thereareCatalan(n)manypossiblebinarytreeswhichisverylargeevenformoderatelylongsentences.Onecoulduseabottom-upbeamsearch,keep-ingak-bestlistateverycellofthechart,possiblyforeachsyntacticcategory.Thisbeamsearchin-ferenceprocedureisstillconsiderablyslowerthanusingonlythesimpliedbasePCFG,especiallysinceithasasmallstatespace(seenextsectionfordetails).Sinceeachprobabilitylook-upischeapbutcomputingSU-RNNscoresrequiresamatrixproduct,wewouldliketoreducethenumberofSU-RNNscorecomputationstoonlythosetreesthatrequiresemanticinformation.WenotethatlabeledF1oftheStanfordPCFGparseronthetestsetis86.17%.However,ifoneusedanoracletoselectthebesttreefromthetop200treesthatitproduces,onecouldgetanF1of95.46%.Weusethisknowledgetospeedupinferenceviatwobottom-uppassesthroughtheparsingchart.Duringtherstone,weuseonlythebasePCFGtorunCKYdynamicprogrammingthroughthetree.Thek=200-bestparsesatthetopcellofthechartarecalculatedusingtheefcientalgorithmof(HuangandChiang,2005).Then,thesecondpassisabeamsearchwiththefullCVGmodel(in-cludingthemoreexpensivematrixmultiplicationsoftheSU-RNN).Thisbeamsearchonlyconsid-ersphrasesthatappearinthetop200parses.Thisissimilartoare-rankingsetupbutwithonemaindifference:theSU-RNNrulescorecomputationateachnodestillonlyhasaccesstoitschildvectors,notthewholetreeorotherglobalfeatures.Thisallowsthesecondpasstobeveryfast.Weusethissetupinourexperimentsbelow.3.5TrainingSU-RNNsThefullCVGmodelistrainedintwostages.FirstthebasePCFGistrainedanditstoptreesarecachedandthenusedfortrainingtheSU-RNNconditionedonthePCFG.TheSU-RNNistrainedusingtheobjectiveinEq.3andthescoresasex-empliedbyEq.6.Foreachsentence,weusethemethoddescribedabovetoefcientlyndanap-proximationfortheoptimaltree.Tominimizetheobjectivewewanttoincreasethescoresofthecorrecttree'sconstituentsanddecreasethescoreofthoseinthehighestscor-ingincorrecttree.Derivativesarecomputedviabackpropagationthroughstructure(BTS)(GollerandK¨uchler,1996).ThederivativeoftreeihastobetakenwithrespecttoallparametermatricesW(AB)thatappearinit.Themaindifferencebe-tweenbackpropagationinstandardRNNsandSU-RNNsisthatthederivativesateachnodeonlyaddtotheoverallderivativeofthespecicmatrixatthatnode.FormoredetailsonbackpropagationthroughRNNs,seeSocheretal.(2010) Parserdev(all)test40test(all) StanfordPCFG85.886.285.5StanfordFactored87.487.286.6FactoredPCFGs89.790.189.4Collins87.7SSN(Henderson)89.4BerkeleyParser90.1 CVG(RNN)85.785.185.0CVG(SU-RNN)91.291.190.4 Charniak-SelfTrain91.0Charniak-RS92.1Table1:ComparisonofparserswithricherstaterepresentationsontheWSJ.Thelastlineistheself-trainedre-rankedCharniakparser.performanceandwerefasterthan50-,100-or200-dimensionalones.Wehypothesizethatthelargerwordvectorsizes,whilecapturingmoreseman-ticknowledge,resultintoomanySU-RNNmatrixparameterstotrainandhenceperformworse.4.2ResultsonWSJThedevsetaccuracyofthebestmodelis90.93%labeledF1onallsentences.Thismodelre-sultedin90.44%onthenaltestset(WSJsec-tion23).Table1comparesourresultstothetwoStanfordparservariants(theunlexicalizedPCFG(KleinandManning,2003a)andthefac-toredparser(KleinandManning,2003b))andotherparsersthatusericherstaterepresentations:theBerkeleyparser(PetrovandKlein,2007),Collinsparser(Collins,1997),SSN:astatisticalneuralnetworkparser(Henderson,2004),Fac-toredPCFGs(HallandKlein,2012),Charniak-SelfTrain:theself-trainingapproachofMcCloskyetal.(2006),whichbootstrapsandparsesaddi-tionallargecorporamultipletimes,Charniak-RS:thestateoftheartself-trainedanddiscrimina-tivelyre-rankedCharniak-Johnsonparsercombin-ing(Charniak,2000;McCloskyetal.,2006;Char-niakandJohnson,2005).SeeKummerfeldetal.(2012)formorecomparisons.WecomparealsotoastandardRNN`CVG(RNN)'andtothepro-posedCVGwithSU-RNNs.4.3ModelAnalysisAnalysisofErrorTypes.Table2showsade-tailedcomparisonofdifferenterrors.WeusethecodeprovidedbyKummerfeldetal.(2012)andcomparetothepreviousversionoftheStan-fordfactoredparseraswellastotheBerkeleyandCharniak-reranked-self-trainedparsers(de-nedabove).SeeKummerfeldetal.(2012)fordetailsandcomparisonstootherparsers.Oneof ErrorTypeStanfordCVGBerkeleyChar-RS PPAttach1.020.790.820.60ClauseAttach0.640.430.500.38DiffLabel0.400.290.290.31ModAttach0.370.270.270.25NPAttach0.440.310.270.25Co-ord0.390.320.380.231-WordSpan0.480.310.280.20Unary0.350.220.240.14NPInt0.280.190.180.14Other0.620.410.410.50 Table2:Detailedcomparisonofdifferentparsers.thelargestsourcesofimprovedperformanceovertheoriginalStanfordfactoredparserisinthecor-rectplacementofPPphrases.WhenmeasuringonlytheF1ofparsenodesthatincludeatleastonePPchild,theCVGimprovestheStanfordparserby6.2%toanF1of77.54%.Thisisa0.23re-ductionintheaveragenumberofbracketerrorspersentence.The`Other'categoryincludesVP,PRNandotherattachments,appositivesandinter-nalstructuresofmodiersandQPs.AnalysisofCompositionMatrices.Ananaly-sisofthenormsofthebinarymatricesrevealsthatthemodellearnsasoftvectorizednotionofheadwords:Headwordsaregivenlargerweightsandimportancewhencomputingtheparentvec-tor:Forthematricescombiningsiblingswithcat-egoriesVP:PP,VP:NPandVP:PRT,theweightsinthepartofthematrixwhichismultipliedwiththeVPchildvectordominates.SimilarlyNPsdom-inateDTs.Fig.5showsexamplematrices.ThetwostrongdiagonalsareduetotheinitializationdescribedinSec.3.7.SemanticTransferforPPAttachments.Inthissmallmodelanalysis,weusetwopairsofsen-tencesthattheoriginalStanfordparserandtheCVGdidnotparsecorrectlyaftertrainingontheWSJ.Wethencontinuetotrainbothparsersontwosimilarsentencesandthenanalyzeiftheparserscorrectlytransferredtheknowledge.ThetrainingsentencesareHeeatsspaghettiwithafork.andSheeatsspaghettiwithpork.TheverysimilartestsentencesareHeeatsspaghettiwithaspoon.andHeeatsspaghettiwithmeat.Initially,bothparsersincorrectlyattachthePPtotheverbinbothtestsentences.Aftertraining,theCVGparsesbothcorrectly,whilethefactoredStanfordparserincorrectlyattachesbothPPstospaghetti.TheCVG'sabilitytotransferthecorrectPPat-tachmentsisduetothesemanticwordvectorsim-ilaritybetweenthewordsinthesentences.Fig.4showstheoutputsofthetwoparsers. ReferencesY.Bengio,R.Ducharme,P.Vincent,andC.Janvin.2003.Aneuralprobabilisticlanguagemodel.Jour-nalofMachineLearningResearch,3:1137–1155.P.F.Brown,P.V.deSouza,R.L.Mercer,V.J.DellaPietra,andJ.C.Lai.1992.Class-basedn-grammodelsofnaturallanguage.ComputationalLin-guistics,18.C.Callison-Burch.2008.Syntacticconstraintsonparaphrasesextractedfromparallelcorpora.InPro-ceedingsofEMNLP,pages196–205.E.CharniakandM.Johnson.2005.Coarse-to-nen-bestparsingandmaxentdiscriminativereranking.InACL.E.Charniak.2000.Amaximum-entropy-inspiredparser.InProceedingsofACL,pages132–139.M.Collins.1997.Threegenerative,lexicalisedmodelsforstatisticalparsing.InACL.M.Collins.2003.Head-drivenstatisticalmodelsfornaturallanguageparsing.ComputationalLinguis-tics,29(4):589–637.R.CollobertandJ.Weston.2008.Auniedarchi-tecturefornaturallanguageprocessing:deepneuralnetworkswithmultitasklearning.InProceedingsofICML,pages160–167.F.Costa,P.Frasconi,V.Lombardo,andG.Soda.2003.Towardsincrementalparsingofnaturallanguageus-ingrecursiveneuralnetworks.AppliedIntelligence.J.Duchi,E.Hazan,andY.Singer.2011.Adaptivesub-gradientmethodsforonlinelearningandstochasticoptimization.JMLR,12,July.J.L.Elman.1991.Distributedrepresentations,sim-plerecurrentnetworks,andgrammaticalstructure.MachineLearning,7(2-3):195–225.J.R.Finkel,A.Kleeman,andC.D.Manning.2008.Efcient,feature-based,conditionalrandomeldparsing.InProceedingsofACL,pages959–967.D.GildeaandM.Palmer.2002.Thenecessityofpars-ingforpredicateargumentrecognition.InProceed-ingsofACL,pages239–246.C.GollerandA.K¨uchler.1996.Learningtask-dependentdistributedrepresentationsbybackprop-agationthroughstructure.InProceedingsoftheIn-ternationalConferenceonNeuralNetworks.J.Goodman.1998.ParsingInside-Out.Ph.D.thesis,MIT.D.HallandD.Klein.2012.Trainingfactoredpcfgswithexpectationpropagation.InEMNLP.J.Henderson.2003.Neuralnetworkprobabilityesti-mationforbroadcoverageparsing.InProceedingsofEACL.J.Henderson.2004.Discriminativetrainingofaneu-ralnetworkstatisticalparser.InACL.LiangHuangandDavidChiang.2005.Betterk-bestparsing.InProceedingsofthe9thInternationalWorkshoponParsingTechnologies(IWPT2005).E.H.Huang,R.Socher,C.D.Manning,andA.Y.Ng.2012.ImprovingWordRepresentationsviaGlobalContextandMultipleWordPrototypes.InACL.D.Kartsaklis,M.Sadrzadeh,andS.Pulman.2012.Auniedsentencespaceforcategoricaldistributional-compositionalsemantics:Theoryandexperiments.Proceedingsof24thInternationalConferenceonComputationalLinguistics(COLING):Posters.D.KleinandC.D.Manning.2003a.Accurateun-lexicalizedparsing.InProceedingsofACL,pages423–430.D.KleinandC.D.Manning.2003b.Fastexactin-ferencewithafactoredmodelfornaturallanguageparsing.InNIPS.J.K.Kummerfeld,D.Hall,J.R.Curran,andD.Klein.2012.Parsershowdownatthewallstreetcorral:Anempiricalinvestigationoferrortypesinparserout-put.InEMNLP.Q.V.Le,J.Ngiam,Z.Chen,D.Chia,P.W.Koh,andA.Y.Ng.2010.Tiledconvolutionalneuralnet-works.InNIPS.T.Matsuzaki,Y.Miyao,andJ.Tsujii.2005.Proba-bilisticcfgwithlatentannotations.InACL.D.McClosky,E.Charniak,andM.Johnson.2006.Ef-fectiveself-trainingforparsing.InNAACL.S.Menchetti,F.Costa,P.Frasconi,andM.Pon-til.2005.Widecoveragenaturallanguagepro-cessingusingkernelmethodsandneuralnetworksforstructureddata.PatternRecognitionLetters,26(12):1896–1906.T.Mikolov,W.Yih,andG.Zweig.2013.Linguis-ticregularitiesincontinuousspacewordrepresenta-tions.InHLT-NAACL.S.PetrovandD.Klein.2007.Improvedinferenceforunlexicalizedparsing.InNAACL.S.Petrov,L.Barrett,R.Thibaux,andD.Klein.2006.Learningaccurate,compact,andinterpretabletreeannotation.InProceedingsofACL,pages433–440.N.Ratliff,J.A.Bagnell,andM.Zinkevich.2007.(On-line)subgradientmethodsforstructuredprediction.InEleventhInternationalConferenceonArticialIntelligenceandStatistics(AIStats).R.Socher,C.D.Manning,andA.Y.Ng.2010.Learn-ingcontinuousphraserepresentationsandsyntacticparsingwithrecursiveneuralnetworks.InProceed-ingsoftheNIPS-2010DeepLearningandUnsuper-visedFeatureLearningWorkshop.