/
DeepRecursiveNeuralNetworksforCompositionalityinLanguage DeepRecursiveNeuralNetworksforCompositionalityinLanguage

DeepRecursiveNeuralNetworksforCompositionalityinLanguage - PDF document

liane-varnes
liane-varnes . @liane-varnes
Follow
354 views
Uploaded On 2015-10-06

DeepRecursiveNeuralNetworksforCompositionalityinLanguage - PPT Presentation

OzanIrsoyDepartmentofComputerScienceCornellUniversityIthacaNY14853oirsoycscornelleduClaireCardieDepartmentofComputerScienceCornellUniversityIthacaNY14853cardiecscornelleduAbstractRecursiveneu ID: 151623

OzanIrsoyDepartmentofComputerScienceCornellUniversityIthaca NY14853oirsoy@cs.cornell.eduClaireCardieDepartmentofComputerScienceCornellUniversityIthaca NY14853cardie@cs.cornell.eduAbstractRecursiveneu

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "DeepRecursiveNeuralNetworksforCompositio..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DeepRecursiveNeuralNetworksforCompositionalityinLanguage OzanIrsoyDepartmentofComputerScienceCornellUniversityIthaca,NY14853oirsoy@cs.cornell.eduClaireCardieDepartmentofComputerScienceCornellUniversityIthaca,NY14853cardie@cs.cornell.eduAbstractRecursiveneuralnetworkscompriseaclassofarchitecturethatcanoperateonstructuredinput.Theyhavebeenpreviouslysuccessfullyappliedtomodelcom-positionalityinnaturallanguageusingparse-tree-basedstructuralrepresentations.Eventhoughthesearchitecturesaredeepinstructure,theylackthecapacityforhierarchicalrepresentationthatexistsinconventionaldeepfeed-forwardnetworksaswellasinrecentlyinvestigateddeeprecurrentneuralnetworks.Inthisworkweintroduceanewarchitecture—adeeprecursiveneuralnetwork(deepRNN)—constructedbystackingmultiplerecursivelayers.Weevaluatetheproposedmodelonthetaskofne-grainedsentimentclassication.OurresultsshowthatdeepRNNsoutperformassociatedshallowcounterpartsthatemploythesamenumberofparameters.Furthermore,ourapproachoutperformspreviousbaselinesonthesentimentanalysistask,includingamultiplicativeRNNvariantaswellasthere-centlyintroducedparagraphvectors,achievingnewstate-of-the-artresults.Weprovideexploratoryanalysesoftheeffectofmultiplelayersandshowthattheycapturedifferentaspectsofcompositionalityinlanguage.1IntroductionDeepconnectionistarchitecturesinvolvemanylayersofnonlinearinformationprocessing[1].Thisallowsthemtoincorporatemeaningrepresentationssuchthateachsucceedinglayerpotentiallyhasamoreabstractmeaning.Recentadvancementsinefcientlytrainingdeepneuralnetworksenabledtheirapplicationtomanyproblems,includingthoseinnaturallanguageprocessing(NLP).AkeyadvanceforapplicationtoNLPtaskswastheinventionofwordembeddingsthatrepresentasinglewordasadense,low-dimensionalvectorinameaningspace[2],andfromwhichnumerousproblemshavebeneted[3,4].Recursiveneuralnetworks,compriseaclassofarchitecturethatoperatesonstructuredinputs,andinparticular,ondirectedacyclicgraphs.Arecursiveneuralnetworkcanbeseenasageneralizationoftherecurrentneuralnetwork[5],whichhasaspecictypeofskewedtreestructure(seeFigure1).Theyhavebeenappliedtoparsing[6],sentence-levelsentimentanalysis[7,8],andparaphrasede-tection[9].Giventhestructuralrepresentationofasentence,e.g.aparsetree,theyrecursivelygenerateparentrepresentationsinabottom-upfashion,bycombiningtokenstoproducerepresen-tationsforphrases,eventuallyproducingthewholesentence.Thesentence-levelrepresentation(or,alternatively,itsphrases)canthenbeusedtomakeanalclassicationforagiveninputsentence—e.g.whetheritconveysapositiveoranegativesentiment.Similartohowrecurrentneuralnetworksaredeepintime,recursiveneuralnetworksaredeepinstructure,becauseoftherepeatedapplicationofrecursiveconnections.Recently,thenotionsofdepthintime—theresultofrecurrentconnections,anddepthinspace—theresultofstacking1 (a) (b) (c)Figure1:Operationofarecursivenet(a),untiedrecursivenet(b)andarecurrentnet(c)onanexamplesentence.Black,orangeandreddotsrepresentinput,hiddenandoutputlayers,respectively.Directededgeshavingthesamecolor-stylecombinationdenotesharedconnections.multiplelayersontopofoneanother,aredistinguishedforrecurrentneuralnetworks.Inordertocombinetheseconcepts,deeprecurrentnetworkswereproposed[10,11,12].Theyareconstructedbystackingmultiplerecurrentlayersontopofeachother,whichallowsthisextranotionofdepthtobeincorporatedintotemporalprocessing.Empiricalinvestigationsshowedthatthisresultsinanaturalhierarchyforhowtheinformationisprocessed[12].Inspiredbytheserecentdevelopments,wemakeasimilardistinctionbetweendepthinstructureanddepthinspace,andtocombinetheseconcepts,proposethedeeprecursiveneuralnetwork,whichisconstructedbystackingmultiplerecursivelayers.Thearchitecturewestudyinthisworkisessentiallyadeepfeedforwardneuralnetworkwithanadditionalstructuralprocessingwithineachlayer(seeFigure2).Duringforwardpropagation,in-formationtravelsthroughthestructurewithineachlayer(becauseoftherecursivenatureofthenetwork,weightsregardingstructuralprocessingareshared).Inaddition,everynodeinthestruc-ture(i.e.intheparsetree)feedsitsownhiddenstatetoitscounterpartinthenextlayer.Thiscanbeseenasacombinationoffeedforwardandrecursivenets.Inashallowrecursiveneuralnetwork,asinglelayerisresponsibleforlearningarepresentationofcompositionthatisbothusefulandsufcientforthenaldecision.Inadeeprecursiveneuralnetwork,alayercanlearnsomepartsofthecompositiontoapply,andpassthisintermediaterepresentationtothenextlayerforfurtherprocessingfortheremainingpartsoftheoverallcomposition.Toevaluatetheperformanceofthearchitectureandmakeexploratoryanalyses,weapplydeepre-cursiveneuralnetworkstothetaskofne-grainedsentimentdetectionontherecentlypublishedStanfordSentimentTreebank(SST)[8].SSTincludesasupervisedsentimentlabelforeverynodeinthebinaryparsetree,notjustattheroot(sentence)level.Thisisespeciallyimportantfordeeplearning,sinceitallowsarichersupervisederrorsignaltobebackpropagatedacrossthenetwork,potentiallyalleviatingvanishinggradientsassociatedwithdeepneuralnetworks[13].Weshowthatourdeeprecursiveneuralnetworksoutperformshallowrecursivenetsofthesamesizeinthene-grainedsentimentpredictiontaskontheStanfordSentimentTreebank.Furthermore,ourmodelsoutperformmultiplicativerecursiveneuralnetworkvariants,achievingnewstate-of-the-artperformanceonthetask.Weconductqualitativeexperimentsthatsuggestthateachlayerhandlesadifferentaspectofcompositionality,andrepresentationsateachlayercapturedifferentnotionsofsimilarity.2Methodology2.1RecursiveNeuralNetworksRecursiveneuralnetworks(e.g.[6])(RNNs)compriseanarchitectureinwhichthesamesetofweightsisrecursivelyappliedwithinastructuralsetting:givenapositionaldirectedacyclicgraph,itvisitsthenodesintopologicalorder,andrecursivelyappliestransformationstogeneratefurtherrepresentationsfrompreviouslycomputedrepresentationsofchildren.Infact,arecurrentneuralnetworkissimplyarecursiveneuralnetworkwithaparticularstructure(seeFigure1c).Eventhough2 that movie was cool that movie was cool that movie was cool RNNscanbeappliedtoanypositionaldirectedacyclicgraph,welimitourattentiontoRNNsoverpositionalbinarytrees,asin[6].Givenabinarytreestructurewithleaveshavingtheinitialrepresentations,e.g.aparsetreewithwordvectorrepresentationsattheleaves,arecursiveneuralnetworkcomputestherepresentationsateachinternalnodeasfollows(seealsoFigure1a):x=f(WLxl()+WRxr()+b)(1)wherel()andr()aretheleftandrightchildrenof,WLandWRaretheweightmatricesthatconnecttheleftandrightchildrentotheparent,andbisabiasvector.GiventhatWLandWRaresquarematrices,andnotdistinguishingwhetherl()andr()areleaforinternalnodes,thisdenitionhasaninterestinginterpretation:initialrepresentationsattheleavesandintermediaterepresentationsatthenonterminalslieinthesamespace.Intheparsetreeexample,arecursiveneuralnetworkcombinestherepresentationsoftwosubphrasestogeneratearepresentationforthelargerphrase,inthesamemeaningspace[6].Wethenhaveatask-specicoutputlayerabovetherepresentationlayer:y=g(Ux+c)(2)whereUistheoutputweightmatrixandcisthebiasvectortotheoutputlayer.Inasupervisedtask,yissimplytheprediction(classlabelorresponsevalue)forthenode,andsupervisionoccursatthislayer.Asanexample,forthetaskofsentimentclassication,yisthepredictedsentimentlabelofthephrasegivenbythesubtreerootedat.Thus,duringsupervisedlearning,initialexternalerrorsareincurredony,andbackpropagatedfromtheroot,towardleaves[14].2.2UntyingLeavesandInternalsEventhoughtheaforementioneddenition,whichtreatstheleafnodesandinternalnodesthesame,hassomeattractiveproperties(suchasmappingindividualwordsandlargerphrasesintothesamemeaningspace),inthisworkweuseanuntiedvariantthatdistinguishesbetweenaleafandaninternalnode.WedothisbyasimpleparametrizationoftheweightsWwithrespecttowhethertheincomingedgeemanatesfromaleaforaninternalnode(seeFigure1bincontrastto1a,coloroftheedgesemanatingfromleavesandinternalnodesaredifferent):h=f(Wl()Lhl()+Wr()Rhr()+b)(3)whereh=x2Xifisaleafandh2Hotherwise,andW=WxhifisaleafandW=Whhotherwise.XandHarevectorspacesofwordsandphrases,respectively.TheweightsWxhactasatransformationfromwordspacetophrasespace,andWhhasatransformationfromphrasespacetoitself.Withthisuntying,arecursivenetworkbecomesageneralizationoftheElmantyperecurrentneuralnetworkwithhbeinganalogoustothehiddenlayeroftherecurrentnetwork(memory)andxbe-inganalogoustotheinputlayer(seeFigure1c).Benetsofthisuntyingaretwofold:(1)NowtheweightmatricesWxh,andWhhareofsizejhjjxjandjhjjhjwhichmeansthatwecanuselargepretrainedwordvectorsandasmallnumberofhiddenunitswithoutaquadraticdependenceonthewordvectordimensionalityjxj.Therefore,smallbutpowerfulmodelscanbetrainedbyusingpretrainedwordvectorswithalargedimensionality.(2)Sincewordsandphrasesarerepresentedindifferentspaces,wecanuserectieractivationunitsforf,whichhavepreviouslybeenshowntoyieldgoodresultswhentrainingdeepneuralnetworks[15].Wordvectorsaredenseandgenerallyhavepositiveandnegativeentrieswhereasrectieractivationcausestheresultingintermediatevec-torstobesparseandnonnegative.Thus,whenleavesandinternalsarerepresentedinthesamespace,adiscrepancyarises,andthesameweightmatrixisappliedtobothleavesandinternalnodesandisexpectedtohandlebothsparseanddensecases,whichmightbedifcult.Thereforeseparatingleavesandinternalnodesallowstheuseofrectiersinamorenaturalmanner.2.3DeepRecursiveNeuralNetworksRecursiveneuralnetworksaredeepinstructure:withtherecursiveapplicationofthenonlinearinformationprocessingtheybecomeasdeepasthedepthofthetree(oringeneral,DAG).However,thisnotionofdepthisunlikelytoinvolveahierarchicalinterpretationofthedata.Byapplying3 Figure2:Operationofa3-layerdeeprecursiveneuralnetwork.Redandblackpointsdenoteoutputandinputvectors,respectively;othercolorsdenoteintermediatememoryrepresentations.Connectionsdenotedbythesamecolor-stylecombinationareshared(i.e.sharethesamesetofweights).thesamecomputationrecursivelytocomputethecontributionofchildrentotheirparents,andthesamecomputationtoproduceanoutputresponse,weare,infact,representingeveryinternalnode(phrase)inthesamespace[6,8].However,inthemoreconventionalstackeddeeplearners(e.g.deepfeedforwardnets),animportantbenetofdepthisthehierarchyamonghiddenrepresentations:everyhiddenlayerconceptuallyliesinadifferentrepresentationspaceandpotentiallyisamoreabstractrepresentationoftheinputthanthepreviouslayer[1].Toaddresstheseobservations,weproposethedeeprecursiveneuralnetwork,whichisconstructedbystackingmultiplelayersofindividualrecursivenets:h(i)=f(W(i)Lh(i)l()+W(i)Rh(i)r()+V(i)h(i�1)+b(i))(4)whereiindexesthemultiplestackedlayers,W(i)L,W(i)R,andb(i)aredenedasbeforewithineachlayeri,andV(i)istheweightmatrixthatconnectsthe(i�1)thhiddenlayertotheithhiddenlayer.NotethattheuntyingthatwedescribedinSection2.2isonlynecessaryfortherstlayer,sincewecanmapbothx2Xandh(1)2H(1)intherstlayertoh(2)2H(2)inthesecondlayerusingsep-arateV(2)forleavesandinternals(Vxh(2)andVhh(2)).Thereforeeverynodeisrepresentedinthesamespaceatlayersabovetherst,regardlessoftheir“leafness”.Figure2providesavisualizationofweightsthatareuntiedorshared.Forprediction,weconnecttheoutputlayertoonlythenalhiddenlayer:y=g(Uh(`)+c)(5)where`isthetotalnumberoflayers.Intuitively,connectingtheoutputlayertoonlythelasthiddenlayerforcesthenetworktorepresentenoughhighlevelinformationatthenallayertosupportthesuperviseddecision.Connectingtheoutputlayertoallhiddenlayersisanotheroption;however,inthatcasemultiplehiddenlayerscanhavesynergisticeffectsontheoutputandmakeitmoredifculttoqualitativelyanalyzeeachlayer.LearningadeepRNNcanbeconceptualizedasinterleavedapplicationsoftheconventionalback-propagationacrossmultiplelayers,andbackpropagationthroughstructurewithinasinglelayer.Duringbackpropagationanodereceiveserrortermsfrombothitsparent(throughstructure),andfromitscounterpartinthehigherlayer(throughspace).Thenitfurtherbackpropagatesthaterrorsignaltobothofitschildren,aswellastoitscounterpartinthelowerlayer.4 that movie was cool 3Experiments3.1SettingData.Forexperimentalevaluationofourmodels,weusetherecentlypublishedStanfordSenti-mentTreebank(SST)[8],whichincludeslabelsfor215,154phrasesintheparsetreesof11,855sentences,withanaveragesentencelengthof19.1tokens.Real-valuedsentimentlabelsarecon-vertedtoanintegerordinallabelinf0;:::;4gbysimplethresholding.Thereforethesupervisedtaskisposedasa5-classclassicationproblem.Weusethesingletraining-validation-testsetparti-tioningprovidedbytheauthors.Baselines.InadditiontoexperimentingamongdeepRNNsofvaryingwidthanddepth,wecom-pareourmodelstopreviousworkonthesamedata.Weusebaselinesfrom[8]:anaivebayesclassi-erthatoperatesonbigramcounts(BINB),shallowRNN(RNN)[6,7]thatlearnsthewordvectorsfromthesuperviseddataandusestanhunits,incontrasttoourshallowRNNs,amatrix-vectorRNNinwhicheverywordisassignedamatrix-vectorpairinsteadofavector,andcompositionisdenedwithmatrix-vectormultiplications(MV-RNN)[16],andthemultiplicativerecursivenet(ortherecursiveneuraltensornetwork)inwhichthecompositionisdenedasabilineartensorprod-uct(RNTN)[8].Additionally,weuseamethodthatiscapableofgeneratingrepresentationsforlargerpiecesoftext(PARAGRAPHVECTORS)[17],andthedynamicconvolutionalneuralnetwork(DCNN)[18].Weusethepreviouslypublishedresultsforcomparisonusingthesametraining-development-testpartitioningofthedata.ActivationUnits.Fortheoutputlayer,weemploythestandardsoftmaxactivation:g(x)=exi=Pjexj.Forthehiddenlayersweusetherectierlinearactivation:f(x)=maxf0;xg.Experimentally,rectieractivationgivesbetterperformance,fasterconvergence,andsparserep-resentations.Previousworkwithrectierunitsreportedgoodresultswhentrainingdeepneuralnetworks,withnopre-trainingstep[15].WordVectors.Inallofourexperiments,wekeepthewordvectorsxedanddonotnetuneforsimplicityofourmodels.Weusethepubliclyavailable300dimensionalwordvectorsby[19],trainedonpartoftheGoogleNewsdataset(100Bwords).Regularizer.Forregularizationofthenetworks,weusetherecentlyproposeddropouttechnique,inwhichwerandomlysetentriesofhiddenrepresentationsto0,withaprobabilitycalledthedropoutrate[20].Dropoutrateistunedoverthedevelopmentsetoutoff0,0.1,0.3,0.5g.Dropoutpreventslearnedfeaturesfromco-adapting,andithasbeenreportedtoyieldgoodresultswhentrainingdeepneuralnetworks[21,22].Notethatdroppedunitsareshared:forasinglesentenceandalayer,wedropthesameunitsofthehiddenlayerateachnode.Sinceweareusinganon-saturatingactivationfunction,intermediaterepresentationsarenotboundedfromabove,hence,theycanexplodeevenwithastrongregularizationovertheconnec-tions,whichisconrmedbypreliminaryexperiments.Therefore,forstabilityreasons,weuseasmallxedadditionalL2penalty(10�5)overboththeconnectionweightsandtheunitactivations,whichresolvestheexplosionproblem.NetworkTraining.Weusestochasticgradientdescentwithaxedlearningrate(.01).WeuseadiagonalvariantofAdaGradforparameterupdates[23].AdaGradyieldsasmoothandfastconver-gence.Furthermore,itcanbeseenasanaturaltuningofindividuallearningratespereachparameter.Thisisbenecialforourcasesincedifferentlayershavegradientsatdifferentscalesbecauseofthescaleofnon-saturatingactivationsateachlayer(growsbiggerathigherlayers).Weupdateweightsafterminibatchesof20sentences.Werun200epochsfortraining.Recursiveweightswithinalayer(Whh)areinitializedas0:5I+whereIistheidentitymatrixandisasmalluniformlyrandomnoise.Thismeansthatinitially,therepresentationofeachnodeisapproximatelythemeanofitstwochildren.Allotherweightsareinitializedas.Weexperimentwithnetworksofvarioussizes,howeverwehavethesamenumberofhiddenunitsacrossmultiplelayersofasingleRNN.Whenweincreasethedepth,wekeeptheoverallnumberofparametersconstant,thereforedeepernet-worksbecomenarrower.Wedonotemployapre-trainingstep;deeparchitecturesaretrainedwiththesupervisederrorsignal,evenwhentheoutputlayerisconnectedtoonlythenalhiddenlayer.5 `jhjFine-grainedBinary 15046.185.324548.085.534043.183.5 134048.186.4224248.386.4320049.586.7417449.886.6515749.085.5 (a)ResultsforRNNs.`andjhjdenotethedepthandwidthofthenetworks,respec-tively. MethodFine-grainedBinary BigramNB41.983.1RNN43.282.4MV-RNN44.482.9RNTN45.785.4 DCNN48.586.8 ParagraphVectors48.787.8 DRNN(4,174)49.886.6 (b)Resultsforpreviousworkandourbestmodel(DRNN).Table1:Accuraciesfor5-classpredictionsoverSST,atthesentencelevel.Additionally,weemployearlystopping:outofalliterations,themodelwiththebestdevelopmentsetperformanceispickedasthenalmodeltobeevaluated.3.2ResultsQuantitativeEvaluation.Weevaluateonbothne-grainedsentimentscoreprediction(5-classclassication)andbinary(positive-negative)classication.Forbinaryclassication,wedonottrainaseparatenetwork,weusethenetworktrainedforne-grainedprediction,andthendecodethe5dimensionalposteriorprobabilityvectorintoabinarydecisionwhichalsoeffectivelydiscardstheneutralcasesfromthetestset.Thisapproachsolvesaharderproblem.Thereforetheremightberoomforimprovementonbinaryresultsbyseparatelytrainingabinaryclassier.ExperimentalresultsofourmodelsandpreviousworkaregiveninTable1.Table1ashowsourmodelswithvaryingdepthandwidth(whilekeepingtheoverallnumberofparametersconstantwithineachgroup).`denotesthedepthandjhjdenotesthewidthofthenetworks(i.e.numberofhiddenunitsinasinglehiddenlayer).WeobservethatshallowRNNsgetanimprovementjustbyusingpretrainedwordvectors,rectiers,anddropout,comparedtopreviouswork(48.1vs.43.2forthene-grainedtask,seeourshallowRNNwithjhj=340inTable1aandtheRNNfrom[8]inTable1b).ThissuggestsavalidationforuntyingleavesandinternalnodesintheRNNasdescribedinSection2.2andusingpre-trainedwordvectors.ResultsonRNNsofvariousdepthsandsizesshowthatdeepRNNsoutperformsinglelayerRNNswithapproximatelythesamenumberofparameters,whichquantitativelyvalidatesthebenetsofdeepnetworksovershallowones(seeTable1a).Weseeaconsistentimprovementasweusedeeperandnarrowernetworksuntilacertaindepth.The2-layerRNNforthesmallernetworksand4-layerRNNforthelargernetworksgivethebestperformancewithrespecttothene-grainedscore.Increasingthedepthfurtherstartstocauseadegrade.Anexplanationforthismightbethedecreaseinwidthdominatingthegainsfromanincreaseddepth.Furthermore,ourbestdeepRNNoutperformspreviousworkonboththene-grainedandbinarypredictiontasks,andoutperformsParagraphVectorsonthene-grainedscore,achievinganewstate-of-the-art(seeTable1b).Weattributeanimportantcontributionoftheimprovementtodropouts.InapreliminaryexperimentwithsimpleL2regularization,a3-layerRNNwith200hiddenunitseachachievedane-grainedscoreof46.06(notshownhere),comparedtoourcurrentscoreof49.5withthedropoutregularizer.InputPerturbation.Inordertoassessthescaleatwhichdifferentlayersoperate,weinvestigatetheresponseofalllayerstoaperturbationintheinput.Awayofperturbingtheinputmightbeanadditionofsomenoise,howeverwithalargeamountofnoise,itispossiblethattheresultingnoisyinputvectorisoutsideofthemanifoldofmeaningfulwordvectors.Therefore,instead,wesimplypickawordfromthesentencethatcarriespositivesentiment,andalterittoasetofwordsthathavesentimentvaluesshiftingtowardsthenegativedirection.6 Figure3:Anexamplesentencewithitsparsetree(left)andtheresponsemeasureofeverylayer(right)inathree-layereddeeprecursivenet.Wechangetheword“best”intheinputtooneofthewords“coolest”,“good”,“average”,“bad”,“worst”(denotedbyblue,lightblue,black,orangeandred,respectively)andmeasurethechangeofhiddenlayerrepresentationsinone-normforeverynodeinthepath. charmingresults 1charming,interestingresultscharmingchemistry2charmingandrivetingperformancesperfectingredients3appealinglymanicandenergeticgrippingperformancesbrilliantlyplayed4refreshinglyadulttakeonadulteryjoyousdocumentaryperfectmedium5unpretentious,sociologicallypointedanamazingslapstickinstrumentengaginglm notgreat 1asgreatnothinggoodnotveryinformative2agreatnotcompellingnotreallyfunny3isgreatonlygoodnotquitesatisfying4Isn'titgreattoogreatthrashyfun5begreatcompletelynumbingexperiencefakefun Table2:Exampleshortestphrasesandtheirnearestneighborsacrossthreelayers.InFigure3,wegiveanexamplesentence,“RogerDodgerisoneofthebestvariationsonthistheme”withitsparsetree.Wechangetheword“best”intothesetofwords“coolest”,“good”,“average”,“bad”,“worst”,andmeasuretheresponseofthischangealongthepaththatconnectstheleaftotheroot(labeledfrom1to8).Notethatallothernodeshavethesamerepresentations,sinceanodeiscompletelydeterminedbyitssubtree.Foreachnode,theresponseismeasuredasthechangeofitshiddenrepresentationinone-norm,foreachofthethreelayersinthenetwork,withrespecttothehiddenrepresentationsusingtheoriginalword(“best”).Intherstlayer(bottom)weobserveasharedtrendchangeaswegoupinthetree.Notethat“good”and“bad”arealmostontopofeachother,whichsuggeststhatthereisnotnecessarilyenoughinformationcapturedintherstlayeryettomakethecorrectsentimentdecision.Inthesecondlayer(middle)aninterestingphenomenonoccurs:Pathswith“coolest”and“good”startclosetogether,aswellas“worst”and“bad”.However,aswemoveupinthetree,pathswith“worst”and“coolest”comeclosertogetheraswellasthepathswith“good”and“bad”.Thissuggeststhatthesecondlayerrememberstheintensityofthesentiment,ratherthandirection.Thethirdlayer(top)isthemostconsistentoneaswetraverseupwardthetree,andcorrectsentimentdecisionspersistacrossthepath.7 1 2 3 4 5 6 7 8 Roger Dodger 6 . 7 8 is 5 one 4 of 3 2 the 1 [best] variations on this theme coolest/good/average/bad/worst NearestNeighborPhrases.Inordertoevaulatethedifferentnotionsofsimilarityinthemeaningspacecapturedbymultiplelayers,welookatnearestneighborsofshortphrases.Forathreelayerdeeprecursiveneuralnetworkwecomputehiddenrepresentationsforallphrasesinourdata.Then,foragivenphrase,wenditsnearestneighborphrasesacrosseachlayer,withtheone-normdistancemeasure.TwoexamplesaregiveninTable2.Fortherstlayer,weobservethatsimilarityisdominatedbyoneofthewordsthatiscomposed,i.e.“charming”forthephrase“charmingresults”(and“appealing”,“refreshing”forsomeneighbors),and“great”forthephrase“notgreat”.Thiseffectissostrongthatitevendiscardsthenegationforthesecondcase,“asgreat”and“isgreat”areconsideredsimilarto“notgreat”.Inthesecondlayer,weobserveamorediversesetofphrasessemantically.Ontheotherhand,thislayerseemstobetakingsyntacticsimilaritymoreintoaccount:intherstexample,thenearestneighborsof“charmingresults”arecomprisedofadjective-nouncombinationsthatalsoexhibitsomesimilarityinmeaning(e.g.“interestingresults”,“rivetingperformances”).Theaccountissimilarfor“notgreat”:itsnearestneighborsareadverb-adjectivecombinationsinwhichthead-jectivesexhibitsomesemanticoverlap(e.g.“good”,“compelling”).Sentimentisstillnotproperlycapturedinthislayer,however,asseenwiththeneighbor“toogreat”forthephrase“notgreat”.Inthethirdandnallayer,weseeahigherlevelofsemanticsimilarity,inthesensethatphrasesaremostlyrelatedtooneanotherintermsofsentiment.Notethatsincethisisasupervisedtaskonsentimentdetection,itissufcientforthenetworktocaptureonlythesentiment(andhowitiscomposedincontext)inthelastlayer.Therefore,itshouldbeexpectedtoobserveanevenmorediversesetofneighborswithonlyasentimentconnection.4ConclusionInthisworkweproposethedeeprecursiveneuralnetwork,whichisconstructedbystackingmultiplerecursivelayersontopofeachother.Weapplythisarchitecturetothetaskofne-grainedsentimentclassicationusingbinaryparsetreesasthestructure.Weempiricallyevaluatedourmodelsagainstshallowrecursivenets.Additionally,wecomparedwithpreviousworkonthetask,includingamultiplicativeRNNandthemorerecentParagraphVectorsmethod.Ourexperimentsshowthatdeepmodelsoutperformtheirshallowcounterpartsofthesamesize.Furthermore,deepRNNoutperformsthebaselines,achievingstate-of-the-artperformanceonthetask.Wefurtherinvestigateourmodelsqualitativelybyperforminginputperturbation,andexaminingnearestneighboringphrasesofgivenexamples.Theseresultssuggestthataddingdepthtoarecursivenetisdifferentfromaddingwidth.Eachlayercapturesadifferentaspectofcompositionality.Phraserepresentationsfocusondifferentaspectsofmeaningateachlayer,asseenbynearestneighborphraseexamples.Sinceourtaskwassupervised,learnedrepresentationsseemedtobefocusedonsentiment,asinpreviouswork.AnimportantfuturedirectionmightbeanapplicationofthedeepRNNtoabroader,moregeneraltask,evenanunsupervisedone(e.g.asin[9]).Thismightprovidebetterinsightsontheoperationofdifferentlayersandtheircontribution,withamoregeneralnotionofcomposition.Theeffectsofne-tuningwordvectorsontheperformanceofdeepRNNisalsoopentoinvestigation.AcknowledgmentsThisworkwassupportedinpartbyNSFgrantIIS-1314778andDARPADEFTFA8750-13-2-0015.Theviewsandconclusionscontainedhereinarethoseoftheauthorsandshouldnotbeinterpretedasnecessarilyrepresentingtheofcialpoliciesorendorsements,eitherexpressedorimplied,ofNSF,DARPAortheU.S.Government.References[1]YoshuaBengio.LearningdeeparchitecturesforAI.FoundationsandTrendsinMachineLearning,2(1):1–127,2009.[2]YoshuaBengio,RjeanDucharme,PascalVincent,ChristianJauvin,JazK,ThomasHofmann,TomasoPoggio,andJohnShawe-taylor.Aneuralprobabilisticlanguagemodel.InInAdvancesinNeuralInfor-mationProcessingSystems,2001.8 [3]RonanCollobertandJasonWeston.Auniedarchitecturefornaturallanguageprocessing:Deepneu-ralnetworkswithmultitasklearning.InProceedingsofthe25thinternationalconferenceonMachinelearning,pages160–167.ACM,2008.[4]RonanCollobert,JasonWeston,L´eonBottou,MichaelKarlen,KorayKavukcuoglu,andPavelKuksa.Naturallanguageprocessing(almost)fromscratch.J.Mach.Learn.Res.,12:2493–2537,November2011.[5]JeffreyLElman.Findingstructureintime.Cognitivescience,14(2):179–211,1990.[6]RichardSocher,CliffCLin,AndrewNg,andChrisManning.Parsingnaturalscenesandnaturallanguagewithrecursiveneuralnetworks.InProceedingsofthe28thInternationalConferenceonMachineLearning(ICML-11),pages129–136,2011.[7]RichardSocher,JeffreyPennington,EricHHuang,AndrewYNg,andChristopherDManning.Semi-supervisedrecursiveautoencodersforpredictingsentimentdistributions.InProceedingsoftheConfer-enceonEmpiricalMethodsinNaturalLanguageProcessing,pages151–161.AssociationforComputa-tionalLinguistics,2011.[8]RichardSocher,AlexPerelygin,JeanYWu,JasonChuang,ChristopherDManning,AndrewYNg,andChristopherPotts.Recursivedeepmodelsforsemanticcompositionalityoverasentimenttreebank.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing,EMNLP'13,2013.[9]RichardSocher,EricHHuang,JeffreyPennin,ChristopherDManning,andAndrewNg.Dynamicpool-ingandunfoldingrecursiveautoencodersforparaphrasedetection.InAdvancesinNeuralInformationProcessingSystems,pages801–809,2011.[10]J¨urgenSchmidhuber.Learningcomplex,extendedsequencesusingtheprincipleofhistorycompression.NeuralComputation,4(2):234–242,1992.[11]SalahElHihiandYoshuaBengio.Hierarchicalrecurrentneuralnetworksforlong-termdependencies.InAdvancesinNeuralInformationProcessingSystems,pages493–499,1995.[12]MichielHermansandBenjaminSchrauwen.Trainingandanalysingdeeprecurrentneuralnetworks.InAdvancesinNeuralInformationProcessingSystems,pages190–198,2013.[13]YoshuaBengio,PatriceSimard,andPaoloFrasconi.Learninglong-termdependencieswithgradientdescentisdifcult.NeuralNetworks,IEEETransactionson,5(2):157–166,1994.[14]ChristophGollerandAndreasKuchler.Learningtask-dependentdistributedrepresentationsbyback-propagationthroughstructure.InNeuralNetworks,1996.,IEEEInternationalConferenceon,volume1,pages347–352.IEEE,1996.[15]XavierGlorot,AntoineBordes,andYoshuaBengio.Deepsparserectiernetworks.InProceedingsofthe14thInternationalConferenceonArticialIntelligenceandStatistics.JMLRW&CPVolume,volume15,pages315–323,2011.[16]RichardSocher,BrodyHuval,ChristopherDManning,andAndrewYNg.Semanticcompositionalitythroughrecursivematrix-vectorspaces.InProceedingsofthe2012JointConferenceonEmpiricalMeth-odsinNaturalLanguageProcessingandComputationalNaturalLanguageLearning,pages1201–1211.AssociationforComputationalLinguistics,2012.[17]QuocVLeandTomasMikolov.Distributedrepresentationsofsentencesanddocuments.arXivpreprintarXiv:1405.4053,2014.[18]NalKalchbrenner,EdwardGrefenstette,andPhilBlunsom.Aconvolutionalneuralnetworkformodellingsentences.Proceedingsofthe52ndAnnualMeetingoftheAssociationforComputationalLinguistics,June2014.[19]TomasMikolov,IlyaSutskever,KaiChen,GregSCorrado,andJeffDean.Distributedrepresentationsofwordsandphrasesandtheircompositionality.InAdvancesinNeuralInformationProcessingSystems,pages3111–3119,2013.[20]GeoffreyEHinton,NitishSrivastava,AlexKrizhevsky,IlyaSutskever,andRuslanRSalakhutdi-nov.Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors.arXivpreprintarXiv:1207.0580,2012.[21]AlexKrizhevsky,IlyaSutskever,andGeoffreyEHinton.Imagenetclassicationwithdeepconvolutionalneuralnetworks.InNIPS,volume1,page4,2012.[22]GeorgeEDahl,TaraNSainath,andGeoffreyEHinton.Improvingdeepneuralnetworksforlvcsrusingrectiedlinearunitsanddropout.InAcoustics,SpeechandSignalProcessing(ICASSP),2013IEEEInternationalConferenceon,pages8609–8613.IEEE,2013.[23]JohnDuchi,EladHazan,andYoramSinger.Adaptivesubgradientmethodsforonlinelearningandstochasticoptimization.TheJournalofMachineLearningResearch,12:2121–2159,2011.9

Related Contents


Next Show more