/
DynamicPoolingandUnfoldingRecursiveAutoencodersforParaphraseDetection DynamicPoolingandUnfoldingRecursiveAutoencodersforParaphraseDetection

DynamicPoolingandUnfoldingRecursiveAutoencodersforParaphraseDetection - PDF document

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
356 views
Uploaded On 2015-09-27

DynamicPoolingandUnfoldingRecursiveAutoencodersforParaphraseDetection - PPT Presentation

RichardSocherEricHHuangJeffreyPenningtonAndrewYNgChristopherDManningComputerScienceDepartmentStanfordUniversityStanfordCA94305USASLACNationalAcceleratorLaboratoryStanfordUniversityStanf ID: 142534

RichardSocher EricH.Huang JeffreyPennington AndrewY.Ng ChristopherD.ManningComputerScienceDepartment StanfordUniversity Stanford CA94305 USASLACNationalAcceleratorLaboratory StanfordUniversity Stanf

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "DynamicPoolingandUnfoldingRecursiveAutoe..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DynamicPoolingandUnfoldingRecursiveAutoencodersforParaphraseDetection RichardSocher,EricH.Huang,JeffreyPennington,AndrewY.Ng,ChristopherD.ManningComputerScienceDepartment,StanfordUniversity,Stanford,CA94305,USASLACNationalAcceleratorLaboratory,StanfordUniversity,Stanford,CA94309,USArichard@socher.org,fehhuang,jpennin,ang,manningg@stanford.eduAbstractParaphrasedetectionisthetaskofexaminingtwosentencesanddeterminingwhethertheyhavethesamemeaning.Inordertoobtainhighaccuracyonthistask,thoroughsyntacticandsemanticanalysisofthetwostatementsisneeded.Weintroduceamethodforparaphrasedetectionbasedonrecursiveautoencoders(RAE).OurunsupervisedRAEsarebasedonanovelunfoldingobjectiveandlearnfeaturevectorsforphrasesinsyntactictrees.Thesefeaturesareusedtomeasuretheword-andphrase-wisesimilaritybetweentwosentences.Sincesentencesmaybeofarbitrarylength,theresultingmatrixofsimilaritymeasuresisofvariablesize.Weintroduceanoveldynamicpoolinglayerwhichcomputesaxed-sizedrepresentationfromthevariable-sizedmatrices.Thepooledrepresentationisthenusedasinputtoaclassier.Ourmethodoutperformsotherstate-of-the-artap-proachesonthechallengingMSRPparaphrasecorpus.1IntroductionParaphrasedetectiondetermineswhethertwophrasesofarbitrarylengthandformcapturethesamemeaning.Identifyingparaphrasesisanimportanttaskthatisusedininformationretrieval,questionanswering[1],textsummarization,plagiarismdetection[2]andevaluationofmachinetranslation[3],amongothers.Forinstance,inordertoavoidaddingredundantinformationtoasummaryonewouldliketodetectthatthefollowingtwosentencesareparaphrases:S1ThejudgealsorefusedtopostponethetrialdateofSept.29.S2ObusalsodeniedadefensemotiontopostponetheSeptembertrialdate.Wepresentajointmodelthatincorporatesthesimilaritiesbetweenbothsinglewordfeaturesaswellasmulti-wordphrasesextractedfromthenodesofparsetrees.OurmodelisbasedontwonovelcomponentsasoutlinedinFig.1.Therstcomponentisanunfoldingrecursiveautoencoder(RAE)forunsupervisedfeaturelearningfromunlabeledparsetrees.TheRAEisarecursiveneuralnetwork.Itlearnsfeaturerepresentationsforeachnodeinthetreesuchthatthewordvectorsunderneatheachnodecanberecursivelyreconstructed.Thesefeaturerepresentationsareusedtocomputeasimilaritymatrixthatcomparesboththesinglewordsaswellasallnonterminalnodefeaturesinbothsentences.Inordertokeepasmuchoftheresultingglobalinformationofthiscomparisonaspossibleanddealwiththearbitrarylengthofthetwosentences,wethenintroduceoursecondcomponent:anewdynamicpoolinglayerwhichoutputsaxed-sizerepresentation.Anyclassiersuchasasoftmaxclassiercanthenbeusedtoclassifywhetherthetwosentencesareparaphrasesornot.WerstdescribetheunsupervisedfeaturelearningwithRAEsfollowedbyadescriptionofpoolingandclassication.InexperimentsweshowqualitativecomparisonsofdifferentRAEmodelsandde-scribeourstate-of-the-artresultsontheMicrosoftResearchParaphrase(MSRP)CorpusintroducedbyDolanetal.[4].Lastly,wediscussrelatedwork.1 Figure1:Anoverviewofourparaphrasemodel.Therecursiveautoencoderlearnsphrasefeaturesforeachnodeinaparsetree.Thedistancesbetweenallnodesthenllasimilaritymatrixwhosesizedependsonthelengthofthesentences.Usinganoveldynamicpoolinglayerwecancomparethevariable-sizedsentencesandclassifypairsasbeingparaphrasesornot.2RecursiveAutoencodersInthissectionwedescribetwovariantsofunsupervisedrecursiveautoencoderswhichcanbeusedtolearnfeaturesfromparsetrees.TheRAEaimstondvectorrepresentationsforvariable-sizedphrasesspannedbyeachnodeofaparsetree.Theserepresentationscanthenbeusedforsubsequentsupervisedtasks.BeforedescribingtheRAE,webrieyreviewneurallanguagemodelswhichcomputewordrepresentationsthatwegiveasinputtoouralgorithm.2.1NeuralLanguageModelsTheideaofneurallanguagemodelsasintroducedbyBengioetal.[5]istojointlylearnanem-beddingofwordsintoann-dimensionalvectorspaceandtousethesevectorstopredicthowlikelyawordisgivenitscontext.CollobertandWeston[6]introducedanewneuralnetworkmodeltocomputesuchanembedding.WhenthesenetworksareoptimizedviagradientascentthederivativesmodifythewordembeddingmatrixL2RnjVj,wherejVjisthesizeofthevocabulary.Thewordvectorsinsidetheembeddingmatrixcapturedistributionalsyntacticandsemanticinformationviatheword'sco-occurrencestatistics.Forfurtherdetailsandevaluationsoftheseembeddings,see[5,6,7,8].Oncethismatrixislearnedonanunlabeledcorpus,wecanuseitforsubsequenttasksbyusingeachword'svector(acolumninL)torepresentthatword.Intheremainderofthispaper,werepresentasentence(oranyn-gram)asanorderedlistofthesevectors(x1;:::;xm).Thiswordrepresentationisbettersuitedforautoencodersthanthebinarynumberrepresentationsusedinpreviousrelatedautoencodermodelssuchastherecursiveautoassociativememory(RAAM)modelofPollack[9,10]orrecurrentneuralnetworks[11]sincetheactivationsareinherentlycontinuous.2.2RecursiveAutoencoderFig.2(left)showsaninstanceofarecursiveautoencoder(RAE)appliedtoagivenparsetreeasintroducedby[12].Unlikeinthatwork,hereweassumethatsuchatreeisgivenforeachsentencebyaparser.Initialexperimentsshowedthathavingasyntacticallyplausibletreestructureisimportantforparaphrasedetection.Assumewearegivenalistofwordvectorsx=(x1;:::;xm)asdescribedintheprevioussection.Thebinaryparsetreeforthisinputisintheformofbranchingtripletsofparentswithchildren:(p!c1c2).Thetreesaregivenbyasyntacticparser.Eachchildcanbeeitheraninputwordvectorxioranonterminalnodeinthetree.ForbothexamplesinFig.2,wehavethefollowingtriplets:((y1!x2x3);(y2!x1y1)),8x;y2Rn.Giventhistreestructure,wecannowcomputetheparentrepresentations.Therstparentvectorp=y1iscomputedfromthechildren(c1;c2)=(x2;x3)byonestandardneuralnetworklayer:p=f(We[c1;c2]+b);(1)where[c1;c2]issimplytheconcatenationofthetwochildren,fanelement-wiseactivationfunctionsuchastanhandWe2Rn2ntheencodingmatrixthatwewanttolearn.Onewayofassessinghowwellthisn-dimensionalvectorrepresentsitsdirectchildrenistodecodetheirvectorsina2 ecursive Autoencoder 2 3 4 5 6 7 4 5 2 Dynamic Pooling and Classification 2 3 4 5 2 3 4 5 6 7 ats catch mice The Cats eat mice Softmax Classifier Paraphrase ynamic Pooling Layer ixed - Sized Matrix ariable - Sized Similarity Matrix Figure2:Twoautoencodermodelswithdetailsofthereconstructionatnodey2.Forsimplicityweleftoutthereconstructionlayerattherstnodey1whichisthesamestandardautoencoderforbothmodels.Left:Astandardautoencoderthattriestoreconstructonlyitsdirectchildren.Right:Theunfoldingautoencoderwhichtriestoreconstructallleafnodesunderneatheachnode.reconstructionlayerandthentocomputetheEuclideandistancebetweentheoriginalinputanditsreconstruction:[c01;c02]=f(Wdp+bd)Erec(p)=jj[c1;c2]�[c01;c02]jj2:(2)Inordertoapplytheautoencoderrecursively,thesamestepsrepeat.Nowthaty1isgiven,wecanuseEq.1tocomputey2bysettingthechildrentobe(c1;c2)=(x1;y1).Again,aftercomputingtheintermediateparentvectorp=y2,wecanassesshowwellthisvectorcapturesthecontentofthechildrenbycomputingthereconstructionerrorasinEq.2.Theprocessrepeatsuntilthefulltreeisconstructedandeachnodehasanassociatedreconstructionerror.Duringtraining,thegoalistominimizethereconstructionerrorofallinputpairsatnonterminalnodespinagivenparsetreeT:Erec(T)=Xp2TErec(p)(3)FortheexampleinFig.2(left),weminimizeErec(T)=Erec(y1)+Erec(y2).SincetheRAEcomputesthehiddenrepresentationsitthentriestoreconstruct,itcouldpotentiallylowerreconstructionerrorbyshrinkingthenormsofthehiddenlayers.Inordertopreventthis,weaddalengthnormalizationlayerp=p=jjpjjtothisRAEmodel(referredtoasthestandardRAE).Anothermoreprincipledsolutionistouseamodelinwhicheachnodetriestoreconstructitsentiresubtreeandthenmeasurethereconstructionoftheoriginalleafnodes.Suchamodelisdescribedinthenextsection.2.3UnfoldingRecursiveAutoencoderTheunfoldingRAEhasthesameencodingschemeasthestandardRAE.ThedifferenceisinthedecodingstepwhichtriestoreconstructtheentirespannedsubtreeunderneatheachnodeasshowninFig.2(right).Forinstance,atnodey2,thereconstructionerroristhedifferencebetweentheleafnodesunderneaththatnode[x1;x2;x3]andtheirreconstructedcounterparts[x01;x02;x03].Theunfoldingproducesthereconstructedleavesbystartingaty2andcomputing[x01;y01]=f(Wdy2+bd):(4)Thenitrecursivelysplitsy01againtoproducevectors[x02;x03]=f(Wdy01+bd):(5)Ingeneral,werepeatedlyusethedecodingmatrixWdtounfoldeachnodewiththesametreestructureasduringencoding.Thereconstructionerroristhencomputedfromaconcatenationofthewordvectorsinthatnode'sspan.Foranodeythatspanswordsitoj:Erec(y(i;j))= [xi;:::;xj]�x0i;:::;x0j 2:(6)Theunfoldingautoencoderessentiallytriestoencodeeachhiddenlayersuchthatitbestreconstructsitsentiresubtreetotheleafnodes.Hence,itwillnothavetheproblemofhiddenlayersshrinkinginnorm.AnotherpotentialproblemofthestandardRAEisthatitgivesequalweighttothelastmergedphrasesevenifoneisonlyasingleword(inFig.2,x1andy1havesimilarweightinthelastmerge).Incontrast,theunfoldingRAEcapturestheincreasedimportanceofachildwhenthechildrepresentsalargersubtree.3 ecursive Autoencoder Unfolding Recursive Autoencoder 1 ' x 2 ' x 3 ' 2 x 3 x 1 2 y 1 x 1 ' y 1 ' e 2 x 3 x 1 2 y 1 1 ' W d W e W e W e W d W d 2.4DeepRecursiveAutoencoderBothtypesofRAEcanbeextendedtohavemultipleencodinglayersateachnodeinthetree.Insteadoftransformingbothchildrendirectlyintoparentp,wecanhaveanotherhiddenlayerhinbetween.Whilethetoplayerateachnodehastohavethesamedimensionalityaseachchild(inorderforthesamenetworktoberecursivelycompatible),thehiddenlayermayhavearbitrarydimensionality.Forthetwo-layerencodingnetwork,wewouldreplaceEq.1withthefollowing:h=f(W(1)e[c1;c2]+b(1)e)(7)p=f(W(2)eh+b(2)e):(8)2.5RAETrainingFortrainingweuseasetofparsetreesandthenminimizethesumofallnodes'reconstructionerrors.Wecomputethegradientefcientlyviabackpropagationthroughstructure[13].Eventhoughtheobjectiveisnotconvex,wefoundthatL-BFGSrunwithmini-batchtrainingworkswellinpractice.Convergenceissmoothandthealgorithmtypicallyndsagoodlocallyoptimalsolution.AftertheunsupervisedtrainingoftheRAE,wedemonstratethatthelearnedfeaturerepresentationscapturesyntacticandsemanticsimilaritiesandcanbeusedforparaphrasedetection.3AnArchitectureforVariable-SizedSimilarityMatricesNowthatwehavedescribedtheunsupervisedfeaturelearning,weexplainhowtousethesefeaturestoclassifysentencepairsasbeinginaparaphraserelationshipornot. Figure3:Exampleofthedy-namicmin-poolinglayernd-ingthesmallestnumberinapoolingwindowregionoftheoriginalsimilaritymatrixS.3.1ComputingSentenceSimilarityMatricesOurmethodincorporatesbothsinglewordandphrasesimilaritiesinoneframework.First,theRAEcomputesphrasevectorsforthenodesinagivenparsetree.WethencomputeEuclideandistancesbetweenallwordandphrasevectorsofthetwosentences.ThesedistancesllasimilaritymatrixSasshowninFig.1.Forcomput-ingthesimilaritymatrix,therowsandcolumnsarerstlledbythewordsintheiroriginalsentenceorder.Wethenaddtoeachrowandcolumnthenonterminalnodesinadepth-rst,right-to-leftorder.Simplyextractingaggregatestatisticsofthistablesuchastheav-eragedistanceorahistogramofdistancescannotaccuratelycap-turetheglobalstructureofthesimilaritycomparison.Forinstance,paraphrasesoftenhaveloworzeroEuclideandistancesinelementsclosetothediagonalofthesimilaritymatrix.Thishappenswhensimilarwordsalignwellbetweenthetwosentences.However,sincethematrixdimensionsvarybasedonthesentencelengthsonecan-notsimplyfeedthesimilaritymatrixintoastandardneuralnetworkorclassier.3.2DynamicPoolingConsiderasimilaritymatrixSgeneratedbysentencesoflengthsnandm.Sincetheparsetreesarebinaryandwealsocompareallnonterminalnodes,S2R(2n�1)(2m�1).WewouldliketomapSintoamatrixSpooledofxedsize,npnp.OurrststepinconstructingsuchamapistopartitiontherowsandcolumnsofSintonproughlyequalparts,producingannpnpgrid.1WethendeneSpooledtobethematrixofminimumvaluesofeachrectangularregionwithinthisgrid,asshowninFig.3.ThematrixSpooledlosessomeoftheinformationcontainedintheoriginalsimilaritymatrixbutitstillcapturesmuchofitsglobalstructure.SinceelementsofSwithsmallEuclideandistancesshowthat 1Thepartitionswillonlybeofequalsizeif2n�1and2m�1aredivisiblebynp.Weaccountforthisinthefollowingway,althoughmanyalternativesarepossible.LetthenumberofrowsofSbeR=2n�1.EachpoolingwindowthenhasbR=npcmanyrows.LetM=Rmodnp,bethenumberofremainingrows.WethenevenlydistributetheseextrarowstothelastMwindowregionswhichwillhavebR=npc+1rows.Thesameprocedureappliestothenumberofcolumnsforthewindows.Thisprocedurewillhaveaslightlynergranularityforthesinglewordsimilaritieswhichisdesiredforourtasksincewordoverlapisagoodindicatorforparaphrases.Intherarecaseswhennp�R,thepoolinglayerneedstorstup-sample.Weachievethisbysimplyduplicatingpixelsrow-wiseuntilRnp.4 . 7 . 2 . 6 . 1 0 . 1 CenterPhrase RecursiveAverage RAE UnfoldingRAE theU.S. theU.S.andGerman theSwiss theformerU.S. sufferinglowmorale sufferinga1.9billionbahtUNK76million sufferingduetonofaultofmyown sufferingheavycasual-ties towatchhockey towatchoneJordanianbor-derpolicemanstamptheIs-raelipassports towatchtelevision towatchavideo advancetothenextround advancetonalqualifyingroundinArgentina advancetothenaloftheUNK1.1millionKremlinCup advancetothesemis aprominentpo-liticalgure suchahigh-prolegure thesecondhigh-proleop-positiongure apowerfulbusinessg-ure Seventeenpeo-plewerekilled ”Seventeenpeoplewerekilled,includingaprominentpolitician” Fourteenpeoplewerekilled Fourteenpeoplewerekilled conditionsofhisrelease ”conditionsofpeace,socialstabilityandpoliticalhar-mony” conditionsofpeace,socialstabilityandpoliticalhar-mony negotiationsfortheirrelease Table1:Nearestneighborsofrandomlychosenphrases.RecursiveaveragingandthestandardRAEfocusmostlyonthelastmergedwordsandincorrectlyaddextrainformation.TheunfoldingRAEcapturesmostcloselybothsyntacticandsemanticsimilarities.therearesimilarwordsorphrasesinbothsentences,wekeepthisinformationbyapplyingaminfunctiontothepoolingregions.Otherfunctions,likeaveraging,arealsopossible,butmightobscurethepresenceofsimilarphrases.Thisdynamicpoolinglayercouldmakeuseofoverlappingpoolingregions,butforsimplicity,weconsideronlynon-overlappingpoolingregions.Afterpooling,wenormalizeeachentrytohave0meanandvariance1.4ExperimentsForunsupervisedRAEtrainingweusedasubsetof150,000sentencesfromtheNYTandAPsec-tionsoftheGigawordcorpus.WeusedtheStanfordparser[14]tocreatetheparsetreesforallsentences.Forinitialwordembeddingsweusedthe100-dimensionalvectorscomputedviatheunsupervisedmethodofCollobertandWeston[6]andprovidedbyTurianetal.[8].ForallparaphraseexperimentsweusedtheMicrosoftResearchparaphrasecorpus(MSRP)intro-ducedbyDolanetal.[4].Thedatasetconsistsof5,801sentencepairs.Theaveragesentencelengthis21,theshortestsentencehas7wordsandthelongest36.3,900arelabeledasbeingintheparaphraserelationship(technicallydenedas“mostlybidirectionalentailment”).Weusethestandardsplitof4,076trainingpairs(67.5%ofwhichareparaphrases)and1,725testpairs(66.5%paraphrases).Allsentenceswerelabeledbytwoannotatorswhoagreedin83%ofthecases.Athirdannotatorresolvedconicts.Duringdatasetcollection,negativeexampleswereselectedtohavehighlexicaloverlaptopreventtrivialexamples.Formoreinformationsee[4,15].AsdescribedinSec.2.4,wecanhavedeepRAEnetworkswithtwoencodingordecodinglayers.ThehiddenRAElayer(seehinEq.8)wassettohave200unitsforbothstandardandunfoldingRAEs.4.1QualitativeEvaluationofNearestNeighborsInordertoshowthatthelearnedfeaturerepresentationscaptureimportantsemanticandsyntacticinformationevenforhighernodesinthetree,wevisualizenearestneighborphrasesofvaryinglength.AfterembeddingsentencesfromtheGigawordcorpus,wecomputenearestneighborsforallnodesinalltrees.InTable1therstphraseisarandomlychosenphraseandtheremainingphrasesaretheclosestphrasesinthedatasetthatarenotinthesamesentence.WeuseEuclideandistancebetweenthevectorrepresentations.Notethatwedonotconstraintheneighborstohavethesamewordlength.Wecomparethetwoautoencodermodelsabove:RAEandunfoldingRAEwithouthiddenlayers,aswellasarecursiveaveragingbaseline(R.Avg).R.Avgrecursivelytakestheaverageofbothchildvectorsinthesyntactictree.WeonlyreportresultsofRAEswithouthiddenlayersbetweenthechildrenandparentvectors.EventhoughthedeepRAEnetworkshavemoreparameterstolearncomplexencodingstheydonotperformaswellinthisandthenexttask.Thisislikelyduetothefactthattheygetstuckinlocaloptimaduringtraining.5 EncodingInput GeneratedTextfromUnfoldedReconstruction aDecembersummit aDecembersummit therstqualifyingsession therstqualifyingsession Englishpremierdivisionclub Irishpresidencydivisionclub thesafetyofaight thesafetyofaight thesigningoftheaccord thesigningoftheaccord theU.S.HouseofRepresentatives theU.S.HouseofRepresentatives enforcementoftheeconomicembargo enforcementofthenationalembargo visitanddiscussinvestmentpossibilities visitandpostponenancialpossibilities theagreementitmadewithMalaysia theagreementitmadewithMalaysia thefullbloomoftheiryounglives thelowerbloomoftheirdemocraticlives theorganizationforwhichthemenwork theorganizationforRomaniathereformwork apocketknifewasfoundinhissuitcaseintheplane'scargohold abombcorpsewasfoundinthemissionintheIrishcarlanguagecase Table2:Originalinputsandgeneratedoutputfromunfoldingandreconstruction.Wordsarethenearestneighborstothereconstructedleafnodevectors.TheunfoldingRAEcanreconstructper-fectlyalmostallphrasesof2and3wordsandmanywithupto5words.Longerphrasesstarttogetincorrectnearestneighborwords.ForthestandardRAEgoodreconstructionsareonlypossiblefortwowords.Recursiveaveragingcannotrecoveranywords.Table1showsseveralinterestingphenomena.Recursiveaveragingisalmostentirelyfocusedonanexactstringmatchofthelastmergedwordsofthecurrentphraseinthetree.Thisleadsthenearestneighborstoincorrectlyaddvariousextrainformationwhichwouldbreaktheparaphraserelationshipifweonlyconsideredthetopnodevectorsandignoressyntacticsimilarity.ThestandardRAEdoeswellthoughitisalsosomewhatfocusedonthelastmergesinthetree.Finally,theunfoldingRAEcapturesmostcloselytheunderlyingsyntacticandsemanticstructure.4.2ReconstructingPhrasesviaRecursiveDecodingInthissectionweanalyzetheinformationcapturedbytheunfoldingRAE's100-dimensionalphrasevectors.Weshowthatthese100-dimensionalvectorrepresentationscannotonlycaptureandmem-orizesinglewordsbutalsolonger,unseenphrases.Inordertoshowhowmuchoftheinformationcanberecoveredwerecursivelyreconstructsentencesafterencodingthem.Theprocessissimilartounfoldingduringtraining.Itstartsfromaphrasevectorofanonterminalnodeintheparsetree.Wethenunfoldthetreeasgivenduringencodingandndthenearestneighborwordtoeachofthereconstructedleafnodevectors.Table2showsthattheunfoldingRAEcanverywellreconstructphrasesofuptolengthve.Noothermethodthatwecomparedhadsuchreconstructioncapabilities.Longerphrasesretainsomecorrectwordsandusuallythecorrectpartofspeechbutthesemanticsofthewordsgetmerged.TheresultsarefromtheunfoldingRAEthatdirectlycomputestheparentrepresentationasinEq.1.4.3EvaluationonFull-SentenceParaphrasingWenowturntoevaluatingtheunsupervisedfeaturesandourdynamicpoolingarchitectureinourmaintaskofparaphrasedetection.Methodswhicharebasedpurelyonvectorrepresentationsinvariablylosesomeinformation.Forinstance,numbersoftenhaveverysimilarrepresentations,butevensmalldifferencesarecrucialtorejecttheparaphraserelationintheMSRPdataset.Hence,weaddthreenumberfeatures.Therstis1iftwosentencescontainexactlythesamenumbersornonumberand0otherwise,thesecondis1ifbothsentencescontainthesamenumbersandthethirdis1ifthesetofnumbersinonesentenceisastrictsubsetofthenumbersintheothersentence.Sinceourpooling-layercannotcapturesentencelengthorthenumberofexactstringmatches,wealsoaddthedifferenceinsentencelengthandthepercentageofwordsandphrasesinonesentencethatareintheothersentenceandvice-versa.Wealsoreportperformancewithoutthesethreefeatures(onlyS).Forallofourmodelsandtrainingsetups,weperform10-foldcross-validationonthetrainingsettochoosethebestregularizationparametersandnp,thesizeofthepoolingmatrixS2Rnpnp.Inourbestmodel,theregularizationfortheRAEwas10�5and0:05forthesoftmaxclassier.Thebestpoolingsizewasconsistentlynp=15,slightlylessthantheaveragesentencelength.Forallsentencepairs(S1;S2)inthetrainingdata,wealsoadded(S2;S1)tothetrainingsetinordertomakethemostuseofthetrainingdata.Thisimprovedperformanceby0.2%.6 ModelAcc.F1 AllParaphraseBaseline66.579.9Rusetal.(2008)[16]70.680.5Mihalceaetal.(2006)[17]70.381.3IslamandInkpen(2007)[18]72.681.3Qiuetal.(2006)[19]72.081.6FernandoandStevenson(2008)[20]74.182.4Wanetal.(2006)[21]75.683.0DasandSmith(2009)[15]73.982.3DasandSmith(2009)+18Features76.182.7UnfoldingRAE+DynamicPooling76.883.6Table3:TestresultsontheMSRPparaphrasecorpus.Comparisonsofunsupervisedfeaturelearningmethods(left),similarityfeatureextractionandsupervisedclassicationmethods(center)andotherapproaches(right).Inourrstsetofexperimentswecompareseveralunsupervisedfeaturelearningmethods:RecursiveaveragingasdenedinSec.4.1,standardRAEsandunfoldingRAEs.Foreachofthethreemethods,wecross-validateonthetrainingdataoverallpossiblehyperparametersandreportthebestperfor-mance.Weobservethatthedynamicpoolinglayerisverypowerfulbecauseitcapturestheglobalstructureofthesimilaritymatrixwhichinturncapturesthesyntacticandsemanticsimilaritiesofthetwosentences.WiththehelpofthispowerfuldynamicpoolinglayerandgoodinitialwordvectorseventhestandardRAEandrecursiveaveragingperformwellonthisdatasetwithanaccuracyof75.5%and75.9%respectively.Weobtainthebestaccuracyof76.8%withtheunfoldingRAEwith-outhiddenlayers.Wetriedadding1and2hiddenencodinganddecodinglayersbutperformanceonlydecreasedby0.2%andtrainingbecameslower.Next,wecomparethedynamicpoolingtosimplerfeatureextractionmethods.Ourcomparisonshowsthatthedynamicpoolingarchitectureisimportantforachievinghighaccuracy.Foreverysettingweagainexhaustivelycross-validateonthetrainingdataandreportthebestperformance.Thesettingsandtheiraccuraciesare:(i)S-Hist:73.0%.AhistogramofvaluesinthematrixS.Thelowperformanceshowsthatourdynamicpoolinglayerbettercapturestheglobalsimilarityinformationthanaggregatestatistics.(ii)OnlyFeat:73.2%.Onlythethreefeaturesdescribedabove.Thisshowsthatsimplebinarystringandnumbermatchingcandetectmanyofthesimpleparaphrasesbutfailstodetectcomplexcases.(iii)OnlySpooled:72.6%.Withoutthethreefeaturesmentionedabove.Thisshowsthatsomeinfor-mationstillgetslostinSpooledandthatabettertreatmentofnumbersisneeded.Inordertobetterrecoverexactstringmatchesitmaybenecessarytoexploreoverlappingpoolingregions.(iv)TopUnfoldingRAENode:74.2%.InsteadofSpooled,useEuclideandistancebetweenthetwotopsentencevectors.TheperformanceshowsthatwhiletheunfoldingRAEisbyitselfverypower-ful,thedynamicpoolinglayerisneededtoextractallinformationfromitstrees.Table3showsourresultscomparedtopreviousapproaches(seenextsection).OurunfoldingRAEanddynamicsimilaritypoolingarchitectureachievesstate-of-the-artperformancewithouthand-designedsemantictaxonomiesandfeaturessuchasWordNet.Notethattheeffectiverangeoftheaccuracyliesbetween66%(mostfrequentclassbaseline)and83%(interannotatoragreement).InTable4weshowseveralexamplesofcorrectlyclassiedparaphrasecandidatepairstogetherwiththeirsimilaritymatrixafterdynamicmin-pooling.Therstandlastpairaresimplecasesofparaphraseandnotparaphrase.Thesecondexampleshowsapooledsimilaritymatrixwhenlargechunksareswappedinbothsentences.Ourmodelisveryrobusttosuchtransformationsandgivesahighprobabilitytothispair.Evenmorecomplexexamplessuchasthethirdwithveryfewdirectstringmatches(fewbluesquares)arecorrectlyclassied.Thesecondtolastexampleishighlyinteresting.Eventhoughthereisacleardiagonalwithgoodstringmatches,thegapinthecentershowsthattherstsentencecontainsmuchextrainformation.Thisisalsocapturedbyourmodel.5RelatedWorkTheeldofparaphrasedetectionhasprogressedimmenselyinrecentyears.Earlyapproacheswerebasedpurelyonlexicalmatchingtechniques[22,23,19,24].Sincethesemethodsareoftenbasedonexactstringmatchesofn-grams,theyfailtodetectsimilarmeaningthatisconveyedbysynonymouswords.Severalapproaches[17,18]overcomethisproblembyusingWordnet-andcorpus-basedsemanticsimilaritymeasures.Intheirapproachtheychooseforeachopen-classwordthesinglemostsimilarwordintheothersentence.FernandoandStevenson[20]improveduponthisideabycomputingasimilaritymatrixthatcapturesallpair-wisesimilaritiesofsinglewordsinthetwosentences.Theythenthresholdtheelementsoftheresultingsimilaritymatrixandcomputethemean7 L Pr Sentences Sim.Mat. P 0.95 (1)LLEYTONHewittyesterdaytradedhistennisracquetforhisrstsportingpassion-Australianfootball-astheworldchampionrelaxedbeforehisWimbledontitledefence(2)LLEYTONHewittyesterdaytradedhistennisracquetforhisrstsportingpassion-Australianrulesfootball-astheworldchampionrelaxedaheadofhisWimbledondefence P 0.82 (1)TheliesanddeceptionsfromSaddamhavebeenwelldocumentedover12years(2)Ithasbeenwelldocumentedover12yearsofliesanddeceptionfromSaddam P 0.67 (1)PollacksaidtheplaintiffsfailedtoshowthatMerrillandBlodgetdirectlycausedtheirlosses(2)Basically,theplaintiffsdidnotshowthatomissionsinMerrill'sresearchcausedtheclaimedlosses N 0.49 (1)ProfSallyBaldwin,63,fromYork,fellintoacavitywhichopenedupwhenthestruc-turecollapsedatTiburtinastation,Italianrailwayofcialssaid(2)SallyBaldwin,fromYork,waskilledinstantlywhenawalkwaycollapsedandshefellintothemachineryatTiburtinastation N 0.44 (1)Bremer,61,isaonetimeassistanttoformerSecretariesofStateWilliamP.RogersandHenryKissingerandwasambassador-at-largeforcounterterrorismfrom1986to1989(2)Bremer,61,isaformerassistanttoformerSecretariesofStateWilliamP.RogersandHenryKissinger N 0.11 (1)TheinitialreportwasmadetoModestoPoliceDecember28(2)ItstemsfromaModestopolicereport Table4:Examplesofsentencepairswith:groundtruthlabelsL(P-Paraphrase,N-NotParaphrase),theprobabilitiesourmodelassignstothem(Pr(S1;S2)�0:5isassignedthelabelParaphrase)andtheirsimilaritymatricesafterdynamicmin-pooling.SimpleparaphrasepairshavecleardiagonalstructureduetoperfectwordmatcheswithEuclideandistance0(darkblue).Thatstructureispreservedbyourmin-poolinglayer.Bestviewedincolor.Seetextfordetails.oftheremainingentries.Therearetwoshortcomingsofsuchmethods:Theyignore(i)thesyntacticstructureofthesentences(bycomparingonlysinglewords)and(ii)theglobalstructureofsuchasimilaritymatrix(bycomputingonlythemean).Insteadofcomparingonlysinglewords[21]addsfeaturesfromdependencyparses.Mostrecently,DasandSmith[15]adoptedtheideathatparaphraseshaverelatedsyntacticstructure.Theirquasi-synchronousgrammarformalismincorporatesavarietyoffeaturesfromWordNet,anamedentityrecognizer,apart-of-speechtagger,andthedependencylabelsfromthealignedtrees.Inordertoobtainhighperformancetheycombinetheirparsing-basedmodelwithalogisticregressionmodelthatuses18hand-designedsurfacefeatures.Wemergetheseword-basedmodelsandsyntacticmodelsinonejointframework:Ourmatrixcon-sistsofphrasesimilaritiesandinsteadofjusttakingthemeanofthesimilaritieswecancapturethegloballayoutofthematrixviaourmin-poolinglayer.TheideaofapplyinganautoencoderinarecursivesettingwasintroducedbyPollack[9]andex-tendedrecentlyby[10].Pollack'srecursiveauto-associativememoriesaresimilartooursinthattheyareaconnectionist,feedforwardmodel.Oneofthemajorshortcomingsofpreviousapplica-tionsofrecursiveautoencoderstonaturallanguagesentenceswastheirbinarywordrepresentationasdiscussedinSec.2.1.Recently,Bottoudiscussedrelatedideasofrecursiveautoencoders[25]andrecursiveimageandtextunderstandingbutwithoutexperimentalresults.Larochelle[26]inves-tigatedautoencoderswithanunfolded“deepobjective”.SupervisedrecursiveneuralnetworkshavebeenusedforparsingimagesandnaturallanguagesentencesbySocheretal.[27,28].Lastly,[12]introducedthestandardrecursiveautoencoderasmentionedinSect.2.2.6ConclusionWeintroducedanunsupervisedfeaturelearningalgorithmbasedonunfolding,recursiveautoen-coders.TheRAEcapturessyntacticandsemanticinformationasshownqualitativelywithnearestneighborembeddingsandquantitativelyonaparaphrasedetectiontask.OurRAEphrasefeaturesallowustocomparebothsinglewordvectorsaswellasphrasesandcompletesyntactictrees.Inordertomakeuseoftheglobalcomparisonofvariablelengthsentencesinasimilaritymatrixweintroduceanewdynamicpoolingarchitecturethatproducesaxed-sizedrepresentation.WeshowthatthispooledrepresentationcapturesenoughinformationaboutthesentencepairtodeterminetheparaphraserelationshipontheMSRPdatasetwithahigheraccuracythananypreviouslypublishedresults.8 References[1]E.MarsiandE.Krahmer.Explorationsinsentencefusion.InEuropeanWorkshoponNaturalLanguageGeneration,2005.[2]P.Clough,R.Gaizauskas,S.S.L.Piao,andY.Wilks.METER:MEasuringTExtReuse.InACL,2002.[3]C.Callison-Burch.Syntacticconstraintsonparaphrasesextractedfromparallelcorpora.InProceedingsofEMNLP,pages196–205,2008.[4]B.Dolan,C.Quirk,andC.Brockett.Unsupervisedconstructionoflargeparaphrasecorpora:exploitingmassivelyparallelnewssources.InCOLING,2004.[5]Y.Bengio,R.Ducharme,P.Vincent,andC.Janvin.Aneuralprobabilisticlanguagemodel.J.Mach.Learn.Res.,3,March2003.[6]R.CollobertandJ.Weston.Auniedarchitecturefornaturallanguageprocessing:deepneuralnetworkswithmultitasklearning.InICML,2008.[7]Y.Bengio,J.Louradour,CollobertR,andJ.Weston.Curriculumlearning.InICML,2009.[8]J.Turian,L.Ratinov,andY.Bengio.Wordrepresentations:asimpleandgeneralmethodforsemi-supervisedlearning.InProceedingsofACL,pages384–394,2010.[9]J.B.Pollack.Recursivedistributedrepresentations.ArticialIntelligence,46,November1990.[10]T.VoegtlinandP.Dominey.LinearRecursiveDistributedRepresentations.NeuralNetworks,18(7),2005.[11]J.L.Elman.Distributedrepresentations,simplerecurrentnetworks,andgrammaticalstructure.MachineLearning,7(2-3),1991.[12]R.Socher,J.Pennington,E.H.Huang,A.Y.Ng,andC.D.Manning.Semi-SupervisedRecursiveAutoencodersforPredictingSentimentDistributions.InEMNLP,2011.[13]C.GollerandA.K¨uchler.Learningtask-dependentdistributedrepresentationsbybackpropagationthroughstructure.InProceedingsoftheInternationalConferenceonNeuralNetworks(ICNN-96),1996.[14]D.KleinandC.D.Manning.Accurateunlexicalizedparsing.InACL,2003.[15]D.DasandN.A.Smith.Paraphraseidenticationasprobabilisticquasi-synchronousrecognition.InInProc.ofACL-IJCNLP,2009.[16]V.Rus,P.M.McCarthy,M.C.Lintean,D.S.McNamara,andA.C.Graesser.Paraphraseidenticationwithlexico-syntacticgraphsubsumption.InFLAIRSConference,2008.[17]R.Mihalcea,C.Corley,andC.Strapparava.Corpus-basedandKnowledge-basedMeasuresofTextSe-manticSimilarity.InProceedingsofthe21stNationalConferenceonArticialIntelligence-Volume1,2006.[18]A.IslamandD.Inkpen.SemanticSimilarityofShortTexts.InProceedingsoftheInternationalConfer-enceonRecentAdvancesinNaturalLanguageProcessing(RANLP2007),2007.[19]L.Qiu,M.Kan,andT.Chua.Paraphraserecognitionviadissimilaritysignicanceclassication.InEMNLP,2006.[20]S.FernandoandM.Stevenson.Asemanticsimilarityapproachtoparaphrasedetection.Proceedingsofthe11thAnnualResearchColloquiumoftheUKSpecialInterestGroupforComputationalLinguistics,2008.[21]S.Wan,M.Dras,R.Dale,andC.Paris.Usingdependency-basedfeaturestotakethe“para-farce”outofparaphrase.InProceedingsoftheAustralasianLanguageTechnologyWorkshop2006,2006.[22]R.BarzilayandL.Lee.Learningtoparaphrase:anunsupervisedapproachusingmultiple-sequencealignment.InNAACL,2003.[23]Y.ZhangandJ.Patrick.Paraphraseidenticationbytextcanonicalization.InProceedingsoftheAus-tralasianLanguageTechnologyWorkshop2005,2005.[24]Z.KozarevaandA.Montoyo.ParaphraseIdenticationontheBasisofSupervisedMachineLearningTechniques.InAdvancesinNaturalLanguageProcessing,5thInternationalConferenceonNLP,FinTAL,2006.[25]L.Bottou.Frommachinelearningtomachinereasoning.CoRR,abs/1102.1808,2011.[26]H.Larochelle,Y.Bengio,J.Louradour,andP.Lamblin.Exploringstrategiesfortrainingdeepneuralnetworks.JMLR,10,2009.[27]R.Socher,C.D.Manning,andA.Y.Ng.Learningcontinuousphraserepresentationsandsyntacticparsingwithrecursiveneuralnetworks.InProceedingsoftheNIPS-2010DeepLearningandUnsupervisedFeatureLearningWorkshop,2010.[28]R.Socher,C.Lin,A.Y.Ng,andC.D.Manning.ParsingNaturalScenesandNaturalLanguagewithRecursiveNeuralNetworks.InICML,2011.9

Related Contents


Next Show more