/
Neural Word Embedding as Implicit Matrix Factorization Omer Levy Department of Computer Neural Word Embedding as Implicit Matrix Factorization Omer Levy Department of Computer

Neural Word Embedding as Implicit Matrix Factorization Omer Levy Department of Computer - PDF document

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
572 views
Uploaded On 2015-01-15

Neural Word Embedding as Implicit Matrix Factorization Omer Levy Department of Computer - PPT Presentation

com Yoav Goldberg Department of Computer Science BarIlan University yoavgoldberggmailcom Abstract We analyze skipgram with negativesampling SGNS a word embedding method introduced by Mikolov et al and show that it is implicitly factorizing a wordcont ID: 31501

com Yoav Goldberg Department

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Neural Word Embedding as Implicit Matrix..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

NeuralWordEmbeddingasImplicitMatrixFactorization OmerLevyDepartmentofComputerScienceBar-IlanUniversityomerlevy@gmail.comYoavGoldbergDepartmentofComputerScienceBar-IlanUniversityyoav.goldberg@gmail.comAbstractWeanalyzeskip-gramwithnegative-sampling(SGNS),awordembeddingmethodintroducedbyMikolovetal.,andshowthatitisimplicitlyfactorizingaword-contextmatrix,whosecellsarethepointwisemutualinformation(PMI)oftherespectivewordandcontextpairs,shiftedbyaglobalconstant.Wendthatanotherembeddingmethod,NCE,isimplicitlyfactorizingasimilarmatrix,whereeachcellisthe(shifted)logconditionalprobabilityofawordgivenitscontext.WeshowthatusingasparseShiftedPositivePMIword-contextmatrixtorepresentwordsimprovesresultsontwowordsimilaritytasksandoneoftwoanalogytasks.Whendenselow-dimensionalvectorsarepreferred,exactfactorizationwithSVDcanachievesolutionsthatareatleastasgoodasSGNS'ssolutionsforwordsimi-laritytasks.OnanalogyquestionsSGNSremainssuperiortoSVD.WeconjecturethatthisstemsfromtheweightednatureofSGNS'sfactorization.1IntroductionMosttasksinnaturallanguageprocessingandunderstandinginvolvelookingatwords,andcouldbenetfromwordrepresentationsthatdonottreatindividualwordsasuniquesymbols,butinsteadreectsimilaritiesanddissimilaritiesbetweenthem.Thecommonparadigmforderivingsuchrepre-sentationsisbasedonthedistributionalhypothesisofHarris[15],whichstatesthatwordsinsimilarcontextshavesimilarmeanings.ThishasgivenrisetomanywordrepresentationmethodsintheNLPliterature,thevastmajorityofwhomcanbedescribedintermsofaword-contextmatrixMinwhicheachrowicorrespondstoaword,eachcolumnjtoacontextinwhichthewordappeared,andeachmatrixentryMijcorrespondstosomeassociationmeasurebetweenthewordandthecontext.WordsarethenrepresentedasrowsinMorinadimensionality-reducedmatrixbasedonM.Recently,therehasbeenasurgeofworkproposingtorepresentwordsasdensevectors,derivedusingvarioustrainingmethodsinspiredfromneural-networklanguagemodeling[3,9,23,21].Theserepresentations,referredtoas“neuralembeddings”or“wordembeddings”,havebeenshowntoperformwellinavarietyofNLPtasks[26,10,1].Inparticular,asequenceofpapersbyMikolovandcolleagues[20,21]culminatedintheskip-gramwithnegative-sampling(SGNS)trainingmethodwhichisbothefcienttotrainandprovidesstate-of-the-artresultsonvariouslinguistictasks.Thetrainingmethod(asimplementedintheword2vecsoftwarepackage)ishighlypopular,butnotwellunderstood.Whileitisclearthatthetrainingobjectivefollowsthedistributionalhypothesis–bytryingtomaximizethedot-productbetweenthevectorsoffrequentlyoccurringword-contextpairs,andminimizeitforrandomword-contextpairs–verylittleisknownaboutthequantitybeingoptimizedbythealgorithm,orthereasonitisexpectedtoproducegoodwordrepresentations.Inthiswork,weaimtobroadenthetheoreticalunderstandingofneural-inspiredwordembeddings.Specically,wecastSGNS'strainingmethodasweightedmatrixfactorization,andshowthatitsobjectiveisimplicitlyfactorizingashiftedPMImatrix–thewell-knownword-contextPMImatrixfromtheword-similarityliterature,shiftedbyaconstantoffset.Asimilarresultholdsalsoforthe1 NCEembeddingmethodofMnihandKavukcuoglu[24].Whileitisimpracticaltodirectlyusetheveryhigh-dimensionalanddenseshiftedPMImatrix,weproposetoapproximateitwiththepositiveshiftedPMImatrix(ShiftedPPMI),whichissparse.ShiftedPPMIisfarbetteratoptimizingSGNS'sobjective,andperformsslightlybetterthanword2vecderivedvectorsonseverallinguistictasks.Finally,wesuggestasimplespectralalgorithmthatisbasedonperformingSVDovertheShiftedPPMImatrix.ThespectralalgorithmoutperformsbothSGNSandtheShiftedPPMImatrixonthewordsimilaritytasks,andisscalabletolargecorpora.However,itlagsbehindtheSGNS-derivedrepresentationonword-analogytasks.WeconjecturethatthisbehaviorisrelatedtothefactthatSGNSperformsweightedmatrixfactorization,givingmoreinuencetofrequentpairs,asopposedtoSVD,whichgivesthesameweighttoallmatrixcells.Whiletheweightedandnon-weightedobjectivessharethesameoptimalsolution(perfectreconstructionoftheshiftedPMImatrix),theyresultindifferentgeneralizationswhencombinedwithdimensionalityconstraints.2Background:Skip-GramwithNegativeSampling(SGNS)OurdeparturepointisSGNS–theskip-gramneuralembeddingmodelintroducedin[20]trainedusingthenegative-samplingprocedurepresentedin[21].Inwhatfollows,wesummarizetheSGNSmodelandintroduceournotation.AdetailedderivationoftheSGNSmodelisavailablein[14].SettingandNotationTheskip-grammodelassumesacorpusofwordsw2VWandtheircontextsc2VC,whereVWandVCarethewordandcontextvocabularies.In[20,21]thewordscomefromanunannotatedtextualcorpusofwordsw1;w2;:::;wn(typicallynisinthebillions)andthecontextsforwordwiarethewordssurroundingitinanL-sizedwindowwi�L;:::;wi�1;wi+1;:::;wi+L.Otherdenitionsofcontextsarepossible[18].WedenotethecollectionofobservedwordsandcontextpairsasD.Weuse#(w;c)todenotethenumberoftimesthepair(w;c)appearsinD.Similarly,#(w)=Pc02VC#(w;c0)and#(c)=Pw02VW#(w0;c)arethenumberoftimeswandcoccurredinD,respectively.Eachwordw2VWisassociatedwithavector~w2Rdandsimilarlyeachcontextc2VCisrepresentedasavector~c2Rd,wheredistheembedding'sdimensionality.Theentriesinthevectorsarelatent,andtreatedasparameterstobelearned.Wesometimesrefertothevectors~wasrowsinajVWjdmatrixW,andtothevectors~casrowsinajVCjdmatrixC.Insuchcases,Wi(Ci)referstothevectorrepresentationoftheithword(context)inthecorrespondingvocabulary.Whenreferringtoembeddingsproducedbyaspecicmethodx,wewillusuallyuseWxandCxexplicitly,butmayusejustWandCwhenthemethodisclearfromthediscussion.SGNS'sObjectiveConsideraword-contextpair(w;c).DidthispaircomefromtheobserveddataD?LetP(D=1jw;c)betheprobabilitythat(w;c)camefromthedata,andP(D=0jw;c)=1�P(D=1jw;c)theprobabilitythat(w;c)didnot.Thedistributionismodeledas:P(D=1jw;c)=(~w~c)=1 1+e�~w~cwhere~wand~c(eachad-dimensionalvector)arethemodelparameterstobelearned.ThenegativesamplingobjectivetriestomaximizeP(D=1jw;c)forobserved(w;c)pairswhilemaximizingP(D=0jw;c)forrandomlysampled“negative”examples(hencethename“negativesampling”),undertheassumptionthatrandomlyselectingacontextforagivenwordislikelytoresultinanunobserved(w;c)pair.SGNS'sobjectiveforasingle(w;c)observationisthen:log(~w~c)+kEcNPD[log(�~w~cN)](1)wherekisthenumberof“negative”samplesandcNisthesampledcontext,drawnaccordingtotheempiricalunigramdistributionPD(c)=#(c) jDj.1 1Inthealgorithmdescribedin[21],thenegativecontextsaresampledaccordingtop3=4(c)=#c3=4 Zinsteadoftheunigramdistribution#c Z.Samplingaccordingtop3=4indeedproducessomewhatsuperiorresultsonsomeofthesemanticevaluationtasks.Itisstraight-forwardtomodifythePMImetricinasimilarfashionbyreplacingthep(c)termwithp3=4(c),anddoingsoshowssimilartrendsinthematrix-basedmethodsasitdoesinword2vec'sstochasticgradientbasedtrainingmethod.Wedonotexplorethisfurtherinthispaper,andreportresultsusingtheunigramdistribution.2 TheobjectiveistrainedinanonlinefashionusingstochasticgradientupdatesovertheobservedpairsinthecorpusD.Theglobalobjectivethensumsovertheobserved(w;c)pairsinthecorpus:`=Xw2VWXc2VC#(w;c)(log(~w~c)+kEcNPD[log(�~w~cN)])(2)Optimizingthisobjectivemakesobservedword-contextpairshavesimilarembeddings,whilescat-teringunobservedpairs.Intuitively,wordsthatappearinsimilarcontextsshouldhavesimilarem-beddings,thoughwearenotfamiliarwithaformalproofthatSGNSdoesindeedmaximizethedot-productofsimilarwords.3SGNSasImplicitMatrixFactorizationSGNSembedsbothwordsandtheircontextsintoalow-dimensionalspaceRd,resultinginthewordandcontextmatricesWandC.TherowsofmatrixWaretypicallyusedinNLPtasks(suchascomputingwordsimilarities)whileCisignored.ItisnonethelessinstructivetoconsidertheproductWC�=M.Viewedthisway,SGNScanbedescribedasfactorizinganimplicitmatrixMofdimensionsjVWjjVCjintotwosmallermatrices.Whichmatrixisbeingfactorized?AmatrixentryMijcorrespondstothedotproductWiCj=~wi~cj.Thus,SGNSisfactorizingamatrixinwhicheachrowcorrespondstoawordw2VW,eachcolumncorrespondstoacontextc2VC,andeachcellcontainsaquantityf(w;c)reectingthestrengthofassociationbetweenthatparticularword-contextpair.Suchword-contextassociationmatricesareverycommonintheNLPandword-similarityliterature,seee.g.[29,2].Thatsaid,theobjectiveofSGNS(equation2)doesnotexplicitlystatewhatthisassociationmetricis.Whatcanwesayabouttheassociationfunctionf(w;c)?Inotherwords,whichmatrixisSGNSfactorizing?3.1CharacterizingtheImplicitMatrixConsidertheglobalobjective(equation2)above.Forsufcientlylargedimensionalityd(i.e.allow-ingforaperfectreconstructionofM),eachproduct~w~ccanassumeavalueindependentlyoftheothers.Undertheseconditions,wecantreattheobjective`asafunctionofindependent~w~cterms,andndthevaluesofthesetermsthatmaximizeit.Webeginbyrewritingequation2:`=Xw2VWXc2VC#(w;c)(log(~w~c))+Xw2VWXc2VC#(w;c)(kEcNPD[log(�~w~cN)])=Xw2VWXc2VC#(w;c)(log(~w~c))+Xw2VW#(w)(kEcNPD[log(�~w~cN)])(3)andexplicitlyexpressingtheexpectationterm:EcNPD[log(�~w~cN)]=XcN2VC#(cN) jDjlog(�~w~cN)=#(c) jDjlog(�~w~c)+XcN2VCnfcg#(cN) jDjlog(�~w~cN)(4)Combiningequations3and4revealsthelocalobjectiveforaspecic(w;c)pair:`(w;c)=#(w;c)log(~w~c)+k#(w)#(c) jDjlog(�~w~c)(5)Tooptimizetheobjective,wedenex=~w~candnditspartialderivativewithrespecttox:@` @x=#(w;c)(�x)�k#(w)#(c) jDj(x)Wecomparethederivativetozero,andaftersomesimplication,arriveat:e2x�0@#(w;c) k#(w)#(c) jDj�11Aex�#(w;c) k#(w)#(c) jDj=03 Ifwedeney=ex,thisequationbecomesaquadraticequationofy,whichhastwosolutions,y=�1(whichisinvalidgiventhedenitionofy)and:y=#(w;c) k#(w)#(c) jDj=#(w;c)jDj #w#(c)1 kSubstitutingywithexandxwith~w~creveals:~w~c=log#(w;c)jDj #(w)#(c)1 k=log#(w;c)jDj #(w)#(c)�logk(6)Interestingly,theexpressionlog#(w;c)jDj #(w)#(c)isthewell-knownpointwisemutualinformation(PMI)of(w;c),whichwediscussindepthbelow.Finally,wecandescribethematrixMthatSGNSisfactorizing:MSGNSij=WiCj=~wi~cj=PMI(wi;cj)�logk(7)Foranegative-samplingvalueofk=1,theSGNSobjectiveisfactorizingaword-contextmatrixinwhichtheassociationbetweenawordanditscontextismeasuredbyf(w;c)=PMI(w;c).WerefertothismatrixasthePMImatrix,MPMI.Fornegative-samplingvaluesk�1,SGNSisfactorizingashiftedPMImatrixMPMIk=MPMI�logk.Otherembeddingmethodscanalsobecastasfactorizingimplicitword-contextmatrices.Usingasimilarderivation,itcanbeshownthatnoise-contrastiveestimation(NCE)[24]isfactorizingthe(shifted)log-conditional-probabilitymatrix:MNCEij=~wi~cj=log#(w;c) #(c)�logk=logP(wjc)�logk(8)3.2WeightedMatrixFactorizationWeobtainedthatSGNS'sobjectiveisoptimizedbysetting~w~c=PMI(w;c)�logkforevery(w;c)pair.However,thisassumesthatthedimensionalityof~wand~cishighenoughtoallowforperfectreconstruction.Whenperfectreconstructionisnotpossible,some~w~cproductsmustdeviatefromtheiroptimalvalues.Lookingatthepair-specicobjective(equation5)revealsthatthelossforapair(w;c)dependsonitsnumberofobservations(#(w;c))andexpectednegativesamples(k#(w)#(c)=jDj).SGNS'sobjectivecannowbecastasaweightedmatrixfactorizationprob-lem,seekingtheoptimald-dimensionalfactorizationofthematrixMPMI�logkunderametricwhichpaysmorefordeviationsonfrequent(w;c)pairsthandeviationsoninfrequentones.3.3PointwiseMutualInformationPointwisemutualinformationisaninformation-theoreticassociationmeasurebetweenapairofdiscreteoutcomesxandy,denedas:PMI(x;y)=logP(x;y) P(x)P(y)(9)Inourcase,PMI(w;c)measurestheassociationbetweenawordwandacontextcbycalculatingthelogoftheratiobetweentheirjointprobability(thefrequencyinwhichtheyoccurtogether)andtheirmarginalprobabilities(thefrequencyinwhichtheyoccurindependently).PMIcanbeestimatedempiricallybyconsideringtheactualnumberofobservationsinacorpus:PMI(w;c)=log#(w;c)jDj #(w)#(c)(10)TheuseofPMIasameasureofassociationinNLPwasintroducedbyChurchandHanks[8]andwidelyadoptedforwordsimilaritytasks[11,27,29].WorkingwiththePMImatrixpresentssomecomputationalchallenges.TherowsofMPMIcon-tainmanyentriesofword-contextpairs(w;c)thatwereneverobservedinthecorpus,forwhich4 PMI(w;c)=log0=�1.Notonlyisthematrixill-dened,itisalsodense,whichisamajorpracticalissuebecauseofitshugedimensionsjVWjjVCj.Onecouldsmooththeprobabilitiesusing,forinstance,aDirichletpriorbyaddingasmall“fake”counttotheunderlyingcountsmatrix,renderingallword-contextpairsobserved.Whiletheresultingmatrixwillnotcontainanyinnitevalues,itwillremaindense.Analternativeapproach,commonlyusedinNLP,istoreplacetheMPMImatrixwithMPMI0,inwhichPMI(w;c)=0incases#(w;c)=0,resultinginasparsematrix.WenotethatMPMI0isinconsistent,inthesensethatobservedbut“bad”(uncorrelated)word-contextpairshaveanegativematrixentry,whileunobserved(henceworse)oneshave0intheircorrespondingcell.Considerforexampleapairofrelativelyfrequentwords(highP(w)andP(c))thatoccuronlyoncetogether.Thereisstrongevidencethatthewordstendnottoappeartogether,resultinginanegativePMIvalue,andhenceanegativematrixentry.Ontheotherhand,apairoffrequentwords(sameP(w)andP(c))thatisneverobservedoccurringtogetherinthecorpus,willreceiveavalueof0.AsparseandconsistentalternativefromtheNLPliteratureistousethepositivePMI(PPMI)metric,inwhichallnegativevaluesarereplacedby0:PPMI(w;c)=max(PMI(w;c);0)(11)Whenrepresentingwords,thereissomeintuitionbehindignoringnegativevalues:humanscaneasilythinkofpositiveassociations(e.g.“Canada”and“snow”)butnditmuchhardertoinventnegativeones(“Canada”and“desert”).Thissuggeststhattheperceivedsimilarityoftwowordsismoreinuencedbythepositivecontexttheysharethanbythenegativecontexttheyshare.Itthereforemakessomeintuitivesensetodiscardthenegativelyassociatedcontextsandmarkthemas“uninformative”(0)instead.2Indeed,itwasshownthatthePPMImetricperformsverywellonsemanticsimilaritytasks[5].BothMPMI0andMPPMIarewellknowntotheNLPcommunity.Inparticular,systematiccomparisonsofvariousword-contextassociationmetricsshowthatPMI,andmoresoPPMI,providethebestresultsforawiderangeofword-similaritytasks[5,16].ItisthusinterestingthatthePMImatrixemergesastheoptimalsolutionforSGNS'sobjective.4AlternativeWordRepresentationsAsSGNSwithk=1isattemptingtoimplicitlyfactorizethefamiliarmatrixMPMI,anaturalalgo-rithmwouldbetousetherowsofMPPMIdirectlywhencalculatingwordsimilarities.ThoughPPMIisonlyanapproximationoftheoriginalPMImatrix,itstillbringstheobjectivefunctionveryclosetoitsoptimum(seeSection5.1).Inthissection,weproposetwoalternativewordrepresentationsthatbuilduponMPPMI.4.1ShiftedPPMIWhilethePMImatrixemergesfromSGNSwithk=1,itwasshownthatdifferentvaluesofkcansubstantiallyimprovetheresultingembedding.Withk�1,theassociationmetricintheimplicitlyfactorizedmatrixisPMI(w;c)�log(k).ThissuggeststheuseofShiftedPPMI(SPPMI),anovelassociationmetricwhich,tothebestofourknowledge,wasnotexploredintheNLPandword-similaritycommunities:SPPMIk(w;c)=max(PMI(w;c)�logk;0)(12)AswithSGNS,certainvaluesofkcanimprovetheperformanceofMSPPMIkondifferenttasks.4.2SpectralDimensionalityReduction:SVDoverShiftedPPMIWhilesparsevectorrepresentationsworkwell,therearealsoadvantagestoworkingwithdenselow-dimensionalvectors,suchasimprovedcomputationalefciencyand,arguably,bettergeneralization. 2Anotableexceptionisthecaseofsyntacticsimilarity.Forexample,allverbsshareaverystrongnegativeassociationwithbeingprecededbydeterminers,andpasttenseverbshaveaverystrongnegativeassociationtobeprecededby“be”verbsandmodals.5 AnalternativematrixfactorizationmethodtoSGNS'sstochasticgradienttrainingistruncatedSin-gularValueDecomposition(SVD)–abasicalgorithmfromlinearalgebrawhichisusedtoachievetheoptimalrankdfactorizationwithrespecttoL2loss[12].SVDfactorizesMintotheproductofthreematricesUV�,whereUandVareorthonormalandisadiagonalmatrixofsin-gularvalues.Letdbethediagonalmatrixformedfromthetopdsingularvalues,andletUdandVdbethematricesproducedbyselectingthecorrespondingcolumnsfromUandV.ThematrixMd=UddV�disthematrixofrankdthatbestapproximatestheoriginalmatrixM,inthesensethatitminimizestheapproximationerrors.Thatis,Md=argminRank(M0)=dkM0�Mk2.WhenusingSVD,thedot-productsbetweentherowsofW=Uddareequaltothedot-productsbetweenrowsofMd.Inthecontextofword-contextmatrices,thedense,ddimensionalrowsofWareperfectsubstitutesfortheveryhigh-dimensionalrowsofMd.IndeedanothercommonapproachintheNLPliteratureisfactorizingthePPMImatrixMPPMIwithSVD,andthentakingtherowsofWSVD=UddandCSVD=Vdaswordandcontextrepresentations,respectively.However,usingtherowsofWSVDaswordrepresentationsconsistentlyunder-performtheWSGNSembeddingsderivedfromSGNSwhenevaluatedonsemantictasks.SymmetricSVDWenotethatintheSVD-basedfactorization,theresultingwordandcontextmatriceshaveverydifferentproperties.Inparticular,thecontextmatrixCSVDisorthonormalwhilethewordmatrixWSVDisnot.Ontheotherhand,thefactorizationachievedbySGNS'strainingprocedureismuchmore“symmetric”,inthesensethatneitherWW2VnorCW2Visorthonormal,andnoparticularbiasisgiventoeitherofthematricesinthetrainingobjective.Wethereforeproposeachievingsimilarsymmetrywiththefollowingfactorization:WSVD1=2=Udp dCSVD1=2=Vdp d(13)Whileitisnottheoreticallyclearwhythesymmetricapproachisbetterforsemantictasks,itdoesworkmuchbetterempirically.3SVDversusSGNSThespectralalgorithmhastwocomputationaladvantagesoverstochasticgra-dienttraining.First,itisexact,anddoesnotrequirelearningratesorhyper-parametertuning.Second,itcanbeeasilytrainedoncount-aggregateddata(i.e.f(w;c;#(w;c))gtriplets),makingitapplicabletomuchlargercorporathanSGNS'strainingprocedure,whichrequireseachobservationof(w;c)tobepresentedseparately.Ontheotherhand,thestochasticgradientmethodhasadvantagesaswell:incontrasttoSVD,itdistinguishesbetweenobservedandunobservedevents;SVDisknowntosufferfromunobservedvalues[17],whichareverycommoninword-contextmatrices.Moreimportantly,SGNS'sobjectiveweighsdifferent(w;c)pairsdifferently,preferringtoassigncorrectvaluestofrequent(w;c)pairswhileallowingmoreerrorforinfrequentpairs(seeSection3.2).Unfortunately,exactweightedSVDisahardcomputationalproblem[25].Finally,becauseSGNScaresonlyaboutobserved(andsampled)(w;c)pairs,itdoesnotrequiretheunderlyingmatrixtobeasparseone,enablingoptimizationofdensematrices,suchastheexactPMI�logkmatrix.ThesameisnotfeasiblewhenusingSVD.Aninterestingmiddle-groundbetweenSGNSandSVDistheuseofstochasticmatrixfactorization(SMF)approaches,commoninthecollaborativelteringliterature[17].IncontrasttoSVD,theSMFapproachesarenotexact,anddorequirehyper-parametertuning.Ontheotherhand,theyarebetterthanSVDathandlingunobservedvalues,andcanintegrateimportanceweightingforexamples,muchlikeSGNS'strainingprocedure.However,likeSVDandunlikeSGNS'sprocedure,theSMFapproachesworkoveraggregated(w;c)statisticsallowing(w;c;f(w;c))tripletsasinput,makingtheoptimizationobjectivemoredirect,andscalabletosignicantlylargercorpora.SMFapproacheshaveadditionaladvantagesoverbothSGNSandSVD,suchasregularization,openingthewaytoarangeofpossibleimprovements.WeleavetheexplorationofSMF-basedalgorithmsforwordembeddingstofuturework. 3TheapproachcanbegeneralizedtoWSVD =Ud(d) ,making atunableparameter.ThisobservationwaspreviouslymadebyCaron[7]andinvestigatedin[6,28],showingthatdifferentvaluesof indeedperformbetterthanothersforvarioustasks.Inparticular,setting =0performswellformanytasks.Wedonotexploretuningthe parameterinthiswork.6 Method PMI�logk SPPMI SVD SGNS d=100d=500d=1000 d=100d=500d=1000 k=1 0% 0:00009% 26:1%25:2%24:2% 31:4%29:4%7:40%k=5 0% 0:00004% 95:8%95:1%94:9% 39:3%36:0%7:13%k=15 0% 0:00002% 266%266%265% 7:80%6:37%5:97% Table1:Percentageofdeviationfromtheoptimalobjectivevalue(lowervaluesarebetter).See5.1fordetails.5EmpiricalResultsWecomparethematrix-basedalgorithmstoSGNSintwoaspects.First,wemeasurehowwelleachalgorithmoptimizestheobjective,andthenproceedtoevaluatethemethodsonvariouslinguistictasks.Wendthatforsometasksthereisalargediscrepancybetweenoptimizingtheobjectiveanddoingwellonthelinguistictask.ExperimentalSetupAllmodelsweretrainedonEnglishWikipedia,pre-processedbyremovingnon-textualelements,sentencesplitting,andtokenization.Thecorpuscontains77.5millionsen-tences,spanning1.5billiontokens.Allmodelswerederivedusingawindowof2tokenstoeachsideofthefocusword,ignoringwordsthatappearedlessthan100timesinthecorpus,resultinginvocabulariesof189,533termsforbothwordsandcontexts.TotraintheSGNSmodels,weusedamodiedversionofword2vecwhichreceivesasequenceofpre-extractedword-contextpairs[18].4Weexperimentedwiththreevaluesofk(numberofnegativesamplesinSGNS,shiftparam-eterinPMI-basedmethods):1,5,15.ForSVD,wetakeW=Udp dasexplainedinSection4.5.1OptimizingtheObjectiveNowthatwehaveananalyticalsolutionfortheobjective,wecanmeasurehowwelleachalgorithmoptimizesthisobjectiveinpractice.Todoso,wecalculated`,thevalueoftheobjective(equation2)giveneachword(andcontext)representation.5Forsparsematrixrepresentations,wesubstituted~w~cwiththematchingcell'svalue(e.g.forSPPMI,weset~w~c=max(PMI(w;c)�logk;0)).Eachalgorithm's`valuewascomparedto`Opt,theobjectivewhensetting~w~c=PMI(w;c)�logk,whichwasshowntobeoptimal(Section3.1).Thepercentageofdeviationfromtheoptimumisdenedby(`�`Opt)=(`Opt)andpresentedintable1.WeobservethatSPPMIisindeedanear-perfectapproximationoftheoptimalsolution,eventhoughitdiscardsalotofinformationwhenconsideringonlypositivecells.Wealsonotethatforthefactorizationmethods,increasingthedimensionalityenablesbettersolutions,asexpected.SVDisslightlybetterthanSGNSatoptimizingtheobjectiveford500andk=1.However,whileSGNSisabletoleveragehigherdimensionsandreduceitserrorsignicantly,SVDfailstodoso.Furthermore,SVDbecomesveryerroneousaskincreases.Wehypothesizethatthisisaresultoftheincreasingnumberofzero-cells,whichmaycauseSVDtopreferafactorizationthatisveryclosetothezeromatrix,sinceSVD'sL2objectiveisunweighted,anddoesnotdistinguishbetweenobservedandunobservedmatrixcells.5.2PerformanceofWordRepresentationsonLinguisticTasksLinguisticTasksandDatasetsWeevaluatedthewordrepresentationsonfourdataset,coveringwordsimilarityandrelationalanalogytasks.Weusedtwodatasetstoevaluatepairwisewordsimi-larity:Finkelsteinetal.'sWordSim353[13]andBrunietal.'sMEN[4].Thesedatasetscontainwordpairstogetherwithhuman-assignedsimilarityscores.Thewordvectorsareevaluatedbyrankingthepairsaccordingtotheircosinesimilarities,andmeasuringthecorrelation(Spearman's)withthehumanratings.Thetwoanalogydatasetspresentquestionsoftheform“aistoaasbistob”,wherebishidden,andmustbeguessedfromtheentirevocabulary.TheSyntacticdataset[22]contains8000morpho- 4http://www.bitbucket.org/yoavgo/word2vecf5Sinceitiscomputationallyexpensivetocalculatetheexactobjective,weapproximatedit.First,insteadofenumeratingeveryobservedword-contextpairinthecorpus,wesampled10millionsuchpairs,accordingtotheirprevalence.Second,insteadofcalculatingtheexpectationtermexplicitly(asinequation4),wesampledanegativeexamplef(w;cN)gforeachoneofthe10million“positive”examples,usingthecontexts'unigramdistribution,asdonebySGNS'soptimizationprocedure(explainedinSection2).7 WS353(WORDSIM)[13] MEN(WORDSIM)[4] MIXEDANALOGIES[20] SYNT.ANALOGIES[22]Representation Corr. Representation Corr. Representation Acc. Representation Acc. SVD(k=5) 0.691 SVD(k=1) 0.735 SPPMI(k=1) 0.655 SGNS(k=15) 0.627SPPMI(k=15) 0.687 SVD(k=5) 0.734 SPPMI(k=5) 0.644 SGNS(k=5) 0.619SPPMI(k=5) 0.670 SPPMI(k=5) 0.721 SGNS(k=15) 0.619 SGNS(k=1) 0.59SGNS(k=15) 0.666 SPPMI(k=15) 0.719 SGNS(k=5) 0.616 SPPMI(k=5) 0.466SVD(k=15) 0.661 SGNS(k=15) 0.716 SPPMI(k=15) 0.571 SVD(k=1) 0.448SVD(k=1) 0.652 SGNS(k=5) 0.708 SVD(k=1) 0.567 SPPMI(k=1) 0.445SGNS(k=5) 0.644 SVD(k=15) 0.694 SGNS(k=1) 0.540 SPPMI(k=15) 0.353SGNS(k=1) 0.633 SGNS(k=1) 0.690 SVD(k=5) 0.472 SVD(k=5) 0.337SPPMI(k=1) 0.605 SPPMI(k=1) 0.688 SVD(k=15) 0.341 SVD(k=15) 0.208 Table2:Acomparisonofwordrepresentationsonvariouslinguistictasks.Thedifferentrepresentationswerecreatedbythreealgorithms(SPPMI,SVD,SGNS)withd=1000anddifferentvaluesofk.syntacticanalogyquestions,suchas“goodistobestassmartistosmartest”.TheMixeddataset[20]contains19544questions,abouthalfofthesamekindasinSyntactic,andanotherhalfofamorese-manticnature,suchascapitalcities(“ParisistoFranceasTokyoistoJapan”).Afterlteringques-tionsinvolvingout-of-vocabularywords,i.e.wordsthatappearedinEnglishWikipedialessthan100times,weremainwith7118instancesinSyntacticand19258instancesinMixed.TheanalogyquestionsareansweredusingLevyandGoldberg'ssimilaritymultiplicationmethod[19],whichisstate-of-the-artinanalogyrecovery:argmaxb2VWnfa;b;agcos(b;a)cos(b;b)=(cos(b;a)+").Theevaluationmetricfortheanalogyquestionsisthepercentageofquestionsforwhichtheargmaxresultwasthecorrectanswer(b).ResultsTable2showstheexperiments'results.Onthewordsimilaritytask,SPPMIyieldsbetterresultsthanSGNS,andSVDimprovesevenmore.However,thedifferencebetweenthetopPMI-basedmethodandthetopSGNScongurationineachdatasetissmall,anditisreasonabletosaythattheyperformon-par.Itisalsoevidentthatdifferentvaluesofkhaveasignicanteffectonallmethods:SGNSgenerallyworksbetterwithhighervaluesofk,whereasSPPMIandSVDpreferlowervaluesofk.Thismaybeduetothefactthatonlypositivevaluesareretained,andhighvaluesofkmaycausetoomuchlossofinformation.AsimilarobservationwasmadeforSGNSandSVDwhenobservinghowwelltheyoptimizedtheobjective(Section5.1).Nevertheless,tuningkcansignicantlyincreasetheperformanceofSPPMIoverthetraditionalPPMIconguration(k=1).Theanalogiestaskshowsdifferentbehavior.First,SVDdoesnotperformaswellasSGNSandSPPMI.Moreinterestingly,inthesyntacticanalogiesdataset,SGNSsignicantlyoutperformstherest.Thistrendisevenmorepronouncedwhenusingtheadditiveanalogyrecoverymethod[22](notshown).Linguisticallyspeaking,thesyntacticanalogiesdatasetisquitedifferentfromtherest,sinceitreliesmoreoncontextualinformationfromcommonwordssuchasdeterminers(“the”,“each”,“many”)andauxiliaryverbs(“will”,“had”)tosolvecorrectly.WeconjecturethatSGNSperformsbetteronthistaskbecauseitstrainingproceduregivesmoreinuencetofrequentpairs,asopposedtoSVD'sobjective,whichgivesthesameweighttoallmatrixcells(seeSection3.2).6ConclusionWeanalyzedtheSGNSwordembeddingalgorithms,andshowedthatitisimplicitlyfactorizingthe(shifted)word-contextPMImatrixMPMI�logkusingper-observationstochasticgradientupdates.WepresentedSPPMI,amodicationofPPMIinspiredbyourtheoreticalndings.Indeed,usingSPPMIcanimproveuponthetraditionalPPMImatrix.ThoughSPPMIprovidesafarbettersolutiontoSGNS'sobjective,itdoesnotnecessarilyperformbetterthanSGNSonlinguistictasks,asevidentwithsyntacticanalogies.WesuspectthatthismayberelatedtoSGNSdown-weightingrarewords,whichPMI-basedmethodsareknowntoexaggerate.Wealsoexperimentedwithanalternativematrixfactorizationmethod,SVD.AlthoughSVDwasrelativelypooratoptimizingSGNS'sobjective,itperformedslightlybetterthantheothermethodsonwordsimilaritydatasets.However,SVDunderperformsontheword-analogytask.OneofthemaindifferencesbetweentheSVDandSGNSisthatSGNSperformsweightedmatrixfactoriza-tion,whichmaybegivingitanedgeintheanalogytask.Asfutureworkwesuggestinvestigatingweightedmatrixfactorizationsofword-contextmatriceswithPMI-basedassociationmetrics.AcknowledgementsThisworkwaspartiallysupportedbytheEC-fundedprojectEXCITEMENT(FP7ICT-287923).WethankIdoDaganandPeterTurneyfortheirvaluableinsights.8 References[1]MarcoBaroni,GeorgianaDinu,andGerm´anKruszewski.Dontcount,predict!asystematiccomparisonofcontext-countingvs.context-predictingsemanticvectors.InACL,2014.[2]MarcoBaroniandAlessandroLenci.Distributionalmemory:Ageneralframeworkforcorpus-basedsemantics.ComputationalLinguistics,36(4):673–721,2010.[3]YoshuaBengio,R´ejeanDucharme,PascalVincent,andChristianJauvin.Aneuralprobabilisticlanguagemodel.JournalofMachineLearningResearch,3:1137–1155,2003.[4]EliaBruni,GemmaBoleda,MarcoBaroni,andNamKhanhTran.Distributionalsemanticsintechnicolor.InACL,2012.[5]JohnABullinariaandJosephPLevy.Extractingsemanticrepresentationsfromwordco-occurrencestatistics:acomputationalstudy.BehaviorResearchMethods,39(3):510–526,2007.[6]JohnABullinariaandJosephPLevy.Extractingsemanticrepresentationsfromwordco-occurrencestatistics:Stop-lists,stemming,andSVD.BehaviorResearchMethods,44(3):890–907,2012.[7]JohnCaron.ExperimentswithLSAscoring:optimalrankandbasis.InProceedingsoftheSIAMCom-putationalInformationRetrievalWorkshop,pages157–169,2001.[8]KennethWardChurchandPatrickHanks.Wordassociationnorms,mutualinformation,andlexicography.Computationallinguistics,16(1):22–29,1990.[9]RonanCollobertandJasonWeston.Auniedarchitecturefornaturallanguageprocessing:Deepneuralnetworkswithmultitasklearning.InICML,2008.[10]RonanCollobert,JasonWeston,L´eonBottou,MichaelKarlen,KorayKavukcuoglu,andPavelKuksa.Naturallanguageprocessing(almost)fromscratch.TheJournalofMachineLearningResearch,2011.[11]IdoDagan,FernandoPereira,andLillianLee.Similarity-basedestimationofwordcooccurrenceproba-bilities.InACL,1994.[12]CEckartandGYoung.Theapproximationofonematrixbyanotheroflowerrank.Psychometrika,1:211–218,1936.[13]LevFinkelstein,EvgeniyGabrilovich,YossiMatias,EhudRivlin,ZachSolan,GadiWolfman,andEytanRuppin.Placingsearchincontext:Theconceptrevisited.ACMTOIS,2002.[14]YoavGoldbergandOmerLevy.word2vecexplained:derivingMikolovetal.'snegative-samplingword-embeddingmethod.arXivpreprintarXiv:1402.3722,2014.[15]ZelligHarris.Distributionalstructure.Word,10(23):146–162,1954.[16]DouweKielaandStephenClark.Asystematicstudyofsemanticvectorspacemodelparameters.InWorkshoponContinuousVectorSpaceModelsandtheirCompositionality,2014.[17]YehudaKoren,RobertBell,andChrisVolinsky.Matrixfactorizationtechniquesforrecommendersys-tems.Computer,2009.[18]OmerLevyandYoavGoldberg.Dependency-basedwordembeddings.InACL,2014.[19]OmerLevyandYoavGoldberg.Linguisticregularitiesinsparseandexplicitwordrepresentations.InCoNLL,2014.[20]TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.Efcientestimationofwordrepresentationsinvectorspace.CoRR,abs/1301.3781,2013.[21]TomasMikolov,IlyaSutskever,KaiChen,GregoryS.Corrado,andJeffreyDean.Distributedrepresen-tationsofwordsandphrasesandtheircompositionality.InNIPS,2013.[22]TomasMikolov,Wen-tauYih,andGeoffreyZweig.Linguisticregularitiesincontinuousspacewordrepresentations.InNAACL,2013.[23]AndriyMnihandGeoffreyEHinton.Ascalablehierarchicaldistributedlanguagemodel.InAdvancesinNeuralInformationProcessingSystems,pages1081–1088,2008.[24]AndriyMnihandKorayKavukcuoglu.Learningwordembeddingsefcientlywithnoise-contrastiveestimation.InNIPS,2013.[25]NathanSrebroandTommiJaakkola.Weightedlow-rankapproximations.InICML,2003.[26]JosephTurian,LevRatinov,andYoshuaBengio.Wordrepresentations:asimpleandgeneralmethodforsemi-supervisedlearning.InACL,2010.[27]PeterD.Turney.Miningthewebforsynonyms:PMI-IRversusLSAonTOEFL.InECML,2001.[28]PeterD.Turney.Domainandfunction:Adual-spacemodelofsemanticrelationsandcompositions.JournalofArticialIntelligenceResearch,44:533–585,2012.[29]PeterD.TurneyandPatrickPantel.Fromfrequencytomeaning:Vectorspacemodelsofsemantics.JournalofArticialIntelligenceResearch,2010.9