/
Journal of Machine Learning Research    Su bmitted  Revised  Published  Natural Journal of Machine Learning Research    Su bmitted  Revised  Published  Natural

Journal of Machine Learning Research Su bmitted Revised Published Natural - PDF document

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
483 views
Uploaded On 2014-10-03

Journal of Machine Learning Research Su bmitted Revised Published Natural - PPT Presentation

This versatility is achieved by trying to avoid taskspeci64257c engineering and therefore disregarding a lot of prior knowl edge Instead of exploiting manmade input features carefully optimized for each task our syste m learns internal representatio ID: 2063

This versatility achieved

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Journal of Machine Learning Research ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch12(2011)2493-2537Submitted1/10;Revised11/10;Published8/11NaturalLanguageProcessing(Almost)fromScratchRonanCollobertRONAN@COLLOBERT.COMJasonWeston†JWESTON@GOOGLE.COML´eonBottou‡LEON@BOTTOU.ORGMichaelKarlenMICHAEL.KARLEN@GMAIL.COMKorayKavukcuoglu§KORAY@CS.NYU.EDUPavelKuksa¶PKUKSA@CS.RUTGERS.EDUNECLaboratoriesAmerica4IndependenceWayPrinceton,NJ08540Editor:MichaelCollinsAbstractWeproposeauniedneuralnetworkarchitectureandlearningalgorithmthatcanbeappliedtovar-iousnaturallanguageprocessingtasksincludingpart-of-speechtagging,chunking,namedentityrecognition,andsemanticrolelabeling.Thisversatilityisachievedbytryingtoavoidtask-specicengineeringandthereforedisregardingalotofpriorknowledge.Insteadofexploitingman-madeinputfeaturescarefullyoptimizedforeachtask,oursystemlearnsinternalrepresentationsonthebasisofvastamountsofmostlyunlabeledtrainingdata.Thisworkisthenusedasabasisforbuildingafreelyavailabletaggingsystemwithgoodperformanceandminimalcomputationalre-quirements.Keywords:naturallanguageprocessing,neuralnetworks1.IntroductionWillacomputerprogrameverbeabletoconvertapieceofEnglishtextintoaprogrammerfriendlydatastructurethatdescribesthemeaningofthenaturallanguagetext?Unfortunately,noconsensushasemergedabouttheformortheexistenceofsuchadatastructure.UntilsuchfundamentalArticialIntelligenceproblemsareresolved,computerscientistsmustsettleforthereducedobjectiveofextractingsimplerrepresentationsthatdescribelimitedaspectsofthetextualinformation.Thesesimplerrepresentationsareoftenmotivatedbyspecicapplications(forinstance,bag-of-wordsvariantsforinformationretrieval),orbyourbeliefthattheycapturesomethingmoregen-eralaboutnaturallanguage.Theycandescribesyntacticinformation(e.g.,part-of-speechtagging,chunking,andparsing)orsemanticinformation(e.g.,word-sensedisambiguation,semanticrolelabeling,namedentityextraction,andanaphoraresolution).Textcorporahavebeenmanuallyan-notatedwithsuchdatastructuresinordertocomparetheperformanceofvarioussystems.TheavailabilityofstandardbenchmarkshasstimulatedresearchinNaturalLanguageProcessing(NLP) .RonanCollobertisnowwiththeIdiapResearchInstitute,Switzerland.†.JasonWestonisnowwithGoogle,NewYork,NY.‡.L´eonBottouisnowwithMicrosoft,Redmond,WA.§.KorayKavukcuogluisalsowithNewYorkUniversity,NewYork,NY.¶.PavelKuksaisalsowithRutgersUniversity,NewBrunswick,NJ.c\r2011RonanCollobert,JasonWeston,L´eonBottou,MichaelKarlen,KorayKavukcuogluandPavelKuksa. COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSAandeffectivesystemshavebeendesignedforallthesetasks.Suchsystemsareoftenviewedassoftwarecomponentsforconstructingreal-worldNLPsolutions.Theoverwhelmingmajorityofthesestate-of-the-artsystemsaddresstheirsinglebenchmarktaskbyapplyinglinearstatisticalmodelstoad-hocfeatures.Inotherwords,theresearchersthem-selvesdiscoverintermediaterepresentationsbyengineeringtask-specicfeatures.Thesefeaturesareoftenderivedfromtheoutputofpreexistingsystems,leadingtocomplexruntimedependencies.Thisapproachiseffectivebecauseresearchersleveragealargebodyoflinguisticknowledge.Ontheotherhand,thereisagreattemptationtooptimizetheperformanceofasystemforaspecicbenchmark.Althoughsuchperformanceimprovementscanbeveryusefulinpractice,theyteachuslittleaboutthemeanstoprogresstowardthebroadergoalsofnaturallanguageunderstandingandtheelusivegoalsofArticialIntelligence.Inthiscontribution,wetrytoexcelonmultiplebenchmarkswhileavoidingtask-specicengi-neering.Insteadweuseasinglelearningsystemabletodiscoveradequateinternalrepresentations.Infactweviewthebenchmarksasindirectmeasurementsoftherelevanceoftheinternalrepresen-tationsdiscoveredbythelearningprocedure,andwepositthattheseintermediaterepresentationsaremoregeneralthananyofthebenchmarks.Ourdesiretoavoidtask-specicengineeredfeaturespreventedusfromusingalargebodyoflinguisticknowledge.Insteadwereachgoodperformancelevelsinmostofthetasksbytransferringintermediaterepresentationsdiscoveredonlargeunlabeleddatasets.Wecallthisapproach“almostfromscratch”toemphasizethereduced(butstillimportant)relianceonaprioriNLPknowledge.Thepaperisorganizedasfollows.Section2describesthebenchmarktasksofinterest.Sec-tion3describestheuniedmodelandreportsbenchmarkresultsobtainedwithsupervisedtraining.Section4leverageslargeunlabeleddatasets(852millionwords)totrainthemodelonalanguagemodelingtask.Performanceimprovementsarethendemonstratedbytransferringtheunsupervisedinternalrepresentationsintothesupervisedbenchmarkmodels.Section5investigatesmultitasksupervisedtraining.Section6thenevaluateshowmuchfurtherimprovementcanbeachievedbyincorporatingstandardNLPtask-specicengineeringintooursystems.Driftingawayfromourini-tialgoalsgivesustheopportunitytoconstructanall-purposetaggerthatissimultaneouslyaccurate,practical,andfast.Wethenconcludewithashortdiscussionsection.2.TheBenchmarkTasksInthissection,webrieyintroducefourstandardNLPtasksonwhichwewillbenchmarkourarchitectureswithinthispaper:Part-Of-Speechtagging(POS),chunking(CHUNK),NamedEntityRecognition(NER)andSemanticRoleLabeling(SRL).Foreachofthem,weconsiderastandardexperimentalsetupandgiveanoverviewofstate-of-the-artsystemsonthissetup.TheexperimentalsetupsaresummarizedinTable1,whilestate-of-the-artsystemsarereportedinTable2.2.1Part-Of-SpeechTaggingPOSaimsatlabelingeachwordwithauniquetagthatindicatesitssyntacticrole,forexample,pluralnoun,adverb,...AstandardbenchmarksetupisdescribedindetailbyToutanovaetal.(2003).Sections0–18ofWallStreetJournal(WSJ)dataareusedfortraining,whilesections19–21areforvalidationandsections22–24fortesting.ThebestPOSclassiersarebasedonclassierstrainedonwindowsoftext,whicharethenfedtoabidirectionaldecodingalgorithmduringinference.Featuresincludeprecedingandfollowing2494 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHTaskBenchmarkDatasetTrainingsetTestset(#tokens)(#tokens)(#tags) POSToutanovaetal.(2003)WSJsections0–18sections22–24(45)(912,344)(129,654) ChunkingCoNLL2000WSJsections15–18section20(42)(211,727)(47,377)(IOBES) NERCoNLL2003Reuters“eng.train”“eng.testb”(17)(203,621)(46,435)(IOBES) SRLCoNLL2005WSJsections2–21section23(186)(950,028)+3Brownsections(IOBES)(63,843)Table1:Experimentalsetup:foreachtask,wereportthestandardbenchmarkweused,thedatasetitrelatesto,aswellastrainingandtestinformation.SystemAccuracy Shenetal.(2007)97.33%Toutanovaetal.(2003)97.24%Gim´enezandMarquez(2004)97.16%(a)POSSystemF1 ShenandSarkar(2005)95.23%ShaandPereira(2003)94.29%KudoandMatsumoto(2001)93.91%(b)CHUNKSystemF1 AndoandZhang(2005)89.31%Florianetal.(2003)88.76%KudoandMatsumoto(2001)88.31%(c)NERSystemF1 Koomenetal.(2005)77.92%Pradhanetal.(2005)77.30%Haghighietal.(2005)77.04%(d)SRLTable2:State-of-the-artsystemsonfourNLPtasks.Performanceisreportedinper-wordaccuracyforPOS,andF1scoreforCHUNK,NERandSRL.Systemsinboldwillbereferredasbenchmarksystemsintherestofthepaper(seeSection2.6).tagcontextaswellasmultiplewords(bigrams,trigrams...)context,andhandcraftedfeaturestodealwithunknownwords.Toutanovaetal.(2003),whousemaximumentropyclassiersandinferenceinabidirectionaldependencynetwork(Heckermanetal.,2001),reach97:24%per-wordaccuracy.Gim´enezandMarquez(2004)proposedaSVMapproachalsotrainedontextwindowswithbidirectionalinferenceachievedwithtwoViterbidecoders(left-to-rightandright-to-left).Theyobtained97:16%per-wordaccuracy.Morerecently,Shenetal.(2007)pushedthestate-of-the-artupto97:33%,withanewlearningalgorithmtheycallguidedlearning,alsoforbidirectionalsequenceclassication.2495 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSA2.2ChunkingAlsocalledshallowparsing,chunkingaimsatlabelingsegmentsofasentencewithsyntacticcon-stituentssuchasnounorverbphrases(NPorVP).Eachwordisassignedonlyoneuniquetag,oftenencodedasabegin-chunk(e.g.,B-NP)orinside-chunktag(e.g.,I-NP).ChunkingisoftenevaluatedusingtheCoNLL2000sharedtask.1Sections15–18ofWSJdataareusedfortrainingandsection20fortesting.Validationisachievedbysplittingthetrainingset.KudohandMatsumoto(2000)wontheCoNLL2000challengeonchunkingwithaF1-scoreof93:48%.TheirsystemwasbasedonSupportVectorMachines(SVMs).EachSVMwastrainedinapairwiseclassicationmanner,andfedwithawindowaroundthewordofinterestcontainingPOSandwordsasfeatures,aswellassurroundingtags.Theyperformdynamicprogrammingattesttime.Later,theyimprovedtheirresultsupto93:91%(KudoandMatsumoto,2001)usinganensembleofclassierstrainedwithdifferenttaggingconventions(seeSection3.3.3).Sincethen,acertainnumberofsystemsbasedonsecond-orderrandomeldswerereported(ShaandPereira,2003;McDonaldetal.,2005;Sunetal.,2008),allreportingaround94:3%F1score.Thesesystemsusefeaturescomposedofwords,POStags,andtags.Morerecently,ShenandSarkar(2005)obtained95:23%usingavotingclassierscheme,whereeachclassieristrainedondifferenttagrepresentations2(IOB,IOE,...).TheyusePOSfeaturescomingfromanexternaltagger,aswellcarefullyhand-craftedspecializationfeatureswhichagainchangethedatarepresentationbyconcatenatingsome(carefullychosen)chunktagsorsomewordswiththeirPOSrepresentation.Theythenbuildtrigramsoverthesefeatures,whicharenallypassedthroughaViterbidecoderatesttime.2.3NamedEntityRecognitionNERlabelsatomicelementsinthesentenceintocategoriessuchas“PERSON”or“LOCATION”.Asinthechunkingtask,eachwordisassignedatagprexedbyanindicatorofthebeginningortheinsideofanentity.TheCoNLL2003setup3isaNERbenchmarkdatasetbasedonReutersdata.Thecontestprovidestraining,validationandtestingsets.Florianetal.(2003)presentedthebestsystemattheNERCoNLL2003challenge,with88:76%F1score.Theyusedacombinationofvariousmachine-learningclassiers.Featurestheypickedincludedwords,POStags,CHUNKtags,prexesandsufxes,alargegazetteer(notprovidedbythechallenge),aswellastheoutputoftwootherNERclassierstrainedonricherdatasets.Chieu(2003),thesecondbestperformerofCoNLL2003(88:31%F1),alsousedanexternalgazetteer(theirperformancegoesdownto86:84%withnogazetteer)andseveralhand-chosenfeatures.Later,AndoandZhang(2005)reached89:31%F1withasemi-supervisedapproach.TheytrainedjointlyalinearmodelonNERwithalinearmodelontwoauxiliaryunsupervisedtasks.TheyalsoperformedViterbidecodingattesttime.Theunlabeledcorpuswas27MwordstakenfromReuters.Featuresincludedwords,POStags,sufxesandprexesorCHUNKtags,butoverallwerelessspecializedthanCoNLL2003challengers. 1.Seehttp://www.cnts.ua.ac.be/conll2000/chunking.2.SeeTable3fortaggingschemedetails.3.Seehttp://www.cnts.ua.ac.be/conll2003/ner.2496 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCH2.4SemanticRoleLabelingSRLaimsatgivingasemanticroletoasyntacticconstituentofasentence.InthePropBank(Palmeretal.,2005)formalismoneassignsrolesARG0-5towordsthatareargumentsofaverb(ormoretechnically,apredicate)inthesentence,forexample,thefollowingsentencemightbetagged“[John]ARG0[ate]REL[theapple]ARG1”,where“ate”isthepredicate.Thepreciseargumentsdependonaverb'sframeandiftherearemultipleverbsinasentencesomewordsmighthavemulti-pletags.InadditiontotheARG0-5tags,therethereareseveralmodiertagssuchasARGM-LOC(locational)andARGM-TMP(temporal)thatoperateinasimilarwayforallverbs.WepickedCoNLL20054asourSRLbenchmark.Ittakessections2–21ofWSJdataastrainingset,andsec-tion24asvalidationset.Atestsetcomposedofsection23ofWSJconcatenatedwith3sectionsfromtheBrowncorpusisalsoprovidedbythechallenge.State-of-the-artSRLsystemsconsistofseveralstages:producingaparsetree,identifyingwhichparsetreenodesrepresenttheargumentsofagivenverb,andnallyclassifyingthesenodestocomputethecorrespondingSRLtags.Thisentailsextractingnumerousbasefeaturesfromtheparsetreeandfeedingthemintostatisticalmodels.Featurecategoriescommonlyusedbythesesysteminclude(GildeaandJurafsky,2002;Pradhanetal.,2004):thepartsofspeechandsyntacticlabelsofwordsandnodesinthetree;thenode'sposition(leftorright)inrelationtotheverb;thesyntacticpathtotheverbintheparsetree;whetheranodeintheparsetreeispartofanounorverbphrase;thevoiceofthesentence:activeorpassive;thenode'sheadword;andtheverbsub-categorization.Pradhanetal.(2004)takethesebasefeaturesanddeneadditionalfeatures,notablythepart-of-speechtagoftheheadword,thepredictednamedentityclassoftheargument,featuresprovidingwordsensedisambiguationfortheverb(theyadd25variantsof12newfeaturetypesoverall).Thissystemisclosetothestate-of-the-artinperformance.Pradhanetal.(2005)obtain77:30%F1withasystembasedonSVMclassiersandsimultaneouslyusingthetwoparsetreesprovidedfortheSRLtask.Inthesamespirit,Haghighietal.(2005)uselog-linearmodelsoneachtreenode,re-rankedgloballywithadynamicalgorithm.Theirsystemreaches77:04%usingthevetopCharniakparsetrees.Koomenetal.(2005)holdthestate-of-the-artwithWinnow-like(Littlestone,1988)classiers,followedbyadecodingstagebasedonanintegerprogramthatenforcesspecicconstraintsonSRLtags.Theyreach77:92%F1onCoNLL2005,thankstothevetopparsetreesproducedbytheCharniak(2000)parser(onlytherstonewasprovidedbythecontest)aswellastheCollins(1999)parsetree. 4.Seehttp://www.lsi.upc.edu/˜srlconll.2497 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSA2.5EvaluationInourexperiments,westrictlyfollowedthestandardevaluationprocedureofeachCoNLLchal-lengesforNER,CHUNKandSRL.Inparticular,wechosethehyper-parametersofourmodelaccordingtoasimplevalidationprocedure(seeRemark8laterinSection3.5),performedoverthevalidationsetavailableforeachtask(seeSection2).Allthesethreetasksareevaluatedbycomput-ingtheF1scoresoverchunksproducedbyourmodels.ThePOStaskisevaluatedbycomputingtheper-wordaccuracy,asitisthecaseforthestandardbenchmarkwereferto(Toutanovaetal.,2003).Weusedtheconllevalscript5forevaluatingPOS,6NERandCHUNK.ForSRL,weusedthesrl-eval.plscriptincludedinthesrlconllpackage.72.6DiscussionWhenparticipatinginan(open)challenge,itislegitimatetoincreasegeneralizationbyallmeans.ItisthusnotsurprisingtoseemanytopCoNLLsystemsusingexternallabeleddata,likeadditionalNERclassiersfortheNERarchitectureofFlorianetal.(2003)oradditionalparsetreesforSRLsystems(Koomenetal.,2005).Combiningmultiplesystemsortweakingcarefullyfeaturesisalsoacommonapproach,likeinthechunkingtopsystem(ShenandSarkar,2005).However,whencomparingsystems,wedonotlearnanythingofthequalityofeachsystemiftheyweretrainedwithdifferentlabeleddata.Forthatreason,wewillrefertobenchmarksystemsthatis,topexistingsystemswhichavoidusageofexternaldataandhavebeenwell-establishedintheNLPeld:Toutanovaetal.(2003)forPOSandShaandPereira(2003)forchunking.ForNERweconsiderAndoandZhang(2005)astheywereusingadditionalunlabeleddataonly.WepickedKoomenetal.(2005)forSRL,keepinginmindtheyuse4additionalparsetreesnotprovidedbythechallenge.Thesebenchmarksystemswillserveasbaselinereferencesinourexperiments.WemarkedtheminboldinTable2.Wenotethatforthefourtasksweareconsideringinthiswork,itcanbeseenthatforthemorecomplextasks(withcorrespondingloweraccuracies),thebestsystemsproposedhavemoreengineeredfeaturesrelativetothebestsystemsonthesimplertasks.Thatis,thePOStaskisoneofthesimplestofourfourtasks,andonlyhasrelativelyfewengineeredfeatures,whereasSRListhemostcomplex,andmanykindsoffeatureshavebeendesignedforit.ThisclearlyhasimplicationsforasyetunsolvedNLPtasksrequiringmoresophisticatedsemanticunderstandingthantheonesconsideredhere.3.TheNetworksAlltheNLPtasksabovecanbeseenastasksassigninglabelstowords.ThetraditionalNLPap-proachis:extractfromthesentencearichsetofhand-designedfeatureswhicharethenfedtoastandardclassicationalgorithm,forexample,aSupportVectorMachine(SVM),oftenwithalin-earkernel.Thechoiceoffeaturesisacompletelyempiricalprocess,mainlybasedrstonlinguisticintuition,andthentrialanderror,andthefeatureselectionistaskdependent,implyingadditionalresearchforeachnewNLPtask.ComplextaskslikeSRLthenrequirealargenumberofpossibly 5.Availableathttp://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt.6.Weusedthe“-r”optionoftheconllevalscripttogettheper-wordaccuracy,forPOSonly.7.Availableathttp://www.lsi.upc.es/˜srlconll/srlconll-1.1.tgz.2498 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCH InputWindow LookupTable Linear HardTanh Linear ...... xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx .LTWK xxxx xxxx xxxx xxxx xxxx M1 M2 wordofinterest d concat xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx n1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx n2=#tags Figure1:Windowapproachnetwork.complexfeatures(e.g.,extractedfromaparsetree)whichcanimpactthecomputationalcostwhichmightbeimportantforlarge-scaleapplicationsorapplicationsrequiringreal-timeresponse.Instead,weadvocatearadicallydifferentapproach:asinputwewilltrytopre-processourfeaturesaslittleaspossibleandthenuseamultilayerneuralnetwork(NN)architecture,trainedinanend-to-endfashion.Thearchitecturetakestheinputsentenceandlearnsseverallayersoffeatureextractionthatprocesstheinputs.Thefeaturescomputedbythedeeplayersofthenetworkareautomaticallytrainedbybackpropagationtoberelevanttothetask.WedescribeinthissectionageneralmultilayerarchitecturesuitableforallourNLPtasks,whichisgeneralizabletootherNLPtasksaswell.OurarchitectureissummarizedinFigure1andFigure2.Therstlayerextractsfeaturesforeachword.Thesecondlayerextractsfeaturesfromawindowofwordsorfromthewholesentence,treatingitasasequencewithlocalandglobalstructure(i.e.,itisnottreatedlikeabagofwords).ThefollowinglayersarestandardNNlayers.3.1NotationsWeconsideraneuralnetworkq(),withparametersq.Anyfeed-forwardneuralnetworkwithLlayers,canbeseenasacompositionoffunctionslq(),correspondingtoeachlayerq()=Lq(L1q(:::1q():::)):2499 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSA InputSentence LookupTable Convolution MaxOverTime Linear HardTanh Linear ...... xxxxxxxxxxxxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx .LTWK xxxxxxxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx max()M2 M3 d PaddingPadding n1 M1 xxxxxxxxxxxxxxxxxxxx n1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx n2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx n3=#tags Figure2:Sentenceapproachnetwork.Inthefollowing,wewilldescribeeachlayerweuseinournetworksshowninFigure1andFigure2.Weadoptfewnotations.GivenamatrixAwedenote[A]i;jthecoefcientatrowandcolumninthematrix.WealsodenotehAidwinithevectorobtainedbyconcatenatingthedwincolumnvectorsaroundthethcolumnvectorofmatrixA2Rd1d2hhAidwiniiT=[A]1;idwin=2:::[A]d1;idwin=2;:::;[A]1;i+dwin=2:::[A]d1;i+dwin=2:2500 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHAsaspecialcase,hAi1irepresentsthethcolumnofmatrixA.Foravectorv,wedenote[v]ithescalaratindexinthevector.Finally,asequenceofelementfx1;x2;:::;xTgiswritten[x]T1.Thethelementofthesequenceis[x]i3.2TransformingWordsintoFeatureVectorsOneofthekeypointsofourarchitectureisitsabilitytoperformwellwiththeuseof(almost8)rawwords.Theabilityforourmethodtolearngoodwordrepresentationsisthuscrucialtoourapproach.Forefciency,wordsarefedtoourarchitectureasindicestakenfromanitedictionaryD.Obviously,asimpleindexdoesnotcarrymuchusefulinformationabouttheword.However,therstlayerofournetworkmapseachofthesewordindicesintoafeaturevector,byalookuptableoperation.Givenataskofinterest,arelevantrepresentationofeachwordisthengivenbythecorrespondinglookuptablefeaturevector,whichistrainedbybackpropagation,startingfromarandominitialization.9WewillseeinSection4thatwecanlearnverygoodwordrepresenta-tionsfromunlabeledcorpora.Ourarchitectureallowustotakeadvantageofbettertrainedwordrepresentations,bysimplyinitializingthewordlookuptablewiththeserepresentations(insteadofrandomly).Moreformally,foreachwordw2D,aninternaldwrd-dimensionalfeaturevectorrepresentationisgivenbythelookuptablelayerLTW()LTW(w)=hWi1w;whereW2RdwrdjDjisamatrixofparameterstobelearned,hWi1w2RdwrdisthewthcolumnofWanddwrdisthewordvectorsize(ahyper-parametertobechosenbytheuser).GivenasentenceoranysequenceofTwords[w]T1inD,thelookuptablelayerappliesthesameoperationforeachwordinthesequence,producingthefollowingoutputmatrix:LTW([w]T1)=hWi1[w]1hWi1[w]2:::hWi1[w]T:(1)Thismatrixcanthenbefedtofurtherneuralnetworklayers,aswewillseebelow.3.2.1EXTENDINGTOANYDISCRETEFEATURESOnemightwanttoprovidefeaturesotherthanwordsifonesuspectsthatthesefeaturesarehelpfulforthetaskofinterest.Forexample,fortheNERtask,onecouldprovideafeaturewhichsaysifawordisinagazetteerornot.Anothercommonpracticeistointroducesomebasicpre-processing,suchasword-stemmingordealingwithupperandlowercase.Inthislatteroption,thewordwouldbethenrepresentedbythreediscretefeatures:itslowercasestemmedroot,itslowercaseending,andacapitalizationfeature.Generallyspeaking,wecanconsiderawordasrepresentedbyKdiscretefeaturesw2D1DK,whereDkisthedictionaryforthekthfeature.WeassociatetoeachfeaturealookuptableLTWk(),withparametersWk2RdkwrdjDkjwheredkwrd2Nisauser-speciedvectorsize.Givena 8.Wedidsomepre-processing,namelylowercasingandencodingcapitalizationasanotherfeature.Withenough(un-labeled)trainingdata,presumablywecouldlearnamodelwithoutthisprocessing.Ideally,anevenmorerawinputwouldbetolearnfromlettersequencesratherthanwords,howeverwefeltthatthiswasbeyondthescopeofthiswork.9.Asanyotherneuralnetworklayer.2501 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSAwordw,afeaturevectorofdimensiondwrd=åkdkwrdisthenobtainedbyconcatenatingalllookuptableoutputs:LTW1;:::;W(w)=0B@LTW1(w1)LTW(wK)1CA=0B@hW1i1w1hWKi1w1CA:Thematrixoutputofthelookuptablelayerforasequenceofwords[w]T1isthensimilarto(1),butwhereextrarowshavebeenaddedforeachdiscretefeature:LTW1;:::;W([w]T1)=0B@hW1i1[w1]1:::hW1i1[w1]ThWKi1[w]1:::hWKi1[w]T1CA:(2)Thesevectorfeaturesinthelookuptableeffectivelylearnfeaturesforwordsinthedictionary.Now,wewanttousethesetrainablefeaturesasinputtofurtherlayersoftrainablefeatureextractors,thatcanrepresentgroupsofwordsandthennallysentences.3.3ExtractingHigherLevelFeaturesfromWordFeatureVectorsFeaturevectorsproducedbythelookuptablelayerneedtobecombinedinsubsequentlayersoftheneuralnetworktoproduceatagdecisionforeachwordinthesentence.Producingtagsforeachelementinvariablelengthsequences(here,asentenceisasequenceofwords)isastandardprobleminmachine-learning.Weconsidertwocommonapproacheswhichtagonewordatthetime:awindowapproach,anda(convolutional)sentenceapproach.3.3.1WINDOWAPPROACHAwindowapproachassumesthetagofaworddependsmainlyonitsneighboringwords.Givenawordtotag,weconsideraxedsizeksz(ahyper-parameter)windowofwordsaroundthisword.Eachwordinthewindowisrstpassedthroughthelookuptablelayer(1)or(2),producingamatrixofwordfeaturesofxedsizedwrdksz.Thismatrixcanbeviewedasadwrdksz-dimensionalvectorbyconcatenatingeachcolumnvector,whichcanbefedtofurtherneuralnetworklayers.Moreformally,thewordfeaturewindowgivenbytherstnetworklayercanbewrittenas:1q=hLTW([w]T1)idwint=0BBBBBBB@hWi1[w]tdwin=2hWi1[w]thWi1[w]t+dwin=21CCCCCCCA:(3)LinearLayer.Thexedsizevector1qcanbefedtooneorseveralstandardneuralnetworklayerswhichperformafnetransformationsovertheirinputs:lq=Wll1q+bl;(4)whereWl2Rnlhunl1huandbl2Rnlhuaretheparameterstobetrained.Thehyper-parameternlhuisusuallycalledthenumberofhiddenunitsofthethlayer.2502 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHHardTanhLayer.Severallinearlayersareoftenstacked,interleavedwithanon-linearityfunc-tion,toextracthighlynon-linearfeatures.Ifnonon-linearityisintroduced,ournetworkwouldbeasimplelinearmodel.Wechosea“hard”versionofthehyperbolictangentasnon-linearity.Ithastheadvantageofbeingslightlycheapertocompute(comparedtotheexacthyperbolictangent),whileleavingthegeneralizationperformanceunchanged(Collobert,2004).ThecorrespondinglayerappliesaHardTanhoveritsinputvector:hlqii=HardTanh(hl1qii);whereHardTanh(x)=8:1ifx1xif1=x=11ifx&#x-0.9;畵1:(5)Scoring.Finally,theoutputsizeofthelastlayerLofournetworkisequaltothenumberofpossibletagsforthetaskofinterest.Eachoutputcanbetheninterpretedasascoreofthecorrespondingtag(giventheinputofthenetwork),thankstoacarefullychosencostfunctionthatwewilldescribelaterinthissection.Remark1(BorderEffects)Thefeaturewindow(3)isnotwelldenedforwordsnearthebegin-ningortheendofasentence.Tocircumventthisproblem,weaugmentthesentencewithaspecial“PADDING”wordreplicateddwin=2timesatthebeginningandtheend.Thisisakintotheuseof“start”and“stop”symbolsinsequencemodels.3.3.2SENTENCEAPPROACHWewillseeintheexperimentalsectionthatawindowapproachperformswellformostnaturallanguageprocessingtasksweareinterestedin.HoweverthisapproachfailswithSRL,wherethetagofaworddependsonaverb(or,morecorrectly,predicate)chosenbeforehandinthesentence.Iftheverbfallsoutsidethewindow,onecannotexpectthiswordtobetaggedcorrectly.Inthisparticularcase,taggingawordrequirestheconsiderationofthewholesentence.Whenusingneuralnetworks,thenaturalchoicetotacklethisproblembecomesaconvolutionalapproach,rstintroducedbyWaibeletal.(1989)andalsocalledTimeDelayNeuralNetworks(TDNNs)intheliterature.Wedescribeindetailourconvolutionalnetworkbelow.Itsuccessivelytakesthecompletesen-tence,passesitthroughthelookuptablelayer(1),produceslocalfeaturesaroundeachwordofthesentencethankstoconvolutionallayers,combinesthesefeatureintoaglobalfeaturevectorwhichcanthenbefedtostandardafnelayers(4).Inthesemanticrolelabelingcase,thisoperationisperformedforeachwordinthesentence,andforeachverbinthesentence.Itisthusnecessarytoencodeinthenetworkarchitecturewhichverbweareconsideringinthesentence,andwhichwordwewanttotag.Forthatpurpose,eachwordatpositioninthesentenceisaugmentedwithtwofeaturesinthewaydescribedinSection3.2.1.Thesefeaturesencodetherelativedistancesposvandposwwithrespecttothechosenverbatpositionposv,andthewordtotagatpositionposwrespectively.ConvolutionalLayer.Aconvolutionallayercanbeseenasageneralizationofawindowap-proach:givenasequencerepresentedbycolumnsinamatrixl1q(inourlookuptablematrix(1)),amatrix-vectoroperationasin(4)isappliedtoeachwindowofsuccessivewindowsinthesequence.2503 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSA 0 10 20 30 40 50 60 70Theproposedchangesalsowouldallowexecutivestoreportexercisesofoptionslaterandlessoften. xxxxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxx xxxxxx xxxxxx xxxxxx xxxx xxxxxxxxxx xxxx xxxx xxxx xx xxxx xx x xx xxxx 0 10 20 30 40 50 60 70Theproposedchangesalsowouldallowexecutivestoreportexercisesofoptionslaterandlessoften. xx xxxx x xxx xxxx xxxx xxxx xxxxxxxxxxxxxx xxxxx xxxxxx xxxxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx Figure3:NumberoffeatureschosenateachwordpositionbytheMaxlayer.Weconsiderasen-tenceapproachnetwork(Figure2)trainedforSRL.Thenumberof“local”featuresoutputbytheconvolutionlayeris300perword.ByapplyingaMaxoverthesentence,weob-tain300featuresforthewholesentence.Itisinterestingtoseethatthenetworkcatchesfeaturesmostlyaroundtheverbofinterest(here“report”)andwordofinterest(“pro-posed”(left)or“often”(right)).Usingpreviousnotations,thethoutputcolumnofthethlayercanbecomputedas:hlqi1t=Wlhl1qidwint+bl8;(6)wheretheweightmatrixWlisthesameacrossallwindowsinthesequence.Convolutionallayersextractlocalfeaturesaroundeachwindowofthegivensequence.Asforstandardafnelayers(4),convolutionallayersareoftenstackedtoextracthigherlevelfeatures.Inthiscase,eachlayermustbefollowedbyanon-linearity(5)orthenetworkwouldbeequivalenttooneconvolutionallayer.MaxLayer.Thesizeoftheoutput(6)dependsonthenumberofwordsinthesentencefedtothenetwork.Localfeaturevectorsextractedbytheconvolutionallayershavetobecombinedtoobtainaglobalfeaturevector,withaxedsizeindependentofthesentencelength,inordertoapplysubsequentstandardafnelayers.Traditionalconvolutionalnetworksoftenapplyanaverage(possiblyweighted)oramaxoperationoverthe“time”ofthesequence(6).(Here,“time”justmeansthepositioninthesentence,thistermstemsfromtheuseofconvolutionallayersin,forexample,speechdatawherethesequenceoccursovertime.)Theaverageoperationdoesnotmakemuchsenseinourcase,asingeneralmostwordsinthesentencedonothaveanyinuenceonthesemanticroleofagivenwordtotag.Instead,weusedamaxapproach,whichforcesthenetworktocapturethemostusefullocalfeaturesproducedbytheconvolutionallayers(seeFigure3),forthetaskathand.Givenamatrixfl1qoutputbyaconvolutionallayer1,theMaxlayeroutputsavectorflqhlqii=maxthl1qii;t1nl1hu:(7)Thisxedsizedglobalfeaturevectorcanbethenfedtostandardafnenetworklayers(4).Asinthewindowapproach,wethennallyproduceonescoreperpossibletagforthegiventask.Remark2Thesamebordereffectsariseintheconvolutionoperation(6)asinthewindowap-proach(3).Weagainworkaroundthisproblembypaddingthesentenceswithaspecialword.2504 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHSchemeBeginInsideEndSingleOtherIOBB-XI-XI-XB-XOIOEI-XI-XE-XE-XOIOBESB-XI-XE-XS-XO Table3:Varioustaggingschemes.Eachwordinasegmentlabeled“X”istaggedwithaprexedlabel,dependingofthewordpositioninthesegment(begin,inside,end).Singlewordsegmentlabelingisalsooutput.Wordsnotinalabeledsegmentarelabeled“O”.VariantsoftheIOB(andIOE)schemeexist,wheretheprexB(orE)isreplacedbyIforallsegmentsnotcontiguouswithanothersegmenthavingthesamelabel“X”.3.3.3TAGGINGSCHEMESAsexplainedearlier,thenetworkoutputlayerscomputescoresforallthepossibletagsforthetaskofinterest.Inthewindowapproach,thesetagsapplytothewordlocatedinthecenterofthewindow.Inthe(convolutional)sentenceapproach,thesetagsapplytotheworddesignatedbyadditionalmarkersinthenetworkinput.ThePOStaskindeedconsistsofmarkingthesyntacticroleofeachword.However,there-mainingthreetasksassociatelabelswithsegmentsofasentence.Thisisusuallyachievedbyusingspecialtaggingschemestoidentifythesegmentboundaries,asshowninTable3.Severalsuchschemeshavebeendened(IOB,IOE,IOBES,...)withoutclearconclusionastowhichschemeisbetteringeneral.State-of-the-artperformanceissometimesobtainedbycombiningclassierstrainedwithdifferenttaggingschemes(e.g.,KudoandMatsumoto,2001).ThegroundtruthfortheNER,CHUNK,andSRLtasksisprovidedusingtwodifferenttaggingschemes.Inordertoeliminatethisadditionalsourceofvariations,wehavedecidedtousethemostexpressiveIOBEStaggingschemeforalltasks.Forinstance,intheCHUNKtask,wedescribenounphrasesusingfourdifferenttags.Tag“S-NP”isusedtomarkanounphrasecontainingasingleword.Otherwisetags“B-NP”,“I-NP”,and“E-NP”areusedtomarktherst,intermediateandlastwordsofthenounphrase.Anadditionaltag“O”markswordsthatarenotmembersofachunk.Duringtesting,thesetagsarethenconvertedtotheoriginalIOBtaggingschemeandfedtothestandardperformanceevaluationscriptsmentionedinSection2.5.3.4TrainingAllourneuralnetworksaretrainedbymaximizingalikelihoodoverthetrainingdata,usingstochas-ticgradientascent.Ifwedenoteqtobeallthetrainableparametersofthenetwork,whicharetrainedusingatrainingsetTwewanttomaximizethefollowinglog-likelihoodwithrespecttoqq7!å(x;y)2Tlogp(yjx;q);(8)wherexcorrespondstoeitheratrainingwordwindoworasentenceanditsassociatedfeatures,andyrepresentsthecorrespondingtag.Theprobabilityp()iscomputedfromtheoutputsoftheneuralnetwork.Wewillseeinthissectiontwowaysofinterpretingneuralnetworkoutputsasprobabilities.2505 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSA3.4.1WORD-LEVELLOG-LIKELIHOODInthisapproach,eachwordinasentenceisconsideredindependently.Givenaninputexamplex,thenetworkwithparametersqoutputsascoreq(x)i,forthethtagwithrespecttothetaskofinterest.Tosimplifythenotation,wedropxfromnow,andwewriteinsteadqi.Thisscorecanbeinterpretedasaconditionaltagprobabilityp(jx;q)byapplyingasoftmax(Bridle,1990)operationoverallthetags:p(jx;q)=e[f]i åje[f]j:(9)Deningthelog-addoperationaslogaddizi=log(åiezi);(10)wecanexpressthelog-likelihoodforonetrainingexample(x;y)asfollows:logp(yjx;q)=[q]ylogaddj[q]j:(11)Whilethistrainingcriterion,oftenreferredascross-entropyiswidelyusedforclassicationprob-lems,itmightnotbeidealinourcase,wherethereisoftenacorrelationbetweenthetagofawordinasentenceanditsneighboringtags.Wenowdescribeanothercommonapproachforneuralnetworkswhichenforcesdependenciesbetweenthepredictedtagsinasentence.3.4.2SENTENCE-LEVELLOG-LIKELIHOODIntaskslikechunking,NERorSRLweknowthattherearedependenciesbetweenwordtagsinasentence:notonlyaretagsorganizedinchunks,butsometagscannotfollowothertags.Trainingusingaword-levelapproachdiscardsthiskindoflabelinginformation.Weconsideratrainingschemewhichtakesintoaccountthesentencestructure:giventhepredictionsofalltagsbyournetworkforallwordsinasentence,andgivenascoreforgoingfromonetagtoanothertag,wewanttoencouragevalidpathsoftagsduringtraining,whilediscouragingallotherpaths.Weconsiderthematrixofscoresq([x]T1)outputbythenetwork.Asbefore,wedroptheinput[x]T1fornotationsimplication.Theelementqi;tofthematrixisthescoreoutputbythenetworkwithparametersq,forthesentence[x]T1andforthethtag,atthethword.Weintroduceatransitionscore[A]i;jforjumpingfromtotagsinsuccessivewords,andaninitialscore[A]i;0forstartingfromthethtag.Asthetransitionscoresaregoingtobetrained(asareallnetworkparametersq),wedene˜q=q[f[A]i;j8;g.Thescoreofasentence[x]T1alongapathoftags[]T1isthengivenbythesumoftransitionscoresandnetworkscores:s([x]T1;[]T1;˜q)=Tåt=1[A][i]t1;[i]t+[q][i]t;t:(12)Exactlyasfortheword-levellikelihood(11),wherewewerenormalizingwithrespecttoalltagsusingasoftmax(9),wenormalizethisscoreoverallpossibletagpaths[]T1usingasoftmax,andweinterprettheresultingratioasaconditionaltagpathprobability.Takingthelog,theconditionalprobabilityofthetruepath[y]T1isthereforegivenby:logp([y]T1j[x]T1;˜q)=s([x]T1;[y]T1;˜q)logadd8[j]T1s([x]T1;[]T1;˜q):(13)2506 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHWhilethenumberoftermsinthelogaddoperation(11)wasequaltothenumberoftags,itgrowsexponentiallywiththelengthofthesentencein(13).Fortunately,onecancomputeitinlineartimewiththefollowingstandardrecursionover(seeRabiner,1989),takingadvantageoftheassociativityanddistributivityonthesemi-ring10(R[f¥g;logadd;+)dt(k)=logaddf[j]t1\[j]t=kgs([x]t1;[]t1;˜q)=logaddilogaddf[j]t1\[j]t1=i\[j]t=kgs([x]t1;[]t11;˜q)+[A][j]t1;k+[q]k;t=logaddidt1()+[A]i;k+[q]k;t=[q]k;t+logaddidt1()+[A]i;k8k;(14)followedbytheterminationlogadd8[j]T1s([x]T1;[]T1;˜q)=logaddidT():(15)Wecannowmaximizein(8)thelog-likelihood(13)overallthetrainingpairs([x]T1;[y]T1)Atinferencetime,givenasentence[x]T1totag,wehavetondthebesttagpathwhichminimizesthesentencescore(12).Inotherwords,wemustndargmax[j]T1s([x]T1;[]T1;˜q):TheViterbialgorithmisthenaturalchoiceforthisinference.Itcorrespondstoperformingtherecursion(14)and(15),butwherethelogaddisreplacedbyamax,andthentrackingbacktheoptimalpaththrougheachmax.Remark3(GraphTransformerNetworks)Ourapproachisaparticularcaseofthediscrimina-tiveforwardtrainingforgraphtransformernetworks(GTNs)(Bottouetal.,1997;LeCunetal.,1998).Thelog-likelihood(13)canbeviewedasthedifferencebetweentheforwardscorecon-strainedoverthevalidpaths(inourcasethereisonlythelabeledpath)andtheunconstrainedforwardscore(15).Remark4(ConditionalRandomFields)Animportantfeatureofequation(12)istheabsenceofnormalization.Summingtheexponentialse[f]itoverallpossibletagsdoesnotnecessarilyyieldtheunity.Ifthiswasthecase,thescorescouldbeviewedasthelogarithmsofconditionaltransitionprobabilities,andourmodelwouldbesubjecttothelabel-biasproblemthatmotivatesConditionalRandomFields(CRFs)(Laffertyetal.,2001).ThedenormalizedscoresshouldinsteadbelikenedtothepotentialfunctionsofaCRF.Infact,aCRFmaximizesthesamelikelihood(13)usingalinearmodelinsteadofanonlinearneuralnetwork.CRFshavebeenwidelyusedintheNLPworld,suchasforPOStagging(Laffertyetal.,2001),chunking(ShaandPereira,2003),NER(McCallumandLi,2003)orSRL(CohnandBlunsom,2005).ComparedtosuchCRFs,wetakeadvantageofthenonlinearnetworktolearnappropriatefeaturesforeachtaskofinterest. 10.Inotherwords,readlogaddasand+as\n.2507 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSA3.4.3STOCHASTICGRADIENTMaximizing(8)withstochasticgradient(Bottou,1991)isachievedbyiterativelyselectingarandomexample(x;y)andmakingagradientstep:q q+l¶logp(yjx;q) ¶q;(16)wherelisachosenlearningrate.OurneuralnetworksdescribedinFigure1andFigure2areasuccessionoflayersthatcorrespondtosuccessivecompositionoffunctions.Theneuralnetworkisnallycomposedwiththeword-levellog-likelihood(11),orsuccessivelycomposedinthere-cursion(14)ifusingthesentence-levellog-likelihood(13).Thus,ananalyticalformulationofthederivative(16)canbecomputed,byapplyingthedifferentiationchainrulethroughthenetwork,andthroughtheword-levellog-likelihood(11)orthroughtherecurrence(14).Remark5(Differentiability)Ourcostfunctionsaredifferentiablealmosteverywhere.Non-differentiablepointsarisebecauseweusea“hard”transferfunction(5)andbecauseweusea“max”layer(7)inthesentenceapproachnetwork.Fortunately,stochasticgradientstillconvergestoameaningfullocalminimumdespitesuchminordifferentiabilityproblems(Bottou,1991,1998).Stochasticgradientiterationsthathitanon-differentiabilityaresimplyskipped.Remark6(ModularApproach)Thewellknown“back-propagation”algorithm(LeCun,1985;Rumelhartetal.,1986)computesgradientsusingthechainrule.Thechainrulecanalsobeusedinamodularimplementation.11OurmodulescorrespondtotheboxesinFigure1andFigure2.Givenderivativeswithrespecttoitsoutputs,eachmodulecanindependentlycomputederivativeswithrespecttoitsinputsandwithrespecttoitstrainableparameters,asproposedbyBottouandGallinari(1991).Thisallowsustoeasilybuildvariantsofournetworks.Fordetailsaboutgradientcomputations,seeAppendixA.Remark7(Tricks)Manytrickshavebeenreportedfortrainingneuralnetworks(LeCunetal.,1998).Whichonestochooseisoftenconfusing.Weemployedonlytwoofthem:theinitializationandupdateoftheparametersofeachnetworklayerweredoneaccordingtothe“fan-in”ofthelayer,thatisthenumberofinputsusedtocomputeeachoutputofthislayer(PlautandHinton,1987).Thefan-inforthelookuptable(1),thelthlinearlayer(4)andtheconvolutionlayer(6)arerespectively1,nl1huanddwinnl1hu.Theinitialparametersofthenetworkweredrawnfromacentereduniformdistribution,withavarianceequaltotheinverseofthesquare-rootofthefan-in.Thelearningratein(16)wasdividedbythefan-in,butstaysxedduringthetraining.3.5SupervisedBenchmarkResultsForPOS,chunkingandNERtasks,wereportresultswiththewindowarchitecture12describedinSection3.3.1.TheSRLtaskwastrainedusingthesentenceapproach(Section3.3.2).ResultsarereportedinTable4,inper-wordaccuracy(PWA)forPOS,andF1scoreforalltheothertasks.Weperformedexperimentsbothwiththeword-levellog-likelihood(WLL)andwiththesentence-levellog-likelihood(SLL).Thehyper-parametersofournetworksarereportedinTable5.Allour 11.Seehttp://torch5.sf.net.12.Wefoundthattrainingthesetaskswiththemorecomplexsentenceapproachwascomputationallyexpensiveandofferedlittleperformancebenets.ResultsdiscussedinSection5providemoreinsightaboutthisdecision.2508 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHApproachPOSChunkingNERSRL(PWA)(F1)(F1)(F1) BenchmarkSystems97.2494.2989.3177.92 NN+WLL96.3189.1379.5355.40NN+SLL96.3790.3381.4770.99 Table4:ComparisoningeneralizationperformanceofbenchmarkNLPsystemswithavanillaneu-ralnetwork(NN)approach,onPOS,chunking,NERandSRLtasks.Wereportresultswithboththeword-levellog-likelihood(WLL)andthesentence-levellog-likelihood(SLL).Generalizationperformanceisreportedinper-wordaccuracyrate(PWA)forPOSandF1scoreforothertasks.TheNNresultsarebehindthebenchmarkresults,inSection4weshowhowtoimprovethesemodelsusingunlabeleddata.TaskWindow/Conv.sizeWorddim.Capsdim.HiddenunitsLearningrate POSdwin=5d0=50d1=5n1hu=300l=0:01 CHUNK””””” NER””””” SRL”””n1hu=300n2hu=500”Table5:Hyper-parametersofournetworks.Theywerechosenbyaminimalvalidation(seeRe-mark8),preferringidenticalparametersformosttasks.Wereportforeachtaskthewindowsize(orconvolutionsize),wordfeaturedimension,capitalfeaturedimension,numberofhiddenunitsandlearningrate.networkswerefedwithtworawtextfeatures:lowercasewords,andacapitalletterfeature.Wechosetoconsiderlowercasewordstolimitthenumberofwordsinthedictionary.However,tokeepsomeuppercaseinformationlostbythistransformation,weaddeda“caps”featurewhichtellsifeachwordwasinlowercase,wasalluppercase,hadrstlettercapital,orhadatleastonenon-initialcapitalletter.Additionally,alloccurrencesofsequencesofnumberswithinawordarereplacedwiththestring“NUMBER”,soforexampleboththewords“PS1”and“PS2”wouldmaptothesingleword“psNUMBER”.Weusedadictionarycontainingthe100,000mostcommonwordsinWSJ(caseinsensitive).Wordsoutsidethisdictionarywerereplacedbyasinglespecial“RARE”word.Resultsshowthatneuralnetworks“out-of-the-box”arebehindbaselinebenchmarksystems.AlthoughtheinitialperformanceofournetworksfallsshortfromtheperformanceoftheCoNLLchallengewinners,itcompareshonorablywiththeperformanceofmostcompetitors.Thetrainingcriterionwhichtakesintoaccountthesentencestructure(SLL)seemstoboosttheperformancefortheChunking,NERandSRLtasks,withlittleadvantageforPOS.ThisresultisinlinewithexistingNLPstudiescomparingsentence-levelandword-levellikelihoods(Liangetal.,2008).Thecapacityofournetworkarchitecturesliesmainlyinthewordlookuptable,whichcontains50100;000parameterstotrain.IntheWSJdata,15%ofthemostcommonwordsappearabout90%ofthetime.Manywordsappearonlyafewtimes.Itisthusverydifculttotrainproperlytheircorresponding2509 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSAFRANCEJESUSXBOXREDDISHSCRATCHEDMEGABITS45419736909117242986987025 PERSUADETHICKETSDECADENTWIDESCREENODDPPAFAWSAVARYDIVOANTICAANCHIETAUDDINBLACKSTOCKSYMPATHETICVERUSSHABBYEMIGRATIONBIOLOGICALLYGIORGIJFKOXIDEAWEMARKINGKAYAKSHAHEEDKHWARAZMURBINATHUDHEUERMCLARENSRUMELIASTATIONERYEPOSOCCUPANTSAMBHAJIGLADWINPLANUMILIASEGLINTONREVISEDWORSHIPPERSCENTRALLYGOA'ULDGSNUMBEREDGINGLEAVENEDRITSUKOINDONESIACOLLATIONOPERATORFRGPANDIONIDAELIFELESSMONEOBACHAW.J.NAMSOSSHIRTMAHANNILGIRISTable6:WordembeddingsinthewordlookuptableofaSRLneuralnetworktrainedfromscratch,withadictionaryofsize100;000.Foreachcolumnthequeriedwordisfollowedbyitsindexinthedictionary(highermeansmorerare)andits10nearestneighbors(arbitrarilyusingtheEuclideanmetric).50dimensionalfeaturevectorsinthelookuptable.Ideally,wewouldlikesemanticallysimilarwordstobecloseintheembeddingspacerepresentedbythewordlookuptable:bycontinuityoftheneuralnetworkfunction,tagsproducedonsemanticallysimilarsentenceswouldbesimilar.WeshowinTable6thatitisnotthecase:neighboringwordsintheembeddingspacedonotseemtobesemanticallyrelated.Wewillfocusinthenextsectiononimprovingthesewordembeddingsbyleveragingunlabeleddata.Wewillseeourapproachresultsinaperformanceboostforalltasks.Remark8(Architectures)Inallourexperimentsinthispaper,wetunedthehyper-parametersbytryingonlyafewdifferentarchitecturesbyvalidation.Inpractice,thechoiceofhyperparameterssuchasthenumberofhiddenunits,providedtheyarelargeenough,hasalimitedimpactonthegeneralizationperformance.InFigure4,wereporttheF1scoreforeachtaskonthevalidationset,withrespecttothenumberofhiddenunits.Consideringthevariancerelatedtothenetworkinitial-ization,wechosethesmallestnetworkachieving“reasonable”performance,ratherthanpickingthenetworkachievingthetopperformanceobtainedonasinglerun.Remark9(TrainingTime)Trainingournetworkisquitecomputationallyexpensive.ChunkingandNERtakeaboutonehourtotrain,POStakesfewhours,andSRLtakesaboutthreedays.Trainingcouldbefasterwithalargerlearningrate,butwepreferredtosticktoasmallonewhichworks,ratherthanndingtheoptimaloneforspeed.Secondordermethods(LeCunetal.,1998)couldbeanotherspeeduptechnique.4.LotsofUnlabeledDataWewouldliketoobtainwordembeddingscarryingmoresyntacticandsemanticinformationthanshowninTable6.Sincemostofthetrainableparametersofoursystemareassociatedwiththewordembeddings,thesepoorresultssuggestthatweshoulduseconsiderablymoretrainingdata.2510 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCH 95.5 96 96.5 100 300 500 700 900 (a)POS 90 90.5 91 91.5 100 300 500 700 900 (b)CHUNK 85 85.5 86 86.5 100 300 500 700 900 (c)NER 67 67.5 68 68.5 69 100 300 500 700 900 (d)SRLFigure4:F1scoreonthevalidationset(y-axis)versusnumberofhiddenunits(x-axis)fordifferenttaskstrainedwiththesentence-levellikelihood(SLL),asinTable4.ForSRL,wevaryinthisgraphonlythenumberofhiddenunitsinthesecondlayer.Thescaleisadaptedforeachtask.Weshowthestandarddeviation(obtainedover5runswithdifferentrandominitialization),forthearchitecturewepicked(300hiddenunitsforPOS,CHUNKandNER,500forSRL).FollowingourNLPfromscratchphilosophy,wenowdescribehowtodramaticallyimprovetheseembeddingsusinglargeunlabeleddatasets.WethenusetheseimprovedembeddingstoinitializethewordlookuptablesofthenetworksdescribedinSection3.5.4.1DataSetsOurrstEnglishcorpusistheentireEnglishWikipedia.13Wehaveremovedallparagraphscon-tainingnon-romancharactersandallMediaWikimarkups.TheresultingtextwastokenizedusingthePennTreebanktokenizerscript.14Theresultingdatasetcontainsabout631millionwords.Asinourpreviousexperiments,weuseadictionarycontainingthe100,000mostcommonwordsinWSJ,withthesameprocessingofcapitalsandnumbers.Again,wordsoutsidethedictionarywerereplacedbythespecial“RARE”word.OursecondEnglishcorpusiscomposedbyaddinganextra221millionwordsextractedfromtheReutersRCV1(Lewisetal.,2004)dataset.15Wealsoextendedthedictionaryto130;000wordsbyaddingthe30;000mostcommonwordsinReuters.Thisisusefulinordertodeterminewhetherimprovementscanbeachievedbyfurtherincreasingtheunlabeleddatasetsize.4.2RankingCriterionversusEntropyCriterionWeusedtheseunlabeleddatasetstotrainlanguagemodelsthatcomputescoresdescribingtheacceptabilityofapieceoftext.TheselanguagemodelsareagainlargeneuralnetworksusingthewindowapproachdescribedinSection3.3.1andinFigure1.Asintheprevioussection,mostofthetrainableparametersarelocatedinthelookuptables.SimilarlanguagemodelswerealreadyproposedbyBengioandDucharme(2001)andSchwenkandGauvain(2002).Theirgoalwastoestimatetheprobabilityofawordgiventhepreviouswordsinasentence.Estimatingconditionalprobabilitiessuggestsacross-entropycriterionsimilartothosedescribedinSection3.4.1.Becausethedictionarysizeislarge,computingthenormalizationterm 13.Availableathttp://download.wikimedia.org.WetooktheNovember2007version.14.Availableathttp://www.cis.upenn.edu/˜treebank/tokenization.html.15.Nowavailableathttp://trec.nist.gov/data/reuters/reuters.html.2511 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSAcanbeextremelydemanding,andsophisticatedapproximationsarerequired.Moreimportantlyforus,neitherworkleadstosignicantwordembeddingsbeingreported.Shannon(1951)hasestimatedtheentropyoftheEnglishlanguagebetween0.6and1.3bitspercharacterbyaskinghumansubjectstoguessupcomingcharacters.CoverandKing(1978)givealowerboundof1.25bitspercharacterusingasubtlegamblingapproach.Meanwhile,usingasimplewordtrigrammodel,Brownetal.(1992b)reach1.75bitspercharacter.TeahanandCleary(1996)obtainentropiesaslowas1.46bitspercharacterusingvariablelengthcharactern-grams.Thehumansubjectsrelyofcourseonalltheirknowledgeofthelanguageandoftheworld.CanwelearnthegrammaticalstructureoftheEnglishlanguageandthenatureoftheworldbyleveragingthe0.2bitspercharacterthatseparatehumansubjectsfromsimplen-grammodels?Sincesuchtaskscertainlyrequirehighcapacitymodels,obtainingsufcientlysmallcondenceintervalsonthetestsetentropymayrequireprohibitivelylargetrainingsets.16Theentropycriterionlacksdynamicalrangebecauseitsnumericalvalueislargelydeterminedbythemostfrequentphrases.Inordertolearnsyntax,rarebutlegalphrasesarenolesssignicantthancommonphrases.Itisthereforedesirabletodenealternativetrainingcriteria.Weproposeheretouseapairwiserankingapproach(Cohenetal.,1998).Weseekanetworkthatcomputesahigherscorewhengivenalegalphrasethanwhengivenanincorrectphrase.Becausetherankingliteratureoftendealswithinformationretrievalapplications,manyauthorsdenecomplexrankingcriteriathatgivemoreweighttotheorderingofthebestrankinginstances(seeBurgesetal.,2007;Cl´emenc¸onandVayatis,2007).However,inourcase,wedonotwanttoemphasizethemostcommonphraseovertherarebutlegalphrases.Thereforeweuseasimplepairwisecriterion.Weconsiderawindowapproachnetwork,asdescribedinSection3.3.1andFigure1,withparametersqwhichoutputsascoreq(x)givenawindowoftextx=[w]dwin1.Weminimizetherankingcriterionwithrespecttoqqåx2Xåw2Dmaxn0;1fq(x)+fq(x(w))o;(17)whereisthesetofallpossibletextwindowswithdwinwordscomingfromourtrainingcorpus,Disthedictionaryofwords,andx(w)denotesthetextwindowobtainedbyreplacingthecentralwordoftextwindow[w]dwin1bythewordwOkanoharaandTsujii(2007)usearelatedapproachtoavoidingtheentropycriteriausingabinaryclassicationapproach(correct/incorrectphrase).Theirworkfocusesonusingakernelclassier,andnotonlearningwordembeddingsaswedohere.SmithandEisner(2005)alsoproposeacontrastivecriterionwhichestimatesthelikelihoodofthedataconditionedtoa“negative”neighborhood.Theyconsidervariousdataneighborhoods,includingsentencesoflengthdwindrawnfromDdwin.Theirgoalwashowevertoperformwellonsometaggingtaskonfullyunsuperviseddata,ratherthanobtaininggenericwordembeddingsusefulforothertasks.4.3TrainingLanguageModelsThelanguagemodelnetworkwastrainedbystochasticgradientminimizationoftherankingcrite-rion(17),samplingasentence-wordpair(s;w)ateachiteration. 16.However,KleinandManning(2002)describearareexampleofrealisticunsupervisedgrammarinductionusingacross-entropyapproachonbinary-branchingparsingtrees,thatis,byforcingthesystemtogenerateahierarchicalrepresentation.2512 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHSincetrainingtimesforsuchlargescalesystemsarecountedinweeks,itisnotfeasibletotrymanycombinationsofhyperparameters.Italsomakessensetospeedupthetrainingtimebyinitializingnewnetworkswiththeembeddingscomputedbyearliernetworks.Inparticular,wefounditexpedienttotrainasuccessionofnetworksusingincreasinglylargedictionaries,eachnetworkbeinginitializedwiththeembeddingsofthepreviousnetwork.Successivedictionarysizesandswitchingtimesarechosenarbitrarily.Bengioetal.(2009)providesamoredetaileddiscussionofthis,the(asyet,poorlyunderstood)“curriculum”process.Forthepurposesofmodelselectionweusetheprocessof“breeding”.Theideaofbreedingisinsteadoftryingafullgridsearchofpossiblevalues(whichwedidnothaveenoughcomputingpowerfor)tosearchfortheparametersinanalogytobreedingbiologicalcelllines.Withineachline,childnetworksareinitializedwiththeembeddingsoftheirparentsandtrainedonincreasinglyrichdatasetswithsometimesdifferentparameters.Thatis,supposewehavekprocessors,whichismuchlessthanthepossiblesetofparametersonewouldliketotry.Onechooseskinitialparameterchoicesfromthelargeset,andtrainstheseonthekprocessors.Inourcase,possibleparameterstoadjustare:thelearningratel,thewordembeddingdimensionsd,numberofhiddenunitsn1huandinputwindowsizedwin.Onethentrainseachofthesemodelsinanonlinefashionforacertainamounoftime(i.e.,afewdays),andthenselectsthebestonesusingthevalidationseterrorrate.Thatis,breedingdecisionsweremadeonthebasisofthevalueoftherankingcriterion(17)estimatedonavalidationsetcomposedofonemillionwordsheldoutfromtheWikipediacorpus.Inthenextbreedingiteration,onethenchoosesanothersetofkparametersfromthepossiblegridofvaluesthatpermuteslightlythemostsuccessfulcandidatesfromthepreviousround.Asmanyoftheseparameterchoicescanshareweights,wecaneffectivelycontinueonlinetrainingretainingsomeofthelearningfromthepreviousiterations.Verylongtrainingtimesmakesuchstrategiesnecessaryfortheforeseeablefuture:ifwehadbeengivencomputerstentimesfaster,weprobablywouldhavefoundusesfordatasetstentimesbigger.However,weshouldsaywebelievethatalthoughweendedupwithaparticularchoiceofparameters,manyotherchoicesarealmostequallyasgood,althoughperhapsthereareothersthatarebetteraswecouldnotdoafullgridsearch.Inthefollowingsubsections,wereportresultsobtainedwithtwotrainedlanguagemodels.Theresultsachievedbythesetwomodelsarerepresentativeofthoseachievedbynetworkstrainedonthefullcorpora.LanguagemodelLM1hasawindowsizedwin=11andahiddenlayerwithn1hu=100units.Theembeddinglayersweredimensionedlikethoseofthesupervisednetworks(Table5).ModelLM1wastrainedonourrstEnglishcorpus(Wikipedia)usingsuccessivedictionariescomposedofthe5000,10;000,30;000,50;000andnally100;000mostcommonWSJwords.Thetotaltrainingtimewasaboutfourweeks.LanguagemodelLM2hasthesamedimensions.ItwasinitializedwiththeembeddingsofLM1,andtrainedforanadditionalthreeweeksonoursecondEnglishcorpus(Wikipedia+Reuters)usingadictionarysizeof130,000words.4.4EmbeddingsBothnetworksproducemuchmoreappealingwordembeddingsthaninSection3.5.Table7showsthetennearestneighborsofafewrandomlychosenquerywordsfortheLM1model.Thesyntactic2513 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSAFRANCEJESUSXBOXREDDISHSCRATCHEDMEGABITS45419736909117242986987025 AUSTRIAGODAMIGAGREENISHNAILEDOCTETSBELGIUMSATIPLAYSTATIONBLUISHSMASHEDMB/SGERMANYCHRISTMSXPINKISHPUNCHEDBIT/SITALYSATANIPODPURPLISHPOPPEDBAUDGREECEKALISEGABROWNISHCRIMPEDCARATSSWEDENINDRAPSNUMBERGREYISHSCRAPEDKBIT/SNORWAYVISHNUHDGRAYISHSCREWEDMEGAHERTZEUROPEANANDADREAMCASTWHITISHSECTIONEDMEGAPIXELSHUNGARYPARVATIGEFORCESILVERYSLASHEDGBIT/SSWITZERLANDGRACECAPCOMYELLOWISHRIPPEDAMPERESTable7:WordembeddingsinthewordlookuptableofthelanguagemodelneuralnetworkLM1trainedwithadictionaryofsize100;000.Foreachcolumnthequeriedwordisfollowedbyitsindexinthedictionary(highermeansmorerare)andits10nearestneighbors(usingtheEuclideanmetric,whichwaschosenarbitrarily).andsemanticpropertiesoftheneighborsareclearlyrelatedtothoseofthequeryword.TheseresultsarefarmoresatisfactorythanthosereportedinTable7forembeddingsobtainedusingpurelysupervisedtrainingofthebenchmarkNLPtasks.4.5Semi-supervisedBenchmarkResultsSemi-supervisedlearninghasbeentheobjectofmuchattentionduringthelastfewyears(seeChapelleetal.,2006).Previoussemi-supervisedapproachesforNLPcanberoughlycategorizedasfollows:Ad-hocapproachessuchasRosenfeldandFeldman(2007)forrelationextraction.Self-trainingapproaches,suchasUefngetal.(2007)formachinetranslation,andMcCloskyetal.(2006)forparsing.Thesemethodsaugmentthelabeledtrainingsetwithexamplesfromtheunlabeleddatasetusingthelabelspredictedbythemodelitself.Transductiveapproaches,suchasJoachims(1999)fortextclassicationcanbeviewedasarenedformofself-training.ParametersharingapproachessuchasAndoandZhang(2005);SuzukiandIsozaki(2008).AndoandZhangproposeamulti-taskapproachwheretheyjointlytrainmodelssharingcer-tainparameters.TheytrainPOSandNERmodelstogetherwithalanguagemodel(trainedon15millionwords)consistingofpredictingwordsgiventhesurroundingtokens.SuzukiandIsozakiembedagenerativemodel(HiddenMarkovModel)insideaCRFforPOS,ChunkingandNER.Thegenerativemodelistrainedononebillionwords.Theseapproachesshouldbeseenasalinearcounterpartofourwork.Usingmultilayermodelsvastlyexpandstheparametersharingopportunities(seeSection5).Ourapproachsimplyconsistsofinitializingthewordlookuptablesofthesupervisednetworkswiththeembeddingscomputedbythelanguagemodels.SupervisedtrainingisthenperformedasinSection3.5.Inparticularthesupervisedtrainingstageisfreetomodifythelookuptables.Thissequentialapproachiscomputationallyconvenientbecauseitseparatesthelengthytrainingofthe2514 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHApproachPOSCHUNKNERSRL(PWA)(F1)(F1)(F1) BenchmarkSystems97.2494.2989.3177.92 NN+WLL96.3189.1379.5355.40NN+SLL96.3790.3381.4770.99 NN+WLL+LM197.0591.9185.6858.18NN+SLL+LM197.1093.6587.5873.84 NN+WLL+LM297.1492.0486.9658.34NN+SLL+LM297.2093.6388.6774.15 Table8:ComparisoningeneralizationperformanceofbenchmarkNLPsystemswithour(NN)ap-proachonPOS,chunking,NERandSRLtasks.Wereportresultswithboththeword-levellog-likelihood(WLL)andthesentence-levellog-likelihood(SLL).Wereportwith(LMn)performanceofthenetworkstrainedfromthelanguagemodelembeddings(Table7).Gen-eralizationperformanceisreportedinper-wordaccuracy(PWA)forPOSandF1scoreforothertasks.languagemodelsfromtherelativelyfasttrainingofthesupervisednetworks.Oncethelanguagemodelsaretrained,wecanperformmultipleexperimentsonthesupervisednetworksinarela-tivelyshorttime.Notethatourprocedureisclearlylinkedtothe(semi-supervised)deeplearningproceduresofHintonetal.(2006),Bengioetal.(2007)andWestonetal.(2008).Table8clearlyshowsthatthissimpleinitializationsignicantlybooststhegeneralizationper-formanceofthesupervisednetworksforeachtask.Itisworthmentioningthelargerlanguagemodelledtoevenbetterperformance.Thissuggeststhatwecouldstilltakeadvantageofevenbiggerunlabeleddatasets.4.6RankingandLanguageThereisalargeagreementintheNLPcommunitythatsyntaxisanecessaryprerequisiteforse-manticrolelabeling(GildeaandPalmer,2002).Thisiswhystate-of-the-artsemanticrolelabelingsystemsthoroughlyexploitmultipleparsetrees.Theparsersthemselves(Charniak,2000;Collins,1999)containconsiderablepriorinformationaboutsyntax(onecanthinkofthisasakindofin-formedpre-processing).Oursystemdoesnotusesuchparsetreesbecauseweattempttolearnthisinformationfromtheunlabeleddataset.Itisthereforelegitimatetoquestionwhetherourrankingcriterion(17)hastheconceptualcapabilitytocapturesucharichhierarchicalinformation.Atrstglance,therankingtaskappearsunrelatedtotheinductionofprobabilisticgrammarsthatunderlystandardparsingalgorithms.Thelackofhierarchicalrepresentationseemsafatalaw(Chomsky,1956).However,rankingiscloselyrelatedtoanalternativedescriptionofthelanguagestructure:op-eratorgrammars(Harris,1968).Insteadofdirectlystudyingthestructureofasentence,Harrisdenesanalgebraicstructureonthespaceofallsentences.Startingfromacoupleofelementarysentenceforms,sentencesaredescribedbythesuccessiveapplicationofsentencetransformationoperators.Thesentencestructureisrevealedasasideeffectofthesuccessivetransformations.Sentencetransformationscanalsohaveasemanticinterpretation.2515 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSAInthespiritofstructurallinguistics,Harrisdescribesprocedurestodiscoversentencetrans-formationoperatorsbyleveragingthestatisticalregularitiesofthelanguage.Suchproceduresareobviouslyusefulformachinelearningapproaches.Inparticular,heproposesatesttodecidewhethertwosentencesformsaresemanticallyrelatedbyatransformationoperator.Herstdenesarankingcriterion(Harris,1968,Section4.1):“Startingforconveniencewithveryshortsentenceforms,sayABC,wechooseaparticularwordchoiceforalltheclasses,sayBqCq,exceptone,inthiscaseA;foreverypairofmembersAiAjofthatwordclassweaskhowthesentenceformedwithoneofthemembers,thatis,AiBqCqcomparesastoacceptabilitywiththesentenceformedwiththeothermember,thatis,AjBqCq.”Thesegradingsarethenusedtocomparesentenceforms:“Itnowturnsoutthat,giventhegradedn-tuplesofwordsforaparticularsentenceform,wecanndothersentencesformsofthesamewordclassesinwhichthesamen-tuplesofwordsproducethesamegradingofsentences.”Thisisanindicationthatthesetwosentenceformsexploitcommonwordswiththesamesyntac-ticfunctionandpossiblythesamemeaning.Thisobservationformstheempiricalbasisfortheconstructionofoperatorgrammarsthatdescribereal-worldnaturallanguagessuchasEnglish.Thereforetherearesolidreasonstobelievethattherankingcriterion(17)hastheconceptualpotentialtocapturestrongsyntacticandsemanticinformation.Ontheotherhand,thestructureofourlanguagemodelsisprobablytoorestrictiveforsuchgoals,andourcurrentapproachonlyexploitsthewordembeddingsdiscoveredduringtraining.5.Multi-TaskLearningItisgenerallyacceptedthatfeaturestrainedforonetaskcanbeusefulforrelatedtasks.Thisideawasalreadyexploitedintheprevioussectionwhencertainlanguagemodelfeatures,namelythewordembeddings,wereusedtoinitializethesupervisednetworks.Multi-tasklearning(MTL)leveragesthisideainamoresystematicway.Modelsforalltasksofinterestsarejointlytrainedwithanadditionallinkagebetweentheirtrainableparametersinthehopeofimprovingthegeneralizationerror.Thislinkagecantaketheformofaregularizationterminthejointcostfunctionthatbiasesthemodelstowardscommonrepresentations.Amuchsimplerapproachconsistsinhavingthemodelssharecertainparametersdenedapriori.Multi-tasklearninghasalonghistoryinmachinelearningandneuralnetworks.Caruana(1997)givesagoodoverviewofthesepastefforts.5.1JointDecodingversusJointTrainingMultitaskapproachesdonotnecessarilyinvolvejointtraining.Forinstance,modernspeechrecog-nitionsystemsuseBayesruletocombinetheoutputsofanacousticmodeltrainedonspeechdataandalanguagemodeltrainedonphoneticortextualcorpora(Jelinek,1976).ThisjointdecodingapproachhasbeensuccessfullyappliedtostructurallymorecomplexNLPtasks.SuttonandMcCal-lum(2005b)obtainimprovedresultsbycombiningthepredictionsofindependentlytrainedCRFmodelsusingajointdecodingprocessattesttimethatrequiresmoresophisticatedprobabilistic2516 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHinferencetechniques.Ontheotherhand,SuttonandMcCallum(2005a)obtainresultssomewhatbelowthestate-of-the-artusingjointdecodingforSRLandsyntacticparsing.MusilloandMerlo(2006)alsodescribeanegativeresultatthesamejointtask.Jointdecodinginvariablyworksbyconsideringadditionalprobabilisticdependencypathsbe-tweenthemodels.Thereforeitdenesanimplicitsupermodelthatdescribesallthetasksinthesameprobabilisticframework.Separatelytrainingasubmodelonlymakessensewhenthetrain-ingdatablockstheseadditionaldependencypaths(inthesenseofd-separation,Pearl,1988).Thisimpliesthat,withoutjointtraining,theadditionaldependencypathscannotdirectlyinvolveunob-servedvariables.Therefore,thenaturalideaofdiscoveringcommoninternalrepresentationsacrosstasksrequiresjointtraining.Jointtrainingisrelativelystraightforwardwhenthetrainingsetsfortheindividualtaskscon-tainthesamepatternswithdifferentlabels.Itisthensufcienttotrainamodelthatcomputesmultipleoutputsforeachpattern(SuddarthandHolden,1991).Usingthisscheme,Suttonetal.(2007)demonstrateimprovementsonPOStaggingandnoun-phrasechunkingusingjointlytrainedCRFs.Howeverthejointlabelingrequirementisalimitationbecausesuchdataisnotoftenavail-able.Milleretal.(2000)achievesperformanceimprovementsbyjointlytrainingNER,parsing,andrelationextractioninastatisticalparsingmodel.Thejointlabelingrequirementproblemwasweakenedusingapredictortollinthemissingannotations.AndoandZhang(2005)proposeasetupthatworksaroundthejointlabelingrequirements.Theydenelinearmodelsoftheformi(x)=w�iF(x)+v�iQY(x)whereiistheclassierforthe-thtaskwithparameterswiandvi.NotationsF(x)andY(x)representengineeredfeaturesforthepat-ternx.MatrixQmapstheY(x)featuresintoalowdimensionalsubspacecommonacrossalltasks.Eachtaskistrainedusingitsownexampleswithoutajointlabelingrequirement.Thelearningpro-cedurealternatestheoptimizationofwiandviforeachtask,andtheoptimizationofQtominimizetheaveragelossforallexamplesinalltasks.Theauthorsalsoconsiderauxiliaryunsupervisedtasksforpredictingsubstructures.Theyreportexcellentresultsonseveraltasks,includingPOSandNER.5.2Multi-TaskBenchmarkResultsTable9reportsresultsobtainedbyjointlytrainedmodelsforthePOS,CHUNK,NERandSRLtasksusingthesamesetupasSection4.5.WetrainedjointlyPOS,CHUNKandNERusingthewindowapproachnetwork.Aswementionedearlier,SRLcanbetrainedonlywiththesentenceapproachnetwork,duetolong-rangedependenciesrelatedtotheverbpredicate.Wethusperformedadditionalexperiments,whereallfourtasksweretrainedusingthesentenceapproachnetwork.Inbothcases,allmodelssharethelookuptableparameters(2).Theparametersoftherstlinearlayers(4)weresharedinthewindowapproachcase(seeFigure5),andthersttheconvolutionlayerparameters(6)weresharedinthesentenceapproachnetworks.Forthewindowapproach,bestresultswereobtainedbyenlargingthersthiddenlayersizeton1hu=500(chosenbyvalidation)inordertoaccountforitssharedresponsibilities.WeusedthesamearchitectureasSRLforthesentenceapproachnetwork.Thewordembeddingdimensionwaskeptconstantd0=50inordertoreusethelanguagemodelsofSection4.5.Trainingwasachievedbyminimizingthelossaveragedacrossalltasks.Thisiseasilyachievedwithstochasticgradientbyalternativelypickingexamplesforeachtaskandapplying(16)toalltheparametersofthecorrespondingmodel,includingthesharedparameters.Notethatthisgiveseachtaskequalweight.SinceeachtaskusesthetrainingsetsdescribedinTable1,itisworthnoticing2517 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSA LookupTable Linear LookupTable Linear HardTanh HardTanh LinearTask1 LinearTask2 M2(t1)M2(t2) xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxx xxxxx xxxxx xxxxx xxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxx xxxxx xxxxx xxxxx LTW1.LTWKM1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx n1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx n1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx n2(t1)=#tags xxxxxxxxxxxxxxxxxxxxxxx n2(t2)=#tags Figure5:ExampleofmultitaskingwithNN.Task1andTask2aretwotaskstrainedwiththewindowapproacharchitecturepresentedinFigure1.Lookuptablesaswellasthersthiddenlayerareshared.Thelastlayeristaskspecic.Theprincipleisthesamewithmorethantwotasks.thatexamplescancomefromquitedifferentdatasets.ThegeneralizationperformanceforeachtaskwasmeasuredusingthetraditionaltestingdataspeciedinTable1.Fortunately,noneofthetrainingandtestsetsoverlapacrosstasks.ItisworthmentioningthatMTLcanproduceasingleuniednetworkthatperformswellforallthesetasksusingthesentenceapproach.Howeverthisuniednetworkonlyleadstomarginalimprovementsoverusingaseparatenetworkforeachtask:themostimportantMTLtaskappearstobetheunsupervisedlearningofthewordembeddings.Asexplainedbefore,simplecomputationalconsiderationsledustotrainthePOS,Chunking,andNERtasksusingthewindowapproach.ThebaselineresultsinTable9alsoshowthatusingthesentenceapproachforthePOS,Chunking,andNERtasksyieldsnoperformanceimprovement(ordegradation)overthewindowapproach.Thenextsectionshowswecanleverageknowncorrelationsbetweentasksinmoredirectmanner.6.TheTemptationResultssofarhavebeenobtainedbystaying(almost17)truetoourfromscratchphilosophy.Wehavesofaravoidedspecializingourarchitectureforanytask,disregardingalotofusefulapriori 17.WedidsomebasicpreprocessingoftherawinputwordsasdescribedinSection3.5,hencethe“almost”inthetitleofthisarticle.Acompletelyfromscratchapproachwouldpresumablynotknowanythingaboutwordsatallandwouldworkfromlettersonly(or,takentoafurtherextreme,fromspeechoropticalcharacterrecognition,ashumansdo).2518 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHApproachPOSCHUNKNERSRL(PWA)(F1)(F1)(F1) BenchmarkSystems97.2494.2989.3177.92 WindowApproachNN+SLL+LM297.2093.6388.67–NN+SLL+LM2+MTL97.2294.1088.62– SentenceApproachNN+SLL+LM297.1293.3788.7874.15NN+SLL+LM2+MTL97.2293.7588.2774.29 Table9:Effectofmulti-taskingonourneuralarchitectures.WetrainedPOS,CHUNKNERinaMTLway,bothforthewindowandsentencenetworkapproaches.SRLwasonlyincludedinthesentenceapproachjointtraining.Asabaseline,weshowpreviousresultsofourwindowapproachsystem,aswellasadditionalresultsforoursentenceapproachsystem,whentrainedseparatelyoneachtask.Benchmarksystemperformanceisalsogivenforcomparison.ApproachPOSCHUNKNERSRL(PWA)(F1)(F1) BenchmarkSystems97.2494.2989.3177.92 NN+SLL+LM297.2093.6388.6774.15 NN+SLL+LM2+Sufx297.29–––NN+SLL+LM2+Gazetteer––89.59–NN+SLL+LM2+POS–94.3288.67–NN+SLL+LM2+CHUNK–––74.72 Table10:ComparisoningeneralizationperformanceofbenchmarkNLPsystemswithourneuralnetworks(NNs)usingincreasingtask-specicengineering.Wereportresultsobtainedwithanetworktrainedwithouttheextratask-specicfeatures(Section5)andwiththeextratask-specicfeaturesdescribedinSection6.ThePOSnetworkwastrainedwithtwocharacterwordsufxes;theNERnetworkwastrainedusingthesmallCoNLL2003gazetteer;theCHUNKandNERnetworksweretrainedwithadditionalPOSfeatures;andnally,theSRLnetworkwastrainedwithadditionalCHUNKfeatures.NLPknowledge.Wehaveshownthat,thankstolargeunlabeleddatasets,ourgenericneuralnet-workscanstillachieveclosetostate-of-the-artperformancebydiscoveringusefulfeatures.Thissectionexploreswhathappenswhenweincreasetheleveloftask-specicengineeringinoursys-temsbyincorporatingsomecommontechniquesfromtheNLPliterature.Weoftenobtainfurtherimprovements.Theseguresareusefultoquantifyhowfarwewentbyleveraginglargedatasetsinsteadofrelyingonaprioriknowledge.2519 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSA6.1SufxFeaturesWordsufxesinmanywesternlanguagesarestrongpredictorsofthesyntacticfunctionofthewordandthereforecanbenetthePOSsystem.Forinstance,Ratnaparkhi(1996)usesinputsrepresentingwordsufxesandprexesuptofourcharacters.WeachievethisinthePOStaskbyaddingdiscretewordfeatures(Section3.2.1)representingthelasttwocharactersofeveryword.Thesizeofthesufxdictionarywas455.ThisledtoasmallimprovementofthePOSperformance(Table10,rowNN+SLL+LM2+Sufx2).WealsotriedsufxesobtainedwiththePorter(1980)stemmerandobtainedthesameperformanceaswhenusingtwocharactersufxes.6.2GazetteersState-of-the-artNERsystemsoftenusealargedictionarycontainingwellknownnamedentities(e.g.,Florianetal.,2003).WerestrictedourselvestothegazetteerprovidedbytheCoNLLchal-lenge,containing8;000locations,personnames,organizations,andmiscellaneousentities.WetrainedaNERnetworkwith4additionalwordfeaturesindicating(feature“on”or“off”)whetherthewordisfoundinthegazetteerunderoneofthesefourcategories.Thegazetteerincludesnotonlywords,butalsochunksofwords.Ifasentencechunkisfoundinthegazetteer,thenallwordsinthechunkhavetheircorrespondinggazetteerfeatureturnedto“on”.Theresultingsystemdisplaysaclearperformanceimprovement(Table10,rowNN+SLL+LM2+Gazetteer),slightlyoutperformingthebaseline.Aplausibleexplanationofthislargeboostoverthenetworkusingonlythelanguagemodelisthatgazetteersincludewordchunks,whileweuseonlythewordrepresentationofourlanguagemodel.Forexample,“united”and“bicycle”seenseparatelyarelikelytobenon-entities,while“unitedbicycle”mightbeanentity,butcatchingitwouldrequirehigherlevelrepresentationsofourlanguagemodel.6.3CascadingWhenoneconsidersrelatedtasks,itisreasonabletoassumethattagsobtainedforonetaskcanbeusefulfortakingdecisionsintheothertasks.ConventionalNLPsystemsoftenusefeaturesobtainedfromtheoutputofotherpreexistingNLPsystems.Forinstance,ShenandSarkar(2005)describeachunkingsystemthatusesPOStagsasinput;Florianetal.(2003)describesaNERsystemwhoseinputsincludePOSandCHUNKtags,aswellastheoutputoftwootherNERclassiers.State-of-the-artSRLsystemsexploitparsetrees(GildeaandPalmer,2002;Punyakanoketal.,2005),relatedtoCHUNKtags,andbuiltusingPOStags(Charniak,2000;Collins,1999).Table10reportsresultsobtainedfortheCHUNKandNERtasksbyaddingdiscretewordfea-tures(Section3.2.1)representingthePOStags.Inordertofacilitatecomparisons,insteadofusingthemoreaccuratetagsfromourPOSnetwork,weuseforeachtaskthePOStagsprovidedbythecorrespondingCoNLLchallenge.WealsoreportresultsobtainedfortheSRLtaskbyaddingwordfeaturesrepresentingtheCHUNKtags(alsoprovidedbytheCoNLLchallenge).Weconsistentlyobtainmoderateimprovements.6.4EnsemblesConstructingensemblesofclassiersisaprovenwaytotradecomputationalefciencyforgeneral-izationperformance(Belletal.,2007).ThereforeitisnotsurprisingthatmanyNLPsystemsachievestate-of-the-artperformancebycombiningtheoutputsofmultipleclassiers.Forinstance,Kudo2520 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHApproachPOSCHUNKNER(PWA)(F1)(F1) BenchmarkSystems97.2494.2989.31 NN+SLL+LM2+POSworst97.2993.9989.35NN+SLL+LM2+POSmean97.3194.1789.65NN+SLL+LM2+POSbest97.3594.3289.86 NN+SLL+LM2+POSvotingensemble97.3794.3489.70NN+SLL+LM2+POSjoinedensemble97.3094.3589.67 Table11:ComparisoningeneralizationperformanceforPOS,CHUNKandNERtasksofthenet-worksobtainedusingbycombiningtentrainingrunswithdifferentinitialization.andMatsumoto(2001)useanensembleofclassierstrainedwithdifferenttaggingconventions(seeSection3.3.3).Winningachallengeisofcoursealegitimateobjective.Yetitisoftendifculttogureoutwhichideasaremostresponsibleforthestate-of-the-artperformanceofalargeensemble.Becauseneuralnetworksarenonconvex,trainingrunswithdifferentinitialparametersusuallygivedifferentsolutions.Table11reportsresultsobtainedfortheCHUNKandNERtaskaftertentrainingrunswithrandominitialparameters.Votingthetennetworkoutputsonapertagbasis(“votingensemble”)leadstoasmallimprovementovertheaveragenetworkperformance.Wehavealsotriedamoresophisticatedensembleapproach:thetennetworkoutputscores(beforesentence-levellikelihood)werecombinedwithanadditionallinearlayer(4)andthenfedtoanewsentence-levellikelihood(13).Theparametersofthecombininglayerswerethentrainedontheexistingtrainingset,whilekeepingthetennetworksxed(“joinedensemble”).Thisapproachdidnotimproveonsimplevoting.Theseensemblescomeofcourseattheexpenseofatenfoldincreaseoftherunningtime.Ontheotherhand,multipletrainingtimescouldbeimprovedusingsmartsamplingstrategies(Neal,1996).Wecanalsoobservethattheperformancevariabilityamongthetennetworksisnotverylarge.Thelocalminimafoundbythetrainingalgorithmareusuallygoodlocalminima,thankstotheoversizedparameterspaceandtothenoiseinducedbythestochasticgradientprocedure(LeCunetal.,1998).Inordertoreducethevarianceinourexperimentalresults,wealwaysusethesameinitialparametersfornetworkstrainedonthesametask(exceptofcoursefortheresultsreportedinTable11.)6.5ParsingGildeaandPalmer(2002)andPunyakanoketal.(2005)offerseveralargumentssuggestingthatsyntacticparsingisanecessaryprerequisitefortheSRLtask.TheCoNLL2005SRLbenchmarktaskprovidesparsetreescomputedusingboththeCharniak(2000)andCollins(1999)parsers.State-of-the-artsystemsoftenexploitadditionalparsetreessuchasthektoprankingparsetrees(Koomenetal.,2005;Haghighietal.,2005).IncontrastourSRLnetworkssofardonotuseparsetreesatall.Theyrelyinsteadoninternalrepresentationstransferredfromalanguagemodeltrainedwithanobjectivefunctionthatcaptures2521 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSALEVEL0 SNP Theluxuryautomaker i-npi-np NP lastyear b-npe-np VP sold s-vp NP 1,214cars b-npe-np PP in s-pp NP theU.S. b-npe-np LEVEL1 S Theluxuryautomakerlastyear oooooo VP sold1,214cars b-vpi-vpe-vp PP intheU.S. b-ppi-ppe-pp LEVEL2 S Theluxuryautomakerlastyear oooooo VP sold1,214carsintheU.S. i-vpi-vpi-vpi-vp Figure6:Charniakparsetreeforthesentence“Theluxuryautomakerlastyearsold1,214carsintheU.S.”.Level0istheoriginaltree.Levels1to4areobtainedbysuccessivelycollapsingterminaltreebranches.Foreachlevel,wordsreceivetagsdescribingtheseg-mentassociatedwiththecorrespondingleaf.Allwordsreceivetag“O”atlevel3inthisexample.alotofsyntacticinformation(seeSection4.6).Itisthereforelegitimatetoquestionwhetherthisapproachisanacceptablelightweightreplacementforparsetrees.Weanswerthisquestionbyprovidingparsetreeinformationasadditionalinputfeaturestooursystem.18WehavelimitedourselvestotheCharniakparsetreeprovidedwiththeCoNLL2005data.Consideringthatanodeinasyntacticparsetreeassignsalabeltoasegmentoftheparsedsentence,weproposeawaytofeed(partially)thislabeledsegmentationtoournetwork,throughadditionallookuptables.Eachoftheselookuptablesencodelabeledsegmentsofeachparsetreelevel(uptoacertaindepth).ThelabeledsegmentsarefedtothenetworkfollowingaIOBEStaggingscheme(seeSections3.3.3and3.2.1).Asthereare40differentphraselabelsinWSJ,eachadditionaltree-relatedlookuptableshas161entries(404+1)correspondingtotheIBESsegmenttags,plustheextraOtag.Wecalllevel0theinformationassociatedwiththeleavesoftheoriginalCharniakparsetree.Thelookuptableforlevel0encodesthecorrespondingIOBESphrasetagsforeachwords.Weobtainlevels1to4byrepeatedlytrimmingtheleavesasshowninFigure6.Welabeled“O”wordsbelongingtotherootnode“S”,orallwordsofthesentenceiftherootitselfhasbeentrimmed.ExperimentswereperformedusingtheLM2languagemodelusingthesamenetworkarchi-tectures(seeTable5)andusingadditionallookuptablesofdimension5foreachparsetreelevel.Table12reportstheperformanceimprovementsobtainedbyprovidingincreasinglevelsofparse 18.Inamorerecentwork(Collobert,2011),weproposeanextensionofthisapproachforthegenerationoffullsyntacticparsetrees,usingarecurrentversionofourarchitecture.2522 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHApproachSRL (valid)(test) BenchmarkSystem(sixparsetrees)77.3577.92BenchmarkSystem(topCharniakparsetreeonly)74.76– NN+SLL+LM272.2974.15 NN+SLL+LM2+Charniak(level0only)74.4475.65NN+SLL+LM2+Charniak(levels0&1)74.5075.81NN+SLL+LM2+Charniak(levels0to2)75.0976.05NN+SLL+LM2+Charniak(levels0to3)75.1275.89NN+SLL+LM2+Charniak(levels0to4)75.4276.06 NN+SLL+LM2+CHUNK–74.72NN+SLL+LM2+PT0–75.49 Table12:GeneralizationperformanceontheSRLtaskofourNNarchitecturecomparedwiththebenchmarksystem.WeshowperformanceofoursystemfedwithdifferentlevelsofdepthoftheCharniakparsetree.Wereportpreviousresultsofourarchitecturewithnoparsetreeasabaseline.Koomenetal.(2005)reporttestandvalidationperformanceusingsixparsetrees,aswellasvalidationperformanceusingonlythetopCharniakparsetree.Forcomparisonpurposes,wehencealsoreportvalidationperformance.Finally,wereportourperformancewiththeCHUNKfeature,andcompareitagainstalevel0featurePT0obtainedbyournetwork.treeinformation.Level0aloneincreasestheF1scorebyalmost1:5%.Additionallevelsyielddiminishingreturns.Thetopperformancereaches76:06%F1score.Thisisnottoofarfromthestate-of-the-artsystemwhichwenoteusessixparsetreesinsteadofone.Koomenetal.(2005)alsoreporta74:76%F1scoreonthevalidationsetusingonlytheCharniakparsetree.Usingtherstthreeparsetreelevels,wereachthisperformanceonthevalidationset.TheseresultscorroboratendingsintheNLPliterature(GildeaandPalmer,2002;Punyakanoketal.,2005)showingthatparsingisimportantfortheSRLtask.WealsoreportedinTable12ourpreviousperformanceobtainedwiththeCHUNKfeature(seeTable10).Itissurprisingtoobservethataddingchunkingfeaturesintothesemanticrolelabelingnetworkperformssignicantlyworsethanaddingfeaturesdescribingthelevel0oftheCharniakparsetree(Table12).Indeed,ifweignorethelabelprexes“BIES”deningthesegmentation,theparsetreeleaves(atlevel0)andthechunkinghaveidenticallabeling.However,theparsetreesidentifyleafsentencesegmentsthatareoftensmallerthanthoseidentiedbythechunkingtags,asshownbyHollingsheadetal.(2005).19InsteadofrelyingonCharniakparser,wechosetotrainasecondchunkingnetworktoidentifythesegmentsdelimitedbytheleavesofthePennTreebankparsetrees(level0).Ournetworkachieved92:25%F1scoreonthistask(wecallitPT0),whileweevaluatedCharniakperformanceas91:94%onthesametask.AsshowninTable12,feedingour 19.AsinHollingsheadetal.(2005),considerthesentenceandchunklabels“(NPThey)(VParestartingtobuy)(NPgrowthstocks)”.Theparsetreecanbewrittenas“(S(NPThey)(VPare(VPstarting(S(VPto(VPbuy(NPgrowthstocks)))))))”.Thetreeleavessegmentationisthusgivenby“(NPThey)(VPare)(VPstarting)(VPto)(VPbuy)(NPgrowthstocks)”.2523 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSAownPT0predictionintotheSRLsystemgivessimilarperformancetousingCharniakpredictions,andisconsistentlybetterthantheCHUNKfeature.6.6WordRepresentationsWehavedescribedhowweinducedusefulwordembeddingsbyapplyingourarchitecturetoalanguagemodelingtasktrainedusingalargeamountofunlabeledtextdata.Theseembeddingsimprovethegeneralizationperformanceonalltasks(Section4.)Theliteraturedescribesotherwaystoinducewordrepresentations.MnihandHinton(2007)proposedarelatedlanguagemodelap-proachinspiredfromRestrictedBoltzmannMachines.However,wordrepresentationsareperhapsmorecommonlyinferredfromn-gramlanguagemodelingratherthanpurelycontinuouslanguagemodels.OnepopularapproachistheBrownclusteringalgorithm(Brownetal.,1992a),whichbuildshierarchicalwordclustersbymaximizingthebigram'smutualinformation.TheinducedwordrepresentationhasbeenusedwithsuccessinawidevarietyofNLPtasks,includingPOS(Sch¨utze,1995),NER(Milleretal.,2004;RatinovandRoth,2009),orparsing(Kooetal.,2008).Otherrelatedapproachesexist,likephraseclustering(LinandWu,2009)whichhasbeenshowntoworkwellforNER.Finally,HuangandYates(2009)haverecentlyproposedasmoothedlanguagemodelingapproachbasedonaHiddenMarkovModel,withsuccessonPOSandChunkingtasks.Whileacomparisonofallthesewordrepresentationsisbeyondthescopeofthispaper,itisratherfairtoquestionthequalityofourwordembeddingscomparedtoapopularNLPapproach.Inthissection,wereportacomparisonofourwordembeddingsagainstBrownclusters,whenusedasfeaturesintoourneuralnetworkarchitecture.Wereportasbaselinepreviousresultswhereourwordembeddingsarene-tunedforeachtask.Wealsoreportperformancewhenourembeddingsarekeptxedduringtask-specictraining.SinceconvexmachinelearningalgorithmsarecommonpracticeinNLP,wenallyreportperformancesfortheconvexversionofourarchitecture.Fortheconvexexperiments,weconsideredthelinearversionofourneuralnetworks(insteadofhavingseverallinearlayersinterleavedwithanon-linearity).WhilewealwayspickedthesentenceapproachforSRL,wehadtoconsiderthewindowapproachinthisparticularconvexsetup,asthesentenceapproachnetwork(seeFigure2)includesaMaxlayer.Havingonlyonelinearlayerinourneuralnetworkisnotenoughtomakeourarchitectureconvex:alllookup-tables(foreachdiscretefeature)mustalsobexed.Theword-lookuptableissimplyxedtotheembeddingsobtainedfromourlanguagemodelLM2.Allotherdiscretefeaturelookup-tables(caps,POS,BrownClusters...)arexedtoastandardsparserepresentation.UsingthenotationintroducedinSection3.2.1,ifLTWkisthelookup-tableofthekthdiscretefeature,wehaveWk2RjDkjjDkjandtherepresentationofthediscreteinputwisobtainedwith:LTWk(w)=hWki1w=0;0;1atindexw;0;0T:(18)Trainingourarchitectureinthisconvexsetupwiththesentence-levellikelihood(13)correspondstotrainingaCRF.Inthatrespect,theseconvexexperimentsshowtheperformanceofourwordembeddingsinaclassicalNLPframework.FollowingtheRatinovandRoth(2009)andKooetal.(2008)setups,wegenerated1;000Brownclustersusingtheimplementation20fromLiang(2005).Tomakethecomparisonfair,theclusterswererstinducedontheconcatenationofWikipediaandReutersdatasets,aswedidinSection4 20.Availableathttp://www.eecs.berkeley.edu/˜pliang/software.2524 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHApproachPOSCHUNKNERSRL(PWA)(F1)(F1)(F1) Non-convexApproachLM2(non-linearNN)97.2994.3289.5976.06LM2(non-linearNN,xedembeddings)97.1094.4588.7972.24BrownClusters(non-linearNN,130Kwords)96.9294.3587.1572.09BrownClusters(non-linearNN,allwords)96.8194.2186.6871.44 ConvexApproachLM2(linearNN,xedembeddings)96.6993.5186.6459.11BrownClusters(linearNN,130Kwords)96.5694.2086.4651.54BrownClusters(linearNN,allwords)96.2894.2286.6356.42 Table13:Generalizationperformanceofourneuralnetworkarchitecturetrainedwithourlanguagemodel(LM2)wordembeddings,andwiththewordrepresentationsderivedfromtheBrownClusters.Asbefore,allnetworksarefedwithacapitalizationfeature.Addition-ally,POSisusingawordsufxofsize2feature,CHUNKisfedwithPOS,NERusestheCoNLL2003gazetteer,andSRLisfedwithlevels1–5oftheCharniakparsetree,aswellasaverbpositionfeature.Wereportperformancewithbothconvexandnon-convexarchitectures(300hiddenunitsforalltasks,withanadditional500hiddenunitslayerforSRL).WealsoprovideresultsforBrownClustersinducedwitha130Kworddictionary,aswellasBrownClustersinducedwithallwordsofthegiventasks.fortrainingourlargestlanguagemodelLM2,usinga130Kworddictionary.Thisdictionarycoversabout99%ofthewordsinthetrainingsetofeachtask.Tocoverthelast1%,weaugmentedthedictionarywiththemissingwords(reachingabout140Kwords)andinducedBrownClustersusingtheconcatenationofWSJ,Wikipedia,andReuters.TheBrownclusteringapproachishierarchicalandgeneratesabinarytreeofclusters.Eachwordinthevocabularyisassignedtoanodeinthetree.Featuresareextractedfromthistreebyconsideringthepathfromtheroottothenodecontainingthewordofinterest.FollowingRatinov&Roth,wepickedasfeaturesthepathprexesofsize4,6,10and20.Inthenon-convexexperiments,wefedthesefourBrownClusterfeaturestoourarchitectureusingfourdifferentlookuptables,replacingourwordlookuptable.Thesizeofthelookuptableswaschosentobe12byvalidation.Intheconvexcase,weusedtheclassicalsparserepresentation(18),asforanyotherdiscretefeature.WerstreportinTable13generalizationperformanceofourbestnon-convexnetworkstrainedwithourLM2languagemodelandwithBrownClusterfeatures.OurembeddingsperformatleastaswellasBrownClusters.Resultsaremoremitigatedinaconvexsetup.Formosttasks,goingnon-convexisbetterforbothwordrepresentationtypes.Ingeneral,“ne-tuning”ourembeddingsforeachtaskalsogivesanextraboost.Finally,usingabetterwordcoveragewithBrownClusters(“allwords”insteadof“130Kwords”inTable13)didnothelp.Morecomplexfeaturescouldbepossiblycombinedinsteadofusinganon-linearmodel.Forinstance,Turianetal.(2010)performedacomparisonofBrownClustersandembeddingstrainedinthesamespiritasours21,withadditionalfeaturescombininglabelsandtokens.Webelievethis 21.Howevertheydidnotreachourembeddingperformance.Thereareseveraldifferencesinhowtheytrainedtheirmodelsthatmightexplainthis.Firstly,theymayhaveexperienceddifcultiesbecausetheytrain50-dimensional2525 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSATaskFeatures POSSufxofsize2CHUNKPOSNERCoNLL2003gazetteerPT0POSSRLPT0,verbposition Table14:FeaturesusedbySENNAimplementation,foreachtask.Inaddition,alltasksuse“lowcapsword”and“caps”features.TaskBenchmarkSENNA PartofSpeech(POS)(Accuracy)97.24%97.29%Chunking(CHUNK)(F1)94.29%94.32%NamedEntityRecognition(NER)(F1)89.31%89.59%ParseTreelevel0(PT0)(F1)91.94%92.25%SemanticRoleLabeling(SRL)(F1)77.92%75.49% Table15:Performanceoftheengineeredsweetspot(SENNA)onvarioustaggingtasks.ThePT0taskreplicatesthesentencesegmentationoftheparsetreeleaves.ThecorrespondingbenchmarkscoremeasuresthequalityoftheCharniakparsetreeleavesrelativetothePennTreebankgoldparsetrees.typeofcomparisonshouldbetakenwithcare,ascombiningagivenfeaturewithdifferentwordrepresentationsmightnothavethesameeffectoneachwordrepresentation.6.7EngineeringaSweetSpotWeimplementedastandaloneversionofourarchitecture,writtenintheClanguage.Wegavethename“SENNA”(Semantic/syntacticExtractionusingaNeuralNetworkArchitecture)totheresultingsystem.TheparametersofeacharchitecturearetheonesdescribedinTable5.Allthenetworksweretrainedseparatelyoneachtaskusingthesentence-levellikelihood(SLL).ThewordembeddingswereinitializedtoLM2embeddings,andthenne-tunedforeachtask.WesummarizefeaturesusedbyourimplementationinTable14,andwereportperformanceachievedoneachtaskinTable15.Theruntimeversion22containsabout2500linesofCcode,runsinlessthan150MBofmemory,andneedslessthanamillisecondperwordtocomputeallthetags.Table16comparesthetaggingspeedsforoursystemandforthefewavailablestate-of-the-artsystems:theToutanovaetal.(2003)POStagger23,theShenetal.(2007)POStagger24andtheKoomenetal.(2005)SRL embeddingsfor269Kdistinctwordsusingacomparativelysmalltrainingset(RCV1,37Mwords),unlikelytocontainenoughinstancesoftherarewords.Secondly,theypredictthecorrectnessofthenalwordofeachwindowinsteadofthecenterword(Turianetal.,2010),effectivelyrestrictingthemodeltounidirectionalprediction.Finally,theydonotnetunetheirembeddingsafterunsupervisedtraining.22.Availableathttp://ml.nec-labs.com/senna.23.Availableathttp://nlp.stanford.edu/software/tagger.shtml.Wepickedthe3.0version(May2010).24.Availableathttp://www.cis.upenn.edu/˜xtag/spinal.2526 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHPOSSystemRAM(MB)Time(s) Toutanovaetal.(2003)80064Shenetal.(2007)2200833 SENNA324SRLSystemRAM(MB)Time(s) Koomenetal.(2005)34006253 SENNA12451 Table16:Runtimespeedandmemoryconsumptioncomparisonbetweenstate-of-the-artsystemsandourapproach(SENNA).WegivetheruntimeinsecondsforrunningboththePOSandSRLtaggersontheirrespectivetestingsets.Memoryusageisreportedinmegabytes.system.25Allprogramswererunonasingle3GHzIntelcore.ThePOStaggerswererunwithSunJava1.6withalargeenoughmemoryallocationtoreachtheirtoptaggingspeed.ThebeamsizeoftheShentaggerwassetto3asrecommendedinthepaper.Regardlessofimplementationdifferences,itisclearthatourneuralnetworksrunconsiderablyfaster.Theyalsorequiremuchlessmemory.OurPOSandSRLtaggerrunsin32MBand120MBofRAMrespectively.TheShenandToutanovataggersslowdownsignicantlywhentheJavamachineisgivenlessthan2.2GBand800MBofRAMrespectively,whiletheKoomentaggerrequiresatleast3GBofRAM.Webelievethatanumberofreasonsexplainthespeedadvantageofoursystem.First,oursystemonlyusesrathersimpleinputfeaturesandthereforeavoidsthenonnegligiblecomputationtimeassociatedwithcomplexhandcraftedfeatures.Secondly,mostnetworkcomputationsaredensematrix-vectoroperations.Incontrast,systemsthatrelyonagreatnumberofsparsefeaturesexperi-encememorylatencieswhentraversingthesparsedatastructures.Finally,ourcompactimplemen-tationisself-contained.SinceitdoesnotrelyontheoutputsofdisparateNLPsystem,itdoesnotsufferfromcommunicationlatencyissues.7.CriticalDiscussionAlthoughwebelievethatthiscontributionrepresentsasteptowardsthe“NLPfromscratch”objec-tive,wearekeenlyawarethatbothourgoalandourmeanscanbecriticized.Themaincriticismofourgoalcanbesummarizedasfollows.Overtheyears,theNLPcom-munityhasdevelopedaconsiderableexpertiseinengineeringeffectiveNLPfeatures.Whyshouldtheyforgetthispainfullyacquiredexpertiseandinsteadpainfullyacquiretheskillsrequiredtotrainlargeneuralnetworks?Asmentionedinourintroduction,weobservethatnosingleNLPtaskreallycoversthegoalsofNLP.Thereforewebelievethattask-specicengineering(i.e.,thatdoesnotgen-eralizetoothertasks)isnotdesirable.ButwealsorecognizehowmuchourneuralnetworksowetopreviousNLPtask-specicresearch.Themaincriticismofourmeansiseasiertoaddress.Whydidwechoosetorelyonatwentyyearoldtechnology,namelymultilayerneuralnetworks?Weweresimplyattractedbytheirabilitytodiscoverhiddenrepresentationsusingastochasticlearningalgorithmthatscaleslinearlywith 25.Availableathttp://l2r.cs.uiuc.edu/˜cogcomp/asoftware.php?skey=SRL.2527 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSAthenumberofexamples.Mostoftheneuralnetworktechnologynecessaryforourworkhasbeendescribedtenyearsago(e.g.,LeCunetal.,1998).However,ifwehaddecidedtenyearsagototrainthelanguagemodelnetworkLM2usingavintagecomputer,trainingwouldonlybenearingcom-pletiontoday.Trainingalgorithmsthatscalelinearlyaremostabletobenetfromsuchtremendousprogressincomputerhardware.8.ConclusionWehavepresentedamultilayerneuralnetworkarchitecturethatcanhandleanumberofNLPtaskswithbothspeedandaccuracy.Thedesignofthissystemwasdeterminedbyourdesiretoavoidtask-specicengineeringasmuchaspossible.Insteadwerelyonlargeunlabeleddatasetsandletthetrainingalgorithmdiscoverinternalrepresentationsthatproveusefulforallthetasksofinterest.Usingthisstrongbasis,wehaveengineeredafastandefcient“allpurpose”NLPtaggerthatwehopewillproveusefultothecommunity.AcknowledgmentsWeacknowledgethepersistentsupportofNECforthisresearcheffort.WethankYoshuaBengio,SamyBengio,EricCosatto,VincentEtter,Hans-PeterGraf,RalphGrishman,andVladimirVapnikfortheirusefulfeedbackandcomments.AppendixA.NeuralNetworkGradientsWeconsideraneuralnetworkq(),withparametersq.Wemaximizethelikelihood(8),orminimizerankingcriterion(17),withrespecttotheparametersq,usingstochasticgradient.Bynegatingthelikelihood,wenowassumeitallcorrespondstominimizeacostC(q()),withrespecttoqFollowingtheclassical“back-propagation”derivations(LeCun,1985;Rumelhartetal.,1986)andthemodularapproachshowninBottou(1991),anyfeed-forwardneuralnetworkwithLlayers,liketheonesshowninFigure1andFigure2,canbeseenasacompositionoffunctionslq()correspondingtoeachlayerq()=Lq(L1q(:::1q():::))Partitioningtheparametersofthenetworkwithrespecttoeachlayers1L,wewrite:q=(q1;:::;ql;:::;qL):Wearenowinterestedincomputingthegradientsofthecostwithrespecttoeachql.Applyingthechainrule(generalizedtovectors)weobtaintheclassicalbackpropagationrecursion:¶C ¶ql=¶lq ¶ql¶C ¶lq(19)¶C ¶l1q=¶lq ¶l1q¶C ¶lq:(20)Inotherwords,werstinitializetherecursionbycomputingthegradientofthecostwithrespecttothelastlayeroutput¶C=¶Lq.Theneachlayercomputesthegradientrespecttoitsownparameters2528 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHwith(19),giventhegradientcomingfromitsoutput¶C=¶lq.Toperformthebackpropagation,italsocomputesthegradientwithrespecttoitsowninputs,asshownin(20).Wenowderivethegradientsforeachlayerweusedinthispaper.A.1LookupTableLayerGivenamatrixofparametersq1=W1andword(ordiscretefeature)indices[w]T1,thelayeroutputsthematrix:lq([w]Tl)=hWi1[w]1hWi1[w]2:::hWi1[w]T:ThegradientsoftheweightshWi1iaregivenby:¶C ¶hWi1i=åf1tT=[w]t=igh¶C ¶lqi1iThissumequalszeroiftheindexinthelookuptabledoesnotcorrespondstoawordinthese-quence.Inthiscase,thethcolumnofWdoesnotneedtobeupdated.AsaLookupTableLayerisalwaystherstlayer,wedonotneedtocomputeitsgradientswithrespecttotheinputs.A.2LinearLayerGivenparametersql=(Wl;bl),andaninputvectorfl1qtheoutputisgivenby:lq=Wll1q+bl:(21)Thegradientswithrespecttotheparametersarethenobtainedwith:¶C ¶Wl=¶C ¶lqhl1qiTand¶C ¶bl=¶C ¶lq;(22)andthegradientswithrespecttotheinputsarecomputedwith:¶C ¶l1q=hWliT¶C ¶lq:(23)A.3ConvolutionLayerGivenainputmatrixfl1q,aConvolutionLayerlq()appliesaLinearLayeroperation(21)suc-cessivelyoneachwindowhl1qidwint(1T)ofsizedwin.Using(22),thegradientsoftheparametersarethusgivenbysummingoverallwindows:¶C ¶Wl=Tåt=1h¶C ¶lqi1thhl1qidwintiTand¶C ¶bl=Tåt=1h¶C ¶lqi1t:Afterinitializingtheinputgradients¶C=¶l1qtozero,weiterate(23)overallwindowsfor1T,leadingtheaccumulation26h¶C ¶l1qidwint+=hWliTh¶C ¶lqi1t: 26.Wedenote“+=”anyaccumulationoperation.2529 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSAA.4MaxLayerGivenamatrixfl1q,theMaxLayercomputeshlqii=maxthhl1qi1tiiandai=argmaxthhl1qi1tii8;whereaistorestheindexofthelargestvalue.Weonlyneedtocomputethegradientwithrespecttotheinputs,asthislayerhasnoparameters.Thegradientisgivenby"h¶C ¶l1qi1t#i=(hh¶C ¶fli1tiiif=ai0otherwise:A.5HardTanhLayerGivenavectorfl1q,andthedenitionoftheHardTanh(5)weget"¶C ¶l1q#i=8���&#x-0.5;▒&#x-0.5;▒&#x-0.5;▒:0ifhl1qii1h¶C ¶fliiif1=hl1qii=10ifhl1qii&#x-0.9;畵1;ifweignorenon-differentiabilitypoints.A.6Word-LevelLog-LikelihoodThenetworkoutputsascore[q]iforeachtagindexedby.Following(11),ifyisthetruetagforagivenexample,thestochasticscoretominimizecanbewrittenasC(q)=logaddj[q]j[q]yConsideringthedenitionofthelogadd(10),thegradientwithrespecttoqisgivenby¶C ¶[q]i=e[f]i åke[f]k1i=y8:A.7Sentence-LevelLog-LikelihoodThenetworkoutputsamatrixwhereeachelementqi;tgivesascorefortagatword.Givenatagsequence[y]T1andainputsequence[x]T1,wemaximizethelikelihood(13),whichcorrespondstominimizingthescoreC(q;A)=logadd8[j]T1s([x]T1;[]T1;˜q)| {z }Clogadds([x]T1;[y]T1;˜q);withs([x]T1;[y]T1;˜q)=Tåt=1[A][y]t1;[y]t+[q][y]t;t:2530 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHWerstinitializeallgradientstozero¶C ¶qi;t=08;and¶C ¶[A]i;j=08;:Wethenaccumulategradientsoverthesecondpartofthecosts([x]T1;[y]T1;˜q),whichgives:¶C ¶q[y]t;t+=1¶C ¶[A][y]t1;[y]t+=18:Wenowneedtoaccumulatethegradientsovertherstpartofthecost,thatisClogadd.Wedifferen-tiateClogaddbyapplyingthechainrulethroughtherecursion(14).Firstweinitializeourrecursionwith¶Clogadd ¶dT()=edT(i) åkedT(k)8:Wethencomputeiteratively:¶Clogadd ¶dt1()=åj¶Clogadd ¶dt()edt1(i)+[A]ij åkedt1(k)+[A]kj;whereateachstepoftherecursionweaccumulateofthegradientswithrespecttotheinputsqandthetransitionscores[A]i;j¶C ¶qi;t+=¶Clogadd ¶dt()¶dt() ¶qi;t=¶Clogadd ¶dt()¶C ¶[A]i;j+=¶Clogadd ¶dt()¶dt() ¶[A]i;j=¶Clogadd ¶dt()edt1(i)+[A]ij åkedt1(k)+[A]kj:A.8RankingCriterionWeusetherankingcriterion(17)fortrainingourlanguagemodel.Inthiscase,givena“positive”examplexanda“negative”examplex(w),wewanttominimize:C(q(x);q(xw))=maxn0;1q(x)+q(x(w))o:Ignoringthenon-differentiabilityofmax(0;)inzero,thegradientissimplygivenby: ¶C ¶f(x)¶C ¶f(x)!=8��&#x-0.5;▒&#x-0.5;▒:11if1q(x)+q(x(w))&#x-0.5;▒000otherwise:2531 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSAReferencesR.K.AndoandT.Zhang.Aframeworkforlearningpredictivestructuresfrommultipletasksandunlabeleddata.JournalofMachineLearningResearch(JMLR),6:1817–1953,2005.R.M.Bell,Y.Koren,andC.Volinsky.TheBellKorsolutiontotheNetixPrize.Technicalreport,AT&TLabs,2007.http://www.research.att.com/˜volinsky/netflixY.BengioandR.Ducharme.Aneuralprobabilisticlanguagemodel.InAdvancesinNeuralInfor-mationProcessingSystems(NIPS13),2001.Y.Bengio,P.Lamblin,D.Popovici,andH.Larochelle.Greedylayer-wisetrainingofdeepnetworks.InAdvancesinNeuralInformationProcessingSystems(NIPS19),2007.Y.Bengio,J.Louradour,R.Collobert,andJ.Weston.Curriculumlearning.InInternationalCon-ferenceonMachineLearning(ICML),2009.L.Bottou.Stochasticgradientlearninginneuralnetworks.InProceedingsofNeuro-Nˆmes.EC2,1991.L.Bottou.Onlinealgorithmsandstochasticapproximations.InDavidSaad,editor,OnlineLearningandNeuralNetworks.CambridgeUniversityPress,Cambridge,UK,1998.L.BottouandP.Gallinari.Aframeworkforthecooperationoflearningalgorithms.InAdvancesinNeuralInformationProcessingSystems(NIPS3).1991.L.Bottou,Y.LeCun,andYoshuaBengio.Globaltrainingofdocumentprocessingsystemsusinggraphtransformernetworks.InConferenceonComputerVisionandPatternRecognition(CVPR)pages489–493,1997.J.S.Bridle.Probabilisticinterpretationoffeedforwardclassicationnetworkoutputs,withrela-tionshipstostatisticalpatternrecognition.InF.FogelmanSouli´eandJ.H´erault,editors,Neu-rocomputing:Algorithms,ArchitecturesandApplications,pages227–236.NATOASISeries,1990.P.F.Brown,P.V.deSouza,R.L.Mercer,V.J.D.Pietra,andJC.Lai.Class-basedn-grammodelsofnaturallanguage.ComputationalLinguistics,18(4):467–479,1992a.P.F.Brown,V.J.DellaPietra,R.L.Mercer,S.A.DellaPietra,andJ.C.Lai.Anestimateofanupperboundfortheentropyofenglish.ComputationalLinguistics,18(1):31–41,1992b.C.J.C.Burges,R.Ragno,andQuocVietLe.Learningtorankwithnonsmoothcostfunctions.InAdvancesinNeuralInformationProcessingSystems(NIPS19),pages193–200.2007.R.Caruana.MultitaskLearning.MachineLearning,28(1):41–75,1997.O.Chapelle,B.Schlkopf,andA.Zien.Semi-SupervisedLearning.Adaptivecomputationandmachinelearning.MITPress,Cambridge,Mass.,USA,September2006.E.Charniak.Amaximum-entropy-inspiredparser.InConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics&HumanLanguageTechnologies(NAACL-HLT)pages132–139,2000.2532 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHH.L.Chieu.Namedentityrecognitionwithamaximumentropyapproach.InConferenceonNaturalLanguageLearning(CoNLL),pages160–163,2003.N.Chomsky.Threemodelsforthedescriptionoflanguage.IRETransactionsonInformationTheory,2(3):113–124,September1956.S.Cl´emenc¸onandN.Vayatis.Rankingthebestinstances.JournalofMachineLearningResearch(JMLR),8:2671–2699,2007.W.W.Cohen,R.E.Schapire,andY.Singer.Learningtoorderthings.JournalofArticialIntelli-genceResearch(JAIR),10:243–270,1998.T.CohnandP.Blunsom.Semanticrolelabellingwithtreeconditionalrandomelds.InConferenceonComputationalNaturalLanguage(CoNLL),2005.M.Collins.Head-DrivenStatisticalModelsforNaturalLanguageParsing.PhDthesis,UniversityofPennsylvania,1999.R.Collobert.LargeScaleMachineLearning.PhDthesis,Universit´eParisVI,2004.R.Collobert.Deeplearningforefcientdiscriminativeparsing.InInternationalConferenceonArticialIntelligenceandStatistics(AISTATS),2011.T.CoverandR.King.Aconvergentgamblingestimateoftheentropyofenglish.IEEETransactionsonInformationTheory,24(4):413–421,July1978.R.Florian,A.Ittycheriah,H.Jing,andT.Zhang.Namedentityrecognitionthroughclassiercombination.InConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics&HumanLanguageTechnologies(NAACL-HLT),pages168–171,2003.D.GildeaandD.Jurafsky.Automaticlabelingofsemanticroles.ComputationalLinguistics,28(3):245–288,2002.D.GildeaandM.Palmer.Thenecessityofparsingforpredicateargumentrecognition.MeetingoftheAssociationforComputationalLinguistics(ACL),pages239–246,2002.J.Gim´enezandL.Marquez.SVMTool:AgeneralPOStaggergeneratorbasedonsupportvectormachines.InConferenceonLanguageResourcesandEvaluation(LREC),2004.A.Haghighi,K.Toutanova,andC.D.Manning.Ajointmodelforsemanticrolelabeling.InConferenceonComputationalNaturalLanguageLearning(CoNLL),June2005.Z.S.Harris.MathematicalStructuresofLanguage.JohnWiley&SonsInc.,1968.D.Heckerman,D.M.Chickering,C.Meek,R.Rounthwaite,andC.Kadie.Dependencynetworksforinference,collaborativeltering,anddatavisualization.JournalofMachineLearningRe-search(JMLR),1:49–75,2001.G.E.Hinton,S.Osindero,andY.-W.Teh.Afastlearningalgorithmfordeepbeliefnets.NeuralComputation,18(7):1527–1554,July2006.2533 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSAK.Hollingshead,S.Fisher,andB.Roark.Comparingandcombiningnite-stateandcontext-freeparsers.InConferenceonHumanLanguageTechnologyandEmpiricalMethodsinNaturalLanguageProcessing(HLT-EMNLP),pages787–794,2005.F.HuangandA.Yates.Distributionalrepresentationsforhandlingsparsityinsupervisedsequence-labeling.InMeetingoftheAssociationforComputationalLinguistics(ACL),pages495–503,2009.F.Jelinek.Continuousspeechrecognitionbystatisticalmethods.ProceedingsoftheIEEE,64(4):532–556,1976.T.Joachims.Transductiveinferencefortextclassicationusingsupportvectormachines.InInter-nationalConferenceonMachinelearning(ICML),1999.D.KleinandC.D.Manning.Naturallanguagegrammarinductionusingaconstituent-contextmodel.InAdvancesinNeuralInformationProcessingSystems(NIPS14),pages35–42.2002.T.Koo,X.Carreras,andM.Collins.Simplesemi-superviseddependencyparsing.InMeetingoftheAssociationforComputationalLinguistics(ACL),pages595–603,2008.P.Koomen,V.Punyakanok,D.Roth,andW.Yih.Generalizedinferencewithmultiplesemanticrolelabelingsystems(sharedtaskpaper).InConferenceonComputationalNaturalLanguageLearning(CoNLL),pages181–184,2005.T.KudoandY.Matsumoto.Chunkingwithsupportvectormachines.InConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics&HumanLanguageTech-nologies(NAACL-HLT),pages1–8,2001.T.KudohandY.Matsumoto.Useofsupportvectorlearningforchunkidentication.InConferenceonNaturalLanguageLearning(CoNLL)andSecondLearningLanguageinLogicWorkshop(LLL),pages142–144,2000.J.Lafferty,A.McCallum,andF.Pereira.Conditionalrandomelds:Probabilisticmodelsforseg-mentingandlabelingsequencedata.InInternationalConferenceonMachineLearning(ICML)2001.Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner.Gradientbasedlearningappliedtodocumentrecognition.ProceedingsofIEEE,86(11):2278–2324,1998.Y.LeCun.Alearningschemeforasymmetricthresholdnetworks.InProceedingsofCognitivapages599–604,Paris,France,1985.Y.LeCun,L.Bottou,G.B.Orr,andK.-R.M¨uller.Efcientbackprop.InG.B.OrrandK.-R.M¨uller,editors,NeuralNetworks:TricksoftheTrade,pages9–50.Springer,1998.D.D.Lewis,Y.Yang,T.G.Rose,andF.Li.Rcv1:Anewbenchmarkcollectionfortextcategoriza-tionresearch.JournalofMachineLearningResearch(JMLR),5:361–397,2004.P.Liang.Semi-supervisedlearningfornaturallanguage.Master'sthesis,MassachusettsInstituteofTechnology,2005.2534 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHP.Liang,H.Daum´e,III,andD.Klein.Structurecompilation:tradingstructureforfeatures.InInternationalConferenceonMachinelearning(ICML),pages592–599,2008.D.LinandX.Wu.Phraseclusteringfordiscriminativelearning.InMeetingoftheAssociationforComputationalLinguistics(ACL),pages1030–1038,2009.N.Littlestone.Learningquicklywhenirrelevantattributesabound:Anewlinear-thresholdalgo-rithm.InMachineLearning,pages285–318,1988.A.McCallumandWeiLi.Earlyresultsfornamedentityrecognitionwithconditionalrandomelds,featureinductionandweb-enhancedlexicons.InConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics&HumanLanguageTechnologies(NAACL-HLT)pages188–191,2003.D.McClosky,E.Charniak,andM.Johnson.Effectiveself-trainingforparsing.ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics&HumanLanguageTechnologies(NAACL-HLT),2006.R.McDonald,K.Crammer,andF.Pereira.Flexibletextsegmentationwithstructuredmultilabelclassication.InConferenceonHumanLanguageTechnologyandEmpiricalMethodsinNaturalLanguageProcessing(HLT-EMNLP),pages987–994,2005.S.Miller,H.Fox,L.Ramshaw,andR.Weischedel.Anoveluseofstatisticalparsingtoextractinformationfromtext.AppliedNaturalLanguageProcessingConference(ANLP),2000.S.Miller,J.Guinness,andA.Zamanian.Nametaggingwithwordclustersanddiscriminativetraining.InConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics&HumanLanguageTechnologies(NAACL-HLT),pages337–342,2004.AMnihandG.E.Hinton.Threenewgraphicalmodelsforstatisticallanguagemodelling.InInternationalConferenceonMachineLearning(ICML),pages641–648,2007.G.MusilloandP.Merlo.RobustParsingofthePropositionBank.ROMAND2006:RobustMethodsinAnalysisofNaturallanguageData,2006.R.M.Neal.BayesianLearningforNeuralNetworks.Number118inLectureNotesinStatistics.Springer-Verlag,NewYork,1996.D.OkanoharaandJ.Tsujii.Adiscriminativelanguagemodelwithpseudo-negativesamples.Meet-ingoftheAssociationforComputationalLinguistics(ACL),pages73–80,2007.M.Palmer,D.Gildea,andP.Kingsbury.Thepropositionbank:Anannotatedcorpusofsemanticroles.ComputationalLinguistics,31(1):71–106,2005.J.Pearl.ProbabilisticReasoninginIntelligentSystems.MorganKaufman,SanMateo,1988.D.C.PlautandG.E.Hinton.Learningsetsofltersusingback-propagation.ComputerSpeechandLanguage,2:35–61,1987.M.F.Porter.Analgorithmforsufxstripping.Program,14(3):130–137,1980.2535 COLLOBERT,WESTON,BOTTOU,KARLEN,KAVUKCUOGLUANDKUKSAS.Pradhan,W.Ward,K.Hacioglu,J.Martin,andD.Jurafsky.Shallowsemanticparsingusingsupportvectormachines.ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics&HumanLanguageTechnologies(NAACL-HLT),2004.S.Pradhan,K.Hacioglu,W.Ward,J.H.Martin,andD.Jurafsky.Semanticrolechunkingcombiningcomplementarysyntacticviews.InConferenceonComputationalNaturalLanguageLearning(CoNLL),pages217–220,2005.V.Punyakanok,D.Roth,andW.Yih.Thenecessityofsyntacticparsingforsemanticrolelabeling.InInternationalJointConferenceonArticialIntelligence(IJCAI),pages1117–1123,2005.L.R.Rabiner.AtutorialonhiddenMarkovmodelsandselectedapplicationsinspeechrecognition.ProceedingsoftheIEEE,77(2):257–286,1989.L.RatinovandD.Roth.Designchallengesandmisconceptionsinnamedentityrecognition.InCon-ferenceonComputationalNaturalLanguageLearning(CoNLL),pages147–155.AssociationforComputationalLinguistics,2009.A.Ratnaparkhi.Amaximumentropymodelforpart-of-speechtagging.InConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages133–142,1996.B.RosenfeldandR.Feldman.UsingCorpusStatisticsonEntitiestoImproveSemi-supervisedRelationExtractionfromtheWeb.MeetingoftheAssociationforComputationalLinguistics(ACL),pages600–607,2007.D.E.Rumelhart,G.E.Hinton,andR.J.Williams.Learninginternalrepresentationsbyback-propagatingerrors.InD.E.RumelhartandJ.L.McClelland,editors,ParallelDistributedPro-cessing:ExplorationsintheMicrostructureofCognition,volume1,pages318–362.MITPress,1986.H.Sch¨utze.Distributionalpart-of-speechtagging.InMeetingoftheAssociationforComputationalLinguistics(ACL),pages141–148,1995.H.SchwenkandJ.L.Gauvain.Connectionistlanguagemodelingforlargevocabularycontinuousspeechrecognition.InInternationalConferenceonAcoustics,Speech,andSignalProcessing(ICASSP),pages765–768,2002.F.ShaandF.Pereira.Shallowparsingwithconditionalrandomelds.InConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics&HumanLanguageTech-nologies(NAACL-HLT),pages134–141,2003.C.E.Shannon.Predictionandentropyofprintedenglish.BellSystemsTechnicalJournal,30:50–64,1951.H.ShenandA.Sarkar.Votingbetweenmultipledatarepresentationsfortextchunking.AdvancesinArticialIntelligence,pages389–400,2005.L.Shen,G.Satta,andA.K.Joshi.Guidedlearningforbidirectionalsequenceclassication.InMeetingoftheAssociationforComputationalLinguistics(ACL),2007.2536 NATURALLANGUAGEPROCESSING(ALMOST)FROMSCRATCHN.A.SmithandJ.Eisner.Contrastiveestimation:Traininglog-linearmodelsonunlabeleddata.InMeetingoftheAssociationforComputationalLinguistics(ACL),pages354–362,2005.S.C.SuddarthandA.D.C.Holden.Symbolic-neuralsystemsandtheuseofhintsfordevelopingcomplexsystems.InternationalJournalofMan-MachineStudies,35(3):291–311,1991.X.Sun,L.-P.Morency,D.Okanohara,andJ.Tsujii.Modelinglatent-dynamicinshallowparsing:alatentconditionalmodelwithimprovedinference.InInternationalConferenceonComputationalLinguistics(COLING),pages841–848,2008.C.SuttonandA.McCallum.Jointparsingandsemanticrolelabeling.InConferenceonComputa-tionalNaturalLanguage(CoNLL),pages225–228,2005a.C.SuttonandA.McCallum.Compositionofconditionalrandomeldsfortransferlearning.Confer-enceonHumanLanguageTechnologyandEmpiricalMethodsinNaturalLanguageProcessing(HLT-EMNLP),pages748–754,2005b.C.Sutton,A.McCallum,andK.Rohanimanesh.DynamicConditionalRandomFields:FactorizedProbabilisticModelsforLabelingandSegmentingSequenceData.JournalofMachineLearningResearch(JMLR),8:693–723,2007.J.SuzukiandH.Isozaki.Semi-supervisedsequentiallabelingandsegmentationusinggiga-wordscaleunlabeleddata.InConferenceoftheNorthAmericanChapteroftheAssociationforCom-putationalLinguistics&HumanLanguageTechnologies(NAACL-HLT),pages665–673,2008.W.J.TeahanandJ.G.Cleary.Theentropyofenglishusingppm-basedmodels.InDataCompres-sionConference(DCC),pages53–62.IEEEComputerSocietyPress,1996.K.Toutanova,D.Klein,C.D.Manning,andY.Singer.Feature-richpart-of-speechtaggingwithacyclicdependencynetwork.InConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics&HumanLanguageTechnologies(NAACL-HLT),2003.J.Turian,L.Ratinov,andY.Bengio.Wordrepresentations:Asimpleandgeneralmethodforsemi-supervisedlearning.InMeetingoftheAssociationforComputationalLinguistics(ACL),pages384–392,2010.N.Uefng,G.Haffari,andA.Sarkar.Transductivelearningforstatisticalmachinetranslation.InMeetingoftheAssociationforComputationalLinguistics(ACL),pages25–32,2007.A.Waibel,T.Hanazawa,G.Hinton,K.Shikano,andK.J.Lang.Phonemerecognitionusingtime-delayneuralnetworks.IEEETransactionsonAcoustics,Speech,andSignalProcessing,37(3):328–339,1989.J.Weston,F.Ratle,andR.Collobert.Deeplearningviasemi-supervisedembedding.InInterna-tionalConferenceonMachinelearning(ICML),pages1168–1175,2008.2537