/
Journal of Machine Learning Research    Su bmitted  Revised  Published  SGDQN Careful Journal of Machine Learning Research    Su bmitted  Revised  Published  SGDQN Careful

Journal of Machine Learning Research Su bmitted Revised Published SGDQN Careful - PDF document

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
465 views
Uploaded On 2014-10-28

Journal of Machine Learning Research Su bmitted Revised Published SGDQN Careful - PPT Presentation

FR LIP6 Universit Pierre et Marie Curie 104 Avenue du Prsident Kennedy 75016 Paris France Lon Bottou LEONB NEC LABS COM NEC Laboratories America Inc 4 Independence Way Princeton NJ 08540 USA Patrick Gallinari PATRICK GALLINARI LIP 6 FR LIP6 Universi ID: 8349

LIP6 Universit Pierre

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Journal of Machine Learning Research ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch10(2009)1737-1754Submitted1/09;Revised4/09;Published7/09SGD-QN:CarefulQuasi-NewtonStochasticGradientDescentAntoineBordesANTOINE.BORDES@LIP6.FRLIP6,UniversitéPierreetMarieCurie104,AvenueduPrésidentKennedy75016Paris,FranceLéonBottouLEONB@NEC-LABS.COMNECLaboratoriesAmerica,Inc.4IndependenceWayPrinceton,NJ08540,USAPatrickGallinariPATRICK.GALLINARI@LIP6.FRLIP6,UniversitéPierreetMarieCurie104,AvenueduPrésidentKennedy75016Paris,FranceEditors:SoerenSonnenburg,VojtechFranc,EladYom-TovandMicheleSebagAbstractTheSGD-QNalgorithmisastochasticgradientdescentalgorithmthatmakescarefuluseofsecond-orderinformationandsplitstheparameterupdateintoindependentlyscheduledcomponents.Thankstothisdesign,SGD-QNiteratesnearlyasfastasarst-orderstochasticgradientdescentbutrequireslessiterationstoachievethesameaccuracy.Thisalgorithmwonthe“WildTrack”oftherstPAS-CALLargeScaleLearningChallenge(Sonnenburgetal.,2008).Keywords:supportvectormachine,stochasticgradientdescent1.IntroductionThelastdecadeshaveseenamassiveincreaseofdataquantities.Invariousdomainssuchasbi-ology,networking,orinformationretrieval,fastclassicationmethodsabletoscaleonmillionsoftraininginstancesareneeded.Real-worldapplicationsdemandlearningalgorithmswithlowtimeandmemoryrequirements.TherstPASCALLargeScaleLearningChallenge(Sonnenburgetal.,2008)wasdesignedtoidentifywhichmachinelearningtechniquesbestaddressthesenewconcerns.Agenericevaluationframeworkandvariousdatasetshavebeenprovided.Evaluationswerecarriedoutonthebasisofvariousperformancecurvessuchastrainingtimeversustesterror,datasetsizeversustesterror,anddatasetsizeversustrainingtime.1Ourentryinthiscompetition,namedSGD-QN,isacarefullydesignedStochasticGradientDescent(SGD)forlinearSupportVectorMachines(SVM).Nonlinearmodelscouldinfactreachmuchbettergeneralizationperformanceonmostoftheproposeddatasets.Unfortunately,evenintheWildTrackcase,theevaluationcriteriaforthecom-petitionrewardgoodscalingpropertiesandshorttrainingdurationsmorethantheypunishsubopti-maltesterrors.Nearlyallthecompetitorschosetoimplementlinearmodelsinordertoavoidthe .AlsoatNECLaboratoriesAmerica,Inc.1.Thismaterialanditsdocumentationcanbefoundathttp://largescale.first.fraunhofer.de/.c\r2009AntoineBordes,LéonBottouandPatrickGallinari. BORDES,BOTTOUANDGALLINARIadditionalpenaltyimpliedbynonlinearities.AlthoughSGD-QNcanworkonnonlinearmodels,2weonlyreportitsperformanceinthecontextoflinearSVMs.Stochasticalgorithmsareknownfortheirpooroptimizationperformance.However,inthelargescalesetup,whenthebottleneckisthecomputingtimeratherthanthenumberoftrainingexamples,BottouandBousquet(2008)haveshownthatstochasticalgorithmsoftenyieldthebestgeneralizationperformancesinspiteofbeingworstoptimizers.SGDalgorithmswerethereforeanaturalchoiceforthe“WildTrack”ofthecompetitionwhichfocusesontherelationbetweentrainingtimeandtestperformance.SGDalgorithmshavebeentheobjectofanumberofrecentworks.Bottou(2007)andShalev-Shwartzetal.(2007)demonstratethattheplainStochasticGradientDescentyieldsparticularlyeffectivealgorithmswhentheinputpatternsareverysparse,takinglessthanO(d)spaceandtimeperiterationtooptimizeasystemwithdparameters.Itcangreatlyoutperformsophisticatedbatchmethodsonlargedatasetsbutsuffersfromslowconvergenceratesespeciallyonill-conditionedproblems.Variousremedieshavebeenproposed:StochasticMeta-Descent(Schraudolph,1999)heuristicallydeterminesalearningrateforeachcoefcientoftheparametervector.Althoughitcansolvesomeill-conditioningissues,itdoesnothelpmuchforlinearSVMs.NaturalGradientDe-scent(Amarietal.,2000)replacesthelearningratebytheinverseoftheRiemannianmetrictensor.Thisquasi-Newtonstochasticmethodisstatisticallyefcientbutispenalizedinpracticebythecostofstoringandmanipulatingthemetrictensor.OnlineBFGS(oBFGS)andOnlineLimitedstorageBFGS(oLBFGS)(Schraudolphetal.,2007)arestochasticadaptationsoftheBroyden-Fletcher-Goldfarb-Shanno(BFGS)optimizationalgorithm.Thelimitedstorageversionofthisalgorithmisaquasi-NewtonstochasticmethodwhosecostbyiterationisasmallmultipleofthecostofastandardSGDiteration.Unfortunatelythispenaltyisoftenbiggerthanthegainsassociatedwiththequasi-Newtonupdate.OnlineDualSolversforSVMs(Bordesetal.,2005;Hsiehetal.,2008)havealsoshowngoodperformanceonlargescaledatasets.ThesesolverscanbeappliedtobothlinearandnonlinearSVMs.Inthelinearcase,thesedualalgorithmsaresurprisingclosetoSGDbutdonotrequireddlingwithlearningrates.Althoughthisisoftenviewedasanadvantage,wefeelthatthisaspectrestrictstheimprovementopportunities.Thecontributionsofthispaperaretwofold:1.Weconductananalysisofdifferentfactors,rangingfromalgorithmicrenementstoimple-mentationdetails,whichcanaffectthelearningspeedofSGDalgorithms.2.Wepresentanovelalgorithm,denotedSGD-QN,thatcarefullyexploitsthesespeedupop-portunities.Weempiricallyvalidateitspropertiesbybenchmarkingitagainststate-of-the-artSGDsolversandbysummarizingitsresultsatthePASCALLargeScaleLearningChal-lenge(Sonnenburgetal.,2008).Thepaperisorganizedasfollows:Section2analysesthepotentialgainsofquasi-Newtontech-niquesforSGDalgorithms.Sections3and4discussthesparsityandimplementationissues.FinallySection5presentstheSGD-QNalgorithm,andSection6reportsexperimentalresults. 2.Stochasticgradientworkswellinmodelswithnonlinearparametrization.ForSVMswithnonlinearkernels,wewouldpreferdualmethods,(e.g.,Bordesetal.,2005),whichcanexploitthesparsityofthekernelexpansion.1738 SGD-QN2.AnalysisThissectiondescribesournotationsandsummarizestheoreticalresultsthatarerelevanttothedesignofafastvariantofstochasticgradientalgorithms.2.1SGDforLinearSVMsConsiderabinaryclassicationproblemwithexamplesz=(x;y)2Rdf1;+1g.ThelinearSVMclassierisobtainedbyminimizingtheprimalcostfunctionPn(w)=l 2kwk2+1 nnåi=1(yiw�xi)=1 nnåi=1l 2kwk2+(yiw�xi);(1)wherethehyper-parameterl�0controlsthestrengthoftheregularizationterm.AlthoughtypicalSVMsusemildlynonregularconvexlossfunctions,weassumeinthispaperthattheloss(s)isconvexandtwicedifferentiablewithcontinuousderivatives(2C2[R]).Thiscouldbesimplyachievedbysmoothingthetraditionallossfunctionsinthevicinityoftheirnonregularpoints.EachiterationoftheSGDalgorithmconsistsofdrawingarandomtrainingexample(xt;yt)andcomputinganewvalueoftheparameterwtaswt+1=wt1 +0Bgt(wt)wheregt(wt)=lwt+0(ytw�txt)ytxt(2)wheretherescalingmatrixBispositivedenite.SincetheSVMtheoryprovidessimpleboundsonthenormoftheoptimalparametervector(Shalev-Shwartzetal.,2007),thepositiveconstant0isheuristicallychosentoensurethattherstfewupdatesdonotproduceaparameterwithanimplausiblylargenorm.Thetraditionalrst-orderSGDalgorithm,withdecreasinglearningrate,isobtainedbysettingB=l1Iinthegenericupdate(2):wt+1=wt1 l(+0)gt(wt):(3)Thesecond-orderSGDalgorithmisobtainedbysettingBtotheinverseoftheHessianMatrix=[P00n(wn)]computedattheoptimumwnoftheprimalcostPn(w)wt+1=wt1 +01gt(wt):(4)Randomlypickingexamplescouldleadtoexpensiverandomaccessestotheslowmemory.Inpractice,onesimplyperformssequentialpassesovertherandomlyshufedtrainingset.2.2WhatMattersAretheConstantFactorsBottouandBousquet(2008)characterizetheasymptoticlearningpropertiesofstochasticgradientalgorithmsinthelargescaleregime,thatis,whenthebottleneckisthecomputingtimeratherthanthenumberoftrainingexamples.TherstthreecolumnsofTable2.2reportthetimeforasingleiteration,thenumberofiterationsneededtoreachapredenedaccuracyr,andtheirproduct,thetimeneededtoreachaccuracyr1739 BORDES,BOTTOUANDGALLINARI StochasticGradientCostofoneIterationsTimetoreachTimetoreachAlgorithmiterationtoreachraccuracyrEc(Eapp) 1stOrderSGDO(d)nk2 ro1 rOdnk2 rOdnk2 e2ndOrderSGDOd2n ro1 rOd2n rOd2n e Table1:Asymptoticresultsforstochasticgradientalgorithms,reproducedfromBottouandBousquet(2008).Comparethesecondlastcolumn(timetooptimize)withthelastcolumn(timetoreachtheexcesstesterrore).Legendnnumberofexamples;dpa-rameterdimension;cpositiveconstantthatappearsinthegeneralizationbounds;kconditionnumberoftheHessianmatrixn=tr1withtheFisherma-trix(seeTheorem1formoredetails).TheimplicitproportionalitycoefcientsinnotationsO()ando()areofcourseindependentofthesequantities.TheexcesstesterrorEmeasureshowmuchthetesterrorisworsethanthebestpossibleerrorforthisproblem.BottouandBousquet(2008)decomposethetesterrorasthesumofthreetermsE=Eapp+Eest+Eopt.TheapproximationerrorEappmeasureshowcloselythechosenfamilyoffunctionscanapproximatetheoptimalsolution,theestimationerrorEestmeasurestheeffectofminimizingtheempiricalriskinsteadoftheexpectedrisk,theoptimizationerrorEoptmeasurestheimpactoftheapproximateoptimizationonthegeneralizationperformance.ThefourthcolumnofTable2.2givesthetimenecessarytoreducetheexcesstesterrorEbelowatargetthatdependsone�0.Thisistheimportantmetricbecausethetesterroristhemeasurethatmattersinmachinelearning.Boththerst-orderandthesecond-orderSGDrequireatimeinverselyproportionaltoetoreachthetargettesterror.Onlytheconstantsdiffer.Thesecond-orderalgorithmisinsensitivetotheconditionnumberkoftheHessianmatrixbutsuffersfromapenaltyproportionaltothedimensiondoftheparametervector.Therefore,algorithmicchangesthatexploitthesecond-orderinformationinSGDalgorithmsareunlikelytoyieldsuperlinearspeedups.Wecanatbestimprovetheconstantfactors.ThispropertyisnotlimitedtoSGDalgorithms.Toreachanexcesserrore,themostfavorablegeneralizationboundssuggestthatoneneedsanumberofexamplesproportionalto1=e.Therefore,thetimecomplexityofanyalgorithmthatprocessesanonvanishingfractionoftheseexamplescannotscalebetterthan1=e.Infact,BottouandBousquet(2008)obtainslightlyworsescalinglawsfortypicalnon-stochasticgradientalgorithms.2.3LimitedStorageApproximationsofSecond-OrderSGDSincethesecond-orderSGDalgorithmispenalizedbythehighcostofperformingtheupdate(2)usingafullrescalingmatrixB=1,itistemptingtoconsidermatricesthatadmitasparserepre-sentationandyetapproximatetheinverseHessianwellenoughtoreducethenegativeimpactoftheconditionnumberkThefollowingtheoremdescribeshowtheconvergencespeedofthegenericSGDalgorithm(2)isrelatedtothespectrumofmatrixHB1740 SGD-QNTheorem1LetEdenotetheexpectationwithrespecttotherandomselectionoftheexamples(xt;yt)drawnindependentlyfromthetrainingsetateachiteration.Letwn=argminwPn(w)beanoptimumoftheprimalcost.DenetheHessianmatrix=¶2Pn(wn)=¶w2andtheFishermatrix=t=Eg0t(wn)g0t(wn)�.IftheeigenvaluesofHBareinrangelmaxlmin�1=2,andiftheSGDalgorithm(2)convergestown,thefollowinginequalityholds:tr(HBGB) 2lmax11+o1E[Pn(wt)Pn(wn)]tr(HBGB) 2lmin11+o1:Theproofofthetheoremisprovidedintheappendix.NotethatthetheoremassumesthatthegenericSGDalgorithmconverges.Convergenceintherst-ordercaseholdsunderverymildassumptions(e.g.,Bottou,1998).ConvergenceinthegenericSGDcaseholdsbecauseitreducestotherst-ordercasewithathechangeofvariablew!B1 2w.ConvergencealsoholdsunderslightlystrongerassumptionswhentherescalingmatrixBchangesovertime(e.g.,Driancourt,1994).ThefollowingtwocorollariesrecoverthemaximalnumberofiterationslistedinTable2.2withn=tr1andk=l1kk.Corollary2givesaverypreciseequalityforthesecond-ordercasebecausethelowerboundandtheupperboundofthetheoremtakeidenticalvalues.Corollary3givesamuchlessrenedboundintherst-ordercase.Corollary2AssumeB=1asinthesecond-orderSGDalgorithm(4).UndertheassumptionsofTheorem1,wehaveE[Pn(wt)Pn(wn)]=tr11+o1=n1+o1:Corollary3AssumeB=l1Iasintherst-orderSGDalgorithm(3).UndertheassumptionsofTheorem1,wehaveE[Pn(wt)Pn(wn)]l2tr211+o1k2n1+o1:AnoftenrediscoveredpropertyofsecondorderSGDprovidesanusefulpointofreference:Theorem4(Fabian,1973;MurataandAmari,1999;BottouandLeCun,2005)Letw=argminl 2kwk2+Exy[(yw�x)].Givenasampleofnindependentexamples(xi;yi),denewn=argminwPn(w)andcomputewnbyapplyingthesecond-orderSGDupdate(4)toeachofthenexamples.Iftheyconverge,bothnEkwnwk2andnEkwnwk2convergetoasamepositiveconstantKwhennincreases.Thisresultmeansthat,asymptoticallyandonaverage,theparameterwnobtainedafteronepassofsecond-orderSGDisasclosetotheinnitetrainingsetsolutionwasthetrueoptimumoftheprimalwn.Therefore,whenthetrainingsetislargeenough,wecanexpectthatasinglepassofsecond-orderSGD(niterationsof(4))optimizestheprimalaccuratelyenoughtoreplicatethetesterroroftheactualSVMsolution.Whenwereplacethefullsecond-orderrescalingmatrixB=1byamorecomputationallyacceptableapproximation,Theorem1indicatesthatweloseaconstantfactorkontherequirednumberofiterationstoreachthataccuracy.Inotherwords,wecanexpecttoreplicatetheSVMtesterrorafterkpassesovertherandomlyreshufedtrainingset.Ontheotherhand,awellchosenapproximationoftherescalingmatrixcansavealargeconstantfactoronthecomputationofthegenericSGDupdate(2).Thebesttrainingtimesarethereforeobtainedbycarefullytradingthequalityoftheapproximationforsparserepresentations.1741 BORDES,BOTTOUANDGALLINARI FrequencyLoss Specialexample:n skiplskip 2kwk2Examples1ton:1(yiw�xi) Table2:Theregularizationtermintheprimalcostcanbeviewedasanadditionaltrainingexamplewithanarbitrarilychosenfrequencyandaspeciclossfunction.2.4MoreSpeedupOpportunitiesWehavearguedthatcarefullydesignedquasi-Newtontechniquescansaveaconstantfactoronthetrainingtimes.Thereareofcoursemanyotherwaystosaveconstantfactors:Exploitingthesparsityofthepatterns(seeSection3)cansaveaconstantfactorinthecostofeachrst-orderiteration.Thebenetsaremorelimitedinthesecond-ordercase,becausetheinverseHessianmatrixisusuallynotsparse.Implementationdetails(seeSection4)suchascompilertechnologyorparallelizationonapredeterminednumberofprocessorscanalsoreducethelearningtimebyconstantfactors.Suchopportunitiesareoftendismissedasengineeringtricks.Howevertheyshouldbeconsid-eredonanequalfootingwithquasi-Newtontechniques.Constantfactorsmatterregardlessoftheirorigin.Thefollowingtwosectionsprovideadetaileddiscussionofsparsityandimplementation.3.SchedulingStochasticUpdatestoExploitSparsityFirst-orderSGDiterationscanbemadesubstantiallyfasterwhenthepatternsxtaresparse.Therst-orderSGDupdatehastheformwt+1=wtatwtbtxt;(5)whereatandbtarescalarcoefcients.Subtractingbtxtfromtheparametervectorinvolvessolelythenonzerocoefcientsofthepatternxt.Ontheotherhand,subtractingatwtinvolvesalldcoef-cients.Anaiveimplementationof(5)wouldthereforespendmostofthetimeprocessingthisrstterm.Shalev-Shwartzetal.(2007)circumventthisproblembyrepresentingtheparameterwtastheproductstvtofascalarandavector.Theupdate(5)canthenbecomputedasst+1=(1at)standvt+1=vtbxt=st+1intimeproportionaltothenumberofnonzerocoefcientsinxtAlthoughthissimpleapproachworkswellfortherstorderSGDalgorithm,itdoesnotextendnicelytoquasi-NewtonSGDalgorithms.Amoregeneralmethodconsistsoftreatingtheregular-izationtermintheprimalcost(1)asanadditionaltrainingexampleoccurringwithanarbitrarilychosenfrequencywithaspeciclossfunction.ConsiderexampleswiththefrequenciesandlosseslistedinTable2andwritetheaverageloss:1 n skipn"n skiplskip 2kwk2nåi=1`(yiw�xi)#skip 1skip"l 2kwk21 nnåi=1`(yiw�xi)#:1742 SGD-QN SGD SVMSGD2 Require:;w0;0;T1:=02:whileTdo3:wt+1=wt1 l(t+t0)(wt+`0(ytw�txt)ytxt)4:5:6:7:8:9:=+110:endwhile11:returnwT Require:;w0;0;T;skip1:=0;count=skip2:whileTdo3:wt+1=wt1 l(t+t0)`0(ytw�txt)ytxt4:count=count15:ifcount0then6:wt+1=wt+1skip t+t0wt+17:count=skip8:endif9:=+110:endwhile11:returnwT Figure1:Detailedpseudo-codesoftheSGDandSVMSGD2algorithms.Minimizingthislossisofcourseequivalenttominimizingtheprimalcost(1)withitsregularizationterm.ApplyingtheSGDalgorithmtotheexamplesdenedinTable2separatestheregularizationupdates,whichinvolvethespecialexample,fromthepatternupdates,whichinvolvetherealex-amples.Theparameterskipregulatestherelativefrequenciesoftheseupdates.TheSVMSGD2algorithm(Bottou,2007)measurestheaveragepatternsparsityandpicksafrequencythatensuresthattheamortizedcostoftheregularizationupdateisproportionaltothenumberofnonzeroco-efcients.Figure1comparesthepseudo-codesofthenaiverst-orderSGDandoftherst-orderSVMSGD2.Bothalgorithmshandletherealexamplesateachiteration(line3)butSVMSGD2onlyperformsaregularizationupdateeveryskipiterations(line6).Assumesistheaverageproportionofnonzerocoefcientsinthepatternsxiandsetskiptoc=swherecisapredenedconstant(weusec=16inourexperiments).Eachpatternupdate(line3)requiressdoperations.Eachregularizationupdate(line6)requiresdoperationsbutoccurss=ctimeslessoften.TheaveragecostperiterationisthereforeproportionaltoO(sd)insteadofO(d)4.ImplementationIntheoptimizationliterature,asuperioralgorithmimplementedwithaslowscriptinglanguageusuallybeatscarefulimplementationsofinferioralgorithms.Thisisbecausethesuperioralgorithmminimizesthetrainingerrorwithahigherorderconvergence.Thisisnolongertrueinthecaseoflargescalemachinelearningbecausewecareaboutthetesterrorinsteadofthetrainingerror.Asexplainedabove,algorithmimprovementsdonotimprovetheorderofthetesterrorconvergence.Theycansimplyimproveconstantfactorsandthereforecompeteevenlywithimplementationimprovements.Timespentreningtheimplementationistimewellspent.Therearelotsofmethodsforrepresentingsparsevectorswithsharplydifferentcomputingrequirementforsequentialandrandomaccess.OurC++implementationalwaysusesafull1743 BORDES,BOTTOUANDGALLINARI FullSparse Randomaccesstoasinglecoefcient:Q(1)Q(s)In-placeadditionintoafullvectorofdimensiond0Q(d)Q(s)In-placeadditionintoasparsevectorwiths0nonzeros:Q(d+s0)Q(s+s0) Table3:Costsofvariousoperationsonavectorofdimensiondwithsnonzerocoefcients.vectorforrepresentingtheparameterw,buthandlesthepatternsxusingeitherafullvectorrepresentationorasparserepresentationasanorderedlistofindex/valuepairs.Eachcalculationcanbeachieveddirectlyonthesparserepresentationorafteraconversiontothefullrepresentation(seeTable3).Inappropriatechoiceshaveoutrageouscosts.Forexample,onadensedatasetwith500attributes,usingsparsevectorsincreasesthetrainingtimeby50%;onthesparseRCV1dataset(seeTable4),usingasparsevectortorepresenttheparameterwincreasesthetrainingtimebymorethan900%.Modernprocessorsoftensportspecializedinstructionstohandlevectorsandmultiplecores.Linearalgebralibraries,suchasBLAS,mayormaynotusetheminwaysthatsuitourpurposes.Compilationagshavenontrivialimpactsonthelearningtimes.Suchimplementationimprovementsareoften(butnotalways)orthogonaltothealgorithmicim-provementsdescribedabove.Themainissueconsistsofdecidinghowmuchdevelopmentresourcesareallocatedtoimplementationandtoalgorithmdesign.Thistrade-offdependsontheavailablecompetencies.5.SGD-QN:ACarefulDiagonalQuasi-NewtonSGDAsexplainedinSection2,designinganefcientquasi-NewtonSGDalgorithminvolvesacarefultrade-offbetweenthesparsityofthescalingmatrixrepresentationBandthequalityofitsapproxi-mationoftheinverseHessian1.Thetwoobviouschoicesarediagonalapproximations(BeckerandLeCun,1989)andlowrankapproximations(Schraudolphetal.,2007).5.1DiagonalRescalingMatricesAmongnumerouspracticalsuggestionsforrunningSGDalgorithminmultilayerneuralnetworks,LeCunetal.(1998)emphaticallyrecommendtorescaleeachinputspacefeatureinordertoimprovetheconditionnumberkoftheHessianmatrix.Inthecaseofalinearmodel,suchpreconditioningissimilartousingaconstantdiagonalscalingmatrix.RescalingtheinputspacedenestransformedpatternsXtsuchthat[Xt]i=bi[xt]iwherethenotation[v]irepresentsthe-thcoefcientofvectorv.Thistransformationdoesnotchangetheclassicationiftheparametervectorsaremodiedas[Wt]i=[wt]i=bi.Therst-orderSGDupdateonthesemodiedvariableisthen8=1:::d[Wt+1]i=[Wt]ihtl[Wt]i+0(ytW�tXt)yt[Xt]i;=[Wt]ihtl[Wt]i+0(ytw�txt)ytbi[xt]i:1744 SGD-QNMultiplyingbybishowshowtheoriginalparametervectorwtisaffected:8=1:::d[wt+1]i=[wt]ihtl[wt]i+0(ytw�txt)ytb2i[xt]i:WeobservethatrescalingtheinputisequivalenttomultiplyingthegradientbyaxeddiagonalmatrixBwhoseelementsarethesquaresofthecoefcientsbiIdeallywewouldliketomaketheproductBHspectrallyclosetheidentitymatrix.UnfortunatelywedonotknowthevalueoftheHessianmatrixattheoptimumwn.InsteadwecouldconsiderthecurrentvalueoftheHessianwt=P00(wt)andcomputethediagonalrescalingmatrixBthatmakesBHwtclosesttotheidentity.ThiscomputationcouldbeverycostlybecauseitinvolvesthefullHessianmatrix.BeckerandLeCun(1989)approximatetheoptimaldiagonalrescalingmatrixbyinvertingthediagonalcoefcientsoftheHessian.Themethodreliesontheanalyticalderivationofthesediagonalcoefcientsformultilayerneuralnetworks.Thisderivationdoesnotextendtoarbitrarymodels.ItcertainlydoesnotworkinthecaseoftraditionalSVMsbecausethehingelosshaszerocurvaturealmosteverywhere.5.2LowRankRescalingMatricesThepopularLBFGSoptimizationalgorithm(Nocedal,1980)maintainsalowrankapproximationoftheinverseHessianbystoringthekmostrecentrank-oneBFGSupdatesinsteadofthefullinverseHessianmatrix.WhenthesuccessivefullgradientsP0n(wt1)andP0n(wt)areavailable,standardrankoneupdatescanbeusedtodirectlyestimatetheinverseHessianmatrix1.UsingthismethodwithstochasticgradientistrickybecausethefullgradientsP0n(wt1)andP0n(wt)arenotreadilyavailable.Insteadweonlyhaveaccesstothestochasticestimatesgt1(wt1)andgt(wt)whicharetoonoisytocomputegoodrescalingmatrices.TheoLBFGSalgorithm(Schraudolphetal.,2007)comparesinsteadthederivativesgt1(wt1)andgt1(wt)forthesameexample(xt1;yt1).Thisreducesthenoisetoanacceptablelevelattheexpenseofthecomputationoftheadditionalgradientvectorgt1(wt)Comparedtotherst-orderSGD,eachiterationoftheoLBFGSalgorithmcomputestheaddi-tionalquantitygt1(wt)andupdatesthelistofkrankoneupdates.Themostexpensiveparthoweverremainsthemultiplicationofthegradientgt(wt)bythelow-rankestimateoftheinverseHessian.Withk=10,eachiterationofouroLBFGSimplementationrunsempirically11timesslowerthanarst-orderSGDiteration.5.3SGD-QNTheSGD-QNalgorithmestimatesadiagonalrescalingmatrixusingatechniqueinspiredbyoLBFGSForanypairofparameterswt1andwt,aTaylorseriesofthegradientoftheprimalcostPprovidesthesecantequation:wtwt11wtP0n(wt)P0n(wt1):(6)WewouldthenliketoreplacetheinverseHessianmatrix1wtbyadiagonalestimateBwtwt1BP0n(wt)P0n(wt1):Sincewearedesigningastochasticalgorithm,wedonothaveaccesstothefullgradientP0n.Fol-lowingoLBFGS,wereplacethembythelocalgradientsgt1(wt)andgt1(wt1)andobtainwtwt1Bgt1(wt)gt1(wt1):1745 BORDES,BOTTOUANDGALLINARISincewechosetouseadiagonalrescalingmatrixB,wecanwritetheterm-by-termequality[wtwt1]iBii[gt1(wt)gt1(wt1)]i;wherethenotation[v]istillrepresentsthe-thcoefcientofvectorv.ThisleadstocomputingBiiastheaverageoftheratio[wtwt1]i=[gt1(wt)gt1(wt1)]i.Anonlineestimationiseasilyachievedduringthecourseoflearningbyperformingaleakyaverageoftheseratios,Bii Bii+2 r[wtwt1]i [gt1(wt)gt1(wt1)]iBii8=1:::d;andwheretheintegerrisincrementedwheneverweupdatethematrixBTheweightsofthescalingmatrixBareinitializedtol1becausethiscorrespondstotheexactsetupofrst-orderSGD.Sincethecurvatureoftheprimalcost(1)isalwayslargerthanl,theratio[gt1(wt)gt1(wt1)]i=[wtwt1]iisalwayslargerthanl.ThereforethecoefcientsBiineverexceedtheirinitialvaluel1.Basicallythesescalingfactorsslowdowntheconvergencealongsomeaxes.Thespeedupdoesnotoccurbecausewefollowthetrajectoryfaster,butbecausewefollowabettertrajectory.Performingtheweightupdate(2)withadiagonalrescalingmatrixBconsistsinperformingterm-by-termoperationswithatimecomplexitythatismarginallygreaterthanthecomplexityoftherst-orderSGD(3)update.Thecomputationoftheadditionalgradientvectorgt1(wt)andthereestimationofallthecoefcientsBiiessentiallytriplesthecomputingtimeofarst-orderSGDiterationwithnon-sparseinputs(3),andisconsiderablyslowerthanarst-orderSGDiterationwithsparseinputsimplementedasdiscussedinSection3.Fortunatelythishighercomputationalcostperiterationcanbenearlyavoidedbyschedulingthereestimationoftherescalingmatrixwiththesamefrequencyastheregularizationupdates.Sec-tion5.1hasshownthatadiagonalrescalingmatrixdoeslittlemorethanrescalingtheinputvariables.Sinceaxeddiagonalrescalingmatrixalreadyworksquitewell,thereislittleneedtoupdateitscoefcientsveryoften.Figure2comparestheSVMSGD2andSGD-QNalgorithms.WheneverSVMSGD2performsaregularizationupdate,wesettheagupdateBtoscheduleareestimationoftherescalingco-efcientsduringthenextiteration.Thisisappropriatebecausebothoperationshavecomparablecomputingtimes.Thereforetherescalingmatrixreestimationschedulecanberegulatedwiththesameskipparameterastheregularizationupdates.Inpractice,weobservethateachSGD-QNiterationdemandslessthantwicethetimeofarst-orderSGDiteration.BecauseSGD-QNreestimatestherescalingmatrixafterapatternupdate,specialcaremustbetakenwhentheratio[wtwt1]i=[gt1(wt)gt1(wt1)]ihastheform0=0becausethecorre-spondinginputcoefcient[xt1]iiszero.SincethesecantEquation(6)isvalidforanytwovaluesoftheparametervector,onecancomputetheratioswithparametervectorswt1andwt+eandderivethecorrectvaluebycontinuitywhene!0.When[xt1]i=0,wecanwrite[(wt+e)wt1i gt1(wt+e)gt1(wt1)]i=[(wt+e)wt1i l[(wt+e)wt1i+0(yt1(wt+e)�xt1)0(yt1w�t1xt1)yt11xt1i=l+0(yt1(wt+e)�xt1)0(yt1w�t1xt1)yt11xt1i [(wt+e)wt1i1=l+0 ei1e!0!l1:1746 SGD-QN SVMSGD2 SGD-QN Require:;w0;0;T;skip1:=0;count=skip2:3:whileTdo4:wt+1=wt1 l(t+t0)`0(ytw�txt)ytxt5:6:7:8:9:10:11:count=count112:ifcount0then13:wt+1=wt+1skip(+0)1wt+114:count=skip15:endif16:=+117:endwhile18:returnwT Require:;w0;0;T;skip1:=0,count=skip,2:B=1IupdateB=falser=23:whileTdo4:wt+1=wt(+0)1`0(ytw�txt)ytBxt5:ifupdateB=truethen6:pt=gt(wt+1)gt(wt)7:8;Bii=Bii+2 r[wt+1wt]i[pt]1iBii8:8;Bii=max(Bii;1021)9:r=r+1;updateB=false10:endif11:count=count112:ifcount0then13:wt+1=wt+1skip(+0)1Bwt+114:count=skipupdateB=true15:endif16:=+117:endwhile18:returnwT Figure2:Detailedpseudo-codesoftheSVMSGD2andSGD-QNalgorithms. DataSet Train.Ex.Test.Ex.Featuress l0skip ALPHA 100,00050,0005001 10510616DELTA 100,00050,0005001 10410416RCV1 781,26523,14947,1520:0016 1041059,965 Table4:Datasetsandparametersusedforexperiments.6.ExperimentsWedemonstratethegoodscalingpropertiesofSGD-QNintwoways:wepresentadetailedcompar-isonwithotherstochasticgradientmethods,andwesummarizetheresultsobtainedonthePASCALLargeScaleChallenge.Table4describesthethreebinaryclassicationtasksweusedforcomparativeexperiments.TheAlphaandDeltatasksweredenedforthePASCALLargeScaleChallenge(Sonnenburgetal.,2008).Wetrainwiththerst100,000examplesandtestwiththelast50,000examplesoftheofcialtrainingsetsbecausetheofcialtestingsetsarenotavailable.AlphaandDeltaaredensedatasetswithrelativelysevereconditioningproblems.ThethirdtaskistheclassicationofRCV1documentsbelongingtoclassCCAT(Lewisetal.,2004).ThistaskhasbecomeastandardbenchmarkforlinearSVMsonsparsedata.Despiteitslargersize,theRCV1taskismucheasierthantheAlphaandDeltatasks.AllmethodsdiscussedinthispaperperformwellonRCV1.1747 BORDES,BOTTOUANDGALLINARI ALPHARCV1 SGD0.1336.8SVMSGD20.100.20 SGD-QN0.210.37 Table5:Time(sec.)forperformingonepassoverthetrainingset.TheexperimentsreportedinSection6.4usethehingeloss(s)=max(0;1s).Allotherex-perimentsusethesquaredhingeloss(s)=1 2(max(0;1s))2.Inpractice,thereisnoneedtomakethelossestwicedifferentiablebysmoothingtheirbehaviornears=0.Unlikemostbatchoptimizer,stochasticalgorithmsdonotnotaimdirectlyfornondifferentiablepoints,butrandomlyhoparoundthem.Thestochasticnoiseimplicitlysmoothestheloss.TheSGDSVMSGD2oLBFGS,andSGD-QNalgorithmswereimplementedusingthesameC++codebase.3Allexperimentsarecarriedoutinsingleprecision.Wedidnotexperiencenumer-icalaccuracyissues,probablybecauseoftheinuenceoftheregularizationterm.Ourimplementa-tionofoLBFGSmaintainsarank10rescalingmatrix.SettingtheoLBFGSgainscheduleisratherdelicate.WeobtainedfairlygoodresultsbyreplicatingthegainscheduleoftheVieCRFpackage.4WealsoproposeacomparisonwiththeonlineduallinearSVMsolver(Hsiehetal.,2008)imple-mentedintheLibLinearpackage.5WedidnotreimplementthisalgorithmbecausetheLibLinearimplementationhasprovedassimpleandasefcientasours.The0parameterisdeterminedusinganautomaticprocedure:sincethesizeofthetrainingsetdoesnotaffectresultsofTheorem1,wesimplypickasubsetcontaining10%ofthetrainingexamples,performoneSGD-QNpassoverthissubsetwithseveralvaluesfor0,andpickthevalueforwhichtheprimalcostdecreasesthemost.ThesevaluesaregiveninTable4.6.1SparsityTricksTable5illustratestheinuenceoftheschedulingtricksdescribedinSection3.ThetabledisplaysthetrainingtimesofSGDandSVMSGD2.Theonlydifferencebetweenthesetwoalgorithmsaretheschedulingtricks.SVMSGD2trains180timesfasterthanSGDonthesparsedatasetRCV1.Thistablealsodemonstratesthatiterationsofthequasi-newtonSGD-QNarenotprohibitivelyexpensive.6.2Quasi-NewtonFigure3showshowtheprimalcostPn(w)oftheAlphadatasetevolveswiththenumberofpasses(left)andthetrainingtime(right).Comparedtotherst-orderSVMSGD2,boththeoLBFGSandSGD-QNalgorithmsdramaticallydecreasethenumberofpassesrequiredtoachievesimilarvaluesoftheprimal.EvenifitusesamorepreciseapproximationoftheinverseHessian,oLBFGSdoesnotperformbetterafterasinglepassthanSGD-QN.Besides,runningasinglepassofoLBFGSismuchslowerthanrunningmultiplepassesofSVMSGD2orSGD-QN.Thebenetsofitssecond-orderapproximationarecanceledbyitsgreatertimerequirementsperiteration.Ontheotherhand, 3.Implementationsandexperimentscriptsareavailableinthelibsgdqnlibraryonhttp://www.mloss.org.4.Thiscanbefoundathttp://www.ofai.at/~jeremy.jancsary.5.Thiscanbefoundathttp://www.csie.ntu.edu.tw/~cjlin/liblinear.1748 SGD-QN 0.30 0.32 0.34 0.36 0.38 0.40 0 2 4 6 8 10 Number of epochsSVMSGD2 SGD-QN oLBFGS 0.30 0.32 0.34 0.36 0.38 0.40 0 0.5 1 1.5 2 Training time (sec.)SVMSGD2 SGD-QN oLBFGS Figure3:Primalcostsaccordingtothenumberofepochs(left)andthetrainingduration(right)ontheAlphadataset.eachSGD-QNiterationisonlymarginallyslowerthanaSVMSGD2iteration;thereductionofthenumberofiterationsissufcienttooffsetthiscost.6.3TrainingSpeedFigure4displaysthetesterrorsachievedontheAlpha,DeltaandRCV1datasetsasafunctionofthenumberofpasses(left)andthetrainingtime(right).TheseresultsshowagainthatbothoLBFGSandSGD-QNrequirelessiterationsthanSVMSGD2toachievethesametesterror.However,oLBFGSsuffersfromtherelativelyhighcomplexityofitsupdateprocess.TheSGD-QNalgorithmiscom-petitivewiththedualsolverLibLinearonthedensedatasetsAlphaandDelta;itrunssignicantlyfasteronthesparseRCV1dataset.AccordingtoTheorem4,givenalargeenoughtrainingset,aperfectsecond-orderSGDalgo-rithmwouldreachthebatchtesterrorafterasinglepass.Onepasslearningisattractivewhenwearedealingwithhighvolumestreamsofexamplesthatcannotbestoredandretrievedquickly.Figure4(left)showsthatoLBFGSisalittlebitclosertothatidealthanSGD-QNandcouldbecomeattractiveforproblemswheretheexampleretrievaltimeismuchgreaterthanthecomputingtime6.4PASCALLargeScaleChallengeResultsTheSGD-QNalgorithmhasbeensubmittedtothe“WildTrack”ofthePASCALLargeScaleChal-lenge.WildTrackcontributorswerefreetodoanythingleadingtomoreefcientandmoreaccuratemethods.Fortytwomethodshavebeensubmittedtothistrack.Table6showstheSGD-QNranksdeterminedbytheorganizersofthechallengeaccordingtotheirevaluationcriteria.TheSGD-QNalgorithmalwaysranksamongthetopvesubmissionsandranksrstinoverallscore(tiewithanotherNewtonmethod).1749 BORDES,BOTTOUANDGALLINARI 21.0 22.0 23.0 24.0 25.0 26.0 27.0 0 2 4 6 8 10 Number of epochsSVMSGD2 SGD-QN oLBFGS LibLinear 21.0 22.0 23.0 24.0 25.0 26.0 27.0 0 0.5 1 1.5 2 Training time (sec.)SVMSGD2 SGD-QN oLBFGS LibLinear ALPHADATASET 21.0 21.5 22.0 22.5 23.0 23.5 24.0 24.5 25.0 0 1 2 3 4 5 Number of epochsSVMSGD2 SGD-QN oLBFGS LibLinear 21.0 21.5 22.0 22.5 23.0 23.5 24.0 24.5 25.0 0 0.2 0.4 0.6 0.8 1 Training time (sec.)SVMSGD2 SGD-QN oLBFGS LibLinear DELTADATASET 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 0 1 2 3 4 5 Number of epochsSVMSGD2 SGD-QN LibLinear 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 0 0.5 1 1.5 2 Training time (sec.)SVMSGD2 SGD-QN LibLinear RCV1DATASETFigure4:Testerrors(in%)accordingtothenumberofepochs(left)andtrainingduration(right).1750 SGD-QN DataSet lskipPasses Rank Alpha 1051610 1stBeta 1041615 3rdGamma 1031610 1stDelta 1031610 1stEpsilon 1051610 5thZeta 1051610 4thOCR 1051610 2ndFace 1051620 4thDNA 1036410 2ndWebspam 10571,06610 4th Table6:ParametersandnalranksobtainedbySGD-QNinthe“WildTrack”oftherstPASCALLargeScaleLearningChallenge.Allcompetingalgorithmswererunbytheorganizers.(Note:thecompetitionresultswereobtainedwithapreliminaryversionofSGD-QN.Inparticularthelparameterslistedabovearedifferentfromthevaluesusedforallexperi-mentsinthispaperandlistedinTable4.)7.ConclusionTheSGD-QNalgorithmstrikesagoodcompromiseforlargescaleapplicationbecauseithaslowtimeandmemoryrequirementsperiterationandbecauseitreachescompetitivetesterrorsafterasmallnumberofiterations.Wehaveshownhowthisperformanceistheresultofacarefuldesigntakingintoaccountthetheoreticalknowledgeaboutsecond-orderSGDandapreciseunderstandingofitscomputationalrequirements.Finally,althoughthiscontributionpresentsSGD-QNasasolverforlinearSVMs,thisalgorithmcanbeeasilyextendedtononlinearmodelsforwhichwecananalyticallycomputethegradients.WeplantofurtherinvestigatetheperformanceofSGD-QNinthiscontext.AcknowledgmentsPartofthisworkwasfundedbyNSFgrantCCR-0325463andbytheEUNetworkofExcellencePASCAL2.AntoineBordeswasalsosupportedbytheFrenchDGA.AppendixA.ProofofTheorem1Denevt=wtwnandobservethatasecond-orderTaylorexpansionoftheprimalgivesPn(wt)Pn(wn)=v�tHvtot2trHvtv�tot2:LetEt1representingtheconditionalexpectationoverthechoiceoftheexampleatiteration1givenallthechoicesmadeduringthepreviousiterations.Sinceweassumethatconvergencetakes1751 BORDES,BOTTOUANDGALLINARIplace,wehavet1gt1(wt1)gt1(wt1)�t1gt1(wn)gt1(wn)�o(1)=Go(1)andt1[gt1(wt1)]=P0n(wt1)=Hvt1o(vt1)=IeHvt1wherenotationIeisashorthandforI+o(1),thatis,amatrixthatconvergestotheidentity.ExpressingHvtv�tusingthegenericSGDupdate(2)givesHvtv�tHvt1v�t1Hvt1gt1(wt1)�B tt0HBgt1(wt1)v�t1 tt0HBgt1(wt1)gt1(wt1)�B (tt0)2t1Hvtv�tHvt1v�t1Hvt1v�t1HIeB tt0HBIeHvt1v�t1 tt0HBGB (tt0)2ot2t1trHvtv�ttrHvt1v�t12trHBIeHvt1v�t1 tt0tr(HBGB) (tt0)2ot2trHvtv�ttrHvt1v�t12trHBIeHvt1v�t1 tt0tr(HBGB) (tt0)2ot2:Letlmaxlmin�1=2betheextremeeigenvaluesofHB.Since,foranypositivematrixXlmino(1)tr(X)tr(HBIeX)lmaxo(1)tr(X)wecanbracketE[tr(Hvtv�t)]betweentheexpressions12lmax to1 ttrHvt1v�t1tr(HBGB) (tt0)2ot2and12lmin to1 ttrHvt1v�t1tr(HBGB) (tt0)2ot2Byrecursivelyapplyingthisbracket,weobtainulmax(+0)E[tr(Hvtv�t)]ulmin(+0)wherethenotationul()representsasequenceofrealsatisfyingtherecursiverelationul(t)=12l to1 tul(t1)+tr(HBGB) t2o1 t2:From(BottouandLeCun,2005,Lemma1),l�1=2impliestul()!tr(HBGB) 2l1.Thentr(HBGB) 2lmax1t1ot1trHvtv�ttr(HBGB) 2lmin1t1ot1andtr(HBGB) 2lmax1t1ot1[Pn(wt)Pn(wn)]tr(HBGB) 2lmin1t1ot1: 1752 SGD-QNReferencesS.-I.Amari,H.Park,andK.Fukumizu.Adaptivemethodofrealizingnaturalgradientlearningformultilayerperceptrons.NeuralComputation,12:1409,2000.S.BeckerandY.LeCun.Improvingtheconvergenceofback-propagationlearningwithsecond-ordermethods.InProc.1988ConnectionistModelsSummerSchool,pages29–37.MorganKauf-man,1989.A.Bordes,S.Ertekin,J.Weston,andL.Bottou.Fastkernelclassierswithonlineandactivelearning.J.MachineLearningResearch,6:1579–1619,September2005.L.Bottou.Onlinealgorithmsandstochasticapproximations.InDavidSaad,editor,OnlineLearningandNeuralNetworks.CambridgeUniversityPress,Cambridge,UK,1998.L.Bottou.Stochasticgradientdescentontoyproblems,2007.http://leon.bottou.org/projects/sgdL.BottouandO.Bousquet.Thetradeoffsoflargescalelearning.InAdv.inNeuralInformationProcessingSystems,volume20.MITPress,2008.L.BottouandY.LeCun.On-linelearningforverylargedatasets.AppliedStochasticModelsinBusinessandIndustry,21(2):137–151,2005.X.Driancourt.Optimisationpardescentedegradientstochastiquedesystèmesmodulairescombi-nantréseauxdeneuronesetprogrammationdynamique.PhDthesis,UniversitéParisXI,Orsay,France,1994.V.Fabian.Asymptoticallyefcientstochasticapproximation;theRMcase.AnnalsofStatistics,1(3):486–495,1973.C.-J.Hsieh,K.-W.Chang,C.-J.Lin,S.Keerthi,andS.Sundararajan.Adualcoordinatedescentmethodforlarge-scalelinearSVM.InProc.25thIntl.Conf.onMachineLearning(ICML'08)pages408–415.Omnipress,2008.Y.LeCun,L.Bottou,G.Orr,andK.-R.Müller.Efcientbackprop.InNeuralNetworks,TricksoftheTrade,LectureNotesinComputerScienceLNCS1524.SpringerVerlag,1998.D.D.Lewis,Y.Yang,T.G.Rose,andF.Li.RCV1:Anewbenchmarkcollectionfortextcatego-rizationresearch.J.MachineLearningResearch,5:361–397,2004.N.MurataandS.-I.Amari.Statisticalanalysisoflearningdynamics.SignalProcessing,74(1):3–28,1999.J.Nocedal.Updatingquasi-Newtonmatriceswithlimitedstorage.MathematicsofComputation35:773–782,1980.N.Schraudolph.Localgainadaptationinstochasticgradientdescent.InInProc.ofthe9thIntl.Conf.onArticialNeuralNetworks,pages569–574,1999.1753 BORDES,BOTTOUANDGALLINARIN.Schraudolph,J.Yu,andS.Günter.Astochasticquasi-Newtonmethodforonlineconvexopti-mization.InProc.11thIntl.Conf.onArticialIntelligenceandStatistics(AIstats),pages433–440.Soc.forArticialIntelligenceandStatistics,2007.S.Shalev-Shwartz,Y.Singer,andN.Srebro.Pegasos:PrimalestimatedsubgradientsolverforSVM.InProc.24thIntl.Conf.onMachineLearning(ICML'07),pages807–814.ACM,2007.S.Sonnenburg,V.Franc,E.Yom-Tov,andM.Sebag.Pascallargescalelearningchallenge.ICML'08Workshop,2008.http://largescale.first.fraunhofer.de1754