FR LIP6 Universit Pierre et Marie Curie 104 Avenue du Prsident Kennedy 75016 Paris France Lon Bottou LEONB NEC LABS COM NEC Laboratories America Inc 4 Independence Way Princeton NJ 08540 USA Patrick Gallinari PATRICK GALLINARI LIP 6 FR LIP6 Universi ID: 8349
Download Pdf The PPT/PDF document "Journal of Machine Learning Research ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
JournalofMachineLearningResearch10(2009)1737-1754Submitted1/09;Revised4/09;Published7/09SGD-QN:CarefulQuasi-NewtonStochasticGradientDescentAntoineBordesANTOINE.BORDES@LIP6.FRLIP6,UniversitéPierreetMarieCurie104,AvenueduPrésidentKennedy75016Paris,FranceLéonBottouLEONB@NEC-LABS.COMNECLaboratoriesAmerica,Inc.4IndependenceWayPrinceton,NJ08540,USAPatrickGallinariPATRICK.GALLINARI@LIP6.FRLIP6,UniversitéPierreetMarieCurie104,AvenueduPrésidentKennedy75016Paris,FranceEditors:SoerenSonnenburg,VojtechFranc,EladYom-TovandMicheleSebagAbstractTheSGD-QNalgorithmisastochasticgradientdescentalgorithmthatmakescarefuluseofsecond-orderinformationandsplitstheparameterupdateintoindependentlyscheduledcomponents.Thankstothisdesign,SGD-QNiteratesnearlyasfastasarst-orderstochasticgradientdescentbutrequireslessiterationstoachievethesameaccuracy.ThisalgorithmwontheWildTrackoftherstPAS-CALLargeScaleLearningChallenge(Sonnenburgetal.,2008).Keywords:supportvectormachine,stochasticgradientdescent1.IntroductionThelastdecadeshaveseenamassiveincreaseofdataquantities.Invariousdomainssuchasbi-ology,networking,orinformationretrieval,fastclassicationmethodsabletoscaleonmillionsoftraininginstancesareneeded.Real-worldapplicationsdemandlearningalgorithmswithlowtimeandmemoryrequirements.TherstPASCALLargeScaleLearningChallenge(Sonnenburgetal.,2008)wasdesignedtoidentifywhichmachinelearningtechniquesbestaddressthesenewconcerns.Agenericevaluationframeworkandvariousdatasetshavebeenprovided.Evaluationswerecarriedoutonthebasisofvariousperformancecurvessuchastrainingtimeversustesterror,datasetsizeversustesterror,anddatasetsizeversustrainingtime.1Ourentryinthiscompetition,namedSGD-QN,isacarefullydesignedStochasticGradientDescent(SGD)forlinearSupportVectorMachines(SVM).Nonlinearmodelscouldinfactreachmuchbettergeneralizationperformanceonmostoftheproposeddatasets.Unfortunately,evenintheWildTrackcase,theevaluationcriteriaforthecom-petitionrewardgoodscalingpropertiesandshorttrainingdurationsmorethantheypunishsubopti-maltesterrors.Nearlyallthecompetitorschosetoimplementlinearmodelsinordertoavoidthe .AlsoatNECLaboratoriesAmerica,Inc.1.Thismaterialanditsdocumentationcanbefoundathttp://largescale.first.fraunhofer.de/.c\r2009AntoineBordes,LéonBottouandPatrickGallinari. BORDES,BOTTOUANDGALLINARIadditionalpenaltyimpliedbynonlinearities.AlthoughSGD-QNcanworkonnonlinearmodels,2weonlyreportitsperformanceinthecontextoflinearSVMs.Stochasticalgorithmsareknownfortheirpooroptimizationperformance.However,inthelargescalesetup,whenthebottleneckisthecomputingtimeratherthanthenumberoftrainingexamples,BottouandBousquet(2008)haveshownthatstochasticalgorithmsoftenyieldthebestgeneralizationperformancesinspiteofbeingworstoptimizers.SGDalgorithmswerethereforeanaturalchoicefortheWildTrackofthecompetitionwhichfocusesontherelationbetweentrainingtimeandtestperformance.SGDalgorithmshavebeentheobjectofanumberofrecentworks.Bottou(2007)andShalev-Shwartzetal.(2007)demonstratethattheplainStochasticGradientDescentyieldsparticularlyeffectivealgorithmswhentheinputpatternsareverysparse,takinglessthanO(d)spaceandtimeperiterationtooptimizeasystemwithdparameters.Itcangreatlyoutperformsophisticatedbatchmethodsonlargedatasetsbutsuffersfromslowconvergenceratesespeciallyonill-conditionedproblems.Variousremedieshavebeenproposed:StochasticMeta-Descent(Schraudolph,1999)heuristicallydeterminesalearningrateforeachcoefcientoftheparametervector.Althoughitcansolvesomeill-conditioningissues,itdoesnothelpmuchforlinearSVMs.NaturalGradientDe-scent(Amarietal.,2000)replacesthelearningratebytheinverseoftheRiemannianmetrictensor.Thisquasi-Newtonstochasticmethodisstatisticallyefcientbutispenalizedinpracticebythecostofstoringandmanipulatingthemetrictensor.OnlineBFGS(oBFGS)andOnlineLimitedstorageBFGS(oLBFGS)(Schraudolphetal.,2007)arestochasticadaptationsoftheBroyden-Fletcher-Goldfarb-Shanno(BFGS)optimizationalgorithm.Thelimitedstorageversionofthisalgorithmisaquasi-NewtonstochasticmethodwhosecostbyiterationisasmallmultipleofthecostofastandardSGDiteration.Unfortunatelythispenaltyisoftenbiggerthanthegainsassociatedwiththequasi-Newtonupdate.OnlineDualSolversforSVMs(Bordesetal.,2005;Hsiehetal.,2008)havealsoshowngoodperformanceonlargescaledatasets.ThesesolverscanbeappliedtobothlinearandnonlinearSVMs.Inthelinearcase,thesedualalgorithmsaresurprisingclosetoSGDbutdonotrequireddlingwithlearningrates.Althoughthisisoftenviewedasanadvantage,wefeelthatthisaspectrestrictstheimprovementopportunities.Thecontributionsofthispaperaretwofold:1.Weconductananalysisofdifferentfactors,rangingfromalgorithmicrenementstoimple-mentationdetails,whichcanaffectthelearningspeedofSGDalgorithms.2.Wepresentanovelalgorithm,denotedSGD-QN,thatcarefullyexploitsthesespeedupop-portunities.Weempiricallyvalidateitspropertiesbybenchmarkingitagainststate-of-the-artSGDsolversandbysummarizingitsresultsatthePASCALLargeScaleLearningChal-lenge(Sonnenburgetal.,2008).Thepaperisorganizedasfollows:Section2analysesthepotentialgainsofquasi-Newtontech-niquesforSGDalgorithms.Sections3and4discussthesparsityandimplementationissues.FinallySection5presentstheSGD-QNalgorithm,andSection6reportsexperimentalresults. 2.Stochasticgradientworkswellinmodelswithnonlinearparametrization.ForSVMswithnonlinearkernels,wewouldpreferdualmethods,(e.g.,Bordesetal.,2005),whichcanexploitthesparsityofthekernelexpansion.1738 SGD-QN2.AnalysisThissectiondescribesournotationsandsummarizestheoreticalresultsthatarerelevanttothedesignofafastvariantofstochasticgradientalgorithms.2.1SGDforLinearSVMsConsiderabinaryclassicationproblemwithexamplesz=(x;y)2Rdf 1;+1g.ThelinearSVMclassierisobtainedbyminimizingtheprimalcostfunctionPn(w)=l 2kwk2+1 nnåi=1(yiwxi)=1 nnåi=1l 2kwk2+(yiwxi);(1)wherethehyper-parameterl0controlsthestrengthoftheregularizationterm.AlthoughtypicalSVMsusemildlynonregularconvexlossfunctions,weassumeinthispaperthattheloss(s)isconvexandtwicedifferentiablewithcontinuousderivatives(2C2[R]).Thiscouldbesimplyachievedbysmoothingthetraditionallossfunctionsinthevicinityoftheirnonregularpoints.EachiterationoftheSGDalgorithmconsistsofdrawingarandomtrainingexample(xt;yt)andcomputinganewvalueoftheparameterwtaswt+1=wt 1 +0Bgt(wt)wheregt(wt)=lwt+0(ytwtxt)ytxt(2)wheretherescalingmatrixBispositivedenite.SincetheSVMtheoryprovidessimpleboundsonthenormoftheoptimalparametervector(Shalev-Shwartzetal.,2007),thepositiveconstant0isheuristicallychosentoensurethattherstfewupdatesdonotproduceaparameterwithanimplausiblylargenorm.Thetraditionalrst-orderSGDalgorithm,withdecreasinglearningrate,isobtainedbysettingB=l 1Iinthegenericupdate(2):wt+1=wt 1 l(+0)gt(wt):(3)Thesecond-orderSGDalgorithmisobtainedbysettingBtotheinverseoftheHessianMatrix=[P00n(wn)]computedattheoptimumwnoftheprimalcostPn(w)wt+1=wt 1 +0 1gt(wt):(4)Randomlypickingexamplescouldleadtoexpensiverandomaccessestotheslowmemory.Inpractice,onesimplyperformssequentialpassesovertherandomlyshufedtrainingset.2.2WhatMattersAretheConstantFactorsBottouandBousquet(2008)characterizetheasymptoticlearningpropertiesofstochasticgradientalgorithmsinthelargescaleregime,thatis,whenthebottleneckisthecomputingtimeratherthanthenumberoftrainingexamples.TherstthreecolumnsofTable2.2reportthetimeforasingleiteration,thenumberofiterationsneededtoreachapredenedaccuracyr,andtheirproduct,thetimeneededtoreachaccuracyr1739 BORDES,BOTTOUANDGALLINARI StochasticGradientCostofoneIterationsTimetoreachTimetoreachAlgorithmiterationtoreachraccuracyrEc(Eapp) 1stOrderSGDO(d)nk2 ro1 rOdnk2 rOdnk2 e2ndOrderSGDO d2n ro1 rOd2n rOd2n e Table1:Asymptoticresultsforstochasticgradientalgorithms,reproducedfromBottouandBousquet(2008).Comparethesecondlastcolumn(timetooptimize)withthelastcolumn(timetoreachtheexcesstesterrore).Legendnnumberofexamples;dpa-rameterdimension;cpositiveconstantthatappearsinthegeneralizationbounds;kconditionnumberoftheHessianmatrixn=tr 1withtheFisherma-trix(seeTheorem1formoredetails).TheimplicitproportionalitycoefcientsinnotationsO()ando()areofcourseindependentofthesequantities.TheexcesstesterrorEmeasureshowmuchthetesterrorisworsethanthebestpossibleerrorforthisproblem.BottouandBousquet(2008)decomposethetesterrorasthesumofthreetermsE=Eapp+Eest+Eopt.TheapproximationerrorEappmeasureshowcloselythechosenfamilyoffunctionscanapproximatetheoptimalsolution,theestimationerrorEestmeasurestheeffectofminimizingtheempiricalriskinsteadoftheexpectedrisk,theoptimizationerrorEoptmeasurestheimpactoftheapproximateoptimizationonthegeneralizationperformance.ThefourthcolumnofTable2.2givesthetimenecessarytoreducetheexcesstesterrorEbelowatargetthatdependsone0.Thisistheimportantmetricbecausethetesterroristhemeasurethatmattersinmachinelearning.Boththerst-orderandthesecond-orderSGDrequireatimeinverselyproportionaltoetoreachthetargettesterror.Onlytheconstantsdiffer.Thesecond-orderalgorithmisinsensitivetotheconditionnumberkoftheHessianmatrixbutsuffersfromapenaltyproportionaltothedimensiondoftheparametervector.Therefore,algorithmicchangesthatexploitthesecond-orderinformationinSGDalgorithmsareunlikelytoyieldsuperlinearspeedups.Wecanatbestimprovetheconstantfactors.ThispropertyisnotlimitedtoSGDalgorithms.Toreachanexcesserrore,themostfavorablegeneralizationboundssuggestthatoneneedsanumberofexamplesproportionalto1=e.Therefore,thetimecomplexityofanyalgorithmthatprocessesanonvanishingfractionoftheseexamplescannotscalebetterthan1=e.Infact,BottouandBousquet(2008)obtainslightlyworsescalinglawsfortypicalnon-stochasticgradientalgorithms.2.3LimitedStorageApproximationsofSecond-OrderSGDSincethesecond-orderSGDalgorithmispenalizedbythehighcostofperformingtheupdate(2)usingafullrescalingmatrixB= 1,itistemptingtoconsidermatricesthatadmitasparserepre-sentationandyetapproximatetheinverseHessianwellenoughtoreducethenegativeimpactoftheconditionnumberkThefollowingtheoremdescribeshowtheconvergencespeedofthegenericSGDalgorithm(2)isrelatedtothespectrumofmatrixHB1740 SGD-QNTheorem1LetEdenotetheexpectationwithrespecttotherandomselectionoftheexamples(xt;yt)drawnindependentlyfromthetrainingsetateachiteration.Letwn=argminwPn(w)beanoptimumoftheprimalcost.DenetheHessianmatrix=¶2Pn(wn)=¶w2andtheFishermatrix=t=Eg0t(wn)g0t(wn).IftheeigenvaluesofHBareinrangelmaxlmin1=2,andiftheSGDalgorithm(2)convergestown,thefollowinginequalityholds:tr(HBGB) 2lmax 1 1+o 1E[Pn(wt) Pn(wn)]tr(HBGB) 2lmin 1 1+o 1:Theproofofthetheoremisprovidedintheappendix.NotethatthetheoremassumesthatthegenericSGDalgorithmconverges.Convergenceintherst-ordercaseholdsunderverymildassumptions(e.g.,Bottou,1998).ConvergenceinthegenericSGDcaseholdsbecauseitreducestotherst-ordercasewithathechangeofvariablew!B 1 2w.ConvergencealsoholdsunderslightlystrongerassumptionswhentherescalingmatrixBchangesovertime(e.g.,Driancourt,1994).ThefollowingtwocorollariesrecoverthemaximalnumberofiterationslistedinTable2.2withn=tr 1andk=l 1kk.Corollary2givesaverypreciseequalityforthesecond-ordercasebecausethelowerboundandtheupperboundofthetheoremtakeidenticalvalues.Corollary3givesamuchlessrenedboundintherst-ordercase.Corollary2AssumeB= 1asinthesecond-orderSGDalgorithm(4).UndertheassumptionsofTheorem1,wehaveE[Pn(wt) Pn(wn)]=tr 1 1+o 1=n 1+o 1:Corollary3AssumeB=l 1Iasintherst-orderSGDalgorithm(3).UndertheassumptionsofTheorem1,wehaveE[Pn(wt) Pn(wn)]l 2tr 2 1 1+o 1k2n 1+o 1:AnoftenrediscoveredpropertyofsecondorderSGDprovidesanusefulpointofreference:Theorem4(Fabian,1973;MurataandAmari,1999;BottouandLeCun,2005)Letw=argminl 2kwk2+Exy[(ywx)].Givenasampleofnindependentexamples(xi;yi),denewn=argminwPn(w)andcomputewnbyapplyingthesecond-orderSGDupdate(4)toeachofthenexamples.Iftheyconverge,bothnEkwn wk2andnEkwn wk2convergetoasamepositiveconstantKwhennincreases.Thisresultmeansthat,asymptoticallyandonaverage,theparameterwnobtainedafteronepassofsecond-orderSGDisasclosetotheinnitetrainingsetsolutionwasthetrueoptimumoftheprimalwn.Therefore,whenthetrainingsetislargeenough,wecanexpectthatasinglepassofsecond-orderSGD(niterationsof(4))optimizestheprimalaccuratelyenoughtoreplicatethetesterroroftheactualSVMsolution.Whenwereplacethefullsecond-orderrescalingmatrixB= 1byamorecomputationallyacceptableapproximation,Theorem1indicatesthatweloseaconstantfactorkontherequirednumberofiterationstoreachthataccuracy.Inotherwords,wecanexpecttoreplicatetheSVMtesterrorafterkpassesovertherandomlyreshufedtrainingset.Ontheotherhand,awellchosenapproximationoftherescalingmatrixcansavealargeconstantfactoronthecomputationofthegenericSGDupdate(2).Thebesttrainingtimesarethereforeobtainedbycarefullytradingthequalityoftheapproximationforsparserepresentations.1741 BORDES,BOTTOUANDGALLINARI FrequencyLoss Specialexample:n skiplskip 2kwk2Examples1ton:1(yiwxi) Table2:Theregularizationtermintheprimalcostcanbeviewedasanadditionaltrainingexamplewithanarbitrarilychosenfrequencyandaspeciclossfunction.2.4MoreSpeedupOpportunitiesWehavearguedthatcarefullydesignedquasi-Newtontechniquescansaveaconstantfactoronthetrainingtimes.Thereareofcoursemanyotherwaystosaveconstantfactors:Exploitingthesparsityofthepatterns(seeSection3)cansaveaconstantfactorinthecostofeachrst-orderiteration.Thebenetsaremorelimitedinthesecond-ordercase,becausetheinverseHessianmatrixisusuallynotsparse.Implementationdetails(seeSection4)suchascompilertechnologyorparallelizationonapredeterminednumberofprocessorscanalsoreducethelearningtimebyconstantfactors.Suchopportunitiesareoftendismissedasengineeringtricks.Howevertheyshouldbeconsid-eredonanequalfootingwithquasi-Newtontechniques.Constantfactorsmatterregardlessoftheirorigin.Thefollowingtwosectionsprovideadetaileddiscussionofsparsityandimplementation.3.SchedulingStochasticUpdatestoExploitSparsityFirst-orderSGDiterationscanbemadesubstantiallyfasterwhenthepatternsxtaresparse.Therst-orderSGDupdatehastheformwt+1=wt atwt btxt;(5)whereatandbtarescalarcoefcients.Subtractingbtxtfromtheparametervectorinvolvessolelythenonzerocoefcientsofthepatternxt.Ontheotherhand,subtractingatwtinvolvesalldcoef-cients.Anaiveimplementationof(5)wouldthereforespendmostofthetimeprocessingthisrstterm.Shalev-Shwartzetal.(2007)circumventthisproblembyrepresentingtheparameterwtastheproductstvtofascalarandavector.Theupdate(5)canthenbecomputedasst+1=(1 at)standvt+1=vt bxt=st+1intimeproportionaltothenumberofnonzerocoefcientsinxtAlthoughthissimpleapproachworkswellfortherstorderSGDalgorithm,itdoesnotextendnicelytoquasi-NewtonSGDalgorithms.Amoregeneralmethodconsistsoftreatingtheregular-izationtermintheprimalcost(1)asanadditionaltrainingexampleoccurringwithanarbitrarilychosenfrequencywithaspeciclossfunction.ConsiderexampleswiththefrequenciesandlosseslistedinTable2andwritetheaverageloss:1 n skipn"n skiplskip 2kwk2nåi=1`(yiwxi)#skip 1skip"l 2kwk21 nnåi=1`(yiwxi)#:1742 SGD-QN SGD SVMSGD2 Require:;w0;0;T1:=02:whileTdo3:wt+1=wt 1 l(t+t0)(wt+`0(ytwtxt)ytxt)4:5:6:7:8:9:=+110:endwhile11:returnwT Require:;w0;0;T;skip1:=0;count=skip2:whileTdo3:wt+1=wt 1 l(t+t0)`0(ytwtxt)ytxt4:count=count 15:ifcount0then6:wt+1=wt+1 skip t+t0wt+17:count=skip8:endif9:=+110:endwhile11:returnwT Figure1:Detailedpseudo-codesoftheSGDandSVMSGD2algorithms.Minimizingthislossisofcourseequivalenttominimizingtheprimalcost(1)withitsregularizationterm.ApplyingtheSGDalgorithmtotheexamplesdenedinTable2separatestheregularizationupdates,whichinvolvethespecialexample,fromthepatternupdates,whichinvolvetherealex-amples.Theparameterskipregulatestherelativefrequenciesoftheseupdates.TheSVMSGD2algorithm(Bottou,2007)measurestheaveragepatternsparsityandpicksafrequencythatensuresthattheamortizedcostoftheregularizationupdateisproportionaltothenumberofnonzeroco-efcients.Figure1comparesthepseudo-codesofthenaiverst-orderSGDandoftherst-orderSVMSGD2.Bothalgorithmshandletherealexamplesateachiteration(line3)butSVMSGD2onlyperformsaregularizationupdateeveryskipiterations(line6).Assumesistheaverageproportionofnonzerocoefcientsinthepatternsxiandsetskiptoc=swherecisapredenedconstant(weusec=16inourexperiments).Eachpatternupdate(line3)requiressdoperations.Eachregularizationupdate(line6)requiresdoperationsbutoccurss=ctimeslessoften.TheaveragecostperiterationisthereforeproportionaltoO(sd)insteadofO(d)4.ImplementationIntheoptimizationliterature,asuperioralgorithmimplementedwithaslowscriptinglanguageusuallybeatscarefulimplementationsofinferioralgorithms.Thisisbecausethesuperioralgorithmminimizesthetrainingerrorwithahigherorderconvergence.Thisisnolongertrueinthecaseoflargescalemachinelearningbecausewecareaboutthetesterrorinsteadofthetrainingerror.Asexplainedabove,algorithmimprovementsdonotimprovetheorderofthetesterrorconvergence.Theycansimplyimproveconstantfactorsandthereforecompeteevenlywithimplementationimprovements.Timespentreningtheimplementationistimewellspent.Therearelotsofmethodsforrepresentingsparsevectorswithsharplydifferentcomputingrequirementforsequentialandrandomaccess.OurC++implementationalwaysusesafull1743 BORDES,BOTTOUANDGALLINARI FullSparse Randomaccesstoasinglecoefcient:Q(1)Q(s)In-placeadditionintoafullvectorofdimensiond0Q(d)Q(s)In-placeadditionintoasparsevectorwiths0nonzeros:Q(d+s0)Q(s+s0) Table3:Costsofvariousoperationsonavectorofdimensiondwithsnonzerocoefcients.vectorforrepresentingtheparameterw,buthandlesthepatternsxusingeitherafullvectorrepresentationorasparserepresentationasanorderedlistofindex/valuepairs.Eachcalculationcanbeachieveddirectlyonthesparserepresentationorafteraconversiontothefullrepresentation(seeTable3).Inappropriatechoiceshaveoutrageouscosts.Forexample,onadensedatasetwith500attributes,usingsparsevectorsincreasesthetrainingtimeby50%;onthesparseRCV1dataset(seeTable4),usingasparsevectortorepresenttheparameterwincreasesthetrainingtimebymorethan900%.Modernprocessorsoftensportspecializedinstructionstohandlevectorsandmultiplecores.Linearalgebralibraries,suchasBLAS,mayormaynotusetheminwaysthatsuitourpurposes.Compilationagshavenontrivialimpactsonthelearningtimes.Suchimplementationimprovementsareoften(butnotalways)orthogonaltothealgorithmicim-provementsdescribedabove.Themainissueconsistsofdecidinghowmuchdevelopmentresourcesareallocatedtoimplementationandtoalgorithmdesign.Thistrade-offdependsontheavailablecompetencies.5.SGD-QN:ACarefulDiagonalQuasi-NewtonSGDAsexplainedinSection2,designinganefcientquasi-NewtonSGDalgorithminvolvesacarefultrade-offbetweenthesparsityofthescalingmatrixrepresentationBandthequalityofitsapproxi-mationoftheinverseHessian 1.Thetwoobviouschoicesarediagonalapproximations(BeckerandLeCun,1989)andlowrankapproximations(Schraudolphetal.,2007).5.1DiagonalRescalingMatricesAmongnumerouspracticalsuggestionsforrunningSGDalgorithminmultilayerneuralnetworks,LeCunetal.(1998)emphaticallyrecommendtorescaleeachinputspacefeatureinordertoimprovetheconditionnumberkoftheHessianmatrix.Inthecaseofalinearmodel,suchpreconditioningissimilartousingaconstantdiagonalscalingmatrix.RescalingtheinputspacedenestransformedpatternsXtsuchthat[Xt]i=bi[xt]iwherethenotation[v]irepresentsthe-thcoefcientofvectorv.Thistransformationdoesnotchangetheclassicationiftheparametervectorsaremodiedas[Wt]i=[wt]i=bi.Therst-orderSGDupdateonthesemodiedvariableisthen8=1:::d[Wt+1]i=[Wt]i ht l[Wt]i+0(ytWtXt)yt[Xt]i;=[Wt]i ht l[Wt]i+0(ytwtxt)ytbi[xt]i:1744 SGD-QNMultiplyingbybishowshowtheoriginalparametervectorwtisaffected:8=1:::d[wt+1]i=[wt]i ht l[wt]i+0(ytwtxt)ytb2i[xt]i:WeobservethatrescalingtheinputisequivalenttomultiplyingthegradientbyaxeddiagonalmatrixBwhoseelementsarethesquaresofthecoefcientsbiIdeallywewouldliketomaketheproductBHspectrallyclosetheidentitymatrix.UnfortunatelywedonotknowthevalueoftheHessianmatrixattheoptimumwn.InsteadwecouldconsiderthecurrentvalueoftheHessianwt=P00(wt)andcomputethediagonalrescalingmatrixBthatmakesBHwtclosesttotheidentity.ThiscomputationcouldbeverycostlybecauseitinvolvesthefullHessianmatrix.BeckerandLeCun(1989)approximatetheoptimaldiagonalrescalingmatrixbyinvertingthediagonalcoefcientsoftheHessian.Themethodreliesontheanalyticalderivationofthesediagonalcoefcientsformultilayerneuralnetworks.Thisderivationdoesnotextendtoarbitrarymodels.ItcertainlydoesnotworkinthecaseoftraditionalSVMsbecausethehingelosshaszerocurvaturealmosteverywhere.5.2LowRankRescalingMatricesThepopularLBFGSoptimizationalgorithm(Nocedal,1980)maintainsalowrankapproximationoftheinverseHessianbystoringthekmostrecentrank-oneBFGSupdatesinsteadofthefullinverseHessianmatrix.WhenthesuccessivefullgradientsP0n(wt 1)andP0n(wt)areavailable,standardrankoneupdatescanbeusedtodirectlyestimatetheinverseHessianmatrix 1.UsingthismethodwithstochasticgradientistrickybecausethefullgradientsP0n(wt 1)andP0n(wt)arenotreadilyavailable.Insteadweonlyhaveaccesstothestochasticestimatesgt 1(wt 1)andgt(wt)whicharetoonoisytocomputegoodrescalingmatrices.TheoLBFGSalgorithm(Schraudolphetal.,2007)comparesinsteadthederivativesgt 1(wt 1)andgt 1(wt)forthesameexample(xt 1;yt 1).Thisreducesthenoisetoanacceptablelevelattheexpenseofthecomputationoftheadditionalgradientvectorgt 1(wt)Comparedtotherst-orderSGD,eachiterationoftheoLBFGSalgorithmcomputestheaddi-tionalquantitygt 1(wt)andupdatesthelistofkrankoneupdates.Themostexpensiveparthoweverremainsthemultiplicationofthegradientgt(wt)bythelow-rankestimateoftheinverseHessian.Withk=10,eachiterationofouroLBFGSimplementationrunsempirically11timesslowerthanarst-orderSGDiteration.5.3SGD-QNTheSGD-QNalgorithmestimatesadiagonalrescalingmatrixusingatechniqueinspiredbyoLBFGSForanypairofparameterswt 1andwt,aTaylorseriesofthegradientoftheprimalcostPprovidesthesecantequation:wt wt 1 1wt P0n(wt) P0n(wt 1):(6)WewouldthenliketoreplacetheinverseHessianmatrix 1wtbyadiagonalestimateBwt wt 1B P0n(wt) P0n(wt 1):Sincewearedesigningastochasticalgorithm,wedonothaveaccesstothefullgradientP0n.Fol-lowingoLBFGS,wereplacethembythelocalgradientsgt 1(wt)andgt 1(wt 1)andobtainwt wt 1B gt 1(wt) gt 1(wt 1):1745 BORDES,BOTTOUANDGALLINARISincewechosetouseadiagonalrescalingmatrixB,wecanwritetheterm-by-termequality[wt wt 1]iBii[gt 1(wt) gt 1(wt 1)]i;wherethenotation[v]istillrepresentsthe-thcoefcientofvectorv.ThisleadstocomputingBiiastheaverageoftheratio[wt wt 1]i=[gt 1(wt) gt 1(wt 1)]i.Anonlineestimationiseasilyachievedduringthecourseoflearningbyperformingaleakyaverageoftheseratios,Bii Bii+2 r[wt wt 1]i [gt 1(wt) gt 1(wt 1)]i Bii8=1:::d;andwheretheintegerrisincrementedwheneverweupdatethematrixBTheweightsofthescalingmatrixBareinitializedtol 1becausethiscorrespondstotheexactsetupofrst-orderSGD.Sincethecurvatureoftheprimalcost(1)isalwayslargerthanl,theratio[gt 1(wt) gt 1(wt 1)]i=[wt wt 1]iisalwayslargerthanl.ThereforethecoefcientsBiineverexceedtheirinitialvaluel 1.Basicallythesescalingfactorsslowdowntheconvergencealongsomeaxes.Thespeedupdoesnotoccurbecausewefollowthetrajectoryfaster,butbecausewefollowabettertrajectory.Performingtheweightupdate(2)withadiagonalrescalingmatrixBconsistsinperformingterm-by-termoperationswithatimecomplexitythatismarginallygreaterthanthecomplexityoftherst-orderSGD(3)update.Thecomputationoftheadditionalgradientvectorgt 1(wt)andthereestimationofallthecoefcientsBiiessentiallytriplesthecomputingtimeofarst-orderSGDiterationwithnon-sparseinputs(3),andisconsiderablyslowerthanarst-orderSGDiterationwithsparseinputsimplementedasdiscussedinSection3.Fortunatelythishighercomputationalcostperiterationcanbenearlyavoidedbyschedulingthereestimationoftherescalingmatrixwiththesamefrequencyastheregularizationupdates.Sec-tion5.1hasshownthatadiagonalrescalingmatrixdoeslittlemorethanrescalingtheinputvariables.Sinceaxeddiagonalrescalingmatrixalreadyworksquitewell,thereislittleneedtoupdateitscoefcientsveryoften.Figure2comparestheSVMSGD2andSGD-QNalgorithms.WheneverSVMSGD2performsaregularizationupdate,wesettheagupdateBtoscheduleareestimationoftherescalingco-efcientsduringthenextiteration.Thisisappropriatebecausebothoperationshavecomparablecomputingtimes.Thereforetherescalingmatrixreestimationschedulecanberegulatedwiththesameskipparameterastheregularizationupdates.Inpractice,weobservethateachSGD-QNiterationdemandslessthantwicethetimeofarst-orderSGDiteration.BecauseSGD-QNreestimatestherescalingmatrixafterapatternupdate,specialcaremustbetakenwhentheratio[wt wt 1]i=[gt 1(wt) gt 1(wt 1)]ihastheform0=0becausethecorre-spondinginputcoefcient[xt 1]iiszero.SincethesecantEquation(6)isvalidforanytwovaluesoftheparametervector,onecancomputetheratioswithparametervectorswt 1andwt+eandderivethecorrectvaluebycontinuitywhene!0.When[xt 1]i=0,wecanwrite[(wt+e) wt 1i gt 1(wt+e) gt 1(wt 1)]i=[(wt+e) wt 1i l[(wt+e) wt 1i+ 0(yt 1(wt+e)xt 1) 0(yt 1wt 1xt 1)yt 11xt 1i=l+ 0(yt 1(wt+e)xt 1) 0(yt 1wt 1xt 1)yt 11xt 1i [(wt+e) wt 1i 1=l+0 ei 1e!0 !l 1:1746 SGD-QN SVMSGD2 SGD-QN Require:;w0;0;T;skip1:=0;count=skip2:3:whileTdo4:wt+1=wt 1 l(t+t0)`0(ytwtxt)ytxt5:6:7:8:9:10:11:count=count 112:ifcount0then13:wt+1=wt+1 skip(+0) 1wt+114:count=skip15:endif16:=+117:endwhile18:returnwT Require:;w0;0;T;skip1:=0,count=skip,2:B= 1IupdateB=falser=23:whileTdo4:wt+1=wt (+0) 1`0(ytwtxt)ytBxt5:ifupdateB=truethen6:pt=gt(wt+1) gt(wt)7:8;Bii=Bii+2 r[wt+1 wt]i[pt] 1i Bii8:8;Bii=max(Bii;10 2 1)9:r=r+1;updateB=false10:endif11:count=count 112:ifcount0then13:wt+1=wt+1 skip(+0) 1Bwt+114:count=skipupdateB=true15:endif16:=+117:endwhile18:returnwT Figure2:Detailedpseudo-codesoftheSVMSGD2andSGD-QNalgorithms. DataSet Train.Ex.Test.Ex.Featuress l0skip ALPHA 100,00050,0005001 10 510616DELTA 100,00050,0005001 10 410416RCV1 781,26523,14947,1520:0016 10 41059,965 Table4:Datasetsandparametersusedforexperiments.6.ExperimentsWedemonstratethegoodscalingpropertiesofSGD-QNintwoways:wepresentadetailedcompar-isonwithotherstochasticgradientmethods,andwesummarizetheresultsobtainedonthePASCALLargeScaleChallenge.Table4describesthethreebinaryclassicationtasksweusedforcomparativeexperiments.TheAlphaandDeltatasksweredenedforthePASCALLargeScaleChallenge(Sonnenburgetal.,2008).Wetrainwiththerst100,000examplesandtestwiththelast50,000examplesoftheofcialtrainingsetsbecausetheofcialtestingsetsarenotavailable.AlphaandDeltaaredensedatasetswithrelativelysevereconditioningproblems.ThethirdtaskistheclassicationofRCV1documentsbelongingtoclassCCAT(Lewisetal.,2004).ThistaskhasbecomeastandardbenchmarkforlinearSVMsonsparsedata.Despiteitslargersize,theRCV1taskismucheasierthantheAlphaandDeltatasks.AllmethodsdiscussedinthispaperperformwellonRCV1.1747 BORDES,BOTTOUANDGALLINARI ALPHARCV1 SGD0.1336.8SVMSGD20.100.20 SGD-QN0.210.37 Table5:Time(sec.)forperformingonepassoverthetrainingset.TheexperimentsreportedinSection6.4usethehingeloss(s)=max(0;1 s).Allotherex-perimentsusethesquaredhingeloss(s)=1 2(max(0;1 s))2.Inpractice,thereisnoneedtomakethelossestwicedifferentiablebysmoothingtheirbehaviornears=0.Unlikemostbatchoptimizer,stochasticalgorithmsdonotnotaimdirectlyfornondifferentiablepoints,butrandomlyhoparoundthem.Thestochasticnoiseimplicitlysmoothestheloss.TheSGDSVMSGD2oLBFGS,andSGD-QNalgorithmswereimplementedusingthesameC++codebase.3Allexperimentsarecarriedoutinsingleprecision.Wedidnotexperiencenumer-icalaccuracyissues,probablybecauseoftheinuenceoftheregularizationterm.Ourimplementa-tionofoLBFGSmaintainsarank10rescalingmatrix.SettingtheoLBFGSgainscheduleisratherdelicate.WeobtainedfairlygoodresultsbyreplicatingthegainscheduleoftheVieCRFpackage.4WealsoproposeacomparisonwiththeonlineduallinearSVMsolver(Hsiehetal.,2008)imple-mentedintheLibLinearpackage.5WedidnotreimplementthisalgorithmbecausetheLibLinearimplementationhasprovedassimpleandasefcientasours.The0parameterisdeterminedusinganautomaticprocedure:sincethesizeofthetrainingsetdoesnotaffectresultsofTheorem1,wesimplypickasubsetcontaining10%ofthetrainingexamples,performoneSGD-QNpassoverthissubsetwithseveralvaluesfor0,andpickthevalueforwhichtheprimalcostdecreasesthemost.ThesevaluesaregiveninTable4.6.1SparsityTricksTable5illustratestheinuenceoftheschedulingtricksdescribedinSection3.ThetabledisplaysthetrainingtimesofSGDandSVMSGD2.Theonlydifferencebetweenthesetwoalgorithmsaretheschedulingtricks.SVMSGD2trains180timesfasterthanSGDonthesparsedatasetRCV1.Thistablealsodemonstratesthatiterationsofthequasi-newtonSGD-QNarenotprohibitivelyexpensive.6.2Quasi-NewtonFigure3showshowtheprimalcostPn(w)oftheAlphadatasetevolveswiththenumberofpasses(left)andthetrainingtime(right).Comparedtotherst-orderSVMSGD2,boththeoLBFGSandSGD-QNalgorithmsdramaticallydecreasethenumberofpassesrequiredtoachievesimilarvaluesoftheprimal.EvenifitusesamorepreciseapproximationoftheinverseHessian,oLBFGSdoesnotperformbetterafterasinglepassthanSGD-QN.Besides,runningasinglepassofoLBFGSismuchslowerthanrunningmultiplepassesofSVMSGD2orSGD-QN.Thebenetsofitssecond-orderapproximationarecanceledbyitsgreatertimerequirementsperiteration.Ontheotherhand, 3.Implementationsandexperimentscriptsareavailableinthelibsgdqnlibraryonhttp://www.mloss.org.4.Thiscanbefoundathttp://www.ofai.at/~jeremy.jancsary.5.Thiscanbefoundathttp://www.csie.ntu.edu.tw/~cjlin/liblinear.1748 SGD-QN 0.30 0.32 0.34 0.36 0.38 0.40 0 2 4 6 8 10 Number of epochsSVMSGD2 SGD-QN oLBFGS 0.30 0.32 0.34 0.36 0.38 0.40 0 0.5 1 1.5 2 Training time (sec.)SVMSGD2 SGD-QN oLBFGS Figure3:Primalcostsaccordingtothenumberofepochs(left)andthetrainingduration(right)ontheAlphadataset.eachSGD-QNiterationisonlymarginallyslowerthanaSVMSGD2iteration;thereductionofthenumberofiterationsissufcienttooffsetthiscost.6.3TrainingSpeedFigure4displaysthetesterrorsachievedontheAlpha,DeltaandRCV1datasetsasafunctionofthenumberofpasses(left)andthetrainingtime(right).TheseresultsshowagainthatbothoLBFGSandSGD-QNrequirelessiterationsthanSVMSGD2toachievethesametesterror.However,oLBFGSsuffersfromtherelativelyhighcomplexityofitsupdateprocess.TheSGD-QNalgorithmiscom-petitivewiththedualsolverLibLinearonthedensedatasetsAlphaandDelta;itrunssignicantlyfasteronthesparseRCV1dataset.AccordingtoTheorem4,givenalargeenoughtrainingset,aperfectsecond-orderSGDalgo-rithmwouldreachthebatchtesterrorafterasinglepass.Onepasslearningisattractivewhenwearedealingwithhighvolumestreamsofexamplesthatcannotbestoredandretrievedquickly.Figure4(left)showsthatoLBFGSisalittlebitclosertothatidealthanSGD-QNandcouldbecomeattractiveforproblemswheretheexampleretrievaltimeismuchgreaterthanthecomputingtime6.4PASCALLargeScaleChallengeResultsTheSGD-QNalgorithmhasbeensubmittedtotheWildTrackofthePASCALLargeScaleChal-lenge.WildTrackcontributorswerefreetodoanythingleadingtomoreefcientandmoreaccuratemethods.Fortytwomethodshavebeensubmittedtothistrack.Table6showstheSGD-QNranksdeterminedbytheorganizersofthechallengeaccordingtotheirevaluationcriteria.TheSGD-QNalgorithmalwaysranksamongthetopvesubmissionsandranksrstinoverallscore(tiewithanotherNewtonmethod).1749 BORDES,BOTTOUANDGALLINARI 21.0 22.0 23.0 24.0 25.0 26.0 27.0 0 2 4 6 8 10 Number of epochsSVMSGD2 SGD-QN oLBFGS LibLinear 21.0 22.0 23.0 24.0 25.0 26.0 27.0 0 0.5 1 1.5 2 Training time (sec.)SVMSGD2 SGD-QN oLBFGS LibLinear ALPHADATASET 21.0 21.5 22.0 22.5 23.0 23.5 24.0 24.5 25.0 0 1 2 3 4 5 Number of epochsSVMSGD2 SGD-QN oLBFGS LibLinear 21.0 21.5 22.0 22.5 23.0 23.5 24.0 24.5 25.0 0 0.2 0.4 0.6 0.8 1 Training time (sec.)SVMSGD2 SGD-QN oLBFGS LibLinear DELTADATASET 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 0 1 2 3 4 5 Number of epochsSVMSGD2 SGD-QN LibLinear 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 0 0.5 1 1.5 2 Training time (sec.)SVMSGD2 SGD-QN LibLinear RCV1DATASETFigure4:Testerrors(in%)accordingtothenumberofepochs(left)andtrainingduration(right).1750 SGD-QN DataSet lskipPasses Rank Alpha 10 51610 1stBeta 10 41615 3rdGamma 10 31610 1stDelta 10 31610 1stEpsilon 10 51610 5thZeta 10 51610 4thOCR 10 51610 2ndFace 10 51620 4thDNA 10 36410 2ndWebspam 10 571,06610 4th Table6:ParametersandnalranksobtainedbySGD-QNintheWildTrackoftherstPASCALLargeScaleLearningChallenge.Allcompetingalgorithmswererunbytheorganizers.(Note:thecompetitionresultswereobtainedwithapreliminaryversionofSGD-QN.Inparticularthelparameterslistedabovearedifferentfromthevaluesusedforallexperi-mentsinthispaperandlistedinTable4.)7.ConclusionTheSGD-QNalgorithmstrikesagoodcompromiseforlargescaleapplicationbecauseithaslowtimeandmemoryrequirementsperiterationandbecauseitreachescompetitivetesterrorsafterasmallnumberofiterations.Wehaveshownhowthisperformanceistheresultofacarefuldesigntakingintoaccountthetheoreticalknowledgeaboutsecond-orderSGDandapreciseunderstandingofitscomputationalrequirements.Finally,althoughthiscontributionpresentsSGD-QNasasolverforlinearSVMs,thisalgorithmcanbeeasilyextendedtononlinearmodelsforwhichwecananalyticallycomputethegradients.WeplantofurtherinvestigatetheperformanceofSGD-QNinthiscontext.AcknowledgmentsPartofthisworkwasfundedbyNSFgrantCCR-0325463andbytheEUNetworkofExcellencePASCAL2.AntoineBordeswasalsosupportedbytheFrenchDGA.AppendixA.ProofofTheorem1Denevt=wt wnandobservethatasecond-orderTaylorexpansionoftheprimalgivesPn(wt)Pn(wn)=vtHvto t 2tr Hvtvto t 2:LetEt 1representingtheconditionalexpectationoverthechoiceoftheexampleatiteration 1givenallthechoicesmadeduringthepreviousiterations.Sinceweassumethatconvergencetakes1751 BORDES,BOTTOUANDGALLINARIplace,wehavet 1gt 1(wt 1)gt 1(wt 1)t 1gt 1(wn)gt 1(wn)o(1)=Go(1)andt 1[gt 1(wt 1)]=P0n(wt 1)=Hvt 1o(vt 1)=IeHvt 1wherenotationIeisashorthandforI+o(1),thatis,amatrixthatconvergestotheidentity.ExpressingHvtvtusingthegenericSGDupdate(2)givesHvtvtHvt 1vt 1Hvt 1gt 1(wt 1)B tt0HBgt 1(wt 1)vt 1 tt0HBgt 1(wt 1)gt 1(wt 1)B (tt0)2t 1HvtvtHvt 1vt 1Hvt 1vt 1HIeB tt0HBIeHvt 1vt 1 tt0HBGB (tt0)2o t 2t 1tr Hvtvttr Hvt 1vt 12tr HBIeHvt 1vt 1 tt0tr(HBGB) (tt0)2o t 2tr Hvtvttr Hvt 1vt 12tr HBIeHvt 1vt 1 tt0tr(HBGB) (tt0)2o t 2:Letlmaxlmin1=2betheextremeeigenvaluesofHB.Since,foranypositivematrixX lmino(1)tr(X)tr(HBIeX) lmaxo(1)tr(X)wecanbracketE[tr(Hvtvt)]betweentheexpressions12lmax to1 ttr Hvt 1vt 1tr(HBGB) (tt0)2o t 2and12lmin to1 ttr Hvt 1vt 1tr(HBGB) (tt0)2o t 2Byrecursivelyapplyingthisbracket,weobtainulmax(+0)E[tr(Hvtvt)]ulmin(+0)wherethenotationul()representsasequenceofrealsatisfyingtherecursiverelationul(t)=12l to1 tul(t1)+tr(HBGB) t2o1 t2:From(BottouandLeCun,2005,Lemma1),l1=2impliestul() !tr(HBGB) 2l 1.Thentr(HBGB) 2lmax1t 1o t 1tr Hvtvttr(HBGB) 2lmin1t 1o t 1andtr(HBGB) 2lmax1t 1o t 1[Pn(wt)Pn(wn)]tr(HBGB) 2lmin1t 1o t 1: 1752 SGD-QNReferencesS.-I.Amari,H.Park,andK.Fukumizu.Adaptivemethodofrealizingnaturalgradientlearningformultilayerperceptrons.NeuralComputation,12:1409,2000.S.BeckerandY.LeCun.Improvingtheconvergenceofback-propagationlearningwithsecond-ordermethods.InProc.1988ConnectionistModelsSummerSchool,pages2937.MorganKauf-man,1989.A.Bordes,S.Ertekin,J.Weston,andL.Bottou.Fastkernelclassierswithonlineandactivelearning.J.MachineLearningResearch,6:15791619,September2005.L.Bottou.Onlinealgorithmsandstochasticapproximations.InDavidSaad,editor,OnlineLearningandNeuralNetworks.CambridgeUniversityPress,Cambridge,UK,1998.L.Bottou.Stochasticgradientdescentontoyproblems,2007.http://leon.bottou.org/projects/sgdL.BottouandO.Bousquet.Thetradeoffsoflargescalelearning.InAdv.inNeuralInformationProcessingSystems,volume20.MITPress,2008.L.BottouandY.LeCun.On-linelearningforverylargedatasets.AppliedStochasticModelsinBusinessandIndustry,21(2):137151,2005.X.Driancourt.Optimisationpardescentedegradientstochastiquedesystèmesmodulairescombi-nantréseauxdeneuronesetprogrammationdynamique.PhDthesis,UniversitéParisXI,Orsay,France,1994.V.Fabian.Asymptoticallyefcientstochasticapproximation;theRMcase.AnnalsofStatistics,1(3):486495,1973.C.-J.Hsieh,K.-W.Chang,C.-J.Lin,S.Keerthi,andS.Sundararajan.Adualcoordinatedescentmethodforlarge-scalelinearSVM.InProc.25thIntl.Conf.onMachineLearning(ICML'08)pages408415.Omnipress,2008.Y.LeCun,L.Bottou,G.Orr,andK.-R.Müller.Efcientbackprop.InNeuralNetworks,TricksoftheTrade,LectureNotesinComputerScienceLNCS1524.SpringerVerlag,1998.D.D.Lewis,Y.Yang,T.G.Rose,andF.Li.RCV1:Anewbenchmarkcollectionfortextcatego-rizationresearch.J.MachineLearningResearch,5:361397,2004.N.MurataandS.-I.Amari.Statisticalanalysisoflearningdynamics.SignalProcessing,74(1):328,1999.J.Nocedal.Updatingquasi-Newtonmatriceswithlimitedstorage.MathematicsofComputation35:773782,1980.N.Schraudolph.Localgainadaptationinstochasticgradientdescent.InInProc.ofthe9thIntl.Conf.onArticialNeuralNetworks,pages569574,1999.1753 BORDES,BOTTOUANDGALLINARIN.Schraudolph,J.Yu,andS.Günter.Astochasticquasi-Newtonmethodforonlineconvexopti-mization.InProc.11thIntl.Conf.onArticialIntelligenceandStatistics(AIstats),pages433440.Soc.forArticialIntelligenceandStatistics,2007.S.Shalev-Shwartz,Y.Singer,andN.Srebro.Pegasos:PrimalestimatedsubgradientsolverforSVM.InProc.24thIntl.Conf.onMachineLearning(ICML'07),pages807814.ACM,2007.S.Sonnenburg,V.Franc,E.Yom-Tov,andM.Sebag.Pascallargescalelearningchallenge.ICML'08Workshop,2008.http://largescale.first.fraunhofer.de1754