/
JournalofMachineLearningResearch10(2009)777-801Submitted6/08;Revised11 JournalofMachineLearningResearch10(2009)777-801Submitted6/08;Revised11

JournalofMachineLearningResearch10(2009)777-801Submitted6/08;Revised11 - PDF document

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
396 views
Uploaded On 2015-11-20

JournalofMachineLearningResearch10(2009)777-801Submitted6/08;Revised11 - PPT Presentation

PartiallysupportedbyNSFgrantDMS0706805cr2009JohnLangfordLihongLiandTongZhang LANGFORDLIANDZHANGThispaperfocusesonthesecondapproachTypicalonlinelearningalgorithmshaveatleastoneweightforeveryf ID: 200092

.PartiallysupportedbyNSFgrantDMS-0706805.c\r2009JohnLangford LihongLiandTongZhang. LANGFORD LIANDZHANGThispaperfocusesonthesecondapproach.Typicalonline-learningalgorithmshaveatleastoneweightforeveryf

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch10(2009)..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch10(2009)777-801Submitted6/08;Revised11/08;Published3/09SparseOnlineLearningviaTruncatedGradientJohnLangfordJL@YAHOO-INC.COMYahoo!ResearchNewYork,NY,USALihongLiLIHONG@CS.RUTGERS.EDUDepartmentofComputerScienceRutgersUniversityPiscataway,NJ,USATongZhangTONGZ@RCI.RUTGERS.EDUDepartmentofStatisticsRutgersUniversityPiscataway,NJ,USAEditor:ManfredWarmuthAbstractWeproposeageneralmethodcalledtruncatedgradienttoinducesparsityintheweightsofonline-learningalgorithmswithconvexlossfunctions.Thismethodhasseveralessentialproperties:1.Thedegreeofsparsityiscontinuous—aparametercontrolstherateofsparsicationfromnosparsi-cationtototalsparsication.2.Theapproachistheoreticallymotivated,andaninstanceofitcanberegardedasanonlinecounterpartofthepopularL1-regularizationmethodinthebatchsetting.Weprovethatsmallratesofsparsicationresultinonlysmalladditionalregretwithrespecttotypicalonline-learningguarantees.3.Theapproachworkswellempirically.Weapplytheapproachtoseveraldatasetsandndfordatasetswithlargenumbersoffeatures,substantialsparsityisdiscoverable.Keywords:truncatedgradient,stochasticgradientdescent,onlinelearning,sparsity,regulariza-tion,Lasso1.IntroductionWeareconcernedwithmachinelearningoverlargedatasets.Asanexample,thelargestdatasetweuseherehasover107sparseexamplesand109featuresusingabout1011bytes.Inthissetting,manycommonapproachesfail,simplybecausetheycannotloadthedatasetintomemoryortheyarenotsufcientlyefcient.Thereareroughlytwoclassesofapproacheswhichcanwork:1.Parallelizeabatch-learningalgorithmovermanymachines(e.g.,Chuetal.,2008).2.Streamtheexamplestoanonline-learningalgorithm(e.g.,Littlestone,1988;Littlestoneetal.,1995;Cesa-Bianchietal.,1996;KivinenandWarmuth,1997). .PartiallysupportedbyNSFgrantDMS-0706805.c\r2009JohnLangford,LihongLiandTongZhang. LANGFORD,LIANDZHANGThispaperfocusesonthesecondapproach.Typicalonline-learningalgorithmshaveatleastoneweightforeveryfeature,whichistoomuchinsomeapplicationsforacouplereasons:1.Spaceconstraints.Ifthestateoftheonline-learningalgorithmoverowsRAMitcannotefcientlyrun.AsimilarproblemoccursifthestateoverowstheL2cache.2.Test-timeconstraintsoncomputation.SubstantiallyreducingthenumberoffeaturescanyieldsubstantialimprovementsinthecomputationaltimerequiredtoevaluateanewsampleThispaperaddressestheproblemofinducingsparsityinlearnedweightswhileusinganonline-learningalgorithm.Thereareseveralwaystodothiswrongforourproblem.Forexample:1.SimplyaddingL1-regularizationtothegradientofanonlineweightupdatedoesn'tworkbecausegradientsdon'tinducesparsity.Theessentialdifcultyisthatagradientupdatehastheforma+bwhereaandbaretwooats.Veryfewoatpairsaddto0(oranyotherdefaultvalue)sothereislittlereasontoexpectagradientupdatetoaccidentallyproducesparsity.2.Simplyroundingweightsto0isproblematicbecauseaweightmaybesmallduetobeinguselessorsmallbecauseithasbeenupdatedonlyonce(eitheratthebeginningoftrainingorbecausethesetoffeaturesappearingisalsosparse).Roundingtechniquescanalsoplayhavocwithstandardonline-learningguarantees.3.Black-boxwrapperapproacheswhicheliminatefeaturesandtesttheimpactoftheeliminationarenotefcientenough.Theseapproachestypicallyrunanalgorithmmanytimeswhichisparticularlyundesirablewithlargedatasets.1.1WhatOthersDoIntheliterature,theLassoalgorithm(Tibshirani,1996)iscommonlyusedtoachievesparsityforlinearregressionusingL1-regularization.Thisalgorithmdoesnotworkautomaticallyinanonlinefashion.TherearetwoformulationsofL1-regularization.ConsideralossfunctionL(w;zi)whichisconvexinw,wherezi=(xi;yi)isaninput/outputpair.Oneistheconvexconstraintformulationˆw=argminwnåi=1L(w;zi)subjecttokwk1s;(1)wheresisatunableparameter.Theotheristhesoftregularizationformulation,whereˆw=argminwnåi=1L(w;zi)+gkwk1:(2)Withappropriatelychoseng,thetwoformulationsareequivalent.Theconvexconstraintformu-lationhasasimpleonlineversionusingtheprojectionideaofZinkevich(2003),whichrequirestheprojectionofweightwintoanL1-ballateveryonlinestep.Thisoperationisdifculttoimple-mentefcientlyforlarge-scaledatawithmanyfeaturesevenifallexampleshavesparsefeaturesalthoughrecentprogresswasmade(Duchietal.,2008)toreducetheamortizedtimecomplexitytoO(klogd),wherekisthenumberofnonzeroentriesinxi,anddisthetotalnumberoffeatures(i.e.,778 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTthedimensionofxi).Incontrast,thesoft-regularizationmethodisefcientforabatchsetting(Leeetal.,2007)sowepursueithereinanonlinesettingwherewedevelopanalgorithmwhosecom-plexityislinearinkbutindependentofd;thesealgorithmsarethereforemoreefcientinproblemswheredisprohibitivelylarge.Morerecently,DuchiandSinger(2008)proposeaframeworkforempiricalriskminimizationwithregularizationcalledForwardLookingSubgradients,orFOLOSinshort.Thebasicideaistosolvearegularizedoptimizationproblemaftereverygradient-descentstep.Thisfamilyofalgo-rithmsallowgeneralconvexregularizationfunction,andreproduceaspecialcaseofthetruncatedgradientalgorithmwewillintroduceinSection3.3(withqsetto¥)whenL1-regularizationisused.TheForgetronalgorithm(Dekeletal.,2006)isanonline-learningalgorithmthatmanagesmem-oryuse.Itoperatesbydecayingtheweightsonpreviousexamplesandthenroundingtheseweightstozerowhentheybecomesmall.TheForgetronisstatedforkernelizedonlinealgorithms,whileweareconcernedwiththesimplerlinearsetting.Whenappliedtoalinearkernel,theForgetronisnotcomputationallyorspacecompetitivewithapproachesoperatingdirectlyonfeatureweights.Adifferent,BayesianapproachtolearningsparselinearclassiersistakenbyBalakrishnanandMadigan(2008).Specically,theiralgorithmsapproximatetheposteriorbyaGaussiandistribution,andhenceneedtostoresecond-ordercovariancestatisticswhichrequireO(d2)spaceandtimeperonlinestep.Incontrast,ourapproachismuchmoreefcient,requiringonlyO(d)spaceandO(k)timeateveryonlinestep.Aftercompletingthepaper,welearnedthatCarpenter(2008)independentlydevelopedanalgo-rithmsimilartoours.1.2WhatWeDoWepursueanalgorithmicstrategywhichcanbeunderstoodasanonlineversionofanefcientL1lossoptimizationapproach(Leeetal.,2007).Atahighlevel,ourapproachworkswiththesoft-regularizationformulation(2)anddecaystheweighttoadefaultvalueaftereveryonlinestochasticgradientstep.Thissimpleapproachenjoysminimaltimecomplexity(whichislinearinkandin-dependentofd)aswellasstrongperformanceguarantee,asdiscussedinSections3and5.Forinstance,thealgorithmneverperformsmuchworsethanastandardonline-learningalgorithm,andtheadditionallossduetosparsicationiscontrolledcontinuouslywithasinglereal-valuedparam-eter.Thetheorygivesafamilyofalgorithmswithconvexlossfunctionsforinducingsparsity—oneperonline-learningalgorithm.Weinstantiatethisforsquarelossandshowhowanefcientimple-mentationcantakeadvantageofsparseexamplesinSection4.InadditiontotheL1-regularizationformulation(2),thefamilyofalgorithmsweconsideralsoincludesomenon-convexsparsicationtechniques.Asmentionedintheintroduction,wearemainlyinterestedinsparseonlinemethodsforlargescaleproblemswithsparsefeatures.Forsuchproblems,ouralgorithmshouldsatisfythefollowingrequirements:Thealgorithmshouldbecomputationallyefcient:thenumberofoperationsperonlinestepshouldbelinearinthenumberofnonzerofeatures,andindependentofthetotalnumberoffeatures.Thealgorithmshouldbememoryefcient:itneedstomaintainalistofactivefeatures,andcaninsert(whenthecorrespondingweightbecomesnonzero)anddelete(whenthecorre-spondingweightbecomeszero)featuresdynamically.779 LANGFORD,LIANDZHANGOursolution,referredtoastruncatedgradient,isasimplemodicationofthestandardstochasticgradientrule.Itisdenedin(6)asanimprovementoversimplerideassuchasroundingandsub-gradientmethodwithL1-regularization.Theimplementationdetails,showingourmethodssatisfytheaboverequirements,areprovidedinSection5.Theoreticalresultsstatinghowmuchsparsityisachievedusingthismethodgenerallyrequireadditionalassumptionswhichmayormaynotbemetinpractice.Consequently,werelyonexperi-mentsinSection6toshowourmethodachievesgoodsparsitypractice.Wecompareourapproachtoafewothers,includingL1-regularizationonsmalldata,aswellasonlineroundingofcoefcientstozero.2.OnlineLearningwithStochasticGradientDescentInthesettingofstandardonlinelearning,weareinterestedinsequentialpredictionproblemswhererepeatedlyfrom=1;2;:::1.Anunlabeledexamplexiarrives.2.Wemakeapredictionbasedonexistingweightswi2Rd3.Weobserveyi,letzi=(xi;yi),andincursomeknownlossL(wi;zi)thatisconvexinparameterwi4.Weupdateweightsaccordingtosomerule:wi+1 (wi)Wewanttocomeupwithanupdaterule,whichallowsustoboundthesumoflosseståi=1L(wi;zi)aswellasachievingsparsity.Forthispurpose,westartwiththestandardstochasticgradientdescent(SGD)rule,whichisoftheform:(wi)=wihÑ1L(wi;zi);(3)whereÑ1L(a;b)isasub-gradientofL(a;b)withrespecttotherstvariablea.Theparameterh�0isoftenreferredtoasthelearningrate.Inouranalysis,weonlyconsiderconstantlearningratewithxedh�0forsimplicity.Intheory,itmightbedesirabletohaveadecayinglearningratehiwhichbecomessmallerwhenincreasestogetthesocalledno-regretboundwithoutknowingTinadvance.However,ifTisknowninadvance,onecanselectaconstanthaccordinglysotheregretvanishesasT!¥.Sinceourfocusisonsparsity,nothowtoadaptlearningrate,forclarity,weuseaconstantlearningrateintheanalysisbecauseitleadstosimplerbounds.Theabovemethodhasbeenwidelyusedinonlinelearning(Littlestoneetal.,1995;Cesa-Bianchietal.,1996).Moreover,itisarguedtobeefcientevenforsolvingbatchproblemswherewerepeatedlyruntheonlinealgorithmovertrainingdatamultipletimes.Forexample,theideahasbeensuccessfullyappliedtosolvelarge-scalestandardSVMformulations(Shalev-Shwartzetal.,2007;Zhang,2004).Inthescenariooutlinedintheintroduction,online-learningmethodsaremoresuitablethansometraditionalbatch-learningmethods.780 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENT (x, (x, - Figure1:Plotsforthetruncationfunctions,T0andT1,whicharedenedinthetext.However,amaindrawbackof(3)isthatitdoesnotachievesparsity,whichweaddressinthispaper.Intheliterature,thestochastic-gradientdescentruleisoftenreferredtoasgradientdescent(GD).Thereareothervariants,suchasexponentiatedgradientdescent(EG).Sinceourfocusinthispaperissparsity,notGDversusEG,weshallonlyconsidermodicationsof(3)forsimplicity.3.SparseOnlineLearningInthissection,weexamineseveralmethodsforachievingsparsityinonlinelearning.Therstideaissimplecoefcientrounding,whichisthemostnaturalmethod.WewillthenconsideranothermethodwhichistheonlinecounterpartofL1-regularizationinbatchlearning.Finally,wecombinesuchtwoideasandintroducetruncatedgradient.Asweshallsee,alltheseideasarecloselyrelated.3.1SimpleCoefcientRoundingInordertoachievesparsity,themostnaturalmethodistoroundsmallcoefcients(thatarenolargerthanathresholdq�0)tozeroaftereveryKonlinesteps.Thatis,if=Kisnotaninteger,weusethestandardGDrulein(3);if=Kisaninteger,wemodifytheruleas:(wi)=T0(wihÑ1L(wi;zi);q);(4)whereforavectorv=[v1;:::;vd]2Rd,andascalarq0,T0(v;q)=[T0(v1;q);:::;T0(vd;q)],withT0denedby(cf.,Figure1)T0(vj;q)=(0ifjvjjqvjotherwise:Thatis,werstapplythestandardstochasticgradientdescentrule,andthenroundsmallcoefcientstozero.Ingeneral,weshouldnottakeK=1,especiallywhenhissmall,sinceeachstepmodieswibyonlyasmallamount.Ifacoefcientiszero,itremainssmallafteroneonlineupdate,andtheroundingoperationpullsitbacktozero.Consequently,roundingcanbedoneonlyaftereveryKsteps(withareasonablylargeK);inthiscase,nonzerocoefcientshavesufcienttimetogoabovethethresholdq.However,ifKistoolarge,theninthetrainingstage,wewillneedtokeepmanymorenonzerofeaturesintheintermediatestepsbeforetheyareroundedtozero.Intheextremecase,wemaysimplyroundthecoefcientsintheend,whichdoesnotsolvethestorageproblemin781 LANGFORD,LIANDZHANGthetrainingphase.ThesensitivityinchoosingappropriateKisamaindrawbackofthismethod;anotherdrawbackisthelackoftheoreticalguaranteeforitsonlineperformance.3.2ASub-gradientAlgorithmforL1-RegularizationInourexperiments,wecombinerounding-in-the-end-of-trainingwithasimpleonlinesub-gradientmethodforL1-regularizationwitharegularizationparameterg�0:(wi)=wihÑ1L(wi;zi)hgsgn(wi);(5)whereforavectorv=[v1;:::;vd],sgn(v)=[sgn(v1);:::;sgn(vd)],andsgn(vj)=1whenvj�0,sgn(vj)=1whenvj0,andsgn(vj)=0whenvj=0.Intheexperiments,theonlinemethod(5)plusroundingintheendisusedasasimplebaseline.Thismethoddoesnotproducesparseweightsonline.Thereforeitdoesnothandlelarge-scaleproblemsforwhichwecannotkeepallfeaturesinmemory.3.3TruncatedGradientInordertoobtainanonlineversionofthesimpleroundingrulein(4),weobservethatthedirectroundingtozeroistooaggressive.Alessaggressiveversionistoshrinkthecoefcienttozerobyasmalleramount.Wecallthisideatruncatedgradient.Theamountofshrinkageismeasuredbyagravityparametergi&#x-0.1;項退0:(wi)=T1(wihÑ1L(wi;zi);hgi;q);(6)whereforavectorv=[v1;:::;vd]2Rd,andascalarg0,T1(v;a;q)=[T1(v1;a;q);:::;T1(vd;a;q)]withT1denedby(cf.,Figure1)T1(vj;a;q)=8&#x-0.1;項退�.36;̑:max(0;vja)ifvj2[0;q]min(0;vj+a)ifvj2[q;0]vjotherwise:Again,thetruncationcanbeperformedeveryKonlinesteps.Thatis,if=Kisnotaninteger,weletgi=0;if=Kisaninteger,weletgi=Kgforagravityparameterg�.36;̑0.Thisparticularchoiceisequivalentto(4)whenwesetgsuchthathKgq.Thisrequiresalargegwhenhissmall.Inpractice,oneshouldsetasmall,xedg,asimpliedbyourregretbounddevelopedlater.Ingeneral,thelargertheparametersgandqare,themoresparsityisincurred.DuetotheextratruncationT1,thismethodcanleadtosparsesolutions,whichisconrmedinourexperimentsdescribedlater.Inthoseexperiments,thedegreeofsparsitydiscoveredvarieswiththeproblem.Aspecialcase,whichwewilltryintheexperiment,istoletg=qin(6).Inthiscase,wecanuseonlyoneparametergtocontrolsparsity.SincehKgqwhenhKissmall,thetruncationoperationislessaggressivethantheroundingin(4).Atrstsight,theprocedureappearstobeanad-hocwaytox(4).However,wecanestablisharegretboundforthismethod,showingitistheoreticallysound.Settingq=¥yieldsanotherimportantspecialcaseof(6),whichbecomes(wi)=T(wihÑ1L(wi;zi);gih);(7)782 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTwhereforavectorv=[v1;:::;vd]2Rd,andascalarg0,T(v;a)=[T(v1;a);:::;T(vd;a)],withT(vj;a)=(max(0;vja)ifvj�0min(0;vj+a)otherwise:Themethodisamodicationofthestandardsub-gradientdescentmethodwithL1-regularizationgivenin(5).Theparametergi0controlsthesparsitythatcanbeachievedwiththealgorithm.Notewhengi=0,theupdateruleisidenticaltothestandardstochasticgradientdescentrule.Ingeneral,wemayperformatruncationeveryKsteps.Thatis,if=Kisnotaninteger,weletgi=0;if=Kisaninteger,weletgi=Kgforagravityparameterg�0.Thereasonfordoingso(insteadofaconstantg)isthatwecanperformamoreaggressivetruncationwithgravityparameterKgaftereachKsteps.Thismayleadtobettersparsity.Analternativewaytoderiveaproceduresimilarto(7)isthroughanapplicationofconvexhullprojectionideaofZinkevich(2003)totheL1-regularizedloss,asin(5).However,insteadofworkingwiththeoriginalfeatureset,weneedtoconsidera2d-dimensionalduplicatedfeatureset[xi;xi],withthenon-negativityconstraintwj0foreachcomponentof(wwillalsohavedimension2dinthiscase).Theresultingmethodissimilartoours,withasimilartheoreticalguaranteeasinTheorem3.1.Theproofpresentedinthispaperismorespecializedtotruncatedgradient,anddirectlyworkswithxiinsteadofaugmenteddata[xi;xi]Moreover,ouranalysisdoesnotrequirethelossfunctiontohaveboundedgradient,andthuscandirectlyhandletheleastsquaresloss.Theprocedurein(7)canberegardedasanonlinecounterpartofL1-regularizationinthesensethatitapproximatelysolvesanL1-regularizationprobleminthelimitofh!0.TruncatedgradientforL1-regularizationisdifferentfrom(5),whichisanaïveapplicationofstochasticgradientde-scentrulewithanaddedL1-regularizationterm.Aspointedoutintheintroduction,thelatterfailsbecauseitrarelyleadstosparsity.Ourtheoryshowsevenwithsparsication,thepredictionperfor-manceisstillcomparabletostandardonline-learningalgorithms.Inthefollowing,wedevelopageneralregretboundforthisgeneralmethod,whichalsoshowshowtheregretmaydependonthesparsicationparameterg3.4RegretAnalysisThroughoutthepaper,weusekk1for1-norm,andkkfor2-norm.Forreference,wemakethefollowingassumptionregardingthelossfunction:Assumption3.1WeassumeL(w;z)isconvexinw,andthereexistnon-negativeconstantsAandBsuchthatkÑ1L(w;z)k2AL(w;z)+Bforallw2Rdandz2Rd+1Forlinearpredictionproblems,wehaveagenerallossfunctionoftheforL(w;z)=f(wTx;y).Thefollowingaresomecommonlossfunctionsf(;)withcorrespondingchoicesofparametersAandB(whicharenotunique),undertheassumptionsupxkxkCLogistic:f(p;y)=ln(1+exp(py))A=0andB=C2.Thislossisforbinaryclassicationproblemswithy2f1gSVM(hingeloss):f(p;y)=max(0;1py)A=0andB=C2.Thislossisforbinaryclassicationproblemswithy2f1gLeastsquares(squareloss):f(p;y)=(py)2A=4C2andB=0.Thislossisforregressionproblems.783 LANGFORD,LIANDZHANGOurmainresultisTheorem3.1whichisparameterizedbyAandB.Theproofislefttotheappendix.Specializingittoparticularlossesyieldsseveralcorollaries.AcorollaryapplicabletotheleastsquarelossisgivenlaterinCorollary4.1.Theorem3.1(SparseOnlineRegret)Considersparseonlineupdaterule(6)withw1=0andh�0IfAssumption3.1holds,thenforall¯w2Rdwehave10:5Ah TTåi=1L(wi;zi)+gi 10:5Ahkwi+1I(wi+1q)k1h 2B+k¯wk2 2hT+1 TTåi=1[L(¯w;zi)+gik¯wI(wi+1q)k1];whereforvectorsv=[v1;:::;vd]andv0=[v01;:::;v0d],weletkvI(jv0jq)k1=dåj=1jvjjI(jv0jjq);whereI()isthesetindicatorfunction.Westatethetheoremwithaconstantlearningrateh.Asmentionedearlier,itispossibletoobtainaresultwithvariablelearningratewhereh=hidecaysasincreases.Althoughthismayleadtoano-regretboundwithoutknowingTinadvance,itintroducesextracomplexitytothepresentationofthemainidea.Sinceourfocusisonsparsityratherthanadaptinglearningrate,wedonotincludesucharesultforclarity.IfTisknowninadvance,thenintheabovebound,onecansimplytakeh=O(1=p T)andtheL1-regularizedregretisoforderO(1=p T)Intheabovetheorem,theright-handsideinvolvesatermgik¯wI(wi+1q)k1dependingonwi+1whichisnoteasilyestimated.Toremovethisdependency,atrivialupperboundofq=¥canbeused,leadingtoL1penaltygik¯wk1.Inthegeneralcaseofq¥,wecannotreplacewi+1by¯wbecausetheeffectiveregularizationcondition(asshownontheleft-handside)isthenon-convexpenaltygikwI(jwjq)k1.Solvingsuchanon-convexformulationishardbothintheonlineandbatchsettings.Ingeneral,weonlyknowhowtoefcientlydiscoveralocalminimumwhichisdifculttocharacterize.Withoutagoodcharacterizationofthelocalminimum,itisnotpossibleforustoreplacegik¯wI(wi+1q)k1ontheright-handsidebygik¯wI(¯wq)k1becausesuchaformulationimplieswecanefcientlysolveanon-convexproblemwithasimpleonlineupdaterule.Still,whenq¥,onenaturallyexpectstheright-handsidepenaltygik¯wI(wi+1q)k1ismuchsmallerthanthecorrespondingL1penaltygik¯wk1,especiallywhenwjhasmanycomponentscloseto0.Thereforethesituationwithq¥canpotentiallyyieldbetterperformanceonsomedata.Thisisconrmedinourexperiments.Theorem3.1alsoimpliesatrade-offbetweensparsityandregretperformance.Wemaysimplyconsiderthecasewheregi=gisaconstant.Whengissmall,wehavelesssparsitybuttheregrettermgk¯wI(wi+1q)k1gk¯wk1ontheright-handsideisalsosmall.Whengislarge,weareabletoachievemoresparsitybuttheregretgk¯wI(wi+1q)k1ontheright-handsidealsobecomeslarge.Suchatrade-off(sparsityversuspredictionaccuracy)isempiricallystudiedinSection6.Ourobservationsuggestswecangainsignicantsparsitywithonlyasmalldecreaseofaccuracy(thatis,usingasmallg).784 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTNowconsiderthecaseq=¥andgi=g.WhenT!¥,ifweleth!0andhT!¥,thenTheorem3.1implies1 TTåi=1[L(wi;zi)+gkwik1]inf¯w2Rd"1 TTåi=1L(¯w;zi)+gk¯wk1#+o(1):Inotherwords,ifweletL0(w;z)=L(w;z)+gkwk1betheL1-regularizedloss,thentheL1-regularizedregretissmallwhenh!0andT!¥.Inparticular,ifweleth=1=p T,thenthetheoremimpliestheL1-regularizedregretisTåi=1(L(wi;zi)+gkwik1)Tåi=1(L(¯w;zi)+gk¯wk1)p T 2(B+k¯wk2)1+A 2p T+A 2p T Tåi=1L(¯w;zi)+gTåi=1(k¯wk1kwi+1k1)!+o(p T);whichisO(p T)forboundedlossfunctionLandweightswi.Theseobservationsimplyourpro-cedurecanberegardedastheonlinecounterpartofL1-regularizationmethods.Inthestochasticsettingwheretheexamplesaredrawniidfromsomeunderlyingdistribution,thesparseonlinegra-dientmethodproposedinthispapersolvestheL1-regularizationproblem.3.5StochasticSettingSGD-basedonline-learningmethodscanbeusedtosolvelarge-scalebatchoptimizationproblems,oftenquitesuccessfully(Shalev-Shwartzetal.,2007;Zhang,2004).Inthissetting,wecangothroughtrainingexamplesone-by-oneinanonlinefashion,andrepeatmultipletimesoverthetrain-ingdata.Inthissection,weanalyzetheperformanceofsuchaprocedureusingTheorem3.1.Tosimplifytheanalysis,insteadofassumingwegothroughthedataonebyone,weassumeeachadditionaldatapointisdrawnfromthetrainingdatarandomlywithequalprobability.Thiscorrespondstothestandardstochasticoptimizationsetting,inwhichobservedsamplesareiidfromsomeunderlyingdistributions.ThefollowingresultisasimpleconsequenceofTheorem3.1.Forsimplicity,weonlyconsiderthecasewithq=¥andconstantgravitygi=gTheorem3.2Considerasetoftrainingdatazi=(xi;yi)fori=1;:::;n,andletR(w;g)=1 nnåi=1L(w;zi)+gkwk1betheL1-regularizedlossovertrainingdata.Letˆw1=w1=0,anddenerecursivelyfort=1;2;:::wt+1=T(wthÑ1(wt;zit);gh);ˆwt+1=ˆwt+wt+1ˆwt +1;785 LANGFORD,LIANDZHANGwhereeachitisdrawnfromf1;:::;nguniformlyatrandom.IfAssumption3.1holds,thenatanytimeT,thefollowinginequalitiesarevalidforall¯w2Rd:Ei1;:::;iT(10:5Ah)RˆwT;g 10:5AhEi1;:::;iT"10:5Ah TTåi=1Rwi;g 10:5Ah#h 2B+k¯wk2 2hT+R(¯w;g):ProofNotetherecursionofˆwtimpliesˆwT=1 TTåt=1wtfromtelescopingtheupdaterule.BecauseR(w;g)isconvexinw,therstinequalityfollowsdi-rectlyfromJensen'sinequality.Itremainstoprovethesecondinequality.Theorem3.1impliesthefollowing:10:5Ah TTåt=1L(wt;zit)+g 10:5Ahkwtk1gk¯wk1+h 2B+k¯wk2 2hT+1 TTåt=1L(¯w;zit):(8)ObservethatEitL(wt;zit)+g 10:5Ahkwtk1=Rwt;g 10:5Ahandgk¯wk1+Ei1;:::;iT"1 TTåt=1L(¯w;zit)#=R(¯w;g):ThesecondinequalityisobtainedbytakingtheexpectationwithrespecttoEi1;:::;iTin(8). Ifweleth!0andhT!¥,theboundinTheorem3.2becomesE[R(ˆwT;g)]E"1 TTåt=1R(wt;g)#inf¯wR(¯w;g)+o(1):Thatis,onaverage,ˆwTapproximatelysolvestheL1-regularizationprobleminfw"1 nnåi=1L(w;zi)+gkwk1#:IfwechoosearandomstoppingtimeT,thentheaboveinequalitiessaysthatonaveragewTalsosolvesthisL1-regularizationproblemapproximately.Thereforeinourexperiment,weusethelastsolutionwTinsteadoftheaggregatedsolutionˆwT.Forpracticepurposes,thisisadequateeventhoughwedonotintentionallychoosearandomstoppingtime.SinceL1-regularizationisfrequentlyusedtoachievesparsityinthebatchlearningsetting,theconnectiontoL1-regularizationcanberegardedasanalternativejusticationforthesparse-onlinealgorithmdevelopedinthispaper.786 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENT Algorithm1TruncatedGradientforLeastSquares Inputs:thresholdq0gravitysequencegi0learningrateh2(0;1)exampleoracleOinitializeweightswj 0(=1;:::;d)fortrial=1;2;:::1.Acquireanunlabeledexamplex=[x1;x2;:::;xd]fromoracleO2.forallweightswj(=1;:::;d)(a)ifwj�0andwjqthenwj maxfwjgih;0g(b)elseifwj0andwjqthenwj minfwj+gih;0g3.Computeprediction:ˆy=åjwjxj4.AcquirethelabelyfromoracleO5.Updateweightsforallfeatureswj wj+2h(yˆy)xj 4.TruncatedGradientforLeastSquaresThemethodinSection3canbedirectlyappliedtoleastsquaresregression.ThisleadstoAlgorithm1whichimplementssparsicationforsquarelossaccordingtoEquation(6).Inthedescription,weusesuperscriptedsymbolwjtodenotethe-thcomponentofvectorw(inordertodifferentiatefromwi,whichwehaveusedtodenotethe-thweightvector).Forclarity,wealsodroptheindexfromwi.Althoughwekeepthechoiceofgravityparametersgiopeninthealgorithmdescription,inpractice,weonlyconsiderthefollowingchoice:gi=(Kgif=Kisaninteger0otherwise:Thismaygiveamoreaggressivetruncation(thussparsity)aftereveryK-thiteration.Sincewedonothaveatheoremformalizinghowmuchmoresparsityonecangainfromthisidea,itseffectwillonlybeexaminedempiricallyinSection6.Inmanyonline-learningsituations(suchaswebapplications),onlyasmallsubsetofthefeatureshavenonzerovaluesforanyexamplex.Itisthusdesirabletodealwithsparsityonlyinthissmallsubsetratherthaninallfeatures,whilesimultaneouslyinducingsparsityonallfeatureweights.Moreover,itisimportanttostoreonlyfeatureswithnon-zerocoefcients(ifthenumberoffeaturesistoolargetobestoredinmemory,thisapproachallowsustouseahashtabletotrackonlythenonzerocoefcients).Wedescribehowthiscanbeimplementedefcientlyinthenextsection.Forreference,wepresentaspecializationofTheorem3.1inthefollowingcorollarywhichisdirectlyapplicabletoAlgorithm1.787 LANGFORD,LIANDZHANGCorollary4.1(SparseOnlineSquareLossRegret)IfthereexistsC�0suchthatforallx,kxkC,thenforall¯w2Rd,wehave12C2h TTåi=1(wTixiyi)2+gi 12C2hkwiI(jwijq)k1k¯wk2 2hT+1 TTåi=1(¯wTxiyi)2+gi+1k¯wI(jwi+1jq)k1;wherewi=[w1;:::;wd]2Rdistheweightvectorusedforpredictionatthei-thstepofAlgorithm1;(xi;yi)isthedatapointobservedatthei-step.Thiscorollaryexplicitlystatesthataveragesquarelossincurredbythelearner(theleft-handside)isboundedbytheaveragesquarelossofthebestweightvector¯w,plusatermrelatedtothesizeof¯wwhichdecaysas1=Tandanadditiveoffsetcontrolledbythesparsitythresholdqandthegravityparametergi5.EfcientImplementationWealteredastandardgradient-descentimplementation,VOWPALWABBIT(Langfordetal.,2007),accordingtoalgorithm1.VOWPALWABBIToptimizessquarelossonalinearrepresentationwTxviagradientdescent(3)withacouplecaveats:1.Thepredictionisnormalizedbythesquarerootofthenumberofnonzeroentriesinasparsevector,wTx=p kxk0.Thisalterationisjustaconstantrescalingondensevectorswhichiseffectivelyremovablebyanappropriaterescalingofthelearningrate.2.Thepredictionisclippedtotheinterval[0;1],implyingthelossfunctionisnotsquarelossforunclippedpredictionsoutsideofthisdynamicrange.Insteadtheupdateisaconstantvalue,equivalenttothegradientofalinearlossfunction.ThelearningrateinVOWPALWABBITiscontrollable,supporting1=decayaswellasaconstantlearningrate(andratesin-between).Theprogramoperatesinanentirelyonlinefashion,sothememoryfootprintisessentiallyjusttheweightvector,evenwhentheamountofdataisverylarge.Asmentionedearlier,wewouldlikethealgorithm'scomputationalcomplexitytodependlin-earlyonthenumberofnonzerofeaturesofanexample,ratherthanthetotalnumberoffeatures.Theapproachwetookwastostoreatime-stamptjforeachfeature.Thetime-stampwasinitializedtotheindexoftheexamplewherefeaturewasnonzeroforthersttime.Duringonlinelearning,wesimplywentthroughallnonzerofeaturesofexample,andcould“simulate”theshrinkageofwjaftertjinabatchmode.Theseweightsarethenupdated,andtheirtimestampsaresetto.Thislazy-updateideaofdelayingtheshrinkagecalculationuntilneededisthekeytoefcientimplementationoftruncatedgradient.Specically,insteadofusingupdaterule(6)forweightwjweshrunktheweightsofallnonzerofeaturedifferentlybythefollowing:(wj)=T1wj+2h(yˆy)xj;tj KKhg;q;andtjisupdatedbytj tj+tj KK:788 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTThislazy-updatetrickcanbeappliedtotheothertwoalgorithmsgiveninSection3.Inthecoefcientroundingalgorithm(4),forinstance,foreachnonzerofeatureofexample,wecanrstperformaregulargradientdescentonthesquareloss,andthendothefollowing:ifjwjjisbelowthethresholdqandtj+K,weroundwjto0andsettjtoThisimplementationshowsthetruncatedgradientmethodsatisesthefollowingrequirementsneededforsolvinglargescaleproblemswithsparsefeatures.Thealgorithmiscomputationallyefcient:thenumberofoperationsperonlinestepislinearinthenumberofnonzerofeatures,andindependentofthetotalnumberoffeatures.Thealgorithmismemoryefcient:itmaintainsalistofactivefeatures,andafeaturecanbeinsertedwhenobserved,anddeletedwhenthecorrespondingweightbecomeszero.IfwedirectlyapplytheonlineprojectionideaofZinkevich(2003)tosolve(1),thenintheup-daterule(7),onehastopickthesmallestgi0suchthatkwi+1k1s.Wedonotknowanefcientmethodtondthisspecicgiusingoperationsindependentofthetotalnumberoffeatures.Astan-dardimplementationreliesonsortingallweights,whichrequiresO(dlogd)operations,wheredisthetotalnumberof(nonzero)features.Thiscomplexityisunacceptableforourpurpose.However,inanimportantrecentwork,Duchietal.(2008)proposedanefcientonline1-projectionmethod.Theideaistouseabalancedtreetokeeptrackofweights,whichallowsefcientthresholdndingandtreeupdatesinO(klnd)operationsonaverage,wherekdenotesthenumberofnonzerocoef-cientsinthecurrenttrainingexample.Althoughthealgorithmstillhasweakdependencyond,itisapplicabletolarge-scalepracticalapplications.Thetheoreticalanalysispresentedinthispapershowswecanobtainameaningfulregretboundbypickinganarbitrarygi.Thisisusefulbecausetheresultingmethodismuchsimplertoimplementandiscomputationallymoreefcientperonlinestep.Moreover,ourmethodallowsnon-convexupdatescloselyrelatedtothesimplecoefcientroundingidea.DuetothecomplexityofimplementingthebalancedtreestrategyinDuchietal.(2008),weshallnotcomparetoitinthispaperandleaveitasafuturedirection.However,webe-lievethesparsityachievedwiththeirapproachshouldbecomparabletothesparsityachievedwithourmethod.6.EmpiricalResultsWeappliedVOWPALWABBITwiththeefcientlyimplementedsparsifyoption,asdescribedintheprevioussection,toaselectionofdatasets,includingelevendatasetsfromtheUCIrepository(AsuncionandNewman,2007),themuchlargerdatasetrcv1(Lewisetal.,2004),andaprivatelarge-scaledatasetBig_Adsrelatedtoadinterestprediction.WhileUCIdatasetsareusefulforbenchmarkpurposes,rcv1andBig_Adsaremoreinterestingsincetheyembodyreal-worlddatasetswithlargenumbersoffeatures,manyofwhicharelessinformativeformakingpredictionsthanothers.ThedatasetsaresummarizedinTable1.TheUCIdatasetsuseddonothavemanyfeaturessoweexpectthatalargefractionofthesefeaturesareusefulformakingpredictions.Forcomparisonpurposesaswellastobetterdemonstratethebehaviorofouralgorithm,wealsoadded1000randombinaryfeaturestothosedatasets.Eachfeaturehasvalue1withprobability0:05and0otherwise.789 LANGFORD,LIANDZHANG DataSet #features#traindata#testdatatask ad 14112455824classication crx 47526164classication housing 14381125regression krvskp 742413783classication magic04 11142264794classication mushroom 11760792045classication spambase 5834451156classication wbc 10520179classication wdbc 31421148classication wpbc 3315345classication zoo 177724regression rcv1 3885378126523149classication Big_Ads 3109261062:7106classication Table1:DataSetSummary.6.1FeatureSparsicationofTruncatedGradientIntherstsetofexperiments,weareinterestedinhowmuchreductioninthenumberoffeaturesispossiblewithoutaffectinglearningperformancesignicantly;specically,werequiretheaccuracybereducedbynomorethan1%forclassicationtasks,andthetotalsquarelossbeincreasedbynomorethan1%forregressiontasks.Ascommonpractice,weallowedthealgorithmtorunonthetrainingdatasetformultiplepasseswithdecayinglearningrate.Foreachdataset,weperformed10-foldcrossvalidationoverthetrainingsettoidentifythebestsetofparameters,includingthelearningrateh(rangingfrom0.1to0.5),thesparsicationrateg(rangingfrom0to0.3),numberofpassesofthetrainingset(rangingfrom5to30),andthedecayoflearningrateacrossthesepasses(rangingfrom0.5to0.9).TheoptimizedparameterswereusedtotrainVOWPALWABBITonthewholetrainingset.Finally,thelearnedclassier/regressorwasevaluatedonthetestset.WexedK=1andq=¥,andwillstudytheeffectsofKandqinlatersubsections.Figure2showsthefractionofreducedfeaturesaftersparsicationisappliedtoeachdataset.ForUCIdatasets,wealsoincludeexperimentswith1000randomfeaturesaddedtotheoriginalfeatureset.Wedonotaddrandomfeaturestorcv1andBig_Adssincetheexperimentisnotasinteresting.ForUCIdatasets,withrandomlyaddedfeatures,VOWPALWABBITisabletoreducethenum-beroffeaturesbyafractionofmorethan90%,exceptfortheaddatasetinwhichonly71%reduc-tionisobserved.Thislesssatisfyingresultmightbeimprovedbyamoreextensiveparametersearchincrossvalidation.However,ifwecantolerate1:3%decreaseinaccuracy(insteadof1%asforotherdatasets)duringcrossvalidation,VOWPALWABBITisabletoachieve91:4%reduction,indi-catingthatalargereductionisstillpossibleatthetinyadditionalcostof0:3%accuracyloss.Withthisslightlymoreaggressivesparsication,thetest-setaccuracydropsfrom95:9%(whenonly1%lossinaccuracyisallowedincrossvalidation)to95:4%,whiletheaccuracywithoutsparsicationis96:5%.790 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTEvenfortheoriginalUCIdatasetswithoutarticiallyaddedfeatures,VOWPALWABBITman-agestolteroutsomeofthelessusefulfeatureswhilemaintainingthesamelevelofperformance.Forexample,fortheaddataset,areductionof83:4%isachieved.Comparedtotheresultsabove,itseemsthemosteffectivefeaturereductionsoccurondatasetswithalargenumberoflessusefulfeatures,exactlywheresparsicationisneeded.Forrcv1,morethan75%offeaturesareremovedafterthesparsicationprocess,indicatingtheeffectivenessofouralgorithminreal-lifeproblems.Wewerenotabletotrymanyparametersincrossvalidationbecauseofthesizeofrcv1.Itisexpectedthatmorereductionispossiblewhenamorethoroughparametersearchisperformed.ThepreviousresultsdonotexercisethefullpoweroftheapproachpresentedherebecausethestandardLasso(Tibshirani,1996)isormaybecomputationallyviableinthesedatasets.Wehavealsoappliedthisapproachtoalargenon-publicdatasetBig_Adswherethegoalispredictingwhichoftwoadswasclickedongivencontextinformation(thecontentofadsandqueryinformation).Here,acceptinga0:009increaseinclassicationerror(fromerrorrate0:329toerrorrate0:338)allowsustoreducethenumberoffeaturesfromabout3109toabout24106,afactorof125decreaseinthenumberoffeatures.Forclassicationtasks,wealsostudyhowoursparsicationsolutionaffectsAUC(AreaUndertheROCCurve),whichisastandardmetricforclassication.1Usingthesamesetsofparametersfrom10-foldcrossvalidationdescribedabove,wendthecriterionisnotaffectedsignicantlybysparsicationandinsomecases,theyareactuallyslightlyimproved.ThereasonmaybethatoursparsicationmethodremovedsomeofthefeaturesthatcouldhaveconfusedVOWPALWABBITTheratiosoftheAUCwithandwithoutsparsicationforallclassicationtasksareplottedinFigures3.Oftentheseratiosareabove98%.6.2TheEffectsofKAswearguedbefore,usingaKvaluelargerthan1maybedesiredintruncatedgradientandtheroundingalgorithms.Thisadvantageisempiricallydemonstratedhere.Inparticular,wetryK=1,K=10,andK=20inbothalgorithms.Asbefore,crossvalidationisusedtoselectparametersintheroundingalgorithm,includinglearningrateh,numberofpassesofdataduringtraining,andlearningratedecayovertrainingpasses.Figures4and5givetheAUCvs.number-of-featureplots,whereeachdatapointisgeneratedbyrunningrespectivealgorithmusingadifferentvalueofg(fortruncatedgradient)andq(fortheroundingalgorithm).Weusedq=¥intruncatedgradient.TheeffectofKislargeintheroundingalgorithm.Forinstance,intheaddatasetthealgorithmusingK=1achievesanAUCof0:94with322features,while13and7featuresareneededusingK=10andK=20,respectively.However,thesamebenetsofusingalargerKisnotobservedintruncatedgradient,althoughtheperformanceswithK=10or20areatleastasgoodasthosewithK=1andforthespambasedatasetfurtherfeaturereductionisachievedatthesamelevelofperformance,reducingthenumberoffeaturesfrom76(whenK=1)to25(whenK=10or20)withofanAUCofabout0:89. 1.WeuseAUChereandinlatersubsectionsbecauseitisinsensitivetothreshold,whichisunlikeaccuracy.791 LANGFORD,LIANDZHANG 0 0.2 0.4 0.6 0.8 1 ad crx housing krvskp magic04 shroom spam wbc wdbc wpbc zoo rcv1 Big_Ads Fraction LeftDatasetFraction of Features Left Base data 1000 extra 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 ad crx housing krvskp magic04 shroom spam wbc wdbc wpbc zoo rcv1 Big_Ads Fraction LeftDatasetFraction of Features Left Base data 1000 extra Figure2:Plotsshowingtheamountoffeaturesleftaftersparsicationusingtruncatedgradientforeachdataset,whentheperformanceischangedbyatmost1%duetosparsication.Thesolidbar:withtheoriginalfeatureset;thedashedbar:with1000randomfeaturesaddedtoeachexample.Plotonleft:fractionleftwithrespecttothetotalnumberoffeatures(originalwith1000articialfeaturesforthedashedbar).Plotonright:fractionleftwithrespecttotheoriginalfeatures(notcountingthe1000articialfeaturesinthedenominatorforthedashedbar). 0 0.2 0.4 0.6 0.8 1 1.2 ad crx krvskp magic04 shroom spam wbc wdbc wpbc rcv1 RatioDatasetRatio of AUC Base data 1000 extra Figure3:AplotshowingtheratiooftheAUCwhensparsicationisusedovertheAUCwhennosparsicationisused.ThesameprocessasinFigure2isusedtodetermineempiricallygoodparameters.Therstresultisfortheoriginaldataset,whilethesecondresultisforthemodieddatasetwhere1000randomfeaturesareaddedtoeachexample.6.3TheEffectsofqinTruncatedGradientInthissubsection,weempiricallystudytheeffectofqintruncatedgradient.Theroundingalgorithmisalsoincludedforcomparisonduetoitssimilaritytotruncatedgradientwhenq=g.Again,weusedcrossvalidationtochooseparametersforeachqvaluetried,andfocusedontheAUCmetricintheeightUCIclassicationtasks,exceptthedegenerateoneofwpbc.WexedK=10inbothalgorithm.792 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENT 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 adNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 crxNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 krvskpNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 magic04Number of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 mushroomNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 spambaseNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wbcNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wdbcNumber of FeaturesAUC K=1 K=10 K=20 Figure4:EffectofKonAUCintheroundingalgorithm.793 LANGFORD,LIANDZHANG 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 adNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 crxNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 krvskpNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 magic04Number of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 mushroomNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 spambaseNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wbcNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wdbcNumber of FeaturesAUC K=1 K=10 K=20 Figure5:EffectofKonAUCintruncatedgradient.794 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTFigure6givestheAUCvs.number-of-featureplots,whereeachdatapointisgeneratedbyrunningrespectivealgorithmsusingadifferentvalueofg(fortruncatedgradient)andq(fortheroundingalgorithm).Afewobservationsareinplace.First,theresultsverifytheobservationthatthebehavioroftruncatedgradientwithq=gissimilartotheroundingalgorithm.Second,theseresultssuggestthat,inpractice,itmaybedesirabletouseq=¥intruncatedgradientbecauseitavoidsthelocal-minimumproblem.6.4ComparisontoOtherAlgorithmsThenextsetofexperimentscomparestruncatedgradienttootheralgorithmsregardingtheirabilitiestobalancefeaturesparsicationandperformance.Again,wefocusontheAUCmetricinUCIclassicationtasksexceptwpdc.Thealgorithmsforcomparisoninclude:Thetruncatedgradientalgorithm:WexedK=10andq=¥,usedcrossed-validatedpa-rameters,andalteredthegravityparametergTheroundingalgorithmdescribedinSection3.1:WexedK=10,usedcross-validatedparameters,andalteredtheroundingthresholdqThesubgradientalgorithmdescribedinSection3.2:WexedK=10,usedcross-validatedparameters,andalteredtheregularizationparametergTheLasso(Tibshirani,1996)forbatchL1-regularization:Weusedapubliclyavailableimple-mentation(Sjöstrand,2005).Notethatwedonotattempttocomparethesealgorithmsonrcv1andBig_AdssimplybecausetheirsizesaretoolargefortheLasso.Figure7givestheresults.Truncatedgradientisconsistentlycompetitivewiththeothertwoonlinealgorithmsandsignicantlyoutperformedtheminsomeproblems.Thissuggeststheeffec-tivenessoftruncatedgradient.Second,itisinterestingtoobservethatthequalitativebehavioroftruncatedgradientisoftensimilartoLASSO,especiallywhenverysparseweightvectorsareallowed(theleftsideinthegraphs).Thisisconsistentwiththeorem3.2showingtherelationbetweenthem.However,LASSOusuallyhasworseperformancewhentheallowednumberofnonzeroweightsissettoolarge(therightsideofthegraphs).Inthiscase,LASSOseemstoovert,whiletruncatedgradientismorerobusttoovertting.Therobustnessofonlinelearningisoftenattributedtoearlystopping,whichhasbeenextensivelydiscussedintheliterature(e.g.,Zhang,2004).Finally,itisworthemphasizingthattheexperimentsinthissubsectiontrytoshedsomelightontherelativestrengthsofthesealgorithmsintermsoffeaturesparsication.ForlargedatasetssuchasBig_Adsonlytruncatedgradient,coefcientrounding,andthesub-gradientalgorithmsareapplicable.Aswehaveshownandargued,theroundingalgorithmisquiteadhocandmaynotworkrobustlyinsomeproblems,andthesub-gradientalgorithmdoesnotleadtosparsityingeneralduringtraining.7.ConclusionThispapercoverstherstsparsicationtechniqueforlarge-scaleonlinelearningwithstrongthe-oreticalguarantees.Thealgorithm,truncatedgradient,isthenaturalextensionofLasso-stylere-795 LANGFORD,LIANDZHANG 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 adNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 crxNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 krvskpNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 magic04Number of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 mushroomNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 spambaseNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wbcNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wdbcNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) Figure6:EffectofqonAUCintruncatedgradient.796 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENT 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 adNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 crxNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 krvskpNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 magic04Number of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 mushroomNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 spambaseNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wbcNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wdbcNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso Figure7:Comparisonoffouralgorithms.797 LANGFORD,LIANDZHANGgressiontotheonline-learningsetting.Theorem3.1provesthetechniqueissound:itneverharmsperformancemuchcomparedtostandardstochasticgradientdescentinadversarialsituations.Fur-thermore,weshowtheasymptoticsolutionofoneinstanceofthealgorithmisessentiallyequivalenttotheLassoregression,thusjustifyingthealgorithm'sabilitytoproducesparseweightvectorswhenthenumberoffeaturesisintractablylarge.Thetheoremisveriedexperimentallyinanumberofproblems.Insomecases,especiallyforproblemswithmanyirrelevantfeatures,thisapproachachievesaoneortwoorderofmagnitudereductioninthenumberoffeatures.AcknowledgmentsWethankAlexStrehlfordiscussionsandhelpindevelopingVOWPALWABBIT.PartofthisworkwasdonewhenLihongLiandTongZhangwereatYahoo!Researchin2007.AppendixA.ProofofTheorem3.1Thefollowinglemmaistheessentialstepinouranalysis.Lemma1Supposeupdaterule(6)isappliedtoweightvectorwonexamplez=(x;y)withgravityparametergi=g,andresultsinaweightvectorw0.IfAssumption3.1holds,thenforall¯w2Rdwehave(10:5Ah)L(w;z)+gkw0I(jw0jq)k1L(¯w;z)+gk¯wI(jw0jq)k1+h 2B+k¯wwk2k¯ww0k2 2h:ProofConsideranytargetvector¯w2Rdandlet˜w=whÑ1L(w;z).Wehavew0=T1(˜w;gh;q)Letu(¯w;w0)=gk¯wI(jw0jq)k1gkw0I(jw0jq)k1:Thentheupdateequationimpliesthefollowing:k¯ww0k2k¯ww0k2+kw0˜wk2=k¯w˜wk22(¯ww0)T(w0˜w)k¯w˜wk2+2hu(¯w;w0)=k¯wwk2+kw˜wk2+2(¯ww)T(w˜w)+2hu(¯w;w0)=k¯wwk2+h2kÑ1L(w;z)k2+2h(¯ww)TÑ1L(w;z)+2hu(¯w;w0)k¯wwk2+h2kÑ1L(w;z)k2+2h(L(¯w;z)L(w;z))+2hu(¯w;w0)k¯wwk2+h2(AL(w;z)+B)+2h(L(¯w;z)L(w;z))+2hu(¯w;w0):Here,therstandsecondequalitiesfollowfromalgebra,andthethirdfromthedenitionof˜wTherstinequalityfollowsbecauseasquareisalwaysnon-negative.Thesecondinequalityfollows798 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTbecausew0=T1(˜w;gh;q),whichimplies(w0˜w)Tw0=ghkw0I(j˜wjq)k1=ghkw0I(jw0jq)k1andjw0j˜wjjghI(jw0jjq).Therefore,(¯ww0)T(w0˜w)=¯wT(w0˜w)+w0T(w0˜w)dåj=1j¯wjjjw0j˜wjj+(w0˜w)Tw0ghdåj=1j¯wjjI(jw0jjq)+(w0˜w)Tw0=hu(¯w;w0);wherethethirdinequalityfollowsfromthedenitionofsub-gradientofaconvexfunction,implying(¯ww)TÑ1L(w;z)L(¯w;z)L(w;z)forallwand¯w;thefourthinequalityfollowsfromAssumption3.1.Rearrangingtheaboveinequal-ityleadstothedesiredbound. Proof(ofTheorem3.1)ApplyingLemma1totheupdateontrialgives(10:5Ah)L(wi;zi)+gikwi+1I(jwi+1jq)k1L(¯w;zi)+k¯wwik2k¯wwi+1k2 2h+gik¯wI(jwi+1jq)k1+h 2B:Nowsummingover=1;2;:::;T,weobtainTåi=1[(10:5Ah)L(wi;zi)+gikwi+1I(jwi+1jq)k1]Tåi=1k¯wwik2k¯wwi+1k2 2h+L(¯w;zi)+gik¯wI(jwi+1jq)k1+h 2B=k¯ww1k2k¯wwTk2 2h+h 2TB+Tåi=1[L(¯w;zi)+gik¯wI(jwi+1jq)k1]k¯wk2 2h+h 2TB+Tåi=1[L(¯w;zi)+gik¯wI(jwi+1jq)k1]:Therstequalityfollowsfromthetelescopingsumandthesecondinequalityfollowsfromtheinitialcondition(allweightsarezero)anddroppingnegativequantities.ThetheoremfollowsbydividingwithrespecttoTandrearrangingterms. ReferencesArthurAsuncionandDavidJ.Newman.UCImachinelearningrepository,2007.UniversityofCalifornia,Irvine,SchoolofInformationandComputerSciences,http://www.ics.uci.edu/mlearn/MLRepository.html.799 LANGFORD,LIANDZHANGSuhridBalakrishnanandDavidMadigan.Algorithmsforsparselinearclassiersinthemassivedatasetting.JournalofMachineLearningResearch,9:313–337,2008.BobCarpenter.Lazysparsestochasticgradientdescentforregularizedmultinomiallogisticregres-sion.Technicalreport,April2008.NicolòCesa-Bianchi,PhilipM.Long,andManfredWarmuth.Worst-casequadraticlossboundsforpredictionusinglinearfunctionsandgradientdescent.IEEETransactionsonNeuralNetworks7(3):604–619,1996.Cheng-TaoChu,SangKyunKim,Yi-AnLin,YuanYuanYu,GaryBradski,AndrewY.Ng,andKunleOlukotun.Map-reduceformachinelearningonmulticore.InAdvancesinNeuralInfor-mationProcessingSystems20(NIPS-07),2008.OferDekel,ShaiShalev-Schwartz,andYoramSinger.TheForgetron:Akernel-basedperceptrononaxedbudget.InAdvancesinNeuralInformationProcessingSystems18(NIPS-05),pages259–266,2006.JohnDuchiandYoramSinger.Onlineandbatchlearningusingforwardlookingsubgradients.Unpublishedmanuscript,September2008.JohnDuchi,ShaiShalev-Shwartz,YoramSinger,andTusharChandra.Efcientprojectionsontothe1-ballforlearninginhighdimensions.InProceedingsoftheTwenty-FifthInternationalConferenceonMachineLearning(ICML-08),pages272–279,2008.JyrkiKivinenandManfredK.Warmuth.Exponentiatedgradientversusgradientdescentforlinearpredictors.InformationandComputation,132(1):1–63,1997.JohnLangford,LihongLi,andAlexanderL.Strehl.VowpalWabbit(fastonlinelearning),2007.http://hunch.net/vw/.HonglakLee,AlexisBattle,RajatRaina,andAndrewY.Ng.Efcientsparsecodingalgorithms.InAdvancesinNeuralInformationProcessingSystems19(NIPS-06),pages801–808,2007.DavidD.Lewis,YimingYang,TonyG.Rose,andFanLi.RCV1:Anewbenchmarkcollectionfortextcategorizationresearch.JournalofMachineLearningResearch,5:361–397,2004.NickLittlestone.Learningquicklywhenirrelevantattributesabound:Anewlinear-thresholdalgo-rithms.MachineLearning,2(4):285–318,1988.NickLittlestone,PhilipM.Long,andManfredK.Warmuth.On-linelearningoflinearfunctions.ComputationalComplexity,5(2):1–23,1995.ShaiShalev-Shwartz,YoramSinger,andNathanSrebro.Pegasos:PrimalEstimatedsub-GrAdientSOlverforSVM.InProceedingsoftheTwenty-FourthInternationalConferenceonMachineLearning(ICML-07),2007.KarlSjöstrand.MatlabimplementationofLASSO,LARS,theelasticnetandSPCA,June2005.Version2.0,http://www2.imm.dtu.dk/pubdb/p.php?3897.800 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTRobertTibshirani.Regressionshrinkageandselectionviathelasso.JournaloftheRoyalStatisticalSociety,B.,58(1):267–288,1996.TongZhang.Solvinglargescalelinearpredictionproblemsusingstochasticgradientdescental-gorithms.InProceedingsoftheTwenty-FirstInternationalConferenceonMachineLearning(ICML-04),pages919–926,2004.MartinZinkevich.Onlineconvexprogrammingandgeneralizedinnitesimalgradientascent.InProceedingsoftheTwentiethInternationalConferenceonMachineLearning(ICML-03),pages928–936,2003.801