PartiallysupportedbyNSFgrantDMS0706805cr2009JohnLangfordLihongLiandTongZhang LANGFORDLIANDZHANGThispaperfocusesonthesecondapproachTypicalonlinelearningalgorithmshaveatleastoneweightforeveryf ID: 200092
Download Pdf The PPT/PDF document "JournalofMachineLearningResearch10(2009)..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
JournalofMachineLearningResearch10(2009)777-801Submitted6/08;Revised11/08;Published3/09SparseOnlineLearningviaTruncatedGradientJohnLangfordJL@YAHOO-INC.COMYahoo!ResearchNewYork,NY,USALihongLiLIHONG@CS.RUTGERS.EDUDepartmentofComputerScienceRutgersUniversityPiscataway,NJ,USATongZhangTONGZ@RCI.RUTGERS.EDUDepartmentofStatisticsRutgersUniversityPiscataway,NJ,USAEditor:ManfredWarmuthAbstractWeproposeageneralmethodcalledtruncatedgradienttoinducesparsityintheweightsofonline-learningalgorithmswithconvexlossfunctions.Thismethodhasseveralessentialproperties:1.Thedegreeofsparsityiscontinuousaparametercontrolstherateofsparsicationfromnosparsi-cationtototalsparsication.2.Theapproachistheoreticallymotivated,andaninstanceofitcanberegardedasanonlinecounterpartofthepopularL1-regularizationmethodinthebatchsetting.Weprovethatsmallratesofsparsicationresultinonlysmalladditionalregretwithrespecttotypicalonline-learningguarantees.3.Theapproachworkswellempirically.Weapplytheapproachtoseveraldatasetsandndfordatasetswithlargenumbersoffeatures,substantialsparsityisdiscoverable.Keywords:truncatedgradient,stochasticgradientdescent,onlinelearning,sparsity,regulariza-tion,Lasso1.IntroductionWeareconcernedwithmachinelearningoverlargedatasets.Asanexample,thelargestdatasetweuseherehasover107sparseexamplesand109featuresusingabout1011bytes.Inthissetting,manycommonapproachesfail,simplybecausetheycannotloadthedatasetintomemoryortheyarenotsufcientlyefcient.Thereareroughlytwoclassesofapproacheswhichcanwork:1.Parallelizeabatch-learningalgorithmovermanymachines(e.g.,Chuetal.,2008).2.Streamtheexamplestoanonline-learningalgorithm(e.g.,Littlestone,1988;Littlestoneetal.,1995;Cesa-Bianchietal.,1996;KivinenandWarmuth,1997). .PartiallysupportedbyNSFgrantDMS-0706805.c\r2009JohnLangford,LihongLiandTongZhang. LANGFORD,LIANDZHANGThispaperfocusesonthesecondapproach.Typicalonline-learningalgorithmshaveatleastoneweightforeveryfeature,whichistoomuchinsomeapplicationsforacouplereasons:1.Spaceconstraints.Ifthestateoftheonline-learningalgorithmoverowsRAMitcannotefcientlyrun.AsimilarproblemoccursifthestateoverowstheL2cache.2.Test-timeconstraintsoncomputation.SubstantiallyreducingthenumberoffeaturescanyieldsubstantialimprovementsinthecomputationaltimerequiredtoevaluateanewsampleThispaperaddressestheproblemofinducingsparsityinlearnedweightswhileusinganonline-learningalgorithm.Thereareseveralwaystodothiswrongforourproblem.Forexample:1.SimplyaddingL1-regularizationtothegradientofanonlineweightupdatedoesn'tworkbecausegradientsdon'tinducesparsity.Theessentialdifcultyisthatagradientupdatehastheforma+bwhereaandbaretwooats.Veryfewoatpairsaddto0(oranyotherdefaultvalue)sothereislittlereasontoexpectagradientupdatetoaccidentallyproducesparsity.2.Simplyroundingweightsto0isproblematicbecauseaweightmaybesmallduetobeinguselessorsmallbecauseithasbeenupdatedonlyonce(eitheratthebeginningoftrainingorbecausethesetoffeaturesappearingisalsosparse).Roundingtechniquescanalsoplayhavocwithstandardonline-learningguarantees.3.Black-boxwrapperapproacheswhicheliminatefeaturesandtesttheimpactoftheeliminationarenotefcientenough.Theseapproachestypicallyrunanalgorithmmanytimeswhichisparticularlyundesirablewithlargedatasets.1.1WhatOthersDoIntheliterature,theLassoalgorithm(Tibshirani,1996)iscommonlyusedtoachievesparsityforlinearregressionusingL1-regularization.Thisalgorithmdoesnotworkautomaticallyinanonlinefashion.TherearetwoformulationsofL1-regularization.ConsideralossfunctionL(w;zi)whichisconvexinw,wherezi=(xi;yi)isaninput/outputpair.Oneistheconvexconstraintformulationw=argminwnåi=1L(w;zi)subjecttokwk1s;(1)wheresisatunableparameter.Theotheristhesoftregularizationformulation,wherew=argminwnåi=1L(w;zi)+gkwk1:(2)Withappropriatelychoseng,thetwoformulationsareequivalent.Theconvexconstraintformu-lationhasasimpleonlineversionusingtheprojectionideaofZinkevich(2003),whichrequirestheprojectionofweightwintoanL1-ballateveryonlinestep.Thisoperationisdifculttoimple-mentefcientlyforlarge-scaledatawithmanyfeaturesevenifallexampleshavesparsefeaturesalthoughrecentprogresswasmade(Duchietal.,2008)toreducetheamortizedtimecomplexitytoO(klogd),wherekisthenumberofnonzeroentriesinxi,anddisthetotalnumberoffeatures(i.e.,778 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTthedimensionofxi).Incontrast,thesoft-regularizationmethodisefcientforabatchsetting(Leeetal.,2007)sowepursueithereinanonlinesettingwherewedevelopanalgorithmwhosecom-plexityislinearinkbutindependentofd;thesealgorithmsarethereforemoreefcientinproblemswheredisprohibitivelylarge.Morerecently,DuchiandSinger(2008)proposeaframeworkforempiricalriskminimizationwithregularizationcalledForwardLookingSubgradients,orFOLOSinshort.Thebasicideaistosolvearegularizedoptimizationproblemaftereverygradient-descentstep.Thisfamilyofalgo-rithmsallowgeneralconvexregularizationfunction,andreproduceaspecialcaseofthetruncatedgradientalgorithmwewillintroduceinSection3.3(withqsetto¥)whenL1-regularizationisused.TheForgetronalgorithm(Dekeletal.,2006)isanonline-learningalgorithmthatmanagesmem-oryuse.Itoperatesbydecayingtheweightsonpreviousexamplesandthenroundingtheseweightstozerowhentheybecomesmall.TheForgetronisstatedforkernelizedonlinealgorithms,whileweareconcernedwiththesimplerlinearsetting.Whenappliedtoalinearkernel,theForgetronisnotcomputationallyorspacecompetitivewithapproachesoperatingdirectlyonfeatureweights.Adifferent,BayesianapproachtolearningsparselinearclassiersistakenbyBalakrishnanandMadigan(2008).Specically,theiralgorithmsapproximatetheposteriorbyaGaussiandistribution,andhenceneedtostoresecond-ordercovariancestatisticswhichrequireO(d2)spaceandtimeperonlinestep.Incontrast,ourapproachismuchmoreefcient,requiringonlyO(d)spaceandO(k)timeateveryonlinestep.Aftercompletingthepaper,welearnedthatCarpenter(2008)independentlydevelopedanalgo-rithmsimilartoours.1.2WhatWeDoWepursueanalgorithmicstrategywhichcanbeunderstoodasanonlineversionofanefcientL1lossoptimizationapproach(Leeetal.,2007).Atahighlevel,ourapproachworkswiththesoft-regularizationformulation(2)anddecaystheweighttoadefaultvalueaftereveryonlinestochasticgradientstep.Thissimpleapproachenjoysminimaltimecomplexity(whichislinearinkandin-dependentofd)aswellasstrongperformanceguarantee,asdiscussedinSections3and5.Forinstance,thealgorithmneverperformsmuchworsethanastandardonline-learningalgorithm,andtheadditionallossduetosparsicationiscontrolledcontinuouslywithasinglereal-valuedparam-eter.Thetheorygivesafamilyofalgorithmswithconvexlossfunctionsforinducingsparsityoneperonline-learningalgorithm.Weinstantiatethisforsquarelossandshowhowanefcientimple-mentationcantakeadvantageofsparseexamplesinSection4.InadditiontotheL1-regularizationformulation(2),thefamilyofalgorithmsweconsideralsoincludesomenon-convexsparsicationtechniques.Asmentionedintheintroduction,wearemainlyinterestedinsparseonlinemethodsforlargescaleproblemswithsparsefeatures.Forsuchproblems,ouralgorithmshouldsatisfythefollowingrequirements:Thealgorithmshouldbecomputationallyefcient:thenumberofoperationsperonlinestepshouldbelinearinthenumberofnonzerofeatures,andindependentofthetotalnumberoffeatures.Thealgorithmshouldbememoryefcient:itneedstomaintainalistofactivefeatures,andcaninsert(whenthecorrespondingweightbecomesnonzero)anddelete(whenthecorre-spondingweightbecomeszero)featuresdynamically.779 LANGFORD,LIANDZHANGOursolution,referredtoastruncatedgradient,isasimplemodicationofthestandardstochasticgradientrule.Itisdenedin(6)asanimprovementoversimplerideassuchasroundingandsub-gradientmethodwithL1-regularization.Theimplementationdetails,showingourmethodssatisfytheaboverequirements,areprovidedinSection5.Theoreticalresultsstatinghowmuchsparsityisachievedusingthismethodgenerallyrequireadditionalassumptionswhichmayormaynotbemetinpractice.Consequently,werelyonexperi-mentsinSection6toshowourmethodachievesgoodsparsitypractice.Wecompareourapproachtoafewothers,includingL1-regularizationonsmalldata,aswellasonlineroundingofcoefcientstozero.2.OnlineLearningwithStochasticGradientDescentInthesettingofstandardonlinelearning,weareinterestedinsequentialpredictionproblemswhererepeatedlyfrom=1;2;:::1.Anunlabeledexamplexiarrives.2.Wemakeapredictionbasedonexistingweightswi2Rd3.Weobserveyi,letzi=(xi;yi),andincursomeknownlossL(wi;zi)thatisconvexinparameterwi4.Weupdateweightsaccordingtosomerule:wi+1 (wi)Wewanttocomeupwithanupdaterule,whichallowsustoboundthesumoflosseståi=1L(wi;zi)aswellasachievingsparsity.Forthispurpose,westartwiththestandardstochasticgradientdescent(SGD)rule,whichisoftheform:(wi)=wi hÑ1L(wi;zi);(3)whereÑ1L(a;b)isasub-gradientofL(a;b)withrespecttotherstvariablea.Theparameterh0isoftenreferredtoasthelearningrate.Inouranalysis,weonlyconsiderconstantlearningratewithxedh0forsimplicity.Intheory,itmightbedesirabletohaveadecayinglearningratehiwhichbecomessmallerwhenincreasestogetthesocalledno-regretboundwithoutknowingTinadvance.However,ifTisknowninadvance,onecanselectaconstanthaccordinglysotheregretvanishesasT!¥.Sinceourfocusisonsparsity,nothowtoadaptlearningrate,forclarity,weuseaconstantlearningrateintheanalysisbecauseitleadstosimplerbounds.Theabovemethodhasbeenwidelyusedinonlinelearning(Littlestoneetal.,1995;Cesa-Bianchietal.,1996).Moreover,itisarguedtobeefcientevenforsolvingbatchproblemswherewerepeatedlyruntheonlinealgorithmovertrainingdatamultipletimes.Forexample,theideahasbeensuccessfullyappliedtosolvelarge-scalestandardSVMformulations(Shalev-Shwartzetal.,2007;Zhang,2004).Inthescenariooutlinedintheintroduction,online-learningmethodsaremoresuitablethansometraditionalbatch-learningmethods.780 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENT (x, (x, - Figure1:Plotsforthetruncationfunctions,T0andT1,whicharedenedinthetext.However,amaindrawbackof(3)isthatitdoesnotachievesparsity,whichweaddressinthispaper.Intheliterature,thestochastic-gradientdescentruleisoftenreferredtoasgradientdescent(GD).Thereareothervariants,suchasexponentiatedgradientdescent(EG).Sinceourfocusinthispaperissparsity,notGDversusEG,weshallonlyconsidermodicationsof(3)forsimplicity.3.SparseOnlineLearningInthissection,weexamineseveralmethodsforachievingsparsityinonlinelearning.Therstideaissimplecoefcientrounding,whichisthemostnaturalmethod.WewillthenconsideranothermethodwhichistheonlinecounterpartofL1-regularizationinbatchlearning.Finally,wecombinesuchtwoideasandintroducetruncatedgradient.Asweshallsee,alltheseideasarecloselyrelated.3.1SimpleCoefcientRoundingInordertoachievesparsity,themostnaturalmethodistoroundsmallcoefcients(thatarenolargerthanathresholdq0)tozeroaftereveryKonlinesteps.Thatis,if=Kisnotaninteger,weusethestandardGDrulein(3);if=Kisaninteger,wemodifytheruleas:(wi)=T0(wi hÑ1L(wi;zi);q);(4)whereforavectorv=[v1;:::;vd]2Rd,andascalarq0,T0(v;q)=[T0(v1;q);:::;T0(vd;q)],withT0denedby(cf.,Figure1)T0(vj;q)=(0ifjvjjqvjotherwise:Thatis,werstapplythestandardstochasticgradientdescentrule,andthenroundsmallcoefcientstozero.Ingeneral,weshouldnottakeK=1,especiallywhenhissmall,sinceeachstepmodieswibyonlyasmallamount.Ifacoefcientiszero,itremainssmallafteroneonlineupdate,andtheroundingoperationpullsitbacktozero.Consequently,roundingcanbedoneonlyaftereveryKsteps(withareasonablylargeK);inthiscase,nonzerocoefcientshavesufcienttimetogoabovethethresholdq.However,ifKistoolarge,theninthetrainingstage,wewillneedtokeepmanymorenonzerofeaturesintheintermediatestepsbeforetheyareroundedtozero.Intheextremecase,wemaysimplyroundthecoefcientsintheend,whichdoesnotsolvethestorageproblemin781 LANGFORD,LIANDZHANGthetrainingphase.ThesensitivityinchoosingappropriateKisamaindrawbackofthismethod;anotherdrawbackisthelackoftheoreticalguaranteeforitsonlineperformance.3.2ASub-gradientAlgorithmforL1-RegularizationInourexperiments,wecombinerounding-in-the-end-of-trainingwithasimpleonlinesub-gradientmethodforL1-regularizationwitharegularizationparameterg0:(wi)=wi hÑ1L(wi;zi) hgsgn(wi);(5)whereforavectorv=[v1;:::;vd],sgn(v)=[sgn(v1);:::;sgn(vd)],andsgn(vj)=1whenvj0,sgn(vj)= 1whenvj0,andsgn(vj)=0whenvj=0.Intheexperiments,theonlinemethod(5)plusroundingintheendisusedasasimplebaseline.Thismethoddoesnotproducesparseweightsonline.Thereforeitdoesnothandlelarge-scaleproblemsforwhichwecannotkeepallfeaturesinmemory.3.3TruncatedGradientInordertoobtainanonlineversionofthesimpleroundingrulein(4),weobservethatthedirectroundingtozeroistooaggressive.Alessaggressiveversionistoshrinkthecoefcienttozerobyasmalleramount.Wecallthisideatruncatedgradient.Theamountofshrinkageismeasuredbyagravityparametergi-0.1;é é0:(wi)=T1(wi hÑ1L(wi;zi);hgi;q);(6)whereforavectorv=[v1;:::;vd]2Rd,andascalarg0,T1(v;a;q)=[T1(v1;a;q);:::;T1(vd;a;q)]withT1denedby(cf.,Figure1)T1(vj;a;q)=8-0.1;é é.36;Ì:max(0;vj a)ifvj2[0;q]min(0;vj+a)ifvj2[ q;0]vjotherwise:Again,thetruncationcanbeperformedeveryKonlinesteps.Thatis,if=Kisnotaninteger,weletgi=0;if=Kisaninteger,weletgi=Kgforagravityparameterg.36;Ì0.Thisparticularchoiceisequivalentto(4)whenwesetgsuchthathKgq.Thisrequiresalargegwhenhissmall.Inpractice,oneshouldsetasmall,xedg,asimpliedbyourregretbounddevelopedlater.Ingeneral,thelargertheparametersgandqare,themoresparsityisincurred.DuetotheextratruncationT1,thismethodcanleadtosparsesolutions,whichisconrmedinourexperimentsdescribedlater.Inthoseexperiments,thedegreeofsparsitydiscoveredvarieswiththeproblem.Aspecialcase,whichwewilltryintheexperiment,istoletg=qin(6).Inthiscase,wecanuseonlyoneparametergtocontrolsparsity.SincehKgqwhenhKissmall,thetruncationoperationislessaggressivethantheroundingin(4).Atrstsight,theprocedureappearstobeanad-hocwaytox(4).However,wecanestablisharegretboundforthismethod,showingitistheoreticallysound.Settingq=¥yieldsanotherimportantspecialcaseof(6),whichbecomes(wi)=T(wi hÑ1L(wi;zi);gih);(7)782 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTwhereforavectorv=[v1;:::;vd]2Rd,andascalarg0,T(v;a)=[T(v1;a);:::;T(vd;a)],withT(vj;a)=(max(0;vj a)ifvj0min(0;vj+a)otherwise:Themethodisamodicationofthestandardsub-gradientdescentmethodwithL1-regularizationgivenin(5).Theparametergi0controlsthesparsitythatcanbeachievedwiththealgorithm.Notewhengi=0,theupdateruleisidenticaltothestandardstochasticgradientdescentrule.Ingeneral,wemayperformatruncationeveryKsteps.Thatis,if=Kisnotaninteger,weletgi=0;if=Kisaninteger,weletgi=Kgforagravityparameterg0.Thereasonfordoingso(insteadofaconstantg)isthatwecanperformamoreaggressivetruncationwithgravityparameterKgaftereachKsteps.Thismayleadtobettersparsity.Analternativewaytoderiveaproceduresimilarto(7)isthroughanapplicationofconvexhullprojectionideaofZinkevich(2003)totheL1-regularizedloss,asin(5).However,insteadofworkingwiththeoriginalfeatureset,weneedtoconsidera2d-dimensionalduplicatedfeatureset[xi; xi],withthenon-negativityconstraintwj0foreachcomponentof(wwillalsohavedimension2dinthiscase).Theresultingmethodissimilartoours,withasimilartheoreticalguaranteeasinTheorem3.1.Theproofpresentedinthispaperismorespecializedtotruncatedgradient,anddirectlyworkswithxiinsteadofaugmenteddata[xi; xi]Moreover,ouranalysisdoesnotrequirethelossfunctiontohaveboundedgradient,andthuscandirectlyhandletheleastsquaresloss.Theprocedurein(7)canberegardedasanonlinecounterpartofL1-regularizationinthesensethatitapproximatelysolvesanL1-regularizationprobleminthelimitofh!0.TruncatedgradientforL1-regularizationisdifferentfrom(5),whichisanaïveapplicationofstochasticgradientde-scentrulewithanaddedL1-regularizationterm.Aspointedoutintheintroduction,thelatterfailsbecauseitrarelyleadstosparsity.Ourtheoryshowsevenwithsparsication,thepredictionperfor-manceisstillcomparabletostandardonline-learningalgorithms.Inthefollowing,wedevelopageneralregretboundforthisgeneralmethod,whichalsoshowshowtheregretmaydependonthesparsicationparameterg3.4RegretAnalysisThroughoutthepaper,weusekk1for1-norm,andkkfor2-norm.Forreference,wemakethefollowingassumptionregardingthelossfunction:Assumption3.1WeassumeL(w;z)isconvexinw,andthereexistnon-negativeconstantsAandBsuchthatkÑ1L(w;z)k2AL(w;z)+Bforallw2Rdandz2Rd+1Forlinearpredictionproblems,wehaveagenerallossfunctionoftheforL(w;z)=f(wTx;y).Thefollowingaresomecommonlossfunctionsf(;)withcorrespondingchoicesofparametersAandB(whicharenotunique),undertheassumptionsupxkxkCLogistic:f(p;y)=ln(1+exp( py))A=0andB=C2.Thislossisforbinaryclassicationproblemswithy2f1gSVM(hingeloss):f(p;y)=max(0;1 py)A=0andB=C2.Thislossisforbinaryclassicationproblemswithy2f1gLeastsquares(squareloss):f(p;y)=(p y)2A=4C2andB=0.Thislossisforregressionproblems.783 LANGFORD,LIANDZHANGOurmainresultisTheorem3.1whichisparameterizedbyAandB.Theproofislefttotheappendix.Specializingittoparticularlossesyieldsseveralcorollaries.AcorollaryapplicabletotheleastsquarelossisgivenlaterinCorollary4.1.Theorem3.1(SparseOnlineRegret)Considersparseonlineupdaterule(6)withw1=0andh0IfAssumption3.1holds,thenforall¯w2Rdwehave1 0:5Ah TTåi=1L(wi;zi)+gi 1 0:5Ahkwi+1I(wi+1q)k1h 2B+k¯wk2 2hT+1 TTåi=1[L(¯w;zi)+gik¯wI(wi+1q)k1];whereforvectorsv=[v1;:::;vd]andv0=[v01;:::;v0d],weletkvI(jv0jq)k1=dåj=1jvjjI(jv0jjq);whereI()isthesetindicatorfunction.Westatethetheoremwithaconstantlearningrateh.Asmentionedearlier,itispossibletoobtainaresultwithvariablelearningratewhereh=hidecaysasincreases.Althoughthismayleadtoano-regretboundwithoutknowingTinadvance,itintroducesextracomplexitytothepresentationofthemainidea.Sinceourfocusisonsparsityratherthanadaptinglearningrate,wedonotincludesucharesultforclarity.IfTisknowninadvance,thenintheabovebound,onecansimplytakeh=O(1=p T)andtheL1-regularizedregretisoforderO(1=p T)Intheabovetheorem,theright-handsideinvolvesatermgik¯wI(wi+1q)k1dependingonwi+1whichisnoteasilyestimated.Toremovethisdependency,atrivialupperboundofq=¥canbeused,leadingtoL1penaltygik¯wk1.Inthegeneralcaseofq¥,wecannotreplacewi+1by¯wbecausetheeffectiveregularizationcondition(asshownontheleft-handside)isthenon-convexpenaltygikwI(jwjq)k1.Solvingsuchanon-convexformulationishardbothintheonlineandbatchsettings.Ingeneral,weonlyknowhowtoefcientlydiscoveralocalminimumwhichisdifculttocharacterize.Withoutagoodcharacterizationofthelocalminimum,itisnotpossibleforustoreplacegik¯wI(wi+1q)k1ontheright-handsidebygik¯wI(¯wq)k1becausesuchaformulationimplieswecanefcientlysolveanon-convexproblemwithasimpleonlineupdaterule.Still,whenq¥,onenaturallyexpectstheright-handsidepenaltygik¯wI(wi+1q)k1ismuchsmallerthanthecorrespondingL1penaltygik¯wk1,especiallywhenwjhasmanycomponentscloseto0.Thereforethesituationwithq¥canpotentiallyyieldbetterperformanceonsomedata.Thisisconrmedinourexperiments.Theorem3.1alsoimpliesatrade-offbetweensparsityandregretperformance.Wemaysimplyconsiderthecasewheregi=gisaconstant.Whengissmall,wehavelesssparsitybuttheregrettermgk¯wI(wi+1q)k1gk¯wk1ontheright-handsideisalsosmall.Whengislarge,weareabletoachievemoresparsitybuttheregretgk¯wI(wi+1q)k1ontheright-handsidealsobecomeslarge.Suchatrade-off(sparsityversuspredictionaccuracy)isempiricallystudiedinSection6.Ourobservationsuggestswecangainsignicantsparsitywithonlyasmalldecreaseofaccuracy(thatis,usingasmallg).784 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTNowconsiderthecaseq=¥andgi=g.WhenT!¥,ifweleth!0andhT!¥,thenTheorem3.1implies1 TTåi=1[L(wi;zi)+gkwik1]inf¯w2Rd"1 TTåi=1L(¯w;zi)+gk¯wk1#+o(1):Inotherwords,ifweletL0(w;z)=L(w;z)+gkwk1betheL1-regularizedloss,thentheL1-regularizedregretissmallwhenh!0andT!¥.Inparticular,ifweleth=1=p T,thenthetheoremimpliestheL1-regularizedregretisTåi=1(L(wi;zi)+gkwik1) Tåi=1(L(¯w;zi)+gk¯wk1)p T 2(B+k¯wk2)1+A 2p T+A 2p T Tåi=1L(¯w;zi)+gTåi=1(k¯wk1 kwi+1k1)!+o(p T);whichisO(p T)forboundedlossfunctionLandweightswi.Theseobservationsimplyourpro-cedurecanberegardedastheonlinecounterpartofL1-regularizationmethods.Inthestochasticsettingwheretheexamplesaredrawniidfromsomeunderlyingdistribution,thesparseonlinegra-dientmethodproposedinthispapersolvestheL1-regularizationproblem.3.5StochasticSettingSGD-basedonline-learningmethodscanbeusedtosolvelarge-scalebatchoptimizationproblems,oftenquitesuccessfully(Shalev-Shwartzetal.,2007;Zhang,2004).Inthissetting,wecangothroughtrainingexamplesone-by-oneinanonlinefashion,andrepeatmultipletimesoverthetrain-ingdata.Inthissection,weanalyzetheperformanceofsuchaprocedureusingTheorem3.1.Tosimplifytheanalysis,insteadofassumingwegothroughthedataonebyone,weassumeeachadditionaldatapointisdrawnfromthetrainingdatarandomlywithequalprobability.Thiscorrespondstothestandardstochasticoptimizationsetting,inwhichobservedsamplesareiidfromsomeunderlyingdistributions.ThefollowingresultisasimpleconsequenceofTheorem3.1.Forsimplicity,weonlyconsiderthecasewithq=¥andconstantgravitygi=gTheorem3.2Considerasetoftrainingdatazi=(xi;yi)fori=1;:::;n,andletR(w;g)=1 nnåi=1L(w;zi)+gkwk1betheL1-regularizedlossovertrainingdata.Letw1=w1=0,anddenerecursivelyfort=1;2;:::wt+1=T(wt hÑ1(wt;zit);gh);wt+1=wt+wt+1 wt +1;785 LANGFORD,LIANDZHANGwhereeachitisdrawnfromf1;:::;nguniformlyatrandom.IfAssumption3.1holds,thenatanytimeT,thefollowinginequalitiesarevalidforall¯w2Rd:Ei1;:::;iT(1 0:5Ah)RwT;g 1 0:5AhEi1;:::;iT"1 0:5Ah TTåi=1Rwi;g 1 0:5Ah#h 2B+k¯wk2 2hT+R(¯w;g):ProofNotetherecursionofwtimplieswT=1 TTåt=1wtfromtelescopingtheupdaterule.BecauseR(w;g)isconvexinw,therstinequalityfollowsdi-rectlyfromJensen'sinequality.Itremainstoprovethesecondinequality.Theorem3.1impliesthefollowing:1 0:5Ah TTåt=1L(wt;zit)+g 1 0:5Ahkwtk1gk¯wk1+h 2B+k¯wk2 2hT+1 TTåt=1L(¯w;zit):(8)ObservethatEitL(wt;zit)+g 1 0:5Ahkwtk1=Rwt;g 1 0:5Ahandgk¯wk1+Ei1;:::;iT"1 TTåt=1L(¯w;zit)#=R(¯w;g):ThesecondinequalityisobtainedbytakingtheexpectationwithrespecttoEi1;:::;iTin(8). Ifweleth!0andhT!¥,theboundinTheorem3.2becomesE[R(wT;g)]E"1 TTåt=1R(wt;g)#inf¯wR(¯w;g)+o(1):Thatis,onaverage,wTapproximatelysolvestheL1-regularizationprobleminfw"1 nnåi=1L(w;zi)+gkwk1#:IfwechoosearandomstoppingtimeT,thentheaboveinequalitiessaysthatonaveragewTalsosolvesthisL1-regularizationproblemapproximately.Thereforeinourexperiment,weusethelastsolutionwTinsteadoftheaggregatedsolutionwT.Forpracticepurposes,thisisadequateeventhoughwedonotintentionallychoosearandomstoppingtime.SinceL1-regularizationisfrequentlyusedtoachievesparsityinthebatchlearningsetting,theconnectiontoL1-regularizationcanberegardedasanalternativejusticationforthesparse-onlinealgorithmdevelopedinthispaper.786 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENT Algorithm1TruncatedGradientforLeastSquares Inputs:thresholdq0gravitysequencegi0learningrateh2(0;1)exampleoracleOinitializeweightswj 0(=1;:::;d)fortrial=1;2;:::1.Acquireanunlabeledexamplex=[x1;x2;:::;xd]fromoracleO2.forallweightswj(=1;:::;d)(a)ifwj0andwjqthenwj maxfwj gih;0g(b)elseifwj0andwj qthenwj minfwj+gih;0g3.Computeprediction:y=åjwjxj4.AcquirethelabelyfromoracleO5.Updateweightsforallfeatureswj wj+2h(y y)xj 4.TruncatedGradientforLeastSquaresThemethodinSection3canbedirectlyappliedtoleastsquaresregression.ThisleadstoAlgorithm1whichimplementssparsicationforsquarelossaccordingtoEquation(6).Inthedescription,weusesuperscriptedsymbolwjtodenotethe-thcomponentofvectorw(inordertodifferentiatefromwi,whichwehaveusedtodenotethe-thweightvector).Forclarity,wealsodroptheindexfromwi.Althoughwekeepthechoiceofgravityparametersgiopeninthealgorithmdescription,inpractice,weonlyconsiderthefollowingchoice:gi=(Kgif=Kisaninteger0otherwise:Thismaygiveamoreaggressivetruncation(thussparsity)aftereveryK-thiteration.Sincewedonothaveatheoremformalizinghowmuchmoresparsityonecangainfromthisidea,itseffectwillonlybeexaminedempiricallyinSection6.Inmanyonline-learningsituations(suchaswebapplications),onlyasmallsubsetofthefeatureshavenonzerovaluesforanyexamplex.Itisthusdesirabletodealwithsparsityonlyinthissmallsubsetratherthaninallfeatures,whilesimultaneouslyinducingsparsityonallfeatureweights.Moreover,itisimportanttostoreonlyfeatureswithnon-zerocoefcients(ifthenumberoffeaturesistoolargetobestoredinmemory,thisapproachallowsustouseahashtabletotrackonlythenonzerocoefcients).Wedescribehowthiscanbeimplementedefcientlyinthenextsection.Forreference,wepresentaspecializationofTheorem3.1inthefollowingcorollarywhichisdirectlyapplicabletoAlgorithm1.787 LANGFORD,LIANDZHANGCorollary4.1(SparseOnlineSquareLossRegret)IfthereexistsC0suchthatforallx,kxkC,thenforall¯w2Rd,wehave1 2C2h TTåi=1(wTixi yi)2+gi 1 2C2hkwiI(jwijq)k1k¯wk2 2hT+1 TTåi=1(¯wTxi yi)2+gi+1k¯wI(jwi+1jq)k1;wherewi=[w1;:::;wd]2Rdistheweightvectorusedforpredictionatthei-thstepofAlgorithm1;(xi;yi)isthedatapointobservedatthei-step.Thiscorollaryexplicitlystatesthataveragesquarelossincurredbythelearner(theleft-handside)isboundedbytheaveragesquarelossofthebestweightvector¯w,plusatermrelatedtothesizeof¯wwhichdecaysas1=Tandanadditiveoffsetcontrolledbythesparsitythresholdqandthegravityparametergi5.EfcientImplementationWealteredastandardgradient-descentimplementation,VOWPALWABBIT(Langfordetal.,2007),accordingtoalgorithm1.VOWPALWABBIToptimizessquarelossonalinearrepresentationwTxviagradientdescent(3)withacouplecaveats:1.Thepredictionisnormalizedbythesquarerootofthenumberofnonzeroentriesinasparsevector,wTx=p kxk0.Thisalterationisjustaconstantrescalingondensevectorswhichiseffectivelyremovablebyanappropriaterescalingofthelearningrate.2.Thepredictionisclippedtotheinterval[0;1],implyingthelossfunctionisnotsquarelossforunclippedpredictionsoutsideofthisdynamicrange.Insteadtheupdateisaconstantvalue,equivalenttothegradientofalinearlossfunction.ThelearningrateinVOWPALWABBITiscontrollable,supporting1=decayaswellasaconstantlearningrate(andratesin-between).Theprogramoperatesinanentirelyonlinefashion,sothememoryfootprintisessentiallyjusttheweightvector,evenwhentheamountofdataisverylarge.Asmentionedearlier,wewouldlikethealgorithm'scomputationalcomplexitytodependlin-earlyonthenumberofnonzerofeaturesofanexample,ratherthanthetotalnumberoffeatures.Theapproachwetookwastostoreatime-stamptjforeachfeature.Thetime-stampwasinitializedtotheindexoftheexamplewherefeaturewasnonzeroforthersttime.Duringonlinelearning,wesimplywentthroughallnonzerofeaturesofexample,andcouldsimulatetheshrinkageofwjaftertjinabatchmode.Theseweightsarethenupdated,andtheirtimestampsaresetto.Thislazy-updateideaofdelayingtheshrinkagecalculationuntilneededisthekeytoefcientimplementationoftruncatedgradient.Specically,insteadofusingupdaterule(6)forweightwjweshrunktheweightsofallnonzerofeaturedifferentlybythefollowing:(wj)=T1wj+2h(y y)xj; tj KKhg;q;andtjisupdatedbytj tj+ tj KK:788 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTThislazy-updatetrickcanbeappliedtotheothertwoalgorithmsgiveninSection3.Inthecoefcientroundingalgorithm(4),forinstance,foreachnonzerofeatureofexample,wecanrstperformaregulargradientdescentonthesquareloss,andthendothefollowing:ifjwjjisbelowthethresholdqandtj+K,weroundwjto0andsettjtoThisimplementationshowsthetruncatedgradientmethodsatisesthefollowingrequirementsneededforsolvinglargescaleproblemswithsparsefeatures.Thealgorithmiscomputationallyefcient:thenumberofoperationsperonlinestepislinearinthenumberofnonzerofeatures,andindependentofthetotalnumberoffeatures.Thealgorithmismemoryefcient:itmaintainsalistofactivefeatures,andafeaturecanbeinsertedwhenobserved,anddeletedwhenthecorrespondingweightbecomeszero.IfwedirectlyapplytheonlineprojectionideaofZinkevich(2003)tosolve(1),thenintheup-daterule(7),onehastopickthesmallestgi0suchthatkwi+1k1s.Wedonotknowanefcientmethodtondthisspecicgiusingoperationsindependentofthetotalnumberoffeatures.Astan-dardimplementationreliesonsortingallweights,whichrequiresO(dlogd)operations,wheredisthetotalnumberof(nonzero)features.Thiscomplexityisunacceptableforourpurpose.However,inanimportantrecentwork,Duchietal.(2008)proposedanefcientonline1-projectionmethod.Theideaistouseabalancedtreetokeeptrackofweights,whichallowsefcientthresholdndingandtreeupdatesinO(klnd)operationsonaverage,wherekdenotesthenumberofnonzerocoef-cientsinthecurrenttrainingexample.Althoughthealgorithmstillhasweakdependencyond,itisapplicabletolarge-scalepracticalapplications.Thetheoreticalanalysispresentedinthispapershowswecanobtainameaningfulregretboundbypickinganarbitrarygi.Thisisusefulbecausetheresultingmethodismuchsimplertoimplementandiscomputationallymoreefcientperonlinestep.Moreover,ourmethodallowsnon-convexupdatescloselyrelatedtothesimplecoefcientroundingidea.DuetothecomplexityofimplementingthebalancedtreestrategyinDuchietal.(2008),weshallnotcomparetoitinthispaperandleaveitasafuturedirection.However,webe-lievethesparsityachievedwiththeirapproachshouldbecomparabletothesparsityachievedwithourmethod.6.EmpiricalResultsWeappliedVOWPALWABBITwiththeefcientlyimplementedsparsifyoption,asdescribedintheprevioussection,toaselectionofdatasets,includingelevendatasetsfromtheUCIrepository(AsuncionandNewman,2007),themuchlargerdatasetrcv1(Lewisetal.,2004),andaprivatelarge-scaledatasetBig_Adsrelatedtoadinterestprediction.WhileUCIdatasetsareusefulforbenchmarkpurposes,rcv1andBig_Adsaremoreinterestingsincetheyembodyreal-worlddatasetswithlargenumbersoffeatures,manyofwhicharelessinformativeformakingpredictionsthanothers.ThedatasetsaresummarizedinTable1.TheUCIdatasetsuseddonothavemanyfeaturessoweexpectthatalargefractionofthesefeaturesareusefulformakingpredictions.Forcomparisonpurposesaswellastobetterdemonstratethebehaviorofouralgorithm,wealsoadded1000randombinaryfeaturestothosedatasets.Eachfeaturehasvalue1withprobability0:05and0otherwise.789 LANGFORD,LIANDZHANG DataSet #features#traindata#testdatatask ad 14112455824classication crx 47526164classication housing 14381125regression krvskp 742413783classication magic04 11142264794classication mushroom 11760792045classication spambase 5834451156classication wbc 10520179classication wdbc 31421148classication wpbc 3315345classication zoo 177724regression rcv1 3885378126523149classication Big_Ads 3109261062:7106classication Table1:DataSetSummary.6.1FeatureSparsicationofTruncatedGradientIntherstsetofexperiments,weareinterestedinhowmuchreductioninthenumberoffeaturesispossiblewithoutaffectinglearningperformancesignicantly;specically,werequiretheaccuracybereducedbynomorethan1%forclassicationtasks,andthetotalsquarelossbeincreasedbynomorethan1%forregressiontasks.Ascommonpractice,weallowedthealgorithmtorunonthetrainingdatasetformultiplepasseswithdecayinglearningrate.Foreachdataset,weperformed10-foldcrossvalidationoverthetrainingsettoidentifythebestsetofparameters,includingthelearningrateh(rangingfrom0.1to0.5),thesparsicationrateg(rangingfrom0to0.3),numberofpassesofthetrainingset(rangingfrom5to30),andthedecayoflearningrateacrossthesepasses(rangingfrom0.5to0.9).TheoptimizedparameterswereusedtotrainVOWPALWABBITonthewholetrainingset.Finally,thelearnedclassier/regressorwasevaluatedonthetestset.WexedK=1andq=¥,andwillstudytheeffectsofKandqinlatersubsections.Figure2showsthefractionofreducedfeaturesaftersparsicationisappliedtoeachdataset.ForUCIdatasets,wealsoincludeexperimentswith1000randomfeaturesaddedtotheoriginalfeatureset.Wedonotaddrandomfeaturestorcv1andBig_Adssincetheexperimentisnotasinteresting.ForUCIdatasets,withrandomlyaddedfeatures,VOWPALWABBITisabletoreducethenum-beroffeaturesbyafractionofmorethan90%,exceptfortheaddatasetinwhichonly71%reduc-tionisobserved.Thislesssatisfyingresultmightbeimprovedbyamoreextensiveparametersearchincrossvalidation.However,ifwecantolerate1:3%decreaseinaccuracy(insteadof1%asforotherdatasets)duringcrossvalidation,VOWPALWABBITisabletoachieve91:4%reduction,indi-catingthatalargereductionisstillpossibleatthetinyadditionalcostof0:3%accuracyloss.Withthisslightlymoreaggressivesparsication,thetest-setaccuracydropsfrom95:9%(whenonly1%lossinaccuracyisallowedincrossvalidation)to95:4%,whiletheaccuracywithoutsparsicationis96:5%.790 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTEvenfortheoriginalUCIdatasetswithoutarticiallyaddedfeatures,VOWPALWABBITman-agestolteroutsomeofthelessusefulfeatureswhilemaintainingthesamelevelofperformance.Forexample,fortheaddataset,areductionof83:4%isachieved.Comparedtotheresultsabove,itseemsthemosteffectivefeaturereductionsoccurondatasetswithalargenumberoflessusefulfeatures,exactlywheresparsicationisneeded.Forrcv1,morethan75%offeaturesareremovedafterthesparsicationprocess,indicatingtheeffectivenessofouralgorithminreal-lifeproblems.Wewerenotabletotrymanyparametersincrossvalidationbecauseofthesizeofrcv1.Itisexpectedthatmorereductionispossiblewhenamorethoroughparametersearchisperformed.ThepreviousresultsdonotexercisethefullpoweroftheapproachpresentedherebecausethestandardLasso(Tibshirani,1996)isormaybecomputationallyviableinthesedatasets.Wehavealsoappliedthisapproachtoalargenon-publicdatasetBig_Adswherethegoalispredictingwhichoftwoadswasclickedongivencontextinformation(thecontentofadsandqueryinformation).Here,acceptinga0:009increaseinclassicationerror(fromerrorrate0:329toerrorrate0:338)allowsustoreducethenumberoffeaturesfromabout3109toabout24106,afactorof125decreaseinthenumberoffeatures.Forclassicationtasks,wealsostudyhowoursparsicationsolutionaffectsAUC(AreaUndertheROCCurve),whichisastandardmetricforclassication.1Usingthesamesetsofparametersfrom10-foldcrossvalidationdescribedabove,wendthecriterionisnotaffectedsignicantlybysparsicationandinsomecases,theyareactuallyslightlyimproved.ThereasonmaybethatoursparsicationmethodremovedsomeofthefeaturesthatcouldhaveconfusedVOWPALWABBITTheratiosoftheAUCwithandwithoutsparsicationforallclassicationtasksareplottedinFigures3.Oftentheseratiosareabove98%.6.2TheEffectsofKAswearguedbefore,usingaKvaluelargerthan1maybedesiredintruncatedgradientandtheroundingalgorithms.Thisadvantageisempiricallydemonstratedhere.Inparticular,wetryK=1,K=10,andK=20inbothalgorithms.Asbefore,crossvalidationisusedtoselectparametersintheroundingalgorithm,includinglearningrateh,numberofpassesofdataduringtraining,andlearningratedecayovertrainingpasses.Figures4and5givetheAUCvs.number-of-featureplots,whereeachdatapointisgeneratedbyrunningrespectivealgorithmusingadifferentvalueofg(fortruncatedgradient)andq(fortheroundingalgorithm).Weusedq=¥intruncatedgradient.TheeffectofKislargeintheroundingalgorithm.Forinstance,intheaddatasetthealgorithmusingK=1achievesanAUCof0:94with322features,while13and7featuresareneededusingK=10andK=20,respectively.However,thesamebenetsofusingalargerKisnotobservedintruncatedgradient,althoughtheperformanceswithK=10or20areatleastasgoodasthosewithK=1andforthespambasedatasetfurtherfeaturereductionisachievedatthesamelevelofperformance,reducingthenumberoffeaturesfrom76(whenK=1)to25(whenK=10or20)withofanAUCofabout0:89. 1.WeuseAUChereandinlatersubsectionsbecauseitisinsensitivetothreshold,whichisunlikeaccuracy.791 LANGFORD,LIANDZHANG 0 0.2 0.4 0.6 0.8 1 ad crx housing krvskp magic04 shroom spam wbc wdbc wpbc zoo rcv1 Big_Ads Fraction LeftDatasetFraction of Features Left Base data 1000 extra 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 ad crx housing krvskp magic04 shroom spam wbc wdbc wpbc zoo rcv1 Big_Ads Fraction LeftDatasetFraction of Features Left Base data 1000 extra Figure2:Plotsshowingtheamountoffeaturesleftaftersparsicationusingtruncatedgradientforeachdataset,whentheperformanceischangedbyatmost1%duetosparsication.Thesolidbar:withtheoriginalfeatureset;thedashedbar:with1000randomfeaturesaddedtoeachexample.Plotonleft:fractionleftwithrespecttothetotalnumberoffeatures(originalwith1000articialfeaturesforthedashedbar).Plotonright:fractionleftwithrespecttotheoriginalfeatures(notcountingthe1000articialfeaturesinthedenominatorforthedashedbar). 0 0.2 0.4 0.6 0.8 1 1.2 ad crx krvskp magic04 shroom spam wbc wdbc wpbc rcv1 RatioDatasetRatio of AUC Base data 1000 extra Figure3:AplotshowingtheratiooftheAUCwhensparsicationisusedovertheAUCwhennosparsicationisused.ThesameprocessasinFigure2isusedtodetermineempiricallygoodparameters.Therstresultisfortheoriginaldataset,whilethesecondresultisforthemodieddatasetwhere1000randomfeaturesareaddedtoeachexample.6.3TheEffectsofqinTruncatedGradientInthissubsection,weempiricallystudytheeffectofqintruncatedgradient.Theroundingalgorithmisalsoincludedforcomparisonduetoitssimilaritytotruncatedgradientwhenq=g.Again,weusedcrossvalidationtochooseparametersforeachqvaluetried,andfocusedontheAUCmetricintheeightUCIclassicationtasks,exceptthedegenerateoneofwpbc.WexedK=10inbothalgorithm.792 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENT 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 adNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 crxNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 krvskpNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 magic04Number of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 mushroomNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 spambaseNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wbcNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wdbcNumber of FeaturesAUC K=1 K=10 K=20 Figure4:EffectofKonAUCintheroundingalgorithm.793 LANGFORD,LIANDZHANG 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 adNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 crxNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 krvskpNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 magic04Number of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 mushroomNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 spambaseNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wbcNumber of FeaturesAUC K=1 K=10 K=20 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wdbcNumber of FeaturesAUC K=1 K=10 K=20 Figure5:EffectofKonAUCintruncatedgradient.794 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTFigure6givestheAUCvs.number-of-featureplots,whereeachdatapointisgeneratedbyrunningrespectivealgorithmsusingadifferentvalueofg(fortruncatedgradient)andq(fortheroundingalgorithm).Afewobservationsareinplace.First,theresultsverifytheobservationthatthebehavioroftruncatedgradientwithq=gissimilartotheroundingalgorithm.Second,theseresultssuggestthat,inpractice,itmaybedesirabletouseq=¥intruncatedgradientbecauseitavoidsthelocal-minimumproblem.6.4ComparisontoOtherAlgorithmsThenextsetofexperimentscomparestruncatedgradienttootheralgorithmsregardingtheirabilitiestobalancefeaturesparsicationandperformance.Again,wefocusontheAUCmetricinUCIclassicationtasksexceptwpdc.Thealgorithmsforcomparisoninclude:Thetruncatedgradientalgorithm:WexedK=10andq=¥,usedcrossed-validatedpa-rameters,andalteredthegravityparametergTheroundingalgorithmdescribedinSection3.1:WexedK=10,usedcross-validatedparameters,andalteredtheroundingthresholdqThesubgradientalgorithmdescribedinSection3.2:WexedK=10,usedcross-validatedparameters,andalteredtheregularizationparametergTheLasso(Tibshirani,1996)forbatchL1-regularization:Weusedapubliclyavailableimple-mentation(Sjöstrand,2005).Notethatwedonotattempttocomparethesealgorithmsonrcv1andBig_AdssimplybecausetheirsizesaretoolargefortheLasso.Figure7givestheresults.Truncatedgradientisconsistentlycompetitivewiththeothertwoonlinealgorithmsandsignicantlyoutperformedtheminsomeproblems.Thissuggeststheeffec-tivenessoftruncatedgradient.Second,itisinterestingtoobservethatthequalitativebehavioroftruncatedgradientisoftensimilartoLASSO,especiallywhenverysparseweightvectorsareallowed(theleftsideinthegraphs).Thisisconsistentwiththeorem3.2showingtherelationbetweenthem.However,LASSOusuallyhasworseperformancewhentheallowednumberofnonzeroweightsissettoolarge(therightsideofthegraphs).Inthiscase,LASSOseemstoovert,whiletruncatedgradientismorerobusttoovertting.Therobustnessofonlinelearningisoftenattributedtoearlystopping,whichhasbeenextensivelydiscussedintheliterature(e.g.,Zhang,2004).Finally,itisworthemphasizingthattheexperimentsinthissubsectiontrytoshedsomelightontherelativestrengthsofthesealgorithmsintermsoffeaturesparsication.ForlargedatasetssuchasBig_Adsonlytruncatedgradient,coefcientrounding,andthesub-gradientalgorithmsareapplicable.Aswehaveshownandargued,theroundingalgorithmisquiteadhocandmaynotworkrobustlyinsomeproblems,andthesub-gradientalgorithmdoesnotleadtosparsityingeneralduringtraining.7.ConclusionThispapercoverstherstsparsicationtechniqueforlarge-scaleonlinelearningwithstrongthe-oreticalguarantees.Thealgorithm,truncatedgradient,isthenaturalextensionofLasso-stylere-795 LANGFORD,LIANDZHANG 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 adNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 crxNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 krvskpNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 magic04Number of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 mushroomNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 spambaseNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wbcNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wdbcNumber of FeaturesAUC Rounding Algorithm Trunc. Grad. (q=1g) Trunc. Grad. (q=¥) Figure6:EffectofqonAUCintruncatedgradient.796 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENT 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 adNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 crxNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 krvskpNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 magic04Number of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 mushroomNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 spambaseNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wbcNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso 100 101 102 103 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wdbcNumber of FeaturesAUC Trunc. Grad. Rounding Sub-gradient Lasso Figure7:Comparisonoffouralgorithms.797 LANGFORD,LIANDZHANGgressiontotheonline-learningsetting.Theorem3.1provesthetechniqueissound:itneverharmsperformancemuchcomparedtostandardstochasticgradientdescentinadversarialsituations.Fur-thermore,weshowtheasymptoticsolutionofoneinstanceofthealgorithmisessentiallyequivalenttotheLassoregression,thusjustifyingthealgorithm'sabilitytoproducesparseweightvectorswhenthenumberoffeaturesisintractablylarge.Thetheoremisveriedexperimentallyinanumberofproblems.Insomecases,especiallyforproblemswithmanyirrelevantfeatures,thisapproachachievesaoneortwoorderofmagnitudereductioninthenumberoffeatures.AcknowledgmentsWethankAlexStrehlfordiscussionsandhelpindevelopingVOWPALWABBIT.PartofthisworkwasdonewhenLihongLiandTongZhangwereatYahoo!Researchin2007.AppendixA.ProofofTheorem3.1Thefollowinglemmaistheessentialstepinouranalysis.Lemma1Supposeupdaterule(6)isappliedtoweightvectorwonexamplez=(x;y)withgravityparametergi=g,andresultsinaweightvectorw0.IfAssumption3.1holds,thenforall¯w2Rdwehave(1 0:5Ah)L(w;z)+gkw0I(jw0jq)k1L(¯w;z)+gk¯wI(jw0jq)k1+h 2B+k¯w wk2 k¯w w0k2 2h:ProofConsideranytargetvector¯w2Rdandletw=w hÑ1L(w;z).Wehavew0=T1(w;gh;q)Letu(¯w;w0)=gk¯wI(jw0jq)k1 gkw0I(jw0jq)k1:Thentheupdateequationimpliesthefollowing:k¯w w0k2k¯w w0k2+kw0 wk2=k¯w wk2 2(¯w w0)T(w0 w)k¯w wk2+2hu(¯w;w0)=k¯w wk2+kw wk2+2(¯w w)T(w w)+2hu(¯w;w0)=k¯w wk2+h2kÑ1L(w;z)k2+2h(¯w w)TÑ1L(w;z)+2hu(¯w;w0)k¯w wk2+h2kÑ1L(w;z)k2+2h(L(¯w;z) L(w;z))+2hu(¯w;w0)k¯w wk2+h2(AL(w;z)+B)+2h(L(¯w;z) L(w;z))+2hu(¯w;w0):Here,therstandsecondequalitiesfollowfromalgebra,andthethirdfromthedenitionofwTherstinequalityfollowsbecauseasquareisalwaysnon-negative.Thesecondinequalityfollows798 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTbecausew0=T1(w;gh;q),whichimplies(w0 w)Tw0= ghkw0I(jwjq)k1= ghkw0I(jw0jq)k1andjw0j wjjghI(jw0jjq).Therefore, (¯w w0)T(w0 w)= ¯wT(w0 w)+w0T(w0 w)dåj=1j¯wjjjw0j wjj+(w0 w)Tw0ghdåj=1j¯wjjI(jw0jjq)+(w0 w)Tw0=hu(¯w;w0);wherethethirdinequalityfollowsfromthedenitionofsub-gradientofaconvexfunction,implying(¯w w)TÑ1L(w;z)L(¯w;z) L(w;z)forallwand¯w;thefourthinequalityfollowsfromAssumption3.1.Rearrangingtheaboveinequal-ityleadstothedesiredbound. Proof(ofTheorem3.1)ApplyingLemma1totheupdateontrialgives(1 0:5Ah)L(wi;zi)+gikwi+1I(jwi+1jq)k1L(¯w;zi)+k¯w wik2 k¯w wi+1k2 2h+gik¯wI(jwi+1jq)k1+h 2B:Nowsummingover=1;2;:::;T,weobtainTåi=1[(1 0:5Ah)L(wi;zi)+gikwi+1I(jwi+1jq)k1]Tåi=1k¯w wik2 k¯w wi+1k2 2h+L(¯w;zi)+gik¯wI(jwi+1jq)k1+h 2B=k¯w w1k2 k¯w wTk2 2h+h 2TB+Tåi=1[L(¯w;zi)+gik¯wI(jwi+1jq)k1]k¯wk2 2h+h 2TB+Tåi=1[L(¯w;zi)+gik¯wI(jwi+1jq)k1]:Therstequalityfollowsfromthetelescopingsumandthesecondinequalityfollowsfromtheinitialcondition(allweightsarezero)anddroppingnegativequantities.ThetheoremfollowsbydividingwithrespecttoTandrearrangingterms. ReferencesArthurAsuncionandDavidJ.Newman.UCImachinelearningrepository,2007.UniversityofCalifornia,Irvine,SchoolofInformationandComputerSciences,http://www.ics.uci.edu/mlearn/MLRepository.html.799 LANGFORD,LIANDZHANGSuhridBalakrishnanandDavidMadigan.Algorithmsforsparselinearclassiersinthemassivedatasetting.JournalofMachineLearningResearch,9:313337,2008.BobCarpenter.Lazysparsestochasticgradientdescentforregularizedmultinomiallogisticregres-sion.Technicalreport,April2008.NicolòCesa-Bianchi,PhilipM.Long,andManfredWarmuth.Worst-casequadraticlossboundsforpredictionusinglinearfunctionsandgradientdescent.IEEETransactionsonNeuralNetworks7(3):604619,1996.Cheng-TaoChu,SangKyunKim,Yi-AnLin,YuanYuanYu,GaryBradski,AndrewY.Ng,andKunleOlukotun.Map-reduceformachinelearningonmulticore.InAdvancesinNeuralInfor-mationProcessingSystems20(NIPS-07),2008.OferDekel,ShaiShalev-Schwartz,andYoramSinger.TheForgetron:Akernel-basedperceptrononaxedbudget.InAdvancesinNeuralInformationProcessingSystems18(NIPS-05),pages259266,2006.JohnDuchiandYoramSinger.Onlineandbatchlearningusingforwardlookingsubgradients.Unpublishedmanuscript,September2008.JohnDuchi,ShaiShalev-Shwartz,YoramSinger,andTusharChandra.Efcientprojectionsontothe1-ballforlearninginhighdimensions.InProceedingsoftheTwenty-FifthInternationalConferenceonMachineLearning(ICML-08),pages272279,2008.JyrkiKivinenandManfredK.Warmuth.Exponentiatedgradientversusgradientdescentforlinearpredictors.InformationandComputation,132(1):163,1997.JohnLangford,LihongLi,andAlexanderL.Strehl.VowpalWabbit(fastonlinelearning),2007.http://hunch.net/vw/.HonglakLee,AlexisBattle,RajatRaina,andAndrewY.Ng.Efcientsparsecodingalgorithms.InAdvancesinNeuralInformationProcessingSystems19(NIPS-06),pages801808,2007.DavidD.Lewis,YimingYang,TonyG.Rose,andFanLi.RCV1:Anewbenchmarkcollectionfortextcategorizationresearch.JournalofMachineLearningResearch,5:361397,2004.NickLittlestone.Learningquicklywhenirrelevantattributesabound:Anewlinear-thresholdalgo-rithms.MachineLearning,2(4):285318,1988.NickLittlestone,PhilipM.Long,andManfredK.Warmuth.On-linelearningoflinearfunctions.ComputationalComplexity,5(2):123,1995.ShaiShalev-Shwartz,YoramSinger,andNathanSrebro.Pegasos:PrimalEstimatedsub-GrAdientSOlverforSVM.InProceedingsoftheTwenty-FourthInternationalConferenceonMachineLearning(ICML-07),2007.KarlSjöstrand.MatlabimplementationofLASSO,LARS,theelasticnetandSPCA,June2005.Version2.0,http://www2.imm.dtu.dk/pubdb/p.php?3897.800 SPARSEONLINELEARNINGVIATRUNCATEDGRADIENTRobertTibshirani.Regressionshrinkageandselectionviathelasso.JournaloftheRoyalStatisticalSociety,B.,58(1):267288,1996.TongZhang.Solvinglargescalelinearpredictionproblemsusingstochasticgradientdescental-gorithms.InProceedingsoftheTwenty-FirstInternationalConferenceonMachineLearning(ICML-04),pages919926,2004.MartinZinkevich.Onlineconvexprogrammingandgeneralizedinnitesimalgradientascent.InProceedingsoftheTwentiethInternationalConferenceonMachineLearning(ICML-03),pages928936,2003.801