/
JournalofMachineLearningResearch8(2007)1519-1555Submitted7/06;Revised1 JournalofMachineLearningResearch8(2007)1519-1555Submitted7/06;Revised1

JournalofMachineLearningResearch8(2007)1519-1555Submitted7/06;Revised1 - PDF document

phoebe-click
phoebe-click . @phoebe-click
Follow
431 views
Uploaded On 2016-06-16

JournalofMachineLearningResearch8(2007)1519-1555Submitted7/06;Revised1 - PPT Presentation

KOHKIMANDBOYDtheconditionalprobabilityofoutcomeb1is111e073andtheconditionalprobabilityofb1is11e027OnthehyperplanewTxv1theseconditionalprobabilitiesarereversedAswTxvincreasesab ID: 364210

KOH KIMANDBOYDtheconditionalprobabilityofoutcomeb=1is1=(1+1=e)0:73 andtheconditionalprobabilityofb=1is1=(1+e)0:27.OnthehyperplanewTx+v=1 theseconditionalprobabilitiesarereversed.AswTx+vincreasesab

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch8(2007)1..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch8(2007)1519-1555Submitted7/06;Revised1/07;Published7/07AnInterior-PointMethodforLarge-Scale`1-RegularizedLogisticRegressionKwangmooKohDENEB1@STANFORD.EDUSeung-JeanKimSJKIM@STANFORD.EDUStephenBoydBOYD@STANFORD.EDUInformationSystemsLaboratoryElectricalEngineeringDepartmentStanfordUniversityStanford,CA94305-9510,USAEditor:YiLinAbstractLogisticregressionwith`1regularizationhasbeenproposedasapromisingmethodforfeatureselectioninclassicationproblems.Inthispaperwedescribeanefcientinterior-pointmethodforsolvinglarge-scale`1-regularizedlogisticregressionproblems.SmallproblemswithuptoathousandorsofeaturesandexamplescanbesolvedinsecondsonaPC;mediumsizedproblems,withtensofthousandsoffeaturesandexamples,canbesolvedintensofseconds(assumingsomesparsityinthedata).Avariationonthebasicmethod,thatusesapreconditionedconjugategradientmethodtocomputethesearchstep,cansolveverylargeproblems,withamillionfeaturesandex-amples(e.g.,the20Newsgroupsdataset),inafewminutes,onaPC.Usingwarm-starttechniques,agoodapproximationoftheentireregularizationpathcanbecomputedmuchmoreefcientlythanbysolvingafamilyofproblemsindependently.Keywords:logisticregression,featureselection,`1regularization,regularizationpath,interior-pointmethods.1.IntroductionInthissectionwedescribethebasiclogisticregressionproblem,the`2-and`1-regularizedversions,andtheregularizationpath.Wesetoutournotation,andreviewexistingmethodsandliterature.Finally,wegiveanoutlineofthispaper.1.1LogisticRegressionLetx2Rndenoteavectorofexplanatoryorfeaturevariables,andb2f1;+1gdenotetheassoci-atedbinaryoutputoroutcome.ThelogisticmodelhastheformProb(bjx)=11+exp(b(wTx+v))=expb(wTx+v)1+exp(b(wTx+v));whereProb(bjx)istheconditionalprobabilityofb,givenx2Rn.Thelogisticmodelhasparametersv2R(theintercept)andw2Rn(theweightvector).Whenw6=0,wTx+v=0denestheneutralhyperplaneinfeaturespace,onwhichtheconditionalprobabilityofeachoutcomeis1=2.OntheshiftedparallelhyperplanewTx+v=1,whichisadistance1=kwk2fromtheneutralhyperplane,c\r2007KwangmooKoh,Seung-JeanKimandStephenBoyd. KOH,KIMANDBOYDtheconditionalprobabilityofoutcomeb=1is1=(1+1=e)0:73,andtheconditionalprobabilityofb=1is1=(1+e)0:27.OnthehyperplanewTx+v=1,theseconditionalprobabilitiesarereversed.AswTx+vincreasesaboveone,theconditionalprobabilityofoutcomeb=1rapidlyapproachesone;aswTx+vdecreasesbelow1,theconditionalprobabilityofoutcomeb=1rapidlyapproachesone.TheslabinfeaturespacedenedbyjwTx+vj1denestheambiguityregion,inwhichthereissubstantialprobabilityofeachoutcome;outsidethisslab,oneoutcomeismuchmorelikelythantheother.Supposewearegivenasetof(observedortraining)examples,(xi;bi)2Rnf1;+1g;i=1;:::;m;assumedtobeindependentsamplesfromadistribution.Weuseplog(v;w)2Rmtodenotethevectorofconditionalprobabilities,accordingtothelogisticmodel,plog(v;w)i=Prob(bijxi)=exp(wTai+vbi)1+exp(wTai+vbi);i=1;:::;m;whereai=bixi.ThelikelihoodfunctionassociatedwiththesamplesisÕm=1plog(v;w)i,andthelog-likelihoodfunctionisgivenbymåi=1logplog(v;w)i=måi=1f(wTai+vbi);wherefisthelogisticlossfunctionf(z)=log(1+exp(z)):(1)Thislossfunctionisconvex,sothelog-likelihoodfunctionisconcave.Thenegativeofthelog-likelihoodfunctioniscalledthe(empirical)logisticloss,anddividingbymweobtaintheaveragelogisticloss,lavg(v;w)=(1=m)måi=1f(wTai+vbi):Wecandeterminethemodelparameterswandvbymaximumlikelihoodestimationfromtheobservedexamples,bysolvingtheconvexoptimizationproblemminimizelavg(v;w);(2)withvariablesv2Randw2Rn,andproblemdataA=[a1am]T2Rmnandthevectorofbinaryoutcomesb=[b1bm]T2Rm.Theproblem(2)iscalledthelogisticregressionproblem(LRP).Theaveragelogisticlossisalwaysnonnegative,thatis,lavg(v;w)0,sincef(z)0foranyz.Forthechoicew=0,v=0,wehavelavg(0;0)=log20:693,sotheoptimalvalueoftheLRPliesbetween0andlog2.Inparticular,theoptimalvaluecanrange(roughly)between0and1.Theoptimalvalueis0onlywhentheoriginaldataarelinearlyseparable,thatis,thereexistwandvsuchthatwTxi+v�0whenbi=1,andwTxi+v0whenbi=1.InthiscasetheoptimalvalueoftheLRP(2)isnotachieved(exceptinthelimitwithwandvgrowingarbitrarilylarge).Theoptimalvalueislog2,thatis,w=0,v=0areoptimal,onlyifåm=1bi=0andåm=1ai=0.(ThisfollowsfromtheexpressionforÑlavg,giveninSection2.1.)Thisoccursonlywhenthenumberofpositive1520 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONexamples(i.e.,thoseforwhichbi=1)isequaltothenumberofnegativeexamples,andtheaverageofxioverthepositiveexamplesisthenegativeoftheaveragevalueofxioverthenegativeexamples.TheLRP(2)isasmoothconvexoptimizationproblem,andcanbesolvedbyawidevarietyofmethods,suchasgradientdescent,steepestdescent,Newton,quasi-Newton,orconjugate-gradients(CG)methods(see,forexampleHastieetal.,2001,§4.4).Oncewendmaximumlikelihoodvaluesofvandw,thatis,asolutionof(2),wecanpredicttheprobabilityofthetwopossibleoutcomes,givenanewfeaturevectorx2Rn,usingtheassociatedlogisticmodel.Forexample,whenw6=0,wecanformthelogisticclassierf(x)=sgn(wTx+v);(3)wheresgn(z)=+1z�01z0;whichpicksthemorelikelyoutcome,givenx,accordingtothelogisticmodel.Thisclassierislinear,meaningthattheboundarybetweenthetwodecisionoutcomesisahyperplane(denedbywTx+v=0).1.2`2-RegularizedLogisticRegressionWhenm,thenumberofobservationsortrainingexamples,isnotlargeenoughcomparedton,thenumberoffeaturevariables,simplelogisticregressionleadstoover-t.Thatis,theclassierfoundbysolvingtheLRP(2)performsperfectly(orverywell)onthetrainingexamples,butmayperformpoorlyonunseenexamples.Over-ttingtendstooccurwhenthettedmodelhasmanyfeaturevariableswith(relatively)largeweightsinmagnitude,thatis,wislarge.Astandardtechniquetopreventover-ttingisregularization,inwhichanextratermthatpe-nalizeslargeweightsisaddedtotheaveragelogisticlossfunction.The`2-regularizedlogisticregressionproblemisminimizelavg(v;w)+lkwk2=(1=m)åm=1f(wTai+vbi)+lån=1w2:(4)Herel�0istheregularizationparameter,usedtocontrolthetrade-offbetweentheaveragelogisticlossandthesizeoftheweightvector,asmeasuredbythe`2-norm.Nopenaltytermisimposedontheintercept,sinceitisaparameterforthresholdingtheweightedsumwTxinthelinearclassi-er(3).Thesolutionofthe`2-regularizedregressionproblem(4)(whichexistsandisunique)canbeinterpretedinaBayesianframework,asthemaximumaposterioriprobability(MAP)estimateofwandv,whenwhasaGaussianpriordistributiononRnwithzeromeanandcovariancelIandvhasthe(improper)uniformprioronR;see,forexample,ChalonerandLarntz(1989)orJaakkolaandJordan(2000).Theobjectivefunctioninthe`2-regularizedLRPissmoothandconvex,andso(liketheordi-nary,unregularizedLRP)canbeminimizedbystandardmethodssuchasgradientdescent,steepestdescent,Newton,quasi-Newton,truncatedNewton,orCGmethods;see,forexample,Luenberger(1984),Linetal.(2007),Minka(2003),NocedalandWright(1999),andNash(2000).Othermeth-odsthathavebeenusedincludeoptimizationtransfer(KrishnapuramandHartemink,2005;Zhangetal.,2004)anditerativelyre-weightedleastsquares(Komarek,2004).Newton'smethodandvari-antsareveryeffectiveforsmallandmediumsizedproblems,whileconjugate-gradientsandlimitedmemoryNewton(ortruncatedNewton)methodscanhandleverylargeproblems.InMinka(2003)1521 KOH,KIMANDBOYDtheauthorcomparesseveralmethodsfor`2-regularizedLRPswithlargedatasets.Thefastestmeth-odsturnouttobeconjugate-gradientsandlimitedmemoryNewtonmethods,outperformingIRLS,gradientdescent,andsteepestdescentmethods.TruncatedNewtonmethodshavebeenappliedtolarge-scaleproblemsinseveralotherelds,forexample,imagerestoration(Fuetal.,2006)andsupportvectormachines(KeerthiandDeCoste,2005).Forlarge-scaleiterativemethodssuchastruncatedNewtonorCG,theconvergencetypicallyimprovesastheregularizationparameterlisincreased,since(roughlyspeaking)thismakestheobjectivemorequadratic,andimprovesthecon-ditioningoftheproblem.1.3`1-RegularizedLogisticRegressionMorerecently,`1-regularizedlogisticregressionhasreceivedmuchattention.The`1-regularizedlogisticregressionproblemisminimizelavg(v;w)+lkwk1=(1=m)åm=1f(wTai+vbi)+lån=1jwij;(5)wherel�0istheregularizationparameter.Theonlydifferencewith`2-regularizedlogisticre-gressionisthatwemeasurethesizeofwbyits`1-norm,insteadofits`2-norm.Asolutionofthe`1-regularizedlogisticregressionmustexist,butitneednotbeunique.Anysolutionofthe`1-regularizedlogisticregressionproblem(5)canbeinterpretedinaBayesianframeworkasaMAPestimateofwandv,whenwhasaLaplacianpriordistributionandvhasthe(improper)uniformprior.Theobjectivefunctioninthe`1-regularizedLRP(5)isconvex,butnotdifferentiable(specif-ically,whenanyoftheweightsiszero),sosolvingitismoreofacomputationalchallengethansolvingthe`2-regularizedLRP(4).Despitetheadditionalcomputationalchallengeposedby`1-regularizedlogisticregression,comparedto`2-regularizedlogisticregression,interestinitsusehasbeengrowing.Themainmoti-vationisthat`1-regularizedLRtypicallyyieldsasparsevectorw,thatis,wtypicallyhasrelativelyfewnonzerocoefcients.(Incontrast,`2-regularizedLRtypicallyyieldswwithallcoefcientsnonzero.)Whenwj=0,theassociatedlogisticmodeldoesnotusethejthcomponentofthefeaturevector,sosparsewcorrespondstoalogisticmodelthatusesonlyafewofthefeatures,thatis,componentsofthefeaturevector.Indeed,wecanthinkofasparsewasaselectionoftherelevantorimportantfeatures(i.e.,thoseassociatedwithnonzerowj),aswellasthechoiceoftheinterceptvalueandweights(fortheselectedfeatures).Alogisticmodelwithsparsewis,inasense,sim-plerormoreparsimoniousthanonewithnonsparsew.Itisnotsurprisingthat`1-regularizedLRcanoutperform`2-regularizedLR,especiallywhenthenumberofobservationsissmallerthanthenumberoffeatures(Ng,2004;Wainwrightetal.,2007).Werefertothenumberofnonzerocomponentsinwasitscardinality,denotedcard(w).Thus,`1-regularizedLRtendstoyieldwwithcard(w)small;theregularizationparameterlroughlycon-trolscard(w),withlargerltypically(butnotalways)yieldingsmallercard(w).Thegeneralideaof`1regularizationforthepurposesofmodelorfeatureselection(orjustsparsityofsolution)isquiteold,andwidelyusedinothercontextssuchasgeophysics(ClaerboutandMuir,1973;Tayloretal.,1979;LevyandFullagar,1981;Oldenburgetal.,1983).Instatistics,itisusedinthewell-knownLassoalgorithm(Tibshirani,1996)for`1-regularizedlinearregression,anditsextensionssuchasthefusedLasso(Tibshiranietal.,2005),thegroupedLasso(Kimetal.,2006;YuanandLin,2006;Zhaoetal.,2007),andthemonotoneLasso(Hastieetal.,2007).Theideaalsocomesupinsignalprocessinginbasispursuit(ChenandDonoho,1994;Chenetal.,1522 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSION2001),signalrecoveryfromincompletemeasurements(Candesetal.,2006,2005;Donoho,2006),andwaveletthresholding(Donohoetal.,1995),decodingoflinearcodes(CandesandTao,2005),portfoliooptimization(Loboetal.,2005),controllerdesign(Hassibietal.,1999),computer-aideddesignofintegratedcircuits(Boydetal.,2001),computervision(BhusnurmathandTaylor,2007),sparseprincipalcomponentanalysis(d'Aspremontetal.,2005;Zouetal.,2006),graphicalmodelselection(Wainwrightetal.,2007),maximumlikelihoodestimationofgraphicalmodels(Banerjeeetal.,2006;Dahletal.,2005),boosting(Rossetetal.,2004),and`1-normsupportvectormachines(Zhuetal.,2004).ArecentsurveyoftheideacanbefoundinTropp(2006).DonohoandElad(2003)andTropp(2006)givesometheoreticalanalysisofwhy`1regularizationleadstoasparsemodelinlinearregression.Recently,theoreticalpropertiesof`1-regularizedlinearregressionhavebeenstudiedbyseveralresearchers;see,forexample,Zou(2006),ZhaoandYu(2006),andZouetal.(2007).Tosolvethe`1-regularizedLRP(5),genericmethodsfornondifferentiableconvexproblemscanbeused,suchastheellipsoidmethodorsubgradientmethods(Shor,1985;Polyak,1987).Thesemethodsareusuallyveryslowinpractice,however.(Because`1-regularizedLRtypicallyresultsinaweightvectorwith(many)zerocomponents,wecannotsimplyignorethenondifferentiabilityoftheobjectiveinthe`1-regularizedLRP(5),hopingtonotencounterpointsofnondifferentiability.)Anotherapproachistotransformtheproblemtoonewithdifferentiableobjectiveandconstraintfunctions.Wecansolvethe`1-regularizedLRP(5),bysolvinganequivalentformulation,withlinearinequalityconstraints,minimizelavg(v;w)+l1Tusubjecttouiwiui;i=1;:::;n;(6)wherethevariablesaretheoriginalonesv2R,w2Rn,alongwithu2Rn.Here1denotesthevectorwithallcomponentsone,so1Tuisthesumofthecomponentsofu.(Toseetheequivalencewiththe`1-regularizedLRP(5),wenotethatattheoptimalpointfor(6),wemusthaveui=jwij,inwhichcasetheobjectivesin(6)and(5)arethesame.)Thereformulatedproblem(6)isaconvexoptimizationproblem,withasmoothobjective,andlinearconstraintfunctions,soitcanbesolvedbystandardconvexoptimizationmethodssuchasSQP,augmentedLagrangian,interior-point,andothermethods.Highqualitysolversthatcandirectlyhandletheproblem(6)(andtherefore,carryout`1-regularizedLR)includeforexampleLOQO(Vanderbei,1997),LANCELOT(Connetal.,1992),MOSEK(MOSEKApS,2002),andNPSOL(Gilletal.,1986).Thesegeneralpurposesolverscansolvesmallandmediumscale`1-regularizedLRPsquiteeffectively.Otherrecentlydevelopedcomputationalmethodsfor`1-regularizedlogisticregressionincludetheIRLSmethod(Leeetal.,2006;Lokhorst,1999),ageneralizedLASSOmethod(Roth,2004)thatextendstheLASSOmethodproposedinOsborneetal.(2000)togeneralizedlinearmodels,generalizediterativescaling(Goodman,2004),boundoptimizationalgorithms(Krishnapurametal.,2005),onlinealgorithms(BalakrishnanandMadigan,2006;PerkinsandTheiler,2003),coordinatedescentmethods(Friedmanetal.,2007;Genkinetal.,2006),andtheGauss-Seidelmethod(ShevadeandKeerthi,2003).Someofthesemethodscanhandleverylargeproblems(assumingsomesparsityinthedata)withmodestaccuracy.Buttheadditionalcomputationalcostrequiredforthesemethodstoachievehigheraccuracycanbeverylarge.Themaingoalofthispaperistodescribeaspecializedinterior-pointmethodforsolvingthe`1-regularizedlogisticregressionproblemthatisveryefcient,forallsizeproblems.Inparticularourmethodhandlesverylargeproblems,attainshighaccuracy,andisnotmuchslowerthanthefastest1523 KOH,KIMANDBOYDlarge-scalemethods(conjugate-gradientsandlimitedmemoryNewton)appliedtothe`2-regularizedLRP.Numericalexperimentsshowthatourmethodisasfastas,orfasterthan,othermethods,andreliablyprovidesveryaccuratesolutions.Comparedwithhigh-qualityimplementationsofgeneralpurposeprimal-dualinterior-pointmethods,ourmethodisfarfaster,especiallyforlargeproblems.Comparedwithrst-ordermethodssuchascoordinatedescentmethods,ourmethodiscomparableinsolvinglargeproblemswithmodestaccuracy,butisabletosolvethemwithhighaccuracywithrelativelysmalladditionalcomputationalcost.Inthispaperwefocusonmethodsforsolvingthe`1-regularizedLRP;wedonotdiscussthebenetsoradvantagesof`1-regularizedLR,comparedto`2-regularizedLRorothermethodsformodelingorconstructingclassiersfortwo-classdata.1.4RegularizationPathLet(v?;w?l)beasolutionforthe`1-regularizedLRPwithregularizationparameterl.Thefamilyofsolutions,aslvariesover(0;¥),iscalledthe(`1-)regularizationpath.Inmanyapplications,theregularizationpath(orsomeportion)needstobecomputed,inordertodetermineanappropriatevalueofl.Attheveryleast,the`1-regularizedLRPmustbesolvedformultiple,andoftenmany,valuesofl.In`1-regularizedlinearregression,whichistheproblemofminimizingkFzgk2+lkzk1overthevariablez,wherel�0istheregularizationparameter,F2Rpnisthecovariatematrix,andg2Rpisthevectorofresponses,itcanbeshownthattheregularizationpathispiecewiselinear,withkinksateachpointwhereanycomponentofthevariableztransitionsfromzerotononzero,orviceversa.Usingthisfact,theentireregularizationpathina(smallormediumsize)`1-regularizedlinearregressionproblemcanbecomputedefciently(Hastieetal.,2004;Efronetal.,2004;Rosset,2005;RossetandZhu,2007;Osborneetal.,2000).Thesemethodsarerelatedtonumericalcontinuationtechniquesforfollowingpiecewisesmoothcurves,whichhavebeenwellstudiedintheoptimizationliterature(AllgowerandGeorg,1993).Pathfollowingmethodshavebeenappliedtoseveralproblems(Hastieetal.,2004;ParkandHastie,2006a,b;Rosset,2005).ParkandHastie(2006a)describeanalgorithmfor(approximately)computingtheentireregularizationpathforgenerallinearmodels(GLMs)includinglogisticre-gressionmodels.InRosset(2005),ageneralpath-followingmethodbasedonapredictor-correctormethodisdescribedforgeneralregularizedconvexlossminimizationproblems.Path-followingmethodscanbeslowforlarge-scaleproblems,wherethenumberofkinksoreventsisverylarge(atleastn).Whenthenumberofkinksontheportionoftheregularizationpathofinterestismodest,however,thesepath-followingmethodscanbeveryfast,requiringjustasmallmultipleoftheeffortneededtosolveoneregularizedproblemtocomputethewholepath(oraportion).Inthispaperwedescribeafastmethodforcomputingalargenumberofpointsontheregu-larizationpath,usingawarm-starttechniqueandourinterior-pointmethod.Unlikethemethodsmentionedabove,ourmethoddoesnotattempttotrackthepathexactly(i.e.,jumpingfromkinktokinkonthepath);itremainsefcientevenwhensuccessivevaluesofljumpovermanykinks.Thisisessentialwhencomputingtheregularizationpathinalarge-scaleproblem.Ourmethodal-lowsustocomputealargenumberofpoints(butmuchfewerthann,whennisverylarge)onthe1524 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONregularizationpath,muchmoreefcientlythanbysolvingafamilyoftheproblemsindependently.Ourmethodisfarmoreefcientthanpathfollowingmethodsincomputingagoodapproximationoftheregularizationforamedium-sizedorlargedataset.1.5OutlineInSection2,wegive(necessaryandsufcient)optimalityconditions,andadualproblem,forthe`1-regularizedLRP.Usingthedualproblem,weshowhowtocomputealowerboundonthesubopti-malityofanypair(v;w).Wedescribeourbasicinterior-pointmethodinSection3,anddemonstrateitsperformanceinSection4withsmallandmediumscalesyntheticandmachinelearningbench-markexamples.Weshowthat`1-regularizedLRcanbecarriedoutwithinaround35orsoiterations,whereeachiterationhasthesamecomplexityassolvingan`2-regularizedlinearregressionproblem.InSection5,wedescribeavariationonourbasicmethodthatusesapreconditionedconjugategradientapproachtocomputethesearchdirection.Thisvariationonourmethodcansolveverylargeproblems,withamillionfeaturesandexamples(e.g.,the20Newsgroupsdataset),inunderanhour,onaPC,providedthedatamatrixissufcientlysparse.InSection6,weconsidertheproblemofcomputingtheregularizationpathefciently,atavarietyofvaluesofl(butpotentiallyfarfewerthanthenumberofkinksonthepath).Usingwarm-starttechniques,weshowhowthiscandonemuchmoreefcientlythanbysolvingafamilyofproblemsindependently.InSection7,wecomparetheinterior-pointmethodwithseveralexistingmethodsfor`1-regularizedlogisticregression.InSection8,wedescribegeneralizationsofourmethodtoother`1-regularizedconvexlossminimizationproblems.2.OptimalityConditionsandDualInthissectionwederiveanecessaryandsufcientoptimalityconditionforthe`1-regularizedLRP,aswellasaLagrangedualproblem,fromwhichweobtainalowerboundontheobjectivethatwewilluseinouralgorithm.2.1OptimalityConditionsTheobjectivefunctionofthe`1-regularizedLRP,lavg(v;w)+lkwk1,isconvexbutnotdifferentiable,soweusearst-orderoptimalityconditionbasedonsubdifferentialcalculus(seeBertsekas,1999,Prop.B.24orBorweinandLewis,2000,§2).Theaveragelogisticlossisdifferentiable,withÑvlavg(v;w)=(1=m)måi=1f0(wTai+vbi)bi=(1=m)måi=1(1plog(v;w)i)bi=(1=m)bT(1plog(v;w));1525 KOH,KIMANDBOYDandÑwlavg(v;w)=(1=m)måi=1f0(wTai+vbi)ai=(1=m)måi=1(1plog(v;w)i)ai=(1=m)AT(1plog(v;w)):Thesubdifferentialofkwk1isgivenby(¶kwk1)i=8f1gwi�0;f1gwi0;[1;1]wi=0:Thenecessaryandsufcientconditionfor(v;w)tobeoptimalforthe`1-regularizedLRP(5)isÑvlavg(v;w)=0;02Ñwlavg(v;w)+l¶kwk1;whichcanbeexpressedasbT(1plog(v;w))=0;(7)and(1=m)AT(1plog(v;w))i28f+lgwi�0;flgwi0;[l;l]wi=0;i=1;:::;n:(8)Letusanalyzewhenapairoftheform(v;0)isoptimal.ThisoccursifandonlyifbT(1plog(v;0))=0;k(1=m)AT(1plog(v;0))k¥l:Therstconditionisequivalenttov=log(m+=m),wherem+isthenumberoftrainingexampleswithoutcome1(calledpositive)andmisthenumberoftrainingexampleswithoutcome1(callednegative).Usingthisvalueofv,thesecondconditionbecomesllmax=k(1=m)AT(1plog(log(m+=m);0))k¥:Thenumberlmaxgivesusanupperboundontheusefulrangeoftheregularizationparameterl:Foranylargervalueofl,thelogisticmodelobtainedfrom`1-regularizedLRhasweightzero(andthereforehasnoabilitytoclassify).Putanotherway,forllmax,wegetamaximallysparseweightvector,thatis,onewithcard(w)=0.Wecangiveamoreexplicitformulaforlmax:lmax=(1=m)\rmmåbi=1ai+m+måbi=1ai\r¥=(1=m)\rXT˜b\r¥;where˜bi=m=mbi=1m+=mbi=1;i=1;:::;m:Thus,lmaxisamaximumcorrelationbetweentheindividualfeaturesandthe(weighted)outputvector˜b.Whenthefeatureshavebeenstandardized,wehaveåm=1xi=0,sowegetthesimpliedexpressionlmax=(1=m)\råbi=1xi\r¥=(1=m)\råbi=1xi\r¥:1526 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSION2.2DualProblemToderiveaLagrangedualofthe`1-regularizedLRP(5),werstintroduceanewvariablez2Rm,aswellasnewequalityconstraintszi=wTai+vbi,i=1;:::;m,toobtaintheequivalentproblemminimize(1=m)åm=1f(zi)+lkwk1subjecttozi=wTai+vbi;i=1;:::;m:(9)Associatingdualvariablesqi2Rwiththeequalityconstraints,theLagrangianisL(v;w;z;q)=(1=m)måi=1f(zi)+lkwk1+qT(z+Aw+bv):Thedualfunctionisinfv;w;zL(v;w;z;q)=(1=m)infzmåi=1(f(zi)mqizi)+infwlkwk1+qTAw+infvqTbv=(1=m)åm=1f(mqi)kATqk¥l;bTq=0;¥otherwise;wherefistheconjugateofthelogisticlossfunctionf:f(y)=supu2R(yuf(u))=8(y)log(y)+(1+y)log(1+y);1y00y=1ory=0¥;otherwise:Forgeneralbackgroundonconvexdualityandconjugates,see,forexample,BoydandVanden-berghe(2004,Chap.5)orBorweinandLewis(2000).Thus,wehavethefollowingLagrangedualofthe`1-regularizedLRP(5):maximizeG(q)subjecttokATqk¥l;bTq=0;(10)whereG(q)=(1=m)måi=1f(mqi)isthedualobjective.Thedualproblem(10)isaconvexoptimizationproblemwithvariableq2Rm,andhastheformofan`¥-normconstrainedmaximumgeneralizedentropyproblem.Wesaythatq2RmisdualfeasibleifitsatiseskATqk¥l,bTq=0.Fromstandardresultsinconvexoptimizationwehavethefollowing.Weakduality.Anydualfeasiblepointqgivesalowerboundontheoptimalvaluep?ofthe(primal)`1-regularizedLRP(5):G(q)p?:(11)Strongduality.The`1-regularizedLRP(5)satisesavariationonSlater'sconstraintquali-cation,sothereisanoptimalsolutionofthedual(10)q?,whichsatisesG(q?)=p?:Inotherwords,theoptimalvaluesoftheprimal(5)anddual(10)areequal.1527 KOH,KIMANDBOYDWecanrelateaprimaloptimalpoint(v?;w?)andadualoptimalpointq?totheoptimalityconditions(7)and(8).Theyarerelatedbyq?=(1=m)(1plog(v?;w?)):Wealsonotethatthedualproblem(10)canbederivedstartingfromtheequivalentproblem(6),byintroducingnewvariableszi(aswedidin(9)),andassociatingdualvariablesq+0fortheinequalitieswu,andq0fortheinequalitiesuw.Byidentifyingq=q+qweobtainthedualproblem(10).2.3SuboptimalityBoundWenowderiveaneasilycomputedboundonthesuboptimalityofapair(v;w),byconstructingadualfeasiblepoint¯qfromanarbitraryw.Dene¯vas¯v=argminvlavg(v;w);(12)thatis,¯vistheoptimalinterceptfortheweightvectorw,characterizedbybT(1plog(¯v;w))=0.Now,wedene¯qas¯q=(s=m)(1plog(¯v;w));(13)wherethescalingconstantsiss=minml=kAT(1plog(¯v;w))k¥;1 :Evidently¯qisdualfeasible,soG(¯q)isalowerboundonp?,theoptimalvalueofthe`1-regularizedLRP(5).TocomputethelowerboundG(¯q),werstcompute¯v.Thisisaone-dimensionalsmoothconvexoptimizationproblem,whichcanbesolvedveryefciently,forexample,byabisectionmethodontheoptimalityconditionbT(1plog(v;w))=0;sincethelefthandsideisamonotonefunctionofv.Newton'smethodcanbeusedtoensureex-tremelyfastterminalconvergenceto¯v.From¯v,wecompute¯qusing(13),andthenevaluatethelowerboundG(¯q).Thedifferencebetweentheprimalobjectivevalueof(v;w),andtheassociatedlowerboundG(¯q),iscalledthedualitygap,anddenotedh:h(v;w)=lavg(v;w)+lkwk1G(¯q)=(1=m)åm=1f(wTai+vbi)+f(mqi)+lkwk1:(14)Wealwayshaveh0;and(byweakduality(11))thepoint(v;w)isnomorethanh-suboptimal.Attheoptimalpoint(v?;w?),wehaveh=0.3.AnInterior-PointMethodInthissectionwedescribeaninterior-pointmethodforsolvingthe`1-regularizedLRP(5),intheequivalentformulationminimizelavg(v;w)+l1Tusubjecttouiwiui;i=1;:::;n;withvariablesw;u2Rnandv2R.1528 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSION3.1LogarithmicBarrierandCentralPathThelogarithmicbarrierfortheboundconstraintsuiwiuiisF(w;u)=nåi=1log(ui+wi)nåi=1log(uiwi)=nåi=1log(u2w2i);withdomaindomF=f(w;u)2RnRnjjwijui;i=1;:::;ng:Thelogarithmicbarrierfunctionissmoothandconvex.Weaugmenttheweightedobjectivefunctionbythelogarithmicbarrier,toobtainft(v;w;u)=tlavg(v;w)+tl1Tu+F(w;u);wheret�0isaparameter.Thisfunctionissmooth,strictlyconvex,andboundedbelow,andsohasauniqueminimizerwhichwedenote(v?(t);w?(t);u?(t)).ThisdenesacurveinRRnRn,parametrizedbyt,calledthecentralpath.(SeeBoydandVandenberghe,2004,Chap.11formoreonthecentralpathanditsproperties.)Withthepoint(v?(t);w?(t);u?(t))weassociateq?(t)=(1=m)(1plog(v?(t);w?(t)));whichcanbeshowntobedualfeasible.(Indeed,itcoincideswiththedualfeasiblepoint¯qcon-structedfromw?(t)usingthemethodofSection2.3.)Theassociateddualitygapsatiseslavg(v?(t);w?(t))+lkw?(t)k1G(q?(t))lavg(v?(t);w?(t))+l1Tu?(t)G(q?(t))=2n=t:Inparticular,(v?(t);w?(t))isnomorethan2n=t-suboptimal,sothecentralpathleadstoanoptimalsolution.Inaprimalinterior-pointmethod,wecomputeasequenceofpointsonthecentralpath,foranincreasingsequenceofvaluesoft,usingNewton'smethodtominimizeft(v;w;u),startingfromthepreviouslycomputedcentralpoint.Atypicalmethodusesthesequencet=t0;µt0;µ2t0;:::,whereµisbetween2and50(seeBoydandVandenberghe,2004,§11.3).Themethodcanbeterminatedwhen2n=te,sincethenwecanguaranteee-suboptimalityof(v?(t);w?(t)).ThereaderisreferredtoNesterovandNemirovsky(1994),Wright(1997),andYe(1997)formoreon(primal)interior-pointmethods.3.2ACustomInterior-PointMethodUsingourmethodforcheaplycomputingadualfeasiblepointandassociateddualitygapforany(v;w)(andnotjustfor(v;w)onthecentralpath,asinthegeneralcase),wecanconstructacustominterior-pointmethodthatupdatestheparametertateachiteration.CUSTOMINTERIOR-POINTMETHODFOR`1-REGULARIZEDLR.giventolerancee�0,linesearchparametersa2(0;1=2),b2(0;1)Setinitialvalues.t:=1=l,v:=log(m+=m),w:=0,u:=1.repeat1529 KOH,KIMANDBOYD1.Computesearchdirection.SolvetheNewtonsystemÑ2ft(v;w;u)2DvDwDu3=Ñft(v;w;u).2.Backtrackinglinesearch.Findthesmallestintegerk0thatsatisesft(v+bkDv;w+bkDw;u+bkDu)ft(v;w;u)+abkÑft(v;w;u)T2DvDwDu3.3.Update.(v;w;u):=(v;w;u)+bk(Dv;Dw;Du).4.Setv:=¯v,theoptimalvalueoftheintercept,asin(12).5.Constructdualfeasiblepointqfrom(13).6.Evaluatedualitygaphfrom(14).7.quitifhe.8.Updatet.Thisdescriptioniscomplete,exceptfortheruleforupdatingtheparametert,whichwillbedescribedbelow.Ourchoiceofinitialvaluesforv,w,u,andtcanbeexplainedasfollows.Thechoicew=0andu=1seemstoworkverywell,especiallywhentheoriginaldataarestandardized.Thechoicev=log(m+=m)istheoptimalvalueofvwhenw=0andu=1,andthechoicet=1=lminimizesk(1=t)Ñft(log(m+=m);0;1)k2.(Inanycase,thechoiceoftheinitialvaluesdoesnotgreatlyaffectperformance.)Theconstructionofadualfeasiblepointanddualitygap,insteps4–6,isexplainedinSection2.3.Typicalvaluesforthelinesearchparametersarea=0:01,b=0:5,butheretoo,theseparametervaluesdonothavealargeeffectonperformance.Thecomputationaleffortperiterationisdominatedbystep1,thesearchdirectioncomputation.Therearemanypossibleupdaterulesfortheparametert.Inaclassicalprimalbarriermethod,tisheldconstantuntilftis(approximately)minimized,thatis,kÑftk2issmall;whenthisoccurs,tisincreasedbyafactortypicallybetween2and50.Moresophisticatedupdaterulescanbefoundin,forexample,NesterovandNemirovsky(1994),Wright(1997),andYe(1997).Theupdateruleweproposeist:=(maxfµminfˆt;tg;tg;ssmint;ssmin(15)whereˆt=2n=h,ands=bkisthesteplengthchoseninthelinesearch.Hereµ�1andsmin2(0;1]arealgorithmparameters;wehavefoundgoodperformancewithµ=2andsmin=0:5.Toexplaintheupdaterule(15),werstgiveaninterpretationofˆt.If(v;w;u)isonthecentralpath,thatis,ftisminimized,thedualitygapish=2n=t.Thusˆtisthevalueoftforwhichtheassociatedcentralpointhasthesamedualitygapasthecurrentpoint.Anotherinterpretationisthatiftwereheldconstantatt=ˆt,(v;w;u)wouldconvergeto(v?(ˆt);w?(ˆt);u?(ˆt)),atwhichpointthedualitygapwouldbeexactlyh.Weusethesteplengthsasacrudemeasureofproximitytothecentralpath.Whenthecurrentpointisnearthecentralpath,thatis,ftisnearlyminimized,wehaves=1;farfromthecentralpath,wetypicallyhaves1.Nowwecanexplaintheupdaterule(15).Whenthecurrentpointisnearthecentralpath,asjudgedbyssminandˆtt,weincreasetbyafactorµ;otherwise,wekeeptatitscurrentvalue.1530 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONWecangiveaninformaljusticationofconvergenceofthecustominterior-pointalgorithm.(Aformalproofofconvergencewouldbequitelong.)Assumethatthealgorithmdoesnotterminate.Sincetneverdecreases,iteitherincreaseswithoutbound,orconvergestosomevalue¯t.Intherstcase,thedualitygaphconvergestozero,sothealgorithmmustexit.Inthesecondcase,thealgorithmreduces(roughly)toNewton'smethodforminimizingf¯t.Thismustconverge,whichmeansthat(v;w;u)convergesto(v?(¯t);w?(¯t);u?(¯t)).Thereforethedualitygapconvergesto¯h=2n=¯t.AbasicpropertyofNewton'smethodisthatnearthesolution,thesteplengthisone.Atthelimit,wethereforehave¯t=maxfµminf2n=¯h;¯tg;¯tg=µ¯t;whichisacontradictionsinceµ�1.3.3GradientandHessianInthissectionwegiveexplicitformulasforthegradientandHessianofft.Thegradientg=Ñft(v;w;u)isgivenbyg=2g1g2g332R2n+1;whereg1=Ñvft(v;w;u)=(t=m)bT(1plog(v;w))2R;g2=Ñwft(v;w;u)=(t=m)AT(1plog(v;w))+22w1=(u2w2).2wn=(u2w2n)32Rn;g3=Ñuft(v;w;u)=tl122u1=(u2w2).2un=(u2w2n)32Rn:TheHessianH=Ñ2ft(v;w;u)isgivenbyH=2tbTD0btbTD0A0tATD0btATD0A+D1D20D2D132R(2n+1)(2n+1);whereD0=(1=m)diag(f00(wTa1+vb1);:::;f00(wTam+vbm));D1=diag2(u2+w2)=(u2w2)2;:::;2(u2n+w2n)=(u2nw2n)2;D2=diag4u1w1=(u2w2)2;:::;4unwn=(u2nw2n)2:Here,weusediag(z1;:::;zm)todenotethediagonalmatrixwithdiagonalentriesz1;:::;zm,wherezi2R;i=1;:::;m.TheHessianHissymmetricandpositivedenite.1531 KOH,KIMANDBOYD3.4ComputingtheSearchDirectionThesearchdirectionisdenedbythelinearequations(Newtonsystem)2tbTD0btbTD0A0tATD0btATD0A+D1D20D2D132DvDwDu3=2g1g2g33:WersteliminateDutoobtainthereducedNewtonsystemHredDvDw=gred;(16)whereHred=tbTD0btbTD0AtATD0btATD0A+D3;gred=g1g2D2D11g3;D3=D1D2D11D2:Oncethisreducedsystemissolved,DucanberecoveredasDu=D11(g3+D2Dw):SeveralmethodscanbeusedtosolvethereducedNewtonsystem(16),dependingontherelativesizesofnandmandthesparsityofthedataA.3.4.1MOREEXAMPLESTHANFEATURESWerstconsiderthecasewhenmn,thatis,therearemoreexamplesthanfeatures.WeformHred,atacostofO(mn2)ops(oating-pointoperations),thensolvethereducedsystem(16)byCholeskyfactorizationofHred,followedbybackandforwardsubstitutionsteps,atacostofO(n3)ops.ThetotalcostusingthismethodisO(mn2+n3)ops,whichisthesameasO(mn2)whentherearemoreexamplesthanfeatures.WhenAissufcientlysparse,thematrixtATD0A+D3issparse,soHredissparse,withadenserstrowandcolumn.ByexploitingsparsityinformingtATD0A+D3,andusingasparseCholeskyfactorizationtofactorHred,thecomplexitycanbemuchsmallerthanO(mn2)ops(seeBoydandVandenberghe,2004,App.CorGeorgeandLiu,1981).3.4.2FEWEREXAMPLESTHANFEATURESWhenmn,thatis,therearefewerexamplesthanfeatures,thematrixHredisadiagonalmatrixplusarankm+1matrix,sowecanusetheSherman-Morrison-WoodburyformulatosolvethereducedNewtonsystem(16)atacostofO(m2n)ops(seeBoydandVandenberghe,2004,§4.3).WestartbyeliminatingDwfrom(16)toobtain(tbTD0bt2bTD0AS1ATD0b)Dv=g1+tbTD0AS1(g2D2D11g3);whereS=tATD0A+D3.BytheSherman-Morrison-Woodburyformula(GolubandVanLoan,1996,p.50),theinverseofSisgivenbyS1=D13D13AT(1=t)D10+AD13AT1AD13:1532 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONWecannowcalculateDvviaCholeskyfactorizationofthematrix(1=t)D10+AD13ATandtwobacksubstitutions(BoydandVandenberghe,2004,App.C).OncewecomputeDv,wecancomputetheothercomponentsofthesearchdirectionasDw=S1(g2D2D11g3+tATD0bDv);Du=D11(g3+D2Dw):ThetotalcostofcomputingthesearchdirectionisO(m2n)ops.WecanexploitsparsityintheCholeskyfactorization,whenever(1=t)D10+AD13ATissufcientlysparse,toreducethecom-plexity.3.4.3SUMMARYInsummary,thenumberofopsneededtocomputethesearchdirectionisO(min(n;m)2max(n;m));usingdensematrixmethods.IfmnandATAissparse,ormnandAATissparse,wecanuse(direct)sparsematrixmethodstocomputethesearchdirectionwithlesseffort.Ineachofthesecases,thecomputationaleffortperiterationoftheinterior-pointmethodisthesameastheeffortofsolvingone`2-regularizedlinearregressionproblem.4.NumericalExamplesInthissectionwegivesomenumericalexamplestoillustratetheperformanceoftheinterior-pointmethoddescribedinSection3,usingalgorithmparametersa=0:01;b=0:5;smin=0:5;µ=2;e=108:(Thealgorithmperformswellformuchsmallervaluesofe,butthisaccuracyismorethanade-quateforanypracticaluse.)ThealgorithmwasimplementedinbothMatlabandC,andrunona3.2GHzPentiumIVunderLinux.TheCimplementation,whichismoreefcientthantheMatlabimplementation(especiallyforsparseproblems),isavailableonline(www.stanford.edu/˜boyd/l1_logreg).4.1BenchmarkProblemsThedataarefoursmallormediumstandarddatasetstakenfromtheUCImachinelearningbench-markrepository(Newmanetal.,1998)andothersources.Therstdatasetisleukemiacancergeneexpressiondata(Golubetal.,1999),thesecondiscolontumorgeneexpressiondata(Alonetal.,1999),thethirdisionospheredata(Newmanetal.,1998),andthefourthisspambasedata(Newmanetal.,1998).Foreachdataset,weconsideredfourvaluesoftheregularizationparameter:l=0:5lmax,l=0:1lmax,l=0:05lmax,andl=0:01lmax.Wediscardedexampleswithmissingdata,andstandardizedeachdataset.Thedimensionsofeachproblem,alongwiththenumberofinterior-pointmethoditerations(IPiterations)needed,andtheexecutiontime,aregiveninTable1.Inreportingcard(w),weconsideracomponentwitobezerowhen (1=m)AT(1plog(v;w))i tl;1533 KOH,KIMANDBOYDDataFeaturesnExamplesml=lmaxcard(w)IPiterationsTime(sec)Leukemia7129380:56370:60(Golubetal.,1999)0:114380:620:0514390:630:0118370:60Colon2000620:57350:26(Alonetal.,1999)0:122320:250:0525330:260:0128320:25Ionosphere343510:53300:02(Newmanetal.,1998)0:111290:020:0514300:020:0124330:03Spambase5740610:58310:63(Newmanetal.,1998)0:128320:660:0538330:690:0152360:75Table1:Performanceoftheinterior-pointmethodon4datasets,eachfor4valuesofl.wheret=0:9999.Thisruleisinspiredbytheoptimalityconditionin(8).Inallsixteenexamples,around35iterationswererequired.Wehaveobservedthisbehavioroveralargenumberofotherexamplesaswell.Theexecutiontimesarewellpredictedbythecomplexityordermin(m;n)2max(m;n).Figure1showstheprogressoftheinterior-pointmethodonthefourdatasets,forthesamefourvaluesofl.Theverticalaxisshowsdualitygap,andthehorizontalaxisshowsiterationnumber,whichisthenaturalmeasureofcomputationaleffortwhendenselinearalgebramethodsareused.Theguresshowthatthealgorithmhaslinearconvergence,withdualitygapdecreasingbyafactoraround1:85ineachiteration.4.2RandomlyGeneratedProblemsToexaminetheeffectofproblemsizeonthenumberofiterationsrequired,wegenerate100randomprobleminstancesforeachof20valuesofn,rangingfromn=100ton=10000,withm=0:1n,thatis,10timesmorefeaturesthanexamples.Eachproblemhasanequalnumberofpositiveandnegativeexamples,thatis,m+=m=m=2.Featuresofpositive(negative)examplesareindependentandidenticallydistributed,drawnfromanormaldistributionN(v;1),wherevisinturndrawnfromauniformdistributionon[0;1]([1;0]).Foreachofthe2000datasets,wesolvethe`1-regularizedLRPforl=0:5lmax,l=0:1lmax,andl=0:05lmax.ThelefthandplotinFigure2showsthemeanandstandarddeviationofthenumberofiterationsrequiredtosolvethe100probleminstancesassociatedwitheachvalueofnandl.Itcanbeseenthatthenumberofiterationsrequiredisverynear35,forall6000probleminstances.Inthesameway,wegenerateafamilyofdatasetswithm=10n,thatis,10timesmoreexamplesthanfeatures,with100probleminstancesforeachof20valuesofnrangingfromn=10ton=1000,andforthesame3valuesofl.TherighthandplotinFigure2showsthemeanandstandarddeviation1534 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONPSfragreplacementsdualitygap010203040501010108106104102100PSfragreplacementsdualitygap010203040501010108106104102100dualitygapPSfragreplacementsdualitygap010203040501010108106104102100dualitygapiterationsdualitygap010203040501010108106104102100PSfragreplacementsdualitygap010203040501010108106104102100dualitygapiterationsdualitygap010203040501010108106104102100iterationsdualitygapFigure1:Progressoftheinterior-pointmethodon4datasets,showingdualitygapversusiterationnumber.Topleft:Leukemiacancergenedataset.Topright:Colontumorgenedataset.Bottomleft:Ionospheredataset.Bottomright:Spambasedataset.ofthenumberofiterationsrequiredtosolvethe100probleminstancesassociatedwitheachvalueofnandl.Theresultsarequitesimilartothecasewithm=0:1n.5.TruncatedNewtonInterior-PointMethodInthissectionwedescribeavariationonourinterior-pointmethodthatcanhandleverylargeprob-lems,providedthedatamatrixAissparse,atthecostofhavingaruntimethatislesspredictable.Thebasicideaistocomputethesearchdirectionapproximately,usingapreconditionedconjugategradients(PCG)method.WhenthesearchdirectioninNewton'smethodiscomputedapproxi-mately,usinganiterativemethodsuchasPCG,theoverallalgorithmiscalledaconjugategradientNewtonmethod,oratruncatedNewtonmethod(Ruszczynski,2006;DemboandSteihaug,1983).TruncatedNewtonmethodshavebeenappliedtointerior-pointmethods(see,forexample,Vanden-bergheandBoyd,1995andPortugaletal.,2000).5.1PreconditionedConjugateGradientsThePCGalgorithm(Demmel,1997,§6.6)computesanapproximatesolutionofthelinearequationsHx=g,whereH2RNNissymmetricpositivedenite.ItusesapreconditionerP2RNN,alsosymmetricpositivedenite.1535 KOH,KIMANDBOYDPSfragreplacementsnumberoffeaturesniterations102103104010203040PSfragreplacementsnumberoffeaturesniterations102103104010203040numberoffeaturesniterations10110210310203040Figure2:Averagenumberofiterationsrequiredtosolve100randomlygenerated`1-regularizedLRPswithdifferentproblemsizeandregularizationparameter.Left:n=10m.Right:n=0:1m.Errorbarsshowstandarddeviation.PRECONDITIONEDCONJUGATEGRADIENTSALGORITHMgivenrelativetoleranceepcg�0,iterationlimitNpcg,andx02Rkk:=0,r0:=Hx0g,p1:=P1g,y0:=P1r0.repeatk:=k+1z:=Hpkqk:=yT1rk1=pTzxk:=xk1+qkpkrk:=rk1qkzyk:=P1rkµk+1:=yTrk=yT1rk1pk+1:=yk+µk+1pkuntilkrkk2=kgk2epcgork=Npcg.EachiterationofthePCGalgorithminvolvesahandfulofinnerproducts,thematrix-vectorproductHpkandasolvestepwithPincomputingP1rk.Withexactarithmetic,andignoringthestoppingcondition,thePCGalgorithmisguaranteedtocomputetheexactsolutionx=H1ginNsteps.WhenP1=2HP1=2iswellconditioned,orhasjustafewextremeeigenvalues,thePCGalgorithmcancomputeanapproximatesolutioninanumberofstepsthatcanbefarsmallerthanN.SinceP1rkiscomputedineachstep,weneedthiscomputationtobeefcient.5.2TruncatedNewtonInterior-PointMethodThetruncatedNewtoninterior-pointmethodisthesameastheinterior-pointalgorithmdescribedinSection3,withthesearchdirectioncomputedusingthePCGalgorithm.1536 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONWecancomputeHpkinthePCGalgorithmusingHpk=2tbTD0btbTD0A0tATD0btATD0A+D1D20D2D132pk1pk2pk33=2bTuATu+D1pk2D2pk2+D1pk33;whereu=tD0(bpk1+Apk2)2Rm.ThecostofcomputingHpkisO(p)opswhenAissparsewithpnonzeroelements.(Weassumepn,whichholdsifeachexamplehasatleastonenonzerofeature.)WenowdescribeasimplechoiceforthepreconditionerP.TheHessiancanbewrittenasH=tÑ2lavg(v;w)+Ñ2F(w;u):Toobtainthepreconditioner,wereplacethersttermwithitsdiagonalpart,togetP=diagtÑ2lavg(v;w)+Ñ2F(w;u)=2d0000D3D20D2D13;(17)whered0=tbTD0b;D3=diag(tATD0A)+D1:(Herediag(S)isthediagonalmatrixobtainedbysettingtheoff-diagonalentriesofthematrixStozero.)ThispreconditionerapproximatestheHessianoftlavgwithitsdiagonalentries,whileretainingtheHessianofthelogarithmicbarrier.Forthispreconditioner,P1rkcanbecomputedcheaplyasP1rk=2d0000D3D20D2D1312rk1rk2rk33=2rk1=d0(D1D3D2)1(D1rk2D2rk3)(D1D3D2)1(D2rk2+D3rk3)3;whichrequiresO(n)ops.Wecannowexplainhowimplicitstandardizationcanbecarriedout.Whenusingstandardizeddata,weworkwiththematrixAstddenedin(20),insteadofA.AsmentionedinAppendixA,Astdisingeneraldense,soweshouldnotformthematrix.InthetruncatedNewtoninterior-pointmethodwedonotneedtoformthematrixAstd;weonlyneedamethodformultiplyingavectorbyAstdandamethodformultiplyingavectorbyAstdT.Butthisiseasilydoneefciently,usingthefactthatAstdisasparsematrix(i.e.,A)timesadiagonalmatrix,plusarank-onematrix;see(20)inAppendixA.ThereareseveralgoodchoicesfortheinitialpointinthePCGalgorithm(labeledx0inSec-tion5.1),suchasthenegativegradient,ortheprevioussearchdirection.Wehavefoundgoodperformancewithboth,withasmalladvantageinusingtheprevioussearchdirection.1537 KOH,KIMANDBOYDThePCGrelativetoleranceparameterepcghastobecarefullychosentoobtaingoodefciencyinatruncatedNewtonmethod.Ifthetoleranceistoosmall,toomanyPCGstepsareneededtocomputeeachsearchdirection;ifthetoleranceistoohigh,thenthecomputedsearchdirectionsdonotgiveadequatereductionindualitygapperiteration.WeexperimentedwithseveralmethodsofadjustingthePCGrelativetolerance,andfoundgoodresultswiththeadaptiveruleepcg=minf0:1;xh=kgk2g;(18)wheregisthegradientandhisthedualitygapatthecurrentiterate.Here,xisanalgorithmparameter.Wehavefoundthatx=0:3workswellforawiderangeofproblems.Inotherwords,wesolvetheNewtonsystemwithlowaccuracy(butneverworsethan10%)atearlyiterations,andsolveitmoreaccuratelyasthedualitygapdecreases.ThisadaptiveruleissimilarinspirittostandardmethodsusedininexactandtruncatedNewtonmethods(seeNocedalandWright,1999).ThecomputationaleffortofthetruncatedNewtoninterior-pointalgorithmistheproductofs,thetotalnumberofPCGstepsrequiredoveralliterations,andthecostofaPCGstep,whichisO(p),wherepisthenumberofnonzeroentriesinA,thatis,thetotalnumberof(nonzero)featuresappearinginallexamples.Inextensivetesting,wefoundthetruncatedNewtoninterior-pointmethodtobeveryefcient,requiringatotalnumberofPCGstepsrangingbetweenafewhundred(formediumsizeproblems)andseveralthousand(forlargeproblems).Formediumsize(andsparse)problemsitwasfasterthanthebasicinterior-pointmethod;moreoverthetruncatedNewtoninterior-pointmethodwasabletosolveverylargeproblems,forwhichformingtheHessianH(letalonecomputingthesearchdirection)wouldbeprohibitivelyexpensive.Whilethetotalnumberofiterationsinthebasicinterior-pointmethodisaround35,andnearlyindependentoftheproblemsizeandproblemdata,thetotalnumberofPCGiterationsrequiredbythetruncatedNewtoninterior-pointmethodcanvarysignicantlywithproblemdataandthevalueoftheregularizationparameterl.Inparticular,forsmallvaluesofl(whichleadtolargevaluesofcard(w)),thetruncatedNewtoninterior-pointmethodrequiresalargertotalnumberofPCGsteps.Algorithmperformancethatdependssubstantiallyonproblemdata,aswellasproblemdimension,istypicalofalliterative(i.e.,nondirect)methods,andisthepricepaidfortheabilitytosolveverylargeproblems.5.3NumericalExamplesInthissectionwegivesomenumericalexamplestoillustratetheperformanceofthetruncatedNewtoninterior-pointmethod.Weusethesamealgorithmparametersforlinesearch,updaterule,andstoppingcriterionasthoseusedinSection4,andthePCGtolerancegivenin(18)withx=0:3.WechosetheparameterNpcgtobelargeenough(5000)thattheiterationlimitwasneverreachedinourexperiments;thetypicalnumberofPCGiterationswasfarsmaller.ThealgorithmisimplementedinbothMatlabandC,ona3.2GHzPentiumIVrunningLinux,exceptforverylargeproblems.Forverylargeproblemswhosedatacouldnotbehandledonthiscomputer,themethodwasrunonAMDOpteron254with8GBmainmemory.TheCimplementationisavailableonlineatwww.stanford.edu/˜boyd/l1_logreg.5.3.1AMEDIUMSPARSEPROBLEMWeconsidertheInternetadvertisementsdataset(Newmanetal.,1998)with1430featuresand2359examples(discardingexampleswithmissingdata).Thetotalnumberofnonzeroentriesinthedata1538 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONPSfragreplacementsiterationsdualitygap010203040501010108106104102100(a),(b),(c)(d)PSfragreplacementsiterationsdualitygap010203040501010108106104102100(a),(b),(c)(d)cumulativePCGiterationsdualitygap020040060070010001010108106104102100(a)(b)(c)(d)Figure3:ProgressofthetruncatedNewtoninterior-pointmethodontheInternetadvertisementsdatasetwithfourregularizationparameters:(a)l=0:5lmax,(b)l=0:1lmax,(c)l=0:05lmax,and(d)l=0:01lmax.matrixAisp=39011.Westandardizedthedatasetusingimplicitstandardization,asexplainedinSection5.2,solvingfour`1-regularizedLRPs,withl=0:5lmax,l=0:1lmax,l=0:05lmax,andl=0:01lmax.Figure3showstheconvergencebehavior.Thelefthandplotshowsthedualitygapversusouteriterations;therighthandplotshowsdualitygapversuscumulativePCGiterations,whichisthemoreaccuratemeasureofcomputationaleffort.ThelefthandplotshowsthatthenumberofNewtoniterationsrequiredtosolvetheproblemisnotmuchmorethaninthebasicinterior-pointmethoddescribedinSection3.TherighthandplotshowsthatthetotalnumberofPCGstepsisseveralhundred,anddependssubstantiallyonthevalueofl.Thus,thesearchdirectionsarecomputedusingontheorderoftenPCGiterations.Togiveaveryroughcomparisonwiththedirectmethodappliedtothissparseproblem,thetruncatedNewtoninterior-pointmethodismuchmoreefcientthanthebasicinterior-pointmethodthatdoesnotexploitthesparsityofthedata.Itiscomparabletoorfasterthanthebasicinterior-pointmethodthatusessparselinearalgebramethods,whentheregularizationparameterisnottoosmall.5.3.2ALARGESPARSEPROBLEMOurnextexampleusesthe20Newsgroupsdataset(Lang,1995).WeprocessedthedatasetinawaysimilartoKeerthiandDeCoste(2005).Thepositiveclassconsistsofthe10groupswithnamesofformsci.*,comp.*,andmisc.forsale,andthenegativeclassconsistsoftheother10groups.WeusedMcCallum'sRainbowprogram(McCallum,1996)withthecommandrainbow-g3-h-s-O2-itotokenizethe(text)dataset.Theseoptionsspecifytrigrams,skipmessageheaders,nostoplist,anddroptermsoccurringfewerthantwotimes.Theresultingdatasethasn=777811features(trigrams)andm=11314examples(articles).Eachexamplecontainsanaverageof425nonzerofeatures.ThetotalnumberofnonzeroentriesinthedatamatrixAisp=4802169.Westandardizedthedatasetusingimplicitstandardization,asexplainedinSection5.2,solvingthree`1-regularizedLRPs,withl=0:5lmax,l=0:1lmax,andl=0:05lmax.(Forthevaluel=0:01lmax,theruntimeisontheorderofonehour.Thiscaseisnotofpracticalinterest,andsonotreportedhere,since1539 KOH,KIMANDBOYDl=lmaxcard(w)IterationsPCGiterationsTime(sec)0:59435581340:15446010362560:052531582090501Table2:PerformanceoftruncatedNewtoninterior-pointmethodonthe20newsgroupdataset(n=777811features,m=11314examples)for3valuesofl.PSfragreplacementsiterationsdualitygap0204060801001010108106104102100(a)(b)(c)PSfragreplacementsiterationsdualitygap0204060801001010108106104102100(a)(b)(c)cumulativePCGiterationsdualitygap05001000150020001010108106104102100(a)(b)(c)Figure4:ProgressofthetruncatedNewtoninterior-pointmethodonthe20Newsgroupsdatasetfor(a)l=0:5lmax,(b)l=0:1lmax,and(c)l=0:05lmax.Left.Dualitygapversusiterations.Right.DualitygapversuscumulativePCGiterations.thecardinalityoftheoptimalsolutionisaround10000andcomparabletothenumberofexamples.)Theperformanceofthealgorithm,andthecardinalityoftheweightvectors,isgiveninTable2.Figure4showstheprogressofthealgorithm,withdualitygapversusiteration(lefthandplot),anddualitygapversuscumulativePCGiteration(righthandplot).Thenumberofiterationsrequiredtosolvetheproblemsrangesbetween43and60,dependingonl.ThemorerelevantmeasureofcomputationaleffortisthetotalnumberofPCGiterations,whichrangesbetweenaround500and2000,again,increasingwithdecreasingl,whichcorrespondstoincreasingcard(w).TheaveragenumberofPCGiterations,periterationofthetruncatedNewtoninterior-pointmethod,isaround13forl=0:5lmax,17forl=0:1lmax,and36forl=0:05lmax.(ThevarianceinthenumberofPCGiterationsrequiredperiteration,however,islarge.)Therunningtimeisconsistentwithacostofaround0:24secondsperPCGiteration.Theincreaseinrunningtime,fordecreasingl,isdueprimarilytoanincreaseintheaveragenumberofPCGiterationsrequiredperiteration,butalsofromanincreaseintheoverallnumberofiterationsrequired.5.3.3RANDOMLYGENERATEDPROBLEMSWegeneratedafamilyof21datasets,withthenumberoffeaturesnvaryingfromonehundredtotenmillion,andm=0:1nexamples.ThedataweregeneratedusingthesamegeneralmethoddescribedinSection4.2,butwithAsparse,withanaveragenumberofnonzerofeaturesperexamplearound30.Thus,thetotalnumberofnonzeroentriesinAisp30m.Westandardizedthedata1540 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONPSfragreplacementsnumberoffeaturesnruntimeinseconds102103104105106107102101100101102103104l=0:5lmaxl=0:1lmaxl=0:05lmaxFigure5:RuntimeofthetruncatedNewtoninterior-pointmethod,forrandomlygeneratedsparseproblems,withthreevaluesofl.setusingimplicitstandardization,asexplainedinSection5.2,solvingeachprobleminstanceforthethreevaluesl=0:5lmax,l=0:1lmax,andl=0:05lmax.Thetotalruntime,forthe63`1-regularizedLRPs,isshowninFigure5.Theplotshowsthatruntimeincreasesasldecreases,andgrowsapproximatelylinearlywithproblemsize.WecomparetheruntimesofthetruncatedNewtoninterior-pointandthebasicinterior-pointmethodusingdenselinearalgebramethodstocomputethesearchdirection.Figure6showsthere-sultsforl=0:1lmax.ThetruncatedNewtoninterior-pointmethodisfarmoreefcientformediumproblems.Forlargeproblems,thebasicinterior-pointmethodfailsduetomemorylimitations,orextremelylongcomputationtimes.Byttinganexponenttothedataovertherangefromn=320tothelargestproblemsuccessfullysolvedbyeachmethod,wendthatthebasicinterior-pointmethodscalesasO(n2:8)(whichisconsistentwiththebasicopcountanalysis,whichpredictsO(n3)).ForthetruncatedNewtoninterior-pointmethod,theempiricalcomplexityisO(n1:3).Whensparsematrixmethodsareusedtocomputethesearchdirectioninthebasicinterior-pointmethod,wegetanempiricalcomplexityofO(n2:2)fortheMatlabimplementationofthebasicinterior-pointmethodthatusessparsematrixmethods,showingagoodefciencygainoverdensemethods,formediumscaleproblems.TheCimplementationwouldhavethesameempiricalcomplexityastheMatlabonewithasmallerconstanthiddenintheO()notation.5.3.4PRECONDITIONERPERFORMANCEToexaminetheeffectofthepreconditioner(17)ontheefciencyoftheapproximatesearchdirectioncomputation,wecomparetheeigenvaluedistributionsoftheHessianHandthepreconditionedHessianP1=2HP1=2,forthecolongenetumorproblem(n=2000features,m=62examples)atthe15thiterate,inFigure7.TheeigenvaluesofthepreconditionedHessianaretightlyclustered,1541 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONisefcientwhenmultipleprocessorsareused,sincetheLRPscanbesolvedsimultaneously,ondifferentprocessors.Butwhenoneprocessorisused,wecansolvetheseMproblemsmuchmoreefcientlybysolvingthemsequentially,usingthepreviouslycomputedsolutionasastartingpointforthenextcomputation.Thisiscalledawarm-startapproach.Werstnotethatthesolutionforl=l1=lmaxis(log(m+=m);0;0).Sincethispointdoesnotsatisfyjwijui,itcannotbeusedtoinitializethecomputationforl=l2.Wemodifyitbyaddingasmallincrementtoutoget(v(1);w(1);u(1))=(log(m+=m);0;(eabs=(nl))1);whichisstrictlyfeasible.Infact,itisonthecentralpathwithparametert=2n=eabs,andsoiseabs-suboptimal.Notethatsofarwehaveexpendednocomputationaleffort.Nowfork=2;:::;Mwecomputethesolution(v(k);w(k);u(k))oftheproblemwithl=lk,byapplyingtheinterior-pointmethod,withstartingpointmodiedtobe(vinit;winit;uinit)=(v(k1);w(k1);u(k1));andinitialvalueoftsettot=2n=eabs.Inthewarm-starttechniquedescribedabove,thenumberofgridpoints,M,isxedinadvance.Thegridpoints(andM)canbechosenadaptivelyonthey,whiletakingintoaccountthecurvatureoftheregularizationpathtrajectories,asdescribedinParkandHastie(2006a).6.1NumericalResultsOurrstexampleistheleukemiacancergeneexpressiondata,forM=100valuesofl,uniformlydistributedonalogarithmicscaleovertheinterval[0:001lmax;lmax].(Forthisexample,lmax=0:37.)TheleftplotinFigure8showstheregularizationpath,thatis,w(k),versusregularizationparameterl.Therightplotshowsthenumberofiterationsrequiredtosolveeachproblemfromawarm-start,andfromacold-start.Thenumberofcold-startiterationsrequiredisalwaysnear36,whilethenumberofwarm-startiterationsvaries,butisalwayssmaller,andtypicallymuchsmaller,withanaveragevalueof3:1.Thusthecomputationalsavingsforthisexampleisover11:1.Oursecondexampleisthe20Newsgroupsdataset,withM=100valuesofluniformlyspacedonalogarithmicscaleover[0:05lmax;lmax].Forthisproblemwehavelmax=0:12.ThetopplotinFigure9showstheregularizationpath.ThebottomleftplotshowsthetotalnumberofPCGiterationsrequiredtosolveeachproblem,withthewarm-startandcold-startmethods.Thebottomrightplotshowsthecardinalityofwasafunctionofl.Heretoothewarm-startmethodgivesasubstantialadvantageoverthecold-startmethod,atleastforlnottoosmall,thatis,aslongastheoptimalweightvectorisrelativelysparse.Thetotalruntimeusingthewarm-startmethodisaround2:8hours,andthetotalruntimeusingthecold-startmethodisaround6:2hours,sothewarm-startmethodsgivesasavingsofaround2:1.Ifweconsideronlytherangefrom0:1lmaxtolmax,thesavingsincreasesto5:1.Wenotethatforthisexample,thenumberofevents(i.e.,aweighttransitioningbetweenzeroandnonzero)alongtheregularizationpathisverylarge,somethodsthatattempttotrackeveryeventwillbeveryslow.1543 KOH,KIMANDBOYDPSfragreplacementslweightscoldstartwarmstart103lmax102lmax101lmaxlmax0:300:30:60:9PSfragreplacementslweightscoldstartwarmstart103lmax102lmax101lmaxlmax0:300:30:60:9literationsrequired103lmax102lmax101lmaxlmax010203040Figure8:Left.Regularizationpathforleukemiacancergeneexpressiondata.Right.Iterationsrequiredforcold-startandwarm-startmethods.7.ComparisonInthissectionwecomparetheperformanceofourbasicandtruncatedNewtoninterior-pointmeth-ods,implementedinC(calledl1logreg),withseveralexistingmethodsfor`1-regularizedlogisticregression,WemakecomparisonswithMOSEK(MOSEKApS,2002),IRLS-LARS(Leeetal.,2006),BBR(Genkinetal.,2006),andglmpath(ParkandHastie,2006a).MOSEKisageneralpurposeprimal-dualinterior-pointsolver,whichisknowntobequiteef-cientcomparedtootherstandardsolvers.MOSEKcansolve`1-regularizedLRPsusingtheseparableconvexformulation(9),orbytreatingtheproblemasageometricprogram(GP)(seeBoydetal.,2006).Weusedbothformulationsandreportthebetterresultshereineachcase.MOSEKusesastoppingcriterionbasedonthedualitygap,likeourmethod.IRLS-LARSalternatesbetweenapproximatingtheaveragelogisticlossbyaquadraticapprox-imationatthecurrentiterate,andsolvingtheresulting`1-regularizedleastsquaresproblemusingtheLARSmethod(Efronetal.,2004)toupdatetheiterate.IRLS-LARSoutperformsmanyexistingmethodsfor`1-regularizedlogisticregressionincludingGenLASSO(Roth,2004),SCGIS(Goodman,2004),Gl1ce(Lokhorst,1999),andGrafting(PerkinsandTheiler,2003).IRLS-LARSusedinourcomparisonisimplementedinMatlabandC,withtheLARSportionimplementedinC.ThehybridimplementationiscalledIRLS-LARS-MCandavailablefromhttp://ai.stanford.edu/˜silee/softwares/irlslars.htm.Weranituntiltheprimalobjectiveiswithintolerancefromtheoptimalobjectivevalue,whichiscomputedusingl1logreg,withsmall(1012)dualitygap.BBR,implementedinC,usesthecycliccoordinatedescentmethodforBayesianlogisticregres-sion.TheCimplementationisavailablefromhttp://www.stat.rutgers.edu/˜madigan/BBR/.Thestoppingcriterionisbasedonlackofprogress,andnotonasuboptimalityboundordualitygap.TighteningthetoleranceforBBRgreatlyincreaseditsrunningtime,andonlyhadaminoreffectonthenalaccuracy.glmpathusesapath-followingmethodforgeneralizedlinearmodels,includinglogisticmodels,andcomputesaportionoftheregularizationpath.ItisimplementedintheRenvironmentandavail-ablefromhttp://cran.r-project.org/src/contrib/Descriptions/glmpath.html.Wecom-paredglmpathtoourwarm-startmethod,describedinSection6.1544 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONPSfragreplacementslweights0:05lmax0:1lmax0:5lmaxlmax0:40:200:20:4PSfragreplacementslweights0:05lmax0:1lmax0:5lmaxlmax0:40:200:20:4lPCGiterationscoldstartwarmstart0:05lmax0:1lmax0:5lmaxlmax0100020003000PSfragreplacementslweights0:05lmax0:1lmax0:5lmaxlmax0:40:200:20:4lPCGiterationscoldstartwarmstart0:05lmax0:1lmax0:5lmaxlmax0100020003000lcard(w)0:05lmax0:1lmax0:5lmaxlmax0100020003000Figure9:Top.Regularizationpathfor20newsgroupdata.Bottomleft.TotalPCGiterationsre-quiredbycold-startandwarm-startmethods.Bottomright.card(w)versusl.1545 KOH,KIMANDBOYDWereporttheruntimesofthemethods,usingthedifferentstoppingcriteriadescribedabove.Wealsoreporttheactualaccuracyachieved,thatis,thedifferencebetweentheachievedprimalobjectiveandtheoptimalobjectivevalue(ascomputedbyl1logregwithdualitygap1012).However,itisimportanttopointoutthatitisverydifcult,ifnotimpossible,tocarryoutafaircomparisonofsolutionmethods,duetotheissueofimplementation(whichcanhaveagreatinuenceonthealgorithmperformance),thechoiceofalgorithmparameters,andthedifferentstoppingcriteria.Therefore,thecomparisonresultsreportedbelowshouldbeinterpretedwithcaution.Wereportcomparisonresultsusingfourdatasets:theleukemiacancergeneexpressionandspambasedatasets,twodensebenchmarkdatasetsusedinSection4.1,theInternetadvertisementsdataset,themediumsparsedatasetusedinSection5.3,andthe20Newsgroupsdataset,thelargesparsedatasetusedinSection5.3.Whenthelarge20Newsgroupsdatasetwasstandardized,thethreeexistingsolverscouldnothandleadatasetofthissize,soitwasnotstandardized.ThesolverscouldhandlethestandardizedInternetadvertisementsdatasetbutdonotexploitthesparseplusrank-onestructureofthestandardizeddatamatrix.Therefore,theInternetadvertisementsdatasetwasnotstandardizedaswell.Forsmallproblems(leukemiaandspambase),thesolverswereallrunona3.2GHzPentiumIVunderLinux;formediumandlargeproblem(Internetadvertisementsand20Newsgroups),thesolverswererunonanAMDOpteron254(with8GBRAM)underLinux.Theregularizationparameterlcanstronglyaffecttheruntimeofthemethods,includingours.Foreachdataset,weconsideredmanyvaluesoftheregularizationparameterlovertheinterval[0:0001lmax;lmax](whichappearstocovertherangeofinterestformanystandardizedproblems).Wereportheretheresultswithl=0:001lmaxforthetwosmallbenchmarkdata.Thecardinalityoftheoptimalsolutionis21fortheleukemiadataand54forthespambasedata,largerthanhalfoftheminimumofthenumberoffeaturesandthenumberofexamples.Forthetwounstandardizedproblems,`1-regularizedLRwitharegularizationparameterintheinterval[0:0001lmax;lmax]yieldsaverysparsesolution.Weusedl=0:0001lmaxfortheInternetadvertisementsandl=0:001lmaxforthe20Newsgroupsdata.Thecardinalitiesofoptimalsolutionsarerelativelyverysmall(19fortheInternetadvertisementsand247forthe20Newsgroups)comparedwiththesizesoftheproblems.Table3summarizesthecomparisonresultsforthefourproblems.Here,thetolerancehasadif-ferentmeaningdependingonthemethod,asdescribedabove.Asshowninthistable,ourmethodisasfastas,orfasterthan,existingmethodsforthesmalltwodatasets,butthediscrepancyinperformanceisnotsignicant,sincetheproblemsaresmall.FortheunstandardizedInternetadver-tisementsdataset,ourmethodismostefcient.MOSEKcouldnothandletheunstandardizedlarge20Newsgroupsdataset,sowecomparedl1logregwithBBRandIRLS-LARS-MContheunstan-dardized20Newsgroupsdataset,fortheregularizationparameterl=0:001lmax.ThetruncatedNewtoninterior-pointmethodsolvestheproblemtotheaccuracy2106inaround100secondsandtoahigheraccuracywithrelativelysmalladditionalruntime;BBRsolvesthemtothisultimateaccuracyinacomparabletime,butslowsdownwhenattemptingtocomputeasolutionwithhigheraccuracy.Finally,wecomparetheruntimesofthewarm-startmethodandglmpath.WeconsidertheleukemiacancergeneexpressiondatasetandtheInternetadvertisementsdatasetasbenchmarkexamplesofsmalldenseandmediumsparseproblems,respectively.Foreachdataset,thewarm-startmethodndsMpoints(w(k);k=1;:::;M)ontheregularizationpath,withluniformlyspacedonalogarithmicscaleovertheinterval[0:001lmax;lmax].glmpathndsanapproximationoftheregularizationpathbychoosingthekinkpointsadaptivelyoverthesameinterval.Theresultsare1546 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONProgramTol.LeukemiaSpambaseInternetadv.Newsgroupstimeaccuracytimeaccuracytimeaccuracytimeaccuracyl1logreg1040:3721060:3421050:1441059021061080:57110100:7411090:27510914021010IRLS-LARS-MC1040:7561060:2991061:2110545081051080:81610110:37310112:51101012007109MOSEK104981061081081:33109--10810310911510131:441013--BBR1041511073931060:44110714021061011734108300110111:1110128501106Table3:Comparisonresultswithtwostandardizeddatasets(leukemiaandspambase)andtwounstandardizeddatasets(Internetadvertisementsand20Newsgroups).Theregularizationparameteristakenasl=0:001lmaxforleukemiaandspambase,l=0:0001lmaxforInternetadvertisementsandl=0:001lmaxfor20Newsgroups.Datawarm-start(M=25)warm-start(M=100)glmpathLeukemia2:86:41:9Internet5:213940Table4:Regularizationpathcomputationtime(inseconds)ofthewarm-startmethodandglmpathforstandardizedleukemiacancergeneexpressiondataandInternetadvertisementsdata.showninTable4.Forthesmallleukemiadataset,glmpathisfasterthanthewarm-startmethod.Thewarm-startmethodismoreefcientthanglmpathfortheInternetadvertisementsdataset,amedium-sizedsparseproblem.Theperformancediscrepancyispartiallyexplainedbythefactthatourwarm-startmethodexploitsthesparseplusrank-onestructureofthestandardizeddatamatrix,whereasglmpathdoesnot.8.ExtensionsandVariationsThebasicinterior-pointmethodandthetruncatedNewtonvariationcanbeextendedtogeneral`1-regularizedconvexlossminimizationproblems,withtwicedifferentiablelossfunctions,thathavetheformminimize(1=m)åm=1f(zi)+lkwk1subjecttozi=wTai+vbi+ci;i=1;:::;m;(19)withvariablesarev2R,w2Rn,andz2Rm,andproblemdataci2R,ai2Rn,andbi2Rdeterminedbyasetofgiven(observedortraining)examples(xi;yi)2RnR;i=1;:::;m:Heref:R!Risalossfunctionwhichisconvexandtwicedifferentiable.PriorworkrelatedtotheextensionincludesParkandHastie(2006a),Rosset(2005),andTibshirani(1997).In`1-regularized(binary)classication,wehaveyi2f1;+1g(binarylabels),andzihastheformzi=yi(wTxi+v),sowehavetheform(19)withai=yixi,bi=yi,andci=0.Theassoci-atedclassierisgivenbyy=sgn(wTx+v).Thelossfunctionfissmallforpositivearguments,1547 KOH,KIMANDBOYDandgrowsfornegativearguments.Whenfisthelogisticlossfunctionfin(1),thisgeneral`1-regularizedclassicationproblemreducestothe`1-regularizedLRP.Whenfistheconvexlossfunctionf(u)=logF(u),whereFisthecumulativedistributionfunctionofthestandardnormaldistribution,this`1-regularizedclassicationproblemreducestothe`1-regularizedprobitregres-sionproblem.Moregenerally,`1-regularizedestimationproblemsthatariseingeneralizedlinearmodels(McCullaghandNelder,1989;Hastieetal.,2001)forbinaryresponsevariables(whichin-cludelogisticandprobitmodels)canbeformulatedproblemsoftheform(19);seeParkandHastie(2006a)forthepreciseformulation.In`1-regularizedlinearregression,wehaveyi2R,andzihastheformzi=wTxi+vyi,whichisthedifferencebetweenyianditspredictedvalue,wTxi+v.Thus`1-regularizedregressionproblemshavetheform(19)withai=xi,bi=1,andci=yi.Typicallyfissymmetric,withf(0)=0.Whenthelossfunctionisquadratic,thatis,f(u)=u2,theconvexlossminimizationproblem(19)isthe`1-regularizedleastsquaresregressionproblemstudiedextensivelyintheliterature.Thedualofthe`1-regularizedconvexlossminimizationproblem(19)ismaximize(1=m)åm=1f(mqi)+qTcsubjecttokATqk¥l;bTq=0;whereA=[a1am]T2Rmn,thevariableisq2Rm,andfistheconjugateofthelossfunctionf,f(y)=supu2R(yuf(u)):Aswith`1-regularizedlogisticregression,wecanderiveaboundonthesuboptimalityof(v;w),byconstructingadualfeasiblepoint¯q,fromanarbitraryw,¯q=(s=m)p(¯v;w);p(¯v;w)=2f0(wTa1+¯vb1+c1).f0(wTam+¯vbm+cm)3;where¯vistheoptimalinterceptfortheoffsetw,¯v=argminv(1=m)måi=1f(wTai+vbi+ci);andthescalingconstantsisgivenbys=minml=kATp(¯v;w))k¥;1 .Usingthismethodforcheaplycomputingadualfeasiblepointandassociateddualitygapforany(v;w),wecanextendthecustominterior-pointmethodfor`1-regularizedLRPstogeneral`1-regularizedconvex(twicedifferentiable)lossproblems.Otherpossibleextensionsinclude`1-regularizedCoxproportionalhazardsmodels(Cox,1972).Theassociated`1-regularizedproblemdoesnothavetheform(19),buttheideabehindthecustominterior-pointmethodfor`1-regularizedLRPscanbereadilyextended.ThereaderisreferredtoParkandHastie(2006a)andTibshirani(1997)forrelatedworkoncomputationalmethodsfor`1-regularizedCoxproportionalhazardsmodels.1548 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONAcknowledgmentsThisworkissupportedinpartbytheNationalScienceFoundationundergrants#0423905and(throughOctober2005)#0140700,bytheAirForceOfceofScienticResearchundergrant#F49620-01-1-0365,byMARCOFocuscenterforCircuit&SystemSolutionscontract#2003-CT-888,andbyMITDARPAcontract#N00014-05-1-0700.TheauthorsthankMichaelGrant,TrevorHastie,HonglakLee,SuinLee,MeeYoungPark,RobertTibshirani,YinyuYe,andSungrohYoonforhelpfulcommentsandsuggestions.Theauthorsthankanonymousreviewersforhelpfulcommentsandsuggestions(especiallyoncomparisonwithexistingmethods).AppendixA.StandardizationStandardizationisawidelyusedpre-processingstepappliedtothefeaturevector,sothateach(transformed)featurehaszeromeanandunitvariance(overtheexamples)(Ryan,1997).Themeanfeaturevectorisµ=(1=m)åm=1xi,andthevectoroffeaturestandarddeviationssisdenedbysj= (1=m)måi=1(xijµj)2!1=2;j=1;:::;n;wherexijisthejthcomponentofxi.Thestandardizedfeaturevectorisdenedasxstd=diag(s)1(xµ):Whentheexamplesarestandardized,weobtainthestandardizeddatamatrixAstd=diag(b)(X1µT)diag(s)1=Adiag(s)1bµTdiag(s)1;(20)whereX=[x1xm]T.Wecarryoutlogisticregression(possiblyregularized)usingthedatamatrixAstdinplaceofA,toobtain(standardized)logisticmodelparameterswstd,vstd.Intermsoftheoriginalfeaturevector,ourlogisticmodelisProb(bjx)=11+exp(b(wstdTxstd+vstd))=11+exp(b(wTx+v))wherew=diag(s)1wstd;v=vstdwstdTdiag(s)1µ:Wepointoutonesubtletyhererelatedtosparsityofthedatamatrix.Forsmallormediumsizedproblems,orwhentheoriginaldatamatrixAisdense,formingthestandardizeddatamatrixAstddoesnoharm.ButwhentheoriginaldatamatrixAissparse,whichisthekeytoefcientsolutionoflarge-scale`1-regularizedLRPs,formingAstdisdisastrous,sinceAstdisingeneraldense,evenwhenAissparse.Butwecangetaroundthisproblem,whenworkingwithverylargeproblems,byneveractuallyformingthematrixAstd,asexplainedinSection5.1549 KOH,KIMANDBOYDReferencesE.AllgowerandK.Georg.Continuationandpathfollowing.ActaNumerica,2:1–64,1993.U.Alon,N.Barkai,D.Notterman,K.Gish,S.Ybarra,D.Mack,andA.Levine.Broadpatternsofgeneexpressionrevealedbyclusteringoftumorandnormalcolontissuesprobedbyoligonu-cleotidearrays.ProceedingsoftheNationalAcademyofSciences,96:6745–6750,1999.S.BalakrishnanandD.Madigan.Algorithmsforsparselinearclassiersinthemassivedatasetting,2006.Manuscript.Availablefromwww.stat.rutgers.edu/˜madigan/papers/.O.Banerjee,L.ElGhaoui,A.d'Aspremont,andG.Natsoulis.ConvexoptimizationtechniquesforttingsparseGaussiangraphicalmodels.InProceedingsofthe23rdInternationalConferenceonMachineLearning,2006.D.Bertsekas.NonlinearProgramming.AthenaScientic,secondedition,1999.A.BhusnurmathandC.Taylor.Solvingthegraphcutproblemvia`1normminimization,2007.UniversityofPennsylvaniaCISTechReportnumberMS-CIS-07-10.J.BorweinandA.Lewis.ConvexAnalysisandNonlinearOptimization.Springer,2000.S.Boyd,S.-J.Kim,L.Vandenberghe,andA.Hassibi.Atutorialongeometricprogramming,2006.ToappearinOptimizationandEngineering.S.BoydandL.Vandenberghe.ConvexOptimization.CambridgeUniversityPress,2004.S.Boyd,L.Vandenberghe,A.ElGamal,andS.Yun.Designofrobustglobalpowerandgroundnetworks.InProceedingsofACM/SIGDAInternationalSymposiumonPhysicalDesign(ISPD),pages60–65,2001.E.Candes,J.Romberg,andT.Tao.Stablesignalrecoveryfromincompleteandinaccuratemea-surements.CommunicationsonPureandAppliedMathematics,59(8):1207–1223,2005.E.Candes,J.Romberg,andT.Tao.Robustuncertaintyprinciples:Exactsignalreconstructionfromhighlyincompletefrequencyinformation.IEEETransactionsonInformationTheory,52(2):489–509,2006.E.CandesandT.Tao.Decodingbylinearprogramming.IEEETransactionsonInformationTheory,51(12):4203–4215,2005.K.ChalonerandK.Larntz.OptimalBayesiandesignappliedtologisticregressionexperiments.JournalofStatisticalPlanningandInferenc,21:191–208,1989.S.ChenandD.Donoho.Basispursuit.InProceedingsoftheTwenty-EighthAsilomarConferenceonSignals,SystemsandComputers,volume1,pages41–44,1994.S.Chen,D.Donoho,andM.Saunders.Atomicdecompositionbybasispursuit.SIAMReview,43(1):129–159,2001.J.ClaerboutandF.Muir.Robustmodelingoferraticdata.Geophysics,38(5):826–844,1973.1550 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONA.Conn,N.Gould,andPh.Toint.LANCELOT:AFortranpackageforlarge-scalenonlinearoptimization(ReleaseA),volume17ofSpringerSeriesinComputationalMathematics.Springer-Verlag,1992.D.Cox.Regressionmodelsandlife-tables.JournaloftheRoyalStatisticalSociety.SeriesB,34(2):187–220,1972.J.Dahl,V.Roychowdhury,andL.Vandenberghe.MaximumlikelihoodestimationofGaussiangraphicalmodels:Numericalimplementationandtopologyselection.Submitted.Availablefromwww.ee.ucla.edu/˜vandenbe/covsel.html,2005.A.d'Aspremont,L.ElGhaoui,M.Jordan,andG.Lanckriet.AdirectformulationforsparsePCAusingsemideniteprogramming,2005.InL.Saul,Y.WeissandL.Bottou,editors,AdvancesinNeuralInformationProcessingSystems,17,pp.41-48,MITPress.R.DemboandT.Steihaug.Truncated-Newtonalgorithmsforlarge-scaleunconstrainedoptimiza-tion.Math.Program.,26:190–212,1983.J.Demmel.AppliedNumericalLinearAlgebra.SocietyforIndustrialandAppliedMathematics,1997.D.Donoho.Compressedsensing.IEEETransactionsonInformationTheory,52(4):1289–1306,2006.D.DonohoandM.Elad.Optimallysparserepresentationingeneral(non-orthogonal)dictionariesvia`1minimization.Proc.Nat.Aca.Sci.,100(5):2197–2202,March2003.D.Donoho,I.Johnstone,G.Kerkyacharian,andD.Picard.Waveletshrinkage:Asymptopia?J.R.Statist.Soc.B.,57(2):301–337,1995.B.Efron,T.Hastie,I.Johnstone,andR.Tibshirani.Leastangleregression.AnnalsofStatistics,32(2):407–499,2004.J.Friedman,T.HastieT,andR.Tibshirani.Pathwisecoordinateoptimization,2007.Manuscriptavailablefromwww-stat.stanford.edu/˜hastie/pub.htm.H.Fu,M.Ng,M.Nikolova,andJ.Barlow.Efcientminimizationmethodsofmixed`1-`1and`2-`1normsforimagerestoration.SIAMJournalonScienticcomputing,27(6):1881–1902,2006.A.Genkin,D.Lewis,andD.Madigan.Large-scaleBayesianlogisticregressionfortextcategoriza-tion,2006.ToappearinTechnometrics.Availablefromwww.stat.rutgers.edu/˜madigan/papers/.A.GeorgeandJ.Liu.ComputerSolutionofLargeSparsePositiveDeniteSystems.Prentice-Hall,1981.P.Gill,W.Murray,M.Saunders,andM.Wright.User'sguideforNPSOL(Version4.0):AFOR-TRANpackagefornonlinearprogramming.TechnicalReportSOL86-2,OperationsResearchDept.,StanfordUniversity,Stanford,California94305,January1986.1551 KOH,KIMANDBOYDG.GolubandC.VanLoan.MatrixComputations,volume13ofStudiesinAppliedMathematics.JohnHopkinsUniversityPress,thirdedition,1996.T.Golub,D.Slonim,P.Tamayo,C.Gaasenbeek,J.Mesirov,H.Coller,M.Loh,J.Downing,M.Caligiuri,C.Bloomeld,andE.Lander.Molecularclassicationofcancer:Classdiscov-eryandclasspredictionbygeneexpressionmonitoring.Science,286:531–537,1999.J.Goodman.Exponentialpriorsformaximumentropymodels.InProceedingsoftheAnnualMeetingsoftheAssociationforComputationalLinguistics,2004.A.Hassibi,J.How,andS.Boyd.Low-authoritycontrollerdesignviaconvexoptimization.InProceedingsoftheIEEEConferenceonDecisionandControl,pages140–145,1999.T.Hastie,S.Rosset,R.Tibshirani,andJ.Zhu.Theentireregularizationpathforthesupportvectormachine.JournalofMachineLearningResearch,5:1391–1415,2004.T.Hastie,J.Taylor,R.Tibshirani,andG.Walther.Forwardstagewiseregressionandthemonotonelasso.ElectronicJournalofStatistics,1:1–29,2007.T.Hastie,R.Tibshirani,andJ.Friedman.TheElementsofStatisticalLearning.SpringerSeriesinStatistics.Springer-Verlag,NewYork,2001.ISBN0-387-95284-5.T.JaakkolaandM.Jordan.Bayesianparameterestimationviavariationalmethods.StatisticsandComputing,10:25–37,2000.S.KeerthiandD.DeCoste.AmodiedniteNewtonmethodforfastsolutionoflargescalelinearSVMs.JournalofMachineLearningResearch,6:341–361,2005.Y.Kim,J.Kim,andY.Kim.Blockwisesparseregression.StatisticaSinica,16:375–390,2006.P.Komarek.LogisticRegressionforDataMiningandHigh-DimensionalClassication.PhDthesis,CarnegieMellonUniversity,2004.B.Krishnapuram,L.Carin,M.Figueiredo,andA.Hartemink.Sparsemultinomiallogisticregres-sion:Fastalgorithmsandgeneralizationbounds.IEEETransactionsonPatternAnalysisandMachineIntelligence,27(6):957–968,2005.B.KrishnapuramandA.Hartemink.Sparsemultinomiallogisticregression:Fastalgorithmsandgeneralizationbounds.IEEETransactionsonPatternAnalysisandMach.Intelligence,27(6):957–968,2005.ISSN0162-8828.K.Lang.Newsweeder:Learningtolternetnews.InProceedingsoftheTwenty-FirstInternationalConferenceonMachinelearning(ICML),pages331–339,1995.S.Lee,H.Lee,P.Abeel,andA.Ng.Efcientl1-regularizedlogisticregression.InProceedingsofthe21stNationalConferenceonArticialIntelligence(AAAI-06),2006.S.LevyandP.Fullagar.Reconstructionofasparsespiketrainfromaportionofitsspectrumandapplicationtohigh-resolutiondeconvolution.Geophysics,46(9):1235–1243,1981.1552 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONC.-J.Lin,R.Weng,andS.Keerthi.TrustregionNewtonmethodsforlarge-scalelogisticregres-sion,2007.ToappearinProceedingsofthe24thInternationalConferenceonMachineLearning(ICML).M.Lobo,M.Fazel,andS.Boyd.Portfoliooptimizationwithlinearandxedtransactioncosts.AnnalsofOperationsResearch,2005.J.Lokhorst.TheLASSOandgeneralisedlinearmodels,1999.HonorsProject,DepartmentofStatistics,TheUniversityofAdelaide,SouthAustralia,Australia.D.Luenberger.LinearandNonlinearProgramming.Addison-Wesley,secondedition,1984.A.McCallum.Bow:Atoolkitforstatisticallanguagemodeling,textretrieval,classicationandclustering.Availablefromwww.cs.cmu.edu/˜mccallum/bow,1996.P.McCullaghandJ.Nelder.GeneralizedLinearModels.Chapmand&Hall/CRC,secondedition,1989.T.Minka.Acomparisonofnumericaloptimizersforlogisticregression,2003.Technicalreport.Availablefromresearch.microsoft.com/˜minka/papers/logreg/.MOSEKApS.TheMOSEKOptimizationToolsVersion2.5.User'sManualandReference,2002.Availablefromwww.mosek.com.S.Nash.Asurveyoftruncated-Newtonmethods.JournalofComputationalandAppliedMathe-matics,124:45–59,2000.Y.NesterovandA.Nemirovsky.Interior-PointPolynomialMethodsinConvexProgramming,vol-ume13ofStudiesinAppliedMathematics.SIAM,Philadelphia,PA,1994.D.Newman,S.Hettich,C.Blake,andC.Merz.UCIrepositoryofmachinelearningdatabases,1998.Availablefromwww.ics.uci.edu/˜mlearn/MLRepository.html.A.Ng.Featureselection,`1vs.`2regularization,androtationalinvariance.InProceedingsoftheTwenty-FirstInternationalConferenceonMachinelearning(ICML),pages78–85,NewYork,NY,USA,2004.ACMPress.ISBN1-58113-828-5.J.NocedalandS.Wright.NumericalOptimization.SpringerSeriesinOperationsResearch.Springer,1999.D.Oldenburg,T.Scheuer,andS.Levy.Recoveryoftheacousticimpedancefromreectionseis-mograms.Geophysics,48(10):1318–1337,1983.M.Osborne,B.Presnell,andB.Turlach.Anewapproachtovariableselectioninleastsquaresproblems.IMAJournalofNumericalAnalysis,20(3):389–403,2000.M.-Y.ParkandT.Hastie.An`1regularization-pathalgorithmforgeneralizedlinearmod-els,2006a.ToappearinJournaloftheRoyalStatisticalSociety,SeriesB.Availablefromwww-stat.stanford.edu/˜hastie/pub.htm.1553 KOH,KIMANDBOYDM.-Y.ParkandT.Hastie.Regularizationpathalgorithmsfordetectinggeneinteractions,2006b.Manuscript.Availablefromwww-stat.stanford.edu/˜hastie/Papers/glasso.pdf.S.PerkinsandJ.Theiler.Onlinefeatureselectionusinggrafting.InProceedingsoftheTwenty-FirstInternationalConferenceonMachinelearning(ICML),pages592–599.ACMPress,2003.B.Polyak.IntroductiontoOptimization.OptimizationSoftware,1987.TranslatedfromRussian.L.Portugal,M.Resende,G.Veiga,andJ.J´udice.Atruncatedprimal-infeasibledual-feasiblenet-workinteriorpointmethod.Networks,35:91–108,2000.S.Rosset.Trackingcurvedregularizedoptimizationsolutionpaths.InL.Saul,Y.Weiss,andL.Bot-tou,editors,AdvancesinNeuralInformationProcessingSystems17.MITPress,Cambridge,MA,2005.S.RossetandJ.Zhu.Piecewiselinearregularizedsolutionpaths,2007.ToappearinAnnalsofStatistics.S.Rosset,J.Zhu,andT.Hastie.Boostingasaregularizedpathtoamaximummarginclassier.JournalofMachineLearningResearch,5:941–973,2004.V.Roth.ThegeneralizedLASSO.IEEETransactionsonNeuralNetworks,15(1):16–28,2004.A.Ruszczynski.NonlinearOptimization.Princetonuniversitypress,2006.T.Ryan.ModernRegressionMethods.Wiley,1997.S.ShevadeandS.Keerthi.Asimpleandefcientalgorithmforgeneselectionusingsparselogisticregression.Bioinformatics,19(17):2246–2253,2003.N.Z.Shor.MinimizationMethodsforNon-differentiableFunctions.SpringerSeriesinComputa-tionalMathematics.Springer,1985.H.Taylor,S.Banks,andJ.McCoy.Deconvolutionwiththel1norm.Geophysics,44(1):39–52,1979.R.Tibshirani.RegressionshrinkageandselectionviatheLasso.JournaloftheRoyalStatisticalSociety,SeriesB,58(1):267–288,1996.R.Tibshirani.TheLassoforvariableselectionintheCoxmodel.StatisticsinMedicine,16:385–395,1997.R.Tibshirani,M.Saunders,S.Rosset,andJ.Zhu.SparsityandsmoothnessviathefusedLasso.JournaloftheRoyalStatisticalSocietySeriesB,67(1):91–108,2005.J.Tropp.Justrelax:Convexprogrammingmethodsforidentifyingsparsesignalsinnoise.IEEETransactionsonInformationTheory,52(3):1030–1051,2006.L.VandenbergheandS.Boyd.Aprimal-dualpotentialreductionmethodforproblemsinvolvingmatrixinequalities.Math.Program.,69:205–236,1995.1554 ANINTERIOR-POINTMETHODFORLARGE-SCALE`1-REGULARIZEDLOGISTICREGRESSIONR.Vanderbei.LOQOUser'sManual—Version3.10,1997.Availablefromwww.orfe.princeton.edu/loqo.M.Wainwright,P.Ravikumar,andJ.Lafferty.High-dimensionalgraphicalmodelselectionusing`1-regularizedlogisticregression.,2007.ToappearinAdvancesinNeuralInformationProcess-ingSystems(NIPS)19.Availablefromhttp://www.eecs.berkeley.edu/˜wainwrig/Pubs/publist.html#High-dimension%al.S.Wright.Primal-dualInterior-pointMethods.SocietyforIndustrialandAppliedMathematics,Philadelphia,PA,USA,1997.ISBN0-89871-382-X.Y.Ye.InteriorPointAlgorithms:TheoryandAnalysis.JohnWiley&Sons,1997.M.YuanandL.Lin.Modelselectionandestimationinregressionwithgroupedvariables.JournaloftheRoyalStatisticalSociety,SeriesB,68(1):49–67,2006.Z.Zhang,J.Kwok,andD.Yeung.Surrogatemaximization/minimizationalgorithmsforAdaBoostandthelogisticregressionmodel.InProceedingsoftheTwenty-FirstInternationalConferenceonMachinelearning(ICML),pages927–934,NewYork,NY,USA,2004.ACMPress.P.Zhao,G.Rocha,andB.Yu.Groupedandhierarchicalmodelselectionthroughcompositeabsolutepenalties,2007.TechReport703.StatDept.UCB.Availablefromwww.stat.berkeley.edu/˜binyu/ps/703.pdf.P.ZhaoandB.Yu.OnmodelselectionconsistencyofLasso.JournalofMachineLearningRe-search,7:2541–2563,2006.J.Zhu,S.Rosset,T.Hastie,andR.Tibshirani.1-normsupportvectormachines.InS.Thrun,LSaul,andB.Sch¨olkopf,editors,AdvancesinNeuralInformationProcessingSystems16,pages49–56,Cambridge,MA,2004.MITPress.H.Zou.TheadaptiveLassoanditsoracleproperties.JournaloftheAmericanStatisticalAssocia-tion,101(476):1418–1429,2006.H.Zou,T.Hastie,andR.Tibshirani.Sparseprincipalcomponentanalysis.JournalofComputa-tionalandGraphicalStatistics,15(2):262–286,2006.H.Zou,T.Hastie,andR.Tibshirani.OnthedegreesoffreedomoftheLasso,2007.ToappearinAnnalsofStatistics.1555