/
JournalofMachineLearningResearch5(2004)1391 JournalofMachineLearningResearch5(2004)1391

JournalofMachineLearningResearch5(2004)1391 - PDF document

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
381 views
Uploaded On 2016-06-01

JournalofMachineLearningResearch5(2004)1391 - PPT Presentation

HASTIEROSSETTIBSHIRANIANDZHU0500051015201005000510151bfx0fx1fx1Figure1AsimpleexampleshowstheelementsofaSVMmodelThe ID: 344730

HASTIE ROSSET TIBSHIRANIANDZHU-0.50.00.51.01.52.0-1.0-0.50.00.51.01.51=bf(x)=0f(x)=+1f(x)=1Figure1:AsimpleexampleshowstheelementsofaSVMmodel.The

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch5(2004)1..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch5(2004)1391–1415Submitted3/04;Published10/04TheEntireRegularizationPathfortheSupportVectorMachineTrevorHastieHASTIE@STANFORD.EDUDepartmentofStatisticsStanfordUniversityStanford,CA94305,USASaharonRossetSROSSET@US.IBM.COMIBMWatsonResearchCenterP.O.Box218YorktownHeights,NY10598,USARobertTibshiraniTIBS@STANFORD.EDUDepartmentofStatisticsStanfordUniversityStanford,CA94305,USAJiZhuJIZHU@UMICH.EDUDepartmentofStatisticsUniversityofMichigan439WestHall550EastUniversityAnnArbor,MI48109-1092,USAEditor:NelloCristianiniAbstractThesupportvectormachine(SVM)isawidelyusedtoolforclassication.Manyefcientimple-mentationsexistforttingatwo-classSVMmodel.Theuserhastosupplyvaluesforthetuningparameters:theregularizationcostparameter,andthekernelparameters.Itseemsacommonprac-ticeistouseadefaultvalueforthecostparameter,oftenleadingtotheleastrestrictivemodel.Inthispaperwearguethatthechoiceofthecostparametercanbecritical.WethenderiveanalgorithmthatcanttheentirepathofSVMsolutionsforeveryvalueofthecostparameter,withessentiallythesamecomputationalcostasttingoneSVMmodel.Weillustrateouralgorithmonsomeexamples,anduseourrepresentationtogivefurtherinsightintotherangeofSVMsolutions.Keywords:supportvectormachines,regularization,coefcientpath1.IntroductionInthispaperwestudythesupportvectormachine(SVM)(Vapnik,1996;Sch¨olkopfandSmola,2001)fortwo-classclassication.Wehaveasetofntrainingpairsxi;yi,wherexi2Rpisap-vectorofreal-valuedpredictors(attributes)forthethobservation,andyi2f1;+1gcodesitsbinaryresponse.Westartoffwiththesimplecaseofalinearclassier,whereourgoalistoestimatealineardecisionfunction(x)=b0+bTx;(1)c\r2004TrevorHastie,SaharonRosset,RobertTibshiraniandJiZhu. HASTIE,ROSSET,TIBSHIRANIANDZHU-0.50.00.51.01.52.0-1.0-0.50.00.51.01.51=bf(x)=0f(x)=+1f(x)=1Figure1:AsimpleexampleshowstheelementsofaSVMmodel.The“+1”pointsaresolid,the“-1”hollow.C=2,andthewidthofthesoftmarginis2=jjbjj=20:587.Twohollowpointsf3;5garemisclassied,whilethetwosolidpointsf10;12garecorrectlyclassied,butonthewrongsideoftheirmargin(x)=+1;eachofthesehasxi�0.Thethreesquareshapedpointsf2;6;7gareexactlyonthemargin.anditsassociatedclassierClass(x)=sign[(x)]:(2)Therearemanywaystotsuchalinearclassier,includinglinearregression,Fisher'slineardiscriminantanalysis,andlogisticregression(Hastieetal.,2001,Chapter4).Ifthetrainingdataarelinearlyseparable,anappealingapproachistoaskforthedecisionboundaryfx(x)=0gthatmaximizesthemarginbetweenthetwoclasses(Vapnik,1996).Solvingsuchaproblemisanexerciseinconvexoptimization;thepopularsetupisminb0b12jjbjj2subjectto,foreachi:yi(b0+xTib)1:(3)Abitoflinearalgebrashowsthat1b(b0+xTib)isthesigneddistancefromxitothedecisionboundary.Whenthedataarenotseparable,thiscriterionismodiedtominb0b12jjbjj2+Cnåi=1xi;(4)subjectto,foreachyi(b0+xTib)1xi:1392 SVMREGULARIZATIONPATH-3-2-101230.00.51.01.52.02.53.0yf(x)LossFigure2:Thehingelosspenalizesobservationmarginsyf(x)lessthan+1linearly,andisindiffer-enttomarginsgreaterthan+1.Thenegativebinomiallog-likelihood(deviance)hasthesameasymptotes,butoperatesinasmootherfashionneartheelbowatyf(x)=1.Herethexiarenon-negativeslackvariablesthatallowpointstobeonthewrongsideoftheir“softmargin”((x)=1),aswellasthedecisionboundary,andCisacostparameterthatcontrolstheamountofoverlap.Figure1showsasimpleexample.Ifthedataareseparable,thenforsufcientlylargeCthesolutionsto(3)and(4)coincide.Ifthedataarenotseparable,asCgetslargethesolutionapproachestheminimumoverlapsolutionwithlargestmargin,whichisattainedforsomenitevalueofCAlternatively,wecanformulatetheproblemusingaLoss+Penaltycriterion(Wahbaetal.,2000;Hastieetal.,2001):minb0bnåi=1[1yi(b0+bTxi)]++l2jjbjj2:(5)Theregularizationparameterlin(5)correspondsto1=C,withCin(4).HerethehingelossL(y;(x))=[1yf(x)]+canbecomparedtothenegativebinomiallog-likelihoodL(y;(x))=log[1+exp(yf(x))]forestimatingthelinearfunction(x)=b0+bTx;seeFigure2.Thisformulationemphasizestheroleofregularization.Inmanysituationswehavesufcientvariables(e.g.geneexpressionarrays)toguaranteeseparation.Wemayneverthelessavoidthemaximummarginseparator(l#0),whichisgovernedbyobservationsontheboundary,infavorofamoreregularizedsolutioninvolvingmoreobservations.Thisformulationalsoadmitsaclassofmoreexible,nonlineargeneralizationsminf2Hnåi=1L(yi;(xi))+lJ();(6)where(x)isanarbitraryfunctioninsomeHilbertspaceH,andJ()isafunctionalthatmeasuresthe“roughness”ofinHThenonlinearkernelSVMsarisenaturallyinthiscontext.Inthiscase(x)=b0+g(x),andJ()=J(g)isanorminaReproducingKernelHilbertSpaceoffunctionsHKgeneratedbya1393 HASTIE,ROSSET,TIBSHIRANIANDZHUTraining Error: 0.160Test Error: 0.218Bayes Error: 0.210RadialKernel:C=2,g=1Training Error: 0.065Test Error: 0.307Bayes Error: 0.210RadialKernel:C=10;000,g=1Figure3:Simulateddataillustratetheneedforregularization.The200datapointsaregeneratedfromapairofmixturedensities.ThetwoSVMmodelsusedradialkernelswiththescaleandcostparametersasindicatedatthetopoftheplots.Thethickblackcurvesarethedecisionboundaries,thedottedcurvesthemargins.Thelessregularizedtontherightovertsthetrainingdata,andsuffersdramaticallyontesterror.ThebrokenpurplecurveistheoptimalBayesdecisionboundary.positive-denitekernelK(x;x0).Bythewell-studiedpropertiesofsuchspaces(Wahba,1990;Ev-geniouetal.,1999),thesolutionto(6)isnitedimensional(evenifHKisinnitedimensional),inthiscasewitharepresentation(x)=b0+åni=1qiK(x;xi).Consequently(6)reducestotheniteformminb0qnåi=1L[yi;b0+nåj=1qiK(xi;xj)]+l2nåj=1nåj0=1qjqj0K(xj;x0j):(7)WithLthehingeloss,thisisanalternativeroutetothekernelSVM;seeHastieetal.(2001)formoredetails.ItseemsthattheregularizationparameterC(orl)isoftenregardedasagenuine“nuisance”inthecommunityofSVMusers.Softwarepackages,suchasthewidelyusedSVMlight(Joachims,1999),providedefaultsettingsforC,whicharethenusedwithoutmuchfurtherexploration.Arecentintroductorydocument(Hsuetal.,2003)supportingtheLIBSVMpackagedoesencouragegridsearchforCFigure3showstheresultsofttingtwoSVMmodelstothesamesimulateddataset.Thedataaregeneratedfromapairofmixturedensities,describedindetailinHastieetal.(2001,Chapter2).1TheradialkernelfunctionK(x;x0)=exp(gjjxx0jj2)wasused,withg=1.Themodelontheleftismoreregularizedthanthatontheright(C=2vsC=10;000,orl=0:5vsl=0:0001),and1.Theactualtrainingdataandtestdistributionareavailablefromhttp://www-stat.stanford.edu/ElemStatLearn.1394 SVMREGULARIZATIONPATH1e-011e+011e+030.200.250.300.351e-011e+011e+031e-011e+011e+031e-011e+011e+03Test ErrorTest Error Curves - SVM with Radial Kernelg=5g=1g=0:5g=0:1C=1=lFigure4:Testerrorcurvesforthemixtureexample,usingfourdifferentvaluesfortheradialkernelparameterg.SmallvaluesofCcorrespondtoheavyregularization,largevaluesofCtolightregularization.Dependingonthevalueofg,theoptimalCcanoccurateitherendofthespectrumoranywhereinbetween,emphasizingtheneedforcarefulselection.performsmuchbetterontestdata.Fortheseexamplesweevaluatethetesterrorbyintegrationoverthelatticeindicatedintheplots.Figure4showsthetesterrorasafunctionofCforthesedata,usingfourdifferentvaluesforthekernelscaleparameterg.HereweseeadramaticrangeinthecorrectchoiceforC(orl=1=C);wheng=5,themostregularizedmodeliscalledfor,andwewillseeinSection6thattheSVMisreallyperformingkerneldensityclassication.Ontheotherhand,wheng=0:1,wewouldwanttochooseamongtheleastregularizedmodels.OneofthereasonsthatinvestigatorsavoidextensiveexplorationofCisthecomputationalcostinvolved.InthispaperwedevelopanalgorithmwhichtstheentirepathofSVMsolu-tions[b0(C);b(C)],forallpossiblevaluesofC,withessentiallythecomputationalcostofttingasinglemodelforaparticularvalueofC.OuralgorithmexploitsthefactthattheLagrangemulti-pliersimplicitin(4)arepiecewise-linearinC.Thisalsomeansthatthecoefcientsb(C)arealsopiecewise-linearinC.ThisistrueforallSVMmodels,bothlinearandnonlinearkernel-basedSVMs.Figure8onpage1406showstheseLagrangepathsforthemixtureexample.Thisworkwasinspiredbytherelated“LeastAngleRegression”(LAR)algorithmforttingLASSOmodels(Efronetal.,2004),whereagainthecoefcientpathsarepiecewiselinear.Thesespeedupshaveabigimpactontheestimationoftheaccuracyoftheclassier,usingavalidationdataset(e.g.asinK-foldcross-validation).WecanrapidlycomputethetforeachtestdatapointforanyandallvaluesofC,andhencethegeneralizationerrorfortheentirevalidationsetasafunctionofCInthenextsectionwedevelopouralgorithm,andthendemonstrateitscapabilitiesonanumberofexamples.Apartfromofferingdramaticcomputationalsavingswhencomputingmultiplesolu-1395 HASTIE,ROSSET,TIBSHIRANIANDZHUtions(Section4.3),thenatureofthepath,inparticularattheboundaries,shedslightontheactionofthekernelSVM(Section6).2.ProblemSetupWeuseacriterionequivalentto(4),implementingtheformulationin(5):minbb0nåi=1xi+l2bTb(8)subjectto1yi(xi)xixi0;(x)=b0+bTx:InitiallyweconsideronlylinearSVMstogettheintuitiveavorofourprocedure;wethengeneral-izetokernelSVMs.WeconstructtheLagrangeprimalfunctionLPnåi=1xi+l2bTb+nåi=1ai(1yi(xi)xi)nåi=1gixi(9)andsetthederivativestozero.Thisgives¶¶bb=1lnåi=1aiyixi;(10)¶¶b0nåi=1yiai=0;(11)¶¶xiai=1gi;(12)alongwiththeKKTconditionsai(1yi(xi)xi)=0;(13)gixi=0:(14)Weseethat0ai1,withai=1whenxi�0(whichiswhenyi(xi)1).Alsowhenyi(xi)&#x-0.1;隔1,xi=0sincenocostisincurred,andai=0.Whenyi(xi)=1,aicanliebetween0and1.2Wewishtondtheentiresolutionpathforallvaluesofl0.Thebasicideaofouralgorithmisasfollows.Westartwithllargeanddecreaseittowardzero,keepingtrackofalltheeventsthatoccuralongtheway.Asldecreases,jjbjjincreases,andhencethewidthofthemargindecreases(seeFigure1).Asthiswidthdecreases,pointsmovefrombeinginsidetooutsidethemargin.Theircorrespondingaichangefromai=1whentheyareinsidethemargin(yi(xi)1)toai=0whentheyareoutsidethemargin(yi(xi)&#x-0.2;Ł怀1).Bycontinuity,pointsmustlingeronthemargin(yi(xi)=1)whiletheiraidecreasefrom1to0.Wewillseethattheai(l)trajectoriesarepiecewise-linearinl,whichaffordsagreatcomputationalsavings:aslongaswecanestablishthebreakpoints,all2.ForreadersmorefamiliarwiththetraditionalSVMformulation(4),wenotethatthereisasimpleconnectionbe-tweenthecorrespondingLagrangemultipliers,a0i=ai==Cai,andhenceinthatcasea0i2[0;C].Wepreferourformulationheresinceourai2[0;1],andthissimpliesthedenitionofthepathswedene.1396 SVMREGULARIZATIONPATHvaluesinbetweencanbefoundbysimplelinearinterpolation.Notethatpointscanreturntothemargin,afterhavingpassedthroughit.Itiseasytoshowthatiftheai(l)arepiecewiselinearinl,thenbotha0i(C)=Cai(C)andb(C)arepiecewiselinearinC.Itturnsoutthatb0(C)isalsopiecewiselinearinC.Wewillfrequentlyswitchbetweenthesetworepresentations.WedenotebyI+thesetofindicescorrespondingtoyi=+1points,therebeingn+=jI+jintotal.LikewiseforIandn.Ouralgorithmkeepstrackofthefollowingsets(withnamesinspiredbythehingelossfunctioninFigure2):E=fyi(xi)=1;0ai1gEforElbow,L=fyi(xi)1;ai=1gLforLeftoftheelbow,R=fyi(xi)&#x-0.1;項退1;ai=0gRforRightoftheelbow.3.InitializationWeneedtoestablishtheinitialstateofthesetsdenedabove.Whenlisverylarge(¥),from(10)b=0,andtheinitialvaluesofb0andtheaidependonwhethern=n+ornot.Iftheclassesarebalanced,onecandirectlyndtheinitialcongurationbyndingthemostextremepointsineachclass.Wewillseethatwhenn=n+,thisisnolongerthecase,andinordertosatisfytheconstraint(11),aquadraticprogrammingalgorithmisneededtoobtaintheinitialconguration.Infact,ourSvmPathalgorithmcanbestartedatanyintermediatesolutionoftheSVMoptimiza-tionproblem(i.e.thesolutionforanyl),sincethevaluesofaiand(xi)determinethesetsLEandR.WewillseeinSection6thatifthereisnointerceptinthemodel,theinitializationisagaintrivial,nomatterwhethertheclassesarebalancedornot.WehavepreparedsomeMPEGmoviestoillustratethetwospecialcasesdetailedbelow.Themoviescanbedownloadedatthewebsitehttp://www-stat.stanford.edu/hastie/Papers/svm/MOVIE/3.1Initialization:n=n+Lemma1Forlsufcientlylarge,alltheai=1.Theinitialb02[1;1]—anyvaluegivesthesamelossåni=1xi=n++nProofOurproofreliesonthecriterionandtheKKTconditionsinSection2.Sinceb=0,(x)=b0Tominimizeåni=1xi,weshouldclearlyrestrictb0to[1;1].Forb02(1;1),allthexi&#x-0.1;項退0,gi=0in(12),andhenceai=1.Pickingoneoftheendpoints,sayb0=1,causesai=1;2I+,andhencealsoai=1;2I,for(11)tohold.Wealsohavethatfortheseearlyandlargevaluesoflb=1lbwhereb=nåi=1yixi:(15)Nowinorderthat(11)remainsatised,weneedthatoneormorepositiveandnegativeexampleshittheelbowsimultaneously.Henceasldecreases,werequirethat8iyi(xi)1oryi"bTxil+b0#1(16)1397 HASTIE,ROSSET,TIBSHIRANIANDZHU0.000.050.100.150.20-1.0-0.50.00.50.000.050.100.150.20-1.0-0.50.00.54444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444448888888888888888888800000000000000000000000000000000000000000000000000000000000b0(C)b(C)C1=lC1=lFigure5:Theinitialpathsofthecoefcientsinasmallsimulateddatasetwithn=n+.Weseethezoneofallowablevaluesforb0shrinkingtowardaxedpoint(20).Theverticallinesindicatethebreakpointsinthepiecewiselinearcoefcientpaths.orb01bTxilforall2I+(17)b01bTxilforall2I:(18)Pick+=argmaxi2I+bTxiand=argmini2IbTxi(forsimplicityweassumethattheseareunique).Thenatthispointofentryandbeyondforawhilewehaveai+=ai,and(xi+)=1and(xi)=1.Thisgivesustwoequationstosolvefortheinitialpointofentryl0andb0,withsolutionsl0=bTxi+bTxi2;(19)b0= bTxi++bTxibTxi+bTxi!:(20)Figure5(leftpanel)showsatrajectoryofb0(C)asafunctionofC,forasmallsimulateddataset.Thesesolutionswerecomputeddirectlyusingaquadratic-programmingalgorithm,usinga1398 SVMREGULARIZATIONPATH0.000.050.100.150.200.50.60.70.80.90.000.050.100.150.20-0.4-0.20.00.2***********************0.000.050.100.150.200.00.20.40.60.81.01111111110.000.050.100.150.20-1.0-0.50.00.51.01.512345678910b0(C)b(C)a(C)MarginsFigure6:Theinitialpathsofthecoefcientsinacasewherenn+.Allthenpointsaremisclassied,andstartoffwithamarginof1.TheairemainconstantuntiloneofthepointsinIreachesthemargin.Theverticallinesindicatethebreakpointsinthepiecewiselinearb(C)paths.Notethattheai(C)arenotpiecewiselinearinC,butratherinl=1=C1399 HASTIE,ROSSET,TIBSHIRANIANDZHUpredenedgridofvaluesforl.Thearbitrarinessoftheinitialvaluesisindicatedbythezig-zagnatureofthispath.Thebreakpointswerefoundusingourexact-pathalgorithm.3.2Initialization:n+�nInthiscase,whenb=0,theoptimalchoiceforb0is1,andthelossisåni=1xi=n.However,wealsorequirethat(11)holds.Lemma2Withb(a)=åni=1yiaixi,letfaig=argminajjb(a)jj2(21)s.t.ai2[0;1]fori2I+ai=1fori2I,andåi2I+ai=n(22)Thenforsomel0wehavethatforalll�l0ai=ai,andb=b=l,withb=åni=1yiaixiProofTheLagrangedualcorrespondingto(9)isobtainedbysubstituting(10)–(12)into(9)(Hastieetal.,2001,Equation12.13):LD=nåi=1ai12lnåi=1nåi0=1aiai0yiyi0xixi0:(23)Sincewestartwithb=0,b0=1,alltheIpointsaremisclassied,andhencewewillhaveai=182I,andhencefrom(11)åni=1ai=2n.Thislattersumwillremain2nforawhileasbgrowsawayfromzero.Thismeansthatduringthisphase,thersttermintheLagrangedualisconstant;thesecondtermisequalto12ljjb(a)jj2,andsincewemaximizethedual,thisprovestheresult.Wenowestablishthe“startingpoint”l0andb0whentheaistarttochange.Letbbethexedcoefcientdirectioncorrespondingtoai(asin(15)):b=nåi=1aiyixi:(24)Therearetwopossiblescenarios:1.ThereexisttwoormoreelementsinI+with0ai1,or2.ai2f0;1g82I+Considertherstscenario(depictedinFigure6),andsupposeai+2(0;1)(onthemargin).Let=argmini2IbTxi.Thensincethepoint+remainsonthemarginuntilanIpointreachesitsmargin,wecanndl0=bTxi+bTxi2;(25)identicalinformtoto(19),asisthecorrespondingb0to(20).Forthesecondscenario,itiseasytoseethatwendourselvesinthesamesituationasinSection3.1—apointfromIandoneofthepointsinI+withai=1mustreachthemarginsimultaneously.Hencewegetananalogoussituation,exceptwith+=argmaxi2I1+bTxi,whereI1+isthesubsetofI+withai=1.1400 SVMREGULARIZATIONPATH3.3KernelsThedevelopmentsofarhasbeenintheoriginalfeaturespace,sinceitiseasiertovisualize.Itiseasytoseethattheentiredevelopmentcarriesthroughwith“kernels”aswell.Inthiscase(x)=b0+g(x),andtheonlychangethatoccursisthat(10)ischangedtog(xi)=1lnåj=1ajyjK(xi;xj);=1;:::;n;(26)orqj(l)=ajyj=lusingthenotationin(7).OurinitialconditionsaredenedintermsofexpressionsbTxi+,forexample,andagainitiseasytoseethattherelevantquantitiesareg(xi+)=nåj=1ajyjK(xi+;xj);(27)wheretheaiareall1inSection3.1,anddenedbyLemma2inSection3.2.Hereafterwewilldevelopouralgorithmforthismoregeneralkernelcase.4.ThePathThealgorithmhingesonthesetofpointsEsittingattheelbowofthelossfunction—i.eonthemargin.Thesepointshaveyi(xi)=1andai2[0;1].ThesearedistinctfromthepointsRtotherightoftheelbow,withyi(xi)�1andai=0,andthosepointsLtotheleftwithyi(xi)1andai=1.Weconsiderthissetatthepointthataneventhasoccurred.Theeventcanbeeither:1.Theinitialevent,whichmeans2ormorepointsstartattheelbow,withtheirinitialvaluesofa2[0;1]2.ApointfromLhasjustenteredE,withitsvalueofaiinitially1.3.ApointfromRhasreenteredE,withitsvalueofaiinitially0.4.OneormorepointsinEhaslefttheset,tojoineitherRorLWhicheverthecase,forcontinuityreasonsthissetwillstaystableuntilthenexteventoccurs,sincetopassthroughE,apoint'saimustchangefrom0to1orviceversa.SinceallpointsinEhaveyi(xi)=1,wecanestablishapathfortheiraiEvent4allowsforthepossibilitythatEbecomesemptywhileLisnot.Ifthisoccurs,thentheKKTcondition(11)impliesthatLisbalancedw.r.t.+1sand-1s,andweresorttotheinitialconditionasinSection3.1.Weusethesubscripttoindexthesetsaboveimmediatelyafterthetheventhasoccurred.SupposejEj=m,andletai;b0andlbethevaluesoftheseparametersatthepointofentry.Likewiseisthefunctionatthispoint.Forconveniencewedenea0=lb0,andhencea0=lb0Since(x)=1l nåj=1yjajK(x;xj)+a0!;(28)1401 HASTIE,ROSSET,TIBSHIRANIANDZHUforl�l�l+1wecanwrite(x)=(x)ll(x)+ll(x)=1l"åj2E`(ajaj)yjK(x;xj)+(a0a0)+l(x)#:(29)ThesecondlinefollowsbecausealltheobservationsinLhavetheirai=1,andthoseinRhavetheirai=0,forthisrangeofl.Sinceeachofthempointsxi2Earetostayattheelbow,wehavethat1l"åj2E`(ajaj)yiyjK(xi;xj)+yi(a0a0)+l#=1;82E:(30)Writingdj=ajaj,from(30)wehaveåj2E`djyiyjK(xi;xj)+yid0=ll;82E:(31)Furthermore,sinceatalltimesåni=1yiai=0,wehavethatåj2E`yjdj=0:(32)Equations(31)and(32)constitutem+1linearequationsinm+1unknownsdj,andcanbesolved.DenotingbythemmmatrixwithijthentryyiyjK(xi;xj)forandinE,wehavefrom(31)thatd+d0y=(ll)1;(33)whereyisthemvectorwithentriesyi;2E.From(32)wehaveyTd=0:(34)Wecancombinethesetwointoonematrixequationasfollows.LetA=0yTy;da=d0d;and1a=01;(35)then(34)and(33)canbewrittenAda=(ll)1a:(36)IfAhasfullrank,thenwecanwriteba=A11a;(37)andhenceaj=aj(ll)bj;2f0g[E:(38)Henceforl+1ll,theajforpointsattheelbowproceedlinearlyinl.From(29)wehave(x)=llh(x)h(x)i+h(x);(39)1402 SVMREGULARIZATIONPATHwhereh(x)=åj2E`yjbjK(x;xj)+b0:(40)Thusthefunctionitselfchangesinapiecewise-inversemannerinlIfAdoesnothavefullrank,thenthesolutionpathsforsomeoftheaiarenotunique,andmorecarehastobetakeninsolvingthesystem(36).Thisoccurs,forexample,whentwotrainingobservationsareidentical(tiedinxandy).Otherdegeneraciescanoccur,butrarelyinpractice,suchasthreedifferentpointsonthesamemargininR2.Theseissuesandsomeoftherelatedupdatinganddowndatingschemesareanareawearecurrentlyresearching,andwillbereportedelsewhere.4.1Findingl+1Thepaths(38)–(39)continueuntiloneofthefollowingeventsoccur:1.Oneoftheaifor2Ereachesaboundary(0or1).Foreachthevalueoflforwhichthisoccursiseasilyestablishedfrom(38).2.OneofthepointsinLorRattainsyi(xi)=1.From(39)thisoccursforpointatl=l(xi)h(xi)yih(xi):(41)Byexaminingtheseconditions,wecanestablishthelargestllforwhichaneventoccurs,andhenceestablishl+1andupdatethesets.OnespecialcasenotaddressedaboveiswhenthesetEbecomesemptyduringthecourseofthealgorithm.Inthiscase,wereverttoaninitializationsetupusingthepointsinL.Itmustbethecasethatthesepointshaveanequalnumberof+1'sas-1's,andsoweareinthebalancedsituationasin3.1.Byexaminingindetailthelinearboundaryinexampleswherep=2,weobservedseveraldifferenttypesofbehavior:1.IfjEj=0,thanasldecreases,theorientationofthedecisionboundarystaysxed,butthemarginwidthnarrowsasldecreases.2.IfjEj=1orjEj=2,butwiththepairofpointsofoppositeclasses,thentheorientationtypicallyrotatesasthemarginwidthgetsnarrower.3.IfjEj=2,withbothpointshavingthesameclass,thentheorientationremainsxed,withtheonemarginstuckonthetwopointsasthedecisionboundarygetsshrunktowardit.4.IfjEj3,thenthemarginsandhence(x)remainsxed,astheai(l)change.Thisimpliesthath=in(39).4.2TerminationIntheseparablecase,weterminatewhenLbecomesempty.Atthispoint,allthexiin(8)arezero,andfurthermovementincreasesthenormofbunnecessarily.Inthenon-separablecase,lrunsallthewaydowntozero.Forthistohappenwithout“blowingup”in(39),wemusthaveh=0,andhencetheboundaryandmarginsremain1403 HASTIE,ROSSET,TIBSHIRANIANDZHUxedatapointwhereåixiisassmallaspossible,andthemarginisaswideaspossiblesubjecttothisconstraint.4.3ComputationalComplexityAtanyupdateeventalongthepathofouralgorithm,themaincomputationalburdenissolvingthesystemofequationsofsizem=jEj.WhilethisnormallyinvolvesO(m3)computations,sinceE+1differsfromEbytypicallyoneobservation,inverseupdating/downdatingcanreducethecomputationstoO(m2).Thecomputationofh(xi)in(40)requiresO(nm)computations.Beyondthat,severalchecksofcostO(n)areneededtoevaluatethenextmove.1e-041e-021e+00020406080100=01=05=1=5lFigure7:TheelbowsizesjEjasafunctionofl,fordifferentvaluesoftheradial-kernelparameterg.TheverticallinesshowthepositionsusedtocomparethetimeswithlibsvmWehaveexploredusingpartitionedinversesforupdating/downdatingthesolutionstotheelbowequations(forthenonsingularcase),andourexperiencesaremixed.InourRimplementations,thecomputationalsavingsappearnegligiblefortheproblemswehavetackled,andafterrepeatedupdating,roundingerrorscancausedrift.Atthetimeofthispublication,weinfactdonotuseupdatingatall,andsimplysolvethesystemeachtime.Wearecurrentlyexploringnumericallystablewaysformanagingtheseupdates.Althoughwehavenohardresults,ourexperiencesofarsuggeststhatthetotalnumberLofmovesisO(kmin(n+;n)),forkaround46;hencetypicallysomesmallmultiplecofn.IftheaveragesizeofEism,thissuggeststhetotalcomputationalburdenisO(cn2m+nm2),whichissimilartothatofasingleSVMt.OurRfunctionSvmPathcomputesall632stepsinthemixtureexample(n+=n=100,radialkernel,g=1)in1.44(0.02)secsonaPentium4,2GhzLinuxmachine;thesvmfunction(usingtheoptimizedcodelibsvm,fromtheRlibrarye1071)takes9.28(0.06)secondstocomputethesolutionat10pointsalongthepath.Henceittakesourprocedureabout50%moretimetocomputetheentirepath,thanitcostslibsvmtocomputeatypicalsinglesolution.1404 SVMREGULARIZATIONPATHWeoftenwishtomakepredictionsatnewinputs.Wecanalsodothisefcientlyforallvaluesofl,becausefrom(28)weseethat(modulo1=l),thesealsochangeinapiecewise-linearfashioninl.HencewecancomputetheentiretpathforasingleinputxinO(n)calculations,plusanadditionalO(nq)operationstocomputethekernelevaluations(assumingitcostsO(q)operationstocomputeK(x;xi)).5.ExamplesInthissectionwelookatthreeexamples,twosyntheticandonereal.Weexamineourrunningmix-tureexampleinsomemoredetail,andexposethenatureofquadraticregularizationinthekernelfeaturespace.Wethensimulateandexamineascaled-downversionofthepnproblem—manymoreinputsthansamples.Despitethefactthatperfectseparationispossiblewithlargemargins,aheavilyregularizedmodelisoptimalinthiscase.FinallywetSVMpathmodelstosomemicroar-raycancerdata.5.1MixtureSimulationInFigure4weshowthetest-errorcurvesforalargenumberofvaluesofl,andfourdifferentvaluesforgfortheradialkernel.TheselareinfacttheentirecollectionofchangepointsasdescribedinSection4.Forexample,forthesecondpanel,withg=1,thereare623changepoints.Figure8[upperplot]showsthepathsofalltheai(l),aswellas[lowerplot]afewindividualexamples.AnMPEGmovieofthesequenceofmodelscanbedownloadedfromtherstauthor'swebsite.Wewereatrstsurprisedtodiscoverthatnotallthesesequencesachievedzerotrainingerrorsonthe200trainingdatapoints,attheirleastregularizedt.Infacttheminimaltrainingerrors,andthecorrespondingvaluesforgaresummarizedinTable1.Itissometimesarguedthattheimplicit510.50.1TrainingErrors0122133EffectiveRank20017714376Table1:Thenumberofminimaltrainingerrorsfordifferentvaluesoftheradialkernelscalepa-rameterg,forthemixturesimulationexample.Alsoshownistheeffectiverankofthe200200Grammatrixfeaturespaceis“innitedimensional”forthiskernel,whichsuggeststhatperfectseparationisalwayspossible.ThelastrowofthetableshowstheeffectiverankofthekernelGrammatrix(whichwedenedtobethenumberofsingularvaluesgreaterthan1012).This200200matrixhaselementsij=K(xi;xj);;=1;:::;n.Ingeneralafullrankisrequiredtoachieveperfectseparation.Similarobservationshaveappearedintheliterature(BachandJordan,2002;WilliamsandSeeger,2000).ThisemphasizesthefactthatnotallfeaturesinthefeaturemapimpliedbyKareofequalstature;manyofthemareshrunkwaydowntozero.Alternatively,theregularizationin(6)and(7)penalizesunit-normfeaturesbytheinverseoftheireigenvalues,whicheffectivelyannihilatessome,dependingong.Smallgimplieswide,atkernels,andasuppressionofwiggly,“rough”functions.1405 HASTIE,ROSSET,TIBSHIRANIANDZHU1e-041e-021e+0001ai(l)l05101501ai(l)lFigure8:[Upperplot]Theentirecollectionofpiece-wiselinearpathsai(l);=1;:::;N,forthemixtureexample.Note:lisplottedonthelog-scale.[Lowerplot]Pathsfor5selectedobservations;lisnotonthelogscale.1406 SVMREGULARIZATIONPATHWriting(7)inmatrixform,minb0qL[y;q]+l2qTq;(42)wereparametrizeusingtheeigen-decompositionof=UDUT.Letq=Uqwhereq=DUTqThen(42)becomesminb0qL[y;Uq]+l2qTD1q:(43)NowthecolumnsofUareunit-normbasisfunctions(inR2)spanningthecolumnspaceof0501001502001e-151e-111e-071e-031e+01222222233333333333333333333333333333333333333333444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444g=01=05=1=5Figure9:Theeigenvalues(onthelogscale)forthekernelmatricescorrespondingtothefourvaluesofgasinFigure4.Thelargereigenvaluescorrespondinthiscasetosmoothereigenfunctions,thesmallonestorougher.Theroughereigenfunctionsgetpenalizedex-ponentiallymorethanthesmootherones.Forsmallervaluesofg,theeffectivedimensionofthespaceistruncated.from(43)weseethatthosememberscorrespondingtonear-zeroeigenvalues(theelementsofthediagonalmatrixD)getheavilypenalizedandhenceignored.Figure9showstheelementsofDforthefourvaluesofg.SeeHastieetal.(2001,Chapter5)formoredetails.5.2pnSimulationTheSVMispopularinsituationswherethenumberoffeaturesexceedsthenumberofobservations.Geneexpressionarraysarealeadingexample,whereatypicaldatasethasp�10;000whilen100.Hereonetypicallytsalinearclassier,andsinceitiseasytoseparatethedata,theoptimalmarginalclassieristhedefactochoice.Weargueherethatregularizationcanplayanimportantroleforthesekindsofdata.1407 HASTIE,ROSSET,TIBSHIRANIANDZHUWemimicasimulationfoundinMarron(2003).Wehavep=50andn=40,witha20-20splitof“+”and“-”classmembers.ThexijarealliidrealizationsfromaN(0;1)distribution,exceptfortherstcoordinate,whichhasmean+2and-2intherespectiveclasses.3TheBayesclassierinthiscaseusesonlytherstcoordinateofx,withathresholdat0.TheBayesriskis0:012.Figure10summarizestheexperiment.Weseethatthemostregularizedmodelsdothebesthere,notthemaximalmarginclassier.AlthoughthemostregularizedlinearSVMisthebestinthisexample,wenoticeadisturbingaspectofitsendpointbehaviorinthetop-rightplot.Althoughbisdeterminedbyallthepoints,thethresholdb0isdeterminedbythetwomostextremepointsinthetwoclasses(seeSection3.1).Thiscanleadtoirregularbehavior,andindeedinsomerealizationsfromthismodelthiswasthecase.Forvaluesofllargerthantheinitialvaluel1,wesawinSection3thattheendpointbehaviordependsonwhethertheclassesarebalancedornot.Ineithercase,aslincreases,theerrorconvergestotheestimatednullerrorratenmin=nThissameobjectionisoftenmadeattheotherextremeoftheoptimalmargin;however,ittypi-callyinvolvesmoresupportpoints(19pointsonthemarginhere),andtendstobemorestable(butstillnogoodinthiscase).Forsolutionsintheinterioroftheregularizationpath,theseobjectionsnolongerhold.Heretheregularizationforcesmorepointstooverlapthemargin(supportpoints),andhencedetermineitsorientation.Includedintheguresareregularizedlineardiscriminantanalysisandlogisticregressionmod-els(usingthesamelsequenceastheSVM).BothshowsimilarbehaviortotheregularizedSVM,havingthemostregularizedsolutionsperformthebest.Logisticregressioncanbeseentoassignweightspi(1pi)toobservationsinthettingofitscoefcientsbandb0,wherepi=11+eb0bTxi(44)istheestimatedprobabilityof+1occurringatxi(HastieandTibshirani,1990,e.g.).Sincethedecisionboundarycorrespondstop(x)=0:5,theseweightscanbeseentodiedowninaquadraticfashionfrom1=4,aswemoveawayfromtheboundary.Therateatwhichtheweightsdiedownwithdistancefromtheboundarydependsonjjbjj;thesmallerthisnorm,theslowertherate.Itcanbeshown,forseparatedclasses,thatthelimitingsolution(l#0)fortheregularizedlogisticregressionmodelisidenticaltotheSVMsolution:themaximalmarginseparator(Rossetetal.,2003).Notsurprisingly,giventhesimilaritiesintheirlossfunctions(Figure2),bothregularizedSVMsandlogisticregressioninvolvemoreorlessobservationsindeterminingtheirsolutions,dependingontheamountofregularization.This“involvement”isachievedinasmootherfashionbylogisticregression.5.3MicroarrayClassicationWeillustrateouralgorithmonalargecancerexpressiondataset(Ramaswamyetal.,2001).Thereare144trainingtumorsamplesand54testtumorsamples,spanning14commontumorclassesthat3.Herewehaveoneimportantfeature;theremaining49arenoise.Withexpressionarrays,theimportantfeaturestypicallyoccuringroups,butthetotalnumberpismuchlarger.1408 SVMREGULARIZATIONPATH-4-2024-3-2-1012++++++++++++++++++++oooooooooooooooooooo-4-2024-3-2-1012++++++++++++++++++++ooooooooooooooooooooSSSS10020030025303540RRLLSSSS1002003000.020.030.040.05RRLLLLl=1=Cl=1=CFigure10:pnsimulation.[TopLeft]Thetrainingdataprojectedontothespacespannedbythe(known)optimalcoordinate1,andtheoptimalmargincoefcientvectorfoundbyanon-regularizedSVM.Weseethelargegapinthemargin,whiletheBayes-optimalclassier(verticalredline)isactuallyexpectedtomakeasmallnumberoferrors.[TopRight]Thesameastheleftpanel,exceptwenowprojectontothemostregularizedSVMcoefcientvector.ThissolutionisclosertotheBayes-optimalsolution.[LowerLeft]TheanglesbetweentheBayes-optimaldirection,andthedirectionsfoundbytheSVM(S)alongtheregularizedpath.IncludedinthegurearethecorrespondingcoefcientsforregularizedLDA(R)(Hastieetal.,2001,Chapter4)andregularizedlogisticregression(L)(ZhuandHastie,2004),usingthesamequadraticpenalties.[LowerRight]Thetesterrorscorrespondingtothethreepaths.ThehorizontallineistheestimatedBayesruleusingonlytherstcoordinate.1409 HASTIE,ROSSET,TIBSHIRANIANDZHUaccountfor80%ofnewcancerdiagnosesintheU.S.A.Thereare16,063genesforeachsample.Hencep=16;063andn=144.WedenotethenumberofclassesbyK=14.Agoalistobuildaclassierforpredictingthecancerclassofanewsample,givenitsexpressionvalues.WeusedacommonapproachforextendingtheSVMfromtwo-classtomulti-classclassica-tion:1.FitKdifferentSVMmodels,eachoneclassifyingasinglecancerclass(+1)versustherest(-1).2.Let[l1(x);:::;lK(x)]bethevectorofevaluationsofthettedfunctions(withparameterl)atatestobservationx3.ClassifyCl(x)=argmaxklk(x)Other,moredirect,multi-classgeneralizationsexist(Rossetetal.,2003;WestonandWatkins,1998);althoughexactpathalgorithmsarepossibleheretoo,wewereabletoimplementourap-proachmosteasilywiththe“onevsall”strategyabove.Figure11showstheresultsofttingthisfamilyofSVMmodels.Shownarethetrainingerror,testerror,aswellas8-foldbalancedcross-validation.4Thetrainingerroriszeroeverywhere,butboththetestandCVerrorincreasesharplywhenthemodelistooregularized.Therightplotshowssimilarresultsusingquadraticallyregularizedmultinomialregression(ZhuandHastie,2004).AlthoughtheleastregularizedSVMandmultinomialmodelsdothebest,thisisstillnotverygood.Withfourteenclasses,thisisatoughclassicationproblem.Itisworthnotingthat:The14differentclassicationproblemsarevery“lop-sided”;inmanycases8observationsinoneclassvsthe136others.Thistendstoproducesolutionswithallmembersofthesmallclassontheboundary,asomewhatunnaturalsituation.ForboththeSVMandthequadraticallyregularizedmultinomialregression,onecanreducethelogisticsbypre-transformingthedata.IfXisthenpdatamatrix,withpn,letitssingular-valuedecompositionbeUDVT.WecanreplaceXbythennmatrixXV=UD=Randobtainidenticalresults(HastieandTibshirani,2003).ThesametransformationVisappliedtothetestdata.Thistransformationisappliedonceupfront,andthetransformeddataisusedinallsubsequentanalyses(i.e.K-foldcross-validationaswell).6.NoInterceptandKernelDensityClassicationHereweconsiderasimplicationofthemodels(6)and(7)whereweleaveouttheintercepttermb0.Itiseasytoshowthatthesolutionforg(x)hastheidenticalformasin(26):g(x)=1lnåj=1ajyjK(x;xj):(45)However,(x)=g(x)(or(x)=bTxinthelinearcase),andwelosetheconstraint(11)duetotheinterceptterm.Thisalsoaddsconsiderablesimplicationtoouralgorithm,inparticulartheinitialconditions.4.Bybalancedwemeanthe14cancerclasseswererepresentedequallyineachofthefolds;8foldswereusedtoaccommodatethisbalance,sincetheclasssizesinthetrainingsetweremultiplesof8.1410 SVMREGULARIZATIONPATH1101001000100000.00.10.20.30.40.51e-021e+001e+021e+040.00.10.20.30.40.5llFigure11:Misclassicationratesforcancerclassicationbygeneexpressionmeasurements.Theleftpanelshowsthethetraining(lowergreen),cross-validation(middleblack,withstandarderrors)andtesterror(upperblue)curvesfortheentireSVMpath.AlthoughtheCVandtesterrorcurvesappeartohavequitedifferentlevels,theregionofinterestingbehavioristhesame(withacuriousdipataboutl=3000).Seeingtheentirepathleavesnoguessworkastowheretheregionofinterestmightbe.Therightpanelshowsthesamefortheregularizedmultiplelogisticregressionmodel.Herewedonothaveanexactpathalgorithm,soagridof15valuesoflisused(onalogscale).Itiseasytoseethatinitiallyai=18,since(x)isclosetozeroforlargel,andhenceallpointsareinL.Thisistruewhetherornotn=n+,unlikethesituationwhenaninterceptispresent(Section3.2).With(x)=ånj=1yjK(x;xj),therstelementofEis=argmaxij(xi)j,withl1=j(xi)j.Forl2[l1;¥)(x)=(x)=lThelinearequationsthatgovernthepointsinEaresimilarto(33):d=(ll)1;(46)Wenowshowthatinthemostregularizedcase,theseno-interceptkernelmodelsareactuallyperformingkerneldensityclassication.Initially,forl�l1,weclassifytoclass+1if(x)=l�0,1411 HASTIE,ROSSET,TIBSHIRANIANDZHUelsetoclass-1.But(x)=åj2I+K(x;xj)åj2IK(x;xj)=n n+n1n+åj2I+K(x;xj)nn1nåj2IK(x;xj)!(47)µp+h+(x)ph(x):(48)Inotherwords,thisistheestimatedBayesdecisionrule,withh+thekerneldensity(Parzenwindow)estimateforthe+class,p+thesampleprior,andlikewiseforh(x)andp.AsimilarobservationismadeinSch¨olkopfandSmola(2001),forthemodelwithintercept.Soatthisendoftheregular-izationscale,thekernelparametergplaysacrucialrole,asitdoesinkerneldensityclassication.Asgincreases,thebehavioroftheclassierapproachesthatofthe1-nearestneighborclassier.Forverysmallg,orinfactalinearkernel,thisamountstoclosestcentroidclassication.Aslisrelaxed,theai(l)willchange,givingultimatelyzeroweighttopointswellwithintheirownclass,andsharingtheweightsamongpointsnearthedecisionboundary.Inthecontextofnearestneighborclassication,thishastheavorof“editing”,awayofthinningoutthetrainingsetretainingonlythoseprototypesessentialforclassication(Ripley,1996)Alltheseinterpretationsgetblurredwhentheinterceptb0ispresentinthemodel.Fortheradialkernel,aconstanttermisincludedinspanfK(x;xi)gn1,soitisnotstrictlynecessarytoincludeoneinthemodel.However,itwillgetregularized(shrunktowardzero)alongwithalltheothercoefcients,whichisusuallywhytheseintercepttermsareseparatedoutandfreedfromregularization.Addingaconstantb2toK(;)willreducetheamountofshrinkingontheintercept(sincetheamountofshrinkingofaneigenfunctionofKisinverselyproportionaltoitseigenvalue;seeSection5).ForthelinearSVM,wecanaugmentthexivectorswithaconstantelementb,andthenttheno-interceptmodel.Thelargerb,thecloserthesolutionwillbetothatofthelinearSVMwithintercept.7.DiscussionOurworkontheSVMpathalgorithmwasinspiredbyearlierworkonexactpathalgorithmsinothersettings.“LeastAngleRegression”(Efronetal.,2002)showsthatthecoefcientpathforthesequenceof“lasso”coefcients(Tibshirani,1996)ispiecewiselinear.Thelassosolvesthefollowingregularizedlinearregressionproblem,minb0bnåi=1(yib0xTib)2+ljbj;(49)wherejbj=åpj=1jbjjistheL1normofthecoefcientvector.ThisL1constraintdeliversasparsesolutionvectorbl;thelargerl,themoreelementsofblarezero,theremaindershrunktowardzero.Infact,anymodelwithanL1constraintandaquadratic,piecewisequadratic,piecewiselinear,ormixedquadraticandlinearlossfunction,willhavepiecewiselinearcoefcientpaths,whichcanbecalculatedexactlyandefcientlyforallvaluesofl(RossetandZhu,2003).Thesemodelsinclude,amongothers,Arobustversionofthelasso,usinga“Huberized”lossfunction.1412 SVMREGULARIZATIONPATHTheL1constrainedsupportvectormachine(Zhuetal.,2003).TheSVMmodelhasaquadraticconstraintandapiecewiselinear(“hinge”)lossfunction.Thisleadstoapiecewiselinearpathinthedualspace,hencetheLagrangecoefcientsaiarepiecewiselinear.OthermodelsthatwouldsharethispropertyincludeThee-insensitiveSVMregressionmodelQuadraticallyregularizedL1regression,includingexiblemodelsbasedonkernelsorsmooth-ingsplines.Ofcourse,quadraticcriterion+quadraticconstraintsalsoleadtoexactpathsolutions,asintheclassiccaseofridgeregression,sinceaclosedformsolutionisobtainedviatheSVD.However,thesepathsarenonlinearintheregularizationparameter.Forgeneralnon-quadraticlossfunctionsandL1constraints,thesolutionpathsaretypicallypiecewisenon-linear.Logisticregressionisaleadingexample.Inthiscase,approximatepath-followingalgorithmsarepossible(Rosset,2005).Thegeneraltechniquesemployedinthispaperareknownasparametricprogrammingviaactivesetsintheconvexoptimizationliterature(AllgowerandGeorg,1992).TheclosestwehaveseentoourworkintheliteratureemploysimilartechniquesinincrementallearningforSVMs(FineandScheinberg,2002;CauwenberghsandPoggio,2001;DeCosteandWagstaff,2000).Theseauthorsdonot,however,constructexactpathsaswedo,butratherfocusonupdatinganddowndatingthesolutionsasmore(orless)dataarises.DiehlandCauwenberghs(2003)allowforupdatingtheparametersaswell,butagaindonotconstructentiresolutionpaths.TheworkofPontilandVerri(1998)recentlycametoournotice,whoalsoobservedthatthelagrangemultipliersforthemarginvectorschangeinapiece-wiselinearfashion,whiletheothersremainconstant.TheSvmPathhasbeenimplementedintheRcomputingenvironment(contributedlibrarysvmpathatCRAN),andisavailablefromtherstauthor'swebsite.AcknowledgmentsTheauthorsthankJeromeFriedmanforhelpfuldiscussions,andMee-YoungParkforassistingwithsomeofthecomputations.Theyalsothanktworefereesandtheassociateeditorforhelpfulcomments.TrevorHastiewaspartiallysupportedbygrantDMS-0204162fromtheNationalScienceFoundation,andgrantRO1-EB0011988-08fromtheNationalInstitutesofHealth.TibshiraniwaspartiallysupportedbygrantDMS-9971405fromtheNationalScienceFoundationandgrantRO1-EB0011988-08fromtheNationalInstitutesofHealth.ReferencesEugeneAllgowerandKurtGeorg.Continuationandpathfollowing.ActaNumerica,pages1–64,1992.FrancisBachandMichaelJordan.Kernelindependentcomponentanalysis.JournalofMachineLearningResearch,3:1–48,2002.1413 HASTIE,ROSSET,TIBSHIRANIANDZHUGertCauwenberghsandTomasoPoggio.Incrementalanddecrementalsupportvectormachinelearning.InAdvancesinNeuralInformationProcessingSystems(NIPS2000),volume13.MITPress,Cambridge,MA,2001.DennisDeCosteandKiriWagstaff.Alphaseedingforsupportvectormachines.InProceedingsoftheSixthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMiningpages345–349.ACMPress,2000.ChristopherDiehlandGertCauwenberghs.SVMincrementallearning,adaptationandoptimiza-tion.InProceedingsofthe2003InternationalJointConferenceonNeuralNetworks,pages2685–2690,2003.SpecialseriesonIncrementalLearning.BradEfron,TrevorHastie,IainJohnstone,andRobertTibshirani.Leastangleregression.Technicalreport,StanfordUniversity,2002.BradEfron,TrevorHastie,IainJohnstone,andRobertTibshirani.Leastangleregression.AnnalsofStatistics,2004.(toappear,withdiscussion).TheodorusEvgeniou,MassimilianoPontil,andTomasoPoggio.Regularizationnetworksandsup-portvectormachines.AdvancesinComputationalMathematics,Volume13,Number1,pages1-50,2000.ShaiFineandKatyaScheinberg.Incas:AnincrementalactivesetmethodforSVM.Technicalreport,IBMResearchLabs,Haifa,2002.TrevorHastieandRobertTibshirani.GeneralizedAdditiveModels.ChapmanandHall,1990.TrevorHastie,RobertTibshirani,andJeromeFriedman.TheElementsofStatisticalLearning;DataMining,InferenceandPrediction.SpringerVerlag,NewYork,2001.TrevorHastieandRobTibshirani.Efcientquadraticregularizationforexpressionarrays.Technicalreport,StanfordUniversity,2003.Chih-WeiHsu,Chih-ChungChang,andChih-JenLin.Apracticalguidetosupportvectorclassica-tion.Technicalreport,DepartmentofComputerScienceandInformationEngineering,NationalTaiwanUniversity,Taipei,2003.http://www.csie.ntu.edu.tw/cjlin/libsvm/ThorstenJoachims.PracticalAdvancesinKernelMethods—SupportVectorLearn-ing,chapterMakinglargescaleSVMlearningpractical.MITPress,1999.Seehttp://svmlight.joachims.orgSteveMarron.Anoverviewofsupportvectormachinesandkernelmethods.Talk,2003.Availablefromauthor'swebsite:http://www.stat.unc.edu/postscript/papers/marron/Talks/MassimilianoPontilandAlessandroVerri.Propertiesofsupportvectormachines.NeuralCompu-tation,10(4):955–974,1998.S.Ramaswamy,P.Tamayo,R.Rifkin,S.Mukherjee,C.Yeang,M.Angelo,C.Ladd,M.Reich,E.Latulippe,J.Mesirov,T.Poggio,W.Gerald,M.Loda,E.Lander,andT.Golub.Multiclasscancerdiagnosisusingtumorgeneexpressionsignature.PNAS,98:15149–15154,2001.1414 SVMREGULARIZATIONPATHB.D.Ripley.Patternrecognitionandneuralnetworks.CambridgeUniversityPress,1996.SaharonRosset.Trackingcurvedregularizedoptimizationsolutionpaths.InAdvancesinNeuralInformationProcessingSystems(NIPS2004),volume17.MITPress,Cambridge,MA,2005.toappear.SaharonRossetandJiZhu.Piecewiselinearregularizedsolutionpaths.Technicalreport,StanfordUniversity,2003.http://www-stat.stanford.edu/saharon/papers/piecewise.psSaharonRosset,JiZhu,andTrevorHastie.Marginmaximizinglossfunctions.InAdvancesinNeuralInformationProcessingSystems(NIPS2003),volume16.MITPress,Cambridge,MA,2004.BernardSch¨olkopfandAlexSmola.LearningwithKernels:SupportVectorMachines,Regular-ization,Optimization,andBeyond(AdaptiveComputationandMachineLearning).MITPress,2001.RobertTibshirani.Regressionshrinkageandselectionviathelasso.JournaloftheRoyalStatisticalSocietyB.,58:267–288,1996.VladimirVapnik.TheNatureofStatisticalLearning.Springer-Verlag,1996.G.Wahba.SplineModelsforObservationalData.SIAM,Philadelphia,1990.G.Wahba,Y.Lin,andH.Zhang.Gacvforsupportvectormachines.InA.J.Smola,P.L.Bartlett,B.Sch¨olkopf,andD.Schuurmans,editors,AdvancesinLargeMarginClassiers,pages297–311,Cambridge,MA,2000.MITPress.J.WestonandC.Watkins.Multi-classsupportvectormachines,1998.URLciteseer.nj.nec.com/8884.htmlChristopherK.I.WilliamsandMatthiasSeeger.Theeffectoftheinputdensitydistributiononkernel-basedclassiers.InProceedingsoftheSeventeenthInternationalConferenceonMachineLearning,pages1159–1166.MorganKaufmannPublishersInc.,2000.JiZhuandTrevorHastie.Classicationofgenemicroarraysbypenalizedlogisticregression.Bio-statistics,2004.(toappear).JiZhu,SaharonRosset,TrevorHastie,andRobertTibshirani.L1normsupportvectormachines.Technicalreport,StanfordUniversity,2003.1415