/
Scalable Kernel Methods via Doubly Stochastic Gradient Scalable Kernel Methods via Doubly Stochastic Gradient

Scalable Kernel Methods via Doubly Stochastic Gradient - PDF document

trish-goza
trish-goza . @trish-goza
Follow
589 views
Uploaded On 2015-06-04

Scalable Kernel Methods via Doubly Stochastic Gradient - PPT Presentation

edu lsongccgatechedu Princeton University Carnegie Mellon University yingyulcsprincetonedu ninamfcscmuedu Abstract The general perception is that kernel methods are not scalable so neural nets be come the choice for largescale nonlinear learning prob ID: 79835

edu lsongccgatechedu Princeton University Carnegie

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Scalable Kernel Methods via Doubly Stoch..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

ScalableKernelMethodsviaDoublyStochasticGradients BoDai1,BoXie1,NiaoHe1,YingyuLiang2,AnantRaj1,Maria-FlorinaBalcan3,LeSong11GeorgiaInstituteofTechnologyfbodai,bxie33,nhe6,araj34g@gatech.edu,lsong@cc.gatech.edu2PrincetonUniversity3CarnegieMellonUniversityyingyul@cs.princeton.eduninamf@cs.cmu.eduAbstractThegeneralperceptionisthatkernelmethodsarenotscalable,soneuralnetsbe-comethechoiceforlarge-scalenonlinearlearningproblems.Havewetriedhardenoughforkernelmethods?Inthispaper,weproposeanapproachthatscalesupkernelmethodsusinganovelconceptcalled“doublystochasticfunctionalgradi-ents”.Basedonthefactthatmanykernelmethodscanbeexpressedasconvexoptimizationproblems,ourapproachsolvestheoptimizationproblemsbymak-ingtwounbiasedstochasticapproximationstothefunctionalgradient—oneusingrandomtrainingpointsandanotherusingrandomfeaturesassociatedwiththekernel—andperformingdescentstepswiththisnoisyfunctionalgradient.Ouralgorithmissimple,neednocommittoapresetnumberofrandomfeatures,andallowstheexibilityofthefunctionclasstogrowasweseemoreincomingdatainthestreamingsetting.Wedemonstratethatafunctionlearnedbythisprocedureaf-tertiterationsconvergestotheoptimalfunctioninthereproducingkernelHilbertspaceinrateO(1=t),andachievesageneralizationboundofO(1=p t).Ourap-proachcanreadilyscalekernelmethodsuptotheregimeswhicharedominatedbyneuralnets.Weshowcompetitiveperformancesofourapproachascomparedtoneuralnetsindatasetssuchas2.3millionenergymaterialsfromMolecularSpace,8millionhandwrittendigitsfromMNIST,and1millionphotosfromImageNetusingconvolutionfeatures.1IntroductionThegeneralperceptionisthatkernelmethodsarenotscalable.Whenitcomestolarge-scalenon-linearlearningproblems,themethodsofchoicesofarareneuralnetsalthoughtheoreticalunder-standingremainsincomplete.Arekernelmethodsreallynotscalable?Orisitsimplybecausewehavenottriedhardenough,whileneuralnetshaveexploitedsophisticateddesignoffeaturearchitec-tures,virtualexamplegenerationfordealingwithinvariance,stochasticgradientdescentforefcienttraining,andGPUsforfurtherspeedup?Abottleneckinscalingupkernelmethodscomesfromthestorageandcomputationcostofthedensekernelmatrix,K.StoringthematrixrequiresO(n2)space,andcomputingittakesO(n2d)operations,wherenisthenumberofdatapointsanddisthedimension.Therehavebeenmanygreatattemptstoscaleupkernelmethods,includingeffortsinperspectivesofnumericallinearalgebra,functionalanalysis,andnumericaloptimization.Acommonnumericallinearalgebraapproachistoapproximatethekernelmatrixusinglow-rankfactorizations,KA�A,withA2Rrnandrankr6n.Thislow-rankapproximationallowssubsequentkernelalgorithmstodirectlyoperateonA,butcomputingtheapproximationrequiresO(nr2+nrd)operations.Manyworkfollowedthisstrategy,includingGreedybasisselectiontechniques[1],Nystr¨omapproximation[2]andincompleteCholeskydecomposition[3].Inprac-tice,oneobservesthatkernelmethodswithapproximatedkernelmatricesoftenresultinafewpercentageoflossesinperformance.Infact,withoutfurtherassumptionontheregularityofthe1 kernelmatrix,thegeneralizationabilityafterusinglow-rankapproximationistypicallyoforderO(1=p r+1=p n)[4,5],whichimpliesthattherankneedstobenearlylinearinthenumberofdatapoints!Thus,inorderforkernelmethodstoachievethebestgeneralizationability,low-rankapproximationbasedapproachesimmediatelybecomeimpracticalforbigdatasetsbecauseoftheirO(n3+n2d)preprocessingtimeandO(n2)storage.Randomfeatureapproximationisanotherpopularapproachforscalingupkernelmethods[6,7].Themethoddirectlyapproximatesthekernelfunctioninsteadofthekernelmatrixusingexplicitfeaturemaps.TheadvantageofthisapproachisthattherandomfeaturematrixforndatapointscanbecomputedintimeO(nrd)usingO(nr)storage,whereristhenumberofrandomfeatures.SubsequentalgorithmsthenonlyneedtooperateonanO(nr)matrix.Similartolow-rankkernelmatrixapproximationapproach,thegeneralizationabilityofthisapproachisoftheorderO(1=p r+1=p n)[8,9],whichimpliesthatthenumberofrandomfeaturesalsoneedstobeO(n).Anothercommondrawbackofthesetwoapproachesisthatadaptingthesolutionfromasmallrtoalarger0isnoteasyifonewantstoincreasetherankoftheapproximatedkernelmatrixorthenumberofrandomfeaturesforbettergeneralizationability.Specialproceduresneedtobedesignedtoreusethesolutionobtainedfromasmallr,whichisnotstraightforward.Anotherapproachthataddressesthescalabilityissuerisesfromtheoptimizationperspective.Onegeneralstrategyistosolvethedualformsofkernelmethodsusingtheblock-coordinatedescen-t(e.g.,[10,11,12]).EachiterationofthisalgorithmonlyincursO(nrd)computationandO(nr)storage,whereristheblocksize.Asecondstrategyistoperformfunctionalgradientdescentbasedonabatchofdatapointsateachepoch(e.g.,[13,14]).Thus,thecomputationandstorageineachiterationrequiredarealsoO(nrd)andO(nr),respectively,whereristhebatchsize.Theseapproachescanstraightforwardlyadapttoadifferentrwithoutrestartingtheoptimizationproce-dureandexhibitnogeneralizationlosssincetheydonotapproximatethekernelmatrixorfunction.However,aseriousdrawbackoftheseapproachesisthat,withoutfurtherapproximation,allsupportvectorsneedtobestoredfortesting,whichcanbeasbigastheentiretrainingset!(e.g.,kernelridgeregressionandnon-separablenonlinearclassicationproblems.)Insummary,thereexistsadelicatetrade-offbetweencomputation,storageandstatisticswhenscalingupkernelmethods.Inspiredbyvariouspreviousefforts,weproposeasimpleyetgeneralstrategythatscalesupmanykernelmethodsusinganovelconceptcalled“doublystochasticfunctionalgradients”.OurmethodreliesonthefactthatmostkernelmethodscanbeexpressedasconvexoptimizationproblemsoverfunctionsinthereproducingkernelHilbertspaces(RKHS)andsolvedviafunctionalgradientdescent.Ouralgorithmproceedsbymakingtwounbiasedstochasticapproximationstothefunctionalgradient,oneusingrandomtrainingpointsandanotherusingrandomfunctionsassociatedwiththekernel,andthendescendingusingthisnoisyfunctionalgradient.Thekeyintuitionsbehindouralgorithmoriginatefrom(i)thepropertyofstochasticgradientdescentalgorithmthataslongasthestochasticgradientisunbiased,theconvergenceofthealgorithmisguaranteed[15];and(ii)thepropertyofpseudo-randomnumbergeneratorsthattherandomsamplescaninfactbecompletelydeterminedbyaninitialvalue(aseed).Weexploitthesepropertiesandenablekernelmethodstoachievebetterbalancesbetweencomputation,storage,andstatistics.Ourmethodinterestinglyintegrateskernelmethods,functionalanalysis,stochasticoptimization,andalgorithmictricks,anditpossessesanumberofdesiderata:Generalityandsimplicity.Ourapproachappliestomanykernelmethodssuchaskernelversionofridgeregression,supportvectormachines,logisticregressionandtwo-sampletestaswellasmanydifferenttypesofkernelssuchasshift-invariant,polynomial,andgeneralinnerproductkernels.Thealgorithmcanbesummarizedinjustafewlinesofcode(Algorithm1and2).Foradif-ferentproblemandkernel,wejustneedtoreplacethelossfunctionandtherandomfeaturegenerator.Flexibility.Whilepreviousapproachesbasedonrandomfeaturestypicallyrequireaprexnumberoffeatures,ourapproachallowsthenumberofrandomfeatures,andhencetheexibilityofthefunctionclasstogrowwiththenumberofdatapoints.Therefore,unlikepreviousrandomfeatureapproach,ourapproachappliestothedatastreamingsettingandachievesfullpotentialsofnonparametricmethods.Efcientcomputation.Thekeycomputationofourmethodcomesfromevaluatingthedoublystochasticfunctionalgradient,whichinvolvesthegenerationoftherandomfeaturesgivenspecicseedsandalsotheevaluationofthesefeaturesonasmallbatchofdatapoints.Atiterationt,thecomputationalcomplexityisO(td).2 Smallmemory.Whilemostapproachesrequiresavingallthesupportvectors,thealgorithmallowsustoavoidkeepingthesupportvectorssinceitonlyrequiresasmallprogramtoregeneratetherandomfeaturesandsamplehistoricalfeaturesaccordingtosomespecicrandomseeds.Atiterationt,thememoryneededisO(t),independentofthedimensionofthedata.Theoreticalguarantees.WeprovidenovelandnontrivialanalysisinvolvingHilbertspacemartingalesandanewlyprovedrecurrencerelation,anddemonstratethattheestimatorproducedbyouralgorithm,whichmightbeoutsideoftheRKHS,convergestotheoptimalRKHSfunction.Morespecically,bothinexpectationandwithhighprobability,ouralgorithmestimatestheoptimalfunctionintheRKHSintherateofO(1=t)andachievesageneralizationboundofO(1=p t),whichareindeedoptimal[15].Thevarianceoftherandomfeaturesintroducedinoursecondapproximationtothefunctionalgradient,onlycontributesadditivelytotheconstantintheconver-gencerate.Theseresultsaretherstofthekindinliterature,whichcouldbeofindependentinterest.Strongempiricalperformance.Ouralgorithmcanreadilyscalekernelmethodsuptotheregimeswhicharepreviouslydominatedbyneuralnets.Weshowthatourmethodcomparesfavorablytootherscalablekernelmethodsinmediumscaledatasets,andtoneuralnetsinbigdatasetswithmillionsofdata.Intheremainder,wewillrstintroducepreliminariesonkernelmethodsandfunctionalgradients.Wewillthendescribeouralgorithmandprovideboththeoreticalandempiricalsupports.2DualitybetweenKernelsandRandomProcessesKernelmethodsowetheirnametotheuseofkernelfunctions,k(x;x0):XX7!R,whicharesymmetricpositivedenite(PD),meaningthatforalln�1,andx1;:::;xn2X,andc1;:::;cn2R,wehavePni;j=1cicjk(xi;xj)�0.Thereisanintriguingdualitybetweenkernelsandstochasticprocesseswhichwillplayacrucialroleinouralgorithmdesignlater.Morespecically,Theorem1(e.g.,Devinatz[16];Hein&Bousquet[17])Ifk(x;x0)isaPDkernel,thenthereexistsaset ,ameasurePon ,andrandomfunction!(x):X7!RfromL2( ;P),suchthatk(x;x0)=R !(x)!(x0)dP(!):Essentially,theaboveintegralrepresentationrelatesthekernelfunctiontoarandomprocess!withmeasureP(!).Notethattheintegralrepresentationmaynotbeunique.Forinstance,therandomprocesscanbeaGaussianprocessonXwiththesamplefunction!(x),andk(x;x0)issimplythecovariancefunctionbetweentwopointxandx0.Ifthekernelisalsocontinuousandshiftinvariant,i.e.,k(x;x0)=k(x�x0)forx2Rd,thentheintegralrepresentationspecializesintoaformcharacterizedbyinverseFouriertransformation(e.g.,[18,Theorem6.6]),Theorem2(Bochner)Acontinuous,real-valued,symmetricandshift-invariantfunctionk(x�x0)onRdisaPDkernelifandonlyifthereisanitenon-negativemeasureP(!)onRd,suchthatk(x�x0)=RRdei!�(x�x0)dP(!)=RRd[0;2]2cos(!�x+b)cos(!�x0+b)d(P(!)P(b));whereP(b)isauniformdistributionon[0;2],and!(x)=p 2cos(!�x+b).ForGaussianRBFkernel,k(x�x0)=exp(�kx�x0k2=22),thisyieldsaGaussiandistributionP(!)withdensityproportionaltoexp(�2k!k2=2);fortheLaplacekernel,thisyieldsaCauchydistribution;andfortheMarternkernel,thisyieldstheconvolutionsoftheunitball[19].Similarrepresentationswheretheexplicitformof!(x)andP(!)areknowncanalsobederivedforrotationinvariantkernel,k(x;x0)=k(hx;x0i),usingFouriertransformationonsphere[19].Forpolynomialkernels,k(x;x0)=(hx;x0i+c)p,arandomtensorsketchingapproachcanalsobeused[20].InsteadofndingtherandomprocessesP(!)andfunctions!(x)givenkernels,onecangothereversedirectionandconstructkernelsfromrandomprocessesandfunctions(e.g.,Wendland[18]).Theorem3Ifk(x;x0)=R !(x)!(x0)dP(!)foranonnegativemeasureP(!)on and!(x):X7!RfromL2( ;P),thenk(x;x0)isaPDkernel.Forinstance,!(x):=cos(!� (x)+b),where (x)canbearandomconvolutionoftheinputxparametrizedby.AnotherimportantconceptisthereproducingkernelHilbertspace(RKHS).AnRKHSHonXisaHilbertspaceoffunctionsfromXtoR.HisanRKHSifandonlyifthereexistsak(x;x0):XX7!Rsuchthat8x2X;k(x;)2H;and8f2H;hf();k(x;)iH=f(x):Ifsuchak(x;x0)exists,itisuniqueanditisaPDkernel.Afunctionf2Hifandonlyifkfk2H:=hf;fiH1,anditsL2normisdominatedbyRKHSnorm,kfkL26kfkH:3 3DoublyStochasticFunctionalGradientsManykernelmethodscanbewrittenasconvexoptimizationproblemsoverfunctionsintheRKHSandsolvedusingthefunctionalgradientmethods[13,14].Inspiredbythesepreviouswork,wewillintroduceanovelconceptcalled“doublystochasticfunctionalgradients”toaddressthescalabilityissue.Letl(u;y)beascalarlossfunctionconvexofu2R.Letthesubgradientofl(u;y)withrespecttoubel0(u;y).GivenaPDkernelk(x;x0)andtheassociatedRKHSH,manykernelmethodstrytondafunctionf2Hwhichsolvestheoptimizationproblemargminf2HR(f):=E(x;y)[l(f(x);y)]+ 2kfk2H()argminkfkH6B()E(x;y)[l(f(x);y)](1)where�0isaregularizationparameter,B()isanon-increasingfunctionof,andthedata(x;y)followadistributionP(x;y).ThefunctionalgradientrR(f)isdenedasthelinearterminthechangeoftheobjectiveafterweperturbfbyinthedirectionofg,i.e.,R(f+g)=R(f)+hrR(f);giH+O(2):(2)Forinstance,applyingtheabovedenition,wehaverf(x)=rhf;k(x;)iH=k(x;),andrkfk2H=rhf;fiH=2f.Stochasticfunctionalgradient.Givenadatapoint(x;y)P(x;y)andf2H,thestochasticfunctionalgradientofE(x;y)[l(f(x);y)]withrespecttof2His():=l0(f(x);y)k(x;);(3)whichisessentiallyasingledatapointapproximationtothetruefunctionalgradient.Furthermore,foranyg2H,wehaveh();giH=l0(f(x);y)g(x).Inspiredbythedualitybetweenkernelfunc-tionsandrandomprocesses,wecanmakeanadditionalapproximationtothestochasticfunctionalgradientusingarandomfunction!(x)sampledaccordingtoP(!).Morespecically,Doublystochasticfunctionalgradient.Let!P(!),thenthedoublystochasticgradientofE(x;y)[l(f(x);y)]withrespecttof2His():=l0(f(x);y)!(x)!():(4)Notethatthestochasticfunctionalgradient()isinRKHSHbut()maybeoutsideH,since!()maybeoutsidetheRKHS.Forinstance,fortheGaussianRBFkernel,therandomfunction!(x)=p 2cos(!�x+b)isoutsidetheRKHSassociatedwiththekernelfunction.However,thesefunctionalgradientsarerelatedby()=E![()],whichleadtounbiasedestima-torsoftheoriginalfunctionalgradient,i.e.,rR(f)=E(x;y)[()]+vf();andrR(f)=E(x;y)E![()]+vf():(5)Weemphasizethatthesourceofrandomnessassociatedwiththerandomfunctionisnotpresentinthedata,butarticiallyintroducedbyus.Thisiscrucialforthedevelopmentofourscalablealgorithminthenextsection.Meanwhile,italsocreatesadditionalchallengesintheanalysisofthealgorithmwhichwewilldealwithcarefully.4DoublyStochasticKernelMachines Algorithm1:f igti=1=Train(P(x;y)) Require:P(!);!(x);l(f(x);y);:1:fori=1;:::;tdo2:Sample(xi;yi)P(x;y).3:Sample!iP(!)withseedi.4:f(xi)=Predict(xi;f jgi�1j=1).5: i=� il0(f(xi);yi)!i(xi).6: j=(1� i) jforj=1;:::;i�1.7:endfor Algorithm2:f(x)=Predict(x;f igti=1) Require:P(!);!(x):1:Setf(x)=0.2:fori=1;:::;tdo3:Sample!iP(!)withseedi.4:f(x)=f(x)+ i!i(x).5:endfor Therstkeyintuitionbehindouralgorithmoriginatesfromthepropertyofstochasticgradientde-scentalgorithmthataslongasthestochasticgradientisboundedandunbiased,theconvergenceofthealgorithmisguaranteed[15].Inouralgorithm,wewillexploitthispropertyandintroducetwosourcesofrandomness,onefromdataandanotherarticial,toscaleupkernelmethods.4 Thesecondkeyintuitionbehindouralgorithmisthattherandomfunctionsusedinthedoublystochasticfunctionalgradientswillbesampledaccordingtopseudo-randomnumbergenerators,wherethesequencesofapparentlyrandomsamplescaninfactbecompletelydeterminedbyaninitialvalue(aseed).Althoughtheserandomsamplesarenotthe“true”randomsampleinthepurestsenseoftheword,theysufceforourtaskinpractice.Tobemorespecic,ouralgorithmproceedsbymakingtwostochasticapproximationtothefunction-algradientineachiteration,andthendescendingusingthisnoisyfunctionalgradient.TheoverallalgorithmsfortrainingandpredictionaresummarizedinAlgorithm1and2.Thetrainingalgo-rithmessentiallyjustperformssamplingsofrandomfunctionsandevaluationsofdoublystochasticgradientsandmaintainsacollectionofrealnumbersf ig,whichiscomputationallyefcientandmemoryfriendly.Acrucialstepinthealgorithmistosampletherandomfunctionswith“seedi”.Theseedshavetobealignedbetweentrainingandprediction,andwiththecorresponding iob-tainedfromeachiteration.Thelearningrate tinthealgorithmneedstobechosenasO(1=t),asshownbyourlateranalysistoachievethebestrateofconvergence.Fornow,weassumethatwehaveaccesstothedatageneratingdistributionP(x;y).Thiscanbemodiedtosampleuniformlyrandomlyfromaxeddataset,withoutaffectingthealgorithmandthelaterconvergenceanalysis.LetthesampleddataandrandomfunctionparametersbeDt:=f(xi;yi)gti=1and!t:=f!igti=1,respectivelyaftertiteration.ThefunctionobtainedbyAlgorithm1isasimpleadditiveformofthedoublystochasticfunctionalgradientsft+1()=ft()� t(t()+ft())=Xti=1aiti();8t�1;andf1()=0;(6)whereait=� iQtj=i+1(1� j)aredeterministicvaluesdependingonthestepsizes j(i6j6t)andregularizationparameter.Thissimpleformmakesiteasyforustoanalyzeitsconvergence.Wenotethatouralgorithmcanalsotakeamini-batchofpointsandrandomfunctionsateachstep,andestimateanempiricalcovarianceforpreconditioningtoachievepotentiallybetterperformance.5TheoreticalGuaranteesInthissection,wewillshowthat,bothinexpectationandwithhighprobability,ouralgorithmcanestimatetheoptimalfunctionintheRKHSwithrateO(1=t)andachieveageneralizationboundofO(1=p t).Theanalysisforouralgorithmhasanewtwistcomparedtopreviousanalysisofstochasticgradientdescentalgorithms,sincetherandomfunctionapproximationresultsinanestimatorwhichisoutsidetheRKHS.Besidestheanalysisforstochasticfunctionalgradientdescent,weneedtousemartingalesandthecorrespondingconcentrationinequalitiestoprovethatthesequenceofestimators,ft+1,outsidetheRKHSconvergetotheoptimalfunction,f,intheRKHS.Wemakethefollowingstandardassumptionsaheadforlaterreferences:A.Thereexistsanoptimalsolution,denotedasf,totheproblemofourinterest(1).B.Lossfunction`(u;y):RR!Randitsrst-orderderivativeisL-Lipschitzcontinousintermsoftherstargument.C.Foranydataf(xi;yi)gti=1andanytrajectoryffi()gti=1,thereexistsM�0,suchthatj`0(fi(xi);yi)j6M.NoteinoursituationMexistsandM1sinceweassumeboundeddomainandthefunctionsftwegeneratearealwaysboundedaswell.D.Thereexists&#x]TJ/;༔ ; .96;& T; 29;&#x.477;&#x 0 T; [0;0and&#x]TJ/;༔ ; .96;& T; 29;&#x.477;&#x 0 T; [0;0,suchthatk(x;x0)6;j!(x)!(x0)j6;8x;x02X;!2 :Forexample,whenk(;)istheGaussianRBFkernel,wehave=1,=2.Wenowpresentourmaintheoremsasbelow.Duetothespacerestrictions,wewillonlyprovideashortsketchofproofshere.Thefullproofsforthethesetheoremsaregivenintheappendix.Theorem4(Convergenceinexpectation)When t= twith�0suchthat2(1;2)[Z+,EDt;!tjft+1(x)�f(x)j262C2+2Q21 t;foranyx2XwhereQ1=maxnkfkH;(Q0+p Q20+(2�1)(1+)22M2)=(2�1)o,withQ0=2p 21=2(+)LM2,andC2=4(+)2M22.Theorem5(Convergencewithhighprobability)When t= twith�0suchthat2Z+,foranyx2X,wehavewithprobabilityatleast1�3over(Dt;!t),jft+1(x)�f(x)j26C2ln(2=) t+2Q22ln(2t=)ln2(t) t;5 whereCisasaboveandQ2=maxnkfkH;Q0+p Q20+M2(1+)2(2+16=)o,withQ0=4p 21=2M(8+(+)L).Proofsketch:Wefocusontheconvergenceinexpectation;thehighprobabilityboundcanbeestablishedinasimilarfashion.Themaintechnicaldifcultyisthatft+1maynotbeintheRKHSH.Thekeyoftheproofisthentoconstructanintermediatefunctionht+1,suchthatthedifferencebetweenft+1andht+1andthedifferencebetweenht+1andfcanbebounded.Morespecically,ht+1()=ht()� t(t()+ht())=Xti=1aiti();8t�1;andh1()=0;(7)wheret()=E!t[t()].Thenforanyx,theerrorcanbedecomposedastwotermsjft+1(x)�f(x)j262jft+1(x)�ht+1(x)j2| {z }errorduetorandomfunctions+2kht+1�fk2H| {z }errorduetorandomdataFortheerrortermduetorandomfunctions,ht+1isconstructedsuchthatft+1�ht+1isamar-tingale,andthestepsizesarechosensuchthatjaitj6 t,whichallowsustoboundthemartingale.Inotherwords,thechoicesofthestepsizeskeepft+1closetotheRKHS.Fortheerrortermduetorandomdata,sinceht+12H,wecannowapplythestandardargumentsforstochasticapproximationintheRKHS.Duetotheadditionalrandomness,therecursionisslightlymorecomplicated,et+16�1�2 tet+ 1 tp et t+ 2 t2;whereet+1=EDt;!t[kht+1�fk2H],and 1and 2dependsontherelatedparameters.Solvingthisrecursionthenleadstoaboundfortheseconderrorterm. Theorem6(Generalizationbound)LetthetrueriskbeRtrue(f)=E(x;y)[l(f(x);y)].Thenwithprobabilityatleast1�3over(Dt;!t),andCandQ2denedaspreviouslyRtrue(ft+1)�Rtrue(f)6(Cp ln(8p et=)+p 2Q2p ln(2t=)ln(t))L p t:ProofBytheLipschitzcontinuityofl(;y)andJensen'sInequality,wehaveRtrue(ft+1)�Rtrue(f)6LExjft+1(x)�f(x)j6Lp Exjft+1(x)�f(x)j2=Lkft+1�fk2:Again,kft+1�fk2canbedecomposedastwotermsO�kft+1�ht+1k22andO(kht+1�fk2H),whichcanbeboundedsimilarlyasinTheorem5(seeCorollary12intheappendix). Remarks.Theoverallrateofconvergenceinexpectation,whichisO(1=t),isindeedoptimal.Clas-sicalcomplexitytheory(see,e.g.referencein[15])showsthattoobtain-accuracysolution,thenumberofiterationsneededforthestochasticapproximationis (1=)forstronglyconvexcaseand (1=2)forgeneralconvexcase.Differentfromtheclassicalsettingofstochasticapproximation,ourcaseimposesnotonebuttwosourcesofrandomness/stochasticityinthegradient,whichintu-itivelyspeaking,mightrequirehigherordernumberofiterationsforgeneralconvexcase.However,ourmethodisstillabletoachievethesamerateasintheclassicalsetting.Therateofthegeneral-izationboundisalsonearlyoptimaluptologfactors.However,theseboundsmaybefurtherrenedwithmoresophisticatedtechniquesandanalysis.Forexample,mini-batchandpreconditioningcanbeusedtoreducetheconstantfactorsintheboundsignicantly,theanalysisofwhichisleftforfuturestudy.Theorem4alsorevealsboundsinL1andL2senseasinSectionA.2intheappendix.Thechoicesofstepsizes tandthetuningparametersgivenintheseboundsareonlyforsufcientconditionsandsimpleanalysis;otherchoicescanalsoleadtoboundsinthesameorder.6Computation,StorageandStatisticsTrade-offToinvestigatecomputation,storageandstatisticstrade-off,wewillxthedesiredL2errorinthefunctionestimationto,i.e.,kf�fk226,andworkoutthedependencyofotherquantitieson.Theseotherquantitiesincludethepreprocessingtime,thenumberofsamplesandrandomfeatures(orrank),thenumberofiterationsofeachalgorithm,andthecomputationalcostandstoragerequire-mentforlearningandprediction.Weassumethatthenumberofsamples,n,neededtoachievetheprescribederrorisoftheorderO(1=),thesameforallmethods.Furthermore,wemakenootherregularityassumptionaboutmarginpropertiesorthekernelmatrixsuchasfastspectrumdecay.Thustherequirednumberofrandomfeature(orranks)rwillbeoftheorderO(n)=O(1=)[4,5,8,9].6 Wewillpickafewrepresentativealgorithmsforcomparison,namely,(i)NORMA[13]:kernelmethodstrainedwithstochasticfunctionalgradients;(ii)k-SDCA[12]:kernelizedversionofs-tochasticdualcoordinateascend;(iii)r-SDCA:rstapproximatethekernelfunctionwithrandomfeatures,andthenrunstochasticdualcoordinateascend;(iv)n-SDCA:rstapproximatetheker-nelmatrixusingNystr¨om'smethod,andthenrunstochasticdualcoordinateascend;similarlywewillcombinePegasosalgorithm[21]withrandomfeaturesandNystr¨om'smethod,andobtain(v)r-Pegasos,and(vi)n-Pegasos.Thecomparisonsaresummarizedbelow.Fromthetable,onecanseethatourmethod,r-SDCAandr-Pegasosachievethebestdependencyonthedimensiondofthedata.However,oftenoneisinterestedinincreasingthenumberofrandomfeaturesasmoredatapointsareobservedtoobtainabettergeneralizationability.Thenspecialproceduresneedtobedesignedforupdatingther-SDCAandr-Pegasossolution,whichwearenotclearhowtoimplementeasilyandefciently. Algorithms Preprocessing TotalComputationCost TotalStorageCost Computation TrainingPrediction TrainingPrediction DoublySGD O(1) O(d=2)O(d=) O(1=)O(1=)NORMA/k-SDCA O(1) O(d=2)O(d=) O(d=)O(d=)r-Pegasos/r-SDCA O(1) O(d=2)O(d=) O(1=)O(1=)n-Pegasos/n-SDCA O(1=3) O(d=2)O(d=) O(1=)O(1=) 7ExperimentsWeshowthatourmethodcomparesfavorablytootherkernelmethodsinmediumscaledatasetsandneuralnetsinlargescaledatasets.Weexaminedbothregressionandclassicationproblemswithsmoothandalmostsmoothlossfunctions.Belowisasummaryofthedatasetsused1,andmoredetaileddescriptionofthesedatasetsandexperimentalsettingscanbefoundintheappendix. Name Model #ofsamplesInputdimOutputrangeVirtual (1)Adult K-SVM 32K123f�1;1gno(2)MNIST8M8vs.6[25] K-SVM 1.6M784f�1;1gyes(3)Forest K-SVM 0.5M54f�1;1gno(4)MNIST8M[25] K-logistic 8M1568f0;:::;9gyes(5)CIFAR10[26] K-logistic 60K2304f0;:::;9gyes(6)ImageNet[27] K-logistic 1.3M9216f0;:::;999gyes(7)QuantumMachine[28] K-ridge 6K276[�800;�2000]yes(8)MolecularSpace[28] K-ridge 2.3M2850[0;13]no Experimentsettings.Fordatasets(1)–(3),wecomparethealgorithmsdiscussedinSection6.Foralgorithmsbasedonlowrankkernelmatrixapproximationandrandomfeatures,i.e.,pegasosandSDCA,wesettherankandnumberofrandomfeaturestobe28.Weusesamebatchsizeforbothouralgorithmandthecompetitors.Westopalgorithmswhentheypassthroughtheentiredatasetonce.Thisstoppingcriterion(SC1)isdesignedforjustifyingourconjecturethatthebottleneckoftheperformancesofthevanillamethodswithexplicitfeaturecomesfromtheaccuracyofkernelapproximation.Tothisend,weinvestigatetheperformancesofthesealgorithmsunderdifferentlevelsofrandomfeatureapproximationsbutwithinthesamenumberoftrainingsamples.Tofurtherinvestigatethecomputationalefciencyoftheproposedalgorithm,wealsoconductexperimentswherewestopallalgorithmswithinthesametimebudget(SC2).Duetospacelimitation,thecomparisononregressionsyntheticdatasetunderSC1andon(1)–(3)underSC2areillustratedinAppendixB.2.WedonotcountthepreprocessingtimeofNystr¨om'smethodforn-Pegasosandn-SDCA.ThealgorithmsareexecutedonthemachinewithAMD162.4GHzOpteronCPUsand200Gmemory.NotethatthisallowsNORMAandk-SDCAtosaveallthedatainthememory.WereportournumericalresultsinFigure1(1)-(8)withexplanationsstatedasbelow.Forfulldetailsofourexperimentalsetups,pleaserefertosectionB.1inAppendix.Adult.TheresultisillustratedinFigure1(1).NORMAandk-SDCAachievethebesterrorrate,15%,whileouralgorithmachievesacomparablerate,15:3%. 1A“yes”forthelastcolumnmeansthatvirtualexamplesaregeneratedfromfortraining.K-ridgestandsforkernelridgeregression;K-SVMstandsforkernelSVM;K-logisticstandsforkernellogisticregression.7 (1)Adult(2)MNIST8M8vs.6(3)Forest (4)MNIST8M(5)CIFAR10(6)ImageNet (7)QuantumMachine(8)MolecularSpace.Figure1:Experimentalresultsfordataset(1)–(8).MNIST8M8vs.6.TheresultisshowninFigure1(2).Ouralgorithmachievesthebesttesterror0:26%.Comparingtothemethodswithfullkernel,themethodsusingrandom/Nystr¨omfeaturesachievebettertesterrorsprobablybecauseoftheunderlyinglow-rankstructureofthedataset.Forest.TheresultisshowninFigure1(3).Ouralgorithmachievestesterrorabout15%,muchbetterthanthen/r-pegasosandn/r-SDCA.Ourmethodispreferableforthisscenario,i.e.,hugedatasetswithsophisticateddecisionboundaryconsideringthetrade-offbetweencostandaccuracy.MNIST8M.TheresultisshowninFigure1(4).Betterthanthe0.6%errorprovidedbyxedandjointly-trainedneuralnets,ourmethodreachesanerrorof0.5%veryquickly.CIFAR10TheresultisshowninFigure1(5).Wecompareouralgorithmtoaneuralnetwithtwoconvolutionlayers(aftercontrastnormalizationandmax-poolinglayers)andtwolocallayer-sachieving11%testerror.Thespecicationisathttps://code.google.com/p/cuda-convnet/.Ourmethodachievescomparableperformancebutmuchfaster.ImageNetTheresultisshowninFigure1(6).Ourmethodachievestesterror44.5%byfurthermax-votingof10transformationsofthetestsetwhilethejointly-trainedneuralnetarrivesat42%(withoutvariationsincolorandillumination),andthexedneuralnetonlyachieves46%withmax-voting.QuantumMachine/MolecularSpaceTheresultsareshowninFigure1(7)&(8).Ondataset(7),ourmethodachievesMeanAbsoluteErrorof2.97kcal/mole,outperformingneuralnets,3.51kcal/mole,whichisclosetothe1kcal/molerequiredforchemicalaccuracy.Moreover,thecomparisonondataset(8)istherstintheliterature,andourmethodisstillcomparablewithneuralnet.AcknowledgementM.B.issuppoertedinpartbyNSFCCF-0953192,CCF-1451177,CCF-1101283,andCCF-1422910,ONRN00014-09-1-0751,andAFOSRFA9550-09-1-0538.L.S.issupportedinpartbyNSFIIS-1116886,NSF/NIHBIGDATA1R01GM108341,NSFCAREERIIS-1350983,andaRaytheonFacultyFellowship.8 References[1]A.J.SmolaandB.Sch¨olkopf.Sparsegreedymatrixapproximationformachinelearning.InICML,2000.[2]C.K.I.WilliamsandM.Seeger.UsingtheNystrommethodtospeedupkernelmachines.InT.G.Dietterich,S.Becker,andZ.Ghahramani,editors,NIPS,2000.[3]S.FineandK.Scheinberg.EfcientSVMtrainingusinglow-rankkernelrepresentations.JournalofMachineLearningResearch,2:243–264,2001.[4]P.DrineasandM.Mahoney.Onthenystrommethodforapproximatingagrammatrixforimprovedkernel-basedlearning.JMLR,6:2153–2175,2005.[5]C.Cortes,M.Mohri,andA.Talwalkar.Ontheimpactofkernelapproximationonlearningaccuracy.InAISTATS,2010.[6]A.RahimiandB.Recht.Randomfeaturesforlarge-scalekernelmachines.InNIPS,2008.[7]Q.V.Le,T.Sarlos,andA.J.Smola.Fastfood—computinghilbertspaceexpansionsinloglineartime.InICML,2013.[8]A.RahimiandB.Recht.Weightedsumsofrandomkitchensinks:Replacingminimizationwithrandom-izationinlearning.InNIPS,2009.[9]D.Lopez-Paz,S.Sra,A.Smola,Z.Ghahramani,andB.Schlkopf.Randomizednonlinearcomponentanalysis.InICML,2014.[10]J.C.Platt.Sequentialminimaloptimization:Afastalgorithmfortrainingsupportvectormachines.TechnicalReportMSR-TR-98-14,MicrosoftResearch,1998.[11]T.Joachims.Makinglarge-scaleSVMlearningpractical.InB.Sch¨olkopf,C.J.C.Burges,andA.J.Smola,editors,AdvancesinKernelMethods—SupportVectorLearning,pages169–184,Cambridge,MA,1999.MITPress.[12]S.Shalev-ShwartzandT.Zhang.Stochasticdualcoordinateascentmethodsforregularizedloss.JournalofMachineLearningResearch,14(1):567–599,2013.[13]J.Kivinen,A.J.Smola,andR.C.Williamson.Onlinelearningwithkernels.IEEETransactionsonSignalProcessing,52(8),Aug2004.[14]N.RatliffandJ.Bagnell.Kernelconjugategradientforfastkernelmachines.InIJCAI,2007.[15]A.Nemirovski,A.Juditsky,G.Lan,andA.Shapiro.Robuststochasticapproximationapproachtos-tochasticprogramming.SIAMJ.onOptimization,19(4):1574–1609,January2009.[16]A.Devinatz.Integralrepresentationofpdfunctions.Trans.AMS,74(1):56–77,1953.[17]M.HeinandO.Bousquet.Kernels,associatedstructures,andgeneralizations.TechnicalReport127,MaxPlanckInstituteforBiologicalCybernetics,2004.[18]H.Wendland.ScatteredDataApproximation.CambridgeUniversityPress,Cambridge,UK,2005.[19]BernhardSch¨olkopfandA.J.Smola.LearningwithKernels.MITPress,Cambridge,MA,2002.[20]N.PhamandR.Pagh.Fastandscalablepolynomialkernelsviaexplicitfeaturemaps.InKDD,2013.[21]ShaiShalev-Shwartz,YoramSinger,andNathanSrebro.Pegasos:Primalestimatedsub-gradientsolverforSVM.InICML,2007.[22]CongD.DangandGuanghuiLan.Stochasticblockmirrordescentmethodsfornonsmoothandstochasticoptimization.Technicalreport,UniversityofFlorida,2013.[23]YuriiNesterov.Efciencyofcoordinatedescentmethodsonhuge-scaleoptimizationproblems.SIAMJournalonOptimization,22(2):341–362,2012.[24]A.Cotter,S.Shalev-Shwartz,andN.Srebro.Learningoptimallysparsesupportvectormachines.InICML,2013.[25]G.Loosli,S.Canu,andL.Bottou.Traininginvariantsupportvectormachineswithselectivesampling.InLargeScaleKernelMachines,pages301–320.MITPress,2007.[26]A.Krizhevsky.Learningmultiplelayersoffeaturesfromtinyimages.Technicalreport,UniversityofToronto,2009.[27]A.Krizhevsky,I.Sutskever,andG.Hinton.Imagenetclassicationwithdeepconvolutionalneuralnet-works.InNIPS,2012.[28]G.Montavon,K.Hansen,S.Fazli,M.Rupp,F.Biegler,A.Ziehe,A.Tkatchenko,A.Lilienfeld,andK.M¨uller.Learninginvariantrepresentationsofmoleculesforatomizationenergyprediction.InNIPS,2012.[29]AlexanderRakhlin,OhadShamir,andKarthikSridharan.Makinggradientdescentoptimalforstronglyconvexstochasticoptimization.InICML,pages449–456,2012.[30]Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner.Gradient-basedlearningappliedtodocumentrecognition.ProceedingsoftheIEEE,86(11):2278–2324,November1998.9