com Tamas Sarlos stamasgooglecom Alex Smola alexsmolaorg Google Knowledge 1600 Amphitheatre Pkwy Mountain View 94043 CA USA Abstract Despite their successes what makes kernel methods di64259cult to use in many large scale problems is the fact that co ID: 26312
Download Pdf The PPT/PDF document "Fastfood Approximating Kernel Expansion..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Fastfood|ComputingHilbertSpaceExpansionsinloglineartime approximatesthefunctionfbymeansofmultiplyingtheinputwithaGaussianrandommatrix,followedbytheapplicationofanonlinearity.Iftheexpansiondimensionisnandtheinputdimensionisd(i.e.,theGaussianmatrixisnd),itrequiresO(nd)timeandmemorytoevaluatethedecisionfunctionf.Forlargeproblemswithsamplesizemn,thisistypicallymuchfasterthantheaforementioned\kerneltrick"be-causethecomputationisindependentofthesizeofthetrainingset.Experimentsalsoshowthatthisapproxi-mationmethodachievesaccuracycomparabletoRBFkernelswhileoeringsignicantspeedup.Ourproposedapproach,Fastfood,acceleratesRandomKitchenSinksfromO(nd)toO(nlogd)time.Thespeedupismostsignicantwhentheinputdimen-siondislargerthan1000,whichistypicalinmostapplications.Forinstance,atiny32x32x3imageintheCIFAR-10(Krizhevsky,2009)alreadyhas3072di-mensions(andnon-linearfunctionclasseshaveshowntoworkwellforMNIST(Scholkopf&Smola,2002)andCIFAR-10).OurapproachreliesonthefactthatHadamardmatrices,whencombinedwithGaussianscalingmatrices,behaveverymuchlikeGaussianran-dommatrices.ThatmeansthesetwomatricescanbeusedinplaceofGaussianmatricesinRandomKitchenSinksandtherebyspeedingupthecomputationforalargerangeofkernelfunctions.ThecomputationalgainisachievedbecauseunlikeGaussianrandomma-trices,Hadamardmatricesandscalingmatricesareinexpensivetomultiplyandstore.WeprovethattheFastfoodapproximationisunbi-ased,haslowvariance,andconcentratesalmostatthesamerateasRandomKitchenSinks.Moreover,extensiveexperimentswithawiderangeofdatasetsshowthatFastfoodachievessimilaraccuracytofullkernelexpansionsandRandomKitchenSinkswhilebeing100xfasterwith1000xlessmemory.Theseimprovements,especiallyintermsofmemoryusage,makeitpossibletousekernelmethodsevenforem-beddedapplications.Ourexperimentsalsodemon-stratethatFastfood,thankstoitsspeedupintrain-ing,achievesstate-of-the-artaccuracyontheCIFAR-10dataset(Krizhevsky,2009)amongpermutation-invariantmethods.OtherrelatedworkSpeedingupkernelmethodshasbeenaresearchfocusformanyyears.Earlyworkcompressesfunctionexpansionsaftertheproblemwassolved(Burges,1996)bymeansofreduced-setexpan-sions.Subsequentworkaimedtoreducememoryfoot-printandcomplexitybyndingsubspacestoexpandfunctions(Smola&Scholkopf,2000;Fine&Schein-berg,2001;Williams&Seeger,2001).Theytypi-callyrequireO(n3+mnd)stepstoprocessmobser- (Rahimi&Recht,2008)generalizeditfurther.Werefertoitwiththetitleofthelatterinsteadoftheambiguousphrase\randomfeatures"tobetterdierentiatefromourown.vationsandtoexpandddimensionaldataintoann-dimensionalfunctionspace.Moreover,theyrequireO(n2)storageatleastatpreprocessingtimetoobtainsuitablebasisfunctions.Despitetheseeorts,thesecostsarestillexpensiveforpracticalapplications.AlongthelinesofRahimi&Recht(2007;2008)'swork,fastmultipoleexpansions(Lee&Gray,2009;Gray&Moore,2003)oeranotherinterestingavenuefore-cientfunctionexpansions.Whilethisideaisattrac-tivewhenthedimensionalityoftheinputdimensiondissmall,theybecomecomputationallyintractableforlarged'sduetothecurseofdimensionalityintermsofpartitioning.2.RandomKitchenSinksWestartbyreviewingsomebasictoolsfromkernelmethods(Scholkopf&Smola,2002)andthenanalyzekeyideasbehindRandomKitchenSinks.2.1.Mercer'sTheoremandExpansionsAttheheartofkernelmethodsisthetheoremof(Mer-cer,1909)whichguaranteesthatkernelscanbeex-pressedasaninnerproductinsomeHilbertspace.Theorem1(Mercer)Anykernelk:XX!RsatisfyingRk(x;x0)f(x)f(x0)dxdx00forallL2(X)measurablefunctionsfcanbeexpandedintok(x;x0)=Xjjj(x)j(x0)(2)Herej0andthejareorthonormalonL2(X).Thekeyideaof(Rahimi&Recht,2007;2008)istousesamplingtoapproximatethesumin(2).Inotherwords,theydraw2ip()wherep(i)/i(3)andk(x;x0)Pjj nnXi=1i(x)i(x0)(4)Notethatthebasicconnectionbetweenrandombasisfunctionswaswellestablished,e.g.,byNeal(1994)inprovingthattheGaussianProcessisalimitofanin-nitenumberofbasisfunctions.Theexpansion(3)ispossiblewheneverthefollowingconditionshold:1.Aninnerproductexpansionoftheform(2)isknownforagivenkernelk.2.Thebasisfunctionsjaresucientlyinexpensivetocompute.3.ThesumPjj1converges,i.e.,kcorre-spondstoatraceclassoperator(Kreyszig,1989). 2Thenormalizationarisesfromthefactthatitisann-samplefromthedistributionoverbasisfunctions. Fastfood|ComputingHilbertSpaceExpansionsinloglineartime N(0;2Id),hencewecandirectlyappealtothecon-structionof(Rahimi&Recht,2008).ThisalsoholdsforV0.ThemaindierencebeingthattherowsinV0areconsiderablymorecorrelated. 3.3.ApproximationGuaranteesInthissectionweprovethattheapproximationthatweincurrelativetoaGaussianrandommatrixismild.3.3.1.LowVarianceTheorem4showsthatwhenapproximatingtheRBFkernelwithnfeaturesthevarianceofFastfood(evenwithoutthescalingmatrixS)isatmostthevarianceofstraightforwardGaussianfeatures,thersttermin(12),plusO(1=n).Infact,wewillseeinexperimentsthatourapproximationworksaswellasanexactker-nelexpansionandRandomKitchenSinks.Sincethekernelvaluesarerealnumbers,letuscon-sidertherealversionofthecomplexfeaturemapforsimplicity.Setyj=[V(x0x)]jandrecallthat j(x)j(x0)=n1exp(iyj)=n1(cos(yj)+isin(yj)):Thuswecanreplace(x)2Cnwith0(x)2R2n,where02j1(x)=n1=2cos([Vx]j)and02j(x)=n1=2sin([Vx]j),see(Rahimi&Recht,2007).Theorem4LetC()=64he2+2 3iandv=(xx0)=.Thenforthefeaturemap0:Rd!R2nobtainedbystackingn=di.i.d.copiesofmatrixV0=1 p dHGHBwehavethatVar0(x)0(x0)21ekvk22 n+C(kvk) n:(12)Moreover,thesameholdsforV=1 p dSHGHB.Proof0(x)0(x0)istheaverageofn=dindependentestimates,eacharisingfrom2dfeatures.Henceit'ssucienttoprovetheclaimforasingleblock,i.e.whenn=d.WeshowthelatterforV0inTheorem5andomitthenearidenticalargumentforV. Theorem5Letv=(xx0)=andlet j(v)=cos(d1 2[HGHBv]j)denotetheestimateoftheker-nelvaluethatcomesfromthejthpairofrandomfea-turesforeachj2f1:::dg.ThenforeachjwehaveVar[ j(v)]=1 21ekvk22,and(13)Var"dXj=1 j(v)#d 21ekvk22+dC(kvk)(14)whereC()=64he2+2 3i.ProofSinceVar(PXj)=Pj;tCov(Xj;Xt)foranyrandomvariableXj,ourgoalistocomputeCov( (v); (v))=E (v) (v)TE( (v))E( (v))T.Letw=1 p dHBv,u=w,andz=HGuandhence j(v)=cos(zj).Nowconditiononthevalueofu.ThenitfollowsthatCov(zj;ztju)=jt(u)kvk2,wherejt(u)2[1;1]isthecorrelationofzjandzt.Tosimplifythenotation,inwhatfollowswewriteinsteadofjt(u).Observethatthemarginaldistribu-tionofeachzjisN(0;kvk2)askuk=kvkandeachelementofHis1.ThusthejointdistributionofzjandztisaGaussianwithmean0andcovarianceCov[[zj;zt]ju]=11kvk2=LLT;L=h10p 12ikvkisitsCholeskyfactor.HenceCov( j(v); t(v)ju)(15)=Eg[cos([Lg]1)cos([Lg]2)]Eg[cos(zj)]Eg[cos(zt)]whereg2R2isdrawnfromN(0;1).Fromthetrigonometricidentitycos()cos()=1 2[cos()+cos(+)]itfollowsthatwecanrewriteEg[cos([Lg]1)cos([Lg]2)]=1 2Eh[cos(ah)+cos(a+h)]=1 2he1 2a2+e1 2a2+ihN(0;1)anda2= L[1;1] 2=2kvk2(1).Thatis,afterapplyingtheadditiontheoremweex-plicitlycomputedthenowone-dimensionalGaussianintegrals.Likewise,sincebyconstructionzjandzjhavezeromeanandvariancekvk2wehavethatEg[cos(zj)]Eg[cos(zt)]=Eh[cos(kvkh)]2=ekvk2CombiningbothtermsweobtainthatthecovariancecanbewrittenasCov[ j(v); t(v)ju]=ekvk2hcosh[kvk2]1i(16)Toprovetherstclaimrealizethatherej=tandcorrespondingly=1.Pluggingthisintotheabovecovarianceexpressionandsimplifyingtermsyieldsourrstclaim(13).Toproveoursecondclaim,observethatfromtheTay-lorseriesofcoshwithremainderinLagrangeform,it Fastfood|ComputingHilbertSpaceExpansionsinloglineartime Table2.TestsetRMSEofdierentkernelcomputationmethods.WecanseeFastfoodmethodsperformcomparablywithExactRBF,NystromorRandomKitchenSinks.manddarethesizeofthetrainingsetthedimensionoftheinput. Datasetmd Exact Nystrom RandomKitchen Fastfood Fastfood Exact Fastfood RBF RBF Sinks(RBF) FFT RBF Matern Matern InsuranceCompany5;82285 0.231 0.232 0.266 0.266 0.264 0.234 0.235 WineQuality4;08011 0.819 0.797 0.740 0.721 0.740 0.753 0.720 ParkinsonTelemonitor4;70021 0.059 0.058 0.054 0.052 0.054 0.053 0.052 CPU6;55421 7.271 6.758 7.103 4.544 7.366 4.345 4.211 LocationofCTslices(axial)42;800384 n.a. 60.683 49.491 58.425 43.858 n.a. 14.868 KEGGMetabolicNetwork51;68627 n.a. 17:872 17:837 17:826 17:818 n.a. 17.846 YearPredictionMSD463;71590 n.a. 0.113 0.123 0.106 0.115 n.a. 0.116 Forest522;91054 n.a. 0.837 0.840 0.838 0.840 n.a. 0.976 Table1.Runtime,speedandmemoryimprovementsofFastfoodrelativetoRandomKitchenSinks dn FastfoodRKSSpeedup RAM 1;02416;384 0.00058s0.0139s24x 256x 4;09632;768 0.00136s0.1224s90x 1024x 8;19265;536 0.00268s0.5360s200x 2048x Figure2.TestRMSEonCPUdatasetwithrespecttothenumberofbasisfunctions.Asnumberofbasisfunctionsincreases,thequalityofregressiongenerallyimproves.toapproximatetheGaussianRBFkernel,theyper-formwellcomparedtoothervariantsandimproveasnincreases.Thissuggeststhatlearningthekernelbydirectspectraladjustmentmightbeausefulapplica-tionofourproposedmethod.4.2.SpeedofkernelcomputationsInthepreviousexperiments,weobservethatFastfoodisonparwithexactkernelcomputation,theNystrommethod,andRandomKitchenSinks.Thekeypoint,however,istoestablishwhetherthealgorithmoerscomputationalsavings.ForthispurposewecompareRandomKitchenSinksusingEigen5andourmethodusingSpiral6.BotharehighlyoptimizednumericallinearalgebralibrariesinC++.Weareinterestedinthetimeittakestogofromrawfeaturesofavectorwithdimensiondtothe 5http://eigen.tuxfamily.org/index.php?title=Main_Page6http://spiral.netlabelpredictionofthatvector.Onasmallproblemwithd=1;024andn=16;384,performingpredictionwithRandomKitchenSinkstakes0.07seconds.Ourmethodisaround24xfaster,takingonly0.003secondstocomputethelabelforoneinputvector.Thespeedgainisevenmoresignicantforlargerproblems,asisevidentinTable1.ThisconrmsexperimentallytheO(nlogd)vs.O(nd)runtimeandO(n)vs.O(nd)storageofFastfoodrelativetoRandomKitchenSinks.4.3.RandomfeaturesforCIFAR-10Tounderstandtheimportanceofnonlinearfeatureex-pansionsforapracticalapplication,webenchmarkedFastfood,RandomKitchenSinksontheCIFAR-10dataset(Krizhevsky,2009)whichhas50,000train-ingimagesand10,000testimages.Eachimagehas32x32pixelsand3channels(d=3072).Inourex-periments,linearSVMsachieve42.3%accuracyonthetestset.Non-linearexpansionsimprovetheclas-sicationaccuracysignicantly.Inparticular,Fast-foodFFT(\Fourierfeatures")achieve63.1%whileFastfood(\Hadamardfeatures")andRandomKitchenSinksachieve62.4%withanexpansionofn=16;384.Thesearealsobestknownclassicationaccuraciesusingpermutation-invariantrepresentationsonthisdataset.Intermsofspeed,RandomKitchenSinksis5xslower(intotaltrainingtime)and20xslower(inpredictingalabelgivenanimage)comparedtoFastfoodandFastfoodFFT.Thisdemonstratesthatnon-linearexpansionsareneededevenwhentherawdataishigh-dimensional,andthatFastfoodismorepracticalforsuchproblems.Inparticular,inmanycases,linearfunctionclassesareusedbecausetheyprovidefasttrainingtime,andespeciallytesttime,butnotbecausetheyoerbetteraccuracy.TheresultsonCIFAR-10demonstratethatFastfoodcanovercomethisobstacle.SummaryWedemonstratedthatitispossibletocomputennonlinearbasisfunctionsinO(nlogd)time,asignicantspeedupoverthebestcompetitivealgo-rithms.Thismeansthatkernelmethodsbecomemorepracticalforproblemsthathavelargedatasetsand/orrequirereal-timeprediction.Infact,Fastfoodcanbeusedtorunoncellphonesbecausenotonlyitisfast,butitalsorequiresonlyasmallamountofstorage.AcknowledgmentsWethankJohnLangfordandRaviKumarforfruitfuldiscussions. Fastfood|ComputingHilbertSpaceExpansionsinloglineartime ReferencesAilon,N.andChazelle,B.ThefastJohnson{Lindenstrausstransformandapproximatenearestneighbors.SICOMP,2009.Aizerman,M.A.,Braverman,A.M.,andRozonoer,L.I.Theoreticalfoundationsofthepotentialfunc-tionmethodinpatternrecognitionlearning.Autom.RemoteControl,25:821{837,1964.Aronszajn,N.Latheoriegeneraledesnoyauxreproduisantsetsesapplications.Proc.CambridgePhilos.Soc.,39:133{153,1944.Boser,B.,Guyon,I.,andVapnik,V.Atrainingalgo-rithmforoptimalmarginclassiers.COLT1992.Burges,C.J.C.Simpliedsupportvectordecisionrules.ICML,1996Cortes,C.andVapnik,V.Supportvectornetworks.MachineLearning,20(3):273{297,1995.Dasgupta,A.,Kumar,R.,andSarlos,T.Fastlocality-sensitivehashing.SIGKDD,pp.1073{1081,2011.Fine,S.andScheinberg,K.EcientSVMtrainingusinglow-rankkernelrepresentations.JMLR,2001.Frank,A.andAsuncion,A.UCImachinelearningrepository.http://archive.ics.uci.edu/ml.Girosi,F.Anequivalencebetweensparseapproxima-tionandsupportvectormachines.NeuralCompu-tation,10(6):1455{1480,1998.Girosi,F.,Jones,M.,andPoggio,T.Regularizationtheoryandneuralnetworksarchitectures.NeuralComputation,7(2):219{269,1995.Gray,A.G.andMoore,A.W.Rapidevaluationofmultipledensitymodels.AISTATS,2003.Jin,R.,Yang,T.,Mahdavi,M.,Li,Y.F.,andZhou,Z.H.ImprovedboundfortheNystrom'smethodanditsapplicationtokernelclassication,2011.URLhttp://arxiv.org/abs/1111.2262.Kimeldorf,G.S.andWahba,G.AcorrespondencebetweenBayesianestimationonstochasticprocessesandsmoothingbysplines.AnnalsofMathematicalStatistics,41:495{502,1970.Kreyszig,E.IntroductoryFunctionalAnalysiswithApplications.Wiley,1989.Krizhevsky,A.Learningmultiplelayersoffeaturesfromtinyimages.TR,UToronto,2009.Ledoux,M.IsoperimetryandGaussiananalysis.Lec-turesonprobabilitytheoryandstatistics,1996.Lee,D.andGray,A.G.Fasthigh-dimensionalker-nelsummationsusingtheMonteCarlomultipolemethod.NIPS,2009.MacKay,D.J.C.InformationTheory,Inference,andLearningAlgorithms.Cambridge2003.Mercer,J.Functionsofpositiveandnegativetypeandtheirconnectionwiththetheoryofintegralequa-tions.RoyalSocietyLondon,A209:415{446,1909.Micchelli,C.A.Interpolationofscattereddata:dis-tancematricesandconditionallypositivedenitefunctions.Constr.Approximation,2:11{22,1986.Neal,R.Priorsforinnitenetworks.CRG-TR-94-1,UToronto,1994.Rahimi,A.andRecht,B.Randomfeaturesforlarge-scalekernelmachines.NIPS20,2007.Rahimi,A.andRecht,B.Weightedsumsofran-domkitchensinks:Replacingminimizationwithrandomizationinlearning.NIPS21,2008.Scholkopf,B.,Smola,A.J.,andMuller,K.-R.Nonlin-earcomponentanalysisasakerneleigenvalueprob-lem.NeuralComput.,10:1299{1319,1998.Scholkopf,BernhardandSmola,A.J.LearningwithKernels.MITPress,Cambridge,MA,2002.Smola,A.J.andScholkopf,B.Sparsegreedymatrixapproximationformachinelearning.ICML,2000.Smola,A.J.,Scholkopf,B.,andMuller,K.-R.TheConnectionbetweenRegularizationOperatorsandSupportVectorKernels.Neur.Networks,1998Steinwart,IngoandChristmann,Andreas.SupportVectorMachines.Springer,2008.Taskar,B.,Guestrin,C.,andKoller,D.Max-marginMarkovnetworks.NIPS16,2004.Tropp,J.A.ImprovedanalysisofthesubsampledrandomizedHadamardtransformAdv.Adapt.DataAnal.,2011.Vapnik,V.,Golowich,S.,andSmola,A.Supportvec-tormethodforfunctionapproximation,regressionestimation,andsignalprocessing.NIPS9,1997.Wahba,G.SplineModelsforObservationalData,CBMS-NSF,vol.59,SIAM,Philadelphia,1990.Williams,C.K.I.PredictionwithGaussianprocesses:Fromlinearregressiontolinearpredictionandbe-yond.InJordan,M.I.(ed.),LearningandInferenceinGraphicalModels,Kluwer,1998.Williams,C.K.I.andSeeger,M.UsingtheNystrommethodtospeedupkernelmachines.NIPS13,2001.