Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le qvlgoogle - PDF document

520 views
Uploaded On 2014-12-19

Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le qvlgoogle - PPT Presentation

com Tamas Sarlos stamasgooglecom Alex Smola alexsmolaorg Google Knowledge 1600 Amphitheatre Pkwy Mountain View 94043 CA USA Abstract Despite their successes what makes kernel methods di64259cult to use in many large scale problems is the fact that co ID: 26312

com Tamas Sarlos stamasgooglecom Alex

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/26312" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Pdf The PPT/PDF document "Fastfood Approximating Kernel Expansion..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Fastfood|ComputingHilbertSpaceExpansionsinloglineartime approximatesthefunctionfbymeansofmultiplyingtheinputwithaGaussianrandommatrix,followedbytheapplicationofanonlinearity.Iftheexpansiondimensionisnandtheinputdimensionisd(i.e.,theGaussianmatrixisnd),itrequiresO(nd)timeandmemorytoevaluatethedecisionfunctionf.Forlargeproblemswithsamplesizemn,thisistypicallymuchfasterthantheaforementioned\kerneltrick"be-causethecomputationisindependentofthesizeofthetrainingset.Experimentsalsoshowthatthisapproxi-mationmethodachievesaccuracycomparabletoRBFkernelswhileoeringsignicantspeedup.Ourproposedapproach,Fastfood,acceleratesRandomKitchenSinksfromO(nd)toO(nlogd)time.Thespeedupismostsignicantwhentheinputdimen-siondislargerthan1000,whichistypicalinmostapplications.Forinstance,atiny32x32x3imageintheCIFAR-10(Krizhevsky,2009)alreadyhas3072di-mensions(andnon-linearfunctionclasseshaveshowntoworkwellforMNIST(Scholkopf&Smola,2002)andCIFAR-10).OurapproachreliesonthefactthatHadamardmatrices,whencombinedwithGaussianscalingmatrices,behaveverymuchlikeGaussianran-dommatrices.ThatmeansthesetwomatricescanbeusedinplaceofGaussianmatricesinRandomKitchenSinksandtherebyspeedingupthecomputationforalargerangeofkernelfunctions.ThecomputationalgainisachievedbecauseunlikeGaussianrandomma-trices,Hadamardmatricesandscalingmatricesareinexpensivetomultiplyandstore.WeprovethattheFastfoodapproximationisunbi-ased,haslowvariance,andconcentratesalmostatthesamerateasRandomKitchenSinks.Moreover,extensiveexperimentswithawiderangeofdatasetsshowthatFastfoodachievessimilaraccuracytofullkernelexpansionsandRandomKitchenSinkswhilebeing100xfasterwith1000xlessmemory.Theseimprovements,especiallyintermsofmemoryusage,makeitpossibletousekernelmethodsevenforem-beddedapplications.Ourexperimentsalsodemon-stratethatFastfood,thankstoitsspeedupintrain-ing,achievesstate-of-the-artaccuracyontheCIFAR-10dataset(Krizhevsky,2009)amongpermutation-invariantmethods.OtherrelatedworkSpeedingupkernelmethodshasbeenaresearchfocusformanyyears.Earlyworkcompressesfunctionexpansionsaftertheproblemwassolved(Burges,1996)bymeansofreduced-setexpan-sions.Subsequentworkaimedtoreducememoryfoot-printandcomplexitybyndingsubspacestoexpandfunctions(Smola&Scholkopf,2000;Fine&Schein-berg,2001;Williams&Seeger,2001).Theytypi-callyrequireO(n3+mnd)stepstoprocessmobser- (Rahimi&Recht,2008)generalizeditfurther.Werefertoitwiththetitleofthelatterinsteadoftheambiguousphrase\randomfeatures"tobetterdierentiatefromourown.vationsandtoexpandddimensionaldataintoann-dimensionalfunctionspace.Moreover,theyrequireO(n2)storageatleastatpreprocessingtimetoobtainsuitablebasisfunctions.Despitetheseeorts,thesecostsarestillexpensiveforpracticalapplications.AlongthelinesofRahimi&Recht(2007;2008)'swork,fastmultipoleexpansions(Lee&Gray,2009;Gray&Moore,2003)oeranotherinterestingavenuefore-cientfunctionexpansions.Whilethisideaisattrac-tivewhenthedimensionalityoftheinputdimensiondissmall,theybecomecomputationallyintractableforlarged'sduetothecurseofdimensionalityintermsofpartitioning.2.RandomKitchenSinksWestartbyreviewingsomebasictoolsfromkernelmethods(Scholkopf&Smola,2002)andthenanalyzekeyideasbehindRandomKitchenSinks.2.1.Mercer'sTheoremandExpansionsAttheheartofkernelmethodsisthetheoremof(Mer-cer,1909)whichguaranteesthatkernelscanbeex-pressedasaninnerproductinsomeHilbertspace.Theorem1(Mercer)Anykernelk:XX!RsatisfyingRk(x;x0)f(x)f(x0)dxdx00forallL2(X)measurablefunctionsfcanbeexpandedintok(x;x0)=Xjjj(x)j(x0)(2)Herej�0andthejareorthonormalonL2(X).Thekeyideaof(Rahimi&Recht,2007;2008)istousesamplingtoapproximatethesumin(2).Inotherwords,theydraw2ip()wherep(i)/i(3)andk(x;x0)Pjj nnXi=1i(x)i(x0)(4)Notethatthebasicconnectionbetweenrandombasisfunctionswaswellestablished,e.g.,byNeal(1994)inprovingthattheGaussianProcessisalimitofanin-nitenumberofbasisfunctions.Theexpansion(3)ispossiblewheneverthefollowingconditionshold:1.Aninnerproductexpansionoftheform(2)isknownforagivenkernelk.2.Thebasisfunctionsjaresucientlyinexpensivetocompute.3.ThesumPjj1converges,i.e.,kcorre-spondstoatraceclassoperator(Kreyszig,1989). 2Thenormalizationarisesfromthefactthatitisann-samplefromthedistributionoverbasisfunctions. Fastfood|ComputingHilbertSpaceExpansionsinloglineartime N(0;�2Id),hencewecandirectlyappealtothecon-structionof(Rahimi&Recht,2008).ThisalsoholdsforV0.ThemaindierencebeingthattherowsinV0areconsiderablymorecorrelated. 3.3.ApproximationGuaranteesInthissectionweprovethattheapproximationthatweincurrelativetoaGaussianrandommatrixismild.3.3.1.LowVarianceTheorem4showsthatwhenapproximatingtheRBFkernelwithnfeaturesthevarianceofFastfood(evenwithoutthescalingmatrixS)isatmostthevarianceofstraightforwardGaussianfeatures,thersttermin(12),plusO(1=n).Infact,wewillseeinexperimentsthatourapproximationworksaswellasanexactker-nelexpansionandRandomKitchenSinks.Sincethekernelvaluesarerealnumbers,letuscon-sidertherealversionofthecomplexfeaturemapforsimplicity.Setyj=[V(x0�x)]jandrecallthat j(x)j(x0)=n�1exp(iyj)=n�1(cos(yj)+isin(yj)):Thuswecanreplace(x)2Cnwith0(x)2R2n,where02j�1(x)=n�1=2cos([Vx]j)and02j(x)=n�1=2sin([Vx]j),see(Rahimi&Recht,2007).Theorem4LetC()=64he�2+2 3iandv=(x�x0)=.Thenforthefeaturemap0:Rd!R2nobtainedbystackingn=di.i.d.copiesofmatrixV0=1 p dHGHBwehavethatVar0(x)�0(x0)21�e�kvk22 n+C(kvk) n:(12)Moreover,thesameholdsforV=1 p dSHGHB.Proof0(x)�0(x0)istheaverageofn=dindependentestimates,eacharisingfrom2dfeatures.Henceit'ssucienttoprovetheclaimforasingleblock,i.e.whenn=d.WeshowthelatterforV0inTheorem5andomitthenearidenticalargumentforV. Theorem5Letv=(x�x0)=andlet j(v)=cos(d�1 2[HGHBv]j)denotetheestimateoftheker-nelvaluethatcomesfromthejthpairofrandomfea-turesforeachj2f1:::dg.ThenforeachjwehaveVar[ j(v)]=1 21�e�kvk22,and(13)Var"dXj=1 j(v)#d 21�e�kvk22+dC(kvk)(14)whereC()=64he�2+2 3i.ProofSinceVar(PXj)=Pj;tCov(Xj;Xt)foranyrandomvariableXj,ourgoalistocomputeCov( (v); (v))=E� (v) (v)T�E( (v))E( (v))T.Letw=1 p dHBv,u=w,andz=HGuandhence j(v)=cos(zj).Nowconditiononthevalueofu.ThenitfollowsthatCov(zj;ztju)=jt(u)kvk2,wherejt(u)2[�1;1]isthecorrelationofzjandzt.Tosimplifythenotation,inwhatfollowswewriteinsteadofjt(u).Observethatthemarginaldistribu-tionofeachzjisN(0;kvk2)askuk=kvkandeachelementofHis1.ThusthejointdistributionofzjandztisaGaussianwithmean0andcovarianceCov[[zj;zt]ju]=11kvk2=LLT;L=h10p 1�2ikvkisitsCholeskyfactor.HenceCov( j(v); t(v)ju)(15)=Eg[cos([Lg]1)cos([Lg]2)]�Eg[cos(zj)]Eg[cos(zt)]whereg2R2isdrawnfromN(0;1).Fromthetrigonometricidentitycos()cos()=1 2[cos(�)+cos(+)]itfollowsthatwecanrewriteEg[cos([Lg]1)cos([Lg]2)]=1 2Eh[cos(a�h)+cos(a+h)]=1 2he�1 2a2�+e�1 2a2+ihN(0;1)anda2= L�[1;1] 2=2kvk2(1).Thatis,afterapplyingtheadditiontheoremweex-plicitlycomputedthenowone-dimensionalGaussianintegrals.Likewise,sincebyconstructionzjandzjhavezeromeanandvariancekvk2wehavethatEg[cos(zj)]Eg[cos(zt)]=Eh[cos(kvkh)]2=e�kvk2CombiningbothtermsweobtainthatthecovariancecanbewrittenasCov[ j(v); t(v)ju]=e�kvk2hcosh[kvk2]�1i(16)Toprovetherstclaimrealizethatherej=tandcorrespondingly=1.Pluggingthisintotheabovecovarianceexpressionandsimplifyingtermsyieldsourrstclaim(13).Toproveoursecondclaim,observethatfromtheTay-lorseriesofcoshwithremainderinLagrangeform,it Fastfood|ComputingHilbertSpaceExpansionsinloglineartime Table2.TestsetRMSEofdierentkernelcomputationmethods.WecanseeFastfoodmethodsperformcomparablywithExactRBF,NystromorRandomKitchenSinks.manddarethesizeofthetrainingsetthedimensionoftheinput. Datasetmd Exact Nystrom RandomKitchen Fastfood Fastfood Exact Fastfood RBF RBF Sinks(RBF) FFT RBF Matern Matern InsuranceCompany5;82285 0.231 0.232 0.266 0.266 0.264 0.234 0.235 WineQuality4;08011 0.819 0.797 0.740 0.721 0.740 0.753 0.720 ParkinsonTelemonitor4;70021 0.059 0.058 0.054 0.052 0.054 0.053 0.052 CPU6;55421 7.271 6.758 7.103 4.544 7.366 4.345 4.211 LocationofCTslices(axial)42;800384 n.a. 60.683 49.491 58.425 43.858 n.a. 14.868 KEGGMetabolicNetwork51;68627 n.a. 17:872 17:837 17:826 17:818 n.a. 17.846 YearPredictionMSD463;71590 n.a. 0.113 0.123 0.106 0.115 n.a. 0.116 Forest522;91054 n.a. 0.837 0.840 0.838 0.840 n.a. 0.976 Table1.Runtime,speedandmemoryimprovementsofFastfoodrelativetoRandomKitchenSinks dn FastfoodRKSSpeedup RAM 1;02416;384 0.00058s0.0139s24x 256x 4;09632;768 0.00136s0.1224s90x 1024x 8;19265;536 0.00268s0.5360s200x 2048x Figure2.TestRMSEonCPUdatasetwithrespecttothenumberofbasisfunctions.Asnumberofbasisfunctionsincreases,thequalityofregressiongenerallyimproves.toapproximatetheGaussianRBFkernel,theyper-formwellcomparedtoothervariantsandimproveasnincreases.Thissuggeststhatlearningthekernelbydirectspectraladjustmentmightbeausefulapplica-tionofourproposedmethod.4.2.SpeedofkernelcomputationsInthepreviousexperiments,weobservethatFastfoodisonparwithexactkernelcomputation,theNystrommethod,andRandomKitchenSinks.Thekeypoint,however,istoestablishwhetherthealgorithmoerscomputationalsavings.ForthispurposewecompareRandomKitchenSinksusingEigen5andourmethodusingSpiral6.BotharehighlyoptimizednumericallinearalgebralibrariesinC++.Weareinterestedinthetimeittakestogofromrawfeaturesofavectorwithdimensiondtothe 5http://eigen.tuxfamily.org/index.php?title=Main_Page6http://spiral.netlabelpredictionofthatvector.Onasmallproblemwithd=1;024andn=16;384,performingpredictionwithRandomKitchenSinkstakes0.07seconds.Ourmethodisaround24xfaster,takingonly0.003secondstocomputethelabelforoneinputvector.Thespeedgainisevenmoresignicantforlargerproblems,asisevidentinTable1.ThisconrmsexperimentallytheO(nlogd)vs.O(nd)runtimeandO(n)vs.O(nd)storageofFastfoodrelativetoRandomKitchenSinks.4.3.RandomfeaturesforCIFAR-10Tounderstandtheimportanceofnonlinearfeatureex-pansionsforapracticalapplication,webenchmarkedFastfood,RandomKitchenSinksontheCIFAR-10dataset(Krizhevsky,2009)whichhas50,000train-ingimagesand10,000testimages.Eachimagehas32x32pixelsand3channels(d=3072).Inourex-periments,linearSVMsachieve42.3%accuracyonthetestset.Non-linearexpansionsimprovetheclas-sicationaccuracysignicantly.Inparticular,Fast-foodFFT(\Fourierfeatures")achieve63.1%whileFastfood(\Hadamardfeatures")andRandomKitchenSinksachieve62.4%withanexpansionofn=16;384.Thesearealsobestknownclassicationaccuraciesusingpermutation-invariantrepresentationsonthisdataset.Intermsofspeed,RandomKitchenSinksis5xslower(intotaltrainingtime)and20xslower(inpredictingalabelgivenanimage)comparedtoFastfoodandFastfoodFFT.Thisdemonstratesthatnon-linearexpansionsareneededevenwhentherawdataishigh-dimensional,andthatFastfoodismorepracticalforsuchproblems.Inparticular,inmanycases,linearfunctionclassesareusedbecausetheyprovidefasttrainingtime,andespeciallytesttime,butnotbecausetheyoerbetteraccuracy.TheresultsonCIFAR-10demonstratethatFastfoodcanovercomethisobstacle.SummaryWedemonstratedthatitispossibletocomputennonlinearbasisfunctionsinO(nlogd)time,asignicantspeedupoverthebestcompetitivealgo-rithms.Thismeansthatkernelmethodsbecomemorepracticalforproblemsthathavelargedatasetsand/orrequirereal-timeprediction.Infact,Fastfoodcanbeusedtorunoncellphonesbecausenotonlyitisfast,butitalsorequiresonlyasmallamountofstorage.AcknowledgmentsWethankJohnLangfordandRaviKumarforfruitfuldiscussions. Fastfood|ComputingHilbertSpaceExpansionsinloglineartime ReferencesAilon,N.andChazelle,B.ThefastJohnson{Lindenstrausstransformandapproximatenearestneighbors.SICOMP,2009.Aizerman,M.A.,Braverman,A.M.,andRozonoer,L.I.Theoreticalfoundationsofthepotentialfunc-tionmethodinpatternrecognitionlearning.Autom.RemoteControl,25:821{837,1964.Aronszajn,N.Latheoriegeneraledesnoyauxreproduisantsetsesapplications.Proc.CambridgePhilos.Soc.,39:133{153,1944.Boser,B.,Guyon,I.,andVapnik,V.Atrainingalgo-rithmforoptimalmarginclassiers.COLT1992.Burges,C.J.C.Simpliedsupportvectordecisionrules.ICML,1996Cortes,C.andVapnik,V.Supportvectornetworks.MachineLearning,20(3):273{297,1995.Dasgupta,A.,Kumar,R.,andSarlos,T.Fastlocality-sensitivehashing.SIGKDD,pp.1073{1081,2011.Fine,S.andScheinberg,K.EcientSVMtrainingusinglow-rankkernelrepresentations.JMLR,2001.Frank,A.andAsuncion,A.UCImachinelearningrepository.http://archive.ics.uci.edu/ml.Girosi,F.Anequivalencebetweensparseapproxima-tionandsupportvectormachines.NeuralCompu-tation,10(6):1455{1480,1998.Girosi,F.,Jones,M.,andPoggio,T.Regularizationtheoryandneuralnetworksarchitectures.NeuralComputation,7(2):219{269,1995.Gray,A.G.andMoore,A.W.Rapidevaluationofmultipledensitymodels.AISTATS,2003.Jin,R.,Yang,T.,Mahdavi,M.,Li,Y.F.,andZhou,Z.H.ImprovedboundfortheNystrom'smethodanditsapplicationtokernelclassication,2011.URLhttp://arxiv.org/abs/1111.2262.Kimeldorf,G.S.andWahba,G.AcorrespondencebetweenBayesianestimationonstochasticprocessesandsmoothingbysplines.AnnalsofMathematicalStatistics,41:495{502,1970.Kreyszig,E.IntroductoryFunctionalAnalysiswithApplications.Wiley,1989.Krizhevsky,A.Learningmultiplelayersoffeaturesfromtinyimages.TR,UToronto,2009.Ledoux,M.IsoperimetryandGaussiananalysis.Lec-turesonprobabilitytheoryandstatistics,1996.Lee,D.andGray,A.G.Fasthigh-dimensionalker-nelsummationsusingtheMonteCarlomultipolemethod.NIPS,2009.MacKay,D.J.C.InformationTheory,Inference,andLearningAlgorithms.Cambridge2003.Mercer,J.Functionsofpositiveandnegativetypeandtheirconnectionwiththetheoryofintegralequa-tions.RoyalSocietyLondon,A209:415{446,1909.Micchelli,C.A.Interpolationofscattereddata:dis-tancematricesandconditionallypositivedenitefunctions.Constr.Approximation,2:11{22,1986.Neal,R.Priorsforinnitenetworks.CRG-TR-94-1,UToronto,1994.Rahimi,A.andRecht,B.Randomfeaturesforlarge-scalekernelmachines.NIPS20,2007.Rahimi,A.andRecht,B.Weightedsumsofran-domkitchensinks:Replacingminimizationwithrandomizationinlearning.NIPS21,2008.Scholkopf,B.,Smola,A.J.,andMuller,K.-R.Nonlin-earcomponentanalysisasakerneleigenvalueprob-lem.NeuralComput.,10:1299{1319,1998.Scholkopf,BernhardandSmola,A.J.LearningwithKernels.MITPress,Cambridge,MA,2002.Smola,A.J.andScholkopf,B.Sparsegreedymatrixapproximationformachinelearning.ICML,2000.Smola,A.J.,Scholkopf,B.,andMuller,K.-R.TheConnectionbetweenRegularizationOperatorsandSupportVectorKernels.Neur.Networks,1998Steinwart,IngoandChristmann,Andreas.SupportVectorMachines.Springer,2008.Taskar,B.,Guestrin,C.,andKoller,D.Max-marginMarkovnetworks.NIPS16,2004.Tropp,J.A.ImprovedanalysisofthesubsampledrandomizedHadamardtransformAdv.Adapt.DataAnal.,2011.Vapnik,V.,Golowich,S.,andSmola,A.Supportvec-tormethodforfunctionapproximation,regressionestimation,andsignalprocessing.NIPS9,1997.Wahba,G.SplineModelsforObservationalData,CBMS-NSF,vol.59,SIAM,Philadelphia,1990.Williams,C.K.I.PredictionwithGaussianprocesses:Fromlinearregressiontolinearpredictionandbe-yond.InJordan,M.I.(ed.),LearningandInferenceinGraphicalModels,Kluwer,1998.Williams,C.K.I.andSeeger,M.UsingtheNystrommethodtospeedupkernelmachines.NIPS13,2001.