/
Fastfood  Approximating Kernel Expansions in Loglinear Time Quoc Le qvlgoogle Fastfood  Approximating Kernel Expansions in Loglinear Time Quoc Le qvlgoogle

Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le qvlgoogle - PDF document

jane-oiler
jane-oiler . @jane-oiler
Follow
520 views
Uploaded On 2014-12-19

Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le qvlgoogle - PPT Presentation

com Tamas Sarlos stamasgooglecom Alex Smola alexsmolaorg Google Knowledge 1600 Amphitheatre Pkwy Mountain View 94043 CA USA Abstract Despite their successes what makes kernel methods di64259cult to use in many large scale problems is the fact that co ID: 26312

com Tamas Sarlos stamasgooglecom Alex

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Fastfood Approximating Kernel Expansion..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Fastfood|ComputingHilbertSpaceExpansionsinloglineartime approximatesthefunctionfbymeansofmultiplyingtheinputwithaGaussianrandommatrix,followedbytheapplicationofanonlinearity.Iftheexpansiondimensionisnandtheinputdimensionisd(i.e.,theGaussianmatrixisnd),itrequiresO(nd)timeandmemorytoevaluatethedecisionfunctionf.Forlargeproblemswithsamplesizemn,thisistypicallymuchfasterthantheaforementioned\kerneltrick"be-causethecomputationisindependentofthesizeofthetrainingset.Experimentsalsoshowthatthisapproxi-mationmethodachievesaccuracycomparabletoRBFkernelswhileo eringsigni cantspeedup.Ourproposedapproach,Fastfood,acceleratesRandomKitchenSinksfromO(nd)toO(nlogd)time.Thespeedupismostsigni cantwhentheinputdimen-siondislargerthan1000,whichistypicalinmostapplications.Forinstance,atiny32x32x3imageintheCIFAR-10(Krizhevsky,2009)alreadyhas3072di-mensions(andnon-linearfunctionclasseshaveshowntoworkwellforMNIST(Scholkopf&Smola,2002)andCIFAR-10).OurapproachreliesonthefactthatHadamardmatrices,whencombinedwithGaussianscalingmatrices,behaveverymuchlikeGaussianran-dommatrices.ThatmeansthesetwomatricescanbeusedinplaceofGaussianmatricesinRandomKitchenSinksandtherebyspeedingupthecomputationforalargerangeofkernelfunctions.ThecomputationalgainisachievedbecauseunlikeGaussianrandomma-trices,Hadamardmatricesandscalingmatricesareinexpensivetomultiplyandstore.WeprovethattheFastfoodapproximationisunbi-ased,haslowvariance,andconcentratesalmostatthesamerateasRandomKitchenSinks.Moreover,extensiveexperimentswithawiderangeofdatasetsshowthatFastfoodachievessimilaraccuracytofullkernelexpansionsandRandomKitchenSinkswhilebeing100xfasterwith1000xlessmemory.Theseimprovements,especiallyintermsofmemoryusage,makeitpossibletousekernelmethodsevenforem-beddedapplications.Ourexperimentsalsodemon-stratethatFastfood,thankstoitsspeedupintrain-ing,achievesstate-of-the-artaccuracyontheCIFAR-10dataset(Krizhevsky,2009)amongpermutation-invariantmethods.OtherrelatedworkSpeedingupkernelmethodshasbeenaresearchfocusformanyyears.Earlyworkcompressesfunctionexpansionsaftertheproblemwassolved(Burges,1996)bymeansofreduced-setexpan-sions.Subsequentworkaimedtoreducememoryfoot-printandcomplexityby ndingsubspacestoexpandfunctions(Smola&Scholkopf,2000;Fine&Schein-berg,2001;Williams&Seeger,2001).Theytypi-callyrequireO(n3+mnd)stepstoprocessmobser- (Rahimi&Recht,2008)generalizeditfurther.Werefertoitwiththetitleofthelatterinsteadoftheambiguousphrase\randomfeatures"tobetterdi erentiatefromourown.vationsandtoexpandddimensionaldataintoann-dimensionalfunctionspace.Moreover,theyrequireO(n2)storageatleastatpreprocessingtimetoobtainsuitablebasisfunctions.Despitethesee orts,thesecostsarestillexpensiveforpracticalapplications.AlongthelinesofRahimi&Recht(2007;2008)'swork,fastmultipoleexpansions(Lee&Gray,2009;Gray&Moore,2003)o eranotherinterestingavenuefore-cientfunctionexpansions.Whilethisideaisattrac-tivewhenthedimensionalityoftheinputdimensiondissmall,theybecomecomputationallyintractableforlarged'sduetothecurseofdimensionalityintermsofpartitioning.2.RandomKitchenSinksWestartbyreviewingsomebasictoolsfromkernelmethods(Scholkopf&Smola,2002)andthenanalyzekeyideasbehindRandomKitchenSinks.2.1.Mercer'sTheoremandExpansionsAttheheartofkernelmethodsisthetheoremof(Mer-cer,1909)whichguaranteesthatkernelscanbeex-pressedasaninnerproductinsomeHilbertspace.Theorem1(Mercer)Anykernelk:XX!RsatisfyingRk(x;x0)f(x)f(x0)dxdx00forallL2(X)measurablefunctionsfcanbeexpandedintok(x;x0)=Xjjj(x)j(x0)(2)Herej�0andthejareorthonormalonL2(X).Thekeyideaof(Rahimi&Recht,2007;2008)istousesamplingtoapproximatethesumin(2).Inotherwords,theydraw2ip()wherep(i)/i(3)andk(x;x0)Pjj nnXi=1i(x)i(x0)(4)Notethatthebasicconnectionbetweenrandombasisfunctionswaswellestablished,e.g.,byNeal(1994)inprovingthattheGaussianProcessisalimitofanin- nitenumberofbasisfunctions.Theexpansion(3)ispossiblewheneverthefollowingconditionshold:1.Aninnerproductexpansionoftheform(2)isknownforagivenkernelk.2.Thebasisfunctionsjaresucientlyinexpensivetocompute.3.ThesumPjj1converges,i.e.,kcorre-spondstoatraceclassoperator(Kreyszig,1989). 2Thenormalizationarisesfromthefactthatitisann-samplefromthedistributionoverbasisfunctions. Fastfood|ComputingHilbertSpaceExpansionsinloglineartime N(0;�2Id),hencewecandirectlyappealtothecon-structionof(Rahimi&Recht,2008).ThisalsoholdsforV0.Themaindi erencebeingthattherowsinV0areconsiderablymorecorrelated. 3.3.ApproximationGuaranteesInthissectionweprovethattheapproximationthatweincurrelativetoaGaussianrandommatrixismild.3.3.1.LowVarianceTheorem4showsthatwhenapproximatingtheRBFkernelwithnfeaturesthevarianceofFastfood(evenwithoutthescalingmatrixS)isatmostthevarianceofstraightforwardGaussianfeatures,the rsttermin(12),plusO(1=n).Infact,wewillseeinexperimentsthatourapproximationworksaswellasanexactker-nelexpansionandRandomKitchenSinks.Sincethekernelvaluesarerealnumbers,letuscon-sidertherealversionofthecomplexfeaturemapforsimplicity.Setyj=[V(x0�x)]jandrecallthat j(x)j(x0)=n�1exp(iyj)=n�1(cos(yj)+isin(yj)):Thuswecanreplace(x)2Cnwith0(x)2R2n,where02j�1(x)=n�1=2cos([Vx]j)and02j(x)=n�1=2sin([Vx]j),see(Rahimi&Recht,2007).Theorem4LetC( )=6 4he� 2+ 2 3iandv=(x�x0)=.Thenforthefeaturemap0:Rd!R2nobtainedbystackingn=di.i.d.copiesofmatrixV0=1 p dHGHBwehavethatVar0(x)�0(x0)21�e�kvk22 n+C(kvk) n:(12)Moreover,thesameholdsforV=1 p dSHGHB.Proof0(x)�0(x0)istheaverageofn=dindependentestimates,eacharisingfrom2dfeatures.Henceit'ssucienttoprovetheclaimforasingleblock,i.e.whenn=d.WeshowthelatterforV0inTheorem5andomitthenearidenticalargumentforV. Theorem5Letv=(x�x0)=andlet j(v)=cos(d�1 2[HGHBv]j)denotetheestimateoftheker-nelvaluethatcomesfromthejthpairofrandomfea-turesforeachj2f1:::dg.ThenforeachjwehaveVar[ j(v)]=1 21�e�kvk22,and(13)Var"dXj=1 j(v)#d 21�e�kvk22+dC(kvk)(14)whereC( )=6 4he� 2+ 2 3i.ProofSinceVar(PXj)=Pj;tCov(Xj;Xt)foranyrandomvariableXj,ourgoalistocomputeCov( (v); (v))=E� (v) (v)T�E( (v))E( (v))T.Letw=1 p dHBv,u=w,andz=HGuandhence j(v)=cos(zj).Nowconditiononthevalueofu.ThenitfollowsthatCov(zj;ztju)=jt(u)kvk2,wherejt(u)2[�1;1]isthecorrelationofzjandzt.Tosimplifythenotation,inwhatfollowswewriteinsteadofjt(u).Observethatthemarginaldistribu-tionofeachzjisN(0;kvk2)askuk=kvkandeachelementofHis1.ThusthejointdistributionofzjandztisaGaussianwithmean0andcovarianceCov[[zj;zt]ju]=11kvk2=LLT;L=h10p 1�2ikvkisitsCholeskyfactor.HenceCov( j(v); t(v)ju)(15)=Eg[cos([Lg]1)cos([Lg]2)]�Eg[cos(zj)]Eg[cos(zt)]whereg2R2isdrawnfromN(0;1).Fromthetrigonometricidentitycos( )cos( )=1 2[cos( � )+cos( + )]itfollowsthatwecanrewriteEg[cos([Lg]1)cos([Lg]2)]=1 2Eh[cos(a�h)+cos(a+h)]=1 2he�1 2a2�+e�1 2a2+ihN(0;1)anda2= L�[1;1] 2=2kvk2(1).Thatis,afterapplyingtheadditiontheoremweex-plicitlycomputedthenowone-dimensionalGaussianintegrals.Likewise,sincebyconstructionzjandzjhavezeromeanandvariancekvk2wehavethatEg[cos(zj)]Eg[cos(zt)]=Eh[cos(kvkh)]2=e�kvk2CombiningbothtermsweobtainthatthecovariancecanbewrittenasCov[ j(v); t(v)ju]=e�kvk2hcosh[kvk2]�1i(16)Toprovethe rstclaimrealizethatherej=tandcorrespondingly=1.Pluggingthisintotheabovecovarianceexpressionandsimplifyingtermsyieldsour rstclaim(13).Toproveoursecondclaim,observethatfromtheTay-lorseriesofcoshwithremainderinLagrangeform,it Fastfood|ComputingHilbertSpaceExpansionsinloglineartime Table2.TestsetRMSEofdi erentkernelcomputationmethods.WecanseeFastfoodmethodsperformcomparablywithExactRBF,NystromorRandomKitchenSinks.manddarethesizeofthetrainingsetthedimensionoftheinput. Datasetmd Exact Nystrom RandomKitchen Fastfood Fastfood Exact Fastfood RBF RBF Sinks(RBF) FFT RBF Matern Matern InsuranceCompany5;82285 0.231 0.232 0.266 0.266 0.264 0.234 0.235 WineQuality4;08011 0.819 0.797 0.740 0.721 0.740 0.753 0.720 ParkinsonTelemonitor4;70021 0.059 0.058 0.054 0.052 0.054 0.053 0.052 CPU6;55421 7.271 6.758 7.103 4.544 7.366 4.345 4.211 LocationofCTslices(axial)42;800384 n.a. 60.683 49.491 58.425 43.858 n.a. 14.868 KEGGMetabolicNetwork51;68627 n.a. 17:872 17:837 17:826 17:818 n.a. 17.846 YearPredictionMSD463;71590 n.a. 0.113 0.123 0.106 0.115 n.a. 0.116 Forest522;91054 n.a. 0.837 0.840 0.838 0.840 n.a. 0.976 Table1.Runtime,speedandmemoryimprovementsofFastfoodrelativetoRandomKitchenSinks dn FastfoodRKSSpeedup RAM 1;02416;384 0.00058s0.0139s24x 256x 4;09632;768 0.00136s0.1224s90x 1024x 8;19265;536 0.00268s0.5360s200x 2048x Figure2.TestRMSEonCPUdatasetwithrespecttothenumberofbasisfunctions.Asnumberofbasisfunctionsincreases,thequalityofregressiongenerallyimproves.toapproximatetheGaussianRBFkernel,theyper-formwellcomparedtoothervariantsandimproveasnincreases.Thissuggeststhatlearningthekernelbydirectspectraladjustmentmightbeausefulapplica-tionofourproposedmethod.4.2.SpeedofkernelcomputationsInthepreviousexperiments,weobservethatFastfoodisonparwithexactkernelcomputation,theNystrommethod,andRandomKitchenSinks.Thekeypoint,however,istoestablishwhetherthealgorithmo erscomputationalsavings.ForthispurposewecompareRandomKitchenSinksusingEigen5andourmethodusingSpiral6.BotharehighlyoptimizednumericallinearalgebralibrariesinC++.Weareinterestedinthetimeittakestogofromrawfeaturesofavectorwithdimensiondtothe 5http://eigen.tuxfamily.org/index.php?title=Main_Page6http://spiral.netlabelpredictionofthatvector.Onasmallproblemwithd=1;024andn=16;384,performingpredictionwithRandomKitchenSinkstakes0.07seconds.Ourmethodisaround24xfaster,takingonly0.003secondstocomputethelabelforoneinputvector.Thespeedgainisevenmoresigni cantforlargerproblems,asisevidentinTable1.Thiscon rmsexperimentallytheO(nlogd)vs.O(nd)runtimeandO(n)vs.O(nd)storageofFastfoodrelativetoRandomKitchenSinks.4.3.RandomfeaturesforCIFAR-10Tounderstandtheimportanceofnonlinearfeatureex-pansionsforapracticalapplication,webenchmarkedFastfood,RandomKitchenSinksontheCIFAR-10dataset(Krizhevsky,2009)whichhas50,000train-ingimagesand10,000testimages.Eachimagehas32x32pixelsand3channels(d=3072).Inourex-periments,linearSVMsachieve42.3%accuracyonthetestset.Non-linearexpansionsimprovetheclas-si cationaccuracysigni cantly.Inparticular,Fast-foodFFT(\Fourierfeatures")achieve63.1%whileFastfood(\Hadamardfeatures")andRandomKitchenSinksachieve62.4%withanexpansionofn=16;384.Thesearealsobestknownclassi cationaccuraciesusingpermutation-invariantrepresentationsonthisdataset.Intermsofspeed,RandomKitchenSinksis5xslower(intotaltrainingtime)and20xslower(inpredictingalabelgivenanimage)comparedtoFastfoodandFastfoodFFT.Thisdemonstratesthatnon-linearexpansionsareneededevenwhentherawdataishigh-dimensional,andthatFastfoodismorepracticalforsuchproblems.Inparticular,inmanycases,linearfunctionclassesareusedbecausetheyprovidefasttrainingtime,andespeciallytesttime,butnotbecausetheyo erbetteraccuracy.TheresultsonCIFAR-10demonstratethatFastfoodcanovercomethisobstacle.SummaryWedemonstratedthatitispossibletocomputennonlinearbasisfunctionsinO(nlogd)time,asigni cantspeedupoverthebestcompetitivealgo-rithms.Thismeansthatkernelmethodsbecomemorepracticalforproblemsthathavelargedatasetsand/orrequirereal-timeprediction.Infact,Fastfoodcanbeusedtorunoncellphonesbecausenotonlyitisfast,butitalsorequiresonlyasmallamountofstorage.AcknowledgmentsWethankJohnLangfordandRaviKumarforfruitfuldiscussions. Fastfood|ComputingHilbertSpaceExpansionsinloglineartime ReferencesAilon,N.andChazelle,B.ThefastJohnson{Lindenstrausstransformandapproximatenearestneighbors.SICOMP,2009.Aizerman,M.A.,Braverman,A.M.,andRozonoer,L.I.Theoreticalfoundationsofthepotentialfunc-tionmethodinpatternrecognitionlearning.Autom.RemoteControl,25:821{837,1964.Aronszajn,N.Latheoriegeneraledesnoyauxreproduisantsetsesapplications.Proc.CambridgePhilos.Soc.,39:133{153,1944.Boser,B.,Guyon,I.,andVapnik,V.Atrainingalgo-rithmforoptimalmarginclassi ers.COLT1992.Burges,C.J.C.Simpli edsupportvectordecisionrules.ICML,1996Cortes,C.andVapnik,V.Supportvectornetworks.MachineLearning,20(3):273{297,1995.Dasgupta,A.,Kumar,R.,andSarlos,T.Fastlocality-sensitivehashing.SIGKDD,pp.1073{1081,2011.Fine,S.andScheinberg,K.EcientSVMtrainingusinglow-rankkernelrepresentations.JMLR,2001.Frank,A.andAsuncion,A.UCImachinelearningrepository.http://archive.ics.uci.edu/ml.Girosi,F.Anequivalencebetweensparseapproxima-tionandsupportvectormachines.NeuralCompu-tation,10(6):1455{1480,1998.Girosi,F.,Jones,M.,andPoggio,T.Regularizationtheoryandneuralnetworksarchitectures.NeuralComputation,7(2):219{269,1995.Gray,A.G.andMoore,A.W.Rapidevaluationofmultipledensitymodels.AISTATS,2003.Jin,R.,Yang,T.,Mahdavi,M.,Li,Y.F.,andZhou,Z.H.ImprovedboundfortheNystrom'smethodanditsapplicationtokernelclassi cation,2011.URLhttp://arxiv.org/abs/1111.2262.Kimeldorf,G.S.andWahba,G.AcorrespondencebetweenBayesianestimationonstochasticprocessesandsmoothingbysplines.AnnalsofMathematicalStatistics,41:495{502,1970.Kreyszig,E.IntroductoryFunctionalAnalysiswithApplications.Wiley,1989.Krizhevsky,A.Learningmultiplelayersoffeaturesfromtinyimages.TR,UToronto,2009.Ledoux,M.IsoperimetryandGaussiananalysis.Lec-turesonprobabilitytheoryandstatistics,1996.Lee,D.andGray,A.G.Fasthigh-dimensionalker-nelsummationsusingtheMonteCarlomultipolemethod.NIPS,2009.MacKay,D.J.C.InformationTheory,Inference,andLearningAlgorithms.Cambridge2003.Mercer,J.Functionsofpositiveandnegativetypeandtheirconnectionwiththetheoryofintegralequa-tions.RoyalSocietyLondon,A209:415{446,1909.Micchelli,C.A.Interpolationofscattereddata:dis-tancematricesandconditionallypositivede nitefunctions.Constr.Approximation,2:11{22,1986.Neal,R.Priorsforin nitenetworks.CRG-TR-94-1,UToronto,1994.Rahimi,A.andRecht,B.Randomfeaturesforlarge-scalekernelmachines.NIPS20,2007.Rahimi,A.andRecht,B.Weightedsumsofran-domkitchensinks:Replacingminimizationwithrandomizationinlearning.NIPS21,2008.Scholkopf,B.,Smola,A.J.,andMuller,K.-R.Nonlin-earcomponentanalysisasakerneleigenvalueprob-lem.NeuralComput.,10:1299{1319,1998.Scholkopf,BernhardandSmola,A.J.LearningwithKernels.MITPress,Cambridge,MA,2002.Smola,A.J.andScholkopf,B.Sparsegreedymatrixapproximationformachinelearning.ICML,2000.Smola,A.J.,Scholkopf,B.,andMuller,K.-R.TheConnectionbetweenRegularizationOperatorsandSupportVectorKernels.Neur.Networks,1998Steinwart,IngoandChristmann,Andreas.SupportVectorMachines.Springer,2008.Taskar,B.,Guestrin,C.,andKoller,D.Max-marginMarkovnetworks.NIPS16,2004.Tropp,J.A.ImprovedanalysisofthesubsampledrandomizedHadamardtransformAdv.Adapt.DataAnal.,2011.Vapnik,V.,Golowich,S.,andSmola,A.Supportvec-tormethodforfunctionapproximation,regressionestimation,andsignalprocessing.NIPS9,1997.Wahba,G.SplineModelsforObservationalData,CBMS-NSF,vol.59,SIAM,Philadelphia,1990.Williams,C.K.I.PredictionwithGaussianprocesses:Fromlinearregressiontolinearpredictionandbe-yond.InJordan,M.I.(ed.),LearningandInferenceinGraphicalModels,Kluwer,1998.Williams,C.K.I.andSeeger,M.UsingtheNystrommethodtospeedupkernelmachines.NIPS13,2001.