/
OnlineDictionaryLearningforSparseCoding JulienMairal JULIEN MAIRAL INRIA FR FrancisBach OnlineDictionaryLearningforSparseCoding JulienMairal JULIEN MAIRAL INRIA FR FrancisBach

OnlineDictionaryLearningforSparseCoding JulienMairal JULIEN MAIRAL INRIA FR FrancisBach - PDF document

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
603 views
Uploaded On 2014-12-10

OnlineDictionaryLearningforSparseCoding JulienMairal JULIEN MAIRAL INRIA FR FrancisBach - PPT Presentation

Thispaperfo cuseson learning thebasissetalsocalleddic tionarytoadaptittospeci64257cdataanapproach thathasrecentlyproventobeveryeffectivefor signalreconstructionandclassi64257cationintheau dioandimageprocessingdomains Thispaper proposesanewonlineopti ID: 21933

Thispaperfo cuseson learning thebasissetalsocalleddic

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "OnlineDictionaryLearningforSparseCoding ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

OnlineDictionaryLearningforSparseCoding JulienMairalJULIEN.MAIRAL@INRIA.FRFrancisBachFRANCIS.BACH@INRIA.FRINRIA,145rued'Ulm75005Paris,FranceJeanPonceJEAN.PONCE@ENS.FREcoleNormaleSup´erieure,145rued'Ulm75005Paris,FranceGuillermoSapiroGUILLE@UMN.EDUUniversityofMinnesota-DepartmentofElectricalandComputerEngineering,200UnionStreetSE,Minneapolis,USAAbstractSparsecoding—thatis,modellingdatavectorsassparselinearcombinationsofbasiselements—iswidelyusedinmachinelearning,neuroscience,signalprocessing,andstatistics.Thispaperfo-cusesonlearningthebasisset,alsocalleddic-tionary,toadaptittospecicdata,anapproachthathasrecentlyproventobeveryeffectiveforsignalreconstructionandclassicationintheau-dioandimageprocessingdomains.Thispaperproposesanewonlineoptimizationalgorithmfordictionarylearning,basedonstochasticap-proximations,whichscalesupgracefullytolargedatasetswithmillionsoftrainingsamples.Aproofofconvergenceispresented,alongwithexperimentswithnaturalimagesdemonstratingthatitleadstofasterperformanceandbetterdic-tionariesthanclassicalbatchalgorithmsforbothsmallandlargedatasets.1.IntroductionThelineardecompositionofasignalusingafewatomsofalearneddictionaryinsteadofapredenedone—basedonwavelets(Mallat,1999)forexample—hasrecentlyledtostate-of-the-artresultsfornumerouslow-levelimagepro-cessingtaskssuchasdenoising(Elad&Aharon,2006)aswellashigher-leveltaskssuchasclassication(Rainaetal.,2007;Mairaletal.,2009),showingthatsparselearnedmodelsarewelladaptedtonaturalsignals.Un- 1WILLOWProject,Laboratoired'Informatiquedel'EcoleNormaleSup´erieure,ENS/INRIA/CNRSUMR8548. AppearinginProceedingsofthe26thInternationalConferenceonMachineLearning,Montreal,Canada,2009.Copyright2009bytheauthor(s)/owner(s).likedecompositionsbasedonprincipalcomponentanaly-sisanditsvariants,thesemodelsdonotimposethatthebasisvectorsbeorthogonal,allowingmoreexibilitytoadapttherepresentationtothedata.Whilelearningthedictionaryhasproventobecriticaltoachieve(orimproveupon)state-of-the-artresults,effectivelysolvingthecor-respondingoptimizationproblemisasignicantcompu-tationalchallenge,particularlyinthecontextofthelarge-scaledatasetsinvolvedinimageprocessingtasks,thatmayincludemillionsoftrainingsamples.Addressingthischal-lengeisthetopicofthispaper.Concretely,considerasignalxinRm.Wesaythatitad-mitsasparseapproximationoveradictionaryDinRmk,withkcolumnsreferredtoasatoms,whenonecanndalinearcombinationofa“few”atomsfromDthatis“close”tothesignalx.Experimentshaveshownthatmodellingasignalwithsuchasparsedecomposition(sparsecoding)isveryeffectiveinmanysignalprocessingapplications(Chenetal.,1999).Fornaturalimages,predeneddictionariesbasedonvarioustypesofwavelets(Mallat,1999)havebeenusedforthistask.However,learningthedictionaryinsteadofusingoff-the-shelfbaseshasbeenshowntodra-maticallyimprovesignalreconstruction(Elad&Aharon,2006).Althoughsomeofthelearneddictionaryelementsmaysometimes“looklike”wavelets(orGaborlters),theyaretunedtotheinputimagesorsignals,leadingtomuchbetterresultsinpractice.Mostrecentalgorithmsfordictionarylearning(Olshausen&Field,1997;Aharonetal.,2006;Leeetal.,2007)aresecond-orderiterativebatchprocedures,accessingthewholetrainingsetateachiterationinordertominimizeacostfunctionundersomeconstraints.Althoughtheyhaveshownexperimentallytobemuchfasterthanrst-ordergradientdescentmethods(Leeetal.,2007),theycannoteffectivelyhandleverylargetrainingsets(Bottou&Bous-quet,2008),ordynamictrainingdatachangingovertime, OnlineDictionaryLearningforSparseCoding suchasvideosequences.Toaddresstheseissues,wepro-poseanonlineapproachthatprocessesoneelement(orasmallsubset)ofthetrainingsetatatime.Thisisparticu-larlyimportantinthecontextofimageandvideoprocess-ing(Protter&Elad,2009),whereitiscommontolearndictionariesadaptedtosmallpatches,withtrainingdatathatmayincludeseveralmillionsofthesepatches(roughlyoneperpixelandperframe).Inthissetting,onlinetech-niquesbasedonstochasticapproximationsareanattractivealternativetobatchmethods(Bottou,1998).Forexample,rst-orderstochasticgradientdescentwithprojectionsontheconstraintsetissometimesusedfordictionarylearn-ing(seeAharonandElad(2008)forinstance).Weshowinthispaperthatitispossibletogofurtherandexploitthespecicstructureofsparsecodinginthedesignofanopti-mizationprocedurededicatedtotheproblemofdictionarylearning,withlowmemoryconsumptionandlowercompu-tationalcostthanclassicalsecond-orderbatchalgorithmsandwithouttheneedofexplicitlearningratetuning.Asdemonstratedbyourexperiments,thealgorithmscalesupgracefullytolargedatasetswithmillionsoftrainingsam-ples,anditisusuallyfasterthanmorestandardmethods.1.1.ContributionsThispapermakesthreemaincontributions.WecastinSection2thedictionarylearningproblemastheoptimizationofasmoothnonconvexobjectivefunctionoveraconvexset,minimizingthe(desired)expectedcostwhenthetrainingsetsizegoestoinnity.WeproposeinSection3aniterativeonlinealgorithmthatsolvesthisproblembyefcientlyminimizingateachstepaquadraticsurrogatefunctionoftheempiricalcostoverthesetofconstraints.ThismethodisshowninSection4toconvergewithprobabilityonetoastationarypointofthecostfunction.AsshownexperimentallyinSection5,ouralgorithmissignicantlyfasterthanpreviousapproachestodictionarylearningonbothsmallandlargedatasetsofnaturalim-ages.Todemonstratethatitisadaptedtodifcult,large-scaleimage-processingtasks,welearnadictionaryona12-Megapixelphotographanduseitforinpainting.2.ProblemStatementClassicaldictionarylearningtechniques(Olshausen&Field,1997;Aharonetal.,2006;Leeetal.,2007)consideranitetrainingsetofsignalsX=[x1;:::;xn]inRmnandoptimizetheempiricalcostfunctionn(D)M1 nnXi=1l(xi;D);(1)whereDinRmkisthedictionary,eachcolumnrepresent-ingabasisvector,andlisalossfunctionsuchthatl(x;D)shouldbesmallifDis“good”atrepresentingthesignalx.Thenumberofsamplesnisusuallylarge,whereasthesig-naldimensionmisrelativelysmall,forexample,m=100for1010imagepatches,andn100;000fortypicalimageprocessingapplications.Ingeneral,wealsohavekn(e.g.,k=200forn=100;000),andeachsignalonlyusesafewelementsofDinitsrepresentation.Notethat,inthissetting,overcompletedictionarieswithk�mareallowed.Asothers(see(Leeetal.,2007)forexample),wedenel(x;D)astheoptimalvalueofthe`1-sparsecod-ingproblem:l(x;D)M=min 2Rk1 2jjxDjj22jjjj1;(2)whereisaregularizationparameter.2Thisproblemisalsoknownasbasispursuit(Chenetal.,1999),ortheLasso(Tibshirani,1996).Itiswellknownthatthe`1penaltyyieldsasparsesolutionfor,butthereisnoan-alyticlinkbetweenthevalueofandthecorrespondingeffectivesparsityjjjj0.TopreventDfrombeingarbitrar-ilylarge(whichwouldleadtoarbitrarilysmallvaluesof),itiscommontoconstrainitscolumns(dj)kj=1tohavean`2normlessthanorequaltoone.WewillcallCtheconvexsetofmatricesverifyingthisconstraint:CMfDRmks.t.j=1;:::;k;dTjdj1g:(3)Notethattheproblemofminimizingtheempiricalcostn(D)isnotconvexwithrespecttoD.Itcanberewrit-tenasajointoptimizationproblemwithrespecttothedic-tionaryDandthecoefcients=[1;:::;n]ofthesparsedecomposition,whichisnotjointlyconvex,butcon-vexwithrespecttoeachofthetwovariablesDandwhentheotheroneisxed:minD2C; 2Rkn1 nnXi=11 2jjxiDijj22jjijj1:(4)Anaturalapproachtosolvingthisproblemistoalter-natebetweenthetwovariables,minimizingoveronewhilekeepingtheotheronexed,asproposedbyLeeetal.(2007)(seealsoAharonetal.(2006),whouse`0ratherthan`1penalties,forrelatedapproaches).3Sincethecomputationofdominatesthecostofeachiteration,asecond-orderoptimizationtechniquecanbeusedinthiscasetoaccuratelyestimateDateachstepwhenisxed.AspointedoutbyBottouandBousquet(2008),however,oneisusuallynotinterestedinaperfectminimizationof 2The`pnormofavectorxinRmisdened,for,byjjxjjpM=(m=1jx[i]jp)1=p.Followingtradition,wedenotebyjjxjj0thenumberofnonzeroelementsofthevectorx.This“`0”sparsitymeasureisnotatruenorm.3Inoursetting,asin(Leeetal.,2007),weusetheconvex`1norm,thathasempiricallyproventobebetterbehavedingeneralthanthe`0pseudo-normfordictionarylearning. OnlineDictionaryLearningforSparseCoding theempiricalcostn(D),butintheminimizationoftheexpectedcost(D)M[l(x;D)]=limn!1n(D)a.s.;(5)wheretheexpectation(whichisassumednite)istakenrel-ativetothe(unknown)probabilitydistributionp(x)ofthedata.4Inparticular,givenanitetrainingset,oneshouldnotspendtoomucheffortonaccuratelyminimizingtheempiricalcost,sinceitisonlyanapproximationoftheex-pectedcost.BottouandBousquet(2008)havefurthershownboththe-oreticallyandexperimentallythatstochasticgradientalgo-rithms,whoserateofconvergenceisnotgoodinconven-tionaloptimizationterms,mayinfactincertainsettingsbethefastestinreachingasolutionwithlowexpectedcost.Withlargetrainingsets,classicalbatchoptimizationtech-niquesmayindeedbecomeimpracticalintermsofspeedormemoryrequirements.Inthecaseofdictionarylearning,classicalprojectedrst-orderstochasticgradientdescent(asusedbyAharonandElad(2008)forinstance)consistsofasequenceofupdatesofD:Dt=ChDt1 trDl(xt;Dt1)i;(6)whereisthegradientstep,Cistheorthogonalprojec-toronC,andthetrainingsetx1;x2;:::arei.i.d.samplesofthe(unknown)distributionp(x).AsshowninSection5,wehaveobservedthatthismethodcanbecompetitivecomparedtobatchmethodswithlargetrainingsets,whenagoodlearningrateisselected.Thedictionarylearningmethodwepresentinthenextsectionfallsintotheclassofonlinealgorithmsbasedonstochasticapproximations,processingonesampleatatime,butexploitsthespecicstructureoftheproblemtoefcientlysolveit.Contrarytoclassicalrst-orderstochas-ticgradientdescent,itdoesnotrequireexplicitlearningratetuningandminimizesasequentiallyquadraticlocalap-proximationsoftheexpectedcost.3.OnlineDictionaryLearningWepresentinthissectionthebasiccomponentsofouron-linealgorithmfordictionarylearning(Sections3.1–3.3),aswellastwominorvariantswhichspeedupourimplemen-tation(Section3.4).3.1.AlgorithmOutlineOuralgorithmissummarizedinAlgorithm1.Assumingthetrainingsetcomposedofi.i.d.samplesofadistribu- 4Weuse“a.s.”(almostsure)todenoteconvergencewithprob-abilityone. Algorithm1Onlinedictionarylearning. Require:xRmp(x)(randomvariableandanalgo-rithmtodrawi.i.dsamplesofp),R(regularizationparameter),D0Rmk(initialdictionary),T(num-berofiterations).1:A00,B00(resetthe“past”information).2:fort=1toTdo3:Drawxtfromp(x).4:Sparsecoding:computeusingLARStM=argmin 2Rk1 2jjxtDt1jj22jjjj1:(8)5:AtAt1tTt.6:BtBt1xtTt.7:ComputeDtusingAlgorithm2,withDt1aswarmrestart,sothatDtM=argminD2C1 ttXi=11 2jjxiDijj22jjijj1;=argminD2C1 t1 2Tr(DTDAt)Tr(DTBt):(9)8:endfor9:ReturnDT(learneddictionary). tionp(x),itsinnerloopdrawsoneelementxtatatime,asinstochasticgradientdescent,andalternatesclassicalsparsecodingstepsforcomputingthedecompositiontofxtoverthedictionaryDt1obtainedatthepreviousitera-tion,withdictionaryupdatestepswherethenewdictionaryDtiscomputedbyminimizingoverCthefunction^t(D)M1 ttXi=11 2jjxiDijj22jjijj1;(7)wherethevectorsiarecomputedduringthepreviousstepsofthealgorithm.Themotivationbehindourapproachistwofold:Thequadraticfunction^taggregatesthepastinforma-tioncomputedduringthepreviousstepsofthealgorithm,namelythevectorsi,anditiseasytoshowthatitup-perboundstheempiricalcostt(Dt)fromEq.(1).Onekeyaspectoftheconvergenceanalysiswillbetoshowthat^t(Dt)andt(Dt)convergesalmostsurelytothesamelimitandthus^tactsasasurrogatefort.Since^tiscloseto^t1,DtcanbeobtainedefcientlyusingDt1aswarmrestart.3.2.SparseCodingThesparsecodingproblemofEq.(2)withxeddictio-naryisan`1-regularizedlinearleast-squaresproblem.A OnlineDictionaryLearningforSparseCoding Algorithm2DictionaryUpdate. Require:D=[d1;:::;dk]Rmk(inputdictionary),A=[a1;:::;ak]RkkPti=1iTi,B=[b1;:::;bk]RmkPti=1xiTi.1:repeat2:forj=1tokdo3:Updatethej-thcolumntooptimizefor(9):uj1 Ajj(bjDaj)+dj:dj1 max(jjujjj2;1)uj:(10)4:endfor5:untilconvergence6:ReturnD(updateddictionary). numberofrecentmethodsforsolvingthistypeofprob-lemsarebasedoncoordinatedescentwithsoftthreshold-ing(Fu,1998;Friedmanetal.,2007).Whenthecolumnsofthedictionaryhavelowcorrelation,thesesimplemeth-odshaveproventobeveryefcient.However,thecolumnsoflearneddictionariesareingeneralhighlycorrelated,andwehaveempiricallyobservedthataCholesky-basedim-plementationoftheLARS-Lassoalgorithm,anhomotopymethod(Osborneetal.,2000;Efronetal.,2004)thatpro-videsthewholeregularizationpath—thatis,thesolutionsforallpossiblevaluesof,canbeasfastasapproachesbasedonsoftthresholding,whileprovidingthesolutionwithahigheraccuracy.3.3.DictionaryUpdateOuralgorithmforupdatingthedictionaryusesblock-coordinatedescentwithwarmrestarts,andoneofitsmainadvantagesisthatitisparameter-freeanddoesnotrequireanylearningratetuning,whichcanbedifcultinacon-strainedoptimizationsetting.Concretely,Algorithm2se-quentiallyupdateseachcolumnofD.Usingsomesimplealgebra,itiseasytoshowthatEq.(10)givesthesolutionofthedictionaryupdate(9)withrespecttothej-thcolumndj,whilekeepingtheotheronesxedundertheconstraintdTjdj1.Sincethisconvexoptimizationproblemad-mitsseparableconstraintsintheupdatedblocks(columns),convergencetoaglobaloptimumisguaranteed(Bertsekas,1999).Inpractice,sincethevectorsiaresparse,thecoef-cientsofthematrixAareingeneralconcentratedonthediagonal,whichmakestheblock-coordinatedescentmoreefcient.5SinceouralgorithmusesthevalueofDt1asa 5Notethatthisassumptiondoesnotexactlyhold:Tobemoreexact,ifagroupofcolumnsinDarehighlycorrelated,theco-efcientsofthematrixAcanconcentrateonthecorrespondingprincipalsubmatricesofA.warmrestartforcomputingDt,asingleiterationhasem-piricallybeenfoundtobeenough.OtherapproacheshavebeenproposedtoupdateD,forinstance,Leeetal.(2007)suggestusingaNewtonmethodonthedualofEq.(9),butthisrequiresinvertingakkmatrixateachNewtonitera-tion,whichisimpracticalforanonlinealgorithm.3.4.OptimizingtheAlgorithmWehavepresentedsofarthebasicbuildingblocksofouralgorithm.Thissectiondiscussessimpleimprovementsthatsignicantlyenhanceitsperformance.HandlingFixed-SizeDatasets.Inpractice,althoughitmaybeverylarge,thesizeofthetrainingsetisoften-nite(ofcoursethismaynotbethecase,whenthedataconsistsofavideostreamthatmustbetreatedontheyforexample).Inthissituation,thesamedatapointsmaybeexaminedseveraltimes,anditisverycommoninon-linealgorithmstosimulateani.i.d.samplingofp(x)bycyclingoverarandomlypermutedtrainingset(Bottou&Bousquet,2008).Thismethodworksexperimentallywellinoursettingbut,whenthetrainingsetissmallenough,itispossibletofurtherspeedupconvergence:InAlgo-rithm1,thematricesAtandBtcarryalltheinformationfromthepastcoefcients1;:::;t.Supposethatattimet0,asignalxisdrawnandthevectort0iscomputed.Ifthesamesignalxisdrawnagainattimet�t0,onewouldliketoremovethe“old”informationconcerningxfromAtandBt—thatis,writeAtAt1tTtt0Tt0forinstance.Whendealingwithlargetrainingsets,itisim-possibletostoreallthepastcoefcientst0,butitisstillpossibletopartiallyexploitthesameidea,bycarryinginAtandBttheinformationfromthecurrentandpreviousepochs(cyclesthroughthedata)only.Mini-BatchExtension.Inpractice,wecanimprovetheconvergencespeedofouralgorithmbydrawing�1signalsateachiterationinsteadofasingleone,whichisaclassicalheuristicinstochasticgradientdescentalgo-rithms.Letusdenotext;1;:::;xt;thesignalsdrawnatiterationt.Wecanthenreplacethelines5and6ofAlgo-rithm1byAt At1Pi=1t;iTt;i;Bt Bt1Pi=1xTt;i;(11)where ischosensothat +1 +1,wheretiftand2tift,whichiscompatiblewithourconvergenceanalysis.PurgingtheDictionaryfromUnusedAtoms.Everydic-tionarylearningtechniquesometimesencounterssituationswheresomeofthedictionaryatomsarenever(orverysel-dom)used,whichhappenstypicallywithaverybadintial-ization.Acommonpracticeistoreplacethemduringthe OnlineDictionaryLearningforSparseCoding optimizationbyelementsofthetrainingset,whichsolvesinpracticethisprobleminmostcases.4.ConvergenceAnalysisAlthoughouralgorithmisrelativelysimple,itsstochas-ticnatureandthenon-convexityoftheobjectivefunctionmaketheproofofitsconvergencetoastationarypointsomewhatinvolved.Themaintoolsusedinourproofsaretheconvergenceofempiricalprocesses(VanderVaart,1998)and,followingBottou(1998),theconvergenceofquasi-martingales(Fisk,1965).Ouranalysisislimitedtothebasicversionofthealgorithm,althoughitcaninprin-ciplebecarriedovertotheoptimizedversiondiscussedinSection3.4.Becauseofspacelimitations,wewillre-strictourselvestothepresentationofourmainresultsandasketchoftheirproofs,whichwillbepresentedinde-tailselsewhere,andrstthe(reasonable)assumptionsun-derwhichouranalysisholds.4.1.Assumptions(A)ThedataadmitsaboundedprobabilitydensitypwithcompactsupportK.Assumingacompactsupportforthedataisnaturalinaudio,image,andvideoprocess-ingapplications,whereitisimposedbythedataacquisitionprocess.(B)Thequadraticsurrogatefunctions^tarestrictlyconvexwithlower-boundedHessians.Weassumethatthesmallesteigenvalueofthesemi-denitepositivema-trix1 tAtdenedinAlgorithm1isgreaterthanorequaltoanon-zeroconstant1(makingAtinvertibleand^tstrictlyconvexwithHessianlower-bounded).Thishypoth-esisisinpracticeveriedexperimentallyafterafewiter-ationsofthealgorithmwhentheinitialdictionaryisrea-sonable,consistingforexampleofafewelementsfromthetrainingset,oranyoneofthe“off-the-shelf”dictionaries,suchasDCT(basesofcosinesproducts)orwavelets.Notethatitiseasytoenforcethisassumptionbyaddingaterm1 2jjDjj2Ftotheobjectivefunction,whichisequivalentinpracticetoreplacingthepositivesemi-denitematrix1 tAtby1 tAt1I.Wehaveomittedforsimplicitythispenal-izationinouranalysis.(C)Asufcientuniquenessconditionofthesparsecod-ingsolutionisveried:GivensomexK,whereKisthesupportofp,andD2C,letusdenotebythesetofindicesjsuchthatjdTj(xD?)j,where?isthesolutionofEq.(2).Weassumethatthereexists20suchthat,forallxinKandalldictionariesDinthesubsetSofCconsideredbyouralgorithm,thesmallesteigen-valueofDTDisgreaterthanorequalto2.Thismatrixisthusinvertibleandclassicalresults(Fuchs,2005)ensuretheuniquenessofthesparsecodingsolution.ItisofcourseeasytobuildadictionaryDforwhichthisassumptionfails.However,havingDTDinvertibleisacommonassump-tioninlinearregressionandinmethodssuchastheLARSalgorithmaimedatsolvingEq.(2)(Efronetal.,2004).Itisalsopossibletoenforcethisconditionusinganelasticnetpenalization(Zou&Hastie,2005),replacingjjjj1byjjjj12 2jjjj22andthusimprovingthenumericalstabil-ityofhomotopyalgorithmssuchasLARS.Again,wehaveomittedthispenalizationforsimplicity.4.2.MainResultsandProofSketchesGivenassumptions(A)to(C),letusnowshowthatouralgorithmconvergestoastationarypointoftheobjectivefunction.Proposition1(convergenceof(Dt)andofthesur-rogatefunction).Let^tdenotethesurrogatefunctiondenedinEq.(7).Underassumptions(A)to(C):^t(Dt)convergesa.s.;(Dt)^t(Dt)convergesa.s.to0;and(Dt)convergesa.s.Proofsktech:TherststepintheproofistoshowthatDtDt1O1 twhich,althoughitdoesnotensuretheconvergenceofDt,ensurestheconvergenceofthese-riesP1t=1jjDtDt1jj2F,aclassicalconditioningradi-entdescentconvergenceproofs(Bertsekas,1999).Inturn,thisreducestoshowingthatDtminimizesaparametrizedquadraticfunctionoverCwithparameters1 tAtand1 tBt,thenshowingthatthesolutionisuniformlyLipschitzwithrespecttotheseparameters,borrowingsomeideasfromperturbationtheory(Bonnans&Shapiro,1998).Atthispoint,andfollowingBottou(1998),provingtheconver-genceofthesequence^t(Dt)amountstoshowingthatthestochasticpositiveprocessutM^t(Dt)0;(12)isaquasi-martingale.Todoso,denotingbyFttheltra-tionofthepastinformation,atheorembyFisk(1965)statesthatifthepositivesumP1t=1[max([ut+1utjFt];0)]converges,thenutisaquasi-martingalewhichconvergeswithprobabilityone.Usingsomeresultsonempiricalpro-cesses(VanderVaart,1998,Chap.19.2,DonskerThe-orem),weobtainaboundthatensurestheconvergenceofthisseries.Itfollowsfromtheconvergenceofutthatt(Dt)^t(Dt)convergestozerowithprobabilityone.Then,aclassicaltheoremfromperturbationtheory(Bon-nans&Shapiro,1998,Theorem4.1)showsthatl(x;D)isC1.This,allowsustousealastresultonempiricalprocessesensuringthat(Dt)^t(Dt)convergesalmostsurelyto0.Therefore(Dt)convergesaswellwithprob-abilityone. OnlineDictionaryLearningforSparseCoding Proposition2(convergencetoastationarypoint).Un-derassumptions(A)to(C),Dtisasymptoticallyclosetothesetofstationarypointsofthedictionarylearningprob-lemwithprobabilityone.Proofsktech:Therststepintheproofistoshowusingclassicalanalysistoolsthat,givenassumptions(A)to(C),isC1withaLipschitzgradient.Considering~Aand~Btwoaccumulationpointsof1 tAtand1 tBtrespectively,wecandenethecorrespondingsurrogatefunction^1suchthatforallDinC,^1(D)=1 2Tr(DTD~A)Tr(DT~B),anditsoptimumD1onC.Thenextstepconsistsofshow-ingthatr^1(D1)=r(D1)andthatr(D1)isinthenormalconeofthesetC—thatis,D1isastationarypointofthedictionarylearningproblem(Borwein&Lewis,2006). 5.ExperimentalValidationInthissection,wepresentexperimentsonnaturalimagestodemonstratetheefciencyofourmethod.5.1.PerformanceevaluationForourexperiments,wehaverandomlyselected1:25106patchesfromimagesintheBerkeleysegmentationdataset,whichisastandardimagedatabase;106ofthesearekeptfortraining,andtherestfortesting.WeusedthesepatchestocreatethreedatasetsA,B,andCwithincreasingpatchanddictionarysizesrepresentingvarioustypicalsettingsinimageprocessingapplications: Data Signalsizem Nbkofatoms Type A 88=64 256 b&w B 12123=432 512 color C 1616=256 1024 b&w Wehavenormalizedthepatchestohaveunit`2-normandusedtheregularizationparameter=1:2=p minallofourexperiments.The1=p mtermisaclassicalnormaliza-tionfactor(Bickeletal.,2007),andtheconstant1:2hasbeenexperimentallyshowntoyieldreasonablesparsities(about10nonzerocoefcients)intheseexperiments.WehaveimplementedtheproposedalgorithminC++withaMatlabinterface.Alltheresultspresentedinthissectionusethemini-batchrenementfromSection3.4sincethishasshownempiricallytoimprovespeedbyafactorof10ormore.Thisrequirestotunetheparameter,thenumberofsignalsdrawnateachiteration.Tryingdifferentpowersof2forthisvariablehasshownthat=256wasagoodchoice(lowestobjectivefunctionvaluesonthetrainingset—empirically,thissettingalsoyieldsthelowestvaluesonthetestset),butvaluesof128andand512havegivenverysimilarperformances.Ourimplementationcanbeusedinboththeonlinesettingitisintendedfor,andinaregularbatchmodewhereitusestheentiredatasetateachiteration(correspondingtothemini-batchversionwithn).Wehavealsoimple-mentedarst-orderstochasticgradientdescentalgorithmthatsharesmostofitscodewithouralgorithm,exceptforthedictionaryupdatestep.Thissettingallowsustodrawmeaningfulcomparisonsbetweenouralgorithmanditsbatchandstochasticgradientalternatives,whichwouldhavebeendifcultotherwise.Forexample,comparingouralgorithmtotheMatlabimplementationofthebatchap-proachfrom(Leeetal.,2007)developedbyitsauthorswouldhavebeenunfairsinceourC++programhasabuilt-inspeedadvantage.Althoughourimplementationismulti-threaded,ourexperimentshavebeenrunforsimplicityonasingle-CPU,single-core2.4Ghzmachine.Tomeasureandcomparetheperformancesofthethreetestedmethods,wehaveplottedthevalueoftheobjectivefunctiononthetestset,actingasasurrogateoftheexpectedcost,asafunctionofthecorrespondingtrainingtime.OnlinevsBatch.Figure1(top)comparestheonlineandbatchsettingsofourimplementation.Thefulltrainingsetconsistsof106samples.Theonlineversionofouralgo-rithmdrawssamplesfromtheentireset,andwehaverunitsbatchversiononthefulldatasetaswellassubsetsofsize104and105(seegure).Theonlinesettingsystematicallyoutperformsitsbatchcounterpartforeverytrainingsetsizeanddesiredprecision.Weusealogarithmicscaleforthecomputationtime,whichshowsthatinmanysituations,thedifferenceinperformancecanbedramatic.Similarexperi-mentshavegivensimilarresultsonsmallerdatasets.ComparisonwithStochasticGradientDescent.Ourex-perimentshaveshownthatobtaininggoodperformancewithstochasticgradientdescentrequiresusingboththemini-batchheuristicandcarefullychoosingthelearningrate.Togivethefairestcomparisonpossible,wehavethusoptimizedtheseparameters,samplingvaluesamongpowersof2(asbefore)andvaluesamongpowersof10.Thecombinationofvalues=104,=512givesthebestresultsonthetrainingandtestdataforstochasticgra-dientdescent.Figure1(bottom)comparesourmethodwithstochasticgradientdescentfordifferentvaluesaround104andaxedvalueof=512.Weobservethatthelargerthevalueofis,thebettertheeventualvalueoftheobjectivefunctionisaftermanyiterations,butthelongeritwilltaketoachieveagoodprecision.Althoughourmethodperformsbetteratsuchhigh-precisionsettingsfordatasetC,itappearsthat,ingeneral,foradesiredprecisionandaparticulardataset,itispossibletotunethestochasticgra-dientdescentalgorithmtoachieveaperformancesimilartothatofouralgorithm.Notethatbothstochasticgradi-entdescentandourmethodonlystartdecreasingtheob-jectivefunctionvalueafterafewiterations.Slightlybetterresultscouldbeobtainedbyusingsmallergradientsteps OnlineDictionaryLearningforSparseCoding 10-1 100 101 102 103 104 0.1455 0.146 0.1465 0.147 0.1475 0.148 0.1485 0.149 0.1495 Evaluation set Atime (in seconds)Objective function (test set) Our method Batch n=104 Batch n=105 Batch n=106 10-1 100 101 102 103 104 0.107 0.108 0.109 0.11 0.111 0.112 Evaluation set Btime (in seconds)Objective function (test set) Our method Batch n=104 Batch n=105 Batch n=106 10-1 100 101 102 103 104 0.083 0.084 0.085 0.086 0.087 Evaluation set Ctime (in seconds)Objective function (test set) Our method Batch n=104 Batch n=105 Batch n=106 10-1 100 101 102 103 104 0.1455 0.146 0.1465 0.147 0.1475 0.148 0.1485 0.149 0.1495 Evaluation set Atime (in seconds)Objective function (test set) Our method SG r=5.103 SG r=104 SG r=2.104 10-1 100 101 102 103 104 0.107 0.108 0.109 0.11 0.111 0.112 Evaluation set Btime (in seconds)Objective function (test set) Our method SG r=5.103 SG r=104 SG r=2.104 10-1 100 101 102 103 104 0.083 0.084 0.085 0.086 0.087 Evaluation set Ctime (in seconds)Objective function (test set) Our method SG r=5.103 SG r=104 SG r=2.104 Figure1.Top:Comparisonbetweenonlineandbatchlearningforvarioustrainingsetsizes.Bottom:Comparisonbetweenourmethodandstochasticgradient(SG)descentwithdifferentlearningrates.Inbothcases,thevalueoftheobjectivefunctionevaluatedonthetestsetisreportedasafunctionofcomputationtimeonalogarithmicscale.Valuesoftheobjectivefunctiongreaterthanitsinitialvaluearetruncated.duringtherstiterations,usingalearningrateoftheform=(tt0)forthestochasticgradientdescent,andinitializ-ingA0t0IandB0t0D0forthematricesAtandBt,wheret0isanewparameter.5.2.ApplicationtoInpaintingOurlastexperimentdemonstratesthatouralgorithmcanbeusedforadifcultlarge-scaleimageprocessingtask,namely,removingthetext(inpainting)fromthedamaged12-MegapixelimageofFigure2.Usingamulti-threadedversionofourimplementation,wehavelearnedadictio-narywith256elementsfromtheroughly7106undam-aged1212colorpatchesintheimagewithtwoepochsinabout500secondsona2.4GHzmachinewitheightcores.Oncethedictionaryhasbeenlearned,thetextisremovedusingthesparsecodingtechniqueforinpaintingofMairaletal.(2008).Ourintenthereisofcoursenottoevaluateourlearningprocedureininpaintingtasks,whichwouldre-quireathoroughcomparisonwithstate-the-arttechniquesonstandarddatasets.Instead,wejustwishtodemonstratethattheproposedmethodcanindeedbeappliedtoare-alistic,non-trivialimageprocessingtaskonalargeim-age.Indeed,tothebestofourknowledge,thisisthersttimethatdictionarylearningisusedforimagerestorationonsuchlarge-scaledata.Forcomparison,thedictionariesusedforinpaintinginthestate-of-the-artmethodofMairaletal.(2008)arelearned(inbatchmode)ononly200,000patches.6.DiscussionWehaveintroducedinthispaperanewstochasticonlineal-gorithmforlearningdictionariesadaptedtosparsecodingtasks,andprovenitsconvergence.Preliminaryexperimentsdemonstratethatitissignicantlyfasterthanbatchalterna-tivesonlargedatasetsthatmaycontainmillionsoftrainingexamples,yetitdoesnotrequirelearningratetuninglikeregularstochasticgradientdescentmethods.Moreexper-imentsareofcourseneededtobetterassessthepromiseofthisapproachinimagerestorationtaskssuchasdenois-ing,deblurring,andinpainting.Beyondthis,weplantousetheproposedlearningframeworkforsparsecodingincomputationallydemandingvideorestorationtasks(Prot-ter&Elad,2009),withdynamicdatasetswhosesizeisnotxed,andalsoplantoextendthisframeworktodifferentlossfunctionstoaddressdiscriminativetaskssuchasimageclassication(Mairaletal.,2009),whicharemoresensitivetooverttingthanreconstructiveones,andvariousmatrixfactorizationtasks,suchasnon-negativematrixfactoriza-tionwithsparsenessconstraintsandsparseprincipalcom-ponentanalysis.AcknowledgmentsThispaperwassupportedinpartbyANRundergrantMGA.TheworkofGuillermoSapiroispartiallysupportedbyONR,NGA,NSF,ARO,andDARPA. OnlineDictionaryLearningforSparseCoding Figure2.Inpaintingexampleona-Megapixelimage.Top:Damagedandrestoredimages.Bottom:Zoomingonthedam-agedandrestoredimages.(Bestseenincolor)ReferencesAharon,M.,&Elad,M.(2008).Sparseandredundantmodelingofimagecontentusinganimage-signature-dictionary.SIAMImagingSciences,1,228–247.Aharon,M.,Elad,M.,&Bruckstein,A.M.(2006).TheK-SVD:Analgorithmfordesigningofovercompletedic-tionariesforsparserepresentations.IEEETransactionsSignalProcessing,54,4311-4322Bertsekas,D.(1999).Nonlinearprogramming.AthenaScienticBelmont,Mass.Bickel,P.,Ritov,Y.,&Tsybakov,A.(2007).SimultaneousanalysisofLassoandDantzigselector.preprint.Bonnans,J.,&Shapiro,A.(1998).Optimizationprob-lemswithperturbation:Aguidedtour.SIAMReview,40,202–227.Borwein,J.,&Lewis,A.(2006).Convexanalysisandnon-linearoptimization:theoryandexamples.Springer.Bottou,L.(1998).Onlinealgorithmsandstochasticap-proximations.InD.Saad(Ed.),Onlinelearningandneuralnetworks.Bottou,L.,&Bousquet,O.(2008).Thetradeoffsoflargescalelearning.AdvancesinNeuralInformationProcess-ingSystems,20,161–168.Chen,S.,Donoho,D.,&Saunders,M.(1999).Atomicde-compositionbybasispursuit.SIAMJournalonScienticComputing,20,33–61.Efron,B.,Hastie,T.,Johnstone,I.,&Tibshirani,R.(2004).Leastangleregression.AnnalsofStatistics,32,407–499.Elad,M.,&Aharon,M.(2006).Imagedenoisingviasparseandredundantrepresentationsoverlearneddictionaries.IEEETransactionsImageProcessing,54,3736–3745.Fisk,D.(1965).Quasi-martingale.TransactionsoftheAmericanMathematicalSociety,359–388.Friedman,J.,Hastie,T.,H¨oling,H.,&Tibshirani,R.(2007).Pathwisecoordinateoptimization.AnnalsofStatistics,1,302–332.Fu,W.(1998).PenalizedRegressions:TheBridgeVer-sustheLasso.Journalofcomputationalandgraphicalstatistics,7,397–416.Fuchs,J.(2005).Recoveryofexactsparserepresentationsinthepresenceofboundednoise.IEEETransactionsInformationTheory,51,3601–3608.Lee,H.,Battle,A.,Raina,R.,&Ng,A.Y.(2007).Efcientsparsecodingalgorithms.AdvancesinNeuralInforma-tionProcessingSystems,19,801–808.Mairal,J.,Elad,M.,&Sapiro,G.(2008).Sparserepresen-tationforcolorimagerestoration.IEEETransactionsImageProcessing,17,53–69.Mairal,J.,Bach,F.,Ponce,J.,Sapiro,G.,&Zisserman,A.(2009).Superviseddictionarylearning.AdvancesinNeuralInformationProcessingSystems,21,1033–1040.Mallat,S.(1999).Awavelettourofsignalprocessing,sec-ondedition.AcademicPress,NewYork.Olshausen,B.A.,&Field,D.J.(1997).Sparsecodingwithanovercompletebasisset:AstrategyemployedbyV1?VisionResearch,37,3311–3325.Osborne,M.,Presnell,B.,&Turlach,B.(2000).Anewapproachtovariableselectioninleastsquaresproblems.IMAJournalofNumericalAnalysis,20,389–403.Protter,M.,&Elad,M.(2009).Imagesequencedenoisingviasparseandredundantrepresentations.IEEETrans-actionsImageProcessing,18,27–36.Raina,R.,Battle,A.,Lee,H.,Packer,B.,&Ng,A.Y.(2007).Self-taughtlearning:transferlearningfromun-labeleddata.Proceedingsofthe26thInternationalCon-ferenceonMachineLearning,759–766.Tibshirani,R.(1996).RegressionshrinkageandselectionviatheLasso.JournaloftheRoyalStatisticalSocietySeriesB,67,267–288.VanderVaart,A.(1998).AsymptoticStatistics.CambridgeUniversityPress.Zou,H.,&Hastie,T.(2005).Regularizationandvariableselectionviatheelasticnet.JournaloftheRoyalStatis-ticalSocietySeriesB,67,301–320.