/
Trading Computation for Communication Distributed Stochastic Dual Coordinate Ascent Tianbao Trading Computation for Communication Distributed Stochastic Dual Coordinate Ascent Tianbao

Trading Computation for Communication Distributed Stochastic Dual Coordinate Ascent Tianbao - PDF document

pamella-moone
pamella-moone . @pamella-moone
Follow
468 views
Uploaded On 2014-12-13

Trading Computation for Communication Distributed Stochastic Dual Coordinate Ascent Tianbao - PPT Presentation

com Abstract We present and study a distributed optimization algorithm by employing a stochas tic dual coordinate ascent method Stochastic dual coordinate ascent methods en joy strong theoretical guarantees and often have better performances than sto ID: 23313

com Abstract present and

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Trading Computation for Communication Di..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

TradingComputationforCommunication:DistributedStochasticDualCoordinateAscent TianbaoYangNECLabsAmerica,Cupertino,CA95014tyang@nec-labs.comAbstractWepresentandstudyadistributedoptimizationalgorithmbyemployingastochas-ticdualcoordinateascentmethod.Stochasticdualcoordinateascentmethodsen-joystrongtheoreticalguaranteesandoftenhavebetterperformancesthanstochas-ticgradientdescentmethodsinoptimizingregularizedlossminimizationprob-lems.Itstilllacksofeffortsinstudyingtheminadistributedframework.Wemakeaprogressalongthelinebypresentingadistributedstochasticdualcoor-dinateascentalgorithminastarnetwork,withananalysisofthetradeoffbe-tweencomputationandcommunication.Weverifyouranalysisbyexperimentsonrealdatasets.Moreover,wecomparetheproposedalgorithmwithdistributedstochasticgradientdescentmethodsanddistributedalternatingdirectionmethodsofmultipliersforoptimizingSVMsinthesamedistributedframework,andob-servecompetitiveperformances.1IntroductionInrecentyearsofmachinelearningapplications,thesizeofdatahasbeenobservedwithanunprece-dentedgrowth.Inordertoefcientlysolvelargescalemachinelearningproblemswithmillionsofandevenbillionsofdatapoints,ithasbecomepopulartotakeadvantageofthecomputationalpowerofmulti-coresinasinglemachineormulti-machinesonaclustertooptimizetheproblemsinapar-allelfashionoradistributedfashion[2].Inthispaper,weconsiderthefollowinggenericoptimizationproblemarisingubiquitouslyinsuper-visedmachinelearningapplications:minw2RdP(w);whereP(w)=1 nnXi=1(w�xi;yi)+g(w);(1)wherew2Rddenotesthelinearpredictortobeoptimized,(xi;yi);xi2Rd;i=1;:::;ndenotetheinstance-labelpairsofasetofdatapoints,(z;y)denotesalossfunctionandg(w)denotesaregularizationonthelinearpredictor.Throughoutthepaper,weassumethelossfunction(z;y)isconvexw.r.ttherstargumentandwerefertotheproblemin(1)asRegularizedLossMinimization(RLM)problem.TheRLMproblemhasbeenstudiedextensivelyinmachinelearning,andmanyefcientsequentialalgorithmshavebeendevelopedinthepastdecades[8,16,10].Inthiswork,weaimtosolvetheprobleminadistributedframeworkbyleveragingthecapabilitiesoftensofhundredsofCPUcores.Incontrasttopreviousworksofdistributedoptimizationthatarebasedoneither(stochastic)gradientdescent(GDandSGD)methods[21,11]oralternatingdirectionmethodsofmultipliers(ADMM)[2,23],wemotivateourresearchfromtherecentadvanceson(stochastic)dualcoordinateascent(DCAandSDCA)algorithms[8,16].IthasbeenobservedthatDCAandSDCAalgorithmscanhavecomparableandsometimesevenbetterconvergencespeedthanGDandSGDmethods.However,itlackseffortsinstudyingtheminadistributedfashionandcomparingtothoseSGD-basedandADMM-baseddistributedalgorithms.1 Inthiswork,webridgethegapbydevelopingaDistributedStochasticDualCoordinateAscent(DisDCA)algorithmforsolvingtheRLMproblem.Wesummarizetheproposedalgorithmandourcontributionsasfollows:ThepresentedDisDCAalgorithmpossessestwokeycharacteristics:(i)parallelcomputa-tionoverKmachines(orcores);(ii)sequentialupdatingofmdualvariablesperiterationonindividualmachinesfollowedbya“reduce”stepforcommunicationamongprocesses.Itenjoysastrongguaranteeofconvergenceratesforsmoothorno-smoothlossfunctions.WeanalyzethetradeoffbetweencomputationandcommunicationofDisDCAinvokedbymandK.Intuitively,increasingthenumbermofdualvariablesperiterationaimsatreducingthenumberofiterationsforconvergenceandthereforemitigatingthepressurecausedbycommunication.Theoretically,ouranalysisrevealstheeffectiveregionofm;Kversustheregularizationpathof.WepresentapracticalvariantofDisDCAandmakeacomparisonwithdistributedADMM.WeverifyouranalysisbyexperimentsanddemonstratetheeffectivenessofDisDCAbycomparingwithSGD-basedandADMM-baseddistributedoptimizationalgorithmsrun-ninginthesamedistributedframework.2RelatedWorkRecentyearshaveseenthegreatemergenceofdistributedalgorithmsforsolvingmachinelearningrelatedproblems[2,9].Inthissection,wefocusourreviewondistributedoptimizationtechniques.Manyofthemarebasedonstochasticgradientdescentmethodsoralternatingdirectionmethodsofmultipliers.DistributedSGDmethodsutilizethecomputingresourcesofmultiplemachinestohandlealargenumberofexamplessimultaneously,whichtosomeextentalleviatesthehighcomputationalloadperiterationofGDmethodsandalsoimprovetheperformancesofsequentialSGDmethods.ThesimplestimplementationofadistributedSGDmethodistocalculatethestochasticgradientsonmultiplemachines,andtocollectthesestochasticgradientsforupdatingthesolutiononamastermachine.ThisideahasbeenimplementedinaMapReduceframework[13,4]andaMPIframe-work[21,11].ManyvariantsofGDmethodshavebedeployedinasimilarstyle[1].ADMMhasbeenemployedforsolvingmachinelearningproblemsinadistributedfashion[2,23],duetoitssuperiorconvergencesandperformances[5,23].TheoriginalADMM[7]isproposedforsolv-ingequalityconstrainedminimizationproblems.ThealgorithmsthatadoptADMMforsolvingtheRLMproblemsinadistributedframeworkarebasedontheideaofglobalvariableconsensus.Recently,severalworks[19,14]havemadeeffortstoextendADMMtoitsonlineorstochasticversions.However,theysufferrelativelylowconvergencerates.TheadvancesonDCAandSDCAalgorithms[12,8,16]motivatethepresentwork.Thesestudieshaveshownthatinsomeregimes(e.g.,whenarelativelyhighaccuratesolutionisneeded),SDCAcanoutperformSGDmethods.Inparticular,S.Shalev-ShwartzandT.Zhang[16]havederivednewboundsonthedualitygap,whichhavebeenshowntobesuperiortoearlierresults.However,therestilllacksofeffortsinextendingthesetypesofmethodstoadistributedfashionandcomparingthemwithSGD-basedalgorithmsandADMM-baseddistributedalgorithms.Webridgethisgapbypresentingandstudyingadistributedstochasticdualascentalgorithm.IthasbeenbroughttoourattentionthatM.Tak´acetal.[20]haverecentlypublishedapapertostudytheparallelspeedupofmini-batchprimalanddualmethodsforSVMwithhingelossandestablishtheconvergenceboundsofmini-batchPegasosandSDCAdependingonthesizeofthemini-batch.Thisworkdifferenti-atesfromtheirworkinthat(i)weexplicitlytakeintoaccountthetradeoffbetweencomputationandcommunication;(ii)wepresentamorepracticalvariantandmakeacomparisonbetweentheproposedalgorithmandADMMinviewofsolvingthesubproblems,and(iii)weconductempiricalstudiesforcomparisonwiththesealgorithms.Otherrelatedbutdifferentworkinclude[3],whichpresentsShotgun,aparallelcoordinatedescentalgorithmforsolving`1regularizedminimizationproblems.Thereareotheruniqueissuesarisingindistributedoptimization,e.g.,synchronizationvsasynchro-nization,starnetworkvsarbitrarynetwork.Alltheseissuesarerelatedtothetradeoffbetweencommunicationandcomputation[22,24].Researchintheseaspectsarebeyondthescopeofthisworkandcanbeconsideredasfuturework.2 3DistributedStochasticDualCoordinateAscentInthissection,wepresentadistributedstochasticdualcoordinateascent(DisDCA)algorithmanditsconvergencebound,andanalyzethetradeoffbetweencomputationandcommunication.WealsopresentapracticalvariantofDisDCAandmakeacomparisonwithADMM.Werstpresentsomenotationsandpreliminaries.Forsimplicityofpresentation,weleti(w�xi)=(w�xi;yi).Leti( )andg(v)betheconvexconjugateofi(z)andg(w),respectively.Weassumeg(v)iscontinuousdifferentiable.Itiseasytoshowthattheproblemin(1)hasadualproblemgivenbelow:max 2RnD( );whereD( )=1 nnXi=1�i(� i)�g 1 nnXi=1 ixi!:(2)Letwbetheoptimalsolutiontotheprimalproblemin(1)and betheoptimalsolutiontothedualproblemin(2).Ifwedenev( )=1 nPni=1 ixi,andw( )=rg(v),itcanbeveriedthatw( )=w;P(w( ))=D( ).Inthispaper,weaimtooptimizethedualproblem(2)inadistributedenvironmentwherethedataaredistributedevenlyacrossoverKmachines.Let(xk;i;yk;i);i=1;:::;nkdenotethetrainingexamplesonmachinek.Foreaseofanalysis,weassumenk=n=K.Wedenoteby k;itheassociateddualvariableofxk;i,andbyk;i();k;i()thecorrespondinglossfunctionanditsconvexconjugate.Tosimplifytheanalysisofouralgorithmandwithoutlossofgenerality,wemakethefollowingassumptionsabouttheproblem:i(z)iseithera(1= )-smoothfunctionoraL-Lipschitzcontinuousfunction(c.f.thedenitionsgivenbelow).Exemplarsmoothlossfunctionsincludee.g.,L2hingelossi(z)=max(0;1�yiz)2,logisticlossi(z)=log(1+exp(�yiz)).CommonlyusedLipschitzcontinuousfunctionsareL1hingelossi(z)=max(0;1�yiz)andabsolutelossi(z)=jyi�zj.g(w)isa1-stronglyconvexfunctionw.r.ttokk2.Examplesinclude`2normsquare1=2kwk22andelasticnet1=2kwk22+kwk1.Foralli,kxik21,i(z)0andi(0)1.Denition1.Afunction(z):R!RisaL-Lipschitzcontinuousfunction,ifforalla;b2Rj(a)�(b)jLja�bj.Afunction(z):R!Ris(1= )-smooth,ifitisdifferentiableanditsgradientr(z)is(1= )-Lipschitzcontinuous,orforalla;b2R,wehave(a)(b)+(a�b)�r(b)+1 2 (a�b)2.Aconvexfunctiong(w):Rd!Ris -stronglyconvexw.r.tanormkk,ifforanys2[0;1]andw1;w22Rd,g(sw1+(1�s)w2)sg(w1)+(1�s)g(w2)�1 2s(1�s) kw1�w2k2.3.1DisDCAAlgorithm:TheBasicVariantThedetailedstepsofthebasicvariantoftheDisDCAalgorithmaredescribedbyapseudocodeinFigure1.ThealgorithmdeploysKprocessesrunningsimultaneouslyonKmachines(orcores)1,eachofwhichonlyaccessesitsassociatedtrainingexamples.Eachmachinecallsthesameproce-dureSDCA-mR,wheremRmanifeststwouniquecharacteristicsofSDCA-mRcomparedtoSDCA.(i)Ateachiterationoftheouterloop,mexamplesinsteadofonearerandomlysampledforupdatingtheirdualvariables.Thisisimplementedbyaninnerloopthatcoststhemostcomputationateachouteriteration.(ii)Afterupdatingthemrandomlyselecteddualvariables,itinvokesafunctionReducetocollecttheupdatedinformationfromallKmachinesthataccommodatesnaturallytothedistributedenvironment.TheReducefunctionactsexactlylikeMPI::AllReduceifonewantstoimplementthealgorithminaMPIframework.Itessentiallysendsvk=1 nPmj=1 k;ijxijtoaprocess,addsallofthemtovt�1,andthenbroadcaststheupdatedvttoallKprocesses.ItisthisstepthatinvolvesthecommunicationamongtheKmachines.Intuitively,smallermyieldslesscomputa-tionandslowerconvergenceandthereforemorecommunicationandviceversa.Innextsubsection,wewouldgivearigorousanalysisabouttheconvergence,computationandcommunication.Remark:Thegoaloftheupdatesistoincreasethedualobjective.TheparticularoptionspresentedinroutineIncDualistomaximizethelowerboundsofthedualobjective.Moreoptionsareprovided 1Weuseprocessandmachineinterchangeably.3 DisDCAAlgorithm(TheBasicVariant)StartKprocessesbycallingthefollowingprocedureSDCA-mRwithinputmandTProcedureSDCA-mRInput:numberofiterationsT,numberofsamplesmateachiterationLet: 0k=0;v0=0;w0=rg(0)ReadData:(xk;i;yk;i);i=1;;nkIterate:fort=1;:::;TIterate:forj=1;:::;mRandomlypicki2f1;;nkgandletij=iFind k;ibycallingroutineIncDual(w=wt�1,scl=mK)Set tk;i= t�1k;i+ k;iReduce:vt:1 nPmj=1 k;ijxk;ij!vt�1Update:wt=rg(vt) RoutineIncDual(w;scl)OptionI:Let k;i=max �k;i(�( t�1k;i+ ))� x�k;iw�scl 2n( )2kxk;ik22OptionII:Letzt�1k;i=�@k;i(x�k;iw)� t�1k;iLet k;i=sk;izt�1k;iwheresk;i2[0;1]maximizes(k;i(� t�1k;i)+k;i(x�k;iwt�1)+zt�1k;ix�k;iw)+ s(1�s) 2(zt�1k;i)2�scl 2ns2(zt�1k;i)2kxk;ik22 Figure1:TheBasicVariantoftheDisDCAAlgorithminsupplementarymaterials.ThesolutionstooptionIhaveclosedformsforseverallossfunctions(e.g.,L1;L2hingelosses,squarelossandabsoluteloss)[16].Notethatdifferentfromtheoptionspresentedin[16],theonesinIncdualuseaslightlydifferentscalarfactormKinthequadratictermtoadaptforthenumberofupdateddualvariables.3.2ConvergenceAnalysis:TradeoffbetweenComputationandCommunicationInthissubsection,wepresenttheconvergenceboundoftheDisDCAalgorithmandanalyzethetradeoffbetweencomputation,convergenceorcommunication.Thetheorembelowstatesthecon-vergencerateofDisDCAalgorithmforsmoothlossfunctions(Theomittedproofsandotherderiva-tionscanbefoundinsupplementarymaterials).Theorem1.Fora(1= )-smoothlossfunctionianda1-stronglyconvexfunctiong(w),toobtainanpdualitygapofE[P(wT)�D( T)]P,itsufcestohaveTn mK+1  logn mK+1  1 P:Remark:In[20],theauthorsestablishedaconvergenceboundofmini-batchSDCAforL1-SVMthatdependsonthespectralnormofthedata.ApplyingtheirtricktoouralgorithmicframeworkisequivalenttoreplacingthescalarmKinDisDCAalgorithmwith mKthatcharacterizesthespectralnormofsampleddataacrossallmachinesXmK=(x11;:::x1m;:::;xKm).Theresultingconver-genceboundfor(1= )-smoothlossfunctionsisgivenbysubstitutingtheterm1  with mK mK1  .Thevalueof mKisusuallysmallerthanmKandtheauthorsin[20]haveprovidedanexpressionforcomputing mKbasedonthespectralnormofthedatamatrixX=p n=(x1;:::xn)=p n.However,inpracticethevalueofcannotbecomputedexactly.Asafeupperboundof=1assumingkxik21givesthevaluemKto mK,whichreducestothescalaraspresentedinFig-ure1.Theauthorsin[20]alsopresentedanaggressivevarianttoadjust adaptivelyandobservedimprovements.InSection3.3wedevelopapracticalvariantthatenjoysmorespeed-upcomparedtothebasicvariantandtheiraggressivevariant.TradeoffbetweenComputationandCommunicationWearenowreadytodiscussthetradeoffbetweencomputationandcommunicationbasedontheworstcaseanalysisasindicatedbyTheo-4 rem1.FortheanalysisoftradeoffbetweencomputationandcommunicationinvokedbythenumberofsamplesmandthenumberofmachinesK,wexthenumberofexamplesnandthenumberofdimensionsd.Whenweanalyzethetradeoffinvolvingm,wexKandviceversa.Inthefollow-inganalysis,weassumethesizeofmodeltobecommunicatedisxeddandisindependentofm,thoughinsomecases(e.g.,highdimensionalsparsedata)onemaycommunicateasmallersizeofdatathatdependsonm.Itisnotablethatintheboundofthenumberofiterations,thereisaterm1=( ).Totakethistermintoaccount,werstconsideraninterestingregionofforachievingagoodgeneralizationerror.Severalpiecesofworks[17,18,6]havesuggestedthatinordertoobtainanoptimalgeneralizationerror,theoptimalscaleslike(n�1=(1+)),where2(0;1].Forexample,theanalysisin[18]suggests=1 p nforSVM.First,weconsiderthetradeoffinvolvingthenumberofsamplesmbyxingthenumberpro-cessesK.WenotethatthecommunicationcostisproportionaltothenumberofiterationsT= n mK+n1=(1+) ,whilethecomputationcostpernodeisproportionaltomT= n K+mn1=(1+) duetothateachiterationinvolvesmexamples.Whenmn=(1+) K,thecommunicationcostdecreasesasmincreases,andthecomputationcostsincreasesasmin-creases,thoughitisdominatedby (n=K).Whenthevalueofmisgreaterthann=(1+) K,thecommunicationcostisdominatedby n1=(1+) ,thenincreasingthevalueofmwillbecomelessinuentialonreducingthecommunicationcost;whilethecomputationcostwouldblowupsubstantially.Similarly,wecanalsounderstandhowthenumberofnodesKaffectsthetradeoffbetweenthecom-municationcost,proportionalto~ (KT)=~ n m+Kn1=(1+) 2,andthecomputationcost,pro-portionalto n K+mn1=(1+) .WhenKn=(1+) m,asKincreasesthecomputationcostwoulddecreaseandthecommunicationcostwouldincrease.Whenitisgreaterthann=(1+) m,thecomputationcostwouldbedominatedby mn1=(1+) andtheeffectofincreasingKonreducingthecomputationcostwoulddiminish.Accordingtotheaboveanalysis,wecanconcludethatwhenmK(n ),towhichwereferastheeffectiveregionofmandK,thecommunicationcostcanbereducedbyincreasingthenumberofsamplesmandthecomputationcostcanbereducedbyincreasingthenumberofnodesK.Meanwhile,increasingthenumberofsamplesmwouldincreasethecomputationcostandsimilarlyincreasingthenumberofnodesKwouldincreasethecommunicationcost.ItisnotablethatthelargerthevalueofthewidertheeffectiveregionofmandK,andviceversa.Toverifythetradeoffofcommunicationandcomputation,wepresentempiricalstudiesinSection4.Althoughthesmoothlossfunctionsarethemostinteresting,wepresentinthetheorembelowabouttheconvergenceofDisDCAforLipschitzcontinuouslossfunctions.Theorem2.ForaL-Lipschitzcontinuouslossfunctionianda1-stronglyconvexfunctiong(w),toobtainanPdualitygapE[P(wT)�D( T)]P,itsufcestohaveT4L2 P+T0+n mK20L2 P+max0;n mKlogn 2mKL2+n mK;wherewT=PT�1t=T0wt=(T�T0); T=PT�1t=T0 t=(T�T0).Remark:Inthiscase,theeffectiveregionofmandKismK(nP)whichisnarrowerthanthatforsmoothlossfunctions,especiallywhenP .Similarly,ifonecanobtainanaccurateestimateofthespectralnormofalldataanduse mKinplaceofmKinFigure1,theconvergenceboundcanbeimprovedwith4L2 P mK mKinplaceof4L2 P.Again,thepracticalvariantpresentedinnextsectionyieldsmorespeed-up. 2Wesimplyignorethecommunicationdelayinouranalysis.5 thepracticalupdatesatthet-thiterationInitialize:u0t=wt�1Iterate:forj=1;:::;mRandomlypicki2f1;;nkgandletij=iFind k;ibycallingroutineIncDual(w=uj�1t,scl=k)Update tk;i= t�1k;i+ k;iandupdateujt=uj�1t+1 nk k;ixk;i Figure2:theupdatesatthet-thiterationofthepracticalvariantofDisDCA3.3APracticalVariantofDisDCAandAComparisonwithADMMInthissection,werstpresentapracticalvariantofDisDCAmotivatedbyintuitionandthenwemakeacomparisonbetweenDisDCAandADMM,whichprovidesusmoreinsightabouttheprac-ticalvariantofDisDCAanddifferencesbetweenthetwoalgorithms.Inwhatfollows,wearepar-ticularlyinterestedin`2normregularizationwhereg(w)=kwk22=2andv=w.APracticalVariantWenotethatinAlgorithm1,whenupdatingthevaluesofthefollowingsam-pleddualvariables,thealgorithmdoesnotusetheupdatedinformationbutinsteadwt�1fromlastiteration.Thereforeapotentialimprovementwouldbeleveragingtheup-to-dateinformationforupdatingthedualvariables.Tothisend,wemaintainalocalcopyofwkineachmachine.Atthebeginningoftheiterationt,allw0k;k=1;;Karesynchronizedwiththeglobalwt�1.Theninindividualmachines,thej-thsampleddualvariableisupdatedbyIncDual(wj�1k;k)andthelocalcopywjkisalsoupdatedbywjk=wj�1k+1 nk k;ijxk;ijforupdatingthenextdualvariable.Attheendoftheiteration,thelocalsolutionsaresynchronizedtotheglobalvariablewt=wt�1+1 nPKk=1Pmj=1 tk;ijxk;ij.ItisimportanttonotethatthescalarfactorinIncDualisnowkbecausethedualvariablesareupdatedincrementallyandtherearekprocessesrunningparallell.ThedetailedstepsarepresentedinFigure2,whereweabusethesamenotationujtforthelocalvariableatallprocesses.TheexperimentsinSection4verifytheimprovementsofthepracticalvariantvsthebasicvariant.Itstillremainsanopenproblemtouswhatistheconvergenceboundofthispracticalvariant.However,nextweestablishaconnectionbetweenDisDCAandADMMthatshedslightonthemotivationbehindthepracticalvariantandthedifferencesbetweenthetwoalgorithms.AComparisonwithADMMFirstwenotethatthegoaloftheupdatesateachiterationinDisDCAistoincreasethedualobjectivebymaximizingthefollowingobjective:max 1 nkmXi=1�i(� i)� 2 ^wt�1+1=(nk)mXi=1 ixi 22;(3)where^wt�1=wt�1�1=(nk)Pmi=1 t�1ixiandwesuppressthesubscriptkassociatedwitheachmachine.TheupdatespresentedinAlgorithm1aresolutionstomaximizingthelowerboundsoftheaboveobjectivefunctionbydecouplingthemdualvariables.Itisnotdifculttoderivethatthedualproblemin(3)hasthefollowingprimalproblem(adetailedderivationandotherscanbefoundinsupplementarymaterials):DisDCA:minw1 nkmXi=1i(x�iw)+ 2 w� wt�1�1=(nk)mXi=1 t�1ixi! 22:(4)Wereferto^wtasthepenaltysolution.SecondletusrecalltheupdatingschemeinADMM.The(deterministic)ADMMalgorithmatiterationtsolvesthefollowingproblemsineachmachine:ADMM:wtk=argminw1 nknkXi=1i(x�iw)+K 2kw�(wt�1�ut�1k)| {z }^wt�1k22;(5)whereisapenaltyparameterandwt�1istheglobalprimalvariableupdatedbywt=K(wt+ut�1) K+;withwt=1 KKXk=1wtk;ut�1=1 KKXk=1ut�1k;6 andut�1kisthelocal“dual”variableupdatedbyutk=ut�1k+wtk�wt.Comparingthesubprob-lem(4)inDisDCAandthesubproblem(5)inADMMleadstothefollowingobservations.(1)Bothaimatsolvingthesametypeofproblemtoincreasethedualobjectiveordecreasetheprimalob-jective.DisDCAusesonlymrandomlyselectedexampleswhileADMMusesallexamples.(2)However,thepenaltysolution^wt�1andthepenaltyparameteraredifferent.InDisDCA,^wt�1isconstructedbysubtractingfromtheglobalsolutionthelocalsolutiondenedbythedualvariables ,whileinADMMitisconstructedbysubtractingfromtheglobalsolutionthelocalLagrangianvariablesu.ThepenaltyparameterinDisDCAisgivenbytheregularizationparameterwhilethatinADMMisaparameterthatisneededtobespeciedbytheuser.Now,letusexplainthepracticalvariantofDisDCAfromtheviewpointofinexactlysolvingthesubproblem(4).Notethatiftheoptimalsolutionto(3)isdenotedby i;i=1;:::;m,thentheoptimalsolutionuto(4)isgivenbyu=^wt�1+1 nkPmi=1 ixi.Infact,theupdatesatthet-thiterationofthepracticalvariantofDisDCAistooptimizethesubproblem(4)bytheSDCAalgorithmwithonlyonepassofthesampleddatapointsandaninitializationof 0i= t�1i;i=1:::;m.Itmeansthattheinitialprimalsolutionforsolvingthesubproblem(3)isu0=^wt�1+1 nkPmi=1 t�1ixi=wt�1.ThatexplainstheinitializationstepinFigure2.Inarecentwork[23]ofapplyingADMMtosolvingtheL2-SVMprobleminthesamedistributedfashion,theauthorsexploiteddifferentstrategiesforsolvingthesubproblem(5)associatedwithL2-SVM,amongwhichtheDCAalgorithmwithonlyonepassofalldatapointsgivesthebestperformanceintermsofrunningtime(e.g.,itisbetterthanDCAwithseveralpassesofalldatapointsandisalsobetterthanatrustedregionNewtonmethod).ThisfromanotherpointofviewvalidatesthepracticalvariantofDisDCA.Finally,itisworthtomentionthatunlikeADMMwhoseperformanceissignicantlyaffectedbythevalueofthepenaltyparameter,DisDCAisaparameterfreealgorithm.4ExperimentsInthissection,wepresentsomeexperimentalresultstoverifythetheoreticalresultsandtheempir-icalperformancesoftheproposedalgorithms.WeimplementthealgorithmsbyC++andopenMPIandrunthemincluster.Oneachmachine,weonlylaunchoneprocess.Theexperimentsareper-formedontwolargedatasetswithdifferentnumberoffeatures,covtypeandkdd.Covtypedatahasatotalof581;012examplesand54features.Kdddataisalargedatausedinkddcup2010,whichcontains19;264;097trainingexamplesand29;890;095features.Forcovtypedata,weuse522;911examplesfortraining.WeapplythealgorithmstosolvingtwoSVMformulations,namelyL2-SVMwithhingelosssquareandL1-SVMwithhingeloss,todemonstratethecapabilitiesofDisDCAinsolvingsmoothlossfunctionsandLipschitzcontinuouslossfunctions.Inthelegendofgures,weuseDisDCA-btodenotethebasicvariant,DisDCA-ptodenotethepracticalvariant,andDisDCA-atodenotetheaggressivevariantofDisDCA[20].TradeoffbetweenCommunicationandComputationToverifytheconvergenceanalysis,weshowinFigures3(a)3(b),3(d)3(e)thedualitygapofthebasicvariantandthepracticalvariantoftheDisDCAalgorithmversusthenumberofiterationsbyvaryingthenumberofsamplesmperiteration,thenumberofmachinesKandthevaluesof.TheresultsverifytheconvergenceboundinTheorem1.AtthebeginningofincreasingthevaluesofmorK,theperformancesareimproved.However,whentheirvaluesexceedcertainnumber,theimpactofincreasingmorKdiminishes.Additionally,thelargerthevalueofthewidertheeffectiveregionofmandK.ItisnotablethattheeffectiveregionofmandKofthepracticalvariantismuchlargerthanthatofthebasicvariant.Wealsobrieyreportarunningtimeresult:toobtainan=10�3dualitygapforoptimizingL2-SVMoncovtypedatawith=10�3,therunningtimeofDisDCA-pwithm=1;10;102;103byxingK=10are30;4;0;5seconds3,respectively,andtherunningtimewithK=1;5;10;20byxingm=100are3;0;0;1seconds,respectively.Thespeed-upgainonkdddatabyincreasingmisevenlargerbecausethecommunicationcostismuchhigher.Insupplement,wepresentmoreresultsonvisualizingthecommunicationandcomputationtradeoff.ThePracticalVariantvsTheBasicVariantTofurtherdemonstratetheusefulnessofthepracticalvariant,wepresentacomparisonbetweenthepracticalvariantandthebasicvariantforoptimizing 30secondmeanslessthan1second.Weexcludethetimeforcomputingthedualitygapateachiteration.7 (a)varyingm (b)varyingm (c)DifferentAlgorithms (d)varyingK (e)varingK (f)DifferentAlgorithmsFigure3:(a,b):dualitygapwithvaryingm;(d,e):dualitygapwithvaryingK;(c,f)comparisonofdifferentalgorithmsforoptimizingSVMs.Moreresultscanbefoundinsupplementarymaterials.thetwoSVMformulationsinsupplementarymaterial.Wealsoincludetheperformancesoftheag-gressivevariantproposedin[20],byapplyingtheaggressiveupdatesonthemsampledexamplesineachmachinewithoutincurringadditionalcommunicationcost.Theresultsshowthatthepracticalvariantconvergesmuchfasterthanthebasicvariantandtheaggressivevariant.ComparisonwithotherbaselinesLastly,wecompareDisDCAwithSGD-basedandADMM-baseddistributedalgorithmsrunninginthesamedistributedframework.ForoptimizingL2-SVM,weimplementthestochasticaveragegradient(SAG)algorithm[15],whichalsoenjoysalinearcon-vergenceforsmoothandstronglyconvexproblems.Weusetheconstantstepsize(1=Ls)suggestedbytheauthorsforobtainingagoodpracticalperformance,wheretheLsdenotesthesmoothnessparameteroftheproblem,setto2R+givenkxik22R;8i.ForoptimizingL1-SVM,wecomparetothestochasticPegasos.ForADMM-basedalgorithms,weimplementastochasticADMMin[14](ADMM-s)andadeterministicADMMin[23](ADMM-dca)thatemployestheDCAalgorithmforsolvingthesubproblems.InthestochasticADMM,thereisastepsizeparametert/1=p t.Wechoosethebestinitialstepsizeamong[10�3;103].WerunallalgorithmsonK=10machinesandsetm=104;=10�6forallstochasticalgorithms.IntermsoftheparameterinADMM,wendthat=10�6yieldsgoodperformancesbysearchingoverarangeofvalues.WecompareDisDCAwithSAG,PegasosandADMM-sinFigures3(c),3(f)4,whichclearlydemonstratethatDisDCAisastrongcompetitorinoptimizingSVMs.InsupplementwecompareDisDCAbysettingm=nkagainstADMM-dcawithfourdifferentvaluesof=10�6;10�4;10�2;1onkdd.Theresultsshowthattheperformancesdeterioratesignicantlyiftheisnotappropriatelyset,whileDisDCAcanproducecomparableperformancewithoutadditionaleffortsintuningtheparameter.5ConclusionsWehavepresentedadistributedstochasticdualcoordinatedescentalgorithmanditsconvergencerates,andanalyzedthetradeoffbetweencomputationandcommunication.Thepracticalvarianthassubstantialimprovementsoverthebasicvariantandothervariants.Wealsomakeacomparisonwithotherdistributedalgorithmsandobservecompetitiveperformances. 4TheprimalobjectiveofPegasosoncovtypeisabovethedisplayrange.8 0 20 40 60 80 100 0.2 0.4 0.6 0.8 1 kdd, L2SVM, K=10, m=104, l=10-6 primal obj DisDCA-p ADMM-s (h=10) SAG 0 20 40 60 80 100 0.2 0.4 0.6 0.8 number of iterations (*100)primal objkdd, L1SVM, K=10, m=104, l=10-6 DisDCA-p ADMM-s (h=100) Pegasos 0 10 20 30 40 50 0 0.5 1 duality gapDisDCA-p (covtype, L2SVM, m=103, l=10-3) K=1 K=5 K=10 0 20 40 60 80 100 0 1 2 3 DisDCA-p (covtype, L2SVM, m=103, l=10-6)number of iterations (*100)duality gap 0 20 40 60 80 100 0 0.5 1 dualtiy gapDisDCA-b (covtype, L2SVM, m=102, l=10-3) K=1 K=5 K=10 0 20 40 60 80 100 0.7 0.8 0.9 1 DisDCA-b (covtype, L2SVM, m=102, l=10-6)number of iterations (*100)duality gap 20 40 60 80 100 0.5 0.6 0.7 0.8 0.9 number of iterations (*100)primal objcovtype, L1SVM, K=10, m=104, l=10-6 20 40 60 80 100 0.7 0.72 0.74 0.76 primal objcovtype, L2SVM, K=10, m=104, l=10-6 DisDCA ADMM-s (h=10) SAG DisDCA ADMM-s (h=100) Pegasos 0 20 40 60 80 100 0 0.5 1 duality gapDisDCA-p (L2SVM, K=10, l=10-3) m=1 m=10 m=102 m=103 m=104 0 20 40 60 80 100 0 0.5 1 1.5 number of iteration (*100)duality gapDisDCA-p (L2SVM, K=10, l=10-6) 0 20 40 60 80 100 0 0.5 1 duality gapDisDCA-b (covtype, L2SVM, K=10, l=10-3) m=1 m=10 m=102 m=103 m=104 0 20 40 60 80 100 0.7 0.75 0.8 0.85 0.9 number of iteration (*100)duality gapDisDCA-b (covtype, L2SVM, K=10, l=10-6) References[1]A.AgarwalandJ.C.Duchi.Distributeddelayedstochasticoptimization.InCDC,pages5451–5452,2012.[2]S.Boyd,N.Parikh,E.Chu,B.Peleato,andJ.Eckstein.Distributedoptimizationandstatisticallearningviathealternatingdirectionmethodofmultipliers.Found.TrendsMach.Learn.,3:1–122,2011.[3]J.K.Bradley,A.Kyrola,D.Bickson,andC.Guestrin.ParallelCoordinateDescentforL1-RegularizedLossMinimization.InICML,2011.[4]C.T.Chu,S.K.Kim,Y.A.Lin,Y.Yu,G.R.Bradski,A.Y.Ng,andK.Olukotun.Map-Reduceformachinelearningonmulticore.InNIPS,pages281–288,2006.[5]W.DengandW.Yin.Ontheglobalandlinearconvergenceofthegeneralizedalternatingdirectionmethodofmultipliers.Technicalreport,2012.[6]M.EbertsandI.Steinwart.Optimallearningratesforleastsquaressvmsusinggaussianker-nels.InNIPS,pages1539–1547,2011.[7]D.GabayandB.Mercier.Adualalgorithmforthesolutionofnonlinearvariationalproblemsvianiteelementapproximation.Comput.Math.Appl.,2:17–40,1976.[8]C.-J.Hsieh,K.-W.Chang,C.-J.Lin,S.S.Keerthi,andS.Sundararajan.Adualcoordinatedescentmethodforlarge-scalelinearsvm.InICML,pages408–415,2008.[9]H.D.III,J.M.Phillips,A.Saha,andS.Venkatasubramanian.Protocolsforlearningclassiersondistributeddata.JMLR-ProceedingsTrack,22:282–290,2012.[10]S.Lacoste-Julien,M.Jaggi,M.W.Schmidt,andP.Pletscher.Stochasticblock-coordinatefrank-wolfeoptimizationforstructuralsvms.CoRR,abs/1207.4747,2012.[11]J.Langford,A.Smola,andM.Zinkevich.Slowlearnersarefast.InNIPS,pages2331–2339.2009.[12]Z.Q.LuoandP.Tseng.Ontheconvergenceofthecoordinatedescentmethodforconvexdifferentiableminimization.JournalofOptimizationTheoryandApplications,pages7–35,1992.[13]G.Mann,R.McDonald,M.Mohri,N.Silberman,andD.Walker.EfcientLarge-Scaledis-tributedtrainingofconditionalmaximumentropymodels.InNIPS,pages1231–1239.2009.[14]H.Ouyang,N.He,L.Tran,andA.G.Gray.Stochasticalternatingdirectionmethodofmulti-pliers.InICML,pages80–88,2013.[15]N.L.Roux,M.W.Schmidt,andF.Bach.Astochasticgradientmethodwithanexponentialconvergenceratefornitetrainingsets.InNIPS,pages2672–2680,2012.[16]S.Shalev-ShwartzandT.Zhang.StochasticDualCoordinateAscentMethodsforRegularizedLossMinimization.JMLR,2013.[17]S.SmaleandD.-X.Zhou.Estimatingtheapproximationerrorinlearningtheory.Anal.Appl.(Singap.),1(1):17–41,2003.[18]K.Sridharan,S.Shalev-Shwartz,andN.Srebro.Fastratesforregularizedobjectives.InNIPS,pages1545–1552,2008.[19]T.Suzuki.Dualaveragingandproximalgradientdescentforonlinealternatingdirectionmul-tipliermethod.InICML,pages392–400,2013.[20]M.Tak´ac,A.S.Bijral,P.Richt´arik,andN.Srebro.Mini-batchprimalanddualmethodsforsvms.InICML,2013.[21]C.H.Teo,S.Vishwanthan,A.J.Smola,andQ.V.Le.Bundlemethodsforregularizedriskminimization.JMLR,pages311–365,2010.[22]K.I.Tsianos,S.Lawlor,andM.G.Rabbat.Communication/computationtradeoffsinconsensus-baseddistributedoptimization.InNIPS,pages1952–1960,2012.[23]C.Zhang,H.Lee,andK.G.Shin.Efcientdistributedlinearclassicationalgorithmsviathealternatingdirectionmethodofmultipliers.InAISTAT,pages1398–1406,2012.[24]M.Zinkevich,M.Weimer,A.Smola,andL.Li.Parallelizedstochasticgradientdescent.InNIPS,pages2595–2603,2010.9