com Abstract We present and study a distributed optimization algorithm by employing a stochas tic dual coordinate ascent method Stochastic dual coordinate ascent methods en joy strong theoretical guarantees and often have better performances than sto ID: 23313
Download Pdf The PPT/PDF document "Trading Computation for Communication Di..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
TradingComputationforCommunication:DistributedStochasticDualCoordinateAscent TianbaoYangNECLabsAmerica,Cupertino,CA95014tyang@nec-labs.comAbstractWepresentandstudyadistributedoptimizationalgorithmbyemployingastochas-ticdualcoordinateascentmethod.Stochasticdualcoordinateascentmethodsen-joystrongtheoreticalguaranteesandoftenhavebetterperformancesthanstochas-ticgradientdescentmethodsinoptimizingregularizedlossminimizationprob-lems.Itstilllacksofeffortsinstudyingtheminadistributedframework.Wemakeaprogressalongthelinebypresentingadistributedstochasticdualcoor-dinateascentalgorithminastarnetwork,withananalysisofthetradeoffbe-tweencomputationandcommunication.Weverifyouranalysisbyexperimentsonrealdatasets.Moreover,wecomparetheproposedalgorithmwithdistributedstochasticgradientdescentmethodsanddistributedalternatingdirectionmethodsofmultipliersforoptimizingSVMsinthesamedistributedframework,andob-servecompetitiveperformances.1IntroductionInrecentyearsofmachinelearningapplications,thesizeofdatahasbeenobservedwithanunprece-dentedgrowth.Inordertoefcientlysolvelargescalemachinelearningproblemswithmillionsofandevenbillionsofdatapoints,ithasbecomepopulartotakeadvantageofthecomputationalpowerofmulti-coresinasinglemachineormulti-machinesonaclustertooptimizetheproblemsinapar-allelfashionoradistributedfashion[2].Inthispaper,weconsiderthefollowinggenericoptimizationproblemarisingubiquitouslyinsuper-visedmachinelearningapplications:minw2RdP(w);whereP(w)=1 nnXi=1(wxi;yi)+g(w);(1)wherew2Rddenotesthelinearpredictortobeoptimized,(xi;yi);xi2Rd;i=1;:::;ndenotetheinstance-labelpairsofasetofdatapoints,(z;y)denotesalossfunctionandg(w)denotesaregularizationonthelinearpredictor.Throughoutthepaper,weassumethelossfunction(z;y)isconvexw.r.ttherstargumentandwerefertotheproblemin(1)asRegularizedLossMinimization(RLM)problem.TheRLMproblemhasbeenstudiedextensivelyinmachinelearning,andmanyefcientsequentialalgorithmshavebeendevelopedinthepastdecades[8,16,10].Inthiswork,weaimtosolvetheprobleminadistributedframeworkbyleveragingthecapabilitiesoftensofhundredsofCPUcores.Incontrasttopreviousworksofdistributedoptimizationthatarebasedoneither(stochastic)gradientdescent(GDandSGD)methods[21,11]oralternatingdirectionmethodsofmultipliers(ADMM)[2,23],wemotivateourresearchfromtherecentadvanceson(stochastic)dualcoordinateascent(DCAandSDCA)algorithms[8,16].IthasbeenobservedthatDCAandSDCAalgorithmscanhavecomparableandsometimesevenbetterconvergencespeedthanGDandSGDmethods.However,itlackseffortsinstudyingtheminadistributedfashionandcomparingtothoseSGD-basedandADMM-baseddistributedalgorithms.1 Inthiswork,webridgethegapbydevelopingaDistributedStochasticDualCoordinateAscent(DisDCA)algorithmforsolvingtheRLMproblem.Wesummarizetheproposedalgorithmandourcontributionsasfollows:ThepresentedDisDCAalgorithmpossessestwokeycharacteristics:(i)parallelcomputa-tionoverKmachines(orcores);(ii)sequentialupdatingofmdualvariablesperiterationonindividualmachinesfollowedbyareducestepforcommunicationamongprocesses.Itenjoysastrongguaranteeofconvergenceratesforsmoothorno-smoothlossfunctions.WeanalyzethetradeoffbetweencomputationandcommunicationofDisDCAinvokedbymandK.Intuitively,increasingthenumbermofdualvariablesperiterationaimsatreducingthenumberofiterationsforconvergenceandthereforemitigatingthepressurecausedbycommunication.Theoretically,ouranalysisrevealstheeffectiveregionofm;Kversustheregularizationpathof.WepresentapracticalvariantofDisDCAandmakeacomparisonwithdistributedADMM.WeverifyouranalysisbyexperimentsanddemonstratetheeffectivenessofDisDCAbycomparingwithSGD-basedandADMM-baseddistributedoptimizationalgorithmsrun-ninginthesamedistributedframework.2RelatedWorkRecentyearshaveseenthegreatemergenceofdistributedalgorithmsforsolvingmachinelearningrelatedproblems[2,9].Inthissection,wefocusourreviewondistributedoptimizationtechniques.Manyofthemarebasedonstochasticgradientdescentmethodsoralternatingdirectionmethodsofmultipliers.DistributedSGDmethodsutilizethecomputingresourcesofmultiplemachinestohandlealargenumberofexamplessimultaneously,whichtosomeextentalleviatesthehighcomputationalloadperiterationofGDmethodsandalsoimprovetheperformancesofsequentialSGDmethods.ThesimplestimplementationofadistributedSGDmethodistocalculatethestochasticgradientsonmultiplemachines,andtocollectthesestochasticgradientsforupdatingthesolutiononamastermachine.ThisideahasbeenimplementedinaMapReduceframework[13,4]andaMPIframe-work[21,11].ManyvariantsofGDmethodshavebedeployedinasimilarstyle[1].ADMMhasbeenemployedforsolvingmachinelearningproblemsinadistributedfashion[2,23],duetoitssuperiorconvergencesandperformances[5,23].TheoriginalADMM[7]isproposedforsolv-ingequalityconstrainedminimizationproblems.ThealgorithmsthatadoptADMMforsolvingtheRLMproblemsinadistributedframeworkarebasedontheideaofglobalvariableconsensus.Recently,severalworks[19,14]havemadeeffortstoextendADMMtoitsonlineorstochasticversions.However,theysufferrelativelylowconvergencerates.TheadvancesonDCAandSDCAalgorithms[12,8,16]motivatethepresentwork.Thesestudieshaveshownthatinsomeregimes(e.g.,whenarelativelyhighaccuratesolutionisneeded),SDCAcanoutperformSGDmethods.Inparticular,S.Shalev-ShwartzandT.Zhang[16]havederivednewboundsonthedualitygap,whichhavebeenshowntobesuperiortoearlierresults.However,therestilllacksofeffortsinextendingthesetypesofmethodstoadistributedfashionandcomparingthemwithSGD-basedalgorithmsandADMM-baseddistributedalgorithms.Webridgethisgapbypresentingandstudyingadistributedstochasticdualascentalgorithm.IthasbeenbroughttoourattentionthatM.Tak´acetal.[20]haverecentlypublishedapapertostudytheparallelspeedupofmini-batchprimalanddualmethodsforSVMwithhingelossandestablishtheconvergenceboundsofmini-batchPegasosandSDCAdependingonthesizeofthemini-batch.Thisworkdifferenti-atesfromtheirworkinthat(i)weexplicitlytakeintoaccountthetradeoffbetweencomputationandcommunication;(ii)wepresentamorepracticalvariantandmakeacomparisonbetweentheproposedalgorithmandADMMinviewofsolvingthesubproblems,and(iii)weconductempiricalstudiesforcomparisonwiththesealgorithms.Otherrelatedbutdifferentworkinclude[3],whichpresentsShotgun,aparallelcoordinatedescentalgorithmforsolving`1regularizedminimizationproblems.Thereareotheruniqueissuesarisingindistributedoptimization,e.g.,synchronizationvsasynchro-nization,starnetworkvsarbitrarynetwork.Alltheseissuesarerelatedtothetradeoffbetweencommunicationandcomputation[22,24].Researchintheseaspectsarebeyondthescopeofthisworkandcanbeconsideredasfuturework.2 3DistributedStochasticDualCoordinateAscentInthissection,wepresentadistributedstochasticdualcoordinateascent(DisDCA)algorithmanditsconvergencebound,andanalyzethetradeoffbetweencomputationandcommunication.WealsopresentapracticalvariantofDisDCAandmakeacomparisonwithADMM.Werstpresentsomenotationsandpreliminaries.Forsimplicityofpresentation,weleti(wxi)=(wxi;yi).Leti()andg(v)betheconvexconjugateofi(z)andg(w),respectively.Weassumeg(v)iscontinuousdifferentiable.Itiseasytoshowthattheproblemin(1)hasadualproblemgivenbelow:max2RnD();whereD()=1 nnXi=1i(i)g 1 nnXi=1ixi!:(2)Letwbetheoptimalsolutiontotheprimalproblemin(1)andbetheoptimalsolutiontothedualproblemin(2).Ifwedenev()=1 nPni=1ixi,andw()=rg(v),itcanbeveriedthatw()=w;P(w())=D().Inthispaper,weaimtooptimizethedualproblem(2)inadistributedenvironmentwherethedataaredistributedevenlyacrossoverKmachines.Let(xk;i;yk;i);i=1;:::;nkdenotethetrainingexamplesonmachinek.Foreaseofanalysis,weassumenk=n=K.Wedenotebyk;itheassociateddualvariableofxk;i,andbyk;i();k;i()thecorrespondinglossfunctionanditsconvexconjugate.Tosimplifytheanalysisofouralgorithmandwithoutlossofgenerality,wemakethefollowingassumptionsabouttheproblem:i(z)iseithera(1= )-smoothfunctionoraL-Lipschitzcontinuousfunction(c.f.thedenitionsgivenbelow).Exemplarsmoothlossfunctionsincludee.g.,L2hingelossi(z)=max(0;1yiz)2,logisticlossi(z)=log(1+exp(yiz)).CommonlyusedLipschitzcontinuousfunctionsareL1hingelossi(z)=max(0;1yiz)andabsolutelossi(z)=jyizj.g(w)isa1-stronglyconvexfunctionw.r.ttokk2.Examplesinclude`2normsquare1=2kwk22andelasticnet1=2kwk22+kwk1.Foralli,kxik21,i(z)0andi(0)1.Denition1.Afunction(z):R!RisaL-Lipschitzcontinuousfunction,ifforalla;b2Rj(a)(b)jLjabj.Afunction(z):R!Ris(1= )-smooth,ifitisdifferentiableanditsgradientr(z)is(1= )-Lipschitzcontinuous,orforalla;b2R,wehave(a)(b)+(ab)r(b)+1 2 (ab)2.Aconvexfunctiong(w):Rd!Ris-stronglyconvexw.r.tanormkk,ifforanys2[0;1]andw1;w22Rd,g(sw1+(1s)w2)sg(w1)+(1s)g(w2)1 2s(1s)kw1w2k2.3.1DisDCAAlgorithm:TheBasicVariantThedetailedstepsofthebasicvariantoftheDisDCAalgorithmaredescribedbyapseudocodeinFigure1.ThealgorithmdeploysKprocessesrunningsimultaneouslyonKmachines(orcores)1,eachofwhichonlyaccessesitsassociatedtrainingexamples.Eachmachinecallsthesameproce-dureSDCA-mR,wheremRmanifeststwouniquecharacteristicsofSDCA-mRcomparedtoSDCA.(i)Ateachiterationoftheouterloop,mexamplesinsteadofonearerandomlysampledforupdatingtheirdualvariables.Thisisimplementedbyaninnerloopthatcoststhemostcomputationateachouteriteration.(ii)Afterupdatingthemrandomlyselecteddualvariables,itinvokesafunctionReducetocollecttheupdatedinformationfromallKmachinesthataccommodatesnaturallytothedistributedenvironment.TheReducefunctionactsexactlylikeMPI::AllReduceifonewantstoimplementthealgorithminaMPIframework.Itessentiallysendsvk=1 nPmj=1k;ijxijtoaprocess,addsallofthemtovt1,andthenbroadcaststheupdatedvttoallKprocesses.ItisthisstepthatinvolvesthecommunicationamongtheKmachines.Intuitively,smallermyieldslesscomputa-tionandslowerconvergenceandthereforemorecommunicationandviceversa.Innextsubsection,wewouldgivearigorousanalysisabouttheconvergence,computationandcommunication.Remark:Thegoaloftheupdatesistoincreasethedualobjective.TheparticularoptionspresentedinroutineIncDualistomaximizethelowerboundsofthedualobjective.Moreoptionsareprovided 1Weuseprocessandmachineinterchangeably.3 DisDCAAlgorithm(TheBasicVariant)StartKprocessesbycallingthefollowingprocedureSDCA-mRwithinputmandTProcedureSDCA-mRInput:numberofiterationsT,numberofsamplesmateachiterationLet:0k=0;v0=0;w0=rg(0)ReadData:(xk;i;yk;i);i=1;;nkIterate:fort=1;:::;TIterate:forj=1;:::;mRandomlypicki2f1;;nkgandletij=iFindk;ibycallingroutineIncDual(w=wt1,scl=mK)Settk;i=t1k;i+k;iReduce:vt:1 nPmj=1k;ijxk;ij!vt1Update:wt=rg(vt) RoutineIncDual(w;scl)OptionI:Letk;i=maxk;i((t1k;i+))xk;iwscl 2n()2kxk;ik22OptionII:Letzt1k;i=@k;i(xk;iw)t1k;iLetk;i=sk;izt1k;iwheresk;i2[0;1]maximizes(k;i(t1k;i)+k;i(xk;iwt1)+zt1k;ixk;iw)+ s(1s) 2(zt1k;i)2scl 2ns2(zt1k;i)2kxk;ik22 Figure1:TheBasicVariantoftheDisDCAAlgorithminsupplementarymaterials.ThesolutionstooptionIhaveclosedformsforseverallossfunctions(e.g.,L1;L2hingelosses,squarelossandabsoluteloss)[16].Notethatdifferentfromtheoptionspresentedin[16],theonesinIncdualuseaslightlydifferentscalarfactormKinthequadratictermtoadaptforthenumberofupdateddualvariables.3.2ConvergenceAnalysis:TradeoffbetweenComputationandCommunicationInthissubsection,wepresenttheconvergenceboundoftheDisDCAalgorithmandanalyzethetradeoffbetweencomputation,convergenceorcommunication.Thetheorembelowstatesthecon-vergencerateofDisDCAalgorithmforsmoothlossfunctions(Theomittedproofsandotherderiva-tionscanbefoundinsupplementarymaterials).Theorem1.Fora(1= )-smoothlossfunctionianda1-stronglyconvexfunctiong(w),toobtainanpdualitygapofE[P(wT)D(T)]P,itsufcestohaveTn mK+1 logn mK+1 1 P:Remark:In[20],theauthorsestablishedaconvergenceboundofmini-batchSDCAforL1-SVMthatdependsonthespectralnormofthedata.ApplyingtheirtricktoouralgorithmicframeworkisequivalenttoreplacingthescalarmKinDisDCAalgorithmwithmKthatcharacterizesthespectralnormofsampleddataacrossallmachinesXmK=(x11;:::x1m;:::;xKm).Theresultingconver-genceboundfor(1= )-smoothlossfunctionsisgivenbysubstitutingtheterm1 withmK mK1 .ThevalueofmKisusuallysmallerthanmKandtheauthorsin[20]haveprovidedanexpressionforcomputingmKbasedonthespectralnormofthedatamatrixX=p n=(x1;:::xn)=p n.However,inpracticethevalueofcannotbecomputedexactly.Asafeupperboundof=1assumingkxik21givesthevaluemKtomK,whichreducestothescalaraspresentedinFig-ure1.Theauthorsin[20]alsopresentedanaggressivevarianttoadjustadaptivelyandobservedimprovements.InSection3.3wedevelopapracticalvariantthatenjoysmorespeed-upcomparedtothebasicvariantandtheiraggressivevariant.TradeoffbetweenComputationandCommunicationWearenowreadytodiscussthetradeoffbetweencomputationandcommunicationbasedontheworstcaseanalysisasindicatedbyTheo-4 rem1.FortheanalysisoftradeoffbetweencomputationandcommunicationinvokedbythenumberofsamplesmandthenumberofmachinesK,wexthenumberofexamplesnandthenumberofdimensionsd.Whenweanalyzethetradeoffinvolvingm,wexKandviceversa.Inthefollow-inganalysis,weassumethesizeofmodeltobecommunicatedisxeddandisindependentofm,thoughinsomecases(e.g.,highdimensionalsparsedata)onemaycommunicateasmallersizeofdatathatdependsonm.Itisnotablethatintheboundofthenumberofiterations,thereisaterm1=( ).Totakethistermintoaccount,werstconsideraninterestingregionofforachievingagoodgeneralizationerror.Severalpiecesofworks[17,18,6]havesuggestedthatinordertoobtainanoptimalgeneralizationerror,theoptimalscaleslike(n1=(1+)),where2(0;1].Forexample,theanalysisin[18]suggests=1 p nforSVM.First,weconsiderthetradeoffinvolvingthenumberofsamplesmbyxingthenumberpro-cessesK.WenotethatthecommunicationcostisproportionaltothenumberofiterationsT= n mK+n1=(1+) ,whilethecomputationcostpernodeisproportionaltomT= n K+mn1=(1+) duetothateachiterationinvolvesmexamples.Whenmn=(1+) K,thecommunicationcostdecreasesasmincreases,andthecomputationcostsincreasesasmin-creases,thoughitisdominatedby (n=K).Whenthevalueofmisgreaterthann=(1+) K,thecommunicationcostisdominatedby n1=(1+) ,thenincreasingthevalueofmwillbecomelessinuentialonreducingthecommunicationcost;whilethecomputationcostwouldblowupsubstantially.Similarly,wecanalsounderstandhowthenumberofnodesKaffectsthetradeoffbetweenthecom-municationcost,proportionalto~ (KT)=~ n m+Kn1=(1+) 2,andthecomputationcost,pro-portionalto n K+mn1=(1+) .WhenKn=(1+) m,asKincreasesthecomputationcostwoulddecreaseandthecommunicationcostwouldincrease.Whenitisgreaterthann=(1+) m,thecomputationcostwouldbedominatedby mn1=(1+) andtheeffectofincreasingKonreducingthecomputationcostwoulddiminish.Accordingtotheaboveanalysis,wecanconcludethatwhenmK(n ),towhichwereferastheeffectiveregionofmandK,thecommunicationcostcanbereducedbyincreasingthenumberofsamplesmandthecomputationcostcanbereducedbyincreasingthenumberofnodesK.Meanwhile,increasingthenumberofsamplesmwouldincreasethecomputationcostandsimilarlyincreasingthenumberofnodesKwouldincreasethecommunicationcost.ItisnotablethatthelargerthevalueofthewidertheeffectiveregionofmandK,andviceversa.Toverifythetradeoffofcommunicationandcomputation,wepresentempiricalstudiesinSection4.Althoughthesmoothlossfunctionsarethemostinteresting,wepresentinthetheorembelowabouttheconvergenceofDisDCAforLipschitzcontinuouslossfunctions.Theorem2.ForaL-Lipschitzcontinuouslossfunctionianda1-stronglyconvexfunctiong(w),toobtainanPdualitygapE[P(wT)D(T)]P,itsufcestohaveT4L2 P+T0+n mK20L2 P+max0;n mKlogn 2mKL2+n mK;wherewT=PT1t=T0wt=(TT0);T=PT1t=T0t=(TT0).Remark:Inthiscase,theeffectiveregionofmandKismK(nP)whichisnarrowerthanthatforsmoothlossfunctions,especiallywhenP .Similarly,ifonecanobtainanaccurateestimateofthespectralnormofalldataandusemKinplaceofmKinFigure1,theconvergenceboundcanbeimprovedwith4L2 PmK mKinplaceof4L2 P.Again,thepracticalvariantpresentedinnextsectionyieldsmorespeed-up. 2Wesimplyignorethecommunicationdelayinouranalysis.5 thepracticalupdatesatthet-thiterationInitialize:u0t=wt1Iterate:forj=1;:::;mRandomlypicki2f1;;nkgandletij=iFindk;ibycallingroutineIncDual(w=uj1t,scl=k)Updatetk;i=t1k;i+k;iandupdateujt=uj1t+1 nkk;ixk;i Figure2:theupdatesatthet-thiterationofthepracticalvariantofDisDCA3.3APracticalVariantofDisDCAandAComparisonwithADMMInthissection,werstpresentapracticalvariantofDisDCAmotivatedbyintuitionandthenwemakeacomparisonbetweenDisDCAandADMM,whichprovidesusmoreinsightabouttheprac-ticalvariantofDisDCAanddifferencesbetweenthetwoalgorithms.Inwhatfollows,wearepar-ticularlyinterestedin`2normregularizationwhereg(w)=kwk22=2andv=w.APracticalVariantWenotethatinAlgorithm1,whenupdatingthevaluesofthefollowingsam-pleddualvariables,thealgorithmdoesnotusetheupdatedinformationbutinsteadwt1fromlastiteration.Thereforeapotentialimprovementwouldbeleveragingtheup-to-dateinformationforupdatingthedualvariables.Tothisend,wemaintainalocalcopyofwkineachmachine.Atthebeginningoftheiterationt,allw0k;k=1;;Karesynchronizedwiththeglobalwt1.Theninindividualmachines,thej-thsampleddualvariableisupdatedbyIncDual(wj1k;k)andthelocalcopywjkisalsoupdatedbywjk=wj1k+1 nkk;ijxk;ijforupdatingthenextdualvariable.Attheendoftheiteration,thelocalsolutionsaresynchronizedtotheglobalvariablewt=wt1+1 nPKk=1Pmj=1tk;ijxk;ij.ItisimportanttonotethatthescalarfactorinIncDualisnowkbecausethedualvariablesareupdatedincrementallyandtherearekprocessesrunningparallell.ThedetailedstepsarepresentedinFigure2,whereweabusethesamenotationujtforthelocalvariableatallprocesses.TheexperimentsinSection4verifytheimprovementsofthepracticalvariantvsthebasicvariant.Itstillremainsanopenproblemtouswhatistheconvergenceboundofthispracticalvariant.However,nextweestablishaconnectionbetweenDisDCAandADMMthatshedslightonthemotivationbehindthepracticalvariantandthedifferencesbetweenthetwoalgorithms.AComparisonwithADMMFirstwenotethatthegoaloftheupdatesateachiterationinDisDCAistoincreasethedualobjectivebymaximizingthefollowingobjective:max1 nkmXi=1i(i) 2 ^wt1+1=(nk)mXi=1ixi 22;(3)where^wt1=wt11=(nk)Pmi=1t1ixiandwesuppressthesubscriptkassociatedwitheachmachine.TheupdatespresentedinAlgorithm1aresolutionstomaximizingthelowerboundsoftheaboveobjectivefunctionbydecouplingthemdualvariables.Itisnotdifculttoderivethatthedualproblemin(3)hasthefollowingprimalproblem(adetailedderivationandotherscanbefoundinsupplementarymaterials):DisDCA:minw1 nkmXi=1i(xiw)+ 2 w wt11=(nk)mXi=1t1ixi! 22:(4)Wereferto^wtasthepenaltysolution.SecondletusrecalltheupdatingschemeinADMM.The(deterministic)ADMMalgorithmatiterationtsolvesthefollowingproblemsineachmachine:ADMM:wtk=argminw1 nknkXi=1i(xiw)+K 2kw(wt1ut1k)| {z }^wt1k22;(5)whereisapenaltyparameterandwt1istheglobalprimalvariableupdatedbywt=K(wt+ut1) K+;withwt=1 KKXk=1wtk;ut1=1 KKXk=1ut1k;6 andut1kisthelocaldualvariableupdatedbyutk=ut1k+wtkwt.Comparingthesubprob-lem(4)inDisDCAandthesubproblem(5)inADMMleadstothefollowingobservations.(1)Bothaimatsolvingthesametypeofproblemtoincreasethedualobjectiveordecreasetheprimalob-jective.DisDCAusesonlymrandomlyselectedexampleswhileADMMusesallexamples.(2)However,thepenaltysolution^wt1andthepenaltyparameteraredifferent.InDisDCA,^wt1isconstructedbysubtractingfromtheglobalsolutionthelocalsolutiondenedbythedualvariables,whileinADMMitisconstructedbysubtractingfromtheglobalsolutionthelocalLagrangianvariablesu.ThepenaltyparameterinDisDCAisgivenbytheregularizationparameterwhilethatinADMMisaparameterthatisneededtobespeciedbytheuser.Now,letusexplainthepracticalvariantofDisDCAfromtheviewpointofinexactlysolvingthesubproblem(4).Notethatiftheoptimalsolutionto(3)isdenotedbyi;i=1;:::;m,thentheoptimalsolutionuto(4)isgivenbyu=^wt1+1 nkPmi=1ixi.Infact,theupdatesatthet-thiterationofthepracticalvariantofDisDCAistooptimizethesubproblem(4)bytheSDCAalgorithmwithonlyonepassofthesampleddatapointsandaninitializationof0i=t1i;i=1:::;m.Itmeansthattheinitialprimalsolutionforsolvingthesubproblem(3)isu0=^wt1+1 nkPmi=1t1ixi=wt1.ThatexplainstheinitializationstepinFigure2.Inarecentwork[23]ofapplyingADMMtosolvingtheL2-SVMprobleminthesamedistributedfashion,theauthorsexploiteddifferentstrategiesforsolvingthesubproblem(5)associatedwithL2-SVM,amongwhichtheDCAalgorithmwithonlyonepassofalldatapointsgivesthebestperformanceintermsofrunningtime(e.g.,itisbetterthanDCAwithseveralpassesofalldatapointsandisalsobetterthanatrustedregionNewtonmethod).ThisfromanotherpointofviewvalidatesthepracticalvariantofDisDCA.Finally,itisworthtomentionthatunlikeADMMwhoseperformanceissignicantlyaffectedbythevalueofthepenaltyparameter,DisDCAisaparameterfreealgorithm.4ExperimentsInthissection,wepresentsomeexperimentalresultstoverifythetheoreticalresultsandtheempir-icalperformancesoftheproposedalgorithms.WeimplementthealgorithmsbyC++andopenMPIandrunthemincluster.Oneachmachine,weonlylaunchoneprocess.Theexperimentsareper-formedontwolargedatasetswithdifferentnumberoffeatures,covtypeandkdd.Covtypedatahasatotalof581;012examplesand54features.Kdddataisalargedatausedinkddcup2010,whichcontains19;264;097trainingexamplesand29;890;095features.Forcovtypedata,weuse522;911examplesfortraining.WeapplythealgorithmstosolvingtwoSVMformulations,namelyL2-SVMwithhingelosssquareandL1-SVMwithhingeloss,todemonstratethecapabilitiesofDisDCAinsolvingsmoothlossfunctionsandLipschitzcontinuouslossfunctions.Inthelegendofgures,weuseDisDCA-btodenotethebasicvariant,DisDCA-ptodenotethepracticalvariant,andDisDCA-atodenotetheaggressivevariantofDisDCA[20].TradeoffbetweenCommunicationandComputationToverifytheconvergenceanalysis,weshowinFigures3(a)3(b),3(d)3(e)thedualitygapofthebasicvariantandthepracticalvariantoftheDisDCAalgorithmversusthenumberofiterationsbyvaryingthenumberofsamplesmperiteration,thenumberofmachinesKandthevaluesof.TheresultsverifytheconvergenceboundinTheorem1.AtthebeginningofincreasingthevaluesofmorK,theperformancesareimproved.However,whentheirvaluesexceedcertainnumber,theimpactofincreasingmorKdiminishes.Additionally,thelargerthevalueofthewidertheeffectiveregionofmandK.ItisnotablethattheeffectiveregionofmandKofthepracticalvariantismuchlargerthanthatofthebasicvariant.Wealsobrieyreportarunningtimeresult:toobtainan=103dualitygapforoptimizingL2-SVMoncovtypedatawith=103,therunningtimeofDisDCA-pwithm=1;10;102;103byxingK=10are30;4;0;5seconds3,respectively,andtherunningtimewithK=1;5;10;20byxingm=100are3;0;0;1seconds,respectively.Thespeed-upgainonkdddatabyincreasingmisevenlargerbecausethecommunicationcostismuchhigher.Insupplement,wepresentmoreresultsonvisualizingthecommunicationandcomputationtradeoff.ThePracticalVariantvsTheBasicVariantTofurtherdemonstratetheusefulnessofthepracticalvariant,wepresentacomparisonbetweenthepracticalvariantandthebasicvariantforoptimizing 30secondmeanslessthan1second.Weexcludethetimeforcomputingthedualitygapateachiteration.7 (a)varyingm (b)varyingm (c)DifferentAlgorithms (d)varyingK (e)varingK (f)DifferentAlgorithmsFigure3:(a,b):dualitygapwithvaryingm;(d,e):dualitygapwithvaryingK;(c,f)comparisonofdifferentalgorithmsforoptimizingSVMs.Moreresultscanbefoundinsupplementarymaterials.thetwoSVMformulationsinsupplementarymaterial.Wealsoincludetheperformancesoftheag-gressivevariantproposedin[20],byapplyingtheaggressiveupdatesonthemsampledexamplesineachmachinewithoutincurringadditionalcommunicationcost.Theresultsshowthatthepracticalvariantconvergesmuchfasterthanthebasicvariantandtheaggressivevariant.ComparisonwithotherbaselinesLastly,wecompareDisDCAwithSGD-basedandADMM-baseddistributedalgorithmsrunninginthesamedistributedframework.ForoptimizingL2-SVM,weimplementthestochasticaveragegradient(SAG)algorithm[15],whichalsoenjoysalinearcon-vergenceforsmoothandstronglyconvexproblems.Weusetheconstantstepsize(1=Ls)suggestedbytheauthorsforobtainingagoodpracticalperformance,wheretheLsdenotesthesmoothnessparameteroftheproblem,setto2R+givenkxik22R;8i.ForoptimizingL1-SVM,wecomparetothestochasticPegasos.ForADMM-basedalgorithms,weimplementastochasticADMMin[14](ADMM-s)andadeterministicADMMin[23](ADMM-dca)thatemployestheDCAalgorithmforsolvingthesubproblems.InthestochasticADMM,thereisastepsizeparametert/1=p t.Wechoosethebestinitialstepsizeamong[103;103].WerunallalgorithmsonK=10machinesandsetm=104;=106forallstochasticalgorithms.IntermsoftheparameterinADMM,wendthat=106yieldsgoodperformancesbysearchingoverarangeofvalues.WecompareDisDCAwithSAG,PegasosandADMM-sinFigures3(c),3(f)4,whichclearlydemonstratethatDisDCAisastrongcompetitorinoptimizingSVMs.InsupplementwecompareDisDCAbysettingm=nkagainstADMM-dcawithfourdifferentvaluesof=106;104;102;1onkdd.Theresultsshowthattheperformancesdeterioratesignicantlyiftheisnotappropriatelyset,whileDisDCAcanproducecomparableperformancewithoutadditionaleffortsintuningtheparameter.5ConclusionsWehavepresentedadistributedstochasticdualcoordinatedescentalgorithmanditsconvergencerates,andanalyzedthetradeoffbetweencomputationandcommunication.Thepracticalvarianthassubstantialimprovementsoverthebasicvariantandothervariants.Wealsomakeacomparisonwithotherdistributedalgorithmsandobservecompetitiveperformances. 4TheprimalobjectiveofPegasosoncovtypeisabovethedisplayrange.8 0 20 40 60 80 100 0.2 0.4 0.6 0.8 1 kdd, L2SVM, K=10, m=104, l=10-6 primal obj DisDCA-p ADMM-s (h=10) SAG 0 20 40 60 80 100 0.2 0.4 0.6 0.8 number of iterations (*100)primal objkdd, L1SVM, K=10, m=104, l=10-6 DisDCA-p ADMM-s (h=100) Pegasos 0 10 20 30 40 50 0 0.5 1 duality gapDisDCA-p (covtype, L2SVM, m=103, l=10-3) K=1 K=5 K=10 0 20 40 60 80 100 0 1 2 3 DisDCA-p (covtype, L2SVM, m=103, l=10-6)number of iterations (*100)duality gap 0 20 40 60 80 100 0 0.5 1 dualtiy gapDisDCA-b (covtype, L2SVM, m=102, l=10-3) K=1 K=5 K=10 0 20 40 60 80 100 0.7 0.8 0.9 1 DisDCA-b (covtype, L2SVM, m=102, l=10-6)number of iterations (*100)duality gap 20 40 60 80 100 0.5 0.6 0.7 0.8 0.9 number of iterations (*100)primal objcovtype, L1SVM, K=10, m=104, l=10-6 20 40 60 80 100 0.7 0.72 0.74 0.76 primal objcovtype, L2SVM, K=10, m=104, l=10-6 DisDCA ADMM-s (h=10) SAG DisDCA ADMM-s (h=100) Pegasos 0 20 40 60 80 100 0 0.5 1 duality gapDisDCA-p (L2SVM, K=10, l=10-3) m=1 m=10 m=102 m=103 m=104 0 20 40 60 80 100 0 0.5 1 1.5 number of iteration (*100)duality gapDisDCA-p (L2SVM, K=10, l=10-6) 0 20 40 60 80 100 0 0.5 1 duality gapDisDCA-b (covtype, L2SVM, K=10, l=10-3) m=1 m=10 m=102 m=103 m=104 0 20 40 60 80 100 0.7 0.75 0.8 0.85 0.9 number of iteration (*100)duality gapDisDCA-b (covtype, L2SVM, K=10, l=10-6) References[1]A.AgarwalandJ.C.Duchi.Distributeddelayedstochasticoptimization.InCDC,pages54515452,2012.[2]S.Boyd,N.Parikh,E.Chu,B.Peleato,andJ.Eckstein.Distributedoptimizationandstatisticallearningviathealternatingdirectionmethodofmultipliers.Found.TrendsMach.Learn.,3:1122,2011.[3]J.K.Bradley,A.Kyrola,D.Bickson,andC.Guestrin.ParallelCoordinateDescentforL1-RegularizedLossMinimization.InICML,2011.[4]C.T.Chu,S.K.Kim,Y.A.Lin,Y.Yu,G.R.Bradski,A.Y.Ng,andK.Olukotun.Map-Reduceformachinelearningonmulticore.InNIPS,pages281288,2006.[5]W.DengandW.Yin.Ontheglobalandlinearconvergenceofthegeneralizedalternatingdirectionmethodofmultipliers.Technicalreport,2012.[6]M.EbertsandI.Steinwart.Optimallearningratesforleastsquaressvmsusinggaussianker-nels.InNIPS,pages15391547,2011.[7]D.GabayandB.Mercier.Adualalgorithmforthesolutionofnonlinearvariationalproblemsvianiteelementapproximation.Comput.Math.Appl.,2:1740,1976.[8]C.-J.Hsieh,K.-W.Chang,C.-J.Lin,S.S.Keerthi,andS.Sundararajan.Adualcoordinatedescentmethodforlarge-scalelinearsvm.InICML,pages408415,2008.[9]H.D.III,J.M.Phillips,A.Saha,andS.Venkatasubramanian.Protocolsforlearningclassiersondistributeddata.JMLR-ProceedingsTrack,22:282290,2012.[10]S.Lacoste-Julien,M.Jaggi,M.W.Schmidt,andP.Pletscher.Stochasticblock-coordinatefrank-wolfeoptimizationforstructuralsvms.CoRR,abs/1207.4747,2012.[11]J.Langford,A.Smola,andM.Zinkevich.Slowlearnersarefast.InNIPS,pages23312339.2009.[12]Z.Q.LuoandP.Tseng.Ontheconvergenceofthecoordinatedescentmethodforconvexdifferentiableminimization.JournalofOptimizationTheoryandApplications,pages735,1992.[13]G.Mann,R.McDonald,M.Mohri,N.Silberman,andD.Walker.EfcientLarge-Scaledis-tributedtrainingofconditionalmaximumentropymodels.InNIPS,pages12311239.2009.[14]H.Ouyang,N.He,L.Tran,andA.G.Gray.Stochasticalternatingdirectionmethodofmulti-pliers.InICML,pages8088,2013.[15]N.L.Roux,M.W.Schmidt,andF.Bach.Astochasticgradientmethodwithanexponentialconvergenceratefornitetrainingsets.InNIPS,pages26722680,2012.[16]S.Shalev-ShwartzandT.Zhang.StochasticDualCoordinateAscentMethodsforRegularizedLossMinimization.JMLR,2013.[17]S.SmaleandD.-X.Zhou.Estimatingtheapproximationerrorinlearningtheory.Anal.Appl.(Singap.),1(1):1741,2003.[18]K.Sridharan,S.Shalev-Shwartz,andN.Srebro.Fastratesforregularizedobjectives.InNIPS,pages15451552,2008.[19]T.Suzuki.Dualaveragingandproximalgradientdescentforonlinealternatingdirectionmul-tipliermethod.InICML,pages392400,2013.[20]M.Tak´ac,A.S.Bijral,P.Richt´arik,andN.Srebro.Mini-batchprimalanddualmethodsforsvms.InICML,2013.[21]C.H.Teo,S.Vishwanthan,A.J.Smola,andQ.V.Le.Bundlemethodsforregularizedriskminimization.JMLR,pages311365,2010.[22]K.I.Tsianos,S.Lawlor,andM.G.Rabbat.Communication/computationtradeoffsinconsensus-baseddistributedoptimization.InNIPS,pages19521960,2012.[23]C.Zhang,H.Lee,andK.G.Shin.Efcientdistributedlinearclassicationalgorithmsviathealternatingdirectionmethodofmultipliers.InAISTAT,pages13981406,2012.[24]M.Zinkevich,M.Weimer,A.Smola,andL.Li.Parallelizedstochasticgradientdescent.InNIPS,pages25952603,2010.9