An Adaptive Accelerated Proximal Gradient Method and its Ho motopy Continuation for Sparse Optimization Qihang Lin QIHANG LIN UIOWA EDU The University of Iowa Iowa City IA  USA Lin Xiao LIN XIAO MICR
161K - views

An Adaptive Accelerated Proximal Gradient Method and its Ho motopy Continuation for Sparse Optimization Qihang Lin QIHANG LIN UIOWA EDU The University of Iowa Iowa City IA USA Lin Xiao LIN XIAO MICR

This method in corporates a restarting scheme to automatical ly estimate the strong convexity parameter and achieves a nearly optimal iteration complexi ty Then we consider the regularized least squares LS problem in the highdimensional setting Alt

Download Pdf

An Adaptive Accelerated Proximal Gradient Method and its Ho motopy Continuation for Sparse Optimization Qihang Lin QIHANG LIN UIOWA EDU The University of Iowa Iowa City IA USA Lin Xiao LIN XIAO MICR

Download Pdf - The PPT/PDF document "An Adaptive Accelerated Proximal Gradien..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation on theme: "An Adaptive Accelerated Proximal Gradient Method and its Ho motopy Continuation for Sparse Optimization Qihang Lin QIHANG LIN UIOWA EDU The University of Iowa Iowa City IA USA Lin Xiao LIN XIAO MICR"— Presentation transcript:

Page 1
An Adaptive Accelerated Proximal Gradient Method and its Ho motopy Continuation for Sparse Optimization Qihang Lin QIHANG LIN UIOWA EDU The University of Iowa, Iowa City, IA 52245 USA Lin Xiao LIN XIAO MICROSOFT EDU Microsoft Research, Redmond, WA 98052 USA Abstract We first propose an adaptive accelerated prox- imal gradient (APG) method for minimizing strongly convex composite functions with un- known convexity parameters. This method in- corporates a restarting scheme to automatical- ly estimate the strong convexity parameter and achieves a nearly optimal iteration

complexi- ty. Then we consider the -regularized least- squares ( -LS) problem in the high-dimensional setting. Although such an objective function is not strongly convex, it has restricted strong convexity over sparse vectors. We exploit this property by combining the adaptive APG method with a homotopy continuation scheme, which generates a sparse solution path towards optimal- ity. This method obtains a global linear rate of convergence and its overall iteration complexity has a weaker dependency on the restricted condi- tion number than previous work. 1. Introduction We consider

first-order methods for minimizing composite objective functions, i.e., the problem of minimize )+Ψ( (1) where and Ψ( are lower-semicontinuous, proper convex functions ( Rockafellar 1970 , Section 7). We as- sume that is differentiable on an open set containing dom and its gradient is Lipschitz continuous on dom , i.e., there exists a constant such that k x,y dom (2) Proceedings of the 31 st International Conference on Machine Learning , Beijing, China, 2014. JMLR: W&CP volume 32. Copy- right 2014 by the author(s). We also assume Ψ( is simple Nesterov 2013 ), meaning that

for any dom , the following auxiliary problem can be solved efficiently or in closed-form: ) = argmin +Ψ( (3) This is the case, e.g., when Ψ( ) = for any λ> or Ψ( is the indicator function of a closed convex set that admits an easy projection mapping. The so-called proximal gradient (PG) method simply us- es ( ) as its update rule: +1) , for ,... , where is set to or determined by a lin- ear search procedure. The iteration complexity for the PG method is / Nesterov 2004 2013 ), which mean- s, to obtain an -optimal solution (whose objective value is within of the

optimum), the PG method needs / iterations. A far better iteration complexity, / can be obtained by accelerated proximal gradient (APG) methods ( Nesterov 2013 Beck & Teboulle 2009 Tseng 2008 ). The iteration complexities above imply that both PG and APG methods have a sublinear convergence rate. However, if is strongly convex, i.e., there exists a constant (the convexity parameter ) such that )+ ,x (4) for all x,y dom , then both PG and APG method- s will achieve a linear convergence rate with the iteration complexities being log(1 / )) and log(1 / )) Nesterov 2004 2013 ), respectively. Here,

/ is called condition number of the function . Since is typically a large number, the iteration complexity of the APG methods can be significantly better than that of the PG method for ill-conditioned problems. However, in or- der to obtain this better complexity, the APG methods need to use the convexity parameter , or a lower bound of it,
Page 2
An Adaptive APG method and its Homotopy Continuation for Spa rse Optimization explicitly in their updates. In many applications, an effec tive lower bound of can be hard to estimate. To address this problem, our first

contribution in this pa- per is an adaptive APG method for solving problem ( when is strongly convex but is unknown. This method incorporates a restart scheme that can automatically esti- mate on the fly and achieves an iteration complexity of log log(1 / Even if is not strongly convex ( = 0 ), problem ( ) may have special structure that may still allow the development of first-order methods with linear convergence. This is the case for the -regularized least-squares -LS) problem, defined as minimize Ax (5) where and are the problem data, and λ > is a regularization

parameter. The problem has im- portant applications in machine learning, signal process- ing, and statistics; see, e.g., Tibshirani 1996 ); Chen et al. 1998 ); Bruckstein et al. 2009 ). We are especially inter- ested in solving this problem in the high-dimensional case m < n ) and when the solution, denoted as , is s- parse. In terms of the general model in ( ), we have ) = (1 2) Ax and Ψ( ) = . Here has a constant Hessian ) = , and we have max and min where max and min denote the largest and smallest eigenvalues, re- spectively, of a symmetric matrix. Under the assumption m , the matrix

is singular, hence = 0 (i.e., is not strongly convex). Therefore, we only expect sublinear convergence rates (at least globally) when using first-orde optimization methods. Nevertheless, even in the case of m < n , when the so- lution is sparse, the PG method often exhibits fast convergence when it gets close to the optimal solution. In- deed, local linear convergence can be established for the PG method provided that the active submatrix (columns of corresponding to the nonzero entries of the sparse it- erates) is well conditioned ( Luo & Tseng 1992 Hale et al. 2008 Bredies & Lorenz

2008 ). To explain this more for- mally, we define the restricted eigenvalues of at the spar- sity level as A,s ) = sup Ax = 0 A,s ) = inf Ax = 0 (6) where is a positive integer and denotes the number of nonzero entries of a vector . From the above definitions, we have A,s A,s s> As discussed before, we have = 0 for m . But it is still possible that A,s holds for some s < m In this case, we say that the matrix satisfies the re- stricted eigenvalue condition at the sparsity level . Let supp( ) = = 0 , and assume that x,y satisfy supp( supp( | . Then it can be shown Xiao &

Zhang 2013 , Lemma 3) that )+ ,x A,s The above inequality gives the notion of restricted strong convexity (cf. strong convexity defined in ( )). Intuitively, if the iterates of the PG method become sparse and their sup- ports do not fluctuate much from each other, then restricted strong convexity leads to (local) linear convergence. This is exactly what happens when the PG method speeds up while getting close to the optimal solution. Moreover, such a local linear convergence can be exploited by a homotopy continuation strategy to obtain much faster global convergence ( Hale et al.

2008 Wright et al. 2009 Xiao & Zhang 2013 ). The basic idea is to solve the -LS problem ( ) with a large value of first, and then grad- ually decreases the value of until the target regulariza- tion is reached. For each value of Xiao & Zhang 2013 employ the PG method to solve ( ) up to an adequate pre- cision, and then use the resulting approximate solution to warm start the PG method for ( ) with the next value of It is shown ( Xiao & Zhang 2013 ) that under suitable as- sumptions for sparse recovery (mainly the restricted eigen value condition), an appropriate homotopy strategy can en-

sure all iterates of the PG method be sparse, hence linear convergence at each stage can be established. As a result, the overall iteration complexity of such a proximal-gradie nt homotopy (PGH) method is log(1 / where de- notes the restricted condition number at some sparsity level s> , i.e., A,s ) = A,s A,s (7) and the notation hides additional log( factors. Our second contribution in this paper is to show that, by using the adaptive APG method developed in this paper in a homotopy continuation scheme, we can further improve the iteration complexity for solving the -LS problem to log(1 / ,

where the sparsity level is slightly larger than the one for the PGH method. We note that this result is not a trivial extension from the convergence resul ts for the PGH method in Xiao & Zhang 2013 ). In particular, the adaptive APG method does not have the property of
Page 3
An Adaptive APG method and its Homotopy Continuation for Spa rse Optimization monotone decreasing, which was important for the analysis of the PGH method. In order to overcome this difficulty, we had to show a “non-blowout” property of our adaptive APG method, which is interesting in its own right. 2. An

APG method for minimizing strongly convex functions The main iteration of the APG method is based on a composite gradient mapping introduced by Nesterov in Nesterov 2013 ). For any fixed point and a given con- stant L > , we define a local model of around using a quadratic approximation of but keeping intact: ) = )+ )+ +Ψ( According to ( ), we have ) = argmin (8) Then the composite gradient mapping of at is defined as ) = )) Following ( Nesterov 2013 ), we also define a local Lips- chitz parameter ) = k )) With the machinery of composite gradient mapping, Nesterov

2004 2013 ) developed several variants of the APG methods. As discussed in the introduction, compared to the PG method, the iteration complexity of the acceler- ated methods have a better dependence on the accuracy when is not strongly convex, and a better dependence on the condition number when is strongly convex. How- ever, in contrast with the PG method, the better complexity bound of the APG method in the strongly convex case re- lies on the knowledge of the convexity parameter , or an effective lower bound of it, both of which can be hard to obtain in practice. To address this problem, we

propose an adaptive APG method that can be applied without knowing and still obtains a linear convergence rate. To do so, we first present an APG method in Algorithm and in Algorithm up- on which the development of the adaptive APG method is based. We name this method scAPG, where “sc” stands for “strongly convex. To use this algorithm, we need to first choose an initial optimistic estimate min for the Lipschitz constant min , and two adjustment parameters dec and inc . In addition, this method requires an input pa- rameter  > , which is an estimate of the true convexity

Algorithm 1 x, } scAPG (0) ,L ,, parameter: min > dec 1) (0) = 1 repeat ( for = 0 ,... +1) ,M , ,g ,S AccelLineSearch ,x 1) ,L ,, +1 max min ,M / dec until +1) +1) Algorithm 2 +1) ,M , ,g ,S AccelLineSearch ,x 1) ,L ,, parameter: inc / inc repeat L inc (1 (1+ 1) +1) until +1) +1) +1) parameter . The scAPG method generates the following three sequences: (1 (1+ 1) (9) +1) where is found by the line-search procedure in Algo- rithm . The line search procedure starts with an estimat- ed Lipschitz constant , and increases its value by the factor inc until +1) +1) ,

which is sufficient to guarantee the convergence. In each itera- tion of Algorithm , the scAPG method tries to start the line search at a smaller initial value by setting +1 to be min min ,M / dec The scAPG algorithm can be considered as an extension of the constant step scheme of Nesterov 2004 ) for mini- mizing composite functions in ( ) when . Indeed, if , we have /L for all and the update for becomes 1) (10)
Page 4
An Adaptive APG method and its Homotopy Continuation for Spa rse Optimization which is the same as Algorithm (2.2.11) in Nesterov 2004 ). Note that, one can not

directly apply Algorithm or Nesterov’s constant scheme to problems without strong- ly convexity by simply setting = 0 Another difference from Nesterov’s method is that Algo- rithm has an explicit stopping criterion based on the opti- mality residue +1) , which is defined as min Ψ( k )+ (11) where Ψ( is the subdifferential of at . The opti- mality residue measures how close a solution is to the optimality condition of ( ) in the sense that ) = 0 if and only if is an solution to ( ). The following theorem states that, if is a positive lower bound of , the scAPG converges

geometrically and it has an iteration complexity log(1 / )) Theorem 1. Suppose is the optimal solution of ( ) and < . Then Algorithm guarantees that (0) )+ (0) (12) (0) )+ (0) (13) where = 0 =0 (1 (14) Moreover, inc In addition to the geometric convergence of , this theorem states that the auxiliary sequence also con- verges to the unique optimizer with a geometric rate. If does not satisfies , Theorem may not hold anymore. However, we can show that, in this case, Algo- rithm will at least not blowout. More precisely, we show that (0) for all as long as min which can be easily enforced

in implementation of the al- gorithm. Lemma 1. Suppose < min . Then Algorithm guarantees that +1) (0) +1) (15) The non-blowout property is also critical in our analysis of the homotopy method for solving the -LS problem pre- sented in Section . In particular, it helps to show the spar- sity of once (0) is sparse. (All proofs for our results are given in the supporting materials). 3. An Adaptive APG method with restart When applied to strongly convex minimization problem- s, Nesterov’s constant step scheme ( 10 ) needs to use and as input parameters. Thanks to the line-search tech- nique,

Algorithm does not need to know explicitly. However, it still need to know the convexity parameter or a nontrivial lower bound of it in order to guarantee the geometric convergence rate given in Theorem Compared to line search on , estimating on-the-fly is much more sophisticated. Nesterov 2013 ) suggested a restarting scheme to estimate , which does not require any lower bound of , and can be shown to have linear convergence (up to a logarithmic factor). In this section, we adapt his restarting technique to Algorithm and obtain an adaptive APG method. This method has the same con-

vergence guarantees as Nesterov’s scheme. However, there are two important differences, which we will elaborate on at the end of this section. We first describe the basic idea of the restart scheme for estimating . Suppose we simply run Algorithm with a guessed value . At each iteration, we can check if the inequality ( 12 ) is satisfied. If not, we must have > ac- cording to Theorem , and therefore need to reduce to en- sure Algorithm converges in a linear rate. However, ( 12 can not be evaluated because is unknown. Fortunately, we can show in the following lemma that, if

, the norm of the gradient mapping generated in Algorithm also decreases at a linear rate. Lemma 2. Suppose < and the ini- tial point (0) of Algorithm is obtained by call- ing Algorithm , i.e., (0) ,M , ,g 1) ,S } AccelLineSearch ini ,x ini ,L ini ,, 1) with an arbi- trary ini and ini min . Then, for any in Algorithm , we have 1+ 1) (16) Unlike the inequality ( 12 ), the inequality ( 16 ) can be checked explicitly and, if it does not hold, we know > and need to reduce Now we are ready to develop the adaptive APG method. Let sc (0 1) be a desired shrinking factor. We check the

following two conditions at iteration of Algorithm A: sc 1) B: 1+ sc If A is satisfied first, then we restart Algorithm with +1) as the new starting point, set = 0 , and update the three quantities 1) and accordingly (a- gain use = 1 and = 1 ). If A is not satisfied but
Page 5
An Adaptive APG method and its Homotopy Continuation for Spa rse Optimization Algorithm 3 x, M, } AdapAPG ini ,L ini , parameter: min dec sc sc (0 1) (0) ,M , ,g 1) ,S AccelLineSearch ini ,x ini ,L ini , 1) 1) (0) repeat +1) ,M , ,g ,S AccelLineSearch ,x 1) ,L ,, +1 (1 if condition A

holds, then (0) +1) 1) +1) 1) else if condition B holds, then / sc else +1 max min ,M / dec +1 end if end if until +1) +1) B is satisfied first, it means that is larger than . In fact, if , then combining condition B with Lemma would imply that A also holds. This contradiction indi- cates that if B is satisfied first, we must have  > , and we have to reduce , say by a factor sc . In this case, we restart Algorithm still at (0) and keep 1) and unchanged. If neither conditions are satisfied, we continue Algorithm to its next iterate until the

optimality residue is smaller than a prescribed value. We present the above procedure formally in Algorithm , whose iteration complexity is given by the following theorem. Theorem 2. Assume > . Let ini denotes the first 1) computed by Algorithm , and and the number of times that conditions A and B are satisfied, re- spectively. Then log / sc 1+ min ini m and log sc m and the total number of itera- tions is at most inc sc ln inc sc sc 1+ min Note that if < , then = 0 The total number of iterations given in Theorem is asymptotically log( )log log( This is the same

complexity as for the restart scheme pro- posed by Nesterov for his accelerated dual gradient (ADG) method ( Nesterov 2013 , Section 5.3). Despite using a sim- ilar restart scheme and having the same complexity bound, here we elaborate on some important differences between our method from Nesterov’s. Nesterov’s ADG method exploits strong convexity in instead of . In order to use it under our assump- tion (that is strongly convex), one needs to relocate a strong convexity term from to , and this relocat- ed term needs to be adjusted whenever the estimate is reduced. The restart scheme suggested

in ( Nesterov 2013 , Sec- tion 5.3) uses an extra line-search at each iteration, solely for the purpose of computing the gradient map- ping at . Our method directly use the gradient mapping at , which does not require the extra line- search, therefore the computational cost per iteration is lower. 4. Homotopy continuation for sparse optimization In this section, we focus on the -regularized least-squares -LS) problem ( ) in the high-dimensional setting i.e., with m < n . This is a special case of ( ), but the func- tion ) = (1 2) Ax is not strongly convex when m . Therefore, we only expect a

sublinear convergence rate (at least globally) when using traditional first-order op- timization methods. Nevertheless, as explained in the introduction, one can use a homotopy continuation strategy to obtain much faster convergence. The key idea is to solve the -LS prob- lem with a large regularization parameter first, and then gradually decreases the value of until the target regular- ization is reached. In Xiao & Zhang 2013 ), the PG method is employed to solve the -LS problem for a fixed up to an adequate precision, then the solution is used to warm start the next stage.

It was shown that under a restricted eigenvalue condition on , such a homotopy scheme guar- antees that all iterates generated by the method are suf- ficiently sparse, which implies restricted strong convexi- ty. As a result, a linear rate of convergence can be estab- lished for each homotopy stage, and the overall complexity is log(1 / )) for certain sparsity level , where is the restricted condition number defined in ( ), and the no- tation hides additional log( factors. In this section, we show that, by combining the adaptive APG method (Algorithm ) with the same homotopy con-

tinuation scheme, the iteration complexity for solving the -LS problem can be improved to log(1 / , with slightly larger than
Page 6
An Adaptive APG method and its Homotopy Continuation for Spa rse Optimization Algorithm 4 (tgt) APGHomotopy A,b, tgt ,,L input: parameter: (0 1) (0 1) initialize: ←k (0) ←b ln( / tgt ln(1 / for = 0 ,...,N do +1 +1 +1 +1) +1 +1 AdapAPG +1 , +1 end for (tgt) tgt } AdapAPG ,, tgt return: (tgt) The APG homotopy method is presented in Algorithm To avoid confusion over the notations, we use tgt to de- note the target regularization

parameter in ( ). The method starts with which is the smallest such that the -LS problem has the trivial solution (by examin- ing the optimality condition). This method has two extra parameters (0 1) and (0 1) . They control the algorithm as follows: The sequence of values for the regularization parame- ter is determined as for = 1 ,... , until the target value tgt is reached. For each except tgt , we solve problem ( ) with a proportional precision . For the last stage with tgt , we solve to the absolute precision Our convergence analysis of the APG homotopy method is based on the following

assumption, which involves the restricted eigenvalues defined in ( ). Assumption 1. Suppose . Let = supp( and . There exist γ > and (0 2] such that γ > (1+ (1 and tgt 4max +1 (1 (1+ (17) Moreover, we assume there exists an integer such that A, +3 and s > 24 inc A, +3 )+3 A, A, + (1+ ) s. (18) We also assume that min inc A, +3 According to Zhang & Huang 2008 ), the above assump- tion implies whenever tgt (here denotes the complement of the support set ). We will show that by choosing the parameters and in Al- gorithm appropriately, these conditions also imply that all iterates

along the solution path are sparse. We note that Assumption is very similar to Assumption 1 in Xiao & Zhang 2013 ) (they differ only in the constants in the conditions), and interpretations and remarks made ther also apply here. More specifically, The existence of satisfying the conditions like ( 18 is necessary and standard in sparse recovery analysis. It is closely related to the restricted isometry proper- ty (RIP) of Cand`es & Tao 2005 ) which assumes that there exist some s > , and (0 1) such that A,s (1+ (1 . See Xiao & Zhang 2013 Section 3) for an example of sufficient RIP

conditions. Another sufficient condition is A, +3 s/ with = 1 (24(1 + )(3 + inc )) , which is more accessible but can be very conservative. The RIP-like condition ( 18 ) can be much stronger than the corresponding conditions established in the sparse recovery literature (see, e.g., Li & Mo 2011 ) and ref- erences therein), which are only concerned about the recovery property of the optimal solution . In con- trast, our condition needs to guarantee sparsity for all iterates along the solution path, thus is “dynamic” in nature. In particular, in addition to the matrix , it also depends on

algorithmic parameters inc and (Theorem will relate to and ). Our first result below concerns the local linear convergence of Algorithm when applied to solve the -LS problem at each stage of the homotopy method. Basically, if the starting point (0) is sparse and the optimality condition is satisfied with adequate precision, then all iterates alon the solution path are sparse. This implies that restricted strong convexity holds and Algorithm actually has linear convergence. Theorem 3. Suppose Assumption holds. If the initial point ini in Algorithm satisfies ini s, ini λ,

(19) then for all , we have . Moreover, all the three conclusions of Theorem holds by replacing and with A, +3 and A, +3 , respectively. Our next result gives the overall iteration complexity of th APG homotopy method in Algorithm . To simplify pre- sentation, we let = +3 , and use the following nota- tions: ) = A, +3 ) = A, +3 A, +3 ) = A, +3 A, +3
Page 7
An Adaptive APG method and its Homotopy Continuation for Spa rse Optimization Roughly speaking, if the parameters and are chosen appropriately, then the total number of proximal-gradient steps in Algorithm for finding an

-optimal solution is ln(1 / )) Theorem 4. Suppose Assumption holds for some and , and the parameters and in Algorithm are chosen such that 1+ 1+ η < . Let ln( / tgt ln as in the algorithm. Then: 1. Condition ( 19 ) holds for each call of Algorithm . For = 0 ,...,N , the number of gradient steps in each call of Algorithm is no more than log sc inc sc ln inc sc sc 1+ min where 1+ min inc (1+ ) and log sc m +1 . It is independent of 2. For each , the outer iterates satisfies tgt ( tgt 2( +1) 5(1+ A, + and the following bound on sparse recovery holds +1 A, + 3. When Algorithm

terminates, the total number of proximal-gradient steps is ln(1 / , More- over, the output (tgt) satisfies tgt ( (tgt) tgt 4(1+ tgt A, + . Our ln(1 / complexity of the APG homotopy method improves the ln(1 / )) complexity of PGH in the dependence on restricted condition number. We note that this result is not a simple extension of those in Xiao & Zhang 2013 ). In particular, the AdapAPG method do not have the property of monotone decreasing, which is key for establishing the complexity of the PGH method in Xiao & Zhang 2013 ). Instead, our proof relies on the non-blowout property

(Lemma ) to show that all iterates along the solution path are sparse (details are given in the supporting materials). 5. Numerical experiments In this section, we present preliminary numerical experi- ments to support our theoretical analysis. In addition to the PG and PGH methods ( Xiao & Zhang 2013 ), we also compare our method with FISTA ( Beck & Teboulle 2009 and its homotopy variants. We implemented FISTA with an adaptive line-search over the Lipschitz constant , but it does not use or esti- mate the convexity parameter . Hence it has a sublin- ear complexity / . In our experiments, we

al- so compare with a simple restart scheme for FISTA sug- gested by O’Donoghue & Cand`es 2012 ): restart FISTA whenever it exhibits nonmonotone behaviors. In particu- lar, we implemented the gradient scheme: restart when- ever 1) 1) , where and are two sequences generated by FISTA, similar to those in our AdapAPG method. O’Donoghue & Cand`es 2012 ) show that for strongly convex pure quadratic func- tions, this restart scheme leads to the optimal complexi- ty of ln(1 / . However, their analysis does not hold for the -LS problem or other non-quadratic func- tions. We call this method FISTA+RS

(meaning FISTA with ReStart). For our AdapAPG method (Algorithm ) and APG homo- topy method (Algorithm ), we use the following values of the parameters unless otherwise stated: parameters inc dec sc sc values 0.1 10 0.8 0.2 To make the comparison clear, we generate an ill- conditioned random matrix following the experimental setup in Agarwal et al. 2012 ): Generate a random matrix with ij fol- lowing i.i.d. standard normal distribution. Choose [0 1) , and for = 1 ,...,m , generate each row i, by i, i, and i,j +1 ωA i,j i,j for = 2 ,...,n It can be shown that the eigenvalues of lie within

the interval (1+ (1 (1+ . If = 0 , then and the covariance matrix is well conditioned. As , it becomes progressively more ill-conditioned. In our experiments, we generate the matrix with = 1000 = 5000 , and = 0 Figure shows the computational results of the four dif- ferent methods: PG, FISTA, FISTA+RS, AdapAPG, and their homotopy continuation variants (denoted by “+H”). For each method, we initialize the Lipschitz constant by = max ∈{ ,...,n ,j . For the AdapAPG method, we initialize the estimate of convexity parameter with two different values, 10 and 100 , and de- note their results by

AdapAPG1 and AdapAPG2, respec- tively.
Page 8
An Adaptive APG method and its Homotopy Continuation for Spa rse Optimization 500 1000 1500 2000 2500 3000 10 −9 10 −6 10 −3 10 10 10 PG FISTA FISTA+RS AdapAPG1 AdapAPG2 500 1000 1500 2000 2500 3000 1000 2000 3000 4000 5000 PG FISTA FISTA+RS AdapAPG1 AdapAPG2 500 1000 1500 2000 2500 3000 10 10 10 10 10 AdapAPG1 AdapAPG1 AdapAPG2 AdapAPG2 300 600 900 1200 1500 1800 10 −9 10 −6 10 −3 10 10 10 PG+H FISTA+H FISTA+RS+H AdapAPG1+H AdapAPG2+H 300 600 900 1200 1500 1800 1000 2000 3000 4000 5000 PG+H FISTA+H

FISTA+RS+H AdapAPG1+H AdapAPG2+H 300 600 900 1200 1500 1800 10 10 10 10 10 AdapAPG1+H AdapAPG1+H AdapAPG2+H AdapAPG2+H Figure 1. Solving an ill-conditioned -LS problem. AdapAPG1 starts with 10 , and AdapAPG2 uses 100 From the top-left plot, we observe that PG, FISTA+RS and AdapAPG all go through a slow plateau before reaching fast local linear convergence. FISTA without restart does not exploit the strong convexity and is the slowest asymp- totically. Their homotopy continuation variants shown in the bottom-left plot are much faster. Each vertical jump on the curves indicates a change in the

value of in the homotopy scheme. In particular, it is clear that all excep- t FISTA+H enter the final homotopy stage with fast lin- ear convergence. In the final stage, the PGH method has a rather flat slope due to ill-conditioning of the matrix; in contrast, FISTA+RS and AdapAPG have much steeper slopes due to their accelerated schemes. AdapAPG1 start- ed with a modest slope, and then detected that the value was too big and reduced it by a factor of sc = 10 , which resulted in the same fast convergence rate as AdapAPG2 after that. The two plots in the middle show the

sparsity of each iter- ates along the solution paths of these methods. We observe that FISTA+RS and AdapAPG entered fast local conver- gence precisely when their iterates became sufficiently s- parse, i.e., when became close to that of the final solution. In contrast, the homotopy variants of these al- gorithms kept all iterates sparse by using the warm start from previous stages. Therefore, restricted strong convex ity hold along the whole path and linear convergence was maintained at each stage. The right column shows the automatic tuning of the lo- cal Lipschitz constant and the

restricted convexity pa- rameter . We see that the homotopy methods (bottom- right plot) have relatively smaller and larger than the ones without using homotopy continuation (top-right plot) which means much better conditioning along the iterates. In particular, the homotopy AdapAPG method used fewer number of reductions of , for both initializations of Overall, we observe that for the -LS problem, the homo- topy continuation scheme is very effective in speeding up different methods. Even with the overhead of estimating and tuning , the AdapAPG+H method is close in efficien- cy compared

with the FISTA+RS+H method. If the initial guess of is not far off, then AdapAPG+H gives the best performance. Finally, we note that unlike the AdapAPG method, the optimal complexity of the FISTA+RS method has not been established for minimizing general strongly convex functions (including -LS). Although often quite competitive in practice, we have observed non-quadratic cases in which FISTA+RS demonstrate less desirable con- vergence (see examples in the supporting materials and also comments in O’Donoghue & Cand`es 2012 )). References Agarwal, A., Negahban, S. N., and Wainwright, M. J. Fast

global convergence of gradient methods for high- dimensional statistical recovery. The Annals of Statistics 40(5):2452–2482, 2012.
Page 9
An Adaptive APG method and its Homotopy Continuation for Spa rse Optimization Beck, A. and Teboulle, M. A fast iterative shrinkage- threshold algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009. Bredies, K. and Lorenz, D. A. Linear convergence of iter- ative soft-thresholding. Journal of Fourier Analysis and Applications , 14:813–837, 2008. Bruckstein, A. M., Donoho, D. L., and Elad, M. From sparse solutions

of systems of equations to sparse mod- eling of signals and images. SIAM Review , 51(1):34–81, 2009. Cand`es, E. J. and Tao, T. Decoding by linear program- ming. IEEE Transactions on Information Theory , 51 (12):4203–4215, December 2005. Chen, S. S., Donoho, D. L., and Saunders, M. A. Atomic decomposition by basis pursuit. SIAM Journal on Scien- tific Computing , 20(1):33–61, 1998. Hale, E. T., Yin, W., and Zhang, Y. Fixed-point con- tinuation for -minimization: Methodology and con- vergence. SIAM Journal on Optimization , 19(3):1107 1130, 2008. Li, S. and Mo, Q. New bounds on the

restricted isome- try constant Applied and Computational Harmonic Analysis , 31(3):460–468, 2011. Luo, Z.-Q. and Tseng, P. On the linear convergence of de- scent methods for convex essentially smooth minimiza- tion. SIAM Journal on Control and Optimization , 30(2): 408–425, 1992. Nesterov, Y. Introductory Lectures on Convex Optimiza- tion: A Basic Course . Kluwer, Boston, 2004. Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical Programming, Series B , 140: 125–161, 2013. O’Donoghue, B. and Cand`es, E. J. Adaptive restart for ac- celerated gradient schemes.

Manuscript, April 2012. To appear in Foundations of Computational Mathematics Rockafellar, R. T. Convex Analysis . Princeton University Press, 1970. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 58:267–288, 1996. Tseng, P. On accelerated proximal gradient methods for convex-concave optimization. Manuscript, 2008. Wright, S. J., Nowad, R. D., and Figueiredo, M. A. T. S- parse reconstruction by separable approximation. IEEE Transactions on Signal Processing , 57(7):2479–2493, July 2009. Xiao, L. and

Zhang, T. A proximal-gradient homotopy method for the sparse least-squares problem. SIAM Jour- nal on Optimization , 23(2):1062–1091, 2013. Zhang, C.-H. and Huang, J. The sparsity and bias of the las- so selection in high–dimensional linear regression. An- nals of Statistics , 36:1567–1594, 2008.