Applications Lecture 5 Sparse optimization Zhu Han University of Houston Thanks Dr Shaohua Qins efforts on slides 1 Outline chapter 4 Sparse optimization models Classic solvers and omitted solvers BSUM and ADMM ID: 712829
Download Presentation The PPT/PDF document "Signal processing and Networking for Big..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Signal processing and Networking for Big Data ApplicationsLecture 5: Sparse optimization
Zhu HanUniversity of HoustonThanks Dr. Shaohua Qin’s efforts on slides
1Slide2
Outline (chapter 4)Sparse optimization modelsClassic solvers and omitted solvers (BSUM and ADMM)Other algorithmsShrinkage OperateProx-linear AlgorithmsDual Algorithms
Bregman MethodHomotopy Algorithm and Parametric Quadratic ProgrammingContinuation, Varying Stepsizes and Line Search
Greedy Algorithms
Greedy Pursuit Algorithms
Iterative Support DetectionHard ThresholdingAlgorithms for low rank metricsHow to choose an algorithm
2
http://
www.math.ucla.edu
/~
wotaoyin
/summer2013/
lectures.htmlSlide3
Traditional Signal Acquisition ApproachThe Typical Signal Acquisition Approach
Sample a signal very densely (at lease twice the highest frequency), and then compress the information for storage or transmission
This 18.1 Mega-Pixels digital camera senses 18.1e+6 samples to construct an image.
The image is then compressed using JPEG to an average size smaller than 3MB – a compression ratio of ~12.
Image Acquisition
3Slide4
Move the burden from sampling to reconstruction
Compressive Sensing?
A natural question to ask is
Could the two processes (sensing & compression) be combined
?
The answer is YES!
This is what
Compressive Sensing (CS)
is about.
4Slide5
CS Concept
Sparse X
Random linear projection
Dimension reduction from X to Y
M >
Klog
(N/K)
Recovery algorithm for ill-posed problem
5Slide6
Compressed
Samples
K-Sparse
Signal
Random Linear
Projection
(RIP)
Exact
Recovery
K<m<<n
CS Concept
6Slide7
What is Compressive Sensing (CS) About?An emerging field of research (ICASSP)
Beat Nyquist sampling theorem
Explore sparsity & redundancy of signals
Construct the combination of sensing & compression
Offers algorithms of overwhelming probability for signal recovery
7Slide8
Sparse optimization models8Some concepts about normSlide9
Sparse optimization models9Some concepts about normSlide10
Sparse optimization models10Slide11
Sparse optimization models11Some concepts about normSlide12
Sparse optimization models12The L1 norm of x can be replaced by regularization functions corresponding to transform sparsity, joint sparsity, low ranking, as well as those involving more than one
regularization terms.Slide13
Outline (chapter 4)Sparse optimization modelsClassic solvers and omitted solvers (BSUM and ADMM)Other algorithmsShrinkage OperateProx
-linear AlgorithmsDual AlgorithmsBregman MethodHomotopy
Algorithm and Parametric Quadratic Programming
Continuation, Varying
Stepsizes and Line Search
Greedy AlgorithmsGreedy Pursuit AlgorithmsIterative Support Detection
Hard Thresholding
Algorithms for low rank metrics
How to choose an algorithm
13Slide14
Shrinkage OperateL1-norm is non-differentiable and component-wise separable, it is easy to minimize together with other component-wise separable functions on x since it reduces to a sequence of independent subproblemsFor
example, the problemis equivalent to solving
14Slide15
Shrinkage OperateApplying an component-wise separable analysis over the three cases of solution - positive, zero, and negative - one can obtain the closed-form solutionThis is called
the soft-thresholding or shrinkage15Slide16
Shrinkage OperateIllustration of shrinkage Visually, the problem takes the input vector
Z and shrinks it to zero component-wise.Slide17
Prox-linear AlgorithmsThe optimization problem can be written in a general form (5)where r is the regularization function and f
is the data fidelity function.It is based on linearizing the second term f(x) at each
x
k
and adding a proximal term, giving rise to the iteration
(6)
17Slide18
Prox-linear AlgorithmsThe last two terms can be combined to a complete square plus a constant as follows: So the iteration is equivalent to
18Slide19
Prox-linear AlgorithmsBesides the above prox-linear interpretation, there are other alternative approaches, such as forward-backward operator splitting, optimization transfers, and successive upper (first-order) approximation
. forward-backward operator splitting is general and very elegant.
19Slide20
Prox-linear AlgorithmsLemmaPoint xopt
is the minimizer of a convex function f if and only ifIntroduce the operators
Therefore the optimality condition (under regularity) is
Which also holds after multiplying a factor
20Slide21
Prox-linear AlgorithmsHence we have (7)
21Slide22
Prox-linear AlgorithmsLetting and we obtain (8)Which only differs from the right-hand side of (6
) by a quantity independent of x. Hence (6) and (8) yield the same solution, and thus (6) is precisely a fixed point iteration of (7):
More stable solutions
22Slide23
Dual AlgorithmsFor a constrained optimization problem, duality gives arise to a companion problem.Between the original (primal) and the companion (dual) problems, there exists a weak, and sometimes also a strong, duality relation, which makes solving one of the problem somewhat
equivalent to solving the other.This brings new choices of efficient algorithms to solve the optimization problem
23Slide24
Dual AlgorithmsExample (Dual of basis pursuit)24Slide25
Dual AlgorithmsExample (Group sparsity)25Slide26
Dual AlgorithmsExample (Nuclear-norm)26Slide27
Dual AlgorithmsExample (Nuclear-norm)27Slide28
Bregman MethodThere are different versions of Bregman algorithms: original Bregman, linearized Bregman, and split
Bregman.The Bregman method is based on successively minimizing the Bregman distance
(also called the
Bregman
divergence).
28Slide29
Bregman MethodDefinition. The Bregman distance induced by convex function r(.) is defined as
29Slide30
Bregman Method Bregman distance (the length of the dashed line)
30Slide31
Bregman MethodThe original Bregman algorithm iteration has two stepsan interesting alternative form of the
Bregman algorithm is the iteration of “adding back the residual”:
31Slide32
Bregman MethodBregman iterations and denoising32
(a)
BPDN
With
(b)
Bregman
Iteration 1
(c)
Bregman
Iteration 3
(d)
Bregman
Iteration 5Slide33
Bregman MethodThe parameters of example:we generated a signal xo of n = 250 dimensions with k = 20 non-zero components that were i.i.d. samples of the standard Gaussian
distribution. xo is depicted by the block dot . in Figure .Measurement
matrix A had m = 100 rows and n =
250 columns
with its entries sampled i.i.d. from the standard Gaussian distribution.
Measurements were b = Axo +n, where noise n followed
the independent
Gaussian noise of standard deviation
0.1.
33Slide34
Bregman MethodThe results of exampleDue to the noise n in the measurements, neither approach gave exact reconstruction of x0.The
BPDN solution had a large number of false non-zero components and slightly mismatched with the non-zero components of xo.The
Bregman
solutions had
significantly fewer false non-zero components.Iterations 1 and 3 missed certain non-zero components of x
o.Iteration 5 had a much better solution that better
matched with
x
0
, achieving a relative 2-norm error of 3.50%, than the solution of BPDN (relative 2-norm error 6.19%).34Slide35
Homotopy Algorithm and Parametric Quadratic ProgrammingFor modelassuming the uniqueness of solution x*
for each , the solution path
is continuous and piece-wise linear in
, . L
et us fix
,
and
study the optimality condition
satisfied
by x*: where the subdifferential of L1 is given by 35
Make most components 0,
s.t.
the dimension is smallSlide36
Homotopy Algorithm and Parametric Quadratic ProgrammingSince is component-wise separable, the condition reduces toor equivalently,
By definition, we know
36Slide37
Homotopy Algorithm and Parametric Quadratic ProgrammingThis offers a way to rewrite the optimality conditions. If we let
then we have and can rewrite the optimality condition equivalently as
37Slide38
Homotopy Algorithm and Parametric Quadratic Programming (9a) (9b) (9c)
Where denotes the submatrix of A formed by the columns of A in and is the
column
of A.
Combining
(9a
)
and (9b
), we
obtain 38Slide39
Continuation, Varying Step sizes and Line Search for the optimization problem (3)The convergence of the
prox-linear iterations depends highly on the parameter and
the
step-size
.
A smaller
gives
more emphasis to the regularization function
r(x) and often leads to more structured, or sparser, solutions. A large can lead to prox-linear iterations that are extremely slow. 39Slide40
Continuation, Varying Step sizes and Line SearchContinuation on is a simple technique used in several codes such as FPC and
SPARSA to accelerate the convergence of prox-linear iterations.To apply this technique, one does not use the
given in
(3)
(which we call the target
) to start the iterations; instead, use a small
and
gradually
increase
over the iterations until it reaches the target . This lets the iteration produce highly structured (low r(x)) yet less-fitted (large f(Ax-b)) solutions first, corresponding to smaller values, and use them to “warm start” the iterations for producing solutions with more balanced structure and fitting, corresponding larger
values. 40Slide41
Continuation, Varying Step sizes and Line Search Line search is widely used in non-linear optimization to speed up objective descent and, in some cases, to guarantee convergence. Line search
allows one to vary the step size
in a systematic way.
Letthe modified Armijo-Wolfe line search iteration
41Slide42
Outline (chapter 4)Sparse optimization modelsClassic solvers and omitted solvers (BSUM and ADMM)Other algorithmsShrinkage OperateProx-linear AlgorithmsDual Algorithms
Bregman MethodHomotopy Algorithm and Parametric Quadratic ProgrammingContinuation, Varying Stepsizes and Line Search
Greedy Algorithms
Greedy Pursuit Algorithms
Iterative Support DetectionHard Thresholding
Algorithms for low rank metricsHow to choose an algorithm
42
http://
www.math.ucla.edu
/~wotaoyin/summer2013/lectures.htmlSlide43
Greedy Pursuit AlgorithmsStrictly speaking, the greedy algorithms for sparse optimization are not optimization algorithms since they are not driven by an overall objective. Instead, they build a solution part by part, yet in many cases
, the solution is indeed the sparsest solution.43Slide44
Greedy Pursuit AlgorithmsA basic greedy algorithm is called the orthogonal matching pursuit (OMP).With an initial point x0
=0 and empty initial support S0, starting at k = 1,it iterates
until
is satisfied
.
44
J.
Tropp
and A. Gilbert,
Signal recovery from partial information via orthogonal matching pursuit, IEEE Transactions on Information Theory, vol. 53, no. 12, pp. 4655-4666, 2007.Slide45
Greedy Pursuit AlgorithmsStep 1 computes the measurement residual,step 2 adds the columns of A that best explains the residual to the running support,
step 3 updates the solution to one that best explains the measurements while confined
to the running support
.
In step 2 for each i
, the minimizer is . Hence, the
selected
i
has the smallest over all i .Step 3 solves a least-squares problem, restricting x to the support Sk.45Slide46
Greedy Pursuit AlgorithmsMany further developments to OMP lead to algorithms with improved solution quality and/or speed such as stage-wise OMP (StOMP)1 , regularized OMP2
, subspace pursuit3 , CoSaMP4, as well as HTP
5
.
46
[1] D.
Donoho
, Y.
Tsaig
, I. Drori, and J.-C. Starck, Sparse solution of underdetermined linear equations by stagewise orthogonal matching pursuit, IEEE Transactions on Information Theory, 2006.[2] D. Needell and R. Vershynin, Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit,Selected Topics in Signal Processing, IEEE Journal of, vol. 4, no. 2, pp.310-316, 2010.[3] W. Dai and O. Milenkovic, Subspace pursuit for compressive sensing: Closing the gap between performance and complexity, arXiv:0803.0811,2008.[4] D. Needell and J. Tropp, Cosamp: Iterative signal recovery from incomplete
and inaccurate samples, Preprint, 2008.[5] D. Ge, X. Jiang, and Y. Ye, A note on complexity of Lp minimization, Stanford University Technical Report 2010, 2010.Slide47
Greedy Pursuit AlgorithmsWith an initial point x0 and an estimate sparsity level s, starting at k = 1, CoSaMP
iteratesuntil
is satisfied
, where
Is the best 2s-approximate of
and
similarly,
C
S is the best s-approximate of C.Over the iterations, supp(xk) are updated but kept to contain no more than S components. Hence, has no more than 3S components. Unlike OMP, CoSaMP's support is updated batch by batch. 47Slide48
Iterative Support DetectionIterative support detection (ISD) aims to achieve fast reconstruction and a reduced requirement on the number of measurements compared to the classical L1 minimization approach. ISD addresses failed reconstructions of L1 minimization due to
insufficient measurements.ISD estimates a support set I from a current reconstruction and obtains a new reconstruction by solving the minimization problem
and
it
iterates these two steps for a small number of times. ISD differs from the orthogonal matching
pursuit method, as well as its variants,
the
index set
I
in ISD is not necessarily nested or increasing,the minimization problem above updates all the components of x at the same time.48Slide49
Iterative Support DetectionThe basis pursuit (BP) problem isISD runs as fast as the best BP algorithms but requires significantly fewer measurementsISD
alternatively calls its two components: support detection and signal reconstruction.From an incorrect reconstruction, support
detection identifies an index set I containing
some
elements ofsignal reconstruction solves
49Slide50
Iterative Support DetectionThe algorithmic framework of ISDCompared with greedy algorithm where when you are in you are there forever, here for ISD, you can be kicked out anytime.
50Slide51
Hard ThresholdingHard thresholding refers to the operator that keeps the K largest entries of a vector in magnitude for a given K. We write it as hardthr(x). For vector x andthe index set, hardthr(x) obeys
Unlike soft thresholding, the values of the K largest entries are not changed.The basic iterative hard-thresholding iteration is
51Slide52
Hard ThresholdingThe iteration can be used to solve the problemwith the guarantee to converge to one of its local minima provided that
2≤1.Nevertheless, under the RIP assumption of A, this local minimum x*
has bounded distance to the original unknown signal x
0.the normalized iterative hard-thresholding iteration
where
is
a changing step size, has much better performance
.
The rule of choosing depends whether the resulting supp(xk+1) equals supp(xk) or not. 52Slide53
Hard Thresholdinghard-thresholding pursuit (HTP) combined the (original and normalized) IHT with the idea of matching pursuit. Instead of applying the hard-thresholding operator to return xk+1, HTP performs
53Slide54
Algorithms for low rank metricsWhen the recovery of a low-rank matrix is based on minimizing the nuclear-norm and solving a convex optimization problem, most convex optimization algorithms above for L1 minimization can be extended for nuclear-norm minimization
.Suppose that the rank of the underlying matrix is given as either exactly or as the current overestimate of the true rank, the underlying matrix can be factorized as X =
USV
*
or X = PQ, where U and P
have s columns, V*
and
Q
have
s rows, and S is an s x s diagonal matrix. Instead of X, if one solves for (U, S,V) or (P,Q), then X can be recovered from these variables and has its rank no more than s. Hence, these factorizations and change of variable replace the role of minimizing the nuclear-norm and give rise to a low-rank matrix.54Slide55
Algorithms for low rank metricsSolvers based on these factorizations include OptSpace 1,LMaFit 2,
an SDPLR-based algorithm 3.
55
[1]
S. Ma, D. Goldfarb, and L. Chen, Fixed
point and bregman iterative methods for matrix rank minimization
,
Mathematical Programming
Se
ries A, vol. 128, pp. 321-353, 2011.[2] E. Hale, W. Yin, and Y. Zhang, Fixed-point continuation for ell1-minimization: Methodology and convergence, SIAM Journal on Optimization, vol. 19, p. 1107, 2008.[3] B. Recht, M. Fazel, and P. Parrilo, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM Review,vol. 52, no. 3, pp. 471-501, 2010.Slide56
Algorithms for low rank metricsOptSpace and LMaFit are based on minimizingfor recovering a low-rank matrix from a subset of its entries where
is the set of entries observed.The SDPLR-based algorithm solves which is a non-convex
problem whose
global solution (
P*,Q*) generates X* = P*Q* that is the solution
to
56Slide57
How to choose an algorithmGreedy versus optimization algorithms.First, both approaches can be trusted for recovering sparse solutions given sufficient measurements
.Secondly, greedy algorithms progressively build the solution support
, their performance on signals with entries following a
faster
decaying distribution is better - namely fewer measurements are required when the decay is
faster. On the other hand, L1 minimization tends to have more consistent recovery performance.
-
the solution quality tends to be less
affected
by the decay speed.Thirdly, the two approaches are extended in different ways, which to a large extent determine which one is more suitable for a specific application.57Slide58
How to choose an algorithmThe greedy approach can be extended to model-base CS 1, in which the underlying signals are not only sparse but their supports are restricted to certain models, e.g., the tree structure. Some of such models are difficult to express as optimization
models.On the other hand, it is difficult to extend greedy approaches to handle general objective or energy functions, when they are present along with the sparse objective in a model. Sparse optimization naturally accepts objective functions and constraints of many kinds, especially if they are convex.
Algorithms based on
prox
-linear, variable splitting, and block coordinate descent are very flexible, so they are among the best choices for engineering models with “side” objectives and constraints. Overall, the two approaches are very complimentary to
each other.
58Slide59
How to choose an algorithmConvex versus non-convex.There is no doubt that on some problems that convex optimization (for example, L1 minimization) fails to return faithful reconstructions
, non-convex optimization succeed. Such as problems with fast-decaying signals and very few measurements.
On the other hand, the theory and generalizations of non-convex
algorithms are
well behind. Their recovery guarantees typically assume global optimality. In convex optimization, the solution
quality is quite predictable.It is fair to say that on a given application of sparse
optimization, one
should perform extensive tests to
confirm
the performance and reliability of non-convex optimization algorithms.59Slide60
How to choose an algorithmVery large scale sparse optimizationTo solve very large scale optimization problems, we are facing memory and computing bottlenecks.try to avoid general data and use data that have computation-friendly structures.
For example in CS, structured sensing matrices A include the discrete Fourier ensemble, a partial
Toeplitz or
circulant
matrix, and so on, which allow fast computation of Ax
and AT y (or complex-valued
A*y
)
(or even (AA
T )-1) without storing the entire matrix in memory. When an algorithm is based on such matrix-vector operations, it can take advantages of faster computation and be matrix-free.60Slide61
How to choose an algorithmVery large scale sparse optimizationdistributed computation. This approach divides the original problem in a serious of smaller subproblems, each involving just a subset of data, or solution, or both. The smaller subproblems
are solved, typically in parallel, on distributed computing nodes.There is a trade-off between the acceleration due to parallel computing and the overheads of synchronization and splitting.
61Slide62
How to choose an algorithmVery large scale sparse optimizationstochastic approximation. It takes advantages of the fact that in many problems, from a small (random) set of data (even a single point), one can generate a gradient (or Hessian) that
approximates the actual one, whose computation would require the entire batch of data. Between using just one point and the entire batch of data, there
lies an
optimal
trade-off among memory, speed and accuracy. If higher solution
accuracies are needed, one can try the stochastic approximation methods with dynamic sample sizes.
62Slide63
How to choose an algorithmVery large scale sparse optimizationSolution structure also provides us with opportunities to address the memory and computing bottlenecks. When the solution is a highly sparse vector, algorithms
such as BCD with the Gauss-Southwell update, homotopy algorithms, and some dual algorithms can keep all
the intermediate
solutions sparse, which
effectively reduces the data/solution storge and access.
When the solution is a low-rank matrix (or tensor) that has
a very
large size, algorithms
that
store and update the factors, which have much smaller sizes, instead of the matrix (or tensor) have significant advantages.63Slide64
64