Signal processing and Networking for Big Data - PowerPoint Presentation

giovanna-bartolotta . @giovanna-bartolotta

347 views
Uploaded On 2018-11-03

Signal processing and Networking for Big Data - PPT Presentation

Applications Lecture 5 Sparse optimization Zhu Han University of Houston Thanks Dr Shaohua Qins efforts on slides 1 Outline chapter 4 Sparse optimization models Classic solvers and omitted solvers BSUM and ADMM ID: 712829

optimization algorithms sparse bregman algorithms optimization bregman sparse solution linear pursuit algorithm support problem rank greedy matrix iteration norm

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/712829" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Signal processing and Networking for Big..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Signal processing and Networking for Big Data ApplicationsLecture 5: Sparse optimization

Zhu HanUniversity of HoustonThanks Dr. Shaohua Qin’s efforts on slides

1Slide2

Outline (chapter 4)Sparse optimization modelsClassic solvers and omitted solvers (BSUM and ADMM)Other algorithmsShrinkage OperateProx-linear AlgorithmsDual Algorithms

Bregman MethodHomotopy Algorithm and Parametric Quadratic ProgrammingContinuation, Varying Stepsizes and Line Search

Greedy Algorithms

Greedy Pursuit Algorithms

Iterative Support DetectionHard ThresholdingAlgorithms for low rank metricsHow to choose an algorithm

http://

www.math.ucla.edu

wotaoyin

/summer2013/

lectures.htmlSlide3

Traditional Signal Acquisition ApproachThe Typical Signal Acquisition Approach

Sample a signal very densely (at lease twice the highest frequency), and then compress the information for storage or transmission

This 18.1 Mega-Pixels digital camera senses 18.1e+6 samples to construct an image.

The image is then compressed using JPEG to an average size smaller than 3MB – a compression ratio of ~12.

Image Acquisition

3Slide4

Move the burden from sampling to reconstruction

Compressive Sensing?

A natural question to ask is

Could the two processes (sensing & compression) be combined

The answer is YES!

This is what

Compressive Sensing (CS)

is about.

4Slide5

CS Concept

Sparse X

Random linear projection

Dimension reduction from X to Y

M >

Klog

(N/K)

Recovery algorithm for ill-posed problem

5Slide6

Compressed

Samples

K-Sparse

Signal

Random Linear

Projection

(RIP)

Exact

Recovery

K<m<<n

CS Concept

6Slide7

What is Compressive Sensing (CS) About?An emerging field of research (ICASSP)

Beat Nyquist sampling theorem

Explore sparsity & redundancy of signals

Construct the combination of sensing & compression

Offers algorithms of overwhelming probability for signal recovery

7Slide8

Sparse optimization models8Some concepts about normSlide9

Sparse optimization models9Some concepts about normSlide10

Sparse optimization models10Slide11

Sparse optimization models11Some concepts about normSlide12

Sparse optimization models12The L1 norm of x can be replaced by regularization functions corresponding to transform sparsity, joint sparsity, low ranking, as well as those involving more than one

regularization terms.Slide13

Outline (chapter 4)Sparse optimization modelsClassic solvers and omitted solvers (BSUM and ADMM)Other algorithmsShrinkage OperateProx

-linear AlgorithmsDual AlgorithmsBregman MethodHomotopy

Algorithm and Parametric Quadratic Programming

Continuation, Varying

Stepsizes and Line Search

Greedy AlgorithmsGreedy Pursuit AlgorithmsIterative Support Detection

Hard Thresholding

Algorithms for low rank metrics

How to choose an algorithm

13Slide14

Shrinkage OperateL1-norm is non-differentiable and component-wise separable, it is easy to minimize together with other component-wise separable functions on x since it reduces to a sequence of independent subproblemsFor

example, the problemis equivalent to solving

14Slide15

Shrinkage OperateApplying an component-wise separable analysis over the three cases of solution - positive, zero, and negative - one can obtain the closed-form solutionThis is called

the soft-thresholding or shrinkage15Slide16

Shrinkage OperateIllustration of shrinkage Visually, the problem takes the input vector

Z and shrinks it to zero component-wise.Slide17

Prox-linear AlgorithmsThe optimization problem can be written in a general form (5)where r is the regularization function and f

is the data fidelity function.It is based on linearizing the second term f(x) at each

and adding a proximal term, giving rise to the iteration

(6)

17Slide18

Prox-linear AlgorithmsThe last two terms can be combined to a complete square plus a constant as follows: So the iteration is equivalent to

18Slide19

Prox-linear AlgorithmsBesides the above prox-linear interpretation, there are other alternative approaches, such as forward-backward operator splitting, optimization transfers, and successive upper (first-order) approximation

. forward-backward operator splitting is general and very elegant.

19Slide20

Prox-linear AlgorithmsLemmaPoint xopt

is the minimizer of a convex function f if and only ifIntroduce the operators

Therefore the optimality condition (under regularity) is

Which also holds after multiplying a factor

20Slide21

Prox-linear AlgorithmsHence we have (7)

21Slide22

Prox-linear AlgorithmsLetting and we obtain (8)Which only differs from the right-hand side of (6

) by a quantity independent of x. Hence (6) and (8) yield the same solution, and thus (6) is precisely a fixed point iteration of (7):

More stable solutions

22Slide23

Dual AlgorithmsFor a constrained optimization problem, duality gives arise to a companion problem.Between the original (primal) and the companion (dual) problems, there exists a weak, and sometimes also a strong, duality relation, which makes solving one of the problem somewhat

equivalent to solving the other.This brings new choices of efficient algorithms to solve the optimization problem

23Slide24

Dual AlgorithmsExample (Dual of basis pursuit)24Slide25

Dual AlgorithmsExample (Group sparsity)25Slide26

Dual AlgorithmsExample (Nuclear-norm)26Slide27

Dual AlgorithmsExample (Nuclear-norm)27Slide28

Bregman MethodThere are different versions of Bregman algorithms: original Bregman, linearized Bregman, and split

Bregman.The Bregman method is based on successively minimizing the Bregman distance

(also called the

Bregman

divergence).

28Slide29

Bregman MethodDefinition. The Bregman distance induced by convex function r(.) is defined as

29Slide30

Bregman Method Bregman distance (the length of the dashed line)

30Slide31

Bregman MethodThe original Bregman algorithm iteration has two stepsan interesting alternative form of the

Bregman algorithm is the iteration of “adding back the residual”:

31Slide32

Bregman MethodBregman iterations and denoising32

(a)

BPDN

With

(b)

Bregman

Iteration 1

(c)

Bregman

Iteration 3

(d)

Bregman

Iteration 5Slide33

Bregman MethodThe parameters of example:we generated a signal xo of n = 250 dimensions with k = 20 non-zero components that were i.i.d. samples of the standard Gaussian

distribution. xo is depicted by the block dot . in Figure .Measurement

matrix A had m = 100 rows and n =

250 columns

with its entries sampled i.i.d. from the standard Gaussian distribution.

Measurements were b = Axo +n, where noise n followed

the independent

Gaussian noise of standard deviation

0.1.

33Slide34

Bregman MethodThe results of exampleDue to the noise n in the measurements, neither approach gave exact reconstruction of x0.The

BPDN solution had a large number of false non-zero components and slightly mismatched with the non-zero components of xo.The

Bregman

solutions had

significantly fewer false non-zero components.Iterations 1 and 3 missed certain non-zero components of x

o.Iteration 5 had a much better solution that better

matched with

, achieving a relative 2-norm error of 3.50%, than the solution of BPDN (relative 2-norm error 6.19%).34Slide35

Homotopy Algorithm and Parametric Quadratic ProgrammingFor modelassuming the uniqueness of solution x*

for each , the solution path

is continuous and piece-wise linear in

, . L

et us fix

and

study the optimality condition

satisfied

by x*: where the subdifferential of L1 is given by 35

Make most components 0,

s.t.

the dimension is smallSlide36

Homotopy Algorithm and Parametric Quadratic ProgrammingSince is component-wise separable, the condition reduces toor equivalently,

By definition, we know

36Slide37

Homotopy Algorithm and Parametric Quadratic ProgrammingThis offers a way to rewrite the optimality conditions. If we let

then we have and can rewrite the optimality condition equivalently as

37Slide38

Homotopy Algorithm and Parametric Quadratic Programming (9a) (9b) (9c)

Where denotes the submatrix of A formed by the columns of A in and is the

column

of A.

Combining

(9a

)

and (9b

), we

obtain 38Slide39

Continuation, Varying Step sizes and Line Search for the optimization problem (3)The convergence of the

prox-linear iterations depends highly on the parameter and

the

step-size

A smaller

gives

more emphasis to the regularization function

r(x) and often leads to more structured, or sparser, solutions. A large can lead to prox-linear iterations that are extremely slow. 39Slide40

Continuation, Varying Step sizes and Line SearchContinuation on is a simple technique used in several codes such as FPC and

SPARSA to accelerate the convergence of prox-linear iterations.To apply this technique, one does not use the

given in

(3)

(which we call the target

) to start the iterations; instead, use a small

and

gradually

increase

over the iterations until it reaches the target . This lets the iteration produce highly structured (low r(x)) yet less-fitted (large f(Ax-b)) solutions first, corresponding to smaller values, and use them to “warm start” the iterations for producing solutions with more balanced structure and fitting, corresponding larger

values. 40Slide41

Continuation, Varying Step sizes and Line Search Line search is widely used in non-linear optimization to speed up objective descent and, in some cases, to guarantee convergence. Line search

allows one to vary the step size

in a systematic way.

Letthe modified Armijo-Wolfe line search iteration

41Slide42

Outline (chapter 4)Sparse optimization modelsClassic solvers and omitted solvers (BSUM and ADMM)Other algorithmsShrinkage OperateProx-linear AlgorithmsDual Algorithms

Bregman MethodHomotopy Algorithm and Parametric Quadratic ProgrammingContinuation, Varying Stepsizes and Line Search

Greedy Algorithms

Greedy Pursuit Algorithms

Iterative Support DetectionHard Thresholding

Algorithms for low rank metricsHow to choose an algorithm

http://

www.math.ucla.edu

/~wotaoyin/summer2013/lectures.htmlSlide43

Greedy Pursuit AlgorithmsStrictly speaking, the greedy algorithms for sparse optimization are not optimization algorithms since they are not driven by an overall objective. Instead, they build a solution part by part, yet in many cases

, the solution is indeed the sparsest solution.43Slide44

Greedy Pursuit AlgorithmsA basic greedy algorithm is called the orthogonal matching pursuit (OMP).With an initial point x0

=0 and empty initial support S0, starting at k = 1,it iterates

until

is satisfied

Tropp

and A. Gilbert,

Signal recovery from partial information via orthogonal matching pursuit, IEEE Transactions on Information Theory, vol. 53, no. 12, pp. 4655-4666, 2007.Slide45

Greedy Pursuit AlgorithmsStep 1 computes the measurement residual,step 2 adds the columns of A that best explains the residual to the running support,

step 3 updates the solution to one that best explains the measurements while confined

to the running support

In step 2 for each i

, the minimizer is . Hence, the

selected

has the smallest over all i .Step 3 solves a least-squares problem, restricting x to the support Sk.45Slide46

Greedy Pursuit AlgorithmsMany further developments to OMP lead to algorithms with improved solution quality and/or speed such as stage-wise OMP (StOMP)1 , regularized OMP2

, subspace pursuit3 , CoSaMP4, as well as HTP

[1] D.

Donoho

, Y.

Tsaig

, I. Drori, and J.-C. Starck, Sparse solution of underdetermined linear equations by stagewise orthogonal matching pursuit, IEEE Transactions on Information Theory, 2006.[2] D. Needell and R. Vershynin, Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit,Selected Topics in Signal Processing, IEEE Journal of, vol. 4, no. 2, pp.310-316, 2010.[3] W. Dai and O. Milenkovic, Subspace pursuit for compressive sensing: Closing the gap between performance and complexity, arXiv:0803.0811,2008.[4] D. Needell and J. Tropp, Cosamp: Iterative signal recovery from incomplete

and inaccurate samples, Preprint, 2008.[5] D. Ge, X. Jiang, and Y. Ye, A note on complexity of Lp minimization, Stanford University Technical Report 2010, 2010.Slide47

Greedy Pursuit AlgorithmsWith an initial point x0 and an estimate sparsity level s, starting at k = 1, CoSaMP

iteratesuntil

is satisfied

, where

Is the best 2s-approximate of

and

similarly,

S is the best s-approximate of C.Over the iterations, supp(xk) are updated but kept to contain no more than S components. Hence, has no more than 3S components. Unlike OMP, CoSaMP's support is updated batch by batch. 47Slide48

Iterative Support DetectionIterative support detection (ISD) aims to achieve fast reconstruction and a reduced requirement on the number of measurements compared to the classical L1 minimization approach. ISD addresses failed reconstructions of L1 minimization due to

insufficient measurements.ISD estimates a support set I from a current reconstruction and obtains a new reconstruction by solving the minimization problem

and

iterates these two steps for a small number of times. ISD differs from the orthogonal matching

pursuit method, as well as its variants,

the

index set

in ISD is not necessarily nested or increasing,the minimization problem above updates all the components of x at the same time.48Slide49

Iterative Support DetectionThe basis pursuit (BP) problem isISD runs as fast as the best BP algorithms but requires significantly fewer measurementsISD

alternatively calls its two components: support detection and signal reconstruction.From an incorrect reconstruction, support

detection identifies an index set I containing

some

elements ofsignal reconstruction solves

49Slide50

Iterative Support DetectionThe algorithmic framework of ISDCompared with greedy algorithm where when you are in you are there forever, here for ISD, you can be kicked out anytime.

50Slide51

Hard ThresholdingHard thresholding refers to the operator that keeps the K largest entries of a vector in magnitude for a given K. We write it as hardthr(x). For vector x andthe index set, hardthr(x) obeys

Unlike soft thresholding, the values of the K largest entries are not changed.The basic iterative hard-thresholding iteration is

51Slide52

Hard ThresholdingThe iteration can be used to solve the problemwith the guarantee to converge to one of its local minima provided that

2≤1.Nevertheless, under the RIP assumption of A, this local minimum x*

has bounded distance to the original unknown signal x

0.the normalized iterative hard-thresholding iteration

where

a changing step size, has much better performance

The rule of choosing depends whether the resulting supp(xk+1) equals supp(xk) or not. 52Slide53

Hard Thresholdinghard-thresholding pursuit (HTP) combined the (original and normalized) IHT with the idea of matching pursuit. Instead of applying the hard-thresholding operator to return xk+1, HTP performs

53Slide54

Algorithms for low rank metricsWhen the recovery of a low-rank matrix is based on minimizing the nuclear-norm and solving a convex optimization problem, most convex optimization algorithms above for L1 minimization can be extended for nuclear-norm minimization

.Suppose that the rank of the underlying matrix is given as either exactly or as the current overestimate of the true rank, the underlying matrix can be factorized as X =

USV

or X = PQ, where U and P

have s columns, V*

and

have

s rows, and S is an s x s diagonal matrix. Instead of X, if one solves for (U, S,V) or (P,Q), then X can be recovered from these variables and has its rank no more than s. Hence, these factorizations and change of variable replace the role of minimizing the nuclear-norm and give rise to a low-rank matrix.54Slide55

Algorithms for low rank metricsSolvers based on these factorizations include OptSpace 1,LMaFit 2,

an SDPLR-based algorithm 3.

[1]

S. Ma, D. Goldfarb, and L. Chen, Fixed

point and bregman iterative methods for matrix rank minimization

Mathematical Programming

ries A, vol. 128, pp. 321-353, 2011.[2] E. Hale, W. Yin, and Y. Zhang, Fixed-point continuation for ell1-minimization: Methodology and convergence, SIAM Journal on Optimization, vol. 19, p. 1107, 2008.[3] B. Recht, M. Fazel, and P. Parrilo, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM Review,vol. 52, no. 3, pp. 471-501, 2010.Slide56

Algorithms for low rank metricsOptSpace and LMaFit are based on minimizingfor recovering a low-rank matrix from a subset of its entries where

is the set of entries observed.The SDPLR-based algorithm solves which is a non-convex

problem whose

global solution (

P*,Q*) generates X* = P*Q* that is the solution

56Slide57

How to choose an algorithmGreedy versus optimization algorithms.First, both approaches can be trusted for recovering sparse solutions given sufficient measurements

.Secondly, greedy algorithms progressively build the solution support

, their performance on signals with entries following a

faster

decaying distribution is better - namely fewer measurements are required when the decay is

faster. On the other hand, L1 minimization tends to have more consistent recovery performance.

the solution quality tends to be less

affected

by the decay speed.Thirdly, the two approaches are extended in different ways, which to a large extent determine which one is more suitable for a specific application.57Slide58

How to choose an algorithmThe greedy approach can be extended to model-base CS 1, in which the underlying signals are not only sparse but their supports are restricted to certain models, e.g., the tree structure. Some of such models are difficult to express as optimization

models.On the other hand, it is difficult to extend greedy approaches to handle general objective or energy functions, when they are present along with the sparse objective in a model. Sparse optimization naturally accepts objective functions and constraints of many kinds, especially if they are convex.

Algorithms based on

prox

-linear, variable splitting, and block coordinate descent are very flexible, so they are among the best choices for engineering models with “side” objectives and constraints. Overall, the two approaches are very complimentary to

each other.

58Slide59

How to choose an algorithmConvex versus non-convex.There is no doubt that on some problems that convex optimization (for example, L1 minimization) fails to return faithful reconstructions

, non-convex optimization succeed. Such as problems with fast-decaying signals and very few measurements.

On the other hand, the theory and generalizations of non-convex

algorithms are

well behind. Their recovery guarantees typically assume global optimality. In convex optimization, the solution

quality is quite predictable.It is fair to say that on a given application of sparse

optimization, one

should perform extensive tests to

confirm

the performance and reliability of non-convex optimization algorithms.59Slide60

How to choose an algorithmVery large scale sparse optimizationTo solve very large scale optimization problems, we are facing memory and computing bottlenecks.try to avoid general data and use data that have computation-friendly structures.

For example in CS, structured sensing matrices A include the discrete Fourier ensemble, a partial

Toeplitz or

circulant

matrix, and so on, which allow fast computation of Ax

and AT y (or complex-valued

A*y

)

(or even (AA

T )-1) without storing the entire matrix in memory. When an algorithm is based on such matrix-vector operations, it can take advantages of faster computation and be matrix-free.60Slide61

How to choose an algorithmVery large scale sparse optimizationdistributed computation. This approach divides the original problem in a serious of smaller subproblems, each involving just a subset of data, or solution, or both. The smaller subproblems

are solved, typically in parallel, on distributed computing nodes.There is a trade-off between the acceleration due to parallel computing and the overheads of synchronization and splitting.

61Slide62

How to choose an algorithmVery large scale sparse optimizationstochastic approximation. It takes advantages of the fact that in many problems, from a small (random) set of data (even a single point), one can generate a gradient (or Hessian) that

approximates the actual one, whose computation would require the entire batch of data. Between using just one point and the entire batch of data, there

lies an

optimal

trade-off among memory, speed and accuracy. If higher solution

accuracies are needed, one can try the stochastic approximation methods with dynamic sample sizes.

62Slide63

How to choose an algorithmVery large scale sparse optimizationSolution structure also provides us with opportunities to address the memory and computing bottlenecks. When the solution is a highly sparse vector, algorithms

such as BCD with the Gauss-Southwell update, homotopy algorithms, and some dual algorithms can keep all

the intermediate

solutions sparse, which

effectively reduces the data/solution storge and access.

When the solution is a low-rank matrix (or tensor) that has

a very

large size, algorithms

that

store and update the factors, which have much smaller sizes, instead of the matrix (or tensor) have significant advantages.63Slide64