/
Deep Linear Networks Kaiqi Deep Linear Networks Kaiqi

Deep Linear Networks Kaiqi - PowerPoint Presentation

lucinda
lucinda . @lucinda
Follow
343 views
Uploaded On 2022-06-11

Deep Linear Networks Kaiqi - PPT Presentation

Jiang Feb 17 Model formulation             Recall the model of fullyconnected neural networks   When   Linear Networks In the following slides we only consider linear networks without bias ID: 915878

linear networks learning gradient networks linear gradient learning saddle descent local arxiv layers points landscape neural balanced dynamics layer

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Deep Linear Networks Kaiqi" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Deep Linear Networks

Kaiqi

Jiang

Feb 17

Slide2

Model formulation

 

 

 

 

 

 

Recall the model of fully-connected neural networks

 

When

 

Linear Networks

In the following slides, we only consider linear networks without bias:

 

Slide3

Outline

Why are we interested in linear networks

Landscape of linear networks

local (global) minima, saddle points ...

Learning dynamics gradient flow, gradient descent

Slide4

Why do we study linear networks

Simple

is a linear function of

Easier to analyze Gradient Descent (and other optimization algorithms) than nonlinear networks

Not that simple, also have some interesting behaviors

As was shown in [1], despite the linearity of the input-output map, linear networks have

nonlinear gradient descent dynamics on weights. 

[1] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.

Slide5

Why do we study linear networks

Help us understand matrix factorization

Matrices in real applications may have low-rank structure, for example in the recommendation problem.

is a low-rank matrix, and is factorized as

with

,

 

Slide6

Landscape of linear networks

Care about critical points

Several types of critical points

: local maximum

: local minimum

Not local minimum and local maximum: saddle points

If

: strict saddle points

 

 

Slide7

Landscape: 1 hidden layer

Observations

Optimal

and

are not unique. For all

invertible matrix

,

Rank constraints: Let

, we have .

 

[2] Pierre

Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.

Setting considered in [2]

Figure 1 in [2]. The network

 

Symmetry

Slide8

Landscape: 1 hidden layer

Without rank constraints

Assume

is invertible, then the unique minimizer of

is given by

With rank constraints

Guess: related to the projection onto some low dimensional subspace (dimension

)

 

 

Slide9

Landscape: 1 hidden layer

Without rank constraints

With rank constraints

Suppose

, the critical

is the orthogonal projection of

onto the subspace spanned by eigenvectors of

[2].

where

is the orthogonal projection onto

,

is any ordered

-index set

denote the matrix formed by

orthonormal eigenvectors of

.

Assume that

has

distinct eigenvalues

, [2] also showed that

 

[2] Pierre

Baldi

and Kurt

Hornik

. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.

Slide10

Landscape: 1 hidden layer

Landscape

The critical map

associated with the index set

is the

unique local and global minimum of

. The remaining

-index sets correspond to saddle points.

All additional critical points defined by which are not of full rank are also saddle points and can be characterized in terms of orthogonal projection onto subspaces spanned by eigenvectors with .

 

[2] Pierre

Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.

Figure 2 in [2]. The landscape of

 

Slide11

Landscape: more hidden layers

Theorem 2.3 in [3]

Assume that

and

are of full rank with

and

has

distinct eigenvalues. Then, for

any depth

and for

any layer widths

and any input-output dimensions

, the loss function

has the following properties:

Every local minimum is a global minimum.

Every critical point that is not a global minimum is a saddle point.

If

, then the Hessian at any saddle point has at least one (strictly) negative eigenvalue.

 strict saddle

 

[3] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.

 

Denote

, the smallest width of a hidden layer

 

Slide12

Landscape: more hidden layers

Corollary 2.4 in [3]

Under assumptions in the previous theorem,

For three-layer networks (i.e.,

), the Hessian at any saddle point has at least one (strictly) negative eigenvalue.

 strict saddleFor networks deeper than three layers (i.e.,

), there exist saddle points at which the Hessian does not have any negative eigenvalue.

 

[3] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.

 

Denote

, the smallest width of a hidden layer

 

Slide13

Escape strict saddle points

[4] Rong Ge,

Furong

Huang, Chi

Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.

[5] Lee, Jason D., Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pp. 1246-1257. PMLR, 2016.[6] Chi Jin

, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, pages 1724–1732, 2017.[7] Du, Simon S., Chi

Jin, Jason D. Lee, Michael I. Jordan, Barnabas Poczos, and Aarti Singh. Gradient descent can take exponential time to escape saddle points. arXiv preprint arXiv:1705.10412, 2017.

Gradient Descent can escape strict saddle pointsIf every saddle point is strict, then gradient descent with random initialization converges to a local minimizer almost surely [5], but it can take exponential time (exponential in the dimension) [7].More efficient algorithms: Perturbed Gradient DescentGradient descent with perturbations [4, 6] can find an approximate local minimizer in polynomial timeFor general networks (not necessarily linear)

Slide14

Learning dynamics: Gradient Flow

Layers are automatically

balanced

[8] and [9] showed that for

linear networks

 

If

initial

layers are balanced, i.e.

 

[8] Du, Simon S., Wei Hu, and Jason D. Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced.

arXiv

preprint arXiv:1806.00900, 2018.

[9] Sanjeev Arora, Nadav Cohen, and

Elad

Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization.

arXiv

preprint arXiv:1802.06509, 2018.

Let

be the SVD of

, then

 

Layers will remain balanced at all time!

 

Slide15

Learning dynamics: Gradient Flow

This

balanced

property comes from the

symmetry property

 

How this property helps optimization

[10]

Noether's

theorem from Wikipedia, https://en.wikipedia.org/wiki/Noether%27s_theorem#:~:text=Noether's%20theorem%20or%20Noether's%20first,case%20was%20proven%20by%20E.

Global minima:

 

(Not compact, norms of

can go to infinity)

 

 

Compact !

Noether's

theorem: If a system has a continuous

symmetry

property, then there are corresponding quantities whose values are

conserved in time

[10]

Slide16

Learning dynamics: Gradient Flow

Can generalize to nonlinear networks

[8] showed that for

general nonlinear networks

, if each activation

is homogeneous, namely,

, then

Corollary 2.1 in [8]:

 

 

represent the

-

th

row and

-

th

column of

, respectively.

ReLU

, Leaky

ReLU

and linear activation functions satisfy the homogeneous property.

 

[8] Du, Simon S., Wei Hu, and Jason D. Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced.

arXiv

preprint arXiv:1806.00900, 2018.

Slide17

Learning dynamics: Gradient Flow

Trajectory analysis (gradient flow)

[9] showed that

moves according to the following differential equation:

where

is a positive semidefinite matrix whose eigenvalues are all greater than or equal to

.

implicit preconditioning

 

 

[9] Sanjeev Arora, Nadav Cohen, and

Elad

Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization.

arXiv

preprint arXiv:1802.06509, 2018.

Let

, the end-to-end matrix at time

.

Consider the square loss

. For linear networks, it becomes

 

Slide18

Learning dynamics: Gradient Descent

Gradient Descent approximately keeps the

balancedness

across layers

If initial layers are approximately balanced, then they remain approximately balanced throughout the optimization process [11].

Linear convergence of Gradient Descent [11]

If the initial layers are approximately

balanced

, and the initial end-to-end has proper

deficiency margin

(detailed definition can be found in [11]), then gradient descent with proper choice of learning rate has

linear convergence to global minima

.

 

 

[11] Sanjeev Arora, Nadav Cohen, Noah

Golowich

, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks.

arXiv

preprint arXiv:1810.02281, 2018.

Let

, the end-to-end matrix at time

.

Consider the square loss

. For linear networks, it becomes

 

Slide19

Conclusion

Landscape of linear networks

Every local minimum is a global minimum

Every saddle point is strict for linear networks with one hidden layer. There exist non-strict saddle points for networks with more layers

Learning Dynamics of linear networks

Layers are automatically balanced under Gradient Flow / Descent

Gradient Descent has linear convergence under some conditions