Jiang Feb 17 Model formulation Recall the model of fullyconnected neural networks When Linear Networks In the following slides we only consider linear networks without bias ID: 915878
Download Presentation The PPT/PDF document "Deep Linear Networks Kaiqi" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Deep Linear Networks
Kaiqi
Jiang
Feb 17
Slide2Model formulation
…
Recall the model of fully-connected neural networks
When
Linear Networks
In the following slides, we only consider linear networks without bias:
Outline
Why are we interested in linear networks
Landscape of linear networks
local (global) minima, saddle points ...
Learning dynamics gradient flow, gradient descent
Slide4Why do we study linear networks
Simple
is a linear function of
Easier to analyze Gradient Descent (and other optimization algorithms) than nonlinear networks
Not that simple, also have some interesting behaviors
As was shown in [1], despite the linearity of the input-output map, linear networks have
nonlinear gradient descent dynamics on weights.
[1] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
Slide5Why do we study linear networks
Help us understand matrix factorization
Matrices in real applications may have low-rank structure, for example in the recommendation problem.
is a low-rank matrix, and is factorized as
with
,
Landscape of linear networks
Care about critical points
Several types of critical points
: local maximum
: local minimum
Not local minimum and local maximum: saddle points
If
: strict saddle points
Landscape: 1 hidden layer
Observations
Optimal
and
are not unique. For all
invertible matrix
,
Rank constraints: Let
, we have .
[2] Pierre
Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.
Setting considered in [2]
Figure 1 in [2]. The network
Symmetry
Slide8Landscape: 1 hidden layer
Without rank constraints
Assume
is invertible, then the unique minimizer of
is given by
With rank constraints
Guess: related to the projection onto some low dimensional subspace (dimension
)
Landscape: 1 hidden layer
Without rank constraints
With rank constraints
Suppose
, the critical
is the orthogonal projection of
onto the subspace spanned by eigenvectors of
[2].
where
is the orthogonal projection onto
,
is any ordered
-index set
denote the matrix formed by
orthonormal eigenvectors of
.
Assume that
has
distinct eigenvalues
, [2] also showed that
[2] Pierre
Baldi
and Kurt
Hornik
. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.
Slide10Landscape: 1 hidden layer
Landscape
The critical map
associated with the index set
is the
unique local and global minimum of
. The remaining
-index sets correspond to saddle points.
All additional critical points defined by which are not of full rank are also saddle points and can be characterized in terms of orthogonal projection onto subspaces spanned by eigenvectors with .
[2] Pierre
Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.
Figure 2 in [2]. The landscape of
Landscape: more hidden layers
Theorem 2.3 in [3]
Assume that
and
are of full rank with
and
has
distinct eigenvalues. Then, for
any depth
and for
any layer widths
and any input-output dimensions
, the loss function
has the following properties:
Every local minimum is a global minimum.
Every critical point that is not a global minimum is a saddle point.
If
, then the Hessian at any saddle point has at least one (strictly) negative eigenvalue.
strict saddle
[3] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
Denote
, the smallest width of a hidden layer
Landscape: more hidden layers
Corollary 2.4 in [3]
Under assumptions in the previous theorem,
For three-layer networks (i.e.,
), the Hessian at any saddle point has at least one (strictly) negative eigenvalue.
strict saddleFor networks deeper than three layers (i.e.,
), there exist saddle points at which the Hessian does not have any negative eigenvalue.
[3] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
Denote
, the smallest width of a hidden layer
Escape strict saddle points
[4] Rong Ge,
Furong
Huang, Chi
Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.
[5] Lee, Jason D., Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pp. 1246-1257. PMLR, 2016.[6] Chi Jin
, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, pages 1724–1732, 2017.[7] Du, Simon S., Chi
Jin, Jason D. Lee, Michael I. Jordan, Barnabas Poczos, and Aarti Singh. Gradient descent can take exponential time to escape saddle points. arXiv preprint arXiv:1705.10412, 2017.
Gradient Descent can escape strict saddle pointsIf every saddle point is strict, then gradient descent with random initialization converges to a local minimizer almost surely [5], but it can take exponential time (exponential in the dimension) [7].More efficient algorithms: Perturbed Gradient DescentGradient descent with perturbations [4, 6] can find an approximate local minimizer in polynomial timeFor general networks (not necessarily linear)
Slide14Learning dynamics: Gradient Flow
Layers are automatically
balanced
[8] and [9] showed that for
linear networks
If
initial
layers are balanced, i.e.
[8] Du, Simon S., Wei Hu, and Jason D. Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced.
arXiv
preprint arXiv:1806.00900, 2018.
[9] Sanjeev Arora, Nadav Cohen, and
Elad
Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization.
arXiv
preprint arXiv:1802.06509, 2018.
Let
be the SVD of
, then
Layers will remain balanced at all time!
Learning dynamics: Gradient Flow
This
balanced
property comes from the
symmetry property
How this property helps optimization
[10]
Noether's
theorem from Wikipedia, https://en.wikipedia.org/wiki/Noether%27s_theorem#:~:text=Noether's%20theorem%20or%20Noether's%20first,case%20was%20proven%20by%20E.
Global minima:
(Not compact, norms of
can go to infinity)
Compact !
Noether's
theorem: If a system has a continuous
symmetry
property, then there are corresponding quantities whose values are
conserved in time
[10]
Slide16Learning dynamics: Gradient Flow
Can generalize to nonlinear networks
[8] showed that for
general nonlinear networks
, if each activation
is homogeneous, namely,
, then
Corollary 2.1 in [8]:
represent the
-
th
row and
-
th
column of
, respectively.
ReLU
, Leaky
ReLU
and linear activation functions satisfy the homogeneous property.
[8] Du, Simon S., Wei Hu, and Jason D. Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced.
arXiv
preprint arXiv:1806.00900, 2018.
Slide17Learning dynamics: Gradient Flow
Trajectory analysis (gradient flow)
[9] showed that
moves according to the following differential equation:
where
is a positive semidefinite matrix whose eigenvalues are all greater than or equal to
.
implicit preconditioning
[9] Sanjeev Arora, Nadav Cohen, and
Elad
Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization.
arXiv
preprint arXiv:1802.06509, 2018.
Let
, the end-to-end matrix at time
.
Consider the square loss
. For linear networks, it becomes
Learning dynamics: Gradient Descent
Gradient Descent approximately keeps the
balancedness
across layers
If initial layers are approximately balanced, then they remain approximately balanced throughout the optimization process [11].
Linear convergence of Gradient Descent [11]
If the initial layers are approximately
balanced
, and the initial end-to-end has proper
deficiency margin
(detailed definition can be found in [11]), then gradient descent with proper choice of learning rate has
linear convergence to global minima
.
[11] Sanjeev Arora, Nadav Cohen, Noah
Golowich
, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks.
arXiv
preprint arXiv:1810.02281, 2018.
Let
, the end-to-end matrix at time
.
Consider the square loss
. For linear networks, it becomes
Conclusion
Landscape of linear networks
Every local minimum is a global minimum
Every saddle point is strict for linear networks with one hidden layer. There exist non-strict saddle points for networks with more layers
Learning Dynamics of linear networks
Layers are automatically balanced under Gradient Flow / Descent
Gradient Descent has linear convergence under some conditions