machine learning Yuchen Zhang Stanford University Nonconvexity in modern machine learning 2 Stateoftheart AI models are learnt by minimizing often nonconvex loss functions T raditional ID: 657435
Download Presentation The PPT/PDF document "Two approaches to non-convex" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Two approaches to non-convex machine learning
Yuchen Zhang
Stanford UniversitySlide2
Non-convexity in modern machine learning
2
State-of-the-art AI models are learnt by minimizing (often non-convex) loss functions.
T
raditional
o
ptimization algorithms only guarantee to find locally optimal solutions.Slide3
This talk
3
Two ideas to
attack
non-convexity
and
local
minima
:Idea 1: Injecting large random noise to SGD.A Hitting Time Analysis of Stochastic Gradient Langevin DyanmicsYuchen Zhang, Percy Liang, Moses Charikar (COLT’17)Idea 2: Convex relaxation.Convexified Convolutional Neural NetworkYuchen Zhang, Percy Liang, Martin Wainwright (ICML’17)Slide4
Part IInjecting Large Random Noise to SGD
4Slide5
Gradient descent and local minima
Problem
:
.
Gradient Descent
:
.
Running GD on a non-convex function may converge to a sub-optimal local minimum.
5global minimalocal minimaSlide6
Adding noise to gradient descent
Draw a noise vector:
Update:
Noise will be vacuous when
.
6
Slide7
Langevin Monte Carlo (LMC) (Roberts and Tweedie 1996)
Imitate
Langevin Diffusion
in physics
7
Choose
temperature
and
stepsize
.Iteratively update:
Noise will dominate gradient when
.
LMC
escapes
local minima
with small step sizes.
Slide8
Stochastic Gradient Langevin Dynamics(SGLD) (Welling and Teh 2011)
Use
stochastic
gradient
instead of
8
Iteratively update:
Slide9
Stationary distribution
With small stepsize
, the distribution of
converges to a stationary distribution:
.
If temperature
is
low
and , then minimizes .
9
Slide10
SGLD in practice
SGLD outperforms SGD on several modern applications:
Prevent
over-fitting
(Welling and Teh
2011)Logistic regressionIndependent Components
AnalysisLearn deep neural networksNeural programmer (
Neelakantan et al. 2015)Neural random-access machines
(Kurach et al. 2015)
Neural GPUs (Kaiser and Sutskever 2015)Deep bidirectional LSTM (Zeyer et al. 2016)10Slide11
Mixing time (time for converging to
)
For smooth functions and small enough
stepsize
:
SGLD asymptotically converge to
.
(Roberts and Tweedie 1996, Teh et al. 2016)For convex : the LMC mixing time is polynomial. (Bubeck et al. 2015, Dalalyan 2016)For non-convex : the SGLD mixing time was rigorously characterized (Raginsky
et al 2017). However, it can be exponential in
and
.
11Slide12
Mixing time is too pessimistic?SGLD can hit a good solution much earlier than it converges to the stationary distribution.
Example: W-shaped function.
12Slide13
O
ur
analysis
SGLD’s
hitting time
to an arbitrary target set.
Polynomial upper bounds on hitting time.Application: non-convex empirical risk minimization.
13
target setSlide14
Preliminaries (I)
For any
, define a
probability measure
:
14Slide15
Preliminaries (II)
Given function
, for any set
, define its
boundary measure
(informally, surface area):
15
s
hellSlide16
Restricted
Cheeger
Constant
Intuition
:
is small
if and only if some subset
is “isolated” from the rest. 16Given
and set ,
define
Restricted
Cheeger
Constant
:
Claim:
measures the
efficiency
of SGLD
(
defined on
and
)
to
escape
the set
.
is isolated,
is small
surface area
volume
All subsets are well-connected,
is large
Slide17
Stability property
17
Lemma
If
, then:
If two functions are
pointwise close
, then their Restricted
Cheeger
Constants are close.
If
, the efficiency of SGLD on
and
are almost equivalent.
Our strategy:
Run SGLD on
, but analyze its
efficiency
on
Example:
empirical risk,
population risk
Slide18
General theorem18
Theorem
For arbitrary
and target set
, SGLD’s hitting time to
satisfies (with high probability) :
Reduce the problem to
lower bounding
.
Sufficient to study the geometric properties of
and
.
S
tudying
geometric properties is much easier than
studying SGLD trajectory
.
Slide19
Lower bounds on Restricted Cheeger
C
onstant
19
For arbitrary smooth function
:
Lemma
Under following conditions:
= {
-approximate local
minimum}
, and
We have lower bound:
.
local minima
saddle point
Slide20
Lower bound + General theorem + Stability property
20
Theorem
Run SGLD on
. For proxy function
satisfying:
is
smooth, and
SGLD hits an
-approximate
local
minimum
of
in
poly-time.
Function
: a
perturbation
of
that eliminates as many local minima as possible.
SGLD
efficiently escapes all local
minima that
don’t exist
in
.
Slide21
Facts
:
Under mild conditions,
as
.
For large enough
, SGLD efficiently
finds
a
local minimum of the population risk.Doesn’t need and (
which are required by SGD).
SGLD for empirical risk minimization
Empirical risk minimization
:
Empirical risk
for
.
Population risk
21Slide22
Learning linear classifier with 0-1 loss
22
Assumption
: Labels corrupted by
Massart
noise
(
).
(
Awasthi et al 2016): Learns in time.SGLD: Learns in
time.
O
ne dimensional 0-1 loss, sample size
Slide23
Summary
SGLD is asymptotically optimal for non-convex optimization;
B
ut its mixing time can be exponentially long.
The hitting time inversely depends on the
Restricted Cheeger Constant.Under certain conditions, the hitting time can be
polynomial.If
, then running SGLD on
hits optimal points of
.SGLD is more robust than SGD for empirical risk minimization. 23Slide24
Part IIConvexified Convolutional Neural Networks
24Slide25
Why convexify CNN?
CNN
uses
“convolutional filters”
to e
xtract local features.Generalizes better than fully-connected NNs.
Requires non-convex optimization.25
What if I want a
globally optimal solution
?ScatNet (Bruna and Mallat 2013)
PCANet (Chan et al. 2014)Convolutional kernel networks (Mairal
et al. 2014
)
CNN with random filters (
Daniely
et al
. 2016)
However, n
one of them is guaranteed to be as good as the classical CNN.Slide26
CNN: convolutional layer
A convolutional layer applies non-linear filters to a sliding window of patches.
26
p
atches:
(
)
f
ilters:
(
)
input
:
output
:
A convolutional layerSlide27
CNN: output and loss function
Output
: defined by a
linear fully-connected layer.
E
xample: Two-layer CNN
.
Loss function
:
.
image
filter
parameters
output
parameters
output
cross-entropy loss / hinge loss / squared loss …Slide28
Challenges
CNN loss is
non-convex
because:N
on-linear activation function .
Parameter sharing.
28
s
hared across
n
on-linear
Question:
How to train CNN by
convex optimization
but preserving
non-linear filters
and
parameter sharing
?Slide29
Convexifying linear two-layer CNNs (I)
Linear CNN:
.
T
hree matrices:
Design matrix:
where the
-
th
row is
.
Filter matrix:
where the
-
th
column is
.
Output matrix:
where the
-
th
element is
.
W
rite
as:
Parameter
matrix
:
.
Constraint:
.
29
f
ilter outputsSlide30
Convexifying linear two-layer CNNs (II)
Re-parameterization:
where
Learning
is a non-convex problem.
Relax
to a nuclear-norm constraint
:
.
Then solve a
convex optimization
problem:
30
Relax to a convex constraintSlide31
Convexifying non-linear two-layer CNNs (I)
Non-linear filter:
.
Re-parameterize
in a Reproducing Kernel Hilbert Space:
31
where
kernel function
n
on-linear mapping
non-linear CNN filter
linear RKHS filter
Slide32
re-parameterization defines a convex loss
Convexifying
non-linear
two-layer CNNs
(II)
CNN filter
RKHS filter
Construct
s.t.
for all pairs of patches in the training set
.
Re-define
patches:
.
32
Then optimize
linear CNN loss
:
Slide33
What filters can be re-parameterized?
Recall CNN filters:
.
If
is smooth, then
will be smooth.
By properly choosing kernel
, the corresponding RKHS will cover
all
sufficiently smooth functions, including .We choose for training (e.g. Gaussian kernel).
A smooth is only required by the theoretical analysis.
33
Slide34
Theoretical results for convexifying two-layer CNN
34
If
is sufficiently smooth:
Tractability:
The
Convexified
CNN (CCNN) can be learnt in polynomial time.
Optimality: The generalization loss of CCNN converges to
at least as good as the best possible CNN in rate.Sample efficiency: Fully-connected NN requires up to times more training examples than CCNN to achieve the same generalization loss. Slide35
Multi-layer CCNN
Estimate parameter matrix
for a two-layer CCNN.
Factorize
into filter and output parameters through SVD:
Extract RKHS filters:
,
.
Repeat steps 1-3, use
as input to train the 2nd convolution layer.Recursively train the 3
rd , 4th, … layer, if necessary.
35
filter
parameters
output
parametersSlide36
Empirical results on multi-layer CCNN
10k/2k/50k examples for training/validation/testing:
36
MNIST variations (random noise, image background, random rotation…).
(CCNN outperforms state-of-the-art results on
rand
,
img
and
img+rot
)Slide37
Summary
Two challenges of
convexifying
CNN:
non-linear activation and parameter sharing.
CCNN is a combination of two ideas: CNN filters
RKHS filters.Parameter sharing
nuclear-norm constraint.Two-layer CCNN: strong optimality guarantee.
Deeper CCNN: convexification improves empirical results.
Slide38
Final summary of this talk
Non-convex optimization is hard, but
we
don’t always need to solve
non-convex optimization.
Optimization Diffusion process:
SGD
SGLD.Non-linear / Low-rank
RKHS / nuclear-norm: CNN
CCNN.High-level open question: is there a better abstraction for machine learning? 38