/
Two approaches to non-convex Two approaches to non-convex

Two approaches to non-convex - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
379 views
Uploaded On 2018-03-19

Two approaches to non-convex - PPT Presentation

machine learning Yuchen Zhang Stanford University Nonconvexity in modern machine learning 2 Stateoftheart AI models are learnt by minimizing often nonconvex loss functions T raditional ID: 657435

cnn sgld convex linear sgld cnn linear convex time layer loss local ccnn function filter risk noise empirical smooth filters output convolutional

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Two approaches to non-convex" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Two approaches to non-convex machine learning

Yuchen Zhang

Stanford UniversitySlide2

Non-convexity in modern machine learning

2

State-of-the-art AI models are learnt by minimizing (often non-convex) loss functions.

T

raditional

o

ptimization algorithms only guarantee to find locally optimal solutions.Slide3

This talk

3

Two ideas to

attack

non-convexity

and

local

minima

:Idea 1: Injecting large random noise to SGD.A Hitting Time Analysis of Stochastic Gradient Langevin DyanmicsYuchen Zhang, Percy Liang, Moses Charikar (COLT’17)Idea 2: Convex relaxation.Convexified Convolutional Neural NetworkYuchen Zhang, Percy Liang, Martin Wainwright (ICML’17)Slide4

Part IInjecting Large Random Noise to SGD

4Slide5

Gradient descent and local minima

Problem

:

.

Gradient Descent

:

.

Running GD on a non-convex function may converge to a sub-optimal local minimum.

 

5global minimalocal minimaSlide6

Adding noise to gradient descent

Draw a noise vector:

Update:

Noise will be vacuous when

.

 

6

 

 Slide7

Langevin Monte Carlo (LMC) (Roberts and Tweedie 1996)

Imitate

Langevin Diffusion

in physics

7

Choose

temperature

and

stepsize

.Iteratively update:

 

Noise will dominate gradient when

.

LMC

escapes

local minima

with small step sizes.

 Slide8

Stochastic Gradient Langevin Dynamics(SGLD) (Welling and Teh 2011)

Use

stochastic

gradient

instead of

 

8

Iteratively update:

 Slide9

Stationary distribution

With small stepsize

, the distribution of

converges to a stationary distribution:

.

If temperature

is

low

and , then minimizes .

 9

 Slide10

SGLD in practice

SGLD outperforms SGD on several modern applications:

Prevent

over-fitting

(Welling and Teh

2011)Logistic regressionIndependent Components

AnalysisLearn deep neural networksNeural programmer (

Neelakantan et al. 2015)Neural random-access machines

(Kurach et al. 2015)

Neural GPUs (Kaiser and Sutskever 2015)Deep bidirectional LSTM (Zeyer et al. 2016)10Slide11

Mixing time (time for converging to

)

 

For smooth functions and small enough

stepsize

:

SGLD asymptotically converge to

.

(Roberts and Tweedie 1996, Teh et al. 2016)For convex : the LMC mixing time is polynomial. (Bubeck et al. 2015, Dalalyan 2016)For non-convex : the SGLD mixing time was rigorously characterized (Raginsky

et al 2017). However, it can be exponential in

and

.

 

11Slide12

Mixing time is too pessimistic?SGLD can hit a good solution much earlier than it converges to the stationary distribution.

Example: W-shaped function.

12Slide13

O

ur

analysis

SGLD’s

hitting time

to an arbitrary target set.

Polynomial upper bounds on hitting time.Application: non-convex empirical risk minimization.

13

target setSlide14

Preliminaries (I)

For any

, define a

probability measure

:

 

14Slide15

Preliminaries (II)

Given function

, for any set

, define its

boundary measure

(informally, surface area):

 

15

 

s

hellSlide16

 

Restricted

Cheeger

Constant

Intuition

:

is small

if and only if some subset

is “isolated” from the rest. 16Given

and set ,

define

Restricted

Cheeger

Constant

:

 

Claim:

measures the

efficiency

of SGLD

(

defined on

and

)

to

escape

the set

.

 

 

 

is isolated,

is small

 

surface area

volume

 

All subsets are well-connected,

is large

 

 Slide17

Stability property

17

Lemma

If

, then:

 

If two functions are

pointwise close

, then their Restricted

Cheeger

Constants are close.

If

, the efficiency of SGLD on

and

are almost equivalent.

Our strategy:

Run SGLD on

, but analyze its

efficiency

on

Example:

empirical risk,

population risk

 Slide18

General theorem18

Theorem

For arbitrary

and target set

, SGLD’s hitting time to

satisfies (with high probability) :

 

Reduce the problem to

lower bounding

.

Sufficient to study the geometric properties of

and

.

S

tudying

geometric properties is much easier than

studying SGLD trajectory

.

 Slide19

Lower bounds on Restricted Cheeger

C

onstant

19

For arbitrary smooth function

:

 

Lemma

Under following conditions:

= {

-approximate local

minimum}

, and

We have lower bound:

.

 

local minima

 

saddle point

 Slide20

Lower bound + General theorem + Stability property

20

Theorem

Run SGLD on

. For proxy function

satisfying:

is

smooth, and

SGLD hits an

-approximate

local

minimum

of

in

poly-time.

 

Function

: a

perturbation

of

that eliminates as many local minima as possible.

SGLD

efficiently escapes all local

minima that

don’t exist

in

.

 Slide21

Facts

:

Under mild conditions,

as

.

For large enough

, SGLD efficiently

finds

a

local minimum of the population risk.Doesn’t need and (

which are required by SGD).

 

SGLD for empirical risk minimization

Empirical risk minimization

:

Empirical risk

for

.

Population risk

 

21Slide22

Learning linear classifier with 0-1 loss

22

Assumption

: Labels corrupted by

Massart

noise

(

).

(

Awasthi et al 2016): Learns in time.SGLD: Learns in

time. 

O

ne dimensional 0-1 loss, sample size

 Slide23

Summary

SGLD is asymptotically optimal for non-convex optimization;

B

ut its mixing time can be exponentially long.

The hitting time inversely depends on the

Restricted Cheeger Constant.Under certain conditions, the hitting time can be

polynomial.If

, then running SGLD on

hits optimal points of

.SGLD is more robust than SGD for empirical risk minimization. 23Slide24

Part IIConvexified Convolutional Neural Networks

24Slide25

Why convexify CNN?

CNN

uses

“convolutional filters”

to e

xtract local features.Generalizes better than fully-connected NNs.

Requires non-convex optimization.25

What if I want a

globally optimal solution

?ScatNet (Bruna and Mallat 2013)

PCANet (Chan et al. 2014)Convolutional kernel networks (Mairal

et al. 2014

)

CNN with random filters (

Daniely

et al

. 2016)

However, n

one of them is guaranteed to be as good as the classical CNN.Slide26

CNN: convolutional layer

A convolutional layer applies non-linear filters to a sliding window of patches.

26

p

atches:

(

)

 

f

ilters:

(

)

 

input

:

 

output

:

 

A convolutional layerSlide27

CNN: output and loss function

Output

: defined by a

linear fully-connected layer.

E

xample: Two-layer CNN

.

Loss function

:

.

 

image

filter

parameters

output

parameters

output

cross-entropy loss / hinge loss / squared loss …Slide28

Challenges

CNN loss is

non-convex

because:N

on-linear activation function .

Parameter sharing.

 

28

 

s

hared across

 

n

on-linear

Question:

How to train CNN by

convex optimization

but preserving

non-linear filters

and

parameter sharing

?Slide29

Convexifying linear two-layer CNNs (I)

Linear CNN:

.

T

hree matrices:

Design matrix:

where the

-

th

row is

.

Filter matrix:

where the

-

th

column is

.

Output matrix:

where the

-

th

element is

.

W

rite

as:

Parameter

matrix

:

.

Constraint:

.

 

29

 

 

 

 

 

 

 

f

ilter outputsSlide30

Convexifying linear two-layer CNNs (II)

Re-parameterization:

where

Learning

is a non-convex problem.

Relax

to a nuclear-norm constraint

:

.

Then solve a

convex optimization

problem:

 

30

 

Relax to a convex constraintSlide31

Convexifying non-linear two-layer CNNs (I)

Non-linear filter:

.

Re-parameterize

in a Reproducing Kernel Hilbert Space:

 

31

where

 

kernel function

n

on-linear mapping

non-linear CNN filter

linear RKHS filter

 Slide32

re-parameterization defines a convex loss

 

 

 

Convexifying

non-linear

two-layer CNNs

(II)

CNN filter

RKHS filter

Construct

s.t.

for all pairs of patches in the training set

.

Re-define

patches:

.

 

32

Then optimize

linear CNN loss

:

 Slide33

What filters can be re-parameterized?

Recall CNN filters:

.

If

is smooth, then

will be smooth.

By properly choosing kernel

, the corresponding RKHS will cover

all

sufficiently smooth functions, including .We choose for training (e.g. Gaussian kernel).

A smooth is only required by the theoretical analysis.

 

33

Slide34

Theoretical results for convexifying two-layer CNN

34

If

is sufficiently smooth:

Tractability:

The

Convexified

CNN (CCNN) can be learnt in polynomial time.

Optimality: The generalization loss of CCNN converges to

at least as good as the best possible CNN in rate.Sample efficiency: Fully-connected NN requires up to times more training examples than CCNN to achieve the same generalization loss. Slide35

Multi-layer CCNN

Estimate parameter matrix

for a two-layer CCNN.

Factorize

into filter and output parameters through SVD:

Extract RKHS filters:

,

.

Repeat steps 1-3, use

as input to train the 2nd convolution layer.Recursively train the 3

rd , 4th, … layer, if necessary.

 

35

 

filter

parameters

output

parametersSlide36

Empirical results on multi-layer CCNN

10k/2k/50k examples for training/validation/testing:

36

MNIST variations (random noise, image background, random rotation…).

(CCNN outperforms state-of-the-art results on

rand

,

img

and

img+rot

)Slide37

Summary

Two challenges of

convexifying

CNN:

non-linear activation and parameter sharing.

CCNN is a combination of two ideas: CNN filters

RKHS filters.Parameter sharing

nuclear-norm constraint.Two-layer CCNN: strong optimality guarantee.

Deeper CCNN: convexification improves empirical results.

 Slide38

Final summary of this talk

Non-convex optimization is hard, but

we

don’t always need to solve

non-convex optimization.

Optimization Diffusion process:

SGD

SGLD.Non-linear / Low-rank

RKHS / nuclear-norm: CNN

CCNN.High-level open question: is there a better abstraction for machine learning? 38