/
Training convolutional networks Training convolutional networks

Training convolutional networks - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
402 views
Uploaded On 2018-03-12

Training convolutional networks - PPT Presentation

Last time Linear classifiers on pixels bad need nonlinear classifiers Multilayer perceptrons overparametrized Reduce parameters by local connections and shift invariance gt Convolution ID: 648051

convnets gradient convolutional learning gradient convnets learning convolutional networks gradients functions training loss linear compute convnet variance vanishing conv parameters residual graphs

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Training convolutional networks" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Training convolutional networksSlide2

Last time

Linear classifiers on pixels bad, need non-linear classifiers

Multi-layer

perceptrons

overparametrized

Reduce parameters by local connections and shift invariance => Convolution

Intersperse subsampling to capture ever larger deformations

Stick a final classifierSlide3

Convolutional networks

conv

filters

subsample

subsample

conv

linear

filters

weightsSlide4

Empirical Risk Minimization

Convolutional networkSlide5

Computing the gradient of the lossSlide6

Convolutional networks

conv

filters

subsample

subsample

conv

linear

filters

weightsSlide7

The gradient of convnets

f

1

f

2

f

3

f

4

f

5

x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= zSlide8

The gradient of convnets

f

1

f

2

f

3

f

4

f

5

x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= zSlide9

The gradient of

convnets

f

1

f

2

f

3

f

4

f5

x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= zSlide10

The gradient of

convnets

f

1

f

2

f

3

f

4

f

5

x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= zSlide11

The gradient of

convnets

f

1

f

2

f

3

f

4

f

5

x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= zSlide12

The gradient of

convnets

f

1

f

2

f

3

f

4

f

5

x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= zSlide13

The gradient of

convnets

f

1

f

2

f

3

f

4

f

5

x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= zSlide14

The gradient of

convnets

f

1

f

2

f

3

f

4

f5

x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= zSlide15

The gradient of convnets

f

1

f

2

f

3

f

4

f

5

x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= zSlide16

The gradient of

convnets

f

1

f

2

f

3

f

4

f

5x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= zSlide17

The gradient of

convnets

f

1

f

2

f

3

f

4

f

5x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= zSlide18

The gradient of

convnets

f

1

f

2

f

3

f

4

f

5

x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= zSlide19

The gradient of convnets

f

1

f

2

f

3

f

4

f

5

x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= zSlide20

The gradient of convnets

f

1

f

2

f

3

f

4

f

5

x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= z

Recurrence going backward!!Slide21

The gradient of convnets

f

1

f

2

f

3

f

4

f

5

x

w

1

w

2

w

3

w

4

w

5

z

1

z

2

z

3

z

4

z

5

= z

BackpropagationSlide22

Backpropagation for a sequence of functions

Previous term

Function derivativeSlide23

Backpropagation for a sequence of functions

Assume we can compute partial derivatives of each function

Use

g(

z

i

) to store gradient of z w.r.t zi, g(w

i) for wiCalculate gi by iterating backwards

Use gi to compute gradient of parametersSlide24

Backpropagation for a sequence of functions

Each “function” has a “forward” and “backward” module

Forward module for f

i

takes zi-1

and weight wi as inputproduces zi as outputBackward module for f

itakes g(zi ) as inputproduces g(z

i-1 ) and g(wi) as output Slide25

Backpropagation for a sequence of functions

f

i

z

i-1

z

i

w

iSlide26

Backpropagation for a sequence of functions

f

i

g(z

i-1

)

g(

zi)

g(

w

i

)Slide27

Chain rule for vectors

JacobianSlide28

Loss as a function

conv

filters

subsample

subsample

conv

linear

filters

weights

loss

labelSlide29

Neural network frameworks

CaffeSlide30

Beyond sequences: computation graphs

Arbitrary

graphs

of functions

No distinction between intermediate outputs and parameters

f

h

g

k

l

x

yw

u

zSlide31

Why computation graphs

Allows multiple functions to reuse same intermediate output

Allows one function to combine multiple intermediate output

Allows trivial parameter sharing

Allows crazy ideas,

eg

., one function predicting parameters for anotherSlide32

Computation graphs

f

i

a

d

c

bSlide33

Computation graphs

f

iSlide34

Computation graphs

f

i

a

d

c

bSlide35

Computation graphs

f

iSlide36

Neural network frameworksSlide37

Stochastic gradient descent

Gradient on single example = unbiased sample of true gradient

Idea: at each iteration sample single example x

(t)

Con: variance in estimate of gradient

 slow convergence, jumping near optimum

step sizeSlide38

Minibatch

stochastic gradient descent

Compute gradient on a small batch of examples

Same mean (=true gradient), but variance inversely proportional to

minibatch

sizeSlide39

Momentum

Average

multiple gradient steps

Use

exponential averagingSlide40

Weight decay

Add -a

w

(t)

to the gradient

Prevents w(t)

from growing to infinityEquivalent to L2 regularization of weightsSlide41

Learning rate decay

Large step size / learning rate

F

aster convergence initially

Bouncing around at the end because of noisy gradients

Learning rate must be decreased over time

Usually done in stepsSlide42

Convolutional network training

Initialize network

Sample

minibatch

of imagesForward pass to compute loss

Backpropagate loss to compute gradientCombine gradient with momentum and weight decayTake step according to current learning rateSlide43

Vagaries of optimization

Non-convex

Local optima

Sensitivity to initialization

Vanishing / exploding gradients

If each term is (much) greater than 1

 explosion of gradientsIf each term is (much) less than 1  vanishing gradientsSlide44

Vanishing and exploding gradientsSlide45

Sigmoids cause vanishing gradients

Gradient close to 0Slide46

Rectified Linear Unit (ReLU

)

max (x,0)

Also called half-wave rectification (signal processing)Slide47

Image ClassificationSlide48

How to do machine learning

Create training / validation sets

Identify loss functions

Choose hypothesis class

Find best hypothesis by minimizing training lossSlide49

How to do machine learning

Create training / validation sets

Identify loss functions

Choose hypothesis class

Find best hypothesis by minimizing training loss

Multiclass classification!!Slide50

MNIST Classification

Method

Error

rate (%)

Linear classifier over pixels

12

Kernel SVM over HOG

0.56Convolutional Network

0.8Slide51

ImageNet

1000 categories

~1000 instances per category

Olga

Russakovsky

*,

Jia

Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh

, Sean Ma, Zhiheng Huang, Andrej

Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) 

ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision

, 2015.Slide52

ImageNet

Top-5 error: algorithm makes 5 predictions, true label must be in top 5

Useful for incomplete

labelingsSlide53

Convolutional NetworksSlide54

Why ConvNets

?

Why now?Slide55

Why do convnets

work?

Claim:

ConvNets

have way more parameters than traditional models

Wrong: contemporary models had same or more parameters

Claim: Deep models are more expressive than shallow modelsWrong: 3 layer neural networks are universal function approximatorsWhat does depth provide?More non-linearities

: many ways of expressing non-linear functionsMore module reuse: really long switch-case vs functionsMore parameter sharing: most computation is shared amongst categoriesSlide56

Why do convnets

work?

We’ve had really long pipelines before. What’s new?

End-to-end learning:

all

functions tuned for final loss

Follows prior trend: more learning is better.Slide57

Visualizing convolutional networks ISlide58

Visualizing convolutional networks I

Rich feature hierarchies for accurate object detection and semantic

segmentation. R.

Girshick

, J. Donahue, T. Darrell, J. Malik. In

CVPR,

2014.Slide59

Visualizing convolutional networks II

Image pixels important for classification = pixels when blocked cause misclassification

Visualizing and Understanding Convolutional

Networks. M.

Zeiler

and R. Fergus. In

ECCV 2014.Slide60

Myths of convolutional networks

They have too many parameters!

So does everything else

They are hard to understand!

So is everything else

They are non-convex!

So what?Slide61

Why did we take so long?

Convolutional networks have been around since 80’s. Why now?

Early vision problems were too simple

Fewer categories

Less intra-class variation

Large differences between categories

Early vision datasets were too smallEasy to overfit on small datasetsSmall datasets encourage less learningSlide62

Data, data, data! I cannot make bricks without clay!

- Sherlock HolmesSlide63

Transfer learningSlide64

Transfer learning with convolutional

networks

Horse

Trained feature extractor

 Slide65

Transfer learning with convolutional networks

Dataset

Non-

Convnet

Method

Non-

Convnet

perfPretrained convnet + classifier

ImprovementCaltech 101MKL

84.387.7

+3.4VOC 2007SIFT+FK

61.779.7

+18CUB 200

SIFT+FK18.861.0

+42.2Aircraft

SIFT+FK61.0

45.0-16Cars

SIFT+FK59.2

36.5-22.7Slide66

Why transfer learning?

Availability of training data

Computational cost

Ability to pre-compute feature vectors and use for multiple tasks

Con: NO end-to-end learningSlide67

Finetuning

HorseSlide68

Finetuning

Bakery

Initialize with pre-trained, then train with low learning rateSlide69

Finetuning

Dataset

Non-

Convnet

Method

Non-

Convnet

perfPretrained convnet + classifier

Finetuned convnetImprovementCaltech 101

MKL84.3

87.788.4

+4.1VOC 2007SIFT+FK

61.779.7

82.4+20.7CUB

200SIFT+FK18.8

61.070.4

+51.6Aircraft

SIFT+FK61.045.0

74.1+13.1

Cars

SIFT+FK

59.2

36.5

79.8

+20.6Slide70

Exploring convnet

architecturesSlide71

Deeper is better

7 layers

16 layersSlide72

Deeper is better

Alexnet

VGG16Slide73

The VGG pattern

Every convolution is 3x3, padded by 1

Every convolution followed by

ReLU

ConvNet

is divided into “stages”

Layers within a stage: no subsamplingSubsampling by 2 at the end of each stageLayers within stage have same number of channelsEvery subsampling  double the number of channelsSlide74

Challenges in training: exploding / vanishing gradients

Vanishing / exploding

gradients

If each term is (much) greater than 1

explosion of gradients

If each term is (much) less than 1  vanishing gradientsSlide75

Challenges in training: dependence on

initSlide76

Solutions

Careful

init

Batch normalization

Residual connectionsSlide77

Careful initialization

Key idea: want variance to remain approx. constant

Variance increases in backward pass => exploding gradient

Variance decreases in backward pass => vanishing gradient

“MSRA initialization”

weights = Gaussian with 0 mean and variance = 2/(k*k*d)

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. K. He, X. Zhang, S. Ren, J. SunSlide78

Batch normalization

Key idea: normalize so that each layer output has zero mean and unit variance

Compute mean and variance for each channel

Aggregate over batch

Subtract mean, divide by

std

Need to reconcile train and testNo ”batches” during testAfter training, compute means and variances on train set and store

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. S. Ioffe

, C. Szegedy. In ICML, 2015.Slide79

Residual connections

In general, gradients tend to vanish

Key idea: allow gradients to flow unimpededSlide80

Residual connections

In general, gradients tend to vanish

Key idea: allow gradients to flow unimpededSlide81

Residual connections

Assumes all

z

i

have the same size

True within a stageAcross stages?

Doubling of feature channelsSubsamplingIncrease channels by 1x1 convolutionDecrease spatial resolution by subsampling Slide82

A residual block

Instead of single layers, have residual connections over block

Conv

BN

ReLU

Conv

BNSlide83

Bottleneck blocks

Problem: When channels increases, 3x3 convolutions introduce many parameters

3x3xc

2

Key idea: use 1x1 to project to lower dimensionality, do convolution, then come back

cxd

+ 3x3xd2 + dxcSlide84

The ResNet

pattern

Decrease resolution substantially in first layer

Reduces memory consumption due to intermediate outputs

Divide into stages

maintain resolution, channels in each stage

halve resolution, double channels between stagesDivide each stage into residual blocksAt the end, compute average value of each channel to feed linear classifierSlide85

Putting it all together - Residual networks