/
Regularization ECE 6504 Deep Learning for Perception Regularization ECE 6504 Deep Learning for Perception

Regularization ECE 6504 Deep Learning for Perception - PowerPoint Presentation

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
347 views
Uploaded On 2018-11-09

Regularization ECE 6504 Deep Learning for Perception - PPT Presentation

Xiao Lin Peng Zhang Virginia Tech Papers Srivastava et al Dropout A simple way to prevent neural networks from overfitting JMLR 2014 Hinton Brain Sex and Machine Learning ID: 724018

2015 dropout 2013 srivastava dropout 2015 srivastava 2013 data maxout product icml credit hinton work model youtube activation machine

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Regularization ECE 6504 Deep Learning fo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Regularization

ECE 6504 Deep Learning for Perception

Xiao Lin, Peng

Zhang

Virginia TechSlide2

Papers

Srivastava

et al. “Dropout: A simple way to prevent neural networks from overfitting”.

JMLR 2014.Hinton. “Brain, Sex and Machine Learning”. Youtube 2012.Wan et al. “Regularization of neural networks using dropconnect”. ICML 2013.Goodfellow et al. “Maxout networks”. ICML 2013.

9/18/2015

2Slide3

Training Data

 

 

9/18/2015

3Slide4

Training Data

 

 

9/18/2015

4Slide5

Sparsity

(L1)

 

 

9/18/2015

5Slide6

Optimal Hyperplane

(L2)

 

 

 

 

 

9/18/2015

6Slide7

Optimal Neural Net?

 

 

9/18/2015

7Slide8

Optimal Neural Net?

 

 

9/18/2015

8

 

 Slide9

Bayesian Prediction?

 

 

 

 

 

?

9/18/2015

9Slide10

 

 

Bayesian Prediction?

?

9/18/2015

10

 

 Slide11

 

 

?

 

 

Bayesian Prediction?

9/18/2015

11

 Slide12

 

 

?

Bayesian Prediction?

9/18/2015

12

 

 

 

 Slide13

Regularization

How to select a good

model?

How to regularize neural nets?9/18/2015

13Slide14

Regularization

How to select a good

model?

How to regularize neural nets?How to make a good prediction?How to approximately vote over all models?9/18/2015

14Slide15

Regularization

How to select a good

model?

How to regularize neural nets?How to make a good prediction?How to approximately vote over all models?9/18/2015

15

Dropout

DropoutSlide16

Dropout in ILSVRC 2010

ImageNet

Large Scale Visual Recognition Challenge (ILSVRC)

1.2 million images over 1000 categoriesReport top-1 and top-5 error rate9/18/2015N. Srivastava et al. 2014 16Slide17

Dropout in ILSVRC 2010

ImageNet

Large Scale Visual Recognition Challenge (ILSVRC)

1.2 million images over 1000 categoriesReport top-1 and top-5 error rateILSVRC 20109/18/2015N. Srivastava et al. 2014

17Slide18

Dropout in ILSVRC 2010

9/18/2015

G. Hinton, “Brains, Sex, and Machine Learning”,

Youtube 201218Slide19

Dropout in ILSVRC 2010

9/18/2015

G. Hinton, “Brains, Sex, and Machine Learning”,

Youtube 201219Slide20

Dropout in ILSVRC 2010

9/18/2015

G. Hinton, “Brains, Sex, and Machine Learning”,

Youtube 201220Slide21

What is Dropout

9/18/2015

N.

Srivastava et al. 2014 21

 

 

 

 

1

 

 

 

 

 

 Slide22

What is Dropout

9/18/2015

N.

Srivastava et al. 2014 22During trainingFor each data pointRandomly set input to 0 with probability 0.5 “dropout ratio”.

 

 

 

 

1

 

 

 

 

 

 Slide23

What is Dropout

9/18/2015

N.

Srivastava et al. 201423During trainingFor each data pointRandomly set input to 0 with probability 0.5 “dropout ratio”.

 

 

 

 

1

 

 

 

 

 

 Slide24

What is Dropout

9/18/2015

N.

Srivastava et al. 201424During trainingFor each data pointRandomly set input to 0 with probability 0.5 “dropout ratio”.

 

 

 

 

1

 

 

 

 

 

 Slide25

What is Dropout

9/18/2015

N.

Srivastava et al. 201425During trainingFor each data pointRandomly set input to 0 with probability 0.5 “dropout ratio”. During testing

Halve weightsNo dropout

 

 

 

 

1

 

 

 

 

 

 Slide26

What is Dropout

9/18/2015

N.

Srivastava et al. 201426During trainingFor each data pointRandomly set input to 0 with probability

“dropout ratio”.

During testingMultiply weights by

No dropout

 

 

 

1

 

 

 

 

 

 

 

 Slide27

What is Dropout

9/18/2015

N.

Srivastava et al. 201427During trainingFor each data pointRandomly set input to 0 with probability

“dropout ratio”.

During testingMultiply

data

by

No dropout

 

 

1

 

 

 

 

 

 

 

 

 Slide28

What is Dropout

9/18/2015

N.

Srivastava et al. 201428During trainingFor each data pointRandomly set input to 0 with probability

“dropout ratio”.

During testingMultiply data by

No dropout

 

 

 

Dropout

“Layer”

Inner

Product

Layer

 Slide29

What is Dropout

9/18/2015

N.

Srivastava et al. 201429During trainingFor each data pointRandomly set input to 0 with probability

“dropout ratio”.

During testingMultiply data by

No dropout

 

 

 

Dropout

“Layer”

Inner

Product

Layer

 

 

 

 

 Slide30

What is Dropout

9/18/2015

N.

Srivastava et al. 201430During trainingFor each data pointRandomly set input to 0 with probability

“dropout ratio”.

During testingMultiply data by

No dropout

 

layer {

name: "drop6“

type: "dropout"

dropout_ratio

: 0.5 } Caffe

nn.Dropout(0.5)

TorchSlide31

Dropout on MLP

9/18/2015

31

Inner Product

ReLU

Inner

Product

X

Scores

Without dropout

Inner Product

ReLUSlide32

Dropout on MLP

9/18/2015

32

Inner Product

ReLU

Inner

Product

X

Scores

Inner

Product

X

Scores

Dropout

Inner

Product

ReLU

Without dropout

With dropout

Dropout

Inner

Product

ReLU

Inner Product

ReLUSlide33

Dropout on MLP

MNIST

Classification of hand written digits

10 classes, 60k train9/18/2015N. Srivastava et al. 201433Slide34

Dropout on MLP

MNIST

Classification of hand written digits

10 classes, 60k trainTIMITClassification of phones39 classes*, ? train9/18/2015N. Srivastava et al. 2014

34Slide35

Dropout on MLP

MNIST

Classification of hand written digits

10 classes, 60k trainTIMITClassification of phones39 classes*, ? trainReutersTopic classification of documents as bags of words50 classes, 800k train9/18/2015N. Srivastava et al. 2014

35Slide36

Dropout on MLP

MNIST

9/18/2015

N. Srivastava et al. 201436Slide37

Dropout on MLP

TIMIT

9/18/2015

N. Srivastava et al. 201437Slide38

Dropout on MLP

Reuters

3 layer NN: 31.05%

3 layer NN + Dropout 29.62%9/18/2015N. Srivastava et al. 201438Slide39

Dropout on MLP

MNIST

9/18/2015

N. Srivastava et al. 201439Slide40

Dropout on MLP

TIMIT

9/18/2015

G. Hinton, “Brains, Sex, and Machine Learning”, Youtube 201240Slide41

Dropout on MLP

Reuters

9/18/2015

G. Hinton, “Brains, Sex, and Machine Learning”, Youtube 201241Slide42

Dropout on ConvNets

9/18/2015

42

Conv

ReLU

Inner

Product

X

Scores

Without dropout

Conv

ReLU

PoolSlide43

Dropout on ConvNets

9/18/2015

43

Conv

ReLU

Inner

Product

X

Scores

Inner

Product

X

Scores

Dropout

Conv

ReLU

Without dropout

With dropout

Dropout

Conv

ReLU

Conv

ReLU

Pool

PoolSlide44

Dropout on ConvNets

Street View House Numbers (SVHN)

Color images of house numbers centered on 1 digit

10 classes, 73k train9/18/2015N. Srivastava et al. 201444Slide45

Dropout on ConvNets

Street View House Numbers (SVHN)

Color images of house numbers centered on 1 digit

10 classes, 73k trainCIFAR-10 and CIFAR-100Color images of object categories10/100 classes, both 50k train9/18/2015N. Srivastava et al. 2014

45Slide46

Dropout on ConvNets

Street View House Numbers (SVHN)

Color images of house numbers centered on 1 digit

10 classes, 73k trainCIFAR-10 and CIFAR-100Color images of object categories10/100 classes, both 50k trainImageNet (ILSVRC series)Color images of object categories1000 classes, 1.2 MILLION train9/18/2015

N. Srivastava et al. 2014

46Slide47

Dropout on ConvNets

Street View House Numbers (SVHN)

9/18/2015

N. Srivastava et al. 201447Slide48

Dropout on ConvNets

CIFAR-10 and CIFAR-100

9/18/2015

N. Srivastava et al. 201448Slide49

Dropout on ConvNets

ImageNet

ILSVRC 2010ILSVRC 20129/18/2015N. Srivastava et al. 2014

49Slide50

Dropout on Simple Classifiers

Caltech 101

Classifying objects into 101

categories9/18/2015J. Donahue et al. “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition ”. arXiv 2013

50Slide51

Dropout on Simple Classifiers

Caltech 101

Classifying objects into 101 categories

Deep Convolutional Activation Feature (DeCAF)Activation of CNN (CaffeNet) layers as features9/18/2015

J. Donahue et al. “DeCAF: A Deep Convolutional

Activation Feature for Generic Visual Recognition ”. arXiv 2013

51Slide52

Dropout on Simple Classifiers

Caltech 101

Classifying objects into 101 categories

Deep Convolutional Activation Feature (DeCAF)Activation of CNN (CaffeNet) layers as featuresLogistic regression and SVM on DeCAF on Caltech 1019/18/2015

J. Donahue et al. “DeCAF

: A Deep Convolutional Activation Feature for Generic Visual Recognition ”.

arXiv

2013

52Slide53

Why Does Dropout Work?

9/18/2015

N.

Srivastava et al. 201453How to select a good model?How to make a good

prediction?Slide54

Why Does Dropout Work?

9/18/2015

N.

Srivastava et al. 201454How to select a good model?Prevent co-adaptationInduce sparse activationsData augmentation

How to make a good prediction?Slide55

Why Does Dropout Work?

9/18/2015

N.

Srivastava et al. 201455How to select a good model?Prevent co-adaptationInduce sparse activationsData augmentation

How to make a good prediction?Bagging with shared weights

Efficient approximate averagingSlide56

Why Does Dropout Work?

9/18/2015

G. Hinton, “Brains, Sex, and Machine Learning”,

Youtube 201256Prevent co-adaptationSlide57

Why Does Dropout Work?

9/18/2015

G. Hinton, “Brains, Sex, and Machine Learning”,

Youtube 201257Prevent co-adaptationSlide58

Why Does Dropout Work?

9/18/2015

N.

Srivastava et al. 201458Prevent co-adaptationFilters (autoencoder on MNIST)Slide59

Why Does Dropout Work?

9/18/2015

G. Hinton, “Brains, Sex, and Machine Learning”,

Youtube 201259Prevent co-adaptationSlide60

Why Does Dropout Work?

9/18/2015

G. Hinton, “Brains, Sex, and Machine Learning”,

Youtube 201260Prevent co-adaptationSlide61

9/18/2015

N.

Srivastava

et al. 201461Induce sparse activationsActivations (autoencoder on MNIST)

Why Does Dropout Work?Slide62

9/18/2015

62

Data augmentation

Dropout as adding “pepper” noise

Cheetah

Train

Morel

Why Does Dropout Work?Slide63

9/18/2015

N.

Srivastava

et al. 201463Bagging with shared weightsExponentially many architectures

Why Does Dropout Work?

ForwardSlide64

9/18/2015

G. Hinton, “Brains, Sex, and Machine Learning”,

Youtube

201264

 

Inner

Product

Dropout

 

 

Why Does Dropout Work?

Bagging with shared weights

Exponentially many architectures

Approximate geometric mean of all architectures

 

 Slide65

9/18/2015

G. Hinton, “Brains, Sex, and Machine Learning”,

Youtube

201265

 

Inner

Product

Dropout

 

 

Why Does Dropout Work?

Bagging with shared weights

Exponentially many architectures

Approximate geometric mean of all architectures

 

 

 Slide66

9/18/2015

G. Hinton, “Brains, Sex, and Machine Learning”,

Youtube

201266

 

Inner

Product

Dropout

 

 

Why Does Dropout Work?

Bagging with shared weights

Exponentially many architectures

Approximate geometric mean of all architectures

 

 

 

 

Geometric mean over

 Slide67

Why Does D

ropout Work?

9/18/2015

N. Srivastava et al. 201467Bagging with shared weightsExponentially many architecturesApproximate geometric mean of all architectures

Exact

Basic dropoutSlide68

Effects of Dropout P

arameters

9/18/2015

N. Srivastava et al. 201468Retaining probability (1-dropout ratio)Slide69

Effects of Dropout Parameters

9/18/2015

N.

Srivastava et al. 201469Dataset sizeSlide70

Effects of Dropout P

arameters

9/18/2015

N. Srivastava et al. 201470Combining with other regularizersSlide71

Generalizations of Dropout

Gaussian dropout

Gaussian instead of 0/1 Bernoulli

DropConnectDropping connections instead of data: larger space for baggingAveraging after the activation function: better approximationMaxoutDropout as cross channel pooling in ReLUUsing Maxout instead of ReLU9/18/2015

71Slide72

Generalizations of Dropout

Dropout

DropConnect

MaxoutActivation function

Selection scheme

Distribution

R.V. from other distributionSlide73

Dropout

DropConnect

Maxout

Activation function

Selection scheme

Distribution

R.V. from other distribution

Generalizations of DropoutSlide74

Multiplicative Gaussian Noise

Multiplying hidden activations by Bernoulli distributed variables

Random variables from other distribution

Each activation function can be:

Result credit:

N.

Srivastava

et al.

2014

orSlide75

Dropout

DropConnect

Maxout

Activation function

Selection scheme

Distribution

R.V. from other distribution

Generalizations of DropoutSlide76

Dropout

No Dropout

Dropout

Keep the element of each layer’s output with p, otherwise set to 0 with (1-p)

Figure credit: Wan et al. ICML 2013Slide77

DropConnect

Dropout

Keep the element of each layer’s output

with p, otherwise set to 0 with (1-p)

DropConnectKeep the connection

of each layer’s output with p, otherwise set to 0 with (1-p)

Figure credit: Wan et al. ICML 2013Slide78

Inference step

Old approximation (dropout)

Better approximation

1-D Gaussian distribution

Figure credit: Wan et al. ICML 2013Slide79

Experiments

Dateset

augmentation5 Independent networksManually tune parameter, i.e. learning rateTrain with dropout, dropconnect and neitherFix sample number at inference step

SVHN

CIFAR-10

MNISTSlide80

MNIST

Test error on MNIST dataset

with 2 hidden layers network

Figure credit: Wan et al. ICML 2013Slide81

Test error as drop out rate changes

Figure credit: Wan et al. ICML 2013

MNISTSlide82

Classification error rate without data augmentation

Figure credit: Wan et al. ICML 2013

MNISTSlide83

CIFAR-10

On top of 3-layer of

AlexNet

, perform no dropout, dropconnect and dropoutFigure credit: Wan et al. ICML 2013Slide84

CIFAR-10

On top of 2-conv layers and 2 fully connected layers

Figure credit: Wan et al. ICML 2013Slide85

SVHN

512 unit between feature extractor and

softmax

layerFigure credit: Wan et al. ICML 2013Slide86

Dropout

DropConnect

Maxout

Activation function

Selection scheme

Distribution

R.V. from other distribution

Generalizations of DropoutSlide87

Maxout

Dropout suits well to train an ensemble of large models and approximate average model predictions

Maxout

enhances dropout’s abilities as a model averaging techniqueSlide88

Maxout

ReLU

x

f(x)

Piece-wise linear function

x

f(x)Slide89

Maxout

Maxout

uses the max unit:

k linear functionsTake the maximum value from k modelsMaxout uses the max unit:WhereSlide90

Universal approximator

Maxout

is a universal

approximatorA maxout model with just two hidden units can approximate any continuous functionFigure credit: Goodfellow et al. ICML 2013Slide91

Any continuous piece wise linear function g(v) can be expressed as the difference between two convex piece wise linear functions

For any continuous function f(v), there exists a continuous piece wise linear function g(v), such that:

Universal

approximatorSlide92

MNIST

3 convolution

maxout

hidden layers, with max poolingFollowed by a softmax layerFigure credit: Goodfellow et al. ICML 2013Slide93

CIFAR-10

3 convolutional

maxout

layers1 fully connected maxout layer1 softmax layerFigure credit: Goodfellow et al. ICML 2013Slide94

Why maxout

performs better

Enhance dropout’s approximate model averaging technique

Maxout yields better optimization performanceDecrease saturation rateSlide95

Better averaging model technique

Dropout training encourages

maxout

units to have larger linear regions around inputsFigure credit: Goodfellow et al. ICML 2013Slide96

Two experiments2 convolutional layers on SVHN data

Rectifier units: 7.3% training error

Maxout

units: 5.1% training errorDeep and narrow model on MNISTFigure credit: Goodfellow et al. ICML 2013Better averaging model techniqueSlide97

Saturation

Training with SGD: saturate at 0 less than 5% time

Training with dropout: saturate rate increases

Activation function is a constant on leftMaxout has gradient every whereFigure credit: Goodfellow et al. ICML 2013Slide98

Saturation

High proportion of zeros and the difficulty of escaping them impairs the optimization performance

Train two MLPs with 2 hidden layers

Fails to use 17.6% filters in first layer and 39.2% in second layerFails to use 0.1% filtersSlide99

Conclusion remarks

“If your deep neural net is significantly

overfitting

, dropout will usually reduce the number of errors by a lot.If your deep neural net is not overfitting you should be using a bigger one!”Hinton, Lecture 6a, CSC2535, 2013Slide100

Marginalizing Dropout

9/18/2015

N.

Srivastava et al. 2014100Marginalizing dropout in regressionA special form of ridge regressionMarginalizing dropout on Deep Networks?Slide101

Conclusion Remarks

“If

your deep neural net is significantly overfitting, dropout will usually reduce the number of errors by a lot

.If your deep neural net is not overfitting you should be using a bigger one!”9/18/2015G. Hinton, Lecture 6a, CSC 2535

101Slide102

9/18/2015

102Slide103

9/18/2015

103

x

f(x)

x

f(x)Slide104

Inference StepSlide105

Regularization

Many approaches have been propose to do regularization in deep neural network,

A

dding L1 or L2 penaltyEarly stopping of trainingBayesian methodDropoutSlide106

Why DropConnect works

Averaging models

Rademacher

complexitywhere: P = 0: model complexity is zeroP = 1: returns standard model complexityP = 0.5: all sub-models have equal complexity