Xiao Lin Peng Zhang Virginia Tech Papers Srivastava et al Dropout A simple way to prevent neural networks from overfitting JMLR 2014 Hinton Brain Sex and Machine Learning ID: 724018
Download Presentation The PPT/PDF document "Regularization ECE 6504 Deep Learning fo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Regularization
ECE 6504 Deep Learning for Perception
Xiao Lin, Peng
Zhang
Virginia TechSlide2
Papers
Srivastava
et al. “Dropout: A simple way to prevent neural networks from overfitting”.
JMLR 2014.Hinton. “Brain, Sex and Machine Learning”. Youtube 2012.Wan et al. “Regularization of neural networks using dropconnect”. ICML 2013.Goodfellow et al. “Maxout networks”. ICML 2013.
9/18/2015
2Slide3
Training Data
9/18/2015
3Slide4
Training Data
9/18/2015
4Slide5
Sparsity
(L1)
9/18/2015
5Slide6
Optimal Hyperplane
(L2)
9/18/2015
6Slide7
Optimal Neural Net?
9/18/2015
7Slide8
Optimal Neural Net?
9/18/2015
8
Slide9
Bayesian Prediction?
?
9/18/2015
9Slide10
Bayesian Prediction?
?
9/18/2015
10
Slide11
?
Bayesian Prediction?
9/18/2015
11
Slide12
?
Bayesian Prediction?
9/18/2015
12
Slide13
Regularization
How to select a good
model?
How to regularize neural nets?9/18/2015
13Slide14
Regularization
How to select a good
model?
How to regularize neural nets?How to make a good prediction?How to approximately vote over all models?9/18/2015
14Slide15
Regularization
How to select a good
model?
How to regularize neural nets?How to make a good prediction?How to approximately vote over all models?9/18/2015
15
Dropout
DropoutSlide16
Dropout in ILSVRC 2010
ImageNet
Large Scale Visual Recognition Challenge (ILSVRC)
1.2 million images over 1000 categoriesReport top-1 and top-5 error rate9/18/2015N. Srivastava et al. 2014 16Slide17
Dropout in ILSVRC 2010
ImageNet
Large Scale Visual Recognition Challenge (ILSVRC)
1.2 million images over 1000 categoriesReport top-1 and top-5 error rateILSVRC 20109/18/2015N. Srivastava et al. 2014
17Slide18
Dropout in ILSVRC 2010
9/18/2015
G. Hinton, “Brains, Sex, and Machine Learning”,
Youtube 201218Slide19
Dropout in ILSVRC 2010
9/18/2015
G. Hinton, “Brains, Sex, and Machine Learning”,
Youtube 201219Slide20
Dropout in ILSVRC 2010
9/18/2015
G. Hinton, “Brains, Sex, and Machine Learning”,
Youtube 201220Slide21
What is Dropout
9/18/2015
N.
Srivastava et al. 2014 21
1
Slide22
What is Dropout
9/18/2015
N.
Srivastava et al. 2014 22During trainingFor each data pointRandomly set input to 0 with probability 0.5 “dropout ratio”.
1
Slide23
What is Dropout
9/18/2015
N.
Srivastava et al. 201423During trainingFor each data pointRandomly set input to 0 with probability 0.5 “dropout ratio”.
1
Slide24
What is Dropout
9/18/2015
N.
Srivastava et al. 201424During trainingFor each data pointRandomly set input to 0 with probability 0.5 “dropout ratio”.
1
Slide25
What is Dropout
9/18/2015
N.
Srivastava et al. 201425During trainingFor each data pointRandomly set input to 0 with probability 0.5 “dropout ratio”. During testing
Halve weightsNo dropout
1
Slide26
What is Dropout
9/18/2015
N.
Srivastava et al. 201426During trainingFor each data pointRandomly set input to 0 with probability
“dropout ratio”.
During testingMultiply weights by
No dropout
1
Slide27
What is Dropout
9/18/2015
N.
Srivastava et al. 201427During trainingFor each data pointRandomly set input to 0 with probability
“dropout ratio”.
During testingMultiply
data
by
No dropout
1
Slide28
What is Dropout
9/18/2015
N.
Srivastava et al. 201428During trainingFor each data pointRandomly set input to 0 with probability
“dropout ratio”.
During testingMultiply data by
No dropout
Dropout
“Layer”
Inner
Product
Layer
Slide29
What is Dropout
9/18/2015
N.
Srivastava et al. 201429During trainingFor each data pointRandomly set input to 0 with probability
“dropout ratio”.
During testingMultiply data by
No dropout
Dropout
“Layer”
Inner
Product
Layer
Slide30
What is Dropout
9/18/2015
N.
Srivastava et al. 201430During trainingFor each data pointRandomly set input to 0 with probability
“dropout ratio”.
During testingMultiply data by
No dropout
layer {
name: "drop6“
type: "dropout"
dropout_ratio
: 0.5 } Caffe
nn.Dropout(0.5)
TorchSlide31
Dropout on MLP
9/18/2015
31
Inner Product
ReLU
Inner
Product
X
Scores
Without dropout
Inner Product
ReLUSlide32
Dropout on MLP
9/18/2015
32
Inner Product
ReLU
Inner
Product
X
Scores
Inner
Product
X
Scores
Dropout
Inner
Product
ReLU
Without dropout
With dropout
Dropout
Inner
Product
ReLU
Inner Product
ReLUSlide33
Dropout on MLP
MNIST
Classification of hand written digits
10 classes, 60k train9/18/2015N. Srivastava et al. 201433Slide34
Dropout on MLP
MNIST
Classification of hand written digits
10 classes, 60k trainTIMITClassification of phones39 classes*, ? train9/18/2015N. Srivastava et al. 2014
34Slide35
Dropout on MLP
MNIST
Classification of hand written digits
10 classes, 60k trainTIMITClassification of phones39 classes*, ? trainReutersTopic classification of documents as bags of words50 classes, 800k train9/18/2015N. Srivastava et al. 2014
35Slide36
Dropout on MLP
MNIST
9/18/2015
N. Srivastava et al. 201436Slide37
Dropout on MLP
TIMIT
9/18/2015
N. Srivastava et al. 201437Slide38
Dropout on MLP
Reuters
3 layer NN: 31.05%
3 layer NN + Dropout 29.62%9/18/2015N. Srivastava et al. 201438Slide39
Dropout on MLP
MNIST
9/18/2015
N. Srivastava et al. 201439Slide40
Dropout on MLP
TIMIT
9/18/2015
G. Hinton, “Brains, Sex, and Machine Learning”, Youtube 201240Slide41
Dropout on MLP
Reuters
9/18/2015
G. Hinton, “Brains, Sex, and Machine Learning”, Youtube 201241Slide42
Dropout on ConvNets
9/18/2015
42
Conv
ReLU
Inner
Product
X
Scores
Without dropout
Conv
ReLU
PoolSlide43
Dropout on ConvNets
9/18/2015
43
Conv
ReLU
Inner
Product
X
Scores
Inner
Product
X
Scores
Dropout
Conv
ReLU
Without dropout
With dropout
Dropout
Conv
ReLU
Conv
ReLU
Pool
PoolSlide44
Dropout on ConvNets
Street View House Numbers (SVHN)
Color images of house numbers centered on 1 digit
10 classes, 73k train9/18/2015N. Srivastava et al. 201444Slide45
Dropout on ConvNets
Street View House Numbers (SVHN)
Color images of house numbers centered on 1 digit
10 classes, 73k trainCIFAR-10 and CIFAR-100Color images of object categories10/100 classes, both 50k train9/18/2015N. Srivastava et al. 2014
45Slide46
Dropout on ConvNets
Street View House Numbers (SVHN)
Color images of house numbers centered on 1 digit
10 classes, 73k trainCIFAR-10 and CIFAR-100Color images of object categories10/100 classes, both 50k trainImageNet (ILSVRC series)Color images of object categories1000 classes, 1.2 MILLION train9/18/2015
N. Srivastava et al. 2014
46Slide47
Dropout on ConvNets
Street View House Numbers (SVHN)
9/18/2015
N. Srivastava et al. 201447Slide48
Dropout on ConvNets
CIFAR-10 and CIFAR-100
9/18/2015
N. Srivastava et al. 201448Slide49
Dropout on ConvNets
ImageNet
ILSVRC 2010ILSVRC 20129/18/2015N. Srivastava et al. 2014
49Slide50
Dropout on Simple Classifiers
Caltech 101
Classifying objects into 101
categories9/18/2015J. Donahue et al. “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition ”. arXiv 2013
50Slide51
Dropout on Simple Classifiers
Caltech 101
Classifying objects into 101 categories
Deep Convolutional Activation Feature (DeCAF)Activation of CNN (CaffeNet) layers as features9/18/2015
J. Donahue et al. “DeCAF: A Deep Convolutional
Activation Feature for Generic Visual Recognition ”. arXiv 2013
51Slide52
Dropout on Simple Classifiers
Caltech 101
Classifying objects into 101 categories
Deep Convolutional Activation Feature (DeCAF)Activation of CNN (CaffeNet) layers as featuresLogistic regression and SVM on DeCAF on Caltech 1019/18/2015
J. Donahue et al. “DeCAF
: A Deep Convolutional Activation Feature for Generic Visual Recognition ”.
arXiv
2013
52Slide53
Why Does Dropout Work?
9/18/2015
N.
Srivastava et al. 201453How to select a good model?How to make a good
prediction?Slide54
Why Does Dropout Work?
9/18/2015
N.
Srivastava et al. 201454How to select a good model?Prevent co-adaptationInduce sparse activationsData augmentation
How to make a good prediction?Slide55
Why Does Dropout Work?
9/18/2015
N.
Srivastava et al. 201455How to select a good model?Prevent co-adaptationInduce sparse activationsData augmentation
How to make a good prediction?Bagging with shared weights
Efficient approximate averagingSlide56
Why Does Dropout Work?
9/18/2015
G. Hinton, “Brains, Sex, and Machine Learning”,
Youtube 201256Prevent co-adaptationSlide57
Why Does Dropout Work?
9/18/2015
G. Hinton, “Brains, Sex, and Machine Learning”,
Youtube 201257Prevent co-adaptationSlide58
Why Does Dropout Work?
9/18/2015
N.
Srivastava et al. 201458Prevent co-adaptationFilters (autoencoder on MNIST)Slide59
Why Does Dropout Work?
9/18/2015
G. Hinton, “Brains, Sex, and Machine Learning”,
Youtube 201259Prevent co-adaptationSlide60
Why Does Dropout Work?
9/18/2015
G. Hinton, “Brains, Sex, and Machine Learning”,
Youtube 201260Prevent co-adaptationSlide61
9/18/2015
N.
Srivastava
et al. 201461Induce sparse activationsActivations (autoencoder on MNIST)
Why Does Dropout Work?Slide62
9/18/2015
62
Data augmentation
Dropout as adding “pepper” noise
Cheetah
Train
Morel
Why Does Dropout Work?Slide63
9/18/2015
N.
Srivastava
et al. 201463Bagging with shared weightsExponentially many architectures
Why Does Dropout Work?
ForwardSlide64
9/18/2015
G. Hinton, “Brains, Sex, and Machine Learning”,
Youtube
201264
Inner
Product
Dropout
Why Does Dropout Work?
Bagging with shared weights
Exponentially many architectures
Approximate geometric mean of all architectures
Slide65
9/18/2015
G. Hinton, “Brains, Sex, and Machine Learning”,
Youtube
201265
Inner
Product
Dropout
Why Does Dropout Work?
Bagging with shared weights
Exponentially many architectures
Approximate geometric mean of all architectures
Slide66
9/18/2015
G. Hinton, “Brains, Sex, and Machine Learning”,
Youtube
201266
Inner
Product
Dropout
Why Does Dropout Work?
Bagging with shared weights
Exponentially many architectures
Approximate geometric mean of all architectures
Geometric mean over
Slide67
Why Does D
ropout Work?
9/18/2015
N. Srivastava et al. 201467Bagging with shared weightsExponentially many architecturesApproximate geometric mean of all architectures
Exact
Basic dropoutSlide68
Effects of Dropout P
arameters
9/18/2015
N. Srivastava et al. 201468Retaining probability (1-dropout ratio)Slide69
Effects of Dropout Parameters
9/18/2015
N.
Srivastava et al. 201469Dataset sizeSlide70
Effects of Dropout P
arameters
9/18/2015
N. Srivastava et al. 201470Combining with other regularizersSlide71
Generalizations of Dropout
Gaussian dropout
Gaussian instead of 0/1 Bernoulli
DropConnectDropping connections instead of data: larger space for baggingAveraging after the activation function: better approximationMaxoutDropout as cross channel pooling in ReLUUsing Maxout instead of ReLU9/18/2015
71Slide72
Generalizations of Dropout
Dropout
DropConnect
MaxoutActivation function
Selection scheme
Distribution
R.V. from other distributionSlide73
Dropout
DropConnect
Maxout
Activation function
Selection scheme
Distribution
R.V. from other distribution
Generalizations of DropoutSlide74
Multiplicative Gaussian Noise
Multiplying hidden activations by Bernoulli distributed variables
Random variables from other distribution
Each activation function can be:
Result credit:
N.
Srivastava
et al.
2014
orSlide75
Dropout
DropConnect
Maxout
Activation function
Selection scheme
Distribution
R.V. from other distribution
Generalizations of DropoutSlide76
Dropout
No Dropout
Dropout
Keep the element of each layer’s output with p, otherwise set to 0 with (1-p)
Figure credit: Wan et al. ICML 2013Slide77
DropConnect
Dropout
Keep the element of each layer’s output
with p, otherwise set to 0 with (1-p)
DropConnectKeep the connection
of each layer’s output with p, otherwise set to 0 with (1-p)
Figure credit: Wan et al. ICML 2013Slide78
Inference step
Old approximation (dropout)
Better approximation
1-D Gaussian distribution
Figure credit: Wan et al. ICML 2013Slide79
Experiments
Dateset
augmentation5 Independent networksManually tune parameter, i.e. learning rateTrain with dropout, dropconnect and neitherFix sample number at inference step
SVHN
CIFAR-10
MNISTSlide80
MNIST
Test error on MNIST dataset
with 2 hidden layers network
Figure credit: Wan et al. ICML 2013Slide81
Test error as drop out rate changes
Figure credit: Wan et al. ICML 2013
MNISTSlide82
Classification error rate without data augmentation
Figure credit: Wan et al. ICML 2013
MNISTSlide83
CIFAR-10
On top of 3-layer of
AlexNet
, perform no dropout, dropconnect and dropoutFigure credit: Wan et al. ICML 2013Slide84
CIFAR-10
On top of 2-conv layers and 2 fully connected layers
Figure credit: Wan et al. ICML 2013Slide85
SVHN
512 unit between feature extractor and
softmax
layerFigure credit: Wan et al. ICML 2013Slide86
Dropout
DropConnect
Maxout
Activation function
Selection scheme
Distribution
R.V. from other distribution
Generalizations of DropoutSlide87
Maxout
Dropout suits well to train an ensemble of large models and approximate average model predictions
Maxout
enhances dropout’s abilities as a model averaging techniqueSlide88
Maxout
ReLU
x
f(x)
Piece-wise linear function
x
f(x)Slide89
Maxout
Maxout
uses the max unit:
k linear functionsTake the maximum value from k modelsMaxout uses the max unit:WhereSlide90
Universal approximator
Maxout
is a universal
approximatorA maxout model with just two hidden units can approximate any continuous functionFigure credit: Goodfellow et al. ICML 2013Slide91
Any continuous piece wise linear function g(v) can be expressed as the difference between two convex piece wise linear functions
For any continuous function f(v), there exists a continuous piece wise linear function g(v), such that:
Universal
approximatorSlide92
MNIST
3 convolution
maxout
hidden layers, with max poolingFollowed by a softmax layerFigure credit: Goodfellow et al. ICML 2013Slide93
CIFAR-10
3 convolutional
maxout
layers1 fully connected maxout layer1 softmax layerFigure credit: Goodfellow et al. ICML 2013Slide94
Why maxout
performs better
Enhance dropout’s approximate model averaging technique
Maxout yields better optimization performanceDecrease saturation rateSlide95
Better averaging model technique
Dropout training encourages
maxout
units to have larger linear regions around inputsFigure credit: Goodfellow et al. ICML 2013Slide96
Two experiments2 convolutional layers on SVHN data
Rectifier units: 7.3% training error
Maxout
units: 5.1% training errorDeep and narrow model on MNISTFigure credit: Goodfellow et al. ICML 2013Better averaging model techniqueSlide97
Saturation
Training with SGD: saturate at 0 less than 5% time
Training with dropout: saturate rate increases
Activation function is a constant on leftMaxout has gradient every whereFigure credit: Goodfellow et al. ICML 2013Slide98
Saturation
High proportion of zeros and the difficulty of escaping them impairs the optimization performance
Train two MLPs with 2 hidden layers
Fails to use 17.6% filters in first layer and 39.2% in second layerFails to use 0.1% filtersSlide99
Conclusion remarks
“If your deep neural net is significantly
overfitting
, dropout will usually reduce the number of errors by a lot.If your deep neural net is not overfitting you should be using a bigger one!”Hinton, Lecture 6a, CSC2535, 2013Slide100
Marginalizing Dropout
9/18/2015
N.
Srivastava et al. 2014100Marginalizing dropout in regressionA special form of ridge regressionMarginalizing dropout on Deep Networks?Slide101
Conclusion Remarks
“If
your deep neural net is significantly overfitting, dropout will usually reduce the number of errors by a lot
.If your deep neural net is not overfitting you should be using a bigger one!”9/18/2015G. Hinton, Lecture 6a, CSC 2535
101Slide102
9/18/2015
102Slide103
9/18/2015
103
x
f(x)
x
f(x)Slide104
Inference StepSlide105
Regularization
Many approaches have been propose to do regularization in deep neural network,
A
dding L1 or L2 penaltyEarly stopping of trainingBayesian methodDropoutSlide106
Why DropConnect works
Averaging models
Rademacher
complexitywhere: P = 0: model complexity is zeroP = 1: returns standard model complexityP = 0.5: all sub-models have equal complexity