Output Training Strategy Batch Normalization Activation Function SELU Network Structure Highway Network Batch Normalization Feature Scaling ID: 809613
Download The PPT/PDF document "Tips for Training Deep Network" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Tips for Training Deep Network
Slide2Output
Training Strategy: Batch Normalization
Activation Function: SELU
Network Structure: Highway Network
Slide3Batch Normalization
Slide4Feature Scaling
……
……
……
……
……
……
……
mean:
standard deviation:
The means of all dimensions are 0, and the variances are all 1
For each dimension i:
In general, gradient descent converges much faster with feature scaling than without it.
Slide5How about Hidden Layer?
Layer 1
……
Feature Scaling
Feature Scaling ?
Feature Scaling ?
Smaller learning rate can be helpful, but the training would be slower.
Difficulty: their statistics change during the training …
Batch normalization
Internal Covariate Shift
Slide6Batch
Sigmoid
……
……
……
=
Sigmoid
Sigmoid
Batch
Slide7Batch normalization
and
depends on
Note: Batch normalization cannot be applied on small batch.
Slide8Batch normalization
and
depends on
Sigmoid
Sigmoid
Sigmoid
How to do
backpropogation
?
Slide9Batch normalization
and
depends on
Batch normalization
At testing stage:
,
are from
batch
,
are network parameters
We do not have
batch
at testing stage.
Ideal solution:
Computing
and
using the whole training dataset.
Practical solution:
Computing the moving average of
and
of the batches during training.
Acc
Updates
Batch normalization - Benefit
BN reduces training times, and make very deep net trainable.
Because of less Covariate Shift, we can use larger learning rates.
Less exploding/vanishing gradients
Especially effective for sigmoid, tanh, etc.
Learning is less affected by initialization.
BN reduces the demand for regularization.
To learn more ……
Batch Renormalization
Layer Normalization
Instance Normalization
Weight Normalization
Spectrum Normalization
Slide14Activation Function: SELU
Slide15ReLURectified Linear Unit (ReLU)
Reason:
1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
4. Vanishing gradient problem
ReLU - variant
α
also learned by gradient descent
Slide17ReLU - variant
Exponential Linear Unit (ELU)
Scaled ELU (SELU)
https://github.com/bioinf-jku/SNNs
SELU
Positive and negative values
The whole
ReLU
family has this property except the original
ReLU
.
Saturation region
ELU also has this property
Slope larger than 1
Only SELU also has this property
Slide19SELU
…
…
…
…
The inputs are
i.i.d
random variables with mean
and variance
.
=0
=1
=0
Do not have to be Gaussian
=0
Slide20SELU
…
…
…
…
The inputs are
i.i.d
random variables with mean
and variance
.
=0
=1
=1
=1
target
Assume Gaussian
Demo
Slide22Source of joke:
https://zhuanlan.zhihu.com/p/27336839
93
頁的證明
SELU is actually more general.
Slide23最新激活神經元
:SELF-NORMALIZATION NEURAL NETWORK (SELU)
MNIST
CIFAR-10
Slide24Demo
Slide25Highway Network & Grid LSTM
Slide26x
f
1
a
1
f
2
a
2
f
3
a
3
f
4
y
x
1
h
0
f
h
1
x
2
f
x
3
h
2
f
x
4
h
3
f
y
4
t is layer
t is time step
Applying gated structure in feedforward network
Feedforward
v.s
. Recurrent
1. Feedforward network does not have input at each step
2. Feedforward network has different parameters for each layer
Slide27GRU
Highway Network
h
t-1
r
z
y
t
x
t
h
t-1
h'
x
t
1-
h
t
reset
update
No input
x
t
at each step
a
t-1
is the output of the (t-1)-
th
layer
a
t
is the output of the t-
th
layer
No output
y
t
at each step
No reset gate
a
t-1
a
t
a
t-1
Slide28Highway NetworkResidual Network
Highway Network
Deep Residual Learning for Image Recognition
http://arxiv.org/abs/1512.03385
Training Very Deep Networks
https://arxiv.org/pdf/1507.06228v2.pdf
+
copy
copy
Gate controller
Input layer
output layer
Input layer
output layer
Input layer
output layer
Highway Network automatically
determines the layers needed!
Slide30Highway Network
Slide31Grid LSTM
LSTM
y
x
c
’
h
t
h
c
Grid
LSTM
c
’
h
’
h
c
Memory for both
time
and
depth
b
a
b’
a’
time
depth
Slide32GridLSTM
h
t-1
c
t-1
a
l
b
l
h
t
c
t
a
l-1
b
l-1
Grid
LSTM
a
l
b
l
h
t+1
c
t+1
a
l-1
b
l-1
Grid
LSTM’
h
t-1
c
t-1
a
l+1
b
l+1
h
t
c
t
Grid
LSTM’
a
l+1
b
l+1
h
t+1
c
t+1
Slide33Grid LSTM
Grid
LSTM
c
’
h
’
h
c
b
a
b’
a’
h
'
z
z
i
z
f
z
o
h
c
tanh
c'
a
b
a'
b'
Slide34e’
f’
3D Grid LSTM
h
c
h’
c’
b
a
e
f
b’
a’
Slide353D Grid LSTM
Images are composed of pixels
3 x 3 images