/
Tips for Training  Deep Network Tips for Training  Deep Network

Tips for Training Deep Network - PowerPoint Presentation

wellific
wellific . @wellific
Follow
343 views
Uploaded On 2020-08-28

Tips for Training Deep Network - PPT Presentation

Output Training Strategy Batch Normalization Activation Function SELU Network Structure Highway Network Batch Normalization Feature Scaling ID: 809613

network batch layer normalization batch network normalization layer grid lstm selu sigmoid training highway output scaling feature relu input

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Tips for Training Deep Network" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Tips for Training Deep Network

Slide2

Output

Training Strategy: Batch Normalization

Activation Function: SELU

Network Structure: Highway Network

Slide3

Batch Normalization

Slide4

Feature Scaling

……

……

……

……

……

……

……

 

 

 

 

 

mean:

 

standard deviation:

 

 

The means of all dimensions are 0, and the variances are all 1

For each dimension i:

 

 

 

 

In general, gradient descent converges much faster with feature scaling than without it.

Slide5

How about Hidden Layer?

 

 

Layer 1

 

 

……

Feature Scaling

Feature Scaling ?

Feature Scaling ?

Smaller learning rate can be helpful, but the training would be slower.

Difficulty: their statistics change during the training …

Batch normalization

Internal Covariate Shift

Slide6

 

 

 

Batch

 

 

 

 

 

 

 

 

 

 

 

 

Sigmoid

……

……

……

 

 

 

 

 

 

 

=

Sigmoid

Sigmoid

Batch

Slide7

Batch normalization

 

 

 

 

 

 

 

 

 

 

 

 

 

and

depends on

 

Note: Batch normalization cannot be applied on small batch.

Slide8

Batch normalization

 

 

 

 

 

 

 

 

 

 

 

 

and

depends on

 

 

 

 

Sigmoid

Sigmoid

Sigmoid

 

 

 

How to do

backpropogation

?

Slide9

Batch normalization

 

 

 

 

 

 

 

 

 

 

 

 

and

depends on

 

 

 

 

 

 

 

 

 

 

Slide10

Batch normalization

At testing stage:

 

 

 

 

 

 

 

,

are from

batch

 

,

are network parameters

 

We do not have

batch

at testing stage.

Ideal solution:

Computing

and

using the whole training dataset.

 

Practical solution:

Computing the moving average of

and

of the batches during training.

 

Acc

Updates

 

 

 

Slide11

Batch normalization - Benefit

BN reduces training times, and make very deep net trainable.

Because of less Covariate Shift, we can use larger learning rates.

Less exploding/vanishing gradients

Especially effective for sigmoid, tanh, etc.

Learning is less affected by initialization.

BN reduces the demand for regularization.

 

 

 

 

 

 

 

 

 

 

 

 

 

Slide12

Slide13

To learn more ……

Batch Renormalization

Layer Normalization

Instance Normalization

Weight Normalization

Spectrum Normalization

Slide14

Activation Function: SELU

Slide15

ReLURectified Linear Unit (ReLU)

Reason:

1. Fast to compute

2. Biological reason

3. Infinite sigmoid with different biases

4. Vanishing gradient problem

 

 

 

 

 

Slide16

ReLU - variant

 

 

 

 

 

 

 

 

 

 

α

also learned by gradient descent

Slide17

ReLU - variant

 

 

 

 

Exponential Linear Unit (ELU)

 

 

 

 

Scaled ELU (SELU)

https://github.com/bioinf-jku/SNNs

 

 

 

 

Slide18

SELU

Positive and negative values

The whole

ReLU

family has this property except the original

ReLU

.

Saturation region

ELU also has this property

Slope larger than 1

 

 

 

 

Only SELU also has this property

Slide19

SELU

The inputs are

i.i.d

random variables with mean

and variance

.

 

 

 

 

 

 

=0

=1

=0

Do not have to be Gaussian

=0

Slide20

SELU

The inputs are

i.i.d

random variables with mean

and variance

.

 

 

 

 

=0

=1

 

 

 

 

 

 

 

 

 

 

=1

=1

 

target

Assume Gaussian

 

Slide21

Demo

Slide22

Source of joke:

https://zhuanlan.zhihu.com/p/27336839

93

頁的證明

SELU is actually more general.

Slide23

最新激活神經元

:SELF-NORMALIZATION NEURAL NETWORK (SELU)

MNIST

CIFAR-10

Slide24

Demo

Slide25

Highway Network & Grid LSTM

Slide26

x

f

1

a

1

f

2

a

2

f

3

a

3

f

4

y

 

x

1

h

0

f

h

1

x

2

f

x

3

h

2

f

x

4

h

3

f

y

4

 

t is layer

t is time step

Applying gated structure in feedforward network

Feedforward

v.s

. Recurrent

1. Feedforward network does not have input at each step

2. Feedforward network has different parameters for each layer

Slide27

GRU

Highway Network

 

h

t-1

r

z

y

t

x

t

h

t-1

h'

 

x

t

 

 

1-

 

h

t

reset

update

No input

x

t

at each step

a

t-1

is the output of the (t-1)-

th

layer

a

t

is the output of the t-

th

layer

No output

y

t

at each step

No reset gate

a

t-1

a

t

a

t-1

Slide28

Highway NetworkResidual Network

Highway Network

Deep Residual Learning for Image Recognition

http://arxiv.org/abs/1512.03385

Training Very Deep Networks

https://arxiv.org/pdf/1507.06228v2.pdf

+

copy

copy

Gate controller

 

 

 

 

 

 

 

 

 

 

 

Slide29

Input layer

output layer

Input layer

output layer

Input layer

output layer

Highway Network automatically

determines the layers needed!

Slide30

Highway Network

Slide31

Grid LSTM

LSTM

y

x

c

h

t

h

c

Grid

LSTM

c

h

h

c

Memory for both

time

and

depth

b

a

b’

a’

time

depth

Slide32

GridLSTM

h

t-1

c

t-1

a

l

b

l

h

t

c

t

a

l-1

b

l-1

Grid

LSTM

a

l

b

l

h

t+1

c

t+1

a

l-1

b

l-1

Grid

LSTM’

h

t-1

c

t-1

a

l+1

b

l+1

h

t

c

t

Grid

LSTM’

a

l+1

b

l+1

h

t+1

c

t+1

Slide33

Grid LSTM

Grid

LSTM

c

h

h

c

b

a

b’

a’

h

'

z

z

i

z

f

z

o

 

h

c

 

 

 

tanh

c'

a

b

a'

b'

Slide34

e’

f’

3D Grid LSTM

h

c

h’

c’

b

a

e

f

b’

a’

Slide35

3D Grid LSTM

Images are composed of pixels

3 x 3 images