Ryota Tomioka ryotomicrosoftcom MSR Summer School 2 July 2018 Azure iPython Notebook httpsnotebooksazurecomryotatlibrariesDLTutorial Agenda This lecture covers Introduction to machine learning ID: 931648
Download Presentation The PPT/PDF document "Introduction to Deep Learning: How to ma..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Introduction to Deep Learning: How to make your own deep learning framework
Ryota Tomioka (ryoto@microsoft.com)MSR Summer School2 July 2018
Azure
iPython
Notebook
https://notebooks.azure.com/ryotat/libraries/DLTutorial
Slide2Agenda
This lecture coversIntroduction to machine learning(keywords: model, training, inference, stochastic gradient descent, overfitting)
How to compute the gradient(keywords: backpropagation, multi-layer perceptrons, activation function)
Slide3What is Machine Learning (ML)?
The goal of ML is to learn from data ---->Technically, combines statistics
and computational tools (optimization)Example (supervised learning) tasks
Cat or Dog
Image recognition / classification
Model
Slide4What is Machine Learning (ML)?
The goal of ML is to learn from data ---->Technically, combines statistics
and computational tools (optimization)Example (supervised learning) tasks
“Hello”
Speech recognition
Model
Slide5What is Machine Learning (ML)?
The goal of ML is to learn from data ---->Technically, combines statistics
and computational tools (optimization)Example (supervised learning) tasks
“How are you?”
“
Wie
geht’s
dir
?”
Machine translation
Model
Slide6What is Machine Learning (ML)?
The goal of ML is to learn from data ---->Technically, combines statistics
and computational tools (optimization)Example (supervised learning) tasks
“How are you?”
“I am fine thank you”
Conversational agent / chatbot
Model
Slide7What is Machine Learning (ML)?
The goal of ML is to learn from data ---->Technically, combines statistics
and computational tools (optimization)Example (supervised learning) tasks
Model
“How are you?”
“I am fine thank you”
Cat or Dog
“
Wie
geht’s
dir
?”
Slide8What is Machine Learning (ML)?
The goal of ML is to learn from data ---->Technically, combines statistics
and computational tools (optimization)Example (supervised learning) tasks
Image recognitionSpeech recognitionMachine translationOther form of learningUnsupervised learningReinforcement learning
Cat or Dog
“Hello”
“Hello”
“Bonjour”
Slide9Training and inference
TrainingThe loss tells what the output of the model
should have beenTraining objective can be overly optimistic (overfitting)
Inference (validation)
We care about the performance in this setting
Training
data
Loss
Frozen
parameter
cat
0.99
(cat)
0.1
(dog)
Slide10Slide11Learning objective
Objective: minimize
for a randomly chosen
(
x,y
)
from some distribution
D
.
We don’t know the distribution
D
.We only have access to (training) samples from D.
input x
label
4
prediction
: parameters
model
Slide12Training objective
Training
data
Objective function:
Approximate the unknown distribution
D
with the training data average
( ,
4
)
Loss
( ,
5
)
Loss
+
Slide13Mapping from input to prediction
inputx
784 dim
scorez10 dim
probability
p
:
parameters
Softmax
10 dim
Slide14Cross-entropy loss
Interpretation 1: you pay penalty
.
Interpretation 2:
Kullback-Leibler
divergence
All the probability mass on the correct label (‘4’)
prediction
Slide15Landscape of training objective
Objective function:
Parameters
Initial parameters
final parameters
Slide16Landscape of training objective
Parameter space
Example space
Landscape of training objective
Parameter space
Example space
Gradient descent
Initialize
randomlyFor t in 0,…, Tmaxiter
Gradient of the objective
Learning rate (step size)
computation of
requires a full sweep over the training data
Per-iteration comp. cost = O(n)
Stochastic gradient descent (SGD)
Initialize
randomlyFor t in 0,…, T
maxiter
where index
i
is chosen randomly
Stochastic gradient
computation of
requires only one training example
Per-iteration comp. cost = O(1)
Minibatch stochastic gradient descent
Initialize
randomlyFor t in 0,…, T
maxiter
where minibatch
B
is chosen randomly
minibatch gradient
is average gradient over random subset of data of size
B
Per-iteration comp. cost = O(
B
)
Overfitting – what is signal vs noise?
Imagine:Powerful models are more likely to overfitWe need validation data: leave out some portion of the training data to validate the generalizability of the model
Training
data
cat
dog
Validation data
Slide22Typical learning curve
Number of training steps
Training loss
Validation loss
Slide23Techniques to reduce overfitting
Reduce the number of parametersParameter sharing (convnets, recurrent neural nets)Weight decay (aka L2 regularization)Penalizes the magnitude of the parameters
Early stopping
Indirectly controls the magnitude of the parameters
More recent techniques
Dropout, batch normalization
Summary so far
A machine learning problem can be specified by:Task: What’s the input? What’s the output?
Model: maps from the input to some numbers
Loss function: measures how the model is doingTraining: mini-batch SGD on the sum of empirical lossesValidation: Are we overfitting?
Slide25How do we compute the gradient?
Slide26Square
Exp
Add
x
Square
Slide27Mul
Exp
Add
x
Mul
Slide28How do we compute the gradient?
ManuallyTedious (model specific), error prone, and not easy to explore new modelsAlgorithmicallyBack-propagationAllow researchers to focus on model building rather than implementing each model correctly
Slide29Back propagation for the linear predictor
Identify how each variable influence the lossx
w
b
z
p
Loss
Back propagation for the linear predictor
Identify how each variable influence the lossx
w
b
z
p
Loss
Back propagation
Don’t repeat shared compute. Propagate the gradients backward
x
w
b
z
p
Loss
Back propagation
Don’t repeat shared compute. Propagate the gradients backward
x
w
b
z
p
Loss
Going deeper
Slide34Going deeper
input
x
784 dim
score
z
1024 dim
probability
p
:
parameters
Softmax
10 dim
ReLU
hidden
h
pre-
hidden
h
in
10 dim
1024 dim
input
x
784 dim
score
z
1024 dim
probability
p
:
parameters
Softmax
10 dim
ReLU
hidden
h
pre-
hidden
h
in
10 dim
1024 dim
Activation functions
Rectified Linear Unit (ReLU)
Hyperbolic tangent (tanh)
Sigmoid
XOR problem
Slide37Rectified linear unit (
ReLU
)
Slide38Rectified linear unit (
ReLU
)
Slide39Rectified linear unit (
ReLU
)
Slide40Rectified linear unit (
ReLU
)
Slide41Rectified linear unit (
ReLU
)
Slide42Comparison to tanh
Gradient is almost zero in these areas
Gradient can flow backwards
“
ReLU
is non-saturating”
Slide43Chain rule
x
W
0b
0
h
in
p
Loss
W
1
b
1
z
h
Identify how each variable influence the loss
Slide44Back propagation
x
W
0b
0
h
in
p
Loss
W
1
b
1
z
h
Softmax
ReLU
Linear
Linear
Don’t repeat shared compute - Propagate the gradients backward
Slide45More complex example
x
W
0b
0
h
in
p
Loss
W
1
b
1
z
0
h
Softmax
ReLU
Linear
Linear
z
Demo
Slide47Deep learning frameworks
Collection of implementations of popular layers (or modules), e.g., ReLU
, Softmax, Convolution, RNNsProvides an easy front-end to the layers/modulesHandles different array libraries / hardware backends (CPUs, GPUs, …)
If there were an exchange format…
Slide48Slide49Books
Convex Optimization
Stephen Boyd and
Lieven VandenbergheCambridge University Press
Information Theory, Inference and Learning Algorithms
David J. C. MacKay
2003
Neural Networks for Pattern Recognition
Christopher M. Bishop
Slide50Conclusion
Training a network consists ofForward propagation
: computing the lossBackward propagation: computing the gradient
Parameter update: move in the direction of the computed stochastic gradientFairly standard set of building blocks are used to build complex modelsLinear, ReLU, Softmax
, Tanh, …
Advanced topics
How to prevent overfitting
How to scale neural network training to multiple machines / devices
Slide51Advanced topics
Slide52Minibatch size and convergence speed
1 pass over the dataset (1 epoch)
Full-batch gradient descent
Large-minibatch gradient descent
Small-minibatch gradient descent
#samples
Error
#samples
Error
#samples
Error
Parameter update
Slide53Effect of minibatch size B
Learning rate
scaled as
Dropout –randomly drops activations during training
Instance 1
Slide55Dropout –randomly drops activations during training
Instance 2
Slide56Dropout –randomly drops activations during training
Instance 3
Slide57Dropout
Idea: randomly drop activations during trainingBenefit: reduces overfitting and improves generalizationCan be implemented as a layer
Slide58Batch normalization
Inputs scaled to [0,1]
784 dim
Weights and biases
drawn from
N(0, 1)
R.V. with scale
at most
1024 dim
Slide59Batch normalization
784 dim
Weights and biases
drawn from
N(0, 1)
1024 dim
Inputs scaled to [0,1]
R.V. with scale
at most
R.V. with scale
at most
Batch normalization [Ioffe &
Szegedy, 2015]Idea: normalize the activation of each unit to have zero mean and unit standard deviation using a mini-batch estimate of mean and variance. Benefit: more stable and faster training. Often generalizes betterCan be implemented as a layer
Slide61More optimization algorithms
Momentum SGD: improves SGD by incorporating “momentum”
Slide62Adaptive optimization algorithms
Adam [Kingma & Ba 2015]: uses first and second order statistics of the gradients so that gradients are normalizedBenefit: prevents the vanishing/exploding gradient problem
Slide63Learning rate decay
Reduce the learning rate or step-size parameter () once in a while
Typical setting: multiply 0.98 to
every epoch.
Slide64Summary
Training and inferenceTraining objective and optimizationNeural networks and backpropagationImportance of software tools – turns research into Lego block engineeringVarious tricks to speed-up training and reduce overfitting
Slide65Gradient explosion/diminishing problem
Linear +
ReLU
Linear +
ReLU
Linear +
ReLU
Gradient explosion/diminishing problem
Linear
Linear
Linear
Gradient is magnified or diminished by factor W at every layer.
If we have many layers, they can explode or diminish to zero.
Slide67What is a model?
A model is a function specified by a set of parameters
Example: linear predictor
0.99
parameters
What is a model?
A model is a
function specified by a set of
parameters Example: linear predictor
parameters
= sum
+
b
w
1
w
2
w
3
w
4
w
5
x
1
x
2
x
3
x
4
x
5
Slide69Training and inference
TrainingThe loss tells what the output of the model
should have beenTraining objective can be overly optimistic (overfitting)
Inference (validation)
We care about the performance in this setting
Training
data
Loss
Frozen
parameter
cat
0.99
(cat)
0.1
(dog)
Slide70Slide71Loss functions for binary classification
Miss classification loss
(-accuracy)
Squared lossCross entropy loss
p=prediction
,
y=ground truth (0 or 1)