/
Introduction to Deep Learning: How to make your own deep learning framework Introduction to Deep Learning: How to make your own deep learning framework

Introduction to Deep Learning: How to make your own deep learning framework - PowerPoint Presentation

StarsAndStripes
StarsAndStripes . @StarsAndStripes
Follow
345 views
Uploaded On 2022-08-01

Introduction to Deep Learning: How to make your own deep learning framework - PPT Presentation

Ryota Tomioka ryotomicrosoftcom MSR Summer School 2 July 2018 Azure iPython Notebook httpsnotebooksazurecomryotatlibrariesDLTutorial Agenda This lecture covers Introduction to machine learning ID: 931648

learning training gradient loss training learning loss gradient linear model dim data relu parameters objective machine optimization overfitting unit

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Introduction to Deep Learning: How to ma..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Introduction to Deep Learning: How to make your own deep learning framework

Ryota Tomioka (ryoto@microsoft.com)MSR Summer School2 July 2018

Azure

iPython

Notebook

https://notebooks.azure.com/ryotat/libraries/DLTutorial

Slide2

Agenda

This lecture coversIntroduction to machine learning(keywords: model, training, inference, stochastic gradient descent, overfitting)

How to compute the gradient(keywords: backpropagation, multi-layer perceptrons, activation function)

Slide3

What is Machine Learning (ML)?

The goal of ML is to learn from data ---->Technically, combines statistics

and computational tools (optimization)Example (supervised learning) tasks

Cat or Dog

Image recognition / classification

Model

Slide4

What is Machine Learning (ML)?

The goal of ML is to learn from data ---->Technically, combines statistics

and computational tools (optimization)Example (supervised learning) tasks

“Hello”

Speech recognition

Model

Slide5

What is Machine Learning (ML)?

The goal of ML is to learn from data ---->Technically, combines statistics

and computational tools (optimization)Example (supervised learning) tasks

“How are you?”

Wie

geht’s

dir

?”

Machine translation

Model

Slide6

What is Machine Learning (ML)?

The goal of ML is to learn from data ---->Technically, combines statistics

and computational tools (optimization)Example (supervised learning) tasks

“How are you?”

“I am fine thank you”

Conversational agent / chatbot

Model

Slide7

What is Machine Learning (ML)?

The goal of ML is to learn from data ---->Technically, combines statistics

and computational tools (optimization)Example (supervised learning) tasks

Model

“How are you?”

“I am fine thank you”

Cat or Dog

Wie

geht’s

dir

?”

Slide8

What is Machine Learning (ML)?

The goal of ML is to learn from data ---->Technically, combines statistics

and computational tools (optimization)Example (supervised learning) tasks

Image recognitionSpeech recognitionMachine translationOther form of learningUnsupervised learningReinforcement learning

Cat or Dog

“Hello”

“Hello”

“Bonjour”

Slide9

Training and inference

TrainingThe loss tells what the output of the model

should have beenTraining objective can be overly optimistic (overfitting)

Inference (validation)

We care about the performance in this setting

Training

data

 

Loss

 

Frozen

parameter

cat

0.99

(cat)

0.1

(dog)

Slide10

Slide11

Learning objective

Objective: minimize

for a randomly chosen

(

x,y

)

from some distribution

D

.

We don’t know the distribution

D

.We only have access to (training) samples from D.

 

input x

label

4

 

prediction

: parameters

 

model

Slide12

Training objective

Training

data

Objective function:

 

Approximate the unknown distribution

D

with the training data average

( ,

4

)

 

Loss

( ,

5

)

 

Loss

+

Slide13

Mapping from input to prediction

inputx

784 dim

scorez10 dim

 

probability

p

:

parameters

 

Softmax

 

10 dim

Slide14

Cross-entropy loss

Interpretation 1: you pay penalty

.

 

Interpretation 2:

Kullback-Leibler

divergence

 

 

All the probability mass on the correct label (‘4’)

prediction

Slide15

Landscape of training objective

Objective function:

 

Parameters

 

Initial parameters

final parameters

Slide16

Landscape of training objective

 

 

Parameter space

Example space

 

 

 

 

Slide17

Landscape of training objective

 

 

Parameter space

Example space

 

 

 

 

 

Slide18

Gradient descent

Initialize

randomlyFor t in 0,…, Tmaxiter

 

 

 

 

 

Gradient of the objective

Learning rate (step size)

computation of

requires a full sweep over the training data

Per-iteration comp. cost = O(n)

 

Slide19

Stochastic gradient descent (SGD)

Initialize

randomlyFor t in 0,…, T

maxiter

where index

i

is chosen randomly

 

 

 

 

 

Stochastic gradient

computation of

requires only one training example

Per-iteration comp. cost = O(1)

 

Slide20

Minibatch stochastic gradient descent

Initialize

randomlyFor t in 0,…, T

maxiter

where minibatch

B

is chosen randomly

 

 

 

 

 

minibatch gradient

is average gradient over random subset of data of size

B

Per-iteration comp. cost = O(

B

)

 

Slide21

Overfitting – what is signal vs noise?

Imagine:Powerful models are more likely to overfitWe need validation data: leave out some portion of the training data to validate the generalizability of the model

Training

data

cat

dog

Validation data

Slide22

Typical learning curve

Number of training steps

Training loss

Validation loss

Slide23

Techniques to reduce overfitting

Reduce the number of parametersParameter sharing (convnets, recurrent neural nets)Weight decay (aka L2 regularization)Penalizes the magnitude of the parameters

Early stopping

Indirectly controls the magnitude of the parameters

More recent techniques

Dropout, batch normalization

 

Slide24

Summary so far

A machine learning problem can be specified by:Task: What’s the input? What’s the output?

Model: maps from the input to some numbers

Loss function: measures how the model is doingTraining: mini-batch SGD on the sum of empirical lossesValidation: Are we overfitting?

Slide25

How do we compute the gradient?

Slide26

 

Square

Exp

Add

x

Square

Slide27

 

Mul

Exp

Add

x

Mul

Slide28

How do we compute the gradient?

ManuallyTedious (model specific), error prone, and not easy to explore new modelsAlgorithmicallyBack-propagationAllow researchers to focus on model building rather than implementing each model correctly

Slide29

Back propagation for the linear predictor

Identify how each variable influence the lossx

w

b

z

p

Loss

 

Slide30

Back propagation for the linear predictor

Identify how each variable influence the lossx

w

b

z

p

Loss

 

 

Slide31

Back propagation

Don’t repeat shared compute. Propagate the gradients backward

x

w

b

z

p

Loss

 

 

Slide32

Back propagation

Don’t repeat shared compute. Propagate the gradients backward

x

w

b

z

p

Loss

 

 

 

 

 

 

Slide33

Going deeper

Slide34

Going deeper

input

x

784 dim

score

z

1024 dim

 

probability

p

:

parameters

 

Softmax

10 dim

ReLU

hidden

h

pre-

hidden

h

in

10 dim

1024 dim

 

input

x

784 dim

score

z

1024 dim

 

probability

p

:

parameters

 

Softmax

10 dim

ReLU

hidden

h

pre-

hidden

h

in

10 dim

1024 dim

 

Slide35

Activation functions

Rectified Linear Unit (ReLU)

Hyperbolic tangent (tanh)

Sigmoid

 

Slide36

XOR problem

Slide37

Rectified linear unit (

ReLU

)

Slide38

Rectified linear unit (

ReLU

)

Slide39

Rectified linear unit (

ReLU

)

Slide40

Rectified linear unit (

ReLU

)

Slide41

Rectified linear unit (

ReLU

)

Slide42

Comparison to tanh

Gradient is almost zero in these areas

Gradient can flow backwards

ReLU

is non-saturating”

Slide43

Chain rule

x

W

0b

0

h

in

p

Loss

W

1

b

1

z

h

 

 

 

 

Identify how each variable influence the loss

Slide44

Back propagation

x

W

0b

0

h

in

p

Loss

W

1

b

1

z

h

 

 

 

 

 

 

 

 

Softmax

ReLU

Linear

Linear

Don’t repeat shared compute - Propagate the gradients backward

Slide45

More complex example

x

W

0b

0

h

in

p

Loss

W

1

b

1

z

0

h

Softmax

ReLU

Linear

Linear

z

 

 

 

 

 

 

 

 

 

 

Slide46

Demo

Slide47

Deep learning frameworks

Collection of implementations of popular layers (or modules), e.g., ReLU

, Softmax, Convolution, RNNsProvides an easy front-end to the layers/modulesHandles different array libraries / hardware backends (CPUs, GPUs, …)

If there were an exchange format…

Slide48

Slide49

Books

Convex Optimization

Stephen Boyd and

Lieven VandenbergheCambridge University Press

Information Theory, Inference and Learning Algorithms

David J. C. MacKay

2003

Neural Networks for Pattern Recognition

Christopher M. Bishop

Slide50

Conclusion

Training a network consists ofForward propagation

: computing the lossBackward propagation: computing the gradient

Parameter update: move in the direction of the computed stochastic gradientFairly standard set of building blocks are used to build complex modelsLinear, ReLU, Softmax

, Tanh, …

Advanced topics

How to prevent overfitting

How to scale neural network training to multiple machines / devices

Slide51

Advanced topics

Slide52

Minibatch size and convergence speed

1 pass over the dataset (1 epoch)

Full-batch gradient descent

Large-minibatch gradient descent

Small-minibatch gradient descent

#samples

Error

#samples

Error

#samples

Error

Parameter update

Slide53

Effect of minibatch size B

Learning rate

scaled as

 

Slide54

Dropout –randomly drops activations during training

Instance 1

Slide55

Dropout –randomly drops activations during training

Instance 2

Slide56

Dropout –randomly drops activations during training

Instance 3

Slide57

Dropout

Idea: randomly drop activations during trainingBenefit: reduces overfitting and improves generalizationCan be implemented as a layer

Slide58

Batch normalization

Inputs scaled to [0,1]

784 dim

Weights and biases

drawn from

N(0, 1)

R.V. with scale

at most

 

1024 dim

Slide59

Batch normalization

784 dim

Weights and biases

drawn from

N(0, 1)

1024 dim

 

 

 

 

 

 

 

 

Inputs scaled to [0,1]

R.V. with scale

at most

 

R.V. with scale

at most

 

Slide60

Batch normalization [Ioffe &

Szegedy, 2015]Idea: normalize the activation of each unit to have zero mean and unit standard deviation using a mini-batch estimate of mean and variance. Benefit: more stable and faster training. Often generalizes betterCan be implemented as a layer

Slide61

More optimization algorithms

Momentum SGD: improves SGD by incorporating “momentum”

Slide62

Adaptive optimization algorithms

Adam [Kingma & Ba 2015]: uses first and second order statistics of the gradients so that gradients are normalizedBenefit: prevents the vanishing/exploding gradient problem

Slide63

Learning rate decay

Reduce the learning rate or step-size parameter () once in a while

Typical setting: multiply 0.98 to

every epoch. 

Slide64

Summary

Training and inferenceTraining objective and optimizationNeural networks and backpropagationImportance of software tools – turns research into Lego block engineeringVarious tricks to speed-up training and reduce overfitting

Slide65

Gradient explosion/diminishing problem

Linear +

ReLU

Linear +

ReLU

 

 

 

 

Linear +

ReLU

 

 

 

 

 

 

Slide66

Gradient explosion/diminishing problem

Linear

Linear

 

 

 

 

Linear

 

 

 

 

 

 

 

 

 

 

Gradient is magnified or diminished by factor W at every layer.

If we have many layers, they can explode or diminish to zero.

Slide67

What is a model?

A model is a function specified by a set of parameters

Example: linear predictor

 

0.99

parameters

 

Slide68

What is a model?

A model is a

function specified by a set of

parameters Example: linear predictor

 

parameters

 

 

 

 

 

= sum

+

b

w

1

w

2

w

3

w

4

w

5

x

1

x

2

x

3

x

4

x

5

Slide69

Training and inference

TrainingThe loss tells what the output of the model

should have beenTraining objective can be overly optimistic (overfitting)

Inference (validation)

We care about the performance in this setting

Training

data

 

Loss

 

Frozen

parameter

cat

0.99

(cat)

0.1

(dog)

Slide70

Slide71

Loss functions for binary classification

Miss classification loss

(-accuracy)

Squared lossCross entropy loss

 

 

 

p=prediction

,

y=ground truth (0 or 1)