Deep Learning Instructor Jared Saia University of New Mexico These slides created by Dan Klein Pieter Abbeel Anca Dragan Josh Hug for CS188 Intro to AI at UC Berkeley All CS188 materials available at http ID: 782360
Download The PPT/PDF document "CS 4/527: Artificial Intelligence" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CS 4/527: Artificial Intelligence
Deep Learning
Instructor:
Jared Saia--- University of New Mexico[These slides created by Dan Klein, Pieter Abbeel, Anca Dragan, Josh Hug for CS188 Intro to AI at UC Berkeley. All CS188 materials available at http://ai.berkeley.edu.]
Slide2Last Time: Linear Classifiers
Inputs are
feature values
Each feature has a weightSum is the activationIf the activation is:
Positive, output +1
Negative, output -1
f
1
f
2
f
3
w
1
w2
w3
>0?
Slide3Non-Linearity
Slide4Non-Linear Separators
Data that is linearly separable works out great for linear decision rules:
But what are we going to do if the dataset is just too hard?
How about… mapping data to a higher-dimensional space:
0
0
0
x
2
x
x
x
This and next slide adapted from Ray Mooney, UT
Slide5Computer Vision
Slide6Object Detection
Slide7Manual Feature Design
Slide8Features and Generalization
[Dalal and Triggs, 2005]
Slide9Features and Generalization
Image
HoG
Slide10Manual Feature Design
Deep Learning
Manual feature design requires:
Domain-specific expertiseDomain-specific effortWhat if we could learn the features, too?-> Deep Learning
Slide11Perceptron
f
1
f
2
f
3
w
1
w
2
w3
>0?
Slide12Two-Layer Perceptron Network
f
1
f
2
f
3
w
13
w
23
w
33
>0?
w
12
w
22
w
32
>0?
w
11
w
21
w
31
>0?
w
1
w
2
w
3
>0?
Slide13N-Layer Perceptron Network
f
1
f
2
f
3
>0?
>0?
>0?
>0?
>0?
>0?
>0?
>0?
>0?
…
…
…
>0?
Slide14Local Search
Simple, general idea:
Start wherever
Repeat: move to the best neighboring stateIf no neighbors better than current, quitNeighbors = small perturbations of wPropertiesPlateaus and local optima
How to escape plateaus and find a good local optimum?
How to deal with very large parameter vectors? E.g.,
Slide15Perceptron: Accuracy via Zero-One Loss
Objective: Classification Accuracy
f1
f
2
f
3
w
1
w
2
w
3
>0?
Slide16Perceptron: Accuracy via Zero-One Loss
Objective: Classification Accuracy
Issue: many plateaus
how to measure incremental progress?
f
1
f
2
f
3
w
1
w
2
w3
>0?
Slide17Soft-Max
Score for y=1: Score for y=-1:
Probability of label:
Slide18Soft-Max
Score for y=1: Score for y=-1:
Probability of label:
Objective: Log:
Slide19Two-Layer Neural Network
f
1
f
2
f
3
w
13
w
23
w
33
>0?
w
12
w
22
w
32
>0?
w
11
w
21
w
31
>0?
w
1
w
2
w
3
Slide20N-Layer Neural Network
f
1
f
2
f
3
>0?
>0?
>0?
>0?
>0?
>0?
>0?
>0?
>0?
…
…
…
Slide21Our Status
Our objective
Changes smoothly with changes in
wDoesn’t suffer from the same plateaus as the perceptron networkChallenge: how to find a good w ?Equivalently:
Slide221-d optimization
Could evaluate and
Then step in best direction
Or, evaluate derivative:Which tells which direction to step into
Slide232-D Optimization
Source: Thomas Jungblut’s Blog
Slide24Steepest Descent
Idea:
Start somewhere
Repeat: Take a step in the steepest descent directionFigure source: Mathworks
Slide25Steepest Direction
Steepest Direction = direction of the gradient
Slide26Optimization Procedure: Gradient Descent
Init:
For i = 1, 2, …
Computing Gradients
Gradient = sum of gradients for each term
How do we compute the gradient for one term about the current weights?
Slide28N-Layer Neural Network
f
1
f
2
f
3
>0?
>0?
>0?
>0?
>0?
>0?
>0?
>0?
>0?
…
…
…
Slide2929
Represent as Computational Graph
x
W
*
hinge loss
R
+
L
s
(scores)
Slide3030
e.g. x = -2, y = 5, z = -4
Slide3131
e.g. x = -2, y = 5, z = -4
Want:
Slide3232
e.g. x = -2, y = 5, z = -4
Want:
Slide3333
e.g. x = -2, y = 5, z = -4
Want:
Slide3434
e.g. x = -2, y = 5, z = -4
Want:
Slide3535
e.g. x = -2, y = 5, z = -4
Want:
Slide3636
e.g. x = -2, y = 5, z = -4
Want:
Slide3737
e.g. x = -2, y = 5, z = -4
Want:
Slide3838
e.g. x = -2, y = 5, z = -4
Want:
Slide3939
e.g. x = -2, y = 5, z = -4
Want:
Chain rule:
Slide4040
e.g. x = -2, y = 5, z = -4
Want:
Slide4141
e.g. x = -2, y = 5, z = -4
Want:
Chain rule:
Slide4242
f
activations
Slide4343
f
activations
“local gradient”
Slide4444
f
activations
“local gradient”
gradients
Slide4545
f
activations
gradients
“local gradient”
Slide4646
f
activations
gradients
“local gradient”
Slide4747
f
activations
gradients
“local gradient”
Slide4848
Another example:
Slide4949
Another example:
Slide5050
Another example:
Slide5151
Another example:
Slide5252
Another example:
Slide5353
Another example:
Slide5454
Another example:
Slide5555
Another example:
Slide5656
Another example:
Slide5757
Another example:
(-1) * (-0.20) = 0.20
Slide5858
Another example:
Slide5959
Another example:
[local gradient] x [its gradient]
[1] x [0.2] = 0.2
[1] x [0.2] = 0.2 (both inputs!)
Slide6060
Another example:
Slide6161
Another example:
[local gradient] x [its gradient]
x0: [2] x [0.2] = 0.4
w0: [-1] x [0.2] = -0.2
Slide6262
sigmoid function
sigmoid gate
Slide6363
sigmoid function
sigmoid gate
(0.73) * (1 - 0.73) = 0.2
Slide64Mini-batches and Stochastic Gradient Descent
Typical objective:
= average log-likelihood of label given input
= estimate based on mini-batch 1…kMini-batch gradient descent: compute gradient on mini-batch (+ cycle over mini-batches: 1..k, k+1…2k, ... ; make sure to randomize permutation of data!)Stochastic gradient descent: k = 1