/
CS 4/527: Artificial Intelligence CS 4/527: Artificial Intelligence

CS 4/527: Artificial Intelligence - PowerPoint Presentation

unisoftsm
unisoftsm . @unisoftsm
Follow
342 views
Uploaded On 2020-06-20

CS 4/527: Artificial Intelligence - PPT Presentation

Deep Learning Instructor Jared Saia University of New Mexico These slides created by Dan Klein Pieter Abbeel Anca Dragan Josh Hug for CS188 Intro to AI at UC Berkeley All CS188 materials available at http ID: 782360

local gradient activations descent gradient local descent activations gradients perceptron gradient

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "CS 4/527: Artificial Intelligence" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CS 4/527: Artificial Intelligence

Deep Learning

Instructor:

Jared Saia--- University of New Mexico[These slides created by Dan Klein, Pieter Abbeel, Anca Dragan, Josh Hug for CS188 Intro to AI at UC Berkeley. All CS188 materials available at http://ai.berkeley.edu.]

Slide2

Last Time: Linear Classifiers

Inputs are

feature values

Each feature has a weightSum is the activationIf the activation is:

Positive, output +1

Negative, output -1

f

1

f

2

f

3

w

1

w2

w3

>0?

Slide3

Non-Linearity

Slide4

Non-Linear Separators

Data that is linearly separable works out great for linear decision rules:

But what are we going to do if the dataset is just too hard?

How about… mapping data to a higher-dimensional space:

0

0

0

x

2

x

x

x

This and next slide adapted from Ray Mooney, UT

Slide5

Computer Vision

Slide6

Object Detection

Slide7

Manual Feature Design

Slide8

Features and Generalization

[Dalal and Triggs, 2005]

Slide9

Features and Generalization

Image

HoG

Slide10

Manual Feature Design 

Deep Learning

Manual feature design requires:

Domain-specific expertiseDomain-specific effortWhat if we could learn the features, too?-> Deep Learning

Slide11

Perceptron

f

1

f

2

f

3

w

1

w

2

w3

>0?

Slide12

Two-Layer Perceptron Network

f

1

f

2

f

3

w

13

w

23

w

33

>0?

w

12

w

22

w

32

>0?

w

11

w

21

w

31

>0?

w

1

w

2

w

3

>0?

Slide13

N-Layer Perceptron Network

f

1

f

2

f

3

>0?

>0?

>0?

>0?

>0?

>0?

>0?

>0?

>0?

>0?

Slide14

Local Search

Simple, general idea:

Start wherever

Repeat: move to the best neighboring stateIf no neighbors better than current, quitNeighbors = small perturbations of wPropertiesPlateaus and local optima

How to escape plateaus and find a good local optimum?

How to deal with very large parameter vectors? E.g.,

Slide15

Perceptron: Accuracy via Zero-One Loss

Objective: Classification Accuracy

f1

f

2

f

3

w

1

w

2

w

3

>0?

Slide16

Perceptron: Accuracy via Zero-One Loss

Objective: Classification Accuracy

Issue: many plateaus

 how to measure incremental progress?

f

1

f

2

f

3

w

1

w

2

w3

>0?

Slide17

Soft-Max

Score for y=1: Score for y=-1:

Probability of label:

Slide18

Soft-Max

Score for y=1: Score for y=-1:

Probability of label:

Objective: Log:

Slide19

Two-Layer Neural Network

f

1

f

2

f

3

w

13

w

23

w

33

>0?

w

12

w

22

w

32

>0?

w

11

w

21

w

31

>0?

w

1

w

2

w

3

Slide20

N-Layer Neural Network

f

1

f

2

f

3

>0?

>0?

>0?

>0?

>0?

>0?

>0?

>0?

>0?

Slide21

Our Status

Our objective

Changes smoothly with changes in

wDoesn’t suffer from the same plateaus as the perceptron networkChallenge: how to find a good w ?Equivalently:

Slide22

1-d optimization

Could evaluate and

Then step in best direction

Or, evaluate derivative:Which tells which direction to step into

Slide23

2-D Optimization

Source: Thomas Jungblut’s Blog

Slide24

Steepest Descent

Idea:

Start somewhere

Repeat: Take a step in the steepest descent directionFigure source: Mathworks

Slide25

Steepest Direction

Steepest Direction = direction of the gradient

Slide26

Optimization Procedure: Gradient Descent

Init:

For i = 1, 2, …

  

Slide27

Computing Gradients

Gradient = sum of gradients for each term

How do we compute the gradient for one term about the current weights?

Slide28

N-Layer Neural Network

f

1

f

2

f

3

>0?

>0?

>0?

>0?

>0?

>0?

>0?

>0?

>0?

Slide29

29

Represent as Computational Graph

x

W

*

hinge loss

R

+

L

s

(scores)

Slide30

30

e.g. x = -2, y = 5, z = -4

Slide31

31

e.g. x = -2, y = 5, z = -4

Want:

Slide32

32

e.g. x = -2, y = 5, z = -4

Want:

Slide33

33

e.g. x = -2, y = 5, z = -4

Want:

Slide34

34

e.g. x = -2, y = 5, z = -4

Want:

Slide35

35

e.g. x = -2, y = 5, z = -4

Want:

Slide36

36

e.g. x = -2, y = 5, z = -4

Want:

Slide37

37

e.g. x = -2, y = 5, z = -4

Want:

Slide38

38

e.g. x = -2, y = 5, z = -4

Want:

Slide39

39

e.g. x = -2, y = 5, z = -4

Want:

Chain rule:

Slide40

40

e.g. x = -2, y = 5, z = -4

Want:

Slide41

41

e.g. x = -2, y = 5, z = -4

Want:

Chain rule:

Slide42

42

f

activations

Slide43

43

f

activations

“local gradient”

Slide44

44

f

activations

“local gradient”

gradients

Slide45

45

f

activations

gradients

“local gradient”

Slide46

46

f

activations

gradients

“local gradient”

Slide47

47

f

activations

gradients

“local gradient”

Slide48

48

Another example:

Slide49

49

Another example:

Slide50

50

Another example:

Slide51

51

Another example:

Slide52

52

Another example:

Slide53

53

Another example:

Slide54

54

Another example:

Slide55

55

Another example:

Slide56

56

Another example:

Slide57

57

Another example:

(-1) * (-0.20) = 0.20

Slide58

58

Another example:

Slide59

59

Another example:

[local gradient] x [its gradient]

[1] x [0.2] = 0.2

[1] x [0.2] = 0.2 (both inputs!)

Slide60

60

Another example:

Slide61

61

Another example:

[local gradient] x [its gradient]

x0: [2] x [0.2] = 0.4

w0: [-1] x [0.2] = -0.2

Slide62

62

sigmoid function

sigmoid gate

Slide63

63

sigmoid function

sigmoid gate

(0.73) * (1 - 0.73) = 0.2

Slide64

Mini-batches and Stochastic Gradient Descent

Typical objective:

= average log-likelihood of label given input

= estimate based on mini-batch 1…kMini-batch gradient descent: compute gradient on mini-batch (+ cycle over mini-batches: 1..k, k+1…2k, ... ; make sure to randomize permutation of data!)Stochastic gradient descent: k = 1