/
CS440/ECE448 Lecture 8: CS440/ECE448 Lecture 8:

CS440/ECE448 Lecture 8: - PowerPoint Presentation

DontBeAScared
DontBeAScared . @DontBeAScared
Follow
342 views
Uploaded On 2022-07-28

CS440/ECE448 Lecture 8: - PPT Presentation

Logistic Regression Mark HasegawaJohnson 22022 License CCBY 40 Outline Onehot vectors rewriting the perceptron to look like linear regression Softmax Soft category boundaries Crossentropy negative log probability of the training data ID: 930941

logistic regression perceptron class regression logistic class perceptron function linear multi training gradient probability output called entropy hot data

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CS440/ECE448 Lecture 8:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CS440/ECE448 Lecture 8: Logistic Regression

Mark Hasegawa-Johnson, 2/2022License: CC-BY 4.0

Slide2

Outline

One-hot vectors: rewriting the perceptron to look like linear regressionSoftmax: Soft category boundariesCross-entropy = negative log probability of the training dataStochastic gradient descent for logistic regression

Slide3

Comparison of Multi-Class Perceptron to Multiple Regression

1

x

1

x

D

x

2

Input

Weights

.

.

.

Output:

 

.

.

.

 

 

 

 

 

 

argmax

.

.

.

Multi-Class Perceptron

1

x

1

x

D

x

2

Input

Weights

.

.

.

.

.

.

 

 

 

 

 

 

Multiple Regression

 

 

 

Slide4

Here’s a weird question:Can we come up with some new notation that can be used to write both the multi-class perceptron AND the linear regression algorithm?

Slide5

Comparison of multi-class perceptron and multiple linear regression

Multi-class perceptron:

For

, the class that should have been the output but wasn’t:

For

, the class that shouldn’t have been the output but was:

For all other classes c,

:

 

Multiple linear regression

:

For all classes, 1<=c<=V,

 

Slide6

New notation: Don’t change the multi-class perceptron algorithm, but make it easier to write

Instead of defining

as an integer, let’s define

to be a vector:

For a multi-class perceptron, this only makes sense if

is what’s called a

one-hot

vector:

 

Slide7

New notation: Don’t change the multi-class perceptron algorithm, but make it easier to write

Let’s also define the output to be a one-hot vector:

… where …

 

Slide8

Example: Binary classifier

Consider the classifier

… with only two classes. Then the classification regions might look like this:

 

 

 

 

 

Slide9

Multi-Class Linear Classifiers

Consider the classifier

… with 20 classes. Then some of the classifications might look like this.

 

 

 

By

Balu

Ertl

- Own work, CC BY-SA 4.0, https://

commons.wikimedia.org

/w/

index.php?curid

=38534275

 

 

 

 

Slide10

Now the perceptron has a vector error, just like linear regression

Now we can define an error term for every output:

If

c

was the correct class label (

), but the network didn’t get it right (

), then it

undershot

:

If the network thought the correct answer was

c

(

), but it wasn’t (

), then it

overershot

Otherwise,

 

Slide11

Multi-class perceptron, written in terms of one-hot vectors

But with this definition, we can write the perceptron update the same as the linear regression update:

 

Slide12

Comparison of Multi-Class Perceptron to Multiple Regression

1

x

1

x

D

x

2

Input

Weights

.

.

.

.

.

.

 

 

 

 

 

 

Multi-Class Perceptron:

One-hot output

1

x

1

x

D

x

2

Input

Weights

.

.

.

.

.

.

 

 

 

 

 

 

Multiple Regression:

Real-valued Output

 

 

 

 

 

 

.

.

.

.

.

.

Slide13

Comparison of multi-class perceptron and multiple linear regression

Multi-class perceptron:

For all classes,

,

 

Multiple linear regression

:

For all classes,

,

 

Slide14

Outline

One-hot vectors: rewriting the perceptron to look like linear regressionSoftmax: Soft category boundariesCross-entropy = negative log probability of the training dataStochastic gradient descent for logistic regression

Slide15

Probabilistic boundaries

Instead of trying to find the exact boundaries, logistic regression models the probability that token

belongs to class

.

 

 

 

In this region,

 

In this region,

 

In this region,

 

Slide16

Remember that for the perceptron, we have

For logistic regression, we have

 

Perceptron versus logistic regression

Slide17

This is called the softmax function:

…where the matrix W is defined to be

 

The

softmax

function

Slide18

In both cases, we have:

 

Argmax and

Softmax

Slide19

In both cases, we can interpret these as probabilities:

 

Argmax and

Softmax

Slide20

Some details: Logistic function

The probability

in the two-class case has an interesting form. It’s called the ”logistic sigmoid” function:

where

.

 

Slide21

Some details: Logistic function

This function,

is called the “logistic sigmoid function.”

It’s called “sigmoid” because it is S-shaped.

It was first discovered by Verhulst in the 1830s, as a model of population growth. The idea was that the population grows exponentially until it runs up against resource limitations, and then starts to stagnate.

 

Slide22

Logistic Regression

We can frame the basic idea of logistic regression in this way: replace the non-differentiable decision function

with a differentiable decision function:

…so that the classifier can be trained using gradient descent.

 

Slide23

Outline

One-hot vectors: rewriting the perceptron to look like linear regressionSoftmax: Soft category boundariesCross-entropy = negative log probability of the training data

Stochastic gradient descent for logistic regression

Slide24

Learning logistic regression

Suppose we have some data. We want to learn vectors

so that

.

 

 

 

Slide25

Learning logistic regression: Training data

Data:

where each

is a vector, and each

is a integer encoding the true class label.

 

 

 

 

 

 

 

Slide26

Learning logistic regression: Model parameters

We want to learn the model parameters

so that

 

 

 

 

 

 

 

Slide27

Learning logistic regression: Training criterion

We want to learn the model parameters,

, in order to maximize the probability of the observed data:

 

 

 

 

 

 

 

Slide28

Learning logistic regression

We want to learn the model parameters,

, in order to maximize the probability of the observed data:

 

 

 

 

 

 

 

Slide29

Learning logistic regression

We want to learn the model parameters,

, in order to maximize the probability of the observed data:

 

 

 

 

 

 

 

Slide30

How do you maximize a function?

Our goal is to find

in order to maximize

Here are some useful things to know:

Logarithm turns products into sums

Maximizing

is the same thing as minimizing

 

Slide31

1. Logarithms turn products into sums

(the natural logarithm of x, shown as

in the plot at right) is a monotonically increasing function of x.

Since it’s monotonically increasing,

Almost always, maximizing the log probability is easier than maximizing the probability, because logarithms turn products into sums.

 

Logarithm_plots.png

, CC-SA 3.0, Richard F. Lyon, 2011

Slide32

1. Logarithms turn products into sums

Our goal is to find

in order to maximize

 

Slide33

2. Maximizing

is the same thing as minimizing

.

 

Our goal is to find

in order to maximize

Choosing W to maximizing

is kind of obvious: just set

, where A is a scalar that’s as big as possible. Maximizing

, is not obvious.

 

Slide34

2. Maximizing

is the same thing as minimizing

.

 

To emphasize the hard part of the problem, there is a convention that, instead of maximizing

, we minimize

:

Our goal is to find

in order to minimize

The curly

is a symbol we use to denote a “loss function”. A loss function is something you want to minimize.

 

Slide35

Some details: Cross entropy

The loss function is called “cross entropy,” because it is similar in some ways to the entropy of a thermodynamic system in physics.When you implement this in software, it’s a good idea to normalize by the number of training tokens, so that the scale is easier to understand:

 

Slide36

Outline

One-hot vectors: rewriting the perceptron to look like linear regressionSoftmax: Soft category boundaries

Cross-entropy = negative log probability of the training data

Stochastic gradient descent for logistic regression

Slide37

Logistic regression training

In each iteration, present a batch of training data,

.

If the batch contains all the data, this is called “gradient descent”

If the batch contains a randomly chosen subset of the data, this is called “stochastic gradient descent”

Calculate

for each training token

, for each class

.

Update all the weight vectors using stochastic gradient descent.

 

Slide38

Logistic regression training example

Start with the given dataset

. Here the true class is indicated by both color and shape.

 

 

 

Slide39

Logistic regression training example

Randomly initialize the weight vectors, and then calculate the probabilities

for every class

, for every training token (shown as transparency and color change, left side)

 

 

 

 

 

 

Slide40

Logistic regression training example

Modify the weight vectors to reduce the loss function, as

 

 

 

 

 

 

Slide41

Logistic regression training example

Repeat until the loss stops decreasing:

 

 

 

 

 

 

Slide42

Stochastic gradient descent

Our goal is to find

in order to minimize

Just like in linear regression, let’s do that one token at a time. Choose a training token

, and try to minimize

 

Slide43

Stochastic gradient descent

Our goal is to find

in order to minimize

We do that by adjusting

, where

is called the learning rate. Typically

, but it’s very hard to know in advance what learning rate will work for a particular problem; you need to experiment to see what works.

is the gradient of the loss with respect to

.

 

Slide44

The gradient of the cross-entropy of a softmax

Now, let’s calculate that gradient.

…where

is our old friend the one-hot vector:

 

Slide45

The gradient of the cross-entropy of a softmax

 

Slide46

Comparison of perceptron, linear regression, and logistic regression

Perceptron:

Linear Regression:

Logistic Regression:

The only difference is how you define the network output (argmax, linear, or

softmax

).

 

Slide47

Outline

One-hot vectors: rewriting the perceptron to look like linear regression Probabilistic-boundary classifiersHow do you maximize a function?

Learning a logistic regression

Two-class logistic regression

Slide48

Some details: Binary cross entropy

For two-class problems, it’s wasteful to compute both

and

, so sometimes we don’t.

Instead, we use binary cross entropy, which is:

 

Slide49

Some details: Logistic function

The probability

in the two-class case is particularly simple. It’s

where

.

 

Slide50

Some details: Logistic function

This function,

is called the “logistic sigmoid function.”

It’s called “sigmoid” because it is S-shaped.

It was first discovered by Verhulst in the 1830s, as a model of population growth. The idea was that the population grows exponentially until it runs up against resource limitations, and then starts to stagnate.

 

Slide51

Logistic Regression

We can frame the basic idea of logistic regression in this way: replace the non-differentiable decision function

with a differentiable decision function:

…so that the classifier can be trained using gradient descent.

 

Slide52

Conclusion

Perceptron:

Linear Regression:

Logistic Regression:

The only difference is how you define the network output (argmax, linear, or

softmax

).

 

Slide53

Comparison of Multi-Class Perceptron to Multiple Regression

1

x

1

x

D

x

2

Input

Weights

.

.

.

.

.

.

 

 

 

 

 

 

Multi-Class Perceptron:

One-hot output

1

x

1

x

D

x

2

Input

Weights

.

.

.

.

.

.

 

 

 

 

 

 

Multiple Regression:

Real-valued Output

 

 

 

 

 

 

.

.

.

.

.

.

Slide54

Logistic Regression

1

x

1

x

D

x

2

Input

Weights

.

.

.

.

.

.

 

 

 

 

 

 

Logistic Regression:

Output is a vector of probabilities

 

 

 

.

.

.