/
Discriminative classifiers: Discriminative classifiers:

Discriminative classifiers: - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
362 views
Uploaded On 2018-03-09

Discriminative classifiers: - PPT Presentation

Logistic Regression SVMs CISC 5800 Professor Daniel Leeds Maximum A Posteriori a quick review Likelihood Prior Posterior Likelihood x prior MAP estimate   Choose and to give the prior belief of Heads bias ID: 644121

data class wtx boundary class data boundary wtx logistic regression margin points parameters point support defined multi map linear space classification separating

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Discriminative classifiers:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Discriminative classifiers:Logistic Regression, SVMs

CISC 5800

Professor Daniel LeedsSlide2

Maximum A Posteriori: a quick review

Likelihood:

Prior:

Posterior Likelihood x prior = MAP estimate:

 

Choose and to give the prior belief of Heads bias Higher : Heads more likelyHigher : Tails more likely

 

2Slide3

Estimate each P(Xi|Y

)

through

MAPIncorporating prior for each class  3Note: both X and Y can take on multiple values (binary and beyond) – “frequency” of class j – “frequencies” of all classes Slide4

Classification strategy: generative vs. discriminative

Generative, e.g., Bayes/Naïve Bayes:

I

dentify probability distribution for each classDetermine class with maximum probability for data exampleDiscriminative, e.g., Logistic Regression:Identify boundary between classesDetermine which side of boundary new data example exists on4 0 5 10 15 20 0 5 10 15 20Slide5

Linear algebra: data features

Vector – list of numbers:

each number describes

a data featureMatrix – list of lists of numbers: features for each data pointWolf12Lion16Monkey14Broker0Analyst1Dividend1⁞ d⁞# of word occurrencesDocument 181011101⁞Document 2021141012⁞Document 35Slide6

Feature space

Each data feature defines a dimension in space

Wolf

12Lion16Monkey14Broker0Analyst1Dividend1⁞ d⁞Document181011101⁞Document2021141012⁞Document3wolfliondoc1doc2doc30 10 20201006Slide7

The dot product

The dot product compares two vectors:

,

 0 10 2020100  7Slide8

The dot product, continued

Magnitude of a vector is the sum of the squares of the elements

If has unit magnitude, is the “projection” of onto  0 1 2210     8Slide9

Separating boundary, defined by w

9

Separating

hyperplane splits class 0 and class 1Plane is defined by line w perpendicular to planIs data point x in class 0 or class 1? wTx > 0 class 0 wTx < 0 class 1Slide10

Separating boundary, defined by w

10

Separating

hyperplane splits class 0 and class 1Plane is defined by line w perpendicular to planIs data point x in class 0 or class 1? wTx > 0 class 1 wTx < 0 class 0More typicallySlide11

From real-number projection to 0/1 label

Binary classification: 0 is class A, 1 is class B

Sigmoid function stands in for p(

x|y)Sigmoid:  1110.50-10 5 0 5 10g(h)h Slide12

Learning parameters for classification

Similar to MLE for Bayes classifier

“Likelihood” for data points y

1, …, yn (different from Bayesian likelihood)If yi in class A, yi =0, multiply (1-g(xi;w))If yi in class B, yi=1, multiply (g(xi;w)) 12Slide13

Learning parameters for classification

 

13

 Slide14

Learning parameters for classification

 

14

 

 Slide15

Iterative gradient asscent

Begin with initial guessed weights w

For each data point (

yi,xi), update each weight wjChoose so change is not too big or too small – “step size”If yi=1 and g(wTxi)=0, and xij>0, make wj larger and push wTxi to be largerIf yi=0 and g(wTxi)=1, and xij>0, make wi smaller and push wTxi to be smaller 15Intuitionyi – true data labelg(wTxi) – computed data labelSlide16

Separating boundary, defined by w

16

Separating

hyperplane splits class 0 and class 1Plane is defined by line w perpendicular to planIs data point x in class 0 or class 1? wTx > 0 class 1 wTx < 0 class 0Slide17

But, where do we place the boundary?

Logistic regression:

Each data point

considered for boundary Outlier data pulls boundary towards it 17Slide18

Max margin classifiers

Focus on boundary points

Find largest margin between boundary points on both sides

Works well in practiceWe can call the boundary points“support vectors”18Slide19

Classifying with additional dimensions

No linear separator

Linear

separator 19Slide20

Mapping function(s)

Map from low-dimensional space

to higher dimensional space

N data points guaranteed to be separable in space of N-1 dimensions or moreClassifying :  20Slide21

Discriminative classifiers

Find a separator to minimize classification error

Logistic Regression

Support Vector Machines21Slide22

Logistic Regression review

Maximize likelihood:

Likelihood is

: , Update w :  

22Logistic function

 Slide23

MAP for discriminative classifier

MLE

:

P(y=1|x;w) ~ g(wTx), P(y=0|x;w) ~ 1-g(wTx)MAP: P(y=1,w|x) P(y=1|x;w) P(w) ~ g(wTx) ??? (different from Bayesian posterior)P(w) priorsL2 regularization – minimize all weightsL1 regularization – minimize number of non-zero weights 23Slide24

MAP – L2 regularization

P(y=1,w|x)

P(y=1|x;w

) P(w):Prevent wTx from getting too large 24wp(x)Slide25

MAP – L1 regularization

P(y=1,w|x)

P(y=1|x;w

) P(w):Force most dimensions to 0 25wp(x)Slide26

Parameters for learning

Regularization: selecting

influences the strength of your bias

Gradient ascent: selecting influences the effect of individual data points in learningBayesian: selecting indicates the strength of the class prior beliefs are parameters controlling our learning 26Slide27

Multi-class logistic regression: class probability

Recall binary class:

Multi-class – m classes:

 27Slide28

Multi-class logistic regression: likelihood

Recall binary class:

Multi-class:

 28

 Slide29

Multi-class logistic regression: update rule

Recall binary class:

Multi-class:

 29 Slide30

Logistic regression: how many parameters?

N features

M classes

Learn wk for each of M-1 classes: N (M-1) parametersActually: Would be better to allow offset from origin: wTx + b : N+1 parameters per classTotal (N+1) (M-1) parameters 30Slide31

Max margin classifiers

Focus on boundary points

Find largest margin between boundary points on both sides

Works well in practiceWe can call the boundary points“support vectors”31Slide32

Maximum margin definitions

M is the margin width

x

+ is a +1 point closest to boundary, x- is a -1 point closest to boundary M   Classify as +1 if Classify as -1 if Undefined if  32Slide33

derivation

 

 

M   

 

33Slide34

M derivation

maximize

minimize

 M   

 

34Slide35

Support vector machine (SVM) optimization

argmax

w,b

argminw,b subject to for x in class +1 for x in class -1 35Optimization with constraints: with Lagrange multipliers.Gradient descentMatrix calculus Slide36

Alternate SVM formulation

Support vectors

have

are the data labels +1 or -1To classify sample xj, compute: 36

 Slide37

Support vector machine (SVM) optimizationwith slack variables

What if data not linearly separable?

argmin

w,b subject to for x in class 1 for x in class -1 37

E

ach error is penalized based on distance from separator 

 

 

 

 Slide38

Classifying with additional dimensions

No linear separator

Linear

separator 38Note: More dimensions makes it easier to separate N training points: training error minimized, may risk over-fitSlide39

Quadratic mapping function

x

1

, x2, x3, x4 -> x1, x2, x3, x4, x12, x22, x32, x42, x1x2, x1x3, x1x4, …, x2x4, x3x4N features -> featuresN2 values to learn for w in higher-dimensional spaceOr, observe:  39v with N elements operating in quadratic space

 Slide40

Kernels

Classifying

:

Kernel trick:Estimate high-dimensional dot product with function E.g.,  40 Slide41

The power of SVM (+kernels)

Boundary defined by a few support vectors

Caused by: maximizing margin

Causes: less overfittingSimilar to: regularizationKernels keep number of learned parameters in check41Slide42

Multi-class SVMs

Learn boundary for class k vs all other classes

Find boundary that gives highest margin for data point x

i42Slide43

Benefits of generative methods

and

can generate non-linear boundary

E.g.: Gaussians with multiple variances 43