Logistic Regression SVMs CISC 5800 Professor Daniel Leeds Maximum A Posteriori a quick review Likelihood Prior Posterior Likelihood x prior MAP estimate Choose and to give the prior belief of Heads bias ID: 644121
Download Presentation The PPT/PDF document "Discriminative classifiers:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Discriminative classifiers:Logistic Regression, SVMs
CISC 5800
Professor Daniel LeedsSlide2
Maximum A Posteriori: a quick review
Likelihood:
Prior:
Posterior Likelihood x prior = MAP estimate:
Choose and to give the prior belief of Heads bias Higher : Heads more likelyHigher : Tails more likely
2Slide3
Estimate each P(Xi|Y
)
through
MAPIncorporating prior for each class 3Note: both X and Y can take on multiple values (binary and beyond) – “frequency” of class j – “frequencies” of all classes Slide4
Classification strategy: generative vs. discriminative
Generative, e.g., Bayes/Naïve Bayes:
I
dentify probability distribution for each classDetermine class with maximum probability for data exampleDiscriminative, e.g., Logistic Regression:Identify boundary between classesDetermine which side of boundary new data example exists on4 0 5 10 15 20 0 5 10 15 20Slide5
Linear algebra: data features
Vector – list of numbers:
each number describes
a data featureMatrix – list of lists of numbers: features for each data pointWolf12Lion16Monkey14Broker0Analyst1Dividend1⁞ d⁞# of word occurrencesDocument 181011101⁞Document 2021141012⁞Document 35Slide6
Feature space
Each data feature defines a dimension in space
Wolf
12Lion16Monkey14Broker0Analyst1Dividend1⁞ d⁞Document181011101⁞Document2021141012⁞Document3wolfliondoc1doc2doc30 10 20201006Slide7
The dot product
The dot product compares two vectors:
,
0 10 2020100 7Slide8
The dot product, continued
Magnitude of a vector is the sum of the squares of the elements
If has unit magnitude, is the “projection” of onto 0 1 2210 8Slide9
Separating boundary, defined by w
9
Separating
hyperplane splits class 0 and class 1Plane is defined by line w perpendicular to planIs data point x in class 0 or class 1? wTx > 0 class 0 wTx < 0 class 1Slide10
Separating boundary, defined by w
10
Separating
hyperplane splits class 0 and class 1Plane is defined by line w perpendicular to planIs data point x in class 0 or class 1? wTx > 0 class 1 wTx < 0 class 0More typicallySlide11
From real-number projection to 0/1 label
Binary classification: 0 is class A, 1 is class B
Sigmoid function stands in for p(
x|y)Sigmoid: 1110.50-10 5 0 5 10g(h)h Slide12
Learning parameters for classification
Similar to MLE for Bayes classifier
“Likelihood” for data points y
1, …, yn (different from Bayesian likelihood)If yi in class A, yi =0, multiply (1-g(xi;w))If yi in class B, yi=1, multiply (g(xi;w)) 12Slide13
Learning parameters for classification
13
Slide14
Learning parameters for classification
14
Slide15
Iterative gradient asscent
Begin with initial guessed weights w
For each data point (
yi,xi), update each weight wjChoose so change is not too big or too small – “step size”If yi=1 and g(wTxi)=0, and xij>0, make wj larger and push wTxi to be largerIf yi=0 and g(wTxi)=1, and xij>0, make wi smaller and push wTxi to be smaller 15Intuitionyi – true data labelg(wTxi) – computed data labelSlide16
Separating boundary, defined by w
16
Separating
hyperplane splits class 0 and class 1Plane is defined by line w perpendicular to planIs data point x in class 0 or class 1? wTx > 0 class 1 wTx < 0 class 0Slide17
But, where do we place the boundary?
Logistic regression:
Each data point
considered for boundary Outlier data pulls boundary towards it 17Slide18
Max margin classifiers
Focus on boundary points
Find largest margin between boundary points on both sides
Works well in practiceWe can call the boundary points“support vectors”18Slide19
Classifying with additional dimensions
No linear separator
Linear
separator 19Slide20
Mapping function(s)
Map from low-dimensional space
to higher dimensional space
N data points guaranteed to be separable in space of N-1 dimensions or moreClassifying : 20Slide21
Discriminative classifiers
Find a separator to minimize classification error
Logistic Regression
Support Vector Machines21Slide22
Logistic Regression review
Maximize likelihood:
Likelihood is
: , Update w :
22Logistic function
Slide23
MAP for discriminative classifier
MLE
:
P(y=1|x;w) ~ g(wTx), P(y=0|x;w) ~ 1-g(wTx)MAP: P(y=1,w|x) P(y=1|x;w) P(w) ~ g(wTx) ??? (different from Bayesian posterior)P(w) priorsL2 regularization – minimize all weightsL1 regularization – minimize number of non-zero weights 23Slide24
MAP – L2 regularization
P(y=1,w|x)
P(y=1|x;w
) P(w):Prevent wTx from getting too large 24wp(x)Slide25
MAP – L1 regularization
P(y=1,w|x)
P(y=1|x;w
) P(w):Force most dimensions to 0 25wp(x)Slide26
Parameters for learning
Regularization: selecting
influences the strength of your bias
Gradient ascent: selecting influences the effect of individual data points in learningBayesian: selecting indicates the strength of the class prior beliefs are parameters controlling our learning 26Slide27
Multi-class logistic regression: class probability
Recall binary class:
Multi-class – m classes:
27Slide28
Multi-class logistic regression: likelihood
Recall binary class:
Multi-class:
28
Slide29
Multi-class logistic regression: update rule
Recall binary class:
Multi-class:
29 Slide30
Logistic regression: how many parameters?
N features
M classes
Learn wk for each of M-1 classes: N (M-1) parametersActually: Would be better to allow offset from origin: wTx + b : N+1 parameters per classTotal (N+1) (M-1) parameters 30Slide31
Max margin classifiers
Focus on boundary points
Find largest margin between boundary points on both sides
Works well in practiceWe can call the boundary points“support vectors”31Slide32
Maximum margin definitions
M is the margin width
x
+ is a +1 point closest to boundary, x- is a -1 point closest to boundary M Classify as +1 if Classify as -1 if Undefined if 32Slide33
derivation
M
33Slide34
M derivation
maximize
minimize
M
34Slide35
Support vector machine (SVM) optimization
argmax
w,b
argminw,b subject to for x in class +1 for x in class -1 35Optimization with constraints: with Lagrange multipliers.Gradient descentMatrix calculus Slide36
Alternate SVM formulation
Support vectors
have
are the data labels +1 or -1To classify sample xj, compute: 36
Slide37
Support vector machine (SVM) optimizationwith slack variables
What if data not linearly separable?
argmin
w,b subject to for x in class 1 for x in class -1 37
E
ach error is penalized based on distance from separator
Slide38
Classifying with additional dimensions
No linear separator
Linear
separator 38Note: More dimensions makes it easier to separate N training points: training error minimized, may risk over-fitSlide39
Quadratic mapping function
x
1
, x2, x3, x4 -> x1, x2, x3, x4, x12, x22, x32, x42, x1x2, x1x3, x1x4, …, x2x4, x3x4N features -> featuresN2 values to learn for w in higher-dimensional spaceOr, observe: 39v with N elements operating in quadratic space
Slide40
Kernels
Classifying
:
Kernel trick:Estimate high-dimensional dot product with function E.g., 40 Slide41
The power of SVM (+kernels)
Boundary defined by a few support vectors
Caused by: maximizing margin
Causes: less overfittingSimilar to: regularizationKernels keep number of learned parameters in check41Slide42
Multi-class SVMs
Learn boundary for class k vs all other classes
Find boundary that gives highest margin for data point x
i42Slide43
Benefits of generative methods
and
can generate non-linear boundary
E.g.: Gaussians with multiple variances 43