/
Logistic Regression Logistic Regression

Logistic Regression - PDF document

emery
emery . @emery
Follow
342 views
Uploaded On 2021-09-09

Logistic Regression - PPT Presentation

MariaFlorinaBalcan02082019Nave Bayes Recapx0099Classifier2x009Ax0095x009Bx0095yPx0099NB Assumptionx0099NB Classifierx0099Assume parametric form for PXx009DYand PYPXXdYidPXiYx009ANBx0095x009Bx0095yi ID: 877743

logistic x0099 regression x0095 x0099 logistic x0095 regression gradient form linear ascent function concave classifier training likelihood conditional decision

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Logistic Regression" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 Maria - Florina Balcan 02/08/2019 Logist
Maria - Florina Balcan 02/08/2019 Logistic Regression Naïve Bayes Recap ™ Classifier: 2 š ×­ ( ¬ ) ൞ •Â¦› ¡•Â¬ y P ( ­ ȓ ¬ ) ™ NB Assumption: ™ NB Classifier: ™ Assume parametric form for P ( X  ȓ Y ) and P ( Y ) P X େ Ç· X d Y ൞ à·£ i ୒ େ d P ( X i ȓ Y ) š NB ¬ ൞ •Â¦› ¡•Â¬ y à·£ i ୒ େ d P ¬ i ­ P ( ­ ) à Estimate p

2 arameters using MLE/MAP and plug in Gene
arameters using MLE/MAP and plug in Generative vs. Discriminative Classifiers 3 Generative classifiers (e.g. Naïve Bayes ) ™ Assume some functional form for P(X,Y) (or P(X|Y) and P(Y)) Why not learn P(Y|X) directly? Or better yet, why not learn the decision boundary directly? Discriminative classifiers (e.g. Logistic Regression ) ™ Assume some functional form for P(Y|X) or for the decision boundary ™ Estimate pa

3 rameters of P(X|Y), P(Y) directly from t
rameters of P(X|Y), P(Y) directly from training data ™ Use Bayes rule to calculate P(Y|X) ™ Estimate parameters of P(Y|X) directly from training data Logistic Regression 4 Assumes the following functional form for P(Y|X): Logistic function (or Sigmoid): Logistic function applied to a linear function of the data z logit (z) Features can be discrete or continuous! P Y ൞ ΅ X ൞ ΅ ΅ + ™Â¬Â¤ ( − ( « ୆ + ϕ i Â

4 « i X i ) ) ൞ ™Â¬Â¤ ( « ୆ +
« i X i ) ) ൞ ™Â¬Â¤ ( « ୆ + ϕ i « i X i ) ™Â¬Â¤ ( « ୆ + ϕ i « i X i ) + ΅ େ େ ୐ ex஀ ୑ z Logistic Regression is a Linear Classifier! 5 (Linear Decision Boundary) Assumes the following functional form for P(Y|X): Decision boundary: « ୆ + ෟ i « i X i ൞ ΄ P Y ൞ ΅ X > P Y ൞ ΄ X ǽ « ୆ + ෟ i « i X i > ΄ ǽ P Y ൞ ΅ X ൞ ΅ ΅ + ™Â¬Â¤ ( − ( « ୆ + ϕ i « i X i ) ) ൞ &#

5 x0099;¬¤ ( « ୆ + ϕ i « i X i ) &#
x0099;¬¤ ( « ୆ + ϕ i « i X i ) ™Â¬Â¤ ( « ୆ + ϕ i « i X i ) + ΅ Logistic Regression is a Linear Classifier! 6 Assumes the following functional form for P(Y|X): Assumes a linear decision boundary: there are weights Ý¥ i s.t. when « ୆ + ϕ i « i X i > ΄ , the example is more likely to be positive, and when this linear function is negative ( « ୆ + ϕ i « i X i < ΄ ) the example is more likely to be nega

6 tive. « ୆ + ෟ i « i X i ൞ ΄ P
tive. « ୆ + ෟ i « i X i ൞ ΄ P Y ൞ ΅ X ൞ ΅ ΅ + ™Â¬Â¤ ( − ( « ୆ + ϕ i « i X i ) ) ൞ ™Â¬Â¤ ( « ୆ + ϕ i « i X i ) ™Â¬Â¤ ( « ୆ + ϕ i « i X i ) + ΅ « ୆ + ϕ i « i X i ൞ ΄ , ݄ ݍ ൞ ΅ ݌ ൞ େ ୈ « ୆ + ϕ i « i X i Õ® ύ , ݄ ݍ ൞ ΅ ݌ Õ® ΅ « ୆ + ϕ i « i X i Õ® − ύ , ݄ ݍ ൞ ΅ ݌ Õ® ΄ 7 Logistic Regression is a Linear Classifier! Ö® P Y ൞ ΄ X ൞ Î

7 … ™Â¬Â¤ ( « ୆ + ϕ i « i X i
… ™Â¬Â¤ ( « ୆ + ϕ i « i X i ) + ΅ Ö® P Y ൞ ΅ X P ( Y ൞ ΄ ȓ X ) ൞ ™Â¬Â¤ ( « ୆ + ෟ i « i X i ) > ΅ ǽ Ö® « ୆ + ෟ i « i X i > ΄ ǽ Assumes the following functional form for P(Y|X): P Y ൞ ΅ X ൞ ΅ ΅ + ™Â¬Â¤ ( − ( « ୆ + ϕ i « i X i ) ) ൞ ™Â¬Â¤ ( « ୆ + ϕ i « i X i ) ™Â¬Â¤ ( « ୆ + ϕ i « i X i ) + ΅ Training Logistic Regression 9 †Ä°Í­ÆÆ ĺƏĢLJÆ

8 ¢ Əƈ ġůƈĔƞǙ ĢƁĔƢƢůĺůĢĔ
¢ Əƈ ġůƈĔƞǙ ĢƁĔƢƢůĺůĢĔƬůƏƈͩ P Y ൞ ΅ X dz « ൞ ™Â¬Â¤ ( « ୆ + ϕ i « i X i ) ΅ + ™Â¬Â¤ ( « ୆ + ϕ i « i X i ) P Y ൞ ΄ X dz « ൞ ΅ ΅ + ™Â¬Â¤ ( « ୆ + ϕ i « i X i ) How to learn the parameters « ୆ dz « େ dz Ç· dz « d ǽ Training data: X j dz Y j j ୒ େ à­¾ X j ൞ X େ j dz Ç· dz X d j Maximum Likelihood Estimates: à·¯ ܱ ML୛ ൞ •Â¦› ¡�

9 95;¬ ܱ ෣ j ୒ େ ୾ P X j dz Y j
95;¬ ܱ à·£ j ୒ େ à­¾ P X j dz Y j ܱ Don’t ave a model for P ( X ) or P ( X ȓ Y ) – only for P ( Y ȓ X ) BLJƬ ƬŬİƞİͭƢ Ĕ ƛƞƏġƁİƇͫ Training Logistic Regression 10 How to learn the parameters ܱ à«« dz ܱ૬ dz Ç· ܱ ܞ ? Training data: X j dz Y j j ୒ େ à­¾ X j ൞ X େ j dz Ç· dz X d j Maximum ( Conditional ) Likelihood Estimates à·¯ ܱ MCL୛ ൞ •Â¦› ¡•Â¬ ܱ à·£ j ୒ େ

10 ୾ P Y j X j dz ܱ Discriminative phil
à­¾ P Y j X j dz ܱ Discriminative philosophy Ά DƏƈͭƬ ǓĔƢƬİ İĺĺƏƞƬ ƁİĔƞƈůƈŢ tΐ‹Î‘ͧ ĺƏĢLJƢ on P(Y|X) Ά ƬŬĔƬͭƢ ĔƁƁ ƬŬĔƬ ƇĔƬƬİƞƢ ĺƏƞ ĢƁĔƢƢůĺůĢĔƬůƏƈ͜ Expressing Conditional log Likelihood 11 P Y ൞ ΅ X dz « ൞ ™Â¬Â¤ ( « ୆ + ϕ i « i X i ) ΅ + ™Â¬Â¤ ( « ୆ + ϕ i « i X i ) P Y ൞ ΄ X dz « ൞ ΅ ΅ + ™Â¬Â¤ ( « ୆ + ϕ i « i

11 X i )   ܱ ز  ¢ ෣ j P ( ­ j ȓ ܲ
X i )   ܱ ز  ¢ à·£ j P ( ­ j ȓ ܲ j dz ܱ ) ൞ ෟ j ­ j « ୆ + ෟ i ୒ େ d « i ¬ i j −  ¢ ΅ + ™Â¬Â¤ « ୆ + ෟ i ୒ େ d « i ¬ i j Maximizing Conditional log Likelihood 12 Good news :   ( ܱ ) is concave in w. Local optimum = global optimum ¡•Â¬ ܱ   ( ܱ ) ز  ¢ à·£ j P ( ­ j ȓ ܲ j dz ܱ ) ൞ ෟ j ­ j « ୆ + ෟ i ୒ େ d « i ¬ i j −  ¢ ΅ + ™Â¬Â¤ « ୆ + ෟ i à

12 ­’ େ d « i ¬ i j Bad news : no close
­’ େ d « i ¬ i j Bad news : no closed - form solution to maximize ݚ ( ܱ ) Good news : concave functions easy to optimize (unique maximum) Optimizing concave/convex function 13 ™ Conditional likelihood for Logistic Regression is concave Gradient: Learning rate,  �0 Update rule: ™ Maximum of a concave function = minimum of a convex function Gradient Ascent (concave)/ Gradient Descent (convex) ߪ ܱ   ܱ àµ

13 ž ࠄ  ܱ ࠄ « ୆ dz Ç· dz ࠄ  ܱ
ž ࠄ  ܱ ࠄ « ୆ dz Ç· dz ࠄ  ܱ ࠄ « d ȱ ܱ ൞ ɖߪ ܱ   ܱ « i ( ஄ ୐ େ ) ൞ « i ஄ + ɖ ࠄ  ܱ ࠄ « i ቬ ஄ Gradient Ascent for Logistic Regression 14 look at actual labels of the examples, compare them to our current predictions, and then for each example j we multiply that difference by the feature value ¬ i j and then add them up. Predict what current weight thinks label Y should be Gradient

14 ascent algorithm: iterate until change
ascent algorithm: iterate until change < ࠅ For  ൞ ΅ dz Ç· dz ˜ ǵ repeat « ୆ ஄ ୐ େ ൞ « ୆ ஄ + ɖ ෟ j ­ j − à·³ P Y j ൞ ΅ ȓ ܲ j dz ܱ ஄ « i ஄ ୐ େ ൞ « i ஄ + ɖ ෟ j ¬ i j ­ j − à·³ P Y j ൞ ΅ ȓ ܲ j dz ܱ ஄ Gradient Ascent for Logistic Regression 15 ™ Gradient ascent is simplest of optimization approaches à e.g., Newton method, Conjugate gradient ascent, IR

15 LS (see Bishop 4.3.3) Predict what curre
LS (see Bishop 4.3.3) Predict what current weight thinks label Y should be Gradient ascent algorithm: iterate until change < ࠅ For  ൞ ΅ dz Ç· dz ˜ ǵ repeat « ୆ ஄ ୐ େ ൞ « ୆ ஄ + ɖ ෟ j ­ j − à·³ P Y j ൞ ΅ ȓ ܲ j dz ܱ ஄ « i ஄ ୐ େ ൞ « i ஄ + ɖ ෟ j ¬ i j ­ j − à·³ P Y j ൞ ΅ ȓ ܲ j dz ܱ ஄ Effect of step - size  16 Large  Ö® Fast convergence but larger residua

16 l error Also possible oscillations Small
l error Also possible oscillations Small  Ö® Slow convergence but small residual error vŬĔƬͭƢ ĔƁƁ aΐCΑPYͪ >ƏǓ ĔġƏLJƬ aAt͟ 17 ™ One common approach is to define priors on w ¤ ܱ Y dz ܘ ×± P Y ܘ dz ܱ ¤ ܱ à Normal distribution, zero mean, identity covariance à ͰtLJƢŬİƢͱ ƛĔƞĔƇİƬİƞƢ ƬƏǓĔƞĬƢ ǞİƞƏ ™ Corresponds to Regularization à Helps avoid very large weights and overf

17 itting à More on this later in the seme
itting à More on this later in the semester ™ M(C)AP estimate ܱ ×­ ൞ •Â¦› ¡•Â¬ ܱ  ¢ ¤ ܱ à·£ j ୒ େ à­¾ P ­ j ȓ ܲ j dz ܱ What you should know 18 ™ LR is a linear classifier: decision rule is a hyperplane ™ LR optimized by conditional likelihood à no closed - form solution à concave Ö® global optimum with gradient ascent à Maximum conditional a posteriori corresponds to regula