Regularization JiaBin Huang Virginia Tech Spring 2019 ECE5424G CS5824 Administrative Women in Data Science Blacksburg Location Holtzman Alumni Center Welcome 330 340 Assembly hall Keynote Speaker ID: 763826
Download Presentation The PPT/PDF document "Regularization Jia-Bin Huang Virginia Te..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Regularization Jia-Bin HuangVirginia Tech Spring 2019 ECE-5424G / CS-5824
Administrative Women in Data Science BlacksburgLocation: Holtzman Alumni CenterWelcome, 3:30 - 3:40, Assembly hallKeynote Speaker: Milinda Lakkam, "Detecting automation on LinkedIn's platform," 3:40 - 4:05, Assembly hallCareer Panel, 4:05 - 5:00, hallBreak , 5:00 - 5:20, Grand hallAssemblyKeynote Speaker: Sally Morton , "Bias," 5:20 - 5:45, Assembly hallDinner with breakout discussion groups, 5:45 - 7:00, MuseumIntroductory track tutorial: Jennifer Van Mullekom, "Data Visualization", 7:00 - 8:15, Assembly hallAdvanced track tutorial: Cheryl Danner, "Focal-loss-based Deep Learning for Object Detection," 7-8:15, 2nd floor board room
k-NN (Classification/Regression) Model Cost function None Learning Do nothing Inference , where
Linear regression (Regression) Model Cost function Learning 1) Gradient descent: Repeat { } 2) Solving normal equation Inference
Naïve Bayes (Classification) Model Cost function Maximum likelihood estimation: Maximum a posteriori estimation : Learning (Discrete ) (Continuous ) mean , variance , Inference
Logistic regression (Classification) Model Cost function Learning Gradient descent: Repeat { } Inference
Logistic Regression Hypothesis representationCost functionLogistic regression with gradient descent RegularizationMulti-class classification
How about MAP? Maximum conditional likelihood estimate (MCLE) Maximum conditional a posterior estimate (MCAP)
Prior Common choice of : Normal distribution, zero mean, identity covariance“Pushes” parameters towards zerosCorresponds to RegularizationHelps avoid very large weights and overfitting Slide credit: Tom Mitchell
MLE vs. MAP Maximum conditional likelihood estimate (MCLE) Maximum conditional a posterior estimate ( MCAP)
Logistic Regression Hypothesis representationCost functionLogistic regression with gradient descent RegularizationMulti-class classification
Multi-class classification Email foldering/taggning : Work, Friends, Family, HobbyMedical diagrams: Not ill, Cold, FluWeather: Sunny, Cloudy, Rain, SnowSlide credit: Andrew Ng
Binary classification Multiclass classification
One-vs-all (one-vs-rest) Class 1: Class 2: Class 3: Slide credit: Andrew Ng
One-vs-all Train a logistic regression classifier for each class to predict the probability that Given a new input , pick the class that maximizes Slide credit: Andrew Ng
Generative Approach Ex: Na ïve BayesEstimate and Prediction Discriminative Approach Ex: Logistic regression Estimate directly ( Or a discriminant function: e.g., SVM) Prediction
Further readings Tom M. MitchellGenerative and discriminative classifiers: Naïve Bayes and Logistic Regression http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdfAndrew Ng, Michael JordanOn discriminative vs. generative classifiers: A comparison of logistic regression and naive bayeshttp://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf
Regularization OverfittingCost functionRegularized linear regression Regularized logistic regression
Regularization OverfittingCost function Regularized linear regressionRegularized logistic regression
Example: Linear regression Price ($) in 1000’s Size in feet^2 Price ($) in 1000’s Size in feet^2 Price ($) in 1000’s Size in feet^2 Underfitting Overfitting Just right Slide credit: Andrew Ng
Overfitting If we have too many features (i.e. complex model), the learned hypothesis may fit the training set very well but fail to generalize to new examples ( predict prices on new examples). Slide credit: Andrew Ng
Example: Linear regression Price ($) in 1000’s Size in feet^2 Price ($) in 1000’s Size in feet^2 Price ($) in 1000’s Size in feet^2 Underfitting Overfitting Just right High bias High variance Slide credit: Andrew Ng
Bias-Variance Tradeoff Bias: difference between what you expect to learn and truthMeasures how well you expect to represent true solutionDecreases with more complex model Variance: difference between what you expect to learn and what you learn from a particular dataset Measures how sensitive learner is to specific datasetIncreases with more complex model
Low variance High variance Low bias High bias
Bias–variance decomposition Training set We want that minimizes https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
Overfitting Tumor Size Age Tumor Size Age Tumor Size Age Underfitting Overfitting Slide credit: Andrew Ng
Addressing overfitting size of house no. of bedrooms no. of floorsage of houseaverage income in neighborhood kitchen size Price ($) in 1000’s Size in feet^2 Slide credit: Andrew Ng
Addressing overfitting 1 . Reduce number of features.Manually select which features to keep.Model selection algorithm (later in course).2. Regularization.Keep all the features, but reduce magnitude/values of parameters .Works well when we have a lot of features, each of which contributes a bit to predicting . Slide credit: Andrew Ng
Overfitting Thriller https://www.youtube.com/watch?v=DQWI1kvmwRg
Regularization OverfittingCost function Regularized linear regressionRegularized logistic regression
Intuition Suppose we penalize and make really small. Price ($) in 1000’s Size in feet^2 Price ($) in 1000’s Size in feet^2 Slide credit: Andrew Ng
Regularization. Small values for parameters “Simpler” hypothesisLess prone to overfittingHousing:Features: Parameters: Slide credit: Andrew Ng
Regularization Price ($) in 1000’s Size in feet^2 : Regularization parameter Slide credit: Andrew Ng
Question What if is set to an extremely large value (say )? Algorithm works fine; setting to be very large can’t hurt it Algorithm fails to eliminate overfitting.Algorithm results in underfitting. (Fails to fit even training data well).Gradient descent will fail to converge. Slide credit: Andrew Ng
Question What if is set to an extremely large value (say )? Price ($) in 1000’s Size in feet^2 Slide credit: Andrew Ng
Regularization OverfittingCost function Regularized linear regressionRegularized logistic regression
Regularized linear regression : Number of features is not panelized Slide credit: Andrew Ng
Gradient descent (Previously) Repeat { } Slide credit: Andrew Ng
Gradient descent (Regularized) Repeat { } Slide credit: Andrew Ng
Comparison Regularized linear regression Un-regularized linear regression : Weight decay
Normal equation Slide credit: Andrew Ng
Regularization OverfittingCost functionRegularized linear regression Regularized logistic regression
Regularized logistic regression Cost function: Tumor Size Age Slide credit: Andrew Ng
Gradient descent (Regularized) Repeat { } Slide credit: Andrew Ng
: Lasso regularization LASSO: Least Absolute Shrinkage and Selection Operator
Single predictor: Soft Thresholding Soft Thresholding operator
Multiple predictors: : Cyclic Coordinate Desce For each , update with w here
L1 and L2 balls Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf
Terminology Regularization function NameSolver Tikhonov regularization Ridge regression Close form LASSO regression Proximal gradient descent, least angle regression Elastic net regularization Proximal gradient descent Regularization function Name SolverTikhonov regularizationRidge regressionClose formLASSO regressionProximal gradient descent, least angle regressionElastic net regularizationProximal gradient descent
Things to remember OverfittingCost functionRegularized linear regression Regularized logistic regression