/
Linear Regression  CS771: Introduction to Machine Learning Linear Regression  CS771: Introduction to Machine Learning

Linear Regression CS771: Introduction to Machine Learning - PowerPoint Presentation

amelia
amelia . @amelia
Follow
65 views
Uploaded On 2023-10-04

Linear Regression CS771: Introduction to Machine Learning - PPT Presentation

Nisheeth Linear regression is like fitting a line or hyperplane to a set of points The lineplane must also predict outputs the unseen test inputs well Linear Regression Pictorially 2 Feature 1 ID: 1022944

training loss linear regression loss training regression linear model data feature squared test problem reg small matrix optimization form

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Linear Regression CS771: Introduction t..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Linear Regression CS771: Introduction to Machine LearningNisheeth

2. Linear regression is like fitting a line or (hyper)plane to a set of pointsThe line/plane must also predict outputs the unseen (test) inputs well Linear Regression: Pictorially2(Feature 1)(Feature 2)(Output ) Input (single feature) (Output ) What if a line/plane doesn’t model the input-output relationship very well, e.g., if their relationship is better modeled by a nonlinear curve or curved surface?No. We can even fit a curve using a linear model after suitably transforming the inputs Do linear models become useless in such cases?The transformation can be predefined or learned (e.g., using kernel methods or a deep neural network based feature extractor). More on this later Original (single) featureNonlinear curve needed Two featuresCan fit a plane (linear) 

3. This is the base model for all statistical machine learningx is a one feature data variabley is the value we are trying to predictThe regression model isTwo parameters to estimate – the slope of the line w1 and the y-intercept w0 ε is the unexplained, random, or error component. Simplest Possible Linear Regression Model

4. We basically want to find {w0, w1} that minimize deviations from the predictor lineHow do we do it?Iterate over all possible w values along the two dimensions?Same, but smarter? [next class]No, we can do this in closed form with just plain calculusVery few optimization problems in ML have closed form solutionsThe ones that do are interesting for that reasonSolving the regression problem

5. We just need to set the partial derivatives to zero (full derivation)SimplifyingParameter estimation via calculus

6. More generallyGiven: Training data with input-output pairs , , Goal: Learn a model to predict the output for new test inputsAssume the function that approximates the I/O relationship to be a linear modelLet’s write the total error or “loss” of this model over the training data as 6  )  measures theprediction error or “loss” or “deviation” of the model on a single training input  Goal of learning is to find the that minimizes this loss + does well on test data  Unlike models like KNN and DT, here we have an explicit problem-specific objective (loss function) that we wish to optimize forCan also write all of them compactly using matrix-vector notation as  

7. Linear Regression with Squared LossIn this case, the loss func will beLet us find the that optimizes (minimizes) the above squared lossWe need calculus and optimization to do this! The LS problem can be solved easily and has a closed form solution 7  =  =  In matrix-vector notation, can write it compactly as )  The “least squares” (LS) problem Gauss-Legendre, 18th century)Link to a nice derivation matrix inversion – can be expensive. Ways to handle this. Will see later 

8. Alternative loss functionsMany possible loss functions for regression problems8)  Loss LossLoss LossSquared lossAbsolute lossHuber loss-insensitive loss(a.k.a. Vapnik loss) Squared loss for small errors (say up to ); absolute loss for larger errors. Good for data with outliers Choice of loss function usually depends on the nature of the data. Also, some loss functions result in easier optimization problem than others    Zero loss for small errors (say up to ); absolute loss for larger errors Grows more slowly than squared loss. Thus better suited when data has some outliers (inputs on which model makes large errors)Very commonly used for regression. Leads to an easy-to-solve optimization problemNote: Can also use squared loss instead of absolute loss) ) ) 

9. We minimized the objective w.r.t. and gotProblem: The matrix may not be invertibleThis may lead to non-unique solutions for Problem: Overfitting since we only minimized loss defined on training dataWeights may become arbitrarily large to fit training data perfectlySuch weights may perform poorly on the test data howeverOne Solution: Minimize a regularized objective The reg. will prevent the elements of from becoming too largeReason: Now we are minimizing training error + magnitude of vector  How do we ensure generalization?9=   is called the Regularizer and measures the “magnitude” of   is the reg. hyperparam. Controls how much we wish to regularize (needs to be tuned via cross-validation) 

10. Recall that the regularized objective is of the form One possible/popular regularizer: the squared Euclidean ( squared) norm of With this regularizer, we have the regularized least squares problem asProceeding just like the LS case, we can find the optimal which is given by  =  Regularized Least Squares (a.k.a. Ridge Regression)10  +    Why is the method called “ridge” regression Look at the form of the solution. We are adding a small value to the diagonals of the DxD matrix (like adding a ridge/mountain to some land) 

11. A closer look at regularization 11The regularized objective we minimized isMinimizing w.r.t. gives a solution for thatKeeps the training error smallHas a small squared norm = Small entries in are good since they lead to “smooth” models  Good because, consequently, the individual entries of the weight vector are also prevented from becoming too large Remember – in general, weights with large magnitude are bad since they can cause overfitting on training data and may not work well on test data1.20.52.40.30.80.10.92.11.20.52.40.30.8  0.10.92.1    Exact same feature vectors only differing in just one feature by a small amountVery different outputs though (maybe one of these two training ex. is an outlier)100003.21.81.32.12.53.10.1A typical learned without reg.  Just to fit the training data where one of the inputs was possibly an outlier, this weight became too big. Such a weight vector will possibly do poorly on normal test inputsNot a “smooth” model since its test data predictions may change drastically even with small changes in some feature’s value

12. Other Ways to Control Overfitting12Use a regularizer defined by other norms, e.g.,Use non-regularization based approachesEarly-stopping (stopping training just when we have a decent val. set accuracy)Dropout (in each iteration, don’t update some of the weights)Injecting noise in the inputs    When should I used these regularizers instead of the regularizer? Use them if you have a very large number of features but many irrelevant features. These regularizers can help in automatic feature selectionUsing such regularizers gives a sparse weight vector as solution sparse means many entries in will be zero or near zero. Thus those features will be considered irrelevant by the model and will not influence prediction  norm regularizer  norm regularizer (counts number of nonzeros in  All of these are very popular ways to control overfitting in deep learning models. More on these later when we talk about deep learningNote that optimizing loss functions with such regularizers is usually harder than ridge reg. but several advanced techniques exist (we will see some of those later)

13. Linear Regression as Solving System of Linear Eqs13The form of the lin. reg. model is akin to a system of linear equationAssuming training examples with features each, we haveHowever, in regression, we rarely have but rather or Thus we have an underdetermined () or overdetermined () systemMethods to solve over/underdetermined systems can be used for lin-reg as wellMany of these methods don’t require expensive matrix inversion    First training example:Second training example:N-th training example:Note: Here denotes the feature of the training example  equations and unknowns here ()   where , and  System of lin. Eqns with equations and unknowns Solving lin-reg as system of lin eq.Now solve this!

14. Next Lecture14Solving linear regression using iterative optimization methodsFaster and don’t require matrix inversionBrief intro to optimization techniques