/
Lecture 5:  Linear Regression, Confidence Intervals, Standard Errors Lecture 5:  Linear Regression, Confidence Intervals, Standard Errors

Lecture 5: Linear Regression, Confidence Intervals, Standard Errors - PowerPoint Presentation

okelly
okelly . @okelly
Follow
65 views
Uploaded On 2023-10-31

Lecture 5: Linear Regression, Confidence Intervals, Standard Errors - PPT Presentation

Lecture Outline 1 Simple Regression Predictor variables Standard Errors Evaluating Significance of Predictors Hypothesis Testing How well do we know How well do we know Multiple Linear Regression ID: 1027391

data hypothesis regression predictors hypothesis data predictors regression formula0 standard linear predictor model coefficients distribution models confidence compute qualitative

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Lecture 5: Linear Regression, Confidenc..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Lecture 5: Linear Regression, Confidence Intervals, Standard Errors

2. Lecture Outline1Simple Regression: Predictor variables Standard Errors Evaluating Significance of Predictors Hypothesis Testing How well do we know ? How well do we know ?Multiple Linear Regression: Categorical Predictors Collinearity Hypothesis Testing Interaction TermsPolynomial Regression 

3. Standard ErrorsThe variances of and are also called their standard errors, If our data is drawn from a larger set of observations then we can empirically estimate the standard errors, of and through bootstrapping. If we know the variance of the noise , we can compute analytically, using the formulae below: 2

4. Standard ErrorsMore data: and Largest coverage: or Better data:  In practice, we do not know the theoretical value of since we do not know the exact distribution of the noise . Remember: 3

5. Standard ErrorsIn practice, we do not know the theoretical value of since we do not know the exact distribution of the noise . However, if we make the following assumptions, the errors and are uncorrelated, for ,each is normally distributed with mean 0 and variance ,then, we can empirically estimate , from the data and our regression line:  4

6. Standard ErrorsMore data: and Largest coverage: or Better data:  Better model:  Question: What happens to the , under these scenarios? 5

7. Standard ErrorsThe following results are for the coefficients for TV advertising:MethodAnalytic Formula0.0061Bootstrap0.0061MethodAnalytic Formula0.0061Bootstrap0.0061The coefficients for TV advertising but restricting the coverage of x are:The coefficients for TV advertising but with added extra noise: MethodAnalytic Formula0.0068Bootstrap0.0068MethodAnalytic Formula0.0068Bootstrap0.0068MethodAnalytic Formula0.0028Bootstrap0.0023MethodAnalytic Formula0.0028Bootstrap0.0023This makes no sense?6

8. Importance of predictorsWe have discussed finding the importance of predictors, by determining the cumulative distribution from to 0..  7

9. Hypothesis TestingHypothesis testing is a formal process through which we evaluate the validity of a statistical hypothesis by considering evidence for or against the hypothesis gathered by random sampling of the data.8

10. TVsales88.322.1102.710.4204.19.339.518.568.412.959.67.270.611.8265.213.2292.94.876.410.680.28.6182.617.4112.99.2199.19.7147.319.089.722.4225.812.5193.224.4TVsales215.422.189.710.468.49.375.318.5142.912.9220.37.2255.411.8139.513.2237.44.816.910.613.18.6218.517.4147.39.225.69.7216.419.0238.222.4213.412.5109.824.4TVsales216.422.1276.710.423.89.313.218.526.812.9170.27.20.711.887.213.2120.54.8293.610.678.28.643.017.4139.29.2276.99.7239.319.0191.122.425.112.525.624.4TVsales230.122.144.510.417.29.3151.518.5180.812.98.77.257.511.8120.213.28.64.8199.810.666.18.6214.717.423.89.297.59.7204.119.0195.422.467.812.5281.424.4TVsales68.422.1202.510.4248.89.3191.118.523.812.9296.47.226.811.8164.513.2209.64.8147.310.6139.28.6109.817.443.09.273.49.7262.719.028.622.4135.212.5240.124.4TVsales50.022.1184.910.411.79.3219.818.513.112.9248.87.276.411.8197.613.2195.44.875.510.6238.28.6222.417.4171.39.2184.99.7193.219.0131.722.4116.012.5166.824.4Random sampling of the data Shuffle the values of the predictor variable9

11. 10

12. Importance of predictorsTranslate this to Kevin’s language. Let’s look at the distance of the estimated value of the coefficient in units of SE() = . .    11

13. Importance of predictorsAnd also evaluate how often a particular value of t can occur by accident (using the shuffled data)?We expect that t will have a t-distribution with n-2 degrees of freedom. To compute the probability of observing any value equal to   or larger, assuming is easy. We call this probability the p-value. a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance  12

14. Hypothesis TestingHypothesis testing is a formal process through which we evaluate the validity of a statistical hypothesis by considering evidence for or against the hypothesis gathered by random sampling of the data. State the hypotheses, typically a null hypothesis, and an alternative hypothesis, , that is the negation of the former. Choose a type of analysis, i.e. how to use sample data to evaluate the null hypothesis. Typically this involves choosing a single test statistic.Compute the test statistic.Use the value of the test statistic to either reject or not reject the null hypothesis. 13

15. Hypothesis testing1. State Hypothesis:Null hypothesis: : There is no relation between X and Y The alternative: : There is some relation between X and Y 2: Choose test statisticsTo test the null hypothesis, we need to determine whether, our estimate for , is sufficiently far from zero that we can be confident that is non-zero. We use the following test statistic:  14

16. Hypothesis testing3. Compute the statistics : Using the estimated we calculate the t-statistic.4. Reject or not reject the hypothesis: If there is really no relationship between X and Y , then we expect that will have a t-distribution with n-2 degrees of freedom. To compute the probability of observing any value equal to   or larger, assuming is easy. We call this probability the p-value. a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance 15

17. Hypothesis testingP-values for all three predictors done independently MethodAnalytic Formula0.3530.0023Bootstrap0.3280.0028MethodAnalytic Formula0.3530.0023Bootstrap0.3280.0028MethodAnalytic Formula0.3530.0023Bootstrap0.3280.0028MethodAnalytic Formula0.3530.0023Bootstrap0.3280.0028MethodAnalytic Formula0.3530.0023Bootstrap0.3280.0028MethodAnalytic Formula0.3530.0023Bootstrap0.3280.002816

18. Things to Consider Comparison of Two Models How do we choose from two different models?Model Fitness How does the model perform predicting? Evaluating Significance of Predictors Does the outcome depend on the predictors?How well do we know The confidence intervals of our  17

19. How well do we know ?  18Our confidence in is directly connected with the confidence in s. So for each we can determine the model. 

20. How well do we know ?  19Here we show two difference set of models given the fitted coefficients for a given subsample

21. How well do we know ?  20There is one such regression line for every imaginable sub-sample.

22. How well do we know ?  21Below we show all regression lines for a thousand of such sub-samples. For a given , we examine the distribution of , and determine the mean and standard deviation.  

23. How well do we know ?  22Below we show all regression lines for a thousand of such sub-samples. For a given , we examine the distribution of , and determine the mean and standard deviation.  

24. How well do we know ?  23Below we show all regression lines for a thousand of such sub-samples. For a given , we examine the distribution of , and determine the mean and standard deviation.  

25. How well do we know ?  24For every , we calculate the mean of the models, (shown with dotted line) and the 95% CI of those models (shaded area). Estimated  

26. Confidence in predicting  25 For a given We have a distribution of models For each of these The prediction The prediction CI is then  Estimated  

27. Multiple Linear Regression26

28. Multiple Linear Regression If you have to guess someone's height, would you rather be toldTheir weight, onlyTheir weight and genderTheir weight, gender, and incomeTheir weight, gender, income, and favorite numberOf course, you'd always want as much data about a person as possible. Even though height and favorite number may not be strongly related, at worst you could just ignore the information on favorite number. We want our models to be able to take in lots of data as they make their predictions.27

29. Response vs. Predictor VariablesTVradionewspapersales230.137.869.222.144.539.345.110.417.245.969.39.3151.541.358.518.5180.810.858.412.928Youtcomeresponse variabledependent variableXpredictorsfeaturescovariates p predictorsn observations

30. Multilinear ModelsIn practice, it is unlikely that any response variable Y depends solely on one predictor x. Rather, we expect that is a function of multiple predictors . Using the notation we introduced last lecture, , andIn this case, we can still assume a simple form for -a multilinear form:Hence, , has the form  29

31. Multiple Linear RegressionAgain, to fit this model means to compute or to minimize a loss function; we will again choose the MSE as our loss function. Given a set of observations, the data and the model can be expressed in vector notation,  30

32. Multiple Linear RegressionThe model takes a simple algebraic form:Thus, the MSE can be expressed in vector notation asMinimizing the MSE using vector calculus yields, 31

33. CollinearityCollinearity refers to the case in which two or more predictors are correlated (related). We will re-visit collinearity in the next lectures, but for now we want to examine how does collinearity affects our confidence on the coefficients and consequently on the importance of those coefficients. First let’s look some examples: 32

34. Collinearity33Coef.Std.Err.tP>|t|[0.0250.975]11.550.57620.0361.628e-4910.41412.6880.0740.0145.1346.734e-070.04560.102Coef.Std.Err.tP>|t|[0.0250.975]6.6790.47813.9572.804e-315.7357.6220.0480.002717.3031.802e-410.0420.053Coef.Std.Err.tP>|t|[0.0250.975]9.5670.55317.2792.133e-418.47510.6590.1950.0209.4291.134e-170.1540.236Coef.Std.Err.tP>|t|[0.0250.975]2.6020.3327.8203.176e-131.9453.2580.0460.001529.8876.314e-750.0430.0490.1750.009418.5764.297e-450.1560.1940.0130.0282.3380.02030.0080.035Coef.Std.Err.tP>|t|[0.0250.975]2.6020.3327.8203.176e-131.9453.2580.0460.001529.8876.314e-750.0430.0490.1750.009418.5764.297e-450.1560.1940.0130.0282.3380.02030.0080.035Three individual modelsOne modelTVRADIONEWS

35. CollinearityCollinearity refers to the case in which two or more predictors are correlated (related). We will re-visit collinearity in the next lectures, but for now we want to examine how does collinearity affects our confidence on the coefficients and consequently on the importance of those coefficients. Assuming uncorrelated noise then we can show: 34

36. Finding Significant Predictors: Hypothesis TestingFor checking the significance of linear regression coefficients:we set up our hypotheses : we choose the F-stat to evaluate the null hypothesis,  35

37. Finding Significant Predictors: Hypothesis Testingwe can compute the F-stat for linear regression models byIf we consider this evidence for ; if , we consider this evidence against .  36

38. Qualitative PredictorsSo far, we have assumed that all variables are quantitative. But in practice, often some predictors are qualitative. Example: The Credit data set contains information about balance, age, cards, education, income, limit , and rating for a number of potential customers.IncomeLimitRatingCardsAgeEducationGenderStudentMarriedEthnicityBalance14.890360628323411 MaleNoYesCaucasian333106.02664548338215FemaleYesYesAsian903104.59707551447111 MaleNoNoAsian580148.92950468133611FemaleNoNoAsian96455.882489735726816 MaleNoYesCaucasian33137

39. Qualitative PredictorsIf the predictor takes only two values, then we create an indicator or dummy variable that takes on two possible numerical values.For example for the gender, we create a new variable:We then use this variable as a predictor in the regression equation. 38

40. Qualitative PredictorsQuestion: What is interpretation of and ?  39

41. Qualitative PredictorsQuestion: What is interpretation of and ? is the average credit card balance among males, is the average credit card balance among females, and the average difference in credit card balance between females and males.Exercise: Calculate and for the Credit data. You should find  40

42. More than two levels: One hot encodingOften, the qualitative predictor takes more than two values (e.g. ethnicity in the credit data). In this situation, a single dummy variable cannot represent all possible values. We create additional dummy variable as: 41

43. More than two levels: One hot encodingWe then use these variables as predictors, the regression equation becomes:Again the interpretation 42

44. Beyond linearityIn the Advertising data, we assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media. If we assume linear model then the average effect on sales of a one-unit increase in TV is always , regardless of the amount spent on radio.Synergy effect or interaction effect states that when an increase on the radio budget affects the effectiveness of the TV spending on sales.  43

45. Beyond linearityWe changeTo 44

46. What does it mean? 45

47. Predictors predictors predictorsWe have a lot predictors! Is it a problem? Yes: Computational Cost Yes: Overfitting Wait there is more …46

48. Polynomial Regression47

49. Polynomial RegressionThe simplest non-linear model we can consider, for a response Y and a predictor X, is a polynomial model of degree M,Just as in the case of linear regression with cross terms, polynomial regression is a special case of linear regression - we treat each as a separate predictor. Thus, we can write 48

50. Polynomial RegressionAgain, minimizing the MSE using vector calculus yields,49