Regression Wisdom 1 Percentage of Men Smokers 18 24 years of age from 1965 through 2009 The centre for Disease Control and Prevention track cigarette smoking in the US How has the percentage of people who smoke changed since the danger became clear during the last half of the 20 ID: 653067
Download Presentation The PPT/PDF document "Week 5 Lecture 2 Chapter 8." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Week 5Lecture 2Chapter 8. Regression Wisdom
1Slide2
Percentage of Men Smokers (18 – 24 years of age) from 1965 through 2009The centre for Disease Control and Prevention track cigarette smoking in the US. How has the percentage of people who smoke changed since the danger became clear during the last half of the 20th
century?
2Slide3
The scatterplot shows percentage of smokers among men 18-24 years of age, as estimated by surveys, from 1965 through 2009. The percent of men age 18–24 who are smokers decreased dramatically between 1965 and 1990, but the trend has not been consistent since then.The association between percent of men age 18–24 who smoke and year is very strong from 1965 to 1990, but is erratic after 1990.A linear model is not an appropriate model for the trend in the percent of males age 18–24 who are smokers. The relationship is not straight.
The regression equation is:
male smoking % = 986.99552 - 0.47919438 Year
R-
sq
= 0.7047499 (70.47%)
Percentage of Men Smokers (18 – 24 years of age) from 1965 through 2009
3Slide4
Checking the Assumptions of Regression Model
Residual points are normally distributed.
4Slide5
Checking the Assumptions of Regression Model
Plot: Residuals vs. Predictor Variable (Year)
Nonlinearity is more prominent.
Residual points are not randomly plotted around the zero line; they are not evenly spread out.
Residual points form a curvature pattern.
Regression model is not correct.
5Slide6
Checking the Assumptions of Regression ModelNo regression analysis is complete without a display of the residuals to check that the linear model is reasonable.
Residuals often reveal subtleties that were not clear from a plot of the original data (e.g. scatterplot of y vs. x)
Sometimes they reveal violations of the regression conditions that require our attention.
It is good to look at both a histogram of residual (or histogram of standardized residuals or the normal QQ plot of residuals) and a scatterplot of the residuals vs. predictor variable.
6Slide7
Percentage of Both Men and Women Smokers (18 – 24 years of age) from 1965 through 2009The centre for Disease Control and Prevention track cigarette smoking in the US. How have the percentages of men and women who smoke changed since the danger became clear during the last half of the 20th
century?
7Slide8
Scatterplot for Men and Women Smokers (18 – 24 years of age) from 1965 through 2009Smoking rates for both men and women in the US have decreased significantly over the time period from 1965 to 2009.
Smoking rates are generally lower for women than for men.
The trend in the smoking rates for women seems a bit straighter than the trend for men.
The apparent curvature in the scatterplot for the men could possibly be due to just a few points, and not an indication of a serious violation of the linearity condition.
8Slide9
Scatterplot for Men and Women Smokers (18 – 24 years of age) from 1965 through 2009StatCrunch Command:Graph > Scatter Plot
X-variable: Year
Y-Variable: Smoking %
Group by: Sex
Grouping Options: Color points by group
Overlay polynomial order: 1
Group properties: Color scheme: Alternate – 7 colors
Click Compute
9Slide10
Men and Women Smokers (18 – 24 years of age) from 1965 through 2009Graph on the left: Not taking group into accountGraph on the right: Identify by group
(male
or
female)
10Slide11
Men and Women Smokers (18 – 24 years of age) from 1965 through 2009Not taking group into account
11
Smoking % = 953.31052 - 0.46382114 Year
Sample size: 34
R (correlation coefficient) = -0.80476796
R-
sq
= 0.64765148Slide12
Analysis of Residual Points12
Looks like we have two groups. Slide13
Analysis of Residual Points13An examination of residuals often leads us to discover groups of observations that are different from the rest.
Histogram might show multiple modes.
When we discover there is more than one group in a regression, we may decide to analyze the groups separately using a different model for each group.Slide14
Outliers14Any point that stands away from the others can be called an outlier and deserves your special attention.
Outlying points can strongly influence a regression. Even a single point far from the body of the data can dominate the analysis.Slide15
High Leverage Points15A data point that has an x-value far from the mean of the x-values is called a high leverage point.
Examples:Slide16
Influential Observations16A data point is influential if omitting from the analysis gives a very different model.
Examples:
Relationship between Murder rate and poverty level for 51 state (
including the state: DC
)
Note: DC is far from the rest of the data (overall pattern) and is observed in a different direction than the rest.
Dependent Variable: Murder Rate
Independent Variable: Poverty Rate
Murder Rate = -3.6792483 + 0.68731484 Poverty Rate
Sample size:
51
R (correlation coefficient) = 0.4735608
R-
sq
= 0.22425983
Estimate of error standard deviation: 3.9143851Slide17
Omitting the Observation for DC17Examples:
Relationship between Murder rate and poverty level for 50 state (
excluding DC
)
Dependent Variable: Murder Rate
Independent Variable: Poverty Rate
Murder Rate = -0.65671571 + 0.41331907 Poverty Rate
Sample size:
50
R (correlation coefficient) = 0.53936435
R-
sq
= 0.29091391Slide18
High Leverage Point BUT Not An Influential Observation 18Slide19
Restricted-range Problem 19When one of the variables is restricted (you only look at some of the values), the correlation can be surprisingly low.
We will visit an example from the web, from David Lane:
http://davidmlane.com/hyperstat/A68809.html
The demo video is found here:
http://onlinestatbook.com/2/describing_bivariate_data/restriction_demo.htmlSlide20
Working with Summary Statistics20Graph below shows that there appears to be a strong, positive, linear association between weight (in pounds)
and height (in inches) for men.
Graph below shows that if instead of data on individuals
we only had the mean weight for each height value, we would see an even stronger association.
We see less scattered points.
It can give a false impression of how well a line summarizes the data.
We have a problem of overestimating or underestimating.