Instructor Prof Wei Zhu 11212013 AMS 572 Group Project Motivation amp Introduction Lizhou Nie A Probabilistic Model for Simple Linear Regression Long Wang Fitting the Simple Linear Regression Model ID: 809952
Download The PPT/PDF document "Simple Linear Regression & Correlati..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Simple Linear Regression & CorrelationInstructor: Prof. Wei Zhu11/21/2013
AMS 572 Group Project
Slide2Motivation & Introduction – Lizhou NieA Probabilistic Model for Simple Linear Regression – Long Wang
Fitting the Simple Linear Regression Model –
Zexi HanStatistical Inference for Simple Linear Regression – Lichao SuRegression Diagnostics – Jue HuangCorrelation Analysis – Ting SunImplementation in SAS – Qianyi ChenApplication and Summary – Jie Shuai
Outline
Slide31. Motivation
http://popperfont.net/2012/11/13/the-ultimate-solar-system-animated-gif/
Fig. 1.1 Simplified Model for Solar SystemFig. 1.2 Obama & Romney during Presidential Election Campaign
http://outfront.blogs.cnn.com/2012/08/14/the-most-negative-in-campaign-history/
Slide4Regression AnalysisLinear Regression:
Simple Linear Regression
: {y, x}Multiple Linear Regression: {y; x1, … , xp}Multivariate Linear Regression: {y1, … , yn; x1, … , xp}Correlation AnalysisPearson Product-Moment Correlation Coefficient: Measurement of Linear Relationship between Two Variables
Introduction
Slide5George
Udny
Yule & Karl Pearson
Extention
to a
More Generalized
Statistical Context
Carl Friedrich Gauss
Further Development of
Least Square Theory
including Gauss-Markov
Theorem
Adrien-Marie Legendre
Earliest Form of
Regression: Least
Square Method
History
Sir Francis
Galton
Coining the Term “Regression”
http://
en.wikipedia.org/wiki/Regression_analysis
http://
en.wikipedia.org/wiki/Adrien_Marie_Legendre
http://
en.wikipedia.org/wiki/Carl_Friedrich_Gauss
http://
en.wikipedia.org/wiki/Francis_Galton
http://www.york.ac.uk/depts/maths/histstat/people/yule.gif
http://en.wikipedia.org/wiki/Karl_Pearson
Slide6Simple Linear Regression - Special Case of Linear Regression - One Response Variable to One Explanatory Variable
General Setting
- We Denote Explanatory Variable as Xi’s and Response Variable as Yi’s - N Pairs of Observations {xi, yi}, i
= 1 to n
2. A Probabilistic Model
Slide7Sketch the Graph2. A Probabilistic Model
(29, 5.5)
X
Y
1
37.70
9.82
2
16.31
5.00
3
28.37
9.27
4
-12.13
2.98
98
9.06
7.34
99
28.54
10.37
100
-17.19
2.33
X
Y
1
37.70
9.82
2
16.31
5.00
3
28.37
9.27
4
-12.13
2.98
98
9.06
7.34
99
28.54
10.37
100
-17.19
2.33
Slide8In Simple Linear Regression, Data is described as: Where ~ N(
0
, )The Fitted Model: Where - Intercept - Slope of Regression Line2. A Probabilistic Model
Slide93.
Fitting the Simple Linear Regression Model
Milage(in 1000 miles)Groove Depth (in mils)
0
394.33
4
329.50
8
291.00
12
255.17
16
229.33
20
204.83
24
179.00
28
163.83
32
150.33
Fig 3.1. Scatter plot of tire tread wear vs. mileage. From:
Statistics and Data Analysis;
Tamhane
and Dunlop; Prentice Hall.
Table
3
.1.
Slide10The difference between the fitted line and real data is
Our goal: minimize the sum of square
3
.
Fitting the Simple Linear Regression Model
Fig 3.2.
i
s the vertical distance between fitted line and the real data
Slide113.
Fitting the Simple Linear Regression Model
Least
Square
Method
Slide123
.
Fitting the Simple Linear Regression Model
Slide13To
simplify,
we
denote:
3
.
Fitting the Simple Linear Regression Model
Slide14Back
to
the
example:
3
.
Fitting the Simple Linear Regression Model
Slide15Therefore, the equation of fitted line is:
Not enough!
3. Fitting the Simple Linear Regression Model
Slide16We define
:
Prove:The ratio: is called the coefficient of determination
3.
Fitting the Simple Linear Regression Model
Check
the
goodness
of
fit
of
LS line
SST:
total sum of
squares
SSR: Regression
sum of
squares
SSE: Error
sum of
squares
Slide17Back to the example:
3.
Fitting the Simple Linear Regression Model Check the goodness of fit of LS line
where the sign of r follows from the sign of
since
95.3%
of the variation in
tread wear
is accounted for by linear regression on
mileage,
the relationship between the two is strongly linear with a negative slope.
Slide18r is the sample correlation coefficient between X and Y:
For
the simple linear regression,
3
.
Fitting the Simple Linear Regression Model
Slide19Estimation of The variance measures the scatter of the
around their
means An unbiased estimate of is given by
3
.
Fitting the Simple Linear Regression Model
Slide20F
rom the example, we have SSE=2351.3 and n-2=
7
,
therefore
Which
has 7
d.f.
The estimate of
is
3
.
Fitting the Simple Linear Regression Model
Slide214. Statistical Inference For SLR
Slide22Under the normal error assumption
* Point estimators:
* Sampling distributions of and
:
22
Slide23Derivation
Slide24Derivation
For mathematical derivations, please refer to the
Tamhane
and Dunlop text book, P331.
* Pivotal Quantities (P.Q.’s):
* Confidence Intervals (
C.I.’s
):
25
Statistical Inference on
β
0
and
β
1
Slide26A useful application is to
show whether
there is a linear relationship between x and y 26/69Hypothesis tests:.
Reject at level if
Reject at level if
Mean Square:
A
sum of squares divided by its degrees of freedom. 27/69
Analysis of Variance (ANOVA)
Slide28Analysis
of Variance (ANOVA)
ANOVA
Table
Source of Variation
(Source)
Sum of Squares
(SS)
Degrees of Freedom
(
d.f.
)
Mean Square
(MS)
F
Regression
Error
SSR
SSE
1
n - 2
Total
SST
n - 1
28
Slide295.1 Checking the Model Assumptions
5.1.1
Checking for Linearity5.1.2 Checking for Constant Variance5.1.3 Checking for NormalityPrimary tool: residual plots5.2 Checking for Outliers and Influential Observations5.2.1 Checking for Outliers5.2.2 Checking for Influential Observations
5.2.3 How to Deal with Outliers and Influential Observations
5. Regression Diagnostics
Slide305.1 Checking the Model Assumptions
5.1.1
Checking for Linearity5.1.2 Checking for Constant Variance5.1.3 Checking for NormalityPrimary tool: residual plots5.2 Checking for Outliers and Influential Observations
5.2.1 Checking for Outliers
5.2.2 Checking for Influential Observations
5.2.3 How to Deal with Outliers and Influential Observations
5. Regression Diagnostics
Slide315.1.1 Checking for Linearity
i
1
0
394.33
360.64
33.69
2
4
329.50
331.51
-2.01
3
8
291.00
302.39
-11.39
4
12
255.17
273.27
-18.10
5
16
229.33
244.15
-14.82
6
20
204.83
215.02
-10.19
724179.00185.90-6.90828163.83156.787.05932150.33127.6622.67i10394.33360.6433.6924
329.50
331.51
-2.01
3
8
291.00
302.39
-11.39
4
12
255.17
273.27
-18.10
5
16
229.33
244.15
-14.82
6
20
204.83
215.02
-10.19
7
24
179.00
185.90
-6.90
8
28
163.83
156.78
7.05
9
32
150.33
127.66
22.67
Table 5.1 The
,
,
,
for the Tire Wear Data
Figure 5.1 S
,
,
for the Tire Wear Data
5. Regression Diagnostics
Slide325.1.1 Checking for
Linearity (Data transformation)
x
y
x
2
y
x
3
y
x
logy
x
1/y
x
y
logx
y
-1/x
y
2
x
y
3
x
y
x
y
logx
y
-1/x
y
x
logy
x
-1/y
x
y
x
2
y
x
3
y
x
y
2
x
y
3
Figure 5.2 Typical Scatter Plot Shapes and Corresponding Linearizing Transformations
5. Regression Diagnostics
Slide335.1.1 Checking for Linearity (Data transformation)
i
1
0
394.33
5.926
374.64
19.69
2
4
329.50
5.807
332.58
-3.08
3
8
291.00
5.688
295.24
-4.24
4
12
255.17
5.569
262.09
-6.92
5
16
229.33
5.450
232.67
-3.34
620204.835.331206.54-1.71724
179.00
5.211
183.36
-4.36
8
28
163.83
5.092
162.77
1.06
9
32
150.33
4.973
144.50
5.83
i
1
0
394.33
5.926
374.64
19.69
2
4
329.50
5.807
332.58
-3.08
3
8
291.00
5.688
295.24
-4.24
4
12
255.17
5.569
262.09
-6.92
5
16
229.33
5.450
232.67
-3.34
6
20
204.83
5.331
206.54
-1.71
7
24
179.00
5.211
183.36
-4.36
8
28
163.83
5.092
162.77
1.06
9
32
150.33
4.973
144.50
5.83
Table 5.2 The
,
,
,
,
for the Tire Wear Data
Figure 5.2 S
,
,
for the Tire Wear Data
5. Regression Diagnostics
Slide345.1 Checking the Model Assumptions
5.1.1
Checking for Linearity5.1.2 Checking for Constant Variance5.1.3 Checking for NormalityPrimary tool: residual plots5.2 Checking for Outliers and Influential Observations
5.2.1 Checking for Outliers
5.2.2 Checking for Influential Observations
5.2.3 How to Deal with Outliers and Influential Observations
5. Regression Diagnostics
Slide355.1.2 Checking for Constant Variance
Plot the residuals against the fitted value
If the constant variance assumption is correct, the dispersion of the
’s is approximately constant with respect to the
’s.
Figure 5.4 Plots of Residuals
Figure 5.3 Plots of Residuals
5. Regression Diagnostics
Slide365.1 Checking the Model Assumptions
5.1.1
Checking for Linearity5.1.2 Checking for Constant Variance5.1.3 Checking for NormalityPrimary tool: residual plots5.2 Checking for Outliers and Influential Observations
5.2.1 Checking for Outliers
5.2.2 Checking for Influential Observations
5.2.3 How to Deal with Outliers and Influential Observations
5. Regression Diagnostics
Slide375.1.3 Checking for normality
Make a normal plot of the residuals
They have a zero mean and an approximately constant variance.
(assuming the other assumptions about the model are correct)
Fi
gur
e 5.5
N
5. Regression Diagnostics
Slide385.1 Checking the Model Assumptions
5.1.1
Checking for Linearity5.1.2 Checking for Constant Variance5.1.3 Checking for NormalityPrimary tool: residual plots5.2 Checking for Outliers and Influential Observations
5.2.1 Checking for Outliers
5.2.2 Checking for Influential Observations
5.2.3 How to Deal with Outliers and Influential Observations
5. Regression Diagnostics
Slide39Outlier: an observation that does not follow the general pattern of the relationship between y and x. A large residual indicates an outlier.
Standardized residuals are given by
If , then the corresponding observation may be regarded as an outlier.
Influential Observation:
an influential observation has an extreme x-value, an extreme y-value, or both.
If we express the fitted value of y as a linear combination of all the
If , then the corresponding observations may be regarded as influential observation.
5. Regression Diagnostics
5.2 Checking for Outliers and Influential Observations
1
2.8653
2
-0.4113
3
-0.5367
4
-0.8505
5
-0.4067
6
-0.2102
7
-0.5519
8
0.14169
0.8484
1
0.3778
2
0.2611
3
0.1778
4
0.1278
5
0.1111
6
0.1278
7
0.1778
80.261190.3778Table 5.3 Standard residuals & leverage for transformed data
5. Regression Diagnostics
Slide41clear;clc;
x = [0 4 8 12 16 20 24 28 32];
y = [394.33 329.50 291.00 255.17 229.33 204.83 179.00 163.83 150.33];y1 = log(y); %data transformationp = polyfit(x,y,1) %linear regression predicts y from x% p =
polyfit
(
x,log
(y),1)
yfit
=
polyval
(
p,x
)
%use p to predict y
yresid
= y -
yfit %compute the residuals
%yresid = y1 -
exp
(yfit
) %residual for transformed datassresid = sum(yresid.^2);
%residual sum of squaressstotal
= (length(y)-1) * var(y);
%
sstotalrsq = 1 - ssresid/sstotal;
%R square
normplot
(
yresid
)
%normal plot for residuals
[
h,p,jbstat,critval]=jbtest(yresid) %test normalityscatter(x,y,500,'r','.') %generate the scatter plotslslinelaxis([-5,35,-10,25])xlabel('x_i')ylabel('y_i')Title('plot of ...')for i = 1:length(x) % check for outliers p(i) = yresid(i)/std(yresid)/sqrt(1-1/length(x)-(yresid(i)-mean(yresid)^2)/(yresid(
i
)-mean(
yresid
))^2)
end
%check for influential observations
for
j = 1:length(x)
q(i) =
1/length(x)+(
x(i)-mean(x))^2/960
end
MATLAB Code for
Regression Diagnostics
Slide42Why we need this? Regression analysis is used to model the relationship between two variables.
But when there is no such distinction and both variables are random, correlation analysis is used to study the strength of the relationship.
6.1 Correlation Analysis
Slide436.1 Correlation Analysis- Example
Flu reported
Life expectancyEconomy levelPeople who get f
lu
shot
T
emperature
Economic growth
Figure 6.1
Example
Slide44Because we need to
investigate the correlation between X,Y
Source:http
://wiki.stat.ucla.edu/socr/index.php/File:SOCR_BivariateNormal_JS_Activity_Fig7.png
6.2 Bivariate
Normal Distribution
Figure 6.2
Slide456.2 Why introduce Bivariate Normal
Distribution?
First, we need to do some computation.
Compare with:
So, if (X,Y) have a bivariate normal distribution, then the regression model is true
Define the r.v. R corresponding to rBut the distribution of R is quite complicated
6.3 Statistical Inference of r
Figure 6.3
r
r
r
r
f(r)
f(r)
f(r)
f(r)
-0.7
-0.3
0.5
0
Slide47Test: H
0
: ρ=0 , Ha : ρ≠0Test statistic:Reject H0 iff
Example
A researcher wants to determine if two test instruments give similar results. The two test instruments are administered to a sample of 15 students. The correlation coefficient between the two sets of scores is found to be 0.7. Is this correlation statistically significant at the .01 level?
H
0
:
ρ
=0 , H
a
:
ρ
≠0
3.534 = t
0
> t
13, .005
= 3.012
So, we reject H
0
6.3
Exact test
when ρ=0
Slide486.3 Note:They are the same!
Because
So
We can say
H
0
:
β
1
=0
are equivalent to H
0
: ρ=0
Because that the exact distribution of R is not very useful for making inference on ρ,
R.A Fisher showed that we can do the following linear transformation, to let it be approximate normal distribution. That is, 6.3 Approximate test when ρ≠0
Slide501,H0 : ρ= ρ0 vs. H1
:
ρ ≠ ρ0 2, point estimator 3, T.S.4, C.I 6.3 Steps to do the approximate test on ρ
Slide51Lurking Variable
Over extrapolation
6.4 The pitfalls of correlation analysis
Slide527. Implementation in SAS
state
districtdemocAvoteAexpendAexpendB
prtystrA
lexpendA
lespendB
shareA
1
"AL"
7
1
68
328.3
8.74
41
5.793916
2.167567
97.41
2
"AK"
1
0
62
626.38
402.48
60
6.439952
5.997638
60.88
3
"AZ"
2
17399.613.07554.6012331.12004897.01…173
"WI"
8
1
30
14.42
227.82
47
2.668685
5.428569
5.95
Table7.1 vote example data
Slide53SAS code of the vote example
proc
corr data=vote1; var F4 F10; run; Pearson Correlation Coefficients, N = 173 Prob
> |r| under H0: Rho=0
F4
F10
F4
1.00000
0.92528
Table7.2 correlation
coeffients
7. Implementation in SAS
proc
reg
data=vote1;
model F4=F10;
label F4=
voteA; label F10=shareA;output out=
fitvote residual=R; run;
Slide54SAS output
Analysis of Variance
Source
DF
Sum
of
Squares
Mean
Square
F Value
Pr > F
Model
1
41486
41486
1017.70
<.0001
Error
171
6970.77364
40.76476
Corrected Total
172
48457
Root MSE
6.38473
R-Square0.8561Dependent Mean50.50289Adj R-Sq0.8553Coeff Var12.64230 Parameter EstimatesVariableLabelDFParameter EstimateStandard Errort ValueInterceptIntercept1
26.81254
0.88719
30.22
F10
F10
1
0.46382
0.01454
31.90
Table7.3 SAS output for vote example
7. Implementation in SAS
Slide55Figure7.1 Plot of Residual vs.
ShareA
for vote example7. Implementation in SAS
Slide56Figure7.2 Plot of
voteA
vs. shareA for vote example7. Implementation in SAS
Slide57SAS-Check Homoscedasticity
Figure7.3 Plots of SAS output for vote example
7. Implementation in SAS
Slide58SAS-Check Normality of ResidualsSAS code:
Tests for Location: Mu0=0
Test
Statistic
p Value
Student's t
t
0
Pr
> |t|
1.0000
Sign
M
-0.5
Pr >= |M|
1.0000
Signed Rank
S
-170.5
Pr >= |S|
0.7969
Tests for Normality
Test
Statistic
p Value
Shapiro-
Wilk
W
0.952811
Pr
< W
0.7395
Kolmogorov-SmirnovD0.209773Pr > D>0.1500Cramer-von MisesW-Sq0.056218Pr > W-Sq>0.2500Anderson-DarlingA-Sq0.30325Pr > A-Sq>0.2500
proc
univariate
data=
fitvote
normal;
var
R;
qqplot
R / normal (Mu=
est
Sigma=
est
);
run;
Table7.4 SAS output for checking normality
7. Implementation in SAS
Slide59SAS-Check Normality of Residuals
Figure7.4 Plot of Residual vs. Normal
Quantiles for vote example7. Implementation in SAS
Slide60Linear regression is widely used to describe possible relationships between variables. It ranks as one of the most important tools in these disciplines.
Marketing/business analytics
HealthcareFinanceEconomicsEcology/environmental science8. Application
Slide61Prediction, forecasting or deduction
Linear
regression can be used to fit a predictive model to an observed data set of Y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of Y, the fitted model can be used to make a prediction of the value of Y.8. Application
Slide62Quantifying the strength of relationship
Given a variable
y and a number of variables X1, ..., Xp that may be related to Y, linear regression analysis can be applied to assess which Xj may have no relationship with Y at all, and to identify which subsets of the Xj contain redundant information about Y.
8. Application
Slide63Example 1. Trend line8. Application
A trend line represents a trend, the long-term movement in
time series data after other components have been accounted for. Trend lines are sometimes used in business analytics to show changes in data over time. Figure 8.1 Refrigerator sales over a 13-year periodhttp://www.likeoffice.com/28057/Excel-2007-Formatting-charts
Slide64Example 2. Clinical drug trials8. Application
Regression analysis is widely utilized in healthcare. The graph shows an example in which we investigate the relationship between protein concentration and absorbance employing linear regression analysis.
Figure 8.2 BSA Protein Concentration Vs. Absorbancehttp://openwetware.org/wiki/User:Laura_Flynn/Notebook/Experimental_Biological_Chemistry/2011/09/13
Slide65Summary
Model
Assumptions
Outliers &
Influential
Observations
Linearity, Constant
Variance &
Normality
Data
Transformation
Probabilistic
Models
Least Square
Estimate
Linear Regression
Analysis
Statistical
Inference
Correlation
Analysis
Correlation
Coefficient
(Bivariate Normal
Distribution, Exact
T-test, Approximate
Z-test.
Slide66AcknowledgementSincere thanks go to Prof. Wei Zhu
References
Statistics and Data Analysis, Ajit Tamhane & Dorothy Dunlop.Introductory Econometrics: A Modern Approach, Jeffrey M. Wooldridge,5th ed.http://en.wikipedia.org/wiki/Regression_analysis
http
://
en.wikipedia.org/wiki/Adrien_Marie_Legendre
etc. (web links have already been included in the slides)
Acknowledgement & References