/
Simple Linear Regression & Correlation Simple Linear Regression & Correlation

Simple Linear Regression & Correlation - PowerPoint Presentation

solidbyte
solidbyte . @solidbyte
Follow
343 views
Uploaded On 2020-08-28

Simple Linear Regression & Correlation - PPT Presentation

Instructor Prof Wei Zhu 11212013 AMS 572 Group Project Motivation amp Introduction Lizhou Nie A Probabilistic Model for Simple Linear Regression Long Wang Fitting the Simple Linear Regression Model ID: 809952

checking regression model linear regression checking linear model influential data simple observations analysis outliers sas diagnostics correlation fitting yresid

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Simple Linear Regression & Correlati..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Simple Linear Regression & CorrelationInstructor: Prof. Wei Zhu11/21/2013

AMS 572 Group Project

Slide2

Motivation & Introduction – Lizhou NieA Probabilistic Model for Simple Linear Regression – Long Wang

Fitting the Simple Linear Regression Model –

Zexi HanStatistical Inference for Simple Linear Regression – Lichao SuRegression Diagnostics – Jue HuangCorrelation Analysis – Ting SunImplementation in SAS – Qianyi ChenApplication and Summary – Jie Shuai

Outline

Slide3

1. Motivation

http://popperfont.net/2012/11/13/the-ultimate-solar-system-animated-gif/

Fig. 1.1 Simplified Model for Solar SystemFig. 1.2 Obama & Romney during Presidential Election Campaign

http://outfront.blogs.cnn.com/2012/08/14/the-most-negative-in-campaign-history/

Slide4

Regression AnalysisLinear Regression:

Simple Linear Regression

: {y, x}Multiple Linear Regression: {y; x1, … , xp}Multivariate Linear Regression: {y1, … , yn; x1, … , xp}Correlation AnalysisPearson Product-Moment Correlation Coefficient: Measurement of Linear Relationship between Two Variables

Introduction

Slide5

George

Udny

Yule & Karl Pearson

Extention

to a

More Generalized

Statistical Context

Carl Friedrich Gauss

Further Development of

Least Square Theory

including Gauss-Markov

Theorem

Adrien-Marie Legendre

Earliest Form of

Regression: Least

Square Method

History

Sir Francis

Galton

Coining the Term “Regression”

http://

en.wikipedia.org/wiki/Regression_analysis

http://

en.wikipedia.org/wiki/Adrien_Marie_Legendre

http://

en.wikipedia.org/wiki/Carl_Friedrich_Gauss

http://

en.wikipedia.org/wiki/Francis_Galton

http://www.york.ac.uk/depts/maths/histstat/people/yule.gif

http://en.wikipedia.org/wiki/Karl_Pearson

Slide6

Simple Linear Regression - Special Case of Linear Regression - One Response Variable to One Explanatory Variable

General Setting

- We Denote Explanatory Variable as Xi’s and Response Variable as Yi’s - N Pairs of Observations {xi, yi}, i

= 1 to n

2. A Probabilistic Model

Slide7

Sketch the Graph2. A Probabilistic Model

(29, 5.5)

X

Y

1

37.70

9.82

2

16.31

5.00

3

28.37

9.27

4

-12.13

2.98

98

9.06

7.34

99

28.54

10.37

100

-17.19

2.33

X

Y

1

37.70

9.82

2

16.31

5.00

3

28.37

9.27

4

-12.13

2.98

98

9.06

7.34

99

28.54

10.37

100

-17.19

2.33

Slide8

In Simple Linear Regression, Data is described as: Where ~ N(

0

, )The Fitted Model: Where - Intercept - Slope of Regression Line2. A Probabilistic Model

Slide9

3.

Fitting the Simple Linear Regression Model

Milage(in 1000 miles)Groove Depth (in mils)

0

394.33

4

329.50

8

291.00

12

255.17

16

229.33

20

204.83

24

179.00

28

163.83

32

150.33

Fig 3.1. Scatter plot of tire tread wear vs. mileage. From:

Statistics and Data Analysis;

Tamhane

and Dunlop; Prentice Hall.

Table

3

.1.

Slide10

The difference between the fitted line and real data is

Our goal: minimize the sum of square

3

.

Fitting the Simple Linear Regression Model

Fig 3.2.

i

s the vertical distance between fitted line and the real data

Slide11

3.

Fitting the Simple Linear Regression Model

Least

Square

Method

Slide12

3

.

Fitting the Simple Linear Regression Model

Slide13

To

simplify,

we

denote:

3

.

Fitting the Simple Linear Regression Model

Slide14

Back

to

the

example:

3

.

Fitting the Simple Linear Regression Model

Slide15

Therefore, the equation of fitted line is:

Not enough!

3. Fitting the Simple Linear Regression Model

Slide16

We define

:

Prove:The ratio: is called the coefficient of determination

3.

Fitting the Simple Linear Regression Model

Check

the

goodness

of

fit

of

LS line

SST:

total sum of

squares

SSR: Regression

sum of

squares

SSE: Error

sum of

squares

Slide17

Back to the example:

3.

Fitting the Simple Linear Regression Model Check the goodness of fit of LS line

where the sign of r follows from the sign of

since

95.3%

of the variation in

tread wear

is accounted for by linear regression on

mileage,

the relationship between the two is strongly linear with a negative slope.

Slide18

r is the sample correlation coefficient between X and Y:

For

the simple linear regression,

3

.

Fitting the Simple Linear Regression Model

Slide19

Estimation of The variance measures the scatter of the

around their

means An unbiased estimate of is given by

3

.

Fitting the Simple Linear Regression Model

Slide20

F

rom the example, we have SSE=2351.3 and n-2=

7

,

therefore

Which

has 7

d.f.

The estimate of

is

3

.

Fitting the Simple Linear Regression Model

Slide21

4. Statistical Inference For SLR

Slide22

Under the normal error assumption

* Point estimators:

* Sampling distributions of and

:

22

Slide23

Derivation

Slide24

Derivation

For mathematical derivations, please refer to the

Tamhane

and Dunlop text book, P331.

Slide25

* Pivotal Quantities (P.Q.’s):

* Confidence Intervals (

C.I.’s

):

25

Statistical Inference on

β

0

and

β

1

Slide26

A useful application is to

show whether

there is a linear relationship between x and y 26/69Hypothesis tests:.

Reject at level if

Reject at level if

Slide27

Mean Square:

A

sum of squares divided by its degrees of freedom. 27/69

Analysis of Variance (ANOVA)

Slide28

Analysis

of Variance (ANOVA)

ANOVA

Table

Source of Variation

(Source)

Sum of Squares

(SS)

Degrees of Freedom

(

d.f.

)

Mean Square

(MS)

F

Regression

Error

SSR

SSE

1

n - 2

Total

SST

n - 1

28

Slide29

5.1 Checking the Model Assumptions

5.1.1

Checking for Linearity5.1.2 Checking for Constant Variance5.1.3 Checking for NormalityPrimary tool: residual plots5.2 Checking for Outliers and Influential Observations5.2.1 Checking for Outliers5.2.2 Checking for Influential Observations

5.2.3 How to Deal with Outliers and Influential Observations

5. Regression Diagnostics

Slide30

5.1 Checking the Model Assumptions

5.1.1

Checking for Linearity5.1.2 Checking for Constant Variance5.1.3 Checking for NormalityPrimary tool: residual plots5.2 Checking for Outliers and Influential Observations

5.2.1 Checking for Outliers

5.2.2 Checking for Influential Observations

5.2.3 How to Deal with Outliers and Influential Observations

5. Regression Diagnostics

Slide31

5.1.1 Checking for Linearity

i

1

0

394.33

360.64

33.69

2

4

329.50

331.51

-2.01

3

8

291.00

302.39

-11.39

4

12

255.17

273.27

-18.10

5

16

229.33

244.15

-14.82

6

20

204.83

215.02

-10.19

724179.00185.90-6.90828163.83156.787.05932150.33127.6622.67i10394.33360.6433.6924

329.50

331.51

-2.01

3

8

291.00

302.39

-11.39

4

12

255.17

273.27

-18.10

5

16

229.33

244.15

-14.82

6

20

204.83

215.02

-10.19

7

24

179.00

185.90

-6.90

8

28

163.83

156.78

7.05

9

32

150.33

127.66

22.67

Table 5.1 The

,

,

,

for the Tire Wear Data

 

Figure 5.1 S

,

,

for the Tire Wear Data

 

5. Regression Diagnostics

Slide32

5.1.1 Checking for

Linearity (Data transformation)

x

y

x

2

y

x

3

y

x

logy

x

1/y

x

y

logx

y

-1/x

y

2

x

y

3

x

y

x

y

logx

y

-1/x

y

x

logy

x

-1/y

x

y

x

2

y

x

3

y

x

y

2

x

y

3

Figure 5.2 Typical Scatter Plot Shapes and Corresponding Linearizing Transformations

5. Regression Diagnostics

Slide33

5.1.1 Checking for Linearity (Data transformation)

i

1

0

394.33

5.926

374.64

19.69

2

4

329.50

5.807

332.58

-3.08

3

8

291.00

5.688

295.24

-4.24

4

12

255.17

5.569

262.09

-6.92

5

16

229.33

5.450

232.67

-3.34

620204.835.331206.54-1.71724

179.00

5.211

183.36

-4.36

8

28

163.83

5.092

162.77

1.06

9

32

150.33

4.973

144.50

5.83

i

1

0

394.33

5.926

374.64

19.69

2

4

329.50

5.807

332.58

-3.08

3

8

291.00

5.688

295.24

-4.24

4

12

255.17

5.569

262.09

-6.92

5

16

229.33

5.450

232.67

-3.34

6

20

204.83

5.331

206.54

-1.71

7

24

179.00

5.211

183.36

-4.36

8

28

163.83

5.092

162.77

1.06

9

32

150.33

4.973

144.50

5.83

Table 5.2 The

,

,

,

,

for the Tire Wear Data

 

Figure 5.2 S

,

,

for the Tire Wear Data

 

5. Regression Diagnostics

Slide34

5.1 Checking the Model Assumptions

5.1.1

Checking for Linearity5.1.2 Checking for Constant Variance5.1.3 Checking for NormalityPrimary tool: residual plots5.2 Checking for Outliers and Influential Observations

5.2.1 Checking for Outliers

5.2.2 Checking for Influential Observations

5.2.3 How to Deal with Outliers and Influential Observations

5. Regression Diagnostics

Slide35

5.1.2 Checking for Constant Variance

Plot the residuals against the fitted value

If the constant variance assumption is correct, the dispersion of the

’s is approximately constant with respect to the

’s.

 

Figure 5.4 Plots of Residuals

 

Figure 5.3 Plots of Residuals

 

5. Regression Diagnostics

Slide36

5.1 Checking the Model Assumptions

5.1.1

Checking for Linearity5.1.2 Checking for Constant Variance5.1.3 Checking for NormalityPrimary tool: residual plots5.2 Checking for Outliers and Influential Observations

5.2.1 Checking for Outliers

5.2.2 Checking for Influential Observations

5.2.3 How to Deal with Outliers and Influential Observations

5. Regression Diagnostics

Slide37

5.1.3 Checking for normality

Make a normal plot of the residuals

They have a zero mean and an approximately constant variance.

(assuming the other assumptions about the model are correct)

Fi

gur

e 5.5

N

 

5. Regression Diagnostics

Slide38

5.1 Checking the Model Assumptions

5.1.1

Checking for Linearity5.1.2 Checking for Constant Variance5.1.3 Checking for NormalityPrimary tool: residual plots5.2 Checking for Outliers and Influential Observations

5.2.1 Checking for Outliers

5.2.2 Checking for Influential Observations

5.2.3 How to Deal with Outliers and Influential Observations

5. Regression Diagnostics

Slide39

Outlier: an observation that does not follow the general pattern of the relationship between y and x. A large residual indicates an outlier.

Standardized residuals are given by

If , then the corresponding observation may be regarded as an outlier.

Influential Observation:

an influential observation has an extreme x-value, an extreme y-value, or both.

If we express the fitted value of y as a linear combination of all the

If , then the corresponding observations may be regarded as influential observation.

5. Regression Diagnostics

 

Slide40

5.2 Checking for Outliers and Influential Observations

1

2.8653

2

-0.4113

3

-0.5367

4

-0.8505

5

-0.4067

6

-0.2102

7

-0.5519

8

0.14169

0.8484

1

0.3778

2

0.2611

3

0.1778

4

0.1278

5

0.1111

6

0.1278

7

0.1778

80.261190.3778Table 5.3 Standard residuals & leverage for transformed data  

5. Regression Diagnostics

Slide41

clear;clc;

x = [0 4 8 12 16 20 24 28 32];

y = [394.33 329.50 291.00 255.17 229.33 204.83 179.00 163.83 150.33];y1 = log(y); %data transformationp = polyfit(x,y,1) %linear regression predicts y from x% p =

polyfit

(

x,log

(y),1)

yfit

=

polyval

(

p,x

)

%use p to predict y

yresid

= y -

yfit %compute the residuals

%yresid = y1 -

exp

(yfit

) %residual for transformed datassresid = sum(yresid.^2);

%residual sum of squaressstotal

= (length(y)-1) * var(y);

%

sstotalrsq = 1 - ssresid/sstotal;

%R square

normplot

(

yresid

)

%normal plot for residuals

[

h,p,jbstat,critval]=jbtest(yresid) %test normalityscatter(x,y,500,'r','.') %generate the scatter plotslslinelaxis([-5,35,-10,25])xlabel('x_i')ylabel('y_i')Title('plot of ...')for i = 1:length(x) % check for outliers p(i) = yresid(i)/std(yresid)/sqrt(1-1/length(x)-(yresid(i)-mean(yresid)^2)/(yresid(

i

)-mean(

yresid

))^2)

end

%check for influential observations

for

j = 1:length(x)

q(i) =

1/length(x)+(

x(i)-mean(x))^2/960

end

MATLAB Code for

Regression Diagnostics

Slide42

Why we need this? Regression analysis is used to model the relationship between two variables.

But when there is no such distinction and both variables are random, correlation analysis is used to study the strength of the relationship.

6.1 Correlation Analysis

Slide43

6.1 Correlation Analysis- Example

Flu reported

Life expectancyEconomy levelPeople who get f

lu

shot

T

emperature

Economic growth

Figure 6.1

Example

Slide44

Because we need to

investigate the correlation between X,Y

Source:http

://wiki.stat.ucla.edu/socr/index.php/File:SOCR_BivariateNormal_JS_Activity_Fig7.png

 

6.2 Bivariate

Normal Distribution

Figure 6.2

Slide45

6.2 Why introduce Bivariate Normal

Distribution?

First, we need to do some computation.

Compare with:

So, if (X,Y) have a bivariate normal distribution, then the regression model is true

 

Slide46

Define the r.v. R corresponding to rBut the distribution of R is quite complicated

6.3 Statistical Inference of r

Figure 6.3

r

r

r

r

f(r)

f(r)

f(r)

f(r)

-0.7

-0.3

0.5

0

Slide47

Test: H

0

: ρ=0 , Ha : ρ≠0Test statistic:Reject H0 iff

Example

A researcher wants to determine if two test instruments give similar results. The two test instruments are administered to a sample of 15 students. The correlation coefficient between the two sets of scores is found to be 0.7. Is this correlation statistically significant at the .01 level?

H

0

:

ρ

=0 , H

a

:

ρ

≠0

3.534 = t

0

> t

13, .005

= 3.012

So, we reject H

0

6.3

Exact test

when ρ=0

Slide48

6.3 Note:They are the same!

Because

So

We can say

H

0

:

β

1

=0

are equivalent to H

0

: ρ=0

 

Slide49

Because that the exact distribution of R is not very useful for making inference on ρ,

R.A Fisher showed that we can do the following linear transformation, to let it be approximate normal distribution. That is, 6.3 Approximate test when ρ≠0

Slide50

1,H0 : ρ= ρ0 vs. H1

:

ρ ≠ ρ0 2, point estimator 3, T.S.4, C.I 6.3 Steps to do the approximate test on ρ

Slide51

Lurking Variable

Over extrapolation

6.4 The pitfalls of correlation analysis

Slide52

7. Implementation in SAS

state

districtdemocAvoteAexpendAexpendB

prtystrA

lexpendA

lespendB

shareA

1

"AL"

7

1

68

328.3

8.74

41

5.793916

2.167567

97.41

2

"AK"

1

0

62

626.38

402.48

60

6.439952

5.997638

60.88

3

"AZ"

2

17399.613.07554.6012331.12004897.01…173

"WI"

8

1

30

14.42

227.82

47

2.668685

5.428569

5.95

Table7.1 vote example data

Slide53

SAS code of the vote example

proc

corr data=vote1; var F4 F10; run; Pearson Correlation Coefficients, N = 173 Prob

> |r| under H0: Rho=0

 

F4

F10

F4

1.00000

0.92528

Table7.2 correlation

coeffients

7. Implementation in SAS

proc

reg

data=vote1;

model F4=F10;

label F4=

voteA; label F10=shareA;output out=

fitvote residual=R; run;

Slide54

SAS output

Analysis of Variance

Source

DF

Sum

of

Squares

Mean

Square

F Value

Pr > F

Model

1

41486

41486

1017.70

<.0001

Error

171

6970.77364

40.76476

 

 

Corrected Total

172

48457

 

 

Root MSE

6.38473

R-Square0.8561Dependent Mean50.50289Adj R-Sq0.8553Coeff Var12.64230 Parameter EstimatesVariableLabelDFParameter EstimateStandard Errort ValueInterceptIntercept1

26.81254

0.88719

30.22

F10

F10

1

0.46382

0.01454

31.90

Table7.3 SAS output for vote example

7. Implementation in SAS

Slide55

Figure7.1 Plot of Residual vs.

ShareA

for vote example7. Implementation in SAS

Slide56

Figure7.2 Plot of

voteA

vs. shareA for vote example7. Implementation in SAS

Slide57

SAS-Check Homoscedasticity

Figure7.3 Plots of SAS output for vote example

7. Implementation in SAS

Slide58

SAS-Check Normality of ResidualsSAS code:

Tests for Location: Mu0=0

Test

Statistic

p Value

Student's t

t

0

Pr

> |t|

1.0000

Sign

M

-0.5

Pr >= |M|

1.0000

Signed Rank

S

-170.5

Pr >= |S|

0.7969

Tests for Normality

Test

Statistic

p Value

Shapiro-

Wilk

W

0.952811

Pr

< W

0.7395

Kolmogorov-SmirnovD0.209773Pr > D>0.1500Cramer-von MisesW-Sq0.056218Pr > W-Sq>0.2500Anderson-DarlingA-Sq0.30325Pr > A-Sq>0.2500

proc

univariate

data=

fitvote

normal;

var

R;

qqplot

R / normal (Mu=

est

Sigma=

est

);

run;

Table7.4 SAS output for checking normality

7. Implementation in SAS

Slide59

SAS-Check Normality of Residuals

Figure7.4 Plot of Residual vs. Normal

Quantiles for vote example7. Implementation in SAS

Slide60

Linear regression is widely used to describe possible relationships between variables. It ranks as one of the most important tools in these disciplines.

Marketing/business analytics

HealthcareFinanceEconomicsEcology/environmental science8. Application

Slide61

Prediction, forecasting or deduction

Linear

regression can be used to fit a predictive model to an observed data set of Y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of Y, the fitted model can be used to make a prediction of the value of Y.8. Application

Slide62

Quantifying the strength of relationship

Given a variable 

y and a number of variables X1, ..., Xp that may be related to Y, linear regression analysis can be applied to assess which Xj may have no relationship with Y at all, and to identify which subsets of the  Xj contain redundant information about Y.

8. Application

Slide63

Example 1. Trend line8. Application

A trend line represents a trend, the long-term movement in 

time series data after other components have been accounted for. Trend lines are sometimes used in business analytics to show changes in data over time. Figure 8.1 Refrigerator sales over a 13-year periodhttp://www.likeoffice.com/28057/Excel-2007-Formatting-charts

Slide64

Example 2. Clinical drug trials8. Application

Regression analysis is widely utilized in healthcare. The graph shows an example in which we investigate the relationship between protein concentration and absorbance employing linear regression analysis.

Figure 8.2 BSA Protein Concentration Vs. Absorbancehttp://openwetware.org/wiki/User:Laura_Flynn/Notebook/Experimental_Biological_Chemistry/2011/09/13

Slide65

Summary

Model

Assumptions

Outliers &

Influential

Observations

Linearity, Constant

Variance &

Normality

Data

Transformation

Probabilistic

Models

Least Square

Estimate

Linear Regression

Analysis

Statistical

Inference

Correlation

Analysis

Correlation

Coefficient

(Bivariate Normal

Distribution, Exact

T-test, Approximate

Z-test.

Slide66

AcknowledgementSincere thanks go to Prof. Wei Zhu

References

Statistics and Data Analysis, Ajit Tamhane & Dorothy Dunlop.Introductory Econometrics: A Modern Approach, Jeffrey M. Wooldridge,5th ed.http://en.wikipedia.org/wiki/Regression_analysis

http

://

en.wikipedia.org/wiki/Adrien_Marie_Legendre

etc. (web links have already been included in the slides)

Acknowledgement & References