/
Xuhua Xia Xuhua Xia

Xuhua Xia - PowerPoint Presentation

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
383 views
Uploaded On 2017-06-20

Xuhua Xia - PPT Presentation

Stepwise Regression Y may depend on many independent variables How to find a subset of Xs that best predict Y There are several criteria eg adjusted R 2 AIC BIC likelihood ratio test etc for model selection and many algorithms for including or excluding Xs in the model forwa ID: 561435

true false weight age false true age weight runtime runpulse maxpulse rstpulse model oxy aic xia xuhua leaps adjusted 168 176 156

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Xuhua Xia" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Xuhua Xia

Stepwise Regression

Y may depend on many independent variablesHow to find a subset of X’s that best predict Y?There are several criteria (e.g., adjusted R2, AIC, BIC, likelihood ratio test, etc.) for model selection and many algorithms for including or excluding X’s in the model: forward selection, backward elimination, stepwise regression, etc.With the availability of statistical packages, stepwise regression is now most commonly used.

X

2

X

3

X

4

Y

X

1

X

5

X

6Slide2

Xuhua Xia

A Data Set for Multiple Regression

Measurements on men involved in a physical fitness course at N. C. State University. Fitness is typically measured by oxygen intake rate (oxy) which is difficult (at least cumbersome when one is exercising oneself) to measure. The study goal is to develop an equation to predict oxy based on exercise tests rather than on oxygen consumption measurements.

The dataset has 31 observations. The variables in the data set are:

age (in years)

weight (in kg)

oxy (oxygen intake rate, ml per kg body weight per minute)

runtime (time to run 1.5 miles, in minutes)

rstpulse (heart rate while resting)runpulse (heart rate while running, at the same time when oxygen rate was measured)maxpulse (maximum heart rate recorded while running).Slide3

Xuhua Xia

oxy

ageweight

runtime

rstpulse

runpulsemaxpulse

44.60944

89.4711.376217818259.5714268.158.17

4016617245.6814075.9811.9570176

18060.0553881.878.634817018644.754

4566.4511.125117617649.1564981.42

8.9544

18018546.7744891.6310.2548162164

46.085479.3811.176215616545.11851

67.2511.0848

17217250.5455759.089.9349148155

47.467

52

82.78

10.5

53

170

172

45.313

40

75.07

10.07

62

185

185

49.874

38

89.02

9.22

55

178

180

49.091

43

81.19

10.85

64

162

170

50.541

44

73.03

10.13

45

168

168

47.273

47

79.15

10.6

47

162

164

40.836

51

69.63

10.95

57

168

172

50.388

49

73.37

10.08

67

168

168

45.441

52

76.32

9.63

48

164

166

39.203

54

91.63

12.88

44

168

172

48.673

49

76.32

9.4

56

186

188

54.297

44

85.84

8.65

45

156

168

44.811

47

77.45

11.63

58

176

176

39.442

44

81.42

13.08

63

174

176

37.388

45

87.66

14.03

56

186

192

51.855

54

83.12

10.33

50

166

170

46.672

51

77.91

10

48

162

168

39.407

57

73.37

12.63

58

174

176

54.625

50

70.87

8.92

48

146

155

45.79

51

73.71

10.47

59

186

188

47.92

48

61.24

11.5

52

170

176Slide4

Xuhua Xia

Correlation matrix

age weight oxy runtime rstpulse runpulse maxpulse

age 1.00000 -0.23354 -0.30459 0.18875 -0.16410 -0.33787 -0.43292

0.2061 0.0957 0.3092 0.3777 0.0630 0.0150

weight -0.23354 1.00000 -0.16275 0.14351 0.04397 0.18152 0.24938

0.2061 0.3817 0.4412 0.8143 0.3284 0.1761

oxy -0.30459 -0.16275 1.00000 -0.86219 -0.39936 -0.39797 -0.23674 0.0957 0.3817 <.0001 0.0260 0.0266 0.1997runtime 0.18875 0.14351 -0.86219 1.00000 0.45038 0.31365 0.22610 0.3092 0.4412 <.0001 0.0110 0.0858 0.2213rstpulse -0.16410 0.04397 -0.39936 0.45038 1.00000 0.35246 0.30512 0.3777 0.8143 0.0260 0.0110 0.0518 0.0951runpulse -0.33787 0.18152 -0.39797 0.31365 0.35246 1.00000 0.92975 0.0630 0.3284 0.0266 0.0858 0.0518 <.0001maxpulse -0.43292 0.24938 -0.23674 0.22610 0.30512 0.92975 1.00000

0.0150 0.1761 0.1997 0.2213 0.0951 <.0001Slide5

Xuhua Xia

Scatterplot matrixSlide6

rcorr in Hmisc

oxy age weight runtime

rstpulse

runpulse

maxpulse

oxy 1.00 -0.30 -0.16 -0.86 -0.40 -0.40 -0.24age -0.30 1.00 -0.23 0.19 -0.16 -0.34 -0.43

weight -0.16 -0.23 1.00 0.14 0.04 0.18 0.25runtime -0.86 0.19 0.14 1.00 0.45 0.31 0.23

rstpulse -0.40 -0.16 0.04 0.45 1.00 0.35 0.31runpulse -0.40 -0.34 0.18 0.31 0.35 1.00 0.93maxpulse -0.24 -0.43 0.25 0.23 0.31 0.93 1.00P oxy age weight runtime rstpulse runpulse

maxpulseoxy 0.0957 0.3817 0.0000 0.0260 0.0266 0.1997 age 0.0957 0.2061 0.3092 0.3777 0.0630 0.0150 weight 0.3817 0.2061 0.4412 0.8143 0.3284 0.1761 runtime 0.0000 0.3092 0.4412 0.0110 0.0858 0.2213 rstpulse 0.0260 0.3777 0.8143 0.0110 0.0518 0.0951 runpulse 0.0266 0.0630 0.3284 0.0858 0.0518 0.0000 maxpulse

0.1997 0.0150 0.1761 0.2213 0.0951 0.0000 > print(rmat) oxy age weight runtime rstpulse runpulse maxpulseoxy 1.00 -0.30 -0.16 -0.86 -0.40 -0.40 -0.24age -0.30 1.00 -0.23 0.19 -0.16 -0.34 -0.43

weight -0.16 -0.23 1.00 0.14 0.04 0.18 0.25runtime -0.86 0.19 0.14 1.00 0.45 0.31 0.23rstpulse -0.40 -0.16 0.04 0.45 1.00 0.35 0.31runpulse -0.40 -0.34 0.18 0.31 0.35 1.00 0.93maxpulse -0.24 -0.43 0.25 0.23 0.31 0.93 1.00Slide7

Backward elimination

Xuhua Xia

Start: AIC=58.16

oxy ~ age + weight + runtime + rstpulse + runpulse + maxpulse

Df Sum of Sq RSS AIC

- rstpulse 1 0.571 129.41 56.299

<none> 128.84 58.162

- weight 1 9.911 138.75 58.459- maxpulse 1 26.491 155.33 61.958- age 1 27.746 156.58 62.208- runpulse 1 51.058 179.90 66.510- runtime 1 250.822 379.66 89.664Step: AIC=56.3oxy ~ age + weight + runtime + runpulse + maxpulse

Df Sum of Sq RSS AIC<none> 129.41 56.299- weight 1 9.52 138.93 56.499- maxpulse 1 26.83 156.23 60.139- age 1 27.37 156.78 60.247- runpulse 1 52.60 182.00 64.871- runtime 1 320.36 449.77 92.917

the current model, i.e., without eliminating rstpulseSlide8

Forward addition

Xuhua Xia

Start: AIC=104.7

oxy ~ 1

Df Sum of Sq RSS AIC

+ runtime 1 632.90 218.48 64.534

+ rstpulse 1 135.78 715.60 101.313

+ runpulse 1 134.84 716.54 101.354+ age 1 78.99 772.39 103.681<none> 851.38 104.699+ maxpulse 1 47.72 803.67 104.911+ weight 1 22.55 828.83 105.867Step: AIC=64.53oxy ~ runtime

Df Sum of Sq RSS AIC+ age 1 17.7656 200.72 63.905+ runpulse 1 15.3621 203.12 64.274<none> 218.48 64.534+ maxpulse 1 1.5674 216.91 66.311+ weight 1 1.3236 217.16 66.346+ rstpulse 1 0.1301 218.35 66.516

Step: AIC=63.9oxy ~ runtime + age Df Sum of Sq RSS AIC

+ runpulse 1 39.885 160.83 59.037+ maxpulse 1 14.885 185.83 63.516<none> 200.72 63.905+ weight 1 5.605 195.11 65.027+ rstpulse 1 2.641 198.07 65.494Step: AIC=59.04oxy ~ runtime + age + runpulse Df Sum of Sq RSS AIC+ maxpulse 1 21.9007 138.93 56.499

<none> 160.83 59.037+ weight 1 4.5958 156.24 60.139

+ rstpulse 1 0.4901 160.34 60.943Step: AIC=56.5oxy ~ runtime + age + runpulse + maxpulse

IVs whose addition will improve fit

IVs whose addition will make it worseSlide9

R Functions

Xuhua Xia

library(

Hmisc

)

cor

(

myD,method="pearson|spearman")pairs(~age+weight+runtime+rstpulse+runpulse+maxpulse+oxy)rmat<-rcorr(as.matrix(myD), type="pearson|spearman")rmat

print(rmat[1],digits=5)fit<-lm(oxy~age+weight+runtime+rstpulse+runpulse+maxpulse)anova(fit)summary(fit)full.model<-lm(oxy~age+weight+runtime+rstpulse+runpulse+maxpulse)best.model<-step(full.model,direction

="backward")min.model<-lm(oxy~1)best.model<-step(min.model,direction="forward", scope="~age+weight+runtime+rstpulse+runpulse+maxpulse")new<-data.frame(age=50,weight=82,......)predict(fit,new,interval="confidence")predict(fit,new,interval

="prediction")Two R functions for computing Pearson correlation: cor in basic package does not provide associated p, and rcorr in the Hmisc package includes p.Slide10

Package leapsXuhua Xia

x<-

as.matrix

(

myD

)

DV<-x[,1]IV<-x[,2:7]

library(leaps) solR2a<-leaps(IV, DV, names=names(myD)[2:7], method="adjr2")solCp<-leaps(IV, DV, names=names(myD)[2:7], method="Cp")

Package leaps includes a function leaps that offers two more criteria for model selection:adjusted r2 Mallow's Cp (which is used less frequently now)The input to leaps is not a data frame but a vector for DV and a matrix for IVs:Slide11

leaps evaluates all linear modelsXuhua Xia$which age weight runtime rstpulse runpulse maxpulse

1 FALSE

FALSE TRUE FALSE FALSE FALSE

1 FALSE

FALSE FALSE TRUE FALSE

FALSE1 FALSE FALSE

FALSE FALSE

TRUE FALSE1 TRUE FALSE FALSE FALSE FALSE FALSE1 FALSE FALSE FALSE FALSE FALSE

TRUE1 FALSE TRUE FALSE FALSE FALSE FALSE2 TRUE FALSE TRUE FALSE FALSE FALSE2 FALSE FALSE TRUE FALSE TRUE FALSE2 FALSE FALSE

TRUE FALSE FALSE TRUE2 FALSE TRUE TRUE FALSE FALSE FALSE2 FALSE FALSE TRUE TRUE FALSE FALSE

2 TRUE FALSE FALSE FALSE TRUE FALSE2 TRUE FALSE FALSE TRUE FALSE FALSE2 FALSE FALSE FALSE FALSE TRUE TRUE

2 TRUE FALSE FALSE

FALSE FALSE TRUE2 FALSE FALSE FALSE TRUE TRUE FALSE3 TRUE FALSE TRUE FALSE TRUE FALSE3 FALSE FALSE TRUE FALSE TRUE TRUE

3 TRUE FALSE TRUE FALSE FALSE TRUE3 TRUE TRUE TRUE FALSE FALSE FALSE3 TRUE FALSE TRUE TRUE FALSE FALSE3 FALSE

FALSE TRUE TRUE

TRUE FALSE3 FALSE TRUE TRUE FALSE TRUE FALSE3 FALSE TRUE TRUE FALSE FALSE TRUE3 FALSE FALSE TRUE TRUE FALSE TRUE3 FALSE TRUE TRUE

TRUE

FALSE

FALSE

……

The best model is one with the greatest adjusted r

2

or a

Cp

closest to the total number of IVs (6 in our case)

The next two slides show results of evaluationSlide12

Model evaluation: adjusted r2Xuhua Xia$size [1] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6[39] 6 6 6 6 7$adjr2 [1] 0.734531140 0.130502041 0.129362176 0.061492963 0.023495780 [6] -0.007080873 0.747407426 0.744382656 0.727022568 0.726715841

[11] 0.725213882 0.331423675 0.250289567 0.238663728 0.207123289

[16] 0.180390053 0.790104958 0.788876048 0.757477967 0.745367336[21] 0.741499364 0.735442756 0.735365598 0.717949838 0.716918697[26] 0.716790424 0.811713247 0.788260638 0.787518105 0.782696332[31] 0.781231246 0.753335728 0.750114011 0.740422515 0.725675821[36] 0.707129090 0.817602176 0.804437584 0.781067331 0.779299361[41] 0.746441303 0.464879114 0.810839895

The number of coefficients in each model

Given the set of adjusted r

2 values for the 43 alternative models, which one is the maximum?

maxAdjR2<-max(solR2a$adjr2);bestModelInd<-match(maxAdjR2,solR2a$adjr2)

solR2a$which[bestModelInd,]bestModelInd<-which.max(solR2a$adjr2)Find the max adjusted r2Find the index of the max adjusted r2 Show the best model

Find the index of the max adjusted r2Slide13

leaps output for Cp: 2$size [1] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6[39] 6 6 6 6 7$Cp [1] 13.698840 106.302108 106.476860 116.881841 122.707162 127.394846 [7] 12.389449 12.837184 15.406872 15.452274 15.674598 73.964510

[13] 85.974204 87.695093 92.363796 96.320923 6.959627 7.135037

[19] 11.616680 13.345306 13.897406 14.761903 14.772916 17.258776[25] 17.405958 17.424267 4.879958 8.103512 8.205573 8.868324[31] 9.069700 12.903931 13.346755 14.678848 16.705777 19.255019[37] 5.106275 6.846150 9.934837 10.168497 14.511122 51.723275[43] 7.000000

The number of coefficients in each model

Given the set of

Cp values for the 43 alternative models, which one is closest to 6?

solCpbestModelInd

<-which(abs(solCp$Cp-6)==min(abs(solCp$Cp-6)))solCp$which[bestModelInd,]This leads to a model that is suboptimal based on AIC or adjusted r2Slide14

Xuhua Xia

Criteria used in model selection

R

a

2 Cp

SBC (BIC)AICSignificance level

Burnham, K. P. and D. R. Anderson. 2002 Model selection and multimodel inference: a practical information-theoretic approach. 2nd ed. Springer. (Best book on model selection)