Professor William Greene Stern School of Business IOMS Department Department of Economics Statistics and Data Analysis Part 18 Regression Modeling Linear Regression Models ID: 654789
Download Presentation The PPT/PDF document "Statistics and Data Analysis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Statistics and Data Analysis
Professor William GreeneStern School of BusinessIOMS DepartmentDepartment of EconomicsSlide2
Statistics and Data Analysis
Part
18
– Regression
ModelingSlide3
Linear Regression Models
Least squares results
Regression model
Sample statistics
Estimates of population parameters
How good is the model?
In the abstract
Statistical measures of model fit
Assessing the validity of the relationshipSlide4
Regression Model
Regression relationshipyi
=
α
+ β x
i
+
ε
i
Random
ε
i
implies random
y
i
Observed random
y
i
has two unobserved components:
Explained:
α
+
β
x
i
Unexplained:
ε
i
Random component
ε
i
zero mean, standard deviation
σ
, normal distribution.Slide5
Linear Regression
: Model AssumptionSlide6
Least Squares ResultsSlide7
Using the Regression Model
Prediction: Use xi as information to predict
y
i
.
The
natural predictor is the mean,
x
i
provides more information
.
With x
i
, the predictor is Slide8
Regression
Fits
Regression
of salary vs
. Regression
of fuel bill vs. number
years of experience of
rooms for a sample of homesSlide9
Regression ArithmeticSlide10
Analysis of VarianceSlide11
Fit of the Model to the DataSlide12
Explained Variation
The proportion of variation “explained” by the regression is called R-squared (R
2
)
It is also called the Coefficient of
DeterminationSlide13
Movie Madness Fit
R
2Slide14
Regression Fits
R
2
= 0.522
R
2
= 0.880
R
2
= 0.424
R
2
= 0.924Slide15
R
2
= 0.338
R
2
is still positive even if
the correlation is negative.Slide16
R Squared Benchmarks
Aggregate time series: expect .9+
Cross sections, .5 is good. Sometimes we do much better.
Large survey data sets, .2 is not bad.
R
2
= 0.924 in this cross section.Slide17
Correlation CoefficientSlide18
Correlations
r
xy
= 0.723
r
xy
= -.402
r
xy
= +1.000Slide19
R-Squared is r
xy2R-squared is the square of the correlation between y
i
and the predicted y
i which is
a
+ bx
i
.
The correlation between y
i
and (a+bx
i
) is the same as the correlation between y
i
and xi.Therefore,….A regression with a high R2 predicts yi well.Slide20
Adjusted R-Squared
We will discover when we study regression with more than one variable, a researcher can increase R
2
just by adding variables to a model, even if those variables do not really explain y or have any real relationship at all.
To have a fit measure that accounts for this, “Adjusted R
2
” is a number that increases with the correlation, but decreases with the number of variables.Slide21
Movie Madness FitSlide22
Notes About Adjusted R
2Slide23
Is R
2 Large?Is there really a relationship between x and y?
We cannot be 100% certain.
We can be “statistically certain” (within limits) by examining R
2.
F is used for this purpose.Slide24
The F RatioSlide25
Is R
2 Large?Since
F
=
(N-2)R2
/(1 – R
2
),
if
R
2
is “large,” then F will be large
.
For a model with one explanatory variable in it, the standard benchmark value for a
‘large’
F is 4.Slide26
Movie Madness Fit
R
2
FSlide27
Why Use F and not R
2?When is R2
“large?” we have no benchmarks to decide.
How large is “large?” We
have a table for F statistics to determine when F is statistically large: yes or no.Slide28
F Table
The “critical value” depends on the number of observations. If F is larger than
the appropriate
value in the table, conclude that there is a “statistically significant” relationship.
There is a
huge F
table on pages 732-742 of your text. Analysts now use computer programs, not tables like this, to find the critical values of F for their model/data.
n
2
is N-2Slide29
Internet Buzz Regression
Regression Analysis: BoxOffice versus Buzz
The regression equation is
BoxOffice = - 14.4 + 72.7 Buzz
Predictor
Coef SE Coef T P
Constant -14.360 5.546 -2.59 0.012
Buzz 72.72 10.94 6.65 0.000
S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%
Analysis of Variance
Source
DF SS MS F P
Regression 1 7913.6 7913.6 44.16 0.000
Residual Error 60 10751.5 179.2
Total 61
18665.1
n
2
is N-2Slide30
$135 Million
http://www.nytimes.com/2006/06/19/arts/design/19klim.html?ex=1308369600&en=37eb32381038a749&ei=5088&partner=rssnyt&emc=rss
Klimt, to Ronald LauderSlide31
$100 Million … sort of
Stephen Wynn with a Prized Possession, 2007Slide32
An Enduring Art Mystery
Why do larger paintings command higher prices?
The Persistence of Memory. Salvador Dali, 1931
The Persistence of Statistics. Hildebrand, Ott and Gray, 2005
Graphics show relative sizes of the two works.Slide33Slide34Slide35
Monet in Large and Small
Log of $price = a + b log surface area + e
Sale prices of 328 signed Monet paintings
The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.Slide36
The Data
Note: Using logs in this context. This is common when analyzing financial measurements (e.g., price) and when percentage changes are more interesting than unit changes. (E.g., what is the % premium when the painting is 10% larger?)Slide37
Monet
Regression: There seems to be a regression. Is there a theory?Slide38
Conclusions about F
R2 answers the question of how well the model fits the data
F answers the question of whether there is a statistically valid fit (as opposed to no fit).
What remains is the question of whether there is a valid relationship – i.e., is
β
different from zero.Slide39
The Regression Slope
The model is yi
=
α
+
β
x
i
+
ε
i
The “relationship” depends on
β
.
If
β equals zero, there is no relationship
The least squares slope, b, is the estimate of β based on the sample.It is a statistic based on a random sample.We cannot be sure it equals the true β
.To accommodate this view, we form a range of uncertainty around b. I.e., a confidence interval.Slide40
Uncertainty About the Regression Slope
Hypothetical Regression
Fuel Bill vs. Number of Rooms
The regression equation is
Fuel Bill
=
-252
+
136 Number of Rooms
Predictor
Coef
SE
Coef
T P
Constant
-251.9 44.88 -5.20 0.000
Rooms 136.2 7.09 19.9 0.000S = 144.456R-Sq = 72.2% R-Sq(adj) =
72.0%
This is b, the estimate of
β
This “Standard Error,” (SE) is the measure of uncertainty about the true value.
The “range of uncertainty” is b
± 2 SE(b). (Actually 1.96, but people use 2)Slide41
Internet Buzz Regression
Regression Analysis: BoxOffice versus Buzz
The regression equation is
BoxOffice = - 14.4 + 72.7 Buzz
Predictor Coef SE Coef T P
Constant -14.360 5.546 -2.59 0.012
Buzz 72.72 10.94 6.65 0.000
S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%
Analysis of Variance
Source DF SS MS F P
Regression 1 7913.6 7913.6 44.16 0.000
Residual Error 60 10751.5 179.2
Total 61 18665.1
Range of Uncertainty for b is
72.72+1.96(10.94)
to
72.72-1.96(10.94)
= [51.27
to
94.17]Slide42
Elasticity in the Monet Regression:
b
= 1.7246.
This is the elasticity of price with respect to area.
The confidence interval would be
1.7246
1.96(.1908) =
[1.3506 to 2.0986]
The fact that this does not include 1.0 is an important result – prices for Monet paintings are extremely elastic with respect to the area.Slide43
Conclusion about b
So, should we conclude the slope is not zero?Does the range of uncertainty include zero?
No, then you should conclude the slope is not zero.
Yes, then you can’t be very sure that
β
is not zero.
Tying it together. If the range of uncertainty does not include 0.0 then,
The ratio b/SE is larger than2.
The square of the ratio is larger than 4.
The square of the ratio is F.
F larger than 4 gave the same conclusion.
They are looking at the same thing.Slide44
Summary
The regression model – theoryLeast squares results, a, b, s, R2
The fit of the regression model to the data
ANOVA and R
2The F statistic and R2
Uncertainty about the regression slope