pt 2 Different slopes for a single variable QM222 Fall 2017 Section A1 1 Todos Assignment 5 is due today But you can only do it if you have your stata dataset Test 6pm Oct 31 location TBD ID: 648077
Download Presentation The PPT/PDF document "QM222 Class 19 Omitted Variable Bias" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
QM222 Class 19Omitted Variable Bias pt 2Different slopes for a single variable
QM222 Fall 2017 Section A1
1Slide2
To-dosAssignment 5 is due today. But you can only do it if you have your stata dataset.Test 6pm Oct 31 (location TBD)
QM222 Fall 2017 Section A12Slide3
Today we will…Omitted variable bias:Review the idea of omitted variable biasDo the graph from last class’s in-class exerciseLearn algebra of omitted variable bias
Different slopes for a single variable (start)QM222 Fall 2017 Section A1
3Slide4
Omitted variable biasIn a simple regression of Y on X1
, the coefficient b1 measures the combined effects of: the direct (or often called “causal”) effect of the included variable X1 on YPLUS
an “omitted variable bias” due to factors that were left out (omitted) from the regression.Often we want to measure the direct, causal effect. In this case, the coefficient in the simple regression is biased.
QM222 Fall 2017 Section A1
4Slide5
Regressions without and with AgeRegression (1)
WS48= .1203 - .0325 INJURED (has bias!) (66.37) (-6.34) adjRsq=.0359
Regression 2:
WS48= .
1991
- .
0274 INJURED - .00279 Age
(18.38) (-5.41) (-7.37)
adjRsq
=.
0826
(t-stats in parentheses)
Putting age in the regression (2) added .051 to the INJURED coefficient (i.e. made it a smaller negative.)
The omitted variable bias
in Regression (1)
was the difference in the coefficients on INJURED -.0274-
.0325 = - .051More generally: Omitted variable bias occurs when:The omitted variable (Age) has an effect on the dependent variable (WS48) AND2. The omitted variable (Age) is correlated with the explanatory variable of interest (INJURED).
QM222 Fall 2017 Section A1
5Slide6
We learned the graphic way last weekReally, both being injured and age affect WS48 as in the multiple regression Y = b
0 + b1X1 + b2X2This is drawn below.
Let’s call this the Full model.
Let’s call b
1
and b
2
the
direct effects.
QM222 Fall 2017 Section A1
6Slide7
The mis-specified or Limited modelHowever, in the simple (1 X variable) regression, we measure only a (combined) effect of injured on price. Call its coefficient c
1Y = c0 + c1X1
Let’s call c1 is the
combined effect
because it combines the direct effect of X1 and the bias.
QM222 Fall 2017 Section A1
7Slide8
The reason that there is an omitted variable bias in the simple regression of Y on X1 is that there is a Background Relationship between the X’s
We intuited that there is a relationship between X1 (Injured) and X2 (Age). We
call this the Background Relationship
:
correlate WS48 INJURED Age
(
obs
=1,051)
| WS48 INJURED Age
-------------+---------------------------
WS48 | 1.0000
INJURED | -0.1920 1.0000
Age | -0.2425
0.1388
1.0000
This
background relationship
,
shown in the graph as
a
1
,
is positive.
QM222 Fall 2017 Section A1
8Slide9
But in the limited model without an X2 in the regression, The combined effect
c1 includes both X1‘s direct effect b
1. And the indirect effect
(blue arrow)
working through
X
2
.
i.e. when
X
1
changes, X
2
also tends to
change
(a1)This change in X2 has another effect on Y (b2) The indirect effect (blue arrow) is the omitted variable bias and its sign is the sign of a1 times the sign of b2 QM222 Fall 2017 Section A19Slide10
In the basketball caseWS48
= .1203 - .0325 INJURED (limited model)WS48= .1991 - .0274 INJURED - .00279 Age (full model)
The effect of
Injured on WS48 has
two channels.
The
first one is the
direct
effect
b
1
(-.0274)
The
second channel is the
indirect effect
working through X2.(Age) When X1 (INJURED) changes, X2 (Age) also tends to change (a1) (correlation +.1388)This change in X2 has its own effect on Y (b2) (-.0274)The indirect effect (blue arrow) is the omitted variable bias and its sign is the sign of a1 times the sign of b2 : pos*neg
=neg
QM222 Fall 2017 Section A1
10Slide11
In-Class exercise (t-stats in parentheses)Regression 1:Score = 61.809 – 5.68
Pay_Program adjR2=.0175 (93.5) (-3.19)Regression 2:Score = 10.80 + 3.73 Pay_Program
+ 0.826 OldScore adjR2=.6687
(6.52) (3.46) (31.68)
QM222 Fall 2017 Section A1
11Slide12
Pay Program graph(1) Score = 61.809 – 5.68 Pay_Program (2) Score = 10.80 + 3.73
Pay_Program + 0.826 OldScorePay Program b1= 3.73
a1 SCORE
bias=-5.68--.373=-.941
Old Score b2= + .826
a1 has the sign of the correlation between Pay Program and Old Score. Since the bias is negative and its sign = sign of a1* sign b2, a1 must be negative.
In words:
It
must be that
OldScore
is correlated with who chooses the Pay Program, and particularly that schools with bad (old) scores chose the pay
program
QM222 Fall 2017 Section A1
12Slide13
AlgebraLimited model Y = c0 + c
1 X1Full model Y = b0 + b1 X1 + b
2 X2
Background model X
2
=
a
0
+
a
1
X
1
We want the Full
model
but we only have the limited one with only
X1 So substitute the background model into the full model:Y = b0 + b1 X1 + b2 (a0 + a1 X1 ) X2 Collect terms:Y = (b0 + b2 a0 ) + (b1 + b2a1) X1 c0 c1 X1
So the bias of the coefficient on
X
1
in the limited model is
b
2
a
1
QM222 Fall 2017 Section A1
13Slide14
Let’s apply this to Brookline Condo’sLimited Model: Price = 520729 – 46969 BEACONFull Model: Price
= 6981 + 409.4 SIZE + 32936 BEACON Background relationship: SIZE = 1254 – 195.17 BEACONc1
= (b1 + b2a
1
)
check
-46969=32935+(-
195.17*409.4)
Bias is b
2
a
1
or
-195.17*409.4 which is negative.
We
are
UNDERESTIMATING
the direct effect
a
1
(negative)
c
1
combined effect (negative.)
b
1
direct effect (positive.)
QM222 Fall 2017 Section A1
14Slide15
Pay Program algebra(1) Score = 61.809 – 5.68 Pay_Program (limited)
(2) Score = 10.80 + 3.73 Pay_Program + 0.826 OldScore (full)
Here is the regression of the background model:
. regress OLDSCORE PAY_PROGRAM
Source | SS
df
MS Number of
obs
= 515
-------------+---------------------------------- F(1, 513) = 42.23
Model | 7952.87922 1 7952.87922
Prob
> F = 0.0000
Residual | 96613.7883 513 188.330971 R-squared = 0.0761
-------------+----------------------------------
Adj
R-squared = 0.0743
Total | 104566.667 514 203.437096 Root MSE = 13.723
------------------------------------------------------------------------------
OLDSCORE |
Coef
. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
PAY_PROGRAM | -11.39843 1.754058 -6.50 0.000 -14.84445 -7.952413
_cons | 61.78153 .6512825 94.86 0.000 60.50202 63.06104
------------------------------------------------------------------------------
Write the equation of the background model.
Combine it with the two above models to get the value and sign of the omitted variable bias in the coefficient -5.68 in the limited model.
QM222 Fall 2017 Section A1
15Slide16
One variable with different slopesQM222 Fall 2017 Section A1
16Slide17
Review of simple derivativesA derivative is the same as a slope.In a line, the slope is always the same.In a curve, the slope changes. The rules of derivatives tell you how to calculate the slope at any point of a curve.
We write the derivative as dy/dx instead of the slope ∆Y/∆XQM222 Fall 2017 Section A117Slide18
Three rules of calculus1. The derivative (slope) of two terms added together = the derivative of each term added together: Y = A + B where A and B are terms with X in them
dY/dX= dA/dX + dB/dX 2. The derivative (slope) of a constant is zero
If Y = 5, dY/dX =0
3.
If
Y
= a
x
b
dY
/
dX
= b a x
b-1
QM222 Fall 2017 Section A1
18Slide19
ExamplesY = 25 X2
thendY/dX = 2 · 25 X 2-1 = 50 x Another
example combining the three rules is:y
= 25
x
2
+ 200
x +
3000
then
, recalling that x
0
= 1,
dY
/
dX
= 2 · 25 x 2-1 + 1· 200 x 1-1 + 0 = 50 x + 200The exponent does not have to be either positive or an integer. Example:Y = 20 X-2.5 then:dY/dX = 2.5 · 20 X -2.5 - 1 = - 50 X-3.5QM222 Fall 2017 Section A119Slide20
Now we’re ready for different slopesQM222 Fall 2017 Section A1
20Slide21
Movie dataset Here is a regression of Movie lifetime revenues on Budget and a dummy for if it is a SciFi movie
Revenues = 16.6 + 1.12 Budget - 9.79 SciFi (5.28) (.102) (11.6) (standard errors in parentheses)What does an observation represent in this data set?What do we learn from the standard errors about each coefficient’s significance?
What is the slope dRevenues/dSciFi
?
What is the slope
dRevenues
/
dBudget
?
Are these results what you expect
?
QM222 Fall 2017 Section A1
21Slide22
Do you think that budget will matter similarly for all types of movies?Particularly, what do we expect about the coefficient on budget (slope) for SciFi movies (compared to others)?
QM222 Fall 2017 Section A122Slide23
If we think that each budget dollar affects SciFi movies differently…The simplest way to model this in a regression is:
Make an additional variable by multiplying Budget x SciFi Make an additional variable by multiplying Budget x non-SciFi Replace Budget with these two variables (keeping in
SciFi)
These are called interaction terms.
QM222 Fall 2017 Section A1
23Slide24
Steps 1 and 2:Replace budget with two new variables Budget x SciFi and Budget x Non-
SciFigen budgetscifi= budget*scifigen budgetnonscifi
=budget*(1-scifi)
QM222 Fall 2017 Section A1
24Slide25
What data looks like in a spreadsheet
movienamerevenuescifi
budget
budgetscifi
budgetnonscifi
The Bridges of Madison County
71.5166
0
22
0
22
Dead Man Walking
39.3636
0
11
0
11
Rob Roy
31.5969
0
28
0
28
Clueless
56.6316
0
13.7
0
13.7
Babe
63.6589
0
30
0
30
Jumanji
100
0
65
0
65
Showgirls
20.3508
0
40
0
40
Starship Troopers
54.8144
1
100
100
0
Bad Boys
65.807
0
23
0
23
Event Horizon
26.6732
1
60
60
0
Jefferson in Paris
2.47367
0
14
0
14
To Die For
21.2845
0
20
0
20
Star Trek: Insurrection
70.1877
1
70
70
0
Sphere
37.0203
0
73
0
73
Out of Sight
37.5626
0
48
0
48
Saving Private Ryan
220
0
65
0
65
Enemy of the State
110
0
85
0
85
The Big Lebowski
17.4519
0
15
0
15
Lost in Space
69.1176
1
80
80
0
Mortal Kombat
70.4541020020Copycat32.0519020020
QM222 Fall 2017 Section A1
25Slide26
3. Replace Budget with these two variables (keeping in SciFi)
regress revenues scifi budgetscifi budgetnonscifiYou get:revenues = 19.91 – 72.07 SciFi + 2.04
budgetscifi + 1.04 budgetnotscifi
(5.36)
(
25.5)
(0.352) (0.105
)
What is the slope
drevenues
/
dbudget
?
drevenues
/
dbudget
= 2.04 scifi + 1.04 notscifi If it is a scifi movie: Slope drevenues/dbudget = 2.04 (since the last term is 0)If it is not a scifi movie : Slope drevenues/dbudget = 1.04Each budget dollar is more important if it is a scifi/fantasy movie.Note also: All coefficients are significant.QM222 Fall 2017 Section A126Slide27
Graph of this modelQM222 Fall 2017 Section A1
27
Budget
Revenues
SciFi movies
Other moviesSlide28
This also allows the effect of being a scifi movie to depend on the budget
From the previous overhead:revenues = 19.91 – 72.07 scifi + 2.04 budgetscifi + 1.04 budgetnotscifi What is the slope drevenues/
dscifi?drevenues
/
dscifi
=
- 72.70 + 2.04 budget
So if budget = 100,
drevenues
/
dscifi
= - 72.70 + 2.04
*100 = 131.3
Compare to our equation without the “interaction terms”, with :
drevenues
/dscifi = - 9.79 QM222 Fall 2017 Section A128Slide29
Making Regression Tables
Regressions of Points per game per player
1
2
3
4
height
-0.0226
0.2133***
0.2240***
4.7306
(-0.28)
(4.77)
-4.93
-0.85
year
-.01545***
-0.0166*
0.1667
(-2.99)
(-1.79)
-0.74
Height x year
-0.0023
(-0.81)
yr1970
0.5629
0.5522
-1.12
-1.1
yr1975
-1.0830***
-1.1161***
(-2.48)
(-2.55)
yr1980
-0.3757
-0.4313
(-0.96)
(-1.08)
yr1985
-0.1960
-0.2582
(-0.55)
(-0.71)
yr1990
-0.1043
-0.1580
(-0.33)
(-0.49)
yr1995
-0.54223*
-0.5841**
(-1.86)
(-1.97)
yr2000
-0.5964**
-0.6269**
(-2.20)
(-2.29)
yr2005
-0.5453**
0.5576**
(-2.08)
(-2.13)
yr2010
-0.1962502
-0.2026464
(-0.73)
(-0.75)
center
-0.8795***
-0.8917***
-0.8946***
(-5.14)
(-5.20)
(-5.22)
minutes played
0.5260***
0.5253***
0.5251***
-76.41
-76.42
-76.35
Constant (integer)
9.872891
11.21819
13.02334
-352.6088
(-1.51)
(-1.12)
(-0.71)
(-0.78)
# observations
1509
1509
1,509
1,509
MSE
SSE
6.0404
2.6943
2.6812
2.6815
adjusted r squared
-0.0006
0.8009
0.8028
0.8028
t-statistics in parentheses; *p<.1 ** p<.05 ***p<.01; Omitted year: 1965
from "Have Taller "Big Men" Become Less Productive in the NBA over Recent Years" Austin Hirsch 2015
QM222 Fall 2017 Section A1
29
When you want to report several regressions and allow readers to compare them, you report the regressions in a table.
All variables are listed in the first column.
Each regression is another column.Slide30
Main elements of the regression table:Each
column represents a different equation.If all regressions have the same dependent (Y) variable, then tell people what the dependent variable is in the table title (like I did here). If different columns have different dependent variables, list them as each column’s first row (instead of “1”, “2”).Each explanatory variable in any equation should be listed in the first column, leaving 2 rows for each variable. For each regression, first put the coefficient itself in the column. Then, in the cell below it, put either the coefficient’s standard error or the coefficient’s t-statistic in parentheses. Somewhere in the title or the notes to the table, inform the reader which statistic is in parentheses.
It is particularly helpful to put asterisks next to significant coefficients. For instance, I put the following: *** if the p-value is <.01 ** p<.05 * p<.10 This should be explained in the table’s footnotes.Every column does not need to have a coefficient for every variable, since every regression did not include every variable. Looking at the table, we can clearly see which explanatory variables were included in each regression.
QM222 Fall 2017 Section A1
30