/
QM222 Class 19 Omitted Variable Bias QM222 Class 19 Omitted Variable Bias

QM222 Class 19 Omitted Variable Bias - PowerPoint Presentation

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
376 views
Uploaded On 2018-03-12

QM222 Class 19 Omitted Variable Bias - PPT Presentation

pt 2 Different slopes for a single variable QM222 Fall 2017 Section A1 1 Todos Assignment 5 is due today But you can only do it if you have your stata dataset Test 6pm Oct 31 location TBD ID: 648077

fall section variable 2017 section fall 2017 variable qm222 regression effect scifi budget model omitted injured bias slope program pay age sign

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "QM222 Class 19 Omitted Variable Bias" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

QM222 Class 19Omitted Variable Bias pt 2Different slopes for a single variable

QM222 Fall 2017 Section A1

1Slide2

To-dosAssignment 5 is due today. But you can only do it if you have your stata dataset.Test 6pm Oct 31 (location TBD)

QM222 Fall 2017 Section A12Slide3

Today we will…Omitted variable bias:Review the idea of omitted variable biasDo the graph from last class’s in-class exerciseLearn algebra of omitted variable bias

Different slopes for a single variable (start)QM222 Fall 2017 Section A1

3Slide4

Omitted variable biasIn a simple regression of Y on X1

, the coefficient b1 measures the combined effects of: the direct (or often called “causal”) effect of the included variable X1 on YPLUS

an “omitted variable bias” due to factors that were left out (omitted) from the regression.Often we want to measure the direct, causal effect. In this case, the coefficient in the simple regression is biased.

QM222 Fall 2017 Section A1

4Slide5

Regressions without and with AgeRegression (1)

WS48= .1203 - .0325 INJURED (has bias!) (66.37) (-6.34) adjRsq=.0359

Regression 2:

WS48= .

1991

- .

0274 INJURED - .00279 Age

(18.38) (-5.41) (-7.37)

adjRsq

=.

0826

(t-stats in parentheses)

Putting age in the regression (2) added .051 to the INJURED coefficient (i.e. made it a smaller negative.)

The omitted variable bias

in Regression (1)

was the difference in the coefficients on INJURED -.0274-

.0325 = - .051More generally: Omitted variable bias occurs when:The omitted variable (Age) has an effect on the dependent variable (WS48) AND2. The omitted variable (Age) is correlated with the explanatory variable of interest (INJURED).

QM222 Fall 2017 Section A1

5Slide6

We learned the graphic way last weekReally, both being injured and age affect WS48 as in the multiple regression Y = b

0 + b1X1 + b2X2This is drawn below.

Let’s call this the Full model.

Let’s call b

1

and b

2

the

direct effects.

QM222 Fall 2017 Section A1

6Slide7

The mis-specified or Limited modelHowever, in the simple (1 X variable) regression, we measure only a (combined) effect of injured on price. Call its coefficient c

1Y = c0 + c1X1

Let’s call c1 is the

combined effect

because it combines the direct effect of X1 and the bias.

QM222 Fall 2017 Section A1

7Slide8

The reason that there is an omitted variable bias in the simple regression of Y on X­1 is that there is a Background Relationship between the X’s

We intuited that there is a relationship between X­1 (Injured) and X2 (Age). We

call this the Background Relationship

:

correlate WS48 INJURED Age

(

obs

=1,051)

| WS48 INJURED Age

-------------+---------------------------

WS48 | 1.0000

INJURED | -0.1920 1.0000

Age | -0.2425

0.1388

1.0000

This

background relationship

,

shown in the graph as

a

1

,

is positive.

QM222 Fall 2017 Section A1

8Slide9

But in the limited model without an X2 in the regression, The combined effect

c1 includes both X­1‘s direct effect b

1. And the indirect effect

(blue arrow)

working through

X

­2

.

i.e. when

X

­1

changes, X

2

also tends to

change

(a1)This change in X­2 has another effect on Y (b2) The indirect effect (blue arrow) is the omitted variable bias and its sign is the sign of a1 times the sign of b2 QM222 Fall 2017 Section A19Slide10

In the basketball caseWS48

= .1203 - .0325 INJURED (limited model)WS48= .1991 - .0274 INJURED - .00279 Age (full model)

The effect of

Injured on WS48 has

two channels.

The

first one is the

direct

effect

b

1

(-.0274)

The

second channel is the

indirect effect

working through X2.(Age) When X­1 (INJURED) changes, X2 (Age) also tends to change (a1) (correlation +.1388)This change in X­2 has its own effect on Y (b2) (-.0274)The indirect effect (blue arrow) is the omitted variable bias and its sign is the sign of a1 times the sign of b2 : pos*neg

=neg

QM222 Fall 2017 Section A1

10Slide11

In-Class exercise (t-stats in parentheses)Regression 1:Score = 61.809 – 5.68

Pay_Program adjR2=.0175 (93.5) (-3.19)Regression 2:Score = 10.80 + 3.73 Pay_Program

+ 0.826 OldScore adjR2=.6687

(6.52) (3.46) (31.68)

QM222 Fall 2017 Section A1

11Slide12

Pay Program graph(1) Score = 61.809 – 5.68 Pay_Program (2) Score = 10.80 + 3.73

Pay_Program + 0.826 OldScorePay Program b1= 3.73

a1 SCORE

bias=-5.68--.373=-.941

Old Score b2= + .826

a1 has the sign of the correlation between Pay Program and Old Score. Since the bias is negative and its sign = sign of a1* sign b2, a1 must be negative.

In words:

It

must be that

OldScore

is correlated with who chooses the Pay Program, and particularly that schools with bad (old) scores chose the pay

program

QM222 Fall 2017 Section A1

12Slide13

AlgebraLimited model Y = c0 + c

1 X1Full model Y = b0 + b1 X1 + b

2 X2

Background model X

2

=

a

0

+

a

1

X

1

We want the Full

model

but we only have the limited one with only

X1 So substitute the background model into the full model:Y = b0 + b1 X1 + b2 (a0 + a1 X1 ) X2 Collect terms:Y = (b0 + b2 a0 ) + (b1 + b2a1) X1 c0 c1 X1

So the bias of the coefficient on

X

1

in the limited model is

b

2

a

1

QM222 Fall 2017 Section A1

13Slide14

Let’s apply this to Brookline Condo’sLimited Model: Price = 520729 – 46969 BEACONFull Model: Price

= 6981 + 409.4 SIZE + 32936 BEACON Background relationship: SIZE = 1254 – 195.17 BEACONc1

= (b1 + b2a

1

)

check

-46969=32935+(-

195.17*409.4)

Bias is b

2

a

1

or

-195.17*409.4 which is negative.

We

are

UNDERESTIMATING

the direct effect

a

1

(negative)

c

1

combined effect (negative.)

b

1

direct effect (positive.)

QM222 Fall 2017 Section A1

14Slide15

Pay Program algebra(1) Score = 61.809 – 5.68 Pay_Program (limited)

(2) Score = 10.80 + 3.73 Pay_Program + 0.826 OldScore (full)

Here is the regression of the background model:

. regress OLDSCORE PAY_PROGRAM

Source | SS

df

MS Number of

obs

= 515

-------------+---------------------------------- F(1, 513) = 42.23

Model | 7952.87922 1 7952.87922

Prob

> F = 0.0000

Residual | 96613.7883 513 188.330971 R-squared = 0.0761

-------------+----------------------------------

Adj

R-squared = 0.0743

Total | 104566.667 514 203.437096 Root MSE = 13.723

------------------------------------------------------------------------------

OLDSCORE |

Coef

. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

PAY_PROGRAM | -11.39843 1.754058 -6.50 0.000 -14.84445 -7.952413

_cons | 61.78153 .6512825 94.86 0.000 60.50202 63.06104

------------------------------------------------------------------------------

Write the equation of the background model.

Combine it with the two above models to get the value and sign of the omitted variable bias in the coefficient -5.68 in the limited model.

QM222 Fall 2017 Section A1

15Slide16

One variable with different slopesQM222 Fall 2017 Section A1

16Slide17

Review of simple derivativesA derivative is the same as a slope.In a line, the slope is always the same.In a curve, the slope changes. The rules of derivatives tell you how to calculate the slope at any point of a curve.

We write the derivative as dy/dx instead of the slope ∆Y/∆XQM222 Fall 2017 Section A117Slide18

Three rules of calculus1. The derivative (slope) of two terms added together = the derivative of each term added together: Y = A + B where A and B are terms with X in them

dY/dX= dA/dX + dB/dX 2. The derivative (slope) of a constant is zero

If Y = 5, dY/dX =0

3.

If

Y

= a

x

b

dY

/

dX

= b a x

b-1

QM222 Fall 2017 Section A1

18Slide19

ExamplesY = 25 X2

thendY/dX = 2 · 25 X 2-1 = 50 x Another

example combining the three rules is:y

= 25

x

2

+ 200

x +

3000

then

, recalling that x

0

= 1,

dY

/

dX

= 2 · 25 x 2-1 + 1· 200 x 1-1 + 0 = 50 x + 200The exponent does not have to be either positive or an integer. Example:Y = 20 X-2.5 then:dY/dX = 2.5 · 20 X -2.5 - 1 = - 50 X-3.5QM222 Fall 2017 Section A119Slide20

Now we’re ready for different slopesQM222 Fall 2017 Section A1

20Slide21

Movie dataset Here is a regression of Movie lifetime revenues on Budget and a dummy for if it is a SciFi movie

Revenues = 16.6 + 1.12 Budget - 9.79 SciFi (5.28) (.102) (11.6) (standard errors in parentheses)What does an observation represent in this data set?What do we learn from the standard errors about each coefficient’s significance?

What is the slope dRevenues/dSciFi

?

What is the slope

dRevenues

/

dBudget

?

Are these results what you expect

?

QM222 Fall 2017 Section A1

21Slide22

Do you think that budget will matter similarly for all types of movies?Particularly, what do we expect about the coefficient on budget (slope) for SciFi movies (compared to others)?

QM222 Fall 2017 Section A122Slide23

If we think that each budget dollar affects SciFi movies differently…The simplest way to model this in a regression is:

Make an additional variable by multiplying Budget x SciFi Make an additional variable by multiplying Budget x non-SciFi Replace Budget with these two variables (keeping in

SciFi)

These are called interaction terms.

QM222 Fall 2017 Section A1

23Slide24

Steps 1 and 2:Replace budget with two new variables Budget x SciFi and Budget x Non-

SciFigen budgetscifi= budget*scifigen budgetnonscifi

=budget*(1-scifi)

QM222 Fall 2017 Section A1

24Slide25

What data looks like in a spreadsheet

movienamerevenuescifi

budget

budgetscifi

budgetnonscifi

The Bridges of Madison County

71.5166

0

22

0

22

Dead Man Walking

39.3636

0

11

0

11

Rob Roy

31.5969

0

28

0

28

Clueless

56.6316

0

13.7

0

13.7

Babe

63.6589

0

30

0

30

Jumanji

100

0

65

0

65

Showgirls

20.3508

0

40

0

40

Starship Troopers

54.8144

1

100

100

0

Bad Boys

65.807

0

23

0

23

Event Horizon

26.6732

1

60

60

0

Jefferson in Paris

2.47367

0

14

0

14

To Die For

21.2845

0

20

0

20

Star Trek: Insurrection

70.1877

1

70

70

0

Sphere

37.0203

0

73

0

73

Out of Sight

37.5626

0

48

0

48

Saving Private Ryan

220

0

65

0

65

Enemy of the State

110

0

85

0

85

The Big Lebowski

17.4519

0

15

0

15

Lost in Space

69.1176

1

80

80

0

Mortal Kombat

70.4541020020Copycat32.0519020020

QM222 Fall 2017 Section A1

25Slide26

3. Replace Budget with these two variables (keeping in SciFi)

regress revenues scifi budgetscifi budgetnonscifiYou get:revenues = 19.91 – 72.07 SciFi + 2.04

budgetscifi + 1.04 budgetnotscifi

(5.36)

(

25.5)

(0.352) (0.105

)

What is the slope

drevenues

/

dbudget

?

drevenues

/

dbudget

= 2.04 scifi + 1.04 notscifi If it is a scifi movie: Slope drevenues/dbudget = 2.04 (since the last term is 0)If it is not a scifi movie : Slope drevenues/dbudget = 1.04Each budget dollar is more important if it is a scifi/fantasy movie.Note also: All coefficients are significant.QM222 Fall 2017 Section A126Slide27

Graph of this modelQM222 Fall 2017 Section A1

27

Budget

Revenues

SciFi movies

Other moviesSlide28

This also allows the effect of being a scifi movie to depend on the budget

From the previous overhead:revenues = 19.91 – 72.07 scifi + 2.04 budgetscifi + 1.04 budgetnotscifi What is the slope drevenues/

dscifi?drevenues

/

dscifi

=

- 72.70 + 2.04 budget

So if budget = 100,

drevenues

/

dscifi

= - 72.70 + 2.04

*100 = 131.3

Compare to our equation without the “interaction terms”, with :

drevenues

/dscifi = - 9.79 QM222 Fall 2017 Section A128Slide29

Making Regression Tables

Regressions of Points per game per player

 

1

2

3

4

height

-0.0226

0.2133***

0.2240***

4.7306

 

(-0.28)

 (4.77)

-4.93

-0.85

year

 

 -.01545***

-0.0166*

0.1667

 

 

 (-2.99)

(-1.79)

-0.74

Height x year

 

 

 

-0.0023

(-0.81)

yr1970

 

 

0.5629

0.5522

 

-1.12

-1.1

yr1975

 

 

-1.0830***

-1.1161***

(-2.48)

(-2.55)

yr1980

 

 

-0.3757

-0.4313

(-0.96)

(-1.08)

yr1985

 

 

-0.1960

-0.2582

(-0.55)

(-0.71)

yr1990

 

 

-0.1043

-0.1580

(-0.33)

(-0.49)

yr1995

 

 

-0.54223*

-0.5841**

(-1.86)

(-1.97)

yr2000

 

 

-0.5964**

-0.6269**

(-2.20)

(-2.29)

yr2005

 

 

-0.5453**

0.5576**

(-2.08)

(-2.13)

yr2010

 

 

-0.1962502

-0.2026464

(-0.73)

(-0.75)

center

 

-0.8795***

-0.8917***

-0.8946***

 

 

(-5.14)

(-5.20)

(-5.22)

minutes played

 

0.5260***

0.5253***

0.5251***

 

 

-76.41

-76.42

-76.35

Constant (integer)

9.872891

11.21819

13.02334

-352.6088

 

(-1.51)

(-1.12)

(-0.71)

(-0.78)

# observations

1509

1509

1,509

1,509

MSE

SSE

6.0404

2.6943

2.6812

2.6815

adjusted r squared

-0.0006

0.8009

0.8028

0.8028

t-statistics in parentheses; *p<.1 ** p<.05 ***p<.01; Omitted year: 1965

from "Have Taller "Big Men" Become Less Productive in the NBA over Recent Years" Austin Hirsch 2015

QM222 Fall 2017 Section A1

29

When you want to report several regressions and allow readers to compare them, you report the regressions in a table.

All variables are listed in the first column.

Each regression is another column.Slide30

Main elements of the regression table:Each

column represents a different equation.If all regressions have the same dependent (Y) variable, then tell people what the dependent variable is in the table title (like I did here). If different columns have different dependent variables, list them as each column’s first row (instead of “1”, “2”).Each explanatory variable in any equation should be listed in the first column, leaving 2 rows for each variable. For each regression, first put the coefficient itself in the column. Then, in the cell below it, put either the coefficient’s standard error or the coefficient’s t-statistic in parentheses. Somewhere in the title or the notes to the table, inform the reader which statistic is in parentheses.

It is particularly helpful to put asterisks next to significant coefficients. For instance, I put the following: *** if the p-value is <.01 ** p<.05 * p<.10 This should be explained in the table’s footnotes.Every column does not need to have a coefficient for every variable, since every regression did not include every variable. Looking at the table, we can clearly see which explanatory variables were included in each regression.

QM222 Fall 2017 Section A1

30