Professor William Greene Stern School of Business IOMS Department Department of Economics Statistics and Data Analysis Part 6 Regression Model1 Conditional Mean US Gasoline Price ID: 228455
Download Presentation The PPT/PDF document "Statistical Inference and Regression Ana..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Statistical Inference and Regression Analysis: GB.3302.30
Professor William Greene
Stern School of Business
IOMS Department
Department of EconomicsSlide2
Statistics and Data Analysis
Part
6 – Regression Model-1
Conditional Mean Slide3
U.S. Gasoline Price
6 Months
5 YearsSlide4
Impact of Change in Gasoline Price on Consumer Demand?
Elasticity concepts
Long
term vs. short term
Income
Demand for gasoline
Demand
for foodSlide5
Movie Success vs. Movie Online Buzz Before Release (2009)Slide6Slide7
Internet Buzz and Movie Success
Box office sales vs. Can’t wait votes 3 weeks before releaseSlide8
Is There
Really
a Relationship?
BoxOffice is obviously not equal to f(Buzz) for
some
function. But
, they do appear to be “related,” perhaps statistically – that is, stochastically. There is a covariance. The linear regression summarizes it.
A predictor would be Box Office =
a + b
Buzz.
Is
b really > 0
? What would be implied by b > 0?Slide9Slide10Slide11
Covariation – Education and Life Expectancy
Causality? Covariation? Does more education make people live longer?
Is there a
hidden driver of both?
(Per capita GDP?)Slide12
Using Regression to Predict
Predictor:
Overseas box office
= a + b
Domestic box office
The prediction will not be perfect. We construct a range of “uncertainty.”
The equation would not predict Titanic.Slide13
Conditional Variation and Regression
Conditional distribution of a pair of random variables
f(y|x) or P(y|x)
Mean function, E[y|x] = Regression of y on x.Slide14
y
|x ~ Normal[ 20 + 3x, 4
2
], x = 1,2,3,4; Poisson
X=4
X=3
X=2
X=1
Expected Income Depends on Household SizeSlide15
Average Box Office by Internet Buzz Index
= Average Box Office for Buzz in IntervalSlide16
Linear Regression?
Fuel Bills vs. Number of RoomsSlide17
Independent vs. Dependent Variables
Y in the model
Dependent variable
Response variableX in the model
Independent variable: Meaning of ‘independent’
Regressor
Covariate
Conditional vs. joint distributionSlide18
Linearity and Functional Form
y = g(x)
h(y) =
+ f(x)y = + x
y = exp( + x); logy = + x
y = + (1/x) = + f(x)
y = e
x, logy = + log x. Etc.Slide19
Inference and Regression
Least SquaresSlide20
Fitting a Line to a Set of Points
Choose
and
to
minimize the sum of squared residuals
Gauss’s method
of
least squares.
Residuals
Y
i
X
i
Predictions
a + bx
iSlide21
Least
Squares RegressionSlide22Slide23
Least Squares AlgebraSlide24
Least SquaresSlide25
Normal EquationsSlide26
Computing the Least Squares Parameters a and b
(We will use s
y
2
later.)Slide27
Least Absolute DeviationsSlide28
Least Squares vs. LADSlide29
Inference and Regression
Regression ModelSlide30
b Measures Covariation
Predictor
Box Office =
a + b
Buzz.Slide31
Interpreting the Function
a
b
a = the life expectancy
associated with 0 years of
education. No country has 0
average years of education.
The regression only applies
in the range of experience
.
b = the increase in life
expectancy associated with
each additional year of
average education.
The range of experience (education)Slide32
Covariation and Causality
Does more education make you live longer (on average)?Slide33
Causality?
Height (inches) and Income
($/mo.) in first post-MBA
Job (men). WSJ, 12/30/86.
Ht. Inc. Ht. Inc. Ht. Inc.
70 2990 68 2910 75 3150
67 2870 66 2840 68 2860
69 2950 71 3180 69 2930
70 3140 68 3020 76 3210
65 2790 73 3220 71 3180
73 3230 73 3370 66 2670
64 2880 70 3180 69 3050
70 3140 71 3340 65 2750
69 3000 69 2970 67 2960
73 3170 73 3240 70 3050
Estimated Income = -451 + 50.2 Height
Correlation = 0.84 (!)Slide34
Inference and Regression
Analysis of VarianceSlide35
Regression
Fits
Regression of salary vs. years Regression of fuel bill vs. number
of experience of rooms for a sample of homesSlide36
Regression ArithmeticSlide37
Variance DecompositionSlide38
Fit of the Equation to the DataSlide39
Regression vs.
Residual
SSSlide40
Analysis of Variance Table
Source
Degrees of Freedom
Sum of
Squares
Mean Square
F Ratio
P Value
Regression
1
2P[z
>
√F]*
Residual
N-2
Total
N-1
Slide41
Explained Variation
The proportion of variation “explained” by the regression is called R-squared (R
2
)It is also called the
Coefficient of Determination
(It is the square of something – to be shown later)Slide42
ANOVA Table
Source
Degrees of Freedom
Sum of
Squares
Mean Square
F Ratio
P Value
Regression
1
2P[z
>
√F]*
Residual
N-2
Total
N-1
Slide43
Movie Madness FitSlide44
Regression Fits
R
2
=0.522
R
2
=0.360
R
2
=0.880
R
2
=0.424Slide45
R Squared Benchmarks
Aggregate time series: expect .9+
Cross sections, .5 is good. Sometimes we do much better.
Large survey data sets, .2 is not bad.
R
2
= 0.924 in this cross section.Slide46
Correlation CoefficientSlide47
Correlations
r
xy
= 0.6
r
xy
= 0.723
r
xy
= -.402
r
xy
= +1.000Slide48
R-Squared is rxy
2
R-squared is the square of the correlation between y
i
and the predicted y
i
which is a + bx
i.The correlation between yi
and (a+bxi) is the same as the correlation between yi and xi.
Therefore,….A regression with a high R2 predicts yi
well.Slide49
Squared Correlations
r
xy
2
= 0.36
r
xy
2
= 0.522
r
xy
2
= .161
r
xy
2
= .924Slide50
Movie Madness
Estimated equation
Estimated coefficients a and b
S =
s
e
= estimated std.
deviation of
ε
Square of the sample correlation between x and y
Sum of squared residuals,
Σ
i
e
i
2
N-2
= degrees of freedom
S
2
= s
e
2Slide51
SoftwareSlide52
http://apps.stern.nyu.edu
http://estore.onthehub.com
http://people.stern.nyu.edu/wgreene/MathStat/MinitabViaCitrix.pdfSlide53Slide54
https://apps.stern.nyu.eduSlide55Slide56Slide57
MONET.MPJSlide58
Use File:Open Worksheet to open an Excel .xls or .xlsx fileSlide59Slide60Slide61
Stat
Basic Statistics Display Descriptive StatisticsSlide62Slide63Slide64Slide65Slide66Slide67
Stat
Regression RegressionSlide68Slide69
Results to ReportSlide70
Linear Regression
Sample Regression LineSlide71Slide72
http://people.stern.nyu.edu/wgreene/MathStat/IRAnlogit5setup.exeSlide73Slide74Slide75Slide76Slide77Slide78Slide79Slide80
Project
Import Variables imports .csvSlide81Slide82Slide83
Command Typed in Editing WindowSlide84
Cursor in desired line of text (or highlight more
than one line)
Press GO buttonSlide85
Typing Commands in the EditorSlide86
Important Commands:
SAMPLE ; first - last $
Sample ; 1 – 1000 $Sample ; All $
CREATE ; Variable = transformation $
Create ; LogMilk = Log(Milk) $
Create ; LMC = .5*Log(Milk)*Log(Cows) $
Create ; … any algebraic transformation $Slide87
Name Conventions
CREATE ; name = any function desired $
Name is the name of a new variable
No more than 8 characters in a name
The first character must be a letter
May not contain -,+,*,/. May contain _.Slide88
Model Command
Model ; Lhs = dependent variable
; Rhs = list of independent variables $
Regress ; Lhs = Milk ; Rhs = ONE,Feed,Labor,Land $
ONE requests the constant term
Slide89
The Go ButtonSlide90
“Submitting” Commands
One Command
Place cursor on that line
Press “Go” button
More than one command
Highlight all lines (like any text editor)
Press “Go” buttonSlide91
Compute a Regression
Sample ; All $
Regress ; Lhs = YIT
; Rhs = One,X1,X2,X3,X4 $
The constant term in the modelSlide92Slide93
Project window shows variables
Results appear in output window
Commands typed in editing window
Standard Three Window OperationSlide94
Inference and Regression
Regression ModelSlide95
The Linear Regression
Statistical
Model
The linear regression model
Sample statistics and population quantities
Specifying the regression modelSlide96
A Linear Regression
Predictor: Box Office = -14.36 + 72.72 BuzzSlide97
Data and Relationship
We suggested the relationship between box office and internet buzz is
Box Office
= -14.36 + 72.72 Buzz
Note the obvious inconsistency in the figure. This is
not
the relationship.
How do we reconcile the equation with the data?Slide98
Modeling the Underlying Process
A
model
that explains the process that produces the data that we observe:
Observed outcome
= the sum of two parts
(1)
Explained
: The regression line(2) Unexplained (noise)
: The remainderRegression modelThe “model” is the statement that part (1) is the same process from one observation to the next.Slide99
The Population
Regression
THE model: A specific statement about the parts of the model
(1) Explained:
Explained Box Office =
α
+
β
Buzz(2) Unexplained: The rest is “noise, ε
.” Random ε has certain characteristics
Model statement
Box Office =
α
+
β
Buzz +
εSlide100
The Data Include the NoiseSlide101
What Explains the Noise?Slide102
Assumptions
(Regression) The equation linking “Box Office” and “Buzz” is stable
E[Box Office| Buzz] =
α
+
β
Buzz
Another sample of movies, say 2012, would obey the same fundamental relationship.Slide103
Model Assumptions
y
i
=
α
+
β
x
i + εi
α + β
x
i
is the “regression function”
Contains the “information” about Y
i
in x
i
Unobserved because
α
and
β
are not known for certain
εi is the “disturbance. It is the unobserved random componentObserved Y
i
is the sum of two unobserved parts.Slide104
Model Assumptions About
ε
i
Random Variable
Mean zero. The regression is the mean of y
i
.
ε
i is the deviation from the regression.Variance
σ2.Noise
ε
i
is unrelated to any values of x
i
(no covariance) – it’s “random noise”
ε
i
is unrelated to any other observations on
ε
j
(not “autocorrelated”).Slide105
Sample “Estimate” vs. PopulationSlide106
Application: Health Care Data
German Health Care Usage Data
,
There
are altogether 27,326
observations on German households, 1984-1994.
DOCTOR = 1(Number of doctor visits > 0)
HOSPITAL = 1(Number of hospital visits > 0)
HSAT = health satisfaction, coded 0 (low) - 10 (high)
DOCVIS = number of doctor visits in last three months
HOSPVIS = number of hospital visits in last calendar year
PUBLIC = insured in public health insurance = 1; otherwise = 0
ADDON = insured by add-on insurance = 1; otherswise = 0
INCOME
= household nominal monthly net income in German marks / 10000
.
HHKIDS = children under age 16 in the household = 1; otherwise = 0
EDUC = years of schooling
AGE = age in years
MARRIED = marital status
EDUC = years of educationSlide107
Sample vs. Population
For the full ‘population’ of 27,326
Income = .12609 + .01996 * Educ +
ε
For a random sample of 52 households, least squares regression produces
Income = .06856 + .02079 * Educ + eSlide108
Sample vs.
PopulationSlide109
Disturbances vs. Residuals
=y--Buzz
e=y-a-bBuzzSlide110
Standard Deviation of Residuals
Standard deviation of
ε
i
= y
i
-
α
-βx
i is σ
σ
= √E[
ε
i
2
] (Mean of
ε
i
is zero)
Sample a and b estimate
α
and
β
Residual ei = yi-a-bxi
estimates
ε
i
Use √(1/N)
Σ
e
i
2
to estimate
σ
? Close,
not quite.
Why
N-2
? Relates to the fact that two parameters (
α
,
β
) were estimated
. Proof to come later.Slide111
ResidualsSlide112
Samples and Populations
Population (Theory)
y
i
=
α
+
βx
i + εi
Parameters α
,
β
Regression
α
+
β
x
i
Mean of y
i
| x
i
Disturbance, εi
Mean 0
Standard deviation
σ
No correlation with x
i
Sample (Observed)
y
i
= a + bx
i
+ e
i
Estimates, a, b
Fitted regression
a + bx
i
Predicted y
i
|x
i
Residuals, e
i
Sample mean 0, Sample std. dev. s
e
Sample Cov[x,e] = 0Slide113
Linear Regression
Sample Regression LineSlide114
A Cost Model
Electricity.mpj
Total cost in $Million
Output in Million KWH
N = 123 American electric utilities
Model: Cost =
α
+
β
KWH +
εSlide115
Cost RelationshipSlide116
Sample RegressionSlide117
Interpreting the Model
Cost = 2.44 + 0.00529 Output + e
Cost is $Million, Output is Million KWH.
Fixed Cost = Cost when output = 0
Fixed Cost = $2.44Million
Marginal cost
= Change in cost/change in output
= .00529 * $Million/Million KWH
= .00529 $/KWH = 0.529 cents/KWH.