Statistics and Data Analysis
25K - views

Statistics and Data Analysis

Similar presentations


Download Presentation

Statistics and Data Analysis




Download Presentation - The PPT/PDF document "Statistics and Data Analysis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Statistics and Data Analysis"— Presentation transcript:

Slide1

Statistics and Data Analysis

Professor William GreeneStern School of BusinessIOMS DepartmentDepartment of Economics

Slide2

Statistics and Data Analysis

Part

18

– Regression

Modeling

Slide3

Linear Regression Models

Least squares results

Regression model

Sample statistics

Estimates of population parameters

How good is the model?

In the abstract

Statistical measures of model fit

Assessing the validity of the relationship

Slide4

Regression Model

Regression relationshipyi

=

α

+ β x

i

+

ε

i

Random

ε

i

implies random

y

i

Observed random

y

i

has two unobserved components:

Explained:

α

+

β

x

i

Unexplained:

ε

i

Random component

ε

i

zero mean, standard deviation

σ

, normal distribution.

Slide5

Linear Regression

: Model Assumption

Slide6

Least Squares Results

Slide7

Using the Regression Model

Prediction: Use xi as information to predict

y

i

.

The

natural predictor is the mean,

x

i

provides more information

.

With x

i

, the predictor is

Slide8

Regression

Fits

Regression

of salary vs

. Regression

of fuel bill vs. number

years of experience of

rooms for a sample of homes

Slide9

Regression Arithmetic

Slide10

Analysis of Variance

Slide11

Fit of the Model to the Data

Slide12

Explained Variation

The proportion of variation “explained” by the regression is called R-squared (R

2

)

It is also called the Coefficient of

Determination

Slide13

Movie Madness Fit

R

2

Slide14

Regression Fits

R

2

= 0.522

R

2

= 0.880

R

2

= 0.424

R

2

= 0.924

Slide15

R

2

= 0.338

R

2

is still positive even if

the correlation is negative.

Slide16

R Squared Benchmarks

Aggregate time series: expect .9+

Cross sections, .5 is good. Sometimes we do much better.

Large survey data sets, .2 is not bad.

R

2

= 0.924 in this cross section.

Slide17

Correlation Coefficient

Slide18

Correlations

r

xy

= 0.723

r

xy

= -.402

r

xy

= +1.000

Slide19

R-Squared is r

xy2R-squared is the square of the correlation between y

i

and the predicted y

i which is

a

+ bx

i

.

The correlation between y

i

and (a+bx

i

) is the same as the correlation between y

i

and xi.Therefore,….A regression with a high R2 predicts yi well.

Slide20

Adjusted R-Squared

We will discover when we study regression with more than one variable, a researcher can increase R

2

just by adding variables to a model, even if those variables do not really explain y or have any real relationship at all.

To have a fit measure that accounts for this, “Adjusted R

2

” is a number that increases with the correlation, but decreases with the number of variables.

Slide21

Movie Madness Fit

Slide22

Notes About Adjusted R

2

Slide23

Is R

2 Large?Is there really a relationship between x and y?

We cannot be 100% certain.

We can be “statistically certain” (within limits) by examining R

2.F is used for this purpose.

Slide24

The F Ratio

Slide25

Is R

2 Large?Since

F

=

(N-2)R2

/(1 – R

2

),

if

R

2

is “large,” then F will be large

.

For a model with one explanatory variable in it, the standard benchmark value for a

‘large’

F is 4.

Slide26

Movie Madness Fit

R

2

F

Slide27

Why Use F and not R

2?When is R2

“large?” we have no benchmarks to decide.

How large is “large?” We

have a table for F statistics to determine when F is statistically large: yes or no.

Slide28

F Table

The “critical value” depends on the number of observations. If F is larger than

the appropriate

value in the table, conclude that there is a “statistically significant” relationship.

There is a

huge F

table on pages 732-742 of your text. Analysts now use computer programs, not tables like this, to find the critical values of F for their model/data.

n

2

is N-2

Slide29

Internet Buzz Regression

Regression Analysis: BoxOffice versus Buzz

The regression equation is

BoxOffice = - 14.4 + 72.7 Buzz

Predictor

Coef SE Coef T P

Constant -14.360 5.546 -2.59 0.012

Buzz 72.72 10.94 6.65 0.000

S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%

Analysis of Variance

Source

DF SS MS F P

Regression 1 7913.6 7913.6 44.16 0.000

Residual Error 60 10751.5 179.2

Total 61

18665.1

n

2

is N-2

Slide30

$135 Million

http://www.nytimes.com/2006/06/19/arts/design/19klim.html?ex=1308369600&en=37eb32381038a749&ei=5088&partner=rssnyt&emc=rss

Klimt, to Ronald Lauder

Slide31

$100 Million … sort of

Stephen Wynn with a Prized Possession, 2007

Slide32

An Enduring Art Mystery

Why do larger paintings command higher prices?

The Persistence of Memory. Salvador Dali, 1931

The Persistence of Statistics. Hildebrand, Ott and Gray, 2005

Graphics show relative sizes of the two works.

Slide33

Slide34

Slide35

Monet in Large and Small

Log of $price = a + b log surface area + e

Sale prices of 328 signed Monet paintings

The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.

Slide36

The Data

Note: Using logs in this context. This is common when analyzing financial measurements (e.g., price) and when percentage changes are more interesting than unit changes. (E.g., what is the % premium when the painting is 10% larger?)

Slide37

Monet

Regression: There seems to be a regression. Is there a theory?

Slide38

Conclusions about F

R2 answers the question of how well the model fits the data

F answers the question of whether there is a statistically valid fit (as opposed to no fit).

What remains is the question of whether there is a valid relationship – i.e., is

β

different from zero.

Slide39

The Regression Slope

The model is yi

=

α

x

i

+

ε

i

The “relationship” depends on

β

.

If

β

equals zero, there is no relationship

The least squares slope, b, is the estimate of β based on the sample.It is a statistic based on a random sample.We cannot be sure it equals the true β.

To accommodate this view, we form a range of uncertainty around b. I.e., a confidence interval.

Slide40

Uncertainty About the Regression Slope

Hypothetical Regression

Fuel Bill vs. Number of Rooms

The regression equation is

Fuel Bill

=

-252

+

136 Number of Rooms

Predictor

Coef

SE

Coef

T P

Constant

-251.9 44.88 -5.20 0.000

Rooms 136.2 7.09 19.9 0.000S = 144.456R-Sq = 72.2% R-Sq(adj) =

72.0%

This is b, the estimate of

β

This “Standard Error,” (SE) is the measure of uncertainty about the true value.

The “range of uncertainty” is b

± 2 SE(b). (Actually 1.96, but people use 2)

Slide41

Internet Buzz Regression

Regression Analysis: BoxOffice versus Buzz

The regression equation is

BoxOffice = - 14.4 + 72.7 Buzz

Predictor Coef SE Coef T P

Constant -14.360 5.546 -2.59 0.012

Buzz 72.72 10.94 6.65 0.000

S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4%

Analysis of Variance

Source DF SS MS F P

Regression 1 7913.6 7913.6 44.16 0.000

Residual Error 60 10751.5 179.2

Total 61 18665.1

Range of Uncertainty for b is

72.72+1.96(10.94)

to

72.72-1.96(10.94)

= [51.27

to

94.17]

Slide42

Elasticity in the Monet Regression:

b

= 1.7246.

This is the elasticity of price with respect to area.

The confidence interval would be

1.7246

 1.96(.1908) =

[1.3506 to 2.0986]

The fact that this does not include 1.0 is an important result – prices for Monet paintings are extremely elastic with respect to the area.

Slide43

Conclusion about b

So, should we conclude the slope is not zero?Does the range of uncertainty include zero?

No, then you should conclude the slope is not zero.

Yes, then you can’t be very sure that

β

is not zero.

Tying it together. If the range of uncertainty does not include 0.0 then,

The ratio b/SE is larger than2.

The square of the ratio is larger than 4.

The square of the ratio is F.

F larger than 4 gave the same conclusion.

They are looking at the same thing.

Slide44

Summary

The regression model – theoryLeast squares results, a, b, s, R2

The fit of the regression model to the data

ANOVA and R

2The F statistic and R2

Uncertainty about the regression slope