for Social and Behavioral Sciences Part IV Causality Multivariate Regression Chapter 11 Prof Amine Ouazad Movie Buzz Can we predict the success of a movie Avatar 2009 760505847 ID: 177399
Download Presentation The PPT/PDF document "Statistics" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Statistics for Socialand Behavioral Sciences
Part IV: CausalityMultivariate RegressionChapter 11Prof. Amine OuazadSlide2
Movie Buzz
Can we predict the success of a movie?Avatar (2009) $760,505,847Titanic (1997) $658,672,302The Avengers (2012) $623,279,547The Dark Knight (2008) $533,316,061Star Wars: Episode I – The Phantom Menace (1999) $474,544,677Slide3
Data
Box_mil = First run U.S. box office (Millions of $)MPRating = 1 if movie is PG13 or R, 0 if the movie is G or PG.Budget = Production budget (Millions of $)Starpowr = Index of star powerSequel = 1 if movie is a sequel, 0 if not
Action = 1 if action film, 0 if
not
Comedy = 1 if comedy film, 0 if
not
Animated = 1 if animated film, 0 if notHorror = 1 if horror film, 0 if not
Addict = Trailer views at
traileraddict.com
Cmngsoon = Message board comments at comingsoon.netFandango = Attention at fandango.com Cntwait3 = Percentage of Fandango votes that can't wait to see.Slide4
Statistics Course Outline
Part I. Introduction and Research DesignPart II. Describing dataPart III. Drawing conclusions from data: Inferential Statistics
Part IV. : Correlation and Causation:
Two Groups, Regression
Analysis
Week
1
Weeks
2-4
Weeks
5-9
Weeks 10-14
Multivariate regression now!
Estimating a
parameter
using sample statistics. Confidence Interval at 90%, 95%, 99%
Testing a hypothesis using the CI method and the t method.
Sample statistics
: Mean, Median, SD, Variance, Percentiles, IQR, Empirical Rule
Bivariate sample statistics
: Correlation, Slope
Four Steps of “Thinking Like a Statistician”
Study Design
: Simple Random Sampling, Cluster Sampling, Stratified Sampling
Biases
: Nonresponse bias, Response bias, Sampling biasSlide5
Coming up
“Comparison of Two Groups”Last week.“Univariate Regression Analysis”Last Saturday, Section 9.5.“Association and Causality: Multivariate Regression”Last Saturday, Chapter 10.Today
,
Tomorrow, Chapter
11.
“Randomized Experiments and ANOVA”.
Wednesday. Chapter 12.“Robustness Checks and Wrap Up”.
Last Thursday.Slide6
Outline
Multivariate regressionInterpreting coefficientsCeteris Paribus
Standardized
Coefficient
Multiple
Correlation
and R
Squared
Next
time:
Multivariate
regression
: the F test (
Continued
)Slide7
Data: Variables
y Box = First run U.S. box office ($)x1 MPRating = 1 if movie is PG13 or R, 0 if the movie is G or PG.x2 Budget
= Production budget ($Mil
)
x
3
Starpowr = Index of star powerx
4
Sequel
= 1 if movie is a sequel, 0 if notx5 Action = 1 if action film, 0 if notx6
Comedy = 1 if comedy film, 0 if notx
7 Animated = 1 if animated film, 0 if notx8 Horror
= 1 if horror film, 0 if notx9
Addict = Trailer views at traileraddict.comx10 Cmngsoon
= Message board comments at
comingsoon.net
x
11
Fandango
= Attention at
fandango.com
x12 Cntwait3 = Percentage of Fandango votes that can't wait to see.Slide8
Multivariate Regression
With variables x1, x2, …, x12.We are trying to get the true impact:b1 of variable x1 on y.b
2
of variable
x
2 on y.…
b
12
of variable xK on y.True model: y =
a + b
1 x1 + b2
x2 + b
3 x3 + … + b12
x
12
+
e
We would get those if we had the population of all possible movies.Slide9
Instead we estimate b1, b2, …, b
K on the sample:Minimizing the sum of the squared prediction error !With these we can predict the success of a movie:Multivariate RegressionSlide10
Sampling Distribution of b3
We only observe one coefficient estimate b3, because we have only one sample.But across all possible samples, the sampling distribution of b3 is bell-shaped.Hence we can design a test:H0: “ b3 = 0 ”
follows a t distribution with N – (K + 1) degrees of freedom.
Under H
0
, Slide11
Hypothesis testing for H0 : “
b3=0”Reject the null hypothesis at 95% if:The absolute value of the t statistic is greater than the t score with N – (K+1) degrees of freedom at 95%.Equivalently, if the p value is lower than 0.05.
There are as many null hypothesis as there are coefficients to estimate :
Here, there are Slide12
Outline
Multivariate regressionInterpreting coefficientsCeteris Paribus
Standardized
Coefficient
Multiple
Correlation
and R
Squared
Next
time:
Multivariate
regression
(
Continued
)Slide13
Ceteris Paribus=“All other things equal”
“All other things equal”, what is the impact of variable x3 on box office outcome in millions of $?
Increase in
starpower
(variable x
3
) all other things equal.
Keep x
1
,x
2,x
4,x5,x6,x7
,x8,x9,x
10,x12 constant ! And change x3.
Increase in x3
(Star power)Slide14
Ceteris Paribus=“All other things equal”
“All other things equal”, what is the impact of variable x3 on box office outcome in millions of $?
Increase in budget(variable x
2
) all other things equal.
Keep x
1,x3,x
4
,x
5,x6,x7,x8,x9
,x10,x12 constant ! And change x
3.
Increase in x2
(Budget)by 1 million $ Slide15Slide16
Reading the coefficientsAn increase in
budget by 1 million $ leads to a rise in box office $ of 0.144 million $, all other things equal.An action movie has on average all other things equal a lower box office outcome, by $12 million.An increase in the ‘Percentage of Fandango votes that can't wait to see’ (cntwait3) by 1 percentage point leads to a 0.01 * 32.15 = 0.3215 M$ increase in box office outcome in $.
We multiply by 0.01 (1%) because cntwait3 ranges from 0 to 1.Slide17
Which coefficients arestatistically significant?
x
1
MPRating
= 1 if movie is PG13 or R, 0 if the movie is G or PG. ❏❏
❏
x
2
Budget = Production budget ($Mil) ❏❏
❏
x
3
Starpowr
= Index of star power ❏❏
❏
x
4
Sequel = 1 if movie is a sequel, 0 if not ❏❏
❏
x
5
Action = 1 if action film, 0 if not ❏❏❏x6 Comedy = 1 if comedy film, 0 if not ❏❏❏x
7 Animated = 1 if animated film, 0 if not ❏❏❏x8 Horror = 1 if horror film, 0 if not ❏❏❏
x9 Addict = Trailer views at traileraddict.com ❏❏❏
x10 Cmngsoon = Message board comments at comingsoon.net ❏❏
❏x11 Fandango = Attention at fandango.com ❏❏❏x12 Cntwait3 = Percentage of Fandango votes that can't wait to see. ❏❏❏
At 10%
At 5%
At 1%
Read the p value !!! Or compare the t stat to the t score with N-13 degrees of freedomSlide18
With Budget Slide19
Without BudgetSlide20
Budget and Can’t Wait to See the movie ! Without budget among the variables, the popularity cntwait3 has a bigger impact…
Than with budget included.BudgetCntwait3Box office (
box_mil
)
We know that Budget and Cntwait3 are correlated (an arrow either in one direction or in the other, or both) because including Budget affects the coefficient of Cntwait3
Other variablesSlide21
Outline
Multivariate regressionInterpreting coefficientsCeteris Paribus
Standardized
Coefficient
Multiple
Correlation
and R
Squared
Next
time:
Multivariate
regression
(
Continued
)Slide22
Standardized Coefficient
We just saw:An increase in budget by 1 million $ leads to a rise in box office $ of 0.144 million $, all other things equal.But is 1 million $ big? Is 0.144 million $ big?Slide23
“a 1 standard deviation increase in x2, leads to a …. % standard deviation increase in y.”
Standard deviation of x2 (budget): 42.9.Standard deviation of y (box office outcome): 17.5.Coefficient of budget: 0.144.Fill in the blank.Standardized CoefficientSlide24
Standardized Coefficient
We multiply by 0.01 (1%) because cntwait3 ranges from 0 to 1.An increase in budget by 1 million $ leads to a rise in box office $ of 0.144 million $, all other things equal.An action movie has on average all other things equal a lower box office outcome, by $12 million.An increase in the ‘
Percentage of Fandango votes that can't wait to
see’ (cntwait3)
by 1 percentage point leads to a 0.01 * 32.15 = 0.3215 M$ increase in box office outcome in $.Slide25
Outline
Multivariate regressionInterpreting coefficientsCeteris Paribus
Standardized
Coefficient
Multiple
Correlation
and R
Squared
Next
time:
Multivariate
regression
(
Continued
)Slide26
R SquaredHow good are we at predicting the success of a movie?
The multiple correlation is 1 if we are absolutely correct in our predictions. ei=0 for every movie.The multiple correlation is 0 if we do not better than taking the average. ei =Slide27
ESS/TSS = 13356/18665 = 0.7156Slide28
Wrap up
We can use a number of variables to explain a dependent variable.Multiple regression accounts for multiple causes.The coefficients minimize the sum of the squared residuals.Understand the t test and the p value.The coefficients should be understood “all other things equal” or “ceteris paribus”.The standardized coefficients express effects in terms of standard deviations.The R squared between 0 and 100% measures how accurate our predictions are.Slide29
Coming up:
Schedule for next week:Chapter on “Association and Causality”, and “Multivariate Regression”.Make sure you come to sessions and recitations.
Sunday
Monday
Multivariate
Regression
Tuesday
Multivariate
Regression
The F test
Wednesday
Randomized Experiments and ANOVAThursday
Wrap up
Recitation
Evening session 7.30pm
West Administration
002
Usual class
12.45pm
Usual room
Evening session
7.30pmWest Administration 001
Usual class12.45pmUsual room