Stata manuals You have all these as pdf Check the folder Stata12docs ASSUMPTION CHECKING AND OTHER NUISANCES In regression analysis with Stata In logistic regression analysis with Stata ID: 167963
Download Presentation The PPT/PDF document "Finding help" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Finding helpSlide2
Stata manuals
You have all these as
pdf
! Check the folder /Stata12/docsSlide3
ASSUMPTION CHECKING AND OTHER NUISANCES
In regression analysis with
Stata
In logistic regression analysis with
Stata
NOTE: THIS WILL BE EASIER IN
Stata
THAN IT WAS IN SPSSSlide4
Assumption checking
in “normal” multiple regression
with
StataSlide5
5
Assumptions
in
regression analysis
No multi-
collinearity
All relevant predictor variables
included
Homoscedasticity
: all residuals are
from a distribution with the same variance
Linearity
: the “true” model should be
linear.
Independent errors
: having information
about the value of a residual should not
give you information about the value of
other residuals
Errors are distributed normallySlide6
6
FIRST THE ONE THAT LEADS TO
NOTHING NEW IN STATA
(NOTE: SLIDE TAKEN LITERALLY FROM MMBR)
Independent
errors
:
having
information
about
the value of a residual
should not give you information
about
the
value
of
other
residuals
Detect
:
ask
yourself
whether
it
is
likely
that
knowledge
about
one
residual
would
tell
you
something
about
the
value
of
another
residual
.
Typical
cases:
-
repeated
measures
-
clustered
observations
(
people
within
firms
/
pupils
within
schools)
Consequence
s
: as
for
heteroscedasticity
Usually
,
your
confidence
intervals
are
estimated
too
small
(
think
about
why
that
is!).
Cure
:
use
multi
-level analyses
part 2 of
this
courseSlide7
The rest, in Stata
:
Example:
the Stata “
auto.dta
” data set
sysuse
auto
corr
(correlation)
vif
(variance inflation factors)
ovtest
(omitted variable test)
hettest
(heterogeneity test)
predict
e
,
resid
swilk
(test for normality)Slide8
Finding the commands
“
help regress
” “
regress
postestimation
”
and you will find most of them (and more) thereSlide9
9
Multi
-collinearity
A
strong
correlation
between
two or more of your predictor variables
You don’t want
it
,
because
:
It
is more
difficult
to
get
higher R’sThe importance of predictors can be difficult to establish (b-hats tend to go to zero)The estimates for b-hats are unstable under slightly different regression attempts (“bouncing beta’s”)Detect: Look at correlation matrix of predictor variablescalculate VIF-factors while running regressionCure: Delete variables so that multi-collinearity disappears, for instance by combining them into a single variableSlide10
10
Stata
:
calculating the correlation matrix (“
corr
” or “
pwcorr
”) and VIF
statistics
(“
vif
”)Slide11
11
Misspecification
tests
(replaces: all relevant predictor variables
included
)
Also run “
ovtest
,
rhs
” here. Both tests should be non-significant.
Note that there are two ways to interpret
“all relevant predictor variables included”Slide12
12
Homoscedasticity
: all residuals are from
a distribution
with
the
same
variance
Consequences
:
Heteroscedasticiy
does
not
necessarily
lead
to
biases
in
your estimated coefficients (b-hat), but it does lead to biases in the estimate of the width of the confidence interval, and the estimation procedure itself is not efficient.This can be done in Stata too
(check for yourself)Slide13
Testing for
heteroscedasticity
in
StataYour residuals should have the same variance for all values of Y
hettest
Your residuals should have the same variance for all values of X
hettest
,
rhsSlide14
14
Errors
distributed normally
Errors
should
be
distributed
normally
(
just
the
errors
,
not
the variables
themselves
!) Detect: look at the residual plots, test for normality, or save residuals and test directly Consequences: rule of thumb: if n>600, no problem. Otherwise confidence intervals are wrong. Cure: try to fit a better model (or use more difficult ways of modeling instead - ask an expert).Slide15
First calculate the errors (after regress):
predict
e
, resid
Then test for normality
swilk
e
Errors
distributed
normallySlide16
Assumption checking
in
multi-level
multiple regression
with
StataSlide17
In multi-level
Test all that you would test for multiple regression – poor man’s test:
do this using multiple regression! (e.g. “
hettest
”)
Add:
xttest0
(see last week)
Add (extra):
Test visually whether the normality
assumption holds
Slide18
Note: extra material
(= not on the exam, bonus points if you know how to use it)
tab school,
gen(sch
_)
reg
y
sch2 – sch28
gen
coefs
= .
for num 2/28: replace
coefs =_
b[schX
] if _
n
==X
swilk
coefsSlide19
Assumption checking in
logistic
regression
with
Stata
Note: based on
http://
www.ats.ucla.edu
/stat/
stata
/
webbooks
/logistic/chapter3/statalog3.htmSlide20
Assumptions in
logistic regression
Y is 0/1
Independence of errors (as in multiple regression)No cases where you have complete separation
(
Stata
will try to remove these cases automatically)
Linearity in the
logit
(comparable to “the true model should be linear” in multiple regression) – “specification error”
No multi-
collinearity (as in m.r.)
Think
!Slide21
Think
!
What will happen if you try
logit
y x1 x2
in this case?Slide22
This!
Because
all
cases with x==1 lead to y==1, the weight of x should be +infinity. Stata therefore rightly disregards these cases.
Do realize that, even though you do not see them in the regression, these are extremely important cases!Slide23
(checking for)
multi-
collinearity
In regression, we had “vif”Here we need to download a command that a user-created: “
collin
”
(try “
findit
collin
” in
Stata
)Slide24
(checking for)
specification error
The equivalent for “
ovtest” is the command “
linktest
”Slide25
(checking for)
specification error – part 2Slide26
Further things to do:
Check for useful transformations of variables, and interaction effects
Check for
outliers / influential cases:
1) using a plot of
stdres
(against n) and
dbeta
(against n) 2) using a plot of ldfbeta’s
(against n) 3) using regress and
diag
(but don’t tell anyone
that I suggested
this)Slide27
Checking for outliers
… check the file
auto_outliers.do
for this …Slide28
Try the taxi tipping data