UCL Institute of Education Aims To understand what external validity is and why it is important The difference between sample and population average treatment effects SATE and PATE The assumptions under which estimates of SATE PATE ID: 830312
Download The PPT/PDF document "External validity Dr. John Jerrim" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
External validity
Dr.
John Jerrim
UCL Institute of Education
Slide2Aims
To understand what ‘external validity’ is and why it is important….
The difference between sample and population average treatment effects (SATE and PATE)
The assumptions under which estimates of SATE = PATE
How you may investigate external validity of your RCT further….
Methods of ‘correcting’ SATE estimates to get closer to PATE…
Gain experience of considering external validity of trial data using Stata
Slide3Name of the game = PATE
Why do we do evaluations (RCTs)?
- Work out is it good to role out policy/intervention more widely?
Therefore, what do we want to know?
- Likely effect in the
population
we want to role out to…
- Hence want an estimate of PATE……
- …Average treatment effect in the population of interest
External validity
- Extent we can generalise results from RCT to population of interest….
- The extent to which we believe we have estimated PATE
- I.E. Got what we really want….
Slide4Recall: The best way to estimate PATE….
Step 1: Population of interest
Step 2: Sampling frame for population
Step 3: Recruit random sample
Step 4: Random Treatment group
Step 4: Random control group
Step 5: 100% follow up
Step 5: 100% follow up
Slide5The problem
RCTs don’t typically randomly recruit into the study (step 3 doesn’t happen)…..
Often not a good sampling frame (step 2 doesn’t happen)……
Population of interest often loosely defined (step 1 doesn’t happen)….
The result
- Non-random convenience samples…..
- Different from population in observed (and unobserved) ways…
- Testing treatment on a ‘strange’ group?
- E.g. Particularly adventurous? Enthusiastic?
Concern: Will our results really generalise!?
Slide6The problem
A Bradford Hill (1966) ‘Reflections on the Controlled Trial’ (The
Heberden
Oration),
Annals of the Rheumatic Diseases
Slide7RCTs = SATE
What most RCTs really give you is SATE (Sample Average Treatment Effect)….
Effectiveness of treatment for your ‘sample’
SATE is a useful piece of information
- Does treatment work even when people are willing / enthusiastic about it?
- If no, then would seem even less likely to work in population……
- ….where some individuals less willing / enthusiastic about change
- Likely to be important in context of social interventions……
But, at the end of the day, SATE isn’t what we really want!
- SATE likely to give upper bound for PATE?
Slide8Other issues….
- Standard errors, p-values, confidence intervals, power calculations….
- Fundamental in RCT analysis
- But rely upon an assumption of random / probabilistic sampling…..
How do we estimate sampling variation?
- Not clear!
- Such statistics do not technically exist
- Hard to judge uncertainty in estimates due to having a ‘sample’
- Hard-line view. Should not event report them.
Big limitation – Many of our standard tools no longer technically appropriate / valid..
Slide9Why does this matter?
Case study: The Polio (Salk) vaccine RCT
Slide10‘Rate’ refers to polio rate per 100,000 population
Note
Polio rate is much lower in ‘no consent’ group than the control group…..
This is despite neither group getting the vaccine….
Why?
Non-random selection into the trial!
Poor more at risk of Polio…..
….so more likely to consent to take part!
Wealthy. Less at risk of polio. Hence less likely to take part!
Slide11How much is external validity considered in social science RCTs?
Slide12When will our estimate of SATE
=
An estimate of PATE?
Slide13When will SATE = PATE
Random recruitment into trial (as noted)
- Ensures, in expectation, that characteristics of sample = characteristics of population
Assumption of homogeneous treatment effect
- You may recruit more of one type of individual than another…..
- But if this characteristic does not interact with the treatment….
- Then……. so what!!
- Won’t result in any difference between SATE and PATE….
If
either
condition holds, it is enough to mean SATE = PATE
Slide14When will SATE ≠ PATE
1.
When treatment effects are heterogeneous
-E.g. Intervention more effective for those enthusiastic about it….
-E.g. Intervention more effective for motivated individuals….
And
2. When we disproportionately recruit such groups into the RCT
- E.g. People who believe treatment more effective more likely to take part
- E.g. Highly motivated individuals more likely to take part
Both conditions have to hold for SATE ≠ PATE!
Slide15Think about this in the context of social science vs medicine
Medicine (e.g. a new oral drug)
Those believe it will be effective probably more likely to enter to RCT….
But as long as person takes tablet when meant to….
….hard to see treatment varying greatly by motivation (biological reaction)
Hence SATE approximating PATE may be credible?
Social science (e.g. teaching children how to play chess)
Those believe it will be effective probably more likely to enter to RCT….
Seems very likely effectiveness will depend upon motivation / willingness to try new things / believe it will work
Hence highly unlikely SATE = PATE…….
Slide16How to further consider external validity of your RCT?
(Assuming random sampling is not possible)?
Slide171. Compare sample to population (in terms of observables)
- RCT sample and population must ‘differ’ for SATE ≠ PATE….
- Therefore compare sample and population in terms of observables….
Closer correspondence between the population and sample…..
….More credible argument that SATE = PATE
Why?
If sample looks like population in terms of observables…..
…then any heterogeneous effect of treatment by these variables will not matter!
Limitation
Only as credible as those characteristics we can observe in both sample and population
Important things we can’t observe in population data (e.g. motivation)
Slide18Example: Maths Mastery…..
Not a random sample of schools
Compare pupils in trial to those in England state school population using NPD.
Trial has:
More FSM
Fewer white
More black & Asian
More low achievers (figures not shown)
Slide192. Investigate possible heterogeneity (observables)
SATE and PATE will only differ if treatment effect heterogeneous…..
….has more impact on some sub-groups than others.
As part of RCTs, typically collect additional baseline information...
- Baseline test scores
- Demographics (gender, ethnicity, measure of poverty)
Can do sub-group analysis by these variables……
….or can include an interaction term in our statistical model.
Slide202. Investigate possible heterogeneity (observables)
Limitations
Observable characteristics only…..
…unobservable heterogeneous treatment effects likely to be important
Statistical power….
Often limited in our ability to detect even main effects…
We have a lot less power to detect interactions / sub-group effects…
Most investigations of interactions will probably be statistically ‘insignificant’…
…but this doesn’t mean they don’t exist!
Slide213. Model selection into the RCT…..
Can think of non-random participation into RCT as a ‘selection problem’…..
E.g. Just like we think about survey non-response…
Can therefore model the selection process (in terms of observables)…..
… and create Inverse Probability Weights to apply in analysis
If we can accurately model the ‘selection process’ (in terms of observables)….
….we can ‘correct’ our SATE estimates into PATE estimates
Limitations
Requires rich population level data
Correction in terms of observables only
Slide22Creating and applying IPW in RCTs
Stage 1: Estimate selection model by probit/logit
- Every observation in population of interest included in model
- Response. 0 = not in trial; 1 = in trial.
Stage 2: Create weights
- Create predicted probability of being in trial for each observation
- Create IPW as the reciprocal of this probability
Stage 3: Estimate ‘adjusted’ SATE
- Standard methods covered in previous lectures….
- Just now apply the IPW in analysis
Slide234. Consider an observational study as well?
RCTs
High internal validity
Low external validity
Observational data
Low internal validity
High external validity
RCTs and observational studies have different +
ives
and –
ives
Use both to complement each other
Observational study
- Make sure it covers you population of interest (plus high response rate)
- As plentiful controls as possible (longitudinal data = even better)
Consistent evidence: You are probably in business!
Slide24Imai (2008): Pros and cons of different research designs
http://gking.harvard.edu/files/matchse.pdf
Slide255. All else fails – be honest!!
RCTs are often made out to be the ‘gold standard’
They have many benefits – but also limitations….
These limitations (external validity in particular) need to be more widely recognised…
Common to say generalisability / external validity ‘limited’….
But maybe should do more?
E.g. Recognise that an observational study may help overcome some weaknesses…
Slide26Case study: Chess in Schools
Slide27The intervention
→ Children to receive 30 hours of chess lessons during one academic year (year 5)
→ Follows a fully developed curriculum by the Chess in Schools and Communities (CSC) team
→ Chess lessons likely to be accompanied by an after school chess club
RQ. Does teaching primary school children how to play chess lead to an improvement in their educational attainment?
Slide28Step 1. Defined the population using administrative data…..
→ 11 LEA’s (geographic areas) in England purposefully selected
→ Year 5 (age 9 / 10) children in 2013 / 14 academic year (born
Sep 2003 – Aug 2004
)
→ Disadvantaged schools
> 37% of KS 2 pupils eligible for FSM in the last six years
→ Total of 442 on population list (sampling frame)
Slide29Step 2. ‘Randomly sample’ from these 442 schools…..
→ Could not do / achieve this…….
→ Ended up recruiting 100 out of the 442 schools…..
→ In other words, like having a 22% response rate to a survey
= Not great! (Though better than what most people do!)
→ Attempt to get some sense of ‘external validity’ by comparing characteristics of pupils in trial to the population as a whole!
Slide30How did the sample compare to
study population
?
Representativity
?
Pretty good!!
Trial participants
Population
of interest
Key Stage 1 maths
Level 1
12%
12%
Level 2A
24%
24%
Level 2B
31%
30%
Level 2C
19%
20%
Level 3
12%
11%
Missing
2%
3%
KS1 average points
-0.280
-0.289
School n
100
442
Pupil n
3,775
16,397
Trial participants
Population
of interest
Eligible for FSM
No
66%
65%
Yes
35%
35%
Gender
Female
50%
50%
Male
50%
51%
Language Group
English
65%
63%
Other
34%
37%
Ethnic Group
White
52%
54%
Black
22%
19%
Asian
12%
14%
Mixed
8%
7%
Other
4%
4%Unclassified1%1%Chinese0%1%School n100442Pupil n4,00316,397
Slide31Sample compared to
England as a whole
?
Representative?
NO!
Can’t generalise results to country as a whole.
Trial participants
Population
of interest
Key Stage 1 maths
Level 1
12%
8%
Level 2A
24%
27%
Level 2B
31%
57%
Level 2C
19%
15%
Level 3
12%
20%
Missing
2%
2%
KS1 average points
-0.280
0.00
School n
100
Pupil n
3,775
570,344
Trial participants
Population
of interest
Eligible for FSM
No
66%
82%
Yes
35%
18%
Gender
Female
50%
49%
Male
50%
51%
Language Group
English
65%
82%
Other
34%
18%
Ethnic Group
White
52%
77%
Black
22%
5%
Asian
12%
10%
Mixed
8%
5%
Other
4%
2%
Unclassified1%1%Chinese0%0%School n100Pupil n4,003570,344
Slide32External validity vs internal validity for some other evaluation methods…..
Slide33Before and After
. Example of seatbelts.
Terrible
internal validity…….
‘Perfect’
external validity……
I.e. This is actually what happened in our population of interest!
From this evidence, are we convinced that the introduction of seatbelts saved lives?
Lesson
Lets not abandon common sense!
Slide34Before and after. Estimated ‘counterfactual’…..
Estimated ‘counterfactual’
Observed values….
Slide35Question
Think about this example of seatbelts.
If an RCT was run instead, would the evidence of this being a ‘good’ policy be more or less convincing?
(Opinion! There is no ‘correct’ answer!)
Slide36RDD. Example = Tuition programme
Very good internal validity……..
External validity =
Very
narrow population!
Only those within the space of the discontinuity………..
Slide37Extending the region around discontinuity. Trade-off!
Trade-off
Bad = ↓ internal validity…..
Good = ↑ external validity…..
Receive treatment
Do
not receive treatment
Slide38Propensity Score Matching
‘Match’ treated individuals to controls who ‘look similar’…
Create propensity score; match individuals with a similar score...
…and throw out any observation that can not be matched.
Narrow caliper
‘Better match’ = ↑ internal validity
More observations throw out = ↓ external validity
Altering caliper = Trading off internal and external validity…..
Slide39Instrumental variables (LATE interpretation…..)
LATE = Effect of
instrument induced
shift in treatment……
… I.E. Individuals who changed behaviour because of the IV
If IV assumptions met,
high internal validity
…..
…..but what about
external validity
?
IV estimate will be
instrument specific
. Potentially different if you were to use a different IV.
A weird ‘population’ who results generalise to……
… not really chosen by the researcher
apriori
… but determined by the data and who responds to the IV
Slide40Conclusions
External validity is important!
Most RCTs give SATE and not PATE
SATE ≠ PATE if there are heterogeneous treatment effects and non-random samples
Methods to look into / account for external validity
- Compare sample to population
- Look for heterogeneous treatment effects
- IPW
- Heckman selection models
- Observational study to complement RCT
Slide41Summary