John Jerrim UCL Institute of Education What do we mean by administrative data Central government records Typically available for every person in the population Not typically collected for research purposes ID: 652263
Download Presentation The PPT/PDF document "Linking administrative data to RCTs" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Linking administrative data to RCTs
John Jerrim
(UCL Institute of Education)Slide2
What do we mean by administrative data?
Central government records
Typically available for every person in the population
Not typically collected for research purposes…..
…rather for ‘record keeping’ / registration purposes
Examples include
Health
Education
Finance (tax)
Criminal recordsSlide3
Example . The National Pupil Database (NPD)
One of the most widely used administrative datasets in England!
Data for all
state school children
in England…..
…..excludes (or missing a lot of information for) private school kids
Test scores at age 5, 7, 11, 14, 16 and 18
Demographic information (e.g. FSM, ethnicity, EAL)
Available from around the mid 1990s
Now routinely linked to education RCTs in England (via EEF)
England is lucky to have this data! Most countries don’t have it!Slide4
Example. Cluster (school) level data…
‘Admin’ data doesn’t have to be at individual level….
Can have information on an administrative unit that a person attends…
E.g. A school, hospital, police station, prison.
Often more easily accessible than pupil level data
Particularly useful in cluster RCTs (when you are randomising the cluster itself).
Education example
School inspection (OFSTED) ratings…..
School level demographics (e.g. % children eligible for FSM)
School level prior achievementSlide5
What are the benefits of admin data?
Low cost
…..
Not intrusive
to collect from participants……
Regularly updated
with new information…..
Often
collected in a consistent way
across individuals…..
Low levels of
missing data
….
Low levels of
measurement error
…..
Together, this makes administrative data very attractive to include in our analysis of RCTs!Slide6
What are some of the main challenges we face in RCTs?.....
(….and how can admin data be used to try and resolve them?)Slide7
1. Boost statistical powerSlide8
A lack of statistical power……
In education: mostly
cluster
RCT’s
Rather than randomise individuals….. Randomise whole schools
Issue = ICC (
ρ
). Low power……
EXAMPLE
Secondary schools (clusters) = 100
200 children per school
ρ
= 0.20
20,000 pupils in trial
Minimum detectable effect =
0.25 standard deviations
95% CI =
0 to 0.50 standard deviationsSlide9
Example admin data to ↑ power….
One way to ↑ power is to control for stuff that is linked to the outcome….
…use NPD for this purpose
EXAMPLE
Maths mastery
Year 7 kids
New way of teaching them maths
Test end of year 7
CONTROL for KS2 MATH scores from NPD
Detectable effect = 0.36 without control (CI = 0 to 0.72)
= 0.22 with NPD controls (CI = 0 to 0.44)
MASSIVE BOOST TO POWERSlide10
2
. Reduce evaluation cost.Slide11
Costly (including to test)….
Imagine it costs
£5 to test
each child in this trial……
…you have spent
£100,000
just on a
post-test
!
Got to deliver intervention in 50 schools (expensive…..)
Many EEF secondary
school RCT’s > £500,000
……..
…..average detectable effect
across trials = 0.25
Big ££
for quite
wide confidence intervals
……Slide12
Example: administrative data to reduce cost…..
In previous example, could have conducted a pre-test rather than use NPD.
Maths Mastery in 50 schools of 200 children = 10,000 kids
£5 per test. Hence pre-test would have cost a
minimum of £50,000
ADMINISTRATIVE DATA SAVED THIS MONEY….
NPD data is there, ready to use.
- LETS USE IT!
- Doing a separate pre-test here would have had almost no benefitSlide13
3
. Minimise attritionSlide14
Attrition…
Schools (and pupils within schools)
drop out
of the trial…..
….particularly when assigned to the
control group
!
Problems
-
Breaks randomisation
. Loses key advantage of the RCT
-
Lose power
Example
(my trial)
-
50 schools. 25 Treatment and 25 control
-
Treatment follow-up =
23 / 25 schools
-
Control follow-up =
9 / 25 schools
Worst of all worlds:
-
Bias (selection effects)
- Low power
- High costSlide15
Example: NPD to
reduce
attrition
Schools would have had to have taken time out of maths lessons to conduct this pre-test…..
…there would be significant administrative burden on them to conduct the test
This burden is a major reason for control schools dropping out
Administrative data has….
(
i
) massively reduced the burden on schools
(ii) Improved validity of the tria
l
Slide16
4
. Allow long-run follow-upSlide17
Administrative data for
long-run follow-up
Test
/ follow-up often immediately at the
end of the trial
….
...often when intervention
most effective
BUT we are really interested in
long-run, lasting effects
I.e. Much point ↑ age 11 test scores if kids don’t do any better at age 16??
Ideally want
short, medium
and
long-term follow-up
…..
….but this again
↑ $$$
However, administrative data may include long-run follow-up information about individuals….Slide18
5
. Insight into external validitySlide19
External validity
Most RCT’s recruit participants via
convenience sampling
…..
….not from a well defined population
How “
weird” is our sample
of trial participants?
Have mainly rich pupils?
Have only high-performing schools?
How far can we
generalise results
?
BIG ISSUE
:
- Will we still get an effect when we scale up / roll-out?
BUT, FRANKLY, OFTEN IGNORED IN RCT’SSlide20
NPD for
external validity
/
generalisability
Most RCT’s based upon
non-random samples
of willing participants.
Big issue. But often glossed over!
Without random samples, how do we know if study results generalise to a wider (target) population?
Admin data – give us some handle on this……..
As we have data for (almost) every child/person in the country…….
…….We can examine how similar trial participants are to target population in terms of
observable
characteristicsSlide21
6. Additional characteristics in datasetSlide22
Additional characteristics
Administrative records may include information we did not collect as part of our RCT.....
…. because it was too difficult too
…. because too costly
…. because we forgot!?
These are additional variables we can use in our analysis of our trial.
E.g. Additional variables we can perform ‘balance checks’ with….
E.g. Additional variables to examine heterogeneous effects…..Slide23
Example: Maths Mastery heterogeneous effects….
Linked in cluster (school) level administrative data on OFSTED (inspection) ratings…
Found big heterogeneity by OFSTED rating!
ONLY POSSIBLE AFTER WE LINKED TO ADMIN DATA!!!Slide24
7
. Potential for clever designs….
See this paper:
Improving recruitment of older people to clinical trials: use of the cohort multiple randomised controlled trial design.
Age Ageing 2015 doi:10.1093/ageing/afv044Slide25
Points to note
1. You never make any contact with control group!
2. If everyone you ask says yes – then you have a perfect RCT! (Both internal & external validity)
3. Statistical power very high….
4. ‘Business as usual control’ (by necessity)…
5. Non-compliance = People saying no when you approach them = the issue (ITT vs CA-ITT analysis)
Step 1: Admin data on population
Step 2: Randomly ask people if they want to receive treatment
Step 2: Control group.
Individuals not approached
Step 3: Follow up in admin data
Step 3
: Follow up in admin dataSlide26
Issues with linking to administrative data….Slide27
Sensitive data = high levels data security….
Most administrative is
potentially identifiable
……. you know who the person is!!
Some data probably won’t be given to you (e.g. names)……
You may not be the one doing the linking
…….
…..it may be left up to others (who may not do this correctly!)
When you have access to linked data, you need to store it securely.
E.G. UCL
Safe Data Haven
.
https://
www.ucl.ac.uk/isd/itforslms/services/handling-sens-data/tech-soln
Potential for
big penalties
if you don’t abide by the rules…..
£500,000 fine…..
Jail….Slide28
Ethics and consent….
Participants usually needs to give you consent to link their admin data to RCT….
Opt-in consent = They need to tick the box saying that you can link
Opt-out consent = They only need to contact you if they don’t consent.
Sometimes the person giving consent is not the person themselves…..
Example (education)
Opt-in consent
from
schools
needed to access children’s NPD data….
Parents
typically asked about
opt-out consent
….
Ethical issue with long-term linking?
What happens if your school and parent give consent to link when you are 10….
…..but then you decide you don’t want this at age 18?
…..should we have to re-ask for consent once children become adults?Slide29
Practicalities. How do you link?
1. Unique ID
Variable that uniquely identifies an individual in both
datafiles
to be merged
E.g. UPN in NPD; national insurance number in tax records.
2. By name
Individuals named in both
datafiles
….
Not as straightforward as it may sound!
Names spelt wrong/differently across files…..
Maiden vs married names…..
Individuals with same name (e.g. NPD and children called Mohammed in London)
3. By individual characteristics
AKA: ‘fuzzy matching’
Need enough characteristics so can identify individuals…..
E.g. Gender, Date of Birth, FSM etc. The more, the better!Slide30
Case study. Chess in Schools and communities.
www.bbc.co.uk/news/education-13343943Slide31
The intervention
→ Children to receive 30 hours of chess lessons during one academic year (year 5)
→
Follows a fully developed curriculum by the Chess in Schools and Communities (CSC) team
→ Chess lessons likely to be accompanied by an after school chess club
RQ. Does teaching primary school children how to play chess lead to an improvement in their educational attainment?Slide32
Why is this of interest?
In
30
countries (e.g. Russia) chess is part of the national curriculum
‘
Well-known
’ that influences maths test scores (at least within the chess world!)
“
we have scientific support for what we have known all along--chess makes kids smarter!”
(
Chess Life, November, p. 16 /
Johan
Christiaen
)
Reasonably strong previous evidence
A
cluster RCT in Italy produced
effect size 0.35
Though
caution – external validity
!Slide33
Big previous effect sizes….but poor research designsSlide34
Why is this of interest?
Intervention is VERY cheap to implement
- If +
ive
impact, then also likely
cost effective
!
Fairly serious money invested in the project- £700K ($1m) for this RCT alone
Putting men into primary schools
More information see:
http://www.psmcd.net/otherfiles/BenefitsOfChessInEdScreen2.pdf
Slide35
An interesting feature of this particular RCT is that it used administrative data only!!Slide36
Step 1. Defined the population using administrative data…..
→ 11 LEA’s (geographic areas) in England purposefully selected
→ Year 5 (age 9 / 10) children in 2013 / 14 academic year (born
Sep 2003 – Aug 2004
)
→ Disadvantaged schools
> 37% of KS 2 pupils eligible for FSM in the last six years
→ Total of 450 on population list (sampling frame) Slide37
Step 2. Pre-specified use of administrative data in study protocol…
→
Primary outcome
=
Key Stage 2 math test score
- National examination in England
- Children will sit 1 year after end of intervention
- Due to sit tests in June 2015 (children age 11)
- ‘Intention to treat’ (ITT) analysis
- Information from NPD (administrative data)
- Should get 100% follow-up (very rare for RCT!)
→ Secondary outcome
- Math sub-domains (e.g. mental arithmetic)
- English & science test scoresSlide38
Step 3: Power calculation
Assumptions
Between school ICC = 0.15
60 children per school on average
Correlation pre / post test (Key Stage 1 and Key Stage 2 test scores) = 0.65
80% power for 95% CI
NOTE
: We are can base these assumptions on analysis of admin data from previous years! Strong basis!
With 100 schools, we can detect an effect size of 0.20.
Hence recruit 100 schools …....
Slide39
Step 4: Selection of the ‘sample’ (and external validity)
→ Chess in schools given list of all 450 schools
→ Asked to recruit 100 from this list
→ Sampling fraction of around 22%
→
How does our sample of children from the 100 recruited schools…..
……compare to the ‘population’ of children from the 450 schools?
→
USE ADMINISTRATIVE DATA TO FIND OUT!!!!
Slide40
Example: Using the NPD to investigate external validity..
Chess in Schools
Able to show participants very similar to population of interest (in terms of observables…..)
…but very different to population of England as a whole!
Variable
Trial participants
All eligible pupils
England
Key Stage 1 maths
Level 1
12%
12%
8%
Level 2A
24%
24%
27%
Level 2B
31%
30%
27%
Level 2C
19%
20%
15%
Level 3
12%
11%
20%
Missing
2%
3%
2%
Eligible for FSM
No
66%
65%
82%
Yes
35%
35%
18%
Gender
Female
50%
50%
49%
Male
50%
51%
51%
Language Group
English
65%
63%
82%
Other
34%
37%
18%
School n
100
442
0
Pupil n
3,775
16,397
570,344Slide41
Step 5: Random assignment
→ Stratify schools into 9 groups
- 3*3 matrix of %FSM and KS2 test scores at school level
→ Randomly select children from within each of these strata
→ 50 Treatment schools (children taught chess)
→ 50
Control
schools
(business as usual)
→All children within these schools taking part in the trial.
→ Q. WAS BALANCE ACHIEVED?
A. USE
ADMINISTRATIVE DATA TO FIND OUT!!!!
Slide42
Balance on prior achievement using admin data…..
Balance upon KS1 average points scores….
These are tests children took at age 7………..Two years before the intervention took place! (But that’s ok!)Slide43
Balance on other characteristics……
ETHNICITY
Pre-test
SESSlide44
By using NPD, we almost
eliminated
attrition….
Clever design
with NPD data means we can (almost)
eliminate drop-out
EXAMPLE: Chess in Schools
- Year 5 children learn how to play chess during one school year
- 50 treatment schools receive chess
- 50 control schools = ‘business as usual’
- Use age 7 (
Key Stage 1
) as the
pre-test scores
- Use age 11 (
Key Stage 2
) as the
post-test scores
Almost no burden on schools (no testing to be done)
Key stage 2 results for all children
Have test scores even if they move schools……
…..should have
very little attritionSlide45
Analysis
Allocation
Randomised
school n=100
pupil
n=4,009
Intervention
School
n=
50
P
upil
n=
2,055
Control
School
n=
50
Pupil
n=
1,954
Analysed
S
chool
n =
50
Pupil
n =
1,965
Analysed
School
n =
50
P
upil
n =
1,900
Almost zero attrition!Slide46
Did it work? Outcomes 1 year post-intervention
Effect size
P-value
Mathematics
+0.01
0.93
Reading
-0.06
0.44
Science
-0.01
0.82
Mental arithmatic
+0.00
0.94
Sub-groups (mathematics)
Boys
-0.02
0.77
Girls
+0.03
0.73
FSM children
+0.01
0.95
Answer
NO!
Note
Able to look at heterogeneous effect by FSM due to admin data link….. Slide47
Planned long-run follow-up (using admin data)
Trial conducted in Year 5 (age 9/10). First follow at end Year 6 (age 10/11).
Treatment and control children then move onto secondary school.
Will be able to track these children via their unique pupil number. Hence long-run control:
Do treatment children do better in math GCSE? (Age 16)
Are they more likely to study maths post-16?
Are they more likely to enter a high-status university?
Administrative data means we can answer these questions at little extra cost.
Can answer the question –
is there a lasting impact of the treatment
?Slide48
Limitations
Exclusive use of
administrative
data
meant
1. Could only look at
educational attainment measures
…….
…..and not look at impact upon ‘
non-cognitive’ skills.
2. Outcome measured
one year after intervention
…….
…..might there have been an
immediate effect
?
3.
Statistical power
would have probably been higher with a
specific pre-test
….
…..but also would have
been costly
!
4.
Balance checks
and
heterogeneous effects
limited to characteristics observable in administrative data only.