Presentations text content in Linking administrative data to RCTs
Linking administrative data to RCTs
(UCL Institute of Education)Slide2
What do we mean by administrative data?
Central government records
Typically available for every person in the population
Not typically collected for research purposes…..
…rather for ‘record keeping’ / registration purposes
Example . The National Pupil Database (NPD)
One of the most widely used administrative datasets in England!
Data for all
state school children
…..excludes (or missing a lot of information for) private school kids
Test scores at age 5, 7, 11, 14, 16 and 18
Demographic information (e.g. FSM, ethnicity, EAL)
Available from around the mid 1990s
Now routinely linked to education RCTs in England (via EEF)
England is lucky to have this data! Most countries don’t have it!Slide4
Example. Cluster (school) level data…
‘Admin’ data doesn’t have to be at individual level….
Can have information on an administrative unit that a person attends…
E.g. A school, hospital, police station, prison.
Often more easily accessible than pupil level data
Particularly useful in cluster RCTs (when you are randomising the cluster itself).
School inspection (OFSTED) ratings…..
School level demographics (e.g. % children eligible for FSM)
School level prior achievementSlide5
What are the benefits of admin data?
to collect from participants……
with new information…..
collected in a consistent way
Low levels of
Low levels of
Together, this makes administrative data very attractive to include in our analysis of RCTs!Slide6
What are some of the main challenges we face in RCTs?.....
(….and how can admin data be used to try and resolve them?)Slide7
1. Boost statistical powerSlide8
A lack of statistical power……
In education: mostly
Rather than randomise individuals….. Randomise whole schools
Issue = ICC (
). Low power……
Secondary schools (clusters) = 100
200 children per school
20,000 pupils in trial
Minimum detectable effect =
0.25 standard deviations
95% CI =
0 to 0.50 standard deviationsSlide9
Example admin data to ↑ power….
One way to ↑ power is to control for stuff that is linked to the outcome….
…use NPD for this purpose
Year 7 kids
New way of teaching them maths
Test end of year 7
CONTROL for KS2 MATH scores from NPD
Detectable effect = 0.36 without control (CI = 0 to 0.72)
= 0.22 with NPD controls (CI = 0 to 0.44)
MASSIVE BOOST TO POWERSlide10
. Reduce evaluation cost.Slide11
Costly (including to test)….
Imagine it costs
£5 to test
each child in this trial……
…you have spent
just on a
Got to deliver intervention in 50 schools (expensive…..)
Many EEF secondary
school RCT’s > £500,000
…..average detectable effect
across trials = 0.25
wide confidence intervals
Example: administrative data to reduce cost…..
In previous example, could have conducted a pre-test rather than use NPD.
Maths Mastery in 50 schools of 200 children = 10,000 kids
£5 per test. Hence pre-test would have cost a
minimum of £50,000
ADMINISTRATIVE DATA SAVED THIS MONEY….
NPD data is there, ready to use.
- LETS USE IT!
- Doing a separate pre-test here would have had almost no benefitSlide13
. Minimise attritionSlide14
Schools (and pupils within schools)
of the trial…..
….particularly when assigned to the
. Loses key advantage of the RCT
50 schools. 25 Treatment and 25 control
Treatment follow-up =
23 / 25 schools
Control follow-up =
9 / 25 schools
Worst of all worlds:
Bias (selection effects)
- Low power
- High costSlide15
Example: NPD to
Schools would have had to have taken time out of maths lessons to conduct this pre-test…..
…there would be significant administrative burden on them to conduct the test
This burden is a major reason for control schools dropping out
Administrative data has….
) massively reduced the burden on schools
(ii) Improved validity of the tria
. Allow long-run follow-upSlide17
Administrative data for
/ follow-up often immediately at the
end of the trial
...often when intervention
BUT we are really interested in
long-run, lasting effects
I.e. Much point ↑ age 11 test scores if kids don’t do any better at age 16??
….but this again
However, administrative data may include long-run follow-up information about individuals….Slide18
. Insight into external validitySlide19
Most RCT’s recruit participants via
….not from a well defined population
weird” is our sample
of trial participants?
Have mainly rich pupils?
Have only high-performing schools?
How far can we
- Will we still get an effect when we scale up / roll-out?
BUT, FRANKLY, OFTEN IGNORED IN RCT’SSlide20
Most RCT’s based upon
of willing participants.
Big issue. But often glossed over!
Without random samples, how do we know if study results generalise to a wider (target) population?
Admin data – give us some handle on this……..
As we have data for (almost) every child/person in the country…….
…….We can examine how similar trial participants are to target population in terms of
6. Additional characteristics in datasetSlide22
Administrative records may include information we did not collect as part of our RCT.....
…. because it was too difficult too
…. because too costly
…. because we forgot!?
These are additional variables we can use in our analysis of our trial.
E.g. Additional variables we can perform ‘balance checks’ with….
E.g. Additional variables to examine heterogeneous effects…..Slide23
Example: Maths Mastery heterogeneous effects….
Linked in cluster (school) level administrative data on OFSTED (inspection) ratings…
Found big heterogeneity by OFSTED rating!
ONLY POSSIBLE AFTER WE LINKED TO ADMIN DATA!!!Slide24
. Potential for clever designs….
See this paper:
Improving recruitment of older people to clinical trials: use of the cohort multiple randomised controlled trial design.
Age Ageing 2015 doi:10.1093/ageing/afv044Slide25
Points to note
1. You never make any contact with control group!
2. If everyone you ask says yes – then you have a perfect RCT! (Both internal & external validity)
3. Statistical power very high….
4. ‘Business as usual control’ (by necessity)…
5. Non-compliance = People saying no when you approach them = the issue (ITT vs CA-ITT analysis)
Step 1: Admin data on population
Step 2: Randomly ask people if they want to receive treatment
Step 2: Control group.
Individuals not approached
Step 3: Follow up in admin data
: Follow up in admin dataSlide26
Issues with linking to administrative data….Slide27
Sensitive data = high levels data security….
Most administrative is
……. you know who the person is!!
Some data probably won’t be given to you (e.g. names)……
You may not be the one doing the linking
…..it may be left up to others (who may not do this correctly!)
When you have access to linked data, you need to store it securely.
Safe Data Haven
if you don’t abide by the rules…..
Ethics and consent….
Participants usually needs to give you consent to link their admin data to RCT….
Opt-in consent = They need to tick the box saying that you can link
Opt-out consent = They only need to contact you if they don’t consent.
Sometimes the person giving consent is not the person themselves…..
needed to access children’s NPD data….
typically asked about
Ethical issue with long-term linking?
What happens if your school and parent give consent to link when you are 10….
…..but then you decide you don’t want this at age 18?
…..should we have to re-ask for consent once children become adults?Slide29
Practicalities. How do you link?
1. Unique ID
Variable that uniquely identifies an individual in both
to be merged
E.g. UPN in NPD; national insurance number in tax records.
2. By name
Individuals named in both
Not as straightforward as it may sound!
Names spelt wrong/differently across files…..
Maiden vs married names…..
Individuals with same name (e.g. NPD and children called Mohammed in London)
3. By individual characteristics
AKA: ‘fuzzy matching’
Need enough characteristics so can identify individuals…..
E.g. Gender, Date of Birth, FSM etc. The more, the better!Slide30
Case study. Chess in Schools and communities.
→ Children to receive 30 hours of chess lessons during one academic year (year 5)
Follows a fully developed curriculum by the Chess in Schools and Communities (CSC) team
→ Chess lessons likely to be accompanied by an after school chess club
RQ. Does teaching primary school children how to play chess lead to an improvement in their educational attainment?Slide32
Why is this of interest?
countries (e.g. Russia) chess is part of the national curriculum
’ that influences maths test scores (at least within the chess world!)
we have scientific support for what we have known all along--chess makes kids smarter!”
Chess Life, November, p. 16 /
Reasonably strong previous evidence
cluster RCT in Italy produced
effect size 0.35
caution – external validity
Big previous effect sizes….but poor research designsSlide34
Why is this of interest?
Intervention is VERY cheap to implement
- If +
impact, then also likely
Fairly serious money invested in the project- £700K ($1m) for this RCT alone
Putting men into primary schools
More information see:
An interesting feature of this particular RCT is that it used administrative data only!!Slide36
Step 1. Defined the population using administrative data…..
→ 11 LEA’s (geographic areas) in England purposefully selected
→ Year 5 (age 9 / 10) children in 2013 / 14 academic year (born
Sep 2003 – Aug 2004
→ Disadvantaged schools
> 37% of KS 2 pupils eligible for FSM in the last six years
→ Total of 450 on population list (sampling frame)Slide37
Step 2. Pre-specified use of administrative data in study protocol…
Key Stage 2 math test score
- National examination in England
- Children will sit 1 year after end of intervention
- Due to sit tests in June 2015 (children age 11)
- ‘Intention to treat’ (ITT) analysis
- Information from NPD (administrative data)
- Should get 100% follow-up (very rare for RCT!)
→ Secondary outcome
- Math sub-domains (e.g. mental arithmetic)
- English & science test scoresSlide38
Step 3: Power calculation
Between school ICC = 0.15
60 children per school on average
Correlation pre / post test (Key Stage 1 and Key Stage 2 test scores) = 0.65
80% power for 95% CI
: We are can base these assumptions on analysis of admin data from previous years! Strong basis!
With 100 schools, we can detect an effect size of 0.20.
Hence recruit 100 schools …....
Step 4: Selection of the ‘sample’ (and external validity)
→ Chess in schools given list of all 450 schools
→ Asked to recruit 100 from this list
→ Sampling fraction of around 22%
How does our sample of children from the 100 recruited schools…..
……compare to the ‘population’ of children from the 450 schools?
USE ADMINISTRATIVE DATA TO FIND OUT!!!!
Example: Using the NPD to investigate external validity..
Chess in Schools
Able to show participants very similar to population of interest (in terms of observables…..)
…but very different to population of England as a whole!
All eligible pupils
Key Stage 1 maths
Eligible for FSM
Step 5: Random assignment
→ Stratify schools into 9 groups
- 3*3 matrix of %FSM and KS2 test scores at school level
→ Randomly select children from within each of these strata
→ 50 Treatment schools (children taught chess)
(business as usual)
→All children within these schools taking part in the trial.
→ Q. WAS BALANCE ACHIEVED?
ADMINISTRATIVE DATA TO FIND OUT!!!!
Balance on prior achievement using admin data…..
Balance upon KS1 average points scores….
These are tests children took at age 7………..Two years before the intervention took place! (But that’s ok!)Slide43
Balance on other characteristics……
By using NPD, we almost
with NPD data means we can (almost)
EXAMPLE: Chess in Schools
- Year 5 children learn how to play chess during one school year
- 50 treatment schools receive chess
- 50 control schools = ‘business as usual’
- Use age 7 (
Key Stage 1
) as the
- Use age 11 (
Key Stage 2
) as the
Almost no burden on schools (no testing to be done)
Key stage 2 results for all children
Have test scores even if they move schools……
very little attritionSlide45
Almost zero attrition!Slide46
Did it work? Outcomes 1 year post-intervention
Able to look at heterogeneous effect by FSM due to admin data link…..Slide47
Planned long-run follow-up (using admin data)
Trial conducted in Year 5 (age 9/10). First follow at end Year 6 (age 10/11).
Treatment and control children then move onto secondary school.
Will be able to track these children via their unique pupil number. Hence long-run control:
Do treatment children do better in math GCSE? (Age 16)
Are they more likely to study maths post-16?
Are they more likely to enter a high-status university?
Administrative data means we can answer these questions at little extra cost.
Can answer the question –
is there a lasting impact of the treatment
Exclusive use of
1. Could only look at
educational attainment measures
…..and not look at impact upon ‘
2. Outcome measured
one year after intervention
…..might there have been an
would have probably been higher with a
…..but also would have
limited to characteristics observable in administrative data only.Slide49Slide50