Paul A Smith S3RI University of Southampton Shelley Gammon Sarah Cummins Christos Chatzoglou Office for National Statistics Dick Heasman 1 Outline Record linkage and strategies Quality measures for record linkage ID: 1026527
Download Presentation The PPT/PDF document "Sampling procedures for assessing accura..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1. Sampling procedures for assessing accuracy of record linkagePaul A. Smith, S3RI, University of SouthamptonShelley Gammon, Sarah Cummins, Christos Chatzoglou, Office for National StatisticsDick Heasman1
2. OutlineRecord linkage and strategiesQuality measures for record linkageStratified samplingInverse samplingResultsConclusions2
3. Record linkageincreasingly important in official statisticsmany strategies for linkagemost high-quality linkage routines contain multiple passes, egexact matchingrule-based matchingprobabilistic matching methods3
4. Quality measuresprecision recall f-measurerely on truth assessment – usually clerical and expensive4TP = true positive, FN = false negative etc
5. Overall and component qualityWant overall assessment of matching quality from all stagesAlso interested in quality of the different stagestreat stages as stratastratification advantageous if strata different, and internally homogeneousexpect differences in precision in different match stagescan stratification improve precision/reduce costs of assessment?how many records do I need to assess clerically?5
6. Example with known outcomeUse linkage from 2011 England & Wales Population Census and Census Coverage Surveylinkage was done automatically with clerical resolution of 'uncertain' casestake clerical linkage result as ‘truth’Evaluate automated linkage procedurescompare precision and recallexperiment with accuracy measures for inference on quality measures6
7. Choice of stratifying variables – true/false +ves600k links from a new automated process0.26% FPs compared to original clerical linkage ‘truth’sample 500 TPs and 500 FPs by srs as basis for modellingCensus hard to count index (1-5)Sex (1 or 2)Age group (0-17, 18-24, 25-39, 40-64, 65+, unknown)Whether in London (1) or not London (0)Ethnicity (White, Asian/Asian British, Black/Black British, mixed, other, unknown) Linkage pass (1, 2, 3, 4, 5, 6, 7, 11)7
8. Stratifying variablesBIC to penalise overfittingbest model hashard to count index (HtC)‘match pass’ (linkage method) merge similar levelsHtC {1&2}, {3} and {4&5}match pass with levels {1}, {2-7} and {11} (=exact, deterministic and probabilistic matching)8
9. Choice of stratifying variables – true/false -ves30k ‘most likely’ non-links (deduplication)approx 1/3 are FNssample 500 TNs and 500 FNs by srs as basis for modellingBICbest model includes match probability, London, age group and ethnicitymerge similar levelsage group to {0-24}, {25-64} and {65+}ethnicity to {White, mixed}, {Asian, Black, other} and {unknown}.9
10. Estimationprecisionestimate as proportion of all linksstraightforward – know linksrecallestimate as proportion of real matches (TP + FN)not straightforward – need to estimate FN and calculateFN estimated as proportion of non-links10
11. Stratified samplingStratified sampling for proportions (Cochran 1977)requires ‘knowledge’ of then allows control of 11
12. Variance or cv?Control of var(p) only helpful if p guessed/known accuratelyOften unknown initiallyInverse sampling (Haldane 1945) controls cv(p) by continuing sampling until m ‘successes’ observed, m fixedsequential procedure with stopping rulefixed m does not mean fixed sample size12
13. inverse sampling13
14. Stratified inverse samplingno clear solution for allocation of nh from var(ph)numerical search approachminimum mh = m* allocated in each stratumincrease mh in stratum where largest decrease in cv(p), based on expected no. of cases to next ‘success’repeat until target cv achieved14
15. Stratum ‘success’ sizes15
16. Overall ‘success’ size16
17. Sample sizes17
18. Overall sample size18
19. Estimated p’s, correct 19
20. Achieved cv’s, correct 20
21. Estimated p’s, 21
22. Achieved cv’s22
23. Estimated p’s, varying 23
24. Achieved cv’s24
25. StrategyThis suggests a strategy for sampling to assess the quality of linkage consisting of:1. If reasonable estimates of available, use them in a Neyman allocation in a stratified design. This will give the smallest sample size with reasonable chance of achieving the required cv. 2. If these are not available, and only an overall estimate is required, use inverse sampling on randomly sorted data.3. If separate estimates in the strata are desirable, follow the algorithm for stratified inverse sampling.Possible combination approach using inverse sampling to get initial estimates of to feed into Neyman allocation25
26. ConclusionsStratified inverse sampling not effective?Sample sizes smaller for stratified sampling (much smaller for p~0.1, not much different for p~0.001)Combination strategies may offer improved results when initial estimates not available.26
27. Questions?Paul Smithp.a.smith@soton.ac.uk27