/
Sampling procedures for assessing accuracy of record linkage Sampling procedures for assessing accuracy of record linkage

Sampling procedures for assessing accuracy of record linkage - PowerPoint Presentation

lam
lam . @lam
Follow
67 views
Uploaded On 2023-10-29

Sampling procedures for assessing accuracy of record linkage - PPT Presentation

Paul A Smith S3RI University of Southampton Shelley Gammon Sarah Cummins Christos Chatzoglou Office for National Statistics Dick Heasman 1 Outline Record linkage and strategies Quality measures for record linkage ID: 1026527

linkage sampling inverse sample sampling linkage sample inverse quality stratified unknown 500 estimates clerical match estimated assessment black matching

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Sampling procedures for assessing accura..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Sampling procedures for assessing accuracy of record linkagePaul A. Smith, S3RI, University of SouthamptonShelley Gammon, Sarah Cummins, Christos Chatzoglou, Office for National StatisticsDick Heasman1

2. OutlineRecord linkage and strategiesQuality measures for record linkageStratified samplingInverse samplingResultsConclusions2

3. Record linkageincreasingly important in official statisticsmany strategies for linkagemost high-quality linkage routines contain multiple passes, egexact matchingrule-based matchingprobabilistic matching methods3

4. Quality measuresprecision recall f-measurerely on truth assessment – usually clerical and expensive4TP = true positive, FN = false negative etc

5. Overall and component qualityWant overall assessment of matching quality from all stagesAlso interested in quality of the different stagestreat stages as stratastratification advantageous if strata different, and internally homogeneousexpect differences in precision in different match stagescan stratification improve precision/reduce costs of assessment?how many records do I need to assess clerically?5

6. Example with known outcomeUse linkage from 2011 England & Wales Population Census and Census Coverage Surveylinkage was done automatically with clerical resolution of 'uncertain' casestake clerical linkage result as ‘truth’Evaluate automated linkage procedurescompare precision and recallexperiment with accuracy measures for inference on quality measures6

7. Choice of stratifying variables – true/false +ves600k links from a new automated process0.26% FPs compared to original clerical linkage ‘truth’sample 500 TPs and 500 FPs by srs as basis for modellingCensus hard to count index (1-5)Sex (1 or 2)Age group (0-17, 18-24, 25-39, 40-64, 65+, unknown)Whether in London (1) or not London (0)Ethnicity (White, Asian/Asian British, Black/Black British, mixed, other, unknown) Linkage pass (1, 2, 3, 4, 5, 6, 7, 11)7

8. Stratifying variablesBIC to penalise overfittingbest model hashard to count index (HtC)‘match pass’ (linkage method) merge similar levelsHtC {1&2}, {3} and {4&5}match pass with levels {1}, {2-7} and {11} (=exact, deterministic and probabilistic matching)8

9. Choice of stratifying variables – true/false -ves30k ‘most likely’ non-links (deduplication)approx 1/3 are FNssample 500 TNs and 500 FNs by srs as basis for modellingBICbest model includes match probability, London, age group and ethnicitymerge similar levelsage group to {0-24}, {25-64} and {65+}ethnicity to {White, mixed}, {Asian, Black, other} and {unknown}.9

10. Estimationprecisionestimate as proportion of all linksstraightforward – know linksrecallestimate as proportion of real matches (TP + FN)not straightforward – need to estimate FN and calculateFN estimated as proportion of non-links10

11. Stratified samplingStratified sampling for proportions (Cochran 1977)requires ‘knowledge’ of then allows control of 11

12. Variance or cv?Control of var(p) only helpful if p guessed/known accuratelyOften unknown initiallyInverse sampling (Haldane 1945) controls cv(p) by continuing sampling until m ‘successes’ observed, m fixedsequential procedure with stopping rulefixed m does not mean fixed sample size12

13. inverse sampling13

14. Stratified inverse samplingno clear solution for allocation of nh from var(ph)numerical search approachminimum mh = m* allocated in each stratumincrease mh in stratum where largest decrease in cv(p), based on expected no. of cases to next ‘success’repeat until target cv achieved14

15. Stratum ‘success’ sizes15

16. Overall ‘success’ size16

17. Sample sizes17

18. Overall sample size18

19. Estimated p’s, correct 19

20. Achieved cv’s, correct 20

21. Estimated p’s, 21

22. Achieved cv’s22

23. Estimated p’s, varying 23

24. Achieved cv’s24

25. StrategyThis suggests a strategy for sampling to assess the quality of linkage consisting of:1. If reasonable estimates of available, use them in a Neyman allocation in a stratified design. This will give the smallest sample size with reasonable chance of achieving the required cv. 2. If these are not available, and only an overall estimate is required, use inverse sampling on randomly sorted data.3. If separate estimates in the strata are desirable, follow the algorithm for stratified inverse sampling.Possible combination approach using inverse sampling to get initial estimates of to feed into Neyman allocation25

26. ConclusionsStratified inverse sampling not effective?Sample sizes smaller for stratified sampling (much smaller for p~0.1, not much different for p~0.001)Combination strategies may offer improved results when initial estimates not available.26

27. Questions?Paul Smithp.a.smith@soton.ac.uk27