/
Statistical Data Editing Statistical Data Editing

Statistical Data Editing - PowerPoint Presentation

martin
martin . @martin
Follow
65 views
Uploaded On 2023-10-31

Statistical Data Editing - PPT Presentation

and Imputation Presented by Sander Scholtus Statistics Netherlands Introduction Data arrive at a statistical institute ID size class number of employees turnover x 1000 labour costs ID: 1027819

imputation data statistical editing data imputation editing statistical edit turnover values based missing model observed methodsdeductive errors variables editingselective

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Statistical Data Editing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Statistical Data EditingandImputation

2. Presented bySander Scholtus Statistics Netherlands

3. IntroductionData arrive at a statistical institute...IDsize classnumber of employeesturnover (x €1000)labour costs (x €1000)other costs (x €1000)total costs (x €1000)0001large21349,827030,47930,4790002large364,9330003medium421,462511,5130004medium296,3018916,3500005small4875,00098,000547,000645,0000006small81,7161759980007small061447153570

4. IntroductionData arrive at a statistical institute...…containing errors and implausible values…containing missing valuesTo produce statistical output of sufficient quality, these data problems have to be treatedStatistical data editing deals with errorsImputation deals with missing values

5. Statistical data editingOverviewGoalsEdit rulesDifferent editing methods and how to combine themModules in the handbook

6. Statistical data editing – goalsTraditional goal of editing:Detect and correct all errors in the collected dataProblems:Very labour-intensiveVery time-consumingHighly inefficient: measurement error is not the only source of error in statistical output

7. Statistical data editing – goalsModern goals of editing:To identify possible sources of errors so that the statistical process may be improved in the future.To provide information about the quality of the data collected and published.To detect and correct influential errors in the collected data.If necessary, to provide complete and consistent micro-data.sources: Granquist (1997), EDIMBUS (2007)

8. Statistical data editing – edit rulesEdit rules (edits, edit checks, checking rules)Used to detect errorsCan be either hard or softGeneral form: IF (unit edit group) THEN (test variable acceptance region) 

9. Statistical data editing – edit rulesExamples of edit rules:Turnover ≥ 0(non-negativity edit, hard)Profit = Turnover – Total costs(balance edit, hard)IF (Size class = “Small”)THEN (0 ≤ Number of employees < 10)(conditional edit, soft)IF (Economic activity = “Construction”)THEN (a < Turnover / Number of employees < b)(ratio edit, soft)

10. Statistical data editing – methodsdeductive editingselective editingnot selectedselectedmanual editingautomatic editingmacro-editingstatistical microdataraw microdata

11. Statistical data editing – methodsDeductive editingDirected at systematic errorsDeterministic detection and amendmentif-then rulesalgorithmsExamples:unit of measurement errors (e.g. “4,000,000” instead of “4,000”)sign errors (e.g. “–10” instead of “10”)simple typing errors (e.g. “192” instead of “129”)subject-matter specific errors

12. Statistical data editing – methodsdeductive editingselective editingnot selectedselectedmanual editingautomatic editingmacro-editingstatistical microdataraw microdata

13. Statistical data editing – methodsSelective editingPrioritise records according to expected benefit of their manual amendment on target estimatesRecords can be selected as they arrive (input editing)Common approach based on score functionsLocal scores for key target variables, e.g.,Use global score to summarise local scores (e.g., sum or maximum) 

14. Statistical data editing – methodsdeductive editingselective editingnot selectedselectedmanual editingautomatic editingmacro-editingstatistical microdataraw microdata

15. Statistical data editing – methodsManual editingRequires:Human editors (subject-matter specialists)Dedicated software (interactive editing)Edit rules (hard and soft)Editing instructionsRe-contacts with businesses are sometimes usedImportant as a source for improvements in future rounds of a repeated survey

16. Statistical data editing – methodsdeductive editingselective editingnot selectedselectedmanual editingautomatic editingmacro-editingstatistical microdataraw microdata

17. Statistical data editing – methodsAutomatic editingObtain consistent micro-data for non-influential recordsParadigm of Fellegi and Holt (1976): Data should be made consistent with the edit rules by changing the fewest possible (weighted) number of items.Leads to error localisation as a mathematical optimisation problemImputation of new values as a separate stepRequires:(Hard) edit rulesDedicated software (e.g.: Banff by Statistics Canada; SLICE by Statistics Netherlands; R package editrules)

18. Statistical data editing – methodsdeductive editingselective editingnot selectedselectedmanual editingautomatic editingmacro-editingstatistical microdataraw microdata

19. Statistical data editing – methodsMacro-editingAlso known as output editingSame purpose as selective editingUses data from all available records at onceAggregate method:Compute high-level aggregatesCheck their plausibilityDrill down to suspicious lower-level aggregatesEventually: Drill down to suspicious individual recordsFeedback to manual editingGraphical aids (scatter plots, etc.) to find outliers

20. Statistical data editing – modulesModules in the handbook:Main theme moduleDeductive editingSelective editingAutomatic editingManual editingMacro-editingEditing administrative dataEditing for longitudinal data

21. ImputationOverviewMissing dataImputation methodsSpecial topicsModules in the handbook

22. Imputation – missing dataMissing data may occur because ofLogical reasonsA particular question does not apply to a particular unitUnit non-responseNo data observed at all for a particular unitItem non-responseUnit is not able to answer a particular questionUnit is not willing to answer a particular questionEditingOriginally observed value discarded during automatic editing

23. Imputation – missing dataImputation: filling in new (estimated) values for data items that are missingCommonly used for missing data due to item non-response and editingObtain a completed micro-data file prior to estimationSimplifies the estimation stepPrevents inconsistencies in the output

24. Imputation – methodsDeductive imputationModel-based imputationDonor imputationAssumption: All observed values are correctImputation applied after error localisation

25. Imputation – methodsDeductive imputationDerive (rather than estimate) missing values from observed values based onlogical relations (edit rules)substantive imputation rulesCan be very useful as a first imputation stepIDturnover (sales)turnover (services)turnover (other)turnover (total)1001154101661002147147IDturnover (sales)turnover (services)turnover (other)turnover (total)10011541021661002147147IDturnover (sales)turnover (services)turnover (other)turnover (total)1001154102166100214700147

26. Imputation – methodsModel-based imputationImputations based on a predictive modelModel fitted on the observed data, then used to impute the missing data

27. Imputation – methodsModel-based imputationSpecial cases:Mean imputationModel: , with Imputed value: Ratio imputationModel: , with Imputed value: (Linear) regression imputationModel: Imputed value:  

28. Imputation – methodsModel-based imputationChoice of model depends on intended use of dataEstimating means and totals: mean or ratio imputation may be sufficientGeneral purpose micro-data: important to model relationshipsMultivariate model-based imputationMultivariate regression imputation(joint model for all variables)Sequential regression / chained equations(separate model for each variable, conditional on the other variables)

29. Imputation – methodsDonor imputationMissing values imputed by ‘borrowing’ observed values from other (similar) unitsUnit with observed value: donorUnit with missing value: recipientHot deck: donor and recipient in the same data file

30. Imputation – methodsDonor imputationSpecial cases:Random hot deck imputationDonor selected at random (within classes)Use auxiliary variables to define imputation classesNearest-neighbour imputationDonor selected with minimal distance to recipientUse auxiliary variables to define distancePredictive mean matchingSpecial case of nearest-neighbour imputationDistance based on predicted values from a regression model

31. Imputation – special topicsChoice of method/model/auxiliary variablesGeneral problem in multivariate analysisAuxiliary variables should explainthe target variable(s)the missing data mechanismCompare model fit among item respondentsCan be misleading (“imputation bias”)Simulation experiments with historical data

32. Imputation – special topicsImputation for longitudinal dataRepeated cross-sectional surveysPanel studiesSpecial imputation methods for longitudinal dataLast observation carried forwardInterpolationExtrapolationLittle and Su method

33. Imputation – special topicsImputations are estimatesImputed values should be flaggedVariance estimation with imputed dataVariance likely to be underestimated when……imputations are treated as observed variables…model predictions are imputed without a disturbance term…single imputation is usedAlternative approach: Multiple imputationNot often used in official statistics (yet)

34. Imputation – special topicsImputed values may be invalid/inconsistentExamples:Turnover = –100 (invalid)Labour costs = 0, Number of employees = 15 (inconsistent)Need not be a problem for estimating aggregatesCan be a problem if micro-data are distributed furtherImputation under edit constraintsOne-step method: constrained imputation modelTwo-step method: imputation followed by data reconciliation

35. Imputation – modulesModules in the handbook:Main theme moduleDeductive imputationModel-based imputationDonor imputationImputation for longitudinal dataLittle and Su methodImputation under edit constraints

36. Thank you for your attention!

37. ReferencesEDIMBUS (2007), Recommended Practices for Editing and Imputation in Cross-Sectional Business Surveys.Fellegi, I.P. and D. Holt (1976), A Systematic Approach to Automatic Edit and Imputation. Journal of the American Statistical Association 71, pp. 17–35.Granquist, L. (1997), The New View on Editing. International Statistical Review 65, pp. 381–387.