/
Preferred Capabilities of Record Linkage Systems for Facilitating Research Record Linkage Preferred Capabilities of Record Linkage Systems for Facilitating Research Record Linkage

Preferred Capabilities of Record Linkage Systems for Facilitating Research Record Linkage - PowerPoint Presentation

layla
layla . @layla
Follow
65 views
Uploaded On 2023-12-30

Preferred Capabilities of Record Linkage Systems for Facilitating Research Record Linkage - PPT Presentation

Krista Park US Census Bureau Center for Optimization and Data Science Presentation for FEDCASIC 2023 1 Disclaimers This presentation is released to inform interested parties of research and research requirements and to encourage discussion of a work in progress Any views expressed are those ID: 1035706

requirements amp capability data amp requirements data capability solutions statistician phase qaqi linkage source risk census solution records open

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Preferred Capabilities of Record Linkage..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Preferred Capabilities of Record Linkage Systems for Facilitating Research Record LinkageKrista ParkUS Census BureauCenter for Optimization and Data SciencePresentation for FEDCASIC 20231

2. DisclaimersThis presentation is released to inform interested parties of research and research requirements and to encourage discussion of a work in progress. Any views expressed are those of the authors and not those of the U.S. Census Bureau. Further, although all of the contributing authors for the report upon which this presentation is based are credited for their large efforts on the project, the contents of this presentation are solely those of the presenter. Not all other contributors may have had the opportunity to do a deep review of this presentation.This presentation includes no content derived from restricted data sets.2

3. Researcher Stakeholder Team / Co-AuthorsAuthorNameDirectorateCenter/DivisionField*Mishal AhmedR&MCESEconomist*Glenn AmbillDecennialDITDIT Specialist (APPSW)*John CuffeDEPDIRDEPDIRSurvey Statistician / MAMBA*Khoa DongECONESMDMathematical Statistician*Suzanne DorinskiR&MCESSurvey Statistician*Juan C. HumudECONERDSurvey Statistician / PVS Team*Shawn KlimekR&MCESEconomist*Daniel MoshinskyDecennialADDCIT Specialist (APPSW)*Kevin ShawDecennialDITDMathematical Statistician / PEARSIS*Damon R. SmithECONERDMathematical Statistician / PVS Team*Yves ThibaudeauR&MCSRMResearch Mathematical Statistician*Christine TomaszewskiECONERDMathematical Statistician / PVS Team*Victoria UdalovaDEMOADDPEconomist / EHEALTH*Daniel WeinbergR&MCSRMResearch Mathematical Statistician*Daniel WhiteheadECONESMDMathematical Statistician3

4. Coordination, Leadership Oversight, Computer Environment Support, Program Management SupportANameDirectorateCenter/DivisionField*Krista ParkR&MCenter for Optimization and Data Science (CODS)*Casey BlalockR&MCenter for Economic Studies (CES)Statistician (Data Scientist)*Steven NesbitContractorOCIO / ADRMJ. David BrownR&MCenter for Economic Studies (CES)EconomistKristee CamillettiR&MADRMContracts / AcquisitionsJaya DamineniR&MCenter for Optimization and Data Science (CODS)CODS ADC - Software EngineeringKen HaaseR&MADRMST – Computer ScienceAnup MathurR&MCenter for Optimization and Data Science (CODS)Chief, CODSVincent T. MuleDecennialDSSDMathematical StatisticianAndThank you to the many people I didn’t list.4

5. Outline of Linkage Process5

6. Research v. Production Record LinkageResearchVariety of data set formats / schemaEasy to experiment with multiple blocking & matching algorithmsEasy to change settingsDetailed metadata for both the matching & computational systemsProductionRestricted data set formats / schemaAble to do planned blocking & matching algorithmMetadata required for downstream production6

7. Requirements Gathering & Refinement PhasesCurrent State Assessment (March 2021 Start)Elicit Capability Requirements and WeightsDemonstrationGatingTechnical Solutions Assessment (TSA) Scoring and SelectionPOCs /QaQI and Score UpdateTSA Model and ResultsFindings7

8. Phase 1: Current StateRepository of Census Bureau Record LinkageSoftwareResearch PapersInternal ReportsExternal AnalysisList of commercial, academic, and open source record linkage solutionsReviewed reports by research and advisory companies or organizations such as Forrester, Gartner, USDA, and MIT8

9. Phase 2: Elicit Capability Requirements, Criteria and WeightsThe Subject Matter Experts (SMEs) were divided into five teams.Each team met for 3-6 Requirements gathering workshops. Each workshop lasted approximately 2- 3 hours.The groups were then shuffled, and in three teams met for 3 hour workshops (Over 10 workshops) to refine the requirements and finalize criteria and weights for each requirementFinally, the entire group met to walkthrough and approve the requirements in another 3 hour workshop.Total of over 35 sessions9

10. Capability Requirements Categories and Topics10CategoriesCapability Requirement Topics (with number of capability requirements per topic)Technical (182)Data Handling (81)Outputs/Results (15)Indexing, Blocking, Clustering or Equivalent (16)Use Cases (6)Field Comparison (32)Technical Risk (1)Matching / Classification (31)Operations (102)Configurability (16)Operability (21)Supportability (13)Diagnosability (7)Reportability (26)Operational Risk (1)Maintainability (3)Monitorability (6)Manageability (1)Serviceability(8)Performance (58)Adaptability (1)Scalability (12)QaQI Quality Metrics and Experience (20)*Elasticity (2)Response Time (5)Risk (2)Integration (5)Accuracy (2)Interoperability (3)Performance Metrics (6)Cost Factors (9)Enterprise Licensing Agreement (1)Source Code Availability (1)Cost Factor Risk (1)Solution Price Model (1)Commitment Requirement (1)Training Price Model (1)Hidden Costs (1)IT Implementation and Support Costs (1)Maintenance and Support Costs (1)User Experience (9)Accessibility – 508 (1)Product Documentation (1)Localization (1)Technical Documentation (1)Usability (4)Training (1)Security (18)Resilience (1)Cybersecurity (11)Compliance (6)Total number of Capability Requirements: 378Requirements are organized into 6 categories aligned with the Census Bureau Technical Solutions Assessment (TSA) Framework published by the Chief Technology Office (CTO)/Office of Systems Engineering (OSE) and used to build out a comprehensive set of topics within each category encapsulating the Records Linkage Capability Requirements.*Developed in a later project phase

11. Phase 3: Demonstrations3 Hour Presentations on Commercial, Open Source, and Internal solutions Record Linkage as well as 90 Minute Demonstrations45 Minute Answers to Distributed Questions45 Minute Interactive Q&A + Suggestions/RecommendationsDemonstrations used simulated data generated by the febrl data generator (200k records in the original file; 300k records in the duplicate file)Demonstrations were conducted explicitly NOT as part of an acquisition with several layers of protection to ensure Census Bureau staff who participate in future acquisitions in this zone will not get the results from these demonstrations.11

12. Phase 4: GatingPhase 5: Technical Solutions Assessment (TSA) Scoring and Selection43 Key Capabilities Identified. Used to assess the entire pool of candidate software packages and eliminate (Gate) those that wouldn’t meet the key requirements.Remaining packages were scored against the entire TSALessons LearnedSmall commercial solutions, Internal and Open-Source Solutions were primarily built for the specific purpose of performing records linkage at a component level versus attempting to deliver End-to-End Enterprise capabilities as larger Commercial Solutions implement Large, complex, simulated data set is needed to enable this type of product research. The simulated data set used was too small to fully evaluate the products12

13. Reasons for Initial Gating13CategoryGating Capability Requirement TopicsSolutions EliminatingCriteriaTechnicalData FormatsDatabases SupportedMaintain Original DatasetLocal Storage LocationsTurn Off Built-In StandardizerProcess Missing ValuesIsolate/Handle Errors w/o DisruptionLoggingSelect/Customize Blocking VariablesOnly Non-Traditional Indexing Explains ApproachSupport DeduplicationMatch a single input to anotherMatching a single dataset to anotherProcess Missing ValuesIsolate/Handle Errors w/o DisruptionLoggingSelect/Customize Blocking VariablesOnly Non-Traditional Indexing Explains ApproachSupport DeduplicationMatch a single input to anotherMatching a single dataset to anotherMatching 1:M datasetsMatching M:M datasetsSelect two or more datasets to matchDe-Duplication FacilityIdentify Best Match AvailableStore Possible MatchesMatch Quality IndicatorsTechnical RiskData FormatsDatabases SupportedIsolate/Handle Errors w/o DisruptionLoggingMaintain Original DatasetTurn Off Built-In StandardizerTechnical Risk OperationsCensus Bureau Approved Operating SystemCommercial Software Ability to Obtain SWG ApprovalOperational Risk (e.g., Stability)Operational RiskPerformanceEnterprise Standards Profile Ability to be CompatiblePerformance Risk (e.g., Stability)Performance RiskCost FactorsSolution Pricing ModelCommitment RequirementHidden CostsMaintenance and Support CostsCost Factor Risk (e.g., No Commercial Pricing, No U.S. Sales)Solution Pricing ModelCost Factor RiskUser ExperienceLocalization Minimal Skillset LevelOverall Product DocumentationTechnical DocumentationTechnical DocumentationSecurityInternet ConnectionsSecurity MonitoringEnterprise Certificate AuthorityRDP or SSH Support, CryptographySecure Sockets Layer (SSL)Transport Layer Security (TLS)FedRAMPSolution National Origin of Solution

14. Phase 5 (con’t): Condensed TSA14SMEs realized that Census Internal and Open-Source Solutions were not intended to result in end-to-end solutions (the model the TSA was designed for)Created a subset of the Capability Requirements (194/358) focused on Records Linkage engines better suited to this set of solutionsTeam recommendations for the hands-on-test were informed by, not dictated by the scores. Considerations:Number of solutions achievable within the schedule + Is there value add for more algorithms or features in Records Linkage in Python? + The capability to process the number of records defined in the QaQI Use Cases + User Base for the Solution vs the Language + Can the Solution run on Spark? + Relative speed of the Solution + Ability to change Solution Code + Amount of debugging required to support the QaQI + QaQI team skillset

15. Phase 6: QAQI and Score UpdatesReplaced a Proof-of-Concept Test involving commercial projects due to logistical hurdlesHand on Testing with only Internal and Open Source packagesCloud Computing EnvironmentMix of Decennial Census & Business Use CasesResults DocumentedQuality Metrics CalculationsRevisions to existing TSA Capability Requirements20 New QaQI Quality Metrics and Experience Capability Requirements defined and ScoredUser Experience Write-upsQAQI Results EvaluationForthcoming paper by Yves Thibaudeau15

16. Phase 7: TSA Model and ResultsViable commercial options exist that meet current requirementsCommercial options are often pipelines (they are user friendly w/ GUIs)Open source options are less frequently complete pipelinesPerformance and Quality benchmarks were developed as part of the QAQI effortInternal and Open Source solutions that passed the Gating review were benchmarked using the process developed during the QAQI effort16

17. Phase 8: FindingsCensus has emerging requirements that aren’t met by existing solutionsCensus users often need a complete pipeline and would prefer a GUIUsers want A broad range of data transformation / standardization tools built inA variety of different (1) blocking and (2) linking algorithm options available within the toolNative support for multiprocessing, multi-core processors, and threadingRealtime monitoring of the environment and resources including load history, activity, errors, workflows and services including triggering actions when criteria are met17