AHMED K ELMAGARMID PURDUE UNIVERSITY WEST LAFAYETTE IN Senior member IEEE PANAGIOTIS G IPEIROTIS LEONARD N STERN SCHOOL OF BUSINESS NEW YORK NY Member IEEE computer security VASSILIOS S VERYKIOS ID: 443033
Download Presentation The PPT/PDF document "Duplicate record detection" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Duplicate record detection
AHMED K. ELMAGARMID
PURDUE UNIVERSITY, WEST LAFAYETTE,
IN
Senior member, IEEE
PANAGIOTIS G. IPEIROTIS
LEONARD N. STERN SCHOOL OF BUSINESS, NEW YORK,
NY
Member, IEEE computer security
VASSILIOS S. VERYKIOS
UNIVERSITY OF THESSALY, VOLOS,
GREECE
Member ,IEEE computer security.
PRESENTED BY
SHILPA
MURTHY Slide2
INTRODUCTION TO THE PROBLEM
D
atabases
play an important role in today’s IT based economy
Many businesses and organizations depend on the quality of data(or the lack thereof) stored in the databases.
Any discrepancies in the data can have significant cost implications to a system that relies on information to function.Slide3
DATA QUALITY
Data are not carefully controlled for quality nor defined in a consistent way across different data sources, thus data quality is compromised due to many factors .//examples
D
ata errors.Ex:
M
icrosft
instead of Microsoft
Integrity errors. Ex:
EmployeeAge
=567
Multiple conventions for information.Ex: 44 W.4
th
street and 44 west fourth street.Slide4
DATA HETEROGENEITY
While integrating data from different sources into a warehouse , organizations become aware potential systematic differences and these problems and conflicts fall under a umbrella term called as “DATA HETEROGENEITY”.
Two types of heterogeneity can be distinguished: Structural heterogeneity and lexical heterogeneity.Slide5
DATA QUALITY
Data cleaning refers to the process of resolving identification problems in the data.
Structural heterogeneity
Different record structure
Addr
versus
City, State, and Zip code [1]
Lexical heterogeneity
Identical record structure, but data is different
44 W. 4
th
St.
versus
44 West Fourth Street [1]Slide6
TERMINOLOGYSlide7
DATA PREPARATION
Step before the duplicate record detection.
Improves the quality of the data
Makes data more comparable and more usable.
Data preparation stage includes three steps.Slide8
STEPS IN DATA PREPARATION
Parsing
Data transformation
StandardizationSlide9
PARSING
Locates, identifies and isolates individual data elements
Makes it easier to correct, standardize and match data
Comparison of individual components rather than complex strings
For example, the appropriate parsing of the name and address components into consistent packets is a very important step.Slide10
DATA TRANSFORMATION
Simple conversions of data type
Field renaming
Decoding field values
Range
checking: involves
examining data in a field to ensure that it falls within the expected range ,usually a numeric or date range
Dependency
checking
: is slightly more complex kind of data transformation where we check the values in a particular field to the values in another field to ensure minimal level of consistency in data Slide11
DATA STANDARDIZATION
Represent certain fields in a standard format
Addresses
US Postal Service Address Verification
tool
Date
and time formatting
Names (first, last, middle, prefix, suffix)
TitlesSlide12
LAST STEP IN DATA PREPARATION
Store data in
tables having comparable fields.
Identify fields suitable for
comparison
Not foolproof
Data
may still contain
inconsistencies due to misspellings and different conventions to represent dataSlide13
FIELD MATCHING TECHNIQUES
Most common sources of mismatches in database entries is due to typographical errors
The field matching metrics that have been designed to overcome this problem are :
Character –based similarity metrics
Token based similarity metrics
Phonetic similarity metrics
Numeric similarity metricsSlide14
CHARACTER BASED SIMILARITY
Works best on typographical errors
Edit distance
Shortest sequence of edit commands that can transform a string
s
into
t
Three types of edit operations .
I
f (cost =1) this version of edit distance is referred to as the
“
Levenshein
”
distance.
Insert, delete, replace operations
.
Example. S1=“tin” s2= “tan”
We need to replace “I”
to
“A” to convert string s1 to s2.
The edit distance here is 1. because we needed only one operation to convert s1 to s2.Slide15
CHARACTER BASED SIMILARITY
Affine gap distance
Strings
that have been truncated
John R. Smith versus Jonathan Richard Smith
Smith-Waterman distance
Substring matching which ignores the prefix and
suffix
Example: Prof.
John.R.Smith
and
John.R.Smith,Prof
Jaro
distance
Compares first and last name
Q-Grams
Divides string into a series of substrings of length
q
.
E.g.: NELSON
and
NELSEN
are phonetically similar but spelled differently. The q-grams for these words are
NE LS ON
and
NE LS EN .Slide16
TOKEN BASED SIMILARITY
Works best when word (tokens) are transposed
Atomic Strings
Computational average
WHIRL
Weights words based on frequency to determine
similarity
The words in the database have a weight associated with it, which is calculated using a cosine similarity metric.
Example: in a database of company names the words “AT&T” and “IBM” are less frequent than the word “
inc.
”
Similarity of John Smith and
Mr.John
Smith is close to 1.
But the similarity of
comptr
department and
deprtment
of computer is zero since it doesn’t take care of misspelled words.
Q
-Grams
with weighting
Extends WHIRL to handle spelling errorsSlide17
PHONETIC SIMILARITY
Comparison based on how words soundSlide18
NUMERIC SIMILARITY
Considers only numbers
Convert numbers to text data
Simple range queries
Authors provided no insight in this areaSlide19
SUMMARY OF METRICSSlide20
DUPLICATE RECORD DETECTION
The methods described till now have been describing about similarity checking in single fields.
The real life situations consist of multiple fields which have to be checked for
d
uplicate records.Slide21
CATEGORIZING METHODS
Probabilistic approaches and supervised machine learning techniques
Approaches that rely on domain knowledge or
Generic distance metricsSlide22
PROBABILISTIC MATCHINGMODELS
Models derived from Bayes theorem
Use prior knowledge to make decision about current data
set
A tuple pair is assigned to one of the two classes M or U. M class represents(match) same entity, and the U class represents(non-match) different entity.
This can be determined by calculating the probability distribution.
Rule-based decision
tree
If-then-else traversalSlide23
Supervised learning
Relies on the existence of trained data.
The trained data is in the form of record pairs.
These record pairs are labeled matching or not.
SVM approach out performs all the simpler approaches.
The post processing step is to create a graph for all the records linking the matching records.
Records are considered identical using the transitivity relation applied on the connected components.Slide24
ACTIVE LEARNINGSlide25
DISTANCE BASED TECHNIQUES
This method can be used when
there is absence of training data or human effort to create matching models.
Treat
a record as a one long field
Use a distance metric
Best matches are ranked using a weighting algorithm
Alternatively, use a single field
Must be highly discriminatingSlide26
RULE BASED TECHNIQUES
Relies on business rules to derive key
Must determine functional dependencies
Requires subject matter expert to build matching
rulesSlide27
Rule based techniques
This figure depicts the equation theory that dictates the logic of domain equivalence.
It specifies an inference about the similarity of the records.Slide28
UNSUPERVISED LEARNING
Classify data as matched or unmatched without a training
set.
The comparison vector generally depicts which category it belongs to. If it does not then it has to be done manually.
One way to avoid manual labeling is to use the clustering algorithms.
Group together similar comparison vectors.
Each cluster contains vectors with similar characteristics.
By knowing the real class of only few vectors we can infer the class of all the vectors.Slide29
techniques to improve efficiency
Reduce the number of record comparisons
Improve the efficiency of record comparisonSlide30
COMPARATIVE METRICS
Elementary nested loop
Compare every record in one table to another table
Requires A*B comparisons (Cartesian product) which is very expensive
Cost required for a single comparison
Must consider number of fields/recordSlide31
Reduce Record Comparisons
Blocking
Sorted Neighborhood
Clustering and Canopies
Set JoinsSlide32
Blocking
Basic: Compute a hash value for each record
Only compare records in the same bucket
Subdivide files into subsets (blocks)
Soundex
, NYSIIS, or
Metaphone
Drawback
Increases in speed may increase number of false mismatches
Compromise is multiple runs using different blocking fieldsSlide33
Sorted Neighborhood
Create composite key, sort data, merge
Assumption
Duplicate records will be close in sorted system
Highly dependent upon the comparison keySlide34
Clustering and canopies
Clustering
: Duplicate records are kept in a cluster and only the representative of a cluster is kept for future comparisons.
This reduces the total number of record comparisons without compromising the accuracy.
Canopies
: The records are grouped into overlapping clusters called as “canopies” and then the records are compared which lead to better qualitative results.Slide35
SOFTWARE TOOLS
Open Architecture
Freely Extensible Biomedical Record Linkage (FEBRL)
- Python
TAILOR –
MLC++
,
DBGen
WHIRL
– C++
Flamingo Project
- C
BigMatch
- CSlide36
DATABASE TOOLS
Commercial RDBMS
SQL Server 2005 implements “
fuzzy matches
”
Oracle 11g implements these techniques in its
utl_match
package
Levenshtein
Distance
Soundex
Jaro
WinklerSlide37
CONCLUSIONS
Lack of a standardized, large-scale benchmarking data set
Training data is needed to produce matching models
Research diversion
Databases emphasize simple, fast, and efficient techniques
Machine learning and statistics rely on sophisticated techniques and probabilistic models
More synergy is needed among various communities
Detection systems need to be adaptive over time