/
Duplicate record detection Duplicate record detection

Duplicate record detection - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
406 views
Uploaded On 2016-08-11

Duplicate record detection - PPT Presentation

AHMED K ELMAGARMID PURDUE UNIVERSITY WEST LAFAYETTE IN Senior member IEEE PANAGIOTIS G IPEIROTIS LEONARD N STERN SCHOOL OF BUSINESS NEW YORK NY Member IEEE computer security VASSILIOS S VERYKIOS ID: 443033

record data based similarity data record similarity based distance records techniques field matching metrics comparison smith fields words heterogeneity

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Duplicate record detection" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Duplicate record detection

AHMED K. ELMAGARMID

PURDUE UNIVERSITY, WEST LAFAYETTE,

IN

Senior member, IEEE

PANAGIOTIS G. IPEIROTIS

LEONARD N. STERN SCHOOL OF BUSINESS, NEW YORK,

NY

Member, IEEE computer security

VASSILIOS S. VERYKIOS

UNIVERSITY OF THESSALY, VOLOS,

GREECE

Member ,IEEE computer security.

PRESENTED BY

SHILPA

MURTHY Slide2

INTRODUCTION TO THE PROBLEM

D

atabases

play an important role in today’s IT based economy

Many businesses and organizations depend on the quality of data(or the lack thereof) stored in the databases.

Any discrepancies in the data can have significant cost implications to a system that relies on information to function.Slide3

DATA QUALITY

Data are not carefully controlled for quality nor defined in a consistent way across different data sources, thus data quality is compromised due to many factors .//examples

D

ata errors.Ex:

M

icrosft

instead of Microsoft

Integrity errors. Ex:

EmployeeAge

=567

Multiple conventions for information.Ex: 44 W.4

th

street and 44 west fourth street.Slide4

DATA HETEROGENEITY

While integrating data from different sources into a warehouse , organizations become aware potential systematic differences and these problems and conflicts fall under a umbrella term called as “DATA HETEROGENEITY”.

Two types of heterogeneity can be distinguished: Structural heterogeneity and lexical heterogeneity.Slide5

DATA QUALITY

Data cleaning refers to the process of resolving identification problems in the data.

Structural heterogeneity

Different record structure

Addr

versus

City, State, and Zip code [1]

Lexical heterogeneity

Identical record structure, but data is different

44 W. 4

th

St.

versus

44 West Fourth Street [1]Slide6

TERMINOLOGYSlide7

DATA PREPARATION

Step before the duplicate record detection.

Improves the quality of the data

Makes data more comparable and more usable.

Data preparation stage includes three steps.Slide8

STEPS IN DATA PREPARATION

Parsing

Data transformation

StandardizationSlide9

PARSING

Locates, identifies and isolates individual data elements

Makes it easier to correct, standardize and match data

Comparison of individual components rather than complex strings

For example, the appropriate parsing of the name and address components into consistent packets is a very important step.Slide10

DATA TRANSFORMATION

Simple conversions of data type

Field renaming

Decoding field values

Range

checking: involves

examining data in a field to ensure that it falls within the expected range ,usually a numeric or date range

Dependency

checking

: is slightly more complex kind of data transformation where we check the values in a particular field to the values in another field to ensure minimal level of consistency in data Slide11

DATA STANDARDIZATION

Represent certain fields in a standard format

Addresses

US Postal Service Address Verification

tool

Date

and time formatting

Names (first, last, middle, prefix, suffix)

TitlesSlide12

LAST STEP IN DATA PREPARATION

Store data in

tables having comparable fields.

Identify fields suitable for

comparison

Not foolproof

Data

may still contain

inconsistencies due to misspellings and different conventions to represent dataSlide13

FIELD MATCHING TECHNIQUES

Most common sources of mismatches in database entries is due to typographical errors

The field matching metrics that have been designed to overcome this problem are :

Character –based similarity metrics

Token based similarity metrics

Phonetic similarity metrics

Numeric similarity metricsSlide14

CHARACTER BASED SIMILARITY

Works best on typographical errors

Edit distance

Shortest sequence of edit commands that can transform a string

s

into

t

Three types of edit operations .

I

f (cost =1) this version of edit distance is referred to as the

Levenshein

distance.

Insert, delete, replace operations

.

Example. S1=“tin” s2= “tan”

We need to replace “I”

to

“A” to convert string s1 to s2.

The edit distance here is 1. because we needed only one operation to convert s1 to s2.Slide15

CHARACTER BASED SIMILARITY

Affine gap distance

Strings

that have been truncated

John R. Smith versus Jonathan Richard Smith

Smith-Waterman distance

Substring matching which ignores the prefix and

suffix

Example: Prof.

John.R.Smith

and

John.R.Smith,Prof

Jaro

distance

Compares first and last name

Q-Grams

Divides string into a series of substrings of length

q

.

E.g.: NELSON

and

NELSEN

are phonetically similar but spelled differently. The q-grams for these words are

NE LS ON

and

NE LS EN .Slide16

TOKEN BASED SIMILARITY

Works best when word (tokens) are transposed

Atomic Strings

Computational average

WHIRL

Weights words based on frequency to determine

similarity

The words in the database have a weight associated with it, which is calculated using a cosine similarity metric.

Example: in a database of company names the words “AT&T” and “IBM” are less frequent than the word “

inc.

Similarity of John Smith and

Mr.John

Smith is close to 1.

But the similarity of

comptr

department and

deprtment

of computer is zero since it doesn’t take care of misspelled words.

Q

-Grams

with weighting

Extends WHIRL to handle spelling errorsSlide17

PHONETIC SIMILARITY

Comparison based on how words soundSlide18

NUMERIC SIMILARITY

Considers only numbers

Convert numbers to text data

Simple range queries

Authors provided no insight in this areaSlide19

SUMMARY OF METRICSSlide20

DUPLICATE RECORD DETECTION

The methods described till now have been describing about similarity checking in single fields.

The real life situations consist of multiple fields which have to be checked for

d

uplicate records.Slide21

CATEGORIZING METHODS

Probabilistic approaches and supervised machine learning techniques

Approaches that rely on domain knowledge or

Generic distance metricsSlide22

PROBABILISTIC MATCHINGMODELS

Models derived from Bayes theorem

Use prior knowledge to make decision about current data

set

A tuple pair is assigned to one of the two classes M or U. M class represents(match) same entity, and the U class represents(non-match) different entity.

This can be determined by calculating the probability distribution.

Rule-based decision

tree

If-then-else traversalSlide23

Supervised learning

Relies on the existence of trained data.

The trained data is in the form of record pairs.

These record pairs are labeled matching or not.

SVM approach out performs all the simpler approaches.

The post processing step is to create a graph for all the records linking the matching records.

Records are considered identical using the transitivity relation applied on the connected components.Slide24

ACTIVE LEARNINGSlide25

DISTANCE BASED TECHNIQUES

This method can be used when

there is absence of training data or human effort to create matching models.

Treat

a record as a one long field

Use a distance metric

Best matches are ranked using a weighting algorithm

Alternatively, use a single field

Must be highly discriminatingSlide26

RULE BASED TECHNIQUES

Relies on business rules to derive key

Must determine functional dependencies

Requires subject matter expert to build matching

rulesSlide27

Rule based techniques

This figure depicts the equation theory that dictates the logic of domain equivalence.

It specifies an inference about the similarity of the records.Slide28

UNSUPERVISED LEARNING

Classify data as matched or unmatched without a training

set.

The comparison vector generally depicts which category it belongs to. If it does not then it has to be done manually.

One way to avoid manual labeling is to use the clustering algorithms.

Group together similar comparison vectors.

Each cluster contains vectors with similar characteristics.

By knowing the real class of only few vectors we can infer the class of all the vectors.Slide29

techniques to improve efficiency

Reduce the number of record comparisons

Improve the efficiency of record comparisonSlide30

COMPARATIVE METRICS

Elementary nested loop

Compare every record in one table to another table

Requires A*B comparisons (Cartesian product) which is very expensive

Cost required for a single comparison

Must consider number of fields/recordSlide31

Reduce Record Comparisons

Blocking

Sorted Neighborhood

Clustering and Canopies

Set JoinsSlide32

Blocking

Basic: Compute a hash value for each record

Only compare records in the same bucket

Subdivide files into subsets (blocks)

Soundex

, NYSIIS, or

Metaphone

Drawback

Increases in speed may increase number of false mismatches

Compromise is multiple runs using different blocking fieldsSlide33

Sorted Neighborhood

Create composite key, sort data, merge

Assumption

Duplicate records will be close in sorted system

Highly dependent upon the comparison keySlide34

Clustering and canopies

Clustering

: Duplicate records are kept in a cluster and only the representative of a cluster is kept for future comparisons.

This reduces the total number of record comparisons without compromising the accuracy.

Canopies

: The records are grouped into overlapping clusters called as “canopies” and then the records are compared which lead to better qualitative results.Slide35

SOFTWARE TOOLS

Open Architecture

Freely Extensible Biomedical Record Linkage (FEBRL)

- Python

TAILOR –

MLC++

,

DBGen

WHIRL

– C++

Flamingo Project

- C

BigMatch

- CSlide36

DATABASE TOOLS

Commercial RDBMS

SQL Server 2005 implements “

fuzzy matches

Oracle 11g implements these techniques in its

utl_match

package

Levenshtein

Distance

Soundex

Jaro

WinklerSlide37

CONCLUSIONS

Lack of a standardized, large-scale benchmarking data set

Training data is needed to produce matching models

Research diversion

Databases emphasize simple, fast, and efficient techniques

Machine learning and statistics rely on sophisticated techniques and probabilistic models

More synergy is needed among various communities

Detection systems need to be adaptive over time