/
Generic entity resolution with data confidences Generic entity resolution with data confidences

Generic entity resolution with data confidences - PDF document

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
404 views
Uploaded On 2017-03-22

Generic entity resolution with data confidences - PPT Presentation

WepresentKooshanalgorithmforresolutionwhencon dencesareinvolvedSection4 WepresentthreeimprovementsoverKooshthatcansigni cantlyreducetheamountofworkduringresolutiondominationpackagesandth ID: 330002

  WepresentKoosh analgorithmforresolutionwhencon- dencesareinvolved(Section4).  WepresentthreeimprovementsoverKooshthatcansig-ni cantlyreducetheamountofworkduringresolution:domination packagesandth

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Generic entity resolution with data conf..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

 Wede neagenericframeworkformanagingcon dencesduringentityresolution(Sections2and3).  WepresentKoosh,analgorithmforresolutionwhencon- dencesareinvolved(Section4).  WepresentthreeimprovementsoverKooshthatcansig-ni cantlyreducetheamountofworkduringresolution:domination,packagesandthresholds.Weidentifyprop-ertiesthatmustholdinorderfortheseimprovementstobeachievable(Sections5,6,and7).  Weevaluatethealgorithmsandquantifythepotentialperformancegains(Section8).2.MODELEachrecordrconsistsofacon dencer:Candasetofattributesr:A.Forillustrationpurposes,wecanthinkofeachattributeasalabel-valuepair,althoughthisviewisnotessentialforourwork.Forexample,thefollowingrecordmayrepresentaperson:0.7[name:\Fred",age:f45;50g,zip:94305]Inourexample,wewriter:C(0.7inthisexample)infrontoftheattributes.(Arecord'scon dencescouldsimplybeconsideredasoneofitsattributes,butherewetreatcon -dencesseparatelytomakeiteasiertorefertothem.)Notethatthevalueforanattributemaybeaset.Inourexample,theageattributehastwovalues,45and50.Multiplevaluesmaybepresentininputrecords,orariseduringintegration:arecordmayreportanageof45whileanotheronereports50.Somemergefunctionsmaycombinetheagesintoasin-glenumber(say,theaverage),whileothersmaydecidetokeepbothpossibilities,asshowninthisexample.Notethatweareusingasinglenumbertorepresentthecon denceofarecord.Webelievethatsinglenumbers(intherange0to1)arethemostcommonwaytorepresentcon dencesintheERprocess,butmoregeneralcon dencemodelsarepossible.Forexample,acon dencecouldbeavector,statingthecon dencesinindividualattributes.Similarly,thecon dencecouldincludelineageinformationexplaininghowthecon dencewasderived.However,theserichermodelsmakeitharderforapplicationprogrammerstodevelopmergefunctions(seebelow),soinpractice,theapplicationswehaveseenalluseasinglenumber.GenericERreliesontwoblack-boxfunctions,thematchandthemergefunction,whichwewillassumehereworkontworecordsatatime:  AmatchfunctionM(r;s)returnstrueifrecordsrandsrepresentthesameentity.WhenM(r;s)=truewesaythatrandsmatch,denotedrs.  Amergefunctioncreatesacompositerecordfromtwomatchingrecords.Werepresentthemergeofrecordrandsbyhr;si.Notethatthematchandmergefunctionscanuseglobalinformationtomaketheirdecisions.Forinstance,inaninitializationphasewecancomputesaythedistributionoftermsusedinproductdescriptions,sothatwhenwecom-parerecordswecantakeintoaccountthesetermfrequencies.Similarly,wecanrunaclusteringalgorithmtoidentifysetsofinputrecordsthatare\similar."Thenthematchfunc-tioncanconsulttheseresultstodecideifrecordsmatch.Asnewrecordsaregenerated,theglobalstatisticsneedtobeupdated(bythemergefunction):theseupdatescanbedoneincrementallyorinbatchmode,ifaccuracyisnotessential. Thepairwiseapproachtomatchandmergeisoftenusedinpracticebecauseitiseasiertowritethefunctions.(Forexample,ERproductsfromIBM,FairIsaac,Oracle,andothersusepairwisefunctions.)Forinstance,itisextremelyraretoseefunctionsthatmergemorethantworecordsatatime.Toillustrate,saywewanttomerge4recordscontain-ingdi erentspellingsofthename\Schwartz."Inprinciple,onecouldconsiderall4namesandcomeupwithsomegood\centroid"name,butinpracticeitismorecommontousesimplerstrategies.Forexample,wecanjustaccumulateallspellingsaswemergerecords,orwecanmapeachspellingtotheclosestnameinadictionaryofcanonicalnames.Eitherapproachcaneasilybeimplementedinapairwisefashion.Ofcourse,insomeapplicationspairwisematchfunctionsmaynotbethebestapproach.Forexample,onemaywanttouseaset-basedmatchfunctionthatconsidersasetofrecordsandidenti esthepairthatshouldbematchednext,i.e.,M(S)returnsrecordsr;s2Sthatarethebestcan-didatesformerging.Althoughwedonotcoverithere,webelievethattheconceptswepresenthere(e.g.,thresholds,domination)canalsobeappliedwhenset-basedmatchfunc-tionsareused,andthatouralgorithmscanbemodi edtouseset-basedfunctions.Pairwisematchandmergearegenerallynotarbitraryfunctions,buthavesomeproperties,whichwecanleveragetoenableeciententityresolution.Weassumethatthematchandmergefunctionssatisfythefollowingproperties:  Commutativity:8r;s,rs,srandifrsthenhr;si=hs;ri.  Idempotence:8r;rrandhr;ri=r.Weexpectthesepropertiestoholdinalmostallapplica-tions(unlessthefunctionsarenotpropertyimplemented).InoneERapplicationwestudied,forexample,theimple-mentedmatchfunctionwasnotidempotent:arecordwouldnotmatchitselfifthe eldsusedforcomparisonweremiss-ing.However,itwastrivialtoaddacomparisonforrecordequalitytothematchfunctiontoachieveidempotence.(Theadvantageofusinganidempotentfunctionwillbecomeap-parentwhenweseetheecientoptionsforER.)Somereadersmaywonderifmergingtwoidenticalrecordsshouldreallygivethesamerecord.Forexample,saytherec-ordsrepresenttwoobservationsofsomephenomena.Thenperhapsthemergerecordshouldhaveahighercon dencebecausetherearetwoobservations?Thecon dencewouldonlybehigherifthetworecordsrepresentindependentob-servations,notiftheyareidentical.Weassumethatin-dependentobservationswoulddi erinsomeway,e.g.,inanattributerecordingthetimeofobservation.Thus,twoidenticalrecordsshouldreallymergeintothesamerecord.3.GENERICENTITYRESOLUTIONGiventhematchandmergefunctions,wecannowaskwhatisthecorrectresultofanentityresolutionalgorithm.Itisclearthatiftworecordsmatch,theyshouldbemergedtogether.Ifthemergedrecordmatchesanotherrecord,thenthosetwoshouldbemergedtogetheraswell.Butwhatshouldhappentotheoriginalmatchingrecords?Consider:r1=0:8[name:Alice;areacode:202]r2=0:7[name:Alice;phone:555-1212].Themergeofthetworecordsmightbe:r12=0:56[name:Alice;areacode:202;phone:555-1212]Inthiscase,themergedrecordhasalloftheinformation  DominationProperty:Ifsrandsxthenrxandhs;xihr;xi.Thisdominationpropertymayormaynotholdinagivenapplication.Forinstance,letusreturntoourr1,r2,r3ex-ampleatthebeginningofthissection.Considerafourthrecordr4=0:9[name:Alice,areacode:717,phone:555-1212,age:20].Aparticularmatchfunctionmaydecidethatr4doesnotmatchr3becausetheareacodesaredif-ferent,butr4andr2maymatchsincethiscon ictdoesnotexistwithr2.Inthisscenario,wecannotdiscardr2whenwegeneratearecordthatdominatesit(r3),sincer2canstillplayaroleinsomematches.However,inapplicationswherehavingmoreinformationinarecordcanneverreduceitsmatchchances,thedomina-tionpropertycanholdandwecantakeadvantageofit.Ifthedominationpropertyholdsthenwecanthrowawaydom-inatedrecordsaswe ndthemwhilecomputingNER(R).Weprovethisfactintheextendedversionofthispaper.5.1AlgorithmKoosh­NDKooshcanbemodi edtoeliminatedominatedrecordsearlyasfollows.First,Koosh-NDbeginsbyremovingalldominatedrecordsfromtheinputset.Second,withinthebodyofthealgorithm,wheneveranewmergedrecordmiscreated(line10),thealgorithmcheckswhethermisdomi-natedbyanyrecordinRorR0.Ifso,thenmisimmediatelydiscarded,beforeitisusedforanyunnecessarycomparisons.Notethatwedonotcheckifmdominatesanyotherrecords,asthischeckwouldbeexpensiveintheinnerloopoftheal-gorithm.Finally,sincewedonotincrementallycheckifmdominatesotherrecords,weaddastepattheendtoremovealldominatedrecordsfromtheoutputset.Koosh-NDreliesontwocomplexoperations:removingalldominatedrecordsfromasetandcheckingifarecordisdominatedbyamemberofaset.Theseseemlikeexpensiveoperationsthatmightoutweighthegainsobtainedbyelim-inatingthecomparisonsofdominatedrecords.However,usinganinvertedlistindexthatmapslabel-valuepairstotherecordsthatcontainthem,wecanmaketheseoperationsquiteecient.ThecorrectnessofKoosh-NDisprovenintheextendedversionofthispaper.6.THEPACKAGESALGORITHMInSection3,weillustratedwhyERwithcon dencesisexpensive,ontherecordsr1andr2thatmergedintor3:r1=0:8[name:Alice;areacode:202],r2=0:7[name:Alice;phone:555-1212],r3=0:56[name:Alice;areacode:202,phone:555-1212].Recallthatr2cannotbediscardedessentiallybecauseithasahighercon dencethantheresultingrecordr3.How-ever,noticethatotherthanthecon dence,r3containsmorelabel-valuepairs,andhence,ifitwerenotforitshighercon- dence,r2wouldnotbenecessary.Thisobservationleadsustoconsiderascenariowheretherecordsminuscon dencescanberesolvedeciently,andthentoaddthecon dencecomputationsinasecondphase.Inparticular,letusassumethatourmergefunctionis\informationpreserving"inthefollowingsense:Whenarecordrmergeswithotherrecords,theinformationcarriedbyr'sattributesisnotlost.Weformalizethisnotionof \information"byde ningarelation\v":rvsmeansthattheattributesofscarrymoreinformationthanthoseofr.Weassumethatthisrelationistransitive.Notethatrvsandsvrdoesnotimplythatr=s;itonlyimpliesthatr:Acarriesasmuchinformationass:A.Thepropertythatmergesareinformationpreservingisformalizedasfollows:  PropertyP1:Ifrsthenrvhr;siandsvhr;si.  PropertyP2:Ifsvr,sxandrx,thenhs;xivhr;xiForexample,amergefunctionthatunionstheattributesofrecordswouldhavepropertiesP1andP2.Suchfunctionsarecommonin\intelligencegathering"applications,whereonewishestocollectallinformationknownaboutentities,evenifcontradictory.Forinstance,saytworecordsreportdi erentpassportnumbersordi erentagesforaperson.Iftherecordsmerge(e.g.,duetoevidenceinotherattributes)suchapplicationstypicallygatherallthefacts,sincethepersonmaybeusingfakepassportsreportingdi erentages.Furthermore,weassumethataddinginformationtoarecorddoesnotchangetheoutcomeofmatch.Inaddition,wealsoassumethatthematchfunctiondoesnotconsidercon dences,onlytheattributesofrecords.Thesecharacter-isticsareformalizedby:  PropertyP3:Ifsvrandsx,thenrx.Havingamatchfunctionthatignorescon dencesisnotveryconstraining:Iftworecordsareunlikelytomatchduetolowcon dences,themergefunctioncanstillassignalowcon dencetotheresultingrecordtoindicateitisunlikely.ThesecondaspectofPropertyP3rulesout\negativeevi-dence":addinginformationtoarecordcannotruleoutafu-turematch.However,negativeinformationcanstillbehan-dledbydecreasingthecon denceoftheresultingrecord.ThealgorithmofFigure2exploitsthesepropertiestoper-formERmoreeciently.Itproceedsintwophases:a rstphasebypassescon dencesandgroupsrecordsintodisjointpackages.Becauseoftheproperties,this rstphasecanbedoneeciently,andrecordsthatfallintodi erentpackagesareknownnottomatch.ThesecondphaserunsERwithcon dencesoneachpackageseparately.Wenextexplainandjustifyeachofthesetwophases.6.1Phase1InPhase1,wemayuseanygenericERalgorithm,suchasthosein[2]toresolvethebaserecords,butwithsomeaddi-tionalbookkeeping.Forexample,whentwobaserecordsr1andr2mergeintor3,wecombineallthreerecordstogetherintoapackagep3.Thepackagep3containstwothings:(i)arootr(p3)whichinthiscaseisr3,and(ii)thebaserecordsb(p3)=fr1;r2g.Actually,baserecordscanalsobeviewedaspackages.Forexample,recordr2canbetreatedaspackagep2withr(p2)=r2,b(p2)=fr2g.Thus,thealgorithmstartswithasetofpackages,andwegeneralizeourmatchandmergefunctionstooperateonpackages.Forinstance,supposewewanttocomparep3withapack-agep4containingonlybaserecordr4.Thatis,r(p4)=r4andb(p4)=fr4g.Tocomparethepackages,weonlycomparetheirroots:Thatis,M(p3;p4)isequivalenttoM(r(p3);r(p4)),orinthisexampleequivalenttoM(r3;r4).(WeusethesamesymbolMforrecordandpackagematch-ing.)Saytheserecordsdomatch,sowegenerateanew Giventhethresholdproperty,wecancomputeTER(R)moreeciently.Intheextendedversionofthispaper,weprovethatifthethresholdpropertyholds,thenallresultscanbeobtainedfromabove-thresholdrecords.7.1AlgorithmsKoosh­TandKoosh­TNDAswithremovingdominatedrecords,Kooshcanbeeasilymodi edtodropbelow-thresholdrecords.First,weaddaninitialscantoremoveallbaserecordsthatarealreadybelowthreshold.Then,wesimplyaddthefollowingconjuncttotheconditionofLine10ofthealgorithm: merged:CTThus,mergedrecordsaredroppediftheyarebelowthecon dencethreshold.Theorem7.2.WhenTER(R)is nite,Koosh-Ttermi-natesandcomputesTER(R).Byperformingthesamemodi cationasaboveonKoosh-ND,weobtainthealgorithmKoosh-TND,whichcomputesthesetNER(R)\TER(R)ofrecordsinER(R)thatareneitherdominatednorbelowthreshold.7.2Packages­TandPackages­TNDIfthethresholdpropertyholds,Koosh-TorKoosh-TNDcanbeusedforPhase2ofthePackagesalgorithm,toob-tainalgorithmPackages-TorPackages-TND.Inthatcase,below-thresholdand/ordominatedrecordsaredroppedaseachpackageisexpanded.8.EXPERIMENTSTosummarize,wehavediscussedthreemainalgorithms:BFA,Koosh,andPackages.Foreachofthosebasicthree,therearethreevariants,addinginthresholds(T),non-dom-ination(ND),orboth(TND).Inthissection,wewillcom-parethethreealgorithmsagainsteachotherusingboththresholdsandnon-domination.Wewillalsoinvestigatehowperformanceisa ectedbyvaryingthresholdvalues,and,independently,byremovingdominatedrecords.Totestouralgorithms,weranthemonsyntheticdata.Syntheticdatagivesusthe exibilitytocarefullycontrolthedistributionofcon dences,theprobabilitythattworecordsmatch,aswellasotherimportantparameters.Ourgoalingeneratingthedatawastoemulatearealisticscenariowherenrecordsdescribevariousaspectsofmreal-worldentities(n�m).Iftwoofourrecordsrefertothesameentity,weexpectthemtomatchwithmuchhigherprobabilitythaniftheyreferredtodi erententities.Toemulatethisscenario,weassumethatthereal-worldentitiescanberepresentedaspointsonanumberline.Rec-ordsaboutaparticularentitywithvaluexcontainanat-tributeAwithavalue\close"tox.(Thevalueisnormallydistributedwithmeanx,seebelow.)Thus,thematchfunc-tioncansimplycomparetheAattributeofrecords:ifthevaluesareclose,therecordsmatch.Recordsarealsoas-signedacon dence,asdiscussedbelow.Forourexperimentsweusean\intelligencegathering"mergefunctionasdiscussedinSection6,whichunionsat-tributes.Thus,asarecordmergeswithothers,itaccumu-latesAvaluesandincreasesitschancesofmatchingotherrecordsrelatedtotheparticularreal-worldentity.Tobemorespeci c,oursyntheticdatawasgeneratedusingthefollowingparameters(andtheirdefaultvalues): Figure1:Thresholdsvs.Matches  n,thenumberofrecordstogenerate(default:1000)  m,thenumberofentitiestosimulate(default:100)  margin,theseparationbetweenentities(default:75)  ,thestandarddeviationofthenormalcurvearoundeachentity.(default:10)  c,themeanofthecon dencevalues(default:0.8)Togenerateonerecordr,weproceedasfollows:First,pickauniformlydistributedrandomintegeriintherange[0;m�1].Thisintegerrepresentsthevalueforthereal-wordentitythatrwillrepresent.FortheAvalueofr,generatearandom oatingpointvaluevfromanormaldistributionwithstandarddeviationandameanofmargini.Togen-erater'scon dence,computeauniformlydistributedvaluecintherange[c�0:1;c+0:1](withc2[0:1;0:9]sothatcstaysin[0;1]).Nowcreatearecordrwithr:C=candr:A=fA:vg.Repeatallofthesestepsntimestocreatensyntheticrecords.Ourmergefunctiontakesinthetworecordsr1andr2,andcreatesanewrecordrm,whererm:C=r1:Cr2:Candrm:A=r1:A[r2:A.ThematchfunctiondetectsamatchiffortheAattribute,thereexistsavaluev1inr1:Aandavaluev2inr2:Awherejv1�v2jk,foraparameterkchoseninadvance(k=25exceptwhereotherwisenoted).Naturally,our rstexperimentcomparestheperformanceofourthreealgorithms,BFA-TND,Koosh-TNDandPack-ages-TND,againsteachother.Wevariedthethresholdval-uestogetasenseofhowmuchfasterthealgorithmsarewhenahigherthresholdcausesmorerecordstobediscarded.Eachalgorithmwasrunatthegiventhresholdvaluethreetimes,andtheresultingnumberofcomparisonswasaver-agedoverthethreerunstogetour nalresults.Figure1showstheresultsofthis rstexperiment.The rstthreelinesonthegraphrepresenttheperformanceofourthreealgorithms.Onthehorizontalaxis,wevarythethresholdvalue.Theverticalaxis(logarithmic)indicatesthenumberofcallstothematchfunction,whichweuseasameasureoftheworkperformedbythealgorithms.The rstthingwenoticeisthatworkperformedbythealgorithmsgrowsexponentiallyasthethresholdisdecreased.Thus,clearlythresholdsareaverypowerfultool:onecangethigh- Figure2:Thresholdsvs.Merges con denceresultsatarelativelymodestcost,whilecomput-ingthelowercon dencerecordsgetsprogressivelymoreex-pensive!Alsointerestingly,theBFA-TNDandKoosh-TNDlinesareparalleltoeachother.Thismeansthattheyareconsistentlyaconstantfactorapart.Roughly,BFAdoes10timesthenumberofcomparisonsthatKooshdoes.ThePackages-TNDalgorithmisfarmoreecientthantheothertwoalgorithms.Ofcourse,PackagescanonlybeusedifPropertiesP1,P2andP3hold,butwhentheydohold,thesavingscanbedramatic.Webelievethatthesesavingscanbeastrongincentivefortheapplicationexperttodesignmatchandmergefunctionthatsatisfytheproperties.Wealsocomparedouralgorithmsbasedonthenumberofmergesperformed.InFigure2,theverticalaxisindicatesthenumberofmergesthatareperformedbythealgorithms.WecanseethatKoosh-TNDandthePackages-TNDarestillagreatimprovementoverBFA.BFAperformsextramergesbecauseineachiterationofitsmainloop,itrecomparesallrecordsandmergesanymatchesfound.Theextramergesresultinduplicaterecordswhichareeliminatedwhentheyareaddedtotheresultset.PackagesperformsslightlymoremergesthanKoosh,sincethesecondphaseofthealgorithmdoesnotuseanyofthemergesthatoccurredinthe rstphase.IfwesubtractthePhase1mergesfromPackages(notshowninthe gure),KooshandPackagesperformroughlythesamenumberofmerges.Inournextexperiment,wecomparetheperformanceofouralgorithmsaswevarytheprobabilitythatbaserecordsmatch.Wecancontrolthematchprobabilitybychangingparameterskor,butweusetheresultingmatchprobabil-ityasthehorizontalaxistoprovidemoreintuition.Inpar-ticular,togenerateFigure3,wevaryparameterkfrom5to55inincrementsof5(keepingthethresholdvalueconstantat0.6).Duringeachrun,wemeasurethematchprobabil-ityasthefractionofbaserecordmatchesthatarepositive.(Theresultsaresimilarwhenwecomputethematchprob-abilityoverallmatches.)Foreachrun,wethenplotthematchprobabilityversusthenumberofcallstothematchfunction,forourthreealgorithms.Asexpected,theworkincreaseswithgreatermatchprob-ability,sincemorerecordsareproduced.Furthermore,wenotethattheBFAandKooshlinesareroughlyparallel,butthePackageslinestaysleveluntilaquickriseintheamount Figure3:Selectivityvs.Comparisons ofworkperformedoncethematchprobabilityreachesabout0.011.ThePackagesoptimizationtakesadvantageofthefactthatrecordscanbeseparatedintopackagesthatdonotmergewithoneanother.Inpractice,wewouldexpecttooperateintherangeofFigure3wherethematchprobabilityislowandPackagesoutperformsKoosh.Inourscenariowithhighmatchprob-abilities,recordsthatrefertodi erententitiesarebeingmerged,whichmeansthematchfunctionisnotdoingitsjob.Onecouldalsogethighmatchprobabilitiesiftherewereveryfewentities,sothatpackagesdonotpartitiontheproblem nely.Butagain,inpracticeonewouldexpectrecordstocoveralargenumberofentities.9.RELATEDWORKOriginallyintroducedbyNewcombeetal.[17]underthenameofrecordlinkage,andformalizedbyFellegiandSun-ter[9],theERproblemwasstudiedunderavarietyofnames,suchasMerge/Purge[12],deduplication[18],referencerec-onciliation[8],objectidenti cation[21],andothers.Mostoftheworkinthisarea(see[23,11]forrecentsurveys)focusesonthe\matching"problem,i.e.,ondecidingwhichrecordsdorepresentthesameentitiesandwhichonesdonot.Thisisgenerallydoneintwophases:Computingmeasuresofhowsimilaratomicvaluesare(e.g.,usingedit-distances[20],TF-IDF[6],oradaptivetechniquessuchasq-grams[4]),thenfeedingthesemeasuresintoamodel(withparame-ters),whichmakesmatchingdecisionsforrecords.Proposedmodelsincludeunsupervisedclusteringtechniques[12,5],Bayesiannetworks[22],decisiontrees,SVM's,conditionalrandom elds[19].Theparametersofthesemodelsarelearnedeitherfromalabeledtrainingset(possiblywiththehelpofauser,throughactivelearning[18]),orusingunsu-pervisedtechniquessuchastheEMalgorithm[24].Allthetechniquesabovemanipulateandproducenumer-icalvalues,whencomparingatomicvalues(e.g.TF-IDFscores),asparametersoftheirinternalmodel(e.g.,thresh-olds,regressionparameters,attributeweights),orastheiroutput.Butthesenumbersareoftenspeci ctothetech-niquesathand,anddonothaveaclearinterpretationintermsof\con dence"intherecordsorthevalues.Ontheotherhand,representationsofuncertaindataexist,whichsoundlymodelcon denceintermsofprobabilities(e.g.,[1, 10]),orbeliefs[14].Howevertheseapproachesfocusoncomputingtheresultsandcon dencesofexactqueries,ex-tendedwithsimple\fuzzy"operatorsforvaluecomparisons(e.g.,see[7]),andarenotcapableofanyadvancedformofentityresolution.Weproposea exiblesolutionforERthataccommodatesanymodelforcon dences,andproposesecientalgorithmsbasedontheirproperties.Ourgenericapproachdepartsfromexistingtechniquesinthatitinterleavesmergeswithmatches.The rstphaseofthePackagesalgorithmissimilartotheset-unionalgorithmdescribedin[16],butouruseofamergefunctionallowstheselectionofatruerepresentativerecord.Thepresenceof\custom"mergesisanimportantpartofER,anditmakescon dencesnon-trivialtocompute.Theneedforiteratingmatchesandmergeswasidenti edby[3]andisalsousedin[8],buttheirrecordmergesaresimpleaggregations(sim-ilartoour\informationgathering"merge),andtheydonotconsiderthepropagationofcon dencesthroughmerges.10.CONCLUSIONInthispaperwelookatERwithcon dencesasa\genericdatabase"problem,wherewearegivenblack-boxesthatcompareandmergerecords,andwefocusonecientalgo-rithmsthatreducethenumberofcallstotheseboxes.Thekeytoreducingworkistoexploitgenericproperties(likethethresholdproperty)thananapplicationmayhave.Ifsuchpropertiesholdwecanusetheoptimizationswehavestudied(e.g.,Koosh-Twhenthethresholdpropertyholds).Ofthethreeoptimizations,thresholdsisthemost exibleone,asitgivesusa\knob"(thethreshold)thatonecancontrol:Forahighthreshold,weonlygethigh-con dencerecords,butwegetthemveryeciently.Aswedecreasethethreshold,westartaddinglower-con denceresultstoouranswer,butthecomputationalcostincreases.Theothertwooptimizations,dominationandpackages,canalsoreducethecostofERverysubstantiallybutdonotprovidesuchacontrolknob.11.REFERENCES [1] D.Barbara,H.Garcia-Molina,andD.Porter.Themanagementofprobabilisticdata.IEEETransactionsonKnowledgeandDataEngineering,4(5):487{502,1992. [2] O.Benjelloun,H.Garcia-Molina,J.Jonas,Q.Su,andJ.Widom.Swoosh:Agenericapproachtoentityresolution.Technicalreport,StanfordUniversity,2005. [3] I.BhattacharyaandL.Getoor.Iterativerecordlinkageforcleaningandintegration.InProc.oftheSIGMOD2004WorkshoponResearchIssuesonDataMiningandKnowledgeDiscovery,June2004. [4] S.Chaudhuri,K.Ganjam,V.Ganti,andR.Motwani.Robustandecientfuzzymatchforonlinedatacleaning.InProc.ofACMSIGMOD,pages313{324.ACMPress,2003. [5] S.Chaudhuri,V.Ganti,andR.Motwani.Robustidenti cationoffuzzyduplicates.InProc.ofICDE,Tokyo,Japan,2005. [6] WilliamCohen.Dataintegrationusingsimilarityjoinsandaword-basedinformationrepresentationlanguage.ACMTransactionsonInformationSystems,18:288{321,2000. [7] NileshN.DalviandDanSuciu.Ecientqueryevaluationonprobabilisticdatabases.InVLDB,pages 864{875,2004. [8] X.Dong,A.Y.Halevy,J.Madhavan,andE.Nemes.Referencereconciliationincomplexinformationspaces.InProc.ofACMSIGMOD,2005. [9] I.P.FellegiandA.B.Sunter.Atheoryforrecordlinkage.JournaloftheAmericanStatisticalAssociation,64(328):1183{1210,1969. [10] NorbertFuhrandThomasRolleke.Aprobabilisticrelationalalgebrafortheintegrationofinformationretrievalanddatabasesystems.ACMTrans.Inf.Syst.,15(1):32{66,1997. [11] L.Gu,R.Baxter,D.Vickers,andC.Rainsford.Recordlinkage:Currentpracticeandfuturedirections.TechnicalReport03/83,CSIROMathematicalandInformationSciences,2003. [12] M.A.HernandezandS.J.Stolfo.Themerge/purgeproblemforlargedatabases.InProc.ofACMSIGMOD,pages127{138,1995. [13] SaulKripke.Semanticalconsiderationsonmodallogic.ActaPhilosophicaFennica,16:83{94,1963. [14] SukKyoonLee.Anextendedrelationaldatabasemodelforuncertainandimpreciseinformation.InLi-YanYuan,editor,VLDB,pages211{220.MorganKaufmann,1992. [15] DavidMenestrina,OmarBenjelloun,andHectorGarcia-Molina.Genericentityresolutionwithdatacon dences(extendedversion).Technicalreport,StanfordUniversity,2006. [16] A.E.MongeandC.Elkan.Anecientdomain-independentalgorithmfordetectingapproximatelyduplicatedatabaserecords.InDMKD,pages0{,1997. [17] H.B.Newcombe,J.M.Kennedy,S.J.Axford,andA.P.James.Automaticlinkageofvitalrecords.Science,130(3381):954{959,1959. [18] S.SarawagiandA.Bhamidipaty.Interactivededuplicationusingactivelearning.InProc.ofACMSIGKDD,Edmonton,Alberta,2002. [19] ParagSinglaandPedroDomingos.Objectidenti cationwithattribute-mediateddependences.InProc.ofPKDD,pages297{308,2005. [20] T.F.SmithandM.S.Waterman.Identi cationofcommonmolecularsubsequences.JournalofMolecularBiology,147:195{197,1981. [21] S.Tejada,C.A.Knoblock,andS.Minton.Learningobjectidenti cationrulesforinformationintegration.InformationSystemsJournal,26(8):635{656,2001. [22] VassiliosS.Verykios,GeorgeV.Moustakides,andMohamedG.Elfeky.Abayesiandecisionmodelforcostoptimalrecordmatching.TheVLDBJournal,12(1):28{40,2003. [23] W.Winkler.Thestateofrecordlinkageandcurrentresearchproblems.Technicalreport,StatisticalResearchDivision,U.S.BureauoftheCensus,Washington,DC,1999. [24] W.E.Winkler.UsingtheEMalgorithmforweightcomputationinthefellegi-suntermodelofrecordlinkage.AmericanStatisticalAssociation,ProceedingsoftheSectiononSurveyResearchMethods,pages667{671,1988.