WepresentKooshanalgorithmforresolutionwhencondencesareinvolvedSection4 WepresentthreeimprovementsoverKooshthatcansignicantlyreducetheamountofworkduringresolutiondominationpackagesandth ID: 330002
Download Pdf The PPT/PDF document "Generic entity resolution with data conf..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Wedeneagenericframeworkformanagingcondencesduringentityresolution(Sections2and3). WepresentKoosh,analgorithmforresolutionwhencon-dencesareinvolved(Section4). WepresentthreeimprovementsoverKooshthatcansig-nicantlyreducetheamountofworkduringresolution:domination,packagesandthresholds.Weidentifyprop-ertiesthatmustholdinorderfortheseimprovementstobeachievable(Sections5,6,and7). Weevaluatethealgorithmsandquantifythepotentialperformancegains(Section8).2.MODELEachrecordrconsistsofacondencer:Candasetofattributesr:A.Forillustrationpurposes,wecanthinkofeachattributeasalabel-valuepair,althoughthisviewisnotessentialforourwork.Forexample,thefollowingrecordmayrepresentaperson:0.7[name:\Fred",age:f45;50g,zip:94305]Inourexample,wewriter:C(0.7inthisexample)infrontoftheattributes.(Arecord'scondencescouldsimplybeconsideredasoneofitsattributes,butherewetreatcon-dencesseparatelytomakeiteasiertorefertothem.)Notethatthevalueforanattributemaybeaset.Inourexample,theageattributehastwovalues,45and50.Multiplevaluesmaybepresentininputrecords,orariseduringintegration:arecordmayreportanageof45whileanotheronereports50.Somemergefunctionsmaycombinetheagesintoasin-glenumber(say,theaverage),whileothersmaydecidetokeepbothpossibilities,asshowninthisexample.Notethatweareusingasinglenumbertorepresentthecondenceofarecord.Webelievethatsinglenumbers(intherange0to1)arethemostcommonwaytorepresentcondencesintheERprocess,butmoregeneralcondencemodelsarepossible.Forexample,acondencecouldbeavector,statingthecondencesinindividualattributes.Similarly,thecondencecouldincludelineageinformationexplaininghowthecondencewasderived.However,theserichermodelsmakeitharderforapplicationprogrammerstodevelopmergefunctions(seebelow),soinpractice,theapplicationswehaveseenalluseasinglenumber.GenericERreliesontwoblack-boxfunctions,thematchandthemergefunction,whichwewillassumehereworkontworecordsatatime: AmatchfunctionM(r;s)returnstrueifrecordsrandsrepresentthesameentity.WhenM(r;s)=truewesaythatrandsmatch,denotedrs. Amergefunctioncreatesacompositerecordfromtwomatchingrecords.Werepresentthemergeofrecordrandsbyhr;si.Notethatthematchandmergefunctionscanuseglobalinformationtomaketheirdecisions.Forinstance,inaninitializationphasewecancomputesaythedistributionoftermsusedinproductdescriptions,sothatwhenwecom-parerecordswecantakeintoaccountthesetermfrequencies.Similarly,wecanrunaclusteringalgorithmtoidentifysetsofinputrecordsthatare\similar."Thenthematchfunc-tioncanconsulttheseresultstodecideifrecordsmatch.Asnewrecordsaregenerated,theglobalstatisticsneedtobeupdated(bythemergefunction):theseupdatescanbedoneincrementallyorinbatchmode,ifaccuracyisnotessential. Thepairwiseapproachtomatchandmergeisoftenusedinpracticebecauseitiseasiertowritethefunctions.(Forexample,ERproductsfromIBM,FairIsaac,Oracle,andothersusepairwisefunctions.)Forinstance,itisextremelyraretoseefunctionsthatmergemorethantworecordsatatime.Toillustrate,saywewanttomerge4recordscontain-ingdierentspellingsofthename\Schwartz."Inprinciple,onecouldconsiderall4namesandcomeupwithsomegood\centroid"name,butinpracticeitismorecommontousesimplerstrategies.Forexample,wecanjustaccumulateallspellingsaswemergerecords,orwecanmapeachspellingtotheclosestnameinadictionaryofcanonicalnames.Eitherapproachcaneasilybeimplementedinapairwisefashion.Ofcourse,insomeapplicationspairwisematchfunctionsmaynotbethebestapproach.Forexample,onemaywanttouseaset-basedmatchfunctionthatconsidersasetofrecordsandidentiesthepairthatshouldbematchednext,i.e.,M(S)returnsrecordsr;s2Sthatarethebestcan-didatesformerging.Althoughwedonotcoverithere,webelievethattheconceptswepresenthere(e.g.,thresholds,domination)canalsobeappliedwhenset-basedmatchfunc-tionsareused,andthatouralgorithmscanbemodiedtouseset-basedfunctions.Pairwisematchandmergearegenerallynotarbitraryfunctions,buthavesomeproperties,whichwecanleveragetoenableeciententityresolution.Weassumethatthematchandmergefunctionssatisfythefollowingproperties: Commutativity:8r;s,rs,srandifrsthenhr;si=hs;ri. Idempotence:8r;rrandhr;ri=r.Weexpectthesepropertiestoholdinalmostallapplica-tions(unlessthefunctionsarenotpropertyimplemented).InoneERapplicationwestudied,forexample,theimple-mentedmatchfunctionwasnotidempotent:arecordwouldnotmatchitselfiftheeldsusedforcomparisonweremiss-ing.However,itwastrivialtoaddacomparisonforrecordequalitytothematchfunctiontoachieveidempotence.(Theadvantageofusinganidempotentfunctionwillbecomeap-parentwhenweseetheecientoptionsforER.)Somereadersmaywonderifmergingtwoidenticalrecordsshouldreallygivethesamerecord.Forexample,saytherec-ordsrepresenttwoobservationsofsomephenomena.Thenperhapsthemergerecordshouldhaveahighercondencebecausetherearetwoobservations?Thecondencewouldonlybehigherifthetworecordsrepresentindependentob-servations,notiftheyareidentical.Weassumethatin-dependentobservationswoulddierinsomeway,e.g.,inanattributerecordingthetimeofobservation.Thus,twoidenticalrecordsshouldreallymergeintothesamerecord.3.GENERICENTITYRESOLUTIONGiventhematchandmergefunctions,wecannowaskwhatisthecorrectresultofanentityresolutionalgorithm.Itisclearthatiftworecordsmatch,theyshouldbemergedtogether.Ifthemergedrecordmatchesanotherrecord,thenthosetwoshouldbemergedtogetheraswell.Butwhatshouldhappentotheoriginalmatchingrecords?Consider:r1=0:8[name:Alice;areacode:202]r2=0:7[name:Alice;phone:555-1212].Themergeofthetworecordsmightbe:r12=0:56[name:Alice;areacode:202;phone:555-1212]Inthiscase,themergedrecordhasalloftheinformation DominationProperty:Ifsrandsxthenrxandhs;xihr;xi.Thisdominationpropertymayormaynotholdinagivenapplication.Forinstance,letusreturntoourr1,r2,r3ex-ampleatthebeginningofthissection.Considerafourthrecordr4=0:9[name:Alice,areacode:717,phone:555-1212,age:20].Aparticularmatchfunctionmaydecidethatr4doesnotmatchr3becausetheareacodesaredif-ferent,butr4andr2maymatchsincethiscon ictdoesnotexistwithr2.Inthisscenario,wecannotdiscardr2whenwegeneratearecordthatdominatesit(r3),sincer2canstillplayaroleinsomematches.However,inapplicationswherehavingmoreinformationinarecordcanneverreduceitsmatchchances,thedomina-tionpropertycanholdandwecantakeadvantageofit.Ifthedominationpropertyholdsthenwecanthrowawaydom-inatedrecordsaswendthemwhilecomputingNER(R).Weprovethisfactintheextendedversionofthispaper.5.1AlgorithmKooshNDKooshcanbemodiedtoeliminatedominatedrecordsearlyasfollows.First,Koosh-NDbeginsbyremovingalldominatedrecordsfromtheinputset.Second,withinthebodyofthealgorithm,wheneveranewmergedrecordmiscreated(line10),thealgorithmcheckswhethermisdomi-natedbyanyrecordinRorR0.Ifso,thenmisimmediatelydiscarded,beforeitisusedforanyunnecessarycomparisons.Notethatwedonotcheckifmdominatesanyotherrecords,asthischeckwouldbeexpensiveintheinnerloopoftheal-gorithm.Finally,sincewedonotincrementallycheckifmdominatesotherrecords,weaddastepattheendtoremovealldominatedrecordsfromtheoutputset.Koosh-NDreliesontwocomplexoperations:removingalldominatedrecordsfromasetandcheckingifarecordisdominatedbyamemberofaset.Theseseemlikeexpensiveoperationsthatmightoutweighthegainsobtainedbyelim-inatingthecomparisonsofdominatedrecords.However,usinganinvertedlistindexthatmapslabel-valuepairstotherecordsthatcontainthem,wecanmaketheseoperationsquiteecient.ThecorrectnessofKoosh-NDisprovenintheextendedversionofthispaper.6.THEPACKAGESALGORITHMInSection3,weillustratedwhyERwithcondencesisexpensive,ontherecordsr1andr2thatmergedintor3:r1=0:8[name:Alice;areacode:202],r2=0:7[name:Alice;phone:555-1212],r3=0:56[name:Alice;areacode:202,phone:555-1212].Recallthatr2cannotbediscardedessentiallybecauseithasahighercondencethantheresultingrecordr3.How-ever,noticethatotherthanthecondence,r3containsmorelabel-valuepairs,andhence,ifitwerenotforitshighercon-dence,r2wouldnotbenecessary.Thisobservationleadsustoconsiderascenariowheretherecordsminuscondencescanberesolvedeciently,andthentoaddthecondencecomputationsinasecondphase.Inparticular,letusassumethatourmergefunctionis\informationpreserving"inthefollowingsense:Whenarecordrmergeswithotherrecords,theinformationcarriedbyr'sattributesisnotlost.Weformalizethisnotionof \information"bydeningarelation\v":rvsmeansthattheattributesofscarrymoreinformationthanthoseofr.Weassumethatthisrelationistransitive.Notethatrvsandsvrdoesnotimplythatr=s;itonlyimpliesthatr:Acarriesasmuchinformationass:A.Thepropertythatmergesareinformationpreservingisformalizedasfollows: PropertyP1:Ifrsthenrvhr;siandsvhr;si. PropertyP2:Ifsvr,sxandrx,thenhs;xivhr;xiForexample,amergefunctionthatunionstheattributesofrecordswouldhavepropertiesP1andP2.Suchfunctionsarecommonin\intelligencegathering"applications,whereonewishestocollectallinformationknownaboutentities,evenifcontradictory.Forinstance,saytworecordsreportdierentpassportnumbersordierentagesforaperson.Iftherecordsmerge(e.g.,duetoevidenceinotherattributes)suchapplicationstypicallygatherallthefacts,sincethepersonmaybeusingfakepassportsreportingdierentages.Furthermore,weassumethataddinginformationtoarecorddoesnotchangetheoutcomeofmatch.Inaddition,wealsoassumethatthematchfunctiondoesnotconsidercondences,onlytheattributesofrecords.Thesecharacter-isticsareformalizedby: PropertyP3:Ifsvrandsx,thenrx.Havingamatchfunctionthatignorescondencesisnotveryconstraining:Iftworecordsareunlikelytomatchduetolowcondences,themergefunctioncanstillassignalowcondencetotheresultingrecordtoindicateitisunlikely.ThesecondaspectofPropertyP3rulesout\negativeevi-dence":addinginformationtoarecordcannotruleoutafu-turematch.However,negativeinformationcanstillbehan-dledbydecreasingthecondenceoftheresultingrecord.ThealgorithmofFigure2exploitsthesepropertiestoper-formERmoreeciently.Itproceedsintwophases:arstphasebypassescondencesandgroupsrecordsintodisjointpackages.Becauseoftheproperties,thisrstphasecanbedoneeciently,andrecordsthatfallintodierentpackagesareknownnottomatch.ThesecondphaserunsERwithcondencesoneachpackageseparately.Wenextexplainandjustifyeachofthesetwophases.6.1Phase1InPhase1,wemayuseanygenericERalgorithm,suchasthosein[2]toresolvethebaserecords,butwithsomeaddi-tionalbookkeeping.Forexample,whentwobaserecordsr1andr2mergeintor3,wecombineallthreerecordstogetherintoapackagep3.Thepackagep3containstwothings:(i)arootr(p3)whichinthiscaseisr3,and(ii)thebaserecordsb(p3)=fr1;r2g.Actually,baserecordscanalsobeviewedaspackages.Forexample,recordr2canbetreatedaspackagep2withr(p2)=r2,b(p2)=fr2g.Thus,thealgorithmstartswithasetofpackages,andwegeneralizeourmatchandmergefunctionstooperateonpackages.Forinstance,supposewewanttocomparep3withapack-agep4containingonlybaserecordr4.Thatis,r(p4)=r4andb(p4)=fr4g.Tocomparethepackages,weonlycomparetheirroots:Thatis,M(p3;p4)isequivalenttoM(r(p3);r(p4)),orinthisexampleequivalenttoM(r3;r4).(WeusethesamesymbolMforrecordandpackagematch-ing.)Saytheserecordsdomatch,sowegenerateanew Giventhethresholdproperty,wecancomputeTER(R)moreeciently.Intheextendedversionofthispaper,weprovethatifthethresholdpropertyholds,thenallresultscanbeobtainedfromabove-thresholdrecords.7.1AlgorithmsKooshTandKooshTNDAswithremovingdominatedrecords,Kooshcanbeeasilymodiedtodropbelow-thresholdrecords.First,weaddaninitialscantoremoveallbaserecordsthatarealreadybelowthreshold.Then,wesimplyaddthefollowingconjuncttotheconditionofLine10ofthealgorithm: merged:CTThus,mergedrecordsaredroppediftheyarebelowthecondencethreshold.Theorem7.2.WhenTER(R)isnite,Koosh-Ttermi-natesandcomputesTER(R).ByperformingthesamemodicationasaboveonKoosh-ND,weobtainthealgorithmKoosh-TND,whichcomputesthesetNER(R)\TER(R)ofrecordsinER(R)thatareneitherdominatednorbelowthreshold.7.2PackagesTandPackagesTNDIfthethresholdpropertyholds,Koosh-TorKoosh-TNDcanbeusedforPhase2ofthePackagesalgorithm,toob-tainalgorithmPackages-TorPackages-TND.Inthatcase,below-thresholdand/ordominatedrecordsaredroppedaseachpackageisexpanded.8.EXPERIMENTSTosummarize,wehavediscussedthreemainalgorithms:BFA,Koosh,andPackages.Foreachofthosebasicthree,therearethreevariants,addinginthresholds(T),non-dom-ination(ND),orboth(TND).Inthissection,wewillcom-parethethreealgorithmsagainsteachotherusingboththresholdsandnon-domination.Wewillalsoinvestigatehowperformanceisaectedbyvaryingthresholdvalues,and,independently,byremovingdominatedrecords.Totestouralgorithms,weranthemonsyntheticdata.Syntheticdatagivesusthe exibilitytocarefullycontrolthedistributionofcondences,theprobabilitythattworecordsmatch,aswellasotherimportantparameters.Ourgoalingeneratingthedatawastoemulatearealisticscenariowherenrecordsdescribevariousaspectsofmreal-worldentities(nm).Iftwoofourrecordsrefertothesameentity,weexpectthemtomatchwithmuchhigherprobabilitythaniftheyreferredtodierententities.Toemulatethisscenario,weassumethatthereal-worldentitiescanberepresentedaspointsonanumberline.Rec-ordsaboutaparticularentitywithvaluexcontainanat-tributeAwithavalue\close"tox.(Thevalueisnormallydistributedwithmeanx,seebelow.)Thus,thematchfunc-tioncansimplycomparetheAattributeofrecords:ifthevaluesareclose,therecordsmatch.Recordsarealsoas-signedacondence,asdiscussedbelow.Forourexperimentsweusean\intelligencegathering"mergefunctionasdiscussedinSection6,whichunionsat-tributes.Thus,asarecordmergeswithothers,itaccumu-latesAvaluesandincreasesitschancesofmatchingotherrecordsrelatedtotheparticularreal-worldentity.Tobemorespecic,oursyntheticdatawasgeneratedusingthefollowingparameters(andtheirdefaultvalues): Figure1:Thresholdsvs.Matches n,thenumberofrecordstogenerate(default:1000) m,thenumberofentitiestosimulate(default:100) margin,theseparationbetweenentities(default:75) ,thestandarddeviationofthenormalcurvearoundeachentity.(default:10) c,themeanofthecondencevalues(default:0.8)Togenerateonerecordr,weproceedasfollows:First,pickauniformlydistributedrandomintegeriintherange[0;m1].Thisintegerrepresentsthevalueforthereal-wordentitythatrwillrepresent.FortheAvalueofr,generatearandom oatingpointvaluevfromanormaldistributionwithstandarddeviationandameanofmargini.Togen-erater'scondence,computeauniformlydistributedvaluecintherange[c0:1;c+0:1](withc2[0:1;0:9]sothatcstaysin[0;1]).Nowcreatearecordrwithr:C=candr:A=fA:vg.Repeatallofthesestepsntimestocreatensyntheticrecords.Ourmergefunctiontakesinthetworecordsr1andr2,andcreatesanewrecordrm,whererm:C=r1:Cr2:Candrm:A=r1:A[r2:A.ThematchfunctiondetectsamatchiffortheAattribute,thereexistsavaluev1inr1:Aandavaluev2inr2:Awherejv1v2jk,foraparameterkchoseninadvance(k=25exceptwhereotherwisenoted).Naturally,ourrstexperimentcomparestheperformanceofourthreealgorithms,BFA-TND,Koosh-TNDandPack-ages-TND,againsteachother.Wevariedthethresholdval-uestogetasenseofhowmuchfasterthealgorithmsarewhenahigherthresholdcausesmorerecordstobediscarded.Eachalgorithmwasrunatthegiventhresholdvaluethreetimes,andtheresultingnumberofcomparisonswasaver-agedoverthethreerunstogetournalresults.Figure1showstheresultsofthisrstexperiment.Therstthreelinesonthegraphrepresenttheperformanceofourthreealgorithms.Onthehorizontalaxis,wevarythethresholdvalue.Theverticalaxis(logarithmic)indicatesthenumberofcallstothematchfunction,whichweuseasameasureoftheworkperformedbythealgorithms.Therstthingwenoticeisthatworkperformedbythealgorithmsgrowsexponentiallyasthethresholdisdecreased.Thus,clearlythresholdsareaverypowerfultool:onecangethigh- Figure2:Thresholdsvs.Merges condenceresultsatarelativelymodestcost,whilecomput-ingthelowercondencerecordsgetsprogressivelymoreex-pensive!Alsointerestingly,theBFA-TNDandKoosh-TNDlinesareparalleltoeachother.Thismeansthattheyareconsistentlyaconstantfactorapart.Roughly,BFAdoes10timesthenumberofcomparisonsthatKooshdoes.ThePackages-TNDalgorithmisfarmoreecientthantheothertwoalgorithms.Ofcourse,PackagescanonlybeusedifPropertiesP1,P2andP3hold,butwhentheydohold,thesavingscanbedramatic.Webelievethatthesesavingscanbeastrongincentivefortheapplicationexperttodesignmatchandmergefunctionthatsatisfytheproperties.Wealsocomparedouralgorithmsbasedonthenumberofmergesperformed.InFigure2,theverticalaxisindicatesthenumberofmergesthatareperformedbythealgorithms.WecanseethatKoosh-TNDandthePackages-TNDarestillagreatimprovementoverBFA.BFAperformsextramergesbecauseineachiterationofitsmainloop,itrecomparesallrecordsandmergesanymatchesfound.Theextramergesresultinduplicaterecordswhichareeliminatedwhentheyareaddedtotheresultset.PackagesperformsslightlymoremergesthanKoosh,sincethesecondphaseofthealgorithmdoesnotuseanyofthemergesthatoccurredintherstphase.IfwesubtractthePhase1mergesfromPackages(notshowninthegure),KooshandPackagesperformroughlythesamenumberofmerges.Inournextexperiment,wecomparetheperformanceofouralgorithmsaswevarytheprobabilitythatbaserecordsmatch.Wecancontrolthematchprobabilitybychangingparameterskor,butweusetheresultingmatchprobabil-ityasthehorizontalaxistoprovidemoreintuition.Inpar-ticular,togenerateFigure3,wevaryparameterkfrom5to55inincrementsof5(keepingthethresholdvalueconstantat0.6).Duringeachrun,wemeasurethematchprobabil-ityasthefractionofbaserecordmatchesthatarepositive.(Theresultsaresimilarwhenwecomputethematchprob-abilityoverallmatches.)Foreachrun,wethenplotthematchprobabilityversusthenumberofcallstothematchfunction,forourthreealgorithms.Asexpected,theworkincreaseswithgreatermatchprob-ability,sincemorerecordsareproduced.Furthermore,wenotethattheBFAandKooshlinesareroughlyparallel,butthePackageslinestaysleveluntilaquickriseintheamount Figure3:Selectivityvs.Comparisons ofworkperformedoncethematchprobabilityreachesabout0.011.ThePackagesoptimizationtakesadvantageofthefactthatrecordscanbeseparatedintopackagesthatdonotmergewithoneanother.Inpractice,wewouldexpecttooperateintherangeofFigure3wherethematchprobabilityislowandPackagesoutperformsKoosh.Inourscenariowithhighmatchprob-abilities,recordsthatrefertodierententitiesarebeingmerged,whichmeansthematchfunctionisnotdoingitsjob.Onecouldalsogethighmatchprobabilitiesiftherewereveryfewentities,sothatpackagesdonotpartitiontheproblemnely.Butagain,inpracticeonewouldexpectrecordstocoveralargenumberofentities.9.RELATEDWORKOriginallyintroducedbyNewcombeetal.[17]underthenameofrecordlinkage,andformalizedbyFellegiandSun-ter[9],theERproblemwasstudiedunderavarietyofnames,suchasMerge/Purge[12],deduplication[18],referencerec-onciliation[8],objectidentication[21],andothers.Mostoftheworkinthisarea(see[23,11]forrecentsurveys)focusesonthe\matching"problem,i.e.,ondecidingwhichrecordsdorepresentthesameentitiesandwhichonesdonot.Thisisgenerallydoneintwophases:Computingmeasuresofhowsimilaratomicvaluesare(e.g.,usingedit-distances[20],TF-IDF[6],oradaptivetechniquessuchasq-grams[4]),thenfeedingthesemeasuresintoamodel(withparame-ters),whichmakesmatchingdecisionsforrecords.Proposedmodelsincludeunsupervisedclusteringtechniques[12,5],Bayesiannetworks[22],decisiontrees,SVM's,conditionalrandomelds[19].Theparametersofthesemodelsarelearnedeitherfromalabeledtrainingset(possiblywiththehelpofauser,throughactivelearning[18]),orusingunsu-pervisedtechniquessuchastheEMalgorithm[24].Allthetechniquesabovemanipulateandproducenumer-icalvalues,whencomparingatomicvalues(e.g.TF-IDFscores),asparametersoftheirinternalmodel(e.g.,thresh-olds,regressionparameters,attributeweights),orastheiroutput.Butthesenumbersareoftenspecictothetech-niquesathand,anddonothaveaclearinterpretationintermsof\condence"intherecordsorthevalues.Ontheotherhand,representationsofuncertaindataexist,whichsoundlymodelcondenceintermsofprobabilities(e.g.,[1, 10]),orbeliefs[14].Howevertheseapproachesfocusoncomputingtheresultsandcondencesofexactqueries,ex-tendedwithsimple\fuzzy"operatorsforvaluecomparisons(e.g.,see[7]),andarenotcapableofanyadvancedformofentityresolution.Weproposea exiblesolutionforERthataccommodatesanymodelforcondences,andproposesecientalgorithmsbasedontheirproperties.Ourgenericapproachdepartsfromexistingtechniquesinthatitinterleavesmergeswithmatches.TherstphaseofthePackagesalgorithmissimilartotheset-unionalgorithmdescribedin[16],butouruseofamergefunctionallowstheselectionofatruerepresentativerecord.Thepresenceof\custom"mergesisanimportantpartofER,anditmakescondencesnon-trivialtocompute.Theneedforiteratingmatchesandmergeswasidentiedby[3]andisalsousedin[8],buttheirrecordmergesaresimpleaggregations(sim-ilartoour\informationgathering"merge),andtheydonotconsiderthepropagationofcondencesthroughmerges.10.CONCLUSIONInthispaperwelookatERwithcondencesasa\genericdatabase"problem,wherewearegivenblack-boxesthatcompareandmergerecords,andwefocusonecientalgo-rithmsthatreducethenumberofcallstotheseboxes.Thekeytoreducingworkistoexploitgenericproperties(likethethresholdproperty)thananapplicationmayhave.Ifsuchpropertiesholdwecanusetheoptimizationswehavestudied(e.g.,Koosh-Twhenthethresholdpropertyholds).Ofthethreeoptimizations,thresholdsisthemost exibleone,asitgivesusa\knob"(thethreshold)thatonecancontrol:Forahighthreshold,weonlygethigh-condencerecords,butwegetthemveryeciently.Aswedecreasethethreshold,westartaddinglower-condenceresultstoouranswer,butthecomputationalcostincreases.Theothertwooptimizations,dominationandpackages,canalsoreducethecostofERverysubstantiallybutdonotprovidesuchacontrolknob.11.REFERENCES [1] D.Barbara,H.Garcia-Molina,andD.Porter.Themanagementofprobabilisticdata.IEEETransactionsonKnowledgeandDataEngineering,4(5):487{502,1992. [2] O.Benjelloun,H.Garcia-Molina,J.Jonas,Q.Su,andJ.Widom.Swoosh:Agenericapproachtoentityresolution.Technicalreport,StanfordUniversity,2005. [3] I.BhattacharyaandL.Getoor.Iterativerecordlinkageforcleaningandintegration.InProc.oftheSIGMOD2004WorkshoponResearchIssuesonDataMiningandKnowledgeDiscovery,June2004. [4] S.Chaudhuri,K.Ganjam,V.Ganti,andR.Motwani.Robustandecientfuzzymatchforonlinedatacleaning.InProc.ofACMSIGMOD,pages313{324.ACMPress,2003. [5] S.Chaudhuri,V.Ganti,andR.Motwani.Robustidenticationoffuzzyduplicates.InProc.ofICDE,Tokyo,Japan,2005. [6] WilliamCohen.Dataintegrationusingsimilarityjoinsandaword-basedinformationrepresentationlanguage.ACMTransactionsonInformationSystems,18:288{321,2000. [7] NileshN.DalviandDanSuciu.Ecientqueryevaluationonprobabilisticdatabases.InVLDB,pages 864{875,2004. [8] X.Dong,A.Y.Halevy,J.Madhavan,andE.Nemes.Referencereconciliationincomplexinformationspaces.InProc.ofACMSIGMOD,2005. [9] I.P.FellegiandA.B.Sunter.Atheoryforrecordlinkage.JournaloftheAmericanStatisticalAssociation,64(328):1183{1210,1969. [10] NorbertFuhrandThomasRolleke.Aprobabilisticrelationalalgebrafortheintegrationofinformationretrievalanddatabasesystems.ACMTrans.Inf.Syst.,15(1):32{66,1997. [11] L.Gu,R.Baxter,D.Vickers,andC.Rainsford.Recordlinkage:Currentpracticeandfuturedirections.TechnicalReport03/83,CSIROMathematicalandInformationSciences,2003. [12] M.A.HernandezandS.J.Stolfo.Themerge/purgeproblemforlargedatabases.InProc.ofACMSIGMOD,pages127{138,1995. [13] SaulKripke.Semanticalconsiderationsonmodallogic.ActaPhilosophicaFennica,16:83{94,1963. [14] SukKyoonLee.Anextendedrelationaldatabasemodelforuncertainandimpreciseinformation.InLi-YanYuan,editor,VLDB,pages211{220.MorganKaufmann,1992. [15] DavidMenestrina,OmarBenjelloun,andHectorGarcia-Molina.Genericentityresolutionwithdatacondences(extendedversion).Technicalreport,StanfordUniversity,2006. [16] A.E.MongeandC.Elkan.Anecientdomain-independentalgorithmfordetectingapproximatelyduplicatedatabaserecords.InDMKD,pages0{,1997. [17] H.B.Newcombe,J.M.Kennedy,S.J.Axford,andA.P.James.Automaticlinkageofvitalrecords.Science,130(3381):954{959,1959. [18] S.SarawagiandA.Bhamidipaty.Interactivededuplicationusingactivelearning.InProc.ofACMSIGKDD,Edmonton,Alberta,2002. [19] ParagSinglaandPedroDomingos.Objectidenticationwithattribute-mediateddependences.InProc.ofPKDD,pages297{308,2005. [20] T.F.SmithandM.S.Waterman.Identicationofcommonmolecularsubsequences.JournalofMolecularBiology,147:195{197,1981. [21] S.Tejada,C.A.Knoblock,andS.Minton.Learningobjectidenticationrulesforinformationintegration.InformationSystemsJournal,26(8):635{656,2001. [22] VassiliosS.Verykios,GeorgeV.Moustakides,andMohamedG.Elfeky.Abayesiandecisionmodelforcostoptimalrecordmatching.TheVLDBJournal,12(1):28{40,2003. [23] W.Winkler.Thestateofrecordlinkageandcurrentresearchproblems.Technicalreport,StatisticalResearchDivision,U.S.BureauoftheCensus,Washington,DC,1999. [24] W.E.Winkler.UsingtheEMalgorithmforweightcomputationinthefellegi-suntermodelofrecordlinkage.AmericanStatisticalAssociation,ProceedingsoftheSectiononSurveyResearchMethods,pages667{671,1988.