SEQUENTIALCOVERINGclassattributesexamples LearnedRules fgRule LEARNONERULEclassattributesexamplesWhileexampleslefttocoverdoLearnedRules LearnedRulesRuleExamples ExamplesfExamplescoveredb ID: 240820
Download Pdf The PPT/PDF document "(RR).AlthoughLEARN-ONE-RULEisdescribedin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
(RR).AlthoughLEARN-ONE-RULEisdescribedindetaillater,intuitively,itaddsfmethod,attributegpairstocon-junctionstoyieldhigherandhigherRRvalues,untiltheRRstopsimproving.So,eventhougharestrictiveconjunctionmightnotcoverallofthetruematches,asSCAaddsmoredisjuncts,moreofthetruematcheswillbecovered,increasingtheblock-ingscheme'spairscompleteness(PC).However,theclassicSCAneedstobemodiedsothatwelearnasmanycon-junctionsasarenecessarytocoveralloftheexampletruematches.ThishelpsmaximizethePCforthetrainingdata.Also,sincethealgorithmgreedilyconstructsandlearnstheconjunctions,itmaylearnasimplerconjunctionatalateriteration,whichcoversthetruematchesofapreviouscon-junction.Thisisanopportunitytosimplifytheblockingschemebyreplacingthemorerestrictiveconjunctionwiththelessrestrictiveone.ThesetwoissuesareaddressedbyourslightmodicationstoSCA,showninboldinTable2.Table2:ModiedSequentialCoveringAlgorithm SEQUENTIAL-COVERING(class;attributes;examples) LearnedRules fgRule LEARN-ONE-RULE(class;attributes;examples)Whileexampleslefttocover,doLearnedRules LearnedRules[RuleExamples Examples-fExamplescoveredbyRulegRule LEARN-ONE-RULE(class;attributes;examples)IfRulecontainsanypreviouslylearnedrules,removethesecontainedrules.ReturnLearnedRules Therstmodicationensuresthatwekeeplearningruleswhiletherearestilltruematchesnotcoveredbyapreviouslylearnedrule.ThisforcesourblockingschemetomaximizePCforthetrainingdata.SCAlearnsconjunctionsindependentlyofeachother,sinceLEARN-ONE-RULEdoesnottakeintoaccountanypreviouslylearnedrules.Sowecancheckthepreviouslylearnedrulesforcontainmentwithinthenewlylearnedrule.Sinceeachoftheconjunctionsaredisjoinedtogether,anyrecordscoveredbyamorerestrictiveconjunctionwillbecoveredbyalessrestrictiveconjunction.Forexample,ifourcurrentschemeincludestheconjunction(ftoken-match,lastnameg^ftoken-match,phoneg)andwejustlearnedtheconjunction(ftoken-match,lastnameg)thenwecanremovethelongerconjunctionfromourlearnedrules,becausealloftherecordscoveredbythemorerestrictiverulewillalsobecoveredbythelessspecicone.Thisisreallyjustanopti-mizationstepbecausethesamecandidatesetwillbegener-atedusingbothrulesversusjustthesimplerone.However,fewerrulesrequirefewerstepstodotheblocking,whichisdesirableasapreprocessingstepforrecordlinkage.Notethatwecandotheconjunctioncontainmentcheckaswelearntherules.ThisispreferredtocheckingalloftheconjunctionsagainsteachotherwhenwenishrunningSCA.Aswillbeshownlater,wecanguaranteethateachnewlylearnedruleissimplerthantheruleslearnedbeforeitthatcoverthesameattribute(s).Theproofofthisisgivenbelow,wherewedenetheLEARN-ONE-RULEstep.LearningeachconjunctionAsmentionedpreviously,thealgorithmlearnseachconjunc-tion,orblockingcriteria,duringtheLEARN-ONE-RULEstepofthemodiedSCA.Tomakeeachstepgenerateasfewcandidatesaspossible,LEARN-ONE-RULElearnsaconjunctionwithashighare-ductionratio(RR)aspossible.Thealgorithmitself,showninTable3,isstraightforwardandintuitive.Startingwithanemptyconjunction,wetryeachfmethod,attributegpair,andkeeptheonesthatyieldthehigherRRwhilemaintain-ingapairscompleteness(PC)aboveaminimumthreshold.Then,wekeepaddingfmethod,attributegpairstothetopconjunctionsuntiltheRRnolongerimproves,whilestillmaintainingaPCgreaterthanthethreshold.NotethatLEARN-ONE-RULEusesagreedy,general-to-specicbeamsearch.General-to-specicbeamsearchmakeseachconjunctionasrestrictiveaspossiblebecauseateachiterationweaddanotherfmethod,attributegpairtothebestconjunct.Althoughanyindividualrulelearnedbygeneral-to-specicbeamsearchmightonlyhaveaminimumPC,thedisjunctionofthenalrules,asoutlinedabove,willcombinetheserulestoincreasethePC,justasinthemulti-passapproach.Thus,thegoalofeachLEARN-ONE-RULEistolearnarulethatmaximizesRRasmuchasitcan,sowhentheruleisdisjoinedwiththeotherrules,itcontributesasfewfalse-positive,candidatematchestothenalcandi-datesetaspossible.Weuseabeamsearchtoallowforsomebacktrackingaswell,sinceweuseagreedyapproach.TheconstraintthataconjunctionhasaminimumPCen-suresthatthelearnedconjunctiondoesnotover-ttothedata.Ifthisrestrictionwerenotinplace,itwouldbepos-sibleforLEARN-ONE-RULEtolearnaconjunctionthatre-turnsnocandidates,uselesslyproducinganoptimalRR.Thealgorithm'sbehavioriswelldenedfortheminimumPCthreshold.Consider,thecasewherethealgorithmislearningasrestrictivearuleasitcanwiththeminimumcov-erage.Inthiscase,theparameterendsuppartitioningthespaceofthecrossproductofexamplerecordsbythethresh-oldamount.Thatis,ifwesetthethresholdamountto50%oftheexamplescovered,themostrestrictiverstrulecov-ers50%oftheexamples.Thenextrulecovers50%ofwhatisremaining,whichis25%oftheexamples.Thenextwillcover12.5%oftheexamples,etc.Inthissense,thepara-meteriswelldened.Ifwesetthethresholdhigh,wewilllearnfewer,lessrestrictiveconjunctions,possiblylimitingourRR,althoughthismayincreasePCslightly.Ifwesetitlower,wecovermoreexamples,butweneedtolearnmoreTable3:Learningaconjunctionoffmethod,attributegpairs LEARN-ONE-RULE(attributes;examples;min thresh;k) Best-Conjunction fgCandidate-conjunctions allfmethod,attributegpairsWhileCandidate-conjunctionsnotempty,doForeachch2Candidate-conjunctionsIfnotrstiterationch ch[fmethod,attributegRemoveanychthatareduplicates,inconsistentornotmax.specicifREDUCTION-RATIO(ch)REDUCTION-RATIO(Best-Conjunction)andPAIRS-COMPLETENESS(ch)min threshBest-Conjunction chCandidate-conjunctions bestkmembersofCandidate-conjunctionsreturnBest-conjunction BS=(ftoken,modelg^ftoken,yearg^ftoken,trimg)[(ftoken,modelg^ftoken,yearg^fsynonym,trimg)Again,thecomparisoninTable5showsthatbothper-formequallywellintermsofPC.Infact,thereisnosigni-cantdifferencestatisticallywithrespecttoeachother,usingatwo-tailedt-testwith=0.05.Yet,despitethesimilarPC,ourgeneratedrulesincreasetheRRbymorethan50%.Table5:Blockingschemeresultsoncars RR PC BSL 99.86 99.92 HFM 47.92 99.97 Inthenalcomparison,weusesyntheticCensusdata,calleddataset4from(Gu&Baxter2004).GuandBaxterpresenttheiradaptivelteringmethodusingonemethod,bi-gramindexing(Baxter,Christen,&Churches2003)ononeattribute.Wecannotgenerateadifferentblockingschemefromtheirssincethereisonlyonemethodandoneattributetochoosefrom.Instead,wetooktheattributesandmethodsfromthe11blockingcriteriapresentedin(Winkler2005)formatchingthe2000DecennialCensustoanAccuracyandCoverageEstimation(ACE)le.Weremovethe2-wayswitchmethodbecauseitisspecictotwoattributes,andwefocusonmethodsthatapplygenerallytotheattributes.Thisleft3methodstouse:ftoken,rst,rst-3g.Tokenandrstaredescribedabove.First-3issimilartorst,butitmatchesthe1stthreelettersofanattribute,ratherthanjustthe1stone.Thereare8attributestochoosefrom:frstname,surname,zip,date-of-birth,day-of-birth,month-of-birth,phone,housenumberg.Fromthesemethodandattributes,welearnblockingschemessuchasthefollowing:BS=(frst-3,surnameg^frst,zipg)[(ftoken,housenumberg^frst-3,phoneg)[(ftoken,housenumberg^frst-3,date-of-birthg)[(frst-3,rstnameg)[(ftoken,phoneg)[(ftoken,day-of-birthg^ftoken,zipg^frst-3,date-of-birthg)[(ftoken,housenumberg^frst,rstnameg)Forcomparisonswepresentourresultsagainstablock-ingschemecomposedofthebest5conjunctionsof(Win-kler2005).Asstatedinthepaper,thebestveblockingschemeis:BS=(ftoken,zipg^frst,surnameg)[(ftoken,phoneg)[(frst-3,zipg^ftoken,day-of-birthg^ftoken,month-of-birthg)[(ftoken,zipg^ftoken,housenumberg)[(frst-3,surnameg^frst-3,rstnameg)NotethattheblockingcriteriaconstructedbyWinkleraremeantforrealcensusdata,notthesyntheticdataweuse.However,thebest-verulestilldoesimpressivelywellonthisdata,asshowninTable6.ThisisnotsurprisingasWin-klerisregardedasanexpertonmatchingcensusdata,andthesyntheticdataisnotasdifculttomatchastherealcen-susdata.Encouragingly,Table6alsoshowsourgeneratedrulesperformedbetteronPCandonlyslightlylessonRRascomparedtothedomainexpert.NotethattheRRandPCresultsarestatisticallysignicantwithrespecttoeachother.Table6:Blockingschemeresultsonsyntheticcensusdata RR PC BSL 98.12 99.85 Bestve 99.52 99.16 AdaptiveFiltering 99.9 92.7 Table6alsocomparestheadaptivelteringmethod(Gu&Baxter2004)toourmethod.Althoughtheadaptivelter-ingresultsareforalltherecordsintheset,ratherthanhalfasinour2-foldcrossvalidation,westillthinkcomparingtheseresultstooursshedssomeinsight.Observethatadap-tivelteringmaximizesRRtremendously.However,webe-lievethatitismisguidedtofocusonlyonreducingthesizeofthecandidateset,withoutaddressingthecoverageoftruematches.Intheirbestreportedresult(showninTable6),adaptivelteringachievesaPCwhichis92.7.Thatwouldrepresentleavingoutroughly365ofthe5000truematchesinthecandidateset.Asstatedinthebeginningofthissection,wethinktrain-ingon50%ofthedataisunrealistic.Labelingonly10%ofthedatafortrainingrepresentsamuchmorepracticalsu-pervisedlearningscenario,soweranourexperimentsagainusing10%ofthedatafortrainingandtestingontheother90%.Table7comparestheresults,averagedover10trials.Table7showsthatthealgorithmscaleswelltolesstrain-ingdata.Inthecarsandcensusexperiments,thedegradationinperformanceforthelearnedblockingschemesissmall.Themoreinterestingcaseistherestaurantsdata.Heretheblockingschemetrainedon10%ofthedatadidnotlearntocoverasmanytruematchesaswhenweused50%ofthedatafortraining.Inthiscase,10%ofthedatafortrainingrepresents54recordsand11.5matches,onaverage,whichwasnotenoughforBSLtolearnasgoodablockingscheme.However,thesmallsizeoftherestaurantdatasetmakestheresultappearworsethanitis.AlthoughthepercentagedropinPCseemslarge,itreallyonlyrepresentsmissing6.5truematchesonaverageversusmissing1.Thishighlightsaproblemwithourapproachtolearningblockingschemes.Thelearnedblockingscheme'sabilitytocovertruematchesislimitedbythenumberoftruematchessuppliedfortraining.Thisisusuallyaconcernwithsuper-visedlearningtechniques:therobustnessofthelearningisconstrainedbytheamountoftrainingdata.Inthecaseoftherestaurantdata,becausethedatasetissmallthisprob-lemcanbehighlightedeasily.AsFigure2shows,asweincreasethetrainingdata,wecanimprovetheperformanceofthePC.