/
(RR).AlthoughLEARN-ONE-RULEisdescribedindetaillater,intuitively,itadds (RR).AlthoughLEARN-ONE-RULEisdescribedindetaillater,intuitively,itadds

(RR).AlthoughLEARN-ONE-RULEisdescribedindetaillater,intuitively,itadds - PDF document

myesha-ticknor
myesha-ticknor . @myesha-ticknor
Follow
366 views
Uploaded On 2016-03-03

(RR).AlthoughLEARN-ONE-RULEisdescribedindetaillater,intuitively,itadds - PPT Presentation

SEQUENTIALCOVERINGclassattributesexamples LearnedRules fgRule LEARNONERULEclassattributesexamplesWhileexampleslefttocoverdoLearnedRules LearnedRulesRuleExamples ExamplesfExamplescoveredb ID: 240820

SEQUENTIAL-COVERING(class;attributes;examples) LearnedRules fgRule LEARN-ONE-RULE(class;attributes;examples)Whileexampleslefttocover doLearnedRules LearnedRules[RuleExamples Examples-fExamplescoveredb

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "(RR).AlthoughLEARN-ONE-RULEisdescribedin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

(RR).AlthoughLEARN-ONE-RULEisdescribedindetaillater,intuitively,itaddsfmethod,attributegpairstocon-junctionstoyieldhigherandhigherRRvalues,untiltheRRstopsimproving.So,eventhougharestrictiveconjunctionmightnotcoverallofthetruematches,asSCAaddsmoredisjuncts,moreofthetruematcheswillbecovered,increasingtheblock-ingscheme'spairscompleteness(PC).However,theclassicSCAneedstobemodiedsothatwelearnasmanycon-junctionsasarenecessarytocoveralloftheexampletruematches.ThishelpsmaximizethePCforthetrainingdata.Also,sincethealgorithmgreedilyconstructsandlearnstheconjunctions,itmaylearnasimplerconjunctionatalateriteration,whichcoversthetruematchesofapreviouscon-junction.Thisisanopportunitytosimplifytheblockingschemebyreplacingthemorerestrictiveconjunctionwiththelessrestrictiveone.ThesetwoissuesareaddressedbyourslightmodicationstoSCA,showninboldinTable2.Table2:ModiedSequentialCoveringAlgorithm SEQUENTIAL-COVERING(class;attributes;examples) LearnedRules fgRule LEARN-ONE-RULE(class;attributes;examples)Whileexampleslefttocover,doLearnedRules LearnedRules[RuleExamples Examples-fExamplescoveredbyRulegRule LEARN-ONE-RULE(class;attributes;examples)IfRulecontainsanypreviouslylearnedrules,removethesecontainedrules.ReturnLearnedRules Therstmodicationensuresthatwekeeplearningruleswhiletherearestilltruematchesnotcoveredbyapreviouslylearnedrule.ThisforcesourblockingschemetomaximizePCforthetrainingdata.SCAlearnsconjunctionsindependentlyofeachother,sinceLEARN-ONE-RULEdoesnottakeintoaccountanypreviouslylearnedrules.Sowecancheckthepreviouslylearnedrulesforcontainmentwithinthenewlylearnedrule.Sinceeachoftheconjunctionsaredisjoinedtogether,anyrecordscoveredbyamorerestrictiveconjunctionwillbecoveredbyalessrestrictiveconjunction.Forexample,ifourcurrentschemeincludestheconjunction(ftoken-match,lastnameg^ftoken-match,phoneg)andwejustlearnedtheconjunction(ftoken-match,lastnameg)thenwecanremovethelongerconjunctionfromourlearnedrules,becausealloftherecordscoveredbythemorerestrictiverulewillalsobecoveredbythelessspecicone.Thisisreallyjustanopti-mizationstepbecausethesamecandidatesetwillbegener-atedusingbothrulesversusjustthesimplerone.However,fewerrulesrequirefewerstepstodotheblocking,whichisdesirableasapreprocessingstepforrecordlinkage.Notethatwecandotheconjunctioncontainmentcheckaswelearntherules.ThisispreferredtocheckingalloftheconjunctionsagainsteachotherwhenwenishrunningSCA.Aswillbeshownlater,wecanguaranteethateachnewlylearnedruleissimplerthantheruleslearnedbeforeitthatcoverthesameattribute(s).Theproofofthisisgivenbelow,wherewedenetheLEARN-ONE-RULEstep.LearningeachconjunctionAsmentionedpreviously,thealgorithmlearnseachconjunc-tion,or“blockingcriteria,”duringtheLEARN-ONE-RULEstepofthemodiedSCA.Tomakeeachstepgenerateasfewcandidatesaspossible,LEARN-ONE-RULElearnsaconjunctionwithashighare-ductionratio(RR)aspossible.Thealgorithmitself,showninTable3,isstraightforwardandintuitive.Startingwithanemptyconjunction,wetryeachfmethod,attributegpair,andkeeptheonesthatyieldthehigherRRwhilemaintain-ingapairscompleteness(PC)aboveaminimumthreshold.Then,wekeepaddingfmethod,attributegpairstothetopconjunctionsuntiltheRRnolongerimproves,whilestillmaintainingaPCgreaterthanthethreshold.NotethatLEARN-ONE-RULEusesagreedy,general-to-specicbeamsearch.General-to-specicbeamsearchmakeseachconjunctionasrestrictiveaspossiblebecauseateachiterationweaddanotherfmethod,attributegpairtothebestconjunct.Althoughanyindividualrulelearnedbygeneral-to-specicbeamsearchmightonlyhaveaminimumPC,thedisjunctionofthenalrules,asoutlinedabove,willcombinetheserulestoincreasethePC,justasinthemulti-passapproach.Thus,thegoalofeachLEARN-ONE-RULEistolearnarulethatmaximizesRRasmuchasitcan,sowhentheruleisdisjoinedwiththeotherrules,itcontributesasfewfalse-positive,candidatematchestothenalcandi-datesetaspossible.Weuseabeamsearchtoallowforsomebacktrackingaswell,sinceweuseagreedyapproach.TheconstraintthataconjunctionhasaminimumPCen-suresthatthelearnedconjunctiondoesnotover-ttothedata.Ifthisrestrictionwerenotinplace,itwouldbepos-sibleforLEARN-ONE-RULEtolearnaconjunctionthatre-turnsnocandidates,uselesslyproducinganoptimalRR.Thealgorithm'sbehavioriswelldenedfortheminimumPCthreshold.Consider,thecasewherethealgorithmislearningasrestrictivearuleasitcanwiththeminimumcov-erage.Inthiscase,theparameterendsuppartitioningthespaceofthecrossproductofexamplerecordsbythethresh-oldamount.Thatis,ifwesetthethresholdamountto50%oftheexamplescovered,themostrestrictiverstrulecov-ers50%oftheexamples.Thenextrulecovers50%ofwhatisremaining,whichis25%oftheexamples.Thenextwillcover12.5%oftheexamples,etc.Inthissense,thepara-meteriswelldened.Ifwesetthethresholdhigh,wewilllearnfewer,lessrestrictiveconjunctions,possiblylimitingourRR,althoughthismayincreasePCslightly.Ifwesetitlower,wecovermoreexamples,butweneedtolearnmoreTable3:Learningaconjunctionoffmethod,attributegpairs LEARN-ONE-RULE(attributes;examples;min thresh;k) Best-Conjunction fgCandidate-conjunctions allfmethod,attributegpairsWhileCandidate-conjunctionsnotempty,doForeachch2Candidate-conjunctionsIfnotrstiterationch ch[fmethod,attributegRemoveanychthatareduplicates,inconsistentornotmax.specicifREDUCTION-RATIO(ch)�REDUCTION-RATIO(Best-Conjunction)andPAIRS-COMPLETENESS(ch)min threshBest-Conjunction chCandidate-conjunctions bestkmembersofCandidate-conjunctionsreturnBest-conjunction BS=(ftoken,modelg^ftoken,yearg^ftoken,trimg)[(ftoken,modelg^ftoken,yearg^fsynonym,trimg)Again,thecomparisoninTable5showsthatbothper-formequallywellintermsofPC.Infact,thereisnosigni-cantdifferencestatisticallywithrespecttoeachother,usingatwo-tailedt-testwith =0.05.Yet,despitethesimilarPC,ourgeneratedrulesincreasetheRRbymorethan50%.Table5:Blockingschemeresultsoncars RR PC BSL 99.86 99.92 HFM 47.92 99.97 Inthenalcomparison,weusesyntheticCensusdata,called“dataset4”from(Gu&Baxter2004).GuandBaxterpresenttheiradaptivelteringmethodusingonemethod,bi-gramindexing(Baxter,Christen,&Churches2003)ononeattribute.Wecannotgenerateadifferentblockingschemefromtheirssincethereisonlyonemethodandoneattributetochoosefrom.Instead,wetooktheattributesandmethodsfromthe11blockingcriteriapresentedin(Winkler2005)formatchingthe2000DecennialCensustoanAccuracyandCoverageEstimation(ACE)le.Weremovethe2-wayswitchmethodbecauseitisspecictotwoattributes,andwefocusonmethodsthatapplygenerallytotheattributes.Thisleft3methodstouse:ftoken,rst,rst-3g.Tokenandrstaredescribedabove.First-3issimilartorst,butitmatchesthe1stthreelettersofanattribute,ratherthanjustthe1stone.Thereare8attributestochoosefrom:frstname,surname,zip,date-of-birth,day-of-birth,month-of-birth,phone,housenumberg.Fromthesemethodandattributes,welearnblockingschemessuchasthefollowing:BS=(frst-3,surnameg^frst,zipg)[(ftoken,housenumberg^frst-3,phoneg)[(ftoken,housenumberg^frst-3,date-of-birthg)[(frst-3,rstnameg)[(ftoken,phoneg)[(ftoken,day-of-birthg^ftoken,zipg^frst-3,date-of-birthg)[(ftoken,housenumberg^frst,rstnameg)Forcomparisonswepresentourresultsagainstablock-ingschemecomposedofthebest5conjunctionsof(Win-kler2005).Asstatedinthepaper,the“bestve”blockingschemeis:BS=(ftoken,zipg^frst,surnameg)[(ftoken,phoneg)[(frst-3,zipg^ftoken,day-of-birthg^ftoken,month-of-birthg)[(ftoken,zipg^ftoken,housenumberg)[(frst-3,surnameg^frst-3,rstnameg)NotethattheblockingcriteriaconstructedbyWinkleraremeantforrealcensusdata,notthesyntheticdataweuse.However,thebest-verulestilldoesimpressivelywellonthisdata,asshowninTable6.ThisisnotsurprisingasWin-klerisregardedasanexpertonmatchingcensusdata,andthesyntheticdataisnotasdifculttomatchastherealcen-susdata.Encouragingly,Table6alsoshowsourgeneratedrulesperformedbetteronPCandonlyslightlylessonRRascomparedtothedomainexpert.NotethattheRRandPCresultsarestatisticallysignicantwithrespecttoeachother.Table6:Blockingschemeresultsonsyntheticcensusdata RR PC BSL 98.12 99.85 “Bestve” 99.52 99.16 AdaptiveFiltering 99.9 92.7 Table6alsocomparestheadaptivelteringmethod(Gu&Baxter2004)toourmethod.Althoughtheadaptivelter-ingresultsareforalltherecordsintheset,ratherthanhalfasinour2-foldcrossvalidation,westillthinkcomparingtheseresultstooursshedssomeinsight.Observethatadap-tivelteringmaximizesRRtremendously.However,webe-lievethatitismisguidedtofocusonlyonreducingthesizeofthecandidateset,withoutaddressingthecoverageoftruematches.Intheirbestreportedresult(showninTable6),adaptivelteringachievesaPCwhichis92.7.Thatwouldrepresentleavingoutroughly365ofthe5000truematchesinthecandidateset.Asstatedinthebeginningofthissection,wethinktrain-ingon50%ofthedataisunrealistic.Labelingonly10%ofthedatafortrainingrepresentsamuchmorepracticalsu-pervisedlearningscenario,soweranourexperimentsagainusing10%ofthedatafortrainingandtestingontheother90%.Table7comparestheresults,averagedover10trials.Table7showsthatthealgorithmscaleswelltolesstrain-ingdata.Inthecarsandcensusexperiments,thedegradationinperformanceforthelearnedblockingschemesissmall.Themoreinterestingcaseistherestaurantsdata.Heretheblockingschemetrainedon10%ofthedatadidnotlearntocoverasmanytruematchesaswhenweused50%ofthedatafortraining.Inthiscase,10%ofthedatafortrainingrepresents54recordsand11.5matches,onaverage,whichwasnotenoughforBSLtolearnasgoodablockingscheme.However,thesmallsizeoftherestaurantdatasetmakestheresultappearworsethanitis.AlthoughthepercentagedropinPCseemslarge,itreallyonlyrepresentsmissing6.5truematchesonaverageversusmissing1.Thishighlightsaproblemwithourapproachtolearningblockingschemes.Thelearnedblockingscheme'sabilitytocovertruematchesislimitedbythenumberoftruematchessuppliedfortraining.Thisisusuallyaconcernwithsuper-visedlearningtechniques:therobustnessofthelearningisconstrainedbytheamountoftrainingdata.Inthecaseoftherestaurantdata,becausethedatasetissmallthisprob-lemcanbehighlightedeasily.AsFigure2shows,asweincreasethetrainingdata,wecanimprovetheperformanceofthePC.