CraigslistPost 93civic5speedrunsgreatobori1800 934drHondaCivcLXStickShift1800 94DELSOLSiVtecGlendale3000 ThecurrentmethodtoquerypostswhetherbyanagentorapersoniskeywordsearchHoweverkeywords ID: 208625
Download Pdf The PPT/PDF document "Michelson&KnoblockThisrequirestheagentto..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Michelson&KnoblockThisrequirestheagenttocontaintwodatagatheringmechanisms:theabilitytoquerysourcesandtheabilitytointegraterelevantsourcesofinformation.However,thesedatagatheringmechanismsassumethatthesourcesthemselvesarede-signedtosupportrelationalqueries,suchashavingwelldenedschemaandstandardvaluesfortheattributes.Yetthisisnotalwaysthecase.TherearemanydatasourcesontheWorldWideWebthatwouldbeusefultoquery,butthetextualdatawithinthemisun-structuredandisnotdesignedtosupportquerying.Wecallthetextofsuchdatasources\posts."Examplesof\posts"includethetextofeBayauctionlistings,InternetclassiedslikeCraigslist,bulletinboardssuchasBiddingForTravel3,oreventhesummarytextbelowthehyperlinksreturnedafterqueryingGoogle.Asarunningexample,considerthethreepostsforusedcarclassiedsshowninTable1.Table1:ThreepostsforHondaCivicsfromCraigslist CraigslistPost 93civic5speedrunsgreatobo(ri)$1800 93-4drHondaCivcLXStickShift$1800 94DELSOLSiVtec(Glendale)$3000 Thecurrentmethodtoqueryposts,whetherbyanagentoraperson,iskeywordsearch.However,keywordsearchisinaccurateandcannotsupportrelationalqueries.Forexample,adierenceinspellingbetweenthekeywordandthatsameattributewithinapostwouldlimitthatpostfrombeingreturnedinthesearch.Thiswouldbethecaseifausersearchedtheexamplelistingsfor\Civic"sincethesecondpostwouldnotbereturned.Anotherfactorwhichlimitskeywordaccuracyistheexclusionofredundantattributes.Forexample,someclassiedpostsaboutcarsonlyincludethecarmodel,andnotthemake,sincethemakeisimpliedbythemodel.ThisisshownintherstandthirdpostofTable1.Inthesecases,ifauserdoesakeywordsearchusingthemake\Honda,"thesepostswillnotbereturned.Moreover,keywordsearchisnotarichqueryframework.Forinstance,considerthequery,WhatistheaveragepriceforallHondasfrom1999orlater?Todothiswithkeywordsearchrequiresausertosearchon\Honda"andretrieveallthatarefrom1999orlater.Thentheusermusttraversethereturnedset,keepingtrackofthepricesandremovingincorrectlyreturnedposts.However,ifaschemawithstandardizedattributevaluesisdenedovertheentitiesintheposts,thenausercouldruntheexamplequeryusingasimpleSQLstatementanddosoaccurately,addressingbothproblemscreatedbykeywordsearch.Thestandardizedattributevaluesensureinvariancetoissuessuchasspellingdierences.Also,eachpostisassociatedwithafullschemawithvalues,soeventhoughapostmightnotcontainacarmake,forinstance,itsschemadoesandhasthecorrectvalueforit,soitwillbereturnedinaqueryoncarmakes.Furthermore,thesestandardizedvaluesallowforintegrationofthesourcewithoutsidesources.Integratingsourcesusuallyentailsjoiningthetwosourcesdirectlyonattributesortranslationsoftheattributes.Withoutstandardizedvaluesand 3.www.biddingfortravel.com544 Michelson&KnoblockEdmundscarbuyingguide,weusewrappertechnologies(AgentBuilder7inthiscase)toscrapedatafromtheWebsource,usingtheschemathatthesourcedenesforthecar.Touseareferencesettobuildarelationaldatasetweexploittheattributesinthereferencesettodeterminetheattributesfromthepostthatcanbeextracted.Therststepofouralgorithmndsthebestmatchingmemberofthereferencesetforthepost.Thisiscalledthe\recordlinkage"step.Bymatchingaposttoamemberofthereferencesetwecandeneschemaelementsforthepostusingtheschemaofthereferenceset,andwecanprovidestandardattributesfortheseattributesbyusingtheattributesfromthereferencesetwhenauserqueriestheposts.Next,weperforminformationextractiontoextracttheactualvaluesinthepostthatmatchtheschemaelementsdenedbythereferenceset.Thisstepistheinformationextractionstep.Duringtheinformationextractionstep,thepartsofthepostareextractedthatbestmatchtheattributevaluesfromthereferencesetmemberchosenduringtherecordlinkagestep.Inthisstepwealsoextractattributesthatarenoteasilyrepresentedbyreferencesets,suchaspricesordates.Althoughwealreadyhavetheschemaandstandardizedattributesrequiredtocreatearelationaldatasetovertheposts,westillextracttheactualattributesembeddedwithinthepostsothatwecanmoreaccuratelylearntoextracttheattributesnotrepresentedbyareferenceset,suchaspricesanddates.Whiletheseattributescanbeextractedusingregularexpressions,ifweextracttheactualattributeswithinthepostwemightbeabletodosomoreaccurately.Forexample,considerthe\Ford500"car.Withoutactuallyextractingtheattributeswithinapost,wemightextract\500"asaprice,whenitisactuallyacarname.OuroverallapproachisoutlinedinFigure1.Althoughwepreviouslydescribeasimilarapproachtosemanticallyannotatingposts(Michelson&Knoblock,2005),thispaperextendsthatresearchbycombiningtheannota-tionwithourworkonmorescalablerecordmatching(Michelson&Knoblock,2006).Notonlydoesthismakethematchingstepforourannotationmorescalable,italsodemonstratesthatourworkonecientrecordmatchingextendstoouruniqueproblemofmatchingposts,withembeddedattributes,tostructured,relationaldata.Thispaperalsopresentsamoredetaileddescriptionthanourpastwork,includingamorethoroughevaluationofthepro-cedurethanpreviously,usinglargerexperimentaldatasetsincludingareferencesetthatincludestensofthousandsofrecords.Thisarticleisorganizedasfollows.WerstdescribeouralgorithmforaligningthepoststothebestmatchingmembersofthereferencesetinSection2.Inparticular,weshowhowthismatchingtakesplace,andhowweecientlygeneratecandidatematchestomakethematchingproceduremorescalable.InSection3,wedemonstratehowtoexploitthematchestoextracttheattributesembeddedwithinthepost.WepresentsomeexperimentsinSection4,validatingourapproachestoblocking,matchingandinformationextractionforunstructuredandungrammaticaltext.WefollowwithadiscussionoftheseresultsinSection5andthenpresentrelatedworkinSection6.WenishwithsomenalthoughtsandconclusionsinSection7. 7.AproductofFetchTechnologieshttp://www.fetch.com/products.asp546 RelationalDatafromUnstructuredDataSourcesmakeg)to(ftoken-match,makeg^ftoken-match,trimg)westillcovernewtruematches,butwegeneratefeweradditionalcandidates.Thereforeeectiveblockingschemesshouldlearnconjunctionsthatminimizethefalsepositives,butlearnenoughoftheseconjunctionstocoverasmanytruematchesaspossi-ble.ThesetwogoalsofblockingcanbeclearlydenedbytheReductionRatioandPairsCompleteness(Elfeky,Verykios,&Elmagarmid,2002).TheReductionRatio(RR)quantieshowwellthecurrentblockingschememinimizesthenumberofcandidates.LetCbethenumberofcandidatematchesandNbethesizeofthecrossproductbetweenbothdatasets.RR=1C=NItshouldbeclearthataddingmorefmethod,attributegpairstoaconjunctionincreasesitsRR,aswhenwechanged(ftoken-match,zipg)to(ftoken-match,zipg^ftoken-match,rstnameg).PairsCompleteness(PC)measuresthecoverageoftruepositives,i.e.,howmanyofthetruematchesareinthecandidatesetversusthoseintheentireset.IfSmisthenumberoftruematchesinthecandidateset,andNmisthenumberofmatchesintheentiredataset,then:PC=Sm=NmAddingmoredisjunctscanincreaseourPC.Forexample,weaddedthesecondconjunc-tiontoourexampleblockingschemebecausetherstdidnotcoverallofthematches.Theblockingapproachinthispaper,\BlockingSchemeLearner"(BSL),learnseectiveblockingschemesindisjunctivenormalformbymaximizingthereductionratioandpairscompleteness.Inthisway,BSLtriestomaximizethetwogoalsofblocking.PreviouslyweshowedBSLaidedthescalabilityofrecordlinkage(Michelson&Knoblock,2006),andthispaperextendsthatideabyshowingthatitalsocanworkinthecaseofmatchingpoststothereferencesetrecords.TheBSLalgorithmusesamodiedversionoftheSequentialCoveringAlgorithm(SCA),usedtodiscoverdisjunctivesetsofrulesfromlabeledtrainingdata(Mitchell,1997).Inourcase,SCAwilllearndisjunctivesetsofconjunctionsconsistingoffmethod,attributegpairs.Basically,eachcalltoLEARN-ONE-RULEgeneratesaconjunction,andBSLkeepsiteratingoverthiscall,coveringthetruematchesleftoveraftereachiteration.ThiswaySCAlearnsafullblockingscheme.TheBSLalgorithmisshowninTable2.TherearetwomodicationstotheclassicSCAalgorithm,whichareshowninbold.First,BSLrunsuntiltherearenomoreexampleslefttocover,ratherthanstoppingatsomethreshold.Thisensuresthatwemaximizethenumberoftruematchesgeneratedascandidatesbythenalblockingrule(PairsCompleteness).Notethatthismight,inturn,yieldalargenumberofcandidates,hurtingtheReductionRatio.However,omittingtruematchesdirectlyaectstheaccuracyofrecordlinkage,andblockingisapreprocessingstepforrecordlinkage,soitismoreimportanttocoverasmanytruematchesaspossible.ThiswayBSLfulllsoneoftheblockinggoals:noteliminatingtruematchesifpossible.Second,ifwelearnanewconjunction(intheLEARN-ONE-RULEstep)andourcurrentblockingschemehasarulethatalreadycontainsthenewlylearnedrule,thenwecanremovetherulecontainingthenewlylearnedrule.Thisisanoptimizationthatallowsustocheckrulecontainmentaswego,ratherthanattheend.549 RelationalDatafromUnstructuredDataSourcesThealgorithm'sbehavioriswelldenedfortheminimumPCthreshold.Consider,thecasewherethealgorithmislearningasrestrictivearuleasitcanwiththeminimumcoverage.Inthiscase,theparameterendsuppartitioningthespaceofthecrossproductofexamplerecordsbythethresholdamount.Thatis,ifwesetthethresholdamountto50%oftheexamplescovered,themostrestrictiverstrulecovers50%oftheexamples.Thenextrulecovers50%ofwhatisremaining,whichis25%oftheexamples.Thenextwillcover12.5%oftheexamples,etc.Inthissense,theparameteriswelldened.Ifwesetthethresholdhigh,wewilllearnfewer,lessrestrictiveconjunctions,possiblylimitingourRR,althoughthismayincreasePCslightly.Ifwesetitlower,wecovermoreexamples,butweneedtolearnmoreconjuncts.Thesenewerconjuncts,inturn,maybesubsumedbylaterconjuncts,sotheywillbeawasteoftimetolearn.So,aslongasthisparameterissmallenough,itshouldnotaectthecoverageofthenalblockingscheme,andsmallerthanthatjustslowsdownthelearning.Wesetthisparameterto50%forourexperiments8.NowweanalyzetherunningtimeofBSLandweshowhowBSLcantakeintoaccounttherunningtimeofdierentblockingmethods,ifneedbe.Assumethatwehavex(method,attribute)pairssuchas(token,firstname).Now,assumethatourbeamsizeisb,sinceweusegeneral-to-specicbeam-searchinourLearn-One-Ruleprocedure.Also,forthetimebeing,assumeeach(method,attribute)paircangenerateitsblockingcandidatesinO(1)time.(Werelaxthisassumptionlater.)EachtimewehitLearn-One-RulewithinBSL,wewilltryallrulesinthebeamwithallofthe(attribute,method)pairsnotinthecurrentbeamrules.So,intheworstcase,thistakesO(bx)eachtime,sinceforeach(method,attribute)pairinthebeam,wetryitagainstallother(method,attribute)pairs.Now,intheworstcase,eachlearneddisjunctwouldonlycover1trainingexample,soourruleisadisjunctionofallpairsx.Therefore,weruntheLearn-One-Rulextimes,resultinginalearningtimeofO(bx2).Ifwehaveetrainingexamples,thefulltrainingtimeisO(ebx2),forBSLtolearntheblockingscheme.Now,whileweassumedabovethateach(method,attribute)runsinO(1)time,thisisclearlynotthecase,sincethereisasubstantialamountofliteratureonblockingmethodsand 8.Settingthisparameterlowerthan50%hadaninsignicanteectonourresults,andsettingitmuchhigher,to90%,onlyincreasedthePCbyasmallamount(ifatall),whiledecreasingtheRR.Table3:Learningaconjunctionoffmethod,attributegpairs LEARN-ONE-RULE(attributes;examples;min thresh;k) Best-Conjunction fgCandidate-conjunctions allfmethod,attributegpairsWhileCandidate-conjunctionsnotempty,doForeachch2Candidate-conjunctionsIfnotrstiterationch ch[fmethod,attributegRemoveanychthatareduplicates,inconsistentornotmax.specicifREDUCTION-RATIO(ch)REDUCTION-RATIO(Best-Conjunction)andPAIRS-COMPLETENESS(ch)min threshBest-Conjunction chCandidate-conjunctions bestkmembersofCandidate-conjunctionsreturnBest-conjunction 551 RelationalDatafromUnstructuredDataSourcesexperiments,forblockingitismoreimportanttopicktherightattributecombinations,asBSLdoes,evenusingsimplemethods,thantodoblockingusingthemostsophisticatedmethods.WecaneasilyextendourBSLalgorithmtohandlethecaseofmatchingpoststomembersofthereferenceset.Thisisaspecialcasebecausethepostshavealltheattributesembeddedwithinthemwhilethereferencesetdataisrelationalandstructuredintoschemaelements.Tohandlethisspecialcase,ratherthanmatchingattributeandmethodpairsacrossthedatasourcesduringourLEARN-ONE-RULE,weinsteadcompareattributeandmethodpairsfromtherelationaldatatotheentirepost.Thisisasmallchange,showingthatthesamealgorithmworkswelleveninthisspecialcase.Oncewelearnagoodblockingscheme,wecannowecientlygeneratecandidatesfromthepostsettoaligntothereferenceset.Thisblockingstepisessentialformappinglargeamountsofunstructuredandungrammaticaldatasourcestolargerandlargerreferencesets.2.2TheMatchingStepFromthesetofcandidatesgeneratedduringblockingonecanndthememberofthereferencesetthatbestmatchesthecurrentpost.Thatis,onedatasource'srecord(thepost)mustaligntoarecordfromtheotherdatasource(thereferencesetcandidates).Whilethewholealignmentprocedureisreferredtoasrecordlinkage(Fellegi&Sunter,1969),werefertondingtheparticularmatchesafterblockingasthe\matchingstep." Figure2:ThetraditionalrecordlinkageproblemHowever,therecordlinkageproblempresentedinthisarticlediersfromthe\traditional"recordlinkageproblemandisnotwellstudied.Traditionalrecordlinkagematchesarecordfromonedatasourcetoarecordfromanotherdatasourcebyrelatingtheirrespective,decomposedattributes.Forinstance,usingthesecondpostfromTable1,andassumingdecomposedattributes,themakefromthepostiscomparedtothemakeofthereference553 Michelson&Knoblock Figure3:Theproblemofmatchingaposttothereferencesetset.Thisisalsodoneforthemodels,thetrims,etc.Therecordfromthereferencesetthatbestmatchesthepostbasedonthesimilaritiesbetweentheattributeswouldbeconsideredthematch.ThisisrepresentedinFigure2.Yet,theattributesofthepostsareembeddedwithinasinglepieceoftextandnotyetidentied.Thistextiscomparedtothereferenceset,whichisalreadydecomposedintoattributesandwhichdoesnothavetheextraneoustokenspresentinthepost.Figure3depictsthisproblem.Withthistypeofmatchingtraditionalrecordlinkageapproachesdonotapply.Instead,thematchingstepcomparestheposttoalloftheattributesofthereferencesetconcatenatedtogether.Sincethepostiscomparedtoawholerecordfromthereferenceset(inthesensethatithasalloftheattributes),thiscomparisonisatthe\recordlevel"anditapproximatelyre ectshowsimilaralloftheembeddedattributesofthepostaretoalloftheattributesofthecandidatematch.Thismimicstheideaoftraditionalrecordlinkage,thatcomparingalloftheeldsdeterminesthesimilarityattherecordlevel.However,byusingonlytherecordlevelsimilarityitispossiblefortwocandidatestogeneratethesamerecordlevelsimilaritywhiledieringonindividualattributes.Ifoneoftheseattributesismorediscriminativethantheother,thereneedstobesomewaytore ectthat.Forexample,considerFigure4.Inthegure,thetwocandidatessharethesamemakeandmodel.However,therstcandidatesharestheyearwhilethesecondcandidatesharesthetrim.Sincebothcandidatessharethesamemakeandmodel,andbothhaveanotherattributeincommon,itispossiblethattheygeneratethesamerecordlevelcomparison.Yet,atrimoncar,especiallywithararethinglikea\Hatchback"shouldbemorediscriminativethansharingayear,sincetherearelotsofcarswiththesamemake,modelandyear,thatdieronlybythetrim.Thisdierenceinindividualattributesneedstobere ected.Todiscriminatebetweenattributes,thematchingstepborrowstheideafromtraditionalrecordlinkagethatincorporatingtheindividualcomparisonsbetweeneachattributefrom554 RelationalDatafromUnstructuredDataSources Figure4:Tworecordswithequalrecordlevelbutdierenteldlevelsimilaritieseachdatasourceisthebestwaytodetermineamatch.Thatis,justtherecordlevelinformationisnotenoughtodiscriminatematches,eldlevelcomparisonsmustbeexploitedaswell.Todo\eldlevel"comparisonsthematchingstepcomparestheposttoeachindividualattributeofthereferenceset.Theserecordandeldlevelcomparisonsarerepresentedbyavectorofdierentsimi-larityfunctionscalledRL scores.Byincorporatingdierentsimilarityfunctions,RL scoresre ectsthedierenttypesofsimilaritythatexistbetweentext.Hence,fortherecordlevelcomparison,thematchingstepgeneratestheRL scoresvectorbetweenthepostandalloftheattributesconcatenated.Togenerateeldlevelcomparisons,thematchingstepcalcu-latestheRL scoresbetweenthepostandeachoftheindividualattributesofthereferenceset.AlloftheseRL scoresvectorsarethenstoredinavectorcalledVRL.Oncepopulated,VRLrepresentstherecordandeldlevelsimilaritiesbetweenapostandamemberofthereferenceset.IntheexamplereferencesetfromFigure3,theschemahas4attributesmake,model,trim,year]TJ/;༠ ;.9; ;Tf 8;.485; 0 T; [0;.Assumingthecurrentcandidateis\Honda",\Civic",\4DLX",\1993"]TJ/;ø 1;.90; T; 8.;҅ ; Td; [00;,thentheVRLlookslike:VRL=RL scores(post,\Honda"),RL scores(post,\Civic"),RL scores(post,\4DLX"),RL scores(post,\1993"),RL scores(post,\HondaCivic4DLX1993")Ormoregenerally:555 Michelson&KnoblockVRL=RL scores(post,attribute1),RL scores(post,attribute2),...,RL scores(post,attributen),RL scores(post,attribute1attribute2...attributen)TheRL scoresvectorismeanttoincludenotionsofthemanywaysthatexisttodenethesimilaritybetweenthetextualvaluesofthedatasources.Itmightbethecasethatoneattributediersfromanotherinafewmisplaced,missingorchangedletters.Thissortofsimilarityidentiestwoattributesthataresimilar,butmisspelled,andiscalled\editdistance."Anothertypeoftextualsimilaritylooksatthetokensoftheattributesanddenessimilaritybaseduponthenumberoftokenssharedbetweentheattributes.This\tokenlevel"similarityisnotrobusttospellingmistakes,butitputsnoemphasisontheorderofthetokens,whereaseditdistancerequiresthattheorderofthetokensmatchinorderfortheattributestobesimilar.Lastly,therearecaseswhereoneattributemaysoundlikeanother,eveniftheyarebothspelleddierently,oroneattributemayshareacommonrootwordwithanotherattribute,whichimpliesa\stemmed"similarity.Theselasttwoexamplesareneithertokennoreditdistancebasedsimilarities.Tocaptureallthesedierentsimilaritytypes,theRL scoresvectorisbuiltofthreevec-torsthatre ecttheeachofthedierentsimilaritytypesdiscussedabove.Hence,RL scoresis:RL scores(post,attribute)=token scores(post,attribute),edit scores(post,attribute),other scores(post,attribute)Thevectortoken scorescomprisesthreetokenlevelsimilarityscores.TwosimilarityscoresincludedinthisvectorarebasedontheJensen-Shannondistance,whichdenessimilaritiesoverprobabilitydistributionsofthetokens.OneusesaDirichletprior(Cohen,Ravikumar,&Feinberg,2003)andtheothersmoothsitstokenprobabilitiesusingaJelenik-Mercermixturemodel(Zhai&Laerty,2001).Thelastmetricinthetoken scoresvectoristheJaccardsimilarity.Withallofthescoresincluded,thetoken scoresvectortakestheform:token scores(post,attribute)=Jensen-Shannon-Dirichlet(post,attribute),Jensen-Shannon-JM-Mixture(post,attribute),Jaccard(post,attribute)]TJ/;༠ ;.9; ;Tf 8;.484; 0 T; [0;Thevectoredit scoresconsistsoftheeditdistancescoreswhicharecomparisonsbetweenstringsatthecharacterleveldenedbyoperationsthatturnonestringintoanother.Forinstance,theedit scoresvectorincludestheLevenshteindistance(Levenshtein,1966),whichreturnstheminimumnumberofoperationstoturnstringSintostringT,andtheSmith-Watermandistance(Smith&Waterman,1981)whichisanextensiontotheLevenshteindistance.Thelastscoreinthevectoredit scoresistheJaro-Winklersimilarity(Winkler&Thibaudeau,1991),whichisanextensionoftheJarometric(Jaro,1989)usedtondsimilarpropernouns.Whilenotastrictedit-distance,becauseitdoesnotregardoperationsoftransformations,theJaro-Winklermetricisausefuldeterminantofstringsimilarity.Withallofthecharacterlevelmetrics,theedit scoresvectorisdenedas:556 RelationalDatafromUnstructuredDataSources Figure6:OurapproachtomatchingpoststorecordsfromareferencesetInadditiontoprovidingastandardizedsetofvaluestoquerytheposts,thesestandard-izedvaluesallowforintegrationwithoutsidesourcesbecausethevaluescanbestandardizedtocanonicalvalues.Forinstance,ifwewanttointegrateourcarclassiedswithasafetyratingswebsite,wecannoweasilyjointhesourcesacrosstheattributevalues.Inthismanner,byapproachingannotationasarecordlinkageproblem,wecancreaterelationaldatafromunstructuredandungrammaticaldatasources.However,toaidintheextractionofattributesnoteasilyrepresentedinreferencesets,weperforminformationextractiononthepostsaswell.3.ExtractingDatafromPostsAlthoughtherecordlinkagestepcreatesmostoftherelationaldatafromtheposts,therearestillattributeswewouldliketoextractfromthepostwhicharenoteasilyrepresentedbyreferencesets,whichmeanstherecordlinkagestepcannotbeusedfortheseattributes.Examplesofsuchattributesaredatesandprices.Althoughmanyofthesesuchattributescanbeextractedusingsimpletechniques,suchasregularexpressions,wecanmaketheirextractionandannotationevermoreaccuratebyusingsophisticatedinformationextraction.Tomotivatethisidea,considertheFordcarmodelcalledthe\500."Ifwejustusedregularexpressions,wemightextract500asthepriceofthecar,butthiswouldnotbethecase.However,ifwetrytoextractalloftheattributes,includingthemodel,thenwewouldextract\500"asthemodelcorrectly.Furthermore,wemightwanttoextracttheactualattributesfromapost,astheyare,andourextractionalgorithmallowsthis.Toperformextraction,thealgorithminfusesinformationextractionwithextraknowl-edge,ratherthanrelyingonpossiblyinconsistentcharacteristics.Togarnerthisextra559 Michelson&Knoblockknowledge,theapproachexploitstheideaofreferencesetsbyusingtheattributesfromthematchingreferencesetmemberasabasisforidentifyingsimilarattributesinthepost.Then,thealgorithmcanlabeltheseextractedvaluesfromthepostwiththeschemafromthereferenceset,thusaddingannotationbasedontheextractedvalues.Inabroadsense,thealgorithmhastwoparts.Firstwelabeleachtokenwithapossibleattributelabeloras\junk"tobeignored.Afterallthetokensinapostarelabeled,wethencleaneachoftheextractedlabels.Figure7showsthewholeproceduregraphically,indetail,usingthesecondpostfromTable1.Eachofthestepsshowninthisgurearedescribedindetailbelow. Figure7:ExtractionprocessforattributesTobegintheextractionprocess,thepostisbrokenintotokens.UsingtherstpostfromTable1asanexample,setoftokensbecomes,f\93",\civic",\5speed",...g.Eachofthesetokensisthenscoredagainsteachattributeoftherecordfromthereferencesetthatwasdeemedthematch.Toscorethetokens,theextractionprocessbuildsavectorofscores,VIE.LiketheVRLvectorofthematchingstep,VIEiscomposedofvectorswhichrepresentthesimilaritiesbetweenthetokenandtheattributesofthereferenceset.However,thecompositionofVIEisslightlydierentfromVRL.Itcontainsnocomparisontotheconcatenationofalltheattributes,andthevectorsthatcomposeVIEaredierentfromthosethatcomposeVRL.Specically,thevectorsthatformVIEarecalledIE scores,andaresimilartothe560 RelationalDatafromUnstructuredDataSourcesRL scoresthatcomposeVRL,excepttheydonotcontainthetoken scorescomponent,sinceeachIE scoresonlyusesonetokenfromthepostatatime.TheRL scoresvector:RL scores(post,attribute)=token scores(post,attribute),edit scores(post,attribute),other scores(post,attribute)becomes:IE scores(token,attribute)=edit scores(token,attribute),other scores(token,attribute)TheothermaindierencebetweenVIEandVRListhatVIEcontainsauniquevectorthatcontainsuserdenedfunctions,suchasregularexpressions,tocaptureattributesthatarenoteasilyrepresentedbyreferencesets,suchaspricesordates.Theseattributetypesgenerallyexhibitconsistentcharacteristicsthatallowthemtobeextracted,andtheyareusuallyinfeasibletorepresentinreferencesets.Thismakestraditionalextractionmethodsagoodchoicefortheseattributes.Thisvectoriscalledcommon scoresbecausethetypesofcharacteristicsusedtoextracttheseattributesare\common"enoughbetweentobeusedforextraction.UsingtherstpostofTable1,assumethereferencesetmatchhasthemake\Honda,"themodel\Civic"andtheyear\1993."Thismeansthematchingtuplewouldbef\Honda",\Civic",\1993"g.ThismatchgeneratesthefollowingVIEforthetoken\civic"ofthepost:VIE=common scores(\civic"),IE scores(\civic",\Honda"),IE scores(\civic",\Civic"),IE scores(\civic",\1993")Moregenerally,foragiventoken,VIElookslike:VIE=common scores(token),IE scores(token,attribute1),IE scores(token,attribute2)...,IE scores(token,attributen)EachVIEisthenpassedtoastructuredSVM(Tsochantaridis,Joachims,Hofmann,&Altun,2005;Tsochantaridis,Hofmann,Joachims,&Altun,2004)trainedtogiveitanattributetypelabel,suchasmake,model,orprice.Intuitively,similarattributetypesshouldhavesimilarVIEvectors.Themakesshouldgenerallyhavehighscoresagainstthemakeattributeofthereferenceset,andsmallscoresagainsttheotherattributes.Further,structuredSVMsareabletoinfertheextractionlabelscollectively,whichhelpsindecidingbetweenpossibletokenlabels.ThismakestheuseofstructuredSVMsanidealmachinelearningmethodforourtask.NotethatsinceeachVIEisnotamemberofaclusterwherethewinnertakesall,thereisnobinaryrescoring.Sincetherearemanyirrelevanttokensinthepostthatshouldnotbeannotated,theSVMlearnsthatanyVIEthatdoesassociatewithalearnedattributetypeshouldbelabeledas561 RelationalDatafromUnstructuredDataSources Figure8:ImprovingextractionaccuracywithreferencesetattributesNote,however,thatwedonotlimitthemachinelearningcomponentofourextractionalgorithmtoSVMs.Instead,weclaimthatinsomecases,referencesetscanaidextractioningeneral,andtotestthis,inourarchitecturewecanreplacetheSVMcomponentwithothermethods.Forexample,inourextractionexperimentswereplacetheSVMextractorwithaConditionalRandomField(CRF)(Laerty,McCallum,&Pereira,2001)extractorthatusestheVIEasfeatures.Therefore,thewholeextractionprocesstakesatokenofthetext,createstheVIEandpassesthistothemachine-learningextractorwhichgeneratesalabelforthetoken.Theneacheldiscleanedandtheextractedattributeissaved.4.ResultsThePhoebussystemwasbuilttoexperimentallyvalidateourapproachtobuildingrelationaldatafromunstructuredandungrammaticaldatasources.Specically,Phoebusteststhetechnique'saccuracyinboththerecordlinkageandtheextraction,andincorporatestheBSLalgorithmforlearningandusingblockingschemes.Theexperimentaldata,comesfromthreedomainsofposts:hotels,comicbooks,andcars.Thedatafromthehoteldomaincontainstheattributeshotelname,hotelarea,starrating,priceanddates,whichareextractedtotesttheextractionalgorithm.ThisdatacomesfromtheBiddingForTravelwebsite9whichisaforumwhereuserssharesuccessfulbidsforPricelineonitemssuchasairlineticketsandhotelrates.TheexperimentaldataislimitedtopostingsabouthotelratesinSacramento,SanDiegoandPittsburgh,whichcomposeadatasetwith1125posts,with1028ofthesepostshavingamatchinthereferenceset.ThereferencesetcomesfromtheBiddingForTravelhotelguides,whicharespecial 9.www.biddingfortravel.com563 Michelson&Knoblock Algorithm3.1:CleanAttribute(E;R) comment:CleanextractedattributeEusingreferencesetattributeRRemovalCandidatesC nullJaroWinklerBaseline JaroWinkler(E;R)JaccardBaseline Jaccard(E;R)foreachtokent2Edo8]TJ ; -1;.93; Td; [00;]TJ ; -1;.93; Td; [00;]TJ ; -1;.93; Td; [00;]TJ ; -1;.93; Td; [00;]TJ ; -1;.93; Td; [00;]TJ ; -1;.93; Td; [00;]TJ ; -1;.93; Td; [00;]TJ ; -1;.93; Td; [00;]TJ ; -1;.93; Td; [00;]TJ ; -1;.93; Td; [00;:Xt RemoveToken(t;E)JaroWinklerXt JaroWinkler(Xt;R)JaccardXt Jaccard(Xt;R)if8]TJ ; -1;.93; Td; [00;]TJ ; -1;.93; Td; [00;:JaroWinklerXt]TJ ; -1;.93; Td; [00;JaroWinklerBaselineandJaccardXt]TJ ; -1;.93; Td; [00;JaccardBaselinethennC C[tif(C=nullreturn(E)else(E RemoveMaxCandidate(C,E)CleanAttribute(E;R) Figure9:Algorithmtocleananextractedattributepostslistingallofthehotelseverpostedaboutagivenarea.Thesespecialpostsprovidehotelnames,hotelareasandstarratings,whicharethereferencesetattributes.Therefore,thesearethe3attributesforwhichthestandardizedvaluesareused,allowingustotreatthesepostsasarelationaldataset.Thisreferencesetcontains132records.TheexperimentaldataforthecomicdomaincomesfrompostsforitemsforsaleoneBay.Togeneratethisdataset,eBaywassearchedbythekeywords\IncredibleHulk"and\FantasticFour"inthecomicbookssectionoftheirwebsite.(Thisreturnedsomeitemsthatarenotcomics,suchast{shirtsandsomesetsofcomicsnotlimitedtothosesearchedfor,whichmakestheproblemmoredicult.)Thereturnedrecordscontaintheattributescomictitle,issuenumber,price,publisher,publicationyearandthedescription,whichareextracted.(Note:thedescriptionisafewworddescriptioncommonlyassociatedwithacomicbook,suchas1stappearancetheRhino.)Thetotalnumberofpostsinthisdatasetis776,ofwhich697havematches.ThecomicdomainreferencesetusesdatafromtheComicsPriceGuide10,whichlistsalltheIncredibleHulkandFantasticFourcomics.Thisreferencesethastheattributestitle,issuenumber,description,andpublisherandcontains918records.ThecarsdataconsistsofpostsmadetoCraigslistregardingcarsforsale.ThisdatasetconsistsofclassiedsforcarsfromLosAngeles,SanFrancisco,Boston,NewYork,New 10.http://www.comicspriceguide.com/564 Michelson&Knoblockproblems.MethodssuchasBigramindexingaretechniquesthatmaketheprocessofeachblockingpassonanattributemoreecient.ThegoalofBSL,however,istoselectwhichattributecombinationsshouldbeusedforblockingasawhole,tryingdierentattributeandmethodpairs.Nonetheless,wecontendthatitismoreimportanttoselecttherightattributecombinations,evenusingsimplemethods,thanitistousemoresophisticatedmethods,butwithoutinsightastowhichattributesmightbeuseful.Totestthishypothesis,wecompareBSLusingthetokenand3-grammethodstoBigramindexingoveralloftheattributes.ThisisequivalenttoformingadisjunctionoverallattributesusingBigramindexingasthemethod.WechoseBigramindexinginparticularbecauseitisdesignedtoperform\fuzzyblocking"whichseemsnecessaryinthecaseofnoisypostdata.Asstatedpreviously(Baxteretal.,2003),weuseathresholdof0.3forBigramindexing,sincethatworksthebest.WealsocompareBSLtorunningadisjunctionoverallattributesusingthesimpletokenmethodonly.Inourresults,wecallthisblockingrule\Disjunction."Thisdisjunctionmirrorstheideaofpickingthesimplestpossibleblockingmethod:namelyusingallattributeswithaverysimplemethod.Asstatedpreviously,thetwogoalsofblockingcanbequantiedbytheReductionRatio(RR)andthePairsCompleteness(PC).Table5showsnotonlythesevaluesbutalsohowmanycandidatesweregeneratedonaverageovertheentiretestset,comparingthethreedierentapproaches.Table5alsoshowshowlongittookeachmethodtolearntheruleandruntherule.Lastly,thecolumn\Timematch"showshowlongtheclassierneedstorungiventhenumberofcandidatesgeneratedbytheblockingscheme.Table6showsafewexampleblockingschemesthatthealgorithmgenerated.ForacomparisonoftheattributesBSLselectedtotheattributespickedmanuallyfordierentdomainswherethedataisstructuredthereaderispointedtoourpreviousworkonthetopic(Michelson&Knoblock,2006).TheresultsofTable5validateourideathatitismoreimportanttopickthecorrectattributestoblockon(usingsimplemethods)thantousesophisticatedmethodswithoutattentiontotheattributes.ComparingtheBSLruletotheBigramresults,thecombinationofPCandRRisalwaysbetterusingBSL.NotethatalthoughintheCarsdomainBigramtooksignicantlylesstimewiththeclassierduetoitslargeRR,itdidsobecauseitonlyhadaPCof4%.Inthiscase,Bigramswasnotevencovering5%ofthetruematches.Further,theBSLresultsarebetterthanusingthesimplestmethodpossible(theDisjuc-tion),especiallyinthecaseswheretherearemanyrecordstotestupon.Asthenumberofrecordsscalesup,itbecomesincreasinglyimportanttogainagoodRR,whilemaintainingagoodPCvalueaswell.ThissavingsisdramaticallydemonstratedbytheCarsdomain,whereBSLoutperformedtheDisjunctioninbothPCandRR.Onesurprisingaspectoftheseresultsishowprevalentthetokenmethodiswithinallthedomains.Weexpectthatthengrammethodwouldbeusedalmostexclusivelysincetherearemanyspellingmistakeswithintheposts.However,thisisnotthecase.Wehypothesizethatthelearningalgorithmusesthetokenmethodsbecausetheyoccurwithmoreregularityacrossthepoststhanthecommonngramswouldsincethespellingmistakesmightvaryquitedierentlyacrosstheposts.Thissuggeststhattheremightbemoreregularity,intermsofwhatwecanlearnfromthedata,acrossthepoststhanweinitiallysurmised.AnotherinterestingresultisthepoorreductionratiooftheComicdomain.Thishappensbecausemostoftherulescontainthedisjunctthatndsacommontokenwithinthecomic566 Michelson&Knoblocktitle.Thisruleproducessuchapoorreductionratiobecausethevalueforthisattributeisthesameacrossalmostallreferencesetrecords.Thatistosay,whentherearejustafewuniquevaluesfortheBSLalgorithmtouseforblocking,thereductionratiowillbesmall.Inthisdomain,thereareonlytwovaluesforthecomictitleattribute,\FantasticFour"and\IncredibleHulk."Soitmakessensethatifblockingisdoneusingthetitleattributeonly,thereductionisabouthalf,sinceblockingonthevalue\FantasticFour"justgetsridofthe\IncredibleHulk"comics.ThispointstoaninterestinglimitationoftheBSLalgorithm.IftherearenotmanydistinctvaluesforthedierentattributeandmethodpairsthatBSLcanusetolearnfrom,thenthislackofvaluescripplestheperformanceofthereductionratio.Intuitivelythough,thismakessense,sinceitishardtodistinguishgoodcandidatematchesfrombadcandidatematchesiftheysharethesameattributevalues.AnotherresultworthmentioningisthatintheHotelsdomainwegetalowerRRbutthesamePCwhenweuselesstrainingdata.ThishappensbecauseourBSLalgorithmrunsuntilithasnomoreexamplestocover,soifthoselastfewexamplesintroduceanewdisjunctthatproducesalotofcandidates,whileonlycoveringafewmoretruepositives,thenthiswouldcausetheRRtodecrease,whilekeepingthePCatthesamehighrate.Thisisinfactwhathappensinthiscase.OnewaytocurbthisbehaviorwouldbetosetsomesortofstoppingthresholdforBSL,butaswesaid,maximizingthePCisthemostimportantthing,sowechoosenottodothis.WewantBSLtocoverasmanytruepositivesasitcan,evenifthatmeanslosingabitinthereduction.Infact,wenexttestthisnotionexplicitly.WesetathresholdintheSCAsuchthatafter95%ofthetrainingexamplesarecovered,thealgorithmstopsandreturnsthelearnedblockingscheme.ThishelpstoavoidthesituationwhereBSLlearnsaverygeneralconjunc-tion,solelytocoverthelastfewremainingtrainingexamples.Whenthathappens,BSLmightenduploweringtheRR,attheexpenseofcoveringjustthoselasttrainingexamples,becausetherulelearnedtocoverthoselastexamplesisoverlygeneralandreturnstoomanycandidatematches. Domain RecordLinkage RR PC F-Measure HotelsDomain NoThresh(30%) 90.63 81.56 99.79 95%Thresh(30%) 90.63 87.63 97.66 ComicDomain NoThresh(30%) 91.30 42.97 99.75 95%Thresh(30%) 91.47 42.97 99.69 CarsDomain NoThresh(10%) 77.04 88.48 92.23 95%Thresh(10%) 67.14 92.67 83.95 Table7:AcomparisonofBSLcoveringalltrainingexamples,andcovering95%ofthetrainingexamples568 Michelson&KnoblockFMeasure=2PrecisionRecall Precison+RecallTherecordlinkageapproachinthisarticleiscomparedtoWHIRL(Cohen,2000).WHIRLperformsrecordlinkagebyperformingsoft-joinsusingvector-basedcosinesimilari-tiesbetweentheattributes.Otherrecordlinkagesystemsrequiredecomposedattributesformatching,whichisnotthecasewiththeposts.WHIRLservesasthebenchmarkbecauseitdoesnothavethisrequirement.TomirrorthealignmenttaskofPhoebus,theexperimentsuppliesWHIRLwithtwotables:thetestsetofposts(either70%or90%oftheposts)andthereferencesetwiththeattributesconcatenatedtoapproximatearecordlevelmatch.Theconcatenationisalsousedbecausewhenmatchingoneachindividualattribute,itisnotobvioushowtocombinethematchingattributestoconstructawholematchingreferencesetmember.Toperformtherecordlinkage,WHIRLdoessoft-joinsacrossthetables,whichproducesalistofmatches,orderedbydescendingsimilarityscore.Foreachpostwithmatchesfromthejoin,thereferencesetmember(s)withthehighestsimilarityscore(s)iscalleditsmatch.IntheCarsdomainthematchesare1-N,sothismeansthatonly1matchfromthereferencesetwillbeexploitedlaterintheinformationextractionstep.Tomirrorthisidea,thenumberofpossiblematchesina1-Ndomainiscountedasthenumberofpoststhathaveamatchinthereferenceset,ratherthanthereferencesetmembersthemselvesthatmatch.Also,thismeansthatweonlyaddasinglematchtoourtotalnumberofcorrectmatchesforagivenpost,ratherthanallofthecorrectmatches,sinceonlyonematters.ThisisdoneforbothWHIRLandPhoebus,andmoreaccuratelyre ectshowwelleachalgorithmwouldperformastheprocessingstepbeforeourinformationextractionstep.TherecordlinkageresultsforbothPhoebusandWHIRLareshowninTable8.Notethattheamountoftrainingdataforeachdomainisshowninparentheses.Allresultsarestatisticallysignicantusingatwo-tailedpairedt-testwith=0.05,exceptfortheprecisionbetweenWHIRLandPhoebusintheCarsdomain,andtheprecisionbetweenPhoebustrainedon10%and30%ofthetrainingdataintheComicdomain.PhoebusoutperformsWHIRLbecauseitusesmanysimilaritytypestodistinguishmatches.Also,sincePhoebususesbotharecordlevelandattributelevelsimilarities,itisabletodistinguishbetweenrecordsthatdierinmorediscriminativeattributes.ThisisespeciallyapparentintheCarsdomain.First,theseresultsindicatethedicultyofmatchingcarpoststothelargereferenceset.Thisisthelargestexperimentaldomainyetusedforthisproblem,anditisencouraginghowwellourapproachoutperformsthebase-line.Itisalsointerestingthattheresultssuggestthatbothtechniquesareequallyaccurateintermsofprecision(infact,thereisnostatisticallysignicantdierencebetweentheminthissense)butPhoebusisabletoretrievemanymorerelevantmatches.ThismeansPhoebuscancapturemorerichfeaturesthatpredictmatchesthanWHIRL'scosinesimi-larityalone.WeexpectthisbehaviorbecausePhoebushasanotionofbotheldandtokenlevelsimilarity,usingmanydierentsimilaritymeasures.Thisjustiesouruseofthemanysimilaritytypesandeldandrecordlevelinformation,sinceourgoalistondasmanymatchesaswecan.Itisalsoencouragingthatusingonly10%ofthedataforlabeling,Phoebusisabletoperformalmostaswellasusing30%ofthedatafortraining.SincetheamountofdataontheWebisvast,onlyhavingtolabel10%ofthedatatogetcomparativeresultsispreferable570 Michelson&Knoblock Precision Recall F-Measure Hotels Phoebus(30%) 87.70 93.78 90.63 ConcatenationOnly 88.49 93.19 90.78 WHIRL 83.61 83.53 83.13 Comic Phoebus(30%) 87.49 95.46 91.30 ConcatenationOnly 61.81 46.55 51.31 WHIRL 73.89 81.63 77.57 Cars Phoebus(10%) 69.98 85.68 77.04 ConcatenationOnly 47.94 58.73 52.79 WHIRL 70.43 63.36 66.71 Table9:Matchingusingonlytheconcatenationalsousesaconcatenationoftheattributes.ThisisbecauseWHIRLusesinformation-retrieval-stylematchingtondthebestmatch,andthemachinelearningtechniquetriestolearnthecharacteristicsofthebestmatch.Clearly,itisverydiculttolearnwhatsuchcharacteristicsare.IntheHotelsdomain,wedonotndastatisticallysignicantdierenceinF-measureusingtheconcatenationalone.Thismeansthattheconcatenationissucienttodeterminethematches,sothereisnoneedforindividualeldstoplayarole.Morespecically,thehotelnameandareaseemtobethemostimportantattributesformatchingandbyincludingthemaspartoftheconcatenation,theconcatenationisstilldistinguishableenoughbetweenallrecordstodeterminematches.Sinceintwoofthethreedomainsweseeahugeimprovement,andweneverloseinF-measure,usingboththeconcatenationandtheindividualattributesisvalidforthematching.Also,sinceintwodomainstheconcatenationalonewasworsethanWHIRL,weconcludethatpartofthereasonPhoebuscanoutperformWHIRListheuseoftheindividualattributesformatching.Ournextexperimenttestshowimportantitistoincludeallofthestringmetricsinourfeaturevectorformatching.Totestthisidea,wecompareusingallthemetricstousingjustone,theJensen-Shannondistance.WechoosetheJensen-ShannondistancebecauseitoutperformedbothTF/IDFandevena\soft"TF/IDF(onethataccountsforfuzzytokenmatches)inthetaskofselectingtherightreferencesetsforagivensetofposts(Michelson&Knoblock,2007).TheseresultsareshowninTable10.AsTable10shows,usingallthemetricsyieldedastatisticallysignicant,largeim-provementinF-measurefortheComicandCarsdomains.Thismeansthatsomeoftheotherstringmetrics,suchastheeditdistances,werecapturingsimilaritiesthattheJensen-Shannondistancealonedidnot.Interestingly,inbothdomains,usingPhoebuswithonlytheJensen-ShannondistancedoesnotdominateWHIRL'sperformance.Therefore,theresultsofTable10andTable9demonstratethatPhoebusbenetsfromthecombination572 Michelson&Knoblock Precision Recall F-Measure Hotels Phoebus(30%) 87.70 93.78 90.63 NoBinaryRescoring 75.44 81.82 78.50 Phoebus(10%) 87.85 92.46 90.09 NoBinaryRescoring 73.49 78.40 75.86 Comic Phoebus(30%) 87.49 95.46 91.30 NoBinaryRescoring 84.87 89.91 87.31 Phoebus(10%) 85.35 93.18 89.09 NoBinaryRescoring 81.52 88.26 84.75 Cars Phoebus(10%) 69.98 85.68 77.04 NoBinaryRescoring 39.78 48.77 43.82 Table11:Recordlinkageresultswithandwithoutbinaryrescoring4.2ExtractionResultsThissectionpresentsresultsthatexperimentallyvalidateourapproachtoextractingtheactualattributesembeddedwithinthepost.Wealsocompareourapproachtotwootherinformationextractionmethodsthatrelyonthestructureorgrammaroftheposts.First,theexperimentscomparePhoebusagainstabaselineConditionalRandomField(CRF)(Laertyetal.,2001)extractor.AConditionalRandomFieldisaprobabilisticmodelthatcanlabelandsegmentdata.Inlabelingtasks,suchasPart-of-Speechtag-ging,CRFsoutperformHiddenMarkovModelsandMaximum-EntropyMarkovModels.Therefore,byrepresentingthestate-of-the-artprobabilisticgraphicalmodel,theypresentastrongcomparisontoourapproachtoextraction.CRFshavealsobeenusedeectivelyforinformationextraction.Forinstance,CRFshavebeenusedtocombineinformationextrac-tionandcoreferenceresolutionwithgoodresults(Wellner,McCallum,Peng,&Hay,2004).TheseexperimentsusetheSimpleTaggerimplementationofCRFsfromtheMALLET(McCallum,2002)suiteoftextprocessingtools.Further,asstatedinSection3onExtraction,wealsocreatedaversionofPhoebusthatusesCRFs,whichwecallPhoebusCRF.PhoebusCRFusesthesameextractionfeatures(VIE)asPhoebususingtheSVM,suchasthecommonscoreregularexpressionsandthestringsimilaritymetrics.WeincludePhoebusCRFtoshowthatextractioningeneralcanbenetfromourreferencesetmatching.Second,theexperimentscomparePhoebustoNaturalLanguageProcessing(NLP)basedextractiontechniques.Sincethepostsareungrammaticalandhaveunreliablelexicalchar-acteristics,theseNLPbasedsystemsarenotexpectedtodoaswellonthistypeofdata.TheAmilcaresystem(Ciravegna,2001),whichusesshallowNLPforextraction,hasbeenshowntooutperformothersymbolicsystemsinextractiontasks,andsoweuseAmilcareastheothersystemtocompareagainst.SinceAmilcarecanexploitgazetteersforextra574 Michelson&Knoblock Hotel Recall Precision F-Measure Frequency Area Phoebus(30%) 83.73 84.76 84.23 ~580 Phoebus(10%) 77.80 83.58 80.52 PhoebusCRF(30%) 85.13 86.93 86.02 PhoebusCRF(10%) 80.71 83.38 82.01 SimpleTaggerCRF(30%) 78.62 79.38 79.00 Amilcare(30%) 64.78 71.59 68.01 Date Phoebus(30%) 85.41 87.02 86.21 ~700 Phoebus(10%) 82.13 83.06 82.59 PhoebusCRF(30%) 87.20 87.11 87.15 PhoebusCRF(10%) 84.39 84.48 84.43 SimpleTaggerCRF(30%) 63.60 63.25 63.42 Amilcare(30%) 86.18 94.10 89.97 Name Phoebus(30%) 77.27 75.18 76.21 ~750 Phoebus(10%) 75.59 74.25 74.92 PhoebusCRF(30%) 85.70 85.07 85.38 PhoebusCRF(10%) 81.46 81.69 81.57 SimpleTaggerCRF(30%) 74.43 84.86 79.29 Amilcare(30%) 58.96 67.44 62.91 Price Phoebus(30%) 93.06 98.38 95.65 ~720 Phoebus(10%) 93.12 98.46 95.72 PhoebusCRF(30%) 92.56 94.90 93.71 PhoebusCRF(10%) 90.34 92.60 91.46 SimpleTaggerCRF(30%) 71.68 73.45 72.55 Amilcare(30%) 88.04 91.10 89.54 Star Phoebus(30%) 97.39 97.01 97.20 ~730 Phoebus(10%) 96.94 96.90 96.92 PhoebusCRF(30%) 96.83 98.06 97.44 PhoebusCRF(10%) 96.17 96.74 96.45 SimpleTaggerCRF(30%) 97.16 96.55 96.85 Amilcare(30%) 95.58 97.35 96.46 Table12:Fieldlevelextractionresults:Hotelsdomain576 Michelson&Knoblock Cars Recall Precision F-Measure Frequency Make Phoebus(10%) 98.21 99.93 99.06 ~580 PhoebusCRF(10%) 90.73 96.71 93.36 SimpleTaggerCRF(10%) 85.68 95.69 90.39 Amilcare(10%) 97.58 91.76 94.57 Model Phoebus(10%) 92.61 96.67 94.59 ~620 PhoebusCRF(10%) 84.58 94.10 88.79 SimpleTaggerCRF(10%) 78.76 91.21 84.52 Amilcare(10%) 78.44 84.31 81.24 Price Phoebus(10%) 97.17 95.91 96.53 ~580 PhoebusCRF(10%) 93.59 92.59 93.09 SimpleTaggerCRF(10%) 83.66 98.16 90.33 Amilcare(10%) 90.06 91.27 90.28 Trim Phoebus(10%) 63.11 70.15 66.43 ~375 PhoebusCRF(10%) 55.61 64.95 59.28 SimpleTaggerCRF(10%) 55.94 66.49 60.57 Amilcare(10%) 27.21 53.99 35.94 Year Phoebus(10%) 88.48 98.23 93.08 ~600 PhoebusCRF(10%) 85.54 96.44 90.59 SimpleTaggerCRF(10%) 91.12 76.78 83.31 Amilcare(10%) 86.32 91.92 88.97 Table14:Fieldlevelextractionresults:Carsdomain. Num.ofMax.F-Measures Domain Phoebus PhoebusCRF Amilcare SimpleTagger TotalAttributes Hotel 1 3 1 0 5 Comic 2 1 0 0 6 Cars 5 0 0 0 5 All 8 4 1 0 16 Table15:SummaryresultsforextractionshowingthenumberoftimeseachsystemhadstatisticallysignicanthighestF-Measureforanattribute.578 Michelson&Knoblockinthereferenceset.Thislendscredibilitytoourclaimearlierinthesectionthatbytrainingthesystemtoextractalloftheattributes,eventhoseinthereferenceset,wecanmoreaccuratelyextractattributesnotinthereferencesetbecausewearetrainingthesystemtoidentifywhatsomethingisnot.TheoverallperformanceofPhoebusvalidatesthisapproachtosemanticannotation.Byinfusinginformationextractionwiththeoutsideknowledgeofreferencesets,Phoebusisabletoperformwellacrossthreedierentdomains,eachrepresentativeofadierenttypeofsourceofposts:theauctionsites,Internetclassiedsandforum/bulletinboards.5.DiscussionThegoalofthisresearchistoproducerelationaldatafromunstructuredandungrammaticaldatasourcessothattheycanbeaccuratelyqueriedandintegratedwithothersources.Byrepresentingtheattributesembeddedwithinapostwiththestandardizedvaluesfromthereferenceset,wecansupportstructuralqueriesandintegration.Forinstance,wecanperformaggregatequeriesbecausewecantreatthedatasourceasarelationaldatabasenow.Furthermore,wehavestandardizedvaluesforperformingjoinsacrossdatasources,akeyforintegrationofmultiplesources.Thesestandardizedvaluesalsoaidinthecaseswherethepostactuallydoesnotcontaintheattribute.Forinstance,inTable1,twoofthelistingsdonotincludethemake\Honda."However,oncematchedtothereferenceset,theycontainastandardizedvalueforthisattributewhichcanthenbeusedforqueryingandintegratingtheseposts.Thisisespeciallypowerfulsincethepostsneverexplicitlystatedtheseattributevalues.Thereferencesetattributesalsoprovideasolutionforthecaseswheretheextractionisextremelydicult.Forexample,noneofthesystemsextractedthedescriptionattributeoftheComicdomainwell.However,ifoneinsteadconsidersthedescriptionattributefromthereferenceset,whichisquantiedbytherecordlinkageresultsfortheComicdomain,thisyieldsanimprovementofover50%intheF-Measureforidentifyingthedescriptionforapost.Itmayseemthatusingthereferencesetattributesforannotationisenoughsincethevaluesarealreadycleaned,andthatextractionisunnecessary.However,thisisnotthecase.Foronething,onemaywanttoseetheactualvaluesenteredfordierentattributes.Forinstance,ausermightwanttodiscoverthemostcommonspellingmistakeorabbreviationforaattribute.Also,therearecaseswhentheextractionresultsoutperformtherecordlinkageresults.Thishappensbecauseevenifapostismatchedtoanincorrectmemberofthereferenceset,thatincorrectmemberismostlikelyveryclosetothecorrectmatch,andsoitcanbeusedtocorrectlyextractmuchoftheinformation.Forastrongexampleofthis,considertheCarsdomain.TheF-measurefortherecordlinkageresultsarenotasgoodasthosefortheextractionresultsinthisdomain.Thismeansmostmatchesthatwerechosenwhereprobablyincorrectbecausetheydierfromthecorrectmatchbysomethingsmall.Forexample,atruematchcouldhavethetrimas\2Door"whiletheincorrectlychosenmatchmighthavethetrim\4Door,"buttherewouldstillbeenoughinformation,suchastherestofthetrimtokens,theyear,themakeandthemodeltocorrectlyextractthosedierentattributesfromthepostitself.Byperformingtheextractionforthevaluesfromthepostitself,wecanovercomethemistakesoftherecordlinkagestepbecausewecanstillexploitmostoftheinformationintheincorrectlychosenreferencesetmember.580 Michelson&KnoblockEdmundsclassiestheMazdaProtege5asa\wagon,"whileKellyBlueBook16classiesitasa\hatchback."Thisseemstoinvalidatetheideathat\wagon"isdierentinmeaningfrom\hatchback."Theyappeartobesimplesynonyms,butthiswouldremainunknownwithouttheoutsideknowledgeofKellyBlueBook.Moregenerally,oneassumesthatthereferencesetisacorrectsetofstandardizedvalues,butthisisnotanabsolutetruth.Thatiswhythemostmeaningfulreferencesetsarethosethatcanbeconstructedfromagreed-uponontologiesfromtheSemanticWeb.Forinstance,areferencesetderivedfromanontologyforcarscreatedbyallofthebiggestautomotivebusinessesshouldalleviatemanyoftheissuesinmeaning,andathesaurusschemecouldworkoutthediscrepanciesintroducedbytheusers,ratherthanthereferencesets.6.RelatedWorkOurresearchisdrivenbytheprincipalthatthecostofannotatingdocumentsfortheSemanticWebshouldbefree,thatis,automaticandinvisibletousers(Hendler,2001).Manyresearchershavefollowedthispath,attemptingtoautomaticallymarkupdocumentsfortheSemanticWeb,asproposedhere(Cimiano,Handschuh,&Staab,2004;Dingli,Ciravegna,&Wilks,2003;Handschuh,Staab,&Ciravegna,2002;Vargas-Vera,Motta,Domingue,Lanzoni,Stutt,&Ciravegna,2002).However,thesesystemsrelyonlexicalinformation,suchaspart-of-speechtaggingorshallowNaturalLanguageProcessingtodotheirextraction/annotation(e.g.,Amilcare,Ciravegna,2001).Thisisnotanoptionwhenthedataisungrammatical,likethepostdata.Inasimilarvein,therearesystemssuchasADEL(Lerman,Gazen,Minton,&Knoblock,2004)whichrelyonthestructuretoidentifyandannotaterecordsinWebpages.Again,thefailureofthepoststoexhibitstructuremakesthisapproachinappropriate.So,whilethereisafairamountofworkinautomaticlabeling,thereislittleemphasisontechniquesthatcouldlabeltextthatisbothunstructuredandungrammatical.Althoughtheideaofrecordlinkageisnotnew(Fellegi&Sunter,1969)andiswellstudiedevennow(Bilenko&Mooney,2003)mostcurrentresearchfocusesonmatchingonesetofrecordstoanothersetofrecordsbasedontheirdecomposedattributes.Thereislittleworkonmatchingdatasetswhereonerecordisasinglestringcomposedoftheotherdataset'sattributestomatchon,asinthecasewithpostsandreferencesets.TheWHIRLsystem(Cohen,2000)allowsforrecordlinkagewithoutdecomposedattributes,butasshowninSection4.1PhoebusoutperformsWHIRL,sinceWHIRLreliessolelyonthevector-basedcosinesimilaritybetweentheattributes,whilePhoebusexploitsalargersetoffeaturestorepresentbotheldandrecordlevelsimilarity.WenotewithinteresttheEROCSsystem(Chakaravarthy,Gupta,Roy,&Mohania,2006)wheretheauthorstackletheproblemoflinkingfulltextdocumentswithrelationaldatabases.Thetechniqueinvolveslteringoutallnon-nounsfromthetext,andthenndingthematchesinthedatabase.Thisisanintriguingapproach;interestingfutureworkwouldinvolveperformingasimilarlteringforlargerdocumentsandthenapplyingthePhoebusalgorithmtomatchtheremainingnounstoreferencesets.Usingthereferenceset'sattributesasnormalizedvaluesissimilartotheideaofdatacleaning.However,mostdatacleaningalgorithmsassumetuple-to-tupletransformations 16.www.kbb.com582 Michelson&Knoblockaboutusingontology-basedinformationextractionasameanstosemanticallyannotateun-structureddatasuchascarclassieds(Ding,Embley,&Liddle,2006).However,incontrasttoourwork,theinformationextractionisperformedbyakeyword-lookupintotheontologyalongwithstructuralandcontextualrulestoaidthelabeling.Theontologyitselfcontainskeywordmisspellingsandabbreviations,sothatthelook-upcanbeperformedinthepres-enceofnoisydata.Webelievetheontology-basedextractionapproachislessscalablethanarecordlinkagetypematchingtaskbecausecreatingandmaintainingtheontologyrequiresextensivedataengineeringinordertoencompassallpossiblecommonspellingmistakesandabbreviations.Further,ifnewdataisaddedtotheontology,additionaldataengineeringmustbeperformed.Inourwork,wecansimplyaddnewtuplestoourreferenceset.Lastly,incontrasttoourwork,thisontologybasedworkassumescontextualandstructuralruleswillapply,makinganassumptionaboutthedatatoextractfrom.Inourwork,wemakenosuchassumptionsaboutthestructureofthetextweareextractingfrom.YetanotherinterestingapproachtoinformationextractionusingontologiesistheText-pressosystemwhichextractsdatafrombiologicaltext(Muller&Sternberg,2004).Thissystemusesaregularexpressionbasedkeywordlook-uptolabeltokensinsometextbasedontheontology.Oncealltokensarelabeled,Textpressocanperform\factextraction"byextractingsequencesoflabeledtokensthattaparticularpattern,suchasgene-allelereferenceassociations.Althoughthissystemagainusesareferencesetforextraction,itdiersinthatitdoesakeywordlook-upintothelexicon.InrecentworkonlearningecientblockingschemesBilenkoetal.,(2006)developedasystemforlearningdisjunctivenormalformblockingschemes.However,theylearntheirschemesusingagraphicalsetcoveringalgorithm,whileweuseaversionoftheSequentialCoveringAlgorithm(SCA).TherearealsosimilaritiesbetweenourBSLalgorithmandworkonminingassociationrulesfromtransactiondata(Agrawal,Imielinski,&Swami,1993).Bothalgorithmsdiscoverpropositionalrules.Further,bothalgorithmsusemultiplepassesoveradatasettodiscovertheirrules.Howeverdespitethesesimilarities,thetechniquesreallysolvedierentproblems.BSLgeneratesasetofcandidatematcheswithaminimalnumberoffalsepositives.Todothis,BSLlearnsconjunctionsthataremaximallyspecic(eliminatingmanyfalsepositives)andunionsthemtogetherasasingledisjunctiverule(tocoverthedierenttruepositives).Sincetheconjunctionsaremaximallyspecic,BSLusesSCAunderneath,whichlearnsrulesinadepth-rst,generaltospecicmanner(Mitchell,1997).Ontheotherhand,theworkofminingassociationrules(Agrawaletal.,1993)looksforactualpatternsinthedatathatrepresentsomeinternalrelationships.Theremaybemanysuchrelationshipsinthedatathatcouldbediscovered,sothisapproachcoversthedatainabreadth-rstfashion,selectingthesetofrulesateachiterationandextendingthembyappendingtoeachanewpossibleitem.7.ConclusionThisarticlepresentsanalgorithmforsemanticallyannotatingtextthatisungrammaticalandunstructured.Unstructured,ungrammaticalsourcescontainmuchinformation,butcannotsupportstructuredqueries.Ourtechniqueallowsformoreinformativeuseofthesources.Usingourapproach,eBayagentscouldmonitortheauctionslookingforthebestdeals,orausercouldndtheaveragepriceofafour-starhotelinSanDiego.Suchsemantic584 Michelson&KnoblockReferencesAgichtein,E.,&Ganti,V.(2004).Miningreferencetablesforautomatictextsegmentation.IntheProceedingsofthe10thACMConferenceonKnowledgeDiscoveryandDataMining,pp.20{29.ACMPress.Agrawal,R.,Imielinski,T.,&Swami,A.(1993).Miningassociationrulesbetweensetsofitemsinlargedatabases.InProceedingsoftheACMSIGMODInternationalConfer-enceonManagementofData,pp.207{216.ACMPress.Baxter,R.,Christen,P.,&Churches,T.(2003).Acomparisonoffastblockingmethodsforrecordlinkage.InProceedingsofthe9thACMSIGKDDWorkshoponDataCleaning,RecordLinkage,andObjectIdentication,pp.25{27.Bellare,K.,&McCallum,A.(2007).Learningextractorsfromunlabeledtextusingrelevantdatabases.InProceedingsoftheAAAIWorkshoponInformationIntegrationontheWeb,pp.10{16.Bilenko,M.,Kamath,B.,&Mooney,R.J.(2006).Adaptiveblocking:Learningtoscaleuprecordlinkageandclustering.InProceedingsofthe6thIEEEInternationalConferenceonDataMining,pp.87{96.Bilenko,M.,&Mooney,R.J.(2003).Adaptiveduplicatedetectionusinglearnablestringsimilaritymeasures.InProceedingsofthe9thACMInternationalConferenceonKnowledgeDiscoveryandDataMining,pp.39{48.ACMPress.Borkar,V.,Deshmukh,K.,&Sarawagi,S.(2001).Automaticsegmentationoftextintostructuredrecords.InProceedingsoftheACMSIGMODInternationalConferenceonManagementofData,pp.175{186.ACMPress.Cali,M.E.,&Mooney,R.J.(1999).Relationallearningofpattern-matchrulesforinformationextraction.InProceedingsofthe16thNationalConferenceonArticialIntelligence,pp.328{334.Chakaravarthy,V.T.,Gupta,H.,Roy,P.,&Mohania,M.(2006).Ecientlylinkingtextdocumentswithrelevantstructuredinformation.InProceedingsoftheInternationalConferenceonVeryLargeDataBases,pp.667{678.VLDBEndowment.Chaudhuri,S.,Ganjam,K.,Ganti,V.,&Motwani,R.(2003).Robustandecientfuzzymatchforonlinedatacleaning.InProceedingsofACMSIGMODInternationalCon-ferenceonManagementofData,pp.313{324.ACMPress.Cimiano,P.,Handschuh,S.,&Staab,S.(2004).Towardstheself-annotatingweb.InProceedingsofthe13thInternationalConferenceonWorldWideWeb,pp.462{471.ACMPress.Ciravegna,F.(2001).Adaptiveinformationextractionfromtextbyruleinductionandgeneralisation..InProceedingsofthe17thInternationalJointConferenceonArticialIntelligence,pp.1251{1256.586 Michelson&KnoblockLaerty,J.,McCallum,A.,&Pereira,F.(2001).Conditionalrandomelds:Probabilis-ticmodelsforsegmentingandlabelingsequencedata.InProceedingsofthe18thInternationalConferenceonMachineLearning,pp.282{289.MorganKaufmann.Lee,M.-L.,Ling,T.W.,Lu,H.,&Ko,Y.T.(1999).Cleansingdataforminingandwarehousing.InProceedingsofthe10thInternationalConferenceonDatabaseandExpertSystemsApplications,pp.751{760.Springer-Verlag.Lerman,K.,Gazen,C.,Minton,S.,&Knoblock,C.A.(2004).Populatingthesemanticweb.InProceedingsoftheAAAIWorkshoponAdvancesinTextExtractionandMining.Levenshtein,V.I.(1966).Binarycodescapableofcorrectingdeletions,insertions,andreversals.EnglishtranslationinSovietPhysicsDoklady,10(8),707{710.Mansuri,I.R.,&Sarawagi,S.(2006).Integratingunstructureddataintorelationaldatabases.InProceedingsoftheInternationalConferenceonDataEngineering,p.29.IEEEComputerSociety.McCallum,A.(2002).Mallet:Amachinelearningforlanguagetoolkit.http://mallet.cs.umass.edu.McCallum,A.,Nigam,K.,&Ungar,L.H.(2000).Ecientclusteringofhigh-dimensionaldatasetswithapplicationtoreferencematching.InProceedingsofthe6thACMSIGKDD,pp.169{178.Michalowski,M.,Thakkar,S.,&Knoblock,C.A.(2005).Automaticallyutilizingsecondarysourcestoaligninformationacrosssources.InAIMagazine,SpecialIssueonSemanticIntegration,Vol.26,pp.33{45.Michelson,M.,&Knoblock,C.A.(2005).Semanticannotationofunstructuredandungram-maticaltext.InProceedingsofthe19thInternationalJointConferenceonArticialIntelligence,pp.1091{1098.Michelson,M.,&Knoblock,C.A.(2006).Learningblockingschemesforrecordlinkage.InProceedingsofthe21stNationalConferenceonArticialIntelligence.Michelson,M.,&Knoblock,C.A.(2007).Unsupervisedinformationextractionfromun-structured,ungrammaticaldatasourcesontheworldwideweb.InternationalJournalofDocumentAnalysisandRecognition(IJDAR),SpecialIssueonNoisyTextAnalyt-ics.Mitchell,T.M.(1997).MachineLearning.McGraw-Hill,NewYork.Muller,H.-M.,&Sternberg,E.E.K.P.W.(2004).Textpresso:Anontology-basedinfor-mationretrievalandextractionsystemforbiologicalliterature.PLoSBiology,2(11).Muslea,I.,Minton,S.,&Knoblock,C.A.(2001).Hierarchicalwrapperinductionforsemistructuredinformationsources.AutonomousAgentsandMulti-AgentSystems,4(1/2),93{114.588