/
Michelson&KnoblockThisrequirestheagenttocontaintwodatagatheringmechani Michelson&KnoblockThisrequirestheagenttocontaintwodatagatheringmechani

Michelson&KnoblockThisrequirestheagenttocontaintwodatagatheringmechani - PDF document

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
366 views
Uploaded On 2015-11-29

Michelson&KnoblockThisrequirestheagenttocontaintwodatagatheringmechani - PPT Presentation

CraigslistPost 93civic5speedrunsgreatobori1800 934drHondaCivcLXStickShift1800 94DELSOLSiVtecGlendale3000 ThecurrentmethodtoquerypostswhetherbyanagentorapersoniskeywordsearchHoweverkeywords ID: 208625

CraigslistPost 93civic5speedrunsgreatobo(ri)$1800 93-4drHondaCivcLXStickShift$1800 94DELSOLSiVtec(Glendale)$3000 Thecurrentmethodtoqueryposts whetherbyanagentoraperson iskeywordsearch.However keywords

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Michelson&KnoblockThisrequirestheagentto..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Michelson&KnoblockThisrequirestheagenttocontaintwodatagatheringmechanisms:theabilitytoquerysourcesandtheabilitytointegraterelevantsourcesofinformation.However,thesedatagatheringmechanismsassumethatthesourcesthemselvesarede-signedtosupportrelationalqueries,suchashavingwellde nedschemaandstandardvaluesfortheattributes.Yetthisisnotalwaysthecase.TherearemanydatasourcesontheWorldWideWebthatwouldbeusefultoquery,butthetextualdatawithinthemisun-structuredandisnotdesignedtosupportquerying.Wecallthetextofsuchdatasources\posts."Examplesof\posts"includethetextofeBayauctionlistings,Internetclassi edslikeCraigslist,bulletinboardssuchasBiddingForTravel3,oreventhesummarytextbelowthehyperlinksreturnedafterqueryingGoogle.Asarunningexample,considerthethreepostsforusedcarclassi edsshowninTable1.Table1:ThreepostsforHondaCivicsfromCraigslist CraigslistPost 93civic5speedrunsgreatobo(ri)$1800 93-4drHondaCivcLXStickShift$1800 94DELSOLSiVtec(Glendale)$3000 Thecurrentmethodtoqueryposts,whetherbyanagentoraperson,iskeywordsearch.However,keywordsearchisinaccurateandcannotsupportrelationalqueries.Forexample,adi erenceinspellingbetweenthekeywordandthatsameattributewithinapostwouldlimitthatpostfrombeingreturnedinthesearch.Thiswouldbethecaseifausersearchedtheexamplelistingsfor\Civic"sincethesecondpostwouldnotbereturned.Anotherfactorwhichlimitskeywordaccuracyistheexclusionofredundantattributes.Forexample,someclassi edpostsaboutcarsonlyincludethecarmodel,andnotthemake,sincethemakeisimpliedbythemodel.Thisisshowninthe rstandthirdpostofTable1.Inthesecases,ifauserdoesakeywordsearchusingthemake\Honda,"thesepostswillnotbereturned.Moreover,keywordsearchisnotarichqueryframework.Forinstance,considerthequery,WhatistheaveragepriceforallHondasfrom1999orlater?Todothiswithkeywordsearchrequiresausertosearchon\Honda"andretrieveallthatarefrom1999orlater.Thentheusermusttraversethereturnedset,keepingtrackofthepricesandremovingincorrectlyreturnedposts.However,ifaschemawithstandardizedattributevaluesisde nedovertheentitiesintheposts,thenausercouldruntheexamplequeryusingasimpleSQLstatementanddosoaccurately,addressingbothproblemscreatedbykeywordsearch.Thestandardizedattributevaluesensureinvariancetoissuessuchasspellingdi erences.Also,eachpostisassociatedwithafullschemawithvalues,soeventhoughapostmightnotcontainacarmake,forinstance,itsschemadoesandhasthecorrectvalueforit,soitwillbereturnedinaqueryoncarmakes.Furthermore,thesestandardizedvaluesallowforintegrationofthesourcewithoutsidesources.Integratingsourcesusuallyentailsjoiningthetwosourcesdirectlyonattributesortranslationsoftheattributes.Withoutstandardizedvaluesand 3.www.biddingfortravel.com544 Michelson&KnoblockEdmundscarbuyingguide,weusewrappertechnologies(AgentBuilder7inthiscase)toscrapedatafromtheWebsource,usingtheschemathatthesourcede nesforthecar.Touseareferencesettobuildarelationaldatasetweexploittheattributesinthereferencesettodeterminetheattributesfromthepostthatcanbeextracted.The rststepofouralgorithm ndsthebestmatchingmemberofthereferencesetforthepost.Thisiscalledthe\recordlinkage"step.Bymatchingaposttoamemberofthereferencesetwecande neschemaelementsforthepostusingtheschemaofthereferenceset,andwecanprovidestandardattributesfortheseattributesbyusingtheattributesfromthereferencesetwhenauserqueriestheposts.Next,weperforminformationextractiontoextracttheactualvaluesinthepostthatmatchtheschemaelementsde nedbythereferenceset.Thisstepistheinformationextractionstep.Duringtheinformationextractionstep,thepartsofthepostareextractedthatbestmatchtheattributevaluesfromthereferencesetmemberchosenduringtherecordlinkagestep.Inthisstepwealsoextractattributesthatarenoteasilyrepresentedbyreferencesets,suchaspricesordates.Althoughwealreadyhavetheschemaandstandardizedattributesrequiredtocreatearelationaldatasetovertheposts,westillextracttheactualattributesembeddedwithinthepostsothatwecanmoreaccuratelylearntoextracttheattributesnotrepresentedbyareferenceset,suchaspricesanddates.Whiletheseattributescanbeextractedusingregularexpressions,ifweextracttheactualattributeswithinthepostwemightbeabletodosomoreaccurately.Forexample,considerthe\Ford500"car.Withoutactuallyextractingtheattributeswithinapost,wemightextract\500"asaprice,whenitisactuallyacarname.OuroverallapproachisoutlinedinFigure1.Althoughwepreviouslydescribeasimilarapproachtosemanticallyannotatingposts(Michelson&Knoblock,2005),thispaperextendsthatresearchbycombiningtheannota-tionwithourworkonmorescalablerecordmatching(Michelson&Knoblock,2006).Notonlydoesthismakethematchingstepforourannotationmorescalable,italsodemonstratesthatourworkonecientrecordmatchingextendstoouruniqueproblemofmatchingposts,withembeddedattributes,tostructured,relationaldata.Thispaperalsopresentsamoredetaileddescriptionthanourpastwork,includingamorethoroughevaluationofthepro-cedurethanpreviously,usinglargerexperimentaldatasetsincludingareferencesetthatincludestensofthousandsofrecords.Thisarticleisorganizedasfollows.We rstdescribeouralgorithmforaligningthepoststothebestmatchingmembersofthereferencesetinSection2.Inparticular,weshowhowthismatchingtakesplace,andhowweecientlygeneratecandidatematchestomakethematchingproceduremorescalable.InSection3,wedemonstratehowtoexploitthematchestoextracttheattributesembeddedwithinthepost.WepresentsomeexperimentsinSection4,validatingourapproachestoblocking,matchingandinformationextractionforunstructuredandungrammaticaltext.WefollowwithadiscussionoftheseresultsinSection5andthenpresentrelatedworkinSection6.We nishwithsome nalthoughtsandconclusionsinSection7. 7.AproductofFetchTechnologieshttp://www.fetch.com/products.asp546 RelationalDatafromUnstructuredDataSourcesmakeg)to(ftoken-match,makeg^ftoken-match,trimg)westillcovernewtruematches,butwegeneratefeweradditionalcandidates.Thereforee ectiveblockingschemesshouldlearnconjunctionsthatminimizethefalsepositives,butlearnenoughoftheseconjunctionstocoverasmanytruematchesaspossi-ble.Thesetwogoalsofblockingcanbeclearlyde nedbytheReductionRatioandPairsCompleteness(Elfeky,Verykios,&Elmagarmid,2002).TheReductionRatio(RR)quanti eshowwellthecurrentblockingschememinimizesthenumberofcandidates.LetCbethenumberofcandidatematchesandNbethesizeofthecrossproductbetweenbothdatasets.RR=1�C=NItshouldbeclearthataddingmorefmethod,attributegpairstoaconjunctionincreasesitsRR,aswhenwechanged(ftoken-match,zipg)to(ftoken-match,zipg^ftoken-match, rstnameg).PairsCompleteness(PC)measuresthecoverageoftruepositives,i.e.,howmanyofthetruematchesareinthecandidatesetversusthoseintheentireset.IfSmisthenumberoftruematchesinthecandidateset,andNmisthenumberofmatchesintheentiredataset,then:PC=Sm=NmAddingmoredisjunctscanincreaseourPC.Forexample,weaddedthesecondconjunc-tiontoourexampleblockingschemebecausethe rstdidnotcoverallofthematches.Theblockingapproachinthispaper,\BlockingSchemeLearner"(BSL),learnse ectiveblockingschemesindisjunctivenormalformbymaximizingthereductionratioandpairscompleteness.Inthisway,BSLtriestomaximizethetwogoalsofblocking.PreviouslyweshowedBSLaidedthescalabilityofrecordlinkage(Michelson&Knoblock,2006),andthispaperextendsthatideabyshowingthatitalsocanworkinthecaseofmatchingpoststothereferencesetrecords.TheBSLalgorithmusesamodi edversionoftheSequentialCoveringAlgorithm(SCA),usedtodiscoverdisjunctivesetsofrulesfromlabeledtrainingdata(Mitchell,1997).Inourcase,SCAwilllearndisjunctivesetsofconjunctionsconsistingoffmethod,attributegpairs.Basically,eachcalltoLEARN-ONE-RULEgeneratesaconjunction,andBSLkeepsiteratingoverthiscall,coveringthetruematchesleftoveraftereachiteration.ThiswaySCAlearnsafullblockingscheme.TheBSLalgorithmisshowninTable2.Therearetwomodi cationstotheclassicSCAalgorithm,whichareshowninbold.First,BSLrunsuntiltherearenomoreexampleslefttocover,ratherthanstoppingatsomethreshold.Thisensuresthatwemaximizethenumberoftruematchesgeneratedascandidatesbythe nalblockingrule(PairsCompleteness).Notethatthismight,inturn,yieldalargenumberofcandidates,hurtingtheReductionRatio.However,omittingtruematchesdirectlya ectstheaccuracyofrecordlinkage,andblockingisapreprocessingstepforrecordlinkage,soitismoreimportanttocoverasmanytruematchesaspossible.ThiswayBSLful llsoneoftheblockinggoals:noteliminatingtruematchesifpossible.Second,ifwelearnanewconjunction(intheLEARN-ONE-RULEstep)andourcurrentblockingschemehasarulethatalreadycontainsthenewlylearnedrule,thenwecanremovetherulecontainingthenewlylearnedrule.Thisisanoptimizationthatallowsustocheckrulecontainmentaswego,ratherthanattheend.549 RelationalDatafromUnstructuredDataSourcesThealgorithm'sbehavioriswellde nedfortheminimumPCthreshold.Consider,thecasewherethealgorithmislearningasrestrictivearuleasitcanwiththeminimumcoverage.Inthiscase,theparameterendsuppartitioningthespaceofthecrossproductofexamplerecordsbythethresholdamount.Thatis,ifwesetthethresholdamountto50%oftheexamplescovered,themostrestrictive rstrulecovers50%oftheexamples.Thenextrulecovers50%ofwhatisremaining,whichis25%oftheexamples.Thenextwillcover12.5%oftheexamples,etc.Inthissense,theparameteriswellde ned.Ifwesetthethresholdhigh,wewilllearnfewer,lessrestrictiveconjunctions,possiblylimitingourRR,althoughthismayincreasePCslightly.Ifwesetitlower,wecovermoreexamples,butweneedtolearnmoreconjuncts.Thesenewerconjuncts,inturn,maybesubsumedbylaterconjuncts,sotheywillbeawasteoftimetolearn.So,aslongasthisparameterissmallenough,itshouldnota ectthecoverageofthe nalblockingscheme,andsmallerthanthatjustslowsdownthelearning.Wesetthisparameterto50%forourexperiments8.NowweanalyzetherunningtimeofBSLandweshowhowBSLcantakeintoaccounttherunningtimeofdi erentblockingmethods,ifneedbe.Assumethatwehavex(method,attribute)pairssuchas(token,first�name).Now,assumethatourbeamsizeisb,sinceweusegeneral-to-speci cbeam-searchinourLearn-One-Ruleprocedure.Also,forthetimebeing,assumeeach(method,attribute)paircangenerateitsblockingcandidatesinO(1)time.(Werelaxthisassumptionlater.)EachtimewehitLearn-One-RulewithinBSL,wewilltryallrulesinthebeamwithallofthe(attribute,method)pairsnotinthecurrentbeamrules.So,intheworstcase,thistakesO(bx)eachtime,sinceforeach(method,attribute)pairinthebeam,wetryitagainstallother(method,attribute)pairs.Now,intheworstcase,eachlearneddisjunctwouldonlycover1trainingexample,soourruleisadisjunctionofallpairsx.Therefore,weruntheLearn-One-Rulextimes,resultinginalearningtimeofO(bx2).Ifwehaveetrainingexamples,thefulltrainingtimeisO(ebx2),forBSLtolearntheblockingscheme.Now,whileweassumedabovethateach(method,attribute)runsinO(1)time,thisisclearlynotthecase,sincethereisasubstantialamountofliteratureonblockingmethodsand 8.Settingthisparameterlowerthan50%hadaninsigni cante ectonourresults,andsettingitmuchhigher,to90%,onlyincreasedthePCbyasmallamount(ifatall),whiledecreasingtheRR.Table3:Learningaconjunctionoffmethod,attributegpairs LEARN-ONE-RULE(attributes;examples;min thresh;k) Best-Conjunction fgCandidate-conjunctions allfmethod,attributegpairsWhileCandidate-conjunctionsnotempty,doForeachch2Candidate-conjunctionsIfnot rstiterationch ch[fmethod,attributegRemoveanychthatareduplicates,inconsistentornotmax.speci cifREDUCTION-RATIO(ch)�REDUCTION-RATIO(Best-Conjunction)andPAIRS-COMPLETENESS(ch)min threshBest-Conjunction chCandidate-conjunctions bestkmembersofCandidate-conjunctionsreturnBest-conjunction 551 RelationalDatafromUnstructuredDataSourcesexperiments,forblockingitismoreimportanttopicktherightattributecombinations,asBSLdoes,evenusingsimplemethods,thantodoblockingusingthemostsophisticatedmethods.WecaneasilyextendourBSLalgorithmtohandlethecaseofmatchingpoststomembersofthereferenceset.Thisisaspecialcasebecausethepostshavealltheattributesembeddedwithinthemwhilethereferencesetdataisrelationalandstructuredintoschemaelements.Tohandlethisspecialcase,ratherthanmatchingattributeandmethodpairsacrossthedatasourcesduringourLEARN-ONE-RULE,weinsteadcompareattributeandmethodpairsfromtherelationaldatatotheentirepost.Thisisasmallchange,showingthatthesamealgorithmworkswelleveninthisspecialcase.Oncewelearnagoodblockingscheme,wecannowecientlygeneratecandidatesfromthepostsettoaligntothereferenceset.Thisblockingstepisessentialformappinglargeamountsofunstructuredandungrammaticaldatasourcestolargerandlargerreferencesets.2.2TheMatchingStepFromthesetofcandidatesgeneratedduringblockingonecan ndthememberofthereferencesetthatbestmatchesthecurrentpost.Thatis,onedatasource'srecord(thepost)mustaligntoarecordfromtheotherdatasource(thereferencesetcandidates).Whilethewholealignmentprocedureisreferredtoasrecordlinkage(Fellegi&Sunter,1969),wereferto ndingtheparticularmatchesafterblockingasthe\matchingstep." Figure2:ThetraditionalrecordlinkageproblemHowever,therecordlinkageproblempresentedinthisarticledi ersfromthe\traditional"recordlinkageproblemandisnotwellstudied.Traditionalrecordlinkagematchesarecordfromonedatasourcetoarecordfromanotherdatasourcebyrelatingtheirrespective,decomposedattributes.Forinstance,usingthesecondpostfromTable1,andassumingdecomposedattributes,themakefromthepostiscomparedtothemakeofthereference553 Michelson&Knoblock Figure3:Theproblemofmatchingaposttothereferencesetset.Thisisalsodoneforthemodels,thetrims,etc.Therecordfromthereferencesetthatbestmatchesthepostbasedonthesimilaritiesbetweentheattributeswouldbeconsideredthematch.ThisisrepresentedinFigure2.Yet,theattributesofthepostsareembeddedwithinasinglepieceoftextandnotyetidenti ed.Thistextiscomparedtothereferenceset,whichisalreadydecomposedintoattributesandwhichdoesnothavetheextraneoustokenspresentinthepost.Figure3depictsthisproblem.Withthistypeofmatchingtraditionalrecordlinkageapproachesdonotapply.Instead,thematchingstepcomparestheposttoalloftheattributesofthereferencesetconcatenatedtogether.Sincethepostiscomparedtoawholerecordfromthereferenceset(inthesensethatithasalloftheattributes),thiscomparisonisatthe\recordlevel"anditapproximatelyre ectshowsimilaralloftheembeddedattributesofthepostaretoalloftheattributesofthecandidatematch.Thismimicstheideaoftraditionalrecordlinkage,thatcomparingallofthe eldsdeterminesthesimilarityattherecordlevel.However,byusingonlytherecordlevelsimilarityitispossiblefortwocandidatestogeneratethesamerecordlevelsimilaritywhiledi eringonindividualattributes.Ifoneoftheseattributesismorediscriminativethantheother,thereneedstobesomewaytore ectthat.Forexample,considerFigure4.Inthe gure,thetwocandidatessharethesamemakeandmodel.However,the rstcandidatesharestheyearwhilethesecondcandidatesharesthetrim.Sincebothcandidatessharethesamemakeandmodel,andbothhaveanotherattributeincommon,itispossiblethattheygeneratethesamerecordlevelcomparison.Yet,atrimoncar,especiallywithararethinglikea\Hatchback"shouldbemorediscriminativethansharingayear,sincetherearelotsofcarswiththesamemake,modelandyear,thatdi eronlybythetrim.Thisdi erenceinindividualattributesneedstobere ected.Todiscriminatebetweenattributes,thematchingstepborrowstheideafromtraditionalrecordlinkagethatincorporatingtheindividualcomparisonsbetweeneachattributefrom554 RelationalDatafromUnstructuredDataSources Figure4:Tworecordswithequalrecordlevelbutdi erent eldlevelsimilaritieseachdatasourceisthebestwaytodetermineamatch.Thatis,justtherecordlevelinformationisnotenoughtodiscriminatematches, eldlevelcomparisonsmustbeexploitedaswell.Todo\ eldlevel"comparisonsthematchingstepcomparestheposttoeachindividualattributeofthereferenceset.Theserecordand eldlevelcomparisonsarerepresentedbyavectorofdi erentsimi-larityfunctionscalledRL scores.Byincorporatingdi erentsimilarityfunctions,RL scoresre ectsthedi erenttypesofsimilaritythatexistbetweentext.Hence,fortherecordlevelcomparison,thematchingstepgeneratestheRL scoresvectorbetweenthepostandalloftheattributesconcatenated.Togenerate eldlevelcomparisons,thematchingstepcalcu-latestheRL scoresbetweenthepostandeachoftheindividualattributesofthereferenceset.AlloftheseRL scoresvectorsarethenstoredinavectorcalledVRL.Oncepopulated,VRLrepresentstherecordand eldlevelsimilaritiesbetweenapostandamemberofthereferenceset.IntheexamplereferencesetfromFigure3,theschemahas4attributesmake,model,trim,year&#x]TJ/;༠ ;.9;‘ ;&#xTf 8;&#x.485;&#x 0 T; [0;.Assumingthecurrentcandidateis\Honda",\Civic",\4DLX",\1993"&#x]TJ/;ø 1;�.90;‘ T; 8.;҅ ;� Td;&#x [00;,thentheVRLlookslike:VRL=RL scores(post,\Honda"),RL scores(post,\Civic"),RL scores(post,\4DLX"),RL scores(post,\1993"),RL scores(post,\HondaCivic4DLX1993")�Ormoregenerally:555 Michelson&KnoblockVRL=RL scores(post,attribute1),RL scores(post,attribute2),...,RL scores(post,attributen),RL scores(post,attribute1attribute2...attributen)�TheRL scoresvectorismeanttoincludenotionsofthemanywaysthatexisttode nethesimilaritybetweenthetextualvaluesofthedatasources.Itmightbethecasethatoneattributedi ersfromanotherinafewmisplaced,missingorchangedletters.Thissortofsimilarityidenti estwoattributesthataresimilar,butmisspelled,andiscalled\editdistance."Anothertypeoftextualsimilaritylooksatthetokensoftheattributesandde nessimilaritybaseduponthenumberoftokenssharedbetweentheattributes.This\tokenlevel"similarityisnotrobusttospellingmistakes,butitputsnoemphasisontheorderofthetokens,whereaseditdistancerequiresthattheorderofthetokensmatchinorderfortheattributestobesimilar.Lastly,therearecaseswhereoneattributemaysoundlikeanother,eveniftheyarebothspelleddi erently,oroneattributemayshareacommonrootwordwithanotherattribute,whichimpliesa\stemmed"similarity.Theselasttwoexamplesareneithertokennoreditdistancebasedsimilarities.Tocaptureallthesedi erentsimilaritytypes,theRL scoresvectorisbuiltofthreevec-torsthatre ecttheeachofthedi erentsimilaritytypesdiscussedabove.Hence,RL scoresis:RL scores(post,attribute)=token scores(post,attribute),edit scores(post,attribute),other scores(post,attribute)�Thevectortoken scorescomprisesthreetokenlevelsimilarityscores.TwosimilarityscoresincludedinthisvectorarebasedontheJensen-Shannondistance,whichde nessimilaritiesoverprobabilitydistributionsofthetokens.OneusesaDirichletprior(Cohen,Ravikumar,&Feinberg,2003)andtheothersmoothsitstokenprobabilitiesusingaJelenik-Mercermixturemodel(Zhai&La erty,2001).Thelastmetricinthetoken scoresvectoristheJaccardsimilarity.Withallofthescoresincluded,thetoken scoresvectortakestheform:token scores(post,attribute)=Jensen-Shannon-Dirichlet(post,attribute),Jensen-Shannon-JM-Mixture(post,attribute),Jaccard(post,attribute)&#x]TJ/;༠ ;.9;‘ ;&#xTf 8;&#x.484;&#x 0 T; [0;Thevectoredit scoresconsistsoftheeditdistancescoreswhicharecomparisonsbetweenstringsatthecharacterlevelde nedbyoperationsthatturnonestringintoanother.Forinstance,theedit scoresvectorincludestheLevenshteindistance(Levenshtein,1966),whichreturnstheminimumnumberofoperationstoturnstringSintostringT,andtheSmith-Watermandistance(Smith&Waterman,1981)whichisanextensiontotheLevenshteindistance.Thelastscoreinthevectoredit scoresistheJaro-Winklersimilarity(Winkler&Thibaudeau,1991),whichisanextensionoftheJarometric(Jaro,1989)usedto ndsimilarpropernouns.Whilenotastrictedit-distance,becauseitdoesnotregardoperationsoftransformations,theJaro-Winklermetricisausefuldeterminantofstringsimilarity.Withallofthecharacterlevelmetrics,theedit scoresvectorisde nedas:556 RelationalDatafromUnstructuredDataSources Figure6:OurapproachtomatchingpoststorecordsfromareferencesetInadditiontoprovidingastandardizedsetofvaluestoquerytheposts,thesestandard-izedvaluesallowforintegrationwithoutsidesourcesbecausethevaluescanbestandardizedtocanonicalvalues.Forinstance,ifwewanttointegrateourcarclassi edswithasafetyratingswebsite,wecannoweasilyjointhesourcesacrosstheattributevalues.Inthismanner,byapproachingannotationasarecordlinkageproblem,wecancreaterelationaldatafromunstructuredandungrammaticaldatasources.However,toaidintheextractionofattributesnoteasilyrepresentedinreferencesets,weperforminformationextractiononthepostsaswell.3.ExtractingDatafromPostsAlthoughtherecordlinkagestepcreatesmostoftherelationaldatafromtheposts,therearestillattributeswewouldliketoextractfromthepostwhicharenoteasilyrepresentedbyreferencesets,whichmeanstherecordlinkagestepcannotbeusedfortheseattributes.Examplesofsuchattributesaredatesandprices.Althoughmanyofthesesuchattributescanbeextractedusingsimpletechniques,suchasregularexpressions,wecanmaketheirextractionandannotationevermoreaccuratebyusingsophisticatedinformationextraction.Tomotivatethisidea,considertheFordcarmodelcalledthe\500."Ifwejustusedregularexpressions,wemightextract500asthepriceofthecar,butthiswouldnotbethecase.However,ifwetrytoextractalloftheattributes,includingthemodel,thenwewouldextract\500"asthemodelcorrectly.Furthermore,wemightwanttoextracttheactualattributesfromapost,astheyare,andourextractionalgorithmallowsthis.Toperformextraction,thealgorithminfusesinformationextractionwithextraknowl-edge,ratherthanrelyingonpossiblyinconsistentcharacteristics.Togarnerthisextra559 Michelson&Knoblockknowledge,theapproachexploitstheideaofreferencesetsbyusingtheattributesfromthematchingreferencesetmemberasabasisforidentifyingsimilarattributesinthepost.Then,thealgorithmcanlabeltheseextractedvaluesfromthepostwiththeschemafromthereferenceset,thusaddingannotationbasedontheextractedvalues.Inabroadsense,thealgorithmhastwoparts.Firstwelabeleachtokenwithapossibleattributelabeloras\junk"tobeignored.Afterallthetokensinapostarelabeled,wethencleaneachoftheextractedlabels.Figure7showsthewholeproceduregraphically,indetail,usingthesecondpostfromTable1.Eachofthestepsshowninthis gurearedescribedindetailbelow. Figure7:ExtractionprocessforattributesTobegintheextractionprocess,thepostisbrokenintotokens.Usingthe rstpostfromTable1asanexample,setoftokensbecomes,f\93",\civic",\5speed",...g.Eachofthesetokensisthenscoredagainsteachattributeoftherecordfromthereferencesetthatwasdeemedthematch.Toscorethetokens,theextractionprocessbuildsavectorofscores,VIE.LiketheVRLvectorofthematchingstep,VIEiscomposedofvectorswhichrepresentthesimilaritiesbetweenthetokenandtheattributesofthereferenceset.However,thecompositionofVIEisslightlydi erentfromVRL.Itcontainsnocomparisontotheconcatenationofalltheattributes,andthevectorsthatcomposeVIEaredi erentfromthosethatcomposeVRL.Speci cally,thevectorsthatformVIEarecalledIE scores,andaresimilartothe560 RelationalDatafromUnstructuredDataSourcesRL scoresthatcomposeVRL,excepttheydonotcontainthetoken scorescomponent,sinceeachIE scoresonlyusesonetokenfromthepostatatime.TheRL scoresvector:RL scores(post,attribute)=token scores(post,attribute),edit scores(post,attribute),other scores(post,attribute)�becomes:IE scores(token,attribute)=edit scores(token,attribute),other scores(token,attribute)�Theothermaindi erencebetweenVIEandVRListhatVIEcontainsauniquevectorthatcontainsuserde nedfunctions,suchasregularexpressions,tocaptureattributesthatarenoteasilyrepresentedbyreferencesets,suchaspricesordates.Theseattributetypesgenerallyexhibitconsistentcharacteristicsthatallowthemtobeextracted,andtheyareusuallyinfeasibletorepresentinreferencesets.Thismakestraditionalextractionmethodsagoodchoicefortheseattributes.Thisvectoriscalledcommon scoresbecausethetypesofcharacteristicsusedtoextracttheseattributesare\common"enoughbetweentobeusedforextraction.Usingthe rstpostofTable1,assumethereferencesetmatchhasthemake\Honda,"themodel\Civic"andtheyear\1993."Thismeansthematchingtuplewouldbef\Honda",\Civic",\1993"g.ThismatchgeneratesthefollowingVIEforthetoken\civic"ofthepost:VIE=common scores(\civic"),IE scores(\civic",\Honda"),IE scores(\civic",\Civic"),IE scores(\civic",\1993")�Moregenerally,foragiventoken,VIElookslike:VIE=common scores(token),IE scores(token,attribute1),IE scores(token,attribute2)...,IE scores(token,attributen)�EachVIEisthenpassedtoastructuredSVM(Tsochantaridis,Joachims,Hofmann,&Altun,2005;Tsochantaridis,Hofmann,Joachims,&Altun,2004)trainedtogiveitanattributetypelabel,suchasmake,model,orprice.Intuitively,similarattributetypesshouldhavesimilarVIEvectors.Themakesshouldgenerallyhavehighscoresagainstthemakeattributeofthereferenceset,andsmallscoresagainsttheotherattributes.Further,structuredSVMsareabletoinfertheextractionlabelscollectively,whichhelpsindecidingbetweenpossibletokenlabels.ThismakestheuseofstructuredSVMsanidealmachinelearningmethodforourtask.NotethatsinceeachVIEisnotamemberofaclusterwherethewinnertakesall,thereisnobinaryrescoring.Sincetherearemanyirrelevanttokensinthepostthatshouldnotbeannotated,theSVMlearnsthatanyVIEthatdoesassociatewithalearnedattributetypeshouldbelabeledas561 RelationalDatafromUnstructuredDataSources Figure8:ImprovingextractionaccuracywithreferencesetattributesNote,however,thatwedonotlimitthemachinelearningcomponentofourextractionalgorithmtoSVMs.Instead,weclaimthatinsomecases,referencesetscanaidextractioningeneral,andtotestthis,inourarchitecturewecanreplacetheSVMcomponentwithothermethods.Forexample,inourextractionexperimentswereplacetheSVMextractorwithaConditionalRandomField(CRF)(La erty,McCallum,&Pereira,2001)extractorthatusestheVIEasfeatures.Therefore,thewholeextractionprocesstakesatokenofthetext,createstheVIEandpassesthistothemachine-learningextractorwhichgeneratesalabelforthetoken.Theneach eldiscleanedandtheextractedattributeissaved.4.ResultsThePhoebussystemwasbuilttoexperimentallyvalidateourapproachtobuildingrelationaldatafromunstructuredandungrammaticaldatasources.Speci cally,Phoebusteststhetechnique'saccuracyinboththerecordlinkageandtheextraction,andincorporatestheBSLalgorithmforlearningandusingblockingschemes.Theexperimentaldata,comesfromthreedomainsofposts:hotels,comicbooks,andcars.Thedatafromthehoteldomaincontainstheattributeshotelname,hotelarea,starrating,priceanddates,whichareextractedtotesttheextractionalgorithm.ThisdatacomesfromtheBiddingForTravelwebsite9whichisaforumwhereuserssharesuccessfulbidsforPricelineonitemssuchasairlineticketsandhotelrates.TheexperimentaldataislimitedtopostingsabouthotelratesinSacramento,SanDiegoandPittsburgh,whichcomposeadatasetwith1125posts,with1028ofthesepostshavingamatchinthereferenceset.ThereferencesetcomesfromtheBiddingForTravelhotelguides,whicharespecial 9.www.biddingfortravel.com563 Michelson&Knoblock Algorithm3.1:CleanAttribute(E;R) comment:CleanextractedattributeEusingreferencesetattributeRRemovalCandidatesC nullJaroWinklerBaseline JaroWinkler(E;R)JaccardBaseline Jaccard(E;R)foreachtokent2Edo8����������&#x]TJ ;� -1;.93; Td;&#x [00;&#x]TJ ;� -1;.93; Td;&#x [00;&#x]TJ ;� -1;.93; Td;&#x [00;&#x]TJ ;� -1;.93; Td;&#x [00;&#x]TJ ;� -1;.93; Td;&#x [00;&#x]TJ ;� -1;.93; Td;&#x [00;&#x]TJ ;� -1;.93; Td;&#x [00;&#x]TJ ;� -1;.93; Td;&#x [00;&#x]TJ ;� -1;.93; Td;&#x [00;&#x]TJ ;� -1;.93; Td;&#x [00;:Xt RemoveToken(t;E)JaroWinklerXt JaroWinkler(Xt;R)JaccardXt Jaccard(Xt;R)if8&#x]TJ ;� -1;.93; Td;&#x [00;&#x]TJ ;� -1;.93; Td;&#x [00;:JaroWinklerXt&#x]TJ ;� -1;.93; Td;&#x [00;JaroWinklerBaselineandJaccardXt&#x]TJ ;� -1;.93; Td;&#x [00;JaccardBaselinethennC C[tif(C=nullreturn(E)else(E RemoveMaxCandidate(C,E)CleanAttribute(E;R) Figure9:Algorithmtocleananextractedattributepostslistingallofthehotelseverpostedaboutagivenarea.Thesespecialpostsprovidehotelnames,hotelareasandstarratings,whicharethereferencesetattributes.Therefore,thesearethe3attributesforwhichthestandardizedvaluesareused,allowingustotreatthesepostsasarelationaldataset.Thisreferencesetcontains132records.TheexperimentaldataforthecomicdomaincomesfrompostsforitemsforsaleoneBay.Togeneratethisdataset,eBaywassearchedbythekeywords\IncredibleHulk"and\FantasticFour"inthecomicbookssectionoftheirwebsite.(Thisreturnedsomeitemsthatarenotcomics,suchast{shirtsandsomesetsofcomicsnotlimitedtothosesearchedfor,whichmakestheproblemmoredicult.)Thereturnedrecordscontaintheattributescomictitle,issuenumber,price,publisher,publicationyearandthedescription,whichareextracted.(Note:thedescriptionisafewworddescriptioncommonlyassociatedwithacomicbook,suchas1stappearancetheRhino.)Thetotalnumberofpostsinthisdatasetis776,ofwhich697havematches.ThecomicdomainreferencesetusesdatafromtheComicsPriceGuide10,whichlistsalltheIncredibleHulkandFantasticFourcomics.Thisreferencesethastheattributestitle,issuenumber,description,andpublisherandcontains918records.ThecarsdataconsistsofpostsmadetoCraigslistregardingcarsforsale.Thisdatasetconsistsofclassi edsforcarsfromLosAngeles,SanFrancisco,Boston,NewYork,New 10.http://www.comicspriceguide.com/564 Michelson&Knoblockproblems.MethodssuchasBigramindexingaretechniquesthatmaketheprocessofeachblockingpassonanattributemoreecient.ThegoalofBSL,however,istoselectwhichattributecombinationsshouldbeusedforblockingasawhole,tryingdi erentattributeandmethodpairs.Nonetheless,wecontendthatitismoreimportanttoselecttherightattributecombinations,evenusingsimplemethods,thanitistousemoresophisticatedmethods,butwithoutinsightastowhichattributesmightbeuseful.Totestthishypothesis,wecompareBSLusingthetokenand3-grammethodstoBigramindexingoveralloftheattributes.ThisisequivalenttoformingadisjunctionoverallattributesusingBigramindexingasthemethod.WechoseBigramindexinginparticularbecauseitisdesignedtoperform\fuzzyblocking"whichseemsnecessaryinthecaseofnoisypostdata.Asstatedpreviously(Baxteretal.,2003),weuseathresholdof0.3forBigramindexing,sincethatworksthebest.WealsocompareBSLtorunningadisjunctionoverallattributesusingthesimpletokenmethodonly.Inourresults,wecallthisblockingrule\Disjunction."Thisdisjunctionmirrorstheideaofpickingthesimplestpossibleblockingmethod:namelyusingallattributeswithaverysimplemethod.Asstatedpreviously,thetwogoalsofblockingcanbequanti edbytheReductionRatio(RR)andthePairsCompleteness(PC).Table5showsnotonlythesevaluesbutalsohowmanycandidatesweregeneratedonaverageovertheentiretestset,comparingthethreedi erentapproaches.Table5alsoshowshowlongittookeachmethodtolearntheruleandruntherule.Lastly,thecolumn\Timematch"showshowlongtheclassi erneedstorungiventhenumberofcandidatesgeneratedbytheblockingscheme.Table6showsafewexampleblockingschemesthatthealgorithmgenerated.ForacomparisonoftheattributesBSLselectedtotheattributespickedmanuallyfordi erentdomainswherethedataisstructuredthereaderispointedtoourpreviousworkonthetopic(Michelson&Knoblock,2006).TheresultsofTable5validateourideathatitismoreimportanttopickthecorrectattributestoblockon(usingsimplemethods)thantousesophisticatedmethodswithoutattentiontotheattributes.ComparingtheBSLruletotheBigramresults,thecombinationofPCandRRisalwaysbetterusingBSL.NotethatalthoughintheCarsdomainBigramtooksigni cantlylesstimewiththeclassi erduetoitslargeRR,itdidsobecauseitonlyhadaPCof4%.Inthiscase,Bigramswasnotevencovering5%ofthetruematches.Further,theBSLresultsarebetterthanusingthesimplestmethodpossible(theDisjuc-tion),especiallyinthecaseswheretherearemanyrecordstotestupon.Asthenumberofrecordsscalesup,itbecomesincreasinglyimportanttogainagoodRR,whilemaintainingagoodPCvalueaswell.ThissavingsisdramaticallydemonstratedbytheCarsdomain,whereBSLoutperformedtheDisjunctioninbothPCandRR.Onesurprisingaspectoftheseresultsishowprevalentthetokenmethodiswithinallthedomains.Weexpectthatthengrammethodwouldbeusedalmostexclusivelysincetherearemanyspellingmistakeswithintheposts.However,thisisnotthecase.Wehypothesizethatthelearningalgorithmusesthetokenmethodsbecausetheyoccurwithmoreregularityacrossthepoststhanthecommonngramswouldsincethespellingmistakesmightvaryquitedi erentlyacrosstheposts.Thissuggeststhattheremightbemoreregularity,intermsofwhatwecanlearnfromthedata,acrossthepoststhanweinitiallysurmised.AnotherinterestingresultisthepoorreductionratiooftheComicdomain.Thishappensbecausemostoftherulescontainthedisjunctthat ndsacommontokenwithinthecomic566 Michelson&Knoblocktitle.Thisruleproducessuchapoorreductionratiobecausethevalueforthisattributeisthesameacrossalmostallreferencesetrecords.Thatistosay,whentherearejustafewuniquevaluesfortheBSLalgorithmtouseforblocking,thereductionratiowillbesmall.Inthisdomain,thereareonlytwovaluesforthecomictitleattribute,\FantasticFour"and\IncredibleHulk."Soitmakessensethatifblockingisdoneusingthetitleattributeonly,thereductionisabouthalf,sinceblockingonthevalue\FantasticFour"justgetsridofthe\IncredibleHulk"comics.ThispointstoaninterestinglimitationoftheBSLalgorithm.Iftherearenotmanydistinctvaluesforthedi erentattributeandmethodpairsthatBSLcanusetolearnfrom,thenthislackofvaluescripplestheperformanceofthereductionratio.Intuitivelythough,thismakessense,sinceitishardtodistinguishgoodcandidatematchesfrombadcandidatematchesiftheysharethesameattributevalues.AnotherresultworthmentioningisthatintheHotelsdomainwegetalowerRRbutthesamePCwhenweuselesstrainingdata.ThishappensbecauseourBSLalgorithmrunsuntilithasnomoreexamplestocover,soifthoselastfewexamplesintroduceanewdisjunctthatproducesalotofcandidates,whileonlycoveringafewmoretruepositives,thenthiswouldcausetheRRtodecrease,whilekeepingthePCatthesamehighrate.Thisisinfactwhathappensinthiscase.OnewaytocurbthisbehaviorwouldbetosetsomesortofstoppingthresholdforBSL,butaswesaid,maximizingthePCisthemostimportantthing,sowechoosenottodothis.WewantBSLtocoverasmanytruepositivesasitcan,evenifthatmeanslosingabitinthereduction.Infact,wenexttestthisnotionexplicitly.WesetathresholdintheSCAsuchthatafter95%ofthetrainingexamplesarecovered,thealgorithmstopsandreturnsthelearnedblockingscheme.ThishelpstoavoidthesituationwhereBSLlearnsaverygeneralconjunc-tion,solelytocoverthelastfewremainingtrainingexamples.Whenthathappens,BSLmightenduploweringtheRR,attheexpenseofcoveringjustthoselasttrainingexamples,becausetherulelearnedtocoverthoselastexamplesisoverlygeneralandreturnstoomanycandidatematches. Domain RecordLinkage RR PC F-Measure HotelsDomain NoThresh(30%) 90.63 81.56 99.79 95%Thresh(30%) 90.63 87.63 97.66 ComicDomain NoThresh(30%) 91.30 42.97 99.75 95%Thresh(30%) 91.47 42.97 99.69 CarsDomain NoThresh(10%) 77.04 88.48 92.23 95%Thresh(10%) 67.14 92.67 83.95 Table7:AcomparisonofBSLcoveringalltrainingexamples,andcovering95%ofthetrainingexamples568 Michelson&KnoblockF�Measure=2PrecisionRecall Precison+RecallTherecordlinkageapproachinthisarticleiscomparedtoWHIRL(Cohen,2000).WHIRLperformsrecordlinkagebyperformingsoft-joinsusingvector-basedcosinesimilari-tiesbetweentheattributes.Otherrecordlinkagesystemsrequiredecomposedattributesformatching,whichisnotthecasewiththeposts.WHIRLservesasthebenchmarkbecauseitdoesnothavethisrequirement.TomirrorthealignmenttaskofPhoebus,theexperimentsuppliesWHIRLwithtwotables:thetestsetofposts(either70%or90%oftheposts)andthereferencesetwiththeattributesconcatenatedtoapproximatearecordlevelmatch.Theconcatenationisalsousedbecausewhenmatchingoneachindividualattribute,itisnotobvioushowtocombinethematchingattributestoconstructawholematchingreferencesetmember.Toperformtherecordlinkage,WHIRLdoessoft-joinsacrossthetables,whichproducesalistofmatches,orderedbydescendingsimilarityscore.Foreachpostwithmatchesfromthejoin,thereferencesetmember(s)withthehighestsimilarityscore(s)iscalleditsmatch.IntheCarsdomainthematchesare1-N,sothismeansthatonly1matchfromthereferencesetwillbeexploitedlaterintheinformationextractionstep.Tomirrorthisidea,thenumberofpossiblematchesina1-Ndomainiscountedasthenumberofpoststhathaveamatchinthereferenceset,ratherthanthereferencesetmembersthemselvesthatmatch.Also,thismeansthatweonlyaddasinglematchtoourtotalnumberofcorrectmatchesforagivenpost,ratherthanallofthecorrectmatches,sinceonlyonematters.ThisisdoneforbothWHIRLandPhoebus,andmoreaccuratelyre ectshowwelleachalgorithmwouldperformastheprocessingstepbeforeourinformationextractionstep.TherecordlinkageresultsforbothPhoebusandWHIRLareshowninTable8.Notethattheamountoftrainingdataforeachdomainisshowninparentheses.Allresultsarestatisticallysigni cantusingatwo-tailedpairedt-testwith =0.05,exceptfortheprecisionbetweenWHIRLandPhoebusintheCarsdomain,andtheprecisionbetweenPhoebustrainedon10%and30%ofthetrainingdataintheComicdomain.PhoebusoutperformsWHIRLbecauseitusesmanysimilaritytypestodistinguishmatches.Also,sincePhoebususesbotharecordlevelandattributelevelsimilarities,itisabletodistinguishbetweenrecordsthatdi erinmorediscriminativeattributes.ThisisespeciallyapparentintheCarsdomain.First,theseresultsindicatethedicultyofmatchingcarpoststothelargereferenceset.Thisisthelargestexperimentaldomainyetusedforthisproblem,anditisencouraginghowwellourapproachoutperformsthebase-line.Itisalsointerestingthattheresultssuggestthatbothtechniquesareequallyaccurateintermsofprecision(infact,thereisnostatisticallysigni cantdi erencebetweentheminthissense)butPhoebusisabletoretrievemanymorerelevantmatches.ThismeansPhoebuscancapturemorerichfeaturesthatpredictmatchesthanWHIRL'scosinesimi-larityalone.WeexpectthisbehaviorbecausePhoebushasanotionofboth eldandtokenlevelsimilarity,usingmanydi erentsimilaritymeasures.Thisjusti esouruseofthemanysimilaritytypesand eldandrecordlevelinformation,sinceourgoalisto ndasmanymatchesaswecan.Itisalsoencouragingthatusingonly10%ofthedataforlabeling,Phoebusisabletoperformalmostaswellasusing30%ofthedatafortraining.SincetheamountofdataontheWebisvast,onlyhavingtolabel10%ofthedatatogetcomparativeresultsispreferable570 Michelson&Knoblock Precision Recall F-Measure Hotels Phoebus(30%) 87.70 93.78 90.63 ConcatenationOnly 88.49 93.19 90.78 WHIRL 83.61 83.53 83.13 Comic Phoebus(30%) 87.49 95.46 91.30 ConcatenationOnly 61.81 46.55 51.31 WHIRL 73.89 81.63 77.57 Cars Phoebus(10%) 69.98 85.68 77.04 ConcatenationOnly 47.94 58.73 52.79 WHIRL 70.43 63.36 66.71 Table9:Matchingusingonlytheconcatenationalsousesaconcatenationoftheattributes.ThisisbecauseWHIRLusesinformation-retrieval-stylematchingto ndthebestmatch,andthemachinelearningtechniquetriestolearnthecharacteristicsofthebestmatch.Clearly,itisverydiculttolearnwhatsuchcharacteristicsare.IntheHotelsdomain,wedonot ndastatisticallysigni cantdi erenceinF-measureusingtheconcatenationalone.Thismeansthattheconcatenationissucienttodeterminethematches,sothereisnoneedforindividual eldstoplayarole.Morespeci cally,thehotelnameandareaseemtobethemostimportantattributesformatchingandbyincludingthemaspartoftheconcatenation,theconcatenationisstilldistinguishableenoughbetweenallrecordstodeterminematches.Sinceintwoofthethreedomainsweseeahugeimprovement,andweneverloseinF-measure,usingboththeconcatenationandtheindividualattributesisvalidforthematching.Also,sinceintwodomainstheconcatenationalonewasworsethanWHIRL,weconcludethatpartofthereasonPhoebuscanoutperformWHIRListheuseoftheindividualattributesformatching.Ournextexperimenttestshowimportantitistoincludeallofthestringmetricsinourfeaturevectorformatching.Totestthisidea,wecompareusingallthemetricstousingjustone,theJensen-Shannondistance.WechoosetheJensen-ShannondistancebecauseitoutperformedbothTF/IDFandevena\soft"TF/IDF(onethataccountsforfuzzytokenmatches)inthetaskofselectingtherightreferencesetsforagivensetofposts(Michelson&Knoblock,2007).TheseresultsareshowninTable10.AsTable10shows,usingallthemetricsyieldedastatisticallysigni cant,largeim-provementinF-measurefortheComicandCarsdomains.Thismeansthatsomeoftheotherstringmetrics,suchastheeditdistances,werecapturingsimilaritiesthattheJensen-Shannondistancealonedidnot.Interestingly,inbothdomains,usingPhoebuswithonlytheJensen-ShannondistancedoesnotdominateWHIRL'sperformance.Therefore,theresultsofTable10andTable9demonstratethatPhoebusbene tsfromthecombination572 Michelson&Knoblock Precision Recall F-Measure Hotels Phoebus(30%) 87.70 93.78 90.63 NoBinaryRescoring 75.44 81.82 78.50 Phoebus(10%) 87.85 92.46 90.09 NoBinaryRescoring 73.49 78.40 75.86 Comic Phoebus(30%) 87.49 95.46 91.30 NoBinaryRescoring 84.87 89.91 87.31 Phoebus(10%) 85.35 93.18 89.09 NoBinaryRescoring 81.52 88.26 84.75 Cars Phoebus(10%) 69.98 85.68 77.04 NoBinaryRescoring 39.78 48.77 43.82 Table11:Recordlinkageresultswithandwithoutbinaryrescoring4.2ExtractionResultsThissectionpresentsresultsthatexperimentallyvalidateourapproachtoextractingtheactualattributesembeddedwithinthepost.Wealsocompareourapproachtotwootherinformationextractionmethodsthatrelyonthestructureorgrammaroftheposts.First,theexperimentscomparePhoebusagainstabaselineConditionalRandomField(CRF)(La ertyetal.,2001)extractor.AConditionalRandomFieldisaprobabilisticmodelthatcanlabelandsegmentdata.Inlabelingtasks,suchasPart-of-Speechtag-ging,CRFsoutperformHiddenMarkovModelsandMaximum-EntropyMarkovModels.Therefore,byrepresentingthestate-of-the-artprobabilisticgraphicalmodel,theypresentastrongcomparisontoourapproachtoextraction.CRFshavealsobeenusede ectivelyforinformationextraction.Forinstance,CRFshavebeenusedtocombineinformationextrac-tionandcoreferenceresolutionwithgoodresults(Wellner,McCallum,Peng,&Hay,2004).TheseexperimentsusetheSimpleTaggerimplementationofCRFsfromtheMALLET(McCallum,2002)suiteoftextprocessingtools.Further,asstatedinSection3onExtraction,wealsocreatedaversionofPhoebusthatusesCRFs,whichwecallPhoebusCRF.PhoebusCRFusesthesameextractionfeatures(VIE)asPhoebususingtheSVM,suchasthecommonscoreregularexpressionsandthestringsimilaritymetrics.WeincludePhoebusCRFtoshowthatextractioningeneralcanbene tfromourreferencesetmatching.Second,theexperimentscomparePhoebustoNaturalLanguageProcessing(NLP)basedextractiontechniques.Sincethepostsareungrammaticalandhaveunreliablelexicalchar-acteristics,theseNLPbasedsystemsarenotexpectedtodoaswellonthistypeofdata.TheAmilcaresystem(Ciravegna,2001),whichusesshallowNLPforextraction,hasbeenshowntooutperformothersymbolicsystemsinextractiontasks,andsoweuseAmilcareastheothersystemtocompareagainst.SinceAmilcarecanexploitgazetteersforextra574 Michelson&Knoblock Hotel Recall Precision F-Measure Frequency Area Phoebus(30%) 83.73 84.76 84.23 ~580 Phoebus(10%) 77.80 83.58 80.52 PhoebusCRF(30%) 85.13 86.93 86.02 PhoebusCRF(10%) 80.71 83.38 82.01 SimpleTaggerCRF(30%) 78.62 79.38 79.00 Amilcare(30%) 64.78 71.59 68.01 Date Phoebus(30%) 85.41 87.02 86.21 ~700 Phoebus(10%) 82.13 83.06 82.59 PhoebusCRF(30%) 87.20 87.11 87.15 PhoebusCRF(10%) 84.39 84.48 84.43 SimpleTaggerCRF(30%) 63.60 63.25 63.42 Amilcare(30%) 86.18 94.10 89.97 Name Phoebus(30%) 77.27 75.18 76.21 ~750 Phoebus(10%) 75.59 74.25 74.92 PhoebusCRF(30%) 85.70 85.07 85.38 PhoebusCRF(10%) 81.46 81.69 81.57 SimpleTaggerCRF(30%) 74.43 84.86 79.29 Amilcare(30%) 58.96 67.44 62.91 Price Phoebus(30%) 93.06 98.38 95.65 ~720 Phoebus(10%) 93.12 98.46 95.72 PhoebusCRF(30%) 92.56 94.90 93.71 PhoebusCRF(10%) 90.34 92.60 91.46 SimpleTaggerCRF(30%) 71.68 73.45 72.55 Amilcare(30%) 88.04 91.10 89.54 Star Phoebus(30%) 97.39 97.01 97.20 ~730 Phoebus(10%) 96.94 96.90 96.92 PhoebusCRF(30%) 96.83 98.06 97.44 PhoebusCRF(10%) 96.17 96.74 96.45 SimpleTaggerCRF(30%) 97.16 96.55 96.85 Amilcare(30%) 95.58 97.35 96.46 Table12:Fieldlevelextractionresults:Hotelsdomain576 Michelson&Knoblock Cars Recall Precision F-Measure Frequency Make Phoebus(10%) 98.21 99.93 99.06 ~580 PhoebusCRF(10%) 90.73 96.71 93.36 SimpleTaggerCRF(10%) 85.68 95.69 90.39 Amilcare(10%) 97.58 91.76 94.57 Model Phoebus(10%) 92.61 96.67 94.59 ~620 PhoebusCRF(10%) 84.58 94.10 88.79 SimpleTaggerCRF(10%) 78.76 91.21 84.52 Amilcare(10%) 78.44 84.31 81.24 Price Phoebus(10%) 97.17 95.91 96.53 ~580 PhoebusCRF(10%) 93.59 92.59 93.09 SimpleTaggerCRF(10%) 83.66 98.16 90.33 Amilcare(10%) 90.06 91.27 90.28 Trim Phoebus(10%) 63.11 70.15 66.43 ~375 PhoebusCRF(10%) 55.61 64.95 59.28 SimpleTaggerCRF(10%) 55.94 66.49 60.57 Amilcare(10%) 27.21 53.99 35.94 Year Phoebus(10%) 88.48 98.23 93.08 ~600 PhoebusCRF(10%) 85.54 96.44 90.59 SimpleTaggerCRF(10%) 91.12 76.78 83.31 Amilcare(10%) 86.32 91.92 88.97 Table14:Fieldlevelextractionresults:Carsdomain. Num.ofMax.F-Measures Domain Phoebus PhoebusCRF Amilcare SimpleTagger TotalAttributes Hotel 1 3 1 0 5 Comic 2 1 0 0 6 Cars 5 0 0 0 5 All 8 4 1 0 16 Table15:Summaryresultsforextractionshowingthenumberoftimeseachsystemhadstatisticallysigni canthighestF-Measureforanattribute.578 Michelson&Knoblockinthereferenceset.Thislendscredibilitytoourclaimearlierinthesectionthatbytrainingthesystemtoextractalloftheattributes,eventhoseinthereferenceset,wecanmoreaccuratelyextractattributesnotinthereferencesetbecausewearetrainingthesystemtoidentifywhatsomethingisnot.TheoverallperformanceofPhoebusvalidatesthisapproachtosemanticannotation.Byinfusinginformationextractionwiththeoutsideknowledgeofreferencesets,Phoebusisabletoperformwellacrossthreedi erentdomains,eachrepresentativeofadi erenttypeofsourceofposts:theauctionsites,Internetclassi edsandforum/bulletinboards.5.DiscussionThegoalofthisresearchistoproducerelationaldatafromunstructuredandungrammaticaldatasourcessothattheycanbeaccuratelyqueriedandintegratedwithothersources.Byrepresentingtheattributesembeddedwithinapostwiththestandardizedvaluesfromthereferenceset,wecansupportstructuralqueriesandintegration.Forinstance,wecanperformaggregatequeriesbecausewecantreatthedatasourceasarelationaldatabasenow.Furthermore,wehavestandardizedvaluesforperformingjoinsacrossdatasources,akeyforintegrationofmultiplesources.Thesestandardizedvaluesalsoaidinthecaseswherethepostactuallydoesnotcontaintheattribute.Forinstance,inTable1,twoofthelistingsdonotincludethemake\Honda."However,oncematchedtothereferenceset,theycontainastandardizedvalueforthisattributewhichcanthenbeusedforqueryingandintegratingtheseposts.Thisisespeciallypowerfulsincethepostsneverexplicitlystatedtheseattributevalues.Thereferencesetattributesalsoprovideasolutionforthecaseswheretheextractionisextremelydicult.Forexample,noneofthesystemsextractedthedescriptionattributeoftheComicdomainwell.However,ifoneinsteadconsidersthedescriptionattributefromthereferenceset,whichisquanti edbytherecordlinkageresultsfortheComicdomain,thisyieldsanimprovementofover50%intheF-Measureforidentifyingthedescriptionforapost.Itmayseemthatusingthereferencesetattributesforannotationisenoughsincethevaluesarealreadycleaned,andthatextractionisunnecessary.However,thisisnotthecase.Foronething,onemaywanttoseetheactualvaluesenteredfordi erentattributes.Forinstance,ausermightwanttodiscoverthemostcommonspellingmistakeorabbreviationforaattribute.Also,therearecaseswhentheextractionresultsoutperformtherecordlinkageresults.Thishappensbecauseevenifapostismatchedtoanincorrectmemberofthereferenceset,thatincorrectmemberismostlikelyveryclosetothecorrectmatch,andsoitcanbeusedtocorrectlyextractmuchoftheinformation.Forastrongexampleofthis,considertheCarsdomain.TheF-measurefortherecordlinkageresultsarenotasgoodasthosefortheextractionresultsinthisdomain.Thismeansmostmatchesthatwerechosenwhereprobablyincorrectbecausetheydi erfromthecorrectmatchbysomethingsmall.Forexample,atruematchcouldhavethetrimas\2Door"whiletheincorrectlychosenmatchmighthavethetrim\4Door,"buttherewouldstillbeenoughinformation,suchastherestofthetrimtokens,theyear,themakeandthemodeltocorrectlyextractthosedi erentattributesfromthepostitself.Byperformingtheextractionforthevaluesfromthepostitself,wecanovercomethemistakesoftherecordlinkagestepbecausewecanstillexploitmostoftheinformationintheincorrectlychosenreferencesetmember.580 Michelson&KnoblockEdmundsclassi estheMazdaProtege5asa\wagon,"whileKellyBlueBook16classi esitasa\hatchback."Thisseemstoinvalidatetheideathat\wagon"isdi erentinmeaningfrom\hatchback."Theyappeartobesimplesynonyms,butthiswouldremainunknownwithouttheoutsideknowledgeofKellyBlueBook.Moregenerally,oneassumesthatthereferencesetisacorrectsetofstandardizedvalues,butthisisnotanabsolutetruth.Thatiswhythemostmeaningfulreferencesetsarethosethatcanbeconstructedfromagreed-uponontologiesfromtheSemanticWeb.Forinstance,areferencesetderivedfromanontologyforcarscreatedbyallofthebiggestautomotivebusinessesshouldalleviatemanyoftheissuesinmeaning,andathesaurusschemecouldworkoutthediscrepanciesintroducedbytheusers,ratherthanthereferencesets.6.RelatedWorkOurresearchisdrivenbytheprincipalthatthecostofannotatingdocumentsfortheSemanticWebshouldbefree,thatis,automaticandinvisibletousers(Hendler,2001).Manyresearchershavefollowedthispath,attemptingtoautomaticallymarkupdocumentsfortheSemanticWeb,asproposedhere(Cimiano,Handschuh,&Staab,2004;Dingli,Ciravegna,&Wilks,2003;Handschuh,Staab,&Ciravegna,2002;Vargas-Vera,Motta,Domingue,Lanzoni,Stutt,&Ciravegna,2002).However,thesesystemsrelyonlexicalinformation,suchaspart-of-speechtaggingorshallowNaturalLanguageProcessingtodotheirextraction/annotation(e.g.,Amilcare,Ciravegna,2001).Thisisnotanoptionwhenthedataisungrammatical,likethepostdata.Inasimilarvein,therearesystemssuchasADEL(Lerman,Gazen,Minton,&Knoblock,2004)whichrelyonthestructuretoidentifyandannotaterecordsinWebpages.Again,thefailureofthepoststoexhibitstructuremakesthisapproachinappropriate.So,whilethereisafairamountofworkinautomaticlabeling,thereislittleemphasisontechniquesthatcouldlabeltextthatisbothunstructuredandungrammatical.Althoughtheideaofrecordlinkageisnotnew(Fellegi&Sunter,1969)andiswellstudiedevennow(Bilenko&Mooney,2003)mostcurrentresearchfocusesonmatchingonesetofrecordstoanothersetofrecordsbasedontheirdecomposedattributes.Thereislittleworkonmatchingdatasetswhereonerecordisasinglestringcomposedoftheotherdataset'sattributestomatchon,asinthecasewithpostsandreferencesets.TheWHIRLsystem(Cohen,2000)allowsforrecordlinkagewithoutdecomposedattributes,butasshowninSection4.1PhoebusoutperformsWHIRL,sinceWHIRLreliessolelyonthevector-basedcosinesimilaritybetweentheattributes,whilePhoebusexploitsalargersetoffeaturestorepresentboth eldandrecordlevelsimilarity.WenotewithinteresttheEROCSsystem(Chakaravarthy,Gupta,Roy,&Mohania,2006)wheretheauthorstackletheproblemoflinkingfulltextdocumentswithrelationaldatabases.Thetechniqueinvolves lteringoutallnon-nounsfromthetext,andthen ndingthematchesinthedatabase.Thisisanintriguingapproach;interestingfutureworkwouldinvolveperformingasimilar lteringforlargerdocumentsandthenapplyingthePhoebusalgorithmtomatchtheremainingnounstoreferencesets.Usingthereferenceset'sattributesasnormalizedvaluesissimilartotheideaofdatacleaning.However,mostdatacleaningalgorithmsassumetuple-to-tupletransformations 16.www.kbb.com582 Michelson&Knoblockaboutusingontology-basedinformationextractionasameanstosemanticallyannotateun-structureddatasuchascarclassi eds(Ding,Embley,&Liddle,2006).However,incontrasttoourwork,theinformationextractionisperformedbyakeyword-lookupintotheontologyalongwithstructuralandcontextualrulestoaidthelabeling.Theontologyitselfcontainskeywordmisspellingsandabbreviations,sothatthelook-upcanbeperformedinthepres-enceofnoisydata.Webelievetheontology-basedextractionapproachislessscalablethanarecordlinkagetypematchingtaskbecausecreatingandmaintainingtheontologyrequiresextensivedataengineeringinordertoencompassallpossiblecommonspellingmistakesandabbreviations.Further,ifnewdataisaddedtotheontology,additionaldataengineeringmustbeperformed.Inourwork,wecansimplyaddnewtuplestoourreferenceset.Lastly,incontrasttoourwork,thisontologybasedworkassumescontextualandstructuralruleswillapply,makinganassumptionaboutthedatatoextractfrom.Inourwork,wemakenosuchassumptionsaboutthestructureofthetextweareextractingfrom.YetanotherinterestingapproachtoinformationextractionusingontologiesistheText-pressosystemwhichextractsdatafrombiologicaltext(Muller&Sternberg,2004).Thissystemusesaregularexpressionbasedkeywordlook-uptolabeltokensinsometextbasedontheontology.Oncealltokensarelabeled,Textpressocanperform\factextraction"byextractingsequencesoflabeledtokensthat taparticularpattern,suchasgene-allelereferenceassociations.Althoughthissystemagainusesareferencesetforextraction,itdi ersinthatitdoesakeywordlook-upintothelexicon.InrecentworkonlearningecientblockingschemesBilenkoetal.,(2006)developedasystemforlearningdisjunctivenormalformblockingschemes.However,theylearntheirschemesusingagraphicalsetcoveringalgorithm,whileweuseaversionoftheSequentialCoveringAlgorithm(SCA).TherearealsosimilaritiesbetweenourBSLalgorithmandworkonminingassociationrulesfromtransactiondata(Agrawal,Imielinski,&Swami,1993).Bothalgorithmsdiscoverpropositionalrules.Further,bothalgorithmsusemultiplepassesoveradatasettodiscovertheirrules.Howeverdespitethesesimilarities,thetechniquesreallysolvedi erentproblems.BSLgeneratesasetofcandidatematcheswithaminimalnumberoffalsepositives.Todothis,BSLlearnsconjunctionsthataremaximallyspeci c(eliminatingmanyfalsepositives)andunionsthemtogetherasasingledisjunctiverule(tocoverthedi erenttruepositives).Sincetheconjunctionsaremaximallyspeci c,BSLusesSCAunderneath,whichlearnsrulesinadepth- rst,generaltospeci cmanner(Mitchell,1997).Ontheotherhand,theworkofminingassociationrules(Agrawaletal.,1993)looksforactualpatternsinthedatathatrepresentsomeinternalrelationships.Theremaybemanysuchrelationshipsinthedatathatcouldbediscovered,sothisapproachcoversthedatainabreadth- rstfashion,selectingthesetofrulesateachiterationandextendingthembyappendingtoeachanewpossibleitem.7.ConclusionThisarticlepresentsanalgorithmforsemanticallyannotatingtextthatisungrammaticalandunstructured.Unstructured,ungrammaticalsourcescontainmuchinformation,butcannotsupportstructuredqueries.Ourtechniqueallowsformoreinformativeuseofthesources.Usingourapproach,eBayagentscouldmonitortheauctionslookingforthebestdeals,orausercould ndtheaveragepriceofafour-starhotelinSanDiego.Suchsemantic584 Michelson&KnoblockReferencesAgichtein,E.,&Ganti,V.(2004).Miningreferencetablesforautomatictextsegmentation.IntheProceedingsofthe10thACMConferenceonKnowledgeDiscoveryandDataMining,pp.20{29.ACMPress.Agrawal,R.,Imielinski,T.,&Swami,A.(1993).Miningassociationrulesbetweensetsofitemsinlargedatabases.InProceedingsoftheACMSIGMODInternationalConfer-enceonManagementofData,pp.207{216.ACMPress.Baxter,R.,Christen,P.,&Churches,T.(2003).Acomparisonoffastblockingmethodsforrecordlinkage.InProceedingsofthe9thACMSIGKDDWorkshoponDataCleaning,RecordLinkage,andObjectIdenti cation,pp.25{27.Bellare,K.,&McCallum,A.(2007).Learningextractorsfromunlabeledtextusingrelevantdatabases.InProceedingsoftheAAAIWorkshoponInformationIntegrationontheWeb,pp.10{16.Bilenko,M.,Kamath,B.,&Mooney,R.J.(2006).Adaptiveblocking:Learningtoscaleuprecordlinkageandclustering.InProceedingsofthe6thIEEEInternationalConferenceonDataMining,pp.87{96.Bilenko,M.,&Mooney,R.J.(2003).Adaptiveduplicatedetectionusinglearnablestringsimilaritymeasures.InProceedingsofthe9thACMInternationalConferenceonKnowledgeDiscoveryandDataMining,pp.39{48.ACMPress.Borkar,V.,Deshmukh,K.,&Sarawagi,S.(2001).Automaticsegmentationoftextintostructuredrecords.InProceedingsoftheACMSIGMODInternationalConferenceonManagementofData,pp.175{186.ACMPress.Cali ,M.E.,&Mooney,R.J.(1999).Relationallearningofpattern-matchrulesforinformationextraction.InProceedingsofthe16thNationalConferenceonArti cialIntelligence,pp.328{334.Chakaravarthy,V.T.,Gupta,H.,Roy,P.,&Mohania,M.(2006).Ecientlylinkingtextdocumentswithrelevantstructuredinformation.InProceedingsoftheInternationalConferenceonVeryLargeDataBases,pp.667{678.VLDBEndowment.Chaudhuri,S.,Ganjam,K.,Ganti,V.,&Motwani,R.(2003).Robustandecientfuzzymatchforonlinedatacleaning.InProceedingsofACMSIGMODInternationalCon-ferenceonManagementofData,pp.313{324.ACMPress.Cimiano,P.,Handschuh,S.,&Staab,S.(2004).Towardstheself-annotatingweb.InProceedingsofthe13thInternationalConferenceonWorldWideWeb,pp.462{471.ACMPress.Ciravegna,F.(2001).Adaptiveinformationextractionfromtextbyruleinductionandgeneralisation..InProceedingsofthe17thInternationalJointConferenceonArti cialIntelligence,pp.1251{1256.586 Michelson&KnoblockLa erty,J.,McCallum,A.,&Pereira,F.(2001).Conditionalrandom elds:Probabilis-ticmodelsforsegmentingandlabelingsequencedata.InProceedingsofthe18thInternationalConferenceonMachineLearning,pp.282{289.MorganKaufmann.Lee,M.-L.,Ling,T.W.,Lu,H.,&Ko,Y.T.(1999).Cleansingdataforminingandwarehousing.InProceedingsofthe10thInternationalConferenceonDatabaseandExpertSystemsApplications,pp.751{760.Springer-Verlag.Lerman,K.,Gazen,C.,Minton,S.,&Knoblock,C.A.(2004).Populatingthesemanticweb.InProceedingsoftheAAAIWorkshoponAdvancesinTextExtractionandMining.Levenshtein,V.I.(1966).Binarycodescapableofcorrectingdeletions,insertions,andreversals.EnglishtranslationinSovietPhysicsDoklady,10(8),707{710.Mansuri,I.R.,&Sarawagi,S.(2006).Integratingunstructureddataintorelationaldatabases.InProceedingsoftheInternationalConferenceonDataEngineering,p.29.IEEEComputerSociety.McCallum,A.(2002).Mallet:Amachinelearningforlanguagetoolkit.http://mallet.cs.umass.edu.McCallum,A.,Nigam,K.,&Ungar,L.H.(2000).Ecientclusteringofhigh-dimensionaldatasetswithapplicationtoreferencematching.InProceedingsofthe6thACMSIGKDD,pp.169{178.Michalowski,M.,Thakkar,S.,&Knoblock,C.A.(2005).Automaticallyutilizingsecondarysourcestoaligninformationacrosssources.InAIMagazine,SpecialIssueonSemanticIntegration,Vol.26,pp.33{45.Michelson,M.,&Knoblock,C.A.(2005).Semanticannotationofunstructuredandungram-maticaltext.InProceedingsofthe19thInternationalJointConferenceonArti cialIntelligence,pp.1091{1098.Michelson,M.,&Knoblock,C.A.(2006).Learningblockingschemesforrecordlinkage.InProceedingsofthe21stNationalConferenceonArti cialIntelligence.Michelson,M.,&Knoblock,C.A.(2007).Unsupervisedinformationextractionfromun-structured,ungrammaticaldatasourcesontheworldwideweb.InternationalJournalofDocumentAnalysisandRecognition(IJDAR),SpecialIssueonNoisyTextAnalyt-ics.Mitchell,T.M.(1997).MachineLearning.McGraw-Hill,NewYork.Muller,H.-M.,&Sternberg,E.E.K.P.W.(2004).Textpresso:Anontology-basedinfor-mationretrievalandextractionsystemforbiologicalliterature.PLoSBiology,2(11).Muslea,I.,Minton,S.,&Knoblock,C.A.(2001).Hierarchicalwrapperinductionforsemistructuredinformationsources.AutonomousAgentsandMulti-AgentSystems,4(1/2),93{114.588