/
InProceedingsofthe2002ConferenceonEmpiricalMethodsinNaturalLanguagePro InProceedingsofthe2002ConferenceonEmpiricalMethodsinNaturalLanguagePro

InProceedingsofthe2002ConferenceonEmpiricalMethodsinNaturalLanguagePro - PDF document

debby-jeon
debby-jeon . @debby-jeon
Follow
359 views
Uploaded On 2015-08-25

InProceedingsofthe2002ConferenceonEmpiricalMethodsinNaturalLanguagePro - PPT Presentation

ootstrappingA pproachtoS emantI L exiconI nductionusingS emanticK nowledgeisaweaklysupervisedbootstrappingalgorithmthatautomaticallygeneratessemanticlexiconsFigure1showsthehighlevelviewofBasilisk ID: 114987

ootstrappingA pproachtoS emantI L exiconI nductionusingS emanticK nowledge)isaweaklysupervisedbootstrappingalgorithmthatau-tomaticallygeneratessemanticlexicons.Figure1showsthehigh-levelviewofBasilisk'

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "InProceedingsofthe2002ConferenceonEmpiri..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

InProceedingsofthe2002ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP2002)ABootstrappingMethodforLearningSemanticLexiconsusingExtractionPatternContextsMichaelThelenandEllenRilo SchoolofComputingUniversityofUtahSaltLakeCity,UT84112USAthelenm,rilo AbstractThispaperdescribesabootstrappingal-gorithmcalledBasiliskthatlearnshigh-qualitysemanticlexiconsformultiplecate-gories.Basiliskbeginswithanunannotatedcorpusandseedwordsforeachsemanticcategory,whicharethenbootstrappedtolearnnewwordsforeachcategory.Basiliskhypothesizesthesemanticclassofawordbasedoncollectiveinformationoveralargebodyofextractionpatterncontexts.WeevaluateBasiliskonsixsemanticcategories.ThesemanticlexiconsproducedbyBasiliskhavehigherprecisionthanthoseproducedbyprevioustechniques,withseveralcate-goriesshowingsubstantialimprovement.1IntroductionInrecentyears,severalalgorithmshavebeendevel-opedtoacquiresemanticlexiconsautomaticallyorsemi-automaticallyusingcorpus-basedtechniques.Forourpurposes,thetermsemanticlexiconwillrefertoadictionaryofwordslabeledwithsemanticclasses(e.g.,\bird"isanand\truck"isaSemanticclassinformationhasproventobeusefulformanynaturallanguageprocessingtasks,includ-inginformationextraction(Rilo andSchmelzen-bach,1998;Soderlandetal.,1995),anaphoraresolu-tion(AoneandBennett,1996),questionanswering(Moldovanetal.,1999;Hirschmanetal.,1999),andprepositionalphraseattachment(BrillandResnik,1994).Althoughsomesemanticdictionariesdoexist(e.g.,WordNet(Miller,1990)),theseresourcesoftendonotcontainthespecializedvocabularyandjargonthatisneededforspeci cdomains.Evenforrela-tivelygeneraltexts,suchastheWallStreetJournal(Marcusetal.,1993)orterrorismarticles(MUC-4Proceedings,1992),RoarkandCharniak(RoarkandCharniak,1998)reportedthat3ofevery5termsgeneratedbytheirsemanticlexiconlearnerwerenotpresentinWordNet.Theseresultssuggestthatauto-maticsemanticlexiconacquisitioncouldbeusedtoenhanceexistingresourcessuchasWordNet,ortoproducesemanticlexiconsforspecializeddomains.Wehavedevelopedaweaklysupervisedbootstrap-pingalgorithmcalledBasiliskthatautomaticallygeneratessemanticlexicons.Basiliskhypothesizesthesemanticclassofawordbygatheringcollectiveevidenceaboutsemanticassociationsfromextractionpatterncontexts.Basiliskalsolearnsmultiplese-manticclassessimultaneously,whichhelpsconstrainthebootstrappingprocess.First,wepresentBasilisk'sbootstrappingalgo-rithmandexplainhowitdi ersfrompreviousworkonsemanticlexiconinduction.Second,wepresentempiricalresultsshowingthatBasiliskoutperformsapreviousalgorithm.Third,weexploretheideaoflearningmultiplesemanticcategoriessimultaneouslybyaddingthiscapabilitytoBasiliskaswellasan-otherbootstrappingalgorithm.Finally,wepresentresultsshowingthatlearningmultiplesemanticcat-egoriessimultaneouslyimprovesperformance.2BootstrappingusingCollectiveEvidencefromExtractionPatternsBasilisk(B ootstrappingA pproachtoS emantI L exiconI nductionusingS emanticK nowledge)isaweaklysupervisedbootstrappingalgorithmthatau-tomaticallygeneratessemanticlexicons.Figure1showsthehigh-levelviewofBasilisk'sbootstrappingprocess.TheinputtoBasiliskisanunannotatedtextcorpusandafewmanuallyde nedseedwordsforeachsemanticcategory.Beforebootstrappingbegins,werunanextractionpatternlearneroverthecorpuswhichgeneratespatternstoextractev-erynounphraseinthecorpus.Thebootstrappingprocessbeginsbyselectingasubsetoftheextractionpatternsthattendtoex-tracttheseedwords.Wecallthisthepatternpool Thenounsextractedbythesepatternsbecomecan-didatesforthelexiconandareplacedinacandidatewordpool.Basiliskscoreseachcandidatewordbygatheringallpatternsthatextractitandmeasur-inghowstronglythosecontextsareassociatedwithwordsthatbelongtothesemanticcategory.The vebestcandidatewordsareaddedtothelexicon,andtheprocessstartsoveragain.Inthissection,wedescribeBasilisk'sbootstrappingalgorithminmoredetailanddiscussrelatedwork. add extractions of best patternsbest patterns select pattern pool semantic lexicon words seed candidateword pool Figure1:BasiliskAlgorithm2.1BasiliskTheinputtoBasiliskisatextcorpusandasetofseedwords.Wegeneratedseedwordsbysortingthewordsinthecorpusbyfrequencyandmanuallyidentifyingthe10mostfrequentnounsthatbelongtoeachcat-egory.Theseseedwordsformtheinitialsemanticlexicon.Inthissectionwedescribethelearningpro-cessforasinglesemanticcategory.InSection3wewillexplainhowtheprocessisadaptedtohandlemultiplecategoriessimultaneously.Toidentifynewlexiconentries,Basiliskreliesonextractionpatternstoprovidecontextualevi-dencethatawordbelongstoasemanticclass.Asourrepresentationforextractionpatterns,weusedtheAutoSlogsystem(Rilo ,1996).AutoSlog'sextractionpatternsrepresentlinguisticexpressionsthatextractanounphraseinoneofthreesyntac-ticroles:subject,directobject,orprepositionalphraseobject.Forexample,threepatternsthatwouldextractpeopleare:\subjectwasarrestedmurdereddirect object",and\collaboratedwith object".Extractionpatternsrepresentlinguis-ticcontextsthatoftenrevealthemeaningofawordbyvirtueofsyntaxandlexicalsemantics.Extractionpatternsaretypicallydesignedtocapturerolerela-tionships.Forexample,considertheverb\robbed"whenitoccursintheactivevoice.Thesubjectof\robbed"identi estheperpetrator,whilethedirectobjectof\robbed"identi esthevictimortarget.Beforebootstrappingbegins,werunAutoSlogex-haustivelyoverthecorpustogenerateanextraction Generateallextractionpatternsinthecorpusandrecordtheirextractions.lexiconseedwords:=0BOOTSTRAPPING1.Scoreallextractionpatternspattern pool=topranked20+patternscandidate word pool=extractionsofpatternsinpattern pool4.Scorecandidatewordsincandidate word pool5.Addtop5candidatewordstolexicon7.GotoStep1. Figure2:Basilisk'sbootstrappingalgorithmpatternforeverynounphrasethatappears.Thepatternsarethenappliedtothecorpusandalloftheirextractednounphrasesarerecorded.Figure2showsthebootstrappingprocessthatfollows,whichweexplaininthefollowingsections.2.1.1ThePatternPoolandCandidatePoolThe rststepinthebootstrappingprocessistoscoretheextractionpatternsbasedontheirtendencytoextractknowncategorymembers.Allwordsthatarecurrentlyde nedinthesemanticlexiconarecon-sideredtobecategorymembers.BasiliskscoreseachpatternusingtheRlogFmetricthathasbeenusedforextractionpatternlearning(Rilo ,1996).Thescoreforeachpatterniscomputedas:RlogFpattern log)(1)whereisthenumberofcategorymembersex-tractedbypatternandisthetotalnumberofnounsextractedbypattern.Intuitively,theRlogFmetricisaweightedconditionalprobability;apat-ternreceivesahighscoreifahighpercentageofitsextractionsarecategorymembers,orifamoderatepercentageofitsextractionsarecategorymembersanditextractsalotofthem.ThetopNextractionpatternsareputintoapat-ternpool.BasiliskusesavalueofN=20forthe rstiteration,whichallowsavarietyofpatternstobeconsidered,yetissmallenoughthatallofthepat-ternsarestronglyassociatedwiththecategory. \Depleted"patternsarenotincludedinthisset.Apatternisdepletedifallofitsextractednounsarealreadyde nedinthelexicon,inwhichcaseithasnounclassi edwordstocontribute. Thepurposeofthepatternpoolistonarrowdownthe eldofcandidatesforthelexicon.Basiliskcol-lectsallnounphrases(NPs)extractedbypatternsinpatternpoolandputstheheadnounofeachNPintothecandidatewordpool.Onlythesenounsareconsideredforadditiontothelexicon.Asthebootstrappingprogresses,usingthesamevalueN=20causesthecandidatepooltobecomestagnant.Forexample,let'sassumethatBasiliskperformsperfectly,addingonlyvalidcategorywordstothelexicon.Aftersomenumberofiterations,allofthevalidcategorymembersextractedbythetop20patternswillhavebeenaddedtothelexicon,leav-ingonlynon-categorywordslefttoconsider.Forthisreason,thepatternpoolneedstobeinfusedwithnewpatternssothatmorenouns(extractions)becomeavailableforconsideration.Toachievethise ect,weincrementthevalueofNbyoneaftereachboot-strappingiteration.Thisensuresthatthereisalwaysatleastonenewpatterncontributingwordstothecandidatewordpooloneachsuccessiveiteration.2.1.2SelectingWordsfortheLexiconThenextstepistoscorethecandidatewords.Foreachword,Basiliskcollectseverypatternthatex-tractedtheword.All extractionpatternsareusedduringthisstep,notjustthepatternsinthepat-ternpool.Initially,weusedascoringfunctionthatcomputestheaveragenumberofcategorymembersextractedbythepatterns.Theformulais:scoreword (2)whereisthenumberofpatternsthatextractword,andisthenumberofdistinctcategorymembersextractedbypattern.Awordreceivesahighscoreifitisextractedbypatternsthatalsohaveatendencytoextractknowncategorymembers.Asanexample,supposetheword\Peru"isinthecandidatewordpoolasapossiblelocation.Basilisk ndsallpatternsthatextract\Peru"andcomputestheaveragenumberofknownlocationsextractedbythosepatterns.Let'sassumethatthethreepatternsshownbelowextract\Peru"andthattheunderlinedwordsareknownlocations.\Peru"wouldreceiveascoreof(2+3+2)/3=2.3.Intuitively,thismeansthatpatternsthatextract\Peru"alsoextract,onaverage,2.3knownlocationwords.\waskilledinExtractions:Peru,clashes,ashootout,ElSalvador Colombia wasdivided"Extractions:thecountry ,theMedellincartel,Colombia Peru,thearmy,Nicaragua \ambassadortoExtractions:Nicaragua ,Peru,theUN,Panama Unfortunately,thisscoringfunctionhasaproblem.Theaveragecanbeheavilyskewedbyonepatternthatextractsalargenumberofcategorymembers.Forexample,supposewordisextractedby10pat-terns,9whichdonotextractanycategorymembersbutthetenthextracts50categorymembers.Theaveragenumberofcategorymembersextractedbythesepatternswillbe5.Thisismisleadingbecausetheonlyevidencelinkingwordwiththesemanticcategoryisasingle,high-frequencyextractionpat-tern(whichmayextractwordsthatbelongtoothercategoriesaswell).Toalleviatethisproblem,wemodi edthescor-ingfunctiontocomputetheaveragelogarithmofthenumberofcategorymembersextractedbyeachpat-tern.Thelogarithmreducesthein uenceofanysin-glepattern.WewillrefertothisscoringmetricasAvgLogfunction,whichisde nedbelow.Sincelog(1)=0,weaddonetoeachfrequencycountsothatpatternswhichextractasinglecategorymem-bercontributeapositivevalue.AvgLogwordlog+1) (3)Usingthisscoringmetric,allwordsinthecandi-datewordpoolarescoredandthetop vewordsareaddedtothesemanticlexicon.Thepatternpoolandcandidatewordpoolarethenemptied,andthebootstrappingprocessstartsoveragain.2.1.3RelatedWorkSeveralweaklysupervisedlearningalgorithmshavepreviouslybeendevelopedtogenerateseman-ticlexiconsfromtextcorpora.Rilo andShepherd(Rilo andShepherd,1997)developedabootstrap-pingalgorithmthatexploitslexicalco-occurrencestatistics,andRoarkandCharniak(RoarkandCharniak,1998)re nedthisalgorithmtofocusmoreexplicitlyoncertainsyntacticstructures.Hale,Ge,andCharniak(Geetal.,1998)devisedatechniquetolearnthegenderofwords.Caraballo(Caraballo,1999)andHearst(Hearst,1992)createdtechniquestolearnhypernym/hyponymrelationships.Noneofthesepreviousalgorithmsusedextractionpatternsorsimilarcontextstoinfersemanticclassassociations.Severallearningalgorithmshavealsobeende-velopedfornamedentityrecognition(e.g.,(Collins andSinger,1999;CucerzanandYarowsky,1999)).(CollinsandSinger,1999)usedcontextualinforma-tionofadi erentsortthanwedo.Furthermore,ourresearchaimstolearngeneralnouns(e.g.,\artist")ratherthanpropernouns,somanyofthefeaturescommonlyusedtogreatadvantagefornamedentityrecognition(e.g.,capitalizationandtitlewords)arenotapplicabletoourtask.ThealgorithmmostcloselyrelatedtoBasiliskismeta-bootstrapping(Rilo andJones,1999),whichalsousesextractionpatterncontextsforsemanticlexiconinduction.Meta-bootstrappingidenti esasingleextractionpatternthatishighlycorrelatedwithasemanticcategoryandthenassumesthatallofitsextractednounphrasesbelongtothesamecat-egory.However,thisassumptionisoftenviolated,whichallowsincorrecttermstoenterthelexicon.Rilo andJonesacknowledgedthisissueandusedasecondlevelofbootstrapping(the\Meta"boot-strappinglevel)toalleviatethisproblem.Whilemeta-bootstrappingtrustsindividualextractionpat-ternstomakeunilateraldecisions,Basiliskgath-erscollectiveevidencefromalargesetofextrac-tionpatterns.AswewilldemonstrateinSec-tion2.2,Basilisk'sapproachproducesbetterre-sultsthanmeta-bootstrappingandisalsoconsid-erablymoreecientbecauseitusesonlyasinglebootstrappingloop(meta-bootstrappingusesnestedbootstrapping).However,meta-bootstrappingpro-ducescategory-speci cextractionpatternsinaddi-tiontoasemanticlexicon,whileBasiliskfocusesex-clusivelyonsemanticlexiconinduction.2.2SingleCategoryResultsToevaluateBasilisk'sperformance,weranexperi-mentswiththeMUC-4corpus(MUC-4Proceedings,1992),whichcontains1700textsassociatedwithter-rorism.WeusedBasilisktolearnsemanticlexiconsforsixsemanticcategories:buildinglocation,andweapon.Beforewerantheseexperiments,oneoftheauthorsmanuallyla-beledeveryheadnouninthecorpusthatwasfoundbyanextractionpattern.Thesemanualannota-tionswerethegoldstandard.Table1showsthebreakdownofsemanticcategoriesfortheheadnouns.Thesenumbersrepresentabaseline:analgorithmthatrandomlyselectswordswouldbeexpectedtogetaccuraciesconsistentwiththesenumbers.ThreesemanticlexiconlearnershavepreviouslybeenevaluatedontheMUC-4corpus(Rilo andShepherd,1997;RoarkandCharniak,1998;Rilo andJones,1999),andofthesemeta-bootstrappingachievedthebestresults.Soweimplementedthemeta-bootstrappingalgorithmourselvestodirectly CategoryTotalPercentage building1882.2% event5015.9% human185621.9% location101812.0% time1121.3% weapon1471.7% other463854.8% Table1:BreakdownofsemanticcategoriescompareitsperformancewiththatofBasilisk.Adi erencebetweentheoriginalimplementationandoursisthatourversionlearnsindividualnouns(asdoesBasilisk)insteadofnounphrases.Webelievethatlearningindividualnounsisamoreconservativeapproachbecausenounphrasesoftenoverlap(e.g.,\high-powerbombs"and\incendiarybombs"wouldcountastwodi erentlexiconentriesintheorigi-nalmeta-bootstrappingalgorithm).Consequently,ourmeta-bootstrappingresultsdi erfromthosere-portedin(Rilo andJones,1999).Figure3showstheresultsforBasilisk(ba-1)andmeta-bootstrapping(mb-1).Weranbothalgorithmsfor200iterations,sothat1000wordswereaddedtothelexicon(5wordsperiteration).TheXaxisshowsthenumberofwordslearned,andtheYaxisshowshowmanywerecorrect.TheYaxeshavedi erentrangesbecausesomecategoriesaremoreproli cthanothers.Basiliskoutperformsmeta-bootstrappingforeverycategory,oftensubstantially.Forthehumanandlocationcategories,Basilisklearnedhundredsofwords,withaccuraciesinthe80-89%rangethroughmuchofthebootstrapping.ItisworthnotingthatBasilisk'sperformanceheldupwellonthehumanandlocationcategoriesevenattheend,achieving79.5%(795/1000)accuracyforhumansand53.2%(532/1000)accuracyforlocations.3LearningMultipleSemanticCategoriesSimultaneouslyWealsoexploredtheideaofbootstrappingmultiplesemanticclassessimultaneously.Ourhypothesiswasthaterrorsofconfusionbetweensemanticcategoriescanbelessenedbyusinginformationaboutmulti-plecategories.Thishypothesismakessenseonlyifawordcannotbelongtomorethanonesemanticclass.Ingeneral,thisisnottruebecausewordsareoftenpolysemous.Butwithinalimiteddomain,awordusuallyhasadominantwordsense.Thereforewemakea\onesenseperdomain"assumption(similar Weusethetermconfusiontorefertoerrorswhereawordislabeledascategorywhenitreallybelongstocategory 0 10 20 30 40 50 60 70 80 90 100 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Building ba-1 mb-1 0 50 100 150 200 250 300 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Event ba-1 mb-1 0 100 200 300 400 500 600 700 800 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Human ba-1 mb-1 0 50 100 150 200 250 300 350 400 450 500 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Location ba-1 mb-1 0 5 10 15 20 25 30 35 40 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Time ba-1 mb-1 0 10 20 30 40 50 60 70 80 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Weapon ba-1 mb-1 Figure3:BasiliskandMeta-BootstrappingResults,SingleCategorytothe\onesenseperdiscourse"observation(Galeetal.,1992))thatawordbelongstoasinglesemanticcategorywithinalimiteddomain.Allofourex-perimentsinvolvetheMUC-4terrorismdomainandcorpus,forwhichthisassumptionseemsappropriate.3.1MotivationFigure4showsonewayofviewingthetaskofse-manticlexiconinduction.Thesetofallwordsinthecorpusisvisualizedasasearchspace.Eachcate-goryownsacertainterritorywithinthespace(de-marcatedwithadashedline),representingthewordsthataretruemembersofthatcategory.Notallter-ritoriesarethesamesize,sincesomecategorieshavemoremembersthanothers. BCDEF ���������������� Figure4:BootstrappingaSingleCategoryFigure4illustrateswhathappenswhenasemanticlexiconisgeneratedforasinglecategory.Theseedwordsforthecategory(inthiscase,categoryC)arerepresentedbythesolidblackareaincategoryC'sterritory.Thehypothesizedwordsinthegrowinglexiconarerepresentedbyashadedarea.Thegoalofthebootstrappingalgorithmistoexpandtheareaofhypothesizedwordssothatitexactlymatchesthecategory'strueterritory.Iftheshadedareaexpandsbeyondthecategory'strueterritory,thenincorrectwordshavebeenaddedtothelexicon.InFigure4,categoryChasclaimedasigni cantnumberofwordsthatbelongtocategoriesBandE.Whengeneratingalexiconforonecategoryatatime,theseconfusionerrorsareimpossibletodetectbecausethelearnerhasnoknowledgeoftheothercategories. CB ���������� ������������� ����������� ����������� ����������������� ���������������� ������ Figure5:BootstrappingMultipleCategoriesFigure5showsthesamesearchspacewhenlexi-consaregeneratedforsixcategoriessimultaneously.Ifthelexiconscannotoverlap,thenweconstraintheabilityofacategorytooverstepitsbounds.Cate-goryCisstoppedwhenitbeginstoencroachupontheterritoriesofcategoriesBandEbecausewordsinthoseareashavealreadybeenclaimed.3.2SimpleCon ictResolutionTheeasiestwaytotakeadvantageofmultiplecate-goriesistoaddsimplecon ictresolutionthaten-forcesthe\onesenseperdomain"constraint.Ifmorethanonecategorytriestoclaimaword,thenweusecon ictresolutiontodecidewhichcategoryshouldwin.Weincorporatedasimplecon ictreso-lutionprocedureintoBasilisk,aswellasthemeta-bootstrappingalgorithm.Forbothalgorithms,thecon ictresolutionprocedureworksasfollows.(1)IfawordishypothesizedforcategoryAbuthasalreadybeenassignedtocategoryBduringapreviousiter-ation,thenthecategoryAhypothesisisdiscarded.(2)IfawordishypothesizedforbothcategoryAandcategoryBduringthesameiteration,thenit 0 20 40 60 80 100 120 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Building ba-M ba-1 0 50 100 150 200 250 300 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Event ba-M ba-1 0 100 200 300 400 500 600 700 800 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Human ba-M ba-1 0 100 200 300 400 500 600 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Location ba-M ba-1 0 5 10 15 20 25 30 35 40 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Time ba-M ba-1 0 10 20 30 40 50 60 70 80 90 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Weapon ba-M ba-1 Figure6:Basilisk,MCATvs.1CATisassignedtothecategoryforwhichitreceivesthehighestscore.InSection3.4,wewillpresentempiri-calresultsshowinghowthissimplecon ictresolutionschemea ectsperformance.3.3ASmarterScoringFunctionforMultipleCategoriesSimplecon ictresolutionhelpsthealgorithmrecognizewhenithasencroachedonanothercate-gory'sterritory,butitdoesnotactivelysteerthebootstrappinginamorepromisingdirection.Amoreintelligentwaytohandlemultiplecategoriesistoincorporateknowledgeaboutothercategoriesdirectlyintothescoringfunction.Wemodi edBasilisk'sscoringfunctiontopreferwordsthathavestrongevidenceforonecategorybutlittleornoevidenceforcompetingcategories.Eachwordcandidatewordpoolreceivesascoreforcategorybasedonthefollowingformula:di )=AvgLog()-max(AvgLog(whereAvgLogisthecandidatescoringfunctionusedpreviouslybyBasilisk(seeEquation3)andthemaxfunctionreturnsthemaximumAvgLogvalueoverallcompetingcategories.Forexample,thescoreforeachcandidatelocationwordwillbeitsAvgLogscoreforthelocationcategoryminusitsmaxi-mumAvgLogscoreforallothercategories.Awordisrankedhighlyonlyifithasahighscoreforthe 10 20 30 40 50 60 70 80 90 100 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Building mb-M mb-1 0 50 100 150 200 250 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Event mb-M mb-1 0 100 200 300 400 500 600 700 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Human mb-M mb-1 0 100 200 300 400 500 600 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Location mb-M mb-1 0 5 10 15 20 25 30 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Time mb-M mb-1 0 10 20 30 40 50 60 70 80 90 100 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Weapon mb-M mb-1 Figure7:Meta-Bootstrapping,MCATvs.1CATtargetedcategoryandthereislittleevidencethatitbelongstoadi erentcategory.Thishasthee ectofsteeringthebootstrappingprocessawayfromam-biguouspartsofthesearchspace.3.4MultipleCategoryResultsWewillusetheabbreviation1CATtoindicatethatonlyonesemanticcategorywasbootstrapped,andMCATtoindicatethatmultiplesemanticcategoriesweresimultaneouslybootstrapped.Figure6com-parestheperformanceofBasilisk-MCATwithcon- ictresolution(ba-M)againstBasilisk-1CAT(ba-1).Mostcategoriesshowsmallperformancegains,withbuildinglocation,andweaponcategoriesbene ttingthemost.However,theimprovementusuallydoesn'tkickinuntilmanybootstrappingit-erationshavepassed.ThisphenomenonisconsistentwiththevisualizationofthesearchspaceinFigure5.Sincetheseedwordsforeachcategoryarenotgener-allylocatedneareachotherinthesearchspace,thebootstrappingprocessisuna ectedbycon ictreso-lutionuntilthecategoriesbegintoencroachoneachother'sterritories.Figure7comparestheperformanceofMeta-Bootstrapping-MCATwithcon ictresolution(mb-M)againstMeta-Bootstrapping-1CAT(mb-1).Learningmultiplecategoriesimprovestheperformanceofmeta-bootstrappingdramaticallyformostcategories.Weweresurprisedthattheimprovementformeta-bootstrappingwasmuch 0 20 40 60 80 100 120 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Building ba-M+ ba-M mb-M 0 50 100 150 200 250 300 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Event ba-M+ ba-M mb-M 0 100 200 300 400 500 600 700 800 900 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Human ba-M+ ba-M mb-M 0 100 200 300 400 500 600 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Location ba-M+ ba-M mb-M 0 5 10 15 20 25 30 35 40 45 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Time ba-M+ ba-M mb-M 0 10 20 30 40 50 60 70 80 90 100 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Weapon ba-M+ ba-M mb-M Figure8:MetaBoot-MCATvs.Basilisk-MCATvs.Basilisk-MCAT+morepronouncedthanforBasilisk.ItseemsthatBasiliskwasalreadydoingabetterjobwitherrorsofconfusion,someta-bootstrappinghadmoreroomforimprovement.Finally,weevaluatedBasiliskusingthefunctiontohandlemultiplecategories.Figure8com-paresallthreeMCATalgorithms,withthesmarterversionofBasilisklabeledasba-M+.Over-all,thisversionofBasiliskperformsbest,showingasmallimprovementovertheversionwithsimplecon ictresolution.BothmultiplecategoryversionsofBasiliskalsoconsistentlyoutperformthemultiplecategoryversionofmeta-bootstrapping.Table2summarizestheimprovementofthebestversionofBasiliskovertheoriginalmeta-bootstrappingalgorithm.Theleft-handcolumnrep-resentsthenumberofwordslearnedandeachcellin-dicateshowmanyofthosewordswerecorrect.TheseresultsshowthatBasiliskproducessubstantiallybet-teraccuracyandcoveragethanmeta-bootstrapping.Figure9showsexamplesofwordslearnedbyBasilisk.Inspectionofthelexiconsrevealsmanyun-usualwordsthatcouldbeeasilyoverlookedbysome-onebuildingadictionarybyhand.Forexample,thewords\deserter"and\narcoterrorists"appearinavarietyofterrorismarticlesbuttheyarenotcom-monlyusedwordsingeneral.WealsomeasuredtherecallofBasilisk'slexiconsafter1000wordshadbeenlearned,basedonthegold Total MetaBoot Words 1CAT MCAT+ building 21(21.0%) 39(39.0%) 200 28(14.0%) 72(36.0%) 500 33(6.6%) 100(20.0%) 39(4.9%) 109(13.6%) 43(4.3%) n/a 61(61.0%) 61(61.0%) 200 89(44.5%) 114(57.0%) 500 146(29.2%) 186(37.2%) 800 172(21.5%) 240(30.0%) 1000 190(19.0%) 266(26.6%) 36(36.0%) 84(84.0%) 200 53(26.5%) 173(86.5%) 500 143(28.6%) 431(86.2%) 800 224(28.0%) 681(85.1%) 1000 278(27.8%) 829(82.9%) location 54(54.0%) 84(84.0%) 200 99(49.5%) 175(87.5%) 500 237(47.4%) 371(74.2%) 800 302(37.8%) 509(63.6%) 1000 310(31.0%) n/a time 9(9.0%) 30(30.0%) 13(6.5%) 33(16.5%) 21(4.2%) 37(7.4%) 25(3.1%) 43(5.4%) 26(2.6%) 45(4.5%) weapon 23(23.0%) 42(42.0%) 200 24(12.0%) 62(31.0%) 500 29(5.8%) 85(17.0%) 33(4.1%) 88(11.0%) 33(3.3%) n/a Table2:LexiconResultsstandarddatashowninTable1.Therecallresultsrangefrom40-60%,whichindicatesthatagoodper-centageofthecategorywordsarebeingfound,al-thoughthereareclearlymorecategorywordslurkinginthecorpus.4ConclusionsBasilisk'sbootstrappingalgorithmexploitstwoideas:(1)collectiveevidencefromextractionpat-ternscanbeusedtoinfersemanticcategoryassoci-ations,and(2)learningmultiplesemanticcategoriessimultaneouslycanhelpconstrainthebootstrappingprocess.TheaccuracyachievedbyBasiliskissub-stantiallyhigherthanthatofprevioustechniquesforsemanticlexiconinductionontheMUC-4corpus,andempiricalresultsshowthatbothofBasilisk'sideascontributetoitsperformance.Wealsodemon- Building:theatrestorecathedraltemplepalacepenitentiaryacademyhousesschoolmansionsEvent:ambushassassinationuprisingssabotagetakeoverincursionkidnappingsclashshoot-outHuman:boyssnipersdetaineescommandoesextremistsdeserternarcoterroristsdemonstratorscroniesmissionariesLocation:suburbSoyapangocapitalOsloregionscitiesneighborhoodsQuitocorregimientoTime:afternooneveningdecadehourMarchweeksSaturdayeveanniversaryWednesdayWeapon:cannongrenadelaunchers rebombcar-bombri epistolmachineguns rearms Figure9:ExampleSemanticLexiconEntriesstratedthatlearningmultiplesemanticcategoriessi-multaneouslyimprovesthemeta-bootstrappingalgo-rithm,whichsuggeststhatthisisageneralobserva-tionwhichmayimproveotherbootstrappingalgo-rithmsaswell.5AcknowledgmentsThisresearchwassupportedbytheNationalScienceFoundationunderawardIRI-9704240.ReferencesChinatsuAoneandScottWilliamBennett.1996.Ap-plyingmachinelearningtoanaphoraresolution.InConnectionist,Statistical,andSymbolicApproachestoLearningforNaturalLanguageUnderstanding,pages302{314.Springer-Verlag,Berlin.E.BrillandP.Resnik.1994.ATransformation-basedApproachtoPrepositionalPhraseAttachmentDisambiguation.InProceedingsoftheFifteenthIn-ternationalConferenceonComputationalLinguistics(COLING-94)S.Caraballo.1999.AutomaticAcquisitionofaHypernym-LabeledNounHierarchyfromText.InProceedingsofthe37thAnnualMeetingoftheAssoci-ationforComputationalLinguistics,pages120{126.M.CollinsandY.Singer.1999.UnsupervisedModelsforNamedEntityClassi cation.InProceedingsoftheJointSIGDATConferenceonEmpiricalMethodsinNaturalLanguageProcessingandVeryLargeCorpora(EMNLP/VLC-99)S.CucerzanandD.Yarowsky.1999.LanguageInde-pendentNamedEntityRecognitionCombiningMorphologicalandContextualEvidence.InProceedingsoftheJointSIGDATConferenceonEmpiricalMethodsinNaturalLanguageProcessingandVeryLargeCor-pora(EMNLP/VLC-99)W.Gale,K.Church,andDavidYarowsky.1992.Amethodfordisambiguatingwordsensesinalargecor-pus.ComputersandtheHumanities,26:415{439.NiyuGe,JohnHale,andEugeneCharniak.1998.Asta-tisticalapproachtoanaphoraresolution.InProceed-ingsoftheSixthWorkshoponVeryLargeCorporaM.Hearst.1992.AutomaticAcquisitionofHy-ponymsfromLargeTextCorpora.InProceedingsoftheFourteenthInternationalConferenceonComputa-tionalLinguistics(COLING-92)LynetteHirschman,MarcLight,EricBreck,andJohnD.Burger.1999.DeepRead:Areadingcomprehensionsystem.InProceedingsofthe37thAnnualMeetingoftheAssociationforComputationalLinguisticsM.Marcus,B.Santorini,andM.Marcinkiewicz.1993.BuildingaLargeAnnotatedCorpusofEnglish:ThePennTreebank.ComputationalLinguistics19(2):313{330.GeorgeMiller.1990.Wordnet:Anon-linelexicaldatabase.InInternationalJournalofLexicographyDanMoldovan,SandaHarabagiu,MariusPasca,RadaMihalcea,RichardGoodrum,RoxanaGirju,andVasileRus.1999.:Atoolforsur ngthean-swernet.InProceedingsoftheEighthTextREtrievalConference(TREC-8)MUC-4Proceedings.1992.ProceedingsoftheFourthMessageUnderstandingConference(MUC-4).Mor-ganKaufmann,SanMateo,CA.E.Rilo andR.Jones.1999.LearningDictionariesforInformationExtractionbyMulti-LevelBootstrapping.ProceedingsoftheSixteenthNationalConferenceonArti cialIntelligenceE.Rilo andM.Schmelzenbach.1998.AnEmpiricalApproachtoConceptualCaseFrameAcquisition.InProceedingsoftheSixthWorkshoponVeryLargeCor-pora,pages49{56.E.Rilo andJ.Shepherd.1997.ACorpus-BasedAp-proachforBuildingSemanticLexicons.InProceed-ingsoftheSecondConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages117{124.E.Rilo .1996.AutomaticallyGeneratingExtractionPatternsfromUntaggedText.InProceedingsoftheThirteenthNationalConferenceonArti cialIntelli-gence,pages1044{1049.TheAAAIPress/MITPress.B.RoarkandE.Charniak.1998.Noun-phraseCo-occurrenceStatisticsforSemi-automaticSemanticLexiconConstruction.InProceedingsofthe36thAn-nualMeetingoftheAssociationforComputationalLinguistics,pages1110{1116.StephenSoderland,DavidFisher,JonathanAseltine,andWendyLehnert.1995.CRYSTAL:Inducingaconceptualdictionary.InProceedingsoftheFour-teenthInternationalJointConferenceonArti cialIn-telligence,pages1314{1319.

Related Contents


Next Show more