ootstrappingA pproachtoS emantI L exiconI nductionusingS emanticK nowledgeisaweaklysupervisedbootstrappingalgorithmthatautomaticallygeneratessemanticlexiconsFigure1showsthehighlevelviewofBasilisk ID: 114987
Download Pdf The PPT/PDF document "InProceedingsofthe2002ConferenceonEmpiri..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
InProceedingsofthe2002ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP2002)ABootstrappingMethodforLearningSemanticLexiconsusingExtractionPatternContextsMichaelThelenandEllenRiloSchoolofComputingUniversityofUtahSaltLakeCity,UT84112USAthelenm,riloAbstractThispaperdescribesabootstrappingal-gorithmcalledBasiliskthatlearnshigh-qualitysemanticlexiconsformultiplecate-gories.Basiliskbeginswithanunannotatedcorpusandseedwordsforeachsemanticcategory,whicharethenbootstrappedtolearnnewwordsforeachcategory.Basiliskhypothesizesthesemanticclassofawordbasedoncollectiveinformationoveralargebodyofextractionpatterncontexts.WeevaluateBasiliskonsixsemanticcategories.ThesemanticlexiconsproducedbyBasiliskhavehigherprecisionthanthoseproducedbyprevioustechniques,withseveralcate-goriesshowingsubstantialimprovement.1IntroductionInrecentyears,severalalgorithmshavebeendevel-opedtoacquiresemanticlexiconsautomaticallyorsemi-automaticallyusingcorpus-basedtechniques.Forourpurposes,thetermsemanticlexiconwillrefertoadictionaryofwordslabeledwithsemanticclasses(e.g.,\bird"isanand\truck"isaSemanticclassinformationhasproventobeusefulformanynaturallanguageprocessingtasks,includ-inginformationextraction(RiloandSchmelzen-bach,1998;Soderlandetal.,1995),anaphoraresolu-tion(AoneandBennett,1996),questionanswering(Moldovanetal.,1999;Hirschmanetal.,1999),andprepositionalphraseattachment(BrillandResnik,1994).Althoughsomesemanticdictionariesdoexist(e.g.,WordNet(Miller,1990)),theseresourcesoftendonotcontainthespecializedvocabularyandjargonthatisneededforspecicdomains.Evenforrela-tivelygeneraltexts,suchastheWallStreetJournal(Marcusetal.,1993)orterrorismarticles(MUC-4Proceedings,1992),RoarkandCharniak(RoarkandCharniak,1998)reportedthat3ofevery5termsgeneratedbytheirsemanticlexiconlearnerwerenotpresentinWordNet.Theseresultssuggestthatauto-maticsemanticlexiconacquisitioncouldbeusedtoenhanceexistingresourcessuchasWordNet,ortoproducesemanticlexiconsforspecializeddomains.Wehavedevelopedaweaklysupervisedbootstrap-pingalgorithmcalledBasiliskthatautomaticallygeneratessemanticlexicons.Basiliskhypothesizesthesemanticclassofawordbygatheringcollectiveevidenceaboutsemanticassociationsfromextractionpatterncontexts.Basiliskalsolearnsmultiplese-manticclassessimultaneously,whichhelpsconstrainthebootstrappingprocess.First,wepresentBasilisk'sbootstrappingalgo-rithmandexplainhowitdiersfrompreviousworkonsemanticlexiconinduction.Second,wepresentempiricalresultsshowingthatBasiliskoutperformsapreviousalgorithm.Third,weexploretheideaoflearningmultiplesemanticcategoriessimultaneouslybyaddingthiscapabilitytoBasiliskaswellasan-otherbootstrappingalgorithm.Finally,wepresentresultsshowingthatlearningmultiplesemanticcat-egoriessimultaneouslyimprovesperformance.2BootstrappingusingCollectiveEvidencefromExtractionPatternsBasilisk(B ootstrappingA pproachtoS emantI L exiconI nductionusingS emanticK nowledge)isaweaklysupervisedbootstrappingalgorithmthatau-tomaticallygeneratessemanticlexicons.Figure1showsthehigh-levelviewofBasilisk'sbootstrappingprocess.TheinputtoBasiliskisanunannotatedtextcorpusandafewmanuallydenedseedwordsforeachsemanticcategory.Beforebootstrappingbegins,werunanextractionpatternlearneroverthecorpuswhichgeneratespatternstoextractev-erynounphraseinthecorpus.Thebootstrappingprocessbeginsbyselectingasubsetoftheextractionpatternsthattendtoex-tracttheseedwords.Wecallthisthepatternpool Thenounsextractedbythesepatternsbecomecan-didatesforthelexiconandareplacedinacandidatewordpool.Basiliskscoreseachcandidatewordbygatheringallpatternsthatextractitandmeasur-inghowstronglythosecontextsareassociatedwithwordsthatbelongtothesemanticcategory.Thevebestcandidatewordsareaddedtothelexicon,andtheprocessstartsoveragain.Inthissection,wedescribeBasilisk'sbootstrappingalgorithminmoredetailanddiscussrelatedwork. add extractions of best patternsbest patterns select pattern pool semantic lexicon words seed candidateword pool Figure1:BasiliskAlgorithm2.1BasiliskTheinputtoBasiliskisatextcorpusandasetofseedwords.Wegeneratedseedwordsbysortingthewordsinthecorpusbyfrequencyandmanuallyidentifyingthe10mostfrequentnounsthatbelongtoeachcat-egory.Theseseedwordsformtheinitialsemanticlexicon.Inthissectionwedescribethelearningpro-cessforasinglesemanticcategory.InSection3wewillexplainhowtheprocessisadaptedtohandlemultiplecategoriessimultaneously.Toidentifynewlexiconentries,Basiliskreliesonextractionpatternstoprovidecontextualevi-dencethatawordbelongstoasemanticclass.Asourrepresentationforextractionpatterns,weusedtheAutoSlogsystem(Rilo,1996).AutoSlog'sextractionpatternsrepresentlinguisticexpressionsthatextractanounphraseinoneofthreesyntac-ticroles:subject,directobject,orprepositionalphraseobject.Forexample,threepatternsthatwouldextractpeopleare:\subjectwasarrestedmurdereddirect object",and\collaboratedwith object".Extractionpatternsrepresentlinguis-ticcontextsthatoftenrevealthemeaningofawordbyvirtueofsyntaxandlexicalsemantics.Extractionpatternsaretypicallydesignedtocapturerolerela-tionships.Forexample,considertheverb\robbed"whenitoccursintheactivevoice.Thesubjectof\robbed"identiestheperpetrator,whilethedirectobjectof\robbed"identiesthevictimortarget.Beforebootstrappingbegins,werunAutoSlogex-haustivelyoverthecorpustogenerateanextraction Generateallextractionpatternsinthecorpusandrecordtheirextractions.lexiconseedwords:=0BOOTSTRAPPING1.Scoreallextractionpatternspattern pool=topranked20+patternscandidate word pool=extractionsofpatternsinpattern pool4.Scorecandidatewordsincandidate word pool5.Addtop5candidatewordstolexicon7.GotoStep1. Figure2:Basilisk'sbootstrappingalgorithmpatternforeverynounphrasethatappears.Thepatternsarethenappliedtothecorpusandalloftheirextractednounphrasesarerecorded.Figure2showsthebootstrappingprocessthatfollows,whichweexplaininthefollowingsections.2.1.1ThePatternPoolandCandidatePoolTherststepinthebootstrappingprocessistoscoretheextractionpatternsbasedontheirtendencytoextractknowncategorymembers.Allwordsthatarecurrentlydenedinthesemanticlexiconarecon-sideredtobecategorymembers.BasiliskscoreseachpatternusingtheRlogFmetricthathasbeenusedforextractionpatternlearning(Rilo,1996).Thescoreforeachpatterniscomputedas:RlogFpattern log)(1)whereisthenumberofcategorymembersex-tractedbypatternandisthetotalnumberofnounsextractedbypattern.Intuitively,theRlogFmetricisaweightedconditionalprobability;apat-ternreceivesahighscoreifahighpercentageofitsextractionsarecategorymembers,orifamoderatepercentageofitsextractionsarecategorymembersanditextractsalotofthem.ThetopNextractionpatternsareputintoapat-ternpool.BasiliskusesavalueofN=20fortherstiteration,whichallowsavarietyofpatternstobeconsidered,yetissmallenoughthatallofthepat-ternsarestronglyassociatedwiththecategory. \Depleted"patternsarenotincludedinthisset.Apatternisdepletedifallofitsextractednounsarealreadydenedinthelexicon,inwhichcaseithasnounclassiedwordstocontribute. Thepurposeofthepatternpoolistonarrowdowntheeldofcandidatesforthelexicon.Basiliskcol-lectsallnounphrases(NPs)extractedbypatternsinpatternpoolandputstheheadnounofeachNPintothecandidatewordpool.Onlythesenounsareconsideredforadditiontothelexicon.Asthebootstrappingprogresses,usingthesamevalueN=20causesthecandidatepooltobecomestagnant.Forexample,let'sassumethatBasiliskperformsperfectly,addingonlyvalidcategorywordstothelexicon.Aftersomenumberofiterations,allofthevalidcategorymembersextractedbythetop20patternswillhavebeenaddedtothelexicon,leav-ingonlynon-categorywordslefttoconsider.Forthisreason,thepatternpoolneedstobeinfusedwithnewpatternssothatmorenouns(extractions)becomeavailableforconsideration.Toachievethiseect,weincrementthevalueofNbyoneaftereachboot-strappingiteration.Thisensuresthatthereisalwaysatleastonenewpatterncontributingwordstothecandidatewordpooloneachsuccessiveiteration.2.1.2SelectingWordsfortheLexiconThenextstepistoscorethecandidatewords.Foreachword,Basiliskcollectseverypatternthatex-tractedtheword.All extractionpatternsareusedduringthisstep,notjustthepatternsinthepat-ternpool.Initially,weusedascoringfunctionthatcomputestheaveragenumberofcategorymembersextractedbythepatterns.Theformulais:scoreword (2)whereisthenumberofpatternsthatextractword,andisthenumberofdistinctcategorymembersextractedbypattern.Awordreceivesahighscoreifitisextractedbypatternsthatalsohaveatendencytoextractknowncategorymembers.Asanexample,supposetheword\Peru"isinthecandidatewordpoolasapossiblelocation.Basiliskndsallpatternsthatextract\Peru"andcomputestheaveragenumberofknownlocationsextractedbythosepatterns.Let'sassumethatthethreepatternsshownbelowextract\Peru"andthattheunderlinedwordsareknownlocations.\Peru"wouldreceiveascoreof(2+3+2)/3=2.3.Intuitively,thismeansthatpatternsthatextract\Peru"alsoextract,onaverage,2.3knownlocationwords.\waskilledinExtractions:Peru,clashes,ashootout,ElSalvador Colombia wasdivided"Extractions:thecountry ,theMedellincartel,Colombia Peru,thearmy,Nicaragua \ambassadortoExtractions:Nicaragua ,Peru,theUN,Panama Unfortunately,thisscoringfunctionhasaproblem.Theaveragecanbeheavilyskewedbyonepatternthatextractsalargenumberofcategorymembers.Forexample,supposewordisextractedby10pat-terns,9whichdonotextractanycategorymembersbutthetenthextracts50categorymembers.Theaveragenumberofcategorymembersextractedbythesepatternswillbe5.Thisismisleadingbecausetheonlyevidencelinkingwordwiththesemanticcategoryisasingle,high-frequencyextractionpat-tern(whichmayextractwordsthatbelongtoothercategoriesaswell).Toalleviatethisproblem,wemodiedthescor-ingfunctiontocomputetheaveragelogarithmofthenumberofcategorymembersextractedbyeachpat-tern.Thelogarithmreducesthein uenceofanysin-glepattern.WewillrefertothisscoringmetricasAvgLogfunction,whichisdenedbelow.Sincelog(1)=0,weaddonetoeachfrequencycountsothatpatternswhichextractasinglecategorymem-bercontributeapositivevalue.AvgLogwordlog+1) (3)Usingthisscoringmetric,allwordsinthecandi-datewordpoolarescoredandthetopvewordsareaddedtothesemanticlexicon.Thepatternpoolandcandidatewordpoolarethenemptied,andthebootstrappingprocessstartsoveragain.2.1.3RelatedWorkSeveralweaklysupervisedlearningalgorithmshavepreviouslybeendevelopedtogenerateseman-ticlexiconsfromtextcorpora.RiloandShepherd(RiloandShepherd,1997)developedabootstrap-pingalgorithmthatexploitslexicalco-occurrencestatistics,andRoarkandCharniak(RoarkandCharniak,1998)renedthisalgorithmtofocusmoreexplicitlyoncertainsyntacticstructures.Hale,Ge,andCharniak(Geetal.,1998)devisedatechniquetolearnthegenderofwords.Caraballo(Caraballo,1999)andHearst(Hearst,1992)createdtechniquestolearnhypernym/hyponymrelationships.Noneofthesepreviousalgorithmsusedextractionpatternsorsimilarcontextstoinfersemanticclassassociations.Severallearningalgorithmshavealsobeende-velopedfornamedentityrecognition(e.g.,(Collins andSinger,1999;CucerzanandYarowsky,1999)).(CollinsandSinger,1999)usedcontextualinforma-tionofadierentsortthanwedo.Furthermore,ourresearchaimstolearngeneralnouns(e.g.,\artist")ratherthanpropernouns,somanyofthefeaturescommonlyusedtogreatadvantagefornamedentityrecognition(e.g.,capitalizationandtitlewords)arenotapplicabletoourtask.ThealgorithmmostcloselyrelatedtoBasiliskismeta-bootstrapping(RiloandJones,1999),whichalsousesextractionpatterncontextsforsemanticlexiconinduction.Meta-bootstrappingidentiesasingleextractionpatternthatishighlycorrelatedwithasemanticcategoryandthenassumesthatallofitsextractednounphrasesbelongtothesamecat-egory.However,thisassumptionisoftenviolated,whichallowsincorrecttermstoenterthelexicon.RiloandJonesacknowledgedthisissueandusedasecondlevelofbootstrapping(the\Meta"boot-strappinglevel)toalleviatethisproblem.Whilemeta-bootstrappingtrustsindividualextractionpat-ternstomakeunilateraldecisions,Basiliskgath-erscollectiveevidencefromalargesetofextrac-tionpatterns.AswewilldemonstrateinSec-tion2.2,Basilisk'sapproachproducesbetterre-sultsthanmeta-bootstrappingandisalsoconsid-erablymoreecientbecauseitusesonlyasinglebootstrappingloop(meta-bootstrappingusesnestedbootstrapping).However,meta-bootstrappingpro-ducescategory-specicextractionpatternsinaddi-tiontoasemanticlexicon,whileBasiliskfocusesex-clusivelyonsemanticlexiconinduction.2.2SingleCategoryResultsToevaluateBasilisk'sperformance,weranexperi-mentswiththeMUC-4corpus(MUC-4Proceedings,1992),whichcontains1700textsassociatedwithter-rorism.WeusedBasilisktolearnsemanticlexiconsforsixsemanticcategories:buildinglocation,andweapon.Beforewerantheseexperiments,oneoftheauthorsmanuallyla-beledeveryheadnouninthecorpusthatwasfoundbyanextractionpattern.Thesemanualannota-tionswerethegoldstandard.Table1showsthebreakdownofsemanticcategoriesfortheheadnouns.Thesenumbersrepresentabaseline:analgorithmthatrandomlyselectswordswouldbeexpectedtogetaccuraciesconsistentwiththesenumbers.ThreesemanticlexiconlearnershavepreviouslybeenevaluatedontheMUC-4corpus(RiloandShepherd,1997;RoarkandCharniak,1998;RiloandJones,1999),andofthesemeta-bootstrappingachievedthebestresults.Soweimplementedthemeta-bootstrappingalgorithmourselvestodirectly CategoryTotalPercentage building1882.2% event5015.9% human185621.9% location101812.0% time1121.3% weapon1471.7% other463854.8% Table1:BreakdownofsemanticcategoriescompareitsperformancewiththatofBasilisk.Adierencebetweentheoriginalimplementationandoursisthatourversionlearnsindividualnouns(asdoesBasilisk)insteadofnounphrases.Webelievethatlearningindividualnounsisamoreconservativeapproachbecausenounphrasesoftenoverlap(e.g.,\high-powerbombs"and\incendiarybombs"wouldcountastwodierentlexiconentriesintheorigi-nalmeta-bootstrappingalgorithm).Consequently,ourmeta-bootstrappingresultsdierfromthosere-portedin(RiloandJones,1999).Figure3showstheresultsforBasilisk(ba-1)andmeta-bootstrapping(mb-1).Weranbothalgorithmsfor200iterations,sothat1000wordswereaddedtothelexicon(5wordsperiteration).TheXaxisshowsthenumberofwordslearned,andtheYaxisshowshowmanywerecorrect.TheYaxeshavedierentrangesbecausesomecategoriesaremoreprolicthanothers.Basiliskoutperformsmeta-bootstrappingforeverycategory,oftensubstantially.Forthehumanandlocationcategories,Basilisklearnedhundredsofwords,withaccuraciesinthe80-89%rangethroughmuchofthebootstrapping.ItisworthnotingthatBasilisk'sperformanceheldupwellonthehumanandlocationcategoriesevenattheend,achieving79.5%(795/1000)accuracyforhumansand53.2%(532/1000)accuracyforlocations.3LearningMultipleSemanticCategoriesSimultaneouslyWealsoexploredtheideaofbootstrappingmultiplesemanticclassessimultaneously.Ourhypothesiswasthaterrorsofconfusionbetweensemanticcategoriescanbelessenedbyusinginformationaboutmulti-plecategories.Thishypothesismakessenseonlyifawordcannotbelongtomorethanonesemanticclass.Ingeneral,thisisnottruebecausewordsareoftenpolysemous.Butwithinalimiteddomain,awordusuallyhasadominantwordsense.Thereforewemakea\onesenseperdomain"assumption(similar Weusethetermconfusiontorefertoerrorswhereawordislabeledascategorywhenitreallybelongstocategory 0 10 20 30 40 50 60 70 80 90 100 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Building ba-1 mb-1 0 50 100 150 200 250 300 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Event ba-1 mb-1 0 100 200 300 400 500 600 700 800 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Human ba-1 mb-1 0 50 100 150 200 250 300 350 400 450 500 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Location ba-1 mb-1 0 5 10 15 20 25 30 35 40 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Time ba-1 mb-1 0 10 20 30 40 50 60 70 80 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Weapon ba-1 mb-1 Figure3:BasiliskandMeta-BootstrappingResults,SingleCategorytothe\onesenseperdiscourse"observation(Galeetal.,1992))thatawordbelongstoasinglesemanticcategorywithinalimiteddomain.Allofourex-perimentsinvolvetheMUC-4terrorismdomainandcorpus,forwhichthisassumptionseemsappropriate.3.1MotivationFigure4showsonewayofviewingthetaskofse-manticlexiconinduction.Thesetofallwordsinthecorpusisvisualizedasasearchspace.Eachcate-goryownsacertainterritorywithinthespace(de-marcatedwithadashedline),representingthewordsthataretruemembersofthatcategory.Notallter-ritoriesarethesamesize,sincesomecategorieshavemoremembersthanothers. BCDEF Figure4:BootstrappingaSingleCategoryFigure4illustrateswhathappenswhenasemanticlexiconisgeneratedforasinglecategory.Theseedwordsforthecategory(inthiscase,categoryC)arerepresentedbythesolidblackareaincategoryC'sterritory.Thehypothesizedwordsinthegrowinglexiconarerepresentedbyashadedarea.Thegoalofthebootstrappingalgorithmistoexpandtheareaofhypothesizedwordssothatitexactlymatchesthecategory'strueterritory.Iftheshadedareaexpandsbeyondthecategory'strueterritory,thenincorrectwordshavebeenaddedtothelexicon.InFigure4,categoryChasclaimedasignicantnumberofwordsthatbelongtocategoriesBandE.Whengeneratingalexiconforonecategoryatatime,theseconfusionerrorsareimpossibletodetectbecausethelearnerhasnoknowledgeoftheothercategories. CB Figure5:BootstrappingMultipleCategoriesFigure5showsthesamesearchspacewhenlexi-consaregeneratedforsixcategoriessimultaneously.Ifthelexiconscannotoverlap,thenweconstraintheabilityofacategorytooverstepitsbounds.Cate-goryCisstoppedwhenitbeginstoencroachupontheterritoriesofcategoriesBandEbecausewordsinthoseareashavealreadybeenclaimed.3.2SimpleCon ictResolutionTheeasiestwaytotakeadvantageofmultiplecate-goriesistoaddsimplecon ictresolutionthaten-forcesthe\onesenseperdomain"constraint.Ifmorethanonecategorytriestoclaimaword,thenweusecon ictresolutiontodecidewhichcategoryshouldwin.Weincorporatedasimplecon ictreso-lutionprocedureintoBasilisk,aswellasthemeta-bootstrappingalgorithm.Forbothalgorithms,thecon ictresolutionprocedureworksasfollows.(1)IfawordishypothesizedforcategoryAbuthasalreadybeenassignedtocategoryBduringapreviousiter-ation,thenthecategoryAhypothesisisdiscarded.(2)IfawordishypothesizedforbothcategoryAandcategoryBduringthesameiteration,thenit 0 20 40 60 80 100 120 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Building ba-M ba-1 0 50 100 150 200 250 300 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Event ba-M ba-1 0 100 200 300 400 500 600 700 800 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Human ba-M ba-1 0 100 200 300 400 500 600 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Location ba-M ba-1 0 5 10 15 20 25 30 35 40 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Time ba-M ba-1 0 10 20 30 40 50 60 70 80 90 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Weapon ba-M ba-1 Figure6:Basilisk,MCATvs.1CATisassignedtothecategoryforwhichitreceivesthehighestscore.InSection3.4,wewillpresentempiri-calresultsshowinghowthissimplecon ictresolutionschemeaectsperformance.3.3ASmarterScoringFunctionforMultipleCategoriesSimplecon ictresolutionhelpsthealgorithmrecognizewhenithasencroachedonanothercate-gory'sterritory,butitdoesnotactivelysteerthebootstrappinginamorepromisingdirection.Amoreintelligentwaytohandlemultiplecategoriesistoincorporateknowledgeaboutothercategoriesdirectlyintothescoringfunction.WemodiedBasilisk'sscoringfunctiontopreferwordsthathavestrongevidenceforonecategorybutlittleornoevidenceforcompetingcategories.Eachwordcandidatewordpoolreceivesascoreforcategorybasedonthefollowingformula:di)=AvgLog()-max(AvgLog(whereAvgLogisthecandidatescoringfunctionusedpreviouslybyBasilisk(seeEquation3)andthemaxfunctionreturnsthemaximumAvgLogvalueoverallcompetingcategories.Forexample,thescoreforeachcandidatelocationwordwillbeitsAvgLogscoreforthelocationcategoryminusitsmaxi-mumAvgLogscoreforallothercategories.Awordisrankedhighlyonlyifithasahighscoreforthe 10 20 30 40 50 60 70 80 90 100 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Building mb-M mb-1 0 50 100 150 200 250 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Event mb-M mb-1 0 100 200 300 400 500 600 700 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Human mb-M mb-1 0 100 200 300 400 500 600 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Location mb-M mb-1 0 5 10 15 20 25 30 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Time mb-M mb-1 0 10 20 30 40 50 60 70 80 90 100 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Weapon mb-M mb-1 Figure7:Meta-Bootstrapping,MCATvs.1CATtargetedcategoryandthereislittleevidencethatitbelongstoadierentcategory.Thishastheeectofsteeringthebootstrappingprocessawayfromam-biguouspartsofthesearchspace.3.4MultipleCategoryResultsWewillusetheabbreviation1CATtoindicatethatonlyonesemanticcategorywasbootstrapped,andMCATtoindicatethatmultiplesemanticcategoriesweresimultaneouslybootstrapped.Figure6com-parestheperformanceofBasilisk-MCATwithcon- ictresolution(ba-M)againstBasilisk-1CAT(ba-1).Mostcategoriesshowsmallperformancegains,withbuildinglocation,andweaponcategoriesbenettingthemost.However,theimprovementusuallydoesn'tkickinuntilmanybootstrappingit-erationshavepassed.ThisphenomenonisconsistentwiththevisualizationofthesearchspaceinFigure5.Sincetheseedwordsforeachcategoryarenotgener-allylocatedneareachotherinthesearchspace,thebootstrappingprocessisunaectedbycon ictreso-lutionuntilthecategoriesbegintoencroachoneachother'sterritories.Figure7comparestheperformanceofMeta-Bootstrapping-MCATwithcon ictresolution(mb-M)againstMeta-Bootstrapping-1CAT(mb-1).Learningmultiplecategoriesimprovestheperformanceofmeta-bootstrappingdramaticallyformostcategories.Weweresurprisedthattheimprovementformeta-bootstrappingwasmuch 0 20 40 60 80 100 120 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Building ba-M+ ba-M mb-M 0 50 100 150 200 250 300 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Event ba-M+ ba-M mb-M 0 100 200 300 400 500 600 700 800 900 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Human ba-M+ ba-M mb-M 0 100 200 300 400 500 600 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Location ba-M+ ba-M mb-M 0 5 10 15 20 25 30 35 40 45 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Time ba-M+ ba-M mb-M 0 10 20 30 40 50 60 70 80 90 100 0 200 400 600 800 1000 Correct Lexicon Entries Total Lexicon Entries Weapon ba-M+ ba-M mb-M Figure8:MetaBoot-MCATvs.Basilisk-MCATvs.Basilisk-MCAT+morepronouncedthanforBasilisk.ItseemsthatBasiliskwasalreadydoingabetterjobwitherrorsofconfusion,someta-bootstrappinghadmoreroomforimprovement.Finally,weevaluatedBasiliskusingthefunctiontohandlemultiplecategories.Figure8com-paresallthreeMCATalgorithms,withthesmarterversionofBasilisklabeledasba-M+.Over-all,thisversionofBasiliskperformsbest,showingasmallimprovementovertheversionwithsimplecon ictresolution.BothmultiplecategoryversionsofBasiliskalsoconsistentlyoutperformthemultiplecategoryversionofmeta-bootstrapping.Table2summarizestheimprovementofthebestversionofBasiliskovertheoriginalmeta-bootstrappingalgorithm.Theleft-handcolumnrep-resentsthenumberofwordslearnedandeachcellin-dicateshowmanyofthosewordswerecorrect.TheseresultsshowthatBasiliskproducessubstantiallybet-teraccuracyandcoveragethanmeta-bootstrapping.Figure9showsexamplesofwordslearnedbyBasilisk.Inspectionofthelexiconsrevealsmanyun-usualwordsthatcouldbeeasilyoverlookedbysome-onebuildingadictionarybyhand.Forexample,thewords\deserter"and\narcoterrorists"appearinavarietyofterrorismarticlesbuttheyarenotcom-monlyusedwordsingeneral.WealsomeasuredtherecallofBasilisk'slexiconsafter1000wordshadbeenlearned,basedonthegold Total MetaBoot Words 1CAT MCAT+ building 21(21.0%) 39(39.0%) 200 28(14.0%) 72(36.0%) 500 33(6.6%) 100(20.0%) 39(4.9%) 109(13.6%) 43(4.3%) n/a 61(61.0%) 61(61.0%) 200 89(44.5%) 114(57.0%) 500 146(29.2%) 186(37.2%) 800 172(21.5%) 240(30.0%) 1000 190(19.0%) 266(26.6%) 36(36.0%) 84(84.0%) 200 53(26.5%) 173(86.5%) 500 143(28.6%) 431(86.2%) 800 224(28.0%) 681(85.1%) 1000 278(27.8%) 829(82.9%) location 54(54.0%) 84(84.0%) 200 99(49.5%) 175(87.5%) 500 237(47.4%) 371(74.2%) 800 302(37.8%) 509(63.6%) 1000 310(31.0%) n/a time 9(9.0%) 30(30.0%) 13(6.5%) 33(16.5%) 21(4.2%) 37(7.4%) 25(3.1%) 43(5.4%) 26(2.6%) 45(4.5%) weapon 23(23.0%) 42(42.0%) 200 24(12.0%) 62(31.0%) 500 29(5.8%) 85(17.0%) 33(4.1%) 88(11.0%) 33(3.3%) n/a Table2:LexiconResultsstandarddatashowninTable1.Therecallresultsrangefrom40-60%,whichindicatesthatagoodper-centageofthecategorywordsarebeingfound,al-thoughthereareclearlymorecategorywordslurkinginthecorpus.4ConclusionsBasilisk'sbootstrappingalgorithmexploitstwoideas:(1)collectiveevidencefromextractionpat-ternscanbeusedtoinfersemanticcategoryassoci-ations,and(2)learningmultiplesemanticcategoriessimultaneouslycanhelpconstrainthebootstrappingprocess.TheaccuracyachievedbyBasiliskissub-stantiallyhigherthanthatofprevioustechniquesforsemanticlexiconinductionontheMUC-4corpus,andempiricalresultsshowthatbothofBasilisk'sideascontributetoitsperformance.Wealsodemon- Building:theatrestorecathedraltemplepalacepenitentiaryacademyhousesschoolmansionsEvent:ambushassassinationuprisingssabotagetakeoverincursionkidnappingsclashshoot-outHuman:boyssnipersdetaineescommandoesextremistsdeserternarcoterroristsdemonstratorscroniesmissionariesLocation:suburbSoyapangocapitalOsloregionscitiesneighborhoodsQuitocorregimientoTime:afternooneveningdecadehourMarchweeksSaturdayeveanniversaryWednesdayWeapon:cannongrenadelaunchersrebombcar-bombri epistolmachinegunsrearms Figure9:ExampleSemanticLexiconEntriesstratedthatlearningmultiplesemanticcategoriessi-multaneouslyimprovesthemeta-bootstrappingalgo-rithm,whichsuggeststhatthisisageneralobserva-tionwhichmayimproveotherbootstrappingalgo-rithmsaswell.5AcknowledgmentsThisresearchwassupportedbytheNationalScienceFoundationunderawardIRI-9704240.ReferencesChinatsuAoneandScottWilliamBennett.1996.Ap-plyingmachinelearningtoanaphoraresolution.InConnectionist,Statistical,andSymbolicApproachestoLearningforNaturalLanguageUnderstanding,pages302{314.Springer-Verlag,Berlin.E.BrillandP.Resnik.1994.ATransformation-basedApproachtoPrepositionalPhraseAttachmentDisambiguation.InProceedingsoftheFifteenthIn-ternationalConferenceonComputationalLinguistics(COLING-94)S.Caraballo.1999.AutomaticAcquisitionofaHypernym-LabeledNounHierarchyfromText.InProceedingsofthe37thAnnualMeetingoftheAssoci-ationforComputationalLinguistics,pages120{126.M.CollinsandY.Singer.1999.UnsupervisedModelsforNamedEntityClassication.InProceedingsoftheJointSIGDATConferenceonEmpiricalMethodsinNaturalLanguageProcessingandVeryLargeCorpora(EMNLP/VLC-99)S.CucerzanandD.Yarowsky.1999.LanguageInde-pendentNamedEntityRecognitionCombiningMorphologicalandContextualEvidence.InProceedingsoftheJointSIGDATConferenceonEmpiricalMethodsinNaturalLanguageProcessingandVeryLargeCor-pora(EMNLP/VLC-99)W.Gale,K.Church,andDavidYarowsky.1992.Amethodfordisambiguatingwordsensesinalargecor-pus.ComputersandtheHumanities,26:415{439.NiyuGe,JohnHale,andEugeneCharniak.1998.Asta-tisticalapproachtoanaphoraresolution.InProceed-ingsoftheSixthWorkshoponVeryLargeCorporaM.Hearst.1992.AutomaticAcquisitionofHy-ponymsfromLargeTextCorpora.InProceedingsoftheFourteenthInternationalConferenceonComputa-tionalLinguistics(COLING-92)LynetteHirschman,MarcLight,EricBreck,andJohnD.Burger.1999.DeepRead:Areadingcomprehensionsystem.InProceedingsofthe37thAnnualMeetingoftheAssociationforComputationalLinguisticsM.Marcus,B.Santorini,andM.Marcinkiewicz.1993.BuildingaLargeAnnotatedCorpusofEnglish:ThePennTreebank.ComputationalLinguistics19(2):313{330.GeorgeMiller.1990.Wordnet:Anon-linelexicaldatabase.InInternationalJournalofLexicographyDanMoldovan,SandaHarabagiu,MariusPasca,RadaMihalcea,RichardGoodrum,RoxanaGirju,andVasileRus.1999.:Atoolforsurngthean-swernet.InProceedingsoftheEighthTextREtrievalConference(TREC-8)MUC-4Proceedings.1992.ProceedingsoftheFourthMessageUnderstandingConference(MUC-4).Mor-ganKaufmann,SanMateo,CA.E.RiloandR.Jones.1999.LearningDictionariesforInformationExtractionbyMulti-LevelBootstrapping.ProceedingsoftheSixteenthNationalConferenceonArticialIntelligenceE.RiloandM.Schmelzenbach.1998.AnEmpiricalApproachtoConceptualCaseFrameAcquisition.InProceedingsoftheSixthWorkshoponVeryLargeCor-pora,pages49{56.E.RiloandJ.Shepherd.1997.ACorpus-BasedAp-proachforBuildingSemanticLexicons.InProceed-ingsoftheSecondConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages117{124.E.Rilo.1996.AutomaticallyGeneratingExtractionPatternsfromUntaggedText.InProceedingsoftheThirteenthNationalConferenceonArticialIntelli-gence,pages1044{1049.TheAAAIPress/MITPress.B.RoarkandE.Charniak.1998.Noun-phraseCo-occurrenceStatisticsforSemi-automaticSemanticLexiconConstruction.InProceedingsofthe36thAn-nualMeetingoftheAssociationforComputationalLinguistics,pages1110{1116.StephenSoderland,DavidFisher,JonathanAseltine,andWendyLehnert.1995.CRYSTAL:Inducingaconceptualdictionary.InProceedingsoftheFour-teenthInternationalJointConferenceonArticialIn-telligence,pages1314{1319.