/
Autonomously Semantifying Wikipedia Fei Wu Daniel S Autonomously Semantifying Wikipedia Fei Wu Daniel S

Autonomously Semantifying Wikipedia Fei Wu Daniel S - PDF document

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
451 views
Uploaded On 2015-02-20

Autonomously Semantifying Wikipedia Fei Wu Daniel S - PPT Presentation

Weld Computer Science Engineering Department University of Washington Seattle WA USA wufeiweldcswashingtonedu ABSTRACT BernersLees compelling vision of a Semantic Web is hindered by a chickenandegg problem which can be best solved by a boot strappi ID: 37117

ylin 64257 classi wikipedia 64257 ylin wikipedia classi recall precision infobox data training web sentence attribute extractor semantic article pages pipeline articles

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Autonomously Semantifying Wikipedia Fei ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Figure1:SampleWikipediainfobox.Withover1.7millionarticles,Wikipediaisappropriatelysized—bigenoughtoprovideasufcientdataset,yetenoughsmallerthanthefullWebthatahundred-nodeclusterisun-necessaryforcorpusprocessing.1.2OverviewOurgrandvisionisacombinationofautonomousandcollab-orativetechniquesforsemanticallymarkingupWikipedia.Suchasystemwouldcreateorcompleteinfoboxesbyextractinginfor-mationfromthepage,rationalizetags,mergereplicateddatausingmicroformats,disambiguatelinksandaddadditionallinkswhereneeded,engagehumanstoverifyinformationasneeded,andper-hapsaddnewtopicpages(e.g.,bylookingforpropernounswithnocorrespondingprimarypageorperhapsbytrackingnewsstories).AsarststeptowardsthisvisionwepresentKYLIN,aproto-typewhichautomatespartofthisvision.KYLINlooksforclassesofpageswithsimilarinfoboxes,determinescommonattributes,createstrainingexamples,learnsCRFextractors,andrunsthemoneachpage—creatingnewinfoboxesandcompletingothers.KYLINalsoautomaticallyidentiesmissinglinksforpropernounsoneachpage,resolvingeachtoauniqueidentier.ExperimentsshowthattheperformanceofKYLINisroughlycomparativewithmanuallabellingintermsofprecisionandrecall.Ononedomain,itdoesevenbetter.2.GENERATINGINFOBOXESManyWikipediaarticlesincludeinfoboxes,aconcise,tabularsummaryofthesubject'sattributes.Forexample,Figure1showsasampleinfoboxfromthearticleon“AbbevilleCounty,”whichwasgenerateddynamicallyfromthedatashowninFigure2.Becauseoftheirrelationalnature,infoboxesmaybeeasilycon-vertedtosemanticformasshownbyAuerandLehmann'sDBpe-dia[3].Furthermore,foreachclassofobjects,infoboxesandtheir Figure2:Attribute/valuedatageneratingtheinfoboxinFig.1templatesimplicitlydenethemostimportantandrepresentativeattributes;hence,infoboxesarevaluableontologicalresources.InthissectionweexplainhowKYLINautomaticallyconstructsandcompletesinfoboxes.Thebasicideaistouseexistinginfoboxesasasourceoftrainingdatawithwhichtolearnextractorsforgather-ingmoredata.AsshowninFigure3,KYLIN'sinfoboxgenerationmodulehasthreemaincomponents:preprocessor,classier,andextractor.Thepreprocessorperformsseveralfunctions.First,itselectsandrenesinfoboxschemata,choosingrelevantattributes.Secondly,thepreprocessorgeneratesadatasetfortrainingmachinelearners.KYLINtrainstwotypesofclassiers.ThersttypepredictswhetheragivenWikipediaarticlebelongstocertainclass.Thesecondtypeofclassierpredictswhetheragivensentencecontainsthevalueofagivenattribute.IfthereareCclassesandAattributesperclass,KYLINautomaticallylearnsC+ACdifferentclassiers.Extractorsarelearnedroutineswhichactuallyidentifyandclipoutthenecessaryattributevalues.KYLINlearnsoneextractorperattributeperclass;eachextractorisaconditionalrandomelds(CRF)model.Trainingdataaretakenfromexistinginfoboxesasdictatedbythepredictionsoftheclassiers.Weexplaintheoperationandperformanceofthesemodulesbe-low.ButrstwediscussthenatureoftheexistingWikipediain-foboxesandwhytheyarehardertousefortrainingthanmightbeexpected.2.1ChallengesforInfoboxCompletionWhileinfoboxescontainmuchvaluableinformation,theysufferfromseveralchallengingproblems:Incompleteness:SinceinfoboxandarticletextarekeptseparateinWikipedia,existinginfoboxesaremanuallycreatedwhenhumanauthorscreateoreditanarticle—atediousandtime-consumingprocess.Asaresult,manyarticleshavenoinfoboxesandthema-jorityofinfoboxeswhichdoexistareincomplete.Forexample,inthe“U.S.County”classlessthan50%ofthearticleshaveaninfobox.Stillthereinmanyclasses,thereisplentyofdatafortraining.Inconsistency:Themanualcreationprocessisnoisy,causingcontradictionsbetweenthearticletextandtheinfoboxsummary.Forexample,whenwemanuallycheckedarandomsampleof50infoboxesinthe“U.S.County”class,wefoundthat16%containedoneormoreerrors.Wesuspectthatmanyoftheerrorsareintro-ducedwhenanauthorupdatesanarticlewitharevisedattribute Figure3:ArchitectureofKYLIN'sinfoboxgenerator. Figure4:Usagepercentageforattributesofthe“U.S.County”infoboxtemplate.value(e.g.,population)andneglectstochangeboththetextandtheinfobox—anothereffectofkeepinginfoboxandtextseparate.SchemaDrift:Sinceusersarefreetocreateormodifyinfoboxtemplates,andsincetheytypicallycreateanarticlebycopyingparts(e.g.,theinfoboxtemplate)fromasimilararticle,theinfoboxschemaforaclassofarticlestendstoevolveduringthecourseofauthoring.Thisleadstoseveralproblems:schemaduplication,at-tributeduplication,andsparseness.Asanexampleofschemadu-plication,notethatfourdifferenttemplates:“U.S.County”(1428),“USCounty”(574),“Counties”(50)and“County”(19)areusedtodescribethesametypeofobject.Similarly,multipletagsdenotethesamesemanticattribute.Forexample,“CensusYr”,“CensusEstimateYr”,“CensusEst.”and“CensusYear”allmeanthesamething.Furthermore,manyattributesareusedveryrarely.Figure4showsthepercentusagefortheattributesofthe“U.S.County”in-foboxtemplate;only29%oftheattributesareusedby30%ormoreofthearticles,andonly46%oftheattributesareusedbyatleast15%ofthearticles.TypefreeSystem:ThedesignofWikipediaisdeliberatelylow-tech,tofacilitatehumangenerationofcontent,andinfoboxesarenoexception.Inparticular,thereisnotypesystemforinfoboxat-tributes.Forexample,theinfoboxfor“KingCounty,Washington”hasatuplebindingtheattribute“landarea”toequal“2126squaremiles”andanothertupledening“landareakm”tobe“5506squarekm”despitethefactthatonecanbeeasilyderivedfromanother.Clearlythissimpleapproachbloatstheschemaandincreasesin-consistency;thesimilaritybetweentheserelatedattributesalsoin-creasesthecomplexityofextraction.IrregularLists:Listpages,whichlinktolargenumbersofsim-ilararticles,areapotentialsourceofvaluabletypeinformation.Unfortunately,becausetheyaredesignedforhumanconsumption,automatedprocessingisdifcult.Forexample,somelistpagessep-arateinformationinitems,whileothersusetableswithdifferentschemas.Sometimes,listsarenestedinanirregular,hierarchicalmanner,whichgreatlycomplicatesextraction.Forexample,the“Listofcities,towns,andvillagesintheUnitedStates”hasanitemcalled“PlacesinFlorida”,whichinturncontains“ListofcountiesinFlorida.”FlattenedCategories:WhileWikipedia'scategorytagsystemseemspromisingasasourceofontologicalstructure,itissoatandquirkythatutilityislow.Furthermore,manytagsarepurelyadministrative,e.g.“ArticlestobemergedsinceMarch2007.”2.2PreprocessorThepreprocessorisresponsibleforcreatingatrainingsuitethatcanbeusedtolearnextractioncodeforcreatinginfoboxes.Wedividethisworkintotwofunctions:schemarenementandtheconstructionoftrainingdatasets.SchemaRenement:Theprevioussectionexplainedhowcol-laborativeauthoringleadstoinfoboxschemadrift,resultingintheproblemsofschemaduplication,attributeduplicationandsparsity.Thus,anecessaryprerequisiteforgeneratinggoodinfoboxesforagivenclassisdeterminingauniformtargetschema.Thiscanbeviewedasaninstanceofthedifcultproblemofschemamatching[11].Clearly,manysophisticatedtechniquescanbebroughttobear,butweadoptasimplestatisticalapproachforourprototype.KYLINscanstheWikipediacorpusandselectsallarticlescontainingtheexactnameofthegiveninfoboxtemplatename.Next,itcatalogsallattributesmentionedandselectsthemostcommon.Ourcurrentimplementationrestrictsattentiontoattributesusedinatleast15%ofthearticles,whichyieldsplentyoftrainingdata.ConstructingTrainingDatasets:Next,thepreprocessorcon-structstrainingdatasetsforusewhenlearningclassiersandex-tractors.KYLINiteratesthroughthearticles.Foreacharticlewithaninfoboxmentioningoneormoretargetattributes,KYLINseg-mentsthedocumentintosentences,usingtheOpenNLPlibrary[1].Then,foreachtargetattribute,KYLINtriestondaunique,corre-spondingsentenceinthearticle.Theresultinglabelledsentencesformpositivetrainingexamplesforeachattribute.Othersentencesformnegativetrainingexamples.Ourcurrentimplementationusesseveralheuristicstomatchsen-tencestoattributes.Ifanattribute'svalueiscomposedofseveralsub-values(e.g.,“hubcities”),KYLINsplitsthemandprocesseseachsub-valueasfollows:1.Foreachinternalhyperlinkinthearticleandtheinfoboxat- tributes,nditsuniqueprimaryURIinWikipedia(througharedirectpageifnecessary).Forexample,both“USA”and“UnitedStatesofAmerica”willberedirectedto“UnitedStates”.Replacetheanchortextofthehyperlinkwiththisidentier.2.Iftheattributevalueismentionedbyexactlyonesentenceinthearticle,usethatsentenceandthematchingtokenasatrainingexample.3.Ifthevalueismentionedbyseveralsentences,KYLINdeter-mineswhatpercentageofthetokensintheattribute'snameareineachsentence.Ifthesentencematchingthehighestpercentageoftokenshasatleast60%ofthesekeywords,thenitisselectedasapositivetrainingexample.Forexam-ple,ifthecurrentattributeis“TotalArea:123”,thenKYLINmightselectthesentence“Ithasatotalareaof123squarekms”,because“123”andthetokens“total”and“area”areallcontainedinthesentence.Unfortunately,thereareseveraldifcultiespreventingusfromgettingaperfecttrainingdataset.First,OpenNLP'ssentencedetec-torisimperfect.Second,thearticlemaynotevenhaveasentencewhichcorrespondstoaninfoboxattributevalue.Third,werequireexactvalue-matchingbetweenattributevaluesinthesentenceandinfobox.Whilethisstrictheuristicensuresprecision,itsubstan-tiallylowersrecall.Thevaluesgiveninmanyareincompleteorwrittendifferentlythanintheinfobox.Together,thesefactorscon-spiretoproducearatherincompletedataset.2Fortunately,wearestillabletotrainourlearningalgorithmseffectively.2.3ClassifyingDocuments&SentencesKYLINlearnstwotypesofclassiers.Foreachclassofarticlebeingprocessed,aheuristicdocumentclassierisusedtorecog-nizemembersoftheclass.Foreachtargetattributewithinaclassasentenceclassieristrainedinordertopredictwhetheragivensentenceislikelytocontaintheattribute'svalue.DocumentClassier:Toaccomplishautonomousinfoboxgen-eration,KYLINmustrstlocatecandidatearticlesforagivenclass—afamiliardocumentclassicationproblem.Wikipedia'smanually-generatedlistpages,whichgatherconceptswithsimilarproperties,andcategorytagsarehighlyinformativefeaturesforthistask.Forexample,the“ListofU.S.countiesinalphabeticalorder”pointsto3099items;furthermore,68%ofthoseitemshaveadditionallybeentaggedas“county”or“counties.”Eventually,wewilluselistsandtagsasfeaturesinaNaiveBayes,MaximumEntropyorSVMclassier,butasaninitialbaselineweusedasimple,heuris-ticapproach.First,KYLINlocatesalllistpageswhosetitlescon-taininfoboxclasskeywords.Second,KYLINiteratesthrougheachpage,retrievingthecorrespondingarticlesbutignoringtables.Ifthecategorytagsoftheretrievedarticlealsocontainsinfoboxclasskeywords,KYLINclassiesthearticleasamemberoftheclass.AsshowninSection4,ourbaselinedocumentclassierachievesveryhighprecision(98.5%)andreasonablerecall(68.8%).SentenceClassier:ItprovesusefulforKYLINtobeabletopredictwhichattributevalues,ifany,arecontainedinagivensen-tence.Thiscanbeseenasamulti-class,multi-label,textclassica-tionproblem.Tolearntheseclassiers,KYLINusesthetrainingsetproducedbythepreprocessor(Section2.2).Forfeatures,weseek 2Alternatively,onecanviewourheuristicsasexplicitlypreferringincompletenessovernoise—alogicalconsequenceofourchoiceofhighprecisionextractionoverhigh-recall. FeatureDescription Example Firsttokenofsentence Helloworld Inrsthalfofsentence Helloworld Insecondhalfofsentence Helloworld Startwithcapital Hawaii Startwithcapital,endwithperiod Mr. Singlecapital A Allcapital,endwithperiod CORP. Containsatleastonedigit AB3 Madeupoftwodigits 99 Madeupoffourdigits 1999 Containsadollarsign 20$ Containsanunderlinesymbol km_square Containsanpercentagesymbol 20% Stopword the;a;of Purelynumeric 1929 Numbertype 1932;1,234;5.6 PartofSpeechtag Tokenitself NPchunkingtag Stringnormalization: capitalto“A”,lowercaseto“a”, digitto“1”,othersto“0” TF�1=)AA01 Partofanchortext MachineLearning Beginningofanchortext MachineLearning Previoustokens(windowsize5) Followingtokens(windowsize5) Previoustokenanchored MachineLearning Nexttokenanchored MachineLearning Table1:FeaturesetsusedbytheCRFextractoradomain-independentsetwhichisfasttocompute;ourcurrentim-plementationusesthesentence'stokensandtheirpartofspeech(POS)tagsasfeatures.Forourclassier,weemployedaMaximumEntropymodel[24]asimplementedinMallet[21],whichpredictsattributelabelsinaprobabilisticway—suitableformulti-classandmulti-labelclassi-cations.3Todecreasetheimpactofanoisyandincompletetrain-ingdataset,weemployedbagging[6]ratherthanboosting[22]asrecommendedby[25].2.4LearningExtractorsExtractingattributevaluesfromasentencemaybeviewedasasequentialdata-labellingproblem.WeusethefeaturesshowninTable1.Conditionalrandomelds(CRFs)[19]areanaturalchoicegiventheirleadingperformanceonthistask;weusetheMallet[21]implementation.Wewereconfrontedwithtwointerestingchoicesinextractordesign,andbothconcernedtheroleofthesentenceclassier.Wealsodiscusstheissueofmultipleextractions.TrainingMethodology:Recallthatwhenproducingtrainingdataforextractor-learning,thepreprocessorusesastrictpairingmodel.Sincethismaycausenumeroussentencestobeincorrectlylabelledasnegativeexamples,KYLINusesthesentenceclassiertorelabelsomeofthetrainingdataasfollows.Allsentenceswhichwereassignedtobenegativetrainingexamplesbythepreprocessoraresentthroughthesentenceclassier;iftheclassierdisagreeswiththepreprocessor(i.e.,itlabelsthempositive),thentheyare 3WealsoexperimentedwithaNaiveBayesmodel,butitsperfor-mancewasslightlyworse. County Airline Actor University Tagged 1245 791 3819 4025 Recall(%) 98.1 85.3 41.3 50.3 Table3:Estimatedrecallofthedocumentclassier Figure5:Precisionvs.recallcurvesofinfoboxgeneration.TheindividualpointscorrespondtotheperformanceofWikipediausers'manualedition.DocumentClassication:Weusesamplingplushumanlabellingtoestimatetheprecisionandrecalloftheclassiers.Wemeasuretheprecisionofaclass'classierbytakingarandomsampleof50pageswhichwerepredictedtobeofthatclassandmanuallycheck-ingtheiraccuracy.Table2liststheestimatedprecisionforourfourclasses.Onaverage,theclassiersachieve98.5%precision.Toestimaterecall,weintroducesomenotation,sayingthatanarticleistaggedwithaclassifithashadaninfoboxofthattypemanuallycreatedbyahumanauthor.Weusethesetoftaggedpagesasasamplefromtheuniversalsetandcounthowmanyofthemareidentiedbytheclassier.Table3showsthedetailedresults,butaveraginguniformlyoverthefourclassesyieldsanaveragerecallof68:8%.Thisisquitegoodforourbaselineimplementation,anditseemslikelythatamachine-learningapproachcouldresultinsubstantiallyhigherrecall.Notethattherearesomepotentialbiaseswhichmightpotentiallyaffectourestimatesofprecisionandrecall.First,asmentionedinSection2.1,somelistpagesarechallengingtoexploit,andlistpageformattingvariesalotbetweendifferentclasses.Second,articleswithuser-addedinfoboxclassestendtobeonmorepopulartopics,andthesemayhaveagreaterchancetobeincludedinlistpages.ThiscouldleadtominoroverestimationofKYLINrecall.Butwebelievethatthesefactorsaresmall,andourestimatesofprecisionandrecallareaccurate.InfoboxAttributeExtractor:Inordertobeusefulasanau-tonomoussystem,KYLINmustbeabletoextractattributevalueswithveryhighprecision.Highrecallisalsogood,butoflessim-portance.SinceourCRFextractoroutputsacondencescoreforitsextraction,wecanmodulatethecondencethresholdtocontroltheprecision/recalltradeoffasshowninFigure5.Interestingly,theprecision/recallcurvesareextremelyat,whichmeanstheprecisionisratherstablew.r.tthevariationofrecall.In People System Class Pre.(%) Rec.(%) Pre.(%) Rec.(%) County 97.6 65.9 97.3 95.9 Airline 92.3 86.7 87.2 63.7 Actor 94.2 70.1 88.0 68.2 University 97.6 90.5 73.9 60.5 Table4:RelativeperformanceofpeopleandKYLINoninfoboxattributeextraction.practice,KYLINisabletoautomaticallytunethecondencethresh-oldbasedontrainingdataprovidedbythepreprocessorforvariousprecision/recallrequirements.Inordertoreducetheneedforhu-manfactchecking,onecansetahighthreshold(e.g.,0.99),boost-ingprecision.Alowerthreshold(e.g.0.3)extendsrecallsubstan-tially,atonlyasmallcostinprecision.Inournextexperiment,weuseaxedthresholdof0:6,whichachievesbothreasonableprecisionandrecallforallclasses.WenowaskhowKYLINcomparesagainststrongcompetition:humanauthors.Foreachclass,werandomlyselected50articleswithexis-tinginfoboxtemplates.Bymanuallyextractingallattributesmen-tionedinthearticles,wecouldchecktheperformanceofboththehumanauthorsandofKYLIN.TheresultsareshowninTable4.WewereproudtoseethatKYLINperformsbetteronthe“U.S.County”domain,mainlybecauseitsnumericattributesarerelativelyeasytoextract.Inthisdomain,KYLINwasabletosuccessfullyrecoveranumberofvalueswhichhadbeenneglectedbyhumans.Forthe“Actor”and“Airline”domains,KYLINperformedslightlyworsethanpeople.Andinthe“University”domain,KYLINperformedratherbadly,becauseofimplicitreferencesandthetypeofexi-blelanguageusedinthosearticles.Forexample,KYLINextracted“DwightD.Eisenhower”asthepresidentof“ColumbiaUniversity”fromthefollowingsentence.FormerU.S.PresidentDwightD.EisenhowerservedasPres-identoftheUniversity.Unfortunately,thisisincorrect,becauseEisenhowerwasafor-merpresident(indicatedsomewhereelseinthearticle)andthustheincorrectvalueforthecurrentpresident.Implicitexpressionsalsoleadtochallengingextractions.Forexample,thearticleon“BinghamtonUniversity”individuallyde-scribesthenumberofundergraduateandgraduatestudentsineachcollegeandschool.Inordertocorrectlyextractthetotalnumberofstudents,KYLINwouldneedtoreasonaboutdisjointsetsandperformarithmetic,whichisbeyondtheabilitiesofmosttextualentailmentsystems[9,20],letaloneonethatscaletoaWikipedia-sizedcorpus.UsingtheSentenceClassierwiththeAttributeExtractor:RecallthatKYLINusesthesentenceclassiertoprunesomeofthenegativetrainingexamplesgeneratedbythepreprocessorbeforetrainingtheextractor.Wealsoexploredtwowaysofconnectingthesentenceclassiertotheextractor:asapipeline(whereonlysentencessatisfyingtheclassieraresenttotheextractor)orbyfeedingtheextractoreverysentence,butlettingitusetheclassi-er'soutputasafeature.Inthisexperiment,weconsiderfourpos-siblecongurations:Relabel,Pipeline—usestheclassier'sresultstorelabelthetrainingdatasetfortheextractorandusesapipelinearchitec-ture.Relabel,#Pipeline—alsousestheclassier'sresultstore-labelthetrainingdatasetfortheextractor,butdoesn'tusea Figure6:PrecisionandrecallcurveswithdifferentpoliciesoncombiningsentenceclassierandCRFextractor.ThecrosspointscorrespondtoWikipediausers'manualedition.pipeline(insteaditprovidestheclassier'soutputtotheCRFextractor).#Relabel,Pipeline—trainingexamplesarenotrelabelled,butthepipelinedarchitectureisused.#Relabel,#Pipeline—trainingexamplesarenotrelabelledandapipelineiseschewed(theclassier'soutputisfeddi-rectlytotheextractor).Figure6showsthedetailedresults.Inmostcasesthe“Relabel,Pipeline”policyachievesthebestperformance.Wedrawthefol-lowingobservations:Noiseandincompletenesswithinthetrainingdatasetpro-videdbythePreprocessormakestheCRFextractorunsta-ble,andhampersitsperformance(especiallyrecall)inmostcases.Byusingclassiertorenethetrainingdataset,manyfalsenegativetrainingexamplesarepruned;thishelpstoenhancetheCRFextractor'sperformance,especiallyintermsofro-bustnessandrecall.Thepipelinearchitectureimprovesprecisioninmostcasesbyreducingtheriskoffalsepositiveextractionsonirrelevantsentences.Butsincefewersentencesareevengiventotheextractor,recallsuffers.4.2InternalLinkGenerationInthissectionweaddressthefollowingquestions:WhataretheprecisionandrecallofKYLIN'slink-generationmethod?Howdotheheuristicruleseffectthisperformance?CanKYLINgureouttherightorderinwhichtoapplytheheuris-ticsviaself-supervisedlearning?Therstquestioniseasy.Samplingover50randomlygeneratedpages,wefound1213uniquehuman-generatedhyperlinks,852ofwhichwereanchoredbypropernouns.Thesetofpagescontainedadditional369uniquepropernounsthatwejudgeddeservingofalink.5Thus,weseethathumanauthorsdisplay69:8%recall, 5Thisjudgmentisabitsubjectiveanditisimportanttonotethatwearetreatinglinksassemanticstructurewhoseobjectiveistouniquelyspecifythenoun'smeaning—notfacilitateahuman'sreadingpleasure. Heuristic Pre.(%) Rec.(%) MatchTitle 100 0.2 MatchAnchor 92.8 9.0 MatchUrl 85.0 49.2 Disambiguation 38.5 1.8 InTitle 18.8 0.3 InAnchor 14.3 0.1 Table5:Performanceofvariouslink-generationheuristicsonexistinglinks. Heuristic Pre.(%) Rec.(%) MatchTitle 100 4.06 MatchAnchor 97.9 12.7 MatchUrl 90.8 42.5 Disambiguation 62.5 4.07 InTitle 75.4 11.6 InAnchor 57.8 17.1 Table6:Performanceofvariouslink-generationheuristicsonnewlinks.presumablywithnear100%precision.WhenKYLINwasaskedtondlinksforthepropernounsleftunlinkedbyhumans,itgenerated291linksofwhich261werecorrect.Thisyieldsalink-generationprecisionof89:7%andahuman-levelrecallof70:7%.ThesixheuristicslistedinSection3formthecoreofKYLIN'slinkgenerator.Eachoftheheuristicssoundsplausible,butasonemightsuspect,thecorrectordermattersconsiderably.Interestingly,wecanusethesetofhuman-authoredinternallinksasanothertrainingsetanduseittochoosetherightorderfortheheuristics—anotherformofself-supervisedlearning.Werstmeasureeachheuristic'sperformanceonthesuiteofuser-addedhyperlinks,andthenordertheminorderofdecreasingprecisionscore.Table5showseachheuristic'sprecisiononaran-domsampleof50Wikipediaarticles.TheorderdeterminedbyKYLINmatcheswellwithourintuitions.Todemonstratethattheorderingworkswellonas-yetunlinkednounphrases,wetesteditonamanually-labelledsampleof50articles(Table6).Notethatwhiletheindividualnumbershavechangedsubstantially,therela-tiveorderisstable.Thedifferenceinquantitativeprecision/recallnumbersisduetothedistinctcharacteristicsofthenounsindatasetscorrespondingtoTables5and6.Table5isbasedonexistinganchortextseditedbyusers.AquickcheckofWikipediaarticlesrevealsthatusersseldomaddlinkspointingtothecurrentarticle,whichleadstoaperformancedecreaseofthe“MatchTitle”and“InTitle”heuris-tics.IfthesameNPappearsseveraltimes,userstendtoonlyaddalinkatitsrstoccurrence,whicheffectsthe“MatchAnchor”and“InAnchor”heuristics.Whenusersaddalinktohelpdisam-biguateconcepts,usuallytheyare“harder”thanrandomlypickednoun-phrases,whichexplainsthelowerperformanceofthe“Dis-ambiguation”heuristicinTable5.IfhumanlinkingbehaviorissodifferentfromKYLIN'sexhaus-tiveapproach,onemightquestiontheutilityofautomaticlinkgen-erationitself.WeagreethatKYLIN'slinksmightnothelphumanreadersasmuchasthoseaddedbyhumanauthors,butourobjectiveistohelpprograms,notpeople!Bydisambiguatingnumerousnounphrasesbylinkingtoauniqueidentier,wegreatlysimplifysub-sequentprocessing.Furthermore,wenotethatmanyaretopageswhichhavenotyetbeenlinkedinthearticle.Forthenextexperiment,weenumeratedall6!=720heuristic Ordering Pre.(%) Rec.(%) KYLIN 89.7 70.7 BestOrder 90.0 71.0 WorstOrder 78.4 61.8 Table7:Effectofdifferentheuristicordersonlinkgenerationperformance.orderingsandmeasuredtheprecisionandrecallofthecollectionasawhole.Table7liststheperformanceofKYLIN'sorderingaswellasthebestandworstorders.WecanseethereislittledifferencebetweenKYLINandtheoptimalone,andbothofthemperformmorethan10%betterthantheworstordering.5.RELATEDWORKWegrouprelatedworkintoseveralcategories:bootstrappingthesemanticweb,unsupervisedinformationextraction,extractionfromWikipedia,andrelatedWikipedia-basedsystems.BootstrappingtheSemanticWeb:REVERE[17]aimstocrossthechasmbetweenstructuredandunstructureddatabyprovidingaplatformtofacilitatetheauthoring,queryingandsharingofdata.Itreliesonhumanefforttogainsemanticdata,whileourKYLINisfullyautonomous.DeepMiner[30]bootstrapsdomainontologiesforsemanticwebservicesfromsourcewebsites.Itextractscon-ceptsandinstancesfromsemi-structureddataoversourceinterfaceanddatapages,whileKYLINhandlesbothsemi-structuredandun-structureddatainWikipedia.TheSemTagandSeeker[10]systemsperformautomatedsemantictaggingoflargecorpora.TheyusetheTAPknowledgebase[27]asthestandardontology,anduseittomatchinstancesontheWeb.Incontrast,KYLINdoesn'tassumeanyparticularontology,andtriestoextractalldesiredsemanticdatawithinWikipedia.UnsupervisedInformationExtraction:SincetheWebislargeandhighlyheterogeneous,unsupervisedandself-supervisedlearn-ingisnecessaryforscaling.Severalsystemsofthisformhavebeenproposed.MULDER[18]andAskMSR[7,13]usetheWebtoanswerquestions,exploitingthefactthatmostimportantfactsarestatedmultipletimesindifferentways,whichlicensestheuseofsimplesyntacticprocessing.KNOWITALL[14]andTEXTRUN-NER[4]usesearchenginestocomputestatisticalpropertiesen-ablingextraction.EachofthesesystemsreliesheavilyontheWeb'sinformationredundancy.However,unliketheWeb,Wikipediahaslittleredundancy—thereisonlyonearticleforeachuniquecon-ceptinWikipedia.Insteadofutilizingredundancy,KYLINexploitsWikipedia'suniquestructureandthepresenceofuser-taggeddatatotrainmachinelearners.InformationExtractionfromWikipedia:Severalothersys-temshaveaddressedinformationextractionfromWikipedia.AuerandLehmanndevelopedtheDBpedia[3]systemwhichextractsin-formationfromexistinginfoboxeswithinarticlesandencapsulatetheminasemanticformforquery.Incontrast,KYLINpopulatesinfoboxeswithnewattributevalues.Suchaneketal.describetheYAGOsystem[28]whichextendsWordNetusingfactsextractedfromWikipedia'scategorytags.ButincontrasttoKYLIN,whichcanlearntoextractvaluesforanyattribute,YAGOonlyextractsvaluesforalimitednumberofpredenedrelations.Nguyenetal.proposedtoextractrelationsfromWikipediabyexploitingsyntacticandsemanticinformation[23].Theirworkisthemostsimilarwithoursinthesenseofsteppingtowardsau-tonomouslysemantifyingbothsemi-structuredandunstructureddata.However,therearestillseveralobviousdistinctions.First,their 8.REFERENCES[1]http://opennlp.sourceforge.net/.[2]S.F.AdafreandM.deRijke.Discoveringmissinglinksinwikipedia.InProceedingsofthe3rdInternationalWorkshoponLinkDiscoveryatKDD05,Chicago,USA,August2005.[3]S.AuerandJ.Lehmann.WhathaveInnsbruckandLeipzigincommon?Extractingsemanticsfromwikicontent.InESWC,2007.[4]M.Banko,M.J.Cafarella,S.Soderland,M.Broadhead,andO.Etzioni.OpeninformationextractionfromtheWeb.InProceedingsofthe20thInternationalJointConferenceonArticialIntelligence,2007.[5]T.Berners-Lee,J.Hendler,andO.Lassila.TheSemanticWeb.ScienticAmerican,May2001.[6]L.Breiman.Baggingpredictors.MachineLearning,24(2):123–140,1996.[7]E.Brill,S.Dumais,andM.Banko.AnanalysisoftheAskMSRquestion-answeringsystem.InProceedingsofEMNLP,2002.[8]C.L.A.Clarke,G.V.Cormack,andT.R.Lynam.Exploitingredundancyinquestionanswering.InProceedingsofthe24thAnnualInternationalACMSIGIRConference,2001.[9]R.deSalvoBraz,R.Girju,V.Punyakanok,D.Roth,andM.Sammons.Aninferencemodelforsemanticentailmentinnaturallanguage.InNationalConferenceonArticialIntelligence(AAAI),pages1678–1679,2005.[10]S.Dill,N.Eiron,D.Gibson,D.Gruhl,R.Guha,A.Jhingran,T.Kanungo,S.Rajagopalan,A.Tomkins,J.Tomlin,andJ.Y.Zien.SemtagandSeeker:bootstrappingtheSemanticWebviaautomatedsemanticannotation.InProceedingsof12thInternationalWorldWideWebConference,pages178–186,2003.[11]A.DoanandA.Halevy.Semanticintegrationresearchinthedatabasecommunity:Abriefsurvey.AIMagazine,SpecialIssueonSemanticIntegration,2005.[12]D.Downey,O.Etzioni,andS.Soderland.Aprobabilisticmodelofredundancyininformationextraction.InProcs.ofIJCAI2005,2005.[13]S.Dumais,M.Banko,E.Brill,J.Lin,andA.Ng.Webquestionanswering:Ismorealwaysbetter?InProceedingsofthe25thAnnualInternationalACMSIGIRConference,2002.[14]O.Etzioni,M.Cafarella,D.Downey,S.Kok,A.Popescu,T.Shaked,S.Soderland,D.Weld,andA.Yates.Unsupervisednamed-entityextractionfromtheWeb:Anexperimentalstudy.ArticialIntelligence,165(1):91–134,2005.[15]E.GabrilovichandS.Markovitch.Overcomingthebrittlenessbottleneckusingwikipedia:Enhancingtextcategorizationwithencyclopedicknowledge.InProceedingsofthe21stNationalConferenceonArticialIntelligence,pages1301–1306,2006.[16]E.GabrilovichandS.Markovitch.Computingsemanticrelatednessusingwikipedia-basedexplicitsemanticanalysis.InProceedingsofThe20thInternationalJointConferenceonArticialIntelligence,Hyderabad,India,January2007.[17]A.Y.Halevy,O.Etzioni,A.Doan,Z.G.Ives,J.Madhavan,L.McDowell,andI.Tatarinov.Crossingthestructurechasm.InProceedingsofCIDR,2003.[18]C.T.Kwok,O.Etzioni,andD.Weld.ScalingquestionansweringtotheWeb.ACMTransactionsonInformationSystems(TOIS),19(3):242–262,2001.[19]J.Lafferty,A.McCallum,andF.Pereira.Conditionalrandomelds:Probabilisticmodelsforsegmentingandlabelingsequencedata.InProceedingsofthe15thInternationalConferenceonWorldWideWeb,Edinburgh,Scotland,May2001.[20]B.MacCartneyandC.D.Manning.Naturallogicfortextualinference.InWorkshoponTextualEntailmentandParaphrasing,ACL2007,2007.[21]A.K.McCallum.Mallet:Amachinelearningforlanguagetoolkit.Inhttp://mallet.cs.umass.edu,2002.[22]R.MeirandG.Ratsch.Anintroductiontoboostingandleveraging.JournalofArticialIntelligenceResearch,AdvancedLecturesonMachineLearning:118–183,2003.[23]D.P.Nguyen,Y.Matsuo,andM.Ishizuka.Exploitingsyntacticandsemanticinformationforrelationextractionfromwikipedia.InIJCAI07-TextLinkWS,2007.[24]K.Nigam,J.Lafferty,andA.McCallum.Usingmaximumentropyfortextclassication.InProceedingsoftheIJCAI-99WorkshoponMachineLearningforInformationFiltering,1999.[25]D.OpitzandR.Maclin.Popularensemblemethods:Anempiricalstudy.JournalofArticialIntelligenceResearch,pages169–198,1999.[26]S.P.PonzettoandM.Strube.Derivingalargescaletaxonomyfromwikipedia.InProceedingsofthe22stNationalConferenceonArticialIntelligence,pages1440–1445,2007.[27]E.RiloffandJ.Shepherd.Acorpus-basedapproachforbuildingsemanticlexicons.InProceedingsoftheSecondConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages117–124,Providence,RI,1997.[28]F.M.Suchanek,G.Kasneci,andG.Weikum.Yago:Acoreofsemanticknowledge-unifyingWordNetandWikipedia.InProceedingsofthe16thInternationalConferenceonWorldWideWeb,2007.[29]M.Volkel,M.Krotzsch,D.Vrandecic,H.Haller,andR.Studer.Semanticwikipedia.InProceedingsofthe15thInternationalConferenceonWorldWideWeb,2006.[30]W.Wu,A.Doan,C.Yu,andW.Meng.BootstrappingdomainontologyforSemanticWebservicesfromsourcewebsites.InProceedingsoftheVLDB-05WorkshoponTechnologiesforE-Services,2005.