154K - views

Finding Advertising Keywords on Web Pages Wentau Yih Microsoft Research Microso

com Joshua Goodman Microsoft Research 1 Microsoft Way Redmond WA 98052 joshuagomicrosoftcom Vitor R Carvalho Carnegie Mellon University 5000 Forbes Avenue Pittsburgh PA 15213 vitorcscmuedu ABSTRACT A large and growing number of web pages display cont

Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "Finding Advertising Keywords on Web Page..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Finding Advertising Keywords on Web Pages Wentau Yih Microsoft Research Microso






Presentation on theme: "Finding Advertising Keywords on Web Pages Wentau Yih Microsoft Research Microso"— Presentation transcript:

FindingAdvertisingKeywordsonWebPagesWen­tauYihMicrosoftResearch1MicrosoftWayRedmond,WA98052scottyih@microsoft.comJoshuaGoodmanMicrosoftResearch1MicrosoftWayRedmond,WA98052joshuago@microsoft.comVitorR.CarvalhoCarnegieMellonUniversity5000ForbesAvenuePittsburgh,PA15213vitor@cs.cmu.eduABSTRACTAlargeandgrowingnumberofwebpagesdisplaycontex-tualadvertisingbasedonkeywordsautomaticallyextractedfromthetextofthepage,andthisisasubstantialsourceofrevenuesupportingthewebtoday.Despitetheimpor-tanceofthisarea,littleformal,publishedresearchexists.Wedescribeasystemthatlearnshowtoextractkeywordsfromwebpagesforadvertisementtargeting.Thesystemusesanumberoffeatures,suchastermfrequencyofeachpotentialkeyword,inversedocumentfrequency,presenceinmeta-data,andhowoftenthetermoccursinsearchquerylogs.Thesystemistrainedwithasetofexamplepagesthathavebeenhand-labeledwith\relevant"keywords.Basedonthistraining,itcanthenextractnewkeywordsfromprevi-ouslyunseenpages.Accuracyissubstantiallybetterthanseveralbaselinesystems.CategoriesandSubjectDescriptorsH.3.1[ContentAnalysisandIndexing]:Abstractingmethods;H.4.m[InformationSystems]:MiscellaneousGeneralTermsAlgorithms,experimentationKeywordskeywordextraction,informationextraction,advertising1.INTRODUCTIONContent-targetedadvertisingsystems,suchasGoogle'sAdSenseprogram,andYahoo'sContextualMatchproduct,arebecominganincreasinglyimportantpartofthefundingforfreewebservices.Theseprogramsautomatically ndrel-evantkeywordsonawebpage,andthendisplayadvertise-mentsbasedonthosekeywords.Inthispaper,wesystem-aticallyanalyzetechniquesfordeterminingwhichkeywordsarerelevant.Wedemonstratethatalearning-basedtech-niqueusingTFIDFfeatures,webpagemetadata,and,mostsurprisingly,informationfromquerylog les,substan-tiallyoutperformscompetingmethods,andevenapproacheshumanlevelsofperformancebyatleastonemeasure.Typicalcontent-targetedadvertisingsystemsanalyzeawebpage,suchasablog,anewspage,oranothersourceCopyrightisheldbytheInternationalWorldWideWebConferenceCom­mittee(IW3C2).Distributionofthesepapersislimitedtoclassroomuse,andpersonalusebyothers.WWW2006,May23–26,2006,Edinburgh,Scotland.ACM1­59593­323­9/06/0005.ofinformation,to ndprominentkeywordsonthatpage.Thesekeywordsarethensenttoanadvertisingsystem,whichmatchesthekeywordsagainstadatabaseofads.Ad-vertisingappropriatetothekeywordisdisplayedtotheuser.Typically,ifauserclicksonthead,theadvertiserischargedafee,mostofwhichisgiventothewebpageowner,withaportionkeptbytheadvertisingservice.Pickingappropriatekeywordshelpsusersinatleasttwoways.First,choosingappropriatekeywordscanleadtousersseeingadsforproductsorservicestheywouldbeinterestedinpurchasing.Second,thebettertargetedtheadvertising,themorerevenuethatisearnedbythewebpageprovider,andthusthemoreinterestingtheapplicationsthatcanbesupported.Forinstance,freebloggingservicesandfreeemailaccountswithlargeamountsofstoragearebothen-abledbygoodadvertisingsystemsFromtheperspectiveoftheadvertiser,itisevenmoreimportanttopickgoodkeywords.Formostareasofre-search,suchasspeechrecognition,a10%improvementleadstobetterproducts,buttheincreaseinrevenueisusuallymuchsmaller.Forkeywordselection,however,a10%im-provementmightactuallyleadtonearlya10%higherclick-through-rate,directlyincreasingpotentialrevenueandpro t.Inthispaper,wesystematicallyinvestigatedseveraldif-ferentaspectsofkeywordextraction.First,wecomparedlookingateachoccurrenceofawordorphraseinadoc-umentseparately,versuscombiningallofourinformationaboutthewordorphrase.Wealsocomparedapproachesthatlookatthewordorphrasemonolithicallytoapproachesthatdecomposeaphraseintoseparatewords.Second,weexaminedawidevarietyofinformationsources,analyzingwhichsourcesweremosthelpful.Theseincludedvariousmeta-tags,titleinformation,andeventhewordsintheURLofthepage.OnesurprisinglyusefulsourceofinformationwasqueryfrequencyinformationfromMSNSearchquerylogs.Thatis,knowingtheoverallqueryfrequencyofapar-ticularwordorphraseonapagewashelpfulindeterminingifthatwordorphrasewasrelevanttothatpage.Wecomparedtheseapproachestoseveraldi erentbase-lineapproaches,includingatraditionalTFIDFmodel;amodelusingTFandIDFfeaturesbutwithlearnedweights;andtheKEAsystem[7].KEAisalsoamachinelearningsystem,butwithasimplerlearningmechanismandfewerfeatures.Aswewillshow,oursystemissubstantiallybetterthananyofthesebaselinesystems.Wealsocomparedoursystemtothemaximumachievablegiventhehumanlabel-ing,andfoundthatononemeasure,oursystemwasinthesamerangeashumanlevelsofperformance. 2.SYSTEMARCHITECTUREInthissection,weintroducethegeneralarchitectureofourkeywordextractionsystem,whichconsistsofthefollow-ingfourstages:preprocessor,candidateselector,classi er,andpostprocessor.2.1PreprocessorThemainpurposeofthepreprocessoristotransformanHTMLdocumentintoaneasy-to-processplain-textbaseddocument,whilestillmaintainingimportantinformation.Inparticular,wewanttopreservetheblocksintheorig-inalHTMLdocument,butremovetheHTMLtags.Forexample,textinthesametableshouldbeplacedtogetherwithouttagslike,,or.Wealsopreserveinformationaboutwhichphrasesarepartoftheanchortextofthehypertextlinks.ThemetasectionofanHTMLdoc-umentheaderisalsoanimportantsourceofusefulinforma-tion,eventhoughmostofthe eldsexceptthetitlearenotdisplayedbywebbrowsers.Thepreprocessor rstparsesanHTMLdocument,andre-turnsblocksoftextinthebody,hypertextinformation,andmetainformationintheheader1.Becauseakeywordshouldnotcrosssentenceboundaries,weapplyasentencesplittertoseparatetextinthesameblockintovarioussentences.Toevaluatewhetherlinguisticinformationcanhelpkey-wordextraction,wealsoapplyastate-of-the-artpart-of-speech(POS)tagger[6],andrecordthepostagofeachword.Inaddition,wehaveobservedthatmostwordsorphrasesthatarerelevantareshortnounphrases.Therefore,havingthisinformationavailableasafeaturewouldpoten-tiallybeuseful.Wethusappliedastate-of-the-artchunkertodetectthebasenounphrasesineachdocument[15]2.2.2CandidateSelectorOursystemconsiderseachwordorphrase(consecutivewords)uptolength5thatappearsinthedocumentasacandidatekeyword.Thisincludesallkeywordsthatappearinthetitlesection,orinmeta-tags,aswellaswordsandphrasesinthebody.Asmentionedpreviously,aphraseisnotselectedasacandidateifitcrossessentenceorblockboundaries.Thisstrategynotonlyeliminatesmanytrivialerrorsbutalsospeedsuptheprocessingtimebyconsideringfewerkeywordcandidates.Eachphrasecanbeconsideredseparately,orcanbecom-binedwithallotheroccurrencesofthesamephraseinthesamedocument.Inaddition,phrasescanbeconsideredmonolithically,orcanbedecomposedintotheirconstituentwords.Puttingtogetherthesepossibilities,weendedupconsideringthreedi erentcandidateselectors.2.2.1Monolithic,Separate(MoS)IntheMonolithicSeparatecandidateselector,fragmentsthatappearindi erentdocumentlocationsareconsideredasdi erentcandidateseveniftheircontentisidentical.Thatis,ifthephrase\digitalcamera"occurredonceinthebeginningofthedocument,andonceintheend,wewouldconsiderthemastwoseparatecandidates,withpotentiallydi erentfeatures.Whilesomefeatures,suchasthephrase 1Weused\BeautifulSoup",whichisapythonlibraryforHTMLparsing.BeautifulSoupcanbedownloadedfromhttp://www.crummy.com/software/BeautifulSoup/2Thesenaturallanguageprocessingtoolscanbedownloadedfromhttp://l2r.cs.uiuc.edu/cogcomplengthandTFvalues,wouldallbethesameforallofthesecandidates,others,suchaswhetherthephrasewascapital-ized,wouldbedi erent.WecallthisvariationSeparate.Inthiscandidateselector,allfeaturesofacandidatephrasearebasedonthephraseasawhole.Forexample,termfre-quencycountsthenumberoftimestheexactphraseoccursinthedocument,ratherthanusingthethetermfrequencyofindividualwords.WerefertothisasMonolithic.Tosim-plifyourdescription,weuseMoStorefertothisdesign.2.2.2Monolithic,Combined(MoC)Sinceweonlycareabouttherankedlistofkeywords,notwherethekeywordsareextracted,wecanreducethenumberofcandidatesbycombiningidentical(caseignored)fragments.Forinstance,\Weatherreport"inthetitleand\weatherreport"inthebodyaretreatedasonlyonecandi-date.WeuseMoCtorefertothisdesign.NotethatevenintheCombinedcase,wordordermatters,e.g.thephrase\weatherreport"istreatedasdi erentthanthephrase\reportweather."2.2.3Decomposed,Separate(DeS)Keywordextractionseemscloselyrelatedtowell-studiedareaslikeinformationextraction,named-entityrecognitionandphraselabeling:theyallattemptto ndimportantphrasesindocuments.State-of-the-artinformationextrac-tionsystems(e.g.,[5,20]),named-entityrecognitionsystems(e.g.,[21]),andphraselabelingsystems(e.g.,[3])typicallydecomposephrasesintoindividualwords,ratherthanex-aminingthemmonolithically.Itthusseemedworthwhiletoexaminesimilartechniquesforkeywordextraction.Decom-posingaphraseintoitsindividualwordsmighthavecertainadvantages.Forinstance,ifthephrase\petstore"occurredonlyonceinadocument,butthephrases\pet"and\store"eachoccurredmanytimesseparately,suchadecomposedapproachwouldmakeiteasytousethisknowledge.Insteadofselectingphrasesdirectlyascandidates,thedecomposedapproachtriestoassignalabeltoeachwordinadocument,asisdoneinrelated elds.Thatis,eachofthewordsinadocumentisselectedasacandidate,withmultiplepossiblelabels.ThelabelscanbeB(beginningofakeyphrase,whenthefollowingwordisalsopartofthekeyphrase),I(insideakeyphrase,butnotthe rstorlastword),L(lastwordofakeyphrase),U(uniquewordofakeywordoflength1),and nallyO(outsideanykeywordorkeyphrase).Thisword-basedframeworkrequiresamulti-classclassi- ertoassignthese5labelstoacandidateword.Inad-dition,italsoneedsasomewhatmoresophisticatedinfer-enceproceduretoconstructarankedlistofkeywordsinthepostprocessingstage.ThedetailswillbedescribedinSection2.4.WeuseDeStorefertothisdesign.2.3ClassierThecoreofourkeywordextractionsystemisaclassi ertrainedusingmachinelearning.Whenamonolithicframe-work(MoSorMoC)isused,wetrainabinaryclassi er.Givenaphrasecandidate,theclassi erpredictswhetherthewordorphraseisakeywordornot.Whenadecomposedframework(DeS)isused,wetrainamulti-classclassi er.Givenaword,thegoalistopredictthelabelasB,I,L,U,orO.Sincewhetheraphraseisakeywordisambiguousbynature,insteadofahardprediction,weactuallyneed theclassi ertopredicthowlikelyitisthatcandidatehasaparticularlabel.Inotherwords,theclassi erneedstooutputsomekindofcon dencescoresorprobabilities.Thescoresorprobabilitiescanthenbeusedlatertogeneratearankedlistofkeywords,givenadocument.Weintroducethelearningalgorithmandthefeaturesweusedasfollows.2.3.1LearningAlgorithmWeusedlogisticregressionasthelearningalgorithminallexperimentsreportedhere.Logisticregressionmodelsarealsocalledmaximumentropymodels,andareequivalenttosinglelayerneuralnetworksthataretrainedtominimizeen-tropy.Inrelatedexperimentsonanemailcorpus,wetriedotherlearningalgorithms,suchaslinearsupport-vectorma-chines,decisiontrees,andnaiveBayes,butineverycasewefoundthatlogisticregressionwasequallygoodorbetterthantheotherlearningalgorithmsforthistypeoftask.Formally,wewanttopredictanoutputvariableY,givenasetofinputfeatures, X.Inthemonolithicframework,Ywouldbe1ifacandidatephraseisarelevantkeyword,and0otherwise. Xwouldbeavectoroffeaturevaluesassociatedwithaparticularcandidate.Forexample,thevectormightincludethedistancefromthestartofthedocument,thetermfrequency(TF)andthedocumentfrequency(DF)val-uesforthephrase.Themodelreturnstheestimatedprob-ability,P(Y=1j X= x).Thelogisticregressionmodellearnsavectorofweights, w,oneforeachinputfeaturein X.Continuingourexample,itmightlearnaweightfortheTFvalue,aweightfortheDFvalue,andaweightforthedistancevalue.TheactualprobabilityreturnedisP(Y=1jX= x)=exp( x w) 1+exp( x w)Aninterestingpropertyofthismodelisitsabilitytosim-ulatesomesimplermodels.Forinstance,ifwetakethelogarithmsoftheTFandDFvaluesasfeatures,thenifthecorrespondingweightsare+1and-1,weendupwithlog(TF)log(DF)=log(TF=DF)Inotherwords,byincludingTFandDFasfeatures,wecansimulatetheTFIDFmodel,butthelogisticregressionmodelalsohastheoptiontolearndi erentweightings.Toactuallytrainalogisticregressionmodel,wetakeasetoftrainingdata,andtryto ndtheweightvector wthatmakesitaslikelyaspossible.Inourcase,thetrain-inginstancesconsistofeverypossiblecandidatephrasesse-lectedfromthetrainingdocuments,withY=1iftheywerelabeledrelevant,and0otherwise.WeusetheSCGISmethod[9]astheactualtrainingmethod.However,becausethereisauniquebestlogisticregressionmodelforanytrain-ingset,andthespaceisconvex,theactualchoiceoftrainingalgorithmmakesrelativelylittledi erence.Inthemonolithicmodels,weonlytrytomodelonedeci-sion{whetherawordorphraseisrelevantornot,sothevariableYcanonlytakevalues0or1.Butforthedecom-posedframework,wetrytodetermineforeachwordwhetheritisthebeginningofaphrase,insideaphrase,etc.with5di erentpossibilities(theBILUOlabels).Inthiscase,weuseageneralizedformofthelogisticregressionmodel:P(Y=ij x)=exp( x w) 5j=1exp( x wj)Thatis,thereare5di erentsetsofweights,oneforeachpossibleoutputvalue.Notethatinpractice,forbothformsoflogisticregression,wealwaysappendaspecial\alwayson"feature(i.e.avalueof1)tothe xvector,thatservesasabiasterm.Inordertopreventover tting,wealsoapplyaGaussianpriorwithvariance0.3forsmoothing[4].2.3.2FeaturesWeexperimentedwithvariousfeaturesthatarepoten-tiallyuseful.Someofthesefeaturesarebinary,takingonlythevalues0or1,suchaswhetherthephraseappearsinthetitle.Othersarereal-valued,suchastheTForDFval-uesortheirlogarithms.Belowwedescribethesefeaturesandtheirvariationswhenusedinthemonolithic(MoS)anddecomposed(DeS)frameworks.Featuresusedinthemonolithic,combined(MoC)frame-workarebasicallythesameasintheMoSframework.Ifinthedocument,acandidatephraseintheMoCframeworkhasseveraloccurrences,whichcorrespondtoseveralcandidatephrasesintheMoSframework,thefeaturesarecombinedusingthefollowingrules.1.Forbinaryfeatures,thecombinedfeatureistheunionofthecorrespondingfeatures.Thatis,ifthisfeatureisactiveinanyoftheoccurrences,thenitisalsoactiveinthecombinedcandidate.2.Forreal-valuedfeatures,thecombinedfeaturetakesthesmallestvalueofthecorrespondingfeatures.Togiveanexampleforthebinarycase,ifoneoccurrenceoftheterm\digitalcamera"isintheanchortextofahyper-link,thentheanchortextfeatureisactiveinthecombinedcandidate.Similarly,forthelocationfeature,whichisareal-valuedcase,thelocationofthe rstoccurrenceofthisphrasewillbeusedasthecorrespondingcombinedvalue,sinceitsvalueisthesmallest.Inthefollowingdescription,featuresarebinaryunlessotherwisenoted.2.3.2.1Lin:linguisticfeatures.Thelinguisticinformationusedinfeatureextractionin-cludes:twotypesofpostags{noun(NN&NNS)andpropernoun(NNP&NNPS),andonetypeofchunk{nounphrase(NP).ThevariationsusedinMoSare:whetherthephrasecontainthesepostags;whetherallthewordsinthatphrasesharethesamepostags(eitherpropernounornoun);andwhetherthewholecandidatephraseisanounphrase.ForDeS,theyare:whetherthewordhasthepostag;whetherthewordisthebeginningofanounphrase;whetherthewordisinanounphrase,butnotthe rstword;andwhetherthewordisoutsideanynounphrase.2.3.2.2C:capitalization.Whetherawordiscapitalizedisanindicationofbeingpartofapropernoun,oranimportantword.ThissetoffeaturesforMoSisde nedas:whetherallthewordsinthecandidatephrasearecapitalized;whetherthe rstwordofthecandidatephraseiscapitalized;andwhetherthecandi-datephrasehasacapitalizedword.ForDeS,itissimplywhetherthewordiscapitalized.2.3.2.3H:hypertext.Whetheracandidatephraseorwordispartofthean-chortextforahypertextlinkisextractedasthefollowing features.ForMoS,theyare:whetherthewholecandidatephrasematchesexactlytheanchortextofalink;whetherallthewordsofthecandidatephraseareinthesameanchortext;andwhetheranywordofthecandidatephrasebelongstotheanchortextofalink.ForDeS,theyare:whetherthewordisthebeginningoftheanchortext;whetherthewordisintheanchortextofalink,butnotthe rstword;andwhetherthewordisoutsideanyanchortext.2.3.2.4Ms:metasectionfeatures.TheheaderofanHTMLdocumentmayprovideaddi-tionalinformationembeddedintags.Althoughthetextinthisregionisusuallynotseenbyreaders,whetheracandidateappearsinthismetasectionseemsimportant.ForMoS,thefeaturesarewhetherthewholecandidatephraseisinthemetasection.ForDeS,theyare:whetherthewordisthe rstwordinametatag;andwhetherthewordoccurssomewhereinametatag,butnotasthe rstword.2.3.2.5T:title.TheonlyhumanreadabletextintheHTMLheaderisthe,whichisusuallyputinthewindowcaptionbythebrowser.ForMoS,thefeatureiswhetherthewholecandidatephraseisinthetitle.ForDeS,thefeaturesare:whetherthewordisthebeginningofthetitle;andwhetherthewordisinthetitle,butnotthe rstword.2.3.2.6M:metafeatures.Inadditionto,severalmetatagsarepotentiallyre-latedtokeywords,andareusedtoderivefeatures.IntheMoSframework,thefeaturesare:whetherthewholecan-didatephraseisinthemeta-description;whetherthewholecandidatephraseisinthemeta-keywords;andwhetherthewholecandidatephraseisinthemeta-title.ForDeS,thefeaturesare:whetherthewordisthebeginningofthemeta-description;whetherthewordisinthemeta-description,butnotthe rstword;whetherthewordisthebeginningofthemeta-keywords;whetherthewordisinthemeta-keywords,butnotthe rstword;whetherthewordisthebeginningofthemeta-title;andwhetherthewordisinthemeta-title,butnotthe rstword.2.3.2.7U:URL.Awebdocumenthasoneadditionalhighlyusefulproperty{thenameofthedocument,whichisits.ForMoS,thefeaturesare:whetherthewholecandidatephraseisinpartoftheURLstring;andwhetheranywordofthecandidatephraseisintheURLstring.IntheDeSframework,thefeatureiswhetherthewordisintheURLstring.2.3.2.8IR:informationretrievalorientedfeatures.WeconsidertheTF(termfrequency)andDF(documentfrequency)valuesofthecandidateasreal-valuedfeatures.Thedocumentfrequencyisderivedbycountinghowmanydocumentsinourwebpagecollectionthatcontainthegiventerm.InadditiontotheoriginalTFandDFfrequencynumbers,log(TF+1)andlog(DF+1)arealsousedasfea-tures.Thefeaturesusedinthemonolithicandthedecom-posedframeworksarebasicallythesame,whereforDeS,the\term"isthecandidateword.2.3.2.9Loc:relativelocationofthecandidate.Thebeginningofadocumentoftencontainsanintroduc-tionorsummarywithimportantwordsandphrases.There-fore,thelocationoftheoccurrenceofthewordorphraseinthedocumentisalsoextractedasafeature.Sincethelengthofadocumentorasentencevariesconsiderably,wetakeonlytheratioofthelocationinsteadoftheabsolutenumber.Forexample,ifawordappearsinthe10thposi-tion,whilethewholedocumentcontains200words,theratioisthen0.05.Thesefeaturesusedforthemonolithicandde-composedframeworksarethesame.Whenthecandidateisaphrase,its rstwordisusedasitslocation.Therearethreedi erentrelativelocationsusedasfea-tures:wordRatio{therelativelocationofthecandidateinthesentence;sentRatio{thelocationofthesentencewherethecandidateisindividedbythetotalnumberofsentencesinthedocument;wordDocRatio{therelativelocationofthecandidateinthedocument.Inadditiontothese3real-valuedfeatures,wealsousetheirlogarithmsasfeatures.Speci cally,weusedlog(1+wordRatio),log(1+sentRatio),andlog(1+wordDocRatio).2.3.2.10Len:sentenceanddocumentlength.Thelength(inwords)ofthesentence(sentLen)wherethecandidateoccurs,andthelengthofthewholedocument(docLen)(wordsintheheaderarenotincluded)areusedasfeatures.Similarly,log(1+sentLen)andlog(1+docLen)arealsoincluded.2.3.2.11phLen:lengthofthecandidatephrase.Forthemonolithicframework,thelengthofthecandidatephrase(phLen)inwordsandlog(1+phLen)areincludedasfeatures.Thesefeaturesarenotusedinthedecomposedframework.2.3.2.12Q:querylog.Thequerylogofasearchenginere\rectsthedistributionofthekeywordspeoplearemostinterestedin.Weusetheinformationtocreatethefollowingfeatures.Fortheseex-periments,unlessotherwisementioned,weusedalog lewiththemostfrequent7.5millionqueries.Forthemonolithicframework,weconsideronebinaryfea-ture{whetherthephraseappearsinthequerylog,andtworeal-valuedfeatures{thefrequencywithwhichitappearsandthelogvalue,log(1+frequency).Forthedecomposedframework,weconsidermorevariationsofthisinformation:whetherthewordappearsinthequerylog leasthe rstwordofaquery;whetherthewordappearsinthequerylog leasaninteriorwordofaquery;andwhetherthewordap-pearsinthequerylog leasthelastwordofaquery.Thefrequencyvaluesoftheabovefeaturesandtheirlogvalues(log(1+),whereisthecorrespondingfrequencyvalue)arealsousedasreal-valuedfeatures.Finally,whetherthewordneverappearsinanyquerylogentriesisalsoafeature.2.4PostprocessorAftertheclassi erpredictstheprobabilitiesofthecandi-datesassociatedwiththepossiblelabels,ourkeywordex-tractionsystemgeneratesalistofkeywordsrankedbytheprobabilities.WhentheMonolithicCombinedframeworkisused,wesimplyreturnthemostprobablewordsorphrases.WhentheMonolithicSeparate(MoS)frameworkisused,thehighestprobabilityofidenticalfragmentsispickedas theprobabilityofthephrase.IntheDecomposedSeparate(DeS)framework,weneedtoconvertprobabilitiesofindividualwordsbeingphrasecomponents(Begininning,Inside,etc.)intoprobabilitiesofrelevanceofwholephrases.Typically,inphraselabelingproblemslikeinformationextraction,thisconversionisdonewiththeViterbi[17]algorithm,to ndthemostprobableassignmentofthewordlabelsequenceofeachsentence[5].Werejectedthismethodfortworeasons.First,becauseinourtrainingset,onlyafewwordsperdocumenttendtobelabeledasrelevant,oursystemalmostneverassignshighprobabilitytoaparticularphrase,andthustheViterbias-signmentwouldtypicallyrejectallphrases.Second,intyp-icalinformationextractionapplications,wenotonlywantto ndvariousentities,wealsowantto ndtheirrelation-shipsandroles.Inourapplication,wesimplywantalistofpotentialentities.Itisthus netoextractpotentiallyoverlappingstringsaspotentialkeywords,somethingthatwouldnotworkwellifrole-assignmentwasrequired.Therefore,weusethefollowingmechanismtoestimatetheprobabilityofaphraseintheDeSframework.Givenaphraseoflengthn,wecalculatetheoverallprobabilityforthephrasebymultiplyingbytheprobabilityofthein-dividualwordsbeingthecorrectlabelofthelabelsequence.Forexample,ifn=3,thenthecorrectlabelsequenceisB,I,L.Theprobabilityofthisphrasebeingakeyword,1,isderivedby(w1=B)(w2=I)(w3=L).Ifthephraseisnotakeyword,thenthecorrectlabelse-quenceisO,O,O.Thecorrespondingprobability,0,isthen(w1=O)(w2=O)(w3=O).Theactualproba-bilityusedforthisphraseisthen1(0+1).Amongthethenormalizationmethodswetried,thisstrategyworksthebestinpractice.3.EXPERIMENTSThissectionreportstheexperimentalresultscomparingoursystemwithseveralbaselinesystems,thecomparisonsbetweenthevariationsofoursystem,thecontributionofindividualfeatures,andtheimpactofreducingthesearchquerylogsize.We rstdescribehowthedocumentswereobtainedandannotated,aswellastheperformancemea-sures.3.1DataandEvaluationCriteriaThe rststepwastoobtainandlabeldata,namelyasetofwebpages.Wecollectedthesedocumentsatrandomfromtheweb,subjecttothefollowingcriteria:First,eachpagewasinacrawlofthewebfromMSNSearch.Second,eachpagedisplayedcontent-targetedad-vertising(detectedbythepresenceofspecialJavascript).Thiscriterionwasdesignedtomakesurewefocusedonthekindofwebpageswherecontent-targetedadvertisingwouldbedesirable,asopposedtotrulyrandompages.Third,thepagesalsohadtooccurintheInternetArchive3.Wehopetoshareourlabeledcorpus,sootherresearcherscanexper-imentwithit.ChoosingpagesintheInternetArchivemaymakethatsharingeasier.Fourth,weselectednomorethanonepageperdomain(inpart,becauseofcopyrightissues,andinparttoensurediversityofthecorpus).Fifth,wetriedtoeliminateforeignlanguageandadultpages.Altogether,wecollected1109pages. 3http://www.archive.orgNext,weselected30ofthesepagesatrandomtobeusedforinter-annotatoragreementmeasurements.Wethenran-domlysplittheremainingpagesinto8setsof120pagesandonesetof119pages.Wegaveeachsettooneofouranno-tators.Foreachannotator,the rsttimetheylabeledasetofwebpages,wealsogavethemtheadditional30pagesforinter-annotatoragreement,scatteredrandomlythroughout(so,theyreceived150pagestotalthe rsttime,120or119onsubsequentsets).Severalpageswererejectedforvariousreasons,suchasoneannotatorwhodidnotcompletetheprocess,severalforeignlanguagepagesthatslippedthroughourscreeningprocess,etc.Asaresult,828webdocumentshadlegitimateannotationsandwereusedtotrainandtestthesystem.Annotatorswereinstructedtolookateachwebpage,anddetermineappropriatekeywords.Inparticular,theyweregivenabout3pagesofinstructionswithactualexamples.Thecoreoftheinstructionswasessentiallyasfollows:Advertisingonthewebisoftendonethroughkeywords.Advertiserspickakeyword,andtheiradappearsbasedonthat.Forinstance,anad-vertiserlikeAmazon.comwouldpickakeywordlike\book"or\books."Ifsomeonesearchesfortheword\book"or\books",anAmazonadisshown.Similarly,ifthekeyword\book"ishighlyprominentonawebpage,Amazonwouldlikeanadtoappear.Weneedtoshowourcomputerprogramexam-plesofwebpages,andthentellitwhichkeywordsare\highlyprominent."Thatway,itcanlearnthatwordslike\the"and\clickhere"areneverhighlyprominent.Itmightlearnthatwordsthatappearontheright(ormaybetheleft)aremorelikelytobehighlyprominent,etc.Yourtaskistocreatetheexamplesforthesystemtolearnfrom.Wewillgiveyouwebpages,andyoushouldlistthehighlyprominentwordsthatanadvertisermightbeinterestedin.Therewasonemoreimportantinstruction,whichwastotrytouseonlywordsorphrasesthatactuallyoccurredonthepagebeinglabeled.Theremainingportionofthein-structionsgaveexamplesanddescribedtechnicaldetailsofthelabelingprocess.Weusedasnapshotofthepages,tomakesurethatthetraining,testing,andlabelingprocessesallusedidenticalpages.Thesnapshottingprocessalsohadtheadditionaladvantagethatmostimages,andallcontent-targetedadvertising,werenotdisplayedtotheannotators,preventingthemfromeitherselectingtermsthatoccurredonlyinimages,orfrombeingpollutedbyathird-partykey-wordselectionprocess.3.2PerformancemeasuresWecomputedthreedi erentperformancemeasuresofourvarioussystems.The rstperformancemeasureissimplythetop-1score.Tocomputethismeasure,wecountedthenumberoftimesthetopoutputofoursystemforagivenpagewasinthelistoftermsdescribedbytheannotatorforthatpage.Wedividedthisnumberbythemaximumachievabletop-1score,andmultipliedby100.Togetthemaximumachievabletop-1score,foreachtestdocument,we rstremovedanyannotationsthatdidnotoccursome-whereinthewebpage(toeliminatespellingmistakes,and occasionalannotatorconfusion).Thebestachievablescorewasthenthenumberofdocumentsthatstillhadatleastoneannotation.Wecountedanswersascorrectiftheymatched,ignoringcase,andwithallwhitespacecollapsedtoasinglespace.Thesecondperformancemeasureisthetop-10score,whichissimilartothetop-1measurebutconsiders10candidatesinsteadof1.Tocomputethismeasure,foreachwebpage,wecountedhowmanyofthetop10outputsofoursystemwereinthelistoftermsdescribedbytheannotatorforthatpage.Thesumofthesenumberswasthendividedbythemaximumachievabletop-10score,andmultipliedby100.Itisimportantinanadvertisingsystemtonotonlyex-tractaccuratelistsofkeywords,buttoalsohaveaccurateprobabilityestimates.Forinstance,considertwokeywordsonaparticularpage,say\digitalcamera",whichmightmonetizeat$1perclick,and\CD-Rmedia",whichmon-etizesatsay,10centsperclick.Ifthereisa50%chancethat\CD-Rmedia"isrelevant,andonlya10%chancethat\digitalcamera"isrelevant,overallexpectedrevenuefromshowing\digitalcamera"adswillstillbehigherthanfromshowing\CD-Rmedia"ads.Therefore,accuratelyestimat-ingprobabilitiesleadstopotentiallyhigheradvertisingrev-enue.Themostcommonlyusedmeasureoftheaccuracyofprob-abilityestimatesisentropy.Foragiventermtandagivenwebpage,oursystemcomputestheprobabilitythatthewordisrelevanttothepage,P(tj)(i.e.,theprobabilitythattheannotatorlistedthewordontherelevantlist).Theentropyofagivenpredictionislog2P(tj)iftisrelevanttolog2(1P(tj))iftisnotrelevanttoLowerentropiescorrespondtobetterprobabilityestimates,with0beingideal.Whenwereporttheentropy,wewillreporttheaverageentropyacrossallwordsandphrasesuptolength5inallwebpagesinthetestset,withduplicateswithinapagecountedasoneoccurrence.TheDeSframe-workwasnoteasilyamenabletothiskindofmeasurement,sowewillonlyreporttheentropymeasurefortheMoSandMoCmethods.3.3Inter­annotatorAgreementWewantedtocomputesomeformofinter-annotatoragree-ment,wherethetypicalchoiceforthismeasureisthekappameasure.However,thatmeasureisdesignedforannotationswithasmall, xednumberofchoices,whereapriorprob-abilityofselectingeachchoicecanbedetermined.Itwasnotclearhowtoapplyitforaproblemwiththousandsofpossibleannotations,withthepossibleannotationsdi erentoneverypage.Instead,weusedourinter-annotatoragreementdatatocomputeanupper-boundforthepossibleperformanceonourtop-1measure.Inparticular,wedesignedanaccuracymeasurecalledcommittee-top-1.Essentiallythisnumbermeasuresroughlyhowwellacommitteeoffourpeoplecoulddoonthistask,bytakingthemostcommonanswerselectedbythefourannotators.Tocomputethis,weperformedthefollowingsteps.First,weselectedapageatrandomfromthesetof30labeledwebpagesforinter-annotatoragreement.Then,werandomlyselectedoneofthe veannotatorsasarbitrator,whosere-sultswecalled\correct."Wethenmergedtheresultsoftheremainingfourannotators,andfoundthesinglemostfrequentone.Finally,ifthatresultwasonthe\correct"keywordslist,weconsidereditcorrect;otherwisewecon-sidereditwrong.Theaveragecommittee-top-1scorefrom1000sampleswas25.8.Thisgivesageneralfeelingforthedicultyoftheproblem.Duetothesmallnumberofdoc-umentsused(30)andthesmallnumberofannotators(5),thereisconsiderableuncertaintyinthisestimate.Infact,insomecases,ourexperimentalresultsareslightlybetterthanthisnumber,whichweattributetothesmallsizeofthesampleusedforinter-annotatoragreement.3.4ResultsAllexperimentswereperformedusing10-waycrossvali-dation.Inmostcases,wepickedanappropriatebaseline,andcomputedstatisticalsigni canceofdi erencesbetweenresultsandthisbaseline.Weindicatethebaselinewithabsymbol.Weindicatesigni canceatthe95%levelwithaysymbol,andatthe99%levelwithazsymbol.Signi cancetestswerecomputedwithatwo-tailedpairedt-test.3.4.1OverallPerformanceWebeginbycomparingtheoverallperformanceofdi er-entsystemsandcon gurations.Thetop-1andtop-10scoresofthesedi erentsystemsarelistedinTable1.Amongthesystemcon gurationswetested,themonolithiccom-bined(MoC)frameworkisingeneralthebest.Inparticular,whenusingallbutthelinguisticfeatures,thiscon gurationachievesthehighesttop-1andtop-10scoresinalltheex-periments,justslightly(andnotstatisticallysigni cantly)betterthanthesameframeworkwithallfeatures.Themonolithicseparate(MoS)systemwithallfeaturesperformsworseforbothtop-1andtop-10,althoughonlythetop-10resultwasstatisticallysigni cant.Despiteitspreva-lenceintheinformationextractioncommunity,forthistask,thedecomposedseparate(DeS)frameworkissigni cantlyworsethanthemonolithicapproach.Wehypothesizethattheadvantagesofusingfeaturesbasedonphrasesasawhole(usedinMoSandMoC)outweightheadvantagesoffeaturesthatcombineinformationacrosswords.Inadditiontothedi erentcon gurationsofoursystems,wealsocomparewithastate-of-the-artkeywordextractionsystem{KEA[7],byprovidingitwithourpreprocessed,plain-textdocuments.KEAisalsoamachinelearning-basedsystem.Althoughitreliesonsimplerfeatures,itusesmoresophisticatedinformationretrievaltechniqueslikeremov-ingstopwordsandstemmingforpreprocessing.OurbestsystemissubstantiallybetterthanKEA,withrelativeim-provementsof27.5%and22.9%ontop-1andtop-10scores,respectively.Wetriedaverysimplebaselinesystem,MoCTFIDF,whichsimplyusestraditionalTFIDFscores.Thissystemonlyreaches13.01intop-1and19.03intop-10.WealsotriedallowingusingTFandIDFfeatures,butallowingtheweightstobetrainedwihtlogisticregression(MoCIR).Trainingslightlyimprovedtop-1to13.63,andsubstantiallyimprovedthetop-10scoreto25.67.Noticethatthetop-1scoreofourbestsystemactuallyex-ceedsthecommittee-top-1inter-annotatoragreementscore.Therewasconsiderableuncertaintyinthatnumberduetothesmallnumberofannotatorsandsmallnumberofdocu-mentsforinter-annotatoragreement,butweinterprettheseresultstomeanthatourbestsystemisinthesamegeneralrangeashumanperformance,atleastontop-1score. system top-1top-10 MoC(Monolithic,Combined),-Lin 30.06b46.97b MoC(Monolithic,Combined),All 29.9446.45 MoS(Monolithic,Separate),All 27.9544.13z DeS(Decomposed,Separate),All 24.25z:z KEA[7] 23.57z38.21z MoC(Monolithic,Combined),IR 13.63z25.67z MoC(Monolithic,Combined),TFIDF 13.01z19.03z Table1:Performanceofdi erentsystems3.4.2FeatureContributionOneimportantandinterestingquestionisthecontribu-tionofdi erenttypesoffeatures.Namely,howimportantisaparticulartypeofinformationtothekeywordextractiontask.Westudiedthisproblembyconductingaseriesofab-lationstudies:removingfeaturesofaspeci ctypetoseehowmuchtheycontribute.Werepeatedtheseexperimentsinthethreedi erentcandidateselectorframeworksofourkeywordextractionsystemtoseeifthesameinformationa ectsthesystemdi erentlywhentheframeworkchanges.Tables2,3,and4showthedetailedresults.Eachrowiseitherthebaselinesystemwhichusesallfeatures,orthecomparedsystemthatusesallexceptonespeci ckindoffeature.Inadditiontothetop-1andtop-10scores,wealsocomparetheentropiesofthevarioussystemsusingthesameframework.Recallthatlowerentropiesarebetter.Entropyisasomewhatsmoother,moresensitivemeasurethantop-nscores,makingiteasiertoseedi erencesbetweendi erentsystems.Notethatbecausethenumbersofcandidatesaredi erent,entropiesfromsystemsusingdi erentframeworksarenotcomparable. features top-1top-10entropy Aall 27.95b44.13b0.0120040b -Ccapitalization 27.3943.500.0120945z -Hhypertext 27.3943.820.0120751z -IRIR 18.77z33.60z0.0149899z -Lenlength 27.1042.45y0.0121040 -Linlinguistic 28.0544.890.0122166z -Loclocation 27.2442.64z0.0121860z -Mmeta 27.8144.050.0120080 -Msmetasection 27.5243.09y0.0120390 -Qquerylog 20.68z39.10z0.0129330z -Ttitle 27.8144.250.0120040 -UURL 26.09y44.140.0121409y Table2:ThesystemperformancebyremovingonesetoffeaturesintheMoSframeworkByexaminingtheseresults,wefoundthatIRfeaturesandQueryLogfeaturesarethemosthelpfulconsistentlyinallthreeframeworks.Thesystemperformancedroppedsignif-icantlyafterremovingeitherofthem.However,theimpactofotherfeatureswasnotsoclear.Thelocationfeaturealsoseemstobeimportant,butita ectsthetop-10scoremuchmorethanthetop-1score.Surprisingly,unlikethecasesinothertypicalphraselabelingproblems,linguisticfeaturesdon'tseemtohelpinthiskeywordextractiondomain.Infact,removingthissetoffeaturesinthedecomposedsep-arateframeworkimprovesthetop-1scorealittle.Wealsoobservemorestatisticallysigni cantdi erencesintheen- features top-1top-10entropy Aall 29.94b46.45b0.0113732b -Ccapitalization 30.1146.270.0114219y -Hhypertext 30.7945.85y0.0114370 -IRIR 25.42z42.26z0.0119463z -Lenlength 30.4944.74y0.0119803z -Linlinguistic 30.0646.970.0114853z -Loclocation 29.5244.63y0.0116400z -Mmeta 30.1046.780.0113633z -Msmetasection 29.3346.330.0114031 -Qquerylog 24.82y42.30z0.0121417z -Ttitle 28.8346.940.0114020 -UURL 30.5346.390.0114310 Table3:ThesystemperformancebyremovingonesetoffeaturesintheMoCframework features top-1top-10 Aall 24.2539.11 -Ccapitalization 24.2639.20 -Hhypertext 24.9639.67 -IRIR 19.10z30.56z -Linlinguistic 25.96y39.36 -Loclocation 24.8437.93 -Mmeta 24.6938.91 -Msmetasection 24.6838.71 -Qquerylog 21.27z33.95z -Ttitle 24.4038.89 -UURL 25.2138.97 Table4:ThesystemperformancebyremovingonesetoffeaturesintheDeSframeworktropymetric.Evaluatingthecontributionofafeaturebyremovingitfromthesystemdoesnotalwaysshowitsvalue.Forin-stance,iftwotypesoffeatureshavesimilare ects,thenthesystemperformancemaynotchangebyeliminatingonlyoneofthem.Wethusconductedadi erentsetofexperimentsbyaddingfeaturestoabaselinesystem.Wechoseasysteminthemonolithiccombinedframework(MoC),usingonlyIRfeatures,asthebaselinesystem.Wethenbuiltdi erentsystemsbyaddingoneadditionalsetoffeatures,andcomparingwiththebaselinesystem.Table5showsperformanceonthetop-1,top-10,andentropymet-rics,orderedbydi erenceintop-1.Interestingly,allthefeaturesseemtobehelpfulwhentheyarecombinedwiththebaselineIRfeatures,includingthelinguisticfeatures.Wethusconcludethatthelinguisticfea-tureswerenotreallyuseless,butinsteadwereredundantwithotherfeatureslikecapitalization(whichhelpsdetectpropernouns)andtheQueryLogfeatures(whichhelpsde-tectlinguisticallyappropriatephrases.)TheQueryLogfea-turesarestillthemoste ectiveamongallthefeaturesweexperimentedwith.However,theperformancegapbetweenourbestsystemandasimplesystemusingonlyIRandQueryLogfeaturesisstillquitelarge.3.4.3DifferentQueryLogSizesThequerylogfeatureweresomeofthemosthelpfulfea-tures,secondonlytotheIRfeatures.Thesefeaturesusedthetop7.5millionEnglishqueriesfromMSNSearch.In features top-1top-10entropy IR 13.63b25.67b0.0163299b +Qquerylog 22.36z35.88z0.0134891z +Ttitle 19.90z34.17z0.0152316z +Lenlength 19.22z33.43z0.0134298z +Msmetasection 19.02z31.90z0.0154484z +Hhypertext 18.46z30.36z0.0150824z +Linlinguistic 18.20z32.26z0.0146324z +Ccapitalization 17.41z33.16z0.0146999z +Loclocation 17.01z32.76z0.0154064z +UURL 16.72y28.63z0.0157466z +Mmeta 16.71y28.19z0.0160472y Table5:ThesystemperformancebyaddingonesetoffeaturesintheMoCframeworkpractice,deployingsystemsusingthismanyfeaturesmaybeproblematic.Forinstance,ifwewanttouse20lan-guages,andifeachqueryentryusesabout20bytes,thequerylog leswoulduse3GB.Thisamountofmemoryus-agewouldnota ectserversdedicatedtokeywordextraction,butmightbeproblematicforserversthatperformotherfunctions(e.g.servingads,oraspartofamoregeneralbloggingornewssystem).Inscenarioswherekeywordex-tractionisdoneontheclientside,thequerylog lesizeisparticularlyimportant.Querylogsmayalsohelpspeedupoursystem.Onestrat-egywewantedtotryistoconsideronlythephrasesthatappearedinthequerylog leaspotentialmatchesforkey-wordextraction.Inthiscase,smallerquerylogsleadtoevenlargerspeedups.Wethusranasetofexperimentsun-dertheMoCframework,usingdi erentquerylog lesizes(asmeasuredbyafrequencythresholdcuto ).Thelog lesizeswereveryroughlyinverselyproportionaltothecuto .Theleftmostpoint,withacuto of18,correspondstoour7.5millionquerylog le.ThegraphinFigure1showsboththetop-1andtop-10scoresatdi erentsizes.\Res.top-1"and\Res.top-10"aretheresultswherewerestrictedthecandidatephrasesbyeliminatingthosethatdonotappearinthequerylog le.Asacomparison,thegraphalsoshowsthenon-restrictingversion(i.e.,top-1andtop-10).Notethatofcoursethetop-1scorescannotbecomparedtothetop-10scores.Whenthefrequencycuto thresholdissmall(i.e.,largequerylog lesize),thisrestrictingstrategyinfactimprovesthetop-1scoreslightly,buthurtsthetop-10score.Thisphenomenonisnotsurprisingsincerestrictingcandidatesmayeliminatesomekeywords.Thistendstonota ectthetop-1scoresincewhenusingthequerylog leasaninput,themostprobablekeywordsinadocumentalmostalwaysappearinthekeyword le,ifthekeyword leislarge.Lessprobable,butstillgoodkeywords,arelesslikelytoappearinthekeyword le.Asthequerylog lesizebecomessmaller,thenegativee ectbecomesmoresigni cant.However,ifonlythetop-1scoreisrelevantintheapplication,thenareasonablecuto pointmaybe1000,wherethequerylog lehaslessthan100,000entries.Ontheotherhand,thetop-1andtop-10scoresinthenon-restrictingversiondecreasegraduallyasthesizeofthequerylog ledecreases.Fairlysmall lescanbeusedwiththenon-restrictingversionifspaceisatapremium.Afterall,theextremecaseiswhennoquerylog leisused,andaswehaveseeninTable3,inthatcasethetop-1andtop-10scoresarestill24.82and42.30,respectively. 10100100010000100000ScoreQueryLogFrequencyThresholdQueryLogFrequencyTrade-o restop-1restop-10top-1top-10Figure1:Theperformanceofusingdi erentsizesofthequerylog le4.RELATEDWORKTherehasbeenamoderateamountofpreviousworkthatisdirectlyrelevanttokeywordextraction.Inmostcases,thekeywordextractionwasnotappliedtowebpages,butinsteadwasappliedtoadi erenttextdomain.Mostprevi-oussystemswerefairlysimple,usingeitherasmallnumberoffeatures,asimplelearningmethod(NaiveBayes),orboth.Inthissection,wedescribethispreviousresearchinmoredetail.4.1GenExOneofthebestknownprogramsforkeywordextractionisTurney'sGenExsystem[22].GenExisarule-basedkeyphraseextractionsystemwith12parameterstunedusingaGeneticAlgorithm.GenExhasbeentrainedandevaluatedonacol-lectionof652documentsofthreedi erenttypes:journalarticles,emailmessages,andwebpages.Precisionsoftop5phrasesandtop15phrasesarereportedasevaluationmet-rics,wherethenumbersare0.239and0.128respectivelyforalldocuments.TurneyshowedGenEx'ssuperioritybycom-paringitwithanearlierkeyphraseextractionsystemtrainedbyC4.5[16]withseveralcomplicatedfeatures[22].4.2KEAandVariationsConcurrentlywiththedevelopmentofGenEx,Franketal.developedtheKEAkeyphraseextractionalgorithmusingasimplemachinelearningapproach[7].KEA rstprocessesdocumentsbyremovingstopwordsandbystemming.Thecandidatephrasesarerepresentedusingonlythreefeatures{TFIDF,distance(numberofwordsbeforethe rstoc-currenceofthephrase,dividedbythenumberofwordsinthewholedocument),andkeyphrase-frequency,whichisthenumberoftimesthecandidatephraseoccursinotherdocu-ments.Theclassi eristrainedusingthenaiveBayeslearn-ingalgorithm[14].Franketal.comparedKEAwithGenEx,andshowedthatKEAisslightlybetteringeneral,butthedi erenceisnotstatisticallysigni cant[7]. KEA'sperformancewasimprovedlaterbyaddingWebrelatedfeatures[23].Inshort,thenumberofdocumentsre-turnedbyasearchengineusingthekeyphraseasquerytermsisusedasadditionalinformation.Thisfeatureisparticu-larlyusefulwhentrainingandtestingdataarefromdi erentdomains.Inbothintraandinterdomainevaluations,therelativeperformancegaininprecisionisabout10%.KelleherandLuzalsoreportedanenhancedversionofKEAforwebdocuments[13].Theyexploitedthelinkin-formationofawebdocumentbyaddinga\semanticratio"feature,whichisthefrequencyofthecandidatephrasePintheoriginaldocumentD,dividedbythefrequencyofPindocumentedlinkedbyD.UsingtheMetaKeywordHTMLtagasthesourceofan-notations,theyexperimentedwiththeirapproachonfoursetsofdocuments,wheredocumentsareconnectedbyhy-perlinksineachcollection.Amongthreeofthem,theyre-portedsigni cantimprovement(45%to52%)comparedtotheoriginalversionofKEA.Addingthesemanticratiotooursystemwouldbeaninterestingareaoffutureresearch.Itshould,however,benotedthatusingthesemanticra-tiorequiresdownloadingandparsingseveraltimesasmanywebpages,whichwouldintroducesubstantialload.Also,inpractice,followingarbitrarylinksfromapagecouldresultinsimulatingclickson,e.g.,unsubscribelinks,leadingtounexpectedorundesirableresults.Theuseoflinguisticinformationforkeywordextractionwas rststudiedbyHulth[12].Inthiswork,nounphrasesandprede nedpart-of-speechtagpatternswereusedtohelpselectphrasecandidates.Inaddition,whetherthecandidatephrasehascertainPOStagsisusedasfeatures.AlongwiththethreefeaturesintheKEAsystem,Hulthappliedbag-ging[1]asthelearningalgorithmtotrainthesystem.Theexperimentalresultsshoweddi erentdegreesofimprove-mentscomparedwithsystemsthatdidnotuselinguisticinformation.However,directcomparisontoKEAwasnotreported.4.3InformationExtractionAnalogoustokeywordextraction,informationextractionisalsoaproblemthataimstoextractorlabelphrasesgivenadocument[8,2,19,5].Unlikekeywordextraction,informa-tionextractiontasksareusuallyassociatedwithprede nedsemantictemplates.Thegoaloftheextractiontasksisto ndcertainphrasesinthedocumentsto llthetemplates.Forexample,givenaseminarannouncement,thetaskmaybe ndingthenameofthespeaker,orthestartingtimeoftheseminar.Similarly,thenamedentityrecognition[21]taskistolabelphrasesassemanticentitieslikeperson,lo-cation,ororganization.Informationextractionisinmanywayssimilartokey-wordextraction,butthetechniquesusedforinformationextractionaretypicallydi erent.Whilemostkeywordex-tractionsystemsusethemonolithiccombined(MoC)frame-work,typically,informationextractionsystemsusetheDeSframework,orsometimesMoS.Sinceidenticalphrasesmayhavedi erentlabels(e.g.,\Washington"canbeeitheraper-sonoralocationeveninthesamedocument),candidatesinadocumentarenevercombined.Thechoiceoffeaturesisalsoverydi erent.Thesesystemstypicallyuselexicalfea-tures(theidentityofspeci cwordsinoraroundthephrase),linguisticfeatures,andevenconjunctionsofthesefeatures.Ingeneral,thefeaturespaceintheseproblemsishuge{oftenhundredsofthousandsfeatures.Lexicalfeaturesforkeywordextractionwouldbeaninterestingareaoffutureresearch,althoughourintuitionisthatthesefeaturesarelesslikelytobeusefulinthiscase.4.4ImpedanceCouplingRibeiro-Netoetal.[18]describeanImpedanceCouplingtechniqueforcontent-targetedadvertising.Theirworkisperhapsthemostextensivepreviouslypublishedworkspecif-icallyoncontent-targetedadvertising.However,Ribeiro-Netoetal.'sworkisquiteabitdi erentfromours.Mostimportantly,theirworkfocusednoton ndingkeywordsonwebpages,butondirectlymatchingadvertisementstothosewebpages.Theyusedavarietyofinformation,includingthetextoftheadvertisements,thedestinationwebpageofthead,andthefullsetofkeywordstiedtoaparticularad(asopposedtoconsideringkeywordsoneatatime.)Theythenlookedatthecosinesimilarityofthesemeasurestopotentialdestinationpages.ThereareseveralreasonswedonotcomparedirectlytotheworkofRibeiro-Netoetal.Mostimportantly,we,likemostresearchers,donothaveadatabaseofadvertisements(asopposedtojustkeywords)thatwecoulduseforexperi-mentsofthissort.Inaddition,theyaresolvingasomewhatdi erentproblemthanweare.Ourgoalisto ndthemostappropriatekeywordsforaspeci cpage.Thisdovetailswellwithhowcontextualadvertisingissoldtoday:advertisersbidonspeci ckeywords,andtheirbidsondi erentkey-wordsmaybeverydi erent.Forinstance,apurveyorofdigitalcamerasmightusethesameadfor\camera",\dig-italcamera",\digitalcamerareviews",or\digitalcameraprices."Theadvertiser'sbidswillbeverydi erentinthedi erentcases,becausetheprobabilitythataclickonsuchanadleadstoasalewillbeverydi erent.Ribeiro-Netoetal.'sgoalisnottoextractkeywords,buttomatchwebpagestoadvertisements.Theydonottrytodeterminewhichpar-ticularkeywordonapageisamatch,andinsomecases,theymatchwebpagestoadvertisementsevenwhenthewebpagedoesnotcontainanykeywordschosenbyanadver-tiser,whichcouldmakepricingdicult.Theirtechniquesarealsosomewhattime-consuming,becausetheycomputethecosinesimilaritybetweeneachwebpageandabundleofwordsassociatedwitheachad.Theyusedonlyasmalldatabase(100,000ads)appliedtoasmallnumberofwebpages(100),makingsuchtime-consumingcomparisonspos-sible.Realworldapplicationwouldbefarmorechallenging:optimizationsarepossible,butwouldneedtobeanareaofadditionalresearch.Incontrast,ourmethodsaredirectlyapplicabletoday.4.5NewsQueryExtractionHenzingeretal.[11]exploredthedomainofkeywordex-tractionfromanewssource,toautomaticallydrivequeries.Inparticular,theyextractedquerytermsfromtheclosedcaptioningoftelevisionnewsstories,todriveasearchsys-temthatwouldretrieverelatedonlinenewsarticles.TheirbasicsystemusedTFIDFtoscoreindividualphrases,andachievedslightimprovementsfromusingTFIDF2.Theytriedstemming,withmixedresults.Becauseofthenatureofbroadcastnewsprograms,whereboundariesbetweentopicsarenotexplicit,theyfoundimprovementsbyusingahis-toryfeaturethatautomaticallydetectedtopicboundaries.Theyachievedtheirlargestimprovementsbypostprocess- ingarticlestoremovethosethatseemedtoodi erentfromthenewsbroadcast.Thisseemscloselyrelatedtothepre-viouslymentionedworkofRibeiro-Netoetal.,whofoundanalogouscomparisonsbetweendocumentsandadvertise-mentshelpful.4.6EmailQueryExtractionGoodmanandCarvalho[10]previouslyappliedsimilartechniquestotheonesdescribedhereforqueryextractionforemail.Theirgoalwassomewhatdi erentthanourgoalhere,namelyto ndgoodsearchqueries,todrivetractosearchengines,althoughthesametechnologytheyusedcouldbeappliedto ndkeywordsforadvertising.Muchoftheirre-searchfocusedonemail-speci cfeatures,suchaswordoc-currenceinsubjectlines,anddistinguishingnewpartsofanemailmessagefromearlier,\in-reply-to"sections.Incontrast,ourworkfocusesonmanyweb-page-speci cfeatures,suchaskeywordsintheURL,andinmeta-data.Ourresearchgoesbeyondtheirsinanumberofotherways.Mostimportantly,wetriedbothinformation-extractionin-spiredmethods(DeS)andlinguisticfeatures.Here,wealsoexaminethetradeo betweenquery lesizeandaccuracy,showingthatlargequery lesarenotnecessaryfornear-optimalperformance,iftherestrictiononwordsoccurringinthequery leisremoved.Goodmanetal.compareonlytosimpleTFIDFstylebaselines,whileinourresearch,wecomparetoKEA.Wealsocomputeaformofinter-annotatoragreement,somethingthatwasnotpreviouslydone.TheimprovementoverKEAandthenearhumanperformancemeasurementsareveryimportantfordemon-stratingthehighqualityofourresults.5.CONCLUSIONSThebetterwecanmonetizetheweb,themorefeaturesthatcanbeprovided.Togivetwoexamples,theexplosionoffreebloggingtoolsandoffreeweb-basedemailsystemswithlargequotas,havebothbeensupportedinlargepartbyadvertising,muchofitusingcontent-targeting.Bettertargetedadsarelessannoyingforusers,andmorelikelytoinformthemofproductsthatdeliverrealvaluetothem.Betteradvertisingthuscreatesawin-win-winsituation:bet-terweb-basedfeatures,lessannoyingads,andmoreinfor-mationforusers.OurresultsdemonstratealargeimprovementoverKea.Weattributethisimprovementtothelargenumberofhelp-fulfeaturesweemployed.WhileKeaemploysonlythreefea-tures,weemployed12di erentsetsoffeatures;sinceeachsetcontainedmultiplefeatures,weactuallyhadabout40featuresoverall.AsweshowedinTable5,everyoneofthesesetswashelpful,althoughaswealsoshowed,someofthemwereredundantwitheachother.GenEx,with12features,worksaboutaswellasKea:thechoiceoffeaturesandthelearningalgorithmarealsoimportant.Ourmostimportantnewfeaturewasthequeryfrequency lefromMSNSearch.6.ACKNOWLEDGMENTSWethankEmilyCook,NatalieJennings,SamanthaDick-son,KrystalPaigeandKatieNeely,whoannotatedthedata.WearealsogratefultoAlexeiBocharov,whohelpedusmea-sureinter-annotatoragreement.7.REFERENCES[1]L.Breiman.Baggingpredictors.MachineLearning,24(2):123{140,1996.[2]M.Cali andR.Mooney.Bottom-uprelationallearningofpatternmatchingrulesforinformationextraction.JMLR,4:177{210,2003.[3]X.Carreras,L.Marquez,andJ.Castro.Filtering-rankingperceptronlearningforpartialparsing.MachineLearning,60(1{3):41{71,2005.[4]S.F.ChenandR.Rosenfeld.Agaussianpriorforsmoothingmaximumentropymodels.TechnicalReportCMU-CS-99-108,CMU,1999.[5]H.ChieuandH.Ng.Amaximumentropyapproachtoinformationextractionfromsemi-structureandfreetext.InProc.ofAAAI-02,pages786{791,2002.[6]Y.Even-ZoharandD.Roth.Asequentialmodelformulticlassclassi cation.InEMNLP-01,2001.[7]E.Frank,G.W.Paynter,I.H.Witten,C.Gutwin,andC.G.Nevill-Manning.Domain-speci ckeyphraseextraction.InProc.ofIJCAI-99,pages668{673,1999.[8]D.Freitag.Machinelearningforinformationextractionininformaldomains.MachineLearning,39(2/3):169{202,2000.[9]J.Goodman.Sequentialconditionalgeneralizediterativescaling.InACL'02,2002.[10]J.GoodmanandV.R.Carvalho.Implicitqueriesforemail.InCEAS-05,2005.[11]M.Henzinger,B.Chang,B.Milch,andS.Brin.Query-freenewssearch.InProceedingsofthe12thWorldWideWebConference,pages1{10,2003.[12]A.Hulth.Improvedautomatickeywordextractiongivenmorelinguisticknowledge.InProc.ofEMNLP-03,pages216{223,2003.[13]D.KelleherandS.Luz.Automatichypertextkeyphrasedetection.InIJCAI-05,2005.[14]T.Mitchell.Tutorialonmachinelearningovernaturallanguagedocuments,1997.Availablefrom.[15]V.PunyakanokandD.Roth.Theuseofclassi ersinsequentialinference.InNIPS-00,2001.[16]J.R.Quinlan.C4.5:programsformachinelearning.MorganKaufmann,SanMateo,CA,1993.[17]L.Rabiner.AtutorialonhiddenMarkovmodelsandselectedapplicationsinspeechrecognition.ProceedingsoftheIEEE,77(2),February1989.[18]B.Ribeiro-Neto,M.Cristo,P.B.Golgher,andE.S.deMoura.Impedancecouplingincontent-targetedadvertising.InSIGIR-05,pages496{503,2005.[19]D.RothandW.Yih.Relationallearningviapropositionalalgorithms:Aninformationextractioncasestudy.InIJCAI-01,pages1257{1263,2001.[20]C.SuttonandA.McCallum.Compositionofconditionalrandom eldsfortransferlearning.InProceedingsofHLT/EMLNLP-05,2005.[21]E.F.TjongKimSang.IntroductiontotheCoNLL-2002sharedtask:Language-independentnamedentityrecognition.InCoNLL-02,2002.[22]P.D.Turney.Learningalgorithmsforkeyphraseextraction.InformationRetrieval,2(4):303{336,2000.[23]P.D.Turney.Coherentkeyphraseextractionviawebmining.InProc.ofIJCAI-03,pages434{439,2003.