com Joshua Goodman Microsoft Research 1 Microsoft Way Redmond WA 98052 joshuagomicrosoftcom Vitor R Carvalho Carnegie Mellon University 5000 Forbes Avenue Pittsburgh PA 15213 vitorcscmuedu ABSTRACT A large and growing number of web pages display cont ID: 3919
Download Pdf The PPT/PDF document "Finding Advertising Keywords on Web Page..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
FindingAdvertisingKeywordsonWebPagesWentauYihMicrosoftResearch1MicrosoftWayRedmond,WA98052scottyih@microsoft.comJoshuaGoodmanMicrosoftResearch1MicrosoftWayRedmond,WA98052joshuago@microsoft.comVitorR.CarvalhoCarnegieMellonUniversity5000ForbesAvenuePittsburgh,PA15213vitor@cs.cmu.eduABSTRACTAlargeandgrowingnumberofwebpagesdisplaycontex-tualadvertisingbasedonkeywordsautomaticallyextractedfromthetextofthepage,andthisisasubstantialsourceofrevenuesupportingthewebtoday.Despitetheimpor-tanceofthisarea,littleformal,publishedresearchexists.Wedescribeasystemthatlearnshowtoextractkeywordsfromwebpagesforadvertisementtargeting.Thesystemusesanumberoffeatures,suchastermfrequencyofeachpotentialkeyword,inversedocumentfrequency,presenceinmeta-data,andhowoftenthetermoccursinsearchquerylogs.Thesystemistrainedwithasetofexamplepagesthathavebeenhand-labeledwith\relevant"keywords.Basedonthistraining,itcanthenextractnewkeywordsfromprevi-ouslyunseenpages.Accuracyissubstantiallybetterthanseveralbaselinesystems.CategoriesandSubjectDescriptorsH.3.1[ContentAnalysisandIndexing]:Abstractingmethods;H.4.m[InformationSystems]:MiscellaneousGeneralTermsAlgorithms,experimentationKeywordskeywordextraction,informationextraction,advertising1.INTRODUCTIONContent-targetedadvertisingsystems,suchasGoogle'sAdSenseprogram,andYahoo'sContextualMatchproduct,arebecominganincreasinglyimportantpartofthefundingforfreewebservices.Theseprogramsautomaticallyndrel-evantkeywordsonawebpage,andthendisplayadvertise-mentsbasedonthosekeywords.Inthispaper,wesystem-aticallyanalyzetechniquesfordeterminingwhichkeywordsarerelevant.Wedemonstratethatalearning-basedtech-niqueusingTFIDFfeatures,webpagemetadata,and,mostsurprisingly,informationfromquerylogles,substan-tiallyoutperformscompetingmethods,andevenapproacheshumanlevelsofperformancebyatleastonemeasure.Typicalcontent-targetedadvertisingsystemsanalyzeawebpage,suchasablog,anewspage,oranothersourceCopyrightisheldbytheInternationalWorldWideWebConferenceCommittee(IW3C2).Distributionofthesepapersislimitedtoclassroomuse,andpersonalusebyothers.WWW2006,May2326,2006,Edinburgh,Scotland.ACM1595933239/06/0005.ofinformation,tondprominentkeywordsonthatpage.Thesekeywordsarethensenttoanadvertisingsystem,whichmatchesthekeywordsagainstadatabaseofads.Ad-vertisingappropriatetothekeywordisdisplayedtotheuser.Typically,ifauserclicksonthead,theadvertiserischargedafee,mostofwhichisgiventothewebpageowner,withaportionkeptbytheadvertisingservice.Pickingappropriatekeywordshelpsusersinatleasttwoways.First,choosingappropriatekeywordscanleadtousersseeingadsforproductsorservicestheywouldbeinterestedinpurchasing.Second,thebettertargetedtheadvertising,themorerevenuethatisearnedbythewebpageprovider,andthusthemoreinterestingtheapplicationsthatcanbesupported.Forinstance,freebloggingservicesandfreeemailaccountswithlargeamountsofstoragearebothen-abledbygoodadvertisingsystemsFromtheperspectiveoftheadvertiser,itisevenmoreimportanttopickgoodkeywords.Formostareasofre-search,suchasspeechrecognition,a10%improvementleadstobetterproducts,buttheincreaseinrevenueisusuallymuchsmaller.Forkeywordselection,however,a10%im-provementmightactuallyleadtonearlya10%higherclick-through-rate,directlyincreasingpotentialrevenueandprot.Inthispaper,wesystematicallyinvestigatedseveraldif-ferentaspectsofkeywordextraction.First,wecomparedlookingateachoccurrenceofawordorphraseinadoc-umentseparately,versuscombiningallofourinformationaboutthewordorphrase.Wealsocomparedapproachesthatlookatthewordorphrasemonolithicallytoapproachesthatdecomposeaphraseintoseparatewords.Second,weexaminedawidevarietyofinformationsources,analyzingwhichsourcesweremosthelpful.Theseincludedvariousmeta-tags,titleinformation,andeventhewordsintheURLofthepage.OnesurprisinglyusefulsourceofinformationwasqueryfrequencyinformationfromMSNSearchquerylogs.Thatis,knowingtheoverallqueryfrequencyofapar-ticularwordorphraseonapagewashelpfulindeterminingifthatwordorphrasewasrelevanttothatpage.Wecomparedtheseapproachestoseveraldierentbase-lineapproaches,includingatraditionalTFIDFmodel;amodelusingTFandIDFfeaturesbutwithlearnedweights;andtheKEAsystem[7].KEAisalsoamachinelearningsystem,butwithasimplerlearningmechanismandfewerfeatures.Aswewillshow,oursystemissubstantiallybetterthananyofthesebaselinesystems.Wealsocomparedoursystemtothemaximumachievablegiventhehumanlabel-ing,andfoundthatononemeasure,oursystemwasinthesamerangeashumanlevelsofperformance. 2.SYSTEMARCHITECTUREInthissection,weintroducethegeneralarchitectureofourkeywordextractionsystem,whichconsistsofthefollow-ingfourstages:preprocessor,candidateselector,classier,andpostprocessor.2.1PreprocessorThemainpurposeofthepreprocessoristotransformanHTMLdocumentintoaneasy-to-processplain-textbaseddocument,whilestillmaintainingimportantinformation.Inparticular,wewanttopreservetheblocksintheorig-inalHTMLdocument,butremovetheHTMLtags.Forexample,textinthesametableshouldbeplacedtogetherwithouttagslike,,or.Wealsopreserveinformationaboutwhichphrasesarepartoftheanchortextofthehypertextlinks.ThemetasectionofanHTMLdoc-umentheaderisalsoanimportantsourceofusefulinforma-tion,eventhoughmostoftheeldsexceptthetitlearenotdisplayedbywebbrowsers.ThepreprocessorrstparsesanHTMLdocument,andre-turnsblocksoftextinthebody,hypertextinformation,andmetainformationintheheader1.Becauseakeywordshouldnotcrosssentenceboundaries,weapplyasentencesplittertoseparatetextinthesameblockintovarioussentences.Toevaluatewhetherlinguisticinformationcanhelpkey-wordextraction,wealsoapplyastate-of-the-artpart-of-speech(POS)tagger[6],andrecordthepostagofeachword.Inaddition,wehaveobservedthatmostwordsorphrasesthatarerelevantareshortnounphrases.Therefore,havingthisinformationavailableasafeaturewouldpoten-tiallybeuseful.Wethusappliedastate-of-the-artchunkertodetectthebasenounphrasesineachdocument[15]2.2.2CandidateSelectorOursystemconsiderseachwordorphrase(consecutivewords)uptolength5thatappearsinthedocumentasacandidatekeyword.Thisincludesallkeywordsthatappearinthetitlesection,orinmeta-tags,aswellaswordsandphrasesinthebody.Asmentionedpreviously,aphraseisnotselectedasacandidateifitcrossessentenceorblockboundaries.Thisstrategynotonlyeliminatesmanytrivialerrorsbutalsospeedsuptheprocessingtimebyconsideringfewerkeywordcandidates.Eachphrasecanbeconsideredseparately,orcanbecom-binedwithallotheroccurrencesofthesamephraseinthesamedocument.Inaddition,phrasescanbeconsideredmonolithically,orcanbedecomposedintotheirconstituentwords.Puttingtogetherthesepossibilities,weendedupconsideringthreedierentcandidateselectors.2.2.1Monolithic,Separate(MoS)IntheMonolithicSeparatecandidateselector,fragmentsthatappearindierentdocumentlocationsareconsideredasdierentcandidateseveniftheircontentisidentical.Thatis,ifthephrase\digitalcamera"occurredonceinthebeginningofthedocument,andonceintheend,wewouldconsiderthemastwoseparatecandidates,withpotentiallydierentfeatures.Whilesomefeatures,suchasthephrase 1Weused\BeautifulSoup",whichisapythonlibraryforHTMLparsing.BeautifulSoupcanbedownloadedfromhttp://www.crummy.com/software/BeautifulSoup/2Thesenaturallanguageprocessingtoolscanbedownloadedfromhttp://l2r.cs.uiuc.edu/cogcomplengthandTFvalues,wouldallbethesameforallofthesecandidates,others,suchaswhetherthephrasewascapital-ized,wouldbedierent.WecallthisvariationSeparate.Inthiscandidateselector,allfeaturesofacandidatephrasearebasedonthephraseasawhole.Forexample,termfre-quencycountsthenumberoftimestheexactphraseoccursinthedocument,ratherthanusingthethetermfrequencyofindividualwords.WerefertothisasMonolithic.Tosim-plifyourdescription,weuseMoStorefertothisdesign.2.2.2Monolithic,Combined(MoC)Sinceweonlycareabouttherankedlistofkeywords,notwherethekeywordsareextracted,wecanreducethenumberofcandidatesbycombiningidentical(caseignored)fragments.Forinstance,\Weatherreport"inthetitleand\weatherreport"inthebodyaretreatedasonlyonecandi-date.WeuseMoCtorefertothisdesign.NotethatevenintheCombinedcase,wordordermatters,e.g.thephrase\weatherreport"istreatedasdierentthanthephrase\reportweather."2.2.3Decomposed,Separate(DeS)Keywordextractionseemscloselyrelatedtowell-studiedareaslikeinformationextraction,named-entityrecognitionandphraselabeling:theyallattempttondimportantphrasesindocuments.State-of-the-artinformationextrac-tionsystems(e.g.,[5,20]),named-entityrecognitionsystems(e.g.,[21]),andphraselabelingsystems(e.g.,[3])typicallydecomposephrasesintoindividualwords,ratherthanex-aminingthemmonolithically.Itthusseemedworthwhiletoexaminesimilartechniquesforkeywordextraction.Decom-posingaphraseintoitsindividualwordsmighthavecertainadvantages.Forinstance,ifthephrase\petstore"occurredonlyonceinadocument,butthephrases\pet"and\store"eachoccurredmanytimesseparately,suchadecomposedapproachwouldmakeiteasytousethisknowledge.Insteadofselectingphrasesdirectlyascandidates,thedecomposedapproachtriestoassignalabeltoeachwordinadocument,asisdoneinrelatedelds.Thatis,eachofthewordsinadocumentisselectedasacandidate,withmultiplepossiblelabels.ThelabelscanbeB(beginningofakeyphrase,whenthefollowingwordisalsopartofthekeyphrase),I(insideakeyphrase,butnottherstorlastword),L(lastwordofakeyphrase),U(uniquewordofakeywordoflength1),andnallyO(outsideanykeywordorkeyphrase).Thisword-basedframeworkrequiresamulti-classclassi-ertoassignthese5labelstoacandidateword.Inad-dition,italsoneedsasomewhatmoresophisticatedinfer-enceproceduretoconstructarankedlistofkeywordsinthepostprocessingstage.ThedetailswillbedescribedinSection2.4.WeuseDeStorefertothisdesign.2.3ClassierThecoreofourkeywordextractionsystemisaclassiertrainedusingmachinelearning.Whenamonolithicframe-work(MoSorMoC)isused,wetrainabinaryclassier.Givenaphrasecandidate,theclassierpredictswhetherthewordorphraseisakeywordornot.Whenadecomposedframework(DeS)isused,wetrainamulti-classclassier.Givenaword,thegoalistopredictthelabelasB,I,L,U,orO.Sincewhetheraphraseisakeywordisambiguousbynature,insteadofahardprediction,weactuallyneed theclassiertopredicthowlikelyitisthatcandidatehasaparticularlabel.Inotherwords,theclassierneedstooutputsomekindofcondencescoresorprobabilities.Thescoresorprobabilitiescanthenbeusedlatertogeneratearankedlistofkeywords,givenadocument.Weintroducethelearningalgorithmandthefeaturesweusedasfollows.2.3.1LearningAlgorithmWeusedlogisticregressionasthelearningalgorithminallexperimentsreportedhere.Logisticregressionmodelsarealsocalledmaximumentropymodels,andareequivalenttosinglelayerneuralnetworksthataretrainedtominimizeen-tropy.Inrelatedexperimentsonanemailcorpus,wetriedotherlearningalgorithms,suchaslinearsupport-vectorma-chines,decisiontrees,andnaiveBayes,butineverycasewefoundthatlogisticregressionwasequallygoodorbetterthantheotherlearningalgorithmsforthistypeoftask.Formally,wewanttopredictanoutputvariableY,givenasetofinputfeatures, X.Inthemonolithicframework,Ywouldbe1ifacandidatephraseisarelevantkeyword,and0otherwise. Xwouldbeavectoroffeaturevaluesassociatedwithaparticularcandidate.Forexample,thevectormightincludethedistancefromthestartofthedocument,thetermfrequency(TF)andthedocumentfrequency(DF)val-uesforthephrase.Themodelreturnstheestimatedprob-ability,P(Y=1j X= x).Thelogisticregressionmodellearnsavectorofweights, w,oneforeachinputfeaturein X.Continuingourexample,itmightlearnaweightfortheTFvalue,aweightfortheDFvalue,andaweightforthedistancevalue.TheactualprobabilityreturnedisP(Y=1jX= x)=exp( x w) 1+exp( x w)Aninterestingpropertyofthismodelisitsabilitytosim-ulatesomesimplermodels.Forinstance,ifwetakethelogarithmsoftheTFandDFvaluesasfeatures,thenifthecorrespondingweightsare+1and-1,weendupwithlog(TF) log(DF)=log(TF=DF)Inotherwords,byincludingTFandDFasfeatures,wecansimulatetheTFIDFmodel,butthelogisticregressionmodelalsohastheoptiontolearndierentweightings.Toactuallytrainalogisticregressionmodel,wetakeasetoftrainingdata,andtrytondtheweightvector wthatmakesitaslikelyaspossible.Inourcase,thetrain-inginstancesconsistofeverypossiblecandidatephrasesse-lectedfromthetrainingdocuments,withY=1iftheywerelabeledrelevant,and0otherwise.WeusetheSCGISmethod[9]astheactualtrainingmethod.However,becausethereisauniquebestlogisticregressionmodelforanytrain-ingset,andthespaceisconvex,theactualchoiceoftrainingalgorithmmakesrelativelylittledierence.Inthemonolithicmodels,weonlytrytomodelonedeci-sion{whetherawordorphraseisrelevantornot,sothevariableYcanonlytakevalues0or1.Butforthedecom-posedframework,wetrytodetermineforeachwordwhetheritisthebeginningofaphrase,insideaphrase,etc.with5dierentpossibilities(theBILUOlabels).Inthiscase,weuseageneralizedformofthelogisticregressionmodel:P(Y=ij x)=exp( x w) 5j=1exp( x wj)Thatis,thereare5dierentsetsofweights,oneforeachpossibleoutputvalue.Notethatinpractice,forbothformsoflogisticregression,wealwaysappendaspecial\alwayson"feature(i.e.avalueof1)tothe xvector,thatservesasabiasterm.Inordertopreventovertting,wealsoapplyaGaussianpriorwithvariance0.3forsmoothing[4].2.3.2FeaturesWeexperimentedwithvariousfeaturesthatarepoten-tiallyuseful.Someofthesefeaturesarebinary,takingonlythevalues0or1,suchaswhetherthephraseappearsinthetitle.Othersarereal-valued,suchastheTForDFval-uesortheirlogarithms.Belowwedescribethesefeaturesandtheirvariationswhenusedinthemonolithic(MoS)anddecomposed(DeS)frameworks.Featuresusedinthemonolithic,combined(MoC)frame-workarebasicallythesameasintheMoSframework.Ifinthedocument,acandidatephraseintheMoCframeworkhasseveraloccurrences,whichcorrespondtoseveralcandidatephrasesintheMoSframework,thefeaturesarecombinedusingthefollowingrules.1.Forbinaryfeatures,thecombinedfeatureistheunionofthecorrespondingfeatures.Thatis,ifthisfeatureisactiveinanyoftheoccurrences,thenitisalsoactiveinthecombinedcandidate.2.Forreal-valuedfeatures,thecombinedfeaturetakesthesmallestvalueofthecorrespondingfeatures.Togiveanexampleforthebinarycase,ifoneoccurrenceoftheterm\digitalcamera"isintheanchortextofahyper-link,thentheanchortextfeatureisactiveinthecombinedcandidate.Similarly,forthelocationfeature,whichisareal-valuedcase,thelocationoftherstoccurrenceofthisphrasewillbeusedasthecorrespondingcombinedvalue,sinceitsvalueisthesmallest.Inthefollowingdescription,featuresarebinaryunlessotherwisenoted.2.3.2.1Lin:linguisticfeatures.Thelinguisticinformationusedinfeatureextractionin-cludes:twotypesofpostags{noun(NN&NNS)andpropernoun(NNP&NNPS),andonetypeofchunk{nounphrase(NP).ThevariationsusedinMoSare:whetherthephrasecontainthesepostags;whetherallthewordsinthatphrasesharethesamepostags(eitherpropernounornoun);andwhetherthewholecandidatephraseisanounphrase.ForDeS,theyare:whetherthewordhasthepostag;whetherthewordisthebeginningofanounphrase;whetherthewordisinanounphrase,butnottherstword;andwhetherthewordisoutsideanynounphrase.2.3.2.2C:capitalization.Whetherawordiscapitalizedisanindicationofbeingpartofapropernoun,oranimportantword.ThissetoffeaturesforMoSisdenedas:whetherallthewordsinthecandidatephrasearecapitalized;whethertherstwordofthecandidatephraseiscapitalized;andwhetherthecandi-datephrasehasacapitalizedword.ForDeS,itissimplywhetherthewordiscapitalized.2.3.2.3H:hypertext.Whetheracandidatephraseorwordispartofthean-chortextforahypertextlinkisextractedasthefollowing features.ForMoS,theyare:whetherthewholecandidatephrasematchesexactlytheanchortextofalink;whetherallthewordsofthecandidatephraseareinthesameanchortext;andwhetheranywordofthecandidatephrasebelongstotheanchortextofalink.ForDeS,theyare:whetherthewordisthebeginningoftheanchortext;whetherthewordisintheanchortextofalink,butnottherstword;andwhetherthewordisoutsideanyanchortext.2.3.2.4Ms:metasectionfeatures.TheheaderofanHTMLdocumentmayprovideaddi-tionalinformationembeddedintags.Althoughthetextinthisregionisusuallynotseenbyreaders,whetheracandidateappearsinthismetasectionseemsimportant.ForMoS,thefeaturesarewhetherthewholecandidatephraseisinthemetasection.ForDeS,theyare:whetherthewordistherstwordinametatag;andwhetherthewordoccurssomewhereinametatag,butnotastherstword.2.3.2.5T:title.TheonlyhumanreadabletextintheHTMLheaderisthe,whichisusuallyputinthewindowcaptionbythebrowser.ForMoS,thefeatureiswhetherthewholecandidatephraseisinthetitle.ForDeS,thefeaturesare:whetherthewordisthebeginningofthetitle;andwhetherthewordisinthetitle,butnottherstword.2.3.2.6M:metafeatures.Inadditionto,severalmetatagsarepotentiallyre-latedtokeywords,andareusedtoderivefeatures.IntheMoSframework,thefeaturesare:whetherthewholecan-didatephraseisinthemeta-description;whetherthewholecandidatephraseisinthemeta-keywords;andwhetherthewholecandidatephraseisinthemeta-title.ForDeS,thefeaturesare:whetherthewordisthebeginningofthemeta-description;whetherthewordisinthemeta-description,butnottherstword;whetherthewordisthebeginningofthemeta-keywords;whetherthewordisinthemeta-keywords,butnottherstword;whetherthewordisthebeginningofthemeta-title;andwhetherthewordisinthemeta-title,butnottherstword.2.3.2.7U:URL.Awebdocumenthasoneadditionalhighlyusefulproperty{thenameofthedocument,whichisits.ForMoS,thefeaturesare:whetherthewholecandidatephraseisinpartoftheURLstring;andwhetheranywordofthecandidatephraseisintheURLstring.IntheDeSframework,thefeatureiswhetherthewordisintheURLstring.2.3.2.8IR:informationretrievalorientedfeatures.WeconsidertheTF(termfrequency)andDF(documentfrequency)valuesofthecandidateasreal-valuedfeatures.Thedocumentfrequencyisderivedbycountinghowmanydocumentsinourwebpagecollectionthatcontainthegiventerm.InadditiontotheoriginalTFandDFfrequencynumbers,log(TF+1)andlog(DF+1)arealsousedasfea-tures.Thefeaturesusedinthemonolithicandthedecom-posedframeworksarebasicallythesame,whereforDeS,the\term"isthecandidateword.2.3.2.9Loc:relativelocationofthecandidate.Thebeginningofadocumentoftencontainsanintroduc-tionorsummarywithimportantwordsandphrases.There-fore,thelocationoftheoccurrenceofthewordorphraseinthedocumentisalsoextractedasafeature.Sincethelengthofadocumentorasentencevariesconsiderably,wetakeonlytheratioofthelocationinsteadoftheabsolutenumber.Forexample,ifawordappearsinthe10thposi-tion,whilethewholedocumentcontains200words,theratioisthen0.05.Thesefeaturesusedforthemonolithicandde-composedframeworksarethesame.Whenthecandidateisaphrase,itsrstwordisusedasitslocation.Therearethreedierentrelativelocationsusedasfea-tures:wordRatio{therelativelocationofthecandidateinthesentence;sentRatio{thelocationofthesentencewherethecandidateisindividedbythetotalnumberofsentencesinthedocument;wordDocRatio{therelativelocationofthecandidateinthedocument.Inadditiontothese3real-valuedfeatures,wealsousetheirlogarithmsasfeatures.Specically,weusedlog(1+wordRatio),log(1+sentRatio),andlog(1+wordDocRatio).2.3.2.10Len:sentenceanddocumentlength.Thelength(inwords)ofthesentence(sentLen)wherethecandidateoccurs,andthelengthofthewholedocument(docLen)(wordsintheheaderarenotincluded)areusedasfeatures.Similarly,log(1+sentLen)andlog(1+docLen)arealsoincluded.2.3.2.11phLen:lengthofthecandidatephrase.Forthemonolithicframework,thelengthofthecandidatephrase(phLen)inwordsandlog(1+phLen)areincludedasfeatures.Thesefeaturesarenotusedinthedecomposedframework.2.3.2.12Q:querylog.Thequerylogofasearchenginere\rectsthedistributionofthekeywordspeoplearemostinterestedin.Weusetheinformationtocreatethefollowingfeatures.Fortheseex-periments,unlessotherwisementioned,weusedaloglewiththemostfrequent7.5millionqueries.Forthemonolithicframework,weconsideronebinaryfea-ture{whetherthephraseappearsinthequerylog,andtworeal-valuedfeatures{thefrequencywithwhichitappearsandthelogvalue,log(1+frequency).Forthedecomposedframework,weconsidermorevariationsofthisinformation:whetherthewordappearsinthequerylogleastherstwordofaquery;whetherthewordappearsinthequerylogleasaninteriorwordofaquery;andwhetherthewordap-pearsinthequerylogleasthelastwordofaquery.Thefrequencyvaluesoftheabovefeaturesandtheirlogvalues(log(1+),whereisthecorrespondingfrequencyvalue)arealsousedasreal-valuedfeatures.Finally,whetherthewordneverappearsinanyquerylogentriesisalsoafeature.2.4PostprocessorAftertheclassierpredictstheprobabilitiesofthecandi-datesassociatedwiththepossiblelabels,ourkeywordex-tractionsystemgeneratesalistofkeywordsrankedbytheprobabilities.WhentheMonolithicCombinedframeworkisused,wesimplyreturnthemostprobablewordsorphrases.WhentheMonolithicSeparate(MoS)frameworkisused,thehighestprobabilityofidenticalfragmentsispickedas theprobabilityofthephrase.IntheDecomposedSeparate(DeS)framework,weneedtoconvertprobabilitiesofindividualwordsbeingphrasecomponents(Begininning,Inside,etc.)intoprobabilitiesofrelevanceofwholephrases.Typically,inphraselabelingproblemslikeinformationextraction,thisconversionisdonewiththeViterbi[17]algorithm,tondthemostprobableassignmentofthewordlabelsequenceofeachsentence[5].Werejectedthismethodfortworeasons.First,becauseinourtrainingset,onlyafewwordsperdocumenttendtobelabeledasrelevant,oursystemalmostneverassignshighprobabilitytoaparticularphrase,andthustheViterbias-signmentwouldtypicallyrejectallphrases.Second,intyp-icalinformationextractionapplications,wenotonlywanttondvariousentities,wealsowanttondtheirrelation-shipsandroles.Inourapplication,wesimplywantalistofpotentialentities.Itisthusnetoextractpotentiallyoverlappingstringsaspotentialkeywords,somethingthatwouldnotworkwellifrole-assignmentwasrequired.Therefore,weusethefollowingmechanismtoestimatetheprobabilityofaphraseintheDeSframework.Givenaphraseoflengthn,wecalculatetheoverallprobabilityforthephrasebymultiplyingbytheprobabilityofthein-dividualwordsbeingthecorrectlabelofthelabelsequence.Forexample,ifn=3,thenthecorrectlabelsequenceisB,I,L.Theprobabilityofthisphrasebeingakeyword,1,isderivedby(w1=B)(w2=I)(w3=L).Ifthephraseisnotakeyword,thenthecorrectlabelse-quenceisO,O,O.Thecorrespondingprobability,0,isthen(w1=O)(w2=O)(w3=O).Theactualproba-bilityusedforthisphraseisthen1(0+1).Amongthethenormalizationmethodswetried,thisstrategyworksthebestinpractice.3.EXPERIMENTSThissectionreportstheexperimentalresultscomparingoursystemwithseveralbaselinesystems,thecomparisonsbetweenthevariationsofoursystem,thecontributionofindividualfeatures,andtheimpactofreducingthesearchquerylogsize.Werstdescribehowthedocumentswereobtainedandannotated,aswellastheperformancemea-sures.3.1DataandEvaluationCriteriaTherststepwastoobtainandlabeldata,namelyasetofwebpages.Wecollectedthesedocumentsatrandomfromtheweb,subjecttothefollowingcriteria:First,eachpagewasinacrawlofthewebfromMSNSearch.Second,eachpagedisplayedcontent-targetedad-vertising(detectedbythepresenceofspecialJavascript).Thiscriterionwasdesignedtomakesurewefocusedonthekindofwebpageswherecontent-targetedadvertisingwouldbedesirable,asopposedtotrulyrandompages.Third,thepagesalsohadtooccurintheInternetArchive3.Wehopetoshareourlabeledcorpus,sootherresearcherscanexper-imentwithit.ChoosingpagesintheInternetArchivemaymakethatsharingeasier.Fourth,weselectednomorethanonepageperdomain(inpart,becauseofcopyrightissues,andinparttoensurediversityofthecorpus).Fifth,wetriedtoeliminateforeignlanguageandadultpages.Altogether,wecollected1109pages. 3http://www.archive.orgNext,weselected30ofthesepagesatrandomtobeusedforinter-annotatoragreementmeasurements.Wethenran-domlysplittheremainingpagesinto8setsof120pagesandonesetof119pages.Wegaveeachsettooneofouranno-tators.Foreachannotator,thersttimetheylabeledasetofwebpages,wealsogavethemtheadditional30pagesforinter-annotatoragreement,scatteredrandomlythroughout(so,theyreceived150pagestotalthersttime,120or119onsubsequentsets).Severalpageswererejectedforvariousreasons,suchasoneannotatorwhodidnotcompletetheprocess,severalforeignlanguagepagesthatslippedthroughourscreeningprocess,etc.Asaresult,828webdocumentshadlegitimateannotationsandwereusedtotrainandtestthesystem.Annotatorswereinstructedtolookateachwebpage,anddetermineappropriatekeywords.Inparticular,theyweregivenabout3pagesofinstructionswithactualexamples.Thecoreoftheinstructionswasessentiallyasfollows:Advertisingonthewebisoftendonethroughkeywords.Advertiserspickakeyword,andtheiradappearsbasedonthat.Forinstance,anad-vertiserlikeAmazon.comwouldpickakeywordlike\book"or\books."Ifsomeonesearchesfortheword\book"or\books",anAmazonadisshown.Similarly,ifthekeyword\book"ishighlyprominentonawebpage,Amazonwouldlikeanadtoappear.Weneedtoshowourcomputerprogramexam-plesofwebpages,andthentellitwhichkeywordsare\highlyprominent."Thatway,itcanlearnthatwordslike\the"and\clickhere"areneverhighlyprominent.Itmightlearnthatwordsthatappearontheright(ormaybetheleft)aremorelikelytobehighlyprominent,etc.Yourtaskistocreatetheexamplesforthesystemtolearnfrom.Wewillgiveyouwebpages,andyoushouldlistthehighlyprominentwordsthatanadvertisermightbeinterestedin.Therewasonemoreimportantinstruction,whichwastotrytouseonlywordsorphrasesthatactuallyoccurredonthepagebeinglabeled.Theremainingportionofthein-structionsgaveexamplesanddescribedtechnicaldetailsofthelabelingprocess.Weusedasnapshotofthepages,tomakesurethatthetraining,testing,andlabelingprocessesallusedidenticalpages.Thesnapshottingprocessalsohadtheadditionaladvantagethatmostimages,andallcontent-targetedadvertising,werenotdisplayedtotheannotators,preventingthemfromeitherselectingtermsthatoccurredonlyinimages,orfrombeingpollutedbyathird-partykey-wordselectionprocess.3.2PerformancemeasuresWecomputedthreedierentperformancemeasuresofourvarioussystems.Therstperformancemeasureissimplythetop-1score.Tocomputethismeasure,wecountedthenumberoftimesthetopoutputofoursystemforagivenpagewasinthelistoftermsdescribedbytheannotatorforthatpage.Wedividedthisnumberbythemaximumachievabletop-1score,andmultipliedby100.Togetthemaximumachievabletop-1score,foreachtestdocument,werstremovedanyannotationsthatdidnotoccursome-whereinthewebpage(toeliminatespellingmistakes,and occasionalannotatorconfusion).Thebestachievablescorewasthenthenumberofdocumentsthatstillhadatleastoneannotation.Wecountedanswersascorrectiftheymatched,ignoringcase,andwithallwhitespacecollapsedtoasinglespace.Thesecondperformancemeasureisthetop-10score,whichissimilartothetop-1measurebutconsiders10candidatesinsteadof1.Tocomputethismeasure,foreachwebpage,wecountedhowmanyofthetop10outputsofoursystemwereinthelistoftermsdescribedbytheannotatorforthatpage.Thesumofthesenumberswasthendividedbythemaximumachievabletop-10score,andmultipliedby100.Itisimportantinanadvertisingsystemtonotonlyex-tractaccuratelistsofkeywords,buttoalsohaveaccurateprobabilityestimates.Forinstance,considertwokeywordsonaparticularpage,say\digitalcamera",whichmightmonetizeat$1perclick,and\CD-Rmedia",whichmon-etizesatsay,10centsperclick.Ifthereisa50%chancethat\CD-Rmedia"isrelevant,andonlya10%chancethat\digitalcamera"isrelevant,overallexpectedrevenuefromshowing\digitalcamera"adswillstillbehigherthanfromshowing\CD-Rmedia"ads.Therefore,accuratelyestimat-ingprobabilitiesleadstopotentiallyhigheradvertisingrev-enue.Themostcommonlyusedmeasureoftheaccuracyofprob-abilityestimatesisentropy.Foragiventermtandagivenwebpage,oursystemcomputestheprobabilitythatthewordisrelevanttothepage,P(tj)(i.e.,theprobabilitythattheannotatorlistedthewordontherelevantlist).Theentropyofagivenpredictionis log2P(tj)iftisrelevantto log2(1 P(tj))iftisnotrelevanttoLowerentropiescorrespondtobetterprobabilityestimates,with0beingideal.Whenwereporttheentropy,wewillreporttheaverageentropyacrossallwordsandphrasesuptolength5inallwebpagesinthetestset,withduplicateswithinapagecountedasoneoccurrence.TheDeSframe-workwasnoteasilyamenabletothiskindofmeasurement,sowewillonlyreporttheentropymeasurefortheMoSandMoCmethods.3.3InterannotatorAgreementWewantedtocomputesomeformofinter-annotatoragree-ment,wherethetypicalchoiceforthismeasureisthekappameasure.However,thatmeasureisdesignedforannotationswithasmall,xednumberofchoices,whereapriorprob-abilityofselectingeachchoicecanbedetermined.Itwasnotclearhowtoapplyitforaproblemwiththousandsofpossibleannotations,withthepossibleannotationsdierentoneverypage.Instead,weusedourinter-annotatoragreementdatatocomputeanupper-boundforthepossibleperformanceonourtop-1measure.Inparticular,wedesignedanaccuracymeasurecalledcommittee-top-1.Essentiallythisnumbermeasuresroughlyhowwellacommitteeoffourpeoplecoulddoonthistask,bytakingthemostcommonanswerselectedbythefourannotators.Tocomputethis,weperformedthefollowingsteps.First,weselectedapageatrandomfromthesetof30labeledwebpagesforinter-annotatoragreement.Then,werandomlyselectedoneoftheveannotatorsasarbitrator,whosere-sultswecalled\correct."Wethenmergedtheresultsoftheremainingfourannotators,andfoundthesinglemostfrequentone.Finally,ifthatresultwasonthe\correct"keywordslist,weconsidereditcorrect;otherwisewecon-sidereditwrong.Theaveragecommittee-top-1scorefrom1000sampleswas25.8.Thisgivesageneralfeelingforthedicultyoftheproblem.Duetothesmallnumberofdoc-umentsused(30)andthesmallnumberofannotators(5),thereisconsiderableuncertaintyinthisestimate.Infact,insomecases,ourexperimentalresultsareslightlybetterthanthisnumber,whichweattributetothesmallsizeofthesampleusedforinter-annotatoragreement.3.4ResultsAllexperimentswereperformedusing10-waycrossvali-dation.Inmostcases,wepickedanappropriatebaseline,andcomputedstatisticalsignicanceofdierencesbetweenresultsandthisbaseline.Weindicatethebaselinewithabsymbol.Weindicatesignicanceatthe95%levelwithaysymbol,andatthe99%levelwithazsymbol.Signicancetestswerecomputedwithatwo-tailedpairedt-test.3.4.1OverallPerformanceWebeginbycomparingtheoverallperformanceofdier-entsystemsandcongurations.Thetop-1andtop-10scoresofthesedierentsystemsarelistedinTable1.Amongthesystemcongurationswetested,themonolithiccom-bined(MoC)frameworkisingeneralthebest.Inparticular,whenusingallbutthelinguisticfeatures,thiscongurationachievesthehighesttop-1andtop-10scoresinalltheex-periments,justslightly(andnotstatisticallysignicantly)betterthanthesameframeworkwithallfeatures.Themonolithicseparate(MoS)systemwithallfeaturesperformsworseforbothtop-1andtop-10,althoughonlythetop-10resultwasstatisticallysignicant.Despiteitspreva-lenceintheinformationextractioncommunity,forthistask,thedecomposedseparate(DeS)frameworkissignicantlyworsethanthemonolithicapproach.Wehypothesizethattheadvantagesofusingfeaturesbasedonphrasesasawhole(usedinMoSandMoC)outweightheadvantagesoffeaturesthatcombineinformationacrosswords.Inadditiontothedierentcongurationsofoursystems,wealsocomparewithastate-of-the-artkeywordextractionsystem{KEA[7],byprovidingitwithourpreprocessed,plain-textdocuments.KEAisalsoamachinelearning-basedsystem.Althoughitreliesonsimplerfeatures,itusesmoresophisticatedinformationretrievaltechniqueslikeremov-ingstopwordsandstemmingforpreprocessing.OurbestsystemissubstantiallybetterthanKEA,withrelativeim-provementsof27.5%and22.9%ontop-1andtop-10scores,respectively.Wetriedaverysimplebaselinesystem,MoCTFIDF,whichsimplyusestraditionalTFIDFscores.Thissystemonlyreaches13.01intop-1and19.03intop-10.WealsotriedallowingusingTFandIDFfeatures,butallowingtheweightstobetrainedwihtlogisticregression(MoCIR).Trainingslightlyimprovedtop-1to13.63,andsubstantiallyimprovedthetop-10scoreto25.67.Noticethatthetop-1scoreofourbestsystemactuallyex-ceedsthecommittee-top-1inter-annotatoragreementscore.Therewasconsiderableuncertaintyinthatnumberduetothesmallnumberofannotatorsandsmallnumberofdocu-mentsforinter-annotatoragreement,butweinterprettheseresultstomeanthatourbestsystemisinthesamegeneralrangeashumanperformance,atleastontop-1score. system top-1top-10 MoC(Monolithic,Combined),-Lin 30.06b46.97b MoC(Monolithic,Combined),All 29.9446.45 MoS(Monolithic,Separate),All 27.9544.13z DeS(Decomposed,Separate),All 24.25z:z KEA[7] 23.57z38.21z MoC(Monolithic,Combined),IR 13.63z25.67z MoC(Monolithic,Combined),TFIDF 13.01z19.03z Table1:Performanceofdierentsystems3.4.2FeatureContributionOneimportantandinterestingquestionisthecontribu-tionofdierenttypesoffeatures.Namely,howimportantisaparticulartypeofinformationtothekeywordextractiontask.Westudiedthisproblembyconductingaseriesofab-lationstudies:removingfeaturesofaspecictypetoseehowmuchtheycontribute.Werepeatedtheseexperimentsinthethreedierentcandidateselectorframeworksofourkeywordextractionsystemtoseeifthesameinformationaectsthesystemdierentlywhentheframeworkchanges.Tables2,3,and4showthedetailedresults.Eachrowiseitherthebaselinesystemwhichusesallfeatures,orthecomparedsystemthatusesallexceptonespecickindoffeature.Inadditiontothetop-1andtop-10scores,wealsocomparetheentropiesofthevarioussystemsusingthesameframework.Recallthatlowerentropiesarebetter.Entropyisasomewhatsmoother,moresensitivemeasurethantop-nscores,makingiteasiertoseedierencesbetweendierentsystems.Notethatbecausethenumbersofcandidatesaredierent,entropiesfromsystemsusingdierentframeworksarenotcomparable. features top-1top-10entropy Aall 27.95b44.13b0.0120040b -Ccapitalization 27.3943.500.0120945z -Hhypertext 27.3943.820.0120751z -IRIR 18.77z33.60z0.0149899z -Lenlength 27.1042.45y0.0121040 -Linlinguistic 28.0544.890.0122166z -Loclocation 27.2442.64z0.0121860z -Mmeta 27.8144.050.0120080 -Msmetasection 27.5243.09y0.0120390 -Qquerylog 20.68z39.10z0.0129330z -Ttitle 27.8144.250.0120040 -UURL 26.09y44.140.0121409y Table2:ThesystemperformancebyremovingonesetoffeaturesintheMoSframeworkByexaminingtheseresults,wefoundthatIRfeaturesandQueryLogfeaturesarethemosthelpfulconsistentlyinallthreeframeworks.Thesystemperformancedroppedsignif-icantlyafterremovingeitherofthem.However,theimpactofotherfeatureswasnotsoclear.Thelocationfeaturealsoseemstobeimportant,butitaectsthetop-10scoremuchmorethanthetop-1score.Surprisingly,unlikethecasesinothertypicalphraselabelingproblems,linguisticfeaturesdon'tseemtohelpinthiskeywordextractiondomain.Infact,removingthissetoffeaturesinthedecomposedsep-arateframeworkimprovesthetop-1scorealittle.Wealsoobservemorestatisticallysignicantdierencesintheen- features top-1top-10entropy Aall 29.94b46.45b0.0113732b -Ccapitalization 30.1146.270.0114219y -Hhypertext 30.7945.85y0.0114370 -IRIR 25.42z42.26z0.0119463z -Lenlength 30.4944.74y0.0119803z -Linlinguistic 30.0646.970.0114853z -Loclocation 29.5244.63y0.0116400z -Mmeta 30.1046.780.0113633z -Msmetasection 29.3346.330.0114031 -Qquerylog 24.82y42.30z0.0121417z -Ttitle 28.8346.940.0114020 -UURL 30.5346.390.0114310 Table3:ThesystemperformancebyremovingonesetoffeaturesintheMoCframework features top-1top-10 Aall 24.2539.11 -Ccapitalization 24.2639.20 -Hhypertext 24.9639.67 -IRIR 19.10z30.56z -Linlinguistic 25.96y39.36 -Loclocation 24.8437.93 -Mmeta 24.6938.91 -Msmetasection 24.6838.71 -Qquerylog 21.27z33.95z -Ttitle 24.4038.89 -UURL 25.2138.97 Table4:ThesystemperformancebyremovingonesetoffeaturesintheDeSframeworktropymetric.Evaluatingthecontributionofafeaturebyremovingitfromthesystemdoesnotalwaysshowitsvalue.Forin-stance,iftwotypesoffeatureshavesimilareects,thenthesystemperformancemaynotchangebyeliminatingonlyoneofthem.Wethusconductedadierentsetofexperimentsbyaddingfeaturestoabaselinesystem.Wechoseasysteminthemonolithiccombinedframework(MoC),usingonlyIRfeatures,asthebaselinesystem.Wethenbuiltdierentsystemsbyaddingoneadditionalsetoffeatures,andcomparingwiththebaselinesystem.Table5showsperformanceonthetop-1,top-10,andentropymet-rics,orderedbydierenceintop-1.Interestingly,allthefeaturesseemtobehelpfulwhentheyarecombinedwiththebaselineIRfeatures,includingthelinguisticfeatures.Wethusconcludethatthelinguisticfea-tureswerenotreallyuseless,butinsteadwereredundantwithotherfeatureslikecapitalization(whichhelpsdetectpropernouns)andtheQueryLogfeatures(whichhelpsde-tectlinguisticallyappropriatephrases.)TheQueryLogfea-turesarestillthemosteectiveamongallthefeaturesweexperimentedwith.However,theperformancegapbetweenourbestsystemandasimplesystemusingonlyIRandQueryLogfeaturesisstillquitelarge.3.4.3DifferentQueryLogSizesThequerylogfeatureweresomeofthemosthelpfulfea-tures,secondonlytotheIRfeatures.Thesefeaturesusedthetop7.5millionEnglishqueriesfromMSNSearch.In features top-1top-10entropy IR 13.63b25.67b0.0163299b +Qquerylog 22.36z35.88z0.0134891z +Ttitle 19.90z34.17z0.0152316z +Lenlength 19.22z33.43z0.0134298z +Msmetasection 19.02z31.90z0.0154484z +Hhypertext 18.46z30.36z0.0150824z +Linlinguistic 18.20z32.26z0.0146324z +Ccapitalization 17.41z33.16z0.0146999z +Loclocation 17.01z32.76z0.0154064z +UURL 16.72y28.63z0.0157466z +Mmeta 16.71y28.19z0.0160472y Table5:ThesystemperformancebyaddingonesetoffeaturesintheMoCframeworkpractice,deployingsystemsusingthismanyfeaturesmaybeproblematic.Forinstance,ifwewanttouse20lan-guages,andifeachqueryentryusesabout20bytes,thequerylogleswoulduse3GB.Thisamountofmemoryus-agewouldnotaectserversdedicatedtokeywordextraction,butmightbeproblematicforserversthatperformotherfunctions(e.g.servingads,oraspartofamoregeneralbloggingornewssystem).Inscenarioswherekeywordex-tractionisdoneontheclientside,thequeryloglesizeisparticularlyimportant.Querylogsmayalsohelpspeedupoursystem.Onestrat-egywewantedtotryistoconsideronlythephrasesthatappearedinthequerylogleaspotentialmatchesforkey-wordextraction.Inthiscase,smallerquerylogsleadtoevenlargerspeedups.Wethusranasetofexperimentsun-dertheMoCframework,usingdierentqueryloglesizes(asmeasuredbyafrequencythresholdcuto).Theloglesizeswereveryroughlyinverselyproportionaltothecuto.Theleftmostpoint,withacutoof18,correspondstoour7.5millionquerylogle.ThegraphinFigure1showsboththetop-1andtop-10scoresatdierentsizes.\Res.top-1"and\Res.top-10"aretheresultswherewerestrictedthecandidatephrasesbyeliminatingthosethatdonotappearinthequerylogle.Asacomparison,thegraphalsoshowsthenon-restrictingversion(i.e.,top-1andtop-10).Notethatofcoursethetop-1scorescannotbecomparedtothetop-10scores.Whenthefrequencycutothresholdissmall(i.e.,largequeryloglesize),thisrestrictingstrategyinfactimprovesthetop-1scoreslightly,buthurtsthetop-10score.Thisphenomenonisnotsurprisingsincerestrictingcandidatesmayeliminatesomekeywords.Thistendstonotaectthetop-1scoresincewhenusingthequerylogleasaninput,themostprobablekeywordsinadocumentalmostalwaysappearinthekeywordle,ifthekeywordleislarge.Lessprobable,butstillgoodkeywords,arelesslikelytoappearinthekeywordle.Asthequeryloglesizebecomessmaller,thenegativeeectbecomesmoresignicant.However,ifonlythetop-1scoreisrelevantintheapplication,thenareasonablecutopointmaybe1000,wherethequeryloglehaslessthan100,000entries.Ontheotherhand,thetop-1andtop-10scoresinthenon-restrictingversiondecreasegraduallyasthesizeofthequerylogledecreases.Fairlysmalllescanbeusedwiththenon-restrictingversionifspaceisatapremium.Afterall,theextremecaseiswhennoquerylogleisused,andaswehaveseeninTable3,inthatcasethetop-1andtop-10scoresarestill24.82and42.30,respectively. 10100100010000100000ScoreQueryLogFrequencyThresholdQueryLogFrequencyTrade-orestop-1restop-10top-1top-10Figure1:Theperformanceofusingdierentsizesofthequerylogle4.RELATEDWORKTherehasbeenamoderateamountofpreviousworkthatisdirectlyrelevanttokeywordextraction.Inmostcases,thekeywordextractionwasnotappliedtowebpages,butinsteadwasappliedtoadierenttextdomain.Mostprevi-oussystemswerefairlysimple,usingeitherasmallnumberoffeatures,asimplelearningmethod(NaiveBayes),orboth.Inthissection,wedescribethispreviousresearchinmoredetail.4.1GenExOneofthebestknownprogramsforkeywordextractionisTurney'sGenExsystem[22].GenExisarule-basedkeyphraseextractionsystemwith12parameterstunedusingaGeneticAlgorithm.GenExhasbeentrainedandevaluatedonacol-lectionof652documentsofthreedierenttypes:journalarticles,emailmessages,andwebpages.Precisionsoftop5phrasesandtop15phrasesarereportedasevaluationmet-rics,wherethenumbersare0.239and0.128respectivelyforalldocuments.TurneyshowedGenEx'ssuperioritybycom-paringitwithanearlierkeyphraseextractionsystemtrainedbyC4.5[16]withseveralcomplicatedfeatures[22].4.2KEAandVariationsConcurrentlywiththedevelopmentofGenEx,Franketal.developedtheKEAkeyphraseextractionalgorithmusingasimplemachinelearningapproach[7].KEArstprocessesdocumentsbyremovingstopwordsandbystemming.Thecandidatephrasesarerepresentedusingonlythreefeatures{TFIDF,distance(numberofwordsbeforetherstoc-currenceofthephrase,dividedbythenumberofwordsinthewholedocument),andkeyphrase-frequency,whichisthenumberoftimesthecandidatephraseoccursinotherdocu-ments.TheclassieristrainedusingthenaiveBayeslearn-ingalgorithm[14].Franketal.comparedKEAwithGenEx,andshowedthatKEAisslightlybetteringeneral,butthedierenceisnotstatisticallysignicant[7]. KEA'sperformancewasimprovedlaterbyaddingWebrelatedfeatures[23].Inshort,thenumberofdocumentsre-turnedbyasearchengineusingthekeyphraseasquerytermsisusedasadditionalinformation.Thisfeatureisparticu-larlyusefulwhentrainingandtestingdataarefromdierentdomains.Inbothintraandinterdomainevaluations,therelativeperformancegaininprecisionisabout10%.KelleherandLuzalsoreportedanenhancedversionofKEAforwebdocuments[13].Theyexploitedthelinkin-formationofawebdocumentbyaddinga\semanticratio"feature,whichisthefrequencyofthecandidatephrasePintheoriginaldocumentD,dividedbythefrequencyofPindocumentedlinkedbyD.UsingtheMetaKeywordHTMLtagasthesourceofan-notations,theyexperimentedwiththeirapproachonfoursetsofdocuments,wheredocumentsareconnectedbyhy-perlinksineachcollection.Amongthreeofthem,theyre-portedsignicantimprovement(45%to52%)comparedtotheoriginalversionofKEA.Addingthesemanticratiotooursystemwouldbeaninterestingareaoffutureresearch.Itshould,however,benotedthatusingthesemanticra-tiorequiresdownloadingandparsingseveraltimesasmanywebpages,whichwouldintroducesubstantialload.Also,inpractice,followingarbitrarylinksfromapagecouldresultinsimulatingclickson,e.g.,unsubscribelinks,leadingtounexpectedorundesirableresults.TheuseoflinguisticinformationforkeywordextractionwasrststudiedbyHulth[12].Inthiswork,nounphrasesandpredenedpart-of-speechtagpatternswereusedtohelpselectphrasecandidates.Inaddition,whetherthecandidatephrasehascertainPOStagsisusedasfeatures.AlongwiththethreefeaturesintheKEAsystem,Hulthappliedbag-ging[1]asthelearningalgorithmtotrainthesystem.Theexperimentalresultsshoweddierentdegreesofimprove-mentscomparedwithsystemsthatdidnotuselinguisticinformation.However,directcomparisontoKEAwasnotreported.4.3InformationExtractionAnalogoustokeywordextraction,informationextractionisalsoaproblemthataimstoextractorlabelphrasesgivenadocument[8,2,19,5].Unlikekeywordextraction,informa-tionextractiontasksareusuallyassociatedwithpredenedsemantictemplates.Thegoaloftheextractiontasksistondcertainphrasesinthedocumentstollthetemplates.Forexample,givenaseminarannouncement,thetaskmaybendingthenameofthespeaker,orthestartingtimeoftheseminar.Similarly,thenamedentityrecognition[21]taskistolabelphrasesassemanticentitieslikeperson,lo-cation,ororganization.Informationextractionisinmanywayssimilartokey-wordextraction,butthetechniquesusedforinformationextractionaretypicallydierent.Whilemostkeywordex-tractionsystemsusethemonolithiccombined(MoC)frame-work,typically,informationextractionsystemsusetheDeSframework,orsometimesMoS.Sinceidenticalphrasesmayhavedierentlabels(e.g.,\Washington"canbeeitheraper-sonoralocationeveninthesamedocument),candidatesinadocumentarenevercombined.Thechoiceoffeaturesisalsoverydierent.Thesesystemstypicallyuselexicalfea-tures(theidentityofspecicwordsinoraroundthephrase),linguisticfeatures,andevenconjunctionsofthesefeatures.Ingeneral,thefeaturespaceintheseproblemsishuge{oftenhundredsofthousandsfeatures.Lexicalfeaturesforkeywordextractionwouldbeaninterestingareaoffutureresearch,althoughourintuitionisthatthesefeaturesarelesslikelytobeusefulinthiscase.4.4ImpedanceCouplingRibeiro-Netoetal.[18]describeanImpedanceCouplingtechniqueforcontent-targetedadvertising.Theirworkisperhapsthemostextensivepreviouslypublishedworkspecif-icallyoncontent-targetedadvertising.However,Ribeiro-Netoetal.'sworkisquiteabitdierentfromours.Mostimportantly,theirworkfocusednotonndingkeywordsonwebpages,butondirectlymatchingadvertisementstothosewebpages.Theyusedavarietyofinformation,includingthetextoftheadvertisements,thedestinationwebpageofthead,andthefullsetofkeywordstiedtoaparticularad(asopposedtoconsideringkeywordsoneatatime.)Theythenlookedatthecosinesimilarityofthesemeasurestopotentialdestinationpages.ThereareseveralreasonswedonotcomparedirectlytotheworkofRibeiro-Netoetal.Mostimportantly,we,likemostresearchers,donothaveadatabaseofadvertisements(asopposedtojustkeywords)thatwecoulduseforexperi-mentsofthissort.Inaddition,theyaresolvingasomewhatdierentproblemthanweare.Ourgoalistondthemostappropriatekeywordsforaspecicpage.Thisdovetailswellwithhowcontextualadvertisingissoldtoday:advertisersbidonspecickeywords,andtheirbidsondierentkey-wordsmaybeverydierent.Forinstance,apurveyorofdigitalcamerasmightusethesameadfor\camera",\dig-italcamera",\digitalcamerareviews",or\digitalcameraprices."Theadvertiser'sbidswillbeverydierentinthedierentcases,becausetheprobabilitythataclickonsuchanadleadstoasalewillbeverydierent.Ribeiro-Netoetal.'sgoalisnottoextractkeywords,buttomatchwebpagestoadvertisements.Theydonottrytodeterminewhichpar-ticularkeywordonapageisamatch,andinsomecases,theymatchwebpagestoadvertisementsevenwhenthewebpagedoesnotcontainanykeywordschosenbyanadver-tiser,whichcouldmakepricingdicult.Theirtechniquesarealsosomewhattime-consuming,becausetheycomputethecosinesimilaritybetweeneachwebpageandabundleofwordsassociatedwitheachad.Theyusedonlyasmalldatabase(100,000ads)appliedtoasmallnumberofwebpages(100),makingsuchtime-consumingcomparisonspos-sible.Realworldapplicationwouldbefarmorechallenging:optimizationsarepossible,butwouldneedtobeanareaofadditionalresearch.Incontrast,ourmethodsaredirectlyapplicabletoday.4.5NewsQueryExtractionHenzingeretal.[11]exploredthedomainofkeywordex-tractionfromanewssource,toautomaticallydrivequeries.Inparticular,theyextractedquerytermsfromtheclosedcaptioningoftelevisionnewsstories,todriveasearchsys-temthatwouldretrieverelatedonlinenewsarticles.TheirbasicsystemusedTFIDFtoscoreindividualphrases,andachievedslightimprovementsfromusingTFIDF2.Theytriedstemming,withmixedresults.Becauseofthenatureofbroadcastnewsprograms,whereboundariesbetweentopicsarenotexplicit,theyfoundimprovementsbyusingahis-toryfeaturethatautomaticallydetectedtopicboundaries.Theyachievedtheirlargestimprovementsbypostprocess- ingarticlestoremovethosethatseemedtoodierentfromthenewsbroadcast.Thisseemscloselyrelatedtothepre-viouslymentionedworkofRibeiro-Netoetal.,whofoundanalogouscomparisonsbetweendocumentsandadvertise-mentshelpful.4.6EmailQueryExtractionGoodmanandCarvalho[10]previouslyappliedsimilartechniquestotheonesdescribedhereforqueryextractionforemail.Theirgoalwassomewhatdierentthanourgoalhere,namelytondgoodsearchqueries,todrivetractosearchengines,althoughthesametechnologytheyusedcouldbeappliedtondkeywordsforadvertising.Muchoftheirre-searchfocusedonemail-specicfeatures,suchaswordoc-currenceinsubjectlines,anddistinguishingnewpartsofanemailmessagefromearlier,\in-reply-to"sections.Incontrast,ourworkfocusesonmanyweb-page-specicfeatures,suchaskeywordsintheURL,andinmeta-data.Ourresearchgoesbeyondtheirsinanumberofotherways.Mostimportantly,wetriedbothinformation-extractionin-spiredmethods(DeS)andlinguisticfeatures.Here,wealsoexaminethetradeobetweenquerylesizeandaccuracy,showingthatlargequerylesarenotnecessaryfornear-optimalperformance,iftherestrictiononwordsoccurringinthequeryleisremoved.Goodmanetal.compareonlytosimpleTFIDFstylebaselines,whileinourresearch,wecomparetoKEA.Wealsocomputeaformofinter-annotatoragreement,somethingthatwasnotpreviouslydone.TheimprovementoverKEAandthenearhumanperformancemeasurementsareveryimportantfordemon-stratingthehighqualityofourresults.5.CONCLUSIONSThebetterwecanmonetizetheweb,themorefeaturesthatcanbeprovided.Togivetwoexamples,theexplosionoffreebloggingtoolsandoffreeweb-basedemailsystemswithlargequotas,havebothbeensupportedinlargepartbyadvertising,muchofitusingcontent-targeting.Bettertargetedadsarelessannoyingforusers,andmorelikelytoinformthemofproductsthatdeliverrealvaluetothem.Betteradvertisingthuscreatesawin-win-winsituation:bet-terweb-basedfeatures,lessannoyingads,andmoreinfor-mationforusers.OurresultsdemonstratealargeimprovementoverKea.Weattributethisimprovementtothelargenumberofhelp-fulfeaturesweemployed.WhileKeaemploysonlythreefea-tures,weemployed12dierentsetsoffeatures;sinceeachsetcontainedmultiplefeatures,weactuallyhadabout40featuresoverall.AsweshowedinTable5,everyoneofthesesetswashelpful,althoughaswealsoshowed,someofthemwereredundantwitheachother.GenEx,with12features,worksaboutaswellasKea:thechoiceoffeaturesandthelearningalgorithmarealsoimportant.OurmostimportantnewfeaturewasthequeryfrequencylefromMSNSearch.6.ACKNOWLEDGMENTSWethankEmilyCook,NatalieJennings,SamanthaDick-son,KrystalPaigeandKatieNeely,whoannotatedthedata.WearealsogratefultoAlexeiBocharov,whohelpedusmea-sureinter-annotatoragreement.7.REFERENCES[1]L.Breiman.Baggingpredictors.MachineLearning,24(2):123{140,1996.[2]M.CaliandR.Mooney.Bottom-uprelationallearningofpatternmatchingrulesforinformationextraction.JMLR,4:177{210,2003.[3]X.Carreras,L.Marquez,andJ.Castro.Filtering-rankingperceptronlearningforpartialparsing.MachineLearning,60(1{3):41{71,2005.[4]S.F.ChenandR.Rosenfeld.Agaussianpriorforsmoothingmaximumentropymodels.TechnicalReportCMU-CS-99-108,CMU,1999.[5]H.ChieuandH.Ng.Amaximumentropyapproachtoinformationextractionfromsemi-structureandfreetext.InProc.ofAAAI-02,pages786{791,2002.[6]Y.Even-ZoharandD.Roth.Asequentialmodelformulticlassclassication.InEMNLP-01,2001.[7]E.Frank,G.W.Paynter,I.H.Witten,C.Gutwin,andC.G.Nevill-Manning.Domain-specickeyphraseextraction.InProc.ofIJCAI-99,pages668{673,1999.[8]D.Freitag.Machinelearningforinformationextractionininformaldomains.MachineLearning,39(2/3):169{202,2000.[9]J.Goodman.Sequentialconditionalgeneralizediterativescaling.InACL'02,2002.[10]J.GoodmanandV.R.Carvalho.Implicitqueriesforemail.InCEAS-05,2005.[11]M.Henzinger,B.Chang,B.Milch,andS.Brin.Query-freenewssearch.InProceedingsofthe12thWorldWideWebConference,pages1{10,2003.[12]A.Hulth.Improvedautomatickeywordextractiongivenmorelinguisticknowledge.InProc.ofEMNLP-03,pages216{223,2003.[13]D.KelleherandS.Luz.Automatichypertextkeyphrasedetection.InIJCAI-05,2005.[14]T.Mitchell.Tutorialonmachinelearningovernaturallanguagedocuments,1997.Availablefrom.[15]V.PunyakanokandD.Roth.Theuseofclassiersinsequentialinference.InNIPS-00,2001.[16]J.R.Quinlan.C4.5:programsformachinelearning.MorganKaufmann,SanMateo,CA,1993.[17]L.Rabiner.AtutorialonhiddenMarkovmodelsandselectedapplicationsinspeechrecognition.ProceedingsoftheIEEE,77(2),February1989.[18]B.Ribeiro-Neto,M.Cristo,P.B.Golgher,andE.S.deMoura.Impedancecouplingincontent-targetedadvertising.InSIGIR-05,pages496{503,2005.[19]D.RothandW.Yih.Relationallearningviapropositionalalgorithms:Aninformationextractioncasestudy.InIJCAI-01,pages1257{1263,2001.[20]C.SuttonandA.McCallum.Compositionofconditionalrandomeldsfortransferlearning.InProceedingsofHLT/EMLNLP-05,2005.[21]E.F.TjongKimSang.IntroductiontotheCoNLL-2002sharedtask:Language-independentnamedentityrecognition.InCoNLL-02,2002.[22]P.D.Turney.Learningalgorithmsforkeyphraseextraction.InformationRetrieval,2(4):303{336,2000.[23]P.D.Turney.Coherentkeyphraseextractionviawebmining.InProc.ofIJCAI-03,pages434{439,2003.