Finding Advertising Keywords on Web Pages Wentau Yih Microsoft Research Microso - PDF document

Download presentation
Finding Advertising Keywords on Web Pages Wentau Yih Microsoft Research  Microso
Finding Advertising Keywords on Web Pages Wentau Yih Microsoft Research  Microso

Embed / Share - Finding Advertising Keywords on Web Pages Wentau Yih Microsoft Research Microso


Presentation on theme: "Finding Advertising Keywords on Web Pages Wentau Yih Microsoft Research Microso"— Presentation transcript


FindingAdvertisingKeywordsonWebPagesWen­tauYihMicrosoftResearch1MicrosoftWayRedmond,WA98052scottyih@microsoft.comJoshuaGoodmanMicrosoftResearch1MicrosoftWayRedmond,WA98052joshuago@microsoft.comVitorR.CarvalhoCarnegieMellonUniversity5000ForbesAvenuePittsburgh,PA15213vitor@cs.cmu.eduABSTRACTAlargeandgrowingnumberofwebpagesdisplaycontex-tualadvertisingbasedonkeywordsautomaticallyextractedfromthetextofthepage,andthisisasubstantialsourceofrevenuesupportingthewebtoday.Despitetheimpor-tanceofthisarea,littleformal,publishedresearchexists.Wedescribeasystemthatlearnshowtoextractkeywordsfromwebpagesforadvertisementtargeting.Thesystemusesanumberoffeatures,suchastermfrequencyofeachpotentialkeyword,inversedocumentfrequency,presenceinmeta-data,andhowoftenthetermoccursinsearchquerylogs.Thesystemistrainedwithasetofexamplepagesthathavebeenhand-labeledwith\relevant"keywords.Basedonthistraining,itcanthenextractnewkeywordsfromprevi-ouslyunseenpages.Accuracyissubstantiallybetterthanseveralbaselinesystems.CategoriesandSubjectDescriptorsH.3.1[ContentAnalysisandIndexing]:Abstractingmethods;H.4.m[InformationSystems]:MiscellaneousGeneralTermsAlgorithms,experimentationKeywordskeywordextraction,informationextraction,advertising1.INTRODUCTIONContent-targetedadvertisingsystems,suchasGoogle'sAdSenseprogram,andYahoo'sContextualMatchproduct,arebecominganincreasinglyimportantpartofthefundingforfreewebservices.Theseprogramsautomatically ndrel-evantkeywordsonawebpage,andthendisplayadvertise-mentsbasedonthosekeywords.Inthispaper,wesystem-aticallyanalyzetechniquesfordeterminingwhichkeywordsarerelevant.Wedemonstratethatalearning-basedtech-niqueusingTFIDFfeatures,webpagemetadata,and,mostsurprisingly,informationfromquerylog les,substan-tiallyoutperformscompetingmethods,andevenapproacheshumanlevelsofperformancebyatleastonemeasure.Typicalcontent-targetedadvertisingsystemsanalyzeawebpage,suchasablog,anewspage,oranothersourceCopyrightisheldbytheInternationalWorldWideWebConferenceCom­mittee(IW3C2).Distributionofthesepapersislimitedtoclassroomuse,andpersonalusebyothers.WWW2006,May23–26,2006,Edinburgh,Scotland.ACM1­59593­323­9/06/0005.ofinformation,to ndprominentkeywordsonthatpage.Thesekeywordsarethensenttoanadvertisingsystem,whichmatchesthekeywordsagainstadatabaseofads.Ad-vertisingappropriatetothekeywordisdisplayedtotheuser.Typically,ifauserclicksonthead,theadvertiserischargedafee,mostofwhichisgiventothewebpageowner,withaportionkeptbytheadvertisingservice.Pickingappropriatekeywordshelpsusersinatleasttwoways.First,choosingappropriatekeywordscanleadtousersseeingadsforproductsorservicestheywouldbeinterestedinpurchasing.Second,thebettertargetedtheadvertising,themorerevenuethatisearnedbythewebpageprovider,andthusthemoreinterestingtheapplicationsthatcanbesupported.Forinstance,freebloggingservicesandfreeemailaccountswithlargeamountsofstoragearebothen-abledbygoodadvertisingsystemsFromtheperspectiveoftheadvertiser,itisevenmoreimportanttopickgoodkeywords.Formostareasofre-search,suchasspeechrecognition,a10%improvementleadstobetterproducts,buttheincreaseinrevenueisusuallymuchsmaller.Forkeywordselection,however,a10%im-provementmightactuallyleadtonearlya10%higherclick-through-rate,directlyincreasingpotentialrevenueandpro t.Inthispaper,wesystematicallyinvestigatedseveraldif-ferentaspectsofkeywordextraction.First,wecomparedlookingateachoccurrenceofawordorphraseinadoc-umentseparately,versuscombiningallofourinformationaboutthewordorphrase.Wealsocomparedapproachesthatlookatthewordorphrasemonolithicallytoapproachesthatdecomposeaphraseintoseparatewords.Second,weexaminedawidevarietyofinformationsources,analyzingwhichsourcesweremosthelpful.Theseincludedvariousmeta-tags,titleinformation,andeventhewordsintheURLofthepage.OnesurprisinglyusefulsourceofinformationwasqueryfrequencyinformationfromMSNSearchquerylogs.Thatis,knowingtheoverallqueryfrequencyofapar-ticularwordorphraseonapagewashelpfulindeterminingifthatwordorphrasewasrelevanttothatpage.Wecomparedtheseapproachestoseveraldi erentbase-lineapproaches,includingatraditionalTFIDFmodel;amodelusingTFandIDFfeaturesbutwithlearnedweights;andtheKEAsystem[7].KEAisalsoamachinelearningsystem,butwithasimplerlearningmechanismandfewerfeatures.Aswewillshow,oursystemissubstantiallybetterthananyofthesebaselinesystems.Wealsocomparedoursystemtothemaximumachievablegiventhehumanlabel-ing,andfoundthatononemeasure,oursystemwasinthesamerangeashumanlevelsofperformance. 2.SYSTEMARCHITECTUREInthissection,weintroducethegeneralarchitectureofourkeywordextractionsystem,whichconsistsofthefollow-ingfourstages:preprocessor,candidateselector,classi er,andpostprocessor.2.1PreprocessorThemainpurposeofthepreprocessoristotransformanHTMLdocumentintoaneasy-to-processplain-textbaseddocument,whilestillmaintainingimportantinformation.Inparticular,wewanttopreservetheblocksintheorig-inalHTMLdocument,butremovetheHTMLtags.Forexample,textinthesametableshouldbeplacedtogetherwithouttagslike,,or.Wealsopreserveinformationaboutwhichphrasesarepartoftheanchortextofthehypertextlinks.ThemetasectionofanHTMLdoc-umentheaderisalsoanimportantsourceofusefulinforma-tion,eventhoughmostofthe eldsexceptthetitlearenotdisplayedbywebbrowsers.Thepreprocessor rstparsesanHTMLdocument,andre-turnsblocksoftextinthebody,hypertextinformation,andmetainformationintheheader1.Becauseakeywordshouldnotcrosssentenceboundaries,weapplyasentencesplittertoseparatetextinthesameblockintovarioussentences.Toevaluatewhetherlinguisticinformationcanhelpkey-wordextraction,wealsoapplyastate-of-the-artpart-of-speech(POS)tagger[6],andrecordthepostagofeachword.Inaddition,wehaveobservedthatmostwordsorphrasesthatarerelevantareshortnounphrases.Therefore,havingthisinformationavailableasafeaturewouldpoten-tiallybeuseful.Wethusappliedastate-of-the-artchunkertodetectthebasenounphrasesineachdocument[15]2.2.2CandidateSelectorOursystemconsiderseachwordorphrase(consecutivewords)uptolength5thatappearsinthedocumentasacandidatekeyword.Thisincludesallkeywordsthatappearinthetitlesection,orinmeta-tags,aswellaswordsandphrasesinthebody.Asmentionedpreviously,aphraseisnotselectedasacandidateifitcrossessentenceorblockboundaries.Thisstrategynotonlyeliminatesmanytrivialerrorsbutalsospeedsuptheprocessingtimebyconsideringfewerkeywordcandidates.Eachphrasecanbeconsideredseparately,orcanbecom-binedwithallotheroccurrencesofthesamephraseinthesamedocument.Inaddition,phrasescanbeconsideredmonolithically,orcanbedecomposedintotheirconstituentwords.Puttingtogetherthesepossibilities,weendedupconsideringthreedi erentcandidateselectors.2.2.1Monolithic,Separate(MoS)IntheMonolithicSeparatecandidateselector,fragmentsthatappearindi erentdocumentlocationsareconsideredasdi erentcandidateseveniftheircontentisidentical.Thatis,ifthephrase\digitalcamera"occurredonceinthebeginningofthedocument,andonceintheend,wewouldconsiderthemastwoseparatecandidates,withpotentiallydi erentfeatures.Whilesomefeatures,suchasthephrase 1Weused\BeautifulSoup",whichisapythonlibraryforHTMLparsing.BeautifulSoupcanbedownloadedfromhttp://www.crummy.com/software/BeautifulSoup/2Thesenaturallanguageprocessingtoolscanbedownloadedfromhttp://l2r.cs.uiuc.edu/cogcomplengthandTFvalues,wouldallbethesameforallofthesecandidates,others,suchaswhetherthephrasewascapital-ized,wouldbedi erent.WecallthisvariationSeparate.Inthiscandidateselector,allfeaturesofacandidatephrasearebasedonthephraseasawhole.Forexample,termfre-quencycountsthenumberoftimestheexactphraseoccursinthedocument,ratherthanusingthethetermfrequencyofindividualwords.WerefertothisasMonolithic.Tosim-plifyourdescription,weuseMoStorefertothisdesign.2.2.2Monolithic,Combined(MoC)Sinceweonlycareabouttherankedlistofkeywords,notwherethekeywordsareextracted,wecanreducethenumberofcandidatesbycombiningidentical(caseignored)fragments.Forinstance,\Weatherreport"inthetitleand\weatherreport"inthebodyaretreatedasonlyonecandi-date.WeuseMoCtorefertothisdesign.NotethatevenintheCombinedcase,wordordermatters,e.g.thephrase\weatherreport"istreatedasdi erentthanthephrase\reportweather."2.2.3Decomposed,Separate(DeS)Keywordextractionseemscloselyrelatedtowell-studiedareaslikeinformationextraction,named-entityrecognitionandphraselabeling:theyallattemptto ndimportantphrasesindocuments.State-of-the-artinformationextrac-tionsystems(e.g.,[5,20]),named-entityrecognitionsystems(e.g.,[21]),andphraselabelingsystems(e.g.,[3])typicallydecomposephrasesintoindividualwords,ratherthanex-aminingthemmonolithically.Itthusseemedworthwhiletoexaminesimilartechniquesforkeywordextraction.Decom-posingaphraseintoitsindividualwordsmighthavecertainadvantages.Forinstance,ifthephrase\petstore"occurredonlyonceinadocument,butthephrases\pet"and\store"eachoccurredmanytimesseparately,suchadecomposedapproachwouldmakeiteasytousethisknowledge.Insteadofselectingphrasesdirectlyascandidates,thedecomposedapproachtriestoassignalabeltoeachwordinadocument,asisdoneinrelated elds.Thatis,eachofthewordsinadocumentisselectedasacandidate,withmultiplepossiblelabels.ThelabelscanbeB(beginningofakeyphrase,whenthefollowingwordisalsopartofthekeyphrase),I(insideakeyphrase,butnotthe rstorlastword),L(lastwordofakeyphrase),U(uniquewordofakeywordoflength1),and nallyO(outsideanykeywordorkeyphrase).Thisword-basedframeworkrequiresamulti-classclassi- ertoassignthese5labelstoacandidateword.Inad-dition,italsoneedsasomewhatmoresophisticatedinfer-enceproceduretoconstructarankedlistofkeywordsinthepostprocessingstage.ThedetailswillbedescribedinSection2.4.WeuseDeStorefertothisdesign.2.3ClassierThecoreofourkeywordextractionsystemisaclassi ertrainedusingmachinelearning.Whenamonolithicframe-work(MoSorMoC)isused,wetrainabinaryclassi er.Givenaphrasecandidate,theclassi erpredictswhetherthewordorphraseisakeywordornot.Whenadecomposedframework(DeS)isused,wetrainamulti-classclassi er.Givenaword,thegoalistopredictthelabelasB,I,L,U,orO.Sincewhetheraphraseisakeywordisambiguousbynature,insteadofahardprediction,weactuallyneed theclassi ertopredicthowlikelyitisthatcandidatehasaparticularlabel.Inotherwords,theclassi erneedstooutputsomekindofcon dencescoresorprobabilities.Thescoresorprobabilitiescanthenbeusedlatertogeneratearankedlistofkeywords,givenadocument.Weintroducethelearningalgorithmandthefeaturesweusedasfollows.2.3.1LearningAlgorithmWeusedlogisticregressionasthelearningalgorithminallexperimentsreportedhere.Logisticregressionmodelsarealsocalledmaximumentropymodels,andareequivalenttosinglelayerneuralnetworksthataretrainedtominimizeen-tropy.Inrelatedexperimentsonanemailcorpus,wetriedotherlearningalgorithms,suchaslinearsupport-vectorma-chines,decisiontrees,andnaiveBayes,butineverycasewefoundthatlogisticregressionwasequallygoodorbetterthantheotherlearningalgorithmsforthistypeoftask.Formally,wewanttopredictanoutputvariableY,givenasetofinputfeatures, X.Inthemonolithicframework,Ywouldbe1ifacandidatephraseisarelevantkeyword,and0otherwise. Xwouldbeavectoroffeaturevaluesassociatedwithaparticularcandidate.Forexample,thevectormightincludethedistancefromthestartofthedocument,thetermfrequency(TF)andthedocumentfrequency(DF)val-uesforthephrase.Themodelreturnstheestimatedprob-ability,P(Y=1j X= x).Thelogisticregressionmodellearnsavectorofweights, w,oneforeachinputfeaturein X.Continuingourexample,itmightlearnaweightfortheTFvalue,aweightfortheDFvalue,andaweightforthedistancevalue.TheactualprobabilityreturnedisP(Y=1jX= x)=exp( x w) 1+exp( x w)Aninterestingpropertyofthismodelisitsabilitytosim-ulatesomesimplermodels.Forinstance,ifwetakethelogarithmsoftheTFandDFvaluesasfeatures,thenifthecorrespondingweightsare+1and-1,weendupwithlog(TF)log(DF)=log(TF=DF)Inotherwords,byincludingTFandDFasfeatures,wecansimulatetheTFIDFmodel,butthelogisticregressionmodelalsohastheoptiontolearndi erentweightings.Toactuallytrainalogisticregressionmodel,wetakeasetoftrainingdata,andtryto ndtheweightvector wthatmakesitaslikelyaspossible.Inourcase,thetrain-inginstancesconsistofeverypossiblecandidatephrasesse-lectedfromthetrainingdocuments,withY=1iftheywerelabeledrelevant,and0otherwise.WeusetheSCGISmethod[9]astheactualtrainingmethod.However,becausethereisauniquebestlogisticregressionmodelforanytrain-ingset,andthespaceisconvex,theactualchoiceoftrainingalgorithmmakesrelativelylittledi erence.Inthemonolithicmodels,weonlytrytomodelonedeci-sion{whetherawordorphraseisrelevantornot,sothevariableYcanonlytakevalues0or1.Butforthedecom-posedframework,wetrytodetermineforeachwordwhetheritisthebeginningofaphrase,insideaphrase,etc.with5di erentpossibilities(theBILUOlabels).Inthiscase,weuseageneralizedformofthelogisticregressionmodel:P(Y=ij x)=exp( x w) 5j=1exp( x wj)Thatis,thereare5di erentsetsofweights,oneforeachpossibleoutputvalue.Notethatinpractice,forbothformsoflogisticregression,wealwaysappendaspecial\alwayson"feature(i.e.avalueof1)tothe xvector,thatservesasabiasterm.Inordertopreventover tting,wealsoapplyaGaussianpriorwithvariance0.3forsmoothing[4].2.3.2FeaturesWeexperimentedwithvariousfeaturesthatarepoten-tiallyuseful.Someofthesefeaturesarebinary,takingonlythevalues0or1,suchaswhetherthephraseappearsinthetitle.Othersarereal-valued,suchastheTForDFval-uesortheirlogarithms.Belowwedescribethesefeaturesandtheirvariationswhenusedinthemonolithic(MoS)anddecomposed(DeS)frameworks.Featuresusedinthemonolithic,combined(MoC)frame-workarebasicallythesameasintheMoSframework.Ifinthedocument,acandidatephraseintheMoCframeworkhasseveraloccurrences,whichcorrespondtoseveralcandidatephrasesintheMoSframework,thefeaturesarecombinedusingthefollowingrules.1.Forbinaryfeatures,thecombinedfeatureistheunionofthecorrespondingfeatures.Thatis,ifthisfeatureisactiveinanyoftheoccurrences,thenitisalsoactiveinthecombinedcandidate.2.Forreal-valuedfeatures,thecombinedfeaturetakesthesmallestvalueofthecorrespondingfeatures.Togiveanexampleforthebinarycase,ifoneoccurrenceoftheterm\digitalcamera"isintheanchortextofahyper-link,thentheanchortextfeatureisactiveinthecombinedcandidate.Similarly,forthelocationfeature,whichisareal-valuedcase,thelocationofthe rstoccurrenceofthisphrasewillbeusedasthecorrespondingcombinedvalue,sinceitsvalueisthesmallest.Inthefollowingdescription,featuresarebinaryunlessotherwisenoted.2.3.2.1Lin:linguisticfeatures.Thelinguisticinformationusedinfeatureextractionin-cludes:twotypesofpostags{noun(NN&NNS)andpropernoun(NNP&NNPS),andonetypeofchunk{nounphrase(NP).ThevariationsusedinMoSare:whetherthephrasecontainthesepostags;whetherallthewordsinthatphrasesharethesamepostags(eitherpropernounornoun);andwhetherthewholecandidatephraseisanounphrase.ForDeS,theyare:whetherthewordhasthepostag;whetherthewordisthebeginningofanounphrase;whetherthewordisinanounphrase,butnotthe rstword;andwhetherthewordisoutsideanynounphrase.2.3.2.2C:capitalization.Whetherawordiscapitalizedisanindicationofbeingpartofapropernoun,oranimportantword.ThissetoffeaturesforMoSisde nedas:whetherallthewordsinthecandidatephrasearecapitalized;whetherthe rstwordofthecandidatephraseiscapitalized;andwhetherthecandi-datephrasehasacapitalizedword.ForDeS,itissimplywhetherthewordiscapitalized.2.3.2.3H:hypertext.Whetheracandidatephraseorwordispartofthean-chortextforahypertextlinkisextractedasthefollowing features.ForMoS,theyare:whetherthewholecandidatephrasematchesexactlytheanchortextofalink;whetherallthewordsofthecandidatephraseareinthesameanchortext;andwhetheranywordofthecandidatephrasebelongstotheanchortextofalink.ForDeS,theyare:whetherthewordisthebeginningoftheanchortext;whetherthewordisintheanchortextofalink,butnotthe rstword;andwhetherthewordisoutsideanyanchortext.2.3.2.4Ms:metasectionfeatures.TheheaderofanHTMLdocumentmayprovideaddi-tionalinformationembeddedintags.Althoughthetextinthisregionisusuallynotseenbyreaders,whetheracandidateappearsinthismetasectionseemsimportant.ForMoS,thefeaturesarewhetherthewholecandidatephraseisinthemetasection.ForDeS,theyare:whetherthewordisthe rstwordinametatag;andwhetherthewordoccurssomewhereinametatag,butnotasthe rstword.2.3.2.5T:title.TheonlyhumanreadabletextintheHTMLheaderisthe,whichisusuallyputinthewindowcaptionbythebrowser.ForMoS,thefeatureiswhetherthewholecandidatephraseisinthetitle.ForDeS,thefeaturesare:whetherthewordisthebeginningofthetitle;andwhetherthewordisinthetitle,butnotthe rstword.2.3.2.6M:metafeatures.Inadditionto,severalmetatagsarepotentiallyre-latedtokeywords,andareusedtoderivefeatures.IntheMoSframework,thefeaturesare:whetherthewholecan-didatephraseisinthemeta-description;whetherthewholecandidatephraseisinthemeta-keywords;andwhetherthewholecandidatephraseisinthemeta-title.ForDeS,thefeaturesare:whetherthewordisthebeginningofthemeta-description;whetherthewordisinthemeta-description,butnotthe rstword;whetherthewordisthebeginningofthemeta-keywords;whetherthewordisinthemeta-keywords,butnotthe rstword;whetherthewordisthebeginningofthemeta-title;andwhetherthewordisinthemeta-title,butnotthe rstword.2.3.2.7U:URL.Awebdocumenthasoneadditionalhighlyusefulproperty{thenameofthedocument,whichisits.ForMoS,thefeaturesare:whetherthewholecandidatephraseisinpartoftheURLstring;andwhetheranywordofthecandidatephraseisintheURLstring.IntheDeSframework,thefeatureiswhetherthewordisintheURLstring.2.3.2.8IR:informationretrievalorientedfeatures.WeconsidertheTF(termfrequency)andDF(documentfrequency)valuesofthecandidateasreal-valuedfeatures.Thedocumentfrequencyisderivedbycountinghowmanydocumentsinourwebpagecollectionthatcontainthegiventerm.InadditiontotheoriginalTFandDFfrequencynumbers,log(TF+1)andlog(DF+1)arealsousedasfea-tures.Thefeaturesusedinthemonolithicandthedecom-posedframeworksarebasicallythesame,whereforDeS,the\term"isthecandidateword.2.3.2.9Loc:relativelocationofthecandidate.Thebeginningofadocumentoftencontainsanintroduc-tionorsummarywithimportantwordsandphrases.There-fore,thelocationoftheoccurrenceofthewordorphraseinthedocumentisalsoextractedasafeature.Sincethelengthofadocumentorasentencevariesconsiderably,wetakeonlytheratioofthelocationinsteadoftheabsolutenumber.Forexample,ifawordappearsinthe10thposi-tion,whilethewholedocumentcontains200words,theratioisthen0.05.Thesefeaturesusedforthemonolithicandde-composedframeworksarethesame.Whenthecandidateisaphrase,its rstwordisusedasitslocation.Therearethreedi erentrelativelocationsusedasfea-tures:wordRatio{therelativelocationofthecandidateinthesentence;sentRatio{thelocationofthesentencewherethecandidateisindividedbythetotalnumberofsentencesinthedocument;wordDocRatio{therelativelocationofthecandidateinthedocument.Inadditiontothese3real-valuedfeatures,wealsousetheirlogarithmsasfeatures.Speci cally,weusedlog(1+wordRatio),log(1+sentRatio),andlog(1+wordDocRatio).2.3.2.10Len:sentenceanddocumentlength.Thelength(inwords)ofthesentence(sentLen)wherethecandidateoccurs,andthelengthofthewholedocument(docLen)(wordsintheheaderarenotincluded)areusedasfeatures.Similarly,log(1+sentLen)andlog(1+docLen)arealsoincluded.2.3.2.11phLen:lengthofthecandidatephrase.Forthemonolithicframework,thelengthofthecandidatephrase(phLen)inwordsandlog(1+phLen)areincludedasfeatures.Thesefeaturesarenotusedinthedecomposedframework.2.3.2.12Q:querylog.Thequerylogofasearchenginere\rectsthedistributionofthekeywordspeoplearemostinterestedin.Weusetheinformationtocreatethefollowingfeatures.Fortheseex-periments,unlessotherwisementioned,weusedalog lewiththemostfrequent7.5millionqueries.Forthemonolithicframework,weconsideronebinaryfea-ture{whetherthephraseappearsinthequerylog,andtworeal-valuedfeatures{thefrequencywithwhichitappearsandthelogvalue,log(1+frequency).Forthedecomposedframework,weconsidermorevariationsofthisinformation:whetherthewordappearsinthequerylog leasthe rstwordofaquery;whetherthewordappearsinthequerylog leasaninteriorwordofaquery;andwhetherthewordap-pearsinthequerylog leasthelastwordofaquery.Thefrequencyvaluesoftheabovefeaturesandtheirlogvalues(log(1+),whereisthecorrespondingfrequencyvalue)arealsousedasreal-valuedfeatures.Finally,whetherthewordneverappearsinanyquerylogentriesisalsoafeature.2.4PostprocessorAftertheclassi erpredictstheprobabilitiesofthecandi-datesassociatedwiththepossiblelabels,ourkeywordex-tractionsystemgeneratesalistofkeywordsrankedbytheprobabilities.WhentheMonolithicCombinedframeworkisused,wesimplyreturnthemostprobablewordsorphrases.WhentheMonolithicSeparate(MoS)frameworkisused,thehighestprobabilityofidenticalfragmentsispickedas theprobabilityofthephrase.IntheDecomposedSeparate(DeS)framework,weneedtoconvertprobabilitiesofindividualwordsbeingphrasecomponents(Begininning,Inside,etc.)intoprobabilitiesofrelevanceofwholephrases.Typically,inphraselabelingproblemslikeinformationextraction,thisconversionisdonewiththeViterbi[17]algorithm,to ndthemostprobableassignmentofthewordlabelsequenceofeachsentence[5].Werejectedthismethodfortworeasons.First,becauseinourtrainingset,onlyafewwordsperdocumenttendtobelabeledasrelevant,oursystemalmostneverassignshighprobabilitytoaparticularphrase,andthustheViterbias-signmentwouldtypicallyrejectallphrases.Second,intyp-icalinformationextractionapplications,wenotonlywantto ndvariousentities,wealsowantto ndtheirrelation-shipsandroles.Inourapplication,wesimplywantalistofpotentialentities.Itisthus netoextractpotentiallyoverlappingstringsaspotentialkeywords,somethingthatwouldnotworkwellifrole-assignmentwasrequired.Therefore,weusethefollowingmechanismtoestimatetheprobabilityofaphraseintheDeSframework.Givenaphraseoflengthn,wecalculatetheoverallprobabilityforthephrasebymultiplyingbytheprobabilityofthein-dividualwordsbeingthecorrectlabelofthelabelsequence.Forexample,ifn=3,thenthecorrectlabelsequenceisB,I,L.Theprobabilityofthisphrasebeingakeyword,1,isderivedby(w1=B)(w2=I)(w3=L).Ifthephraseisnotakeyword,thenthecorrectlabelse-quenceisO,O,O.Thecorrespondingprobability,0,isthen(w1=O)(w2=O)(w3=O).Theactualproba-bilityusedforthisphraseisthen1(0+1).Amongthethenormalizationmethodswetried,thisstrategyworksthebestinpractice.3.EXPERIMENTSThissectionreportstheexperimentalresultscomparingoursystemwithseveralbaselinesystems,thecomparisonsbetweenthevariationsofoursystem,thecontributionofindividualfeatures,andtheimpactofreducingthesearchquerylogsize.We rstdescribehowthedocumentswereobtainedandannotated,aswellastheperformancemea-sures.3.1DataandEvaluationCriteriaThe rststepwastoobtainandlabeldata,namelyasetofwebpages.Wecollectedthesedocumentsatrandomfromtheweb,subjecttothefollowingcriteria:First,eachpagewasinacrawlofthewebfromMSNSearch.Second,eachpagedisplayedcontent-targetedad-vertising(detectedbythepresenceofspecialJavascript).Thiscriterionwasdesignedtomakesurewefocusedonthekindofwebpageswherecontent-targetedadvertisingwouldbedesirable,asopposedtotrulyrandompages.Third,thepagesalsohadtooccurintheInternetArchive3.Wehopetoshareourlabeledcorpus,sootherresearcherscanexper-imentwithit.ChoosingpagesintheInternetArchivemaymakethatsharingeasier.Fourth,weselectednomorethanonepageperdomain(inpart,becauseofcopyrightissues,andinparttoensurediversityofthecorpus).Fifth,wetriedtoeliminateforeignlanguageandadultpages.Altogether,wecollected1109pages. 3http://www.archive.orgNext,weselected30ofthesepagesatrandomtobeusedforinter-annotatoragreementmeasurements.Wethenran-domlysplittheremainingpagesinto8setsof120pagesandonesetof119pages.Wegaveeachsettooneofouranno-tators.Foreachannotator,the rsttimetheylabeledasetofwebpages,wealsogavethemtheadditional30pagesforinter-annotatoragreement,scatteredrandomlythroughout(so,theyreceived150pagestotalthe rsttime,120or119onsubsequentsets).Severalpageswererejectedforvariousreasons,suchasoneannotatorwhodidnotcompletetheprocess,severalforeignlanguagepagesthatslippedthroughourscreeningprocess,etc.Asaresult,828webdocumentshadlegitimateannotationsandwereusedtotrainandtestthesystem.Annotatorswereinstructedtolookateachwebpage,anddetermineappropriatekeywords.Inparticular,theyweregivenabout3pagesofinstructionswithactualexamples.Thecoreoftheinstructionswasessentiallyasfollows:Advertisingonthewebisoftendonethroughkeywords.Advertiserspickakeyword,andtheiradappearsbasedonthat.Forinstance,anad-vertiserlikeAmazon.comwouldpickakeywordlike\book"or\books."Ifsomeonesearchesfortheword\book"or\books",anAmazonadisshown.Similarly,ifthekeyword\book"ishighlyprominentonawebpage,Amazonwouldlikeanadtoappear.Weneedtoshowourcomputerprogramexam-plesofwebpages,andthentellitwhichkeywordsare\highlyprominent."Thatway,itcanlearnthatwordslike\the"and\clickhere"areneverhighlyprominent.Itmightlearnthatwordsthatappearontheright(ormaybetheleft)aremorelikelytobehighlyprominent,etc.Yourtaskistocreatetheexamplesforthesystemtolearnfrom.Wewillgiveyouwebpages,andyoushouldlistthehighlyprominentwordsthatanadvertisermightbeinterestedin.Therewasonemoreimportantinstruction,whichwastotrytouseonlywordsorphrasesthatactuallyoccurredonthepagebeinglabeled.Theremainingportionofthein-structionsgaveexamplesanddescribedtechnicaldetailsofthelabelingprocess.Weusedasnapshotofthepages,tomakesurethatthetraining,testing,andlabelingprocessesallusedidenticalpages.Thesnapshottingprocessalsohadtheadditionaladvantagethatmostimages,andallcontent-targetedadvertising,werenotdisplayedtotheannotators,preventingthemfromeitherselectingtermsthatoccurredonlyinimages,orfrombeingpollutedbyathird-partykey-wordselectionprocess.3.2PerformancemeasuresWecomputedthreedi erentperformancemeasuresofourvarioussystems.The rstperformancemeasureissimplythetop-1score.Tocomputethismeasure,wecountedthenumberoftimesthetopoutputofoursystemforagivenpagewasinthelistoftermsdescribedbytheannotatorforthatpage.Wedividedthisnumberbythemaximumachievabletop-1score,andmultipliedby100.Togetthemaximumachievabletop-1score,foreachtestdocument,we rstremovedanyannotationsthatdidnotoccursome-whereinthewebpage(toeliminatespellingmistakes,and occasionalannotatorconfusion).Thebestachievablescorewasthenthenumberofdocumentsthatstillhadatleastoneannotation.Wecountedanswersascorrectiftheymatched,ignoringcase,andwithallwhitespacecollapsedtoasinglespace.Thesecondperformancemeasureisthetop-10score,whichissimilartothetop-1measurebutconsiders10candidatesinsteadof1.Tocomputethismeasure,foreachwebpage,wecountedhowmanyofthetop10outputsofoursystemwereinthelistoftermsdescribedbytheannotatorforthatpage.Thesumofthesenumberswasthendividedbythemaximumachievabletop-10score,andmultipliedby100.Itisimportantinanadvertisingsystemtonotonlyex-tractaccuratelistsofkeywords,buttoalsohaveaccurateprobabilityestimates.Forinstance,considertwokeywordsonaparticularpage,say\digitalcamera",whichmightmonetizeat$1perclick,and\CD-Rmedia",whichmon-etizesatsay,10centsperclick.Ifthereisa50%chancethat\CD-Rmedia"isrelevant,andonlya10%chancethat\digitalcamera"isrelevant,overallexpectedrevenuefromshowing\digitalcamera"adswillstillbehigherthanfromshowing\CD-Rmedia"ads.Therefore,accuratelyestimat-ingprobabilitiesleadstopotentiallyhigheradvertisingrev-enue.Themostcommonlyusedmeasureoftheaccuracyofprob-abilityestimatesisentropy.Foragiventermtandagivenwebpage,oursystemcomputestheprobabilitythatthewordisrelevanttothepage,P(tj)(i.e.,theprobabilitythattheannotatorlistedthewordontherelevantlist).Theentropyofagivenpredictionislog2P(tj)iftisrelevanttolog2(1P(tj))iftisnotrelevanttoLowerentropiescorrespondtobetterprobabilityestimates,with0beingideal.Whenwereporttheentropy,wewillreporttheaverageentropyacrossallwordsandphrasesuptolength5inallwebpagesinthetestset,withduplicateswithinapagecountedasoneoccurrence.TheDeSframe-workwasnoteasilyamenabletothiskindofmeasurement,sowewillonlyreporttheentropymeasurefortheMoSandMoCmethods.3.3Inter­annotatorAgreementWewantedtocomputesomeformofinter-annotatoragree-ment,wherethetypicalchoiceforthismeasureisthekappameasure.However,thatmeasureisdesignedforannotationswithasmall, xednumberofchoices,whereapriorprob-abilityofselectingeachchoicecanbedetermined.Itwasnotclearhowtoapplyitforaproblemwiththousandsofpossibleannotations,withthepossibleannotationsdi erentoneverypage.Instead,weusedourinter-annotatoragreementdatatocomputeanupper-boundforthepossibleperformanceonourtop-1measure.Inparticular,wedesignedanaccuracymeasurecalledcommittee-top-1.Essentiallythisnumbermeasuresroughlyhowwellacommitteeoffourpeoplecoulddoonthistask,bytakingthemostcommonanswerselectedbythefourannotators.Tocomputethis,weperformedthefollowingsteps.First,weselectedapageatrandomfromthesetof30labeledwebpagesforinter-annotatoragreement.Then,werandomlyselectedoneofthe veannotatorsasarbitrator,whosere-sultswecalled\correct."Wethenmergedtheresultsoftheremainingfourannotators,andfoundthesinglemostfrequentone.Finally,ifthatresultwasonthe\correct"keywordslist,weconsidereditcorrect;otherwisewecon-sidereditwrong.Theaveragecommittee-top-1scorefrom1000sampleswas25.8.Thisgivesageneralfeelingforthedicultyoftheproblem.Duetothesmallnumberofdoc-umentsused(30)andthesmallnumberofannotators(5),thereisconsiderableuncertaintyinthisestimate.Infact,insomecases,ourexperimentalresultsareslightlybetterthanthisnumber,whichweattributetothesmallsizeofthesampleusedforinter-annotatoragreement.3.4ResultsAllexperimentswereperformedusing10-waycrossvali-dation.Inmostcases,wepickedanappropriatebaseline,andcomputedstatisticalsigni canceofdi erencesbetweenresultsandthisbaseline.Weindicatethebaselinewithabsymbol.Weindicatesigni canceatthe95%levelwithaysymbol,andatthe99%levelwithazsymbol.Signi cancetestswerecomputedwithatwo-tailedpairedt-test.3.4.1OverallPerformanceWebeginbycomparingtheoverallperformanceofdi er-entsystemsandcon gurations.Thetop-1andtop-10scoresofthesedi erentsystemsarelistedinTable1.Amongthesystemcon gurationswetested,themonolithiccom-bined(MoC)frameworkisingeneralthebest.Inparticular,whenusingallbutthelinguisticfeatures,thiscon gurationachievesthehighesttop-1andtop-10scoresinalltheex-periments,justslightly(andnotstatisticallysigni cantly)betterthanthesameframeworkwithallfeatures.Themonolithicseparate(MoS)systemwithallfeaturesperformsworseforbothtop-1andtop-10,althoughonlythetop-10resultwasstatisticallysigni cant.Despiteitspreva-lenceintheinformationextractioncommunity,forthistask,thedecomposedseparate(DeS)frameworkissigni cantlyworsethanthemonolithicapproach.Wehypothesizethattheadvantagesofusingfeaturesbasedonphrasesasawhole(usedinMoSandMoC)outweightheadvantagesoffeaturesthatcombineinformationacrosswords.Inadditiontothedi erentcon gurationsofoursystems,wealsocomparewithastate-of-the-artkeywordextractionsystem{KEA[7],byprovidingitwithourpreprocessed,plain-textdocuments.KEAisalsoamachinelearning-basedsystem.Althoughitreliesonsimplerfeatures,itusesmoresophisticatedinformationretrievaltechniqueslikeremov-ingstopwordsandstemmingforpreprocessing.OurbestsystemissubstantiallybetterthanKEA,withrelativeim-provementsof27.5%and22.9%ontop-1andtop-10scores,respectively.Wetriedaverysimplebaselinesystem,MoCTFIDF,whichsimplyusestraditionalTFIDFscores.Thissystemonlyreaches13.01intop-1and19.03intop-10.WealsotriedallowingusingTFandIDFfeatures,butallowingtheweightstobetrainedwihtlogisticregression(MoCIR).Trainingslightlyimprovedtop-1to13.63,andsubstantiallyimprovedthetop-10scoreto25.67.Noticethatthetop-1scoreofourbestsystemactuallyex-ceedsthecommittee-top-1inter-annotatoragreementscore.Therewasconsiderableuncertaintyinthatnumberduetothesmallnumberofannotatorsandsmallnumberofdocu-mentsforinter-annotatoragreement,butweinterprettheseresultstomeanthatourbestsystemisinthesamegeneralrangeashumanperformance,atleastontop-1score. system top-1top-10 MoC(Monolithic,Combined),-Lin 30.06b46.97b MoC(Monolithic,Combined),All 29.9446.45 MoS(Monolithic,Separate),All 27.9544.13z DeS(Decomposed,Separate),All 24.25z:z KEA[7] 23.57z38.21z MoC(Monolithic,Combined),IR 13.63z25.67z MoC(Monolithic,Combined),TFIDF 13.01z19.03z Table1:Performanceofdi erentsystems3.4.2FeatureContributionOneimportantandinterestingquestionisthecontribu-tionofdi erenttypesoffeatures.Namely,howimportantisaparticulartypeofinformationtothekeywordextractiontask.Westudiedthisproblembyconductingaseriesofab-lationstudies:removingfeaturesofaspeci ctypetoseehowmuchtheycontribute.Werepeatedtheseexperimentsinthethreedi erentcandidateselectorframeworksofourkeywordextractionsystemtoseeifthesameinformationa ectsthesystemdi erentlywhentheframeworkchanges.Tables2,3,and4showthedetailedresults.Eachrowiseitherthebaselinesystemwhichusesallfeatures,orthecomparedsystemthatusesallexceptonespeci ckindoffeature.Inadditiontothetop-1andtop-10scores,wealsocomparetheentropiesofthevarioussystemsusingthesameframework.Recallthatlowerentropiesarebetter.Entropyisasomewhatsmoother,moresensitivemeasurethantop-nscores,makingiteasiertoseedi erencesbetweendi erentsystems.Notethatbecausethenumbersofcandidatesaredi erent,entropiesfromsystemsusingdi erentframeworksarenotcomparable. features top-1top-10entropy Aall 27.95b44.13b0.0120040b -Ccapitalization 27.3943.500.0120945z -Hhypertext 27.3943.820.0120751z -IRIR 18.77z33.60z0.0149899z -Lenlength 27.1042.45y0.0121040 -Linlinguistic 28.0544.890.0122166z -Loclocation 27.2442.64z0.0121860z -Mmeta 27.8144.050.0120080 -Msmetasection 27.5243.09y0.0120390 -Qquerylog 20.68z39.10z0.0129330z -Ttitle 27.8144.250.0120040 -UURL 26.09y44.140.0121409y Table2:ThesystemperformancebyremovingonesetoffeaturesintheMoSframeworkByexaminingtheseresults,wefoundthatIRfeaturesandQueryLogfeaturesarethemosthelpfulconsistentlyinallthreeframeworks.Thesystemperformancedroppedsignif-icantlyafterremovingeitherofthem.However,theimpactofotherfeatureswasnotsoclear.Thelocationfeaturealsoseemstobeimportant,butita ectsthetop-10scoremuchmorethanthetop-1score.Surprisingly,unlikethecasesinothertypicalphraselabelingproblems,linguisticfeaturesdon'tseemtohelpinthiskeywordextractiondomain.Infact,removingthissetoffeaturesinthedecomposedsep-arateframeworkimprovesthetop-1scorealittle.Wealsoobservemorestatisticallysigni cantdi erencesintheen- features top-1top-10entropy Aall 29.94b46.45b0.0113732b -Ccapitalization 30.1146.270.0114219y -Hhypertext 30.7945.85y0.0114370 -IRIR 25.42z42.26z0.0119463z -Lenlength 30.4944.74y0.0119803z -Linlinguistic 30.0646.970.0114853z -Loclocation 29.5244.63y0.0116400z -Mmeta 30.1046.780.0113633z -Msmetasection 29.3346.330.0114031 -Qquerylog 24.82y42.30z0.0121417z -Ttitle 28.8346.940.0114020 -UURL 30.5346.390.0114310 Table3:ThesystemperformancebyremovingonesetoffeaturesintheMoCframework features top-1top-10 Aall 24.2539.11 -Ccapitalization 24.2639.20 -Hhypertext 24.9639.67 -IRIR 19.10z30.56z -Linlinguistic 25.96y39.36 -Loclocation 24.8437.93 -Mmeta 24.6938.91 -Msmetasection 24.6838.71 -Qquerylog 21.27z33.95z -Ttitle 24.4038.89 -UURL 25.2138.97 Table4:ThesystemperformancebyremovingonesetoffeaturesintheDeSframeworktropymetric.Evaluatingthecontributionofafeaturebyremovingitfromthesystemdoesnotalwaysshowitsvalue.Forin-stance,iftwotypesoffeatureshavesimilare ects,thenthesystemperformancemaynotchangebyeliminatingonlyoneofthem.Wethusconductedadi erentsetofexperimentsbyaddingfeaturestoabaselinesystem.Wechoseasysteminthemonolithiccombinedframework(MoC),usingonlyIRfeatures,asthebaselinesystem.Wethenbuiltdi erentsystemsbyaddingoneadditionalsetoffeatures,andcomparingwiththebaselinesystem.Table5showsperformanceonthetop-1,top-10,andentropymet-rics,orderedbydi erenceintop-1.Interestingly,allthefeaturesseemtobehelpfulwhentheyarecombinedwiththebaselineIRfeatures,includingthelinguisticfeatures.Wethusconcludethatthelinguisticfea-tureswerenotreallyuseless,butinsteadwereredundantwithotherfeatureslikecapitalization(whichhelpsdetectpropernouns)andtheQueryLogfeatures(whichhelpsde-tectlinguisticallyappropriatephrases.)TheQueryLogfea-turesarestillthemoste ectiveamongallthefeaturesweexperimentedwith.However,theperformancegapbetweenourbestsystemandasimplesystemusingonlyIRandQueryLogfeaturesisstillquitelarge.3.4.3DifferentQueryLogSizesThequerylogfeatureweresomeofthemosthelpfulfea-tures,secondonlytotheIRfeatures.Thesefeaturesusedthetop7.5millionEnglishqueriesfromMSNSearch.In features top-1top-10entropy IR 13.63b25.67b0.0163299b +Qquerylog 22.36z35.88z0.0134891z +Ttitle 19.90z34.17z0.0152316z +Lenlength 19.22z33.43z0.0134298z +Msmetasection 19.02z31.90z0.0154484z +Hhypertext 18.46z30.36z0.0150824z +Linlinguistic 18.20z32.26z0.0146324z +Ccapitalization 17.41z33.16z0.0146999z +Loclocation 17.01z32.76z0.0154064z +UURL 16.72y28.63z0.0157466z +Mmeta 16.71y28.19z0.0160472y Table5:ThesystemperformancebyaddingonesetoffeaturesintheMoCframeworkpractice,deployingsystemsusingthismanyfeaturesmaybeproblematic.Forinstance,ifwewanttouse20lan-guages,andifeachqueryentryusesabout20bytes,thequerylog leswoulduse3GB.Thisamountofmemoryus-agewouldnota ectserversdedicatedtokeywordextraction,butmightbeproblematicforserversthatperformotherfunctions(e.g.servingads,oraspartofamoregeneralbloggingornewssystem).Inscenarioswherekeywordex-tractionisdoneontheclientside,thequerylog lesizeisparticularlyimportant.Querylogsmayalsohelpspeedupoursystem.Onestrat-egywewantedtotryistoconsideronlythephrasesthatappearedinthequerylog leaspotentialmatchesforkey-wordextraction.Inthiscase,smallerquerylogsleadtoevenlargerspeedups.Wethusranasetofexperimentsun-dertheMoCframework,usingdi erentquerylog lesizes(asmeasuredbyafrequencythresholdcuto ).Thelog lesizeswereveryroughlyinverselyproportionaltothecuto .Theleftmostpoint,withacuto of18,correspondstoour7.5millionquerylog le.ThegraphinFigure1showsboththetop-1andtop-10scoresatdi erentsizes.\Res.top-1"and\Res.top-10"aretheresultswherewerestrictedthecandidatephrasesbyeliminatingthosethatdonotappearinthequerylog le.Asacomparison,thegraphalsoshowsthenon-restrictingversion(i.e.,top-1andtop-10).Notethatofcoursethetop-1scorescannotbecomparedtothetop-10scores.Whenthefrequencycuto thresholdissmall(i.e.,largequerylog lesize),thisrestrictingstrategyinfactimprovesthetop-1scoreslightly,buthurtsthetop-10score.Thisphenomenonisnotsurprisingsincerestrictingcandidatesmayeliminatesomekeywords.Thistendstonota ectthetop-1scoresincewhenusingthequerylog leasaninput,themostprobablekeywordsinadocumentalmostalwaysappearinthekeyword le,ifthekeyword leislarge.Lessprobable,butstillgoodkeywords,arelesslikelytoappearinthekeyword le.Asthequerylog lesizebecomessmaller,thenegativee ectbecomesmoresigni cant.However,ifonlythetop-1scoreisrelevantintheapplication,thenareasonablecuto pointmaybe1000,wherethequerylog lehaslessthan100,000entries.Ontheotherhand,thetop-1andtop-10scoresinthenon-restrictingversiondecreasegraduallyasthesizeofthequerylog ledecreases.Fairlysmall lescanbeusedwiththenon-restrictingversionifspaceisatapremium.Afterall,theextremecaseiswhennoquerylog leisused,andaswehaveseeninTable3,inthatcasethetop-1andtop-10scoresarestill24.82and42.30,respectively. 10100100010000100000ScoreQueryLogFrequencyThresholdQueryLogFrequencyTrade-o restop-1restop-10top-1top-10Figure1:Theperformanceofusingdi erentsizesofthequerylog le4.RELATEDWORKTherehasbeenamoderateamountofpreviousworkthatisdirectlyrelevanttokeywordextraction.Inmostcases,thekeywordextractionwasnotappliedtowebpages,butinsteadwasappliedtoadi erenttextdomain.Mostprevi-oussystemswerefairlysimple,usingeitherasmallnumberoffeatures,asimplelearningmethod(NaiveBayes),orboth.Inthissection,wedescribethispreviousresearchinmoredetail.4.1GenExOneofthebestknownprogramsforkeywordextractionisTurney'sGenExsystem[22].GenExisarule-basedkeyphraseextractionsystemwith12parameterstunedusingaGeneticAlgorithm.GenExhasbeentrainedandevaluatedonacol-lectionof652documentsofthreedi erenttypes:journalarticles,emailmessages,andwebpages.Precisionsoftop5phrasesandtop15phrasesarereportedasevaluationmet-rics,wherethenumbersare0.239and0.128respectivelyforalldocuments.TurneyshowedGenEx'ssuperioritybycom-paringitwithanearlierkeyphraseextractionsystemtrainedbyC4.5[16]withseveralcomplicatedfeatures[22].4.2KEAandVariationsConcurrentlywiththedevelopmentofGenEx,Franketal.developedtheKEAkeyphraseextractionalgorithmusingasimplemachinelearningapproach[7].KEA rstprocessesdocumentsbyremovingstopwordsandbystemming.Thecandidatephrasesarerepresentedusingonlythreefeatures{TFIDF,distance(numberofwordsbeforethe rstoc-currenceofthephrase,dividedbythenumberofwordsinthewholedocument),andkeyphrase-frequency,whichisthenumberoftimesthecandidatephraseoccursinotherdocu-ments.Theclassi eristrainedusingthenaiveBayeslearn-ingalgorithm[14].Franketal.comparedKEAwithGenEx,andshowedthatKEAisslightlybetteringeneral,butthedi erenceisnotstatisticallysigni cant[7]. KEA'sperformancewasimprovedlaterbyaddingWebrelatedfeatures[23].Inshort,thenumberofdocumentsre-turnedbyasearchengineusingthekeyphraseasquerytermsisusedasadditionalinformation.Thisfeatureisparticu-larlyusefulwhentrainingandtestingdataarefromdi erentdomains.Inbothintraandinterdomainevaluations,therelativeperformancegaininprecisionisabout10%.KelleherandLuzalsoreportedanenhancedversionofKEAforwebdocuments[13].Theyexploitedthelinkin-formationofawebdocumentbyaddinga\semanticratio"feature,whichisthefrequencyofthecandidatephrasePintheoriginaldocumentD,dividedbythefrequencyofPindocumentedlinkedbyD.UsingtheMetaKeywordHTMLtagasthesourceofan-notations,theyexperimentedwiththeirapproachonfoursetsofdocuments,wheredocumentsareconnectedbyhy-perlinksineachcollection.Amongthreeofthem,theyre-portedsigni cantimprovement(45%to52%)comparedtotheoriginalversionofKEA.Addingthesemanticratiotooursystemwouldbeaninterestingareaoffutureresearch.Itshould,however,benotedthatusingthesemanticra-tiorequiresdownloadingandparsingseveraltimesasmanywebpages,whichwouldintroducesubstantialload.Also,inpractice,followingarbitrarylinksfromapagecouldresultinsimulatingclickson,e.g.,unsubscribelinks,leadingtounexpectedorundesirableresults.Theuseoflinguisticinformationforkeywordextractionwas rststudiedbyHulth[12].Inthiswork,nounphrasesandprede nedpart-of-speechtagpatternswereusedtohelpselectphrasecandidates.Inaddition,whetherthecandidatephrasehascertainPOStagsisusedasfeatures.AlongwiththethreefeaturesintheKEAsystem,Hulthappliedbag-ging[1]asthelearningalgorithmtotrainthesystem.Theexperimentalresultsshoweddi erentdegreesofimprove-mentscomparedwithsystemsthatdidnotuselinguisticinformation.However,directcomparisontoKEAwasnotreported.4.3InformationExtractionAnalogoustokeywordextraction,informationextractionisalsoaproblemthataimstoextractorlabelphrasesgivenadocument[8,2,19,5].Unlikekeywordextraction,informa-tionextractiontasksareusuallyassociatedwithprede nedsemantictemplates.Thegoaloftheextractiontasksisto ndcertainphrasesinthedocumentsto llthetemplates.Forexample,givenaseminarannouncement,thetaskmaybe ndingthenameofthespeaker,orthestartingtimeoftheseminar.Similarly,thenamedentityrecognition[21]taskistolabelphrasesassemanticentitieslikeperson,lo-cation,ororganization.Informationextractionisinmanywayssimilartokey-wordextraction,butthetechniquesusedforinformationextractionaretypicallydi erent.Whilemostkeywordex-tractionsystemsusethemonolithiccombined(MoC)frame-work,typically,informationextractionsystemsusetheDeSframework,orsometimesMoS.Sinceidenticalphrasesmayhavedi erentlabels(e.g.,\Washington"canbeeitheraper-sonoralocationeveninthesamedocument),candidatesinadocumentarenevercombined.Thechoiceoffeaturesisalsoverydi erent.Thesesystemstypicallyuselexicalfea-tures(theidentityofspeci cwordsinoraroundthephrase),linguisticfeatures,andevenconjunctionsofthesefeatures.Ingeneral,thefeaturespaceintheseproblemsishuge{oftenhundredsofthousandsfeatures.Lexicalfeaturesforkeywordextractionwouldbeaninterestingareaoffutureresearch,althoughourintuitionisthatthesefeaturesarelesslikelytobeusefulinthiscase.4.4ImpedanceCouplingRibeiro-Netoetal.[18]describeanImpedanceCouplingtechniqueforcontent-targetedadvertising.Theirworkisperhapsthemostextensivepreviouslypublishedworkspecif-icallyoncontent-targetedadvertising.However,Ribeiro-Netoetal.'sworkisquiteabitdi erentfromours.Mostimportantly,theirworkfocusednoton ndingkeywordsonwebpages,butondirectlymatchingadvertisementstothosewebpages.Theyusedavarietyofinformation,includingthetextoftheadvertisements,thedestinationwebpageofthead,andthefullsetofkeywordstiedtoaparticularad(asopposedtoconsideringkeywordsoneatatime.)Theythenlookedatthecosinesimilarityofthesemeasurestopotentialdestinationpages.ThereareseveralreasonswedonotcomparedirectlytotheworkofRibeiro-Netoetal.Mostimportantly,we,likemostresearchers,donothaveadatabaseofadvertisements(asopposedtojustkeywords)thatwecoulduseforexperi-mentsofthissort.Inaddition,theyaresolvingasomewhatdi erentproblemthanweare.Ourgoalisto ndthemostappropriatekeywordsforaspeci cpage.Thisdovetailswellwithhowcontextualadvertisingissoldtoday:advertisersbidonspeci ckeywords,andtheirbidsondi erentkey-wordsmaybeverydi erent.Forinstance,apurveyorofdigitalcamerasmightusethesameadfor\camera",\dig-italcamera",\digitalcamerareviews",or\digitalcameraprices."Theadvertiser'sbidswillbeverydi erentinthedi erentcases,becausetheprobabilitythataclickonsuchanadleadstoasalewillbeverydi erent.Ribeiro-Netoetal.'sgoalisnottoextractkeywords,buttomatchwebpagestoadvertisements.Theydonottrytodeterminewhichpar-ticularkeywordonapageisamatch,andinsomecases,theymatchwebpagestoadvertisementsevenwhenthewebpagedoesnotcontainanykeywordschosenbyanadver-tiser,whichcouldmakepricingdicult.Theirtechniquesarealsosomewhattime-consuming,becausetheycomputethecosinesimilaritybetweeneachwebpageandabundleofwordsassociatedwitheachad.Theyusedonlyasmalldatabase(100,000ads)appliedtoasmallnumberofwebpages(100),makingsuchtime-consumingcomparisonspos-sible.Realworldapplicationwouldbefarmorechallenging:optimizationsarepossible,butwouldneedtobeanareaofadditionalresearch.Incontrast,ourmethodsaredirectlyapplicabletoday.4.5NewsQueryExtractionHenzingeretal.[11]exploredthedomainofkeywordex-tractionfromanewssource,toautomaticallydrivequeries.Inparticular,theyextractedquerytermsfromtheclosedcaptioningoftelevisionnewsstories,todriveasearchsys-temthatwouldretrieverelatedonlinenewsarticles.TheirbasicsystemusedTFIDFtoscoreindividualphrases,andachievedslightimprovementsfromusingTFIDF2.Theytriedstemming,withmixedresults.Becauseofthenatureofbroadcastnewsprograms,whereboundariesbetweentopicsarenotexplicit,theyfoundimprovementsbyusingahis-toryfeaturethatautomaticallydetectedtopicboundaries.Theyachievedtheirlargestimprovementsbypostprocess- ingarticlestoremovethosethatseemedtoodi erentfromthenewsbroadcast.Thisseemscloselyrelatedtothepre-viouslymentionedworkofRibeiro-Netoetal.,whofoundanalogouscomparisonsbetweendocumentsandadvertise-mentshelpful.4.6EmailQueryExtractionGoodmanandCarvalho[10]previouslyappliedsimilartechniquestotheonesdescribedhereforqueryextractionforemail.Theirgoalwassomewhatdi erentthanourgoalhere,namelyto ndgoodsearchqueries,todrivetractosearchengines,althoughthesametechnologytheyusedcouldbeappliedto ndkeywordsforadvertising.Muchoftheirre-searchfocusedonemail-speci cfeatures,suchaswordoc-currenceinsubjectlines,anddistinguishingnewpartsofanemailmessagefromearlier,\in-reply-to"sections.Incontrast,ourworkfocusesonmanyweb-page-speci cfeatures,suchaskeywordsintheURL,andinmeta-data.Ourresearchgoesbeyondtheirsinanumberofotherways.Mostimportantly,wetriedbothinformation-extractionin-spiredmethods(DeS)andlinguisticfeatures.Here,wealsoexaminethetradeo betweenquery lesizeandaccuracy,showingthatlargequery lesarenotnecessaryfornear-optimalperformance,iftherestrictiononwordsoccurringinthequery leisremoved.Goodmanetal.compareonlytosimpleTFIDFstylebaselines,whileinourresearch,wecomparetoKEA.Wealsocomputeaformofinter-annotatoragreement,somethingthatwasnotpreviouslydone.TheimprovementoverKEAandthenearhumanperformancemeasurementsareveryimportantfordemon-stratingthehighqualityofourresults.5.CONCLUSIONSThebetterwecanmonetizetheweb,themorefeaturesthatcanbeprovided.Togivetwoexamples,theexplosionoffreebloggingtoolsandoffreeweb-basedemailsystemswithlargequotas,havebothbeensupportedinlargepartbyadvertising,muchofitusingcontent-targeting.Bettertargetedadsarelessannoyingforusers,andmorelikelytoinformthemofproductsthatdeliverrealvaluetothem.Betteradvertisingthuscreatesawin-win-winsituation:bet-terweb-basedfeatures,lessannoyingads,andmoreinfor-mationforusers.OurresultsdemonstratealargeimprovementoverKea.Weattributethisimprovementtothelargenumberofhelp-fulfeaturesweemployed.WhileKeaemploysonlythreefea-tures,weemployed12di erentsetsoffeatures;sinceeachsetcontainedmultiplefeatures,weactuallyhadabout40featuresoverall.AsweshowedinTable5,everyoneofthesesetswashelpful,althoughaswealsoshowed,someofthemwereredundantwitheachother.GenEx,with12features,worksaboutaswellasKea:thechoiceoffeaturesandthelearningalgorithmarealsoimportant.Ourmostimportantnewfeaturewasthequeryfrequency lefromMSNSearch.6.ACKNOWLEDGMENTSWethankEmilyCook,NatalieJennings,SamanthaDick-son,KrystalPaigeandKatieNeely,whoannotatedthedata.WearealsogratefultoAlexeiBocharov,whohelpedusmea-sureinter-annotatoragreement.7.REFERENCES[1]L.Breiman.Baggingpredictors.MachineLearning,24(2):123{140,1996.[2]M.Cali andR.Mooney.Bottom-uprelationallearningofpatternmatchingrulesforinformationextraction.JMLR,4:177{210,2003.[3]X.Carreras,L.Marquez,andJ.Castro.Filtering-rankingperceptronlearningforpartialparsing.MachineLearning,60(1{3):41{71,2005.[4]S.F.ChenandR.Rosenfeld.Agaussianpriorforsmoothingmaximumentropymodels.TechnicalReportCMU-CS-99-108,CMU,1999.[5]H.ChieuandH.Ng.Amaximumentropyapproachtoinformationextractionfromsemi-structureandfreetext.InProc.ofAAAI-02,pages786{791,2002.[6]Y.Even-ZoharandD.Roth.Asequentialmodelformulticlassclassi cation.InEMNLP-01,2001.[7]E.Frank,G.W.Paynter,I.H.Witten,C.Gutwin,andC.G.Nevill-Manning.Domain-speci ckeyphraseextraction.InProc.ofIJCAI-99,pages668{673,1999.[8]D.Freitag.Machinelearningforinformationextractionininformaldomains.MachineLearning,39(2/3):169{202,2000.[9]J.Goodman.Sequentialconditionalgeneralizediterativescaling.InACL'02,2002.[10]J.GoodmanandV.R.Carvalho.Implicitqueriesforemail.InCEAS-05,2005.[11]M.Henzinger,B.Chang,B.Milch,andS.Brin.Query-freenewssearch.InProceedingsofthe12thWorldWideWebConference,pages1{10,2003.[12]A.Hulth.Improvedautomatickeywordextractiongivenmorelinguisticknowledge.InProc.ofEMNLP-03,pages216{223,2003.[13]D.KelleherandS.Luz.Automatichypertextkeyphrasedetection.InIJCAI-05,2005.[14]T.Mitchell.Tutorialonmachinelearningovernaturallanguagedocuments,1997.Availablefrom.[15]V.PunyakanokandD.Roth.Theuseofclassi ersinsequentialinference.InNIPS-00,2001.[16]J.R.Quinlan.C4.5:programsformachinelearning.MorganKaufmann,SanMateo,CA,1993.[17]L.Rabiner.AtutorialonhiddenMarkovmodelsandselectedapplicationsinspeechrecognition.ProceedingsoftheIEEE,77(2),February1989.[18]B.Ribeiro-Neto,M.Cristo,P.B.Golgher,andE.S.deMoura.Impedancecouplingincontent-targetedadvertising.InSIGIR-05,pages496{503,2005.[19]D.RothandW.Yih.Relationallearningviapropositionalalgorithms:Aninformationextractioncasestudy.InIJCAI-01,pages1257{1263,2001.[20]C.SuttonandA.McCallum.Compositionofconditionalrandom eldsfortransferlearning.InProceedingsofHLT/EMLNLP-05,2005.[21]E.F.TjongKimSang.IntroductiontotheCoNLL-2002sharedtask:Language-independentnamedentityrecognition.InCoNLL-02,2002.[22]P.D.Turney.Learningalgorithmsforkeyphraseextraction.InformationRetrieval,2(4):303{336,2000.[23]P.D.Turney.Coherentkeyphraseextractionviawebmining.InProc.ofIJCAI-03,pages434{439,2003.

By: alida-meadow
Views: 155
Type: Public

Finding Advertising Keywords on Web Pages Wentau Yih Microsoft Research Microso - Description


com Joshua Goodman Microsoft Research 1 Microsoft Way Redmond WA 98052 joshuagomicrosoftcom Vitor R Carvalho Carnegie Mellon University 5000 Forbes Avenue Pittsburgh PA 15213 vitorcscmuedu ABSTRACT A large and growing number of web pages display cont ID: 3919 Download Pdf

Related Documents