/
atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprob atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprob

atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprob - PDF document

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
386 views
Uploaded On 2015-10-12

atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprob - PPT Presentation

PwcTiDkwhereasthesmoothingconstantis1intheJelinekMercerbasedmodelPT1T2TnjDkPDkPDknYi1cTiDkPTijC PwcTiDk33EntrypageSearchToimprovetheresultsofacontentonlyruninth ID: 158261

Pwc(Ti;Dk)+ whereasthesmoothingconstantis(1)intheJelinek-Mercerbasedmodel.P(T1;T2;;TnjDk)P(Dk)=P(Dk)nYi=1c(Ti;Dk)+P(TijC) Pwc(Ti;Dk)+(3)3EntrypageSearchToimprovetheresultsofacontentonlyruninth

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "atrandomfromdocumentk;andistheinterpola..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprobabilityofrelevanceP(Dk)isusuallytakentobealinearfunctionofthedocumentlength,modellingtheempiricalfactthatlongerdocumentshaveahigherprobabilityofrelevance.2.1CombiningexternalinformationThebasicrankingmodelisbasedonthecontentofthewebpages.Thereisevidencethatothersourcesofinformation(linkstructure,anchortext)playadecisiveroleintherankingprocessofentrypages(e.g.Google2).Thepreferredwaytoincorporateextrainformationaboutwebpagesistoincludethisinformationinthemodel.Acleanmethodistoincorporatethisinformationinthepriorprobabilityofadocument.Asecondmanneristomodeldifferenttypesofevidenceasdifferenttypesofrankingmodels,andcombinethesemethodsviainterpolation.scorecombi= scorecontent+(1� )scorefeatures(2)Equation2showshowtworankingfunctionscanbecombinedbyinterpolation.ThecombinedscoreisbasedonaweightedfunctionoftheunigramdocumentmodelandtheposteriorprobabilitygiventhedocumentfeaturesetandaBayesianclassiertrainedonthetrainingset.AsfeaturesweexperimentedwiththenumberofinlinksandtheURLform.However,forinterpolation,scoreshavetobenormalisedacrossqueries,becausetheinterpolationschemeisqueryindependent.Therefore,fortheinterpolationmethodwenormalisedthecontentscorebythequerylength,therankingmodelsbasedonotherdocumentinformationthatweappliedare(discriminative)probabilitiesandthusneednonormalisation.Theinterpolationmethodhasshowntoworkwellincaseswherescorenormalisationisakeyfactor[6].Fortheexperimentswedescribehere,wehaveappliedbothmethodsandtheyyieldsimilarresults.Inacontextwherescorenormalisationisnotnecessary,weprefermethodone.Wedeterminedthedocumentpriors(document-contentindependentpriorprobabilities)usingvarioustechniques,eitherpostulatingarelationship,orlearningpriorsfromtrainingdataconditioningone.g.theURLform.ThisprocesswillbedescribedinmoredetailintheSection3.2.2SmoothingvariantsRecentexperimentshaveshownthattheparticularchoiceofsmoothingtechniquecanhavealargeinuenceontheretrievaleffectiveness.Fortitleadhocqueries,ZhaiandLafferty[8]foundDirichletsmoothingtobemoreeffectivethanlinearinterpolation3Bothmethodsstartfromtheideathattheprobabilityestimateforunseenterms:Pu(TijDk)ismodelledaconstanttimesthecollectionbasedestimate:P(TijC).AcrucialdifferencebetweenDirichletandJelinek-MercersmoothingisthatthesmoothingconstantisdependentonthedocumentlengthforDirichlet,reectingthefactthatprobabilityestimatesaremorereliableforlongerdocuments.Equation(3)showstheweightingformulaforDirichletsmoothing,wherec(TijDk)isthetermfrequencyoftermTiindocumentDk,Pwc(Ti;Dk)isthelengthofdocumentDkandisaconstant.Thecollectionspecicsmoothingconstantisinthiscase Pwc(Ti;Dk)+,whereasthesmoothingconstantis(1�)intheJelinek-Mercerbasedmodel.P(T1;T2;;TnjDk)P(Dk)=P(Dk)nYi=1c(Ti;Dk)+P(TijC) Pwc(Ti;Dk)+(3)3EntrypageSearchToimprovetheresultsofacontentonlyrunintheentrypagendingtask,weexperimentedwithvariouslink,URLandanchorbasedmethods.Wetestedseveralwell-knownandnoveltechniquesonthesetof100trainingtopicsprovidedbyNISTandfoundthateachmethodwetestedwasmoreorlessbenecialforndingentrypages.Thiscontrastswithlastyear'sndingswherelinkbasedtechniquesdidn'taddanythinginanadhocsearchtask[7].Inthefollowingsubsections,wesubsequentlydiscusslinkbasedmethods,URLbasedmethodsandanchorbasedmethods,alongwithourndingsonthetrainingdata. 1Weapplyasimpliedversionofthemodeldevelopedin[3],whereistermspecic,denotingthetermimportance2http://www.google.com3AlsocalledJelinek-Mercersmoothing. AdhocsearchEntrypagesearchFig.2.P(relevantjdoclen)runMRR content0.26content+doclenprior0.210.7*content+0.3*inlink0.38Table1.MRRsInlinkanddoclenpriorsontrainingdataKleinbergThesecondlink-basedapproachweexperimentedwithisbasedonKleinberg'shubandauthorityalgorithm[5].Thisalgorithmidentiesauthorities(importantsourcesofinformation)andhubs(listsofpointerstoauthorities)byanalysingthestructureofhyperlinks.Sinceentrypagescanbeseenasauthoritiesonaveryspecicsubject(acertainorganisation),Kleinberg'salgorithmcanbeusefulfortheentrypagesearchtask.Thealgorithmworksbyiterativelyassigninghubandauthorityscorestodocumentsinsuchawaythatgoodhubsarepagesthatrefertomanygoodauthoritiesandgoodauthoritiesarereferencedbymanygoodhubs:1.TakethetopNresultsfromthecontentrun2.ExtendthissetSwithalldocumentsthatarelinkedtoS(eitherthroughinorthroughoutlinks)3.Initialiseallhubandauthorityscoresinthissetto1.4.hub(D)=PfijlinkD!iexistsgauth(i)5.auth(D)=Pfijlinki!Dexistsghub(i)6.normalisehubandauthscoressuchthatPs2Shub2(s)=Ps2Sauth2(s)=17.repeatsteps4-6WecomputedhubsandauthoritiesforthetopNofthecontentonlyrunandusedtheresultingauthorityscorestorankthedocuments.Table2showstheresultsfordifferentvaluesofN.Astheresultsshow,takingonlythetop5ortop10ranksfromthecontentrunandcomputingauthorityscoresstartingfromthose,issufcienttoimprovetheresults.Apparently,ifanentrypageisnotinthetop5fromthecontentrun,itisofteninthesetofdocumentslinkedtothese5documents.3.2URLsApartfromcontentandlinks,athirdsourceofinformationarethedocument'sURLs.EntrypageURLsoftencontainthenameoracronymofthecorrespondingorganisation.Therefore,anobviouswayofexploitingURLinformationis NMRR content0.2610.1850.33100.32500.30Table2.MRRsKleinberg@10resultsontrainingdatatryingtomatchquerytermsandURLterms.OurURLapproachhowever,isbasedontheobservationthatentrypageURLstendtobehigherinaserver'sdocumenttreethanotherwebpages,i.e.thenumberofslashes('/')inanentrypageURLtendstoberelativelysmall.Wedene4differenttypesofURLs:–root:adomainname,optionallyfollowedby'index.html'(e.g.http://trec.nist.gov)–subroot:adomainname,followedbyasingledirectory,optionallyfollowedby'index.html'name(e.g.http://trec.nist.gov/pubs/)–path:adomainname,followedbyanarbitrarilydeeppath,butnotendinginalenameotherthan'index.html'(e.g.http://trec.nist.gov/pubs/trec9/papers/)–le:anythingendinginalenameotherthan'index.html'(e.g.http://trec.nist.gov/pubs/trec9/t9_proceedings.html)WeanalysedWT10gandtherelevantentrypagesforhalfofthetrainingdocumentstoseehowentrypagesandotherdocumentsaredistributedovertheseURLtypes.Table3showsthestatistics.URLtype#entrypages#WT10g root38(71.7%)11680(0.6%)subroot7(13.2%)37959(2.2%)path3(5.7%)83734(4.9%)le3(5.7%)1557719(92.1%)Table3.DistributionsofentrypagesandWT10goverURLtypesFromthesestatistics,weestimatedpriorprobabilitiesofbeinganentrypageonthebasisoftheURLtypeP(entrypagejURLtype=t)forallURLtypest.Wetheninterpolatedthesepriorswiththenormalisedcontentonlyscores(cf.eq.2)andtestedthisontheother50entrypagesearchtopicsofthetrainingdata.Thisgaveamajorimprovementonthecontentonlyresults(seetable4).runMRR contentonly0.260.7*content+0.3*URLprior0.79Table4.URLpriorresults3.3AnchorsThefourthsourceofinformationisprovidedbytheanchortextsofoutlinks.Theseanchortextsaretheunderlinedandhighlightedtextsofhyperlinksinwebpages.Wegatheredallanchortextsoftheoutlinks,combinedalltextspointing Fig.3.ProbabilityofrelevanceandprobabilityofbeingretrievedasafunctionofdocumentlengthorientationastheidealP(reljdlen)curve(g.3).TheDirichletsmoothingschemeislesssensitivetoquerylength[8],andthepreferenceforlongerdocumentsisinherent,sincelesssmoothingisappliedtolongerdocuments.Figure3illustratestheeffect.TheDirichletrunfollowstheshapeoftheP(reljdlen)linemorecloselythantherunsbasedonJelinek-Mercersmoothing.TheJMrunbasedonadocumentlengthdependentpriorindeedfollowstheidealcurvebetterinthelowerrangesofdocumentlengths,butovercompensatesforthehigherdocumentlengthranges.4.2EntrypagetaskFortheentrypagetask,wesubmittedfourruns:acontentonlyrun,aanchoronlyrun,acontentrunwithURLprior5andarunwithcontent,anchorsandURLpriors.Wedidsomeadditionalrunstohaveresultsforallsensiblecombinationsofcontent,anchorsandpriors,aswellasaninlinkpriorrunandaKleinbergrun.Themeanreciprocalranksforallrunsareshownintable7(ofcialrunsinboldface).Figure4showsthesuccessrateatNforallruns6(onalogarithmicscaletoemphasisehighprecision).Therstthingthatshouldbenotedfromtheresultsisthateachcombinationofcontentandanothersourceofinformationoutperformsthecontentonlyrun.Thesameholdsforcombinationswiththeanchorrun.However,theimprovementwhenaddingURLinformationisfortheanchorrunlessimpressivethanforthecontentrun.Thisisprobablyduetothedifferencesinthetworuns.Althoughtheserunshavesimilarscores(MRRaround0.33),theyhavedifferentcharacteristics.Theanchorrunisahighprecisionrun,whereasthecontentrunalsohasareasonablerecall.Therefore,itishardtoimprovetheanchorrunsincetheentrypagesthatareretrievedarealreadyinthetopranksandtheotherentrypagesaresimplynotretrievedatall.Figure4showsthedifferencesbetweenthetworuns:theanchorrunhasaslightlyhighersuccessrateforthelowerranks,butastheranksgethigher,thecontentruntakesover.Asmentionedinsection2.1,ourpreferredwayofcombiningsourcesofinformationwhennormalisationisnotnecessary,istoincorporatetheadditionalinformationinthepriorprobabilityofadocument.However,intherunslistedintable7weinterpolatedURLpriorsandinlinkpriorswiththecontentscores.Wedidadditionalrunsinwhichweusedthepriorsexactlyasinequation1;Table8showstheresults. 5Werecomputedthepriorsonthewholesetoftrainingdata.6ThenumberofentrypagesretrievedwithinthetopNdocumentsreturned runscoresdescriptionMRR tnout10epCcontentscoreContentonlyrun0.3375tnout10epAanchorscoreAnchoronlyrun0.3306tnout10epCU0:7contentscore+ContentruncombinedwithURLpriors0.77160:3urlpriortnout10epAU0:7anchorscore+AnchorruncombinedwithURLpriors0.47980:3urlpriorstnout10epCA0:9contentscore+InterpolationofContentandAnchorruns0.45000:1anchorscoretnout10epCAU0:63contentscore+InterpolationofContentandAnchorruns0.77450:07anchorscore+combinedwithURLpriors0:3urlpriorstnout10epInlinks0:7contentscore+ContentruncombinedwithInlinkpriors0.48720:3inlinkpriortnout10epKlein10Kleinberg0sauth:score@10AuthorityscoresafterKleinbergalgorithm0.3548ontop10ranksfromContentrunTable7.EntryPageresults Fig.4.EntryPageresults:success@N runMRR contentonly0.3375content*URLprior0.7743content*inlinkprior0.4251content*inlinkprior*URLprior0.5440content*combiprior0.7746Table8.Resultswithclean(non-interpolated)priorsTable8showsthatalsowhenweusepriorsinthecleanway(cf.eq1,theyimproveourresults.Comparingtheseresultstotheonesintable7,weseenodifferenceinperformancebetweentheinterpolatedinlinksandthecleaninlinks.TheinterpolatedURLpriorsareslightlybetterthanthecleanones.WhenwetakeacombinationofinlinkandURLinformationasaprior,bysimplymultiplyingthetwopriors,ourresultsdrop(seetable8).Thisindicatesthatthetwosourcesofinformationarenotindependent.Wethereforedroppedtheindependenceassumptionandhadanotherlookatthetrainingdata.JustlikewiththeestimationoftheURLpriors,wesubdividedthecollectionintodifferentcategoriesandestimatedpriorprobabilitiesofbeinganentrypagegivenacertaincategory.Asabasisforthecategories,wetookthe4URLtypesdenedinsection3.2,thenwesubdividedtheroottypeintocategoriesonthebasisofthenumberofinlinks.AgainwecountedthenumberofentrypagesfromthetrainingdataandthenumberofdocumentsfromWT10gthatfellintoeachcategoryandestimatedthepriorprobabilitiesfromthat.WetookthecategoriesfromtheURLtypesasastartingpointandsubdividedtheroottypeinto4subtypesonthebasisofthenumberofinlinks.Table9showsthestatisticsforthedifferentcategories.Documenttype#entrypages#WT10g rootwith1-10inlinks39(36.1%)8938(0.5%)rootwith11-100inlinks25(23.1%)2905(0.2%)rootwith101-1000inlinks11(10.2%)377(0.0%)rootwith1000+inlinks4(3.7%))38(0.0%)subroot15(13.9%)37959(2.2%)path8(7.4%)83734(4.9%)le6(5.6%)1557719(92.1%)Table9.DistributionentrypagesandWT10goverdifferentdocumenttypesAscanbeseenintable8,thispropercombinationofURLandinlinkinformation(i.e.withouttheindependenceassumption)performsasgoodasorbetterthanthetwoseparatepriors.5ConclusionPosthocrunsshowthattheDirichletsmoothingtechniqueyieldssuperiorperformancefortitleadhocqueriesonthewebcollection.Thisisprobablyduetothedocumentlengthdependentsmoothingconstant,butfurtherinvestigationisneeded.TheEntrypagendingtaskturnsouttobeverydifferentfromanadhoctask.Inpreviouswebtrackslinkinfor-mationdidn'tseemtohelpforgeneralsearches[7][2].Thisyear,wefoundthatinadditiontocontent,othersourcesofinformationcanbeveryusefulforidentifyingentrypages.Wedescribedtwodifferentwaysofcombiningdifferentsourcesofinformationintoourunigramlanguagemodel:eitherasaproperpriororbyinterpolatingresultsfromdif-ferentrankingmodels.Weusedbothmethodssuccessfullywhencombiningcontentinformationwithothersourcesasdiverseasinlinks,URLsandanchors.URLinfogivesthebestpriorinfo.Addinginlinksyieldsmarginalimprovement.