atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprob - PDF document

386 views
Uploaded On 2015-10-12

atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprob - PPT Presentation

PwcTiDkwhereasthesmoothingconstantis1intheJelinekMercerbasedmodelPT1T2TnjDkPDkPDknYi1cTiDkPTijC PwcTiDk33EntrypageSearchToimprovetheresultsofacontentonlyruninth ID: 158261

Pwc(Ti;Dk)+ whereasthesmoothingconstantis(1)intheJelinek-Mercerbasedmodel.P(T1;T2;;TnjDk)P(Dk)=P(Dk)nYi=1c(Ti;Dk)+P(TijC) Pwc(Ti;Dk)+(3)3EntrypageSearchToimprovetheresultsofacontentonlyruninth

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/158261" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Pdf The PPT/PDF document "atrandomfromdocumentk;andistheinterpola..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprobabilityofrelevanceP(Dk)isusuallytakentobealinearfunctionofthedocumentlength,modellingtheempiricalfactthatlongerdocumentshaveahigherprobabilityofrelevance.2.1CombiningexternalinformationThebasicrankingmodelisbasedonthecontentofthewebpages.Thereisevidencethatothersourcesofinformation(linkstructure,anchortext)playadecisiveroleintherankingprocessofentrypages(e.g.Google2).Thepreferredwaytoincorporateextrainformationaboutwebpagesistoincludethisinformationinthemodel.Acleanmethodistoincorporatethisinformationinthepriorprobabilityofadocument.Asecondmanneristomodeldifferenttypesofevidenceasdifferenttypesofrankingmodels,andcombinethesemethodsviainterpolation.scorecombi=scorecontent+(1�)scorefeatures(2)Equation2showshowtworankingfunctionscanbecombinedbyinterpolation.ThecombinedscoreisbasedonaweightedfunctionoftheunigramdocumentmodelandtheposteriorprobabilitygiventhedocumentfeaturesetandaBayesianclassiertrainedonthetrainingset.AsfeaturesweexperimentedwiththenumberofinlinksandtheURLform.However,forinterpolation,scoreshavetobenormalisedacrossqueries,becausetheinterpolationschemeisqueryindependent.Therefore,fortheinterpolationmethodwenormalisedthecontentscorebythequerylength,therankingmodelsbasedonotherdocumentinformationthatweappliedare(discriminative)probabilitiesandthusneednonormalisation.Theinterpolationmethodhasshowntoworkwellincaseswherescorenormalisationisakeyfactor[6].Fortheexperimentswedescribehere,wehaveappliedbothmethodsandtheyyieldsimilarresults.Inacontextwherescorenormalisationisnotnecessary,weprefermethodone.Wedeterminedthedocumentpriors(document-contentindependentpriorprobabilities)usingvarioustechniques,eitherpostulatingarelationship,orlearningpriorsfromtrainingdataconditioningone.g.theURLform.ThisprocesswillbedescribedinmoredetailintheSection3.2.2SmoothingvariantsRecentexperimentshaveshownthattheparticularchoiceofsmoothingtechniquecanhavealargeinuenceontheretrievaleffectiveness.Fortitleadhocqueries,ZhaiandLafferty[8]foundDirichletsmoothingtobemoreeffectivethanlinearinterpolation3Bothmethodsstartfromtheideathattheprobabilityestimateforunseenterms:Pu(TijDk)ismodelledaconstanttimesthecollectionbasedestimate:P(TijC).AcrucialdifferencebetweenDirichletandJelinek-MercersmoothingisthatthesmoothingconstantisdependentonthedocumentlengthforDirichlet,reectingthefactthatprobabilityestimatesaremorereliableforlongerdocuments.Equation(3)showstheweightingformulaforDirichletsmoothing,wherec(TijDk)isthetermfrequencyoftermTiindocumentDk,Pwc(Ti;Dk)isthelengthofdocumentDkandisaconstant.Thecollectionspecicsmoothingconstantisinthiscase Pwc(Ti;Dk)+,whereasthesmoothingconstantis(1�)intheJelinek-Mercerbasedmodel.P(T1;T2;;TnjDk)P(Dk)=P(Dk)nYi=1c(Ti;Dk)+P(TijC) Pwc(Ti;Dk)+(3)3EntrypageSearchToimprovetheresultsofacontentonlyrunintheentrypagendingtask,weexperimentedwithvariouslink,URLandanchorbasedmethods.Wetestedseveralwell-knownandnoveltechniquesonthesetof100trainingtopicsprovidedbyNISTandfoundthateachmethodwetestedwasmoreorlessbenecialforndingentrypages.Thiscontrastswithlastyear'sndingswherelinkbasedtechniquesdidn'taddanythinginanadhocsearchtask[7].Inthefollowingsubsections,wesubsequentlydiscusslinkbasedmethods,URLbasedmethodsandanchorbasedmethods,alongwithourndingsonthetrainingdata. 1Weapplyasimpliedversionofthemodeldevelopedin[3],whereistermspecic,denotingthetermimportance2http://www.google.com3AlsocalledJelinek-Mercersmoothing. AdhocsearchEntrypagesearchFig.2.P(relevantjdoclen)runMRR content0.26content+doclenprior0.210.7*content+0.3*inlink0.38Table1.MRRsInlinkanddoclenpriorsontrainingdataKleinbergThesecondlink-basedapproachweexperimentedwithisbasedonKleinberg'shubandauthorityalgorithm[5].Thisalgorithmidentiesauthorities(importantsourcesofinformation)andhubs(listsofpointerstoauthorities)byanalysingthestructureofhyperlinks.Sinceentrypagescanbeseenasauthoritiesonaveryspecicsubject(acertainorganisation),Kleinberg'salgorithmcanbeusefulfortheentrypagesearchtask.Thealgorithmworksbyiterativelyassigninghubandauthorityscorestodocumentsinsuchawaythatgoodhubsarepagesthatrefertomanygoodauthoritiesandgoodauthoritiesarereferencedbymanygoodhubs:1.TakethetopNresultsfromthecontentrun2.ExtendthissetSwithalldocumentsthatarelinkedtoS(eitherthroughinorthroughoutlinks)3.Initialiseallhubandauthorityscoresinthissetto1.4.hub(D)=PfijlinkD!iexistsgauth(i)5.auth(D)=Pfijlinki!Dexistsghub(i)6.normalisehubandauthscoressuchthatPs2Shub2(s)=Ps2Sauth2(s)=17.repeatsteps4-6WecomputedhubsandauthoritiesforthetopNofthecontentonlyrunandusedtheresultingauthorityscorestorankthedocuments.Table2showstheresultsfordifferentvaluesofN.Astheresultsshow,takingonlythetop5ortop10ranksfromthecontentrunandcomputingauthorityscoresstartingfromthose,issufcienttoimprovetheresults.Apparently,ifanentrypageisnotinthetop5fromthecontentrun,itisofteninthesetofdocumentslinkedtothese5documents.3.2URLsApartfromcontentandlinks,athirdsourceofinformationarethedocument'sURLs.EntrypageURLsoftencontainthenameoracronymofthecorrespondingorganisation.Therefore,anobviouswayofexploitingURLinformationis NMRR content0.2610.1850.33100.32500.30Table2.MRRsKleinberg@10resultsontrainingdatatryingtomatchquerytermsandURLterms.OurURLapproachhowever,isbasedontheobservationthatentrypageURLstendtobehigherinaserver'sdocumenttreethanotherwebpages,i.e.thenumberofslashes('/')inanentrypageURLtendstoberelativelysmall.Wedene4differenttypesofURLs:–root:adomainname,optionallyfollowedby'index.html'(e.g.http://trec.nist.gov)–subroot:adomainname,followedbyasingledirectory,optionallyfollowedby'index.html'name(e.g.http://trec.nist.gov/pubs/)–path:adomainname,followedbyanarbitrarilydeeppath,butnotendinginalenameotherthan'index.html'(e.g.http://trec.nist.gov/pubs/trec9/papers/)–le:anythingendinginalenameotherthan'index.html'(e.g.http://trec.nist.gov/pubs/trec9/t9_proceedings.html)WeanalysedWT10gandtherelevantentrypagesforhalfofthetrainingdocumentstoseehowentrypagesandotherdocumentsaredistributedovertheseURLtypes.Table3showsthestatistics.URLtype#entrypages#WT10g root38(71.7%)11680(0.6%)subroot7(13.2%)37959(2.2%)path3(5.7%)83734(4.9%)le3(5.7%)1557719(92.1%)Table3.DistributionsofentrypagesandWT10goverURLtypesFromthesestatistics,weestimatedpriorprobabilitiesofbeinganentrypageonthebasisoftheURLtypeP(entrypagejURLtype=t)forallURLtypest.Wetheninterpolatedthesepriorswiththenormalisedcontentonlyscores(cf.eq.2)andtestedthisontheother50entrypagesearchtopicsofthetrainingdata.Thisgaveamajorimprovementonthecontentonlyresults(seetable4).runMRR contentonly0.260.7*content+0.3*URLprior0.79Table4.URLpriorresults3.3AnchorsThefourthsourceofinformationisprovidedbytheanchortextsofoutlinks.Theseanchortextsaretheunderlinedandhighlightedtextsofhyperlinksinwebpages.Wegatheredallanchortextsoftheoutlinks,combinedalltextspointing Fig.3.ProbabilityofrelevanceandprobabilityofbeingretrievedasafunctionofdocumentlengthorientationastheidealP(reljdlen)curve(g.3).TheDirichletsmoothingschemeislesssensitivetoquerylength[8],andthepreferenceforlongerdocumentsisinherent,sincelesssmoothingisappliedtolongerdocuments.Figure3illustratestheeffect.TheDirichletrunfollowstheshapeoftheP(reljdlen)linemorecloselythantherunsbasedonJelinek-Mercersmoothing.TheJMrunbasedonadocumentlengthdependentpriorindeedfollowstheidealcurvebetterinthelowerrangesofdocumentlengths,butovercompensatesforthehigherdocumentlengthranges.4.2EntrypagetaskFortheentrypagetask,wesubmittedfourruns:acontentonlyrun,aanchoronlyrun,acontentrunwithURLprior5andarunwithcontent,anchorsandURLpriors.Wedidsomeadditionalrunstohaveresultsforallsensiblecombinationsofcontent,anchorsandpriors,aswellasaninlinkpriorrunandaKleinbergrun.Themeanreciprocalranksforallrunsareshownintable7(ofcialrunsinboldface).Figure4showsthesuccessrateatNforallruns6(onalogarithmicscaletoemphasisehighprecision).Therstthingthatshouldbenotedfromtheresultsisthateachcombinationofcontentandanothersourceofinformationoutperformsthecontentonlyrun.Thesameholdsforcombinationswiththeanchorrun.However,theimprovementwhenaddingURLinformationisfortheanchorrunlessimpressivethanforthecontentrun.Thisisprobablyduetothedifferencesinthetworuns.Althoughtheserunshavesimilarscores(MRRaround0.33),theyhavedifferentcharacteristics.Theanchorrunisahighprecisionrun,whereasthecontentrunalsohasareasonablerecall.Therefore,itishardtoimprovetheanchorrunsincetheentrypagesthatareretrievedarealreadyinthetopranksandtheotherentrypagesaresimplynotretrievedatall.Figure4showsthedifferencesbetweenthetworuns:theanchorrunhasaslightlyhighersuccessrateforthelowerranks,butastheranksgethigher,thecontentruntakesover.Asmentionedinsection2.1,ourpreferredwayofcombiningsourcesofinformationwhennormalisationisnotnecessary,istoincorporatetheadditionalinformationinthepriorprobabilityofadocument.However,intherunslistedintable7weinterpolatedURLpriorsandinlinkpriorswiththecontentscores.Wedidadditionalrunsinwhichweusedthepriorsexactlyasinequation1;Table8showstheresults. 5Werecomputedthepriorsonthewholesetoftrainingdata.6ThenumberofentrypagesretrievedwithinthetopNdocumentsreturned runscoresdescriptionMRR tnout10epCcontentscoreContentonlyrun0.3375tnout10epAanchorscoreAnchoronlyrun0.3306tnout10epCU0:7contentscore+ContentruncombinedwithURLpriors0.77160:3urlpriortnout10epAU0:7anchorscore+AnchorruncombinedwithURLpriors0.47980:3urlpriorstnout10epCA0:9contentscore+InterpolationofContentandAnchorruns0.45000:1anchorscoretnout10epCAU0:63contentscore+InterpolationofContentandAnchorruns0.77450:07anchorscore+combinedwithURLpriors0:3urlpriorstnout10epInlinks0:7contentscore+ContentruncombinedwithInlinkpriors0.48720:3inlinkpriortnout10epKlein10Kleinberg0sauth:score@10AuthorityscoresafterKleinbergalgorithm0.3548ontop10ranksfromContentrunTable7.EntryPageresults Fig.4.EntryPageresults:success@N runMRR contentonly0.3375content*URLprior0.7743content*inlinkprior0.4251content*inlinkprior*URLprior0.5440content*combiprior0.7746Table8.Resultswithclean(non-interpolated)priorsTable8showsthatalsowhenweusepriorsinthecleanway(cf.eq1,theyimproveourresults.Comparingtheseresultstotheonesintable7,weseenodifferenceinperformancebetweentheinterpolatedinlinksandthecleaninlinks.TheinterpolatedURLpriorsareslightlybetterthanthecleanones.WhenwetakeacombinationofinlinkandURLinformationasaprior,bysimplymultiplyingthetwopriors,ourresultsdrop(seetable8).Thisindicatesthatthetwosourcesofinformationarenotindependent.Wethereforedroppedtheindependenceassumptionandhadanotherlookatthetrainingdata.JustlikewiththeestimationoftheURLpriors,wesubdividedthecollectionintodifferentcategoriesandestimatedpriorprobabilitiesofbeinganentrypagegivenacertaincategory.Asabasisforthecategories,wetookthe4URLtypesdenedinsection3.2,thenwesubdividedtheroottypeintocategoriesonthebasisofthenumberofinlinks.AgainwecountedthenumberofentrypagesfromthetrainingdataandthenumberofdocumentsfromWT10gthatfellintoeachcategoryandestimatedthepriorprobabilitiesfromthat.WetookthecategoriesfromtheURLtypesasastartingpointandsubdividedtheroottypeinto4subtypesonthebasisofthenumberofinlinks.Table9showsthestatisticsforthedifferentcategories.Documenttype#entrypages#WT10g rootwith1-10inlinks39(36.1%)8938(0.5%)rootwith11-100inlinks25(23.1%)2905(0.2%)rootwith101-1000inlinks11(10.2%)377(0.0%)rootwith1000+inlinks4(3.7%))38(0.0%)subroot15(13.9%)37959(2.2%)path8(7.4%)83734(4.9%)le6(5.6%)1557719(92.1%)Table9.DistributionentrypagesandWT10goverdifferentdocumenttypesAscanbeseenintable8,thispropercombinationofURLandinlinkinformation(i.e.withouttheindependenceassumption)performsasgoodasorbetterthanthetwoseparatepriors.5ConclusionPosthocrunsshowthattheDirichletsmoothingtechniqueyieldssuperiorperformancefortitleadhocqueriesonthewebcollection.Thisisprobablyduetothedocumentlengthdependentsmoothingconstant,butfurtherinvestigationisneeded.TheEntrypagendingtaskturnsouttobeverydifferentfromanadhoctask.Inpreviouswebtrackslinkinfor-mationdidn'tseemtohelpforgeneralsearches[7][2].Thisyear,wefoundthatinadditiontocontent,othersourcesofinformationcanbeveryusefulforidentifyingentrypages.Wedescribedtwodifferentwaysofcombiningdifferentsourcesofinformationintoourunigramlanguagemodel:eitherasaproperpriororbyinterpolatingresultsfromdif-ferentrankingmodels.Weusedbothmethodssuccessfullywhencombiningcontentinformationwithothersourcesasdiverseasinlinks,URLsandanchors.URLinfogivesthebestpriorinfo.Addinginlinksyieldsmarginalimprovement.

atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprob - PDF document

atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprob - PPT Presentation

Share:

Link:

Embed:

Related Contents

atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprob - PDF document

atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprob - PPT Presentation

Share:

Link:

Embed:

Related Contents

atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprob - PDF document

atrandomfromdocumentk;andistheinterpolationparameter1.Thea-prioriprob - PPT Presentation