/
Online edition cn2009 Cambridge UP Online edition cn2009 Cambridge UP

Online edition cn2009 Cambridge UP - PDF document

reese
reese . @reese
Follow
342 views
Uploaded On 2021-10-08

Online edition cn2009 Cambridge UP - PPT Presentation

DRAFTApril12009CambridgeUniversityPressFeedbackwelcome1518EvaluationininformationretrievalWehaveseenintheprecedingchaptersmanyalternativesindesigninganIRsystemHowdoweknowwhichofthesetechniquesareeffec ID: 898279

n2009 online edition cambridge online n2009 cambridge edition exercise8 section precision recall page 400 thatis forexample ifigure8 tion 2005

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Online edition cn2009 Cambridge UP" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 Online edition (c)\n2009 Cambridge UP DR
Online edition (c)\n2009 Cambridge UP DRAFT!©April1,2009CambridgeUniversityPress.Feedbackwelcome.1518EvaluationininformationretrievalWehaveseenintheprecedingchaptersmanyalternativesindesigninganIRsystem.Howdoweknowwhichofthesetechniquesareeffectiveinwhichapplications?Shouldweusestoplists?Shouldwestem?Shouldweusein-versedocumentfrequencyweighting?Informationretrievalhasdevelopedasahighlyempiricaldiscipline,requiringcarefulandthoroughevaluationtodemonstratethesuperiorperformanceofnoveltechniquesonrepresentativedocumentcollections.InthischapterwebeginwithadiscussionofmeasuringtheeffectivenessofIRsystems(Section 8.1 )andthetestcollectionsthataremostoftenusedforthispurpose(Section 8.2 ).Wethenpresentthestraightforwardnotionofrelevantandnonrelevantdocumentsandtheformalevaluationmethodol-ogythathasbeendevelopedforevaluatingunrankedretrievalresults(Sec-tion 8.3 ).Thisincludesexplainingthekindsofevaluationmeasuresthatarestandardlyusedfordocumentretrievalandrelatedtasksliketextclas-sicationandwhytheyareappropriate.Wethenextendthesenotionsanddevelopfurthermeasuresforevaluatingrankedretrievalresults(Section 8.4 )anddiscussdevelopingreliableandinformativetestcollections(Section 8.5 ).Wethenstepbacktointroducethenotionofuserutility,andhowitisap-proximatedbytheuseofdocumentrelevance(Section 8.6 ).Thekeyutilitymeasureisuserhappiness.Speedofresponseandthesizeoftheindexarefactorsinuserhappiness.Itseemsreasonabletoassumethatrelevanceofresultsisthemostimportantfactor:blindinglyfast,uselessanswersdonotmakeauserhappy.However,userperceptionsdonotalwayscoincidewithsystemdesigners'notionsofquality.Forexample,userhappinesscommonlydependsverystronglyonuserinterfacedesignissues,includingthelayout,clarity,andresponsivenessoftheuserinterface,whichareindependentofthequalityoftheresultsreturned.Wetouchonothermeasuresofthequal-ityofasystem,inparticularthegenerationofhigh-qualityresultsummarysnippets,whichstronglyinuenceuserutility,butarenotmeasuredinthebasicrelevancerankingparadigm(Section 8.7 ). Online e

2 dition (c)\n2009 Cambridge UP 1528Evalua
dition (c)\n2009 Cambridge UP 1528Evaluationininformationretrieval8.1InformationretrievalsystemevaluationTomeasureadhocinformationretrievaleffectivenessinthestandardway,weneedatestcollectionconsistingofthreethings:1.Adocumentcollection2.Atestsuiteofinformationneeds,expressibleasqueries3.Asetofrelevancejudgments,standardlyabinaryassessmentofeitherrelevantornonrelevantforeachquery-documentpair.Thestandardapproachtoinformationretrievalsystemevaluationrevolvesaroundthenotionofrelevantandnonrelevantdocuments.WithrespecttoaRELEVANCEuserinformationneed,adocumentinthetestcollectionisgivenabinaryclassicationaseitherrelevantornonrelevant.Thisdecisionisreferredtoasthegoldstandardorgroundtruthjudgmentofrelevance.ThetestdocumentGOLDSTANDARDGROUNDTRUTHcollectionandsuiteofinformationneedshavetobeofareasonablesize:youneedtoaverageperformanceoverfairlylargetestsets,asresultsarehighlyvariableoverdifferentdocumentsandinformationneeds.Asaruleofthumb,50informationneedshasusuallybeenfoundtobeasufcientminimum.Relevanceisassessedrelativetoaninformationneed,notaquery.ForINFORMATIONNEEDexample,aninformationneedmightbe:Informationonwhetherdrinkingredwineismoreeffectiveatreduc-ingyourriskofheartattacksthanwhitewine.Thismightbetranslatedintoaquerysuchas:wineANDredANDwhiteANDheartANDattackANDeffectiveAdocumentisrelevantifitaddressesthestatedinformationneed,notbe-causeitjusthappenstocontainallthewordsinthequery.Thisdistinctionisoftenmisunderstoodinpractice,becausetheinformationneedisnotovert.But,nevertheless,aninformationneedispresent.Ifausertypespythonintoawebsearchengine,theymightbewantingtoknowwheretheycanpurchaseapetpython.OrtheymightbewantinginformationontheprogramminglanguagePython.Fromaonewordquery,itisverydifcultforasystemtoknowwhattheinformationneedis.But,nevertheless,theuserhasone,andcanjudgethereturnedresultsonthebasisoftheirrelevancetoit.Toevalu-ateasystem,werequireanovertexpressionofaninformationneed,whichcanbeusedforjudgingreturneddocumentsasrelevantornonrelevant.Atthispoint,wemakeasimpli

3 cation:relevancecanreasonablybethoughto
cation:relevancecanreasonablybethoughtofasascale,withsomedocumentshighlyrelevantandothersmarginallyso.Butforthemoment,wewillusejustabinarydecisionofrelevance.We Online edition (c)\n2009 Cambridge UP 8.2Standardtestcollections153discussthereasonsforusingbinaryrelevancejudgmentsandalternativesinSection 8.5.1 .Manysystemscontainvariousweights(oftenknownasparameters)thatcanbeadjustedtotunesystemperformance.Itiswrongtoreportresultsonatestcollectionwhichwereobtainedbytuningtheseparameterstomaxi-mizeperformanceonthatcollection.Thatisbecausesuchtuningoverstatestheexpectedperformanceofthesystem,becausetheweightswillbesettomaximizeperformanceononeparticularsetofqueriesratherthanforaran-domsampleofqueries.Insuchcases,thecorrectprocedureistohaveoneormoredevelopmenttestcollections,andtotunetheparametersonthedevel-DEVELOPMENTTESTCOLLECTIONopmenttestcollection.Thetesterthenrunsthesystemwiththoseweightsonthetestcollectionandreportstheresultsonthatcollectionasanunbiasedestimateofperformance.8.2StandardtestcollectionsHereisalistofthemoststandardtestcollectionsandevaluationseries.Wefocusparticularlyontestcollectionsforadhocinformationretrievalsystemevaluation,butalsomentionacoupleofsimilartestcollectionsfortextclas-sication.TheCraneldcollection.ThiswasthepioneeringtestcollectioninallowingCRANFIELDprecisequantitativemeasuresofinformationretrievaleffectiveness,butisnowadaystoosmallforanythingbutthemostelementarypilotexperi-ments.CollectedintheUnitedKingdomstartinginthelate1950s,itcon-tains1398abstractsofaerodynamicsjournalarticles,asetof225queries,andexhaustiverelevancejudgmentsofall(query,document)pairs.TextRetrievalConference(TREC).TheU.S.NationalInstituteofStandardsTRECandTechnology(NIST)hasrunalargeIRtestbedevaluationseriessince1992.Withinthisframework,therehavebeenmanytracksoverarangeofdifferenttestcollections,butthebestknowntestcollectionsaretheonesusedfortheTRECAdHoctrackduringtherst8TRECevaluationsbetween1992and1999.Intotal,thesetestcollectionscomprise6CDscontaining1.89milliondocumen

4 ts(mainly,butnotexclusively,newswirearti
ts(mainly,butnotexclusively,newswirearticles)andrelevancejudgmentsfor450informationneeds,whicharecalledtopicsandspeciedindetailedtextpassages.Individualtestcol-lectionsaredenedoverdifferentsubsetsofthisdata.TheearlyTRECseachconsistedof50informationneeds,evaluatedoverdifferentbutover-lappingsetsofdocuments.TRECs6–8provide150informationneedsoverabout528,000newswireandForeignBroadcastInformationServicearticles.Thisisprobablythebestsubcollectiontouseinfuturework,be-causeitisthelargestandthetopicsaremoreconsistent.Becausethetest Online edition (c)\n2009 Cambridge UP 1548Evaluationininformationretrievaldocumentcollectionsaresolarge,therearenoexhaustiverelevancejudg-ments.Rather,NISTassessors'relevancejudgmentsareavailableonlyforthedocumentsthatwereamongthetopkreturnedforsomesystemwhichwasenteredintheTRECevaluationforwhichtheinformationneedwasdeveloped.Inmorerecentyears,NISThasdoneevaluationsonlargerdocumentcol-lections,includingthe25millionpageGOV2webpagecollection.FromGOV2thebeginning,theNISTtestdocumentcollectionswereordersofmagni-tudelargerthananythingavailabletoresearcherspreviouslyandGOV2isnowthelargestWebcollectioneasilyavailableforresearchpurposes.Nevertheless,thesizeofGOV2isstillmorethan2ordersofmagnitudesmallerthanthecurrentsizeofthedocumentcollectionsindexedbythelargewebsearchcompanies.NIITestCollectionsforIRSystems(NTCIR).TheNTCIRprojecthasbuiltNTCIRvarioustestcollectionsofsimilarsizestotheTRECcollections,focus-ingonEastAsianlanguageandcross-languageinformationretrieval,whereCROSS-LANGUAGEINFORMATIONRETRIEVALqueriesaremadeinonelanguageoveradocumentcollectioncontainingdocumentsinoneormoreotherlanguages.See:http://research.nii.ac.jp/ntcir/data/data-en.htmlCrossLanguageEvaluationForum(CLEF).Thisevaluationserieshascon-CLEFcentratedonEuropeanlanguagesandcross-languageinformationretrieval.See:http://www.clef-campaign.org/Reuters-21578andReuters-RCV1.Fortextclassication,themostusedtestREUTERScollectionhasbeentheReuters-21578collectionof21578newswirearti-cles;seeChapter 13 ,

5 page 279 .Morerecently,Reutersreleasedth
page 279 .Morerecently,ReutersreleasedthemuchlargerReutersCorpusVolume1(RCV1),consistingof806,791documents;seeChapter 4 ,page 69 .Itsscaleandrichannotationmakesitabetterbasisforfutureresearch.20Newsgroups.Thisisanotherwidelyusedtextclassicationcollection,20NEWSGROUPScollectedbyKenLang.Itconsistsof1000articlesfromeachof20Usenetnewsgroups(thenewsgroupnamebeingregardedasthecategory).Aftertheremovalofduplicatearticles,asitisusuallyused,itcontains18941articles.8.3EvaluationofunrankedretrievalsetsGiventheseingredients,howissystemeffectivenessmeasured?Thetwomostfrequentandbasicmeasuresforinformationretrievaleffectivenessareprecisionandrecall.Thesearerstdenedforthesimplecasewherean Online edition (c)\n2009 Cambridge UP 8.3Evaluationofunrankedretrievalsets155IRsystemreturnsasetofdocumentsforaquery.Wewillseelaterhowtoextendthesenotionstorankedretrievalsituations.Precision(P)isthefractionofretrieveddocumentsthatarerelevantPRECISIONPrecision=#(relevantitemsretrieved) #(retrieveditems)=P(relevantjretrieved)(8.1)Recall(R)isthefractionofrelevantdocumentsthatareretrievedRECALLRecall=#(relevantitemsretrieved) #(relevantitems)=P(retrievedjrelevant)(8.2)Thesenotionscanbemadeclearbyexaminingthefollowingcontingencytable:(8.3) Relevant Nonrelevant Retrieved truepositives(tp) falsepositives(fp) Notretrieved falsenegatives(fn) truenegatives(tn) Then:P=tp/(tp+fp) (8.4) R=tp/(tp+fn)Anobviousalternativethatmayoccurtothereaderistojudgeaninfor-mationretrievalsystembyitsaccuracy,thatis,thefractionofitsclassica-ACCURACYtionsthatarecorrect.Intermsofthecontingencytableabove,accuracy=(tp+tn)/(tp+fp+fn+tn).Thisseemsplausible,sincetherearetwoac-tualclasses,relevantandnonrelevant,andaninformationretrievalsystemcanbethoughtofasatwo-classclassierwhichattemptstolabelthemassuch(itretrievesthesubsetofdocumentswhichitbelievestoberelevant).Thisispreciselytheeffectivenessmeasureoftenusedforevaluatingmachinelearningclassicationproblems.Thereisagoodreasonwhyaccuracyisnotanappropriatemeasureforinformationretrievalpr

6 oblems.Inalmostallcircumstances,thedatai
oblems.Inalmostallcircumstances,thedataisex-tremelyskewed:normallyover99.9%ofthedocumentsareinthenonrele-vantcategory.Asystemtunedtomaximizeaccuracycanappeartoperformwellbysimplydeemingalldocumentsnonrelevanttoallqueries.Evenifthesystemisquitegood,tryingtolabelsomedocumentsasrelevantwillalmostalwaysleadtoahighrateoffalsepositives.However,labelingalldocumentsasnonrelevantiscompletelyunsatisfyingtoaninformationretrievalsystemuser.Usersarealwaysgoingtowanttoseesomedocuments,andcanbe Online edition (c)\n2009 Cambridge UP 1568Evaluationininformationretrieval assumedtohaveacertaintoleranceforseeingsomefalsepositivesprovid-ingthattheygetsomeusefulinformation.Themeasuresofprecisionandrecallconcentratetheevaluationonthereturnoftruepositives,askingwhatpercentageoftherelevantdocumentshavebeenfoundandhowmanyfalsepositiveshavealsobeenreturned.Theadvantageofhavingthetwonumbersforprecisionandrecallisthatoneismoreimportantthantheotherinmanycircumstances.Typicalwebsurferswouldlikeeveryresultontherstpagetoberelevant(highpreci-sion)buthavenottheslightestinterestinknowingletalonelookingateverydocumentthatisrelevant.Incontrast,variousprofessionalsearcherssuchasparalegalsandintelligenceanalystsareveryconcernedwithtryingtogetashighrecallaspossible,andwilltoleratefairlylowprecisionresultsinordertogetit.Individualssearchingtheirharddisksarealsoofteninterestedinhighrecallsearches.Nevertheless,thetwoquantitiesclearlytradeoffagainstoneanother:youcanalwaysgetarecallof1(butverylowprecision)byretriev-ingalldocumentsforallqueries!Recallisanon-decreasingfunctionofthenumberofdocumentsretrieved.Ontheotherhand,inagoodsystem,preci-sionusuallydecreasesasthenumberofdocumentsretrievedisincreased.Ingeneralwewanttogetsomeamountofrecallwhiletoleratingonlyacertainpercentageoffalsepositives.AsinglemeasurethattradesoffprecisionversusrecallistheFmeasure,FMEASUREwhichistheweightedharmonicmeanofprecisionandrecall:F=1 a1 P+(1a)1 R=(2+1)PR 2P+Rwhere2=1a a (8.5) wherea2[0,1]andthus22[0,¥].ThedefaultbalancedFmeasureequallyw

7 eightsprecisionandrecall,whichmeansmakin
eightsprecisionandrecall,whichmeansmakinga=1/2or=1.ItiscommonlywrittenasF1,whichisshortforFb=1,eventhoughtheformula-tionintermsofamoretransparentlyexhibitstheFmeasureasaweightedharmonicmean.Whenusing=1,theformulaontherightsimpliesto:Fb=1=2PR P+R (8.6) However,usinganevenweightingisnottheonlychoice.Valuesof1emphasizeprecision,whilevaluesof.29;─1emphasizerecall.Forexample,avalueof=3or=5mightbeusedifrecallistobeemphasized.Recall,precision,andtheFmeasureareinherentlymeasuresbetween0and1,buttheyarealsoverycommonlywrittenaspercentages,onascalebetween0and100.Whydoweuseaharmonicmeanratherthanthesimpleraverage(arith-meticmean)?Recallthatwecanalwaysget100%recallbyjustreturningalldocuments,andthereforewecanalwaysgeta50%arithmeticmeanbythe Online edition (c)\n2009 Cambridge UP 8.3Evaluationofunrankedretrievalsets157  \n \n \n   #$#& #"& & IFigure8.1Graphcomparingtheharmonicmeantoothermeans.Thegraphshowsaslicethroughthecalculationofvariousmeansofprecisionandrecallforthexedrecallvalueof70%.Theharmonicmeanisalwayslessthaneitherthearith-meticorgeometricmean,andoftenquiteclosetotheminimumofthetwonumbers.Whentheprecisionisalso70%,allthemeasurescoincide. sameprocess.Thisstronglysuggeststhatthearithmeticmeanisanunsuit-ablemeasuretouse.Incontrast,ifweassumethat1documentin10,000isrelevanttothequery,theharmonicmeanscoreofthisstrategyis0.02%.Theharmonicmeanisalwayslessthanorequaltothearithmeticmeanandthegeometricmean.Whenthevaluesoftwonumbersdiffergreatly,thehar-monicmeanisclosertotheirminimumthantotheirarithmeticmean;seeFigure 8.1 . ?Exercise8.1 [?]AnIRsystemreturns8relevantdocuments,and10nonrelevantdocuments.Thereareatotalof20relevantdocumentsinthecollection.Whatistheprecisionofthesystemonthissearch,andwhatisitsrecall? Exercise8.2 [?]ThebalancedFmeasure(a.k.a.F1)isdenedastheharmonicmeanofprecisionandrecall.Whatistheadvantageofusingtheharmonicmeanratherthan“averaging”(usingthearithmeticmean)? Online edition (c)\n2009 Cambridge UP 1588Evaluationininform

8 ationretrieval 
ationretrieval  IFigure8.2Precision/recallgraph. Exercise8.3 [??]DerivetheequivalencebetweenthetwoformulasforFmeasureshowninEqua-tion( 8.5 ),giventhata=1/(b2+1).8.4EvaluationofrankedretrievalresultsPrecision,recall,andtheFmeasureareset-basedmeasures.Theyarecom-putedusingunorderedsetsofdocuments.Weneedtoextendthesemeasures(ortodenenewmeasures)ifwearetoevaluatetherankedretrievalresultsthatarenowstandardwithsearchengines.Inarankedretrievalcontext,appropriatesetsofretrieveddocumentsarenaturallygivenbythetopkre-trieveddocuments.Foreachsuchset,precisionandrecallvaluescanbeplottedtogiveaprecision-recallcurve,suchastheoneshowninFigure 8.2 .PRECISION-RECALLCURVEPrecision-recallcurveshaveadistinctivesaw-toothshape:ifthe(k+1)thdocumentretrievedisnonrelevantthenrecallisthesameasforthetopkdocuments,butprecisionhasdropped.Ifitisrelevant,thenbothprecisionandrecallincrease,andthecurvejagsupandtotheright.Itisoftenusefultoremovethesejigglesandthestandardwaytodothisiswithaninterpolatedprecision:theinterpolatedprecisionpinterpatacertainrecalllevelrisdenedINTERPOLATEDPRECISION Online edition (c)\n2009 Cambridge UP 8.4Evaluationofrankedretrievalresults159RecallInterp.Precision0.01.000.10.670.20.630.30.550.40.450.50.410.60.360.70.290.80.130.90.101.00.08ITable8.1Calculationof11-pointInterpolatedAveragePrecision.Thisisfortheprecision-recallcurveshowninFigure 8.2 . asthehighestprecisionfoundforanyrecalllevelr0r:pinterp(r)=maxr0rp(r0) (8.7) Thejusticationisthatalmostanyonewouldbepreparedtolookatafewmoredocumentsifitwouldincreasethepercentageoftheviewedsetthatwererelevant(thatis,iftheprecisionofthelargersetishigher).InterpolatedprecisionisshownbyathinnerlineinFigure 8.2 .Withthisdenition,theinterpolatedprecisionatarecallof0iswell-dened(Exercise 8.4 ).Examiningtheentireprecision-recallcurveisveryinformative,butthereisoftenadesiretoboilthisinformationdowntoafewnumbers,orperhapsevenasinglenumber.Thetraditionalwayofdoingthis(usedforinstanceintherst8TRECAdHocevaluations)is

9 the11-pointinterpolatedaverage11-POINTIN
the11-pointinterpolatedaverage11-POINTINTERPOLATEDAVERAGEPRECISIONprecision.Foreachinformationneed,theinterpolatedprecisionismeasuredatthe11recalllevelsof0.0,0.1,0.2,...,1.0.Fortheprecision-recallcurveinFigure 8.2 ,these11valuesareshowninTable 8.1 .Foreachrecalllevel,wethencalculatethearithmeticmeanoftheinterpolatedprecisionatthatrecalllevelforeachinformationneedinthetestcollection.Acompositeprecision-recallcurveshowing11pointscanthenbegraphed.Figure 8.3 showsanexamplegraphofsuchresultsfromarepresentativegoodsystematTREC8.Inrecentyears,othermeasureshavebecomemorecommon.Moststan-dardamongtheTRECcommunityisMeanAveragePrecision(MAP),whichMEANAVERAGEPRECISIONprovidesasingle-guremeasureofqualityacrossrecalllevels.Amongeval-uationmeasures,MAPhasbeenshowntohaveespeciallygooddiscrimina-tionandstability.Forasingleinformationneed,AveragePrecisionisthe Online edition (c)\n2009 Cambridge UP 1608Evaluationininformationretrieval  IFigure8.3Averaged11-pointprecision/recallgraphacross50queriesforarep-resentativeTRECsystem.TheMeanAveragePrecisionforthissystemis0.2553. averageoftheprecisionvalueobtainedforthesetoftopkdocumentsexist-ingaftereachrelevantdocumentisretrieved,andthisvalueisthenaveragedoverinformationneeds.Thatis,ifthesetofrelevantdocumentsforanin-formationneedqj2Qisfd1,...dmjgandRjkisthesetofrankedretrievalresultsfromthetopresultuntilyougettodocumentdk,thenMAP(Q)=1 jQjjQjåj=11 mjmjåk=1Precision(Rjk) (8.8) Whenarelevantdocumentisnotretrievedatall, 1 theprecisionvalueintheaboveequationistakentobe0.Forasingleinformationneed,theaverageprecisionapproximatestheareaundertheuninterpolatedprecision-recallcurve,andsotheMAPisroughlytheaverageareaundertheprecision-recallcurveforasetofqueries.UsingMAP,xedrecalllevelsarenotchosen,andthereisnointerpola-tion.TheMAPvalueforatestcollectionisthearithmeticmeanofaverage 1.Asystemmaynotfullyorderalldocumentsinthecollectioninresponsetoaqueryoratanyrateanevaluationexercisemaybebasedonsubmittingonlythetopkresultsforeachinformation

10 need. Online edition (c)\n2009 Cambridge
need. Online edition (c)\n2009 Cambridge UP 8.4Evaluationofrankedretrievalresults161 precisionvaluesforindividualinformationneeds.(Thishastheeffectofweightingeachinformationneedequallyinthenalreportednumber,evenifmanydocumentsarerelevanttosomequerieswhereasveryfewarerele-vanttootherqueries.)CalculatedMAPscoresnormallyvarywidelyacrossinformationneedswhenmeasuredwithinasinglesystem,forinstance,be-tween0.1and0.7.Indeed,thereisnormallymoreagreementinMAPforanindividualinformationneedacrosssystemsthanforMAPscoresfordif-ferentinformationneedsforthesamesystem.Thismeansthatasetoftestinformationneedsmustbelargeanddiverseenoughtoberepresentativeofsystemeffectivenessacrossdifferentqueries.Theabovemeasuresfactorinprecisionatallrecalllevels.Formanypromi-PRECISIONATknentapplications,particularlywebsearch,thismaynotbegermanetousers.Whatmattersisratherhowmanygoodresultsthereareontherstpageortherstthreepages.Thisleadstomeasuringprecisionatxedlowlevelsofretrievedresults,suchas10or30documents.Thisisreferredtoas“Precisionatk”,forexample“Precisionat10”.Ithastheadvantageofnotrequiringanyestimateofthesizeofthesetofrelevantdocumentsbutthedisadvantagesthatitistheleaststableofthecommonlyusedevaluationmeasuresandthatitdoesnotaveragewell,sincethetotalnumberofrelevantdocumentsforaqueryhasastronginuenceonprecisionatk.Analternative,whichalleviatesthisproblem,isR-precision.ItrequiresR-PRECISIONhavingasetofknownrelevantdocumentsRel,fromwhichwecalculatetheprecisionofthetopReldocumentsreturned.(ThesetRelmaybeincomplete,suchaswhenRelisformedbycreatingrelevancejudgmentsforthepooledtopkresultsofparticularsystemsinasetofexperiments.)R-precisionad-justsforthesizeofthesetofrelevantdocuments:Aperfectsystemcouldscore1onthismetricforeachquery,whereas,evenaperfectsystemcouldonlyachieveaprecisionat20of0.4iftherewereonly8documentsinthecollectionrelevanttoaninformationneed.Averagingthismeasureacrossqueriesthusmakesmoresense.ThismeasureishardertoexplaintonaiveusersthanPrecisionatkbuteasiertoexplainthanMAP.IftherearejRe

11 ljrelevantdocumentsforaquery,weexamineth
ljrelevantdocumentsforaquery,weexaminethetopjReljresultsofasys-tem,andndthatrarerelevant,thenbydenition,notonlyistheprecision(andhenceR-precision)r/jRelj,buttherecallofthisresultsetisalsor/jRelj.Thus,R-precisionturnsouttobeidenticaltothebreak-evenpoint,anotherBREAK-EVENPOINTmeasurewhichissometimesused,denedintermsofthisequalityrelation-shipholding.LikePrecisionatk,R-precisiondescribesonlyonepointontheprecision-recallcurve,ratherthanattemptingtosummarizeeffectivenessacrossthecurve,anditissomewhatunclearwhyyoushouldbeinterestedinthebreak-evenpointratherthaneitherthebestpointonthecurve(thepointwithmaximalF-measure)oraretrievallevelofinteresttoaparticularapplication(Precisionatk).Nevertheless,R-precisionturnsouttobehighlycorrelatedwithMAPempirically,despitemeasuringonlyasinglepointon Online edition (c)\n2009 Cambridge UP 1628Evaluationininformationretrieval  \n  \n  \r IFigure8.4TheROCcurvecorrespondingtotheprecision-recallcurveinFig-ure 8.2 .. thecurve.AnotherconceptsometimesusedinevaluationisanROCcurve.(“ROC”ROCCURVEstandsfor“ReceiverOperatingCharacteristics”,butknowingthatdoesn'thelpmostpeople.)AnROCcurveplotsthetruepositiverateorsensitiv-ityagainstthefalsepositiverateor(1specicity).Here,sensitivityisjustSENSITIVITYanothertermforrecall.Thefalsepositiverateisgivenbyfp/(fp+tn).Fig-ure 8.4 showstheROCcurvecorrespondingtotheprecision-recallcurveinFigure 8.2 .AnROCcurvealwaysgoesfromthebottomlefttothetoprightofthegraph.Foragoodsystem,thegraphclimbssteeplyontheleftside.Forunrankedresultsets,specicity,givenbytn/(fp+tn),wasnotseenasaverySPECIFICITYusefulnotion.Becausethesetoftruenegativesisalwayssolarge,itsvaluewouldbealmost1forallinformationneeds(and,correspondingly,thevalueofthefalsepositiveratewouldbealmost0).Thatis,the“interesting”partofFigure 8.2 is0recall0.4,apartwhichiscompressedtoasmallcornerofFigure 8.4 .ButanROCcurvecouldmakesensewhenlookingoverthefullretrievalspectrum,anditprovidesanotherwayoflookingatthedata.Inmanyelds,ac

12 ommonaggregatemeasureistoreporttheareaun
ommonaggregatemeasureistoreporttheareaundertheROCcurve,whichistheROCanalogofMAP.Precision-recallcurvesaresometimeslooselyreferredtoasROCcurves.Thisisunderstandable,butnotaccurate.Analapproachthathasseenincreasingadoption,especiallywhenem-ployedwithmachinelearningapproachestoranking(seeSection 15.4 ,page 341 )ismeasuresofcumulativegain,andinparticularnormalizeddiscountedcumu-CUMULATIVEGAINNORMALIZEDDISCOUNTEDCUMULATIVEGAIN Online edition (c)\n2009 Cambridge UP 8.4Evaluationofrankedretrievalresults163 lativegain(NDCG).NDCGisdesignedforsituationsofnon-binarynotionsNDCGofrelevance(cf.Section 8.5.1 ).Likeprecisionatk,itisevaluatedoversomenumberkoftopsearchresults.ForasetofqueriesQ,letR(j,d)betherele-vancescoreassessorsgavetodocumentdforqueryj.Then,NDCG(Q,k)=1 jQjjQjåj=1Zkjkåm=12R(j,m)1 log2(1+m), (8.9) whereZkjisanormalizationfactorcalculatedtomakeitsothataperfectranking'sNDCGatkforqueryjis1.Forqueriesforwhichk0kdocumentsareretrieved,thelastsummationisdoneuptok0. ?Exercise8.4 [?]Whatarethepossiblevaluesforinterpolatedprecisionatarecalllevelof0? Exercise8.5 [??]Musttherealwaysbeabreak-evenpointbetweenprecisionandrecall?Eithershowtheremustbeorgiveacounter-example. Exercise8.6 [??]WhatistherelationshipbetweenthevalueofF1andthebreak-evenpoint? Exercise8.7 [??]TheDicecoefcientoftwosetsisameasureoftheirintersectionscaledbytheirsizeDICECOEFFICIENT(givingavalueintherange0to1):Dice(X,Y)=2jX\Yj jXj+jYjShowthatthebalancedF-measure(F1)isequaltotheDicecoefcientoftheretrievedandrelevantdocumentsets. Exercise8.8 [?]Consideraninformationneedforwhichthereare4relevantdocumentsinthecollec-tion.Contrasttwosystemsrunonthiscollection.Theirtop10resultsarejudgedforrelevanceasfollows(theleftmostitemisthetoprankedsearchresult):System1RNRNNNNNRRSystem2NRNNRRRNNN a. WhatistheMAPofeachsystem?WhichhasahigherMAP? b. Doesthisresultintuitivelymakesense?WhatdoesitsayaboutwhatisimportantingettingagoodMAPscore? c. WhatistheR-precisionofeachsystem?(DoesitrankthesystemsthesameasMAP?) Online edition (c)\n2009 Camb

13 ridge UP 1648Evaluationininformationretr
ridge UP 1648EvaluationininformationretrievalExercise8.9 [??]ThefollowinglistofRsandNsrepresentsrelevant(R)andnonrelevant(N)returneddocumentsinarankedlistof20documentsretrievedinresponsetoaqueryfromacollectionof10,000documents.Thetopoftherankedlist(thedocumentthesystemthinksismostlikelytoberelevant)isontheleftofthelist.Thislistshows6relevantdocuments.Assumethatthereare8relevantdocumentsintotalinthecollection.RRNNNNNNRNRNNNRNNNNR a. Whatistheprecisionofthesystemonthetop20? b. WhatistheF1onthetop20? c. Whatistheuninterpolatedprecisionofthesystemat25%recall? d. Whatistheinterpolatedprecisionat33%recall? e. Assumethatthese20documentsarethecompleteresultsetofthesystem.WhatistheMAPforthequery?Assume,now,instead,thatthesystemreturnedtheentire10,000documentsinarankedlist,andthesearetherst20resultsreturned. f. WhatisthelargestpossibleMAPthatthissystemcouldhave? g. WhatisthesmallestpossibleMAPthatthissystemcouldhave? h. Inasetofexperiments,onlythetop20resultsareevaluatedbyhand.Theresultin(e)isusedtoapproximatetherange(f)–(g).Forthisexample,howlarge(inabsoluteterms)cantheerrorfortheMAPbebycalculating(e)insteadof(f)and(g)forthisquery?8.5AssessingrelevanceToproperlyevaluateasystem,yourtestinformationneedsmustbegermanetothedocumentsinthetestdocumentcollection,andappropriateforpre-dictedusageofthesystem.Theseinformationneedsarebestdesignedbydomainexperts.Usingrandomcombinationsofquerytermsasaninforma-tionneedisgenerallynotagoodideabecausetypicallytheywillnotresem-bletheactualdistributionofinformationneeds.Giveninformationneedsanddocuments,youneedtocollectrelevanceassessments.Thisisatime-consumingandexpensiveprocessinvolvinghu-manbeings.FortinycollectionslikeCraneld,exhaustivejudgmentsofrel-evanceforeachqueryanddocumentpairwereobtained.Forlargemoderncollections,itisusualforrelevancetobeassessedonlyforasubsetofthedocumentsforeachquery.Themoststandardapproachispooling,whererel-POOLINGevanceisassessedoverasubsetofthecollectionthatisformedfromthetopkdocumentsreturnedbyanumberofdifferentIRsyst

14 ems(usuallytheonestobeevaluated),andperh
ems(usuallytheonestobeevaluated),andperhapsothersourcessuchastheresultsofBooleankeywordsearchesordocumentsfoundbyexpertsearchersinaninteractiveprocess. Online edition (c)\n2009 Cambridge UP 8.5Assessingrelevance165Judge2RelevanceYesNoTotal Judge1Yes 30020320 RelevanceNo 107080 Total 31090400 ObservedproportionofthetimesthejudgesagreedP(A)=(300+70)/400=370/400=0.925PooledmarginalsP(nonrelevant)=(80+90)/(400+400)=170/800=0.2125P(relevant)=(320+310)/(400+400)=630/800=0.7878ProbabilitythatthetwojudgesagreedbychanceP(E)=P(nonrelevant)2+P(relevant)2=0.21252+0.78782=0.665Kappastatistick=(P(A)P(E))/(1P(E))=(0.9250.665)/(10.665)=0.776ITable8.2Calculatingthekappastatistic. Ahumanisnotadevicethatreliablyreportsagoldstandardjudgmentofrelevanceofadocumenttoaquery.Rather,humansandtheirrelevancejudgmentsarequiteidiosyncraticandvariable.Butthisisnotaproblemtobesolved:inthenalanalysis,thesuccessofanIRsystemdependsonhowgooditisatsatisfyingtheneedsoftheseidiosyncratichumans,oneinformationneedatatime.Nevertheless,itisinterestingtoconsiderandmeasurehowmuchagree-mentbetweenjudgesthereisonrelevancejudgments.Inthesocialsciences,acommonmeasureforagreementbetweenjudgesisthekappastatistic.ItisKAPPASTATISTICdesignedforcategoricaljudgmentsandcorrectsasimpleagreementratefortherateofchanceagreement.kappa=P(A)P(E) 1P(E) (8.10) whereP(A)istheproportionofthetimesthejudgesagreed,andP(E)istheproportionofthetimestheywouldbeexpectedtoagreebychance.Therearechoicesinhowthelatterisestimated:ifwesimplysaywearemakingatwo-classdecisionandassumenothingmore,thentheexpectedchanceagreementrateis0.5.However,normallytheclassdistributionassignedisskewed,anditisusualtousemarginalstatisticstocalculateexpectedagree-MARGINALment. 2 Therearestilltwowaystodoitdependingonwhetheronepools 2.Foracontingencytable,asinTable 8.2 ,amarginalstatisticisformedbysummingaroworcolumn.Themarginalai.k=åjaijk. Online edition (c)\n2009 Cambridge UP 1668Evaluationininformationretrieval themarginaldistributionacrossjudgesorusesthemarginalsfore

15 achjudgeseparately;bothformshavebeenused
achjudgeseparately;bothformshavebeenused,butwepresentthepooledversionbecauseitismoreconservativeinthepresenceofsystematicdifferencesinas-sessmentsacrossjudges.ThecalculationsareshowninTable 8.2 .Thekappavaluewillbe1iftwojudgesalwaysagree,0iftheyagreeonlyattherategivenbychance,andnegativeiftheyareworsethanrandom.Iftherearemorethantwojudges,itisnormaltocalculateanaveragepairwisekappavalue.Asaruleofthumb,akappavalueabove0.8istakenasgoodagree-ment,akappavaluebetween0.67and0.8istakenasfairagreement,andagreementbelow0.67isseenasdataprovidingadubiousbasisforanevalu-ation,thoughtheprecisecutoffsdependonthepurposesforwhichthedatawillbeused.InterjudgeagreementofrelevancehasbeenmeasuredwithintheTRECevaluationsandformedicalIRcollections.Usingtheaboverulesofthumb,thelevelofagreementnormallyfallsintherangeof“fair”(0.67–0.8).Thefactthathumanagreementonabinaryrelevancejudgmentisquitemodestisonereasonfornotrequiringmorene-grainedrelevancelabelingfromthetestsetcreator.ToanswerthequestionofwhetherIRevaluationresultsarevaliddespitethevariationofindividualassessors'judgments,peoplehaveexper-imentedwithevaluationstakingoneortheotheroftwojudges'opinionsasthegoldstandard.Thechoicecanmakeaconsiderableabsolutedifferencetoreportedscores,buthasingeneralbeenfoundtohavelittleimpactontherel-ativeeffectivenessrankingofeitherdifferentsystemsorvariantsofasinglesystemwhicharebeingcomparedforeffectiveness.8.5.1CritiquesandjusticationsoftheconceptofrelevanceTheadvantageofsystemevaluation,asenabledbythestandardmodelofrelevantandnonrelevantdocuments,isthatwehaveaxedsettinginwhichwecanvaryIRsystemsandsystemparameterstocarryoutcomparativeex-periments.Suchformaltestingismuchlessexpensiveandallowsclearerdiagnosisoftheeffectofchangingsystemparametersthandoinguserstud-iesofretrievaleffectiveness.Indeed,oncewehaveaformalmeasurethatwehavecondencein,wecanproceedtooptimizeeffectivenessbymachinelearningmethods,ratherthantuningparametersbyhand.Ofcourse,iftheformalmeasurepoorlydescribeswhatusersactuallywant,doingthiswilln

16 otbeeffectiveinimprovingusersatisfaction
otbeeffectiveinimprovingusersatisfaction.Ourperspectiveisthat,inpractice,thestandardformalmeasuresforIRevaluation,althoughasimpli-cation,aregoodenough,andrecentworkinoptimizingformalevaluationmeasuresinIRhassucceededbrilliantly.Therearenumerousexamplesoftechniquesdevelopedinformalevaluationsettings,whichimproveeffec-tivenessinoperationalsettings,suchasthedevelopmentofdocumentlengthnormalizationmethodswithinthecontextofTREC(Sections 6.4.4 and 11.4.3 ) Online edition (c)\n2009 Cambridge UP 8.5Assessingrelevance167 andmachinelearningmethodsforadjustingparameterweightsinscoring(Section 6.1.2 ).Thatisnottosaythattherearenotproblemslatentwithintheabstrac-tionsused.Therelevanceofonedocumentistreatedasindependentoftherelevanceofotherdocumentsinthecollection.(Thisassumptionisactuallybuiltintomostretrievalsystems–documentsarescoredagainstqueries,notagainsteachother–aswellasbeingassumedintheevaluationmethods.)Assessmentsarebinary:therearen'tanynuancedassessmentsofrelevance.Relevanceofadocumenttoaninformationneedistreatedasanabsolute,objectivedecision.Butjudgmentsofrelevancearesubjective,varyingacrosspeople,aswediscussedabove.Inpractice,humanassessorsarealsoimper-fectmeasuringinstruments,susceptibletofailuresofunderstandingandat-tention.Wealsohavetoassumethatusers'informationneedsdonotchangeastheystartlookingatretrievalresults.Anyresultsbasedononecollectionareheavilyskewedbythechoiceofcollection,queries,andrelevancejudg-mentset:theresultsmaynottranslatefromonedomaintoanotherortoadifferentuserpopulation.Someoftheseproblemsmaybexable.Anumberofrecentevaluations,includingINEX,someTRECtracks,andNTCIRhaveadoptedanordinalnotionofrelevancewithdocumentsdividedinto3or4classes,distinguish-ingslightlyrelevantdocumentsfromhighlyrelevantdocuments.SeeSec-tion 10.4 (page 210 )foradetaileddiscussionofhowthisisimplementedintheINEXevaluations.Oneclearproblemwiththerelevance-basedassessmentthatwehavepre-sentedisthedistinctionbetweenrelevanceandmarginalrelevance:whetherMARGINALRELEVANCEadocumentstillhasdi

17 stinctiveusefulnessaftertheuserhaslooked
stinctiveusefulnessaftertheuserhaslookedatcer-tainotherdocuments( CarbonellandGoldstein1998 ).Evenifadocumentishighlyrelevant,itsinformationcanbecompletelyredundantwithotherdocumentswhichhavealreadybeenexamined.Themostextremecaseofthisisdocumentsthatareduplicates–aphenomenonthatisactuallyverycommonontheWorldWideWeb–butitcanalsoeasilyoccurwhensev-eraldocumentsprovideasimilarprecisofanevent.Insuchcircumstances,marginalrelevanceisclearlyabettermeasureofutilitytotheuser.Maximiz-ingmarginalrelevancerequiresreturningdocumentsthatexhibitdiversityandnovelty.Onewaytoapproachmeasuringthisisbyusingdistinctfactsorentitiesasevaluationunits.Thisperhapsmoredirectlymeasurestrueutilitytotheuserbutdoingthismakesithardertocreateatestcollection. ?Exercise8.10 [??]Belowisatableshowinghowtwohumanjudgesratedtherelevanceofasetof12documentstoaparticularinformationneed(0=nonrelevant,1=relevant).Letusas-sumethatyou'vewrittenanIRsystemthatforthisqueryreturnsthesetofdocuments{4,5,6,7,8}. Online edition (c)\n2009 Cambridge UP 1688Evaluationininformationretrieval docIDJudge1Judge2100200311411510610710810901100111011201 a. Calculatethekappameasurebetweenthetwojudges. b. Calculateprecision,recall,andF1ofyoursystemifadocumentisconsideredrel-evantonlyifthetwojudgesagree. c. Calculateprecision,recall,andF1ofyoursystemifadocumentisconsideredrel-evantifeitherjudgethinksitisrelevant.8.6Abroaderperspective:SystemqualityanduserutilityFormalevaluationmeasuresareatsomedistancefromourultimateinterestinmeasuresofhumanutility:howsatisediseachuserwiththeresultsthesystemgivesforeachinformationneedthattheypose?Thestandardwaytomeasurehumansatisfactionisbyvariouskindsofuserstudies.Thesemightincludequantitativemeasures,bothobjective,suchastimetocompleteatask,aswellassubjective,suchasascoreforsatisfactionwiththesearchengine,andqualitativemeasures,suchasusercommentsonthesearchin-terface.Inthissectionwewilltouchonothersystemaspectsthatallowquan-titativeevaluationandtheissueofuserutility.8.6.1SystemissuesTherearemanypracticalben

18 chmarksonwhichtorateaninformationre-trie
chmarksonwhichtorateaninformationre-trievalsystembeyonditsretrievalquality.Theseinclude: • Howfastdoesitindex,thatis,howmanydocumentsperhourdoesitindexforacertaindistributionoverdocumentlengths?(cf.Chapter 4 ) • Howfastdoesitsearch,thatis,whatisitslatencyasafunctionofindexsize? • Howexpressiveisitsquerylanguage?Howfastisitoncomplexqueries? Online edition (c)\n2009 Cambridge UP 8.6Abroaderperspective:Systemqualityanduserutility169• Howlargeisitsdocumentcollection,intermsofthenumberofdoc-umentsorthecollectionhavinginformationdistributedacrossabroadrangeoftopics?Allthesecriteriaapartfromquerylanguageexpressivenessarestraightfor-wardlymeasurable:wecanquantifythespeedorsize.Variouskindsoffea-turechecklistscanmakequerylanguageexpressivenesssemi-precise.8.6.2UserutilityWhatwewouldreallylikeisawayofquantifyingaggregateuserhappiness,basedontherelevance,speed,anduserinterfaceofasystem.Onepartofthisisunderstandingthedistributionofpeoplewewishtomakehappy,andthisdependsentirelyonthesetting.Forawebsearchengine,happysearchusersarethosewhondwhattheywant.Oneindirectmeasureofsuchusersisthattheytendtoreturntothesameengine.Measuringtherateofreturnofusersisthusaneffectivemetric,whichwouldofcoursebemoreeffectiveifyoucouldalsomeasurehowmuchtheseusersusedothersearchengines.Butadvertisersarealsousersofmodernwebsearchengines.Theyarehappyifcustomersclickthroughtotheirsitesandthenmakepurchases.OnaneCommercewebsite,auserislikelytobewantingtopurchasesomething.Thus,wecanmeasurethetimetopurchase,orthefractionofsearcherswhobecomebuyers.Onashopfrontwebsite,perhapsboththeuser'sandthestoreowner'sneedsaresatisedifapurchaseismade.Nevertheless,ingeneral,weneedtodecidewhetheritistheenduser'sortheeCommercesiteowner'shappinessthatwearetryingtooptimize.Usually,itisthestoreownerwhoispayingus.Foran“enterprise”(company,government,oracademic)intranetsearchengine,therelevantmetricismorelikelytobeuserproductivity:howmuchtimedousersspendlookingforinformationthattheyneed.Therearealsomanyotherpracticalcriteriaconcerningsuch

19 mattersasinformationsecu-rity,whichwemen
mattersasinformationsecu-rity,whichwementionedinSection 4.6 (page 80 ).Userhappinessiselusivetomeasure,andthisispartofwhythestandardmethodologyusestheproxyofrelevanceofsearchresults.Thestandarddirectwaytogetatusersatisfactionistorunuserstudies,wherepeopleen-gageintasks,andusuallyvariousmetricsaremeasured,theparticipantsareobserved,andethnographicinterviewtechniquesareusedtogetqualitativeinformationonsatisfaction.Userstudiesareveryusefulinsystemdesign,buttheyaretimeconsumingandexpensivetodo.Theyarealsodifculttodowell,andexpertiseisrequiredtodesignthestudiesandtointerprettheresults.Wewillnotdiscussthedetailsofhumanusabilitytestinghere. Online edition (c)\n2009 Cambridge UP 1708Evaluationininformationretrieval 8.6.3ReningadeployedsystemIfanIRsystemhasbeenbuiltandisbeingusedbyalargenumberofusers,thesystem'sbuilderscanevaluatepossiblechangesbydeployingvariantversionsofthesystemandrecordingmeasuresthatareindicativeofusersatisfactionwithonevariantvs.othersastheyarebeingused.Thismethodisfrequentlyusedbywebsearchengines.ThemostcommonversionofthisisA/Btesting,atermborrowedfromtheA/BTESTadvertisingindustry.Forsuchatest,preciselyonethingischangedbetweenthecurrentsystemandaproposedsystem,andasmallproportionoftraf-c(say,1–10%ofusers)israndomlydirectedtothevariantsystem,whilemostusersusethecurrentsystem.Forexample,ifwewishtoinvestigateachangetotherankingalgorithm,weredirectarandomsampleofuserstoavariantsystemandevaluatemeasuressuchasthefrequencywithwhichpeopleclickonthetopresult,oranyresultontherstpage.(Thisparticularanalysismethodisreferredtoasclickthroughloganalysisorclickstreammin-CLICKTHROUGHLOGANALYSISCLICKSTREAMMININGing.ItisfurtherdiscussedasamethodofimplicitfeedbackinSection 9.1.7 (page 187 ).)ThebasisofA/Btestingisrunningabunchofsinglevariabletests(eitherinsequenceorinparallel):foreachtestonlyoneparameterisvariedfromthecontrol(thecurrentlivesystem).Itisthereforeeasytoseewhethervaryingeachparameterhasapositiveornegativeeffect.Suchtestingofalivesystemcaneasilyandcheaplygaugethee

20 ffectofachangeonusers,and,withalargeenou
ffectofachangeonusers,and,withalargeenoughuserbase,itispracticaltomeasureevenverysmallpositiveandnegativeeffects.Inprinciple,moreanalyticpowercanbeachievedbyvaryingmultiplethingsatonceinanuncorrelated(random)way,anddoingstandardmultivariatestatisticalanalysis,suchasmultiplelinearregression.Inpractice,though,A/Btestingiswidelyused,becauseA/Btestsareeasytodeploy,easytounderstand,andeasytoexplaintomanagement.8.7ResultssnippetsHavingchosenorrankedthedocumentsmatchingaquery,wewishtopre-sentaresultslistthatwillbeinformativetotheuser.Inmanycasestheuserwillnotwanttoexamineallthereturneddocumentsandsowewanttomaketheresultslistinformativeenoughthattheusercandoanalrank-ingofthedocumentsforthemselvesbasedonrelevancetotheirinformationneed. 3 Thestandardwayofdoingthisistoprovideasnippet,ashortsum-SNIPPETmaryofthedocument,whichisdesignedsoastoallowtheusertodecideitsrelevance.Typically,thesnippetconsistsofthedocumenttitleandashort 3.Thereareexceptions,indomainswhererecallisemphasized.Forinstance,inmanylegaldisclosurecases,alegalassociatewillrevieweverydocumentthatmatchesakeywordsearch. Online edition (c)\n2009 Cambridge UP 8.7Resultssnippets171 summary,whichisautomaticallyextracted.Thequestionishowtodesignthesummarysoastomaximizeitsusefulnesstotheuser.Thetwobasickindsofsummariesarestatic,whicharealwaysthesameSTATICSUMMARYregardlessofthequery,anddynamic(orquery-dependent),whicharecus-DYNAMICSUMMARYtomizedaccordingtotheuser'sinformationneedasdeducedfromaquery.Dynamicsummariesattempttoexplainwhyaparticulardocumentwasre-trievedforthequeryathand.Astaticsummaryisgenerallycomprisedofeitherorbothasubsetofthedocumentandmetadataassociatedwiththedocument.Thesimplestformofsummarytakesthersttwosentencesor50wordsofadocument,orex-tractsparticularzonesofadocument,suchasthetitleandauthor.Insteadofzonesofadocument,thesummarycaninsteadusemetadataassociatedwiththedocument.Thismaybeanalternativewaytoprovideanauthorordate,ormayincludeelementswhicharedesignedtogiveasummary,suchasthedescriptionmetadatawhichcanap

21 pearinthemetaelementofawebHTMLpage.Thiss
pearinthemetaelementofawebHTMLpage.Thissummaryistypicallyextractedandcachedatindexingtime,insuchawaythatitcanberetrievedandpresentedquicklywhendis-playingsearchresults,whereashavingtoaccesstheactualdocumentcontentmightbearelativelyexpensiveoperation.Therehasbeenextensiveworkwithinnaturallanguageprocessing(NLP)onbetterwaystodotextsummarization.MostsuchworkstillaimsonlytoTEXTSUMMARIZATIONchoosesentencesfromtheoriginaldocumenttopresentandconcentratesonhowtoselectgoodsentences.Themodelstypicallycombinepositionalfac-tors,favoringtherstandlastparagraphsofdocumentsandtherstandlastsentencesofparagraphs,withcontentfactors,emphasizingsentenceswithkeyterms,whichhavelowdocumentfrequencyinthecollectionasawhole,buthighfrequencyandgooddistributionacrosstheparticulardocumentbeingreturned.InsophisticatedNLPapproaches,thesystemsynthesizessentencesforasummary,eitherbydoingfulltextgenerationorbyeditingandperhapscombiningsentencesusedinthedocument.Forexample,itmightdeletearelativeclauseorreplaceapronounwiththenounphrasethatitrefersto.Thislastclassofmethodsremainsintherealmofresearchandisseldomusedforsearchresults:itiseasier,safer,andoftenevenbettertojustusesentencesfromtheoriginaldocument.Dynamicsummariesdisplayoneormore“windows”onthedocument,aimingtopresentthepiecesthathavethemostutilitytotheuserinevalu-atingthedocumentwithrespecttotheirinformationneed.Usuallythesewindowscontainoneorseveralofthequeryterms,andsoareoftenre-ferredtoaskeyword-in-context(KWIC)snippets,thoughsometimestheymayKEYWORD-IN-CONTEXTstillbepiecesofthetextsuchasthetitlethatareselectedfortheirquery-independentinformationvaluejustasinthecaseofstaticsummarization.Dynamicsummariesaregeneratedinconjunctionwithscoring.Ifthequeryisfoundasaphrase,occurrencesofthephraseinthedocumentwillbe Online edition (c)\n2009 Cambridge UP 1728Evaluationininformationretrieval...Inrecentyears,PapuaNewGuineahasfacedsevereeconomicdifcultiesandeconomicgrowthhasslowed,partlyasaresultofweakgovernanceandcivilwar,andpartlyasaresultofexternalfactorssucha

22 stheBougainvillecivilwarwhichledtotheclo
stheBougainvillecivilwarwhichledtotheclosurein1989ofthePangunamine(atthattimethemostimportantforeignexchangeearnerandcontributortoGovernmentnances),theAsiannancialcrisis,adeclineinthepricesofgoldandcopper,andafallintheproductionofoil.PNG'seconomicdevelopmentrecordoverthepastfewyearsisevidencethatgovernanceissuesunderlymanyofthecountry'sproblems.Goodgovernance,whichmaybedenedasthetransparentandaccountablemanagementofhuman,natural,economicandnancialresourcesforthepurposesofequitableandsustainabledevelopment,owsfromproperpublicsectormanagement,efcientscalandaccountingmechanisms,andawillingnesstomakeservicedeliveryapriorityinpractice....IFigure8.5Anexampleofselectingtextforadynamicsnippet.Thissnippetwasgeneratedforadocumentinresponsetothequerynewguineaeconomicdevelopment.Thegureshowsinbolditalicwheretheselectedsnippettextoccurredintheoriginaldocument. shownasthesummary.Ifnot,windowswithinthedocumentthatcontainmultiplequerytermswillbeselected.Commonlythesewindowsmayjuststretchsomenumberofwordstotheleftandrightofthequeryterms.ThisisaplacewhereNLPtechniquescanusefullybeemployed:usersprefersnip-petsthatreadwellbecausetheycontaincompletephrases.Dynamicsummariesaregenerallyregardedasgreatlyimprovingtheus-abilityofIRsystems,buttheypresentacomplicationforIRsystemdesign.Adynamicsummarycannotbeprecomputed,but,ontheotherhand,ifasys-temhasonlyapositionalindex,thenitcannoteasilyreconstructthecontextsurroundingsearchenginehitsinordertogeneratesuchadynamicsum-mary.Thisisonereasonforusingstaticsummaries.Thestandardsolutiontothisinaworldoflargeandcheapdiskdrivesistolocallycacheallthedocumentsatindextime(notwithstandingthatthisapproachraisesvariouslegal,informationsecurityandcontrolissuesthatarefarfromresolved)asshowninFigure 7.5 (page 147 ).Then,asystemcansimplyscanadocumentwhichisabouttoappearinadisplayedresultslisttondsnippetscontainingthequerywords.Beyondsimplyaccesstothetext,producingagoodKWICsnippetrequiressomecare.Givenavarietyofkeywordoccurrencesinadocument,thegoalistochoosefragm

23 entswhichare:(i)maximallyinforma-tiveabo
entswhichare:(i)maximallyinforma-tiveaboutthediscussionofthosetermsinthedocument,(ii)self-containedenoughtobeeasytoread,and(iii)shortenoughtotwithinthenormallystrictconstraintsonthespaceavailableforsummaries. Online edition (c)\n2009 Cambridge UP 8.8Referencesandfurtherreading173 Generatingsnippetsmustbefastsincethesystemistypicallygeneratingmanysnippetsforeachquerythatithandles.Ratherthancachinganentiredocument,itiscommontocacheonlyagenerousbutxedsizeprexofthedocument,suchasperhaps10,000characters.Formostcommon,shortdocuments,theentiredocumentisthuscached,buthugeamountsoflocalstoragewillnotbewastedonpotentiallyvastdocuments.Summariesofdocumentswhoselengthexceedstheprexsizewillbebasedonmaterialintheprexonly,whichisingeneralausefulzoneinwhichtolookforadocumentsummaryanyway.Ifadocumenthasbeenupdatedsinceitwaslastprocessedbyacrawlerandindexer,thesechangeswillbeneitherinthecachenorintheindex.Inthesecircumstances,neithertheindexnorthesummarywillaccuratelyre-ectthecurrentcontentsofthedocument,butitisthedifferencesbetweenthesummaryandtheactualdocumentcontentthatwillbemoreglaringlyobvioustotheenduser.8.8ReferencesandfurtherreadingDenitionandimplementationofthenotionofrelevancetoaquerygotofftoarockystartin1953. Swanson ( 1988 )reportsthatinanevaluationinthatyearbetweentwoteams,theyagreedthat1390documentswerevariouslyrelevanttoasetof98questions,butdisagreedonafurther1577documents,andthedisagreementswereneverresolved.RigorousformaltestingofIRsystemswasrstcompletedintheCraneldexperiments,beginninginthelate1950s.AretrospectivediscussionoftheCraneldtestcollectionandexperimentationwithitcanbefoundin( Clever-don1991 ).TheotherseminalseriesofearlyIRexperimentswerethoseontheSMARTsystembyGerardSaltonandcolleagues( Salton1971b ; 1991 ).TheTRECevaluationsaredescribedindetailby VoorheesandHarman ( 2005 ).Onlineinformationisavailableathttp://trec.nist.gov/.Initially,fewresearcherscomputedthestatisticalsignicanceoftheirexperimentalresults,buttheIRcommunityincreasinglydemandsthis( Hull199

24 3 ).UserstudiesofIRsystemeffectivenessbe
3 ).UserstudiesofIRsystemeffectivenessbeganmorerecently( SaracevicandKantor1988 ; 1996 ).Thenotionsofrecallandprecisionwererstusedby Kentetal. ( 1955 ),althoughthetermprecisiondidnotappearuntillater.TheFmeasure(or,FMEASUREratheritscomplementE=1F)wasintroducedby vanRijsbergen ( 1979 ).Heprovidesanextensivetheoreticaldiscussion,whichshowshowadoptingaprincipleofdecreasingmarginalrelevance(atsomepointauserwillbeunwillingtosacriceaunitofprecisionforanaddedunitofrecall)leadstotheharmonicmeanbeingtheappropriatemethodforcombiningprecisionandrecall(andhencetoitsadoptionratherthantheminimumorgeometricmean). Online edition (c)\n2009 Cambridge UP 1748Evaluationininformationretrieval BuckleyandVoorhees ( 2000 )compareseveralevaluationmeasures,in-cludingprecisionatk,MAP,andR-precision,andevaluatetheerrorrateofeachmeasure.R-precisionwasadoptedastheofcialevaluationmetricinR-PRECISIONtheTRECHARDtrack( Allan2005 ). AslamandYilmaz ( 2005 )examineitssurprisinglyclosecorrelationtoMAP,whichhadbeennotedinearlierstud-ies( Tague-SutcliffeandBlustein1995 , BuckleyandVoorhees2000 ).Astan-dardprogramforevaluatingIRsystemswhichcomputesmanymeasuresofrankedretrievaleffectivenessisChrisBuckley'strec_evalprogramusedintheTRECevaluations.Itcanbedownloadedfrom:http://trec.nist.gov/trec_eval/. KekäläinenandJärvelin ( 2002 )argueforthesuperiorityofgradedrele-vancejudgmentswhendealingwithverylargedocumentcollections,and JärvelinandKekäläinen ( 2002 )introducecumulatedgain-basedmethodsforIRsystemevaluationinthiscontext. Sakai ( 2007 )doesastudyofthestabil-ityandsensitivityofevaluationmeasuresbasedongradedrelevancejudg-mentsfromNTCIRtasks,andconcludesthatNDCGisbestforevaluatingdocumentranking. Schamberetal. ( 1990 )examinetheconceptofrelevance,stressingitsmulti-dimensionalandcontext-specicnature,butalsoarguingthatitcanbemea-suredeffectively.( Voorhees2000 )isthestandardarticleforexaminingvari-ationinrelevancejudgmentsandtheireffectsonretrievalsystemscoresandrankingfortheTRECAdHoctask. Voorhees concludesthatalt

25 houghthenumberschange,therankingsarequit
houghthenumberschange,therankingsarequitestable. Hershetal. ( 1994 )presentsimilaranalysisforamedicalIRcollection.Incontrast, Kekäläinen ( 2005 )analyzesomeofthelaterTRECs,exploringa4-wayrelevancejudgmentandthenotionofcumulativegain,arguingthattherelevancemeasureuseddoessubstantiallyaffectsystemrankings.Seealso Harter ( 1998 ). Zobel ( 1998 )studieswhetherthepoolingmethodusedbyTRECtocollectasubsetofdoc-umentsthatwillbeevaluatedforrelevanceisreliableandfair,andconcludesthatitis.Thekappastatisticanditsuseforlanguage-relatedpurposesisdiscussedKAPPASTATISTICby Carletta ( 1996 ).Manystandardsources(e.g., SiegelandCastellan1988 )presentpooledcalculationoftheexpectedagreement,but DiEugenioandGlass ( 2004 )argueforpreferringtheunpooledagreement(thoughperhapspresentingmultiplemeasures).Forfurtherdiscussionofalternativemea-suresofagreement,whichmayinfactbebetter,see Lombardetal. ( 2002 )and Krippendorff ( 2003 ).Textsummarizationhasbeenactivelyexploredformanyyears.Modernworkonsentenceselectionwasinitiatedby Kupiecetal. ( 1995 ).Morerecentworkincludes( BarzilayandElhadad1997 )and( Jing2000 ),togetherwithabroadselectionofworkappearingattheyearlyDUCconferencesandatotherNLPvenues. TombrosandSanderson ( 1998 )demonstratetheadvan-tagesofdynamicsummariesintheIRcontext. Turpinetal. ( 2007 )addresshowtogeneratesnippetsefciently. Online edition (c)\n2009 Cambridge UP 8.8Referencesandfurtherreading175 Clickthroughloganalysisisstudiedin( Joachims2002b , Joachimsetal.2005 ).Inaseriesofpapers,Hersh,Turpinandcolleaguesshowhowimprove-mentsinformalretrievaleffectiveness,asevaluatedinbatchexperiments,donotalwaystranslateintoanimprovedsystemforusers( Hershetal.2000a ; b ; 2001 , TurpinandHersh2001 ; 2002 ).UserinterfacesforIRandhumanfactorssuchasmodelsofhumaninfor-mationseekingandusabilitytestingareoutsidethescopeofwhatwecoverinthisbook.Moreinformationonthesetopicscanbefoundinothertext-books,including( Baeza-YatesandRibeiro-Neto1999 ,ch.10)and( Korfhage1997 ),andcollectionsfocusedoncognitiveaspects( Sp