DRAFTApril12009CambridgeUniversityPressFeedbackwelcome1518EvaluationininformationretrievalWehaveseenintheprecedingchaptersmanyalternativesindesigninganIRsystemHowdoweknowwhichofthesetechniquesareeffec ID: 898279
Download Pdf The PPT/PDF document "Online edition cn2009 Cambridge UP" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1 Online edition (c)\n2009 Cambridge UP DR
Online edition (c)\n2009 Cambridge UP DRAFT!©April1,2009CambridgeUniversityPress.Feedbackwelcome.1518EvaluationininformationretrievalWehaveseenintheprecedingchaptersmanyalternativesindesigninganIRsystem.Howdoweknowwhichofthesetechniquesareeffectiveinwhichapplications?Shouldweusestoplists?Shouldwestem?Shouldweusein-versedocumentfrequencyweighting?Informationretrievalhasdevelopedasahighlyempiricaldiscipline,requiringcarefulandthoroughevaluationtodemonstratethesuperiorperformanceofnoveltechniquesonrepresentativedocumentcollections.InthischapterwebeginwithadiscussionofmeasuringtheeffectivenessofIRsystems(Section 8.1 )andthetestcollectionsthataremostoftenusedforthispurpose(Section 8.2 ).Wethenpresentthestraightforwardnotionofrelevantandnonrelevantdocumentsandtheformalevaluationmethodol-ogythathasbeendevelopedforevaluatingunrankedretrievalresults(Sec-tion 8.3 ).Thisincludesexplainingthekindsofevaluationmeasuresthatarestandardlyusedfordocumentretrievalandrelatedtasksliketextclas-sicationandwhytheyareappropriate.Wethenextendthesenotionsanddevelopfurthermeasuresforevaluatingrankedretrievalresults(Section 8.4 )anddiscussdevelopingreliableandinformativetestcollections(Section 8.5 ).Wethenstepbacktointroducethenotionofuserutility,andhowitisap-proximatedbytheuseofdocumentrelevance(Section 8.6 ).Thekeyutilitymeasureisuserhappiness.Speedofresponseandthesizeoftheindexarefactorsinuserhappiness.Itseemsreasonabletoassumethatrelevanceofresultsisthemostimportantfactor:blindinglyfast,uselessanswersdonotmakeauserhappy.However,userperceptionsdonotalwayscoincidewithsystemdesigners'notionsofquality.Forexample,userhappinesscommonlydependsverystronglyonuserinterfacedesignissues,includingthelayout,clarity,andresponsivenessoftheuserinterface,whichareindependentofthequalityoftheresultsreturned.Wetouchonothermeasuresofthequal-ityofasystem,inparticularthegenerationofhigh-qualityresultsummarysnippets,whichstronglyinuenceuserutility,butarenotmeasuredinthebasicrelevancerankingparadigm(Section 8.7 ). Online e
2 dition (c)\n2009 Cambridge UP 1528Evalua
dition (c)\n2009 Cambridge UP 1528Evaluationininformationretrieval8.1InformationretrievalsystemevaluationTomeasureadhocinformationretrievaleffectivenessinthestandardway,weneedatestcollectionconsistingofthreethings:1.Adocumentcollection2.Atestsuiteofinformationneeds,expressibleasqueries3.Asetofrelevancejudgments,standardlyabinaryassessmentofeitherrelevantornonrelevantforeachquery-documentpair.Thestandardapproachtoinformationretrievalsystemevaluationrevolvesaroundthenotionofrelevantandnonrelevantdocuments.WithrespecttoaRELEVANCEuserinformationneed,adocumentinthetestcollectionisgivenabinaryclassicationaseitherrelevantornonrelevant.Thisdecisionisreferredtoasthegoldstandardorgroundtruthjudgmentofrelevance.ThetestdocumentGOLDSTANDARDGROUNDTRUTHcollectionandsuiteofinformationneedshavetobeofareasonablesize:youneedtoaverageperformanceoverfairlylargetestsets,asresultsarehighlyvariableoverdifferentdocumentsandinformationneeds.Asaruleofthumb,50informationneedshasusuallybeenfoundtobeasufcientminimum.Relevanceisassessedrelativetoaninformationneed,notaquery.ForINFORMATIONNEEDexample,aninformationneedmightbe:Informationonwhetherdrinkingredwineismoreeffectiveatreduc-ingyourriskofheartattacksthanwhitewine.Thismightbetranslatedintoaquerysuchas:wineANDredANDwhiteANDheartANDattackANDeffectiveAdocumentisrelevantifitaddressesthestatedinformationneed,notbe-causeitjusthappenstocontainallthewordsinthequery.Thisdistinctionisoftenmisunderstoodinpractice,becausetheinformationneedisnotovert.But,nevertheless,aninformationneedispresent.Ifausertypespythonintoawebsearchengine,theymightbewantingtoknowwheretheycanpurchaseapetpython.OrtheymightbewantinginformationontheprogramminglanguagePython.Fromaonewordquery,itisverydifcultforasystemtoknowwhattheinformationneedis.But,nevertheless,theuserhasone,andcanjudgethereturnedresultsonthebasisoftheirrelevancetoit.Toevalu-ateasystem,werequireanovertexpressionofaninformationneed,whichcanbeusedforjudgingreturneddocumentsasrelevantornonrelevant.Atthispoint,wemakeasimpli
3 cation:relevancecanreasonablybethoughto
cation:relevancecanreasonablybethoughtofasascale,withsomedocumentshighlyrelevantandothersmarginallyso.Butforthemoment,wewillusejustabinarydecisionofrelevance.We Online edition (c)\n2009 Cambridge UP 8.2Standardtestcollections153discussthereasonsforusingbinaryrelevancejudgmentsandalternativesinSection 8.5.1 .Manysystemscontainvariousweights(oftenknownasparameters)thatcanbeadjustedtotunesystemperformance.Itiswrongtoreportresultsonatestcollectionwhichwereobtainedbytuningtheseparameterstomaxi-mizeperformanceonthatcollection.Thatisbecausesuchtuningoverstatestheexpectedperformanceofthesystem,becausetheweightswillbesettomaximizeperformanceononeparticularsetofqueriesratherthanforaran-domsampleofqueries.Insuchcases,thecorrectprocedureistohaveoneormoredevelopmenttestcollections,andtotunetheparametersonthedevel-DEVELOPMENTTESTCOLLECTIONopmenttestcollection.Thetesterthenrunsthesystemwiththoseweightsonthetestcollectionandreportstheresultsonthatcollectionasanunbiasedestimateofperformance.8.2StandardtestcollectionsHereisalistofthemoststandardtestcollectionsandevaluationseries.Wefocusparticularlyontestcollectionsforadhocinformationretrievalsystemevaluation,butalsomentionacoupleofsimilartestcollectionsfortextclas-sication.TheCraneldcollection.ThiswasthepioneeringtestcollectioninallowingCRANFIELDprecisequantitativemeasuresofinformationretrievaleffectiveness,butisnowadaystoosmallforanythingbutthemostelementarypilotexperi-ments.CollectedintheUnitedKingdomstartinginthelate1950s,itcon-tains1398abstractsofaerodynamicsjournalarticles,asetof225queries,andexhaustiverelevancejudgmentsofall(query,document)pairs.TextRetrievalConference(TREC).TheU.S.NationalInstituteofStandardsTRECandTechnology(NIST)hasrunalargeIRtestbedevaluationseriessince1992.Withinthisframework,therehavebeenmanytracksoverarangeofdifferenttestcollections,butthebestknowntestcollectionsaretheonesusedfortheTRECAdHoctrackduringtherst8TRECevaluationsbetween1992and1999.Intotal,thesetestcollectionscomprise6CDscontaining1.89milliondocumen
4 ts(mainly,butnotexclusively,newswirearti
ts(mainly,butnotexclusively,newswirearticles)andrelevancejudgmentsfor450informationneeds,whicharecalledtopicsandspeciedindetailedtextpassages.Individualtestcol-lectionsaredenedoverdifferentsubsetsofthisdata.TheearlyTRECseachconsistedof50informationneeds,evaluatedoverdifferentbutover-lappingsetsofdocuments.TRECs68provide150informationneedsoverabout528,000newswireandForeignBroadcastInformationServicearticles.Thisisprobablythebestsubcollectiontouseinfuturework,be-causeitisthelargestandthetopicsaremoreconsistent.Becausethetest Online edition (c)\n2009 Cambridge UP 1548Evaluationininformationretrievaldocumentcollectionsaresolarge,therearenoexhaustiverelevancejudg-ments.Rather,NISTassessors'relevancejudgmentsareavailableonlyforthedocumentsthatwereamongthetopkreturnedforsomesystemwhichwasenteredintheTRECevaluationforwhichtheinformationneedwasdeveloped.Inmorerecentyears,NISThasdoneevaluationsonlargerdocumentcol-lections,includingthe25millionpageGOV2webpagecollection.FromGOV2thebeginning,theNISTtestdocumentcollectionswereordersofmagni-tudelargerthananythingavailabletoresearcherspreviouslyandGOV2isnowthelargestWebcollectioneasilyavailableforresearchpurposes.Nevertheless,thesizeofGOV2isstillmorethan2ordersofmagnitudesmallerthanthecurrentsizeofthedocumentcollectionsindexedbythelargewebsearchcompanies.NIITestCollectionsforIRSystems(NTCIR).TheNTCIRprojecthasbuiltNTCIRvarioustestcollectionsofsimilarsizestotheTRECcollections,focus-ingonEastAsianlanguageandcross-languageinformationretrieval,whereCROSS-LANGUAGEINFORMATIONRETRIEVALqueriesaremadeinonelanguageoveradocumentcollectioncontainingdocumentsinoneormoreotherlanguages.See:http://research.nii.ac.jp/ntcir/data/data-en.htmlCrossLanguageEvaluationForum(CLEF).Thisevaluationserieshascon-CLEFcentratedonEuropeanlanguagesandcross-languageinformationretrieval.See:http://www.clef-campaign.org/Reuters-21578andReuters-RCV1.Fortextclassication,themostusedtestREUTERScollectionhasbeentheReuters-21578collectionof21578newswirearti-cles;seeChapter 13 ,
5 page 279 .Morerecently,Reutersreleasedth
page 279 .Morerecently,ReutersreleasedthemuchlargerReutersCorpusVolume1(RCV1),consistingof806,791documents;seeChapter 4 ,page 69 .Itsscaleandrichannotationmakesitabetterbasisforfutureresearch.20Newsgroups.Thisisanotherwidelyusedtextclassicationcollection,20NEWSGROUPScollectedbyKenLang.Itconsistsof1000articlesfromeachof20Usenetnewsgroups(thenewsgroupnamebeingregardedasthecategory).Aftertheremovalofduplicatearticles,asitisusuallyused,itcontains18941articles.8.3EvaluationofunrankedretrievalsetsGiventheseingredients,howissystemeffectivenessmeasured?Thetwomostfrequentandbasicmeasuresforinformationretrievaleffectivenessareprecisionandrecall.Thesearerstdenedforthesimplecasewherean Online edition (c)\n2009 Cambridge UP 8.3Evaluationofunrankedretrievalsets155IRsystemreturnsasetofdocumentsforaquery.Wewillseelaterhowtoextendthesenotionstorankedretrievalsituations.Precision(P)isthefractionofretrieveddocumentsthatarerelevantPRECISIONPrecision=#(relevantitemsretrieved) #(retrieveditems)=P(relevantjretrieved)(8.1)Recall(R)isthefractionofrelevantdocumentsthatareretrievedRECALLRecall=#(relevantitemsretrieved) #(relevantitems)=P(retrievedjrelevant)(8.2)Thesenotionscanbemadeclearbyexaminingthefollowingcontingencytable:(8.3) Relevant Nonrelevant Retrieved truepositives(tp) falsepositives(fp) Notretrieved falsenegatives(fn) truenegatives(tn) Then:P=tp/(tp+fp) (8.4) R=tp/(tp+fn)Anobviousalternativethatmayoccurtothereaderistojudgeaninfor-mationretrievalsystembyitsaccuracy,thatis,thefractionofitsclassica-ACCURACYtionsthatarecorrect.Intermsofthecontingencytableabove,accuracy=(tp+tn)/(tp+fp+fn+tn).Thisseemsplausible,sincetherearetwoac-tualclasses,relevantandnonrelevant,andaninformationretrievalsystemcanbethoughtofasatwo-classclassierwhichattemptstolabelthemassuch(itretrievesthesubsetofdocumentswhichitbelievestoberelevant).Thisispreciselytheeffectivenessmeasureoftenusedforevaluatingmachinelearningclassicationproblems.Thereisagoodreasonwhyaccuracyisnotanappropriatemeasureforinformationretrievalpr
6 oblems.Inalmostallcircumstances,thedatai
oblems.Inalmostallcircumstances,thedataisex-tremelyskewed:normallyover99.9%ofthedocumentsareinthenonrele-vantcategory.Asystemtunedtomaximizeaccuracycanappeartoperformwellbysimplydeemingalldocumentsnonrelevanttoallqueries.Evenifthesystemisquitegood,tryingtolabelsomedocumentsasrelevantwillalmostalwaysleadtoahighrateoffalsepositives.However,labelingalldocumentsasnonrelevantiscompletelyunsatisfyingtoaninformationretrievalsystemuser.Usersarealwaysgoingtowanttoseesomedocuments,andcanbe Online edition (c)\n2009 Cambridge UP 1568Evaluationininformationretrieval assumedtohaveacertaintoleranceforseeingsomefalsepositivesprovid-ingthattheygetsomeusefulinformation.Themeasuresofprecisionandrecallconcentratetheevaluationonthereturnoftruepositives,askingwhatpercentageoftherelevantdocumentshavebeenfoundandhowmanyfalsepositiveshavealsobeenreturned.Theadvantageofhavingthetwonumbersforprecisionandrecallisthatoneismoreimportantthantheotherinmanycircumstances.Typicalwebsurferswouldlikeeveryresultontherstpagetoberelevant(highpreci-sion)buthavenottheslightestinterestinknowingletalonelookingateverydocumentthatisrelevant.Incontrast,variousprofessionalsearcherssuchasparalegalsandintelligenceanalystsareveryconcernedwithtryingtogetashighrecallaspossible,andwilltoleratefairlylowprecisionresultsinordertogetit.Individualssearchingtheirharddisksarealsoofteninterestedinhighrecallsearches.Nevertheless,thetwoquantitiesclearlytradeoffagainstoneanother:youcanalwaysgetarecallof1(butverylowprecision)byretriev-ingalldocumentsforallqueries!Recallisanon-decreasingfunctionofthenumberofdocumentsretrieved.Ontheotherhand,inagoodsystem,preci-sionusuallydecreasesasthenumberofdocumentsretrievedisincreased.Ingeneralwewanttogetsomeamountofrecallwhiletoleratingonlyacertainpercentageoffalsepositives.AsinglemeasurethattradesoffprecisionversusrecallistheFmeasure,FMEASUREwhichistheweightedharmonicmeanofprecisionandrecall:F=1 a1 P+(1 a)1 R=(2+1)PR 2P+Rwhere2=1 a a (8.5) wherea2[0,1]andthus22[0,¥].ThedefaultbalancedFmeasureequallyw
7 eightsprecisionandrecall,whichmeansmakin
eightsprecisionandrecall,whichmeansmakinga=1/2or=1.ItiscommonlywrittenasF1,whichisshortforFb=1,eventhoughtheformula-tionintermsofamoretransparentlyexhibitstheFmeasureasaweightedharmonicmean.Whenusing=1,theformulaontherightsimpliesto:Fb=1=2PR P+R (8.6) However,usinganevenweightingisnottheonlychoice.Valuesof1emphasizeprecision,whilevaluesof.29;â1emphasizerecall.Forexample,avalueof=3or=5mightbeusedifrecallistobeemphasized.Recall,precision,andtheFmeasureareinherentlymeasuresbetween0and1,buttheyarealsoverycommonlywrittenaspercentages,onascalebetween0and100.Whydoweuseaharmonicmeanratherthanthesimpleraverage(arith-meticmean)?Recallthatwecanalwaysget100%recallbyjustreturningalldocuments,andthereforewecanalwaysgeta50%arithmeticmeanbythe Online edition (c)\n2009 Cambridge UP 8.3Evaluationofunrankedretrievalsets157 \n\n\n #$#& #"& & IFigure8.1Graphcomparingtheharmonicmeantoothermeans.Thegraphshowsaslicethroughthecalculationofvariousmeansofprecisionandrecallforthexedrecallvalueof70%.Theharmonicmeanisalwayslessthaneitherthearith-meticorgeometricmean,andoftenquiteclosetotheminimumofthetwonumbers.Whentheprecisionisalso70%,allthemeasurescoincide. sameprocess.Thisstronglysuggeststhatthearithmeticmeanisanunsuit-ablemeasuretouse.Incontrast,ifweassumethat1documentin10,000isrelevanttothequery,theharmonicmeanscoreofthisstrategyis0.02%.Theharmonicmeanisalwayslessthanorequaltothearithmeticmeanandthegeometricmean.Whenthevaluesoftwonumbersdiffergreatly,thehar-monicmeanisclosertotheirminimumthantotheirarithmeticmean;seeFigure 8.1 . ?Exercise8.1 [?]AnIRsystemreturns8relevantdocuments,and10nonrelevantdocuments.Thereareatotalof20relevantdocumentsinthecollection.Whatistheprecisionofthesystemonthissearch,andwhatisitsrecall? Exercise8.2 [?]ThebalancedFmeasure(a.k.a.F1)isdenedastheharmonicmeanofprecisionandrecall.Whatistheadvantageofusingtheharmonicmeanratherthanaveraging(usingthearithmeticmean)? Online edition (c)\n2009 Cambridge UP 1588Evaluationininform
8 ationretrieval
ationretrieval IFigure8.2Precision/recallgraph. Exercise8.3 [??]DerivetheequivalencebetweenthetwoformulasforFmeasureshowninEqua-tion( 8.5 ),giventhata=1/(b2+1).8.4EvaluationofrankedretrievalresultsPrecision,recall,andtheFmeasureareset-basedmeasures.Theyarecom-putedusingunorderedsetsofdocuments.Weneedtoextendthesemeasures(ortodenenewmeasures)ifwearetoevaluatetherankedretrievalresultsthatarenowstandardwithsearchengines.Inarankedretrievalcontext,appropriatesetsofretrieveddocumentsarenaturallygivenbythetopkre-trieveddocuments.Foreachsuchset,precisionandrecallvaluescanbeplottedtogiveaprecision-recallcurve,suchastheoneshowninFigure 8.2 .PRECISION-RECALLCURVEPrecision-recallcurveshaveadistinctivesaw-toothshape:ifthe(k+1)thdocumentretrievedisnonrelevantthenrecallisthesameasforthetopkdocuments,butprecisionhasdropped.Ifitisrelevant,thenbothprecisionandrecallincrease,andthecurvejagsupandtotheright.Itisoftenusefultoremovethesejigglesandthestandardwaytodothisiswithaninterpolatedprecision:theinterpolatedprecisionpinterpatacertainrecalllevelrisdenedINTERPOLATEDPRECISION Online edition (c)\n2009 Cambridge UP 8.4Evaluationofrankedretrievalresults159RecallInterp.Precision0.01.000.10.670.20.630.30.550.40.450.50.410.60.360.70.290.80.130.90.101.00.08ITable8.1Calculationof11-pointInterpolatedAveragePrecision.Thisisfortheprecision-recallcurveshowninFigure 8.2 . asthehighestprecisionfoundforanyrecalllevelr0r:pinterp(r)=maxr0rp(r0) (8.7) Thejusticationisthatalmostanyonewouldbepreparedtolookatafewmoredocumentsifitwouldincreasethepercentageoftheviewedsetthatwererelevant(thatis,iftheprecisionofthelargersetishigher).InterpolatedprecisionisshownbyathinnerlineinFigure 8.2 .Withthisdenition,theinterpolatedprecisionatarecallof0iswell-dened(Exercise 8.4 ).Examiningtheentireprecision-recallcurveisveryinformative,butthereisoftenadesiretoboilthisinformationdowntoafewnumbers,orperhapsevenasinglenumber.Thetraditionalwayofdoingthis(usedforinstanceintherst8TRECAdHocevaluations)is
9 the11-pointinterpolatedaverage11-POINTIN
the11-pointinterpolatedaverage11-POINTINTERPOLATEDAVERAGEPRECISIONprecision.Foreachinformationneed,theinterpolatedprecisionismeasuredatthe11recalllevelsof0.0,0.1,0.2,...,1.0.Fortheprecision-recallcurveinFigure 8.2 ,these11valuesareshowninTable 8.1 .Foreachrecalllevel,wethencalculatethearithmeticmeanoftheinterpolatedprecisionatthatrecalllevelforeachinformationneedinthetestcollection.Acompositeprecision-recallcurveshowing11pointscanthenbegraphed.Figure 8.3 showsanexamplegraphofsuchresultsfromarepresentativegoodsystematTREC8.Inrecentyears,othermeasureshavebecomemorecommon.Moststan-dardamongtheTRECcommunityisMeanAveragePrecision(MAP),whichMEANAVERAGEPRECISIONprovidesasingle-guremeasureofqualityacrossrecalllevels.Amongeval-uationmeasures,MAPhasbeenshowntohaveespeciallygooddiscrimina-tionandstability.Forasingleinformationneed,AveragePrecisionisthe Online edition (c)\n2009 Cambridge UP 1608Evaluationininformationretrieval IFigure8.3Averaged11-pointprecision/recallgraphacross50queriesforarep-resentativeTRECsystem.TheMeanAveragePrecisionforthissystemis0.2553. averageoftheprecisionvalueobtainedforthesetoftopkdocumentsexist-ingaftereachrelevantdocumentisretrieved,andthisvalueisthenaveragedoverinformationneeds.Thatis,ifthesetofrelevantdocumentsforanin-formationneedqj2Qisfd1,...dmjgandRjkisthesetofrankedretrievalresultsfromthetopresultuntilyougettodocumentdk,thenMAP(Q)=1 jQjjQjåj=11 mjmjåk=1Precision(Rjk) (8.8) Whenarelevantdocumentisnotretrievedatall, 1 theprecisionvalueintheaboveequationistakentobe0.Forasingleinformationneed,theaverageprecisionapproximatestheareaundertheuninterpolatedprecision-recallcurve,andsotheMAPisroughlytheaverageareaundertheprecision-recallcurveforasetofqueries.UsingMAP,xedrecalllevelsarenotchosen,andthereisnointerpola-tion.TheMAPvalueforatestcollectionisthearithmeticmeanofaverage 1.Asystemmaynotfullyorderalldocumentsinthecollectioninresponsetoaqueryoratanyrateanevaluationexercisemaybebasedonsubmittingonlythetopkresultsforeachinformation
10 need. Online edition (c)\n2009 Cambridge
need. Online edition (c)\n2009 Cambridge UP 8.4Evaluationofrankedretrievalresults161 precisionvaluesforindividualinformationneeds.(Thishastheeffectofweightingeachinformationneedequallyinthenalreportednumber,evenifmanydocumentsarerelevanttosomequerieswhereasveryfewarerele-vanttootherqueries.)CalculatedMAPscoresnormallyvarywidelyacrossinformationneedswhenmeasuredwithinasinglesystem,forinstance,be-tween0.1and0.7.Indeed,thereisnormallymoreagreementinMAPforanindividualinformationneedacrosssystemsthanforMAPscoresfordif-ferentinformationneedsforthesamesystem.Thismeansthatasetoftestinformationneedsmustbelargeanddiverseenoughtoberepresentativeofsystemeffectivenessacrossdifferentqueries.Theabovemeasuresfactorinprecisionatallrecalllevels.Formanypromi-PRECISIONATknentapplications,particularlywebsearch,thismaynotbegermanetousers.Whatmattersisratherhowmanygoodresultsthereareontherstpageortherstthreepages.Thisleadstomeasuringprecisionatxedlowlevelsofretrievedresults,suchas10or30documents.ThisisreferredtoasPrecisionatk,forexamplePrecisionat10.Ithastheadvantageofnotrequiringanyestimateofthesizeofthesetofrelevantdocumentsbutthedisadvantagesthatitistheleaststableofthecommonlyusedevaluationmeasuresandthatitdoesnotaveragewell,sincethetotalnumberofrelevantdocumentsforaqueryhasastronginuenceonprecisionatk.Analternative,whichalleviatesthisproblem,isR-precision.ItrequiresR-PRECISIONhavingasetofknownrelevantdocumentsRel,fromwhichwecalculatetheprecisionofthetopReldocumentsreturned.(ThesetRelmaybeincomplete,suchaswhenRelisformedbycreatingrelevancejudgmentsforthepooledtopkresultsofparticularsystemsinasetofexperiments.)R-precisionad-justsforthesizeofthesetofrelevantdocuments:Aperfectsystemcouldscore1onthismetricforeachquery,whereas,evenaperfectsystemcouldonlyachieveaprecisionat20of0.4iftherewereonly8documentsinthecollectionrelevanttoaninformationneed.Averagingthismeasureacrossqueriesthusmakesmoresense.ThismeasureishardertoexplaintonaiveusersthanPrecisionatkbuteasiertoexplainthanMAP.IftherearejRe
11 ljrelevantdocumentsforaquery,weexamineth
ljrelevantdocumentsforaquery,weexaminethetopjReljresultsofasys-tem,andndthatrarerelevant,thenbydenition,notonlyistheprecision(andhenceR-precision)r/jRelj,buttherecallofthisresultsetisalsor/jRelj.Thus,R-precisionturnsouttobeidenticaltothebreak-evenpoint,anotherBREAK-EVENPOINTmeasurewhichissometimesused,denedintermsofthisequalityrelation-shipholding.LikePrecisionatk,R-precisiondescribesonlyonepointontheprecision-recallcurve,ratherthanattemptingtosummarizeeffectivenessacrossthecurve,anditissomewhatunclearwhyyoushouldbeinterestedinthebreak-evenpointratherthaneitherthebestpointonthecurve(thepointwithmaximalF-measure)oraretrievallevelofinteresttoaparticularapplication(Precisionatk).Nevertheless,R-precisionturnsouttobehighlycorrelatedwithMAPempirically,despitemeasuringonlyasinglepointon Online edition (c)\n2009 Cambridge UP 1628Evaluationininformationretrieval \n \n\r IFigure8.4TheROCcurvecorrespondingtotheprecision-recallcurveinFig-ure 8.2 .. thecurve.AnotherconceptsometimesusedinevaluationisanROCcurve.(ROCROCCURVEstandsforReceiverOperatingCharacteristics,butknowingthatdoesn'thelpmostpeople.)AnROCcurveplotsthetruepositiverateorsensitiv-ityagainstthefalsepositiverateor(1 specicity).Here,sensitivityisjustSENSITIVITYanothertermforrecall.Thefalsepositiverateisgivenbyfp/(fp+tn).Fig-ure 8.4 showstheROCcurvecorrespondingtotheprecision-recallcurveinFigure 8.2 .AnROCcurvealwaysgoesfromthebottomlefttothetoprightofthegraph.Foragoodsystem,thegraphclimbssteeplyontheleftside.Forunrankedresultsets,specicity,givenbytn/(fp+tn),wasnotseenasaverySPECIFICITYusefulnotion.Becausethesetoftruenegativesisalwayssolarge,itsvaluewouldbealmost1forallinformationneeds(and,correspondingly,thevalueofthefalsepositiveratewouldbealmost0).Thatis,theinterestingpartofFigure 8.2 is0recall0.4,apartwhichiscompressedtoasmallcornerofFigure 8.4 .ButanROCcurvecouldmakesensewhenlookingoverthefullretrievalspectrum,anditprovidesanotherwayoflookingatthedata.Inmanyelds,ac
12 ommonaggregatemeasureistoreporttheareaun
ommonaggregatemeasureistoreporttheareaundertheROCcurve,whichistheROCanalogofMAP.Precision-recallcurvesaresometimeslooselyreferredtoasROCcurves.Thisisunderstandable,butnotaccurate.Analapproachthathasseenincreasingadoption,especiallywhenem-ployedwithmachinelearningapproachestoranking(seeSection 15.4 ,page 341 )ismeasuresofcumulativegain,andinparticularnormalizeddiscountedcumu-CUMULATIVEGAINNORMALIZEDDISCOUNTEDCUMULATIVEGAIN Online edition (c)\n2009 Cambridge UP 8.4Evaluationofrankedretrievalresults163 lativegain(NDCG).NDCGisdesignedforsituationsofnon-binarynotionsNDCGofrelevance(cf.Section 8.5.1 ).Likeprecisionatk,itisevaluatedoversomenumberkoftopsearchresults.ForasetofqueriesQ,letR(j,d)betherele-vancescoreassessorsgavetodocumentdforqueryj.Then,NDCG(Q,k)=1 jQjjQjåj=1Zkjkåm=12R(j,m) 1 log2(1+m), (8.9) whereZkjisanormalizationfactorcalculatedtomakeitsothataperfectranking'sNDCGatkforqueryjis1.Forqueriesforwhichk0kdocumentsareretrieved,thelastsummationisdoneuptok0. ?Exercise8.4 [?]Whatarethepossiblevaluesforinterpolatedprecisionatarecalllevelof0? Exercise8.5 [??]Musttherealwaysbeabreak-evenpointbetweenprecisionandrecall?Eithershowtheremustbeorgiveacounter-example. Exercise8.6 [??]WhatistherelationshipbetweenthevalueofF1andthebreak-evenpoint? Exercise8.7 [??]TheDicecoefcientoftwosetsisameasureoftheirintersectionscaledbytheirsizeDICECOEFFICIENT(givingavalueintherange0to1):Dice(X,Y)=2jX\Yj jXj+jYjShowthatthebalancedF-measure(F1)isequaltotheDicecoefcientoftheretrievedandrelevantdocumentsets. Exercise8.8 [?]Consideraninformationneedforwhichthereare4relevantdocumentsinthecollec-tion.Contrasttwosystemsrunonthiscollection.Theirtop10resultsarejudgedforrelevanceasfollows(theleftmostitemisthetoprankedsearchresult):System1RNRNNNNNRRSystem2NRNNRRRNNN a. WhatistheMAPofeachsystem?WhichhasahigherMAP? b. Doesthisresultintuitivelymakesense?WhatdoesitsayaboutwhatisimportantingettingagoodMAPscore? c. WhatistheR-precisionofeachsystem?(DoesitrankthesystemsthesameasMAP?) Online edition (c)\n2009 Camb
13 ridge UP 1648Evaluationininformationretr
ridge UP 1648EvaluationininformationretrievalExercise8.9 [??]ThefollowinglistofRsandNsrepresentsrelevant(R)andnonrelevant(N)returneddocumentsinarankedlistof20documentsretrievedinresponsetoaqueryfromacollectionof10,000documents.Thetopoftherankedlist(thedocumentthesystemthinksismostlikelytoberelevant)isontheleftofthelist.Thislistshows6relevantdocuments.Assumethatthereare8relevantdocumentsintotalinthecollection.RRNNNNNNRNRNNNRNNNNR a. Whatistheprecisionofthesystemonthetop20? b. WhatistheF1onthetop20? c. Whatistheuninterpolatedprecisionofthesystemat25%recall? d. Whatistheinterpolatedprecisionat33%recall? e. Assumethatthese20documentsarethecompleteresultsetofthesystem.WhatistheMAPforthequery?Assume,now,instead,thatthesystemreturnedtheentire10,000documentsinarankedlist,andthesearetherst20resultsreturned. f. WhatisthelargestpossibleMAPthatthissystemcouldhave? g. WhatisthesmallestpossibleMAPthatthissystemcouldhave? h. Inasetofexperiments,onlythetop20resultsareevaluatedbyhand.Theresultin(e)isusedtoapproximatetherange(f)(g).Forthisexample,howlarge(inabsoluteterms)cantheerrorfortheMAPbebycalculating(e)insteadof(f)and(g)forthisquery?8.5AssessingrelevanceToproperlyevaluateasystem,yourtestinformationneedsmustbegermanetothedocumentsinthetestdocumentcollection,andappropriateforpre-dictedusageofthesystem.Theseinformationneedsarebestdesignedbydomainexperts.Usingrandomcombinationsofquerytermsasaninforma-tionneedisgenerallynotagoodideabecausetypicallytheywillnotresem-bletheactualdistributionofinformationneeds.Giveninformationneedsanddocuments,youneedtocollectrelevanceassessments.Thisisatime-consumingandexpensiveprocessinvolvinghu-manbeings.FortinycollectionslikeCraneld,exhaustivejudgmentsofrel-evanceforeachqueryanddocumentpairwereobtained.Forlargemoderncollections,itisusualforrelevancetobeassessedonlyforasubsetofthedocumentsforeachquery.Themoststandardapproachispooling,whererel-POOLINGevanceisassessedoverasubsetofthecollectionthatisformedfromthetopkdocumentsreturnedbyanumberofdifferentIRsyst
14 ems(usuallytheonestobeevaluated),andperh
ems(usuallytheonestobeevaluated),andperhapsothersourcessuchastheresultsofBooleankeywordsearchesordocumentsfoundbyexpertsearchersinaninteractiveprocess. Online edition (c)\n2009 Cambridge UP 8.5Assessingrelevance165Judge2RelevanceYesNoTotal Judge1Yes 30020320 RelevanceNo 107080 Total 31090400 ObservedproportionofthetimesthejudgesagreedP(A)=(300+70)/400=370/400=0.925PooledmarginalsP(nonrelevant)=(80+90)/(400+400)=170/800=0.2125P(relevant)=(320+310)/(400+400)=630/800=0.7878ProbabilitythatthetwojudgesagreedbychanceP(E)=P(nonrelevant)2+P(relevant)2=0.21252+0.78782=0.665Kappastatistick=(P(A) P(E))/(1 P(E))=(0.925 0.665)/(1 0.665)=0.776ITable8.2Calculatingthekappastatistic. Ahumanisnotadevicethatreliablyreportsagoldstandardjudgmentofrelevanceofadocumenttoaquery.Rather,humansandtheirrelevancejudgmentsarequiteidiosyncraticandvariable.Butthisisnotaproblemtobesolved:inthenalanalysis,thesuccessofanIRsystemdependsonhowgooditisatsatisfyingtheneedsoftheseidiosyncratichumans,oneinformationneedatatime.Nevertheless,itisinterestingtoconsiderandmeasurehowmuchagree-mentbetweenjudgesthereisonrelevancejudgments.Inthesocialsciences,acommonmeasureforagreementbetweenjudgesisthekappastatistic.ItisKAPPASTATISTICdesignedforcategoricaljudgmentsandcorrectsasimpleagreementratefortherateofchanceagreement.kappa=P(A) P(E) 1 P(E) (8.10) whereP(A)istheproportionofthetimesthejudgesagreed,andP(E)istheproportionofthetimestheywouldbeexpectedtoagreebychance.Therearechoicesinhowthelatterisestimated:ifwesimplysaywearemakingatwo-classdecisionandassumenothingmore,thentheexpectedchanceagreementrateis0.5.However,normallytheclassdistributionassignedisskewed,anditisusualtousemarginalstatisticstocalculateexpectedagree-MARGINALment. 2 Therearestilltwowaystodoitdependingonwhetheronepools 2.Foracontingencytable,asinTable 8.2 ,amarginalstatisticisformedbysummingaroworcolumn.Themarginalai.k=åjaijk. Online edition (c)\n2009 Cambridge UP 1668Evaluationininformationretrieval themarginaldistributionacrossjudgesorusesthemarginalsfore
15 achjudgeseparately;bothformshavebeenused
achjudgeseparately;bothformshavebeenused,butwepresentthepooledversionbecauseitismoreconservativeinthepresenceofsystematicdifferencesinas-sessmentsacrossjudges.ThecalculationsareshowninTable 8.2 .Thekappavaluewillbe1iftwojudgesalwaysagree,0iftheyagreeonlyattherategivenbychance,andnegativeiftheyareworsethanrandom.Iftherearemorethantwojudges,itisnormaltocalculateanaveragepairwisekappavalue.Asaruleofthumb,akappavalueabove0.8istakenasgoodagree-ment,akappavaluebetween0.67and0.8istakenasfairagreement,andagreementbelow0.67isseenasdataprovidingadubiousbasisforanevalu-ation,thoughtheprecisecutoffsdependonthepurposesforwhichthedatawillbeused.InterjudgeagreementofrelevancehasbeenmeasuredwithintheTRECevaluationsandformedicalIRcollections.Usingtheaboverulesofthumb,thelevelofagreementnormallyfallsintherangeoffair(0.670.8).Thefactthathumanagreementonabinaryrelevancejudgmentisquitemodestisonereasonfornotrequiringmorene-grainedrelevancelabelingfromthetestsetcreator.ToanswerthequestionofwhetherIRevaluationresultsarevaliddespitethevariationofindividualassessors'judgments,peoplehaveexper-imentedwithevaluationstakingoneortheotheroftwojudges'opinionsasthegoldstandard.Thechoicecanmakeaconsiderableabsolutedifferencetoreportedscores,buthasingeneralbeenfoundtohavelittleimpactontherel-ativeeffectivenessrankingofeitherdifferentsystemsorvariantsofasinglesystemwhicharebeingcomparedforeffectiveness.8.5.1CritiquesandjusticationsoftheconceptofrelevanceTheadvantageofsystemevaluation,asenabledbythestandardmodelofrelevantandnonrelevantdocuments,isthatwehaveaxedsettinginwhichwecanvaryIRsystemsandsystemparameterstocarryoutcomparativeex-periments.Suchformaltestingismuchlessexpensiveandallowsclearerdiagnosisoftheeffectofchangingsystemparametersthandoinguserstud-iesofretrievaleffectiveness.Indeed,oncewehaveaformalmeasurethatwehavecondencein,wecanproceedtooptimizeeffectivenessbymachinelearningmethods,ratherthantuningparametersbyhand.Ofcourse,iftheformalmeasurepoorlydescribeswhatusersactuallywant,doingthiswilln
16 otbeeffectiveinimprovingusersatisfaction
otbeeffectiveinimprovingusersatisfaction.Ourperspectiveisthat,inpractice,thestandardformalmeasuresforIRevaluation,althoughasimpli-cation,aregoodenough,andrecentworkinoptimizingformalevaluationmeasuresinIRhassucceededbrilliantly.Therearenumerousexamplesoftechniquesdevelopedinformalevaluationsettings,whichimproveeffec-tivenessinoperationalsettings,suchasthedevelopmentofdocumentlengthnormalizationmethodswithinthecontextofTREC(Sections 6.4.4 and 11.4.3 ) Online edition (c)\n2009 Cambridge UP 8.5Assessingrelevance167 andmachinelearningmethodsforadjustingparameterweightsinscoring(Section 6.1.2 ).Thatisnottosaythattherearenotproblemslatentwithintheabstrac-tionsused.Therelevanceofonedocumentistreatedasindependentoftherelevanceofotherdocumentsinthecollection.(Thisassumptionisactuallybuiltintomostretrievalsystemsdocumentsarescoredagainstqueries,notagainsteachotheraswellasbeingassumedintheevaluationmethods.)Assessmentsarebinary:therearen'tanynuancedassessmentsofrelevance.Relevanceofadocumenttoaninformationneedistreatedasanabsolute,objectivedecision.Butjudgmentsofrelevancearesubjective,varyingacrosspeople,aswediscussedabove.Inpractice,humanassessorsarealsoimper-fectmeasuringinstruments,susceptibletofailuresofunderstandingandat-tention.Wealsohavetoassumethatusers'informationneedsdonotchangeastheystartlookingatretrievalresults.Anyresultsbasedononecollectionareheavilyskewedbythechoiceofcollection,queries,andrelevancejudg-mentset:theresultsmaynottranslatefromonedomaintoanotherortoadifferentuserpopulation.Someoftheseproblemsmaybexable.Anumberofrecentevaluations,includingINEX,someTRECtracks,andNTCIRhaveadoptedanordinalnotionofrelevancewithdocumentsdividedinto3or4classes,distinguish-ingslightlyrelevantdocumentsfromhighlyrelevantdocuments.SeeSec-tion 10.4 (page 210 )foradetaileddiscussionofhowthisisimplementedintheINEXevaluations.Oneclearproblemwiththerelevance-basedassessmentthatwehavepre-sentedisthedistinctionbetweenrelevanceandmarginalrelevance:whetherMARGINALRELEVANCEadocumentstillhasdi
17 stinctiveusefulnessaftertheuserhaslooked
stinctiveusefulnessaftertheuserhaslookedatcer-tainotherdocuments( CarbonellandGoldstein1998 ).Evenifadocumentishighlyrelevant,itsinformationcanbecompletelyredundantwithotherdocumentswhichhavealreadybeenexamined.ThemostextremecaseofthisisdocumentsthatareduplicatesaphenomenonthatisactuallyverycommonontheWorldWideWebbutitcanalsoeasilyoccurwhensev-eraldocumentsprovideasimilarprecisofanevent.Insuchcircumstances,marginalrelevanceisclearlyabettermeasureofutilitytotheuser.Maximiz-ingmarginalrelevancerequiresreturningdocumentsthatexhibitdiversityandnovelty.Onewaytoapproachmeasuringthisisbyusingdistinctfactsorentitiesasevaluationunits.Thisperhapsmoredirectlymeasurestrueutilitytotheuserbutdoingthismakesithardertocreateatestcollection. ?Exercise8.10 [??]Belowisatableshowinghowtwohumanjudgesratedtherelevanceofasetof12documentstoaparticularinformationneed(0=nonrelevant,1=relevant).Letusas-sumethatyou'vewrittenanIRsystemthatforthisqueryreturnsthesetofdocuments{4,5,6,7,8}. Online edition (c)\n2009 Cambridge UP 1688Evaluationininformationretrieval docIDJudge1Judge2100200311411510610710810901100111011201 a. Calculatethekappameasurebetweenthetwojudges. b. Calculateprecision,recall,andF1ofyoursystemifadocumentisconsideredrel-evantonlyifthetwojudgesagree. c. Calculateprecision,recall,andF1ofyoursystemifadocumentisconsideredrel-evantifeitherjudgethinksitisrelevant.8.6Abroaderperspective:SystemqualityanduserutilityFormalevaluationmeasuresareatsomedistancefromourultimateinterestinmeasuresofhumanutility:howsatisediseachuserwiththeresultsthesystemgivesforeachinformationneedthattheypose?Thestandardwaytomeasurehumansatisfactionisbyvariouskindsofuserstudies.Thesemightincludequantitativemeasures,bothobjective,suchastimetocompleteatask,aswellassubjective,suchasascoreforsatisfactionwiththesearchengine,andqualitativemeasures,suchasusercommentsonthesearchin-terface.Inthissectionwewilltouchonothersystemaspectsthatallowquan-titativeevaluationandtheissueofuserutility.8.6.1SystemissuesTherearemanypracticalben
18 chmarksonwhichtorateaninformationre-trie
chmarksonwhichtorateaninformationre-trievalsystembeyonditsretrievalquality.Theseinclude: Howfastdoesitindex,thatis,howmanydocumentsperhourdoesitindexforacertaindistributionoverdocumentlengths?(cf.Chapter 4 ) Howfastdoesitsearch,thatis,whatisitslatencyasafunctionofindexsize? Howexpressiveisitsquerylanguage?Howfastisitoncomplexqueries? Online edition (c)\n2009 Cambridge UP 8.6Abroaderperspective:Systemqualityanduserutility169 Howlargeisitsdocumentcollection,intermsofthenumberofdoc-umentsorthecollectionhavinginformationdistributedacrossabroadrangeoftopics?Allthesecriteriaapartfromquerylanguageexpressivenessarestraightfor-wardlymeasurable:wecanquantifythespeedorsize.Variouskindsoffea-turechecklistscanmakequerylanguageexpressivenesssemi-precise.8.6.2UserutilityWhatwewouldreallylikeisawayofquantifyingaggregateuserhappiness,basedontherelevance,speed,anduserinterfaceofasystem.Onepartofthisisunderstandingthedistributionofpeoplewewishtomakehappy,andthisdependsentirelyonthesetting.Forawebsearchengine,happysearchusersarethosewhondwhattheywant.Oneindirectmeasureofsuchusersisthattheytendtoreturntothesameengine.Measuringtherateofreturnofusersisthusaneffectivemetric,whichwouldofcoursebemoreeffectiveifyoucouldalsomeasurehowmuchtheseusersusedothersearchengines.Butadvertisersarealsousersofmodernwebsearchengines.Theyarehappyifcustomersclickthroughtotheirsitesandthenmakepurchases.OnaneCommercewebsite,auserislikelytobewantingtopurchasesomething.Thus,wecanmeasurethetimetopurchase,orthefractionofsearcherswhobecomebuyers.Onashopfrontwebsite,perhapsboththeuser'sandthestoreowner'sneedsaresatisedifapurchaseismade.Nevertheless,ingeneral,weneedtodecidewhetheritistheenduser'sortheeCommercesiteowner'shappinessthatwearetryingtooptimize.Usually,itisthestoreownerwhoispayingus.Foranenterprise(company,government,oracademic)intranetsearchengine,therelevantmetricismorelikelytobeuserproductivity:howmuchtimedousersspendlookingforinformationthattheyneed.Therearealsomanyotherpracticalcriteriaconcerningsuch
19 mattersasinformationsecu-rity,whichwemen
mattersasinformationsecu-rity,whichwementionedinSection 4.6 (page 80 ).Userhappinessiselusivetomeasure,andthisispartofwhythestandardmethodologyusestheproxyofrelevanceofsearchresults.Thestandarddirectwaytogetatusersatisfactionistorunuserstudies,wherepeopleen-gageintasks,andusuallyvariousmetricsaremeasured,theparticipantsareobserved,andethnographicinterviewtechniquesareusedtogetqualitativeinformationonsatisfaction.Userstudiesareveryusefulinsystemdesign,buttheyaretimeconsumingandexpensivetodo.Theyarealsodifculttodowell,andexpertiseisrequiredtodesignthestudiesandtointerprettheresults.Wewillnotdiscussthedetailsofhumanusabilitytestinghere. Online edition (c)\n2009 Cambridge UP 1708Evaluationininformationretrieval 8.6.3ReningadeployedsystemIfanIRsystemhasbeenbuiltandisbeingusedbyalargenumberofusers,thesystem'sbuilderscanevaluatepossiblechangesbydeployingvariantversionsofthesystemandrecordingmeasuresthatareindicativeofusersatisfactionwithonevariantvs.othersastheyarebeingused.Thismethodisfrequentlyusedbywebsearchengines.ThemostcommonversionofthisisA/Btesting,atermborrowedfromtheA/BTESTadvertisingindustry.Forsuchatest,preciselyonethingischangedbetweenthecurrentsystemandaproposedsystem,andasmallproportionoftraf-c(say,110%ofusers)israndomlydirectedtothevariantsystem,whilemostusersusethecurrentsystem.Forexample,ifwewishtoinvestigateachangetotherankingalgorithm,weredirectarandomsampleofuserstoavariantsystemandevaluatemeasuressuchasthefrequencywithwhichpeopleclickonthetopresult,oranyresultontherstpage.(Thisparticularanalysismethodisreferredtoasclickthroughloganalysisorclickstreammin-CLICKTHROUGHLOGANALYSISCLICKSTREAMMININGing.ItisfurtherdiscussedasamethodofimplicitfeedbackinSection 9.1.7 (page 187 ).)ThebasisofA/Btestingisrunningabunchofsinglevariabletests(eitherinsequenceorinparallel):foreachtestonlyoneparameterisvariedfromthecontrol(thecurrentlivesystem).Itisthereforeeasytoseewhethervaryingeachparameterhasapositiveornegativeeffect.Suchtestingofalivesystemcaneasilyandcheaplygaugethee
20 ffectofachangeonusers,and,withalargeenou
ffectofachangeonusers,and,withalargeenoughuserbase,itispracticaltomeasureevenverysmallpositiveandnegativeeffects.Inprinciple,moreanalyticpowercanbeachievedbyvaryingmultiplethingsatonceinanuncorrelated(random)way,anddoingstandardmultivariatestatisticalanalysis,suchasmultiplelinearregression.Inpractice,though,A/Btestingiswidelyused,becauseA/Btestsareeasytodeploy,easytounderstand,andeasytoexplaintomanagement.8.7ResultssnippetsHavingchosenorrankedthedocumentsmatchingaquery,wewishtopre-sentaresultslistthatwillbeinformativetotheuser.Inmanycasestheuserwillnotwanttoexamineallthereturneddocumentsandsowewanttomaketheresultslistinformativeenoughthattheusercandoanalrank-ingofthedocumentsforthemselvesbasedonrelevancetotheirinformationneed. 3 Thestandardwayofdoingthisistoprovideasnippet,ashortsum-SNIPPETmaryofthedocument,whichisdesignedsoastoallowtheusertodecideitsrelevance.Typically,thesnippetconsistsofthedocumenttitleandashort 3.Thereareexceptions,indomainswhererecallisemphasized.Forinstance,inmanylegaldisclosurecases,alegalassociatewillrevieweverydocumentthatmatchesakeywordsearch. Online edition (c)\n2009 Cambridge UP 8.7Resultssnippets171 summary,whichisautomaticallyextracted.Thequestionishowtodesignthesummarysoastomaximizeitsusefulnesstotheuser.Thetwobasickindsofsummariesarestatic,whicharealwaysthesameSTATICSUMMARYregardlessofthequery,anddynamic(orquery-dependent),whicharecus-DYNAMICSUMMARYtomizedaccordingtotheuser'sinformationneedasdeducedfromaquery.Dynamicsummariesattempttoexplainwhyaparticulardocumentwasre-trievedforthequeryathand.Astaticsummaryisgenerallycomprisedofeitherorbothasubsetofthedocumentandmetadataassociatedwiththedocument.Thesimplestformofsummarytakesthersttwosentencesor50wordsofadocument,orex-tractsparticularzonesofadocument,suchasthetitleandauthor.Insteadofzonesofadocument,thesummarycaninsteadusemetadataassociatedwiththedocument.Thismaybeanalternativewaytoprovideanauthorordate,ormayincludeelementswhicharedesignedtogiveasummary,suchasthedescriptionmetadatawhichcanap
21 pearinthemetaelementofawebHTMLpage.Thiss
pearinthemetaelementofawebHTMLpage.Thissummaryistypicallyextractedandcachedatindexingtime,insuchawaythatitcanberetrievedandpresentedquicklywhendis-playingsearchresults,whereashavingtoaccesstheactualdocumentcontentmightbearelativelyexpensiveoperation.Therehasbeenextensiveworkwithinnaturallanguageprocessing(NLP)onbetterwaystodotextsummarization.MostsuchworkstillaimsonlytoTEXTSUMMARIZATIONchoosesentencesfromtheoriginaldocumenttopresentandconcentratesonhowtoselectgoodsentences.Themodelstypicallycombinepositionalfac-tors,favoringtherstandlastparagraphsofdocumentsandtherstandlastsentencesofparagraphs,withcontentfactors,emphasizingsentenceswithkeyterms,whichhavelowdocumentfrequencyinthecollectionasawhole,buthighfrequencyandgooddistributionacrosstheparticulardocumentbeingreturned.InsophisticatedNLPapproaches,thesystemsynthesizessentencesforasummary,eitherbydoingfulltextgenerationorbyeditingandperhapscombiningsentencesusedinthedocument.Forexample,itmightdeletearelativeclauseorreplaceapronounwiththenounphrasethatitrefersto.Thislastclassofmethodsremainsintherealmofresearchandisseldomusedforsearchresults:itiseasier,safer,andoftenevenbettertojustusesentencesfromtheoriginaldocument.Dynamicsummariesdisplayoneormorewindowsonthedocument,aimingtopresentthepiecesthathavethemostutilitytotheuserinevalu-atingthedocumentwithrespecttotheirinformationneed.Usuallythesewindowscontainoneorseveralofthequeryterms,andsoareoftenre-ferredtoaskeyword-in-context(KWIC)snippets,thoughsometimestheymayKEYWORD-IN-CONTEXTstillbepiecesofthetextsuchasthetitlethatareselectedfortheirquery-independentinformationvaluejustasinthecaseofstaticsummarization.Dynamicsummariesaregeneratedinconjunctionwithscoring.Ifthequeryisfoundasaphrase,occurrencesofthephraseinthedocumentwillbe Online edition (c)\n2009 Cambridge UP 1728Evaluationininformationretrieval...Inrecentyears,PapuaNewGuineahasfacedsevereeconomicdifcultiesandeconomicgrowthhasslowed,partlyasaresultofweakgovernanceandcivilwar,andpartlyasaresultofexternalfactorssucha
22 stheBougainvillecivilwarwhichledtotheclo
stheBougainvillecivilwarwhichledtotheclosurein1989ofthePangunamine(atthattimethemostimportantforeignexchangeearnerandcontributortoGovernmentnances),theAsiannancialcrisis,adeclineinthepricesofgoldandcopper,andafallintheproductionofoil.PNG'seconomicdevelopmentrecordoverthepastfewyearsisevidencethatgovernanceissuesunderlymanyofthecountry'sproblems.Goodgovernance,whichmaybedenedasthetransparentandaccountablemanagementofhuman,natural,economicandnancialresourcesforthepurposesofequitableandsustainabledevelopment,owsfromproperpublicsectormanagement,efcientscalandaccountingmechanisms,andawillingnesstomakeservicedeliveryapriorityinpractice....IFigure8.5Anexampleofselectingtextforadynamicsnippet.Thissnippetwasgeneratedforadocumentinresponsetothequerynewguineaeconomicdevelopment.Thegureshowsinbolditalicwheretheselectedsnippettextoccurredintheoriginaldocument. shownasthesummary.Ifnot,windowswithinthedocumentthatcontainmultiplequerytermswillbeselected.Commonlythesewindowsmayjuststretchsomenumberofwordstotheleftandrightofthequeryterms.ThisisaplacewhereNLPtechniquescanusefullybeemployed:usersprefersnip-petsthatreadwellbecausetheycontaincompletephrases.Dynamicsummariesaregenerallyregardedasgreatlyimprovingtheus-abilityofIRsystems,buttheypresentacomplicationforIRsystemdesign.Adynamicsummarycannotbeprecomputed,but,ontheotherhand,ifasys-temhasonlyapositionalindex,thenitcannoteasilyreconstructthecontextsurroundingsearchenginehitsinordertogeneratesuchadynamicsum-mary.Thisisonereasonforusingstaticsummaries.Thestandardsolutiontothisinaworldoflargeandcheapdiskdrivesistolocallycacheallthedocumentsatindextime(notwithstandingthatthisapproachraisesvariouslegal,informationsecurityandcontrolissuesthatarefarfromresolved)asshowninFigure 7.5 (page 147 ).Then,asystemcansimplyscanadocumentwhichisabouttoappearinadisplayedresultslisttondsnippetscontainingthequerywords.Beyondsimplyaccesstothetext,producingagoodKWICsnippetrequiressomecare.Givenavarietyofkeywordoccurrencesinadocument,thegoalistochoosefragm
23 entswhichare:(i)maximallyinforma-tiveabo
entswhichare:(i)maximallyinforma-tiveaboutthediscussionofthosetermsinthedocument,(ii)self-containedenoughtobeeasytoread,and(iii)shortenoughtotwithinthenormallystrictconstraintsonthespaceavailableforsummaries. Online edition (c)\n2009 Cambridge UP 8.8Referencesandfurtherreading173 Generatingsnippetsmustbefastsincethesystemistypicallygeneratingmanysnippetsforeachquerythatithandles.Ratherthancachinganentiredocument,itiscommontocacheonlyagenerousbutxedsizeprexofthedocument,suchasperhaps10,000characters.Formostcommon,shortdocuments,theentiredocumentisthuscached,buthugeamountsoflocalstoragewillnotbewastedonpotentiallyvastdocuments.Summariesofdocumentswhoselengthexceedstheprexsizewillbebasedonmaterialintheprexonly,whichisingeneralausefulzoneinwhichtolookforadocumentsummaryanyway.Ifadocumenthasbeenupdatedsinceitwaslastprocessedbyacrawlerandindexer,thesechangeswillbeneitherinthecachenorintheindex.Inthesecircumstances,neithertheindexnorthesummarywillaccuratelyre-ectthecurrentcontentsofthedocument,butitisthedifferencesbetweenthesummaryandtheactualdocumentcontentthatwillbemoreglaringlyobvioustotheenduser.8.8ReferencesandfurtherreadingDenitionandimplementationofthenotionofrelevancetoaquerygotofftoarockystartin1953. Swanson ( 1988 )reportsthatinanevaluationinthatyearbetweentwoteams,theyagreedthat1390documentswerevariouslyrelevanttoasetof98questions,butdisagreedonafurther1577documents,andthedisagreementswereneverresolved.RigorousformaltestingofIRsystemswasrstcompletedintheCraneldexperiments,beginninginthelate1950s.AretrospectivediscussionoftheCraneldtestcollectionandexperimentationwithitcanbefoundin( Clever-don1991 ).TheotherseminalseriesofearlyIRexperimentswerethoseontheSMARTsystembyGerardSaltonandcolleagues( Salton1971b ; 1991 ).TheTRECevaluationsaredescribedindetailby VoorheesandHarman ( 2005 ).Onlineinformationisavailableathttp://trec.nist.gov/.Initially,fewresearcherscomputedthestatisticalsignicanceoftheirexperimentalresults,buttheIRcommunityincreasinglydemandsthis( Hull199
24 3 ).UserstudiesofIRsystemeffectivenessbe
3 ).UserstudiesofIRsystemeffectivenessbeganmorerecently( SaracevicandKantor1988 ; 1996 ).Thenotionsofrecallandprecisionwererstusedby Kentetal. ( 1955 ),althoughthetermprecisiondidnotappearuntillater.TheFmeasure(or,FMEASUREratheritscomplementE=1 F)wasintroducedby vanRijsbergen ( 1979 ).Heprovidesanextensivetheoreticaldiscussion,whichshowshowadoptingaprincipleofdecreasingmarginalrelevance(atsomepointauserwillbeunwillingtosacriceaunitofprecisionforanaddedunitofrecall)leadstotheharmonicmeanbeingtheappropriatemethodforcombiningprecisionandrecall(andhencetoitsadoptionratherthantheminimumorgeometricmean). Online edition (c)\n2009 Cambridge UP 1748Evaluationininformationretrieval BuckleyandVoorhees ( 2000 )compareseveralevaluationmeasures,in-cludingprecisionatk,MAP,andR-precision,andevaluatetheerrorrateofeachmeasure.R-precisionwasadoptedastheofcialevaluationmetricinR-PRECISIONtheTRECHARDtrack( Allan2005 ). AslamandYilmaz ( 2005 )examineitssurprisinglyclosecorrelationtoMAP,whichhadbeennotedinearlierstud-ies( Tague-SutcliffeandBlustein1995 , BuckleyandVoorhees2000 ).Astan-dardprogramforevaluatingIRsystemswhichcomputesmanymeasuresofrankedretrievaleffectivenessisChrisBuckley'strec_evalprogramusedintheTRECevaluations.Itcanbedownloadedfrom:http://trec.nist.gov/trec_eval/. KekäläinenandJärvelin ( 2002 )argueforthesuperiorityofgradedrele-vancejudgmentswhendealingwithverylargedocumentcollections,and JärvelinandKekäläinen ( 2002 )introducecumulatedgain-basedmethodsforIRsystemevaluationinthiscontext. Sakai ( 2007 )doesastudyofthestabil-ityandsensitivityofevaluationmeasuresbasedongradedrelevancejudg-mentsfromNTCIRtasks,andconcludesthatNDCGisbestforevaluatingdocumentranking. Schamberetal. ( 1990 )examinetheconceptofrelevance,stressingitsmulti-dimensionalandcontext-specicnature,butalsoarguingthatitcanbemea-suredeffectively.( Voorhees2000 )isthestandardarticleforexaminingvari-ationinrelevancejudgmentsandtheireffectsonretrievalsystemscoresandrankingfortheTRECAdHoctask. Voorhees concludesthatalt
25 houghthenumberschange,therankingsarequit
houghthenumberschange,therankingsarequitestable. Hershetal. ( 1994 )presentsimilaranalysisforamedicalIRcollection.Incontrast, Kekäläinen ( 2005 )analyzesomeofthelaterTRECs,exploringa4-wayrelevancejudgmentandthenotionofcumulativegain,arguingthattherelevancemeasureuseddoessubstantiallyaffectsystemrankings.Seealso Harter ( 1998 ). Zobel ( 1998 )studieswhetherthepoolingmethodusedbyTRECtocollectasubsetofdoc-umentsthatwillbeevaluatedforrelevanceisreliableandfair,andconcludesthatitis.Thekappastatisticanditsuseforlanguage-relatedpurposesisdiscussedKAPPASTATISTICby Carletta ( 1996 ).Manystandardsources(e.g., SiegelandCastellan1988 )presentpooledcalculationoftheexpectedagreement,but DiEugenioandGlass ( 2004 )argueforpreferringtheunpooledagreement(thoughperhapspresentingmultiplemeasures).Forfurtherdiscussionofalternativemea-suresofagreement,whichmayinfactbebetter,see Lombardetal. ( 2002 )and Krippendorff ( 2003 ).Textsummarizationhasbeenactivelyexploredformanyyears.Modernworkonsentenceselectionwasinitiatedby Kupiecetal. ( 1995 ).Morerecentworkincludes( BarzilayandElhadad1997 )and( Jing2000 ),togetherwithabroadselectionofworkappearingattheyearlyDUCconferencesandatotherNLPvenues. TombrosandSanderson ( 1998 )demonstratetheadvan-tagesofdynamicsummariesintheIRcontext. Turpinetal. ( 2007 )addresshowtogeneratesnippetsefciently. Online edition (c)\n2009 Cambridge UP 8.8Referencesandfurtherreading175 Clickthroughloganalysisisstudiedin( Joachims2002b , Joachimsetal.2005 ).Inaseriesofpapers,Hersh,Turpinandcolleaguesshowhowimprove-mentsinformalretrievaleffectiveness,asevaluatedinbatchexperiments,donotalwaystranslateintoanimprovedsystemforusers( Hershetal.2000a ; b ; 2001 , TurpinandHersh2001 ; 2002 ).UserinterfacesforIRandhumanfactorssuchasmodelsofhumaninfor-mationseekingandusabilitytestingareoutsidethescopeofwhatwecoverinthisbook.Moreinformationonthesetopicscanbefoundinothertext-books,including( Baeza-YatesandRibeiro-Neto1999 ,ch.10)and( Korfhage1997 ),andcollectionsfocusedoncognitiveaspects( Sp