/
Terrier:AHighPerformanceandScalableInformationRetrievalPlatformIadhOun Terrier:AHighPerformanceandScalableInformationRetrievalPlatformIadhOun

Terrier:AHighPerformanceandScalableInformationRetrievalPlatformIadhOun - PDF document

calandra-battersby
calandra-battersby . @calandra-battersby
Follow
412 views
Uploaded On 2015-11-14

Terrier:AHighPerformanceandScalableInformationRetrievalPlatformIadhOun - PPT Presentation

bilisticmodelsforIR1TheDFRmodelsareinformationtheoreticmodelsforbothqueryexpansionanddocumentrankingaversatilefeaturethatisnotpossessedbyanyotherIRmodelInadditionforeasycrosscomparisonofdi er ID: 192512

bilisticmodelsforIR[1].TheDFRmodelsareinformation-theoreticmodels forbothqueryexpansionanddocumentranking aversatilefeaturethatisnotpossessedbyanyotherIRmodel.Inaddition foreasycross-comparisonofdi er

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Terrier:AHighPerformanceandScalableInfor..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Terrier:AHighPerformanceandScalableInformationRetrievalPlatformIadhOunis,GianniAmati,VassilisPlachouras,BenHe,CraigMacdonald,ChristinaLiomaDepartmentofComputingScienceUniversityofGlasgowScotland,UKfounis,gianni,vassilis,ben,craigm,xristinag@dcs.gla.ac.ukABSTRACTInthispaper,wedescribeTerrier,ahighperformanceandscalablesearchenginethatallowstherapiddevelopmentoflarge-scaleretrievalapplications.Wefocusontheopen-sourceversionofthesoftware,whichprovidesacomprehen-sive,\rexible,robust,andtransparenttest-bedplatformforresearchandexperimentationinInformationRetrieval(IR).CategoriesandSubjectDescriptorsH.3.3[InformationStorageandRetrieval]:InformationSearchandRetrievalGeneralTermsAlgorithms,Design,Experimentation,Performance,TheoryKeywordsTerrier,InformationRetrievalplatform,OpenSource1.INTRODUCTIONTerrier[12],TerabyteRetriever,isaprojectthatwasini-tiatedattheUniversityofGlasgowin2000,withtheaimtoprovidea\rexibleplatformfortherapiddevelopmentoflarge-scaleInformationRetrievalapplications,aswellasastate-of-the-arttest-bedforresearchandexperimentationinthewider eldofIR.TheTerrierprojectexplorednovel,ecientande ectivesearchmethodsforlarge-scaledocumentcollections,com-biningnewandcutting-edgeideasfromprobabilistictheory,statisticalanalysis,anddatacompressiontechniques.Thisledtothedevelopmentofvariousretrievalapproachesus-inganewandhighlye ectiveprobabilisticframeworkforIR,withthepracticalaimofcombiningecientlyandef-fectivelyvarioussourcesofevidencetoenhancetheretrievalperformance.Inparticular,wehavedevelopedseveralnewhyperlinkstructureanalysismethods[15],variousselectivecombina-tionofevidenceapproaches[13,16],newdocumentlengthnormalisationmethods[1,7,8],manyautomaticqueryex-pansionandre-formulationtechniques[1,11],aswellasaAlsoaliatedtoFondazioneUgoBordoni,RomeCopyrightisheldbytheauthor/owner(s).SIGIROpenSourceWorkshop'06Seattle,Washington,USA.comprehensivesetofqueryperformancepredictors[6,9],tosupportawiderangeofretrievaltasksandapplications.Inaddition,ane ectivein-house,highlycon gurableandscalablecrawlercalledLabrador1hasbeendeveloped.ThecrawlerwasusedforthedeploymentofTerrierinvariousindustrialapplications.Variouspowerfulcompressiontech-niqueshavebeenexploredandimplemented,makingTer-rierparticularlysuitableforlarge-scalecollections.Finally,theprojectinvestigateddistributedarchitecturesandre-trieval[4],whichallowedthesystemtobeusedbothinacentralisedanddistributedsetting.ThispaperfocusesonacoreversionofTerrier,anopensourceproduct,whichhasbeenavailabletothegeneralpublicsinceNovember2004undertheMPLlicense2.TheopensourceversionofTerrieriswritteninJava,allow-ingthesystemtorunonvariousoperatingsystemsplat-forms,andawidevarietyofhardware.Terrier,thelat-estversionofwhichis1.0.2,isavailabletodownloadfromhttp://ir.dcs.gla.ac.uk/terrier/.2.MOTIVATIONS&AIMSInanyexperimentalscience eld,asisthecaseinIR,itiscrucialtohaveasystemallowinglarge-scaleexperimenta-tiontobeconductedina\rexible,robust,transparentandreproducibleway.Terrieraddressesthisimportantissuebyprovidingatest-bedframeworkfordrivingresearchandfa-cilitatingexperimentationinIR.ThemajorleadingIRgroupsintheworldhavealwayshadanexperimentalplatformtosupporttheirresearch.MostoftheexistingIRplatformsaredesignedtowardssupportingaparticularIRmethod/approach,makingthemdiculttocustomiseandtailortonewmethodsandapproaches.Terrieristhe rstseriousanswerinEuropetothedom-inanceoftheUnitedStatesonresearchandtechnologicalsolutionsinIR.Ourobjectivewastoincludethestate-of-the-artIRmethodsandtechniques,ando erthemoste-cientande ectiveIRtechnologies,intoatransparent,eas-ilyextensibleandmodularopensourcesoftware.TheopensourceproductofTerrierwasthereforedesignedasatooltoevaluate,testandcomparemodelsandideas,andtobuildsystemsforlarge-scaleIR.Thesystemincludesecientande ectivestate-of-the-artretrievalmodelsbasedonanewDivergenceFromRandom-ness(DFR)frameworkforderivingparameter-freeproba-1http://ir.dcs.gla.ac.uk/labrador/2http://www.mozilla.org/MPL/ bilisticmodelsforIR[1].TheDFRmodelsareinformation-theoreticmodels,forbothqueryexpansionanddocumentranking,aversatilefeaturethatisnotpossessedbyanyotherIRmodel.Inaddition,foreasycross-comparisonofdi erentretrievalmodels,Terrierincludesawidevarietyofmodels,rangingfromnumerousformsoftheclassicalTF-IDFweightingschemetotherecentlanguagemodellingap-proach,throughthewell-establishedOkapi'sBM25proba-bilisticrankingformula.TerrierprovidesvariousindexingandqueryingAPIs,andallowsrapidlyexperimentingwithnewconcepts/ideasonvariouscollections,andindi erentcon gurationsettings.Thisimportantfeatureforaresearchplatformismadepos-siblebythemodulararchitectureofthesystem,itsvariouseasy-to-usecon gurationoptions,aswellasitsuseofthemostrecentsoftwareengineeringmethods.Thesystemisveryeasytoworkwith,furtherhelpedwithacomprehen-sivedocumentation3,andiseasytoextendandadapttonewapplications.Thisisdemonstratedbytheuseoftheplatforminsuchdiversesearchtasksasdesktopsearch,websearch,e-mail&expertisesearch,XMLretrieval,multilin-gualandblogssearch.Finally,Terrierincludesvariousretrievalperformanceeval-uationtools,whichproduceanextensiveanalysisofthere-trievale ectivenessofthetestedmodels/conceptsongiventestcollections.Systemsandtheirvariantscanbeinstanti-atedinparallel,animportantfeatureforconductingnumer-ousexperimentsatthesametime,thusfosteringextensiveexperimentationinastructuredandsystematicway.Theremainderofthispaperisstructuredasfollows.WedescribethemainindexingfeaturesofTerrierinSection3,andtheirassociateddatastructuresinSection4.Section5providesanoverviewoftheretrievalfeaturesofTerrier,whichmakethesystemparticularlysuitableforresearchpurposes.InSection6,wedescribeaveryinterestingfunc-tionalityofTerrier,namelyqueryexpansion,andoutlinehowitcanbeusedforotherIRactivitiesandapplications.TheretrievalperformanceevaluationpackageofTerrierispresentedinSection7,whilesome\out-of-the-box"appli-cationsofTerrieraredescribedinSection8.Finally,weconcludethepaperwithlessonslearntwhilebuildingtheTerrierplatform.3.INDEXINGFEATURESFigure1outlinestheindexingprocessinTerrier.Terrierisdesignedtoallowmanydi erentwaysofindexingacorpusofdocuments.Indexingisafourstageprocess,andateachstage,pluginscanbeaddedtoaltertheindexingprocess.Thismodulararchitectureallows\rexibilityintheindexingprocessatseveralstages:inthehandlingofacorpusofdoc-uments,inhandlingandparsingeachindividualdocument,intheprocessingoftermsfromdocuments,andinwritingtheindexdatastructures.Inaddition,Terrierallowsthedirectparsingandindexingofcompressedcollections.TheopensourceversionofTerriercomeswithsupportforindexingtestcollectionsfromthewell-establishedTREC4internationalevaluationforum.ThispermitsresearcherstoquicklysetupandrunTerrierwithatestcollection,less-eningthelearningcurvefortheplatform.Theadvantageisthatanyotherstoragemethodforacollectionofdocuments3http://ir.dcs.gla.ac.uk/terrier/docs.html4http://trec.nist.govFigure1:OverviewoftheindexingarchitectureofTerrier.AcorpusofdocumentsishandledbyaCol-lectionplugin,whichgeneratesastreamofDocu-mentobjects.EachDocumentgeneratesastreamofterms,whicharetransformedbyaseriesofTermPipelinecomponents,afterwhichtheIndexerwritestodisk.caneasilybesupportedbyaddinganewCollectionplugin.Forexample,readingastreamofdocumentsfromanemailserverwouldonlyrequireimplementinganewCollectionplugintoconnecttotheemailserver.Terriercomeswithvariousdocumentparsersembedded.TheseincludetheabilitytoindexHTMLdocuments,Plaintextdocuments,MicrosoftWord,ExcelandPowerpointdoc-uments,aswellasAdobePDF les.Toaddsupportforanother leformattoTerrier,adevelopermustonlyaddanotherDocumentpluginwhichisabletoextractthetermsfromthedocument.Eachtermextractedfromadocumenthasthreefunda-mentalproperties: rstly,theactualStringtextualformoftheterm;secondly,thepositionatwhichthetermoccursinthedocument,andthirdly,the eldsinwhichthetermoccurs( eldscanbearbitrarilyde nedbytheDocumentplugin,buttypicallyareusedtode newhichHTMLtagsatermoccursin).Terrieraddscon gurabilityatthisstageoftheindexing:termsarepassedthrougha`TermPipeline',whichallowsthetermstobetransformedinvariousways.Terriercomeswithsomepre-de nedTermPipelineplug-ins,suchastwovariantsofPorter'sstemmingalgorithm,andastopwordsremovalcomponent.TermPipelineplug-insintroduce\rexibilitytotheprocessingandtransformationofterms,whichcanbemanifestedforexamplebyn-gramindexing,addingstemmingandstopwordremovalinvari-ouslanguages,acronymexpansion,oruseofathesaurus,tonamebutafew.ThelastcomponentintheTermPipelinechainisalwaystheIndexer.TheIndexerisresponsibleforwritingthein-dexusingtheappropriatedatastructures.ByusingaBlockIndexer,itispossibletocreateanindexthatstorestheposi-tionsoftermsineachdocument.Eachdocumentisdividedintoblocks-thesizeofablockde nestheaccuracywithwhichtermpositionsarerecorded.Bydefault,thesizeofablockis1and,inthatcase,theexactpositionsofthe termoccurrencesarerecorded.Thisallowsforproximityandphraseoperatorstobeusedinqueriesduringretrieval.However,ablockcouldalsobede nedasasemanticentity,soastoallowstructuredretrieval.Thefollowingsectionde-scribestheindexdatastructuresgeneratedbyTerrierwhenindexingacorpusofdocuments.4.INDEXSTRUCTURESATerrierindexconsistsof4maindatastructures,inad-ditiontosomeauxiliary les:Lexicon:Thelexiconstoresthetermanditstermid(auniquenumberforeachterm),alongwiththeglobalstatisticsoftheterm(globaltermfrequencyanddocumentfre-quencyoftheterm)andtheo setsofthepostingslistintheInvertedIndex.InvertedIndex:Theinvertedindexstoresthepostingslistsofaterm.Inparticular,foreachterm,theinvertedindexstores:thedocumentidofthematchingdocument;andthetermfrequencyoftheterminthatdocument.The eldsinwhichthetermoccurredareencodedusingabitset.Iftheindexhasbeenmadewithpositionalin-formation,thepostingslistwillalsocontaintheblockidsinthedocumentthatthetermoccursin.Positionalinformationallowsphrasalandproximitysearchtobeperformed.ThepostingslistinTerrierishighlycompressed.Inparticular,thedocumentidsareencodedinthein-vertedindexusingGammaencoding,thetermfre-quenciesareencodedusingUnaryencoding,andtheBlockids,ifpresent,areencodedusingGammaen-coding[20].DocumentIndex:TheDocumentIndexstoresthedocumentnumber(anexternaluniqueidenti erofthedocument),thedoc-umentid(internaluniqueidenti erofthedocument);thelengthofthedocumentintermsoftokens;andtheo setofthedocumentintheDirectIndex.DirectIndex:TheDirectIndexstoresthetermsandtermfrequen-ciesofthetermspresentineachdocument.TheaimoftheDirectIndexistofacilitateeasyandecientqueryexpansion,asdescribedinSection6.However,thedirectindexisalsoextremelyusefulforapplyingclusteringtoacollectionofdocuments.Thedirectindexcontentsarecompressedinanorthog-onalwaytotheinvertedindex.TermidsarewrittenusingGammaencoding,termfrequenciesuseUnaryencodingandthe eldsinwhichthetermsoccurareencodedusingabitset.Blockids,ifpresent,areen-codedusingGammaencoding.Table1detailstheformatofeach leintheindexstruc-tures.Moreover,Table2showsthesizesofTerrier'sindexdatastructuresafterindexingtheWT2Gcollection.Fieldsarenotrecorded,whileindexsizeswithandwithouttermpositionsarestated.Forcomparisonpurposes,theindexsizesofTerrierwithoutcompression,andtwootheropenIndexStructureContentsLexiconTerm(20)Termid(4)DocumentFrequency(4)TermFrequency(4)Byteo setininverted le(8)Bito setininverted le(1)InvertedIndexDocumentidgap(gammacode)TermFrequency(unarycode)Fields(#of eldsbits)BlockFrequency(unarycode)[Blockidgap(gammacode)]DocumentIndexDocumentid(4)DocumentLength(4)DocumentNumber(20)Byteo setindirect le(8)Bito setindirect le(1)DirectIndexTermidgap(gammacode)Termfrequency(unarycode)Fields(#of eldsbits)Blockfrequency(unarycode)[Blockidgap(gammacode)]LexiconIndexO setinlexicon(8)CollectionStatistics#ofdocuments#oftokens#ofuniqueterms#ofpointersTable1:Detailsontheformatandcompressionusedforeachindexdatastructure.Thenumbersinparenthesisarethesizeofeachentryinbytes,unlessotherwisedenoted.Gammaandunarycodesarevariablelengthencodings.sourceproducts,Indri5andZettair6forthesamecollectionarealsoprovided.LikeTerrier,IndriandZettairarede-ployedusingtheirdefaultsettings.5.RETRIEVALFEATURESOneofthemainaimsofTerrieristofacilitateresearchinthe eldofIR.Figure2providesanoutlineoftheretrievalprocessinTerrier.Terrier'sretrievalfeatureshavebeense-lectedinordertobeusefulforawiderangeofIRresearch.Indeed,Terriero ersgreat\rexibilityinchoosingaweight-ingmodel(Section5.1),aswellasinalteringthescoreoftheretrieveddocuments(Section5.2).Moreover,Terriero ersanadvancedquerylanguage(Section5.3).AnotherveryimportantretrievalfeatureofTerrieristheautomaticqueryexpansion,whichisdescribedinSection6.5.1WeightingModelsThecorefunctionalityofmatchingdocumentstoqueriesandrankingdocumentstakesplaceintheMatchingmod-ule.Matchingemploysaweightingmodeltoassignascoretoeachofthequerytermsinadocument.Inordertofacil-itatethecross-comparisonofweightingmodels,arangeofweightingmodelsissupplied,includingBM25,TF-IDF,anddocumentweightingmodelsfromtheDivergenceFromRan-5http://www.lemurproject.org/indri/6http://www.seg.rmit.edu.au/zettair/ IndexStructureIndexSizes%OfWT2GTerrier1.0.2DirectIndex87MB4%InvertedIndex72MB3%Lexicon40MB2%DocumentIndex8.8MB0.4%TerrierifnocompressionisusedDirectIndex481MB22%InvertedIndex481MB22%Lexicon40MB2%DocumentIndex8.8MB0.4%Terrier1.0.2withTermPositionsDirectIndex366MB17%InvertedIndex349MB17%Lexicon40MB2%DocumentIndex8.8MB0.4%Indri2.2DirectIndex386MB18%InvertedIndex417MB19%Zettair0.6.1InvertedIndex430MB20%Table2:IndexsizesforTerrier(withandwithouttermpositions),IndriandZettairfortheWT2Gcol-lection(2.1GB).Terrier'suncompressedindexsizeisprovidedforcomparisonpurposes.Figure2:OverviewoftheretrievalarchitectureofTerrier.TheapplicationcommunicateswiththeManager,whichinturnrunsthedesiredMatchingmodule.Matchingassignsscorestothedocumentsusingthecombinationofweightingmodelandscoremodi ers.domness(DFR)framework[1].TheDFRapproachsuppliesparameter-freeprobabilisticmodels,basedonasimpleidea:\Themorethedivergenceofthewithin-documentterm-frequencyfromitsfrequencywithinthecol-lection,themoretheinformationcarriedbythetermtinthedocumentd".Theopensource1.0.2versionofTerrierincludeseightdoc-umentweightingmodelsfromtheDFRframework,allofwhichperformrobustlyonstandardtestcollections.Ponte-Croft'slanguagemodel[18]isalsosupported.5.2AlteringDocumentScoresThescoreofanindividualterminadocumentcanbealteredbyapplyingaTermScoreModi er.Forexample,aTermInFieldModi ercanbeappliedinordertoensurethatthequerytermsoccurinaparticular eldofadocument.Ifaquerytermdoesnotappearinaparticular eld,thentheTermInFieldModi erresetsthescoreofthedocument.Inaddition,aFieldScoreModi erbooststhescoreofadoc-umentinwhichaquerytermappearsinaparticular eld.Similarly,changingthescoreofaretrieveddocumentisachievedbyapplyingaDocumentScoreModi er.Onesuchmodi eristhePhraseScoreModi er,whichemploysthepo-sitioninformationsavedintheindexofTerrier,andresetsthescoreoftheretrieveddocumentsinwhichthequerytermsdonotappearasaphrase,orwithinagivennumberofblocks.Generally,aDocumentScoreModi erisidealforapplyingquery-independentevidence,suchasevidencefromhyperlinkstructureanalysis,orfromtheURLofdocuments.Moreover,theselectiveapplicationofdi erentretrievaltech-niquesbasedonevidencefromthehyperlinkstructure[13]canbeappliedasaDocumentScoreModi er.AftertheapplicationofanyTermScoreModi ersorDoc-umentScoreModi ers,thesetofretrieveddocumentscanbefurtheralteredbyapplyingpostprocessingorpost ltering.Postprocessingisappropriatetoimplementfunctionalitiesthatrequirechangingtheoriginalquery.AnexampleofpostprocessingisQueryExpansion(QE),whichisdescribedindetailinSection6.TheapplicationofQEcouldbeenabledonaper-querybasis,beforeretrieval,dependingontheout-putofapre-retrievalperformancepredictor[6].Anotherpossibleexampleofpostprocessingcouldbetheapplica-tionofclustering.Post lteringisthe nalstepinTerrier'sretrievalprocess,whereaseriesof lterscanremovealreadyretrieveddocu-ments,whichdonotsatisfyagivencondition.Forexample,inthecontextofaWebsearchengine,apost ltercouldbeusedtoreducethenumberofretrieveddocumentsfromthesameWebsite,inordertoincreasediversityintheresults.5.3QueryLanguageTerrierincludesapowerfulquerylanguagethatallowstheusertospecifyadditionaloperationsontopofthenormalprobabilisticqueries.Suchoperationsmayspecifythataparticularquerytermshouldorshouldnotappearintheretrieveddocuments.Otheravailableoperationsincludere-quiringthattermsappearinparticular elds,phrasequeries,orproximityqueries.Anoverviewoftheavailablequerylan-guageoperationsisgivenbelow.t1t2retrievesdocumentswitheithert1ort2.t1^2.3setstheweightofquerytermt1to2.3. +t1-t2retrievesdocumentswitht1butnott2.``t1t2''retrievesdocumentswherethetermst1andt2occurnexttoeachother.``t1t2''nretrievesdocumentswherethetermst1,t2occurwithinnblocks.+(t1t2)speci esthatbothtermst1andt2arere-quired.field:t1retrievesdocswheret1mustappearinthespeci ed eld.control:on/offenablesordisablesagivencontrol.Forexample,queryexpansionisenabledwithqe:on.ThequerylanguageoperationscorrespondtoTermScore-Modi erorDocumentScoreModi ermodules,whichareap-propriatelycon guredwhenthequeryisparsed.Retrievalparameterscanalsobespeci edbytheuser,whenpermit-ted,tochangeorenableretrievalfunctionalities.Forexam-ple,aqueryforwhichaparticularquerytermshouldappearinthetitleoftheretrieveddocuments,automaticallyenablesaTermInFieldModi er.6.QUERYEXPANSIONTerrierincludesautomaticpseudo-relevancefeedback,intheformofQueryExpansion(QE).Themethodworksbytakingthetopmostinformativetermsfromthetop-rankeddocumentsofthequery,andaddingthesenewrelatedtermsintothequery.Thisoperationismadepossiblebythepres-enceofthedirectindex.Thedirectindexallowsthetermsandtheirfrequenciestobedeterminedforeachdocumentintheindex.Thenewqueryisreweightedandrerun-provid-ingarichersetofretrieveddocuments.Terrierprovidessev-eraltermweightingmodelsfromtheDFRframeworkwhichareusefulforidentifyinginformativetermsfromtop-rankeddocuments.Inaddition,foreasycross-comparison,Terrieralsoincludessomewell-establishedpseudo-relevancefeed-backtechniques,suchasRocchio'smethod[19].Automaticqueryexpansionishighlye ectiveformanyIRtasks.ThedecisionofapplyingQEdependsonthetypeofapplication,andthedicultyofqueries.Therefore,wehavedevelopedseveraltoolsfordeterminingpossibleindi-catorsthatpredictthequerydiculty,andtheselectiveapplicationofqueryexpansion[2,6,9].Thevocabularyandsizeofthecollectionsmaybelargelydi erent.Asaconsequence,automaticmeanstotunetheinherentparam-etersofthequeryexpansionarealsonecessaryinagoodIRsystem.Forthispurpose,wedevisedaparameter-freeQEmechanism,calledBo1[1,11],asadefaultinTerrier.Thesystemisalsoreadilyavailabletosupportseverallow-costmethodsforaselectiveapplicationofqueryexpansion.OurmethodofQEguaranteesthemoste ectiveandecientmechanismforperformingQE,evenwithonlyusing3re-trieveddocumentsandahandfulofadditionalqueryterms.Inaddition,QEcanbealsousedtoautomaticallygeneraterealisticsamplesofqueries[6].Datastructures,inparticularthedirect le,toolsandmethodsadoptedforperformingQEinTerrier,alsoconsti-tuteanadvancedbasisforothertypesofIRactivitiesandapplications,suchasdocumentandtermclusteringorques-tionanswering.Indeed,thequeryexpansionmethodologyisbasedona lter,activatedontheinformationtheoreticCollectionTopics/TaskPerformanceWT2GAdhocTopics401-4500.2663MAPWT10GAdhocTopics451-5500.1886MAPW3CKnownItemTopics26-1500.4960MRRTable3:Performanceof\out-of-the-box"TerrieronstandardTRECcollectionsandtasks.Performanceismeasuredusingthestandardmeasureforeachtask:forAdhoc,MeanAveragePrecision(MAP);forKnownItem,MeanReciprocalRank(MRR).weightsoftheterms,whichselectsthehighlyinformativetermsthatappearinatleastXretrieveddocuments.Asanillustration,theoptimalvalueforXis2inadhocre-trieval.TheclusteringoftheresultswouldrequireahighervalueforX,whilequestionansweringmayrequireadi er-entinformationtheoreticde nitionforX.Inconclusion,QEisagoodsourceofinspirationfornewandinterestingIRapplicationsandresearchdirections.7.EVALUATIONPACKAGEWhenperformingIRresearchusingastandardtestcol-lection,itisessentialtoevaluatetheretrievalperformanceoftheappliedapproaches.TheopensourceversionofTer-rierwasdesignedtoallowuserstorapidlydesignandtestnewretrievalfunctionsandIRmodels.Forthisreason,weimplementedtheevaluationtoolsintheTRECstylewithawiderangeofknownIRmeasures.Inparticular,Terrierincludesandintegratesthemainevaluationfunctionalitiesfoundinthetrecevaltool,including,amongothers,calcu-latingMeanAveragePrecision,Precision@rankN,inter-polatedPrecision,andR-Precision.Itisalsopossibletoshowtheevaluationmeasuresonaper-querybasis.TheevaluationtoolsofTerrieralloweasytestingofnewretrievalmethodologiesorapproaches.Oncethenewap-proachhasbeenintegratedintoTerrier,userscantestit,togetherwithasmanymodelsaspossible,againstatestcollection.Assoonasthesystemhasterminatedalltheruns,foreachmodel,the leoftheevaluationresultscanbecreatedandasummaryisdisplayed.8.STANDARDDEPLOYMENTS&SAMPLEAPPLICATIONSTerrier1.0.2comeswithtwoapplications.Firstly,Ter-riercomprisesacompleteapplicationsuiteforperformingresearchusingtestcollections.TerrieriseasytodeployonthestandardTRECtestcollections,e.g.theTRECCDs,WT2G,WT10G,.GOVetc.Thisfacilitatesresearchonthesestandardtestcollections.Table3shows\out-of-the-box"retrievalperformancescoresofTerrieronstandardtestcollections,usingtitle-onlyqueries.Thedefaultsettingsin-cludetheremovalofEnglishstopwordsandtheapplicationofPorter'sstemmingalgorithm.Duringretrieval,theInL2DFRweightingmodelisappliedasis.Secondly,Terriercomeswithaproof-of-conceptDesktopSearchapplication,showninFigure3.ThisshowshowtheretrievalfunctionalitiesofTerriercanbeeasilydeployedtoanapplicationsuitableforday-to-dayretrievaltasks.AlltheretrievalfunctionalitiesareeasilyaccessiblefromtheDesk-topSearchapplication,includingTerrier'squerylanguageandqueryexpansion. Figure3:Screenshotoftheproof-of-conceptTerrierDesktopSearch.Itcanindexvarioustypesofcom-mondocumentformats,includingMicrosoftOcedocuments,HTMLandPDF les.9.RESEARCHFACILITATED&LESSONSLEARNEDThedevelopmentofatest-bedframeworkallowingfortherapidexperimentationofnewIRconcepts/ideashasboostedourresearch,yieldingsomesigni cantinsightsintothebe-haviourofIRtechniquesondiverse,multilingual,andlarge-scalecollections.Thesystemallowedustosuccessfullyandsigni cantlyparticipateinvariousTRECtrackssuchastheWebtracks(2001-2004),theRobusttracks(2003,2004),theTerabytetracks(2004-2005),andrecentlytheEnter-prisetrack(2005).Inaddition,weparticipatedinvarioustracksofCLEF7in2003-2005.Moreimportantly,inadditiontobeingavehiclefortheevaluationofournewsearchmethods,theTerrierplatformallowedustoeasilysimulate,assess,andimprovestate-of-the-artIRmethodsandtechniques.Thisledtoabetterunderstandingoftheunderlyingissuesrelatedtothetack-lingofdi erentsearchtasksonvariousanddiverselargecollections.Inparticular,werealisedtheneedtobuildparameter-freemodels,whichavoidtheconstraintofusingcostlyparameter-tuningapproachesbasedonrelevanceas-sessment.TheproposedDFRframeworkisnotonlyanecientande ectiveIRmodel,butalsoamechanismbywhichwecanexplainandcomplementotherexistingIRap-proaches/techniques[1].OneofthemainlessonslearntfrombuildingtheTerrierplatformduringthepreviousyearsisthefactthatbuildingasustainable,\rexibleandrobustplatformforinformationretrievalisequallyimportantascreatingnewIRmodels.In-deed,researchinanexperimental eldsuchasIRcannotbeconductedwithoutaplatformfacilitatingexperimentationandevaluation.Indevelopingacutting-edgetechnologyfromalaboratorysettingtoaproductrelease,whichwasdeployedinvariousindustrialapplicationsandsettings,wegainedamuchbetterunderstandingofthechallengesfacedintacklingreallarge-scalecollections,notnecessarilyascleanastheTRECorCLEFcollections,andhowtoprovidepracticalande ectivesolutionsunderahighqueryworkload.Manybene tswereachievedbyreleasingacoreversionoftheprojectasanopensourcesoftware.Byfacilitating7http://www.clef-campaign.orgatransparentandreproducibleexperimentationofvariousretrievalmethods,webelievethatweachievedagreaterim-pactandvisibilityinthescienti ccommunity.Wehavealsobuiltacommunityofusers/developersaroundTerrier,whichhelpedimprovingthequalityofthesoftwareandat-tractedmoreresearcherstotheproject.SinceitsreleaseinNovember2004,theTerriersoftwarehasbeendownloadedthousandsoftimesfromacrosstheworld,includingmajorcommercialsearchenginecompanies.Finally,theTerrierprojecthasattractedmanyundergrad-uate,postgraduate,andresearchstudents,andbene tedfromtheircontributions.BuildinganIRplatformisbotharesearchandengineeringprocess,requiringteamworkandacriticalmassofdevelopers.Terrieriscurrentlythemainde-velopmentplatformforbothundergraduateandpostgradu-atestudentsinourresearchgroup,allowingthemtoemployastate-of-the-artframeworkintheirlearningandresearch.10.CONCLUSIONSInthispaper,wedescribedtheopensourceversionofTerrier,arobust,transparent,andmodularframeworkforresearchandexperimentationinInformationRetrieval.Theplatformwasdevelopedaspartofanongoinglargerproject,ultimatelyaimingatbecomingtheEuropeanSearchEngine,withthestate-of-the-artsearchtechnologyforvarioussearchtasksandapplications.Terrierisamatureandstate-of-the-artplatform,includ-inghighlye ectiveretrievalmodels,basedontheDFRframe-work.Theperformanceofthesemodelsisatleastcompa-rableto,ifnotbetterthan,thatofthemostrecentmod-els,includingonverylarge-scaletestcollections[14,11].Inaddition,Terrier'sDFRmodelsareinformation-theoreticmodelsforbothqueryexpansionanddocumentranking,aversatileandoriginalfeaturethatisnotpossessedbyanyotherIRmodel.Terrieriscontinuouslyexpandedwithnewmodelsresultingfromourresearch.Indeed,newresearch,suchasmodelsfordocumentstructureandcombinationofevidence[17],lengthnormalisationmethods[7],queryper-formanceprediction[6,9],NLPtechniques[10],informationextraction[3],isbeingreadilyfedtoTerrier.Furthermore,Terrierisbeingusedforresearchinenterpriseandexpertsearch[11].Retrievalfromlarge-scaletestcollectionshasalsoledustostudyoptimaldistributedarchitecturesforretrievalsystems[4,5].FromanIRperspective,technologicalchallengesmayren-deranIRmodelobsoleteveryquickly(ashashappenedwiththeBooleanorVectorSpacemodel).Webelievethatathe-oreticalframeworkforInformationRetrieval,whenimple-mentedwithinamature,robustandextensibleIRsystem,isdestinedtolast.OurstrongbeliefisthatTerrier,astheaccumulatorofinnovativeideasinIR,andwithitsdesignfeaturesandunderlyingprinciples,opensourcebeingoneofthem,willmakeeasiertechnologytransfers.11.ACKNOWLEDGEMENTSThedevelopmentofTerrierrequiredasigni cantman-powerandinfrastructure.Wewouldliketothankthosemanystudentsandprogrammerswhoparticipatedindevel-opingTerrier.WewouldalsoliketothanktheUKEngi-neeringandPhysicalSciencesResearchCouncil(EPSRC),projectgrantnumberGR/R90543/01,aswellastheLev-erhulmeTrust,projectgrantnumberF/00179/S,whohave partiallysupportedthedevelopmentofTerrier.Finally,wethanktheDepartmentofComputingScienceattheUniver-sityofGlasgow,andKeithvanRijsbergen,forsupportingtheTerrierplatformatvariousstagesofitsdevelopment.12.REFERENCES[1]G.Amati.ProbabilisticModelsforInformationRetrievalbasedonDivergencefromRandomness.PhDthesis,DepartmentofComputingScience,UniversityofGlasgow,2003.[2]G.Amati,C.Carpineto,andG.Romano.QueryDiculty,Robustness,andSelectiveApplicationofQueryExpansion.InECIR'04:Proceedingsofthe26thEuropeanConferenceonInformationRetrieval(ECIR'04),pages127-137,Sunderland,UK,2004.Springer.[3]G.Amati.InformationTheoreticApproachtoInformationExtraction.InProceedingsofthe7thinternationalconferenceonFlexibleQueryAnsweringSystems(FQAS2006),pages519{529,Milan,Italy,2006.[4]F.Cacheda,V.Plachouras,andI.Ounis.AcasestudyofdistributedinformationretrievalarchitecturestoindexoneTerabyteoftext.InformationProcessing&Management,41(5):1141{1161,2005.[5]F.Cacheda,V.Carneiro,V.Plachouras,andI.Ounis.PerformanceAnalysisofDistributedInformationRetrievalArchitecturesUsinganImprovedNetworkSimulationModel.InformationProcessing&Management(toappear).[6]B.He,andI.Ounis.InferringQueryPerformanceUsingPre-retrievalPredictors.InProceedingsofthe11thSymposiumonStringProcessingandInformationRetrieval(SPIRE2004),Padova,Italy,2004.[7]B.He,andI.Ounis.AstudyoftheDirichletpriorsfortermfrequencynormalisation.InSIGIR'05:Proceedingsofthe28thannualinternationalACMSIGIRconferenceonResearchandDevelopmentininformationretrieval,pages465{471,Salvador,Brazil,2004.[8]B.He,andI.Ounis.TermFrequencyNormalisationTuningforBM25andDFRModels.InECIR'05:Proceedingsofthe27thEuropeanConferenceonInformationRetrieval(ECIR'05),pages200{214,SantiagodeCompostela,Spain,2005.Springer.[9]B.He,andI.Ounis.QueryPerformancePrediction.InInformationSystems,Elsevier,2006(Inpress).[10]C.Lioma,andI.Ounis.ExaminingtheContentLoadofPart-of-SpeechBlocksforInformationRetrieval.InJointConferenceoftheInternationalCommitteeonComputationalLinguisticsandtheAssociationforComputationalLinguistics,(COLING/ACL2006),Sydney,Australia,2006.[11]C.Macdonald,B.He,V.Plachouras,andI.Ounis.UniversityofGlasgowatTREC2005:ExperimentsinTerabyteandEnterpriseTrackswithTerrier.InTREC'05:Proceedingsofthe14thTextREtrievalConference(TREC2005),November,2005,NIST.[12]I.Ounis,G.Amati,PlachourasV.,B.He,C.Macdonald,andD.Johnson.TerrierInformationRetrievalPlatform.InProceedingsofthe27thEuropeanConferenceonIRResearch(ECIR2005),volume3408ofLectureNotesinComputerScience,pages517{519.Springer,2005.[13]V.Plachouras,andI.Ounis.Usefulnessofhyperlinkstructureforquery-biasedtopicdistillation.InSIGIR'04:Proceedingsofthe27thannualinternationalACMSIGIRconferenceonResearchandDevelopmentininformationretrieval,pages48{455,Sheeld,UK,2004.[14]V.Plachouras,B.He,andI.Ounis.UniversityofGlasgowatTREC2004:ExperimentsinWeb,RobustandTerabytetrackswithTerrier.InTREC'04:Proceedingsofthe13thTextREtrievalConference(TREC2004),November2004,NIST.[15]V.Plachouras,I.Ounis,andG.Amati.TheStaticAbsorbingModelfortheWeb.JournalofWebEngineering,4(2):165{186,2005.[16]V.Plachouras,F.Cacheda,andI.Ounis.ADecisionMechanismfortheSelectiveCombinationofEvidenceinTopicDistillation.InformationRetrieval,9(2):139{163,2006.[17]V.Plachouras.SelectiveWebInformationRetrieval.PhDthesis,DepartmentofComputingScience,UniversityofGlasgow,2006.[18]J.M.PonteandW.B.Croft.Alanguagemodellingapproachtoinformationretrieval.InSIGIR'98:Proceedingsofthe21stannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages275{281,Melbourne,Australia,1998.[19]J.Rocchio.Relevancefeedbackininformationretrieval.InTheSmartRetrievalsystem-ExperimentsinAutomaticDocumentProcessing,InSalton,G.,Ed.Prentice-HallEnglewoodCli s.NJ.313-323,1971.[20]I.H.Witten,A.Mo at,andT.C.Bell.ManagingGigabytes:CompressingandIndexingDocumentsandImages.MorganKaufmannPublishers,1999.