bilisticmodelsforIR1TheDFRmodelsareinformationtheoreticmodelsforbothqueryexpansionanddocumentrankingaversatilefeaturethatisnotpossessedbyanyotherIRmodelInadditionforeasycrosscomparisonofdier ID: 192512
Download Pdf The PPT/PDF document "Terrier:AHighPerformanceandScalableInfor..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Terrier:AHighPerformanceandScalableInformationRetrievalPlatformIadhOunis,GianniAmati,VassilisPlachouras,BenHe,CraigMacdonald,ChristinaLiomaDepartmentofComputingScienceUniversityofGlasgowScotland,UKfounis,gianni,vassilis,ben,craigm,xristinag@dcs.gla.ac.ukABSTRACTInthispaper,wedescribeTerrier,ahighperformanceandscalablesearchenginethatallowstherapiddevelopmentoflarge-scaleretrievalapplications.Wefocusontheopen-sourceversionofthesoftware,whichprovidesacomprehen-sive,\rexible,robust,andtransparenttest-bedplatformforresearchandexperimentationinInformationRetrieval(IR).CategoriesandSubjectDescriptorsH.3.3[InformationStorageandRetrieval]:InformationSearchandRetrievalGeneralTermsAlgorithms,Design,Experimentation,Performance,TheoryKeywordsTerrier,InformationRetrievalplatform,OpenSource1.INTRODUCTIONTerrier[12],TerabyteRetriever,isaprojectthatwasini-tiatedattheUniversityofGlasgowin2000,withtheaimtoprovidea\rexibleplatformfortherapiddevelopmentoflarge-scaleInformationRetrievalapplications,aswellasastate-of-the-arttest-bedforresearchandexperimentationinthewidereldofIR.TheTerrierprojectexplorednovel,ecientandeectivesearchmethodsforlarge-scaledocumentcollections,com-biningnewandcutting-edgeideasfromprobabilistictheory,statisticalanalysis,anddatacompressiontechniques.Thisledtothedevelopmentofvariousretrievalapproachesus-inganewandhighlyeectiveprobabilisticframeworkforIR,withthepracticalaimofcombiningecientlyandef-fectivelyvarioussourcesofevidencetoenhancetheretrievalperformance.Inparticular,wehavedevelopedseveralnewhyperlinkstructureanalysismethods[15],variousselectivecombina-tionofevidenceapproaches[13,16],newdocumentlengthnormalisationmethods[1,7,8],manyautomaticqueryex-pansionandre-formulationtechniques[1,11],aswellasaAlsoaliatedtoFondazioneUgoBordoni,RomeCopyrightisheldbytheauthor/owner(s).SIGIROpenSourceWorkshop'06Seattle,Washington,USA.comprehensivesetofqueryperformancepredictors[6,9],tosupportawiderangeofretrievaltasksandapplications.Inaddition,aneectivein-house,highlycongurableandscalablecrawlercalledLabrador1hasbeendeveloped.ThecrawlerwasusedforthedeploymentofTerrierinvariousindustrialapplications.Variouspowerfulcompressiontech-niqueshavebeenexploredandimplemented,makingTer-rierparticularlysuitableforlarge-scalecollections.Finally,theprojectinvestigateddistributedarchitecturesandre-trieval[4],whichallowedthesystemtobeusedbothinacentralisedanddistributedsetting.ThispaperfocusesonacoreversionofTerrier,anopensourceproduct,whichhasbeenavailabletothegeneralpublicsinceNovember2004undertheMPLlicense2.TheopensourceversionofTerrieriswritteninJava,allow-ingthesystemtorunonvariousoperatingsystemsplat-forms,andawidevarietyofhardware.Terrier,thelat-estversionofwhichis1.0.2,isavailabletodownloadfromhttp://ir.dcs.gla.ac.uk/terrier/.2.MOTIVATIONS&AIMSInanyexperimentalscienceeld,asisthecaseinIR,itiscrucialtohaveasystemallowinglarge-scaleexperimenta-tiontobeconductedina\rexible,robust,transparentandreproducibleway.Terrieraddressesthisimportantissuebyprovidingatest-bedframeworkfordrivingresearchandfa-cilitatingexperimentationinIR.ThemajorleadingIRgroupsintheworldhavealwayshadanexperimentalplatformtosupporttheirresearch.MostoftheexistingIRplatformsaredesignedtowardssupportingaparticularIRmethod/approach,makingthemdiculttocustomiseandtailortonewmethodsandapproaches.TerrieristherstseriousanswerinEuropetothedom-inanceoftheUnitedStatesonresearchandtechnologicalsolutionsinIR.Ourobjectivewastoincludethestate-of-the-artIRmethodsandtechniques,andoerthemoste-cientandeectiveIRtechnologies,intoatransparent,eas-ilyextensibleandmodularopensourcesoftware.TheopensourceproductofTerrierwasthereforedesignedasatooltoevaluate,testandcomparemodelsandideas,andtobuildsystemsforlarge-scaleIR.Thesystemincludesecientandeectivestate-of-the-artretrievalmodelsbasedonanewDivergenceFromRandom-ness(DFR)frameworkforderivingparameter-freeproba-1http://ir.dcs.gla.ac.uk/labrador/2http://www.mozilla.org/MPL/ bilisticmodelsforIR[1].TheDFRmodelsareinformation-theoreticmodels,forbothqueryexpansionanddocumentranking,aversatilefeaturethatisnotpossessedbyanyotherIRmodel.Inaddition,foreasycross-comparisonofdierentretrievalmodels,Terrierincludesawidevarietyofmodels,rangingfromnumerousformsoftheclassicalTF-IDFweightingschemetotherecentlanguagemodellingap-proach,throughthewell-establishedOkapi'sBM25proba-bilisticrankingformula.TerrierprovidesvariousindexingandqueryingAPIs,andallowsrapidlyexperimentingwithnewconcepts/ideasonvariouscollections,andindierentcongurationsettings.Thisimportantfeatureforaresearchplatformismadepos-siblebythemodulararchitectureofthesystem,itsvariouseasy-to-usecongurationoptions,aswellasitsuseofthemostrecentsoftwareengineeringmethods.Thesystemisveryeasytoworkwith,furtherhelpedwithacomprehen-sivedocumentation3,andiseasytoextendandadapttonewapplications.Thisisdemonstratedbytheuseoftheplatforminsuchdiversesearchtasksasdesktopsearch,websearch,e-mail&expertisesearch,XMLretrieval,multilin-gualandblogssearch.Finally,Terrierincludesvariousretrievalperformanceeval-uationtools,whichproduceanextensiveanalysisofthere-trievaleectivenessofthetestedmodels/conceptsongiventestcollections.Systemsandtheirvariantscanbeinstanti-atedinparallel,animportantfeatureforconductingnumer-ousexperimentsatthesametime,thusfosteringextensiveexperimentationinastructuredandsystematicway.Theremainderofthispaperisstructuredasfollows.WedescribethemainindexingfeaturesofTerrierinSection3,andtheirassociateddatastructuresinSection4.Section5providesanoverviewoftheretrievalfeaturesofTerrier,whichmakethesystemparticularlysuitableforresearchpurposes.InSection6,wedescribeaveryinterestingfunc-tionalityofTerrier,namelyqueryexpansion,andoutlinehowitcanbeusedforotherIRactivitiesandapplications.TheretrievalperformanceevaluationpackageofTerrierispresentedinSection7,whilesome\out-of-the-box"appli-cationsofTerrieraredescribedinSection8.Finally,weconcludethepaperwithlessonslearntwhilebuildingtheTerrierplatform.3.INDEXINGFEATURESFigure1outlinestheindexingprocessinTerrier.Terrierisdesignedtoallowmanydierentwaysofindexingacorpusofdocuments.Indexingisafourstageprocess,andateachstage,pluginscanbeaddedtoaltertheindexingprocess.Thismodulararchitectureallows\rexibilityintheindexingprocessatseveralstages:inthehandlingofacorpusofdoc-uments,inhandlingandparsingeachindividualdocument,intheprocessingoftermsfromdocuments,andinwritingtheindexdatastructures.Inaddition,Terrierallowsthedirectparsingandindexingofcompressedcollections.TheopensourceversionofTerriercomeswithsupportforindexingtestcollectionsfromthewell-establishedTREC4internationalevaluationforum.ThispermitsresearcherstoquicklysetupandrunTerrierwithatestcollection,less-eningthelearningcurvefortheplatform.Theadvantageisthatanyotherstoragemethodforacollectionofdocuments3http://ir.dcs.gla.ac.uk/terrier/docs.html4http://trec.nist.govFigure1:OverviewoftheindexingarchitectureofTerrier.AcorpusofdocumentsishandledbyaCol-lectionplugin,whichgeneratesastreamofDocu-mentobjects.EachDocumentgeneratesastreamofterms,whicharetransformedbyaseriesofTermPipelinecomponents,afterwhichtheIndexerwritestodisk.caneasilybesupportedbyaddinganewCollectionplugin.Forexample,readingastreamofdocumentsfromanemailserverwouldonlyrequireimplementinganewCollectionplugintoconnecttotheemailserver.Terriercomeswithvariousdocumentparsersembedded.TheseincludetheabilitytoindexHTMLdocuments,Plaintextdocuments,MicrosoftWord,ExcelandPowerpointdoc-uments,aswellasAdobePDFles.ToaddsupportforanotherleformattoTerrier,adevelopermustonlyaddanotherDocumentpluginwhichisabletoextractthetermsfromthedocument.Eachtermextractedfromadocumenthasthreefunda-mentalproperties:rstly,theactualStringtextualformoftheterm;secondly,thepositionatwhichthetermoccursinthedocument,andthirdly,theeldsinwhichthetermoccurs(eldscanbearbitrarilydenedbytheDocumentplugin,buttypicallyareusedtodenewhichHTMLtagsatermoccursin).Terrieraddscongurabilityatthisstageoftheindexing:termsarepassedthrougha`TermPipeline',whichallowsthetermstobetransformedinvariousways.Terriercomeswithsomepre-denedTermPipelineplug-ins,suchastwovariantsofPorter'sstemmingalgorithm,andastopwordsremovalcomponent.TermPipelineplug-insintroduce\rexibilitytotheprocessingandtransformationofterms,whichcanbemanifestedforexamplebyn-gramindexing,addingstemmingandstopwordremovalinvari-ouslanguages,acronymexpansion,oruseofathesaurus,tonamebutafew.ThelastcomponentintheTermPipelinechainisalwaystheIndexer.TheIndexerisresponsibleforwritingthein-dexusingtheappropriatedatastructures.ByusingaBlockIndexer,itispossibletocreateanindexthatstorestheposi-tionsoftermsineachdocument.Eachdocumentisdividedintoblocks-thesizeofablockdenestheaccuracywithwhichtermpositionsarerecorded.Bydefault,thesizeofablockis1and,inthatcase,theexactpositionsofthe termoccurrencesarerecorded.Thisallowsforproximityandphraseoperatorstobeusedinqueriesduringretrieval.However,ablockcouldalsobedenedasasemanticentity,soastoallowstructuredretrieval.Thefollowingsectionde-scribestheindexdatastructuresgeneratedbyTerrierwhenindexingacorpusofdocuments.4.INDEXSTRUCTURESATerrierindexconsistsof4maindatastructures,inad-ditiontosomeauxiliaryles:Lexicon:Thelexiconstoresthetermanditstermid(auniquenumberforeachterm),alongwiththeglobalstatisticsoftheterm(globaltermfrequencyanddocumentfre-quencyoftheterm)andtheosetsofthepostingslistintheInvertedIndex.InvertedIndex:Theinvertedindexstoresthepostingslistsofaterm.Inparticular,foreachterm,theinvertedindexstores:thedocumentidofthematchingdocument;andthetermfrequencyoftheterminthatdocument.Theeldsinwhichthetermoccurredareencodedusingabitset.Iftheindexhasbeenmadewithpositionalin-formation,thepostingslistwillalsocontaintheblockidsinthedocumentthatthetermoccursin.Positionalinformationallowsphrasalandproximitysearchtobeperformed.ThepostingslistinTerrierishighlycompressed.Inparticular,thedocumentidsareencodedinthein-vertedindexusingGammaencoding,thetermfre-quenciesareencodedusingUnaryencoding,andtheBlockids,ifpresent,areencodedusingGammaen-coding[20].DocumentIndex:TheDocumentIndexstoresthedocumentnumber(anexternaluniqueidentierofthedocument),thedoc-umentid(internaluniqueidentierofthedocument);thelengthofthedocumentintermsoftokens;andtheosetofthedocumentintheDirectIndex.DirectIndex:TheDirectIndexstoresthetermsandtermfrequen-ciesofthetermspresentineachdocument.TheaimoftheDirectIndexistofacilitateeasyandecientqueryexpansion,asdescribedinSection6.However,thedirectindexisalsoextremelyusefulforapplyingclusteringtoacollectionofdocuments.Thedirectindexcontentsarecompressedinanorthog-onalwaytotheinvertedindex.TermidsarewrittenusingGammaencoding,termfrequenciesuseUnaryencodingandtheeldsinwhichthetermsoccurareencodedusingabitset.Blockids,ifpresent,areen-codedusingGammaencoding.Table1detailstheformatofeachleintheindexstruc-tures.Moreover,Table2showsthesizesofTerrier'sindexdatastructuresafterindexingtheWT2Gcollection.Fieldsarenotrecorded,whileindexsizeswithandwithouttermpositionsarestated.Forcomparisonpurposes,theindexsizesofTerrierwithoutcompression,andtwootheropenIndexStructureContentsLexiconTerm(20)Termid(4)DocumentFrequency(4)TermFrequency(4)Byteosetininvertedle(8)Bitosetininvertedle(1)InvertedIndexDocumentidgap(gammacode)TermFrequency(unarycode)Fields(#ofeldsbits)BlockFrequency(unarycode)[Blockidgap(gammacode)]DocumentIndexDocumentid(4)DocumentLength(4)DocumentNumber(20)Byteosetindirectle(8)Bitosetindirectle(1)DirectIndexTermidgap(gammacode)Termfrequency(unarycode)Fields(#ofeldsbits)Blockfrequency(unarycode)[Blockidgap(gammacode)]LexiconIndexOsetinlexicon(8)CollectionStatistics#ofdocuments#oftokens#ofuniqueterms#ofpointersTable1:Detailsontheformatandcompressionusedforeachindexdatastructure.Thenumbersinparenthesisarethesizeofeachentryinbytes,unlessotherwisedenoted.Gammaandunarycodesarevariablelengthencodings.sourceproducts,Indri5andZettair6forthesamecollectionarealsoprovided.LikeTerrier,IndriandZettairarede-ployedusingtheirdefaultsettings.5.RETRIEVALFEATURESOneofthemainaimsofTerrieristofacilitateresearchintheeldofIR.Figure2providesanoutlineoftheretrievalprocessinTerrier.Terrier'sretrievalfeatureshavebeense-lectedinordertobeusefulforawiderangeofIRresearch.Indeed,Terrieroersgreat\rexibilityinchoosingaweight-ingmodel(Section5.1),aswellasinalteringthescoreoftheretrieveddocuments(Section5.2).Moreover,Terrieroersanadvancedquerylanguage(Section5.3).AnotherveryimportantretrievalfeatureofTerrieristheautomaticqueryexpansion,whichisdescribedinSection6.5.1WeightingModelsThecorefunctionalityofmatchingdocumentstoqueriesandrankingdocumentstakesplaceintheMatchingmod-ule.Matchingemploysaweightingmodeltoassignascoretoeachofthequerytermsinadocument.Inordertofacil-itatethecross-comparisonofweightingmodels,arangeofweightingmodelsissupplied,includingBM25,TF-IDF,anddocumentweightingmodelsfromtheDivergenceFromRan-5http://www.lemurproject.org/indri/6http://www.seg.rmit.edu.au/zettair/ IndexStructureIndexSizes%OfWT2GTerrier1.0.2DirectIndex87MB4%InvertedIndex72MB3%Lexicon40MB2%DocumentIndex8.8MB0.4%TerrierifnocompressionisusedDirectIndex481MB22%InvertedIndex481MB22%Lexicon40MB2%DocumentIndex8.8MB0.4%Terrier1.0.2withTermPositionsDirectIndex366MB17%InvertedIndex349MB17%Lexicon40MB2%DocumentIndex8.8MB0.4%Indri2.2DirectIndex386MB18%InvertedIndex417MB19%Zettair0.6.1InvertedIndex430MB20%Table2:IndexsizesforTerrier(withandwithouttermpositions),IndriandZettairfortheWT2Gcol-lection(2.1GB).Terrier'suncompressedindexsizeisprovidedforcomparisonpurposes.Figure2:OverviewoftheretrievalarchitectureofTerrier.TheapplicationcommunicateswiththeManager,whichinturnrunsthedesiredMatchingmodule.Matchingassignsscorestothedocumentsusingthecombinationofweightingmodelandscoremodiers.domness(DFR)framework[1].TheDFRapproachsuppliesparameter-freeprobabilisticmodels,basedonasimpleidea:\Themorethedivergenceofthewithin-documentterm-frequencyfromitsfrequencywithinthecol-lection,themoretheinformationcarriedbythetermtinthedocumentd".Theopensource1.0.2versionofTerrierincludeseightdoc-umentweightingmodelsfromtheDFRframework,allofwhichperformrobustlyonstandardtestcollections.Ponte-Croft'slanguagemodel[18]isalsosupported.5.2AlteringDocumentScoresThescoreofanindividualterminadocumentcanbealteredbyapplyingaTermScoreModier.Forexample,aTermInFieldModiercanbeappliedinordertoensurethatthequerytermsoccurinaparticulareldofadocument.Ifaquerytermdoesnotappearinaparticulareld,thentheTermInFieldModierresetsthescoreofthedocument.Inaddition,aFieldScoreModierbooststhescoreofadoc-umentinwhichaquerytermappearsinaparticulareld.Similarly,changingthescoreofaretrieveddocumentisachievedbyapplyingaDocumentScoreModier.OnesuchmodieristhePhraseScoreModier,whichemploysthepo-sitioninformationsavedintheindexofTerrier,andresetsthescoreoftheretrieveddocumentsinwhichthequerytermsdonotappearasaphrase,orwithinagivennumberofblocks.Generally,aDocumentScoreModierisidealforapplyingquery-independentevidence,suchasevidencefromhyperlinkstructureanalysis,orfromtheURLofdocuments.Moreover,theselectiveapplicationofdierentretrievaltech-niquesbasedonevidencefromthehyperlinkstructure[13]canbeappliedasaDocumentScoreModier.AftertheapplicationofanyTermScoreModiersorDoc-umentScoreModiers,thesetofretrieveddocumentscanbefurtheralteredbyapplyingpostprocessingorpostltering.Postprocessingisappropriatetoimplementfunctionalitiesthatrequirechangingtheoriginalquery.AnexampleofpostprocessingisQueryExpansion(QE),whichisdescribedindetailinSection6.TheapplicationofQEcouldbeenabledonaper-querybasis,beforeretrieval,dependingontheout-putofapre-retrievalperformancepredictor[6].Anotherpossibleexampleofpostprocessingcouldbetheapplica-tionofclustering.PostlteringisthenalstepinTerrier'sretrievalprocess,whereaseriesoflterscanremovealreadyretrieveddocu-ments,whichdonotsatisfyagivencondition.Forexample,inthecontextofaWebsearchengine,apostltercouldbeusedtoreducethenumberofretrieveddocumentsfromthesameWebsite,inordertoincreasediversityintheresults.5.3QueryLanguageTerrierincludesapowerfulquerylanguagethatallowstheusertospecifyadditionaloperationsontopofthenormalprobabilisticqueries.Suchoperationsmayspecifythataparticularquerytermshouldorshouldnotappearintheretrieveddocuments.Otheravailableoperationsincludere-quiringthattermsappearinparticularelds,phrasequeries,orproximityqueries.Anoverviewoftheavailablequerylan-guageoperationsisgivenbelow.t1t2retrievesdocumentswitheithert1ort2.t1^2.3setstheweightofquerytermt1to2.3. +t1-t2retrievesdocumentswitht1butnott2.``t1t2''retrievesdocumentswherethetermst1andt2occurnexttoeachother.``t1t2''nretrievesdocumentswherethetermst1,t2occurwithinnblocks.+(t1t2)speciesthatbothtermst1andt2arere-quired.field:t1retrievesdocswheret1mustappearinthespeciedeld.control:on/offenablesordisablesagivencontrol.Forexample,queryexpansionisenabledwithqe:on.ThequerylanguageoperationscorrespondtoTermScore-ModierorDocumentScoreModiermodules,whichareap-propriatelyconguredwhenthequeryisparsed.Retrievalparameterscanalsobespeciedbytheuser,whenpermit-ted,tochangeorenableretrievalfunctionalities.Forexam-ple,aqueryforwhichaparticularquerytermshouldappearinthetitleoftheretrieveddocuments,automaticallyenablesaTermInFieldModier.6.QUERYEXPANSIONTerrierincludesautomaticpseudo-relevancefeedback,intheformofQueryExpansion(QE).Themethodworksbytakingthetopmostinformativetermsfromthetop-rankeddocumentsofthequery,andaddingthesenewrelatedtermsintothequery.Thisoperationismadepossiblebythepres-enceofthedirectindex.Thedirectindexallowsthetermsandtheirfrequenciestobedeterminedforeachdocumentintheindex.Thenewqueryisreweightedandrerun-provid-ingarichersetofretrieveddocuments.Terrierprovidessev-eraltermweightingmodelsfromtheDFRframeworkwhichareusefulforidentifyinginformativetermsfromtop-rankeddocuments.Inaddition,foreasycross-comparison,Terrieralsoincludessomewell-establishedpseudo-relevancefeed-backtechniques,suchasRocchio'smethod[19].AutomaticqueryexpansionishighlyeectiveformanyIRtasks.ThedecisionofapplyingQEdependsonthetypeofapplication,andthedicultyofqueries.Therefore,wehavedevelopedseveraltoolsfordeterminingpossibleindi-catorsthatpredictthequerydiculty,andtheselectiveapplicationofqueryexpansion[2,6,9].Thevocabularyandsizeofthecollectionsmaybelargelydierent.Asaconsequence,automaticmeanstotunetheinherentparam-etersofthequeryexpansionarealsonecessaryinagoodIRsystem.Forthispurpose,wedevisedaparameter-freeQEmechanism,calledBo1[1,11],asadefaultinTerrier.Thesystemisalsoreadilyavailabletosupportseverallow-costmethodsforaselectiveapplicationofqueryexpansion.OurmethodofQEguaranteesthemosteectiveandecientmechanismforperformingQE,evenwithonlyusing3re-trieveddocumentsandahandfulofadditionalqueryterms.Inaddition,QEcanbealsousedtoautomaticallygeneraterealisticsamplesofqueries[6].Datastructures,inparticularthedirectle,toolsandmethodsadoptedforperformingQEinTerrier,alsoconsti-tuteanadvancedbasisforothertypesofIRactivitiesandapplications,suchasdocumentandtermclusteringorques-tionanswering.Indeed,thequeryexpansionmethodologyisbasedonalter,activatedontheinformationtheoreticCollectionTopics/TaskPerformanceWT2GAdhocTopics401-4500.2663MAPWT10GAdhocTopics451-5500.1886MAPW3CKnownItemTopics26-1500.4960MRRTable3:Performanceof\out-of-the-box"TerrieronstandardTRECcollectionsandtasks.Performanceismeasuredusingthestandardmeasureforeachtask:forAdhoc,MeanAveragePrecision(MAP);forKnownItem,MeanReciprocalRank(MRR).weightsoftheterms,whichselectsthehighlyinformativetermsthatappearinatleastXretrieveddocuments.Asanillustration,theoptimalvalueforXis2inadhocre-trieval.TheclusteringoftheresultswouldrequireahighervalueforX,whilequestionansweringmayrequireadier-entinformationtheoreticdenitionforX.Inconclusion,QEisagoodsourceofinspirationfornewandinterestingIRapplicationsandresearchdirections.7.EVALUATIONPACKAGEWhenperformingIRresearchusingastandardtestcol-lection,itisessentialtoevaluatetheretrievalperformanceoftheappliedapproaches.TheopensourceversionofTer-rierwasdesignedtoallowuserstorapidlydesignandtestnewretrievalfunctionsandIRmodels.Forthisreason,weimplementedtheevaluationtoolsintheTRECstylewithawiderangeofknownIRmeasures.Inparticular,Terrierincludesandintegratesthemainevaluationfunctionalitiesfoundinthetrecevaltool,including,amongothers,calcu-latingMeanAveragePrecision,Precision@rankN,inter-polatedPrecision,andR-Precision.Itisalsopossibletoshowtheevaluationmeasuresonaper-querybasis.TheevaluationtoolsofTerrieralloweasytestingofnewretrievalmethodologiesorapproaches.Oncethenewap-proachhasbeenintegratedintoTerrier,userscantestit,togetherwithasmanymodelsaspossible,againstatestcollection.Assoonasthesystemhasterminatedalltheruns,foreachmodel,theleoftheevaluationresultscanbecreatedandasummaryisdisplayed.8.STANDARDDEPLOYMENTS&SAMPLEAPPLICATIONSTerrier1.0.2comeswithtwoapplications.Firstly,Ter-riercomprisesacompleteapplicationsuiteforperformingresearchusingtestcollections.TerrieriseasytodeployonthestandardTRECtestcollections,e.g.theTRECCDs,WT2G,WT10G,.GOVetc.Thisfacilitatesresearchonthesestandardtestcollections.Table3shows\out-of-the-box"retrievalperformancescoresofTerrieronstandardtestcollections,usingtitle-onlyqueries.Thedefaultsettingsin-cludetheremovalofEnglishstopwordsandtheapplicationofPorter'sstemmingalgorithm.Duringretrieval,theInL2DFRweightingmodelisappliedasis.Secondly,Terriercomeswithaproof-of-conceptDesktopSearchapplication,showninFigure3.ThisshowshowtheretrievalfunctionalitiesofTerriercanbeeasilydeployedtoanapplicationsuitableforday-to-dayretrievaltasks.AlltheretrievalfunctionalitiesareeasilyaccessiblefromtheDesk-topSearchapplication,includingTerrier'squerylanguageandqueryexpansion. Figure3:Screenshotoftheproof-of-conceptTerrierDesktopSearch.Itcanindexvarioustypesofcom-mondocumentformats,includingMicrosoftOcedocuments,HTMLandPDFles.9.RESEARCHFACILITATED&LESSONSLEARNEDThedevelopmentofatest-bedframeworkallowingfortherapidexperimentationofnewIRconcepts/ideashasboostedourresearch,yieldingsomesignicantinsightsintothebe-haviourofIRtechniquesondiverse,multilingual,andlarge-scalecollections.ThesystemallowedustosuccessfullyandsignicantlyparticipateinvariousTRECtrackssuchastheWebtracks(2001-2004),theRobusttracks(2003,2004),theTerabytetracks(2004-2005),andrecentlytheEnter-prisetrack(2005).Inaddition,weparticipatedinvarioustracksofCLEF7in2003-2005.Moreimportantly,inadditiontobeingavehiclefortheevaluationofournewsearchmethods,theTerrierplatformallowedustoeasilysimulate,assess,andimprovestate-of-the-artIRmethodsandtechniques.Thisledtoabetterunderstandingoftheunderlyingissuesrelatedtothetack-lingofdierentsearchtasksonvariousanddiverselargecollections.Inparticular,werealisedtheneedtobuildparameter-freemodels,whichavoidtheconstraintofusingcostlyparameter-tuningapproachesbasedonrelevanceas-sessment.TheproposedDFRframeworkisnotonlyanecientandeectiveIRmodel,butalsoamechanismbywhichwecanexplainandcomplementotherexistingIRap-proaches/techniques[1].OneofthemainlessonslearntfrombuildingtheTerrierplatformduringthepreviousyearsisthefactthatbuildingasustainable,\rexibleandrobustplatformforinformationretrievalisequallyimportantascreatingnewIRmodels.In-deed,researchinanexperimentaleldsuchasIRcannotbeconductedwithoutaplatformfacilitatingexperimentationandevaluation.Indevelopingacutting-edgetechnologyfromalaboratorysettingtoaproductrelease,whichwasdeployedinvariousindustrialapplicationsandsettings,wegainedamuchbetterunderstandingofthechallengesfacedintacklingreallarge-scalecollections,notnecessarilyascleanastheTRECorCLEFcollections,andhowtoprovidepracticalandeectivesolutionsunderahighqueryworkload.Manybenetswereachievedbyreleasingacoreversionoftheprojectasanopensourcesoftware.Byfacilitating7http://www.clef-campaign.orgatransparentandreproducibleexperimentationofvariousretrievalmethods,webelievethatweachievedagreaterim-pactandvisibilityinthescienticcommunity.Wehavealsobuiltacommunityofusers/developersaroundTerrier,whichhelpedimprovingthequalityofthesoftwareandat-tractedmoreresearcherstotheproject.SinceitsreleaseinNovember2004,theTerriersoftwarehasbeendownloadedthousandsoftimesfromacrosstheworld,includingmajorcommercialsearchenginecompanies.Finally,theTerrierprojecthasattractedmanyundergrad-uate,postgraduate,andresearchstudents,andbenetedfromtheircontributions.BuildinganIRplatformisbotharesearchandengineeringprocess,requiringteamworkandacriticalmassofdevelopers.Terrieriscurrentlythemainde-velopmentplatformforbothundergraduateandpostgradu-atestudentsinourresearchgroup,allowingthemtoemployastate-of-the-artframeworkintheirlearningandresearch.10.CONCLUSIONSInthispaper,wedescribedtheopensourceversionofTerrier,arobust,transparent,andmodularframeworkforresearchandexperimentationinInformationRetrieval.Theplatformwasdevelopedaspartofanongoinglargerproject,ultimatelyaimingatbecomingtheEuropeanSearchEngine,withthestate-of-the-artsearchtechnologyforvarioussearchtasksandapplications.Terrierisamatureandstate-of-the-artplatform,includ-inghighlyeectiveretrievalmodels,basedontheDFRframe-work.Theperformanceofthesemodelsisatleastcompa-rableto,ifnotbetterthan,thatofthemostrecentmod-els,includingonverylarge-scaletestcollections[14,11].Inaddition,Terrier'sDFRmodelsareinformation-theoreticmodelsforbothqueryexpansionanddocumentranking,aversatileandoriginalfeaturethatisnotpossessedbyanyotherIRmodel.Terrieriscontinuouslyexpandedwithnewmodelsresultingfromourresearch.Indeed,newresearch,suchasmodelsfordocumentstructureandcombinationofevidence[17],lengthnormalisationmethods[7],queryper-formanceprediction[6,9],NLPtechniques[10],informationextraction[3],isbeingreadilyfedtoTerrier.Furthermore,Terrierisbeingusedforresearchinenterpriseandexpertsearch[11].Retrievalfromlarge-scaletestcollectionshasalsoledustostudyoptimaldistributedarchitecturesforretrievalsystems[4,5].FromanIRperspective,technologicalchallengesmayren-deranIRmodelobsoleteveryquickly(ashashappenedwiththeBooleanorVectorSpacemodel).Webelievethatathe-oreticalframeworkforInformationRetrieval,whenimple-mentedwithinamature,robustandextensibleIRsystem,isdestinedtolast.OurstrongbeliefisthatTerrier,astheaccumulatorofinnovativeideasinIR,andwithitsdesignfeaturesandunderlyingprinciples,opensourcebeingoneofthem,willmakeeasiertechnologytransfers.11.ACKNOWLEDGEMENTSThedevelopmentofTerrierrequiredasignicantman-powerandinfrastructure.Wewouldliketothankthosemanystudentsandprogrammerswhoparticipatedindevel-opingTerrier.WewouldalsoliketothanktheUKEngi-neeringandPhysicalSciencesResearchCouncil(EPSRC),projectgrantnumberGR/R90543/01,aswellastheLev-erhulmeTrust,projectgrantnumberF/00179/S,whohave partiallysupportedthedevelopmentofTerrier.Finally,wethanktheDepartmentofComputingScienceattheUniver-sityofGlasgow,andKeithvanRijsbergen,forsupportingtheTerrierplatformatvariousstagesofitsdevelopment.12.REFERENCES[1]G.Amati.ProbabilisticModelsforInformationRetrievalbasedonDivergencefromRandomness.PhDthesis,DepartmentofComputingScience,UniversityofGlasgow,2003.[2]G.Amati,C.Carpineto,andG.Romano.QueryDiculty,Robustness,andSelectiveApplicationofQueryExpansion.InECIR'04:Proceedingsofthe26thEuropeanConferenceonInformationRetrieval(ECIR'04),pages127-137,Sunderland,UK,2004.Springer.[3]G.Amati.InformationTheoreticApproachtoInformationExtraction.InProceedingsofthe7thinternationalconferenceonFlexibleQueryAnsweringSystems(FQAS2006),pages519{529,Milan,Italy,2006.[4]F.Cacheda,V.Plachouras,andI.Ounis.AcasestudyofdistributedinformationretrievalarchitecturestoindexoneTerabyteoftext.InformationProcessing&Management,41(5):1141{1161,2005.[5]F.Cacheda,V.Carneiro,V.Plachouras,andI.Ounis.PerformanceAnalysisofDistributedInformationRetrievalArchitecturesUsinganImprovedNetworkSimulationModel.InformationProcessing&Management(toappear).[6]B.He,andI.Ounis.InferringQueryPerformanceUsingPre-retrievalPredictors.InProceedingsofthe11thSymposiumonStringProcessingandInformationRetrieval(SPIRE2004),Padova,Italy,2004.[7]B.He,andI.Ounis.AstudyoftheDirichletpriorsfortermfrequencynormalisation.InSIGIR'05:Proceedingsofthe28thannualinternationalACMSIGIRconferenceonResearchandDevelopmentininformationretrieval,pages465{471,Salvador,Brazil,2004.[8]B.He,andI.Ounis.TermFrequencyNormalisationTuningforBM25andDFRModels.InECIR'05:Proceedingsofthe27thEuropeanConferenceonInformationRetrieval(ECIR'05),pages200{214,SantiagodeCompostela,Spain,2005.Springer.[9]B.He,andI.Ounis.QueryPerformancePrediction.InInformationSystems,Elsevier,2006(Inpress).[10]C.Lioma,andI.Ounis.ExaminingtheContentLoadofPart-of-SpeechBlocksforInformationRetrieval.InJointConferenceoftheInternationalCommitteeonComputationalLinguisticsandtheAssociationforComputationalLinguistics,(COLING/ACL2006),Sydney,Australia,2006.[11]C.Macdonald,B.He,V.Plachouras,andI.Ounis.UniversityofGlasgowatTREC2005:ExperimentsinTerabyteandEnterpriseTrackswithTerrier.InTREC'05:Proceedingsofthe14thTextREtrievalConference(TREC2005),November,2005,NIST.[12]I.Ounis,G.Amati,PlachourasV.,B.He,C.Macdonald,andD.Johnson.TerrierInformationRetrievalPlatform.InProceedingsofthe27thEuropeanConferenceonIRResearch(ECIR2005),volume3408ofLectureNotesinComputerScience,pages517{519.Springer,2005.[13]V.Plachouras,andI.Ounis.Usefulnessofhyperlinkstructureforquery-biasedtopicdistillation.InSIGIR'04:Proceedingsofthe27thannualinternationalACMSIGIRconferenceonResearchandDevelopmentininformationretrieval,pages48{455,Sheeld,UK,2004.[14]V.Plachouras,B.He,andI.Ounis.UniversityofGlasgowatTREC2004:ExperimentsinWeb,RobustandTerabytetrackswithTerrier.InTREC'04:Proceedingsofthe13thTextREtrievalConference(TREC2004),November2004,NIST.[15]V.Plachouras,I.Ounis,andG.Amati.TheStaticAbsorbingModelfortheWeb.JournalofWebEngineering,4(2):165{186,2005.[16]V.Plachouras,F.Cacheda,andI.Ounis.ADecisionMechanismfortheSelectiveCombinationofEvidenceinTopicDistillation.InformationRetrieval,9(2):139{163,2006.[17]V.Plachouras.SelectiveWebInformationRetrieval.PhDthesis,DepartmentofComputingScience,UniversityofGlasgow,2006.[18]J.M.PonteandW.B.Croft.Alanguagemodellingapproachtoinformationretrieval.InSIGIR'98:Proceedingsofthe21stannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,pages275{281,Melbourne,Australia,1998.[19]J.Rocchio.Relevancefeedbackininformationretrieval.InTheSmartRetrievalsystem-ExperimentsinAutomaticDocumentProcessing,InSalton,G.,Ed.Prentice-HallEnglewoodClis.NJ.313-323,1971.[20]I.H.Witten,A.Moat,andT.C.Bell.ManagingGigabytes:CompressingandIndexingDocumentsandImages.MorganKaufmannPublishers,1999.