DRAFTApril12009CambridgeUniversityPressFeedbackwelcome11BooleanretrievalThemeaningoftheterminformationretrievalcanbeverybroadJustgettingacreditcardoutofyourwalletsothatyoucantypeinthecardnumberisaform ID: 898278
Download Pdf The PPT/PDF document "Online edition cn2009 Cambridge UP" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1 Online edition (c)\n2009 Cambridge UP DR
Online edition (c)\n2009 Cambridge UP DRAFT!©April1,2009CambridgeUniversityPress.Feedbackwelcome.11BooleanretrievalThemeaningoftheterminformationretrievalcanbeverybroad.Justgettingacreditcardoutofyourwalletsothatyoucantypeinthecardnumberisaformofinformationretrieval.However,asanacademiceldofstudy,informationretrievalmightbedenedthus:INFORMATIONRETRIEVALInformationretrieval(IR)isndingmaterial(usuallydocuments)ofanunstructurednature(usuallytext)thatsatisesaninformationneedfromwithinlargecollections(usuallystoredoncomputers).Asdenedinthisway,informationretrievalusedtobeanactivitythatonlyafewpeopleengagedin:referencelibrarians,paralegals,andsimilarpro-fessionalsearchers.Nowtheworldhaschanged,andhundredsofmillionsofpeopleengageininformationretrievaleverydaywhentheyuseawebsearchengineorsearchtheiremail. 1 Informationretrievalisfastbecomingthedominantformofinformationaccess,overtakingtraditionaldatabase-stylesearching(thesortthatisgoingonwhenaclerksaystoyou:I'msorry,IcanonlylookupyourorderifyoucangivemeyourOrderID).IRcanalsocoverotherkindsofdataandinformationproblemsbeyondthatspeciedinthecoredenitionabove.Thetermunstructureddatareferstodatawhichdoesnothaveclear,semanticallyovert,easy-for-a-computerstructure.Itistheoppositeofstructureddata,thecanonicalexampleofwhichisarelationaldatabase,ofthesortcompaniesusuallyusetomain-tainproductinventoriesandpersonnelrecords.Inreality,almostnodataaretrulyunstructured.Thisisdenitelytrueofalltextdataifyoucountthelatentlinguisticstructureofhumanlanguages.Butevenacceptingthattheintendednotionofstructureisovertstructure,mosttexthasstructure,suchasheadingsandparagraphsandfootnotes,whichiscommonlyrepre-sentedindocumentsbyexplicitmarkup(suchasthecodingunderlyingweb 1.Inmodernparlance,thewordsearchhastendedtoreplace(information)retrieval;thetermsearchisquiteambiguous,butincontextweusethetwosynonymously. Online edition (c)\n2009 Cambridge UP 21Booleanretrievalpages).IRisalsousedtofacilitatesemistruc
2 turedsearchsuchasndingadocumentwhereth
turedsearchsuchasndingadocumentwherethetitlecontainsJavaandthebodycontainsthreading.Theeldofinformationretrievalalsocoverssupportingusersinbrowsingorlteringdocumentcollectionsorfurtherprocessingasetofretrieveddoc-uments.Givenasetofdocuments,clusteringisthetaskofcomingupwithagoodgroupingofthedocumentsbasedontheircontents.Itissimilartoar-rangingbooksonabookshelfaccordingtotheirtopic.Givenasetoftopics,standinginformationneeds,orothercategories(suchassuitabilityoftextsfordifferentagegroups),classicationisthetaskofdecidingwhichclass(es),ifany,eachofasetofdocumentsbelongsto.Itisoftenapproachedbyrstmanuallyclassifyingsomedocumentsandthenhopingtobeabletoclassifynewdocumentsautomatically.Informationretrievalsystemscanalsobedistinguishedbythescaleatwhichtheyoperate,anditisusefultodistinguishthreeprominentscales.Inwebsearch,thesystemhastoprovidesearchoverbillionsofdocumentsstoredonmillionsofcomputers.Distinctiveissuesareneedingtogatherdocumentsforindexing,beingabletobuildsystemsthatworkefcientlyatthisenormousscale,andhandlingparticularaspectsoftheweb,suchastheexploitationofhypertextandnotbeingfooledbysiteprovidersmanip-ulatingpagecontentinanattempttoboosttheirsearchenginerankings,giventhecommercialimportanceoftheweb.WefocusonalltheseissuesinChapters 19 21 .Attheotherextremeispersonalinformationretrieval.Inthelastfewyears,consumeroperatingsystemshaveintegratedinformationretrieval(suchasApple'sMacOSXSpotlightorWindowsVista'sInstantSearch).Emailprogramsusuallynotonlyprovidesearchbutalsotextclas-sication:theyatleastprovideaspam(junkmail)lter,andcommonlyalsoprovideeithermanualorautomaticmeansforclassifyingmailsothatitcanbeplaceddirectlyintoparticularfolders.Distinctiveissueshereincludehan-dlingthebroadrangeofdocumenttypesonatypicalpersonalcomputer,andmakingthesearchsystemmaintenancefreeandsufcientlylightweightintermsofstartup,processing,anddiskspaceusagethatitcanrunononemachinewithoutannoyingitsowner.Inbetweenisthespaceofenterprise,institution
3 al,anddomain-specicsearch,whereretrieva
al,anddomain-specicsearch,whereretrievalmightbeprovidedforcollectionssuchasacorporation'sinternaldocuments,adatabaseofpatents,orresearcharticlesonbiochemistry.Inthiscase,thedocumentswilltypi-callybestoredoncentralizedlesystemsandoneorahandfulofdedicatedmachineswillprovidesearchoverthecollection.Thisbookcontainstech-niquesofvalueoverthiswholespectrum,butourcoverageofsomeaspectsofparallelanddistributedsearchinweb-scalesearchsystemsiscompara-tivelylightowingtotherelativelysmallpublishedliteratureonthedetailsofsuchsystems.However,outsideofahandfulofwebsearchcompanies,asoftwaredeveloperismostlikelytoencounterthepersonalsearchanden-terprisescenarios. Online edition (c)\n2009 Cambridge UP 1.1Anexampleinformationretrievalproblem3Inthischapterwebeginwithaverysimpleexampleofaninformationretrievalproblem,andintroducetheideaofaterm-documentmatrix(Sec-tion 1.1 )andthecentralinvertedindexdatastructure(Section 1.2 ).WewillthenexaminetheBooleanretrievalmodelandhowBooleanqueriesarepro-cessed(Sections 1.3 and 1.4 ).1.1AnexampleinformationretrievalproblemAfatbookwhichmanypeopleownisShakespeare'sCollectedWorks.Sup-poseyouwantedtodeterminewhichplaysofShakespearecontainthewordsBrutusANDCaesarANDNOTCalpurnia.Onewaytodothatistostartatthebeginningandtoreadthroughallthetext,notingforeachplaywhetheritcontainsBrutusandCaesarandexcludingitfromconsiderationifitcon-tainsCalpurnia.Thesimplestformofdocumentretrievalisforacomputertodothissortoflinearscanthroughdocuments.Thisprocessiscommonlyreferredtoasgreppingthroughtext,aftertheUnixcommandgrep,whichGREPperformsthisprocess.Greppingthroughtextcanbeaveryeffectiveprocess,especiallygiventhespeedofmoderncomputers,andoftenallowsusefulpossibilitiesforwildcardpatternmatchingthroughtheuseofregularexpres-sions.Withmoderncomputers,forsimplequeryingofmodestcollections(thesizeofShakespeare'sCollectedWorksisabitunderonemillionwordsoftextintotal),youreallyneednothingmore.Butformanypurposes,youdoneedmore:1.Toprocesslargedocumentcollect
4 ionsquickly.Theamountofonlinedatahasgrow
ionsquickly.Theamountofonlinedatahasgrownatleastasquicklyasthespeedofcomputers,andwewouldnowliketobeabletosearchcollectionsthattotalintheorderofbillionstotrillionsofwords.2.Toallowmoreexiblematchingoperations.Forexample,itisimpracticaltoperformthequeryRomansNEARcountrymenwithgrep,whereNEARmightbedenedaswithin5wordsorwithinthesamesentence.3.Toallowrankedretrieval:inmanycasesyouwantthebestanswertoaninformationneedamongmanydocumentsthatcontaincertainwords.ThewaytoavoidlinearlyscanningthetextsforeachqueryistoindextheINDEXdocumentsinadvance.LetusstickwithShakespeare'sCollectedWorks,anduseittointroducethebasicsoftheBooleanretrievalmodel.SupposewerecordforeachdocumenthereaplayofShakespeare'swhetheritcontainseachwordoutofallthewordsShakespeareused(Shakespeareusedabout32,000differentwords).Theresultisabinaryterm-documentincidenceINCIDENCEMATRIXmatrix,asinFigure 1.1 .Termsaretheindexedunits(furtherdiscussedinTERMSection 2.2 );theyareusuallywords,andforthemomentyoucanthinkof Online edition (c)\n2009 Cambridge UP 41BooleanretrievalAntonyJuliusTheHamletOthelloMacbeth...andCaesarTempestCleopatraAntony110001Brutus110100Caesar110111Calpurnia010000Cleopatra100000mercy101111worser101110...IFigure1.1Aterm-documentincidencematrix.Matrixelement(t,d)is1iftheplayincolumndcontainsthewordinrowt,andis0otherwise. themaswords,buttheinformationretrievalliteraturenormallyspeaksoftermsbecausesomeofthem,suchasperhapsI-9orHongKongarenotusuallythoughtofaswords.Now,dependingonwhetherwelookatthematrixrowsorcolumns,wecanhaveavectorforeachterm,whichshowsthedocumentsitappearsin,oravectorforeachdocument,showingthetermsthatoccurinit. 2 ToanswerthequeryBrutusANDCaesarANDNOTCalpurnia,wetakethevectorsforBrutus,CaesarandCalpurnia,complementthelast,andthendoabitwiseAND:110100AND110111AND101111=100100TheanswersforthisqueryarethusAntonyandCleopatraandHamlet(Fig-ure 1.2 ).TheBooleanretrievalmodelisamodelforinformationretrievalinwhichweBOOLEANRETRIEVALMODELcanposeanyquerywhichisin
5 theformofaBooleanexpressionofterms,thati
theformofaBooleanexpressionofterms,thatis,inwhichtermsarecombinedwiththeoperatorsAND,OR,andNOT.Themodelviewseachdocumentasjustasetofwords.Letusnowconsideramorerealisticscenario,simultaneouslyusingtheopportunitytointroducesometerminologyandnotation.SupposewehaveN=1milliondocuments.BydocumentswemeanwhateverunitswehaveDOCUMENTdecidedtobuildaretrievalsystemover.Theymightbeindividualmemosorchaptersofabook(seeSection 2.1.2 (page 20 )forfurtherdiscussion).Wewillrefertothegroupofdocumentsoverwhichweperformretrievalasthe(document)collection.Itissometimesalsoreferredtoasacorpus(abodyofCOLLECTIONCORPUStexts).Supposeeachdocumentisabout1000wordslong(23bookpages).If 2.Formally,wetakethetransposeofthematrixtobeabletogetthetermsascolumnvectors. Online edition (c)\n2009 Cambridge UP 1.1Anexampleinformationretrievalproblem5AntonyandCleopatra,ActIII,SceneiiAgrippa[AsidetoDomitiusEnobarbus]:Why,Enobarbus,WhenAntonyfoundJuliusCaesardead,Hecriedalmosttoroaring;andheweptWhenatPhilippihefoundBrutusslain.Hamlet,ActIII,SceneiiLordPolonius:IdidenactJuliusCaesar:Iwaskilledi'theCapitol;Brutuskilledme.IFigure1.2ResultsfromShakespeareforthequeryBrutusANDCaesarANDNOTCalpurnia. weassumeanaverageof6bytesperwordincludingspacesandpunctuation,thenthisisadocumentcollectionabout6GBinsize.Typically,theremightbeaboutM=500,000distincttermsinthesedocuments.Thereisnothingspecialaboutthenumberswehavechosen,andtheymightvarybyanorderofmagnitudeormore,buttheygiveussomeideaofthedimensionsofthekindsofproblemsweneedtohandle.WewilldiscussandmodelthesesizeassumptionsinSection 5.1 (page 86 ).Ourgoalistodevelopasystemtoaddresstheadhocretrievaltask.ThisisADHOCRETRIEVALthemoststandardIRtask.Init,asystemaimstoprovidedocumentsfromwithinthecollectionthatarerelevanttoanarbitraryuserinformationneed,communicatedtothesystembymeansofaone-off,user-initiatedquery.Aninformationneedisthetopicaboutwhichtheuserdesirestoknowmore,andINFORMATIONNEEDisdifferentiatedfromaquery,whichiswhattheuserconveystothecom-QU
6 ERYputerinanattempttocommunicatetheinfor
ERYputerinanattempttocommunicatetheinformationneed.AdocumentisrelevantifitisonethattheuserperceivesascontaininginformationofvalueRELEVANCEwithrespecttotheirpersonalinformationneed.Ourexampleabovewasratherarticialinthattheinformationneedwasdenedintermsofpar-ticularwords,whereasusuallyauserisinterestedinatopiclikepipelineleaksandwouldliketondrelevantdocumentsregardlessofwhethertheypreciselyusethosewordsorexpresstheconceptwithotherwordssuchaspipelinerupture.ToassesstheeffectivenessofanIRsystem(i.e.,thequalityofEFFECTIVENESSitssearchresults),auserwillusuallywanttoknowtwokeystatisticsaboutthesystem'sreturnedresultsforaquery:Precision:Whatfractionofthereturnedresultsarerelevanttotheinforma-PRECISIONtionneed?Recall:Whatfractionoftherelevantdocumentsinthecollectionwerere-RECALLturnedbythesystem? Online edition (c)\n2009 Cambridge UP 61BooleanretrievalDetaileddiscussionofrelevanceandevaluationmeasuresincludingpreci-sionandrecallisfoundinChapter 8 .Wenowcannotbuildaterm-documentmatrixinanaiveway.A500K1Mmatrixhashalf-a-trillion0'sand1'stoomanytotinacomputer'smemory.Butthecrucialobservationisthatthematrixisextremelysparse,thatis,ithasfewnon-zeroentries.Becauseeachdocumentis1000wordslong,thematrixhasnomorethanonebillion1's,soaminimumof99.8%ofthecellsarezero.Amuchbetterrepresentationistorecordonlythethingsthatdooccur,thatis,the1positions.Thisideaiscentraltotherstmajorconceptininformationretrieval,theinvertedindex.Thenameisactuallyredundant:anindexalwaysmapsbackINVERTEDINDEXfromtermstothepartsofadocumentwheretheyoccur.Nevertheless,in-vertedindex,orsometimesinvertedle,hasbecomethestandardtermininfor-mationretrieval. 3 ThebasicideaofaninvertedindexisshowninFigure 1.3 .Wekeepadictionaryofterms(sometimesalsoreferredtoasavocabularyorDICTIONARYVOCABULARYlexicon;inthisbook,weusedictionaryforthedatastructureandvocabularyLEXICONforthesetofterms).Thenforeachterm,wehavealistthatrecordswhichdocumentsthetermoccursin.Eachiteminthelistwhichrecordsthatatermap
7 pearedinadocument(and,later,often,thepos
pearedinadocument(and,later,often,thepositionsinthedocu-ment)isconventionallycalledaposting. 4 ThelististhencalledapostingsPOSTINGPOSTINGSLISTlist(orinvertedlist),andallthepostingsliststakentogetherarereferredtoasthepostings.ThedictionaryinFigure 1.3 hasbeensortedalphabeticallyandPOSTINGSeachpostingslistissortedbydocumentID.WewillseewhythisisusefulinSection 1.3 ,below,butlaterwewillalsoconsideralternativestodoingthis(Section 7.1.5 ).1.2ArsttakeatbuildinganinvertedindexTogainthespeedbenetsofindexingatretrievaltime,wehavetobuildtheindexinadvance.Themajorstepsinthisare:1.Collectthedocumentstobeindexed: Friends,Romans,countrymen. SoletitbewithCaesar ...2.Tokenizethetext,turningeachdocumentintoalistoftokens: Friends Romans countrymen So ... 3.Someinformationretrievalresearchersprefertheterminvertedle,butexpressionslikein-dexconstructionandindexcompressionaremuchmorecommonthaninvertedleconstructionandinvertedlecompression.Forconsistency,weuse(inverted)indexthroughoutthisbook.4.Ina(non-positional)invertedindex,apostingisjustadocumentID,butitisinherentlyassociatedwithaterm,viathepostingslistitisplacedon;sometimeswewillalsotalkofa(term,docID)pairasaposting. Online edition (c)\n2009 Cambridge UP 1.2Arsttakeatbuildinganinvertedindex7 Brutus ! 1 2 4 11 31 45 173 174 Caesar ! 1 2 4 5 6 16 57 132 ... Calpurnia ! 2 31 54 101 ...| {z }| {z }DictionaryPostingsIFigure1.3Thetwopartsofaninvertedindex.Thedictionaryiscommonlykeptinmemory,withpointerstoeachpostingslist,whichisstoredondisk. 3.Dolinguisticpreprocessing,producingalistofnormalizedtokens,whicharetheindexingterms: friend roman countryman so ...4.Indexthedocumentsthateachtermoccursinbycreatinganinvertedin-dex,consistingofadictionaryandpostings.Wewilldeneanddiscusstheearlierstagesofprocessing,thatis,steps13,inSection 2.2 (page 22 ).Untilthenyoucanthinkoftokensandnormalizedtokensasalsolooselyequivalenttowords.Here,weassumethattherst3stepshavealreadybeendone,andweexaminebuildingabasicinverted
8 indexbysort-basedindexing.Withinadocumen
indexbysort-basedindexing.Withinadocumentcollection,weassumethateachdocumenthasauniqueserialnumber,knownasthedocumentidentier(docID).Duringindexcon-DOCIDstruction,wecansimplyassignsuccessiveintegerstoeachnewdocumentwhenitisrstencountered.Theinputtoindexingisalistofnormalizedtokensforeachdocument,whichwecanequallythinkofasalistofpairsoftermanddocID,asinFigure 1.4 .ThecoreindexingstepissortingthislistSORTINGsothatthetermsarealphabetical,givingustherepresentationinthemiddlecolumnofFigure 1.4 .Multipleoccurrencesofthesametermfromthesamedocumentarethenmerged. 5 Instancesofthesametermarethengrouped,andtheresultissplitintoadictionaryandpostings,asshownintherightcolumnofFigure 1.4 .Sinceatermgenerallyoccursinanumberofdocu-ments,thisdataorganizationalreadyreducesthestoragerequirementsoftheindex.Thedictionaryalsorecordssomestatistics,suchasthenumberofdocumentswhichcontaineachterm(thedocumentfrequency,whichishereDOCUMENTFREQUENCYalsothelengthofeachpostingslist).Thisinformationisnotvitalforaba-sicBooleansearchengine,butitallowsustoimprovetheefciencyofthe 5.Unixuserscannotethatthesestepsaresimilartouseofthesortandthenuniqcommands. Online edition (c)\n2009 Cambridge UP 81BooleanretrievalDoc1Doc2IdidenactJuliusCaesar:Iwaskilledi'theCapitol;Brutuskilledme.SoletitbewithCaesar.ThenobleBrutushathtoldyouCaesarwasambitious:termdocIDI1did1enact1julius1caesar1I1was1killed1i'1the1capitol1brutus1killed1me1so2let2it2be2with2caesar2the2noble2brutus2hath2told2you2caesar2was2ambitious2=)termdocIDambitious2be2brutus1brutus2capitol1caesar1caesar2caesar2did1enact1hath1I1I1i'1it2julius1killed1killed1let2me1noble2so2the1the2told2you2was1was2with2=)termdoc.freq.!postingslists ambitious 1 ! 2 be 1 ! 2 brutus 2 ! 1 ! 2 capitol 1 ! 1 caesar 2 ! 1 ! 2 did 1 ! 1 enact 1 ! 1 hath 1 ! 2 I 1 ! 1 i' 1 ! 1 it 1 ! 2 julius 1 ! 1 killed 1 ! 1 let 1 ! 2 me 1 ! 1 noble 1 ! 2 so 1 ! 2 the 2 ! 1 ! 2 told 1 ! 2 you 1 ! 2 was 2 ! 1 ! 2 with 1 ! 2 IFigure1.4Buildinganindexbysortingandgroup
9 ing.Thesequenceoftermsineachdocument,tag
ing.Thesequenceoftermsineachdocument,taggedbytheirdocumentID(left)issortedalphabetically(mid-dle).InstancesofthesametermarethengroupedbywordandthenbydocumentID.ThetermsanddocumentIDsarethenseparatedout(right).Thedictionarystorestheterms,andhasapointertothepostingslistforeachterm.Itcommonlyalsostoresothersummaryinformationsuchas,here,thedocumentfrequencyofeachterm.Weusethisinformationforimprovingquerytimeefciencyand,later,forweightinginrankedretrievalmodels.Eachpostingsliststoresthelistofdocumentsinwhichatermoccurs,andmaystoreotherinformationsuchasthetermfrequency(thefrequencyofeachtermineachdocument)ortheposition(s)ofthetermineachdocument. Online edition (c)\n2009 Cambridge UP 1.2Arsttakeatbuildinganinvertedindex9searchengineatquerytime,anditisastatisticlaterusedinmanyrankedre-trievalmodels.ThepostingsaresecondarilysortedbydocID.Thisprovidesthebasisforefcientqueryprocessing.Thisinvertedindexstructureises-sentiallywithoutrivalsasthemostefcientstructureforsupportingadhoctextsearch.Intheresultingindex,wepayforstorageofboththedictionaryandthepostingslists.Thelatteraremuchlarger,butthedictionaryiscommonlykeptinmemory,whilepostingslistsarenormallykeptondisk,sothesizeofeachisimportant,andinChapter 5 wewillexaminehoweachcanbeoptimizedforstorageandaccessefciency.Whatdatastructureshouldbeusedforapostingslist?Axedlengtharraywouldbewastefulassomewordsoccurinmanydocuments,andothersinveryfew.Foranin-memorypostingslist,twogoodalternativesaresinglylinkedlistsorvariablelengtharrays.Singlylinkedlistsallowcheapinsertionofdocumentsintopostingslists(followingupdates,suchaswhenrecrawlingthewebforupdateddoc-uments),andnaturallyextendtomoreadvancedindexingstrategiessuchasskiplists(Section 2.3 ),whichrequireadditionalpointers.Variablelengthar-rayswininspacerequirementsbyavoidingtheoverheadforpointersandintimerequirementsbecausetheiruseofcontiguousmemoryincreasesspeedonmodernprocessorswithmemorycaches.Extrapointerscaninpracticebeencodedintothelistsasoffsets.I
10 fupdatesarerelativelyinfrequent,variable
fupdatesarerelativelyinfrequent,variablelengtharrayswillbemorecompactandfastertotraverse.Wecanalsouseahybridschemewithalinkedlistofxedlengtharraysforeachterm.Whenpostingslistsarestoredondisk,theyarestored(perhapscompressed)asacontiguousrunofpostingswithoutexplicitpointers(asinFigure 1.3 ),soastominimizethesizeofthepostingslistandthenumberofdiskseekstoreadapostingslistintomemory.?Exercise1.1[?]Drawtheinvertedindexthatwouldbebuiltforthefollowingdocumentcollection.(SeeFigure 1.3 foranexample.)Doc1newhomesalestopforecastsDoc2homesalesriseinjulyDoc3increaseinhomesalesinjulyDoc4julynewhomesalesriseExercise1.2[?]Considerthesedocuments:Doc1breakthroughdrugforschizophreniaDoc2newschizophreniadrugDoc3newapproachfortreatmentofschizophreniaDoc4newhopesforschizophreniapatientsa.Drawtheterm-documentincidencematrixforthisdocumentcollection. Online edition (c)\n2009 Cambridge UP 101BooleanretrievalBrutus ! 1 ! 2 ! 4 ! 11 ! 31 ! 45 ! 173 ! 174 Calpurnia ! 2 ! 31 ! 54 ! 101 Intersection=) 2 ! 31 IFigure1.5IntersectingthepostingslistsforBrutusandCalpurniafromFigure 1.3 . b.Drawtheinvertedindexrepresentationforthiscollection,asinFigure 1.3 (page 7 ).Exercise1.3[?]ForthedocumentcollectionshowninExercise 1.2 ,whatarethereturnedresultsforthesequeries:a.schizophreniaANDdrugb.forANDNOT(drugORapproach)1.3ProcessingBooleanqueriesHowdoweprocessaqueryusinganinvertedindexandthebasicBooleanretrievalmodel?Considerprocessingthesimpleconjunctivequery: SIMPLECONJUNCTIVEQUERIES (1.1) BrutusANDCalpurniaovertheinvertedindexpartiallyshowninFigure 1.3 (page 7 ).We: 1. LocateBrutusintheDictionary 2. Retrieveitspostings 3. LocateCalpurniaintheDictionary 4. Retrieveitspostings 5. Intersectthetwopostingslists,asshowninFigure 1.5 .Theintersectionoperationisthecrucialone:weneedtoefcientlyintersectPOSTINGSLISTINTERSECTIONpostingslistssoastobeabletoquicklynddocumentsthatcontainbothterms.(Thisoperationissometimesreferredtoasmergingpostingslists:POSTINGSMERGEthisslightlycounterintuitiv
11 enamereectsusingthetermmergealgorithmfo
enamereectsusingthetermmergealgorithmforageneralfamilyofalgorithmsthatcombinemultiplesortedlistsbyinter-leavedadvancingofpointersthrougheach;herewearemergingthelistswithalogicalANDoperation.)Thereisasimpleandeffectivemethodofintersectingpostingslistsusingthemergealgorithm(seeFigure 1.6 ):wemaintainpointersintobothlists Online edition (c)\n2009 Cambridge UP 1.3ProcessingBooleanqueries11INTERSECT(p1,p2)1answer hi2whilep1=NILandp2=NIL3doifdocID(p1)=docID(p2)4thenADD(answer,docID(p1))5p1 next(p1)6p2 next(p2)7elseifdocID(p1)docID(p2)8thenp1 next(p1)9elsep2 next(p2)10returnanswerIFigure1.6Algorithmfortheintersectionoftwopostingslistsp1andp2. andwalkthroughthetwopostingslistssimultaneously,intimelinearinthetotalnumberofpostingsentries.Ateachstep,wecomparethedocIDpointedtobybothpointers.Iftheyarethesame,weputthatdocIDintheresultslist,andadvancebothpointers.OtherwiseweadvancethepointerpointingtothesmallerdocID.Ifthelengthsofthepostingslistsarexandy,theintersectiontakesO(x+y)operations.Formally,thecomplexityofqueryingisQ(N),whereNisthenumberofdocumentsinthecollection. 6 Ourindexingmethodsgainusjustaconstant,notadifferenceinQtimecomplexitycomparedtoalinearscan,butinpracticetheconstantishuge.Tousethisalgorithm,itiscrucialthatpostingsbesortedbyasingleglobalordering.UsinganumericsortbydocIDisonesimplewaytoachievethis.Wecanextendtheintersectionoperationtoprocessmorecomplicatedquerieslike: (1.2) (BrutusORCaesar)ANDNOTCalpurniaQueryoptimizationistheprocessofselectinghowtoorganizetheworkofan-QUERYOPTIMIZATIONsweringaquerysothattheleasttotalamountofworkneedstobedonebythesystem.AmajorelementofthisforBooleanqueriesistheorderinwhichpostingslistsareaccessed.Whatisthebestorderforqueryprocessing?Con-sideraquerythatisanANDoftterms,forinstance: (1.3) BrutusANDCaesarANDCalpurniaForeachofthetterms,weneedtogetitspostings,thenANDthemtogether.Thestandardheuristicistoprocesstermsinorderofincreasingdocument 6.ThenotationQ()isusedtoexpressanasymptoticallytightboundontheco
12 mplexityofanalgorithm.Informally,thisiso
mplexityofanalgorithm.Informally,thisisoftenwrittenasO(),butthisnotationreallyexpressesanasymptoticupperbound,whichneednotbetight( Cormenetal.1990 ). Online edition (c)\n2009 Cambridge UP 121BooleanretrievalINTERSECT(ht1,...,tni)1terms SORTBYINCREASINGFREQUENCY(ht1,...,tni)2result postings(first(terms))3terms rest(terms)4whileterms=NILandresult=NIL5doresult INTERSECT(result,postings(first(terms)))6terms rest(terms)7returnresultIFigure1.7Algorithmforconjunctivequeriesthatreturnsthesetofdocumentscontainingeachtermintheinputlistofterms. frequency:ifwestartbyintersectingthetwosmallestpostingslists,thenallintermediateresultsmustbenobiggerthanthesmallestpostingslist,andwearethereforelikelytodotheleastamountoftotalwork.So,forthepostingslistsinFigure 1.3 (page 7 ),weexecutetheabovequeryas: (1.4) (CalpurniaANDBrutus)ANDCaesarThisisarstjusticationforkeepingthefrequencyoftermsinthedictionary:itallowsustomakethisorderingdecisionbasedonin-memorydatabeforeaccessinganypostingslist.Considernowtheoptimizationofmoregeneralqueries,suchas: (1.5) (maddingORcrowd)AND(ignobleORstrife)AND(killedORslain)Asbefore,wewillgetthefrequenciesforallterms,andwecanthen(con-servatively)estimatethesizeofeachORbythesumofthefrequenciesofitsdisjuncts.Wecanthenprocessthequeryinincreasingorderofthesizeofeachdisjunctiveterm.ForarbitraryBooleanqueries,wehavetoevaluateandtemporarilystoretheanswersforintermediateexpressionsinacomplexexpression.However,inmanycircumstances,eitherbecauseofthenatureofthequerylanguage,orjustbecausethisisthemostcommontypeofquerythatuserssubmit,aqueryispurelyconjunctive.Inthiscase,ratherthanviewingmergingpost-ingslistsasafunctionwithtwoinputsandadistinctoutput,itismoreef-cienttointersecteachretrievedpostingslistwiththecurrentintermediateresultinmemory,whereweinitializetheintermediateresultbyloadingthepostingslistoftheleastfrequentterm.ThisalgorithmisshowninFigure 1.7 .Theintersectionoperationisthenasymmetric:theintermediateresultslistisinmemorywhiletheli
13 stitisbeingintersectedwithisbeingreadfro
stitisbeingintersectedwithisbeingreadfromdisk.Moreovertheintermediateresultslistisalwaysatleastasshortastheotherlist,andinmanycasesitisordersofmagnitudeshorter.Thepostings Online edition (c)\n2009 Cambridge UP 1.3ProcessingBooleanqueries13 intersectioncanstillbedonebythealgorithminFigure 1.6 ,butwhenthedifferencebetweenthelistlengthsisverylarge,opportunitiestousealter-nativetechniquesopenup.Theintersectioncanbecalculatedinplacebydestructivelymodifyingormarkinginvaliditemsintheintermediateresultslist.Ortheintersectioncanbedoneasasequenceofbinarysearchesinthelongpostingslistsforeachpostingintheintermediateresultslist.Anotherpossibilityistostorethelongpostingslistasahashtable,sothatmembershipofanintermediateresultitemcanbecalculatedinconstantratherthanlinearorlogtime.However,suchalternativetechniquesaredifculttocombinewithpostingslistcompressionofthesortdiscussedinChapter 5 .Moreover,standardpostingslistintersectionoperationsremainnecessarywhenbothtermsofaqueryareverycommon. ?Exercise1.4 [?]Forthequeriesbelow,canwestillrunthroughtheintersectionintimeO(x+y),wherexandyarethelengthsofthepostingslistsforBrutusandCaesar?Ifnot,whatcanweachieve? a. BrutusANDNOTCaesar b. BrutusORNOTCaesar Exercise1.5 [?]ExtendthepostingsmergealgorithmtoarbitraryBooleanqueryformulas.Whatisitstimecomplexity?Forinstance,consider: c. (BrutusORCaesar)ANDNOT(AntonyORCleopatra)Canwealwaysmergeinlineartime?Linearinwhat?Canwedobetterthanthis? Exercise1.6 [??]WecanusedistributivelawsforANDandORtorewritequeries. a. ShowhowtorewritethequeryinExercise 1.5 intodisjunctivenormalformusingthedistributivelaws. b. Wouldtheresultingquerybemoreorlessefcientlyevaluatedthantheoriginalformofthisquery? c. Isthisresulttrueingeneralordoesitdependonthewordsandthecontentsofthedocumentcollection? Exercise1.7 [?]Recommendaqueryprocessingorderfor d. (tangerineORtrees)AND(marmaladeORskies)AND(kaleidoscopeOReyes)giventhefollowingpostingslistsizes: Online edition (c)\n2009 Cambridge UP 141Booleanret
14 rieval TermPostingssizeeyes213312kaleido
rieval TermPostingssizeeyes213312kaleidoscope87009marmalade107913skies271658tangerine46653trees316812 Exercise1.8 [?]Ifthequeryis: e. friendsANDromansAND(NOTcountrymen)howcouldweusethefrequencyofcountrymeninevaluatingthebestqueryevaluationorder?Inparticular,proposeawayofhandlingnegationindeterminingtheorderofqueryprocessing. Exercise1.9 [??]Foraconjunctivequery,isprocessingpostingslistsinorderofsizeguaranteedtobeoptimal?Explainwhyitis,orgiveanexamplewhereitisn't. Exercise1.10 [??]Writeoutapostingsmergealgorithm,inthestyleofFigure 1.6 (page 11 ),foranxORyquery. Exercise1.11 [??]HowshouldtheBooleanqueryxANDNOTybehandled?Whyisnaiveevaluationofthisquerynormallyveryexpensive?Writeoutapostingsmergealgorithmthatevaluatesthisqueryefciently.1.4TheextendedBooleanmodelversusrankedretrievalTheBooleanretrievalmodelcontrastswithrankedretrievalmodelssuchastheRANKEDRETRIEVALMODELvectorspacemodel(Section 6.3 ),inwhichuserslargelyusefreetextqueries,FREETEXTQUERIESthatis,justtypingoneormorewordsratherthanusingapreciselanguagewithoperatorsforbuildingupqueryexpressions,andthesystemdecideswhichdocumentsbestsatisfythequery.Despitedecadesofacademicre-searchontheadvantagesofrankedretrieval,systemsimplementingtheBoo-leanretrievalmodelwerethemainoronlysearchoptionprovidedbylargecommercialinformationprovidersforthreedecadesuntiltheearly1990s(ap-proximatelythedateofarrivaloftheWorldWideWeb).However,thesesystemsdidnothavejustthebasicBooleanoperations(AND,OR,andNOT)whichwehavepresentedsofar.AstrictBooleanexpressionovertermswithanunorderedresultssetistoolimitedformanyoftheinformationneedsthatpeoplehave,andthesesystemsimplementedextendedBooleanretrievalmodelsbyincorporatingadditionaloperatorssuchastermproximityoper-ators.AproximityoperatorisawayofspecifyingthattwotermsinaqueryPROXIMITYOPERATOR Online edition (c)\n2009 Cambridge UP 1.4TheextendedBooleanmodelversusrankedretrieval15 mustoccurclosetoeachotherinadocument,whereclosenessmaybemea-suredbylimitingtheallowednumbero
15 finterveningwordsorbyreferencetoastructu
finterveningwordsorbyreferencetoastructuralunitsuchasasentenceorparagraph..Example1.1:CommercialBooleansearching:Westlaw.Westlaw(http://www.westlaw.com/)isthelargestcommerciallegalsearchservice(intermsofthenumberofpayingsub-scribers),withoverhalfamillionsubscribersperformingmillionsofsearchesadayovertensofterabytesoftextdata.Theservicewasstartedin1975.In2005,Booleansearch(calledTermsandConnectorsbyWestlaw)wasstillthedefault,andusedbyalargepercentageofusers,althoughrankedfreetextquerying(calledNaturalLanguagebyWestlaw)wasaddedin1992.HerearesomeexampleBooleanqueriesonWestlaw:Informationneed:Informationonthelegaltheoriesinvolvedinpreventingthedisclosureoftradesecretsbyemployeesformerlyemployedbyacompetingcompany.Query:"tradesecret"/sdisclos!/sprevent/semploye!Informationneed:Requirementsfordisabledpeopletobeabletoaccessawork-place.Query:disab!/paccess!/swork-sitework-place(employment/3place)Informationneed:Casesaboutahost'sresponsibilityfordrunkguests.Query:host!/p(responsib!liab!)/p(intoxicat!drunk!)/pguestNotethelong,precisequeriesandtheuseofproximityoperators,bothuncommoninwebsearch.Submittedqueriesaverageabouttenwordsinlength.Unlikewebsearchconventions,aspacebetweenwordsrepresentsdisjunction(thetightestbind-ingoperator),&isANDand/s,/p,and/kaskformatchesinthesamesentence,sameparagraphorwithinkwordsrespectively.Doublequotesgiveaphrasesearch(consecutivewords);seeSection 2.4 (page 39 ).Theexclamationmark(!)givesatrail-ingwildcardquery(seeSection 3.2 ,page 51 );thusliab!matchesallwordsstartingwithliab.Additionallywork-sitematchesanyofworksite,work-siteorworksite;seeSection 2.2.1 (page 22 ).Typicalexpertqueriesareusuallycarefullydenedandincre-mentallydevelopeduntiltheyobtainwhatlooktobegoodresultstotheuser.Manyusers,particularlyprofessionals,preferBooleanquerymodels.Booleanqueriesareprecise:adocumenteithermatchesthequeryoritdoesnot.Thisof-ferstheusergreatercontrolandtransparencyoverwhatisretrieved.Andsomedo-mains,suchaslegalmaterials,allow
16 aneffectivemeansofdocumentrankingwithina
aneffectivemeansofdocumentrankingwithinaBooleanmodel:Westlawreturnsdocumentsinreversechronologicalorder,whichisinpracticequiteeffective.In2007,themajorityoflawlibrariansstillseemtorec-ommendtermsandconnectorsforhighrecallsearches,andthemajorityoflegalusersthinktheyaregettinggreatercontrolbyusingthem.However,thisdoesnotmeanthatBooleanqueriesaremoreeffectiveforprofessionalsearchers.Indeed,ex-perimentingonaWestlawsubcollection, Turtle ( 1994 )foundthatfreetextqueriesproducedbetterresultsthanBooleanqueriespreparedbyWestlaw'sownreferencelibrariansforthemajorityoftheinformationneedsinhisexperiments.AgeneralproblemwithBooleansearchisthatusingANDoperatorstendstoproducehighpre-cisionbutlowrecallsearches,whileusingORoperatorsgiveslowprecisionbuthighrecallsearches,anditisdifcultorimpossibletondasatisfactorymiddleground.Inthischapter,wehavelookedatthestructureandconstructionofabasic Online edition (c)\n2009 Cambridge UP 161Booleanretrieval invertedindex,comprisingadictionaryandpostingslists.WeintroducedtheBooleanretrievalmodel,andexaminedhowtodoefcientretrievalvialineartimemergesandsimplequeryoptimization.InChapters 2 7 wewillconsiderindetailricherquerymodelsandthesortofaugmentedindexstruc-turesthatareneededtohandlethemefciently.Herewejustmentionafewofthemainadditionalthingswewouldliketobeabletodo: 1. Wewouldliketobetterdeterminethesetoftermsinthedictionaryandtoprovideretrievalthatistoleranttospellingmistakesandinconsistentchoiceofwords. 2. Itisoftenusefultosearchforcompoundsorphrasesthatdenoteaconceptsuchasoperatingsystem.AstheWestlawexamplesshow,wemightalsowishtodoproximityqueriessuchasGatesNEARMicrosoft.Toanswersuchqueries,theindexhastobeaugmentedtocapturetheproximitiesoftermsindocuments. 3. ABooleanmodelonlyrecordstermpresenceorabsence,butoftenwewouldliketoaccumulateevidence,givingmoreweighttodocumentsthathaveatermseveraltimesasopposedtoonesthatcontainitonlyonce.Tobeabletodothisweneedtermfrequencyinformation(thenumberoftimesTERMFREQUENCYat
17 ermoccursinadocument)inpostingslists. 4.
ermoccursinadocument)inpostingslists. 4. Booleanqueriesjustretrieveasetofmatchingdocuments,butcommonlywewishtohaveaneffectivemethodtoorder(orrank)thereturnedresults.Thisrequireshavingamechanismfordeterminingadocumentscorewhichencapsulateshowgoodamatchadocumentisforaquery.Withtheseadditionalideas,wewillhaveseenmostofthebasictechnol-ogythatsupportsadhocsearchingoverunstructuredinformation.Adhocsearchingoverdocumentshasrecentlyconqueredtheworld,poweringnotonlywebsearchenginesbutthekindofunstructuredsearchthatliesbehindthelargeeCommercewebsites.Althoughthemainwebsearchenginesdifferbyemphasizingfreetextquerying,mostofthebasicissuesandtechnologiesofindexingandqueryingremainthesame,aswewillseeinlaterchapters.Moreover,overtime,websearchengineshaveaddedatleastpartialimple-mentationsofsomeofthemostpopularoperatorsfromextendedBooleanmodels:phrasesearchisespeciallypopularandmosthaveaverypartialimplementationofBooleanoperators.Nevertheless,whiletheseoptionsarelikedbyexpertsearchers,theyarelittleusedbymostpeopleandarenotthemainfocusinworkontryingtoimprovewebsearchengineperformance. ?Exercise1.12 [?]WriteaqueryusingWestlawsyntaxwhichwouldndanyofthewordsprofessor,teacher,orlecturerinthesamesentenceasaformoftheverbexplain. Online edition (c)\n2009 Cambridge UP 1.5Referencesandfurtherreading17Exercise1.13 [?]TryusingtheBooleansearchfeaturesonacoupleofmajorwebsearchengines.Forinstance,chooseaword,suchasburglar,andsubmitthequeries(i)burglar,(ii)burglarANDburglar,and(iii)burglarORburglar.Lookattheestimatednumberofresultsandtophits.DotheymakesenseintermsofBooleanlogic?Oftentheyhaven'tformajorsearchengines.Canyoumakesenseofwhatisgoingon?Whataboutifyoutrydifferentwords?Forexample,queryfor(i)knight,(ii)conquer,andthen(iii)knightORconquer.Whatboundshouldthenumberofresultsfromthersttwoqueriesplaceonthethirdquery?Isthisboundobserved?1.5ReferencesandfurtherreadingThepracticalpursuitofcomputerizedinformationretrievalbeganinthelate1940s( Cleverdon1991 , Liddy2005 ).
18 Agreatincreaseintheproductionofscientic
Agreatincreaseintheproductionofscienticliterature,muchintheformoflessformaltechnicalreportsratherthantraditionaljournalarticles,coupledwiththeavailabilityofcomputers,ledtointerestinautomaticdocumentretrieval.However,inthosedays,doc-umentretrievalwasalwaysbasedonauthor,title,andkeywords;full-textsearchcamemuchlater.Thearticleof Bush ( 1945 )providedlastinginspirationfortheneweld:Considerafuturedeviceforindividualuse,whichisasortofmech-anizedprivateleandlibrary.Itneedsaname,and,tocoinoneatrandom,`memex'willdo.Amemexisadeviceinwhichanindividualstoresallhisbooks,records,andcommunications,andwhichismech-anizedsothatitmaybeconsultedwithexceedingspeedandexibility.Itisanenlargedintimatesupplementtohismemory.ThetermInformationRetrievalwascoinedbyCalvinMooersin1948/1950( Mooers1950 ).In1958,muchnewspaperattentionwaspaidtodemonstrationsatacon-ference(see TaubeandWooster1958 )ofIBMauto-indexingmachines,basedprimarilyontheworkofH.P.Luhn.CommercialinterestquicklygravitatedtowardsBooleanretrievalsystems,buttheearlyyearssawaheadydebateovervariousdisparatetechnologiesforretrievalsystems.Forexample Moo-ers ( 1961 )dissented:Itisacommonfallacy,underwrittenatthisdatebytheinvestmentofseveralmilliondollarsinavarietyofretrievalhardware,thattheal-gebraofGeorgeBoole(1847)istheappropriateformalismforretrievalsystemdesign.Thisviewisaswidelyanduncriticallyacceptedasitiswrong.TheobservationofANDvs.ORgivingyouoppositeextremesinaprecision/recalltradeoff,butnotthemiddlegroundcomesfrom( LeeandFox1988 ). Online edition (c)\n2009 Cambridge UP 181Booleanretrieval Thebook( Wittenetal.1999 )isthestandardreferenceforanin-depthcom-parisonofthespaceandtimeefciencyoftheinvertedindexversusotherpossibledatastructures;amoresuccinctandup-to-datepresentationap-pearsin ZobelandMoffat ( 2006 ).WefurtherdiscussseveralapproachesinChapter 5 . Friedl ( 2006 )coversthepracticalusageofregularexpressionsforsearching.REGULAREXPRESSIONSTheunderlyingcomputerscienceappearsin( Hopcrofteta