/
Online edition cn2009 Cambridge UP Online edition cn2009 Cambridge UP

Online edition cn2009 Cambridge UP - PDF document

alyssa
alyssa . @alyssa
Follow
342 views
Uploaded On 2021-10-08

Online edition cn2009 Cambridge UP - PPT Presentation

DRAFTApril12009CambridgeUniversityPressFeedbackwelcome11BooleanretrievalThemeaningoftheterminformationretrievalcanbeverybroadJustgettingacreditcardoutofyourwalletsothatyoucantypeinthecardnumberisaform ID: 898278

n2009 online edition cambridge online n2009 cambridge edition exercise1 page docid asinfigure seesection thatis ifigure1 terms section forexample informationneed

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Online edition cn2009 Cambridge UP" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 Online edition (c)\n2009 Cambridge UP DR
Online edition (c)\n2009 Cambridge UP DRAFT!©April1,2009CambridgeUniversityPress.Feedbackwelcome.11BooleanretrievalThemeaningoftheterminformationretrievalcanbeverybroad.Justgettingacreditcardoutofyourwalletsothatyoucantypeinthecardnumberisaformofinformationretrieval.However,asanacademiceldofstudy,informationretrievalmightbedenedthus:INFORMATIONRETRIEVALInformationretrieval(IR)isndingmaterial(usuallydocuments)ofanunstructurednature(usuallytext)thatsatisesaninformationneedfromwithinlargecollections(usuallystoredoncomputers).Asdenedinthisway,informationretrievalusedtobeanactivitythatonlyafewpeopleengagedin:referencelibrarians,paralegals,andsimilarpro-fessionalsearchers.Nowtheworldhaschanged,andhundredsofmillionsofpeopleengageininformationretrievaleverydaywhentheyuseawebsearchengineorsearchtheiremail. 1 Informationretrievalisfastbecomingthedominantformofinformationaccess,overtakingtraditionaldatabase-stylesearching(thesortthatisgoingonwhenaclerksaystoyou:“I'msorry,IcanonlylookupyourorderifyoucangivemeyourOrderID”).IRcanalsocoverotherkindsofdataandinformationproblemsbeyondthatspeciedinthecoredenitionabove.Theterm“unstructureddata”referstodatawhichdoesnothaveclear,semanticallyovert,easy-for-a-computerstructure.Itistheoppositeofstructureddata,thecanonicalexampleofwhichisarelationaldatabase,ofthesortcompaniesusuallyusetomain-tainproductinventoriesandpersonnelrecords.Inreality,almostnodataaretruly“unstructured”.Thisisdenitelytrueofalltextdataifyoucountthelatentlinguisticstructureofhumanlanguages.Butevenacceptingthattheintendednotionofstructureisovertstructure,mosttexthasstructure,suchasheadingsandparagraphsandfootnotes,whichiscommonlyrepre-sentedindocumentsbyexplicitmarkup(suchasthecodingunderlyingweb 1.Inmodernparlance,theword“search”hastendedtoreplace“(information)retrieval”;theterm“search”isquiteambiguous,butincontextweusethetwosynonymously. Online edition (c)\n2009 Cambridge UP 21Booleanretrievalpages).IRisalsousedtofacilitate“semistruc

2 tured”searchsuchasndingadocumentwhereth
tured”searchsuchasndingadocumentwherethetitlecontainsJavaandthebodycontainsthreading.Theeldofinformationretrievalalsocoverssupportingusersinbrowsingorlteringdocumentcollectionsorfurtherprocessingasetofretrieveddoc-uments.Givenasetofdocuments,clusteringisthetaskofcomingupwithagoodgroupingofthedocumentsbasedontheircontents.Itissimilartoar-rangingbooksonabookshelfaccordingtotheirtopic.Givenasetoftopics,standinginformationneeds,orothercategories(suchassuitabilityoftextsfordifferentagegroups),classicationisthetaskofdecidingwhichclass(es),ifany,eachofasetofdocumentsbelongsto.Itisoftenapproachedbyrstmanuallyclassifyingsomedocumentsandthenhopingtobeabletoclassifynewdocumentsautomatically.Informationretrievalsystemscanalsobedistinguishedbythescaleatwhichtheyoperate,anditisusefultodistinguishthreeprominentscales.Inwebsearch,thesystemhastoprovidesearchoverbillionsofdocumentsstoredonmillionsofcomputers.Distinctiveissuesareneedingtogatherdocumentsforindexing,beingabletobuildsystemsthatworkefcientlyatthisenormousscale,andhandlingparticularaspectsoftheweb,suchastheexploitationofhypertextandnotbeingfooledbysiteprovidersmanip-ulatingpagecontentinanattempttoboosttheirsearchenginerankings,giventhecommercialimportanceoftheweb.WefocusonalltheseissuesinChapters 19 – 21 .Attheotherextremeispersonalinformationretrieval.Inthelastfewyears,consumeroperatingsystemshaveintegratedinformationretrieval(suchasApple'sMacOSXSpotlightorWindowsVista'sInstantSearch).Emailprogramsusuallynotonlyprovidesearchbutalsotextclas-sication:theyatleastprovideaspam(junkmail)lter,andcommonlyalsoprovideeithermanualorautomaticmeansforclassifyingmailsothatitcanbeplaceddirectlyintoparticularfolders.Distinctiveissueshereincludehan-dlingthebroadrangeofdocumenttypesonatypicalpersonalcomputer,andmakingthesearchsystemmaintenancefreeandsufcientlylightweightintermsofstartup,processing,anddiskspaceusagethatitcanrunononemachinewithoutannoyingitsowner.Inbetweenisthespaceofenterprise,institution

3 al,anddomain-specicsearch,whereretrieva
al,anddomain-specicsearch,whereretrievalmightbeprovidedforcollectionssuchasacorporation'sinternaldocuments,adatabaseofpatents,orresearcharticlesonbiochemistry.Inthiscase,thedocumentswilltypi-callybestoredoncentralizedlesystemsandoneorahandfulofdedicatedmachineswillprovidesearchoverthecollection.Thisbookcontainstech-niquesofvalueoverthiswholespectrum,butourcoverageofsomeaspectsofparallelanddistributedsearchinweb-scalesearchsystemsiscompara-tivelylightowingtotherelativelysmallpublishedliteratureonthedetailsofsuchsystems.However,outsideofahandfulofwebsearchcompanies,asoftwaredeveloperismostlikelytoencounterthepersonalsearchanden-terprisescenarios. Online edition (c)\n2009 Cambridge UP 1.1Anexampleinformationretrievalproblem3Inthischapterwebeginwithaverysimpleexampleofaninformationretrievalproblem,andintroducetheideaofaterm-documentmatrix(Sec-tion 1.1 )andthecentralinvertedindexdatastructure(Section 1.2 ).WewillthenexaminetheBooleanretrievalmodelandhowBooleanqueriesarepro-cessed(Sections 1.3 and 1.4 ).1.1AnexampleinformationretrievalproblemAfatbookwhichmanypeopleownisShakespeare'sCollectedWorks.Sup-poseyouwantedtodeterminewhichplaysofShakespearecontainthewordsBrutusANDCaesarANDNOTCalpurnia.Onewaytodothatistostartatthebeginningandtoreadthroughallthetext,notingforeachplaywhetheritcontainsBrutusandCaesarandexcludingitfromconsiderationifitcon-tainsCalpurnia.Thesimplestformofdocumentretrievalisforacomputertodothissortoflinearscanthroughdocuments.Thisprocessiscommonlyreferredtoasgreppingthroughtext,aftertheUnixcommandgrep,whichGREPperformsthisprocess.Greppingthroughtextcanbeaveryeffectiveprocess,especiallygiventhespeedofmoderncomputers,andoftenallowsusefulpossibilitiesforwildcardpatternmatchingthroughtheuseofregularexpres-sions.Withmoderncomputers,forsimplequeryingofmodestcollections(thesizeofShakespeare'sCollectedWorksisabitunderonemillionwordsoftextintotal),youreallyneednothingmore.Butformanypurposes,youdoneedmore:1.Toprocesslargedocumentcollect

4 ionsquickly.Theamountofonlinedatahasgrow
ionsquickly.Theamountofonlinedatahasgrownatleastasquicklyasthespeedofcomputers,andwewouldnowliketobeabletosearchcollectionsthattotalintheorderofbillionstotrillionsofwords.2.Toallowmoreexiblematchingoperations.Forexample,itisimpracticaltoperformthequeryRomansNEARcountrymenwithgrep,whereNEARmightbedenedas“within5words”or“withinthesamesentence”.3.Toallowrankedretrieval:inmanycasesyouwantthebestanswertoaninformationneedamongmanydocumentsthatcontaincertainwords.ThewaytoavoidlinearlyscanningthetextsforeachqueryistoindextheINDEXdocumentsinadvance.LetusstickwithShakespeare'sCollectedWorks,anduseittointroducethebasicsoftheBooleanretrievalmodel.Supposewerecordforeachdocument–hereaplayofShakespeare's–whetheritcontainseachwordoutofallthewordsShakespeareused(Shakespeareusedabout32,000differentwords).Theresultisabinaryterm-documentincidenceINCIDENCEMATRIXmatrix,asinFigure 1.1 .Termsaretheindexedunits(furtherdiscussedinTERMSection 2.2 );theyareusuallywords,andforthemomentyoucanthinkof Online edition (c)\n2009 Cambridge UP 41BooleanretrievalAntonyJuliusTheHamletOthelloMacbeth...andCaesarTempestCleopatraAntony110001Brutus110100Caesar110111Calpurnia010000Cleopatra100000mercy101111worser101110...IFigure1.1Aterm-documentincidencematrix.Matrixelement(t,d)is1iftheplayincolumndcontainsthewordinrowt,andis0otherwise. themaswords,buttheinformationretrievalliteraturenormallyspeaksoftermsbecausesomeofthem,suchasperhapsI-9orHongKongarenotusuallythoughtofaswords.Now,dependingonwhetherwelookatthematrixrowsorcolumns,wecanhaveavectorforeachterm,whichshowsthedocumentsitappearsin,oravectorforeachdocument,showingthetermsthatoccurinit. 2 ToanswerthequeryBrutusANDCaesarANDNOTCalpurnia,wetakethevectorsforBrutus,CaesarandCalpurnia,complementthelast,andthendoabitwiseAND:110100AND110111AND101111=100100TheanswersforthisqueryarethusAntonyandCleopatraandHamlet(Fig-ure 1.2 ).TheBooleanretrievalmodelisamodelforinformationretrievalinwhichweBOOLEANRETRIEVALMODELcanposeanyquerywhichisin

5 theformofaBooleanexpressionofterms,thati
theformofaBooleanexpressionofterms,thatis,inwhichtermsarecombinedwiththeoperatorsAND,OR,andNOT.Themodelviewseachdocumentasjustasetofwords.Letusnowconsideramorerealisticscenario,simultaneouslyusingtheopportunitytointroducesometerminologyandnotation.SupposewehaveN=1milliondocuments.BydocumentswemeanwhateverunitswehaveDOCUMENTdecidedtobuildaretrievalsystemover.Theymightbeindividualmemosorchaptersofabook(seeSection 2.1.2 (page 20 )forfurtherdiscussion).Wewillrefertothegroupofdocumentsoverwhichweperformretrievalasthe(document)collection.Itissometimesalsoreferredtoasacorpus(abodyofCOLLECTIONCORPUStexts).Supposeeachdocumentisabout1000wordslong(2–3bookpages).If 2.Formally,wetakethetransposeofthematrixtobeabletogetthetermsascolumnvectors. Online edition (c)\n2009 Cambridge UP 1.1Anexampleinformationretrievalproblem5AntonyandCleopatra,ActIII,SceneiiAgrippa[AsidetoDomitiusEnobarbus]:Why,Enobarbus,WhenAntonyfoundJuliusCaesardead,Hecriedalmosttoroaring;andheweptWhenatPhilippihefoundBrutusslain.Hamlet,ActIII,SceneiiLordPolonius:IdidenactJuliusCaesar:Iwaskilledi'theCapitol;Brutuskilledme.IFigure1.2ResultsfromShakespeareforthequeryBrutusANDCaesarANDNOTCalpurnia. weassumeanaverageof6bytesperwordincludingspacesandpunctuation,thenthisisadocumentcollectionabout6GBinsize.Typically,theremightbeaboutM=500,000distincttermsinthesedocuments.Thereisnothingspecialaboutthenumberswehavechosen,andtheymightvarybyanorderofmagnitudeormore,buttheygiveussomeideaofthedimensionsofthekindsofproblemsweneedtohandle.WewilldiscussandmodelthesesizeassumptionsinSection 5.1 (page 86 ).Ourgoalistodevelopasystemtoaddresstheadhocretrievaltask.ThisisADHOCRETRIEVALthemoststandardIRtask.Init,asystemaimstoprovidedocumentsfromwithinthecollectionthatarerelevanttoanarbitraryuserinformationneed,communicatedtothesystembymeansofaone-off,user-initiatedquery.Aninformationneedisthetopicaboutwhichtheuserdesirestoknowmore,andINFORMATIONNEEDisdifferentiatedfromaquery,whichiswhattheuserconveystothecom-QU

6 ERYputerinanattempttocommunicatetheinfor
ERYputerinanattempttocommunicatetheinformationneed.AdocumentisrelevantifitisonethattheuserperceivesascontaininginformationofvalueRELEVANCEwithrespecttotheirpersonalinformationneed.Ourexampleabovewasratherarticialinthattheinformationneedwasdenedintermsofpar-ticularwords,whereasusuallyauserisinterestedinatopiclike“pipelineleaks”andwouldliketondrelevantdocumentsregardlessofwhethertheypreciselyusethosewordsorexpresstheconceptwithotherwordssuchaspipelinerupture.ToassesstheeffectivenessofanIRsystem(i.e.,thequalityofEFFECTIVENESSitssearchresults),auserwillusuallywanttoknowtwokeystatisticsaboutthesystem'sreturnedresultsforaquery:Precision:Whatfractionofthereturnedresultsarerelevanttotheinforma-PRECISIONtionneed?Recall:Whatfractionoftherelevantdocumentsinthecollectionwerere-RECALLturnedbythesystem? Online edition (c)\n2009 Cambridge UP 61BooleanretrievalDetaileddiscussionofrelevanceandevaluationmeasuresincludingpreci-sionandrecallisfoundinChapter 8 .Wenowcannotbuildaterm-documentmatrixinanaiveway.A500K1Mmatrixhashalf-a-trillion0'sand1's–toomanytotinacomputer'smemory.Butthecrucialobservationisthatthematrixisextremelysparse,thatis,ithasfewnon-zeroentries.Becauseeachdocumentis1000wordslong,thematrixhasnomorethanonebillion1's,soaminimumof99.8%ofthecellsarezero.Amuchbetterrepresentationistorecordonlythethingsthatdooccur,thatis,the1positions.Thisideaiscentraltotherstmajorconceptininformationretrieval,theinvertedindex.Thenameisactuallyredundant:anindexalwaysmapsbackINVERTEDINDEXfromtermstothepartsofadocumentwheretheyoccur.Nevertheless,in-vertedindex,orsometimesinvertedle,hasbecomethestandardtermininfor-mationretrieval. 3 ThebasicideaofaninvertedindexisshowninFigure 1.3 .Wekeepadictionaryofterms(sometimesalsoreferredtoasavocabularyorDICTIONARYVOCABULARYlexicon;inthisbook,weusedictionaryforthedatastructureandvocabularyLEXICONforthesetofterms).Thenforeachterm,wehavealistthatrecordswhichdocumentsthetermoccursin.Eachiteminthelist–whichrecordsthatatermap

7 pearedinadocument(and,later,often,thepos
pearedinadocument(and,later,often,thepositionsinthedocu-ment)–isconventionallycalledaposting. 4 ThelististhencalledapostingsPOSTINGPOSTINGSLISTlist(orinvertedlist),andallthepostingsliststakentogetherarereferredtoasthepostings.ThedictionaryinFigure 1.3 hasbeensortedalphabeticallyandPOSTINGSeachpostingslistissortedbydocumentID.WewillseewhythisisusefulinSection 1.3 ,below,butlaterwewillalsoconsideralternativestodoingthis(Section 7.1.5 ).1.2ArsttakeatbuildinganinvertedindexTogainthespeedbenetsofindexingatretrievaltime,wehavetobuildtheindexinadvance.Themajorstepsinthisare:1.Collectthedocumentstobeindexed: Friends,Romans,countrymen. SoletitbewithCaesar ...2.Tokenizethetext,turningeachdocumentintoalistoftokens: Friends Romans countrymen So ... 3.Someinformationretrievalresearchersprefertheterminvertedle,butexpressionslikein-dexconstructionandindexcompressionaremuchmorecommonthaninvertedleconstructionandinvertedlecompression.Forconsistency,weuse(inverted)indexthroughoutthisbook.4.Ina(non-positional)invertedindex,apostingisjustadocumentID,butitisinherentlyassociatedwithaterm,viathepostingslistitisplacedon;sometimeswewillalsotalkofa(term,docID)pairasaposting. Online edition (c)\n2009 Cambridge UP 1.2Arsttakeatbuildinganinvertedindex7 Brutus ! 1 2 4 11 31 45 173 174 Caesar ! 1 2 4 5 6 16 57 132 ... Calpurnia ! 2 31 54 101 ...| {z }| {z }DictionaryPostingsIFigure1.3Thetwopartsofaninvertedindex.Thedictionaryiscommonlykeptinmemory,withpointerstoeachpostingslist,whichisstoredondisk. 3.Dolinguisticpreprocessing,producingalistofnormalizedtokens,whicharetheindexingterms: friend roman countryman so ...4.Indexthedocumentsthateachtermoccursinbycreatinganinvertedin-dex,consistingofadictionaryandpostings.Wewilldeneanddiscusstheearlierstagesofprocessing,thatis,steps1–3,inSection 2.2 (page 22 ).Untilthenyoucanthinkoftokensandnormalizedtokensasalsolooselyequivalenttowords.Here,weassumethattherst3stepshavealreadybeendone,andweexaminebuildingabasicinverted

8 indexbysort-basedindexing.Withinadocumen
indexbysort-basedindexing.Withinadocumentcollection,weassumethateachdocumenthasauniqueserialnumber,knownasthedocumentidentier(docID).Duringindexcon-DOCIDstruction,wecansimplyassignsuccessiveintegerstoeachnewdocumentwhenitisrstencountered.Theinputtoindexingisalistofnormalizedtokensforeachdocument,whichwecanequallythinkofasalistofpairsoftermanddocID,asinFigure 1.4 .ThecoreindexingstepissortingthislistSORTINGsothatthetermsarealphabetical,givingustherepresentationinthemiddlecolumnofFigure 1.4 .Multipleoccurrencesofthesametermfromthesamedocumentarethenmerged. 5 Instancesofthesametermarethengrouped,andtheresultissplitintoadictionaryandpostings,asshownintherightcolumnofFigure 1.4 .Sinceatermgenerallyoccursinanumberofdocu-ments,thisdataorganizationalreadyreducesthestoragerequirementsoftheindex.Thedictionaryalsorecordssomestatistics,suchasthenumberofdocumentswhichcontaineachterm(thedocumentfrequency,whichishereDOCUMENTFREQUENCYalsothelengthofeachpostingslist).Thisinformationisnotvitalforaba-sicBooleansearchengine,butitallowsustoimprovetheefciencyofthe 5.Unixuserscannotethatthesestepsaresimilartouseofthesortandthenuniqcommands. Online edition (c)\n2009 Cambridge UP 81BooleanretrievalDoc1Doc2IdidenactJuliusCaesar:Iwaskilledi'theCapitol;Brutuskilledme.SoletitbewithCaesar.ThenobleBrutushathtoldyouCaesarwasambitious:termdocIDI1did1enact1julius1caesar1I1was1killed1i'1the1capitol1brutus1killed1me1so2let2it2be2with2caesar2the2noble2brutus2hath2told2you2caesar2was2ambitious2=)termdocIDambitious2be2brutus1brutus2capitol1caesar1caesar2caesar2did1enact1hath1I1I1i'1it2julius1killed1killed1let2me1noble2so2the1the2told2you2was1was2with2=)termdoc.freq.!postingslists ambitious 1 ! 2 be 1 ! 2 brutus 2 ! 1 ! 2 capitol 1 ! 1 caesar 2 ! 1 ! 2 did 1 ! 1 enact 1 ! 1 hath 1 ! 2 I 1 ! 1 i' 1 ! 1 it 1 ! 2 julius 1 ! 1 killed 1 ! 1 let 1 ! 2 me 1 ! 1 noble 1 ! 2 so 1 ! 2 the 2 ! 1 ! 2 told 1 ! 2 you 1 ! 2 was 2 ! 1 ! 2 with 1 ! 2 IFigure1.4Buildinganindexbysortingandgroup

9 ing.Thesequenceoftermsineachdocument,tag
ing.Thesequenceoftermsineachdocument,taggedbytheirdocumentID(left)issortedalphabetically(mid-dle).InstancesofthesametermarethengroupedbywordandthenbydocumentID.ThetermsanddocumentIDsarethenseparatedout(right).Thedictionarystorestheterms,andhasapointertothepostingslistforeachterm.Itcommonlyalsostoresothersummaryinformationsuchas,here,thedocumentfrequencyofeachterm.Weusethisinformationforimprovingquerytimeefciencyand,later,forweightinginrankedretrievalmodels.Eachpostingsliststoresthelistofdocumentsinwhichatermoccurs,andmaystoreotherinformationsuchasthetermfrequency(thefrequencyofeachtermineachdocument)ortheposition(s)ofthetermineachdocument. Online edition (c)\n2009 Cambridge UP 1.2Arsttakeatbuildinganinvertedindex9searchengineatquerytime,anditisastatisticlaterusedinmanyrankedre-trievalmodels.ThepostingsaresecondarilysortedbydocID.Thisprovidesthebasisforefcientqueryprocessing.Thisinvertedindexstructureises-sentiallywithoutrivalsasthemostefcientstructureforsupportingadhoctextsearch.Intheresultingindex,wepayforstorageofboththedictionaryandthepostingslists.Thelatteraremuchlarger,butthedictionaryiscommonlykeptinmemory,whilepostingslistsarenormallykeptondisk,sothesizeofeachisimportant,andinChapter 5 wewillexaminehoweachcanbeoptimizedforstorageandaccessefciency.Whatdatastructureshouldbeusedforapostingslist?Axedlengtharraywouldbewastefulassomewordsoccurinmanydocuments,andothersinveryfew.Foranin-memorypostingslist,twogoodalternativesaresinglylinkedlistsorvariablelengtharrays.Singlylinkedlistsallowcheapinsertionofdocumentsintopostingslists(followingupdates,suchaswhenrecrawlingthewebforupdateddoc-uments),andnaturallyextendtomoreadvancedindexingstrategiessuchasskiplists(Section 2.3 ),whichrequireadditionalpointers.Variablelengthar-rayswininspacerequirementsbyavoidingtheoverheadforpointersandintimerequirementsbecausetheiruseofcontiguousmemoryincreasesspeedonmodernprocessorswithmemorycaches.Extrapointerscaninpracticebeencodedintothelistsasoffsets.I

10 fupdatesarerelativelyinfrequent,variable
fupdatesarerelativelyinfrequent,variablelengtharrayswillbemorecompactandfastertotraverse.Wecanalsouseahybridschemewithalinkedlistofxedlengtharraysforeachterm.Whenpostingslistsarestoredondisk,theyarestored(perhapscompressed)asacontiguousrunofpostingswithoutexplicitpointers(asinFigure 1.3 ),soastominimizethesizeofthepostingslistandthenumberofdiskseekstoreadapostingslistintomemory.?Exercise1.1[?]Drawtheinvertedindexthatwouldbebuiltforthefollowingdocumentcollection.(SeeFigure 1.3 foranexample.)Doc1newhomesalestopforecastsDoc2homesalesriseinjulyDoc3increaseinhomesalesinjulyDoc4julynewhomesalesriseExercise1.2[?]Considerthesedocuments:Doc1breakthroughdrugforschizophreniaDoc2newschizophreniadrugDoc3newapproachfortreatmentofschizophreniaDoc4newhopesforschizophreniapatientsa.Drawtheterm-documentincidencematrixforthisdocumentcollection. Online edition (c)\n2009 Cambridge UP 101BooleanretrievalBrutus! 1 ! 2 ! 4 ! 11 ! 31 ! 45 ! 173 ! 174 Calpurnia! 2 ! 31 ! 54 ! 101 Intersection=) 2 ! 31 IFigure1.5IntersectingthepostingslistsforBrutusandCalpurniafromFigure 1.3 . b.Drawtheinvertedindexrepresentationforthiscollection,asinFigure 1.3 (page 7 ).Exercise1.3[?]ForthedocumentcollectionshowninExercise 1.2 ,whatarethereturnedresultsforthesequeries:a.schizophreniaANDdrugb.forANDNOT(drugORapproach)1.3ProcessingBooleanqueriesHowdoweprocessaqueryusinganinvertedindexandthebasicBooleanretrievalmodel?Considerprocessingthesimpleconjunctivequery: SIMPLECONJUNCTIVEQUERIES (1.1) BrutusANDCalpurniaovertheinvertedindexpartiallyshowninFigure 1.3 (page 7 ).We: 1. LocateBrutusintheDictionary 2. Retrieveitspostings 3. LocateCalpurniaintheDictionary 4. Retrieveitspostings 5. Intersectthetwopostingslists,asshowninFigure 1.5 .Theintersectionoperationisthecrucialone:weneedtoefcientlyintersectPOSTINGSLISTINTERSECTIONpostingslistssoastobeabletoquicklynddocumentsthatcontainbothterms.(Thisoperationissometimesreferredtoasmergingpostingslists:POSTINGSMERGEthisslightlycounterintuitiv

11 enamereectsusingthetermmergealgorithmfo
enamereectsusingthetermmergealgorithmforageneralfamilyofalgorithmsthatcombinemultiplesortedlistsbyinter-leavedadvancingofpointersthrougheach;herewearemergingthelistswithalogicalANDoperation.)Thereisasimpleandeffectivemethodofintersectingpostingslistsusingthemergealgorithm(seeFigure 1.6 ):wemaintainpointersintobothlists Online edition (c)\n2009 Cambridge UP 1.3ProcessingBooleanqueries11INTERSECT(p1,p2)1answer hi2whilep1=NILandp2=NIL3doifdocID(p1)=docID(p2)4thenADD(answer,docID(p1))5p1 next(p1)6p2 next(p2)7elseifdocID(p1)docID(p2)8thenp1 next(p1)9elsep2 next(p2)10returnanswerIFigure1.6Algorithmfortheintersectionoftwopostingslistsp1andp2. andwalkthroughthetwopostingslistssimultaneously,intimelinearinthetotalnumberofpostingsentries.Ateachstep,wecomparethedocIDpointedtobybothpointers.Iftheyarethesame,weputthatdocIDintheresultslist,andadvancebothpointers.OtherwiseweadvancethepointerpointingtothesmallerdocID.Ifthelengthsofthepostingslistsarexandy,theintersectiontakesO(x+y)operations.Formally,thecomplexityofqueryingisQ(N),whereNisthenumberofdocumentsinthecollection. 6 Ourindexingmethodsgainusjustaconstant,notadifferenceinQtimecomplexitycomparedtoalinearscan,butinpracticetheconstantishuge.Tousethisalgorithm,itiscrucialthatpostingsbesortedbyasingleglobalordering.UsinganumericsortbydocIDisonesimplewaytoachievethis.Wecanextendtheintersectionoperationtoprocessmorecomplicatedquerieslike: (1.2) (BrutusORCaesar)ANDNOTCalpurniaQueryoptimizationistheprocessofselectinghowtoorganizetheworkofan-QUERYOPTIMIZATIONsweringaquerysothattheleasttotalamountofworkneedstobedonebythesystem.AmajorelementofthisforBooleanqueriesistheorderinwhichpostingslistsareaccessed.Whatisthebestorderforqueryprocessing?Con-sideraquerythatisanANDoftterms,forinstance: (1.3) BrutusANDCaesarANDCalpurniaForeachofthetterms,weneedtogetitspostings,thenANDthemtogether.Thestandardheuristicistoprocesstermsinorderofincreasingdocument 6.ThenotationQ()isusedtoexpressanasymptoticallytightboundontheco

12 mplexityofanalgorithm.Informally,thisiso
mplexityofanalgorithm.Informally,thisisoftenwrittenasO(),butthisnotationreallyexpressesanasymptoticupperbound,whichneednotbetight( Cormenetal.1990 ). Online edition (c)\n2009 Cambridge UP 121BooleanretrievalINTERSECT(ht1,...,tni)1terms SORTBYINCREASINGFREQUENCY(ht1,...,tni)2result postings(first(terms))3terms rest(terms)4whileterms=NILandresult=NIL5doresult INTERSECT(result,postings(first(terms)))6terms rest(terms)7returnresultIFigure1.7Algorithmforconjunctivequeriesthatreturnsthesetofdocumentscontainingeachtermintheinputlistofterms. frequency:ifwestartbyintersectingthetwosmallestpostingslists,thenallintermediateresultsmustbenobiggerthanthesmallestpostingslist,andwearethereforelikelytodotheleastamountoftotalwork.So,forthepostingslistsinFigure 1.3 (page 7 ),weexecutetheabovequeryas: (1.4) (CalpurniaANDBrutus)ANDCaesarThisisarstjusticationforkeepingthefrequencyoftermsinthedictionary:itallowsustomakethisorderingdecisionbasedonin-memorydatabeforeaccessinganypostingslist.Considernowtheoptimizationofmoregeneralqueries,suchas: (1.5) (maddingORcrowd)AND(ignobleORstrife)AND(killedORslain)Asbefore,wewillgetthefrequenciesforallterms,andwecanthen(con-servatively)estimatethesizeofeachORbythesumofthefrequenciesofitsdisjuncts.Wecanthenprocessthequeryinincreasingorderofthesizeofeachdisjunctiveterm.ForarbitraryBooleanqueries,wehavetoevaluateandtemporarilystoretheanswersforintermediateexpressionsinacomplexexpression.However,inmanycircumstances,eitherbecauseofthenatureofthequerylanguage,orjustbecausethisisthemostcommontypeofquerythatuserssubmit,aqueryispurelyconjunctive.Inthiscase,ratherthanviewingmergingpost-ingslistsasafunctionwithtwoinputsandadistinctoutput,itismoreef-cienttointersecteachretrievedpostingslistwiththecurrentintermediateresultinmemory,whereweinitializetheintermediateresultbyloadingthepostingslistoftheleastfrequentterm.ThisalgorithmisshowninFigure 1.7 .Theintersectionoperationisthenasymmetric:theintermediateresultslistisinmemorywhiletheli

13 stitisbeingintersectedwithisbeingreadfro
stitisbeingintersectedwithisbeingreadfromdisk.Moreovertheintermediateresultslistisalwaysatleastasshortastheotherlist,andinmanycasesitisordersofmagnitudeshorter.Thepostings Online edition (c)\n2009 Cambridge UP 1.3ProcessingBooleanqueries13 intersectioncanstillbedonebythealgorithminFigure 1.6 ,butwhenthedifferencebetweenthelistlengthsisverylarge,opportunitiestousealter-nativetechniquesopenup.Theintersectioncanbecalculatedinplacebydestructivelymodifyingormarkinginvaliditemsintheintermediateresultslist.Ortheintersectioncanbedoneasasequenceofbinarysearchesinthelongpostingslistsforeachpostingintheintermediateresultslist.Anotherpossibilityistostorethelongpostingslistasahashtable,sothatmembershipofanintermediateresultitemcanbecalculatedinconstantratherthanlinearorlogtime.However,suchalternativetechniquesaredifculttocombinewithpostingslistcompressionofthesortdiscussedinChapter 5 .Moreover,standardpostingslistintersectionoperationsremainnecessarywhenbothtermsofaqueryareverycommon. ?Exercise1.4 [?]Forthequeriesbelow,canwestillrunthroughtheintersectionintimeO(x+y),wherexandyarethelengthsofthepostingslistsforBrutusandCaesar?Ifnot,whatcanweachieve? a. BrutusANDNOTCaesar b. BrutusORNOTCaesar Exercise1.5 [?]ExtendthepostingsmergealgorithmtoarbitraryBooleanqueryformulas.Whatisitstimecomplexity?Forinstance,consider: c. (BrutusORCaesar)ANDNOT(AntonyORCleopatra)Canwealwaysmergeinlineartime?Linearinwhat?Canwedobetterthanthis? Exercise1.6 [??]WecanusedistributivelawsforANDandORtorewritequeries. a. ShowhowtorewritethequeryinExercise 1.5 intodisjunctivenormalformusingthedistributivelaws. b. Wouldtheresultingquerybemoreorlessefcientlyevaluatedthantheoriginalformofthisquery? c. Isthisresulttrueingeneralordoesitdependonthewordsandthecontentsofthedocumentcollection? Exercise1.7 [?]Recommendaqueryprocessingorderfor d. (tangerineORtrees)AND(marmaladeORskies)AND(kaleidoscopeOReyes)giventhefollowingpostingslistsizes: Online edition (c)\n2009 Cambridge UP 141Booleanret

14 rieval TermPostingssizeeyes213312kaleido
rieval TermPostingssizeeyes213312kaleidoscope87009marmalade107913skies271658tangerine46653trees316812 Exercise1.8 [?]Ifthequeryis: e. friendsANDromansAND(NOTcountrymen)howcouldweusethefrequencyofcountrymeninevaluatingthebestqueryevaluationorder?Inparticular,proposeawayofhandlingnegationindeterminingtheorderofqueryprocessing. Exercise1.9 [??]Foraconjunctivequery,isprocessingpostingslistsinorderofsizeguaranteedtobeoptimal?Explainwhyitis,orgiveanexamplewhereitisn't. Exercise1.10 [??]Writeoutapostingsmergealgorithm,inthestyleofFigure 1.6 (page 11 ),foranxORyquery. Exercise1.11 [??]HowshouldtheBooleanqueryxANDNOTybehandled?Whyisnaiveevaluationofthisquerynormallyveryexpensive?Writeoutapostingsmergealgorithmthatevaluatesthisqueryefciently.1.4TheextendedBooleanmodelversusrankedretrievalTheBooleanretrievalmodelcontrastswithrankedretrievalmodelssuchastheRANKEDRETRIEVALMODELvectorspacemodel(Section 6.3 ),inwhichuserslargelyusefreetextqueries,FREETEXTQUERIESthatis,justtypingoneormorewordsratherthanusingapreciselanguagewithoperatorsforbuildingupqueryexpressions,andthesystemdecideswhichdocumentsbestsatisfythequery.Despitedecadesofacademicre-searchontheadvantagesofrankedretrieval,systemsimplementingtheBoo-leanretrievalmodelwerethemainoronlysearchoptionprovidedbylargecommercialinformationprovidersforthreedecadesuntiltheearly1990s(ap-proximatelythedateofarrivaloftheWorldWideWeb).However,thesesystemsdidnothavejustthebasicBooleanoperations(AND,OR,andNOT)whichwehavepresentedsofar.AstrictBooleanexpressionovertermswithanunorderedresultssetistoolimitedformanyoftheinformationneedsthatpeoplehave,andthesesystemsimplementedextendedBooleanretrievalmodelsbyincorporatingadditionaloperatorssuchastermproximityoper-ators.AproximityoperatorisawayofspecifyingthattwotermsinaqueryPROXIMITYOPERATOR Online edition (c)\n2009 Cambridge UP 1.4TheextendedBooleanmodelversusrankedretrieval15 mustoccurclosetoeachotherinadocument,whereclosenessmaybemea-suredbylimitingtheallowednumbero

15 finterveningwordsorbyreferencetoastructu
finterveningwordsorbyreferencetoastructuralunitsuchasasentenceorparagraph..Example1.1:CommercialBooleansearching:Westlaw.Westlaw(http://www.westlaw.com/)isthelargestcommerciallegalsearchservice(intermsofthenumberofpayingsub-scribers),withoverhalfamillionsubscribersperformingmillionsofsearchesadayovertensofterabytesoftextdata.Theservicewasstartedin1975.In2005,Booleansearch(called“TermsandConnectors”byWestlaw)wasstillthedefault,andusedbyalargepercentageofusers,althoughrankedfreetextquerying(called“NaturalLanguage”byWestlaw)wasaddedin1992.HerearesomeexampleBooleanqueriesonWestlaw:Informationneed:Informationonthelegaltheoriesinvolvedinpreventingthedisclosureoftradesecretsbyemployeesformerlyemployedbyacompetingcompany.Query:"tradesecret"/sdisclos!/sprevent/semploye!Informationneed:Requirementsfordisabledpeopletobeabletoaccessawork-place.Query:disab!/paccess!/swork-sitework-place(employment/3place)Informationneed:Casesaboutahost'sresponsibilityfordrunkguests.Query:host!/p(responsib!liab!)/p(intoxicat!drunk!)/pguestNotethelong,precisequeriesandtheuseofproximityoperators,bothuncommoninwebsearch.Submittedqueriesaverageabouttenwordsinlength.Unlikewebsearchconventions,aspacebetweenwordsrepresentsdisjunction(thetightestbind-ingoperator),&isANDand/s,/p,and/kaskformatchesinthesamesentence,sameparagraphorwithinkwordsrespectively.Doublequotesgiveaphrasesearch(consecutivewords);seeSection 2.4 (page 39 ).Theexclamationmark(!)givesatrail-ingwildcardquery(seeSection 3.2 ,page 51 );thusliab!matchesallwordsstartingwithliab.Additionallywork-sitematchesanyofworksite,work-siteorworksite;seeSection 2.2.1 (page 22 ).Typicalexpertqueriesareusuallycarefullydenedandincre-mentallydevelopeduntiltheyobtainwhatlooktobegoodresultstotheuser.Manyusers,particularlyprofessionals,preferBooleanquerymodels.Booleanqueriesareprecise:adocumenteithermatchesthequeryoritdoesnot.Thisof-ferstheusergreatercontrolandtransparencyoverwhatisretrieved.Andsomedo-mains,suchaslegalmaterials,allow

16 aneffectivemeansofdocumentrankingwithina
aneffectivemeansofdocumentrankingwithinaBooleanmodel:Westlawreturnsdocumentsinreversechronologicalorder,whichisinpracticequiteeffective.In2007,themajorityoflawlibrariansstillseemtorec-ommendtermsandconnectorsforhighrecallsearches,andthemajorityoflegalusersthinktheyaregettinggreatercontrolbyusingthem.However,thisdoesnotmeanthatBooleanqueriesaremoreeffectiveforprofessionalsearchers.Indeed,ex-perimentingonaWestlawsubcollection, Turtle ( 1994 )foundthatfreetextqueriesproducedbetterresultsthanBooleanqueriespreparedbyWestlaw'sownreferencelibrariansforthemajorityoftheinformationneedsinhisexperiments.AgeneralproblemwithBooleansearchisthatusingANDoperatorstendstoproducehighpre-cisionbutlowrecallsearches,whileusingORoperatorsgiveslowprecisionbuthighrecallsearches,anditisdifcultorimpossibletondasatisfactorymiddleground.Inthischapter,wehavelookedatthestructureandconstructionofabasic Online edition (c)\n2009 Cambridge UP 161Booleanretrieval invertedindex,comprisingadictionaryandpostingslists.WeintroducedtheBooleanretrievalmodel,andexaminedhowtodoefcientretrievalvialineartimemergesandsimplequeryoptimization.InChapters 2 – 7 wewillconsiderindetailricherquerymodelsandthesortofaugmentedindexstruc-turesthatareneededtohandlethemefciently.Herewejustmentionafewofthemainadditionalthingswewouldliketobeabletodo: 1. Wewouldliketobetterdeterminethesetoftermsinthedictionaryandtoprovideretrievalthatistoleranttospellingmistakesandinconsistentchoiceofwords. 2. Itisoftenusefultosearchforcompoundsorphrasesthatdenoteaconceptsuchas“operatingsystem”.AstheWestlawexamplesshow,wemightalsowishtodoproximityqueriessuchasGatesNEARMicrosoft.Toanswersuchqueries,theindexhastobeaugmentedtocapturetheproximitiesoftermsindocuments. 3. ABooleanmodelonlyrecordstermpresenceorabsence,butoftenwewouldliketoaccumulateevidence,givingmoreweighttodocumentsthathaveatermseveraltimesasopposedtoonesthatcontainitonlyonce.Tobeabletodothisweneedtermfrequencyinformation(thenumberoftimesTERMFREQUENCYat

17 ermoccursinadocument)inpostingslists. 4.
ermoccursinadocument)inpostingslists. 4. Booleanqueriesjustretrieveasetofmatchingdocuments,butcommonlywewishtohaveaneffectivemethodtoorder(or“rank”)thereturnedresults.Thisrequireshavingamechanismfordeterminingadocumentscorewhichencapsulateshowgoodamatchadocumentisforaquery.Withtheseadditionalideas,wewillhaveseenmostofthebasictechnol-ogythatsupportsadhocsearchingoverunstructuredinformation.Adhocsearchingoverdocumentshasrecentlyconqueredtheworld,poweringnotonlywebsearchenginesbutthekindofunstructuredsearchthatliesbehindthelargeeCommercewebsites.Althoughthemainwebsearchenginesdifferbyemphasizingfreetextquerying,mostofthebasicissuesandtechnologiesofindexingandqueryingremainthesame,aswewillseeinlaterchapters.Moreover,overtime,websearchengineshaveaddedatleastpartialimple-mentationsofsomeofthemostpopularoperatorsfromextendedBooleanmodels:phrasesearchisespeciallypopularandmosthaveaverypartialimplementationofBooleanoperators.Nevertheless,whiletheseoptionsarelikedbyexpertsearchers,theyarelittleusedbymostpeopleandarenotthemainfocusinworkontryingtoimprovewebsearchengineperformance. ?Exercise1.12 [?]WriteaqueryusingWestlawsyntaxwhichwouldndanyofthewordsprofessor,teacher,orlecturerinthesamesentenceasaformoftheverbexplain. Online edition (c)\n2009 Cambridge UP 1.5Referencesandfurtherreading17Exercise1.13 [?]TryusingtheBooleansearchfeaturesonacoupleofmajorwebsearchengines.Forinstance,chooseaword,suchasburglar,andsubmitthequeries(i)burglar,(ii)burglarANDburglar,and(iii)burglarORburglar.Lookattheestimatednumberofresultsandtophits.DotheymakesenseintermsofBooleanlogic?Oftentheyhaven'tformajorsearchengines.Canyoumakesenseofwhatisgoingon?Whataboutifyoutrydifferentwords?Forexample,queryfor(i)knight,(ii)conquer,andthen(iii)knightORconquer.Whatboundshouldthenumberofresultsfromthersttwoqueriesplaceonthethirdquery?Isthisboundobserved?1.5ReferencesandfurtherreadingThepracticalpursuitofcomputerizedinformationretrievalbeganinthelate1940s( Cleverdon1991 , Liddy2005 ).

18 Agreatincreaseintheproductionofscientic
Agreatincreaseintheproductionofscienticliterature,muchintheformoflessformaltechnicalreportsratherthantraditionaljournalarticles,coupledwiththeavailabilityofcomputers,ledtointerestinautomaticdocumentretrieval.However,inthosedays,doc-umentretrievalwasalwaysbasedonauthor,title,andkeywords;full-textsearchcamemuchlater.Thearticleof Bush ( 1945 )providedlastinginspirationfortheneweld:“Considerafuturedeviceforindividualuse,whichisasortofmech-anizedprivateleandlibrary.Itneedsaname,and,tocoinoneatrandom,`memex'willdo.Amemexisadeviceinwhichanindividualstoresallhisbooks,records,andcommunications,andwhichismech-anizedsothatitmaybeconsultedwithexceedingspeedandexibility.Itisanenlargedintimatesupplementtohismemory.”ThetermInformationRetrievalwascoinedbyCalvinMooersin1948/1950( Mooers1950 ).In1958,muchnewspaperattentionwaspaidtodemonstrationsatacon-ference(see TaubeandWooster1958 )ofIBM“auto-indexing”machines,basedprimarilyontheworkofH.P.Luhn.CommercialinterestquicklygravitatedtowardsBooleanretrievalsystems,buttheearlyyearssawaheadydebateovervariousdisparatetechnologiesforretrievalsystems.Forexample Moo-ers ( 1961 )dissented:“Itisacommonfallacy,underwrittenatthisdatebytheinvestmentofseveralmilliondollarsinavarietyofretrievalhardware,thattheal-gebraofGeorgeBoole(1847)istheappropriateformalismforretrievalsystemdesign.Thisviewisaswidelyanduncriticallyacceptedasitiswrong.”TheobservationofANDvs.ORgivingyouoppositeextremesinaprecision/recalltradeoff,butnotthemiddlegroundcomesfrom( LeeandFox1988 ). Online edition (c)\n2009 Cambridge UP 181Booleanretrieval Thebook( Wittenetal.1999 )isthestandardreferenceforanin-depthcom-parisonofthespaceandtimeefciencyoftheinvertedindexversusotherpossibledatastructures;amoresuccinctandup-to-datepresentationap-pearsin ZobelandMoffat ( 2006 ).WefurtherdiscussseveralapproachesinChapter 5 . Friedl ( 2006 )coversthepracticalusageofregularexpressionsforsearching.REGULAREXPRESSIONSTheunderlyingcomputerscienceappearsin( Hopcrofteta