introduction to dialog Thomas Krichel 20111101 structure of talk some generalities about searching Working with DIALOG Overview Search command online information retrieval This subject can be though off as a subset of information retrieval IR Most IR is online or digital ID: 783954
Download The PPT/PDF document "LIS6 18 lecture 4 before searching +" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
LIS618 lecture 4before searching + introduction to dialog
Thomas Krichel
2011-11-01
Slide2structure of talksome generalities about searchingWorking with DIALOGOverviewSearch command
Slide3online information retrievalThis subject can be though off as a subset of information retrieval (IR). Most IR is online or digital.IR concentrates on textual data.We can think of online IR to fall under two categoriesdatabase IR web IR
Slide4database / web IRDatabase IR look at systems that havecontrolled set of recordlow heterogeneityuse requires authenticationadvanced search featuresWeb IR has opposite characteristics
Slide5traditional social model User goes to a libraryDescribes problem to the librarianLibrarian does the searchwithout the user presentwith the user presentHands over the result to the userUser fetches full-text or asks a librarian to fetch the full text.
Slide6economic rational for traditional modelIn olden days the cost of telecommunication was high. Database use costscost of communicationcost of access time to the databaseThe traditional model controls an upper limit to the costs.
Slide7disintermediationWith access cost time gone, the traditional model is under threatThere is disintermediation where the librarian looses her role of doing the search.But that may not be good news for information retrieval resultsuser knows subject matter bestlibrarian knows searching best
Slide8Web searchingIR has received a lot of impetus through the web, which poses unprecedented search challenges. With more and more data appearing on the web DS may be a subject in declineIt is primarily concerned with non-web databasesThere is more and more web-based methods of searching
Slide9Public access vs qualityNow the public at large is able to do online searching. At the same time need for quality answers has grown.Quality-filtered services will become more important.In the current databases, there is as lot that would already be available for free mixed with quality-controlled stuff. Publishers have direct offerings and intermediated vending is in decline.
Slide10components of the IR processproviderdefine data that is availabledocuments that can be useddocument operationsdocument structureindexuseruser needIR system familiarity
Slide11the IR processQuery expresses user need in a query languageProcessing of query yields retrieved documents Calculation of relevance rankingExamination of retrieved documentsPossible return to the start, another query.
Slide12main problemUser is not an expert at the formulation of a queryGarbage in garbage out, the retrieval yields poor resultWays around that problemdesign very intuitive interface for the querygive expert guidance
Slide13before a search IWhat is the purpose of the query?brief overviewcomprehensive searchWhat perspective on the topic is required?scholarlytechnicalbusinesspopular
Slide14before search IIWhat type of information does the patron want?fulltextbibliographicdirectorynumericAre there any known sources?authorsjournalspapersconferences
Slide15before search IIIWhat are the language restrictions?What, if any, are the cost restrictions?How current need the data to be?How much of each record is required?
Slide16concept analysisThis is the art/science of taking the topic to search for and develop facets. Example “Internet filtering in Libraries”Internet filterLibrariesControversy not technical issuesWe may also need the think about the aim of the search.
Slide17search aimsa known needle in a known haystacka known needle in an unknown haystackan unknown needle in an unknown haystackany needle in a haystackthe sharpest needle in a haystackmost of the sharpest needles in a haystack
Slide18search aimsall the needles in a haystackaffirmation of no needles in a haystackthings like needles in a haystackis there a new needle in the haystackwhere are the haystacksneedles, haystacks, anything
Slide19types of searchesknown-item searchesnegative searchesselective dissemination of informationtopical or subject searchespassage searching, where the user is only interested in part of the item
Slide20search strategies IBuilding block approachDo a number of elementary searchesCombine the resulting sets with Boolean operatorsThis is what I did in the example in the previous lectureWorks only with the Boolean model
Slide21search strategies IISnowballing approachStart with a very specific queryThink of other term that can be added to get more resultsStop when a reasonable number of results are achieved.Not sure this really works well in practice.
Slide22search strategies IIIThe successive fraction approach is the opposite of the snowballing approachFirst search for a broad conceptThen repeat the query by adding various limiting factors. Can work well if the IR system allows to repeat and edit queries.But queries can become unwieldy.
Slide23search strategies IVMost specific facet firstConduct concept analysisLook for the most specific facetSearch that first, add others laterPresupposes that you have done a decent concept analysis.
Slide24taxonomy of classic IR modelsBoolean, or set-theoreticfuzzy set modelsextended Boolean Vector, or algebraicgeneralized vector modellatent semantic indexingneural network modelProbabilisticinference networkbelief network
Slide25summaryThere are three basic types of models in classic information retrieval. Extensions of these types are a matter of research concern and require good mathematical skills. All classic models treat document as individual pieces.
Slide26key aid: indexAn index is a list of terms, with a list of locations where the term is to be found.The way to express locations usually depends on the form that the indexed data takes. for a book, it is usually the page number, e.g."shmoo 34, 75"for computer files it is usually the name of the file plus the number of the byte where the indexed term starts, e.g. "krichel index.html 34, cv.html 890 1209"There is usually more than one location of the term.
Slide27key aid: index termsThe index term is a part of the document that has a meaning on its own.It is usually a noun word.Retrieval based on index term raises questionssemantics in query or document is lostmatching done in imprecise space of index termsOne way out is to specify several terms and require that they have to be close to each other.
Slide28basic concept: weight of index termGiven all nouns, not all appear to have the same relevance to the textSometimes, we can have a simple measure of the importance of a term, example?More generally, for each indexing term and each document we can associate a weight with the term and the document.Usually, if the document does not contain the term, its weight is zero
Slide29Boolean modelIn the Boolean model, the index weight of all index term for any document is 1 if the term appears in the document. It is 0 otherwise. This allows to combine query terms with Boolean operator AND, OR, and NOTThus powerful queries can be written.
Slide30Dialog is a databank over 500 databasesthese are also known as files and cover references and abstracts for published literature, business information and financial data;complete text of articles and news stories;statistical tablesdirectoriesDIALOG uses the Boolean model
Slide31DIALOG interface It is still rooted in "traditional" database systems.It has been dismissed as "dial-a-dog".It uses a command-driven interface.It is very complicated to learn fully.It is not suitable for the end-user. It therefore offers a valuable skill to the information professional.
Slide32Accessing DIALOGOn the web, go tohttp://www.dialogclassic.com/Enter username and password.Forget about subaccount.Then click on logon.You are in the classic interface. Let’s hear three cheers for being old-fashioned.
Slide33two steps in DIALOGStep one: select databases (aka files) to look at Step two: perform searches on the selected databasesYou may wonder why one does not have one single step like in a search engine. Discuss.
Slide34sample searchWe want to know something about “current awareness in digital libraries”.Let us assume something of this is in the ERIC database and we know that ERIC is the database number 1.We issue the command "b 1" to begin working with ERIC.
Slide35Boolean searchDo a number of searchess current(N)awarnesss digital(N)librarys digital(N)librariesEach search retrieves a set of documentsThe sets can be combineds s1 and (s2 or s3)
Slide36What is the deal?There are two stages.At stage two we make Boolean queries. Each query splits the records into matching and non-matching records.The set of matching records is return. It can be further searched or combined with other sets using Boolean operators.Try this at home.
Slide37two steps in DIALOGstep one: select databases (aka files) to look at step two: perform searches on the selected databasesYou may wonder why one does not have one single step like in a search engine. Discuss. today we concentrate on the second step
Slide38working on selected filesWe assume that we have selected database that we know and we look at the search interface on the selected database. The database selection process is a bit more complicated, covered next week.First, let us login and look at the command prompt.Then we select the first database (file) with the begin command
Slide39the ‘begin’ commandAs its name suggests, usually the first command.begin number, number,…selects files with numbers numberOnce they are selected they can be searched. Now select the ERIC "begin 1""Begin 1" can be abbreviated as "b 1"
Slide40substeps in the second stepIdentify search termsUse Dialog basic commands to conduct a searchView records online or print the results
Slide41the 's' (select) commandOnce issued the "begin" command to select a database, we issue the "s" command on the database."s query_expression" where query_expression is a query expression.This will search the index of selected database in full-text view for the query issuedIt will not find any of the following: "an and by for from of the to with". They are stop words.
Slide42query expressionA query expression contains search terms expressed in special waysYou can truncate search terms. You can build an elementary expression by putting several keywords together. This is achieved by DIALOG's connectors. You can combine several expressions with the use of Boolean operatorsWe will cover this is in turn now.
Slide43truncation of terms IOpen Truncation"select path?" retrieves all words that begin with path: paths, pathos, pathway, pathologyControlled-Length Truncation"select path??" retrieves the root and up to two additional characters: paths, pathos
Slide44truncation of terms IIEmbedded Character truncation can be used for variant spellings:"select organi?ation" -> organization organisation "select fib??board" -> fiberboard fibreboard This truncation feature is also useful for searching for unusual plural forms:"select wom?n" -> woman womenApparently you can also do prefixes by putting the ? in the beginning. "?mobile" -> automobile metamobile
Slide45use of connectorsConnectors are used to put several words together.One instance where this is useful is when you have words that on their own mean different things.For example "mate" is a herbal beverage consumed in South America. Looking for mate on the Internet retrieves a lot of singles' pages.
Slide46example: terms related to "mate"What other terms to be used? matear (drink mate)matero (mate drinker)cebar (prepare mate)cebador (mate preparer) yerba (mate herb)bombilla (mate straw)
Slide47connectors I'(W)' requires terms to appear one after the other next to each other e.g. 'yerba(W)mate?' matches "yerba mate".'(i W)' where i is an integer, means followed by at most i words, e.g. 'ceba?(3W)mate?' matches "cebar un maravilloso mate" but not "cebador guapo mirando un buen mate"
Slide48connectors II'(N)' requires terms to be next to each other e.g. 'yerba(N)mate?' matches "yerba mate" or "mate yerba".'(i N)' where i is an integer, means proximity by at most i words, e.g. 'ceba?(3N)mate?' matches "cebar mate" or "matear con la cebadora".'(S)' searches for the occurrence of connected terms in the same paragraph.
Slide49using Boolean operatorsIn your query, you can combine several expressions with Boolean operatorsExample: "S LIBRARY(W)SCHOOL? AND DISTANCE(W)EDUCATION"But I usually do not issue such fancy queries.
Slide50executing several searchesThere can be several searches done sequentially, and the results sets are saved by the system. Each time the system assigns a set number, Si,These can be combined in Boolean expressions, e.g. 's S1 or S2 and S3'Remember that Boolean operations are set-theoretic!
Slide51Boolean operators on setsWhen using Booleans, be aware that "and" has higher precedence than "or". Thus:a or b and cis not the same as(a or b) and cbut it is a or (b and c)Use parenthesis when in doubt
Slide52DS (display sets)This command can be executed any time to review the sets that have been formed since the last B (begin) command. This can be useful to review your search history.
Slide53the target command"target set" where set is a search result set creates a subset of the "statistically most relevant results" in the original set.I have not seen details about how this subset is computed. A new result set is being formed.
Slide54display: the type commandtype set/format/range set is a result setformat is a formatrange can be start – endstart is a record number to startend is a record number to endall
Slide55standard delivery formats 2 -- full record except abstract3 or medium – citation5 or long – full except full text6 or free – title and dialog number8 or short – title plus indexing termsuseful to find other indexing terms9 or full – everythingKWIC or K – keywords in context
Slide56options for deliveryI once tried to email results to me, to no availYou can save the html of the search results in the browser. You can print the results within the browser.
Slide57http://openlib.org/home/krichelThank you for your attention!