/
UMass/BostonDept. of CS, UMass/BostonBoston, MA 02125-3393819-376-3691 UMass/BostonDept. of CS, UMass/BostonBoston, MA 02125-3393819-376-3691

UMass/BostonDept. of CS, UMass/BostonBoston, MA 02125-3393819-376-3691 - PDF document

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
391 views
Uploaded On 2015-10-24

UMass/BostonDept. of CS, UMass/BostonBoston, MA 02125-3393819-376-3691 - PPT Presentation

algorithms exists in the Information Retrieval called the Text Retrieval field algorithm weintroduce called the or algorithm uses basically the same approach as the efficient IR algorithm the Pe ID: 171138

algorithms exists the Information

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "UMass/BostonDept. of CS, UMass/BostonBos..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

UMass/BostonDept. of CS, UMass/BostonBoston, MA 02125-3393819-376-3691drinf@cs.umb.eduUMass/Boston & Microsoft ResearchDept. of CS, UMass/BostonBoston, MA 02125-3393617-354-6460poneil@cs.umb.eduResearch for this paper was supported by NSF Grant 97-11374 at UMass/Boston. Microsoft Patrick O’Neil andThe bit-sliced index (originally defined in algorithms exists in the Information Retrieval called the Text Retrieval field). algorithm weintroduce, called the or algorithm, uses basically the same approach as the efficient IR algorithm, the Perry-Willet Term M [PW83], and has the advantage that itdepends only on what we propose as native operations DBMS. Some TM algorithms in IR use more complex metrics than ours (see (see )example by wterm matches higher for terms that relatively infrequent. Inconcluding section we explain can begeneralized to more complex metrics. Searching documents forterms in this way is of great interest, of course, as evidenced byDatabase Vendor products such as the OConText, and the Extender product named named To illustrate provide a statement to retrieve thetop k documents in terms of a count of valid equal restrictions, we modify Query (1.1) to another (1.2), valid in Microsoft SQL Server, but not in SQL-99:(SELECT PRID FROM T WHERE COL_1 = const_1Note that the final clause, "ORDER BY CT DESC," that the top 10 documents with maximum counts of We note in passing that classic syntax were unable to find rows with the largest number offrom a given set; Gerard Salton Salton pointed out this shortcoming in some years ago. Theproblem is that there is no Boolean with a count of matches on subsets of conditions. If a provides a list of M keyword terms, and we wish to find k with the largest number of matches using classical SQL, wewould need to perform a query to look on all M keywords, then multiple queries to look for rows matches on any subset of M-1 of the keywords, and so onpossibly ranging upto a large fraction F of M, before we find k rows with themaximum number of keyword matches. The number of For F = M/2, the number of distinct queries required is addition of the newer ALL andTOP k clauses, Query (1.2) the appropriate syntax toperform maximal term matching. TOP k clause is not inDocument, but it is implemented in a2 we present previously published fundamental concepts ofbitmaps and bit-sliced indexes. In Section 3 we present ourapply to ALL, andSection 4 introduces algorithm to retrieve k rows with the largest set of mterms to a list of terms given by a query Q. In Section 5, weexplain the optimal algorithm used in aspects of comparative performance to our Section 6 provides experimental results comparing the twoalgorithms, both of which we have implemented in Finally, Section 7 presents our for future work, including a description of We review a number of previously published concepts below, [ON87, ONQ97] To create a biindex, all N rows of the table T must be assiordinal numbers: 1, 2, . . . , N, called Ordinal row-positions, orindex value X of anlist of rows in T that have the value X can berepresented by an such as: 4, 7, 11, 15,17, . . ., or equivalently by a 00010010001000101 . . .. verbatim (having a small number of 1’s relative to 0’s) will beVariations on this bitmap index definition studied in[CHI98, CHI99, WUB98, WU99]. Ordinal row-positions 1, . . . ,N can be assigned to table pages in fixed size blocks of size J ,1 through J on the first page, J+1 through 2J on page, etc., where J is the maximum number of rows of T thatwill fit on a page the maximum occurs for the shortestrows). This makes it possible to determine the zero-based number pn for a row with Ordinal position n by the formula pn= (N-1)/J. A known page number can then be accessed long extents of the table are mapped tocontiguous physical disk might lead to fewer rows on a page, some pages have no rows for the larger Ordinal numbers reason, an Existence Bitmap (EBM) is maintained for the containing 1 bits in Ordinal positions wexist, and 0bits otherwise. The EBM can also be useful if It is a common row-list in abitmap index must be carried in verbatim bitmap form. Inreality, some form of compression is always used for bitmaps (although verbatim bitmaps preferred down to arelatively sparse ratio of 1’s to 0’s such as 1/50, because operations on verbatim bitmaps are more CPU efficient than oncompressed forms). In the architecture implpaper, which we call the Barchitecture, bicompression simply involves converting sparse bitmap into ordered lists of Segment-Relative Ordinal (defined below). We first describe introduced in [Oand is used in. We break therows of table T into equal-size blocks so that the for the set of rows in each block will fit on a , followingMODEL 204 nomenclature of architecture uses 4KByte disk pages, so Segments contain S =8*4000 = 32,000 rows. (We use S = 32,000 as a restimate; the true number is larger, but not quite 2 = 32,768,A B-tree index value X in the architecture has the format shown in Figure 1. SeginfoSeginfo. . .Seginfo Figure 1The entry in Figure 1 can grow to the length available on theB-tree leaf page where it resides, and another entry with theindex value X can follow on a leaf page ifmore Segments make it necessary. block inFigure 1, is shown in Figure 2 to contain a Segment ) for the Segment of rows it represents, and the the next paragraph for a description of anORDRID-list. The Seginfo blocks in order by Seg_no, and if a Segment contains no row for i, then the Seginfo block for that Segment will bemissing in Figure 1. (This fact can be used at an involving conjunctions to that have no Seginfo block in Figure 2. Since the S bits of a bitmap must fit on a 4 KByte page, S 2, and a relative ORDRID will fit in two what follows we refer to a simply as an This short length provides a significant advantage in space and I/O speed during a range search. An ORDRID value kin Segment m can be translated into a Table-Relative position t by the formula t = m * S + k. An ORDRID-list for aSegment of an index entry (pointed to by DKPTR in Figure 2)are alsostored in order on disk, and fit on a page. If the line between sparse bitmap andORDRID-list occurs at a bit density 1/50, then ORDRID-list will take up at most 16/50 of a page, andB-tree with atleast three entries per leaf page. ORDRID-lists use a continuum of pages (not intermixed with Index B-tree pages orBitmaps) for fastest ordered by ivalue and Segment number, that is: X || Seg_no. used to address ORDRID-lists has the same format used in rowaddressing, consisting of (Disk Page #, Slot #), addresses an offset directory entry that locates the Note that when we refer to a , this is a name meaning that Bitmaps are a form ofrepresentation, and does not mean that every is a Bitmap: it may be a or an ORDRID-list, or a segment-by-segment combination ofthe two forms, whichever is most appropriate based on thedensity of rows for that value in the given segment. Simwhen we speak of a in a Bitmap index, an ORDRID-listmight be the actual representation; we between bitmap and ORDRID-list when the difference isOperations on Bitmaps. Pseudo-code COUNT on bitprovided innGiventwo verbatim bitmaps B1 and B2, we can create the bitmap B3treating memory-resident Segment of these bitmaps as arrays of long ints in C, and through the fragments, setting B3[I] = B1[I] & B2[I]. The logicthrough successive Segment fragments (for B1 and B2) and to (B3), unthe operation iscomplete. The bitmap B3 = B1 OR B2 is computed in the sameway, and B3 = NOT B1 is computed by setting B3[I] = ~B1[I] &EBM[I] in the loop. Note that the efficiency of bifrom a type of parallelism in operations in occurring in the simplest possible loop. To find the number ofrows represented in a trick is used: the bitmap fragment to be counted is with a short int array, and then the uses the short ints as indexes into another array containing thenumber of 1 bits in short int, aggregating these into aWe perform logical and OR on two ORDRID-lists B1 and B2 by the two lists inorder to perform a merge-intersect or merge-union into anORDRID-list B3; in the case of enough to require conversion to a bitmap, an easy case to recognize, and easily done byinitializing a zero Bitmap for this Segment and turning on found in the union. The NOT operation on a Segment list B1 is performed by copying the EBM Segment and off bits in the list corresponding to found in with a verbatim bitmap B1 in one iSegment and an ORDRID-list B2 in another, the ORDRID-listis assumed to have fewer elements and efficiently drives theloop to access individual bits in the bitmap and perform thefilling in a the verbatim bitmap B3to B1 and turning on bits from the ORDRID-list B2.. A bit-sliced index B often referred to as a , is an ordered list of bitmaps of bitmaps is used to represent values (normally integers) of some column C (although the column C might be acalculated value associated with rows of T, and have no, . . . , B are called, and provide binary representations of C values holds the 1’s bits, B holds the 2’s bits, Bholds the 4’s bits, precisely, if we represent the Cvalue of row j (Ordinal position j) by C[j], and the bit for in bit-slice B by B[j], then the values for BBare chosen sooBjiiiS[] 20. Note that we determine S in advance sohighest-order bit-slice B is non-empty, i.e., itIn [OQ97], a bit-sliced index was also defined to contain a representing the set of rows with non-null values in column C, and a Bitmap B representing the set of rows null values (the redundancy, B = NOT(B) AND provided extra efficiency). We will not be using these in BSI’s of the current paper, however, since we will only bedealing with calculated values for rows in T (that is, for allIn this section we demonstrate how we can perform on bit-sliced indexes, using operations on each of thehave been used for for Consider Figure 3, where each of the bitmaps B, and Brepresent the found sets of three combined with UNION ALL clauses in a SQL Query. How are weto calculate and then represent the multiset of rows thatresults? If we could somehow the bitmaps in the three rows to generate the SUM of the bottom row, we SUM011020120011130010201...Of course the SUM on the bottom row of Figure 3 cannot berepresented as a bitmap, since it has values other than 0 and 1., however, be represented as a bit-sliced index! We to ask ourselves how we might be able to add three bitmaps toarrive at the BSI SUM of Figure 3. A problem in perspective. First, it is easy to add two bitmaps toarrive at a BSI sum. Second, a bitmap is just a BSI with a bit-slice. This leads us to ask if we can find an algorithm to add any two BSI’s, and indeed this turns out to be and B of Figure 3 toarrive at a BSI named BS. Clearly BS must have two and BS, since we need to represent values 0, 1, and 2. Wepoint out that BS can be generated quite simply with twoBoolean operations: BS = B AND B; BS = B XOR B. Thiscalculation, along with a SUM of B and for comparison, is illustrated in Figure 4.SUM011010020010120010101...We note in Figure 4 that interpreting the two BS gives the values represented in clear. The operation BS AND B to 1 in precisely and B where both contain 1, and SUM = 2. BS = B XOR B sets BS to 1 in preciselypositions of B and B where one or the contains 1, but not both, and thus wSUM = 1. positions in BS are 0. To adding bitmaps toin bit-valued addition to situation of A "Carry" bit-slice C can arise in Algorithm 3.1 whenever twoor three bit-slices are added to form S, and a non-zero C . Note that if C is (no bits on), Boolean operations give the expected results, buta flag for zero C can speed up the operation. Once the in either A or B run out, calculations of C are likely to result inNumbers in a We provide an algorithm forsubtracting one BSI from another, but first we how torepresent negative numbers in a BSI by Algorithm 3.1 Addition of BSI’s. Given two BSI’s, A = A P- construct a new sum BSI, S = A + B, using the following pseudo-code. We must allow the highest-order slice of S to be S, so that a carry from the highest bit-slice in A or B will have a place. iff exactly one bit on in A or B -- C is "Carry" bit-slice; bit on iff bits on in A and Bfor (i = 1; i MIN(S, P); i++) {-- While there are further bit-slices in both A and BXOR C)-- if (S &#x= 65;&#x.600; P) -- if A has more bit-slices than Bfor (i = P+1; i )-- XOR C)-- ; note C might be zero! AND C)-- else-- P&#x= S;;&#x i++;&#x {-4;Ƞ.;  = S for (i = S+1; i )-- XOR C)-- one bit on gives bit on in S note that C might be zero!AND C)-- if (C is non-zero)-- if still non-zero Carry after A and B Bit-slices end= C-- Put Carry into final bit-slice of S, S arithmetic. If we’re working with a collection of BSI’scontaining only non-negative numbers (the only kind we’vediscussed up to now), and the largest positive number that canbe represented is 7, then only three still arise during subtraction and must bedistinguished in two’s complement arithmetic by a leading 1compared to a leading 0 that four bit-slices are needed to represent the binary The bit determined by the bits of 7 (1000) and adding 1: thus -7 is 1001. Now tosubtraction -7 -(+7)," we will require yet high-level bit-slice (5 bit-slices, in all); so we get: -7 -(+7)" =quantities aswe added high-order bits); then we flip the bits of theright-hand term and add 1 to get 11001 + 11001, then add theWhenever two different BSI’s are to be subtracted, any BSIrepresenting only positive numbers must have a bit-slice of all zeros adjoined. (This will afterward beused in sign-extension.). Then a high-order bit-slice must beadjoined to the BSI with the maximum number of bit-slices tohandle overflow during subtraction. Following this, the BSIwith the minimum number of bit-slices must be brought up tothe maximum number by All BSI’s must be sign-extended as are adjoined: this is done by copying the most significant bit-slice in adjoining new high-order bit-slices.We provide Algorithm 3.2 to subtract one We explained in the algorithms to add bit-sliced indexes can be used in current native SQL to row multiplicities arising from clauses. Simwe can create queries where multiplicities are subtracted, multiplicity of are determined, using Note that in ALL, any negative numbers in the result BSI D will be with zeros, since rows do not appear with multiplicities in the algorithm to do this find all rows with high-order bits on in D, and mask this set out of all bit-slices. Similarly, will notneed to be concerned with rows that have P- , create new "min" BSI M = MIN(A, B).The following pseudo-code handles only non-negative values. To handle both and negative numbers, we would positive values in slice. Then we’d use the pseudo-code below to find MIN(A, B) for the bitmap set of code to find MAX(A, B) for the bitmap set of only the special MIN(S, P), since the minimum of two numbers x and w represented in row r of the �A and B cannot have more binary digits than MIN(S, P). We assume in the loop below that S = P (if not we reverse A and B).K = empty set-- bitmap K of rows for which we know min = empty set-- bitmap of rows for which A has lesser value = empty set-- bitmap of rows for which B has lesser valuefor (i = S; i � P; i--)�-- recall that S = P; loop is empty if S == P�for (i = P; i 0; i--) {-- loop down to zero) AND NOT(K)-- and B AND X)-- if A has 1-bit, new min must be in B AND X)-- has 1-bit & new min must be in AK = K OR X-- new min rows found in this pass}-- any rows not still in K are equal in A and B OR (EBM AND NOT(K))-- for (i = 0; i )-- loop to set BSI M using known K and K values for rows with bits in K values for rows with (disjoint) bits in K P- . . . B , we will createa new difference BSI D = A - B, by taking the twos complement of B and adding it to A using Algorithm We adjoin bit-slices as specified in the paragraph above, and for simplicity we assume that A, B, and D end upAdd needed bit-slices to A and B,-- for 2’s complement subtraction . . .sign-extending A and B if necessary-- . . . allow for MAX(S, P)+2 bit-slices in Dfor (i = 0; i ())-- ) AND EBM-- }-- one’s complement completeD = A + (B + (all 1’s bitmap)) -- use Algorithm 3.1; B + all 1’s bitmap is 2’s complement We now Algorithm. We given a query Q with a list of values, Q = keyword-2, . . .,&#xkeyw;&#xord-;, -;v.7; keyword-|Q|, keyword column K ofan object-relational table T. We wish to find the set of k that have the largest number of matching keywords with thequery list Q. Denote the bitmap rows of table Tthat contains keyword-i in its column K by Bi; these will occur as terms of an index KX. It is our task to find thewhich have the largest number of m1’s among all bitmaps B1, B2, . . ., Bm. We use Algorithm 3.1 toADD these bitmaps resulting in a BSI SUM. All that remains isto find Ordinal positions wvalues. We recall that an algorithm provided in Section 4of [ONQ97] by which the set of rows in T with C&#xkeyw;&#xord-;, -;v.7; = c1, C avalue column having a found quite efficiently. InAlgorithm 4.1 below we provide a variation of this Finding the rows with the k largest values in a BSI. Given a over a table T and a positive integer kbitmap F "found set") of with the k largest S-values, S(r), in T. Algorithm 4.1 accom-plishes this in a subtle way, explained in the proof of. We wish to find F, the bitmap of rowswith the k largest S-values in T. Denote by m the minimum S-value of any row that lies in F, and assume m has binary repre-sentation: m . . . m, This implies that m is to the largest S-value S(r) of all rows r in T (with might also be the k+1 largest, etc.). We do not know m inadvance, but we determine successive bits mof the binaryrepresentation as we progress through passes of the loop inVariables used in Algorithm 4.1 that exist to the next are the bitmaps G and E; the bitmap X and integer n only temporary, used to hold results within aloop pass for efficiency, and could be dropped We wish to demonstrate the defining properties of G (Grows r with reater than m) and E (E rows r with qual to m in a specific initial sequence ofbits), so we provide an , which we define as the values of G and Eon entry to pass i. We then prove that the ihypothesis remains true from pass i to successor pass i-1, and. Assume for an arbitrary row r in T thatthe binary representation of S(r) is r . . . r. Our i and Gas follows. (1) A row r in T will beif and only if does not differ in its early bit. (2) A row r in Tif and only if the early bit . . . is greater than m . . . m; this is equivalent to that for some bit position j in the range i+1 = j = P, bit r ison with bit m off, and . . . r are all equal to We now perform initial test of Algorithm 4.1guarantees that k = Cand since m is the klargest S-value of any row in T, it guarantees that such a with S(r) = m exists. We enter the first pass of the loop with i =is initialized to the empty set and obeys the i4.2;hypothesis, since i+1 P and thus there is no value j with i+1(2) above, so no are in Gis initialized to and obeys the iare no position i = P thatinduction hypothesis holds at consists of all the rows r in EBMthat have early binary . . . r equal tomust lie in Econsists of all the rows r’ in EBM where there is some bit tion j in the range i+1 = P such that bit r’is on with bit off, and bits r’ . . . r’ are all equal to bits mcan contain no rows until a zero bit shows up in m . . ..) Begin by noting that every row in G(if there are any) hasS-value larger than all the rows in E, since each of the rows r’ �if (k COUNT(EBM) or k )-- Error ("k is invalid")-- if not, exit; otherwise, kth largest S-value existsG = empty set; E = EBM;-- G starts with no rows; E with all rows&#x 0-3;ࢗ.;耀for (i = P; i = 0; i--) {-- i is a descending loop index for bit-slice number) -- X is trial set: G OR {rows in E with 1-bit in position i}&#x 0-3;ࢗ.;耀if ((n = COUNT(X) ) k)-- if n = else if (n k){-- if n = COUNT(X) has less than k rowsG = X-- G in next pass gets all rows in X)-- E in next pass contains no rows r with bit i on in S(r)else{-- n = k; break;-- }-- we know at this point that COUNT(G) F = G OR E-- might be too many rows in F; check below&#x= k0;if ((n = (COUNT(F) - k) 0)-- if n {turn off n bits from E in F};-- throw out some ties to return exactly k rows matched by a 0-bit r in all the in r (i.e., the same as in m). Furthermore, since this characterization (for some j) holds for any r’-r pair with S(r’) �and since Gall rows r’ that obey the rows with At the beginning of the loop in Algorithm 4.1, we set X = G OR). Nowwe claim that rows in Xhave the largest S-values of any in T. To demonstrate this, consider the following. We that Gcontains all rows in T with S-values larger than any S-values in E. Furthermore, rows r in (E) have larger S-values than any of the other rows in E, that is rows r’ in (E)), since they have identical bit positions up to ris off. Finally, any row r not in Gor in, since its S-value representation r . . . r cannot begreater than or equal to m . . . m, must have some bit off that is on in m, i+1 r’ . . . r’ . . . mthus must have an OR (E, and therefore have S-values larger than any row) and have S-values larger than any are either in (E)) or, so clearly Xof the rows with the largest S-values in T. With )&#x= P,;&#x wit;&#xh -1;.6; k, this will imply that mthere were less than k rows in G(m is the kth largest and G rows with S-values larger than m) and) were added. Thus the kth), and mwill bein the next line of the, now has rows with r= 1 = mthe appropriate set of rows for pass i-1 by match all bits in m is unchanged from G, and is valid (2), since i was not anappropriate value for j in the definition to rows to G off and bit r on.) X, the set of n rows with thelargest S-values in T, does not include the kth largest. But ifwere on, that would not be true, since by all rows r with r . . . r equal to bits m . . .thus include m. Since bit mour induction hypothesis (2) requires us to add rows r to with S-values that have r . . . r all . . . m; in other words we set G = X satisfies induction (2) with j = i. Next we set E = E) restricting E; since all rows r in E . . . r equal to m . . . m, it is clear satisfies induction hypothesis (1) for i - 1.Finally, if n = k, then Xconsists of k rows with the largest S-value in T, exactly what seeking. We set E = E, and break from the loop; on exit we set F = G OR E ), and we will find that COUNT(F) - k = 0. In continue to through i = 0, and on exit from the loop (with i = -1), we set F =)&#x k, ;&#xwe s;î t;&#xhat ;&#x-114;&#x.800; k. But all the S-values of are now the same (since they all have the same bitrepresentation as m) and as always we know that k. Thus we simply need to remove some rows of E from F The algorithm. We given an Object-Relational table T with a keyword column Khaving a Bitmap index Given a query list Q of from K and the task of the k rows with thelargest number of keyword values in Q, we proceed as Find all the bitmaps for Q’s keyword values in and add3.1, to create a bit-sliced index KS. apply Algorithm 4.1 to find the setpublished a term-matching algorithm innthat was cited as best in [SALT89], but innMurtagh cites the Perry-Willet Algorithm [PW83]as an improvement on earlier term-matching algorithms usedin IR, including his algorithm in [PW83] isstraightforward, and we modify its description slightly to usemore modern nomenclature from [MZ96, KZS99]. In Willet algorithm, an index I is understood to exist on thekeyword terms of all documents. For each keyword term t in I,there is a sequence of document identifiers: have associated weights forthe term in the referenced document. (In our nomenclature, thedocuments are rows in an object-relational table, the index I isan index on the set-oriented keyword column K, and thedocument identifiers are ORDRIDs. We will ignore weights forbelow that each term match counts as Perry-Willet algorithm uses an variable A toaccrue weighted matches for each term value found in thedocument d. A good deal of appears in [MZ96,KZS99] on how are to be assigned, whetherhave their first term match, orexisting in advance as an array. We assume the The Perry-Willet Algorithm. Q is a list of to match, I is the index, the array A represents theAccumulators: A[d] is the accumulator for document d, with aint A[N]-- Array with N cells [KZS99] studies an approach enough for about 2% of the documents. There is It should be clear that Algorithm 5.1 is close to being index information is read fromdisk exactly once and accumulated efficiently. If the number ofterms in Q is then assuming that the index I is not cached in memory but disk resident, we will need toperform at least disk reads (some of them ones) to retrieve all document lists of terms. It ispossible that more than one I/O per term document list will berequired, depending on the size of the list, and this min favor of long disk reads; the value of multi-page disk is understood in the IR field, and we will assume consideration gives no advantage to database system access.Data compression is also discussed at length in [KZS99], andwe assume that data compression gives no advantage either toAs successive index lists are accessed, theebecomes an imconsideration. Is it possible to keep the set ofaccumulators so densely packed that memory cache performance? Or at the opposite end of the spectrum, will all ofthem in memory at once, so that some need to reside on disk part of the time? We need tocollections of this kind to evaluate these questions.In [MZ96] the document collection database used fordatabase for large texttcollectionvary from around 100 bytes to 2MB, and the authors broke the of around 1000 bytes to help This resulted in 2054.5 MB, an average of 191.4 words per record, and keyword terms after translating to lowercase and. Note thatall document words other than a short list of areindexed as keyword terms in index I 195,935,531 stored pairs of doc ID and weight (in the case) appearing in term lists. This means that I over 1GBin size and was not for the low-considered typical by [MZ96]. To give asecond example, in in about 530,000 documents fromfromwere used in testing, and these broken into smaller documents of 50-500 bytes each toFrom these two examples, we see that hits will notbe an important performance consideration in accesses toAccumulators in Algorithm 5.1. On the other hand, it isreasonable to assume that the array of Accumulators will fit inmemory for most document collections. In [KZS99], thehardware used for was a Sparc20 with 385MB ofmemory, which easily contained the Accumulator array for themillion small documents. The point was also made thatsufficient to materialize term-list ofdocument ID’s in full. Since Segmentation is not used, this canbe an important simplifying factor to avoid special-case Comparison to BSTM AlgorithmWe rewrite our 2 in the form of AAlgorithm (restating 4.2). As in thePerry-Willet Algorithm, Q is a list of terms to match, I is theappearing in column find Given a query Q with terms in its list, we can calculate themaximum number of bit-slice bitmaps, bcount, that BSI A, specifically: bcount = |Q|). Weinitialize a BSI by creating structures known as Anchors" for each of the bitmaps that might be accessed. TheSegment Anchors look like the index entries of Figure 1,except that the index value Xis not needed and the The outer loop on Segments 1 through M in Algorithm 5.2 isstandard, and the loop on all terms within a Segment with all bitmap or ORDRID-list additions implied by the As outlined in Algorithm addition of B into A and a sequence ofoperations to generate Carries that will then be into upper-level bit-slices A. Whether the operations involveORDRID-lists or bitmaps is material only considerations. Operating on two ORDRID-lists may gain hits over the 5.1, since therepresentations are relatively compact. ORDRID-list vs. bioperations are in fact identical to the type of operation used inare acted on result. Bitmap vs. bitmap operations, on the other hand, arelikely to be more efficient than corresponding steps inAlgorithm 5.1, at least on a per bit-slice basis, since the efficiency of multiple bits at once is anadvantage. On the other hand, the possibility of these operations mitigates against the efficiency of A2 in comparison to Algorithm all carries arehandled in one operation. As the number of terms in Qrises and the probability of Carries to higher increases in Algorithm 5.2 for later terms, we would expect theefficiency of Algorithm 5.2 to drop off compared to A5.1. Thus our seems to show that Algorithm 5.2have an advantage in performance for a small number of terms in Q, but be at adisadvantage for a large number of terms in Q. The final step ofthis restatement of Algorithm 5.2 is to find the k largest Avalue rows, a relatively quick task in BSTM using A4.1 and in PWTM using a heapsort to extract k terms from theIn the next section, we present our experimental comparing Algorithms 5.1 and 5.2. consideration ismentioning, however. Algorithm 5.1 is a purpose program that has been implemented by IRpractitioners in prototype and is available to interested interested such a program cannot be expected to perform well as an application on a database, giventhat massive numbers of statements must be used toRIDs, a rather heavyweight call interface that from efficiency. Database storage of documents is of Cartridges such as ConText andExtenders such as Text Extender Extender PCW97].Algorithm 5.2, which is based on useful for a large number of query types other than textretrieval, would be natural to implement in native form. Because of this, we claim that our algorithm hasTo test the performance of the Perry-Willet Algorithm 5.1benchmark tables and queries for our experiments to measure under varying conditions. The experiments performed ona 333 Mhz Sun Ultra-Sparc-IIi, with 128M of memory, The design of our benchmark tables is based on some of thecollections in [PW83], rather smallcollections by today’s standards, but appropriate for oursystem configuration. In Table 1, we provide a list ofused in our experiments, along with theNotational symbolValues used N(# rows = # docs) T(# terms) (# terms/doc)40 (# terms/Query)5, 10, 20, 30, 40 1000, 2000, 3000 for the moment on the minimal configuration ofTable 1, we see we have N = 50,000 documents in our smtable, with T = 40 terms for each document (terms arerepresented by integers because of limitations in our iThis means that the number of term-document pairs contained in index entries is N*T2,000,000. Since there terms, we the average number of per term to be 200. Thenumber of documents per term grows linearly with the of documents, for N = 100,000 we have 400, 800 for N =etc. We generated the terms in each document atrandom, using a Zipfian 70-30 skew (a assumption), and then created queries whose terms tended touse the more popular terms, behavior we modeled after [PW83].When the average number of per term is 200, theaverage number of documents per query term is 500, tuned the Zipfian of the query so that query terms 2.5 times more than the average document terms; so for N = 100,000, when theaverage number of per term is 400, the = 1000.The number of rows (or documents) N in the tables and thenumber of terms per Query T are ranging parameters of Table 1, and we ran experiments with all = 5,10, 20, 30, and 40 terms, and ran them implementations of Algorithms 5.1 and 5.2 on the architecture for tables of N = 50,000, 100,000, 200,000 andruns of each case were performed, buffers prior to a but the result was not perfect so the one elapsed time out of three dropped in each case as anoutlier. We gave the Perry-Willet Algorithm 5.2 the sameaccess to our term index that the the advantage of bitmap compression forextremely popular terms. We graph the 010203040 N=100000N=200000 Figure 5. CPU Times Per Query for BSTM (Solid Lines) and PWTM (Dashed Lines) Algorithmshave very similar Elapsed time is more5, since BSTM has a slight advantage over PWTM for a numbers of terms and is at a disadvantage for a larger 010203040 and PWTM (Dashed lines) AlgorithmsWe take two lessons from these results.The BSTM algorithm is comparable in performance with the Performance of the algorithm degrades as thenumber of Query Terms increases, which we take to be anartifact of the increasing number of Carries required asperformance for 30still comparable to PWTM, probably quite aIn the current paper we have shown, for any BSI’s X and Y on atable T, how to efficiently generate new BSI’s Z, V, and W, suchthat Z = X + Y, V = X - Y, and W = MIN(X, Y). We claim that BSIarithmetic such as this is the most straightforward way todetermine multisets of duplicates) resulting fromtion), and INTERSECT ALL (min). Another contribution of paper has been to demonstrate how to determine the top k BSI-valued rows, for any meaningful value k between one and thetotal number of rows in T. Together with this has allowed us to solve a common problem of textretrieval: an efficient algorithm to find k documents that avenue of future work is suggested by the fact that ourextended byautomatically pipelining intermediate results in expressions on multiple BSI’s. For example, when we add thethree bitmaps from Figure 3, B1 + B2 + B3, instead of = B1 + B2, then writing S out to disk, and later reading it in + B3, we can maintain S in memory tobe consumed by immediate addition to Pipelining wasimplemented in our Barchitecture for the special case of used inTerm Matching, but a more general would bevaluable. It is also worth arithmetic is parallelized. Multi-Segment bitmaps can have their partitioned out to be dealt with by different process process has long provided pipelining andConcurrency control to support bitmap indexing and BSIarithmetic presents special problems unless the frequency is reasonably limited. form of concurrency control based on locking for a number ofyears. But a valuable future task would be to provide a newform of multi-version concurrency (see the Snapshot Isolationdiscussion in Section 4.2 of [BBG+93]). A avenue which seems feasible would be toimplement Snapshot Isolation so as to run queries asefficiently as possible, trading efficient queries efficient update transactions when necessary. Up to now,now,We now describe how our approach could be Weighted Term Matchingfield has an extremely large number of approaches toevaluating document queries. In [ZMare eightdifferent formulations listed (in Table 1) of measures between a query and a document two forms of"inner product", the "cosine measure", two forms ofEach of measures depends on assigned to mterms, and in Table 2, nine different weight functions arelisted, including "binary "normalized", and four noise and entropy. Thesimple non-weighted approach dealing with up tonow uses the simpler form of "inner measure formulation, with the "binary match" weight function.To illustrate can deal with matches, we consider the between a document d and a query qthat was used exclusively in [KZS99]. We The number of occurrences (frequency) of the term t inlog() Weight of term t in document d. Notethat if t doesn’t appear in d then the weight is zero; themore times t appears in d, the more the document iswfNfqteqtetlog()log(/) Weight of term t inquery q. As before, a higher "frequency" of the term in thequery increases weight. Note that is a count ofdocuments that contain the term t, and N is the totalnumber of documents, so terms appearing in a The Cosine measure of similarity of a query q and a Cqd(,). To evaluate a query, we will need tocalculate this measure for a specific q and all documents d,then choose the k documents with the largest measures. TheCqd(,) is: Cqd(,) (Note that is for short.)While formula (7.1) might seem complex, the approach tocalculating it using a BSI is relatively straightforward. First,we note that [MZ96] says we can use lowweights of about six bits without retrieval effectiveness. Prior to knowing exist inthe query q, we can precalculate , the frequency of document d, then calculate log() = . Finally, we can calculate the dtd, and represent these values as 6-bit construct foreach term t a to represent these document dtd for all documents in T; thus the isassociated with each term in our Index I (as currently a bitmap is associated with each term). To calculate the Cqd(,) by formula (7.1), we begin by deriving thevalues of for each term t of the query. From the query qwe know , the frequency of each query term, so we canqteqtlog(), and from this we construct 6 bit integer approximations. is derived by. This is a sum of BSI’s arrived at by from the index by small. The method of multiplying is simply apositions in and Most of the calculations above must be performed for anyalgorithm that solves the given problem; calculations to create are used in creating the index and in any event do not occur at runtime. calculation of is the only one that is peculiar to our and while the needed for the algorithm thataccumulates product terms into an array. We and performance tests of this algorithm for[BBG+95]H. Berenson, P. Bernstein, J. Gray, J. Melton, E.[BMWZ95]T. C. Bell, A. Moffat, I. H. Witten, and J. Zobel. TheT. C. Bell, A. Moffat, I. H. Witten, and J. Zobel. TheByte Magazine, Ann O’Leary. Managing Mission-Critical Text [with ORACLE ConText Cartridge].http://www.byte.com/art/9709/sec4/art1.htm[CHI98]Chee Yong Chan, Yannis E. Ioannidis: Bitmap Index[CHI99]Chee Yong Chan, Yannis E. Ioannidis: An Efficient http://www.csa.ru/dblab/DB2/db2s0/fullslt.htm [HARM92]D. K. Harmon, Ed. Proceedings of TREC for TextD. K. Harmon, Ed. Proceedings of TREC for TextK. Haas, M. Holt, F. Putzolu, B. Quigley. Concurrency Control:[KZS99]Marcin Kaszkiel, Justin Zobel, and Ron Sacks-Davis.[MANO95]M. [MURT82]Fionn Murtagh. A Very Fast, Exact NearestNeighbour Algorithm for use in Information Retrieval.Vol. 1, Pages 275-283.[MURT99]Fionn Murtagh. Clustering in Massive Data Sets.Pardalos and M.G.C. Reisende, Eds. Preprint, August 22, 1999 Inverted Files for Fast Text Retrieval. ACM Trans. on Info.Sys., Vol. 14, No. 4, October 1996, Pages 349-379.[ON87] Patrick O’Neil. MODEL 204 Architecture and Perform-ance. HPTS Workshop, September 1987, Springer-Verlagance. HPTS Workshop, September 1987, Springer-VerlagPatrick O’Neil and Dallan Quass. Improved Query[OO00]Patrick O’Neil and Elizabeth O’Neil. Database: Princi-Patrick O’Neil and Elizabeth O’Neil. Database: Princi-PCWEEK ONLINE, Timothy Dyck. ConText Getsfaster and friendlier. [Also discusses DB2 Text Extender.][PW83]Shirley A. Perry and Peter Willet. A Review of the useof Inverted Files for Best Match Searching in InformationRetrieval Systems. J. of Information Science 6 (1083) 59-66.Retrieval Systems. J. of Information Science 6 (1083) 59-66.Gerard Salton. Automatic Text Processing: TheTransformation, Analysis, and Retrieval of Information byComputer. Addison-Wesley Publishing, Reading, MA 1989.[VH96]E. Voorhees and D. Harmon. Overview of the Fifth Text[WMB94]I. H. Witten, A. Moffat, and T. C. Bell. Managing[WUB98]Ming-Chuan Wu, Alejandro P. Buchmann: EncodedMing-Chuan Wu, Alejandro P. Buchmann: EncodedMing-Chuan Wu: Query Optimization for SelectionsUsing Bitmaps. SIGMOD Conference 1999: 227-238[ZM98]Justin Zobel and Alistair Moffat. Exploring the