/
Hard Instances of Compressed Text Indexing Rahul  Shah Louisiana State University Hard Instances of Compressed Text Indexing Rahul  Shah Louisiana State University

Hard Instances of Compressed Text Indexing Rahul Shah Louisiana State University - PowerPoint Presentation

mitsue-stanley
mitsue-stanley . @mitsue-stanley
Follow
343 views
Uploaded On 2019-11-05

Hard Instances of Compressed Text Indexing Rahul Shah Louisiana State University - PPT Presentation

Hard Instances of Compressed Text Indexing Rahul Shah Louisiana State University National Science Foundation Supported by NSF Grant CCF 1527435 This talk does not represent views of the NSF Based on joint work with ID: 763490

suffix log time bits log suffix bits time number length character tree space matching computed csa occ top pattern

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Hard Instances of Compressed Text Indexi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Hard Instances of Compressed Text Indexing Rahul ShahLouisiana State UniversityNational Science Foundation* Supported by NSF Grant CCF 1527435 *This talk does not represent views of the NSF Based on joint work with Sharma Thankachan (Univ. of Central Florida) Arnab Ganguly (U Wisc. Whitewater)

String DataFundamental in Computer ScienceFinite sequence of characters drawn from alphabet set ∑Applications:Genomes, e.g. sequence alignmentBiometrics, ImagesTime-series, e.g. financeText (for phrase search): Google, Microsoft MusicNetwork security, e.g. online malware detection

Touches Many FieldsUncertain and probabilistic matching[SIGMOD’14, EDBT’16]Big Data matching[ESA’13, PODS’14, ISAAC’15]Ranked Pattern Matching[JACM’14]Software Plagiarism and Version Control[SODA’17] RNA Structural Matching[ISAAC ‘17]

Agenda for todayIntroduction to Text/Succinct Data StructuresSuffix Trees, Suffix ArrayBit vectors, Wavelet treesBWT, FM-index, LF-mappingTree and RMQ encodingsCompressed Suffix TreeEasy cases: augment Compressed Suffix Trees Property MatchingTop-k retrieval : SparsificationHard cases: Pattern Matching problems with ST variants Parameterized matching (pBWT)Order-preserving matching (LF-Successor encoding)RNA structural matchingEven harder: 2-D matching Technical Talk Alert !!!

String Searching IndexesSuffix Trees, Suffix Arrays, CSA, FM-Index, etc. T: mississippi$mississippi$ississippi$ssissippi $sissippi $issippi$ ssippi$sippi$ ippi$ppi$pi$ i$$ suffixes P = ssi O(p ) time Finding Occurrences in O(occ ) time O(n ) words space and optimal O(p+occ ) query time Locus of P 12 9 10 11 8 7 6 5 3 4 2 1 LF(9) = 11 LF(10) = 12

Suffix ArrayM I S S I S S I P P I 1 2 3 4 5 6 7 8 9 10 11Space: O(n) words, O(nlog n) bits SUFFIX ARRAY SORTED SUFFIXES 11 8 5 2 1 10 9 7 4 6 3 I IPPI ISSIPPI ISSISSIPPI MISSISSIPPI PI PPI SIPPI SISSIPPI SSIPPI SSISSIPPI

Space bloat incurred by ST and SAPractically Suffix Trees are about 15-50 times the size of the original text. Suffix Arrays take about 5-15 timesComparison based on minimum size required for storing the text Complexity wise: n log Σ vs n log n bitsΣ = 4 for DNA and Σ = 256 for ASCII text log n is often word size of 32-64 bits vs DNA symbol is 2 bits Human Genome 3Billion base pairs = 0.8GB memory but Suffix tree takes about 35-45GB. Even more memory during construction. Tools: bowtie, bzip

Pattern Matching with BWT T = a b a a b a b b $ BWT(T) = b $ a b a a a b b aababb$ abaababb$ ababb$ abb$ baababb$ babb$ bb$ b$ $ P = a b a Count statistics on BWT can be achieved using data structure called wavelet tree sp ep C[a] = 0 C[b] = 4 sp = (sp-1).count[c] + C[c] +1 ep = ep.count [c] + C[c] LF-mapping: jumping from i th suffix in suffix tree to its text-previous suffix LF(6)=3 LF(3)=1

Goals:Succinct Data StructuresInformation theoretic optimal spacePlus lower order o(…) termsCompact: O(optimum)Query times: As fast as possible with space limitationsCompressed Text IndexingNot O(n) words i.e., O(n log n) bitsn log Σ + o(..)nHkpoly(P, log n) or P* poly(log n) query times

Two building blocksText dictionary: Given character vector T from Σ, for any c in Σrank c(p) – counts the number of c’s in T[1..p] selectc(i) – finds (min) position p such that T[1..p] has i number of c’s char(i) – gives the character at T[i] Bit dictionary: Given a bit-vector B of length n with t 1s (this can also be seen as subset of t items out of the ordered universe of size n) rank(p) : returns # of 1s in B[1..p]select(i ): returns the position p such that B[p] is ith 1 Min. space: nH0(T), t log (ne/t), n log Σ 0100 1011 1010 0011 0111 0010 1101 1000 6 12 1 4 2 6 3

Wavelet-Treec a b f b e g c g a g e f e a b e g 0 0 0 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 0 1 0 0 ∑={ a, b, c, d , e, f, g } ∑={ a, b, c, d } ∑={ e, f, g } ∑={ a, b } ∑={ c, d } ∑={ e, f } a b c d e f g T = Count the number of e ’s in T[5,15] 1 1 0 0 1 0 0 0 1 1 0 1

Wavelet-Treec a b f b e g c g a g e f e a b e g 0 0 0 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 0 1 0 0 ∑={ a, b, c, d , e, f, g } ∑={ a, b, c, d } ∑={ e, f, g } ∑={ a, b } ∑={c, d}∑={e, f} a b c d e f g T = Count the number of e ’s in T[5,15] rank 1 (5-1)=1 rank 1 (15)=7 rank 0 (7)=5 rank 0 (1)=1 rank 0 (1)=0 rank 0 (5)=3 Count of e’s in T[5,15] = 3 - 0 = 3 1 1 0 1 0 1 0 0 0 1 1 0

Encoding Tree Structure 11 nodes, ~22 bits Catalan numbers: 2n – θ (log n) BP : ( ( ( ( ) ( ) ( ) ) ( ) ) ( ) ( ( ) ( ) ) ) DFUDS: ((() (() ((() ) ) ) ) ) (() ) )Rank, Select, Find_match_close--kth child, parent, leftmost leaf etcin O(1) time--LCA, LA etc ( ( ( ( ) ( ) ( ) ) ( ) ) ( ) ( ( ) ( ) ) )

Range Minimum Query 6 2 3 8 7 5 1 9 4 11 1 7 4 9 8 5 3 6 2 11 -Cartesian Tree : using balanced parenthesis encoding takes ~2n bits -No need to store original array – queries can be position based

Property MatchingText (T): A G T C A T A T T G A C A T A G G C C T A C A T G A A A A C C G C A T T A G Pattern (P): C A T Number of occurrences ( occ ) = 4 Number of occurrences which satisfied the property ( occ_π) = 2 Tandem repeats, SINEs, LINEs, probabilistic matching, etc. Properties ( π ) = {(3, 7), (10, 13), (18, 27), (22, 30), (37, 38)} (3, 7) (10, 13) (18, 27) (22, 30) (37, 38)

Compress Augmenting StructureAll the property matching indexes consists of aSuffix Tree and augmented with some additional structures Compressed Suffix Trees (CST) Can we compress??? O(n ) bits Compressed Property Suffix Trees (CPST) = CSA + O(n ) bits additional structures Space: nH_k+O(n)+o(n log σ ) bits Query Time : O(|P|+ occ_π log 1+ε n )

Compressed Property Suffix Treesend(i) : max {f_k , 0} such that s_k =< i, among all the properties (s_k, f_k) in π length(i) : end(i) – i +1 1 2 3 4 5 6 . . . 17 18 19 20 21 22 23 24 25 26 27 28 29 30 . . 0 0 7 7 7 7 . . . 13 27 27 27 27 30 30 30 30 30 30 30 30 30 . . 0 -1 5 4 3 2 . . . -3 10 9 8 7 9 8 7 6 5 4 3 2 1 . . i end(i ) length(i ) Text (T) : A G T C A T A T T G A C A T A G G C C T A C A T G A A A A C C G C A T T A G (3, 7) (10, 13) (18, 27) (22, 30) (37, 38)

Compressed Property Suffix TreesSuffix Range ofP = C A T (|P| = 3) …. …. 13 14 15 16 …. …. …. …. 12 4 22 33 …. …. …. …. 2 4 6 -2 …. …. Suffix Array Text (T) : A G T C A T A T T G A C A T A G G C C T A C A T G A A A A C C G C A T T A G (3, 7) (10, 13) (18, 27) (22, 30) (37, 38) i SA[i ] length(SA[i ]) SA[i ] is an output if length(SA[i ]) >= |P|

3-Sided Range Searching UsingRange Maximum Queries (RMQ) …. …. 2 4 6 -2 …. …. length ( SA[i ]) RMQ[2, 4, 6, 0] = 6 > |P|=3 RMQ[2, 4] = 4 >|P|=3 RMQ[-2]=-2 <|P|=3 All outputs can be retrieved in O(1+occ_π) RMQ queries RMQ structure over length(SA[i ]) takes 2n+o(n) bits But Storing length[SA[i ]] takes O(n log n ) bits

Compressed Property Suffix Trees 1 2 3 4 5 6 . . . 17 18 19 20 21 22 23 24 25 26 27 28 29 30 . . 0 0 7 7 7 7 . . . 13 27 27 27 27 30 30 30 30 30 30 30 30 30 . . 0 -1 5 4 3 2 . . . -3 10 9 8 7 9 8 7 6 5 4 3 2 1 . . i end(i ) length(i ) Observation: end(i ) is a non-decreasing function , end(i ) can be encoded using a bit vector B of length 2n (2n+o(n) bits along with rank/select structures) B[j ]=1, for j = end(i)+i , else 0 Then, end(i ) = select B (i)-i length(i )= select B (i)-2i+1

Compressed Property Suffix TreesHence we do not store length(SA[i]) explicitly, instead we store only an RMQ structure (of 2n+o(n) bits) over itAnd length(SA[i]) for any given i can be computed by first finding SA[i] in O(log1+ε n) time and obtain length(SA[i]) from end( i ) array in constant time Hence our Index consists of the following components CSA (nH_k+o(n log σ ) bits)end(i ) array (2n+o(n) bits)RMQ over length(SA[i ]) (2n+o(n) bits) Total Space: nH_k+O(n)+o(n log σ ) bits - Query Time : O(|P|+ occ_π log 1+ε n )

Top-k Most Frequent Document RetrievalInstead of listing all documents (strings) in which pattern occurs, list only highly ``frequent” documents.Top-k: Retrieve only the k most frequent documents.Approaches: Inverted IndexesPopular in IR community.Do not efficiently answer arbitrary pattern queries.Not efficient for string search Suffix Tree based Solutions

Suffix Tree-Based SolutionsFor the pattern “an”, we look at its subtree—d1 appears twice and d2 appears once in this subtree d1: banana d2: urban Suffixes: a$ an$ ana $ anana $ ban$ banana$ n$ na $ nana$ rban $ urban$ a ban na$ $ a $ n $ n a na$ $ $ $ rban$ urban$ ana$ d1 d2 d1 d1 d2 d1 d2 d1 d1 d2 d2

FrameworkFirst assume k (or K) is fixed, let group size g = k log k log1+ Є n.Take consecutive g leaves (in Suffix tree) from left to right and make them into groups. Mark the Least Common Ancestor of each group, and also Mark each node that is an LCA of two Marked nodes.Store explicit list of top-k highest scoring documents in Subtree(v) at each marked node v. *

ExampleExample: Group size = 4We build a CSA on the n/g bottom-level marked nodes. b a c d f e g i h o m l k j p n At each marked node, the top-k list is stored LCA of two marked nodes Is also marked

FrameworkBecause of the sampling structure space consumption (for a fixed k) is = O(n/g) k log n = O(n/(k log k log1+Є n) k log n) = O(n / (log k logЄ n)) bits Repeat this for k = 1, 2, 4, 8, 16, ….Total space used is = O( n/logЄ n * Σ 1/log k) = O(n/logЄ n *(1+1/2+1/4+1/8+…) = o(n) bits

Query Answering v u Explicit top- k list stored at u Fringe leaves (atmost 2g) Key Idea: Any document corresponding to top- k in the subtree of v is Either in the top- k list of marked node u Or the document corresponding to one of the 2g fringe leaf

Our ApproachChoose a smaller grouping factors h = k log k logЄ n # of fringe leaves are less, hence only the fly frequency computation time is O(h log1+Є n) = O(k log k log2+Є n) CHALLENGE: We cannot afford to store top-k answers explicitly (in log n bits) at the marked nodes(because (n/h)k log n can be very large) SOLUTION: Encode a top-k documents corresponding to a marked node in O(log log n) bits v/s O(log n) bits. (bounds total space for pre-computed answers by o(n) bits)

Encoding an answer in O(loglog n) bits v u Fringe leaves (atmost 2g) Node u is marked based on grouping factor g = k log k log 1+ Є n and answers and maintained explicitly Then top-k for node v are - either from top-k of u - or from 2g fringes Instead of maintaining the document id explicitly, refer to k elements among this 2g+k elements Encoding Idea:

Encoding an answer in O( loglog n) bits Hence the task is to reduce k numbers from a universe of size O( g+k ) N numbers from a universes of size U can be encoded in nlog (U/n)+O(n) bits and can decode any number in O(1) time using indexible dictionaries [RRR, SODA02] For U = g+k and N =k, space is O(k loglog n) bits

Summary: Succinct and Semi-Succinct Results PaperSpaceReport Time Per ItemHSV [FOCS09]2|CSA| + o(n)log4+enCulpepper et al [SIGIR 12] 70%1mS Gagie et al [SPIRE10] |CSA| + o(n log D)log3+e n Belazzougui et al [SPIRE11]|CSA| + n log D + o(n logD)log k log 1+e n HST [CPM 2012]|CSA| + n log D + o(n logD)(log Σ log log n) 1+e HST [CPM 12] |CSA| + 2n log D+ o( nlogD ) loglog n Tsur [IPL12] |CSA| + o(n) log k log 2+e n Shah et Thankachan 2|CSA| + o(n)log k log 1+e n

Parameterized Pattern MatchingAlphabet Σ consists of two disjoint sets:Static characters ΣsParameterized characters Σ pParameterized string (p-string) is a string in (Σs U Σp)*Two p-strings S = s1s2…sm and S’ = s’1s’2… s’m match iff si = s’i for any si in ΣsThere exists a bijection ƒ S that renames si to s’ i for any si in Σ p Example: Σ s = {A,B} and Σ p = { w,x,y,z } AxBy and AwBz p-match AxBy and AwBw do not p-match Going forward, without loss of generality, we’ll focus on texts/suffixes with all parameterized characters.

Canonical EncodingsConvert a p-string S into a string which draws alphabet as numbers from 0 onwardsEncoding 1 : Every time a new character is encountered, it receives a new numeric symbolaabcacb  0012021Encoding 2 (Baker’s) : encode a character as distance from its previous occurrence : prev(S)prev(S)[i] = 0 if i is the first occurrence of the p-character S[i]prev(S)[i] = (i – j) if j<i is the rightmost occurrence of the p-character S[i]wwxyywz  0100140 Encoding 1  n log ∑, but Encoding 2  n log n bits

Parameterized Suffix TreeTwo p-strings S and S’ p-match iff prev(S) = prev(S’)p-Suffix Tree:Encode every suffix according to prev(.) Construct a suffix tree for every encoded suffixConstruction time: Baker: O(n|Σp| + n log |Σ|)Kosaraju: O(n log |Σ|)Searching in p-Suffix TreeEncode P using prev(.)Search in p-Suffix Tree for prev(P)Time: O(|P| log |Σ| + occ )

Parameterized BWTConsider suffix Ti = T[i …n]Obtain the previous character T[i-1] of suffix Ti Define zero of suffix Ti , z(Ti ) = first position in Ti where character T[i-1] appearsLet zero depth of Ti , zd(Ti ) = number of distinct characters in Ti until first occurrence of T[i-1]T= abc abbadcb ; T4 = abbadcb ; T[4-1] = c; First c in T4 at T4[6] ; zero(Ti) = 6; zd(Ti)=4 Encode all the suffix by prev(.) and sort the encoded suffixes (maintaining their corresponding zd(.) values)Vector of zd (.) values thus obtained is called pBWT

pBWT a b c abba dcb$ Tizd 0 $ T 10 5 0 0 $ T 9 5 0 0 0 $ T 8 5 0 0 0 0 $ T 7500004$T6100013064$T3 3000 31306 4$T 23 00 033 130 64$ T1 5 0 01 30 04 $T 44 0100 0 4 $ T 5 2 pBWT = 5 5 5 5 1 3 3 5 4 2 SA[ i ] =10 9 8 7 6 3 2 1 4 5 LF(9) = 6

pST 0 $ $ 0 0 0 $ $ 4… 1… 1… 1… 3 1… 3… T 10 T 5 T 4 T 1 T 2 T 3 T 6 T 7T8T9 T = abcabbadcb LF(6) = 7 T 3 = cabbadcb 00 0 13064 LF(5)=10 T 6 = badcb 00004 Good thing: Only constant number of changes as suffix transforms during LF

String Searching IndexesSuffix Trees, Suffix Arrays, CSA, FM-Index, etc. T: mississippi$mississippi$ississippi$ssissippi $sissippi $issippi$ ssippi$sippi$ ippi$ppi$pi$ i$$ suffixes P = ssi O(p ) time Finding Occurrences in O(occ ) time O(n ) words space and optimal O(p+occ ) query time Locus of P 12 9 10 11 8 7 6 5 3 4 2 1 LF(9) = 11 LF(10) = 12

Data StructuresWT over pBWTOperations:pBWT[i] rangeCount(i,j,x,y) = number of k in [i,j ] satisfying x ≤ pBWT[k] ≤ ySpace: n log |Σ| bitsTime: O(log |Σ|)Succinct representation of p-Suffix TreeOperations on a node:leftMostLeaf, rightMostleaf, qth childparent, lcaSpace: 4n + o(n) bitsTime: O(1) Additional O(n) + o(n log |Σ|) bits structure Total Space: n log |Σ| + O(n) + o(n log |Σ|) bits

Compute LF(i) z = node just below zero of l i Computed in O(log | Σ |) time using WT and an additional O(n log log | Σ|)-bit structure v = parent(z)

Compute LF(i) – Computing N1 LF( i ) = N 1 +N 2 +N3 , where N i is the number of suffixes j in S i such that LF(j) ≤ LF( i ) Write in unary on every edge, the number of zeros falling on them. Arrange the edge is post-order and form a bit vector. N1= zero’s from the left falling on path(v) = #zero’s coming from right - #zeros counted until v in postorder

Compute LF(i) – N2 N 2 = number of j’s, such that L[j] is p-character, and f j > f i or f j = f i and j ≤ i = rangeCount (L z ,R z ,c+1, | Σ p |) + rangeCount ( L z,i,c,c), where c = pBWT[i]Computed using WT in O(log |Σ |) time

Compute LF(i) when L[i] is parameterized N 3 = 0 N 4 = number of j’s, such that L[j] is p-character, f j > f i, and leading character on the path from v to leaf j is parameterized = rangeCount (R z+1 ,R u , c+1, | Σ p |), where c = pBWT [ i ]Computed using WT in O(log |Σ|) time

Summarizing LF(i)Computed in O(log |Σ|) Space is n log |Σ| + O(n) + o(n log |Σ|) bitspSA[.] and pSA -1[.] can be computed inTime: O(log1+ε n) timeSpace: additional O(n) bitsSampled suffix array and inverse suffix array

Backward SearchSuffix range of P as followsGiven suffix range [sp,ep] of Q = proper suffix of Pc = preceding character of Q in PCompute suffix range [sp’,ep’] of cQ Preprocess P in O(|P|log |Σ|) time such that for any p-character P[i], we can find number of distinct p-characters in P[i+1,|P|]number of distinct p-characters in P[i+1,ci], where ci is the first occurrence of c in P[i+1,|P|]c is staticsp’ = 1+rangeCount(1, n, 1, c-1) rangeCount(1, sp-1, c, c) ep’ = rangeCount (1, n, 1, c-1) + rangeCount(1, ep, c, c)Time: O(log |Σ|)

Backward Searchc does not appear in Qd = number of distinct p-characters in Q(ep’-sp’+1) = rangeCount(sp, ep, d+1, | Σp|)Computed in O(log |Σ|) timesp’ = 1 + fSum(1+fSum(lca(leafsp,leafep)))Computed in O(1) timec is appears in Qd = number of distinct p-characters in Q until the first occurrence of c (ep’-sp’+1) = rangeCount( sp,ep,d,d)Computed in O(log |Σ|) timesp ’ = LF(imin), where imin= min{ i | sp ≤ i ≤ ep such that pBWT[i] = d}Computed in O(log |Σ|) time

SummarizingSuffix Range of P is found in O(|P| log |Σ|) timeEach text-position located in O(log1+ε n) timeFinal ResultSpace: n log |Σ| + O(n) + o(n log |Σ |) bitsTime: O(|P| log |Σ| + occ log1+ε n)

Order Preserving Pattern Matching

T = 5 4 1 3 8 6 2 9 4 5 1 2P = 3 2 1Matches at T[1] and T[5]P = 3 1 2Matches at T[2] and T[10] but not T[6]Predecessor encoding :Point to number on left, just smaller than itself T = 1 8 6 9 5 2 4 3 7 2Pred = n 1 1 8 1 1 2 2 6 2 Prev = n 1 2 2 4 5 1 2 6 4-- what happens if we prepend this with “3” a previous character?Order-Preserving Pattern Matching

Pictorially …1 273 4 2 5 9 6 8

Pictorially …1 273 4 2 5 9 6 8

Pictorially …1 273 4 2 5 9 6 8 3

How to Compute LF-mappingLF-successor idea Philosophically, speaking LF can simulate SA (Suffix Array) – by repeated applicationGo one level beyond: LF-successor can simulate LF – by repeated application

Computing LF-succStore with leaf i (whose LFsucc is j)#distinct char up to lca(i,j) Incase where LF(i) and LF(j) first disagree after lca(i,j)#distinct char up to point of disagreementOtherwiseThe information stored per leaf is only log ∑ bitsBased on this stored information and wavelet tree, we can compute LF-SuccessorLF-succ can be computed in O(log ∑) time.Thus, we get O(p+occ)poly log n)) query time index with O(n log ∑) bits of space

Even Harder – 2D pattern matchingGiven a text as a matrix on size n x nWe want to match square patterns of various sizes in itSuffix tree existsSuffix starting at (1,2) = BBDDCCAC …..In elbow fashion A B B C A B DDC A D B C A A C B A A B A D C C A

Overall philosophy …Suffix TreeSuffix Array BWT (LF)mappingModified LF Based on structure(Parameterized)LF-Successor(Order Preserving ???

Questions?Thank you!