/
Fast Compressed Tries Fast Compressed Tries

Fast Compressed Tries - PowerPoint Presentation

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
408 views
Uploaded On 2016-02-29

Fast Compressed Tries - PPT Presentation

through Path Decompositions Roberto Grossi Giuseppe Ottaviano Università di Pisa Part of the work done while at Microsoft Research Cambridge t three trial triangle trie triple ID: 236153

trie compression succinct tree compression trie tree succinct path centroid alenex node encoding decomposed space html branching labels label

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Fast Compressed Tries" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Fast Compressed Tries through Path Decompositions

Roberto GrossiGiuseppe Ottaviano*Università di Pisa

* Part of the work done while at Microsoft Research CambridgeSlide2

t

threetrial

triangle

trie

tripletriply

i

ree

ε

ε

l

ε

ε

ε

gle

h

r

e

p

a

l

n

Node label

Branching character

Compacted tries

y

eSlide3

Applications

String dictionariesWith prefix lookup, predecessor, …Exploit prefix compressionMonotone perfect hash functions“Hollow” or “Blind” tries [ALENEX 09]Binary tree (no need store branching chars)No need to store node labels, just lengths (skips)Slide4

Height vs. performance

Tries can be deep – no guarantee on heightBad with pointer-based trees ~1 cache miss per child operation Worse with succinct tree encodingsNeed to access several directoriesMany cache misses per child operation

Large constants hidden in the O(1)Slide5

Path decomposition

t

i

ree

ε

ε

l

ε

ε

ε

gle

h

r

e

p

a

l

n

y

e

t

r

i

an

gle

h

e

p

l

Query:

triple

Recurse

here with

suffix

leSlide6

Centroid path decomposition

Decompose along the heavy pathschoose the edge that has most descendantsHeight of the decomposed tree: O(log n)Usually lowerAverage height

Web

Queries

URLs

SyntheticCompacted trie

11.0 18.1504.4Centroid trie

5.2

6.2

2.8Hollow trie

50.867.3

1005.3Centroid hollow trie

8.09.22.8Slide7

Succinct encoding

[PODS 08] presents a succinct data structure for centroid path-decomposed triesNot practical: need complex operations on succinct treesWe introduce a simpler and practical encodingThis encoding enables also simple compression of the labelsSlide8

Succinct encoding

t

r

i

angle

h

e

p

l

L :

t

1

ri

2

a1ngle

BP: ( ((( )B : h epl

(spaces added for clarity)

Node label written literally, interleaved with number of other branching characters at that point

in array LCorresponding branching characters in array

BTree encoded with DFUDS in bitvector BP

Variant of Range Min-Max tree [ALENEX 10] to support Find{

Close,Open}, more space-efficient (Range Min tree)Slide9

Compression of L

...$...

index.html

$..

..html

$....html$..

.index.html$

..

.

$..

.

3

5

$

...

5$...

5$..

.3

5$

3 index

…5 .html

…Dictionary

Dictionary codewords

shared among labelsCodewords do not cross label boundaries ($)

Use vbyte to compress the codeword

idsSlide10

Compression of L

Node labels (t1ri2a1

ngle

,

l1e

, …):each label is suffix of a string in the setinterleaved with few “special characters” 1, 2, 3

,… Compressible if strings are compressibleDictionary and parsing computed withmodified Re-PairDomain-specific compression can be used insteadDecompression overhead negligibleSlide11

Experimental results (time)

Experiments show gains in time comparable to the gains in heightConfirm that bottleneck is traversal operations

Web

Queries

URLs

Synthetic

Trie

3.5

7.0

119.8

Centroid trie2.4

4.35.1

Hollow trie [ALENEX 09]

16.622.4

462.7

Hollow trie7.213.9

137.1

Centroid hollow trie

2.8

4.411.1

(microseconds,

lower is better

)

Code available at

https

://

github.com

/ot/

path_decomposed_triesSlide12

Experimental results (space)

For strings with many common prefixes, even non-compressed trie is space-efficientLabels compression considerably increases space-efficiencyDecompression time overhead: ~10%

Web

Queries

URLs

Synthetic

Hu-Tucker

Front Coding

40.9%

24.4%

19.1%Centroid trie

55.6%22.4%

17.9%Centroid trie + compression

31.5%

13.6%0.4%

(compression

ratio, lower is better)

Code available at

https

://

github.com

/ot

/path_decomposed_triesSlide13

Thanks for your attention!

Questions?Slide14

References

[ALENEX 10] D. Arroyuelo, R. Cánovas

, G. Navarro, and K.

Sadakane

. Succinct trees in practice. In ALENEX, pages 84–97, 2010.

[ALENEX 09] D. Belazzougui, P.

Boldi, R. Pagh, and S. Vigna

. Monotone minimal perfect hashing: searching a sorted table with O(1) accesses. In SODA, pages 785–794, 2009.

[PODS 08] P.

Ferragina, R. Grossi, A. Gupta, R. Shah, and J. S. Vitter. On searching compressed string collections cache-obliviously

. In PODS, pages 181–190, 2008.