through Path Decompositions Roberto Grossi Giuseppe Ottaviano Università di Pisa Part of the work done while at Microsoft Research Cambridge t three trial triangle trie triple ID: 236153
Download Presentation The PPT/PDF document "Fast Compressed Tries" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Fast Compressed Tries through Path Decompositions
Roberto GrossiGiuseppe Ottaviano*Università di Pisa
* Part of the work done while at Microsoft Research CambridgeSlide2
t
threetrial
triangle
trie
tripletriply
i
ree
ε
ε
l
ε
ε
ε
gle
h
r
e
p
a
l
n
Node label
Branching character
Compacted tries
y
eSlide3
Applications
String dictionariesWith prefix lookup, predecessor, …Exploit prefix compressionMonotone perfect hash functions“Hollow” or “Blind” tries [ALENEX 09]Binary tree (no need store branching chars)No need to store node labels, just lengths (skips)Slide4
Height vs. performance
Tries can be deep – no guarantee on heightBad with pointer-based trees ~1 cache miss per child operation Worse with succinct tree encodingsNeed to access several directoriesMany cache misses per child operation
Large constants hidden in the O(1)Slide5
Path decomposition
t
i
ree
ε
ε
l
ε
ε
ε
gle
h
r
e
p
a
l
n
y
e
t
r
i
an
gle
h
e
p
l
Query:
triple
Recurse
here with
suffix
leSlide6
Centroid path decomposition
Decompose along the heavy pathschoose the edge that has most descendantsHeight of the decomposed tree: O(log n)Usually lowerAverage height
Web
Queries
URLs
SyntheticCompacted trie
11.0 18.1504.4Centroid trie
5.2
6.2
2.8Hollow trie
50.867.3
1005.3Centroid hollow trie
8.09.22.8Slide7
Succinct encoding
[PODS 08] presents a succinct data structure for centroid path-decomposed triesNot practical: need complex operations on succinct treesWe introduce a simpler and practical encodingThis encoding enables also simple compression of the labelsSlide8
Succinct encoding
t
r
i
angle
h
e
p
l
L :
t
1
ri
2
a1ngle
BP: ( ((( )B : h epl
(spaces added for clarity)
Node label written literally, interleaved with number of other branching characters at that point
in array LCorresponding branching characters in array
BTree encoded with DFUDS in bitvector BP
Variant of Range Min-Max tree [ALENEX 10] to support Find{
Close,Open}, more space-efficient (Range Min tree)Slide9
Compression of L
...$...
index.html
$..
..html
$....html$..
.index.html$
..
.
$..
.
3
5
$
...
5$...
5$..
.3
5$
…
3 index
…5 .html
…Dictionary
Dictionary codewords
shared among labelsCodewords do not cross label boundaries ($)
Use vbyte to compress the codeword
idsSlide10
Compression of L
Node labels (t1ri2a1
ngle
,
l1e
, …):each label is suffix of a string in the setinterleaved with few “special characters” 1, 2, 3
,… Compressible if strings are compressibleDictionary and parsing computed withmodified Re-PairDomain-specific compression can be used insteadDecompression overhead negligibleSlide11
Experimental results (time)
Experiments show gains in time comparable to the gains in heightConfirm that bottleneck is traversal operations
Web
Queries
URLs
Synthetic
Trie
3.5
7.0
119.8
Centroid trie2.4
4.35.1
Hollow trie [ALENEX 09]
16.622.4
462.7
Hollow trie7.213.9
137.1
Centroid hollow trie
2.8
4.411.1
(microseconds,
lower is better
)
Code available at
https
://
github.com
/ot/
path_decomposed_triesSlide12
Experimental results (space)
For strings with many common prefixes, even non-compressed trie is space-efficientLabels compression considerably increases space-efficiencyDecompression time overhead: ~10%
Web
Queries
URLs
Synthetic
Hu-Tucker
Front Coding
40.9%
24.4%
19.1%Centroid trie
55.6%22.4%
17.9%Centroid trie + compression
31.5%
13.6%0.4%
(compression
ratio, lower is better)
Code available at
https
://
github.com
/ot
/path_decomposed_triesSlide13
Thanks for your attention!
Questions?Slide14
References
[ALENEX 10] D. Arroyuelo, R. Cánovas
, G. Navarro, and K.
Sadakane
. Succinct trees in practice. In ALENEX, pages 84–97, 2010.
[ALENEX 09] D. Belazzougui, P.
Boldi, R. Pagh, and S. Vigna
. Monotone minimal perfect hashing: searching a sorted table with O(1) accesses. In SODA, pages 785–794, 2009.
[PODS 08] P.
Ferragina, R. Grossi, A. Gupta, R. Shah, and J. S. Vitter. On searching compressed string collections cache-obliviously
. In PODS, pages 181–190, 2008.