/
Space-Efficient Data Structures for Top-k Completion Space-Efficient Data Structures for Top-k Completion

Space-Efficient Data Structures for Top-k Completion - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
401 views
Uploaded On 2015-11-11

Space-Efficient Data Structures for Top-k Completion - PPT Presentation

Giuseppe Ottaviano Università di Pisa BoJune Paul Hsu Microsoft Research WWW 2013 String autocompletion Scored string sets Topk Completion query Given prefix p return k strings prefixed by p with highest scores ID: 190040

score trie bits completion trie score completion bits data space gle strings decomposed rmq top structure based node triangle

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Space-Efficient Data Structures for Top-..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Space-Efficient Data Structures for Top-k Completion

Giuseppe OttavianoUniversità di Pisa

Bo-June (Paul) HsuMicrosoft Research

WWW 2013Slide2

String auto-completionSlide3

Scored string sets

Top-k Completion query:Given prefix p, return k strings prefixed by p with highest scores

Example: p=“tr”, k=2(triangle, 9), (trie, 5)three2trial1

triangle

9

trie

5

triple

4

triply

3Slide4

Space-Efficiency

Scored string sets can be very largeHundreds of millions of queries for web search auto-suggestMust fit in RAM for fast access

Need space-efficient solutions!We compare three solutionsRMQ Trie, based on Range Minimum QueriesCompletion Trie, based on a modified trie with variable-sized pointersScore-Decomposed Trie, based on succinct data structuresSlide5

RMQ Trie

(RT)Slide6

RMQ Trie

Lexicographic order → strings starting with given prefix in a contiguous rangeIf we can find the max in a range, it is top-1

Range is split in two subranges, can proceed recursively using a priority queue to retrieve top-kthree2trial1triangle

9

trie

5

triple

4

triply

3Slide7

RMQ Trie

To store the strings we can use any data structure that keeps the strings sortedWe use a compressed trie

To find max score in a range we use a succinct Range Minimum Query (RMQ) data structureNeeds only 2.6 additional bits per score, answers queries in O(log n) timeThis is a standard technique, but not very fast. We use it as a baseline.Slide8

Completion trie

(CT)Slide9

t

i

reeε

ε

l

ε

ε

ε

gle

h

r

e

p

a

l

n

Node label

Branching character

(Scored) compacted tries

y

e

three

2

trial

1

triangle

9

trie

5

triple

4

triply

3

2

1

4

3

5

9Slide10

t

i

reeε

ε

l

ε

ε

ε

gle

h

r

e

p

a

l

n

Completion

Trie

y

e

three

2

trial

1

triangle

9

trie

5

triple

4

triply

3

2

1

4

3

5

9

4

9

9

9Slide11

Completion Trie

t

9hree2ri9a9

e

5

pl

4

e

4

y

3ngle9

l

1

t

i

ree

ε

ε

l

ε

ε

ε

gle

h

r

e

p

a

l

n

y

e

2

1

4

3

5

9

4

9

9

9Slide12

Completion Trie

Scores encoded differentially (either from parent or previous sibling)Pointers and score deltas encoded with variable bytes

All node information in the same stream, favoring cache-efficiencyt9hree2ri9

a

9

e

5

pl

4

e

4y3

ngle

9

l

1Slide13

Score-Decomposed Trie

(SDT)Slide14

Can we save more space?

Completion Trie representation consists ofTree structure (pointers)Node labels (strings)Slide15

Trees as balanced parentheses

()

()

()

()

(()()())

(()(()()()))

2n bits are sufficient (and necessary) to represent a tree

Can support O(1) operations with 2n + o(n) bitsSlide16

Score-decomposed trie

Builds on compressed path-decomposed tries [Grossi-Ottaviano ALENEX 2012]Parentheses-based representation of trees

Dictionary-compression of node labelsSlide17

Score-decomposed trie

t

iree

ε

ε

l

ε

ε

ε

gle

h

r

e

p

a

l

n

y

e

2

1

4

3

5

9

t

r

i

an

gle

h,2

e,5

p,4

l,1

9

L :

t

1

ri

2

a

1

ngle

BP: ( ((( )

B : h

epl

R

:

2 541Slide18

three

2

trial1triangle9

trie

5

triple

4

triply

3Slide19

Score compression

... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ...

3 bits/value11 bits/value16 bits/value

Data structure to store scores in RT and SDT

Packed-blocks array

“Folklore” data structure, similar to many existing packed arrays, Frame-Of-Reference,

PFORDelta

,…

Divide the array into fixed-size blocks

Encode the values of each block with the same number of bits

Store separately the block offsetsSlide20

Score compression

... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ...

3 bits/value11 bits/value16 bits/value

Can be unlucky

Each block may contain a large value

But scores are power-law distributed

Also, tree-wise monotone sorting

On average, 4 bits per scoreSlide21

Space

Dataset

gzip CT SDT RTQueriesA27%57%30%

31%

QueriesB

25%

48%

26%

27%

URLs

24%57%26%27%Unigrams

39%

43%

35%

37%

Compression ratio

wrt

raw dataSlide22

SpaceSlide23

Time

Time per returned completion

on a top-10 querySlide24

Thanks for your attention!

Questions?