Space-Efficient Data Structures for Top-k Completion

Space-Efficient Data Structures for Top-k Completion Space-Efficient Data Structures for Top-k Completion - Start

2015-11-11 56K 56 0 0

Space-Efficient Data Structures for Top-k Completion - Description

Giuseppe Ottaviano. Università. di Pisa. Bo-June (Paul) Hsu. Microsoft Research. WWW 2013. String auto-completion. Scored string sets. Top-k Completion query:. Given prefix p, return k strings prefixed by p with highest scores. ID: 190040 Download Presentation

Download Presentation

Space-Efficient Data Structures for Top-k Completion




Download Presentation - The PPT/PDF document "Space-Efficient Data Structures for Top-..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentations text content in Space-Efficient Data Structures for Top-k Completion

Slide1

Space-Efficient Data Structures for Top-k Completion

Giuseppe OttavianoUniversità di Pisa

Bo-June (Paul) HsuMicrosoft Research

WWW 2013

Slide2

String auto-completion

Slide3

Scored string sets

Top-k Completion query:Given prefix p, return k strings prefixed by p with highest scoresExample: p=“tr”, k=2(triangle, 9), (trie, 5)

three2trial1triangle9trie5triple4triply3

Slide4

Space-Efficiency

Scored string sets can be very large

Hundreds of millions of queries for web search auto-suggest

Must fit in RAM for fast access

Need space-efficient solutions!

We compare three solutions

RMQ

Trie

, based on Range Minimum Queries

Completion

Trie

, based on a modified

trie

with variable-sized pointers

Score-Decomposed

Trie

, based on succinct data structures

Slide5

RMQ Trie (RT)

Slide6

RMQ Trie

Lexicographic order → strings starting with given prefix in a contiguous rangeIf we can find the max in a range, it is top-1Range is split in two subranges, can proceed recursively using a priority queue to retrieve top-k

three2trial1triangle9trie5triple4triply3

Slide7

RMQ Trie

To store the strings we can use any data structure that keeps the strings sorted

We use a compressed

trie

To find max score in a range we use a succinct Range Minimum Query (RMQ) data structure

Needs only 2.6 additional bits per score, answers queries in O(log n) time

This is a standard technique, but not very fast. We use it as a baseline.

Slide8

Completion trie (CT)

Slide9

t

i

ree

ε

ε

l

ε

ε

ε

gle

h

r

e

p

a

l

n

Node label

Branching character

(Scored) compacted tries

y

e

three

2

trial

1

triangle

9

trie

5

triple

4

triply

3

2

1

4

3

5

9

Slide10

t

i

ree

ε

ε

l

ε

ε

ε

gle

h

r

e

p

a

l

n

Completion

Trie

y

e

three

2

trial

1

triangle

9

trie

5

triple4triply3

2

1

4

3

5

9

4

9

9

9

Slide11

Completion Trie

t

9

hree

2

ri9

a

9

e5

pl4

e

4

y3

ngle

9

l1

t

i

ree

ε

ε

l

ε

ε

ε

gle

h

r

e

p

a

l

n

y

e

2

1

4

3

5

9

4

9

9

9

Slide12

Completion Trie

Scores encoded differentially (either from parent or previous sibling)Pointers and score deltas encoded with variable bytesAll node information in the same stream, favoring cache-efficiency

t

9

hree

2

ri9

a

9

e5

pl4

e

4

y3

ngle

9

l1

Slide13

Score-Decomposed Trie (SDT)

Slide14

Can we save more space?

Completion

Trie

representation consists of

Tree structure (pointers)

Node labels (strings)

Slide15

Trees as balanced parentheses

()

()

()

()

(()()())

(()(()()()))

2n bits are sufficient (and necessary) to represent a tree

Can support O(1) operations with 2n + o(n) bits

Slide16

Score-decomposed trie

Builds on compressed path-decomposed tries [

Grossi

-Ottaviano ALENEX 2012]

Parentheses-based representation of trees

Dictionary-compression of node labels

Slide17

Score-decomposed trie

t

i

ree

ε

ε

l

ε

ε

ε

gle

h

r

e

p

a

l

n

y

e

2

1

4

3

5

9

t

r

i

an

gle

h,2

e,5

p,4

l,1

9

L :

t

1

ri

2

a

1

ngle

BP: ( ((( )

B : h

epl

R

:

2 541

Slide18

three

2

trial

1

triangle

9

trie

5

triple

4

triply

3

Slide19

Score compression

... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ...

3 bits/value

11 bits/value

16 bits/value

Data structure to store scores in RT and SDT

Packed-blocks array

“Folklore” data structure, similar to many existing packed arrays, Frame-Of-Reference,

PFORDelta

,…

Divide the array into fixed-size blocksEncode the values of each block with the same number of bitsStore separately the block offsets

Slide20

Score compression

... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ...

3 bits/value

11 bits/value

16 bits/value

Can be unlucky

Each block may contain a large value

But scores are power-law distributed

Also, tree-wise monotone sorting

On average, 4 bits per score

Slide21

Space

Datasetgzip CT SDT RTQueriesA27%57%30%31%QueriesB25%48%26%27%URLs24%57%26%27%Unigrams39%43%35%37%

Compression ratio

wrt

raw data

Slide22

Space

Slide23

Time

Time per returned completion

on a top-10 query

Slide24

Thanks for your attention!

Questions?


About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.