/
UPI: A Primary Index for Uncertain Databases UPI: A Primary Index for Uncertain Databases

UPI: A Primary Index for Uncertain Databases - PowerPoint Presentation

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
427 views
Uploaded On 2017-06-30

UPI: A Primary Index for Uncertain Databases - PPT Presentation

VLDB 10 Hideaki Kimura BrownU Samuel Madden MIT Stanley B Zdonik BrownU Speaker Yinuo Zhang Supervisor Dr Reynold Cheng Outline Introduction Uncertain Primary Index UPI ID: 564725

upi index cutoff mit index upi mit cutoff 100 author query select runtime threshold secondary bobmit ucb inst pointers

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "UPI: A Primary Index for Uncertain Datab..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

UPI: A Primary Index for Uncertain Databases (VLDB 10)

Hideaki Kimura (

BrownU

)

Samuel Madden (MIT)

Stanley B.

Zdonik

(

BrownU

)

Speaker: Yinuo Zhang

Supervisor: Dr. Reynold ChengSlide2

OutlineIntroduction

Uncertain Primary Index (UPI)

Secondary Index on UPI

Experiments

Conclusion and Future WorkSlide3

Introduction

Name

Institution

p

Existence

AliceBrown: 80%, MIT: 20%90%BobMIT: 95%, UCB: 5%100%CarolBrown: 60%, U. Tokyo: 40%80%

Table Author

NameInstitutionAliceBrownBobMIT

The probability of such world: 90%*80% * 100%*95% * 20% = 13.7%

A Possible World

Query answering over Possible World Semantics

SELECT * FROM Author WHERE Institution=MIT

Threshold: confidence

QT

Slide4

Ex) DBLP with Uncertain Affiliation

DBLP: 1.3M Papers and 0.7M Authors

Complemented Author

Affiliation

G

oogleAPI

RankURL1Wisc.edu/…2

Microsoft.com/…3Columbia.edu/…NameInstitutionp

CountrypDavid DeWittWisconsin: 40%, MS: 20%,Columbia: 13%,

…US: 100%

Zipfian Distribution

Name

Inst.

David DeWitt

?

q

=“David DeWitt”Slide5

IntroductionAchieving an efficient implementation using possible world semantics is difficult.

Probabilistic Inverted Index [Singh07] – a

secondary

index

Institution

PointerBrown[Alice] 0.3[Carol] 0.2[Bob] 0.4MIT[Bob] 0.3[Alice] 0.8

Heap

Disk

SeekingSlide6

Introduction

Build A Primary Index

Over Uncertain Attributes

Goal

Primary Index

Seq.

ReadSlide7

Challenges on Building PI over Uncertain Data

Cluster

Tuples

Brown

Alice, Carol

MITBobCluster on most probablepossible value?Cluster

TuplesBrownAlice, Carol, …MITAlice, Bob, …

Replicate tuplesinto inverted index?

Alice?

Too Large for

Long-tail distribution(e.g., 100 values with 0.1%)

SELECT …WHERE

Inst.=MIT

Name

Institution

p

Existence

Alice

Brown: 80%, MIT: 20%

90%

Bob

MIT: 95%, UCB: 5%

100%

Carol

Brown: 60%, U. Tokyo: 40%

80%Slide8

UPI: Heap + Cutoff Index

Institution

Tuple

Brown

(72

%)AliceBrown (48%)CarolMIT (95%)BobMIT (18%)AliceUCB (5%)BobU. Tokyo (32%)Carol

InstitutionTupleIDPointerUCB (5%)

BobMIT

Cutoff Index:Sorted by (Inst.,

Prob)

Heap: Sorted by (Inst., Prob)

Cutoff

Entries with

Less than

C

probability

(

Cutoff Threshold

)

Name

Institution

p

Existence

Alice

Brown: 80%, MIT: 20%

90%

Bob

MIT: 95%, UCB: 5%

100%

Carol

Brown: 60%, U. Tokyo: 40%

80%Slide9

Answering Queries with UPI

Probabilistic Threshold Query (PTQ

)

Institution

Tuple

MIT (95%)Bob…UCB (90%)DanUCB (20%)Emily

InstitutionTupleIDPointerUCB (5%)BobMIT

C=10%

If QT<C (e.g., QT=5%),follow Cutoff pointers

If QT≥C (e.g., QT=20%),Sequentially ReadSeek

SELECT * FROM Author WHERE Inst.=UCBWith:

Probability ≥

QT

(Query Threshold)Slide10

Choosing Cutoff Threshold

Faster

, but

Larger

Slower

, butSmallerCutoff Threshold CSELECT * FROM Author WHERE Institution=Ishikawa U

Threshold: confidence QT (QT is given at runtime)Slide11

Determining C

Based on Value/Probability Histograms

Value

#Keys

…Br*30,000Bs*31,000Bt*30,500……Histograms (Inst.)

#Pointers

Available Disk Capability -> UPI Size

= Costfullscan * Selectivity+ Costseek * # Pointers

Prob.#Keys…

10%-15%

15,000

15%-25%

28,000

25%-40%

33,000

C

Tolerable average query runtime

C

?Slide12

#Pointers and Query Cost

Cost

Model

(replace

Ishikawa U with Stanford

)

Saturation

Logistic functionSlide13

Store Multiple PointersTailored Access

Secondary Index on UPI

Name

Institution

p

CountrypExistenceAliceBrown: 80%, MIT: 20%US: 100%90%BobMIT: 95%, UCB: 5%US: 100%100%

CarolBrown: 60%, U. Tokyo: 40%US: 60%, Japan: 40%80%

InstitutionTupleBrown (72%)Alice

…MIT (95%)BobMIT (18%)

AliceCountry

TulpleID

Pointers

US (100%)

Bob

MIT

US (90%)

Alice

Brown,

MIT

SELECT

* FROM Author WHERE Country=US

Secondary Index

on (Country)Slide14

Experiments

Environments

C++ & BerkeleyDB 4.7 on Fedora Core 11

Quad-Core, 4GB RAM, 10k RPM SATA HDD

Dataset: DBLP w/ Uncertain Affiliation

700k authors and 1.3M publicationsSwetoDblp, Google API (institutions up to ten per author)Compared With PIISlide15

Query Runtime: PII vs. UPI

Q1: SELECT * FROM Author WHERE Institution=

x

UPI Causes Much Fewer

Disk Seeks

ElapsedRead PII5 [ms]Sort Pointer30 [µs]Read Heap

5,200 [ms]

ElapsedRead UPI47 [ms]Slide16

Secondary Index Access

Q2: SELECT Journal, COUNT (*) FROM Publication

WHERE Country=

x

GROUP BY Journal

ElapsedRead PII110 [

ms]Read UPI3,200 [ms]

ElapsedRead PII110 [ms]

Tailor33 [ms]Read UPI500

[ms]Slide17

Conclusion and Future WorkUPIHeap + Cutoff Index

Tailored Secondary Index Access

Fractured UPI (

not presented here)

Applying to other types of queries

Top-k Query: UPI as Tuple Access LayerSlide18

Thanks!Slide19

Fractured UPI

UPI

Heap File

Cutoff Index

2ndary Index

2ndary Index

Delete Set

Main Fracture

Fracture 1NewFracture

Delete Set

Insert Buffer

(On RAM)

INSERT

DELETE

SELECT

Dump

Query

IndependentlySlide20

Fractured UPI

Insert 10%

Delete 1%

Unclustered Heap

8

sec75 secUPI650 sec212 secFractured UPI 4

sec0.03 sec

FragmentationMore FracturesSlide21

Cutoff Index Cost Model (1)

Selective Case (Q1, #Pointers=300)

Real Runtime

Estimated RuntimeSlide22

Cutoff Index Cost Model (2)

Non-Selective Case (Q1, #Pointers=37000)

Real Runtime

Estimated Runtime