VLDB 10 Hideaki Kimura BrownU Samuel Madden MIT Stanley B Zdonik BrownU Speaker Yinuo Zhang Supervisor Dr Reynold Cheng Outline Introduction Uncertain Primary Index UPI ID: 564725
Download Presentation The PPT/PDF document "UPI: A Primary Index for Uncertain Datab..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
UPI: A Primary Index for Uncertain Databases (VLDB 10)
Hideaki Kimura (
BrownU
)
Samuel Madden (MIT)
Stanley B.
Zdonik
(
BrownU
)
Speaker: Yinuo Zhang
Supervisor: Dr. Reynold ChengSlide2
OutlineIntroduction
Uncertain Primary Index (UPI)
Secondary Index on UPI
Experiments
Conclusion and Future WorkSlide3
Introduction
Name
Institution
p
Existence
AliceBrown: 80%, MIT: 20%90%BobMIT: 95%, UCB: 5%100%CarolBrown: 60%, U. Tokyo: 40%80%
Table Author
NameInstitutionAliceBrownBobMIT
The probability of such world: 90%*80% * 100%*95% * 20% = 13.7%
A Possible World
Query answering over Possible World Semantics
SELECT * FROM Author WHERE Institution=MIT
Threshold: confidence
QT
Slide4
Ex) DBLP with Uncertain Affiliation
DBLP: 1.3M Papers and 0.7M Authors
Complemented Author
Affiliation
G
oogleAPI
RankURL1Wisc.edu/…2
Microsoft.com/…3Columbia.edu/…NameInstitutionp
CountrypDavid DeWittWisconsin: 40%, MS: 20%,Columbia: 13%,
…US: 100%
Zipfian Distribution
Name
Inst.
David DeWitt
?
q
=“David DeWitt”Slide5
IntroductionAchieving an efficient implementation using possible world semantics is difficult.
Probabilistic Inverted Index [Singh07] – a
secondary
index
Institution
PointerBrown[Alice] 0.3[Carol] 0.2[Bob] 0.4MIT[Bob] 0.3[Alice] 0.8
Heap
Disk
SeekingSlide6
Introduction
Build A Primary Index
Over Uncertain Attributes
Goal
Primary Index
Seq.
ReadSlide7
Challenges on Building PI over Uncertain Data
Cluster
Tuples
Brown
Alice, Carol
MITBobCluster on most probablepossible value?Cluster
TuplesBrownAlice, Carol, …MITAlice, Bob, …
Replicate tuplesinto inverted index?
Alice?
Too Large for
Long-tail distribution(e.g., 100 values with 0.1%)
SELECT …WHERE
Inst.=MIT
Name
Institution
p
Existence
Alice
Brown: 80%, MIT: 20%
90%
Bob
MIT: 95%, UCB: 5%
100%
Carol
Brown: 60%, U. Tokyo: 40%
80%Slide8
UPI: Heap + Cutoff Index
Institution
Tuple
Brown
(72
%)AliceBrown (48%)CarolMIT (95%)BobMIT (18%)AliceUCB (5%)BobU. Tokyo (32%)Carol
InstitutionTupleIDPointerUCB (5%)
BobMIT
Cutoff Index:Sorted by (Inst.,
Prob)
Heap: Sorted by (Inst., Prob)
Cutoff
Entries with
Less than
C
probability
(
Cutoff Threshold
)
Name
Institution
p
Existence
Alice
Brown: 80%, MIT: 20%
90%
Bob
MIT: 95%, UCB: 5%
100%
Carol
Brown: 60%, U. Tokyo: 40%
80%Slide9
Answering Queries with UPI
Probabilistic Threshold Query (PTQ
)
Institution
Tuple
MIT (95%)Bob…UCB (90%)DanUCB (20%)Emily
InstitutionTupleIDPointerUCB (5%)BobMIT
C=10%
If QT<C (e.g., QT=5%),follow Cutoff pointers
If QT≥C (e.g., QT=20%),Sequentially ReadSeek
SELECT * FROM Author WHERE Inst.=UCBWith:
Probability ≥
QT
(Query Threshold)Slide10
Choosing Cutoff Threshold
Faster
, but
Larger
Slower
, butSmallerCutoff Threshold CSELECT * FROM Author WHERE Institution=Ishikawa U
Threshold: confidence QT (QT is given at runtime)Slide11
Determining C
Based on Value/Probability Histograms
Value
#Keys
…
…Br*30,000Bs*31,000Bt*30,500……Histograms (Inst.)
#Pointers
Available Disk Capability -> UPI Size
= Costfullscan * Selectivity+ Costseek * # Pointers
Prob.#Keys…
…
10%-15%
15,000
15%-25%
28,000
25%-40%
33,000
…
…
C
Tolerable average query runtime
C
?Slide12
#Pointers and Query Cost
Cost
Model
(replace
Ishikawa U with Stanford
)
Saturation
Logistic functionSlide13
Store Multiple PointersTailored Access
Secondary Index on UPI
Name
Institution
p
CountrypExistenceAliceBrown: 80%, MIT: 20%US: 100%90%BobMIT: 95%, UCB: 5%US: 100%100%
CarolBrown: 60%, U. Tokyo: 40%US: 60%, Japan: 40%80%
InstitutionTupleBrown (72%)Alice
…MIT (95%)BobMIT (18%)
AliceCountry
TulpleID
Pointers
US (100%)
Bob
MIT
US (90%)
Alice
Brown,
MIT
SELECT
* FROM Author WHERE Country=US
Secondary Index
on (Country)Slide14
Experiments
Environments
C++ & BerkeleyDB 4.7 on Fedora Core 11
Quad-Core, 4GB RAM, 10k RPM SATA HDD
Dataset: DBLP w/ Uncertain Affiliation
700k authors and 1.3M publicationsSwetoDblp, Google API (institutions up to ten per author)Compared With PIISlide15
Query Runtime: PII vs. UPI
Q1: SELECT * FROM Author WHERE Institution=
x
UPI Causes Much Fewer
Disk Seeks
ElapsedRead PII5 [ms]Sort Pointer30 [µs]Read Heap
5,200 [ms]
ElapsedRead UPI47 [ms]Slide16
Secondary Index Access
Q2: SELECT Journal, COUNT (*) FROM Publication
WHERE Country=
x
GROUP BY Journal
ElapsedRead PII110 [
ms]Read UPI3,200 [ms]
ElapsedRead PII110 [ms]
Tailor33 [ms]Read UPI500
[ms]Slide17
Conclusion and Future WorkUPIHeap + Cutoff Index
Tailored Secondary Index Access
Fractured UPI (
not presented here)
Applying to other types of queries
Top-k Query: UPI as Tuple Access LayerSlide18
Thanks!Slide19
Fractured UPI
UPI
Heap File
Cutoff Index
2ndary Index
2ndary Index
Delete Set
Main Fracture
Fracture 1NewFracture
Delete Set
Insert Buffer
(On RAM)
INSERT
DELETE
SELECT
Dump
Query
IndependentlySlide20
Fractured UPI
Insert 10%
Delete 1%
Unclustered Heap
8
sec75 secUPI650 sec212 secFractured UPI 4
sec0.03 sec
FragmentationMore FracturesSlide21
Cutoff Index Cost Model (1)
Selective Case (Q1, #Pointers=300)
Real Runtime
Estimated RuntimeSlide22
Cutoff Index Cost Model (2)
Non-Selective Case (Q1, #Pointers=37000)
Real Runtime
Estimated Runtime