Vivek Narasayya Manoj Syamala Microsoft Research Brown University Hideaki Kimura hkimuracsbrownedu viveknarmanojsy microsoftcom Graduates soon On Job Market Every Major DBMS ID: 930803
Download Presentation The PPT/PDF document "Compression Aware Physical Database Desi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Compression Aware Physical Database Design
Vivek NarasayyaManoj Syamala
Microsoft Research
Brown University
Hideaki Kimura*
hkimura@cs.brown.edu
{viveknar,manojsy}@microsoft.com
(*) Graduates soon. On Job Market.
Slide2Every Major DBMS Supports
Saves Storage ConsumptionSaves I/O BandwidthBackground: Compression in DB
Tables,
Indexes
SELECT
Compressed
Data
INSERT
Decompress
C
ompress
Query Process
Engine
DBMS A:
4x
!
DBMS B:
10x
!
DBMS C:
12x
!
Slide3Compression Schemes in DB
City
Seattle
San JoseSeattle
..
Dictionary Encoding
1
2
1
..
Dict.
1:Seattle
2:San
Jose
+
Local dict. (Oracle, SQL Server)
Global dict. (DB2)
NULL Suppression
LZO
, RLE…
Price000321
000054000015
..
@321
@54
@15
..
Prefix Suppression,
Slide4Two Types of Compression in DB
Order
Independent
Order
Dependent
A
000AA
000AA
000AA
000AA
00BBB
00BBB
00BBB
00BBB
B
X
X
Y
Y
X
X
Y
Y
I
AB
A
000AA
000AA
00BBB
00BBB
000AA
000AA
00BBB
00BBB
B
X
X
X
X
Y
Y
Y
Y
I
BA
A
@AA
@AA
@AA@AA@BBB@BBB@BBB@BBB
A@AA@AA@BBB@BBB@AA@AA@BBB@BBB
IAB
IBA
A000AA↑↑↑00BBB↑↑↑
A000AA↑00BBB↑000AA↑00BBB↑
IAB
IBA
page
=
≠
fragmented
NULL-Supp.
Global dict.
…
Run Length Enc.
Local dict.
…
Slide5Saves Storage Space, I/O
CPU Overhead to Compress & DecompressDifferent Compression Scheme= Different Saving
↔ Overhead
Benefits and Overheads
How Do We Use It?
DBA
Slide6Depends on Workload
SELECTs/INSERTs FrequencyCPU bottleneck? IO bottleneck?
Issue 1: To Compress or not..
Depends on
Data
9GB
10GB
-90%
10GB
1GB
-10%
High Compression Ratio
Low Compression Ratio
Slide7Issue 2: What Index to Create
I
1
I
2
I
3
I
4
Q1
Q2
I
1
I
3
I
5
I
5
Syntactically
Relevant
Indexes
Select
Candidate Configurations
Enumerate
Best
Configuration
Configuration
I
1
I
5
Physical DB Design Tool
DBMS
Query
Optimizer
Hypothetical
Indexes
Estimate Runtime
What-if
A
nalysis
Prune
Slide8Run Design Tool to Select I
ndexesCompress them, then Repeat.Naïve Solution: Staged Design
Stage 1
Stage 2
Compress!
100 MB
Budget
Idx
MV
100 MB
Idx
MV
5
0 MB
Idx
MV
100 MB
Workload
Slide9Misses an index that makes sense only with compression
Problem in tight space budget
Shipdate
State
PriceDiscount
Feb 21CA
$12310%Jan 9
RI
$222
0%
Jul
5
TX
$213
5%
Sales
SELECT SUM(Price*Discount) FROM Sales
WHERE State='CA' and Jul 01 < Shipdate < Sep 01
I1
(State, Shipdate):
95 MB → 50 MB
I
2 (State, Shipdate)Include (Price, Discount):
170 MB
→
9
0 MB
Choice for 100 MB?
Slide10Example: Tight Space Budget
?
Good design:
175MB
CREATE COMPRESSED INDEX (L_PARTKEY,L_ORDERKEY,L_SUPPKEY) INCLUDE (L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT)Staged:
155MBCREATE INDEX (L_ORDERKEY) INCLUDE(L_SUPPKEY,L_COMMITDATE,L_RECEIPTDATE)
Slide11Result in too high CPU overheads for compression/decompression.
Problem in plenty space budget
I
2
(State, Shipdate)Include (Price, Discount):
170 MB
→ 90 MB
Choice for 200 MB?
UPDATE Sales SET Price=..
I
1
(State, Shipdate):
95 MB
→
50 MB
INSERT INTO Sales …
CPU Overheads
Slide12Example: Plenty Space Budget
Worse with More Budget!
Slide13How to Estimate Index-size
after compression?How to Evaluate benefits/overheads of compression?How Compression affects Candidate Selection/Enumeration?
Integrated Solution N
eeded!
Slide14Essential Metric of Indexes
To Fit Space BudgetTo Estimate I/O costNeed Compression FractionSize Estimation
Col-A
Width=8
Col-B
Width=4
Col-C
Width=10
S
tats
Table
#tuple=
1M
Size (
I
ABC) = (8 + 4 + 10 + 4) * 1M = 26 MB
Clust. Key
Width=4Comp. Size (
IABC) = 26 MB * CF (IABC
)
Slide15SampleCF
[Idreos et al. ICDE'10]
Prior work
Sample Size: Cost
↔
Accuracy
Still Expensive for
1,000s
of indexes
1GB
Table
10MB
Sample
CREATE
COMPRESSED
INDEX
SampleCF
Overheads
Slide16Solution Overview
Slide17Index Size Deduction
I
b
I
a
I
a,b
SampleCF
Col-Ext Deduction
I
a
I
b
NULL supp. (
ORD-IND
)
I
a,b
I
b,a
Col-Set
Deduction
A
000AA
/
/
/
00BBB
/
/
/
A
000AA
/
00BBB
/
000AA
/
00BBB
/
I
AB
I
BA
Local dict. (
ORD-DEP
)
Estimate From Run-Length
Sum-up
Savings
More Details in paper
Slide18Size-Estimation Strategy
Sample Size?Deduction Path?Expected Errors?Formulate as Graph Problem
Greedy algorithm to solve(details in the paper)
Optimize Accuracy-Cost Trade-off
Slide19Query Cost model to consider (De)Compression CPU costCandidate Selection/Enumeration
Issues in Design Tool
Key Challenge:
Space-Performance
Trade-off
Slide20Candidate Selection:Space-Performance Trade-off
I
A
I
B
I
C
I
D
Q1
Q2
Select
Fastest
I
A
I
C
I
A
I
B
I
C
I
D
Compressed Versions
Add Compressed
Indexes
Compressed Indexes
are often
Slower-but-Smaller
Most of them
are Ignored!
(exception: very high
compression ratio)
Slide21Skyline Candidate Selection
Construct
Skyline
of ConfigurationsPick Both Fast-Indexes
and Small-Indexes
Slide22Greedy picks un-compressed indexes too early
Enumeration: Problem
I
A
I
B
I
C
B
I
C
10MB
5MB
10MB
Comp.
Seed
I
A
I
C
15MB Room
I
A
I
B
I
A
I
C
B
I
A
I
B
I
C
B
I
A
I
B
I
C
I
A
I
C
I
C
B
Optimal Design
Slide23Recover oversized configurations
Compress indexes in the config.Local Backtrack in Enumeartion
I
A
I
A
I
B
I
A
I
B
I
C
I
A
I
C
I
C
B
Recover
If Oversized
I
A
I
B
I
C
C
…
Slide24Implemented on SQL Server 2008Modified Database Tuning Advisor (DTA) "
DTAc"Modified Query Cost Model
TPC-H Scale-1 (more results in paper)SELECT-intensive/UPDATE-intensiveCompared Estimated Runtime
Experimental Results
Slide25Both Skyline & Backtrack are required esp. for tight budget
Candidate Selection/Enumeration
Clustered/
2ndary
Indexes
Slide26Especially better in tight budgetChoose lightly compressed designs in UPDATE-intensive
DTAc vs. DTA
Clustered/
2ndary
/MV Indexes
Slide27Reduce Size Estimation Overheads for a factor of 3
Mostly <10% Estimation ErrorOverhead in DTA
Slide28Opportunities and ChallengesIntegrated Approach to exploit
compression in physical designSpace-Performance TradeoffSize EstimationOpen IssuesColumn-Store
Conclusion