Risi Thonangi PhD Defense Talk Advisor Jun Yang Background HardDisk Drives Hard Disk Drives HDDs Magnetic platters for storage Mechanical moving parts for data access IO characteristics ID: 799287
Download The PPT/PDF document "Optimizing Database Algorithms for Rando..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Optimizing Database Algorithms for Random-Access Block Devices
Risi Thonangi
PhD Defense Talk
Advisor: Jun Yang
Slide2Background: Hard-Disk Drives
Hard Disk Drives (HDDs)
Magnetic platters for storage
Mechanical moving parts for data accessI/O characteristicsFast sequential access & slow random accessRead/write symmetryDisadvantagesSlow random accessHigh energy costsBad shock absorption
2
Slide3Background: Newer Storage Technologies
Flash memory, phase-change
m
emory, ferroelectric RAM and MRAMSolves HDD’s disadvantages of high energy usage, bad shock absorption, etc.Representative: flash memoryFloating-gate transistors for storageElectronic circuits for data accessI/O Characteristics – fast random
access & Expensive writes
USB Flash Drive
Other technologies (phase-change
m
emory, etc.) exhibit similar advantages and I/O characteristics
3
Slide4Random-Access Block
D
evices
Devices that areBlock basedSupport fast random access but have costlier writesPopular example – Solid-State Drives (SSDs)Use flash memory for storageHave advantages of flash memory Quickly replacing HDDs in consumer & enterprise storageOther examplesCloud storage & key-value store based storage
Phase-change memory
4
Slide5Motivation & Problem Statement
Can’t we use random-access block device as a drop-in replacement for HDD?
They do lead to better performance
But can be suboptimal because existing algorithms are not optimized for themProblem statement :–
Optimize database algorithms for I/O characteristics of random-access & read/write asymmetry
5
Slide6Related work
For SSDs
Indexes
BFTL, FlashDB, LA-tree, FD-tree, etc.Query processing techniquesPAX style storage & FlashJoin, B-File for maintaining samplesTransaction processingIn-Page Logging, Log-structured storage & optimistic concurrency control, etc. We propose new
techniques & algorithms that systematically exploit both random access and
read/write
asymmetry
6
Slide7Outline
Permutation problem
Merge policies for Log-Structured Merge (LSM) tree
Concurrency control in indexesConclusion & future work[VLDB 2013] “Permuting Data on Random-Access Block Storage”. Risi Thonangi, Jun Yang
[CIKM 2012] “A Practical Concurrent Index for Solid-State Drives”.
Risi Thonangi, Shivnath Babu, Jun Yang
[Techreport 2015]
“Optimizing LSM-tree for Random- Access Block Storage”.
Risi Thonangi, Jun Yang
7
Slide8Permutation problem
Permutation:- reorder
records such that the output address of a record is a function of its input address
Ex.: converting between matrix layouts, resorting multi-dimensional aggregates, etc.For HDDs, external merge sort is popular for permutationAugment each record with its output address, and sort by it# passes = logarithmic in input size# block I/Os =
: input, memory, block sizes (all in # records)
External merge sort – shortcomings
Augmentation creates larger records
Does not exploit structure of the permutation
8
Slide9A naïve algorithm
Just write out records in output order, and read as needed
Need two blocks of memory: one for write, one for read
Too many block reads—
—because records are scatteredMuch worse than sorting—
In fact, general permutation is as hard as sorting [Vitter 2001]
9
…..
Memory state
Input
Output
Slide10Address space
A record address is encoded by an
-digit number
with radices
The (0-based) sequence # of the record within the file is
Example: address space with radices
10
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
1
1
1
2
2
2
3
3
3
0
0
0111222333012012012012012012012012
3-digit addresses
records
Slide11ADP (address-digit permutation)
Permutation of records is defined by a
permutation of address digits
Example
Permutation
in
address space
Input
address space
Output
address space
11
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
1
1
1
2
2
2
33300011122233301201201201201201
2
0
1
2
0
1
2
0
0
0
0
0
0
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
0
0
1122001122001122001122010101010101010101010101
Permutation
in record space
Single-radix ADP
For simplicity of this presentation, assume
A
ll digits have radix , so
Memory size
(# of records)
Block size
(# of records)
Presented in the thesis – the general case
with mixed radices and sizes that are non-powers
Block boundary
12
Slide13A really simple ADP
Suppose no digits cross the block boundary
I.e., no record needs to move outside its own block
One pass, with one memory block, will do!Generalize?Carefully select memory-full of “action records” at a time, so they are clustered in blocks in both input and outputIn this simple ADP above, records in each block form a group of action records
13
Input
Output
Slide14Basic one-pass ADP: selecting “action digits”
Choose
, “
action digits
”
In-digits
—those within block boundary
in the input address space,
plus
Entering digits
—out-digits in the input that become in-digits in the outputDenote non-action digits by A group of action records = those whose input addresses share the same setting for digits in
In-digits:
Entering
digits:
14
Slide15Basic one-pass ADP: algorithm
For each possible setting
of
:
read the group of action records sharing
;
permute them in memory;
write action records to output blocks
Memory requirement:
# entering digits
Clustered in input blocks!
Clustered in output blocks!
15
0
00
***
0
01
***
0
10
***
0
11
***
1
00
***
1
01
***
1
10
***
1
11
***
00
0***
00
1***
01
0***011***100***101***110***111***
Entering digits:
Slide16More entering digits than we can handle?
Exploit the read/write asymmetry
Filtered reads
: allow a block to be read multiple times in one “pass”Use multiple passesPerform a series of simpler permutations whose composition is the desired compositionA cost-based optimizer to determine the passes:
Presented in the thesis – a provably optimal algorithm… which “balances” aggressiveness of filtered reads across passes
16
Problem
: given input data and a permutation, find a “plan”—possibly involving multiple passes and/or filtered reads—that minimizes cost
Permuting vs. sorting for ADP
Sorting vs. our algorithm
(without using filtered
reads)In either case, each pass has reads and
writes
Sorting takes
passes
Depends on
—bigger input means more passes
We take
passesDepends on the permutation, but does not depend on Note (# entering digits)
, so
with practical configurations, we can complete any ADP in fewer than two passes no matter how big the input is
!
(Filtered reads further exploit r/w asymmetry to lower cost)
17
Slide18Experiment: # of passes
5-attribute dataset adapted from TPC-W
Consider all possible ADPs of
the 5-digit address spaceI.e., all ways to resort dataShow distribution of # passes needed as we increase data sizeSorting passes increase with data sizeWe never take
passes!
18
% of permutations
ADP
SORTING
ADPSORTINGADP
SORTING
ADP
SORTING
ADP
SORTING
ADP
SORTING
Slide19Conclusion
Introduced address-digit permutations (ADPs)
Capturing many useful data reorganization tasks
Designed algorithms for ADPs on random-access block storageExploiting fast random accesses, read/write asymmetryBeating sort!Results not covered in this talkOptimizations that read/write larger runs of blocksMixed radices, memory/blocks sizes that are non-powers of the radicesMore experiments, including permuting data stored on SSDs & Amazon S3
19
Slide20Outline
Permutation problem
Merge policies for
Log-Structured Merge (LSM) treeConcurrency control in indexesConclusion & future work[VLDB 2013] “Permuting Data on Random-Access Block Storage”.
Risi Thonangi, Jun Yang
[CIKM 2012]
“A Practical Concurrent Index for Solid-State Drives”.
Risi Thonangi, Shivnath Babu, Jun Yang
[Techreport 2015]
“Optimizing LSM-tree for Random- Access Block Storage”.
Risi Thonangi, Jun Yang20
Slide21Background: LSM-tree
Index structure
Levels
L0 , L1, …, Ln-1 with geometrically increasing sizesL0 – always in memory
L1, …, Ln-1 – tightly packed B-trees
Processing updates to LSM-tree
Store them in
L
0
If no space in Li
, invoke merge between Li and Li+1
…
L
0
L
1
L
2
Move data from smaller to larger levels
LSM-tree – features
High update throughput
Fast access to recent data
Slide22Motivation
Existing LSM-tree implementations optimize for HDDs
Minimize random access
Variants are also popular for SSDs, as LSM-tree avoids in-place updatesCan we do better?Optimize LSM-tree for random-access block devicesMinimize writesUnderstand and improve merge policies
Slide23Step 1: modify LSM-tree to preserve blocks
Structure
L
0 – remains the samePersistent level Li – list of data blocksPointers to blocks stored in a B-treeData blocks – Not required to be sequentially collocatedNot necessarily full
Operations
Block Preserving Merge (BPM)
Preserves untouched blocks from input
L
i+1
L
i+1
L
i
Save blocks
Merge blocks
23
Slide24Step 2: understand merge policies
Full policy
Merge all blocks to
Li+1
Round-Robin (RR) policy
Merge
a fraction of blocks to
L
i+1
Select blocks to merge in round-robin fashion
Li+1Li
fraction
fraction
…
24
It seems RR is simply spreading the work of Full over time (
δ
fraction at a time)
— would it be better?
Slide25Full vs RR – performance
Sample experiment to compare Full and RR
3 levels with
L0 = 1Mb & fanout = 10Uniform workloadSteady-state caseRR is surprisingly better than FullSelects from high density region alwaysMerge process sustains this behavior for uniform workloadsWorst-case behavior for RR could be bad if the distribution is not friendly
25
Slide26Merge policies – Partial policy
Merge a portion of blocks to
L
i+1Select the best portion that minimizes I/O costCan be done in
steps
L
i+1
L
i
Li+1L
i
Fewest block overlaps in this range
Round-robin selection
26
Slide27Partial’s performance guarantee
In
the worst
case:RR: each merge can write all blocks in Li+1Partial: each merge can write no more than
blocks in
L
i+1
is maximum allowed size of
L
i+1
Step 3: an even better merge policy?
Observation – internal levels are almost always full in RR & Partial policies
High occupancy levels have more “resistance”: costlier to merge into
Leads to more overlapping blocks during merge
RR and Partial policies are too greedy
Idea: gain long-term savings by applying Full now and then?
Slide29Towards Mixed policy
Feasibility study
LSM tree with 3
levels L0 = 1Mb & fanout = 10Merge policy `Test’ s.t.L0 to L1 merge – always Partial
L1 to L2 merge – always Full
Test policy beats Partial in some cases
Example: costs per level for index size = 20Mb
Costs at
L
2 – Test is
slightly worse than PartialCosts at L1 – Test outperforms Partial by a large marginTotal savings is much better than Partial policyUniform distributionIndex size = 20MbApply Full policy when Li+1 is small. Extra costs in Full merge will be offset by future savings at upper levels.29
Slide30Towards Mixed policy
Threshold for Full vs Partial policies
Depends on the workload
Uniform distribution
Normal distribution
30
Slide31Mixed policy
For steady state case only
Upper most merge
Partial because L0 is always in memoryLower most merge
and
– cost of index maintenance for Partial and Full policies, respectively
Internal level merges
Full policy if
L
i+1
is smaller than a threshold i+1Learning the parametersExperiment based learning algorithm
L
0
L
1
L
2
L
3
Partial
Full , if
L
2
<
2
Partial, otherwise.
Full, if
<
Partial, otherwise.
31
Slide32Results
Settings
0
= 1MbFanout = 10
Uniform distribution
Experiment – vary index size
Mixed is the overall winner
RR is close to Partial
Sharp improvement for Full & Mixed policies when number of levels increase
Experiment – vary record sizeBlock-Preserving Merge can fetch considerable savings 32100Mb4 levels
3 levels
Slide33Conclusion
Optimized LSM tree structure for Random Access Block Devices
Block-Preserving Merge for saving blocks during merge
Studied performance of merge policies – Full, RR and PartialRR is surprisingly quite goodIntroduced Mixed policyMore I/O savings 33
Slide34Outline
Permutation problem
Merge policies for Log-Structured
Merge (LSM) treeConcurrency control in indexesConclusion & future work[VLDB 2013] “Permuting Data on Random-Access Block Storage”.
Risi Thonangi, Jun Yang
[CIKM 2012]
“A Practical Concurrent Index for Solid-State Drives”.
Risi Thonangi, Shivnath Babu, Jun Yang
[Techreport 2015]
“Optimizing LSM-tree for Random- Access Block Storage”.
Risi Thonangi, Jun Yang34
Slide35Indexing schemes for SSDs
B+tree is sub-optimal for SSDs
Insertions/deletions require in-place updates
Thumb rule for index design on SSDs – avoid small in-place updatesBuffer insertions and deletionsBatch-reorganize the index when buffer overflowsIndexing schemes proposed for SSDsBFTL, LA-tree, FD-tree, …Have bad response times during index reorganizations
35
Slide36FD-tree
Index structure
Logarithmic method
:- levels with geometrically increasing sizesTop level cached in memoryFractional cascading :- pointers between levelsProcessing updates to FD-treeStore them in L0If L
0 is full, merge L0 to
L
1
Continue merges to lower levels until all levels are within their size limits
L
2 is within size limit – stop mergeIDIII
I
I
I
I
I
I
I
I
I
I
I
I
D
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
IIIIIII
Not designed for efficient concurrent accesses during merge!36
Slide37FD+tree
Modified and improved version of FD-tree
Allows
designing an efficient concurrency schemeOther advantages – deletion support, level skipping, and tighter performance guaranteesFD+tree’s merge
Calculate, in advance, the number of levels to merge
Merge all levels in a single
shot
FD+tree’s
advantages
Fewer write I/
Os than FD-treeMaintain a valid index structure during merge: useful for concurrency accessIDII
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
37
Slide38FD+FC – Full Concurrency
for FD+tree
Goals :-
Don’t want to use extra spaceNo coarse-grained locking of the treeIdea :-Maintain a wavefront to track progress of mergeDelete blocks in old levels that were already mergedLookups check wavefront and determine which levels to search
I
D
I
I
I
I
II
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
wavefront
38
Slide39FD+FC (contd.)
Technical challenges
Merge implementation
“Record at a time” vs. “block at a time”Reclamation of blocks from old levelsBlock is to be reclaimed only after next block is started processingEnsures all children of the block are done being processed by the merge
11
13
18
25
29
39
20
22
27
28
30
42
Cannot reclaim yet:
28 would be orphaned!
Can reclaim now
39
Slide40FD+tree & FD+FC – more details
Level skipping
Utilizes main memory more efficiently
Proper deletion support + stronger performance guaranteesMerges do not read lock blocks40
Slide41Experimental results
FD+FC vs.
FD+XM – global lock
FD+DS – space doubling
Synthetic workloads
FD+FC is better for both inserts and lookups
FD+DS fares poorly for inserts
41
Slide42Conclusion
Concurrency control is important for SSD
Indexes
Good concurrency control requires bothcarefully rethinking index operations (FD+tree), anddesigning fine-grained but low-overhead protocols (FD+FC)42
Slide43Outline
Permutation problem
Merge policies for Log-Structured
Merge (LSM) treeConcurrency control in indexesConclusion & future work[VLDB 2013] “Permuting Data on Random-Access Block Storage”.
Risi Thonangi, Jun Yang
[CIKM 2012]
“A Practical Concurrent Index for Solid-State Drives”.
Risi Thonangi, Shivnath Babu, Jun Yang
[Techreport 2015]
“Optimizing LSM-tree for Random- Access Block Storage”.
Risi Thonangi, Jun Yang43
Slide44Conclusion & Future Work
Studied optimizations to database algorithms for Random Access Block Devices
Random access block devices as drop in replacement is good but sub-optimal
Optimizing database algorithms to random access and read/write asymmetry can fetch us considerably more savings – both cost & performance Future workOptimizing for multi-channel parallelism in random-access block devices
Utilizing on-disk computing resources for data processing tasksOptimizing for Phase change RAMWhere should we fit PC-RAM in the storage architecture?Explore how high to push the specializations in the system architecture
We’ve shown the benefit of specializing access methods and query processing algorithms
What about query optimization?
44
Slide45Thank you
45
Slide46Concurrency schemes – some proposals
Exclusive merge
(FD+XM)
Lock the complete tree for every op (incl. a long merge)Simple but little concurrency benefitDoubling space (FD+DS)Don’t delete old levels until merge completesIncurs twice the space costInefficient main memory utilization
Some concurrency control is still needed
I
D
I
I
I
II
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
46
Slide47Experimental results
Synthetic workloads
FD+FC is better for both inserts and lookups
FD+DS fares poorly for inserts
TPC-C like workloads
FD+FC worst-case
R
p
is less than a second
FD+DS worst-case
Rp increases as size of database grows
47
Slide48Results
As index size grows
Mixed policy is at least as good as Partial
An extra level with lesser occupancy is better
Even when index size is larger
Uniform
Normal
Block-preserving Merge is very efficient as index record size increases
Uniform
48
Slide49Modified LSM-tree
To save blocks during merge
Merge policies – Full, Round Robin, Partial
Mixed merge policyCombine Full and Partial for increased I/O savingsResults Conclusion
49
Slide50Indexing schemes for SSDs
B+tree is sub-optimal for SSDs
Insertions/deletions require in-place updates
4
8
8
9
4
6
1
3
Insert – 7
Thumb rule for index design on SSDs :-
Buffer insertions and deletions
Batch-reorganize
the index
Indexing
schemes proposed for SSDs :-
BFTL, LA-tree, FD-tree, PIO B-tree,
…
Avoid small in-place updates
50