/
Optimizing Database Algorithms for Random-Access Block Devices Optimizing Database Algorithms for Random-Access Block Devices

Optimizing Database Algorithms for Random-Access Block Devices - PowerPoint Presentation

mercynaybor
mercynaybor . @mercynaybor
Follow
342 views
Uploaded On 2020-08-05

Optimizing Database Algorithms for Random-Access Block Devices - PPT Presentation

Risi Thonangi PhD Defense Talk Advisor Jun Yang Background HardDisk Drives Hard Disk Drives HDDs Magnetic platters for storage Mechanical moving parts for data access IO characteristics ID: 799287

block merge random tree merge block tree random access blocks full amp index lsm levels partial digits memory input

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Optimizing Database Algorithms for Rando..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Optimizing Database Algorithms for Random-Access Block Devices

Risi Thonangi

PhD Defense Talk

Advisor: Jun Yang

Slide2

Background: Hard-Disk Drives

Hard Disk Drives (HDDs)

Magnetic platters for storage

Mechanical moving parts for data accessI/O characteristicsFast sequential access & slow random accessRead/write symmetryDisadvantagesSlow random accessHigh energy costsBad shock absorption

2

Slide3

Background: Newer Storage Technologies

Flash memory, phase-change

m

emory, ferroelectric RAM and MRAMSolves HDD’s disadvantages of high energy usage, bad shock absorption, etc.Representative: flash memoryFloating-gate transistors for storageElectronic circuits for data accessI/O Characteristics – fast random

access & Expensive writes

USB Flash Drive

Other technologies (phase-change

m

emory, etc.) exhibit similar advantages and I/O characteristics

3

Slide4

Random-Access Block

D

evices

Devices that areBlock basedSupport fast random access but have costlier writesPopular example – Solid-State Drives (SSDs)Use flash memory for storageHave advantages of flash memory Quickly replacing HDDs in consumer & enterprise storageOther examplesCloud storage & key-value store based storage

Phase-change memory

4

Slide5

Motivation & Problem Statement

Can’t we use random-access block device as a drop-in replacement for HDD?

They do lead to better performance

But can be suboptimal because existing algorithms are not optimized for themProblem statement :–

Optimize database algorithms for I/O characteristics of random-access & read/write asymmetry

5

Slide6

Related work

For SSDs

Indexes

BFTL, FlashDB, LA-tree, FD-tree, etc.Query processing techniquesPAX style storage & FlashJoin, B-File for maintaining samplesTransaction processingIn-Page Logging, Log-structured storage & optimistic concurrency control, etc. We propose new

techniques & algorithms that systematically exploit both random access and

read/write

asymmetry

6

Slide7

Outline

Permutation problem

Merge policies for Log-Structured Merge (LSM) tree

Concurrency control in indexesConclusion & future work[VLDB 2013] “Permuting Data on Random-Access Block Storage”. Risi Thonangi, Jun Yang

[CIKM 2012] “A Practical Concurrent Index for Solid-State Drives”.

Risi Thonangi, Shivnath Babu, Jun Yang

[Techreport 2015]

“Optimizing LSM-tree for Random- Access Block Storage”.

Risi Thonangi, Jun Yang

7

Slide8

Permutation problem

Permutation:- reorder

records such that the output address of a record is a function of its input address

Ex.: converting between matrix layouts, resorting multi-dimensional aggregates, etc.For HDDs, external merge sort is popular for permutationAugment each record with its output address, and sort by it# passes = logarithmic in input size# block I/Os =

: input, memory, block sizes (all in # records)

External merge sort – shortcomings

Augmentation creates larger records

Does not exploit structure of the permutation

 

8

Slide9

A naïve algorithm

Just write out records in output order, and read as needed

Need two blocks of memory: one for write, one for read

Too many block reads—

—because records are scatteredMuch worse than sorting—

In fact, general permutation is as hard as sorting [Vitter 2001]

 

9

…..

Memory state

Input

Output

Slide10

Address space

A record address is encoded by an

-digit number

with radices

The (0-based) sequence # of the record within the file is

Example: address space with radices

 

10

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

1

1

1

2

2

2

3

3

3

0

0

0111222333012012012012012012012012

3-digit addresses

records

Slide11

ADP (address-digit permutation)

Permutation of records is defined by a

permutation of address digits

Example

Permutation

in

address space

 

Input

address space

Output

address space

11

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

1

1

1

2

2

2

33300011122233301201201201201201

2

0

1

2

0

1

2

0

0

0

0

0

0

1

1

1

1

1

1

2

2

2

2

2

2

3

3

3

3

3

3

0

0

1122001122001122001122010101010101010101010101

Permutation

in record space

 

Slide12

Single-radix ADP

For simplicity of this presentation, assume

A

ll digits have radix , so

Memory size

(# of records)

Block size

(# of records)

Presented in the thesis – the general case

with mixed radices and sizes that are non-powers

 

 

 

Block boundary

12

Slide13

A really simple ADP

Suppose no digits cross the block boundary

I.e., no record needs to move outside its own block

One pass, with one memory block, will do!Generalize?Carefully select memory-full of “action records” at a time, so they are clustered in blocks in both input and outputIn this simple ADP above, records in each block form a group of action records

13

Input

Output

Slide14

Basic one-pass ADP: selecting “action digits”

Choose

, “

action digits

In-digits

—those within block boundary

in the input address space,

plus

Entering digits

—out-digits in the input that become in-digits in the outputDenote non-action digits by A group of action records = those whose input addresses share the same setting for digits in  

In-digits:

 

Entering

digits:

 

 

 

14

Slide15

Basic one-pass ADP: algorithm

For each possible setting

of

:

read the group of action records sharing

;

permute them in memory;

write action records to output blocks

Memory requirement:

# entering digits

 

Clustered in input blocks!

 

Clustered in output blocks!

 

15

0

00

***

0

01

***

0

10

***

0

11

***

1

00

***

1

01

***

1

10

***

1

11

***

00

0***

00

1***

01

0***011***100***101***110***111***

      Entering digits: 

Slide16

More entering digits than we can handle?

Exploit the read/write asymmetry

Filtered reads

: allow a block to be read multiple times in one “pass”Use multiple passesPerform a series of simpler permutations whose composition is the desired compositionA cost-based optimizer to determine the passes:

Presented in the thesis – a provably optimal algorithm… which “balances” aggressiveness of filtered reads across passes

16

Problem

: given input data and a permutation, find a “plan”—possibly involving multiple passes and/or filtered reads—that minimizes cost

 

Slide17

Permuting vs. sorting for ADP

Sorting vs. our algorithm

(without using filtered

reads)In either case, each pass has reads and

writes

Sorting takes

passes

Depends on

—bigger input means more passes

We take

passesDepends on the permutation, but does not depend on Note (# entering digits)

, so

with practical configurations, we can complete any ADP in fewer than two passes no matter how big the input is

!

(Filtered reads further exploit r/w asymmetry to lower cost)

 

17

Slide18

Experiment: # of passes

5-attribute dataset adapted from TPC-W

Consider all possible ADPs of

the 5-digit address spaceI.e., all ways to resort dataShow distribution of # passes needed as we increase data sizeSorting passes increase with data sizeWe never take

passes!

 

18

% of permutations

ADP

SORTING

ADPSORTINGADP

SORTING

ADP

SORTING

ADP

SORTING

ADP

SORTING

Slide19

Conclusion

Introduced address-digit permutations (ADPs)

Capturing many useful data reorganization tasks

Designed algorithms for ADPs on random-access block storageExploiting fast random accesses, read/write asymmetryBeating sort!Results not covered in this talkOptimizations that read/write larger runs of blocksMixed radices, memory/blocks sizes that are non-powers of the radicesMore experiments, including permuting data stored on SSDs & Amazon S3

19

Slide20

Outline

Permutation problem

Merge policies for

Log-Structured Merge (LSM) treeConcurrency control in indexesConclusion & future work[VLDB 2013] “Permuting Data on Random-Access Block Storage”.

Risi Thonangi, Jun Yang

[CIKM 2012]

“A Practical Concurrent Index for Solid-State Drives”.

Risi Thonangi, Shivnath Babu, Jun Yang

[Techreport 2015]

“Optimizing LSM-tree for Random- Access Block Storage”.

Risi Thonangi, Jun Yang20

Slide21

Background: LSM-tree

Index structure

Levels

L0 , L1, …, Ln-1 with geometrically increasing sizesL0 – always in memory

L1, …, Ln-1 – tightly packed B-trees

Processing updates to LSM-tree

Store them in

L

0

If no space in Li

, invoke merge between Li and Li+1

L

0

L

1

L

2

Move data from smaller to larger levels

LSM-tree – features

High update throughput

Fast access to recent data

Slide22

Motivation

Existing LSM-tree implementations optimize for HDDs

Minimize random access

Variants are also popular for SSDs, as LSM-tree avoids in-place updatesCan we do better?Optimize LSM-tree for random-access block devicesMinimize writesUnderstand and improve merge policies

Slide23

Step 1: modify LSM-tree to preserve blocks

Structure

L

0 – remains the samePersistent level Li – list of data blocksPointers to blocks stored in a B-treeData blocks – Not required to be sequentially collocatedNot necessarily full

Operations

Block Preserving Merge (BPM)

Preserves untouched blocks from input

L

i+1

L

i+1

L

i

Save blocks

Merge blocks

23

Slide24

Step 2: understand merge policies

Full policy

Merge all blocks to

Li+1

Round-Robin (RR) policy

Merge

a fraction of blocks to

L

i+1

Select blocks to merge in round-robin fashion

Li+1Li

fraction

 

fraction

 

24

It seems RR is simply spreading the work of Full over time (

δ

fraction at a time)

— would it be better?

Slide25

Full vs RR – performance

Sample experiment to compare Full and RR

3 levels with

L0 = 1Mb & fanout = 10Uniform workloadSteady-state caseRR is surprisingly better than FullSelects from high density region alwaysMerge process sustains this behavior for uniform workloadsWorst-case behavior for RR could be bad if the distribution is not friendly

25

Slide26

Merge policies – Partial policy

Merge a portion of blocks to

L

i+1Select the best portion that minimizes I/O costCan be done in

steps

 

L

i+1

L

i

Li+1L

i

Fewest block overlaps in this range

Round-robin selection

26

Slide27

Partial’s performance guarantee

In

the worst

case:RR: each merge can write all blocks in Li+1Partial: each merge can write no more than

blocks in

L

i+1

is maximum allowed size of

L

i+1

 

Slide28

Step 3: an even better merge policy?

Observation – internal levels are almost always full in RR & Partial policies

High occupancy levels have more “resistance”: costlier to merge into

Leads to more overlapping blocks during merge

RR and Partial policies are too greedy

Idea: gain long-term savings by applying Full now and then?

Slide29

Towards Mixed policy

Feasibility study

LSM tree with 3

levels L0 = 1Mb & fanout = 10Merge policy `Test’ s.t.L0 to L1 merge – always Partial

L1 to L2 merge – always Full

Test policy beats Partial in some cases

Example: costs per level for index size = 20Mb

Costs at

L

2 – Test is

slightly worse than PartialCosts at L1 – Test outperforms Partial by a large marginTotal savings is much better than Partial policyUniform distributionIndex size = 20MbApply Full policy when Li+1 is small. Extra costs in Full merge will be offset by future savings at upper levels.29

Slide30

Towards Mixed policy

Threshold for Full vs Partial policies

Depends on the workload

Uniform distribution

Normal distribution

30

Slide31

Mixed policy

For steady state case only

Upper most merge

Partial because L0 is always in memoryLower most merge

and

– cost of index maintenance for Partial and Full policies, respectively

Internal level merges

Full policy if

L

i+1

is smaller than a threshold i+1Learning the parametersExperiment based learning algorithm 

L

0

L

1

L

2

L

3

Partial

Full , if

L

2

<

2

Partial, otherwise.

 

Full, if

<

Partial, otherwise.

 

31

Slide32

Results

Settings

0

= 1MbFanout = 10

Uniform distribution

Experiment – vary index size

Mixed is the overall winner

RR is close to Partial

Sharp improvement for Full & Mixed policies when number of levels increase

Experiment – vary record sizeBlock-Preserving Merge can fetch considerable savings 32100Mb4 levels

3 levels

Slide33

Conclusion

Optimized LSM tree structure for Random Access Block Devices

Block-Preserving Merge for saving blocks during merge

Studied performance of merge policies – Full, RR and PartialRR is surprisingly quite goodIntroduced Mixed policyMore I/O savings 33

Slide34

Outline

Permutation problem

Merge policies for Log-Structured

Merge (LSM) treeConcurrency control in indexesConclusion & future work[VLDB 2013] “Permuting Data on Random-Access Block Storage”.

Risi Thonangi, Jun Yang

[CIKM 2012]

“A Practical Concurrent Index for Solid-State Drives”.

Risi Thonangi, Shivnath Babu, Jun Yang

[Techreport 2015]

“Optimizing LSM-tree for Random- Access Block Storage”.

Risi Thonangi, Jun Yang34

Slide35

Indexing schemes for SSDs

B+tree is sub-optimal for SSDs

Insertions/deletions require in-place updates

Thumb rule for index design on SSDs – avoid small in-place updatesBuffer insertions and deletionsBatch-reorganize the index when buffer overflowsIndexing schemes proposed for SSDsBFTL, LA-tree, FD-tree, …Have bad response times during index reorganizations

35

Slide36

FD-tree

Index structure

Logarithmic method

:- levels with geometrically increasing sizesTop level cached in memoryFractional cascading :- pointers between levelsProcessing updates to FD-treeStore them in L0If L

0 is full, merge L0 to

L

1

Continue merges to lower levels until all levels are within their size limits

L

2 is within size limit – stop mergeIDIII

I

I

I

I

I

I

I

I

I

I

I

I

D

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

IIIIIII

Not designed for efficient concurrent accesses during merge!36

Slide37

FD+tree

Modified and improved version of FD-tree

Allows

designing an efficient concurrency schemeOther advantages – deletion support, level skipping, and tighter performance guaranteesFD+tree’s merge

Calculate, in advance, the number of levels to merge

Merge all levels in a single

shot

FD+tree’s

advantages

Fewer write I/

Os than FD-treeMaintain a valid index structure during merge: useful for concurrency accessIDII

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

37

Slide38

FD+FC – Full Concurrency

for FD+tree

Goals :-

Don’t want to use extra spaceNo coarse-grained locking of the treeIdea :-Maintain a wavefront to track progress of mergeDelete blocks in old levels that were already mergedLookups check wavefront and determine which levels to search

I

D

I

I

I

I

II

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

wavefront

38

Slide39

FD+FC (contd.)

Technical challenges

Merge implementation

“Record at a time” vs. “block at a time”Reclamation of blocks from old levelsBlock is to be reclaimed only after next block is started processingEnsures all children of the block are done being processed by the merge

11

13

18

25

29

39

20

22

27

28

30

42

Cannot reclaim yet:

28 would be orphaned!

Can reclaim now

39

Slide40

FD+tree & FD+FC – more details

Level skipping

Utilizes main memory more efficiently

Proper deletion support + stronger performance guaranteesMerges do not read lock blocks40

Slide41

Experimental results

FD+FC vs.

FD+XM – global lock

FD+DS – space doubling

Synthetic workloads

FD+FC is better for both inserts and lookups

FD+DS fares poorly for inserts

41

Slide42

Conclusion

Concurrency control is important for SSD

Indexes

Good concurrency control requires bothcarefully rethinking index operations (FD+tree), anddesigning fine-grained but low-overhead protocols (FD+FC)42

Slide43

Outline

Permutation problem

Merge policies for Log-Structured

Merge (LSM) treeConcurrency control in indexesConclusion & future work[VLDB 2013] “Permuting Data on Random-Access Block Storage”.

Risi Thonangi, Jun Yang

[CIKM 2012]

“A Practical Concurrent Index for Solid-State Drives”.

Risi Thonangi, Shivnath Babu, Jun Yang

[Techreport 2015]

“Optimizing LSM-tree for Random- Access Block Storage”.

Risi Thonangi, Jun Yang43

Slide44

Conclusion & Future Work

Studied optimizations to database algorithms for Random Access Block Devices

Random access block devices as drop in replacement is good but sub-optimal

Optimizing database algorithms to random access and read/write asymmetry can fetch us considerably more savings – both cost & performance Future workOptimizing for multi-channel parallelism in random-access block devices

Utilizing on-disk computing resources for data processing tasksOptimizing for Phase change RAMWhere should we fit PC-RAM in the storage architecture?Explore how high to push the specializations in the system architecture

We’ve shown the benefit of specializing access methods and query processing algorithms

What about query optimization?

44

Slide45

Thank you

45

Slide46

Concurrency schemes – some proposals

Exclusive merge

(FD+XM)

Lock the complete tree for every op (incl. a long merge)Simple but little concurrency benefitDoubling space (FD+DS)Don’t delete old levels until merge completesIncurs twice the space costInefficient main memory utilization

Some concurrency control is still needed

I

D

I

I

I

II

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

46

Slide47

Experimental results

Synthetic workloads

FD+FC is better for both inserts and lookups

FD+DS fares poorly for inserts

TPC-C like workloads

FD+FC worst-case

R

p

is less than a second

FD+DS worst-case

Rp increases as size of database grows

47

Slide48

Results

As index size grows

Mixed policy is at least as good as Partial

An extra level with lesser occupancy is better

Even when index size is larger

Uniform

Normal

Block-preserving Merge is very efficient as index record size increases

Uniform

48

Slide49

Modified LSM-tree

To save blocks during merge

Merge policies – Full, Round Robin, Partial

Mixed merge policyCombine Full and Partial for increased I/O savingsResults Conclusion

49

Slide50

Indexing schemes for SSDs

B+tree is sub-optimal for SSDs

Insertions/deletions require in-place updates

4

8

8

9

4

6

1

3

Insert – 7

Thumb rule for index design on SSDs :-

Buffer insertions and deletions

Batch-reorganize

the index

Indexing

schemes proposed for SSDs :-

BFTL, LA-tree, FD-tree, PIO B-tree,

Avoid small in-place updates

50