/
Parity Logging with Reserved Space: Parity Logging with Reserved Space:

Parity Logging with Reserved Space: - PowerPoint Presentation

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
384 views
Uploaded On 2017-08-29

Parity Logging with Reserved Space: - PPT Presentation

Towards Efficient Updates and Recovery in Erasurecoded Clustered Storage Jeremy C W Chan Qian Ding Patrick P C Lee Helen H W Chan The Chinese University of Hong Kong FAST14 ID: 583269

parity storage data node storage parity node data update updates reserved plr space disk recovery chunk schemes logging chunks shrink coding stream

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Parity Logging with Reserved Space:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage

Jeremy C. W. Chan*, Qian Ding*, Patrick P. C. Lee, Helen H. W. ChanThe Chinese University of Hong KongFAST’14

1

The first two authors contributed equally to this work.Slide2

Clustered storage systems provide scalable storage by striping data across multiple nodese.g., GFS, HDFS, Azure, Ceph

, Panasas, Lustre, etc.Maintain data availability with redundancyReplicationErasure coding2MotivationSlide3

With explosive data growth, enterprises move to erasure-coded storage to save footprints and coste.g., 3-way replication has 200% overhead; erasure coding can reduce overhead to 33%

[Huang, ATC’12]Erasure coding recap:Encodes data chunks to create parity chunksAny subset of data/parity chunks can recover original data chunksErasure coding introduces two challenges: (1) updates and (2) recovery/degraded reads3MotivationSlide4

1. Updates are expensiveWhen a data chunk is updated, its encoded parity chunks need to be updatedTwo update approaches:

In-place updates: overwrites existing chunksLog-based updates: appends changes4ChallengesSlide5

2. Recovery/degraded reads are expensiveFailures are commonData may be permanently loss due to crashes

90% of failures are transient (e.g., reboots, power loss, network connectivity loss, stragglers) [Ford, OSDI’10]Recovery/degraded read approach:Reads enough data and parity chunksReconstructs lost/unavailable chunks5ChallengesSlide6

How to achieve both efficient updates and fast recovery in clustered storage systems? Target scenario:Server workloads with frequent updates

Commodity configurations with frequent failuresDisk-based storagePotential bottlenecks in clustered storage systemsNetwork I/ODisk I/O6ChallengesSlide7

Propose parity-logging with reserved spaceUses hybrid in-place and

log-based updatesPuts deltas in a reserved space next to parity chunks to mitigate disk seeksPredicts and reclaims reserved space in workload-aware manner Achieves both efficient updates and fast recoveryBuild a clustered storage system prototype CodFSIncorporates different erasure coding and update schemes Released as open-source softwareConduct extensive trace-driven

testbed experiments

7

Our ContributionsSlide8

MSR Cambridge tracesBlock-level I/O traces captured by Microsoft Research Cambridge in 2007

36 volumes (179 disks) on 13 serversWorkloads including home directories and project directoriesHarvard NFS traces (DEAS03)NFS requests/responses of a NetApp file server in 2003Mixed workloads including email, research and development8Background: Trace AnalysisSlide9

9MSR Trace Analysis

Distribution of update size in 10 volumes of MSR Cambridge traces

Updates are small

All updates are smaller than 512KB

8 in 10 volumes show more than 60% of tiny updates

(

<4KB)Slide10

Updates are intensive9 in 10 volumes show more than 90%

update writes over all writesUpdate coverage variesMeasured by the fraction of WSS that is updated at least once throughout the trace periodLarge variation among different workloadsneed a dynamic algorithm for handling updatesSimilar observations for Harvard traces10MSR Trace AnalysisSlide11

Objective #1: Efficient handling of small, intensive updates in an erasure-coded clustered storage

11Slide12

Make use of

linearity of erasure codingCodFS reduces network traffic by only sending parity deltaQuestion: How to save data update (A’) and parity delta (ΔA) on disk?12

Saving Network Traffic in Parity Updates

A

B

C

P

Linear combination with some encoding coefficient

A’

B

C

P’

Update A

P

is equivalent to

applying the same encoding coefficient

p

arity delta

(

Δ

A)

A’

P’Slide13

Used in host-based file systems (e.g., NTFS and ext4)Also used for parity updates in RAID systems

13Update Approach #1: in-place updates (overwrite)

A

B

C

Disk

B

C

Disk

Update A

Problem

: significant I/O to read and update parities Slide14

Used by most clustered storage systems (e.g. GFS, Azure)

Original concept from log-structured file system (LFS)Convert random writes to sequential writes14Update Approach #2: log-based updates (logging)

A

B

C

Disk

B

C

Disk

Update A

A

Problem

: fragmentation to chunk ASlide15

Objective #2: Preserves sequentiality in large read (e.g. recovery) for both data and parity chunks

15Slide16

Data Update

Parity DeltaFull-overwrite (FO)OOFull-logging (FL)L

L

Parity-logging (PL)

O

L

Parity-logging with reserved space (PLR)

O

L

16

Parity Update Schemes

O: Overwrite L: Logging

Our ProposalSlide17

17

Parity Update Schemes

Storage Node 1

Storage Node 2

Storage Node 3

FO

FL

PL

PLR

Data streamSlide18

18

Parity Update Schemes

FO

FL

PL

PLR

Data stream

a

a

a

a

b

b

b

b

a

+b

a

+b

a

+b

a

+b

a

b

Storage Node 1

Storage Node 2

Storage Node 3Slide19

19

Parity Update Schemes

FO

FL

PL

PLR

Data stream

a

a

a

a

b

b

b

b

Δ

a

Δ

a

Δ

a

a

a

b

a

Storage Node 1

Storage Node 2

Storage Node 3

a

+b

a

+b

a

+b

a

+bSlide20

20

Parity Update Schemes

FO

FL

PL

PLR

Data stream

a

a

a

a

b

b

b

b

Δ

a

Δ

a

Δ

a

a

c

c

c

c

d

d

d

d

c+d

c+d

c+d

c+d

a

b

c

d

a

Storage Node 1

Storage Node 2

Storage Node 3

a

+b

a

+b

a

+b

a

+bSlide21

21

Parity Update Schemes

FO

FL

PL

PLR

Data stream

a

a

a

a

b

b

b

b

Δ

a

Δ

a

Δ

a

a

c

c

c

c

d

d

d

d

b

Δ

b

Δ

b

Δ

b

a

b

c

d

a

b

Storage Node 1

Storage Node 2

Storage Node 3

c+d

c+d

c+d

c+d

a

+b

a

+b

a

+b

a

+bSlide22

22

Parity Update Schemes

FO

FL

PL

PLR

Data stream

a

a

a

a

b

b

b

b

Δ

a

Δ

a

a

c

c

c

c

d

d

d

d

b

c’

Δ

c

Δ

c

Δ

c

a

b

c

d

a

b

c’

Storage Node 1

Storage Node 2

Storage Node 3

a

+b

a

+b

a

+b

a

+b

Δ

a

Δ

b

Δ

b

Δ

b

c+d

c+d

c+d

c+dSlide23

23

Parity Update Schemes

Storage Node 1

Storage Node 2

Storage Node 3

FO

FL

PL

PLR

Data stream

a

a

a

a

a

b

b

b

b

b

a

c

d

a

c

c

c

c

d

d

d

d

b

b

c’

c’

FL

: disk seek for chunk b

FL&PL

: disk seek for parity chunk b

PLR

: No seeks for both data and parity

FO

: extra read for merging parity

Δ

a

Δ

a

Δ

c

Δ

c

Δ

c

a

+b

a

+b

a

+b

a

+b

Δ

a

Δ

b

Δ

b

Δ

b

c+d

c+d

c+d

c+dSlide24

24Implementation - CodFS

CodFS

Architecture

Exploits parallelization across nodes and within each node

Provides a file system interface based on FUSE

OSD: Modular DesignSlide25

Testbed: 22 nodes with commodity hardware12-node storage cluster 10 client nodes sending Connected via a Gigabit switch

ExperimentsBaseline testsShow CodFS can achieve theoretical throughputSynthetic workload evaluationReal-world workload evaluation 25Experiments

Focus of this talkSlide26

26Synthetic Workload Evaluation

Random WriteLogging parity (FL, PL, PLR) helps random writes by saving disk seeks and parity read overheadFO has 20% less IOPS than othersIOZone record length: 128KBRDP coding (6,4)Slide27

27

Synthetic Workload EvaluationSequential Read

Recovery

No seeks in recovery for FO and PLR

Only FL needs disk seeks in reading data chunk

merge overheadSlide28

28

Fixing Storage Overhead

PLR (6,4)

FO/FL/PL (8,6)

FO/FL/PL (8,4)

Data Chunk

Parity Chunk

Reserved Space

FO (8,6) is still 20% slower than PLR (6,4) in random writes

PLR and FO are still much faster than FL and PL

Random Write

RecoverySlide29

Remaining problemWhat is the appropriate reserved space size?Too small – frequent merges

Too large – waste of spaceCan we shrink the reserved space if it is not used?Baseline approachFixed reserved space sizeWorkload-aware management approachPredict: exponential moving average to guess reserved space sizeShrink: release unused space back to systemMerge: merge all parity deltas back to parity chunk

29Dynamic Resizing of Reserved SpaceSlide30

30

Dynamic Resizing of Reserved Space

Step 1: Compute utility using past

workload pattern

Step 2: Compute utility using past

workload pattern

smoothing factor

current usage

previous usage

no. of chunk to shrink

Step 3: Perform shrink

disk

disk

shrink

disk

write new data chunks

shrinking

reserved space as a multiple of chunk

size avoids

creating unusable

“holes”Slide31

31Dynamic Resizing of Reserved Space

Shrink + merge performs a merge after the daily shrinkingShrink only performs shrinking at 00:00 and 12:00 each day16MB baseline

*(10,8) Cauchy RS Coding with 16MB segments

Reserved space overhead under different

shrink strategies

in

Harvard

traceSlide32

32Penalty of Over-shrinking

Less than 1% of writes are stalled by a merge operationPenalty of inaccurate prediction

Average number of merges per 1000

writes

under different shrink strategies in the Harvard trace

*(10,8) Cauchy RS Coding with 16MB segmentsSlide33

Latency analysisMetadata managementConsistency / lockingApplicability to different workloads

33Open IssuesSlide34

Key idea: Parity logging with reserved spaceKeep parity updates next to parity chunks to reduce disk seeks

Workload aware scheme to predict and adjust the reserved space sizeBuild CodFS prototype that achieves efficient updates and fast recoverySource code:http://ansrlab.cse.cuhk.edu.hk/software/codfs

34

ConclusionsSlide35

35

BackupSlide36

36

MSR Cambridge Traces ReplayUpdate Throughput

PLR ~ PL ~ FL >> FOSlide37

37

MSR Cambridge Traces ReplayRecovery Throughput

PLR ~ FO >> FL ~ PL