Towards Efficient Updates and Recovery in Erasurecoded Clustered Storage Jeremy C W Chan Qian Ding Patrick P C Lee Helen H W Chan The Chinese University of Hong Kong FAST14 ID: 583269
Download Presentation The PPT/PDF document "Parity Logging with Reserved Space:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage
Jeremy C. W. Chan*, Qian Ding*, Patrick P. C. Lee, Helen H. W. ChanThe Chinese University of Hong KongFAST’14
1
The first two authors contributed equally to this work.Slide2
Clustered storage systems provide scalable storage by striping data across multiple nodese.g., GFS, HDFS, Azure, Ceph
, Panasas, Lustre, etc.Maintain data availability with redundancyReplicationErasure coding2MotivationSlide3
With explosive data growth, enterprises move to erasure-coded storage to save footprints and coste.g., 3-way replication has 200% overhead; erasure coding can reduce overhead to 33%
[Huang, ATC’12]Erasure coding recap:Encodes data chunks to create parity chunksAny subset of data/parity chunks can recover original data chunksErasure coding introduces two challenges: (1) updates and (2) recovery/degraded reads3MotivationSlide4
1. Updates are expensiveWhen a data chunk is updated, its encoded parity chunks need to be updatedTwo update approaches:
In-place updates: overwrites existing chunksLog-based updates: appends changes4ChallengesSlide5
2. Recovery/degraded reads are expensiveFailures are commonData may be permanently loss due to crashes
90% of failures are transient (e.g., reboots, power loss, network connectivity loss, stragglers) [Ford, OSDI’10]Recovery/degraded read approach:Reads enough data and parity chunksReconstructs lost/unavailable chunks5ChallengesSlide6
How to achieve both efficient updates and fast recovery in clustered storage systems? Target scenario:Server workloads with frequent updates
Commodity configurations with frequent failuresDisk-based storagePotential bottlenecks in clustered storage systemsNetwork I/ODisk I/O6ChallengesSlide7
Propose parity-logging with reserved spaceUses hybrid in-place and
log-based updatesPuts deltas in a reserved space next to parity chunks to mitigate disk seeksPredicts and reclaims reserved space in workload-aware manner Achieves both efficient updates and fast recoveryBuild a clustered storage system prototype CodFSIncorporates different erasure coding and update schemes Released as open-source softwareConduct extensive trace-driven
testbed experiments
7
Our ContributionsSlide8
MSR Cambridge tracesBlock-level I/O traces captured by Microsoft Research Cambridge in 2007
36 volumes (179 disks) on 13 serversWorkloads including home directories and project directoriesHarvard NFS traces (DEAS03)NFS requests/responses of a NetApp file server in 2003Mixed workloads including email, research and development8Background: Trace AnalysisSlide9
9MSR Trace Analysis
Distribution of update size in 10 volumes of MSR Cambridge traces
Updates are small
All updates are smaller than 512KB
8 in 10 volumes show more than 60% of tiny updates
(
<4KB)Slide10
Updates are intensive9 in 10 volumes show more than 90%
update writes over all writesUpdate coverage variesMeasured by the fraction of WSS that is updated at least once throughout the trace periodLarge variation among different workloadsneed a dynamic algorithm for handling updatesSimilar observations for Harvard traces10MSR Trace AnalysisSlide11
Objective #1: Efficient handling of small, intensive updates in an erasure-coded clustered storage
11Slide12
Make use of
linearity of erasure codingCodFS reduces network traffic by only sending parity deltaQuestion: How to save data update (A’) and parity delta (ΔA) on disk?12
Saving Network Traffic in Parity Updates
A
B
C
P
Linear combination with some encoding coefficient
A’
B
C
P’
Update A
P
is equivalent to
applying the same encoding coefficient
p
arity delta
(
Δ
A)
A’
P’Slide13
Used in host-based file systems (e.g., NTFS and ext4)Also used for parity updates in RAID systems
13Update Approach #1: in-place updates (overwrite)
A
B
C
Disk
B
C
Disk
Update A
Problem
: significant I/O to read and update parities Slide14
Used by most clustered storage systems (e.g. GFS, Azure)
Original concept from log-structured file system (LFS)Convert random writes to sequential writes14Update Approach #2: log-based updates (logging)
A
B
C
Disk
B
C
Disk
Update A
A
Problem
: fragmentation to chunk ASlide15
Objective #2: Preserves sequentiality in large read (e.g. recovery) for both data and parity chunks
15Slide16
Data Update
Parity DeltaFull-overwrite (FO)OOFull-logging (FL)L
L
Parity-logging (PL)
O
L
Parity-logging with reserved space (PLR)
O
L
16
Parity Update Schemes
O: Overwrite L: Logging
Our ProposalSlide17
17
Parity Update Schemes
Storage Node 1
Storage Node 2
Storage Node 3
FO
FL
PL
PLR
Data streamSlide18
18
Parity Update Schemes
FO
FL
PL
PLR
Data stream
a
a
a
a
b
b
b
b
a
+b
a
+b
a
+b
a
+b
a
b
Storage Node 1
Storage Node 2
Storage Node 3Slide19
19
Parity Update Schemes
FO
FL
PL
PLR
Data stream
a
a
a
a
b
b
b
b
Δ
a
Δ
a
Δ
a
a
’
a
b
a
’
Storage Node 1
Storage Node 2
Storage Node 3
a
+b
a
+b
a
+b
a
+bSlide20
20
Parity Update Schemes
FO
FL
PL
PLR
Data stream
a
a
a
a
b
b
b
b
Δ
a
Δ
a
Δ
a
a
’
c
c
c
c
d
d
d
d
c+d
c+d
c+d
c+d
a
b
c
d
a
’
Storage Node 1
Storage Node 2
Storage Node 3
a
+b
a
+b
a
+b
a
+bSlide21
21
Parity Update Schemes
FO
FL
PL
PLR
Data stream
a
a
a
a
b
b
b
b
Δ
a
Δ
a
Δ
a
a
’
c
c
c
c
d
d
d
d
b
’
Δ
b
Δ
b
Δ
b
a
b
c
d
a
’
b
’
Storage Node 1
Storage Node 2
Storage Node 3
c+d
c+d
c+d
c+d
a
+b
a
+b
a
+b
a
+bSlide22
22
Parity Update Schemes
FO
FL
PL
PLR
Data stream
a
a
a
a
b
b
b
b
Δ
a
Δ
a
a
’
c
c
c
c
d
d
d
d
b
’
c’
Δ
c
Δ
c
Δ
c
a
b
c
d
a
’
b
’
c’
Storage Node 1
Storage Node 2
Storage Node 3
a
+b
a
+b
a
+b
a
+b
Δ
a
Δ
b
Δ
b
Δ
b
c+d
c+d
c+d
c+dSlide23
23
Parity Update Schemes
Storage Node 1
Storage Node 2
Storage Node 3
FO
FL
PL
PLR
Data stream
a
a
a
a
a
b
b
b
b
b
a
’
c
d
a
’
c
c
c
c
d
d
d
d
b
’
b
’
c’
c’
FL
: disk seek for chunk b
FL&PL
: disk seek for parity chunk b
PLR
: No seeks for both data and parity
FO
: extra read for merging parity
Δ
a
Δ
a
Δ
c
Δ
c
Δ
c
a
+b
a
+b
a
+b
a
+b
Δ
a
Δ
b
Δ
b
Δ
b
c+d
c+d
c+d
c+dSlide24
24Implementation - CodFS
CodFS
Architecture
Exploits parallelization across nodes and within each node
Provides a file system interface based on FUSE
OSD: Modular DesignSlide25
Testbed: 22 nodes with commodity hardware12-node storage cluster 10 client nodes sending Connected via a Gigabit switch
ExperimentsBaseline testsShow CodFS can achieve theoretical throughputSynthetic workload evaluationReal-world workload evaluation 25Experiments
Focus of this talkSlide26
26Synthetic Workload Evaluation
Random WriteLogging parity (FL, PL, PLR) helps random writes by saving disk seeks and parity read overheadFO has 20% less IOPS than othersIOZone record length: 128KBRDP coding (6,4)Slide27
27
Synthetic Workload EvaluationSequential Read
Recovery
No seeks in recovery for FO and PLR
Only FL needs disk seeks in reading data chunk
merge overheadSlide28
28
Fixing Storage Overhead
PLR (6,4)
FO/FL/PL (8,6)
FO/FL/PL (8,4)
Data Chunk
Parity Chunk
Reserved Space
FO (8,6) is still 20% slower than PLR (6,4) in random writes
PLR and FO are still much faster than FL and PL
Random Write
RecoverySlide29
Remaining problemWhat is the appropriate reserved space size?Too small – frequent merges
Too large – waste of spaceCan we shrink the reserved space if it is not used?Baseline approachFixed reserved space sizeWorkload-aware management approachPredict: exponential moving average to guess reserved space sizeShrink: release unused space back to systemMerge: merge all parity deltas back to parity chunk
29Dynamic Resizing of Reserved SpaceSlide30
30
Dynamic Resizing of Reserved Space
Step 1: Compute utility using past
workload pattern
Step 2: Compute utility using past
workload pattern
smoothing factor
current usage
previous usage
no. of chunk to shrink
Step 3: Perform shrink
disk
disk
shrink
disk
write new data chunks
shrinking
reserved space as a multiple of chunk
size avoids
creating unusable
“holes”Slide31
31Dynamic Resizing of Reserved Space
Shrink + merge performs a merge after the daily shrinkingShrink only performs shrinking at 00:00 and 12:00 each day16MB baseline
*(10,8) Cauchy RS Coding with 16MB segments
Reserved space overhead under different
shrink strategies
in
Harvard
traceSlide32
32Penalty of Over-shrinking
Less than 1% of writes are stalled by a merge operationPenalty of inaccurate prediction
Average number of merges per 1000
writes
under different shrink strategies in the Harvard trace
*(10,8) Cauchy RS Coding with 16MB segmentsSlide33
Latency analysisMetadata managementConsistency / lockingApplicability to different workloads
33Open IssuesSlide34
Key idea: Parity logging with reserved spaceKeep parity updates next to parity chunks to reduce disk seeks
Workload aware scheme to predict and adjust the reserved space sizeBuild CodFS prototype that achieves efficient updates and fast recoverySource code:http://ansrlab.cse.cuhk.edu.hk/software/codfs
34
ConclusionsSlide35
35
BackupSlide36
36
MSR Cambridge Traces ReplayUpdate Throughput
PLR ~ PL ~ FL >> FOSlide37
37
MSR Cambridge Traces ReplayRecovery Throughput
PLR ~ FO >> FL ~ PL