Dealing with Misbehaving Disks Abhishek Rajimwale Vijay Chidambaram Deepak Ramamurthi Andrea ArpaciDusseau Remzi ArpaciDusseau Data Domain Inc University of Wisconsin Madison ID: 797632
Download The PPT/PDF document "Coerced Cache Eviction and Discreet-Mode..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Coerced Cache Eviction and Discreet-Mode Journaling:Dealing with Misbehaving Disks
Abhishek Rajimwale*, Vijay Chidambaram, Deepak Ramamurthi Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau
*
Data Domain
Inc
University of Wisconsin Madison
Slide2Disks are not perfectDSN 11
27/7/11
Expanding
disk fault model
Latent Sector Errors
[
Bairavasundaram
SIGMETRICS 07]
RAID-6
Block Corruption
[Bairavasundaram FAST 08]ChecksumsThe disk cacheAlways trusted so far
Disk
Surface
Disk Cache
Slide3Disk CachesDisk cache improves performance
But at the risk of data lossOrder of writes issued by file system:A, B ,CDisks reorder writes during destaging:B, A, CFile systems flush the disk cache to ensure correct ordering of writesA, flush, B, flush, CDSN 11
37/7/11
Disk
Surface
Disk Cache
Write to disk
Slide4Problem: Flushing doesn’t work
Disks can fail to flush data upon requestOne reason: BugsErrors in the storage stack [Bairavasundaram FAST 08]Improper propagation of error codes [Bairavasundaram FAST 08]Inadequate failure policies
[Prabhakaran SOSP 05]Bugs in the firmware [Ghemawat SOSP 03]DSN 11
47/7/11
Slide5Disks can lie!DSN 11
5Misbehaving disks ignore or delay flush requests
Increases risk for data lossFile systems usually blamed for such loss
7/7/11
Slide6Disks can lie!DSN 11
6F_FULLFSYNC
From the
fcntl man page in Mac OSX:Does the same thing as fsync(2) then asks the drive to flush all buffered data to the permanent storage device (
arg is ignored). This is currently implemented on HFS, MS-DOS (FAT), and Universal Disk Format (UDF) file systems. The operation may take quite a while to complete.
Certain FireWire drives have also been known to ignore the request to flush their buffered data.
7/7/11
Evidence from industry experts
Microsoft
Seagate
Slide7Ordering points are essentialAll modern file systems depend on ordering points
Journaling file systems (ext3, ext4)Data before the commit blockCopy on write file systems (ZFS)Data before the uber-blockIf ordering points are not enforced:Data corruptionInconsistent file systemDSN 117
7/7/11
Slide8SummaryWe present
Coerced Cache Eviction (CCE)Write extra data into the cache to evict target blocksWe show how to characterize 9 SATA disk drive cacheExamine the wide range of caching policies We implement CCE in ext3Well known journaling file systemCCE provides stronger enforcement for ordering pointsAt a
cceptable overheadsDSN 118
7/7/11
Slide9OutlineMotivation
BackgroundCoerced Cache EvictionCache FingerprintingDiscreet Mode JournalingEvaluationConclusionDSN 1197/7/11
Slide10File System BackgroundConsider deleting a file
Removing its directory entryFreeing the space occupied by the file and its metadataJournaling file systemMakes sure all changes get to disk or none doGroups writes into transactionsWrites everything to a log firstCheckpoints to disk laterDSN 11
107/7/11
Slide11File System BackgroundExt3 file system
Semi-modern journaling file systemWell known, well understoodVariants of journalingData journaling modeEverything (data, metadata) goes to the log firstOrdered journaling modeOnly metadata is loggedDSN 1111
7/7/11
Slide12Disk Surface
Journal
Fixed locationsData JournalingDSN 11
12
D
D
D
C
M
M
Memory
7/7/11
B
Slide13Disk Surface
Journal
Fixed locations
Disk CacheData JournalingDSN 11
13
D
D
D
C
M
M
Memory
7/7/11
B
Slide14OutlineMotivation
BackgroundCoerced Cache EvictionCache FingerprintingDiscreet Mode JournalingEvaluationConclusionDSN 11147/7/11
Slide15Coerced Cache Eviction
Ensures that cache has been truly flushed Key idea:Extra writes to flush the disk cacheDesired Order of writes: A, B, CWith CCE:Write AWrite to flush zoneWrite B
Write to flush zoneWrite CDSN 11
157/7/11
Slide16Disk Surface
Flush Zone
Disk Cache
JournalFixed locations
Coerced Cache Eviction
DSN 11
16
D
D
D
C
M
M
Memory
F
F
F
F
F
F
F
F
7/7/11
B
F
Slide17Coerced Cache Eviction
Desired properties:High probability of flushing target blocksLow performance overheadNeed to understand the disk cache to design the flush workloadDSN 11
177/7/11
Slide18OutlineMotivation
BackgroundCoerced Cache EvictionCache FingerprintingDiscreet Mode JournalingEvaluationConclusionDSN 11187/7/11
Slide19Cache FingerprintingManufacturers
don’t expose details about disk cachesDisk caches can vary in:Read/Write partition sizeNumber of segmentsReplacement policyPoorly characterized in literatureDSN 1119
7/7/11
Disk Cache
Slide20Cache Fingerprinting
Flush micro-benchmark:Write target blockWrite varied flush workload – measure costfsync()Read target – infer evictionMicro-benchmark is repeated
Probability of eviction is calculatedVary in each workload:Number of writesAmount of data in each writeSequential/Random writes
DSN 11207/7/11
Slide21Cache FingerprintingDSN 11
217/7/11
Eviction fingerprint
Probability of eviction is visually shown
Darker region indicates higher probability
90 – 100%
70 - 90
50 – 70
30
– 50
10 – 30
0 – 10
Eviction Probability
Slide22Cache Fingerprinting
DSN 11227/7/11
Performance fingerprint
Time taken to write flush workload
Darker region indicates more time
500+
ms
100
– 500
50
– 100
10
- 50
0
- 10
Flush Latency
Slide23Cache Fingerprinting
Selecting a flush workload:Combine information from both fingerprintsHigh probability of eviction Dark region in eviction fingerprintLow performance costLight region in performance fingerprintDSN 11
237/7/11
Slide24Cache Fingerprinting
ManufacturerCache (MB)Capacity(GB)Hitachi880
Hitachi321024Samsung
8250Samsung16
250Western Digital
16
320
Western
Digital
64
800Seagate8250Seagate16320
Seagate
32750DSN 1124
7/7/11
Slide25Cache Fingerprinting
Sequential writes may be ineffective at flushing Regardless of the size of the writeA number of random writes are requiredDSN 11
25
7/7/11
90 – 100%
70 - 90
50 – 70
30
– 50
10 – 30
0 – 10
Eviction Probability
Slide26Cache Fingerprinting
Vertical stripes indicate that the cache is segmentedEach write, regardless of size, is sent to one segmentDSN 11
267/7/11
90 – 100%
70 - 90
50 – 70
30
– 50
10 – 30
0 – 10
Eviction Probability
Slide27Cache Fingerprinting
Cache behavior of disks from the same manufacturer is qualitatively similar across their different modelsDSN 1127
7/7/11
90 – 100%
70 - 90
50 – 70
30
– 50
10 – 30
0 – 10
Eviction Probability
Slide28Cache FingerprintingDSN 11
287/7/11
It’s not all good news however:
Some caches appear to use
random
replacement policies
For such caches, we cannot evict blocks with 100% certainty
A large number of
random writes
are required to get high eviction probability
Slide29Cache Fingerprinting - Results
DriveNumber of writesTotal Data(MB)Eviction ProbabilityTime (s)
Hitachi 8 MB12.38100
0.05Hitachi 32 MB111100
0.087Seagate 8 MB
256
31
100
0.87
Seagate 16 MB
128171000.342Seagate 64 MB128
37
1000.396Samsung 8 MB12849
~ 901.328Samsung 16 MB256
128
~
90
2.872
Western Digital 16 MB
1792
19
~
90
5.107
Western
Digital 64 MB
256
1
100
7.705
DSN 11
29
7/7/11
Slide30OutlineMotivation
BackgroundCoerced Cache EvictionCache FingerprintingDiscreet Mode JournalingEvaluationConclusionDSN 11307/7/11
Slide31Discreet Mode JournalingIncorporating CCE into
ext3Fingerprint the disk to find optimal flush workloadCreate flush zone with suitable sizeModify ext3 to issue flush zone writes:One at each ordering point# of CCE operations = # of ordering pointsCan be used with any disk: As long as the disk is fingerprinted firstDSN 11
31
7/7/11
Slide32OutlineMotivation
BackgroundCoerced Cache EvictionCache FingerprintingDiscreet Mode JournalingEvaluationConclusionDSN 1132
7/7/11
Slide33EvaluationGoal:
CCE provides higher reliabilityAt what cost? Is it practical to use?Experimental setup:File system: Ext3Disk: Hitachi 8 MBJournaling mode: Data journaling(See paper for ordered journaling results)Operating system: Linux 2.6.13, Linux 2.6.23DSN 1133
7/7/11
Slide34EvaluationWhat we compare:
Regular journaling with disk cache turned off“Safe” but slowDisk might not obey command to turn off cache!Regular journaling with disk cache turned onUnsafe but fastDiscreet mode journalingMidway option – Safe but with costDSN 11
347/7/11
Slide35EvaluationDSN 11
357/7/11Benchmarks:OpenSSH copy, untar, configure, makePostmark
Simulates a mail serverSingle threadedFilebench Webserver I/O intensiveFilebench
VarmailMultithreaded postmark
Slide36Evaluation – OpenSSHDSN 11
36Data Journaling Mode7/7/11
Slide37Evaluation – PostmarkDSN 11
37Data Journaling Mode7/7/11
Slide38Evaluation – Filebench Webserver
DSN 1138Data Journaling Mode7/7/11
Slide39Evaluation – Filebench Varmail
DSN 1139Data Journaling Mode7/7/11
Slide40Evaluation – Filebench varmail
Workload writes a small amount of data and calls fsync() repeatedlyEach fsync()causes 3 CCEs
Number of optimizations :Incorporate Group Commit in varmail
Improves throughput for all modesWe use a few other techniques as well (see paper)DSN 11
40
7/7/11
Slide41Evaluation – Filebench Varmail
DSN 1141With optimizations
7/7/11Original performance
Slide42Summary
Coerced Cache Eviction (CCE): Run file systems reliably on top of misbehaving disksCharacterization of 9 SATA disk caches through fingerprintsDiscreet Mode Journaling:
Implementation of CCE for ext3 filesystemAcceptable performance on 3 workloadsOnly if the cache doesn’t use random replacementHigh overhead for apps which call fsync() frequently
DSN 1142
7/7/11
Slide43ConclusionTrust in disk is
weakening:Latent Sector ErrorsBlock corruptionCache flushingCloud computing systems:Virtualized hardwareLarge software stackCan such hardware be trusted? Will coercion be more widely used?
DSN 11437/7/11
Slide44DSN 1144
Thank you!7/7/11
Advanced Systems Lab (ADSL)University of Wisconsin-Madisonhttp://www.cs.wisc.edu/adsl