the Impact of Limpware on Scaleout Cloud Systems 1 Thanh Do Mingzhe Hao Tanakorn Leesatapornwongsa Tiratat Patanaanake and Haryadi S Gunawi Hardware fails ID: 648805
Download Presentation The PPT/PDF document "Limplock: Understanding" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems
1
Thanh Do
*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. GunawiSlide2
Hardware fails2
Growing complexity of …Technology scalingManufacturing
Design logicUsageOperating environment… makes HW fail differently
Complete fail-stopFail partial CorruptionPerformance degradation? Rich literatureSlide3
The 1st anecdote3
“…
1Gb NIC card on a machine that suddenly
starts transmitting at 1 kbps, this slow machine caused a chain reaction upstream in such a way that the performance of entire workload for a 100 node cluster was crawling at a snail's pace, effectively making the system unavailable for all practical purposes.” – Borthakur of FacebookDegraded NIC! (1000000x)Cascadingimpact!Slide4
Cases of Degraded HW 4
“In 2011, one of the DDN 9900 units had 4 servers
having high wait times on I/O for a certain set of d
isk LUNs. The maximum wait time was 103 seconds. This was left uncorrected for 50 days.” – Kasick of CMU, Harms of Argonne“The disk attempts to re-read each block multiple times before responding.” – Baptist of Cleversafe “On Intrepid, we had a bad batch of optical transceivers with an extremely high error rate. That results in an effective throughput of 1-2 Kbps.” – Harms of Argonne Many others: “Yes we've seen that in production” Slide5
Limpware5
Does HW degrade? YesLimpware: Hardware whose performance degrades significantly compared to its specification
Is this a destructive failure mode? YesCascading failures, no “fail in place”
No systematic analysis on its impactSlide6
Study Summary6
56 experiments that benchmark 5 systemsHadoop, HDFS, Zookeeper, Cassandra, HBase
22 protocols8 hours under normal scenarios207 hours
under limpware scenariosUnearth many limpware-intolerant designsOur findings:A single piece of limpware (e.g. NIC) causes severe impact on a whole clusterSlide7
Outline7
IntroductionSystem analysisLimplockLimpware-Tolerant SystemsConclusionSlide8
Anecdotal impacts8
“The performance of a
100 node cluster was crawling at a snail's pace
” – FacebookBut, … why?Slide9
System analysis9
GoalsMeasure system-level impacts
Find design flawsMethodologyTarget cloud systems (e.g., HDFS
, Hadoop, ZooKeeper)Inject load + limpwareE.g. slow a NIC to 1 Mbps, 0.1 Mbps, etc.White-box analysis (internal probes)Find design flawsSlide10
Example10
Run a distributed protocolE.g., 3-node write in HDFS
Measure slowdowns under:No failure, crash,
a degraded NICworkload
10 Mbps
NIC
1Mbps
NIC
0.1 Mbps
NIC
1
10x
slower
100x
slower
1000x
slower
Execution
slowdownSlide11
Outline11
IntroductionSystem analysisHadoop case study
LimplockLimpware Tolerant Cloud SystemsConclusionSlide12
Hadoop Spec. Exec.
12
Hadoop tail-tolerant?
Why speculative exec is not triggered?Consider degraded NIC on a map nodeTask M2’s speed = M1 and M3Input data is local!But all reducers are slowStraggler: slow vs. others of same jobNo straggler detected!FlawsTask-level straggler detectionSingle point of failure!Wordcounton Hadoop
Mappers
Reducers
M1
M2
M3
1
10Slide13
Cascading failures13
A degraded NIC
degraded tasks(Degraded tasks are slower by orders of magnitude)
Slow tasks use up slots degraded nodeDefault: 2 mappers and 2 reducers per nodeIf all slots are used node is “unavailable”All nodes in limp mode degraded clusterMMRRHealthy nodein limp modeNodewithslow NICSlide14
Cluster collapse
14
Macrobenchmark: Facebook workload
30-node cluster One node w/ degraded NIC (0.1 Mbps)Cluster collapse! Why?1 job/hour172 jobs/hourSlide15
15
Fail-stop
tolerant, but
not
limpware tolerant
(no
failover recovery)Slide16
Outline16
IntroductionSystem analysisFormalizing the problem: Limplock
Definitions and causesLimpware-Tolerant SystemsConclusionSlide17
Limplock17
DefinitionThe system progresses slowly due to limpware and is not capable of failing over to healthy components
(i.e., the system is “locked” in
limping mode)3 levels of limplockOperationNodeClusterSlide18
Limplock Levels18
Operation LimplockOperation involving
limpware is “locked” in limping mode; no failover
Node LimplockA situation where operations that must be served by this node experience limplock, although the operations do not involve limpwareCluster LimplockThe whole cluster is in limplock due to limpwareSlide19
Causes of Limplock19
Operation Limplock
Single point of failure
Hadoop slow map taskHBase “Gateway”
M1
M2
M3
Mappers
ReducersSlide20
Causes of Limplock20
Operation Limplock
Single point of failureCoarse
-grained timeout(more in the paper)512MB writeto HDFS1Slowdown10100
Reason: No
timeout is
triggered
Coarse
-grained timeout in HDFS
60 second timeout on every 64 KB
Could limp almost to
1 KB/sSlide21
Causes of Limplock21
Operation Limplock
Single point of
failureCoarse-grained timeout…Node LimplockBounded multi-purpose thread poolIn-memorymeta readsMasterMeta writes
In-memory
m
eta reads
Master
In-memory reads
> 100x slower
than normal
1
S
lowdown
10
100
Resource exhaustion
by limplocked operation
In-memory metadata reads are
blockedSlide22
Causes of Limplock22
Operation Limplock
Single point of
failureCoarse-grained timeout…Node LimplockBounded multi-purpose thread poolBounded multi-purpose queuemessages
messagesSlide23
Causes of Limplock23
Operation Limplock
Single point of
failureCoarse-grained timeout…Node LimplockBounded multi-purpose thread poolBounded multi-purpose queueUnbounded thread pool/queueEx: Backlogged queue at leaderNode limplock at leader because garbage collection works hardQuorum write: 10x slowdownStress load20 seconds
1
S
lowdown
10
ZooKeeper
Leader
Client quorum write
Followers
Stress load
600 secondsSlide24
Operation LimplockSingle point of
failure
Coarse-grained timeout
…Node LimplockBounded multi-purpose thread poolBounded multi-purpose queueUnbounded thread pool/queueCluster LimplockAll nodes in limplockEx: resource exhaustion in Hadoop, HDFS Regeneration Master limplock in master-slave architectureEx: cases in ZooKeeper, HDFSCauses of LimplockSlide25
Analysis Results25
Found 15 protocols that exhibit limplock8 in HDFS1 in Hadoop
2 in ZooKeeper4 in HBase
Limplock happens in almost all systems we have analyzedSlide26
Outline26
IntroductionSystem analysisLimplock
Limpware-Tolerant Cloud SystemsConclusionSlide27
AnticipationLimpware-tolerant design patternsLimpware static analysisLimpware statisticsExisting work: memory failure, disk failure, etc.
DetectionPerformance degradation implicit (no hard errors)
Study explicit causes (e.g. block remapping, error correcting)
RecoveryHow to “fail in place”?Better to fail-stop than fail-slow?Quarantine?UtilizationFail-stop: fail or workingLimpware: degrade 1-100%Principles of limpware …27Slide28
Conclusion28
New failure modes transform systemsLimpware is a “new”, destructive failure mode
Orders of magnitude slowdownCascading failures
No “fail in place” in current systemsA need for Limpware-Tolerant SystemsSlide29
Thank you!Questions?29