S Gunawi Mingzhe Hao Riza O Suminto Agung Laksono Anang D Satria Jeffrey Adityatama and Kurnia J Eliazar Why does the Cloud stop computing Lessons from hundreds of service outages ID: 580779
Download Presentation The PPT/PDF document "Haryadi" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Haryadi
S.
Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffrey Adityatama, and Kurnia J. Eliazar
Why does the Cloud stop computing?
Lessons from hundreds of service outagesSlide2
Oct '16
COS @ SoCC '162Slide3
Oct '16
COS @ SoCC '163
2 years ago @ SoCC ’14Study of bugs in datacenter distributed systems (Hadoop, HBase
, etc.)
Bugs
OutagesSlide4
Public reports!
Oct '16
COS @ SoCC '164Headline news and post-mortem reportsProviders’ transparencyUntapped
informationPros/cons+ Detailed root causes+ Detailed chain of failures
+ Downtime durations+ Zero false positive-- (Very) incomplete
-- (High) varianceSlide5
COS: Cloud Outage Study
Oct '16
COS @ SoCC '16532 services597 outagesbetween 2009-2015 ~70% report downtimes~60% report root causes
?Slide6
Oct '16
COS @ SoCC '16
6Slide7
Downtime/year
Oct '16
COS @ SoCC '167On average6% services do not reach 99% availability (>88 hours)78% not reach
99.9% (>8.8 hours)
Worst year31%
not reach 99%81% not reach
99.9%5-nine availability?It’s just a dream?
HoursSlide8
Root causes
(sorted by
count)Oct '16COS @ SoCC '168Slide9
Interesting Root Causes
Oct '16
COS @ SoCC '169UpgradeInvolves multi-layers“a code push behaved differently in widespread use than it had during testing”To understand/reproduce, need full ecosystemSlide10
Interesting Root Causes
Oct '16
COS @ SoCC '1610Human mistakesRare now (vs. 10 years ago)Config/Upgrade software bugs Bugs in automation processSimilar issues?But root cause origins are differentSlide11
Config vs. Upgrade Research
Oct '16
COS @ SoCC '1611Upgrade #1, need more research?Paper count in last few years Challenges:Multi-layerFull ecosystem neededMulti-year?Reproducible bugs from industry (benchmarks)?
Conference
Config
papersUpgrade
papersASPLOS
0
1
ATC
6
2
DSN
82
EuroSys
3
2
NSDI
3
0
OSDI
4
0
SOSP
3
1
…
Total
27
8Slide12
Interesting Root Causes
Oct '16
COS @ SoCC '1612BugsWhat types of bugs lead to outages? Why are not masked?(pls. see paper)“Cascading” bugsSlide13
Storage servers
Metadata service
Oct '16
COS @ SoCC '16
13
“DynamoDB
Storage
servers
query
the
metadata service
for their
membership”“But
, on Sunday morning, the metadata service responses exceeded the retrieval time allowed by storage servers [
busy timeout]”“As a result,
the storage servers were unable to obtain their membership data, and removed
themselves from taking requests
”
Busy
Timeout
Remove
selfSlide14
EBS storage servers
Data collection servers
Oct '16
COS @ SoCC '16
14
“Each
EBS storage server
contacts
data
collection servers
and reports information that is used for fleet
maintenance”
“data collection servers … had
a failure”“this inability to contact a data collection server triggered
a latent memory leak bug in the storage servers …
“EBS servers continued trying in a way that slowly
consumed system memory”
F
ailure
Memory
leakSlide15
Oct '16
COS @ SoCC '16
15(more in the paper)Slide16
Where is the SPOF?
Oct '16
COS @ SoCC '1616Redundancies, redundancies, redundancies!
Yes, we did that
So, why do outages still happen?Slide17
Failure recovery chain
Oct '16
COS @ SoCC '1617FailureDetection
Failover
BackupsSlide18
Imperfect failure recovery chain
Oct '16COS @ SoCC '16
18IncompleteFailureDetection
Failover that
Fails
Backups that also
FailSlide19
Imperfect failure recovery chain
Oct '16
COS @ SoCC '1619Incomplete error/failure detectionUndetected (specific type of) memory leaksLoad spikes of authentication requests“an unexpected hardware behavior”Incomplete
FailureDetection
Failover
that
Fails
Backups
that also
FailSlide20
Imperfect failure recovery chain
Oct '16
COS @ SoCC '1620Failover/recovery that failsBad PLC fails to activate backup power generatorsFailed network switch failoverDC failover fails due to cold cache problems Recovery/re-mirroring storm
IncompleteFailure
Detection
Failover
that Fails
Backups
that also
FailSlide21
Imperfect failure recovery chain
Oct '16
COS @ SoCC '1621Multiple failures!Double failures of power, network, storage or server componentsDiverse failures: network+server; storage+fibre cutCascading bugs …… that caused many/all redundancies to fail
Incomplete
Failure
Detection
Failover
that Fails
Backups
that also
FailSlide22
COS Database:
Oct '16COS @ SoCC '16
22Email us / Check our websiteMore correlations between …Root cause & downtimeService maturity & downtimeRoot cause & impactsRoot cause & fixesEtc.
?Slide23
Conclusion
Oct '16COS @ SoCC '16
23Features and failures are racing with each other “Biggest/worst cloud outages of 20YY” – a new year’s traditionHope COS tells the causeMany more examples/details in the papersSlide24
Thank you!
Questions?24
ucare.cs.uchicago.eduOct '16COS @ SoCC '16
c
eres.cs.uchicago.eduSlide25
EXTRASlide26
Oct '16
COS @ SoCC '16
26Manually extract outage “metadata” Classifications:Slide27
Oct '16
COS @ SoCC '1627
A service outage implies an unplanned unavailability of partial or
full features of the
service that affects
all or a significant
number of users,
in such a way that the outage is reported
publicly.
Data loss, staleness
, and late deliveries that lead to
loss of productivity
are also considered an outage. Slide28
#Outages/year
Oct '16
COS @ SoCC '1628On average1/3 of the services, at least 3 unplanned outages per yearWorst Year(between ’09-’14)
½ of the services, at least 4 unplanned outages per yearSlide29
Downtime by root cause
Oct '16
COS @ SoCC '1629(sorted by median downtime)Slide30
Maturity helps?
Oct '16
COS @ SoCC '1630Does service maturity help?Based on outage count:In 2014, 24 outages occurred from 9-yr old servicesSlide31
Maturity helps?
Oct '16
COS @ SoCC '1631
Based on
downtime
:In 2014, 267 hours
of downtime from 17-yr old servicesMore mature more popular more users more complexSlide32
Interesting Root Causes
Oct '16
COS @ SoCC '1632LoadSpikes of non-monitored requestsUser requests (monitored) Database index accessesAuthentication requests (cryptographic consumption)MisconfigurationEx: traffic redirectionTake-away: be careful with traffic-related code/
configsRecovery feedback loopSlide33
Interesting Root Causes
Oct '16
COS @ SoCC '1633Cross (dependencies)Amazon Web ServicesAirbnb, Bitbucket, Dropbox, Foursquare, Github, Heroku, Instagram, Minecraft, Netflix, Pinterest, Quora, Reddit, VineAzureXbox Live and “52 other services”Google DC (co-location)Google Gmail, Search, Drive, Youtube(40% drop of internet traffic for 5 mins)Slide34
Studies of failures, enough?
Oct '16
COS @ SoCC '1634Slide35
Studies of failures, enough?
Oct '16
COS @ SoCC '1635
Most studyo
nly a few services(data behind company walls)
Not all report
“d”
owntimes