/
Haryadi Haryadi

Haryadi - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
371 views
Uploaded On 2017-08-21

Haryadi - PPT Presentation

S Gunawi Mingzhe Hao Riza O Suminto Agung Laksono Anang D Satria Jeffrey Adityatama and Kurnia J Eliazar Why does the Cloud stop computing Lessons from hundreds of service outages ID: 580779

socc cos root oct cos socc oct root servers storage outages bugs failover failures recovery data service failure outage

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Haryadi" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Haryadi

S.

Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffrey Adityatama, and Kurnia J. Eliazar

Why does the Cloud stop computing?

Lessons from hundreds of service outagesSlide2

Oct '16

COS @ SoCC '162Slide3

Oct '16

COS @ SoCC '163

2 years ago @ SoCC ’14Study of bugs in datacenter distributed systems (Hadoop, HBase

, etc.)

Bugs

OutagesSlide4

Public reports!

Oct '16

COS @ SoCC '164Headline news and post-mortem reportsProviders’ transparencyUntapped

informationPros/cons+ Detailed root causes+ Detailed chain of failures

+ Downtime durations+ Zero false positive-- (Very) incomplete

-- (High) varianceSlide5

COS: Cloud Outage Study

Oct '16

COS @ SoCC '16532 services597 outagesbetween 2009-2015 ~70% report downtimes~60% report root causes

?Slide6

Oct '16

COS @ SoCC '16

6Slide7

Downtime/year

Oct '16

COS @ SoCC '167On average6% services do not reach 99% availability (>88 hours)78% not reach

99.9% (>8.8 hours)

Worst year31%

not reach 99%81% not reach

99.9%5-nine availability?It’s just a dream?

HoursSlide8

Root causes

(sorted by

count)Oct '16COS @ SoCC '168Slide9

Interesting Root Causes

Oct '16

COS @ SoCC '169UpgradeInvolves multi-layers“a code push behaved differently in widespread use than it had during testing”To understand/reproduce, need full ecosystemSlide10

Interesting Root Causes

Oct '16

COS @ SoCC '1610Human mistakesRare now (vs. 10 years ago)Config/Upgrade software bugs Bugs in automation processSimilar issues?But root cause origins are differentSlide11

Config vs. Upgrade Research

Oct '16

COS @ SoCC '1611Upgrade #1, need more research?Paper count in last few years Challenges:Multi-layerFull ecosystem neededMulti-year?Reproducible bugs from industry (benchmarks)?

Conference

Config

papersUpgrade

papersASPLOS

0

1

ATC

6

2

DSN

82

EuroSys

3

2

NSDI

3

0

OSDI

4

0

SOSP

3

1

Total

27

8Slide12

Interesting Root Causes

Oct '16

COS @ SoCC '1612BugsWhat types of bugs lead to outages? Why are not masked?(pls. see paper)“Cascading” bugsSlide13

Storage servers

Metadata service

Oct '16

COS @ SoCC '16

13

“DynamoDB

Storage

servers

query

the

metadata service

for their

membership”“But

, on Sunday morning, the metadata service responses exceeded the retrieval time allowed by storage servers [

busy timeout]”“As a result,

the storage servers were unable to obtain their membership data, and removed

themselves from taking requests

Busy

Timeout

Remove

selfSlide14

EBS storage servers

Data collection servers

Oct '16

COS @ SoCC '16

14

“Each

EBS storage server

contacts

data

collection servers

and reports information that is used for fleet

maintenance”

“data collection servers … had

a failure”“this inability to contact a data collection server triggered

a latent memory leak bug in the storage servers …

“EBS servers continued trying in a way that slowly

consumed system memory”

F

ailure

Memory

leakSlide15

Oct '16

COS @ SoCC '16

15(more in the paper)Slide16

Where is the SPOF?

Oct '16

COS @ SoCC '1616Redundancies, redundancies, redundancies!

Yes, we did that

So, why do outages still happen?Slide17

Failure recovery chain

Oct '16

COS @ SoCC '1617FailureDetection

Failover

BackupsSlide18

Imperfect failure recovery chain

Oct '16COS @ SoCC '16

18IncompleteFailureDetection

Failover that

Fails

Backups that also

FailSlide19

Imperfect failure recovery chain

Oct '16

COS @ SoCC '1619Incomplete error/failure detectionUndetected (specific type of) memory leaksLoad spikes of authentication requests“an unexpected hardware behavior”Incomplete

FailureDetection

Failover

that

Fails

Backups

that also

FailSlide20

Imperfect failure recovery chain

Oct '16

COS @ SoCC '1620Failover/recovery that failsBad PLC fails to activate backup power generatorsFailed network switch failoverDC failover fails due to cold cache problems Recovery/re-mirroring storm

IncompleteFailure

Detection

Failover

that Fails

Backups

that also

FailSlide21

Imperfect failure recovery chain

Oct '16

COS @ SoCC '1621Multiple failures!Double failures of power, network, storage or server componentsDiverse failures: network+server; storage+fibre cutCascading bugs …… that caused many/all redundancies to fail

Incomplete

Failure

Detection

Failover

that Fails

Backups

that also

FailSlide22

COS Database:

Oct '16COS @ SoCC '16

22Email us / Check our websiteMore correlations between …Root cause & downtimeService maturity & downtimeRoot cause & impactsRoot cause & fixesEtc.

?Slide23

Conclusion

Oct '16COS @ SoCC '16

23Features and failures are racing with each other “Biggest/worst cloud outages of 20YY” – a new year’s traditionHope COS tells the causeMany more examples/details in the papersSlide24

Thank you!

Questions?24

ucare.cs.uchicago.eduOct '16COS @ SoCC '16

c

eres.cs.uchicago.eduSlide25

EXTRASlide26

Oct '16

COS @ SoCC '16

26Manually extract outage “metadata” Classifications:Slide27

Oct '16

COS @ SoCC '1627

A service outage implies an unplanned unavailability of partial or

full features of the

service that affects

all or a significant

number of users,

in such a way that the outage is reported

publicly.

Data loss, staleness

, and late deliveries that lead to

loss of productivity

are also considered an outage. Slide28

#Outages/year

Oct '16

COS @ SoCC '1628On average1/3 of the services, at least 3 unplanned outages per yearWorst Year(between ’09-’14)

½ of the services, at least 4 unplanned outages per yearSlide29

Downtime by root cause

Oct '16

COS @ SoCC '1629(sorted by median downtime)Slide30

Maturity helps?

Oct '16

COS @ SoCC '1630Does service maturity help?Based on outage count:In 2014, 24 outages occurred from 9-yr old servicesSlide31

Maturity helps?

Oct '16

COS @ SoCC '1631

Based on

downtime

:In 2014, 267 hours

of downtime from 17-yr old servicesMore mature  more popular  more users  more complexSlide32

Interesting Root Causes

Oct '16

COS @ SoCC '1632LoadSpikes of non-monitored requestsUser requests (monitored) Database index accessesAuthentication requests (cryptographic consumption)MisconfigurationEx: traffic redirectionTake-away: be careful with traffic-related code/

configsRecovery feedback loopSlide33

Interesting Root Causes

Oct '16

COS @ SoCC '1633Cross (dependencies)Amazon Web ServicesAirbnb, Bitbucket, Dropbox, Foursquare, Github, Heroku, Instagram, Minecraft, Netflix, Pinterest, Quora, Reddit, VineAzureXbox Live and “52 other services”Google DC (co-location)Google Gmail, Search, Drive, Youtube(40% drop of internet traffic for 5 mins)Slide34

Studies of failures, enough?

Oct '16

COS @ SoCC '1634Slide35

Studies of failures, enough?

Oct '16

COS @ SoCC '1635

Most studyo

nly a few services(data behind company walls)

Not all report

“d”

owntimes

Related Contents


Next Show more