/
Resilent Resilent

Resilent - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
390 views
Uploaded On 2016-05-16

Resilent - PPT Presentation

Cloud Applications Mark Simms mabsimms Principal Program Manager Windows Azure Customer Advisory Team Session Objectives Designing resilient largescale services requires careful ID: 322487

service data design failure data service failure design cache services time database external availability windows partitioning api sla azure

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Resilent" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Resilent Cloud Applications

Mark Simms (@

mabsimms

)

Principal Program Manager

Windows Azure Customer Advisory TeamSlide2

Session Objectives

Designing

resilient

large-scale

services requires careful

design and architecture

choices

This

session will explore key patterns & practices for highly available cloud services, illustrated with customer

examples

Interactivity rocks -> please ask questions throughout!Slide3

Setting the StageSlide4

Setting the stage

Scalability

Availability

InsightSlide5

Setting the stage

Maximize service availability for consumers

Ensure customers (and client devices) can access and use the service

Minimize impact of failure on consumers

Degrade gracefully, isolate faults, fallback to alternate delivery paths

Maximize performance and capacity

Services that are “live”, but cannot handle desired/required demand are not available Slide6

Musings on application design

Traditional web service design (N-tier)

Make “everything stateless”Slide7

Musings on application design

Traditional web service design (N-tier)

Make “everything stateless”

Separate logic from data (state)

Leverage specialized external state services

Cache, load balancer, relational database, document database, key/value store,

etcSlide8

Musings on application design

No service is an island

Dependencies on other internal and

external

services

Trading time-to-market and agility for controlSlide9

What’s in a workload?

#1: without the relational database the application cannot fulfill

any

workloads

#2: the relational database is an

external service

, subject to partial availabilitySlide10

Designing for FailureSlide11

Decompose by Workload

Applications are compromised of one or more workloads

Products like SharePoint and Windows Server are designed with this principle in mind

Each with different profiles, requirements and boundaries

Management, Availability, Operational, Cost, Health, Security, Capacity, etc.

Decomposition allows for workload specific optimization

Technology selections, scalability and availability approaches, etc.Slide12

What are the “9”s

Availability %

Downtime per year

Downtime per month*

Downtime per week

90% ("one nine")

36.5 days

72 hours

16.8 hours

99% ("two nines")

3.65 days

7.20 hours

1.68 hours

99.9% ("three nines")8.76 hours43.2 minutes10.1 minutes99.99% ("four nines")52.56 minutes4.32 minutes1.01 minutes99.999% ("five nines")5.26 minutes

25.9 seconds

6.05 seconds99.9999% ("six nines")31.5 seconds

2.59 seconds0.605 seconds12Study Windows Azure Platform SLAs:Compute External Connectivity: 99.95% (2 or more instances)Compute Instance Availability: 99.9% (2 or more instances)Storage Availability: 99.9%SQL Azure Availability: 99.9%Slide13

The Truth About 9s

Contoso

API

99.99%

SLA

Fabrikam

API

99.99%

SLA

Duwamish API

99.99%

SLA

TailSpin

API

99.99%

SLA

Northwind

API

99.99%

SLA

SLA =

Composite

99.99%

SLA

Composite

99.95%

SLA

*Slide14

Live Scores + Commentary

Team, Player, League Stats

Sports API

99.99%

All the time

100%

During Games

0%

When No Game

99%

All the Time

Define Your SLAsSlide15

Design for Failure

Given enough scale, time and pressure all components or services will fail

Your application will experience 1..N failures

How will your application behave?

Gracefully handle failure modes, continue to deliver

value

Not so gracefully …Fault types:

Transient. Temporary service interruptions, self-healingEnduring. Require intervention.Slide16

Failure Scope

Region

Service

Node

Individual Nodes May Fail

Connectivity Issues (transient failures), hardware failures,

Entire Services May Fail

Service dependencies (internal and external

), configuration and code issues

Regions may become unavailable

Connectivity Issues, acts of natureSlide17

Handling Transient and Enduring Failures

Use fault-handling frameworks that recognize transient

errors

Make it part of the background ”noise”

Appropriate retry and backoff policies Slide18

Handling Transient and Enduring FailuresSlide19
Slide20

Handling Transient and Enduring Failures

At some point, your request is blocking the line

Fail gracefully, and get out of the queue!

Anti-patterns:

Too much trust in downstream services and client proxies

Not bounding non-deterministic calls

Blocking synchronous operationsSlide21

Sample Retry Policies

Platform

Context

Sample

Target

e2e latency max

“Fast First”

Retry Count

Delay

Backoff

SQL Database

Synchronous (e.g. render

web page)200 msYes350 msLinearAsynchronous (e.g. process queue item)60 secondsNo45 sExponential

Azure Cache

Synchronous (e.g. render web page)100

msYes310 msLinearAsynchronous (e.g. process queue item)500 msYes3100 msExponentialSlide22

Circuit Breaker at Netflix

A request to a remote service

times out

Thread pool and bounded task

queue used to interact with

a service dependency are at 100%

Client library used to interact

with a service dependency

throws an exception

On

Off

Error Rate

Threshold

CriteriaSlide23

Circuit Breaker at Netflix - Fallbacks

Custom fallback

Client library can provide an

invokable

callback method. Can also use locally available data on API server (cookie or cache) to generate a fallback response

Fail Silent

Return a null value. Useful if the data is optional

Fail Fast

When data is required and there’s no good fallback. Negative UX impact, but keeps API healthySlide24

Deployment Redundancy

Within a Datacenter

Traffic Management

Across Cloud Providers

Across On Premise and Cloud

Across Data CentersSlide25

Failure Points

Focus on identifying design elements that are subject to external change. For example:

Database connection

Website connection

Configuration file

Registry key

Categories of common Failure Points:

ACLs, Database access, External web site/service access, Transactions, Configuration, Capacity, Network

definition:

design elements that can cause an outage.Slide26

Failure Modes

Examples of failure modes:

Configuration file is not in correct location

Too much traffic overusing resources

Database reaches maximum capacity

The following would not be considered a failure mode:

Product bugs

Symptoms of problems

Informational occurrences

definition:

a predictable root cause of the outage that occurs at

a Failure Point.Slide27

Failure Mode Example

27

public

int

GetBusinessData

(string[] parameters)

{

try

{

var

config = Config.Open(_configPath); var conn = ConnectToDB(config.ConnectString); var data =

conn.GetData

(_sproc, parameters); return data;

} catch (Exception e) { WriteEventLogEvent(100, E_ExceptionInDal); throw; }}Potential Failure Points:Database ServerDatabaseTableConfiguration File

Potential Failure Modes:

DB Server not respondingDB offline

DB access denied

Sproc execute denied

DB doesn’t exist

DB timeout on connect

Index corrupt

Database corrupt

Table doesn’t exist

Table corrupt

Config

file missing or invalidSlide28

Design for operationsSlide29

Running a Live Site ServiceSlide30

Running without Insight / TelemetrySlide31

Capturing Insight

Log all internal/external “transactions” (database, web services,

etc

)

Application context (module/component)

Host context (server/role/instance/process)

Timing information (start/stop/duration)Activity identifierConsolidate logs to central system / dashboard for health monitoring and troubleshootingSlide32

Capturing Insight

Capture timing and context information through helper delegates (background noise)

Capture contextual errors (inner exceptions,

etc

) on error

Logging library is asynchronous (fire-and-forget) to avoid blockingSlide33

Many Options

Windows Azure DiagnosticsSlide34

Designing for Insight

Instrument for production logging

If you didn’t capture it, it didn’t happen

Implement inter-service monitoring and alerting

Capture and quantify inter-service behavior and activity

Run-time configurable logging

Enable activation (capture or delivery) of additional channels at run-timeSlide35

Define ALMSlide36

Updating Configuration

For a production service configuration == code

Need rigorous ALM process for rolling out (and rolling back) updates to both.Slide37

Updating Services

“We want global, simultaneous production rollouts of our new code”

Are you sure about that?

Production rollouts:

Running N, N+1 concurrently

Rolling load over to N+1, ability to fallbackSlide38

What is a health model?

Logical piece of an application

A component that makes sense to an operator

Each entity has a health state

Entities can be external or internal

Multiple instances of an entity may exist

Managed Entity

Break down health state by functional team

Must be mutually exclusive

Group by

organizational

responsibility e.g. security, performance, backup

May be specific or non-technology e.g. orders shipped.

Aspect

Defines level of operation currently available

Normal state is fully functional

Well designed applications may support partial operation e.g. read only

Operational ConditionSlide39

Troubleshooting Workflow

Detection

Is there a problem?

Classification

What’s not working, how bad is it?

Diagnosis

Why is there a problem?

Recovery

What needs to be done to fix it?

Verification

Is the problem really gone?Slide40

Resources

Failsafe: Guidance for Resilient Cloud Architectures

(

http://

msdn.microsoft.com/en-us/library/jj853352.aspx

)

Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services(http

://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx)Designing and Deploying Internet Scale

Serviceshttps://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdfSlide41

Design for ScaleSlide42

Scale

(*) Other details such as operational demand, resources and workloads omitted for simplicity

Resources

*

4 x Web Servers ( 8 CPU)

100 GB Database

10 GB Blob Storage

Demands

10K Active Users

1K Concurrent Users

<2 second response time

Unit of Scale

Workloads

Messaging

Collaboration

ProductivitySlide43

Scale by Units

100K

400K

Demand & Resources

TimeSlide44

J

F

M

A

M

J

J

A

S

O

N

D

Workload 1

Workload 2

Bottom

Ramp

Peek

ExampleSlide45

Data Partitioning

Decomposition and Partitioning

Hybrid Partitioning

Vertical Partitioning

Horizontal Partitioning

Understanding the 3

VsSlide46

Understanding the 3Vs

Volume

How large is the data today?

Velocity

How fast is it growing?

Variety

What type(s) of data are involved?Slide47

Understanding

Queryability

What?

What types of queries are done and what data set(s) and transformations are required to deliver them?

When?

How often must the data be queried? In real time or once a day, month, quarter, or year?Slide48

Horizontal PartitioningSlide49

Vertical PartitioningSlide50

Hybrid PartitioningSlide51

Data – to cache or not to cache….Slide52

Push vs. Pull

Load Balanced Push

Sync and good for sequential processing

Dependent on downstream services

Throttling vs. Performance

Managed Pull/Throughput

Asynchronous and event driven processing

Easy Parallelisation and Pipelining

Extending logic is easySlide53

Data on the inside – Data on the outside

http://msdn.microsoft.com/en-us/library/ms954587.aspxSlide54

“Query Ready” Cache

Query patterns

Push the data close to where it is queried

Example: BING Maps

Process, structure, produce, format etc. data and cache “query ready” data

Light/cheap data production is OK

Pure and Idempotent operations are usually good candidates

Duplication is OK Same

data in a different formatSame data in multiple placesThis requires processing data before it is queried - NOT at the query timeAll data can be cached

Some data can be cached:

Frequently used

Process Heavy, Expensive data

Build as you GoSlide55

Distributed Caching

Simple to administer

No need to manage and host a distributed cache yourself.

Integrates easily into existing applications

ASP.NET session state and output cache providers enable no-code integration.

Same managed interfaces as Windows Server AppFabric

Cache

On-Premises

App

Windows Azure App

Core Logic

AppFabric Cache APIs

Windows Server AppFabric Cache

Core Logic

AppFabric

Cache APIsWindows Azure AppFabric CachingSlide56

Data Resiliency

Remediation

Backups

Content Delivery NetworksSlide57

Backup and Restore

Types of Backups

Total backup, point in time, synchronized

Blob

NoSQL

Sharded

Relational Databases

Relational DatabasesSlide58

Backing Up Table and Blob Storage

Source

Replica

Log

Log Replica

01100100 01100001 01110100 01100001Slide59

Managing Backed Up Data

Archive

Archiving unneeded data to secondary storage area can reduce costs and increase performance

Purge

Storage in the cloud is inexpensive, but it’s not free. Purge unnecessary archives as appropriateSlide60

CDN

pic1.jpg

pic1.jpg

Content

Delivery

Network

Blob

Service

Edge

Location

Edge

Location

Edge

Location

pic1.jpg

Related Contents


Next Show more