Cloud Applications Mark Simms mabsimms Principal Program Manager Windows Azure Customer Advisory Team Session Objectives Designing resilient largescale services requires careful ID: 322487
Download Presentation The PPT/PDF document "Resilent" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Resilent Cloud Applications
Mark Simms (@
mabsimms
)
Principal Program Manager
Windows Azure Customer Advisory TeamSlide2
Session Objectives
Designing
resilient
large-scale
services requires careful
design and architecture
choices
This
session will explore key patterns & practices for highly available cloud services, illustrated with customer
examples
Interactivity rocks -> please ask questions throughout!Slide3
Setting the StageSlide4
Setting the stage
Scalability
Availability
InsightSlide5
Setting the stage
Maximize service availability for consumers
Ensure customers (and client devices) can access and use the service
Minimize impact of failure on consumers
Degrade gracefully, isolate faults, fallback to alternate delivery paths
Maximize performance and capacity
Services that are “live”, but cannot handle desired/required demand are not available Slide6
Musings on application design
Traditional web service design (N-tier)
Make “everything stateless”Slide7
Musings on application design
Traditional web service design (N-tier)
Make “everything stateless”
Separate logic from data (state)
Leverage specialized external state services
Cache, load balancer, relational database, document database, key/value store,
etcSlide8
Musings on application design
No service is an island
Dependencies on other internal and
external
services
Trading time-to-market and agility for controlSlide9
What’s in a workload?
#1: without the relational database the application cannot fulfill
any
workloads
#2: the relational database is an
external service
, subject to partial availabilitySlide10
Designing for FailureSlide11
Decompose by Workload
Applications are compromised of one or more workloads
Products like SharePoint and Windows Server are designed with this principle in mind
Each with different profiles, requirements and boundaries
Management, Availability, Operational, Cost, Health, Security, Capacity, etc.
Decomposition allows for workload specific optimization
Technology selections, scalability and availability approaches, etc.Slide12
What are the “9”s
Availability %
Downtime per year
Downtime per month*
Downtime per week
90% ("one nine")
36.5 days
72 hours
16.8 hours
99% ("two nines")
3.65 days
7.20 hours
1.68 hours
99.9% ("three nines")8.76 hours43.2 minutes10.1 minutes99.99% ("four nines")52.56 minutes4.32 minutes1.01 minutes99.999% ("five nines")5.26 minutes
25.9 seconds
6.05 seconds99.9999% ("six nines")31.5 seconds
2.59 seconds0.605 seconds12Study Windows Azure Platform SLAs:Compute External Connectivity: 99.95% (2 or more instances)Compute Instance Availability: 99.9% (2 or more instances)Storage Availability: 99.9%SQL Azure Availability: 99.9%Slide13
The Truth About 9s
Contoso
API
99.99%
SLA
Fabrikam
API
99.99%
SLA
Duwamish API
99.99%
SLA
TailSpin
API
99.99%
SLA
Northwind
API
99.99%
SLA
SLA =
Composite
99.99%
SLA
Composite
99.95%
SLA
*Slide14
Live Scores + Commentary
Team, Player, League Stats
Sports API
99.99%
All the time
100%
During Games
0%
When No Game
99%
All the Time
Define Your SLAsSlide15
Design for Failure
Given enough scale, time and pressure all components or services will fail
Your application will experience 1..N failures
How will your application behave?
Gracefully handle failure modes, continue to deliver
value
Not so gracefully …Fault types:
Transient. Temporary service interruptions, self-healingEnduring. Require intervention.Slide16
Failure Scope
Region
Service
Node
Individual Nodes May Fail
Connectivity Issues (transient failures), hardware failures,
Entire Services May Fail
Service dependencies (internal and external
), configuration and code issues
Regions may become unavailable
Connectivity Issues, acts of natureSlide17
Handling Transient and Enduring Failures
Use fault-handling frameworks that recognize transient
errors
Make it part of the background ”noise”
Appropriate retry and backoff policies Slide18
Handling Transient and Enduring FailuresSlide19Slide20
Handling Transient and Enduring Failures
At some point, your request is blocking the line
Fail gracefully, and get out of the queue!
Anti-patterns:
Too much trust in downstream services and client proxies
Not bounding non-deterministic calls
Blocking synchronous operationsSlide21
Sample Retry Policies
Platform
Context
Sample
Target
e2e latency max
“Fast First”
Retry Count
Delay
Backoff
SQL Database
Synchronous (e.g. render
web page)200 msYes350 msLinearAsynchronous (e.g. process queue item)60 secondsNo45 sExponential
Azure Cache
Synchronous (e.g. render web page)100
msYes310 msLinearAsynchronous (e.g. process queue item)500 msYes3100 msExponentialSlide22
Circuit Breaker at Netflix
A request to a remote service
times out
Thread pool and bounded task
queue used to interact with
a service dependency are at 100%
Client library used to interact
with a service dependency
throws an exception
On
Off
Error Rate
Threshold
CriteriaSlide23
Circuit Breaker at Netflix - Fallbacks
Custom fallback
Client library can provide an
invokable
callback method. Can also use locally available data on API server (cookie or cache) to generate a fallback response
Fail Silent
Return a null value. Useful if the data is optional
Fail Fast
When data is required and there’s no good fallback. Negative UX impact, but keeps API healthySlide24
Deployment Redundancy
Within a Datacenter
Traffic Management
Across Cloud Providers
Across On Premise and Cloud
Across Data CentersSlide25
Failure Points
Focus on identifying design elements that are subject to external change. For example:
Database connection
Website connection
Configuration file
Registry key
Categories of common Failure Points:
ACLs, Database access, External web site/service access, Transactions, Configuration, Capacity, Network
definition:
design elements that can cause an outage.Slide26
Failure Modes
Examples of failure modes:
Configuration file is not in correct location
Too much traffic overusing resources
Database reaches maximum capacity
The following would not be considered a failure mode:
Product bugs
Symptoms of problems
Informational occurrences
definition:
a predictable root cause of the outage that occurs at
a Failure Point.Slide27
Failure Mode Example
27
public
int
GetBusinessData
(string[] parameters)
{
try
{
var
config = Config.Open(_configPath); var conn = ConnectToDB(config.ConnectString); var data =
conn.GetData
(_sproc, parameters); return data;
} catch (Exception e) { WriteEventLogEvent(100, E_ExceptionInDal); throw; }}Potential Failure Points:Database ServerDatabaseTableConfiguration File
Potential Failure Modes:
DB Server not respondingDB offline
DB access denied
Sproc execute denied
DB doesn’t exist
DB timeout on connect
Index corrupt
Database corrupt
Table doesn’t exist
Table corrupt
Config
file missing or invalidSlide28
Design for operationsSlide29
Running a Live Site ServiceSlide30
Running without Insight / TelemetrySlide31
Capturing Insight
Log all internal/external “transactions” (database, web services,
etc
)
Application context (module/component)
Host context (server/role/instance/process)
Timing information (start/stop/duration)Activity identifierConsolidate logs to central system / dashboard for health monitoring and troubleshootingSlide32
Capturing Insight
Capture timing and context information through helper delegates (background noise)
Capture contextual errors (inner exceptions,
etc
) on error
Logging library is asynchronous (fire-and-forget) to avoid blockingSlide33
Many Options
Windows Azure DiagnosticsSlide34
Designing for Insight
Instrument for production logging
If you didn’t capture it, it didn’t happen
Implement inter-service monitoring and alerting
Capture and quantify inter-service behavior and activity
Run-time configurable logging
Enable activation (capture or delivery) of additional channels at run-timeSlide35
Define ALMSlide36
Updating Configuration
For a production service configuration == code
Need rigorous ALM process for rolling out (and rolling back) updates to both.Slide37
Updating Services
“We want global, simultaneous production rollouts of our new code”
Are you sure about that?
Production rollouts:
Running N, N+1 concurrently
Rolling load over to N+1, ability to fallbackSlide38
What is a health model?
Logical piece of an application
A component that makes sense to an operator
Each entity has a health state
Entities can be external or internal
Multiple instances of an entity may exist
Managed Entity
Break down health state by functional team
Must be mutually exclusive
Group by
organizational
responsibility e.g. security, performance, backup
May be specific or non-technology e.g. orders shipped.
Aspect
Defines level of operation currently available
Normal state is fully functional
Well designed applications may support partial operation e.g. read only
Operational ConditionSlide39
Troubleshooting Workflow
Detection
Is there a problem?
Classification
What’s not working, how bad is it?
Diagnosis
Why is there a problem?
Recovery
What needs to be done to fix it?
Verification
Is the problem really gone?Slide40
Resources
Failsafe: Guidance for Resilient Cloud Architectures
(
http://
msdn.microsoft.com/en-us/library/jj853352.aspx
)
Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services(http
://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx)Designing and Deploying Internet Scale
Serviceshttps://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdfSlide41
Design for ScaleSlide42
Scale
(*) Other details such as operational demand, resources and workloads omitted for simplicity
Resources
*
4 x Web Servers ( 8 CPU)
100 GB Database
10 GB Blob Storage
Demands
10K Active Users
1K Concurrent Users
<2 second response time
Unit of Scale
Workloads
Messaging
Collaboration
ProductivitySlide43
Scale by Units
100K
400K
Demand & Resources
TimeSlide44
J
F
M
A
M
J
J
A
S
O
N
D
Workload 1
Workload 2
Bottom
Ramp
Peek
ExampleSlide45
Data Partitioning
Decomposition and Partitioning
Hybrid Partitioning
Vertical Partitioning
Horizontal Partitioning
Understanding the 3
VsSlide46
Understanding the 3Vs
Volume
How large is the data today?
Velocity
How fast is it growing?
Variety
What type(s) of data are involved?Slide47
Understanding
Queryability
What?
What types of queries are done and what data set(s) and transformations are required to deliver them?
When?
How often must the data be queried? In real time or once a day, month, quarter, or year?Slide48
Horizontal PartitioningSlide49
Vertical PartitioningSlide50
Hybrid PartitioningSlide51
Data – to cache or not to cache….Slide52
Push vs. Pull
Load Balanced Push
Sync and good for sequential processing
Dependent on downstream services
Throttling vs. Performance
Managed Pull/Throughput
Asynchronous and event driven processing
Easy Parallelisation and Pipelining
Extending logic is easySlide53
Data on the inside – Data on the outside
http://msdn.microsoft.com/en-us/library/ms954587.aspxSlide54
“Query Ready” Cache
Query patterns
Push the data close to where it is queried
Example: BING Maps
Process, structure, produce, format etc. data and cache “query ready” data
Light/cheap data production is OK
Pure and Idempotent operations are usually good candidates
Duplication is OK Same
data in a different formatSame data in multiple placesThis requires processing data before it is queried - NOT at the query timeAll data can be cached
Some data can be cached:
Frequently used
Process Heavy, Expensive data
Build as you GoSlide55
Distributed Caching
Simple to administer
No need to manage and host a distributed cache yourself.
Integrates easily into existing applications
ASP.NET session state and output cache providers enable no-code integration.
Same managed interfaces as Windows Server AppFabric
Cache
On-Premises
App
Windows Azure App
Core Logic
AppFabric Cache APIs
Windows Server AppFabric Cache
Core Logic
AppFabric
Cache APIsWindows Azure AppFabric CachingSlide56
Data Resiliency
Remediation
Backups
Content Delivery NetworksSlide57
Backup and Restore
Types of Backups
Total backup, point in time, synchronized
Blob
NoSQL
Sharded
Relational Databases
Relational DatabasesSlide58
Backing Up Table and Blob Storage
Source
Replica
Log
Log Replica
01100100 01100001 01110100 01100001Slide59
Managing Backed Up Data
Archive
Archiving unneeded data to secondary storage area can reduce costs and increase performance
Purge
Storage in the cloud is inexpensive, but it’s not free. Purge unnecessary archives as appropriateSlide60
CDN
pic1.jpg
pic1.jpg
Content
Delivery
Network
Blob
Service
Edge
Location
Edge
Location
Edge
Location
pic1.jpg