Mark Russinovich Technical Fellow Windows Azure AZR302 Agenda Windows Azure Datacenter Architecture Deploying Services Inside IaaS VMs Maintaining Service Health The Leap Day Outage and Lessons Learned ID: 320949
Download Presentation The PPT/PDF document "Windows Azure Internals" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Windows Azure Internals
Mark RussinovichTechnical FellowWindows Azure
AZR302Slide2
Agenda
Windows Azure Datacenter ArchitectureDeploying ServicesInside
IaaS
VMs
Maintaining Service
Health
The Leap Day Outage and Lessons LearnedSlide3
Windows Azure Datacenter ArchitectureSlide4
Windows Azure Datacenters
Windows Azure currently has 8 regions At least two per geo-political region100,000’s of servers
Building out many
moreSlide5
The Fabric Controller (FC)
The “kernel” of the cloud operating systemManages datacenter hardware
Manages Windows Azure services
Four main responsibilities:
Datacenter resource allocation
Datacenter
resource
provisioning
Service
l
ifecycle management
Service health management
Inputs:
Description of the hardware and network resources it will control
Service model and binaries for cloud applications
Server
KernelProcess
DatacenterFabric ControllerService
Windows Kernel
ServerWordSQL Server
Fabric ControllerDatacenterExchangeOnline
SQL AzureSlide6
Datacenter Clusters
Datacenters are divided into “clusters”Approximately 1000 rack-mounted server (we call them “nodes”)
Provides a unit of fault isolation
Each cluster is managed by a Fabric Controller (FC)
FC is responsible for:
Blade provisioning
Blade management
Service deployment and lifecycle
Cluster
1
Cluster
2
Cluster
n
…
Datacenter network
FC
FC
FCSlide7
Inside a Cluster
FC is a distributed, stateful application running on nodes (servers) spread across fault domains
Top blades are reserved for FC
One FC instance is the primary and all others keep view of world in sync
Supports rolling upgrade, and services continue to run even if FC fails entirely
TOR
FC1
…
…
TOR
FC2
…
…
TOR
FC3
…
…
FC3
TOR
FC4
…
…
TOR
FC5
…
…
Spine
Nodes
RackSlide8
Datacenter Network Architecture
DLA Architecture (Old)
Quantum10 Architecture (New)
TOR
TOR
TOR
TOR
Spine
Spine
Spine
…
…
DCR
DCR
BL
BL
Spine
DC Routers
BL
BL
30,000
Gbps
120
Gbs
40 Nodes
TOR
LB
LB
AGG
Digi
APC
LB
LB
AGG
LB
LB
AGG
LB
LB
AGGLBLB
AGGLBLBAGG
20Racks
DC RouterAccess RoutersAggregation + LB
40 Nodes
TORDigiAPC40 NodesTOR
DigiAPC40 NodesTORDigiAPC
40 NodesTORDigi
APC40 NodesTORDigiAPC40 Nodes
TORDigiAPC
40 NodesTORDigiAPC40 NodesTOR
DigiAPC40 NodesTORDigiAPC
40 NodesTORDigi
APC40 NodesTORDigiAPC40 Nodes
TORDigiAPC
40 NodesTORDigiAPC40 NodesTOR
DigiAPC……20Racks20Racks20Racks…
………Slide9
Tip: Load Balancer Overhead
Going through the load balancer adds about 0.5ms latencyWhen possible, connect to systems via their DIP (dynamic IP address)
Instances in the same Cloud Service can access each other by DIP
You can use Virtual Network to make the DIPs of different cloud services visible to each other
Load Balancer
Instance 0
Instance
1
10.2.3.4
10.2.3.5
65.123.44.22
0.5ms
iSlide10
Deploying ServicesSlide11
Provisioning a Node
Power on nodePXE-boot Maintenance OSAgent formats disk and downloads Host OS via Windows Deployment Services (WDS)
Host OS boots, runs
Sysprep
/specialize, reboots
FC connects with the “Host Agent”
Fabric Controller
Role
Images
Role
Images
Role
Images
Role
Images
Image Repository
Maintenance OS
Parent OS
NodePXEServer
Maintenance OSWindows AzureOSWindows AzureOS
FC Host AgentWindows Azure Hypervisor
Windows DeploymentServerSlide12
RDFE
Service
US-North Central Datacenter
Deploying a Service to the Cloud:
The 10,000 foot view
Package
u
pload
to portal
System Center App Controller provides IT Pro upload experience
Powershell
provides scripting interface
Windows Azure portal provides developer upload experience
Service package passed to RDFE
RDFE sends service to
a
Fabric Controller (FC) based on target region and affinity group
FC stores image in repository and deploys service
Fabric Controller
Windows Azure Portal
System Center App Controller
Service
REST
APIsSlide13
RDFE
RDFE serves as the front end for all Windows Azure servicesSubscription managementBilling
User access
Service management
RDFE is responsible for picking clusters to deploy services and storage accounts
First datacenter region
Then affinity group or cluster load
Normalized VIP and core utilization
A(h, g) = C(h, g) /
Slide14
FC Service Deployment Steps
Process service model filesDetermine resource requirementsCreate role images
Allocate compute and network resources
Prepare nodes
Place role images on nodes
Create virtual machines
Start virtual machines and roles
Configure networking
Dynamic IP addresses (DIPs) assigned to blades
Virtual IP addresses (VIPs) + ports allocated and mapped to sets of DIPs
Configure packet filter for VM to VM traffic
Programs load balancers to allow trafficSlide15
Service Resource Allocation
Goal: allocate service components to available resources while satisfying all hard constraints HW requirements: CPU, Memory, Storage, Network
Fault domains
Secondary goal: Satisfy soft constraints
Prefer allocations which will simplify servicing the host OS/hypervisor
Optimize network proximity: pack nodes
Service allocation produces the goal state for the resources assigned to the service components
Node and VM configuration (OS, hosting environment)
Images and configuration files to deploy
Processes to start
Assign and configure network resources such as LB and VIPsSlide16
Deploying a Service
Role B
Worker Role
Count:
2
Update Domains
:
2
Size: Medium
Role A
Web Role
(Front End)
Count: 3
Update Domains: 3
Size: Large
Load
Balancer
10.100.0.36
10.100.0.122
10.100.0.185
www.mycloudapp.net
www.mycloudapp.net
Slide17
Deploying a Role Instance
FC pushes role files and configuration information to target node host agentHost agent creates VHDsHost agent creates VM, attaches VHDs, and starts VMGuest agent starts role host, which calls role entry point
Starts health heartbeat to and gets commands from host agent
Load balancer only routes to external endpoint when it responds to simple HTTP GET (LB probe)Slide18
Inside a Deployed Node
Fabric Controller (Primary)
FC Host Agent
Host Partition
Guest Partition
Guest Agent
Guest Partition
Guest Agent
Guest Partition
Guest Agent
Guest Partition
Guest Agent
Physical Node
Fabric Controller (Replica)
Fabric Controller (Replica)
…
Role Instance
Role Instance
Role Instance
Role InstanceTrust boundaryImage Repository (OS VHDs, role ZIP files)Slide19
PaaS Role Instance VHDs
Differencing VHD for OS image (D:\)Host agent injects FC guest agent into VHD for Web/Worker rolesResource VHD for temporary files (C:\)
Role VHD for role files (first available drive letter e.g. E:\, F:\)
Role Virtual Machine
C:\
Resource Disk Dynamic VHD
D:\
Windows Differencing Disk
E:\ or F:\
Role Image Differencing Disk
Windows VHD
Role VHDSlide20
Resource Volume
OS Volume
Role Volume
Inside a Role VM
Guest Agent
Role Host
Role Entry PointSlide21
Tip: Keep It Small
Role files get copied up to four times in a deployment
Instead, put artifacts in blob storage
Break them into small pieces
Pull them on-demand from your roles
RDFE
Portal
FC
Server
Core Package
1
2
3
4
Data
Auxiliary Files
i
1
2Slide22
Inside
IaaS VMsSlide23
Virtual Machine (
IaaS) OperationNo standard cached images for
IaaS
OS is faulted in from blob storage during boot
Sysprep
/specialize on first boot
Default cache policy
:
OS disk:
read+write
cache
Data disks:
no cache
Local On-Disk Cache
Disk Blob
Local RAM Cache
Virtual Disk Driver
NodeVMSlide24
IaaS Role Instance VHDs
Role Virtual Machine
C:\
OS Disk
E:\, F:\, etc.
Data Disks
D:\
Resource Disk Dynamic VHD
RAM Cache
Local Disk Cache
Blobs
BlobSlide25
Tip: Optimize Disk Performance
Each IaaS disk type has different performance characteristics by default
OS: local
read+write
cache optimized for small working set I/O
Temporary disk: local disk spindles that can be shared
Data disk: great at random writes and large working sets
Striped data disk: even better
Unless its small, put your application’s data (e.g. SQL database) on striped data disks
iSlide26
Updating Services and the Host OSSlide27
In-Place Update
Purpose: Ensure service stays up while updating and Windows Azure OS updatesSystem considers update domains when upgrading a service1/Update domains = percent of service that will be offline
Default is 5 and
max is 20
, override with
upgradeDomainCount
service definition property
The Windows Azure SLA is based on at least two update domains and two role instances in each role
Front-End-1
Front-End-2
Update Domain 1
Update Domain 2
Middle Tier-1
Middle Tier-2
Middle Tier-3
Update Domain 3
Middle Tier-3
Front-End-2Front-End-1Middle Tier-2
Middle Tier-1Slide28
Tip:
Config Updates vs Code Updates
Code updates:
Deploys new role image
Creates new VHD
Shutdown old code and start new code
Config
updates:
Notification sent to role via
RoleEnvironmentChanging
Graceful role shutdown/restart if no response
For fast update:
Deploy settings
as configuration
Respond to
configuration updatesiSlide29
Maintaining Service HealthSlide30
Node and Role Health Maintenance
FC maintains service availability by monitoring the software and hardware health
Based primarily on heartbeats
Automatically “heals” affected roles/VMs
Problem
Fabric
Detection
Fabric Response
Role instance
crashes
FC guest agent
monitors role termination
FC restarts role
Guest VM or agent crashes
FC host agent notices missing guest agent heartbeatsFC restarts VM and hosted roleHost OS or agent crashes
FC notices missing host agent heartbeatTries to recover nodeFC reallocates roles to other nodesDetected node hardware issueHost agent informs FC
FC migrates roles to other nodesMarks node “out for repair”Slide31
Guest Agent and Role Instance Heartbeats and Timeouts
25 min
Guest
Agent
Connect
Timeout
Guest Agent Heartbeat
5s
Role
Instance
Launch
Indefinite
Role
Instance
Start
Role
Instance
Ready
(for updates only)
15 min
Role Instance Heartbeat
15s
Guest Agent Heartbeat Timeout
10 min
Role Instance
“Unresponsive” Timeout
30s
Load Balancer Heartbeat
15s
Load BalancerTimeout30s
Guest AgentRole InstanceSlide32
Fault Domains and Availability Sets
Avoid single points of physical failures
Unit of failure based on data center topology
E.g. top-of-rack switch on a rack of machines
Windows Azure considers fault domains when allocating service roles
At least 2 fault domains per service
Will try and spread roles out across more
Availability
SLA: 99.95%
Front-End-1
Fault Domain 1
Fault Domain 2
Front-End-2
Middle Tier-2
Middle Tier-1
Fault Domain 3
Middle Tier-3
Front-End-1
Middle Tier-1Front-End-2
Middle Tier-2Middle Tier-3Slide33
Moving a Role Instance (Service Healing)
Moving a role instance is similar to a service updateOn source node:
Role instances stopped
VMs stopped
Node
reprovisioned
On destination node:
Same steps as initial role instance deployment
Warning: Resource VHD is not moved
Including for Persistent VM RoleSlide34
Service Healing
Role B
Worker Role
Count:
2
Update Domains
:
2
Size: Medium
Role A – V2
VM Role
(Front End)
Count: 3
Update Domains: 3
Size: Large
Load
Balancer
10.100.0.36
10.100.0.122
10.100.0.185
www.mycloudapp.net
www.mycloudapp.net
10.100.0.191Slide35
Allocation Constraints
Initiated by the Windows Azure teamTypically no more than once per month Goal: update all machines as quickly as possibleConstraint:
honor UDs
Allocation algorithm:
Prefer nodes hosting same UD as role instance’s UD
Allocation 1
Allocation 2
Service A
Role A-1
UD 1
Service B
Role A-1
UD 1
Service A
Role B-1
UD 1
Service B
Role B-1
UD 1
Service A
Role A-1
UD 2Service BRole A-1UD 2
Service ARole B-2
UD 2Service BRole B-2UD 2
Service A
Role A-1UD 1
Service ARole B-1UD 1
Service ARole A-1UD 2
Service ARole B-2UD 2
Service BRole B-2UD 2
Service BRole A-1UD 2
Service BRole A-1UD 1
Service B
Role B-1
UD 1Slide36
Tip: Three is Better than Two
Your availability is reduced when:You are updating a role instance’s code
An instance is being service healed
The host OS is being serviced
The guest OS is being serviced
To avoid a complete outage when two of these are concurrent: deploy at least three instances
Front-End-1
Fault Domain 1
Fault Domain 2
Front-End-2
Middle Tier-2
Middle Tier-1
Fault Domain 3
Middle Tier-3
Front-End-1
Middle Tier-1
Front-End-2
Middle Tier-2
iSlide37
The Leap Day Outage:
Cause and Lessons LearnedSlide38
Tying it all Together: Leap Day
Outage on February 29 caused by this line of code:
expiredate.year
=
currentdate.year
+ 1;
The problem and its resolution highlights:
Network Operations and monitoring
DevOps
“on call” model
Cluster fault isolation
Lessons we learnedSlide39
Windows Azure Network Operations CenterSlide40
On-Call
All developers take turns at third-tier support for live-site operations
Date
Start
End
Primary
Secondary
Backup1
Backup2
Friday,
January 13 2012
11:00 AM
10:59 AM
densamo
gagupta
padou
anueSaturday, January 14 201211:00 AM10:59 AMjimjohn
mkeatingchucklpadouSunday, January 15 201211:00 AM10:59 AManilinglabsinghchuckl
padouMonday, January 16 201211:00 AM10:59 AMsushantrlisdsaadsyedsushantrTuesday,
January 17 201211:00 AM10:59 AMcoreysappatwaksinghritwiktWednesday, January 18 201211:00 AM
10:59 AMwakkasrsoupalritwiktpadouThursday, January 19 201211:00 AM10:59 AMroylin
mkeatinganuepadouSlide41
Leap Day Outage Timeline
Event
Date (PST)
Response and Recovery Timeline
Initiating Event
2/28/2012 16:00
Leap year bug begin
Detection
2/28 17:15
3x25
min retry for first batch hit, nodes start going to HI (cascading failure)
Phase1
2/28 16:00 – 2/29 05:23
New deployments fail initially and then marked offline globally to protect clusters
Phase
2
2/29 02:57 – 2/29 23:00
Service management offline for 7 clusters (staggered recovery)Slide42
Host OS
Hypervisor
Host Agent
Phase 1: Starting a Healthy VM
Application VM
Guest Agent
Public Key
Private Key
Create a “transport cert”Slide43
Host OS
Hypervisor
Host Agent
Phase 1: The Leap Day Bug
Application VM
Guest Agent
App VM
Guest Agent
App VM
Guest Agent
After 25 minutes…
After 3 attempts…
All new
VMs
fail to start (Service Management)
Existing healthy
VMs
continue to run
(until migrated)Slide44
Deploying an infra update or customer VM
or “normal” hardware failure
VMs
cause nodes to fail
The cascade is viral…
Leap day starts…
Normal “Service healing” migrates VMs
Cascade protection threshold hit (60 nodes) All healing and infra deployment stop!
Phase 1: Cascading Impact…
44Slide45
Phase 1: Tenant Availability
45
Customer 1:
Complete Availability Loss
Customer 2:
Partial Capacity Loss
Customer 3:
No Availability LossSlide46
Overview of Phase 1
Service Management started failing immediately in all regionsNew VM creation, infrastructure deployments, and standard hardware recovery created a viral cascadeService healing threshold tripped, with customers in different states of availability and capacity
Service Management deliberately de-activated everywhereSlide47
Recovery
Build and deploy a hotfix to the GA and the HAClusters were in two different states:Fully (or mostly) updated clusters (119 GA, 119 HA, 119 OS…)Mostly non-updated clusters (118 GA, 118 HA, 118 OS…)
For updated clusters, we pushed the fix on the new version.
For non-updated clusters, we reverted back and pushed the fix on the old versionSlide48
119 HA v1
119 Networking Plugin
VM
Network
119 GA v1
Fixing the updated clusters…
119 OS
119 HA v1
119 Networking Plugin
VM
119 GA v1
119 OS
119 HA v1
119 Networking Plugin
VM
119 GA v1
119 OS
119 HA v2
119 Networking Plugin
119 OS119 GA v2
Fixed 119 Package119 HA v2119 Networking Plugin119 OS119 HA v2
119 Networking Plugin119 OS119 GA v2
119 GA v2NetworkSlide49
118 HA v1
118 Networking Plugin
VM
Network
118 GA v1
Attempted fix for partially updated clusters…
Phase 2 begins
118 OS
118 HA v1
118 Networking Plugin
VM
118 GA v1
118 OS
119 HA v1
119 Networking Plugin
VM
118 GA v1
119 OS
118 HA v2
119 Networking Plugin118 OS
Fixed 118 Package118 HA v2119 Networking Plugin118 OS118 HA v2
119 Networking Plugin118 OSNetworkSlide50
Overview of Phase 2
Most clusters were repaired completely in Phase 17 clusters were moved into an inconsistent state (119 Plugin/Config with 118 Agent)Machines moved into a completely disconnected stateSlide51
118 HA v2
119 Networking Plugin
VM – 1
Network
118 GA v1
Recovery of Phase 2
Step 1
118 OS
118 HA v2
119 Networking Plugin
118 OS
118 HA v2
119 Networking Plugin
118 OS
Network
VM – 2
118 GA v1
VM – 3
118 GA v1
VM – 1
118 GA v1
VM – 3118 GA v1VM – 4118 GA v1VM – 2
118 GA v1VM – 4118 GA v1VM – 5
118 GA v1119 HA v2119 Networking Plugin119 OSFixed 119 Package on 118 Cluster
119 HA v2119 Networking Plugin119 OS119 HA v2119 Networking Plugin
119 OSSlide52
Phase 2: Recovery Step 1
On the seven remaining clusters, we forced update to 119 (119 GA, 119 HA, 119 OS…)This resulted in cluster-wide rebootsBecause the OS needed to be updated
Because the VM GAs were mostly unpatched, most machines moved quickly into “Human Investigate”
Required additional effortSlide53
119 HA v2
119 Networking Plugin
VM – 1
Network
118 GA v1
Recovery of Phase 2
Step 2 – Automated Update Script
119 OS
119 HA v2
119 Networking Plugin
119 OS
119 HA v2
119 Networking Plugin
119 OS
Fixed 119 GA
Network
VM – 2
118 GA v1
VM – 3
118 GA v1
VM – 1118 GA v1VM – 3118 GA v1VM – 4
118 GA v1VM – 2118 GA v1VM – 4
118 GA v1VM – 5118 GA v1119 GA v2119 GA v2
119 GA v2119 GA v2Automatic update cannot proceedNot enough healthy instances…Slide54
119 HA v2
119 Networking Plugin
VM – 1
Network
118 GA v1
Recovery of Phase 2
Step 2 – Manual Update Script
119 OS
119 HA v2
119 Networking Plugin
119 OS
119 HA v2
119 Networking Plugin
119 OS
Fixed 119 GA
Network
VM – 2
118 GA v1
VM – 3
118 GA v1
VM – 1118 GA v1VM – 3118 GA v1VM – 4
118 GA v1VM – 2118 GA v1VM – 4
118 GA v1VM – 5118 GA v1119 GA v2119 GA v2
119 GA v2119 GA v2Manual Update SucceedsBut takes a long time…119 GA v2119 GA v2119 GA v2
119 GA v2119 GA v2Slide55
Major Learning
Time can be a single point of failureCascading failures as a side-effect of recoveryPartitioning and built-in brakes contained the failureNeed “safe mode” for all services, e.g., read-only
Recovery should be done through the normal path
People need sleep
Customer must know what is going on Slide56Slide57
Conclusion
Platform as a Service is all about reducing management and operations overheadThe Windows Azure Fabric Controller is the foundation for Windows Azure compute
Provisions machines
Deploys services
Configures hardware for services
Monitors service and hardware health
The Fabric Controller continues to evolve and improveSlide58
Track Resources
Meetwindowsazure.com
@
WindowsAzure
@
teched_europe
DOWNLOAD Windows Azure
Windowsazure.com/
teched
Hands-On LabsSlide59
Resources
Connect. Share. Discuss.
http
://europe.msteched.com
Learning
Microsoft Certification & Training Resources
www.microsoft.com/learning
TechNet
Resources for IT Professionals
http://microsoft.com/technet
Resources for Developers
http://microsoft.com/msdn Slide60
Evaluations
http://europe.msteched.com/sessions
Submit your evals
online Slide61
©
2012 Microsoft
Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the
part
of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT
MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.Slide62