/
Windows Azure Internals Windows Azure Internals

Windows Azure Internals - PowerPoint Presentation

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
404 views
Uploaded On 2016-05-15

Windows Azure Internals - PPT Presentation

Mark Russinovich Technical Fellow Windows Azure AZR302 Agenda Windows Azure Datacenter Architecture Deploying Services Inside IaaS VMs Maintaining Service Health The Leap Day Outage and Lessons Learned ID: 320949

role 119 service 118 119 role 118 service agent windows nodes plugin networking host guest tor instance network azure tier middle disk

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Windows Azure Internals" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Windows Azure Internals

Mark RussinovichTechnical FellowWindows Azure

AZR302Slide2

Agenda

Windows Azure Datacenter ArchitectureDeploying ServicesInside

IaaS

VMs

Maintaining Service

Health

The Leap Day Outage and Lessons LearnedSlide3

Windows Azure Datacenter ArchitectureSlide4

Windows Azure Datacenters

Windows Azure currently has 8 regions At least two per geo-political region100,000’s of servers

Building out many

moreSlide5

The Fabric Controller (FC)

The “kernel” of the cloud operating systemManages datacenter hardware

Manages Windows Azure services

Four main responsibilities:

Datacenter resource allocation

Datacenter

resource

provisioning

Service

l

ifecycle management

Service health management

Inputs:

Description of the hardware and network resources it will control

Service model and binaries for cloud applications

Server

KernelProcess

DatacenterFabric ControllerService

Windows Kernel

ServerWordSQL Server

Fabric ControllerDatacenterExchangeOnline

SQL AzureSlide6

Datacenter Clusters

Datacenters are divided into “clusters”Approximately 1000 rack-mounted server (we call them “nodes”)

Provides a unit of fault isolation

Each cluster is managed by a Fabric Controller (FC)

FC is responsible for:

Blade provisioning

Blade management

Service deployment and lifecycle

Cluster

1

Cluster

2

Cluster

n

Datacenter network

FC

FC

FCSlide7

Inside a Cluster

FC is a distributed, stateful application running on nodes (servers) spread across fault domains

Top blades are reserved for FC

One FC instance is the primary and all others keep view of world in sync

Supports rolling upgrade, and services continue to run even if FC fails entirely

TOR

FC1

TOR

FC2

TOR

FC3

FC3

TOR

FC4

TOR

FC5

Spine

Nodes

RackSlide8

Datacenter Network Architecture

DLA Architecture (Old)

Quantum10 Architecture (New)

TOR

TOR

TOR

TOR

Spine

Spine

Spine

DCR

DCR

BL

BL

Spine

DC Routers

BL

BL

30,000

Gbps

120

Gbs

40 Nodes

TOR

LB

LB

AGG

Digi

APC

LB

LB

AGG

LB

LB

AGG

LB

LB

AGGLBLB

AGGLBLBAGG

20Racks

DC RouterAccess RoutersAggregation + LB

40 Nodes

TORDigiAPC40 NodesTOR

DigiAPC40 NodesTORDigiAPC

40 NodesTORDigi

APC40 NodesTORDigiAPC40 Nodes

TORDigiAPC

40 NodesTORDigiAPC40 NodesTOR

DigiAPC40 NodesTORDigiAPC

40 NodesTORDigi

APC40 NodesTORDigiAPC40 Nodes

TORDigiAPC

40 NodesTORDigiAPC40 NodesTOR

DigiAPC……20Racks20Racks20Racks…

………Slide9

Tip: Load Balancer Overhead

Going through the load balancer adds about 0.5ms latencyWhen possible, connect to systems via their DIP (dynamic IP address)

Instances in the same Cloud Service can access each other by DIP

You can use Virtual Network to make the DIPs of different cloud services visible to each other

Load Balancer

Instance 0

Instance

1

10.2.3.4

10.2.3.5

65.123.44.22

0.5ms

iSlide10

Deploying ServicesSlide11

Provisioning a Node

Power on nodePXE-boot Maintenance OSAgent formats disk and downloads Host OS via Windows Deployment Services (WDS)

Host OS boots, runs

Sysprep

/specialize, reboots

FC connects with the “Host Agent”

Fabric Controller

Role

Images

Role

Images

Role

Images

Role

Images

Image Repository

Maintenance OS

Parent OS

NodePXEServer

Maintenance OSWindows AzureOSWindows AzureOS

FC Host AgentWindows Azure Hypervisor

Windows DeploymentServerSlide12

RDFE

Service

US-North Central Datacenter

Deploying a Service to the Cloud:

The 10,000 foot view

Package

u

pload

to portal

System Center App Controller provides IT Pro upload experience

Powershell

provides scripting interface

Windows Azure portal provides developer upload experience

Service package passed to RDFE

RDFE sends service to

a

Fabric Controller (FC) based on target region and affinity group

FC stores image in repository and deploys service

Fabric Controller

Windows Azure Portal

System Center App Controller

Service

REST

APIsSlide13

RDFE

RDFE serves as the front end for all Windows Azure servicesSubscription managementBilling

User access

Service management

RDFE is responsible for picking clusters to deploy services and storage accounts

First datacenter region

Then affinity group or cluster load

Normalized VIP and core utilization

A(h, g) = C(h, g) /

 

 Slide14

FC Service Deployment Steps

Process service model filesDetermine resource requirementsCreate role images

Allocate compute and network resources

Prepare nodes

Place role images on nodes

Create virtual machines

Start virtual machines and roles

Configure networking

Dynamic IP addresses (DIPs) assigned to blades

Virtual IP addresses (VIPs) + ports allocated and mapped to sets of DIPs

Configure packet filter for VM to VM traffic

Programs load balancers to allow trafficSlide15

Service Resource Allocation

Goal: allocate service components to available resources while satisfying all hard constraints HW requirements: CPU, Memory, Storage, Network

Fault domains

Secondary goal: Satisfy soft constraints

Prefer allocations which will simplify servicing the host OS/hypervisor

Optimize network proximity: pack nodes

Service allocation produces the goal state for the resources assigned to the service components

Node and VM configuration (OS, hosting environment)

Images and configuration files to deploy

Processes to start

Assign and configure network resources such as LB and VIPsSlide16

Deploying a Service

Role B

Worker Role

Count:

2

Update Domains

:

2

Size: Medium

Role A

Web Role

(Front End)

Count: 3

Update Domains: 3

Size: Large

Load

Balancer

10.100.0.36

10.100.0.122

10.100.0.185

www.mycloudapp.net

www.mycloudapp.net

Slide17

Deploying a Role Instance

FC pushes role files and configuration information to target node host agentHost agent creates VHDsHost agent creates VM, attaches VHDs, and starts VMGuest agent starts role host, which calls role entry point

Starts health heartbeat to and gets commands from host agent

Load balancer only routes to external endpoint when it responds to simple HTTP GET (LB probe)Slide18

Inside a Deployed Node

Fabric Controller (Primary)

FC Host Agent

Host Partition

Guest Partition

Guest Agent

Guest Partition

Guest Agent

Guest Partition

Guest Agent

Guest Partition

Guest Agent

Physical Node

Fabric Controller (Replica)

Fabric Controller (Replica)

Role Instance

Role Instance

Role Instance

Role InstanceTrust boundaryImage Repository (OS VHDs, role ZIP files)Slide19

PaaS Role Instance VHDs

Differencing VHD for OS image (D:\)Host agent injects FC guest agent into VHD for Web/Worker rolesResource VHD for temporary files (C:\)

Role VHD for role files (first available drive letter e.g. E:\, F:\)

Role Virtual Machine

C:\

Resource Disk Dynamic VHD

D:\

Windows Differencing Disk

E:\ or F:\

Role Image Differencing Disk

Windows VHD

Role VHDSlide20

Resource Volume

OS Volume

Role Volume

Inside a Role VM

Guest Agent

Role Host

Role Entry PointSlide21

Tip: Keep It Small

Role files get copied up to four times in a deployment

Instead, put artifacts in blob storage

Break them into small pieces

Pull them on-demand from your roles

RDFE

Portal

FC

Server

Core Package

1

2

3

4

Data

Auxiliary Files

i

1

2Slide22

Inside

IaaS VMsSlide23

Virtual Machine (

IaaS) OperationNo standard cached images for

IaaS

OS is faulted in from blob storage during boot

Sysprep

/specialize on first boot

Default cache policy

:

OS disk:

read+write

cache

Data disks:

no cache

Local On-Disk Cache

Disk Blob

Local RAM Cache

Virtual Disk Driver

NodeVMSlide24

IaaS Role Instance VHDs

Role Virtual Machine

C:\

OS Disk

E:\, F:\, etc.

Data Disks

D:\

Resource Disk Dynamic VHD

RAM Cache

Local Disk Cache

Blobs

BlobSlide25

Tip: Optimize Disk Performance

Each IaaS disk type has different performance characteristics by default

OS: local

read+write

cache optimized for small working set I/O

Temporary disk: local disk spindles that can be shared

Data disk: great at random writes and large working sets

Striped data disk: even better

Unless its small, put your application’s data (e.g. SQL database) on striped data disks

iSlide26

Updating Services and the Host OSSlide27

In-Place Update

Purpose: Ensure service stays up while updating and Windows Azure OS updatesSystem considers update domains when upgrading a service1/Update domains = percent of service that will be offline

Default is 5 and

max is 20

, override with

upgradeDomainCount

service definition property

The Windows Azure SLA is based on at least two update domains and two role instances in each role

Front-End-1

Front-End-2

Update Domain 1

Update Domain 2

Middle Tier-1

Middle Tier-2

Middle Tier-3

Update Domain 3

Middle Tier-3

Front-End-2Front-End-1Middle Tier-2

Middle Tier-1Slide28

Tip:

Config Updates vs Code Updates

Code updates:

Deploys new role image

Creates new VHD

Shutdown old code and start new code

Config

updates:

Notification sent to role via

RoleEnvironmentChanging

Graceful role shutdown/restart if no response

For fast update:

Deploy settings

as configuration

Respond to

configuration updatesiSlide29

Maintaining Service HealthSlide30

Node and Role Health Maintenance

FC maintains service availability by monitoring the software and hardware health

Based primarily on heartbeats

Automatically “heals” affected roles/VMs

Problem

Fabric

Detection

Fabric Response

Role instance

crashes

FC guest agent

monitors role termination

FC restarts role

Guest VM or agent crashes

FC host agent notices missing guest agent heartbeatsFC restarts VM and hosted roleHost OS or agent crashes

FC notices missing host agent heartbeatTries to recover nodeFC reallocates roles to other nodesDetected node hardware issueHost agent informs FC

FC migrates roles to other nodesMarks node “out for repair”Slide31

Guest Agent and Role Instance Heartbeats and Timeouts

25 min

Guest

Agent

Connect

Timeout

Guest Agent Heartbeat

5s

Role

Instance

Launch

Indefinite

Role

Instance

Start

Role

Instance

Ready

(for updates only)

15 min

Role Instance Heartbeat

15s

Guest Agent Heartbeat Timeout

10 min

Role Instance

“Unresponsive” Timeout

30s

Load Balancer Heartbeat

15s

Load BalancerTimeout30s

Guest AgentRole InstanceSlide32

Fault Domains and Availability Sets

Avoid single points of physical failures

Unit of failure based on data center topology

E.g. top-of-rack switch on a rack of machines

Windows Azure considers fault domains when allocating service roles

At least 2 fault domains per service

Will try and spread roles out across more

Availability

SLA: 99.95%

Front-End-1

Fault Domain 1

Fault Domain 2

Front-End-2

Middle Tier-2

Middle Tier-1

Fault Domain 3

Middle Tier-3

Front-End-1

Middle Tier-1Front-End-2

Middle Tier-2Middle Tier-3Slide33

Moving a Role Instance (Service Healing)

Moving a role instance is similar to a service updateOn source node:

Role instances stopped

VMs stopped

Node

reprovisioned

On destination node:

Same steps as initial role instance deployment

Warning: Resource VHD is not moved

Including for Persistent VM RoleSlide34

Service Healing

Role B

Worker Role

Count:

2

Update Domains

:

2

Size: Medium

Role A – V2

VM Role

(Front End)

Count: 3

Update Domains: 3

Size: Large

Load

Balancer

10.100.0.36

10.100.0.122

10.100.0.185

www.mycloudapp.net

www.mycloudapp.net

10.100.0.191Slide35

Allocation Constraints

Initiated by the Windows Azure teamTypically no more than once per month Goal: update all machines as quickly as possibleConstraint:

honor UDs

Allocation algorithm:

Prefer nodes hosting same UD as role instance’s UD

Allocation 1

Allocation 2

Service A

Role A-1

UD 1

Service B

Role A-1

UD 1

Service A

Role B-1

UD 1

Service B

Role B-1

UD 1

Service A

Role A-1

UD 2Service BRole A-1UD 2

Service ARole B-2

UD 2Service BRole B-2UD 2

Service A

Role A-1UD 1

Service ARole B-1UD 1

Service ARole A-1UD 2

Service ARole B-2UD 2

Service BRole B-2UD 2

Service BRole A-1UD 2

Service BRole A-1UD 1

Service B

Role B-1

UD 1Slide36

Tip: Three is Better than Two

Your availability is reduced when:You are updating a role instance’s code

An instance is being service healed

The host OS is being serviced

The guest OS is being serviced

To avoid a complete outage when two of these are concurrent: deploy at least three instances

Front-End-1

Fault Domain 1

Fault Domain 2

Front-End-2

Middle Tier-2

Middle Tier-1

Fault Domain 3

Middle Tier-3

Front-End-1

Middle Tier-1

Front-End-2

Middle Tier-2

iSlide37

The Leap Day Outage:

Cause and Lessons LearnedSlide38

Tying it all Together: Leap Day

Outage on February 29 caused by this line of code:

expiredate.year

=

currentdate.year

+ 1;

The problem and its resolution highlights:

Network Operations and monitoring

DevOps

“on call” model

Cluster fault isolation

Lessons we learnedSlide39

Windows Azure Network Operations CenterSlide40

On-Call

All developers take turns at third-tier support for live-site operations

Date

Start

End

Primary

Secondary

Backup1

Backup2

Friday,

January 13 2012

11:00 AM

10:59 AM

densamo

gagupta

padou

anueSaturday, January 14 201211:00 AM10:59 AMjimjohn

mkeatingchucklpadouSunday, January 15 201211:00 AM10:59 AManilinglabsinghchuckl

padouMonday, January 16 201211:00 AM10:59 AMsushantrlisdsaadsyedsushantrTuesday,

January 17 201211:00 AM10:59 AMcoreysappatwaksinghritwiktWednesday, January 18 201211:00 AM

10:59 AMwakkasrsoupalritwiktpadouThursday, January 19 201211:00 AM10:59 AMroylin

mkeatinganuepadouSlide41

Leap Day Outage Timeline

Event

Date (PST)

Response and Recovery Timeline

Initiating Event

2/28/2012 16:00

Leap year bug begin

Detection

2/28 17:15

3x25

min retry for first batch hit, nodes start going to HI (cascading failure)

Phase1

2/28 16:00 – 2/29 05:23

New deployments fail initially and then marked offline globally to protect clusters

Phase

2

2/29 02:57 – 2/29 23:00

Service management offline for 7 clusters (staggered recovery)Slide42

Host OS

Hypervisor

Host Agent

Phase 1: Starting a Healthy VM

Application VM

Guest Agent

Public Key

Private Key

Create a “transport cert”Slide43

Host OS

Hypervisor

Host Agent

Phase 1: The Leap Day Bug

Application VM

Guest Agent

App VM

Guest Agent

App VM

Guest Agent

After 25 minutes…

After 3 attempts…

All new

VMs

fail to start (Service Management)

Existing healthy

VMs

continue to run

(until migrated)Slide44

Deploying an infra update or customer VM

or “normal” hardware failure

VMs

cause nodes to fail

The cascade is viral…

Leap day starts…

Normal “Service healing” migrates VMs

Cascade protection threshold hit (60 nodes) All healing and infra deployment stop!

Phase 1: Cascading Impact…

44Slide45

Phase 1: Tenant Availability

45

Customer 1:

Complete Availability Loss

Customer 2:

Partial Capacity Loss

Customer 3:

No Availability LossSlide46

Overview of Phase 1

Service Management started failing immediately in all regionsNew VM creation, infrastructure deployments, and standard hardware recovery created a viral cascadeService healing threshold tripped, with customers in different states of availability and capacity

Service Management deliberately de-activated everywhereSlide47

Recovery

Build and deploy a hotfix to the GA and the HAClusters were in two different states:Fully (or mostly) updated clusters (119 GA, 119 HA, 119 OS…)Mostly non-updated clusters (118 GA, 118 HA, 118 OS…)

For updated clusters, we pushed the fix on the new version.

For non-updated clusters, we reverted back and pushed the fix on the old versionSlide48

119 HA v1

119 Networking Plugin

VM

Network

119 GA v1

Fixing the updated clusters…

119 OS

119 HA v1

119 Networking Plugin

VM

119 GA v1

119 OS

119 HA v1

119 Networking Plugin

VM

119 GA v1

119 OS

119 HA v2

119 Networking Plugin

119 OS119 GA v2

Fixed 119 Package119 HA v2119 Networking Plugin119 OS119 HA v2

119 Networking Plugin119 OS119 GA v2

119 GA v2NetworkSlide49

118 HA v1

118 Networking Plugin

VM

Network

118 GA v1

Attempted fix for partially updated clusters…

Phase 2 begins

118 OS

118 HA v1

118 Networking Plugin

VM

118 GA v1

118 OS

119 HA v1

119 Networking Plugin

VM

118 GA v1

119 OS

118 HA v2

119 Networking Plugin118 OS

Fixed 118 Package118 HA v2119 Networking Plugin118 OS118 HA v2

119 Networking Plugin118 OSNetworkSlide50

Overview of Phase 2

Most clusters were repaired completely in Phase 17 clusters were moved into an inconsistent state (119 Plugin/Config with 118 Agent)Machines moved into a completely disconnected stateSlide51

118 HA v2

119 Networking Plugin

VM – 1

Network

118 GA v1

Recovery of Phase 2

Step 1

118 OS

118 HA v2

119 Networking Plugin

118 OS

118 HA v2

119 Networking Plugin

118 OS

Network

VM – 2

118 GA v1

VM – 3

118 GA v1

VM – 1

118 GA v1

VM – 3118 GA v1VM – 4118 GA v1VM – 2

118 GA v1VM – 4118 GA v1VM – 5

118 GA v1119 HA v2119 Networking Plugin119 OSFixed 119 Package on 118 Cluster

119 HA v2119 Networking Plugin119 OS119 HA v2119 Networking Plugin

119 OSSlide52

Phase 2: Recovery Step 1

On the seven remaining clusters, we forced update to 119 (119 GA, 119 HA, 119 OS…)This resulted in cluster-wide rebootsBecause the OS needed to be updated

Because the VM GAs were mostly unpatched, most machines moved quickly into “Human Investigate”

Required additional effortSlide53

119 HA v2

119 Networking Plugin

VM – 1

Network

118 GA v1

Recovery of Phase 2

Step 2 – Automated Update Script

119 OS

119 HA v2

119 Networking Plugin

119 OS

119 HA v2

119 Networking Plugin

119 OS

Fixed 119 GA

Network

VM – 2

118 GA v1

VM – 3

118 GA v1

VM – 1118 GA v1VM – 3118 GA v1VM – 4

118 GA v1VM – 2118 GA v1VM – 4

118 GA v1VM – 5118 GA v1119 GA v2119 GA v2

119 GA v2119 GA v2Automatic update cannot proceedNot enough healthy instances…Slide54

119 HA v2

119 Networking Plugin

VM – 1

Network

118 GA v1

Recovery of Phase 2

Step 2 – Manual Update Script

119 OS

119 HA v2

119 Networking Plugin

119 OS

119 HA v2

119 Networking Plugin

119 OS

Fixed 119 GA

Network

VM – 2

118 GA v1

VM – 3

118 GA v1

VM – 1118 GA v1VM – 3118 GA v1VM – 4

118 GA v1VM – 2118 GA v1VM – 4

118 GA v1VM – 5118 GA v1119 GA v2119 GA v2

119 GA v2119 GA v2Manual Update SucceedsBut takes a long time…119 GA v2119 GA v2119 GA v2

119 GA v2119 GA v2Slide55

Major Learning

Time can be a single point of failureCascading failures as a side-effect of recoveryPartitioning and built-in brakes contained the failureNeed “safe mode” for all services, e.g., read-only

Recovery should be done through the normal path

People need sleep

Customer must know what is going on Slide56
Slide57

Conclusion

Platform as a Service is all about reducing management and operations overheadThe Windows Azure Fabric Controller is the foundation for Windows Azure compute

Provisions machines

Deploys services

Configures hardware for services

Monitors service and hardware health

The Fabric Controller continues to evolve and improveSlide58

Track Resources

Meetwindowsazure.com

@

WindowsAzure

@

teched_europe

DOWNLOAD Windows Azure

Windowsazure.com/

teched

Hands-On LabsSlide59

Resources

Connect. Share. Discuss.

http

://europe.msteched.com

Learning

Microsoft Certification & Training Resources

www.microsoft.com/learning

TechNet

Resources for IT Professionals

http://microsoft.com/technet

Resources for Developers

http://microsoft.com/msdn Slide60

Evaluations

http://europe.msteched.com/sessions

Submit your evals

online Slide61

©

2012 Microsoft

Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the

part

of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.

MICROSOFT

MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.Slide62