/
Exchange Server 2013 Exchange Server 2013

Exchange Server 2013 - PowerPoint Presentation

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
384 views
Uploaded On 2018-01-18

Exchange Server 2013 - PPT Presentation

High Availability Site Resilience Scott Schnoll OFCB318 New in Exchange Server 2013 Exchange can automatically recovery from Disk Failures Network Failures Server Failures Datacenter Failures ID: 624590

quorum server dynamic cluster server quorum cluster dynamic exchange datacenter microsoft witness dag health majority activation copy amp site

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Exchange Server 2013" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1
Slide2

Exchange Server 2013High Availability | Site Resilience

Scott Schnoll

OFC-B318Slide3

New in Exchange Server 2013

Exchange can automatically recovery fromDisk Failures

Network Failures

Server Failures

Datacenter Failures

Failover time decreased by 50% over Exchange 2010

58% faster reseeds when using multiple databases per volumeSlide4
Slide5

What have we been doing since RTM?

620+ check-insDAGs without cluster admin access points

Dag Management service

Loose Truncation

New Monitors

Max Preferred Actives

Database Activation Suspended and Move NowSlide6

Responding to FailuresSlide7

What should we do about failures?

Find and fix the

Root

cause code

Recover

the client experience

Repair

the symptom

Remove

complexitySlide8

DAGs without CAAPs

Easier deployment and management

Fewer things that can failSlide9

Dag Management service

Runs non-critical aspects of maintaining high availabilityChecking for sufficient redundancy and availabilityLoose Truncation monitoring

Lagged copy manager

Isolates failure modes by separating log replication and HA decision-making from non-core functionsSlide10

Lagged Copy Management

Failed backups are worse than no backupsLagged database copies will play forward beyond their configured value when:

A database has a bad page and needs a patch

There isn’t enough space to keep all the logs

There is a risk of losing all available copies of a databaseSlide11

AutoReseed overview

Restoring redundancy, so you don’t have toConfigured by setting mount points for volumes

In-Use Storage

Spares

XSlide12

AutoReseed – why?

Forget about replacing disks as they fail

Probability you’ll need to replace more than monthly:

=(1-BINOM.DIST(spares + 1, disks per server, AFR/12, TRUE))*serversSlide13

Recovering from storage failures

Exchange Server 2010ESE database hung IO (4 min)

Crimson channel heartbeat (30s)

System disk heartbeat (2 min)

Exchange Server 2013

System

bad state

(5 min)

Long I/O times (.6 min)

MSExchangeRepl.exe memory threshold (4 GB)Replication service won’t restart (65 min)Store timeout (1 min)

Exchange Server 2013 SP1

Cluster service repeated crashes (60 min)Slide14

World’s largest Exchange deployment

26 locations worldwide125,000+ databases at 99.98% availability

15 second average database failover time

Site switchovers/month: 100s planned, 10s unplannedSlide15

Monitoring and Server MaintenanceSlide16

Managed Availability

Primary purpose: Detect service degradation affecting users

Attempt to recover from failure

If recovery fails – escalate to Exchange administratorsSlide17

Managed

Availability

XYZ_Probe

\ Resource1

XYZ_Probe

\ Resource2

XYZ_Probe

\ Resource3

XYZ_ResetAppPool

XYZ_Restart

XYZ_Restart

XYZ_Failover

XYZ_Reboot

XYZ_Escalate

XYZ_Monitor

Probe engine

:

data collection and

notifications mechanism, feeding into

Monitor

engine

: contains

business logic

to evaluate

health of

customer-impacting

features

Responder engine

: set of recovery actions that can be taken to recover degraded state of the monitored resourceSlide18

Managed Availability

Get-ServerHealth

provides status of all monitors tracking a particular server

Get-

HealthReport

provides a rollup of health sets for a server or for group of servers

Complete set of monitors, probes and responders can be found in Windows crimson log channelSlide19

HA Managed Availability

HA uses MA to monitor data redundancy, cluster health, physical storage health and database logical corruption

HA Probes, Monitors and Responders are grouped into

DataProtection

and

Clustering

Health SetsSlide20

HA monitors

ClusterEndpointMonitor

ClusterGroupMonitor

ClusterHangMonitor

ClusterNetworkMonitor

ClusterServiceCrashMonitor

ServerOneCopyMonitor

ServerOneCopyInternalMonitorMonitor

ServerWideOfflineMonitor

ServiceHealthActiveManagerCheckMonitor

ServiceHealthMSExchangeReplCrashMonitor

ServiceHealthMSExchangeReplEndpointMonitor

DatabaseHealthLogGenerationRateMonitor

DatabaseHealthUnMonitoredDatabaseMonitor

DatabaseHealthCircLoggingMonitor

DatabaseHealthDbCopyFailedAndSuspendedMonitor

DatabaseHealthDbCopyStalledMonitor

DatabaseHealthDbCopySuspendedMonitor

DatabaseHealthLogCopyQueueMonitor

DatabaseHealthLogReplayQueueMonitor

EseDbTimeTooNewMonitor

EseDbTimeTooOldMonitor

EseInconsistentDataMonitor

EseLostFlushMonitor

StorageDbIoHardFailureItemMonitor

LowLogVolumeSpaceMonitorSlide21

ServerOneCopyMonitor

ServerOneCopyMonitor: HA’s most important redundancy protection

Once a minute each database on a server is checked:

Copy is (Healthy || Mounted) &&

ServerComponentState

is NOT Offline &&

Copy is NOT Activation Blocked &&

Server is NOT exceeding

MaxActive

&&Copy Queue Length < MountDial &&Server is NOT Activation DisabledSlide22

HA Monitors – ServerOneCopyMonitor

30 consecutive failures are considered an Escalating conditionImmediately after that

OneCopyMonitor

is notified and becomes Unhealthy

OneCopy

notification

OneCopyEscalate

OneCopyMonitor

Healthy

1

2

3

30

OneCopyMonitor

UNHEALTHYSlide23

ServiceHealthMSExchangeReplEndpointMonitor

Three probes and five responders

ReplEndpointProbe

\ RPC

RestartResponder

RestartResponder2

FailoverResponder

RebootResponder

EscalateResponder

ReplEndpontMonitor

ReplEndpointProbe

\ TCP

ReplEndpointProbe

\

ServerLocatorSlide24

DAG Server Maintenance

In Exchange 2013, the story is a little bit more complicated than Exchange 2010 Mailbox Server has multiple roles installed

To prevent outages, we need to make sure the server is not serving any client protocolsSlide25

Exchange 2013 server maintenance

Put server into maintenance modeSet Transport and UM to draining their queues

Set messaging redirection to (preferably) another server in the DAG

Suspend cluster node

Set server to be Activation Disabled

Set server to be Activation Blocked

Set all

ServerComponentStates

Offline

Confirm All ServerComponentStates are offline

Server is activation blocked and activation disabled

Cluster node is “Paused”

Transport queues are emptySlide26

Best Copy and Server SelectionSlide27

Best Copy and Server Selection

What’s the same?Still an Active Manager algorithm

Performed at *over time

Uses extracted system health

Same replication criteria and phases

What’s new?

Cap replay queue to limit mount time

New max actives soft limit

BCS criteria includes protocol stack health

Protocol health prioritized to control impactTuned replication health criteria thresholdsMA failover responder targets not worse serverSlide28

Activation controls

Load management limitsControls server max loadServer-level activation controls

Controls server usage

Database-level activation control

Prevent copy activation – questionable database copy?Slide29

Load management limits

Maximum Preferred Actives

Optimized for load

Still allows mount

Example: 19

Designed optimum

Result of Redistribute-ActiveDatabases.ps1

Example: 14Slide30

Load management limits

MaximumActiveDatabasesHard limit for activation– i.e. worst case

Enforced by BCS

Dismount databases over limit

Control “exceptional failure” load

Set to most databases you want per server

Follow role requirements calculator guidance

MaximumPreferredActiveDatabases

Soft limit for activation –

added in SP1Copies deprioritized in BCS

Catalog and copy queue health

Failovers can exceed limit

Load balancing optimizes to this

limit

Move-

ActiveMailboxDatabase

-

SkipMaximumActiveDatabaseChecks

skips

bothSlide31

Best Copy and Protocol Health

Normal *over behavior

All health sets healthy

All medium priority health sets and above are healthy

All health sets on target are better than source

All health sets on target are the same as source

Server health not considered

MA failover behavior

Skip target if not better than source

All health sets healthy

All medium priority and above are healthy

All health sets better than source serverSlide32

Site ResilienceSlide33

DatabaseDisabledAndMoveNow

New server setting to improve site resilienceGet all active databases off server – FAST!

Last resort to not move an active!

Proactively continue move databases attempts

Server can still be in service

Databases mounted and mail delivery!Slide34

Activation Block Comparison

Tool

Parameter

Value

Instance

Usage

Suspend-

MailboxDatabase

Copy

ActivationOnly

N/A

Per database copy

Keep active off a working but questionable

drive

Set-

Mailbox

Server

DatabaseCopyAutoActivationPolicy

“Blocked” or “Unrestricted”

Per server

Used to control active/passive

SR configurations and maintenance

Can force admin move

Set-

Mailbox

Server

DatabaseCopyActivationDisabledAndMoveNow

$true or

$false

Per serverUsed to do faster site failovers and maintain database availabilityDatabases are not blocked from failing backContinuous move-off operationSlide35

Dynamic Quorum and DAGsSlide36

Dynamic Quorum

In Windows Server 2008 R2, quorum majority is fixed, based on the initial cluster configuration

In Windows Server 2012 (and later), cluster quorum majority is determined by the set of nodes that are active members of the cluster at a given time

This new feature is called Dynamic Quorum, and it is enabled for all clusters by defaultSlide37

Dynamic Quorum

Cluster dynamically manages vote assignment to nodes, based on the state of each nodeWhen a node shuts down or crashes, the node loses its quorum vote

When a node rejoins the cluster, it regains its quorum vote

By adjusting the assignment of quorum votes, the cluster can dynamically increase or decrease the number of quorum votes required to keep runningSlide38

Dynamic Quorum

By dynamically adjusting the quorum majority requirement, a cluster can sustain sequential node shutdowns to a single nodeThis is referred to as a “Last Man Standing” scenarioSlide39

Dynamic Quorum

Does not allow a cluster to sustain a simultaneous failure of majority of voting membersTo continue running, the cluster must always maintain quorum after a node shutdown or failure

If you manually remove a node’s vote, the cluster does not dynamically add the vote backSlide40

Dynamic Quorum

Majority of 7 requiredSlide41

Dynamic Quorum

X

X

X

Majority of 4 required

Majority of 7 requiredSlide42

Dynamic Quorum

X

X

X

X

Majority of 3 requiredSlide43

Dynamic Quorum

X

X

X

X

X

Majority of 2 requiredSlide44

Dynamic Quorum

X

X

X

X

X

Majority of 2 requiredSlide45

Dynamic Quorum

X

X

X

X

X

1

0

Majority of 2 requiredSlide46

Dynamic Quorum

X

X

X

X

X

0

1

Majority of 2 requiredSlide47

Dynamic Quorum

X

X

X

X

X

0

1

Majority of 2 required

XSlide48

Dynamic Quorum

X

X

X

X

X

0

1

Majority of 2 required

X

XSlide49

Dynamic Quorum

Use Get-ClusterNode to verify votes

0 = does not have quorum vote

1 = has quorum vote

Get-

ClusterNode

<Name> |

ft

name, *weight, state

Name

DynamicWeight

NodeWeight

State

----

-------------

----------

-----

EX1

1

1

Up Slide50

Dynamic Quorum

Works with most DAGsThird-party replication DAGs not testedAll internal testing has it enabledOffice 365 servers use it

Exchange is not dynamic quorum-aware

Does not change quorum requirementsSlide51

Dynamic Quorum

Cluster team guidance:Generally increases the availability of the cluster

Enabled by default, strongly recommended to leave enabled

Allows the cluster to continue running in failure scenarios that are not possible when this option is disabled

Exchange team guidance:

Leave it enabled for majority of DAG members

In some cases where a Windows 2008 R2 DAG would have lost quorum, a Windows 2012 DAG can maintain quorum

Don’t factor it into availability plansSlide52

Witness Server Placement

New Witness Server placement options availableChoose based on business needs and available options

Third location DAG witness server improves DAG recovery behaviors

Automatic recovery on datacenter loss;

Third location network infrastructure must have independent failure

modes

Deployment scenario

Recommendations

DAG(s) deployed in a single datacenter

Locate witness server in the same datacenter as DAG members;

can share one server across DAGs

DAG(s) deployed across two datacenters;

No additional locations available

Locate witness server in primary datacenter; can share one server across DAGs

DAG(s) deployed across two+ datacenters

Locate witness server

in third location; can share one server across DAGsSlide53

Dynamic Witness Scenarios

Witness Offline

Witness vote gets removed by the cluster

Witness Online

If necessary, Witness vote is added back by the cluster

Witness Failure

Witness vote gets removed by the cluster

Windows Server 2012 R2 and later Slide54

Site Resilience

Frontend/Backend recovery are independentMost protocol access in Exchange Server 2013 is HTTP

DNS resolves to multiple IP addresses

HTTP clients have built-in IP failover capabilities

Clients skip past IPs that produce hard TCP

failures

Namespace

no longer a single point of failure

Single or multiple namespace

optionsAdmins can switchover by removing VIP from DNS or disablingNo dealing with DNS latencySlide55

alternate datacenter:

Portland

primary datacenter:

Redmond

Site Resilience - CAS

cas3

cas4

cas1

cas2

VIP: 192.168.1.50

X

VIP: 10.0.1.50

mail.contoso.com: 192.168.1.50, 10.0.1.50

Removing failing IP from DNS puts you in control of in service time of VIP

With multiple VIP endpoints sharing the same namespace, if one VIP fails, clients automatically failover to alternate VIP and just work!

mail.contoso.com: 10.0.1.50Slide56

third datacenter: Paris

alternate datacenter:

Portland

primary datacenter:

Redmond

Site Resilience - Mailbox

mbx1

mbx2

mbx3

mbx4

Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file,

automatic failover should occur

witness

XSlide57

alternate datacenter:

Portland

primary datacenter:

Redmond

Site Resilience - Mailbox

witness

mbx1

mbx2

mbx3

mbx4

Mark the failed servers/site as down:

Stop-

DatabaseAvailabilityGroup

DAG1 –

ActiveDirectorySite:Redmond

Stop the Cluster Service on Remaining DAG members:

Stop-

Clussvc

Activate DAG members in 2

nd

datacenter:

Restore-

DatabaseAvailabilityGroup

DAG1 –

ActiveDirectorySite:Portland

X

X

XSlide58

alternate datacenter:

Portland

primary datacenter:

Redmond

Site Resilience - Mailbox

witness

mbx1

mbx2

mbx3

mbx4

alternate witness

Mark the failed servers/site as down:

Stop-

DatabaseAvailabilityGroup

DAG1 –

ActiveDirectorySite:Redmond

Stop the Cluster Service on Remaining DAG members:

Stop-

Clussvc

Activate DAG members in 2

nd

datacenter:

Restore-

DatabaseAvailabilityGroup

DAG1 –

ActiveDirectorySite:Portland

XSlide59

Best PracticesSlide60

Best Practices

Automate your recovery logic; make it reliableThink of it as rack/site maintenance

Exercise it regularly

Recovery times directly dependent on detection & decision times!

Flip the bit! Don’t ask repair times, “if outage go…”

Humans are the biggest threat to recovery timesSlide61

Solutions Advisory BoardSlide62

Solutions Advisory Board

Microsoft provides lab-tested, cross-product, end-to-end solutionsSAB members hear our solution ideas, and influence them by providing feedback

SAB Session

Presenters from Microsoft Azure, Office, Cloud and Datacenter, and Microsoft Consulting Services

SAB Table @ Ask the Experts

Tues 6:30 – 8:30pm

Ask the Experts

Meet the SAB team and ask us questions

Experts from Microsoft Azure, Office, Cloud and Datacenter teams

Hilton Americas, Room 335A

Wed 4:00 – 5:30pmSlide63

OFC-B244 Microsoft Exchange Server 2013 SP1 Tips and Tricks

OFC-B248 Publishing Microsoft Exchange Server: Which TLA Should You Choose?

OFC-B321 Monitoring and Tuning Microsoft Exchange Server 2013 Performance

Related contentSlide64
Slide65

Resources

Learning

Microsoft Certification & Training Resources

www.microsoft.com/learning

msdn

Resources for Developers

http://microsoft.com/msdn

TechNet

Resources for IT Professionals

http://microsoft.com/technet

Sessions on Demand

http://channel9.msdn.com/Events/TechEdSlide66

Complete an evaluation

and

enter to win!Slide67

Evaluate this session

Scan this

QR

code

to evaluate

this

session.Slide68

©

2014

Microsoft Corporation. All rights reserved. Microsoft, Windows,

and

other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.