/
WHEA System Design And Implementation WHEA System Design And Implementation

WHEA System Design And Implementation - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
459 views
Uploaded On 2016-04-07

WHEA System Design And Implementation - PPT Presentation

John Strange Software Design Engineer Microsoft Corporation Key Takeaways Understand what it takes to implement a WHEAenabled platform Improve server reliability by implementing required WHEA features ID: 276305

whea error plug platform error whea platform plug pshed firmware windows source hardware record optional microsoft dell required errors implementation sources system

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "WHEA System Design And Implementation" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

WHEA System Design And Implementation

John Strange

Software Design Engineer

Microsoft CorporationSlide2

Key Takeaways

Understand what it takes to implement a WHEA-enabled platform

Improve server reliability by implementing required WHEA features

Differentiate server products by extending WHEA capabilitiesSlide3

Agenda

WHEA Overview

Description of a WHEA-enabled platform

Key requirements of WHEA-enabled platform

Dell WHEA system implementation

Dell demo of WHEA-enabled platformSlide4

WHEA Objective

To make Windows-based platforms more resilient in the face of hardware errors

Better root cause analysis

Better support for hardware error recovery

Error avoidance with health monitoringSlide5

WHEA Overview

Operating System Support (OS)

Windows is much more agile with respect to hardware error sources

Platform describes error sources to Windows

Standardized hardware error record format

UEFI 2.1 Common Platform Error Record

Hardware error recovery support

Hardware error events

Event Tracing For Windows (ETW)Slide6

WHEA Overview

Platform integration

Platforms retain their existing investment in error handling features

Controls the level of integration with the OS

Leverages existing error handling and reporting featuresSlide7

WHEA Overview

Platform implementation

Satisfying Windows Server 2008 logo requirements

Error record persistence

Error injection

WHEA _OSC method

BOOT Error Source (X86/X64 platforms only)

Extending WHEA feature set to add value

Add richer error data content (i.e. FRU info)

Participate in error recoverySlide8

WHEA Components

Provided by:

Microsoft

ISV

OEM

IHV

WHEA

ACPI Tables

Platform Hardware Error Handlers

Platform Specific Hardware Error Driver

Kernel

OS Hardware Error Handlers

User

WHEA-Enabled Applications

PSHED Plug-ins

WHEA-Enabled

Management Applications

WHEA

ACPI TablesSlide9

WHEA-Enabled Platform

Feature/

Processor Architecture

x86

x64

Itanium

Error Source Enumeration

Optional: HEST or PSHED plug-in

Optional: HEST or PSHED plug-in

Optional: HEST or PSHED plug-in

Error Record Persistence

Required: ERST or PSHED plug-in

Required: ERST, PSHED plug-in, or UEFI 2.1 variables services

Required: ERST, PSHED plug-in, or UEFI 2.1 variables services

BOOT Error Source

Required

Required

Optional

Error Injection

Required: EINJ or PSHED plug-in

Required: EINJ or PSHED plug-inOptional if PAL-based or MSR-based error injection is supportedError Information Retrieval

Optional: PSHED plug-inOptional: PSHED plug-in

Optional: PSHED plug-inError Source Control

Optional: PSHED plug-inOptional: PSHED plug-inOptional: PSHED plug-inError Recovery

Optional: PSHED plug-inOptional: PSHED plug-inOptional: PSHED plug-in_OSCRequiredRequiredRequiredSlide10

Reporting Error Sources

The platform must report error sources to the Windows only for the following

To override default error source configuration

To report error sources Windows does not support by default

It needs firmware-first control of one or more error sources

It uses generic error source to inject errorsSlide11

Error Source Defaults

x86/x64 Machine check

x86/x64 Machine Check Settings

IA32_MCG_CTL: 0xFFFFFFFFFFFFFFFF

IA32_MCi_CTL: 0xFFFFFFFFFFFFFFFF

OS respects settings in IA32_MC0_CTL

x86/x64 Corrected Machine Checks

Polling interval is 60 secondsSlide12

Endpoint Devices

Default Value

Device Control

0x0007

Uncorrectable Error Mask

0x00100000

Uncorrectable Error Severity

0x00062011

Correctable Error Mask

0x00002000

Capabilities and Control

0x00000000

Root Ports

Default Value

Root Error Command

0x0007

Bridges

Default Value

Secondary Uncorrectable Error Mask

0x000017A8

Secondary

Uncorrectable Error Severity

0x00001340

Secondary Capabilities and Control0x00000000Error Source DefaultsPCI Express AERSlide13

Error Record Persistence

In Windows Server 2008

Windows writes error record only when system is to be

bugchecked

Windows only requires space for one error record

Platform must implement persistence interface to get logo

Storage requirements

x64/x86 platforms require minimum of 1K

Itanium platforms require minimum of 128KSlide14

Error Record Persistence

Platform implementation

ACPI ERST Table

UEFI 2.1 Variable Services Error Record Extensions for EFI-based platforms

PSHED plug-in

This solution is generally discouragedSlide15

Error Injection

Error injection interface allows hardware errors to be injected on a platform for the following purposes

Validation of OS/platform error handling flows

Validation of platform logo support for WHEA

Exercising hardware/firmware error flows for diagnostic purposesSlide16

Error Injection

Platform implementation

Prefer true hardware error injection if possible

Enables system/component diagnostic

In cases where no true hardware injection is possible, generic error source can be used to simulate errors

Enables feature validationSlide17

WHEA _OSC Method

New \_SB _OSC method

GUID

{ed855e0c-6c90-47bf-a62a-26de0fc5ad5c}

Notifies platform that Windows implements WHEA so platform can perform any necessary configuration

If platform does not implement \_SB _OSC or if the platform returns “Unrecognized UUID”, Windows does not configure WHEA support for the platformSlide18

Boot Error Source

For fatal errors that cannot be processed by the OS

Firmware-initiated reset

BMC-initiated reset

Sync-flood reset

Platform describes the error to Windows using the BOOT error source

ACPI BERT tables describes the platform’s BOOT error source to WindowsSlide19

Firmware First

Platform can indicate that error sources should be handled first by firmware

Via error source enumeration interface

Some error sources cannot do firmware-first (i.e. machine check exception)

Generally, an error source reported as firmware-first is configured by the platform to generate an SMISlide20

Firmware First

Enumerating error sources

The error source for which platform wants firmware-first control is marked as FIRMWARE_FIRST

A paired generic error source must be enumerated

This error source is how the platform will signal errors from the firmware-first source to the OSSlide21

Firmware First

Error handling flow

Platform gains control when error occurs (SMI)

Platform processes and possibly logs the error

Platform may void errors in some cases

Platform fills in error status block with information describing the error

Platform is responsible for clearing HW error statusSlide22

Firmware First

Error handling flow

Platform signals the error to Windows using the notification mechanism it reported when it enumerated the error source

This means platform generates an NMI, interrupt, or allows Windows to poll, etc

Signaling mechanism depends on type of error (i.e. corrected/uncorrected)

Windows clears bits in block status to signal that it has processed the errorSlide23

Dell WHEA System Design

Mukund

Khatri

Server Strategist

Dell Inc.Slide24

Overview

Close collaboration between Dell and Microsoft on WHEA feature design over last couple of years

Design enhancements

Prototype efforts

WHEA architecture holds great promise for future server designs

Enables OS participation in error handling flows

Flexibility to retain full value in existing error handling infrastructureSlide25

WHEA: Dell Implementation

Dell Implementation incorporates support for

Error Enumeration and Control

Error record persistence

WHEA _OSC method

BOOT Error Source

Error injection

Implementation uses Firmware First Mode

Complementary to OS-first modeSlide26

Firmware First Mode

Implementation considerations

Ability at platform firmware level to override defaults in OS without PSHED plug-ins

Silicon errata management

Updates to interface specifications

Control over level of integration with OS

Extend WHEA feature set to add value

Add richer error data content (ex: FRU info)

Retain existing investments in error

handling infrastructuresSlide27

Platform Hardware Event Flow

Firmware First Mode

Error Flow with:

No WHEA

New with WHEA

Errors handled by

Platform Firmware

Service Processor

and

Management Consoles

ETW

New

for ecosystem consumption

Existing error management paradigm still retained

Richer error records and ETW available

for consumption

Data

Platform

Errors

Errors handled by OSSlide28

Dell PowerEdge™ System

Mukund

Khatri

Server Strategist

Dell Inc.

demoSlide29

Demo: Dell PowerEdge™ Server

Injection of PCI-Express uncorrectable error

Error captured and processed by

platform firmware

Firmware creates and uploads GES data packet and triggers NMI to OS

WHEA error record stored in persistent storage

System bug-checks and subsequently reboots

OS retrieves WHEA error record on next boot

Event viewer reports the event along with error recordSlide30

PCI-Express Error Record OutputSlide31

Dell: Key Takeaways

New Dell servers will include full support for WHEA

We intend to build on

WHEA

architecture to add end

customer

value

in

future Dell servers

Dell and Microsoft partnering on WHEA architecture and implementationSlide32

Call To Action

WHEA-enable your server platforms now

Work with Microsoft to get BIOS reference implementations

Validate WHEA support

Run Logo Tests to validate WHEA implementation

Fully implement and validate Advanced Error Reporting capability in PCI-express devicesSlide33

Additional Resources

Related Sessions

SVR-T464 WHEA Platform Implementation

SVR-C460 WHEA PSHED Plug-in

SVR-T325 Dynamic Partition: Windows Server

WHEA Feedback:

WHEA introduction:

http://www.microsoft.com/whdc/system/pnppwr/WHEA/wheaintro.mspx

Specifications

WHEA Platform Design Guide

UEFI 2.1 Specification

wheafb

@ microsoft.comSlide34

© 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.

MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.