/
ReVive : Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessor ReVive : Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessor

ReVive : Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessor - PowerPoint Presentation

jalin
jalin . @jalin
Follow
342 views
Uploaded On 2022-06-07

ReVive : Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessor - PPT Presentation

Milos Prvulovic Zheng Zhang Josep Torrellas University of Illinois at UrbanaChampaign HewlettPackard Laboratories Isaac Liu Introduction Targeting large scale applications that provide ID: 913963

separation checkpoint parity storage checkpoint separation storage parity recovery global update high error distributed partial overhead design choices safe

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "ReVive : Cost-Effective Architectural Su..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

Milos Prvulovic, Zheng Zhang, Josep TorrellasUniversity of Illinois at Urbana-ChampaignHewlett-Packard Laboratories

Isaac Liu

Slide2

IntroductionTargeting large scale applications that provide

services (need high availability)Improvements in silicon technology make modern integrated circuits prone to transient and permanent faultsFER vs. BER Hardware redundancy vs. recovery

Slide3

ReVive designGoal: Cost-effective general-purpose rollback recovery

Modest amount of hardware (cost-effective)Recovery from a wide class of errors (General-purpose)Short system downtime due to error (high availability)Low overhead when error-free (high performance)

Slide4

Hardware Modifications

Slide5

Design Choices

Checkpoint Storage: Safe Internal Storage with Distributed paritySafe ExternalSpecialized fault class Checkpoint Separation: Partial separation with LoggingFull separationPartial separation with buffering (renaming)Checkpoint Consistency: Global(Un) Coordinated Local

Slide6

OverviewPeriodically establish checkpoint

Between checkpoints, whenever main memory written to, log the data to maintain checkpoint state.If error is detected, then use the logs to roll back state.

Slide7

Design Choices

Checkpoint Storage: Safe Internal Storage with Distributed parityCheckpoint Separation: Partial separation with LoggingCheckpoint Consistency: Global

Slide8

Distributed Parity

Slide9

Design Choices

Checkpoint Storage: Safe Internal Storage with Distributed parityCheckpoint Separation: Partial separation with LoggingCheckpoint Consistency: Global

Slide10

Logging

Slide11

Design Choices

Checkpoint Storage: Safe Internal Storage with Distributed parityCheckpoint Separation: Partial separation with LoggingCheckpoint Consistency: Global Checkpoint

Slide12

Global checkpointCommit all work and states to main memory.

Two phase commit protocol, first sync is tentative commit, and then sync again to fully commit. Keeps two most recent checkpoints.

Slide13

Global Checkpoint

Slide14

Implementation issuesExtra L bit for each directory entry

New states in directory protocol, new messages (parity update/ack)Race ConditionsLog-Data Update raceAtomic Log Update RaceLog-Parity Update RaceData-Parity Update RaceCheckpoint commit Race

Slide15

Rollback

Slide16

OverheadLogging and parity maintenance

Depends on applicationGlobal Checkpointcross-processor interruptWrite dirty data to memoryRollbackRecovery + Lost work + Rebuild lost memory pages

Slide17

Evaluation environmentCC-NUMA multiprocessor with 16 nodes

Non-blocking and write-back cacheFull-map directory and cache coherent protocol similar to DASH.Cache size: 16KB for L1, 128kB for L2*Applications run on smaller problems sizes and shorter periods

Slide18

Evaluation Results

Cp10ms – Parity and checkpoint every 10msCpInf – Parity and checkpoint with infinite intervalCp10msM – Mirror and checkpoint every 10msCpInfM –Mirror and checkpoint with infinite interval

Slide19

Traffic

Par – parity updatesCkp – checkpointWB – writebackRD/RDX- cache missLOG – writing to logs

Slide20

Overhead

Slide21

ReVive vs. SafetyNet

Both use log-based rollback mechanismsReVive enables recovery from a permanent nodeReVive does not need to change processor’s cacheReVive is more general, so it may result in larger performance overhead.

Slide22

ConclusionReVive provides:

Modest amount of hardware (cost-effective)Recovery from a wide class of errors (General-purpose)Short system downtime due to error (high availability)Low overhead when error-free (high performance)