Milos Prvulovic Zheng Zhang Josep Torrellas University of Illinois at UrbanaChampaign HewlettPackard Laboratories Isaac Liu Introduction Targeting large scale applications that provide ID: 913963
Download Presentation The PPT/PDF document "ReVive : Cost-Effective Architectural Su..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
Milos Prvulovic, Zheng Zhang, Josep TorrellasUniversity of Illinois at Urbana-ChampaignHewlett-Packard Laboratories
Isaac Liu
Slide2IntroductionTargeting large scale applications that provide
services (need high availability)Improvements in silicon technology make modern integrated circuits prone to transient and permanent faultsFER vs. BER Hardware redundancy vs. recovery
Slide3ReVive designGoal: Cost-effective general-purpose rollback recovery
Modest amount of hardware (cost-effective)Recovery from a wide class of errors (General-purpose)Short system downtime due to error (high availability)Low overhead when error-free (high performance)
Slide4Hardware Modifications
Slide5Design Choices
Checkpoint Storage: Safe Internal Storage with Distributed paritySafe ExternalSpecialized fault class Checkpoint Separation: Partial separation with LoggingFull separationPartial separation with buffering (renaming)Checkpoint Consistency: Global(Un) Coordinated Local
Slide6OverviewPeriodically establish checkpoint
Between checkpoints, whenever main memory written to, log the data to maintain checkpoint state.If error is detected, then use the logs to roll back state.
Slide7Design Choices
Checkpoint Storage: Safe Internal Storage with Distributed parityCheckpoint Separation: Partial separation with LoggingCheckpoint Consistency: Global
Slide8Distributed Parity
Slide9Design Choices
Checkpoint Storage: Safe Internal Storage with Distributed parityCheckpoint Separation: Partial separation with LoggingCheckpoint Consistency: Global
Slide10Logging
Slide11Design Choices
Checkpoint Storage: Safe Internal Storage with Distributed parityCheckpoint Separation: Partial separation with LoggingCheckpoint Consistency: Global Checkpoint
Slide12Global checkpointCommit all work and states to main memory.
Two phase commit protocol, first sync is tentative commit, and then sync again to fully commit. Keeps two most recent checkpoints.
Slide13Global Checkpoint
Slide14Implementation issuesExtra L bit for each directory entry
New states in directory protocol, new messages (parity update/ack)Race ConditionsLog-Data Update raceAtomic Log Update RaceLog-Parity Update RaceData-Parity Update RaceCheckpoint commit Race
Slide15Rollback
Slide16OverheadLogging and parity maintenance
Depends on applicationGlobal Checkpointcross-processor interruptWrite dirty data to memoryRollbackRecovery + Lost work + Rebuild lost memory pages
Slide17Evaluation environmentCC-NUMA multiprocessor with 16 nodes
Non-blocking and write-back cacheFull-map directory and cache coherent protocol similar to DASH.Cache size: 16KB for L1, 128kB for L2*Applications run on smaller problems sizes and shorter periods
Slide18Evaluation Results
Cp10ms – Parity and checkpoint every 10msCpInf – Parity and checkpoint with infinite intervalCp10msM – Mirror and checkpoint every 10msCpInfM –Mirror and checkpoint with infinite interval
Slide19Traffic
Par – parity updatesCkp – checkpointWB – writebackRD/RDX- cache missLOG – writing to logs
Slide20Overhead
Slide21ReVive vs. SafetyNet
Both use log-based rollback mechanismsReVive enables recovery from a permanent nodeReVive does not need to change processor’s cacheReVive is more general, so it may result in larger performance overhead.
Slide22ConclusionReVive provides:
Modest amount of hardware (cost-effective)Recovery from a wide class of errors (General-purpose)Short system downtime due to error (high availability)Low overhead when error-free (high performance)