A Scalable Efficient and Flexible Resilience Scheme for Exascale Systems Jinsuk Chung Ikhwan Lee Michael Sullivan Jee Ho Ryoo Dong Wan Kim Doe Hyun Yoon Larry Kaplan ID: 233923
Download Presentation The PPT/PDF document "Containment Domains" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Containment DomainsA Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems
Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo, Dong Wan Kim, Doe Hyun Yoon+, Larry Kaplan*, and Mattan ErezUT Austin, + now at HP Labs, * CraySlide2
Containment DomainsA Scalable, Efficient, and Flexible Resilience Scheme for Exascale SystemsSlide3
Motivation and goalsResilience bounds performanceResilience is a major obstacle to exascale
Containment domains: scalable efficient resilienceHierarchical Preserve data where most efficient and effectiveProportionalTunable redundancy and recoveryDifferent errors/faults handled differentlyAbstractPortableAmenable to auto-tuning and analysis3CDs elevate resilience to a first-order application concern
Containment Domain [SC'12] (c) Jinsuk ChungSlide4
Containment domainsSingle consistent abstraction Encapsulates resilience techniquesSpans levels: programming, system, and analysisComponents
Preserve data on domain startCompute (domain body)Detect faults before domain commitsRecover from detected errorsSemanticsErroneous data never communicated Each CD provides recovery mechanism HierarchyEscalationMatch CD and machine hierarchiesContainment Domain [SC'12] (c) Jinsuk Chung
4
Root CD
Child CDSlide5
Mapping example: SpMVvoid task<inner>
SpMV( in M, in Vi, out Ri){ forall(…) reduce(…) SpMV
(M[…],Vi[…],Ri[…]);
}void task<leaf>
SpMV(…){ for r=0..N for c=rowS[r]..rowS[r+1] {
res
i
[r]+=data[c]*V
i
[cIdx[c]];
prevC=c;
}
}
Containment Domain [SC'12] (c) Jinsuk Chung
5
Matrix
M
Vector
VSlide6
Mapping example: SpMVContainment Domain [SC'12] (c) Jinsuk Chung
6
Matrix
M
Vector
V
void task<inner>
SpMV
( in M, in V
i
, out
R
i
){
forall
(…) reduce(…)
SpMV
(M[…],V
i
[…],
R
i
[…]);
}
void task<leaf>
SpMV
(…){
for r=0..N
for c=rowS[r]..rowS[r+1] {
res
i
[r]+=data[c]*V
i
[cIdx[c]];
prevC=c;
}
}Slide7
Mapping example: SpMV7Containment Domain [SC'12] (c) Jinsuk Chung
Matrix
M
Vector
V
Distributed to 4 nodes
void task<inner>
SpMV
( in M, in V
i
, out
R
i
){
forall
(…) reduce(…)
SpMV
(M[…],V
i
[…],
R
i
[…]);
}
void task<leaf>
SpMV
(…){
for r=0..N
for c=rowS[r]..rowS[r+1] {
res
i
[r]+=data[c]*V
i
[cIdx[c]];
prevC=c;
}
}Slide8
Mapping example: SpMV8Containment Domain [SC'12] (c) Jinsuk Chung
Matrix
M
Vector
V
void task<inner> SpMV( in M, in V
i
, out R
i
){
forall(…) reduce(…)
SpMV(M[…],V
i
[…],R
i
[…]);
}
void task<leaf> SpMV(…){
for r=0..N
for c=rowS[r]..rowS[r+1] {
res
i
[r]+=data[c]*V
i
[cIdx[c]];
prevC=c;
}
}
Distributed to 4 nodesSlide9
Mapping example: SpMV9Containment Domain [SC'12] (c) Jinsuk Chung
Preserve
Detect
Recover
Preserve
Detect
Recover
Preserve
Detect
Recover
Preserve
Detect
Recover
Preserve
Detect
Recover
M
V
Parent CD
Child CD
Preserve (Parent)
Detect (Parent)
Recover (Parent)
Child
Detect
Recover
Child
Detect
Recover
Child
Detect
Recover
Child
Detect
RecoverSlide10
Initial CD preservation API and prototypevoid task<inner> SpMV
(in M, in Vi, out Ri) { cd = create_CD(parentCD); preserve_via_copy
(cd, matrix, …); forall(…) reduce(…)
SpMV(M[…],Vi[…],
Ri[…]);
commit_CD(cd
);
}
void task<leaf>
SpMV
(…) {
cd
=
create_CD(parentCD
);
preserve_via_copy
(cd,
matrix
, …);
preserve_via_parent
(cd,
vec
i
, …);
for r=0..N
for c=rowS[r]..rowS[r+1] {
res
i
[r]+=data[c]*Vi[cIdx[c]]; check {fault<
fail>(c > prevC);}
prevC
=c;
}
commit_CD
(cd);
}
Containment Domain [SC'12] (c) Jinsuk Chung
10
Preservation components
prototype on Cray XK7
http
://lph.ece.utexas.edu/public/CDs
API
create_CDpreserve_via_copy
preserve_via_parentcheckcommit_CDSlide11
Containment domains long-term design Hardware Abstraction Layer
Runtime Library InterfaceMachine
efficiency-oriented
programming model
int
main(
int
argc
, char **
argv
)
{
main_task
here = phalanx::initialize(
argc
,
argv
);
… Create test arrays here …
//
Launch kernel on default CPU (“host”)
openmp_event
e1 =
async
(here,
here.processor
(), n)
(
host_saxpy
, 2.0f,
host_x
,
host_y
);
//
Launch kernel on default GPU (“device”)
cuda_event
e2 =
async
(here,
here.cuda_gpu
(0), n)
(
device_saxpy
, 2.0f,
device_x
,
device_y
);
wait(here, e1&e2);
return 0;
}
CD
Annotations
resilience model
Error Reporting Architecture
ECC, status
CD control and persistence
Language integration
Compiler support
Runtime components
Hardware aspects
CD
API
resilience interface
Research prototype by
Cray for XK7 (Titan)
Containment Domain [SC'12] (c) Jinsuk ChungSlide12
OutlineMotivation and GoalsSemantics of Containment DomainsWhat do CDs do? When and why are they good?
Differentiated error handlingAnalyzabilityEvaluation12Containment Domain [SC'12] (c) Jinsuk ChungSlide13
13Containment Domain [SC'12] (c) Jinsuk ChungDifferentiated Error Handling Slide14
AbstractOptimized preservation and restorationAnalyzed, auto-tuned Allows explicit application controlHierarchicalMatch storage hierarchy
Maximize locality and minimize overheadPartialPreserve only when worth itExploit natural redundancyExploit hierarchyEnable regenerationState preservation and restoration Containment Domain [SC'12] (c) Jinsuk Chung14Slide15
SpMV partial preservation tuning15Containment Domain [SC'12] (c) Jinsuk Chung
Natural redundancyvoid task<leaf> SpMV(…) { cd = create_CD(parentCD);
preserve_via_copy(cd, matrix, …); preserve_via_parent
(cd, veci, …); for r=0..N
for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi
[cIdx[c]];
check
{
fault<fail>(c > prevC)
;}
prevC=c;
}
commit_CD(cd);
}
Matrix
M
Vector
V
Hierarchy
Slide16
Concise abstraction for complex behavior
Containment Domain [SC'12] (c) Jinsuk Chung
16
void task<leaf>
SpMV
(…) {
cd =
create_CD
(
parentCD
);
preserve_via_copy
(cd,
matrix
, …);
preserve_via_parent
(cd,
vec
i
, …);
for r=0..N
for c=rowS[r]..rowS[r+1] {
res
i
[r]+=data[c]*V
i
[cIdx[c]];
check
{
fault<fail>(c > prevC)
;} prevC=c; } commit_CD(cd);}
Local copy or
regen
Sibling
Parent (unchanged)Slide17
DetectionAbstractUtilize most efficient detection mechanismLow overhead detection: e.g., algorithm specific detection CustomizedReplicate in time, replicate in space, algorithm specific
HeterogeneousPer-CD routinesE.g., selective multi-granularity DMRContainment Domain [SC'12] (c) Jinsuk Chung17Slide18
RecoveryAbstractUtilize most efficient recovery mechanismMaximize local recoveryLow overhead recovery e.g., re-materialization or regeneration
CustomizedRe-execute, ignore, re-materialize, DMR, TMRHeterogeneousPer-CD routinesE.g., selective multi-granularity DMRApp/system specificContainment Domain [SC'12] (c) Jinsuk Chung18
Compute
Preserve
Detect
Re-execution overhead
TimeSlide19
19Containment Domain [SC'12] (c) Jinsuk ChungAnalyzabilitySlide20
Leverage hierarchy and CD semantics
Uncoordinated “local” actionsSolve in outApplication abstracted to CDsCD treeVolumes of preservation, computation, and communicationPreservation and recovery options per CDMachine modelStorage hierarchyCommunication hierarchyBandwidths and capacitiesError processes and rates
Analytical Model20
Containment Domain [SC'12] (c) Jinsuk ChungExecution timeSlide21
Power modelCDs that are not re-executing may remain idleActively executing a CD has a relative power of 1A node that is idling consumes a relative power of
In our experiments 21
IdleContainment Domain [SC'12] (c) Jinsuk Chung
Re-execution time
Parallel domains
Execution
Re-execution
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
IdleSlide22
EvaluationWhat we evaluatedPerformance efficiency Energy overheadBaseline resiliency approachesg-CPR: global checkpoint restarth-CPR: hierarchical checkpoint restart (e.g., SCR)
Optimum interval used for eachCD advantagesPreserve only what is neededHierarchical uncoordinated AssumptionsDetection overhead is assumed to be zeroCapacity of storage for preservation is infiniteInfinite spares (quick repair)22Containment Domain [SC'12] (c) Jinsuk ChungSlide23
Machine and error models23Containment Domain [SC'12] (c) Jinsuk Chung
Component“Performance”ErrorError ScalingCore10GFLOP/coreSoft error∝ #cores
Memory1GB/coreECC fail∝ #DRAM chips
Socket200GB/s /socketHard/OS crash∝ #sockets
SystemHierarchical networkPower module
or network
∝
#
modules
and #cabinetsSlide24
WorkloadsMonte Carlo NTEmbarrassingly parallelInfrequent communicationSmall fraction of read/write data
Iterative hierarchical SpMVRecursive decompositionNatural redundancyFrequent global communicationMantevo HPCCGRequires little storageConjugate-gradient based linear system solverFrequent global communication24Containment Domain [SC'12] (c) Jinsuk ChungSlide25
Evaluation toolsSimulatorExecutes at granularity of containment domainsReexecutes when error is detectedUsed to validate the analytical modelAnalytical Model
Simulation is too slow for evaluating exascale systemsInputs to the model: extracted from each applicationVolume of preservation, restoration, computation and communicationError ratesShape of CD structureValidationSimulator and analytical modelPrototype of preservation/restoration on Cray XK725Containment Domain [SC'12] (c) Jinsuk ChungSlide26
26Containment Domain [SC'12] (c) Jinsuk Chung
Peak System Performance
NT
SpMV
HPCCG
Autotuned
CDs perform wellSlide27
27Containment Domain [SC'12] (c) Jinsuk Chung
Peak System PerformanceNTSpMVHPCCGAutotuned CDs perform wellSlide28
SPMV, HPCCG: local recovery and partial preservation28
Containment Domain [SC'12] (c) Jinsuk Chung
Disk
Remote NVM
Local NVM
DRAM
Partial preservation via sibling or parent where appropriateSlide29
NT: hierarchical local recovery and partial preservation29
Containment Domain [SC'12] (c) Jinsuk Chung
Disk
Remote NVM
Local NVM
DRAM
Partial preservation via sibling, parent, or regeneration where appropriateSlide30
30Containment Domain [SC'12] (c) Jinsuk Chung
Peak System PerformanceNTSpMVHPCCGAutotuned CDs perform wellSlide31
31
Containment Domain [SC'12] (c) Jinsuk ChungPeak System Performance
CDs improve energy efficiency at scale
NT
SpMV
HPCCGSlide32
32
Containment Domain [SC'12] (c) Jinsuk ChungPeak System Performance
CDs improve energy efficiency at scale
NT
SpMV
HPCCGSlide33
10X failure rate emphasizes CD benefits33Containment Domain [SC'12] (c) Jinsuk Chung
Peak Performance
Energy OverheadSlide34
More in the paperStrict vs. relaxed containment domainsAnalytical model detailsError and machine model detailsAdditional sensitivity studiesRelated work discussion
34Containment Domain [SC'12] (c) Jinsuk ChungSlide35
ConclusionContainment domains Abstract constructs for resilience concerns & techniquesProportional and application/machine tuned resilience
Hierarchical & distributed preservation, restoration, and recoveryAnalyzable and amendable to automatic optimizationScalable to large systems with high relative energy efficiencyHeterogeneous to match emerging architectureGood start and exciting work aheadPreservation concept prototyped on Cray XK7Fine-grained CDs for high error ratesCompiler optimizations and supportApplication-specific detection/elision PGAS support and interactions with system Interaction with other models (tasking, DSLs, …)
35
http://lph.ece.utexas.edu/public/CDsContainment Domain [SC'12] (c) Jinsuk ChungSlide36
Questions?Thank you
36Containment Domain [SC'12] (c) Jinsuk Chung