Part 6 System protection with checkpointrestart Mattan Erez The University of Texas at Austin July 2015 Hardware reliability never perfect Not worth the cost Though can get close with high cost ID: 913961
Download Presentation The PPT/PDF document "Toward Exascale Resilience" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Toward Exascale ResiliencePart 6: System protection with checkpoint/restart
Mattan Erez
The University of Texas at Austin
July 2015
Slide2Hardware reliability never perfectNot worth the costThough can get close with high costSoftware also has errorsCan never fully debug
2
(c) Mattan Erez
Slide3Application and system must go on3(c) Mattan Erez
Slide4Detectcontainrepairrecover4
(c) Mattan Erez
Slide5Detection is most critical pieceWe’re in luck – silent-data-corruption very rareStrong ECCNot very vulnerable logicThe paranoid among us aren’t convinced thoughMore when we talk about cross-layer approaches
5
(c) Mattan Erez
Slide6What to do on detected errors?Simplest idea: failstop!Contain the error and don’t let it propagate
6
(c) Mattan Erez
Slide7If system stopped, can the application continue?Yes, if preparedCheckpoint/restart7
(c) Mattan Erez
Slide8Periodically, take checkpointStop the systemCopy all state somewhere sageKeep goingRollbackRecover saved state
Restart
Recompute
and keep going
8
(c) Mattan Erez
Slide9Things to think about (outline)How frequently to checkpoint?How important is checkpoint time?And what can we do to improve itWho decides to take a checkpoint?System or userDo we really have to stop everyone to take a checkpoint?
Coordinated vs. uncoordinated
checkpointing
Always (for now) coordinated restart
Great survey paper:
ElNozahy
et
al.,
“A Survey of Rollback-Recovery Protocols in
Message-Passing
Systems”
(http://www.cs.utexas.edu/~
lorenzo/papers/SurveyFinal.pdf)
9
(c) Mattan Erez
Slide10How often should we take a checkpoint?Want to maximize efficiencyToo frequent – time wasted on checkpointingToo infrequent – time wasted on re-executionCan we optimally balance the two?
Sure, let’s see how
10
(c) Mattan Erez
Slide11What determines CP/RS effectiveness?How long between failures (failstops)How long to repair and recoverHow long to take checkpoint
11
(c) Mattan Erez
Slide12Time between failures (AMTTI)Not up to CP/RS scheme – system parameterCurrently ~5 hours and trending down
12
(c) Mattan Erez
Slide13Time to repairIf spare nodes: repair takes seconds (at most)If no spare nodes: repair hidden by other jobsDeallocate and reallocate failed jobFrom system perspective, repair time very short
13
(c) Mattan Erez
Slide14Time to recoverRead state back inUsually similar to taking checkpointRe-execute all lost work
14
(c) Mattan Erez
Slide15Time to take the checkpointStop the systemDepends on technique used, but can be fastCopy state somewhere safe~25MB/s per node typical global file system BW
Worse when many nodes write together
~100GB per node
Continue execution
Kind of long, let’s see what that means
15
(c) Mattan Erez
Slide16Estimating execution time with CP/RSExcellent paper by J. Daly, Future Generation Computer Systems 22 (2006)Took figures and equations from that paper
Total = solve + checkpoint +
reexec
+ repair
16
Slide17First order modelNo interrupts during checkpoint or recoveryInterrupts occur in the middle of an interval (expected)Poisson process for interrupts
17
(c) Mattan Erez
Total
Ideal
# checkpoints
recover time
# interrupts
repair
Slide18When Poisson and when checkpoint interval << MTTI then:
18
Slide19First-order equation:Minimize time (control interval)
19
Slide20How good is the first order model?Great yesterday, w/ 5min checkpoints and 25h MTTI
20
Slide21How good is the first order model?OK today, w/ 5min checkpoints and 5h MTTI
21
Slide22How good is the first order model?Bad in the future (MTTI = 15min, checkpoint = 5min)Ignored interrupts during checkpoint and restartIgnored greater likelihood of failing early in a period
22
Slide23So what is the rough efficiency?Good today, but tomorrow
23
(c) Mattan Erez
Slide24What can we do?Improve reliabilityExpensiveReduce checkpoint overhead?
24
(c) Mattan Erez
Slide25How can we reduce the checkpoint overhead?Copy less stateOverlap copy and computeCopy faster
25
(c) Mattan Erez
Slide26Copy less state?Only preserve critical live stateAsk programmerAnalyzeIncremental checkpointingOnly save the delta from prior checkpoint
Need to track what changed
(can use OS protection mechanisms)
Garbage collection a big issue –
when do we no longer need old state?
Recovery much more complicated
26
(c) Mattan Erez
Slide27Overlap copy and compute?Take quick checkpoint locally (e.g., to SSD)BW can be 2GB/s per node – 100x improvementSlowly copy out to global file system during computationWhat happens on interrupt? May have to rollback to previous full checkpoint
“Burst buffers”
Can also use
memprotect
to copy the checkpoint in the background
Similar to VM live migration
27
(c) Mattan Erez
Slide28Copy fasterBuild more BW to global file systemPartition and replicate file systemUse hierarchy?
28
(c) Mattan Erez
Slide29“Scalable checkpoint restart”Take frequent checkpoints to memoryLess frequent to SSDLess frequent to global file systemCan adjust number of levels
Moody
et
al.,
SC10
29
(c) Mattan Erez
Slide30Saving locally not very effective for node failureCopy to a “buddy” insteadCoding can reduce memory requirements
30
Slide3131
Slide3232
Slide3333(c) Mattan Erez
Slide34Things to think about (outline)How frequently to checkpoint?How important is checkpoint time?And what can we do to improve itWho decides to take a checkpoint?System or userDo we really have to stop everyone to take a checkpoint?
Coordinated vs. uncoordinated
checkpointing
Always (for now) coordinated restart
Great survey paper:
ElNozahy
et
al.,
“A Survey of Rollback-Recovery Protocols in
Message-Passing
Systems”
(http://www.cs.utexas.edu/~
lorenzo/papers/SurveyFinal.pdf)
34
(c) Mattan Erez
Slide35Who decides when to checkpointUser or system?Requirement: consistent checkpointImplication: no inflight messages (for coordinated)What gets checkpointed
?
35
(c) Mattan Erez
Slide36Identifying consistent pointsUser knowsAnnotateRuntime may knowCertain runtime calls imply consistencyMPI collectives and barriersSystem doesn’t
Must
quiesce
network to take checkpoint
Often impractical, but SDN might help?
36
(c) Mattan Erez
Slide37What data should be checkpointed?Programmer can identify minimal setBut what if programmer missed something?Compiler analysis may helpSystem can use OS page table and protection mechanisms to checkpoint
37
(c) Mattan Erez
Slide38Both are in use todayApplication-level libraries (more common)User can ask if checkpointed needed and then decide to call checkpoint routineUser can ask for checkpoint periodically and library can skipSystem-level
Integrated with kernel
Dumps entire application state (or incremental one)
38
(c) Mattan Erez
Slide39Things to think about (outline)How frequently to checkpoint?How important is checkpoint time?And what can we do to improve itWho decides to take a checkpoint?System or userDo we really have to stop everyone to take a checkpoint?
Coordinated vs. uncoordinated
checkpointing
Always (for now) coordinated restart
Great survey paper:
ElNozahy
et
al.,
“A Survey of Rollback-Recovery Protocols in
Message-Passing
Systems”
(http://www.cs.utexas.edu/~
lorenzo/papers/SurveyFinal.pdf)
39
(c) Mattan Erez
Slide40Coordinated checkpointing is straightforwardFind consistent timeCheckpointUncoordinated is notHow to deal with in-flight messages?
How to deal with in-flight remote loads/stores?
40
(c) Mattan Erez
Slide41Side note: message passing vs. global address spaceTwo sided or one sided communicationSend/recv or put/get
41
(c) Mattan Erez
Slide42Uncoordianted checkpointing w/ messagesDistributed systems theory (and practice)Checkpoint each process mostly independentlyGoal is to enable rollback to a consistent state
Everyone re-executes and agrees on all results
42
From
ElNozahy
et al., ACM Comp Surveys, 06/2002
Slide43The Domino Effect (Randell, 1975)
43
From
ElNozahy
et al., ACM Comp Surveys, 06/2002
Slide44Coordinated non-blocking
44
From
ElNozahy
et al., ACM Comp Surveys, 06/2002
Slide45Communication-induced45
Useless
From
ElNozahy
et al., ACM Comp Surveys, 06/2002
Slide46Uncoordinated Message loggingPessimistic vs. optimistic
46
From
ElNozahy
et al., ACM Comp Surveys, 06/2002
Slide47Example protocol47
Slide4848
From A.
Guermouche
et al., IPDPS 2012
Slide49Sender or receiver logs?Receiver is easier, but complicated if optimistic and interruptedSender is easier when logging, but recovery can be complicatedAs is possibly garbage collection
49
(c) Mattan Erez
Slide50Non-blocking with message loggingCoordinate a consistent point to checkpoint, but don’t blockStart logging messages to eliminate domino effectStop logging when checkpoint consistent and done
50
(c) Mattan Erez
Slide51Hierarchical coordinated/uncoordinated groupsLog messages hierarchically between groups of grouped processesMore on this when discussing containment domains
51
From A.
Guermouche
et al., IPDPS 2012
Slide52Uncoordinated recovery?Possible, but challenging
52
(c) Mattan Erez
From
ElNozahy
et al., ACM Comp Surveys, 06/2002
Slide53Global address space way more complicatedOne sided communication happens without remote aware of communicationFine-grained sync and commVery relaxed memory consistency modelsConsistent points hard to find
Quiescing
the network too expensive
Active area of research (with few/no proofs yet)
53
(c) Mattan Erez
Slide5454(c) Mattan Erez
Slide55Detectcontainrepairrecover55
(c) Mattan Erez
Slide56Periodically, take checkpointStop the systemCopy all state somewhere sageKeep goingRollbackRecover saved state
Restart
Recompute
and keep going
56
(c) Mattan Erez
Slide57Things to think about (outline)How frequently to checkpoint?How important is checkpoint time?And what can we do to improve itWho decides to take a checkpoint?System or userDo we really have to stop everyone to take a checkpoint?
Coordinated vs. uncoordinated
checkpointing
Always (for now) coordinated restart
Great survey paper:
ElNozahy
et
al.,
“A Survey of Rollback-Recovery Protocols in
Message-Passing
Systems”
(http://www.cs.utexas.edu/~
lorenzo/papers/SurveyFinal.pdf)
57
(c) Mattan Erez
Slide58Redundant MPIAn alternative to checkpoint/restart58
(c) Mattan Erez