/
Toward  Exascale  Resilience Toward  Exascale  Resilience

Toward Exascale Resilience - PowerPoint Presentation

rodriguez
rodriguez . @rodriguez
Follow
342 views
Uploaded On 2022-06-07

Toward Exascale Resilience - PPT Presentation

Part 6 System protection with checkpointrestart Mattan Erez The University of Texas at Austin July 2015 Hardware reliability never perfect Not worth the cost Though can get close with high cost ID: 913961

mattan erez time checkpoint erez mattan checkpoint time coordinated system state elnozahy survey restart recovery uncoordinated message copy consistent

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Toward Exascale Resilience" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Toward Exascale ResiliencePart 6: System protection with checkpoint/restart

Mattan Erez

The University of Texas at Austin

July 2015

Slide2

Hardware reliability never perfectNot worth the costThough can get close with high costSoftware also has errorsCan never fully debug

2

(c) Mattan Erez

Slide3

Application and system must go on3(c) Mattan Erez

Slide4

Detectcontainrepairrecover4

(c) Mattan Erez

Slide5

Detection is most critical pieceWe’re in luck – silent-data-corruption very rareStrong ECCNot very vulnerable logicThe paranoid among us aren’t convinced thoughMore when we talk about cross-layer approaches

5

(c) Mattan Erez

Slide6

What to do on detected errors?Simplest idea: failstop!Contain the error and don’t let it propagate

6

(c) Mattan Erez

Slide7

If system stopped, can the application continue?Yes, if preparedCheckpoint/restart7

(c) Mattan Erez

Slide8

Periodically, take checkpointStop the systemCopy all state somewhere sageKeep goingRollbackRecover saved state

Restart

Recompute

and keep going

8

(c) Mattan Erez

Slide9

Things to think about (outline)How frequently to checkpoint?How important is checkpoint time?And what can we do to improve itWho decides to take a checkpoint?System or userDo we really have to stop everyone to take a checkpoint?

Coordinated vs. uncoordinated

checkpointing

Always (for now) coordinated restart

Great survey paper:

ElNozahy

et

al.,

“A Survey of Rollback-Recovery Protocols in

Message-Passing

Systems”

(http://www.cs.utexas.edu/~

lorenzo/papers/SurveyFinal.pdf)

9

(c) Mattan Erez

Slide10

How often should we take a checkpoint?Want to maximize efficiencyToo frequent – time wasted on checkpointingToo infrequent – time wasted on re-executionCan we optimally balance the two?

Sure, let’s see how

10

(c) Mattan Erez

Slide11

What determines CP/RS effectiveness?How long between failures (failstops)How long to repair and recoverHow long to take checkpoint

11

(c) Mattan Erez

Slide12

Time between failures (AMTTI)Not up to CP/RS scheme – system parameterCurrently ~5 hours and trending down

12

(c) Mattan Erez

Slide13

Time to repairIf spare nodes: repair takes seconds (at most)If no spare nodes: repair hidden by other jobsDeallocate and reallocate failed jobFrom system perspective, repair time very short

13

(c) Mattan Erez

Slide14

Time to recoverRead state back inUsually similar to taking checkpointRe-execute all lost work

14

(c) Mattan Erez

Slide15

Time to take the checkpointStop the systemDepends on technique used, but can be fastCopy state somewhere safe~25MB/s per node typical global file system BW

Worse when many nodes write together

~100GB per node

Continue execution

Kind of long, let’s see what that means

15

(c) Mattan Erez

Slide16

Estimating execution time with CP/RSExcellent paper by J. Daly, Future Generation Computer Systems 22 (2006)Took figures and equations from that paper

Total = solve + checkpoint +

reexec

+ repair

16

Slide17

First order modelNo interrupts during checkpoint or recoveryInterrupts occur in the middle of an interval (expected)Poisson process for interrupts

17

(c) Mattan Erez

Total

Ideal

# checkpoints

recover time

# interrupts

repair

Slide18

When Poisson and when checkpoint interval << MTTI then:

18

Slide19

First-order equation:Minimize time (control interval)

19

Slide20

How good is the first order model?Great yesterday, w/ 5min checkpoints and 25h MTTI

20

Slide21

How good is the first order model?OK today, w/ 5min checkpoints and 5h MTTI

21

Slide22

How good is the first order model?Bad in the future (MTTI = 15min, checkpoint = 5min)Ignored interrupts during checkpoint and restartIgnored greater likelihood of failing early in a period

22

Slide23

So what is the rough efficiency?Good today, but tomorrow

23

(c) Mattan Erez

Slide24

What can we do?Improve reliabilityExpensiveReduce checkpoint overhead?

24

(c) Mattan Erez

Slide25

How can we reduce the checkpoint overhead?Copy less stateOverlap copy and computeCopy faster

25

(c) Mattan Erez

Slide26

Copy less state?Only preserve critical live stateAsk programmerAnalyzeIncremental checkpointingOnly save the delta from prior checkpoint

Need to track what changed

(can use OS protection mechanisms)

Garbage collection a big issue –

when do we no longer need old state?

Recovery much more complicated

26

(c) Mattan Erez

Slide27

Overlap copy and compute?Take quick checkpoint locally (e.g., to SSD)BW can be 2GB/s per node – 100x improvementSlowly copy out to global file system during computationWhat happens on interrupt? May have to rollback to previous full checkpoint

“Burst buffers”

Can also use

memprotect

to copy the checkpoint in the background

Similar to VM live migration

27

(c) Mattan Erez

Slide28

Copy fasterBuild more BW to global file systemPartition and replicate file systemUse hierarchy?

28

(c) Mattan Erez

Slide29

“Scalable checkpoint restart”Take frequent checkpoints to memoryLess frequent to SSDLess frequent to global file systemCan adjust number of levels

Moody

et

al.,

SC10

29

(c) Mattan Erez

Slide30

Saving locally not very effective for node failureCopy to a “buddy” insteadCoding can reduce memory requirements

30

Slide31

31

Slide32

32

Slide33

33(c) Mattan Erez

Slide34

Things to think about (outline)How frequently to checkpoint?How important is checkpoint time?And what can we do to improve itWho decides to take a checkpoint?System or userDo we really have to stop everyone to take a checkpoint?

Coordinated vs. uncoordinated

checkpointing

Always (for now) coordinated restart

Great survey paper:

ElNozahy

et

al.,

“A Survey of Rollback-Recovery Protocols in

Message-Passing

Systems”

(http://www.cs.utexas.edu/~

lorenzo/papers/SurveyFinal.pdf)

34

(c) Mattan Erez

Slide35

Who decides when to checkpointUser or system?Requirement: consistent checkpointImplication: no inflight messages (for coordinated)What gets checkpointed

?

35

(c) Mattan Erez

Slide36

Identifying consistent pointsUser knowsAnnotateRuntime may knowCertain runtime calls imply consistencyMPI collectives and barriersSystem doesn’t

Must

quiesce

network to take checkpoint

Often impractical, but SDN might help?

36

(c) Mattan Erez

Slide37

What data should be checkpointed?Programmer can identify minimal setBut what if programmer missed something?Compiler analysis may helpSystem can use OS page table and protection mechanisms to checkpoint

37

(c) Mattan Erez

Slide38

Both are in use todayApplication-level libraries (more common)User can ask if checkpointed needed and then decide to call checkpoint routineUser can ask for checkpoint periodically and library can skipSystem-level

Integrated with kernel

Dumps entire application state (or incremental one)

38

(c) Mattan Erez

Slide39

Things to think about (outline)How frequently to checkpoint?How important is checkpoint time?And what can we do to improve itWho decides to take a checkpoint?System or userDo we really have to stop everyone to take a checkpoint?

Coordinated vs. uncoordinated

checkpointing

Always (for now) coordinated restart

Great survey paper:

ElNozahy

et

al.,

“A Survey of Rollback-Recovery Protocols in

Message-Passing

Systems”

(http://www.cs.utexas.edu/~

lorenzo/papers/SurveyFinal.pdf)

39

(c) Mattan Erez

Slide40

Coordinated checkpointing is straightforwardFind consistent timeCheckpointUncoordinated is notHow to deal with in-flight messages?

How to deal with in-flight remote loads/stores?

40

(c) Mattan Erez

Slide41

Side note: message passing vs. global address spaceTwo sided or one sided communicationSend/recv or put/get

41

(c) Mattan Erez

Slide42

Uncoordianted checkpointing w/ messagesDistributed systems theory (and practice)Checkpoint each process mostly independentlyGoal is to enable rollback to a consistent state

Everyone re-executes and agrees on all results

42

From

ElNozahy

et al., ACM Comp Surveys, 06/2002

Slide43

The Domino Effect (Randell, 1975)

43

From

ElNozahy

et al., ACM Comp Surveys, 06/2002

Slide44

Coordinated non-blocking

44

From

ElNozahy

et al., ACM Comp Surveys, 06/2002

Slide45

Communication-induced45

Useless

From

ElNozahy

et al., ACM Comp Surveys, 06/2002

Slide46

Uncoordinated Message loggingPessimistic vs. optimistic

46

From

ElNozahy

et al., ACM Comp Surveys, 06/2002

Slide47

Example protocol47

Slide48

48

From A.

Guermouche

et al., IPDPS 2012

Slide49

Sender or receiver logs?Receiver is easier, but complicated if optimistic and interruptedSender is easier when logging, but recovery can be complicatedAs is possibly garbage collection

49

(c) Mattan Erez

Slide50

Non-blocking with message loggingCoordinate a consistent point to checkpoint, but don’t blockStart logging messages to eliminate domino effectStop logging when checkpoint consistent and done

50

(c) Mattan Erez

Slide51

Hierarchical coordinated/uncoordinated groupsLog messages hierarchically between groups of grouped processesMore on this when discussing containment domains

51

From A.

Guermouche

et al., IPDPS 2012

Slide52

Uncoordinated recovery?Possible, but challenging

52

(c) Mattan Erez

From

ElNozahy

et al., ACM Comp Surveys, 06/2002

Slide53

Global address space way more complicatedOne sided communication happens without remote aware of communicationFine-grained sync and commVery relaxed memory consistency modelsConsistent points hard to find

Quiescing

the network too expensive

Active area of research (with few/no proofs yet)

53

(c) Mattan Erez

Slide54

54(c) Mattan Erez

Slide55

Detectcontainrepairrecover55

(c) Mattan Erez

Slide56

Periodically, take checkpointStop the systemCopy all state somewhere sageKeep goingRollbackRecover saved state

Restart

Recompute

and keep going

56

(c) Mattan Erez

Slide57

Things to think about (outline)How frequently to checkpoint?How important is checkpoint time?And what can we do to improve itWho decides to take a checkpoint?System or userDo we really have to stop everyone to take a checkpoint?

Coordinated vs. uncoordinated

checkpointing

Always (for now) coordinated restart

Great survey paper:

ElNozahy

et

al.,

“A Survey of Rollback-Recovery Protocols in

Message-Passing

Systems”

(http://www.cs.utexas.edu/~

lorenzo/papers/SurveyFinal.pdf)

57

(c) Mattan Erez

Slide58

Redundant MPIAn alternative to checkpoint/restart58

(c) Mattan Erez